saturn·

noaa significant storms noaa significant storms

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/datasets/noaa-significant-storms/noaa_significant_storms.json

Saturn profiled 14,770 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/noaa-significant-storms/noaa_significant_storms.json",
    "--findings", "noaa-significant-storms-noaa_significant_storms.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 14,770 significant US storm events from the NOAA Storm Events Database, with 14 columns covering event type, location, date, magnitude, casualties, and property damage. Tornadoes dominate at 6,334 records (about 43% of rows), followed by Flash Flood, Thunderstorm Wind, and Flood — worth focusing on first since event_type drives most other fields. Geographically the events skew heavily to the central/southern US, with Texas alone accounting for 1,450 records and a long tail across 65 state values. Fatalities and injuries are highly zero-inflated (around 69% and 68% zeros respectively), so any casualty analysis should treat the non-zero tail separately. Note also that magnitude is missing for 51.8% of rows and damage_property is stored as text codes like '2.5M' and '1.00M' rather than numbers, which will need parsing before quantitative use.

citing: row_count · column_count · event_type.top_values · event_type.top_rate · state.top_values · state.n_unique · fatalities.top_rate · injuries.top_rate · magnitude.null_rate · damage_property.top_values · country.top_value · source.top_value

Out[4]:

saturn.schema() · 14 columns

column kind n null% unique alerts
latitude numeric 14,770 0.0% 7,810
longitude numeric 14,770 0.0% 8,828
name text 14,770 0.0% 6,660 multilingual duplicates
description text 14,770 0.0% 5,796 multilingual duplicates
category categorical 14,770 0.0% 1 imbalance
date text 14,770 0.0% 5,058 one_word allcaps short_text duplicates
country categorical 14,770 0.0% 1 imbalance
event_type categorical 14,770 0.0% 17
state categorical 14,770 0.0% 65
magnitude categorical 14,770 51.8% 170 null_rate
injuries categorical 14,770 0.0% 178
fatalities categorical 14,770 0.0% 49
damage_property text 14,770 0.0% 1,014 one_word allcaps short_text duplicates
source categorical 14,770 0.0% 1 imbalance
Fig 1.
event_type · See how Tornado dominates the mix and how quickly the long tail drops off.
Show data table
Top values for event_type (17 unique shown, of 17 total).
valuecountshare
Tornado633442.9%
Flash Flood235816.0%
Thunderstorm Wind225715.3%
Flood177712.0%
Hail12468.4%
Lightning5743.9%
Heavy Rain990.7%
Marine Strong Wind430.3%
Debris Flow430.3%
Marine Thunderstorm Wind250.2%
Marine High Wind50.0%
Dust Devil30.0%
Waterspout20.0%
Tropical Storm10.0%
High Wind10.0%
Heat10.0%
Marine Lightning10.0%
Fig 2.
state · Check the geographic concentration — Texas leads, followed by a cluster of southern and midwestern states.
Show data table
Top values for state (20 unique shown, of 65 total).
valuecountshare
TEXAS14509.8%
MISSOURI6484.4%
ARKANSAS6024.1%
MISSISSIPPI5703.9%
GEORGIA5623.8%
ILLINOIS5603.8%
IOWA5273.6%
LOUISIANA5073.4%
TENNESSEE4993.4%
FLORIDA4983.4%
OKLAHOMA4903.3%
NEBRASKA4863.3%
ALABAMA4693.2%
WISCONSIN4633.1%
OHIO4413.0%
MICHIGAN4262.9%
NORTH CAROLINA4222.9%
KANSAS4182.8%
INDIANA4082.8%
KENTUCKY3832.6%
Fig 3.
fatalities · Notice the heavy zero-inflation; most events report no fatalities, with a thin but important tail.
Show data table
Top values for fatalities (20 unique shown, of 49 total).
valuecountshare
01020969.1%
1320821.7%
26494.4%
32221.5%
41120.8%
5740.5%
6660.4%
7380.3%
9250.2%
10240.2%
8210.1%
11200.1%
13110.1%
16100.1%
1290.1%
1480.1%
1760.0%
2060.0%
2540.0%
2330.0%
Fig 4.
damage_property · Watch for the recurring rounded codes like 2.5M and 1.00M that hint at estimation rather than measurement.
Show data table
Character-length distribution for damage_property (mean: 4.380568720379147).
charscount
0 – 0368
0 – 00
0 – 10
1 – 10
1 – 10
1 – 1264
1 – 10
1 – 20
2 – 20
2 – 20
2 – 21252
2 – 20
2 – 30
3 – 30
3 – 30
3 – 31172
3 – 30
3 – 40
4 – 40
4 – 40
4 – 43414
4 – 40
4 – 50
5 – 50
5 – 50
5 – 56075
5 – 50
5 – 60
6 – 60
6 – 60
6 – 61450
6 – 60
6 – 70
7 – 70
7 – 70
7 – 7514
7 – 70
7 – 80
8 – 80
8 – 8261
Fig 5.
latitude · Confirm the bulk of events sit in mid-US latitudes, with a few far-flung outliers worth inspecting.
Show data table
Histogram bins for latitude (median: 37.12).
bincount
-14.32 – -12.213
-12.21 – -10.10
-10.1 – -7.990
-7.99 – -5.8790
-5.879 – -3.7670
-3.767 – -1.6560
-1.656 – 0.45520
0.4552 – 2.5660
2.566 – 4.6780
4.678 – 6.7890
6.789 – 8.92
8.9 – 11.010
11.01 – 13.120
13.12 – 15.232
15.23 – 17.350
17.35 – 19.4675
19.46 – 21.5719
21.57 – 23.6810
23.68 – 25.7922
25.79 – 27.9270
27.9 – 30.01522
30.01 – 32.121240
32.12 – 34.242165
34.24 – 36.352333
36.35 – 38.461803
38.46 – 40.571901
40.57 – 42.682226
42.68 – 44.791382
44.79 – 46.9515
46.9 – 49.01232
49.01 – 51.130
51.13 – 53.240
53.24 – 55.350
55.35 – 57.465
57.46 – 59.576
59.57 – 61.6815
61.68 – 63.7911
63.79 – 65.98
65.9 – 68.022
68.02 – 70.131
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
latitudenumeric0.0%
longitudenumeric0.0%
nametext0.0%
descriptiontext0.0%
categorycategorical0.0%
datetext0.0%
countrycategorical0.0%
event_typecategorical0.0%
statecategorical0.0%
magnitudecategorical51.8%
injuriescategorical0.0%
fatalitiescategorical0.0%
damage_propertytext0.0%
sourcecategorical0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 10,000 detected strings).
langcountshare
en978097.8%
es1341.3%
de250.2%
ja230.2%
no100.1%
id60.1%
fr60.1%
it50.1%
pt40.0%
sr20.0%
ru20.0%
eu20.0%
zh10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
latitudelongitude
latitude+1.00-0.31
longitude-0.31+1.00

latitude numeric feature

Geographic latitude coordinates spanning from -14.3236 to 70.1269, with a mean of 37.28 and median of 37.12 indicating most observations cluster in the northern mid-latitudes. The tight IQR (33.63 to 41.13) suggests a heavy concentration in temperate Northern Hemisphere regions, with 159 outliers (1.08%) likely representing equatorial or far-northern points. Distribution is nearly symmetric (skew -0.18) but moderately peaked (kurtosis 3.34).

Treatment: Pair with longitude for geospatial features; consider binning or clustering rather than using raw value in linear models.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["latitude"].stats

statvalue
n14,770
nulls0 (0.0%)
unique7,810
min -14.32
max 70.13
mean 37.28
median 37.12
std 5.247
q1 33.63
q3 41.13
iqr 7.499
skew -0.1787
kurtosis 3.341
n_outliers 159
outlier_rate 0.01077
zero_rate 0
Fig 9.
Distribution of latitude. Vertical dash marks the median.
Show data table
Histogram bins for latitude (median: 37.12).
bincount
-14.32 – -12.213
-12.21 – -10.10
-10.1 – -7.990
-7.99 – -5.8790
-5.879 – -3.7670
-3.767 – -1.6560
-1.656 – 0.45520
0.4552 – 2.5660
2.566 – 4.6780
4.678 – 6.7890
6.789 – 8.92
8.9 – 11.010
11.01 – 13.120
13.12 – 15.232
15.23 – 17.350
17.35 – 19.4675
19.46 – 21.5719
21.57 – 23.6810
23.68 – 25.7922
25.79 – 27.9270
27.9 – 30.01522
30.01 – 32.121240
32.12 – 34.242165
34.24 – 36.352333
36.35 – 38.461803
38.46 – 40.571901
40.57 – 42.682226
42.68 – 44.791382
44.79 – 46.9515
46.9 – 49.01232
49.01 – 51.130
51.13 – 53.240
53.24 – 55.350
55.35 – 57.465
57.46 – 59.576
59.57 – 61.6815
61.68 – 63.7911
63.79 – 65.98
65.9 – 68.022
68.02 – 70.131

longitude numeric feature

Geographic longitude coordinates spanning -170.73 to 171.47, with values concentrated in the Western Hemisphere (median -90.22, IQR -96.4 to -84.23) consistent with North American locations. The distribution is heavy-tailed (kurtosis 55.6, skew 1.29) with 623 outliers (4.2%) likely representing locations outside the dominant cluster. No nulls or zeros, and 8828 unique values across 14770 rows suggests repeated locations.

Treatment: Pair with latitude for geospatial features; consider clustering or binning by region rather than using raw values linearly.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["longitude"].stats

statvalue
n14,770
nulls0 (0.0%)
unique8,828
min -170.7
max 171.5
mean -90.94
median -90.22
std 11.7
q1 -96.4
q3 -84.23
iqr 12.17
skew 1.286
kurtosis 55.61
n_outliers 623
outlier_rate 0.04218
zero_rate 0
Fig 10.
Distribution of longitude. Vertical dash marks the median.
Show data table
Histogram bins for longitude (median: -90.22).
bincount
-170.7 – -162.24
-162.2 – -153.633
-153.6 – -145.128
-145.1 – -136.54
-136.5 – -12811
-128 – -119.4305
-119.4 – -110.8466
-110.8 – -102.3544
-102.3 – -93.743693
-93.74 – -85.185377
-85.18 – -76.633281
-76.63 – -68.07941
-68.07 – -59.5279
-59.52 – -50.960
-50.96 – -42.410
-42.41 – -33.850
-33.85 – -25.30
-25.3 – -16.740
-16.74 – -8.1860
-8.186 – 0.36870
0.3687 – 8.9240
8.924 – 17.480
17.48 – 26.030
26.03 – 34.590
34.59 – 43.140
43.14 – 51.70
51.7 – 60.250
60.25 – 68.810
68.81 – 77.360
77.36 – 85.920
85.92 – 94.470
94.47 – 1030
103 – 111.60
111.6 – 120.10
120.1 – 128.70
128.7 – 137.20
137.2 – 145.82
145.8 – 154.41
154.4 – 162.90
162.9 – 171.51

name text label

Templated event labels of the form ' in , ' describing US severe weather incidents (tornado, flood, hail, thunderstorm wind dominate top_words). With 14,770 rows but only 6,660 unique values and a 54.9% duplicate rate, the same state/county/event combinations recur heavily — 'Hail in TEXAS, TARRANT' alone appears 59 times. The 'multilingual' alert is misleading: 4,796 strings tag as English against tiny counts in other languages, almost certainly false positives from the proper-noun template.

Treatment: Parse into separate event_type, state, and county fields rather than using the raw string.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["name"].stats

statvalue
n14,770
nulls0 (0.0%)
unique6,660
len_min 17
len_max 134
len_mean 30.22
len_median 29
len_p95 41
word_mean 4.588
word_median 4
n_empty 0
n_duplicates 8,110
duplicate_rate 0.5491
vocab_size 1,980
readability_flesch_mean 31.16
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 0
boilerplate_rate 0
alert: multilingual13 languages detected in sample
alert: duplicates54.9% duplicate strings
Fig 11.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 30.219160460392686).
charscount
17 – 2050
20 – 23793
23 – 262165
26 – 293842
29 – 322969
32 – 351813
35 – 371442
37 – 40915
40 – 43493
43 – 46153
46 – 4945
49 – 524
52 – 553
55 – 584
58 – 616
61 – 649
64 – 675
67 – 707
70 – 733
73 – 7610
76 – 787
78 – 815
81 – 846
84 – 873
87 – 903
90 – 933
93 – 961
96 – 995
99 – 1023
102 – 1050
105 – 1080
108 – 1110
111 – 1140
114 – 1160
116 – 1191
119 – 1220
122 – 1250
125 – 1280
128 – 1311
131 – 1341

description text metadata

This appears to be a templated event-summary field describing storm or disaster impacts (magnitude, injuries, fatalities, property damage in dollars), not free-form prose. Despite 14,770 rows, only 5,796 are unique and 60.8% are duplicates — the top value alone repeats 1,055 times — so the field carries far less information than its size suggests. The 'multilingual' alert is misleading: 4,984 rows tag as English against only 16 in other languages, likely noise from short numeric strings. Low Flesch (29.86) and a 7.4-word mean confirm terse, formulaic content rather than narrative text.

Treatment: Parse out structured fields (magnitude, injuries, fatalities, damage_usd) with regex rather than embedding the raw string.

anthropic:claude-opus-4-7 · confidence high
Out[23]:

saturn.columns["description"].stats

statvalue
n14,770
nulls0 (0.0%)
unique5,796
len_min 3
len_max 259
len_mean 50.09
len_median 36
len_p95 166
word_mean 7.393
word_median 5
n_empty 0
n_duplicates 8,974
duplicate_rate 0.6076
vocab_size 4,289
readability_flesch_mean 29.86
emoji_rate 0
url_rate 0
one_word_rate 0.0002708
allcaps_rate 0.0002708
boilerplate_rate 0
alert: multilingual5 languages detected in sample
alert: duplicates60.8% duplicate strings
Fig 12.
Character-length distribution for description.
Show data table
Character-length distribution for description (mean: 50.085240352065).
charscount
3 – 94
9 – 1613
16 – 223101
22 – 291689
29 – 352230
35 – 411186
41 – 48572
48 – 541589
54 – 612168
61 – 67902
67 – 737
73 – 806
80 – 8614
86 – 9310
93 – 9915
99 – 10525
105 – 11230
112 – 11830
118 – 12534
125 – 13147
131 – 13771
137 – 14464
144 – 15052
150 – 15777
157 – 16360
163 – 16974
169 – 17666
176 – 18265
182 – 18954
189 – 19569
195 – 20182
201 – 20870
208 – 21472
214 – 22178
221 – 22764
227 – 23329
233 – 24020
240 – 24611
246 – 25312
253 – 2598

category categorical metadata

This column is a constant tag identifying the dataset partition: every one of the 14,770 rows holds the single value "significant_us_storms". Cardinality is 1 with entropy 0.0 and a top_rate of 1.0, so it carries no information for modelling.

Treatment: Drop before modelling; retain only as a provenance label.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["category"].stats

statvalue
n14,770
nulls0 (0.0%)
unique1
top_value significant_us_storms
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 13.
Top values for category.
Show data table
Top values for category (1 unique shown, of 1 total).
valuecountshare
significant_us_storms14770100.0%

date text timestamp

This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every value exactly 10 characters long across 14,770 rows and no nulls. Values span at least 1965 to 2021, but heavy clustering — 9,712 duplicates (65.8%) and spikes like 1974-04-03 (126 rows) and 2011-04-27 (105 rows) — suggests events grouped on shared dates rather than unique daily records. Only 5,058 distinct dates appear, so this won't act as a row identifier.

Treatment: parse to datetime and use for temporal joins or time-based features.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["date"].stats

statvalue
n14,770
nulls0 (0.0%)
unique5,058
len_min 10
len_max 10
len_mean 10
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 0
n_duplicates 9,712
duplicate_rate 0.6575
vocab_size 5,058
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates65.8% duplicate strings
Fig 14.
Character-length distribution for date.
Show data table
Character-length distribution for date (mean: 10.0).
charscount
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 1014770
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100

country categorical metadata

This column records country of origin but contains a single value, "USA", across all 14770 rows. With cardinality of 1, entropy of 0, and a top_rate of 1.0, it carries no information. The imbalance alert is effectively a constant-column flag.

Treatment: Drop; constant column with no predictive signal.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["country"].stats

statvalue
n14,770
nulls0 (0.0%)
unique1
top_value USA
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 15.
Top values for country.
Show data table
Top values for country (1 unique shown, of 1 total).
valuecountshare
USA14770100.0%

event_type categorical label

Categorical label of severe weather event types across 14,770 rows with no nulls and only 17 distinct categories. Tornado dominates at 42.9% (6,334 records), followed by Flash Flood, Thunderstorm Wind, Flood, and Hail; tail categories like Marine Thunderstorm Wind have just 25 records. Entropy ratio of 0.57 confirms the distribution is heavily skewed toward a few classes.

Treatment: One-hot or target-encode; consider grouping rare marine categories before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["event_type"].stats

statvalue
n14,770
nulls0 (0.0%)
unique17
top_value Tornado
top_rate 0.4288
cardinality 17
entropy 2.336
entropy_ratio 0.5715
Fig 16.
Top values for event_type.
Show data table
Top values for event_type (17 unique shown, of 17 total).
valuecountshare
Tornado633442.9%
Flash Flood235816.0%
Thunderstorm Wind225715.3%
Flood177712.0%
Hail12468.4%
Lightning5743.9%
Heavy Rain990.7%
Marine Strong Wind430.3%
Debris Flow430.3%
Marine Thunderstorm Wind250.2%
Marine High Wind50.0%
Dust Devil30.0%
Waterspout20.0%
Tropical Storm10.0%
High Wind10.0%
Heat10.0%
Marine Lightning10.0%

state categorical feature

U.S. state names stored as uppercase strings, fully populated across 14,770 rows with no nulls. Cardinality is 65, well above the 50 states, suggesting territories, districts, or non-state codes are mixed in. Distribution is broad (entropy ratio 0.86) with Texas leading at 9.8% (1,450 rows), followed by Missouri, Arkansas, Mississippi, and Georgia.

Treatment: Normalize to a known state/territory code list, then one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[38]:

saturn.columns["state"].stats

statvalue
n14,770
nulls0 (0.0%)
unique65
top_value TEXAS
top_rate 0.09817
cardinality 65
entropy 5.182
entropy_ratio 0.8605
Fig 17.
Top values for state.
Show data table
Top values for state (20 unique shown, of 65 total).
valuecountshare
TEXAS14509.8%
MISSOURI6484.4%
ARKANSAS6024.1%
MISSISSIPPI5703.9%
GEORGIA5623.8%
ILLINOIS5603.8%
IOWA5273.6%
LOUISIANA5073.4%
TENNESSEE4993.4%
FLORIDA4983.4%
OKLAHOMA4903.3%
NEBRASKA4863.3%
ALABAMA4693.2%
WISCONSIN4633.1%
OHIO4413.0%
MICHIGAN4262.9%
NORTH CAROLINA4222.9%
KANSAS4182.8%
INDIANA4082.8%
KENTUCKY3832.6%

magnitude categorical feature

Likely a magnitude or measurement value stored as text rather than numeric, with 170 distinct string values dominated by "0" (54.2% of non-nulls, 3863 rows). More than half the column is missing (null_rate 0.5178), and the remaining values mix small decimals like "1.75" and "2.75" with much larger ones like "70.00" and "61.00", suggesting either heterogeneous units or a compressed scale. Entropy ratio of 0.48 confirms heavy concentration on the zero bucket.

Treatment: Cast to numeric, treat "0" and nulls explicitly, and investigate whether the large vs small values reflect different units before modelling.

anthropic:claude-opus-4-7 · confidence medium
Out[41]:

saturn.columns["magnitude"].stats

statvalue
n14,770
nulls7,648 (51.8%)
unique170
top_value 0
top_rate 0.5424
cardinality 170
entropy 3.586
entropy_ratio 0.484
alert: null_rate51.8% null
Fig 18.
Top values for magnitude.
Show data table
Top values for magnitude (20 unique shown, of 170 total).
valuecountshare
0386326.2%
1.753832.6%
2.752201.5%
70.001621.1%
50.001511.0%
2.001501.0%
2.501230.8%
61.001220.8%
65.001040.7%
52.00950.6%
78.00800.5%
70790.5%
3.00770.5%
56.00760.5%
87.00650.4%
60.00630.4%
50590.4%
60540.4%
1.50500.3%
61470.3%

injuries categorical feature

This is an injury count stored as strings, with 178 distinct values dominated by '0' (68.1% of 14,770 rows). The next most common values ('1' through '7', plus '10' and '12') are clearly numeric, suggesting the column should be cast to integer rather than treated as categorical. Low entropy ratio (0.33) reflects the heavy zero-mass.

Treatment: Cast to integer and consider log1p or zero-inflated treatment given the zero-heavy distribution.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

saturn.columns["injuries"].stats

statvalue
n14,770
nulls0 (0.0%)
unique178
top_value 0
top_rate 0.6814
cardinality 178
entropy 2.468
entropy_ratio 0.3301
Fig 19.
Top values for injuries.
Show data table
Top values for injuries (20 unique shown, of 178 total).
valuecountshare
01006468.1%
18936.0%
25523.7%
33432.3%
42361.6%
52341.6%
102191.5%
61961.3%
121581.1%
71340.9%
81210.8%
201140.8%
151110.8%
11900.6%
9850.6%
13700.5%
14690.5%
30680.5%
25560.4%
16480.3%

fatalities categorical numeric_target

Counts of fatalities per event, stored as strings but numeric in content with 49 distinct values. Heavily zero-inflated: 69.1% of 14,770 rows are "0" and the next bucket "1" covers 3,208 more, leaving a long thin tail (5+ fatalities each appear in under 100 rows). Low entropy ratio (0.25) confirms the distribution is dominated by a single value.

Treatment: Cast to integer and model as a zero-inflated count (e.g., ZIP/NB) or binarise to fatal/non-fatal.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["fatalities"].stats

statvalue
n14,770
nulls0 (0.0%)
unique49
top_value 0
top_rate 0.6912
cardinality 49
entropy 1.423
entropy_ratio 0.2535
Fig 20.
Top values for fatalities.
Show data table
Top values for fatalities (20 unique shown, of 49 total).
valuecountshare
01020969.1%
1320821.7%
26494.4%
32221.5%
41120.8%
5740.5%
6660.4%
7380.3%
9250.2%
10240.2%
8210.1%
11200.1%
13110.1%
16100.1%
1290.1%
1480.1%
1760.0%
2060.0%
2540.0%
2330.0%

damage_property text feature

This column encodes property damage estimates as short magnitude-suffixed strings (e.g. '2.5M', '250K', '0.00K'), with every value being a single token of at most 8 characters. The format is inconsistent — some values use two decimals ('1.00M') while others don't ('1M', '25M') — and 368 rows are empty strings rather than nulls, with the literal '0.00K' appearing 1229 times to denote zero damage. Duplication is extreme (93.1%) because the underlying domain is a small set of round-number estimates, yielding only 1014 distinct values.

Treatment: Parse the K/M/B suffix into a numeric float, treat empty strings and '0.00K' explicitly, then log-transform before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["damage_property"].stats

statvalue
n14,770
nulls0 (0.0%)
unique1,014
len_min 0
len_max 8
len_mean 4.381
len_median 5
len_p95 7
word_mean 1
word_median 1
n_empty 368
n_duplicates 13,756
duplicate_rate 0.9313
vocab_size 1,013
readability_flesch_mean 117
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.8724
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps87.2% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates93.1% duplicate strings
Fig 21.
Character-length distribution for damage_property.
Show data table
Character-length distribution for damage_property (mean: 4.380568720379147).
charscount
0 – 0368
0 – 00
0 – 10
1 – 10
1 – 10
1 – 1264
1 – 10
1 – 20
2 – 20
2 – 20
2 – 21252
2 – 20
2 – 30
3 – 30
3 – 30
3 – 31172
3 – 30
3 – 40
4 – 40
4 – 40
4 – 43414
4 – 40
4 – 50
5 – 50
5 – 50
5 – 56075
5 – 50
5 – 60
6 – 60
6 – 60
6 – 61450
6 – 60
6 – 70
7 – 70
7 – 70
7 – 7514
7 – 70
7 – 80
8 – 80
8 – 8261

source categorical metadata

This column records the dataset's provenance, with every one of the 14,770 rows tagged 'NOAA Storm Events Database'. Cardinality is 1 and entropy is 0, so it carries no discriminative signal whatsoever.

Treatment: Drop before modelling; retain only as dataset-level provenance.

anthropic:claude-opus-4-7 · confidence high
Out[53]:

saturn.columns["source"].stats

statvalue
n14,770
nulls0 (0.0%)
unique1
top_value NOAA Storm Events Database
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 22.
Top values for source.
Show data table
Top values for source (1 unique shown, of 1 total).
valuecountshare
NOAA Storm Events Database14770100.0%

How to cite

click to copy

BibTeX
@misc{saturn-noaa-significant-storms-noaa-significant-storms-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: noaa significant storms noaa significant storms},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/noaa-significant-storms-noaa_significant_storms}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: noaa significant storms noaa significant storms. Source: /home/coolhand/datasets/noaa-significant-storms/noaa_significant_storms.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/noaa-significant-storms-noaa_significant_storms