noaa-significant-storms-noaa_significant_storms

Overview

Source: /home/coolhand/datasets/noaa-significant-storms/noaa_significant_storms.json

Saturn profiled 14,770 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/noaa-significant-storms/noaa_significant_storms.json",
    "--findings", "noaa-significant-storms-noaa_significant_storms.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 14,770 significant US storm events from the NOAA Storm Events Database, with 14 columns covering event type, location, date, magnitude, casualties, and property damage. Tornadoes dominate at 6,334 records (about 43% of rows), followed by Flash Flood, Thunderstorm Wind, and Flood — worth focusing on first since event_type drives most other fields. Geographically the events skew heavily to the central/southern US, with Texas alone accounting for 1,450 records and a long tail across 65 state values. Fatalities and injuries are highly zero-inflated (around 69% and 68% zeros respectively), so any casualty analysis should treat the non-zero tail separately. Note also that magnitude is missing for 51.8% of rows and damage_property is stored as text codes like '2.5M' and '1.00M' rather than numbers, which will need parsing before quantitative use.

citing: row_count · column_count · event_type.top_values · event_type.top_rate · state.top_values · state.n_unique · fatalities.top_rate · injuries.top_rate · magnitude.null_rate · damage_property.top_values · country.top_value · source.top_value

Out[4]:

saturn.schema() · 14 columns

column	kind	n	null%	unique	alerts
latitude	numeric	14,770	0.0%	7,810
longitude	numeric	14,770	0.0%	8,828
name	text	14,770	0.0%	6,660	multilingual duplicates
description	text	14,770	0.0%	5,796	multilingual duplicates
category	categorical	14,770	0.0%	1	imbalance
date	text	14,770	0.0%	5,058	one_word allcaps short_text duplicates
country	categorical	14,770	0.0%	1	imbalance
event_type	categorical	14,770	0.0%	17
state	categorical	14,770	0.0%	65
magnitude	categorical	14,770	51.8%	170	null_rate
injuries	categorical	14,770	0.0%	178
fatalities	categorical	14,770	0.0%	49
damage_property	text	14,770	0.0%	1,014	one_word allcaps short_text duplicates
source	categorical	14,770	0.0%	1	imbalance

Fig 1.

event_type · See how Tornado dominates the mix and how quickly the long tail drops off.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	42.9%
Flash Flood	2358	16.0%
Thunderstorm Wind	2257	15.3%
Flood	1777	12.0%
Hail	1246	8.4%
Lightning	574	3.9%
Heavy Rain	99	0.7%
Marine Strong Wind	43	0.3%
Debris Flow	43	0.3%
Marine Thunderstorm Wind	25	0.2%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

Fig 2.

state · Check the geographic concentration — Texas leads, followed by a cluster of southern and midwestern states.

Show data table

Top values for state (20 unique shown, of 65 total).
value	count	share
TEXAS	1450	9.8%
MISSOURI	648	4.4%
ARKANSAS	602	4.1%
MISSISSIPPI	570	3.9%
GEORGIA	562	3.8%
ILLINOIS	560	3.8%
IOWA	527	3.6%
LOUISIANA	507	3.4%
TENNESSEE	499	3.4%
FLORIDA	498	3.4%
OKLAHOMA	490	3.3%
NEBRASKA	486	3.3%
ALABAMA	469	3.2%
WISCONSIN	463	3.1%
OHIO	441	3.0%
MICHIGAN	426	2.9%
NORTH CAROLINA	422	2.9%
KANSAS	418	2.8%
INDIANA	408	2.8%
KENTUCKY	383	2.6%

Fig 3.

fatalities · Notice the heavy zero-inflation; most events report no fatalities, with a thin but important tail.

Show data table

Top values for fatalities (20 unique shown, of 49 total).
value	count	share
0	10209	69.1%
1	3208	21.7%
2	649	4.4%
3	222	1.5%
4	112	0.8%
5	74	0.5%
6	66	0.4%
7	38	0.3%
9	25	0.2%
10	24	0.2%
8	21	0.1%
11	20	0.1%
13	11	0.1%
16	10	0.1%
12	9	0.1%
14	8	0.1%
17	6	0.0%
20	6	0.0%
25	4	0.0%
23	3	0.0%

Fig 4.

damage_property · Watch for the recurring rounded codes like 2.5M and 1.00M that hint at estimation rather than measurement.

Show data table

Character-length distribution for damage_property (mean: 4.380568720379147).
chars	count
0 – 0	368
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 1	264
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	1252
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1172
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3414
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	6075
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1450
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	514
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	261

Fig 5.

latitude · Confirm the bulk of events sit in mid-US latitudes, with a few far-flung outliers worth inspecting.

Show data table

Histogram bins for latitude (median: 37.12).
bin	count
-14.32 – -12.21	3
-12.21 – -10.1	0
-10.1 – -7.99	0
-7.99 – -5.879	0
-5.879 – -3.767	0
-3.767 – -1.656	0
-1.656 – 0.4552	0
0.4552 – 2.566	0
2.566 – 4.678	0
4.678 – 6.789	0
6.789 – 8.9	2
8.9 – 11.01	0
11.01 – 13.12	0
13.12 – 15.23	2
15.23 – 17.35	0
17.35 – 19.46	75
19.46 – 21.57	19
21.57 – 23.68	10
23.68 – 25.79	22
25.79 – 27.9	270
27.9 – 30.01	522
30.01 – 32.12	1240
32.12 – 34.24	2165
34.24 – 36.35	2333
36.35 – 38.46	1803
38.46 – 40.57	1901
40.57 – 42.68	2226
42.68 – 44.79	1382
44.79 – 46.9	515
46.9 – 49.01	232
49.01 – 51.13	0
51.13 – 53.24	0
53.24 – 55.35	0
55.35 – 57.46	5
57.46 – 59.57	6
59.57 – 61.68	15
61.68 – 63.79	11
63.79 – 65.9	8
65.9 – 68.02	2
68.02 – 70.13	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
latitude	numeric	0.0%
longitude	numeric	0.0%
name	text	0.0%
description	text	0.0%
category	categorical	0.0%
date	text	0.0%
country	categorical	0.0%
event_type	categorical	0.0%
state	categorical	0.0%
magnitude	categorical	51.8%
injuries	categorical	0.0%
fatalities	categorical	0.0%
damage_property	text	0.0%
source	categorical	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 10,000 detected strings).
lang	count	share
en	9780	97.8%
es	134	1.3%
de	25	0.2%
ja	23	0.2%
no	10	0.1%
id	6	0.1%
fr	6	0.1%
it	5	0.1%
pt	4	0.0%
sr	2	0.0%
ru	2	0.0%
eu	2	0.0%
zh	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	latitude	longitude
latitude	+1.00	-0.31
longitude	-0.31	+1.00

latitude numeric feature

Geographic latitude coordinates spanning from -14.3236 to 70.1269, with a mean of 37.28 and median of 37.12 indicating most observations cluster in the northern mid-latitudes. The tight IQR (33.63 to 41.13) suggests a heavy concentration in temperate Northern Hemisphere regions, with 159 outliers (1.08%) likely representing equatorial or far-northern points. Distribution is nearly symmetric (skew -0.18) but moderately peaked (kurtosis 3.34).

Treatment: Pair with longitude for geospatial features; consider binning or clustering rather than using raw value in linear models.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["latitude"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	7,810
min	-14.32
max	70.13
mean	37.28
median	37.12
std	5.247
q1	33.63
q3	41.13
iqr	7.499
skew	-0.1787
kurtosis	3.341
n_outliers	159
outlier_rate	0.01077
zero_rate	0

Fig 9.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 37.12).
bin	count
-14.32 – -12.21	3
-12.21 – -10.1	0
-10.1 – -7.99	0
-7.99 – -5.879	0
-5.879 – -3.767	0
-3.767 – -1.656	0
-1.656 – 0.4552	0
0.4552 – 2.566	0
2.566 – 4.678	0
4.678 – 6.789	0
6.789 – 8.9	2
8.9 – 11.01	0
11.01 – 13.12	0
13.12 – 15.23	2
15.23 – 17.35	0
17.35 – 19.46	75
19.46 – 21.57	19
21.57 – 23.68	10
23.68 – 25.79	22
25.79 – 27.9	270
27.9 – 30.01	522
30.01 – 32.12	1240
32.12 – 34.24	2165
34.24 – 36.35	2333
36.35 – 38.46	1803
38.46 – 40.57	1901
40.57 – 42.68	2226
42.68 – 44.79	1382
44.79 – 46.9	515
46.9 – 49.01	232
49.01 – 51.13	0
51.13 – 53.24	0
53.24 – 55.35	0
55.35 – 57.46	5
57.46 – 59.57	6
59.57 – 61.68	15
61.68 – 63.79	11
63.79 – 65.9	8
65.9 – 68.02	2
68.02 – 70.13	1

longitude numeric feature

Geographic longitude coordinates spanning -170.73 to 171.47, with values concentrated in the Western Hemisphere (median -90.22, IQR -96.4 to -84.23) consistent with North American locations. The distribution is heavy-tailed (kurtosis 55.6, skew 1.29) with 623 outliers (4.2%) likely representing locations outside the dominant cluster. No nulls or zeros, and 8828 unique values across 14770 rows suggests repeated locations.

Treatment: Pair with latitude for geospatial features; consider clustering or binning by region rather than using raw values linearly.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["longitude"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	8,828
min	-170.7
max	171.5
mean	-90.94
median	-90.22
std	11.7
q1	-96.4
q3	-84.23
iqr	12.17
skew	1.286
kurtosis	55.61
n_outliers	623
outlier_rate	0.04218
zero_rate	0

Fig 10.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: -90.22).
bin	count
-170.7 – -162.2	4
-162.2 – -153.6	33
-153.6 – -145.1	28
-145.1 – -136.5	4
-136.5 – -128	11
-128 – -119.4	305
-119.4 – -110.8	466
-110.8 – -102.3	544
-102.3 – -93.74	3693
-93.74 – -85.18	5377
-85.18 – -76.63	3281
-76.63 – -68.07	941
-68.07 – -59.52	79
-59.52 – -50.96	0
-50.96 – -42.41	0
-42.41 – -33.85	0
-33.85 – -25.3	0
-25.3 – -16.74	0
-16.74 – -8.186	0
-8.186 – 0.3687	0
0.3687 – 8.924	0
8.924 – 17.48	0
17.48 – 26.03	0
26.03 – 34.59	0
34.59 – 43.14	0
43.14 – 51.7	0
51.7 – 60.25	0
60.25 – 68.81	0
68.81 – 77.36	0
77.36 – 85.92	0
85.92 – 94.47	0
94.47 – 103	0
103 – 111.6	0
111.6 – 120.1	0
120.1 – 128.7	0
128.7 – 137.2	0
137.2 – 145.8	2
145.8 – 154.4	1
154.4 – 162.9	0
162.9 – 171.5	1

name text label

Templated event labels of the form ' in , ' describing US severe weather incidents (tornado, flood, hail, thunderstorm wind dominate top_words). With 14,770 rows but only 6,660 unique values and a 54.9% duplicate rate, the same state/county/event combinations recur heavily — 'Hail in TEXAS, TARRANT' alone appears 59 times. The 'multilingual' alert is misleading: 4,796 strings tag as English against tiny counts in other languages, almost certainly false positives from the proper-noun template.

Treatment: Parse into separate event_type, state, and county fields rather than using the raw string.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["name"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	6,660
len_min	17
len_max	134
len_mean	30.22
len_median	29
len_p95	41
word_mean	4.588
word_median	4
n_empty	0
n_duplicates	8,110
duplicate_rate	0.5491
vocab_size	1,980
readability_flesch_mean	31.16
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	13 languages detected in sample
alert: duplicates	54.9% duplicate strings

Fig 11.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 30.219160460392686).
chars	count
17 – 20	50
20 – 23	793
23 – 26	2165
26 – 29	3842
29 – 32	2969
32 – 35	1813
35 – 37	1442
37 – 40	915
40 – 43	493
43 – 46	153
46 – 49	45
49 – 52	4
52 – 55	3
55 – 58	4
58 – 61	6
61 – 64	9
64 – 67	5
67 – 70	7
70 – 73	3
73 – 76	10
76 – 78	7
78 – 81	5
81 – 84	6
84 – 87	3
87 – 90	3
90 – 93	3
93 – 96	1
96 – 99	5
99 – 102	3
102 – 105	0
105 – 108	0
108 – 111	0
111 – 114	0
114 – 116	0
116 – 119	1
119 – 122	0
122 – 125	0
125 – 128	0
128 – 131	1
131 – 134	1

description text metadata

This appears to be a templated event-summary field describing storm or disaster impacts (magnitude, injuries, fatalities, property damage in dollars), not free-form prose. Despite 14,770 rows, only 5,796 are unique and 60.8% are duplicates — the top value alone repeats 1,055 times — so the field carries far less information than its size suggests. The 'multilingual' alert is misleading: 4,984 rows tag as English against only 16 in other languages, likely noise from short numeric strings. Low Flesch (29.86) and a 7.4-word mean confirm terse, formulaic content rather than narrative text.

Treatment: Parse out structured fields (magnitude, injuries, fatalities, damage_usd) with regex rather than embedding the raw string.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["description"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	5,796
len_min	3
len_max	259
len_mean	50.09
len_median	36
len_p95	166
word_mean	7.393
word_median	5
n_empty	0
n_duplicates	8,974
duplicate_rate	0.6076
vocab_size	4,289
readability_flesch_mean	29.86
emoji_rate	0
url_rate	0
one_word_rate	0.0002708
allcaps_rate	0.0002708
boilerplate_rate	0
alert: multilingual	5 languages detected in sample
alert: duplicates	60.8% duplicate strings

Fig 12.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 50.085240352065).
chars	count
3 – 9	4
9 – 16	13
16 – 22	3101
22 – 29	1689
29 – 35	2230
35 – 41	1186
41 – 48	572
48 – 54	1589
54 – 61	2168
61 – 67	902
67 – 73	7
73 – 80	6
80 – 86	14
86 – 93	10
93 – 99	15
99 – 105	25
105 – 112	30
112 – 118	30
118 – 125	34
125 – 131	47
131 – 137	71
137 – 144	64
144 – 150	52
150 – 157	77
157 – 163	60
163 – 169	74
169 – 176	66
176 – 182	65
182 – 189	54
189 – 195	69
195 – 201	82
201 – 208	70
208 – 214	72
214 – 221	78
221 – 227	64
227 – 233	29
233 – 240	20
240 – 246	11
246 – 253	12
253 – 259	8

category categorical metadata

This column is a constant tag identifying the dataset partition: every one of the 14,770 rows holds the single value "significant_us_storms". Cardinality is 1 with entropy 0.0 and a top_rate of 1.0, so it carries no information for modelling.

Treatment: Drop before modelling; retain only as a provenance label.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["category"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1
top_value	significant_us_storms
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 13.

Top values for category.

Show data table

Top values for category (1 unique shown, of 1 total).
value	count	share
significant_us_storms	14770	100.0%

date text timestamp

This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every value exactly 10 characters long across 14,770 rows and no nulls. Values span at least 1965 to 2021, but heavy clustering — 9,712 duplicates (65.8%) and spikes like 1974-04-03 (126 rows) and 2011-04-27 (105 rows) — suggests events grouped on shared dates rather than unique daily records. Only 5,058 distinct dates appear, so this won't act as a row identifier.

Treatment: parse to datetime and use for temporal joins or time-based features.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["date"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	5,058
len_min	10
len_max	10
len_mean	10
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	9,712
duplicate_rate	0.6575
vocab_size	5,058
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	65.8% duplicate strings

Fig 14.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 10.0).
chars	count
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	14770
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0

country categorical metadata

This column records country of origin but contains a single value, "USA", across all 14770 rows. With cardinality of 1, entropy of 0, and a top_rate of 1.0, it carries no information. The imbalance alert is effectively a constant-column flag.

Treatment: Drop; constant column with no predictive signal.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["country"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1
top_value	USA
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 15.

Top values for country.

Show data table

Top values for country (1 unique shown, of 1 total).
value	count	share
USA	14770	100.0%

event_type categorical label

Categorical label of severe weather event types across 14,770 rows with no nulls and only 17 distinct categories. Tornado dominates at 42.9% (6,334 records), followed by Flash Flood, Thunderstorm Wind, Flood, and Hail; tail categories like Marine Thunderstorm Wind have just 25 records. Entropy ratio of 0.57 confirms the distribution is heavily skewed toward a few classes.

Treatment: One-hot or target-encode; consider grouping rare marine categories before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["event_type"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	17
top_value	Tornado
top_rate	0.4288
cardinality	17
entropy	2.336
entropy_ratio	0.5715

Fig 16.

Top values for event_type.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	42.9%
Flash Flood	2358	16.0%
Thunderstorm Wind	2257	15.3%
Flood	1777	12.0%
Hail	1246	8.4%
Lightning	574	3.9%
Heavy Rain	99	0.7%
Marine Strong Wind	43	0.3%
Debris Flow	43	0.3%
Marine Thunderstorm Wind	25	0.2%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

state categorical feature

U.S. state names stored as uppercase strings, fully populated across 14,770 rows with no nulls. Cardinality is 65, well above the 50 states, suggesting territories, districts, or non-state codes are mixed in. Distribution is broad (entropy ratio 0.86) with Texas leading at 9.8% (1,450 rows), followed by Missouri, Arkansas, Mississippi, and Georgia.

Treatment: Normalize to a known state/territory code list, then one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["state"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	65
top_value	TEXAS
top_rate	0.09817
cardinality	65
entropy	5.182
entropy_ratio	0.8605

Fig 17.

Top values for state.

Show data table

Top values for state (20 unique shown, of 65 total).
value	count	share
TEXAS	1450	9.8%
MISSOURI	648	4.4%
ARKANSAS	602	4.1%
MISSISSIPPI	570	3.9%
GEORGIA	562	3.8%
ILLINOIS	560	3.8%
IOWA	527	3.6%
LOUISIANA	507	3.4%
TENNESSEE	499	3.4%
FLORIDA	498	3.4%
OKLAHOMA	490	3.3%
NEBRASKA	486	3.3%
ALABAMA	469	3.2%
WISCONSIN	463	3.1%
OHIO	441	3.0%
MICHIGAN	426	2.9%
NORTH CAROLINA	422	2.9%
KANSAS	418	2.8%
INDIANA	408	2.8%
KENTUCKY	383	2.6%

magnitude categorical feature

Likely a magnitude or measurement value stored as text rather than numeric, with 170 distinct string values dominated by "0" (54.2% of non-nulls, 3863 rows). More than half the column is missing (null_rate 0.5178), and the remaining values mix small decimals like "1.75" and "2.75" with much larger ones like "70.00" and "61.00", suggesting either heterogeneous units or a compressed scale. Entropy ratio of 0.48 confirms heavy concentration on the zero bucket.

Treatment: Cast to numeric, treat "0" and nulls explicitly, and investigate whether the large vs small values reflect different units before modelling.

anthropic:claude-opus-4-7 · confidence medium

Out[41]:

saturn.columns["magnitude"].stats

stat	value
n	14,770
nulls	7,648 (51.8%)
unique	170
top_value	0
top_rate	0.5424
cardinality	170
entropy	3.586
entropy_ratio	0.484
alert: null_rate	51.8% null

Fig 18.

Top values for magnitude.

Show data table

Top values for magnitude (20 unique shown, of 170 total).
value	count	share
0	3863	26.2%
1.75	383	2.6%
2.75	220	1.5%
70.00	162	1.1%
50.00	151	1.0%
2.00	150	1.0%
2.50	123	0.8%
61.00	122	0.8%
65.00	104	0.7%
52.00	95	0.6%
78.00	80	0.5%
70	79	0.5%
3.00	77	0.5%
56.00	76	0.5%
87.00	65	0.4%
60.00	63	0.4%
50	59	0.4%
60	54	0.4%
1.50	50	0.3%
61	47	0.3%

injuries categorical feature

This is an injury count stored as strings, with 178 distinct values dominated by '0' (68.1% of 14,770 rows). The next most common values ('1' through '7', plus '10' and '12') are clearly numeric, suggesting the column should be cast to integer rather than treated as categorical. Low entropy ratio (0.33) reflects the heavy zero-mass.

Treatment: Cast to integer and consider log1p or zero-inflated treatment given the zero-heavy distribution.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["injuries"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	178
top_value	0
top_rate	0.6814
cardinality	178
entropy	2.468
entropy_ratio	0.3301

Fig 19.

Top values for injuries.

Show data table

Top values for injuries (20 unique shown, of 178 total).
value	count	share
0	10064	68.1%
1	893	6.0%
2	552	3.7%
3	343	2.3%
4	236	1.6%
5	234	1.6%
10	219	1.5%
6	196	1.3%
12	158	1.1%
7	134	0.9%
8	121	0.8%
20	114	0.8%
15	111	0.8%
11	90	0.6%
9	85	0.6%
13	70	0.5%
14	69	0.5%
30	68	0.5%
25	56	0.4%
16	48	0.3%

fatalities categorical numeric_target

Counts of fatalities per event, stored as strings but numeric in content with 49 distinct values. Heavily zero-inflated: 69.1% of 14,770 rows are "0" and the next bucket "1" covers 3,208 more, leaving a long thin tail (5+ fatalities each appear in under 100 rows). Low entropy ratio (0.25) confirms the distribution is dominated by a single value.

Treatment: Cast to integer and model as a zero-inflated count (e.g., ZIP/NB) or binarise to fatal/non-fatal.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["fatalities"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	49
top_value	0
top_rate	0.6912
cardinality	49
entropy	1.423
entropy_ratio	0.2535

Fig 20.

Top values for fatalities.

Show data table

Top values for fatalities (20 unique shown, of 49 total).
value	count	share
0	10209	69.1%
1	3208	21.7%
2	649	4.4%
3	222	1.5%
4	112	0.8%
5	74	0.5%
6	66	0.4%
7	38	0.3%
9	25	0.2%
10	24	0.2%
8	21	0.1%
11	20	0.1%
13	11	0.1%
16	10	0.1%
12	9	0.1%
14	8	0.1%
17	6	0.0%
20	6	0.0%
25	4	0.0%
23	3	0.0%

damage_property text feature

This column encodes property damage estimates as short magnitude-suffixed strings (e.g. '2.5M', '250K', '0.00K'), with every value being a single token of at most 8 characters. The format is inconsistent — some values use two decimals ('1.00M') while others don't ('1M', '25M') — and 368 rows are empty strings rather than nulls, with the literal '0.00K' appearing 1229 times to denote zero damage. Duplication is extreme (93.1%) because the underlying domain is a small set of round-number estimates, yielding only 1014 distinct values.

Treatment: Parse the K/M/B suffix into a numeric float, treat empty strings and '0.00K' explicitly, then log-transform before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["damage_property"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1,014
len_min	0
len_max	8
len_mean	4.381
len_median	5
len_p95	7
word_mean	1
word_median	1
n_empty	368
n_duplicates	13,756
duplicate_rate	0.9313
vocab_size	1,013
readability_flesch_mean	117
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.8724
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	87.2% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	93.1% duplicate strings

Fig 21.

Character-length distribution for damage_property.

Show data table

Character-length distribution for damage_property (mean: 4.380568720379147).
chars	count
0 – 0	368
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 1	264
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	1252
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1172
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3414
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	6075
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1450
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	514
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	261

source categorical metadata

This column records the dataset's provenance, with every one of the 14,770 rows tagged 'NOAA Storm Events Database'. Cardinality is 1 and entropy is 0, so it carries no discriminative signal whatsoever.

Treatment: Drop before modelling; retain only as dataset-level provenance.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["source"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1
top_value	NOAA Storm Events Database
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 22.

Top values for source.

Show data table

Top values for source (1 unique shown, of 1 total).
value	count	share
NOAA Storm Events Database	14770	100.0%

noaa significant storms noaa significant storms

Overview

Summary confidence: high

latitude numeric feature

longitude numeric feature

name text label

description text metadata

category categorical metadata

date text timestamp

country categorical metadata

event_type categorical label

state categorical feature

magnitude categorical feature

injuries categorical feature

fatalities categorical numeric_target

damage_property text feature

source categorical metadata

How to cite