natural_hazards-storms · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/natural_hazards/storms.json

Saturn profiled 14,770 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/natural_hazards/storms.json",
    "--findings", "natural_hazards-storms.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 14,770 records of significant U.S. storm events sourced entirely from the NOAA Storm Events Database, with each row describing a weather incident's location, type, magnitude, and damages. Tornadoes dominate the event mix at roughly 43% of records, followed by Flash Floods and Thunderstorm Winds, so the event_type distribution is the first thing to inspect. Geographically the data skews toward Texas (1,450 events) and other tornado-belt states like Missouri, Arkansas, and Mississippi, which is worth confirming on the latitude/longitude spread. Two caveats deserve attention: the magnitude field is missing for 51.8% of rows, and category/country/source are constants (single value) so they carry no analytical signal. Fatalities and injuries are heavily zero-inflated (about 69% and 68% zeros respectively), meaning summary stats will be driven by a small tail of severe events.

citing: row_count · column_count · columns.event_type.top_values · columns.event_type.stats.top_rate · columns.state.top_values · columns.magnitude.null_rate · columns.fatalities.stats.top_rate · columns.injuries.stats.top_rate · columns.country.stats.top_value · columns.source.stats.top_value · columns.latitude.stats · columns.longitude.stats

Out[4]:

saturn.schema() · 14 columns

column	kind	n	null%	unique	alerts
latitude	numeric	14,770	0.0%	7,810
longitude	numeric	14,770	0.0%	8,828
name	text	14,770	0.0%	6,660	multilingual duplicates
description	text	14,770	0.0%	5,796	multilingual duplicates
category	categorical	14,770	0.0%	1	imbalance
date	text	14,770	0.0%	5,058	one_word allcaps short_text duplicates
country	categorical	14,770	0.0%	1	imbalance
event_type	categorical	14,770	0.0%	17
state	categorical	14,770	0.0%	65
magnitude	categorical	14,770	51.8%	170	null_rate
injuries	categorical	14,770	0.0%	178
fatalities	categorical	14,770	0.0%	49
damage_property	text	14,770	0.0%	1,014	one_word allcaps short_text duplicates
source	categorical	14,770	0.0%	1	imbalance

Fig 1.

event_type · Tornadoes account for ~43% of events; check how steeply the distribution drops to Flash Flood and Thunderstorm Wind.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	42.9%
Flash Flood	2358	16.0%
Thunderstorm Wind	2257	15.3%
Flood	1777	12.0%
Hail	1246	8.4%
Lightning	574	3.9%
Heavy Rain	99	0.7%
Marine Strong Wind	43	0.3%
Debris Flow	43	0.3%
Marine Thunderstorm Wind	25	0.2%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

Fig 2.

state · Texas leads at nearly 10% of records — see how concentrated events are in the southern/midwestern tornado belt.

Show data table

Top values for state (20 unique shown, of 65 total).
value	count	share
TEXAS	1450	9.8%
MISSOURI	648	4.4%
ARKANSAS	602	4.1%
MISSISSIPPI	570	3.9%
GEORGIA	562	3.8%
ILLINOIS	560	3.8%
IOWA	527	3.6%
LOUISIANA	507	3.4%
TENNESSEE	499	3.4%
FLORIDA	498	3.4%
OKLAHOMA	490	3.3%
NEBRASKA	486	3.3%
ALABAMA	469	3.2%
WISCONSIN	463	3.1%
OHIO	441	3.0%
MICHIGAN	426	2.9%
NORTH CAROLINA	422	2.9%
KANSAS	418	2.8%
INDIANA	408	2.8%
KENTUCKY	383	2.6%

Fig 3.

fatalities · Heavy zero-inflation (69% are 0 fatalities) means the tail of deadly events is what matters analytically.

Show data table

Top values for fatalities (20 unique shown, of 49 total).
value	count	share
0	10209	69.1%
1	3208	21.7%
2	649	4.4%
3	222	1.5%
4	112	0.8%
5	74	0.5%
6	66	0.4%
7	38	0.3%
9	25	0.2%
10	24	0.2%
8	21	0.1%
11	20	0.1%
13	11	0.1%
16	10	0.1%
12	9	0.1%
14	8	0.1%
17	6	0.0%
20	6	0.0%
25	4	0.0%
23	3	0.0%

Fig 4.

damage_property · Damage values cluster on a few round figures like 2.5M and 1.00M, hinting at coded or estimated reporting rather than precise amounts.

Show data table

Character-length distribution for damage_property (mean: 4.380568720379147).
chars	count
0 – 0	368
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 1	264
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	1252
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1172
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3414
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	6075
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1450
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	514
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	261

Fig 5.

latitude · Most events fall between 33° and 41° N, with a few outliers extending toward Alaska and the tropics.

Show data table

Histogram bins for latitude (median: 37.12).
bin	count
-14.32 – -12.21	3
-12.21 – -10.1	0
-10.1 – -7.99	0
-7.99 – -5.879	0
-5.879 – -3.767	0
-3.767 – -1.656	0
-1.656 – 0.4552	0
0.4552 – 2.566	0
2.566 – 4.678	0
4.678 – 6.789	0
6.789 – 8.9	2
8.9 – 11.01	0
11.01 – 13.12	0
13.12 – 15.23	2
15.23 – 17.35	0
17.35 – 19.46	75
19.46 – 21.57	19
21.57 – 23.68	10
23.68 – 25.79	22
25.79 – 27.9	270
27.9 – 30.01	522
30.01 – 32.12	1240
32.12 – 34.24	2165
34.24 – 36.35	2333
36.35 – 38.46	1803
38.46 – 40.57	1901
40.57 – 42.68	2226
42.68 – 44.79	1382
44.79 – 46.9	515
46.9 – 49.01	232
49.01 – 51.13	0
51.13 – 53.24	0
53.24 – 55.35	0
55.35 – 57.46	5
57.46 – 59.57	6
59.57 – 61.68	15
61.68 – 63.79	11
63.79 – 65.9	8
65.9 – 68.02	2
68.02 – 70.13	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
latitude	numeric	0.0%
longitude	numeric	0.0%
name	text	0.0%
description	text	0.0%
category	categorical	0.0%
date	text	0.0%
country	categorical	0.0%
event_type	categorical	0.0%
state	categorical	0.0%
magnitude	categorical	51.8%
injuries	categorical	0.0%
fatalities	categorical	0.0%
damage_property	text	0.0%
source	categorical	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 10,000 detected strings).
lang	count	share
en	9780	97.8%
es	134	1.3%
de	25	0.2%
ja	23	0.2%
no	10	0.1%
id	6	0.1%
fr	6	0.1%
it	5	0.1%
pt	4	0.0%
sr	2	0.0%
ru	2	0.0%
eu	2	0.0%
zh	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	latitude	longitude
latitude	+1.00	-0.31
longitude	-0.31	+1.00

latitude numeric feature

This column holds geographic latitudes, ranging from -14.3236 to 70.1269 with a mean of 37.28 and median of 37.12, consistent with degrees north/south. The distribution is tightly clustered (IQR 7.50, std 5.25) around the mid-30s to low-40s, suggesting most records sit in northern temperate zones. Skew is mild (-0.18) but kurtosis of 3.34 plus 159 outliers (1.08%) hints at a long southern tail extending into the southern hemisphere.

Treatment: Pair with longitude as a geospatial feature; consider binning or projecting rather than treating as a plain scalar.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["latitude"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	7,810
min	-14.32
max	70.13
mean	37.28
median	37.12
std	5.247
q1	33.63
q3	41.13
iqr	7.499
skew	-0.1787
kurtosis	3.341
n_outliers	159
outlier_rate	0.01077
zero_rate	0

Fig 9.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 37.12).
bin	count
-14.32 – -12.21	3
-12.21 – -10.1	0
-10.1 – -7.99	0
-7.99 – -5.879	0
-5.879 – -3.767	0
-3.767 – -1.656	0
-1.656 – 0.4552	0
0.4552 – 2.566	0
2.566 – 4.678	0
4.678 – 6.789	0
6.789 – 8.9	2
8.9 – 11.01	0
11.01 – 13.12	0
13.12 – 15.23	2
15.23 – 17.35	0
17.35 – 19.46	75
19.46 – 21.57	19
21.57 – 23.68	10
23.68 – 25.79	22
25.79 – 27.9	270
27.9 – 30.01	522
30.01 – 32.12	1240
32.12 – 34.24	2165
34.24 – 36.35	2333
36.35 – 38.46	1803
38.46 – 40.57	1901
40.57 – 42.68	2226
42.68 – 44.79	1382
44.79 – 46.9	515
46.9 – 49.01	232
49.01 – 51.13	0
51.13 – 53.24	0
53.24 – 55.35	0
55.35 – 57.46	5
57.46 – 59.57	6
59.57 – 61.68	15
61.68 – 63.79	11
63.79 – 65.9	8
65.9 – 68.02	2
68.02 – 70.13	1

longitude numeric feature

Geographic longitude coordinates, with 8828 unique values across 14770 rows and no nulls. Values span -170.73 to 171.47 (full global range), but the distribution is tightly concentrated around a median of -90.22 with an IQR of just 12.17, indicating most records cluster in the Americas. Heavy kurtosis (55.6) and 623 outliers (4.2%) reflect a small set of points scattered far from this North/Central American core.

Treatment: Pair with latitude for geospatial features; do not standardize as a plain scalar.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["longitude"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	8,828
min	-170.7
max	171.5
mean	-90.94
median	-90.22
std	11.7
q1	-96.4
q3	-84.23
iqr	12.17
skew	1.286
kurtosis	55.61
n_outliers	623
outlier_rate	0.04218
zero_rate	0

Fig 10.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: -90.22).
bin	count
-170.7 – -162.2	4
-162.2 – -153.6	33
-153.6 – -145.1	28
-145.1 – -136.5	4
-136.5 – -128	11
-128 – -119.4	305
-119.4 – -110.8	466
-110.8 – -102.3	544
-102.3 – -93.74	3693
-93.74 – -85.18	5377
-85.18 – -76.63	3281
-76.63 – -68.07	941
-68.07 – -59.52	79
-59.52 – -50.96	0
-50.96 – -42.41	0
-42.41 – -33.85	0
-33.85 – -25.3	0
-25.3 – -16.74	0
-16.74 – -8.186	0
-8.186 – 0.3687	0
0.3687 – 8.924	0
8.924 – 17.48	0
17.48 – 26.03	0
26.03 – 34.59	0
34.59 – 43.14	0
43.14 – 51.7	0
51.7 – 60.25	0
60.25 – 68.81	0
68.81 – 77.36	0
77.36 – 85.92	0
85.92 – 94.47	0
94.47 – 103	0
103 – 111.6	0
111.6 – 120.1	0
120.1 – 128.7	0
128.7 – 137.2	0
137.2 – 145.8	2
145.8 – 154.4	1
154.4 – 162.9	0
162.9 – 171.5	1

name text label

This column holds short structured event labels of the form ' in , ' describing US severe weather incidents (tornadoes, floods, hail, thunderstorm winds). It is highly repetitive: 8110 of 14770 rows are duplicates (54.9% duplicate_rate) with only 6660 unique values, and 'Hail in TEXAS, TARRANT' alone appears 59 times. The 'multilingual' alert is largely a false positive driven by place names — 4796 rows classify as English versus tiny counts like 134 Spanish and 25 German.

Treatment: Parse into structured fields (event_type, state, county) rather than treating as free text.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["name"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	6,660
len_min	17
len_max	134
len_mean	30.22
len_median	29
len_p95	41
word_mean	4.588
word_median	4
n_empty	0
n_duplicates	8,110
duplicate_rate	0.5491
vocab_size	1,980
readability_flesch_mean	31.16
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	13 languages detected in sample
alert: duplicates	54.9% duplicate strings

Fig 11.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 30.219160460392686).
chars	count
17 – 20	50
20 – 23	793
23 – 26	2165
26 – 29	3842
29 – 32	2969
32 – 35	1813
35 – 37	1442
37 – 40	915
40 – 43	493
43 – 46	153
46 – 49	45
49 – 52	4
52 – 55	3
55 – 58	4
58 – 61	6
61 – 64	9
64 – 67	5
67 – 70	7
70 – 73	3
73 – 76	10
76 – 78	7
78 – 81	5
81 – 84	6
84 – 87	3
87 – 90	3
90 – 93	3
93 – 96	1
96 – 99	5
99 – 102	3
102 – 105	0
105 – 108	0
108 – 111	0
111 – 114	0
114 – 116	0
116 – 119	1
119 – 122	0
122 – 125	0
125 – 128	0
128 – 131	1
131 – 134	1

description text metadata

This is a templated event-summary field describing storm or disaster impacts (magnitude, injuries, fatalities, property damage in dollars), not free-form prose. With 14,770 rows but only 5,796 unique values and a 60.8% duplicate rate, the text is highly formulaic — the single string 'Magnitude 0; $2.5M property damage' alone appears 1,055 times. Language detection flags 5 French, 1 Japanese, and 10 Norwegian rows against 4,984 English, almost certainly false positives from numeric/currency tokens rather than real multilingual content; mean Flesch of 29.9 reflects the terse template, not difficult prose.

Treatment: Parse with regex into structured fields (magnitude, injuries, fatalities, damage_usd) rather than treating as free text.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["description"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	5,796
len_min	3
len_max	259
len_mean	50.09
len_median	36
len_p95	166
word_mean	7.393
word_median	5
n_empty	0
n_duplicates	8,974
duplicate_rate	0.6076
vocab_size	4,289
readability_flesch_mean	29.86
emoji_rate	0
url_rate	0
one_word_rate	0.0002708
allcaps_rate	0.0002708
boilerplate_rate	0
alert: multilingual	5 languages detected in sample
alert: duplicates	60.8% duplicate strings

Fig 12.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 50.085240352065).
chars	count
3 – 9	4
9 – 16	13
16 – 22	3101
22 – 29	1689
29 – 35	2230
35 – 41	1186
41 – 48	572
48 – 54	1589
54 – 61	2168
61 – 67	902
67 – 73	7
73 – 80	6
80 – 86	14
86 – 93	10
93 – 99	15
99 – 105	25
105 – 112	30
112 – 118	30
118 – 125	34
125 – 131	47
131 – 137	71
137 – 144	64
144 – 150	52
150 – 157	77
157 – 163	60
163 – 169	74
169 – 176	66
176 – 182	65
182 – 189	54
189 – 195	69
195 – 201	82
201 – 208	70
208 – 214	72
214 – 221	78
221 – 227	64
227 – 233	29
233 – 240	20
240 – 246	11
246 – 253	12
253 – 259	8

category categorical metadata

This column is a constant categorical tag identifying the dataset source: every one of the 14770 rows holds the single value "significant_us_storms". With cardinality 1, entropy 0, and top_rate 1.0, it carries no information for modelling and only serves as a provenance marker.

Treatment: Drop before modelling; retain only as a source tag if merging with other datasets.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["category"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1
top_value	significant_us_storms
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 13.

Top values for category.

Show data table

Top values for category (1 unique shown, of 1 total).
value	count	share
significant_us_storms	14770	100.0%

date text timestamp

This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every one of 14770 rows exactly 10 characters and a single token. Values span at least 1965 to 2021, and duplicates are heavy: 9712 rows (65.8%) repeat, with 1974-04-03 appearing 126 times and 2011-04-27 appearing 105 times, suggesting clustering around specific event dates rather than a uniform timeline. The 'allcaps' flag is a false positive since digits and hyphens trigger it.

Treatment: parse to date dtype and use for temporal grouping or joins.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["date"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	5,058
len_min	10
len_max	10
len_mean	10
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	9,712
duplicate_rate	0.6575
vocab_size	5,058
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	65.8% duplicate strings

Fig 14.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 10.0).
chars	count
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	14770
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0

country categorical metadata

This column records the country of origin for each record, but every one of the 14770 rows holds the value "USA". Cardinality is 1 and entropy is 0, so the field carries no information.

Treatment: Drop; constant column with zero variance.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["country"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1
top_value	USA
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 15.

Top values for country.

Show data table

Top values for country (1 unique shown, of 1 total).
value	count	share
USA	14770	100.0%

event_type categorical label

Categorical label of severe-weather event types across 14,770 rows with no nulls and 17 distinct categories. Tornado dominates at 42.9% (6,334 rows), followed by Flash Flood, Thunderstorm Wind, Flood, and Hail; entropy ratio of 0.57 confirms the distribution is concentrated in a few classes. The long tail (Heavy Rain, Marine Strong Wind, Debris Flow, Marine Thunderstorm Wind) has very thin support, which will hurt per-class modelling.

Treatment: Use as categorical target or feature; consider grouping rare classes given the heavy Tornado skew.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["event_type"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	17
top_value	Tornado
top_rate	0.4288
cardinality	17
entropy	2.336
entropy_ratio	0.5715

Fig 16.

Top values for event_type.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	42.9%
Flash Flood	2358	16.0%
Thunderstorm Wind	2257	15.3%
Flood	1777	12.0%
Hail	1246	8.4%
Lightning	574	3.9%
Heavy Rain	99	0.7%
Marine Strong Wind	43	0.3%
Debris Flow	43	0.3%
Marine Thunderstorm Wind	25	0.2%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

state categorical feature

US state names in uppercase, with 65 distinct values across 14,770 complete rows — more than the 50 states, suggesting territories, military codes, or 'UNKNOWN'-style entries are mixed in. Distribution is broad (entropy ratio 0.86) with Texas leading at 9.8% (1,450 rows), followed by Missouri, Arkansas, Mississippi, and Georgia, indicating a southern/central US tilt.

Treatment: Normalize to standard state codes and one-hot or target-encode; investigate the 15 extra categories beyond 50 states.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["state"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	65
top_value	TEXAS
top_rate	0.09817
cardinality	65
entropy	5.182
entropy_ratio	0.8605

Fig 17.

Top values for state.

Show data table

Top values for state (20 unique shown, of 65 total).
value	count	share
TEXAS	1450	9.8%
MISSOURI	648	4.4%
ARKANSAS	602	4.1%
MISSISSIPPI	570	3.9%
GEORGIA	562	3.8%
ILLINOIS	560	3.8%
IOWA	527	3.6%
LOUISIANA	507	3.4%
TENNESSEE	499	3.4%
FLORIDA	498	3.4%
OKLAHOMA	490	3.3%
NEBRASKA	486	3.3%
ALABAMA	469	3.2%
WISCONSIN	463	3.1%
OHIO	441	3.0%
MICHIGAN	426	2.9%
NORTH CAROLINA	422	2.9%
KANSAS	418	2.8%
INDIANA	408	2.8%
KENTUCKY	383	2.6%

magnitude categorical feature

Numeric magnitudes stored as strings, with 170 distinct values ranging from '0' to two-decimal figures like '70.00'. Over half the rows (51.78%) are null, and of the non-nulls the value '0' dominates at 54.24%, leaving real magnitudes in a minority of records. Entropy ratio of 0.48 confirms most signal is concentrated in a few buckets.

Treatment: Cast to numeric, treat '0' and nulls as likely missing/sentinel, and consider a presence flag before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["magnitude"].stats

stat	value
n	14,770
nulls	7,648 (51.8%)
unique	170
top_value	0
top_rate	0.5424
cardinality	170
entropy	3.586
entropy_ratio	0.484
alert: null_rate	51.8% null

Fig 18.

Top values for magnitude.

Show data table

Top values for magnitude (20 unique shown, of 170 total).
value	count	share
0	3863	26.2%
1.75	383	2.6%
2.75	220	1.5%
70.00	162	1.1%
50.00	151	1.0%
2.00	150	1.0%
2.50	123	0.8%
61.00	122	0.8%
65.00	104	0.7%
52.00	95	0.6%
78.00	80	0.5%
70	79	0.5%
3.00	77	0.5%
56.00	76	0.5%
87.00	65	0.4%
60.00	63	0.4%
50	59	0.4%
60	54	0.4%
1.50	50	0.3%
61	47	0.3%

injuries categorical feature

This is an injury count per record, stored as strings but numeric in nature (top values are '0' through '12'). The distribution is dominated by zeros: 68.1% of 14,770 rows report '0' injuries, with 893 at '1' and a long tail spanning 178 distinct values. Entropy ratio of 0.33 confirms heavy concentration at the low end.

Treatment: Cast to integer and consider log1p or zero-inflated modelling given the heavy zero mass.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["injuries"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	178
top_value	0
top_rate	0.6814
cardinality	178
entropy	2.468
entropy_ratio	0.3301

Fig 19.

Top values for injuries.

Show data table

Top values for injuries (20 unique shown, of 178 total).
value	count	share
0	10064	68.1%
1	893	6.0%
2	552	3.7%
3	343	2.3%
4	236	1.6%
5	234	1.6%
10	219	1.5%
6	196	1.3%
12	158	1.1%
7	134	0.9%
8	121	0.8%
20	114	0.8%
15	111	0.8%
11	90	0.6%
9	85	0.6%
13	70	0.5%
14	69	0.5%
30	68	0.5%
25	56	0.4%
16	48	0.3%

fatalities categorical numeric_target

Counts of fatalities per event, stored as strings with 49 distinct values across 14,770 rows and no nulls. The distribution is heavily zero-inflated: 69.1% of records are "0" and the next bucket "1" covers 3,208 rows, leaving a long thin tail (e.g., 25 rows at "9", 24 at "10"). Low entropy ratio (0.25) confirms most variance lives in the 0/1 split.

Treatment: Cast to integer and consider modelling as a count (Poisson/negative binomial) or binarise to any-fatality given the zero inflation.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["fatalities"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	49
top_value	0
top_rate	0.6912
cardinality	49
entropy	1.423
entropy_ratio	0.2535

Fig 20.

Top values for fatalities.

Show data table

Top values for fatalities (20 unique shown, of 49 total).
value	count	share
0	10209	69.1%
1	3208	21.7%
2	649	4.4%
3	222	1.5%
4	112	0.8%
5	74	0.5%
6	66	0.4%
7	38	0.3%
9	25	0.2%
10	24	0.2%
8	21	0.1%
11	20	0.1%
13	11	0.1%
16	10	0.1%
12	9	0.1%
14	8	0.1%
17	6	0.0%
20	6	0.0%
25	4	0.0%
23	3	0.0%

damage_property text feature

This column encodes property damage as short magnitude strings with a K/M suffix (e.g., '2.5M', '250K', '0.00K'), not free text — every value is one word and max length is 8. Formatting is inconsistent: '1M' and '1.00M' coexist, and '0.00K' appears 1229 times alongside 368 empty entries, conflating true zeros with missingness. With 93.1% duplicate rate across only 1013 unique tokens, this is a coarse categorical-looking encoding of a numeric quantity.

Treatment: Parse the K/M suffix into a numeric dollar amount and treat empty strings as missing (distinct from 0.00K).

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["damage_property"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1,014
len_min	0
len_max	8
len_mean	4.381
len_median	5
len_p95	7
word_mean	1
word_median	1
n_empty	368
n_duplicates	13,756
duplicate_rate	0.9313
vocab_size	1,013
readability_flesch_mean	117
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.8724
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	87.2% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	93.1% duplicate strings

Fig 21.

Character-length distribution for damage_property.

Show data table

Character-length distribution for damage_property (mean: 4.380568720379147).
chars	count
0 – 0	368
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 1	264
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	1252
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1172
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3414
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	6075
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1450
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	514
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	261

source categorical metadata

This column records the data provenance, holding the constant string "NOAA Storm Events Database" for all 14770 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and only serves as a dataset-level annotation.

Treatment: Drop before modelling; retain in documentation as a provenance tag.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["source"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1
top_value	NOAA Storm Events Database
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 22.

Top values for source.

Show data table

Top values for source (1 unique shown, of 1 total).
value	count	share
NOAA Storm Events Database	14770	100.0%

natural hazards storms

Overview

Summary confidence: high

latitude numeric feature

longitude numeric feature

name text label

description text metadata

category categorical metadata

date text timestamp

country categorical metadata

event_type categorical label

state categorical feature

magnitude categorical feature

injuries categorical feature

fatalities categorical numeric_target

damage_property text feature

source categorical metadata

How to cite