data-trove-noaa-significant-storms

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/weather/noaa_significant_storms.json

Saturn profiled 14,770 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/weather/noaa_significant_storms.json",
    "--findings", "data-trove-noaa-significant-storms.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset contains 14,770 records of significant US storms sourced from the NOAA Storm Events Database, covering events across all 50+ states with dates, locations, event types, casualties, and property damage estimates. The most striking pattern is the dominance of tornadoes (6,334 events, 43% of all records), far outnumbering the next categories of Flash Flood and Thunderstorm Wind. Two dates worth flagging immediately are 1974-04-03 (126 events, the Super Outbreak) and 2011-04-27 (105 events, the 2011 Super Outbreak), suggesting this dataset captures landmark multi-tornado outbreaks disproportionately. Property damage skews heavily toward million-dollar figures, with '2.5M' being the single most common damage value (2,278 occurrences), hinting at possible rounding or a threshold-based inclusion criterion. Texas leads all states with 1,450 events, nearly double the next state (Missouri at 648), reflecting both its geographic size and exposure to severe weather corridors.

citing: row_count · column_count · event_type.top_values · date.top_values · damage_property.top_values · state.top_values · fatalities.top_values · injuries.top_values

Out[4]:

saturn.schema() · 14 columns

column	kind	n	null%	unique	alerts
latitude	numeric	14,770	0.0%	7,810
longitude	numeric	14,770	0.0%	8,828
name	text	14,770	0.0%	6,660	multilingual duplicates
description	text	14,770	0.0%	5,796	multilingual duplicates
category	categorical	14,770	0.0%	1	imbalance
date	text	14,770	0.0%	5,058	one_word allcaps short_text duplicates
country	categorical	14,770	0.0%	1	imbalance
event_type	categorical	14,770	0.0%	17
state	categorical	14,770	0.0%	65
magnitude	categorical	14,770	51.8%	170	null_rate
injuries	categorical	14,770	0.0%	178
fatalities	categorical	14,770	0.0%	49
damage_property	text	14,770	0.0%	1,014	one_word allcaps short_text duplicates
source	categorical	14,770	0.0%	1	imbalance

Fig 1.

event_type · Look for the outsized dominance of Tornado versus all other storm types — it accounts for 43% of all records.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	42.9%
Flash Flood	2358	16.0%
Thunderstorm Wind	2257	15.3%
Flood	1777	12.0%
Hail	1246	8.4%
Lightning	574	3.9%
Heavy Rain	99	0.7%
Marine Strong Wind	43	0.3%
Debris Flow	43	0.3%
Marine Thunderstorm Wind	25	0.2%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

Fig 2.

state · Texas leads by a wide margin; compare the long tail of less-affected states to spot the core tornado-alley concentration.

Show data table

Top values for state (20 unique shown, of 65 total).
value	count	share
TEXAS	1450	9.8%
MISSOURI	648	4.4%
ARKANSAS	602	4.1%
MISSISSIPPI	570	3.9%
GEORGIA	562	3.8%
ILLINOIS	560	3.8%
IOWA	527	3.6%
LOUISIANA	507	3.4%
TENNESSEE	499	3.4%
FLORIDA	498	3.4%
OKLAHOMA	490	3.3%
NEBRASKA	486	3.3%
ALABAMA	469	3.2%
WISCONSIN	463	3.1%
OHIO	441	3.0%
MICHIGAN	426	2.9%
NORTH CAROLINA	422	2.9%
KANSAS	418	2.8%
INDIANA	408	2.8%
KENTUCKY	383	2.6%

Fig 3.

fatalities · The distribution is heavily right-skewed — most events have zero fatalities, but look for the rare high-casualty outliers.

Show data table

Top values for fatalities (20 unique shown, of 49 total).
value	count	share
0	10209	69.1%
1	3208	21.7%
2	649	4.4%
3	222	1.5%
4	112	0.8%
5	74	0.5%
6	66	0.4%
7	38	0.3%
9	25	0.2%
10	24	0.2%
8	21	0.1%
11	20	0.1%
13	11	0.1%
16	10	0.1%
12	9	0.1%
14	8	0.1%
17	6	0.0%
20	6	0.0%
25	4	0.0%
23	3	0.0%

Fig 4.

damage_property · Check whether damage clusters around round million-dollar values, which may indicate reporting thresholds or rounding conventions.

Show data table

Character-length distribution for damage_property (mean: 4.380568720379147).
chars	count
0 – 0	368
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 1	264
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	1252
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1172
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3414
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	6075
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1450
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	514
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	261

Fig 5.

injuries · Like fatalities, injuries are zero for most events — scan the tail to identify the handful of mass-casualty storm incidents.

Show data table

Top values for injuries (20 unique shown, of 178 total).
value	count	share
0	10064	68.1%
1	893	6.0%
2	552	3.7%
3	343	2.3%
4	236	1.6%
5	234	1.6%
10	219	1.5%
6	196	1.3%
12	158	1.1%
7	134	0.9%
8	121	0.8%
20	114	0.8%
15	111	0.8%
11	90	0.6%
9	85	0.6%
13	70	0.5%
14	69	0.5%
30	68	0.5%
25	56	0.4%
16	48	0.3%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
latitude	numeric	0.0%
longitude	numeric	0.0%
name	text	0.0%
description	text	0.0%
category	categorical	0.0%
date	text	0.0%
country	categorical	0.0%
event_type	categorical	0.0%
state	categorical	0.0%
magnitude	categorical	51.8%
injuries	categorical	0.0%
fatalities	categorical	0.0%
damage_property	text	0.0%
source	categorical	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 10,000 detected strings).
lang	count	share
en	9780	97.8%
es	134	1.3%
de	25	0.2%
ja	23	0.2%
no	10	0.1%
id	6	0.1%
fr	6	0.1%
it	5	0.1%
pt	4	0.0%
sr	2	0.0%
ru	2	0.0%
eu	2	0.0%
zh	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	latitude	longitude
latitude	+1.00	-0.31
longitude	-0.31	+1.00

latitude numeric feature

This column contains geographic latitude values, spanning from -14.3236 to 70.1269 degrees, consistent with worldwide location data. The distribution is tightly clustered between Q1=33.63 and Q3=41.13 (IQR ~7.5), suggesting the bulk of records concentrate around mid-latitude Northern Hemisphere locations (roughly US/Europe range), with the mean (37.28) and median (37.12) nearly identical indicating only mild skew (-0.18). The leptokurtic shape (kurtosis 3.34) and 159 outliers (~1.1%) reflect a small tail of equatorial or high-latitude records that an analyst should verify are not geocoding errors.

Treatment: Use as-is or pair with longitude for spatial modelling; consider binning into regions or projecting to avoid Euclidean distance distortion.

anthropic:default · confidence high

Out[14]:

saturn.columns["latitude"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	7,810
min	-14.32
max	70.13
mean	37.28
median	37.12
std	5.247
q1	33.63
q3	41.13
iqr	7.499
skew	-0.1787
kurtosis	3.341
n_outliers	159
outlier_rate	0.01077
zero_rate	0

Fig 9.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 37.12).
bin	count
-14.32 – -12.21	3
-12.21 – -10.1	0
-10.1 – -7.99	0
-7.99 – -5.879	0
-5.879 – -3.767	0
-3.767 – -1.656	0
-1.656 – 0.4552	0
0.4552 – 2.566	0
2.566 – 4.678	0
4.678 – 6.789	0
6.789 – 8.9	2
8.9 – 11.01	0
11.01 – 13.12	0
13.12 – 15.23	2
15.23 – 17.35	0
17.35 – 19.46	75
19.46 – 21.57	19
21.57 – 23.68	10
23.68 – 25.79	22
25.79 – 27.9	270
27.9 – 30.01	522
30.01 – 32.12	1240
32.12 – 34.24	2165
34.24 – 36.35	2333
36.35 – 38.46	1803
38.46 – 40.57	1901
40.57 – 42.68	2226
42.68 – 44.79	1382
44.79 – 46.9	515
46.9 – 49.01	232
49.01 – 51.13	0
51.13 – 53.24	0
53.24 – 55.35	0
55.35 – 57.46	5
57.46 – 59.57	6
59.57 – 61.68	15
61.68 – 63.79	11
63.79 – 65.9	8
65.9 – 68.02	2
68.02 – 70.13	1

longitude numeric feature

This column represents geographic longitude, with values spanning from -170.7316 to 171.4689 degrees. The bulk of observations cluster around the Americas (mean -90.94, IQR roughly -96.4 to -84.23, consistent with the central/eastern US or Caribbean), but the extreme kurtosis of 55.6 and 623 outliers (4.2%) indicate a heavy-tailed distribution with a notable minority of records far outside this core region — including values near +171, suggesting Pacific or Asian locations. The positive skew (1.29) and tight IQR relative to the full range confirm most records concentrate in a narrow band while a long right tail pulls toward positive (eastern hemisphere) longitudes.

Treatment: Retain as-is for geospatial modelling; investigate the 623 outliers for data-entry errors or legitimate international records before clustering or bounding-box filtering.

anthropic:default · confidence high

Out[17]:

saturn.columns["longitude"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	8,828
min	-170.7
max	171.5
mean	-90.94
median	-90.22
std	11.7
q1	-96.4
q3	-84.23
iqr	12.17
skew	1.286
kurtosis	55.61
n_outliers	623
outlier_rate	0.04218
zero_rate	0

Fig 10.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: -90.22).
bin	count
-170.7 – -162.2	4
-162.2 – -153.6	33
-153.6 – -145.1	28
-145.1 – -136.5	4
-136.5 – -128	11
-128 – -119.4	305
-119.4 – -110.8	466
-110.8 – -102.3	544
-102.3 – -93.74	3693
-93.74 – -85.18	5377
-85.18 – -76.63	3281
-76.63 – -68.07	941
-68.07 – -59.52	79
-59.52 – -50.96	0
-50.96 – -42.41	0
-42.41 – -33.85	0
-33.85 – -25.3	0
-25.3 – -16.74	0
-16.74 – -8.186	0
-8.186 – 0.3687	0
0.3687 – 8.924	0
8.924 – 17.48	0
17.48 – 26.03	0
26.03 – 34.59	0
34.59 – 43.14	0
43.14 – 51.7	0
51.7 – 60.25	0
60.25 – 68.81	0
68.81 – 77.36	0
77.36 – 85.92	0
85.92 – 94.47	0
94.47 – 103	0
103 – 111.6	0
111.6 – 120.1	0
120.1 – 128.7	0
128.7 – 137.2	0
137.2 – 145.8	2
145.8 – 154.4	1
154.4 – 162.9	0
162.9 – 171.5	1

name text label

This column contains structured event description labels of the form '[Weather Event Type] in [STATE, COUNTY]', effectively serving as a composite label combining event type and geographic location. The duplicate rate is strikingly high at 54.9%, with 8,110 duplicates across 14,770 rows and only 6,660 unique values, indicating that the same event type/location combinations recur frequently — consistent with repeated weather incidents in the same areas. The multilingual alert is almost certainly a false positive from language detection mis-classifying US place names and weather terminology as non-English; dominant language is English (4,796 of sampled values) and top values are entirely English-structured strings. Vocabulary size of 1,980 across ~14k rows and a mean of ~4.6 words per entry confirm the formulaic, low-variety nature of the text.

Treatment: Parse into two structured features (event_type, state_county) via regex split on ' in ' before modelling; do not embed as raw text.

anthropic:default · confidence high

Out[20]:

saturn.columns["name"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	6,660
len_min	17
len_max	134
len_mean	30.22
len_median	29
len_p95	41
word_mean	4.588
word_median	4
n_empty	0
n_duplicates	8,110
duplicate_rate	0.5491
vocab_size	1,980
readability_flesch_mean	31.16
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	13 languages detected in sample
alert: duplicates	54.9% duplicate strings

Fig 11.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 30.219160460392686).
chars	count
17 – 20	50
20 – 23	793
23 – 26	2165
26 – 29	3842
29 – 32	2969
32 – 35	1813
35 – 37	1442
37 – 40	915
40 – 43	493
43 – 46	153
46 – 49	45
49 – 52	4
52 – 55	3
55 – 58	4
58 – 61	6
61 – 64	9
64 – 67	5
67 – 70	7
70 – 73	3
73 – 76	10
76 – 78	7
78 – 81	5
81 – 84	6
84 – 87	3
87 – 90	3
90 – 93	3
93 – 96	1
96 – 99	5
99 – 102	3
102 – 105	0
105 – 108	0
108 – 111	0
111 – 114	0
114 – 116	0
116 – 119	1
119 – 122	0
122 – 125	0
125 – 128	0
128 – 131	1
131 – 134	1

description text label

This column contains structured event descriptions summarising disaster or incident outcomes — specifically property damage amounts, injury counts, fatalities, and seismic magnitudes (e.g., 'Magnitude 0; $2.5M property damage'). The duplicate rate is strikingly high at 60.76%, with 8,974 duplicates across 14,770 rows and only 5,796 unique values, indicating these are templated strings generated from a small set of outcome combinations rather than free-form text. The Flesch readability mean of 29.86 reflects the dense, numeric, shorthand nature of the content. A small multilingual signal exists (10 Norwegian, 5 French, 1 Japanese entries) which may indicate data sourced from multiple regional systems and warrants review.

Treatment: Parse structured fields (damage amount, injuries, fatalities, magnitude) via regex into separate numeric columns rather than embedding as text.

anthropic:default · confidence high

Out[23]:

saturn.columns["description"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	5,796
len_min	3
len_max	259
len_mean	50.09
len_median	36
len_p95	166
word_mean	7.393
word_median	5
n_empty	0
n_duplicates	8,974
duplicate_rate	0.6076
vocab_size	4,289
readability_flesch_mean	29.86
emoji_rate	0
url_rate	0
one_word_rate	0.0002708
allcaps_rate	0.0002708
boilerplate_rate	0
alert: multilingual	5 languages detected in sample
alert: duplicates	60.8% duplicate strings

Fig 12.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 50.085240352065).
chars	count
3 – 9	4
9 – 16	13
16 – 22	3101
22 – 29	1689
29 – 35	2230
35 – 41	1186
41 – 48	572
48 – 54	1589
54 – 61	2168
61 – 67	902
67 – 73	7
73 – 80	6
80 – 86	14
86 – 93	10
93 – 99	15
99 – 105	25
105 – 112	30
112 – 118	30
118 – 125	34
125 – 131	47
131 – 137	71
137 – 144	64
144 – 150	52
150 – 157	77
157 – 163	60
163 – 169	74
169 – 176	66
176 – 182	65
182 – 189	54
189 – 195	69
195 – 201	82
201 – 208	70
208 – 214	72
214 – 221	78
221 – 227	64
227 – 233	29
233 – 240	20
240 – 246	11
246 – 253	12
253 – 259	8

category categorical metadata

This column is a dataset category tag, holding a single constant value 'significant_us_storms' across all 14,770 rows with no nulls. It carries zero information entropy (entropy = 0.0) and a top_rate of 1.0, meaning it is entirely invariant. This is a metadata label describing the dataset itself, not a feature with predictive or analytical value.

Treatment: Drop before modelling; constant column adds no signal and will cause issues with variance-based methods.

anthropic:default · confidence high

Out[26]:

saturn.columns["category"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1
top_value	significant_us_storms
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 13.

Top values for category.

Show data table

Top values for category (1 unique shown, of 1 total).
value	count	share
significant_us_storms	14770	100.0%

date text timestamp

This column contains ISO-8601 date strings (YYYY-MM-DD format), stored as text rather than a native date type — all 14,770 values are exactly 10 characters with zero nulls. The duplicate rate of 65.75% (9,712 duplicates across only 5,058 unique dates) is notable and suggests this is a grouping/event date used as a foreign-key-style attribute rather than a unique record timestamp. The top date, 1974-04-03, appears 126 times, and several 2011 dates cluster heavily, which may reflect significant event concentrations worth investigating.

Treatment: Parse to native date type, then use as a grouping/join key or engineer calendar features (year, month, day-of-week) for modelling.

anthropic:default · confidence high

Out[29]:

saturn.columns["date"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	5,058
len_min	10
len_max	10
len_mean	10
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	9,712
duplicate_rate	0.6575
vocab_size	5,058
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	65.8% duplicate strings

Fig 14.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 10.0).
chars	count
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	14770
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0

country categorical metadata

This column represents the country of origin or scope for all records in the dataset, and every single one of the 14,770 rows contains the value 'USA' — making it a zero-entropy constant. The column carries no discriminative information whatsoever and will contribute nothing to any model or analysis. Its uniformity may also indicate the dataset is intentionally scoped to a single country, which is worth confirming before joining with broader datasets.

Treatment: Drop before modelling; constant column with zero variance and entropy of 0.0.

anthropic:default · confidence high

Out[32]:

saturn.columns["country"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1
top_value	USA
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 15.

Top values for country.

Show data table

Top values for country (1 unique shown, of 1 total).
value	count	share
USA	14770	100.0%

event_type categorical label

This column contains categorical labels for natural weather/disaster event types across 14,770 records, with 17 distinct categories and no nulls. The dominant class is 'Tornado' at 42.9% (6,334 occurrences), creating notable class imbalance — the top 5 categories ('Tornado', 'Flash Flood', 'Thunderstorm Wind', 'Flood', 'Hail') account for the vast majority of records, while tail categories like 'Marine Thunderstorm Wind' (25) and 'Debris Flow' (43) are sparsely represented. The entropy ratio of 0.572 confirms moderate but uneven spread across classes, which will challenge classifiers without resampling or class-weight adjustment.

Treatment: Encode as nominal category; apply class weights or oversample minority classes (e.g., 'Marine Thunderstorm Wind' n=25) before classification modelling.

anthropic:default · confidence high

Out[35]:

saturn.columns["event_type"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	17
top_value	Tornado
top_rate	0.4288
cardinality	17
entropy	2.336
entropy_ratio	0.5715

Fig 16.

Top values for event_type.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	42.9%
Flash Flood	2358	16.0%
Thunderstorm Wind	2257	15.3%
Flood	1777	12.0%
Hail	1246	8.4%
Lightning	574	3.9%
Heavy Rain	99	0.7%
Marine Strong Wind	43	0.3%
Debris Flow	43	0.3%
Marine Thunderstorm Wind	25	0.2%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

state categorical label

This column represents the US state associated with each record, stored as full uppercase state names. With 65 unique values against the expected 50 US states, there are likely extra entries such as territories (e.g., Puerto Rico, Guam), non-standard labels, or minor data quality issues worth auditing. Texas dominates at 9.8% of records (1,450), and the top-10 states are heavily weighted toward the South and Midwest. The high entropy ratio of 0.86 indicates a relatively even spread across categories, though Texas is a clear outlier compared to the rest.

Treatment: Standardize to a canonical list (resolve the 65→50+ mapping), then one-hot encode or use target encoding for modelling.

anthropic:default · confidence high

Out[38]:

saturn.columns["state"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	65
top_value	TEXAS
top_rate	0.09817
cardinality	65
entropy	5.182
entropy_ratio	0.8605

Fig 17.

Top values for state.

Show data table

Top values for state (20 unique shown, of 65 total).
value	count	share
TEXAS	1450	9.8%
MISSOURI	648	4.4%
ARKANSAS	602	4.1%
MISSISSIPPI	570	3.9%
GEORGIA	562	3.8%
ILLINOIS	560	3.8%
IOWA	527	3.6%
LOUISIANA	507	3.4%
TENNESSEE	499	3.4%
FLORIDA	498	3.4%
OKLAHOMA	490	3.3%
NEBRASKA	486	3.3%
ALABAMA	469	3.2%
WISCONSIN	463	3.1%
OHIO	441	3.0%
MICHIGAN	426	2.9%
NORTH CAROLINA	422	2.9%
KANSAS	418	2.8%
INDIANA	408	2.8%
KENTUCKY	383	2.6%

magnitude categorical feature

This column appears to represent a magnitude measure (likely seismic, stellar, or similar scientific scale) stored as a categorical type despite containing numeric-looking values spanning a wide range (e.g., 1.75, 2.75, 50.00, 70.00). Two surprises stand out: first, 51.78% of rows are null, triggering an alert; second, the dominant value '0' accounts for 54.24% of non-null rows (3,863 of ~7,124 non-null records), suggesting zero may encode 'none', 'unknown', or a sentinel rather than a true zero magnitude. The presence of both small decimal values (1.75, 2.00, 2.50) and large round integers (50.00, 61.00, 65.00, 70.00) hints at a possible mixed-scale or mixed-source column.

Treatment: Investigate zero sentinel vs. true zero, impute or drop nulls based on missingness mechanism, cast to float, then assess whether log-transform or binning is appropriate before modelling.

anthropic:default · confidence medium

Out[41]:

saturn.columns["magnitude"].stats

stat	value
n	14,770
nulls	7,648 (51.8%)
unique	170
top_value	0
top_rate	0.5424
cardinality	170
entropy	3.586
entropy_ratio	0.484
alert: null_rate	51.8% null

Fig 18.

Top values for magnitude.

Show data table

Top values for magnitude (20 unique shown, of 170 total).
value	count	share
0	3863	26.2%
1.75	383	2.6%
2.75	220	1.5%
70.00	162	1.1%
50.00	151	1.0%
2.00	150	1.0%
2.50	123	0.8%
61.00	122	0.8%
65.00	104	0.7%
52.00	95	0.6%
78.00	80	0.5%
70	79	0.5%
3.00	77	0.5%
56.00	76	0.5%
87.00	65	0.4%
60.00	63	0.4%
50	59	0.4%
60	54	0.4%
1.50	50	0.3%
61	47	0.3%

injuries categorical feature

This column represents a count of injuries per record, stored as a categorical type despite being fundamentally numeric. The dominant value is '0' appearing in 68.1% of rows (10,064 of 14,770), indicating most records involve no injuries. With 178 unique values and top counts following a steep drop-off consistent with a zero-inflated count distribution, the categorical encoding is likely a data-type artifact — the values are clearly ordinal integers and should be treated as numeric.

Treatment: Cast to integer, then model with zero-inflated Poisson or apply log1p transform before regression given heavy zero inflation.

anthropic:default · confidence high

Out[44]:

saturn.columns["injuries"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	178
top_value	0
top_rate	0.6814
cardinality	178
entropy	2.468
entropy_ratio	0.3301

Fig 19.

Top values for injuries.

Show data table

Top values for injuries (20 unique shown, of 178 total).
value	count	share
0	10064	68.1%
1	893	6.0%
2	552	3.7%
3	343	2.3%
4	236	1.6%
5	234	1.6%
10	219	1.5%
6	196	1.3%
12	158	1.1%
7	134	0.9%
8	121	0.8%
20	114	0.8%
15	111	0.8%
11	90	0.6%
9	85	0.6%
13	70	0.5%
14	69	0.5%
30	68	0.5%
25	56	0.4%
16	48	0.3%

fatalities categorical feature

This column records fatality counts per incident, stored as strings but representing non-negative integers ranging from 0 to at least 10 across 49 distinct values. The dominant value is '0' at 69.1% of rows (10,209 of 14,770), indicating most incidents involve no fatalities. The distribution is heavily right-skewed, with counts dropping sharply: 1 fatality appears 3,208 times, 2 appears 649 times, and values thin out rapidly beyond that — yet 49 unique values suggests some high-count outliers exist beyond the top 10 shown. Low entropy (1.42, ratio 0.25) confirms the extreme concentration on zero.

Treatment: Cast to integer, treat as count variable; consider zero-inflated modelling or log1p-transform given severe right skew and 69.1% zero mass.

anthropic:default · confidence high

Out[47]:

saturn.columns["fatalities"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	49
top_value	0
top_rate	0.6912
cardinality	49
entropy	1.423
entropy_ratio	0.2535

Fig 20.

Top values for fatalities.

Show data table

Top values for fatalities (20 unique shown, of 49 total).
value	count	share
0	10209	69.1%
1	3208	21.7%
2	649	4.4%
3	222	1.5%
4	112	0.8%
5	74	0.5%
6	66	0.4%
7	38	0.3%
9	25	0.2%
10	24	0.2%
8	21	0.1%
11	20	0.1%
13	11	0.1%
16	10	0.1%
12	9	0.1%
14	8	0.1%
17	6	0.0%
20	6	0.0%
25	4	0.0%
23	3	0.0%

damage_property text feature

This column represents property damage amounts stored as formatted currency strings (e.g., '2.5M', '250K', '0.00K'), typical of NOAA storm event or similar disaster/insurance datasets. With only 1,014 unique values across 14,770 rows, the duplicate rate is extremely high at 93.1%, reflecting heavy rounding/bucketing of damage estimates rather than precise measurements. All values are single tokens (one_word_rate: 1.0) and 87.2% are uppercase, consistent with a coded categorical-style encoding of numeric magnitudes. There are 368 empty strings (null_rate reported as 0.0 but n_empty=368), which should be treated as missing values.

Treatment: Parse suffix notation (K=thousands, M=millions) to convert to numeric float, treat empty strings as null, then log-transform before modelling.

anthropic:default · confidence high

Out[50]:

saturn.columns["damage_property"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1,014
len_min	0
len_max	8
len_mean	4.381
len_median	5
len_p95	7
word_mean	1
word_median	1
n_empty	368
n_duplicates	13,756
duplicate_rate	0.9313
vocab_size	1,013
readability_flesch_mean	117
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.8724
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	87.2% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	93.1% duplicate strings

Fig 21.

Character-length distribution for damage_property.

Show data table

Character-length distribution for damage_property (mean: 4.380568720379147).
chars	count
0 – 0	368
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 1	264
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	1252
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1172
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3414
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	6075
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1450
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	514
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	261

source categorical metadata

This column identifies the data source, and every single one of the 14,770 rows carries the identical value 'NOAA Storm Events Database' — cardinality of 1 with top_rate of 1.0 and entropy of 0.0. It is a constant metadata field, almost certainly a provenance tag added during ingestion. It carries zero predictive or analytical signal.

Treatment: Drop before modelling; constant column adds no variance.

anthropic:default · confidence high

Out[53]:

saturn.columns["source"].stats

stat	value
n	14,770
nulls	0 (0.0%)
unique	1
top_value	NOAA Storm Events Database
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 22.

Top values for source.

Show data table

Top values for source (1 unique shown, of 1 total).
value	count	share
NOAA Storm Events Database	14770	100.0%

data trove noaa significant storms

Overview

Summary confidence: high

latitude numeric feature

longitude numeric feature

name text label

description text label

category categorical metadata

date text timestamp

country categorical metadata

event_type categorical label

state categorical label

magnitude categorical feature

injuries categorical feature

fatalities categorical feature

damage_property text feature

source categorical metadata

How to cite