natural hazards storms

source /home/coolhand/html/datavis/data_trove/data/natural_hazards/storms.json 14,770 rows 14 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 14,770 records of significant U.S. storm events sourced entirely from the NOAA Storm Events Database, with each row describing a weather incident's location, type, magnitude, and damages. Tornadoes dominate the event mix at roughly 43% of records, followed by Flash Floods and Thunderstorm Winds, so the event_type distribution is the first thing to inspect. Geographically the data skews toward Texas (1,450 events) and other tornado-belt states like Missouri, Arkansas, and Mississippi, which is worth confirming on the latitude/longitude spread. Two caveats deserve attention: the magnitude field is missing for 51.8% of rows, and category/country/source are constants (single value) so they carry no analytical signal. Fatalities and injuries are heavily zero-inflated (about 69% and 68% zeros respectively), meaning summary stats will be driven by a small tail of severe events.

citing: row_count · column_count · columns.event_type.top_values · columns.event_type.stats.top_rate · columns.state.top_values · columns.magnitude.null_rate · columns.fatalities.stats.top_rate · columns.injuries.stats.top_rate · columns.country.stats.top_value · columns.source.stats.top_value · columns.latitude.stats · columns.longitude.stats

Charts the summary said to look at first

event_type · Tornadoes account for ~43% of events; check how steeply the distribution drops to Flash Flood and Thunderstorm Wind.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	42.9%
Flash Flood	2358	16.0%
Thunderstorm Wind	2257	15.3%
Flood	1777	12.0%
Hail	1246	8.4%
Lightning	574	3.9%
Heavy Rain	99	0.7%
Marine Strong Wind	43	0.3%
Debris Flow	43	0.3%
Marine Thunderstorm Wind	25	0.2%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

state · Texas leads at nearly 10% of records — see how concentrated events are in the southern/midwestern tornado belt.

Show data table

Top values for state (20 unique shown, of 65 total).
value	count	share
TEXAS	1450	9.8%
MISSOURI	648	4.4%
ARKANSAS	602	4.1%
MISSISSIPPI	570	3.9%
GEORGIA	562	3.8%
ILLINOIS	560	3.8%
IOWA	527	3.6%
LOUISIANA	507	3.4%
TENNESSEE	499	3.4%
FLORIDA	498	3.4%
OKLAHOMA	490	3.3%
NEBRASKA	486	3.3%
ALABAMA	469	3.2%
WISCONSIN	463	3.1%
OHIO	441	3.0%
MICHIGAN	426	2.9%
NORTH CAROLINA	422	2.9%
KANSAS	418	2.8%
INDIANA	408	2.8%
KENTUCKY	383	2.6%

fatalities · Heavy zero-inflation (69% are 0 fatalities) means the tail of deadly events is what matters analytically.

Show data table

Top values for fatalities (20 unique shown, of 49 total).
value	count	share
0	10209	69.1%
1	3208	21.7%
2	649	4.4%
3	222	1.5%
4	112	0.8%
5	74	0.5%
6	66	0.4%
7	38	0.3%
9	25	0.2%
10	24	0.2%
8	21	0.1%
11	20	0.1%
13	11	0.1%
16	10	0.1%
12	9	0.1%
14	8	0.1%
17	6	0.0%
20	6	0.0%
25	4	0.0%
23	3	0.0%

damage_property · Damage values cluster on a few round figures like 2.5M and 1.00M, hinting at coded or estimated reporting rather than precise amounts.

Show data table

Character-length distribution for damage_property (mean: 4.380568720379147).
chars	count
0 – 0	368
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 1	264
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	1252
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1172
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3414
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	6075
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1450
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	514
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	261

latitude · Most events fall between 33° and 41° N, with a few outliers extending toward Alaska and the tropics.

Show data table

Histogram bins for latitude (median: 37.12).
bin	count
-14.32 – -12.21	3
-12.21 – -10.1	0
-10.1 – -7.99	0
-7.99 – -5.879	0
-5.879 – -3.767	0
-3.767 – -1.656	0
-1.656 – 0.4552	0
0.4552 – 2.566	0
2.566 – 4.678	0
4.678 – 6.789	0
6.789 – 8.9	2
8.9 – 11.01	0
11.01 – 13.12	0
13.12 – 15.23	2
15.23 – 17.35	0
17.35 – 19.46	75
19.46 – 21.57	19
21.57 – 23.68	10
23.68 – 25.79	22
25.79 – 27.9	270
27.9 – 30.01	522
30.01 – 32.12	1240
32.12 – 34.24	2165
34.24 – 36.35	2333
36.35 – 38.46	1803
38.46 – 40.57	1901
40.57 – 42.68	2226
42.68 – 44.79	1382
44.79 – 46.9	515
46.9 – 49.01	232
49.01 – 51.13	0
51.13 – 53.24	0
53.24 – 55.35	0
55.35 – 57.46	5
57.46 – 59.57	6
59.57 – 61.68	15
61.68 – 63.79	11
63.79 – 65.9	8
65.9 – 68.02	2
68.02 – 70.13	1

Schema

14 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
latitude	numeric	0.0%	7,810
longitude	numeric	0.0%	8,828
name	text	0.0%	6,660	multilingual duplicates
description	text	0.0%	5,796	multilingual duplicates
category	categorical	0.0%	1	imbalance
date	text	0.0%	5,058	one_word allcaps short_text duplicates
country	categorical	0.0%	1	imbalance
event_type	categorical	0.0%	17
state	categorical	0.0%	65
magnitude	categorical	51.8%	170	null_rate
injuries	categorical	0.0%	178
fatalities	categorical	0.0%	49
damage_property	text	0.0%	1,014	one_word allcaps short_text duplicates
source	categorical	0.0%	1	imbalance

latitude

numeric feature

This column holds geographic latitudes, ranging from -14.3236 to 70.1269 with a mean of 37.28 and median of 37.12, consistent with degrees north/south. The distribution is tightly clustered (IQR 7.50, std 5.25) around the mid-30s to low-40s, suggesting most records sit in northern temperate zones. Skew is mild (-0.18) but kurtosis of 3.34 plus 159 outliers (1.08%) hints at a long southern tail extending into the southern hemisphere. Treatment: Pair with longitude as a geospatial feature; consider binning or projecting rather than treating as a plain scalar. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 7,810
min: -14.32
max: 70.13
mean: 37.28
median: 37.12
std: 5.247
q1: 33.63
q3: 41.13
iqr: 7.499
skew: -0.1787
kurtosis: 3.341
n_outliers: 159
outlier_rate: 0.01077
zero_rate: 0

longitude

numeric feature

Geographic longitude coordinates, with 8828 unique values across 14770 rows and no nulls. Values span -170.73 to 171.47 (full global range), but the distribution is tightly concentrated around a median of -90.22 with an IQR of just 12.17, indicating most records cluster in the Americas. Heavy kurtosis (55.6) and 623 outliers (4.2%) reflect a small set of points scattered far from this North/Central American core. Treatment: Pair with latitude for geospatial features; do not standardize as a plain scalar. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 8,828
min: -170.7
max: 171.5
mean: -90.94
median: -90.22
std: 11.7
q1: -96.4
q3: -84.23
iqr: 12.17
skew: 1.286
kurtosis: 55.61
n_outliers: 623
outlier_rate: 0.04218
zero_rate: 0

name

text label multilingual duplicates

This column holds short structured event labels of the form ' in , ' describing US severe weather incidents (tornadoes, floods, hail, thunderstorm winds). It is highly repetitive: 8110 of 14770 rows are duplicates (54.9% duplicate_rate) with only 6660 unique values, and 'Hail in TEXAS, TARRANT' alone appears 59 times. The 'multilingual' alert is largely a false positive driven by place names — 4796 rows classify as English versus tiny counts like 134 Spanish and 25 German. Treatment: Parse into structured fields (event_type, state, county) rather than treating as free text. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 6,660
len_min: 17
len_max: 134
len_mean: 30.22
len_median: 29
len_p95: 41
word_mean: 4.588
word_median: 4
n_empty: 0
n_duplicates: 8,110
duplicate_rate: 0.5491
vocab_size: 1,980
readability_flesch_mean: 31.16
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0

description

text metadata multilingual duplicates

This is a templated event-summary field describing storm or disaster impacts (magnitude, injuries, fatalities, property damage in dollars), not free-form prose. With 14,770 rows but only 5,796 unique values and a 60.8% duplicate rate, the text is highly formulaic — the single string 'Magnitude 0; $2.5M property damage' alone appears 1,055 times. Language detection flags 5 French, 1 Japanese, and 10 Norwegian rows against 4,984 English, almost certainly false positives from numeric/currency tokens rather than real multilingual content; mean Flesch of 29.9 reflects the terse template, not difficult prose. Treatment: Parse with regex into structured fields (magnitude, injuries, fatalities, damage_usd) rather than treating as free text. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 5,796
len_min: 3
len_max: 259
len_mean: 50.09
len_median: 36
len_p95: 166
word_mean: 7.393
word_median: 5
n_empty: 0
n_duplicates: 8,974
duplicate_rate: 0.6076
vocab_size: 4,289
readability_flesch_mean: 29.86
emoji_rate: 0
url_rate: 0
one_word_rate: 0.0002708
allcaps_rate: 0.0002708
boilerplate_rate: 0

date

text timestamp one_word allcaps short_text duplicates

This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every one of 14770 rows exactly 10 characters and a single token. Values span at least 1965 to 2021, and duplicates are heavy: 9712 rows (65.8%) repeat, with 1974-04-03 appearing 126 times and 2011-04-27 appearing 105 times, suggesting clustering around specific event dates rather than a uniform timeline. The 'allcaps' flag is a false positive since digits and hyphens trigger it. Treatment: parse to date dtype and use for temporal grouping or joins. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 5,058
len_min: 10
len_max: 10
len_mean: 10
len_median: 10
len_p95: 10
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 9,712
duplicate_rate: 0.6575
vocab_size: 5,058
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

country

categorical metadata imbalance

This column records the country of origin for each record, but every one of the 14770 rows holds the value "USA". Cardinality is 1 and entropy is 0, so the field carries no information. Treatment: Drop; constant column with zero variance. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 1
top_value: USA
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

event_type

categorical label

Categorical label of severe-weather event types across 14,770 rows with no nulls and 17 distinct categories. Tornado dominates at 42.9% (6,334 rows), followed by Flash Flood, Thunderstorm Wind, Flood, and Hail; entropy ratio of 0.57 confirms the distribution is concentrated in a few classes. The long tail (Heavy Rain, Marine Strong Wind, Debris Flow, Marine Thunderstorm Wind) has very thin support, which will hurt per-class modelling. Treatment: Use as categorical target or feature; consider grouping rare classes given the heavy Tornado skew. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 17
top_value: Tornado
top_rate: 0.4288
cardinality: 17
entropy: 2.336
entropy_ratio: 0.5715

state

categorical feature

US state names in uppercase, with 65 distinct values across 14,770 complete rows — more than the 50 states, suggesting territories, military codes, or 'UNKNOWN'-style entries are mixed in. Distribution is broad (entropy ratio 0.86) with Texas leading at 9.8% (1,450 rows), followed by Missouri, Arkansas, Mississippi, and Georgia, indicating a southern/central US tilt. Treatment: Normalize to standard state codes and one-hot or target-encode; investigate the 15 extra categories beyond 50 states. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 65
top_value: TEXAS
top_rate: 0.09817
cardinality: 65
entropy: 5.182
entropy_ratio: 0.8605

magnitude

categorical feature null_rate

Numeric magnitudes stored as strings, with 170 distinct values ranging from '0' to two-decimal figures like '70.00'. Over half the rows (51.78%) are null, and of the non-nulls the value '0' dominates at 54.24%, leaving real magnitudes in a minority of records. Entropy ratio of 0.48 confirms most signal is concentrated in a few buckets. Treatment: Cast to numeric, treat '0' and nulls as likely missing/sentinel, and consider a presence flag before modelling. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 7,648 (51.8%)
unique: 170
top_value: 0
top_rate: 0.5424
cardinality: 170
entropy: 3.586
entropy_ratio: 0.484

injuries

categorical feature

This is an injury count per record, stored as strings but numeric in nature (top values are '0' through '12'). The distribution is dominated by zeros: 68.1% of 14,770 rows report '0' injuries, with 893 at '1' and a long tail spanning 178 distinct values. Entropy ratio of 0.33 confirms heavy concentration at the low end. Treatment: Cast to integer and consider log1p or zero-inflated modelling given the heavy zero mass. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 178
top_value: 0
top_rate: 0.6814
cardinality: 178
entropy: 2.468
entropy_ratio: 0.3301

fatalities

categorical numeric_target

Counts of fatalities per event, stored as strings with 49 distinct values across 14,770 rows and no nulls. The distribution is heavily zero-inflated: 69.1% of records are "0" and the next bucket "1" covers 3,208 rows, leaving a long thin tail (e.g., 25 rows at "9", 24 at "10"). Low entropy ratio (0.25) confirms most variance lives in the 0/1 split. Treatment: Cast to integer and consider modelling as a count (Poisson/negative binomial) or binarise to any-fatality given the zero inflation. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 49
top_value: 0
top_rate: 0.6912
cardinality: 49
entropy: 1.423
entropy_ratio: 0.2535

damage_property

text feature one_word allcaps short_text duplicates

This column encodes property damage as short magnitude strings with a K/M suffix (e.g., '2.5M', '250K', '0.00K'), not free text — every value is one word and max length is 8. Formatting is inconsistent: '1M' and '1.00M' coexist, and '0.00K' appears 1229 times alongside 368 empty entries, conflating true zeros with missingness. With 93.1% duplicate rate across only 1013 unique tokens, this is a coarse categorical-looking encoding of a numeric quantity. Treatment: Parse the K/M suffix into a numeric dollar amount and treat empty strings as missing (distinct from 0.00K). high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 1,014
len_min: 0
len_max: 8
len_mean: 4.381
len_median: 5
len_p95: 7
word_mean: 1
word_median: 1
n_empty: 368
n_duplicates: 13,756
duplicate_rate: 0.9313
vocab_size: 1,013
readability_flesch_mean: 117
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.8724
boilerplate_rate: 0

source

categorical metadata imbalance

This column records the data provenance, holding the constant string "NOAA Storm Events Database" for all 14770 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and only serves as a dataset-level annotation. Treatment: Drop before modelling; retain in documentation as a provenance tag. high · anthropic:claude-opus-4-7

n: 14,770
nulls: 0 (0.0%)
unique: 1
top_value: NOAA Storm Events Database
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0