saturn·

natural hazards storms

source /home/coolhand/html/datavis/data_trove/data/natural_hazards/storms.json 14,770 rows 14 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 14,770 records of significant U.S. storm events sourced entirely from the NOAA Storm Events Database, with each row describing a weather incident's location, type, magnitude, and damages. Tornadoes dominate the event mix at roughly 43% of records, followed by Flash Floods and Thunderstorm Winds, so the event_type distribution is the first thing to inspect. Geographically the data skews toward Texas (1,450 events) and other tornado-belt states like Missouri, Arkansas, and Mississippi, which is worth confirming on the latitude/longitude spread. Two caveats deserve attention: the magnitude field is missing for 51.8% of rows, and category/country/source are constants (single value) so they carry no analytical signal. Fatalities and injuries are heavily zero-inflated (about 69% and 68% zeros respectively), meaning summary stats will be driven by a small tail of severe events.

citing: row_count · column_count · columns.event_type.top_values · columns.event_type.stats.top_rate · columns.state.top_values · columns.magnitude.null_rate · columns.fatalities.stats.top_rate · columns.injuries.stats.top_rate · columns.country.stats.top_value · columns.source.stats.top_value · columns.latitude.stats · columns.longitude.stats

Schema

14 columns
Per-column summary. Click column name to jump to its detail.
Alerts
latitude numeric 0.0% 7,810
longitude numeric 0.0% 8,828
name text 0.0% 6,660
multilingual duplicates
description text 0.0% 5,796
multilingual duplicates
category categorical 0.0% 1
imbalance
date text 0.0% 5,058
one_word allcaps short_text duplicates
country categorical 0.0% 1
imbalance
event_type categorical 0.0% 17
state categorical 0.0% 65
magnitude categorical 51.8% 170
null_rate
injuries categorical 0.0% 178
fatalities categorical 0.0% 49
damage_property text 0.0% 1,014
one_word allcaps short_text duplicates
source categorical 0.0% 1
imbalance

latitude

numeric feature
This column holds geographic latitudes, ranging from -14.3236 to 70.1269 with a mean of 37.28 and median of 37.12, consistent with degrees north/south. The distribution is tightly clustered (IQR 7.50, std 5.25) around the mid-30s to low-40s, suggesting most records sit in northern temperate zones. Skew is mild (-0.18) but kurtosis of 3.34 plus 159 outliers (1.08%) hints at a long southern tail extending into the southern hemisphere. Treatment: Pair with longitude as a geospatial feature; consider binning or projecting rather than treating as a plain scalar. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
7,810
min
-14.32
max
70.13
mean
37.28
median
37.12
std
5.247
q1
33.63
q3
41.13
iqr
7.499
skew
-0.1787
kurtosis
3.341
n_outliers
159
outlier_rate
0.01077
zero_rate
0

longitude

numeric feature
Geographic longitude coordinates, with 8828 unique values across 14770 rows and no nulls. Values span -170.73 to 171.47 (full global range), but the distribution is tightly concentrated around a median of -90.22 with an IQR of just 12.17, indicating most records cluster in the Americas. Heavy kurtosis (55.6) and 623 outliers (4.2%) reflect a small set of points scattered far from this North/Central American core. Treatment: Pair with latitude for geospatial features; do not standardize as a plain scalar. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
8,828
min
-170.7
max
171.5
mean
-90.94
median
-90.22
std
11.7
q1
-96.4
q3
-84.23
iqr
12.17
skew
1.286
kurtosis
55.61
n_outliers
623
outlier_rate
0.04218
zero_rate
0

name

text label multilingual duplicates
This column holds short structured event labels of the form ' in , ' describing US severe weather incidents (tornadoes, floods, hail, thunderstorm winds). It is highly repetitive: 8110 of 14770 rows are duplicates (54.9% duplicate_rate) with only 6660 unique values, and 'Hail in TEXAS, TARRANT' alone appears 59 times. The 'multilingual' alert is largely a false positive driven by place names — 4796 rows classify as English versus tiny counts like 134 Spanish and 25 German. Treatment: Parse into structured fields (event_type, state, county) rather than treating as free text. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
6,660
len_min
17
len_max
134
len_mean
30.22
len_median
29
len_p95
41
word_mean
4.588
word_median
4
n_empty
0
n_duplicates
8,110
duplicate_rate
0.5491
vocab_size
1,980
readability_flesch_mean
31.16
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

description

text metadata multilingual duplicates
This is a templated event-summary field describing storm or disaster impacts (magnitude, injuries, fatalities, property damage in dollars), not free-form prose. With 14,770 rows but only 5,796 unique values and a 60.8% duplicate rate, the text is highly formulaic — the single string 'Magnitude 0; $2.5M property damage' alone appears 1,055 times. Language detection flags 5 French, 1 Japanese, and 10 Norwegian rows against 4,984 English, almost certainly false positives from numeric/currency tokens rather than real multilingual content; mean Flesch of 29.9 reflects the terse template, not difficult prose. Treatment: Parse with regex into structured fields (magnitude, injuries, fatalities, damage_usd) rather than treating as free text. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
5,796
len_min
3
len_max
259
len_mean
50.09
len_median
36
len_p95
166
word_mean
7.393
word_median
5
n_empty
0
n_duplicates
8,974
duplicate_rate
0.6076
vocab_size
4,289
readability_flesch_mean
29.86
emoji_rate
0
url_rate
0
one_word_rate
0.0002708
allcaps_rate
0.0002708
boilerplate_rate
0

category

categorical metadata imbalance
This column is a constant categorical tag identifying the dataset source: every one of the 14770 rows holds the single value "significant_us_storms". With cardinality 1, entropy 0, and top_rate 1.0, it carries no information for modelling and only serves as a provenance marker. Treatment: Drop before modelling; retain only as a source tag if merging with other datasets. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
1
top_value
significant_us_storms
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

date

text timestamp one_word allcaps short_text duplicates
This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every one of 14770 rows exactly 10 characters and a single token. Values span at least 1965 to 2021, and duplicates are heavy: 9712 rows (65.8%) repeat, with 1974-04-03 appearing 126 times and 2011-04-27 appearing 105 times, suggesting clustering around specific event dates rather than a uniform timeline. The 'allcaps' flag is a false positive since digits and hyphens trigger it. Treatment: parse to date dtype and use for temporal grouping or joins. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
5,058
len_min
10
len_max
10
len_mean
10
len_median
10
len_p95
10
word_mean
1
word_median
1
n_empty
0
n_duplicates
9,712
duplicate_rate
0.6575
vocab_size
5,058
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

country

categorical metadata imbalance
This column records the country of origin for each record, but every one of the 14770 rows holds the value "USA". Cardinality is 1 and entropy is 0, so the field carries no information. Treatment: Drop; constant column with zero variance. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
1
top_value
USA
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

event_type

categorical label
Categorical label of severe-weather event types across 14,770 rows with no nulls and 17 distinct categories. Tornado dominates at 42.9% (6,334 rows), followed by Flash Flood, Thunderstorm Wind, Flood, and Hail; entropy ratio of 0.57 confirms the distribution is concentrated in a few classes. The long tail (Heavy Rain, Marine Strong Wind, Debris Flow, Marine Thunderstorm Wind) has very thin support, which will hurt per-class modelling. Treatment: Use as categorical target or feature; consider grouping rare classes given the heavy Tornado skew. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
17
top_value
Tornado
top_rate
0.4288
cardinality
17
entropy
2.336
entropy_ratio
0.5715

state

categorical feature
US state names in uppercase, with 65 distinct values across 14,770 complete rows — more than the 50 states, suggesting territories, military codes, or 'UNKNOWN'-style entries are mixed in. Distribution is broad (entropy ratio 0.86) with Texas leading at 9.8% (1,450 rows), followed by Missouri, Arkansas, Mississippi, and Georgia, indicating a southern/central US tilt. Treatment: Normalize to standard state codes and one-hot or target-encode; investigate the 15 extra categories beyond 50 states. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
65
top_value
TEXAS
top_rate
0.09817
cardinality
65
entropy
5.182
entropy_ratio
0.8605

magnitude

categorical feature null_rate
Numeric magnitudes stored as strings, with 170 distinct values ranging from '0' to two-decimal figures like '70.00'. Over half the rows (51.78%) are null, and of the non-nulls the value '0' dominates at 54.24%, leaving real magnitudes in a minority of records. Entropy ratio of 0.48 confirms most signal is concentrated in a few buckets. Treatment: Cast to numeric, treat '0' and nulls as likely missing/sentinel, and consider a presence flag before modelling. high · anthropic:claude-opus-4-7
n
14,770
nulls
7,648 (51.8%)
unique
170
top_value
0
top_rate
0.5424
cardinality
170
entropy
3.586
entropy_ratio
0.484

injuries

categorical feature
This is an injury count per record, stored as strings but numeric in nature (top values are '0' through '12'). The distribution is dominated by zeros: 68.1% of 14,770 rows report '0' injuries, with 893 at '1' and a long tail spanning 178 distinct values. Entropy ratio of 0.33 confirms heavy concentration at the low end. Treatment: Cast to integer and consider log1p or zero-inflated modelling given the heavy zero mass. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
178
top_value
0
top_rate
0.6814
cardinality
178
entropy
2.468
entropy_ratio
0.3301

fatalities

categorical numeric_target
Counts of fatalities per event, stored as strings with 49 distinct values across 14,770 rows and no nulls. The distribution is heavily zero-inflated: 69.1% of records are "0" and the next bucket "1" covers 3,208 rows, leaving a long thin tail (e.g., 25 rows at "9", 24 at "10"). Low entropy ratio (0.25) confirms most variance lives in the 0/1 split. Treatment: Cast to integer and consider modelling as a count (Poisson/negative binomial) or binarise to any-fatality given the zero inflation. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
49
top_value
0
top_rate
0.6912
cardinality
49
entropy
1.423
entropy_ratio
0.2535

damage_property

text feature one_word allcaps short_text duplicates
This column encodes property damage as short magnitude strings with a K/M suffix (e.g., '2.5M', '250K', '0.00K'), not free text — every value is one word and max length is 8. Formatting is inconsistent: '1M' and '1.00M' coexist, and '0.00K' appears 1229 times alongside 368 empty entries, conflating true zeros with missingness. With 93.1% duplicate rate across only 1013 unique tokens, this is a coarse categorical-looking encoding of a numeric quantity. Treatment: Parse the K/M suffix into a numeric dollar amount and treat empty strings as missing (distinct from 0.00K). high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
1,014
len_min
0
len_max
8
len_mean
4.381
len_median
5
len_p95
7
word_mean
1
word_median
1
n_empty
368
n_duplicates
13,756
duplicate_rate
0.9313
vocab_size
1,013
readability_flesch_mean
117
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.8724
boilerplate_rate
0

source

categorical metadata imbalance
This column records the data provenance, holding the constant string "NOAA Storm Events Database" for all 14770 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and only serves as a dataset-level annotation. Treatment: Drop before modelling; retain in documentation as a provenance tag. high · anthropic:claude-opus-4-7
n
14,770
nulls
0 (0.0%)
unique
1
top_value
NOAA Storm Events Database
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0