natural hazards storms
Reading
This dataset contains 14,770 records of significant U.S. storm events sourced entirely from the NOAA Storm Events Database, with each row describing a weather incident's location, type, magnitude, and damages. Tornadoes dominate the event mix at roughly 43% of records, followed by Flash Floods and Thunderstorm Winds, so the event_type distribution is the first thing to inspect. Geographically the data skews toward Texas (1,450 events) and other tornado-belt states like Missouri, Arkansas, and Mississippi, which is worth confirming on the latitude/longitude spread. Two caveats deserve attention: the magnitude field is missing for 51.8% of rows, and category/country/source are constants (single value) so they carry no analytical signal. Fatalities and injuries are heavily zero-inflated (about 69% and 68% zeros respectively), meaning summary stats will be driven by a small tail of severe events.
citing: row_count · column_count · columns.event_type.top_values · columns.event_type.stats.top_rate · columns.state.top_values · columns.magnitude.null_rate · columns.fatalities.stats.top_rate · columns.injuries.stats.top_rate · columns.country.stats.top_value · columns.source.stats.top_value · columns.latitude.stats · columns.longitude.stats
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Tornado | 6334 | 42.9% |
| Flash Flood | 2358 | 16.0% |
| Thunderstorm Wind | 2257 | 15.3% |
| Flood | 1777 | 12.0% |
| Hail | 1246 | 8.4% |
| Lightning | 574 | 3.9% |
| Heavy Rain | 99 | 0.7% |
| Marine Strong Wind | 43 | 0.3% |
| Debris Flow | 43 | 0.3% |
| Marine Thunderstorm Wind | 25 | 0.2% |
| Marine High Wind | 5 | 0.0% |
| Dust Devil | 3 | 0.0% |
| Waterspout | 2 | 0.0% |
| Tropical Storm | 1 | 0.0% |
| High Wind | 1 | 0.0% |
| Heat | 1 | 0.0% |
| Marine Lightning | 1 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| TEXAS | 1450 | 9.8% |
| MISSOURI | 648 | 4.4% |
| ARKANSAS | 602 | 4.1% |
| MISSISSIPPI | 570 | 3.9% |
| GEORGIA | 562 | 3.8% |
| ILLINOIS | 560 | 3.8% |
| IOWA | 527 | 3.6% |
| LOUISIANA | 507 | 3.4% |
| TENNESSEE | 499 | 3.4% |
| FLORIDA | 498 | 3.4% |
| OKLAHOMA | 490 | 3.3% |
| NEBRASKA | 486 | 3.3% |
| ALABAMA | 469 | 3.2% |
| WISCONSIN | 463 | 3.1% |
| OHIO | 441 | 3.0% |
| MICHIGAN | 426 | 2.9% |
| NORTH CAROLINA | 422 | 2.9% |
| KANSAS | 418 | 2.8% |
| INDIANA | 408 | 2.8% |
| KENTUCKY | 383 | 2.6% |
Show data table
| value | count | share |
|---|---|---|
| 0 | 10209 | 69.1% |
| 1 | 3208 | 21.7% |
| 2 | 649 | 4.4% |
| 3 | 222 | 1.5% |
| 4 | 112 | 0.8% |
| 5 | 74 | 0.5% |
| 6 | 66 | 0.4% |
| 7 | 38 | 0.3% |
| 9 | 25 | 0.2% |
| 10 | 24 | 0.2% |
| 8 | 21 | 0.1% |
| 11 | 20 | 0.1% |
| 13 | 11 | 0.1% |
| 16 | 10 | 0.1% |
| 12 | 9 | 0.1% |
| 14 | 8 | 0.1% |
| 17 | 6 | 0.0% |
| 20 | 6 | 0.0% |
| 25 | 4 | 0.0% |
| 23 | 3 | 0.0% |
Show data table
| chars | count |
|---|---|
| 0 – 0 | 368 |
| 0 – 0 | 0 |
| 0 – 1 | 0 |
| 1 – 1 | 0 |
| 1 – 1 | 0 |
| 1 – 1 | 264 |
| 1 – 1 | 0 |
| 1 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 1252 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 1172 |
| 3 – 3 | 0 |
| 3 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 3414 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 6075 |
| 5 – 5 | 0 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 1450 |
| 6 – 6 | 0 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 514 |
| 7 – 7 | 0 |
| 7 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 261 |
Show data table
| bin | count |
|---|---|
| -14.32 – -12.21 | 3 |
| -12.21 – -10.1 | 0 |
| -10.1 – -7.99 | 0 |
| -7.99 – -5.879 | 0 |
| -5.879 – -3.767 | 0 |
| -3.767 – -1.656 | 0 |
| -1.656 – 0.4552 | 0 |
| 0.4552 – 2.566 | 0 |
| 2.566 – 4.678 | 0 |
| 4.678 – 6.789 | 0 |
| 6.789 – 8.9 | 2 |
| 8.9 – 11.01 | 0 |
| 11.01 – 13.12 | 0 |
| 13.12 – 15.23 | 2 |
| 15.23 – 17.35 | 0 |
| 17.35 – 19.46 | 75 |
| 19.46 – 21.57 | 19 |
| 21.57 – 23.68 | 10 |
| 23.68 – 25.79 | 22 |
| 25.79 – 27.9 | 270 |
| 27.9 – 30.01 | 522 |
| 30.01 – 32.12 | 1240 |
| 32.12 – 34.24 | 2165 |
| 34.24 – 36.35 | 2333 |
| 36.35 – 38.46 | 1803 |
| 38.46 – 40.57 | 1901 |
| 40.57 – 42.68 | 2226 |
| 42.68 – 44.79 | 1382 |
| 44.79 – 46.9 | 515 |
| 46.9 – 49.01 | 232 |
| 49.01 – 51.13 | 0 |
| 51.13 – 53.24 | 0 |
| 53.24 – 55.35 | 0 |
| 55.35 – 57.46 | 5 |
| 57.46 – 59.57 | 6 |
| 59.57 – 61.68 | 15 |
| 61.68 – 63.79 | 11 |
| 63.79 – 65.9 | 8 |
| 65.9 – 68.02 | 2 |
| 68.02 – 70.13 | 1 |
Schema
14 columns| Alerts | ||||
|---|---|---|---|---|
| latitude | numeric | 0.0% | 7,810 |
|
| longitude | numeric | 0.0% | 8,828 |
|
| name | text | 0.0% | 6,660 |
multilingual
duplicates
|
| description | text | 0.0% | 5,796 |
multilingual
duplicates
|
| category | categorical | 0.0% | 1 |
imbalance
|
| date | text | 0.0% | 5,058 |
one_word
allcaps
short_text
duplicates
|
| country | categorical | 0.0% | 1 |
imbalance
|
| event_type | categorical | 0.0% | 17 |
|
| state | categorical | 0.0% | 65 |
|
| magnitude | categorical | 51.8% | 170 |
null_rate
|
| injuries | categorical | 0.0% | 178 |
|
| fatalities | categorical | 0.0% | 49 |
|
| damage_property | text | 0.0% | 1,014 |
one_word
allcaps
short_text
duplicates
|
| source | categorical | 0.0% | 1 |
imbalance
|
latitude
numeric featureThis column holds geographic latitudes, ranging from -14.3236 to 70.1269 with a mean of 37.28 and median of 37.12, consistent with degrees north/south. The distribution is tightly clustered (IQR 7.50, std 5.25) around the mid-30s to low-40s, suggesting most records sit in northern temperate zones. Skew is mild (-0.18) but kurtosis of 3.34 plus 159 outliers (1.08%) hints at a long southern tail extending into the southern hemisphere. Treatment: Pair with longitude as a geospatial feature; consider binning or projecting rather than treating as a plain scalar.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 7,810
- min
- -14.32
- max
- 70.13
- mean
- 37.28
- median
- 37.12
- std
- 5.247
- q1
- 33.63
- q3
- 41.13
- iqr
- 7.499
- skew
- -0.1787
- kurtosis
- 3.341
- n_outliers
- 159
- outlier_rate
- 0.01077
- zero_rate
- 0
longitude
numeric featureGeographic longitude coordinates, with 8828 unique values across 14770 rows and no nulls. Values span -170.73 to 171.47 (full global range), but the distribution is tightly concentrated around a median of -90.22 with an IQR of just 12.17, indicating most records cluster in the Americas. Heavy kurtosis (55.6) and 623 outliers (4.2%) reflect a small set of points scattered far from this North/Central American core. Treatment: Pair with latitude for geospatial features; do not standardize as a plain scalar.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 8,828
- min
- -170.7
- max
- 171.5
- mean
- -90.94
- median
- -90.22
- std
- 11.7
- q1
- -96.4
- q3
- -84.23
- iqr
- 12.17
- skew
- 1.286
- kurtosis
- 55.61
- n_outliers
- 623
- outlier_rate
- 0.04218
- zero_rate
- 0
name
text label multilingual duplicatesThis column holds short structured event labels of the form 'in , ' describing US severe weather incidents (tornadoes, floods, hail, thunderstorm winds). It is highly repetitive: 8110 of 14770 rows are duplicates (54.9% duplicate_rate) with only 6660 unique values, and 'Hail in TEXAS, TARRANT' alone appears 59 times. The 'multilingual' alert is largely a false positive driven by place names — 4796 rows classify as English versus tiny counts like 134 Spanish and 25 German. Treatment: Parse into structured fields (event_type, state, county) rather than treating as free text.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 6,660
- len_min
- 17
- len_max
- 134
- len_mean
- 30.22
- len_median
- 29
- len_p95
- 41
- word_mean
- 4.588
- word_median
- 4
- n_empty
- 0
- n_duplicates
- 8,110
- duplicate_rate
- 0.5491
- vocab_size
- 1,980
- readability_flesch_mean
- 31.16
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
description
text metadata multilingual duplicatesThis is a templated event-summary field describing storm or disaster impacts (magnitude, injuries, fatalities, property damage in dollars), not free-form prose. With 14,770 rows but only 5,796 unique values and a 60.8% duplicate rate, the text is highly formulaic — the single string 'Magnitude 0; $2.5M property damage' alone appears 1,055 times. Language detection flags 5 French, 1 Japanese, and 10 Norwegian rows against 4,984 English, almost certainly false positives from numeric/currency tokens rather than real multilingual content; mean Flesch of 29.9 reflects the terse template, not difficult prose. Treatment: Parse with regex into structured fields (magnitude, injuries, fatalities, damage_usd) rather than treating as free text.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 5,796
- len_min
- 3
- len_max
- 259
- len_mean
- 50.09
- len_median
- 36
- len_p95
- 166
- word_mean
- 7.393
- word_median
- 5
- n_empty
- 0
- n_duplicates
- 8,974
- duplicate_rate
- 0.6076
- vocab_size
- 4,289
- readability_flesch_mean
- 29.86
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.0002708
- allcaps_rate
- 0.0002708
- boilerplate_rate
- 0
category
categorical metadata imbalanceThis column is a constant categorical tag identifying the dataset source: every one of the 14770 rows holds the single value "significant_us_storms". With cardinality 1, entropy 0, and top_rate 1.0, it carries no information for modelling and only serves as a provenance marker. Treatment: Drop before modelling; retain only as a source tag if merging with other datasets.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- significant_us_storms
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
date
text timestamp one_word allcaps short_text duplicatesThis is a date column stored as ISO-formatted text (YYYY-MM-DD), with every one of 14770 rows exactly 10 characters and a single token. Values span at least 1965 to 2021, and duplicates are heavy: 9712 rows (65.8%) repeat, with 1974-04-03 appearing 126 times and 2011-04-27 appearing 105 times, suggesting clustering around specific event dates rather than a uniform timeline. The 'allcaps' flag is a false positive since digits and hyphens trigger it. Treatment: parse to date dtype and use for temporal grouping or joins.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 5,058
- len_min
- 10
- len_max
- 10
- len_mean
- 10
- len_median
- 10
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 9,712
- duplicate_rate
- 0.6575
- vocab_size
- 5,058
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
country
categorical metadata imbalanceThis column records the country of origin for each record, but every one of the 14770 rows holds the value "USA". Cardinality is 1 and entropy is 0, so the field carries no information. Treatment: Drop; constant column with zero variance.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- USA
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
event_type
categorical labelCategorical label of severe-weather event types across 14,770 rows with no nulls and 17 distinct categories. Tornado dominates at 42.9% (6,334 rows), followed by Flash Flood, Thunderstorm Wind, Flood, and Hail; entropy ratio of 0.57 confirms the distribution is concentrated in a few classes. The long tail (Heavy Rain, Marine Strong Wind, Debris Flow, Marine Thunderstorm Wind) has very thin support, which will hurt per-class modelling. Treatment: Use as categorical target or feature; consider grouping rare classes given the heavy Tornado skew.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 17
- top_value
- Tornado
- top_rate
- 0.4288
- cardinality
- 17
- entropy
- 2.336
- entropy_ratio
- 0.5715
state
categorical featureUS state names in uppercase, with 65 distinct values across 14,770 complete rows — more than the 50 states, suggesting territories, military codes, or 'UNKNOWN'-style entries are mixed in. Distribution is broad (entropy ratio 0.86) with Texas leading at 9.8% (1,450 rows), followed by Missouri, Arkansas, Mississippi, and Georgia, indicating a southern/central US tilt. Treatment: Normalize to standard state codes and one-hot or target-encode; investigate the 15 extra categories beyond 50 states.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 65
- top_value
- TEXAS
- top_rate
- 0.09817
- cardinality
- 65
- entropy
- 5.182
- entropy_ratio
- 0.8605
magnitude
categorical feature null_rateNumeric magnitudes stored as strings, with 170 distinct values ranging from '0' to two-decimal figures like '70.00'. Over half the rows (51.78%) are null, and of the non-nulls the value '0' dominates at 54.24%, leaving real magnitudes in a minority of records. Entropy ratio of 0.48 confirms most signal is concentrated in a few buckets. Treatment: Cast to numeric, treat '0' and nulls as likely missing/sentinel, and consider a presence flag before modelling.
- n
- 14,770
- nulls
- 7,648 (51.8%)
- unique
- 170
- top_value
- 0
- top_rate
- 0.5424
- cardinality
- 170
- entropy
- 3.586
- entropy_ratio
- 0.484
injuries
categorical featureThis is an injury count per record, stored as strings but numeric in nature (top values are '0' through '12'). The distribution is dominated by zeros: 68.1% of 14,770 rows report '0' injuries, with 893 at '1' and a long tail spanning 178 distinct values. Entropy ratio of 0.33 confirms heavy concentration at the low end. Treatment: Cast to integer and consider log1p or zero-inflated modelling given the heavy zero mass.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 178
- top_value
- 0
- top_rate
- 0.6814
- cardinality
- 178
- entropy
- 2.468
- entropy_ratio
- 0.3301
fatalities
categorical numeric_targetCounts of fatalities per event, stored as strings with 49 distinct values across 14,770 rows and no nulls. The distribution is heavily zero-inflated: 69.1% of records are "0" and the next bucket "1" covers 3,208 rows, leaving a long thin tail (e.g., 25 rows at "9", 24 at "10"). Low entropy ratio (0.25) confirms most variance lives in the 0/1 split. Treatment: Cast to integer and consider modelling as a count (Poisson/negative binomial) or binarise to any-fatality given the zero inflation.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 49
- top_value
- 0
- top_rate
- 0.6912
- cardinality
- 49
- entropy
- 1.423
- entropy_ratio
- 0.2535
damage_property
text feature one_word allcaps short_text duplicatesThis column encodes property damage as short magnitude strings with a K/M suffix (e.g., '2.5M', '250K', '0.00K'), not free text — every value is one word and max length is 8. Formatting is inconsistent: '1M' and '1.00M' coexist, and '0.00K' appears 1229 times alongside 368 empty entries, conflating true zeros with missingness. With 93.1% duplicate rate across only 1013 unique tokens, this is a coarse categorical-looking encoding of a numeric quantity. Treatment: Parse the K/M suffix into a numeric dollar amount and treat empty strings as missing (distinct from 0.00K).
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 1,014
- len_min
- 0
- len_max
- 8
- len_mean
- 4.381
- len_median
- 5
- len_p95
- 7
- word_mean
- 1
- word_median
- 1
- n_empty
- 368
- n_duplicates
- 13,756
- duplicate_rate
- 0.9313
- vocab_size
- 1,013
- readability_flesch_mean
- 117
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.8724
- boilerplate_rate
- 0
source
categorical metadata imbalanceThis column records the data provenance, holding the constant string "NOAA Storm Events Database" for all 14770 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and only serves as a dataset-level annotation. Treatment: Drop before modelling; retain in documentation as a provenance tag.
- n
- 14,770
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- NOAA Storm Events Database
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0