data trove tornadoes noaa spc
Reading
This dataset contains 70,022 tornado records across the United States, with attributes covering location, timing, magnitude, path dimensions, and human impact. Texas dominates with 9,345 events, and the classic 'Tornado Alley' states (TX, KS, OK, NE, IA) together account for a large share of all records. Magnitude is worth close inspection: nearly half of all tornadoes are rated 0 (the weakest EF/F scale), and only 59 reach magnitude 5, suggesting a steep severity distribution. Human cost is highly skewed — 97.7% of events report zero fatalities, but the long tail of deadly events (including multi-fatality outbreaks) and the April 27, 2011 date appearing most frequently (207 records) point to a handful of catastrophic outbreak days that deserve focused analysis.
citing: row_count · column_count · state.top_values · mag.top_values · fatalities.top_rate · fatalities.top_value · date.top_values · injuries.top_values · loss.top_values · len.top_values
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| TX | 9345 | 13.3% |
| KS | 4474 | 6.4% |
| OK | 4221 | 6.0% |
| FL | 3620 | 5.2% |
| NE | 3056 | 4.4% |
| IA | 2887 | 4.1% |
| IL | 2835 | 4.0% |
| MS | 2657 | 3.8% |
| AL | 2529 | 3.6% |
| MO | 2462 | 3.5% |
| CO | 2425 | 3.5% |
| LA | 2305 | 3.3% |
| MN | 2118 | 3.0% |
| AR | 1981 | 2.8% |
| SD | 1917 | 2.7% |
| GA | 1898 | 2.7% |
| ND | 1640 | 2.3% |
| IN | 1610 | 2.3% |
| WI | 1515 | 2.2% |
| NC | 1472 | 2.1% |
Show data table
| value | count | share |
|---|---|---|
| 0 | 32218 | 46.0% |
| 1 | 23782 | 34.0% |
| 2 | 9767 | 13.9% |
| 3 | 2585 | 3.7% |
| -9 | 1024 | 1.5% |
| 4 | 587 | 0.8% |
| 5 | 59 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| 0 | 68423 | 97.7% |
| 1 | 830 | 1.2% |
| 2 | 277 | 0.4% |
| 3 | 134 | 0.2% |
| 4 | 77 | 0.1% |
| 5 | 46 | 0.1% |
| 6 | 45 | 0.1% |
| 7 | 32 | 0.0% |
| 9 | 15 | 0.0% |
| 10 | 15 | 0.0% |
| 11 | 14 | 0.0% |
| 8 | 13 | 0.0% |
| 16 | 12 | 0.0% |
| 13 | 8 | 0.0% |
| 17 | 7 | 0.0% |
| 18 | 6 | 0.0% |
| 21 | 6 | 0.0% |
| 12 | 6 | 0.0% |
| 22 | 5 | 0.0% |
| 25 | 4 | 0.0% |
Show data table
| chars | count |
|---|---|
| 3 – 3 | 47630 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 11687 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 795 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 9118 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 789 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 3 |
Show data table
| value | count | share |
|---|---|---|
| 0 | 62177 | 88.8% |
| 1 | 2480 | 3.5% |
| 2 | 1388 | 2.0% |
| 3 | 770 | 1.1% |
| 4 | 484 | 0.7% |
| 5 | 385 | 0.5% |
| 6 | 300 | 0.4% |
| 7 | 194 | 0.3% |
| 8 | 171 | 0.2% |
| 10 | 141 | 0.2% |
| 12 | 120 | 0.2% |
| 9 | 117 | 0.2% |
| 11 | 80 | 0.1% |
| 20 | 71 | 0.1% |
| 15 | 69 | 0.1% |
| 13 | 66 | 0.1% |
| 14 | 55 | 0.1% |
| 30 | 46 | 0.1% |
| 25 | 44 | 0.1% |
| 16 | 43 | 0.1% |
Schema
13 columns| Alerts | ||||
|---|---|---|---|---|
| date | text | 0.0% | 12,639 |
one_word
allcaps
short_text
duplicates
|
| time | text | 0.0% | 1,438 |
one_word
allcaps
short_text
duplicates
|
| state | categorical | 0.0% | 53 |
|
| mag | categorical | 0.0% | 7 |
|
| injuries | categorical | 0.0% | 209 |
|
| fatalities | categorical | 0.0% | 50 |
imbalance
|
| loss | text | 0.0% | 1,019 |
one_word
allcaps
short_text
duplicates
|
| slat | numeric | 0.0% | 16,016 |
|
| slon | numeric | 0.0% | 17,912 |
|
| elat | numeric | 37.6% | 16,965 |
null_rate
|
| elon | numeric | 37.6% | 18,586 |
null_rate
|
| len | text | 0.0% | 3,663 |
one_word
allcaps
short_text
duplicates
|
| wid | categorical | 0.0% | 419 |
|
date
text timestamp one_word allcaps short_text duplicatesThis column contains ISO-8601 calendar dates stored as text strings, all exactly 10 characters long (YYYY-MM-DD format) with zero nulls across 70,022 rows. The 'allcaps' alert is a quirk of the profiler treating hyphenated tokens as uppercase-only, not a real data issue. What is notable is the high duplicate rate of 81.9% (57,383 duplicates) across only 12,639 unique dates, meaning many records share the same date — the top value '2011-04-27' appears 207 times — suggesting this is an event or transaction date that clusters around specific calendar days rather than a unique record timestamp. Treatment: Parse to datetime dtype, then use as a temporal feature (day-of-week, month, year, cyclical encoding) or grouping key for aggregations.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 12,639
- len_min
- 10
- len_max
- 10
- len_mean
- 10
- len_median
- 10
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 57,383
- duplicate_rate
- 0.8195
- vocab_size
- 7,831
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
time
text timestamp one_word allcaps short_text duplicatesThis column contains wall-clock time strings in HH:MM:SS format (all values are exactly 8 characters), representing the time-of-day component of some event or record. With only 1,438 unique values across 70,022 rows, the duplicate rate is extremely high at 97.9%, indicating times are heavily reused — the top values cluster tightly around afternoon/evening hours (14:00–19:00), suggesting a business or scheduling context with strong temporal patterns. The column is stored as text despite being a structured time value, so it should be parsed to a proper time type for any downstream use. Treatment: Parse to time/datetime type and use as a cyclical feature (e.g., sine/cosine encoding of hour) or join key for time-based aggregations.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 1,438
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 68,584
- duplicate_rate
- 0.9795
- vocab_size
- 1,352
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
state
categorical featureThis column contains US state abbreviations, with 53 distinct values (50 states plus likely DC, Puerto Rico, and one other territory) and zero nulls across 70,022 rows. Texas dominates at 13.3% (9,345 rows), and the top 10 are heavily skewed toward Great Plains and Southern states (KS, OK, NE, IA, MS, AL), which is surprising for a national dataset and may indicate agricultural or livestock-sector data. The entropy ratio of 0.847 indicates reasonably broad coverage, but the concentration in TX/KS/OK/NE/IA suggests a non-representative geographic distribution. Treatment: One-hot encode or target-encode depending on model type; consider grouping low-frequency states if using one-hot.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 53
- top_value
- TX
- top_rate
- 0.1335
- cardinality
- 53
- entropy
- 4.851
- entropy_ratio
- 0.8468
mag
categorical featureThis column represents a magnitude or severity level encoded as a small integer, with 7 distinct values spanning -9 to 5. The dominant value is '0' (46% of rows, 32,218 records), followed by '1' and '2', giving a right-skewed ordinal distribution. The value '-9' appearing 1,024 times is a sentinel or sentinel-coded missing/unknown value rather than a true negative magnitude, which would surprise an analyst expecting a clean ordinal scale. The column is stored as categorical despite being numerically interpretable. Treatment: Recode '-9' as missing, then treat as ordinal integer feature or one-hot encode depending on model assumptions.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- 0
- top_rate
- 0.4601
- cardinality
- 7
- entropy
- 1.772
- entropy_ratio
- 0.6312
injuries
categorical featureThis column represents a count of injuries per record, stored as a categorical type despite being fundamentally numeric. The distribution is extremely right-skewed: 88.8% of the 70,022 records have zero injuries, and counts drop off sharply thereafter, yet the column exhibits 209 unique values suggesting some very high injury counts exist in the tail. The low entropy ratio (0.123) confirms the near-degenerate concentration on '0', and the presence of non-contiguous values (e.g., '10' appearing before lower counts drop out of the top 10) hints at a long, sparse tail. Treatment: Cast to integer, then consider zero-inflated modelling or a binary 'any_injury' flag plus a separate log-transformed count for the non-zero subset.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 209
- top_value
- 0
- top_rate
- 0.888
- cardinality
- 209
- entropy
- 0.9454
- entropy_ratio
- 0.1227
fatalities
categorical feature imbalanceThis column represents a count of fatalities per incident, stored as a categorical type despite being numeric in nature — it should be treated as an ordinal or integer feature. The dominant value is '0' (68,423 out of 70,022 rows, or 97.7%), making this severely right-skewed and triggering an imbalance alert. Cardinality reaches 50 distinct values with extremely low entropy (0.217, entropy ratio 0.039), confirming that the non-zero tail is sparse and long. Analysts modelling rare fatal events should be aware that the positive class represents fewer than 2.3% of records. Treatment: Cast to integer, apply zero-inflated or rare-event modelling strategy, or binarise into fatal/non-fatal indicator before classification.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- 0
- top_rate
- 0.9772
- cardinality
- 50
- entropy
- 0.2174
- entropy_ratio
- 0.03852
loss
text numeric_target one_word allcaps short_text duplicatesThis column contains numeric loss values stored as text strings — all single-token entries (one_word_rate: 1.0) with a mean length of ~3.18 characters, dominated by small non-negative numbers like '0.0', '4.0', '5.0'. The 92.5% allcaps rate is a misleading artefact of how saturn classifies short numeric strings. Notably, '0.0' and '0' appear as separate tokens (22764 and 5248 occurrences respectively), indicating inconsistent serialization of the same underlying zero value — an analyst should consolidate these before use. The duplicate rate is 98.5%, reflecting a low-cardinality numeric range across 70022 rows with only 1019 unique string representations. Treatment: Cast to float (unifying '0' and '0.0'), then use as a regression target or loss metric; check whether the bimodal spike at 0 represents true zero-loss or missing/default values.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 1,019
- len_min
- 1
- len_max
- 10
- len_mean
- 3.181
- len_median
- 3
- len_p95
- 5
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 69,003
- duplicate_rate
- 0.9854
- vocab_size
- 503
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.9251
- boilerplate_rate
- 0
slat
numeric featureThis column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the Caribbean/Mexico up through Canada, covering the contiguous United States and beyond. The distribution is remarkably symmetric (skew 0.038, kurtosis -0.582) and tightly clustered around a mean of 37.14° with an IQR of 7.74°, suggesting a dataset dominated by mid-latitude U.S. locations. Only 70 outliers (0.1%) exist, likely extreme northern or southern observations, and there are no nulls. Treatment: Use directly as a geospatial feature; consider pairing with longitude and engineering distance or region-based features rather than treating as a raw numeric.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 16,016
- min
- 17.72
- max
- 61.02
- mean
- 37.14
- median
- 37.03
- std
- 5.09
- q1
- 33.19
- q3
- 40.93
- iqr
- 7.74
- skew
- 0.03792
- kurtosis
- -0.5825
- n_outliers
- 70
- outlier_rate
- 0.0009997
- zero_rate
- 0
slon
numeric featureThis column contains geographic longitude values, almost certainly representing the longitude of seismic event epicenters (suggested by the 'slon' name, likely 'station longitude' or 'source longitude'). All values are negative, ranging from -163.53 to -64.72, which places observations within the Western Hemisphere — consistent with the Americas or Pacific region. The mean of -92.74 and median of -93.50 suggest a concentration around the Gulf of Mexico / Central America corridor. With 17,912 unique values across 70,022 rows and zero nulls, this is a continuous geographic coordinate with mild repetition (e.g., fixed station locations), and 951 outliers (~1.36%) may represent distant events or data entry anomalies worth inspecting. Treatment: Use as-is for spatial modeling or map directly to geographic coordinates; inspect the 951 outliers for plausibility against known geographic bounds.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 17,912
- min
- -163.5
- max
- -64.72
- mean
- -92.74
- median
- -93.5
- std
- 8.677
- q1
- -98.4
- q3
- -86.69
- iqr
- 11.71
- skew
- -0.3229
- kurtosis
- 2.156
- n_outliers
- 951
- outlier_rate
- 0.01358
- zero_rate
- 0
elat
numeric feature null_rateThis column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the southern US/Mexico border region up through Canada or northern Europe. The distribution is strikingly symmetric (skew 0.034, kurtosis -0.41) and tightly clustered around a mean of 37.26° with an IQR of 7.42°, suggesting a geographically focused dataset. The most notable concern is a 37.65% null rate, flagged as an alert, meaning over a third of records lack coordinate data. Only 78 outliers (0.18%) exist at the extremes of the range. Treatment: Investigate source of 37.65% nulls before use; pair with longitude for spatial features or geohash encoding; impute or filter nulls depending on missingness mechanism.
- n
- 70,022
- nulls
- 26,363 (37.6%)
- unique
- 16,965
- min
- 17.72
- max
- 61.02
- mean
- 37.26
- median
- 37.13
- std
- 4.942
- q1
- 33.49
- q3
- 40.91
- iqr
- 7.42
- skew
- 0.03404
- kurtosis
- -0.4085
- n_outliers
- 78
- outlier_rate
- 0.001787
- zero_rate
- 0
elon
numeric feature null_rateThis column almost certainly represents **longitude** (Eastern longitude or a signed longitude coordinate), given the name 'elon' and values ranging from -163.53 to -64.72 — a range consistent with the Western Hemisphere (roughly spanning the Americas). The mean of -92.19 and median of -92.47 suggest a central tendency near the US Gulf Coast/Central America region. Two analyst-worthy surprises: the null rate is high at 37.65%, triggering an alert, and the max value of -64.72 is the least-negative (easternmost) point while min of -163.53 is near Alaska/Pacific Islands — indicating wide geographic spread. Treatment: Investigate and impute or exclude the 37.65% nulls before use; pair with a latitude column for geospatial modelling or clustering.
- n
- 70,022
- nulls
- 26,363 (37.6%)
- unique
- 18,586
- min
- -163.5
- max
- -64.72
- mean
- -92.19
- median
- -92.47
- std
- 8.545
- q1
- -97.73
- q3
- -86.47
- iqr
- 11.26
- skew
- -0.5954
- kurtosis
- 3.766
- n_outliers
- 647
- outlier_rate
- 0.01482
- zero_rate
- 0
len
text feature one_word allcaps short_text duplicatesThis column named 'len' stores numeric measurements encoded as text strings — almost certainly a length or dosage/quantity field stored in the wrong dtype. All 70,022 values are single 'words' in all-caps classification, with values like '0.1', '0.5', '1.0', '2.0' dominating the top entries and a character length range of 3–8. The duplicate rate is extremely high at 94.8% (66,359 duplicates across only 3,663 unique values), which is expected for a bounded numeric measure but confirms this should be cast to float and treated as a continuous feature rather than a categorical label. Treatment: Cast to float64 and use as a numeric feature; check for unit consistency across the value range.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 3,663
- len_min
- 3
- len_max
- 8
- len_mean
- 3.626
- len_median
- 3
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 66,359
- duplicate_rate
- 0.9477
- vocab_size
- 2,204
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
wid
categorical featureThis column ('wid') appears to be a numeric width or weight identifier encoded as a categorical, with only 419 distinct values across 70,022 rows. All observed top values are round numbers (10, 50, 100, 30, 20, 200, 25, 150, 40, 75), strongly suggesting a discrete measurement dimension — likely a product width, bin size, or weight class. The distribution is notably skewed: value '10' alone accounts for 20.6% of all rows (14,417 occurrences), with a steep drop-off thereafter, indicating heavy concentration at the smallest value. Treatment: Cast to numeric integer and treat as an ordinal or continuous feature; consider log-transform if using in regression given the heavy skew toward small values.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 419
- top_value
- 10
- top_rate
- 0.2059
- cardinality
- 419
- entropy
- 4.463
- entropy_ratio
- 0.5124