quirky tornadoes
Reading
This is a tornado event log with 70,022 rows and 13 columns covering dates, times, start/end coordinates, magnitudes, widths, fatalities, injuries, and U.S. state. Geographically it is a U.S.-centered dataset: starting longitudes average around -92.7 and latitudes around 37.1, with Texas (13.3% of records), Kansas, and Oklahoma leading the state counts. The severity fields are highly imbalanced — fatalities are 0 in 97.7% of events and injuries are 0 in 88.8% — so any analysis of harm should focus on the rare non-zero tail. Magnitude (mag) is a more usable categorical signal with 7 levels, dominated by 0 (46%) and 1 (34%). Note that the end-coordinate columns (elat, elon) are null in ~37.7% of rows, which matters if you plan to draw tornado tracks rather than just start points.
citing: row_count · column_count · columns.state.top_values · columns.state.top_rate · columns.fatalities.top_rate · columns.injuries.top_rate · columns.mag.top_values · columns.mag.top_rate · columns.elat.null_rate · columns.elon.null_rate · columns.slat.stats · columns.slon.stats
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| TX | 9345 | 13.3% |
| KS | 4474 | 6.4% |
| OK | 4221 | 6.0% |
| FL | 3620 | 5.2% |
| NE | 3056 | 4.4% |
| IA | 2887 | 4.1% |
| IL | 2835 | 4.0% |
| MS | 2657 | 3.8% |
| AL | 2529 | 3.6% |
| MO | 2462 | 3.5% |
| CO | 2425 | 3.5% |
| LA | 2305 | 3.3% |
| MN | 2118 | 3.0% |
| AR | 1981 | 2.8% |
| SD | 1917 | 2.7% |
| GA | 1898 | 2.7% |
| ND | 1640 | 2.3% |
| IN | 1610 | 2.3% |
| WI | 1515 | 2.2% |
| NC | 1472 | 2.1% |
Show data table
| value | count | share |
|---|---|---|
| 0 | 32218 | 46.0% |
| 1 | 23782 | 34.0% |
| 2 | 9767 | 13.9% |
| 3 | 2585 | 3.7% |
| -9 | 1024 | 1.5% |
| 4 | 587 | 0.8% |
| 5 | 59 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| 0 | 68423 | 97.7% |
| 1 | 830 | 1.2% |
| 2 | 277 | 0.4% |
| 3 | 134 | 0.2% |
| 4 | 77 | 0.1% |
| 5 | 46 | 0.1% |
| 6 | 45 | 0.1% |
| 7 | 32 | 0.0% |
| 9 | 15 | 0.0% |
| 10 | 15 | 0.0% |
| 11 | 14 | 0.0% |
| 8 | 13 | 0.0% |
| 16 | 12 | 0.0% |
| 13 | 8 | 0.0% |
| 17 | 7 | 0.0% |
| 18 | 6 | 0.0% |
| 21 | 6 | 0.0% |
| 12 | 6 | 0.0% |
| 22 | 5 | 0.0% |
| 25 | 4 | 0.0% |
Show data table
| bin | count |
|---|---|
| 17.72 – 18.8 | 33 |
| 18.8 – 19.89 | 6 |
| 19.89 – 20.97 | 8 |
| 20.97 – 22.05 | 26 |
| 22.05 – 23.13 | 1 |
| 23.13 – 24.22 | 0 |
| 24.22 – 25.3 | 76 |
| 25.3 – 26.38 | 487 |
| 26.38 – 27.46 | 748 |
| 27.46 – 28.55 | 1225 |
| 28.55 – 29.63 | 1468 |
| 29.63 – 30.71 | 3607 |
| 30.71 – 31.79 | 3441 |
| 31.79 – 32.88 | 5003 |
| 32.88 – 33.96 | 4775 |
| 33.96 – 35.04 | 5392 |
| 35.04 – 36.12 | 5164 |
| 36.12 – 37.21 | 4268 |
| 37.21 – 38.29 | 4028 |
| 38.29 – 39.37 | 4807 |
| 39.37 – 40.45 | 5498 |
| 40.45 – 41.54 | 5313 |
| 41.54 – 42.62 | 3995 |
| 42.62 – 43.7 | 3396 |
| 43.7 – 44.78 | 2407 |
| 44.78 – 45.87 | 1720 |
| 45.87 – 46.95 | 1257 |
| 46.95 – 48.03 | 1114 |
| 48.03 – 49.11 | 754 |
| 49.11 – 50.2 | 1 |
| 50.2 – 51.28 | 0 |
| 51.28 – 52.36 | 0 |
| 52.36 – 53.44 | 0 |
| 53.44 – 54.53 | 0 |
| 54.53 – 55.61 | 1 |
| 55.61 – 56.69 | 0 |
| 56.69 – 57.77 | 0 |
| 57.77 – 58.86 | 0 |
| 58.86 – 59.94 | 1 |
| 59.94 – 61.02 | 2 |
Show data table
| bin | count |
|---|---|
| -163.5 – -161.1 | 2 |
| -161.1 – -158.6 | 9 |
| -158.6 – -156.1 | 25 |
| -156.1 – -153.6 | 8 |
| -153.6 – -151.2 | 0 |
| -151.2 – -148.7 | 0 |
| -148.7 – -146.2 | 0 |
| -146.2 – -143.8 | 1 |
| -143.8 – -141.3 | 0 |
| -141.3 – -138.8 | 0 |
| -138.8 – -136.4 | 0 |
| -136.4 – -133.9 | 0 |
| -133.9 – -131.4 | 0 |
| -131.4 – -128.9 | 0 |
| -128.9 – -126.5 | 0 |
| -126.5 – -124 | 15 |
| -124 – -121.5 | 231 |
| -121.5 – -119.1 | 241 |
| -119.1 – -116.6 | 282 |
| -116.6 – -114.1 | 174 |
| -114.1 – -111.7 | 399 |
| -111.7 – -109.2 | 323 |
| -109.2 – -106.7 | 325 |
| -106.7 – -104.2 | 1782 |
| -104.2 – -101.8 | 4689 |
| -101.8 – -99.3 | 6094 |
| -99.3 – -96.83 | 9665 |
| -96.83 – -94.36 | 8344 |
| -94.36 – -91.89 | 6626 |
| -91.89 – -89.42 | 6618 |
| -89.42 – -86.95 | 6086 |
| -86.95 – -84.48 | 5361 |
| -84.48 – -82.01 | 4363 |
| -82.01 – -79.54 | 3931 |
| -79.54 – -77.07 | 1885 |
| -77.07 – -74.6 | 1618 |
| -74.6 – -72.13 | 541 |
| -72.13 – -69.66 | 285 |
| -69.66 – -67.19 | 71 |
| -67.19 – -64.72 | 28 |
Schema
13 columns| Alerts | ||||
|---|---|---|---|---|
| date | text | 0.0% | 12,639 |
one_word
allcaps
short_text
duplicates
|
| time | text | 0.0% | 1,438 |
one_word
allcaps
short_text
duplicates
|
| state | categorical | 0.0% | 53 |
|
| mag | categorical | 0.0% | 7 |
|
| injuries | categorical | 0.0% | 209 |
|
| fatalities | categorical | 0.0% | 50 |
imbalance
|
| loss | text | 0.0% | 1,019 |
one_word
allcaps
short_text
duplicates
|
| slat | numeric | 0.0% | 16,016 |
|
| slon | numeric | 0.0% | 17,912 |
|
| elat | numeric | 37.6% | 16,965 |
null_rate
|
| elon | numeric | 37.6% | 18,586 |
null_rate
|
| len | text | 0.0% | 3,663 |
one_word
allcaps
short_text
duplicates
|
| wid | categorical | 0.0% | 419 |
|
date
text timestamp one_word allcaps short_text duplicatesThis is a date column stored as ISO-formatted text (YYYY-MM-DD), with every value exactly 10 characters long and one token. Across 70,022 rows there are only 12,639 unique dates and an 81.9% duplicate rate, so many records share dates — top value 2011-04-27 appears 207 times. The range spans at least 1974-04-03 to 2023-03-31, suggesting it was misclassified as text rather than a date type. Treatment: Cast to a proper date type and use for temporal joins or time-based features.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 12,639
- len_min
- 10
- len_max
- 10
- len_mean
- 10
- len_median
- 10
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 57,383
- duplicate_rate
- 0.8195
- vocab_size
- 7,831
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
time
text timestamp one_word allcaps short_text duplicatesThis column holds clock times stored as 8-character HH:MM:SS strings, with all 70,022 rows non-null and uniform length (len_min=len_max=8). Only 1,438 distinct values appear and 97.95% are duplicates, with afternoon/evening slots like 16:00:00 (978), 17:00:00 (971) and 15:00:00 (959) dominating — consistent with event start times rather than free text. It's mistyped as text: parse to a proper time type before use. Treatment: Cast to a time type and bucket by hour for modelling.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 1,438
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 68,584
- duplicate_rate
- 0.9795
- vocab_size
- 1,352
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
state
categorical featureThis is a US state code column with 53 distinct values, slightly more than the 50 states (likely including DC and territories like PR). TX dominates at 13.3% of 70,022 rows, followed by KS (4,474) and OK (4,221), giving a clear southern/plains skew rather than a uniform national distribution. Entropy ratio of 0.847 confirms the distribution is fairly spread but not flat, and there are no nulls. Treatment: One-hot or target-encode; consider grouping low-frequency states into an 'Other' bucket.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 53
- top_value
- TX
- top_rate
- 0.1335
- cardinality
- 53
- entropy
- 4.851
- entropy_ratio
- 0.8468
mag
categorical feature`mag` is a low-cardinality categorical with 7 distinct values dominated by '0' (46% of 70022 rows) and decreasing counts through '1','2','3','4','5'. The ordered integer levels suggest a magnitude or severity code rather than a free category, and the presence of '-9' (1024 rows) is the standout signal — almost certainly a sentinel for missing/unknown that is not being counted as null. Treatment: Recode '-9' as missing, then treat as an ordinal feature.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- 0
- top_rate
- 0.4601
- cardinality
- 7
- entropy
- 1.772
- entropy_ratio
- 0.6312
injuries
categorical numeric_targetCounts of injuries stored as strings, with 209 distinct values across 70,022 rows and no nulls. The distribution is severely zero-inflated: '0' accounts for 88.8% of records, and entropy ratio is just 0.123. Tail values like '1' through '10' decay quickly but 209 unique tokens suggests very long tails or non-integer entries worth inspecting. Treatment: Cast to integer and model as a zero-inflated count (e.g., hurdle or ZIP regression).
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 209
- top_value
- 0
- top_rate
- 0.888
- cardinality
- 209
- entropy
- 0.9454
- entropy_ratio
- 0.1227
fatalities
categorical numeric_target imbalanceThis is a fatality count per event, stored as a categorical/string field with 50 distinct integer values and no nulls across 70,022 rows. The distribution is extremely imbalanced: '0' accounts for 97.72% of records, giving an entropy ratio of just 0.039, with '1' at 830 rows and a long thin tail (2 → 277, 3 → 134, down to single-digit counts at higher values). Despite being typed as categorical, the values are numeric and ordered, so the current encoding is likely a load artifact. Treatment: Cast to integer and model as a zero-inflated count rather than a categorical.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- 0
- top_rate
- 0.9772
- cardinality
- 50
- entropy
- 0.2174
- entropy_ratio
- 0.03852
loss
text feature one_word allcaps short_text duplicatesNumeric loss values stored as text strings, with all 70,022 entries being single tokens (one_word_rate 1.0) and lengths of 1-10 characters. The distribution is heavily concentrated: '0.0' alone accounts for 22,764 rows and the duplicate_rate is 0.985 across only 1,019 unique values. Mixed formatting is a hazard — '0' and '0.0' appear as separate tokens, so a naive cast will collapse them but string-based grouping won't. Treatment: Cast to float (normalising '0' vs '0.0') before any modelling or aggregation.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 1,019
- len_min
- 1
- len_max
- 10
- len_mean
- 3.181
- len_median
- 3
- len_p95
- 5
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 69,003
- duplicate_rate
- 0.9854
- vocab_size
- 503
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.9251
- boilerplate_rate
- 0
slat
numeric featureValues range from 17.72 to 61.02 with a mean of 37.14 and median of 37.03, consistent with a starting latitude (slat) field in decimal degrees, likely covering the contiguous US. The distribution is essentially symmetric (skew 0.04, kurtosis -0.58) with a tight IQR of 7.74 and only 70 outliers (0.10%). No nulls and no zeros, and 16,016 unique values across 70,022 rows suggest repeated coordinates rather than free-form noise. Treatment: Use as-is for geospatial features; optionally pair with a longitude column for distance or region encoding.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 16,016
- min
- 17.72
- max
- 61.02
- mean
- 37.14
- median
- 37.03
- std
- 5.09
- q1
- 33.19
- q3
- 40.93
- iqr
- 7.74
- skew
- 0.03792
- kurtosis
- -0.5825
- n_outliers
- 70
- outlier_rate
- 0.0009997
- zero_rate
- 0
slon
numeric featureValues are negative decimal degrees ranging from -163.53 to -64.7151 with a median of -93.5, consistent with western-hemisphere longitudes (the 'slon' name suggests starting longitude). The distribution is mildly left-skewed (-0.32) and concentrated within an IQR of ~11.7 degrees around the central US longitude band. Only 951 outliers (1.36%) fall outside that range, and there are no nulls or zeros. Treatment: Use as a geographic coordinate; pair with the matching latitude column for spatial features rather than treating as a standalone scalar.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 17,912
- min
- -163.5
- max
- -64.72
- mean
- -92.74
- median
- -93.5
- std
- 8.677
- q1
- -98.4
- q3
- -86.69
- iqr
- 11.71
- skew
- -0.3229
- kurtosis
- 2.156
- n_outliers
- 951
- outlier_rate
- 0.01358
- zero_rate
- 0
elat
numeric feature null_rateAlmost certainly an event latitude in decimal degrees, with values spanning 17.72 to 61.02 — consistent with locations across North America. The distribution is roughly symmetric (skew 0.03, kurtosis -0.41) and centered near 37.26, with only 78 mild outliers. The standout concern is that 37.65% of rows are null, so coverage is partial. Treatment: Pair with the matching longitude and impute or filter the 37.65% nulls before any spatial modelling.
- n
- 70,022
- nulls
- 26,363 (37.6%)
- unique
- 16,965
- min
- 17.72
- max
- 61.02
- mean
- 37.26
- median
- 37.13
- std
- 4.942
- q1
- 33.49
- q3
- 40.91
- iqr
- 7.42
- skew
- 0.03404
- kurtosis
- -0.4085
- n_outliers
- 78
- outlier_rate
- 0.001787
- zero_rate
- 0
elon
numeric feature null_rateThis is almost certainly longitude (east coordinate), with values bounded between -163.53 and -64.7151 and a median of -92.47, consistent with points across North America. The distribution is moderately left-skewed (-0.60) with kurtosis 3.77 and a tight IQR of 11.26 around the median. Notably, 37.65% of rows are null, which will materially shrink any geo-based analysis. Treatment: Pair with the matching latitude column and filter out the ~38% null rows before spatial analysis.
- n
- 70,022
- nulls
- 26,363 (37.6%)
- unique
- 18,586
- min
- -163.5
- max
- -64.72
- mean
- -92.19
- median
- -92.47
- std
- 8.545
- q1
- -97.73
- q3
- -86.47
- iqr
- 11.26
- skew
- -0.5954
- kurtosis
- 3.766
- n_outliers
- 647
- outlier_rate
- 0.01482
- zero_rate
- 0
len
text feature one_word allcaps short_text duplicatesDespite being typed as text, `len` holds short numeric tokens (length 3-8, one word each) like '0.1', '0.5', '1.0' — almost certainly a length or size measurement stored as strings. Values are highly concentrated: '0.1' alone covers 15,456 of 70,022 rows and the duplicate rate is 94.8% across only 3,663 unique tokens. The allcaps and one_word alerts are artefacts of numeric strings rather than real text signal. Treatment: Cast to float and treat as a numeric feature rather than text.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 3,663
- len_min
- 3
- len_max
- 8
- len_mean
- 3.626
- len_median
- 3
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 66,359
- duplicate_rate
- 0.9477
- vocab_size
- 2,204
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
wid
categorical featurewid is a categorical column with 419 distinct values, all numeric-looking strings like '10', '50', '100', '30' — almost certainly a width or weight parameter stored as text rather than a true category. The distribution is heavily concentrated on round numbers: '10' alone covers 20.6% of 70022 rows and the top ten values are all multiples of 5 or 25, giving an entropy ratio of 0.51. No nulls, but the string encoding of clearly numeric quanta is the surprise. Treatment: cast to numeric and consider binning or log-transform given the round-number concentration.
- n
- 70,022
- nulls
- 0 (0.0%)
- unique
- 419
- top_value
- 10
- top_rate
- 0.2059
- cardinality
- 419
- entropy
- 4.463
- entropy_ratio
- 0.5124