quirky tornadoes

source /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json 70,022 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a tornado event log with 70,022 rows and 13 columns covering dates, times, start/end coordinates, magnitudes, widths, fatalities, injuries, and U.S. state. Geographically it is a U.S.-centered dataset: starting longitudes average around -92.7 and latitudes around 37.1, with Texas (13.3% of records), Kansas, and Oklahoma leading the state counts. The severity fields are highly imbalanced — fatalities are 0 in 97.7% of events and injuries are 0 in 88.8% — so any analysis of harm should focus on the rare non-zero tail. Magnitude (mag) is a more usable categorical signal with 7 levels, dominated by 0 (46%) and 1 (34%). Note that the end-coordinate columns (elat, elon) are null in ~37.7% of rows, which matters if you plan to draw tornado tracks rather than just start points.

citing: row_count · column_count · columns.state.top_values · columns.state.top_rate · columns.fatalities.top_rate · columns.injuries.top_rate · columns.mag.top_values · columns.mag.top_rate · columns.elat.null_rate · columns.elon.null_rate · columns.slat.stats · columns.slon.stats

Charts the summary said to look at first

state · Top tornado states — Texas, Kansas, and Oklahoma dominate the count.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
TX	9345	13.3%
KS	4474	6.4%
OK	4221	6.0%
FL	3620	5.2%
NE	3056	4.4%
IA	2887	4.1%
IL	2835	4.0%
MS	2657	3.8%
AL	2529	3.6%
MO	2462	3.5%
CO	2425	3.5%
LA	2305	3.3%
MN	2118	3.0%
AR	1981	2.8%
SD	1917	2.7%
GA	1898	2.7%
ND	1640	2.3%
IN	1610	2.3%
WI	1515	2.2%
NC	1472	2.1%

mag · Magnitude distribution skews heavily toward weak (0–1) tornadoes.

Show data table

Top values for mag (7 unique shown, of 7 total).
value	count	share
0	32218	46.0%
1	23782	34.0%
2	9767	13.9%
3	2585	3.7%
-9	1024	1.5%
4	587	0.8%
5	59	0.1%

fatalities · Fatalities are 0 in roughly 98% of events; the deadly tail is tiny but important.

Show data table

Top values for fatalities (20 unique shown, of 50 total).
value	count	share
0	68423	97.7%
1	830	1.2%
2	277	0.4%
3	134	0.2%
4	77	0.1%
5	46	0.1%
6	45	0.1%
7	32	0.0%
9	15	0.0%
10	15	0.0%
11	14	0.0%
8	13	0.0%
16	12	0.0%
13	8	0.0%
17	7	0.0%
18	6	0.0%
21	6	0.0%
12	6	0.0%
22	5	0.0%
25	4	0.0%

slat · Starting latitudes cluster around 37°N, consistent with the U.S. tornado belt.

Show data table

Histogram bins for slat (median: 37.02505).
bin	count
17.72 – 18.8	33
18.8 – 19.89	6
19.89 – 20.97	8
20.97 – 22.05	26
22.05 – 23.13	1
23.13 – 24.22	0
24.22 – 25.3	76
25.3 – 26.38	487
26.38 – 27.46	748
27.46 – 28.55	1225
28.55 – 29.63	1468
29.63 – 30.71	3607
30.71 – 31.79	3441
31.79 – 32.88	5003
32.88 – 33.96	4775
33.96 – 35.04	5392
35.04 – 36.12	5164
36.12 – 37.21	4268
37.21 – 38.29	4028
38.29 – 39.37	4807
39.37 – 40.45	5498
40.45 – 41.54	5313
41.54 – 42.62	3995
42.62 – 43.7	3396
43.7 – 44.78	2407
44.78 – 45.87	1720
45.87 – 46.95	1257
46.95 – 48.03	1114
48.03 – 49.11	754
49.11 – 50.2	1
50.2 – 51.28	0
51.28 – 52.36	0
52.36 – 53.44	0
53.44 – 54.53	0
54.53 – 55.61	1
55.61 – 56.69	0
56.69 – 57.77	0
57.77 – 58.86	0
58.86 – 59.94	1
59.94 – 61.02	2

slon · Starting longitudes concentrate around -93°, reinforcing the central-U.S. footprint.

Show data table

Histogram bins for slon (median: -93.5).
bin	count
-163.5 – -161.1	2
-161.1 – -158.6	9
-158.6 – -156.1	25
-156.1 – -153.6	8
-153.6 – -151.2	0
-151.2 – -148.7	0
-148.7 – -146.2	0
-146.2 – -143.8	1
-143.8 – -141.3	0
-141.3 – -138.8	0
-138.8 – -136.4	0
-136.4 – -133.9	0
-133.9 – -131.4	0
-131.4 – -128.9	0
-128.9 – -126.5	0
-126.5 – -124	15
-124 – -121.5	231
-121.5 – -119.1	241
-119.1 – -116.6	282
-116.6 – -114.1	174
-114.1 – -111.7	399
-111.7 – -109.2	323
-109.2 – -106.7	325
-106.7 – -104.2	1782
-104.2 – -101.8	4689
-101.8 – -99.3	6094
-99.3 – -96.83	9665
-96.83 – -94.36	8344
-94.36 – -91.89	6626
-91.89 – -89.42	6618
-89.42 – -86.95	6086
-86.95 – -84.48	5361
-84.48 – -82.01	4363
-82.01 – -79.54	3931
-79.54 – -77.07	1885
-77.07 – -74.6	1618
-74.6 – -72.13	541
-72.13 – -69.66	285
-69.66 – -67.19	71
-67.19 – -64.72	28

Schema

13 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
date	text	0.0%	12,639	one_word allcaps short_text duplicates
time	text	0.0%	1,438	one_word allcaps short_text duplicates
state	categorical	0.0%	53
mag	categorical	0.0%	7
injuries	categorical	0.0%	209
fatalities	categorical	0.0%	50	imbalance
loss	text	0.0%	1,019	one_word allcaps short_text duplicates
slat	numeric	0.0%	16,016
slon	numeric	0.0%	17,912
elat	numeric	37.6%	16,965	null_rate
elon	numeric	37.6%	18,586	null_rate
len	text	0.0%	3,663	one_word allcaps short_text duplicates
wid	categorical	0.0%	419

date

text timestamp one_word allcaps short_text duplicates

This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every value exactly 10 characters long and one token. Across 70,022 rows there are only 12,639 unique dates and an 81.9% duplicate rate, so many records share dates — top value 2011-04-27 appears 207 times. The range spans at least 1974-04-03 to 2023-03-31, suggesting it was misclassified as text rather than a date type. Treatment: Cast to a proper date type and use for temporal joins or time-based features. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 12,639
len_min: 10
len_max: 10
len_mean: 10
len_median: 10
len_p95: 10
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 57,383
duplicate_rate: 0.8195
vocab_size: 7,831
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

time

text timestamp one_word allcaps short_text duplicates

This column holds clock times stored as 8-character HH:MM:SS strings, with all 70,022 rows non-null and uniform length (len_min=len_max=8). Only 1,438 distinct values appear and 97.95% are duplicates, with afternoon/evening slots like 16:00:00 (978), 17:00:00 (971) and 15:00:00 (959) dominating — consistent with event start times rather than free text. It's mistyped as text: parse to a proper time type before use. Treatment: Cast to a time type and bucket by hour for modelling. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 1,438
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 68,584
duplicate_rate: 0.9795
vocab_size: 1,352
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

state

categorical feature

This is a US state code column with 53 distinct values, slightly more than the 50 states (likely including DC and territories like PR). TX dominates at 13.3% of 70,022 rows, followed by KS (4,474) and OK (4,221), giving a clear southern/plains skew rather than a uniform national distribution. Entropy ratio of 0.847 confirms the distribution is fairly spread but not flat, and there are no nulls. Treatment: One-hot or target-encode; consider grouping low-frequency states into an 'Other' bucket. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 53
top_value: TX
top_rate: 0.1335
cardinality: 53
entropy: 4.851
entropy_ratio: 0.8468

mag

categorical feature

`mag` is a low-cardinality categorical with 7 distinct values dominated by '0' (46% of 70022 rows) and decreasing counts through '1','2','3','4','5'. The ordered integer levels suggest a magnitude or severity code rather than a free category, and the presence of '-9' (1024 rows) is the standout signal — almost certainly a sentinel for missing/unknown that is not being counted as null. Treatment: Recode '-9' as missing, then treat as an ordinal feature. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 7
top_value: 0
top_rate: 0.4601
cardinality: 7
entropy: 1.772
entropy_ratio: 0.6312

injuries

categorical numeric_target

Counts of injuries stored as strings, with 209 distinct values across 70,022 rows and no nulls. The distribution is severely zero-inflated: '0' accounts for 88.8% of records, and entropy ratio is just 0.123. Tail values like '1' through '10' decay quickly but 209 unique tokens suggests very long tails or non-integer entries worth inspecting. Treatment: Cast to integer and model as a zero-inflated count (e.g., hurdle or ZIP regression). high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 209
top_value: 0
top_rate: 0.888
cardinality: 209
entropy: 0.9454
entropy_ratio: 0.1227

fatalities

categorical numeric_target imbalance

This is a fatality count per event, stored as a categorical/string field with 50 distinct integer values and no nulls across 70,022 rows. The distribution is extremely imbalanced: '0' accounts for 97.72% of records, giving an entropy ratio of just 0.039, with '1' at 830 rows and a long thin tail (2 → 277, 3 → 134, down to single-digit counts at higher values). Despite being typed as categorical, the values are numeric and ordered, so the current encoding is likely a load artifact. Treatment: Cast to integer and model as a zero-inflated count rather than a categorical. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 50
top_value: 0
top_rate: 0.9772
cardinality: 50
entropy: 0.2174
entropy_ratio: 0.03852

loss

text feature one_word allcaps short_text duplicates

Numeric loss values stored as text strings, with all 70,022 entries being single tokens (one_word_rate 1.0) and lengths of 1-10 characters. The distribution is heavily concentrated: '0.0' alone accounts for 22,764 rows and the duplicate_rate is 0.985 across only 1,019 unique values. Mixed formatting is a hazard — '0' and '0.0' appear as separate tokens, so a naive cast will collapse them but string-based grouping won't. Treatment: Cast to float (normalising '0' vs '0.0') before any modelling or aggregation. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 1,019
len_min: 1
len_max: 10
len_mean: 3.181
len_median: 3
len_p95: 5
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 69,003
duplicate_rate: 0.9854
vocab_size: 503
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.9251
boilerplate_rate: 0

slat

numeric feature

Values range from 17.72 to 61.02 with a mean of 37.14 and median of 37.03, consistent with a starting latitude (slat) field in decimal degrees, likely covering the contiguous US. The distribution is essentially symmetric (skew 0.04, kurtosis -0.58) with a tight IQR of 7.74 and only 70 outliers (0.10%). No nulls and no zeros, and 16,016 unique values across 70,022 rows suggest repeated coordinates rather than free-form noise. Treatment: Use as-is for geospatial features; optionally pair with a longitude column for distance or region encoding. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 16,016
min: 17.72
max: 61.02
mean: 37.14
median: 37.03
std: 5.09
q1: 33.19
q3: 40.93
iqr: 7.74
skew: 0.03792
kurtosis: -0.5825
n_outliers: 70
outlier_rate: 0.0009997
zero_rate: 0

slon

numeric feature

Values are negative decimal degrees ranging from -163.53 to -64.7151 with a median of -93.5, consistent with western-hemisphere longitudes (the 'slon' name suggests starting longitude). The distribution is mildly left-skewed (-0.32) and concentrated within an IQR of ~11.7 degrees around the central US longitude band. Only 951 outliers (1.36%) fall outside that range, and there are no nulls or zeros. Treatment: Use as a geographic coordinate; pair with the matching latitude column for spatial features rather than treating as a standalone scalar. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 17,912
min: -163.5
max: -64.72
mean: -92.74
median: -93.5
std: 8.677
q1: -98.4
q3: -86.69
iqr: 11.71
skew: -0.3229
kurtosis: 2.156
n_outliers: 951
outlier_rate: 0.01358
zero_rate: 0

elat

numeric feature null_rate

Almost certainly an event latitude in decimal degrees, with values spanning 17.72 to 61.02 — consistent with locations across North America. The distribution is roughly symmetric (skew 0.03, kurtosis -0.41) and centered near 37.26, with only 78 mild outliers. The standout concern is that 37.65% of rows are null, so coverage is partial. Treatment: Pair with the matching longitude and impute or filter the 37.65% nulls before any spatial modelling. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 26,363 (37.6%)
unique: 16,965
min: 17.72
max: 61.02
mean: 37.26
median: 37.13
std: 4.942
q1: 33.49
q3: 40.91
iqr: 7.42
skew: 0.03404
kurtosis: -0.4085
n_outliers: 78
outlier_rate: 0.001787
zero_rate: 0

elon

numeric feature null_rate

This is almost certainly longitude (east coordinate), with values bounded between -163.53 and -64.7151 and a median of -92.47, consistent with points across North America. The distribution is moderately left-skewed (-0.60) with kurtosis 3.77 and a tight IQR of 11.26 around the median. Notably, 37.65% of rows are null, which will materially shrink any geo-based analysis. Treatment: Pair with the matching latitude column and filter out the ~38% null rows before spatial analysis. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 26,363 (37.6%)
unique: 18,586
min: -163.5
max: -64.72
mean: -92.19
median: -92.47
std: 8.545
q1: -97.73
q3: -86.47
iqr: 11.26
skew: -0.5954
kurtosis: 3.766
n_outliers: 647
outlier_rate: 0.01482
zero_rate: 0

len

text feature one_word allcaps short_text duplicates

Despite being typed as text, `len` holds short numeric tokens (length 3-8, one word each) like '0.1', '0.5', '1.0' — almost certainly a length or size measurement stored as strings. Values are highly concentrated: '0.1' alone covers 15,456 of 70,022 rows and the duplicate rate is 94.8% across only 3,663 unique tokens. The allcaps and one_word alerts are artefacts of numeric strings rather than real text signal. Treatment: Cast to float and treat as a numeric feature rather than text. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 3,663
len_min: 3
len_max: 8
len_mean: 3.626
len_median: 3
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 66,359
duplicate_rate: 0.9477
vocab_size: 2,204
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

wid

categorical feature

wid is a categorical column with 419 distinct values, all numeric-looking strings like '10', '50', '100', '30' — almost certainly a width or weight parameter stored as text rather than a true category. The distribution is heavily concentrated on round numbers: '10' alone covers 20.6% of 70022 rows and the top ten values are all multiples of 5 or 25, giving an entropy ratio of 0.51. No nulls, but the string encoding of clearly numeric quanta is the surprise. Treatment: cast to numeric and consider binning or log-transform given the round-number concentration. high · anthropic:claude-opus-4-7

n: 70,022
nulls: 0 (0.0%)
unique: 419
top_value: 10
top_rate: 0.2059
cardinality: 419
entropy: 4.463
entropy_ratio: 0.5124