saturn·

quirky tornadoes

source /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json 70,022 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a tornado event log with 70,022 rows and 13 columns covering dates, times, start/end coordinates, magnitudes, widths, fatalities, injuries, and U.S. state. Geographically it is a U.S.-centered dataset: starting longitudes average around -92.7 and latitudes around 37.1, with Texas (13.3% of records), Kansas, and Oklahoma leading the state counts. The severity fields are highly imbalanced — fatalities are 0 in 97.7% of events and injuries are 0 in 88.8% — so any analysis of harm should focus on the rare non-zero tail. Magnitude (mag) is a more usable categorical signal with 7 levels, dominated by 0 (46%) and 1 (34%). Note that the end-coordinate columns (elat, elon) are null in ~37.7% of rows, which matters if you plan to draw tornado tracks rather than just start points.

citing: row_count · column_count · columns.state.top_values · columns.state.top_rate · columns.fatalities.top_rate · columns.injuries.top_rate · columns.mag.top_values · columns.mag.top_rate · columns.elat.null_rate · columns.elon.null_rate · columns.slat.stats · columns.slon.stats

Schema

13 columns
Per-column summary. Click column name to jump to its detail.
Alerts
date text 0.0% 12,639
one_word allcaps short_text duplicates
time text 0.0% 1,438
one_word allcaps short_text duplicates
state categorical 0.0% 53
mag categorical 0.0% 7
injuries categorical 0.0% 209
fatalities categorical 0.0% 50
imbalance
loss text 0.0% 1,019
one_word allcaps short_text duplicates
slat numeric 0.0% 16,016
slon numeric 0.0% 17,912
elat numeric 37.6% 16,965
null_rate
elon numeric 37.6% 18,586
null_rate
len text 0.0% 3,663
one_word allcaps short_text duplicates
wid categorical 0.0% 419

date

text timestamp one_word allcaps short_text duplicates
This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every value exactly 10 characters long and one token. Across 70,022 rows there are only 12,639 unique dates and an 81.9% duplicate rate, so many records share dates — top value 2011-04-27 appears 207 times. The range spans at least 1974-04-03 to 2023-03-31, suggesting it was misclassified as text rather than a date type. Treatment: Cast to a proper date type and use for temporal joins or time-based features. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
12,639
len_min
10
len_max
10
len_mean
10
len_median
10
len_p95
10
word_mean
1
word_median
1
n_empty
0
n_duplicates
57,383
duplicate_rate
0.8195
vocab_size
7,831
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

time

text timestamp one_word allcaps short_text duplicates
This column holds clock times stored as 8-character HH:MM:SS strings, with all 70,022 rows non-null and uniform length (len_min=len_max=8). Only 1,438 distinct values appear and 97.95% are duplicates, with afternoon/evening slots like 16:00:00 (978), 17:00:00 (971) and 15:00:00 (959) dominating — consistent with event start times rather than free text. It's mistyped as text: parse to a proper time type before use. Treatment: Cast to a time type and bucket by hour for modelling. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
1,438
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
68,584
duplicate_rate
0.9795
vocab_size
1,352
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

state

categorical feature
This is a US state code column with 53 distinct values, slightly more than the 50 states (likely including DC and territories like PR). TX dominates at 13.3% of 70,022 rows, followed by KS (4,474) and OK (4,221), giving a clear southern/plains skew rather than a uniform national distribution. Entropy ratio of 0.847 confirms the distribution is fairly spread but not flat, and there are no nulls. Treatment: One-hot or target-encode; consider grouping low-frequency states into an 'Other' bucket. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
53
top_value
TX
top_rate
0.1335
cardinality
53
entropy
4.851
entropy_ratio
0.8468

mag

categorical feature
`mag` is a low-cardinality categorical with 7 distinct values dominated by '0' (46% of 70022 rows) and decreasing counts through '1','2','3','4','5'. The ordered integer levels suggest a magnitude or severity code rather than a free category, and the presence of '-9' (1024 rows) is the standout signal — almost certainly a sentinel for missing/unknown that is not being counted as null. Treatment: Recode '-9' as missing, then treat as an ordinal feature. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
7
top_value
0
top_rate
0.4601
cardinality
7
entropy
1.772
entropy_ratio
0.6312

injuries

categorical numeric_target
Counts of injuries stored as strings, with 209 distinct values across 70,022 rows and no nulls. The distribution is severely zero-inflated: '0' accounts for 88.8% of records, and entropy ratio is just 0.123. Tail values like '1' through '10' decay quickly but 209 unique tokens suggests very long tails or non-integer entries worth inspecting. Treatment: Cast to integer and model as a zero-inflated count (e.g., hurdle or ZIP regression). high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
209
top_value
0
top_rate
0.888
cardinality
209
entropy
0.9454
entropy_ratio
0.1227

fatalities

categorical numeric_target imbalance
This is a fatality count per event, stored as a categorical/string field with 50 distinct integer values and no nulls across 70,022 rows. The distribution is extremely imbalanced: '0' accounts for 97.72% of records, giving an entropy ratio of just 0.039, with '1' at 830 rows and a long thin tail (2 → 277, 3 → 134, down to single-digit counts at higher values). Despite being typed as categorical, the values are numeric and ordered, so the current encoding is likely a load artifact. Treatment: Cast to integer and model as a zero-inflated count rather than a categorical. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
50
top_value
0
top_rate
0.9772
cardinality
50
entropy
0.2174
entropy_ratio
0.03852

loss

text feature one_word allcaps short_text duplicates
Numeric loss values stored as text strings, with all 70,022 entries being single tokens (one_word_rate 1.0) and lengths of 1-10 characters. The distribution is heavily concentrated: '0.0' alone accounts for 22,764 rows and the duplicate_rate is 0.985 across only 1,019 unique values. Mixed formatting is a hazard — '0' and '0.0' appear as separate tokens, so a naive cast will collapse them but string-based grouping won't. Treatment: Cast to float (normalising '0' vs '0.0') before any modelling or aggregation. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
1,019
len_min
1
len_max
10
len_mean
3.181
len_median
3
len_p95
5
word_mean
1
word_median
1
n_empty
0
n_duplicates
69,003
duplicate_rate
0.9854
vocab_size
503
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.9251
boilerplate_rate
0

slat

numeric feature
Values range from 17.72 to 61.02 with a mean of 37.14 and median of 37.03, consistent with a starting latitude (slat) field in decimal degrees, likely covering the contiguous US. The distribution is essentially symmetric (skew 0.04, kurtosis -0.58) with a tight IQR of 7.74 and only 70 outliers (0.10%). No nulls and no zeros, and 16,016 unique values across 70,022 rows suggest repeated coordinates rather than free-form noise. Treatment: Use as-is for geospatial features; optionally pair with a longitude column for distance or region encoding. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
16,016
min
17.72
max
61.02
mean
37.14
median
37.03
std
5.09
q1
33.19
q3
40.93
iqr
7.74
skew
0.03792
kurtosis
-0.5825
n_outliers
70
outlier_rate
0.0009997
zero_rate
0

slon

numeric feature
Values are negative decimal degrees ranging from -163.53 to -64.7151 with a median of -93.5, consistent with western-hemisphere longitudes (the 'slon' name suggests starting longitude). The distribution is mildly left-skewed (-0.32) and concentrated within an IQR of ~11.7 degrees around the central US longitude band. Only 951 outliers (1.36%) fall outside that range, and there are no nulls or zeros. Treatment: Use as a geographic coordinate; pair with the matching latitude column for spatial features rather than treating as a standalone scalar. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
17,912
min
-163.5
max
-64.72
mean
-92.74
median
-93.5
std
8.677
q1
-98.4
q3
-86.69
iqr
11.71
skew
-0.3229
kurtosis
2.156
n_outliers
951
outlier_rate
0.01358
zero_rate
0

elat

numeric feature null_rate
Almost certainly an event latitude in decimal degrees, with values spanning 17.72 to 61.02 — consistent with locations across North America. The distribution is roughly symmetric (skew 0.03, kurtosis -0.41) and centered near 37.26, with only 78 mild outliers. The standout concern is that 37.65% of rows are null, so coverage is partial. Treatment: Pair with the matching longitude and impute or filter the 37.65% nulls before any spatial modelling. high · anthropic:claude-opus-4-7
n
70,022
nulls
26,363 (37.6%)
unique
16,965
min
17.72
max
61.02
mean
37.26
median
37.13
std
4.942
q1
33.49
q3
40.91
iqr
7.42
skew
0.03404
kurtosis
-0.4085
n_outliers
78
outlier_rate
0.001787
zero_rate
0

elon

numeric feature null_rate
This is almost certainly longitude (east coordinate), with values bounded between -163.53 and -64.7151 and a median of -92.47, consistent with points across North America. The distribution is moderately left-skewed (-0.60) with kurtosis 3.77 and a tight IQR of 11.26 around the median. Notably, 37.65% of rows are null, which will materially shrink any geo-based analysis. Treatment: Pair with the matching latitude column and filter out the ~38% null rows before spatial analysis. high · anthropic:claude-opus-4-7
n
70,022
nulls
26,363 (37.6%)
unique
18,586
min
-163.5
max
-64.72
mean
-92.19
median
-92.47
std
8.545
q1
-97.73
q3
-86.47
iqr
11.26
skew
-0.5954
kurtosis
3.766
n_outliers
647
outlier_rate
0.01482
zero_rate
0

len

text feature one_word allcaps short_text duplicates
Despite being typed as text, `len` holds short numeric tokens (length 3-8, one word each) like '0.1', '0.5', '1.0' — almost certainly a length or size measurement stored as strings. Values are highly concentrated: '0.1' alone covers 15,456 of 70,022 rows and the duplicate rate is 94.8% across only 3,663 unique tokens. The allcaps and one_word alerts are artefacts of numeric strings rather than real text signal. Treatment: Cast to float and treat as a numeric feature rather than text. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
3,663
len_min
3
len_max
8
len_mean
3.626
len_median
3
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
66,359
duplicate_rate
0.9477
vocab_size
2,204
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

wid

categorical feature
wid is a categorical column with 419 distinct values, all numeric-looking strings like '10', '50', '100', '30' — almost certainly a width or weight parameter stored as text rather than a true category. The distribution is heavily concentrated on round numbers: '10' alone covers 20.6% of 70022 rows and the top ten values are all multiples of 5 or 25, giving an entropy ratio of 0.51. No nulls, but the string encoding of clearly numeric quanta is the surprise. Treatment: cast to numeric and consider binning or log-transform given the round-number concentration. high · anthropic:claude-opus-4-7
n
70,022
nulls
0 (0.0%)
unique
419
top_value
10
top_rate
0.2059
cardinality
419
entropy
4.463
entropy_ratio
0.5124