data trove tornadoes noaa spc

source /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json 70,022 rows 13 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset contains 70,022 tornado records across the United States, with attributes covering location, timing, magnitude, path dimensions, and human impact. Texas dominates with 9,345 events, and the classic 'Tornado Alley' states (TX, KS, OK, NE, IA) together account for a large share of all records. Magnitude is worth close inspection: nearly half of all tornadoes are rated 0 (the weakest EF/F scale), and only 59 reach magnitude 5, suggesting a steep severity distribution. Human cost is highly skewed — 97.7% of events report zero fatalities, but the long tail of deadly events (including multi-fatality outbreaks) and the April 27, 2011 date appearing most frequently (207 records) point to a handful of catastrophic outbreak days that deserve focused analysis.

citing: row_count · column_count · state.top_values · mag.top_values · fatalities.top_rate · fatalities.top_value · date.top_values · injuries.top_values · loss.top_values · len.top_values

Charts the summary said to look at first

state · Look for the dominance of Tornado Alley states — TX, KS, OK — and how sharply activity drops beyond the top 10.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
TX	9345	13.3%
KS	4474	6.4%
OK	4221	6.0%
FL	3620	5.2%
NE	3056	4.4%
IA	2887	4.1%
IL	2835	4.0%
MS	2657	3.8%
AL	2529	3.6%
MO	2462	3.5%
CO	2425	3.5%
LA	2305	3.3%
MN	2118	3.0%
AR	1981	2.8%
SD	1917	2.7%
GA	1898	2.7%
ND	1640	2.3%
IN	1610	2.3%
WI	1515	2.2%
NC	1472	2.1%

mag · Notice the steep drop-off from magnitude 0 to 5, confirming that the vast majority of tornadoes are weak, with violent twisters being rare.

Show data table

Top values for mag (7 unique shown, of 7 total).
value	count	share
0	32218	46.0%
1	23782	34.0%
2	9767	13.9%
3	2585	3.7%
-9	1024	1.5%
4	587	0.8%
5	59	0.1%

fatalities · The near-total dominance of zero-fatality events (97.7%) contrasts sharply with the small but deadly tail of high-casualty tornadoes.

Show data table

Top values for fatalities (20 unique shown, of 50 total).
value	count	share
0	68423	97.7%
1	830	1.2%
2	277	0.4%
3	134	0.2%
4	77	0.1%
5	46	0.1%
6	45	0.1%
7	32	0.0%
9	15	0.0%
10	15	0.0%
11	14	0.0%
8	13	0.0%
16	12	0.0%
13	8	0.0%
17	7	0.0%
18	6	0.0%
21	6	0.0%
12	6	0.0%
22	5	0.0%
25	4	0.0%

len · Path length is heavily concentrated at 0.1 miles, but watch for a long right tail of tornadoes that traveled many miles.

Show data table

Character-length distribution for len (mean: 3.6255462568906913).
chars	count
3 – 3	47630
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	11687
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	795
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	9118
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	789
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	3

injuries · Like fatalities, injuries are overwhelmingly zero, but the distribution of non-zero counts reveals which events caused mass casualties.

Show data table

Top values for injuries (20 unique shown, of 209 total).
value	count	share
0	62177	88.8%
1	2480	3.5%
2	1388	2.0%
3	770	1.1%
4	484	0.7%
5	385	0.5%
6	300	0.4%
7	194	0.3%
8	171	0.2%
10	141	0.2%
12	120	0.2%
9	117	0.2%
11	80	0.1%
20	71	0.1%
15	69	0.1%
13	66	0.1%
14	55	0.1%
30	46	0.1%
25	44	0.1%
16	43	0.1%

Schema

13 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
date	text	0.0%	12,639	one_word allcaps short_text duplicates
time	text	0.0%	1,438	one_word allcaps short_text duplicates
state	categorical	0.0%	53
mag	categorical	0.0%	7
injuries	categorical	0.0%	209
fatalities	categorical	0.0%	50	imbalance
loss	text	0.0%	1,019	one_word allcaps short_text duplicates
slat	numeric	0.0%	16,016
slon	numeric	0.0%	17,912
elat	numeric	37.6%	16,965	null_rate
elon	numeric	37.6%	18,586	null_rate
len	text	0.0%	3,663	one_word allcaps short_text duplicates
wid	categorical	0.0%	419

date

text timestamp one_word allcaps short_text duplicates

This column contains ISO-8601 calendar dates stored as text strings, all exactly 10 characters long (YYYY-MM-DD format) with zero nulls across 70,022 rows. The 'allcaps' alert is a quirk of the profiler treating hyphenated tokens as uppercase-only, not a real data issue. What is notable is the high duplicate rate of 81.9% (57,383 duplicates) across only 12,639 unique dates, meaning many records share the same date — the top value '2011-04-27' appears 207 times — suggesting this is an event or transaction date that clusters around specific calendar days rather than a unique record timestamp. Treatment: Parse to datetime dtype, then use as a temporal feature (day-of-week, month, year, cyclical encoding) or grouping key for aggregations. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 12,639
len_min: 10
len_max: 10
len_mean: 10
len_median: 10
len_p95: 10
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 57,383
duplicate_rate: 0.8195
vocab_size: 7,831
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

time

text timestamp one_word allcaps short_text duplicates

This column contains wall-clock time strings in HH:MM:SS format (all values are exactly 8 characters), representing the time-of-day component of some event or record. With only 1,438 unique values across 70,022 rows, the duplicate rate is extremely high at 97.9%, indicating times are heavily reused — the top values cluster tightly around afternoon/evening hours (14:00–19:00), suggesting a business or scheduling context with strong temporal patterns. The column is stored as text despite being a structured time value, so it should be parsed to a proper time type for any downstream use. Treatment: Parse to time/datetime type and use as a cyclical feature (e.g., sine/cosine encoding of hour) or join key for time-based aggregations. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 1,438
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 68,584
duplicate_rate: 0.9795
vocab_size: 1,352
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

state

categorical feature

This column contains US state abbreviations, with 53 distinct values (50 states plus likely DC, Puerto Rico, and one other territory) and zero nulls across 70,022 rows. Texas dominates at 13.3% (9,345 rows), and the top 10 are heavily skewed toward Great Plains and Southern states (KS, OK, NE, IA, MS, AL), which is surprising for a national dataset and may indicate agricultural or livestock-sector data. The entropy ratio of 0.847 indicates reasonably broad coverage, but the concentration in TX/KS/OK/NE/IA suggests a non-representative geographic distribution. Treatment: One-hot encode or target-encode depending on model type; consider grouping low-frequency states if using one-hot. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 53
top_value: TX
top_rate: 0.1335
cardinality: 53
entropy: 4.851
entropy_ratio: 0.8468

mag

categorical feature

This column represents a magnitude or severity level encoded as a small integer, with 7 distinct values spanning -9 to 5. The dominant value is '0' (46% of rows, 32,218 records), followed by '1' and '2', giving a right-skewed ordinal distribution. The value '-9' appearing 1,024 times is a sentinel or sentinel-coded missing/unknown value rather than a true negative magnitude, which would surprise an analyst expecting a clean ordinal scale. The column is stored as categorical despite being numerically interpretable. Treatment: Recode '-9' as missing, then treat as ordinal integer feature or one-hot encode depending on model assumptions. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 7
top_value: 0
top_rate: 0.4601
cardinality: 7
entropy: 1.772
entropy_ratio: 0.6312

injuries

categorical feature

This column represents a count of injuries per record, stored as a categorical type despite being fundamentally numeric. The distribution is extremely right-skewed: 88.8% of the 70,022 records have zero injuries, and counts drop off sharply thereafter, yet the column exhibits 209 unique values suggesting some very high injury counts exist in the tail. The low entropy ratio (0.123) confirms the near-degenerate concentration on '0', and the presence of non-contiguous values (e.g., '10' appearing before lower counts drop out of the top 10) hints at a long, sparse tail. Treatment: Cast to integer, then consider zero-inflated modelling or a binary 'any_injury' flag plus a separate log-transformed count for the non-zero subset. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 209
top_value: 0
top_rate: 0.888
cardinality: 209
entropy: 0.9454
entropy_ratio: 0.1227

fatalities

categorical feature imbalance

This column represents a count of fatalities per incident, stored as a categorical type despite being numeric in nature — it should be treated as an ordinal or integer feature. The dominant value is '0' (68,423 out of 70,022 rows, or 97.7%), making this severely right-skewed and triggering an imbalance alert. Cardinality reaches 50 distinct values with extremely low entropy (0.217, entropy ratio 0.039), confirming that the non-zero tail is sparse and long. Analysts modelling rare fatal events should be aware that the positive class represents fewer than 2.3% of records. Treatment: Cast to integer, apply zero-inflated or rare-event modelling strategy, or binarise into fatal/non-fatal indicator before classification. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 50
top_value: 0
top_rate: 0.9772
cardinality: 50
entropy: 0.2174
entropy_ratio: 0.03852

loss

text numeric_target one_word allcaps short_text duplicates

This column contains numeric loss values stored as text strings — all single-token entries (one_word_rate: 1.0) with a mean length of ~3.18 characters, dominated by small non-negative numbers like '0.0', '4.0', '5.0'. The 92.5% allcaps rate is a misleading artefact of how saturn classifies short numeric strings. Notably, '0.0' and '0' appear as separate tokens (22764 and 5248 occurrences respectively), indicating inconsistent serialization of the same underlying zero value — an analyst should consolidate these before use. The duplicate rate is 98.5%, reflecting a low-cardinality numeric range across 70022 rows with only 1019 unique string representations. Treatment: Cast to float (unifying '0' and '0.0'), then use as a regression target or loss metric; check whether the bimodal spike at 0 represents true zero-loss or missing/default values. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 1,019
len_min: 1
len_max: 10
len_mean: 3.181
len_median: 3
len_p95: 5
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 69,003
duplicate_rate: 0.9854
vocab_size: 503
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.9251
boilerplate_rate: 0

slat

numeric feature

This column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the Caribbean/Mexico up through Canada, covering the contiguous United States and beyond. The distribution is remarkably symmetric (skew 0.038, kurtosis -0.582) and tightly clustered around a mean of 37.14° with an IQR of 7.74°, suggesting a dataset dominated by mid-latitude U.S. locations. Only 70 outliers (0.1%) exist, likely extreme northern or southern observations, and there are no nulls. Treatment: Use directly as a geospatial feature; consider pairing with longitude and engineering distance or region-based features rather than treating as a raw numeric. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 16,016
min: 17.72
max: 61.02
mean: 37.14
median: 37.03
std: 5.09
q1: 33.19
q3: 40.93
iqr: 7.74
skew: 0.03792
kurtosis: -0.5825
n_outliers: 70
outlier_rate: 0.0009997
zero_rate: 0

slon

numeric feature

This column contains geographic longitude values, almost certainly representing the longitude of seismic event epicenters (suggested by the 'slon' name, likely 'station longitude' or 'source longitude'). All values are negative, ranging from -163.53 to -64.72, which places observations within the Western Hemisphere — consistent with the Americas or Pacific region. The mean of -92.74 and median of -93.50 suggest a concentration around the Gulf of Mexico / Central America corridor. With 17,912 unique values across 70,022 rows and zero nulls, this is a continuous geographic coordinate with mild repetition (e.g., fixed station locations), and 951 outliers (~1.36%) may represent distant events or data entry anomalies worth inspecting. Treatment: Use as-is for spatial modeling or map directly to geographic coordinates; inspect the 951 outliers for plausibility against known geographic bounds. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 17,912
min: -163.5
max: -64.72
mean: -92.74
median: -93.5
std: 8.677
q1: -98.4
q3: -86.69
iqr: 11.71
skew: -0.3229
kurtosis: 2.156
n_outliers: 951
outlier_rate: 0.01358
zero_rate: 0

elat

numeric feature null_rate

This column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the southern US/Mexico border region up through Canada or northern Europe. The distribution is strikingly symmetric (skew 0.034, kurtosis -0.41) and tightly clustered around a mean of 37.26° with an IQR of 7.42°, suggesting a geographically focused dataset. The most notable concern is a 37.65% null rate, flagged as an alert, meaning over a third of records lack coordinate data. Only 78 outliers (0.18%) exist at the extremes of the range. Treatment: Investigate source of 37.65% nulls before use; pair with longitude for spatial features or geohash encoding; impute or filter nulls depending on missingness mechanism. high · anthropic:default

n: 70,022
nulls: 26,363 (37.6%)
unique: 16,965
min: 17.72
max: 61.02
mean: 37.26
median: 37.13
std: 4.942
q1: 33.49
q3: 40.91
iqr: 7.42
skew: 0.03404
kurtosis: -0.4085
n_outliers: 78
outlier_rate: 0.001787
zero_rate: 0

elon

numeric feature null_rate

This column almost certainly represents **longitude** (Eastern longitude or a signed longitude coordinate), given the name 'elon' and values ranging from -163.53 to -64.72 — a range consistent with the Western Hemisphere (roughly spanning the Americas). The mean of -92.19 and median of -92.47 suggest a central tendency near the US Gulf Coast/Central America region. Two analyst-worthy surprises: the null rate is high at 37.65%, triggering an alert, and the max value of -64.72 is the least-negative (easternmost) point while min of -163.53 is near Alaska/Pacific Islands — indicating wide geographic spread. Treatment: Investigate and impute or exclude the 37.65% nulls before use; pair with a latitude column for geospatial modelling or clustering. high · anthropic:default

n: 70,022
nulls: 26,363 (37.6%)
unique: 18,586
min: -163.5
max: -64.72
mean: -92.19
median: -92.47
std: 8.545
q1: -97.73
q3: -86.47
iqr: 11.26
skew: -0.5954
kurtosis: 3.766
n_outliers: 647
outlier_rate: 0.01482
zero_rate: 0

len

text feature one_word allcaps short_text duplicates

This column named 'len' stores numeric measurements encoded as text strings — almost certainly a length or dosage/quantity field stored in the wrong dtype. All 70,022 values are single 'words' in all-caps classification, with values like '0.1', '0.5', '1.0', '2.0' dominating the top entries and a character length range of 3–8. The duplicate rate is extremely high at 94.8% (66,359 duplicates across only 3,663 unique values), which is expected for a bounded numeric measure but confirms this should be cast to float and treated as a continuous feature rather than a categorical label. Treatment: Cast to float64 and use as a numeric feature; check for unit consistency across the value range. high · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 3,663
len_min: 3
len_max: 8
len_mean: 3.626
len_median: 3
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 66,359
duplicate_rate: 0.9477
vocab_size: 2,204
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

wid

categorical feature

This column ('wid') appears to be a numeric width or weight identifier encoded as a categorical, with only 419 distinct values across 70,022 rows. All observed top values are round numbers (10, 50, 100, 30, 20, 200, 25, 150, 40, 75), strongly suggesting a discrete measurement dimension — likely a product width, bin size, or weight class. The distribution is notably skewed: value '10' alone accounts for 20.6% of all rows (14,417 occurrences), with a steep drop-off thereafter, indicating heavy concentration at the smallest value. Treatment: Cast to numeric integer and treat as an ordinal or continuous feature; consider log-transform if using in regression given the heavy skew toward small values. medium · anthropic:default

n: 70,022
nulls: 0 (0.0%)
unique: 419
top_value: 10
top_rate: 0.2059
cardinality: 419
entropy: 4.463
entropy_ratio: 0.5124