saturn·

data trove tornadoes noaa spc

source /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json 70,022 rows 13 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset contains 70,022 tornado records across the United States, with attributes covering location, timing, magnitude, path dimensions, and human impact. Texas dominates with 9,345 events, and the classic 'Tornado Alley' states (TX, KS, OK, NE, IA) together account for a large share of all records. Magnitude is worth close inspection: nearly half of all tornadoes are rated 0 (the weakest EF/F scale), and only 59 reach magnitude 5, suggesting a steep severity distribution. Human cost is highly skewed — 97.7% of events report zero fatalities, but the long tail of deadly events (including multi-fatality outbreaks) and the April 27, 2011 date appearing most frequently (207 records) point to a handful of catastrophic outbreak days that deserve focused analysis.

citing: row_count · column_count · state.top_values · mag.top_values · fatalities.top_rate · fatalities.top_value · date.top_values · injuries.top_values · loss.top_values · len.top_values

Schema

13 columns
Per-column summary. Click column name to jump to its detail.
Alerts
date text 0.0% 12,639
one_word allcaps short_text duplicates
time text 0.0% 1,438
one_word allcaps short_text duplicates
state categorical 0.0% 53
mag categorical 0.0% 7
injuries categorical 0.0% 209
fatalities categorical 0.0% 50
imbalance
loss text 0.0% 1,019
one_word allcaps short_text duplicates
slat numeric 0.0% 16,016
slon numeric 0.0% 17,912
elat numeric 37.6% 16,965
null_rate
elon numeric 37.6% 18,586
null_rate
len text 0.0% 3,663
one_word allcaps short_text duplicates
wid categorical 0.0% 419

date

text timestamp one_word allcaps short_text duplicates
This column contains ISO-8601 calendar dates stored as text strings, all exactly 10 characters long (YYYY-MM-DD format) with zero nulls across 70,022 rows. The 'allcaps' alert is a quirk of the profiler treating hyphenated tokens as uppercase-only, not a real data issue. What is notable is the high duplicate rate of 81.9% (57,383 duplicates) across only 12,639 unique dates, meaning many records share the same date — the top value '2011-04-27' appears 207 times — suggesting this is an event or transaction date that clusters around specific calendar days rather than a unique record timestamp. Treatment: Parse to datetime dtype, then use as a temporal feature (day-of-week, month, year, cyclical encoding) or grouping key for aggregations. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
12,639
len_min
10
len_max
10
len_mean
10
len_median
10
len_p95
10
word_mean
1
word_median
1
n_empty
0
n_duplicates
57,383
duplicate_rate
0.8195
vocab_size
7,831
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

time

text timestamp one_word allcaps short_text duplicates
This column contains wall-clock time strings in HH:MM:SS format (all values are exactly 8 characters), representing the time-of-day component of some event or record. With only 1,438 unique values across 70,022 rows, the duplicate rate is extremely high at 97.9%, indicating times are heavily reused — the top values cluster tightly around afternoon/evening hours (14:00–19:00), suggesting a business or scheduling context with strong temporal patterns. The column is stored as text despite being a structured time value, so it should be parsed to a proper time type for any downstream use. Treatment: Parse to time/datetime type and use as a cyclical feature (e.g., sine/cosine encoding of hour) or join key for time-based aggregations. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
1,438
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
68,584
duplicate_rate
0.9795
vocab_size
1,352
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

state

categorical feature
This column contains US state abbreviations, with 53 distinct values (50 states plus likely DC, Puerto Rico, and one other territory) and zero nulls across 70,022 rows. Texas dominates at 13.3% (9,345 rows), and the top 10 are heavily skewed toward Great Plains and Southern states (KS, OK, NE, IA, MS, AL), which is surprising for a national dataset and may indicate agricultural or livestock-sector data. The entropy ratio of 0.847 indicates reasonably broad coverage, but the concentration in TX/KS/OK/NE/IA suggests a non-representative geographic distribution. Treatment: One-hot encode or target-encode depending on model type; consider grouping low-frequency states if using one-hot. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
53
top_value
TX
top_rate
0.1335
cardinality
53
entropy
4.851
entropy_ratio
0.8468

mag

categorical feature
This column represents a magnitude or severity level encoded as a small integer, with 7 distinct values spanning -9 to 5. The dominant value is '0' (46% of rows, 32,218 records), followed by '1' and '2', giving a right-skewed ordinal distribution. The value '-9' appearing 1,024 times is a sentinel or sentinel-coded missing/unknown value rather than a true negative magnitude, which would surprise an analyst expecting a clean ordinal scale. The column is stored as categorical despite being numerically interpretable. Treatment: Recode '-9' as missing, then treat as ordinal integer feature or one-hot encode depending on model assumptions. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
7
top_value
0
top_rate
0.4601
cardinality
7
entropy
1.772
entropy_ratio
0.6312

injuries

categorical feature
This column represents a count of injuries per record, stored as a categorical type despite being fundamentally numeric. The distribution is extremely right-skewed: 88.8% of the 70,022 records have zero injuries, and counts drop off sharply thereafter, yet the column exhibits 209 unique values suggesting some very high injury counts exist in the tail. The low entropy ratio (0.123) confirms the near-degenerate concentration on '0', and the presence of non-contiguous values (e.g., '10' appearing before lower counts drop out of the top 10) hints at a long, sparse tail. Treatment: Cast to integer, then consider zero-inflated modelling or a binary 'any_injury' flag plus a separate log-transformed count for the non-zero subset. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
209
top_value
0
top_rate
0.888
cardinality
209
entropy
0.9454
entropy_ratio
0.1227

fatalities

categorical feature imbalance
This column represents a count of fatalities per incident, stored as a categorical type despite being numeric in nature — it should be treated as an ordinal or integer feature. The dominant value is '0' (68,423 out of 70,022 rows, or 97.7%), making this severely right-skewed and triggering an imbalance alert. Cardinality reaches 50 distinct values with extremely low entropy (0.217, entropy ratio 0.039), confirming that the non-zero tail is sparse and long. Analysts modelling rare fatal events should be aware that the positive class represents fewer than 2.3% of records. Treatment: Cast to integer, apply zero-inflated or rare-event modelling strategy, or binarise into fatal/non-fatal indicator before classification. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
50
top_value
0
top_rate
0.9772
cardinality
50
entropy
0.2174
entropy_ratio
0.03852

loss

text numeric_target one_word allcaps short_text duplicates
This column contains numeric loss values stored as text strings — all single-token entries (one_word_rate: 1.0) with a mean length of ~3.18 characters, dominated by small non-negative numbers like '0.0', '4.0', '5.0'. The 92.5% allcaps rate is a misleading artefact of how saturn classifies short numeric strings. Notably, '0.0' and '0' appear as separate tokens (22764 and 5248 occurrences respectively), indicating inconsistent serialization of the same underlying zero value — an analyst should consolidate these before use. The duplicate rate is 98.5%, reflecting a low-cardinality numeric range across 70022 rows with only 1019 unique string representations. Treatment: Cast to float (unifying '0' and '0.0'), then use as a regression target or loss metric; check whether the bimodal spike at 0 represents true zero-loss or missing/default values. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
1,019
len_min
1
len_max
10
len_mean
3.181
len_median
3
len_p95
5
word_mean
1
word_median
1
n_empty
0
n_duplicates
69,003
duplicate_rate
0.9854
vocab_size
503
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.9251
boilerplate_rate
0

slat

numeric feature
This column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the Caribbean/Mexico up through Canada, covering the contiguous United States and beyond. The distribution is remarkably symmetric (skew 0.038, kurtosis -0.582) and tightly clustered around a mean of 37.14° with an IQR of 7.74°, suggesting a dataset dominated by mid-latitude U.S. locations. Only 70 outliers (0.1%) exist, likely extreme northern or southern observations, and there are no nulls. Treatment: Use directly as a geospatial feature; consider pairing with longitude and engineering distance or region-based features rather than treating as a raw numeric. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
16,016
min
17.72
max
61.02
mean
37.14
median
37.03
std
5.09
q1
33.19
q3
40.93
iqr
7.74
skew
0.03792
kurtosis
-0.5825
n_outliers
70
outlier_rate
0.0009997
zero_rate
0

slon

numeric feature
This column contains geographic longitude values, almost certainly representing the longitude of seismic event epicenters (suggested by the 'slon' name, likely 'station longitude' or 'source longitude'). All values are negative, ranging from -163.53 to -64.72, which places observations within the Western Hemisphere — consistent with the Americas or Pacific region. The mean of -92.74 and median of -93.50 suggest a concentration around the Gulf of Mexico / Central America corridor. With 17,912 unique values across 70,022 rows and zero nulls, this is a continuous geographic coordinate with mild repetition (e.g., fixed station locations), and 951 outliers (~1.36%) may represent distant events or data entry anomalies worth inspecting. Treatment: Use as-is for spatial modeling or map directly to geographic coordinates; inspect the 951 outliers for plausibility against known geographic bounds. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
17,912
min
-163.5
max
-64.72
mean
-92.74
median
-93.5
std
8.677
q1
-98.4
q3
-86.69
iqr
11.71
skew
-0.3229
kurtosis
2.156
n_outliers
951
outlier_rate
0.01358
zero_rate
0

elat

numeric feature null_rate
This column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the southern US/Mexico border region up through Canada or northern Europe. The distribution is strikingly symmetric (skew 0.034, kurtosis -0.41) and tightly clustered around a mean of 37.26° with an IQR of 7.42°, suggesting a geographically focused dataset. The most notable concern is a 37.65% null rate, flagged as an alert, meaning over a third of records lack coordinate data. Only 78 outliers (0.18%) exist at the extremes of the range. Treatment: Investigate source of 37.65% nulls before use; pair with longitude for spatial features or geohash encoding; impute or filter nulls depending on missingness mechanism. high · anthropic:default
n
70,022
nulls
26,363 (37.6%)
unique
16,965
min
17.72
max
61.02
mean
37.26
median
37.13
std
4.942
q1
33.49
q3
40.91
iqr
7.42
skew
0.03404
kurtosis
-0.4085
n_outliers
78
outlier_rate
0.001787
zero_rate
0

elon

numeric feature null_rate
This column almost certainly represents **longitude** (Eastern longitude or a signed longitude coordinate), given the name 'elon' and values ranging from -163.53 to -64.72 — a range consistent with the Western Hemisphere (roughly spanning the Americas). The mean of -92.19 and median of -92.47 suggest a central tendency near the US Gulf Coast/Central America region. Two analyst-worthy surprises: the null rate is high at 37.65%, triggering an alert, and the max value of -64.72 is the least-negative (easternmost) point while min of -163.53 is near Alaska/Pacific Islands — indicating wide geographic spread. Treatment: Investigate and impute or exclude the 37.65% nulls before use; pair with a latitude column for geospatial modelling or clustering. high · anthropic:default
n
70,022
nulls
26,363 (37.6%)
unique
18,586
min
-163.5
max
-64.72
mean
-92.19
median
-92.47
std
8.545
q1
-97.73
q3
-86.47
iqr
11.26
skew
-0.5954
kurtosis
3.766
n_outliers
647
outlier_rate
0.01482
zero_rate
0

len

text feature one_word allcaps short_text duplicates
This column named 'len' stores numeric measurements encoded as text strings — almost certainly a length or dosage/quantity field stored in the wrong dtype. All 70,022 values are single 'words' in all-caps classification, with values like '0.1', '0.5', '1.0', '2.0' dominating the top entries and a character length range of 3–8. The duplicate rate is extremely high at 94.8% (66,359 duplicates across only 3,663 unique values), which is expected for a bounded numeric measure but confirms this should be cast to float and treated as a continuous feature rather than a categorical label. Treatment: Cast to float64 and use as a numeric feature; check for unit consistency across the value range. high · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
3,663
len_min
3
len_max
8
len_mean
3.626
len_median
3
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
66,359
duplicate_rate
0.9477
vocab_size
2,204
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

wid

categorical feature
This column ('wid') appears to be a numeric width or weight identifier encoded as a categorical, with only 419 distinct values across 70,022 rows. All observed top values are round numbers (10, 50, 100, 30, 20, 200, 25, 150, 40, 75), strongly suggesting a discrete measurement dimension — likely a product width, bin size, or weight class. The distribution is notably skewed: value '10' alone accounts for 20.6% of all rows (14,417 occurrences), with a steep drop-off thereafter, indicating heavy concentration at the smallest value. Treatment: Cast to numeric integer and treat as an ordinal or continuous feature; consider log-transform if using in regression given the heavy skew toward small values. medium · anthropic:default
n
70,022
nulls
0 (0.0%)
unique
419
top_value
10
top_rate
0.2059
cardinality
419
entropy
4.463
entropy_ratio
0.5124