This is a tornado event log with 70,022 rows and 13 columns covering dates, times, start/end coordinates, magnitudes, widths, fatalities, injuries, and U.S. state. Geographically it is a U.S.-centered dataset: starting longitudes average around -92.7 and latitudes around 37.1, with Texas (13.3% of records), Kansas, and Oklahoma leading the state counts. The severity fields are highly imbalanced — fatalities are 0 in 97.7% of events and injuries are 0 in 88.8% — so any analysis of harm should focus on the rare non-zero tail. Magnitude (mag) is a more usable categorical signal with 7 levels, dominated by 0 (46%) and 1 (34%). Note that the end-coordinate columns (elat, elon) are null in ~37.7% of rows, which matters if you plan to draw tornado tracks rather than just start points.
saturn
/home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json 70,022 rows sample n=70,022 seed 42 2026-05-01T23:12:11+00:00
Overview
| Source | /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json |
| Total rows | 70,022 |
| Profiled sample | 70,022 |
| Columns | 13 |
| Generated | 2026-05-01T23:12:11+00:00 |
Insights opt-in
Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.
This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every value exactly 10 characters long and one token. Across 70,022 rows there are only 12,639 unique dates and an 81.9% duplicate rate, so many records share dates — top value 2011-04-27 appears 207 times. The range spans at least 1974-04-03 to 2023-03-31, suggesting it was misclassified as text rather than a date type.
This column holds clock times stored as 8-character HH:MM:SS strings, with all 70,022 rows non-null and uniform length (len_min=len_max=8). Only 1,438 distinct values appear and 97.95% are duplicates, with afternoon/evening slots like 16:00:00 (978), 17:00:00 (971) and 15:00:00 (959) dominating — consistent with event start times rather than free text. It's mistyped as text: parse to a proper time type before use.
This is a US state code column with 53 distinct values, slightly more than the 50 states (likely including DC and territories like PR). TX dominates at 13.3% of 70,022 rows, followed by KS (4,474) and OK (4,221), giving a clear southern/plains skew rather than a uniform national distribution. Entropy ratio of 0.847 confirms the distribution is fairly spread but not flat, and there are no nulls.
`mag` is a low-cardinality categorical with 7 distinct values dominated by '0' (46% of 70022 rows) and decreasing counts through '1','2','3','4','5'. The ordered integer levels suggest a magnitude or severity code rather than a free category, and the presence of '-9' (1024 rows) is the standout signal — almost certainly a sentinel for missing/unknown that is not being counted as null.
Counts of injuries stored as strings, with 209 distinct values across 70,022 rows and no nulls. The distribution is severely zero-inflated: '0' accounts for 88.8% of records, and entropy ratio is just 0.123. Tail values like '1' through '10' decay quickly but 209 unique tokens suggests very long tails or non-integer entries worth inspecting.
This is a fatality count per event, stored as a categorical/string field with 50 distinct integer values and no nulls across 70,022 rows. The distribution is extremely imbalanced: '0' accounts for 97.72% of records, giving an entropy ratio of just 0.039, with '1' at 830 rows and a long thin tail (2 → 277, 3 → 134, down to single-digit counts at higher values). Despite being typed as categorical, the values are numeric and ordered, so the current encoding is likely a load artifact.
Numeric loss values stored as text strings, with all 70,022 entries being single tokens (one_word_rate 1.0) and lengths of 1-10 characters. The distribution is heavily concentrated: '0.0' alone accounts for 22,764 rows and the duplicate_rate is 0.985 across only 1,019 unique values. Mixed formatting is a hazard — '0' and '0.0' appear as separate tokens, so a naive cast will collapse them but string-based grouping won't.
Values range from 17.72 to 61.02 with a mean of 37.14 and median of 37.03, consistent with a starting latitude (slat) field in decimal degrees, likely covering the contiguous US. The distribution is essentially symmetric (skew 0.04, kurtosis -0.58) with a tight IQR of 7.74 and only 70 outliers (0.10%). No nulls and no zeros, and 16,016 unique values across 70,022 rows suggest repeated coordinates rather than free-form noise.
Values are negative decimal degrees ranging from -163.53 to -64.7151 with a median of -93.5, consistent with western-hemisphere longitudes (the 'slon' name suggests starting longitude). The distribution is mildly left-skewed (-0.32) and concentrated within an IQR of ~11.7 degrees around the central US longitude band. Only 951 outliers (1.36%) fall outside that range, and there are no nulls or zeros.
Almost certainly an event latitude in decimal degrees, with values spanning 17.72 to 61.02 — consistent with locations across North America. The distribution is roughly symmetric (skew 0.03, kurtosis -0.41) and centered near 37.26, with only 78 mild outliers. The standout concern is that 37.65% of rows are null, so coverage is partial.
This is almost certainly longitude (east coordinate), with values bounded between -163.53 and -64.7151 and a median of -92.47, consistent with points across North America. The distribution is moderately left-skewed (-0.60) with kurtosis 3.77 and a tight IQR of 11.26 around the median. Notably, 37.65% of rows are null, which will materially shrink any geo-based analysis.
Despite being typed as text, `len` holds short numeric tokens (length 3-8, one word each) like '0.1', '0.5', '1.0' — almost certainly a length or size measurement stored as strings. Values are highly concentrated: '0.1' alone covers 15,456 of 70,022 rows and the duplicate rate is 94.8% across only 3,663 unique tokens. The allcaps and one_word alerts are artefacts of numeric strings rather than real text signal.
wid is a categorical column with 419 distinct values, all numeric-looking strings like '10', '50', '100', '30' — almost certainly a width or weight parameter stored as text rather than a true category. The distribution is heavily concentrated on round numbers: '10' alone covers 20.6% of 70022 rows and the top ten values are all multiples of 5 or 25, giving an entropy ratio of 0.51. No nulls, but the string encoding of clearly numeric quanta is the surprise.
Numeric correlation
date text
Sample values (first 10)
- 1950-05-16
- 2009-06-27
- 2016-05-26
- 2013-06-21
- 1990-09-09
- 2022-04-05
- 1991-08-15
- 2018-04-15
- 2013-07-01
- 1956-04-03
time text
Sample values (first 10)
- 18:00:00
- 13:55:00
- 19:01:00
- 12:58:00
- 14:45:00
- 07:14:00
- 15:35:00
- 16:56:00
- 18:41:00
- 19:45:00
state categorical
Top values (rank 1–20)
- TX — 9,345
- KS — 4,474
- OK — 4,221
- FL — 3,620
- NE — 3,056
- IA — 2,887
- IL — 2,835
- MS — 2,657
- AL — 2,529
- MO — 2,462
- CO — 2,425
- LA — 2,305
- MN — 2,118
- AR — 1,981
- SD — 1,917
- GA — 1,898
- ND — 1,640
- IN — 1,610
- WI — 1,515
- NC — 1,472
mag categorical
Top values (rank 1–20)
- 0 — 32,218
- 1 — 23,782
- 2 — 9,767
- 3 — 2,585
- -9 — 1,024
- 4 — 587
- 5 — 59
injuries categorical
Top values (rank 1–20)
- 0 — 62,177
- 1 — 2,480
- 2 — 1,388
- 3 — 770
- 4 — 484
- 5 — 385
- 6 — 300
- 7 — 194
- 8 — 171
- 10 — 141
- 12 — 120
- 9 — 117
- 11 — 80
- 20 — 71
- 15 — 69
- 13 — 66
- 14 — 55
- 30 — 46
- 25 — 44
- 16 — 43
fatalities categorical
Top values (rank 1–20)
- 0 — 68,423
- 1 — 830
- 2 — 277
- 3 — 134
- 4 — 77
- 5 — 46
- 6 — 45
- 7 — 32
- 9 — 15
- 10 — 15
- 11 — 14
- 8 — 13
- 16 — 12
- 13 — 8
- 17 — 7
- 18 — 6
- 21 — 6
- 12 — 6
- 22 — 5
- 25 — 4
loss text
Sample values (first 10)
- 3.0
- 0.0
- 0
- 0.0
- 3.0
- 0
- 5.0
- 100000
- 0.0
- 4.0
slat numeric
slon numeric
elat numeric
elon numeric
len text
Sample values (first 10)
- 0.2
- 0.22
- 5.4400
- 0.21
- 0.3
- 2.2100
- 0.8
- 1.7000
- 0.8
- 0.2
wid categorical
Top values (rank 1–20)
- 10 — 14,417
- 50 — 10,366
- 100 — 7,067
- 30 — 4,772
- 20 — 4,368
- 200 — 2,946
- 25 — 2,452
- 150 — 2,101
- 40 — 1,967
- 75 — 1,906
- 300 — 1,430
- 33 — 1,160
- 17 — 1,037
- 400 — 944
- 23 — 812
- 250 — 765
- 60 — 737
- 440 — 677
- 500 — 636
- 80 — 573