saturn

/home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json 70,022 rows sample n=70,022 seed 42 2026-05-01T23:12:11+00:00

Overview

Source	/home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json
Total rows	70,022
Profiled sample	70,022
Columns	13
Generated	2026-05-01T23:12:11+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This is a tornado event log with 70,022 rows and 13 columns covering dates, times, start/end coordinates, magnitudes, widths, fatalities, injuries, and U.S. state. Geographically it is a U.S.-centered dataset: starting longitudes average around -92.7 and latitudes around 37.1, with Texas (13.3% of records), Kansas, and Oklahoma leading the state counts. The severity fields are highly imbalanced — fatalities are 0 in 97.7% of events and injuries are 0 in 88.8% — so any analysis of harm should focus on the rare non-zero tail. Magnitude (mag) is a more usable categorical signal with 7 levels, dominated by 0 (46%) and 1 (34%). Note that the end-coordinate columns (elat, elon) are null in ~37.7% of rows, which matters if you plan to draw tornado tracks rather than just start points.

date high anthropic:claude-opus-4-7

This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every value exactly 10 characters long and one token. Across 70,022 rows there are only 12,639 unique dates and an 81.9% duplicate rate, so many records share dates — top value 2011-04-27 appears 207 times. The range spans at least 1974-04-03 to 2023-03-31, suggesting it was misclassified as text rather than a date type.

time high anthropic:claude-opus-4-7

This column holds clock times stored as 8-character HH:MM:SS strings, with all 70,022 rows non-null and uniform length (len_min=len_max=8). Only 1,438 distinct values appear and 97.95% are duplicates, with afternoon/evening slots like 16:00:00 (978), 17:00:00 (971) and 15:00:00 (959) dominating — consistent with event start times rather than free text. It's mistyped as text: parse to a proper time type before use.

state high anthropic:claude-opus-4-7

This is a US state code column with 53 distinct values, slightly more than the 50 states (likely including DC and territories like PR). TX dominates at 13.3% of 70,022 rows, followed by KS (4,474) and OK (4,221), giving a clear southern/plains skew rather than a uniform national distribution. Entropy ratio of 0.847 confirms the distribution is fairly spread but not flat, and there are no nulls.

mag high anthropic:claude-opus-4-7

`mag` is a low-cardinality categorical with 7 distinct values dominated by '0' (46% of 70022 rows) and decreasing counts through '1','2','3','4','5'. The ordered integer levels suggest a magnitude or severity code rather than a free category, and the presence of '-9' (1024 rows) is the standout signal — almost certainly a sentinel for missing/unknown that is not being counted as null.

injuries high anthropic:claude-opus-4-7

Counts of injuries stored as strings, with 209 distinct values across 70,022 rows and no nulls. The distribution is severely zero-inflated: '0' accounts for 88.8% of records, and entropy ratio is just 0.123. Tail values like '1' through '10' decay quickly but 209 unique tokens suggests very long tails or non-integer entries worth inspecting.

fatalities high anthropic:claude-opus-4-7

This is a fatality count per event, stored as a categorical/string field with 50 distinct integer values and no nulls across 70,022 rows. The distribution is extremely imbalanced: '0' accounts for 97.72% of records, giving an entropy ratio of just 0.039, with '1' at 830 rows and a long thin tail (2 → 277, 3 → 134, down to single-digit counts at higher values). Despite being typed as categorical, the values are numeric and ordered, so the current encoding is likely a load artifact.

loss high anthropic:claude-opus-4-7

Numeric loss values stored as text strings, with all 70,022 entries being single tokens (one_word_rate 1.0) and lengths of 1-10 characters. The distribution is heavily concentrated: '0.0' alone accounts for 22,764 rows and the duplicate_rate is 0.985 across only 1,019 unique values. Mixed formatting is a hazard — '0' and '0.0' appear as separate tokens, so a naive cast will collapse them but string-based grouping won't.

slat high anthropic:claude-opus-4-7

Values range from 17.72 to 61.02 with a mean of 37.14 and median of 37.03, consistent with a starting latitude (slat) field in decimal degrees, likely covering the contiguous US. The distribution is essentially symmetric (skew 0.04, kurtosis -0.58) with a tight IQR of 7.74 and only 70 outliers (0.10%). No nulls and no zeros, and 16,016 unique values across 70,022 rows suggest repeated coordinates rather than free-form noise.

slon high anthropic:claude-opus-4-7

Values are negative decimal degrees ranging from -163.53 to -64.7151 with a median of -93.5, consistent with western-hemisphere longitudes (the 'slon' name suggests starting longitude). The distribution is mildly left-skewed (-0.32) and concentrated within an IQR of ~11.7 degrees around the central US longitude band. Only 951 outliers (1.36%) fall outside that range, and there are no nulls or zeros.

elat high anthropic:claude-opus-4-7

Almost certainly an event latitude in decimal degrees, with values spanning 17.72 to 61.02 — consistent with locations across North America. The distribution is roughly symmetric (skew 0.03, kurtosis -0.41) and centered near 37.26, with only 78 mild outliers. The standout concern is that 37.65% of rows are null, so coverage is partial.

elon high anthropic:claude-opus-4-7

This is almost certainly longitude (east coordinate), with values bounded between -163.53 and -64.7151 and a median of -92.47, consistent with points across North America. The distribution is moderately left-skewed (-0.60) with kurtosis 3.77 and a tight IQR of 11.26 around the median. Notably, 37.65% of rows are null, which will materially shrink any geo-based analysis.

len high anthropic:claude-opus-4-7

Despite being typed as text, `len` holds short numeric tokens (length 3-8, one word each) like '0.1', '0.5', '1.0' — almost certainly a length or size measurement stored as strings. Values are highly concentrated: '0.1' alone covers 15,456 of 70,022 rows and the duplicate rate is 94.8% across only 3,663 unique tokens. The allcaps and one_word alerts are artefacts of numeric strings rather than real text signal.

wid high anthropic:claude-opus-4-7

wid is a categorical column with 419 distinct values, all numeric-looking strings like '10', '50', '100', '30' — almost certainly a width or weight parameter stored as text rather than a true category. The distribution is heavily concentrated on round numbers: '10' alone covers 20.6% of 70022 rows and the top ten values are all multiples of 5 or 25, giving an entropy ratio of 0.51. No nulls, but the string encoding of clearly numeric quanta is the surprise.

Numeric correlation

date text

100.0% rows are a single word 100.0% rows are all-caps 95th-percentile length under 20 chars 81.9% duplicate strings

rows70,022

null0 (0.0%)

unique12,639

len_min10

len_max10

len_mean10.000

len_median10.000

len_p9510.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates57,383

duplicate_rate0.819

vocab_size7,831

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Sample values (first 10)

1950-05-16
2009-06-27
2016-05-26
2013-06-21
1990-09-09
2022-04-05
1991-08-15
2018-04-15
2013-07-01
1956-04-03

time text

100.0% rows are a single word 100.0% rows are all-caps 95th-percentile length under 20 chars 97.9% duplicate strings

rows70,022

null0 (0.0%)

unique1,438

len_min8

len_max8

len_mean8.000

len_median8.000

len_p958.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates68,584

duplicate_rate0.979

vocab_size1,352

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Sample values (first 10)

18:00:00
13:55:00
19:01:00
12:58:00
14:45:00
07:14:00
15:35:00
16:56:00
18:41:00
19:45:00

state categorical

rows70,022

null0 (0.0%)

unique53

top_valueTX

top_rate0.133

cardinality53

entropy4.851

entropy_ratio0.847

Top values (rank 1–20)

TX — 9,345
KS — 4,474
OK — 4,221
FL — 3,620
NE — 3,056
IA — 2,887
IL — 2,835
MS — 2,657
AL — 2,529
MO — 2,462
CO — 2,425
LA — 2,305
MN — 2,118
AR — 1,981
SD — 1,917
GA — 1,898
ND — 1,640
IN — 1,610
WI — 1,515
NC — 1,472

mag categorical

rows70,022

null0 (0.0%)

unique7

top_value0

top_rate0.460

cardinality7

entropy1.772

entropy_ratio0.631

Top values (rank 1–20)

0 — 32,218
1 — 23,782
2 — 9,767
3 — 2,585
-9 — 1,024
4 — 587
5 — 59

injuries categorical

rows70,022

null0 (0.0%)

unique209

top_value0

top_rate0.888

cardinality209

entropy0.945

entropy_ratio0.123

Top values (rank 1–20)

0 — 62,177
1 — 2,480
2 — 1,388
3 — 770
4 — 484
5 — 385
6 — 300
7 — 194
8 — 171
10 — 141
12 — 120
9 — 117
11 — 80
20 — 71
15 — 69
13 — 66
14 — 55
30 — 46
25 — 44
16 — 43

fatalities categorical

top value is 97.7% of rows

rows70,022

null0 (0.0%)

unique50

top_value0

top_rate0.977

cardinality50

entropy0.217

entropy_ratio0.039

Top values (rank 1–20)

0 — 68,423
1 — 830
2 — 277
3 — 134
4 — 77
5 — 46
6 — 45
7 — 32
9 — 15
10 — 15
11 — 14
8 — 13
16 — 12
13 — 8
17 — 7
18 — 6
21 — 6
12 — 6
22 — 5
25 — 4

loss text

100.0% rows are a single word 92.5% rows are all-caps 95th-percentile length under 20 chars 98.5% duplicate strings

rows70,022

null0 (0.0%)

unique1,019

len_min1

len_max10

len_mean3.181

len_median3.000

len_p955.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates69,003

duplicate_rate0.985

vocab_size503

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.925

boilerplate_rate0.000

Sample values (first 10)

3.0
0.0
0
0.0
3.0
0
5.0
100000
0.0
4.0

slat numeric

rows70,022

null0 (0.0%)

unique16,016

min17.721

max61.020

mean37.137

median37.025

std5.090

q133.190

q340.930

iqr7.740

skew0.038

kurtosis-0.582

n_outliers70

outlier_rate1.00e-03

zero_rate0.000

slon numeric

rows70,022

null0 (0.0%)

unique17,912

min-163.530

max-64.715

mean-92.738

median-93.500

std8.677

q1-98.400

q3-86.691

iqr11.709

skew-0.323

kurtosis2.156

n_outliers951

outlier_rate0.014

zero_rate0.000

elat numeric

37.6% null

rows70,022

null26,363 (37.6%)

unique16,965

min17.721

max61.020

mean37.262

median37.131

std4.942

q133.490

q340.910

iqr7.420

skew0.034

kurtosis-0.409

n_outliers78

outlier_rate1.79e-03

zero_rate0.000

elon numeric

37.6% null

rows70,022

null26,363 (37.6%)

unique18,586

min-163.530

max-64.715

mean-92.193

median-92.470

std8.545

q1-97.730

q3-86.470

iqr11.260

skew-0.595

kurtosis3.766

n_outliers647

outlier_rate0.015

zero_rate0.000

len text

100.0% rows are a single word 100.0% rows are all-caps 95th-percentile length under 20 chars 94.8% duplicate strings

rows70,022

null0 (0.0%)

unique3,663

len_min3

len_max8

len_mean3.626

len_median3.000

len_p956.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates66,359

duplicate_rate0.948

vocab_size2,204

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Sample values (first 10)

0.2
0.22
5.4400
0.21
0.3
2.2100
0.8
1.7000
0.8
0.2

wid categorical

rows70,022

null0 (0.0%)

unique419

top_value10

top_rate0.206

cardinality419

entropy4.463

entropy_ratio0.512

Top values (rank 1–20)

10 — 14,417
50 — 10,366
100 — 7,067
30 — 4,772
20 — 4,368
200 — 2,946
25 — 2,452
150 — 2,101
40 — 1,967
75 — 1,906
300 — 1,430
33 — 1,160
17 — 1,037
400 — 944
23 — 812
250 — 765
60 — 737
440 — 677
500 — 636
80 — 573