saturn·

quirky tornadoes

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json

Saturn profiled 70,022 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json",
    "--findings", "quirky-tornadoes.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a tornado event log with 70,022 rows and 13 columns covering dates, times, start/end coordinates, magnitudes, widths, fatalities, injuries, and U.S. state. Geographically it is a U.S.-centered dataset: starting longitudes average around -92.7 and latitudes around 37.1, with Texas (13.3% of records), Kansas, and Oklahoma leading the state counts. The severity fields are highly imbalanced — fatalities are 0 in 97.7% of events and injuries are 0 in 88.8% — so any analysis of harm should focus on the rare non-zero tail. Magnitude (mag) is a more usable categorical signal with 7 levels, dominated by 0 (46%) and 1 (34%). Note that the end-coordinate columns (elat, elon) are null in ~37.7% of rows, which matters if you plan to draw tornado tracks rather than just start points.

citing: row_count · column_count · columns.state.top_values · columns.state.top_rate · columns.fatalities.top_rate · columns.injuries.top_rate · columns.mag.top_values · columns.mag.top_rate · columns.elat.null_rate · columns.elon.null_rate · columns.slat.stats · columns.slon.stats

Out[4]:

saturn.schema() · 13 columns

column kind n null% unique alerts
date text 70,022 0.0% 12,639 one_word allcaps short_text duplicates
time text 70,022 0.0% 1,438 one_word allcaps short_text duplicates
state categorical 70,022 0.0% 53
mag categorical 70,022 0.0% 7
injuries categorical 70,022 0.0% 209
fatalities categorical 70,022 0.0% 50 imbalance
loss text 70,022 0.0% 1,019 one_word allcaps short_text duplicates
slat numeric 70,022 0.0% 16,016
slon numeric 70,022 0.0% 17,912
elat numeric 70,022 37.6% 16,965 null_rate
elon numeric 70,022 37.6% 18,586 null_rate
len text 70,022 0.0% 3,663 one_word allcaps short_text duplicates
wid categorical 70,022 0.0% 419
Fig 1.
state · Top tornado states — Texas, Kansas, and Oklahoma dominate the count.
Show data table
Top values for state (20 unique shown, of 53 total).
valuecountshare
TX934513.3%
KS44746.4%
OK42216.0%
FL36205.2%
NE30564.4%
IA28874.1%
IL28354.0%
MS26573.8%
AL25293.6%
MO24623.5%
CO24253.5%
LA23053.3%
MN21183.0%
AR19812.8%
SD19172.7%
GA18982.7%
ND16402.3%
IN16102.3%
WI15152.2%
NC14722.1%
Fig 2.
mag · Magnitude distribution skews heavily toward weak (0–1) tornadoes.
Show data table
Top values for mag (7 unique shown, of 7 total).
valuecountshare
03221846.0%
12378234.0%
2976713.9%
325853.7%
-910241.5%
45870.8%
5590.1%
Fig 3.
fatalities · Fatalities are 0 in roughly 98% of events; the deadly tail is tiny but important.
Show data table
Top values for fatalities (20 unique shown, of 50 total).
valuecountshare
06842397.7%
18301.2%
22770.4%
31340.2%
4770.1%
5460.1%
6450.1%
7320.0%
9150.0%
10150.0%
11140.0%
8130.0%
16120.0%
1380.0%
1770.0%
1860.0%
2160.0%
1260.0%
2250.0%
2540.0%
Fig 4.
slat · Starting latitudes cluster around 37°N, consistent with the U.S. tornado belt.
Show data table
Histogram bins for slat (median: 37.02505).
bincount
17.72 – 18.833
18.8 – 19.896
19.89 – 20.978
20.97 – 22.0526
22.05 – 23.131
23.13 – 24.220
24.22 – 25.376
25.3 – 26.38487
26.38 – 27.46748
27.46 – 28.551225
28.55 – 29.631468
29.63 – 30.713607
30.71 – 31.793441
31.79 – 32.885003
32.88 – 33.964775
33.96 – 35.045392
35.04 – 36.125164
36.12 – 37.214268
37.21 – 38.294028
38.29 – 39.374807
39.37 – 40.455498
40.45 – 41.545313
41.54 – 42.623995
42.62 – 43.73396
43.7 – 44.782407
44.78 – 45.871720
45.87 – 46.951257
46.95 – 48.031114
48.03 – 49.11754
49.11 – 50.21
50.2 – 51.280
51.28 – 52.360
52.36 – 53.440
53.44 – 54.530
54.53 – 55.611
55.61 – 56.690
56.69 – 57.770
57.77 – 58.860
58.86 – 59.941
59.94 – 61.022
Fig 5.
slon · Starting longitudes concentrate around -93°, reinforcing the central-U.S. footprint.
Show data table
Histogram bins for slon (median: -93.5).
bincount
-163.5 – -161.12
-161.1 – -158.69
-158.6 – -156.125
-156.1 – -153.68
-153.6 – -151.20
-151.2 – -148.70
-148.7 – -146.20
-146.2 – -143.81
-143.8 – -141.30
-141.3 – -138.80
-138.8 – -136.40
-136.4 – -133.90
-133.9 – -131.40
-131.4 – -128.90
-128.9 – -126.50
-126.5 – -12415
-124 – -121.5231
-121.5 – -119.1241
-119.1 – -116.6282
-116.6 – -114.1174
-114.1 – -111.7399
-111.7 – -109.2323
-109.2 – -106.7325
-106.7 – -104.21782
-104.2 – -101.84689
-101.8 – -99.36094
-99.3 – -96.839665
-96.83 – -94.368344
-94.36 – -91.896626
-91.89 – -89.426618
-89.42 – -86.956086
-86.95 – -84.485361
-84.48 – -82.014363
-82.01 – -79.543931
-79.54 – -77.071885
-77.07 – -74.61618
-74.6 – -72.13541
-72.13 – -69.66285
-69.66 – -67.1971
-67.19 – -64.7228
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
datetext0.0%
timetext0.0%
statecategorical0.0%
magcategorical0.0%
injuriescategorical0.0%
fatalitiescategorical0.0%
losstext0.0%
slatnumeric0.0%
slonnumeric0.0%
elatnumeric37.6%
elonnumeric37.6%
lentext0.0%
widcategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
slatslonelatelon
slat+1.00-0.17-0.02-0.03
slon-0.17+1.00-0.04-0.04
elat-0.02-0.04+1.00-0.16
elon-0.03-0.04-0.16+1.00

date text timestamp

This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every value exactly 10 characters long and one token. Across 70,022 rows there are only 12,639 unique dates and an 81.9% duplicate rate, so many records share dates — top value 2011-04-27 appears 207 times. The range spans at least 1974-04-03 to 2023-03-31, suggesting it was misclassified as text rather than a date type.

Treatment: Cast to a proper date type and use for temporal joins or time-based features.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["date"].stats

statvalue
n70,022
nulls0 (0.0%)
unique12,639
len_min 10
len_max 10
len_mean 10
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 0
n_duplicates 57,383
duplicate_rate 0.8195
vocab_size 7,831
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates81.9% duplicate strings
Fig 8.
Character-length distribution for date.
Show data table
Character-length distribution for date (mean: 10.0).
charscount
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 1070022
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100

time text timestamp

This column holds clock times stored as 8-character HH:MM:SS strings, with all 70,022 rows non-null and uniform length (len_min=len_max=8). Only 1,438 distinct values appear and 97.95% are duplicates, with afternoon/evening slots like 16:00:00 (978), 17:00:00 (971) and 15:00:00 (959) dominating — consistent with event start times rather than free text. It's mistyped as text: parse to a proper time type before use.

Treatment: Cast to a time type and bucket by hour for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["time"].stats

statvalue
n70,022
nulls0 (0.0%)
unique1,438
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 68,584
duplicate_rate 0.9795
vocab_size 1,352
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.9% duplicate strings
Fig 9.
Character-length distribution for time.
Show data table
Character-length distribution for time (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 870022
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

state categorical feature

This is a US state code column with 53 distinct values, slightly more than the 50 states (likely including DC and territories like PR). TX dominates at 13.3% of 70,022 rows, followed by KS (4,474) and OK (4,221), giving a clear southern/plains skew rather than a uniform national distribution. Entropy ratio of 0.847 confirms the distribution is fairly spread but not flat, and there are no nulls.

Treatment: One-hot or target-encode; consider grouping low-frequency states into an 'Other' bucket.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["state"].stats

statvalue
n70,022
nulls0 (0.0%)
unique53
top_value TX
top_rate 0.1335
cardinality 53
entropy 4.851
entropy_ratio 0.8468
Fig 10.
Top values for state.
Show data table
Top values for state (20 unique shown, of 53 total).
valuecountshare
TX934513.3%
KS44746.4%
OK42216.0%
FL36205.2%
NE30564.4%
IA28874.1%
IL28354.0%
MS26573.8%
AL25293.6%
MO24623.5%
CO24253.5%
LA23053.3%
MN21183.0%
AR19812.8%
SD19172.7%
GA18982.7%
ND16402.3%
IN16102.3%
WI15152.2%
NC14722.1%

mag categorical feature

`mag` is a low-cardinality categorical with 7 distinct values dominated by '0' (46% of 70022 rows) and decreasing counts through '1','2','3','4','5'. The ordered integer levels suggest a magnitude or severity code rather than a free category, and the presence of '-9' (1024 rows) is the standout signal — almost certainly a sentinel for missing/unknown that is not being counted as null.

Treatment: Recode '-9' as missing, then treat as an ordinal feature.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["mag"].stats

statvalue
n70,022
nulls0 (0.0%)
unique7
top_value 0
top_rate 0.4601
cardinality 7
entropy 1.772
entropy_ratio 0.6312
Fig 11.
Top values for mag.
Show data table
Top values for mag (7 unique shown, of 7 total).
valuecountshare
03221846.0%
12378234.0%
2976713.9%
325853.7%
-910241.5%
45870.8%
5590.1%

injuries categorical numeric_target

Counts of injuries stored as strings, with 209 distinct values across 70,022 rows and no nulls. The distribution is severely zero-inflated: '0' accounts for 88.8% of records, and entropy ratio is just 0.123. Tail values like '1' through '10' decay quickly but 209 unique tokens suggests very long tails or non-integer entries worth inspecting.

Treatment: Cast to integer and model as a zero-inflated count (e.g., hurdle or ZIP regression).

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["injuries"].stats

statvalue
n70,022
nulls0 (0.0%)
unique209
top_value 0
top_rate 0.888
cardinality 209
entropy 0.9454
entropy_ratio 0.1227
Fig 12.
Top values for injuries.
Show data table
Top values for injuries (20 unique shown, of 209 total).
valuecountshare
06217788.8%
124803.5%
213882.0%
37701.1%
44840.7%
53850.5%
63000.4%
71940.3%
81710.2%
101410.2%
121200.2%
91170.2%
11800.1%
20710.1%
15690.1%
13660.1%
14550.1%
30460.1%
25440.1%
16430.1%

fatalities categorical numeric_target

This is a fatality count per event, stored as a categorical/string field with 50 distinct integer values and no nulls across 70,022 rows. The distribution is extremely imbalanced: '0' accounts for 97.72% of records, giving an entropy ratio of just 0.039, with '1' at 830 rows and a long thin tail (2 → 277, 3 → 134, down to single-digit counts at higher values). Despite being typed as categorical, the values are numeric and ordered, so the current encoding is likely a load artifact.

Treatment: Cast to integer and model as a zero-inflated count rather than a categorical.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["fatalities"].stats

statvalue
n70,022
nulls0 (0.0%)
unique50
top_value 0
top_rate 0.9772
cardinality 50
entropy 0.2174
entropy_ratio 0.03852
alert: imbalancetop value is 97.7% of rows
Fig 13.
Top values for fatalities.
Show data table
Top values for fatalities (20 unique shown, of 50 total).
valuecountshare
06842397.7%
18301.2%
22770.4%
31340.2%
4770.1%
5460.1%
6450.1%
7320.0%
9150.0%
10150.0%
11140.0%
8130.0%
16120.0%
1380.0%
1770.0%
1860.0%
2160.0%
1260.0%
2250.0%
2540.0%

loss text feature

Numeric loss values stored as text strings, with all 70,022 entries being single tokens (one_word_rate 1.0) and lengths of 1-10 characters. The distribution is heavily concentrated: '0.0' alone accounts for 22,764 rows and the duplicate_rate is 0.985 across only 1,019 unique values. Mixed formatting is a hazard — '0' and '0.0' appear as separate tokens, so a naive cast will collapse them but string-based grouping won't.

Treatment: Cast to float (normalising '0' vs '0.0') before any modelling or aggregation.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["loss"].stats

statvalue
n70,022
nulls0 (0.0%)
unique1,019
len_min 1
len_max 10
len_mean 3.181
len_median 3
len_p95 5
word_mean 1
word_median 1
n_empty 0
n_duplicates 69,003
duplicate_rate 0.9854
vocab_size 503
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.9251
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps92.5% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates98.5% duplicate strings
Fig 14.
Character-length distribution for loss.
Show data table
Character-length distribution for loss (mean: 3.1810716631915685).
charscount
1 – 15248
1 – 10
1 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 350873
3 – 30
3 – 30
3 – 40
4 – 40
4 – 47040
4 – 40
4 – 50
5 – 50
5 – 54912
5 – 50
5 – 60
6 – 60
6 – 60
6 – 61581
6 – 60
6 – 70
7 – 70
7 – 7292
7 – 70
7 – 80
8 – 80
8 – 80
8 – 858
8 – 80
8 – 90
9 – 90
9 – 916
9 – 90
9 – 100
10 – 100
10 – 102

slat numeric feature

Values range from 17.72 to 61.02 with a mean of 37.14 and median of 37.03, consistent with a starting latitude (slat) field in decimal degrees, likely covering the contiguous US. The distribution is essentially symmetric (skew 0.04, kurtosis -0.58) with a tight IQR of 7.74 and only 70 outliers (0.10%). No nulls and no zeros, and 16,016 unique values across 70,022 rows suggest repeated coordinates rather than free-form noise.

Treatment: Use as-is for geospatial features; optionally pair with a longitude column for distance or region encoding.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["slat"].stats

statvalue
n70,022
nulls0 (0.0%)
unique16,016
min 17.72
max 61.02
mean 37.14
median 37.03
std 5.09
q1 33.19
q3 40.93
iqr 7.74
skew 0.03792
kurtosis -0.5825
n_outliers 70
outlier_rate 0.0009997
zero_rate 0
Fig 15.
Distribution of slat. Vertical dash marks the median.
Show data table
Histogram bins for slat (median: 37.02505).
bincount
17.72 – 18.833
18.8 – 19.896
19.89 – 20.978
20.97 – 22.0526
22.05 – 23.131
23.13 – 24.220
24.22 – 25.376
25.3 – 26.38487
26.38 – 27.46748
27.46 – 28.551225
28.55 – 29.631468
29.63 – 30.713607
30.71 – 31.793441
31.79 – 32.885003
32.88 – 33.964775
33.96 – 35.045392
35.04 – 36.125164
36.12 – 37.214268
37.21 – 38.294028
38.29 – 39.374807
39.37 – 40.455498
40.45 – 41.545313
41.54 – 42.623995
42.62 – 43.73396
43.7 – 44.782407
44.78 – 45.871720
45.87 – 46.951257
46.95 – 48.031114
48.03 – 49.11754
49.11 – 50.21
50.2 – 51.280
51.28 – 52.360
52.36 – 53.440
53.44 – 54.530
54.53 – 55.611
55.61 – 56.690
56.69 – 57.770
57.77 – 58.860
58.86 – 59.941
59.94 – 61.022

slon numeric feature

Values are negative decimal degrees ranging from -163.53 to -64.7151 with a median of -93.5, consistent with western-hemisphere longitudes (the 'slon' name suggests starting longitude). The distribution is mildly left-skewed (-0.32) and concentrated within an IQR of ~11.7 degrees around the central US longitude band. Only 951 outliers (1.36%) fall outside that range, and there are no nulls or zeros.

Treatment: Use as a geographic coordinate; pair with the matching latitude column for spatial features rather than treating as a standalone scalar.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["slon"].stats

statvalue
n70,022
nulls0 (0.0%)
unique17,912
min -163.5
max -64.72
mean -92.74
median -93.5
std 8.677
q1 -98.4
q3 -86.69
iqr 11.71
skew -0.3229
kurtosis 2.156
n_outliers 951
outlier_rate 0.01358
zero_rate 0
Fig 16.
Distribution of slon. Vertical dash marks the median.
Show data table
Histogram bins for slon (median: -93.5).
bincount
-163.5 – -161.12
-161.1 – -158.69
-158.6 – -156.125
-156.1 – -153.68
-153.6 – -151.20
-151.2 – -148.70
-148.7 – -146.20
-146.2 – -143.81
-143.8 – -141.30
-141.3 – -138.80
-138.8 – -136.40
-136.4 – -133.90
-133.9 – -131.40
-131.4 – -128.90
-128.9 – -126.50
-126.5 – -12415
-124 – -121.5231
-121.5 – -119.1241
-119.1 – -116.6282
-116.6 – -114.1174
-114.1 – -111.7399
-111.7 – -109.2323
-109.2 – -106.7325
-106.7 – -104.21782
-104.2 – -101.84689
-101.8 – -99.36094
-99.3 – -96.839665
-96.83 – -94.368344
-94.36 – -91.896626
-91.89 – -89.426618
-89.42 – -86.956086
-86.95 – -84.485361
-84.48 – -82.014363
-82.01 – -79.543931
-79.54 – -77.071885
-77.07 – -74.61618
-74.6 – -72.13541
-72.13 – -69.66285
-69.66 – -67.1971
-67.19 – -64.7228

elat numeric feature

Almost certainly an event latitude in decimal degrees, with values spanning 17.72 to 61.02 — consistent with locations across North America. The distribution is roughly symmetric (skew 0.03, kurtosis -0.41) and centered near 37.26, with only 78 mild outliers. The standout concern is that 37.65% of rows are null, so coverage is partial.

Treatment: Pair with the matching longitude and impute or filter the 37.65% nulls before any spatial modelling.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["elat"].stats

statvalue
n70,022
nulls26,363 (37.6%)
unique16,965
min 17.72
max 61.02
mean 37.26
median 37.13
std 4.942
q1 33.49
q3 40.91
iqr 7.42
skew 0.03404
kurtosis -0.4085
n_outliers 78
outlier_rate 0.001787
zero_rate 0
alert: null_rate37.6% null
Fig 17.
Distribution of elat. Vertical dash marks the median.
Show data table
Histogram bins for elat (median: 37.1309).
bincount
17.72 – 18.833
18.8 – 19.896
19.89 – 20.978
20.97 – 22.0526
22.05 – 23.131
23.13 – 24.220
24.22 – 25.341
25.3 – 26.38262
26.38 – 27.46369
27.46 – 28.55534
28.55 – 29.63668
29.63 – 30.711952
30.71 – 31.792190
31.79 – 32.883083
32.88 – 33.963018
33.96 – 35.043496
35.04 – 36.123427
36.12 – 37.212899
37.21 – 38.292790
38.29 – 39.373133
39.37 – 40.453414
40.45 – 41.543272
41.54 – 42.622540
42.62 – 43.72078
43.7 – 44.781428
44.78 – 45.871122
45.87 – 46.95745
46.95 – 48.03676
48.03 – 49.11443
49.11 – 50.21
50.2 – 51.280
51.28 – 52.360
52.36 – 53.440
53.44 – 54.530
54.53 – 55.611
55.61 – 56.690
56.69 – 57.770
57.77 – 58.860
58.86 – 59.941
59.94 – 61.022

elon numeric feature

This is almost certainly longitude (east coordinate), with values bounded between -163.53 and -64.7151 and a median of -92.47, consistent with points across North America. The distribution is moderately left-skewed (-0.60) with kurtosis 3.77 and a tight IQR of 11.26 around the median. Notably, 37.65% of rows are null, which will materially shrink any geo-based analysis.

Treatment: Pair with the matching latitude column and filter out the ~38% null rows before spatial analysis.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["elon"].stats

statvalue
n70,022
nulls26,363 (37.6%)
unique18,586
min -163.5
max -64.72
mean -92.19
median -92.47
std 8.545
q1 -97.73
q3 -86.47
iqr 11.26
skew -0.5954
kurtosis 3.766
n_outliers 647
outlier_rate 0.01482
zero_rate 0
alert: null_rate37.6% null
Fig 18.
Distribution of elon. Vertical dash marks the median.
Show data table
Histogram bins for elon (median: -92.47).
bincount
-163.5 – -161.12
-161.1 – -158.69
-158.6 – -156.125
-156.1 – -153.68
-153.6 – -151.20
-151.2 – -148.70
-148.7 – -146.20
-146.2 – -143.81
-143.8 – -141.30
-141.3 – -138.80
-138.8 – -136.40
-136.4 – -133.90
-133.9 – -131.40
-131.4 – -128.90
-128.9 – -126.50
-126.5 – -1248
-124 – -121.5157
-121.5 – -119.1158
-119.1 – -116.6142
-116.6 – -114.182
-114.1 – -111.7173
-111.7 – -109.2176
-109.2 – -106.7177
-106.7 – -104.2774
-104.2 – -101.82464
-101.8 – -99.33415
-99.3 – -96.835507
-96.83 – -94.365157
-94.36 – -91.894463
-91.89 – -89.424668
-89.42 – -86.954417
-86.95 – -84.483706
-84.48 – -82.012739
-82.01 – -79.542302
-79.54 – -77.071344
-77.07 – -74.61056
-74.6 – -72.13303
-72.13 – -69.66152
-69.66 – -67.1947
-67.19 – -64.7227

len text feature

Despite being typed as text, `len` holds short numeric tokens (length 3-8, one word each) like '0.1', '0.5', '1.0' — almost certainly a length or size measurement stored as strings. Values are highly concentrated: '0.1' alone covers 15,456 of 70,022 rows and the duplicate rate is 94.8% across only 3,663 unique tokens. The allcaps and one_word alerts are artefacts of numeric strings rather than real text signal.

Treatment: Cast to float and treat as a numeric feature rather than text.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["len"].stats

statvalue
n70,022
nulls0 (0.0%)
unique3,663
len_min 3
len_max 8
len_mean 3.626
len_median 3
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 66,359
duplicate_rate 0.9477
vocab_size 2,204
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates94.8% duplicate strings
Fig 19.
Character-length distribution for len.
Show data table
Character-length distribution for len (mean: 3.6255462568906913).
charscount
3 – 347630
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 411687
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 5795
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 69118
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 7789
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 83

wid categorical feature

wid is a categorical column with 419 distinct values, all numeric-looking strings like '10', '50', '100', '30' — almost certainly a width or weight parameter stored as text rather than a true category. The distribution is heavily concentrated on round numbers: '10' alone covers 20.6% of 70022 rows and the top ten values are all multiples of 5 or 25, giving an entropy ratio of 0.51. No nulls, but the string encoding of clearly numeric quanta is the surprise.

Treatment: cast to numeric and consider binning or log-transform given the round-number concentration.

anthropic:claude-opus-4-7 · confidence high
Out[49]:

saturn.columns["wid"].stats

statvalue
n70,022
nulls0 (0.0%)
unique419
top_value 10
top_rate 0.2059
cardinality 419
entropy 4.463
entropy_ratio 0.5124
Fig 20.
Top values for wid.
Show data table
Top values for wid (20 unique shown, of 419 total).
valuecountshare
101441720.6%
501036614.8%
100706710.1%
3047726.8%
2043686.2%
20029464.2%
2524523.5%
15021013.0%
4019672.8%
7519062.7%
30014302.0%
3311601.7%
1710371.5%
4009441.3%
238121.2%
2507651.1%
607371.1%
4406771.0%
5006360.9%
805730.8%

How to cite

click to copy

BibTeX
@misc{saturn-quirky-tornadoes-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky tornadoes},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-tornadoes}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky tornadoes. Source: /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-tornadoes