quirky-tornadoes · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json

Saturn profiled 70,022 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json",
    "--findings", "quirky-tornadoes.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a tornado event log with 70,022 rows and 13 columns covering dates, times, start/end coordinates, magnitudes, widths, fatalities, injuries, and U.S. state. Geographically it is a U.S.-centered dataset: starting longitudes average around -92.7 and latitudes around 37.1, with Texas (13.3% of records), Kansas, and Oklahoma leading the state counts. The severity fields are highly imbalanced — fatalities are 0 in 97.7% of events and injuries are 0 in 88.8% — so any analysis of harm should focus on the rare non-zero tail. Magnitude (mag) is a more usable categorical signal with 7 levels, dominated by 0 (46%) and 1 (34%). Note that the end-coordinate columns (elat, elon) are null in ~37.7% of rows, which matters if you plan to draw tornado tracks rather than just start points.

citing: row_count · column_count · columns.state.top_values · columns.state.top_rate · columns.fatalities.top_rate · columns.injuries.top_rate · columns.mag.top_values · columns.mag.top_rate · columns.elat.null_rate · columns.elon.null_rate · columns.slat.stats · columns.slon.stats

Out[4]:

saturn.schema() · 13 columns

column	kind	n	null%	unique	alerts
date	text	70,022	0.0%	12,639	one_word allcaps short_text duplicates
time	text	70,022	0.0%	1,438	one_word allcaps short_text duplicates
state	categorical	70,022	0.0%	53
mag	categorical	70,022	0.0%	7
injuries	categorical	70,022	0.0%	209
fatalities	categorical	70,022	0.0%	50	imbalance
loss	text	70,022	0.0%	1,019	one_word allcaps short_text duplicates
slat	numeric	70,022	0.0%	16,016
slon	numeric	70,022	0.0%	17,912
elat	numeric	70,022	37.6%	16,965	null_rate
elon	numeric	70,022	37.6%	18,586	null_rate
len	text	70,022	0.0%	3,663	one_word allcaps short_text duplicates
wid	categorical	70,022	0.0%	419

Fig 1.

state · Top tornado states — Texas, Kansas, and Oklahoma dominate the count.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
TX	9345	13.3%
KS	4474	6.4%
OK	4221	6.0%
FL	3620	5.2%
NE	3056	4.4%
IA	2887	4.1%
IL	2835	4.0%
MS	2657	3.8%
AL	2529	3.6%
MO	2462	3.5%
CO	2425	3.5%
LA	2305	3.3%
MN	2118	3.0%
AR	1981	2.8%
SD	1917	2.7%
GA	1898	2.7%
ND	1640	2.3%
IN	1610	2.3%
WI	1515	2.2%
NC	1472	2.1%

Fig 2.

mag · Magnitude distribution skews heavily toward weak (0–1) tornadoes.

Show data table

Top values for mag (7 unique shown, of 7 total).
value	count	share
0	32218	46.0%
1	23782	34.0%
2	9767	13.9%
3	2585	3.7%
-9	1024	1.5%
4	587	0.8%
5	59	0.1%

Fig 3.

fatalities · Fatalities are 0 in roughly 98% of events; the deadly tail is tiny but important.

Show data table

Top values for fatalities (20 unique shown, of 50 total).
value	count	share
0	68423	97.7%
1	830	1.2%
2	277	0.4%
3	134	0.2%
4	77	0.1%
5	46	0.1%
6	45	0.1%
7	32	0.0%
9	15	0.0%
10	15	0.0%
11	14	0.0%
8	13	0.0%
16	12	0.0%
13	8	0.0%
17	7	0.0%
18	6	0.0%
21	6	0.0%
12	6	0.0%
22	5	0.0%
25	4	0.0%

Fig 4.

slat · Starting latitudes cluster around 37°N, consistent with the U.S. tornado belt.

Show data table

Histogram bins for slat (median: 37.02505).
bin	count
17.72 – 18.8	33
18.8 – 19.89	6
19.89 – 20.97	8
20.97 – 22.05	26
22.05 – 23.13	1
23.13 – 24.22	0
24.22 – 25.3	76
25.3 – 26.38	487
26.38 – 27.46	748
27.46 – 28.55	1225
28.55 – 29.63	1468
29.63 – 30.71	3607
30.71 – 31.79	3441
31.79 – 32.88	5003
32.88 – 33.96	4775
33.96 – 35.04	5392
35.04 – 36.12	5164
36.12 – 37.21	4268
37.21 – 38.29	4028
38.29 – 39.37	4807
39.37 – 40.45	5498
40.45 – 41.54	5313
41.54 – 42.62	3995
42.62 – 43.7	3396
43.7 – 44.78	2407
44.78 – 45.87	1720
45.87 – 46.95	1257
46.95 – 48.03	1114
48.03 – 49.11	754
49.11 – 50.2	1
50.2 – 51.28	0
51.28 – 52.36	0
52.36 – 53.44	0
53.44 – 54.53	0
54.53 – 55.61	1
55.61 – 56.69	0
56.69 – 57.77	0
57.77 – 58.86	0
58.86 – 59.94	1
59.94 – 61.02	2

Fig 5.

slon · Starting longitudes concentrate around -93°, reinforcing the central-U.S. footprint.

Show data table

Histogram bins for slon (median: -93.5).
bin	count
-163.5 – -161.1	2
-161.1 – -158.6	9
-158.6 – -156.1	25
-156.1 – -153.6	8
-153.6 – -151.2	0
-151.2 – -148.7	0
-148.7 – -146.2	0
-146.2 – -143.8	1
-143.8 – -141.3	0
-141.3 – -138.8	0
-138.8 – -136.4	0
-136.4 – -133.9	0
-133.9 – -131.4	0
-131.4 – -128.9	0
-128.9 – -126.5	0
-126.5 – -124	15
-124 – -121.5	231
-121.5 – -119.1	241
-119.1 – -116.6	282
-116.6 – -114.1	174
-114.1 – -111.7	399
-111.7 – -109.2	323
-109.2 – -106.7	325
-106.7 – -104.2	1782
-104.2 – -101.8	4689
-101.8 – -99.3	6094
-99.3 – -96.83	9665
-96.83 – -94.36	8344
-94.36 – -91.89	6626
-91.89 – -89.42	6618
-89.42 – -86.95	6086
-86.95 – -84.48	5361
-84.48 – -82.01	4363
-82.01 – -79.54	3931
-79.54 – -77.07	1885
-77.07 – -74.6	1618
-74.6 – -72.13	541
-72.13 – -69.66	285
-69.66 – -67.19	71
-67.19 – -64.72	28

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
date	text	0.0%
time	text	0.0%
state	categorical	0.0%
mag	categorical	0.0%
injuries	categorical	0.0%
fatalities	categorical	0.0%
loss	text	0.0%
slat	numeric	0.0%
slon	numeric	0.0%
elat	numeric	37.6%
elon	numeric	37.6%
len	text	0.0%
wid	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
	slat	slon	elat	elon
slat	+1.00	-0.17	-0.02	-0.03
slon	-0.17	+1.00	-0.04	-0.04
elat	-0.02	-0.04	+1.00	-0.16
elon	-0.03	-0.04	-0.16	+1.00

date text timestamp

This is a date column stored as ISO-formatted text (YYYY-MM-DD), with every value exactly 10 characters long and one token. Across 70,022 rows there are only 12,639 unique dates and an 81.9% duplicate rate, so many records share dates — top value 2011-04-27 appears 207 times. The range spans at least 1974-04-03 to 2023-03-31, suggesting it was misclassified as text rather than a date type.

Treatment: Cast to a proper date type and use for temporal joins or time-based features.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["date"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	12,639
len_min	10
len_max	10
len_mean	10
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	57,383
duplicate_rate	0.8195
vocab_size	7,831
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	81.9% duplicate strings

Fig 8.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 10.0).
chars	count
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	70022
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0

time text timestamp

This column holds clock times stored as 8-character HH:MM:SS strings, with all 70,022 rows non-null and uniform length (len_min=len_max=8). Only 1,438 distinct values appear and 97.95% are duplicates, with afternoon/evening slots like 16:00:00 (978), 17:00:00 (971) and 15:00:00 (959) dominating — consistent with event start times rather than free text. It's mistyped as text: parse to a proper time type before use.

Treatment: Cast to a time type and bucket by hour for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["time"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	1,438
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	68,584
duplicate_rate	0.9795
vocab_size	1,352
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.9% duplicate strings

Fig 9.

Character-length distribution for time.

Show data table

Character-length distribution for time (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	70022
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

state categorical feature

This is a US state code column with 53 distinct values, slightly more than the 50 states (likely including DC and territories like PR). TX dominates at 13.3% of 70,022 rows, followed by KS (4,474) and OK (4,221), giving a clear southern/plains skew rather than a uniform national distribution. Entropy ratio of 0.847 confirms the distribution is fairly spread but not flat, and there are no nulls.

Treatment: One-hot or target-encode; consider grouping low-frequency states into an 'Other' bucket.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["state"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	53
top_value	TX
top_rate	0.1335
cardinality	53
entropy	4.851
entropy_ratio	0.8468

Fig 10.

Top values for state.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
TX	9345	13.3%
KS	4474	6.4%
OK	4221	6.0%
FL	3620	5.2%
NE	3056	4.4%
IA	2887	4.1%
IL	2835	4.0%
MS	2657	3.8%
AL	2529	3.6%
MO	2462	3.5%
CO	2425	3.5%
LA	2305	3.3%
MN	2118	3.0%
AR	1981	2.8%
SD	1917	2.7%
GA	1898	2.7%
ND	1640	2.3%
IN	1610	2.3%
WI	1515	2.2%
NC	1472	2.1%

mag categorical feature

`mag` is a low-cardinality categorical with 7 distinct values dominated by '0' (46% of 70022 rows) and decreasing counts through '1','2','3','4','5'. The ordered integer levels suggest a magnitude or severity code rather than a free category, and the presence of '-9' (1024 rows) is the standout signal — almost certainly a sentinel for missing/unknown that is not being counted as null.

Treatment: Recode '-9' as missing, then treat as an ordinal feature.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["mag"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	7
top_value	0
top_rate	0.4601
cardinality	7
entropy	1.772
entropy_ratio	0.6312

Fig 11.

Top values for mag.

Show data table

Top values for mag (7 unique shown, of 7 total).
value	count	share
0	32218	46.0%
1	23782	34.0%
2	9767	13.9%
3	2585	3.7%
-9	1024	1.5%
4	587	0.8%
5	59	0.1%

injuries categorical numeric_target

Counts of injuries stored as strings, with 209 distinct values across 70,022 rows and no nulls. The distribution is severely zero-inflated: '0' accounts for 88.8% of records, and entropy ratio is just 0.123. Tail values like '1' through '10' decay quickly but 209 unique tokens suggests very long tails or non-integer entries worth inspecting.

Treatment: Cast to integer and model as a zero-inflated count (e.g., hurdle or ZIP regression).

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["injuries"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	209
top_value	0
top_rate	0.888
cardinality	209
entropy	0.9454
entropy_ratio	0.1227

Fig 12.

Top values for injuries.

Show data table

Top values for injuries (20 unique shown, of 209 total).
value	count	share
0	62177	88.8%
1	2480	3.5%
2	1388	2.0%
3	770	1.1%
4	484	0.7%
5	385	0.5%
6	300	0.4%
7	194	0.3%
8	171	0.2%
10	141	0.2%
12	120	0.2%
9	117	0.2%
11	80	0.1%
20	71	0.1%
15	69	0.1%
13	66	0.1%
14	55	0.1%
30	46	0.1%
25	44	0.1%
16	43	0.1%

fatalities categorical numeric_target

This is a fatality count per event, stored as a categorical/string field with 50 distinct integer values and no nulls across 70,022 rows. The distribution is extremely imbalanced: '0' accounts for 97.72% of records, giving an entropy ratio of just 0.039, with '1' at 830 rows and a long thin tail (2 → 277, 3 → 134, down to single-digit counts at higher values). Despite being typed as categorical, the values are numeric and ordered, so the current encoding is likely a load artifact.

Treatment: Cast to integer and model as a zero-inflated count rather than a categorical.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["fatalities"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	50
top_value	0
top_rate	0.9772
cardinality	50
entropy	0.2174
entropy_ratio	0.03852
alert: imbalance	top value is 97.7% of rows

Fig 13.

Top values for fatalities.

Show data table

Top values for fatalities (20 unique shown, of 50 total).
value	count	share
0	68423	97.7%
1	830	1.2%
2	277	0.4%
3	134	0.2%
4	77	0.1%
5	46	0.1%
6	45	0.1%
7	32	0.0%
9	15	0.0%
10	15	0.0%
11	14	0.0%
8	13	0.0%
16	12	0.0%
13	8	0.0%
17	7	0.0%
18	6	0.0%
21	6	0.0%
12	6	0.0%
22	5	0.0%
25	4	0.0%

loss text feature

Numeric loss values stored as text strings, with all 70,022 entries being single tokens (one_word_rate 1.0) and lengths of 1-10 characters. The distribution is heavily concentrated: '0.0' alone accounts for 22,764 rows and the duplicate_rate is 0.985 across only 1,019 unique values. Mixed formatting is a hazard — '0' and '0.0' appear as separate tokens, so a naive cast will collapse them but string-based grouping won't.

Treatment: Cast to float (normalising '0' vs '0.0') before any modelling or aggregation.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["loss"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	1,019
len_min	1
len_max	10
len_mean	3.181
len_median	3
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	69,003
duplicate_rate	0.9854
vocab_size	503
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.9251
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	92.5% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	98.5% duplicate strings

Fig 14.

Character-length distribution for loss.

Show data table

Character-length distribution for loss (mean: 3.1810716631915685).
chars	count
1 – 1	5248
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	50873
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	7040
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	4912
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1581
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	292
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	58
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	16
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	2

slat numeric feature

Values range from 17.72 to 61.02 with a mean of 37.14 and median of 37.03, consistent with a starting latitude (slat) field in decimal degrees, likely covering the contiguous US. The distribution is essentially symmetric (skew 0.04, kurtosis -0.58) with a tight IQR of 7.74 and only 70 outliers (0.10%). No nulls and no zeros, and 16,016 unique values across 70,022 rows suggest repeated coordinates rather than free-form noise.

Treatment: Use as-is for geospatial features; optionally pair with a longitude column for distance or region encoding.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["slat"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	16,016
min	17.72
max	61.02
mean	37.14
median	37.03
std	5.09
q1	33.19
q3	40.93
iqr	7.74
skew	0.03792
kurtosis	-0.5825
n_outliers	70
outlier_rate	0.0009997
zero_rate	0

Fig 15.

Distribution of slat. Vertical dash marks the median.

Show data table

Histogram bins for slat (median: 37.02505).
bin	count
17.72 – 18.8	33
18.8 – 19.89	6
19.89 – 20.97	8
20.97 – 22.05	26
22.05 – 23.13	1
23.13 – 24.22	0
24.22 – 25.3	76
25.3 – 26.38	487
26.38 – 27.46	748
27.46 – 28.55	1225
28.55 – 29.63	1468
29.63 – 30.71	3607
30.71 – 31.79	3441
31.79 – 32.88	5003
32.88 – 33.96	4775
33.96 – 35.04	5392
35.04 – 36.12	5164
36.12 – 37.21	4268
37.21 – 38.29	4028
38.29 – 39.37	4807
39.37 – 40.45	5498
40.45 – 41.54	5313
41.54 – 42.62	3995
42.62 – 43.7	3396
43.7 – 44.78	2407
44.78 – 45.87	1720
45.87 – 46.95	1257
46.95 – 48.03	1114
48.03 – 49.11	754
49.11 – 50.2	1
50.2 – 51.28	0
51.28 – 52.36	0
52.36 – 53.44	0
53.44 – 54.53	0
54.53 – 55.61	1
55.61 – 56.69	0
56.69 – 57.77	0
57.77 – 58.86	0
58.86 – 59.94	1
59.94 – 61.02	2

slon numeric feature

Values are negative decimal degrees ranging from -163.53 to -64.7151 with a median of -93.5, consistent with western-hemisphere longitudes (the 'slon' name suggests starting longitude). The distribution is mildly left-skewed (-0.32) and concentrated within an IQR of ~11.7 degrees around the central US longitude band. Only 951 outliers (1.36%) fall outside that range, and there are no nulls or zeros.

Treatment: Use as a geographic coordinate; pair with the matching latitude column for spatial features rather than treating as a standalone scalar.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["slon"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	17,912
min	-163.5
max	-64.72
mean	-92.74
median	-93.5
std	8.677
q1	-98.4
q3	-86.69
iqr	11.71
skew	-0.3229
kurtosis	2.156
n_outliers	951
outlier_rate	0.01358
zero_rate	0

Fig 16.

Distribution of slon. Vertical dash marks the median.

Show data table

Histogram bins for slon (median: -93.5).
bin	count
-163.5 – -161.1	2
-161.1 – -158.6	9
-158.6 – -156.1	25
-156.1 – -153.6	8
-153.6 – -151.2	0
-151.2 – -148.7	0
-148.7 – -146.2	0
-146.2 – -143.8	1
-143.8 – -141.3	0
-141.3 – -138.8	0
-138.8 – -136.4	0
-136.4 – -133.9	0
-133.9 – -131.4	0
-131.4 – -128.9	0
-128.9 – -126.5	0
-126.5 – -124	15
-124 – -121.5	231
-121.5 – -119.1	241
-119.1 – -116.6	282
-116.6 – -114.1	174
-114.1 – -111.7	399
-111.7 – -109.2	323
-109.2 – -106.7	325
-106.7 – -104.2	1782
-104.2 – -101.8	4689
-101.8 – -99.3	6094
-99.3 – -96.83	9665
-96.83 – -94.36	8344
-94.36 – -91.89	6626
-91.89 – -89.42	6618
-89.42 – -86.95	6086
-86.95 – -84.48	5361
-84.48 – -82.01	4363
-82.01 – -79.54	3931
-79.54 – -77.07	1885
-77.07 – -74.6	1618
-74.6 – -72.13	541
-72.13 – -69.66	285
-69.66 – -67.19	71
-67.19 – -64.72	28

elat numeric feature

Almost certainly an event latitude in decimal degrees, with values spanning 17.72 to 61.02 — consistent with locations across North America. The distribution is roughly symmetric (skew 0.03, kurtosis -0.41) and centered near 37.26, with only 78 mild outliers. The standout concern is that 37.65% of rows are null, so coverage is partial.

Treatment: Pair with the matching longitude and impute or filter the 37.65% nulls before any spatial modelling.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["elat"].stats

stat	value
n	70,022
nulls	26,363 (37.6%)
unique	16,965
min	17.72
max	61.02
mean	37.26
median	37.13
std	4.942
q1	33.49
q3	40.91
iqr	7.42
skew	0.03404
kurtosis	-0.4085
n_outliers	78
outlier_rate	0.001787
zero_rate	0
alert: null_rate	37.6% null

Fig 17.

Distribution of elat. Vertical dash marks the median.

Show data table

Histogram bins for elat (median: 37.1309).
bin	count
17.72 – 18.8	33
18.8 – 19.89	6
19.89 – 20.97	8
20.97 – 22.05	26
22.05 – 23.13	1
23.13 – 24.22	0
24.22 – 25.3	41
25.3 – 26.38	262
26.38 – 27.46	369
27.46 – 28.55	534
28.55 – 29.63	668
29.63 – 30.71	1952
30.71 – 31.79	2190
31.79 – 32.88	3083
32.88 – 33.96	3018
33.96 – 35.04	3496
35.04 – 36.12	3427
36.12 – 37.21	2899
37.21 – 38.29	2790
38.29 – 39.37	3133
39.37 – 40.45	3414
40.45 – 41.54	3272
41.54 – 42.62	2540
42.62 – 43.7	2078
43.7 – 44.78	1428
44.78 – 45.87	1122
45.87 – 46.95	745
46.95 – 48.03	676
48.03 – 49.11	443
49.11 – 50.2	1
50.2 – 51.28	0
51.28 – 52.36	0
52.36 – 53.44	0
53.44 – 54.53	0
54.53 – 55.61	1
55.61 – 56.69	0
56.69 – 57.77	0
57.77 – 58.86	0
58.86 – 59.94	1
59.94 – 61.02	2

elon numeric feature

This is almost certainly longitude (east coordinate), with values bounded between -163.53 and -64.7151 and a median of -92.47, consistent with points across North America. The distribution is moderately left-skewed (-0.60) with kurtosis 3.77 and a tight IQR of 11.26 around the median. Notably, 37.65% of rows are null, which will materially shrink any geo-based analysis.

Treatment: Pair with the matching latitude column and filter out the ~38% null rows before spatial analysis.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["elon"].stats

stat	value
n	70,022
nulls	26,363 (37.6%)
unique	18,586
min	-163.5
max	-64.72
mean	-92.19
median	-92.47
std	8.545
q1	-97.73
q3	-86.47
iqr	11.26
skew	-0.5954
kurtosis	3.766
n_outliers	647
outlier_rate	0.01482
zero_rate	0
alert: null_rate	37.6% null

Fig 18.

Distribution of elon. Vertical dash marks the median.

Show data table

Histogram bins for elon (median: -92.47).
bin	count
-163.5 – -161.1	2
-161.1 – -158.6	9
-158.6 – -156.1	25
-156.1 – -153.6	8
-153.6 – -151.2	0
-151.2 – -148.7	0
-148.7 – -146.2	0
-146.2 – -143.8	1
-143.8 – -141.3	0
-141.3 – -138.8	0
-138.8 – -136.4	0
-136.4 – -133.9	0
-133.9 – -131.4	0
-131.4 – -128.9	0
-128.9 – -126.5	0
-126.5 – -124	8
-124 – -121.5	157
-121.5 – -119.1	158
-119.1 – -116.6	142
-116.6 – -114.1	82
-114.1 – -111.7	173
-111.7 – -109.2	176
-109.2 – -106.7	177
-106.7 – -104.2	774
-104.2 – -101.8	2464
-101.8 – -99.3	3415
-99.3 – -96.83	5507
-96.83 – -94.36	5157
-94.36 – -91.89	4463
-91.89 – -89.42	4668
-89.42 – -86.95	4417
-86.95 – -84.48	3706
-84.48 – -82.01	2739
-82.01 – -79.54	2302
-79.54 – -77.07	1344
-77.07 – -74.6	1056
-74.6 – -72.13	303
-72.13 – -69.66	152
-69.66 – -67.19	47
-67.19 – -64.72	27

len text feature

Despite being typed as text, `len` holds short numeric tokens (length 3-8, one word each) like '0.1', '0.5', '1.0' — almost certainly a length or size measurement stored as strings. Values are highly concentrated: '0.1' alone covers 15,456 of 70,022 rows and the duplicate rate is 94.8% across only 3,663 unique tokens. The allcaps and one_word alerts are artefacts of numeric strings rather than real text signal.

Treatment: Cast to float and treat as a numeric feature rather than text.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["len"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	3,663
len_min	3
len_max	8
len_mean	3.626
len_median	3
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	66,359
duplicate_rate	0.9477
vocab_size	2,204
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	94.8% duplicate strings

Fig 19.

Character-length distribution for len.

Show data table

Character-length distribution for len (mean: 3.6255462568906913).
chars	count
3 – 3	47630
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	11687
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	795
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	9118
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	789
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	3

wid categorical feature

wid is a categorical column with 419 distinct values, all numeric-looking strings like '10', '50', '100', '30' — almost certainly a width or weight parameter stored as text rather than a true category. The distribution is heavily concentrated on round numbers: '10' alone covers 20.6% of 70022 rows and the top ten values are all multiples of 5 or 25, giving an entropy ratio of 0.51. No nulls, but the string encoding of clearly numeric quanta is the surprise.

Treatment: cast to numeric and consider binning or log-transform given the round-number concentration.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["wid"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	419
top_value	10
top_rate	0.2059
cardinality	419
entropy	4.463
entropy_ratio	0.5124

Fig 20.

Top values for wid.

Show data table

Top values for wid (20 unique shown, of 419 total).
value	count	share
10	14417	20.6%
50	10366	14.8%
100	7067	10.1%
30	4772	6.8%
20	4368	6.2%
200	2946	4.2%
25	2452	3.5%
150	2101	3.0%
40	1967	2.8%
75	1906	2.7%
300	1430	2.0%
33	1160	1.7%
17	1037	1.5%
400	944	1.3%
23	812	1.2%
250	765	1.1%
60	737	1.1%
440	677	1.0%
500	636	0.9%
80	573	0.8%

quirky tornadoes

Overview

Summary confidence: high

date text timestamp

time text timestamp

state categorical feature

mag categorical feature

injuries categorical numeric_target

fatalities categorical numeric_target

loss text feature

slat numeric feature

slon numeric feature

elat numeric feature

elon numeric feature

len text feature

wid categorical feature

How to cite