data-trove-tornadoes-noaa-spc

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json

Saturn profiled 70,022 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json",
    "--findings", "data-trove-tornadoes-noaa-spc.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset contains 70,022 tornado records across the United States, with attributes covering location, timing, magnitude, path dimensions, and human impact. Texas dominates with 9,345 events, and the classic 'Tornado Alley' states (TX, KS, OK, NE, IA) together account for a large share of all records. Magnitude is worth close inspection: nearly half of all tornadoes are rated 0 (the weakest EF/F scale), and only 59 reach magnitude 5, suggesting a steep severity distribution. Human cost is highly skewed — 97.7% of events report zero fatalities, but the long tail of deadly events (including multi-fatality outbreaks) and the April 27, 2011 date appearing most frequently (207 records) point to a handful of catastrophic outbreak days that deserve focused analysis.

citing: row_count · column_count · state.top_values · mag.top_values · fatalities.top_rate · fatalities.top_value · date.top_values · injuries.top_values · loss.top_values · len.top_values

Out[4]:

saturn.schema() · 13 columns

column	kind	n	null%	unique	alerts
date	text	70,022	0.0%	12,639	one_word allcaps short_text duplicates
time	text	70,022	0.0%	1,438	one_word allcaps short_text duplicates
state	categorical	70,022	0.0%	53
mag	categorical	70,022	0.0%	7
injuries	categorical	70,022	0.0%	209
fatalities	categorical	70,022	0.0%	50	imbalance
loss	text	70,022	0.0%	1,019	one_word allcaps short_text duplicates
slat	numeric	70,022	0.0%	16,016
slon	numeric	70,022	0.0%	17,912
elat	numeric	70,022	37.6%	16,965	null_rate
elon	numeric	70,022	37.6%	18,586	null_rate
len	text	70,022	0.0%	3,663	one_word allcaps short_text duplicates
wid	categorical	70,022	0.0%	419

Fig 1.

state · Look for the dominance of Tornado Alley states — TX, KS, OK — and how sharply activity drops beyond the top 10.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
TX	9345	13.3%
KS	4474	6.4%
OK	4221	6.0%
FL	3620	5.2%
NE	3056	4.4%
IA	2887	4.1%
IL	2835	4.0%
MS	2657	3.8%
AL	2529	3.6%
MO	2462	3.5%
CO	2425	3.5%
LA	2305	3.3%
MN	2118	3.0%
AR	1981	2.8%
SD	1917	2.7%
GA	1898	2.7%
ND	1640	2.3%
IN	1610	2.3%
WI	1515	2.2%
NC	1472	2.1%

Fig 2.

mag · Notice the steep drop-off from magnitude 0 to 5, confirming that the vast majority of tornadoes are weak, with violent twisters being rare.

Show data table

Top values for mag (7 unique shown, of 7 total).
value	count	share
0	32218	46.0%
1	23782	34.0%
2	9767	13.9%
3	2585	3.7%
-9	1024	1.5%
4	587	0.8%
5	59	0.1%

Fig 3.

fatalities · The near-total dominance of zero-fatality events (97.7%) contrasts sharply with the small but deadly tail of high-casualty tornadoes.

Show data table

Top values for fatalities (20 unique shown, of 50 total).
value	count	share
0	68423	97.7%
1	830	1.2%
2	277	0.4%
3	134	0.2%
4	77	0.1%
5	46	0.1%
6	45	0.1%
7	32	0.0%
9	15	0.0%
10	15	0.0%
11	14	0.0%
8	13	0.0%
16	12	0.0%
13	8	0.0%
17	7	0.0%
18	6	0.0%
21	6	0.0%
12	6	0.0%
22	5	0.0%
25	4	0.0%

Fig 4.

len · Path length is heavily concentrated at 0.1 miles, but watch for a long right tail of tornadoes that traveled many miles.

Show data table

Character-length distribution for len (mean: 3.6255462568906913).
chars	count
3 – 3	47630
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	11687
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	795
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	9118
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	789
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	3

Fig 5.

injuries · Like fatalities, injuries are overwhelmingly zero, but the distribution of non-zero counts reveals which events caused mass casualties.

Show data table

Top values for injuries (20 unique shown, of 209 total).
value	count	share
0	62177	88.8%
1	2480	3.5%
2	1388	2.0%
3	770	1.1%
4	484	0.7%
5	385	0.5%
6	300	0.4%
7	194	0.3%
8	171	0.2%
10	141	0.2%
12	120	0.2%
9	117	0.2%
11	80	0.1%
20	71	0.1%
15	69	0.1%
13	66	0.1%
14	55	0.1%
30	46	0.1%
25	44	0.1%
16	43	0.1%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
date	text	0.0%
time	text	0.0%
state	categorical	0.0%
mag	categorical	0.0%
injuries	categorical	0.0%
fatalities	categorical	0.0%
loss	text	0.0%
slat	numeric	0.0%
slon	numeric	0.0%
elat	numeric	37.6%
elon	numeric	37.6%
len	text	0.0%
wid	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
	slat	slon	elat	elon
slat	+1.00	-0.17	-0.02	-0.03
slon	-0.17	+1.00	-0.04	-0.04
elat	-0.02	-0.04	+1.00	-0.16
elon	-0.03	-0.04	-0.16	+1.00

date text timestamp

This column contains ISO-8601 calendar dates stored as text strings, all exactly 10 characters long (YYYY-MM-DD format) with zero nulls across 70,022 rows. The 'allcaps' alert is a quirk of the profiler treating hyphenated tokens as uppercase-only, not a real data issue. What is notable is the high duplicate rate of 81.9% (57,383 duplicates) across only 12,639 unique dates, meaning many records share the same date — the top value '2011-04-27' appears 207 times — suggesting this is an event or transaction date that clusters around specific calendar days rather than a unique record timestamp.

Treatment: Parse to datetime dtype, then use as a temporal feature (day-of-week, month, year, cyclical encoding) or grouping key for aggregations.

anthropic:default · confidence high

Out[13]:

saturn.columns["date"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	12,639
len_min	10
len_max	10
len_mean	10
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	57,383
duplicate_rate	0.8195
vocab_size	7,831
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	81.9% duplicate strings

Fig 8.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 10.0).
chars	count
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	70022
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0

time text timestamp

This column contains wall-clock time strings in HH:MM:SS format (all values are exactly 8 characters), representing the time-of-day component of some event or record. With only 1,438 unique values across 70,022 rows, the duplicate rate is extremely high at 97.9%, indicating times are heavily reused — the top values cluster tightly around afternoon/evening hours (14:00–19:00), suggesting a business or scheduling context with strong temporal patterns. The column is stored as text despite being a structured time value, so it should be parsed to a proper time type for any downstream use.

Treatment: Parse to time/datetime type and use as a cyclical feature (e.g., sine/cosine encoding of hour) or join key for time-based aggregations.

anthropic:default · confidence high

Out[16]:

saturn.columns["time"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	1,438
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	68,584
duplicate_rate	0.9795
vocab_size	1,352
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.9% duplicate strings

Fig 9.

Character-length distribution for time.

Show data table

Character-length distribution for time (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	70022
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

state categorical feature

This column contains US state abbreviations, with 53 distinct values (50 states plus likely DC, Puerto Rico, and one other territory) and zero nulls across 70,022 rows. Texas dominates at 13.3% (9,345 rows), and the top 10 are heavily skewed toward Great Plains and Southern states (KS, OK, NE, IA, MS, AL), which is surprising for a national dataset and may indicate agricultural or livestock-sector data. The entropy ratio of 0.847 indicates reasonably broad coverage, but the concentration in TX/KS/OK/NE/IA suggests a non-representative geographic distribution.

Treatment: One-hot encode or target-encode depending on model type; consider grouping low-frequency states if using one-hot.

anthropic:default · confidence high

Out[19]:

saturn.columns["state"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	53
top_value	TX
top_rate	0.1335
cardinality	53
entropy	4.851
entropy_ratio	0.8468

Fig 10.

Top values for state.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
TX	9345	13.3%
KS	4474	6.4%
OK	4221	6.0%
FL	3620	5.2%
NE	3056	4.4%
IA	2887	4.1%
IL	2835	4.0%
MS	2657	3.8%
AL	2529	3.6%
MO	2462	3.5%
CO	2425	3.5%
LA	2305	3.3%
MN	2118	3.0%
AR	1981	2.8%
SD	1917	2.7%
GA	1898	2.7%
ND	1640	2.3%
IN	1610	2.3%
WI	1515	2.2%
NC	1472	2.1%

mag categorical feature

This column represents a magnitude or severity level encoded as a small integer, with 7 distinct values spanning -9 to 5. The dominant value is '0' (46% of rows, 32,218 records), followed by '1' and '2', giving a right-skewed ordinal distribution. The value '-9' appearing 1,024 times is a sentinel or sentinel-coded missing/unknown value rather than a true negative magnitude, which would surprise an analyst expecting a clean ordinal scale. The column is stored as categorical despite being numerically interpretable.

Treatment: Recode '-9' as missing, then treat as ordinal integer feature or one-hot encode depending on model assumptions.

anthropic:default · confidence high

Out[22]:

saturn.columns["mag"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	7
top_value	0
top_rate	0.4601
cardinality	7
entropy	1.772
entropy_ratio	0.6312

Fig 11.

Top values for mag.

Show data table

Top values for mag (7 unique shown, of 7 total).
value	count	share
0	32218	46.0%
1	23782	34.0%
2	9767	13.9%
3	2585	3.7%
-9	1024	1.5%
4	587	0.8%
5	59	0.1%

injuries categorical feature

This column represents a count of injuries per record, stored as a categorical type despite being fundamentally numeric. The distribution is extremely right-skewed: 88.8% of the 70,022 records have zero injuries, and counts drop off sharply thereafter, yet the column exhibits 209 unique values suggesting some very high injury counts exist in the tail. The low entropy ratio (0.123) confirms the near-degenerate concentration on '0', and the presence of non-contiguous values (e.g., '10' appearing before lower counts drop out of the top 10) hints at a long, sparse tail.

Treatment: Cast to integer, then consider zero-inflated modelling or a binary 'any_injury' flag plus a separate log-transformed count for the non-zero subset.

anthropic:default · confidence high

Out[25]:

saturn.columns["injuries"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	209
top_value	0
top_rate	0.888
cardinality	209
entropy	0.9454
entropy_ratio	0.1227

Fig 12.

Top values for injuries.

Show data table

Top values for injuries (20 unique shown, of 209 total).
value	count	share
0	62177	88.8%
1	2480	3.5%
2	1388	2.0%
3	770	1.1%
4	484	0.7%
5	385	0.5%
6	300	0.4%
7	194	0.3%
8	171	0.2%
10	141	0.2%
12	120	0.2%
9	117	0.2%
11	80	0.1%
20	71	0.1%
15	69	0.1%
13	66	0.1%
14	55	0.1%
30	46	0.1%
25	44	0.1%
16	43	0.1%

fatalities categorical feature

This column represents a count of fatalities per incident, stored as a categorical type despite being numeric in nature — it should be treated as an ordinal or integer feature. The dominant value is '0' (68,423 out of 70,022 rows, or 97.7%), making this severely right-skewed and triggering an imbalance alert. Cardinality reaches 50 distinct values with extremely low entropy (0.217, entropy ratio 0.039), confirming that the non-zero tail is sparse and long. Analysts modelling rare fatal events should be aware that the positive class represents fewer than 2.3% of records.

Treatment: Cast to integer, apply zero-inflated or rare-event modelling strategy, or binarise into fatal/non-fatal indicator before classification.

anthropic:default · confidence high

Out[28]:

saturn.columns["fatalities"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	50
top_value	0
top_rate	0.9772
cardinality	50
entropy	0.2174
entropy_ratio	0.03852
alert: imbalance	top value is 97.7% of rows

Fig 13.

Top values for fatalities.

Show data table

Top values for fatalities (20 unique shown, of 50 total).
value	count	share
0	68423	97.7%
1	830	1.2%
2	277	0.4%
3	134	0.2%
4	77	0.1%
5	46	0.1%
6	45	0.1%
7	32	0.0%
9	15	0.0%
10	15	0.0%
11	14	0.0%
8	13	0.0%
16	12	0.0%
13	8	0.0%
17	7	0.0%
18	6	0.0%
21	6	0.0%
12	6	0.0%
22	5	0.0%
25	4	0.0%

loss text numeric_target

This column contains numeric loss values stored as text strings — all single-token entries (one_word_rate: 1.0) with a mean length of ~3.18 characters, dominated by small non-negative numbers like '0.0', '4.0', '5.0'. The 92.5% allcaps rate is a misleading artefact of how saturn classifies short numeric strings. Notably, '0.0' and '0' appear as separate tokens (22764 and 5248 occurrences respectively), indicating inconsistent serialization of the same underlying zero value — an analyst should consolidate these before use. The duplicate rate is 98.5%, reflecting a low-cardinality numeric range across 70022 rows with only 1019 unique string representations.

Treatment: Cast to float (unifying '0' and '0.0'), then use as a regression target or loss metric; check whether the bimodal spike at 0 represents true zero-loss or missing/default values.

anthropic:default · confidence high

Out[31]:

saturn.columns["loss"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	1,019
len_min	1
len_max	10
len_mean	3.181
len_median	3
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	69,003
duplicate_rate	0.9854
vocab_size	503
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.9251
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	92.5% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	98.5% duplicate strings

Fig 14.

Character-length distribution for loss.

Show data table

Character-length distribution for loss (mean: 3.1810716631915685).
chars	count
1 – 1	5248
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	50873
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	7040
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	4912
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1581
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	292
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	58
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	16
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	2

slat numeric feature

This column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the Caribbean/Mexico up through Canada, covering the contiguous United States and beyond. The distribution is remarkably symmetric (skew 0.038, kurtosis -0.582) and tightly clustered around a mean of 37.14° with an IQR of 7.74°, suggesting a dataset dominated by mid-latitude U.S. locations. Only 70 outliers (0.1%) exist, likely extreme northern or southern observations, and there are no nulls.

Treatment: Use directly as a geospatial feature; consider pairing with longitude and engineering distance or region-based features rather than treating as a raw numeric.

anthropic:default · confidence high

Out[34]:

saturn.columns["slat"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	16,016
min	17.72
max	61.02
mean	37.14
median	37.03
std	5.09
q1	33.19
q3	40.93
iqr	7.74
skew	0.03792
kurtosis	-0.5825
n_outliers	70
outlier_rate	0.0009997
zero_rate	0

Fig 15.

Distribution of slat. Vertical dash marks the median.

Show data table

Histogram bins for slat (median: 37.02505).
bin	count
17.72 – 18.8	33
18.8 – 19.89	6
19.89 – 20.97	8
20.97 – 22.05	26
22.05 – 23.13	1
23.13 – 24.22	0
24.22 – 25.3	76
25.3 – 26.38	487
26.38 – 27.46	748
27.46 – 28.55	1225
28.55 – 29.63	1468
29.63 – 30.71	3607
30.71 – 31.79	3441
31.79 – 32.88	5003
32.88 – 33.96	4775
33.96 – 35.04	5392
35.04 – 36.12	5164
36.12 – 37.21	4268
37.21 – 38.29	4028
38.29 – 39.37	4807
39.37 – 40.45	5498
40.45 – 41.54	5313
41.54 – 42.62	3995
42.62 – 43.7	3396
43.7 – 44.78	2407
44.78 – 45.87	1720
45.87 – 46.95	1257
46.95 – 48.03	1114
48.03 – 49.11	754
49.11 – 50.2	1
50.2 – 51.28	0
51.28 – 52.36	0
52.36 – 53.44	0
53.44 – 54.53	0
54.53 – 55.61	1
55.61 – 56.69	0
56.69 – 57.77	0
57.77 – 58.86	0
58.86 – 59.94	1
59.94 – 61.02	2

slon numeric feature

This column contains geographic longitude values, almost certainly representing the longitude of seismic event epicenters (suggested by the 'slon' name, likely 'station longitude' or 'source longitude'). All values are negative, ranging from -163.53 to -64.72, which places observations within the Western Hemisphere — consistent with the Americas or Pacific region. The mean of -92.74 and median of -93.50 suggest a concentration around the Gulf of Mexico / Central America corridor. With 17,912 unique values across 70,022 rows and zero nulls, this is a continuous geographic coordinate with mild repetition (e.g., fixed station locations), and 951 outliers (~1.36%) may represent distant events or data entry anomalies worth inspecting.

Treatment: Use as-is for spatial modeling or map directly to geographic coordinates; inspect the 951 outliers for plausibility against known geographic bounds.

anthropic:default · confidence high

Out[37]:

saturn.columns["slon"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	17,912
min	-163.5
max	-64.72
mean	-92.74
median	-93.5
std	8.677
q1	-98.4
q3	-86.69
iqr	11.71
skew	-0.3229
kurtosis	2.156
n_outliers	951
outlier_rate	0.01358
zero_rate	0

Fig 16.

Distribution of slon. Vertical dash marks the median.

Show data table

Histogram bins for slon (median: -93.5).
bin	count
-163.5 – -161.1	2
-161.1 – -158.6	9
-158.6 – -156.1	25
-156.1 – -153.6	8
-153.6 – -151.2	0
-151.2 – -148.7	0
-148.7 – -146.2	0
-146.2 – -143.8	1
-143.8 – -141.3	0
-141.3 – -138.8	0
-138.8 – -136.4	0
-136.4 – -133.9	0
-133.9 – -131.4	0
-131.4 – -128.9	0
-128.9 – -126.5	0
-126.5 – -124	15
-124 – -121.5	231
-121.5 – -119.1	241
-119.1 – -116.6	282
-116.6 – -114.1	174
-114.1 – -111.7	399
-111.7 – -109.2	323
-109.2 – -106.7	325
-106.7 – -104.2	1782
-104.2 – -101.8	4689
-101.8 – -99.3	6094
-99.3 – -96.83	9665
-96.83 – -94.36	8344
-94.36 – -91.89	6626
-91.89 – -89.42	6618
-89.42 – -86.95	6086
-86.95 – -84.48	5361
-84.48 – -82.01	4363
-82.01 – -79.54	3931
-79.54 – -77.07	1885
-77.07 – -74.6	1618
-74.6 – -72.13	541
-72.13 – -69.66	285
-69.66 – -67.19	71
-67.19 – -64.72	28

elat numeric feature

This column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the southern US/Mexico border region up through Canada or northern Europe. The distribution is strikingly symmetric (skew 0.034, kurtosis -0.41) and tightly clustered around a mean of 37.26° with an IQR of 7.42°, suggesting a geographically focused dataset. The most notable concern is a 37.65% null rate, flagged as an alert, meaning over a third of records lack coordinate data. Only 78 outliers (0.18%) exist at the extremes of the range.

Treatment: Investigate source of 37.65% nulls before use; pair with longitude for spatial features or geohash encoding; impute or filter nulls depending on missingness mechanism.

anthropic:default · confidence high

Out[40]:

saturn.columns["elat"].stats

stat	value
n	70,022
nulls	26,363 (37.6%)
unique	16,965
min	17.72
max	61.02
mean	37.26
median	37.13
std	4.942
q1	33.49
q3	40.91
iqr	7.42
skew	0.03404
kurtosis	-0.4085
n_outliers	78
outlier_rate	0.001787
zero_rate	0
alert: null_rate	37.6% null

Fig 17.

Distribution of elat. Vertical dash marks the median.

Show data table

Histogram bins for elat (median: 37.1309).
bin	count
17.72 – 18.8	33
18.8 – 19.89	6
19.89 – 20.97	8
20.97 – 22.05	26
22.05 – 23.13	1
23.13 – 24.22	0
24.22 – 25.3	41
25.3 – 26.38	262
26.38 – 27.46	369
27.46 – 28.55	534
28.55 – 29.63	668
29.63 – 30.71	1952
30.71 – 31.79	2190
31.79 – 32.88	3083
32.88 – 33.96	3018
33.96 – 35.04	3496
35.04 – 36.12	3427
36.12 – 37.21	2899
37.21 – 38.29	2790
38.29 – 39.37	3133
39.37 – 40.45	3414
40.45 – 41.54	3272
41.54 – 42.62	2540
42.62 – 43.7	2078
43.7 – 44.78	1428
44.78 – 45.87	1122
45.87 – 46.95	745
46.95 – 48.03	676
48.03 – 49.11	443
49.11 – 50.2	1
50.2 – 51.28	0
51.28 – 52.36	0
52.36 – 53.44	0
53.44 – 54.53	0
54.53 – 55.61	1
55.61 – 56.69	0
56.69 – 57.77	0
57.77 – 58.86	0
58.86 – 59.94	1
59.94 – 61.02	2

elon numeric feature

This column almost certainly represents **longitude** (Eastern longitude or a signed longitude coordinate), given the name 'elon' and values ranging from -163.53 to -64.72 — a range consistent with the Western Hemisphere (roughly spanning the Americas). The mean of -92.19 and median of -92.47 suggest a central tendency near the US Gulf Coast/Central America region. Two analyst-worthy surprises: the null rate is high at 37.65%, triggering an alert, and the max value of -64.72 is the least-negative (easternmost) point while min of -163.53 is near Alaska/Pacific Islands — indicating wide geographic spread.

Treatment: Investigate and impute or exclude the 37.65% nulls before use; pair with a latitude column for geospatial modelling or clustering.

anthropic:default · confidence high

Out[43]:

saturn.columns["elon"].stats

stat	value
n	70,022
nulls	26,363 (37.6%)
unique	18,586
min	-163.5
max	-64.72
mean	-92.19
median	-92.47
std	8.545
q1	-97.73
q3	-86.47
iqr	11.26
skew	-0.5954
kurtosis	3.766
n_outliers	647
outlier_rate	0.01482
zero_rate	0
alert: null_rate	37.6% null

Fig 18.

Distribution of elon. Vertical dash marks the median.

Show data table

Histogram bins for elon (median: -92.47).
bin	count
-163.5 – -161.1	2
-161.1 – -158.6	9
-158.6 – -156.1	25
-156.1 – -153.6	8
-153.6 – -151.2	0
-151.2 – -148.7	0
-148.7 – -146.2	0
-146.2 – -143.8	1
-143.8 – -141.3	0
-141.3 – -138.8	0
-138.8 – -136.4	0
-136.4 – -133.9	0
-133.9 – -131.4	0
-131.4 – -128.9	0
-128.9 – -126.5	0
-126.5 – -124	8
-124 – -121.5	157
-121.5 – -119.1	158
-119.1 – -116.6	142
-116.6 – -114.1	82
-114.1 – -111.7	173
-111.7 – -109.2	176
-109.2 – -106.7	177
-106.7 – -104.2	774
-104.2 – -101.8	2464
-101.8 – -99.3	3415
-99.3 – -96.83	5507
-96.83 – -94.36	5157
-94.36 – -91.89	4463
-91.89 – -89.42	4668
-89.42 – -86.95	4417
-86.95 – -84.48	3706
-84.48 – -82.01	2739
-82.01 – -79.54	2302
-79.54 – -77.07	1344
-77.07 – -74.6	1056
-74.6 – -72.13	303
-72.13 – -69.66	152
-69.66 – -67.19	47
-67.19 – -64.72	27

len text feature

This column named 'len' stores numeric measurements encoded as text strings — almost certainly a length or dosage/quantity field stored in the wrong dtype. All 70,022 values are single 'words' in all-caps classification, with values like '0.1', '0.5', '1.0', '2.0' dominating the top entries and a character length range of 3–8. The duplicate rate is extremely high at 94.8% (66,359 duplicates across only 3,663 unique values), which is expected for a bounded numeric measure but confirms this should be cast to float and treated as a continuous feature rather than a categorical label.

Treatment: Cast to float64 and use as a numeric feature; check for unit consistency across the value range.

anthropic:default · confidence high

Out[46]:

saturn.columns["len"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	3,663
len_min	3
len_max	8
len_mean	3.626
len_median	3
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	66,359
duplicate_rate	0.9477
vocab_size	2,204
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	94.8% duplicate strings

Fig 19.

Character-length distribution for len.

Show data table

Character-length distribution for len (mean: 3.6255462568906913).
chars	count
3 – 3	47630
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	11687
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	795
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	9118
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	789
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	3

wid categorical feature

This column ('wid') appears to be a numeric width or weight identifier encoded as a categorical, with only 419 distinct values across 70,022 rows. All observed top values are round numbers (10, 50, 100, 30, 20, 200, 25, 150, 40, 75), strongly suggesting a discrete measurement dimension — likely a product width, bin size, or weight class. The distribution is notably skewed: value '10' alone accounts for 20.6% of all rows (14,417 occurrences), with a steep drop-off thereafter, indicating heavy concentration at the smallest value.

Treatment: Cast to numeric integer and treat as an ordinal or continuous feature; consider log-transform if using in regression given the heavy skew toward small values.

anthropic:default · confidence medium

Out[49]:

saturn.columns["wid"].stats

stat	value
n	70,022
nulls	0 (0.0%)
unique	419
top_value	10
top_rate	0.2059
cardinality	419
entropy	4.463
entropy_ratio	0.5124

Fig 20.

Top values for wid.

Show data table

Top values for wid (20 unique shown, of 419 total).
value	count	share
10	14417	20.6%
50	10366	14.8%
100	7067	10.1%
30	4772	6.8%
20	4368	6.2%
200	2946	4.2%
25	2452	3.5%
150	2101	3.0%
40	1967	2.8%
75	1906	2.7%
300	1430	2.0%
33	1160	1.7%
17	1037	1.5%
400	944	1.3%
23	812	1.2%
250	765	1.1%
60	737	1.1%
440	677	1.0%
500	636	0.9%
80	573	0.8%

data trove tornadoes noaa spc

Overview

Summary confidence: high

date text timestamp

time text timestamp

state categorical feature

mag categorical feature

injuries categorical feature

fatalities categorical feature

loss text numeric_target

slat numeric feature

slon numeric feature

elat numeric feature

elon numeric feature

len text feature

wid categorical feature

How to cite