saturn·

data trove tornadoes noaa spc

saturn notebook · generated 2026-06-22 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json

Saturn profiled 70,022 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json",
    "--findings", "data-trove-tornadoes-noaa-spc.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset contains 70,022 tornado records across the United States, with attributes covering location, timing, magnitude, path dimensions, and human impact. Texas dominates with 9,345 events, and the classic 'Tornado Alley' states (TX, KS, OK, NE, IA) together account for a large share of all records. Magnitude is worth close inspection: nearly half of all tornadoes are rated 0 (the weakest EF/F scale), and only 59 reach magnitude 5, suggesting a steep severity distribution. Human cost is highly skewed — 97.7% of events report zero fatalities, but the long tail of deadly events (including multi-fatality outbreaks) and the April 27, 2011 date appearing most frequently (207 records) point to a handful of catastrophic outbreak days that deserve focused analysis.

citing: row_count · column_count · state.top_values · mag.top_values · fatalities.top_rate · fatalities.top_value · date.top_values · injuries.top_values · loss.top_values · len.top_values

Out[4]:

saturn.schema() · 13 columns

column kind n null% unique alerts
date text 70,022 0.0% 12,639 one_word allcaps short_text duplicates
time text 70,022 0.0% 1,438 one_word allcaps short_text duplicates
state categorical 70,022 0.0% 53
mag categorical 70,022 0.0% 7
injuries categorical 70,022 0.0% 209
fatalities categorical 70,022 0.0% 50 imbalance
loss text 70,022 0.0% 1,019 one_word allcaps short_text duplicates
slat numeric 70,022 0.0% 16,016
slon numeric 70,022 0.0% 17,912
elat numeric 70,022 37.6% 16,965 null_rate
elon numeric 70,022 37.6% 18,586 null_rate
len text 70,022 0.0% 3,663 one_word allcaps short_text duplicates
wid categorical 70,022 0.0% 419
Fig 1.
state · Look for the dominance of Tornado Alley states — TX, KS, OK — and how sharply activity drops beyond the top 10.
Show data table
Top values for state (20 unique shown, of 53 total).
valuecountshare
TX934513.3%
KS44746.4%
OK42216.0%
FL36205.2%
NE30564.4%
IA28874.1%
IL28354.0%
MS26573.8%
AL25293.6%
MO24623.5%
CO24253.5%
LA23053.3%
MN21183.0%
AR19812.8%
SD19172.7%
GA18982.7%
ND16402.3%
IN16102.3%
WI15152.2%
NC14722.1%
Fig 2.
mag · Notice the steep drop-off from magnitude 0 to 5, confirming that the vast majority of tornadoes are weak, with violent twisters being rare.
Show data table
Top values for mag (7 unique shown, of 7 total).
valuecountshare
03221846.0%
12378234.0%
2976713.9%
325853.7%
-910241.5%
45870.8%
5590.1%
Fig 3.
fatalities · The near-total dominance of zero-fatality events (97.7%) contrasts sharply with the small but deadly tail of high-casualty tornadoes.
Show data table
Top values for fatalities (20 unique shown, of 50 total).
valuecountshare
06842397.7%
18301.2%
22770.4%
31340.2%
4770.1%
5460.1%
6450.1%
7320.0%
9150.0%
10150.0%
11140.0%
8130.0%
16120.0%
1380.0%
1770.0%
1860.0%
2160.0%
1260.0%
2250.0%
2540.0%
Fig 4.
len · Path length is heavily concentrated at 0.1 miles, but watch for a long right tail of tornadoes that traveled many miles.
Show data table
Character-length distribution for len (mean: 3.6255462568906913).
charscount
3 – 347630
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 411687
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 5795
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 69118
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 7789
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 83
Fig 5.
injuries · Like fatalities, injuries are overwhelmingly zero, but the distribution of non-zero counts reveals which events caused mass casualties.
Show data table
Top values for injuries (20 unique shown, of 209 total).
valuecountshare
06217788.8%
124803.5%
213882.0%
37701.1%
44840.7%
53850.5%
63000.4%
71940.3%
81710.2%
101410.2%
121200.2%
91170.2%
11800.1%
20710.1%
15690.1%
13660.1%
14550.1%
30460.1%
25440.1%
16430.1%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
datetext0.0%
timetext0.0%
statecategorical0.0%
magcategorical0.0%
injuriescategorical0.0%
fatalitiescategorical0.0%
losstext0.0%
slatnumeric0.0%
slonnumeric0.0%
elatnumeric37.6%
elonnumeric37.6%
lentext0.0%
widcategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
slatslonelatelon
slat+1.00-0.17-0.02-0.03
slon-0.17+1.00-0.04-0.04
elat-0.02-0.04+1.00-0.16
elon-0.03-0.04-0.16+1.00

date text timestamp

This column contains ISO-8601 calendar dates stored as text strings, all exactly 10 characters long (YYYY-MM-DD format) with zero nulls across 70,022 rows. The 'allcaps' alert is a quirk of the profiler treating hyphenated tokens as uppercase-only, not a real data issue. What is notable is the high duplicate rate of 81.9% (57,383 duplicates) across only 12,639 unique dates, meaning many records share the same date — the top value '2011-04-27' appears 207 times — suggesting this is an event or transaction date that clusters around specific calendar days rather than a unique record timestamp.

Treatment: Parse to datetime dtype, then use as a temporal feature (day-of-week, month, year, cyclical encoding) or grouping key for aggregations.

anthropic:default · confidence high
Out[13]:

saturn.columns["date"].stats

statvalue
n70,022
nulls0 (0.0%)
unique12,639
len_min 10
len_max 10
len_mean 10
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 0
n_duplicates 57,383
duplicate_rate 0.8195
vocab_size 7,831
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates81.9% duplicate strings
Fig 8.
Character-length distribution for date.
Show data table
Character-length distribution for date (mean: 10.0).
charscount
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 1070022
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100

time text timestamp

This column contains wall-clock time strings in HH:MM:SS format (all values are exactly 8 characters), representing the time-of-day component of some event or record. With only 1,438 unique values across 70,022 rows, the duplicate rate is extremely high at 97.9%, indicating times are heavily reused — the top values cluster tightly around afternoon/evening hours (14:00–19:00), suggesting a business or scheduling context with strong temporal patterns. The column is stored as text despite being a structured time value, so it should be parsed to a proper time type for any downstream use.

Treatment: Parse to time/datetime type and use as a cyclical feature (e.g., sine/cosine encoding of hour) or join key for time-based aggregations.

anthropic:default · confidence high
Out[16]:

saturn.columns["time"].stats

statvalue
n70,022
nulls0 (0.0%)
unique1,438
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 68,584
duplicate_rate 0.9795
vocab_size 1,352
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.9% duplicate strings
Fig 9.
Character-length distribution for time.
Show data table
Character-length distribution for time (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 870022
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

state categorical feature

This column contains US state abbreviations, with 53 distinct values (50 states plus likely DC, Puerto Rico, and one other territory) and zero nulls across 70,022 rows. Texas dominates at 13.3% (9,345 rows), and the top 10 are heavily skewed toward Great Plains and Southern states (KS, OK, NE, IA, MS, AL), which is surprising for a national dataset and may indicate agricultural or livestock-sector data. The entropy ratio of 0.847 indicates reasonably broad coverage, but the concentration in TX/KS/OK/NE/IA suggests a non-representative geographic distribution.

Treatment: One-hot encode or target-encode depending on model type; consider grouping low-frequency states if using one-hot.

anthropic:default · confidence high
Out[19]:

saturn.columns["state"].stats

statvalue
n70,022
nulls0 (0.0%)
unique53
top_value TX
top_rate 0.1335
cardinality 53
entropy 4.851
entropy_ratio 0.8468
Fig 10.
Top values for state.
Show data table
Top values for state (20 unique shown, of 53 total).
valuecountshare
TX934513.3%
KS44746.4%
OK42216.0%
FL36205.2%
NE30564.4%
IA28874.1%
IL28354.0%
MS26573.8%
AL25293.6%
MO24623.5%
CO24253.5%
LA23053.3%
MN21183.0%
AR19812.8%
SD19172.7%
GA18982.7%
ND16402.3%
IN16102.3%
WI15152.2%
NC14722.1%

mag categorical feature

This column represents a magnitude or severity level encoded as a small integer, with 7 distinct values spanning -9 to 5. The dominant value is '0' (46% of rows, 32,218 records), followed by '1' and '2', giving a right-skewed ordinal distribution. The value '-9' appearing 1,024 times is a sentinel or sentinel-coded missing/unknown value rather than a true negative magnitude, which would surprise an analyst expecting a clean ordinal scale. The column is stored as categorical despite being numerically interpretable.

Treatment: Recode '-9' as missing, then treat as ordinal integer feature or one-hot encode depending on model assumptions.

anthropic:default · confidence high
Out[22]:

saturn.columns["mag"].stats

statvalue
n70,022
nulls0 (0.0%)
unique7
top_value 0
top_rate 0.4601
cardinality 7
entropy 1.772
entropy_ratio 0.6312
Fig 11.
Top values for mag.
Show data table
Top values for mag (7 unique shown, of 7 total).
valuecountshare
03221846.0%
12378234.0%
2976713.9%
325853.7%
-910241.5%
45870.8%
5590.1%

injuries categorical feature

This column represents a count of injuries per record, stored as a categorical type despite being fundamentally numeric. The distribution is extremely right-skewed: 88.8% of the 70,022 records have zero injuries, and counts drop off sharply thereafter, yet the column exhibits 209 unique values suggesting some very high injury counts exist in the tail. The low entropy ratio (0.123) confirms the near-degenerate concentration on '0', and the presence of non-contiguous values (e.g., '10' appearing before lower counts drop out of the top 10) hints at a long, sparse tail.

Treatment: Cast to integer, then consider zero-inflated modelling or a binary 'any_injury' flag plus a separate log-transformed count for the non-zero subset.

anthropic:default · confidence high
Out[25]:

saturn.columns["injuries"].stats

statvalue
n70,022
nulls0 (0.0%)
unique209
top_value 0
top_rate 0.888
cardinality 209
entropy 0.9454
entropy_ratio 0.1227
Fig 12.
Top values for injuries.
Show data table
Top values for injuries (20 unique shown, of 209 total).
valuecountshare
06217788.8%
124803.5%
213882.0%
37701.1%
44840.7%
53850.5%
63000.4%
71940.3%
81710.2%
101410.2%
121200.2%
91170.2%
11800.1%
20710.1%
15690.1%
13660.1%
14550.1%
30460.1%
25440.1%
16430.1%

fatalities categorical feature

This column represents a count of fatalities per incident, stored as a categorical type despite being numeric in nature — it should be treated as an ordinal or integer feature. The dominant value is '0' (68,423 out of 70,022 rows, or 97.7%), making this severely right-skewed and triggering an imbalance alert. Cardinality reaches 50 distinct values with extremely low entropy (0.217, entropy ratio 0.039), confirming that the non-zero tail is sparse and long. Analysts modelling rare fatal events should be aware that the positive class represents fewer than 2.3% of records.

Treatment: Cast to integer, apply zero-inflated or rare-event modelling strategy, or binarise into fatal/non-fatal indicator before classification.

anthropic:default · confidence high
Out[28]:

saturn.columns["fatalities"].stats

statvalue
n70,022
nulls0 (0.0%)
unique50
top_value 0
top_rate 0.9772
cardinality 50
entropy 0.2174
entropy_ratio 0.03852
alert: imbalancetop value is 97.7% of rows
Fig 13.
Top values for fatalities.
Show data table
Top values for fatalities (20 unique shown, of 50 total).
valuecountshare
06842397.7%
18301.2%
22770.4%
31340.2%
4770.1%
5460.1%
6450.1%
7320.0%
9150.0%
10150.0%
11140.0%
8130.0%
16120.0%
1380.0%
1770.0%
1860.0%
2160.0%
1260.0%
2250.0%
2540.0%

loss text numeric_target

This column contains numeric loss values stored as text strings — all single-token entries (one_word_rate: 1.0) with a mean length of ~3.18 characters, dominated by small non-negative numbers like '0.0', '4.0', '5.0'. The 92.5% allcaps rate is a misleading artefact of how saturn classifies short numeric strings. Notably, '0.0' and '0' appear as separate tokens (22764 and 5248 occurrences respectively), indicating inconsistent serialization of the same underlying zero value — an analyst should consolidate these before use. The duplicate rate is 98.5%, reflecting a low-cardinality numeric range across 70022 rows with only 1019 unique string representations.

Treatment: Cast to float (unifying '0' and '0.0'), then use as a regression target or loss metric; check whether the bimodal spike at 0 represents true zero-loss or missing/default values.

anthropic:default · confidence high
Out[31]:

saturn.columns["loss"].stats

statvalue
n70,022
nulls0 (0.0%)
unique1,019
len_min 1
len_max 10
len_mean 3.181
len_median 3
len_p95 5
word_mean 1
word_median 1
n_empty 0
n_duplicates 69,003
duplicate_rate 0.9854
vocab_size 503
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.9251
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps92.5% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates98.5% duplicate strings
Fig 14.
Character-length distribution for loss.
Show data table
Character-length distribution for loss (mean: 3.1810716631915685).
charscount
1 – 15248
1 – 10
1 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 350873
3 – 30
3 – 30
3 – 40
4 – 40
4 – 47040
4 – 40
4 – 50
5 – 50
5 – 54912
5 – 50
5 – 60
6 – 60
6 – 60
6 – 61581
6 – 60
6 – 70
7 – 70
7 – 7292
7 – 70
7 – 80
8 – 80
8 – 80
8 – 858
8 – 80
8 – 90
9 – 90
9 – 916
9 – 90
9 – 100
10 – 100
10 – 102

slat numeric feature

This column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the Caribbean/Mexico up through Canada, covering the contiguous United States and beyond. The distribution is remarkably symmetric (skew 0.038, kurtosis -0.582) and tightly clustered around a mean of 37.14° with an IQR of 7.74°, suggesting a dataset dominated by mid-latitude U.S. locations. Only 70 outliers (0.1%) exist, likely extreme northern or southern observations, and there are no nulls.

Treatment: Use directly as a geospatial feature; consider pairing with longitude and engineering distance or region-based features rather than treating as a raw numeric.

anthropic:default · confidence high
Out[34]:

saturn.columns["slat"].stats

statvalue
n70,022
nulls0 (0.0%)
unique16,016
min 17.72
max 61.02
mean 37.14
median 37.03
std 5.09
q1 33.19
q3 40.93
iqr 7.74
skew 0.03792
kurtosis -0.5825
n_outliers 70
outlier_rate 0.0009997
zero_rate 0
Fig 15.
Distribution of slat. Vertical dash marks the median.
Show data table
Histogram bins for slat (median: 37.02505).
bincount
17.72 – 18.833
18.8 – 19.896
19.89 – 20.978
20.97 – 22.0526
22.05 – 23.131
23.13 – 24.220
24.22 – 25.376
25.3 – 26.38487
26.38 – 27.46748
27.46 – 28.551225
28.55 – 29.631468
29.63 – 30.713607
30.71 – 31.793441
31.79 – 32.885003
32.88 – 33.964775
33.96 – 35.045392
35.04 – 36.125164
36.12 – 37.214268
37.21 – 38.294028
38.29 – 39.374807
39.37 – 40.455498
40.45 – 41.545313
41.54 – 42.623995
42.62 – 43.73396
43.7 – 44.782407
44.78 – 45.871720
45.87 – 46.951257
46.95 – 48.031114
48.03 – 49.11754
49.11 – 50.21
50.2 – 51.280
51.28 – 52.360
52.36 – 53.440
53.44 – 54.530
54.53 – 55.611
55.61 – 56.690
56.69 – 57.770
57.77 – 58.860
58.86 – 59.941
59.94 – 61.022

slon numeric feature

This column contains geographic longitude values, almost certainly representing the longitude of seismic event epicenters (suggested by the 'slon' name, likely 'station longitude' or 'source longitude'). All values are negative, ranging from -163.53 to -64.72, which places observations within the Western Hemisphere — consistent with the Americas or Pacific region. The mean of -92.74 and median of -93.50 suggest a concentration around the Gulf of Mexico / Central America corridor. With 17,912 unique values across 70,022 rows and zero nulls, this is a continuous geographic coordinate with mild repetition (e.g., fixed station locations), and 951 outliers (~1.36%) may represent distant events or data entry anomalies worth inspecting.

Treatment: Use as-is for spatial modeling or map directly to geographic coordinates; inspect the 951 outliers for plausibility against known geographic bounds.

anthropic:default · confidence high
Out[37]:

saturn.columns["slon"].stats

statvalue
n70,022
nulls0 (0.0%)
unique17,912
min -163.5
max -64.72
mean -92.74
median -93.5
std 8.677
q1 -98.4
q3 -86.69
iqr 11.71
skew -0.3229
kurtosis 2.156
n_outliers 951
outlier_rate 0.01358
zero_rate 0
Fig 16.
Distribution of slon. Vertical dash marks the median.
Show data table
Histogram bins for slon (median: -93.5).
bincount
-163.5 – -161.12
-161.1 – -158.69
-158.6 – -156.125
-156.1 – -153.68
-153.6 – -151.20
-151.2 – -148.70
-148.7 – -146.20
-146.2 – -143.81
-143.8 – -141.30
-141.3 – -138.80
-138.8 – -136.40
-136.4 – -133.90
-133.9 – -131.40
-131.4 – -128.90
-128.9 – -126.50
-126.5 – -12415
-124 – -121.5231
-121.5 – -119.1241
-119.1 – -116.6282
-116.6 – -114.1174
-114.1 – -111.7399
-111.7 – -109.2323
-109.2 – -106.7325
-106.7 – -104.21782
-104.2 – -101.84689
-101.8 – -99.36094
-99.3 – -96.839665
-96.83 – -94.368344
-94.36 – -91.896626
-91.89 – -89.426618
-89.42 – -86.956086
-86.95 – -84.485361
-84.48 – -82.014363
-82.01 – -79.543931
-79.54 – -77.071885
-77.07 – -74.61618
-74.6 – -72.13541
-72.13 – -69.66285
-69.66 – -67.1971
-67.19 – -64.7228

elat numeric feature

This column almost certainly represents geographic latitude in decimal degrees, with values ranging from 17.72° to 61.02° — consistent with locations spanning from the southern US/Mexico border region up through Canada or northern Europe. The distribution is strikingly symmetric (skew 0.034, kurtosis -0.41) and tightly clustered around a mean of 37.26° with an IQR of 7.42°, suggesting a geographically focused dataset. The most notable concern is a 37.65% null rate, flagged as an alert, meaning over a third of records lack coordinate data. Only 78 outliers (0.18%) exist at the extremes of the range.

Treatment: Investigate source of 37.65% nulls before use; pair with longitude for spatial features or geohash encoding; impute or filter nulls depending on missingness mechanism.

anthropic:default · confidence high
Out[40]:

saturn.columns["elat"].stats

statvalue
n70,022
nulls26,363 (37.6%)
unique16,965
min 17.72
max 61.02
mean 37.26
median 37.13
std 4.942
q1 33.49
q3 40.91
iqr 7.42
skew 0.03404
kurtosis -0.4085
n_outliers 78
outlier_rate 0.001787
zero_rate 0
alert: null_rate37.6% null
Fig 17.
Distribution of elat. Vertical dash marks the median.
Show data table
Histogram bins for elat (median: 37.1309).
bincount
17.72 – 18.833
18.8 – 19.896
19.89 – 20.978
20.97 – 22.0526
22.05 – 23.131
23.13 – 24.220
24.22 – 25.341
25.3 – 26.38262
26.38 – 27.46369
27.46 – 28.55534
28.55 – 29.63668
29.63 – 30.711952
30.71 – 31.792190
31.79 – 32.883083
32.88 – 33.963018
33.96 – 35.043496
35.04 – 36.123427
36.12 – 37.212899
37.21 – 38.292790
38.29 – 39.373133
39.37 – 40.453414
40.45 – 41.543272
41.54 – 42.622540
42.62 – 43.72078
43.7 – 44.781428
44.78 – 45.871122
45.87 – 46.95745
46.95 – 48.03676
48.03 – 49.11443
49.11 – 50.21
50.2 – 51.280
51.28 – 52.360
52.36 – 53.440
53.44 – 54.530
54.53 – 55.611
55.61 – 56.690
56.69 – 57.770
57.77 – 58.860
58.86 – 59.941
59.94 – 61.022

elon numeric feature

This column almost certainly represents **longitude** (Eastern longitude or a signed longitude coordinate), given the name 'elon' and values ranging from -163.53 to -64.72 — a range consistent with the Western Hemisphere (roughly spanning the Americas). The mean of -92.19 and median of -92.47 suggest a central tendency near the US Gulf Coast/Central America region. Two analyst-worthy surprises: the null rate is high at 37.65%, triggering an alert, and the max value of -64.72 is the least-negative (easternmost) point while min of -163.53 is near Alaska/Pacific Islands — indicating wide geographic spread.

Treatment: Investigate and impute or exclude the 37.65% nulls before use; pair with a latitude column for geospatial modelling or clustering.

anthropic:default · confidence high
Out[43]:

saturn.columns["elon"].stats

statvalue
n70,022
nulls26,363 (37.6%)
unique18,586
min -163.5
max -64.72
mean -92.19
median -92.47
std 8.545
q1 -97.73
q3 -86.47
iqr 11.26
skew -0.5954
kurtosis 3.766
n_outliers 647
outlier_rate 0.01482
zero_rate 0
alert: null_rate37.6% null
Fig 18.
Distribution of elon. Vertical dash marks the median.
Show data table
Histogram bins for elon (median: -92.47).
bincount
-163.5 – -161.12
-161.1 – -158.69
-158.6 – -156.125
-156.1 – -153.68
-153.6 – -151.20
-151.2 – -148.70
-148.7 – -146.20
-146.2 – -143.81
-143.8 – -141.30
-141.3 – -138.80
-138.8 – -136.40
-136.4 – -133.90
-133.9 – -131.40
-131.4 – -128.90
-128.9 – -126.50
-126.5 – -1248
-124 – -121.5157
-121.5 – -119.1158
-119.1 – -116.6142
-116.6 – -114.182
-114.1 – -111.7173
-111.7 – -109.2176
-109.2 – -106.7177
-106.7 – -104.2774
-104.2 – -101.82464
-101.8 – -99.33415
-99.3 – -96.835507
-96.83 – -94.365157
-94.36 – -91.894463
-91.89 – -89.424668
-89.42 – -86.954417
-86.95 – -84.483706
-84.48 – -82.012739
-82.01 – -79.542302
-79.54 – -77.071344
-77.07 – -74.61056
-74.6 – -72.13303
-72.13 – -69.66152
-69.66 – -67.1947
-67.19 – -64.7227

len text feature

This column named 'len' stores numeric measurements encoded as text strings — almost certainly a length or dosage/quantity field stored in the wrong dtype. All 70,022 values are single 'words' in all-caps classification, with values like '0.1', '0.5', '1.0', '2.0' dominating the top entries and a character length range of 3–8. The duplicate rate is extremely high at 94.8% (66,359 duplicates across only 3,663 unique values), which is expected for a bounded numeric measure but confirms this should be cast to float and treated as a continuous feature rather than a categorical label.

Treatment: Cast to float64 and use as a numeric feature; check for unit consistency across the value range.

anthropic:default · confidence high
Out[46]:

saturn.columns["len"].stats

statvalue
n70,022
nulls0 (0.0%)
unique3,663
len_min 3
len_max 8
len_mean 3.626
len_median 3
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 66,359
duplicate_rate 0.9477
vocab_size 2,204
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates94.8% duplicate strings
Fig 19.
Character-length distribution for len.
Show data table
Character-length distribution for len (mean: 3.6255462568906913).
charscount
3 – 347630
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 411687
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 5795
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 69118
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 7789
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 83

wid categorical feature

This column ('wid') appears to be a numeric width or weight identifier encoded as a categorical, with only 419 distinct values across 70,022 rows. All observed top values are round numbers (10, 50, 100, 30, 20, 200, 25, 150, 40, 75), strongly suggesting a discrete measurement dimension — likely a product width, bin size, or weight class. The distribution is notably skewed: value '10' alone accounts for 20.6% of all rows (14,417 occurrences), with a steep drop-off thereafter, indicating heavy concentration at the smallest value.

Treatment: Cast to numeric integer and treat as an ordinal or continuous feature; consider log-transform if using in regression given the heavy skew toward small values.

anthropic:default · confidence medium
Out[49]:

saturn.columns["wid"].stats

statvalue
n70,022
nulls0 (0.0%)
unique419
top_value 10
top_rate 0.2059
cardinality 419
entropy 4.463
entropy_ratio 0.5124
Fig 20.
Top values for wid.
Show data table
Top values for wid (20 unique shown, of 419 total).
valuecountshare
101441720.6%
501036614.8%
100706710.1%
3047726.8%
2043686.2%
20029464.2%
2524523.5%
15021013.0%
4019672.8%
7519062.7%
30014302.0%
3311601.7%
1710371.5%
4009441.3%
238121.2%
2507651.1%
607371.1%
4406771.0%
5006360.9%
805730.8%

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-tornadoes-noaa-spc-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove tornadoes noaa spc},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-tornadoes-noaa-spc}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove tornadoes noaa spc. Source: /home/coolhand/html/datavis/data_trove/data/quirky/tornadoes.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-tornadoes-noaa-spc