saturn·

quirky fossils

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/fossils.json

Saturn profiled 22,043 rows across 21 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/fossils.json",
    "--findings", "quirky-fossils.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 22,043 fossil occurrence records with 21 columns spanning taxonomy (phylum, class, order, family, genus, name, rank), geography (country, state, lat/lon, paleolat/paleolng), and geologic age (early_age_mya, late_age_mya, period, late_interval). Taxonomy is dominated by Chordata (about 82% of rows) with Mammalia as the leading class (~32%) followed by Saurischia and Ornithischia, suggesting a strong vertebrate and dinosaur emphasis worth examining first. Geographically the data skews heavily to the US (~51%), with Wyoming, Montana, and New Mexico topping the state list, so any spatial analysis should account for this North American concentration. Age columns (early_age_mya, late_age_mya) are right-skewed with medians around 100 Mya and ~11% flagged as outliers, hinting at a long tail of very old records. Note that 'collection' and 'formation' are entirely empty and should be ignored.

citing: row_count · column_count · phylum.top_values · class.top_values · country.top_values · state.top_values · early_age_mya.stats · late_age_mya.stats · rank.top_values · collection.stats · formation.stats

Out[4]:

saturn.schema() · 21 columns

column kind n null% unique alerts
name text 22,043 0.0% 4,660 one_word duplicates
rank categorical 22,043 0.0% 18
lat numeric 22,043 0.0% 4,095 high_skew outliers
lon numeric 22,043 0.0% 4,259
early_age_mya numeric 22,043 0.0% 164 outliers
late_age_mya numeric 22,043 0.0% 156 outliers
period categorical 22,043 0.0% 298
late_interval categorical 22,043 0.0% 138
phylum categorical 22,043 0.0% 4
class categorical 22,043 0.0% 19
order categorical 22,043 0.0% 99
family categorical 22,043 0.0% 528
genus text 22,043 0.0% 2,608 one_word short_text duplicates
country categorical 22,043 0.0% 93
state categorical 22,043 0.0% 519
formation categorical 22,043 0.0% 1 imbalance
collection categorical 22,043 0.0% 1 imbalance
paleolat numeric 22,043 2.2% 3,214 outliers
paleolng numeric 22,043 2.2% 3,715
reference_no text 22,043 0.0% 3,725 one_word allcaps short_text duplicates
occurrence_no text 22,043 0.0% 22,043 near_unique one_word allcaps short_text
Fig 1.
class · Mammalia, Saurischia, and Ornithischia dominate; check how unevenly taxonomic classes are represented.
Show data table
Top values for class (19 unique shown, of 19 total).
valuecountshare
Mammalia701531.8%
Saurischia550725.0%
Ornithischia281112.8%
Cephalopoda20009.1%
Trilobita20009.1%
Conodonta18838.5%
Reptilia5682.6%
Aves920.4%
600.3%
NO_CLASS_SPECIFIED260.1%
Pteraspidomorpha240.1%
Placodermi170.1%
Acanthodii150.1%
Osteichthyes110.0%
Thelodonti40.0%
Osteostraci40.0%
Chondrichthyes40.0%
Actinopterygii10.0%
Galeaspidomorphi10.0%
Fig 2.
phylum · Chordata accounts for roughly 82% of records — see how heavily vertebrate-skewed the dataset is.
Show data table
Top values for phylum (4 unique shown, of 4 total).
valuecountshare
Chordata1799381.6%
Mollusca20009.1%
Arthropoda20009.1%
500.2%
Fig 3.
country · Look for the strong US concentration (~51%) before drawing global conclusions.
Show data table
Top values for country (20 unique shown, of 93 total).
valuecountshare
US1121850.9%
CA18308.3%
CN16617.5%
UK9834.5%
ES8413.8%
FR3901.8%
MA3031.4%
AR2921.3%
CZ2881.3%
AU2471.1%
TZ2181.0%
UZ1840.8%
MX1750.8%
KR1750.8%
SE1700.8%
CH1660.8%
MN1620.7%
ZA1590.7%
RU1560.7%
DE1520.7%
Fig 4.
early_age_mya · Right-skewed distribution from recent fossils to ~539 Mya; watch for the long tail and outliers.
Show data table
Histogram bins for early_age_mya (median: 110.1).
bincount
0.0117 – 13.482334
13.48 – 26.951665
26.95 – 40.4285
40.42 – 53.8912
53.89 – 67.362904
67.36 – 80.831302
80.83 – 94.32239
94.3 – 107.8454
107.8 – 121.2796
121.2 – 134.71645
134.7 – 148.2410
148.2 – 161.61650
161.6 – 175.1421
175.1 – 188.636
188.6 – 202.1975
202.1 – 215.5193
215.5 – 229530
229 – 242.5216
242.5 – 255.961
255.9 – 269.40
269.4 – 282.90
282.9 – 296.30
296.3 – 309.825
309.8 – 323.317
323.3 – 336.821
336.8 – 350.237
350.2 – 363.758
363.7 – 377.2770
377.2 – 390.6319
390.6 – 404.1319
404.1 – 417.6176
417.6 – 431602
431 – 444.5265
444.5 – 458382
458 – 471.5427
471.5 – 484.981
484.9 – 498.4334
498.4 – 511.9180
511.9 – 525.362
525.3 – 538.840
Fig 5.
rank · Most records are identified to species or genus level — useful context for taxonomic resolution.
Show data table
Top values for rank (18 unique shown, of 18 total).
valuecountshare
species908241.2%
genus734233.3%
unranked clade282812.8%
family17167.8%
subfamily2721.2%
subclass2050.9%
class1340.6%
order1150.5%
infraorder970.4%
superfamily750.3%
subgenus510.2%
kingdom500.2%
suborder290.1%
subspecies230.1%
tribe120.1%
subphylum90.0%
superorder20.0%
superclass10.0%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
nametext0.0%
rankcategorical0.0%
latnumeric0.0%
lonnumeric0.0%
early_age_myanumeric0.0%
late_age_myanumeric0.0%
periodcategorical0.0%
late_intervalcategorical0.0%
phylumcategorical0.0%
classcategorical0.0%
ordercategorical0.0%
familycategorical0.0%
genustext0.0%
countrycategorical0.0%
statecategorical0.0%
formationcategorical0.0%
collectioncategorical0.0%
paleolatnumeric2.2%
paleolngnumeric2.2%
reference_notext0.0%
occurrence_notext0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 6 numeric columns (values clipped to 2 decimals).
latlonearly_age_myalate_age_myapaleolatpaleolng
lat+1.00-0.27+0.04+0.05+0.12-0.17
lon-0.27+1.00+0.17+0.16-0.22+0.45
early_age_mya+0.04+0.17+1.00+1.00-0.61+0.03
late_age_mya+0.05+0.16+1.00+1.00-0.61+0.03
paleolat+0.12-0.22-0.61-0.61+1.00-0.21
paleolng-0.17+0.45+0.03+0.03-0.21+1.00

name text label

This column holds taxonomic names of fossil organisms, dominated by dinosaur and conodont genera/clades like Theropoda (768), Dinosauria (512), and Palmatolepis (376). Values are short—mean length 15 characters and 58% are single words—so the Flesch readability score of -4.13 is a meaningless artifact of scientific Latin. With 4,660 uniques across 22,043 rows and a 78.9% duplicate rate, this behaves as a categorical taxon label rather than a free-text field.

Treatment: Treat as a high-cardinality categorical; group rare taxa or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["name"].stats

statvalue
n22,043
nulls0 (0.0%)
unique4,660
len_min 3
len_max 47
len_mean 15.09
len_median 14
len_p95 26
word_mean 1.425
word_median 1
n_empty 0
n_duplicates 17,383
duplicate_rate 0.7886
vocab_size 5,140
readability_flesch_mean -4.127
emoji_rate 0
url_rate 0
one_word_rate 0.5846
allcaps_rate 0
boilerplate_rate 0
alert: one_word58.5% rows are a single word
alert: duplicates78.9% duplicate strings
Fig 8.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 15.094769314521617).
charscount
3 – 472
4 – 5190
5 – 6110
6 – 7466
7 – 8942
8 – 102265
10 – 112027
11 – 121560
12 – 131821
13 – 141566
14 – 152019
15 – 16626
16 – 17923
17 – 18808
18 – 201219
20 – 211117
21 – 22747
22 – 23889
23 – 24549
24 – 25531
25 – 26719
26 – 27307
27 – 28119
28 – 29178
29 – 3163
31 – 3252
32 – 3322
33 – 348
34 – 352
35 – 3617
36 – 3719
37 – 3811
38 – 3950
39 – 4013
40 – 429
42 – 432
43 – 441
44 – 453
45 – 460
46 – 471

rank categorical feature

This column records the taxonomic rank of each record, with 18 distinct values and no nulls across 22,043 rows. 'Species' dominates at 41.2% (9,082 rows), followed by 'genus' (7,342) and 'unranked clade' (2,828); ranks below family drop off sharply, with 'subfamily' already at only 272. The presence of 'unranked clade' alongside formal Linnaean ranks is worth noting as a non-standard category.

Treatment: One-hot or ordinal-encode by taxonomic depth before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["rank"].stats

statvalue
n22,043
nulls0 (0.0%)
unique18
top_value species
top_rate 0.412
cardinality 18
entropy 2.085
entropy_ratio 0.5001
Fig 9.
Top values for rank.
Show data table
Top values for rank (18 unique shown, of 18 total).
valuecountshare
species908241.2%
genus734233.3%
unranked clade282812.8%
family17167.8%
subfamily2721.2%
subclass2050.9%
class1340.6%
order1150.5%
infraorder970.4%
superfamily750.3%
subgenus510.2%
kingdom500.2%
suborder290.1%
subspecies230.1%
tribe120.1%
subphylum90.0%
superorder20.0%
superclass10.0%

lat numeric feature

This is a latitude coordinate in degrees, ranging from -84.33 to 79.75 with a median of 41.70 and IQR of 11.61. The strong negative skew (-2.44) and kurtosis (7.05) suggest most points cluster in the northern hemisphere with a long tail of southern-hemisphere outliers (9.16% flagged). No nulls and 4,095 distinct values across 22,043 rows indicate repeated locations rather than per-row unique geocoding.

Treatment: Pair with a longitude column for spatial features; avoid log-transform and keep raw degrees.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["lat"].stats

statvalue
n22,043
nulls0 (0.0%)
unique4,095
min -84.33
max 79.75
mean 37.12
median 41.7
std 19.37
q1 35
q3 46.61
iqr 11.61
skew -2.442
kurtosis 7.054
n_outliers 2,019
outlier_rate 0.09159
zero_rate 0
alert: high_skewskew=-2.44
alert: outliers9.2% rows beyond 1.5 IQR
Fig 10.
Distribution of lat. Vertical dash marks the median.
Show data table
Histogram bins for lat (median: 41.700001).
bincount
-84.33 – -80.234
-80.23 – -76.130
-76.13 – -72.030
-72.03 – -67.9312
-67.93 – -63.827
-63.82 – -59.720
-59.72 – -55.620
-55.62 – -51.520
-51.52 – -47.4116
-47.41 – -43.3160
-43.31 – -39.21111
-39.21 – -35.11127
-35.11 – -31.01166
-31.01 – -26.9347
-26.9 – -22.840
-22.8 – -18.799
-18.7 – -14.6138
-14.6 – -10.56
-10.5 – -6.394255
-6.394 – -2.29221
-2.292 – 1.810
1.81 – 5.91232
5.912 – 10.0122
10.01 – 14.1212
14.12 – 18.22168
18.22 – 22.32137
22.32 – 26.421224
26.42 – 30.52669
30.52 – 34.631615
34.63 – 38.733128
38.73 – 42.834830
42.83 – 46.933520
46.93 – 51.042987
51.04 – 55.141322
55.14 – 59.24253
59.24 – 63.34303
63.34 – 67.4499
67.44 – 71.55118
71.55 – 75.65180
75.65 – 79.7515

lon numeric feature

This column captures longitude coordinates, with values spanning -176.67 to 177.07, consistent with the full global range. The distribution is right-skewed (skew 0.93) and centered on a median of -98.25, suggesting a heavy concentration of records in the Western Hemisphere (likely the Americas). Only 4,259 unique values across 22,043 rows indicate repeated location points, and just 3 outliers were flagged.

Treatment: Pair with latitude for geospatial features; consider binning or projecting rather than using raw degrees in linear models.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["lon"].stats

statvalue
n22,043
nulls0 (0.0%)
unique4,259
min -176.7
max 177.1
mean -47.21
median -98.25
std 79.13
q1 -108.2
q3 5.873
iqr 114
skew 0.9275
kurtosis -0.4932
n_outliers 3
outlier_rate 0.0001361
zero_rate 0.0002268
Fig 11.
Distribution of lon. Vertical dash marks the median.
Show data table
Histogram bins for lon (median: -98.25).
bincount
-176.7 – -167.84
-167.8 – -15913
-159 – -150.118
-150.1 – -141.332
-141.3 – -132.491
-132.4 – -123.6163
-123.6 – -114.81447
-114.8 – -105.95659
-105.9 – -97.083904
-97.08 – -88.23360
-88.23 – -79.39559
-79.39 – -70.55962
-70.55 – -61.7409
-61.7 – -52.8676
-52.86 – -44.0252
-44.02 – -35.1713
-35.17 – -26.330
-26.33 – -17.4893
-17.48 – -8.642160
-8.642 – 0.20191918
0.2019 – 9.045932
9.045 – 17.89817
17.89 – 26.73253
26.73 – 35.58447
35.58 – 44.42311
44.42 – 53.2671
53.26 – 62.1190
62.11 – 70.95315
70.95 – 79.79306
79.79 – 88.64107
88.64 – 97.4878
97.48 – 106.3553
106.3 – 115.21162
115.2 – 124150
124 – 132.9206
132.9 – 141.750
141.7 – 150.5238
150.5 – 159.411
159.4 – 168.24
168.2 – 177.19

early_age_mya numeric feature

Numeric column representing the early bound of an age estimate in millions of years (mya), spanning 0.0117 to 538.8 across 22043 rows with no nulls. The distribution is right-skewed (skew 1.13) with median 110.1 well below mean 154.67, and saturn flags 2549 outlier rows (11.6%) — consistent with a long Paleozoic tail above the Q3 of 201.4. Only 164 unique values suggest ages are bucketed to standard stratigraphic boundaries rather than continuous measurements.

Treatment: Consider log-transform or binning to stratigraphic periods before modelling given the heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["early_age_mya"].stats

statvalue
n22,043
nulls0 (0.0%)
unique164
min 0.0117
max 538.8
mean 154.7
median 110.1
std 143.1
q1 63.4
q3 201.4
iqr 138
skew 1.131
kurtosis 0.07677
n_outliers 2,549
outlier_rate 0.1156
zero_rate 0
alert: outliers11.6% rows beyond 1.5 IQR
Fig 12.
Distribution of early_age_mya. Vertical dash marks the median.
Show data table
Histogram bins for early_age_mya (median: 110.1).
bincount
0.0117 – 13.482334
13.48 – 26.951665
26.95 – 40.4285
40.42 – 53.8912
53.89 – 67.362904
67.36 – 80.831302
80.83 – 94.32239
94.3 – 107.8454
107.8 – 121.2796
121.2 – 134.71645
134.7 – 148.2410
148.2 – 161.61650
161.6 – 175.1421
175.1 – 188.636
188.6 – 202.1975
202.1 – 215.5193
215.5 – 229530
229 – 242.5216
242.5 – 255.961
255.9 – 269.40
269.4 – 282.90
282.9 – 296.30
296.3 – 309.825
309.8 – 323.317
323.3 – 336.821
336.8 – 350.237
350.2 – 363.758
363.7 – 377.2770
377.2 – 390.6319
390.6 – 404.1319
404.1 – 417.6176
417.6 – 431602
431 – 444.5265
444.5 – 458382
458 – 471.5427
471.5 – 484.981
484.9 – 498.4334
498.4 – 511.9180
511.9 – 525.362
525.3 – 538.840

late_age_mya numeric feature

Numeric column representing the younger bound of a geological age in millions of years (Mya), with 156 distinct values across 22,043 rows and no nulls. Distribution is right-skewed (skew 1.17) with median 93.9 well below the mean of 147.5, ranging from 0 to 521 Mya, and 11.5% of values (2,535) flagged as outliers on the high end. The bounded discrete value set suggests entries snap to standardized stratigraphic stage boundaries rather than continuous measurements.

Treatment: Keep as-is or pair with early_age_mya to derive a midpoint; consider log or sqrt transform before regression given the right skew.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["late_age_mya"].stats

statvalue
n22,043
nulls0 (0.0%)
unique156
min 0
max 521
mean 147.5
median 93.9
std 141.7
q1 60.9
q3 192.9
iqr 132
skew 1.169
kurtosis 0.1231
n_outliers 2,535
outlier_rate 0.115
zero_rate 0.001134
alert: outliers11.5% rows beyond 1.5 IQR
Fig 13.
Distribution of late_age_mya. Vertical dash marks the median.
Show data table
Histogram bins for late_age_mya (median: 93.9).
bincount
0 – 13.032732
13.03 – 26.051291
26.05 – 39.0869
39.08 – 52.17
52.1 – 65.122911
65.12 – 78.153279
78.15 – 91.17374
91.17 – 104.2949
104.2 – 117.2922
117.2 – 130.21043
130.2 – 143.31317
143.3 – 156.3671
156.3 – 169.3368
169.3 – 182.3135
182.3 – 195.4634
195.4 – 208.41003
208.4 – 221.435
221.4 – 234.5123
234.5 – 247.563
247.5 – 260.52
260.5 – 273.50
273.5 – 286.60
286.6 – 299.612
299.6 – 312.634
312.6 – 325.619
325.6 – 338.74
338.7 – 351.783
351.7 – 364.7133
364.7 – 377.7828
377.7 – 390.8467
390.8 – 403.8190
403.8 – 416.8530
416.8 – 429.8170
429.8 – 442.9141
442.9 – 455.9529
455.9 – 468.9299
468.9 – 481.9111
481.9 – 494.9310
494.9 – 508191
508 – 52164

period categorical feature

This column records geologic time periods or stages, with 298 distinct values across 22,043 rows and no nulls. The distribution is moderately spread (entropy ratio 0.78), but 'Irvingtonian' dominates at 7.8% (1,723 rows), followed by 'Late Campanian' and various Paleocene/Mesozoic stages. The vocabulary mixes broad and finely-subdivided stage names (e.g., 'Late Maastrichtian' vs. 'Aptian'), so granularity is uneven.

Treatment: Group rare stages or map to coarser epochs before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["period"].stats

statvalue
n22,043
nulls0 (0.0%)
unique298
top_value Irvingtonian
top_rate 0.07817
cardinality 298
entropy 6.422
entropy_ratio 0.7813
Fig 14.
Top values for period.
Show data table
Top values for period (20 unique shown, of 298 total).
valuecountshare
Irvingtonian17237.8%
Late Campanian10884.9%
Torrejonian9354.2%
Tiffanian9234.2%
Puercan7783.5%
Kimmeridgian6362.9%
Hettangian6072.8%
Aptian6002.7%
Harrisonian5922.7%
Late Maastrichtian5442.5%
Norian5162.3%
Lochkovian4602.1%
Early Barremian4492.0%
Hemingfordian4412.0%
Tithonian4081.9%
Middle Campanian3591.6%
Early Famennian3461.6%
Early Albian3271.5%
Lancian3201.5%
Maastrichtian3141.4%

late_interval categorical feature

Categorical geological stage marking the late end of a fossil/sample's age range, with 138 distinct stage names like Tithonian, Sinemurian, and Late Campanian. The column is dominated by empty strings (83.1% of 22,043 rows), leaving only ~3,724 records with an actual stage; entropy ratio of 0.22 reflects this sparsity. Among populated values, Tithonian (548) and Sinemurian (430) lead, suggesting Mesozoic specimens are over-represented.

Treatment: Treat empty string as missing and either impute from a paired early_interval or use as a sparse categorical with an explicit 'unknown' level.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["late_interval"].stats

statvalue
n22,043
nulls0 (0.0%)
unique138
top_value
top_rate 0.8311
cardinality 138
entropy 1.591
entropy_ratio 0.2239
Fig 15.
Top values for late_interval.
Show data table
Top values for late_interval (20 unique shown, of 138 total).
valuecountshare
1831983.1%
Tithonian5482.5%
Sinemurian4302.0%
Late Campanian1830.8%
Early Cenomanian1320.6%
Albian1290.6%
Rhaetian1190.5%
Early Maastrichtian1110.5%
Early Tithonian1020.5%
Late Turonian920.4%
Maastrichtian720.3%
Harnagian620.3%
Santonian610.3%
Early Aptian570.3%
Tiffanian570.3%
Early Albian560.3%
Barremian540.2%
Pliensbachian500.2%
Toarcian500.2%
Cenomanian450.2%

phylum categorical label

Taxonomic phylum label with only 4 distinct values across 22,043 rows. Chordata dominates at 81.6% (17,993 rows), with Mollusca and Arthropoda each at exactly 2,000 — suggesting deliberate sampling caps on the non-Chordata classes. 50 rows carry an empty-string phylum that is not counted as null.

Treatment: Recode the 50 empty strings to null and treat as a low-cardinality categorical; expect class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["phylum"].stats

statvalue
n22,043
nulls0 (0.0%)
unique4
top_value Chordata
top_rate 0.8163
cardinality 4
entropy 0.8873
entropy_ratio 0.4436
Fig 16.
Top values for phylum.
Show data table
Top values for phylum (4 unique shown, of 4 total).
valuecountshare
Chordata1799381.6%
Mollusca20009.1%
Arthropoda20009.1%
500.2%

class categorical label

Taxonomic class label for what appears to be a paleontological/zoological occurrence dataset, with 19 distinct values across 22,043 rows and no nulls. Mammalia dominates at 31.8% (7,015), followed by the dinosaur clades Saurischia (5,507) and Ornithischia (2,811), suggesting a fossil-heavy sample. Note two sentinel-style entries that should be cleaned: 60 empty strings and 26 'NO_CLASS_SPECIFIED' rows.

Treatment: Normalize the empty strings and 'NO_CLASS_SPECIFIED' to a single missing token, then one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["class"].stats

statvalue
n22,043
nulls0 (0.0%)
unique19
top_value Mammalia
top_rate 0.3182
cardinality 19
entropy 2.579
entropy_ratio 0.6071
Fig 17.
Top values for class.
Show data table
Top values for class (19 unique shown, of 19 total).
valuecountshare
Mammalia701531.8%
Saurischia550725.0%
Ornithischia281112.8%
Cephalopoda20009.1%
Trilobita20009.1%
Conodonta18838.5%
Reptilia5682.6%
Aves920.4%
600.3%
NO_CLASS_SPECIFIED260.1%
Pteraspidomorpha240.1%
Placodermi170.1%
Acanthodii150.1%
Osteichthyes110.0%
Thelodonti40.0%
Osteostraci40.0%
Chondrichthyes40.0%
Actinopterygii10.0%
Galeaspidomorphi10.0%

order categorical feature

Taxonomic order assignments for what appears to be a paleontology/biology specimen dataset, mixing extinct groups (Ammonitida, Ozarkodinida, Multituberculata, Phacopida) with extant mammalian orders (Rodentia, Artiodactyla, Carnivora). Coverage is poor: 32.3% are the sentinel 'NO_ORDER_SPECIFIED' and another 3019 rows are empty strings, so roughly 46% of records lack a real order. Across 22043 rows there are 99 distinct values with entropy ratio 0.586, indicating a few orders dominate the long tail.

Treatment: Normalise empty strings and 'NO_ORDER_SPECIFIED' into a single missing category before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["order"].stats

statvalue
n22,043
nulls0 (0.0%)
unique99
top_value NO_ORDER_SPECIFIED
top_rate 0.3229
cardinality 99
entropy 3.887
entropy_ratio 0.5863
Fig 18.
Top values for order.
Show data table
Top values for order (20 unique shown, of 99 total).
valuecountshare
NO_ORDER_SPECIFIED711732.3%
301913.7%
Ammonitida15727.1%
Ozarkodinida13416.1%
Rodentia11095.0%
Artiodactyla9514.3%
Carnivora7443.4%
Multituberculata5532.5%
Perissodactyla5172.3%
Phacopida5072.3%
Procreodi5032.3%
Prioniodontida3481.6%
Primates3151.4%
Asaphida3041.4%
Corynexochida2521.1%
Ammonoidea2461.1%
Ptychopariida2381.1%
Proetida2191.0%
Cimolesta2181.0%
Lagomorpha1870.8%

family categorical feature

Taxonomic family classification for biological specimens, with 528 distinct families across 22043 rows and no nulls. The top value is the empty string at 15.5% (3418 rows), and 'NO_FAMILY_SPECIFIED' adds another 1996 rows—together roughly a quarter of records lack a real family assignment. Among populated values, families like Hadrosauridae (689), Grallatoridae (593), and Palmatolepidae (586) dominate, suggesting a paleontological dataset.

Treatment: Normalize empty strings and 'NO_FAMILY_SPECIFIED' to a single missing category, then target- or frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["family"].stats

statvalue
n22,043
nulls0 (0.0%)
unique528
top_value
top_rate 0.1551
cardinality 528
entropy 6.566
entropy_ratio 0.726
Fig 19.
Top values for family.
Show data table
Top values for family (20 unique shown, of 528 total).
valuecountshare
341815.5%
NO_FAMILY_SPECIFIED19969.1%
Hadrosauridae6893.1%
Grallatoridae5932.7%
Palmatolepidae5862.7%
Arctocyonidae5032.3%
Polygnathidae4592.1%
Cricetidae4071.8%
Equidae3601.6%
Canidae3581.6%
Ceratopsidae3361.5%
Dromaeosauridae3351.5%
Icriodontidae2721.2%
Periptychidae2491.1%
Neoplagiaulacidae2341.1%
Merycoidodontidae2311.0%
Camelidae2161.0%
Tyrannosauridae1840.8%
Diplodocidae1810.8%
Asaphidae1720.8%

genus text feature

This column holds taxonomic genus names for fossil records — single-word Latinate identifiers like Palmatolepis, Polygnathus, and Grallator dominate, with one_word_rate at 0.989. Of 22,043 rows, 5,545 are empty strings and duplicate_rate is 0.88 across 2,608 unique values, so the field is heavily repeated and a quarter of rows carry no genus at all. Length stats (mean 8.2, max 33) and a vocab of 2,525 are consistent with controlled scientific nomenclature rather than free text.

Treatment: Treat as a high-cardinality categorical: normalize case, encode empties as missing, and target/frequency-encode rather than one-hot.

anthropic:claude-opus-4-7 · confidence high
Out[49]:

saturn.columns["genus"].stats

statvalue
n22,043
nulls0 (0.0%)
unique2,608
len_min 0
len_max 33
len_mean 8.188
len_median 10
len_p95 15
word_mean 1.011
word_median 1
n_empty 5,545
n_duplicates 19,435
duplicate_rate 0.8817
vocab_size 2,525
readability_flesch_mean -4.827
emoji_rate 0
url_rate 0
one_word_rate 0.9894
allcaps_rate 0
boilerplate_rate 0
alert: one_word98.9% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates88.2% duplicate strings
Fig 20.
Character-length distribution for genus.
Show data table
Character-length distribution for genus (mean: 8.188449848024316).
charscount
0 – 15545
1 – 20
2 – 20
2 – 313
3 – 431
4 – 50
5 – 6372
6 – 7188
7 – 7704
7 – 81676
8 – 92307
9 – 100
10 – 112405
11 – 122504
12 – 122230
12 – 131696
13 – 141017
14 – 150
15 – 16522
16 – 16324
16 – 17191
17 – 1850
18 – 190
19 – 2051
20 – 2121
21 – 2119
21 – 2236
22 – 2320
23 – 240
24 – 255
25 – 267
26 – 263
26 – 274
27 – 2840
28 – 290
29 – 3057
30 – 314
31 – 310
31 – 320
32 – 331

country categorical feature

Two-letter ISO country codes across 93 distinct values with no nulls in 22,043 rows. The distribution is heavily US-dominated at 50.9% (11,218 rows), with CA, CN, UK, and ES rounding out a long tail; entropy ratio of 0.50 confirms the concentration. Worth noting 'UK' is used rather than the ISO-standard 'GB', which may complicate joins against canonical country tables.

Treatment: Normalize codes (e.g., UK→GB) and group the long tail before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[52]:

saturn.columns["country"].stats

statvalue
n22,043
nulls0 (0.0%)
unique93
top_value US
top_rate 0.5089
cardinality 93
entropy 3.293
entropy_ratio 0.5035
Fig 21.
Top values for country.
Show data table
Top values for country (20 unique shown, of 93 total).
valuecountshare
US1121850.9%
CA18308.3%
CN16617.5%
UK9834.5%
ES8413.8%
FR3901.8%
MA3031.4%
AR2921.3%
CZ2881.3%
AU2471.1%
TZ2181.0%
UZ1840.8%
MX1750.8%
KR1750.8%
SE1700.8%
CH1660.8%
MN1620.7%
ZA1590.7%
RU1560.7%
DE1520.7%

state categorical feature

Geographic subdivision (state/province/region) with 519 distinct values spanning US states, Canadian provinces, English regions, and Chinese provinces — indicating an international dataset rather than US-only. Wyoming leads at 8.6% (1903 rows), followed by Montana and an empty string with 1082 occurrences that null_rate=0 misses. Entropy ratio of 0.70 shows a fairly even spread across the long tail.

Treatment: Treat empty strings as missing, then group-encode or target-encode given high cardinality.

anthropic:claude-opus-4-7 · confidence high
Out[55]:

saturn.columns["state"].stats

statvalue
n22,043
nulls0 (0.0%)
unique519
top_value Wyoming
top_rate 0.08633
cardinality 519
entropy 6.288
entropy_ratio 0.6971
Fig 22.
Top values for state.
Show data table
Top values for state (20 unique shown, of 519 total).
valuecountshare
Wyoming19038.6%
Montana13946.3%
10824.9%
New Mexico10484.8%
Alberta10094.6%
Nebraska9504.3%
England9074.1%
Guangxi8613.9%
California8373.8%
Colorado5402.4%
Texas5302.4%
Utah4892.2%
Nevada3611.6%
Murcia3331.5%
North Dakota3251.5%
South Dakota3161.4%
Massachusetts2781.3%
Kansas2731.2%
Northwest Territories2461.1%
Arizona2261.0%

formation categorical other

The `formation` column is constant: all 22043 rows hold the empty string, giving cardinality 1 and entropy 0.0. It carries no information and the top_value being "" suggests the field was never populated rather than genuinely categorical.

Treatment: Drop; single constant value provides no signal.

anthropic:claude-opus-4-7 · confidence high
Out[58]:

saturn.columns["formation"].stats

statvalue
n22,043
nulls0 (0.0%)
unique1
top_value
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 23.
Top values for formation.
Show data table
Top values for formation (1 unique shown, of 1 total).
valuecountshare
22043100.0%

collection categorical metadata

The 'collection' column contains a single value across all 22043 rows, and that value is the empty string. Cardinality is 1, entropy is 0, and null_rate is 0.0, meaning the field is technically populated but carries no information. This is likely a vestigial schema field that was never filled in.

Treatment: Drop; constant empty value provides no signal.

anthropic:claude-opus-4-7 · confidence high
Out[61]:

saturn.columns["collection"].stats

statvalue
n22,043
nulls0 (0.0%)
unique1
top_value
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 24.
Top values for collection.
Show data table
Top values for collection (1 unique shown, of 1 total).
valuecountshare
22043100.0%

paleolat numeric feature

Paleolatitude in degrees, ranging from -86.16 to 89.2 with a median of 34.89 — consistent with reconstructed latitudes of fossil or geological samples. The distribution is left-skewed (skew -1.08) and ~6.97% of values flag as outliers (1503 records), suggesting a heavy southern-hemisphere tail against a northern-hemisphere mode. Null rate is low at 2.23% and 3214 unique values across 22043 rows indicate substantial repetition, likely from shared site coordinates.

Treatment: Use as-is for geographic modelling; consider binning by hemisphere or absolute latitude given the left skew.

anthropic:claude-opus-4-7 · confidence high
Out[64]:

saturn.columns["paleolat"].stats

statvalue
n22,043
nulls491 (2.2%)
unique3,214
min -86.16
max 89.2
mean 26.46
median 34.89
std 29.54
q1 16.34
q3 46.98
iqr 30.64
skew -1.08
kurtosis 0.2646
n_outliers 1,503
outlier_rate 0.06974
zero_rate 0
alert: outliers7.0% rows beyond 1.5 IQR
Fig 25.
Distribution of paleolat. Vertical dash marks the median.
Show data table
Histogram bins for paleolat (median: 34.89).
bincount
-86.16 – -81.7813
-81.78 – -77.3917
-77.39 – -73.0117
-73.01 – -68.6211
-68.62 – -64.2412
-64.24 – -59.8619
-59.86 – -55.4713
-55.47 – -51.0987
-51.09 – -46.782
-46.7 – -42.32185
-42.32 – -37.94454
-37.94 – -33.55285
-33.55 – -29.17350
-29.17 – -24.78508
-24.78 – -20.4462
-20.4 – -16.02669
-16.02 – -11.63554
-11.63 – -7.248173
-7.248 – -2.864144
-2.864 – 1.52184
1.52 – 5.904345
5.904 – 10.29212
10.29 – 14.67504
14.67 – 19.06275
19.06 – 23.44888
23.44 – 27.821581
27.82 – 32.211491
32.21 – 36.592167
36.59 – 40.981711
40.98 – 45.36787
45.36 – 49.742851
49.74 – 54.131762
54.13 – 58.511421
58.51 – 62.91132
62.9 – 67.28152
67.28 – 71.668
71.66 – 76.054
76.05 – 80.436
80.43 – 84.823
84.82 – 89.213

paleolng numeric feature

This is paleolongitude — reconstructed longitudinal coordinates of fossil/sample locations on ancient continental configurations. Values span the full longitudinal range (-177.6 to 168.7) with a mean of -28.58 and median of -62.15, indicating a leftward (western) concentration and moderate positive skew (0.74). Null rate is 2.23% and only 3 outliers are flagged, so the distribution is well-behaved within plausible geographic bounds.

Treatment: Use as a geographic coordinate feature; pair with paleolat and consider cyclic encoding since longitude wraps at ±180.

anthropic:claude-opus-4-7 · confidence high
Out[67]:

saturn.columns["paleolng"].stats

statvalue
n22,043
nulls491 (2.2%)
unique3,715
min -177.6
max 168.7
mean -28.58
median -62.15
std 68.91
q1 -77.16
q3 20.19
iqr 97.35
skew 0.7372
kurtosis -0.4811
n_outliers 3
outlier_rate 0.0001392
zero_rate 0
Fig 26.
Distribution of paleolng. Vertical dash marks the median.
Show data table
Histogram bins for paleolng (median: -62.150000000000006).
bincount
-177.6 – -168.93
-168.9 – -160.312
-160.3 – -151.64
-151.6 – -14331
-143 – -134.331
-134.3 – -125.7121
-125.7 – -117345
-117 – -108.3968
-108.3 – -99.68726
-99.68 – -91.031791
-91.03 – -82.37244
-82.37 – -73.712539
-73.71 – -65.053233
-65.05 – -56.4949
-56.4 – -47.74404
-47.74 – -39.081084
-39.08 – -30.42478
-30.42 – -21.77154
-21.77 – -13.1157
-13.11 – -4.45626
-4.45 – 4.207210
4.207 – 12.861004
12.86 – 21.521648
21.52 – 30.181074
30.18 – 38.84534
38.84 – 47.49129
47.49 – 56.1552
56.15 – 64.8182
64.81 – 73.47284
73.47 – 82.12257
82.12 – 90.78663
90.78 – 99.44486
99.44 – 108.1156
108.1 – 116.8315
116.8 – 125.4434
125.4 – 134.1182
134.1 – 142.7192
142.7 – 151.440
151.4 – 1607
160 – 168.73

reference_no text foreign_key

Short numeric reference codes (length 1-5, mean 4.17, all single-word) stored as text. Despite the name 'reference_no', it is far from unique: 22,043 rows collapse to 3,725 distinct values with an 83.1% duplicate rate, and the top code '4245' alone appears 794 times. The 89.8% allcaps_rate is a quirk of the detector treating digit-only strings as uppercase.

Treatment: Treat as a categorical foreign key and left-join to the reference dimension; do not assume uniqueness.

anthropic:claude-opus-4-7 · confidence high
Out[70]:

saturn.columns["reference_no"].stats

statvalue
n22,043
nulls0 (0.0%)
unique3,725
len_min 1
len_max 5
len_mean 4.172
len_median 4
len_p95 5
word_mean 1
word_median 1
n_empty 0
n_duplicates 18,318
duplicate_rate 0.831
vocab_size 3,547
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.8977
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps89.8% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates83.1% duplicate strings
Fig 27.
Character-length distribution for reference_no.
Show data table
Character-length distribution for reference_no (mean: 4.172208864492129).
charscount
1 – 123
1 – 10
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 22233
2 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 31274
3 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 48908
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 59605

occurrence_no text identifier

This is almost certainly a primary-key style identifier: every one of the 22043 rows holds a unique single-token value (n_unique == n, one_word_rate 1.0), with no nulls or duplicates. Lengths range from 1 to 7 characters and the top words are all numeric strings like '164260' and '1439335', so the field is stored as text but contains integer occurrence numbers. The 'allcaps' alert is a quirk of the detector treating digit-only strings as uppercase and can be ignored.

Treatment: Use as a row key for joins; do not feed into modelling.

anthropic:claude-opus-4-7 · confidence high
Out[73]:

saturn.columns["occurrence_no"].stats

statvalue
n22,043
nulls0 (0.0%)
unique22,043
len_min 1
len_max 7
len_mean 5.762
len_median 6
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 20,000
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.999
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 28.
Character-length distribution for occurrence_no.
Show data table
Character-length distribution for occurrence_no (mean: 5.761783786235993).
charscount
1 – 18
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 215
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 326
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 41359
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 53185
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 616620
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 7830

How to cite

click to copy

BibTeX
@misc{saturn-quirky-fossils-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky fossils},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-fossils}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky fossils. Source: /home/coolhand/html/datavis/data_trove/data/quirky/fossils.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-fossils