quirky-fossils · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/fossils.json

Saturn profiled 22,043 rows across 21 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/fossils.json",
    "--findings", "quirky-fossils.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 22,043 fossil occurrence records with 21 columns spanning taxonomy (phylum, class, order, family, genus, name, rank), geography (country, state, lat/lon, paleolat/paleolng), and geologic age (early_age_mya, late_age_mya, period, late_interval). Taxonomy is dominated by Chordata (about 82% of rows) with Mammalia as the leading class (~32%) followed by Saurischia and Ornithischia, suggesting a strong vertebrate and dinosaur emphasis worth examining first. Geographically the data skews heavily to the US (~51%), with Wyoming, Montana, and New Mexico topping the state list, so any spatial analysis should account for this North American concentration. Age columns (early_age_mya, late_age_mya) are right-skewed with medians around 100 Mya and ~11% flagged as outliers, hinting at a long tail of very old records. Note that 'collection' and 'formation' are entirely empty and should be ignored.

citing: row_count · column_count · phylum.top_values · class.top_values · country.top_values · state.top_values · early_age_mya.stats · late_age_mya.stats · rank.top_values · collection.stats · formation.stats

Out[4]:

saturn.schema() · 21 columns

column	kind	n	null%	unique	alerts
name	text	22,043	0.0%	4,660	one_word duplicates
rank	categorical	22,043	0.0%	18
lat	numeric	22,043	0.0%	4,095	high_skew outliers
lon	numeric	22,043	0.0%	4,259
early_age_mya	numeric	22,043	0.0%	164	outliers
late_age_mya	numeric	22,043	0.0%	156	outliers
period	categorical	22,043	0.0%	298
late_interval	categorical	22,043	0.0%	138
phylum	categorical	22,043	0.0%	4
class	categorical	22,043	0.0%	19
order	categorical	22,043	0.0%	99
family	categorical	22,043	0.0%	528
genus	text	22,043	0.0%	2,608	one_word short_text duplicates
country	categorical	22,043	0.0%	93
state	categorical	22,043	0.0%	519
formation	categorical	22,043	0.0%	1	imbalance
collection	categorical	22,043	0.0%	1	imbalance
paleolat	numeric	22,043	2.2%	3,214	outliers
paleolng	numeric	22,043	2.2%	3,715
reference_no	text	22,043	0.0%	3,725	one_word allcaps short_text duplicates
occurrence_no	text	22,043	0.0%	22,043	near_unique one_word allcaps short_text

Fig 1.

class · Mammalia, Saurischia, and Ornithischia dominate; check how unevenly taxonomic classes are represented.

Show data table

Top values for class (19 unique shown, of 19 total).
value	count	share
Mammalia	7015	31.8%
Saurischia	5507	25.0%
Ornithischia	2811	12.8%
Cephalopoda	2000	9.1%
Trilobita	2000	9.1%
Conodonta	1883	8.5%
Reptilia	568	2.6%
Aves	92	0.4%
	60	0.3%
NO_CLASS_SPECIFIED	26	0.1%
Pteraspidomorpha	24	0.1%
Placodermi	17	0.1%
Acanthodii	15	0.1%
Osteichthyes	11	0.0%
Thelodonti	4	0.0%
Osteostraci	4	0.0%
Chondrichthyes	4	0.0%
Actinopterygii	1	0.0%
Galeaspidomorphi	1	0.0%

Fig 2.

phylum · Chordata accounts for roughly 82% of records — see how heavily vertebrate-skewed the dataset is.

Show data table

Top values for phylum (4 unique shown, of 4 total).
value	count	share
Chordata	17993	81.6%
Mollusca	2000	9.1%
Arthropoda	2000	9.1%
	50	0.2%

Fig 3.

country · Look for the strong US concentration (~51%) before drawing global conclusions.

Show data table

Top values for country (20 unique shown, of 93 total).
value	count	share
US	11218	50.9%
CA	1830	8.3%
CN	1661	7.5%
UK	983	4.5%
ES	841	3.8%
FR	390	1.8%
MA	303	1.4%
AR	292	1.3%
CZ	288	1.3%
AU	247	1.1%
TZ	218	1.0%
UZ	184	0.8%
MX	175	0.8%
KR	175	0.8%
SE	170	0.8%
CH	166	0.8%
MN	162	0.7%
ZA	159	0.7%
RU	156	0.7%
DE	152	0.7%

Fig 4.

early_age_mya · Right-skewed distribution from recent fossils to ~539 Mya; watch for the long tail and outliers.

Show data table

Histogram bins for early_age_mya (median: 110.1).
bin	count
0.0117 – 13.48	2334
13.48 – 26.95	1665
26.95 – 40.42	85
40.42 – 53.89	12
53.89 – 67.36	2904
67.36 – 80.83	1302
80.83 – 94.3	2239
94.3 – 107.8	454
107.8 – 121.2	796
121.2 – 134.7	1645
134.7 – 148.2	410
148.2 – 161.6	1650
161.6 – 175.1	421
175.1 – 188.6	36
188.6 – 202.1	975
202.1 – 215.5	193
215.5 – 229	530
229 – 242.5	216
242.5 – 255.9	61
255.9 – 269.4	0
269.4 – 282.9	0
282.9 – 296.3	0
296.3 – 309.8	25
309.8 – 323.3	17
323.3 – 336.8	21
336.8 – 350.2	37
350.2 – 363.7	58
363.7 – 377.2	770
377.2 – 390.6	319
390.6 – 404.1	319
404.1 – 417.6	176
417.6 – 431	602
431 – 444.5	265
444.5 – 458	382
458 – 471.5	427
471.5 – 484.9	81
484.9 – 498.4	334
498.4 – 511.9	180
511.9 – 525.3	62
525.3 – 538.8	40

Fig 5.

rank · Most records are identified to species or genus level — useful context for taxonomic resolution.

Show data table

Top values for rank (18 unique shown, of 18 total).
value	count	share
species	9082	41.2%
genus	7342	33.3%
unranked clade	2828	12.8%
family	1716	7.8%
subfamily	272	1.2%
subclass	205	0.9%
class	134	0.6%
order	115	0.5%
infraorder	97	0.4%
superfamily	75	0.3%
subgenus	51	0.2%
kingdom	50	0.2%
suborder	29	0.1%
subspecies	23	0.1%
tribe	12	0.1%
subphylum	9	0.0%
superorder	2	0.0%
superclass	1	0.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
name	text	0.0%
rank	categorical	0.0%
lat	numeric	0.0%
lon	numeric	0.0%
early_age_mya	numeric	0.0%
late_age_mya	numeric	0.0%
period	categorical	0.0%
late_interval	categorical	0.0%
phylum	categorical	0.0%
class	categorical	0.0%
order	categorical	0.0%
family	categorical	0.0%
genus	text	0.0%
country	categorical	0.0%
state	categorical	0.0%
formation	categorical	0.0%
collection	categorical	0.0%
paleolat	numeric	2.2%
paleolng	numeric	2.2%
reference_no	text	0.0%
occurrence_no	text	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 6 numeric columns (values clipped to 2 decimals).
	lat	lon	early_age_mya	late_age_mya	paleolat	paleolng
lat	+1.00	-0.27	+0.04	+0.05	+0.12	-0.17
lon	-0.27	+1.00	+0.17	+0.16	-0.22	+0.45
early_age_mya	+0.04	+0.17	+1.00	+1.00	-0.61	+0.03
late_age_mya	+0.05	+0.16	+1.00	+1.00	-0.61	+0.03
paleolat	+0.12	-0.22	-0.61	-0.61	+1.00	-0.21
paleolng	-0.17	+0.45	+0.03	+0.03	-0.21	+1.00

name text label

This column holds taxonomic names of fossil organisms, dominated by dinosaur and conodont genera/clades like Theropoda (768), Dinosauria (512), and Palmatolepis (376). Values are short—mean length 15 characters and 58% are single words—so the Flesch readability score of -4.13 is a meaningless artifact of scientific Latin. With 4,660 uniques across 22,043 rows and a 78.9% duplicate rate, this behaves as a categorical taxon label rather than a free-text field.

Treatment: Treat as a high-cardinality categorical; group rare taxa or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["name"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	4,660
len_min	3
len_max	47
len_mean	15.09
len_median	14
len_p95	26
word_mean	1.425
word_median	1
n_empty	0
n_duplicates	17,383
duplicate_rate	0.7886
vocab_size	5,140
readability_flesch_mean	-4.127
emoji_rate	0
url_rate	0
one_word_rate	0.5846
allcaps_rate	0
boilerplate_rate	0
alert: one_word	58.5% rows are a single word
alert: duplicates	78.9% duplicate strings

Fig 8.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 15.094769314521617).
chars	count
3 – 4	72
4 – 5	190
5 – 6	110
6 – 7	466
7 – 8	942
8 – 10	2265
10 – 11	2027
11 – 12	1560
12 – 13	1821
13 – 14	1566
14 – 15	2019
15 – 16	626
16 – 17	923
17 – 18	808
18 – 20	1219
20 – 21	1117
21 – 22	747
22 – 23	889
23 – 24	549
24 – 25	531
25 – 26	719
26 – 27	307
27 – 28	119
28 – 29	178
29 – 31	63
31 – 32	52
32 – 33	22
33 – 34	8
34 – 35	2
35 – 36	17
36 – 37	19
37 – 38	11
38 – 39	50
39 – 40	13
40 – 42	9
42 – 43	2
43 – 44	1
44 – 45	3
45 – 46	0
46 – 47	1

rank categorical feature

This column records the taxonomic rank of each record, with 18 distinct values and no nulls across 22,043 rows. 'Species' dominates at 41.2% (9,082 rows), followed by 'genus' (7,342) and 'unranked clade' (2,828); ranks below family drop off sharply, with 'subfamily' already at only 272. The presence of 'unranked clade' alongside formal Linnaean ranks is worth noting as a non-standard category.

Treatment: One-hot or ordinal-encode by taxonomic depth before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["rank"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	18
top_value	species
top_rate	0.412
cardinality	18
entropy	2.085
entropy_ratio	0.5001

Fig 9.

Top values for rank.

Show data table

Top values for rank (18 unique shown, of 18 total).
value	count	share
species	9082	41.2%
genus	7342	33.3%
unranked clade	2828	12.8%
family	1716	7.8%
subfamily	272	1.2%
subclass	205	0.9%
class	134	0.6%
order	115	0.5%
infraorder	97	0.4%
superfamily	75	0.3%
subgenus	51	0.2%
kingdom	50	0.2%
suborder	29	0.1%
subspecies	23	0.1%
tribe	12	0.1%
subphylum	9	0.0%
superorder	2	0.0%
superclass	1	0.0%

lat numeric feature

This is a latitude coordinate in degrees, ranging from -84.33 to 79.75 with a median of 41.70 and IQR of 11.61. The strong negative skew (-2.44) and kurtosis (7.05) suggest most points cluster in the northern hemisphere with a long tail of southern-hemisphere outliers (9.16% flagged). No nulls and 4,095 distinct values across 22,043 rows indicate repeated locations rather than per-row unique geocoding.

Treatment: Pair with a longitude column for spatial features; avoid log-transform and keep raw degrees.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["lat"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	4,095
min	-84.33
max	79.75
mean	37.12
median	41.7
std	19.37
q1	35
q3	46.61
iqr	11.61
skew	-2.442
kurtosis	7.054
n_outliers	2,019
outlier_rate	0.09159
zero_rate	0
alert: high_skew	skew=-2.44
alert: outliers	9.2% rows beyond 1.5 IQR

Fig 10.

Distribution of lat. Vertical dash marks the median.

Show data table

Histogram bins for lat (median: 41.700001).
bin	count
-84.33 – -80.23	4
-80.23 – -76.13	0
-76.13 – -72.03	0
-72.03 – -67.93	12
-67.93 – -63.82	7
-63.82 – -59.72	0
-59.72 – -55.62	0
-55.62 – -51.52	0
-51.52 – -47.41	16
-47.41 – -43.31	60
-43.31 – -39.21	111
-39.21 – -35.11	127
-35.11 – -31.01	166
-31.01 – -26.9	347
-26.9 – -22.8	40
-22.8 – -18.7	99
-18.7 – -14.6	138
-14.6 – -10.5	6
-10.5 – -6.394	255
-6.394 – -2.292	21
-2.292 – 1.81	0
1.81 – 5.912	32
5.912 – 10.01	22
10.01 – 14.12	12
14.12 – 18.22	168
18.22 – 22.32	137
22.32 – 26.42	1224
26.42 – 30.52	669
30.52 – 34.63	1615
34.63 – 38.73	3128
38.73 – 42.83	4830
42.83 – 46.93	3520
46.93 – 51.04	2987
51.04 – 55.14	1322
55.14 – 59.24	253
59.24 – 63.34	303
63.34 – 67.44	99
67.44 – 71.55	118
71.55 – 75.65	180
75.65 – 79.75	15

lon numeric feature

This column captures longitude coordinates, with values spanning -176.67 to 177.07, consistent with the full global range. The distribution is right-skewed (skew 0.93) and centered on a median of -98.25, suggesting a heavy concentration of records in the Western Hemisphere (likely the Americas). Only 4,259 unique values across 22,043 rows indicate repeated location points, and just 3 outliers were flagged.

Treatment: Pair with latitude for geospatial features; consider binning or projecting rather than using raw degrees in linear models.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["lon"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	4,259
min	-176.7
max	177.1
mean	-47.21
median	-98.25
std	79.13
q1	-108.2
q3	5.873
iqr	114
skew	0.9275
kurtosis	-0.4932
n_outliers	3
outlier_rate	0.0001361
zero_rate	0.0002268

Fig 11.

Distribution of lon. Vertical dash marks the median.

Show data table

Histogram bins for lon (median: -98.25).
bin	count
-176.7 – -167.8	4
-167.8 – -159	13
-159 – -150.1	18
-150.1 – -141.3	32
-141.3 – -132.4	91
-132.4 – -123.6	163
-123.6 – -114.8	1447
-114.8 – -105.9	5659
-105.9 – -97.08	3904
-97.08 – -88.23	360
-88.23 – -79.39	559
-79.39 – -70.55	962
-70.55 – -61.7	409
-61.7 – -52.86	76
-52.86 – -44.02	52
-44.02 – -35.17	13
-35.17 – -26.33	0
-26.33 – -17.48	93
-17.48 – -8.642	160
-8.642 – 0.2019	1918
0.2019 – 9.045	932
9.045 – 17.89	817
17.89 – 26.73	253
26.73 – 35.58	447
35.58 – 44.42	311
44.42 – 53.26	71
53.26 – 62.11	90
62.11 – 70.95	315
70.95 – 79.79	306
79.79 – 88.64	107
88.64 – 97.48	78
97.48 – 106.3	553
106.3 – 115.2	1162
115.2 – 124	150
124 – 132.9	206
132.9 – 141.7	50
141.7 – 150.5	238
150.5 – 159.4	11
159.4 – 168.2	4
168.2 – 177.1	9

early_age_mya numeric feature

Numeric column representing the early bound of an age estimate in millions of years (mya), spanning 0.0117 to 538.8 across 22043 rows with no nulls. The distribution is right-skewed (skew 1.13) with median 110.1 well below mean 154.67, and saturn flags 2549 outlier rows (11.6%) — consistent with a long Paleozoic tail above the Q3 of 201.4. Only 164 unique values suggest ages are bucketed to standard stratigraphic boundaries rather than continuous measurements.

Treatment: Consider log-transform or binning to stratigraphic periods before modelling given the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["early_age_mya"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	164
min	0.0117
max	538.8
mean	154.7
median	110.1
std	143.1
q1	63.4
q3	201.4
iqr	138
skew	1.131
kurtosis	0.07677
n_outliers	2,549
outlier_rate	0.1156
zero_rate	0
alert: outliers	11.6% rows beyond 1.5 IQR

Fig 12.

Distribution of early_age_mya. Vertical dash marks the median.

Show data table

Histogram bins for early_age_mya (median: 110.1).
bin	count
0.0117 – 13.48	2334
13.48 – 26.95	1665
26.95 – 40.42	85
40.42 – 53.89	12
53.89 – 67.36	2904
67.36 – 80.83	1302
80.83 – 94.3	2239
94.3 – 107.8	454
107.8 – 121.2	796
121.2 – 134.7	1645
134.7 – 148.2	410
148.2 – 161.6	1650
161.6 – 175.1	421
175.1 – 188.6	36
188.6 – 202.1	975
202.1 – 215.5	193
215.5 – 229	530
229 – 242.5	216
242.5 – 255.9	61
255.9 – 269.4	0
269.4 – 282.9	0
282.9 – 296.3	0
296.3 – 309.8	25
309.8 – 323.3	17
323.3 – 336.8	21
336.8 – 350.2	37
350.2 – 363.7	58
363.7 – 377.2	770
377.2 – 390.6	319
390.6 – 404.1	319
404.1 – 417.6	176
417.6 – 431	602
431 – 444.5	265
444.5 – 458	382
458 – 471.5	427
471.5 – 484.9	81
484.9 – 498.4	334
498.4 – 511.9	180
511.9 – 525.3	62
525.3 – 538.8	40

late_age_mya numeric feature

Numeric column representing the younger bound of a geological age in millions of years (Mya), with 156 distinct values across 22,043 rows and no nulls. Distribution is right-skewed (skew 1.17) with median 93.9 well below the mean of 147.5, ranging from 0 to 521 Mya, and 11.5% of values (2,535) flagged as outliers on the high end. The bounded discrete value set suggests entries snap to standardized stratigraphic stage boundaries rather than continuous measurements.

Treatment: Keep as-is or pair with early_age_mya to derive a midpoint; consider log or sqrt transform before regression given the right skew.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["late_age_mya"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	156
min	0
max	521
mean	147.5
median	93.9
std	141.7
q1	60.9
q3	192.9
iqr	132
skew	1.169
kurtosis	0.1231
n_outliers	2,535
outlier_rate	0.115
zero_rate	0.001134
alert: outliers	11.5% rows beyond 1.5 IQR

Fig 13.

Distribution of late_age_mya. Vertical dash marks the median.

Show data table

Histogram bins for late_age_mya (median: 93.9).
bin	count
0 – 13.03	2732
13.03 – 26.05	1291
26.05 – 39.08	69
39.08 – 52.1	7
52.1 – 65.12	2911
65.12 – 78.15	3279
78.15 – 91.17	374
91.17 – 104.2	949
104.2 – 117.2	922
117.2 – 130.2	1043
130.2 – 143.3	1317
143.3 – 156.3	671
156.3 – 169.3	368
169.3 – 182.3	135
182.3 – 195.4	634
195.4 – 208.4	1003
208.4 – 221.4	35
221.4 – 234.5	123
234.5 – 247.5	63
247.5 – 260.5	2
260.5 – 273.5	0
273.5 – 286.6	0
286.6 – 299.6	12
299.6 – 312.6	34
312.6 – 325.6	19
325.6 – 338.7	4
338.7 – 351.7	83
351.7 – 364.7	133
364.7 – 377.7	828
377.7 – 390.8	467
390.8 – 403.8	190
403.8 – 416.8	530
416.8 – 429.8	170
429.8 – 442.9	141
442.9 – 455.9	529
455.9 – 468.9	299
468.9 – 481.9	111
481.9 – 494.9	310
494.9 – 508	191
508 – 521	64

period categorical feature

This column records geologic time periods or stages, with 298 distinct values across 22,043 rows and no nulls. The distribution is moderately spread (entropy ratio 0.78), but 'Irvingtonian' dominates at 7.8% (1,723 rows), followed by 'Late Campanian' and various Paleocene/Mesozoic stages. The vocabulary mixes broad and finely-subdivided stage names (e.g., 'Late Maastrichtian' vs. 'Aptian'), so granularity is uneven.

Treatment: Group rare stages or map to coarser epochs before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["period"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	298
top_value	Irvingtonian
top_rate	0.07817
cardinality	298
entropy	6.422
entropy_ratio	0.7813

Fig 14.

Top values for period.

Show data table

Top values for period (20 unique shown, of 298 total).
value	count	share
Irvingtonian	1723	7.8%
Late Campanian	1088	4.9%
Torrejonian	935	4.2%
Tiffanian	923	4.2%
Puercan	778	3.5%
Kimmeridgian	636	2.9%
Hettangian	607	2.8%
Aptian	600	2.7%
Harrisonian	592	2.7%
Late Maastrichtian	544	2.5%
Norian	516	2.3%
Lochkovian	460	2.1%
Early Barremian	449	2.0%
Hemingfordian	441	2.0%
Tithonian	408	1.9%
Middle Campanian	359	1.6%
Early Famennian	346	1.6%
Early Albian	327	1.5%
Lancian	320	1.5%
Maastrichtian	314	1.4%

late_interval categorical feature

Categorical geological stage marking the late end of a fossil/sample's age range, with 138 distinct stage names like Tithonian, Sinemurian, and Late Campanian. The column is dominated by empty strings (83.1% of 22,043 rows), leaving only ~3,724 records with an actual stage; entropy ratio of 0.22 reflects this sparsity. Among populated values, Tithonian (548) and Sinemurian (430) lead, suggesting Mesozoic specimens are over-represented.

Treatment: Treat empty string as missing and either impute from a paired early_interval or use as a sparse categorical with an explicit 'unknown' level.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["late_interval"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	138
top_value
top_rate	0.8311
cardinality	138
entropy	1.591
entropy_ratio	0.2239

Fig 15.

Top values for late_interval.

Show data table

Top values for late_interval (20 unique shown, of 138 total).
value	count	share
	18319	83.1%
Tithonian	548	2.5%
Sinemurian	430	2.0%
Late Campanian	183	0.8%
Early Cenomanian	132	0.6%
Albian	129	0.6%
Rhaetian	119	0.5%
Early Maastrichtian	111	0.5%
Early Tithonian	102	0.5%
Late Turonian	92	0.4%
Maastrichtian	72	0.3%
Harnagian	62	0.3%
Santonian	61	0.3%
Early Aptian	57	0.3%
Tiffanian	57	0.3%
Early Albian	56	0.3%
Barremian	54	0.2%
Pliensbachian	50	0.2%
Toarcian	50	0.2%
Cenomanian	45	0.2%

phylum categorical label

Taxonomic phylum label with only 4 distinct values across 22,043 rows. Chordata dominates at 81.6% (17,993 rows), with Mollusca and Arthropoda each at exactly 2,000 — suggesting deliberate sampling caps on the non-Chordata classes. 50 rows carry an empty-string phylum that is not counted as null.

Treatment: Recode the 50 empty strings to null and treat as a low-cardinality categorical; expect class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["phylum"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	4
top_value	Chordata
top_rate	0.8163
cardinality	4
entropy	0.8873
entropy_ratio	0.4436

Fig 16.

Top values for phylum.

Show data table

Top values for phylum (4 unique shown, of 4 total).
value	count	share
Chordata	17993	81.6%
Mollusca	2000	9.1%
Arthropoda	2000	9.1%
	50	0.2%

class categorical label

Taxonomic class label for what appears to be a paleontological/zoological occurrence dataset, with 19 distinct values across 22,043 rows and no nulls. Mammalia dominates at 31.8% (7,015), followed by the dinosaur clades Saurischia (5,507) and Ornithischia (2,811), suggesting a fossil-heavy sample. Note two sentinel-style entries that should be cleaned: 60 empty strings and 26 'NO_CLASS_SPECIFIED' rows.

Treatment: Normalize the empty strings and 'NO_CLASS_SPECIFIED' to a single missing token, then one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["class"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	19
top_value	Mammalia
top_rate	0.3182
cardinality	19
entropy	2.579
entropy_ratio	0.6071

Fig 17.

Top values for class.

Show data table

Top values for class (19 unique shown, of 19 total).
value	count	share
Mammalia	7015	31.8%
Saurischia	5507	25.0%
Ornithischia	2811	12.8%
Cephalopoda	2000	9.1%
Trilobita	2000	9.1%
Conodonta	1883	8.5%
Reptilia	568	2.6%
Aves	92	0.4%
	60	0.3%
NO_CLASS_SPECIFIED	26	0.1%
Pteraspidomorpha	24	0.1%
Placodermi	17	0.1%
Acanthodii	15	0.1%
Osteichthyes	11	0.0%
Thelodonti	4	0.0%
Osteostraci	4	0.0%
Chondrichthyes	4	0.0%
Actinopterygii	1	0.0%
Galeaspidomorphi	1	0.0%

order categorical feature

Taxonomic order assignments for what appears to be a paleontology/biology specimen dataset, mixing extinct groups (Ammonitida, Ozarkodinida, Multituberculata, Phacopida) with extant mammalian orders (Rodentia, Artiodactyla, Carnivora). Coverage is poor: 32.3% are the sentinel 'NO_ORDER_SPECIFIED' and another 3019 rows are empty strings, so roughly 46% of records lack a real order. Across 22043 rows there are 99 distinct values with entropy ratio 0.586, indicating a few orders dominate the long tail.

Treatment: Normalise empty strings and 'NO_ORDER_SPECIFIED' into a single missing category before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["order"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	99
top_value	NO_ORDER_SPECIFIED
top_rate	0.3229
cardinality	99
entropy	3.887
entropy_ratio	0.5863

Fig 18.

Top values for order.

Show data table

Top values for order (20 unique shown, of 99 total).
value	count	share
NO_ORDER_SPECIFIED	7117	32.3%
	3019	13.7%
Ammonitida	1572	7.1%
Ozarkodinida	1341	6.1%
Rodentia	1109	5.0%
Artiodactyla	951	4.3%
Carnivora	744	3.4%
Multituberculata	553	2.5%
Perissodactyla	517	2.3%
Phacopida	507	2.3%
Procreodi	503	2.3%
Prioniodontida	348	1.6%
Primates	315	1.4%
Asaphida	304	1.4%
Corynexochida	252	1.1%
Ammonoidea	246	1.1%
Ptychopariida	238	1.1%
Proetida	219	1.0%
Cimolesta	218	1.0%
Lagomorpha	187	0.8%

family categorical feature

Taxonomic family classification for biological specimens, with 528 distinct families across 22043 rows and no nulls. The top value is the empty string at 15.5% (3418 rows), and 'NO_FAMILY_SPECIFIED' adds another 1996 rows—together roughly a quarter of records lack a real family assignment. Among populated values, families like Hadrosauridae (689), Grallatoridae (593), and Palmatolepidae (586) dominate, suggesting a paleontological dataset.

Treatment: Normalize empty strings and 'NO_FAMILY_SPECIFIED' to a single missing category, then target- or frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["family"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	528
top_value
top_rate	0.1551
cardinality	528
entropy	6.566
entropy_ratio	0.726

Fig 19.

Top values for family.

Show data table

Top values for family (20 unique shown, of 528 total).
value	count	share
	3418	15.5%
NO_FAMILY_SPECIFIED	1996	9.1%
Hadrosauridae	689	3.1%
Grallatoridae	593	2.7%
Palmatolepidae	586	2.7%
Arctocyonidae	503	2.3%
Polygnathidae	459	2.1%
Cricetidae	407	1.8%
Equidae	360	1.6%
Canidae	358	1.6%
Ceratopsidae	336	1.5%
Dromaeosauridae	335	1.5%
Icriodontidae	272	1.2%
Periptychidae	249	1.1%
Neoplagiaulacidae	234	1.1%
Merycoidodontidae	231	1.0%
Camelidae	216	1.0%
Tyrannosauridae	184	0.8%
Diplodocidae	181	0.8%
Asaphidae	172	0.8%

genus text feature

This column holds taxonomic genus names for fossil records — single-word Latinate identifiers like Palmatolepis, Polygnathus, and Grallator dominate, with one_word_rate at 0.989. Of 22,043 rows, 5,545 are empty strings and duplicate_rate is 0.88 across 2,608 unique values, so the field is heavily repeated and a quarter of rows carry no genus at all. Length stats (mean 8.2, max 33) and a vocab of 2,525 are consistent with controlled scientific nomenclature rather than free text.

Treatment: Treat as a high-cardinality categorical: normalize case, encode empties as missing, and target/frequency-encode rather than one-hot.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["genus"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	2,608
len_min	0
len_max	33
len_mean	8.188
len_median	10
len_p95	15
word_mean	1.011
word_median	1
n_empty	5,545
n_duplicates	19,435
duplicate_rate	0.8817
vocab_size	2,525
readability_flesch_mean	-4.827
emoji_rate	0
url_rate	0
one_word_rate	0.9894
allcaps_rate	0
boilerplate_rate	0
alert: one_word	98.9% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	88.2% duplicate strings

Fig 20.

Character-length distribution for genus.

Show data table

Character-length distribution for genus (mean: 8.188449848024316).
chars	count
0 – 1	5545
1 – 2	0
2 – 2	0
2 – 3	13
3 – 4	31
4 – 5	0
5 – 6	372
6 – 7	188
7 – 7	704
7 – 8	1676
8 – 9	2307
9 – 10	0
10 – 11	2405
11 – 12	2504
12 – 12	2230
12 – 13	1696
13 – 14	1017
14 – 15	0
15 – 16	522
16 – 16	324
16 – 17	191
17 – 18	50
18 – 19	0
19 – 20	51
20 – 21	21
21 – 21	19
21 – 22	36
22 – 23	20
23 – 24	0
24 – 25	5
25 – 26	7
26 – 26	3
26 – 27	4
27 – 28	40
28 – 29	0
29 – 30	57
30 – 31	4
31 – 31	0
31 – 32	0
32 – 33	1

country categorical feature

Two-letter ISO country codes across 93 distinct values with no nulls in 22,043 rows. The distribution is heavily US-dominated at 50.9% (11,218 rows), with CA, CN, UK, and ES rounding out a long tail; entropy ratio of 0.50 confirms the concentration. Worth noting 'UK' is used rather than the ISO-standard 'GB', which may complicate joins against canonical country tables.

Treatment: Normalize codes (e.g., UK→GB) and group the long tail before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[52]:

saturn.columns["country"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	93
top_value	US
top_rate	0.5089
cardinality	93
entropy	3.293
entropy_ratio	0.5035

Fig 21.

Top values for country.

Show data table

Top values for country (20 unique shown, of 93 total).
value	count	share
US	11218	50.9%
CA	1830	8.3%
CN	1661	7.5%
UK	983	4.5%
ES	841	3.8%
FR	390	1.8%
MA	303	1.4%
AR	292	1.3%
CZ	288	1.3%
AU	247	1.1%
TZ	218	1.0%
UZ	184	0.8%
MX	175	0.8%
KR	175	0.8%
SE	170	0.8%
CH	166	0.8%
MN	162	0.7%
ZA	159	0.7%
RU	156	0.7%
DE	152	0.7%

state categorical feature

Geographic subdivision (state/province/region) with 519 distinct values spanning US states, Canadian provinces, English regions, and Chinese provinces — indicating an international dataset rather than US-only. Wyoming leads at 8.6% (1903 rows), followed by Montana and an empty string with 1082 occurrences that null_rate=0 misses. Entropy ratio of 0.70 shows a fairly even spread across the long tail.

Treatment: Treat empty strings as missing, then group-encode or target-encode given high cardinality.

anthropic:claude-opus-4-7 · confidence high

Out[55]:

saturn.columns["state"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	519
top_value	Wyoming
top_rate	0.08633
cardinality	519
entropy	6.288
entropy_ratio	0.6971

Fig 22.

Top values for state.

Show data table

Top values for state (20 unique shown, of 519 total).
value	count	share
Wyoming	1903	8.6%
Montana	1394	6.3%
	1082	4.9%
New Mexico	1048	4.8%
Alberta	1009	4.6%
Nebraska	950	4.3%
England	907	4.1%
Guangxi	861	3.9%
California	837	3.8%
Colorado	540	2.4%
Texas	530	2.4%
Utah	489	2.2%
Nevada	361	1.6%
Murcia	333	1.5%
North Dakota	325	1.5%
South Dakota	316	1.4%
Massachusetts	278	1.3%
Kansas	273	1.2%
Northwest Territories	246	1.1%
Arizona	226	1.0%

formation categorical other

The `formation` column is constant: all 22043 rows hold the empty string, giving cardinality 1 and entropy 0.0. It carries no information and the top_value being "" suggests the field was never populated rather than genuinely categorical.

Treatment: Drop; single constant value provides no signal.

anthropic:claude-opus-4-7 · confidence high

Out[58]:

saturn.columns["formation"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 23.

Top values for formation.

Show data table

Top values for formation (1 unique shown, of 1 total).
value	count	share
	22043	100.0%

collection categorical metadata

The 'collection' column contains a single value across all 22043 rows, and that value is the empty string. Cardinality is 1, entropy is 0, and null_rate is 0.0, meaning the field is technically populated but carries no information. This is likely a vestigial schema field that was never filled in.

Treatment: Drop; constant empty value provides no signal.

anthropic:claude-opus-4-7 · confidence high

Out[61]:

saturn.columns["collection"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 24.

Top values for collection.

Show data table

Top values for collection (1 unique shown, of 1 total).
value	count	share
	22043	100.0%

paleolat numeric feature

Paleolatitude in degrees, ranging from -86.16 to 89.2 with a median of 34.89 — consistent with reconstructed latitudes of fossil or geological samples. The distribution is left-skewed (skew -1.08) and ~6.97% of values flag as outliers (1503 records), suggesting a heavy southern-hemisphere tail against a northern-hemisphere mode. Null rate is low at 2.23% and 3214 unique values across 22043 rows indicate substantial repetition, likely from shared site coordinates.

Treatment: Use as-is for geographic modelling; consider binning by hemisphere or absolute latitude given the left skew.

anthropic:claude-opus-4-7 · confidence high

Out[64]:

saturn.columns["paleolat"].stats

stat	value
n	22,043
nulls	491 (2.2%)
unique	3,214
min	-86.16
max	89.2
mean	26.46
median	34.89
std	29.54
q1	16.34
q3	46.98
iqr	30.64
skew	-1.08
kurtosis	0.2646
n_outliers	1,503
outlier_rate	0.06974
zero_rate	0
alert: outliers	7.0% rows beyond 1.5 IQR

Fig 25.

Distribution of paleolat. Vertical dash marks the median.

Show data table

Histogram bins for paleolat (median: 34.89).
bin	count
-86.16 – -81.78	13
-81.78 – -77.39	17
-77.39 – -73.01	17
-73.01 – -68.62	11
-68.62 – -64.24	12
-64.24 – -59.86	19
-59.86 – -55.47	13
-55.47 – -51.09	87
-51.09 – -46.7	82
-46.7 – -42.32	185
-42.32 – -37.94	454
-37.94 – -33.55	285
-33.55 – -29.17	350
-29.17 – -24.78	508
-24.78 – -20.4	462
-20.4 – -16.02	669
-16.02 – -11.63	554
-11.63 – -7.248	173
-7.248 – -2.864	144
-2.864 – 1.52	184
1.52 – 5.904	345
5.904 – 10.29	212
10.29 – 14.67	504
14.67 – 19.06	275
19.06 – 23.44	888
23.44 – 27.82	1581
27.82 – 32.21	1491
32.21 – 36.59	2167
36.59 – 40.98	1711
40.98 – 45.36	787
45.36 – 49.74	2851
49.74 – 54.13	1762
54.13 – 58.51	1421
58.51 – 62.9	1132
62.9 – 67.28	152
67.28 – 71.66	8
71.66 – 76.05	4
76.05 – 80.43	6
80.43 – 84.82	3
84.82 – 89.2	13

paleolng numeric feature

This is paleolongitude — reconstructed longitudinal coordinates of fossil/sample locations on ancient continental configurations. Values span the full longitudinal range (-177.6 to 168.7) with a mean of -28.58 and median of -62.15, indicating a leftward (western) concentration and moderate positive skew (0.74). Null rate is 2.23% and only 3 outliers are flagged, so the distribution is well-behaved within plausible geographic bounds.

Treatment: Use as a geographic coordinate feature; pair with paleolat and consider cyclic encoding since longitude wraps at ±180.

anthropic:claude-opus-4-7 · confidence high

Out[67]:

saturn.columns["paleolng"].stats

stat	value
n	22,043
nulls	491 (2.2%)
unique	3,715
min	-177.6
max	168.7
mean	-28.58
median	-62.15
std	68.91
q1	-77.16
q3	20.19
iqr	97.35
skew	0.7372
kurtosis	-0.4811
n_outliers	3
outlier_rate	0.0001392
zero_rate	0

Fig 26.

Distribution of paleolng. Vertical dash marks the median.

Show data table

Histogram bins for paleolng (median: -62.150000000000006).
bin	count
-177.6 – -168.9	3
-168.9 – -160.3	12
-160.3 – -151.6	4
-151.6 – -143	31
-143 – -134.3	31
-134.3 – -125.7	121
-125.7 – -117	345
-117 – -108.3	968
-108.3 – -99.68	726
-99.68 – -91.03	1791
-91.03 – -82.37	244
-82.37 – -73.71	2539
-73.71 – -65.05	3233
-65.05 – -56.4	949
-56.4 – -47.74	404
-47.74 – -39.08	1084
-39.08 – -30.42	478
-30.42 – -21.77	154
-21.77 – -13.11	57
-13.11 – -4.45	626
-4.45 – 4.207	210
4.207 – 12.86	1004
12.86 – 21.52	1648
21.52 – 30.18	1074
30.18 – 38.84	534
38.84 – 47.49	129
47.49 – 56.15	52
56.15 – 64.81	82
64.81 – 73.47	284
73.47 – 82.12	257
82.12 – 90.78	663
90.78 – 99.44	486
99.44 – 108.1	156
108.1 – 116.8	315
116.8 – 125.4	434
125.4 – 134.1	182
134.1 – 142.7	192
142.7 – 151.4	40
151.4 – 160	7
160 – 168.7	3

reference_no text foreign_key

Short numeric reference codes (length 1-5, mean 4.17, all single-word) stored as text. Despite the name 'reference_no', it is far from unique: 22,043 rows collapse to 3,725 distinct values with an 83.1% duplicate rate, and the top code '4245' alone appears 794 times. The 89.8% allcaps_rate is a quirk of the detector treating digit-only strings as uppercase.

Treatment: Treat as a categorical foreign key and left-join to the reference dimension; do not assume uniqueness.

anthropic:claude-opus-4-7 · confidence high

Out[70]:

saturn.columns["reference_no"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	3,725
len_min	1
len_max	5
len_mean	4.172
len_median	4
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	18,318
duplicate_rate	0.831
vocab_size	3,547
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.8977
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	89.8% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	83.1% duplicate strings

Fig 27.

Character-length distribution for reference_no.

Show data table

Character-length distribution for reference_no (mean: 4.172208864492129).
chars	count
1 – 1	23
1 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	2233
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1274
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	8908
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	9605

occurrence_no text identifier

This is almost certainly a primary-key style identifier: every one of the 22043 rows holds a unique single-token value (n_unique == n, one_word_rate 1.0), with no nulls or duplicates. Lengths range from 1 to 7 characters and the top words are all numeric strings like '164260' and '1439335', so the field is stored as text but contains integer occurrence numbers. The 'allcaps' alert is a quirk of the detector treating digit-only strings as uppercase and can be ignored.

Treatment: Use as a row key for joins; do not feed into modelling.

anthropic:claude-opus-4-7 · confidence high

Out[73]:

saturn.columns["occurrence_no"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	22,043
len_min	1
len_max	7
len_mean	5.762
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.999
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 28.

Character-length distribution for occurrence_no.

Show data table

Character-length distribution for occurrence_no (mean: 5.761783786235993).
chars	count
1 – 1	8
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	15
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	26
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	1359
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	3185
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	16620
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	830

quirky fossils

Overview

Summary confidence: high

name text label

rank categorical feature

lat numeric feature

lon numeric feature

early_age_mya numeric feature

late_age_mya numeric feature

period categorical feature

late_interval categorical feature

phylum categorical label

class categorical label

order categorical feature

family categorical feature

genus text feature

country categorical feature

state categorical feature

formation categorical other

collection categorical metadata

paleolat numeric feature

paleolng numeric feature

reference_no text foreign_key

occurrence_no text identifier

How to cite