data-trove-fossils-pbdb · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/fossils.json

Saturn profiled 22,043 rows across 21 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/fossils.json",
    "--findings", "data-trove-fossils-pbdb.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This is a fossil occurrence dataset containing 22,043 records spanning taxonomic classifications, geographic coordinates, and geological time ranges for paleontological finds. The taxonomic breakdown is dominated by Chordata (81.6%) with Mammalia, Saurischia, and Ornithischia as the leading classes, while over half of all occurrences (50.9%) come from the United States — worth examining for geographic bias. The geological age columns (early_age_mya and late_age_mya) span from near-present to over 500 million years ago with high spread and outliers, suggesting the dataset mixes very different eras of life. Taxonomic rank is split between species (41%) and genus (33%), meaning precision of identification varies considerably across records and may affect comparative analyses.

citing: phylum.top_values · class.top_values · country.top_rate · country.top_value · early_age_mya.max · early_age_mya.mean · early_age_mya.n_outliers · rank.top_values · lat.skew · lat.n_outliers · row_count

Out[4]:

saturn.schema() · 21 columns

column	kind	n	null%	unique	alerts
name	text	22,043	0.0%	4,660	one_word duplicates
rank	categorical	22,043	0.0%	18
lat	numeric	22,043	0.0%	4,095	high_skew outliers
lon	numeric	22,043	0.0%	4,259
early_age_mya	numeric	22,043	0.0%	164	outliers
late_age_mya	numeric	22,043	0.0%	156	outliers
period	categorical	22,043	0.0%	298
late_interval	categorical	22,043	0.0%	138
phylum	categorical	22,043	0.0%	4
class	categorical	22,043	0.0%	19
order	categorical	22,043	0.0%	99
family	categorical	22,043	0.0%	528
genus	text	22,043	0.0%	2,608	one_word short_text duplicates
country	categorical	22,043	0.0%	93
state	categorical	22,043	0.0%	519
formation	categorical	22,043	0.0%	1	imbalance
collection	categorical	22,043	0.0%	1	imbalance
paleolat	numeric	22,043	2.2%	3,214	outliers
paleolng	numeric	22,043	2.2%	3,715
reference_no	text	22,043	0.0%	3,725	one_word allcaps short_text duplicates
occurrence_no	text	22,043	0.0%	22,043	near_unique one_word allcaps short_text

Fig 1.

class · Look for how heavily Mammalia and the dinosaur classes (Saurischia, Ornithischia) dominate over marine and invertebrate groups like Cephalopoda and Trilobita.

Show data table

Top values for class (19 unique shown, of 19 total).
value	count	share
Mammalia	7015	31.8%
Saurischia	5507	25.0%
Ornithischia	2811	12.8%
Cephalopoda	2000	9.1%
Trilobita	2000	9.1%
Conodonta	1883	8.5%
Reptilia	568	2.6%
Aves	92	0.4%
	60	0.3%
NO_CLASS_SPECIFIED	26	0.1%
Pteraspidomorpha	24	0.1%
Placodermi	17	0.1%
Acanthodii	15	0.1%
Osteichthyes	11	0.0%
Thelodonti	4	0.0%
Osteostraci	4	0.0%
Chondrichthyes	4	0.0%
Actinopterygii	1	0.0%
Galeaspidomorphi	1	0.0%

Fig 2.

country · Notice the strong US bias (50.9% of records) versus all other countries, which may reflect collection effort rather than true fossil distribution.

Show data table

Top values for country (20 unique shown, of 93 total).
value	count	share
US	11218	50.9%
CA	1830	8.3%
CN	1661	7.5%
UK	983	4.5%
ES	841	3.8%
FR	390	1.8%
MA	303	1.4%
AR	292	1.3%
CZ	288	1.3%
AU	247	1.1%
TZ	218	1.0%
UZ	184	0.8%
MX	175	0.8%
KR	175	0.8%
SE	170	0.8%
CH	166	0.8%
MN	162	0.7%
ZA	159	0.7%
RU	156	0.7%
DE	152	0.7%

Fig 3.

early_age_mya · Look for the wide spread from near-zero to 538 million years ago, and the outlier spike of older Paleozoic occurrences pulling the tail right.

Show data table

Histogram bins for early_age_mya (median: 110.1).
bin	count
0.0117 – 13.48	2334
13.48 – 26.95	1665
26.95 – 40.42	85
40.42 – 53.89	12
53.89 – 67.36	2904
67.36 – 80.83	1302
80.83 – 94.3	2239
94.3 – 107.8	454
107.8 – 121.2	796
121.2 – 134.7	1645
134.7 – 148.2	410
148.2 – 161.6	1650
161.6 – 175.1	421
175.1 – 188.6	36
188.6 – 202.1	975
202.1 – 215.5	193
215.5 – 229	530
229 – 242.5	216
242.5 – 255.9	61
255.9 – 269.4	0
269.4 – 282.9	0
282.9 – 296.3	0
296.3 – 309.8	25
309.8 – 323.3	17
323.3 – 336.8	21
336.8 – 350.2	37
350.2 – 363.7	58
363.7 – 377.2	770
377.2 – 390.6	319
390.6 – 404.1	319
404.1 – 417.6	176
417.6 – 431	602
431 – 444.5	265
444.5 – 458	382
458 – 471.5	427
471.5 – 484.9	81
484.9 – 498.4	334
498.4 – 511.9	180
511.9 – 525.3	62
525.3 – 538.8	40

Fig 4.

rank · Check the split between species- and genus-level identifications, as this reveals how precisely occurrences have been classified taxonomically.

Show data table

Top values for rank (18 unique shown, of 18 total).
value	count	share
species	9082	41.2%
genus	7342	33.3%
unranked clade	2828	12.8%
family	1716	7.8%
subfamily	272	1.2%
subclass	205	0.9%
class	134	0.6%
order	115	0.5%
infraorder	97	0.4%
superfamily	75	0.3%
subgenus	51	0.2%
kingdom	50	0.2%
suborder	29	0.1%
subspecies	23	0.1%
tribe	12	0.1%
subphylum	9	0.0%
superorder	2	0.0%
superclass	1	0.0%

Fig 5.

phylum · Chordata overwhelmingly dominates at 81.6%, with Mollusca and Arthropoda each contributing exactly 2,000 records — a suspiciously round number worth investigating.

Show data table

Top values for phylum (4 unique shown, of 4 total).
value	count	share
Chordata	17993	81.6%
Mollusca	2000	9.1%
Arthropoda	2000	9.1%
	50	0.2%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
name	text	0.0%
rank	categorical	0.0%
lat	numeric	0.0%
lon	numeric	0.0%
early_age_mya	numeric	0.0%
late_age_mya	numeric	0.0%
period	categorical	0.0%
late_interval	categorical	0.0%
phylum	categorical	0.0%
class	categorical	0.0%
order	categorical	0.0%
family	categorical	0.0%
genus	text	0.0%
country	categorical	0.0%
state	categorical	0.0%
formation	categorical	0.0%
collection	categorical	0.0%
paleolat	numeric	2.2%
paleolng	numeric	2.2%
reference_no	text	0.0%
occurrence_no	text	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 6 numeric columns (values clipped to 2 decimals).
	lat	lon	early_age_mya	late_age_mya	paleolat	paleolng
lat	+1.00	-0.27	+0.04	+0.05	+0.12	-0.17
lon	-0.27	+1.00	+0.17	+0.16	-0.22	+0.45
early_age_mya	+0.04	+0.17	+1.00	+1.00	-0.61	+0.03
late_age_mya	+0.05	+0.16	+1.00	+1.00	-0.61	+0.03
paleolat	+0.12	-0.22	-0.61	-0.61	+1.00	-0.21
paleolng	-0.17	+0.45	+0.03	+0.03	-0.21	+1.00

name text label

This column contains taxonomic names of fossil organisms — dinosaurs (Theropoda, Sauropoda, Hadrosauridae), conodonts (Palmatolepis, Polygnathus, Icriodus), and other paleontological taxa — making it a biological classification label rather than a unique specimen identifier. The duplicate rate is extremely high at 78.9%, with only 4,660 unique values across 22,043 rows, reflecting that many specimens share the same taxon name. Over half of values (58.5%) are single words (genus or clade names), with a mean word count of 1.42, consistent with Linnaean binomial or higher-rank nomenclature. The top value 'Theropoda' alone appears 768 times, confirming this is a categorical grouping field, not a unique label.

Treatment: Encode as a categorical feature (ordinal or target-encode by frequency); consider hierarchical grouping by taxonomic rank for modelling.

anthropic:default · confidence high

Out[13]:

saturn.columns["name"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	4,660
len_min	3
len_max	47
len_mean	15.09
len_median	14
len_p95	26
word_mean	1.425
word_median	1
n_empty	0
n_duplicates	17,383
duplicate_rate	0.7886
vocab_size	5,140
readability_flesch_mean	-4.127
emoji_rate	0
url_rate	0
one_word_rate	0.5846
allcaps_rate	0
boilerplate_rate	0
alert: one_word	58.5% rows are a single word
alert: duplicates	78.9% duplicate strings

Fig 8.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 15.094769314521617).
chars	count
3 – 4	72
4 – 5	190
5 – 6	110
6 – 7	466
7 – 8	942
8 – 10	2265
10 – 11	2027
11 – 12	1560
12 – 13	1821
13 – 14	1566
14 – 15	2019
15 – 16	626
16 – 17	923
17 – 18	808
18 – 20	1219
20 – 21	1117
21 – 22	747
22 – 23	889
23 – 24	549
24 – 25	531
25 – 26	719
26 – 27	307
27 – 28	119
28 – 29	178
29 – 31	63
31 – 32	52
32 – 33	22
33 – 34	8
34 – 35	2
35 – 36	17
36 – 37	19
37 – 38	11
38 – 39	50
39 – 40	13
40 – 42	9
42 – 43	2
43 – 44	1
44 – 45	3
45 – 46	0
46 – 47	1

rank categorical label

This column encodes taxonomic rank in a biological classification dataset, with 18 distinct levels spanning the Linnaean hierarchy from species up through class and beyond. 'Species' dominates at 41.2% (9,082 rows) and 'genus' follows at 7,342 rows, which is the expected shape for a taxonomy tree where leaf nodes vastly outnumber higher groupings. Notably, 'unranked clade' appears 2,828 times—making it the third-largest category—indicating a substantial portion of entries reflect modern phylogenetic classifications that don't fit traditional Linnaean ranks. Entropy ratio of 0.50 signals moderate concentration, not uniform distribution.

Treatment: Ordinal-encode with a defined taxonomic hierarchy or one-hot encode for modelling; treat 'unranked clade' as a separate nominal category since it has no fixed position in the hierarchy.

anthropic:default · confidence high

Out[16]:

saturn.columns["rank"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	18
top_value	species
top_rate	0.412
cardinality	18
entropy	2.085
entropy_ratio	0.5001

Fig 9.

Top values for rank.

Show data table

Top values for rank (18 unique shown, of 18 total).
value	count	share
species	9082	41.2%
genus	7342	33.3%
unranked clade	2828	12.8%
family	1716	7.8%
subfamily	272	1.2%
subclass	205	0.9%
class	134	0.6%
order	115	0.5%
infraorder	97	0.4%
superfamily	75	0.3%
subgenus	51	0.2%
kingdom	50	0.2%
suborder	29	0.1%
subspecies	23	0.1%
tribe	12	0.1%
subphylum	9	0.0%
superorder	2	0.0%
superclass	1	0.0%

lat numeric feature

This column contains geographic latitude values, with a valid range from -84.33° to 79.75° and a median of 41.70°, consistent with mid-northern hemisphere locations (e.g., Europe or northern US). The distribution is strongly left-skewed (skew = -2.44) with high kurtosis (7.05), indicating a heavy tail of unusually southern or southern-hemisphere values — 9.2% of rows (2,019 records) are flagged as outliers. The -84.33° minimum is suspicious as it approaches the Antarctic Circle and may represent data quality issues or encoding errors.

Treatment: Validate extreme negative values (especially near -84.33°) as likely errors; use as-is or pair with longitude for geospatial features after outlier review.

anthropic:default · confidence high

Out[19]:

saturn.columns["lat"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	4,095
min	-84.33
max	79.75
mean	37.12
median	41.7
std	19.37
q1	35
q3	46.61
iqr	11.61
skew	-2.442
kurtosis	7.054
n_outliers	2,019
outlier_rate	0.09159
zero_rate	0
alert: high_skew	skew=-2.44
alert: outliers	9.2% rows beyond 1.5 IQR

Fig 10.

Distribution of lat. Vertical dash marks the median.

Show data table

Histogram bins for lat (median: 41.700001).
bin	count
-84.33 – -80.23	4
-80.23 – -76.13	0
-76.13 – -72.03	0
-72.03 – -67.93	12
-67.93 – -63.82	7
-63.82 – -59.72	0
-59.72 – -55.62	0
-55.62 – -51.52	0
-51.52 – -47.41	16
-47.41 – -43.31	60
-43.31 – -39.21	111
-39.21 – -35.11	127
-35.11 – -31.01	166
-31.01 – -26.9	347
-26.9 – -22.8	40
-22.8 – -18.7	99
-18.7 – -14.6	138
-14.6 – -10.5	6
-10.5 – -6.394	255
-6.394 – -2.292	21
-2.292 – 1.81	0
1.81 – 5.912	32
5.912 – 10.01	22
10.01 – 14.12	12
14.12 – 18.22	168
18.22 – 22.32	137
22.32 – 26.42	1224
26.42 – 30.52	669
30.52 – 34.63	1615
34.63 – 38.73	3128
38.73 – 42.83	4830
42.83 – 46.93	3520
46.93 – 51.04	2987
51.04 – 55.14	1322
55.14 – 59.24	253
59.24 – 63.34	303
63.34 – 67.44	99
67.44 – 71.55	118
71.55 – 75.65	180
75.65 – 79.75	15

lon numeric feature

This column contains geographic longitude values, spanning from -176.7° to 177.1°, consistent with near-global coverage. The median of -98.25° and mean of -47.2° indicate a strong concentration of records in the Americas (particularly North/Central America), which explains the positive skew of 0.93 — the distribution is pulled rightward toward European/African longitudes. The wide IQR of 114° and only 4,259 unique values across 22,043 rows suggests coordinates are rounded or snapped to coarse grid points rather than being fully precise.

Treatment: Use as-is or pair with latitude for geospatial modelling; consider binning into regions if cardinality reduction is needed.

anthropic:default · confidence high

Out[22]:

saturn.columns["lon"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	4,259
min	-176.7
max	177.1
mean	-47.21
median	-98.25
std	79.13
q1	-108.2
q3	5.873
iqr	114
skew	0.9275
kurtosis	-0.4932
n_outliers	3
outlier_rate	0.0001361
zero_rate	0.0002268

Fig 11.

Distribution of lon. Vertical dash marks the median.

Show data table

Histogram bins for lon (median: -98.25).
bin	count
-176.7 – -167.8	4
-167.8 – -159	13
-159 – -150.1	18
-150.1 – -141.3	32
-141.3 – -132.4	91
-132.4 – -123.6	163
-123.6 – -114.8	1447
-114.8 – -105.9	5659
-105.9 – -97.08	3904
-97.08 – -88.23	360
-88.23 – -79.39	559
-79.39 – -70.55	962
-70.55 – -61.7	409
-61.7 – -52.86	76
-52.86 – -44.02	52
-44.02 – -35.17	13
-35.17 – -26.33	0
-26.33 – -17.48	93
-17.48 – -8.642	160
-8.642 – 0.2019	1918
0.2019 – 9.045	932
9.045 – 17.89	817
17.89 – 26.73	253
26.73 – 35.58	447
35.58 – 44.42	311
44.42 – 53.26	71
53.26 – 62.11	90
62.11 – 70.95	315
70.95 – 79.79	306
79.79 – 88.64	107
88.64 – 97.48	78
97.48 – 106.3	553
106.3 – 115.2	1162
115.2 – 124	150
124 – 132.9	206
132.9 – 141.7	50
141.7 – 150.5	238
150.5 – 159.4	11
159.4 – 168.2	4
168.2 – 177.1	9

early_age_mya numeric feature

This column represents the early (older) age boundary of a geological time range in millions of years ago (Mya), likely for fossil taxa or stratigraphic intervals. With only 164 unique values across 22,043 rows, most records share standardized geological stage boundaries rather than continuous age estimates. The distribution is right-skewed (skew 1.13) with a mean of ~154.7 Mya but a median of only 110.1 Mya, and 11.6% of values (2,549 rows) are flagged as outliers—driven by a long tail extending to 538.8 Mya representing Cambrian or older occurrences against a bulk of Mesozoic/Cenozoic records.

Treatment: Use as-is or bin into geological periods; log-transform if used in regression due to right skew and wide range (0.0117–538.8).

anthropic:default · confidence high

Out[25]:

saturn.columns["early_age_mya"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	164
min	0.0117
max	538.8
mean	154.7
median	110.1
std	143.1
q1	63.4
q3	201.4
iqr	138
skew	1.131
kurtosis	0.07677
n_outliers	2,549
outlier_rate	0.1156
zero_rate	0
alert: outliers	11.6% rows beyond 1.5 IQR

Fig 12.

Distribution of early_age_mya. Vertical dash marks the median.

Show data table

Histogram bins for early_age_mya (median: 110.1).
bin	count
0.0117 – 13.48	2334
13.48 – 26.95	1665
26.95 – 40.42	85
40.42 – 53.89	12
53.89 – 67.36	2904
67.36 – 80.83	1302
80.83 – 94.3	2239
94.3 – 107.8	454
107.8 – 121.2	796
121.2 – 134.7	1645
134.7 – 148.2	410
148.2 – 161.6	1650
161.6 – 175.1	421
175.1 – 188.6	36
188.6 – 202.1	975
202.1 – 215.5	193
215.5 – 229	530
229 – 242.5	216
242.5 – 255.9	61
255.9 – 269.4	0
269.4 – 282.9	0
282.9 – 296.3	0
296.3 – 309.8	25
309.8 – 323.3	17
323.3 – 336.8	21
336.8 – 350.2	37
350.2 – 363.7	58
363.7 – 377.2	770
377.2 – 390.6	319
390.6 – 404.1	319
404.1 – 417.6	176
417.6 – 431	602
431 – 444.5	265
444.5 – 458	382
458 – 471.5	427
471.5 – 484.9	81
484.9 – 498.4	334
498.4 – 511.9	180
511.9 – 525.3	62
525.3 – 538.8	40

late_age_mya numeric feature

This column records the late (younger) age boundary of a fossil taxon's stratigraphic range in millions of years ago (Mya), a standard field in paleontological occurrence databases. With only 156 unique values across 22,043 rows, ages are drawn from a discrete set of stage/interval boundaries rather than continuous measurements. The distribution is right-skewed (skew 1.17, mean 147.5 vs. median 93.9) and 11.5% of rows are flagged as outliers, driven by a long tail extending to 521 Mya — likely Cambrian/Ordovician taxa pulling the upper end — while the near-zero minimum (0.0, ~0.1% of rows) represents taxa surviving to the Recent.

Treatment: Use as-is or bin into geological periods; consider log-transform or pair with early_age_mya to compute range duration before modelling.

anthropic:default · confidence high

Out[28]:

saturn.columns["late_age_mya"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	156
min	0
max	521
mean	147.5
median	93.9
std	141.7
q1	60.9
q3	192.9
iqr	132
skew	1.169
kurtosis	0.1231
n_outliers	2,535
outlier_rate	0.115
zero_rate	0.001134
alert: outliers	11.5% rows beyond 1.5 IQR

Fig 13.

Distribution of late_age_mya. Vertical dash marks the median.

Show data table

Histogram bins for late_age_mya (median: 93.9).
bin	count
0 – 13.03	2732
13.03 – 26.05	1291
26.05 – 39.08	69
39.08 – 52.1	7
52.1 – 65.12	2911
65.12 – 78.15	3279
78.15 – 91.17	374
91.17 – 104.2	949
104.2 – 117.2	922
117.2 – 130.2	1043
130.2 – 143.3	1317
143.3 – 156.3	671
156.3 – 169.3	368
169.3 – 182.3	135
182.3 – 195.4	634
195.4 – 208.4	1003
208.4 – 221.4	35
221.4 – 234.5	123
234.5 – 247.5	63
247.5 – 260.5	2
260.5 – 273.5	0
273.5 – 286.6	0
286.6 – 299.6	12
299.6 – 312.6	34
312.6 – 325.6	19
325.6 – 338.7	4
338.7 – 351.7	83
351.7 – 364.7	133
364.7 – 377.7	828
377.7 – 390.8	467
390.8 – 403.8	190
403.8 – 416.8	530
416.8 – 429.8	170
429.8 – 442.9	141
442.9 – 455.9	529
455.9 – 468.9	299
468.9 – 481.9	111
481.9 – 494.9	310
494.9 – 508	191
508 – 521	64

period categorical label

This column represents geological time periods or North American Land Mammal Ages (NALMAs), with values like 'Irvingtonian', 'Torrejonian', 'Tiffanian', and 'Puercan' — terminology specific to paleontology and fossil occurrence datasets. With 298 unique values across 22,043 rows and zero nulls, the distribution is moderately concentrated: the top value 'Irvingtonian' accounts for only ~7.8% of rows, while entropy ratio of 0.78 suggests meaningful but not extreme skew. The mix of formal geological stages (e.g., 'Kimmeridgian', 'Aptian', 'Hettangian') alongside NALMA names signals that multiple classification schemes coexist in this column, which could complicate grouping or ordering without a lookup table.

Treatment: Map to a standardized chronostratigraphic hierarchy or numeric age range using a geological timescale reference before modelling or time-series analysis.

anthropic:default · confidence high

Out[31]:

saturn.columns["period"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	298
top_value	Irvingtonian
top_rate	0.07817
cardinality	298
entropy	6.422
entropy_ratio	0.7813

Fig 14.

Top values for period.

Show data table

Top values for period (20 unique shown, of 298 total).
value	count	share
Irvingtonian	1723	7.8%
Late Campanian	1088	4.9%
Torrejonian	935	4.2%
Tiffanian	923	4.2%
Puercan	778	3.5%
Kimmeridgian	636	2.9%
Hettangian	607	2.8%
Aptian	600	2.7%
Harrisonian	592	2.7%
Late Maastrichtian	544	2.5%
Norian	516	2.3%
Lochkovian	460	2.1%
Early Barremian	449	2.0%
Hemingfordian	441	2.0%
Tithonian	408	1.9%
Middle Campanian	359	1.6%
Early Famennian	346	1.6%
Early Albian	327	1.5%
Lancian	320	1.5%
Maastrichtian	314	1.4%

late_interval categorical feature

This column encodes the geological time interval representing the late (upper) bound of a fossil or stratigraphic occurrence, drawn from standard chronostratigraphic stage names (e.g., 'Tithonian', 'Sinemurian', 'Albian'). The most striking signal is that 83.1% of the 22,043 rows carry an empty string as the top value, meaning the late interval is unspecified for the vast majority of records — this is a heavily sparse categorical field despite a zero null rate. The remaining 138 distinct values span Mesozoic and Cenozoic stages with modest frequency, suggesting the dataset skews toward certain time periods (Tithonian and Sinemurian together account for ~4.4% of all rows).

Treatment: Treat empty strings as missing; encode valid stage names ordinally by geological age order or map to numeric Ma midpoints before modelling.

anthropic:default · confidence high

Out[34]:

saturn.columns["late_interval"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	138
top_value
top_rate	0.8311
cardinality	138
entropy	1.591
entropy_ratio	0.2239

Fig 15.

Top values for late_interval.

Show data table

Top values for late_interval (20 unique shown, of 138 total).
value	count	share
	18319	83.1%
Tithonian	548	2.5%
Sinemurian	430	2.0%
Late Campanian	183	0.8%
Early Cenomanian	132	0.6%
Albian	129	0.6%
Rhaetian	119	0.5%
Early Maastrichtian	111	0.5%
Early Tithonian	102	0.5%
Late Turonian	92	0.4%
Maastrichtian	72	0.3%
Harnagian	62	0.3%
Santonian	61	0.3%
Early Aptian	57	0.3%
Tiffanian	57	0.3%
Early Albian	56	0.3%
Barremian	54	0.2%
Pliensbachian	50	0.2%
Toarcian	50	0.2%
Cenomanian	45	0.2%

phylum categorical label

This column contains biological phylum classifications drawn from exactly 4 distinct values across 22,043 records, with no nulls. It is heavily dominated by 'Chordata' at 81.6% (17,993 rows), while 'Mollusca' and 'Arthropoda' each account for exactly 2,000 records — a suspiciously round number suggesting deliberate sampling or stratification. Notably, 50 records carry an empty string value, which acts as a de-facto null and should be treated as missing rather than a valid category.

Treatment: Encode as ordinal or one-hot; replace 50 empty-string records with NaN and impute or exclude before modelling.

anthropic:default · confidence high

Out[37]:

saturn.columns["phylum"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	4
top_value	Chordata
top_rate	0.8163
cardinality	4
entropy	0.8873
entropy_ratio	0.4436

Fig 16.

Top values for phylum.

Show data table

Top values for phylum (4 unique shown, of 4 total).
value	count	share
Chordata	17993	81.6%
Mollusca	2000	9.1%
Arthropoda	2000	9.1%
	50	0.2%

class categorical label

This column contains the biological/taxonomic class of fossil or paleontological specimens, with 19 distinct values across 22,043 records and no nulls. The top value 'Mammalia' accounts for 31.8% of records, followed by dinosaur orders 'Saurischia' and 'Ornithischia' — notably these are clades, not true classes, mixed in with proper classes like Mammalia and Reptilia, suggesting taxonomic rank inconsistency. Two sentinel-style values ('NO_CLASS_SPECIFIED' with 26 occurrences and 60 empty-string entries) represent ~0.4% of records and should be treated as missing. Entropy ratio of 0.61 indicates moderate concentration rather than a uniform spread.

Treatment: Unify empty strings and 'NO_CLASS_SPECIFIED' into a single missing category, then encode as a categorical feature or target depending on modelling objective.

anthropic:default · confidence high

Out[40]:

saturn.columns["class"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	19
top_value	Mammalia
top_rate	0.3182
cardinality	19
entropy	2.579
entropy_ratio	0.6071

Fig 17.

Top values for class.

Show data table

Top values for class (19 unique shown, of 19 total).
value	count	share
Mammalia	7015	31.8%
Saurischia	5507	25.0%
Ornithischia	2811	12.8%
Cephalopoda	2000	9.1%
Trilobita	2000	9.1%
Conodonta	1883	8.5%
Reptilia	568	2.6%
Aves	92	0.4%
	60	0.3%
NO_CLASS_SPECIFIED	26	0.1%
Pteraspidomorpha	24	0.1%
Placodermi	17	0.1%
Acanthodii	15	0.1%
Osteichthyes	11	0.0%
Thelodonti	4	0.0%
Osteostraci	4	0.0%
Chondrichthyes	4	0.0%
Actinopterygii	1	0.0%
Galeaspidomorphi	1	0.0%

order categorical label

This column contains biological taxonomic order classifications (e.g., Rodentia, Carnivora, Artiodactyla), likely from a paleontological or natural history specimen dataset. Two sentinel/missing-value patterns dominate: 'NO_ORDER_SPECIFIED' accounts for 32.3% of rows (7,117) and an empty string accounts for a further 3,019 rows (~13.7%), meaning roughly 46% of records lack a meaningful order assignment. Despite 99 unique values and moderate entropy (3.89), the effective signal is skewed toward these two non-informative categories.

Treatment: Consolidate 'NO_ORDER_SPECIFIED' and empty string into a single null/unknown category, then encode remaining values as a nominal categorical feature (e.g., target or ordinal encoding) before modelling.

anthropic:default · confidence high

Out[43]:

saturn.columns["order"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	99
top_value	NO_ORDER_SPECIFIED
top_rate	0.3229
cardinality	99
entropy	3.887
entropy_ratio	0.5863

Fig 18.

Top values for order.

Show data table

Top values for order (20 unique shown, of 99 total).
value	count	share
NO_ORDER_SPECIFIED	7117	32.3%
	3019	13.7%
Ammonitida	1572	7.1%
Ozarkodinida	1341	6.1%
Rodentia	1109	5.0%
Artiodactyla	951	4.3%
Carnivora	744	3.4%
Multituberculata	553	2.5%
Perissodactyla	517	2.3%
Phacopida	507	2.3%
Procreodi	503	2.3%
Prioniodontida	348	1.6%
Primates	315	1.4%
Asaphida	304	1.4%
Corynexochida	252	1.1%
Ammonoidea	246	1.1%
Ptychopariida	238	1.1%
Proetida	219	1.0%
Cimolesta	218	1.0%
Lagomorpha	187	0.8%

family categorical label

This column contains biological family-level taxonomic classifications for fossil or specimen records, with 528 distinct family names across 22,043 rows. The most surprising signal is that the top value is an empty string (3,418 occurrences, 15.5% of rows), and the second most frequent value is the sentinel 'NO_FAMILY_SPECIFIED' (1,996 occurrences) — together these two non-informative values account for roughly 24.6% of the dataset, indicating pervasive missing family assignments. Substantive values include well-known paleontological families such as Hadrosauridae, Grallatoridae, and Palmatolepidae, confirming a paleobiology or fossil-occurrence context.

Treatment: Unify missing indicators by mapping empty string and 'NO_FAMILY_SPECIFIED' to a single null/unknown category, then encode as a categorical feature or use as a grouping label.

anthropic:default · confidence high

Out[46]:

saturn.columns["family"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	528
top_value
top_rate	0.1551
cardinality	528
entropy	6.566
entropy_ratio	0.726

Fig 19.

Top values for family.

Show data table

Top values for family (20 unique shown, of 528 total).
value	count	share
	3418	15.5%
NO_FAMILY_SPECIFIED	1996	9.1%
Hadrosauridae	689	3.1%
Grallatoridae	593	2.7%
Palmatolepidae	586	2.7%
Arctocyonidae	503	2.3%
Polygnathidae	459	2.1%
Cricetidae	407	1.8%
Equidae	360	1.6%
Canidae	358	1.6%
Ceratopsidae	336	1.5%
Dromaeosauridae	335	1.5%
Icriodontidae	272	1.2%
Periptychidae	249	1.1%
Neoplagiaulacidae	234	1.1%
Merycoidodontidae	231	1.0%
Camelidae	216	1.0%
Tyrannosauridae	184	0.8%
Diplodocidae	181	0.8%
Asaphidae	172	0.8%

genus text label

This column contains biological genus names from a paleontological or natural history dataset, as evidenced by taxa such as Palmatolepis (conodonts), Grallator/Eubrontes (dinosaur tracks), Baculites (ammonites), and Equus (horses). The duplicate rate is extremely high at 88.2% (19,435 of 22,043 rows share a value), which is expected for a categorical taxonomic label with only 2,608 unique genera. Notably, 5,545 rows (25.2% of the dataset) have an empty string rather than a null — a data quality issue that should be treated as missing. The vocabulary of 2,525 single-word tokens aligns tightly with the 2,608 unique values, confirming these are clean, single-word Latin genus names.

Treatment: Replace empty strings with NaN, then encode as a categorical feature (e.g. label or target encode) or use as a grouping key for taxonomic analysis.

anthropic:default · confidence high

Out[49]:

saturn.columns["genus"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	2,608
len_min	0
len_max	33
len_mean	8.188
len_median	10
len_p95	15
word_mean	1.011
word_median	1
n_empty	5,545
n_duplicates	19,435
duplicate_rate	0.8817
vocab_size	2,525
readability_flesch_mean	-4.827
emoji_rate	0
url_rate	0
one_word_rate	0.9894
allcaps_rate	0
boilerplate_rate	0
alert: one_word	98.9% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	88.2% duplicate strings

Fig 20.

Character-length distribution for genus.

Show data table

Character-length distribution for genus (mean: 8.188449848024316).
chars	count
0 – 1	5545
1 – 2	0
2 – 2	0
2 – 3	13
3 – 4	31
4 – 5	0
5 – 6	372
6 – 7	188
7 – 7	704
7 – 8	1676
8 – 9	2307
9 – 10	0
10 – 11	2405
11 – 12	2504
12 – 12	2230
12 – 13	1696
13 – 14	1017
14 – 15	0
15 – 16	522
16 – 16	324
16 – 17	191
17 – 18	50
18 – 19	0
19 – 20	51
20 – 21	21
21 – 21	19
21 – 22	36
22 – 23	20
23 – 24	0
24 – 25	5
25 – 26	7
26 – 26	3
26 – 27	4
27 – 28	40
28 – 29	0
29 – 30	57
30 – 31	4
31 – 31	0
31 – 32	0
32 – 33	1

country categorical feature

This column contains ISO-style two-letter country codes across 93 distinct countries, with zero nulls across 22,043 rows. The distribution is heavily US-dominated: 'US' alone accounts for 50.9% of all records (11,218 rows), while the next largest country 'CA' has only 1,830 — a roughly 6× drop-off. The entropy ratio of 0.50 confirms the pronounced imbalance, meaning models treating this as a uniform categorical feature will be misled by the long tail of 83 countries each with very small counts.

Treatment: One-hot encode top countries and bin the long tail into an 'Other' category, or embed via target-encoding given the severe US skew.

anthropic:default · confidence high

Out[52]:

saturn.columns["country"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	93
top_value	US
top_rate	0.5089
cardinality	93
entropy	3.293
entropy_ratio	0.5035

Fig 21.

Top values for country.

Show data table

Top values for country (20 unique shown, of 93 total).
value	count	share
US	11218	50.9%
CA	1830	8.3%
CN	1661	7.5%
UK	983	4.5%
ES	841	3.8%
FR	390	1.8%
MA	303	1.4%
AR	292	1.3%
CZ	288	1.3%
AU	247	1.1%
TZ	218	1.0%
UZ	184	0.8%
MX	175	0.8%
KR	175	0.8%
SE	170	0.8%
CH	166	0.8%
MN	162	0.7%
ZA	159	0.7%
RU	156	0.7%
DE	152	0.7%

state categorical feature

This column represents a geographic state or province field, but its 519 unique values far exceed the 50 US states, revealing a mix of US states, Canadian provinces (Alberta), English regions (England), and Chinese provinces (Guangxi) — indicating international scope. The top value is 'Wyoming' at 8.6% of rows, which is disproportionately high for a state with a small population, suggesting dataset bias or a specific collection source. Notably, 1,082 rows (roughly 4.9%) contain an empty string rather than a null, meaning the null_rate of 0.0 understates true missingness. The entropy ratio of 0.70 confirms meaningful but imperfect spread across categories.

Treatment: Standardize empty strings to null, map to ISO region codes, and consider grouping by country before encoding.

anthropic:default · confidence high

Out[55]:

saturn.columns["state"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	519
top_value	Wyoming
top_rate	0.08633
cardinality	519
entropy	6.288
entropy_ratio	0.6971

Fig 22.

Top values for state.

Show data table

Top values for state (20 unique shown, of 519 total).
value	count	share
Wyoming	1903	8.6%
Montana	1394	6.3%
	1082	4.9%
New Mexico	1048	4.8%
Alberta	1009	4.6%
Nebraska	950	4.3%
England	907	4.1%
Guangxi	861	3.9%
California	837	3.8%
Colorado	540	2.4%
Texas	530	2.4%
Utah	489	2.2%
Nevada	361	1.6%
Murcia	333	1.5%
North Dakota	325	1.5%
South Dakota	316	1.4%
Massachusetts	278	1.3%
Kansas	273	1.2%
Northwest Territories	246	1.1%
Arizona	226	1.0%

formation categorical other

This column, 'formation', is a categorical field that is entirely empty: all 22,043 rows contain a blank string, giving it a cardinality of 1 and an entropy of 0. There is no null rate, meaning the field was populated with empty strings rather than true nulls. It carries zero information and should be dropped.

Treatment: Drop entirely; column contains only empty strings across all 22,043 rows and provides no analytical value.

anthropic:default · confidence high

Out[58]:

saturn.columns["formation"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 23.

Top values for formation.

Show data table

Top values for formation (1 unique shown, of 1 total).
value	count	share
	22043	100.0%

collection categorical other

This column is a categorical field named 'collection' that is entirely empty strings across all 22,043 rows — it has cardinality of 1 and a top_value of '' with top_rate of 1.0. There are no nulls, meaning the field was populated with blank strings rather than left absent. The column carries zero information (entropy = 0.0) and is completely useless for analysis in its current state.

Treatment: Drop this column; it is a constant empty string with no analytical value.

anthropic:default · confidence high

Out[61]:

saturn.columns["collection"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 24.

Top values for collection.

Show data table

Top values for collection (1 unique shown, of 1 total).
value	count	share
	22043	100.0%

paleolat numeric feature

This column represents paleolatitude — the reconstructed geographic latitude of a fossil or geological sample at the time of its deposition, ranging from -86.16° (near the South Pole) to 89.2° (near the North Pole). The distribution is moderately left-skewed (skew = -1.08) with a mean of 26.46° but a median of 34.89°, indicating a pull toward southern or equatorial values in the tail. Notably, 6.97% of records (1,503) are flagged as outliers, likely representing polar or high-latitude specimens that are genuinely rare in the fossil record rather than data errors. The 3,214 unique values across 22,043 records suggests coordinate rounding or discrete sampling grids.

Treatment: Retain as-is for spatial or paleogeographic modelling; consider binning into latitudinal zones (e.g., polar, temperate, tropical) if used as a categorical feature.

anthropic:default · confidence high

Out[64]:

saturn.columns["paleolat"].stats

stat	value
n	22,043
nulls	491 (2.2%)
unique	3,214
min	-86.16
max	89.2
mean	26.46
median	34.89
std	29.54
q1	16.34
q3	46.98
iqr	30.64
skew	-1.08
kurtosis	0.2646
n_outliers	1,503
outlier_rate	0.06974
zero_rate	0
alert: outliers	7.0% rows beyond 1.5 IQR

Fig 25.

Distribution of paleolat. Vertical dash marks the median.

Show data table

Histogram bins for paleolat (median: 34.89).
bin	count
-86.16 – -81.78	13
-81.78 – -77.39	17
-77.39 – -73.01	17
-73.01 – -68.62	11
-68.62 – -64.24	12
-64.24 – -59.86	19
-59.86 – -55.47	13
-55.47 – -51.09	87
-51.09 – -46.7	82
-46.7 – -42.32	185
-42.32 – -37.94	454
-37.94 – -33.55	285
-33.55 – -29.17	350
-29.17 – -24.78	508
-24.78 – -20.4	462
-20.4 – -16.02	669
-16.02 – -11.63	554
-11.63 – -7.248	173
-7.248 – -2.864	144
-2.864 – 1.52	184
1.52 – 5.904	345
5.904 – 10.29	212
10.29 – 14.67	504
14.67 – 19.06	275
19.06 – 23.44	888
23.44 – 27.82	1581
27.82 – 32.21	1491
32.21 – 36.59	2167
36.59 – 40.98	1711
40.98 – 45.36	787
45.36 – 49.74	2851
49.74 – 54.13	1762
54.13 – 58.51	1421
58.51 – 62.9	1132
62.9 – 67.28	152
67.28 – 71.66	8
71.66 – 76.05	4
76.05 – 80.43	6
80.43 – 84.82	3
84.82 – 89.2	13

paleolng numeric feature

This column represents paleogeographic longitude — the reconstructed east-west position of a sample location at the time of fossil deposition, typically ranging from −180° to +180°. The values span −177.6 to 168.7, consistent with valid global longitude, and the 3,715 unique values across 22,043 records suggest discretized but reasonably fine-grained spatial resolution. The distribution is moderately right-skewed (skew 0.737) with a wide IQR of 97.35°, indicating samples are spread broadly but with more density in the western/negative hemisphere — the median of −62.15 is well below the mean of −28.58. Null rate of 2.23% is minor but worth flagging for paleogeographic reconstructions where completeness matters.

Treatment: Use as-is for spatial modelling; consider sin/cos encoding if used in circular/angular context to handle the −180°/+180° boundary.

anthropic:default · confidence high

Out[67]:

saturn.columns["paleolng"].stats

stat	value
n	22,043
nulls	491 (2.2%)
unique	3,715
min	-177.6
max	168.7
mean	-28.58
median	-62.15
std	68.91
q1	-77.16
q3	20.19
iqr	97.35
skew	0.7372
kurtosis	-0.4811
n_outliers	3
outlier_rate	0.0001392
zero_rate	0

Fig 26.

Distribution of paleolng. Vertical dash marks the median.

Show data table

Histogram bins for paleolng (median: -62.150000000000006).
bin	count
-177.6 – -168.9	3
-168.9 – -160.3	12
-160.3 – -151.6	4
-151.6 – -143	31
-143 – -134.3	31
-134.3 – -125.7	121
-125.7 – -117	345
-117 – -108.3	968
-108.3 – -99.68	726
-99.68 – -91.03	1791
-91.03 – -82.37	244
-82.37 – -73.71	2539
-73.71 – -65.05	3233
-65.05 – -56.4	949
-56.4 – -47.74	404
-47.74 – -39.08	1084
-39.08 – -30.42	478
-30.42 – -21.77	154
-21.77 – -13.11	57
-13.11 – -4.45	626
-4.45 – 4.207	210
4.207 – 12.86	1004
12.86 – 21.52	1648
21.52 – 30.18	1074
30.18 – 38.84	534
38.84 – 47.49	129
47.49 – 56.15	52
56.15 – 64.81	82
64.81 – 73.47	284
73.47 – 82.12	257
82.12 – 90.78	663
90.78 – 99.44	486
99.44 – 108.1	156
108.1 – 116.8	315
116.8 – 125.4	434
125.4 – 134.1	182
134.1 – 142.7	192
142.7 – 151.4	40
151.4 – 160	7
160 – 168.7	3

reference_no text foreign_key

This column is a reference number field — short numeric-looking codes (mean length 4.17 characters, max 5) that identify some entity like a case, order, or account. Despite the name implying uniqueness, the duplicate rate is extremely high at 83.1%, with only 3,725 unique values across 22,043 rows, meaning the same reference numbers recur many times (e.g., '4245' appears 794 times, '6294' 743 times). The allcaps alert (89.8%) is likely a false positive triggered by numeric-only strings. This column appears to be a grouping or foreign-key identifier rather than a row-level unique reference.

Treatment: Use as a grouping key or left-join foreign key; do not treat as a unique row identifier.

anthropic:default · confidence high

Out[70]:

saturn.columns["reference_no"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	3,725
len_min	1
len_max	5
len_mean	4.172
len_median	4
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	18,318
duplicate_rate	0.831
vocab_size	3,547
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.8977
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	89.8% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	83.1% duplicate strings

Fig 27.

Character-length distribution for reference_no.

Show data table

Character-length distribution for reference_no (mean: 4.172208864492129).
chars	count
1 – 1	23
1 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	2233
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1274
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	8908
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	9605

occurrence_no text identifier

This column is a unique occurrence identifier — likely a numeric reference code stored as text, given the all-caps alert (which reflects purely digit characters) and mean length of ~5.76 characters. All 22,043 rows are non-null, non-duplicate, and every value is a single token, confirming it as a primary key-style field. The values range from short (min length 1) to 7 characters and appear to be plain integers (e.g., '164260', '1439335'), with no patterns suggesting a structured prefix scheme.

Treatment: Retain as a row identifier for joins or traceability; drop from any model feature matrix.

anthropic:default · confidence high

Out[73]:

saturn.columns["occurrence_no"].stats

stat	value
n	22,043
nulls	0 (0.0%)
unique	22,043
len_min	1
len_max	7
len_mean	5.762
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.999
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 28.

Character-length distribution for occurrence_no.

Show data table

Character-length distribution for occurrence_no (mean: 5.761783786235993).
chars	count
1 – 1	8
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	15
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	26
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	1359
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	3185
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	16620
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	830

data trove fossils pbdb

Overview

Summary confidence: high

name text label

rank categorical label

lat numeric feature

lon numeric feature

early_age_mya numeric feature

late_age_mya numeric feature

period categorical label

late_interval categorical feature

phylum categorical label

class categorical label

order categorical label

family categorical label

genus text label

country categorical feature

state categorical feature

formation categorical other

collection categorical other

paleolat numeric feature

paleolng numeric feature

reference_no text foreign_key

occurrence_no text identifier

How to cite