saturn

/home/coolhand/html/datavis/data_trove/data/quirky/fossils.json 22,043 rows sample n=22,043 seed 42 2026-05-01T23:08:24+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/data/quirky/fossils.json
Total rows22,043
Profiled sample22,043
Columns21
Generated2026-05-01T23:08:24+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset contains 22,043 fossil occurrence records with 21 columns spanning taxonomy (phylum, class, order, family, genus, name, rank), geography (country, state, lat/lon, paleolat/paleolng), and geologic age (early_age_mya, late_age_mya, period, late_interval). Taxonomy is dominated by Chordata (about 82% of rows) with Mammalia as the leading class (~32%) followed by Saurischia and Ornithischia, suggesting a strong vertebrate and dinosaur emphasis worth examining first. Geographically the data skews heavily to the US (~51%), with Wyoming, Montana, and New Mexico topping the state list, so any spatial analysis should account for this North American concentration. Age columns (early_age_mya, late_age_mya) are right-skewed with medians around 100 Mya and ~11% flagged as outliers, hinting at a long tail of very old records. Note that 'collection' and 'formation' are entirely empty and should be ignored.

name high anthropic:claude-opus-4-7

This column holds taxonomic names of fossil organisms, dominated by dinosaur and conodont genera/clades like Theropoda (768), Dinosauria (512), and Palmatolepis (376). Values are short—mean length 15 characters and 58% are single words—so the Flesch readability score of -4.13 is a meaningless artifact of scientific Latin. With 4,660 uniques across 22,043 rows and a 78.9% duplicate rate, this behaves as a categorical taxon label rather than a free-text field.

rank high anthropic:claude-opus-4-7

This column records the taxonomic rank of each record, with 18 distinct values and no nulls across 22,043 rows. 'Species' dominates at 41.2% (9,082 rows), followed by 'genus' (7,342) and 'unranked clade' (2,828); ranks below family drop off sharply, with 'subfamily' already at only 272. The presence of 'unranked clade' alongside formal Linnaean ranks is worth noting as a non-standard category.

lat high anthropic:claude-opus-4-7

This is a latitude coordinate in degrees, ranging from -84.33 to 79.75 with a median of 41.70 and IQR of 11.61. The strong negative skew (-2.44) and kurtosis (7.05) suggest most points cluster in the northern hemisphere with a long tail of southern-hemisphere outliers (9.16% flagged). No nulls and 4,095 distinct values across 22,043 rows indicate repeated locations rather than per-row unique geocoding.

lon high anthropic:claude-opus-4-7

This column captures longitude coordinates, with values spanning -176.67 to 177.07, consistent with the full global range. The distribution is right-skewed (skew 0.93) and centered on a median of -98.25, suggesting a heavy concentration of records in the Western Hemisphere (likely the Americas). Only 4,259 unique values across 22,043 rows indicate repeated location points, and just 3 outliers were flagged.

early_age_mya high anthropic:claude-opus-4-7

Numeric column representing the early bound of an age estimate in millions of years (mya), spanning 0.0117 to 538.8 across 22043 rows with no nulls. The distribution is right-skewed (skew 1.13) with median 110.1 well below mean 154.67, and saturn flags 2549 outlier rows (11.6%) — consistent with a long Paleozoic tail above the Q3 of 201.4. Only 164 unique values suggest ages are bucketed to standard stratigraphic boundaries rather than continuous measurements.

late_age_mya high anthropic:claude-opus-4-7

Numeric column representing the younger bound of a geological age in millions of years (Mya), with 156 distinct values across 22,043 rows and no nulls. Distribution is right-skewed (skew 1.17) with median 93.9 well below the mean of 147.5, ranging from 0 to 521 Mya, and 11.5% of values (2,535) flagged as outliers on the high end. The bounded discrete value set suggests entries snap to standardized stratigraphic stage boundaries rather than continuous measurements.

period high anthropic:claude-opus-4-7

This column records geologic time periods or stages, with 298 distinct values across 22,043 rows and no nulls. The distribution is moderately spread (entropy ratio 0.78), but 'Irvingtonian' dominates at 7.8% (1,723 rows), followed by 'Late Campanian' and various Paleocene/Mesozoic stages. The vocabulary mixes broad and finely-subdivided stage names (e.g., 'Late Maastrichtian' vs. 'Aptian'), so granularity is uneven.

late_interval high anthropic:claude-opus-4-7

Categorical geological stage marking the late end of a fossil/sample's age range, with 138 distinct stage names like Tithonian, Sinemurian, and Late Campanian. The column is dominated by empty strings (83.1% of 22,043 rows), leaving only ~3,724 records with an actual stage; entropy ratio of 0.22 reflects this sparsity. Among populated values, Tithonian (548) and Sinemurian (430) lead, suggesting Mesozoic specimens are over-represented.

phylum high anthropic:claude-opus-4-7

Taxonomic phylum label with only 4 distinct values across 22,043 rows. Chordata dominates at 81.6% (17,993 rows), with Mollusca and Arthropoda each at exactly 2,000 — suggesting deliberate sampling caps on the non-Chordata classes. 50 rows carry an empty-string phylum that is not counted as null.

class high anthropic:claude-opus-4-7

Taxonomic class label for what appears to be a paleontological/zoological occurrence dataset, with 19 distinct values across 22,043 rows and no nulls. Mammalia dominates at 31.8% (7,015), followed by the dinosaur clades Saurischia (5,507) and Ornithischia (2,811), suggesting a fossil-heavy sample. Note two sentinel-style entries that should be cleaned: 60 empty strings and 26 'NO_CLASS_SPECIFIED' rows.

order high anthropic:claude-opus-4-7

Taxonomic order assignments for what appears to be a paleontology/biology specimen dataset, mixing extinct groups (Ammonitida, Ozarkodinida, Multituberculata, Phacopida) with extant mammalian orders (Rodentia, Artiodactyla, Carnivora). Coverage is poor: 32.3% are the sentinel 'NO_ORDER_SPECIFIED' and another 3019 rows are empty strings, so roughly 46% of records lack a real order. Across 22043 rows there are 99 distinct values with entropy ratio 0.586, indicating a few orders dominate the long tail.

family high anthropic:claude-opus-4-7

Taxonomic family classification for biological specimens, with 528 distinct families across 22043 rows and no nulls. The top value is the empty string at 15.5% (3418 rows), and 'NO_FAMILY_SPECIFIED' adds another 1996 rows—together roughly a quarter of records lack a real family assignment. Among populated values, families like Hadrosauridae (689), Grallatoridae (593), and Palmatolepidae (586) dominate, suggesting a paleontological dataset.

genus high anthropic:claude-opus-4-7

This column holds taxonomic genus names for fossil records — single-word Latinate identifiers like Palmatolepis, Polygnathus, and Grallator dominate, with one_word_rate at 0.989. Of 22,043 rows, 5,545 are empty strings and duplicate_rate is 0.88 across 2,608 unique values, so the field is heavily repeated and a quarter of rows carry no genus at all. Length stats (mean 8.2, max 33) and a vocab of 2,525 are consistent with controlled scientific nomenclature rather than free text.

country high anthropic:claude-opus-4-7

Two-letter ISO country codes across 93 distinct values with no nulls in 22,043 rows. The distribution is heavily US-dominated at 50.9% (11,218 rows), with CA, CN, UK, and ES rounding out a long tail; entropy ratio of 0.50 confirms the concentration. Worth noting 'UK' is used rather than the ISO-standard 'GB', which may complicate joins against canonical country tables.

state high anthropic:claude-opus-4-7

Geographic subdivision (state/province/region) with 519 distinct values spanning US states, Canadian provinces, English regions, and Chinese provinces — indicating an international dataset rather than US-only. Wyoming leads at 8.6% (1903 rows), followed by Montana and an empty string with 1082 occurrences that null_rate=0 misses. Entropy ratio of 0.70 shows a fairly even spread across the long tail.

formation high anthropic:claude-opus-4-7

The `formation` column is constant: all 22043 rows hold the empty string, giving cardinality 1 and entropy 0.0. It carries no information and the top_value being "" suggests the field was never populated rather than genuinely categorical.

collection high anthropic:claude-opus-4-7

The 'collection' column contains a single value across all 22043 rows, and that value is the empty string. Cardinality is 1, entropy is 0, and null_rate is 0.0, meaning the field is technically populated but carries no information. This is likely a vestigial schema field that was never filled in.

paleolat high anthropic:claude-opus-4-7

Paleolatitude in degrees, ranging from -86.16 to 89.2 with a median of 34.89 — consistent with reconstructed latitudes of fossil or geological samples. The distribution is left-skewed (skew -1.08) and ~6.97% of values flag as outliers (1503 records), suggesting a heavy southern-hemisphere tail against a northern-hemisphere mode. Null rate is low at 2.23% and 3214 unique values across 22043 rows indicate substantial repetition, likely from shared site coordinates.

paleolng high anthropic:claude-opus-4-7

This is paleolongitude — reconstructed longitudinal coordinates of fossil/sample locations on ancient continental configurations. Values span the full longitudinal range (-177.6 to 168.7) with a mean of -28.58 and median of -62.15, indicating a leftward (western) concentration and moderate positive skew (0.74). Null rate is 2.23% and only 3 outliers are flagged, so the distribution is well-behaved within plausible geographic bounds.

reference_no high anthropic:claude-opus-4-7

Short numeric reference codes (length 1-5, mean 4.17, all single-word) stored as text. Despite the name 'reference_no', it is far from unique: 22,043 rows collapse to 3,725 distinct values with an 83.1% duplicate rate, and the top code '4245' alone appears 794 times. The 89.8% allcaps_rate is a quirk of the detector treating digit-only strings as uppercase.

occurrence_no high anthropic:claude-opus-4-7

This is almost certainly a primary-key style identifier: every one of the 22043 rows holds a unique single-token value (n_unique == n, one_word_rate 1.0), with no nulls or duplicates. Lengths range from 1 to 7 characters and the top words are all numeric strings like '164260' and '1439335', so the field is stored as text but contains integer occurrence numbers. The 'allcaps' alert is a quirk of the detector treating digit-only strings as uppercase and can be ignored.

Numeric correlation

name text

58.5% rows are a single word 78.9% duplicate strings
rows22,043
null0 (0.0%)
unique4,660
len_min3
len_max47
len_mean15.095
len_median14.000
len_p9526.000
word_mean1.425
word_median1.000
n_empty0
n_duplicates17,383
duplicate_rate0.789
vocab_size5,140
readability_flesch_mean-4.127
emoji_rate0.000
url_rate0.000
one_word_rate0.585
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. Animalia
  2. Microtus ochrogaster
  3. Acanthohoplites
  4. Ammonoidea
  5. Theropoda
  6. Synphoroides
  7. Hadrosauropodus leonardii
  8. Moutoniceras moutonianum
  9. Hoploscaphites
  10. Palmatolepis

rank categorical

rows22,043
null0 (0.0%)
unique18
top_valuespecies
top_rate0.412
cardinality18
entropy2.085
entropy_ratio0.500
Top values (rank 1–20)
  1. species — 9,082
  2. genus — 7,342
  3. unranked clade — 2,828
  4. family — 1,716
  5. subfamily — 272
  6. subclass — 205
  7. class — 134
  8. order — 115
  9. infraorder — 97
  10. superfamily — 75
  11. subgenus — 51
  12. kingdom — 50
  13. suborder — 29
  14. subspecies — 23
  15. tribe — 12
  16. subphylum — 9
  17. superorder — 2
  18. superclass — 1

lat numeric

skew=-2.44 9.2% rows beyond 1.5 IQR
rows22,043
null0 (0.0%)
unique4,095
min-84.333
max79.750
mean37.120
median41.700
std19.372
q135.000
q346.609
iqr11.609
skew-2.442
kurtosis7.054
n_outliers2,019
outlier_rate0.092
zero_rate0.000

lon numeric

rows22,043
null0 (0.0%)
unique4,259
min-176.667
max177.071
mean-47.212
median-98.250
std79.135
q1-108.167
q35.873
iqr114.040
skew0.928
kurtosis-0.493
n_outliers3
outlier_rate1.36e-04
zero_rate2.27e-04

early_age_mya numeric

11.6% rows beyond 1.5 IQR
rows22,043
null0 (0.0%)
unique164
min0.012
max538.800
mean154.673
median110.100
std143.088
q163.400
q3201.400
iqr138.000
skew1.131
kurtosis0.077
n_outliers2,549
outlier_rate0.116
zero_rate0.000

late_age_mya numeric

11.5% rows beyond 1.5 IQR
rows22,043
null0 (0.0%)
unique156
min0.000
max521.000
mean147.523
median93.900
std141.724
q160.900
q3192.900
iqr132.000
skew1.169
kurtosis0.123
n_outliers2,535
outlier_rate0.115
zero_rate1.13e-03

period categorical

rows22,043
null0 (0.0%)
unique298
top_valueIrvingtonian
top_rate0.078
cardinality298
entropy6.422
entropy_ratio0.781
Top values (rank 1–20)
  1. Irvingtonian — 1,723
  2. Late Campanian — 1,088
  3. Torrejonian — 935
  4. Tiffanian — 923
  5. Puercan — 778
  6. Kimmeridgian — 636
  7. Hettangian — 607
  8. Aptian — 600
  9. Harrisonian — 592
  10. Late Maastrichtian — 544
  11. Norian — 516
  12. Lochkovian — 460
  13. Early Barremian — 449
  14. Hemingfordian — 441
  15. Tithonian — 408
  16. Middle Campanian — 359
  17. Early Famennian — 346
  18. Early Albian — 327
  19. Lancian — 320
  20. Maastrichtian — 314

late_interval categorical

rows22,043
null0 (0.0%)
unique138
top_value
top_rate0.831
cardinality138
entropy1.591
entropy_ratio0.224
Top values (rank 1–20)
  1. — 18,319
  2. Tithonian — 548
  3. Sinemurian — 430
  4. Late Campanian — 183
  5. Early Cenomanian — 132
  6. Albian — 129
  7. Rhaetian — 119
  8. Early Maastrichtian — 111
  9. Early Tithonian — 102
  10. Late Turonian — 92
  11. Maastrichtian — 72
  12. Harnagian — 62
  13. Santonian — 61
  14. Early Aptian — 57
  15. Tiffanian — 57
  16. Early Albian — 56
  17. Barremian — 54
  18. Pliensbachian — 50
  19. Toarcian — 50
  20. Cenomanian — 45

phylum categorical

rows22,043
null0 (0.0%)
unique4
top_valueChordata
top_rate0.816
cardinality4
entropy0.887
entropy_ratio0.444
Top values (rank 1–20)
  1. Chordata — 17,993
  2. Mollusca — 2,000
  3. Arthropoda — 2,000
  4. — 50

class categorical

rows22,043
null0 (0.0%)
unique19
top_valueMammalia
top_rate0.318
cardinality19
entropy2.579
entropy_ratio0.607
Top values (rank 1–20)
  1. Mammalia — 7,015
  2. Saurischia — 5,507
  3. Ornithischia — 2,811
  4. Cephalopoda — 2,000
  5. Trilobita — 2,000
  6. Conodonta — 1,883
  7. Reptilia — 568
  8. Aves — 92
  9. — 60
  10. NO_CLASS_SPECIFIED — 26
  11. Pteraspidomorpha — 24
  12. Placodermi — 17
  13. Acanthodii — 15
  14. Osteichthyes — 11
  15. Thelodonti — 4
  16. Osteostraci — 4
  17. Chondrichthyes — 4
  18. Actinopterygii — 1
  19. Galeaspidomorphi — 1

order categorical

rows22,043
null0 (0.0%)
unique99
top_valueNO_ORDER_SPECIFIED
top_rate0.323
cardinality99
entropy3.887
entropy_ratio0.586
Top values (rank 1–20)
  1. NO_ORDER_SPECIFIED — 7,117
  2. — 3,019
  3. Ammonitida — 1,572
  4. Ozarkodinida — 1,341
  5. Rodentia — 1,109
  6. Artiodactyla — 951
  7. Carnivora — 744
  8. Multituberculata — 553
  9. Perissodactyla — 517
  10. Phacopida — 507
  11. Procreodi — 503
  12. Prioniodontida — 348
  13. Primates — 315
  14. Asaphida — 304
  15. Corynexochida — 252
  16. Ammonoidea — 246
  17. Ptychopariida — 238
  18. Proetida — 219
  19. Cimolesta — 218
  20. Lagomorpha — 187

family categorical

rows22,043
null0 (0.0%)
unique528
top_value
top_rate0.155
cardinality528
entropy6.566
entropy_ratio0.726
Top values (rank 1–20)
  1. — 3,418
  2. NO_FAMILY_SPECIFIED — 1,996
  3. Hadrosauridae — 689
  4. Grallatoridae — 593
  5. Palmatolepidae — 586
  6. Arctocyonidae — 503
  7. Polygnathidae — 459
  8. Cricetidae — 407
  9. Equidae — 360
  10. Canidae — 358
  11. Ceratopsidae — 336
  12. Dromaeosauridae — 335
  13. Icriodontidae — 272
  14. Periptychidae — 249
  15. Neoplagiaulacidae — 234
  16. Merycoidodontidae — 231
  17. Camelidae — 216
  18. Tyrannosauridae — 184
  19. Diplodocidae — 181
  20. Asaphidae — 172

genus text

98.9% rows are a single word 95th-percentile length under 20 chars 88.2% duplicate strings
rows22,043
null0 (0.0%)
unique2,608
len_min0
len_max33
len_mean8.188
len_median10.000
len_p9515.000
word_mean1.011
word_median1.000
n_empty5,545
n_duplicates19,435
duplicate_rate0.882
vocab_size2,525
readability_flesch_mean-4.827
emoji_rate0.000
url_rate0.000
one_word_rate0.989
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. Microtus
  2. Acanthohoplites
  3. Synphoroides
  4. Hadrosauropodus
  5. Hemibaculites
  6. Hoploscaphites
  7. Palmatolepis

country categorical

rows22,043
null0 (0.0%)
unique93
top_valueUS
top_rate0.509
cardinality93
entropy3.293
entropy_ratio0.504
Top values (rank 1–20)
  1. US — 11,218
  2. CA — 1,830
  3. CN — 1,661
  4. UK — 983
  5. ES — 841
  6. FR — 390
  7. MA — 303
  8. AR — 292
  9. CZ — 288
  10. AU — 247
  11. TZ — 218
  12. UZ — 184
  13. MX — 175
  14. KR — 175
  15. SE — 170
  16. CH — 166
  17. MN — 162
  18. ZA — 159
  19. RU — 156
  20. DE — 152

state categorical

rows22,043
null0 (0.0%)
unique519
top_valueWyoming
top_rate0.086
cardinality519
entropy6.288
entropy_ratio0.697
Top values (rank 1–20)
  1. Wyoming — 1,903
  2. Montana — 1,394
  3. — 1,082
  4. New Mexico — 1,048
  5. Alberta — 1,009
  6. Nebraska — 950
  7. England — 907
  8. Guangxi — 861
  9. California — 837
  10. Colorado — 540
  11. Texas — 530
  12. Utah — 489
  13. Nevada — 361
  14. Murcia — 333
  15. North Dakota — 325
  16. South Dakota — 316
  17. Massachusetts — 278
  18. Kansas — 273
  19. Northwest Territories — 246
  20. Arizona — 226

formation categorical

top value is 100.0% of rows
rows22,043
null0 (0.0%)
unique1
top_value
top_rate1.000
cardinality1
entropy-0.000
entropy_ratio0.000
Top values (rank 1–20)
  1. — 22,043

collection categorical

top value is 100.0% of rows
rows22,043
null0 (0.0%)
unique1
top_value
top_rate1.000
cardinality1
entropy-0.000
entropy_ratio0.000
Top values (rank 1–20)
  1. — 22,043

paleolat numeric

7.0% rows beyond 1.5 IQR
rows22,043
null491 (2.2%)
unique3,214
min-86.160
max89.200
mean26.460
median34.890
std29.543
q116.340
q346.980
iqr30.640
skew-1.080
kurtosis0.265
n_outliers1,503
outlier_rate0.070
zero_rate0.000

paleolng numeric

rows22,043
null491 (2.2%)
unique3,715
min-177.600
max168.700
mean-28.577
median-62.150
std68.911
q1-77.160
q320.190
iqr97.350
skew0.737
kurtosis-0.481
n_outliers3
outlier_rate1.39e-04
zero_rate0.000

reference_no text

100.0% rows are a single word 89.8% rows are all-caps 95th-percentile length under 20 chars 83.1% duplicate strings
rows22,043
null0 (0.0%)
unique3,725
len_min1
len_max5
len_mean4.172
len_median4.000
len_p955.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates18,318
duplicate_rate0.831
vocab_size3,547
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate0.898
boilerplate_rate0.000
Sample values (first 10)
  1. 8880
  2. 3211
  3. 57
  4. 41
  5. 13037
  6. 36816
  7. 14666
  8. 70
  9. 45
  10. 4233

occurrence_no text

100.0% of rows are unique strings 100.0% rows are a single word 99.9% rows are all-caps 95th-percentile length under 20 chars
rows22,043
null0 (0.0%)
unique22,043
len_min1
len_max7
len_mean5.762
len_median6.000
len_p956.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates0
duplicate_rate0.000
vocab_size20,000
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate0.999
boilerplate_rate0.000
Sample values (first 10)
  1. 361526
  2. 196124
  3. 27440
  4. 23237
  5. 519811
  6. 10365
  7. 533398
  8. 31658
  9. 23849
  10. 142746