data-trove-deep-sea-specimens

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json

Saturn profiled 200,000 rows across 12 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json",
    "--findings", "data-trove-deep-sea-specimens.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset contains 200,000 deep-sea biodiversity occurrence records spanning taxonomic classification, geographic coordinates, ocean depth, and collection year. The most striking feature is the dominance of blank values across taxonomy columns — 55% of genus, 40% of family, and 73% of species entries are empty strings, suggesting many records are identified only at higher taxonomic levels. Proteobacteria, Cnidaria, and Chordata are the best-represented phyla, while Australia accounts for the vast majority of records with a named country (~79k of ~96k non-blank entries). Depth ranges from 1,000 to 11,000 metres with a mean around 2,400 m, and the year column is heavily left-skewed with over 12,000 outlier records dating back as far as 1875, versus a median of 2016.

citing: phylum.top_values · species.stats.top_rate · genus.stats.top_rate · family.stats.top_rate · country.top_values · depth.stats.mean · depth.stats.max · depth.stats.min · year.stats.median · year.stats.min · year.stats.n_outliers · year.alerts

Out[4]:

saturn.schema() · 12 columns

column	kind	n	null%	unique	alerts
scientificName	categorical	200,000	0.0%	1,478
species	categorical	200,000	0.0%	678
genus	categorical	200,000	0.0%	841
family	categorical	200,000	0.0%	606
order	categorical	200,000	0.0%	310
class	categorical	200,000	0.0%	138
phylum	categorical	200,000	0.0%	65
latitude	numeric	200,000	0.0%	2,617
longitude	numeric	200,000	0.0%	2,654
depth	numeric	200,000	0.0%	1,938
year	numeric	200,000	3.6%	98	high_skew outliers
country	categorical	200,000	0.0%	57

Fig 1.

phylum · Look for the dominance of Proteobacteria and which marine phyla (Cnidaria, Chordata, Echinodermata) make up the bulk of identified records.

Show data table

Top values for phylum (20 unique shown, of 65 total).
value	count	share
Proteobacteria	35480	17.7%
Cnidaria	25520	12.8%
Chordata	23920	12.0%
Echinodermata	14000	7.0%
	13200	6.6%
Arthropoda	13200	6.6%
Myzozoa	11280	5.6%
Thaumarchaeota	10920	5.5%
Porifera	10360	5.2%
Foraminifera	4720	2.4%
Annelida	4240	2.1%
Radiozoa	4000	2.0%
Mollusca	3840	1.9%
Bacteroidetes	3440	1.7%
Euryarchaeota	2440	1.2%
Planctomycetes	2320	1.2%
Heterokontophyta	1680	0.8%
Verrucomicrobia	1520	0.8%
Brachiopoda	1520	0.8%
Nematoda	1160	0.6%

Fig 2.

depth · Look for the right-skewed spread of sampling depths, with most records between 1,000–3,300 m and a long tail extending to 11,000 m.

Show data table

Histogram bins for depth (median: 1961.6950000000002).
bin	count
1000 – 1250	58400
1250 – 1500	15280
1500 – 1750	10920
1750 – 2000	20200
2000 – 2250	23560
2250 – 2500	5720
2500 – 2750	5320
2750 – 3000	4360
3000 – 3250	5080
3250 – 3500	3520
3500 – 3750	3520
3750 – 4000	3360
4000 – 4250	8000
4250 – 4500	4680
4500 – 4750	1880
4750 – 5000	11400
5000 – 5250	3240
5250 – 5500	3360
5500 – 5750	5520
5750 – 6000	1760
6000 – 6250	280
6250 – 6500	40
6500 – 6750	40
6750 – 7000	0
7000 – 7250	0
7250 – 7500	80
7500 – 7750	80
7750 – 8000	0
8000 – 8250	0
8250 – 8500	0
8500 – 8750	80
8750 – 9000	80
9000 – 9250	40
9250 – 9500	0
9500 – 9750	40
9750 – 1e+04	0
1e+04 – 1.025e+04	120
1.025e+04 – 1.05e+04	0
1.05e+04 – 1.075e+04	0
1.075e+04 – 1.1e+04	40

Fig 3.

year · Look for the sharp concentration of records after 2004 and the small but notable cluster of historical outlier observations dating back to 1875.

Show data table

Histogram bins for year (median: 2016.0).
bin	count
1875 – 1879	40
1879 – 1882	120
1882 – 1886	320
1886 – 1890	80
1890 – 1894	120
1894 – 1897	0
1897 – 1901	0
1901 – 1905	80
1905 – 1909	40
1909 – 1912	120
1912 – 1916	0
1916 – 1920	0
1920 – 1923	0
1923 – 1927	160
1927 – 1931	400
1931 – 1935	240
1935 – 1938	80
1938 – 1942	0
1942 – 1946	0
1946 – 1950	40
1950 – 1953	80
1953 – 1957	40
1957 – 1961	120
1961 – 1964	1680
1964 – 1968	840
1968 – 1972	1520
1972 – 1976	1840
1976 – 1979	1360
1979 – 1983	960
1983 – 1987	2160
1987 – 1990	2880
1990 – 1994	5200
1994 – 1998	8520
1998 – 2002	13760
2002 – 2005	10040
2005 – 2009	7360
2009 – 2013	7280
2013 – 2017	98920
2017 – 2020	14120
2020 – 2024	12240

Fig 4.

country · Look for Australia's overwhelming share of named-country records versus all other nations combined.

Show data table

Top values for country (20 unique shown, of 57 total).
value	count	share
	103840	51.9%
Australia	79320	39.7%
United States	8160	4.1%
New Zealand	1320	0.7%
USA	680	0.3%
Antarctica	680	0.3%
Colombia	640	0.3%
Chile	520	0.3%
Bermuda	400	0.2%
Portugal	320	0.2%
UNITED STATES	320	0.2%
Ross Dependency	240	0.1%
Russia	240	0.1%
United States of America	240	0.1%
GREAT BRITAIN	200	0.1%
Ecuador	160	0.1%
Bahamas	160	0.1%
Italy	160	0.1%
CO	160	0.1%
Discovery Deep, Red Sea	160	0.1%

Fig 5.

class · Look for Alphaproteobacteria and Teleostei at the top, and note the large blank category indicating records unclassified at class level.

Show data table

Top values for class (20 unique shown, of 138 total).
value	count	share
Alphaproteobacteria	22840	11.4%
	21920	11.0%
Teleostei	19120	9.6%
Octocorallia	17880	8.9%
Thaumarchaeota incertae sedis	10840	5.4%
Dinophyceae	10600	5.3%
Gammaproteobacteria	9440	4.7%
Holothuroidea	7760	3.9%
Malacostraca	7440	3.7%
Hexactinellida	6320	3.2%
Hexacorallia	5640	2.8%
Copepoda	4080	2.0%
Ophiuroidea	3320	1.7%
Polychaeta	3320	1.7%
Polycystina	2800	1.4%
Elasmobranchii	2640	1.3%
Globothalamea	2520	1.3%
Deltaproteobacteria	2440	1.2%
Thermoplasmata	2360	1.2%
Flavobacteria	2320	1.2%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
scientificName	categorical	0.0%
species	categorical	0.0%
genus	categorical	0.0%
family	categorical	0.0%
order	categorical	0.0%
class	categorical	0.0%
phylum	categorical	0.0%
latitude	numeric	0.0%
longitude	numeric	0.0%
depth	numeric	0.0%
year	numeric	3.6%
country	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
	latitude	longitude	depth	year
latitude	+1.00	-0.23	+0.02	-0.05
longitude	-0.23	+1.00	-0.01	+0.02
depth	+0.02	-0.01	+1.00	+0.02
year	-0.05	+0.02	+0.02	+1.00

scientificName categorical label

This column contains biological taxonomic names (scientific names) spanning multiple ranks — from broad groups like 'Bacteria' and 'Alphaproteobacteria' down to species-level binomials like 'Amperima rosea' and 'Xiphias gladius'. With 1,478 unique values across 200,000 rows and zero nulls, it is a well-populated label field, though the top value 'Alphaproteobacteria' accounts for 8.82% of all rows, indicating moderate concentration at a handful of higher-rank taxa. The high entropy ratio of 0.796 confirms substantial spread across many taxa, and the mix of ranks (class, order, family, genus, species) within a single column is a structural issue that may require rank-disambiguation before analysis.

Treatment: Normalise taxonomic rank before grouping or modelling; consider splitting into rank-specific columns or encoding hierarchy via a taxonomy backbone.

anthropic:default · confidence high

Out[13]:

saturn.columns["scientificName"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	1,478
top_value	Alphaproteobacteria
top_rate	0.0882
cardinality	1,478
entropy	8.378
entropy_ratio	0.7956

Fig 8.

Top values for scientificName.

Show data table

Top values for scientificName (20 unique shown, of 1478 total).
value	count	share
Alphaproteobacteria	17640	8.8%
Bacteria	12760	6.4%
Nitrosopumilaceae	10840	5.4%
Syndiniales	7280	3.6%
Amperima rosea	4520	2.3%
Porifera	2400	1.2%
Thermoplasmata	2360	1.2%
Keratoisididae	2320	1.2%
Xiphias gladius	2000	1.0%
Pseudomonadales	1920	1.0%
Gammaproteobacteria	1920	1.0%
Monothalamea	1640	0.8%
Rhodospirillales	1640	0.8%
Scomber scombrus	1520	0.8%
Retaria	1520	0.8%
Dinophyceae	1320	0.7%
Rickettsiales	1200	0.6%
Chrysogorgia	1160	0.6%
Hexactinellida	1080	0.5%
Prionace glauca	1080	0.5%

species categorical label

This column contains biological species names (binomial Latin nomenclature), likely from a marine/oceanographic observation dataset given taxa such as swordfish (Xiphias gladius), Atlantic mackerel (Scomber scombrus), and deep-sea holothurians (Amperima rosea). The dominant 'value' is an empty string, accounting for 73.2% of all 200,000 rows (146,400 records), which is a critical data quality issue — species was not recorded for nearly three-quarters of observations. Among the 678 distinct non-empty values, entropy ratio is only 0.33, indicating heavy concentration in a handful of species.

Treatment: Treat empty string as missing/unknown; impute or filter before modelling, then encode remaining 677 species (e.g. target-encode or embed taxonomic hierarchy) given high cardinality.

anthropic:default · confidence high

Out[16]:

saturn.columns["species"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	678
top_value
top_rate	0.732
cardinality	678
entropy	3.077
entropy_ratio	0.3272

Fig 9.

Top values for species.

Show data table

Top values for species (20 unique shown, of 678 total).
value	count	share
	146400	73.2%
Amperima rosea	4520	2.3%
Xiphias gladius	2000	1.0%
Scomber scombrus	1520	0.8%
Prionace glauca	1080	0.5%
Oneirophanta mutabilis	840	0.4%
Thunnus albacares	760	0.4%
Farrea occa	680	0.3%
Trissopathes pseudotristicha	640	0.3%
Hoplostethus atlanticus	520	0.3%
Trachurus trachurus	480	0.2%
Florometra serratissima	440	0.2%
Heteropolypus ritteri	400	0.2%
Desmophyllum dianthus	400	0.2%
Psychropotes longicauda	360	0.2%
Thunnus obesus	320	0.2%
Solenosmilia variabilis	320	0.2%
Etmopterus granulosus	320	0.2%
Molpadiodemas villosus	280	0.1%
Paragorgia arborea	280	0.1%

genus categorical label

This column contains biological genus names, likely from a marine species observation or biodiversity dataset given genera such as Xiphias (swordfish), Thunnus (tuna), Prionace (blue shark), and deep-sea taxa like Amperima and Chrysogorgia. The most striking issue is that 54.9% of rows (109,800 of 200,000) carry an empty string rather than a null, meaning missingness is systematically masked and will not be caught by standard null checks. The remaining 840 distinct genus values show moderate entropy (4.90, entropy_ratio 0.50), indicating a reasonably spread but skewed distribution.

Treatment: Replace empty-string values with NaN to expose true missingness (~55%), then use as a categorical label or grouping key; consider hierarchical encoding with higher taxonomic ranks if available.

anthropic:default · confidence high

Out[19]:

saturn.columns["genus"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	841
top_value
top_rate	0.549
cardinality	841
entropy	4.896
entropy_ratio	0.5039

Fig 10.

Top values for genus.

Show data table

Top values for genus (20 unique shown, of 841 total).
value	count	share
	109800	54.9%
Amperima	4520	2.3%
Xiphias	2000	1.0%
Scomber	1520	0.8%
Retaria	1520	0.8%
Farrea	1400	0.7%
Thunnus	1360	0.7%
Chrysogorgia	1360	0.7%
Coryphaenoides	1240	0.6%
Prionace	1080	0.5%
Hemicorallium	1000	0.5%
Alteromonas	960	0.5%
Paragorgia	880	0.4%
Oneirophanta	840	0.4%
Lepidisis	800	0.4%
Trissopathes	800	0.4%
Alepisaurus	760	0.4%
Keratoisis	720	0.4%
Pennatula	600	0.3%
Hoplostethus	600	0.3%

family categorical label

This column contains biological family-level taxonomic names (e.g., Nitrosopumilaceae, Elpidiidae, Scombridae), consistent with a marine species or biodiversity dataset. The most striking issue is that the top value is an empty string, accounting for 40.18% of all 200,000 rows (80,360 records) — a substantial proportion of missing taxonomy that is masked by a null_rate of 0.0, meaning blanks were stored as empty strings rather than true nulls. With 606 unique values and moderate entropy (5.50), the remaining distribution is fairly spread but dominated by a handful of families.

Treatment: Replace empty strings with NaN, then use as a categorical grouping variable or encode (e.g., target/ordinal encoding) for modelling; investigate whether blank family records can be imputed from genus/species columns.

anthropic:default · confidence high

Out[22]:

saturn.columns["family"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	606
top_value
top_rate	0.4018
cardinality	606
entropy	5.5
entropy_ratio	0.595

Fig 11.

Top values for family.

Show data table

Top values for family (20 unique shown, of 606 total).
value	count	share
	80360	40.2%
Nitrosopumilaceae	10840	5.4%
Elpidiidae	4920	2.5%
Keratoisididae	4360	2.2%
Coralliidae	4040	2.0%
Macrouridae	3120	1.6%
Scombridae	3040	1.5%
Primnoidae	2440	1.2%
Chrysogorgiidae	2320	1.2%
Xiphiidae	2000	1.0%
Retariidae	1520	0.8%
Farreidae	1520	0.8%
Alteromonadaceae	1440	0.7%
Euplectellidae	1360	0.7%
Flavobacteriaceae	1320	0.7%
Schizopathidae	1280	0.6%
Caryophylliidae	1080	0.5%
Carcharhinidae	1080	0.5%
Acanthogorgiidae	1000	0.5%
Nitrospinaceae	1000	0.5%

order categorical label

This column represents the biological taxonomic rank 'Order' for marine organisms, containing 310 distinct taxonomic order names drawn from bacteria, archaea, protists, invertebrates, and fish. The most striking signal is that the top value is an empty string, accounting for 28.16% of all 200,000 rows (56,320 records) — suggesting a substantial proportion of specimens could not be classified at this rank, which is common in metagenomic or environmental sampling datasets. The entropy ratio of 0.659 indicates moderate diversity across the 310 categories, with a long tail of rarer orders beneath the dominant few.

Treatment: Treat empty string as a distinct 'unclassified' category or null before encoding; encode remaining values as nominal categories (one-hot or target encoding depending on model).

anthropic:default · confidence high

Out[25]:

saturn.columns["order"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	310
top_value
top_rate	0.2816
cardinality	310
entropy	5.45
entropy_ratio	0.6585

Fig 12.

Top values for order.

Show data table

Top values for order (20 unique shown, of 310 total).
value	count	share
	56320	28.2%
Scleralcyonacea	15720	7.9%
Nitrosopumilales	10840	5.4%
Syndiniales	7960	4.0%
Elasipodida	5680	2.8%
Gadiformes	4040	2.0%
Scombriformes	3360	1.7%
Carangiformes	3240	1.6%
Calanoida	2840	1.4%
Alteromonadales	2720	1.4%
Decapoda	2720	1.4%
Antipatharia	2560	1.3%
Sceptrulophora	2480	1.2%
Pseudomonadales	2400	1.2%
Flavobacteriales	2320	1.2%
Scleractinia	2240	1.1%
Malacalcyonacea	2000	1.0%
Lyssacinosida	2000	1.0%
Rotaliida	2000	1.0%
Amphipoda	1880	0.9%

class categorical label

This column contains biological taxonomic class-level classifications (e.g., Alphaproteobacteria, Teleostei, Octocorallia), consistent with a marine or environmental biodiversity dataset. With 138 unique values across 200,000 rows, cardinality is moderate and entropy is reasonably high (4.89, ratio 0.69), indicating a fairly broad but not flat distribution. The top value 'Alphaproteobacteria' accounts for 11.42% of records, suggesting mild concentration at the top. A notable concern is the second-most-frequent entry being an empty string ('') with 21,920 occurrences (≈11% of rows), which likely represents missing or unclassified taxa rather than a true category and should be treated as null.

Treatment: Replace empty-string entries with null, then encode as a nominal categorical feature (e.g., target-encode or embed) for modelling.

anthropic:default · confidence high

Out[28]:

saturn.columns["class"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	138
top_value	Alphaproteobacteria
top_rate	0.1142
cardinality	138
entropy	4.892
entropy_ratio	0.6881

Fig 13.

Top values for class.

Show data table

Top values for class (20 unique shown, of 138 total).
value	count	share
Alphaproteobacteria	22840	11.4%
	21920	11.0%
Teleostei	19120	9.6%
Octocorallia	17880	8.9%
Thaumarchaeota incertae sedis	10840	5.4%
Dinophyceae	10600	5.3%
Gammaproteobacteria	9440	4.7%
Holothuroidea	7760	3.9%
Malacostraca	7440	3.7%
Hexactinellida	6320	3.2%
Hexacorallia	5640	2.8%
Copepoda	4080	2.0%
Ophiuroidea	3320	1.7%
Polychaeta	3320	1.7%
Polycystina	2800	1.4%
Elasmobranchii	2640	1.3%
Globothalamea	2520	1.3%
Deltaproteobacteria	2440	1.2%
Thermoplasmata	2360	1.2%
Flavobacteria	2320	1.2%

phylum categorical label

This column contains biological phylum classifications, covering 65 distinct phyla across 200,000 rows with no nulls. The dominant value is Proteobacteria (17.74%), followed by Cnidaria and Chordata, suggesting a marine or mixed ecological dataset spanning bacteria, invertebrates, and vertebrates. Notably, empty strings appear as the 5th most frequent 'value' with 13,200 occurrences (6.6%), which are functionally missing values masked as non-null entries. The entropy ratio of 0.68 indicates moderate concentration — a few phyla dominate while the long tail of 65 categories is spread unevenly.

Treatment: Replace empty-string entries (13,200 rows) with explicit nulls, then encode as a categorical feature (e.g., ordinal or one-hot for low-cardinality models).

anthropic:default · confidence high

Out[31]:

saturn.columns["phylum"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	65
top_value	Proteobacteria
top_rate	0.1774
cardinality	65
entropy	4.095
entropy_ratio	0.6799

Fig 14.

Top values for phylum.

Show data table

Top values for phylum (20 unique shown, of 65 total).
value	count	share
Proteobacteria	35480	17.7%
Cnidaria	25520	12.8%
Chordata	23920	12.0%
Echinodermata	14000	7.0%
	13200	6.6%
Arthropoda	13200	6.6%
Myzozoa	11280	5.6%
Thaumarchaeota	10920	5.5%
Porifera	10360	5.2%
Foraminifera	4720	2.4%
Annelida	4240	2.1%
Radiozoa	4000	2.0%
Mollusca	3840	1.9%
Bacteroidetes	3440	1.7%
Euryarchaeota	2440	1.2%
Planctomycetes	2320	1.2%
Heterokontophyta	1680	0.8%
Verrucomicrobia	1520	0.8%
Brachiopoda	1520	0.8%
Nematoda	1160	0.6%

latitude numeric feature

This column contains geographic latitude values, ranging from -75.0 to 89.06 degrees, consistent with a near-global spatial dataset. With only 2,617 unique values across 200,000 rows, each distinct latitude is reused on average ~76 times, suggesting coordinates are discretized or snapped to a coarse grid rather than recorded at full precision. The distribution is nearly symmetric (skew 0.12) with a platykurtic shape (kurtosis -1.22) and an IQR of ~72 degrees, indicating broad global coverage with no strong concentration in any hemisphere. No nulls, no outliers, and zero_rate of 0.0 are all clean signals.

Treatment: Pair with longitude for spatial joins or clustering; investigate grid resolution given only 2,617 unique values across 200,000 rows before treating as continuous.

anthropic:default · confidence high

Out[34]:

saturn.columns["latitude"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	2,617
min	-75
max	89.06
mean	-1.581
median	-4.998
std	39.48
q1	-36.25
q3	35.72
iqr	71.97
skew	0.1182
kurtosis	-1.223
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 15.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: -4.99815).
bin	count
-75 – -70.9	640
-70.9 – -66.8	440
-66.8 – -62.7	6000
-62.7 – -58.59	5160
-58.59 – -54.49	4960
-54.49 – -50.39	4360
-50.39 – -46.29	4360
-46.29 – -42.19	8560
-42.19 – -38.09	5560
-38.09 – -33.99	16440
-33.99 – -29.88	13320
-29.88 – -25.78	10400
-25.78 – -21.68	3480
-21.68 – -17.58	2520
-17.58 – -13.48	4280
-13.48 – -9.376	4280
-9.376 – -5.275	4440
-5.275 – -1.173	6080
-1.173 – 2.928	2160
2.928 – 7.03	1080
7.03 – 11.13	5280
11.13 – 15.23	5680
15.23 – 19.33	1960
19.33 – 23.44	5960
23.44 – 27.54	9160
27.54 – 31.64	7680
31.64 – 35.74	7480
35.74 – 39.84	12840
39.84 – 43.94	7000
43.94 – 48.04	5880
48.04 – 52.15	11360
52.15 – 56.25	2480
56.25 – 60.35	920
60.35 – 64.45	1160
64.45 – 68.55	1200
68.55 – 72.65	920
72.65 – 76.76	920
76.76 – 80.86	1600
80.86 – 84.96	1200
84.96 – 89.06	800

longitude numeric feature

This column represents geographic longitude, with values spanning the full valid range from approximately -180 to +180 degrees and a mean of -51.58, suggesting a dataset skewed toward the Western Hemisphere (median -94.29, well west of the prime meridian). The IQR of 233.75 and near-flat kurtosis (-1.12) indicate values are broadly spread across the globe with no sharp central peak — consistent with global or multi-continental coverage. Surprisingly, only 2,654 unique values exist across 200,000 rows, implying heavy coordinate quantization or snapping to a fixed grid rather than continuous GPS precision. The zero_rate of 0.0004 is negligible but worth monitoring as a potential null-proxy sentinel.

Treatment: Use as-is for spatial joins or map projections; investigate coordinate quantization (2654 unique values over 200000 rows) before using as a continuous feature in regression.

anthropic:default · confidence high

Out[37]:

saturn.columns["longitude"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	2,654
min	-180
max	180
mean	-51.58
median	-94.29
std	127.1
q1	-170
q3	63.76
iqr	233.8
skew	0.6619
kurtosis	-1.123
n_outliers	0
outlier_rate	0
zero_rate	0.0004

Fig 16.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: -94.29).
bin	count
-180 – -171	17200
-171 – -162	53840
-162 – -153	2320
-153 – -144	680
-144 – -135	520
-135 – -126	3240
-126 – -117	17080
-117 – -108	4360
-108 – -98.99	320
-98.99 – -89.99	3760
-89.99 – -80.99	4040
-80.99 – -71.99	4120
-71.99 – -62.99	5680
-62.99 – -53.99	1840
-53.99 – -44.99	3280
-44.99 – -35.99	2240
-35.99 – -26.99	1280
-26.99 – -17.99	2360
-17.99 – -8.994	12680
-8.994 – 0.00565	2440
0.00565 – 9.005	2760
9.005 – 18	640
18 – 27	560
27 – 36	680
36 – 45	560
45 – 54	520
54 – 63	880
63 – 72	920
72 – 81	480
81 – 90	3320
90 – 99	680
99 – 108	160
108 – 117	1160
117 – 126	1200
126 – 135	1080
135 – 144	1000
144 – 153	14840
153 – 162	19120
162 – 171	1520
171 – 180	4640

depth numeric feature

This column almost certainly represents depth measurements (e.g., ocean depth, well depth, or seismic depth) in meters, ranging from 1,000 m to 11,000 m with a mean of ~2,406 m and median of ~1,962 m. The distribution is right-skewed (skew ≈ 1.09) with a wide IQR of 2,174 m, indicating most observations cluster at shallower depths while a tail extends toward very deep values — the maximum of 11,000 m is consistent with ocean trench depths. Only 1,938 unique values across 200,000 rows suggests the depth values are rounded or binned rather than continuous measurements, which is worth noting for precision-sensitive analyses. Outlier count is modest (560, 0.28%) and no nulls or zeros are present.

Treatment: Apply log-transform or quantile transformation before regression/modelling to reduce right skew; verify whether discretisation into 1,938 unique values is intentional.

anthropic:default · confidence high

Out[40]:

saturn.columns["depth"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	1,938
min	1,000
max	11,000
mean	2406
median	1962
std	1477
q1	1,149
q3	3323
iqr	2174
skew	1.091
kurtosis	0.5022
n_outliers	560
outlier_rate	0.0028
zero_rate	0

Fig 17.

Distribution of depth. Vertical dash marks the median.

Show data table

Histogram bins for depth (median: 1961.6950000000002).
bin	count
1000 – 1250	58400
1250 – 1500	15280
1500 – 1750	10920
1750 – 2000	20200
2000 – 2250	23560
2250 – 2500	5720
2500 – 2750	5320
2750 – 3000	4360
3000 – 3250	5080
3250 – 3500	3520
3500 – 3750	3520
3750 – 4000	3360
4000 – 4250	8000
4250 – 4500	4680
4500 – 4750	1880
4750 – 5000	11400
5000 – 5250	3240
5250 – 5500	3360
5500 – 5750	5520
5750 – 6000	1760
6000 – 6250	280
6250 – 6500	40
6500 – 6750	40
6750 – 7000	0
7000 – 7250	0
7250 – 7500	80
7500 – 7750	80
7750 – 8000	0
8000 – 8250	0
8250 – 8500	0
8500 – 8750	80
8750 – 9000	80
9000 – 9250	40
9250 – 9500	0
9500 – 9750	40
9750 – 1e+04	0
1e+04 – 1.025e+04	120
1.025e+04 – 1.05e+04	0
1.05e+04 – 1.075e+04	0
1.075e+04 – 1.1e+04	40

year numeric feature

This column represents a calendar year, spanning from 1875 to 2024 with 98 distinct values across 200,000 rows. The bulk of records cluster tightly between 2004 and 2016 (IQR of 12 years, median 2016), but the distribution is heavily left-skewed (skew = -3.57, kurtosis = 19.65), driven by a long tail of historically old entries stretching back to 1875. Roughly 6.3% of rows (12,080) are flagged as outliers, almost certainly the pre-20th/early-20th century records that sit far below the modern core; analysts should verify whether these antique years are legitimate data or encoding errors.

Treatment: Inspect and potentially cap or bin pre-1950 outlier years; use as an ordinal/numeric feature or derive 'age relative to reference year' before modelling.

anthropic:default · confidence high

Out[43]:

saturn.columns["year"].stats

stat	value
n	200,000
nulls	7,240 (3.6%)
unique	98
min	1,875
max	2,024
mean	2009
median	2,016
std	15.43
q1	2,004
q3	2,016
iqr	12
skew	-3.574
kurtosis	19.65
n_outliers	12,080
outlier_rate	0.06267
zero_rate	0
alert: high_skew	skew=-3.57
alert: outliers	6.3% rows beyond 1.5 IQR

Fig 18.

Distribution of year. Vertical dash marks the median.

Show data table

Histogram bins for year (median: 2016.0).
bin	count
1875 – 1879	40
1879 – 1882	120
1882 – 1886	320
1886 – 1890	80
1890 – 1894	120
1894 – 1897	0
1897 – 1901	0
1901 – 1905	80
1905 – 1909	40
1909 – 1912	120
1912 – 1916	0
1916 – 1920	0
1920 – 1923	0
1923 – 1927	160
1927 – 1931	400
1931 – 1935	240
1935 – 1938	80
1938 – 1942	0
1942 – 1946	0
1946 – 1950	40
1950 – 1953	80
1953 – 1957	40
1957 – 1961	120
1961 – 1964	1680
1964 – 1968	840
1968 – 1972	1520
1972 – 1976	1840
1976 – 1979	1360
1979 – 1983	960
1983 – 1987	2160
1987 – 1990	2880
1990 – 1994	5200
1994 – 1998	8520
1998 – 2002	13760
2002 – 2005	10040
2005 – 2009	7360
2009 – 2013	7280
2013 – 2017	98920
2017 – 2020	14120
2020 – 2024	12240

country categorical feature

This column represents the country of origin or registration for records in the dataset, with 57 distinct values across 200,000 rows. The dominant signal is that 51.92% of rows (103,840) have an empty string — effectively a missing value masked as a non-null entry, which would go undetected by null checks. Among populated values, 'Australia' accounts for 79,320 rows (~39.7%), making this a heavily Australia-centric dataset; a further data quality concern is the presence of both 'United States' (8,160) and 'USA' (680) as separate values, indicating inconsistent country name standardisation.

Treatment: Replace empty strings with null, then standardise country name variants (e.g. 'USA' → 'United States') before encoding or aggregating.

anthropic:default · confidence high

Out[46]:

saturn.columns["country"].stats

stat	value
n	200,000
nulls	0 (0.0%)
unique	57
top_value
top_rate	0.5192
cardinality	57
entropy	1.616
entropy_ratio	0.277

Fig 19.

Top values for country.

Show data table

Top values for country (20 unique shown, of 57 total).
value	count	share
	103840	51.9%
Australia	79320	39.7%
United States	8160	4.1%
New Zealand	1320	0.7%
USA	680	0.3%
Antarctica	680	0.3%
Colombia	640	0.3%
Chile	520	0.3%
Bermuda	400	0.2%
Portugal	320	0.2%
UNITED STATES	320	0.2%
Ross Dependency	240	0.1%
Russia	240	0.1%
United States of America	240	0.1%
GREAT BRITAIN	200	0.1%
Ecuador	160	0.1%
Bahamas	160	0.1%
Italy	160	0.1%
CO	160	0.1%
Discovery Deep, Red Sea	160	0.1%

data trove deep sea specimens

Overview

Summary confidence: high

scientificName categorical label

species categorical label

genus categorical label

family categorical label

order categorical label

class categorical label

phylum categorical label

latitude numeric feature

longitude numeric feature

depth numeric feature

year numeric feature

country categorical feature

How to cite