saturn·

quirky deep sea

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json

Saturn profiled 200,000 rows across 12 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json",
    "--findings", "quirky-deep_sea.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a 200,000-row deep-sea biodiversity dataset with 12 columns covering taxonomy (phylum, class, order, family, genus, species, scientificName), geography (country, latitude, longitude), depth, and observation year. Two things stand out: the taxonomic hierarchy is heavily incomplete at lower ranks — species is blank in 73.2% of rows and genus in 54.9% — so most records can only be analyzed at higher ranks like phylum (top: Proteobacteria at 17.7%) or class (top: Alphaproteobacteria at 11.4%). Country is also mostly missing (51.9% blank) with Australia dominating the populated entries at 79,320 records, suggesting a strong sampling bias. Year is left-skewed (skew -3.57) toward recent records with a long tail back to 1875, while depth ranges from 1,000 to 11,000 m with a median near 1,962 m. Start by checking the missingness in species/country and the geographic concentration before any biodiversity analysis.

citing: row_count · column_count · species.top_rate · genus.top_rate · country.top_rate · country.top_values · phylum.top_value · phylum.top_rate · class.top_value · class.top_rate · year.skew · year.min · year.max · depth.min · depth.max · depth.median

Out[4]:

saturn.schema() · 12 columns

column kind n null% unique alerts
scientificName categorical 200,000 0.0% 1,478
species categorical 200,000 0.0% 678
genus categorical 200,000 0.0% 841
family categorical 200,000 0.0% 606
order categorical 200,000 0.0% 310
class categorical 200,000 0.0% 138
phylum categorical 200,000 0.0% 65
latitude numeric 200,000 0.0% 2,617
longitude numeric 200,000 0.0% 2,654
depth numeric 200,000 0.0% 1,938
year numeric 200,000 3.6% 98 high_skew outliers
country categorical 200,000 0.0% 57
Fig 1.
phylum · Shows the dominant taxonomic groups, led by Proteobacteria, Cnidaria, and Chordata.
Show data table
Top values for phylum (20 unique shown, of 65 total).
valuecountshare
Proteobacteria3548017.7%
Cnidaria2552012.8%
Chordata2392012.0%
Echinodermata140007.0%
132006.6%
Arthropoda132006.6%
Myzozoa112805.6%
Thaumarchaeota109205.5%
Porifera103605.2%
Foraminifera47202.4%
Annelida42402.1%
Radiozoa40002.0%
Mollusca38401.9%
Bacteroidetes34401.7%
Euryarchaeota24401.2%
Planctomycetes23201.2%
Heterokontophyta16800.8%
Verrucomicrobia15200.8%
Brachiopoda15200.8%
Nematoda11600.6%
Fig 2.
depth · Reveals the distribution of sampling depths from 1,000 m down to the 11,000 m hadal zone.
Show data table
Histogram bins for depth (median: 1961.6950000000002).
bincount
1000 – 125058400
1250 – 150015280
1500 – 175010920
1750 – 200020200
2000 – 225023560
2250 – 25005720
2500 – 27505320
2750 – 30004360
3000 – 32505080
3250 – 35003520
3500 – 37503520
3750 – 40003360
4000 – 42508000
4250 – 45004680
4500 – 47501880
4750 – 500011400
5000 – 52503240
5250 – 55003360
5500 – 57505520
5750 – 60001760
6000 – 6250280
6250 – 650040
6500 – 675040
6750 – 70000
7000 – 72500
7250 – 750080
7500 – 775080
7750 – 80000
8000 – 82500
8250 – 85000
8500 – 875080
8750 – 900080
9000 – 925040
9250 – 95000
9500 – 975040
9750 – 1e+040
1e+04 – 1.025e+04120
1.025e+04 – 1.05e+040
1.05e+04 – 1.075e+040
1.075e+04 – 1.1e+0440
Fig 3.
year · Highlights the strong left skew toward recent years with a long historical tail back to 1875.
Show data table
Histogram bins for year (median: 2016.0).
bincount
1875 – 187940
1879 – 1882120
1882 – 1886320
1886 – 189080
1890 – 1894120
1894 – 18970
1897 – 19010
1901 – 190580
1905 – 190940
1909 – 1912120
1912 – 19160
1916 – 19200
1920 – 19230
1923 – 1927160
1927 – 1931400
1931 – 1935240
1935 – 193880
1938 – 19420
1942 – 19460
1946 – 195040
1950 – 195380
1953 – 195740
1957 – 1961120
1961 – 19641680
1964 – 1968840
1968 – 19721520
1972 – 19761840
1976 – 19791360
1979 – 1983960
1983 – 19872160
1987 – 19902880
1990 – 19945200
1994 – 19988520
1998 – 200213760
2002 – 200510040
2005 – 20097360
2009 – 20137280
2013 – 201798920
2017 – 202014120
2020 – 202412240
Fig 4.
country · Exposes the heavy Australia bias and the large share of records with no country recorded.
Show data table
Top values for country (20 unique shown, of 57 total).
valuecountshare
10384051.9%
Australia7932039.7%
United States81604.1%
New Zealand13200.7%
USA6800.3%
Antarctica6800.3%
Colombia6400.3%
Chile5200.3%
Bermuda4000.2%
Portugal3200.2%
UNITED STATES3200.2%
Ross Dependency2400.1%
Russia2400.1%
United States of America2400.1%
GREAT BRITAIN2000.1%
Ecuador1600.1%
Bahamas1600.1%
Italy1600.1%
CO1600.1%
Discovery Deep, Red Sea1600.1%
Fig 5.
species · Illustrates how 73% of rows have no species assignment, with only a handful of named species dominating the rest.
Show data table
Top values for species (20 unique shown, of 678 total).
valuecountshare
14640073.2%
Amperima rosea45202.3%
Xiphias gladius20001.0%
Scomber scombrus15200.8%
Prionace glauca10800.5%
Oneirophanta mutabilis8400.4%
Thunnus albacares7600.4%
Farrea occa6800.3%
Trissopathes pseudotristicha6400.3%
Hoplostethus atlanticus5200.3%
Trachurus trachurus4800.2%
Florometra serratissima4400.2%
Heteropolypus ritteri4000.2%
Desmophyllum dianthus4000.2%
Psychropotes longicauda3600.2%
Thunnus obesus3200.2%
Solenosmilia variabilis3200.2%
Etmopterus granulosus3200.2%
Molpadiodemas villosus2800.1%
Paragorgia arborea2800.1%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
scientificNamecategorical0.0%
speciescategorical0.0%
genuscategorical0.0%
familycategorical0.0%
ordercategorical0.0%
classcategorical0.0%
phylumcategorical0.0%
latitudenumeric0.0%
longitudenumeric0.0%
depthnumeric0.0%
yearnumeric3.6%
countrycategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
latitudelongitudedepthyear
latitude+1.00-0.23+0.02-0.05
longitude-0.23+1.00-0.01+0.02
depth+0.02-0.01+1.00+0.02
year-0.05+0.02+0.02+1.00

scientificName categorical label

Taxonomic name field mixing ranks from class (Alphaproteobacteria, 8.82% of rows) and domain (Bacteria) down to species (Xiphias gladius, Amperima rosea), across 1,478 distinct values with no nulls. The rank inconsistency is the headline issue: aggregating or joining on this column will conflate broad clades with individual species. Entropy ratio of 0.80 shows the distribution is fairly diffuse despite the dominant top value.

Treatment: Normalise to a consistent taxonomic rank (or join to a taxonomy table) before grouping or modelling.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["scientificName"].stats

statvalue
n200,000
nulls0 (0.0%)
unique1,478
top_value Alphaproteobacteria
top_rate 0.0882
cardinality 1,478
entropy 8.378
entropy_ratio 0.7956
Fig 8.
Top values for scientificName.
Show data table
Top values for scientificName (20 unique shown, of 1478 total).
valuecountshare
Alphaproteobacteria176408.8%
Bacteria127606.4%
Nitrosopumilaceae108405.4%
Syndiniales72803.6%
Amperima rosea45202.3%
Porifera24001.2%
Thermoplasmata23601.2%
Keratoisididae23201.2%
Xiphias gladius20001.0%
Pseudomonadales19201.0%
Gammaproteobacteria19201.0%
Monothalamea16400.8%
Rhodospirillales16400.8%
Scomber scombrus15200.8%
Retaria15200.8%
Dinophyceae13200.7%
Rickettsiales12000.6%
Chrysogorgia11600.6%
Hexactinellida10800.5%
Prionace glauca10800.5%

species categorical label

Categorical species identifier with 678 distinct binomial names (e.g., Amperima rosea, Xiphias gladius, Prionace glauca) covering marine taxa. The dominant value is an empty string at 73.2% of 200,000 rows, meaning species is unrecorded for nearly three quarters of observations despite a reported null_rate of 0.0. Among labelled rows, distribution is long-tailed with no single species exceeding 4,520 occurrences, and overall entropy_ratio is 0.327.

Treatment: Treat empty strings as missing, then group rare categories or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["species"].stats

statvalue
n200,000
nulls0 (0.0%)
unique678
top_value
top_rate 0.732
cardinality 678
entropy 3.077
entropy_ratio 0.3272
Fig 9.
Top values for species.
Show data table
Top values for species (20 unique shown, of 678 total).
valuecountshare
14640073.2%
Amperima rosea45202.3%
Xiphias gladius20001.0%
Scomber scombrus15200.8%
Prionace glauca10800.5%
Oneirophanta mutabilis8400.4%
Thunnus albacares7600.4%
Farrea occa6800.3%
Trissopathes pseudotristicha6400.3%
Hoplostethus atlanticus5200.3%
Trachurus trachurus4800.2%
Florometra serratissima4400.2%
Heteropolypus ritteri4000.2%
Desmophyllum dianthus4000.2%
Psychropotes longicauda3600.2%
Thunnus obesus3200.2%
Solenosmilia variabilis3200.2%
Etmopterus granulosus3200.2%
Molpadiodemas villosus2800.1%
Paragorgia arborea2800.1%

genus categorical label

Taxonomic genus label for what appears to be a marine biology dataset (Amperima, Xiphias, Scomber, Thunnus, Prionace). The dominant signal is missingness encoded as empty string: 109,800 of 200,000 rows (54.9%) have no genus assigned, despite a stated null_rate of 0.0. Across the remaining records, 840 distinct genera spread fairly thin, with Amperima the largest non-empty bucket at 4,520.

Treatment: Recode empty strings as nulls, then group rare genera or roll up to a higher taxonomic rank before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["genus"].stats

statvalue
n200,000
nulls0 (0.0%)
unique841
top_value
top_rate 0.549
cardinality 841
entropy 4.896
entropy_ratio 0.5039
Fig 10.
Top values for genus.
Show data table
Top values for genus (20 unique shown, of 841 total).
valuecountshare
10980054.9%
Amperima45202.3%
Xiphias20001.0%
Scomber15200.8%
Retaria15200.8%
Farrea14000.7%
Thunnus13600.7%
Chrysogorgia13600.7%
Coryphaenoides12400.6%
Prionace10800.5%
Hemicorallium10000.5%
Alteromonas9600.5%
Paragorgia8800.4%
Oneirophanta8400.4%
Lepidisis8000.4%
Trissopathes8000.4%
Alepisaurus7600.4%
Keratoisis7200.4%
Pennatula6000.3%
Hoplostethus6000.3%

family categorical feature

Taxonomic family classification, with 606 distinct families across 200000 records and zero nulls. The dominant value is an empty string at 40.18% (80360 rows), effectively a hidden missing-data category, while the next-largest real family Nitrosopumilaceae covers only 5.42% (10840). The remaining tail spans marine taxa (Elpidiidae, Keratoisididae, Coralliidae, Macrouridae, Scombridae), consistent with a deep-sea or oceanographic biodiversity dataset.

Treatment: Recode empty strings as missing, then group rare families into an 'other' bucket before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["family"].stats

statvalue
n200,000
nulls0 (0.0%)
unique606
top_value
top_rate 0.4018
cardinality 606
entropy 5.5
entropy_ratio 0.595
Fig 11.
Top values for family.
Show data table
Top values for family (20 unique shown, of 606 total).
valuecountshare
8036040.2%
Nitrosopumilaceae108405.4%
Elpidiidae49202.5%
Keratoisididae43602.2%
Coralliidae40402.0%
Macrouridae31201.6%
Scombridae30401.5%
Primnoidae24401.2%
Chrysogorgiidae23201.2%
Xiphiidae20001.0%
Retariidae15200.8%
Farreidae15200.8%
Alteromonadaceae14400.7%
Euplectellidae13600.7%
Flavobacteriaceae13200.7%
Schizopathidae12800.6%
Caryophylliidae10800.5%
Carcharhinidae10800.5%
Acanthogorgiidae10000.5%
Nitrospinaceae10000.5%

order categorical feature

Taxonomic order assignments for biological records, with 310 distinct orders spanning corals (Scleralcyonacea), archaea (Nitrosopumilales), dinoflagellates (Syndiniales), and various fish and crustacean groups. The dominant value is an empty string at 28.16% (56,320 rows), indicating a large block of unassigned/missing orders rather than true nulls. Entropy ratio of 0.66 shows moderate concentration, with the top 10 orders accounting for the bulk of labeled records.

Treatment: Recode empty strings as missing, then group rare orders before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["order"].stats

statvalue
n200,000
nulls0 (0.0%)
unique310
top_value
top_rate 0.2816
cardinality 310
entropy 5.45
entropy_ratio 0.6585
Fig 12.
Top values for order.
Show data table
Top values for order (20 unique shown, of 310 total).
valuecountshare
5632028.2%
Scleralcyonacea157207.9%
Nitrosopumilales108405.4%
Syndiniales79604.0%
Elasipodida56802.8%
Gadiformes40402.0%
Scombriformes33601.7%
Carangiformes32401.6%
Calanoida28401.4%
Alteromonadales27201.4%
Decapoda27201.4%
Antipatharia25601.3%
Sceptrulophora24801.2%
Pseudomonadales24001.2%
Flavobacteriales23201.2%
Scleractinia22401.1%
Malacalcyonacea20001.0%
Lyssacinosida20001.0%
Rotaliida20001.0%
Amphipoda18800.9%

class categorical label

Taxonomic class assignments across 200,000 records, spanning 138 distinct values with moderate concentration (entropy ratio 0.69, top class 'Alphaproteobacteria' at 11.4%). The mix is biologically heterogeneous, blending bacteria, archaea, fish, corals, and sponges, suggesting a marine biodiversity catalogue rather than a single-domain dataset. Notably, the second most common value is an empty string at 21,920 rows (~11%), which null_rate=0.0 misses because blanks are encoded as strings rather than nulls.

Treatment: Recode empty strings to nulls, then group rare classes before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["class"].stats

statvalue
n200,000
nulls0 (0.0%)
unique138
top_value Alphaproteobacteria
top_rate 0.1142
cardinality 138
entropy 4.892
entropy_ratio 0.6881
Fig 13.
Top values for class.
Show data table
Top values for class (20 unique shown, of 138 total).
valuecountshare
Alphaproteobacteria2284011.4%
2192011.0%
Teleostei191209.6%
Octocorallia178808.9%
Thaumarchaeota incertae sedis108405.4%
Dinophyceae106005.3%
Gammaproteobacteria94404.7%
Holothuroidea77603.9%
Malacostraca74403.7%
Hexactinellida63203.2%
Hexacorallia56402.8%
Copepoda40802.0%
Ophiuroidea33201.7%
Polychaeta33201.7%
Polycystina28001.4%
Elasmobranchii26401.3%
Globothalamea25201.3%
Deltaproteobacteria24401.2%
Thermoplasmata23601.2%
Flavobacteria23201.2%

phylum categorical feature

Taxonomic phylum labels across 200,000 records spanning 65 distinct values, led by Proteobacteria at 17.74% and followed by Cnidaria, Chordata, and Echinodermata. The mix of bacterial, archaeal, and animal phyla (e.g., Thaumarchaeota alongside Chordata) suggests a broad biodiversity or environmental-sequencing dataset rather than a single kingdom. Notably, 13,200 rows carry an empty-string phylum — a hidden null channel despite the reported null_rate of 0.0.

Treatment: Recode empty strings to null and group-encode (target or frequency encoding) given 65 categories.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["phylum"].stats

statvalue
n200,000
nulls0 (0.0%)
unique65
top_value Proteobacteria
top_rate 0.1774
cardinality 65
entropy 4.095
entropy_ratio 0.6799
Fig 14.
Top values for phylum.
Show data table
Top values for phylum (20 unique shown, of 65 total).
valuecountshare
Proteobacteria3548017.7%
Cnidaria2552012.8%
Chordata2392012.0%
Echinodermata140007.0%
132006.6%
Arthropoda132006.6%
Myzozoa112805.6%
Thaumarchaeota109205.5%
Porifera103605.2%
Foraminifera47202.4%
Annelida42402.1%
Radiozoa40002.0%
Mollusca38401.9%
Bacteroidetes34401.7%
Euryarchaeota24401.2%
Planctomycetes23201.2%
Heterokontophyta16800.8%
Verrucomicrobia15200.8%
Brachiopoda15200.8%
Nematoda11600.6%

latitude numeric feature

Geographic latitude in degrees, ranging from -75.0 to 89.06 across 200000 rows with no nulls and only 2617 unique values, suggesting coordinates snapped to a coarse grid. The distribution is nearly symmetric (skew 0.12) but platykurtic (kurtosis -1.22) with a wide IQR of 71.97, indicating fairly uniform global coverage rather than a concentration near populated bands. Mean (-1.58) and median (-4.998) sit just south of the equator.

Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than using raw degrees in linear models.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["latitude"].stats

statvalue
n200,000
nulls0 (0.0%)
unique2,617
min -75
max 89.06
mean -1.581
median -4.998
std 39.48
q1 -36.25
q3 35.72
iqr 71.97
skew 0.1182
kurtosis -1.223
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 15.
Distribution of latitude. Vertical dash marks the median.
Show data table
Histogram bins for latitude (median: -4.99815).
bincount
-75 – -70.9640
-70.9 – -66.8440
-66.8 – -62.76000
-62.7 – -58.595160
-58.59 – -54.494960
-54.49 – -50.394360
-50.39 – -46.294360
-46.29 – -42.198560
-42.19 – -38.095560
-38.09 – -33.9916440
-33.99 – -29.8813320
-29.88 – -25.7810400
-25.78 – -21.683480
-21.68 – -17.582520
-17.58 – -13.484280
-13.48 – -9.3764280
-9.376 – -5.2754440
-5.275 – -1.1736080
-1.173 – 2.9282160
2.928 – 7.031080
7.03 – 11.135280
11.13 – 15.235680
15.23 – 19.331960
19.33 – 23.445960
23.44 – 27.549160
27.54 – 31.647680
31.64 – 35.747480
35.74 – 39.8412840
39.84 – 43.947000
43.94 – 48.045880
48.04 – 52.1511360
52.15 – 56.252480
56.25 – 60.35920
60.35 – 64.451160
64.45 – 68.551200
68.55 – 72.65920
72.65 – 76.76920
76.76 – 80.861600
80.86 – 84.961200
84.96 – 89.06800

longitude numeric feature

This is a geographic longitude feature spanning the full valid range from -179.9872 to 179.9985 across 200000 rows with no nulls. The distribution is wide (std 127.09) and platykurtic (kurtosis -1.12) with a median of -94.29 and Q1 at -169.995, suggesting a heavy concentration in the western hemisphere and Pacific quadrant rather than a uniform global spread. Only 2654 unique values indicate the coordinates are quantised rather than raw floats.

Treatment: Pair with latitude for geospatial features; avoid treating as a plain scalar in models.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["longitude"].stats

statvalue
n200,000
nulls0 (0.0%)
unique2,654
min -180
max 180
mean -51.58
median -94.29
std 127.1
q1 -170
q3 63.76
iqr 233.8
skew 0.6619
kurtosis -1.123
n_outliers 0
outlier_rate 0
zero_rate 0.0004
Fig 16.
Distribution of longitude. Vertical dash marks the median.
Show data table
Histogram bins for longitude (median: -94.29).
bincount
-180 – -17117200
-171 – -16253840
-162 – -1532320
-153 – -144680
-144 – -135520
-135 – -1263240
-126 – -11717080
-117 – -1084360
-108 – -98.99320
-98.99 – -89.993760
-89.99 – -80.994040
-80.99 – -71.994120
-71.99 – -62.995680
-62.99 – -53.991840
-53.99 – -44.993280
-44.99 – -35.992240
-35.99 – -26.991280
-26.99 – -17.992360
-17.99 – -8.99412680
-8.994 – 0.005652440
0.00565 – 9.0052760
9.005 – 18640
18 – 27560
27 – 36680
36 – 45560
45 – 54520
54 – 63880
63 – 72920
72 – 81480
81 – 903320
90 – 99680
99 – 108160
108 – 1171160
117 – 1261200
126 – 1351080
135 – 1441000
144 – 15314840
153 – 16219120
162 – 1711520
171 – 1804640

depth numeric feature

Numeric measurement of depth ranging from 1000.0 to 11000.0 with mean 2405.74 and median 1961.70, suggesting a physical quantity like well or borehole depth in meters/feet. The distribution is right-skewed (skew 1.09) with IQR 2174.25 and a tight 0.28% outlier rate, and only 1938 unique values across 200000 rows points to discretized or rounded measurements. No nulls or zeros are present.

Treatment: Consider a log or sqrt transform before regression to tame the right skew.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["depth"].stats

statvalue
n200,000
nulls0 (0.0%)
unique1,938
min 1,000
max 11,000
mean 2406
median 1962
std 1477
q1 1,149
q3 3323
iqr 2174
skew 1.091
kurtosis 0.5022
n_outliers 560
outlier_rate 0.0028
zero_rate 0
Fig 17.
Distribution of depth. Vertical dash marks the median.
Show data table
Histogram bins for depth (median: 1961.6950000000002).
bincount
1000 – 125058400
1250 – 150015280
1500 – 175010920
1750 – 200020200
2000 – 225023560
2250 – 25005720
2500 – 27505320
2750 – 30004360
3000 – 32505080
3250 – 35003520
3500 – 37503520
3750 – 40003360
4000 – 42508000
4250 – 45004680
4500 – 47501880
4750 – 500011400
5000 – 52503240
5250 – 55003360
5500 – 57505520
5750 – 60001760
6000 – 6250280
6250 – 650040
6500 – 675040
6750 – 70000
7000 – 72500
7250 – 750080
7500 – 775080
7750 – 80000
8000 – 82500
8250 – 85000
8500 – 875080
8750 – 900080
9000 – 925040
9250 – 95000
9500 – 975040
9750 – 1e+040
1e+04 – 1.025e+04120
1.025e+04 – 1.05e+040
1.05e+04 – 1.075e+040
1.075e+04 – 1.1e+0440

year numeric feature

This is a year column spanning 1875 to 2024 with 98 distinct values, almost certainly the release or production year of each record. The distribution is heavily left-skewed (skew -3.57, kurtosis 19.6) with a median of 2016 but mean pulled down to 2008.9, and 6.27% of values flagged as outliers — i.e. a long tail of older entries against a modern-heavy mass. About 3.6% of rows are null.

Treatment: Impute the 3.6% nulls and consider bucketing into decades or capping the pre-1980 tail before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["year"].stats

statvalue
n200,000
nulls7,240 (3.6%)
unique98
min 1,875
max 2,024
mean 2009
median 2,016
std 15.43
q1 2,004
q3 2,016
iqr 12
skew -3.574
kurtosis 19.65
n_outliers 12,080
outlier_rate 0.06267
zero_rate 0
alert: high_skewskew=-3.57
alert: outliers6.3% rows beyond 1.5 IQR
Fig 18.
Distribution of year. Vertical dash marks the median.
Show data table
Histogram bins for year (median: 2016.0).
bincount
1875 – 187940
1879 – 1882120
1882 – 1886320
1886 – 189080
1890 – 1894120
1894 – 18970
1897 – 19010
1901 – 190580
1905 – 190940
1909 – 1912120
1912 – 19160
1916 – 19200
1920 – 19230
1923 – 1927160
1927 – 1931400
1931 – 1935240
1935 – 193880
1938 – 19420
1942 – 19460
1946 – 195040
1950 – 195380
1953 – 195740
1957 – 1961120
1961 – 19641680
1964 – 1968840
1968 – 19721520
1972 – 19761840
1976 – 19791360
1979 – 1983960
1983 – 19872160
1987 – 19902880
1990 – 19945200
1994 – 19988520
1998 – 200213760
2002 – 200510040
2005 – 20097360
2009 – 20137280
2013 – 201798920
2017 – 202014120
2020 – 202412240

country categorical feature

Country of origin as a free-form categorical with 57 distinct values, dominated by an empty string at 51.92% (103,840 rows) which functions as an undeclared missing token rather than a true null (null_rate is 0.0). Australia accounts for the bulk of populated values (79,320), making the dataset overwhelmingly Australia-centric, and the entropy_ratio of 0.277 confirms heavy concentration. Note also the inconsistent encoding of the United States as both 'United States' (8,160) and 'USA' (680).

Treatment: Recode empty strings as missing, canonicalise duplicates like 'USA'/'United States', then group rare countries before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["country"].stats

statvalue
n200,000
nulls0 (0.0%)
unique57
top_value
top_rate 0.5192
cardinality 57
entropy 1.616
entropy_ratio 0.277
Fig 19.
Top values for country.
Show data table
Top values for country (20 unique shown, of 57 total).
valuecountshare
10384051.9%
Australia7932039.7%
United States81604.1%
New Zealand13200.7%
USA6800.3%
Antarctica6800.3%
Colombia6400.3%
Chile5200.3%
Bermuda4000.2%
Portugal3200.2%
UNITED STATES3200.2%
Ross Dependency2400.1%
Russia2400.1%
United States of America2400.1%
GREAT BRITAIN2000.1%
Ecuador1600.1%
Bahamas1600.1%
Italy1600.1%
CO1600.1%
Discovery Deep, Red Sea1600.1%

How to cite

click to copy

BibTeX
@misc{saturn-quirky-deep-sea-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky deep sea},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-deep_sea}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky deep sea. Source: /home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-deep_sea