This is a 200,000-row deep-sea biodiversity dataset with 12 columns covering taxonomy (phylum, class, order, family, genus, species, scientificName), geography (country, latitude, longitude), depth, and observation year. Two things stand out: the taxonomic hierarchy is heavily incomplete at lower ranks — species is blank in 73.2% of rows and genus in 54.9% — so most records can only be analyzed at higher ranks like phylum (top: Proteobacteria at 17.7%) or class (top: Alphaproteobacteria at 11.4%). Country is also mostly missing (51.9% blank) with Australia dominating the populated entries at 79,320 records, suggesting a strong sampling bias. Year is left-skewed (skew -3.57) toward recent records with a long tail back to 1875, while depth ranges from 1,000 to 11,000 m with a median near 1,962 m. Start by checking the missingness in species/country and the geographic concentration before any biodiversity analysis.
saturn
/home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json 200,000 rows sample n=200,000 seed 42 2026-05-01T23:35:12+00:00
Overview
| Source | /home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json |
| Total rows | 200,000 |
| Profiled sample | 200,000 |
| Columns | 12 |
| Generated | 2026-05-01T23:35:12+00:00 |
Insights opt-in
Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.
Taxonomic name field mixing ranks from class (Alphaproteobacteria, 8.82% of rows) and domain (Bacteria) down to species (Xiphias gladius, Amperima rosea), across 1,478 distinct values with no nulls. The rank inconsistency is the headline issue: aggregating or joining on this column will conflate broad clades with individual species. Entropy ratio of 0.80 shows the distribution is fairly diffuse despite the dominant top value.
Categorical species identifier with 678 distinct binomial names (e.g., Amperima rosea, Xiphias gladius, Prionace glauca) covering marine taxa. The dominant value is an empty string at 73.2% of 200,000 rows, meaning species is unrecorded for nearly three quarters of observations despite a reported null_rate of 0.0. Among labelled rows, distribution is long-tailed with no single species exceeding 4,520 occurrences, and overall entropy_ratio is 0.327.
Taxonomic genus label for what appears to be a marine biology dataset (Amperima, Xiphias, Scomber, Thunnus, Prionace). The dominant signal is missingness encoded as empty string: 109,800 of 200,000 rows (54.9%) have no genus assigned, despite a stated null_rate of 0.0. Across the remaining records, 840 distinct genera spread fairly thin, with Amperima the largest non-empty bucket at 4,520.
Taxonomic family classification, with 606 distinct families across 200000 records and zero nulls. The dominant value is an empty string at 40.18% (80360 rows), effectively a hidden missing-data category, while the next-largest real family Nitrosopumilaceae covers only 5.42% (10840). The remaining tail spans marine taxa (Elpidiidae, Keratoisididae, Coralliidae, Macrouridae, Scombridae), consistent with a deep-sea or oceanographic biodiversity dataset.
Taxonomic order assignments for biological records, with 310 distinct orders spanning corals (Scleralcyonacea), archaea (Nitrosopumilales), dinoflagellates (Syndiniales), and various fish and crustacean groups. The dominant value is an empty string at 28.16% (56,320 rows), indicating a large block of unassigned/missing orders rather than true nulls. Entropy ratio of 0.66 shows moderate concentration, with the top 10 orders accounting for the bulk of labeled records.
Taxonomic class assignments across 200,000 records, spanning 138 distinct values with moderate concentration (entropy ratio 0.69, top class 'Alphaproteobacteria' at 11.4%). The mix is biologically heterogeneous, blending bacteria, archaea, fish, corals, and sponges, suggesting a marine biodiversity catalogue rather than a single-domain dataset. Notably, the second most common value is an empty string at 21,920 rows (~11%), which null_rate=0.0 misses because blanks are encoded as strings rather than nulls.
Taxonomic phylum labels across 200,000 records spanning 65 distinct values, led by Proteobacteria at 17.74% and followed by Cnidaria, Chordata, and Echinodermata. The mix of bacterial, archaeal, and animal phyla (e.g., Thaumarchaeota alongside Chordata) suggests a broad biodiversity or environmental-sequencing dataset rather than a single kingdom. Notably, 13,200 rows carry an empty-string phylum — a hidden null channel despite the reported null_rate of 0.0.
Geographic latitude in degrees, ranging from -75.0 to 89.06 across 200000 rows with no nulls and only 2617 unique values, suggesting coordinates snapped to a coarse grid. The distribution is nearly symmetric (skew 0.12) but platykurtic (kurtosis -1.22) with a wide IQR of 71.97, indicating fairly uniform global coverage rather than a concentration near populated bands. Mean (-1.58) and median (-4.998) sit just south of the equator.
This is a geographic longitude feature spanning the full valid range from -179.9872 to 179.9985 across 200000 rows with no nulls. The distribution is wide (std 127.09) and platykurtic (kurtosis -1.12) with a median of -94.29 and Q1 at -169.995, suggesting a heavy concentration in the western hemisphere and Pacific quadrant rather than a uniform global spread. Only 2654 unique values indicate the coordinates are quantised rather than raw floats.
Numeric measurement of depth ranging from 1000.0 to 11000.0 with mean 2405.74 and median 1961.70, suggesting a physical quantity like well or borehole depth in meters/feet. The distribution is right-skewed (skew 1.09) with IQR 2174.25 and a tight 0.28% outlier rate, and only 1938 unique values across 200000 rows points to discretized or rounded measurements. No nulls or zeros are present.
This is a year column spanning 1875 to 2024 with 98 distinct values, almost certainly the release or production year of each record. The distribution is heavily left-skewed (skew -3.57, kurtosis 19.6) with a median of 2016 but mean pulled down to 2008.9, and 6.27% of values flagged as outliers — i.e. a long tail of older entries against a modern-heavy mass. About 3.6% of rows are null.
Country of origin as a free-form categorical with 57 distinct values, dominated by an empty string at 51.92% (103,840 rows) which functions as an undeclared missing token rather than a true null (null_rate is 0.0). Australia accounts for the bulk of populated values (79,320), making the dataset overwhelmingly Australia-centric, and the entropy_ratio of 0.277 confirms heavy concentration. Note also the inconsistent encoding of the United States as both 'United States' (8,160) and 'USA' (680).
Numeric correlation
scientificName categorical
Top values (rank 1–20)
- Alphaproteobacteria — 17,640
- Bacteria — 12,760
- Nitrosopumilaceae — 10,840
- Syndiniales — 7,280
- Amperima rosea — 4,520
- Porifera — 2,400
- Thermoplasmata — 2,360
- Keratoisididae — 2,320
- Xiphias gladius — 2,000
- Pseudomonadales — 1,920
- Gammaproteobacteria — 1,920
- Monothalamea — 1,640
- Rhodospirillales — 1,640
- Scomber scombrus — 1,520
- Retaria — 1,520
- Dinophyceae — 1,320
- Rickettsiales — 1,200
- Chrysogorgia — 1,160
- Hexactinellida — 1,080
- Prionace glauca — 1,080
species categorical
Top values (rank 1–20)
- — 146,400
- Amperima rosea — 4,520
- Xiphias gladius — 2,000
- Scomber scombrus — 1,520
- Prionace glauca — 1,080
- Oneirophanta mutabilis — 840
- Thunnus albacares — 760
- Farrea occa — 680
- Trissopathes pseudotristicha — 640
- Hoplostethus atlanticus — 520
- Trachurus trachurus — 480
- Florometra serratissima — 440
- Heteropolypus ritteri — 400
- Desmophyllum dianthus — 400
- Psychropotes longicauda — 360
- Thunnus obesus — 320
- Solenosmilia variabilis — 320
- Etmopterus granulosus — 320
- Molpadiodemas villosus — 280
- Paragorgia arborea — 280
genus categorical
Top values (rank 1–20)
- — 109,800
- Amperima — 4,520
- Xiphias — 2,000
- Scomber — 1,520
- Retaria — 1,520
- Farrea — 1,400
- Thunnus — 1,360
- Chrysogorgia — 1,360
- Coryphaenoides — 1,240
- Prionace — 1,080
- Hemicorallium — 1,000
- Alteromonas — 960
- Paragorgia — 880
- Oneirophanta — 840
- Lepidisis — 800
- Trissopathes — 800
- Alepisaurus — 760
- Keratoisis — 720
- Pennatula — 600
- Hoplostethus — 600
family categorical
Top values (rank 1–20)
- — 80,360
- Nitrosopumilaceae — 10,840
- Elpidiidae — 4,920
- Keratoisididae — 4,360
- Coralliidae — 4,040
- Macrouridae — 3,120
- Scombridae — 3,040
- Primnoidae — 2,440
- Chrysogorgiidae — 2,320
- Xiphiidae — 2,000
- Retariidae — 1,520
- Farreidae — 1,520
- Alteromonadaceae — 1,440
- Euplectellidae — 1,360
- Flavobacteriaceae — 1,320
- Schizopathidae — 1,280
- Caryophylliidae — 1,080
- Carcharhinidae — 1,080
- Acanthogorgiidae — 1,000
- Nitrospinaceae — 1,000
order categorical
Top values (rank 1–20)
- — 56,320
- Scleralcyonacea — 15,720
- Nitrosopumilales — 10,840
- Syndiniales — 7,960
- Elasipodida — 5,680
- Gadiformes — 4,040
- Scombriformes — 3,360
- Carangiformes — 3,240
- Calanoida — 2,840
- Alteromonadales — 2,720
- Decapoda — 2,720
- Antipatharia — 2,560
- Sceptrulophora — 2,480
- Pseudomonadales — 2,400
- Flavobacteriales — 2,320
- Scleractinia — 2,240
- Malacalcyonacea — 2,000
- Lyssacinosida — 2,000
- Rotaliida — 2,000
- Amphipoda — 1,880
class categorical
Top values (rank 1–20)
- Alphaproteobacteria — 22,840
- — 21,920
- Teleostei — 19,120
- Octocorallia — 17,880
- Thaumarchaeota incertae sedis — 10,840
- Dinophyceae — 10,600
- Gammaproteobacteria — 9,440
- Holothuroidea — 7,760
- Malacostraca — 7,440
- Hexactinellida — 6,320
- Hexacorallia — 5,640
- Copepoda — 4,080
- Ophiuroidea — 3,320
- Polychaeta — 3,320
- Polycystina — 2,800
- Elasmobranchii — 2,640
- Globothalamea — 2,520
- Deltaproteobacteria — 2,440
- Thermoplasmata — 2,360
- Flavobacteria — 2,320
phylum categorical
Top values (rank 1–20)
- Proteobacteria — 35,480
- Cnidaria — 25,520
- Chordata — 23,920
- Echinodermata — 14,000
- — 13,200
- Arthropoda — 13,200
- Myzozoa — 11,280
- Thaumarchaeota — 10,920
- Porifera — 10,360
- Foraminifera — 4,720
- Annelida — 4,240
- Radiozoa — 4,000
- Mollusca — 3,840
- Bacteroidetes — 3,440
- Euryarchaeota — 2,440
- Planctomycetes — 2,320
- Heterokontophyta — 1,680
- Verrucomicrobia — 1,520
- Brachiopoda — 1,520
- Nematoda — 1,160
latitude numeric
longitude numeric
depth numeric
year numeric
country categorical
Top values (rank 1–20)
- — 103,840
- Australia — 79,320
- United States — 8,160
- New Zealand — 1,320
- USA — 680
- Antarctica — 680
- Colombia — 640
- Chile — 520
- Bermuda — 400
- Portugal — 320
- UNITED STATES — 320
- Ross Dependency — 240
- Russia — 240
- United States of America — 240
- GREAT BRITAIN — 200
- Ecuador — 160
- Bahamas — 160
- Italy — 160
- CO — 160
- Discovery Deep, Red Sea — 160