quirky deep sea

source /home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json 200,000 rows 12 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a 200,000-row deep-sea biodiversity dataset with 12 columns covering taxonomy (phylum, class, order, family, genus, species, scientificName), geography (country, latitude, longitude), depth, and observation year. Two things stand out: the taxonomic hierarchy is heavily incomplete at lower ranks — species is blank in 73.2% of rows and genus in 54.9% — so most records can only be analyzed at higher ranks like phylum (top: Proteobacteria at 17.7%) or class (top: Alphaproteobacteria at 11.4%). Country is also mostly missing (51.9% blank) with Australia dominating the populated entries at 79,320 records, suggesting a strong sampling bias. Year is left-skewed (skew -3.57) toward recent records with a long tail back to 1875, while depth ranges from 1,000 to 11,000 m with a median near 1,962 m. Start by checking the missingness in species/country and the geographic concentration before any biodiversity analysis.

citing: row_count · column_count · species.top_rate · genus.top_rate · country.top_rate · country.top_values · phylum.top_value · phylum.top_rate · class.top_value · class.top_rate · year.skew · year.min · year.max · depth.min · depth.max · depth.median

Charts the summary said to look at first

phylum · Shows the dominant taxonomic groups, led by Proteobacteria, Cnidaria, and Chordata.

Show data table

Top values for phylum (20 unique shown, of 65 total).
value	count	share
Proteobacteria	35480	17.7%
Cnidaria	25520	12.8%
Chordata	23920	12.0%
Echinodermata	14000	7.0%
	13200	6.6%
Arthropoda	13200	6.6%
Myzozoa	11280	5.6%
Thaumarchaeota	10920	5.5%
Porifera	10360	5.2%
Foraminifera	4720	2.4%
Annelida	4240	2.1%
Radiozoa	4000	2.0%
Mollusca	3840	1.9%
Bacteroidetes	3440	1.7%
Euryarchaeota	2440	1.2%
Planctomycetes	2320	1.2%
Heterokontophyta	1680	0.8%
Verrucomicrobia	1520	0.8%
Brachiopoda	1520	0.8%
Nematoda	1160	0.6%

depth · Reveals the distribution of sampling depths from 1,000 m down to the 11,000 m hadal zone.

Show data table

Histogram bins for depth (median: 1961.6950000000002).
bin	count
1000 – 1250	58400
1250 – 1500	15280
1500 – 1750	10920
1750 – 2000	20200
2000 – 2250	23560
2250 – 2500	5720
2500 – 2750	5320
2750 – 3000	4360
3000 – 3250	5080
3250 – 3500	3520
3500 – 3750	3520
3750 – 4000	3360
4000 – 4250	8000
4250 – 4500	4680
4500 – 4750	1880
4750 – 5000	11400
5000 – 5250	3240
5250 – 5500	3360
5500 – 5750	5520
5750 – 6000	1760
6000 – 6250	280
6250 – 6500	40
6500 – 6750	40
6750 – 7000	0
7000 – 7250	0
7250 – 7500	80
7500 – 7750	80
7750 – 8000	0
8000 – 8250	0
8250 – 8500	0
8500 – 8750	80
8750 – 9000	80
9000 – 9250	40
9250 – 9500	0
9500 – 9750	40
9750 – 1e+04	0
1e+04 – 1.025e+04	120
1.025e+04 – 1.05e+04	0
1.05e+04 – 1.075e+04	0
1.075e+04 – 1.1e+04	40

year · Highlights the strong left skew toward recent years with a long historical tail back to 1875.

Show data table

Histogram bins for year (median: 2016.0).
bin	count
1875 – 1879	40
1879 – 1882	120
1882 – 1886	320
1886 – 1890	80
1890 – 1894	120
1894 – 1897	0
1897 – 1901	0
1901 – 1905	80
1905 – 1909	40
1909 – 1912	120
1912 – 1916	0
1916 – 1920	0
1920 – 1923	0
1923 – 1927	160
1927 – 1931	400
1931 – 1935	240
1935 – 1938	80
1938 – 1942	0
1942 – 1946	0
1946 – 1950	40
1950 – 1953	80
1953 – 1957	40
1957 – 1961	120
1961 – 1964	1680
1964 – 1968	840
1968 – 1972	1520
1972 – 1976	1840
1976 – 1979	1360
1979 – 1983	960
1983 – 1987	2160
1987 – 1990	2880
1990 – 1994	5200
1994 – 1998	8520
1998 – 2002	13760
2002 – 2005	10040
2005 – 2009	7360
2009 – 2013	7280
2013 – 2017	98920
2017 – 2020	14120
2020 – 2024	12240

country · Exposes the heavy Australia bias and the large share of records with no country recorded.

Show data table

Top values for country (20 unique shown, of 57 total).
value	count	share
	103840	51.9%
Australia	79320	39.7%
United States	8160	4.1%
New Zealand	1320	0.7%
USA	680	0.3%
Antarctica	680	0.3%
Colombia	640	0.3%
Chile	520	0.3%
Bermuda	400	0.2%
Portugal	320	0.2%
UNITED STATES	320	0.2%
Ross Dependency	240	0.1%
Russia	240	0.1%
United States of America	240	0.1%
GREAT BRITAIN	200	0.1%
Ecuador	160	0.1%
Bahamas	160	0.1%
Italy	160	0.1%
CO	160	0.1%
Discovery Deep, Red Sea	160	0.1%

species · Illustrates how 73% of rows have no species assignment, with only a handful of named species dominating the rest.

Show data table

Top values for species (20 unique shown, of 678 total).
value	count	share
	146400	73.2%
Amperima rosea	4520	2.3%
Xiphias gladius	2000	1.0%
Scomber scombrus	1520	0.8%
Prionace glauca	1080	0.5%
Oneirophanta mutabilis	840	0.4%
Thunnus albacares	760	0.4%
Farrea occa	680	0.3%
Trissopathes pseudotristicha	640	0.3%
Hoplostethus atlanticus	520	0.3%
Trachurus trachurus	480	0.2%
Florometra serratissima	440	0.2%
Heteropolypus ritteri	400	0.2%
Desmophyllum dianthus	400	0.2%
Psychropotes longicauda	360	0.2%
Thunnus obesus	320	0.2%
Solenosmilia variabilis	320	0.2%
Etmopterus granulosus	320	0.2%
Molpadiodemas villosus	280	0.1%
Paragorgia arborea	280	0.1%

Schema

12 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
scientificName	categorical	0.0%	1,478
species	categorical	0.0%	678
genus	categorical	0.0%	841
family	categorical	0.0%	606
order	categorical	0.0%	310
class	categorical	0.0%	138
phylum	categorical	0.0%	65
latitude	numeric	0.0%	2,617
longitude	numeric	0.0%	2,654
depth	numeric	0.0%	1,938
year	numeric	3.6%	98	high_skew outliers
country	categorical	0.0%	57

scientificName

categorical label

Taxonomic name field mixing ranks from class (Alphaproteobacteria, 8.82% of rows) and domain (Bacteria) down to species (Xiphias gladius, Amperima rosea), across 1,478 distinct values with no nulls. The rank inconsistency is the headline issue: aggregating or joining on this column will conflate broad clades with individual species. Entropy ratio of 0.80 shows the distribution is fairly diffuse despite the dominant top value. Treatment: Normalise to a consistent taxonomic rank (or join to a taxonomy table) before grouping or modelling. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 1,478
top_value: Alphaproteobacteria
top_rate: 0.0882
cardinality: 1,478
entropy: 8.378
entropy_ratio: 0.7956

species

categorical label

Categorical species identifier with 678 distinct binomial names (e.g., Amperima rosea, Xiphias gladius, Prionace glauca) covering marine taxa. The dominant value is an empty string at 73.2% of 200,000 rows, meaning species is unrecorded for nearly three quarters of observations despite a reported null_rate of 0.0. Among labelled rows, distribution is long-tailed with no single species exceeding 4,520 occurrences, and overall entropy_ratio is 0.327. Treatment: Treat empty strings as missing, then group rare categories or target-encode before modelling. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 678
top_value
top_rate: 0.732
cardinality: 678
entropy: 3.077
entropy_ratio: 0.3272

genus

categorical label

Taxonomic genus label for what appears to be a marine biology dataset (Amperima, Xiphias, Scomber, Thunnus, Prionace). The dominant signal is missingness encoded as empty string: 109,800 of 200,000 rows (54.9%) have no genus assigned, despite a stated null_rate of 0.0. Across the remaining records, 840 distinct genera spread fairly thin, with Amperima the largest non-empty bucket at 4,520. Treatment: Recode empty strings as nulls, then group rare genera or roll up to a higher taxonomic rank before encoding. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 841
top_value
top_rate: 0.549
cardinality: 841
entropy: 4.896
entropy_ratio: 0.5039

family

categorical feature

Taxonomic family classification, with 606 distinct families across 200000 records and zero nulls. The dominant value is an empty string at 40.18% (80360 rows), effectively a hidden missing-data category, while the next-largest real family Nitrosopumilaceae covers only 5.42% (10840). The remaining tail spans marine taxa (Elpidiidae, Keratoisididae, Coralliidae, Macrouridae, Scombridae), consistent with a deep-sea or oceanographic biodiversity dataset. Treatment: Recode empty strings as missing, then group rare families into an 'other' bucket before encoding. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 606
top_value
top_rate: 0.4018
cardinality: 606
entropy: 5.5
entropy_ratio: 0.595

order

categorical feature

Taxonomic order assignments for biological records, with 310 distinct orders spanning corals (Scleralcyonacea), archaea (Nitrosopumilales), dinoflagellates (Syndiniales), and various fish and crustacean groups. The dominant value is an empty string at 28.16% (56,320 rows), indicating a large block of unassigned/missing orders rather than true nulls. Entropy ratio of 0.66 shows moderate concentration, with the top 10 orders accounting for the bulk of labeled records. Treatment: Recode empty strings as missing, then group rare orders before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 310
top_value
top_rate: 0.2816
cardinality: 310
entropy: 5.45
entropy_ratio: 0.6585

class

categorical label

Taxonomic class assignments across 200,000 records, spanning 138 distinct values with moderate concentration (entropy ratio 0.69, top class 'Alphaproteobacteria' at 11.4%). The mix is biologically heterogeneous, blending bacteria, archaea, fish, corals, and sponges, suggesting a marine biodiversity catalogue rather than a single-domain dataset. Notably, the second most common value is an empty string at 21,920 rows (~11%), which null_rate=0.0 misses because blanks are encoded as strings rather than nulls. Treatment: Recode empty strings to nulls, then group rare classes before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 138
top_value: Alphaproteobacteria
top_rate: 0.1142
cardinality: 138
entropy: 4.892
entropy_ratio: 0.6881

phylum

categorical feature

Taxonomic phylum labels across 200,000 records spanning 65 distinct values, led by Proteobacteria at 17.74% and followed by Cnidaria, Chordata, and Echinodermata. The mix of bacterial, archaeal, and animal phyla (e.g., Thaumarchaeota alongside Chordata) suggests a broad biodiversity or environmental-sequencing dataset rather than a single kingdom. Notably, 13,200 rows carry an empty-string phylum — a hidden null channel despite the reported null_rate of 0.0. Treatment: Recode empty strings to null and group-encode (target or frequency encoding) given 65 categories. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 65
top_value: Proteobacteria
top_rate: 0.1774
cardinality: 65
entropy: 4.095
entropy_ratio: 0.6799

latitude

numeric feature

Geographic latitude in degrees, ranging from -75.0 to 89.06 across 200000 rows with no nulls and only 2617 unique values, suggesting coordinates snapped to a coarse grid. The distribution is nearly symmetric (skew 0.12) but platykurtic (kurtosis -1.22) with a wide IQR of 71.97, indicating fairly uniform global coverage rather than a concentration near populated bands. Mean (-1.58) and median (-4.998) sit just south of the equator. Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than using raw degrees in linear models. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 2,617
min: -75
max: 89.06
mean: -1.581
median: -4.998
std: 39.48
q1: -36.25
q3: 35.72
iqr: 71.97
skew: 0.1182
kurtosis: -1.223
n_outliers: 0
outlier_rate: 0
zero_rate: 0

longitude

numeric feature

This is a geographic longitude feature spanning the full valid range from -179.9872 to 179.9985 across 200000 rows with no nulls. The distribution is wide (std 127.09) and platykurtic (kurtosis -1.12) with a median of -94.29 and Q1 at -169.995, suggesting a heavy concentration in the western hemisphere and Pacific quadrant rather than a uniform global spread. Only 2654 unique values indicate the coordinates are quantised rather than raw floats. Treatment: Pair with latitude for geospatial features; avoid treating as a plain scalar in models. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 2,654
min: -180
max: 180
mean: -51.58
median: -94.29
std: 127.1
q1: -170
q3: 63.76
iqr: 233.8
skew: 0.6619
kurtosis: -1.123
n_outliers: 0
outlier_rate: 0
zero_rate: 0.0004

depth

numeric feature

Numeric measurement of depth ranging from 1000.0 to 11000.0 with mean 2405.74 and median 1961.70, suggesting a physical quantity like well or borehole depth in meters/feet. The distribution is right-skewed (skew 1.09) with IQR 2174.25 and a tight 0.28% outlier rate, and only 1938 unique values across 200000 rows points to discretized or rounded measurements. No nulls or zeros are present. Treatment: Consider a log or sqrt transform before regression to tame the right skew. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 1,938
min: 1,000
max: 11,000
mean: 2406
median: 1962
std: 1477
q1: 1,149
q3: 3323
iqr: 2174
skew: 1.091
kurtosis: 0.5022
n_outliers: 560
outlier_rate: 0.0028
zero_rate: 0

year

numeric feature high_skew outliers

This is a year column spanning 1875 to 2024 with 98 distinct values, almost certainly the release or production year of each record. The distribution is heavily left-skewed (skew -3.57, kurtosis 19.6) with a median of 2016 but mean pulled down to 2008.9, and 6.27% of values flagged as outliers — i.e. a long tail of older entries against a modern-heavy mass. About 3.6% of rows are null. Treatment: Impute the 3.6% nulls and consider bucketing into decades or capping the pre-1980 tail before modelling. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 7,240 (3.6%)
unique: 98
min: 1,875
max: 2,024
mean: 2009
median: 2,016
std: 15.43
q1: 2,004
q3: 2,016
iqr: 12
skew: -3.574
kurtosis: 19.65
n_outliers: 12,080
outlier_rate: 0.06267
zero_rate: 0

country

categorical feature

Country of origin as a free-form categorical with 57 distinct values, dominated by an empty string at 51.92% (103,840 rows) which functions as an undeclared missing token rather than a true null (null_rate is 0.0). Australia accounts for the bulk of populated values (79,320), making the dataset overwhelmingly Australia-centric, and the entropy_ratio of 0.277 confirms heavy concentration. Note also the inconsistent encoding of the United States as both 'United States' (8,160) and 'USA' (680). Treatment: Recode empty strings as missing, canonicalise duplicates like 'USA'/'United States', then group rare countries before one-hot encoding. high · anthropic:claude-opus-4-7

n: 200,000
nulls: 0 (0.0%)
unique: 57
top_value
top_rate: 0.5192
cardinality: 57
entropy: 1.616
entropy_ratio: 0.277