saturn·

quirky deep sea

source /home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json 200,000 rows 12 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a 200,000-row deep-sea biodiversity dataset with 12 columns covering taxonomy (phylum, class, order, family, genus, species, scientificName), geography (country, latitude, longitude), depth, and observation year. Two things stand out: the taxonomic hierarchy is heavily incomplete at lower ranks — species is blank in 73.2% of rows and genus in 54.9% — so most records can only be analyzed at higher ranks like phylum (top: Proteobacteria at 17.7%) or class (top: Alphaproteobacteria at 11.4%). Country is also mostly missing (51.9% blank) with Australia dominating the populated entries at 79,320 records, suggesting a strong sampling bias. Year is left-skewed (skew -3.57) toward recent records with a long tail back to 1875, while depth ranges from 1,000 to 11,000 m with a median near 1,962 m. Start by checking the missingness in species/country and the geographic concentration before any biodiversity analysis.

citing: row_count · column_count · species.top_rate · genus.top_rate · country.top_rate · country.top_values · phylum.top_value · phylum.top_rate · class.top_value · class.top_rate · year.skew · year.min · year.max · depth.min · depth.max · depth.median

Schema

12 columns
Per-column summary. Click column name to jump to its detail.
Alerts
scientificName categorical 0.0% 1,478
species categorical 0.0% 678
genus categorical 0.0% 841
family categorical 0.0% 606
order categorical 0.0% 310
class categorical 0.0% 138
phylum categorical 0.0% 65
latitude numeric 0.0% 2,617
longitude numeric 0.0% 2,654
depth numeric 0.0% 1,938
year numeric 3.6% 98
high_skew outliers
country categorical 0.0% 57

scientificName

categorical label
Taxonomic name field mixing ranks from class (Alphaproteobacteria, 8.82% of rows) and domain (Bacteria) down to species (Xiphias gladius, Amperima rosea), across 1,478 distinct values with no nulls. The rank inconsistency is the headline issue: aggregating or joining on this column will conflate broad clades with individual species. Entropy ratio of 0.80 shows the distribution is fairly diffuse despite the dominant top value. Treatment: Normalise to a consistent taxonomic rank (or join to a taxonomy table) before grouping or modelling. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
1,478
top_value
Alphaproteobacteria
top_rate
0.0882
cardinality
1,478
entropy
8.378
entropy_ratio
0.7956

species

categorical label
Categorical species identifier with 678 distinct binomial names (e.g., Amperima rosea, Xiphias gladius, Prionace glauca) covering marine taxa. The dominant value is an empty string at 73.2% of 200,000 rows, meaning species is unrecorded for nearly three quarters of observations despite a reported null_rate of 0.0. Among labelled rows, distribution is long-tailed with no single species exceeding 4,520 occurrences, and overall entropy_ratio is 0.327. Treatment: Treat empty strings as missing, then group rare categories or target-encode before modelling. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
678
top_value
top_rate
0.732
cardinality
678
entropy
3.077
entropy_ratio
0.3272

genus

categorical label
Taxonomic genus label for what appears to be a marine biology dataset (Amperima, Xiphias, Scomber, Thunnus, Prionace). The dominant signal is missingness encoded as empty string: 109,800 of 200,000 rows (54.9%) have no genus assigned, despite a stated null_rate of 0.0. Across the remaining records, 840 distinct genera spread fairly thin, with Amperima the largest non-empty bucket at 4,520. Treatment: Recode empty strings as nulls, then group rare genera or roll up to a higher taxonomic rank before encoding. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
841
top_value
top_rate
0.549
cardinality
841
entropy
4.896
entropy_ratio
0.5039

family

categorical feature
Taxonomic family classification, with 606 distinct families across 200000 records and zero nulls. The dominant value is an empty string at 40.18% (80360 rows), effectively a hidden missing-data category, while the next-largest real family Nitrosopumilaceae covers only 5.42% (10840). The remaining tail spans marine taxa (Elpidiidae, Keratoisididae, Coralliidae, Macrouridae, Scombridae), consistent with a deep-sea or oceanographic biodiversity dataset. Treatment: Recode empty strings as missing, then group rare families into an 'other' bucket before encoding. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
606
top_value
top_rate
0.4018
cardinality
606
entropy
5.5
entropy_ratio
0.595

order

categorical feature
Taxonomic order assignments for biological records, with 310 distinct orders spanning corals (Scleralcyonacea), archaea (Nitrosopumilales), dinoflagellates (Syndiniales), and various fish and crustacean groups. The dominant value is an empty string at 28.16% (56,320 rows), indicating a large block of unassigned/missing orders rather than true nulls. Entropy ratio of 0.66 shows moderate concentration, with the top 10 orders accounting for the bulk of labeled records. Treatment: Recode empty strings as missing, then group rare orders before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
310
top_value
top_rate
0.2816
cardinality
310
entropy
5.45
entropy_ratio
0.6585

class

categorical label
Taxonomic class assignments across 200,000 records, spanning 138 distinct values with moderate concentration (entropy ratio 0.69, top class 'Alphaproteobacteria' at 11.4%). The mix is biologically heterogeneous, blending bacteria, archaea, fish, corals, and sponges, suggesting a marine biodiversity catalogue rather than a single-domain dataset. Notably, the second most common value is an empty string at 21,920 rows (~11%), which null_rate=0.0 misses because blanks are encoded as strings rather than nulls. Treatment: Recode empty strings to nulls, then group rare classes before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
138
top_value
Alphaproteobacteria
top_rate
0.1142
cardinality
138
entropy
4.892
entropy_ratio
0.6881

phylum

categorical feature
Taxonomic phylum labels across 200,000 records spanning 65 distinct values, led by Proteobacteria at 17.74% and followed by Cnidaria, Chordata, and Echinodermata. The mix of bacterial, archaeal, and animal phyla (e.g., Thaumarchaeota alongside Chordata) suggests a broad biodiversity or environmental-sequencing dataset rather than a single kingdom. Notably, 13,200 rows carry an empty-string phylum — a hidden null channel despite the reported null_rate of 0.0. Treatment: Recode empty strings to null and group-encode (target or frequency encoding) given 65 categories. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
65
top_value
Proteobacteria
top_rate
0.1774
cardinality
65
entropy
4.095
entropy_ratio
0.6799

latitude

numeric feature
Geographic latitude in degrees, ranging from -75.0 to 89.06 across 200000 rows with no nulls and only 2617 unique values, suggesting coordinates snapped to a coarse grid. The distribution is nearly symmetric (skew 0.12) but platykurtic (kurtosis -1.22) with a wide IQR of 71.97, indicating fairly uniform global coverage rather than a concentration near populated bands. Mean (-1.58) and median (-4.998) sit just south of the equator. Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than using raw degrees in linear models. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
2,617
min
-75
max
89.06
mean
-1.581
median
-4.998
std
39.48
q1
-36.25
q3
35.72
iqr
71.97
skew
0.1182
kurtosis
-1.223
n_outliers
0
outlier_rate
0
zero_rate
0

longitude

numeric feature
This is a geographic longitude feature spanning the full valid range from -179.9872 to 179.9985 across 200000 rows with no nulls. The distribution is wide (std 127.09) and platykurtic (kurtosis -1.12) with a median of -94.29 and Q1 at -169.995, suggesting a heavy concentration in the western hemisphere and Pacific quadrant rather than a uniform global spread. Only 2654 unique values indicate the coordinates are quantised rather than raw floats. Treatment: Pair with latitude for geospatial features; avoid treating as a plain scalar in models. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
2,654
min
-180
max
180
mean
-51.58
median
-94.29
std
127.1
q1
-170
q3
63.76
iqr
233.8
skew
0.6619
kurtosis
-1.123
n_outliers
0
outlier_rate
0
zero_rate
0.0004

depth

numeric feature
Numeric measurement of depth ranging from 1000.0 to 11000.0 with mean 2405.74 and median 1961.70, suggesting a physical quantity like well or borehole depth in meters/feet. The distribution is right-skewed (skew 1.09) with IQR 2174.25 and a tight 0.28% outlier rate, and only 1938 unique values across 200000 rows points to discretized or rounded measurements. No nulls or zeros are present. Treatment: Consider a log or sqrt transform before regression to tame the right skew. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
1,938
min
1,000
max
11,000
mean
2406
median
1962
std
1477
q1
1,149
q3
3323
iqr
2174
skew
1.091
kurtosis
0.5022
n_outliers
560
outlier_rate
0.0028
zero_rate
0

year

numeric feature high_skew outliers
This is a year column spanning 1875 to 2024 with 98 distinct values, almost certainly the release or production year of each record. The distribution is heavily left-skewed (skew -3.57, kurtosis 19.6) with a median of 2016 but mean pulled down to 2008.9, and 6.27% of values flagged as outliers — i.e. a long tail of older entries against a modern-heavy mass. About 3.6% of rows are null. Treatment: Impute the 3.6% nulls and consider bucketing into decades or capping the pre-1980 tail before modelling. high · anthropic:claude-opus-4-7
n
200,000
nulls
7,240 (3.6%)
unique
98
min
1,875
max
2,024
mean
2009
median
2,016
std
15.43
q1
2,004
q3
2,016
iqr
12
skew
-3.574
kurtosis
19.65
n_outliers
12,080
outlier_rate
0.06267
zero_rate
0

country

categorical feature
Country of origin as a free-form categorical with 57 distinct values, dominated by an empty string at 51.92% (103,840 rows) which functions as an undeclared missing token rather than a true null (null_rate is 0.0). Australia accounts for the bulk of populated values (79,320), making the dataset overwhelmingly Australia-centric, and the entropy_ratio of 0.277 confirms heavy concentration. Note also the inconsistent encoding of the United States as both 'United States' (8,160) and 'USA' (680). Treatment: Recode empty strings as missing, canonicalise duplicates like 'USA'/'United States', then group rare countries before one-hot encoding. high · anthropic:claude-opus-4-7
n
200,000
nulls
0 (0.0%)
unique
57
top_value
top_rate
0.5192
cardinality
57
entropy
1.616
entropy_ratio
0.277