saturn

/home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json 200,000 rows sample n=200,000 seed 42 2026-05-01T23:35:12+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/data/quirky/deep_sea.json
Total rows200,000
Profiled sample200,000
Columns12
Generated2026-05-01T23:35:12+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This is a 200,000-row deep-sea biodiversity dataset with 12 columns covering taxonomy (phylum, class, order, family, genus, species, scientificName), geography (country, latitude, longitude), depth, and observation year. Two things stand out: the taxonomic hierarchy is heavily incomplete at lower ranks — species is blank in 73.2% of rows and genus in 54.9% — so most records can only be analyzed at higher ranks like phylum (top: Proteobacteria at 17.7%) or class (top: Alphaproteobacteria at 11.4%). Country is also mostly missing (51.9% blank) with Australia dominating the populated entries at 79,320 records, suggesting a strong sampling bias. Year is left-skewed (skew -3.57) toward recent records with a long tail back to 1875, while depth ranges from 1,000 to 11,000 m with a median near 1,962 m. Start by checking the missingness in species/country and the geographic concentration before any biodiversity analysis.

scientificName high anthropic:claude-opus-4-7

Taxonomic name field mixing ranks from class (Alphaproteobacteria, 8.82% of rows) and domain (Bacteria) down to species (Xiphias gladius, Amperima rosea), across 1,478 distinct values with no nulls. The rank inconsistency is the headline issue: aggregating or joining on this column will conflate broad clades with individual species. Entropy ratio of 0.80 shows the distribution is fairly diffuse despite the dominant top value.

species high anthropic:claude-opus-4-7

Categorical species identifier with 678 distinct binomial names (e.g., Amperima rosea, Xiphias gladius, Prionace glauca) covering marine taxa. The dominant value is an empty string at 73.2% of 200,000 rows, meaning species is unrecorded for nearly three quarters of observations despite a reported null_rate of 0.0. Among labelled rows, distribution is long-tailed with no single species exceeding 4,520 occurrences, and overall entropy_ratio is 0.327.

genus high anthropic:claude-opus-4-7

Taxonomic genus label for what appears to be a marine biology dataset (Amperima, Xiphias, Scomber, Thunnus, Prionace). The dominant signal is missingness encoded as empty string: 109,800 of 200,000 rows (54.9%) have no genus assigned, despite a stated null_rate of 0.0. Across the remaining records, 840 distinct genera spread fairly thin, with Amperima the largest non-empty bucket at 4,520.

family high anthropic:claude-opus-4-7

Taxonomic family classification, with 606 distinct families across 200000 records and zero nulls. The dominant value is an empty string at 40.18% (80360 rows), effectively a hidden missing-data category, while the next-largest real family Nitrosopumilaceae covers only 5.42% (10840). The remaining tail spans marine taxa (Elpidiidae, Keratoisididae, Coralliidae, Macrouridae, Scombridae), consistent with a deep-sea or oceanographic biodiversity dataset.

order high anthropic:claude-opus-4-7

Taxonomic order assignments for biological records, with 310 distinct orders spanning corals (Scleralcyonacea), archaea (Nitrosopumilales), dinoflagellates (Syndiniales), and various fish and crustacean groups. The dominant value is an empty string at 28.16% (56,320 rows), indicating a large block of unassigned/missing orders rather than true nulls. Entropy ratio of 0.66 shows moderate concentration, with the top 10 orders accounting for the bulk of labeled records.

class high anthropic:claude-opus-4-7

Taxonomic class assignments across 200,000 records, spanning 138 distinct values with moderate concentration (entropy ratio 0.69, top class 'Alphaproteobacteria' at 11.4%). The mix is biologically heterogeneous, blending bacteria, archaea, fish, corals, and sponges, suggesting a marine biodiversity catalogue rather than a single-domain dataset. Notably, the second most common value is an empty string at 21,920 rows (~11%), which null_rate=0.0 misses because blanks are encoded as strings rather than nulls.

phylum high anthropic:claude-opus-4-7

Taxonomic phylum labels across 200,000 records spanning 65 distinct values, led by Proteobacteria at 17.74% and followed by Cnidaria, Chordata, and Echinodermata. The mix of bacterial, archaeal, and animal phyla (e.g., Thaumarchaeota alongside Chordata) suggests a broad biodiversity or environmental-sequencing dataset rather than a single kingdom. Notably, 13,200 rows carry an empty-string phylum — a hidden null channel despite the reported null_rate of 0.0.

latitude high anthropic:claude-opus-4-7

Geographic latitude in degrees, ranging from -75.0 to 89.06 across 200000 rows with no nulls and only 2617 unique values, suggesting coordinates snapped to a coarse grid. The distribution is nearly symmetric (skew 0.12) but platykurtic (kurtosis -1.22) with a wide IQR of 71.97, indicating fairly uniform global coverage rather than a concentration near populated bands. Mean (-1.58) and median (-4.998) sit just south of the equator.

longitude high anthropic:claude-opus-4-7

This is a geographic longitude feature spanning the full valid range from -179.9872 to 179.9985 across 200000 rows with no nulls. The distribution is wide (std 127.09) and platykurtic (kurtosis -1.12) with a median of -94.29 and Q1 at -169.995, suggesting a heavy concentration in the western hemisphere and Pacific quadrant rather than a uniform global spread. Only 2654 unique values indicate the coordinates are quantised rather than raw floats.

depth high anthropic:claude-opus-4-7

Numeric measurement of depth ranging from 1000.0 to 11000.0 with mean 2405.74 and median 1961.70, suggesting a physical quantity like well or borehole depth in meters/feet. The distribution is right-skewed (skew 1.09) with IQR 2174.25 and a tight 0.28% outlier rate, and only 1938 unique values across 200000 rows points to discretized or rounded measurements. No nulls or zeros are present.

year high anthropic:claude-opus-4-7

This is a year column spanning 1875 to 2024 with 98 distinct values, almost certainly the release or production year of each record. The distribution is heavily left-skewed (skew -3.57, kurtosis 19.6) with a median of 2016 but mean pulled down to 2008.9, and 6.27% of values flagged as outliers — i.e. a long tail of older entries against a modern-heavy mass. About 3.6% of rows are null.

country high anthropic:claude-opus-4-7

Country of origin as a free-form categorical with 57 distinct values, dominated by an empty string at 51.92% (103,840 rows) which functions as an undeclared missing token rather than a true null (null_rate is 0.0). Australia accounts for the bulk of populated values (79,320), making the dataset overwhelmingly Australia-centric, and the entropy_ratio of 0.277 confirms heavy concentration. Note also the inconsistent encoding of the United States as both 'United States' (8,160) and 'USA' (680).

Numeric correlation

scientificName categorical

rows200,000
null0 (0.0%)
unique1,478
top_valueAlphaproteobacteria
top_rate0.088
cardinality1,478
entropy8.378
entropy_ratio0.796
Top values (rank 1–20)
  1. Alphaproteobacteria — 17,640
  2. Bacteria — 12,760
  3. Nitrosopumilaceae — 10,840
  4. Syndiniales — 7,280
  5. Amperima rosea — 4,520
  6. Porifera — 2,400
  7. Thermoplasmata — 2,360
  8. Keratoisididae — 2,320
  9. Xiphias gladius — 2,000
  10. Pseudomonadales — 1,920
  11. Gammaproteobacteria — 1,920
  12. Monothalamea — 1,640
  13. Rhodospirillales — 1,640
  14. Scomber scombrus — 1,520
  15. Retaria — 1,520
  16. Dinophyceae — 1,320
  17. Rickettsiales — 1,200
  18. Chrysogorgia — 1,160
  19. Hexactinellida — 1,080
  20. Prionace glauca — 1,080

species categorical

rows200,000
null0 (0.0%)
unique678
top_value
top_rate0.732
cardinality678
entropy3.077
entropy_ratio0.327
Top values (rank 1–20)
  1. — 146,400
  2. Amperima rosea — 4,520
  3. Xiphias gladius — 2,000
  4. Scomber scombrus — 1,520
  5. Prionace glauca — 1,080
  6. Oneirophanta mutabilis — 840
  7. Thunnus albacares — 760
  8. Farrea occa — 680
  9. Trissopathes pseudotristicha — 640
  10. Hoplostethus atlanticus — 520
  11. Trachurus trachurus — 480
  12. Florometra serratissima — 440
  13. Heteropolypus ritteri — 400
  14. Desmophyllum dianthus — 400
  15. Psychropotes longicauda — 360
  16. Thunnus obesus — 320
  17. Solenosmilia variabilis — 320
  18. Etmopterus granulosus — 320
  19. Molpadiodemas villosus — 280
  20. Paragorgia arborea — 280

genus categorical

rows200,000
null0 (0.0%)
unique841
top_value
top_rate0.549
cardinality841
entropy4.896
entropy_ratio0.504
Top values (rank 1–20)
  1. — 109,800
  2. Amperima — 4,520
  3. Xiphias — 2,000
  4. Scomber — 1,520
  5. Retaria — 1,520
  6. Farrea — 1,400
  7. Thunnus — 1,360
  8. Chrysogorgia — 1,360
  9. Coryphaenoides — 1,240
  10. Prionace — 1,080
  11. Hemicorallium — 1,000
  12. Alteromonas — 960
  13. Paragorgia — 880
  14. Oneirophanta — 840
  15. Lepidisis — 800
  16. Trissopathes — 800
  17. Alepisaurus — 760
  18. Keratoisis — 720
  19. Pennatula — 600
  20. Hoplostethus — 600

family categorical

rows200,000
null0 (0.0%)
unique606
top_value
top_rate0.402
cardinality606
entropy5.500
entropy_ratio0.595
Top values (rank 1–20)
  1. — 80,360
  2. Nitrosopumilaceae — 10,840
  3. Elpidiidae — 4,920
  4. Keratoisididae — 4,360
  5. Coralliidae — 4,040
  6. Macrouridae — 3,120
  7. Scombridae — 3,040
  8. Primnoidae — 2,440
  9. Chrysogorgiidae — 2,320
  10. Xiphiidae — 2,000
  11. Retariidae — 1,520
  12. Farreidae — 1,520
  13. Alteromonadaceae — 1,440
  14. Euplectellidae — 1,360
  15. Flavobacteriaceae — 1,320
  16. Schizopathidae — 1,280
  17. Caryophylliidae — 1,080
  18. Carcharhinidae — 1,080
  19. Acanthogorgiidae — 1,000
  20. Nitrospinaceae — 1,000

order categorical

rows200,000
null0 (0.0%)
unique310
top_value
top_rate0.282
cardinality310
entropy5.450
entropy_ratio0.659
Top values (rank 1–20)
  1. — 56,320
  2. Scleralcyonacea — 15,720
  3. Nitrosopumilales — 10,840
  4. Syndiniales — 7,960
  5. Elasipodida — 5,680
  6. Gadiformes — 4,040
  7. Scombriformes — 3,360
  8. Carangiformes — 3,240
  9. Calanoida — 2,840
  10. Alteromonadales — 2,720
  11. Decapoda — 2,720
  12. Antipatharia — 2,560
  13. Sceptrulophora — 2,480
  14. Pseudomonadales — 2,400
  15. Flavobacteriales — 2,320
  16. Scleractinia — 2,240
  17. Malacalcyonacea — 2,000
  18. Lyssacinosida — 2,000
  19. Rotaliida — 2,000
  20. Amphipoda — 1,880

class categorical

rows200,000
null0 (0.0%)
unique138
top_valueAlphaproteobacteria
top_rate0.114
cardinality138
entropy4.892
entropy_ratio0.688
Top values (rank 1–20)
  1. Alphaproteobacteria — 22,840
  2. — 21,920
  3. Teleostei — 19,120
  4. Octocorallia — 17,880
  5. Thaumarchaeota incertae sedis — 10,840
  6. Dinophyceae — 10,600
  7. Gammaproteobacteria — 9,440
  8. Holothuroidea — 7,760
  9. Malacostraca — 7,440
  10. Hexactinellida — 6,320
  11. Hexacorallia — 5,640
  12. Copepoda — 4,080
  13. Ophiuroidea — 3,320
  14. Polychaeta — 3,320
  15. Polycystina — 2,800
  16. Elasmobranchii — 2,640
  17. Globothalamea — 2,520
  18. Deltaproteobacteria — 2,440
  19. Thermoplasmata — 2,360
  20. Flavobacteria — 2,320

phylum categorical

rows200,000
null0 (0.0%)
unique65
top_valueProteobacteria
top_rate0.177
cardinality65
entropy4.095
entropy_ratio0.680
Top values (rank 1–20)
  1. Proteobacteria — 35,480
  2. Cnidaria — 25,520
  3. Chordata — 23,920
  4. Echinodermata — 14,000
  5. — 13,200
  6. Arthropoda — 13,200
  7. Myzozoa — 11,280
  8. Thaumarchaeota — 10,920
  9. Porifera — 10,360
  10. Foraminifera — 4,720
  11. Annelida — 4,240
  12. Radiozoa — 4,000
  13. Mollusca — 3,840
  14. Bacteroidetes — 3,440
  15. Euryarchaeota — 2,440
  16. Planctomycetes — 2,320
  17. Heterokontophyta — 1,680
  18. Verrucomicrobia — 1,520
  19. Brachiopoda — 1,520
  20. Nematoda — 1,160

latitude numeric

rows200,000
null0 (0.0%)
unique2,617
min-75.000
max89.060
mean-1.581
median-4.998
std39.477
q1-36.251
q335.723
iqr71.974
skew0.118
kurtosis-1.223
n_outliers0
outlier_rate0.000
zero_rate0.000

longitude numeric

rows200,000
null0 (0.0%)
unique2,654
min-179.987
max179.999
mean-51.577
median-94.290
std127.087
q1-169.995
q363.756
iqr233.751
skew0.662
kurtosis-1.123
n_outliers0
outlier_rate0.000
zero_rate4.00e-04

depth numeric

rows200,000
null0 (0.0%)
unique1,938
min1,000
max11,000
mean2,406
median1,962
std1,477
q11,149
q33,323
iqr2,174
skew1.091
kurtosis0.502
n_outliers560
outlier_rate2.80e-03
zero_rate0.000

year numeric

skew=-3.57 6.3% rows beyond 1.5 IQR
rows200,000
null7,240 (3.6%)
unique98
min1,875
max2,024
mean2,009
median2,016
std15.425
q12,004
q32,016
iqr12.000
skew-3.574
kurtosis19.645
n_outliers12,080
outlier_rate0.063
zero_rate0.000

country categorical

rows200,000
null0 (0.0%)
unique57
top_value
top_rate0.519
cardinality57
entropy1.616
entropy_ratio0.277
Top values (rank 1–20)
  1. — 103,840
  2. Australia — 79,320
  3. United States — 8,160
  4. New Zealand — 1,320
  5. USA — 680
  6. Antarctica — 680
  7. Colombia — 640
  8. Chile — 520
  9. Bermuda — 400
  10. Portugal — 320
  11. UNITED STATES — 320
  12. Ross Dependency — 240
  13. Russia — 240
  14. United States of America — 240
  15. GREAT BRITAIN — 200
  16. Ecuador — 160
  17. Bahamas — 160
  18. Italy — 160
  19. CO — 160
  20. Discovery Deep, Red Sea — 160