data-trove-bioluminescence · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/bioluminescence.json

Saturn profiled 43,060 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/bioluminescence.json",
    "--findings", "data-trove-bioluminescence.json",
    "--llm", "anthropic:default",
])

Summary confidence: medium

This dataset contains 43,060 occurrence records of bioluminescent marine organisms, covering 26 named groups across 7 phyla — from dinoflagellates and jellyfish to krill and bacteria — with geographic coordinates, taxonomy, and sampling depth. The most notable issue is that depth has a 24.75% null rate, extreme skew (max 10,000 m vs. median 52.5 m), and over 10% outliers, meaning depth-based analysis needs careful filtering before any conclusions are drawn. A second area to investigate is geographic bias: over 63% of country values are blank, yet Australia, the United States, Peru, and Canada dominate the named entries, suggesting strong regional over-representation in the sourced datasets. The year column also carries a 42% null rate, which limits time-trend analysis despite records spanning from at least 1962 to 2017.

citing: depth.null_rate · depth.stats.outlier_rate · depth.stats.median · depth.stats.max · depth.stats.skew · country.stats.top_rate · country.top_values · year.null_rate · bioluminescence_group.stats.cardinality · phylum.top_values · row_count

Out[4]:

saturn.schema() · 14 columns

column	kind	n	null%	unique	alerts
scientificName	categorical	43,060	0.0%	245
genus	categorical	43,060	0.0%	27
family	categorical	43,060	0.0%	22
phylum	categorical	43,060	0.0%	7
class	categorical	43,060	0.0%	13
order	categorical	43,060	0.0%	17
latitude	numeric	43,060	0.0%	14,146
longitude	numeric	43,060	0.0%	14,637
depth	numeric	43,060	24.8%	3,283	null_rate high_skew outliers
date	text	43,060	12.0%	12,338	one_word allcaps duplicates
year	categorical	43,060	42.2%	137	null_rate
country	categorical	43,060	0.0%	130
dataset	categorical	43,060	0.0%	214
bioluminescence_group	categorical	43,060	0.0%	26

Fig 1.

bioluminescence_group · Look for which bioluminescent groups dominate the record count — Dinoflagellates lead but many other groups cluster near 2,000 records each, suggesting structured sampling.

Show data table

Top values for bioluminescence_group (20 unique shown, of 26 total).
value	count	share
Dinoflagellate	4000	9.3%
Sea sparkle dinoflagellate	2000	4.6%
Bioluminescent dinoflagellate	2000	4.6%
Crystal jelly (source of GFP)	2000	4.6%
Mauve stinger jellyfish	2000	4.6%
Warty comb jelly	2000	4.6%
Crown jellyfish (alarm jelly)	2000	4.6%
Helmet jellyfish	2000	4.6%
Comb jelly	2000	4.6%
Krill (many species bioluminescent)	2000	4.6%
Northern krill	2000	4.6%
Copepod (secretes luminous fluid)	2000	4.6%
Deep-sea shrimp (NanoLuc source)	2000	4.6%
Sea firefly ostracod	2000	4.6%
Bioluminescent ostracod	2000	4.6%
Cock-eyed squid	2000	4.6%
Bioluminescent marine bacteria	2000	4.6%
Marine luminous bacteria	2000	4.6%
Parchment tube worm	2000	4.6%
Boring clam (piddock)	928	2.2%

Fig 2.

phylum · Arthropoda and Cnidaria together account for nearly half of all records; check whether this reflects true ecological abundance or sampling bias.

Show data table

Top values for phylum (7 unique shown, of 7 total).
value	count	share
Arthropoda	12297	28.6%
Cnidaria	8874	20.6%
Myzozoa	8000	18.6%
Ctenophora	4168	9.7%
Proteobacteria	4000	9.3%
Mollusca	3721	8.6%
Annelida	2000	4.6%

Fig 3.

depth · The distribution is heavily right-skewed with a median of 52.5 m but values reaching 10,000 m — watch for the long tail of deep-water outliers that may distort analyses.

Show data table

Histogram bins for depth (median: 52.5).
bin	count
-53 – 198.3	21893
198.3 – 449.6	4443
449.6 – 701	1966
701 – 952.3	1504
952.3 – 1204	1070
1204 – 1455	303
1455 – 1706	226
1706 – 1958	182
1958 – 2209	255
2209 – 2460	95
2460 – 2712	111
2712 – 2963	57
2963 – 3214	55
3214 – 3466	56
3466 – 3717	42
3717 – 3968	24
3968 – 4220	31
4220 – 4471	20
4471 – 4722	6
4722 – 4974	14
4974 – 5225	12
5225 – 5476	14
5476 – 5727	6
5727 – 5979	2
5979 – 6230	4
6230 – 6481	2
6481 – 6733	0
6733 – 6984	0
6984 – 7235	0
7235 – 7487	0
7487 – 7738	3
7738 – 7989	0
7989 – 8241	0
8241 – 8492	0
8492 – 8743	2
8743 – 8995	0
8995 – 9246	0
9246 – 9497	0
9497 – 9749	0
9749 – 1e+04	4

Fig 4.

year · Year 2000 has by far the most records; look for uneven temporal coverage and note the 42% null rate means a large share of observations have no year assigned.

Show data table

Top values for year (20 unique shown, of 137 total).
value	count	share
2000	1287	3.0%
2001	703	1.6%
2016	691	1.6%
2008	688	1.6%
2010	651	1.5%
2002	579	1.3%
2013	556	1.3%
2011	554	1.3%
1979	524	1.2%
2014	519	1.2%
2003	514	1.2%
2004	511	1.2%
2015	504	1.2%
2012	493	1.1%
2007	459	1.1%
2006	442	1.0%
2005	438	1.0%
1998	437	1.0%
2020	437	1.0%
2019	436	1.0%

Fig 5.

country · Over 63% of records have no country value, and among those that do, Australia dominates — flag this geographic gap before drawing any regional conclusions.

Show data table

Top values for country (20 unique shown, of 130 total).
value	count	share
	27422	63.7%
Australia	4573	10.6%
United States	1416	3.3%
PERU	1098	2.5%
Canada	976	2.3%
SOVIET UNION	634	1.5%
Israel	550	1.3%
GB	465	1.1%
Spain	370	0.9%
Sweden	340	0.8%
USA	323	0.8%
Ukraine	316	0.7%
Romania	310	0.7%
Antarctica	242	0.6%
Republic of Korea	225	0.5%
Colombia	214	0.5%
Italy	213	0.5%
New Zealand	212	0.5%
FR	210	0.5%
Brazil	179	0.4%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
scientificName	categorical	0.0%
genus	categorical	0.0%
family	categorical	0.0%
phylum	categorical	0.0%
class	categorical	0.0%
order	categorical	0.0%
latitude	numeric	0.0%
longitude	numeric	0.0%
depth	numeric	24.8%
date	text	12.0%
year	categorical	42.2%
country	categorical	0.0%
dataset	categorical	0.0%
bioluminescence_group	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	latitude	longitude	depth
latitude	+1.00	-0.33	-0.06
longitude	-0.33	+1.00	+0.00
depth	-0.06	+0.00	+1.00

scientificName categorical label

This column contains scientific (Latin) names of marine organisms, covering 245 distinct taxa across 43,060 records with no nulls. The values span both genus-only entries (e.g., 'Lingulodinium', 'Photobacterium', 'Vibrio') and full binomial species names, suggesting inconsistent taxonomic resolution across records. The top value 'Mnemiopsis leidyi' appears 2,000 times (~4.6% of rows), and the top 10 values together account for a substantial share of records, indicating the dataset is dominated by a relatively small set of species. Entropy ratio of 0.747 confirms moderate-to-high concentration for a 245-cardinality field.

Treatment: Standardise taxonomic resolution (genus vs. species) before grouping; one-hot or target-encode for modelling.

anthropic:default · confidence high

Out[13]:

saturn.columns["scientificName"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	245
top_value	Mnemiopsis leidyi
top_rate	0.04645
cardinality	245
entropy	5.928
entropy_ratio	0.747

Fig 8.

Top values for scientificName.

Show data table

Top values for scientificName (20 unique shown, of 245 total).
value	count	share
Mnemiopsis leidyi	2000	4.6%
Lingulodinium	1976	4.6%
Meganyctiphanes norvegica	1928	4.5%
Photobacterium	1842	4.3%
Periphylla periphylla	1802	4.2%
Pelagia noctiluca	1768	4.1%
Noctiluca scintillans	1728	4.0%
Vibrio	1584	3.7%
Vargula norvegica	1482	3.4%
Cypridina dentata	1320	3.1%
Euphausia superba	1298	3.0%
Chaetopterus variopedatus	1222	2.8%
Beroe	1202	2.8%
Oplophorus spinosus	1170	2.7%
Histioteuthis	952	2.2%
Alexandrium	944	2.2%
Metridia lucens	872	2.0%
Aequorea	798	1.9%
Atolla wyvillei	756	1.8%
Pyrocystis pseudonoctiluca	742	1.7%

genus categorical label

This column contains biological genus names for marine organisms — including bioluminescent dinoflagellates (Noctiluca, Pyrocystis, Lingulodinium, Alexandrium), jellyfish (Pelagia, Atolla, Periphylla), and ctenophores (Mnemiopsis, Beroe) — suggesting a marine biology or bioluminescence dataset. With 27 unique genera across 43,060 rows and no nulls, the distribution is remarkably flat: every visible top value appears exactly 2,000 times, implying a deliberately balanced or stratified dataset. The entropy ratio of 0.9586 is very high for only 27 categories, confirming near-uniform representation across genera. No skew or imbalance alerts were triggered.

Treatment: Use as a classification target or stratification key; one-hot encode or ordinal encode for modelling given 27 balanced classes.

anthropic:default · confidence high

Out[16]:

saturn.columns["genus"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	27
top_value	Noctiluca
top_rate	0.04645
cardinality	27
entropy	4.558
entropy_ratio	0.9586

Fig 9.

Top values for genus.

Show data table

Top values for genus (20 unique shown, of 27 total).
value	count	share
Noctiluca	2000	4.6%
Pyrocystis	2000	4.6%
Lingulodinium	2000	4.6%
Alexandrium	2000	4.6%
Aequorea	2000	4.6%
Pelagia	2000	4.6%
Mnemiopsis	2000	4.6%
Atolla	2000	4.6%
Periphylla	2000	4.6%
Beroe	2000	4.6%
Euphausia	2000	4.6%
Meganyctiphanes	2000	4.6%
Metridia	2000	4.6%
Oplophorus	2000	4.6%
Vargula	2000	4.6%
Cypridina	2000	4.6%
Histioteuthis	2000	4.6%
Vibrio	2000	4.6%
Photobacterium	2000	4.6%
Chaetopterus	2000	4.6%

family categorical label

This column contains biological family-level taxonomic classifications, covering 22 distinct families across 43,060 records with no nulls. The distribution is notably near-uniform for the top entries: four families (Pyrocystaceae, Euphausiidae, Cypridinidae, Vibrionaceae) each have exactly 4,000 records, strongly suggesting deliberate stratified sampling or synthetic balancing rather than natural occurrence frequencies. The high entropy ratio of 0.932 (close to maximum for 22 categories) confirms an unusually flat distribution. Families represented include bioluminescent marine organisms (dinoflagellates, krill, ostracods, bacteria), hinting this is a marine bioluminescence or plankton dataset.

Treatment: Use as a categorical grouping variable or classification target; encode with integer or one-hot encoding, noting the artificially balanced class distribution may not reflect real-world priors.

anthropic:default · confidence high

Out[19]:

saturn.columns["family"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	22
top_value	Pyrocystaceae
top_rate	0.09289
cardinality	22
entropy	4.157
entropy_ratio	0.9322

Fig 10.

Top values for family.

Show data table

Top values for family (20 unique shown, of 22 total).
value	count	share
Pyrocystaceae	4000	9.3%
Euphausiidae	4000	9.3%
Cypridinidae	4000	9.3%
Vibrionaceae	4000	9.3%
Metridinidae	2297	5.3%
Noctilucaceae	2000	4.6%
Lingulodiniaceae	2000	4.6%
Aequoreidae	2000	4.6%
Pelagiidae	2000	4.6%
Bolinopsidae	2000	4.6%
Atollidae	2000	4.6%
Periphyllidae	2000	4.6%
Beroidae	2000	4.6%
Oplophoridae	2000	4.6%
Histioteuthidae	2000	4.6%
Chaetopteridae	2000	4.6%
Pholadidae	928	2.2%
Renillidae	874	2.0%
Vampyroteuthidae	484	1.1%
Thysanoteuthidae	209	0.5%

phylum categorical label

This column encodes biological phylum classifications across 43,060 records with exactly 7 distinct values and no nulls, making it a clean taxonomic label field. The distribution spans both animal (Arthropoda, Cnidaria, Ctenophora, Mollusca, Annelida) and non-animal (Myzozoa, Proteobacteria) kingdoms, suggesting the dataset covers a broad range of marine or environmental organisms. Arthropoda dominates at 28.6% (12,297 records), while the entropy ratio of 0.923 indicates a fairly well-spread distribution across categories. The presence of Proteobacteria (bacteria) and Myzozoa (protists) alongside metazoans may surprise analysts expecting a purely animal-focused dataset.

Treatment: One-hot encode or use as a stratification/grouping variable; verify whether cross-kingdom mixing is intentional before modelling.

anthropic:default · confidence high

Out[22]:

saturn.columns["phylum"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	7
top_value	Arthropoda
top_rate	0.2856
cardinality	7
entropy	2.593
entropy_ratio	0.9235

Fig 11.

Top values for phylum.

Show data table

Top values for phylum (7 unique shown, of 7 total).
value	count	share
Arthropoda	12297	28.6%
Cnidaria	8874	20.6%
Myzozoa	8000	18.6%
Ctenophora	4168	9.7%
Proteobacteria	4000	9.3%
Mollusca	3721	8.6%
Annelida	2000	4.6%

class categorical label

This column contains biological taxonomic class labels for marine/aquatic organisms, with 13 distinct classes across 43,060 records and zero nulls. The entropy ratio of 0.927 indicates a near-uniform distribution across classes, though mild imbalance exists: 'Dinophyceae' is the most frequent at 18.6% (8,000 records), while several classes like 'Scyphozoa' and 'Malacostraca' each hold exactly 6,000 records, suggesting some classes may have been deliberately sampled to round numbers. The mix spans protists (Dinophyceae, Gammaproteobacteria), crustaceans (Malacostraca, Ostracoda, Copepoda), molluscs (Cephalopoda), and cnidarians/ctenophores (Scyphozoa, Hydrozoa, Nuda, Tentaculata), pointing to a plankton or marine biodiversity dataset.

Treatment: Use as classification target or stratification key; check class imbalance before modelling and consider stratified splits given the ~2× spread between largest and smallest classes.

anthropic:default · confidence high

Out[25]:

saturn.columns["class"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	13
top_value	Dinophyceae
top_rate	0.1858
cardinality	13
entropy	3.43
entropy_ratio	0.9268

Fig 12.

Top values for class.

Show data table

Top values for class (13 unique shown, of 13 total).
value	count	share
Dinophyceae	8000	18.6%
Scyphozoa	6000	13.9%
Malacostraca	6000	13.9%
Ostracoda	4000	9.3%
Gammaproteobacteria	4000	9.3%
Cephalopoda	2793	6.5%
Copepoda	2297	5.3%
Tentaculata	2168	5.0%
Hydrozoa	2000	4.6%
Nuda	2000	4.6%
Polychaeta	2000	4.6%
Bivalvia	928	2.2%
Octocorallia	874	2.0%

order categorical label

This column contains biological taxonomic order classifications for marine organisms, with 17 distinct orders spanning bacteria (Vibrionales), dinoflagellates (Gonyaulacales, Noctilucales), jellyfish (Coronatae, Leptothecata), crustaceans (Euphausiacea, Calanoida), and cephalopods (Oegopsida) among others. The distribution is moderately uneven — Gonyaulacales dominates at 13.9% (6,000 rows) while several orders share exactly 4,000 rows, suggesting possible stratified or quota-based sampling rather than natural observation frequencies. Entropy ratio of 0.949 indicates near-uniform spread across the 17 classes, which is unusually high for a taxonomic label and reinforces the structured-sampling hypothesis. No nulls are present.

Treatment: Use as a categorical target or stratification variable; one-hot encode or ordinal-encode for modelling, and verify whether the equal-count groupings (4,000 each) reflect intentional sampling design.

anthropic:default · confidence high

Out[28]:

saturn.columns["order"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	17
top_value	Gonyaulacales
top_rate	0.1393
cardinality	17
entropy	3.879
entropy_ratio	0.9491

Fig 13.

Top values for order.

Show data table

Top values for order (17 unique shown, of 17 total).
value	count	share
Gonyaulacales	6000	13.9%
Coronatae	4000	9.3%
Euphausiacea	4000	9.3%
Myodocopida	4000	9.3%
Vibrionales	4000	9.3%
Oegopsida	2309	5.4%
Calanoida	2297	5.3%
Lobata	2168	5.0%
Noctilucales	2000	4.6%
Leptothecata	2000	4.6%
Semaeostomeae	2000	4.6%
Beroida	2000	4.6%
Decapoda	2000	4.6%
	2000	4.6%
Myida	928	2.2%
Scleralcyonacea	874	2.0%
Vampyromorpha	484	1.1%

latitude numeric feature

This column contains geographic latitude values, spanning from -76.619° (deep Southern Hemisphere) to 88.29° (near the North Pole), with 14,146 unique values across 43,060 rows suggesting many repeated locations. The mean (19.1°) is notably lower than the median (36.7°), and the IQR of 69.6° spans nearly the full usable latitude range, indicating records are spread across both hemispheres rather than clustered in any one region. The slight negative skew (-0.66) and near-zero outlier rate confirm a broadly spread but reasonably uniform distribution, which is unusual — most real-world geo datasets cluster in populated mid-latitude bands. The 0.046% zero-rate (≈20 rows) warrants inspection as 0.0° latitude may represent missing/default values.

Treatment: Verify ~20 zero-latitude rows for data quality; use as-is or pair with longitude for spatial joins/clustering; consider binning into latitude bands if used as a categorical feature.

anthropic:default · confidence high

Out[31]:

saturn.columns["latitude"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	14,146
min	-76.62
max	88.29
mean	19.1
median	36.71
std	40.27
q1	-19.31
q3	50.3
iqr	69.61
skew	-0.6614
kurtosis	-0.9355
n_outliers	0
outlier_rate	0
zero_rate	0.0004645

Fig 14.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 36.710105896).
bin	count
-76.62 – -72.5	34
-72.5 – -68.37	134
-68.37 – -64.25	770
-64.25 – -60.13	872
-60.13 – -56.01	500
-56.01 – -51.88	309
-51.88 – -47.76	279
-47.76 – -43.64	377
-43.64 – -39.51	1598
-39.51 – -35.39	900
-35.39 – -31.27	2736
-31.27 – -27.15	1218
-27.15 – -23.02	589
-23.02 – -18.9	615
-18.9 – -14.78	671
-14.78 – -10.66	768
-10.66 – -6.533	598
-6.533 – -2.41	504
-2.41 – 1.713	319
1.713 – 5.836	199
5.836 – 9.958	628
9.958 – 14.08	953
14.08 – 18.2	793
18.2 – 22.33	744
22.33 – 26.45	566
26.45 – 30.57	783
30.57 – 34.69	1840
34.69 – 38.82	2424
38.82 – 42.94	2931
42.94 – 47.06	3500
47.06 – 51.19	4244
51.19 – 55.31	3508
55.31 – 59.43	2052
59.43 – 63.55	1070
63.55 – 67.68	764
67.68 – 71.8	1560
71.8 – 75.92	532
75.92 – 80.04	94
80.04 – 84.17	52
84.17 – 88.29	32

longitude numeric feature

This column contains geographic longitude values, spanning the full valid range from -179.9987 to 179.99 degrees, confirming global coverage. The distribution is remarkably flat and near-symmetric (skew 0.138, kurtosis -0.646) with an IQR of 124.12 degrees, indicating records are spread broadly across both hemispheres rather than clustered in any region. The mean (9.64) is notably higher than the median (3.06), hinting at a slight eastward bias in the dataset. A zero_rate of 0.11% warrants a check for null-substituted zeros masquerading as the Prime Meridian.

Treatment: Use as-is for geospatial joins or pair with latitude for coordinate-based features; verify ~48 zero-value rows are genuine Prime Meridian locations and not null substitutions.

anthropic:default · confidence high

Out[34]:

saturn.columns["longitude"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	14,637
min	-180
max	180
mean	9.64
median	3.057
std	88.61
q1	-60.19
q3	63.93
iqr	124.1
skew	0.1376
kurtosis	-0.6464
n_outliers	0
outlier_rate	0
zero_rate	0.001115

Fig 15.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: 3.05735505).
bin	count
-180 – -171	405
-171 – -162	653
-162 – -153	381
-153 – -144	284
-144 – -135	177
-135 – -126	485
-126 – -117	1914
-117 – -108	117
-108 – -99	151
-99 – -90	289
-90 – -81	785
-81 – -72	1676
-72 – -63	2314
-63 – -54	2269
-54 – -45	643
-45 – -36	680
-36 – -27	556
-27 – -18	530
-18 – -9.004	1463
-9.004 – -0.004333	3887
-0.004333 – 8.995	4520
8.995 – 18	2373
18 – 26.99	1671
26.99 – 35.99	2361
35.99 – 44.99	834
44.99 – 53.99	313
53.99 – 62.99	501
62.99 – 71.99	696
71.99 – 80.99	561
80.99 – 89.99	360
89.99 – 98.99	288
98.99 – 108	70
108 – 117	753
117 – 126	480
126 – 135	964
135 – 144	761
144 – 153	3177
153 – 162	1422
162 – 171	582
171 – 180	714

depth numeric feature

This column represents a physical depth measurement (e.g., depth below surface for geological, seismic, or oceanographic observations), ranging from -53.0 to 10,000.0 with a median of only 52.5 — meaning most records are shallow, but a long tail of deep measurements drives extreme skew (4.72) and very high kurtosis (35.89). The null rate of 24.75% and outlier rate of 10.63% (3,444 rows) are both flagged as alerts, and 11.92% of values are exactly zero, suggesting possible default-fill or surface-level records that may need special handling. The IQR of 313.5 against a std of 570.18 confirms the heavy-tailed distribution driven by a minority of extreme deep values up to 10,000.0.

Treatment: Impute nulls carefully (median preferred over mean given skew), investigate zero values for validity, then log-transform or apply a signed-log transform before modelling to reduce the heavy right tail.

anthropic:default · confidence high

Out[37]:

saturn.columns["depth"].stats

stat	value
n	43,060
nulls	10,658 (24.8%)
unique	3,283
min	-53
max	10,000
mean	281.2
median	52.5
std	570.2
q1	7.5
q3	321
iqr	313.5
skew	4.724
kurtosis	35.89
n_outliers	3,444
outlier_rate	0.1063
zero_rate	0.1192
alert: null_rate	24.8% null
alert: high_skew	skew=+4.72
alert: outliers	10.6% rows beyond 1.5 IQR

Fig 16.

Distribution of depth. Vertical dash marks the median.

Show data table

Histogram bins for depth (median: 52.5).
bin	count
-53 – 198.3	21893
198.3 – 449.6	4443
449.6 – 701	1966
701 – 952.3	1504
952.3 – 1204	1070
1204 – 1455	303
1455 – 1706	226
1706 – 1958	182
1958 – 2209	255
2209 – 2460	95
2460 – 2712	111
2712 – 2963	57
2963 – 3214	55
3214 – 3466	56
3466 – 3717	42
3717 – 3968	24
3968 – 4220	31
4220 – 4471	20
4471 – 4722	6
4722 – 4974	14
4974 – 5225	12
5225 – 5476	14
5476 – 5727	6
5727 – 5979	2
5979 – 6230	4
6230 – 6481	2
6481 – 6733	0
6733 – 6984	0
6984 – 7235	0
7235 – 7487	0
7487 – 7738	3
7738 – 7989	0
7989 – 8241	0
8241 – 8492	0
8492 – 8743	2
8743 – 8995	0
8995 – 9246	0
9246 – 9497	0
9497 – 9749	0
9749 – 1e+04	4

date text timestamp

This column contains publication or recording date strings stored as text, using multiple formats including year ranges ('1962/1964'), year-month ranges ('2010-05/2010-06'), year-month ('2013-08'), and full ISO dates ('2017-05-30'). The heterogeneous format mix and wide length range (min 4, max 51 chars) will require normalization before any temporal analysis. A 67.4% duplicate rate across 43,060 rows with only 12,338 unique values indicates many records share the same date, consistent with batch or periodical publication data. The 12.0% null rate is also notable and should be investigated for systematic missingness.

Treatment: Parse and normalize the mixed date formats (range, year-month, ISO) into structured start/end date fields before use in temporal analysis or modelling.

anthropic:default · confidence high

Out[40]:

saturn.columns["date"].stats

stat	value
n	43,060
nulls	5,182 (12.0%)
unique	12,338
len_min	4
len_max	51
len_mean	16.45
len_median	19
len_p95	39
word_mean	1.03
word_median	1
n_empty	0
n_duplicates	25,540
duplicate_rate	0.6743
vocab_size	10,135
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9705
allcaps_rate	1
boilerplate_rate	0
alert: one_word	97.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: duplicates	67.4% duplicate strings

Fig 17.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 16.4484133269972).
chars	count
4 – 5	276
5 – 6	11
6 – 8	1316
8 – 9	78
9 – 10	573
10 – 11	13982
11 – 12	0
12 – 13	4
13 – 15	0
15 – 16	1820
16 – 17	691
17 – 18	49
18 – 19	3297
19 – 20	11038
20 – 22	392
22 – 23	858
23 – 24	102
24 – 25	993
25 – 26	0
26 – 28	15
28 – 29	100
29 – 30	126
30 – 31	12
31 – 32	0
32 – 33	224
33 – 35	0
35 – 36	0
36 – 37	1
37 – 38	0
38 – 39	1632
39 – 40	2
40 – 42	226
42 – 43	0
43 – 44	0
44 – 45	18
45 – 46	0
46 – 47	0
47 – 49	0
49 – 50	0
50 – 51	42

year categorical feature

This column represents a calendar year associated with each record, stored as a categorical type despite being numeric in nature. The top value is '2000' with 1,287 occurrences (~5.2% of total), and the 137 unique values span a range that includes at least 1979 through 2016. Two signals stand out: the null rate is 42.18%, which is severe enough to warrant an alert, and the year 2000 is notably over-represented relative to adjacent years (e.g., 2001 has only 703), suggesting either a data collection artifact or a large batch of undated records defaulted to that year.

Treatment: Investigate the 2000 spike for default-value contamination, impute or flag the 42.18% nulls, then cast to integer for temporal modelling.

anthropic:default · confidence high

Out[43]:

saturn.columns["year"].stats

stat	value
n	43,060
nulls	18,164 (42.2%)
unique	137
top_value	2000
top_rate	0.0517
cardinality	137
entropy	6.142
entropy_ratio	0.8653
alert: null_rate	42.2% null

Fig 18.

Top values for year.

Show data table

Top values for year (20 unique shown, of 137 total).
value	count	share
2000	1287	3.0%
2001	703	1.6%
2016	691	1.6%
2008	688	1.6%
2010	651	1.5%
2002	579	1.3%
2013	556	1.3%
2011	554	1.3%
1979	524	1.2%
2014	519	1.2%
2003	514	1.2%
2004	511	1.2%
2015	504	1.2%
2012	493	1.1%
2007	459	1.1%
2006	442	1.0%
2005	438	1.0%
1998	437	1.0%
2020	437	1.0%
2019	436	1.0%

country categorical feature

This column captures the country associated with each record, with 130 distinct values across 43,060 rows. The dominant 'value' is an empty string, accounting for 63.7% of all records (27,422 rows) — a critical data quality issue that functionally resembles a very high null rate. The remaining values show inconsistent normalisation: mixed casing ('PERU', 'SOVIET UNION' vs. 'Australia', 'Canada'), abbreviations ('GB' instead of 'United Kingdom'), and anachronistic entities ('SOVIET UNION'), suggesting data was collected from heterogeneous or historical sources without standardisation.

Treatment: Treat empty strings as nulls, then standardise to ISO 3166-1 country codes before modelling or aggregation.

anthropic:default · confidence high

Out[46]:

saturn.columns["country"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	130
top_value
top_rate	0.6368
cardinality	130
entropy	2.569
entropy_ratio	0.3658

Fig 19.

Top values for country.

Show data table

Top values for country (20 unique shown, of 130 total).
value	count	share
	27422	63.7%
Australia	4573	10.6%
United States	1416	3.3%
PERU	1098	2.5%
Canada	976	2.3%
SOVIET UNION	634	1.5%
Israel	550	1.3%
GB	465	1.1%
Spain	370	0.9%
Sweden	340	0.8%
USA	323	0.8%
Ukraine	316	0.7%
Romania	310	0.7%
Antarctica	242	0.6%
Republic of Korea	225	0.5%
Colombia	214	0.5%
Italy	213	0.5%
New Zealand	212	0.5%
FR	210	0.5%
Brazil	179	0.4%

dataset categorical metadata

This column identifies the source dataset or survey program for each observation, with 214 distinct named sources across 43,060 rows. The dominant value is an empty string, accounting for 61.1% of all records (26,317 rows), meaning the majority of observations carry no dataset attribution — a significant data quality concern. The remaining records span named marine/environmental monitoring programs (trawl surveys, coastal monitoring, jellyfish sightings, etc.), with the largest named source ('Environmental Monitoring database (MOD) DNV') covering only ~4% of rows.

Treatment: Treat empty string as missing/unknown; encode named values as a categorical feature or use as a grouping/stratification variable, but investigate whether the 61.1% blank rate reflects a systematic data gap before modelling.

anthropic:default · confidence high

Out[49]:

saturn.columns["dataset"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	214
top_value
top_rate	0.6112
cardinality	214
entropy	3.19
entropy_ratio	0.4121

Fig 20.

Top values for dataset.

Show data table

Top values for dataset (20 unique shown, of 214 total).
value	count	share
	26317	61.1%
Environmental Monitoring database (MOD) DNV	1760	4.1%
Jellyfish sightings along the Italian coastline from 2009 to 2017	1024	2.4%
QUADRIGE - Coastal monitoring database and products, 1974 onwards. (6064)	978	2.3%
MBIS research trawl surveys	714	1.7%
Groundfish Survey Invertebrate Data	674	1.6%
DFO Quebec Region Ecosystemic bottom trawl surveys	650	1.5%
Marine Recorder Snapshot extract of surveys entered by SeaSearch	643	1.5%
CPR	604	1.4%
DATRAS: ICES Database of trawl surveys	591	1.4%
Citizen Science based jellyfish observations along the Israeli Mediterranean coast in 2011-2025	546	1.3%
BioChem: Sameoto zooplankton collection	516	1.2%
Marine Recorder Snapshot extract of surveys entered by JNCC	396	0.9%
Atlantic Reference Centre	383	0.9%
DFO Central and Arctic Multi-species Stock Assessment Surveys	364	0.8%
MEDITS-Spain: Demersal and mega-benthic species from the MEDITS (Mediterranean International Trawl Survey) project on the Spanish continental shelf between 1994 and 2010	277	0.6%
NIWA Invertebrate Collection	267	0.6%
ANEMOON Beach washup monitoring (SMP) data along the Dutch coastline collected through citizen science	240	0.6%
Phytoplankton abundance and composition in the Ebro delta embayments (Alfacs Bay and Fangar Bay, North Western Mediterranean) during 1990-2019	198	0.5%
Romanian Black Sea Zooplankton data from 1981 to 2000	196	0.5%

bioluminescence_group categorical label

This column classifies observations into one of 26 named bioluminescent organism groups (e.g., 'Dinoflagellate', 'Crystal jelly (source of GFP)', 'Krill (many species bioluminescent)'), covering marine taxa from dinoflagellates to jellyfish and crustaceans. The distribution is remarkably uniform: the top category 'Dinoflagellate' holds only 9.3% of rows, and the entropy ratio of 0.95 (near-maximum for 26 categories) indicates near-flat class balance. With no nulls across 43,060 rows and exactly 2,000 records for all visible non-top categories, the dataset appears deliberately balanced or synthetically constructed.

Treatment: Use as classification target or stratification variable; near-perfect class balance means no resampling required.

anthropic:default · confidence high

Out[52]:

saturn.columns["bioluminescence_group"].stats

stat	value
n	43,060
nulls	0 (0.0%)
unique	26
top_value	Dinoflagellate
top_rate	0.09289
cardinality	26
entropy	4.465
entropy_ratio	0.95

Fig 21.

Top values for bioluminescence_group.

Show data table

Top values for bioluminescence_group (20 unique shown, of 26 total).
value	count	share
Dinoflagellate	4000	9.3%
Sea sparkle dinoflagellate	2000	4.6%
Bioluminescent dinoflagellate	2000	4.6%
Crystal jelly (source of GFP)	2000	4.6%
Mauve stinger jellyfish	2000	4.6%
Warty comb jelly	2000	4.6%
Crown jellyfish (alarm jelly)	2000	4.6%
Helmet jellyfish	2000	4.6%
Comb jelly	2000	4.6%
Krill (many species bioluminescent)	2000	4.6%
Northern krill	2000	4.6%
Copepod (secretes luminous fluid)	2000	4.6%
Deep-sea shrimp (NanoLuc source)	2000	4.6%
Sea firefly ostracod	2000	4.6%
Bioluminescent ostracod	2000	4.6%
Cock-eyed squid	2000	4.6%
Bioluminescent marine bacteria	2000	4.6%
Marine luminous bacteria	2000	4.6%
Parchment tube worm	2000	4.6%
Boring clam (piddock)	928	2.2%

data trove bioluminescence

Overview

Summary confidence: medium

scientificName categorical label

genus categorical label

family categorical label

phylum categorical label

class categorical label

order categorical label

latitude numeric feature

longitude numeric feature

depth numeric feature

date text timestamp

year categorical feature

country categorical feature

dataset categorical metadata

bioluminescence_group categorical label

How to cite