quirky bioluminescence

source /home/coolhand/html/datavis/data_trove/data/quirky/bioluminescence.json 43,060 rows 14 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogues 43,060 records of bioluminescent marine organisms, with taxonomic fields (phylum, class, order, family, genus, scientificName), a bioluminescence_group label, geographic coordinates, depth, country, source dataset, and date/year. The taxonomy is dominated by Arthropoda (12,297) and Cnidaria (8,874) within 7 phyla, while bioluminescence_group is fairly evenly distributed across 26 categories led by Dinoflagellate (4,000). Two things deserve a closer look first: the depth column is highly skewed (skew 4.72, max 10,000m vs median 52.5m) with a 24.75% null rate and ~10.6% outliers, and the country field is 63.7% empty, limiting any geographic breakdown by nation. The year field is also 42% null, so temporal analysis will be partial.

citing: depth · phylum · bioluminescence_group · class · country · year · latitude · longitude · scientificName

Charts the summary said to look at first

phylum · Shows the taxonomic backbone of the dataset, with Arthropoda and Cnidaria making up the majority of records.

Show data table

Top values for phylum (7 unique shown, of 7 total).
value	count	share
Arthropoda	12297	28.6%
Cnidaria	8874	20.6%
Myzozoa	8000	18.6%
Ctenophora	4168	9.7%
Proteobacteria	4000	9.3%
Mollusca	3721	8.6%
Annelida	2000	4.6%

bioluminescence_group · Reveals the relatively even spread across 26 bioluminescent organism groups, led by Dinoflagellate.

Show data table

Top values for bioluminescence_group (20 unique shown, of 26 total).
value	count	share
Dinoflagellate	4000	9.3%
Sea sparkle dinoflagellate	2000	4.6%
Bioluminescent dinoflagellate	2000	4.6%
Crystal jelly (source of GFP)	2000	4.6%
Mauve stinger jellyfish	2000	4.6%
Warty comb jelly	2000	4.6%
Crown jellyfish (alarm jelly)	2000	4.6%
Helmet jellyfish	2000	4.6%
Comb jelly	2000	4.6%
Krill (many species bioluminescent)	2000	4.6%
Northern krill	2000	4.6%
Copepod (secretes luminous fluid)	2000	4.6%
Deep-sea shrimp (NanoLuc source)	2000	4.6%
Sea firefly ostracod	2000	4.6%
Bioluminescent ostracod	2000	4.6%
Cock-eyed squid	2000	4.6%
Bioluminescent marine bacteria	2000	4.6%
Marine luminous bacteria	2000	4.6%
Parchment tube worm	2000	4.6%
Boring clam (piddock)	928	2.2%

depth · Look for the heavy right skew and extreme values up to 10,000m — most observations sit shallow (median 52.5m) but a long tail of deep-sea records pulls the mean to 281m.

Show data table

Histogram bins for depth (median: 52.5).
bin	count
-53 – 198.3	21893
198.3 – 449.6	4443
449.6 – 701	1966
701 – 952.3	1504
952.3 – 1204	1070
1204 – 1455	303
1455 – 1706	226
1706 – 1958	182
1958 – 2209	255
2209 – 2460	95
2460 – 2712	111
2712 – 2963	57
2963 – 3214	55
3214 – 3466	56
3466 – 3717	42
3717 – 3968	24
3968 – 4220	31
4220 – 4471	20
4471 – 4722	6
4722 – 4974	14
4974 – 5225	12
5225 – 5476	14
5476 – 5727	6
5727 – 5979	2
5979 – 6230	4
6230 – 6481	2
6481 – 6733	0
6733 – 6984	0
6984 – 7235	0
7235 – 7487	0
7487 – 7738	3
7738 – 7989	0
7989 – 8241	0
8241 – 8492	0
8492 – 8743	2
8743 – 8995	0
8995 – 9246	0
9246 – 9497	0
9497 – 9749	0
9749 – 1e+04	4

class · Breaks the phylum view into 13 classes, with Dinophyceae, Scyphozoa, and Malacostraca leading.

Show data table

Top values for class (13 unique shown, of 13 total).
value	count	share
Dinophyceae	8000	18.6%
Scyphozoa	6000	13.9%
Malacostraca	6000	13.9%
Ostracoda	4000	9.3%
Gammaproteobacteria	4000	9.3%
Cephalopoda	2793	6.5%
Copepoda	2297	5.3%
Tentaculata	2168	5.0%
Hydrozoa	2000	4.6%
Nuda	2000	4.6%
Polychaeta	2000	4.6%
Bivalvia	928	2.2%
Octocorallia	874	2.0%

latitude · Check the global latitudinal coverage; the distribution leans toward northern latitudes (median ~36.7°).

Show data table

Histogram bins for latitude (median: 36.710105896).
bin	count
-76.62 – -72.5	34
-72.5 – -68.37	134
-68.37 – -64.25	770
-64.25 – -60.13	872
-60.13 – -56.01	500
-56.01 – -51.88	309
-51.88 – -47.76	279
-47.76 – -43.64	377
-43.64 – -39.51	1598
-39.51 – -35.39	900
-35.39 – -31.27	2736
-31.27 – -27.15	1218
-27.15 – -23.02	589
-23.02 – -18.9	615
-18.9 – -14.78	671
-14.78 – -10.66	768
-10.66 – -6.533	598
-6.533 – -2.41	504
-2.41 – 1.713	319
1.713 – 5.836	199
5.836 – 9.958	628
9.958 – 14.08	953
14.08 – 18.2	793
18.2 – 22.33	744
22.33 – 26.45	566
26.45 – 30.57	783
30.57 – 34.69	1840
34.69 – 38.82	2424
38.82 – 42.94	2931
42.94 – 47.06	3500
47.06 – 51.19	4244
51.19 – 55.31	3508
55.31 – 59.43	2052
59.43 – 63.55	1070
63.55 – 67.68	764
67.68 – 71.8	1560
71.8 – 75.92	532
75.92 – 80.04	94
80.04 – 84.17	52
84.17 – 88.29	32

Schema

14 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
scientificName	categorical	0.0%	245
genus	categorical	0.0%	27
family	categorical	0.0%	22
phylum	categorical	0.0%	7
class	categorical	0.0%	13
order	categorical	0.0%	17
latitude	numeric	0.0%	14,146
longitude	numeric	0.0%	14,637
depth	numeric	24.8%	3,283	null_rate high_skew outliers
date	text	12.0%	12,338	one_word allcaps duplicates
year	categorical	42.2%	137	null_rate
country	categorical	0.0%	130
dataset	categorical	0.0%	214
bioluminescence_group	categorical	0.0%	26

scientificName

categorical label

Taxonomic species/genus identifiers (Latin binomials like 'Mnemiopsis leidyi' and genera like 'Lingulodinium', 'Vibrio'). With 245 unique values across 43,060 rows and entropy ratio 0.747, the distribution is moderately spread — the top species accounts for only 4.6% of rows. The mix of binomial species names and bare genus names suggests inconsistent taxonomic resolution across records. Treatment: Treat as a categorical label; consider normalizing genus-only vs. species-level entries before grouping or modelling. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 245
top_value: Mnemiopsis leidyi
top_rate: 0.04645
cardinality: 245
entropy: 5.928
entropy_ratio: 0.747

genus

categorical label

Categorical genus label across 27 bioluminescent marine genera (Noctiluca, Pyrocystis, Alexandrium, Aequorea, etc.). Distribution is essentially uniform — the top 10 genera each show exactly 2000 rows and the top rate is 4.64%, giving an entropy ratio of 0.959, which signals a deliberately balanced sample rather than natural abundance. No nulls across 43,060 rows. Treatment: use directly as the classification target; one-hot or label-encode for modelling. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 27
top_value: Noctiluca
top_rate: 0.04645
cardinality: 27
entropy: 4.558
entropy_ratio: 0.9586

family

categorical label

Taxonomic family labels for what appears to be a catalogue of bioluminescent marine organisms, spanning 22 distinct families across 43,060 complete rows. The distribution is highly engineered rather than natural: four families (Pyrocystaceae, Euphausiidae, Cypridinidae, Vibrionaceae) each hit exactly 4,000 rows and several others land at exactly 2,000, suggesting deliberate per-class sampling or quota balancing. Entropy ratio of 0.93 confirms the near-uniform spread, and no nulls are present. Treatment: Use directly as a categorical class label; one-hot or integer-encode for modelling. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 22
top_value: Pyrocystaceae
top_rate: 0.09289
cardinality: 22
entropy: 4.157
entropy_ratio: 0.9322

phylum

categorical feature

Taxonomic phylum label across 43060 records spanning just 7 distinct values with no nulls. Arthropoda leads at 28.6% (12297 rows), followed by Cnidaria (8874) and Myzozoa (8000), with entropy ratio 0.92 indicating a fairly even spread across the seven categories. The mix of animal phyla alongside Proteobacteria (a bacterial phylum) is notable — this column blends kingdoms. Treatment: one-hot encode directly given the low cardinality. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 7
top_value: Arthropoda
top_rate: 0.2856
cardinality: 7
entropy: 2.593
entropy_ratio: 0.9235

class

categorical label

Taxonomic class labels for marine organisms, spanning 13 distinct values across 43,060 rows with no nulls. Distribution is fairly balanced (entropy ratio 0.93) with Dinophyceae leading at only 18.6% — several classes show suspiciously round counts (8000, 6000, 4000, 2000) suggesting curated/sampled rather than naturally observed frequencies. Treatment: use directly as a multi-class classification target with label encoding. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 13
top_value: Dinophyceae
top_rate: 0.1858
cardinality: 13
entropy: 3.43
entropy_ratio: 0.9268

order

categorical feature

Taxonomic order names for marine organisms (Gonyaulacales, Euphausiacea, Calanoida, etc.), with 17 distinct values across 43,060 rows and no nulls. Distribution is fairly even — entropy ratio 0.949 and top class only 13.9% — though several categories sit at suspiciously round counts (4000, 2000), suggesting stratified sampling or quota construction rather than natural frequencies. Treatment: one-hot or target-encode; safe to use directly given low cardinality and no missing values. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 17
top_value: Gonyaulacales
top_rate: 0.1393
cardinality: 17
entropy: 3.879
entropy_ratio: 0.9491

latitude

numeric feature

Numeric column bounded between -76.619 and 88.29, consistent with WGS84 latitudes in degrees. The distribution is wide (std 40.27, IQR 69.61) and mildly left-skewed (-0.66) with a flat shape (kurtosis -0.94), indicating coverage across both hemispheres rather than a single region. No nulls and no outliers flagged across 43,060 rows with 14,146 distinct values. Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than using raw degrees in a linear model. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 14,146
min: -76.62
max: 88.29
mean: 19.1
median: 36.71
std: 40.27
q1: -19.31
q3: 50.3
iqr: 69.61
skew: -0.6614
kurtosis: -0.9355
n_outliers: 0
outlier_rate: 0
zero_rate: 0.0004645

longitude

numeric feature

Geographic longitude in decimal degrees, spanning the full -179.9987 to 179.99 range with 14,637 distinct values across 43,060 rows and zero nulls. The distribution is nearly symmetric (skew 0.14) with light tails (kurtosis -0.65) and a wide IQR of 124.12, indicating truly global coverage rather than a regional sample. No outliers flagged and only a 0.11% zero rate, consistent with clean coordinate data. Treatment: Pair with latitude for geospatial features; avoid treating as a linear scalar since ±180 wraps. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 14,637
min: -180
max: 180
mean: 9.64
median: 3.057
std: 88.61
q1: -60.19
q3: 63.93
iqr: 124.1
skew: 0.1376
kurtosis: -0.6464
n_outliers: 0
outlier_rate: 0
zero_rate: 0.001115

depth

numeric feature null_rate high_skew outliers

This is a numeric depth measurement (likely meters), with 24.75% nulls across 43,060 rows and only 3,283 distinct values. The distribution is heavily right-skewed (skew 4.72, kurtosis 35.89): the median is 52.5 but the mean is 281.2 and the max reaches 10,000, while 10.63% of values are flagged as outliers. Notably, the minimum is -53.0 (negative depths are suspect) and 11.92% of values are exactly zero. Treatment: Investigate negative and zero values, then log-transform (after shifting) before modelling. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 10,658 (24.8%)
unique: 3,283
min: -53
max: 10,000
mean: 281.2
median: 52.5
std: 570.2
q1: 7.5
q3: 321
iqr: 313.5
skew: 4.724
kurtosis: 35.89
n_outliers: 3,444
outlier_rate: 0.1063
zero_rate: 0.1192

date

text timestamp one_word allcaps duplicates

This is a date column stored as free text rather than a parsed timestamp, with values mixing single dates (e.g. '2017-05-30'), single months ('2013-08'), month ranges ('2010-05/2010-06') and even multi-year spans ('1962/1964'). The format heterogeneity is the main surprise: 97% of entries are one 'word', but length varies from 4 to 51 characters, and 67% are duplicates of another row. Nulls are also non-trivial at 12%. Treatment: Parse into structured start/end dates (handling year-only, month, and range formats) before any temporal analysis. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 5,182 (12.0%)
unique: 12,338
len_min: 4
len_max: 51
len_mean: 16.45
len_median: 19
len_p95: 39
word_mean: 1.03
word_median: 1
n_empty: 0
n_duplicates: 25,540
duplicate_rate: 0.6743
vocab_size: 10,135
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9705
allcaps_rate: 1
boilerplate_rate: 0

year

categorical timestamp null_rate

This is a year column stored categorically across 137 distinct values, suggesting coverage spanning over a century. The most common year is '2000' at 5.17% of non-null rows, with a high entropy ratio of 0.865 indicating values are spread fairly evenly across years. Notably, 42.18% of rows are null, which triggered a null_rate alert and limits usefulness without imputation or filtering. Treatment: Cast to integer year and decide whether to drop or impute the 42% missing rows before time-based analysis. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 18,164 (42.2%)
unique: 137
top_value: 2000
top_rate: 0.0517
cardinality: 137
entropy: 6.142
entropy_ratio: 0.8653

country

categorical feature

Country of origin as a categorical label across 130 distinct values, but 63.7% of the 43,060 rows are empty strings rather than nulls, making the modal 'value' effectively missing. The remaining entries show inconsistent encoding — full names ('Australia', 'United States'), ISO codes ('GB'), uppercase forms ('PERU', 'SOVIET UNION'), and a defunct state — suggesting data was merged from heterogeneous sources without normalisation. Entropy ratio of 0.37 confirms the distribution is heavily concentrated in a few buckets. Treatment: Normalise to ISO country codes, treat empty string as missing, then one-hot or target-encode. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 130
top_value
top_rate: 0.6368
cardinality: 130
entropy: 2.569
entropy_ratio: 0.3658

dataset

categorical metadata

This column names the source dataset each record was drawn from, with 214 distinct provenance strings across 43,060 rows. The dominant value is an empty string covering 61.1% of rows (26,317), meaning provenance is missing for the majority; named sources like 'Environmental Monitoring database (MOD) DNV' (1,760) and 'Jellyfish sightings along the Italian coastline from 2009 to 2017' (1,024) trail far behind. Entropy ratio of 0.41 confirms the distribution is heavily concentrated on that blank. Treatment: Treat empty string as missing and group rare sources before any per-dataset stratification. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 214
top_value
top_rate: 0.6112
cardinality: 214
entropy: 3.19
entropy_ratio: 0.4121

bioluminescence_group

categorical label

Categorical taxonomy label grouping records by bioluminescent organism type, with 26 distinct groups across 43,060 rows and no nulls. Distribution is remarkably flat (entropy ratio 0.95): 'Dinoflagellate' leads at only 9.3%, and the next nine values are tied at exactly 2,000 rows each, suggesting a synthetic or quota-balanced sample rather than naturally observed frequencies. Treatment: One-hot or target-encode; the suspiciously uniform 2,000-per-class counts warrant a check for synthetic balancing before modelling. high · anthropic:claude-opus-4-7

n: 43,060
nulls: 0 (0.0%)
unique: 26
top_value: Dinoflagellate
top_rate: 0.09289
cardinality: 26
entropy: 4.465
entropy_ratio: 0.95