saturn·

quirky bioluminescence

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/bioluminescence.json

Saturn profiled 43,060 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/bioluminescence.json",
    "--findings", "quirky-bioluminescence.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogues 43,060 records of bioluminescent marine organisms, with taxonomic fields (phylum, class, order, family, genus, scientificName), a bioluminescence_group label, geographic coordinates, depth, country, source dataset, and date/year. The taxonomy is dominated by Arthropoda (12,297) and Cnidaria (8,874) within 7 phyla, while bioluminescence_group is fairly evenly distributed across 26 categories led by Dinoflagellate (4,000). Two things deserve a closer look first: the depth column is highly skewed (skew 4.72, max 10,000m vs median 52.5m) with a 24.75% null rate and ~10.6% outliers, and the country field is 63.7% empty, limiting any geographic breakdown by nation. The year field is also 42% null, so temporal analysis will be partial.

citing: depth · phylum · bioluminescence_group · class · country · year · latitude · longitude · scientificName

Out[4]:

saturn.schema() · 14 columns

column kind n null% unique alerts
scientificName categorical 43,060 0.0% 245
genus categorical 43,060 0.0% 27
family categorical 43,060 0.0% 22
phylum categorical 43,060 0.0% 7
class categorical 43,060 0.0% 13
order categorical 43,060 0.0% 17
latitude numeric 43,060 0.0% 14,146
longitude numeric 43,060 0.0% 14,637
depth numeric 43,060 24.8% 3,283 null_rate high_skew outliers
date text 43,060 12.0% 12,338 one_word allcaps duplicates
year categorical 43,060 42.2% 137 null_rate
country categorical 43,060 0.0% 130
dataset categorical 43,060 0.0% 214
bioluminescence_group categorical 43,060 0.0% 26
Fig 1.
phylum · Shows the taxonomic backbone of the dataset, with Arthropoda and Cnidaria making up the majority of records.
Show data table
Top values for phylum (7 unique shown, of 7 total).
valuecountshare
Arthropoda1229728.6%
Cnidaria887420.6%
Myzozoa800018.6%
Ctenophora41689.7%
Proteobacteria40009.3%
Mollusca37218.6%
Annelida20004.6%
Fig 2.
bioluminescence_group · Reveals the relatively even spread across 26 bioluminescent organism groups, led by Dinoflagellate.
Show data table
Top values for bioluminescence_group (20 unique shown, of 26 total).
valuecountshare
Dinoflagellate40009.3%
Sea sparkle dinoflagellate20004.6%
Bioluminescent dinoflagellate20004.6%
Crystal jelly (source of GFP)20004.6%
Mauve stinger jellyfish20004.6%
Warty comb jelly20004.6%
Crown jellyfish (alarm jelly)20004.6%
Helmet jellyfish20004.6%
Comb jelly20004.6%
Krill (many species bioluminescent)20004.6%
Northern krill20004.6%
Copepod (secretes luminous fluid)20004.6%
Deep-sea shrimp (NanoLuc source)20004.6%
Sea firefly ostracod20004.6%
Bioluminescent ostracod20004.6%
Cock-eyed squid20004.6%
Bioluminescent marine bacteria20004.6%
Marine luminous bacteria20004.6%
Parchment tube worm20004.6%
Boring clam (piddock)9282.2%
Fig 3.
depth · Look for the heavy right skew and extreme values up to 10,000m — most observations sit shallow (median 52.5m) but a long tail of deep-sea records pulls the mean to 281m.
Show data table
Histogram bins for depth (median: 52.5).
bincount
-53 – 198.321893
198.3 – 449.64443
449.6 – 7011966
701 – 952.31504
952.3 – 12041070
1204 – 1455303
1455 – 1706226
1706 – 1958182
1958 – 2209255
2209 – 246095
2460 – 2712111
2712 – 296357
2963 – 321455
3214 – 346656
3466 – 371742
3717 – 396824
3968 – 422031
4220 – 447120
4471 – 47226
4722 – 497414
4974 – 522512
5225 – 547614
5476 – 57276
5727 – 59792
5979 – 62304
6230 – 64812
6481 – 67330
6733 – 69840
6984 – 72350
7235 – 74870
7487 – 77383
7738 – 79890
7989 – 82410
8241 – 84920
8492 – 87432
8743 – 89950
8995 – 92460
9246 – 94970
9497 – 97490
9749 – 1e+044
Fig 4.
class · Breaks the phylum view into 13 classes, with Dinophyceae, Scyphozoa, and Malacostraca leading.
Show data table
Top values for class (13 unique shown, of 13 total).
valuecountshare
Dinophyceae800018.6%
Scyphozoa600013.9%
Malacostraca600013.9%
Ostracoda40009.3%
Gammaproteobacteria40009.3%
Cephalopoda27936.5%
Copepoda22975.3%
Tentaculata21685.0%
Hydrozoa20004.6%
Nuda20004.6%
Polychaeta20004.6%
Bivalvia9282.2%
Octocorallia8742.0%
Fig 5.
latitude · Check the global latitudinal coverage; the distribution leans toward northern latitudes (median ~36.7°).
Show data table
Histogram bins for latitude (median: 36.710105896).
bincount
-76.62 – -72.534
-72.5 – -68.37134
-68.37 – -64.25770
-64.25 – -60.13872
-60.13 – -56.01500
-56.01 – -51.88309
-51.88 – -47.76279
-47.76 – -43.64377
-43.64 – -39.511598
-39.51 – -35.39900
-35.39 – -31.272736
-31.27 – -27.151218
-27.15 – -23.02589
-23.02 – -18.9615
-18.9 – -14.78671
-14.78 – -10.66768
-10.66 – -6.533598
-6.533 – -2.41504
-2.41 – 1.713319
1.713 – 5.836199
5.836 – 9.958628
9.958 – 14.08953
14.08 – 18.2793
18.2 – 22.33744
22.33 – 26.45566
26.45 – 30.57783
30.57 – 34.691840
34.69 – 38.822424
38.82 – 42.942931
42.94 – 47.063500
47.06 – 51.194244
51.19 – 55.313508
55.31 – 59.432052
59.43 – 63.551070
63.55 – 67.68764
67.68 – 71.81560
71.8 – 75.92532
75.92 – 80.0494
80.04 – 84.1752
84.17 – 88.2932
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
scientificNamecategorical0.0%
genuscategorical0.0%
familycategorical0.0%
phylumcategorical0.0%
classcategorical0.0%
ordercategorical0.0%
latitudenumeric0.0%
longitudenumeric0.0%
depthnumeric24.8%
datetext12.0%
yearcategorical42.2%
countrycategorical0.0%
datasetcategorical0.0%
bioluminescence_groupcategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
latitudelongitudedepth
latitude+1.00-0.33-0.06
longitude-0.33+1.00+0.00
depth-0.06+0.00+1.00

scientificName categorical label

Taxonomic species/genus identifiers (Latin binomials like 'Mnemiopsis leidyi' and genera like 'Lingulodinium', 'Vibrio'). With 245 unique values across 43,060 rows and entropy ratio 0.747, the distribution is moderately spread — the top species accounts for only 4.6% of rows. The mix of binomial species names and bare genus names suggests inconsistent taxonomic resolution across records.

Treatment: Treat as a categorical label; consider normalizing genus-only vs. species-level entries before grouping or modelling.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["scientificName"].stats

statvalue
n43,060
nulls0 (0.0%)
unique245
top_value Mnemiopsis leidyi
top_rate 0.04645
cardinality 245
entropy 5.928
entropy_ratio 0.747
Fig 8.
Top values for scientificName.
Show data table
Top values for scientificName (20 unique shown, of 245 total).
valuecountshare
Mnemiopsis leidyi20004.6%
Lingulodinium19764.6%
Meganyctiphanes norvegica19284.5%
Photobacterium18424.3%
Periphylla periphylla18024.2%
Pelagia noctiluca17684.1%
Noctiluca scintillans17284.0%
Vibrio15843.7%
Vargula norvegica14823.4%
Cypridina dentata13203.1%
Euphausia superba12983.0%
Chaetopterus variopedatus12222.8%
Beroe12022.8%
Oplophorus spinosus11702.7%
Histioteuthis9522.2%
Alexandrium9442.2%
Metridia lucens8722.0%
Aequorea7981.9%
Atolla wyvillei7561.8%
Pyrocystis pseudonoctiluca7421.7%

genus categorical label

Categorical genus label across 27 bioluminescent marine genera (Noctiluca, Pyrocystis, Alexandrium, Aequorea, etc.). Distribution is essentially uniform — the top 10 genera each show exactly 2000 rows and the top rate is 4.64%, giving an entropy ratio of 0.959, which signals a deliberately balanced sample rather than natural abundance. No nulls across 43,060 rows.

Treatment: use directly as the classification target; one-hot or label-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["genus"].stats

statvalue
n43,060
nulls0 (0.0%)
unique27
top_value Noctiluca
top_rate 0.04645
cardinality 27
entropy 4.558
entropy_ratio 0.9586
Fig 9.
Top values for genus.
Show data table
Top values for genus (20 unique shown, of 27 total).
valuecountshare
Noctiluca20004.6%
Pyrocystis20004.6%
Lingulodinium20004.6%
Alexandrium20004.6%
Aequorea20004.6%
Pelagia20004.6%
Mnemiopsis20004.6%
Atolla20004.6%
Periphylla20004.6%
Beroe20004.6%
Euphausia20004.6%
Meganyctiphanes20004.6%
Metridia20004.6%
Oplophorus20004.6%
Vargula20004.6%
Cypridina20004.6%
Histioteuthis20004.6%
Vibrio20004.6%
Photobacterium20004.6%
Chaetopterus20004.6%

family categorical label

Taxonomic family labels for what appears to be a catalogue of bioluminescent marine organisms, spanning 22 distinct families across 43,060 complete rows. The distribution is highly engineered rather than natural: four families (Pyrocystaceae, Euphausiidae, Cypridinidae, Vibrionaceae) each hit exactly 4,000 rows and several others land at exactly 2,000, suggesting deliberate per-class sampling or quota balancing. Entropy ratio of 0.93 confirms the near-uniform spread, and no nulls are present.

Treatment: Use directly as a categorical class label; one-hot or integer-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["family"].stats

statvalue
n43,060
nulls0 (0.0%)
unique22
top_value Pyrocystaceae
top_rate 0.09289
cardinality 22
entropy 4.157
entropy_ratio 0.9322
Fig 10.
Top values for family.
Show data table
Top values for family (20 unique shown, of 22 total).
valuecountshare
Pyrocystaceae40009.3%
Euphausiidae40009.3%
Cypridinidae40009.3%
Vibrionaceae40009.3%
Metridinidae22975.3%
Noctilucaceae20004.6%
Lingulodiniaceae20004.6%
Aequoreidae20004.6%
Pelagiidae20004.6%
Bolinopsidae20004.6%
Atollidae20004.6%
Periphyllidae20004.6%
Beroidae20004.6%
Oplophoridae20004.6%
Histioteuthidae20004.6%
Chaetopteridae20004.6%
Pholadidae9282.2%
Renillidae8742.0%
Vampyroteuthidae4841.1%
Thysanoteuthidae2090.5%

phylum categorical feature

Taxonomic phylum label across 43060 records spanning just 7 distinct values with no nulls. Arthropoda leads at 28.6% (12297 rows), followed by Cnidaria (8874) and Myzozoa (8000), with entropy ratio 0.92 indicating a fairly even spread across the seven categories. The mix of animal phyla alongside Proteobacteria (a bacterial phylum) is notable — this column blends kingdoms.

Treatment: one-hot encode directly given the low cardinality.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["phylum"].stats

statvalue
n43,060
nulls0 (0.0%)
unique7
top_value Arthropoda
top_rate 0.2856
cardinality 7
entropy 2.593
entropy_ratio 0.9235
Fig 11.
Top values for phylum.
Show data table
Top values for phylum (7 unique shown, of 7 total).
valuecountshare
Arthropoda1229728.6%
Cnidaria887420.6%
Myzozoa800018.6%
Ctenophora41689.7%
Proteobacteria40009.3%
Mollusca37218.6%
Annelida20004.6%

class categorical label

Taxonomic class labels for marine organisms, spanning 13 distinct values across 43,060 rows with no nulls. Distribution is fairly balanced (entropy ratio 0.93) with Dinophyceae leading at only 18.6% — several classes show suspiciously round counts (8000, 6000, 4000, 2000) suggesting curated/sampled rather than naturally observed frequencies.

Treatment: use directly as a multi-class classification target with label encoding.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["class"].stats

statvalue
n43,060
nulls0 (0.0%)
unique13
top_value Dinophyceae
top_rate 0.1858
cardinality 13
entropy 3.43
entropy_ratio 0.9268
Fig 12.
Top values for class.
Show data table
Top values for class (13 unique shown, of 13 total).
valuecountshare
Dinophyceae800018.6%
Scyphozoa600013.9%
Malacostraca600013.9%
Ostracoda40009.3%
Gammaproteobacteria40009.3%
Cephalopoda27936.5%
Copepoda22975.3%
Tentaculata21685.0%
Hydrozoa20004.6%
Nuda20004.6%
Polychaeta20004.6%
Bivalvia9282.2%
Octocorallia8742.0%

order categorical feature

Taxonomic order names for marine organisms (Gonyaulacales, Euphausiacea, Calanoida, etc.), with 17 distinct values across 43,060 rows and no nulls. Distribution is fairly even — entropy ratio 0.949 and top class only 13.9% — though several categories sit at suspiciously round counts (4000, 2000), suggesting stratified sampling or quota construction rather than natural frequencies.

Treatment: one-hot or target-encode; safe to use directly given low cardinality and no missing values.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["order"].stats

statvalue
n43,060
nulls0 (0.0%)
unique17
top_value Gonyaulacales
top_rate 0.1393
cardinality 17
entropy 3.879
entropy_ratio 0.9491
Fig 13.
Top values for order.
Show data table
Top values for order (17 unique shown, of 17 total).
valuecountshare
Gonyaulacales600013.9%
Coronatae40009.3%
Euphausiacea40009.3%
Myodocopida40009.3%
Vibrionales40009.3%
Oegopsida23095.4%
Calanoida22975.3%
Lobata21685.0%
Noctilucales20004.6%
Leptothecata20004.6%
Semaeostomeae20004.6%
Beroida20004.6%
Decapoda20004.6%
20004.6%
Myida9282.2%
Scleralcyonacea8742.0%
Vampyromorpha4841.1%

latitude numeric feature

Numeric column bounded between -76.619 and 88.29, consistent with WGS84 latitudes in degrees. The distribution is wide (std 40.27, IQR 69.61) and mildly left-skewed (-0.66) with a flat shape (kurtosis -0.94), indicating coverage across both hemispheres rather than a single region. No nulls and no outliers flagged across 43,060 rows with 14,146 distinct values.

Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than using raw degrees in a linear model.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["latitude"].stats

statvalue
n43,060
nulls0 (0.0%)
unique14,146
min -76.62
max 88.29
mean 19.1
median 36.71
std 40.27
q1 -19.31
q3 50.3
iqr 69.61
skew -0.6614
kurtosis -0.9355
n_outliers 0
outlier_rate 0
zero_rate 0.0004645
Fig 14.
Distribution of latitude. Vertical dash marks the median.
Show data table
Histogram bins for latitude (median: 36.710105896).
bincount
-76.62 – -72.534
-72.5 – -68.37134
-68.37 – -64.25770
-64.25 – -60.13872
-60.13 – -56.01500
-56.01 – -51.88309
-51.88 – -47.76279
-47.76 – -43.64377
-43.64 – -39.511598
-39.51 – -35.39900
-35.39 – -31.272736
-31.27 – -27.151218
-27.15 – -23.02589
-23.02 – -18.9615
-18.9 – -14.78671
-14.78 – -10.66768
-10.66 – -6.533598
-6.533 – -2.41504
-2.41 – 1.713319
1.713 – 5.836199
5.836 – 9.958628
9.958 – 14.08953
14.08 – 18.2793
18.2 – 22.33744
22.33 – 26.45566
26.45 – 30.57783
30.57 – 34.691840
34.69 – 38.822424
38.82 – 42.942931
42.94 – 47.063500
47.06 – 51.194244
51.19 – 55.313508
55.31 – 59.432052
59.43 – 63.551070
63.55 – 67.68764
67.68 – 71.81560
71.8 – 75.92532
75.92 – 80.0494
80.04 – 84.1752
84.17 – 88.2932

longitude numeric feature

Geographic longitude in decimal degrees, spanning the full -179.9987 to 179.99 range with 14,637 distinct values across 43,060 rows and zero nulls. The distribution is nearly symmetric (skew 0.14) with light tails (kurtosis -0.65) and a wide IQR of 124.12, indicating truly global coverage rather than a regional sample. No outliers flagged and only a 0.11% zero rate, consistent with clean coordinate data.

Treatment: Pair with latitude for geospatial features; avoid treating as a linear scalar since ±180 wraps.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["longitude"].stats

statvalue
n43,060
nulls0 (0.0%)
unique14,637
min -180
max 180
mean 9.64
median 3.057
std 88.61
q1 -60.19
q3 63.93
iqr 124.1
skew 0.1376
kurtosis -0.6464
n_outliers 0
outlier_rate 0
zero_rate 0.001115
Fig 15.
Distribution of longitude. Vertical dash marks the median.
Show data table
Histogram bins for longitude (median: 3.05735505).
bincount
-180 – -171405
-171 – -162653
-162 – -153381
-153 – -144284
-144 – -135177
-135 – -126485
-126 – -1171914
-117 – -108117
-108 – -99151
-99 – -90289
-90 – -81785
-81 – -721676
-72 – -632314
-63 – -542269
-54 – -45643
-45 – -36680
-36 – -27556
-27 – -18530
-18 – -9.0041463
-9.004 – -0.0043333887
-0.004333 – 8.9954520
8.995 – 182373
18 – 26.991671
26.99 – 35.992361
35.99 – 44.99834
44.99 – 53.99313
53.99 – 62.99501
62.99 – 71.99696
71.99 – 80.99561
80.99 – 89.99360
89.99 – 98.99288
98.99 – 10870
108 – 117753
117 – 126480
126 – 135964
135 – 144761
144 – 1533177
153 – 1621422
162 – 171582
171 – 180714

depth numeric feature

This is a numeric depth measurement (likely meters), with 24.75% nulls across 43,060 rows and only 3,283 distinct values. The distribution is heavily right-skewed (skew 4.72, kurtosis 35.89): the median is 52.5 but the mean is 281.2 and the max reaches 10,000, while 10.63% of values are flagged as outliers. Notably, the minimum is -53.0 (negative depths are suspect) and 11.92% of values are exactly zero.

Treatment: Investigate negative and zero values, then log-transform (after shifting) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["depth"].stats

statvalue
n43,060
nulls10,658 (24.8%)
unique3,283
min -53
max 10,000
mean 281.2
median 52.5
std 570.2
q1 7.5
q3 321
iqr 313.5
skew 4.724
kurtosis 35.89
n_outliers 3,444
outlier_rate 0.1063
zero_rate 0.1192
alert: null_rate24.8% null
alert: high_skewskew=+4.72
alert: outliers10.6% rows beyond 1.5 IQR
Fig 16.
Distribution of depth. Vertical dash marks the median.
Show data table
Histogram bins for depth (median: 52.5).
bincount
-53 – 198.321893
198.3 – 449.64443
449.6 – 7011966
701 – 952.31504
952.3 – 12041070
1204 – 1455303
1455 – 1706226
1706 – 1958182
1958 – 2209255
2209 – 246095
2460 – 2712111
2712 – 296357
2963 – 321455
3214 – 346656
3466 – 371742
3717 – 396824
3968 – 422031
4220 – 447120
4471 – 47226
4722 – 497414
4974 – 522512
5225 – 547614
5476 – 57276
5727 – 59792
5979 – 62304
6230 – 64812
6481 – 67330
6733 – 69840
6984 – 72350
7235 – 74870
7487 – 77383
7738 – 79890
7989 – 82410
8241 – 84920
8492 – 87432
8743 – 89950
8995 – 92460
9246 – 94970
9497 – 97490
9749 – 1e+044

date text timestamp

This is a date column stored as free text rather than a parsed timestamp, with values mixing single dates (e.g. '2017-05-30'), single months ('2013-08'), month ranges ('2010-05/2010-06') and even multi-year spans ('1962/1964'). The format heterogeneity is the main surprise: 97% of entries are one 'word', but length varies from 4 to 51 characters, and 67% are duplicates of another row. Nulls are also non-trivial at 12%.

Treatment: Parse into structured start/end dates (handling year-only, month, and range formats) before any temporal analysis.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["date"].stats

statvalue
n43,060
nulls5,182 (12.0%)
unique12,338
len_min 4
len_max 51
len_mean 16.45
len_median 19
len_p95 39
word_mean 1.03
word_median 1
n_empty 0
n_duplicates 25,540
duplicate_rate 0.6743
vocab_size 10,135
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9705
allcaps_rate 1
boilerplate_rate 0
alert: one_word97.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: duplicates67.4% duplicate strings
Fig 17.
Character-length distribution for date.
Show data table
Character-length distribution for date (mean: 16.4484133269972).
charscount
4 – 5276
5 – 611
6 – 81316
8 – 978
9 – 10573
10 – 1113982
11 – 120
12 – 134
13 – 150
15 – 161820
16 – 17691
17 – 1849
18 – 193297
19 – 2011038
20 – 22392
22 – 23858
23 – 24102
24 – 25993
25 – 260
26 – 2815
28 – 29100
29 – 30126
30 – 3112
31 – 320
32 – 33224
33 – 350
35 – 360
36 – 371
37 – 380
38 – 391632
39 – 402
40 – 42226
42 – 430
43 – 440
44 – 4518
45 – 460
46 – 470
47 – 490
49 – 500
50 – 5142

year categorical timestamp

This is a year column stored categorically across 137 distinct values, suggesting coverage spanning over a century. The most common year is '2000' at 5.17% of non-null rows, with a high entropy ratio of 0.865 indicating values are spread fairly evenly across years. Notably, 42.18% of rows are null, which triggered a null_rate alert and limits usefulness without imputation or filtering.

Treatment: Cast to integer year and decide whether to drop or impute the 42% missing rows before time-based analysis.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["year"].stats

statvalue
n43,060
nulls18,164 (42.2%)
unique137
top_value 2000
top_rate 0.0517
cardinality 137
entropy 6.142
entropy_ratio 0.8653
alert: null_rate42.2% null
Fig 18.
Top values for year.
Show data table
Top values for year (20 unique shown, of 137 total).
valuecountshare
200012873.0%
20017031.6%
20166911.6%
20086881.6%
20106511.5%
20025791.3%
20135561.3%
20115541.3%
19795241.2%
20145191.2%
20035141.2%
20045111.2%
20155041.2%
20124931.1%
20074591.1%
20064421.0%
20054381.0%
19984371.0%
20204371.0%
20194361.0%

country categorical feature

Country of origin as a categorical label across 130 distinct values, but 63.7% of the 43,060 rows are empty strings rather than nulls, making the modal 'value' effectively missing. The remaining entries show inconsistent encoding — full names ('Australia', 'United States'), ISO codes ('GB'), uppercase forms ('PERU', 'SOVIET UNION'), and a defunct state — suggesting data was merged from heterogeneous sources without normalisation. Entropy ratio of 0.37 confirms the distribution is heavily concentrated in a few buckets.

Treatment: Normalise to ISO country codes, treat empty string as missing, then one-hot or target-encode.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["country"].stats

statvalue
n43,060
nulls0 (0.0%)
unique130
top_value
top_rate 0.6368
cardinality 130
entropy 2.569
entropy_ratio 0.3658
Fig 19.
Top values for country.
Show data table
Top values for country (20 unique shown, of 130 total).
valuecountshare
2742263.7%
Australia457310.6%
United States14163.3%
PERU10982.5%
Canada9762.3%
SOVIET UNION6341.5%
Israel5501.3%
GB4651.1%
Spain3700.9%
Sweden3400.8%
USA3230.8%
Ukraine3160.7%
Romania3100.7%
Antarctica2420.6%
Republic of Korea2250.5%
Colombia2140.5%
Italy2130.5%
New Zealand2120.5%
FR2100.5%
Brazil1790.4%

dataset categorical metadata

This column names the source dataset each record was drawn from, with 214 distinct provenance strings across 43,060 rows. The dominant value is an empty string covering 61.1% of rows (26,317), meaning provenance is missing for the majority; named sources like 'Environmental Monitoring database (MOD) DNV' (1,760) and 'Jellyfish sightings along the Italian coastline from 2009 to 2017' (1,024) trail far behind. Entropy ratio of 0.41 confirms the distribution is heavily concentrated on that blank.

Treatment: Treat empty string as missing and group rare sources before any per-dataset stratification.

anthropic:claude-opus-4-7 · confidence high
Out[49]:

saturn.columns["dataset"].stats

statvalue
n43,060
nulls0 (0.0%)
unique214
top_value
top_rate 0.6112
cardinality 214
entropy 3.19
entropy_ratio 0.4121
Fig 20.
Top values for dataset.
Show data table
Top values for dataset (20 unique shown, of 214 total).
valuecountshare
2631761.1%
Environmental Monitoring database (MOD) DNV17604.1%
Jellyfish sightings along the Italian coastline from 2009 to 201710242.4%
QUADRIGE - Coastal monitoring database and products, 1974 onwards. (6064)9782.3%
MBIS research trawl surveys7141.7%
Groundfish Survey Invertebrate Data6741.6%
DFO Quebec Region Ecosystemic bottom trawl surveys6501.5%
Marine Recorder Snapshot extract of surveys entered by SeaSearch6431.5%
CPR6041.4%
DATRAS: ICES Database of trawl surveys5911.4%
Citizen Science based jellyfish observations along the Israeli Mediterranean coast in 2011-20255461.3%
BioChem: Sameoto zooplankton collection5161.2%
Marine Recorder Snapshot extract of surveys entered by JNCC3960.9%
Atlantic Reference Centre3830.9%
DFO Central and Arctic Multi-species Stock Assessment Surveys3640.8%
MEDITS-Spain: Demersal and mega-benthic species from the MEDITS (Mediterranean International Trawl Survey) project on the Spanish continental shelf between 1994 and 20102770.6%
NIWA Invertebrate Collection2670.6%
ANEMOON Beach washup monitoring (SMP) data along the Dutch coastline collected through citizen science2400.6%
Phytoplankton abundance and composition in the Ebro delta embayments (Alfacs Bay and Fangar Bay, North Western Mediterranean) during 1990-20191980.5%
Romanian Black Sea Zooplankton data from 1981 to 20001960.5%

bioluminescence_group categorical label

Categorical taxonomy label grouping records by bioluminescent organism type, with 26 distinct groups across 43,060 rows and no nulls. Distribution is remarkably flat (entropy ratio 0.95): 'Dinoflagellate' leads at only 9.3%, and the next nine values are tied at exactly 2,000 rows each, suggesting a synthetic or quota-balanced sample rather than naturally observed frequencies.

Treatment: One-hot or target-encode; the suspiciously uniform 2,000-per-class counts warrant a check for synthetic balancing before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[52]:

saturn.columns["bioluminescence_group"].stats

statvalue
n43,060
nulls0 (0.0%)
unique26
top_value Dinoflagellate
top_rate 0.09289
cardinality 26
entropy 4.465
entropy_ratio 0.95
Fig 21.
Top values for bioluminescence_group.
Show data table
Top values for bioluminescence_group (20 unique shown, of 26 total).
valuecountshare
Dinoflagellate40009.3%
Sea sparkle dinoflagellate20004.6%
Bioluminescent dinoflagellate20004.6%
Crystal jelly (source of GFP)20004.6%
Mauve stinger jellyfish20004.6%
Warty comb jelly20004.6%
Crown jellyfish (alarm jelly)20004.6%
Helmet jellyfish20004.6%
Comb jelly20004.6%
Krill (many species bioluminescent)20004.6%
Northern krill20004.6%
Copepod (secretes luminous fluid)20004.6%
Deep-sea shrimp (NanoLuc source)20004.6%
Sea firefly ostracod20004.6%
Bioluminescent ostracod20004.6%
Cock-eyed squid20004.6%
Bioluminescent marine bacteria20004.6%
Marine luminous bacteria20004.6%
Parchment tube worm20004.6%
Boring clam (piddock)9282.2%

How to cite

click to copy

BibTeX
@misc{saturn-quirky-bioluminescence-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky bioluminescence},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-bioluminescence}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky bioluminescence. Source: /home/coolhand/html/datavis/data_trove/data/quirky/bioluminescence.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-bioluminescence