saturn·

quirky bioluminescence

source /home/coolhand/html/datavis/data_trove/data/quirky/bioluminescence.json 43,060 rows 14 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogues 43,060 records of bioluminescent marine organisms, with taxonomic fields (phylum, class, order, family, genus, scientificName), a bioluminescence_group label, geographic coordinates, depth, country, source dataset, and date/year. The taxonomy is dominated by Arthropoda (12,297) and Cnidaria (8,874) within 7 phyla, while bioluminescence_group is fairly evenly distributed across 26 categories led by Dinoflagellate (4,000). Two things deserve a closer look first: the depth column is highly skewed (skew 4.72, max 10,000m vs median 52.5m) with a 24.75% null rate and ~10.6% outliers, and the country field is 63.7% empty, limiting any geographic breakdown by nation. The year field is also 42% null, so temporal analysis will be partial.

citing: depth · phylum · bioluminescence_group · class · country · year · latitude · longitude · scientificName

Schema

14 columns
Per-column summary. Click column name to jump to its detail.
Alerts
scientificName categorical 0.0% 245
genus categorical 0.0% 27
family categorical 0.0% 22
phylum categorical 0.0% 7
class categorical 0.0% 13
order categorical 0.0% 17
latitude numeric 0.0% 14,146
longitude numeric 0.0% 14,637
depth numeric 24.8% 3,283
null_rate high_skew outliers
date text 12.0% 12,338
one_word allcaps duplicates
year categorical 42.2% 137
null_rate
country categorical 0.0% 130
dataset categorical 0.0% 214
bioluminescence_group categorical 0.0% 26

scientificName

categorical label
Taxonomic species/genus identifiers (Latin binomials like 'Mnemiopsis leidyi' and genera like 'Lingulodinium', 'Vibrio'). With 245 unique values across 43,060 rows and entropy ratio 0.747, the distribution is moderately spread — the top species accounts for only 4.6% of rows. The mix of binomial species names and bare genus names suggests inconsistent taxonomic resolution across records. Treatment: Treat as a categorical label; consider normalizing genus-only vs. species-level entries before grouping or modelling. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
245
top_value
Mnemiopsis leidyi
top_rate
0.04645
cardinality
245
entropy
5.928
entropy_ratio
0.747

genus

categorical label
Categorical genus label across 27 bioluminescent marine genera (Noctiluca, Pyrocystis, Alexandrium, Aequorea, etc.). Distribution is essentially uniform — the top 10 genera each show exactly 2000 rows and the top rate is 4.64%, giving an entropy ratio of 0.959, which signals a deliberately balanced sample rather than natural abundance. No nulls across 43,060 rows. Treatment: use directly as the classification target; one-hot or label-encode for modelling. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
27
top_value
Noctiluca
top_rate
0.04645
cardinality
27
entropy
4.558
entropy_ratio
0.9586

family

categorical label
Taxonomic family labels for what appears to be a catalogue of bioluminescent marine organisms, spanning 22 distinct families across 43,060 complete rows. The distribution is highly engineered rather than natural: four families (Pyrocystaceae, Euphausiidae, Cypridinidae, Vibrionaceae) each hit exactly 4,000 rows and several others land at exactly 2,000, suggesting deliberate per-class sampling or quota balancing. Entropy ratio of 0.93 confirms the near-uniform spread, and no nulls are present. Treatment: Use directly as a categorical class label; one-hot or integer-encode for modelling. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
22
top_value
Pyrocystaceae
top_rate
0.09289
cardinality
22
entropy
4.157
entropy_ratio
0.9322

phylum

categorical feature
Taxonomic phylum label across 43060 records spanning just 7 distinct values with no nulls. Arthropoda leads at 28.6% (12297 rows), followed by Cnidaria (8874) and Myzozoa (8000), with entropy ratio 0.92 indicating a fairly even spread across the seven categories. The mix of animal phyla alongside Proteobacteria (a bacterial phylum) is notable — this column blends kingdoms. Treatment: one-hot encode directly given the low cardinality. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
7
top_value
Arthropoda
top_rate
0.2856
cardinality
7
entropy
2.593
entropy_ratio
0.9235

class

categorical label
Taxonomic class labels for marine organisms, spanning 13 distinct values across 43,060 rows with no nulls. Distribution is fairly balanced (entropy ratio 0.93) with Dinophyceae leading at only 18.6% — several classes show suspiciously round counts (8000, 6000, 4000, 2000) suggesting curated/sampled rather than naturally observed frequencies. Treatment: use directly as a multi-class classification target with label encoding. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
13
top_value
Dinophyceae
top_rate
0.1858
cardinality
13
entropy
3.43
entropy_ratio
0.9268

order

categorical feature
Taxonomic order names for marine organisms (Gonyaulacales, Euphausiacea, Calanoida, etc.), with 17 distinct values across 43,060 rows and no nulls. Distribution is fairly even — entropy ratio 0.949 and top class only 13.9% — though several categories sit at suspiciously round counts (4000, 2000), suggesting stratified sampling or quota construction rather than natural frequencies. Treatment: one-hot or target-encode; safe to use directly given low cardinality and no missing values. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
17
top_value
Gonyaulacales
top_rate
0.1393
cardinality
17
entropy
3.879
entropy_ratio
0.9491

latitude

numeric feature
Numeric column bounded between -76.619 and 88.29, consistent with WGS84 latitudes in degrees. The distribution is wide (std 40.27, IQR 69.61) and mildly left-skewed (-0.66) with a flat shape (kurtosis -0.94), indicating coverage across both hemispheres rather than a single region. No nulls and no outliers flagged across 43,060 rows with 14,146 distinct values. Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than using raw degrees in a linear model. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
14,146
min
-76.62
max
88.29
mean
19.1
median
36.71
std
40.27
q1
-19.31
q3
50.3
iqr
69.61
skew
-0.6614
kurtosis
-0.9355
n_outliers
0
outlier_rate
0
zero_rate
0.0004645

longitude

numeric feature
Geographic longitude in decimal degrees, spanning the full -179.9987 to 179.99 range with 14,637 distinct values across 43,060 rows and zero nulls. The distribution is nearly symmetric (skew 0.14) with light tails (kurtosis -0.65) and a wide IQR of 124.12, indicating truly global coverage rather than a regional sample. No outliers flagged and only a 0.11% zero rate, consistent with clean coordinate data. Treatment: Pair with latitude for geospatial features; avoid treating as a linear scalar since ±180 wraps. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
14,637
min
-180
max
180
mean
9.64
median
3.057
std
88.61
q1
-60.19
q3
63.93
iqr
124.1
skew
0.1376
kurtosis
-0.6464
n_outliers
0
outlier_rate
0
zero_rate
0.001115

depth

numeric feature null_rate high_skew outliers
This is a numeric depth measurement (likely meters), with 24.75% nulls across 43,060 rows and only 3,283 distinct values. The distribution is heavily right-skewed (skew 4.72, kurtosis 35.89): the median is 52.5 but the mean is 281.2 and the max reaches 10,000, while 10.63% of values are flagged as outliers. Notably, the minimum is -53.0 (negative depths are suspect) and 11.92% of values are exactly zero. Treatment: Investigate negative and zero values, then log-transform (after shifting) before modelling. high · anthropic:claude-opus-4-7
n
43,060
nulls
10,658 (24.8%)
unique
3,283
min
-53
max
10,000
mean
281.2
median
52.5
std
570.2
q1
7.5
q3
321
iqr
313.5
skew
4.724
kurtosis
35.89
n_outliers
3,444
outlier_rate
0.1063
zero_rate
0.1192

date

text timestamp one_word allcaps duplicates
This is a date column stored as free text rather than a parsed timestamp, with values mixing single dates (e.g. '2017-05-30'), single months ('2013-08'), month ranges ('2010-05/2010-06') and even multi-year spans ('1962/1964'). The format heterogeneity is the main surprise: 97% of entries are one 'word', but length varies from 4 to 51 characters, and 67% are duplicates of another row. Nulls are also non-trivial at 12%. Treatment: Parse into structured start/end dates (handling year-only, month, and range formats) before any temporal analysis. high · anthropic:claude-opus-4-7
n
43,060
nulls
5,182 (12.0%)
unique
12,338
len_min
4
len_max
51
len_mean
16.45
len_median
19
len_p95
39
word_mean
1.03
word_median
1
n_empty
0
n_duplicates
25,540
duplicate_rate
0.6743
vocab_size
10,135
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
0.9705
allcaps_rate
1
boilerplate_rate
0

year

categorical timestamp null_rate
This is a year column stored categorically across 137 distinct values, suggesting coverage spanning over a century. The most common year is '2000' at 5.17% of non-null rows, with a high entropy ratio of 0.865 indicating values are spread fairly evenly across years. Notably, 42.18% of rows are null, which triggered a null_rate alert and limits usefulness without imputation or filtering. Treatment: Cast to integer year and decide whether to drop or impute the 42% missing rows before time-based analysis. high · anthropic:claude-opus-4-7
n
43,060
nulls
18,164 (42.2%)
unique
137
top_value
2000
top_rate
0.0517
cardinality
137
entropy
6.142
entropy_ratio
0.8653

country

categorical feature
Country of origin as a categorical label across 130 distinct values, but 63.7% of the 43,060 rows are empty strings rather than nulls, making the modal 'value' effectively missing. The remaining entries show inconsistent encoding — full names ('Australia', 'United States'), ISO codes ('GB'), uppercase forms ('PERU', 'SOVIET UNION'), and a defunct state — suggesting data was merged from heterogeneous sources without normalisation. Entropy ratio of 0.37 confirms the distribution is heavily concentrated in a few buckets. Treatment: Normalise to ISO country codes, treat empty string as missing, then one-hot or target-encode. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
130
top_value
top_rate
0.6368
cardinality
130
entropy
2.569
entropy_ratio
0.3658

dataset

categorical metadata
This column names the source dataset each record was drawn from, with 214 distinct provenance strings across 43,060 rows. The dominant value is an empty string covering 61.1% of rows (26,317), meaning provenance is missing for the majority; named sources like 'Environmental Monitoring database (MOD) DNV' (1,760) and 'Jellyfish sightings along the Italian coastline from 2009 to 2017' (1,024) trail far behind. Entropy ratio of 0.41 confirms the distribution is heavily concentrated on that blank. Treatment: Treat empty string as missing and group rare sources before any per-dataset stratification. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
214
top_value
top_rate
0.6112
cardinality
214
entropy
3.19
entropy_ratio
0.4121

bioluminescence_group

categorical label
Categorical taxonomy label grouping records by bioluminescent organism type, with 26 distinct groups across 43,060 rows and no nulls. Distribution is remarkably flat (entropy ratio 0.95): 'Dinoflagellate' leads at only 9.3%, and the next nine values are tied at exactly 2,000 rows each, suggesting a synthetic or quota-balanced sample rather than naturally observed frequencies. Treatment: One-hot or target-encode; the suspiciously uniform 2,000-per-class counts warrant a check for synthetic balancing before modelling. high · anthropic:claude-opus-4-7
n
43,060
nulls
0 (0.0%)
unique
26
top_value
Dinoflagellate
top_rate
0.09289
cardinality
26
entropy
4.465
entropy_ratio
0.95