quirky bioluminescence
Reading
This dataset catalogues 43,060 records of bioluminescent marine organisms, with taxonomic fields (phylum, class, order, family, genus, scientificName), a bioluminescence_group label, geographic coordinates, depth, country, source dataset, and date/year. The taxonomy is dominated by Arthropoda (12,297) and Cnidaria (8,874) within 7 phyla, while bioluminescence_group is fairly evenly distributed across 26 categories led by Dinoflagellate (4,000). Two things deserve a closer look first: the depth column is highly skewed (skew 4.72, max 10,000m vs median 52.5m) with a 24.75% null rate and ~10.6% outliers, and the country field is 63.7% empty, limiting any geographic breakdown by nation. The year field is also 42% null, so temporal analysis will be partial.
citing: depth · phylum · bioluminescence_group · class · country · year · latitude · longitude · scientificName
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Arthropoda | 12297 | 28.6% |
| Cnidaria | 8874 | 20.6% |
| Myzozoa | 8000 | 18.6% |
| Ctenophora | 4168 | 9.7% |
| Proteobacteria | 4000 | 9.3% |
| Mollusca | 3721 | 8.6% |
| Annelida | 2000 | 4.6% |
Show data table
| value | count | share |
|---|---|---|
| Dinoflagellate | 4000 | 9.3% |
| Sea sparkle dinoflagellate | 2000 | 4.6% |
| Bioluminescent dinoflagellate | 2000 | 4.6% |
| Crystal jelly (source of GFP) | 2000 | 4.6% |
| Mauve stinger jellyfish | 2000 | 4.6% |
| Warty comb jelly | 2000 | 4.6% |
| Crown jellyfish (alarm jelly) | 2000 | 4.6% |
| Helmet jellyfish | 2000 | 4.6% |
| Comb jelly | 2000 | 4.6% |
| Krill (many species bioluminescent) | 2000 | 4.6% |
| Northern krill | 2000 | 4.6% |
| Copepod (secretes luminous fluid) | 2000 | 4.6% |
| Deep-sea shrimp (NanoLuc source) | 2000 | 4.6% |
| Sea firefly ostracod | 2000 | 4.6% |
| Bioluminescent ostracod | 2000 | 4.6% |
| Cock-eyed squid | 2000 | 4.6% |
| Bioluminescent marine bacteria | 2000 | 4.6% |
| Marine luminous bacteria | 2000 | 4.6% |
| Parchment tube worm | 2000 | 4.6% |
| Boring clam (piddock) | 928 | 2.2% |
Show data table
| bin | count |
|---|---|
| -53 – 198.3 | 21893 |
| 198.3 – 449.6 | 4443 |
| 449.6 – 701 | 1966 |
| 701 – 952.3 | 1504 |
| 952.3 – 1204 | 1070 |
| 1204 – 1455 | 303 |
| 1455 – 1706 | 226 |
| 1706 – 1958 | 182 |
| 1958 – 2209 | 255 |
| 2209 – 2460 | 95 |
| 2460 – 2712 | 111 |
| 2712 – 2963 | 57 |
| 2963 – 3214 | 55 |
| 3214 – 3466 | 56 |
| 3466 – 3717 | 42 |
| 3717 – 3968 | 24 |
| 3968 – 4220 | 31 |
| 4220 – 4471 | 20 |
| 4471 – 4722 | 6 |
| 4722 – 4974 | 14 |
| 4974 – 5225 | 12 |
| 5225 – 5476 | 14 |
| 5476 – 5727 | 6 |
| 5727 – 5979 | 2 |
| 5979 – 6230 | 4 |
| 6230 – 6481 | 2 |
| 6481 – 6733 | 0 |
| 6733 – 6984 | 0 |
| 6984 – 7235 | 0 |
| 7235 – 7487 | 0 |
| 7487 – 7738 | 3 |
| 7738 – 7989 | 0 |
| 7989 – 8241 | 0 |
| 8241 – 8492 | 0 |
| 8492 – 8743 | 2 |
| 8743 – 8995 | 0 |
| 8995 – 9246 | 0 |
| 9246 – 9497 | 0 |
| 9497 – 9749 | 0 |
| 9749 – 1e+04 | 4 |
Show data table
| value | count | share |
|---|---|---|
| Dinophyceae | 8000 | 18.6% |
| Scyphozoa | 6000 | 13.9% |
| Malacostraca | 6000 | 13.9% |
| Ostracoda | 4000 | 9.3% |
| Gammaproteobacteria | 4000 | 9.3% |
| Cephalopoda | 2793 | 6.5% |
| Copepoda | 2297 | 5.3% |
| Tentaculata | 2168 | 5.0% |
| Hydrozoa | 2000 | 4.6% |
| Nuda | 2000 | 4.6% |
| Polychaeta | 2000 | 4.6% |
| Bivalvia | 928 | 2.2% |
| Octocorallia | 874 | 2.0% |
Show data table
| bin | count |
|---|---|
| -76.62 – -72.5 | 34 |
| -72.5 – -68.37 | 134 |
| -68.37 – -64.25 | 770 |
| -64.25 – -60.13 | 872 |
| -60.13 – -56.01 | 500 |
| -56.01 – -51.88 | 309 |
| -51.88 – -47.76 | 279 |
| -47.76 – -43.64 | 377 |
| -43.64 – -39.51 | 1598 |
| -39.51 – -35.39 | 900 |
| -35.39 – -31.27 | 2736 |
| -31.27 – -27.15 | 1218 |
| -27.15 – -23.02 | 589 |
| -23.02 – -18.9 | 615 |
| -18.9 – -14.78 | 671 |
| -14.78 – -10.66 | 768 |
| -10.66 – -6.533 | 598 |
| -6.533 – -2.41 | 504 |
| -2.41 – 1.713 | 319 |
| 1.713 – 5.836 | 199 |
| 5.836 – 9.958 | 628 |
| 9.958 – 14.08 | 953 |
| 14.08 – 18.2 | 793 |
| 18.2 – 22.33 | 744 |
| 22.33 – 26.45 | 566 |
| 26.45 – 30.57 | 783 |
| 30.57 – 34.69 | 1840 |
| 34.69 – 38.82 | 2424 |
| 38.82 – 42.94 | 2931 |
| 42.94 – 47.06 | 3500 |
| 47.06 – 51.19 | 4244 |
| 51.19 – 55.31 | 3508 |
| 55.31 – 59.43 | 2052 |
| 59.43 – 63.55 | 1070 |
| 63.55 – 67.68 | 764 |
| 67.68 – 71.8 | 1560 |
| 71.8 – 75.92 | 532 |
| 75.92 – 80.04 | 94 |
| 80.04 – 84.17 | 52 |
| 84.17 – 88.29 | 32 |
Schema
14 columns| Alerts | ||||
|---|---|---|---|---|
| scientificName | categorical | 0.0% | 245 |
|
| genus | categorical | 0.0% | 27 |
|
| family | categorical | 0.0% | 22 |
|
| phylum | categorical | 0.0% | 7 |
|
| class | categorical | 0.0% | 13 |
|
| order | categorical | 0.0% | 17 |
|
| latitude | numeric | 0.0% | 14,146 |
|
| longitude | numeric | 0.0% | 14,637 |
|
| depth | numeric | 24.8% | 3,283 |
null_rate
high_skew
outliers
|
| date | text | 12.0% | 12,338 |
one_word
allcaps
duplicates
|
| year | categorical | 42.2% | 137 |
null_rate
|
| country | categorical | 0.0% | 130 |
|
| dataset | categorical | 0.0% | 214 |
|
| bioluminescence_group | categorical | 0.0% | 26 |
|
scientificName
categorical labelTaxonomic species/genus identifiers (Latin binomials like 'Mnemiopsis leidyi' and genera like 'Lingulodinium', 'Vibrio'). With 245 unique values across 43,060 rows and entropy ratio 0.747, the distribution is moderately spread — the top species accounts for only 4.6% of rows. The mix of binomial species names and bare genus names suggests inconsistent taxonomic resolution across records. Treatment: Treat as a categorical label; consider normalizing genus-only vs. species-level entries before grouping or modelling.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 245
- top_value
- Mnemiopsis leidyi
- top_rate
- 0.04645
- cardinality
- 245
- entropy
- 5.928
- entropy_ratio
- 0.747
genus
categorical labelCategorical genus label across 27 bioluminescent marine genera (Noctiluca, Pyrocystis, Alexandrium, Aequorea, etc.). Distribution is essentially uniform — the top 10 genera each show exactly 2000 rows and the top rate is 4.64%, giving an entropy ratio of 0.959, which signals a deliberately balanced sample rather than natural abundance. No nulls across 43,060 rows. Treatment: use directly as the classification target; one-hot or label-encode for modelling.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 27
- top_value
- Noctiluca
- top_rate
- 0.04645
- cardinality
- 27
- entropy
- 4.558
- entropy_ratio
- 0.9586
family
categorical labelTaxonomic family labels for what appears to be a catalogue of bioluminescent marine organisms, spanning 22 distinct families across 43,060 complete rows. The distribution is highly engineered rather than natural: four families (Pyrocystaceae, Euphausiidae, Cypridinidae, Vibrionaceae) each hit exactly 4,000 rows and several others land at exactly 2,000, suggesting deliberate per-class sampling or quota balancing. Entropy ratio of 0.93 confirms the near-uniform spread, and no nulls are present. Treatment: Use directly as a categorical class label; one-hot or integer-encode for modelling.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 22
- top_value
- Pyrocystaceae
- top_rate
- 0.09289
- cardinality
- 22
- entropy
- 4.157
- entropy_ratio
- 0.9322
phylum
categorical featureTaxonomic phylum label across 43060 records spanning just 7 distinct values with no nulls. Arthropoda leads at 28.6% (12297 rows), followed by Cnidaria (8874) and Myzozoa (8000), with entropy ratio 0.92 indicating a fairly even spread across the seven categories. The mix of animal phyla alongside Proteobacteria (a bacterial phylum) is notable — this column blends kingdoms. Treatment: one-hot encode directly given the low cardinality.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- Arthropoda
- top_rate
- 0.2856
- cardinality
- 7
- entropy
- 2.593
- entropy_ratio
- 0.9235
class
categorical labelTaxonomic class labels for marine organisms, spanning 13 distinct values across 43,060 rows with no nulls. Distribution is fairly balanced (entropy ratio 0.93) with Dinophyceae leading at only 18.6% — several classes show suspiciously round counts (8000, 6000, 4000, 2000) suggesting curated/sampled rather than naturally observed frequencies. Treatment: use directly as a multi-class classification target with label encoding.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 13
- top_value
- Dinophyceae
- top_rate
- 0.1858
- cardinality
- 13
- entropy
- 3.43
- entropy_ratio
- 0.9268
order
categorical featureTaxonomic order names for marine organisms (Gonyaulacales, Euphausiacea, Calanoida, etc.), with 17 distinct values across 43,060 rows and no nulls. Distribution is fairly even — entropy ratio 0.949 and top class only 13.9% — though several categories sit at suspiciously round counts (4000, 2000), suggesting stratified sampling or quota construction rather than natural frequencies. Treatment: one-hot or target-encode; safe to use directly given low cardinality and no missing values.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 17
- top_value
- Gonyaulacales
- top_rate
- 0.1393
- cardinality
- 17
- entropy
- 3.879
- entropy_ratio
- 0.9491
latitude
numeric featureNumeric column bounded between -76.619 and 88.29, consistent with WGS84 latitudes in degrees. The distribution is wide (std 40.27, IQR 69.61) and mildly left-skewed (-0.66) with a flat shape (kurtosis -0.94), indicating coverage across both hemispheres rather than a single region. No nulls and no outliers flagged across 43,060 rows with 14,146 distinct values. Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than using raw degrees in a linear model.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 14,146
- min
- -76.62
- max
- 88.29
- mean
- 19.1
- median
- 36.71
- std
- 40.27
- q1
- -19.31
- q3
- 50.3
- iqr
- 69.61
- skew
- -0.6614
- kurtosis
- -0.9355
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.0004645
longitude
numeric featureGeographic longitude in decimal degrees, spanning the full -179.9987 to 179.99 range with 14,637 distinct values across 43,060 rows and zero nulls. The distribution is nearly symmetric (skew 0.14) with light tails (kurtosis -0.65) and a wide IQR of 124.12, indicating truly global coverage rather than a regional sample. No outliers flagged and only a 0.11% zero rate, consistent with clean coordinate data. Treatment: Pair with latitude for geospatial features; avoid treating as a linear scalar since ±180 wraps.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 14,637
- min
- -180
- max
- 180
- mean
- 9.64
- median
- 3.057
- std
- 88.61
- q1
- -60.19
- q3
- 63.93
- iqr
- 124.1
- skew
- 0.1376
- kurtosis
- -0.6464
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.001115
depth
numeric feature null_rate high_skew outliersThis is a numeric depth measurement (likely meters), with 24.75% nulls across 43,060 rows and only 3,283 distinct values. The distribution is heavily right-skewed (skew 4.72, kurtosis 35.89): the median is 52.5 but the mean is 281.2 and the max reaches 10,000, while 10.63% of values are flagged as outliers. Notably, the minimum is -53.0 (negative depths are suspect) and 11.92% of values are exactly zero. Treatment: Investigate negative and zero values, then log-transform (after shifting) before modelling.
- n
- 43,060
- nulls
- 10,658 (24.8%)
- unique
- 3,283
- min
- -53
- max
- 10,000
- mean
- 281.2
- median
- 52.5
- std
- 570.2
- q1
- 7.5
- q3
- 321
- iqr
- 313.5
- skew
- 4.724
- kurtosis
- 35.89
- n_outliers
- 3,444
- outlier_rate
- 0.1063
- zero_rate
- 0.1192
date
text timestamp one_word allcaps duplicatesThis is a date column stored as free text rather than a parsed timestamp, with values mixing single dates (e.g. '2017-05-30'), single months ('2013-08'), month ranges ('2010-05/2010-06') and even multi-year spans ('1962/1964'). The format heterogeneity is the main surprise: 97% of entries are one 'word', but length varies from 4 to 51 characters, and 67% are duplicates of another row. Nulls are also non-trivial at 12%. Treatment: Parse into structured start/end dates (handling year-only, month, and range formats) before any temporal analysis.
- n
- 43,060
- nulls
- 5,182 (12.0%)
- unique
- 12,338
- len_min
- 4
- len_max
- 51
- len_mean
- 16.45
- len_median
- 19
- len_p95
- 39
- word_mean
- 1.03
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 25,540
- duplicate_rate
- 0.6743
- vocab_size
- 10,135
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9705
- allcaps_rate
- 1
- boilerplate_rate
- 0
year
categorical timestamp null_rateThis is a year column stored categorically across 137 distinct values, suggesting coverage spanning over a century. The most common year is '2000' at 5.17% of non-null rows, with a high entropy ratio of 0.865 indicating values are spread fairly evenly across years. Notably, 42.18% of rows are null, which triggered a null_rate alert and limits usefulness without imputation or filtering. Treatment: Cast to integer year and decide whether to drop or impute the 42% missing rows before time-based analysis.
- n
- 43,060
- nulls
- 18,164 (42.2%)
- unique
- 137
- top_value
- 2000
- top_rate
- 0.0517
- cardinality
- 137
- entropy
- 6.142
- entropy_ratio
- 0.8653
country
categorical featureCountry of origin as a categorical label across 130 distinct values, but 63.7% of the 43,060 rows are empty strings rather than nulls, making the modal 'value' effectively missing. The remaining entries show inconsistent encoding — full names ('Australia', 'United States'), ISO codes ('GB'), uppercase forms ('PERU', 'SOVIET UNION'), and a defunct state — suggesting data was merged from heterogeneous sources without normalisation. Entropy ratio of 0.37 confirms the distribution is heavily concentrated in a few buckets. Treatment: Normalise to ISO country codes, treat empty string as missing, then one-hot or target-encode.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 130
- top_value
- top_rate
- 0.6368
- cardinality
- 130
- entropy
- 2.569
- entropy_ratio
- 0.3658
dataset
categorical metadataThis column names the source dataset each record was drawn from, with 214 distinct provenance strings across 43,060 rows. The dominant value is an empty string covering 61.1% of rows (26,317), meaning provenance is missing for the majority; named sources like 'Environmental Monitoring database (MOD) DNV' (1,760) and 'Jellyfish sightings along the Italian coastline from 2009 to 2017' (1,024) trail far behind. Entropy ratio of 0.41 confirms the distribution is heavily concentrated on that blank. Treatment: Treat empty string as missing and group rare sources before any per-dataset stratification.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 214
- top_value
- top_rate
- 0.6112
- cardinality
- 214
- entropy
- 3.19
- entropy_ratio
- 0.4121
bioluminescence_group
categorical labelCategorical taxonomy label grouping records by bioluminescent organism type, with 26 distinct groups across 43,060 rows and no nulls. Distribution is remarkably flat (entropy ratio 0.95): 'Dinoflagellate' leads at only 9.3%, and the next nine values are tied at exactly 2,000 rows each, suggesting a synthetic or quota-balanced sample rather than naturally observed frequencies. Treatment: One-hot or target-encode; the suspiciously uniform 2,000-per-class counts warrant a check for synthetic balancing before modelling.
- n
- 43,060
- nulls
- 0 (0.0%)
- unique
- 26
- top_value
- Dinoflagellate
- top_rate
- 0.09289
- cardinality
- 26
- entropy
- 4.465
- entropy_ratio
- 0.95