saturn·

quirky carnivorous plants real

source /home/coolhand/html/datavis/data_trove/data/quirky/carnivorous_plants_real.json 610 rows 14 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset holds 610 GBIF biodiversity occurrence records across 14 columns, mixing taxonomy (family, genus, species), geography (country, stateProvince, latitude/longitude), and observation metadata (basisOfRecord, year, month, coordinateUncertainty). Despite the 'carnivorous_plants' filename, the taxonomy is dominated by two unrelated families — Hesperiidae (skipper butterflies) and Canellaceae — each with 300 records, plus a small Araceae tail; this taxonomic split is the first thing worth investigating. Geographically, records skew to the Americas (USA 130, Mexico 73, Brazil 51) but span 35 countries, and 90% are HUMAN_OBSERVATION rather than preserved specimens. Watch coordinateUncertainty closely: it is highly skewed (skew 17.3) with a max of 766,917 m and 22.6% nulls, so any spatial analysis needs filtering. Years are tightly clustered in 2021–2026, indicating a recent-only snapshot.

citing: row_count · column_count · columns.family.top_values · columns.country.top_values · columns.basisOfRecord.top_values · columns.scientificName.top_values · columns.coordinateUncertainty.stats · columns.year.stats

Schema

14 columns
Per-column summary. Click column name to jump to its detail.
Alerts
scientificName categorical 0.0% 157
long_tail
species categorical 0.0% 123
genus categorical 0.0% 94
family categorical 0.0% 3
latitude numeric 0.0% 466
longitude numeric 0.0% 467
country categorical 0.0% 35
stateProvince categorical 0.0% 108
locality categorical 0.0% 29
long_tail
basisOfRecord categorical 0.0% 2
year numeric 0.0% 6
month numeric 0.2% 12
coordinateUncertainty numeric 22.6% 151
null_rate high_skew outliers
gbifID categorical 0.0% 610
long_tail

scientificName

categorical label long_tail
Taxonomic binomials with authorship — almost certainly biodiversity occurrence records keyed by Linnaean scientific name. The distribution is heavily concentrated: 157 distinct taxa across 610 rows, with Canella winterana alone claiming 28.5% (174 records) and a long tail flagged by the profiler. Notably the names mix plants (Canella, Warburgia, Cinnamodendron, Pinellia) with butterflies (Hylephila, Ocybadistes, Urbanus), so this column spans multiple kingdoms rather than a single clade. Treatment: Group rare taxa into an 'other' bucket or join to a taxonomy table before using as a categorical feature. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
157
top_value
Canella winterana (L.) Gaertn.
top_rate
0.2852
cardinality
157
entropy
5.517
entropy_ratio
0.7563

species

categorical label
Categorical taxonomic labels — mostly Linnaean binomials (e.g. Canella winterana, Warburgia salutaris) with a few family-level names mixed in (Droseraceae, Sarraceniaceae), suggesting inconsistent taxonomic granularity. One species, Canella winterana, dominates at 28.5% of 610 rows, yet 123 distinct values and an entropy ratio of 0.74 indicate a long tail. The mix of plant genera (Cinnamodendron, Cinnamosma) and butterfly/skipper species (Hylephila phyleus, Ocybadistes walkeri, Urbanus dorantes) is unusual for a single 'species' column. Treatment: Normalise to a consistent taxonomic rank before grouping; consider collapsing rare classes or target-encoding given the 123-way cardinality. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
123
top_value
Canella winterana
top_rate
0.2852
cardinality
123
entropy
5.144
entropy_ratio
0.7409

genus

categorical feature
Categorical genus name with 94 distinct values across 610 rows and no nulls. The distribution is heavy-tailed: 'Canella' alone accounts for 28.5% (174 records), and the top four values appear to be plant genera (Canella, Warburgia, Cinnamosma, Cinnamodendron) while subsequent entries (Urbanus, Hylephila, Burnsius, Pyrgus) are butterfly/skipper genera, suggesting the column mixes taxa from different kingdoms. Entropy ratio of 0.74 reflects moderate concentration around the dominant genus. Treatment: Group rare genera into an 'other' bucket and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
94
top_value
Canella
top_rate
0.2852
cardinality
94
entropy
4.84
entropy_ratio
0.7384

family

categorical label
Categorical column holding taxonomic family labels across 610 rows with only 3 distinct values and no nulls. The distribution is essentially bimodal — Hesperiidae and Canellaceae each appear 300 times (top_rate 0.492) while Araceae appears just 10 times — and notably mixes an animal family (Hesperiidae, skipper butterflies) with two plant families, which is an unusual cross-kingdom blend. Treatment: One-hot encode; consider merging or stratifying given the rare Araceae class (10/610). high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
3
top_value
Hesperiidae
top_rate
0.4918
cardinality
3
entropy
1.104
entropy_ratio
0.6967

latitude

numeric feature
This column holds geographic latitudes in decimal degrees, ranging from -43.245933 to 46.704735 with a median of 17.008014. The wide IQR of 47.748 and bimodal-leaning kurtosis of -1.28 suggest observations are spread across both hemispheres rather than clustered in one region. With 466 unique values across 610 rows and no nulls or outliers, coverage is clean but globally dispersed. Treatment: Pair with longitude for geospatial features; avoid treating as a plain scalar in models. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
466
min
-43.25
max
46.7
mean
5.2
median
17.01
std
22.75
q1
-22.92
q3
24.83
iqr
47.75
skew
-0.6517
kurtosis
-1.283
n_outliers
0
outlier_rate
0
zero_rate
0

longitude

numeric feature
Geographic longitude in decimal degrees, spanning -115.04 to 153.39 across 610 rows with no nulls and 467 unique values. The distribution is right-skewed (1.18) with a median of -63.06 sitting well below the mean of -32.94, suggesting a concentration of points in the Western Hemisphere with a long tail reaching into the Eastern Hemisphere. No outliers flagged, consistent with valid lon bounds. Treatment: Pair with latitude for geospatial features; avoid treating as a standalone scalar in linear models. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
467
min
-115
max
153.4
mean
-32.94
median
-63.06
std
78.93
q1
-89.37
q3
30.84
iqr
120.2
skew
1.184
kurtosis
0.0844
n_outliers
0
outlier_rate
0
zero_rate
0

country

categorical feature
Country of origin or observation, with 35 distinct values across 610 complete rows. The distribution is moderately concentrated: United States of America leads at 21.3% (130 rows), followed by Mexico (73) and Brazil (51), and the entropy ratio of 0.77 indicates a fairly diverse but US-tilted mix. Notable is the prominence of small territories like Guadeloupe (48) and Puerto Rico (37) ranking above larger nations, suggesting a tropical/Americas sampling bias rather than a global population sample. Treatment: One-hot encode top values and bucket the long tail into 'Other' before modelling. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
35
top_value
United States of America
top_rate
0.2131
cardinality
35
entropy
3.961
entropy_ratio
0.7722

stateProvince

categorical feature
Holds state or province names for 610 records spanning 108 distinct values across multiple countries (Texas, Florida, Nayarit, Queensland, KwaZulu-Natal). The mix is uneven: Texas alone covers 13.3% of rows, and the categories blend US states, Mexican states, Brazilian states, and a French city ('Pointe-à-Pitre'), suggesting inconsistent administrative granularity. 30 rows carry an empty-string value that null_rate=0 does not flag, and an explicit 'Other' bucket appears 11 times. Treatment: Normalise empty strings to null and group rare levels before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
108
top_value
Texas
top_rate
0.1328
cardinality
108
entropy
5.53
entropy_ratio
0.8187

locality

categorical free_text long_tail
Free-text locality descriptions for specimen records, mostly in French with Malagasy place names (districts, communes, fokontany in Madagascar). 563 of 610 rows (top_rate 0.923) are empty strings, so the field is effectively blank for the vast majority of records, and the remaining 29 unique values are long sentence-length descriptions rather than controlled vocabulary. Entropy ratio of 0.154 confirms the distribution is dominated by the empty value. Treatment: Treat empty string as missing and parse remaining entries with NER or regex to extract administrative units before use. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
29
top_value
top_rate
0.923
cardinality
29
entropy
0.7475
entropy_ratio
0.1539

basisOfRecord

categorical metadata
Categorical provenance flag from a biodiversity occurrence record (GBIF-style basisOfRecord), with only two values present out of the wider controlled vocabulary. HUMAN_OBSERVATION dominates at 550/610 (90.2%), with PRESERVED_SPECIMEN making up the remaining 60; no nulls. Entropy ratio 0.46 confirms the heavy imbalance. Treatment: Keep as a binary indicator (e.g., is_specimen) for stratification or filtering. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
2
top_value
HUMAN_OBSERVATION
top_rate
0.9016
cardinality
2
entropy
0.4638
entropy_ratio
0.4638

year

numeric timestamp
Calendar year of the record, spanning only 2021 to 2026 across 610 rows with 6 distinct values. The distribution is left-skewed (skew -0.80) and concentrated at the recent end: median and Q3 both sit at 2026, with Q1 at 2024. Treatment: Treat as an ordinal time bucket; consider one-hot or year-since-min rather than raw integer. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
6
min
2,021
max
2,026
mean
2025
median
2,026
std
1.503
q1
2,024
q3
2,026
iqr
2
skew
-0.7969
kurtosis
-0.7929
n_outliers
0
outlier_rate
0
zero_rate
0

month

numeric feature
Integer values bounded between 1 and 12 with 12 unique levels strongly indicate a calendar month index. The distribution is heavily front-loaded: the median is 1.0 and Q3 is only 7.0, so at least half the rows fall in January and the skew of 1.00 confirms a long tail toward year-end months. Nulls are negligible (0.16%) and no outliers are flagged. Treatment: Treat as a cyclical categorical (one-hot or sin/cos encode) rather than a raw numeric. high · anthropic:claude-opus-4-7
n
610
nulls
1 (0.2%)
unique
12
min
1
max
12
mean
3.752
median
1
std
3.75
q1
1
q3
7
iqr
6
skew
1.002
kurtosis
-0.5078
n_outliers
0
outlier_rate
0
zero_rate
0

coordinateUncertainty

numeric feature null_rate high_skew outliers
Numeric coordinate uncertainty values, almost certainly meters of GPS/locality error attached to occurrence records. The distribution is severely right-skewed (skew 17.3, kurtosis 335.7): the median is 35 but the mean is 6463 and the max reaches 766917, with 19.3% of values flagged as outliers. Roughly 22.6% of rows are null, so coverage is partial. Treatment: Log-transform and impute missing values before using as a quality filter or feature. high · anthropic:claude-opus-4-7
n
610
nulls
138 (22.6%)
unique
151
min
1
max
766,917
mean
6463
median
35
std
3.814e+04
q1
5
q3
466.8
iqr
461.8
skew
17.3
kurtosis
335.7
n_outliers
91
outlier_rate
0.1928
zero_rate
0

gbifID

categorical identifier long_tail
This is the GBIF occurrence identifier: every one of the 610 rows carries a unique numeric ID (n_unique=610, top_rate=0.0016, entropy_ratio≈1.0) with no nulls. The top values cluster tightly in the 5937748304–5937748333 range, suggesting the records were ingested in a single contiguous GBIF batch rather than sampled across time. Treatment: Keep as a primary key for joins back to GBIF; drop from any model features. high · anthropic:claude-opus-4-7
n
610
nulls
0 (0.0%)
unique
610
top_value
5937748304
top_rate
0.001639
cardinality
610
entropy
9.253
entropy_ratio
1