quirky carnivorous plants real
Reading
This dataset holds 610 GBIF biodiversity occurrence records across 14 columns, mixing taxonomy (family, genus, species), geography (country, stateProvince, latitude/longitude), and observation metadata (basisOfRecord, year, month, coordinateUncertainty). Despite the 'carnivorous_plants' filename, the taxonomy is dominated by two unrelated families — Hesperiidae (skipper butterflies) and Canellaceae — each with 300 records, plus a small Araceae tail; this taxonomic split is the first thing worth investigating. Geographically, records skew to the Americas (USA 130, Mexico 73, Brazil 51) but span 35 countries, and 90% are HUMAN_OBSERVATION rather than preserved specimens. Watch coordinateUncertainty closely: it is highly skewed (skew 17.3) with a max of 766,917 m and 22.6% nulls, so any spatial analysis needs filtering. Years are tightly clustered in 2021–2026, indicating a recent-only snapshot.
citing: row_count · column_count · columns.family.top_values · columns.country.top_values · columns.basisOfRecord.top_values · columns.scientificName.top_values · columns.coordinateUncertainty.stats · columns.year.stats
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Hesperiidae | 300 | 49.2% |
| Canellaceae | 300 | 49.2% |
| Araceae | 10 | 1.6% |
Show data table
| value | count | share |
|---|---|---|
| United States of America | 130 | 21.3% |
| Mexico | 73 | 12.0% |
| Brazil | 51 | 8.4% |
| Guadeloupe | 48 | 7.9% |
| Australia | 47 | 7.7% |
| South Africa | 41 | 6.7% |
| Madagascar | 40 | 6.6% |
| Puerto Rico | 37 | 6.1% |
| Dominican Republic | 16 | 2.6% |
| Panama | 15 | 2.5% |
| Argentina | 14 | 2.3% |
| Singapore | 10 | 1.6% |
| Cayman Islands | 10 | 1.6% |
| Antigua and Barbuda | 10 | 1.6% |
| China | 10 | 1.6% |
| Virgin Islands (U.S.) | 8 | 1.3% |
| Kenya | 8 | 1.3% |
| Hong Kong | 6 | 1.0% |
| Costa Rica | 5 | 0.8% |
| Sint Maarten (Dutch part) | 4 | 0.7% |
Show data table
| value | count | share |
|---|---|---|
| HUMAN_OBSERVATION | 550 | 90.2% |
| PRESERVED_SPECIMEN | 60 | 9.8% |
Show data table
| bin | count |
|---|---|
| 1 – 3.652e+04 | 467 |
| 3.652e+04 – 7.304e+04 | 2 |
| 7.304e+04 – 1.096e+05 | 1 |
| 1.096e+05 – 1.461e+05 | 0 |
| 1.461e+05 – 1.826e+05 | 0 |
| 1.826e+05 – 2.191e+05 | 0 |
| 2.191e+05 – 2.556e+05 | 1 |
| 2.556e+05 – 2.922e+05 | 0 |
| 2.922e+05 – 3.287e+05 | 0 |
| 3.287e+05 – 3.652e+05 | 0 |
| 3.652e+05 – 4.017e+05 | 0 |
| 4.017e+05 – 4.382e+05 | 0 |
| 4.382e+05 – 4.748e+05 | 0 |
| 4.748e+05 – 5.113e+05 | 0 |
| 5.113e+05 – 5.478e+05 | 0 |
| 5.478e+05 – 5.843e+05 | 0 |
| 5.843e+05 – 6.208e+05 | 0 |
| 6.208e+05 – 6.574e+05 | 0 |
| 6.574e+05 – 6.939e+05 | 0 |
| 6.939e+05 – 7.304e+05 | 0 |
| 7.304e+05 – 7.669e+05 | 1 |
Show data table
| bin | count |
|---|---|
| 1 – 1.458 | 342 |
| 1.458 – 1.917 | 0 |
| 1.917 – 2.375 | 18 |
| 2.375 – 2.833 | 0 |
| 2.833 – 3.292 | 24 |
| 3.292 – 3.75 | 0 |
| 3.75 – 4.208 | 24 |
| 4.208 – 4.667 | 0 |
| 4.667 – 5.125 | 21 |
| 5.125 – 5.583 | 0 |
| 5.583 – 6.042 | 17 |
| 6.042 – 6.5 | 0 |
| 6.5 – 6.958 | 0 |
| 6.958 – 7.417 | 42 |
| 7.417 – 7.875 | 0 |
| 7.875 – 8.333 | 23 |
| 8.333 – 8.792 | 0 |
| 8.792 – 9.25 | 15 |
| 9.25 – 9.708 | 0 |
| 9.708 – 10.17 | 24 |
| 10.17 – 10.62 | 0 |
| 10.62 – 11.08 | 29 |
| 11.08 – 11.54 | 0 |
| 11.54 – 12 | 30 |
Schema
14 columns| Alerts | ||||
|---|---|---|---|---|
| scientificName | categorical | 0.0% | 157 |
long_tail
|
| species | categorical | 0.0% | 123 |
|
| genus | categorical | 0.0% | 94 |
|
| family | categorical | 0.0% | 3 |
|
| latitude | numeric | 0.0% | 466 |
|
| longitude | numeric | 0.0% | 467 |
|
| country | categorical | 0.0% | 35 |
|
| stateProvince | categorical | 0.0% | 108 |
|
| locality | categorical | 0.0% | 29 |
long_tail
|
| basisOfRecord | categorical | 0.0% | 2 |
|
| year | numeric | 0.0% | 6 |
|
| month | numeric | 0.2% | 12 |
|
| coordinateUncertainty | numeric | 22.6% | 151 |
null_rate
high_skew
outliers
|
| gbifID | categorical | 0.0% | 610 |
long_tail
|
scientificName
categorical label long_tailTaxonomic binomials with authorship — almost certainly biodiversity occurrence records keyed by Linnaean scientific name. The distribution is heavily concentrated: 157 distinct taxa across 610 rows, with Canella winterana alone claiming 28.5% (174 records) and a long tail flagged by the profiler. Notably the names mix plants (Canella, Warburgia, Cinnamodendron, Pinellia) with butterflies (Hylephila, Ocybadistes, Urbanus), so this column spans multiple kingdoms rather than a single clade. Treatment: Group rare taxa into an 'other' bucket or join to a taxonomy table before using as a categorical feature.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 157
- top_value
- Canella winterana (L.) Gaertn.
- top_rate
- 0.2852
- cardinality
- 157
- entropy
- 5.517
- entropy_ratio
- 0.7563
species
categorical labelCategorical taxonomic labels — mostly Linnaean binomials (e.g. Canella winterana, Warburgia salutaris) with a few family-level names mixed in (Droseraceae, Sarraceniaceae), suggesting inconsistent taxonomic granularity. One species, Canella winterana, dominates at 28.5% of 610 rows, yet 123 distinct values and an entropy ratio of 0.74 indicate a long tail. The mix of plant genera (Cinnamodendron, Cinnamosma) and butterfly/skipper species (Hylephila phyleus, Ocybadistes walkeri, Urbanus dorantes) is unusual for a single 'species' column. Treatment: Normalise to a consistent taxonomic rank before grouping; consider collapsing rare classes or target-encoding given the 123-way cardinality.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 123
- top_value
- Canella winterana
- top_rate
- 0.2852
- cardinality
- 123
- entropy
- 5.144
- entropy_ratio
- 0.7409
genus
categorical featureCategorical genus name with 94 distinct values across 610 rows and no nulls. The distribution is heavy-tailed: 'Canella' alone accounts for 28.5% (174 records), and the top four values appear to be plant genera (Canella, Warburgia, Cinnamosma, Cinnamodendron) while subsequent entries (Urbanus, Hylephila, Burnsius, Pyrgus) are butterfly/skipper genera, suggesting the column mixes taxa from different kingdoms. Entropy ratio of 0.74 reflects moderate concentration around the dominant genus. Treatment: Group rare genera into an 'other' bucket and one-hot or target-encode before modelling.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 94
- top_value
- Canella
- top_rate
- 0.2852
- cardinality
- 94
- entropy
- 4.84
- entropy_ratio
- 0.7384
family
categorical labelCategorical column holding taxonomic family labels across 610 rows with only 3 distinct values and no nulls. The distribution is essentially bimodal — Hesperiidae and Canellaceae each appear 300 times (top_rate 0.492) while Araceae appears just 10 times — and notably mixes an animal family (Hesperiidae, skipper butterflies) with two plant families, which is an unusual cross-kingdom blend. Treatment: One-hot encode; consider merging or stratifying given the rare Araceae class (10/610).
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- Hesperiidae
- top_rate
- 0.4918
- cardinality
- 3
- entropy
- 1.104
- entropy_ratio
- 0.6967
latitude
numeric featureThis column holds geographic latitudes in decimal degrees, ranging from -43.245933 to 46.704735 with a median of 17.008014. The wide IQR of 47.748 and bimodal-leaning kurtosis of -1.28 suggest observations are spread across both hemispheres rather than clustered in one region. With 466 unique values across 610 rows and no nulls or outliers, coverage is clean but globally dispersed. Treatment: Pair with longitude for geospatial features; avoid treating as a plain scalar in models.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 466
- min
- -43.25
- max
- 46.7
- mean
- 5.2
- median
- 17.01
- std
- 22.75
- q1
- -22.92
- q3
- 24.83
- iqr
- 47.75
- skew
- -0.6517
- kurtosis
- -1.283
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
longitude
numeric featureGeographic longitude in decimal degrees, spanning -115.04 to 153.39 across 610 rows with no nulls and 467 unique values. The distribution is right-skewed (1.18) with a median of -63.06 sitting well below the mean of -32.94, suggesting a concentration of points in the Western Hemisphere with a long tail reaching into the Eastern Hemisphere. No outliers flagged, consistent with valid lon bounds. Treatment: Pair with latitude for geospatial features; avoid treating as a standalone scalar in linear models.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 467
- min
- -115
- max
- 153.4
- mean
- -32.94
- median
- -63.06
- std
- 78.93
- q1
- -89.37
- q3
- 30.84
- iqr
- 120.2
- skew
- 1.184
- kurtosis
- 0.0844
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
country
categorical featureCountry of origin or observation, with 35 distinct values across 610 complete rows. The distribution is moderately concentrated: United States of America leads at 21.3% (130 rows), followed by Mexico (73) and Brazil (51), and the entropy ratio of 0.77 indicates a fairly diverse but US-tilted mix. Notable is the prominence of small territories like Guadeloupe (48) and Puerto Rico (37) ranking above larger nations, suggesting a tropical/Americas sampling bias rather than a global population sample. Treatment: One-hot encode top values and bucket the long tail into 'Other' before modelling.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 35
- top_value
- United States of America
- top_rate
- 0.2131
- cardinality
- 35
- entropy
- 3.961
- entropy_ratio
- 0.7722
stateProvince
categorical featureHolds state or province names for 610 records spanning 108 distinct values across multiple countries (Texas, Florida, Nayarit, Queensland, KwaZulu-Natal). The mix is uneven: Texas alone covers 13.3% of rows, and the categories blend US states, Mexican states, Brazilian states, and a French city ('Pointe-à-Pitre'), suggesting inconsistent administrative granularity. 30 rows carry an empty-string value that null_rate=0 does not flag, and an explicit 'Other' bucket appears 11 times. Treatment: Normalise empty strings to null and group rare levels before one-hot or target encoding.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 108
- top_value
- Texas
- top_rate
- 0.1328
- cardinality
- 108
- entropy
- 5.53
- entropy_ratio
- 0.8187
locality
categorical free_text long_tailFree-text locality descriptions for specimen records, mostly in French with Malagasy place names (districts, communes, fokontany in Madagascar). 563 of 610 rows (top_rate 0.923) are empty strings, so the field is effectively blank for the vast majority of records, and the remaining 29 unique values are long sentence-length descriptions rather than controlled vocabulary. Entropy ratio of 0.154 confirms the distribution is dominated by the empty value. Treatment: Treat empty string as missing and parse remaining entries with NER or regex to extract administrative units before use.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 29
- top_value
- top_rate
- 0.923
- cardinality
- 29
- entropy
- 0.7475
- entropy_ratio
- 0.1539
basisOfRecord
categorical metadataCategorical provenance flag from a biodiversity occurrence record (GBIF-style basisOfRecord), with only two values present out of the wider controlled vocabulary. HUMAN_OBSERVATION dominates at 550/610 (90.2%), with PRESERVED_SPECIMEN making up the remaining 60; no nulls. Entropy ratio 0.46 confirms the heavy imbalance. Treatment: Keep as a binary indicator (e.g., is_specimen) for stratification or filtering.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- HUMAN_OBSERVATION
- top_rate
- 0.9016
- cardinality
- 2
- entropy
- 0.4638
- entropy_ratio
- 0.4638
year
numeric timestampCalendar year of the record, spanning only 2021 to 2026 across 610 rows with 6 distinct values. The distribution is left-skewed (skew -0.80) and concentrated at the recent end: median and Q3 both sit at 2026, with Q1 at 2024. Treatment: Treat as an ordinal time bucket; consider one-hot or year-since-min rather than raw integer.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 6
- min
- 2,021
- max
- 2,026
- mean
- 2025
- median
- 2,026
- std
- 1.503
- q1
- 2,024
- q3
- 2,026
- iqr
- 2
- skew
- -0.7969
- kurtosis
- -0.7929
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
month
numeric featureInteger values bounded between 1 and 12 with 12 unique levels strongly indicate a calendar month index. The distribution is heavily front-loaded: the median is 1.0 and Q3 is only 7.0, so at least half the rows fall in January and the skew of 1.00 confirms a long tail toward year-end months. Nulls are negligible (0.16%) and no outliers are flagged. Treatment: Treat as a cyclical categorical (one-hot or sin/cos encode) rather than a raw numeric.
- n
- 610
- nulls
- 1 (0.2%)
- unique
- 12
- min
- 1
- max
- 12
- mean
- 3.752
- median
- 1
- std
- 3.75
- q1
- 1
- q3
- 7
- iqr
- 6
- skew
- 1.002
- kurtosis
- -0.5078
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
coordinateUncertainty
numeric feature null_rate high_skew outliersNumeric coordinate uncertainty values, almost certainly meters of GPS/locality error attached to occurrence records. The distribution is severely right-skewed (skew 17.3, kurtosis 335.7): the median is 35 but the mean is 6463 and the max reaches 766917, with 19.3% of values flagged as outliers. Roughly 22.6% of rows are null, so coverage is partial. Treatment: Log-transform and impute missing values before using as a quality filter or feature.
- n
- 610
- nulls
- 138 (22.6%)
- unique
- 151
- min
- 1
- max
- 766,917
- mean
- 6463
- median
- 35
- std
- 3.814e+04
- q1
- 5
- q3
- 466.8
- iqr
- 461.8
- skew
- 17.3
- kurtosis
- 335.7
- n_outliers
- 91
- outlier_rate
- 0.1928
- zero_rate
- 0
gbifID
categorical identifier long_tailThis is the GBIF occurrence identifier: every one of the 610 rows carries a unique numeric ID (n_unique=610, top_rate=0.0016, entropy_ratio≈1.0) with no nulls. The top values cluster tightly in the 5937748304–5937748333 range, suggesting the records were ingested in a single contiguous GBIF batch rather than sampled across time. Treatment: Keep as a primary key for joins back to GBIF; drop from any model features.
- n
- 610
- nulls
- 0 (0.0%)
- unique
- 610
- top_value
- 5937748304
- top_rate
- 0.001639
- cardinality
- 610
- entropy
- 9.253
- entropy_ratio
- 1