data trove caves worldwide
Reading
This dataset is a global registry of 69,716 cave entries, likely sourced from OpenStreetMap, containing geographic coordinates, names, access rules, and optional metadata such as depth, length, and tourism classification. The most striking issue is extreme sparsity: the vast majority of records have empty descriptions (93%), websites (96%), wikipedia links (97%), depth (99.6%), and length (99.1%), meaning most caves are little more than a name and a pin on a map. Nearly 28% of named caves are simply called 'Unnamed Cave' (19,527 records), pointing to a significant data completeness problem worth investigating before any analysis. Geographic coverage skews heavily toward Europe — latitude median ~44°N with tight interquartile range — but longitude outliers suggest a global but uneven spread. Among the minority of caves with access tags, the split between open ('yes'), restricted ('no'), and 'private' is worth exploring for any public-access analysis.
citing: row_count · column_count · name.top_values · name.stats.n_duplicates · description.stats.n_empty · website.stats.n_empty · wikipedia.stats.n_empty · cave:depth.stats.top_rate · cave:length.stats.top_rate · lat.stats.median · lat.stats.iqr · lon.stats.outlier_rate · access.top_values · tourism.top_values
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| 62551 | 89.7% | |
| yes | 2717 | 3.9% |
| no | 2266 | 3.3% |
| private | 815 | 1.2% |
| permit | 575 | 0.8% |
| permissive | 440 | 0.6% |
| customers | 271 | 0.4% |
| unknown | 47 | 0.1% |
| destination | 11 | 0.0% |
| restricted | 9 | 0.0% |
| tidal | 2 | 0.0% |
| request | 2 | 0.0% |
| key | 2 | 0.0% |
| discouraged | 2 | 0.0% |
| designated | 1 | 0.0% |
| official | 1 | 0.0% |
| forestry | 1 | 0.0% |
| agricultural | 1 | 0.0% |
| guided | 1 | 0.0% |
| university | 1 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| 67776 | 97.2% | |
| attraction | 1670 | 2.4% |
| viewpoint | 119 | 0.2% |
| yes | 105 | 0.2% |
| camp_site | 7 | 0.0% |
| museum | 6 | 0.0% |
| register | 6 | 0.0% |
| artwork | 6 | 0.0% |
| information | 5 | 0.0% |
| picnic_site | 4 | 0.0% |
| cave_entrance | 3 | 0.0% |
| caves | 2 | 0.0% |
| wilderness_hut | 2 | 0.0% |
| attraction;museum | 1 | 0.0% |
| guestbook | 1 | 0.0% |
| no | 1 | 0.0% |
| cave | 1 | 0.0% |
| hotel | 1 | 0.0% |
Show data table
| chars | count |
|---|---|
| 1 – 5 | 1601 |
| 5 – 8 | 3457 |
| 8 – 12 | 5854 |
| 12 – 15 | 31600 |
| 15 – 19 | 9877 |
| 19 – 22 | 8424 |
| 22 – 26 | 3274 |
| 26 – 29 | 2614 |
| 29 – 33 | 1175 |
| 33 – 36 | 885 |
| 36 – 40 | 482 |
| 40 – 44 | 175 |
| 44 – 47 | 154 |
| 47 – 51 | 56 |
| 51 – 54 | 45 |
| 54 – 58 | 17 |
| 58 – 61 | 11 |
| 61 – 65 | 3 |
| 65 – 68 | 2 |
| 68 – 72 | 2 |
| 72 – 76 | 4 |
| 76 – 79 | 2 |
| 79 – 83 | 0 |
| 83 – 86 | 1 |
| 86 – 90 | 0 |
| 90 – 93 | 0 |
| 93 – 97 | 0 |
| 97 – 100 | 0 |
| 100 – 104 | 0 |
| 104 – 108 | 0 |
| 108 – 111 | 0 |
| 111 – 115 | 0 |
| 115 – 118 | 0 |
| 118 – 122 | 0 |
| 122 – 125 | 0 |
| 125 – 129 | 0 |
| 129 – 132 | 0 |
| 132 – 136 | 0 |
| 136 – 139 | 0 |
| 139 – 143 | 1 |
Show data table
| bin | count |
|---|---|
| -77.98 – -74.07 | 4 |
| -74.07 – -70.17 | 0 |
| -70.17 – -66.26 | 3 |
| -66.26 – -62.36 | 0 |
| -62.36 – -58.46 | 1 |
| -58.46 – -54.55 | 2 |
| -54.55 – -50.65 | 13 |
| -50.65 – -46.74 | 26 |
| -46.74 – -42.84 | 60 |
| -42.84 – -38.94 | 127 |
| -38.94 – -35.03 | 329 |
| -35.03 – -31.13 | 330 |
| -31.13 – -27.23 | 225 |
| -27.23 – -23.32 | 257 |
| -23.32 – -19.42 | 301 |
| -19.42 – -15.51 | 204 |
| -15.51 – -11.61 | 78 |
| -11.61 – -7.705 | 139 |
| -7.705 – -3.801 | 314 |
| -3.801 – 0.1026 | 183 |
| 0.1026 – 4.007 | 118 |
| 4.007 – 7.911 | 358 |
| 7.911 – 11.81 | 287 |
| 11.81 – 15.72 | 385 |
| 15.72 – 19.62 | 1856 |
| 19.62 – 23.53 | 937 |
| 23.53 – 27.43 | 775 |
| 27.43 – 31.33 | 1082 |
| 31.33 – 35.24 | 1231 |
| 35.24 – 39.14 | 3928 |
| 39.14 – 43.05 | 12068 |
| 43.05 – 46.95 | 20377 |
| 46.95 – 50.85 | 19250 |
| 50.85 – 54.76 | 2492 |
| 54.76 – 58.66 | 1075 |
| 58.66 – 62.57 | 564 |
| 62.57 – 66.47 | 267 |
| 66.47 – 70.37 | 47 |
| 70.37 – 74.28 | 19 |
| 74.28 – 78.18 | 4 |
Show data table
| value | count | share |
|---|---|---|
| 69074 | 99.1% | |
| 5 | 32 | 0.0% |
| 6 | 26 | 0.0% |
| 10 | 25 | 0.0% |
| 3 | 23 | 0.0% |
| 4 | 23 | 0.0% |
| 7 | 20 | 0.0% |
| 8 | 19 | 0.0% |
| 15 | 16 | 0.0% |
| 20 | 14 | 0.0% |
| 12 | 13 | 0.0% |
| 30 | 13 | 0.0% |
| 2 | 11 | 0.0% |
| 11 | 9 | 0.0% |
| 60 | 8 | 0.0% |
| 4.5 | 8 | 0.0% |
| 13 | 8 | 0.0% |
| 16 | 8 | 0.0% |
| 17 | 8 | 0.0% |
| 25 | 8 | 0.0% |
Schema
12 columns| Alerts | ||||
|---|---|---|---|---|
| id | numeric | 0.0% | 69,716 |
|
| name | text | 0.0% | 45,229 |
duplicates
|
| lat | numeric | 0.0% | 69,544 |
high_skew
outliers
|
| lon | numeric | 0.0% | 69,585 |
outliers
|
| description | text | 0.0% | 3,705 |
one_word
short_text
duplicates
|
| access | categorical | 0.0% | 20 |
|
| tourism | categorical | 0.0% | 18 |
imbalance
|
| wikipedia | text | 0.0% | 2,077 |
one_word
short_text
duplicates
|
| website | text | 0.0% | 2,492 |
one_word
short_text
duplicates
|
| cave:length | categorical | 0.0% | 238 |
long_tail
imbalance
|
| cave:depth | categorical | 0.0% | 124 |
long_tail
imbalance
|
| country | categorical | 0.0% | 16 |
imbalance
|
id
numeric identifierThis column is a numeric unique identifier — every one of the 69,716 rows has a distinct value with zero nulls, confirming a true primary key role. The values span a wide range (12,788,558 to 13,536,013,182) with a near-zero skew (0.0196) and a slightly platykurtic distribution (kurtosis −1.18), consistent with IDs assigned from a large, sparsely populated keyspace rather than a simple auto-increment sequence. The IQR of ~6.3 billion against a full range of ~13.5 billion indicates values are spread broadly across the domain with no clustering or outliers detected. Treatment: Drop before modelling; use as row key for joins or deduplication only.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 69,716
- min
- 1.279e+07
- max
- 1.354e+10
- mean
- 6.842e+09
- median
- 7.007e+09
- std
- 3.774e+09
- q1
- 3.61e+09
- q3
- 9.948e+09
- iqr
- 6.338e+09
- skew
- 0.01955
- kurtosis
- -1.177
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
name
text label duplicatesThis column contains the proper names of caves or underground features in a speleological dataset, with multilingual entries spanning English, French, Italian, Spanish, German, and Russian. The most striking signal is that 'Unnamed Cave' appears 19,527 times — roughly 28% of all 69,716 rows — driving a duplicate rate of 35.1% (24,487 duplicates) even though 45,229 unique values exist. The vocabulary mix (grotta, grotte, cueva, cova, Грот) confirms international coverage, and the name is clearly a label, not a unique identifier. Treatment: Use as a display label only; do not treat as a unique key — normalise language variants and 'Unnamed Cave' placeholders before any grouping or matching.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 45,229
- len_min
- 1
- len_max
- 143
- len_mean
- 15.52
- len_median
- 13
- len_p95
- 29
- word_mean
- 2.519
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 24,487
- duplicate_rate
- 0.3512
- vocab_size
- 15,219
- readability_flesch_mean
- 55.51
- emoji_rate
- 1.434e-05
- url_rate
- 0
- one_word_rate
- 0.1545
- allcaps_rate
- 0.03242
- boilerplate_rate
- 0
lat
numeric feature high_skew outliersThis column is a geographic latitude, spanning from -77.98° (southern hemisphere) to 78.18° (high Arctic), with the bulk of records clustered between ~40.5° and ~47.75° (IQR of 7.26°), suggesting the dataset is predominantly North American or European. The strong negative skew (-3.16) and high kurtosis (11.50) indicate a heavy left tail — roughly 9,000 rows (12.9%) are flagged as outliers, likely locations far outside the core geographic cluster. With 69,544 unique values across 69,716 rows and zero nulls, near-uniqueness suggests these are precise point coordinates rather than binned regions. Treatment: Retain as-is for geospatial modelling; investigate the 8,996 outlier rows for data quality issues or legitimate geographic subpopulations before regression or clustering.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 69,544
- min
- -77.98
- max
- 78.18
- mean
- 40.58
- median
- 44.14
- std
- 15.48
- q1
- 40.49
- q3
- 47.75
- iqr
- 7.256
- skew
- -3.161
- kurtosis
- 11.5
- n_outliers
- 8,996
- outlier_rate
- 0.129
- zero_rate
- 0
lon
numeric feature outliersThis column contains geographic longitude values, ranging from -178.0 to 178.8 degrees, consistent with a global coordinate field. Nearly all 69,716 rows are unique (69,585 distinct values) with zero nulls, indicating high-precision GPS or geocoded data. The distribution is heavily concentrated in a narrow band (IQR of ~17 degrees, Q1=1.24, Q3=18.18), suggesting the bulk of records cluster around Europe/Africa longitudes, while 16.2% of values (11,302 rows) are flagged as outliers — likely legitimate points in the Americas, East Asia, or Oceania that fall far from this central cluster. The elevated kurtosis (4.51) and large std (40.5) relative to the IQR confirm this heavy-tailed, multimodal geographic spread. Treatment: Retain as-is for geospatial modelling; consider pairing with latitude and clustering by region to handle the multimodal distribution before feeding into distance-based models.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 69,585
- min
- -178
- max
- 178.8
- mean
- 12.03
- median
- 11.38
- std
- 40.5
- q1
- 1.245
- q3
- 18.18
- iqr
- 16.94
- skew
- 0.2755
- kurtosis
- 4.509
- n_outliers
- 11,302
- outlier_rate
- 0.1621
- zero_rate
- 0
description
text free_text one_word short_text duplicatesThis column is a short free-text description field for what appears to be a European cave/karst cadastral dataset, with entries in German, French, Spanish, and English. It is overwhelmingly empty — 65,189 of 69,716 rows (93.5%) are blank strings — making it near-useless as a feature in its current form. The duplicate rate is 94.7% (66,011 duplicates), driven almost entirely by those empty strings, and among the non-empty values, content is very short (median length 0, p95 only 19 characters) and consists mostly of single words or brief directional/categorical labels like 'unterer Eingang' or 'nicht katasterwürdig'. The multilingual vocabulary (German, French, Spanish, English) and low cardinality (3,705 unique values across 4,651-word vocab) suggest this is an optional annotation field rather than a structured attribute. Treatment: Exclude empty strings, then optionally use as a categorical label or tokenize non-empty values for NLP; do not use as a predictive feature without imputation strategy for the 93.5% blanks.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 3,705
- len_min
- 0
- len_max
- 255
- len_mean
- 3.462
- len_median
- 0
- len_p95
- 19
- word_mean
- 1.455
- word_median
- 1
- n_empty
- 65,189
- n_duplicates
- 66,011
- duplicate_rate
- 0.9469
- vocab_size
- 4,651
- readability_flesch_mean
- 1.752
- emoji_rate
- 0
- url_rate
- 0.0006311
- one_word_rate
- 0.9428
- allcaps_rate
- 0.00241
- boilerplate_rate
- 1.434e-05
access
categorical featureThis column is an OpenStreetMap-style 'access' tag describing road or path accessibility permissions, with values like 'yes', 'no', 'private', 'permit', 'permissive', 'customers', and 'destination'. The dominant signal is that 89.7% of rows (62,551 of 69,716) carry an empty string rather than a proper null — this is a data quality concern, as missing access information is stored as blank text rather than NULL. The remaining 20 distinct values have very low entropy (0.707), confirming the distribution is heavily concentrated. The blank-dominance will mislead cardinality and frequency analyses unless blanks are recoded as missing. Treatment: Recode empty strings to NaN, then one-hot or ordinal encode the remaining access-permission categories.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 20
- top_value
- top_rate
- 0.8972
- cardinality
- 20
- entropy
- 0.7067
- entropy_ratio
- 0.1635
tourism
categorical feature imbalanceThis column is an OpenStreetMap-style 'tourism' tag classifying geographic features by their tourism type (attraction, viewpoint, camp_site, museum, etc.). It is severely imbalanced: 97.2% of the 69,716 rows carry an empty string — meaning the tourism tag is absent — leaving only ~1,940 rows with meaningful values. Entropy ratio of 0.050 confirms near-zero informational content across the full dataset. The 18 distinct non-null categories are legitimate OSM tourism values, but their rarity makes this column sparse by design rather than by data quality failure. Treatment: Treat empty string as missing/absent tag; consider binarising (tourism present vs. absent) or one-hot encoding the non-empty subset, given extreme sparsity.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 18
- top_value
- top_rate
- 0.9722
- cardinality
- 18
- entropy
- 0.2076
- entropy_ratio
- 0.04979
wikipedia
text metadata one_word short_text duplicatesThis column stores Wikipedia article links in a 'language-code:article-title' format, used to cross-reference cave or geological features to their Wikipedia pages across multiple languages (German, French, Spanish, Italian, Catalan, Polish, etc.). The overwhelming surprise is that 67,531 of 69,716 rows (96.9% duplicate rate, with 67,531 empty strings) have no Wikipedia link at all, making this column sparsely populated. Among the 2,185 non-empty rows, values are single short tokens (one_word_rate 0.976, len_mean 0.636) covering 940 unique vocabulary terms across several language prefixes. Treatment: Filter empty strings before use; parse language prefix and article slug as separate fields for any multilingual Wikipedia lookup or join.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 2,077
- len_min
- 0
- len_max
- 114
- len_mean
- 0.636
- len_median
- 0
- len_p95
- 0
- word_mean
- 1.044
- word_median
- 1
- n_empty
- 67,531
- n_duplicates
- 67,639
- duplicate_rate
- 0.9702
- vocab_size
- 940
- readability_flesch_mean
- -0.02485
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9756
- allcaps_rate
- 0
- boilerplate_rate
- 0
website
text metadata one_word short_text duplicatesThis column contains optional website/URL references associated with dataset records (likely natural features, caves, or protected sites given the UNESCO, speleology, and nature-registry URLs visible in top values). The dominant signal is near-total sparsity: 67,082 of 69,716 rows are empty strings, yielding a duplicate rate of 96.4% and a len_median of 0.0. When a URL is present, it appears only 2–19 times at most, suggesting these are editorial citations rather than structured foreign keys. The url_rate of only 3.8% confirms that the vast majority of non-empty values are not being parsed as URLs, though top values are all valid URLs — this may be a tokenisation artefact of the length-based stats. Treatment: Exclude from modelling; optionally binarise into a 'has_website' flag or extract domain as a sparse categorical feature.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 2,492
- len_min
- 0
- len_max
- 255
- len_mean
- 2.789
- len_median
- 0
- len_p95
- 0
- word_mean
- 1
- word_median
- 1
- n_empty
- 67,082
- n_duplicates
- 67,224
- duplicate_rate
- 0.9643
- vocab_size
- 737
- readability_flesch_mean
- -24.04
- emoji_rate
- 0
- url_rate
- 0.03775
- one_word_rate
- 1
- allcaps_rate
- 1.434e-05
- boilerplate_rate
- 0
cave:length
categorical feature long_tail imbalanceThis column represents a cave length measurement, stored as a categorical/string type rather than numeric. The dominant signal is that 99.08% of 69,716 rows (69,074) carry an empty string, meaning cave length is recorded for only ~642 rows across 238 distinct values. The non-null values appear to be numeric strings (e.g., '5', '6', '10', '20'), suggesting this field is sparsely populated and was likely intended as a numeric attribute. Treatment: Treat empty strings as missing; cast remaining values to numeric; expect ~99% missingness — impute or use as a sparse indicator feature only.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 238
- top_value
- top_rate
- 0.9908
- cardinality
- 238
- entropy
- 0.1392
- entropy_ratio
- 0.01763
cave:depth
categorical feature long_tail imbalanceThis column represents the depth attribute of a cave feature, stored as a categorical field containing numeric depth values (in unspecified units). It is almost entirely empty: 69,419 of 69,716 rows (99.57%) are blank strings, leaving only 297 rows with any actual depth value across 123 distinct non-empty values. The extreme sparsity and near-zero entropy ratio (0.0092) signal this is a rarely-populated optional attribute, not a reliable analytical field in its current form. Treatment: Filter to non-empty rows before use; convert to numeric and consider imputation or indicator flag for missingness given 99.57% blank rate.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 124
- top_value
- top_rate
- 0.9957
- cardinality
- 124
- entropy
- 0.06432
- entropy_ratio
- 0.00925
country
categorical feature imbalanceThis column is intended to capture a 2-letter ISO country code for each record. However, it is effectively empty: 69,684 of 69,716 rows (99.95%) contain a blank string rather than a real value, making the column nearly useless as a feature. Only 32 rows carry a real country code spread across 15 distinct values, with no single country dominating — the most common non-blank code is 'KY' with just 7 occurrences. The extreme imbalance (top_rate 0.9995) and near-zero entropy (0.0074) confirm this column carries virtually no information. Treatment: Drop or flag as data-quality issue; if retained, treat blank as missing and note that only 32 rows have usable country data.
- n
- 69,716
- nulls
- 0 (0.0%)
- unique
- 16
- top_value
- top_rate
- 0.9995
- cardinality
- 16
- entropy
- 0.00736
- entropy_ratio
- 0.00184