data trove caves worldwide

source /home/coolhand/html/datavis/data_trove/data/quirky/caves.json 69,716 rows 12 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset is a global registry of 69,716 cave entries, likely sourced from OpenStreetMap, containing geographic coordinates, names, access rules, and optional metadata such as depth, length, and tourism classification. The most striking issue is extreme sparsity: the vast majority of records have empty descriptions (93%), websites (96%), wikipedia links (97%), depth (99.6%), and length (99.1%), meaning most caves are little more than a name and a pin on a map. Nearly 28% of named caves are simply called 'Unnamed Cave' (19,527 records), pointing to a significant data completeness problem worth investigating before any analysis. Geographic coverage skews heavily toward Europe — latitude median ~44°N with tight interquartile range — but longitude outliers suggest a global but uneven spread. Among the minority of caves with access tags, the split between open ('yes'), restricted ('no'), and 'private' is worth exploring for any public-access analysis.

citing: row_count · column_count · name.top_values · name.stats.n_duplicates · description.stats.n_empty · website.stats.n_empty · wikipedia.stats.n_empty · cave:depth.stats.top_rate · cave:length.stats.top_rate · lat.stats.median · lat.stats.iqr · lon.stats.outlier_rate · access.top_values · tourism.top_values

Charts the summary said to look at first

access · Look at the breakdown between open, private, permit, and no-access caves among the ~11% of entries that carry an access tag.

Show data table

Top values for access (20 unique shown, of 20 total).
value	count	share
	62551	89.7%
yes	2717	3.9%
no	2266	3.3%
private	815	1.2%
permit	575	0.8%
permissive	440	0.6%
customers	271	0.4%
unknown	47	0.1%
destination	11	0.0%
restricted	9	0.0%
tidal	2	0.0%
request	2	0.0%
key	2	0.0%
discouraged	2	0.0%
designated	1	0.0%
official	1	0.0%
forestry	1	0.0%
agricultural	1	0.0%
guided	1	0.0%
university	1	0.0%

tourism · Nearly all tagged tourism values fall under 'attraction' or 'viewpoint' — this shows how few caves are formally designated for visitors.

Show data table

Top values for tourism (18 unique shown, of 18 total).
value	count	share
	67776	97.2%
attraction	1670	2.4%
viewpoint	119	0.2%
yes	105	0.2%
camp_site	7	0.0%
museum	6	0.0%
register	6	0.0%
artwork	6	0.0%
information	5	0.0%
picnic_site	4	0.0%
cave_entrance	3	0.0%
caves	2	0.0%
wilderness_hut	2	0.0%
attraction;museum	1	0.0%
guestbook	1	0.0%
no	1	0.0%
cave	1	0.0%
hotel	1	0.0%

name · Check the distribution of name lengths to see how many entries are generic single-word labels versus descriptively named caves.

Show data table

Character-length distribution for name (mean: 15.52237649893855).
chars	count
1 – 5	1601
5 – 8	3457
8 – 12	5854
12 – 15	31600
15 – 19	9877
19 – 22	8424
22 – 26	3274
26 – 29	2614
29 – 33	1175
33 – 36	885
36 – 40	482
40 – 44	175
44 – 47	154
47 – 51	56
51 – 54	45
54 – 58	17
58 – 61	11
61 – 65	3
65 – 68	2
68 – 72	2
72 – 76	4
76 – 79	2
79 – 83	0
83 – 86	1
86 – 90	0
90 – 93	0
93 – 97	0
97 – 100	0
100 – 104	0
104 – 108	0
108 – 111	0
111 – 115	0
115 – 118	0
118 – 122	0
122 – 125	0
125 – 129	0
129 – 132	0
132 – 136	0
136 – 139	0
139 – 143	1

lat · The latitude distribution reveals a strong concentration around 40–48°N (Europe), with a long tail of outliers across other continents.

Show data table

Histogram bins for lat (median: 44.14308795).
bin	count
-77.98 – -74.07	4
-74.07 – -70.17	0
-70.17 – -66.26	3
-66.26 – -62.36	0
-62.36 – -58.46	1
-58.46 – -54.55	2
-54.55 – -50.65	13
-50.65 – -46.74	26
-46.74 – -42.84	60
-42.84 – -38.94	127
-38.94 – -35.03	329
-35.03 – -31.13	330
-31.13 – -27.23	225
-27.23 – -23.32	257
-23.32 – -19.42	301
-19.42 – -15.51	204
-15.51 – -11.61	78
-11.61 – -7.705	139
-7.705 – -3.801	314
-3.801 – 0.1026	183
0.1026 – 4.007	118
4.007 – 7.911	358
7.911 – 11.81	287
11.81 – 15.72	385
15.72 – 19.62	1856
19.62 – 23.53	937
23.53 – 27.43	775
27.43 – 31.33	1082
31.33 – 35.24	1231
35.24 – 39.14	3928
39.14 – 43.05	12068
43.05 – 46.95	20377
46.95 – 50.85	19250
50.85 – 54.76	2492
54.76 – 58.66	1075
58.66 – 62.57	564
62.57 – 66.47	267
66.47 – 70.37	47
70.37 – 74.28	19
74.28 – 78.18	4

cave:length · Among the small fraction of caves with a recorded length, see whether short caves (under 20m) dominate the measured entries.

Show data table

Top values for cave:length (20 unique shown, of 238 total).
value	count	share
	69074	99.1%
5	32	0.0%
6	26	0.0%
10	25	0.0%
3	23	0.0%
4	23	0.0%
7	20	0.0%
8	19	0.0%
15	16	0.0%
20	14	0.0%
12	13	0.0%
30	13	0.0%
2	11	0.0%
11	9	0.0%
60	8	0.0%
4.5	8	0.0%
13	8	0.0%
16	8	0.0%
17	8	0.0%
25	8	0.0%

Schema

12 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
id	numeric	0.0%	69,716
name	text	0.0%	45,229	duplicates
lat	numeric	0.0%	69,544	high_skew outliers
lon	numeric	0.0%	69,585	outliers
description	text	0.0%	3,705	one_word short_text duplicates
access	categorical	0.0%	20
tourism	categorical	0.0%	18	imbalance
wikipedia	text	0.0%	2,077	one_word short_text duplicates
website	text	0.0%	2,492	one_word short_text duplicates
cave:length	categorical	0.0%	238	long_tail imbalance
cave:depth	categorical	0.0%	124	long_tail imbalance
country	categorical	0.0%	16	imbalance

id

numeric identifier

This column is a numeric unique identifier — every one of the 69,716 rows has a distinct value with zero nulls, confirming a true primary key role. The values span a wide range (12,788,558 to 13,536,013,182) with a near-zero skew (0.0196) and a slightly platykurtic distribution (kurtosis −1.18), consistent with IDs assigned from a large, sparsely populated keyspace rather than a simple auto-increment sequence. The IQR of ~6.3 billion against a full range of ~13.5 billion indicates values are spread broadly across the domain with no clustering or outliers detected. Treatment: Drop before modelling; use as row key for joins or deduplication only. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 69,716
min: 1.279e+07
max: 1.354e+10
mean: 6.842e+09
median: 7.007e+09
std: 3.774e+09
q1: 3.61e+09
q3: 9.948e+09
iqr: 6.338e+09
skew: 0.01955
kurtosis: -1.177
n_outliers: 0
outlier_rate: 0
zero_rate: 0

name

text label duplicates

This column contains the proper names of caves or underground features in a speleological dataset, with multilingual entries spanning English, French, Italian, Spanish, German, and Russian. The most striking signal is that 'Unnamed Cave' appears 19,527 times — roughly 28% of all 69,716 rows — driving a duplicate rate of 35.1% (24,487 duplicates) even though 45,229 unique values exist. The vocabulary mix (grotta, grotte, cueva, cova, Грот) confirms international coverage, and the name is clearly a label, not a unique identifier. Treatment: Use as a display label only; do not treat as a unique key — normalise language variants and 'Unnamed Cave' placeholders before any grouping or matching. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 45,229
len_min: 1
len_max: 143
len_mean: 15.52
len_median: 13
len_p95: 29
word_mean: 2.519
word_median: 2
n_empty: 0
n_duplicates: 24,487
duplicate_rate: 0.3512
vocab_size: 15,219
readability_flesch_mean: 55.51
emoji_rate: 1.434e-05
url_rate: 0
one_word_rate: 0.1545
allcaps_rate: 0.03242
boilerplate_rate: 0

lat

numeric feature high_skew outliers

This column is a geographic latitude, spanning from -77.98° (southern hemisphere) to 78.18° (high Arctic), with the bulk of records clustered between ~40.5° and ~47.75° (IQR of 7.26°), suggesting the dataset is predominantly North American or European. The strong negative skew (-3.16) and high kurtosis (11.50) indicate a heavy left tail — roughly 9,000 rows (12.9%) are flagged as outliers, likely locations far outside the core geographic cluster. With 69,544 unique values across 69,716 rows and zero nulls, near-uniqueness suggests these are precise point coordinates rather than binned regions. Treatment: Retain as-is for geospatial modelling; investigate the 8,996 outlier rows for data quality issues or legitimate geographic subpopulations before regression or clustering. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 69,544
min: -77.98
max: 78.18
mean: 40.58
median: 44.14
std: 15.48
q1: 40.49
q3: 47.75
iqr: 7.256
skew: -3.161
kurtosis: 11.5
n_outliers: 8,996
outlier_rate: 0.129
zero_rate: 0

lon

numeric feature outliers

This column contains geographic longitude values, ranging from -178.0 to 178.8 degrees, consistent with a global coordinate field. Nearly all 69,716 rows are unique (69,585 distinct values) with zero nulls, indicating high-precision GPS or geocoded data. The distribution is heavily concentrated in a narrow band (IQR of ~17 degrees, Q1=1.24, Q3=18.18), suggesting the bulk of records cluster around Europe/Africa longitudes, while 16.2% of values (11,302 rows) are flagged as outliers — likely legitimate points in the Americas, East Asia, or Oceania that fall far from this central cluster. The elevated kurtosis (4.51) and large std (40.5) relative to the IQR confirm this heavy-tailed, multimodal geographic spread. Treatment: Retain as-is for geospatial modelling; consider pairing with latitude and clustering by region to handle the multimodal distribution before feeding into distance-based models. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 69,585
min: -178
max: 178.8
mean: 12.03
median: 11.38
std: 40.5
q1: 1.245
q3: 18.18
iqr: 16.94
skew: 0.2755
kurtosis: 4.509
n_outliers: 11,302
outlier_rate: 0.1621
zero_rate: 0

description

text free_text one_word short_text duplicates

This column is a short free-text description field for what appears to be a European cave/karst cadastral dataset, with entries in German, French, Spanish, and English. It is overwhelmingly empty — 65,189 of 69,716 rows (93.5%) are blank strings — making it near-useless as a feature in its current form. The duplicate rate is 94.7% (66,011 duplicates), driven almost entirely by those empty strings, and among the non-empty values, content is very short (median length 0, p95 only 19 characters) and consists mostly of single words or brief directional/categorical labels like 'unterer Eingang' or 'nicht katasterwürdig'. The multilingual vocabulary (German, French, Spanish, English) and low cardinality (3,705 unique values across 4,651-word vocab) suggest this is an optional annotation field rather than a structured attribute. Treatment: Exclude empty strings, then optionally use as a categorical label or tokenize non-empty values for NLP; do not use as a predictive feature without imputation strategy for the 93.5% blanks. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 3,705
len_min: 0
len_max: 255
len_mean: 3.462
len_median: 0
len_p95: 19
word_mean: 1.455
word_median: 1
n_empty: 65,189
n_duplicates: 66,011
duplicate_rate: 0.9469
vocab_size: 4,651
readability_flesch_mean: 1.752
emoji_rate: 0
url_rate: 0.0006311
one_word_rate: 0.9428
allcaps_rate: 0.00241
boilerplate_rate: 1.434e-05

access

categorical feature

This column is an OpenStreetMap-style 'access' tag describing road or path accessibility permissions, with values like 'yes', 'no', 'private', 'permit', 'permissive', 'customers', and 'destination'. The dominant signal is that 89.7% of rows (62,551 of 69,716) carry an empty string rather than a proper null — this is a data quality concern, as missing access information is stored as blank text rather than NULL. The remaining 20 distinct values have very low entropy (0.707), confirming the distribution is heavily concentrated. The blank-dominance will mislead cardinality and frequency analyses unless blanks are recoded as missing. Treatment: Recode empty strings to NaN, then one-hot or ordinal encode the remaining access-permission categories. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 20
top_value
top_rate: 0.8972
cardinality: 20
entropy: 0.7067
entropy_ratio: 0.1635

tourism

categorical feature imbalance

This column is an OpenStreetMap-style 'tourism' tag classifying geographic features by their tourism type (attraction, viewpoint, camp_site, museum, etc.). It is severely imbalanced: 97.2% of the 69,716 rows carry an empty string — meaning the tourism tag is absent — leaving only ~1,940 rows with meaningful values. Entropy ratio of 0.050 confirms near-zero informational content across the full dataset. The 18 distinct non-null categories are legitimate OSM tourism values, but their rarity makes this column sparse by design rather than by data quality failure. Treatment: Treat empty string as missing/absent tag; consider binarising (tourism present vs. absent) or one-hot encoding the non-empty subset, given extreme sparsity. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 18
top_value
top_rate: 0.9722
cardinality: 18
entropy: 0.2076
entropy_ratio: 0.04979

wikipedia

text metadata one_word short_text duplicates

This column stores Wikipedia article links in a 'language-code:article-title' format, used to cross-reference cave or geological features to their Wikipedia pages across multiple languages (German, French, Spanish, Italian, Catalan, Polish, etc.). The overwhelming surprise is that 67,531 of 69,716 rows (96.9% duplicate rate, with 67,531 empty strings) have no Wikipedia link at all, making this column sparsely populated. Among the 2,185 non-empty rows, values are single short tokens (one_word_rate 0.976, len_mean 0.636) covering 940 unique vocabulary terms across several language prefixes. Treatment: Filter empty strings before use; parse language prefix and article slug as separate fields for any multilingual Wikipedia lookup or join. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 2,077
len_min: 0
len_max: 114
len_mean: 0.636
len_median: 0
len_p95: 0
word_mean: 1.044
word_median: 1
n_empty: 67,531
n_duplicates: 67,639
duplicate_rate: 0.9702
vocab_size: 940
readability_flesch_mean: -0.02485
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9756
allcaps_rate: 0
boilerplate_rate: 0

website

text metadata one_word short_text duplicates

This column contains optional website/URL references associated with dataset records (likely natural features, caves, or protected sites given the UNESCO, speleology, and nature-registry URLs visible in top values). The dominant signal is near-total sparsity: 67,082 of 69,716 rows are empty strings, yielding a duplicate rate of 96.4% and a len_median of 0.0. When a URL is present, it appears only 2–19 times at most, suggesting these are editorial citations rather than structured foreign keys. The url_rate of only 3.8% confirms that the vast majority of non-empty values are not being parsed as URLs, though top values are all valid URLs — this may be a tokenisation artefact of the length-based stats. Treatment: Exclude from modelling; optionally binarise into a 'has_website' flag or extract domain as a sparse categorical feature. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 2,492
len_min: 0
len_max: 255
len_mean: 2.789
len_median: 0
len_p95: 0
word_mean: 1
word_median: 1
n_empty: 67,082
n_duplicates: 67,224
duplicate_rate: 0.9643
vocab_size: 737
readability_flesch_mean: -24.04
emoji_rate: 0
url_rate: 0.03775
one_word_rate: 1
allcaps_rate: 1.434e-05
boilerplate_rate: 0

cave:length

categorical feature long_tail imbalance

This column represents a cave length measurement, stored as a categorical/string type rather than numeric. The dominant signal is that 99.08% of 69,716 rows (69,074) carry an empty string, meaning cave length is recorded for only ~642 rows across 238 distinct values. The non-null values appear to be numeric strings (e.g., '5', '6', '10', '20'), suggesting this field is sparsely populated and was likely intended as a numeric attribute. Treatment: Treat empty strings as missing; cast remaining values to numeric; expect ~99% missingness — impute or use as a sparse indicator feature only. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 238
top_value
top_rate: 0.9908
cardinality: 238
entropy: 0.1392
entropy_ratio: 0.01763

cave:depth

categorical feature long_tail imbalance

This column represents the depth attribute of a cave feature, stored as a categorical field containing numeric depth values (in unspecified units). It is almost entirely empty: 69,419 of 69,716 rows (99.57%) are blank strings, leaving only 297 rows with any actual depth value across 123 distinct non-empty values. The extreme sparsity and near-zero entropy ratio (0.0092) signal this is a rarely-populated optional attribute, not a reliable analytical field in its current form. Treatment: Filter to non-empty rows before use; convert to numeric and consider imputation or indicator flag for missingness given 99.57% blank rate. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 124
top_value
top_rate: 0.9957
cardinality: 124
entropy: 0.06432
entropy_ratio: 0.00925

country

categorical feature imbalance

This column is intended to capture a 2-letter ISO country code for each record. However, it is effectively empty: 69,684 of 69,716 rows (99.95%) contain a blank string rather than a real value, making the column nearly useless as a feature. Only 32 rows carry a real country code spread across 15 distinct values, with no single country dominating — the most common non-blank code is 'KY' with just 7 occurrences. The extreme imbalance (top_rate 0.9995) and near-zero entropy (0.0074) confirm this column carries virtually no information. Treatment: Drop or flag as data-quality issue; if retained, treat blank as missing and note that only 32 rows have usable country data. high · anthropic:default

n: 69,716
nulls: 0 (0.0%)
unique: 16
top_value
top_rate: 0.9995
cardinality: 16
entropy: 0.00736
entropy_ratio: 0.00184