saturn·

data trove caves worldwide

source /home/coolhand/html/datavis/data_trove/data/quirky/caves.json 69,716 rows 12 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset is a global registry of 69,716 cave entries, likely sourced from OpenStreetMap, containing geographic coordinates, names, access rules, and optional metadata such as depth, length, and tourism classification. The most striking issue is extreme sparsity: the vast majority of records have empty descriptions (93%), websites (96%), wikipedia links (97%), depth (99.6%), and length (99.1%), meaning most caves are little more than a name and a pin on a map. Nearly 28% of named caves are simply called 'Unnamed Cave' (19,527 records), pointing to a significant data completeness problem worth investigating before any analysis. Geographic coverage skews heavily toward Europe — latitude median ~44°N with tight interquartile range — but longitude outliers suggest a global but uneven spread. Among the minority of caves with access tags, the split between open ('yes'), restricted ('no'), and 'private' is worth exploring for any public-access analysis.

citing: row_count · column_count · name.top_values · name.stats.n_duplicates · description.stats.n_empty · website.stats.n_empty · wikipedia.stats.n_empty · cave:depth.stats.top_rate · cave:length.stats.top_rate · lat.stats.median · lat.stats.iqr · lon.stats.outlier_rate · access.top_values · tourism.top_values

Schema

12 columns
Per-column summary. Click column name to jump to its detail.
Alerts
id numeric 0.0% 69,716
name text 0.0% 45,229
duplicates
lat numeric 0.0% 69,544
high_skew outliers
lon numeric 0.0% 69,585
outliers
description text 0.0% 3,705
one_word short_text duplicates
access categorical 0.0% 20
tourism categorical 0.0% 18
imbalance
wikipedia text 0.0% 2,077
one_word short_text duplicates
website text 0.0% 2,492
one_word short_text duplicates
cave:length categorical 0.0% 238
long_tail imbalance
cave:depth categorical 0.0% 124
long_tail imbalance
country categorical 0.0% 16
imbalance

id

numeric identifier
This column is a numeric unique identifier — every one of the 69,716 rows has a distinct value with zero nulls, confirming a true primary key role. The values span a wide range (12,788,558 to 13,536,013,182) with a near-zero skew (0.0196) and a slightly platykurtic distribution (kurtosis −1.18), consistent with IDs assigned from a large, sparsely populated keyspace rather than a simple auto-increment sequence. The IQR of ~6.3 billion against a full range of ~13.5 billion indicates values are spread broadly across the domain with no clustering or outliers detected. Treatment: Drop before modelling; use as row key for joins or deduplication only. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
69,716
min
1.279e+07
max
1.354e+10
mean
6.842e+09
median
7.007e+09
std
3.774e+09
q1
3.61e+09
q3
9.948e+09
iqr
6.338e+09
skew
0.01955
kurtosis
-1.177
n_outliers
0
outlier_rate
0
zero_rate
0

name

text label duplicates
This column contains the proper names of caves or underground features in a speleological dataset, with multilingual entries spanning English, French, Italian, Spanish, German, and Russian. The most striking signal is that 'Unnamed Cave' appears 19,527 times — roughly 28% of all 69,716 rows — driving a duplicate rate of 35.1% (24,487 duplicates) even though 45,229 unique values exist. The vocabulary mix (grotta, grotte, cueva, cova, Грот) confirms international coverage, and the name is clearly a label, not a unique identifier. Treatment: Use as a display label only; do not treat as a unique key — normalise language variants and 'Unnamed Cave' placeholders before any grouping or matching. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
45,229
len_min
1
len_max
143
len_mean
15.52
len_median
13
len_p95
29
word_mean
2.519
word_median
2
n_empty
0
n_duplicates
24,487
duplicate_rate
0.3512
vocab_size
15,219
readability_flesch_mean
55.51
emoji_rate
1.434e-05
url_rate
0
one_word_rate
0.1545
allcaps_rate
0.03242
boilerplate_rate
0

lat

numeric feature high_skew outliers
This column is a geographic latitude, spanning from -77.98° (southern hemisphere) to 78.18° (high Arctic), with the bulk of records clustered between ~40.5° and ~47.75° (IQR of 7.26°), suggesting the dataset is predominantly North American or European. The strong negative skew (-3.16) and high kurtosis (11.50) indicate a heavy left tail — roughly 9,000 rows (12.9%) are flagged as outliers, likely locations far outside the core geographic cluster. With 69,544 unique values across 69,716 rows and zero nulls, near-uniqueness suggests these are precise point coordinates rather than binned regions. Treatment: Retain as-is for geospatial modelling; investigate the 8,996 outlier rows for data quality issues or legitimate geographic subpopulations before regression or clustering. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
69,544
min
-77.98
max
78.18
mean
40.58
median
44.14
std
15.48
q1
40.49
q3
47.75
iqr
7.256
skew
-3.161
kurtosis
11.5
n_outliers
8,996
outlier_rate
0.129
zero_rate
0

lon

numeric feature outliers
This column contains geographic longitude values, ranging from -178.0 to 178.8 degrees, consistent with a global coordinate field. Nearly all 69,716 rows are unique (69,585 distinct values) with zero nulls, indicating high-precision GPS or geocoded data. The distribution is heavily concentrated in a narrow band (IQR of ~17 degrees, Q1=1.24, Q3=18.18), suggesting the bulk of records cluster around Europe/Africa longitudes, while 16.2% of values (11,302 rows) are flagged as outliers — likely legitimate points in the Americas, East Asia, or Oceania that fall far from this central cluster. The elevated kurtosis (4.51) and large std (40.5) relative to the IQR confirm this heavy-tailed, multimodal geographic spread. Treatment: Retain as-is for geospatial modelling; consider pairing with latitude and clustering by region to handle the multimodal distribution before feeding into distance-based models. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
69,585
min
-178
max
178.8
mean
12.03
median
11.38
std
40.5
q1
1.245
q3
18.18
iqr
16.94
skew
0.2755
kurtosis
4.509
n_outliers
11,302
outlier_rate
0.1621
zero_rate
0

description

text free_text one_word short_text duplicates
This column is a short free-text description field for what appears to be a European cave/karst cadastral dataset, with entries in German, French, Spanish, and English. It is overwhelmingly empty — 65,189 of 69,716 rows (93.5%) are blank strings — making it near-useless as a feature in its current form. The duplicate rate is 94.7% (66,011 duplicates), driven almost entirely by those empty strings, and among the non-empty values, content is very short (median length 0, p95 only 19 characters) and consists mostly of single words or brief directional/categorical labels like 'unterer Eingang' or 'nicht katasterwürdig'. The multilingual vocabulary (German, French, Spanish, English) and low cardinality (3,705 unique values across 4,651-word vocab) suggest this is an optional annotation field rather than a structured attribute. Treatment: Exclude empty strings, then optionally use as a categorical label or tokenize non-empty values for NLP; do not use as a predictive feature without imputation strategy for the 93.5% blanks. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
3,705
len_min
0
len_max
255
len_mean
3.462
len_median
0
len_p95
19
word_mean
1.455
word_median
1
n_empty
65,189
n_duplicates
66,011
duplicate_rate
0.9469
vocab_size
4,651
readability_flesch_mean
1.752
emoji_rate
0
url_rate
0.0006311
one_word_rate
0.9428
allcaps_rate
0.00241
boilerplate_rate
1.434e-05

access

categorical feature
This column is an OpenStreetMap-style 'access' tag describing road or path accessibility permissions, with values like 'yes', 'no', 'private', 'permit', 'permissive', 'customers', and 'destination'. The dominant signal is that 89.7% of rows (62,551 of 69,716) carry an empty string rather than a proper null — this is a data quality concern, as missing access information is stored as blank text rather than NULL. The remaining 20 distinct values have very low entropy (0.707), confirming the distribution is heavily concentrated. The blank-dominance will mislead cardinality and frequency analyses unless blanks are recoded as missing. Treatment: Recode empty strings to NaN, then one-hot or ordinal encode the remaining access-permission categories. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
20
top_value
top_rate
0.8972
cardinality
20
entropy
0.7067
entropy_ratio
0.1635

tourism

categorical feature imbalance
This column is an OpenStreetMap-style 'tourism' tag classifying geographic features by their tourism type (attraction, viewpoint, camp_site, museum, etc.). It is severely imbalanced: 97.2% of the 69,716 rows carry an empty string — meaning the tourism tag is absent — leaving only ~1,940 rows with meaningful values. Entropy ratio of 0.050 confirms near-zero informational content across the full dataset. The 18 distinct non-null categories are legitimate OSM tourism values, but their rarity makes this column sparse by design rather than by data quality failure. Treatment: Treat empty string as missing/absent tag; consider binarising (tourism present vs. absent) or one-hot encoding the non-empty subset, given extreme sparsity. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
18
top_value
top_rate
0.9722
cardinality
18
entropy
0.2076
entropy_ratio
0.04979

wikipedia

text metadata one_word short_text duplicates
This column stores Wikipedia article links in a 'language-code:article-title' format, used to cross-reference cave or geological features to their Wikipedia pages across multiple languages (German, French, Spanish, Italian, Catalan, Polish, etc.). The overwhelming surprise is that 67,531 of 69,716 rows (96.9% duplicate rate, with 67,531 empty strings) have no Wikipedia link at all, making this column sparsely populated. Among the 2,185 non-empty rows, values are single short tokens (one_word_rate 0.976, len_mean 0.636) covering 940 unique vocabulary terms across several language prefixes. Treatment: Filter empty strings before use; parse language prefix and article slug as separate fields for any multilingual Wikipedia lookup or join. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
2,077
len_min
0
len_max
114
len_mean
0.636
len_median
0
len_p95
0
word_mean
1.044
word_median
1
n_empty
67,531
n_duplicates
67,639
duplicate_rate
0.9702
vocab_size
940
readability_flesch_mean
-0.02485
emoji_rate
0
url_rate
0
one_word_rate
0.9756
allcaps_rate
0
boilerplate_rate
0

website

text metadata one_word short_text duplicates
This column contains optional website/URL references associated with dataset records (likely natural features, caves, or protected sites given the UNESCO, speleology, and nature-registry URLs visible in top values). The dominant signal is near-total sparsity: 67,082 of 69,716 rows are empty strings, yielding a duplicate rate of 96.4% and a len_median of 0.0. When a URL is present, it appears only 2–19 times at most, suggesting these are editorial citations rather than structured foreign keys. The url_rate of only 3.8% confirms that the vast majority of non-empty values are not being parsed as URLs, though top values are all valid URLs — this may be a tokenisation artefact of the length-based stats. Treatment: Exclude from modelling; optionally binarise into a 'has_website' flag or extract domain as a sparse categorical feature. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
2,492
len_min
0
len_max
255
len_mean
2.789
len_median
0
len_p95
0
word_mean
1
word_median
1
n_empty
67,082
n_duplicates
67,224
duplicate_rate
0.9643
vocab_size
737
readability_flesch_mean
-24.04
emoji_rate
0
url_rate
0.03775
one_word_rate
1
allcaps_rate
1.434e-05
boilerplate_rate
0

cave:length

categorical feature long_tail imbalance
This column represents a cave length measurement, stored as a categorical/string type rather than numeric. The dominant signal is that 99.08% of 69,716 rows (69,074) carry an empty string, meaning cave length is recorded for only ~642 rows across 238 distinct values. The non-null values appear to be numeric strings (e.g., '5', '6', '10', '20'), suggesting this field is sparsely populated and was likely intended as a numeric attribute. Treatment: Treat empty strings as missing; cast remaining values to numeric; expect ~99% missingness — impute or use as a sparse indicator feature only. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
238
top_value
top_rate
0.9908
cardinality
238
entropy
0.1392
entropy_ratio
0.01763

cave:depth

categorical feature long_tail imbalance
This column represents the depth attribute of a cave feature, stored as a categorical field containing numeric depth values (in unspecified units). It is almost entirely empty: 69,419 of 69,716 rows (99.57%) are blank strings, leaving only 297 rows with any actual depth value across 123 distinct non-empty values. The extreme sparsity and near-zero entropy ratio (0.0092) signal this is a rarely-populated optional attribute, not a reliable analytical field in its current form. Treatment: Filter to non-empty rows before use; convert to numeric and consider imputation or indicator flag for missingness given 99.57% blank rate. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
124
top_value
top_rate
0.9957
cardinality
124
entropy
0.06432
entropy_ratio
0.00925

country

categorical feature imbalance
This column is intended to capture a 2-letter ISO country code for each record. However, it is effectively empty: 69,684 of 69,716 rows (99.95%) contain a blank string rather than a real value, making the column nearly useless as a feature. Only 32 rows carry a real country code spread across 15 distinct values, with no single country dominating — the most common non-blank code is 'KY' with just 7 occurrences. The extreme imbalance (top_rate 0.9995) and near-zero entropy (0.0074) confirm this column carries virtually no information. Treatment: Drop or flag as data-quality issue; if retained, treat blank as missing and note that only 32 rows have usable country data. high · anthropic:default
n
69,716
nulls
0 (0.0%)
unique
16
top_value
top_rate
0.9995
cardinality
16
entropy
0.00736
entropy_ratio
0.00184