data-trove-caves-worldwide · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/caves.json

Saturn profiled 69,716 rows across 12 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/caves.json",
    "--findings", "data-trove-caves-worldwide.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is a global registry of 69,716 cave entries, likely sourced from OpenStreetMap, containing geographic coordinates, names, access rules, and optional metadata such as depth, length, and tourism classification. The most striking issue is extreme sparsity: the vast majority of records have empty descriptions (93%), websites (96%), wikipedia links (97%), depth (99.6%), and length (99.1%), meaning most caves are little more than a name and a pin on a map. Nearly 28% of named caves are simply called 'Unnamed Cave' (19,527 records), pointing to a significant data completeness problem worth investigating before any analysis. Geographic coverage skews heavily toward Europe — latitude median ~44°N with tight interquartile range — but longitude outliers suggest a global but uneven spread. Among the minority of caves with access tags, the split between open ('yes'), restricted ('no'), and 'private' is worth exploring for any public-access analysis.

citing: row_count · column_count · name.top_values · name.stats.n_duplicates · description.stats.n_empty · website.stats.n_empty · wikipedia.stats.n_empty · cave:depth.stats.top_rate · cave:length.stats.top_rate · lat.stats.median · lat.stats.iqr · lon.stats.outlier_rate · access.top_values · tourism.top_values

Out[4]:

saturn.schema() · 12 columns

column	kind	n	null%	unique	alerts
id	numeric	69,716	0.0%	69,716
name	text	69,716	0.0%	45,229	duplicates
lat	numeric	69,716	0.0%	69,544	high_skew outliers
lon	numeric	69,716	0.0%	69,585	outliers
description	text	69,716	0.0%	3,705	one_word short_text duplicates
access	categorical	69,716	0.0%	20
tourism	categorical	69,716	0.0%	18	imbalance
wikipedia	text	69,716	0.0%	2,077	one_word short_text duplicates
website	text	69,716	0.0%	2,492	one_word short_text duplicates
cave:length	categorical	69,716	0.0%	238	long_tail imbalance
cave:depth	categorical	69,716	0.0%	124	long_tail imbalance
country	categorical	69,716	0.0%	16	imbalance

Fig 1.

access · Look at the breakdown between open, private, permit, and no-access caves among the ~11% of entries that carry an access tag.

Show data table

Top values for access (20 unique shown, of 20 total).
value	count	share
	62551	89.7%
yes	2717	3.9%
no	2266	3.3%
private	815	1.2%
permit	575	0.8%
permissive	440	0.6%
customers	271	0.4%
unknown	47	0.1%
destination	11	0.0%
restricted	9	0.0%
tidal	2	0.0%
request	2	0.0%
key	2	0.0%
discouraged	2	0.0%
designated	1	0.0%
official	1	0.0%
forestry	1	0.0%
agricultural	1	0.0%
guided	1	0.0%
university	1	0.0%

Fig 2.

tourism · Nearly all tagged tourism values fall under 'attraction' or 'viewpoint' — this shows how few caves are formally designated for visitors.

Show data table

Top values for tourism (18 unique shown, of 18 total).
value	count	share
	67776	97.2%
attraction	1670	2.4%
viewpoint	119	0.2%
yes	105	0.2%
camp_site	7	0.0%
museum	6	0.0%
register	6	0.0%
artwork	6	0.0%
information	5	0.0%
picnic_site	4	0.0%
cave_entrance	3	0.0%
caves	2	0.0%
wilderness_hut	2	0.0%
attraction;museum	1	0.0%
guestbook	1	0.0%
no	1	0.0%
cave	1	0.0%
hotel	1	0.0%

Fig 3.

name · Check the distribution of name lengths to see how many entries are generic single-word labels versus descriptively named caves.

Show data table

Character-length distribution for name (mean: 15.52237649893855).
chars	count
1 – 5	1601
5 – 8	3457
8 – 12	5854
12 – 15	31600
15 – 19	9877
19 – 22	8424
22 – 26	3274
26 – 29	2614
29 – 33	1175
33 – 36	885
36 – 40	482
40 – 44	175
44 – 47	154
47 – 51	56
51 – 54	45
54 – 58	17
58 – 61	11
61 – 65	3
65 – 68	2
68 – 72	2
72 – 76	4
76 – 79	2
79 – 83	0
83 – 86	1
86 – 90	0
90 – 93	0
93 – 97	0
97 – 100	0
100 – 104	0
104 – 108	0
108 – 111	0
111 – 115	0
115 – 118	0
118 – 122	0
122 – 125	0
125 – 129	0
129 – 132	0
132 – 136	0
136 – 139	0
139 – 143	1

Fig 4.

lat · The latitude distribution reveals a strong concentration around 40–48°N (Europe), with a long tail of outliers across other continents.

Show data table

Histogram bins for lat (median: 44.14308795).
bin	count
-77.98 – -74.07	4
-74.07 – -70.17	0
-70.17 – -66.26	3
-66.26 – -62.36	0
-62.36 – -58.46	1
-58.46 – -54.55	2
-54.55 – -50.65	13
-50.65 – -46.74	26
-46.74 – -42.84	60
-42.84 – -38.94	127
-38.94 – -35.03	329
-35.03 – -31.13	330
-31.13 – -27.23	225
-27.23 – -23.32	257
-23.32 – -19.42	301
-19.42 – -15.51	204
-15.51 – -11.61	78
-11.61 – -7.705	139
-7.705 – -3.801	314
-3.801 – 0.1026	183
0.1026 – 4.007	118
4.007 – 7.911	358
7.911 – 11.81	287
11.81 – 15.72	385
15.72 – 19.62	1856
19.62 – 23.53	937
23.53 – 27.43	775
27.43 – 31.33	1082
31.33 – 35.24	1231
35.24 – 39.14	3928
39.14 – 43.05	12068
43.05 – 46.95	20377
46.95 – 50.85	19250
50.85 – 54.76	2492
54.76 – 58.66	1075
58.66 – 62.57	564
62.57 – 66.47	267
66.47 – 70.37	47
70.37 – 74.28	19
74.28 – 78.18	4

Fig 5.

cave:length · Among the small fraction of caves with a recorded length, see whether short caves (under 20m) dominate the measured entries.

Show data table

Top values for cave:length (20 unique shown, of 238 total).
value	count	share
	69074	99.1%
5	32	0.0%
6	26	0.0%
10	25	0.0%
3	23	0.0%
4	23	0.0%
7	20	0.0%
8	19	0.0%
15	16	0.0%
20	14	0.0%
12	13	0.0%
30	13	0.0%
2	11	0.0%
11	9	0.0%
60	8	0.0%
4.5	8	0.0%
13	8	0.0%
16	8	0.0%
17	8	0.0%
25	8	0.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	numeric	0.0%
name	text	0.0%
lat	numeric	0.0%
lon	numeric	0.0%
description	text	0.0%
access	categorical	0.0%
tourism	categorical	0.0%
wikipedia	text	0.0%
website	text	0.0%
cave:length	categorical	0.0%
cave:depth	categorical	0.0%
country	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	id	lat	lon
id	+1.00	-0.04	-0.06
lat	-0.04	+1.00	-0.15
lon	-0.06	-0.15	+1.00

id numeric identifier

This column is a numeric unique identifier — every one of the 69,716 rows has a distinct value with zero nulls, confirming a true primary key role. The values span a wide range (12,788,558 to 13,536,013,182) with a near-zero skew (0.0196) and a slightly platykurtic distribution (kurtosis −1.18), consistent with IDs assigned from a large, sparsely populated keyspace rather than a simple auto-increment sequence. The IQR of ~6.3 billion against a full range of ~13.5 billion indicates values are spread broadly across the domain with no clustering or outliers detected.

Treatment: Drop before modelling; use as row key for joins or deduplication only.

anthropic:default · confidence high

Out[13]:

saturn.columns["id"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	69,716
min	1.279e+07
max	1.354e+10
mean	6.842e+09
median	7.007e+09
std	3.774e+09
q1	3.61e+09
q3	9.948e+09
iqr	6.338e+09
skew	0.01955
kurtosis	-1.177
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 8.

Distribution of id. Vertical dash marks the median.

Show data table

Histogram bins for id (median: 7006519241.0).
bin	count
1.279e+07 – 3.509e+08	592
3.509e+08 – 6.889e+08	1123
6.889e+08 – 1.027e+09	2360
1.027e+09 – 1.365e+09	1819
1.365e+09 – 1.703e+09	1497
1.703e+09 – 2.041e+09	1907
2.041e+09 – 2.379e+09	1345
2.379e+09 – 2.717e+09	1415
2.717e+09 – 3.056e+09	1806
3.056e+09 – 3.394e+09	1977
3.394e+09 – 3.732e+09	2439
3.732e+09 – 4.07e+09	1595
4.07e+09 – 4.408e+09	1710
4.408e+09 – 4.746e+09	2212
4.746e+09 – 5.084e+09	2764
5.084e+09 – 5.422e+09	1741
5.422e+09 – 5.76e+09	1236
5.76e+09 – 6.098e+09	1268
6.098e+09 – 6.436e+09	2153
6.436e+09 – 6.774e+09	1135
6.774e+09 – 7.112e+09	2343
7.112e+09 – 7.451e+09	1932
7.451e+09 – 7.789e+09	2060
7.789e+09 – 8.127e+09	1970
8.127e+09 – 8.465e+09	1451
8.465e+09 – 8.803e+09	1229
8.803e+09 – 9.141e+09	1689
9.141e+09 – 9.479e+09	1952
9.479e+09 – 9.817e+09	2902
9.817e+09 – 1.016e+10	1840
1.016e+10 – 1.049e+10	617
1.049e+10 – 1.083e+10	1370
1.083e+10 – 1.117e+10	1613
1.117e+10 – 1.151e+10	3298
1.151e+10 – 1.185e+10	1230
1.185e+10 – 1.218e+10	1741
1.218e+10 – 1.252e+10	1317
1.252e+10 – 1.286e+10	2065
1.286e+10 – 1.32e+10	1532
1.32e+10 – 1.354e+10	1471

name text label

This column contains the proper names of caves or underground features in a speleological dataset, with multilingual entries spanning English, French, Italian, Spanish, German, and Russian. The most striking signal is that 'Unnamed Cave' appears 19,527 times — roughly 28% of all 69,716 rows — driving a duplicate rate of 35.1% (24,487 duplicates) even though 45,229 unique values exist. The vocabulary mix (grotta, grotte, cueva, cova, Грот) confirms international coverage, and the name is clearly a label, not a unique identifier.

Treatment: Use as a display label only; do not treat as a unique key — normalise language variants and 'Unnamed Cave' placeholders before any grouping or matching.

anthropic:default · confidence high

Out[16]:

saturn.columns["name"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	45,229
len_min	1
len_max	143
len_mean	15.52
len_median	13
len_p95	29
word_mean	2.519
word_median	2
n_empty	0
n_duplicates	24,487
duplicate_rate	0.3512
vocab_size	15,219
readability_flesch_mean	55.51
emoji_rate	1.434e-05
url_rate	0
one_word_rate	0.1545
allcaps_rate	0.03242
boilerplate_rate	0
alert: duplicates	35.1% duplicate strings

Fig 9.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 15.52237649893855).
chars	count
1 – 5	1601
5 – 8	3457
8 – 12	5854
12 – 15	31600
15 – 19	9877
19 – 22	8424
22 – 26	3274
26 – 29	2614
29 – 33	1175
33 – 36	885
36 – 40	482
40 – 44	175
44 – 47	154
47 – 51	56
51 – 54	45
54 – 58	17
58 – 61	11
61 – 65	3
65 – 68	2
68 – 72	2
72 – 76	4
76 – 79	2
79 – 83	0
83 – 86	1
86 – 90	0
90 – 93	0
93 – 97	0
97 – 100	0
100 – 104	0
104 – 108	0
108 – 111	0
111 – 115	0
115 – 118	0
118 – 122	0
122 – 125	0
125 – 129	0
129 – 132	0
132 – 136	0
136 – 139	0
139 – 143	1

lat numeric feature

This column is a geographic latitude, spanning from -77.98° (southern hemisphere) to 78.18° (high Arctic), with the bulk of records clustered between ~40.5° and ~47.75° (IQR of 7.26°), suggesting the dataset is predominantly North American or European. The strong negative skew (-3.16) and high kurtosis (11.50) indicate a heavy left tail — roughly 9,000 rows (12.9%) are flagged as outliers, likely locations far outside the core geographic cluster. With 69,544 unique values across 69,716 rows and zero nulls, near-uniqueness suggests these are precise point coordinates rather than binned regions.

Treatment: Retain as-is for geospatial modelling; investigate the 8,996 outlier rows for data quality issues or legitimate geographic subpopulations before regression or clustering.

anthropic:default · confidence high

Out[19]:

saturn.columns["lat"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	69,544
min	-77.98
max	78.18
mean	40.58
median	44.14
std	15.48
q1	40.49
q3	47.75
iqr	7.256
skew	-3.161
kurtosis	11.5
n_outliers	8,996
outlier_rate	0.129
zero_rate	0
alert: high_skew	skew=-3.16
alert: outliers	12.9% rows beyond 1.5 IQR

Fig 10.

Distribution of lat. Vertical dash marks the median.

Show data table

Histogram bins for lat (median: 44.14308795).
bin	count
-77.98 – -74.07	4
-74.07 – -70.17	0
-70.17 – -66.26	3
-66.26 – -62.36	0
-62.36 – -58.46	1
-58.46 – -54.55	2
-54.55 – -50.65	13
-50.65 – -46.74	26
-46.74 – -42.84	60
-42.84 – -38.94	127
-38.94 – -35.03	329
-35.03 – -31.13	330
-31.13 – -27.23	225
-27.23 – -23.32	257
-23.32 – -19.42	301
-19.42 – -15.51	204
-15.51 – -11.61	78
-11.61 – -7.705	139
-7.705 – -3.801	314
-3.801 – 0.1026	183
0.1026 – 4.007	118
4.007 – 7.911	358
7.911 – 11.81	287
11.81 – 15.72	385
15.72 – 19.62	1856
19.62 – 23.53	937
23.53 – 27.43	775
27.43 – 31.33	1082
31.33 – 35.24	1231
35.24 – 39.14	3928
39.14 – 43.05	12068
43.05 – 46.95	20377
46.95 – 50.85	19250
50.85 – 54.76	2492
54.76 – 58.66	1075
58.66 – 62.57	564
62.57 – 66.47	267
66.47 – 70.37	47
70.37 – 74.28	19
74.28 – 78.18	4

lon numeric feature

This column contains geographic longitude values, ranging from -178.0 to 178.8 degrees, consistent with a global coordinate field. Nearly all 69,716 rows are unique (69,585 distinct values) with zero nulls, indicating high-precision GPS or geocoded data. The distribution is heavily concentrated in a narrow band (IQR of ~17 degrees, Q1=1.24, Q3=18.18), suggesting the bulk of records cluster around Europe/Africa longitudes, while 16.2% of values (11,302 rows) are flagged as outliers — likely legitimate points in the Americas, East Asia, or Oceania that fall far from this central cluster. The elevated kurtosis (4.51) and large std (40.5) relative to the IQR confirm this heavy-tailed, multimodal geographic spread.

Treatment: Retain as-is for geospatial modelling; consider pairing with latitude and clustering by region to handle the multimodal distribution before feeding into distance-based models.

anthropic:default · confidence high

Out[22]:

saturn.columns["lon"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	69,585
min	-178
max	178.8
mean	12.03
median	11.38
std	40.5
q1	1.245
q3	18.18
iqr	16.94
skew	0.2755
kurtosis	4.509
n_outliers	11,302
outlier_rate	0.1621
zero_rate	0
alert: outliers	16.2% rows beyond 1.5 IQR

Fig 11.

Distribution of lon. Vertical dash marks the median.

Show data table

Histogram bins for lon (median: 11.37646115).
bin	count
-178 – -169.1	54
-169.1 – -160.2	6
-160.2 – -151.2	49
-151.2 – -142.3	20
-142.3 – -133.4	6
-133.4 – -124.5	18
-124.5 – -115.6	352
-115.6 – -106.6	454
-106.6 – -97.72	283
-97.72 – -88.8	361
-88.8 – -79.88	579
-79.88 – -70.96	1877
-70.96 – -62.04	236
-62.04 – -53.12	168
-53.12 – -44.2	252
-44.2 – -35.28	203
-35.28 – -26.36	26
-26.36 – -17.44	182
-17.44 – -8.523	766
-8.523 – 0.3967	9279
0.3967 – 9.317	14696
9.317 – 18.24	22468
18.24 – 27.16	8156
27.16 – 36.08	1754
36.08 – 45	1346
45 – 53.92	530
53.92 – 62.84	479
62.84 – 71.76	150
71.76 – 80.68	381
80.68 – 89.6	210
89.6 – 98.52	212
98.52 – 107.4	1114
107.4 – 116.4	956
116.4 – 125.3	696
125.3 – 134.2	395
134.2 – 143.1	363
143.1 – 152	330
152 – 161	46
161 – 169.9	33
169.9 – 178.8	230

description text free_text

This column is a short free-text description field for what appears to be a European cave/karst cadastral dataset, with entries in German, French, Spanish, and English. It is overwhelmingly empty — 65,189 of 69,716 rows (93.5%) are blank strings — making it near-useless as a feature in its current form. The duplicate rate is 94.7% (66,011 duplicates), driven almost entirely by those empty strings, and among the non-empty values, content is very short (median length 0, p95 only 19 characters) and consists mostly of single words or brief directional/categorical labels like 'unterer Eingang' or 'nicht katasterwürdig'. The multilingual vocabulary (German, French, Spanish, English) and low cardinality (3,705 unique values across 4,651-word vocab) suggest this is an optional annotation field rather than a structured attribute.

Treatment: Exclude empty strings, then optionally use as a categorical label or tokenize non-empty values for NLP; do not use as a predictive feature without imputation strategy for the 93.5% blanks.

anthropic:default · confidence high

Out[25]:

saturn.columns["description"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	3,705
len_min	0
len_max	255
len_mean	3.462
len_median	0
len_p95	19
word_mean	1.455
word_median	1
n_empty	65,189
n_duplicates	66,011
duplicate_rate	0.9469
vocab_size	4,651
readability_flesch_mean	1.752
emoji_rate	0
url_rate	0.0006311
one_word_rate	0.9428
allcaps_rate	0.00241
boilerplate_rate	1.434e-05
alert: one_word	94.3% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	94.7% duplicate strings

Fig 12.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 3.4617591370703997).
chars	count
0 – 6	65286
6 – 13	422
13 – 19	544
19 – 26	501
26 – 32	458
32 – 38	437
38 – 45	344
45 – 51	218
51 – 57	205
57 – 64	170
64 – 70	229
70 – 76	85
76 – 83	70
83 – 89	60
89 – 96	54
96 – 102	48
102 – 108	51
108 – 115	42
115 – 121	34
121 – 128	34
128 – 134	26
134 – 140	24
140 – 147	23
147 – 153	24
153 – 159	19
159 – 166	13
166 – 172	27
172 – 178	13
178 – 185	18
185 – 191	21
191 – 198	20
198 – 204	12
204 – 210	23
210 – 217	18
217 – 223	16
223 – 230	9
230 – 236	8
236 – 242	17
242 – 249	22
249 – 255	71

access categorical feature

This column is an OpenStreetMap-style 'access' tag describing road or path accessibility permissions, with values like 'yes', 'no', 'private', 'permit', 'permissive', 'customers', and 'destination'. The dominant signal is that 89.7% of rows (62,551 of 69,716) carry an empty string rather than a proper null — this is a data quality concern, as missing access information is stored as blank text rather than NULL. The remaining 20 distinct values have very low entropy (0.707), confirming the distribution is heavily concentrated. The blank-dominance will mislead cardinality and frequency analyses unless blanks are recoded as missing.

Treatment: Recode empty strings to NaN, then one-hot or ordinal encode the remaining access-permission categories.

anthropic:default · confidence high

Out[28]:

saturn.columns["access"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	20
top_value
top_rate	0.8972
cardinality	20
entropy	0.7067
entropy_ratio	0.1635

Fig 13.

Top values for access.

Show data table

Top values for access (20 unique shown, of 20 total).
value	count	share
	62551	89.7%
yes	2717	3.9%
no	2266	3.3%
private	815	1.2%
permit	575	0.8%
permissive	440	0.6%
customers	271	0.4%
unknown	47	0.1%
destination	11	0.0%
restricted	9	0.0%
tidal	2	0.0%
request	2	0.0%
key	2	0.0%
discouraged	2	0.0%
designated	1	0.0%
official	1	0.0%
forestry	1	0.0%
agricultural	1	0.0%
guided	1	0.0%
university	1	0.0%

tourism categorical feature

This column is an OpenStreetMap-style 'tourism' tag classifying geographic features by their tourism type (attraction, viewpoint, camp_site, museum, etc.). It is severely imbalanced: 97.2% of the 69,716 rows carry an empty string — meaning the tourism tag is absent — leaving only ~1,940 rows with meaningful values. Entropy ratio of 0.050 confirms near-zero informational content across the full dataset. The 18 distinct non-null categories are legitimate OSM tourism values, but their rarity makes this column sparse by design rather than by data quality failure.

Treatment: Treat empty string as missing/absent tag; consider binarising (tourism present vs. absent) or one-hot encoding the non-empty subset, given extreme sparsity.

anthropic:default · confidence high

Out[31]:

saturn.columns["tourism"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	18
top_value
top_rate	0.9722
cardinality	18
entropy	0.2076
entropy_ratio	0.04979
alert: imbalance	top value is 97.2% of rows

Fig 14.

Top values for tourism.

Show data table

Top values for tourism (18 unique shown, of 18 total).
value	count	share
	67776	97.2%
attraction	1670	2.4%
viewpoint	119	0.2%
yes	105	0.2%
camp_site	7	0.0%
museum	6	0.0%
register	6	0.0%
artwork	6	0.0%
information	5	0.0%
picnic_site	4	0.0%
cave_entrance	3	0.0%
caves	2	0.0%
wilderness_hut	2	0.0%
attraction;museum	1	0.0%
guestbook	1	0.0%
no	1	0.0%
cave	1	0.0%
hotel	1	0.0%

wikipedia text metadata

This column stores Wikipedia article links in a 'language-code:article-title' format, used to cross-reference cave or geological features to their Wikipedia pages across multiple languages (German, French, Spanish, Italian, Catalan, Polish, etc.). The overwhelming surprise is that 67,531 of 69,716 rows (96.9% duplicate rate, with 67,531 empty strings) have no Wikipedia link at all, making this column sparsely populated. Among the 2,185 non-empty rows, values are single short tokens (one_word_rate 0.976, len_mean 0.636) covering 940 unique vocabulary terms across several language prefixes.

Treatment: Filter empty strings before use; parse language prefix and article slug as separate fields for any multilingual Wikipedia lookup or join.

anthropic:default · confidence high

Out[34]:

saturn.columns["wikipedia"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	2,077
len_min	0
len_max	114
len_mean	0.636
len_median	0
len_p95	0
word_mean	1.044
word_median	1
n_empty	67,531
n_duplicates	67,639
duplicate_rate	0.9702
vocab_size	940
readability_flesch_mean	-0.02485
emoji_rate	0
url_rate	0
one_word_rate	0.9756
allcaps_rate	0
boilerplate_rate	0
alert: one_word	97.6% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.0% duplicate strings

Fig 15.

Character-length distribution for wikipedia.

Show data table

Character-length distribution for wikipedia (mean: 0.6359802627804234).
chars	count
0 – 3	67531
3 – 6	0
6 – 9	42
9 – 11	85
11 – 14	252
14 – 17	402
17 – 20	335
20 – 23	445
23 – 26	243
26 – 28	131
28 – 31	102
31 – 34	72
34 – 37	33
37 – 40	13
40 – 43	12
43 – 46	6
46 – 48	4
48 – 51	1
51 – 54	1
54 – 57	1
57 – 60	3
60 – 63	0
63 – 66	1
66 – 68	0
68 – 71	0
71 – 74	0
74 – 77	0
77 – 80	0
80 – 83	0
83 – 86	0
86 – 88	0
88 – 91	0
91 – 94	0
94 – 97	0
97 – 100	0
100 – 103	0
103 – 105	0
105 – 108	0
108 – 111	0
111 – 114	1

website text metadata

This column contains optional website/URL references associated with dataset records (likely natural features, caves, or protected sites given the UNESCO, speleology, and nature-registry URLs visible in top values). The dominant signal is near-total sparsity: 67,082 of 69,716 rows are empty strings, yielding a duplicate rate of 96.4% and a len_median of 0.0. When a URL is present, it appears only 2–19 times at most, suggesting these are editorial citations rather than structured foreign keys. The url_rate of only 3.8% confirms that the vast majority of non-empty values are not being parsed as URLs, though top values are all valid URLs — this may be a tokenisation artefact of the length-based stats.

Treatment: Exclude from modelling; optionally binarise into a 'has_website' flag or extract domain as a sparse categorical feature.

anthropic:default · confidence high

Out[37]:

saturn.columns["website"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	2,492
len_min	0
len_max	255
len_mean	2.789
len_median	0
len_p95	0
word_mean	1
word_median	1
n_empty	67,082
n_duplicates	67,224
duplicate_rate	0.9643
vocab_size	737
readability_flesch_mean	-24.04
emoji_rate	0
url_rate	0.03775
one_word_rate	1
allcaps_rate	1.434e-05
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	96.4% duplicate strings

Fig 16.

Character-length distribution for website.

Show data table

Character-length distribution for website (mean: 2.7889580584084).
chars	count
0 – 6	67082
6 – 13	0
13 – 19	11
19 – 26	45
26 – 32	130
32 – 38	161
38 – 45	109
45 – 51	118
51 – 57	86
57 – 64	101
64 – 70	223
70 – 76	45
76 – 83	49
83 – 89	1360
89 – 96	127
96 – 102	16
102 – 108	14
108 – 115	8
115 – 121	2
121 – 128	5
128 – 134	5
134 – 140	0
140 – 147	5
147 – 153	1
153 – 159	6
159 – 166	0
166 – 172	0
172 – 178	0
178 – 185	1
185 – 191	1
191 – 198	0
198 – 204	0
204 – 210	0
210 – 217	1
217 – 223	2
223 – 230	0
230 – 236	1
236 – 242	0
242 – 249	0
249 – 255	1

cave:length categorical feature

This column represents a cave length measurement, stored as a categorical/string type rather than numeric. The dominant signal is that 99.08% of 69,716 rows (69,074) carry an empty string, meaning cave length is recorded for only ~642 rows across 238 distinct values. The non-null values appear to be numeric strings (e.g., '5', '6', '10', '20'), suggesting this field is sparsely populated and was likely intended as a numeric attribute.

Treatment: Treat empty strings as missing; cast remaining values to numeric; expect ~99% missingness — impute or use as a sparse indicator feature only.

anthropic:default · confidence high

Out[40]:

saturn.columns["cave:length"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	238
top_value
top_rate	0.9908
cardinality	238
entropy	0.1392
entropy_ratio	0.01763
alert: long_tail	158 singleton categories
alert: imbalance	top value is 99.1% of rows

Fig 17.

Top values for cave:length.

Show data table

Top values for cave:length (20 unique shown, of 238 total).
value	count	share
	69074	99.1%
5	32	0.0%
6	26	0.0%
10	25	0.0%
3	23	0.0%
4	23	0.0%
7	20	0.0%
8	19	0.0%
15	16	0.0%
20	14	0.0%
12	13	0.0%
30	13	0.0%
2	11	0.0%
11	9	0.0%
60	8	0.0%
4.5	8	0.0%
13	8	0.0%
16	8	0.0%
17	8	0.0%
25	8	0.0%

cave:depth categorical feature

This column represents the depth attribute of a cave feature, stored as a categorical field containing numeric depth values (in unspecified units). It is almost entirely empty: 69,419 of 69,716 rows (99.57%) are blank strings, leaving only 297 rows with any actual depth value across 123 distinct non-empty values. The extreme sparsity and near-zero entropy ratio (0.0092) signal this is a rarely-populated optional attribute, not a reliable analytical field in its current form.

Treatment: Filter to non-empty rows before use; convert to numeric and consider imputation or indicator flag for missingness given 99.57% blank rate.

anthropic:default · confidence high

Out[43]:

saturn.columns["cave:depth"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	124
top_value
top_rate	0.9957
cardinality	124
entropy	0.06432
entropy_ratio	0.00925
alert: long_tail	87 singleton categories
alert: imbalance	top value is 99.6% of rows

Fig 18.

Top values for cave:depth.

Show data table

Top values for cave:depth (20 unique shown, of 124 total).
value	count	share
	69419	99.6%
0	63	0.1%
10	13	0.0%
3	11	0.0%
1	9	0.0%
5	9	0.0%
4	8	0.0%
25	7	0.0%
30	6	0.0%
6	6	0.0%
2	6	0.0%
11	5	0.0%
35	5	0.0%
28	4	0.0%
14	4	0.0%
40	4	0.0%
70	4	0.0%
12	3	0.0%
8	3	0.0%
15	3	0.0%

country categorical feature

This column is intended to capture a 2-letter ISO country code for each record. However, it is effectively empty: 69,684 of 69,716 rows (99.95%) contain a blank string rather than a real value, making the column nearly useless as a feature. Only 32 rows carry a real country code spread across 15 distinct values, with no single country dominating — the most common non-blank code is 'KY' with just 7 occurrences. The extreme imbalance (top_rate 0.9995) and near-zero entropy (0.0074) confirm this column carries virtually no information.

Treatment: Drop or flag as data-quality issue; if retained, treat blank as missing and note that only 32 rows have usable country data.

anthropic:default · confidence high

Out[46]:

saturn.columns["country"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	16
top_value
top_rate	0.9995
cardinality	16
entropy	0.00736
entropy_ratio	0.00184
alert: imbalance	top value is 100.0% of rows

Fig 19.

Top values for country.

Show data table

Top values for country (16 unique shown, of 16 total).
value	count	share
	69684	100.0%
KY	7	0.0%
AU	6	0.0%
DE	3	0.0%
RO	2	0.0%
US	2	0.0%
EC	2	0.0%
IT	2	0.0%
GB	1	0.0%
MT	1	0.0%
CH	1	0.0%
GR	1	0.0%
ET	1	0.0%
LB	1	0.0%
VN	1	0.0%
PT	1	0.0%

data trove caves worldwide

Overview

Summary confidence: high

id numeric identifier

name text label

lat numeric feature

lon numeric feature

description text free_text

access categorical feature

tourism categorical feature

wikipedia text metadata

website text metadata

cave:length categorical feature

cave:depth categorical feature

country categorical feature

How to cite