saturn·

data trove caves worldwide

saturn notebook · generated 2026-06-22 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/caves.json

Saturn profiled 69,716 rows across 12 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/caves.json",
    "--findings", "data-trove-caves-worldwide.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is a global registry of 69,716 cave entries, likely sourced from OpenStreetMap, containing geographic coordinates, names, access rules, and optional metadata such as depth, length, and tourism classification. The most striking issue is extreme sparsity: the vast majority of records have empty descriptions (93%), websites (96%), wikipedia links (97%), depth (99.6%), and length (99.1%), meaning most caves are little more than a name and a pin on a map. Nearly 28% of named caves are simply called 'Unnamed Cave' (19,527 records), pointing to a significant data completeness problem worth investigating before any analysis. Geographic coverage skews heavily toward Europe — latitude median ~44°N with tight interquartile range — but longitude outliers suggest a global but uneven spread. Among the minority of caves with access tags, the split between open ('yes'), restricted ('no'), and 'private' is worth exploring for any public-access analysis.

citing: row_count · column_count · name.top_values · name.stats.n_duplicates · description.stats.n_empty · website.stats.n_empty · wikipedia.stats.n_empty · cave:depth.stats.top_rate · cave:length.stats.top_rate · lat.stats.median · lat.stats.iqr · lon.stats.outlier_rate · access.top_values · tourism.top_values

Out[4]:

saturn.schema() · 12 columns

column kind n null% unique alerts
id numeric 69,716 0.0% 69,716
name text 69,716 0.0% 45,229 duplicates
lat numeric 69,716 0.0% 69,544 high_skew outliers
lon numeric 69,716 0.0% 69,585 outliers
description text 69,716 0.0% 3,705 one_word short_text duplicates
access categorical 69,716 0.0% 20
tourism categorical 69,716 0.0% 18 imbalance
wikipedia text 69,716 0.0% 2,077 one_word short_text duplicates
website text 69,716 0.0% 2,492 one_word short_text duplicates
cave:length categorical 69,716 0.0% 238 long_tail imbalance
cave:depth categorical 69,716 0.0% 124 long_tail imbalance
country categorical 69,716 0.0% 16 imbalance
Fig 1.
access · Look at the breakdown between open, private, permit, and no-access caves among the ~11% of entries that carry an access tag.
Show data table
Top values for access (20 unique shown, of 20 total).
valuecountshare
6255189.7%
yes27173.9%
no22663.3%
private8151.2%
permit5750.8%
permissive4400.6%
customers2710.4%
unknown470.1%
destination110.0%
restricted90.0%
tidal20.0%
request20.0%
key20.0%
discouraged20.0%
designated10.0%
official10.0%
forestry10.0%
agricultural10.0%
guided10.0%
university10.0%
Fig 2.
tourism · Nearly all tagged tourism values fall under 'attraction' or 'viewpoint' — this shows how few caves are formally designated for visitors.
Show data table
Top values for tourism (18 unique shown, of 18 total).
valuecountshare
6777697.2%
attraction16702.4%
viewpoint1190.2%
yes1050.2%
camp_site70.0%
museum60.0%
register60.0%
artwork60.0%
information50.0%
picnic_site40.0%
cave_entrance30.0%
caves20.0%
wilderness_hut20.0%
attraction;museum10.0%
guestbook10.0%
no10.0%
cave10.0%
hotel10.0%
Fig 3.
name · Check the distribution of name lengths to see how many entries are generic single-word labels versus descriptively named caves.
Show data table
Character-length distribution for name (mean: 15.52237649893855).
charscount
1 – 51601
5 – 83457
8 – 125854
12 – 1531600
15 – 199877
19 – 228424
22 – 263274
26 – 292614
29 – 331175
33 – 36885
36 – 40482
40 – 44175
44 – 47154
47 – 5156
51 – 5445
54 – 5817
58 – 6111
61 – 653
65 – 682
68 – 722
72 – 764
76 – 792
79 – 830
83 – 861
86 – 900
90 – 930
93 – 970
97 – 1000
100 – 1040
104 – 1080
108 – 1110
111 – 1150
115 – 1180
118 – 1220
122 – 1250
125 – 1290
129 – 1320
132 – 1360
136 – 1390
139 – 1431
Fig 4.
lat · The latitude distribution reveals a strong concentration around 40–48°N (Europe), with a long tail of outliers across other continents.
Show data table
Histogram bins for lat (median: 44.14308795).
bincount
-77.98 – -74.074
-74.07 – -70.170
-70.17 – -66.263
-66.26 – -62.360
-62.36 – -58.461
-58.46 – -54.552
-54.55 – -50.6513
-50.65 – -46.7426
-46.74 – -42.8460
-42.84 – -38.94127
-38.94 – -35.03329
-35.03 – -31.13330
-31.13 – -27.23225
-27.23 – -23.32257
-23.32 – -19.42301
-19.42 – -15.51204
-15.51 – -11.6178
-11.61 – -7.705139
-7.705 – -3.801314
-3.801 – 0.1026183
0.1026 – 4.007118
4.007 – 7.911358
7.911 – 11.81287
11.81 – 15.72385
15.72 – 19.621856
19.62 – 23.53937
23.53 – 27.43775
27.43 – 31.331082
31.33 – 35.241231
35.24 – 39.143928
39.14 – 43.0512068
43.05 – 46.9520377
46.95 – 50.8519250
50.85 – 54.762492
54.76 – 58.661075
58.66 – 62.57564
62.57 – 66.47267
66.47 – 70.3747
70.37 – 74.2819
74.28 – 78.184
Fig 5.
cave:length · Among the small fraction of caves with a recorded length, see whether short caves (under 20m) dominate the measured entries.
Show data table
Top values for cave:length (20 unique shown, of 238 total).
valuecountshare
6907499.1%
5320.0%
6260.0%
10250.0%
3230.0%
4230.0%
7200.0%
8190.0%
15160.0%
20140.0%
12130.0%
30130.0%
2110.0%
1190.0%
6080.0%
4.580.0%
1380.0%
1680.0%
1780.0%
2580.0%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
idnumeric0.0%
nametext0.0%
latnumeric0.0%
lonnumeric0.0%
descriptiontext0.0%
accesscategorical0.0%
tourismcategorical0.0%
wikipediatext0.0%
websitetext0.0%
cave:lengthcategorical0.0%
cave:depthcategorical0.0%
countrycategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
idlatlon
id+1.00-0.04-0.06
lat-0.04+1.00-0.15
lon-0.06-0.15+1.00

id numeric identifier

This column is a numeric unique identifier — every one of the 69,716 rows has a distinct value with zero nulls, confirming a true primary key role. The values span a wide range (12,788,558 to 13,536,013,182) with a near-zero skew (0.0196) and a slightly platykurtic distribution (kurtosis −1.18), consistent with IDs assigned from a large, sparsely populated keyspace rather than a simple auto-increment sequence. The IQR of ~6.3 billion against a full range of ~13.5 billion indicates values are spread broadly across the domain with no clustering or outliers detected.

Treatment: Drop before modelling; use as row key for joins or deduplication only.

anthropic:default · confidence high
Out[13]:

saturn.columns["id"].stats

statvalue
n69,716
nulls0 (0.0%)
unique69,716
min 1.279e+07
max 1.354e+10
mean 6.842e+09
median 7.007e+09
std 3.774e+09
q1 3.61e+09
q3 9.948e+09
iqr 6.338e+09
skew 0.01955
kurtosis -1.177
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 8.
Distribution of id. Vertical dash marks the median.
Show data table
Histogram bins for id (median: 7006519241.0).
bincount
1.279e+07 – 3.509e+08592
3.509e+08 – 6.889e+081123
6.889e+08 – 1.027e+092360
1.027e+09 – 1.365e+091819
1.365e+09 – 1.703e+091497
1.703e+09 – 2.041e+091907
2.041e+09 – 2.379e+091345
2.379e+09 – 2.717e+091415
2.717e+09 – 3.056e+091806
3.056e+09 – 3.394e+091977
3.394e+09 – 3.732e+092439
3.732e+09 – 4.07e+091595
4.07e+09 – 4.408e+091710
4.408e+09 – 4.746e+092212
4.746e+09 – 5.084e+092764
5.084e+09 – 5.422e+091741
5.422e+09 – 5.76e+091236
5.76e+09 – 6.098e+091268
6.098e+09 – 6.436e+092153
6.436e+09 – 6.774e+091135
6.774e+09 – 7.112e+092343
7.112e+09 – 7.451e+091932
7.451e+09 – 7.789e+092060
7.789e+09 – 8.127e+091970
8.127e+09 – 8.465e+091451
8.465e+09 – 8.803e+091229
8.803e+09 – 9.141e+091689
9.141e+09 – 9.479e+091952
9.479e+09 – 9.817e+092902
9.817e+09 – 1.016e+101840
1.016e+10 – 1.049e+10617
1.049e+10 – 1.083e+101370
1.083e+10 – 1.117e+101613
1.117e+10 – 1.151e+103298
1.151e+10 – 1.185e+101230
1.185e+10 – 1.218e+101741
1.218e+10 – 1.252e+101317
1.252e+10 – 1.286e+102065
1.286e+10 – 1.32e+101532
1.32e+10 – 1.354e+101471

name text label

This column contains the proper names of caves or underground features in a speleological dataset, with multilingual entries spanning English, French, Italian, Spanish, German, and Russian. The most striking signal is that 'Unnamed Cave' appears 19,527 times — roughly 28% of all 69,716 rows — driving a duplicate rate of 35.1% (24,487 duplicates) even though 45,229 unique values exist. The vocabulary mix (grotta, grotte, cueva, cova, Грот) confirms international coverage, and the name is clearly a label, not a unique identifier.

Treatment: Use as a display label only; do not treat as a unique key — normalise language variants and 'Unnamed Cave' placeholders before any grouping or matching.

anthropic:default · confidence high
Out[16]:

saturn.columns["name"].stats

statvalue
n69,716
nulls0 (0.0%)
unique45,229
len_min 1
len_max 143
len_mean 15.52
len_median 13
len_p95 29
word_mean 2.519
word_median 2
n_empty 0
n_duplicates 24,487
duplicate_rate 0.3512
vocab_size 15,219
readability_flesch_mean 55.51
emoji_rate 1.434e-05
url_rate 0
one_word_rate 0.1545
allcaps_rate 0.03242
boilerplate_rate 0
alert: duplicates35.1% duplicate strings
Fig 9.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 15.52237649893855).
charscount
1 – 51601
5 – 83457
8 – 125854
12 – 1531600
15 – 199877
19 – 228424
22 – 263274
26 – 292614
29 – 331175
33 – 36885
36 – 40482
40 – 44175
44 – 47154
47 – 5156
51 – 5445
54 – 5817
58 – 6111
61 – 653
65 – 682
68 – 722
72 – 764
76 – 792
79 – 830
83 – 861
86 – 900
90 – 930
93 – 970
97 – 1000
100 – 1040
104 – 1080
108 – 1110
111 – 1150
115 – 1180
118 – 1220
122 – 1250
125 – 1290
129 – 1320
132 – 1360
136 – 1390
139 – 1431

lat numeric feature

This column is a geographic latitude, spanning from -77.98° (southern hemisphere) to 78.18° (high Arctic), with the bulk of records clustered between ~40.5° and ~47.75° (IQR of 7.26°), suggesting the dataset is predominantly North American or European. The strong negative skew (-3.16) and high kurtosis (11.50) indicate a heavy left tail — roughly 9,000 rows (12.9%) are flagged as outliers, likely locations far outside the core geographic cluster. With 69,544 unique values across 69,716 rows and zero nulls, near-uniqueness suggests these are precise point coordinates rather than binned regions.

Treatment: Retain as-is for geospatial modelling; investigate the 8,996 outlier rows for data quality issues or legitimate geographic subpopulations before regression or clustering.

anthropic:default · confidence high
Out[19]:

saturn.columns["lat"].stats

statvalue
n69,716
nulls0 (0.0%)
unique69,544
min -77.98
max 78.18
mean 40.58
median 44.14
std 15.48
q1 40.49
q3 47.75
iqr 7.256
skew -3.161
kurtosis 11.5
n_outliers 8,996
outlier_rate 0.129
zero_rate 0
alert: high_skewskew=-3.16
alert: outliers12.9% rows beyond 1.5 IQR
Fig 10.
Distribution of lat. Vertical dash marks the median.
Show data table
Histogram bins for lat (median: 44.14308795).
bincount
-77.98 – -74.074
-74.07 – -70.170
-70.17 – -66.263
-66.26 – -62.360
-62.36 – -58.461
-58.46 – -54.552
-54.55 – -50.6513
-50.65 – -46.7426
-46.74 – -42.8460
-42.84 – -38.94127
-38.94 – -35.03329
-35.03 – -31.13330
-31.13 – -27.23225
-27.23 – -23.32257
-23.32 – -19.42301
-19.42 – -15.51204
-15.51 – -11.6178
-11.61 – -7.705139
-7.705 – -3.801314
-3.801 – 0.1026183
0.1026 – 4.007118
4.007 – 7.911358
7.911 – 11.81287
11.81 – 15.72385
15.72 – 19.621856
19.62 – 23.53937
23.53 – 27.43775
27.43 – 31.331082
31.33 – 35.241231
35.24 – 39.143928
39.14 – 43.0512068
43.05 – 46.9520377
46.95 – 50.8519250
50.85 – 54.762492
54.76 – 58.661075
58.66 – 62.57564
62.57 – 66.47267
66.47 – 70.3747
70.37 – 74.2819
74.28 – 78.184

lon numeric feature

This column contains geographic longitude values, ranging from -178.0 to 178.8 degrees, consistent with a global coordinate field. Nearly all 69,716 rows are unique (69,585 distinct values) with zero nulls, indicating high-precision GPS or geocoded data. The distribution is heavily concentrated in a narrow band (IQR of ~17 degrees, Q1=1.24, Q3=18.18), suggesting the bulk of records cluster around Europe/Africa longitudes, while 16.2% of values (11,302 rows) are flagged as outliers — likely legitimate points in the Americas, East Asia, or Oceania that fall far from this central cluster. The elevated kurtosis (4.51) and large std (40.5) relative to the IQR confirm this heavy-tailed, multimodal geographic spread.

Treatment: Retain as-is for geospatial modelling; consider pairing with latitude and clustering by region to handle the multimodal distribution before feeding into distance-based models.

anthropic:default · confidence high
Out[22]:

saturn.columns["lon"].stats

statvalue
n69,716
nulls0 (0.0%)
unique69,585
min -178
max 178.8
mean 12.03
median 11.38
std 40.5
q1 1.245
q3 18.18
iqr 16.94
skew 0.2755
kurtosis 4.509
n_outliers 11,302
outlier_rate 0.1621
zero_rate 0
alert: outliers16.2% rows beyond 1.5 IQR
Fig 11.
Distribution of lon. Vertical dash marks the median.
Show data table
Histogram bins for lon (median: 11.37646115).
bincount
-178 – -169.154
-169.1 – -160.26
-160.2 – -151.249
-151.2 – -142.320
-142.3 – -133.46
-133.4 – -124.518
-124.5 – -115.6352
-115.6 – -106.6454
-106.6 – -97.72283
-97.72 – -88.8361
-88.8 – -79.88579
-79.88 – -70.961877
-70.96 – -62.04236
-62.04 – -53.12168
-53.12 – -44.2252
-44.2 – -35.28203
-35.28 – -26.3626
-26.36 – -17.44182
-17.44 – -8.523766
-8.523 – 0.39679279
0.3967 – 9.31714696
9.317 – 18.2422468
18.24 – 27.168156
27.16 – 36.081754
36.08 – 451346
45 – 53.92530
53.92 – 62.84479
62.84 – 71.76150
71.76 – 80.68381
80.68 – 89.6210
89.6 – 98.52212
98.52 – 107.41114
107.4 – 116.4956
116.4 – 125.3696
125.3 – 134.2395
134.2 – 143.1363
143.1 – 152330
152 – 16146
161 – 169.933
169.9 – 178.8230

description text free_text

This column is a short free-text description field for what appears to be a European cave/karst cadastral dataset, with entries in German, French, Spanish, and English. It is overwhelmingly empty — 65,189 of 69,716 rows (93.5%) are blank strings — making it near-useless as a feature in its current form. The duplicate rate is 94.7% (66,011 duplicates), driven almost entirely by those empty strings, and among the non-empty values, content is very short (median length 0, p95 only 19 characters) and consists mostly of single words or brief directional/categorical labels like 'unterer Eingang' or 'nicht katasterwürdig'. The multilingual vocabulary (German, French, Spanish, English) and low cardinality (3,705 unique values across 4,651-word vocab) suggest this is an optional annotation field rather than a structured attribute.

Treatment: Exclude empty strings, then optionally use as a categorical label or tokenize non-empty values for NLP; do not use as a predictive feature without imputation strategy for the 93.5% blanks.

anthropic:default · confidence high
Out[25]:

saturn.columns["description"].stats

statvalue
n69,716
nulls0 (0.0%)
unique3,705
len_min 0
len_max 255
len_mean 3.462
len_median 0
len_p95 19
word_mean 1.455
word_median 1
n_empty 65,189
n_duplicates 66,011
duplicate_rate 0.9469
vocab_size 4,651
readability_flesch_mean 1.752
emoji_rate 0
url_rate 0.0006311
one_word_rate 0.9428
allcaps_rate 0.00241
boilerplate_rate 1.434e-05
alert: one_word94.3% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates94.7% duplicate strings
Fig 12.
Character-length distribution for description.
Show data table
Character-length distribution for description (mean: 3.4617591370703997).
charscount
0 – 665286
6 – 13422
13 – 19544
19 – 26501
26 – 32458
32 – 38437
38 – 45344
45 – 51218
51 – 57205
57 – 64170
64 – 70229
70 – 7685
76 – 8370
83 – 8960
89 – 9654
96 – 10248
102 – 10851
108 – 11542
115 – 12134
121 – 12834
128 – 13426
134 – 14024
140 – 14723
147 – 15324
153 – 15919
159 – 16613
166 – 17227
172 – 17813
178 – 18518
185 – 19121
191 – 19820
198 – 20412
204 – 21023
210 – 21718
217 – 22316
223 – 2309
230 – 2368
236 – 24217
242 – 24922
249 – 25571

access categorical feature

This column is an OpenStreetMap-style 'access' tag describing road or path accessibility permissions, with values like 'yes', 'no', 'private', 'permit', 'permissive', 'customers', and 'destination'. The dominant signal is that 89.7% of rows (62,551 of 69,716) carry an empty string rather than a proper null — this is a data quality concern, as missing access information is stored as blank text rather than NULL. The remaining 20 distinct values have very low entropy (0.707), confirming the distribution is heavily concentrated. The blank-dominance will mislead cardinality and frequency analyses unless blanks are recoded as missing.

Treatment: Recode empty strings to NaN, then one-hot or ordinal encode the remaining access-permission categories.

anthropic:default · confidence high
Out[28]:

saturn.columns["access"].stats

statvalue
n69,716
nulls0 (0.0%)
unique20
top_value
top_rate 0.8972
cardinality 20
entropy 0.7067
entropy_ratio 0.1635
Fig 13.
Top values for access.
Show data table
Top values for access (20 unique shown, of 20 total).
valuecountshare
6255189.7%
yes27173.9%
no22663.3%
private8151.2%
permit5750.8%
permissive4400.6%
customers2710.4%
unknown470.1%
destination110.0%
restricted90.0%
tidal20.0%
request20.0%
key20.0%
discouraged20.0%
designated10.0%
official10.0%
forestry10.0%
agricultural10.0%
guided10.0%
university10.0%

tourism categorical feature

This column is an OpenStreetMap-style 'tourism' tag classifying geographic features by their tourism type (attraction, viewpoint, camp_site, museum, etc.). It is severely imbalanced: 97.2% of the 69,716 rows carry an empty string — meaning the tourism tag is absent — leaving only ~1,940 rows with meaningful values. Entropy ratio of 0.050 confirms near-zero informational content across the full dataset. The 18 distinct non-null categories are legitimate OSM tourism values, but their rarity makes this column sparse by design rather than by data quality failure.

Treatment: Treat empty string as missing/absent tag; consider binarising (tourism present vs. absent) or one-hot encoding the non-empty subset, given extreme sparsity.

anthropic:default · confidence high
Out[31]:

saturn.columns["tourism"].stats

statvalue
n69,716
nulls0 (0.0%)
unique18
top_value
top_rate 0.9722
cardinality 18
entropy 0.2076
entropy_ratio 0.04979
alert: imbalancetop value is 97.2% of rows
Fig 14.
Top values for tourism.
Show data table
Top values for tourism (18 unique shown, of 18 total).
valuecountshare
6777697.2%
attraction16702.4%
viewpoint1190.2%
yes1050.2%
camp_site70.0%
museum60.0%
register60.0%
artwork60.0%
information50.0%
picnic_site40.0%
cave_entrance30.0%
caves20.0%
wilderness_hut20.0%
attraction;museum10.0%
guestbook10.0%
no10.0%
cave10.0%
hotel10.0%

wikipedia text metadata

This column stores Wikipedia article links in a 'language-code:article-title' format, used to cross-reference cave or geological features to their Wikipedia pages across multiple languages (German, French, Spanish, Italian, Catalan, Polish, etc.). The overwhelming surprise is that 67,531 of 69,716 rows (96.9% duplicate rate, with 67,531 empty strings) have no Wikipedia link at all, making this column sparsely populated. Among the 2,185 non-empty rows, values are single short tokens (one_word_rate 0.976, len_mean 0.636) covering 940 unique vocabulary terms across several language prefixes.

Treatment: Filter empty strings before use; parse language prefix and article slug as separate fields for any multilingual Wikipedia lookup or join.

anthropic:default · confidence high
Out[34]:

saturn.columns["wikipedia"].stats

statvalue
n69,716
nulls0 (0.0%)
unique2,077
len_min 0
len_max 114
len_mean 0.636
len_median 0
len_p95 0
word_mean 1.044
word_median 1
n_empty 67,531
n_duplicates 67,639
duplicate_rate 0.9702
vocab_size 940
readability_flesch_mean -0.02485
emoji_rate 0
url_rate 0
one_word_rate 0.9756
allcaps_rate 0
boilerplate_rate 0
alert: one_word97.6% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.0% duplicate strings
Fig 15.
Character-length distribution for wikipedia.
Show data table
Character-length distribution for wikipedia (mean: 0.6359802627804234).
charscount
0 – 367531
3 – 60
6 – 942
9 – 1185
11 – 14252
14 – 17402
17 – 20335
20 – 23445
23 – 26243
26 – 28131
28 – 31102
31 – 3472
34 – 3733
37 – 4013
40 – 4312
43 – 466
46 – 484
48 – 511
51 – 541
54 – 571
57 – 603
60 – 630
63 – 661
66 – 680
68 – 710
71 – 740
74 – 770
77 – 800
80 – 830
83 – 860
86 – 880
88 – 910
91 – 940
94 – 970
97 – 1000
100 – 1030
103 – 1050
105 – 1080
108 – 1110
111 – 1141

website text metadata

This column contains optional website/URL references associated with dataset records (likely natural features, caves, or protected sites given the UNESCO, speleology, and nature-registry URLs visible in top values). The dominant signal is near-total sparsity: 67,082 of 69,716 rows are empty strings, yielding a duplicate rate of 96.4% and a len_median of 0.0. When a URL is present, it appears only 2–19 times at most, suggesting these are editorial citations rather than structured foreign keys. The url_rate of only 3.8% confirms that the vast majority of non-empty values are not being parsed as URLs, though top values are all valid URLs — this may be a tokenisation artefact of the length-based stats.

Treatment: Exclude from modelling; optionally binarise into a 'has_website' flag or extract domain as a sparse categorical feature.

anthropic:default · confidence high
Out[37]:

saturn.columns["website"].stats

statvalue
n69,716
nulls0 (0.0%)
unique2,492
len_min 0
len_max 255
len_mean 2.789
len_median 0
len_p95 0
word_mean 1
word_median 1
n_empty 67,082
n_duplicates 67,224
duplicate_rate 0.9643
vocab_size 737
readability_flesch_mean -24.04
emoji_rate 0
url_rate 0.03775
one_word_rate 1
allcaps_rate 1.434e-05
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates96.4% duplicate strings
Fig 16.
Character-length distribution for website.
Show data table
Character-length distribution for website (mean: 2.7889580584084).
charscount
0 – 667082
6 – 130
13 – 1911
19 – 2645
26 – 32130
32 – 38161
38 – 45109
45 – 51118
51 – 5786
57 – 64101
64 – 70223
70 – 7645
76 – 8349
83 – 891360
89 – 96127
96 – 10216
102 – 10814
108 – 1158
115 – 1212
121 – 1285
128 – 1345
134 – 1400
140 – 1475
147 – 1531
153 – 1596
159 – 1660
166 – 1720
172 – 1780
178 – 1851
185 – 1911
191 – 1980
198 – 2040
204 – 2100
210 – 2171
217 – 2232
223 – 2300
230 – 2361
236 – 2420
242 – 2490
249 – 2551

cave:length categorical feature

This column represents a cave length measurement, stored as a categorical/string type rather than numeric. The dominant signal is that 99.08% of 69,716 rows (69,074) carry an empty string, meaning cave length is recorded for only ~642 rows across 238 distinct values. The non-null values appear to be numeric strings (e.g., '5', '6', '10', '20'), suggesting this field is sparsely populated and was likely intended as a numeric attribute.

Treatment: Treat empty strings as missing; cast remaining values to numeric; expect ~99% missingness — impute or use as a sparse indicator feature only.

anthropic:default · confidence high
Out[40]:

saturn.columns["cave:length"].stats

statvalue
n69,716
nulls0 (0.0%)
unique238
top_value
top_rate 0.9908
cardinality 238
entropy 0.1392
entropy_ratio 0.01763
alert: long_tail158 singleton categories
alert: imbalancetop value is 99.1% of rows
Fig 17.
Top values for cave:length.
Show data table
Top values for cave:length (20 unique shown, of 238 total).
valuecountshare
6907499.1%
5320.0%
6260.0%
10250.0%
3230.0%
4230.0%
7200.0%
8190.0%
15160.0%
20140.0%
12130.0%
30130.0%
2110.0%
1190.0%
6080.0%
4.580.0%
1380.0%
1680.0%
1780.0%
2580.0%

cave:depth categorical feature

This column represents the depth attribute of a cave feature, stored as a categorical field containing numeric depth values (in unspecified units). It is almost entirely empty: 69,419 of 69,716 rows (99.57%) are blank strings, leaving only 297 rows with any actual depth value across 123 distinct non-empty values. The extreme sparsity and near-zero entropy ratio (0.0092) signal this is a rarely-populated optional attribute, not a reliable analytical field in its current form.

Treatment: Filter to non-empty rows before use; convert to numeric and consider imputation or indicator flag for missingness given 99.57% blank rate.

anthropic:default · confidence high
Out[43]:

saturn.columns["cave:depth"].stats

statvalue
n69,716
nulls0 (0.0%)
unique124
top_value
top_rate 0.9957
cardinality 124
entropy 0.06432
entropy_ratio 0.00925
alert: long_tail87 singleton categories
alert: imbalancetop value is 99.6% of rows
Fig 18.
Top values for cave:depth.
Show data table
Top values for cave:depth (20 unique shown, of 124 total).
valuecountshare
6941999.6%
0630.1%
10130.0%
3110.0%
190.0%
590.0%
480.0%
2570.0%
3060.0%
660.0%
260.0%
1150.0%
3550.0%
2840.0%
1440.0%
4040.0%
7040.0%
1230.0%
830.0%
1530.0%

country categorical feature

This column is intended to capture a 2-letter ISO country code for each record. However, it is effectively empty: 69,684 of 69,716 rows (99.95%) contain a blank string rather than a real value, making the column nearly useless as a feature. Only 32 rows carry a real country code spread across 15 distinct values, with no single country dominating — the most common non-blank code is 'KY' with just 7 occurrences. The extreme imbalance (top_rate 0.9995) and near-zero entropy (0.0074) confirm this column carries virtually no information.

Treatment: Drop or flag as data-quality issue; if retained, treat blank as missing and note that only 32 rows have usable country data.

anthropic:default · confidence high
Out[46]:

saturn.columns["country"].stats

statvalue
n69,716
nulls0 (0.0%)
unique16
top_value
top_rate 0.9995
cardinality 16
entropy 0.00736
entropy_ratio 0.00184
alert: imbalancetop value is 100.0% of rows
Fig 19.
Top values for country.
Show data table
Top values for country (16 unique shown, of 16 total).
valuecountshare
69684100.0%
KY70.0%
AU60.0%
DE30.0%
RO20.0%
US20.0%
EC20.0%
IT20.0%
GB10.0%
MT10.0%
CH10.0%
GR10.0%
ET10.0%
LB10.0%
VN10.0%
PT10.0%

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-caves-worldwide-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove caves worldwide},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-caves-worldwide}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove caves worldwide. Source: /home/coolhand/html/datavis/data_trove/data/quirky/caves.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-caves-worldwide