saturn

/home/coolhand/html/datavis/data_trove/data/quirky/caves.json 69,716 rows sample n=69,716 seed 42 2026-06-22T00:34:29+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/data/quirky/caves.json
Total rows69,716
Profiled sample69,716
Columns12
Generated2026-06-22T00:34:29+00:00
Show data table
Per-column null rate across the corpus.
columnkindnull %
idnumeric0.0%
nametext0.0%
latnumeric0.0%
lonnumeric0.0%
descriptiontext0.0%
accesscategorical0.0%
tourismcategorical0.0%
wikipediatext0.0%
websitetext0.0%
cave:lengthcategorical0.0%
cave:depthcategorical0.0%
countrycategorical0.0%

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:default.

Dataset high anthropic:default

This dataset is a global registry of 69,716 cave entries, likely sourced from OpenStreetMap, containing geographic coordinates, names, access rules, and optional metadata such as depth, length, and tourism classification. The most striking issue is extreme sparsity: the vast majority of records have empty descriptions (93%), websites (96%), wikipedia links (97%), depth (99.6%), and length (99.1%), meaning most caves are little more than a name and a pin on a map. Nearly 28% of named caves are simply called 'Unnamed Cave' (19,527 records), pointing to a significant data completeness problem worth investigating before any analysis. Geographic coverage skews heavily toward Europe — latitude median ~44°N with tight interquartile range — but longitude outliers suggest a global but uneven spread. Among the minority of caves with access tags, the split between open ('yes'), restricted ('no'), and 'private' is worth exploring for any public-access analysis.

description high anthropic:default

This column is a short free-text description field for what appears to be a European cave/karst cadastral dataset, with entries in German, French, Spanish, and English. It is overwhelmingly empty — 65,189 of 69,716 rows (93.5%) are blank strings — making it near-useless as a feature in its current form. The duplicate rate is 94.7% (66,011 duplicates), driven almost entirely by those empty strings, and among the non-empty values, content is very short (median length 0, p95 only 19 characters) and consists mostly of single words or brief directional/categorical labels like 'unterer Eingang' or 'nicht katasterwürdig'. The multilingual vocabulary (German, French, Spanish, English) and low cardinality (3,705 unique values across 4,651-word vocab) suggest this is an optional annotation field rather than a structured attribute.

website high anthropic:default

This column contains optional website/URL references associated with dataset records (likely natural features, caves, or protected sites given the UNESCO, speleology, and nature-registry URLs visible in top values). The dominant signal is near-total sparsity: 67,082 of 69,716 rows are empty strings, yielding a duplicate rate of 96.4% and a len_median of 0.0. When a URL is present, it appears only 2–19 times at most, suggesting these are editorial citations rather than structured foreign keys. The url_rate of only 3.8% confirms that the vast majority of non-empty values are not being parsed as URLs, though top values are all valid URLs — this may be a tokenisation artefact of the length-based stats.

wikipedia high anthropic:default

This column stores Wikipedia article links in a 'language-code:article-title' format, used to cross-reference cave or geological features to their Wikipedia pages across multiple languages (German, French, Spanish, Italian, Catalan, Polish, etc.). The overwhelming surprise is that 67,531 of 69,716 rows (96.9% duplicate rate, with 67,531 empty strings) have no Wikipedia link at all, making this column sparsely populated. Among the 2,185 non-empty rows, values are single short tokens (one_word_rate 0.976, len_mean 0.636) covering 940 unique vocabulary terms across several language prefixes.

lat high anthropic:default

This column is a geographic latitude, spanning from -77.98° (southern hemisphere) to 78.18° (high Arctic), with the bulk of records clustered between ~40.5° and ~47.75° (IQR of 7.26°), suggesting the dataset is predominantly North American or European. The strong negative skew (-3.16) and high kurtosis (11.50) indicate a heavy left tail — roughly 9,000 rows (12.9%) are flagged as outliers, likely locations far outside the core geographic cluster. With 69,544 unique values across 69,716 rows and zero nulls, near-uniqueness suggests these are precise point coordinates rather than binned regions.

cave:depth high anthropic:default

This column represents the depth attribute of a cave feature, stored as a categorical field containing numeric depth values (in unspecified units). It is almost entirely empty: 69,419 of 69,716 rows (99.57%) are blank strings, leaving only 297 rows with any actual depth value across 123 distinct non-empty values. The extreme sparsity and near-zero entropy ratio (0.0092) signal this is a rarely-populated optional attribute, not a reliable analytical field in its current form.

cave:length high anthropic:default

This column represents a cave length measurement, stored as a categorical/string type rather than numeric. The dominant signal is that 99.08% of 69,716 rows (69,074) carry an empty string, meaning cave length is recorded for only ~642 rows across 238 distinct values. The non-null values appear to be numeric strings (e.g., '5', '6', '10', '20'), suggesting this field is sparsely populated and was likely intended as a numeric attribute.

name high anthropic:default

This column contains the proper names of caves or underground features in a speleological dataset, with multilingual entries spanning English, French, Italian, Spanish, German, and Russian. The most striking signal is that 'Unnamed Cave' appears 19,527 times — roughly 28% of all 69,716 rows — driving a duplicate rate of 35.1% (24,487 duplicates) even though 45,229 unique values exist. The vocabulary mix (grotta, grotte, cueva, cova, Грот) confirms international coverage, and the name is clearly a label, not a unique identifier.

lon high anthropic:default

This column contains geographic longitude values, ranging from -178.0 to 178.8 degrees, consistent with a global coordinate field. Nearly all 69,716 rows are unique (69,585 distinct values) with zero nulls, indicating high-precision GPS or geocoded data. The distribution is heavily concentrated in a narrow band (IQR of ~17 degrees, Q1=1.24, Q3=18.18), suggesting the bulk of records cluster around Europe/Africa longitudes, while 16.2% of values (11,302 rows) are flagged as outliers — likely legitimate points in the Americas, East Asia, or Oceania that fall far from this central cluster. The elevated kurtosis (4.51) and large std (40.5) relative to the IQR confirm this heavy-tailed, multimodal geographic spread.

country high anthropic:default

This column is intended to capture a 2-letter ISO country code for each record. However, it is effectively empty: 69,684 of 69,716 rows (99.95%) contain a blank string rather than a real value, making the column nearly useless as a feature. Only 32 rows carry a real country code spread across 15 distinct values, with no single country dominating — the most common non-blank code is 'KY' with just 7 occurrences. The extreme imbalance (top_rate 0.9995) and near-zero entropy (0.0074) confirm this column carries virtually no information.

tourism high anthropic:default

This column is an OpenStreetMap-style 'tourism' tag classifying geographic features by their tourism type (attraction, viewpoint, camp_site, museum, etc.). It is severely imbalanced: 97.2% of the 69,716 rows carry an empty string — meaning the tourism tag is absent — leaving only ~1,940 rows with meaningful values. Entropy ratio of 0.050 confirms near-zero informational content across the full dataset. The 18 distinct non-null categories are legitimate OSM tourism values, but their rarity makes this column sparse by design rather than by data quality failure.

id high anthropic:default

This column is a numeric unique identifier — every one of the 69,716 rows has a distinct value with zero nulls, confirming a true primary key role. The values span a wide range (12,788,558 to 13,536,013,182) with a near-zero skew (0.0196) and a slightly platykurtic distribution (kurtosis −1.18), consistent with IDs assigned from a large, sparsely populated keyspace rather than a simple auto-increment sequence. The IQR of ~6.3 billion against a full range of ~13.5 billion indicates values are spread broadly across the domain with no clustering or outliers detected.

access high anthropic:default

This column is an OpenStreetMap-style 'access' tag describing road or path accessibility permissions, with values like 'yes', 'no', 'private', 'permit', 'permissive', 'customers', and 'destination'. The dominant signal is that 89.7% of rows (62,551 of 69,716) carry an empty string rather than a proper null — this is a data quality concern, as missing access information is stored as blank text rather than NULL. The remaining 20 distinct values have very low entropy (0.707), confirming the distribution is heavily concentrated. The blank-dominance will mislead cardinality and frequency analyses unless blanks are recoded as missing.

Numeric correlation

Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
idlatlon
id+1.00-0.04-0.06
lat-0.04+1.00-0.15
lon-0.06-0.15+1.00

id numeric

rows69,716
null0 (0.0%)
unique69,716
min12,788,558
max13,536,013,182
mean6,841,788,293
median7,006,519,241
std3,774,305,021
q13,609,747,458
q39,948,115,791
iqr6,338,368,333
skew0.020
kurtosis-1.177
n_outliers0
outlier_rate0.000
zero_rate0.000
Show data table
Histogram bins for id (median: 7006519241.0).
bincount
1.279e+07 – 3.509e+08592
3.509e+08 – 6.889e+081123
6.889e+08 – 1.027e+092360
1.027e+09 – 1.365e+091819
1.365e+09 – 1.703e+091497
1.703e+09 – 2.041e+091907
2.041e+09 – 2.379e+091345
2.379e+09 – 2.717e+091415
2.717e+09 – 3.056e+091806
3.056e+09 – 3.394e+091977
3.394e+09 – 3.732e+092439
3.732e+09 – 4.07e+091595
4.07e+09 – 4.408e+091710
4.408e+09 – 4.746e+092212
4.746e+09 – 5.084e+092764
5.084e+09 – 5.422e+091741
5.422e+09 – 5.76e+091236
5.76e+09 – 6.098e+091268
6.098e+09 – 6.436e+092153
6.436e+09 – 6.774e+091135
6.774e+09 – 7.112e+092343
7.112e+09 – 7.451e+091932
7.451e+09 – 7.789e+092060
7.789e+09 – 8.127e+091970
8.127e+09 – 8.465e+091451
8.465e+09 – 8.803e+091229
8.803e+09 – 9.141e+091689
9.141e+09 – 9.479e+091952
9.479e+09 – 9.817e+092902
9.817e+09 – 1.016e+101840
1.016e+10 – 1.049e+10617
1.049e+10 – 1.083e+101370
1.083e+10 – 1.117e+101613
1.117e+10 – 1.151e+103298
1.151e+10 – 1.185e+101230
1.185e+10 – 1.218e+101741
1.218e+10 – 1.252e+101317
1.252e+10 – 1.286e+102065
1.286e+10 – 1.32e+101532
1.32e+10 – 1.354e+101471

name text

35.1% duplicate strings
rows69,716
null0 (0.0%)
unique45,229
len_min1
len_max143
len_mean15.522
len_median13.000
len_p9529.000
word_mean2.519
word_median2.000
n_empty0
n_duplicates24,487
duplicate_rate0.351
vocab_size15,219
readability_flesch_mean55.508
emoji_rate1.43e-05
url_rate0.000
one_word_rate0.154
allcaps_rate0.032
boilerplate_rate0.000
Show data table
Character-length distribution for name (mean: 15.52237649893855).
charscount
1 – 51601
5 – 83457
8 – 125854
12 – 1531600
15 – 199877
19 – 228424
22 – 263274
26 – 292614
29 – 331175
33 – 36885
36 – 40482
40 – 44175
44 – 47154
47 – 5156
51 – 5445
54 – 5817
58 – 6111
61 – 653
65 – 682
68 – 722
72 – 764
76 – 792
79 – 830
83 – 861
86 – 900
90 – 930
93 – 970
97 – 1000
100 – 1040
104 – 1080
108 – 1110
111 – 1150
115 – 1180
118 – 1220
122 – 1250
125 – 1290
129 – 1320
132 – 1360
136 – 1390
139 – 1431
Sample values (first 10)
  1. Santa Marija Caves
  2. Unnamed Cave
  3. Ipogeo 7° via Monti
  4. Grotte des Vers Luisants
  5. Unnamed Cave
  6. Alibi 1. sz. barlang
  7. skalní byty
  8. Unnamed Cave
  9. Abisso B52
  10. Cueva de la Torca 3 (corrales del trillo)

lat numeric

skew=-3.16 12.9% rows beyond 1.5 IQR
rows69,716
null0 (0.0%)
unique69,544
min-77.976
max78.182
mean40.582
median44.143
std15.476
q140.495
q347.750
iqr7.256
skew-3.161
kurtosis11.499
n_outliers8,996
outlier_rate0.129
zero_rate0.000
Show data table
Histogram bins for lat (median: 44.14308795).
bincount
-77.98 – -74.074
-74.07 – -70.170
-70.17 – -66.263
-66.26 – -62.360
-62.36 – -58.461
-58.46 – -54.552
-54.55 – -50.6513
-50.65 – -46.7426
-46.74 – -42.8460
-42.84 – -38.94127
-38.94 – -35.03329
-35.03 – -31.13330
-31.13 – -27.23225
-27.23 – -23.32257
-23.32 – -19.42301
-19.42 – -15.51204
-15.51 – -11.6178
-11.61 – -7.705139
-7.705 – -3.801314
-3.801 – 0.1026183
0.1026 – 4.007118
4.007 – 7.911358
7.911 – 11.81287
11.81 – 15.72385
15.72 – 19.621856
19.62 – 23.53937
23.53 – 27.43775
27.43 – 31.331082
31.33 – 35.241231
35.24 – 39.143928
39.14 – 43.0512068
43.05 – 46.9520377
46.95 – 50.8519250
50.85 – 54.762492
54.76 – 58.661075
58.66 – 62.57564
62.57 – 66.47267
66.47 – 70.3747
70.37 – 74.2819
74.28 – 78.184

lon numeric

16.2% rows beyond 1.5 IQR
rows69,716
null0 (0.0%)
unique69,585
min-178.005
max178.798
mean12.025
median11.376
std40.500
q11.245
q318.184
iqr16.939
skew0.276
kurtosis4.509
n_outliers11,302
outlier_rate0.162
zero_rate0.000
Show data table
Histogram bins for lon (median: 11.37646115).
bincount
-178 – -169.154
-169.1 – -160.26
-160.2 – -151.249
-151.2 – -142.320
-142.3 – -133.46
-133.4 – -124.518
-124.5 – -115.6352
-115.6 – -106.6454
-106.6 – -97.72283
-97.72 – -88.8361
-88.8 – -79.88579
-79.88 – -70.961877
-70.96 – -62.04236
-62.04 – -53.12168
-53.12 – -44.2252
-44.2 – -35.28203
-35.28 – -26.3626
-26.36 – -17.44182
-17.44 – -8.523766
-8.523 – 0.39679279
0.3967 – 9.31714696
9.317 – 18.2422468
18.24 – 27.168156
27.16 – 36.081754
36.08 – 451346
45 – 53.92530
53.92 – 62.84479
62.84 – 71.76150
71.76 – 80.68381
80.68 – 89.6210
89.6 – 98.52212
98.52 – 107.41114
107.4 – 116.4956
116.4 – 125.3696
125.3 – 134.2395
134.2 – 143.1363
143.1 – 152330
152 – 16146
161 – 169.933
169.9 – 178.8230

description text

94.3% rows are a single word 95th-percentile length under 20 chars 94.7% duplicate strings
rows69,716
null0 (0.0%)
unique3,705
len_min0
len_max255
len_mean3.462
len_median0.000
len_p9519.000
word_mean1.455
word_median1.000
n_empty65,189
n_duplicates66,011
duplicate_rate0.947
vocab_size4,651
readability_flesch_mean1.752
emoji_rate0.000
url_rate6.31e-04
one_word_rate0.943
allcaps_rate2.41e-03
boilerplate_rate1.43e-05
Show data table
Character-length distribution for description (mean: 3.4617591370703997).
charscount
0 – 665286
6 – 13422
13 – 19544
19 – 26501
26 – 32458
32 – 38437
38 – 45344
45 – 51218
51 – 57205
57 – 64170
64 – 70229
70 – 7685
76 – 8370
83 – 8960
89 – 9654
96 – 10248
102 – 10851
108 – 11542
115 – 12134
121 – 12834
128 – 13426
134 – 14024
140 – 14723
147 – 15324
153 – 15919
159 – 16613
166 – 17227
172 – 17813
178 – 18518
185 – 19121
191 – 19820
198 – 20412
204 – 21023
210 – 21718
217 – 22316
223 – 2309
230 – 2368
236 – 24217
242 – 24922
249 – 25571
Sample values (first 10)
  1. unerforschtes, nordschauendes Portal

access categorical

rows69,716
null0 (0.0%)
unique20
top_value
top_rate0.897
cardinality20
entropy0.707
entropy_ratio0.164
Show data table
Top values for access (20 unique shown, of 20 total).
valuecountshare
6255189.7%
yes27173.9%
no22663.3%
private8151.2%
permit5750.8%
permissive4400.6%
customers2710.4%
unknown470.1%
destination110.0%
restricted90.0%
tidal20.0%
request20.0%
key20.0%
discouraged20.0%
designated10.0%
official10.0%
forestry10.0%
agricultural10.0%
guided10.0%
university10.0%
Top values (rank 1–20)
  1. — 62,551
  2. yes — 2,717
  3. no — 2,266
  4. private — 815
  5. permit — 575
  6. permissive — 440
  7. customers — 271
  8. unknown — 47
  9. destination — 11
  10. restricted — 9
  11. tidal — 2
  12. request — 2
  13. key — 2
  14. discouraged — 2
  15. designated — 1
  16. official — 1
  17. forestry — 1
  18. agricultural — 1
  19. guided — 1
  20. university — 1

tourism categorical

top value is 97.2% of rows
rows69,716
null0 (0.0%)
unique18
top_value
top_rate0.972
cardinality18
entropy0.208
entropy_ratio0.050
Show data table
Top values for tourism (18 unique shown, of 18 total).
valuecountshare
6777697.2%
attraction16702.4%
viewpoint1190.2%
yes1050.2%
camp_site70.0%
museum60.0%
register60.0%
artwork60.0%
information50.0%
picnic_site40.0%
cave_entrance30.0%
caves20.0%
wilderness_hut20.0%
attraction;museum10.0%
guestbook10.0%
no10.0%
cave10.0%
hotel10.0%
Top values (rank 1–20)
  1. — 67,776
  2. attraction — 1,670
  3. viewpoint — 119
  4. yes — 105
  5. camp_site — 7
  6. museum — 6
  7. register — 6
  8. artwork — 6
  9. information — 5
  10. picnic_site — 4
  11. cave_entrance — 3
  12. caves — 2
  13. wilderness_hut — 2
  14. attraction;museum — 1
  15. guestbook — 1
  16. no — 1
  17. cave — 1
  18. hotel — 1

wikipedia text

97.6% rows are a single word 95th-percentile length under 20 chars 97.0% duplicate strings
rows69,716
null0 (0.0%)
unique2,077
len_min0
len_max114
len_mean0.636
len_median0.000
len_p950.000
word_mean1.044
word_median1.000
n_empty67,531
n_duplicates67,639
duplicate_rate0.970
vocab_size940
readability_flesch_mean-0.025
emoji_rate0.000
url_rate0.000
one_word_rate0.976
allcaps_rate0.000
boilerplate_rate0.000
Show data table
Character-length distribution for wikipedia (mean: 0.6359802627804234).
charscount
0 – 367531
3 – 60
6 – 942
9 – 1185
11 – 14252
14 – 17402
17 – 20335
20 – 23445
23 – 26243
26 – 28131
28 – 31102
31 – 3472
34 – 3733
37 – 4013
40 – 4312
43 – 466
46 – 484
48 – 511
51 – 541
54 – 571
57 – 603
60 – 630
63 – 661
66 – 680
68 – 710
71 – 740
74 – 770
77 – 800
80 – 830
83 – 860
86 – 880
88 – 910
91 – 940
94 – 970
97 – 1000
100 – 1030
103 – 1050
105 – 1080
108 – 1110
111 – 1141
Sample values (first 10)

website text

100.0% rows are a single word 95th-percentile length under 20 chars 96.4% duplicate strings
rows69,716
null0 (0.0%)
unique2,492
len_min0
len_max255
len_mean2.789
len_median0.000
len_p950.000
word_mean1.000
word_median1.000
n_empty67,082
n_duplicates67,224
duplicate_rate0.964
vocab_size737
readability_flesch_mean-24.041
emoji_rate0.000
url_rate0.038
one_word_rate1.000
allcaps_rate1.43e-05
boilerplate_rate0.000
Show data table
Character-length distribution for website (mean: 2.7889580584084).
charscount
0 – 667082
6 – 130
13 – 1911
19 – 2645
26 – 32130
32 – 38161
38 – 45109
45 – 51118
51 – 5786
57 – 64101
64 – 70223
70 – 7645
76 – 8349
83 – 891360
89 – 96127
96 – 10216
102 – 10814
108 – 1158
115 – 1212
121 – 1285
128 – 1345
134 – 1400
140 – 1475
147 – 1531
153 – 1596
159 – 1660
166 – 1720
172 – 1780
178 – 1851
185 – 1911
191 – 1980
198 – 2040
204 – 2100
210 – 2171
217 – 2232
223 – 2300
230 – 2361
236 – 2420
242 – 2490
249 – 2551
Sample values (first 10)
  1. https://termeszetvedelem.hu/talalati-oldal/?type=orszagos-barlangnyilvantartas&id=4762-23

cave:length categorical

158 singleton categories top value is 99.1% of rows
rows69,716
null0 (0.0%)
unique238
top_value
top_rate0.991
cardinality238
entropy0.139
entropy_ratio0.018
Show data table
Top values for cave:length (20 unique shown, of 238 total).
valuecountshare
6907499.1%
5320.0%
6260.0%
10250.0%
3230.0%
4230.0%
7200.0%
8190.0%
15160.0%
20140.0%
12130.0%
30130.0%
2110.0%
1190.0%
6080.0%
4.580.0%
1380.0%
1680.0%
1780.0%
2580.0%
Top values (rank 1–20)
  1. — 69,074
  2. 5 — 32
  3. 6 — 26
  4. 10 — 25
  5. 3 — 23
  6. 4 — 23
  7. 7 — 20
  8. 8 — 19
  9. 15 — 16
  10. 20 — 14
  11. 12 — 13
  12. 30 — 13
  13. 2 — 11
  14. 11 — 9
  15. 60 — 8
  16. 4.5 — 8
  17. 13 — 8
  18. 16 — 8
  19. 17 — 8
  20. 25 — 8

cave:depth categorical

87 singleton categories top value is 99.6% of rows
rows69,716
null0 (0.0%)
unique124
top_value
top_rate0.996
cardinality124
entropy0.064
entropy_ratio9.25e-03
Show data table
Top values for cave:depth (20 unique shown, of 124 total).
valuecountshare
6941999.6%
0630.1%
10130.0%
3110.0%
190.0%
590.0%
480.0%
2570.0%
3060.0%
660.0%
260.0%
1150.0%
3550.0%
2840.0%
1440.0%
4040.0%
7040.0%
1230.0%
830.0%
1530.0%
Top values (rank 1–20)
  1. — 69,419
  2. 0 — 63
  3. 10 — 13
  4. 3 — 11
  5. 1 — 9
  6. 5 — 9
  7. 4 — 8
  8. 25 — 7
  9. 30 — 6
  10. 6 — 6
  11. 2 — 6
  12. 11 — 5
  13. 35 — 5
  14. 28 — 4
  15. 14 — 4
  16. 40 — 4
  17. 70 — 4
  18. 12 — 3
  19. 8 — 3
  20. 15 — 3

country categorical

top value is 100.0% of rows
rows69,716
null0 (0.0%)
unique16
top_value
top_rate1.000
cardinality16
entropy7.36e-03
entropy_ratio1.84e-03
Show data table
Top values for country (16 unique shown, of 16 total).
valuecountshare
69684100.0%
KY70.0%
AU60.0%
DE30.0%
RO20.0%
US20.0%
EC20.0%
IT20.0%
GB10.0%
MT10.0%
CH10.0%
GR10.0%
ET10.0%
LB10.0%
VN10.0%
PT10.0%
Top values (rank 1–20)
  1. — 69,684
  2. KY — 7
  3. AU — 6
  4. DE — 3
  5. RO — 2
  6. US — 2
  7. EC — 2
  8. IT — 2
  9. GB — 1
  10. MT — 1
  11. CH — 1
  12. GR — 1
  13. ET — 1
  14. LB — 1
  15. VN — 1
  16. PT — 1