saturn

/home/coolhand/html/datavis/data_trove/data/quirky/caves.json 69,716 rows sample n=69,716 seed 42 2026-05-01T23:17:16+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/data/quirky/caves.json
Total rows69,716
Profiled sample69,716
Columns12
Generated2026-05-01T23:17:16+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset catalogs 69,716 caves with 12 columns covering names, geocoordinates, country, tourism/access tags, and optional metadata like description, website, and Wikipedia links. The headline issue is sparsity in the descriptive fields: 'description' is empty in 65,189 rows, 'website' in 67,082, and 'wikipedia' in 67,531, so most analytical signal sits in name and coordinates. Worth a closer look first: the 'name' column, where 19,527 entries are literally 'Unnamed Cave' and overall duplicate rate is 35%, and the geographic spread, where 'lat' is heavily left-skewed (skew -3.16) with ~12.9% outliers and 'lon' has ~16.2% outliers, suggesting a Northern-Hemisphere/European concentration with scattered global entries. The 'country' field is almost entirely blank (99.95%), so country-level analysis will need to be derived from coordinates rather than read off directly. 'Access' is the most usable categorical, with meaningful splits across yes/no/private/permit when present.

id high anthropic:claude-opus-4-7

This column is almost certainly a row identifier: all 69716 values are unique with zero nulls, and the numeric range spans roughly 1.28e7 to 1.354e10 with near-zero skew (0.02) and negative kurtosis (-1.18), consistent with broadly distributed assigned IDs rather than a measured quantity. No outliers or zeros are flagged. Treat the numeric stats as incidental — the magnitudes carry no analytical meaning.

name high anthropic:claude-opus-4-7

Short free-text names of caves/caverns in mixed languages (English 'Cave', French 'Grotte', Italian 'Grotta', Spanish 'Cueva', Cyrillic 'Грот', German 'Bärenhöhle'), averaging 2.5 words and 15.5 characters. Severe duplication (35.1%, 24,487 rows) is dominated by the placeholder 'Unnamed Cave' appearing 19,527 times — over a quarter of all rows are effectively unlabelled. Of 69,716 rows only 45,229 are unique, and the vocabulary of 15,219 tokens reflects the multilingual mix.

lat high anthropic:claude-opus-4-7

Latitude coordinates spanning -77.98 to 78.18, so the column covers nearly the full globe from Antarctic to Arctic ranges. The distribution is heavily left-skewed (skew -3.16, kurtosis 11.5) with a tight IQR of 7.26 around a median of 44.14, indicating most points cluster in northern mid-latitudes while a long tail of southern-hemisphere values produces 8,996 outliers (12.9%). Near-unique values (69,544 of 69,716) confirm these are precise geocoordinates rather than bucketed regions.

lon high anthropic:claude-opus-4-7

This is a longitude coordinate, with values spanning the full -178.00 to 178.80 range and 69,585 unique values across 69,716 rows. The distribution is centered near 12.03 (median 11.38) with IQR 1.24 to 18.18, suggesting a heavy concentration in European/African longitudes, but the std of 40.50 and 16.2% flagged outliers reveal a long global tail. No nulls or zeros are present.

description high anthropic:claude-opus-4-7

A free-text 'description' field, but 65189 of 69716 rows (94.7% duplicate rate) are empty strings, leaving only ~4500 populated entries with a mean length of 3.46 characters and median word count of 1. The non-empty values are short multilingual fragments — German ('nicht katasterwürdig', 'unterer Eingang'), French ('Entrée d'une carrière souterraine'), and other tokens like 'cave' and 'Halbhöhle' — suggesting cave/feature annotations rather than prose. With 92.8% one-word rate and a vocab of 4651, this is closer to sparse tagging than narrative text.

access high anthropic:claude-opus-4-7

Categorical access-permission tag, almost certainly the OSM-style `access` key indicating who may use a feature. 89.7% of the 69,716 rows are empty strings, leaving only ~10% with substantive values like `yes` (2,717), `no` (2,266), `private` (815), `permit` (575), and `permissive` (440). Entropy ratio is just 0.16 and 20 distinct values appear, so the signal is sparse but the long tail (`customers`, `destination`, `restricted`, `unknown`) is meaningful when present.

tourism high anthropic:claude-opus-4-7

This is an OSM-style `tourism` tag categorising features like attractions, viewpoints, and museums across 18 distinct values. The column is severely imbalanced: 97.2% of the 69,716 rows are empty strings, with `attraction` (1,670) and `viewpoint` (119) the only non-trivial categories. Entropy ratio of 0.05 confirms almost no information content as-is.

wikipedia high anthropic:claude-opus-4-7

This column holds Wikipedia article references in the OSM-style `lang:Article Title` format (e.g. `de:Einödhöhle`, `fr:Grotte...`), mostly pointing to cave-related pages across many languages. It is overwhelmingly empty: 67,531 of 69,716 rows are blank and the duplicate rate is 0.97, leaving only 2,077 unique values. Among the populated entries, language prefixes span de, fr, pl, es, it, ca, en, ru and more, so any downstream use must handle multilingual strings.

website high anthropic:claude-opus-4-7

This is a website/URL field for each record, but it is overwhelmingly empty: 67,082 of 69,716 rows (n_empty) are blank, leaving only 2,492 unique values across the column. Where populated, entries are single-token URLs (one_word_rate 0.99998, word_mean ~1.0) pointing to varied external domains (angloasianmining.com, unesco.org, termeszetvedelem.hu, etc.). The duplicate_rate of 0.96 is driven almost entirely by the empty string, and url_rate is only 0.038 because the metric is computed over all rows including blanks.

cave:length high anthropic:claude-opus-4-7

This appears to be a cave length attribute (likely OpenStreetMap-style tag) stored as a string, with 99.08% of the 69,716 rows being empty and only 238 distinct values overall. When populated, values look like small integers (5, 6, 10, 3, 4...) suggesting metres, but the signal is so sparse (entropy ratio 0.0176) that it carries almost no information. The non-null counts in the top values are tiny (≤32 each), indicating this tag is rarely filled in upstream.

cave:depth high anthropic:claude-opus-4-7

This appears to be a cave depth attribute, likely sourced from an OpenStreetMap-style tag, stored as strings. It is effectively empty: 69,419 of 69,716 rows (top_rate 0.9957) carry the blank value "", and the remaining 124 distinct values are tiny integer-like strings ranging from "0" to "30" with single- or low-double-digit counts. Entropy ratio of 0.0092 confirms there is virtually no signal here.

country high anthropic:claude-opus-4-7

Two-letter ISO country code, but effectively empty: 69,684 of 69,716 rows (top_rate 0.9995) carry a blank string, leaving only 32 rows spread across 15 actual codes (KY, AU, DE, RO, US, etc.). Entropy ratio of 0.0018 confirms there is essentially no signal here. No nulls are reported because the missingness is encoded as empty string rather than NULL.

Numeric correlation

id numeric

rows69,716
null0 (0.0%)
unique69,716
min12,788,558
max13,536,013,182
mean6,841,788,293
median7,006,519,241
std3,774,305,021
q13,609,747,458
q39,948,115,791
iqr6,338,368,333
skew0.020
kurtosis-1.177
n_outliers0
outlier_rate0.000
zero_rate0.000

name text

35.1% duplicate strings
rows69,716
null0 (0.0%)
unique45,229
len_min1
len_max143
len_mean15.522
len_median13.000
len_p9529.000
word_mean2.519
word_median2.000
n_empty0
n_duplicates24,487
duplicate_rate0.351
vocab_size15,219
readability_flesch_mean55.508
emoji_rate1.43e-05
url_rate0.000
one_word_rate0.154
allcaps_rate0.032
boilerplate_rate0.000
Sample values (first 10)
  1. Santa Marija Caves
  2. Unnamed Cave
  3. Ipogeo 7° via Monti
  4. Grotte des Vers Luisants
  5. Unnamed Cave
  6. Alibi 1. sz. barlang
  7. skalní byty
  8. Unnamed Cave
  9. Abisso B52
  10. Cueva de la Torca 3 (corrales del trillo)

lat numeric

skew=-3.16 12.9% rows beyond 1.5 IQR
rows69,716
null0 (0.0%)
unique69,544
min-77.976
max78.182
mean40.582
median44.143
std15.476
q140.495
q347.750
iqr7.256
skew-3.161
kurtosis11.499
n_outliers8,996
outlier_rate0.129
zero_rate0.000

lon numeric

16.2% rows beyond 1.5 IQR
rows69,716
null0 (0.0%)
unique69,585
min-178.005
max178.798
mean12.025
median11.376
std40.500
q11.245
q318.184
iqr16.939
skew0.276
kurtosis4.509
n_outliers11,302
outlier_rate0.162
zero_rate0.000

description text

94.3% rows are a single word 95th-percentile length under 20 chars 94.7% duplicate strings
rows69,716
null0 (0.0%)
unique3,705
len_min0
len_max255
len_mean3.462
len_median0.000
len_p9519.000
word_mean1.455
word_median1.000
n_empty65,189
n_duplicates66,011
duplicate_rate0.947
vocab_size4,651
readability_flesch_mean1.752
emoji_rate0.000
url_rate6.31e-04
one_word_rate0.943
allcaps_rate2.41e-03
boilerplate_rate1.43e-05
Sample values (first 10)
  1. unerforschtes, nordschauendes Portal

access categorical

rows69,716
null0 (0.0%)
unique20
top_value
top_rate0.897
cardinality20
entropy0.707
entropy_ratio0.164
Top values (rank 1–20)
  1. — 62,551
  2. yes — 2,717
  3. no — 2,266
  4. private — 815
  5. permit — 575
  6. permissive — 440
  7. customers — 271
  8. unknown — 47
  9. destination — 11
  10. restricted — 9
  11. tidal — 2
  12. request — 2
  13. key — 2
  14. discouraged — 2
  15. designated — 1
  16. official — 1
  17. forestry — 1
  18. agricultural — 1
  19. guided — 1
  20. university — 1

tourism categorical

top value is 97.2% of rows
rows69,716
null0 (0.0%)
unique18
top_value
top_rate0.972
cardinality18
entropy0.208
entropy_ratio0.050
Top values (rank 1–20)
  1. — 67,776
  2. attraction — 1,670
  3. viewpoint — 119
  4. yes — 105
  5. camp_site — 7
  6. museum — 6
  7. register — 6
  8. artwork — 6
  9. information — 5
  10. picnic_site — 4
  11. cave_entrance — 3
  12. caves — 2
  13. wilderness_hut — 2
  14. attraction;museum — 1
  15. guestbook — 1
  16. no — 1
  17. cave — 1
  18. hotel — 1

wikipedia text

97.6% rows are a single word 95th-percentile length under 20 chars 97.0% duplicate strings
rows69,716
null0 (0.0%)
unique2,077
len_min0
len_max114
len_mean0.636
len_median0.000
len_p950.000
word_mean1.044
word_median1.000
n_empty67,531
n_duplicates67,639
duplicate_rate0.970
vocab_size940
readability_flesch_mean-0.025
emoji_rate0.000
url_rate0.000
one_word_rate0.976
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)

website text

100.0% rows are a single word 95th-percentile length under 20 chars 96.4% duplicate strings
rows69,716
null0 (0.0%)
unique2,492
len_min0
len_max255
len_mean2.789
len_median0.000
len_p950.000
word_mean1.000
word_median1.000
n_empty67,082
n_duplicates67,224
duplicate_rate0.964
vocab_size737
readability_flesch_mean-24.041
emoji_rate0.000
url_rate0.038
one_word_rate1.000
allcaps_rate1.43e-05
boilerplate_rate0.000
Sample values (first 10)
  1. https://termeszetvedelem.hu/talalati-oldal/?type=orszagos-barlangnyilvantartas&id=4762-23

cave:length categorical

158 singleton categories top value is 99.1% of rows
rows69,716
null0 (0.0%)
unique238
top_value
top_rate0.991
cardinality238
entropy0.139
entropy_ratio0.018
Top values (rank 1–20)
  1. — 69,074
  2. 5 — 32
  3. 6 — 26
  4. 10 — 25
  5. 3 — 23
  6. 4 — 23
  7. 7 — 20
  8. 8 — 19
  9. 15 — 16
  10. 20 — 14
  11. 12 — 13
  12. 30 — 13
  13. 2 — 11
  14. 11 — 9
  15. 60 — 8
  16. 4.5 — 8
  17. 13 — 8
  18. 16 — 8
  19. 17 — 8
  20. 25 — 8

cave:depth categorical

87 singleton categories top value is 99.6% of rows
rows69,716
null0 (0.0%)
unique124
top_value
top_rate0.996
cardinality124
entropy0.064
entropy_ratio9.25e-03
Top values (rank 1–20)
  1. — 69,419
  2. 0 — 63
  3. 10 — 13
  4. 3 — 11
  5. 1 — 9
  6. 5 — 9
  7. 4 — 8
  8. 25 — 7
  9. 30 — 6
  10. 6 — 6
  11. 2 — 6
  12. 11 — 5
  13. 35 — 5
  14. 28 — 4
  15. 14 — 4
  16. 40 — 4
  17. 70 — 4
  18. 12 — 3
  19. 8 — 3
  20. 15 — 3

country categorical

top value is 100.0% of rows
rows69,716
null0 (0.0%)
unique16
top_value
top_rate1.000
cardinality16
entropy7.36e-03
entropy_ratio1.84e-03
Top values (rank 1–20)
  1. — 69,684
  2. KY — 7
  3. AU — 6
  4. DE — 3
  5. RO — 2
  6. US — 2
  7. EC — 2
  8. IT — 2
  9. GB — 1
  10. MT — 1
  11. CH — 1
  12. GR — 1
  13. ET — 1
  14. LB — 1
  15. VN — 1
  16. PT — 1