saturn·

quirky caves

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/caves.json

Saturn profiled 69,716 rows across 12 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/caves.json",
    "--findings", "quirky-caves.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 69,716 caves with 12 columns covering names, geocoordinates, country, tourism/access tags, and optional metadata like description, website, and Wikipedia links. The headline issue is sparsity in the descriptive fields: 'description' is empty in 65,189 rows, 'website' in 67,082, and 'wikipedia' in 67,531, so most analytical signal sits in name and coordinates. Worth a closer look first: the 'name' column, where 19,527 entries are literally 'Unnamed Cave' and overall duplicate rate is 35%, and the geographic spread, where 'lat' is heavily left-skewed (skew -3.16) with ~12.9% outliers and 'lon' has ~16.2% outliers, suggesting a Northern-Hemisphere/European concentration with scattered global entries. The 'country' field is almost entirely blank (99.95%), so country-level analysis will need to be derived from coordinates rather than read off directly. 'Access' is the most usable categorical, with meaningful splits across yes/no/private/permit when present.

citing: row_count · column_count · columns.name.top_values · columns.name.stats.duplicate_rate · columns.description.stats.n_empty · columns.website.stats.n_empty · columns.wikipedia.stats.n_empty · columns.lat.stats.skew · columns.lat.stats.outlier_rate · columns.lon.stats.outlier_rate · columns.country.stats.top_rate · columns.access.top_values · columns.tourism.top_values

Out[4]:

saturn.schema() · 12 columns

column kind n null% unique alerts
id numeric 69,716 0.0% 69,716
name text 69,716 0.0% 45,229 duplicates
lat numeric 69,716 0.0% 69,544 high_skew outliers
lon numeric 69,716 0.0% 69,585 outliers
description text 69,716 0.0% 3,705 one_word short_text duplicates
access categorical 69,716 0.0% 20
tourism categorical 69,716 0.0% 18 imbalance
wikipedia text 69,716 0.0% 2,077 one_word short_text duplicates
website text 69,716 0.0% 2,492 one_word short_text duplicates
cave:length categorical 69,716 0.0% 238 long_tail imbalance
cave:depth categorical 69,716 0.0% 124 long_tail imbalance
country categorical 69,716 0.0% 16 imbalance
Fig 1.
lat · Check the left-skewed latitude distribution to see how heavily caves cluster in the Northern Hemisphere.
Show data table
Histogram bins for lat (median: 44.14308795).
bincount
-77.98 – -74.074
-74.07 – -70.170
-70.17 – -66.263
-66.26 – -62.360
-62.36 – -58.461
-58.46 – -54.552
-54.55 – -50.6513
-50.65 – -46.7426
-46.74 – -42.8460
-42.84 – -38.94127
-38.94 – -35.03329
-35.03 – -31.13330
-31.13 – -27.23225
-27.23 – -23.32257
-23.32 – -19.42301
-19.42 – -15.51204
-15.51 – -11.6178
-11.61 – -7.705139
-7.705 – -3.801314
-3.801 – 0.1026183
0.1026 – 4.007118
4.007 – 7.911358
7.911 – 11.81287
11.81 – 15.72385
15.72 – 19.621856
19.62 – 23.53937
23.53 – 27.43775
27.43 – 31.331082
31.33 – 35.241231
35.24 – 39.143928
39.14 – 43.0512068
43.05 – 46.9520377
46.95 – 50.8519250
50.85 – 54.762492
54.76 – 58.661075
58.66 – 62.57564
62.57 – 66.47267
66.47 – 70.3747
70.37 – 74.2819
74.28 – 78.184
Fig 2.
lon · Longitude spread reveals the European core versus scattered entries across the Americas and Asia-Pacific.
Show data table
Histogram bins for lon (median: 11.37646115).
bincount
-178 – -169.154
-169.1 – -160.26
-160.2 – -151.249
-151.2 – -142.320
-142.3 – -133.46
-133.4 – -124.518
-124.5 – -115.6352
-115.6 – -106.6454
-106.6 – -97.72283
-97.72 – -88.8361
-88.8 – -79.88579
-79.88 – -70.961877
-70.96 – -62.04236
-62.04 – -53.12168
-53.12 – -44.2252
-44.2 – -35.28203
-35.28 – -26.3626
-26.36 – -17.44182
-17.44 – -8.523766
-8.523 – 0.39679279
0.3967 – 9.31714696
9.317 – 18.2422468
18.24 – 27.168156
27.16 – 36.081754
36.08 – 451346
45 – 53.92530
53.92 – 62.84479
62.84 – 71.76150
71.76 – 80.68381
80.68 – 89.6210
89.6 – 98.52212
98.52 – 107.41114
107.4 – 116.4956
116.4 – 125.3696
125.3 – 134.2395
134.2 – 143.1363
143.1 – 152330
152 – 16146
161 – 169.933
169.9 – 178.8230
Fig 3.
access · Among rows that specify access, compare yes/no/private/permit to gauge how reachable tagged caves are.
Show data table
Top values for access (20 unique shown, of 20 total).
valuecountshare
6255189.7%
yes27173.9%
no22663.3%
private8151.2%
permit5750.8%
permissive4400.6%
customers2710.4%
unknown470.1%
destination110.0%
restricted90.0%
tidal20.0%
request20.0%
key20.0%
discouraged20.0%
designated10.0%
official10.0%
forestry10.0%
agricultural10.0%
guided10.0%
university10.0%
Fig 4.
tourism · Note that 'attraction' dominates the tagged tourism values, dwarfing viewpoints, museums, and other uses.
Show data table
Top values for tourism (18 unique shown, of 18 total).
valuecountshare
6777697.2%
attraction16702.4%
viewpoint1190.2%
yes1050.2%
camp_site70.0%
museum60.0%
register60.0%
artwork60.0%
information50.0%
picnic_site40.0%
cave_entrance30.0%
caves20.0%
wilderness_hut20.0%
attraction;museum10.0%
guestbook10.0%
no10.0%
cave10.0%
hotel10.0%
Fig 5.
name · Name lengths center around 13 characters, but watch for the huge 'Unnamed Cave' spike inflating duplicates.
Show data table
Character-length distribution for name (mean: 15.52237649893855).
charscount
1 – 51601
5 – 83457
8 – 125854
12 – 1531600
15 – 199877
19 – 228424
22 – 263274
26 – 292614
29 – 331175
33 – 36885
36 – 40482
40 – 44175
44 – 47154
47 – 5156
51 – 5445
54 – 5817
58 – 6111
61 – 653
65 – 682
68 – 722
72 – 764
76 – 792
79 – 830
83 – 861
86 – 900
90 – 930
93 – 970
97 – 1000
100 – 1040
104 – 1080
108 – 1110
111 – 1150
115 – 1180
118 – 1220
122 – 1250
125 – 1290
129 – 1320
132 – 1360
136 – 1390
139 – 1431
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
idnumeric0.0%
nametext0.0%
latnumeric0.0%
lonnumeric0.0%
descriptiontext0.0%
accesscategorical0.0%
tourismcategorical0.0%
wikipediatext0.0%
websitetext0.0%
cave:lengthcategorical0.0%
cave:depthcategorical0.0%
countrycategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
idlatlon
id+1.00-0.04-0.06
lat-0.04+1.00-0.15
lon-0.06-0.15+1.00

id numeric identifier

This column is almost certainly a row identifier: all 69716 values are unique with zero nulls, and the numeric range spans roughly 1.28e7 to 1.354e10 with near-zero skew (0.02) and negative kurtosis (-1.18), consistent with broadly distributed assigned IDs rather than a measured quantity. No outliers or zeros are flagged. Treat the numeric stats as incidental — the magnitudes carry no analytical meaning.

Treatment: Drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["id"].stats

statvalue
n69,716
nulls0 (0.0%)
unique69,716
min 1.279e+07
max 1.354e+10
mean 6.842e+09
median 7.007e+09
std 3.774e+09
q1 3.61e+09
q3 9.948e+09
iqr 6.338e+09
skew 0.01955
kurtosis -1.177
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 8.
Distribution of id. Vertical dash marks the median.
Show data table
Histogram bins for id (median: 7006519241.0).
bincount
1.279e+07 – 3.509e+08592
3.509e+08 – 6.889e+081123
6.889e+08 – 1.027e+092360
1.027e+09 – 1.365e+091819
1.365e+09 – 1.703e+091497
1.703e+09 – 2.041e+091907
2.041e+09 – 2.379e+091345
2.379e+09 – 2.717e+091415
2.717e+09 – 3.056e+091806
3.056e+09 – 3.394e+091977
3.394e+09 – 3.732e+092439
3.732e+09 – 4.07e+091595
4.07e+09 – 4.408e+091710
4.408e+09 – 4.746e+092212
4.746e+09 – 5.084e+092764
5.084e+09 – 5.422e+091741
5.422e+09 – 5.76e+091236
5.76e+09 – 6.098e+091268
6.098e+09 – 6.436e+092153
6.436e+09 – 6.774e+091135
6.774e+09 – 7.112e+092343
7.112e+09 – 7.451e+091932
7.451e+09 – 7.789e+092060
7.789e+09 – 8.127e+091970
8.127e+09 – 8.465e+091451
8.465e+09 – 8.803e+091229
8.803e+09 – 9.141e+091689
9.141e+09 – 9.479e+091952
9.479e+09 – 9.817e+092902
9.817e+09 – 1.016e+101840
1.016e+10 – 1.049e+10617
1.049e+10 – 1.083e+101370
1.083e+10 – 1.117e+101613
1.117e+10 – 1.151e+103298
1.151e+10 – 1.185e+101230
1.185e+10 – 1.218e+101741
1.218e+10 – 1.252e+101317
1.252e+10 – 1.286e+102065
1.286e+10 – 1.32e+101532
1.32e+10 – 1.354e+101471

name text label

Short free-text names of caves/caverns in mixed languages (English 'Cave', French 'Grotte', Italian 'Grotta', Spanish 'Cueva', Cyrillic 'Грот', German 'Bärenhöhle'), averaging 2.5 words and 15.5 characters. Severe duplication (35.1%, 24,487 rows) is dominated by the placeholder 'Unnamed Cave' appearing 19,527 times — over a quarter of all rows are effectively unlabelled. Of 69,716 rows only 45,229 are unique, and the vocabulary of 15,219 tokens reflects the multilingual mix.

Treatment: Treat 'Unnamed Cave' as missing and language-normalise before any text matching or grouping.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["name"].stats

statvalue
n69,716
nulls0 (0.0%)
unique45,229
len_min 1
len_max 143
len_mean 15.52
len_median 13
len_p95 29
word_mean 2.519
word_median 2
n_empty 0
n_duplicates 24,487
duplicate_rate 0.3512
vocab_size 15,219
readability_flesch_mean 55.51
emoji_rate 1.434e-05
url_rate 0
one_word_rate 0.1545
allcaps_rate 0.03242
boilerplate_rate 0
alert: duplicates35.1% duplicate strings
Fig 9.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 15.52237649893855).
charscount
1 – 51601
5 – 83457
8 – 125854
12 – 1531600
15 – 199877
19 – 228424
22 – 263274
26 – 292614
29 – 331175
33 – 36885
36 – 40482
40 – 44175
44 – 47154
47 – 5156
51 – 5445
54 – 5817
58 – 6111
61 – 653
65 – 682
68 – 722
72 – 764
76 – 792
79 – 830
83 – 861
86 – 900
90 – 930
93 – 970
97 – 1000
100 – 1040
104 – 1080
108 – 1110
111 – 1150
115 – 1180
118 – 1220
122 – 1250
125 – 1290
129 – 1320
132 – 1360
136 – 1390
139 – 1431

lat numeric feature

Latitude coordinates spanning -77.98 to 78.18, so the column covers nearly the full globe from Antarctic to Arctic ranges. The distribution is heavily left-skewed (skew -3.16, kurtosis 11.5) with a tight IQR of 7.26 around a median of 44.14, indicating most points cluster in northern mid-latitudes while a long tail of southern-hemisphere values produces 8,996 outliers (12.9%). Near-unique values (69,544 of 69,716) confirm these are precise geocoordinates rather than bucketed regions.

Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than treating as a raw scalar.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["lat"].stats

statvalue
n69,716
nulls0 (0.0%)
unique69,544
min -77.98
max 78.18
mean 40.58
median 44.14
std 15.48
q1 40.49
q3 47.75
iqr 7.256
skew -3.161
kurtosis 11.5
n_outliers 8,996
outlier_rate 0.129
zero_rate 0
alert: high_skewskew=-3.16
alert: outliers12.9% rows beyond 1.5 IQR
Fig 10.
Distribution of lat. Vertical dash marks the median.
Show data table
Histogram bins for lat (median: 44.14308795).
bincount
-77.98 – -74.074
-74.07 – -70.170
-70.17 – -66.263
-66.26 – -62.360
-62.36 – -58.461
-58.46 – -54.552
-54.55 – -50.6513
-50.65 – -46.7426
-46.74 – -42.8460
-42.84 – -38.94127
-38.94 – -35.03329
-35.03 – -31.13330
-31.13 – -27.23225
-27.23 – -23.32257
-23.32 – -19.42301
-19.42 – -15.51204
-15.51 – -11.6178
-11.61 – -7.705139
-7.705 – -3.801314
-3.801 – 0.1026183
0.1026 – 4.007118
4.007 – 7.911358
7.911 – 11.81287
11.81 – 15.72385
15.72 – 19.621856
19.62 – 23.53937
23.53 – 27.43775
27.43 – 31.331082
31.33 – 35.241231
35.24 – 39.143928
39.14 – 43.0512068
43.05 – 46.9520377
46.95 – 50.8519250
50.85 – 54.762492
54.76 – 58.661075
58.66 – 62.57564
62.57 – 66.47267
66.47 – 70.3747
70.37 – 74.2819
74.28 – 78.184

lon numeric feature

This is a longitude coordinate, with values spanning the full -178.00 to 178.80 range and 69,585 unique values across 69,716 rows. The distribution is centered near 12.03 (median 11.38) with IQR 1.24 to 18.18, suggesting a heavy concentration in European/African longitudes, but the std of 40.50 and 16.2% flagged outliers reveal a long global tail. No nulls or zeros are present.

Treatment: Pair with latitude for geospatial features; do not treat outliers as errors since the global range is legitimate.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["lon"].stats

statvalue
n69,716
nulls0 (0.0%)
unique69,585
min -178
max 178.8
mean 12.03
median 11.38
std 40.5
q1 1.245
q3 18.18
iqr 16.94
skew 0.2755
kurtosis 4.509
n_outliers 11,302
outlier_rate 0.1621
zero_rate 0
alert: outliers16.2% rows beyond 1.5 IQR
Fig 11.
Distribution of lon. Vertical dash marks the median.
Show data table
Histogram bins for lon (median: 11.37646115).
bincount
-178 – -169.154
-169.1 – -160.26
-160.2 – -151.249
-151.2 – -142.320
-142.3 – -133.46
-133.4 – -124.518
-124.5 – -115.6352
-115.6 – -106.6454
-106.6 – -97.72283
-97.72 – -88.8361
-88.8 – -79.88579
-79.88 – -70.961877
-70.96 – -62.04236
-62.04 – -53.12168
-53.12 – -44.2252
-44.2 – -35.28203
-35.28 – -26.3626
-26.36 – -17.44182
-17.44 – -8.523766
-8.523 – 0.39679279
0.3967 – 9.31714696
9.317 – 18.2422468
18.24 – 27.168156
27.16 – 36.081754
36.08 – 451346
45 – 53.92530
53.92 – 62.84479
62.84 – 71.76150
71.76 – 80.68381
80.68 – 89.6210
89.6 – 98.52212
98.52 – 107.41114
107.4 – 116.4956
116.4 – 125.3696
125.3 – 134.2395
134.2 – 143.1363
143.1 – 152330
152 – 16146
161 – 169.933
169.9 – 178.8230

description text free_text

A free-text 'description' field, but 65189 of 69716 rows (94.7% duplicate rate) are empty strings, leaving only ~4500 populated entries with a mean length of 3.46 characters and median word count of 1. The non-empty values are short multilingual fragments — German ('nicht katasterwürdig', 'unterer Eingang'), French ('Entrée d'une carrière souterraine'), and other tokens like 'cave' and 'Halbhöhle' — suggesting cave/feature annotations rather than prose. With 92.8% one-word rate and a vocab of 4651, this is closer to sparse tagging than narrative text.

Treatment: Treat as sparse optional tag: flag presence as a boolean feature and ignore the text body unless modelling the populated subset.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["description"].stats

statvalue
n69,716
nulls0 (0.0%)
unique3,705
len_min 0
len_max 255
len_mean 3.462
len_median 0
len_p95 19
word_mean 1.455
word_median 1
n_empty 65,189
n_duplicates 66,011
duplicate_rate 0.9469
vocab_size 4,651
readability_flesch_mean 1.752
emoji_rate 0
url_rate 0.0006311
one_word_rate 0.9428
allcaps_rate 0.00241
boilerplate_rate 1.434e-05
alert: one_word94.3% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates94.7% duplicate strings
Fig 12.
Character-length distribution for description.
Show data table
Character-length distribution for description (mean: 3.4617591370703997).
charscount
0 – 665286
6 – 13422
13 – 19544
19 – 26501
26 – 32458
32 – 38437
38 – 45344
45 – 51218
51 – 57205
57 – 64170
64 – 70229
70 – 7685
76 – 8370
83 – 8960
89 – 9654
96 – 10248
102 – 10851
108 – 11542
115 – 12134
121 – 12834
128 – 13426
134 – 14024
140 – 14723
147 – 15324
153 – 15919
159 – 16613
166 – 17227
172 – 17813
178 – 18518
185 – 19121
191 – 19820
198 – 20412
204 – 21023
210 – 21718
217 – 22316
223 – 2309
230 – 2368
236 – 24217
242 – 24922
249 – 25571

access categorical feature

Categorical access-permission tag, almost certainly the OSM-style `access` key indicating who may use a feature. 89.7% of the 69,716 rows are empty strings, leaving only ~10% with substantive values like `yes` (2,717), `no` (2,266), `private` (815), `permit` (575), and `permissive` (440). Entropy ratio is just 0.16 and 20 distinct values appear, so the signal is sparse but the long tail (`customers`, `destination`, `restricted`, `unknown`) is meaningful when present.

Treatment: Treat empty string as missing, then collapse rare levels and one-hot encode the survivors.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["access"].stats

statvalue
n69,716
nulls0 (0.0%)
unique20
top_value
top_rate 0.8972
cardinality 20
entropy 0.7067
entropy_ratio 0.1635
Fig 13.
Top values for access.
Show data table
Top values for access (20 unique shown, of 20 total).
valuecountshare
6255189.7%
yes27173.9%
no22663.3%
private8151.2%
permit5750.8%
permissive4400.6%
customers2710.4%
unknown470.1%
destination110.0%
restricted90.0%
tidal20.0%
request20.0%
key20.0%
discouraged20.0%
designated10.0%
official10.0%
forestry10.0%
agricultural10.0%
guided10.0%
university10.0%

tourism categorical feature

This is an OSM-style `tourism` tag categorising features like attractions, viewpoints, and museums across 18 distinct values. The column is severely imbalanced: 97.2% of the 69,716 rows are empty strings, with `attraction` (1,670) and `viewpoint` (119) the only non-trivial categories. Entropy ratio of 0.05 confirms almost no information content as-is.

Treatment: Treat empty string as missing and either drop or collapse rare categories into a binary `is_tourism` flag.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["tourism"].stats

statvalue
n69,716
nulls0 (0.0%)
unique18
top_value
top_rate 0.9722
cardinality 18
entropy 0.2076
entropy_ratio 0.04979
alert: imbalancetop value is 97.2% of rows
Fig 14.
Top values for tourism.
Show data table
Top values for tourism (18 unique shown, of 18 total).
valuecountshare
6777697.2%
attraction16702.4%
viewpoint1190.2%
yes1050.2%
camp_site70.0%
museum60.0%
register60.0%
artwork60.0%
information50.0%
picnic_site40.0%
cave_entrance30.0%
caves20.0%
wilderness_hut20.0%
attraction;museum10.0%
guestbook10.0%
no10.0%
cave10.0%
hotel10.0%

wikipedia text metadata

This column holds Wikipedia article references in the OSM-style `lang:Article Title` format (e.g. `de:Einödhöhle`, `fr:Grotte...`), mostly pointing to cave-related pages across many languages. It is overwhelmingly empty: 67,531 of 69,716 rows are blank and the duplicate rate is 0.97, leaving only 2,077 unique values. Among the populated entries, language prefixes span de, fr, pl, es, it, ca, en, ru and more, so any downstream use must handle multilingual strings.

Treatment: Split into language prefix and title, and treat as a sparse optional reference rather than a modelling feature.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["wikipedia"].stats

statvalue
n69,716
nulls0 (0.0%)
unique2,077
len_min 0
len_max 114
len_mean 0.636
len_median 0
len_p95 0
word_mean 1.044
word_median 1
n_empty 67,531
n_duplicates 67,639
duplicate_rate 0.9702
vocab_size 940
readability_flesch_mean -0.02485
emoji_rate 0
url_rate 0
one_word_rate 0.9756
allcaps_rate 0
boilerplate_rate 0
alert: one_word97.6% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.0% duplicate strings
Fig 15.
Character-length distribution for wikipedia.
Show data table
Character-length distribution for wikipedia (mean: 0.6359802627804234).
charscount
0 – 367531
3 – 60
6 – 942
9 – 1185
11 – 14252
14 – 17402
17 – 20335
20 – 23445
23 – 26243
26 – 28131
28 – 31102
31 – 3472
34 – 3733
37 – 4013
40 – 4312
43 – 466
46 – 484
48 – 511
51 – 541
54 – 571
57 – 603
60 – 630
63 – 661
66 – 680
68 – 710
71 – 740
74 – 770
77 – 800
80 – 830
83 – 860
86 – 880
88 – 910
91 – 940
94 – 970
97 – 1000
100 – 1030
103 – 1050
105 – 1080
108 – 1110
111 – 1141

website text metadata

This is a website/URL field for each record, but it is overwhelmingly empty: 67,082 of 69,716 rows (n_empty) are blank, leaving only 2,492 unique values across the column. Where populated, entries are single-token URLs (one_word_rate 0.99998, word_mean ~1.0) pointing to varied external domains (angloasianmining.com, unesco.org, termeszetvedelem.hu, etc.). The duplicate_rate of 0.96 is driven almost entirely by the empty string, and url_rate is only 0.038 because the metric is computed over all rows including blanks.

Treatment: Treat as optional reference URL; impute missing as null and avoid using as a feature given >96% blanks.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["website"].stats

statvalue
n69,716
nulls0 (0.0%)
unique2,492
len_min 0
len_max 255
len_mean 2.789
len_median 0
len_p95 0
word_mean 1
word_median 1
n_empty 67,082
n_duplicates 67,224
duplicate_rate 0.9643
vocab_size 737
readability_flesch_mean -24.04
emoji_rate 0
url_rate 0.03775
one_word_rate 1
allcaps_rate 1.434e-05
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates96.4% duplicate strings
Fig 16.
Character-length distribution for website.
Show data table
Character-length distribution for website (mean: 2.7889580584084).
charscount
0 – 667082
6 – 130
13 – 1911
19 – 2645
26 – 32130
32 – 38161
38 – 45109
45 – 51118
51 – 5786
57 – 64101
64 – 70223
70 – 7645
76 – 8349
83 – 891360
89 – 96127
96 – 10216
102 – 10814
108 – 1158
115 – 1212
121 – 1285
128 – 1345
134 – 1400
140 – 1475
147 – 1531
153 – 1596
159 – 1660
166 – 1720
172 – 1780
178 – 1851
185 – 1911
191 – 1980
198 – 2040
204 – 2100
210 – 2171
217 – 2232
223 – 2300
230 – 2361
236 – 2420
242 – 2490
249 – 2551

cave:length categorical metadata

This appears to be a cave length attribute (likely OpenStreetMap-style tag) stored as a string, with 99.08% of the 69,716 rows being empty and only 238 distinct values overall. When populated, values look like small integers (5, 6, 10, 3, 4...) suggesting metres, but the signal is so sparse (entropy ratio 0.0176) that it carries almost no information. The non-null counts in the top values are tiny (≤32 each), indicating this tag is rarely filled in upstream.

Treatment: Drop or convert to a numeric 'has_length' indicator; the column is too sparse to model directly.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["cave:length"].stats

statvalue
n69,716
nulls0 (0.0%)
unique238
top_value
top_rate 0.9908
cardinality 238
entropy 0.1392
entropy_ratio 0.01763
alert: long_tail158 singleton categories
alert: imbalancetop value is 99.1% of rows
Fig 17.
Top values for cave:length.
Show data table
Top values for cave:length (20 unique shown, of 238 total).
valuecountshare
6907499.1%
5320.0%
6260.0%
10250.0%
3230.0%
4230.0%
7200.0%
8190.0%
15160.0%
20140.0%
12130.0%
30130.0%
2110.0%
1190.0%
6080.0%
4.580.0%
1380.0%
1680.0%
1780.0%
2580.0%

cave:depth categorical metadata

This appears to be a cave depth attribute, likely sourced from an OpenStreetMap-style tag, stored as strings. It is effectively empty: 69,419 of 69,716 rows (top_rate 0.9957) carry the blank value "", and the remaining 124 distinct values are tiny integer-like strings ranging from "0" to "30" with single- or low-double-digit counts. Entropy ratio of 0.0092 confirms there is virtually no signal here.

Treatment: Drop; near-constant blank with negligible entropy.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["cave:depth"].stats

statvalue
n69,716
nulls0 (0.0%)
unique124
top_value
top_rate 0.9957
cardinality 124
entropy 0.06432
entropy_ratio 0.00925
alert: long_tail87 singleton categories
alert: imbalancetop value is 99.6% of rows
Fig 18.
Top values for cave:depth.
Show data table
Top values for cave:depth (20 unique shown, of 124 total).
valuecountshare
6941999.6%
0630.1%
10130.0%
3110.0%
190.0%
590.0%
480.0%
2570.0%
3060.0%
660.0%
260.0%
1150.0%
3550.0%
2840.0%
1440.0%
4040.0%
7040.0%
1230.0%
830.0%
1530.0%

country categorical feature

Two-letter ISO country code, but effectively empty: 69,684 of 69,716 rows (top_rate 0.9995) carry a blank string, leaving only 32 rows spread across 15 actual codes (KY, AU, DE, RO, US, etc.). Entropy ratio of 0.0018 confirms there is essentially no signal here. No nulls are reported because the missingness is encoded as empty string rather than NULL.

Treatment: Drop; near-constant empty string with only 32 populated rows.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["country"].stats

statvalue
n69,716
nulls0 (0.0%)
unique16
top_value
top_rate 0.9995
cardinality 16
entropy 0.00736
entropy_ratio 0.00184
alert: imbalancetop value is 100.0% of rows
Fig 19.
Top values for country.
Show data table
Top values for country (16 unique shown, of 16 total).
valuecountshare
69684100.0%
KY70.0%
AU60.0%
DE30.0%
RO20.0%
US20.0%
EC20.0%
IT20.0%
GB10.0%
MT10.0%
CH10.0%
GR10.0%
ET10.0%
LB10.0%
VN10.0%
PT10.0%

How to cite

click to copy

BibTeX
@misc{saturn-quirky-caves-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky caves},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-caves}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky caves. Source: /home/coolhand/html/datavis/data_trove/data/quirky/caves.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-caves