quirky-caves · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/caves.json

Saturn profiled 69,716 rows across 12 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/caves.json",
    "--findings", "quirky-caves.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 69,716 caves with 12 columns covering names, geocoordinates, country, tourism/access tags, and optional metadata like description, website, and Wikipedia links. The headline issue is sparsity in the descriptive fields: 'description' is empty in 65,189 rows, 'website' in 67,082, and 'wikipedia' in 67,531, so most analytical signal sits in name and coordinates. Worth a closer look first: the 'name' column, where 19,527 entries are literally 'Unnamed Cave' and overall duplicate rate is 35%, and the geographic spread, where 'lat' is heavily left-skewed (skew -3.16) with ~12.9% outliers and 'lon' has ~16.2% outliers, suggesting a Northern-Hemisphere/European concentration with scattered global entries. The 'country' field is almost entirely blank (99.95%), so country-level analysis will need to be derived from coordinates rather than read off directly. 'Access' is the most usable categorical, with meaningful splits across yes/no/private/permit when present.

citing: row_count · column_count · columns.name.top_values · columns.name.stats.duplicate_rate · columns.description.stats.n_empty · columns.website.stats.n_empty · columns.wikipedia.stats.n_empty · columns.lat.stats.skew · columns.lat.stats.outlier_rate · columns.lon.stats.outlier_rate · columns.country.stats.top_rate · columns.access.top_values · columns.tourism.top_values

Out[4]:

saturn.schema() · 12 columns

column	kind	n	null%	unique	alerts
id	numeric	69,716	0.0%	69,716
name	text	69,716	0.0%	45,229	duplicates
lat	numeric	69,716	0.0%	69,544	high_skew outliers
lon	numeric	69,716	0.0%	69,585	outliers
description	text	69,716	0.0%	3,705	one_word short_text duplicates
access	categorical	69,716	0.0%	20
tourism	categorical	69,716	0.0%	18	imbalance
wikipedia	text	69,716	0.0%	2,077	one_word short_text duplicates
website	text	69,716	0.0%	2,492	one_word short_text duplicates
cave:length	categorical	69,716	0.0%	238	long_tail imbalance
cave:depth	categorical	69,716	0.0%	124	long_tail imbalance
country	categorical	69,716	0.0%	16	imbalance

Fig 1.

lat · Check the left-skewed latitude distribution to see how heavily caves cluster in the Northern Hemisphere.

Show data table

Histogram bins for lat (median: 44.14308795).
bin	count
-77.98 – -74.07	4
-74.07 – -70.17	0
-70.17 – -66.26	3
-66.26 – -62.36	0
-62.36 – -58.46	1
-58.46 – -54.55	2
-54.55 – -50.65	13
-50.65 – -46.74	26
-46.74 – -42.84	60
-42.84 – -38.94	127
-38.94 – -35.03	329
-35.03 – -31.13	330
-31.13 – -27.23	225
-27.23 – -23.32	257
-23.32 – -19.42	301
-19.42 – -15.51	204
-15.51 – -11.61	78
-11.61 – -7.705	139
-7.705 – -3.801	314
-3.801 – 0.1026	183
0.1026 – 4.007	118
4.007 – 7.911	358
7.911 – 11.81	287
11.81 – 15.72	385
15.72 – 19.62	1856
19.62 – 23.53	937
23.53 – 27.43	775
27.43 – 31.33	1082
31.33 – 35.24	1231
35.24 – 39.14	3928
39.14 – 43.05	12068
43.05 – 46.95	20377
46.95 – 50.85	19250
50.85 – 54.76	2492
54.76 – 58.66	1075
58.66 – 62.57	564
62.57 – 66.47	267
66.47 – 70.37	47
70.37 – 74.28	19
74.28 – 78.18	4

Fig 2.

lon · Longitude spread reveals the European core versus scattered entries across the Americas and Asia-Pacific.

Show data table

Histogram bins for lon (median: 11.37646115).
bin	count
-178 – -169.1	54
-169.1 – -160.2	6
-160.2 – -151.2	49
-151.2 – -142.3	20
-142.3 – -133.4	6
-133.4 – -124.5	18
-124.5 – -115.6	352
-115.6 – -106.6	454
-106.6 – -97.72	283
-97.72 – -88.8	361
-88.8 – -79.88	579
-79.88 – -70.96	1877
-70.96 – -62.04	236
-62.04 – -53.12	168
-53.12 – -44.2	252
-44.2 – -35.28	203
-35.28 – -26.36	26
-26.36 – -17.44	182
-17.44 – -8.523	766
-8.523 – 0.3967	9279
0.3967 – 9.317	14696
9.317 – 18.24	22468
18.24 – 27.16	8156
27.16 – 36.08	1754
36.08 – 45	1346
45 – 53.92	530
53.92 – 62.84	479
62.84 – 71.76	150
71.76 – 80.68	381
80.68 – 89.6	210
89.6 – 98.52	212
98.52 – 107.4	1114
107.4 – 116.4	956
116.4 – 125.3	696
125.3 – 134.2	395
134.2 – 143.1	363
143.1 – 152	330
152 – 161	46
161 – 169.9	33
169.9 – 178.8	230

Fig 3.

access · Among rows that specify access, compare yes/no/private/permit to gauge how reachable tagged caves are.

Show data table

Top values for access (20 unique shown, of 20 total).
value	count	share
	62551	89.7%
yes	2717	3.9%
no	2266	3.3%
private	815	1.2%
permit	575	0.8%
permissive	440	0.6%
customers	271	0.4%
unknown	47	0.1%
destination	11	0.0%
restricted	9	0.0%
tidal	2	0.0%
request	2	0.0%
key	2	0.0%
discouraged	2	0.0%
designated	1	0.0%
official	1	0.0%
forestry	1	0.0%
agricultural	1	0.0%
guided	1	0.0%
university	1	0.0%

Fig 4.

tourism · Note that 'attraction' dominates the tagged tourism values, dwarfing viewpoints, museums, and other uses.

Show data table

Top values for tourism (18 unique shown, of 18 total).
value	count	share
	67776	97.2%
attraction	1670	2.4%
viewpoint	119	0.2%
yes	105	0.2%
camp_site	7	0.0%
museum	6	0.0%
register	6	0.0%
artwork	6	0.0%
information	5	0.0%
picnic_site	4	0.0%
cave_entrance	3	0.0%
caves	2	0.0%
wilderness_hut	2	0.0%
attraction;museum	1	0.0%
guestbook	1	0.0%
no	1	0.0%
cave	1	0.0%
hotel	1	0.0%

Fig 5.

name · Name lengths center around 13 characters, but watch for the huge 'Unnamed Cave' spike inflating duplicates.

Show data table

Character-length distribution for name (mean: 15.52237649893855).
chars	count
1 – 5	1601
5 – 8	3457
8 – 12	5854
12 – 15	31600
15 – 19	9877
19 – 22	8424
22 – 26	3274
26 – 29	2614
29 – 33	1175
33 – 36	885
36 – 40	482
40 – 44	175
44 – 47	154
47 – 51	56
51 – 54	45
54 – 58	17
58 – 61	11
61 – 65	3
65 – 68	2
68 – 72	2
72 – 76	4
76 – 79	2
79 – 83	0
83 – 86	1
86 – 90	0
90 – 93	0
93 – 97	0
97 – 100	0
100 – 104	0
104 – 108	0
108 – 111	0
111 – 115	0
115 – 118	0
118 – 122	0
122 – 125	0
125 – 129	0
129 – 132	0
132 – 136	0
136 – 139	0
139 – 143	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	numeric	0.0%
name	text	0.0%
lat	numeric	0.0%
lon	numeric	0.0%
description	text	0.0%
access	categorical	0.0%
tourism	categorical	0.0%
wikipedia	text	0.0%
website	text	0.0%
cave:length	categorical	0.0%
cave:depth	categorical	0.0%
country	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	id	lat	lon
id	+1.00	-0.04	-0.06
lat	-0.04	+1.00	-0.15
lon	-0.06	-0.15	+1.00

id numeric identifier

This column is almost certainly a row identifier: all 69716 values are unique with zero nulls, and the numeric range spans roughly 1.28e7 to 1.354e10 with near-zero skew (0.02) and negative kurtosis (-1.18), consistent with broadly distributed assigned IDs rather than a measured quantity. No outliers or zeros are flagged. Treat the numeric stats as incidental — the magnitudes carry no analytical meaning.

Treatment: Drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["id"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	69,716
min	1.279e+07
max	1.354e+10
mean	6.842e+09
median	7.007e+09
std	3.774e+09
q1	3.61e+09
q3	9.948e+09
iqr	6.338e+09
skew	0.01955
kurtosis	-1.177
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 8.

Distribution of id. Vertical dash marks the median.

Show data table

Histogram bins for id (median: 7006519241.0).
bin	count
1.279e+07 – 3.509e+08	592
3.509e+08 – 6.889e+08	1123
6.889e+08 – 1.027e+09	2360
1.027e+09 – 1.365e+09	1819
1.365e+09 – 1.703e+09	1497
1.703e+09 – 2.041e+09	1907
2.041e+09 – 2.379e+09	1345
2.379e+09 – 2.717e+09	1415
2.717e+09 – 3.056e+09	1806
3.056e+09 – 3.394e+09	1977
3.394e+09 – 3.732e+09	2439
3.732e+09 – 4.07e+09	1595
4.07e+09 – 4.408e+09	1710
4.408e+09 – 4.746e+09	2212
4.746e+09 – 5.084e+09	2764
5.084e+09 – 5.422e+09	1741
5.422e+09 – 5.76e+09	1236
5.76e+09 – 6.098e+09	1268
6.098e+09 – 6.436e+09	2153
6.436e+09 – 6.774e+09	1135
6.774e+09 – 7.112e+09	2343
7.112e+09 – 7.451e+09	1932
7.451e+09 – 7.789e+09	2060
7.789e+09 – 8.127e+09	1970
8.127e+09 – 8.465e+09	1451
8.465e+09 – 8.803e+09	1229
8.803e+09 – 9.141e+09	1689
9.141e+09 – 9.479e+09	1952
9.479e+09 – 9.817e+09	2902
9.817e+09 – 1.016e+10	1840
1.016e+10 – 1.049e+10	617
1.049e+10 – 1.083e+10	1370
1.083e+10 – 1.117e+10	1613
1.117e+10 – 1.151e+10	3298
1.151e+10 – 1.185e+10	1230
1.185e+10 – 1.218e+10	1741
1.218e+10 – 1.252e+10	1317
1.252e+10 – 1.286e+10	2065
1.286e+10 – 1.32e+10	1532
1.32e+10 – 1.354e+10	1471

name text label

Short free-text names of caves/caverns in mixed languages (English 'Cave', French 'Grotte', Italian 'Grotta', Spanish 'Cueva', Cyrillic 'Грот', German 'Bärenhöhle'), averaging 2.5 words and 15.5 characters. Severe duplication (35.1%, 24,487 rows) is dominated by the placeholder 'Unnamed Cave' appearing 19,527 times — over a quarter of all rows are effectively unlabelled. Of 69,716 rows only 45,229 are unique, and the vocabulary of 15,219 tokens reflects the multilingual mix.

Treatment: Treat 'Unnamed Cave' as missing and language-normalise before any text matching or grouping.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["name"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	45,229
len_min	1
len_max	143
len_mean	15.52
len_median	13
len_p95	29
word_mean	2.519
word_median	2
n_empty	0
n_duplicates	24,487
duplicate_rate	0.3512
vocab_size	15,219
readability_flesch_mean	55.51
emoji_rate	1.434e-05
url_rate	0
one_word_rate	0.1545
allcaps_rate	0.03242
boilerplate_rate	0
alert: duplicates	35.1% duplicate strings

Fig 9.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 15.52237649893855).
chars	count
1 – 5	1601
5 – 8	3457
8 – 12	5854
12 – 15	31600
15 – 19	9877
19 – 22	8424
22 – 26	3274
26 – 29	2614
29 – 33	1175
33 – 36	885
36 – 40	482
40 – 44	175
44 – 47	154
47 – 51	56
51 – 54	45
54 – 58	17
58 – 61	11
61 – 65	3
65 – 68	2
68 – 72	2
72 – 76	4
76 – 79	2
79 – 83	0
83 – 86	1
86 – 90	0
90 – 93	0
93 – 97	0
97 – 100	0
100 – 104	0
104 – 108	0
108 – 111	0
111 – 115	0
115 – 118	0
118 – 122	0
122 – 125	0
125 – 129	0
129 – 132	0
132 – 136	0
136 – 139	0
139 – 143	1

lat numeric feature

Latitude coordinates spanning -77.98 to 78.18, so the column covers nearly the full globe from Antarctic to Arctic ranges. The distribution is heavily left-skewed (skew -3.16, kurtosis 11.5) with a tight IQR of 7.26 around a median of 44.14, indicating most points cluster in northern mid-latitudes while a long tail of southern-hemisphere values produces 8,996 outliers (12.9%). Near-unique values (69,544 of 69,716) confirm these are precise geocoordinates rather than bucketed regions.

Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than treating as a raw scalar.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["lat"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	69,544
min	-77.98
max	78.18
mean	40.58
median	44.14
std	15.48
q1	40.49
q3	47.75
iqr	7.256
skew	-3.161
kurtosis	11.5
n_outliers	8,996
outlier_rate	0.129
zero_rate	0
alert: high_skew	skew=-3.16
alert: outliers	12.9% rows beyond 1.5 IQR

Fig 10.

Distribution of lat. Vertical dash marks the median.

Show data table

Histogram bins for lat (median: 44.14308795).
bin	count
-77.98 – -74.07	4
-74.07 – -70.17	0
-70.17 – -66.26	3
-66.26 – -62.36	0
-62.36 – -58.46	1
-58.46 – -54.55	2
-54.55 – -50.65	13
-50.65 – -46.74	26
-46.74 – -42.84	60
-42.84 – -38.94	127
-38.94 – -35.03	329
-35.03 – -31.13	330
-31.13 – -27.23	225
-27.23 – -23.32	257
-23.32 – -19.42	301
-19.42 – -15.51	204
-15.51 – -11.61	78
-11.61 – -7.705	139
-7.705 – -3.801	314
-3.801 – 0.1026	183
0.1026 – 4.007	118
4.007 – 7.911	358
7.911 – 11.81	287
11.81 – 15.72	385
15.72 – 19.62	1856
19.62 – 23.53	937
23.53 – 27.43	775
27.43 – 31.33	1082
31.33 – 35.24	1231
35.24 – 39.14	3928
39.14 – 43.05	12068
43.05 – 46.95	20377
46.95 – 50.85	19250
50.85 – 54.76	2492
54.76 – 58.66	1075
58.66 – 62.57	564
62.57 – 66.47	267
66.47 – 70.37	47
70.37 – 74.28	19
74.28 – 78.18	4

lon numeric feature

This is a longitude coordinate, with values spanning the full -178.00 to 178.80 range and 69,585 unique values across 69,716 rows. The distribution is centered near 12.03 (median 11.38) with IQR 1.24 to 18.18, suggesting a heavy concentration in European/African longitudes, but the std of 40.50 and 16.2% flagged outliers reveal a long global tail. No nulls or zeros are present.

Treatment: Pair with latitude for geospatial features; do not treat outliers as errors since the global range is legitimate.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["lon"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	69,585
min	-178
max	178.8
mean	12.03
median	11.38
std	40.5
q1	1.245
q3	18.18
iqr	16.94
skew	0.2755
kurtosis	4.509
n_outliers	11,302
outlier_rate	0.1621
zero_rate	0
alert: outliers	16.2% rows beyond 1.5 IQR

Fig 11.

Distribution of lon. Vertical dash marks the median.

Show data table

Histogram bins for lon (median: 11.37646115).
bin	count
-178 – -169.1	54
-169.1 – -160.2	6
-160.2 – -151.2	49
-151.2 – -142.3	20
-142.3 – -133.4	6
-133.4 – -124.5	18
-124.5 – -115.6	352
-115.6 – -106.6	454
-106.6 – -97.72	283
-97.72 – -88.8	361
-88.8 – -79.88	579
-79.88 – -70.96	1877
-70.96 – -62.04	236
-62.04 – -53.12	168
-53.12 – -44.2	252
-44.2 – -35.28	203
-35.28 – -26.36	26
-26.36 – -17.44	182
-17.44 – -8.523	766
-8.523 – 0.3967	9279
0.3967 – 9.317	14696
9.317 – 18.24	22468
18.24 – 27.16	8156
27.16 – 36.08	1754
36.08 – 45	1346
45 – 53.92	530
53.92 – 62.84	479
62.84 – 71.76	150
71.76 – 80.68	381
80.68 – 89.6	210
89.6 – 98.52	212
98.52 – 107.4	1114
107.4 – 116.4	956
116.4 – 125.3	696
125.3 – 134.2	395
134.2 – 143.1	363
143.1 – 152	330
152 – 161	46
161 – 169.9	33
169.9 – 178.8	230

description text free_text

A free-text 'description' field, but 65189 of 69716 rows (94.7% duplicate rate) are empty strings, leaving only ~4500 populated entries with a mean length of 3.46 characters and median word count of 1. The non-empty values are short multilingual fragments — German ('nicht katasterwürdig', 'unterer Eingang'), French ('Entrée d'une carrière souterraine'), and other tokens like 'cave' and 'Halbhöhle' — suggesting cave/feature annotations rather than prose. With 92.8% one-word rate and a vocab of 4651, this is closer to sparse tagging than narrative text.

Treatment: Treat as sparse optional tag: flag presence as a boolean feature and ignore the text body unless modelling the populated subset.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["description"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	3,705
len_min	0
len_max	255
len_mean	3.462
len_median	0
len_p95	19
word_mean	1.455
word_median	1
n_empty	65,189
n_duplicates	66,011
duplicate_rate	0.9469
vocab_size	4,651
readability_flesch_mean	1.752
emoji_rate	0
url_rate	0.0006311
one_word_rate	0.9428
allcaps_rate	0.00241
boilerplate_rate	1.434e-05
alert: one_word	94.3% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	94.7% duplicate strings

Fig 12.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 3.4617591370703997).
chars	count
0 – 6	65286
6 – 13	422
13 – 19	544
19 – 26	501
26 – 32	458
32 – 38	437
38 – 45	344
45 – 51	218
51 – 57	205
57 – 64	170
64 – 70	229
70 – 76	85
76 – 83	70
83 – 89	60
89 – 96	54
96 – 102	48
102 – 108	51
108 – 115	42
115 – 121	34
121 – 128	34
128 – 134	26
134 – 140	24
140 – 147	23
147 – 153	24
153 – 159	19
159 – 166	13
166 – 172	27
172 – 178	13
178 – 185	18
185 – 191	21
191 – 198	20
198 – 204	12
204 – 210	23
210 – 217	18
217 – 223	16
223 – 230	9
230 – 236	8
236 – 242	17
242 – 249	22
249 – 255	71

access categorical feature

Categorical access-permission tag, almost certainly the OSM-style `access` key indicating who may use a feature. 89.7% of the 69,716 rows are empty strings, leaving only ~10% with substantive values like `yes` (2,717), `no` (2,266), `private` (815), `permit` (575), and `permissive` (440). Entropy ratio is just 0.16 and 20 distinct values appear, so the signal is sparse but the long tail (`customers`, `destination`, `restricted`, `unknown`) is meaningful when present.

Treatment: Treat empty string as missing, then collapse rare levels and one-hot encode the survivors.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["access"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	20
top_value
top_rate	0.8972
cardinality	20
entropy	0.7067
entropy_ratio	0.1635

Fig 13.

Top values for access.

Show data table

Top values for access (20 unique shown, of 20 total).
value	count	share
	62551	89.7%
yes	2717	3.9%
no	2266	3.3%
private	815	1.2%
permit	575	0.8%
permissive	440	0.6%
customers	271	0.4%
unknown	47	0.1%
destination	11	0.0%
restricted	9	0.0%
tidal	2	0.0%
request	2	0.0%
key	2	0.0%
discouraged	2	0.0%
designated	1	0.0%
official	1	0.0%
forestry	1	0.0%
agricultural	1	0.0%
guided	1	0.0%
university	1	0.0%

tourism categorical feature

This is an OSM-style `tourism` tag categorising features like attractions, viewpoints, and museums across 18 distinct values. The column is severely imbalanced: 97.2% of the 69,716 rows are empty strings, with `attraction` (1,670) and `viewpoint` (119) the only non-trivial categories. Entropy ratio of 0.05 confirms almost no information content as-is.

Treatment: Treat empty string as missing and either drop or collapse rare categories into a binary `is_tourism` flag.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["tourism"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	18
top_value
top_rate	0.9722
cardinality	18
entropy	0.2076
entropy_ratio	0.04979
alert: imbalance	top value is 97.2% of rows

Fig 14.

Top values for tourism.

Show data table

Top values for tourism (18 unique shown, of 18 total).
value	count	share
	67776	97.2%
attraction	1670	2.4%
viewpoint	119	0.2%
yes	105	0.2%
camp_site	7	0.0%
museum	6	0.0%
register	6	0.0%
artwork	6	0.0%
information	5	0.0%
picnic_site	4	0.0%
cave_entrance	3	0.0%
caves	2	0.0%
wilderness_hut	2	0.0%
attraction;museum	1	0.0%
guestbook	1	0.0%
no	1	0.0%
cave	1	0.0%
hotel	1	0.0%

wikipedia text metadata

This column holds Wikipedia article references in the OSM-style `lang:Article Title` format (e.g. `de:Einödhöhle`, `fr:Grotte...`), mostly pointing to cave-related pages across many languages. It is overwhelmingly empty: 67,531 of 69,716 rows are blank and the duplicate rate is 0.97, leaving only 2,077 unique values. Among the populated entries, language prefixes span de, fr, pl, es, it, ca, en, ru and more, so any downstream use must handle multilingual strings.

Treatment: Split into language prefix and title, and treat as a sparse optional reference rather than a modelling feature.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["wikipedia"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	2,077
len_min	0
len_max	114
len_mean	0.636
len_median	0
len_p95	0
word_mean	1.044
word_median	1
n_empty	67,531
n_duplicates	67,639
duplicate_rate	0.9702
vocab_size	940
readability_flesch_mean	-0.02485
emoji_rate	0
url_rate	0
one_word_rate	0.9756
allcaps_rate	0
boilerplate_rate	0
alert: one_word	97.6% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.0% duplicate strings

Fig 15.

Character-length distribution for wikipedia.

Show data table

Character-length distribution for wikipedia (mean: 0.6359802627804234).
chars	count
0 – 3	67531
3 – 6	0
6 – 9	42
9 – 11	85
11 – 14	252
14 – 17	402
17 – 20	335
20 – 23	445
23 – 26	243
26 – 28	131
28 – 31	102
31 – 34	72
34 – 37	33
37 – 40	13
40 – 43	12
43 – 46	6
46 – 48	4
48 – 51	1
51 – 54	1
54 – 57	1
57 – 60	3
60 – 63	0
63 – 66	1
66 – 68	0
68 – 71	0
71 – 74	0
74 – 77	0
77 – 80	0
80 – 83	0
83 – 86	0
86 – 88	0
88 – 91	0
91 – 94	0
94 – 97	0
97 – 100	0
100 – 103	0
103 – 105	0
105 – 108	0
108 – 111	0
111 – 114	1

website text metadata

This is a website/URL field for each record, but it is overwhelmingly empty: 67,082 of 69,716 rows (n_empty) are blank, leaving only 2,492 unique values across the column. Where populated, entries are single-token URLs (one_word_rate 0.99998, word_mean ~1.0) pointing to varied external domains (angloasianmining.com, unesco.org, termeszetvedelem.hu, etc.). The duplicate_rate of 0.96 is driven almost entirely by the empty string, and url_rate is only 0.038 because the metric is computed over all rows including blanks.

Treatment: Treat as optional reference URL; impute missing as null and avoid using as a feature given >96% blanks.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["website"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	2,492
len_min	0
len_max	255
len_mean	2.789
len_median	0
len_p95	0
word_mean	1
word_median	1
n_empty	67,082
n_duplicates	67,224
duplicate_rate	0.9643
vocab_size	737
readability_flesch_mean	-24.04
emoji_rate	0
url_rate	0.03775
one_word_rate	1
allcaps_rate	1.434e-05
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	96.4% duplicate strings

Fig 16.

Character-length distribution for website.

Show data table

Character-length distribution for website (mean: 2.7889580584084).
chars	count
0 – 6	67082
6 – 13	0
13 – 19	11
19 – 26	45
26 – 32	130
32 – 38	161
38 – 45	109
45 – 51	118
51 – 57	86
57 – 64	101
64 – 70	223
70 – 76	45
76 – 83	49
83 – 89	1360
89 – 96	127
96 – 102	16
102 – 108	14
108 – 115	8
115 – 121	2
121 – 128	5
128 – 134	5
134 – 140	0
140 – 147	5
147 – 153	1
153 – 159	6
159 – 166	0
166 – 172	0
172 – 178	0
178 – 185	1
185 – 191	1
191 – 198	0
198 – 204	0
204 – 210	0
210 – 217	1
217 – 223	2
223 – 230	0
230 – 236	1
236 – 242	0
242 – 249	0
249 – 255	1

cave:length categorical metadata

This appears to be a cave length attribute (likely OpenStreetMap-style tag) stored as a string, with 99.08% of the 69,716 rows being empty and only 238 distinct values overall. When populated, values look like small integers (5, 6, 10, 3, 4...) suggesting metres, but the signal is so sparse (entropy ratio 0.0176) that it carries almost no information. The non-null counts in the top values are tiny (≤32 each), indicating this tag is rarely filled in upstream.

Treatment: Drop or convert to a numeric 'has_length' indicator; the column is too sparse to model directly.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["cave:length"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	238
top_value
top_rate	0.9908
cardinality	238
entropy	0.1392
entropy_ratio	0.01763
alert: long_tail	158 singleton categories
alert: imbalance	top value is 99.1% of rows

Fig 17.

Top values for cave:length.

Show data table

Top values for cave:length (20 unique shown, of 238 total).
value	count	share
	69074	99.1%
5	32	0.0%
6	26	0.0%
10	25	0.0%
3	23	0.0%
4	23	0.0%
7	20	0.0%
8	19	0.0%
15	16	0.0%
20	14	0.0%
12	13	0.0%
30	13	0.0%
2	11	0.0%
11	9	0.0%
60	8	0.0%
4.5	8	0.0%
13	8	0.0%
16	8	0.0%
17	8	0.0%
25	8	0.0%

cave:depth categorical metadata

This appears to be a cave depth attribute, likely sourced from an OpenStreetMap-style tag, stored as strings. It is effectively empty: 69,419 of 69,716 rows (top_rate 0.9957) carry the blank value "", and the remaining 124 distinct values are tiny integer-like strings ranging from "0" to "30" with single- or low-double-digit counts. Entropy ratio of 0.0092 confirms there is virtually no signal here.

Treatment: Drop; near-constant blank with negligible entropy.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["cave:depth"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	124
top_value
top_rate	0.9957
cardinality	124
entropy	0.06432
entropy_ratio	0.00925
alert: long_tail	87 singleton categories
alert: imbalance	top value is 99.6% of rows

Fig 18.

Top values for cave:depth.

Show data table

Top values for cave:depth (20 unique shown, of 124 total).
value	count	share
	69419	99.6%
0	63	0.1%
10	13	0.0%
3	11	0.0%
1	9	0.0%
5	9	0.0%
4	8	0.0%
25	7	0.0%
30	6	0.0%
6	6	0.0%
2	6	0.0%
11	5	0.0%
35	5	0.0%
28	4	0.0%
14	4	0.0%
40	4	0.0%
70	4	0.0%
12	3	0.0%
8	3	0.0%
15	3	0.0%

country categorical feature

Two-letter ISO country code, but effectively empty: 69,684 of 69,716 rows (top_rate 0.9995) carry a blank string, leaving only 32 rows spread across 15 actual codes (KY, AU, DE, RO, US, etc.). Entropy ratio of 0.0018 confirms there is essentially no signal here. No nulls are reported because the missingness is encoded as empty string rather than NULL.

Treatment: Drop; near-constant empty string with only 32 populated rows.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["country"].stats

stat	value
n	69,716
nulls	0 (0.0%)
unique	16
top_value
top_rate	0.9995
cardinality	16
entropy	0.00736
entropy_ratio	0.00184
alert: imbalance	top value is 100.0% of rows

Fig 19.

Top values for country.

Show data table

Top values for country (16 unique shown, of 16 total).
value	count	share
	69684	100.0%
KY	7	0.0%
AU	6	0.0%
DE	3	0.0%
RO	2	0.0%
US	2	0.0%
EC	2	0.0%
IT	2	0.0%
GB	1	0.0%
MT	1	0.0%
CH	1	0.0%
GR	1	0.0%
ET	1	0.0%
LB	1	0.0%
VN	1	0.0%
PT	1	0.0%

quirky caves

Overview

Summary confidence: high

id numeric identifier

name text label

lat numeric feature

lon numeric feature

description text free_text

access categorical feature

tourism categorical feature

wikipedia text metadata

website text metadata

cave:length categorical metadata

cave:depth categorical metadata

country categorical feature

How to cite