quirky-megaliths · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/megaliths.json

Saturn profiled 15,464 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/megaliths.json",
    "--findings", "quirky-megaliths.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogues 15,464 megalithic sites with 14 fields covering geographic coordinates, classification (type, megalith_type, material), heritage status, and external references (wikidata, wikipedia, name). Coverage is uneven: many descriptive fields are mostly empty (description is blank in 14,814 rows, material in 15,223, start_date in 15,430), so analysis should lean on the well-populated columns. The most informative categorical is megalith_type, where menhir (5,231) and dolmen (4,501) dominate but 73 distinct subtypes appear, while the broader type field is overwhelmingly 'megalith' (97.7%). Geographically, lat/lon are highly skewed with heavy clustering in Europe (median lat 47.6, lon -1.6) and a long tail of outliers stretching as far as 144°E and -51°S. Start with megalith_type and the lat/lon distributions to understand what kinds of sites exist and where they cluster.

citing: megalith_type · type · lat · lon · material · description · start_date · osm_type · heritage · name

Out[4]:

saturn.schema() · 14 columns

column	kind	n	null%	unique	alerts
id	numeric	15,464	0.0%	15,464
osm_type	categorical	15,464	0.0%	2
name	text	15,464	0.0%	9,869	one_word duplicates
lat	numeric	15,464	0.0%	15,320	high_skew
lon	numeric	15,464	0.0%	15,407	high_skew
type	categorical	15,464	0.0%	19	imbalance
megalith_type	categorical	15,464	0.0%	73
description	categorical	15,464	0.0%	587	long_tail imbalance
wikipedia	text	15,464	0.0%	2,058	one_word duplicates
wikidata	text	15,464	0.0%	4,289	one_word allcaps short_text duplicates
heritage	categorical	15,464	0.0%	12
heritage_operator	categorical	15,464	0.0%	31
start_date	categorical	15,464	0.0%	26	long_tail imbalance
material	categorical	15,464	0.0%	13	long_tail imbalance

Fig 1.

megalith_type · Shows the dominance of menhirs and dolmens among 73 subtypes — the richest categorical signal in the dataset.

Show data table

Top values for megalith_type (20 unique shown, of 73 total).
value	count	share
menhir	5231	33.8%
dolmen	4501	29.1%
	1714	11.1%
nuraghe	1080	7.0%
stone_circle	1011	6.5%
passage_grave	537	3.5%
chamber	437	2.8%
long_barrow	184	1.2%
alignment	116	0.8%
cist	107	0.7%
gallery_grave	85	0.5%
standing_stone	68	0.4%
stone_ship	47	0.3%
tholos	32	0.2%
court_tomb	32	0.2%
round_barrow	25	0.2%
well	23	0.1%
wedge_tomb	23	0.1%
cairn	20	0.1%
stone	20	0.1%

Fig 2.

lat · Reveals the strong European clustering (median ~47.6°N) and a long left tail of southern-hemisphere outliers.

Show data table

Histogram bins for lat (median: 47.59247835).
bin	count
-51.81 – -48.88	1
-48.88 – -45.96	0
-45.96 – -43.04	0
-43.04 – -40.11	0
-40.11 – -37.19	1
-37.19 – -34.26	0
-34.26 – -31.34	3
-31.34 – -28.41	0
-28.41 – -25.49	2
-25.49 – -22.56	0
-22.56 – -19.64	1
-19.64 – -16.72	1
-16.72 – -13.79	4
-13.79 – -10.87	4
-10.87 – -7.942	8
-7.942 – -5.018	21
-5.018 – -2.093	3
-2.093 – 0.8313	13
0.8313 – 3.756	26
3.756 – 6.68	1
6.68 – 9.605	7
9.605 – 12.53	5
12.53 – 15.45	8
15.45 – 18.38	2
18.38 – 21.3	5
21.3 – 24.23	2
24.23 – 27.15	8
27.15 – 30.08	3
30.08 – 33	9
33 – 35.92	38
35.92 – 38.85	523
38.85 – 41.77	2211
41.77 – 44.7	3646
44.7 – 47.62	3269
47.62 – 50.55	1808
50.55 – 53.47	1660
53.47 – 56.4	1627
56.4 – 59.32	506
59.32 – 62.24	33
62.24 – 65.17	5

Fig 3.

lon · Highlights the western-European concentration with extreme outliers spanning -151° to 144°.

Show data table

Histogram bins for lon (median: -1.6201083).
bin	count
-151.4 – -144	1
-144 – -136.6	0
-136.6 – -129.2	0
-129.2 – -121.7	1
-121.7 – -114.3	0
-114.3 – -106.9	1
-106.9 – -99.54	1
-99.54 – -92.14	2
-92.14 – -84.74	2
-84.74 – -77.33	6
-77.33 – -69.93	34
-69.93 – -62.53	2
-62.53 – -55.13	1
-55.13 – -47.72	5
-47.72 – -40.32	1
-40.32 – -32.92	0
-32.92 – -25.52	0
-25.52 – -18.12	0
-18.12 – -10.71	4
-10.71 – -3.31	3136
-3.31 – 4.092	7654
4.092 – 11.49	3031
11.49 – 18.9	921
18.9 – 26.3	58
26.3 – 33.7	15
33.7 – 41.1	441
41.1 – 48.51	21
48.51 – 55.91	3
55.91 – 63.31	0
63.31 – 70.71	1
70.71 – 78.12	7
78.12 – 85.52	7
85.52 – 92.92	7
92.92 – 100.3	1
100.3 – 107.7	19
107.7 – 115.1	7
115.1 – 122.5	23
122.5 – 129.9	30
129.9 – 137.3	15
137.3 – 144.7	6

Fig 4.

osm_type · Simple split showing ~86% of records are OSM nodes versus ways, useful for understanding source geometry.

Show data table

Top values for osm_type (2 unique shown, of 2 total).
value	count	share
node	13311	86.1%
way	2153	13.9%

Fig 5.

type · Illustrates the severe imbalance where 'megalith' covers 97.7% of rows, suggesting this top-level field is less useful than megalith_type.

Show data table

Top values for type (19 unique shown, of 19 total).
value	count	share
megalith	15113	97.7%
menhir	156	1.0%
dolmen	83	0.5%
standing_stone	59	0.4%
stone_circle	16	0.1%
nuraghe	8	0.1%
gallery_grave	6	0.0%
passage_grave	5	0.0%
lech	4	0.0%
stone_ship	3	0.0%
tholos	2	0.0%
chamber	2	0.0%
village	1	0.0%
plaque	1	0.0%
cist	1	0.0%
long_barrow	1	0.0%
chambered_cairn	1	0.0%
grave_field	1	0.0%
stone	1	0.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	numeric	0.0%
osm_type	categorical	0.0%
name	text	0.0%
lat	numeric	0.0%
lon	numeric	0.0%
type	categorical	0.0%
megalith_type	categorical	0.0%
description	categorical	0.0%
wikipedia	text	0.0%
wikidata	text	0.0%
heritage	categorical	0.0%
heritage_operator	categorical	0.0%
start_date	categorical	0.0%
material	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	id	lat	lon
id	+1.00	+0.00	-0.05
lat	+0.00	+1.00	-0.11
lon	-0.05	-0.11	+1.00

id numeric identifier

This is an identifier column: all 15464 values are unique with no nulls and no zeros, spanning 24151805 to 13537320281. The numeric stats (mean 4503184709.89, std 3470459882.49, skew 0.89) reflect ID allocation rather than a meaningful distribution. Treat the numeric summary as incidental.

Treatment: Use as a join key; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["id"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	15,464
min	2.415e+07
max	1.354e+10
mean	4.503e+09
median	3.411e+09
std	3.47e+09
q1	2.375e+09
q3	6.845e+09
iqr	4.471e+09
skew	0.8907
kurtosis	-0.2006
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 8.

Distribution of id. Vertical dash marks the median.

Show data table

Histogram bins for id (median: 3411205875.5).
bin	count
2.415e+07 – 3.62e+08	682
3.62e+08 – 6.998e+08	889
6.998e+08 – 1.038e+09	623
1.038e+09 – 1.375e+09	736
1.375e+09 – 1.713e+09	297
1.713e+09 – 2.051e+09	352
2.051e+09 – 2.389e+09	305
2.389e+09 – 2.727e+09	2679
2.727e+09 – 3.065e+09	300
3.065e+09 – 3.402e+09	756
3.402e+09 – 3.74e+09	951
3.74e+09 – 4.078e+09	705
4.078e+09 – 4.416e+09	532
4.416e+09 – 4.754e+09	318
4.754e+09 – 5.092e+09	386
5.092e+09 – 5.429e+09	200
5.429e+09 – 5.767e+09	207
5.767e+09 – 6.105e+09	261
6.105e+09 – 6.443e+09	161
6.443e+09 – 6.781e+09	175
6.781e+09 – 7.119e+09	250
7.119e+09 – 7.456e+09	270
7.456e+09 – 7.794e+09	89
7.794e+09 – 8.132e+09	115
8.132e+09 – 8.47e+09	734
8.47e+09 – 8.808e+09	391
8.808e+09 – 9.146e+09	155
9.146e+09 – 9.483e+09	137
9.483e+09 – 9.821e+09	204
9.821e+09 – 1.016e+10	137
1.016e+10 – 1.05e+10	94
1.05e+10 – 1.083e+10	180
1.083e+10 – 1.117e+10	175
1.117e+10 – 1.151e+10	157
1.151e+10 – 1.185e+10	133
1.185e+10 – 1.219e+10	114
1.219e+10 – 1.252e+10	132
1.252e+10 – 1.286e+10	97
1.286e+10 – 1.32e+10	255
1.32e+10 – 1.354e+10	130

osm_type categorical feature

This column records the OpenStreetMap geometry type for each record, taking only two values: "node" (13311 rows, 86%) and "way" (2153 rows). With cardinality of 2 and no nulls across 15464 rows, it's a clean binary categorical, though the 86/14 split means "way" is the clear minority class.

Treatment: Encode as a binary indicator (e.g., is_node) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["osm_type"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	2
top_value	node
top_rate	0.8608
cardinality	2
entropy	0.5822
entropy_ratio	0.5822

Fig 9.

Top values for osm_type.

Show data table

Top values for osm_type (2 unique shown, of 2 total).
value	count	share
node	13311	86.1%
way	2153	13.9%

name text label

This is a free-text 'name' field for megalithic monuments, mixing English, French, German, Italian and Cyrillic labels (e.g. 'Dolmen', 'Menhir', 'Großsteingrab', 'Нураге', 'Дольмен'). It is short (mean 13.6 chars, median 2 words) and heavily duplicated: 36.2% are repeats and 4720 rows (≈30%) are empty strings, leaving only 9869 unique values out of 15464. The vocabulary is dominated by generic monument types rather than proper names, so 37.8% are single-word entries.

Treatment: Treat as a categorical type label after lowercasing and language-normalising; do not use as a unique identifier.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["name"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	9,869
len_min	0
len_max	84
len_mean	13.65
len_median	15
len_p95	30
word_mean	2.495
word_median	2
n_empty	4,720
n_duplicates	5,595
duplicate_rate	0.3618
vocab_size	9,447
readability_flesch_mean	46.9
emoji_rate	0
url_rate	0
one_word_rate	0.3782
allcaps_rate	0.003169
boilerplate_rate	0
alert: one_word	37.8% rows are a single word
alert: duplicates	36.2% duplicate strings

Fig 10.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 13.64685721676151).
chars	count
0 – 2	4725
2 – 4	27
4 – 6	252
6 – 8	552
8 – 10	420
10 – 13	493
13 – 15	810
15 – 17	1037
17 – 19	1247
19 – 21	1295
21 – 23	1726
23 – 25	831
25 – 27	680
27 – 29	447
29 – 32	283
32 – 34	185
34 – 36	134
36 – 38	86
38 – 40	83
40 – 42	37
42 – 44	43
44 – 46	15
46 – 48	13
48 – 50	13
50 – 52	4
52 – 55	9
55 – 57	4
57 – 59	1
59 – 61	5
61 – 63	2
63 – 65	2
65 – 67	1
67 – 69	0
69 – 71	1
71 – 74	0
74 – 76	0
76 – 78	0
78 – 80	0
80 – 82	0
82 – 84	1

lat numeric feature

This is a latitude coordinate column with 15320 unique values across 15464 rows and no nulls. Values span -51.81 to 65.17 with a median of 47.59 and Q1-Q3 of 42.95-50.52, indicating most observations cluster in the northern mid-latitudes (likely Europe/North America). The strong negative skew (-3.09) and high kurtosis (26.33) reflect a small tail of southern-hemisphere points pulling against an otherwise tight northern cluster, with 134 outliers flagged.

Treatment: Pair with longitude as a geospatial feature; consider binning by region or projecting before modelling rather than using raw latitude.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["lat"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	15,320
min	-51.81
max	65.17
mean	46.41
median	47.59
std	6.81
q1	42.95
q3	50.52
iqr	7.569
skew	-3.087
kurtosis	26.33
n_outliers	134
outlier_rate	0.008665
zero_rate	0
alert: high_skew	skew=-3.09

Fig 11.

Distribution of lat. Vertical dash marks the median.

Show data table

Histogram bins for lat (median: 47.59247835).
bin	count
-51.81 – -48.88	1
-48.88 – -45.96	0
-45.96 – -43.04	0
-43.04 – -40.11	0
-40.11 – -37.19	1
-37.19 – -34.26	0
-34.26 – -31.34	3
-31.34 – -28.41	0
-28.41 – -25.49	2
-25.49 – -22.56	0
-22.56 – -19.64	1
-19.64 – -16.72	1
-16.72 – -13.79	4
-13.79 – -10.87	4
-10.87 – -7.942	8
-7.942 – -5.018	21
-5.018 – -2.093	3
-2.093 – 0.8313	13
0.8313 – 3.756	26
3.756 – 6.68	1
6.68 – 9.605	7
9.605 – 12.53	5
12.53 – 15.45	8
15.45 – 18.38	2
18.38 – 21.3	5
21.3 – 24.23	2
24.23 – 27.15	8
27.15 – 30.08	3
30.08 – 33	9
33 – 35.92	38
35.92 – 38.85	523
38.85 – 41.77	2211
41.77 – 44.7	3646
44.7 – 47.62	3269
47.62 – 50.55	1808
50.55 – 53.47	1660
53.47 – 56.4	1627
56.4 – 59.32	506
59.32 – 62.24	33
62.24 – 65.17	5

lon numeric feature

This column is longitude coordinates, with values spanning -151.36 to 144.74 across 15,464 rows and 15,407 unique values. The distribution is tightly concentrated around a median of -1.62 with an IQR of just 11.53, but a skew of 3.65 and kurtosis of 34.34 indicate heavy tails — 676 outliers (4.37%) reach far into the Pacific. No nulls or zeros, so coverage is clean.

Treatment: Pair with latitude as a geospatial feature; consider projecting or binning rather than using raw values in a linear model.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["lon"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	15,407
min	-151.4
max	144.7
mean	2.618
median	-1.62
std	14.64
q1	-3.083
q3	8.447
iqr	11.53
skew	3.654
kurtosis	34.34
n_outliers	676
outlier_rate	0.04371
zero_rate	0
alert: high_skew	skew=+3.65

Fig 12.

Distribution of lon. Vertical dash marks the median.

Show data table

Histogram bins for lon (median: -1.6201083).
bin	count
-151.4 – -144	1
-144 – -136.6	0
-136.6 – -129.2	0
-129.2 – -121.7	1
-121.7 – -114.3	0
-114.3 – -106.9	1
-106.9 – -99.54	1
-99.54 – -92.14	2
-92.14 – -84.74	2
-84.74 – -77.33	6
-77.33 – -69.93	34
-69.93 – -62.53	2
-62.53 – -55.13	1
-55.13 – -47.72	5
-47.72 – -40.32	1
-40.32 – -32.92	0
-32.92 – -25.52	0
-25.52 – -18.12	0
-18.12 – -10.71	4
-10.71 – -3.31	3136
-3.31 – 4.092	7654
4.092 – 11.49	3031
11.49 – 18.9	921
18.9 – 26.3	58
26.3 – 33.7	15
33.7 – 41.1	441
41.1 – 48.51	21
48.51 – 55.91	3
55.91 – 63.31	0
63.31 – 70.71	1
70.71 – 78.12	7
78.12 – 85.52	7
85.52 – 92.92	7
92.92 – 100.3	1
100.3 – 107.7	19
107.7 – 115.1	7
115.1 – 122.5	23
122.5 – 129.9	30
129.9 – 137.3	15
137.3 – 144.7	6

type categorical label

Categorical type label for what appears to be megalithic monuments, with 19 distinct classes across 15,464 rows and no nulls. The distribution is severely imbalanced: 'megalith' alone covers 97.7% of records (15,113 rows), leaving rarer types like 'menhir' (156), 'dolmen' (83), and 'standing_stone' (59) as long-tail minorities. Entropy ratio of 0.049 confirms the column carries little discriminative signal in its raw form.

Treatment: Collapse rare categories into 'other' or stratify/resample before using as a class label.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["type"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	19
top_value	megalith
top_rate	0.9773
cardinality	19
entropy	0.2096
entropy_ratio	0.04933
alert: imbalance	top value is 97.7% of rows

Fig 13.

Top values for type.

Show data table

Top values for type (19 unique shown, of 19 total).
value	count	share
megalith	15113	97.7%
menhir	156	1.0%
dolmen	83	0.5%
standing_stone	59	0.4%
stone_circle	16	0.1%
nuraghe	8	0.1%
gallery_grave	6	0.0%
passage_grave	5	0.0%
lech	4	0.0%
stone_ship	3	0.0%
tholos	2	0.0%
chamber	2	0.0%
village	1	0.0%
plaque	1	0.0%
cist	1	0.0%
long_barrow	1	0.0%
chambered_cairn	1	0.0%
grave_field	1	0.0%
stone	1	0.0%

megalith_type categorical feature

Categorical classification of megalithic structures across 73 distinct types, dominated by 'menhir' (33.8%) and 'dolmen', with a long tail including nuraghe, stone_circle, and passage_grave. Notable concern: 1,714 rows (~11%) carry an empty-string value despite a reported null_rate of 0.0, suggesting blanks are being treated as a valid category rather than missing. Entropy ratio of 0.44 indicates concentration in a few dominant types.

Treatment: Recode empty strings to null, then one-hot encode the top categories and bucket the long tail as 'other'.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["megalith_type"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	73
top_value	menhir
top_rate	0.3383
cardinality	73
entropy	2.749
entropy_ratio	0.4441

Fig 14.

Top values for megalith_type.

Show data table

Top values for megalith_type (20 unique shown, of 73 total).
value	count	share
menhir	5231	33.8%
dolmen	4501	29.1%
	1714	11.1%
nuraghe	1080	7.0%
stone_circle	1011	6.5%
passage_grave	537	3.5%
chamber	437	2.8%
long_barrow	184	1.2%
alignment	116	0.8%
cist	107	0.7%
gallery_grave	85	0.5%
standing_stone	68	0.4%
stone_ship	47	0.3%
tholos	32	0.2%
court_tomb	32	0.2%
round_barrow	25	0.2%
well	23	0.1%
wedge_tomb	23	0.1%
cairn	20	0.1%
stone	20	0.1%

description categorical free_text

Free-text description field for what appears to be megalithic/archaeological sites, with labels in multiple languages (Danish 'Jættestue', 'Langdysse'; Portuguese 'Anta da Herdade da Ordem'; German 'Großsteingrab'; English 'Stone circle', 'Long Barrow'). The column is effectively empty: 14,814 of 15,464 rows (top_rate 0.958) hold the empty string, leaving only ~650 populated rows spread across 586 distinct descriptions. Entropy ratio of 0.069 confirms the near-degenerate distribution, and the language mix means even the populated values won't cluster cleanly without normalization.

Treatment: Drop or treat as a sparse free-text flag; not usable as a categorical feature given 96% empty and multilingual long tail.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["description"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	587
top_value
top_rate	0.958
cardinality	587
entropy	0.6328
entropy_ratio	0.0688
alert: long_tail	558 singleton categories
alert: imbalance	top value is 95.8% of rows

Fig 15.

Top values for description.

Show data table

Top values for description (20 unique shown, of 587 total).
value	count	share
	14814	95.8%
Jættestue	11	0.1%
Anta da Herdade da Ordem	8	0.1%
Stone circle	5	0.0%
Großsteingrab	5	0.0%
Rest eines Großsteingrabes	5	0.0%
Long Barrow	4	0.0%
Dolmen	4	0.0%
Langdysse	4	0.0%
Four standing and one recumbent standing stone.	4	0.0%
pair of two standing stones	4	0.0%
Hünengrab	3	0.0%
Henge / Círculo lítico	3	0.0%
Menhir	2	0.0%
Гармония	2	0.0%
Runddysse	2	0.0%
Allée couverte	2	0.0%
Stendysse	2	0.0%
Tumulus, dalle de couverture	2	0.0%
Table, chevet, orthostates droit et gauche	2	0.0%

wikipedia text metadata

This column holds Wikipedia article references in `lang:Title` form (e.g. `de:Großsteingräber im Haldensleber Forst`, `fr:dolmen…`), pointing to megalithic-monument pages across multiple language editions. It is overwhelmingly empty: 13,060 of 15,464 rows are blank and the duplicate rate is 0.87, leaving only 2,058 distinct values across 15,464 rows. Where present, entries are short single tokens (one_word_rate 0.85, word_mean 1.35) and skew heavily German, with `de`-prefixed titles dominating the top values and words.

Treatment: Treat as an optional cross-reference link; parse the `lang:title` prefix if needed but drop from modelling given 84% emptiness.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["wikipedia"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	2,058
len_min	0
len_max	75
len_mean	4.1
len_median	0
len_p95	29
word_mean	1.351
word_median	1
n_empty	13,060
n_duplicates	13,406
duplicate_rate	0.8669
vocab_size	2,769
readability_flesch_mean	5.48
emoji_rate	0
url_rate	0
one_word_rate	0.8524
allcaps_rate	0
boilerplate_rate	0
alert: one_word	85.2% rows are a single word
alert: duplicates	86.7% duplicate strings

Fig 16.

Character-length distribution for wikipedia.

Show data table

Character-length distribution for wikipedia (mean: 4.100038799793068).
chars	count
0 – 2	13060
2 – 4	0
4 – 6	0
6 – 8	1
8 – 9	3
9 – 11	21
11 – 13	47
13 – 15	28
15 – 17	159
17 – 19	115
19 – 21	191
21 – 22	236
22 – 24	271
24 – 26	254
26 – 28	203
28 – 30	130
30 – 32	198
32 – 34	133
34 – 36	110
36 – 38	75
38 – 39	49
39 – 41	97
41 – 43	30
43 – 45	6
45 – 47	12
47 – 49	7
49 – 51	4
51 – 52	2
52 – 54	8
54 – 56	3
56 – 58	3
58 – 60	0
60 – 62	0
62 – 64	3
64 – 66	0
66 – 68	1
68 – 69	1
69 – 71	0
71 – 73	1
73 – 75	2

wikidata text foreign_key

This column holds Wikidata Q-identifiers (e.g. Q106546933, Q1917052), one token per row with a max length of 10 characters. Coverage is poor: 10819 of 15464 rows are empty strings and only 4289 unique IDs appear, giving a 0.72 duplicate rate. The most frequent non-empty ID recurs only 17 times, so duplication is spread thinly rather than concentrated on a few entities.

Treatment: Treat as an optional Wikidata key; left-join on non-empty values to enrich, and don't use as a feature directly.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["wikidata"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	4,289
len_min	0
len_max	10
len_mean	2.667
len_median	0
len_p95	10
word_mean	1
word_median	1
n_empty	10,819
n_duplicates	11,175
duplicate_rate	0.7226
vocab_size	4,288
readability_flesch_mean	38.79
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.3004
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	30.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	72.3% duplicate strings

Fig 17.

Character-length distribution for wikidata.

Show data table

Character-length distribution for wikidata (mean: 2.6668391101914124).
chars	count
0 – 0	10819
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	5
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	94
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	1167
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	2574
9 – 10	0
10 – 10	0
10 – 10	805

heritage categorical feature

Categorical heritage flag with 12 distinct values across 15464 rows and no nulls, but 87.96% of records carry an empty string, leaving only ~1862 rows with any signal. The non-empty values are a messy mix: numeric codes ('1','2','3','4','7'), yes/no, and free-text labels like 'Em Vias de Classificação' and 'Scheduled Monument', suggesting concatenated sources or inconsistent encoding schemes. Entropy ratio of 0.20 confirms the distribution is heavily concentrated in the blank class.

Treatment: Normalise empty strings to null and harmonise the mixed numeric/yes-no/text codes before any encoding.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["heritage"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	12
top_value
top_rate	0.8796
cardinality	12
entropy	0.7343
entropy_ratio	0.2048

Fig 18.

Top values for heritage.

Show data table

Top values for heritage (12 unique shown, of 12 total).
value	count	share
	13602	88.0%
2	1264	8.2%
3	205	1.3%
1	120	0.8%
yes	109	0.7%
no	69	0.4%
Em Vias de Classificação	60	0.4%
4	24	0.2%
7	8	0.1%
Scheduled Monument	1	0.0%
6	1	0.0%
M0021	1	0.0%

heritage_operator categorical metadata

Categorical field naming the operator/agency responsible for a heritage record, with 31 distinct values across 15,464 rows and no nulls. It is overwhelmingly empty: the blank string accounts for 89.5% of rows (13,848), leaving only ~10% with an actual operator code such as 'mhs' (960), 'IE:smr' (229), or 'dgpc' (185). Entropy ratio of 0.14 confirms almost all signal lives in that empty bucket, and the value casing is inconsistent (lowercase codes alongside 'Historic Environment Scotland').

Treatment: Treat blanks as 'unknown' and normalise casing; only useful as a sparse categorical flag, not a primary feature.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["heritage_operator"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	31
top_value
top_rate	0.8955
cardinality	31
entropy	0.7028
entropy_ratio	0.1419

Fig 19.

Top values for heritage_operator.

Show data table

Top values for heritage_operator (20 unique shown, of 31 total).
value	count	share
	13848	89.5%
mhs	960	6.2%
IE:smr	229	1.5%
dgpc	185	1.2%
pc	103	0.7%
rce	23	0.1%
Historic Environment Scotland	18	0.1%
cadw	14	0.1%
whc	14	0.1%
lda	12	0.1%
nld	9	0.1%
IE:smr;IE:nm	8	0.1%
he	6	0.0%
Cadw	5	0.0%
mecd	4	0.0%
DGPC	3	0.0%
IE:smr:IE:nm	3	0.0%
alsh	2	0.0%
hs	2	0.0%
raa	2	0.0%

start_date categorical metadata

A nominally date-like field that is effectively empty: 15,430 of 15,464 rows (top_rate 0.9978) carry the blank string, leaving only 34 populated cells across 25 other distinct values. Those rare entries are wildly inconsistent in format — ISO dates ('2004-07-01'), bare years ('1999'), BCE ranges ('between 3500 and 2800 BCE'), and codes ('C-30') — so even the non-null content is not parseable as a uniform timestamp. Entropy ratio of 0.0069 confirms there is essentially no information here.

Treatment: Drop; near-constant blank with unparseable mixed-format residue.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["start_date"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	26
top_value
top_rate	0.9978
cardinality	26
entropy	0.03224
entropy_ratio	0.006859
alert: long_tail	21 singleton categories
alert: imbalance	top value is 99.8% of rows

Fig 20.

Top values for start_date.

Show data table

Top values for start_date (20 unique shown, of 26 total).
value	count	share
	15430	99.8%
1999	5	0.0%
C-30	4	0.0%
~2000 BC	2	0.0%
between 3500 and 2800 BCE	2	0.0%
2900 BC..2600 BC	1	0.0%
-3000 BC	1	0.0%
-2000	1	0.0%
2004-07-01	1	0.0%
before -3250	1	0.0%
3720 BC	1	0.0%
2800-2200 BC	1	0.0%
~5000 BCE	1	0.0%
~C30 BC	1	0.0%
2000 BC	1	0.0%
Mittelneolithikum (2350 - 2150 v. u. Z.)	1	0.0%
1500 BC	1	0.0%
2800 BC..2200 BC	1	0.0%
2012-04-30	1	0.0%
3100 BC	1	0.0%

material categorical metadata

This is a categorical 'material' attribute, almost certainly the OSM-style material tag for some physical feature, with 13 distinct values across 15,464 rows. It is overwhelmingly empty: 15,223 of 15,464 rows (top_rate 0.984) carry the blank string, leaving only ~241 actual material labels dominated by 'stone' (196) and 'granite' (29). Entropy ratio 0.036 confirms almost no information content, and the long tail includes a German 'Quarzit' and a compound 'stone;concrete' value, hinting at inconsistent tagging conventions.

Treatment: Drop or treat empty as null and collapse rare variants; too sparse to use as a feature without aggressive grouping.

anthropic:claude-opus-4-7 · confidence high

Out[52]:

saturn.columns["material"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	13
top_value
top_rate	0.9844
cardinality	13
entropy	0.1326
entropy_ratio	0.03582
alert: long_tail	7 singleton categories
alert: imbalance	top value is 98.4% of rows

Fig 21.

Top values for material.

Show data table

Top values for material (13 unique shown, of 13 total).
value	count	share
	15223	98.4%
stone	196	1.3%
granite	29	0.2%
sandstone	5	0.0%
limestone	2	0.0%
dry_stone	2	0.0%
Quarzit	1	0.0%
reinforced_concrete	1	0.0%
stone;concrete	1	0.0%
basalt	1	0.0%
quartz_blanc	1	0.0%
granit	1	0.0%
andesite	1	0.0%

quirky megaliths

Overview

Summary confidence: high

id numeric identifier

osm_type categorical feature

name text label

lat numeric feature

lon numeric feature

type categorical label

megalith_type categorical feature

description categorical free_text

wikipedia text metadata

wikidata text foreign_key

heritage categorical feature

heritage_operator categorical metadata

start_date categorical metadata

material categorical metadata

How to cite