data-trove-megaliths · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/megaliths.json

Saturn profiled 15,464 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/megaliths.json",
    "--findings", "data-trove-megaliths.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset catalogues 15,464 megalithic structures (dolmens, menhirs, stone circles, nuraghes, and more) drawn from OpenStreetMap, with geographic coordinates, heritage classification, and typology fields. The most striking pattern is extreme sparsity in descriptive metadata: over 95% of records have no description, 98.5% have no material recorded, and roughly 70% lack a Wikidata link, suggesting the dataset is geographically rich but editorially thin. The megalith_type column is the most informative categorical field, splitting meaningfully across menhirs (5,231), dolmens (4,501), nuraghes (1,080), and stone circles (1,011). Geographically, the bulk of sites cluster in Western Europe (median latitude ~47.6°N, median longitude ~-1.6°), but high skew and outliers in both lat and lon indicate a long tail of sites in places like Sardinia, Iberia, Ireland, and beyond — worth mapping.

citing: row_count · column_count · megalith_type.top_values · megalith_type.stats.top_rate · description.stats.top_rate · material.stats.top_rate · wikidata.stats.n_empty · lat.stats.median · lat.stats.skew · lon.stats.median · lon.stats.skew · lon.stats.outlier_rate · heritage.top_values · type.top_values

Out[4]:

saturn.schema() · 14 columns

column	kind	n	null%	unique	alerts
id	numeric	15,464	0.0%	15,464
osm_type	categorical	15,464	0.0%	2
name	text	15,464	0.0%	9,869	one_word duplicates
lat	numeric	15,464	0.0%	15,320	high_skew
lon	numeric	15,464	0.0%	15,407	high_skew
type	categorical	15,464	0.0%	19	imbalance
megalith_type	categorical	15,464	0.0%	73
description	categorical	15,464	0.0%	587	long_tail imbalance
wikipedia	text	15,464	0.0%	2,058	one_word duplicates
wikidata	text	15,464	0.0%	4,289	one_word allcaps short_text duplicates
heritage	categorical	15,464	0.0%	12
heritage_operator	categorical	15,464	0.0%	31
start_date	categorical	15,464	0.0%	26	long_tail imbalance
material	categorical	15,464	0.0%	13	long_tail imbalance

Fig 1.

megalith_type · Look for how dominated the dataset is by menhirs and dolmens, and which rarer types (nuraghes, stone circles, passage graves) still have enough records to analyse separately.

Show data table

Top values for megalith_type (20 unique shown, of 73 total).
value	count	share
menhir	5231	33.8%
dolmen	4501	29.1%
	1714	11.1%
nuraghe	1080	7.0%
stone_circle	1011	6.5%
passage_grave	537	3.5%
chamber	437	2.8%
long_barrow	184	1.2%
alignment	116	0.8%
cist	107	0.7%
gallery_grave	85	0.5%
standing_stone	68	0.4%
stone_ship	47	0.3%
tholos	32	0.2%
court_tomb	32	0.2%
round_barrow	25	0.2%
well	23	0.1%
wedge_tomb	23	0.1%
cairn	20	0.1%
stone	20	0.1%

Fig 2.

lat · The strong negative skew reveals that most megaliths cluster in a narrow Northern European band, with a sparse tail of southern and non-European outliers worth investigating.

Show data table

Histogram bins for lat (median: 47.59247835).
bin	count
-51.81 – -48.88	1
-48.88 – -45.96	0
-45.96 – -43.04	0
-43.04 – -40.11	0
-40.11 – -37.19	1
-37.19 – -34.26	0
-34.26 – -31.34	3
-31.34 – -28.41	0
-28.41 – -25.49	2
-25.49 – -22.56	0
-22.56 – -19.64	1
-19.64 – -16.72	1
-16.72 – -13.79	4
-13.79 – -10.87	4
-10.87 – -7.942	8
-7.942 – -5.018	21
-5.018 – -2.093	3
-2.093 – 0.8313	13
0.8313 – 3.756	26
3.756 – 6.68	1
6.68 – 9.605	7
9.605 – 12.53	5
12.53 – 15.45	8
15.45 – 18.38	2
18.38 – 21.3	5
21.3 – 24.23	2
24.23 – 27.15	8
27.15 – 30.08	3
30.08 – 33	9
33 – 35.92	38
35.92 – 38.85	523
38.85 – 41.77	2211
41.77 – 44.7	3646
44.7 – 47.62	3269
47.62 – 50.55	1808
50.55 – 53.47	1660
53.47 – 56.4	1627
56.4 – 59.32	506
59.32 – 62.24	33
62.24 – 65.17	5

Fig 3.

lon · High positive skew and a 4.4% outlier rate in longitude flag sites far outside Western Europe — check whether these are data errors or genuinely remote monuments.

Show data table

Histogram bins for lon (median: -1.6201083).
bin	count
-151.4 – -144	1
-144 – -136.6	0
-136.6 – -129.2	0
-129.2 – -121.7	1
-121.7 – -114.3	0
-114.3 – -106.9	1
-106.9 – -99.54	1
-99.54 – -92.14	2
-92.14 – -84.74	2
-84.74 – -77.33	6
-77.33 – -69.93	34
-69.93 – -62.53	2
-62.53 – -55.13	1
-55.13 – -47.72	5
-47.72 – -40.32	1
-40.32 – -32.92	0
-32.92 – -25.52	0
-25.52 – -18.12	0
-18.12 – -10.71	4
-10.71 – -3.31	3136
-3.31 – 4.092	7654
4.092 – 11.49	3031
11.49 – 18.9	921
18.9 – 26.3	58
26.3 – 33.7	15
33.7 – 41.1	441
41.1 – 48.51	21
48.51 – 55.91	3
55.91 – 63.31	0
63.31 – 70.71	1
70.71 – 78.12	7
78.12 – 85.52	7
85.52 – 92.92	7
92.92 – 100.3	1
100.3 – 107.7	19
107.7 – 115.1	7
115.1 – 122.5	23
122.5 – 129.9	30
129.9 – 137.3	15
137.3 – 144.7	6

Fig 4.

heritage · Only about 12% of sites carry any heritage designation; look at which designation levels (1, 2, 3) are most common among those that do.

Show data table

Top values for heritage (12 unique shown, of 12 total).
value	count	share
	13602	88.0%
2	1264	8.2%
3	205	1.3%
1	120	0.8%
yes	109	0.7%
no	69	0.4%
Em Vias de Classificação	60	0.4%
4	24	0.2%
7	8	0.1%
Scheduled Monument	1	0.0%
6	1	0.0%
M0021	1	0.0%

Fig 5.

heritage_operator · Among the minority of sites with a heritage operator, 'mhs' (France) and 'IE:smr' (Ireland) dominate — revealing which countries have contributed the most structured heritage data.

Show data table

Top values for heritage_operator (20 unique shown, of 31 total).
value	count	share
	13848	89.5%
mhs	960	6.2%
IE:smr	229	1.5%
dgpc	185	1.2%
pc	103	0.7%
rce	23	0.1%
Historic Environment Scotland	18	0.1%
cadw	14	0.1%
whc	14	0.1%
lda	12	0.1%
nld	9	0.1%
IE:smr;IE:nm	8	0.1%
he	6	0.0%
Cadw	5	0.0%
mecd	4	0.0%
DGPC	3	0.0%
IE:smr:IE:nm	3	0.0%
alsh	2	0.0%
hs	2	0.0%
raa	2	0.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	numeric	0.0%
osm_type	categorical	0.0%
name	text	0.0%
lat	numeric	0.0%
lon	numeric	0.0%
type	categorical	0.0%
megalith_type	categorical	0.0%
description	categorical	0.0%
wikipedia	text	0.0%
wikidata	text	0.0%
heritage	categorical	0.0%
heritage_operator	categorical	0.0%
start_date	categorical	0.0%
material	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	id	lat	lon
id	+1.00	+0.00	-0.05
lat	+0.00	+1.00	-0.11
lon	-0.05	-0.11	+1.00

id numeric identifier

This column is a numeric row identifier — every one of the 15,464 rows carries a distinct value with zero nulls, confirming it functions as a unique primary key. The values are large integers spanning roughly 24 million to 13.5 billion, which is consistent with a distributed-system or database auto-increment ID rather than a sequential integer index. Mild positive skew (0.89) and a wide IQR (~4.5 billion) suggest IDs were assigned non-uniformly over time or across sources, but no outliers are flagged.

Treatment: Retain as a join/lookup key; exclude from any feature matrix or model input.

anthropic:default · confidence high

Out[13]:

saturn.columns["id"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	15,464
min	2.415e+07
max	1.354e+10
mean	4.503e+09
median	3.411e+09
std	3.47e+09
q1	2.375e+09
q3	6.845e+09
iqr	4.471e+09
skew	0.8907
kurtosis	-0.2006
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 8.

Distribution of id. Vertical dash marks the median.

Show data table

Histogram bins for id (median: 3411205875.5).
bin	count
2.415e+07 – 3.62e+08	682
3.62e+08 – 6.998e+08	889
6.998e+08 – 1.038e+09	623
1.038e+09 – 1.375e+09	736
1.375e+09 – 1.713e+09	297
1.713e+09 – 2.051e+09	352
2.051e+09 – 2.389e+09	305
2.389e+09 – 2.727e+09	2679
2.727e+09 – 3.065e+09	300
3.065e+09 – 3.402e+09	756
3.402e+09 – 3.74e+09	951
3.74e+09 – 4.078e+09	705
4.078e+09 – 4.416e+09	532
4.416e+09 – 4.754e+09	318
4.754e+09 – 5.092e+09	386
5.092e+09 – 5.429e+09	200
5.429e+09 – 5.767e+09	207
5.767e+09 – 6.105e+09	261
6.105e+09 – 6.443e+09	161
6.443e+09 – 6.781e+09	175
6.781e+09 – 7.119e+09	250
7.119e+09 – 7.456e+09	270
7.456e+09 – 7.794e+09	89
7.794e+09 – 8.132e+09	115
8.132e+09 – 8.47e+09	734
8.47e+09 – 8.808e+09	391
8.808e+09 – 9.146e+09	155
9.146e+09 – 9.483e+09	137
9.483e+09 – 9.821e+09	204
9.821e+09 – 1.016e+10	137
1.016e+10 – 1.05e+10	94
1.05e+10 – 1.083e+10	180
1.083e+10 – 1.117e+10	175
1.117e+10 – 1.151e+10	157
1.151e+10 – 1.185e+10	133
1.185e+10 – 1.219e+10	114
1.219e+10 – 1.252e+10	132
1.252e+10 – 1.286e+10	97
1.286e+10 – 1.32e+10	255
1.32e+10 – 1.354e+10	130

osm_type categorical feature

This column encodes the OpenStreetMap geometry type, distinguishing between point features ('node') and linear/polygon features ('way'). With only 2 distinct values across 15,464 rows and zero nulls, it is a clean binary flag. The distribution is heavily skewed: 'node' dominates at 86.1% (13,311 records) versus 'way' at just 13.9% (2,153 records). The low entropy of 0.582 confirms the imbalance, which may matter if 'way' features behave differently in downstream models.

Treatment: Binary-encode (node=1, way=0) and monitor class imbalance if used as a feature or stratification variable.

anthropic:default · confidence high

Out[16]:

saturn.columns["osm_type"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	2
top_value	node
top_rate	0.8608
cardinality	2
entropy	0.5822
entropy_ratio	0.5822

Fig 9.

Top values for osm_type.

Show data table

Top values for osm_type (2 unique shown, of 2 total).
value	count	share
node	13311	86.1%
way	2153	13.9%

name text label

This column contains the local or common name of prehistoric megalithic monuments (dolmens, menhirs, stone circles, nuraghes, etc.), drawing from a multilingual dataset spanning at least English, French, Russian (Cyrillic), and German. Two signals stand out: 30.5% of rows (4,720 of 15,464) are empty strings rather than true nulls, and the duplicate rate is 36.2% (5,595 duplicates), largely driven by generic type-names like 'Долмен' (191), 'Dolmen' (51), 'Menhir' (50), and 'Standing Stone' (48) being reused across many distinct monuments. The one-word rate of 37.8% and word mean of ~2.5 are consistent with short monument names, but the 4,720 empty strings should be treated as missing values.

Treatment: Replace empty strings with NaN, then use as a descriptive label or weak text feature; do not treat as a unique identifier due to 36% duplicate rate and multilingual generic names.

anthropic:default · confidence high

Out[19]:

saturn.columns["name"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	9,869
len_min	0
len_max	84
len_mean	13.65
len_median	15
len_p95	30
word_mean	2.495
word_median	2
n_empty	4,720
n_duplicates	5,595
duplicate_rate	0.3618
vocab_size	9,447
readability_flesch_mean	46.9
emoji_rate	0
url_rate	0
one_word_rate	0.3782
allcaps_rate	0.003169
boilerplate_rate	0
alert: one_word	37.8% rows are a single word
alert: duplicates	36.2% duplicate strings

Fig 10.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 13.64685721676151).
chars	count
0 – 2	4725
2 – 4	27
4 – 6	252
6 – 8	552
8 – 10	420
10 – 13	493
13 – 15	810
15 – 17	1037
17 – 19	1247
19 – 21	1295
21 – 23	1726
23 – 25	831
25 – 27	680
27 – 29	447
29 – 32	283
32 – 34	185
34 – 36	134
36 – 38	86
38 – 40	83
40 – 42	37
42 – 44	43
44 – 46	15
46 – 48	13
48 – 50	13
50 – 52	4
52 – 55	9
55 – 57	4
57 – 59	1
59 – 61	5
61 – 63	2
63 – 65	2
65 – 67	1
67 – 69	0
69 – 71	1
71 – 74	0
74 – 76	0
76 – 78	0
78 – 80	0
80 – 82	0
82 – 84	1

lat numeric feature

This column contains geographic latitude values, ranging from -51.81° to 65.17°, almost certainly representing the latitude coordinate of geolocated records. The vast majority of values cluster in the 43°–51° band (IQR of ~7.6°), suggesting heavy concentration in mid-latitude Europe or North America. The negative skew of -3.09 and extreme kurtosis of 26.33 indicate a sharp central peak with a long left tail — a surprising number of records pull toward lower or even southern-hemisphere latitudes, captured in 134 flagged outliers (~0.87%). Near-uniqueness (15,320 unique out of 15,464 rows) is expected for precise coordinate data.

Treatment: Use as-is or pair with longitude for spatial analysis; investigate the 134 outliers for data-entry errors or genuine remote locations before modelling.

anthropic:default · confidence high

Out[22]:

saturn.columns["lat"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	15,320
min	-51.81
max	65.17
mean	46.41
median	47.59
std	6.81
q1	42.95
q3	50.52
iqr	7.569
skew	-3.087
kurtosis	26.33
n_outliers	134
outlier_rate	0.008665
zero_rate	0
alert: high_skew	skew=-3.09

Fig 11.

Distribution of lat. Vertical dash marks the median.

Show data table

Histogram bins for lat (median: 47.59247835).
bin	count
-51.81 – -48.88	1
-48.88 – -45.96	0
-45.96 – -43.04	0
-43.04 – -40.11	0
-40.11 – -37.19	1
-37.19 – -34.26	0
-34.26 – -31.34	3
-31.34 – -28.41	0
-28.41 – -25.49	2
-25.49 – -22.56	0
-22.56 – -19.64	1
-19.64 – -16.72	1
-16.72 – -13.79	4
-13.79 – -10.87	4
-10.87 – -7.942	8
-7.942 – -5.018	21
-5.018 – -2.093	3
-2.093 – 0.8313	13
0.8313 – 3.756	26
3.756 – 6.68	1
6.68 – 9.605	7
9.605 – 12.53	5
12.53 – 15.45	8
15.45 – 18.38	2
18.38 – 21.3	5
21.3 – 24.23	2
24.23 – 27.15	8
27.15 – 30.08	3
30.08 – 33	9
33 – 35.92	38
35.92 – 38.85	523
38.85 – 41.77	2211
41.77 – 44.7	3646
44.7 – 47.62	3269
47.62 – 50.55	1808
50.55 – 53.47	1660
53.47 – 56.4	1627
56.4 – 59.32	506
59.32 – 62.24	33
62.24 – 65.17	5

lon numeric feature

This column represents geographic longitude values, with readings spanning from -151.36 to 144.74 degrees — a plausible global range. What is surprising is the severe positive skew (3.65) and extreme kurtosis (34.34), indicating the distribution is heavily concentrated in a narrow band (IQR of only 11.53, centred around Western Europe/Africa longitudes near 0°) with 676 outliers (4.37%) pulled far to the east and west. The mean (2.62) and median (-1.62) diverge noticeably, confirming the asymmetric clustering, likely reflecting a dataset dominated by European locations with a long tail of global outliers.

Treatment: Retain as-is for geo-spatial modelling; investigate and potentially stratify or cap the 676 outlier records before distance-based or regression analyses.

anthropic:default · confidence high

Out[25]:

saturn.columns["lon"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	15,407
min	-151.4
max	144.7
mean	2.618
median	-1.62
std	14.64
q1	-3.083
q3	8.447
iqr	11.53
skew	3.654
kurtosis	34.34
n_outliers	676
outlier_rate	0.04371
zero_rate	0
alert: high_skew	skew=+3.65

Fig 12.

Distribution of lon. Vertical dash marks the median.

Show data table

Histogram bins for lon (median: -1.6201083).
bin	count
-151.4 – -144	1
-144 – -136.6	0
-136.6 – -129.2	0
-129.2 – -121.7	1
-121.7 – -114.3	0
-114.3 – -106.9	1
-106.9 – -99.54	1
-99.54 – -92.14	2
-92.14 – -84.74	2
-84.74 – -77.33	6
-77.33 – -69.93	34
-69.93 – -62.53	2
-62.53 – -55.13	1
-55.13 – -47.72	5
-47.72 – -40.32	1
-40.32 – -32.92	0
-32.92 – -25.52	0
-25.52 – -18.12	0
-18.12 – -10.71	4
-10.71 – -3.31	3136
-3.31 – 4.092	7654
4.092 – 11.49	3031
11.49 – 18.9	921
18.9 – 26.3	58
26.3 – 33.7	15
33.7 – 41.1	441
41.1 – 48.51	21
48.51 – 55.91	3
55.91 – 63.31	0
63.31 – 70.71	1
70.71 – 78.12	7
78.12 – 85.52	7
85.52 – 92.92	7
92.92 – 100.3	1
100.3 – 107.7	19
107.7 – 115.1	7
115.1 – 122.5	23
122.5 – 129.9	30
129.9 – 137.3	15
137.3 – 144.7	6

type categorical label

This column classifies archaeological monument types, with 19 distinct categories across 15,464 records and no nulls. It is severely imbalanced: 'megalith' dominates at 97.73% of all records (15,113), leaving the remaining 18 types — menhir, dolmen, standing_stone, stone_circle, nuraghe, etc. — sharing just 351 records. The entropy ratio of 0.049 confirms near-total concentration in one class, which will severely impair any multi-class model trained on this label.

Treatment: Treat as a severely imbalanced categorical label; apply oversampling (SMOTE) or class-weighted losses if used as a target, or collapse rare types into an 'other' bucket for feature use.

anthropic:default · confidence high

Out[28]:

saturn.columns["type"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	19
top_value	megalith
top_rate	0.9773
cardinality	19
entropy	0.2096
entropy_ratio	0.04933
alert: imbalance	top value is 97.7% of rows

Fig 13.

Top values for type.

Show data table

Top values for type (19 unique shown, of 19 total).
value	count	share
megalith	15113	97.7%
menhir	156	1.0%
dolmen	83	0.5%
standing_stone	59	0.4%
stone_circle	16	0.1%
nuraghe	8	0.1%
gallery_grave	6	0.0%
passage_grave	5	0.0%
lech	4	0.0%
stone_ship	3	0.0%
tholos	2	0.0%
chamber	2	0.0%
village	1	0.0%
plaque	1	0.0%
cist	1	0.0%
long_barrow	1	0.0%
chambered_cairn	1	0.0%
grave_field	1	0.0%
stone	1	0.0%

megalith_type categorical label

This column classifies prehistoric stone monuments into structural types, with 73 distinct categories across 15,464 records and no nulls. The dominant class is 'menhir' (5,231 records, ~33.8%), followed closely by 'dolmen' (4,501), meaning these two types together account for over 60% of all rows — a moderate concentration reflected in an entropy ratio of 0.44. Notably, the third most frequent value is an empty string ('') with 1,714 occurrences (~11.1%), which masquerades as a non-null entry and represents a meaningful data quality issue that null_rate alone does not capture.

Treatment: Recode empty-string entries as explicit nulls or an 'unknown' category, then one-hot or target-encode for modelling given 73 categories and moderate class imbalance.

anthropic:default · confidence high

Out[31]:

saturn.columns["megalith_type"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	73
top_value	menhir
top_rate	0.3383
cardinality	73
entropy	2.749
entropy_ratio	0.4441

Fig 14.

Top values for megalith_type.

Show data table

Top values for megalith_type (20 unique shown, of 73 total).
value	count	share
menhir	5231	33.8%
dolmen	4501	29.1%
	1714	11.1%
nuraghe	1080	7.0%
stone_circle	1011	6.5%
passage_grave	537	3.5%
chamber	437	2.8%
long_barrow	184	1.2%
alignment	116	0.8%
cist	107	0.7%
gallery_grave	85	0.5%
standing_stone	68	0.4%
stone_ship	47	0.3%
tholos	32	0.2%
court_tomb	32	0.2%
round_barrow	25	0.2%
well	23	0.1%
wedge_tomb	23	0.1%
cairn	20	0.1%
stone	20	0.1%

description categorical free_text

This column is a text description field for archaeological or heritage site records, containing short labels or names of megalithic structures (e.g., 'Jættestue', 'Großsteingrab', 'Dolmen', 'Stone circle') in multiple languages including Danish, German, Portuguese, and English. The most striking signal is that 95.8% of the 15,464 rows (14,814) carry an empty string, effectively making the field near-empty at scale. The remaining 586 distinct non-empty values are heavily long-tailed, with the most frequent non-empty value ('Jættestue') appearing only 11 times. The entropy ratio of 0.069 confirms extreme imbalance driven by the dominant empty-string value.

Treatment: Treat empty strings as missing; for non-null values, consider as a sparse categorical label or tokenize and embed for similarity/search use cases.

anthropic:default · confidence high

Out[34]:

saturn.columns["description"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	587
top_value
top_rate	0.958
cardinality	587
entropy	0.6328
entropy_ratio	0.0688
alert: long_tail	558 singleton categories
alert: imbalance	top value is 95.8% of rows

Fig 15.

Top values for description.

Show data table

Top values for description (20 unique shown, of 587 total).
value	count	share
	14814	95.8%
Jættestue	11	0.1%
Anta da Herdade da Ordem	8	0.1%
Stone circle	5	0.0%
Großsteingrab	5	0.0%
Rest eines Großsteingrabes	5	0.0%
Long Barrow	4	0.0%
Dolmen	4	0.0%
Langdysse	4	0.0%
Four standing and one recumbent standing stone.	4	0.0%
pair of two standing stones	4	0.0%
Hünengrab	3	0.0%
Henge / Círculo lítico	3	0.0%
Menhir	2	0.0%
Гармония	2	0.0%
Runddysse	2	0.0%
Allée couverte	2	0.0%
Stendysse	2	0.0%
Tumulus, dalle de couverture	2	0.0%
Table, chevet, orthostates droit et gauche	2	0.0%

wikipedia text metadata

This column stores Wikipedia article references in a 'language-code:article-title' format (e.g., 'de:Großsteingräber im Haldensleber Forst'), linking dataset records to corresponding Wikipedia pages across multiple languages including German, French, Catalan, Portuguese, and English. The dominant surprise is that 13,060 of 15,464 rows (84.5%) are empty strings, meaning most records have no Wikipedia link at all. Among populated values, 13,406 duplicates exist because the same Wikipedia article is referenced by multiple records — consistent with grouped/list articles covering many individual megalithic sites. The multi-language mix (de, fr, pt, ca, en prefixes visible) is expected for a multilingual cultural-heritage dataset.

Treatment: Parse language prefix and article slug into separate fields; treat empty strings as nulls; use as an optional enrichment join key rather than a model feature.

anthropic:default · confidence high

Out[37]:

saturn.columns["wikipedia"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	2,058
len_min	0
len_max	75
len_mean	4.1
len_median	0
len_p95	29
word_mean	1.351
word_median	1
n_empty	13,060
n_duplicates	13,406
duplicate_rate	0.8669
vocab_size	2,769
readability_flesch_mean	5.48
emoji_rate	0
url_rate	0
one_word_rate	0.8524
allcaps_rate	0
boilerplate_rate	0
alert: one_word	85.2% rows are a single word
alert: duplicates	86.7% duplicate strings

Fig 16.

Character-length distribution for wikipedia.

Show data table

Character-length distribution for wikipedia (mean: 4.100038799793068).
chars	count
0 – 2	13060
2 – 4	0
4 – 6	0
6 – 8	1
8 – 9	3
9 – 11	21
11 – 13	47
13 – 15	28
15 – 17	159
17 – 19	115
19 – 21	191
21 – 22	236
22 – 24	271
24 – 26	254
26 – 28	203
28 – 30	130
30 – 32	198
32 – 34	133
34 – 36	110
36 – 38	75
38 – 39	49
39 – 41	97
41 – 43	30
43 – 45	6
45 – 47	12
47 – 49	7
49 – 51	4
51 – 52	2
52 – 54	8
54 – 56	3
56 – 58	3
58 – 60	0
60 – 62	0
62 – 64	3
64 – 66	0
66 – 68	1
68 – 69	1
69 – 71	0
71 – 73	1
73 – 75	2

wikidata text foreign_key

This column stores Wikidata entity identifiers (Q-codes) linking dataset rows to Wikidata knowledge-base entries. Two signals demand immediate attention: 10,819 of 15,464 rows (70%) are empty strings rather than true nulls, and the duplicate_rate is 0.723, meaning many rows share the same Q-code — the top value 'Q106546933' appears 17 times, suggesting a many-to-one entity mapping. The allcaps_rate of 0.300 reflects the uppercase 'Q' prefix on valid codes.

Treatment: Replace empty strings with null, then left-join on this Q-code to enrich with Wikidata properties; expect a many-to-one join given high duplicate rate.

anthropic:default · confidence high

Out[40]:

saturn.columns["wikidata"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	4,289
len_min	0
len_max	10
len_mean	2.667
len_median	0
len_p95	10
word_mean	1
word_median	1
n_empty	10,819
n_duplicates	11,175
duplicate_rate	0.7226
vocab_size	4,288
readability_flesch_mean	38.79
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.3004
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	30.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	72.3% duplicate strings

Fig 17.

Character-length distribution for wikidata.

Show data table

Character-length distribution for wikidata (mean: 2.6668391101914124).
chars	count
0 – 0	10819
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	5
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	94
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	1167
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	2574
9 – 10	0
10 – 10	0
10 – 10	805

heritage categorical label

This column represents a heritage classification or designation status for records, likely a regulatory or cultural heritage grading field. The dominant 'value' is an empty string, which accounts for 87.96% of all 15,464 rows, indicating that most records carry no heritage designation. The remaining values are a heterogeneous mix of numeric grades (1–4, 7), boolean-style strings ('yes', 'no'), a Portuguese classification phrase ('Em Vias de Classificação'), and a single 'Scheduled Monument' entry — suggesting the column was populated from multiple source systems or locales with no enforced vocabulary.

Treatment: Treat empty strings as a distinct 'undesignated' category; harmonise numeric grades, boolean strings, and foreign-language values into a unified controlled vocabulary before encoding.

anthropic:default · confidence medium

Out[43]:

saturn.columns["heritage"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	12
top_value
top_rate	0.8796
cardinality	12
entropy	0.7343
entropy_ratio	0.2048

Fig 18.

Top values for heritage.

Show data table

Top values for heritage (12 unique shown, of 12 total).
value	count	share
	13602	88.0%
2	1264	8.2%
3	205	1.3%
1	120	0.8%
yes	109	0.7%
no	69	0.4%
Em Vias de Classificação	60	0.4%
4	24	0.2%
7	8	0.1%
Scheduled Monument	1	0.0%
6	1	0.0%
M0021	1	0.0%

heritage_operator categorical label

This column identifies the heritage operator or authority responsible for a record, with 31 distinct coded values across 15,464 rows. The dominant 'value' is an empty string, accounting for 89.5% of all rows (13,848), meaning the vast majority of records have no operator assigned — this blank dominance severely suppresses the entropy ratio to 0.14. Among the 30 non-empty values, 'mhs' (960), 'IE:smr' (229), and 'dgpc' (185) are the most common, suggesting a mix of abbreviated authority codes and occasional full names (e.g., 'Historic Environment Scotland'), indicating inconsistent formatting across sources.

Treatment: Treat empty string as missing; normalise authority codes to a consistent controlled vocabulary before using as a categorical feature or grouping key.

anthropic:default · confidence high

Out[46]:

saturn.columns["heritage_operator"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	31
top_value
top_rate	0.8955
cardinality	31
entropy	0.7028
entropy_ratio	0.1419

Fig 19.

Top values for heritage_operator.

Show data table

Top values for heritage_operator (20 unique shown, of 31 total).
value	count	share
	13848	89.5%
mhs	960	6.2%
IE:smr	229	1.5%
dgpc	185	1.2%
pc	103	0.7%
rce	23	0.1%
Historic Environment Scotland	18	0.1%
cadw	14	0.1%
whc	14	0.1%
lda	12	0.1%
nld	9	0.1%
IE:smr;IE:nm	8	0.1%
he	6	0.0%
Cadw	5	0.0%
mecd	4	0.0%
DGPC	3	0.0%
IE:smr:IE:nm	3	0.0%
alsh	2	0.0%
hs	2	0.0%
raa	2	0.0%

start_date categorical metadata

This column is intended to capture a start date for records, but it is overwhelmingly empty: 15,430 of 15,464 rows (99.78%) contain a blank string, making it nearly useless as a feature. The 34 non-empty values are highly heterogeneous — mixing ISO dates ('2004-07-01'), calendar years ('1999'), approximate historical dates ('~2000 BC'), ranges ('between 3500 and 2800 BCE'), negative year offsets ('-3000 BC'), and even a code-like value ('C-30') — indicating no enforced format or schema. The extreme imbalance (top_rate 0.998) and near-zero entropy (0.032) confirm the column carries almost no information signal.

Treatment: Drop from modelling due to 99.78% blank rate; if historical context is needed, parse and normalise the 34 non-blank values manually before any use.

anthropic:default · confidence high

Out[49]:

saturn.columns["start_date"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	26
top_value
top_rate	0.9978
cardinality	26
entropy	0.03224
entropy_ratio	0.006859
alert: long_tail	21 singleton categories
alert: imbalance	top value is 99.8% of rows

Fig 20.

Top values for start_date.

Show data table

Top values for start_date (20 unique shown, of 26 total).
value	count	share
	15430	99.8%
1999	5	0.0%
C-30	4	0.0%
~2000 BC	2	0.0%
between 3500 and 2800 BCE	2	0.0%
2900 BC..2600 BC	1	0.0%
-3000 BC	1	0.0%
-2000	1	0.0%
2004-07-01	1	0.0%
before -3250	1	0.0%
3720 BC	1	0.0%
2800-2200 BC	1	0.0%
~5000 BCE	1	0.0%
~C30 BC	1	0.0%
2000 BC	1	0.0%
Mittelneolithikum (2350 - 2150 v. u. Z.)	1	0.0%
1500 BC	1	0.0%
2800 BC..2200 BC	1	0.0%
2012-04-30	1	0.0%
3100 BC	1	0.0%

material categorical feature

This column captures the construction or surface material of a physical feature (likely a wall, path, or structure in a geospatial dataset), with 13 distinct values across 15,464 rows. The dominant 'value' is an empty string, accounting for 98.44% of all records — meaning the field is effectively unpopulated for the vast majority of entries, despite a null_rate of 0.0. The remaining 241 non-empty records span stone-type materials (stone, granite, sandstone, limestone, etc.), with a minor language inconsistency ('Quarzit' appearing in German). Entropy is extremely low (0.133) confirming near-total dominance of the blank value.

Treatment: Treat empty string as missing; recode to NaN, then consider a binary 'has_material' flag or impute/drop depending on task, given 98.44% missingness.

anthropic:default · confidence high

Out[52]:

saturn.columns["material"].stats

stat	value
n	15,464
nulls	0 (0.0%)
unique	13
top_value
top_rate	0.9844
cardinality	13
entropy	0.1326
entropy_ratio	0.03582
alert: long_tail	7 singleton categories
alert: imbalance	top value is 98.4% of rows

Fig 21.

Top values for material.

Show data table

Top values for material (13 unique shown, of 13 total).
value	count	share
	15223	98.4%
stone	196	1.3%
granite	29	0.2%
sandstone	5	0.0%
limestone	2	0.0%
dry_stone	2	0.0%
Quarzit	1	0.0%
reinforced_concrete	1	0.0%
stone;concrete	1	0.0%
basalt	1	0.0%
quartz_blanc	1	0.0%
granit	1	0.0%
andesite	1	0.0%

data trove megaliths

Overview

Summary confidence: high

id numeric identifier

osm_type categorical feature

name text label

lat numeric feature

lon numeric feature

type categorical label

megalith_type categorical label

description categorical free_text

wikipedia text metadata

wikidata text foreign_key

heritage categorical label

heritage_operator categorical label

start_date categorical metadata

material categorical feature

How to cite