quirky megaliths
Reading
This dataset catalogues 15,464 megalithic sites with 14 fields covering geographic coordinates, classification (type, megalith_type, material), heritage status, and external references (wikidata, wikipedia, name). Coverage is uneven: many descriptive fields are mostly empty (description is blank in 14,814 rows, material in 15,223, start_date in 15,430), so analysis should lean on the well-populated columns. The most informative categorical is megalith_type, where menhir (5,231) and dolmen (4,501) dominate but 73 distinct subtypes appear, while the broader type field is overwhelmingly 'megalith' (97.7%). Geographically, lat/lon are highly skewed with heavy clustering in Europe (median lat 47.6, lon -1.6) and a long tail of outliers stretching as far as 144°E and -51°S. Start with megalith_type and the lat/lon distributions to understand what kinds of sites exist and where they cluster.
citing: megalith_type · type · lat · lon · material · description · start_date · osm_type · heritage · name
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| menhir | 5231 | 33.8% |
| dolmen | 4501 | 29.1% |
| 1714 | 11.1% | |
| nuraghe | 1080 | 7.0% |
| stone_circle | 1011 | 6.5% |
| passage_grave | 537 | 3.5% |
| chamber | 437 | 2.8% |
| long_barrow | 184 | 1.2% |
| alignment | 116 | 0.8% |
| cist | 107 | 0.7% |
| gallery_grave | 85 | 0.5% |
| standing_stone | 68 | 0.4% |
| stone_ship | 47 | 0.3% |
| tholos | 32 | 0.2% |
| court_tomb | 32 | 0.2% |
| round_barrow | 25 | 0.2% |
| well | 23 | 0.1% |
| wedge_tomb | 23 | 0.1% |
| cairn | 20 | 0.1% |
| stone | 20 | 0.1% |
Show data table
| bin | count |
|---|---|
| -51.81 – -48.88 | 1 |
| -48.88 – -45.96 | 0 |
| -45.96 – -43.04 | 0 |
| -43.04 – -40.11 | 0 |
| -40.11 – -37.19 | 1 |
| -37.19 – -34.26 | 0 |
| -34.26 – -31.34 | 3 |
| -31.34 – -28.41 | 0 |
| -28.41 – -25.49 | 2 |
| -25.49 – -22.56 | 0 |
| -22.56 – -19.64 | 1 |
| -19.64 – -16.72 | 1 |
| -16.72 – -13.79 | 4 |
| -13.79 – -10.87 | 4 |
| -10.87 – -7.942 | 8 |
| -7.942 – -5.018 | 21 |
| -5.018 – -2.093 | 3 |
| -2.093 – 0.8313 | 13 |
| 0.8313 – 3.756 | 26 |
| 3.756 – 6.68 | 1 |
| 6.68 – 9.605 | 7 |
| 9.605 – 12.53 | 5 |
| 12.53 – 15.45 | 8 |
| 15.45 – 18.38 | 2 |
| 18.38 – 21.3 | 5 |
| 21.3 – 24.23 | 2 |
| 24.23 – 27.15 | 8 |
| 27.15 – 30.08 | 3 |
| 30.08 – 33 | 9 |
| 33 – 35.92 | 38 |
| 35.92 – 38.85 | 523 |
| 38.85 – 41.77 | 2211 |
| 41.77 – 44.7 | 3646 |
| 44.7 – 47.62 | 3269 |
| 47.62 – 50.55 | 1808 |
| 50.55 – 53.47 | 1660 |
| 53.47 – 56.4 | 1627 |
| 56.4 – 59.32 | 506 |
| 59.32 – 62.24 | 33 |
| 62.24 – 65.17 | 5 |
Show data table
| bin | count |
|---|---|
| -151.4 – -144 | 1 |
| -144 – -136.6 | 0 |
| -136.6 – -129.2 | 0 |
| -129.2 – -121.7 | 1 |
| -121.7 – -114.3 | 0 |
| -114.3 – -106.9 | 1 |
| -106.9 – -99.54 | 1 |
| -99.54 – -92.14 | 2 |
| -92.14 – -84.74 | 2 |
| -84.74 – -77.33 | 6 |
| -77.33 – -69.93 | 34 |
| -69.93 – -62.53 | 2 |
| -62.53 – -55.13 | 1 |
| -55.13 – -47.72 | 5 |
| -47.72 – -40.32 | 1 |
| -40.32 – -32.92 | 0 |
| -32.92 – -25.52 | 0 |
| -25.52 – -18.12 | 0 |
| -18.12 – -10.71 | 4 |
| -10.71 – -3.31 | 3136 |
| -3.31 – 4.092 | 7654 |
| 4.092 – 11.49 | 3031 |
| 11.49 – 18.9 | 921 |
| 18.9 – 26.3 | 58 |
| 26.3 – 33.7 | 15 |
| 33.7 – 41.1 | 441 |
| 41.1 – 48.51 | 21 |
| 48.51 – 55.91 | 3 |
| 55.91 – 63.31 | 0 |
| 63.31 – 70.71 | 1 |
| 70.71 – 78.12 | 7 |
| 78.12 – 85.52 | 7 |
| 85.52 – 92.92 | 7 |
| 92.92 – 100.3 | 1 |
| 100.3 – 107.7 | 19 |
| 107.7 – 115.1 | 7 |
| 115.1 – 122.5 | 23 |
| 122.5 – 129.9 | 30 |
| 129.9 – 137.3 | 15 |
| 137.3 – 144.7 | 6 |
Show data table
| value | count | share |
|---|---|---|
| node | 13311 | 86.1% |
| way | 2153 | 13.9% |
Show data table
| value | count | share |
|---|---|---|
| megalith | 15113 | 97.7% |
| menhir | 156 | 1.0% |
| dolmen | 83 | 0.5% |
| standing_stone | 59 | 0.4% |
| stone_circle | 16 | 0.1% |
| nuraghe | 8 | 0.1% |
| gallery_grave | 6 | 0.0% |
| passage_grave | 5 | 0.0% |
| lech | 4 | 0.0% |
| stone_ship | 3 | 0.0% |
| tholos | 2 | 0.0% |
| chamber | 2 | 0.0% |
| village | 1 | 0.0% |
| plaque | 1 | 0.0% |
| cist | 1 | 0.0% |
| long_barrow | 1 | 0.0% |
| chambered_cairn | 1 | 0.0% |
| grave_field | 1 | 0.0% |
| stone | 1 | 0.0% |
Schema
14 columns| Alerts | ||||
|---|---|---|---|---|
| id | numeric | 0.0% | 15,464 |
|
| osm_type | categorical | 0.0% | 2 |
|
| name | text | 0.0% | 9,869 |
one_word
duplicates
|
| lat | numeric | 0.0% | 15,320 |
high_skew
|
| lon | numeric | 0.0% | 15,407 |
high_skew
|
| type | categorical | 0.0% | 19 |
imbalance
|
| megalith_type | categorical | 0.0% | 73 |
|
| description | categorical | 0.0% | 587 |
long_tail
imbalance
|
| wikipedia | text | 0.0% | 2,058 |
one_word
duplicates
|
| wikidata | text | 0.0% | 4,289 |
one_word
allcaps
short_text
duplicates
|
| heritage | categorical | 0.0% | 12 |
|
| heritage_operator | categorical | 0.0% | 31 |
|
| start_date | categorical | 0.0% | 26 |
long_tail
imbalance
|
| material | categorical | 0.0% | 13 |
long_tail
imbalance
|
id
numeric identifierThis is an identifier column: all 15464 values are unique with no nulls and no zeros, spanning 24151805 to 13537320281. The numeric stats (mean 4503184709.89, std 3470459882.49, skew 0.89) reflect ID allocation rather than a meaningful distribution. Treat the numeric summary as incidental. Treatment: Use as a join key; exclude from modelling features.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 15,464
- min
- 2.415e+07
- max
- 1.354e+10
- mean
- 4.503e+09
- median
- 3.411e+09
- std
- 3.47e+09
- q1
- 2.375e+09
- q3
- 6.845e+09
- iqr
- 4.471e+09
- skew
- 0.8907
- kurtosis
- -0.2006
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
osm_type
categorical featureThis column records the OpenStreetMap geometry type for each record, taking only two values: "node" (13311 rows, 86%) and "way" (2153 rows). With cardinality of 2 and no nulls across 15464 rows, it's a clean binary categorical, though the 86/14 split means "way" is the clear minority class. Treatment: Encode as a binary indicator (e.g., is_node) before modelling.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- node
- top_rate
- 0.8608
- cardinality
- 2
- entropy
- 0.5822
- entropy_ratio
- 0.5822
name
text label one_word duplicatesThis is a free-text 'name' field for megalithic monuments, mixing English, French, German, Italian and Cyrillic labels (e.g. 'Dolmen', 'Menhir', 'Großsteingrab', 'Нураге', 'Дольмен'). It is short (mean 13.6 chars, median 2 words) and heavily duplicated: 36.2% are repeats and 4720 rows (≈30%) are empty strings, leaving only 9869 unique values out of 15464. The vocabulary is dominated by generic monument types rather than proper names, so 37.8% are single-word entries. Treatment: Treat as a categorical type label after lowercasing and language-normalising; do not use as a unique identifier.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 9,869
- len_min
- 0
- len_max
- 84
- len_mean
- 13.65
- len_median
- 15
- len_p95
- 30
- word_mean
- 2.495
- word_median
- 2
- n_empty
- 4,720
- n_duplicates
- 5,595
- duplicate_rate
- 0.3618
- vocab_size
- 9,447
- readability_flesch_mean
- 46.9
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.3782
- allcaps_rate
- 0.003169
- boilerplate_rate
- 0
lat
numeric feature high_skewThis is a latitude coordinate column with 15320 unique values across 15464 rows and no nulls. Values span -51.81 to 65.17 with a median of 47.59 and Q1-Q3 of 42.95-50.52, indicating most observations cluster in the northern mid-latitudes (likely Europe/North America). The strong negative skew (-3.09) and high kurtosis (26.33) reflect a small tail of southern-hemisphere points pulling against an otherwise tight northern cluster, with 134 outliers flagged. Treatment: Pair with longitude as a geospatial feature; consider binning by region or projecting before modelling rather than using raw latitude.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 15,320
- min
- -51.81
- max
- 65.17
- mean
- 46.41
- median
- 47.59
- std
- 6.81
- q1
- 42.95
- q3
- 50.52
- iqr
- 7.569
- skew
- -3.087
- kurtosis
- 26.33
- n_outliers
- 134
- outlier_rate
- 0.008665
- zero_rate
- 0
lon
numeric feature high_skewThis column is longitude coordinates, with values spanning -151.36 to 144.74 across 15,464 rows and 15,407 unique values. The distribution is tightly concentrated around a median of -1.62 with an IQR of just 11.53, but a skew of 3.65 and kurtosis of 34.34 indicate heavy tails — 676 outliers (4.37%) reach far into the Pacific. No nulls or zeros, so coverage is clean. Treatment: Pair with latitude as a geospatial feature; consider projecting or binning rather than using raw values in a linear model.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 15,407
- min
- -151.4
- max
- 144.7
- mean
- 2.618
- median
- -1.62
- std
- 14.64
- q1
- -3.083
- q3
- 8.447
- iqr
- 11.53
- skew
- 3.654
- kurtosis
- 34.34
- n_outliers
- 676
- outlier_rate
- 0.04371
- zero_rate
- 0
type
categorical label imbalanceCategorical type label for what appears to be megalithic monuments, with 19 distinct classes across 15,464 rows and no nulls. The distribution is severely imbalanced: 'megalith' alone covers 97.7% of records (15,113 rows), leaving rarer types like 'menhir' (156), 'dolmen' (83), and 'standing_stone' (59) as long-tail minorities. Entropy ratio of 0.049 confirms the column carries little discriminative signal in its raw form. Treatment: Collapse rare categories into 'other' or stratify/resample before using as a class label.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 19
- top_value
- megalith
- top_rate
- 0.9773
- cardinality
- 19
- entropy
- 0.2096
- entropy_ratio
- 0.04933
megalith_type
categorical featureCategorical classification of megalithic structures across 73 distinct types, dominated by 'menhir' (33.8%) and 'dolmen', with a long tail including nuraghe, stone_circle, and passage_grave. Notable concern: 1,714 rows (~11%) carry an empty-string value despite a reported null_rate of 0.0, suggesting blanks are being treated as a valid category rather than missing. Entropy ratio of 0.44 indicates concentration in a few dominant types. Treatment: Recode empty strings to null, then one-hot encode the top categories and bucket the long tail as 'other'.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 73
- top_value
- menhir
- top_rate
- 0.3383
- cardinality
- 73
- entropy
- 2.749
- entropy_ratio
- 0.4441
description
categorical free_text long_tail imbalanceFree-text description field for what appears to be megalithic/archaeological sites, with labels in multiple languages (Danish 'Jættestue', 'Langdysse'; Portuguese 'Anta da Herdade da Ordem'; German 'Großsteingrab'; English 'Stone circle', 'Long Barrow'). The column is effectively empty: 14,814 of 15,464 rows (top_rate 0.958) hold the empty string, leaving only ~650 populated rows spread across 586 distinct descriptions. Entropy ratio of 0.069 confirms the near-degenerate distribution, and the language mix means even the populated values won't cluster cleanly without normalization. Treatment: Drop or treat as a sparse free-text flag; not usable as a categorical feature given 96% empty and multilingual long tail.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 587
- top_value
- top_rate
- 0.958
- cardinality
- 587
- entropy
- 0.6328
- entropy_ratio
- 0.0688
wikipedia
text metadata one_word duplicatesThis column holds Wikipedia article references in `lang:Title` form (e.g. `de:Großsteingräber im Haldensleber Forst`, `fr:dolmen…`), pointing to megalithic-monument pages across multiple language editions. It is overwhelmingly empty: 13,060 of 15,464 rows are blank and the duplicate rate is 0.87, leaving only 2,058 distinct values across 15,464 rows. Where present, entries are short single tokens (one_word_rate 0.85, word_mean 1.35) and skew heavily German, with `de`-prefixed titles dominating the top values and words. Treatment: Treat as an optional cross-reference link; parse the `lang:title` prefix if needed but drop from modelling given 84% emptiness.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 2,058
- len_min
- 0
- len_max
- 75
- len_mean
- 4.1
- len_median
- 0
- len_p95
- 29
- word_mean
- 1.351
- word_median
- 1
- n_empty
- 13,060
- n_duplicates
- 13,406
- duplicate_rate
- 0.8669
- vocab_size
- 2,769
- readability_flesch_mean
- 5.48
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.8524
- allcaps_rate
- 0
- boilerplate_rate
- 0
wikidata
text foreign_key one_word allcaps short_text duplicatesThis column holds Wikidata Q-identifiers (e.g. Q106546933, Q1917052), one token per row with a max length of 10 characters. Coverage is poor: 10819 of 15464 rows are empty strings and only 4289 unique IDs appear, giving a 0.72 duplicate rate. The most frequent non-empty ID recurs only 17 times, so duplication is spread thinly rather than concentrated on a few entities. Treatment: Treat as an optional Wikidata key; left-join on non-empty values to enrich, and don't use as a feature directly.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 4,289
- len_min
- 0
- len_max
- 10
- len_mean
- 2.667
- len_median
- 0
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 10,819
- n_duplicates
- 11,175
- duplicate_rate
- 0.7226
- vocab_size
- 4,288
- readability_flesch_mean
- 38.79
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.3004
- boilerplate_rate
- 0
heritage
categorical featureCategorical heritage flag with 12 distinct values across 15464 rows and no nulls, but 87.96% of records carry an empty string, leaving only ~1862 rows with any signal. The non-empty values are a messy mix: numeric codes ('1','2','3','4','7'), yes/no, and free-text labels like 'Em Vias de Classificação' and 'Scheduled Monument', suggesting concatenated sources or inconsistent encoding schemes. Entropy ratio of 0.20 confirms the distribution is heavily concentrated in the blank class. Treatment: Normalise empty strings to null and harmonise the mixed numeric/yes-no/text codes before any encoding.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- top_rate
- 0.8796
- cardinality
- 12
- entropy
- 0.7343
- entropy_ratio
- 0.2048
heritage_operator
categorical metadataCategorical field naming the operator/agency responsible for a heritage record, with 31 distinct values across 15,464 rows and no nulls. It is overwhelmingly empty: the blank string accounts for 89.5% of rows (13,848), leaving only ~10% with an actual operator code such as 'mhs' (960), 'IE:smr' (229), or 'dgpc' (185). Entropy ratio of 0.14 confirms almost all signal lives in that empty bucket, and the value casing is inconsistent (lowercase codes alongside 'Historic Environment Scotland'). Treatment: Treat blanks as 'unknown' and normalise casing; only useful as a sparse categorical flag, not a primary feature.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 31
- top_value
- top_rate
- 0.8955
- cardinality
- 31
- entropy
- 0.7028
- entropy_ratio
- 0.1419
start_date
categorical metadata long_tail imbalanceA nominally date-like field that is effectively empty: 15,430 of 15,464 rows (top_rate 0.9978) carry the blank string, leaving only 34 populated cells across 25 other distinct values. Those rare entries are wildly inconsistent in format — ISO dates ('2004-07-01'), bare years ('1999'), BCE ranges ('between 3500 and 2800 BCE'), and codes ('C-30') — so even the non-null content is not parseable as a uniform timestamp. Entropy ratio of 0.0069 confirms there is essentially no information here. Treatment: Drop; near-constant blank with unparseable mixed-format residue.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 26
- top_value
- top_rate
- 0.9978
- cardinality
- 26
- entropy
- 0.03224
- entropy_ratio
- 0.006859
material
categorical metadata long_tail imbalanceThis is a categorical 'material' attribute, almost certainly the OSM-style material tag for some physical feature, with 13 distinct values across 15,464 rows. It is overwhelmingly empty: 15,223 of 15,464 rows (top_rate 0.984) carry the blank string, leaving only ~241 actual material labels dominated by 'stone' (196) and 'granite' (29). Entropy ratio 0.036 confirms almost no information content, and the long tail includes a German 'Quarzit' and a compound 'stone;concrete' value, hinting at inconsistent tagging conventions. Treatment: Drop or treat empty as null and collapse rare variants; too sparse to use as a feature without aggressive grouping.
- n
- 15,464
- nulls
- 0 (0.0%)
- unique
- 13
- top_value
- top_rate
- 0.9844
- cardinality
- 13
- entropy
- 0.1326
- entropy_ratio
- 0.03582