data-trove-lighthouses-worldwide

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/lighthouses.json

Saturn profiled 14,585 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/lighthouses.json",
    "--findings", "data-trove-lighthouses-worldwide.json",
    "--llm", "anthropic:default",
])

Summary confidence: medium

This dataset contains 14,585 lighthouse and seamark records sourced from OpenStreetMap, covering navigational lights and related structures worldwide. The most immediately striking feature is that many descriptive columns — country, operator, year_built, height, and heritage — have null rates of 90% or higher, meaning the richest analysis must focus on the minority of well-filled records. Two columns worth close inspection are seamark_type, which cleanly splits records into light_minor (3,496), light_major (3,051), and landmark (716) with no nulls beyond the 48% gap, and light_character, where 'Fl' (flashing) dominates at 74.7% of non-null values across 19 pattern types. Geographically, all 14,585 records carry latitude and longitude, revealing a notable left skew in latitude (mean 34.5°, median 40.8°) with 1,295 outliers, suggesting a clustering of records in the Northern Hemisphere with some Southern outliers worth mapping.

citing: row_count · column_count · seamark_type.top_values · seamark_type.null_rate · light_character.top_rate · light_character.top_value · light_character.null_rate · lat.mean · lat.median · lat.skew · lat.n_outliers · country.null_rate · operator.null_rate · year_built.null_rate · height.null_rate · osm_type.top_values

Out[4]:

saturn.schema() · 13 columns

column	kind	n	null%	unique	alerts
name	text	14,585	0.0%	14,239	near_unique
lat	numeric	14,585	0.0%	14,572	outliers
lon	numeric	14,585	0.0%	14,565
country	categorical	14,585	99.6%	19	long_tail null_rate
height	categorical	14,585	90.2%	316	long_tail null_rate
year_built	categorical	14,585	93.2%	429	long_tail null_rate
operator	categorical	14,585	92.7%	283	long_tail null_rate
seamark_type	categorical	14,585	48.0%	25	null_rate
light_character	categorical	14,585	71.3%	19	null_rate
heritage	categorical	14,585	96.9%	7	null_rate
wikipedia	text	14,585	86.2%	1,965	near_unique null_rate
osm_id	numeric	14,585	0.0%	14,584
osm_type	categorical	14,585	0.0%	2

Fig 1.

seamark_type · Look for the dominance of light_minor and light_major together accounting for the majority of seamarks, with landmark a distant third.

Show data table

Top values for seamark_type (20 unique shown, of 25 total).
value	count	share
light_minor	3496	24.0%
light_major	3051	20.9%
landmark	716	4.9%
beacon_special_purpose	146	1.0%
beacon_lateral	102	0.7%
beacon_cardinal	19	0.1%
light	9	0.1%
beacon	5	0.0%
beacon_isolated_danger	5	0.0%
building	4	0.0%
daymark	4	0.0%
radio_station	3	0.0%
pile	3	0.0%
signal_station_traffic	3	0.0%
signal_station_warning	3	0.0%
buoy_lateral	3	0.0%
navigation_line	2	0.0%
fishing_facility	2	0.0%
light_vessel	1	0.0%
cone	1	0.0%

Fig 2.

light_character · Notice how 'Fl' (flashing) overwhelms all other light patterns, representing nearly three-quarters of all classified lights.

Show data table

Top values for light_character (19 unique shown, of 19 total).
value	count	share
Fl	3126	21.4%
Iso	249	1.7%
F	245	1.7%
Q	182	1.2%
Oc	151	1.0%
LFl	148	1.0%
VQ	24	0.2%
Al.Fl	24	0.2%
Mo	14	0.1%
FFl	4	0.0%
Al	4	0.0%
IQ	3	0.0%
Al.LFl	2	0.0%
Al.Oc	2	0.0%
Q+LFl	2	0.0%
IVQ	1	0.0%
Fl(2)	1	0.0%
LFl W 10s	1	0.0%
Al.Iso	1	0.0%

Fig 3.

lat · Watch for the left skew and outlier cluster revealing that most structures sit in the Northern Hemisphere, with a tail of Southern Hemisphere locations.

Show data table

Histogram bins for lat (median: 40.8121914).
bin	count
-63.4 – -59.77	3
-59.77 – -56.14	2
-56.14 – -52.51	38
-52.51 – -48.88	44
-48.88 – -45.25	29
-45.25 – -41.62	92
-41.62 – -37.99	108
-37.99 – -34.36	112
-34.36 – -30.73	147
-30.73 – -27.1	101
-27.1 – -23.47	93
-23.47 – -19.84	113
-19.84 – -16.21	58
-16.21 – -12.58	92
-12.58 – -8.947	58
-8.947 – -5.317	130
-5.317 – -1.687	129
-1.687 – 1.942	168
1.942 – 5.572	121
5.572 – 9.202	239
9.202 – 12.83	549
12.83 – 16.46	274
16.46 – 20.09	241
20.09 – 23.72	383
23.72 – 27.35	273
27.35 – 30.98	226
30.98 – 34.61	1062
34.61 – 38.24	1554
38.24 – 41.87	1300
41.87 – 45.5	1976
45.5 – 49.13	1257
49.13 – 52.76	625
52.76 – 56.39	900
56.39 – 60.02	1039
60.02 – 63.65	410
63.65 – 67.28	357
67.28 – 70.91	179
70.91 – 74.54	73
74.54 – 78.17	22
78.17 – 81.8	8

Fig 4.

osm_type · Check the roughly 78/22 split between node and way OSM types, reflecting how most lighthouses are mapped as single points rather than polygons.

Show data table

Top values for osm_type (2 unique shown, of 2 total).
value	count	share
node	11358	77.9%
way	3227	22.1%

Fig 5.

height · Among the ~10% of records with height data, look for the concentration around 10–20 units, suggesting a typical structural height range for these seamarks.

Show data table

Top values for height (20 unique shown, of 316 total).
value	count	share
15	59	0.4%
12	55	0.4%
14	54	0.4%
10	51	0.3%
8	45	0.3%
18	44	0.3%
20	43	0.3%
13	38	0.3%
6	37	0.3%
17	33	0.2%
25	31	0.2%
11	28	0.2%
30	28	0.2%
16	27	0.2%
23	26	0.2%
21	21	0.1%
19	20	0.1%
28	19	0.1%
9	19	0.1%
26	18	0.1%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
name	text	0.0%
lat	numeric	0.0%
lon	numeric	0.0%
country	categorical	99.6%
height	categorical	90.2%
year_built	categorical	93.2%
operator	categorical	92.7%
seamark_type	categorical	48.0%
light_character	categorical	71.3%
heritage	categorical	96.9%
wikipedia	text	86.2%
osm_id	numeric	0.0%
osm_type	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	lat	lon	osm_id
lat	+1.00	-0.12	-0.04
lon	-0.12	+1.00	+0.18
osm_id	-0.04	+0.18	+1.00

name text label

This column contains the proper names of lighthouses, drawn from a multilingual global dataset — 'lighthouse', 'faro' (Spanish/Italian), 'fyr' (Scandinavian), 'phare' (French), 'farol' (Portuguese), and 'маяк' (Russian/Cyrillic) all appear in the top words, confirming broad geographic coverage. With 14,239 unique values across 14,585 rows, the near-unique alert is expected for a name field; the 346 duplicates (2.37%) likely reflect shared names for distinct structures (e.g., 'North Light'). Average name length is ~19 characters with a median of 2 words, consistent with short proper-noun phrases. The multilingual vocabulary mix is the key surprise and warrants attention if name-matching or NLP is planned.

Treatment: Use as a display label or entity identifier; normalize Unicode and language variants before any string-matching or embedding.

anthropic:default · confidence high

Out[13]:

saturn.columns["name"].stats

stat	value
n	14,585
nulls	0 (0.0%)
unique	14,239
len_min	2
len_max	91
len_mean	18.95
len_median	21
len_p95	27
word_mean	2.327
word_median	2
n_empty	0
n_duplicates	346
duplicate_rate	0.02372
vocab_size	14,670
readability_flesch_mean	75.78
emoji_rate	0
url_rate	0
one_word_rate	0.1068
allcaps_rate	0.05794
boilerplate_rate	0
alert: near_unique	97.6% of rows are unique strings

Fig 8.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 18.947685978745287).
chars	count
2 – 4	301
4 – 6	525
6 – 9	304
9 – 11	528
11 – 13	827
13 – 15	705
15 – 18	859
18 – 20	906
20 – 22	7858
22 – 24	560
24 – 26	414
26 – 29	287
29 – 31	151
31 – 33	155
33 – 35	52
35 – 38	42
38 – 40	42
40 – 42	37
42 – 44	5
44 – 46	8
46 – 49	4
49 – 51	3
51 – 53	2
53 – 55	2
55 – 58	2
58 – 60	1
60 – 62	1
62 – 64	1
64 – 67	0
67 – 69	0
69 – 71	0
71 – 73	2
73 – 75	0
75 – 78	0
78 – 80	0
80 – 82	0
82 – 84	0
84 – 87	0
87 – 89	0
89 – 91	1

lat numeric feature

This column represents geographic latitude, with values ranging from -63.4° (near Antarctica) to 81.8° (high Arctic), consistent with global location data across 14,585 records. The distribution is notably left-skewed (skew = -1.46) with a mean of 34.5° but a median of 40.8°, suggesting a concentration of records in mid-to-high Northern Hemisphere latitudes pulled down by a Southern Hemisphere tail. Nearly 9% of values (1,295) are flagged as outliers, likely corresponding to records in the Southern Hemisphere or extreme polar regions which are genuinely sparse in most datasets. Uniqueness is near-perfect (14,572 of 14,585 values distinct), implying precise coordinate capture with minimal rounding.

Treatment: Use as-is or pair with longitude for spatial modelling; consider binning into latitude bands or projecting to Cartesian coordinates for distance-based algorithms.

anthropic:default · confidence high

Out[16]:

saturn.columns["lat"].stats

stat	value
n	14,585
nulls	0 (0.0%)
unique	14,572
min	-63.4
max	81.8
mean	34.52
median	40.81
std	24.64
q1	28.11
q3	48.98
iqr	20.87
skew	-1.458
kurtosis	2.028
n_outliers	1,295
outlier_rate	0.08879
zero_rate	0
alert: outliers	8.9% rows beyond 1.5 IQR

Fig 9.

Distribution of lat. Vertical dash marks the median.

Show data table

Histogram bins for lat (median: 40.8121914).
bin	count
-63.4 – -59.77	3
-59.77 – -56.14	2
-56.14 – -52.51	38
-52.51 – -48.88	44
-48.88 – -45.25	29
-45.25 – -41.62	92
-41.62 – -37.99	108
-37.99 – -34.36	112
-34.36 – -30.73	147
-30.73 – -27.1	101
-27.1 – -23.47	93
-23.47 – -19.84	113
-19.84 – -16.21	58
-16.21 – -12.58	92
-12.58 – -8.947	58
-8.947 – -5.317	130
-5.317 – -1.687	129
-1.687 – 1.942	168
1.942 – 5.572	121
5.572 – 9.202	239
9.202 – 12.83	549
12.83 – 16.46	274
16.46 – 20.09	241
20.09 – 23.72	383
23.72 – 27.35	273
27.35 – 30.98	226
30.98 – 34.61	1062
34.61 – 38.24	1554
38.24 – 41.87	1300
41.87 – 45.5	1976
45.5 – 49.13	1257
49.13 – 52.76	625
52.76 – 56.39	900
56.39 – 60.02	1039
60.02 – 63.65	410
63.65 – 67.28	357
67.28 – 70.91	179
70.91 – 74.54	73
74.54 – 78.17	22
78.17 – 81.8	8

lon numeric feature

This column contains geographic longitude values, spanning the full valid range from -179.2 to 179.3 degrees with a mean near 23° and median near 15°, suggesting a slight concentration of observations in Europe/Africa relative to the Americas and East Asia. The distribution is nearly symmetric (skew ≈ 0.05) and platykurtic (kurtosis ≈ -0.95), consistent with a broad, flat spread of coordinates across the globe rather than clustering around a single region. Near-uniqueness (14,565 distinct values out of 14,585 rows) and zero nulls confirm these are precise geospatial measurements, not coarse bins. The IQR of ~153 degrees reinforces global coverage.

Treatment: Pair with latitude for spatial modelling; consider map-projection or trigonometric encoding (sin/cos) to handle the -180/180 wraparound boundary.

anthropic:default · confidence high

Out[19]:

saturn.columns["lon"].stats

stat	value
n	14,585
nulls	0 (0.0%)
unique	14,565
min	-179.2
max	179.3
mean	23.04
median	14.97
std	79.4
q1	-40.5
q3	112.4
iqr	152.9
skew	0.0526
kurtosis	-0.9537
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 10.

Distribution of lon. Vertical dash marks the median.

Show data table

Histogram bins for lon (median: 14.9745494).
bin	count
-179.2 – -170.2	39
-170.2 – -161.3	3
-161.3 – -152.3	20
-152.3 – -143.3	6
-143.3 – -134.4	8
-134.4 – -125.4	179
-125.4 – -116.5	235
-116.5 – -107.5	78
-107.5 – -98.53	38
-98.53 – -89.56	132
-89.56 – -80.6	544
-80.6 – -71.64	805
-71.64 – -62.67	835
-62.67 – -53.71	379
-53.71 – -44.75	272
-44.75 – -35.78	178
-35.78 – -26.82	73
-26.82 – -17.86	128
-17.86 – -8.892	277
-8.892 – 0.07137	1272
0.07137 – 9.035	770
9.035 – 18	1383
18 – 26.96	1441
26.96 – 35.93	659
35.93 – 44.89	258
44.89 – 53.85	141
53.85 – 62.82	77
62.82 – 71.78	38
71.78 – 80.74	131
80.74 – 89.71	73
89.71 – 98.67	57
98.67 – 107.6	245
107.6 – 116.6	388
116.6 – 125.6	955
125.6 – 134.5	1230
134.5 – 143.5	818
143.5 – 152.4	199
152.4 – 161.4	66
161.4 – 170.4	49
170.4 – 179.3	106

country categorical

Out[22]:

saturn.columns["country"].stats

stat	value
n	14,585
nulls	14,524 (99.6%)
unique	19
top_value	LV
top_rate	0.2951
cardinality	19
entropy	3.442
entropy_ratio	0.8102
alert: long_tail	11 singleton categories
alert: null_rate	99.6% null

Fig 11.

Top values for country.

Show data table

Top values for country (19 unique shown, of 19 total).
value	count	share
LV	18	0.1%
DE	8	0.1%
EE	7	0.0%
HT	6	0.0%
US	5	0.0%
GB	2	0.0%
PT	2	0.0%
IM	2	0.0%
JP	1	0.0%
KY	1	0.0%
PH	1	0.0%
IN	1	0.0%
FI	1	0.0%
PL	1	0.0%
IT	1	0.0%
RU	1	0.0%
UZ	1	0.0%
MX	1	0.0%
TW	1	0.0%

height categorical feature

This column represents a numeric height measurement stored as a categorical/string type, with values that appear to be small integers (range visible: 6–20, likely in inches or some domain-specific unit). The most alarming signal is a null rate of 90.18%, meaning only ~1,433 of 14,585 rows carry any value at all — this is a severely sparse field. Among non-null values, 316 unique levels exist with modest concentration (top value '15' appears in only 4.12% of rows) and a high entropy ratio of 0.823, indicating the non-null values are broadly spread rather than clustered.

Treatment: Cast to numeric after handling nulls; impute or flag missing values given the 90.18% null rate before modelling.

anthropic:default · confidence medium

Out[25]:

saturn.columns["height"].stats

stat	value
n	14,585
nulls	13,153 (90.2%)
unique	316
top_value	15
top_rate	0.0412
cardinality	316
entropy	6.837
entropy_ratio	0.8234
alert: long_tail	197 singleton categories
alert: null_rate	90.2% null

Fig 12.

Top values for height.

Show data table

Top values for height (20 unique shown, of 316 total).
value	count	share
15	59	0.4%
12	55	0.4%
14	54	0.4%
10	51	0.3%
8	45	0.3%
18	44	0.3%
20	43	0.3%
13	38	0.3%
6	37	0.3%
17	33	0.2%
25	31	0.2%
11	28	0.2%
30	28	0.2%
16	27	0.2%
23	26	0.2%
21	21	0.1%
19	20	0.1%
28	19	0.1%
9	19	0.1%
26	18	0.1%

year_built categorical feature

This column represents the year a structure was built, but it has been parsed as categorical rather than numeric, likely because it contains century-level codes such as 'C19' and 'C20' mixed with specific years like '1875' and '1906'. The null rate is extremely high at 93.2%, meaning only about 990 of 14,585 rows carry any value at all. Among populated values, 429 distinct entries exist with the top value 'C19' appearing only 31 times (3.1% of non-null rows), indicating a very long tail. The mixed format — century codes alongside specific years — signals data quality issues that require harmonisation before any temporal analysis.

Treatment: Impute or exclude 93.2% nulls, normalise century codes ('C19' → 1800–1899) to numeric ranges, then convert to integer or ordinal decade bins before modelling.

anthropic:default · confidence high

Out[28]:

saturn.columns["year_built"].stats

stat	value
n	14,585
nulls	13,593 (93.2%)
unique	429
top_value	C19
top_rate	0.03125
cardinality	429
entropy	8.142
entropy_ratio	0.9311
alert: long_tail	271 singleton categories
alert: null_rate	93.2% null

Fig 13.

Top values for year_built.

Show data table

Top values for year_built (20 unique shown, of 429 total).
value	count	share
C19	31	0.2%
1875	16	0.1%
1872	15	0.1%
1881	12	0.1%
C20	12	0.1%
1906	11	0.1%
1871	11	0.1%
1877	11	0.1%
1882	11	0.1%
1874	11	0.1%
1873	11	0.1%
1939	10	0.1%
1898	10	0.1%
1897	9	0.1%
1870	9	0.1%
1909	8	0.1%
1884	8	0.1%
1890	8	0.1%
1911	7	0.0%
1950	7	0.0%

operator categorical label

This column records the operating authority or agency responsible for a navigational aid or maritime infrastructure asset — dominated by coast guard entities (U.S., Philippine, and Chinese 海上保安厅) alongside European bodies like Plovput and INEA. The 92.7% null rate is the critical anomaly: only 1,074 of 14,585 rows carry a value, meaning operator attribution is nearly absent across the dataset. Among populated rows, entropy ratio of 0.861 across 283 unique values signals a long tail of rarely-seen operators beyond the top few, and the top value 'U.S. Coast Guard' covers only 7.7% of non-null rows. The presence of CJK characters (海上保安庁) confirms a multilingual mix requiring normalisation before any grouping or analysis.

Treatment: Impute or flag nulls explicitly, normalise multilingual variants to a canonical form, then use as a categorical grouping variable with an 'Unknown' level for the 92.7% missing.

anthropic:default · confidence high

Out[31]:

saturn.columns["operator"].stats

stat	value
n	14,585
nulls	13,520 (92.7%)
unique	283
top_value	U.S. Coast Guard
top_rate	0.077
cardinality	283
entropy	7.013
entropy_ratio	0.8611
alert: long_tail	167 singleton categories
alert: null_rate	92.7% null

Fig 14.

Top values for operator.

Show data table

Top values for operator (20 unique shown, of 283 total).
value	count	share
U.S. Coast Guard	82	0.6%
Plovput	49	0.3%
INEA	42	0.3%
Tagbilaran Station, Philippine Coast Guard	27	0.2%
Cebu Station, Philippine Coast Guard	26	0.2%
Catbalogan Station, Philippine Coast Guard	18	0.1%
Surigao Station, Philippine Coast Guard	18	0.1%
Masbate Station, Philippine Coast Guard	17	0.1%
Maasin Station, Philippine Coast Guard	17	0.1%
海上保安庁	16	0.1%
Directorate General of Lighthouses and Lightships	15	0.1%
Romblon Station, Philippine Coast Guard	15	0.1%
Iloilo Station, Philippine Coast Guard	14	0.1%
Puerto Princesa Station, Philippine Coast Guard	14	0.1%
Bacolod Station, Philippine Coast Guard	13	0.1%
Sorsogon Station, Philippine Coast Guard	13	0.1%
Cagayan de Oro Station, Philippine Coast Guard	13	0.1%
Marine Department of Sabah	13	0.1%
Dumaguete Station, Philippine Coast Guard	12	0.1%
Appari Station, Philippine Coast Guard	12	0.1%

seamark_type categorical label

This column captures the classification of maritime seamarks (navigational aids such as lights, beacons, and landmarks), drawn from what appears to be nautical/GIS data. Nearly half the rows (48.01%) are null, which is flagged as an alert and likely reflects features in the dataset that are not seamarks at all. Among the 7,588 populated rows, the distribution is heavily skewed toward light types: 'light_minor' alone accounts for 46.1% of non-null values, and combined with 'light_major' these two dominate, while the remaining 23 categories (beacons, landmarks, buildings, etc.) cover a long tail. Entropy ratio of 0.357 confirms moderate but uneven spread across the 25 categories.

Treatment: Impute nulls only if missingness is structurally meaningful (i.e., non-seamark features); otherwise keep as-is, one-hot or ordinal encode the 25 categories, and consider grouping rare categories (≤5 occurrences) into an 'other' bucket before modelling.

anthropic:default · confidence high

Out[34]:

saturn.columns["seamark_type"].stats

stat	value
n	14,585
nulls	7,002 (48.0%)
unique	25
top_value	light_minor
top_rate	0.461
cardinality	25
entropy	1.657
entropy_ratio	0.3569
alert: null_rate	48.0% null

Fig 15.

Top values for seamark_type.

Show data table

Top values for seamark_type (20 unique shown, of 25 total).
value	count	share
light_minor	3496	24.0%
light_major	3051	20.9%
landmark	716	4.9%
beacon_special_purpose	146	1.0%
beacon_lateral	102	0.7%
beacon_cardinal	19	0.1%
light	9	0.1%
beacon	5	0.0%
beacon_isolated_danger	5	0.0%
building	4	0.0%
daymark	4	0.0%
radio_station	3	0.0%
pile	3	0.0%
signal_station_traffic	3	0.0%
signal_station_warning	3	0.0%
buoy_lateral	3	0.0%
navigation_line	2	0.0%
fishing_facility	2	0.0%
light_vessel	1	0.0%
cone	1	0.0%

light_character categorical feature

This column encodes the light character (flashing pattern) of maritime navigational aids — values like 'Fl' (flashing), 'Iso' (isophase), 'Oc' (occulting), and 'LFl' (long flashing) are standard IALA light notation. The dominant concern is a 71.31% null rate, meaning nearly three-quarters of records carry no light character, likely because many features in the dataset are unlighted aids or non-light structures. Among populated rows, 'Fl' accounts for 74.7% of non-null values, making the distribution heavily skewed across 19 categories.

Treatment: Impute nulls with an explicit 'None/Unlighted' category, then one-hot or ordinal encode the 19 light character types before modelling.

anthropic:default · confidence high

Out[37]:

saturn.columns["light_character"].stats

stat	value
n	14,585
nulls	10,401 (71.3%)
unique	19
top_value	Fl
top_rate	0.7471
cardinality	19
entropy	1.503
entropy_ratio	0.3539
alert: null_rate	71.3% null

Fig 16.

Top values for light_character.

Show data table

Top values for light_character (19 unique shown, of 19 total).
value	count	share
Fl	3126	21.4%
Iso	249	1.7%
F	245	1.7%
Q	182	1.2%
Oc	151	1.0%
LFl	148	1.0%
VQ	24	0.2%
Al.Fl	24	0.2%
Mo	14	0.1%
FFl	4	0.0%
Al	4	0.0%
IQ	3	0.0%
Al.LFl	2	0.0%
Al.Oc	2	0.0%
Q+LFl	2	0.0%
IVQ	1	0.0%
Fl(2)	1	0.0%
LFl W 10s	1	0.0%
Al.Iso	1	0.0%

heritage categorical

Out[40]:

saturn.columns["heritage"].stats

stat	value
n	14,585
nulls	14,132 (96.9%)
unique	7
top_value	2
top_rate	0.7572
cardinality	7
entropy	1.244
entropy_ratio	0.4431
alert: null_rate	96.9% null

Fig 17.

Top values for heritage.

Show data table

Top values for heritage (7 unique shown, of 7 total).
value	count	share
2	343	2.4%
3	52	0.4%
4	26	0.2%
yes	25	0.2%
1	4	0.0%
regional	2	0.0%
no	1	0.0%

wikipedia text metadata

This column contains Wikipedia article titles or slugs for lighthouse-related entries, as evidenced by top words including 'lighthouse', 'light', 'fr:phare', 'es:faro', 'de:leuchtturm', and 'pt:farol' — indicating multilingual cross-references. The column is extremely sparse, with an 86.23% null rate across 14,585 rows, meaning only ~2,027 rows carry a value. Despite being flagged near-unique, there are 43 duplicate values across 1,965 unique entries, which is low but worth noting if this is intended as a one-to-one reference.

Treatment: Use as an optional external reference link; impute or exclude nulls before joining, and investigate the 43 duplicates for data quality issues.

anthropic:default · confidence high

Out[43]:

saturn.columns["wikipedia"].stats

stat	value
n	14,585
nulls	12,577 (86.2%)
unique	1,965
len_min	6
len_max	55
len_mean	22.3
len_median	22
len_p95	33
word_mean	2.996
word_median	3
n_empty	0
n_duplicates	43
duplicate_rate	0.02141
vocab_size	2,370
readability_flesch_mean	48.92
emoji_rate	0
url_rate	0
one_word_rate	0.0762
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	97.9% of rows are unique strings
alert: null_rate	86.2% null

Fig 18.

Character-length distribution for wikipedia.

Show data table

Character-length distribution for wikipedia (mean: 22.296314741035857).
chars	count
6 – 7	29
7 – 8	74
8 – 10	20
10 – 11	9
11 – 12	17
12 – 13	22
13 – 15	24
15 – 16	45
16 – 17	107
17 – 18	93
18 – 19	132
19 – 21	139
21 – 22	178
22 – 23	307
23 – 24	147
24 – 26	129
26 – 27	103
27 – 28	168
28 – 29	56
29 – 30	38
30 – 32	38
32 – 33	22
33 – 34	38
34 – 35	11
35 – 37	13
37 – 38	5
38 – 39	20
39 – 40	5
40 – 42	2
42 – 43	2
43 – 44	2
44 – 45	9
45 – 46	0
46 – 48	0
48 – 49	0
49 – 50	0
50 – 51	1
51 – 53	0
53 – 54	1
54 – 55	2

osm_id numeric foreign_key

This column contains OpenStreetMap object identifiers, a well-known external reference system where numeric IDs are assigned sequentially as features are added to the OSM database. Nearly all 14,585 rows have a unique ID (only 1 duplicate exists across 14,584 unique values), and the null rate is zero, indicating high integrity. The wide IQR of ~5.4 billion and right skew (1.07) reflect OSM's historical ID growth — older features have lower IDs while newer additions push into the 10+ billion range, which is consistent with OSM's current ID space.

Treatment: Use as a join key to OSM data sources; do not use as a numeric feature in modelling.

anthropic:default · confidence high

Out[46]:

saturn.columns["osm_id"].stats

stat	value
n	14,585
nulls	0 (0.0%)
unique	14,584
min	1.339e+07
max	1.353e+10
mean	3.723e+09
median	1.574e+09
std	3.828e+09
q1	1.001e+09
q3	6.402e+09
iqr	5.401e+09
skew	1.073
kurtosis	-0.1991
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 19.

Distribution of osm_id. Vertical dash marks the median.

Show data table

Histogram bins for osm_id (median: 1574285300.0).
bin	count
1.339e+07 – 3.512e+08	1272
3.512e+08 – 6.89e+08	1457
6.89e+08 – 1.027e+09	998
1.027e+09 – 1.365e+09	2358
1.365e+09 – 1.702e+09	1852
1.702e+09 – 2.04e+09	348
2.04e+09 – 2.378e+09	245
2.378e+09 – 2.716e+09	266
2.716e+09 – 3.054e+09	212
3.054e+09 – 3.391e+09	223
3.391e+09 – 3.729e+09	234
3.729e+09 – 4.067e+09	178
4.067e+09 – 4.405e+09	187
4.405e+09 – 4.743e+09	216
4.743e+09 – 5.08e+09	220
5.08e+09 – 5.418e+09	241
5.418e+09 – 5.756e+09	180
5.756e+09 – 6.094e+09	113
6.094e+09 – 6.432e+09	143
6.432e+09 – 6.769e+09	248
6.769e+09 – 7.107e+09	351
7.107e+09 – 7.445e+09	94
7.445e+09 – 7.783e+09	185
7.783e+09 – 8.121e+09	147
8.121e+09 – 8.458e+09	150
8.458e+09 – 8.796e+09	131
8.796e+09 – 9.134e+09	163
9.134e+09 – 9.472e+09	210
9.472e+09 – 9.81e+09	199
9.81e+09 – 1.015e+10	422
1.015e+10 – 1.049e+10	102
1.049e+10 – 1.082e+10	91
1.082e+10 – 1.116e+10	107
1.116e+10 – 1.15e+10	97
1.15e+10 – 1.184e+10	198
1.184e+10 – 1.217e+10	162
1.217e+10 – 1.251e+10	158
1.251e+10 – 1.285e+10	161
1.285e+10 – 1.319e+10	132
1.319e+10 – 1.353e+10	134

osm_type categorical feature

This column captures the OpenStreetMap geometry type, distinguishing between point features ('node') and linear/polygon features ('way'). With only 2 distinct values across 14,585 rows and zero nulls, it is a clean binary indicator. The distribution is notably skewed: 'node' dominates at 77.9% (11,358 records) versus 'way' at 22.1% (3,227 records), reflecting the typical OSM pattern where point features outnumber area/line features. Entropy ratio of 0.76 confirms moderate imbalance but not extreme dominance.

Treatment: One-hot encode or binary-encode (node=1, way=0) before modelling; consider as a stratification variable given the 78/22 class split.

anthropic:default · confidence high

Out[49]:

saturn.columns["osm_type"].stats

stat	value
n	14,585
nulls	0 (0.0%)
unique	2
top_value	node
top_rate	0.7787
cardinality	2
entropy	0.7625
entropy_ratio	0.7625

Fig 20.

Top values for osm_type.

Show data table

Top values for osm_type (2 unique shown, of 2 total).
value	count	share
node	11358	77.9%
way	3227	22.1%

data trove lighthouses worldwide

Overview

Summary confidence: medium

name text label

lat numeric feature

lon numeric feature

country categorical

height categorical feature

year_built categorical feature

operator categorical label

seamark_type categorical label

light_character categorical feature

heritage categorical

wikipedia text metadata

osm_id numeric foreign_key

osm_type categorical feature

How to cite