saturn·

quirky lighthouses

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/lighthouses.json

Saturn profiled 14,585 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/lighthouses.json",
    "--findings", "quirky-lighthouses.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogues 14,585 lighthouses and related navigational landmarks sourced from OpenStreetMap, with 13 columns covering location (lat/lon), OSM identifiers, names, operators, build years, light characteristics, and heritage status. Coverage is very uneven: descriptive fields like country (99.6% null), heritage (96.9% null), year_built (93.2% null) and operator (92.7% null) are mostly empty, so any analysis on those needs to acknowledge the small annotated subsample. The most reliable signals are geographic (lat/lon, fully populated) and the OSM-derived fields osm_type and seamark_type — the latter shows the dataset is dominated by light_minor (3,496) and light_major (3,051), confirming its lighthouse focus. Light_character is also worth examining: where recorded, 'Fl' (flashing) overwhelmingly dominates at 75% of entries. Latitude is heavily skewed toward the northern hemisphere (median 40.8°) with 1,295 outliers flagged in the southern extremes.

citing: row_count · column_count · columns · kinds

Out[4]:

saturn.schema() · 13 columns

column kind n null% unique alerts
name text 14,585 0.0% 14,239 near_unique
lat numeric 14,585 0.0% 14,572 outliers
lon numeric 14,585 0.0% 14,565
country categorical 14,585 99.6% 19 long_tail null_rate
height categorical 14,585 90.2% 316 long_tail null_rate
year_built categorical 14,585 93.2% 429 long_tail null_rate
operator categorical 14,585 92.7% 283 long_tail null_rate
seamark_type categorical 14,585 48.0% 25 null_rate
light_character categorical 14,585 71.3% 19 null_rate
heritage categorical 14,585 96.9% 7 null_rate
wikipedia text 14,585 86.2% 1,965 near_unique null_rate
osm_id numeric 14,585 0.0% 14,584
osm_type categorical 14,585 0.0% 2
Fig 1.
seamark_type · Shows the dataset is dominated by light_minor and light_major entries, confirming its lighthouse focus.
Show data table
Top values for seamark_type (20 unique shown, of 25 total).
valuecountshare
light_minor349624.0%
light_major305120.9%
landmark7164.9%
beacon_special_purpose1461.0%
beacon_lateral1020.7%
beacon_cardinal190.1%
light90.1%
beacon50.0%
beacon_isolated_danger50.0%
building40.0%
daymark40.0%
radio_station30.0%
pile30.0%
signal_station_traffic30.0%
signal_station_warning30.0%
buoy_lateral30.0%
navigation_line20.0%
fishing_facility20.0%
light_vessel10.0%
cone10.0%
Fig 2.
light_character · Reveals that flashing lights ('Fl') account for roughly three-quarters of recorded light characters.
Show data table
Top values for light_character (19 unique shown, of 19 total).
valuecountshare
Fl312621.4%
Iso2491.7%
F2451.7%
Q1821.2%
Oc1511.0%
LFl1481.0%
VQ240.2%
Al.Fl240.2%
Mo140.1%
FFl40.0%
Al40.0%
IQ30.0%
Al.LFl20.0%
Al.Oc20.0%
Q+LFl20.0%
IVQ10.0%
Fl(2)10.0%
LFl W 10s10.0%
Al.Iso10.0%
Fig 3.
osm_type · Indicates most lighthouses are mapped as point nodes (78%) rather than ways/polygons.
Show data table
Top values for osm_type (2 unique shown, of 2 total).
valuecountshare
node1135877.9%
way322722.1%
Fig 4.
lat · Highlights a strong northern-hemisphere skew with a long tail of southern outliers.
Show data table
Histogram bins for lat (median: 40.8121914).
bincount
-63.4 – -59.773
-59.77 – -56.142
-56.14 – -52.5138
-52.51 – -48.8844
-48.88 – -45.2529
-45.25 – -41.6292
-41.62 – -37.99108
-37.99 – -34.36112
-34.36 – -30.73147
-30.73 – -27.1101
-27.1 – -23.4793
-23.47 – -19.84113
-19.84 – -16.2158
-16.21 – -12.5892
-12.58 – -8.94758
-8.947 – -5.317130
-5.317 – -1.687129
-1.687 – 1.942168
1.942 – 5.572121
5.572 – 9.202239
9.202 – 12.83549
12.83 – 16.46274
16.46 – 20.09241
20.09 – 23.72383
23.72 – 27.35273
27.35 – 30.98226
30.98 – 34.611062
34.61 – 38.241554
38.24 – 41.871300
41.87 – 45.51976
45.5 – 49.131257
49.13 – 52.76625
52.76 – 56.39900
56.39 – 60.021039
60.02 – 63.65410
63.65 – 67.28357
67.28 – 70.91179
70.91 – 74.5473
74.54 – 78.1722
78.17 – 81.88
Fig 5.
country · Among the tiny 0.4% with country tags, Latvia, Germany and Estonia lead — illustrating how sparse this field is.
Show data table
Top values for country (19 unique shown, of 19 total).
valuecountshare
LV180.1%
DE80.1%
EE70.0%
HT60.0%
US50.0%
GB20.0%
PT20.0%
IM20.0%
JP10.0%
KY10.0%
PH10.0%
IN10.0%
FI10.0%
PL10.0%
IT10.0%
RU10.0%
UZ10.0%
MX10.0%
TW10.0%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
nametext0.0%
latnumeric0.0%
lonnumeric0.0%
countrycategorical99.6%
heightcategorical90.2%
year_builtcategorical93.2%
operatorcategorical92.7%
seamark_typecategorical48.0%
light_charactercategorical71.3%
heritagecategorical96.9%
wikipediatext86.2%
osm_idnumeric0.0%
osm_typecategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
latlonosm_id
lat+1.00-0.12-0.04
lon-0.12+1.00+0.18
osm_id-0.04+0.18+1.00

name text identifier

This column holds proper names of lighthouses, with 14,239 unique values across 14,585 rows (near_unique alert) and a mean of 2.33 words per entry. The vocabulary is multilingual: 'lighthouse' dominates at 8,487 occurrences but 'faro', 'fyr', 'phare', 'farol' and Cyrillic 'маяк' all appear, signalling Spanish/Italian, Scandinavian, French, Portuguese and Russian sources mixed together. Despite the near-uniqueness, 346 duplicate names (2.4%) exist, which is worth checking before treating this as a key.

Treatment: Use as a display label; do not treat as a unique key without disambiguating the 346 duplicates.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["name"].stats

statvalue
n14,585
nulls0 (0.0%)
unique14,239
len_min 2
len_max 91
len_mean 18.95
len_median 21
len_p95 27
word_mean 2.327
word_median 2
n_empty 0
n_duplicates 346
duplicate_rate 0.02372
vocab_size 14,670
readability_flesch_mean 75.78
emoji_rate 0
url_rate 0
one_word_rate 0.1068
allcaps_rate 0.05794
boilerplate_rate 0
alert: near_unique97.6% of rows are unique strings
Fig 8.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 18.947685978745287).
charscount
2 – 4301
4 – 6525
6 – 9304
9 – 11528
11 – 13827
13 – 15705
15 – 18859
18 – 20906
20 – 227858
22 – 24560
24 – 26414
26 – 29287
29 – 31151
31 – 33155
33 – 3552
35 – 3842
38 – 4042
40 – 4237
42 – 445
44 – 468
46 – 494
49 – 513
51 – 532
53 – 552
55 – 582
58 – 601
60 – 621
62 – 641
64 – 670
67 – 690
69 – 710
71 – 732
73 – 750
75 – 780
78 – 800
80 – 820
82 – 840
84 – 870
87 – 890
89 – 911

lat numeric feature

This is a latitude coordinate in decimal degrees, ranging from -63.40 to 81.80 with a median of 40.81, suggesting a Northern-Hemisphere-skewed global distribution. The strong negative skew (-1.46) and 1295 flagged outliers (8.88%) reflect a long tail into the southern hemisphere rather than data errors. Near-unique values (14572 of 14585) indicate per-record geocoordinates with no nulls.

Treatment: Pair with longitude for geospatial features; avoid treating southern-hemisphere points as outliers when modelling.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["lat"].stats

statvalue
n14,585
nulls0 (0.0%)
unique14,572
min -63.4
max 81.8
mean 34.52
median 40.81
std 24.64
q1 28.11
q3 48.98
iqr 20.87
skew -1.458
kurtosis 2.028
n_outliers 1,295
outlier_rate 0.08879
zero_rate 0
alert: outliers8.9% rows beyond 1.5 IQR
Fig 9.
Distribution of lat. Vertical dash marks the median.
Show data table
Histogram bins for lat (median: 40.8121914).
bincount
-63.4 – -59.773
-59.77 – -56.142
-56.14 – -52.5138
-52.51 – -48.8844
-48.88 – -45.2529
-45.25 – -41.6292
-41.62 – -37.99108
-37.99 – -34.36112
-34.36 – -30.73147
-30.73 – -27.1101
-27.1 – -23.4793
-23.47 – -19.84113
-19.84 – -16.2158
-16.21 – -12.5892
-12.58 – -8.94758
-8.947 – -5.317130
-5.317 – -1.687129
-1.687 – 1.942168
1.942 – 5.572121
5.572 – 9.202239
9.202 – 12.83549
12.83 – 16.46274
16.46 – 20.09241
20.09 – 23.72383
23.72 – 27.35273
27.35 – 30.98226
30.98 – 34.611062
34.61 – 38.241554
38.24 – 41.871300
41.87 – 45.51976
45.5 – 49.131257
49.13 – 52.76625
52.76 – 56.39900
56.39 – 60.021039
60.02 – 63.65410
63.65 – 67.28357
67.28 – 70.91179
70.91 – 74.5473
74.54 – 78.1722
78.17 – 81.88

lon numeric feature

This is a longitude coordinate, with values spanning -179.20 to 179.34 and a near-symmetric distribution (skew 0.05). The 14565 unique values across 14585 rows suggest each record is a distinct geographic point. The wide IQR of 152.94 and negative kurtosis (-0.95) indicate global coverage rather than concentration in any single region.

Treatment: pair with latitude for geospatial analysis; avoid treating as a standalone scalar feature.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["lon"].stats

statvalue
n14,585
nulls0 (0.0%)
unique14,565
min -179.2
max 179.3
mean 23.04
median 14.97
std 79.4
q1 -40.5
q3 112.4
iqr 152.9
skew 0.0526
kurtosis -0.9537
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 10.
Distribution of lon. Vertical dash marks the median.
Show data table
Histogram bins for lon (median: 14.9745494).
bincount
-179.2 – -170.239
-170.2 – -161.33
-161.3 – -152.320
-152.3 – -143.36
-143.3 – -134.48
-134.4 – -125.4179
-125.4 – -116.5235
-116.5 – -107.578
-107.5 – -98.5338
-98.53 – -89.56132
-89.56 – -80.6544
-80.6 – -71.64805
-71.64 – -62.67835
-62.67 – -53.71379
-53.71 – -44.75272
-44.75 – -35.78178
-35.78 – -26.8273
-26.82 – -17.86128
-17.86 – -8.892277
-8.892 – 0.071371272
0.07137 – 9.035770
9.035 – 181383
18 – 26.961441
26.96 – 35.93659
35.93 – 44.89258
44.89 – 53.85141
53.85 – 62.8277
62.82 – 71.7838
71.78 – 80.74131
80.74 – 89.7173
89.71 – 98.6757
98.67 – 107.6245
107.6 – 116.6388
116.6 – 125.6955
125.6 – 134.51230
134.5 – 143.5818
143.5 – 152.4199
152.4 – 161.466
161.4 – 170.449
170.4 – 179.3106

country categorical metadata

Two-letter ISO country codes (LV, DE, EE, HT, US…) identifying record origin, but the column is effectively empty: 99.58% of the 14,585 rows are null, leaving only ~61 populated values spread across 19 codes. Among the few present, Latvia leads at 29.5% (18 records) with high entropy ratio 0.81, so no single market dominates the observed slice.

Treatment: Drop or treat as missing-indicator only; insufficient coverage for modelling or segmentation.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["country"].stats

statvalue
n14,585
nulls14,524 (99.6%)
unique19
top_value LV
top_rate 0.2951
cardinality 19
entropy 3.442
entropy_ratio 0.8102
alert: long_tail11 singleton categories
alert: null_rate99.6% null
Fig 11.
Top values for country.
Show data table
Top values for country (19 unique shown, of 19 total).
valuecountshare
LV180.1%
DE80.1%
EE70.0%
HT60.0%
US50.0%
GB20.0%
PT20.0%
IM20.0%
JP10.0%
KY10.0%
PH10.0%
IN10.0%
FI10.0%
PL10.0%
IT10.0%
RU10.0%
UZ10.0%
MX10.0%
TW10.0%

height categorical feature

Stored as a categorical, this column appears to record a height as a small integer (top values are '15', '12', '14', '10', '8'), with 316 distinct values and high entropy ratio 0.8234. The dominant signal is missingness: null_rate is 0.9018, so roughly 90% of the 14585 rows have no value, and even the most common value '15' covers only 4.12% of non-nulls. The numeric-looking strings suggest it should be a numeric feature rather than a category.

Treatment: Cast to numeric and impute or add a missingness indicator before modelling, given ~90% nulls.

anthropic:claude-opus-4-7 · confidence medium
Out[25]:

saturn.columns["height"].stats

statvalue
n14,585
nulls13,153 (90.2%)
unique316
top_value 15
top_rate 0.0412
cardinality 316
entropy 6.837
entropy_ratio 0.8234
alert: long_tail197 singleton categories
alert: null_rate90.2% null
Fig 12.
Top values for height.
Show data table
Top values for height (20 unique shown, of 316 total).
valuecountshare
15590.4%
12550.4%
14540.4%
10510.3%
8450.3%
18440.3%
20430.3%
13380.3%
6370.3%
17330.2%
25310.2%
11280.2%
30280.2%
16270.2%
23260.2%
21210.1%
19200.1%
28190.1%
9190.1%
26180.1%

year_built categorical feature

Construction year of the asset, stored as free-text strings that mix four-digit years (e.g. '1875', '1872') with century shorthand like 'C19' and 'C20' — the latter is actually the most frequent value at 31 occurrences. The column is 93.2% null and only populated for 14585 rows, with 429 distinct values and very high entropy ratio (0.93), so the populated portion is a long, flat tail dominated by 19th-century dates. The mixed encoding (numeric years vs. century codes) means this cannot be treated as a clean numeric field without parsing.

Treatment: Parse century codes (e.g. 'C19' → 1850) and cast to numeric year; given 93% nulls, add a missing-indicator and consider dropping if coverage stays this low.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["year_built"].stats

statvalue
n14,585
nulls13,593 (93.2%)
unique429
top_value C19
top_rate 0.03125
cardinality 429
entropy 8.142
entropy_ratio 0.9311
alert: long_tail271 singleton categories
alert: null_rate93.2% null
Fig 13.
Top values for year_built.
Show data table
Top values for year_built (20 unique shown, of 429 total).
valuecountshare
C19310.2%
1875160.1%
1872150.1%
1881120.1%
C20120.1%
1906110.1%
1871110.1%
1877110.1%
1882110.1%
1874110.1%
1873110.1%
1939100.1%
1898100.1%
189790.1%
187090.1%
190980.1%
188480.1%
189080.1%
191170.0%
195070.0%

operator categorical metadata

This is the operating authority responsible for each entry, likely a lighthouse or maritime navigation aid given the prevalence of coast guards and Plovput. The column is 92.7% null, leaving only ~1,065 populated rows spread across 283 distinct operators with high entropy (ratio 0.861) and the top value (U.S. Coast Guard) covering just 7.7%. Notable signals include a long tail of regional Philippine Coast Guard stations and at least one non-Latin entry (海上保安庁, Japan Coast Guard), suggesting multilingual free-form values rather than a controlled vocabulary.

Treatment: Normalize/translate operator strings and group long-tail values; treat missingness as its own category before any encoding.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["operator"].stats

statvalue
n14,585
nulls13,520 (92.7%)
unique283
top_value U.S. Coast Guard
top_rate 0.077
cardinality 283
entropy 7.013
entropy_ratio 0.8611
alert: long_tail167 singleton categories
alert: null_rate92.7% null
Fig 14.
Top values for operator.
Show data table
Top values for operator (20 unique shown, of 283 total).
valuecountshare
U.S. Coast Guard820.6%
Plovput490.3%
INEA420.3%
Tagbilaran Station, Philippine Coast Guard270.2%
Cebu Station, Philippine Coast Guard260.2%
Catbalogan Station, Philippine Coast Guard180.1%
Surigao Station, Philippine Coast Guard180.1%
Masbate Station, Philippine Coast Guard170.1%
Maasin Station, Philippine Coast Guard170.1%
海上保安庁160.1%
Directorate General of Lighthouses and Lightships150.1%
Romblon Station, Philippine Coast Guard150.1%
Iloilo Station, Philippine Coast Guard140.1%
Puerto Princesa Station, Philippine Coast Guard140.1%
Bacolod Station, Philippine Coast Guard130.1%
Sorsogon Station, Philippine Coast Guard130.1%
Cagayan de Oro Station, Philippine Coast Guard130.1%
Marine Department of Sabah130.1%
Dumaguete Station, Philippine Coast Guard120.1%
Appari Station, Philippine Coast Guard120.1%

seamark_type categorical feature

This column classifies maritime seamark features, with 25 distinct types dominated by navigational lights — 'light_minor' covers 46.1% of non-null rows and 'light_major' is the runner-up, together swamping the long tail. Nearly half the rows (48.0%) are null, which triggered the null_rate alert and means this attribute is absent for most records. Entropy ratio of 0.36 confirms heavy concentration in the top two categories.

Treatment: Treat nulls as a distinct 'unknown' level and group rare categories before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["seamark_type"].stats

statvalue
n14,585
nulls7,002 (48.0%)
unique25
top_value light_minor
top_rate 0.461
cardinality 25
entropy 1.657
entropy_ratio 0.3569
alert: null_rate48.0% null
Fig 15.
Top values for seamark_type.
Show data table
Top values for seamark_type (20 unique shown, of 25 total).
valuecountshare
light_minor349624.0%
light_major305120.9%
landmark7164.9%
beacon_special_purpose1461.0%
beacon_lateral1020.7%
beacon_cardinal190.1%
light90.1%
beacon50.0%
beacon_isolated_danger50.0%
building40.0%
daymark40.0%
radio_station30.0%
pile30.0%
signal_station_traffic30.0%
signal_station_warning30.0%
buoy_lateral30.0%
navigation_line20.0%
fishing_facility20.0%
light_vessel10.0%
cone10.0%

light_character categorical feature

This column captures the light character (flash pattern) of navigational lights, with codes like 'Fl' (flashing), 'Iso' (isochronous), 'Q' (quick), and 'Oc' (occulting) drawn from standard maritime chart notation. It is overwhelmingly dominated by 'Fl' at 74.7% of the 4,188 non-null rows, yielding a low entropy ratio of 0.354 across 19 distinct codes. The headline concern is a 71.31% null rate, meaning the field is populated for fewer than 30% of records.

Treatment: Treat nulls as a distinct 'unknown' category and one-hot encode, collapsing rare codes into 'other'.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["light_character"].stats

statvalue
n14,585
nulls10,401 (71.3%)
unique19
top_value Fl
top_rate 0.7471
cardinality 19
entropy 1.503
entropy_ratio 0.3539
alert: null_rate71.3% null
Fig 16.
Top values for light_character.
Show data table
Top values for light_character (19 unique shown, of 19 total).
valuecountshare
Fl312621.4%
Iso2491.7%
F2451.7%
Q1821.2%
Oc1511.0%
LFl1481.0%
VQ240.2%
Al.Fl240.2%
Mo140.1%
FFl40.0%
Al40.0%
IQ30.0%
Al.LFl20.0%
Al.Oc20.0%
Q+LFl20.0%
IVQ10.0%
Fl(2)10.0%
LFl W 10s10.0%
Al.Iso10.0%

heritage categorical metadata

`heritage` is a sparsely populated categorical flag with only 7 distinct values across 14585 rows and a 96.89% null rate. The non-null entries are a messy mix of numeric codes ("2", "3", "4", "1") and free-text tokens ("yes", "no", "regional"), with "2" alone covering 75.7% of populated cells. The coding inconsistency plus the extreme nullity suggest this field was populated ad hoc rather than against a controlled vocabulary.

Treatment: Normalise the mixed numeric/text codes to a single scheme and treat absence as its own category, or drop given the 96.89% null rate.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["heritage"].stats

statvalue
n14,585
nulls14,132 (96.9%)
unique7
top_value 2
top_rate 0.7572
cardinality 7
entropy 1.244
entropy_ratio 0.4431
alert: null_rate96.9% null
Fig 17.
Top values for heritage.
Show data table
Top values for heritage (7 unique shown, of 7 total).
valuecountshare
23432.4%
3520.4%
4260.2%
yes250.2%
140.0%
regional20.0%
no10.0%

wikipedia text metadata

This column holds short Wikipedia article titles or links, almost certainly for lighthouses given the dominance of tokens like 'light' (477), 'lighthouse' (335), and multilingual equivalents 'fr:phare' (195), 'es:faro' (181), 'de:leuchtturm' (89), 'pt:farol' (65). Strings are short (mean 22.3 chars, ~3 words) and the column is 86.23% null with only 1,965 unique values across 14,585 rows. The interwiki-style prefixes ('fr:', 'es:', 'de:', 'pt:') indicate a language mix encoded in the values themselves rather than free prose.

Treatment: Parse the 'lang:title' prefix into a language code and title, and treat as an optional cross-reference link rather than a model feature.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["wikipedia"].stats

statvalue
n14,585
nulls12,577 (86.2%)
unique1,965
len_min 6
len_max 55
len_mean 22.3
len_median 22
len_p95 33
word_mean 2.996
word_median 3
n_empty 0
n_duplicates 43
duplicate_rate 0.02141
vocab_size 2,370
readability_flesch_mean 48.92
emoji_rate 0
url_rate 0
one_word_rate 0.0762
allcaps_rate 0
boilerplate_rate 0
alert: near_unique97.9% of rows are unique strings
alert: null_rate86.2% null
Fig 18.
Character-length distribution for wikipedia.
Show data table
Character-length distribution for wikipedia (mean: 22.296314741035857).
charscount
6 – 729
7 – 874
8 – 1020
10 – 119
11 – 1217
12 – 1322
13 – 1524
15 – 1645
16 – 17107
17 – 1893
18 – 19132
19 – 21139
21 – 22178
22 – 23307
23 – 24147
24 – 26129
26 – 27103
27 – 28168
28 – 2956
29 – 3038
30 – 3238
32 – 3322
33 – 3438
34 – 3511
35 – 3713
37 – 385
38 – 3920
39 – 405
40 – 422
42 – 432
43 – 442
44 – 459
45 – 460
46 – 480
48 – 490
49 – 500
50 – 511
51 – 530
53 – 541
54 – 552

osm_id numeric identifier

This is almost certainly the OpenStreetMap object id: 14584 unique values across 14585 rows (effectively one per record), no nulls, and a numeric range from 13,391,742 up to 13,525,496,137 that matches OSM's monotonically growing identifier space. The distribution is right-skewed (skew 1.07) with median 1,574,285,300 well below mean 3,722,756,855, reflecting the mix of older low-numbered and newer high-numbered OSM entities rather than any analytic signal.

Treatment: Keep as a key for joins to OSM data; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["osm_id"].stats

statvalue
n14,585
nulls0 (0.0%)
unique14,584
min 1.339e+07
max 1.353e+10
mean 3.723e+09
median 1.574e+09
std 3.828e+09
q1 1.001e+09
q3 6.402e+09
iqr 5.401e+09
skew 1.073
kurtosis -0.1991
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 19.
Distribution of osm_id. Vertical dash marks the median.
Show data table
Histogram bins for osm_id (median: 1574285300.0).
bincount
1.339e+07 – 3.512e+081272
3.512e+08 – 6.89e+081457
6.89e+08 – 1.027e+09998
1.027e+09 – 1.365e+092358
1.365e+09 – 1.702e+091852
1.702e+09 – 2.04e+09348
2.04e+09 – 2.378e+09245
2.378e+09 – 2.716e+09266
2.716e+09 – 3.054e+09212
3.054e+09 – 3.391e+09223
3.391e+09 – 3.729e+09234
3.729e+09 – 4.067e+09178
4.067e+09 – 4.405e+09187
4.405e+09 – 4.743e+09216
4.743e+09 – 5.08e+09220
5.08e+09 – 5.418e+09241
5.418e+09 – 5.756e+09180
5.756e+09 – 6.094e+09113
6.094e+09 – 6.432e+09143
6.432e+09 – 6.769e+09248
6.769e+09 – 7.107e+09351
7.107e+09 – 7.445e+0994
7.445e+09 – 7.783e+09185
7.783e+09 – 8.121e+09147
8.121e+09 – 8.458e+09150
8.458e+09 – 8.796e+09131
8.796e+09 – 9.134e+09163
9.134e+09 – 9.472e+09210
9.472e+09 – 9.81e+09199
9.81e+09 – 1.015e+10422
1.015e+10 – 1.049e+10102
1.049e+10 – 1.082e+1091
1.082e+10 – 1.116e+10107
1.116e+10 – 1.15e+1097
1.15e+10 – 1.184e+10198
1.184e+10 – 1.217e+10162
1.217e+10 – 1.251e+10158
1.251e+10 – 1.285e+10161
1.285e+10 – 1.319e+10132
1.319e+10 – 1.353e+10134

osm_type categorical feature

Binary categorical flag indicating the OpenStreetMap geometry type, taking only the values 'node' (11358 rows, 77.9%) and 'way' (3227 rows). No nulls across 14585 rows, and entropy ratio of 0.76 reflects the moderate imbalance toward nodes.

Treatment: One-hot or boolean-encode (node vs way) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[49]:

saturn.columns["osm_type"].stats

statvalue
n14,585
nulls0 (0.0%)
unique2
top_value node
top_rate 0.7787
cardinality 2
entropy 0.7625
entropy_ratio 0.7625
Fig 20.
Top values for osm_type.
Show data table
Top values for osm_type (2 unique shown, of 2 total).
valuecountshare
node1135877.9%
way322722.1%

How to cite

click to copy

BibTeX
@misc{saturn-quirky-lighthouses-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky lighthouses},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-lighthouses}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky lighthouses. Source: /home/coolhand/html/datavis/data_trove/data/quirky/lighthouses.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-lighthouses