data-trove-shipwrecks · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json

Saturn profiled 6,914 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json",
    "--findings", "data-trove-shipwrecks.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is an OpenStreetMap-derived catalogue of 6,914 shipwrecks and related maritime hazards mapped globally. The most important thing to explore first is the `type` and `seamark_type` columns, which reveal that the overwhelming majority (~73-78%) of entries are labelled simply 'shipwreck' or 'wreck', with a long tail of submarines, aircraft, barges, and other vessels worth examining. A secondary point of interest is the high null rates across many descriptive fields — `heritage` (99.8% null), `year_sunk` (99.5% null), and `wikipedia` (95.5% null) — meaning rich contextual data exists for only a tiny fraction of wrecks, and the dataset is far more useful as a spatial inventory than a historical record. The `access` column, where populated, shows most accessible wrecks are open ('yes'), but a meaningful share require permits or are private, which could interest dive-site analysts.

citing: type.top_value · type.top_rate · seamark_type.top_value · seamark_type.top_rate · heritage.null_rate · year_sunk.null_rate · wikipedia.null_rate · access.null_rate · access.top_value · access.top_rate · row_count · lat.min · lat.max

Out[4]:

saturn.schema() · 14 columns

column	kind	n	null%	unique	alerts
name	text	6,914	0.0%	6,841	near_unique
lat	numeric	6,914	0.0%	6,902	outliers
lon	numeric	6,914	0.0%	6,910	outliers
year_sunk	categorical	6,914	99.5%	36	long_tail null_rate
type	categorical	6,914	0.0%	31	long_tail
wikipedia	categorical	6,914	95.5%	307	long_tail null_rate
wikidata	categorical	6,914	94.8%	353	long_tail null_rate
description	categorical	6,914	94.9%	304	long_tail null_rate
heritage	categorical	6,914	99.8%	4	long_tail null_rate
access	categorical	6,914	92.6%	8	null_rate
depth	categorical	6,914	77.4%	502	null_rate
seamark_type	categorical	6,914	6.6%	15
osm_id	numeric	6,914	0.0%	6,914
osm_type	categorical	6,914	0.0%	2

Fig 1.

type · Look for the dominance of 'shipwreck' and 'wreck' and how thin the long tail of aircraft, submarines, and barges really is.

Show data table

Top values for type (20 unique shown, of 31 total).
value	count	share
shipwreck	5081	73.5%
wreck	1345	19.5%
ship	381	5.5%
barge	27	0.4%
submarine	18	0.3%
aircraft	17	0.2%
plane	10	0.1%
boat	4	0.1%
vehicle	3	0.0%
motor_vehicle	3	0.0%
schooner	2	0.0%
car	2	0.0%
sailboat	2	0.0%
battleship	2	0.0%
steamer	1	0.0%
airplane	1	0.0%
freightcar	1	0.0%
train	1	0.0%
paddle steamer	1	0.0%
motorbike	1	0.0%

Fig 2.

seamark_type · Shows the navigational hazard classification — note how many wrecks are marked 'dangerous' or have 'distributed_remains' versus a clean hull.

Show data table

Top values for seamark_type (15 unique shown, of 15 total).
value	count	share
wreck	5055	73.1%
dangerous	598	8.6%
non-dangerous	358	5.2%
distributed_remains	306	4.4%
hulk	56	0.8%
hull_showing	46	0.7%
shoreline_construction	14	0.2%
mast_showing	8	0.1%
obstruction	7	0.1%
harbour	2	0.0%
restricted_area	1	0.0%
plane	1	0.0%
beacon_special_purpose	1	0.0%
landmark	1	0.0%
no	1	0.0%

Fig 3.

access · Among the minority of wrecks with access data, check how many are freely accessible versus permit-only or private.

Show data table

Top values for access (8 unique shown, of 8 total).
value	count	share
yes	341	4.9%
no	73	1.1%
permit	27	0.4%
private	27	0.4%
unknown	20	0.3%
permissive	17	0.2%
customers	3	0.0%
foot	1	0.0%

Fig 4.

lat · Distribution of wreck latitudes reveals geographic clustering — look for the concentration in northern hemisphere waters and the outlier spike near the poles.

Show data table

Histogram bins for lat (median: 43.8517503).
bin	count
-77.42 – -73.44	1
-73.44 – -69.45	0
-69.45 – -65.46	0
-65.46 – -61.47	1
-61.47 – -57.48	1
-57.48 – -53.49	20
-53.49 – -49.5	39
-49.5 – -45.51	41
-45.51 – -41.52	50
-41.52 – -37.53	107
-37.53 – -33.54	235
-33.54 – -29.55	110
-29.55 – -25.56	56
-25.56 – -21.57	118
-21.57 – -17.58	67
-17.58 – -13.59	28
-13.59 – -9.597	30
-9.597 – -5.607	72
-5.607 – -1.617	73
-1.617 – 2.373	67
2.373 – 6.363	40
6.363 – 10.35	180
10.35 – 14.34	105
14.34 – 18.33	85
18.33 – 22.32	108
22.32 – 26.31	84
26.31 – 30.3	75
30.3 – 34.29	149
34.29 – 38.28	529
38.28 – 42.27	748
42.27 – 46.26	608
46.26 – 50.25	494
50.25 – 54.24	1302
54.24 – 58.23	846
58.23 – 62.22	212
62.22 – 66.21	85
66.21 – 70.2	103
70.2 – 74.19	39
74.19 – 78.18	5
78.18 – 82.17	1

Fig 5.

depth · Depth values (where recorded) cluster around 7–20 metres, suggesting the dataset skews toward shallow, diveable wrecks rather than deep-sea losses.

Show data table

Top values for depth (20 unique shown, of 502 total).
value	count	share
12.4	11	0.2%
16	11	0.2%
18	11	0.2%
15.5	11	0.2%
19.2	11	0.2%
1.1	10	0.1%
17.4	10	0.1%
15.6	10	0.1%
7	10	0.1%
14	10	0.1%
5	10	0.1%
15.1	10	0.1%
6.4	9	0.1%
9	9	0.1%
8	9	0.1%
19	9	0.1%
15.2	9	0.1%
20	9	0.1%
16.4	9	0.1%
18.5	9	0.1%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
name	text	0.0%
lat	numeric	0.0%
lon	numeric	0.0%
year_sunk	categorical	99.5%
type	categorical	0.0%
wikipedia	categorical	95.5%
wikidata	categorical	94.8%
description	categorical	94.9%
heritage	categorical	99.8%
access	categorical	92.6%
depth	categorical	77.4%
seamark_type	categorical	6.6%
osm_id	numeric	0.0%
osm_type	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	lat	lon	osm_id
lat	+1.00	-0.27	+0.16
lon	-0.27	+1.00	+0.16
osm_id	+0.16	+0.16	+1.00

name text label

This column contains the names of individual shipwrecks, as confirmed by dominant top words: 'shipwreck' (5032 occurrences across 6914 rows), 'wreck', 'ss', 'uss', and 'hms'. With 6841 unique values out of 6914 rows and a near-zero null rate, it is essentially a name/label field — but the 73 duplicates (1.06% duplicate rate) are mildly surprising and may indicate the same wreck is referenced under the same name in multiple records. Lengths cluster tightly (median 20, p95 21 characters) with a long tail reaching 153, suggesting most names are concise vessel names while a minority carry extended descriptions.

Treatment: Use as a display label; investigate 73 duplicates for deduplication or record linkage before treating as a unique identifier.

anthropic:default · confidence high

Out[13]:

saturn.columns["name"].stats

stat	value
n	6,914
nulls	0 (0.0%)
unique	6,841
len_min	2
len_max	153
len_mean	18.35
len_median	20
len_p95	21
word_mean	2.058
word_median	2
n_empty	0
n_duplicates	73
duplicate_rate	0.01056
vocab_size	7,602
readability_flesch_mean	73.37
emoji_rate	0
url_rate	0
one_word_rate	0.0849
allcaps_rate	0.01403
boilerplate_rate	0
alert: near_unique	98.9% of rows are unique strings

Fig 8.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 18.353630315302286).
chars	count
2 – 6	173
6 – 10	481
10 – 13	493
13 – 17	305
17 – 21	3911
21 – 25	1393
25 – 28	71
28 – 32	35
32 – 36	16
36 – 40	12
40 – 44	10
44 – 47	5
47 – 51	3
51 – 55	0
55 – 59	1
59 – 62	1
62 – 66	1
66 – 70	1
70 – 74	0
74 – 78	1
78 – 81	0
81 – 85	0
85 – 89	0
89 – 93	0
93 – 96	0
96 – 100	0
100 – 104	0
104 – 108	0
108 – 111	0
111 – 115	0
115 – 119	0
119 – 123	0
123 – 127	0
127 – 130	0
130 – 134	0
134 – 138	0
138 – 142	0
142 – 145	0
145 – 149	0
149 – 153	1

lat numeric feature

This column represents geographic latitude values, spanning from -77.42° (near Antarctica) to 82.17° (high Arctic), with 6,902 unique values across 6,914 rows. The mean (33.15°) sits notably below the median (43.85°), driven by a left skew of -1.42 — indicating a cluster of records in mid-to-high northern latitudes with a pull from southern hemisphere or equatorial observations. Roughly 12.5% of values (864 rows) are flagged as outliers, likely corresponding to polar or deep southern hemisphere coordinates that deviate from the dominant northern mid-latitude band.

Treatment: Retain as-is for geospatial modelling; consider pairing with longitude and binning into geographic regions to handle the skewed distribution and outlier polar values.

anthropic:default · confidence high

Out[16]:

saturn.columns["lat"].stats

stat	value
n	6,914
nulls	0 (0.0%)
unique	6,902
min	-77.42
max	82.17
mean	33.15
median	43.85
std	29.88
q1	26.58
q3	53.87
iqr	27.29
skew	-1.417
kurtosis	0.8666
n_outliers	864
outlier_rate	0.125
zero_rate	0
alert: outliers	12.5% rows beyond 1.5 IQR

Fig 9.

Distribution of lat. Vertical dash marks the median.

Show data table

Histogram bins for lat (median: 43.8517503).
bin	count
-77.42 – -73.44	1
-73.44 – -69.45	0
-69.45 – -65.46	0
-65.46 – -61.47	1
-61.47 – -57.48	1
-57.48 – -53.49	20
-53.49 – -49.5	39
-49.5 – -45.51	41
-45.51 – -41.52	50
-41.52 – -37.53	107
-37.53 – -33.54	235
-33.54 – -29.55	110
-29.55 – -25.56	56
-25.56 – -21.57	118
-21.57 – -17.58	67
-17.58 – -13.59	28
-13.59 – -9.597	30
-9.597 – -5.607	72
-5.607 – -1.617	73
-1.617 – 2.373	67
2.373 – 6.363	40
6.363 – 10.35	180
10.35 – 14.34	105
14.34 – 18.33	85
18.33 – 22.32	108
22.32 – 26.31	84
26.31 – 30.3	75
30.3 – 34.29	149
34.29 – 38.28	529
38.28 – 42.27	748
42.27 – 46.26	608
46.26 – 50.25	494
50.25 – 54.24	1302
54.24 – 58.23	846
58.23 – 62.22	212
62.22 – 66.21	85
66.21 – 70.2	103
70.2 – 74.19	39
74.19 – 78.18	5
78.18 – 82.17	1

lon numeric feature

This column contains geographic longitude values, spanning the full valid range from -179.28° to 179.45° and covering both hemispheres. The mean (3.07°) and median (8.32°) are both modestly east of the Prime Meridian, suggesting a concentration of records in Europe/Africa, while the wide IQR of 58.74° and std of 69.12° confirm global scatter. Notably, 806 rows (11.66%) are flagged as outliers, likely corresponding to locations in the Americas or Pacific — not erroneous values, but genuine geographic extremes relative to the modal cluster.

Treatment: Use as-is or pair with latitude for spatial modelling; consider projecting to radians or embedding via geohash for ML pipelines.

anthropic:default · confidence high

Out[19]:

saturn.columns["lon"].stats

stat	value
n	6,914
nulls	0 (0.0%)
unique	6,910
min	-179.3
max	179.4
mean	3.067
median	8.322
std	69.12
q1	-40.75
q3	17.99
iqr	58.74
skew	0.5093
kurtosis	0.9211
n_outliers	806
outlier_rate	0.1166
zero_rate	0
alert: outliers	11.7% rows beyond 1.5 IQR

Fig 10.

Distribution of lon. Vertical dash marks the median.

Show data table

Histogram bins for lon (median: 8.321783100000001).
bin	count
-179.3 – -170.3	54
-170.3 – -161.3	17
-161.3 – -152.4	7
-152.4 – -143.4	5
-143.4 – -134.4	4
-134.4 – -125.5	15
-125.5 – -116.5	266
-116.5 – -107.5	15
-107.5 – -98.57	3
-98.57 – -89.6	39
-89.6 – -80.63	161
-80.63 – -71.66	521
-71.66 – -62.7	174
-62.7 – -53.73	212
-53.73 – -44.76	147
-44.76 – -35.79	158
-35.79 – -26.82	39
-26.82 – -17.85	31
-17.85 – -8.886	111
-8.886 – 0.08197	631
0.08197 – 9.05	1144
9.05 – 18.02	1435
18.02 – 26.99	411
26.99 – 35.96	272
35.96 – 44.92	92
44.92 – 53.89	101
53.89 – 62.86	61
62.86 – 71.83	8
71.83 – 80.8	31
80.8 – 89.76	7
89.76 – 98.73	7
98.73 – 107.7	23
107.7 – 116.7	34
116.7 – 125.6	44
125.6 – 134.6	48
134.6 – 143.6	44
143.6 – 152.5	109
152.5 – 161.5	69
161.5 – 170.5	163
170.5 – 179.4	201

year_sunk categorical metadata

This column records the year (or date) a vessel was sunk, but it is almost entirely empty — 99.46% of the 6,914 rows are null, leaving only about 38 non-null values. Among those, the formats are wildly inconsistent: bare years ('1942', '1854'), full dates in multiple formats ('30 June 1890', 'June 7, 1928', '1937-09-02'), partial dates ('1963-02'), and even a range ('1643..1663'), making normalisation non-trivial. With 36 unique values across ~38 populated rows the column is near-unique relative to its populated set, and the top value '1942' appears only twice.

Treatment: Parse and normalise to a standard year integer after regex-based format detection; treat as sparse metadata and do not use as a primary feature without imputation strategy given 99.46% nulls.

anthropic:default · confidence high

Out[22]:

saturn.columns["year_sunk"].stats

stat	value
n	6,914
nulls	6,877 (99.5%)
unique	36
top_value	1942
top_rate	0.05405
cardinality	36
entropy	5.155
entropy_ratio	0.9972
alert: long_tail	35 singleton categories
alert: null_rate	99.5% null

Fig 11.

Top values for year_sunk.

Show data table

Top values for year_sunk (20 unique shown, of 36 total).
value	count	share
1942	2	0.0%
30 June 1890	1	0.0%
1854	1	0.0%
1971	1	0.0%
1937-09-02	1	0.0%
1963-02	1	0.0%
1643..1663	1	0.0%
1982	1	0.0%
June 7, 1928	1	0.0%
1435	1	0.0%
1920-12-16	1	0.0%
1490s	1	0.0%
~1700	1	0.0%
20 April 1943	1	0.0%
25 May 1963	1	0.0%
1710	1	0.0%
1915	1	0.0%
1909	1	0.0%
1951	1	0.0%
1952	1	0.0%

type categorical label

This column classifies underwater or maritime wreck sites by vessel/object type, with 31 distinct categories across 6,914 records and no nulls. The distribution is heavily dominated by 'shipwreck' (73.5% of records) and 'wreck' (19.5%), together accounting for over 93% of all entries — the remaining 29 categories share just ~6.5%, confirming the long-tail alert. The near-redundancy between 'shipwreck', 'wreck', and 'ship' (plus 'boat', 'barge') suggests inconsistent taxonomy that may need consolidation before modelling.

Treatment: Consolidate overlapping categories (e.g. 'shipwreck'/'wreck'/'ship') into a canonical taxonomy, then one-hot or target-encode for modelling.

anthropic:default · confidence high

Out[25]:

saturn.columns["type"].stats

stat	value
n	6,914
nulls	0 (0.0%)
unique	31
top_value	shipwreck
top_rate	0.7349
cardinality	31
entropy	1.166
entropy_ratio	0.2353
alert: long_tail	17 singleton categories

Fig 12.

Top values for type.

Show data table

Top values for type (20 unique shown, of 31 total).
value	count	share
shipwreck	5081	73.5%
wreck	1345	19.5%
ship	381	5.5%
barge	27	0.4%
submarine	18	0.3%
aircraft	17	0.2%
plane	10	0.1%
boat	4	0.1%
vehicle	3	0.0%
motor_vehicle	3	0.0%
schooner	2	0.0%
car	2	0.0%
sailboat	2	0.0%
battleship	2	0.0%
steamer	1	0.0%
airplane	1	0.0%
freightcar	1	0.0%
train	1	0.0%
paddle steamer	1	0.0%
motorbike	1	0.0%

wikipedia categorical metadata

This column stores Wikipedia article links associated with dataset entities (ships and aircraft), formatted as language-prefixed slugs (e.g., 'en:SS Edmund Fitzgerald', 'fr:Armorique (navire)'). The null rate is extremely high at 95.47%, meaning only ~313 of 6,914 rows have any Wikipedia reference. Among populated values, cardinality is very high (307 unique values across ~313 non-null rows), with the top value appearing only 4 times — indicating near-unique coverage and a long-tail distribution. A language mix is present (English 'en:', French 'fr:', Arabic 'ar:'), which could complicate any downstream lookup or joining logic.

Treatment: Use as an optional enrichment link; do not use in modelling due to 95.47% nulls and near-unique cardinality; parse language prefix if language-specific resolution is needed.

anthropic:default · confidence high

Out[28]:

saturn.columns["wikipedia"].stats

stat	value
n	6,914
nulls	6,601 (95.5%)
unique	307
top_value	en:SS Edmund Fitzgerald
top_rate	0.01278
cardinality	307
entropy	8.245
entropy_ratio	0.998
alert: long_tail	303 singleton categories
alert: null_rate	95.5% null

Fig 13.

Top values for wikipedia.

Show data table

Top values for wikipedia (20 unique shown, of 307 total).
value	count	share
en:SS Edmund Fitzgerald	4	0.1%
fr:Armorique (navire)	2	0.0%
en:Curtiss C-46 Commando	2	0.0%
en:USS Amesbury	2	0.0%
en:SS America (1939)	1	0.0%
ar:سفينة زيستل جورم	1	0.0%
en:BOS 400	1	0.0%
en:New Carissa	1	0.0%
en:MV Cita	1	0.0%
en:SS Richard Montgomery	1	0.0%
en:Kroombit Tops National Park#Crash site	1	0.0%
en:Astron (ship)	1	0.0%
en:SS Yongala	1	0.0%
et:Raketa (laev, 1949)	1	0.0%
en:USNS General Hoyt S. Vandenberg (T-AGM-10)	1	0.0%
en:USS Oriskany (CV-34)	1	0.0%
en:USS Massachusetts (BB-2)	1	0.0%
en:Water Witch (schooner)	1	0.0%
en:Burlington Bay Horse Ferry	1	0.0%
en:Champlain II	1	0.0%

wikidata categorical foreign_key

This column stores Wikidata entity identifiers (Q-codes), linking dataset records to Wikidata knowledge graph entries. The most striking signal is the extreme null rate of 94.78%, meaning only ~360 of 6,914 rows carry a Wikidata link at all. Among the 353 unique Q-codes present, the distribution is nearly flat — the top value 'Q1286267' appears only 4 times, entropy ratio is 0.998, and the long-tail alert confirms almost no repeated values — suggesting each populated row points to a distinct entity with minimal reuse.

Treatment: Use as an optional foreign key to enrich records via Wikidata API lookup; do not use as a feature directly given 94.78% null rate.

anthropic:default · confidence high

Out[31]:

saturn.columns["wikidata"].stats

stat	value
n	6,914
nulls	6,553 (94.8%)
unique	353
top_value	Q1286267
top_rate	0.01108
cardinality	353
entropy	8.446
entropy_ratio	0.9979
alert: long_tail	347 singleton categories
alert: null_rate	94.8% null

Fig 14.

Top values for wikidata.

Show data table

Top values for wikidata (20 unique shown, of 353 total).
value	count	share
Q1286267	4	0.1%
Q959696	2	0.0%
Q215692	2	0.0%
Q2862787	2	0.0%
Q1145708	2	0.0%
Q11675753	2	0.0%
Q463091	1	0.0%
Q32276	1	0.0%
Q115709756	1	0.0%
Q14213801	1	0.0%
Q2877353	1	0.0%
Q7006376	1	0.0%
Q6719379	1	0.0%
Q41771616	1	0.0%
Q7394285	1	0.0%
Q7420193	1	0.0%
Q1359321	1	0.0%
Q4811601	1	0.0%
Q1424289	1	0.0%
Q1618842	1	0.0%

description categorical free_text

This column contains free-text descriptions of maritime wrecks or nautical features, with entries referencing WWII-era vessels, jetties, fishing boats, and abandoned craft. The most striking signal is the 94.87% null rate — nearly the entire dataset lacks a description — making this column nearly unusable at scale. Among the 304 unique values across 6,914 rows, entropy is very high (8.05, ratio 0.976), indicating wide diversity in phrasing, and a language mix is evident (e.g., French 'Chaloupe abandonnée à terre' alongside English entries).

Treatment: Exclude from modelling due to 94.87% null rate; if used, tokenize and embed the 5.13% populated values, and flag language mixing before NLP processing.

anthropic:default · confidence high

Out[34]:

saturn.columns["description"].stats

stat	value
n	6,914
nulls	6,559 (94.9%)
unique	304
top_value	WWII era concrete fuel barge converted into breakwater
top_rate	0.03944
cardinality	304
entropy	8.052
entropy_ratio	0.9763
alert: long_tail	282 singleton categories
alert: null_rate	94.9% null

Fig 15.

Top values for description.

Show data table

Top values for description (20 unique shown, of 304 total).
value	count	share
WWII era concrete fuel barge converted into breakwater	14	0.2%
Wrecks	7	0.1%
WWII concrete barge sunk as part of jetty, partially covered by jetty and fill	5	0.1%
Location is based on divers hand drawn maps. Due to the wreak breaking up and salvage, the wreak is scattered over a large area.	4	0.1%
Partially sunken ships	4	0.1%
Concrete petrol barge sunk as part of breakwater	4	0.1%
Wrecks of Zulu fishing boats	3	0.0%
Chaloupe abandonnée à terre	3	0.0%
WWII era concrete fuel barge sunk as part of jetty foundation	3	0.0%
Armada Ship	2	0.0%
remains of sunken wooden boats	2	0.0%
Hundido el 3 de julio de 1898 durante la batalla naval de Santiago de Cuba en la Guerra Hispano-Cubana-Norteamericana.	2	0.0%
09/09/2006 : Epave en bois, longue de 20 mètres, large de 4 mètres et haute de 3 mètres.	2	0.0%
Steamer	2	0.0%
Iron-hulled barque	2	0.0%
On shore wreck of a small abandoned wooden ship.	2	0.0%
Épave	2	0.0%
Dojście do wraków w zasadzie wolne. Jednak mogą wystąpić sytuacje gdy będzie to utrudnione lub niemożliwe.	2	0.0%
Wrecked sealing vessel	2	0.0%
Staten Island boat graveyard	2	0.0%

heritage categorical feature

This column appears to encode a 'heritage' flag or classification with only 4 distinct values ('1', '2', 'no', 'yes'), suggesting a binary or ordinal attribute that may have been inconsistently encoded across sources. The critical finding is a null rate of 99.81%, meaning only 13 of 6,914 rows have any value at all — rendering this column nearly useless for modelling. Among those 13 non-null values, '2' dominates at 76.9%, while 'no', 'yes', and '1' each appear only once, indicating a mixed encoding scheme (numeric vs. boolean strings) on an already negligible sample.

Treatment: Drop this column; 99.81% null rate and only 13 non-null observations make it statistically unusable.

anthropic:default · confidence high

Out[37]:

saturn.columns["heritage"].stats

stat	value
n	6,914
nulls	6,901 (99.8%)
unique	4
top_value	2
top_rate	0.7692
cardinality	4
entropy	1.145
entropy_ratio	0.5726
alert: long_tail	3 singleton categories
alert: null_rate	99.8% null

Fig 16.

Top values for heritage.

Show data table

Top values for heritage (4 unique shown, of 4 total).
value	count	share
2	10	0.1%
no	1	0.0%
yes	1	0.0%
1	1	0.0%

access categorical feature

This column appears to encode access permission or restriction tags for geographic features (likely OpenStreetMap-style data), with values such as 'yes', 'no', 'permit', 'private', 'permissive', and 'customers'. The striking finding is a 92.64% null rate — only 509 of 6,914 rows carry a value — meaning this attribute is almost entirely absent from the dataset. Among the non-null values, 'yes' dominates heavily at 66.99% of populated rows, suggesting most tagged features have open access.

Treatment: Flag extreme sparsity (92.64% nulls); treat nulls as a distinct 'untagged' category or drop column if missingness renders it uninformative for modelling.

anthropic:default · confidence high

Out[40]:

saturn.columns["access"].stats

stat	value
n	6,914
nulls	6,405 (92.6%)
unique	8
top_value	yes
top_rate	0.6699
cardinality	8
entropy	1.647
entropy_ratio	0.549
alert: null_rate	92.6% null

Fig 17.

Top values for access.

Show data table

Top values for access (8 unique shown, of 8 total).
value	count	share
yes	341	4.9%
no	73	1.1%
permit	27	0.4%
private	27	0.4%
unknown	20	0.3%
permissive	17	0.2%
customers	3	0.0%
foot	1	0.0%

depth categorical feature

This column represents a numeric depth measurement (likely in meters or similar units) stored as a categorical string, with values ranging from small decimals like '1.1' to integers like '19.2'. The most striking signal is a null rate of 77.36%, meaning only ~1,556 of 6,914 rows carry a value — a severe missingness that warrants investigation into whether it is structurally absent (e.g., not applicable to certain record types) or a data quality issue. Among populated rows, cardinality is very high (502 unique values) with an entropy ratio of 0.956, indicating nearly uniform spread and essentially no dominant depth value — the top value '12.4' appears only 11 times. The column should be cast to numeric before any modelling use.

Treatment: Cast to float, investigate structural vs. random missingness before imputing or dropping nulls, then use as a numeric feature.

anthropic:default · confidence high

Out[43]:

saturn.columns["depth"].stats

stat	value
n	6,914
nulls	5,349 (77.4%)
unique	502
top_value	12.4
top_rate	0.007029
cardinality	502
entropy	8.579
entropy_ratio	0.9563
alert: null_rate	77.4% null

Fig 18.

Top values for depth.

Show data table

Top values for depth (20 unique shown, of 502 total).
value	count	share
12.4	11	0.2%
16	11	0.2%
18	11	0.2%
15.5	11	0.2%
19.2	11	0.2%
1.1	10	0.1%
17.4	10	0.1%
15.6	10	0.1%
7	10	0.1%
14	10	0.1%
5	10	0.1%
15.1	10	0.1%
6.4	9	0.1%
9	9	0.1%
8	9	0.1%
19	9	0.1%
15.2	9	0.1%
20	9	0.1%
16.4	9	0.1%
18.5	9	0.1%

seamark_type categorical label

This column contains a nautical/maritime classification type for seamarks, most likely drawn from an OpenStreetMap or similar marine charting schema. The distribution is severely dominated by 'wreck' at 78.3% of 6,914 rows, with the next largest category 'dangerous' at only 8.6%, giving a low entropy ratio of 0.307. The mix of subtypes (hull_showing, mast_showing, distributed_remains, hulk) suggests these are sub-classifications of wrecks that could have been normalized into a hierarchy rather than a flat taxonomy. The 6.6% null rate warrants attention if completeness matters for navigation safety contexts.

Treatment: One-hot or ordinal encode; consider grouping wreck subtypes (hull_showing, mast_showing, hulk, distributed_remains) into a parent 'wreck' hierarchy before modelling, and impute or flag the 6.6% nulls.

anthropic:default · confidence high

Out[46]:

saturn.columns["seamark_type"].stats

stat	value
n	6,914
nulls	459 (6.6%)
unique	15
top_value	wreck
top_rate	0.7831
cardinality	15
entropy	1.2
entropy_ratio	0.307

Fig 19.

Top values for seamark_type.

Show data table

Top values for seamark_type (15 unique shown, of 15 total).
value	count	share
wreck	5055	73.1%
dangerous	598	8.6%
non-dangerous	358	5.2%
distributed_remains	306	4.4%
hulk	56	0.8%
hull_showing	46	0.7%
shoreline_construction	14	0.2%
mast_showing	8	0.1%
obstruction	7	0.1%
harbour	2	0.0%
restricted_area	1	0.0%
plane	1	0.0%
beacon_special_purpose	1	0.0%
landmark	1	0.0%
no	1	0.0%

osm_id numeric identifier

This column is an OpenStreetMap (OSM) object identifier — a large integer surrogate key assigned by the OSM platform to geographic features. Every one of the 6,914 rows has a distinct value with zero nulls, confirming it functions purely as a unique identifier. The value range (13 M to ~13.7 B) and flat distribution (kurtosis −1.45, near-uniform spread across a ~8.4 B IQR) are consistent with OSM's incrementally assigned ID space across different data vintages. No outliers are flagged and the mild positive skew (0.44) suggests a slight concentration of older, lower-numbered IDs.

Treatment: Retain as a join/lookup key to OSM data; drop from any model feature set as it carries no predictive signal.

anthropic:default · confidence high

Out[49]:

saturn.columns["osm_id"].stats

stat	value
n	6,914
nulls	0 (0.0%)
unique	6,914
min	1.306e+07
max	1.371e+10
mean	5.365e+09
median	3.145e+09
std	4.464e+09
q1	1.348e+09
q3	9.788e+09
iqr	8.44e+09
skew	0.4355
kurtosis	-1.453
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 20.

Distribution of osm_id. Vertical dash marks the median.

Show data table

Histogram bins for osm_id (median: 3145392419.5).
bin	count
1.306e+07 – 3.555e+08	262
3.555e+08 – 6.979e+08	451
6.979e+08 – 1.04e+09	454
1.04e+09 – 1.383e+09	614
1.383e+09 – 1.725e+09	699
1.725e+09 – 2.068e+09	184
2.068e+09 – 2.41e+09	137
2.41e+09 – 2.752e+09	46
2.752e+09 – 3.095e+09	213
3.095e+09 – 3.437e+09	623
3.437e+09 – 3.78e+09	54
3.78e+09 – 4.122e+09	37
4.122e+09 – 4.464e+09	58
4.464e+09 – 4.807e+09	44
4.807e+09 – 5.149e+09	48
5.149e+09 – 5.492e+09	40
5.492e+09 – 5.834e+09	49
5.834e+09 – 6.177e+09	90
6.177e+09 – 6.519e+09	140
6.519e+09 – 6.861e+09	94
6.861e+09 – 7.204e+09	67
7.204e+09 – 7.546e+09	40
7.546e+09 – 7.889e+09	40
7.889e+09 – 8.231e+09	64
8.231e+09 – 8.573e+09	43
8.573e+09 – 8.916e+09	59
8.916e+09 – 9.258e+09	374
9.258e+09 – 9.601e+09	135
9.601e+09 – 9.943e+09	125
9.943e+09 – 1.029e+10	131
1.029e+10 – 1.063e+10	20
1.063e+10 – 1.097e+10	45
1.097e+10 – 1.131e+10	73
1.131e+10 – 1.166e+10	37
1.166e+10 – 1.2e+10	928
1.2e+10 – 1.234e+10	85
1.234e+10 – 1.268e+10	88
1.268e+10 – 1.302e+10	141
1.302e+10 – 1.337e+10	54
1.337e+10 – 1.371e+10	28

osm_type categorical feature

This column encodes the OpenStreetMap geometry type, distinguishing between point features ('node') and linear/polygonal features ('way'). With only 2 distinct values across 6,914 rows and zero nulls, it is a clean binary categorical. The distribution is moderately skewed: 'node' accounts for 72.3% (5,000 rows) versus 'way' at 27.7% (1,914 rows), which is consistent with OSM datasets where point POIs outnumber way geometries.

Treatment: One-hot encode or map to binary flag (node=1, way=0) before modelling; consider whether geometry type meaningfully differs from other features in the pipeline.

anthropic:default · confidence high

Out[52]:

saturn.columns["osm_type"].stats

stat	value
n	6,914
nulls	0 (0.0%)
unique	2
top_value	node
top_rate	0.7232
cardinality	2
entropy	0.8511
entropy_ratio	0.8511

Fig 21.

Top values for osm_type.

Show data table

Top values for osm_type (2 unique shown, of 2 total).
value	count	share
node	5000	72.3%
way	1914	27.7%

data trove shipwrecks

Overview

Summary confidence: high

name text label

lat numeric feature

lon numeric feature

year_sunk categorical metadata

type categorical label

wikipedia categorical metadata

wikidata categorical foreign_key

description categorical free_text

heritage categorical feature

access categorical feature

depth categorical feature

seamark_type categorical label

osm_id numeric identifier

osm_type categorical feature

How to cite