saturn·

data trove shipwrecks

saturn notebook · generated 2026-06-21 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json

Saturn profiled 6,914 rows across 14 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json",
    "--findings", "data-trove-shipwrecks.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is an OpenStreetMap-derived catalogue of 6,914 shipwrecks and related maritime hazards mapped globally. The most important thing to explore first is the `type` and `seamark_type` columns, which reveal that the overwhelming majority (~73-78%) of entries are labelled simply 'shipwreck' or 'wreck', with a long tail of submarines, aircraft, barges, and other vessels worth examining. A secondary point of interest is the high null rates across many descriptive fields — `heritage` (99.8% null), `year_sunk` (99.5% null), and `wikipedia` (95.5% null) — meaning rich contextual data exists for only a tiny fraction of wrecks, and the dataset is far more useful as a spatial inventory than a historical record. The `access` column, where populated, shows most accessible wrecks are open ('yes'), but a meaningful share require permits or are private, which could interest dive-site analysts.

citing: type.top_value · type.top_rate · seamark_type.top_value · seamark_type.top_rate · heritage.null_rate · year_sunk.null_rate · wikipedia.null_rate · access.null_rate · access.top_value · access.top_rate · row_count · lat.min · lat.max

Out[4]:

saturn.schema() · 14 columns

column kind n null% unique alerts
name text 6,914 0.0% 6,841 near_unique
lat numeric 6,914 0.0% 6,902 outliers
lon numeric 6,914 0.0% 6,910 outliers
year_sunk categorical 6,914 99.5% 36 long_tail null_rate
type categorical 6,914 0.0% 31 long_tail
wikipedia categorical 6,914 95.5% 307 long_tail null_rate
wikidata categorical 6,914 94.8% 353 long_tail null_rate
description categorical 6,914 94.9% 304 long_tail null_rate
heritage categorical 6,914 99.8% 4 long_tail null_rate
access categorical 6,914 92.6% 8 null_rate
depth categorical 6,914 77.4% 502 null_rate
seamark_type categorical 6,914 6.6% 15
osm_id numeric 6,914 0.0% 6,914
osm_type categorical 6,914 0.0% 2
Fig 1.
type · Look for the dominance of 'shipwreck' and 'wreck' and how thin the long tail of aircraft, submarines, and barges really is.
Show data table
Top values for type (20 unique shown, of 31 total).
valuecountshare
shipwreck508173.5%
wreck134519.5%
ship3815.5%
barge270.4%
submarine180.3%
aircraft170.2%
plane100.1%
boat40.1%
vehicle30.0%
motor_vehicle30.0%
schooner20.0%
car20.0%
sailboat20.0%
battleship20.0%
steamer10.0%
airplane10.0%
freightcar10.0%
train10.0%
paddle steamer10.0%
motorbike10.0%
Fig 2.
seamark_type · Shows the navigational hazard classification — note how many wrecks are marked 'dangerous' or have 'distributed_remains' versus a clean hull.
Show data table
Top values for seamark_type (15 unique shown, of 15 total).
valuecountshare
wreck505573.1%
dangerous5988.6%
non-dangerous3585.2%
distributed_remains3064.4%
hulk560.8%
hull_showing460.7%
shoreline_construction140.2%
mast_showing80.1%
obstruction70.1%
harbour20.0%
restricted_area10.0%
plane10.0%
beacon_special_purpose10.0%
landmark10.0%
no10.0%
Fig 3.
access · Among the minority of wrecks with access data, check how many are freely accessible versus permit-only or private.
Show data table
Top values for access (8 unique shown, of 8 total).
valuecountshare
yes3414.9%
no731.1%
permit270.4%
private270.4%
unknown200.3%
permissive170.2%
customers30.0%
foot10.0%
Fig 4.
lat · Distribution of wreck latitudes reveals geographic clustering — look for the concentration in northern hemisphere waters and the outlier spike near the poles.
Show data table
Histogram bins for lat (median: 43.8517503).
bincount
-77.42 – -73.441
-73.44 – -69.450
-69.45 – -65.460
-65.46 – -61.471
-61.47 – -57.481
-57.48 – -53.4920
-53.49 – -49.539
-49.5 – -45.5141
-45.51 – -41.5250
-41.52 – -37.53107
-37.53 – -33.54235
-33.54 – -29.55110
-29.55 – -25.5656
-25.56 – -21.57118
-21.57 – -17.5867
-17.58 – -13.5928
-13.59 – -9.59730
-9.597 – -5.60772
-5.607 – -1.61773
-1.617 – 2.37367
2.373 – 6.36340
6.363 – 10.35180
10.35 – 14.34105
14.34 – 18.3385
18.33 – 22.32108
22.32 – 26.3184
26.31 – 30.375
30.3 – 34.29149
34.29 – 38.28529
38.28 – 42.27748
42.27 – 46.26608
46.26 – 50.25494
50.25 – 54.241302
54.24 – 58.23846
58.23 – 62.22212
62.22 – 66.2185
66.21 – 70.2103
70.2 – 74.1939
74.19 – 78.185
78.18 – 82.171
Fig 5.
depth · Depth values (where recorded) cluster around 7–20 metres, suggesting the dataset skews toward shallow, diveable wrecks rather than deep-sea losses.
Show data table
Top values for depth (20 unique shown, of 502 total).
valuecountshare
12.4110.2%
16110.2%
18110.2%
15.5110.2%
19.2110.2%
1.1100.1%
17.4100.1%
15.6100.1%
7100.1%
14100.1%
5100.1%
15.1100.1%
6.490.1%
990.1%
890.1%
1990.1%
15.290.1%
2090.1%
16.490.1%
18.590.1%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
nametext0.0%
latnumeric0.0%
lonnumeric0.0%
year_sunkcategorical99.5%
typecategorical0.0%
wikipediacategorical95.5%
wikidatacategorical94.8%
descriptioncategorical94.9%
heritagecategorical99.8%
accesscategorical92.6%
depthcategorical77.4%
seamark_typecategorical6.6%
osm_idnumeric0.0%
osm_typecategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
latlonosm_id
lat+1.00-0.27+0.16
lon-0.27+1.00+0.16
osm_id+0.16+0.16+1.00

name text label

This column contains the names of individual shipwrecks, as confirmed by dominant top words: 'shipwreck' (5032 occurrences across 6914 rows), 'wreck', 'ss', 'uss', and 'hms'. With 6841 unique values out of 6914 rows and a near-zero null rate, it is essentially a name/label field — but the 73 duplicates (1.06% duplicate rate) are mildly surprising and may indicate the same wreck is referenced under the same name in multiple records. Lengths cluster tightly (median 20, p95 21 characters) with a long tail reaching 153, suggesting most names are concise vessel names while a minority carry extended descriptions.

Treatment: Use as a display label; investigate 73 duplicates for deduplication or record linkage before treating as a unique identifier.

anthropic:default · confidence high
Out[13]:

saturn.columns["name"].stats

statvalue
n6,914
nulls0 (0.0%)
unique6,841
len_min 2
len_max 153
len_mean 18.35
len_median 20
len_p95 21
word_mean 2.058
word_median 2
n_empty 0
n_duplicates 73
duplicate_rate 0.01056
vocab_size 7,602
readability_flesch_mean 73.37
emoji_rate 0
url_rate 0
one_word_rate 0.0849
allcaps_rate 0.01403
boilerplate_rate 0
alert: near_unique98.9% of rows are unique strings
Fig 8.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 18.353630315302286).
charscount
2 – 6173
6 – 10481
10 – 13493
13 – 17305
17 – 213911
21 – 251393
25 – 2871
28 – 3235
32 – 3616
36 – 4012
40 – 4410
44 – 475
47 – 513
51 – 550
55 – 591
59 – 621
62 – 661
66 – 701
70 – 740
74 – 781
78 – 810
81 – 850
85 – 890
89 – 930
93 – 960
96 – 1000
100 – 1040
104 – 1080
108 – 1110
111 – 1150
115 – 1190
119 – 1230
123 – 1270
127 – 1300
130 – 1340
134 – 1380
138 – 1420
142 – 1450
145 – 1490
149 – 1531

lat numeric feature

This column represents geographic latitude values, spanning from -77.42° (near Antarctica) to 82.17° (high Arctic), with 6,902 unique values across 6,914 rows. The mean (33.15°) sits notably below the median (43.85°), driven by a left skew of -1.42 — indicating a cluster of records in mid-to-high northern latitudes with a pull from southern hemisphere or equatorial observations. Roughly 12.5% of values (864 rows) are flagged as outliers, likely corresponding to polar or deep southern hemisphere coordinates that deviate from the dominant northern mid-latitude band.

Treatment: Retain as-is for geospatial modelling; consider pairing with longitude and binning into geographic regions to handle the skewed distribution and outlier polar values.

anthropic:default · confidence high
Out[16]:

saturn.columns["lat"].stats

statvalue
n6,914
nulls0 (0.0%)
unique6,902
min -77.42
max 82.17
mean 33.15
median 43.85
std 29.88
q1 26.58
q3 53.87
iqr 27.29
skew -1.417
kurtosis 0.8666
n_outliers 864
outlier_rate 0.125
zero_rate 0
alert: outliers12.5% rows beyond 1.5 IQR
Fig 9.
Distribution of lat. Vertical dash marks the median.
Show data table
Histogram bins for lat (median: 43.8517503).
bincount
-77.42 – -73.441
-73.44 – -69.450
-69.45 – -65.460
-65.46 – -61.471
-61.47 – -57.481
-57.48 – -53.4920
-53.49 – -49.539
-49.5 – -45.5141
-45.51 – -41.5250
-41.52 – -37.53107
-37.53 – -33.54235
-33.54 – -29.55110
-29.55 – -25.5656
-25.56 – -21.57118
-21.57 – -17.5867
-17.58 – -13.5928
-13.59 – -9.59730
-9.597 – -5.60772
-5.607 – -1.61773
-1.617 – 2.37367
2.373 – 6.36340
6.363 – 10.35180
10.35 – 14.34105
14.34 – 18.3385
18.33 – 22.32108
22.32 – 26.3184
26.31 – 30.375
30.3 – 34.29149
34.29 – 38.28529
38.28 – 42.27748
42.27 – 46.26608
46.26 – 50.25494
50.25 – 54.241302
54.24 – 58.23846
58.23 – 62.22212
62.22 – 66.2185
66.21 – 70.2103
70.2 – 74.1939
74.19 – 78.185
78.18 – 82.171

lon numeric feature

This column contains geographic longitude values, spanning the full valid range from -179.28° to 179.45° and covering both hemispheres. The mean (3.07°) and median (8.32°) are both modestly east of the Prime Meridian, suggesting a concentration of records in Europe/Africa, while the wide IQR of 58.74° and std of 69.12° confirm global scatter. Notably, 806 rows (11.66%) are flagged as outliers, likely corresponding to locations in the Americas or Pacific — not erroneous values, but genuine geographic extremes relative to the modal cluster.

Treatment: Use as-is or pair with latitude for spatial modelling; consider projecting to radians or embedding via geohash for ML pipelines.

anthropic:default · confidence high
Out[19]:

saturn.columns["lon"].stats

statvalue
n6,914
nulls0 (0.0%)
unique6,910
min -179.3
max 179.4
mean 3.067
median 8.322
std 69.12
q1 -40.75
q3 17.99
iqr 58.74
skew 0.5093
kurtosis 0.9211
n_outliers 806
outlier_rate 0.1166
zero_rate 0
alert: outliers11.7% rows beyond 1.5 IQR
Fig 10.
Distribution of lon. Vertical dash marks the median.
Show data table
Histogram bins for lon (median: 8.321783100000001).
bincount
-179.3 – -170.354
-170.3 – -161.317
-161.3 – -152.47
-152.4 – -143.45
-143.4 – -134.44
-134.4 – -125.515
-125.5 – -116.5266
-116.5 – -107.515
-107.5 – -98.573
-98.57 – -89.639
-89.6 – -80.63161
-80.63 – -71.66521
-71.66 – -62.7174
-62.7 – -53.73212
-53.73 – -44.76147
-44.76 – -35.79158
-35.79 – -26.8239
-26.82 – -17.8531
-17.85 – -8.886111
-8.886 – 0.08197631
0.08197 – 9.051144
9.05 – 18.021435
18.02 – 26.99411
26.99 – 35.96272
35.96 – 44.9292
44.92 – 53.89101
53.89 – 62.8661
62.86 – 71.838
71.83 – 80.831
80.8 – 89.767
89.76 – 98.737
98.73 – 107.723
107.7 – 116.734
116.7 – 125.644
125.6 – 134.648
134.6 – 143.644
143.6 – 152.5109
152.5 – 161.569
161.5 – 170.5163
170.5 – 179.4201

year_sunk categorical metadata

This column records the year (or date) a vessel was sunk, but it is almost entirely empty — 99.46% of the 6,914 rows are null, leaving only about 38 non-null values. Among those, the formats are wildly inconsistent: bare years ('1942', '1854'), full dates in multiple formats ('30 June 1890', 'June 7, 1928', '1937-09-02'), partial dates ('1963-02'), and even a range ('1643..1663'), making normalisation non-trivial. With 36 unique values across ~38 populated rows the column is near-unique relative to its populated set, and the top value '1942' appears only twice.

Treatment: Parse and normalise to a standard year integer after regex-based format detection; treat as sparse metadata and do not use as a primary feature without imputation strategy given 99.46% nulls.

anthropic:default · confidence high
Out[22]:

saturn.columns["year_sunk"].stats

statvalue
n6,914
nulls6,877 (99.5%)
unique36
top_value 1942
top_rate 0.05405
cardinality 36
entropy 5.155
entropy_ratio 0.9972
alert: long_tail35 singleton categories
alert: null_rate99.5% null
Fig 11.
Top values for year_sunk.
Show data table
Top values for year_sunk (20 unique shown, of 36 total).
valuecountshare
194220.0%
30 June 189010.0%
185410.0%
197110.0%
1937-09-0210.0%
1963-0210.0%
1643..166310.0%
198210.0%
June 7, 192810.0%
143510.0%
1920-12-1610.0%
1490s10.0%
~170010.0%
20 April 194310.0%
25 May 196310.0%
171010.0%
191510.0%
190910.0%
195110.0%
195210.0%

type categorical label

This column classifies underwater or maritime wreck sites by vessel/object type, with 31 distinct categories across 6,914 records and no nulls. The distribution is heavily dominated by 'shipwreck' (73.5% of records) and 'wreck' (19.5%), together accounting for over 93% of all entries — the remaining 29 categories share just ~6.5%, confirming the long-tail alert. The near-redundancy between 'shipwreck', 'wreck', and 'ship' (plus 'boat', 'barge') suggests inconsistent taxonomy that may need consolidation before modelling.

Treatment: Consolidate overlapping categories (e.g. 'shipwreck'/'wreck'/'ship') into a canonical taxonomy, then one-hot or target-encode for modelling.

anthropic:default · confidence high
Out[25]:

saturn.columns["type"].stats

statvalue
n6,914
nulls0 (0.0%)
unique31
top_value shipwreck
top_rate 0.7349
cardinality 31
entropy 1.166
entropy_ratio 0.2353
alert: long_tail17 singleton categories
Fig 12.
Top values for type.
Show data table
Top values for type (20 unique shown, of 31 total).
valuecountshare
shipwreck508173.5%
wreck134519.5%
ship3815.5%
barge270.4%
submarine180.3%
aircraft170.2%
plane100.1%
boat40.1%
vehicle30.0%
motor_vehicle30.0%
schooner20.0%
car20.0%
sailboat20.0%
battleship20.0%
steamer10.0%
airplane10.0%
freightcar10.0%
train10.0%
paddle steamer10.0%
motorbike10.0%

wikipedia categorical metadata

This column stores Wikipedia article links associated with dataset entities (ships and aircraft), formatted as language-prefixed slugs (e.g., 'en:SS Edmund Fitzgerald', 'fr:Armorique (navire)'). The null rate is extremely high at 95.47%, meaning only ~313 of 6,914 rows have any Wikipedia reference. Among populated values, cardinality is very high (307 unique values across ~313 non-null rows), with the top value appearing only 4 times — indicating near-unique coverage and a long-tail distribution. A language mix is present (English 'en:', French 'fr:', Arabic 'ar:'), which could complicate any downstream lookup or joining logic.

Treatment: Use as an optional enrichment link; do not use in modelling due to 95.47% nulls and near-unique cardinality; parse language prefix if language-specific resolution is needed.

anthropic:default · confidence high
Out[28]:

saturn.columns["wikipedia"].stats

statvalue
n6,914
nulls6,601 (95.5%)
unique307
top_value en:SS Edmund Fitzgerald
top_rate 0.01278
cardinality 307
entropy 8.245
entropy_ratio 0.998
alert: long_tail303 singleton categories
alert: null_rate95.5% null
Fig 13.
Top values for wikipedia.
Show data table
Top values for wikipedia (20 unique shown, of 307 total).
valuecountshare
en:SS Edmund Fitzgerald40.1%
fr:Armorique (navire)20.0%
en:Curtiss C-46 Commando20.0%
en:USS Amesbury20.0%
en:SS America (1939)10.0%
ar:سفينة زيستل جورم10.0%
en:BOS 40010.0%
en:New Carissa10.0%
en:MV Cita10.0%
en:SS Richard Montgomery10.0%
en:Kroombit Tops National Park#Crash site10.0%
en:Astron (ship)10.0%
en:SS Yongala10.0%
et:Raketa (laev, 1949)10.0%
en:USNS General Hoyt S. Vandenberg (T-AGM-10)10.0%
en:USS Oriskany (CV-34)10.0%
en:USS Massachusetts (BB-2)10.0%
en:Water Witch (schooner)10.0%
en:Burlington Bay Horse Ferry10.0%
en:Champlain II10.0%

wikidata categorical foreign_key

This column stores Wikidata entity identifiers (Q-codes), linking dataset records to Wikidata knowledge graph entries. The most striking signal is the extreme null rate of 94.78%, meaning only ~360 of 6,914 rows carry a Wikidata link at all. Among the 353 unique Q-codes present, the distribution is nearly flat — the top value 'Q1286267' appears only 4 times, entropy ratio is 0.998, and the long-tail alert confirms almost no repeated values — suggesting each populated row points to a distinct entity with minimal reuse.

Treatment: Use as an optional foreign key to enrich records via Wikidata API lookup; do not use as a feature directly given 94.78% null rate.

anthropic:default · confidence high
Out[31]:

saturn.columns["wikidata"].stats

statvalue
n6,914
nulls6,553 (94.8%)
unique353
top_value Q1286267
top_rate 0.01108
cardinality 353
entropy 8.446
entropy_ratio 0.9979
alert: long_tail347 singleton categories
alert: null_rate94.8% null
Fig 14.
Top values for wikidata.
Show data table
Top values for wikidata (20 unique shown, of 353 total).
valuecountshare
Q128626740.1%
Q95969620.0%
Q21569220.0%
Q286278720.0%
Q114570820.0%
Q1167575320.0%
Q46309110.0%
Q3227610.0%
Q11570975610.0%
Q1421380110.0%
Q287735310.0%
Q700637610.0%
Q671937910.0%
Q4177161610.0%
Q739428510.0%
Q742019310.0%
Q135932110.0%
Q481160110.0%
Q142428910.0%
Q161884210.0%

description categorical free_text

This column contains free-text descriptions of maritime wrecks or nautical features, with entries referencing WWII-era vessels, jetties, fishing boats, and abandoned craft. The most striking signal is the 94.87% null rate — nearly the entire dataset lacks a description — making this column nearly unusable at scale. Among the 304 unique values across 6,914 rows, entropy is very high (8.05, ratio 0.976), indicating wide diversity in phrasing, and a language mix is evident (e.g., French 'Chaloupe abandonnée à terre' alongside English entries).

Treatment: Exclude from modelling due to 94.87% null rate; if used, tokenize and embed the 5.13% populated values, and flag language mixing before NLP processing.

anthropic:default · confidence high
Out[34]:

saturn.columns["description"].stats

statvalue
n6,914
nulls6,559 (94.9%)
unique304
top_value WWII era concrete fuel barge converted into breakwater
top_rate 0.03944
cardinality 304
entropy 8.052
entropy_ratio 0.9763
alert: long_tail282 singleton categories
alert: null_rate94.9% null
Fig 15.
Top values for description.
Show data table
Top values for description (20 unique shown, of 304 total).
valuecountshare
WWII era concrete fuel barge converted into breakwater140.2%
Wrecks70.1%
WWII concrete barge sunk as part of jetty, partially covered by jetty and fill50.1%
Location is based on divers hand drawn maps. Due to the wreak breaking up and salvage, the wreak is scattered over a large area.40.1%
Partially sunken ships40.1%
Concrete petrol barge sunk as part of breakwater40.1%
Wrecks of Zulu fishing boats30.0%
Chaloupe abandonnée à terre30.0%
WWII era concrete fuel barge sunk as part of jetty foundation30.0%
Armada Ship20.0%
remains of sunken wooden boats20.0%
Hundido el 3 de julio de 1898 durante la batalla naval de Santiago de Cuba en la Guerra Hispano-Cubana-Norteamericana.20.0%
09/09/2006 : Epave en bois, longue de 20 mètres, large de 4 mètres et haute de 3 mètres.20.0%
Steamer20.0%
Iron-hulled barque20.0%
On shore wreck of a small abandoned wooden ship.20.0%
Épave20.0%
Dojście do wraków w zasadzie wolne. Jednak mogą wystąpić sytuacje gdy będzie to utrudnione lub niemożliwe.20.0%
Wrecked sealing vessel20.0%
Staten Island boat graveyard20.0%

heritage categorical feature

This column appears to encode a 'heritage' flag or classification with only 4 distinct values ('1', '2', 'no', 'yes'), suggesting a binary or ordinal attribute that may have been inconsistently encoded across sources. The critical finding is a null rate of 99.81%, meaning only 13 of 6,914 rows have any value at all — rendering this column nearly useless for modelling. Among those 13 non-null values, '2' dominates at 76.9%, while 'no', 'yes', and '1' each appear only once, indicating a mixed encoding scheme (numeric vs. boolean strings) on an already negligible sample.

Treatment: Drop this column; 99.81% null rate and only 13 non-null observations make it statistically unusable.

anthropic:default · confidence high
Out[37]:

saturn.columns["heritage"].stats

statvalue
n6,914
nulls6,901 (99.8%)
unique4
top_value 2
top_rate 0.7692
cardinality 4
entropy 1.145
entropy_ratio 0.5726
alert: long_tail3 singleton categories
alert: null_rate99.8% null
Fig 16.
Top values for heritage.
Show data table
Top values for heritage (4 unique shown, of 4 total).
valuecountshare
2100.1%
no10.0%
yes10.0%
110.0%

access categorical feature

This column appears to encode access permission or restriction tags for geographic features (likely OpenStreetMap-style data), with values such as 'yes', 'no', 'permit', 'private', 'permissive', and 'customers'. The striking finding is a 92.64% null rate — only 509 of 6,914 rows carry a value — meaning this attribute is almost entirely absent from the dataset. Among the non-null values, 'yes' dominates heavily at 66.99% of populated rows, suggesting most tagged features have open access.

Treatment: Flag extreme sparsity (92.64% nulls); treat nulls as a distinct 'untagged' category or drop column if missingness renders it uninformative for modelling.

anthropic:default · confidence high
Out[40]:

saturn.columns["access"].stats

statvalue
n6,914
nulls6,405 (92.6%)
unique8
top_value yes
top_rate 0.6699
cardinality 8
entropy 1.647
entropy_ratio 0.549
alert: null_rate92.6% null
Fig 17.
Top values for access.
Show data table
Top values for access (8 unique shown, of 8 total).
valuecountshare
yes3414.9%
no731.1%
permit270.4%
private270.4%
unknown200.3%
permissive170.2%
customers30.0%
foot10.0%

depth categorical feature

This column represents a numeric depth measurement (likely in meters or similar units) stored as a categorical string, with values ranging from small decimals like '1.1' to integers like '19.2'. The most striking signal is a null rate of 77.36%, meaning only ~1,556 of 6,914 rows carry a value — a severe missingness that warrants investigation into whether it is structurally absent (e.g., not applicable to certain record types) or a data quality issue. Among populated rows, cardinality is very high (502 unique values) with an entropy ratio of 0.956, indicating nearly uniform spread and essentially no dominant depth value — the top value '12.4' appears only 11 times. The column should be cast to numeric before any modelling use.

Treatment: Cast to float, investigate structural vs. random missingness before imputing or dropping nulls, then use as a numeric feature.

anthropic:default · confidence high
Out[43]:

saturn.columns["depth"].stats

statvalue
n6,914
nulls5,349 (77.4%)
unique502
top_value 12.4
top_rate 0.007029
cardinality 502
entropy 8.579
entropy_ratio 0.9563
alert: null_rate77.4% null
Fig 18.
Top values for depth.
Show data table
Top values for depth (20 unique shown, of 502 total).
valuecountshare
12.4110.2%
16110.2%
18110.2%
15.5110.2%
19.2110.2%
1.1100.1%
17.4100.1%
15.6100.1%
7100.1%
14100.1%
5100.1%
15.1100.1%
6.490.1%
990.1%
890.1%
1990.1%
15.290.1%
2090.1%
16.490.1%
18.590.1%

seamark_type categorical label

This column contains a nautical/maritime classification type for seamarks, most likely drawn from an OpenStreetMap or similar marine charting schema. The distribution is severely dominated by 'wreck' at 78.3% of 6,914 rows, with the next largest category 'dangerous' at only 8.6%, giving a low entropy ratio of 0.307. The mix of subtypes (hull_showing, mast_showing, distributed_remains, hulk) suggests these are sub-classifications of wrecks that could have been normalized into a hierarchy rather than a flat taxonomy. The 6.6% null rate warrants attention if completeness matters for navigation safety contexts.

Treatment: One-hot or ordinal encode; consider grouping wreck subtypes (hull_showing, mast_showing, hulk, distributed_remains) into a parent 'wreck' hierarchy before modelling, and impute or flag the 6.6% nulls.

anthropic:default · confidence high
Out[46]:

saturn.columns["seamark_type"].stats

statvalue
n6,914
nulls459 (6.6%)
unique15
top_value wreck
top_rate 0.7831
cardinality 15
entropy 1.2
entropy_ratio 0.307
Fig 19.
Top values for seamark_type.
Show data table
Top values for seamark_type (15 unique shown, of 15 total).
valuecountshare
wreck505573.1%
dangerous5988.6%
non-dangerous3585.2%
distributed_remains3064.4%
hulk560.8%
hull_showing460.7%
shoreline_construction140.2%
mast_showing80.1%
obstruction70.1%
harbour20.0%
restricted_area10.0%
plane10.0%
beacon_special_purpose10.0%
landmark10.0%
no10.0%

osm_id numeric identifier

This column is an OpenStreetMap (OSM) object identifier — a large integer surrogate key assigned by the OSM platform to geographic features. Every one of the 6,914 rows has a distinct value with zero nulls, confirming it functions purely as a unique identifier. The value range (13 M to ~13.7 B) and flat distribution (kurtosis −1.45, near-uniform spread across a ~8.4 B IQR) are consistent with OSM's incrementally assigned ID space across different data vintages. No outliers are flagged and the mild positive skew (0.44) suggests a slight concentration of older, lower-numbered IDs.

Treatment: Retain as a join/lookup key to OSM data; drop from any model feature set as it carries no predictive signal.

anthropic:default · confidence high
Out[49]:

saturn.columns["osm_id"].stats

statvalue
n6,914
nulls0 (0.0%)
unique6,914
min 1.306e+07
max 1.371e+10
mean 5.365e+09
median 3.145e+09
std 4.464e+09
q1 1.348e+09
q3 9.788e+09
iqr 8.44e+09
skew 0.4355
kurtosis -1.453
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 20.
Distribution of osm_id. Vertical dash marks the median.
Show data table
Histogram bins for osm_id (median: 3145392419.5).
bincount
1.306e+07 – 3.555e+08262
3.555e+08 – 6.979e+08451
6.979e+08 – 1.04e+09454
1.04e+09 – 1.383e+09614
1.383e+09 – 1.725e+09699
1.725e+09 – 2.068e+09184
2.068e+09 – 2.41e+09137
2.41e+09 – 2.752e+0946
2.752e+09 – 3.095e+09213
3.095e+09 – 3.437e+09623
3.437e+09 – 3.78e+0954
3.78e+09 – 4.122e+0937
4.122e+09 – 4.464e+0958
4.464e+09 – 4.807e+0944
4.807e+09 – 5.149e+0948
5.149e+09 – 5.492e+0940
5.492e+09 – 5.834e+0949
5.834e+09 – 6.177e+0990
6.177e+09 – 6.519e+09140
6.519e+09 – 6.861e+0994
6.861e+09 – 7.204e+0967
7.204e+09 – 7.546e+0940
7.546e+09 – 7.889e+0940
7.889e+09 – 8.231e+0964
8.231e+09 – 8.573e+0943
8.573e+09 – 8.916e+0959
8.916e+09 – 9.258e+09374
9.258e+09 – 9.601e+09135
9.601e+09 – 9.943e+09125
9.943e+09 – 1.029e+10131
1.029e+10 – 1.063e+1020
1.063e+10 – 1.097e+1045
1.097e+10 – 1.131e+1073
1.131e+10 – 1.166e+1037
1.166e+10 – 1.2e+10928
1.2e+10 – 1.234e+1085
1.234e+10 – 1.268e+1088
1.268e+10 – 1.302e+10141
1.302e+10 – 1.337e+1054
1.337e+10 – 1.371e+1028

osm_type categorical feature

This column encodes the OpenStreetMap geometry type, distinguishing between point features ('node') and linear/polygonal features ('way'). With only 2 distinct values across 6,914 rows and zero nulls, it is a clean binary categorical. The distribution is moderately skewed: 'node' accounts for 72.3% (5,000 rows) versus 'way' at 27.7% (1,914 rows), which is consistent with OSM datasets where point POIs outnumber way geometries.

Treatment: One-hot encode or map to binary flag (node=1, way=0) before modelling; consider whether geometry type meaningfully differs from other features in the pipeline.

anthropic:default · confidence high
Out[52]:

saturn.columns["osm_type"].stats

statvalue
n6,914
nulls0 (0.0%)
unique2
top_value node
top_rate 0.7232
cardinality 2
entropy 0.8511
entropy_ratio 0.8511
Fig 21.
Top values for osm_type.
Show data table
Top values for osm_type (2 unique shown, of 2 total).
valuecountshare
node500072.3%
way191427.7%

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-shipwrecks-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove shipwrecks},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-shipwrecks}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove shipwrecks. Source: /home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-shipwrecks