saturn·

data trove shipwrecks

source /home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json 6,914 rows 14 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset is an OpenStreetMap-derived catalogue of 6,914 shipwrecks and related maritime hazards mapped globally. The most important thing to explore first is the `type` and `seamark_type` columns, which reveal that the overwhelming majority (~73-78%) of entries are labelled simply 'shipwreck' or 'wreck', with a long tail of submarines, aircraft, barges, and other vessels worth examining. A secondary point of interest is the high null rates across many descriptive fields — `heritage` (99.8% null), `year_sunk` (99.5% null), and `wikipedia` (95.5% null) — meaning rich contextual data exists for only a tiny fraction of wrecks, and the dataset is far more useful as a spatial inventory than a historical record. The `access` column, where populated, shows most accessible wrecks are open ('yes'), but a meaningful share require permits or are private, which could interest dive-site analysts.

citing: type.top_value · type.top_rate · seamark_type.top_value · seamark_type.top_rate · heritage.null_rate · year_sunk.null_rate · wikipedia.null_rate · access.null_rate · access.top_value · access.top_rate · row_count · lat.min · lat.max

Schema

14 columns
Per-column summary. Click column name to jump to its detail.
Alerts
name text 0.0% 6,841
near_unique
lat numeric 0.0% 6,902
outliers
lon numeric 0.0% 6,910
outliers
year_sunk categorical 99.5% 36
long_tail null_rate
type categorical 0.0% 31
long_tail
wikipedia categorical 95.5% 307
long_tail null_rate
wikidata categorical 94.8% 353
long_tail null_rate
description categorical 94.9% 304
long_tail null_rate
heritage categorical 99.8% 4
long_tail null_rate
access categorical 92.6% 8
null_rate
depth categorical 77.4% 502
null_rate
seamark_type categorical 6.6% 15
osm_id numeric 0.0% 6,914
osm_type categorical 0.0% 2

name

text label near_unique
This column contains the names of individual shipwrecks, as confirmed by dominant top words: 'shipwreck' (5032 occurrences across 6914 rows), 'wreck', 'ss', 'uss', and 'hms'. With 6841 unique values out of 6914 rows and a near-zero null rate, it is essentially a name/label field — but the 73 duplicates (1.06% duplicate rate) are mildly surprising and may indicate the same wreck is referenced under the same name in multiple records. Lengths cluster tightly (median 20, p95 21 characters) with a long tail reaching 153, suggesting most names are concise vessel names while a minority carry extended descriptions. Treatment: Use as a display label; investigate 73 duplicates for deduplication or record linkage before treating as a unique identifier. high · anthropic:default
n
6,914
nulls
0 (0.0%)
unique
6,841
len_min
2
len_max
153
len_mean
18.35
len_median
20
len_p95
21
word_mean
2.058
word_median
2
n_empty
0
n_duplicates
73
duplicate_rate
0.01056
vocab_size
7,602
readability_flesch_mean
73.37
emoji_rate
0
url_rate
0
one_word_rate
0.0849
allcaps_rate
0.01403
boilerplate_rate
0

lat

numeric feature outliers
This column represents geographic latitude values, spanning from -77.42° (near Antarctica) to 82.17° (high Arctic), with 6,902 unique values across 6,914 rows. The mean (33.15°) sits notably below the median (43.85°), driven by a left skew of -1.42 — indicating a cluster of records in mid-to-high northern latitudes with a pull from southern hemisphere or equatorial observations. Roughly 12.5% of values (864 rows) are flagged as outliers, likely corresponding to polar or deep southern hemisphere coordinates that deviate from the dominant northern mid-latitude band. Treatment: Retain as-is for geospatial modelling; consider pairing with longitude and binning into geographic regions to handle the skewed distribution and outlier polar values. high · anthropic:default
n
6,914
nulls
0 (0.0%)
unique
6,902
min
-77.42
max
82.17
mean
33.15
median
43.85
std
29.88
q1
26.58
q3
53.87
iqr
27.29
skew
-1.417
kurtosis
0.8666
n_outliers
864
outlier_rate
0.125
zero_rate
0

lon

numeric feature outliers
This column contains geographic longitude values, spanning the full valid range from -179.28° to 179.45° and covering both hemispheres. The mean (3.07°) and median (8.32°) are both modestly east of the Prime Meridian, suggesting a concentration of records in Europe/Africa, while the wide IQR of 58.74° and std of 69.12° confirm global scatter. Notably, 806 rows (11.66%) are flagged as outliers, likely corresponding to locations in the Americas or Pacific — not erroneous values, but genuine geographic extremes relative to the modal cluster. Treatment: Use as-is or pair with latitude for spatial modelling; consider projecting to radians or embedding via geohash for ML pipelines. high · anthropic:default
n
6,914
nulls
0 (0.0%)
unique
6,910
min
-179.3
max
179.4
mean
3.067
median
8.322
std
69.12
q1
-40.75
q3
17.99
iqr
58.74
skew
0.5093
kurtosis
0.9211
n_outliers
806
outlier_rate
0.1166
zero_rate
0

year_sunk

categorical metadata long_tail null_rate
This column records the year (or date) a vessel was sunk, but it is almost entirely empty — 99.46% of the 6,914 rows are null, leaving only about 38 non-null values. Among those, the formats are wildly inconsistent: bare years ('1942', '1854'), full dates in multiple formats ('30 June 1890', 'June 7, 1928', '1937-09-02'), partial dates ('1963-02'), and even a range ('1643..1663'), making normalisation non-trivial. With 36 unique values across ~38 populated rows the column is near-unique relative to its populated set, and the top value '1942' appears only twice. Treatment: Parse and normalise to a standard year integer after regex-based format detection; treat as sparse metadata and do not use as a primary feature without imputation strategy given 99.46% nulls. high · anthropic:default
n
6,914
nulls
6,877 (99.5%)
unique
36
top_value
1942
top_rate
0.05405
cardinality
36
entropy
5.155
entropy_ratio
0.9972

type

categorical label long_tail
This column classifies underwater or maritime wreck sites by vessel/object type, with 31 distinct categories across 6,914 records and no nulls. The distribution is heavily dominated by 'shipwreck' (73.5% of records) and 'wreck' (19.5%), together accounting for over 93% of all entries — the remaining 29 categories share just ~6.5%, confirming the long-tail alert. The near-redundancy between 'shipwreck', 'wreck', and 'ship' (plus 'boat', 'barge') suggests inconsistent taxonomy that may need consolidation before modelling. Treatment: Consolidate overlapping categories (e.g. 'shipwreck'/'wreck'/'ship') into a canonical taxonomy, then one-hot or target-encode for modelling. high · anthropic:default
n
6,914
nulls
0 (0.0%)
unique
31
top_value
shipwreck
top_rate
0.7349
cardinality
31
entropy
1.166
entropy_ratio
0.2353

wikipedia

categorical metadata long_tail null_rate
This column stores Wikipedia article links associated with dataset entities (ships and aircraft), formatted as language-prefixed slugs (e.g., 'en:SS Edmund Fitzgerald', 'fr:Armorique (navire)'). The null rate is extremely high at 95.47%, meaning only ~313 of 6,914 rows have any Wikipedia reference. Among populated values, cardinality is very high (307 unique values across ~313 non-null rows), with the top value appearing only 4 times — indicating near-unique coverage and a long-tail distribution. A language mix is present (English 'en:', French 'fr:', Arabic 'ar:'), which could complicate any downstream lookup or joining logic. Treatment: Use as an optional enrichment link; do not use in modelling due to 95.47% nulls and near-unique cardinality; parse language prefix if language-specific resolution is needed. high · anthropic:default
n
6,914
nulls
6,601 (95.5%)
unique
307
top_value
en:SS Edmund Fitzgerald
top_rate
0.01278
cardinality
307
entropy
8.245
entropy_ratio
0.998

wikidata

categorical foreign_key long_tail null_rate
This column stores Wikidata entity identifiers (Q-codes), linking dataset records to Wikidata knowledge graph entries. The most striking signal is the extreme null rate of 94.78%, meaning only ~360 of 6,914 rows carry a Wikidata link at all. Among the 353 unique Q-codes present, the distribution is nearly flat — the top value 'Q1286267' appears only 4 times, entropy ratio is 0.998, and the long-tail alert confirms almost no repeated values — suggesting each populated row points to a distinct entity with minimal reuse. Treatment: Use as an optional foreign key to enrich records via Wikidata API lookup; do not use as a feature directly given 94.78% null rate. high · anthropic:default
n
6,914
nulls
6,553 (94.8%)
unique
353
top_value
Q1286267
top_rate
0.01108
cardinality
353
entropy
8.446
entropy_ratio
0.9979

description

categorical free_text long_tail null_rate
This column contains free-text descriptions of maritime wrecks or nautical features, with entries referencing WWII-era vessels, jetties, fishing boats, and abandoned craft. The most striking signal is the 94.87% null rate — nearly the entire dataset lacks a description — making this column nearly unusable at scale. Among the 304 unique values across 6,914 rows, entropy is very high (8.05, ratio 0.976), indicating wide diversity in phrasing, and a language mix is evident (e.g., French 'Chaloupe abandonnée à terre' alongside English entries). Treatment: Exclude from modelling due to 94.87% null rate; if used, tokenize and embed the 5.13% populated values, and flag language mixing before NLP processing. high · anthropic:default
n
6,914
nulls
6,559 (94.9%)
unique
304
top_value
WWII era concrete fuel barge converted into breakwater
top_rate
0.03944
cardinality
304
entropy
8.052
entropy_ratio
0.9763

heritage

categorical feature long_tail null_rate
This column appears to encode a 'heritage' flag or classification with only 4 distinct values ('1', '2', 'no', 'yes'), suggesting a binary or ordinal attribute that may have been inconsistently encoded across sources. The critical finding is a null rate of 99.81%, meaning only 13 of 6,914 rows have any value at all — rendering this column nearly useless for modelling. Among those 13 non-null values, '2' dominates at 76.9%, while 'no', 'yes', and '1' each appear only once, indicating a mixed encoding scheme (numeric vs. boolean strings) on an already negligible sample. Treatment: Drop this column; 99.81% null rate and only 13 non-null observations make it statistically unusable. high · anthropic:default
n
6,914
nulls
6,901 (99.8%)
unique
4
top_value
2
top_rate
0.7692
cardinality
4
entropy
1.145
entropy_ratio
0.5726

access

categorical feature null_rate
This column appears to encode access permission or restriction tags for geographic features (likely OpenStreetMap-style data), with values such as 'yes', 'no', 'permit', 'private', 'permissive', and 'customers'. The striking finding is a 92.64% null rate — only 509 of 6,914 rows carry a value — meaning this attribute is almost entirely absent from the dataset. Among the non-null values, 'yes' dominates heavily at 66.99% of populated rows, suggesting most tagged features have open access. Treatment: Flag extreme sparsity (92.64% nulls); treat nulls as a distinct 'untagged' category or drop column if missingness renders it uninformative for modelling. high · anthropic:default
n
6,914
nulls
6,405 (92.6%)
unique
8
top_value
yes
top_rate
0.6699
cardinality
8
entropy
1.647
entropy_ratio
0.549

depth

categorical feature null_rate
This column represents a numeric depth measurement (likely in meters or similar units) stored as a categorical string, with values ranging from small decimals like '1.1' to integers like '19.2'. The most striking signal is a null rate of 77.36%, meaning only ~1,556 of 6,914 rows carry a value — a severe missingness that warrants investigation into whether it is structurally absent (e.g., not applicable to certain record types) or a data quality issue. Among populated rows, cardinality is very high (502 unique values) with an entropy ratio of 0.956, indicating nearly uniform spread and essentially no dominant depth value — the top value '12.4' appears only 11 times. The column should be cast to numeric before any modelling use. Treatment: Cast to float, investigate structural vs. random missingness before imputing or dropping nulls, then use as a numeric feature. high · anthropic:default
n
6,914
nulls
5,349 (77.4%)
unique
502
top_value
12.4
top_rate
0.007029
cardinality
502
entropy
8.579
entropy_ratio
0.9563

seamark_type

categorical label
This column contains a nautical/maritime classification type for seamarks, most likely drawn from an OpenStreetMap or similar marine charting schema. The distribution is severely dominated by 'wreck' at 78.3% of 6,914 rows, with the next largest category 'dangerous' at only 8.6%, giving a low entropy ratio of 0.307. The mix of subtypes (hull_showing, mast_showing, distributed_remains, hulk) suggests these are sub-classifications of wrecks that could have been normalized into a hierarchy rather than a flat taxonomy. The 6.6% null rate warrants attention if completeness matters for navigation safety contexts. Treatment: One-hot or ordinal encode; consider grouping wreck subtypes (hull_showing, mast_showing, hulk, distributed_remains) into a parent 'wreck' hierarchy before modelling, and impute or flag the 6.6% nulls. high · anthropic:default
n
6,914
nulls
459 (6.6%)
unique
15
top_value
wreck
top_rate
0.7831
cardinality
15
entropy
1.2
entropy_ratio
0.307

osm_id

numeric identifier
This column is an OpenStreetMap (OSM) object identifier — a large integer surrogate key assigned by the OSM platform to geographic features. Every one of the 6,914 rows has a distinct value with zero nulls, confirming it functions purely as a unique identifier. The value range (13 M to ~13.7 B) and flat distribution (kurtosis −1.45, near-uniform spread across a ~8.4 B IQR) are consistent with OSM's incrementally assigned ID space across different data vintages. No outliers are flagged and the mild positive skew (0.44) suggests a slight concentration of older, lower-numbered IDs. Treatment: Retain as a join/lookup key to OSM data; drop from any model feature set as it carries no predictive signal. high · anthropic:default
n
6,914
nulls
0 (0.0%)
unique
6,914
min
1.306e+07
max
1.371e+10
mean
5.365e+09
median
3.145e+09
std
4.464e+09
q1
1.348e+09
q3
9.788e+09
iqr
8.44e+09
skew
0.4355
kurtosis
-1.453
n_outliers
0
outlier_rate
0
zero_rate
0

osm_type

categorical feature
This column encodes the OpenStreetMap geometry type, distinguishing between point features ('node') and linear/polygonal features ('way'). With only 2 distinct values across 6,914 rows and zero nulls, it is a clean binary categorical. The distribution is moderately skewed: 'node' accounts for 72.3% (5,000 rows) versus 'way' at 27.7% (1,914 rows), which is consistent with OSM datasets where point POIs outnumber way geometries. Treatment: One-hot encode or map to binary flag (node=1, way=0) before modelling; consider whether geometry type meaningfully differs from other features in the pipeline. high · anthropic:default
n
6,914
nulls
0 (0.0%)
unique
2
top_value
node
top_rate
0.7232
cardinality
2
entropy
0.8511
entropy_ratio
0.8511