data trove shipwrecks

source /home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json 6,914 rows 14 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset is an OpenStreetMap-derived catalogue of 6,914 shipwrecks and related maritime hazards mapped globally. The most important thing to explore first is the `type` and `seamark_type` columns, which reveal that the overwhelming majority (~73-78%) of entries are labelled simply 'shipwreck' or 'wreck', with a long tail of submarines, aircraft, barges, and other vessels worth examining. A secondary point of interest is the high null rates across many descriptive fields — `heritage` (99.8% null), `year_sunk` (99.5% null), and `wikipedia` (95.5% null) — meaning rich contextual data exists for only a tiny fraction of wrecks, and the dataset is far more useful as a spatial inventory than a historical record. The `access` column, where populated, shows most accessible wrecks are open ('yes'), but a meaningful share require permits or are private, which could interest dive-site analysts.

citing: type.top_value · type.top_rate · seamark_type.top_value · seamark_type.top_rate · heritage.null_rate · year_sunk.null_rate · wikipedia.null_rate · access.null_rate · access.top_value · access.top_rate · row_count · lat.min · lat.max

Charts the summary said to look at first

type · Look for the dominance of 'shipwreck' and 'wreck' and how thin the long tail of aircraft, submarines, and barges really is.

Show data table

Top values for type (20 unique shown, of 31 total).
value	count	share
shipwreck	5081	73.5%
wreck	1345	19.5%
ship	381	5.5%
barge	27	0.4%
submarine	18	0.3%
aircraft	17	0.2%
plane	10	0.1%
boat	4	0.1%
vehicle	3	0.0%
motor_vehicle	3	0.0%
schooner	2	0.0%
car	2	0.0%
sailboat	2	0.0%
battleship	2	0.0%
steamer	1	0.0%
airplane	1	0.0%
freightcar	1	0.0%
train	1	0.0%
paddle steamer	1	0.0%
motorbike	1	0.0%

seamark_type · Shows the navigational hazard classification — note how many wrecks are marked 'dangerous' or have 'distributed_remains' versus a clean hull.

Show data table

Top values for seamark_type (15 unique shown, of 15 total).
value	count	share
wreck	5055	73.1%
dangerous	598	8.6%
non-dangerous	358	5.2%
distributed_remains	306	4.4%
hulk	56	0.8%
hull_showing	46	0.7%
shoreline_construction	14	0.2%
mast_showing	8	0.1%
obstruction	7	0.1%
harbour	2	0.0%
restricted_area	1	0.0%
plane	1	0.0%
beacon_special_purpose	1	0.0%
landmark	1	0.0%
no	1	0.0%

access · Among the minority of wrecks with access data, check how many are freely accessible versus permit-only or private.

Show data table

Top values for access (8 unique shown, of 8 total).
value	count	share
yes	341	4.9%
no	73	1.1%
permit	27	0.4%
private	27	0.4%
unknown	20	0.3%
permissive	17	0.2%
customers	3	0.0%
foot	1	0.0%

lat · Distribution of wreck latitudes reveals geographic clustering — look for the concentration in northern hemisphere waters and the outlier spike near the poles.

Show data table

Histogram bins for lat (median: 43.8517503).
bin	count
-77.42 – -73.44	1
-73.44 – -69.45	0
-69.45 – -65.46	0
-65.46 – -61.47	1
-61.47 – -57.48	1
-57.48 – -53.49	20
-53.49 – -49.5	39
-49.5 – -45.51	41
-45.51 – -41.52	50
-41.52 – -37.53	107
-37.53 – -33.54	235
-33.54 – -29.55	110
-29.55 – -25.56	56
-25.56 – -21.57	118
-21.57 – -17.58	67
-17.58 – -13.59	28
-13.59 – -9.597	30
-9.597 – -5.607	72
-5.607 – -1.617	73
-1.617 – 2.373	67
2.373 – 6.363	40
6.363 – 10.35	180
10.35 – 14.34	105
14.34 – 18.33	85
18.33 – 22.32	108
22.32 – 26.31	84
26.31 – 30.3	75
30.3 – 34.29	149
34.29 – 38.28	529
38.28 – 42.27	748
42.27 – 46.26	608
46.26 – 50.25	494
50.25 – 54.24	1302
54.24 – 58.23	846
58.23 – 62.22	212
62.22 – 66.21	85
66.21 – 70.2	103
70.2 – 74.19	39
74.19 – 78.18	5
78.18 – 82.17	1

depth · Depth values (where recorded) cluster around 7–20 metres, suggesting the dataset skews toward shallow, diveable wrecks rather than deep-sea losses.

Show data table

Top values for depth (20 unique shown, of 502 total).
value	count	share
12.4	11	0.2%
16	11	0.2%
18	11	0.2%
15.5	11	0.2%
19.2	11	0.2%
1.1	10	0.1%
17.4	10	0.1%
15.6	10	0.1%
7	10	0.1%
14	10	0.1%
5	10	0.1%
15.1	10	0.1%
6.4	9	0.1%
9	9	0.1%
8	9	0.1%
19	9	0.1%
15.2	9	0.1%
20	9	0.1%
16.4	9	0.1%
18.5	9	0.1%

Schema

14 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
name	text	0.0%	6,841	near_unique
lat	numeric	0.0%	6,902	outliers
lon	numeric	0.0%	6,910	outliers
year_sunk	categorical	99.5%	36	long_tail null_rate
type	categorical	0.0%	31	long_tail
wikipedia	categorical	95.5%	307	long_tail null_rate
wikidata	categorical	94.8%	353	long_tail null_rate
description	categorical	94.9%	304	long_tail null_rate
heritage	categorical	99.8%	4	long_tail null_rate
access	categorical	92.6%	8	null_rate
depth	categorical	77.4%	502	null_rate
seamark_type	categorical	6.6%	15
osm_id	numeric	0.0%	6,914
osm_type	categorical	0.0%	2

name

text label near_unique

This column contains the names of individual shipwrecks, as confirmed by dominant top words: 'shipwreck' (5032 occurrences across 6914 rows), 'wreck', 'ss', 'uss', and 'hms'. With 6841 unique values out of 6914 rows and a near-zero null rate, it is essentially a name/label field — but the 73 duplicates (1.06% duplicate rate) are mildly surprising and may indicate the same wreck is referenced under the same name in multiple records. Lengths cluster tightly (median 20, p95 21 characters) with a long tail reaching 153, suggesting most names are concise vessel names while a minority carry extended descriptions. Treatment: Use as a display label; investigate 73 duplicates for deduplication or record linkage before treating as a unique identifier. high · anthropic:default

n: 6,914
nulls: 0 (0.0%)
unique: 6,841
len_min: 2
len_max: 153
len_mean: 18.35
len_median: 20
len_p95: 21
word_mean: 2.058
word_median: 2
n_empty: 0
n_duplicates: 73
duplicate_rate: 0.01056
vocab_size: 7,602
readability_flesch_mean: 73.37
emoji_rate: 0
url_rate: 0
one_word_rate: 0.0849
allcaps_rate: 0.01403
boilerplate_rate: 0

lat

numeric feature outliers

This column represents geographic latitude values, spanning from -77.42° (near Antarctica) to 82.17° (high Arctic), with 6,902 unique values across 6,914 rows. The mean (33.15°) sits notably below the median (43.85°), driven by a left skew of -1.42 — indicating a cluster of records in mid-to-high northern latitudes with a pull from southern hemisphere or equatorial observations. Roughly 12.5% of values (864 rows) are flagged as outliers, likely corresponding to polar or deep southern hemisphere coordinates that deviate from the dominant northern mid-latitude band. Treatment: Retain as-is for geospatial modelling; consider pairing with longitude and binning into geographic regions to handle the skewed distribution and outlier polar values. high · anthropic:default

n: 6,914
nulls: 0 (0.0%)
unique: 6,902
min: -77.42
max: 82.17
mean: 33.15
median: 43.85
std: 29.88
q1: 26.58
q3: 53.87
iqr: 27.29
skew: -1.417
kurtosis: 0.8666
n_outliers: 864
outlier_rate: 0.125
zero_rate: 0

lon

numeric feature outliers

This column contains geographic longitude values, spanning the full valid range from -179.28° to 179.45° and covering both hemispheres. The mean (3.07°) and median (8.32°) are both modestly east of the Prime Meridian, suggesting a concentration of records in Europe/Africa, while the wide IQR of 58.74° and std of 69.12° confirm global scatter. Notably, 806 rows (11.66%) are flagged as outliers, likely corresponding to locations in the Americas or Pacific — not erroneous values, but genuine geographic extremes relative to the modal cluster. Treatment: Use as-is or pair with latitude for spatial modelling; consider projecting to radians or embedding via geohash for ML pipelines. high · anthropic:default

n: 6,914
nulls: 0 (0.0%)
unique: 6,910
min: -179.3
max: 179.4
mean: 3.067
median: 8.322
std: 69.12
q1: -40.75
q3: 17.99
iqr: 58.74
skew: 0.5093
kurtosis: 0.9211
n_outliers: 806
outlier_rate: 0.1166
zero_rate: 0

year_sunk

categorical metadata long_tail null_rate

This column records the year (or date) a vessel was sunk, but it is almost entirely empty — 99.46% of the 6,914 rows are null, leaving only about 38 non-null values. Among those, the formats are wildly inconsistent: bare years ('1942', '1854'), full dates in multiple formats ('30 June 1890', 'June 7, 1928', '1937-09-02'), partial dates ('1963-02'), and even a range ('1643..1663'), making normalisation non-trivial. With 36 unique values across ~38 populated rows the column is near-unique relative to its populated set, and the top value '1942' appears only twice. Treatment: Parse and normalise to a standard year integer after regex-based format detection; treat as sparse metadata and do not use as a primary feature without imputation strategy given 99.46% nulls. high · anthropic:default

n: 6,914
nulls: 6,877 (99.5%)
unique: 36
top_value: 1942
top_rate: 0.05405
cardinality: 36
entropy: 5.155
entropy_ratio: 0.9972

type

categorical label long_tail

This column classifies underwater or maritime wreck sites by vessel/object type, with 31 distinct categories across 6,914 records and no nulls. The distribution is heavily dominated by 'shipwreck' (73.5% of records) and 'wreck' (19.5%), together accounting for over 93% of all entries — the remaining 29 categories share just ~6.5%, confirming the long-tail alert. The near-redundancy between 'shipwreck', 'wreck', and 'ship' (plus 'boat', 'barge') suggests inconsistent taxonomy that may need consolidation before modelling. Treatment: Consolidate overlapping categories (e.g. 'shipwreck'/'wreck'/'ship') into a canonical taxonomy, then one-hot or target-encode for modelling. high · anthropic:default

n: 6,914
nulls: 0 (0.0%)
unique: 31
top_value: shipwreck
top_rate: 0.7349
cardinality: 31
entropy: 1.166
entropy_ratio: 0.2353

wikipedia

categorical metadata long_tail null_rate

This column stores Wikipedia article links associated with dataset entities (ships and aircraft), formatted as language-prefixed slugs (e.g., 'en:SS Edmund Fitzgerald', 'fr:Armorique (navire)'). The null rate is extremely high at 95.47%, meaning only ~313 of 6,914 rows have any Wikipedia reference. Among populated values, cardinality is very high (307 unique values across ~313 non-null rows), with the top value appearing only 4 times — indicating near-unique coverage and a long-tail distribution. A language mix is present (English 'en:', French 'fr:', Arabic 'ar:'), which could complicate any downstream lookup or joining logic. Treatment: Use as an optional enrichment link; do not use in modelling due to 95.47% nulls and near-unique cardinality; parse language prefix if language-specific resolution is needed. high · anthropic:default

n: 6,914
nulls: 6,601 (95.5%)
unique: 307
top_value: en:SS Edmund Fitzgerald
top_rate: 0.01278
cardinality: 307
entropy: 8.245
entropy_ratio: 0.998

wikidata

categorical foreign_key long_tail null_rate

This column stores Wikidata entity identifiers (Q-codes), linking dataset records to Wikidata knowledge graph entries. The most striking signal is the extreme null rate of 94.78%, meaning only ~360 of 6,914 rows carry a Wikidata link at all. Among the 353 unique Q-codes present, the distribution is nearly flat — the top value 'Q1286267' appears only 4 times, entropy ratio is 0.998, and the long-tail alert confirms almost no repeated values — suggesting each populated row points to a distinct entity with minimal reuse. Treatment: Use as an optional foreign key to enrich records via Wikidata API lookup; do not use as a feature directly given 94.78% null rate. high · anthropic:default

n: 6,914
nulls: 6,553 (94.8%)
unique: 353
top_value: Q1286267
top_rate: 0.01108
cardinality: 353
entropy: 8.446
entropy_ratio: 0.9979

description

categorical free_text long_tail null_rate

This column contains free-text descriptions of maritime wrecks or nautical features, with entries referencing WWII-era vessels, jetties, fishing boats, and abandoned craft. The most striking signal is the 94.87% null rate — nearly the entire dataset lacks a description — making this column nearly unusable at scale. Among the 304 unique values across 6,914 rows, entropy is very high (8.05, ratio 0.976), indicating wide diversity in phrasing, and a language mix is evident (e.g., French 'Chaloupe abandonnée à terre' alongside English entries). Treatment: Exclude from modelling due to 94.87% null rate; if used, tokenize and embed the 5.13% populated values, and flag language mixing before NLP processing. high · anthropic:default

n: 6,914
nulls: 6,559 (94.9%)
unique: 304
top_value: WWII era concrete fuel barge converted into breakwater
top_rate: 0.03944
cardinality: 304
entropy: 8.052
entropy_ratio: 0.9763

heritage

categorical feature long_tail null_rate

This column appears to encode a 'heritage' flag or classification with only 4 distinct values ('1', '2', 'no', 'yes'), suggesting a binary or ordinal attribute that may have been inconsistently encoded across sources. The critical finding is a null rate of 99.81%, meaning only 13 of 6,914 rows have any value at all — rendering this column nearly useless for modelling. Among those 13 non-null values, '2' dominates at 76.9%, while 'no', 'yes', and '1' each appear only once, indicating a mixed encoding scheme (numeric vs. boolean strings) on an already negligible sample. Treatment: Drop this column; 99.81% null rate and only 13 non-null observations make it statistically unusable. high · anthropic:default

n: 6,914
nulls: 6,901 (99.8%)
unique: 4
top_value: 2
top_rate: 0.7692
cardinality: 4
entropy: 1.145
entropy_ratio: 0.5726

access

categorical feature null_rate

This column appears to encode access permission or restriction tags for geographic features (likely OpenStreetMap-style data), with values such as 'yes', 'no', 'permit', 'private', 'permissive', and 'customers'. The striking finding is a 92.64% null rate — only 509 of 6,914 rows carry a value — meaning this attribute is almost entirely absent from the dataset. Among the non-null values, 'yes' dominates heavily at 66.99% of populated rows, suggesting most tagged features have open access. Treatment: Flag extreme sparsity (92.64% nulls); treat nulls as a distinct 'untagged' category or drop column if missingness renders it uninformative for modelling. high · anthropic:default

n: 6,914
nulls: 6,405 (92.6%)
unique: 8
top_value: yes
top_rate: 0.6699
cardinality: 8
entropy: 1.647
entropy_ratio: 0.549

depth

categorical feature null_rate

This column represents a numeric depth measurement (likely in meters or similar units) stored as a categorical string, with values ranging from small decimals like '1.1' to integers like '19.2'. The most striking signal is a null rate of 77.36%, meaning only ~1,556 of 6,914 rows carry a value — a severe missingness that warrants investigation into whether it is structurally absent (e.g., not applicable to certain record types) or a data quality issue. Among populated rows, cardinality is very high (502 unique values) with an entropy ratio of 0.956, indicating nearly uniform spread and essentially no dominant depth value — the top value '12.4' appears only 11 times. The column should be cast to numeric before any modelling use. Treatment: Cast to float, investigate structural vs. random missingness before imputing or dropping nulls, then use as a numeric feature. high · anthropic:default

n: 6,914
nulls: 5,349 (77.4%)
unique: 502
top_value: 12.4
top_rate: 0.007029
cardinality: 502
entropy: 8.579
entropy_ratio: 0.9563

seamark_type

categorical label

This column contains a nautical/maritime classification type for seamarks, most likely drawn from an OpenStreetMap or similar marine charting schema. The distribution is severely dominated by 'wreck' at 78.3% of 6,914 rows, with the next largest category 'dangerous' at only 8.6%, giving a low entropy ratio of 0.307. The mix of subtypes (hull_showing, mast_showing, distributed_remains, hulk) suggests these are sub-classifications of wrecks that could have been normalized into a hierarchy rather than a flat taxonomy. The 6.6% null rate warrants attention if completeness matters for navigation safety contexts. Treatment: One-hot or ordinal encode; consider grouping wreck subtypes (hull_showing, mast_showing, hulk, distributed_remains) into a parent 'wreck' hierarchy before modelling, and impute or flag the 6.6% nulls. high · anthropic:default

n: 6,914
nulls: 459 (6.6%)
unique: 15
top_value: wreck
top_rate: 0.7831
cardinality: 15
entropy: 1.2
entropy_ratio: 0.307

osm_id

numeric identifier

This column is an OpenStreetMap (OSM) object identifier — a large integer surrogate key assigned by the OSM platform to geographic features. Every one of the 6,914 rows has a distinct value with zero nulls, confirming it functions purely as a unique identifier. The value range (13 M to ~13.7 B) and flat distribution (kurtosis −1.45, near-uniform spread across a ~8.4 B IQR) are consistent with OSM's incrementally assigned ID space across different data vintages. No outliers are flagged and the mild positive skew (0.44) suggests a slight concentration of older, lower-numbered IDs. Treatment: Retain as a join/lookup key to OSM data; drop from any model feature set as it carries no predictive signal. high · anthropic:default

n: 6,914
nulls: 0 (0.0%)
unique: 6,914
min: 1.306e+07
max: 1.371e+10
mean: 5.365e+09
median: 3.145e+09
std: 4.464e+09
q1: 1.348e+09
q3: 9.788e+09
iqr: 8.44e+09
skew: 0.4355
kurtosis: -1.453
n_outliers: 0
outlier_rate: 0
zero_rate: 0

osm_type

categorical feature

This column encodes the OpenStreetMap geometry type, distinguishing between point features ('node') and linear/polygonal features ('way'). With only 2 distinct values across 6,914 rows and zero nulls, it is a clean binary categorical. The distribution is moderately skewed: 'node' accounts for 72.3% (5,000 rows) versus 'way' at 27.7% (1,914 rows), which is consistent with OSM datasets where point POIs outnumber way geometries. Treatment: One-hot encode or map to binary flag (node=1, way=0) before modelling; consider whether geometry type meaningfully differs from other features in the pipeline. high · anthropic:default

n: 6,914
nulls: 0 (0.0%)
unique: 2
top_value: node
top_rate: 0.7232
cardinality: 2
entropy: 0.8511
entropy_ratio: 0.8511