data trove shipwrecks
Reading
This dataset is an OpenStreetMap-derived catalogue of 6,914 shipwrecks and related maritime hazards mapped globally. The most important thing to explore first is the `type` and `seamark_type` columns, which reveal that the overwhelming majority (~73-78%) of entries are labelled simply 'shipwreck' or 'wreck', with a long tail of submarines, aircraft, barges, and other vessels worth examining. A secondary point of interest is the high null rates across many descriptive fields — `heritage` (99.8% null), `year_sunk` (99.5% null), and `wikipedia` (95.5% null) — meaning rich contextual data exists for only a tiny fraction of wrecks, and the dataset is far more useful as a spatial inventory than a historical record. The `access` column, where populated, shows most accessible wrecks are open ('yes'), but a meaningful share require permits or are private, which could interest dive-site analysts.
citing: type.top_value · type.top_rate · seamark_type.top_value · seamark_type.top_rate · heritage.null_rate · year_sunk.null_rate · wikipedia.null_rate · access.null_rate · access.top_value · access.top_rate · row_count · lat.min · lat.max
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| shipwreck | 5081 | 73.5% |
| wreck | 1345 | 19.5% |
| ship | 381 | 5.5% |
| barge | 27 | 0.4% |
| submarine | 18 | 0.3% |
| aircraft | 17 | 0.2% |
| plane | 10 | 0.1% |
| boat | 4 | 0.1% |
| vehicle | 3 | 0.0% |
| motor_vehicle | 3 | 0.0% |
| schooner | 2 | 0.0% |
| car | 2 | 0.0% |
| sailboat | 2 | 0.0% |
| battleship | 2 | 0.0% |
| steamer | 1 | 0.0% |
| airplane | 1 | 0.0% |
| freightcar | 1 | 0.0% |
| train | 1 | 0.0% |
| paddle steamer | 1 | 0.0% |
| motorbike | 1 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| wreck | 5055 | 73.1% |
| dangerous | 598 | 8.6% |
| non-dangerous | 358 | 5.2% |
| distributed_remains | 306 | 4.4% |
| hulk | 56 | 0.8% |
| hull_showing | 46 | 0.7% |
| shoreline_construction | 14 | 0.2% |
| mast_showing | 8 | 0.1% |
| obstruction | 7 | 0.1% |
| harbour | 2 | 0.0% |
| restricted_area | 1 | 0.0% |
| plane | 1 | 0.0% |
| beacon_special_purpose | 1 | 0.0% |
| landmark | 1 | 0.0% |
| no | 1 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| yes | 341 | 4.9% |
| no | 73 | 1.1% |
| permit | 27 | 0.4% |
| private | 27 | 0.4% |
| unknown | 20 | 0.3% |
| permissive | 17 | 0.2% |
| customers | 3 | 0.0% |
| foot | 1 | 0.0% |
Show data table
| bin | count |
|---|---|
| -77.42 – -73.44 | 1 |
| -73.44 – -69.45 | 0 |
| -69.45 – -65.46 | 0 |
| -65.46 – -61.47 | 1 |
| -61.47 – -57.48 | 1 |
| -57.48 – -53.49 | 20 |
| -53.49 – -49.5 | 39 |
| -49.5 – -45.51 | 41 |
| -45.51 – -41.52 | 50 |
| -41.52 – -37.53 | 107 |
| -37.53 – -33.54 | 235 |
| -33.54 – -29.55 | 110 |
| -29.55 – -25.56 | 56 |
| -25.56 – -21.57 | 118 |
| -21.57 – -17.58 | 67 |
| -17.58 – -13.59 | 28 |
| -13.59 – -9.597 | 30 |
| -9.597 – -5.607 | 72 |
| -5.607 – -1.617 | 73 |
| -1.617 – 2.373 | 67 |
| 2.373 – 6.363 | 40 |
| 6.363 – 10.35 | 180 |
| 10.35 – 14.34 | 105 |
| 14.34 – 18.33 | 85 |
| 18.33 – 22.32 | 108 |
| 22.32 – 26.31 | 84 |
| 26.31 – 30.3 | 75 |
| 30.3 – 34.29 | 149 |
| 34.29 – 38.28 | 529 |
| 38.28 – 42.27 | 748 |
| 42.27 – 46.26 | 608 |
| 46.26 – 50.25 | 494 |
| 50.25 – 54.24 | 1302 |
| 54.24 – 58.23 | 846 |
| 58.23 – 62.22 | 212 |
| 62.22 – 66.21 | 85 |
| 66.21 – 70.2 | 103 |
| 70.2 – 74.19 | 39 |
| 74.19 – 78.18 | 5 |
| 78.18 – 82.17 | 1 |
Show data table
| value | count | share |
|---|---|---|
| 12.4 | 11 | 0.2% |
| 16 | 11 | 0.2% |
| 18 | 11 | 0.2% |
| 15.5 | 11 | 0.2% |
| 19.2 | 11 | 0.2% |
| 1.1 | 10 | 0.1% |
| 17.4 | 10 | 0.1% |
| 15.6 | 10 | 0.1% |
| 7 | 10 | 0.1% |
| 14 | 10 | 0.1% |
| 5 | 10 | 0.1% |
| 15.1 | 10 | 0.1% |
| 6.4 | 9 | 0.1% |
| 9 | 9 | 0.1% |
| 8 | 9 | 0.1% |
| 19 | 9 | 0.1% |
| 15.2 | 9 | 0.1% |
| 20 | 9 | 0.1% |
| 16.4 | 9 | 0.1% |
| 18.5 | 9 | 0.1% |
Schema
14 columns| Alerts | ||||
|---|---|---|---|---|
| name | text | 0.0% | 6,841 |
near_unique
|
| lat | numeric | 0.0% | 6,902 |
outliers
|
| lon | numeric | 0.0% | 6,910 |
outliers
|
| year_sunk | categorical | 99.5% | 36 |
long_tail
null_rate
|
| type | categorical | 0.0% | 31 |
long_tail
|
| wikipedia | categorical | 95.5% | 307 |
long_tail
null_rate
|
| wikidata | categorical | 94.8% | 353 |
long_tail
null_rate
|
| description | categorical | 94.9% | 304 |
long_tail
null_rate
|
| heritage | categorical | 99.8% | 4 |
long_tail
null_rate
|
| access | categorical | 92.6% | 8 |
null_rate
|
| depth | categorical | 77.4% | 502 |
null_rate
|
| seamark_type | categorical | 6.6% | 15 |
|
| osm_id | numeric | 0.0% | 6,914 |
|
| osm_type | categorical | 0.0% | 2 |
|
name
text label near_uniqueThis column contains the names of individual shipwrecks, as confirmed by dominant top words: 'shipwreck' (5032 occurrences across 6914 rows), 'wreck', 'ss', 'uss', and 'hms'. With 6841 unique values out of 6914 rows and a near-zero null rate, it is essentially a name/label field — but the 73 duplicates (1.06% duplicate rate) are mildly surprising and may indicate the same wreck is referenced under the same name in multiple records. Lengths cluster tightly (median 20, p95 21 characters) with a long tail reaching 153, suggesting most names are concise vessel names while a minority carry extended descriptions. Treatment: Use as a display label; investigate 73 duplicates for deduplication or record linkage before treating as a unique identifier.
- n
- 6,914
- nulls
- 0 (0.0%)
- unique
- 6,841
- len_min
- 2
- len_max
- 153
- len_mean
- 18.35
- len_median
- 20
- len_p95
- 21
- word_mean
- 2.058
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 73
- duplicate_rate
- 0.01056
- vocab_size
- 7,602
- readability_flesch_mean
- 73.37
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.0849
- allcaps_rate
- 0.01403
- boilerplate_rate
- 0
lat
numeric feature outliersThis column represents geographic latitude values, spanning from -77.42° (near Antarctica) to 82.17° (high Arctic), with 6,902 unique values across 6,914 rows. The mean (33.15°) sits notably below the median (43.85°), driven by a left skew of -1.42 — indicating a cluster of records in mid-to-high northern latitudes with a pull from southern hemisphere or equatorial observations. Roughly 12.5% of values (864 rows) are flagged as outliers, likely corresponding to polar or deep southern hemisphere coordinates that deviate from the dominant northern mid-latitude band. Treatment: Retain as-is for geospatial modelling; consider pairing with longitude and binning into geographic regions to handle the skewed distribution and outlier polar values.
- n
- 6,914
- nulls
- 0 (0.0%)
- unique
- 6,902
- min
- -77.42
- max
- 82.17
- mean
- 33.15
- median
- 43.85
- std
- 29.88
- q1
- 26.58
- q3
- 53.87
- iqr
- 27.29
- skew
- -1.417
- kurtosis
- 0.8666
- n_outliers
- 864
- outlier_rate
- 0.125
- zero_rate
- 0
lon
numeric feature outliersThis column contains geographic longitude values, spanning the full valid range from -179.28° to 179.45° and covering both hemispheres. The mean (3.07°) and median (8.32°) are both modestly east of the Prime Meridian, suggesting a concentration of records in Europe/Africa, while the wide IQR of 58.74° and std of 69.12° confirm global scatter. Notably, 806 rows (11.66%) are flagged as outliers, likely corresponding to locations in the Americas or Pacific — not erroneous values, but genuine geographic extremes relative to the modal cluster. Treatment: Use as-is or pair with latitude for spatial modelling; consider projecting to radians or embedding via geohash for ML pipelines.
- n
- 6,914
- nulls
- 0 (0.0%)
- unique
- 6,910
- min
- -179.3
- max
- 179.4
- mean
- 3.067
- median
- 8.322
- std
- 69.12
- q1
- -40.75
- q3
- 17.99
- iqr
- 58.74
- skew
- 0.5093
- kurtosis
- 0.9211
- n_outliers
- 806
- outlier_rate
- 0.1166
- zero_rate
- 0
year_sunk
categorical metadata long_tail null_rateThis column records the year (or date) a vessel was sunk, but it is almost entirely empty — 99.46% of the 6,914 rows are null, leaving only about 38 non-null values. Among those, the formats are wildly inconsistent: bare years ('1942', '1854'), full dates in multiple formats ('30 June 1890', 'June 7, 1928', '1937-09-02'), partial dates ('1963-02'), and even a range ('1643..1663'), making normalisation non-trivial. With 36 unique values across ~38 populated rows the column is near-unique relative to its populated set, and the top value '1942' appears only twice. Treatment: Parse and normalise to a standard year integer after regex-based format detection; treat as sparse metadata and do not use as a primary feature without imputation strategy given 99.46% nulls.
- n
- 6,914
- nulls
- 6,877 (99.5%)
- unique
- 36
- top_value
- 1942
- top_rate
- 0.05405
- cardinality
- 36
- entropy
- 5.155
- entropy_ratio
- 0.9972
type
categorical label long_tailThis column classifies underwater or maritime wreck sites by vessel/object type, with 31 distinct categories across 6,914 records and no nulls. The distribution is heavily dominated by 'shipwreck' (73.5% of records) and 'wreck' (19.5%), together accounting for over 93% of all entries — the remaining 29 categories share just ~6.5%, confirming the long-tail alert. The near-redundancy between 'shipwreck', 'wreck', and 'ship' (plus 'boat', 'barge') suggests inconsistent taxonomy that may need consolidation before modelling. Treatment: Consolidate overlapping categories (e.g. 'shipwreck'/'wreck'/'ship') into a canonical taxonomy, then one-hot or target-encode for modelling.
- n
- 6,914
- nulls
- 0 (0.0%)
- unique
- 31
- top_value
- shipwreck
- top_rate
- 0.7349
- cardinality
- 31
- entropy
- 1.166
- entropy_ratio
- 0.2353
wikipedia
categorical metadata long_tail null_rateThis column stores Wikipedia article links associated with dataset entities (ships and aircraft), formatted as language-prefixed slugs (e.g., 'en:SS Edmund Fitzgerald', 'fr:Armorique (navire)'). The null rate is extremely high at 95.47%, meaning only ~313 of 6,914 rows have any Wikipedia reference. Among populated values, cardinality is very high (307 unique values across ~313 non-null rows), with the top value appearing only 4 times — indicating near-unique coverage and a long-tail distribution. A language mix is present (English 'en:', French 'fr:', Arabic 'ar:'), which could complicate any downstream lookup or joining logic. Treatment: Use as an optional enrichment link; do not use in modelling due to 95.47% nulls and near-unique cardinality; parse language prefix if language-specific resolution is needed.
- n
- 6,914
- nulls
- 6,601 (95.5%)
- unique
- 307
- top_value
- en:SS Edmund Fitzgerald
- top_rate
- 0.01278
- cardinality
- 307
- entropy
- 8.245
- entropy_ratio
- 0.998
wikidata
categorical foreign_key long_tail null_rateThis column stores Wikidata entity identifiers (Q-codes), linking dataset records to Wikidata knowledge graph entries. The most striking signal is the extreme null rate of 94.78%, meaning only ~360 of 6,914 rows carry a Wikidata link at all. Among the 353 unique Q-codes present, the distribution is nearly flat — the top value 'Q1286267' appears only 4 times, entropy ratio is 0.998, and the long-tail alert confirms almost no repeated values — suggesting each populated row points to a distinct entity with minimal reuse. Treatment: Use as an optional foreign key to enrich records via Wikidata API lookup; do not use as a feature directly given 94.78% null rate.
- n
- 6,914
- nulls
- 6,553 (94.8%)
- unique
- 353
- top_value
- Q1286267
- top_rate
- 0.01108
- cardinality
- 353
- entropy
- 8.446
- entropy_ratio
- 0.9979
description
categorical free_text long_tail null_rateThis column contains free-text descriptions of maritime wrecks or nautical features, with entries referencing WWII-era vessels, jetties, fishing boats, and abandoned craft. The most striking signal is the 94.87% null rate — nearly the entire dataset lacks a description — making this column nearly unusable at scale. Among the 304 unique values across 6,914 rows, entropy is very high (8.05, ratio 0.976), indicating wide diversity in phrasing, and a language mix is evident (e.g., French 'Chaloupe abandonnée à terre' alongside English entries). Treatment: Exclude from modelling due to 94.87% null rate; if used, tokenize and embed the 5.13% populated values, and flag language mixing before NLP processing.
- n
- 6,914
- nulls
- 6,559 (94.9%)
- unique
- 304
- top_value
- WWII era concrete fuel barge converted into breakwater
- top_rate
- 0.03944
- cardinality
- 304
- entropy
- 8.052
- entropy_ratio
- 0.9763
heritage
categorical feature long_tail null_rateThis column appears to encode a 'heritage' flag or classification with only 4 distinct values ('1', '2', 'no', 'yes'), suggesting a binary or ordinal attribute that may have been inconsistently encoded across sources. The critical finding is a null rate of 99.81%, meaning only 13 of 6,914 rows have any value at all — rendering this column nearly useless for modelling. Among those 13 non-null values, '2' dominates at 76.9%, while 'no', 'yes', and '1' each appear only once, indicating a mixed encoding scheme (numeric vs. boolean strings) on an already negligible sample. Treatment: Drop this column; 99.81% null rate and only 13 non-null observations make it statistically unusable.
- n
- 6,914
- nulls
- 6,901 (99.8%)
- unique
- 4
- top_value
- 2
- top_rate
- 0.7692
- cardinality
- 4
- entropy
- 1.145
- entropy_ratio
- 0.5726
access
categorical feature null_rateThis column appears to encode access permission or restriction tags for geographic features (likely OpenStreetMap-style data), with values such as 'yes', 'no', 'permit', 'private', 'permissive', and 'customers'. The striking finding is a 92.64% null rate — only 509 of 6,914 rows carry a value — meaning this attribute is almost entirely absent from the dataset. Among the non-null values, 'yes' dominates heavily at 66.99% of populated rows, suggesting most tagged features have open access. Treatment: Flag extreme sparsity (92.64% nulls); treat nulls as a distinct 'untagged' category or drop column if missingness renders it uninformative for modelling.
- n
- 6,914
- nulls
- 6,405 (92.6%)
- unique
- 8
- top_value
- yes
- top_rate
- 0.6699
- cardinality
- 8
- entropy
- 1.647
- entropy_ratio
- 0.549
depth
categorical feature null_rateThis column represents a numeric depth measurement (likely in meters or similar units) stored as a categorical string, with values ranging from small decimals like '1.1' to integers like '19.2'. The most striking signal is a null rate of 77.36%, meaning only ~1,556 of 6,914 rows carry a value — a severe missingness that warrants investigation into whether it is structurally absent (e.g., not applicable to certain record types) or a data quality issue. Among populated rows, cardinality is very high (502 unique values) with an entropy ratio of 0.956, indicating nearly uniform spread and essentially no dominant depth value — the top value '12.4' appears only 11 times. The column should be cast to numeric before any modelling use. Treatment: Cast to float, investigate structural vs. random missingness before imputing or dropping nulls, then use as a numeric feature.
- n
- 6,914
- nulls
- 5,349 (77.4%)
- unique
- 502
- top_value
- 12.4
- top_rate
- 0.007029
- cardinality
- 502
- entropy
- 8.579
- entropy_ratio
- 0.9563
seamark_type
categorical labelThis column contains a nautical/maritime classification type for seamarks, most likely drawn from an OpenStreetMap or similar marine charting schema. The distribution is severely dominated by 'wreck' at 78.3% of 6,914 rows, with the next largest category 'dangerous' at only 8.6%, giving a low entropy ratio of 0.307. The mix of subtypes (hull_showing, mast_showing, distributed_remains, hulk) suggests these are sub-classifications of wrecks that could have been normalized into a hierarchy rather than a flat taxonomy. The 6.6% null rate warrants attention if completeness matters for navigation safety contexts. Treatment: One-hot or ordinal encode; consider grouping wreck subtypes (hull_showing, mast_showing, hulk, distributed_remains) into a parent 'wreck' hierarchy before modelling, and impute or flag the 6.6% nulls.
- n
- 6,914
- nulls
- 459 (6.6%)
- unique
- 15
- top_value
- wreck
- top_rate
- 0.7831
- cardinality
- 15
- entropy
- 1.2
- entropy_ratio
- 0.307
osm_id
numeric identifierThis column is an OpenStreetMap (OSM) object identifier — a large integer surrogate key assigned by the OSM platform to geographic features. Every one of the 6,914 rows has a distinct value with zero nulls, confirming it functions purely as a unique identifier. The value range (13 M to ~13.7 B) and flat distribution (kurtosis −1.45, near-uniform spread across a ~8.4 B IQR) are consistent with OSM's incrementally assigned ID space across different data vintages. No outliers are flagged and the mild positive skew (0.44) suggests a slight concentration of older, lower-numbered IDs. Treatment: Retain as a join/lookup key to OSM data; drop from any model feature set as it carries no predictive signal.
- n
- 6,914
- nulls
- 0 (0.0%)
- unique
- 6,914
- min
- 1.306e+07
- max
- 1.371e+10
- mean
- 5.365e+09
- median
- 3.145e+09
- std
- 4.464e+09
- q1
- 1.348e+09
- q3
- 9.788e+09
- iqr
- 8.44e+09
- skew
- 0.4355
- kurtosis
- -1.453
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
osm_type
categorical featureThis column encodes the OpenStreetMap geometry type, distinguishing between point features ('node') and linear/polygonal features ('way'). With only 2 distinct values across 6,914 rows and zero nulls, it is a clean binary categorical. The distribution is moderately skewed: 'node' accounts for 72.3% (5,000 rows) versus 'way' at 27.7% (1,914 rows), which is consistent with OSM datasets where point POIs outnumber way geometries. Treatment: One-hot encode or map to binary flag (node=1, way=0) before modelling; consider whether geometry type meaningfully differs from other features in the pipeline.
- n
- 6,914
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- node
- top_rate
- 0.7232
- cardinality
- 2
- entropy
- 0.8511
- entropy_ratio
- 0.8511