quirky shipwrecks
Reading
This dataset catalogues 5,569 shipwrecks (and a handful of related features) sourced from OpenStreetMap, with 14 columns covering geography (lat/lon), OSM identifiers, type classifications, and optional metadata like depth, year sunk, and Wikipedia links. The collection is overwhelmingly homogeneous in category: 'wreck' accounts for 98.4% of seamark_type and 'shipwreck' for 91.2% of type, so the interesting variation lives elsewhere. Geographic spread is global — longitude ranges from -179.28 to 179.45 and latitude from -77.42 to 82.17 — making the lat/lon distribution the most informative view. Be aware that descriptive fields are largely empty: heritage is 99.8% null, year_sunk 99.3% null, depth 96.3% null, and Wikipedia/Wikidata links are missing for ~94% of records, so any analysis beyond location and basic typing will be working with a small subset.
citing: row_count · column_count · seamark_type.top_rate · type.top_rate · lon.min · lon.max · lat.min · lat.max · heritage.null_rate · year_sunk.null_rate · depth.null_rate · wikipedia.null_rate · osm_type.top_rate
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| -179.3 – -170.3 | 54 |
| -170.3 – -161.3 | 17 |
| -161.3 – -152.4 | 7 |
| -152.4 – -143.4 | 5 |
| -143.4 – -134.4 | 4 |
| -134.4 – -125.5 | 15 |
| -125.5 – -116.5 | 266 |
| -116.5 – -107.5 | 15 |
| -107.5 – -98.57 | 3 |
| -98.57 – -89.6 | 39 |
| -89.6 – -80.63 | 161 |
| -80.63 – -71.66 | 520 |
| -71.66 – -62.7 | 163 |
| -62.7 – -53.73 | 211 |
| -53.73 – -44.76 | 146 |
| -44.76 – -35.79 | 157 |
| -35.79 – -26.82 | 39 |
| -26.82 – -17.85 | 25 |
| -17.85 – -8.886 | 98 |
| -8.886 – 0.08197 | 586 |
| 0.08197 – 9.05 | 539 |
| 9.05 – 18.02 | 923 |
| 18.02 – 26.99 | 302 |
| 26.99 – 35.96 | 239 |
| 35.96 – 44.92 | 86 |
| 44.92 – 53.89 | 100 |
| 53.89 – 62.86 | 61 |
| 62.86 – 71.83 | 8 |
| 71.83 – 80.8 | 31 |
| 80.8 – 89.76 | 7 |
| 89.76 – 98.73 | 7 |
| 98.73 – 107.7 | 23 |
| 107.7 – 116.7 | 34 |
| 116.7 – 125.6 | 44 |
| 125.6 – 134.6 | 48 |
| 134.6 – 143.6 | 44 |
| 143.6 – 152.5 | 109 |
| 152.5 – 161.5 | 69 |
| 161.5 – 170.5 | 163 |
| 170.5 – 179.4 | 201 |
Show data table
| bin | count |
|---|---|
| -77.42 – -73.44 | 1 |
| -73.44 – -69.45 | 0 |
| -69.45 – -65.46 | 0 |
| -65.46 – -61.47 | 1 |
| -61.47 – -57.48 | 1 |
| -57.48 – -53.49 | 20 |
| -53.49 – -49.5 | 39 |
| -49.5 – -45.51 | 41 |
| -45.51 – -41.52 | 50 |
| -41.52 – -37.53 | 107 |
| -37.53 – -33.54 | 235 |
| -33.54 – -29.55 | 110 |
| -29.55 – -25.56 | 55 |
| -25.56 – -21.57 | 117 |
| -21.57 – -17.58 | 67 |
| -17.58 – -13.59 | 28 |
| -13.59 – -9.597 | 30 |
| -9.597 – -5.607 | 72 |
| -5.607 – -1.617 | 73 |
| -1.617 – 2.373 | 67 |
| 2.373 – 6.363 | 39 |
| 6.363 – 10.35 | 180 |
| 10.35 – 14.34 | 104 |
| 14.34 – 18.33 | 80 |
| 18.33 – 22.32 | 106 |
| 22.32 – 26.31 | 84 |
| 26.31 – 30.3 | 74 |
| 30.3 – 34.29 | 140 |
| 34.29 – 38.28 | 470 |
| 38.28 – 42.27 | 734 |
| 42.27 – 46.26 | 578 |
| 46.26 – 50.25 | 466 |
| 50.25 – 54.24 | 668 |
| 54.24 – 58.23 | 332 |
| 58.23 – 62.22 | 185 |
| 62.22 – 66.21 | 77 |
| 66.21 – 70.2 | 98 |
| 70.2 – 74.19 | 34 |
| 74.19 – 78.18 | 5 |
| 78.18 – 82.17 | 1 |
Show data table
| value | count | share |
|---|---|---|
| wreck | 5026 | 90.2% |
| hulk | 56 | 1.0% |
| shoreline_construction | 14 | 0.3% |
| obstruction | 7 | 0.1% |
| harbour | 2 | 0.0% |
| restricted_area | 1 | 0.0% |
| plane | 1 | 0.0% |
| beacon_special_purpose | 1 | 0.0% |
| landmark | 1 | 0.0% |
| no | 1 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| shipwreck | 5081 | 91.2% |
| ship | 381 | 6.8% |
| barge | 27 | 0.5% |
| submarine | 18 | 0.3% |
| aircraft | 17 | 0.3% |
| plane | 10 | 0.2% |
| boat | 4 | 0.1% |
| vehicle | 3 | 0.1% |
| motor_vehicle | 3 | 0.1% |
| schooner | 2 | 0.0% |
| car | 2 | 0.0% |
| sailboat | 2 | 0.0% |
| battleship | 2 | 0.0% |
| steamer | 1 | 0.0% |
| airplane | 1 | 0.0% |
| freightcar | 1 | 0.0% |
| train | 1 | 0.0% |
| paddle steamer | 1 | 0.0% |
| motorbike | 1 | 0.0% |
| helicopter | 1 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| node | 3656 | 65.6% |
| way | 1913 | 34.4% |
Schema
14 columns| Alerts | ||||
|---|---|---|---|---|
| name | text | 0.0% | 5,497 |
near_unique
|
| lat | numeric | 0.0% | 5,561 |
|
| lon | numeric | 0.0% | 5,568 |
outliers
|
| year_sunk | categorical | 99.3% | 36 |
long_tail
null_rate
|
| type | categorical | 0.0% | 30 |
long_tail
|
| wikipedia | categorical | 94.4% | 307 |
long_tail
null_rate
|
| wikidata | categorical | 93.5% | 353 |
long_tail
null_rate
|
| description | categorical | 93.9% | 291 |
long_tail
null_rate
|
| heritage | categorical | 99.8% | 4 |
long_tail
null_rate
|
| access | categorical | 90.9% | 8 |
null_rate
|
| depth | categorical | 96.3% | 154 |
long_tail
null_rate
|
| seamark_type | categorical | 8.2% | 10 |
imbalance
|
| osm_id | numeric | 0.0% | 5,569 |
|
| osm_type | categorical | 0.0% | 2 |
|
name
text identifier near_uniqueThis column holds short text labels for individual records, almost certainly vessel or wreck names: 5,497 of 5,569 values are unique, mean length is 17.8 characters with a median of 2 words, and the dominant token "shipwreck" appears 3,706 times alongside nautical prefixes like "ss", "uss", and "hms". Despite the near_unique alert, there are 72 duplicates (1.3%) worth inspecting, and the recurring "shipwreck"/"(wrack)" tokens suggest names follow a templated pattern rather than being free prose. Treatment: Treat as a name identifier; strip boilerplate tokens like "shipwreck" before any text matching, and do not use as a model feature.
- n
- 5,569
- nulls
- 0 (0.0%)
- unique
- 5,497
- len_min
- 2
- len_max
- 153
- len_mean
- 17.81
- len_median
- 20
- len_p95
- 21
- word_mean
- 2.073
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 72
- duplicate_rate
- 0.01293
- vocab_size
- 6,255
- readability_flesch_mean
- 71.33
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.1034
- allcaps_rate
- 0.01724
- boilerplate_rate
- 0
lat
numeric featureThis is a latitude feature spanning -77.42 to 82.17, covering nearly the full geographic range from Antarctica to the high Arctic. The distribution is left-skewed (skew -1.14) with a median of 40.64 well above the mean of 28.41, suggesting a concentration of points in northern mid-latitudes with a tail of southern hemisphere observations. With 5561 unique values across 5569 rows and no nulls, each record carries a near-distinct coordinate. Treatment: Pair with the matching longitude column for geospatial features rather than using as a standalone scalar.
- n
- 5,569
- nulls
- 0 (0.0%)
- unique
- 5,561
- min
- -77.42
- max
- 82.17
- mean
- 28.41
- median
- 40.64
- std
- 31.36
- q1
- 12.54
- q3
- 50.36
- iqr
- 37.82
- skew
- -1.136
- kurtosis
- 0.08299
- n_outliers
- 112
- outlier_rate
- 0.02011
- zero_rate
- 0
lon
numeric feature outliersThis column holds longitude coordinates, with values ranging from -179.28 to 179.45 spanning the full globe and 5568 unique values across 5569 rows. The distribution is mildly right-skewed (0.53) with a median of 2.03 sitting near the prime meridian, and the IQR of 80.73 suggests broad geographic coverage. The flagged 542 outliers (9.7%) likely reflect points in the Pacific tails rather than data errors, given valid lon bounds. Treatment: Pair with latitude as a geospatial feature; avoid treating outliers as anomalies since extremes are valid longitudes.
- n
- 5,569
- nulls
- 0 (0.0%)
- unique
- 5,568
- min
- -179.3
- max
- 179.4
- mean
- 1.413
- median
- 2.033
- std
- 76.75
- q1
- -58.6
- q3
- 22.13
- iqr
- 80.73
- skew
- 0.5273
- kurtosis
- 0.2435
- n_outliers
- 542
- outlier_rate
- 0.09732
- zero_rate
- 0
year_sunk
categorical metadata long_tail null_rateThis column records the year (or fuller date) a vessel sank, but it's almost entirely empty — 99.34% null with only 36 distinct values across 5569 rows. Date formats are inconsistent: bare years like '1942' and '1435', ISO strings like '1937-09-02', verbose forms like 'June 7, 1928', and even ranges like '1643..1663' coexist. Entropy ratio of 0.997 confirms the few populated values are nearly all unique, with '1942' the only repeat (2 occurrences). Treatment: Parse to a normalized year integer and treat as sparse metadata; too null-heavy to use as a feature.
- n
- 5,569
- nulls
- 5,532 (99.3%)
- unique
- 36
- top_value
- 1942
- top_rate
- 0.05405
- cardinality
- 36
- entropy
- 5.155
- entropy_ratio
- 0.9972
type
categorical label long_tailCategorical type label for each record, dominated overwhelmingly by maritime wreckage: 'shipwreck' accounts for 5081 of 5569 rows (91.2% top_rate) with 30 distinct values total. Entropy ratio of 0.115 confirms the long_tail alert — the remaining 29 categories split fewer than 500 rows, with several near-synonyms ('ship'/'boat'/'schooner', 'aircraft'/'plane', 'vehicle'/'motor_vehicle') suggesting inconsistent labelling that could be consolidated. Treatment: Collapse synonymous categories and consider binarising as shipwreck-vs-other given the extreme imbalance.
- n
- 5,569
- nulls
- 0 (0.0%)
- unique
- 30
- top_value
- shipwreck
- top_rate
- 0.9124
- cardinality
- 30
- entropy
- 0.565
- entropy_ratio
- 0.1151
wikipedia
categorical metadata long_tail null_rateThis column holds Wikipedia article references prefixed with a language code (e.g., 'en:SS Edmund Fitzgerald', 'fr:Armorique (navire)', 'ar:...'), likely linking each record to an encyclopedia entry about a ship, aircraft, or wreck. It is overwhelmingly sparse — 94.38% null with only 307 distinct values across 5569 rows — and the distribution is nearly flat (entropy ratio 0.998, top value appears just 4 times, top_rate 1.28%). The presence of multiple language prefixes (en, fr, ar) signals a mixed-language reference field rather than a clean categorical. Treatment: Treat as an optional external reference link; drop for modelling or split off the language prefix if needed.
- n
- 5,569
- nulls
- 5,256 (94.4%)
- unique
- 307
- top_value
- en:SS Edmund Fitzgerald
- top_rate
- 0.01278
- cardinality
- 307
- entropy
- 8.245
- entropy_ratio
- 0.998
wikidata
categorical foreign_key long_tail null_rateThis column holds Wikidata Q-identifiers (e.g., Q1286267), linking rows to entities in the Wikidata knowledge graph. It is overwhelmingly sparse — 93.52% null — and among the 5569 rows only 353 unique values appear, with the most common identifier showing up just 4 times (top_rate 0.011). Entropy ratio of 0.998 confirms the non-null values are nearly all distinct, consistent with a foreign key rather than a categorical feature. Treatment: Left-join on this id to enrich with Wikidata attributes; do not use as a model feature directly.
- n
- 5,569
- nulls
- 5,208 (93.5%)
- unique
- 353
- top_value
- Q1286267
- top_rate
- 0.01108
- cardinality
- 353
- entropy
- 8.446
- entropy_ratio
- 0.9979
description
categorical free_text long_tail null_rateFree-text descriptive notes about wrecks, barges, and other maritime features, populated for only ~6% of rows (null_rate 0.9386). Among the 342 non-null entries there are 291 distinct strings with entropy_ratio 0.975, so values are nearly all unique short narratives; the modal phrase 'WWII era concrete fuel barge converted into breakwater' appears just 14 times (top_rate 0.041). Mixed languages are present (e.g., French 'Chaloupe abandonnée à terre' alongside English), confirming this is curator-authored prose rather than a controlled vocabulary. Treatment: Treat as sparse free text; tokenize/embed for search or keyword extraction rather than using as a categorical feature.
- n
- 5,569
- nulls
- 5,227 (93.9%)
- unique
- 291
- top_value
- WWII era concrete fuel barge converted into breakwater
- top_rate
- 0.04094
- cardinality
- 291
- entropy
- 7.983
- entropy_ratio
- 0.9753
heritage
categorical metadata long_tail null_rateA categorical 'heritage' field that is effectively empty: 99.77% null, with only 13 non-null values across 4 distinct levels. The observed values are inconsistent ('2', '1', 'yes', 'no'), suggesting a coding scheme that was never standardized or fully populated. Treatment: Drop; null_rate of 0.9977 leaves too little signal and the value coding is inconsistent.
- n
- 5,569
- nulls
- 5,556 (99.8%)
- unique
- 4
- top_value
- 2
- top_rate
- 0.7692
- cardinality
- 4
- entropy
- 1.145
- entropy_ratio
- 0.5726
access
categorical metadata null_rateThis is an OpenStreetMap-style 'access' tag indicating who may use a feature, with values like 'yes', 'no', 'permit', 'private', 'permissive', 'customers', and 'foot'. It is overwhelmingly null (90.88%), and among the 508 populated rows 'yes' dominates at 66.93%, leaving the other 7 categories thinly represented. Cardinality is only 8 with entropy ratio 0.55, so signal beyond presence/absence is limited. Treatment: Collapse rare levels and encode as a low-cardinality categorical, or reduce to a populated/'yes' indicator given the 90.88% null rate.
- n
- 5,569
- nulls
- 5,061 (90.9%)
- unique
- 8
- top_value
- yes
- top_rate
- 0.6693
- cardinality
- 8
- entropy
- 1.649
- entropy_ratio
- 0.5497
depth
categorical feature long_tail null_rateA free-text 'depth' field, almost certainly a measurement (likely meters) but stored as strings with mixed formats — bare numbers like '7', '16', '14' coexist with unit-suffixed values like '30m', '25m', and decimals like '12.2'. It is overwhelmingly missing (null_rate 0.9627) and extremely diffuse among the 208 populated rows: 154 unique values, top value '7' covers only 2.88%, and entropy_ratio 0.975 indicates a near-uniform long tail. Treatment: Strip unit suffixes and parse to numeric meters; given >96% nulls, treat as low-signal and consider dropping or flagging presence only.
- n
- 5,569
- nulls
- 5,361 (96.3%)
- unique
- 154
- top_value
- 7
- top_rate
- 0.02885
- cardinality
- 154
- entropy
- 7.085
- entropy_ratio
- 0.975
seamark_type
categorical feature imbalanceCategorical seamark classification with 10 distinct values, almost entirely dominated by 'wreck' at 98.36% of non-null rows (5026 of 5569). The remaining categories are extreme long-tail (hulk at 56, then single-digit counts down to one), and 8.24% of rows are null. Entropy ratio of 0.044 confirms the column carries almost no discriminative signal. Treatment: Drop or collapse to a binary 'is_wreck' flag; near-constant.
- n
- 5,569
- nulls
- 459 (8.2%)
- unique
- 10
- top_value
- wreck
- top_rate
- 0.9836
- cardinality
- 10
- entropy
- 0.1477
- entropy_ratio
- 0.04447
osm_id
numeric identifierThis is almost certainly the OpenStreetMap object id: every one of the 5569 rows is unique, no nulls, no zeros, and values span 13M to 13.5B which matches OSM's monotonically growing id space. The distribution is right-skewed (skew 1.07) with mean 4.03B well above the median 2.35B, reflecting OSM's accumulation of newer, higher ids over time rather than anything analytically meaningful. Treatment: Drop from modelling; retain as a join key to OSM source data.
- n
- 5,569
- nulls
- 0 (0.0%)
- unique
- 5,569
- min
- 1.306e+07
- max
- 1.354e+10
- mean
- 4.032e+09
- median
- 2.349e+09
- std
- 3.875e+09
- q1
- 1.181e+09
- q3
- 6.516e+09
- iqr
- 5.335e+09
- skew
- 1.071
- kurtosis
- -0.2044
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
osm_type
categorical featureThis column records the OpenStreetMap geometry type for each row, taking only two values: "node" (3656 rows, 65.6%) and "way" (1913 rows). With cardinality 2 and entropy ratio 0.928, the split is fairly balanced but tilted toward nodes, and there are no nulls across all 5569 rows. Treatment: Encode as a binary indicator (node vs way) for modelling.
- n
- 5,569
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- node
- top_rate
- 0.6565
- cardinality
- 2
- entropy
- 0.9281
- entropy_ratio
- 0.9281