saturn·

quirky shipwrecks

source /home/coolhand/html/datavis/data_trove/data/quirky/shipwrecks.json 5,569 rows 14 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogues 5,569 shipwrecks (and a handful of related features) sourced from OpenStreetMap, with 14 columns covering geography (lat/lon), OSM identifiers, type classifications, and optional metadata like depth, year sunk, and Wikipedia links. The collection is overwhelmingly homogeneous in category: 'wreck' accounts for 98.4% of seamark_type and 'shipwreck' for 91.2% of type, so the interesting variation lives elsewhere. Geographic spread is global — longitude ranges from -179.28 to 179.45 and latitude from -77.42 to 82.17 — making the lat/lon distribution the most informative view. Be aware that descriptive fields are largely empty: heritage is 99.8% null, year_sunk 99.3% null, depth 96.3% null, and Wikipedia/Wikidata links are missing for ~94% of records, so any analysis beyond location and basic typing will be working with a small subset.

citing: row_count · column_count · seamark_type.top_rate · type.top_rate · lon.min · lon.max · lat.min · lat.max · heritage.null_rate · year_sunk.null_rate · depth.null_rate · wikipedia.null_rate · osm_type.top_rate

Schema

14 columns
Per-column summary. Click column name to jump to its detail.
Alerts
name text 0.0% 5,497
near_unique
lat numeric 0.0% 5,561
lon numeric 0.0% 5,568
outliers
year_sunk categorical 99.3% 36
long_tail null_rate
type categorical 0.0% 30
long_tail
wikipedia categorical 94.4% 307
long_tail null_rate
wikidata categorical 93.5% 353
long_tail null_rate
description categorical 93.9% 291
long_tail null_rate
heritage categorical 99.8% 4
long_tail null_rate
access categorical 90.9% 8
null_rate
depth categorical 96.3% 154
long_tail null_rate
seamark_type categorical 8.2% 10
imbalance
osm_id numeric 0.0% 5,569
osm_type categorical 0.0% 2

name

text identifier near_unique
This column holds short text labels for individual records, almost certainly vessel or wreck names: 5,497 of 5,569 values are unique, mean length is 17.8 characters with a median of 2 words, and the dominant token "shipwreck" appears 3,706 times alongside nautical prefixes like "ss", "uss", and "hms". Despite the near_unique alert, there are 72 duplicates (1.3%) worth inspecting, and the recurring "shipwreck"/"(wrack)" tokens suggest names follow a templated pattern rather than being free prose. Treatment: Treat as a name identifier; strip boilerplate tokens like "shipwreck" before any text matching, and do not use as a model feature. high · anthropic:claude-opus-4-7
n
5,569
nulls
0 (0.0%)
unique
5,497
len_min
2
len_max
153
len_mean
17.81
len_median
20
len_p95
21
word_mean
2.073
word_median
2
n_empty
0
n_duplicates
72
duplicate_rate
0.01293
vocab_size
6,255
readability_flesch_mean
71.33
emoji_rate
0
url_rate
0
one_word_rate
0.1034
allcaps_rate
0.01724
boilerplate_rate
0

lat

numeric feature
This is a latitude feature spanning -77.42 to 82.17, covering nearly the full geographic range from Antarctica to the high Arctic. The distribution is left-skewed (skew -1.14) with a median of 40.64 well above the mean of 28.41, suggesting a concentration of points in northern mid-latitudes with a tail of southern hemisphere observations. With 5561 unique values across 5569 rows and no nulls, each record carries a near-distinct coordinate. Treatment: Pair with the matching longitude column for geospatial features rather than using as a standalone scalar. high · anthropic:claude-opus-4-7
n
5,569
nulls
0 (0.0%)
unique
5,561
min
-77.42
max
82.17
mean
28.41
median
40.64
std
31.36
q1
12.54
q3
50.36
iqr
37.82
skew
-1.136
kurtosis
0.08299
n_outliers
112
outlier_rate
0.02011
zero_rate
0

lon

numeric feature outliers
This column holds longitude coordinates, with values ranging from -179.28 to 179.45 spanning the full globe and 5568 unique values across 5569 rows. The distribution is mildly right-skewed (0.53) with a median of 2.03 sitting near the prime meridian, and the IQR of 80.73 suggests broad geographic coverage. The flagged 542 outliers (9.7%) likely reflect points in the Pacific tails rather than data errors, given valid lon bounds. Treatment: Pair with latitude as a geospatial feature; avoid treating outliers as anomalies since extremes are valid longitudes. high · anthropic:claude-opus-4-7
n
5,569
nulls
0 (0.0%)
unique
5,568
min
-179.3
max
179.4
mean
1.413
median
2.033
std
76.75
q1
-58.6
q3
22.13
iqr
80.73
skew
0.5273
kurtosis
0.2435
n_outliers
542
outlier_rate
0.09732
zero_rate
0

year_sunk

categorical metadata long_tail null_rate
This column records the year (or fuller date) a vessel sank, but it's almost entirely empty — 99.34% null with only 36 distinct values across 5569 rows. Date formats are inconsistent: bare years like '1942' and '1435', ISO strings like '1937-09-02', verbose forms like 'June 7, 1928', and even ranges like '1643..1663' coexist. Entropy ratio of 0.997 confirms the few populated values are nearly all unique, with '1942' the only repeat (2 occurrences). Treatment: Parse to a normalized year integer and treat as sparse metadata; too null-heavy to use as a feature. high · anthropic:claude-opus-4-7
n
5,569
nulls
5,532 (99.3%)
unique
36
top_value
1942
top_rate
0.05405
cardinality
36
entropy
5.155
entropy_ratio
0.9972

type

categorical label long_tail
Categorical type label for each record, dominated overwhelmingly by maritime wreckage: 'shipwreck' accounts for 5081 of 5569 rows (91.2% top_rate) with 30 distinct values total. Entropy ratio of 0.115 confirms the long_tail alert — the remaining 29 categories split fewer than 500 rows, with several near-synonyms ('ship'/'boat'/'schooner', 'aircraft'/'plane', 'vehicle'/'motor_vehicle') suggesting inconsistent labelling that could be consolidated. Treatment: Collapse synonymous categories and consider binarising as shipwreck-vs-other given the extreme imbalance. high · anthropic:claude-opus-4-7
n
5,569
nulls
0 (0.0%)
unique
30
top_value
shipwreck
top_rate
0.9124
cardinality
30
entropy
0.565
entropy_ratio
0.1151

wikipedia

categorical metadata long_tail null_rate
This column holds Wikipedia article references prefixed with a language code (e.g., 'en:SS Edmund Fitzgerald', 'fr:Armorique (navire)', 'ar:...'), likely linking each record to an encyclopedia entry about a ship, aircraft, or wreck. It is overwhelmingly sparse — 94.38% null with only 307 distinct values across 5569 rows — and the distribution is nearly flat (entropy ratio 0.998, top value appears just 4 times, top_rate 1.28%). The presence of multiple language prefixes (en, fr, ar) signals a mixed-language reference field rather than a clean categorical. Treatment: Treat as an optional external reference link; drop for modelling or split off the language prefix if needed. high · anthropic:claude-opus-4-7
n
5,569
nulls
5,256 (94.4%)
unique
307
top_value
en:SS Edmund Fitzgerald
top_rate
0.01278
cardinality
307
entropy
8.245
entropy_ratio
0.998

wikidata

categorical foreign_key long_tail null_rate
This column holds Wikidata Q-identifiers (e.g., Q1286267), linking rows to entities in the Wikidata knowledge graph. It is overwhelmingly sparse — 93.52% null — and among the 5569 rows only 353 unique values appear, with the most common identifier showing up just 4 times (top_rate 0.011). Entropy ratio of 0.998 confirms the non-null values are nearly all distinct, consistent with a foreign key rather than a categorical feature. Treatment: Left-join on this id to enrich with Wikidata attributes; do not use as a model feature directly. high · anthropic:claude-opus-4-7
n
5,569
nulls
5,208 (93.5%)
unique
353
top_value
Q1286267
top_rate
0.01108
cardinality
353
entropy
8.446
entropy_ratio
0.9979

description

categorical free_text long_tail null_rate
Free-text descriptive notes about wrecks, barges, and other maritime features, populated for only ~6% of rows (null_rate 0.9386). Among the 342 non-null entries there are 291 distinct strings with entropy_ratio 0.975, so values are nearly all unique short narratives; the modal phrase 'WWII era concrete fuel barge converted into breakwater' appears just 14 times (top_rate 0.041). Mixed languages are present (e.g., French 'Chaloupe abandonnée à terre' alongside English), confirming this is curator-authored prose rather than a controlled vocabulary. Treatment: Treat as sparse free text; tokenize/embed for search or keyword extraction rather than using as a categorical feature. high · anthropic:claude-opus-4-7
n
5,569
nulls
5,227 (93.9%)
unique
291
top_value
WWII era concrete fuel barge converted into breakwater
top_rate
0.04094
cardinality
291
entropy
7.983
entropy_ratio
0.9753

heritage

categorical metadata long_tail null_rate
A categorical 'heritage' field that is effectively empty: 99.77% null, with only 13 non-null values across 4 distinct levels. The observed values are inconsistent ('2', '1', 'yes', 'no'), suggesting a coding scheme that was never standardized or fully populated. Treatment: Drop; null_rate of 0.9977 leaves too little signal and the value coding is inconsistent. high · anthropic:claude-opus-4-7
n
5,569
nulls
5,556 (99.8%)
unique
4
top_value
2
top_rate
0.7692
cardinality
4
entropy
1.145
entropy_ratio
0.5726

access

categorical metadata null_rate
This is an OpenStreetMap-style 'access' tag indicating who may use a feature, with values like 'yes', 'no', 'permit', 'private', 'permissive', 'customers', and 'foot'. It is overwhelmingly null (90.88%), and among the 508 populated rows 'yes' dominates at 66.93%, leaving the other 7 categories thinly represented. Cardinality is only 8 with entropy ratio 0.55, so signal beyond presence/absence is limited. Treatment: Collapse rare levels and encode as a low-cardinality categorical, or reduce to a populated/'yes' indicator given the 90.88% null rate. high · anthropic:claude-opus-4-7
n
5,569
nulls
5,061 (90.9%)
unique
8
top_value
yes
top_rate
0.6693
cardinality
8
entropy
1.649
entropy_ratio
0.5497

depth

categorical feature long_tail null_rate
A free-text 'depth' field, almost certainly a measurement (likely meters) but stored as strings with mixed formats — bare numbers like '7', '16', '14' coexist with unit-suffixed values like '30m', '25m', and decimals like '12.2'. It is overwhelmingly missing (null_rate 0.9627) and extremely diffuse among the 208 populated rows: 154 unique values, top value '7' covers only 2.88%, and entropy_ratio 0.975 indicates a near-uniform long tail. Treatment: Strip unit suffixes and parse to numeric meters; given >96% nulls, treat as low-signal and consider dropping or flagging presence only. high · anthropic:claude-opus-4-7
n
5,569
nulls
5,361 (96.3%)
unique
154
top_value
7
top_rate
0.02885
cardinality
154
entropy
7.085
entropy_ratio
0.975

seamark_type

categorical feature imbalance
Categorical seamark classification with 10 distinct values, almost entirely dominated by 'wreck' at 98.36% of non-null rows (5026 of 5569). The remaining categories are extreme long-tail (hulk at 56, then single-digit counts down to one), and 8.24% of rows are null. Entropy ratio of 0.044 confirms the column carries almost no discriminative signal. Treatment: Drop or collapse to a binary 'is_wreck' flag; near-constant. high · anthropic:claude-opus-4-7
n
5,569
nulls
459 (8.2%)
unique
10
top_value
wreck
top_rate
0.9836
cardinality
10
entropy
0.1477
entropy_ratio
0.04447

osm_id

numeric identifier
This is almost certainly the OpenStreetMap object id: every one of the 5569 rows is unique, no nulls, no zeros, and values span 13M to 13.5B which matches OSM's monotonically growing id space. The distribution is right-skewed (skew 1.07) with mean 4.03B well above the median 2.35B, reflecting OSM's accumulation of newer, higher ids over time rather than anything analytically meaningful. Treatment: Drop from modelling; retain as a join key to OSM source data. high · anthropic:claude-opus-4-7
n
5,569
nulls
0 (0.0%)
unique
5,569
min
1.306e+07
max
1.354e+10
mean
4.032e+09
median
2.349e+09
std
3.875e+09
q1
1.181e+09
q3
6.516e+09
iqr
5.335e+09
skew
1.071
kurtosis
-0.2044
n_outliers
0
outlier_rate
0
zero_rate
0

osm_type

categorical feature
This column records the OpenStreetMap geometry type for each row, taking only two values: "node" (3656 rows, 65.6%) and "way" (1913 rows). With cardinality 2 and entropy ratio 0.928, the split is fairly balanced but tilted toward nodes, and there are no nulls across all 5569 rows. Treatment: Encode as a binary indicator (node vs way) for modelling. high · anthropic:claude-opus-4-7
n
5,569
nulls
0 (0.0%)
unique
2
top_value
node
top_rate
0.6565
cardinality
2
entropy
0.9281
entropy_ratio
0.9281