saturn·

data trove strange places v5 2

source /home/coolhand/html/datavis/data_trove/data/quirky/strange_places_v5.2.json 354,770 rows 48 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This is a 354,770-row mashup of 14 heterogeneous 'strange places' datasets — spanning tornadoes, UFO sightings, cave entrances, meteorites, ghost towns, earthquakes, shipwrecks, and more — unified under a single 'category' column. The most important thing to examine first is the category distribution, which reveals that no single source dominates but tornadoes (~71K), caves (~70K), and UFO sightings (~61K) each make up roughly 17–20% of records. A second key signal is the pervasive sparsity: most domain-specific columns (depth_km, duration_seconds, shape, damage_property) carry null rates of 80–99%, meaning each column is only meaningful for the subset of rows belonging to its originating dataset. UFO sighting durations show extreme right-skew (median 180 s, max 66 million s) and earthquake depths are similarly skewed, both worth closer inspection within their respective subsets.

citing: category.top_values · category.null_rate · duration_seconds.stats · depth_km.stats · shape.null_rate · damage_property.null_rate · source.top_values · fatalities.top_values · event_type.top_values

Schema

48 columns
Per-column summary. Click column name to jump to its detail.
Alerts
latitude numeric 0.0% 215,964
high_skew outliers
longitude numeric 0.0% 223,129
name text 0.0% 189,861
multilingual duplicates
description text 0.0% 218,717
multilingual duplicates
category categorical 0.0% 14
date text 41.9% 23,500
one_word allcaps null_rate short_text duplicates
country categorical 55.3% 28
null_rate
city text 82.9% 9,149
one_word null_rate short_text duplicates
state categorical 58.5% 118
null_rate
shape categorical 82.9% 28
null_rate
duration_seconds numeric 82.9% 444
null_rate high_skew outliers
mass_g unknown 0.0%
skipped
meteorite_class categorical 90.9% 395
null_rate
fall_type categorical 90.9% 2
null_rate imbalance
magnitude categorical 76.7% 294
null_rate
depth_km numeric 98.9% 1,505
null_rate high_skew outliers
place text 98.9% 3,002
null_rate
earthquake_type categorical 98.9% 3
null_rate imbalance
volcano_type categorical 100.0% 1
null_rate imbalance
elevation_m unknown 0.0%
skipped
status categorical 100.0% 1
null_rate imbalance
last_eruption categorical 100.0% 1
null_rate imbalance
injuries categorical 75.6% 233
null_rate
fatalities categorical 75.6% 57
null_rate
length_miles text 79.8% 3,795
one_word allcaps null_rate short_text duplicates
width_yards categorical 79.8% 437
null_rate
type categorical 98.6% 1
null_rate imbalance
temperature categorical 98.6% 44
long_tail null_rate imbalance
source categorical 51.6% 4
null_rate
vessel_type categorical 99.0% 23
long_tail null_rate
cargo categorical 99.0% 17
long_tail null_rate imbalance
peak_brightness_altitude_km categorical 99.8% 224
null_rate
velocity_km_s categorical 99.9% 158
null_rate
energy_joules categorical 99.8% 518
long_tail null_rate
event_type categorical 95.8% 17
null_rate
damage_property text 95.8% 1,014
one_word allcaps null_rate short_text duplicates
cave_type categorical 100.0% 5
long_tail null_rate
cave_length_m categorical 99.8% 237
long_tail null_rate
cave_depth_m categorical 99.9% 124
long_tail null_rate
access categorical 98.0% 20
null_rate
cave_ref text 97.9% 7,162
one_word allcaps null_rate short_text
osm_id numeric 75.1% 88,395
null_rate
osm_type categorical 75.1% 3
null_rate imbalance
place_type categorical 94.9% 48
long_tail null_rate
abandoned_year categorical 99.7% 147
long_tail null_rate
abandoned_reason unknown 0.0%
skipped
former_population categorical 99.3% 75
null_rate
heritage categorical 100.0% 6
long_tail null_rate

latitude

numeric feature high_skew outliers
This column contains geographic latitude values, ranging from -87.37° to 88.5°, consistent with global coordinates. The distribution is surprisingly left-skewed (skew = -2.84) with high kurtosis (7.30), meaning there is a heavy tail toward negative (southern hemisphere) latitudes despite the median sitting at ~40.6°N — suggesting the bulk of records are mid-latitude northern hemisphere but a notable minority of extreme southern values pull the mean down. About 9.4% of rows (33,355) are flagged as outliers, likely driven by records near the poles or far southern hemisphere; the near-zero zero_rate (0.06%) is negligible but worth checking for sentinel nulls encoded as 0. Treatment: Retain as-is for geospatial modelling; investigate ~0.06% zero-value rows as possible null sentinels, and review 33,355 outlier records for data quality before clustering or distance-based methods. high · anthropic:default
n
354,770
nulls
0 (0.0%)
unique
215,964
min
-87.37
max
88.5
mean
32.66
median
40.6
std
31.01
q1
33.69
q3
46.53
iqr
12.85
skew
-2.84
kurtosis
7.302
n_outliers
33,355
outlier_rate
0.09402
zero_rate
0.000637

longitude

numeric feature
This column contains geographic longitude values for 354,770 records, spanning the full valid range from -179.28° to 180°. The distribution is moderately right-skewed (skew = 0.755) with a mean of -31.75° and median of -42.66°, indicating a concentration of records in the Western Hemisphere (Americas/Atlantic). The IQR of 104.81° is extremely wide, suggesting genuinely global coverage rather than a region-specific dataset, and only 827 values (0.23%) are flagged as outliers. Treatment: Pair with latitude for geospatial modelling; consider coordinate binning or haversine-based features rather than treating as a raw numeric. high · anthropic:default
n
354,770
nulls
0 (0.0%)
unique
223,129
min
-179.3
max
180
mean
-31.75
median
-42.66
std
72.11
q1
-92.08
q3
12.73
iqr
104.8
skew
0.7545
kurtosis
0.1165
n_outliers
827
outlier_rate
0.002331
zero_rate
0

name

text label multilingual duplicates
This column contains the name or title of individual records in what appears to be a multi-domain dataset covering natural features (caves), weather events (tornadoes by US state), and UFO sightings. The duplicate rate is strikingly high at 46.5%, driven largely by templated strings like 'Unnamed Cave' (19,962 occurrences) and repeated tornado/state/count patterns. Despite the predominantly English content (3,363 language-detected values skewing English), the multilingual alert flags 30 detected languages including German (230), French (279), Italian (236), Russian (102), and Spanish (156), suggesting internationally-sourced named entities mixed into the dataset. Analysts should note that near-half of values are non-unique, so this column cannot serve as a reliable identifier. Treatment: Deduplicate or group by name pattern before use; consider splitting templated names (e.g. 'Tornado in TX, 48') into structured fields; embed free-form names if semantic similarity is needed. high · anthropic:default
n
354,770
nulls
0 (0.0%)
unique
189,861
len_min
1
len_max
235
len_mean
20
len_median
17
len_p95
32
word_mean
3.564
word_median
4
n_empty
0
n_duplicates
164,909
duplicate_rate
0.4648
vocab_size
15,811
readability_flesch_mean
64.79
emoji_rate
2.819e-06
url_rate
0
one_word_rate
0.09411
allcaps_rate
0.01283
boilerplate_rate
0

description

text free_text multilingual duplicates
This column contains free-text descriptions of geographic or physical features — cave entrances, former hamlets, hot springs, shipwrecks, and tornado tracks (e.g. 'F0, 0.1mi long, 10yd wide') dominate the top values, suggesting a points-of-interest or geographic gazetteer dataset. The duplicate rate is strikingly high at 38.3%, driven by 136,053 repeated values out of 354,770 rows, largely from templated entries like 'Cave entrance' (52,067 occurrences) and storm-track boilerplate. Text is overwhelmingly English (4,893 sampled as English) but 21 languages are detected including German (28), Bashkir (13), Russian (9), and Belarusian (9), flagging a multilingual minority that may require separate handling. The wide spread between median length (40 chars) and mean (114 chars) with a p95 of 491 indicates a heavily right-skewed length distribution. Treatment: Deduplicate or group templated entries before NLP; apply language detection and route non-English rows to language-specific pipelines; tokenize and embed for semantic modelling. high · anthropic:default
n
354,770
nulls
0 (0.0%)
unique
218,717
len_min
1
len_max
500
len_mean
114
len_median
40
len_p95
491
word_mean
24.07
word_median
7
n_empty
0
n_duplicates
136,053
duplicate_rate
0.3835
vocab_size
38,639
readability_flesch_mean
66.65
emoji_rate
0
url_rate
0.008149
one_word_rate
0.01018
allcaps_rate
0.004256
boilerplate_rate
0.0002509

category

categorical label
This column is a data-source/event-type label drawn from 14 distinct categories across 354,770 rows with zero nulls. The categories span scientific datasets (NOAA tornadoes, NASA meteorites, OSM features) and paranormal/anomalous phenomena (UFO sightings, Bigfoot, haunted places, megalithic portal), suggesting this is a multi-source 'strange phenomena' aggregation dataset. Distribution is moderately uneven — the top value 'noaa_tornadoes' holds 20.2% of rows (71,813), while 'bigfoot_sightings' has only 3,797 — but entropy of 2.99 against a ratio of 0.78 indicates reasonable spread across classes. No nulls and clean cardinality make this an immediately usable stratification variable. Treatment: Use as a stratification or grouping key; one-hot encode or target-encode for modelling. high · anthropic:default
n
354,770
nulls
0 (0.0%)
unique
14
top_value
noaa_tornadoes
top_rate
0.2024
cardinality
14
entropy
2.985
entropy_ratio
0.7841

date

text timestamp one_word allcaps null_rate short_text duplicates
This column contains ISO-format date strings (YYYY-MM-DD), stored as text rather than a proper date type, representing what appear to be annual publication or release dates — all top values fall on January 1st of a given year, suggesting date precision is year-level only. Two major data quality issues stand out: a 41.88% null rate (including 17,854 empty strings) and an 88.6% duplicate rate across 354,770 rows with only 23,500 unique values. The 'allcaps' alert is a false positive from the Saturn parser — ISO date strings trigger it due to lack of lowercase letters. Treatment: Cast to date type, impute or flag the 41.88% nulls, and consider extracting year as an integer feature given all values are Jan-1 anchored. high · anthropic:default
n
354,770
nulls
148,570 (41.9%)
unique
23,500
len_min
0
len_max
30
len_mean
9.331
len_median
10
len_p95
10
word_mean
1.005
word_median
1
n_empty
17,854
n_duplicates
182,700
duplicate_rate
0.886
vocab_size
8,565
readability_flesch_mean
112.1
emoji_rate
0
url_rate
0
one_word_rate
0.9954
allcaps_rate
0.913
boilerplate_rate
0

country

categorical feature null_rate
This column captures country of origin or residence, using a mix of ISO 2-letter codes and full-name variants. The most alarming issue is a 55.29% null rate, meaning over half of 354,770 rows carry no country value. Compounding this, 'USA' and 'US' are effectively the same country but stored as two distinct values (86,583 and 60,634 respectively), together accounting for ~54.6% of non-null records — indicating inconsistent data entry that inflates apparent cardinality. There are also 9,497 empty-string records that escaped null detection, and the distribution is heavily US-dominated with 28 unique values at low entropy (1.34). Treatment: Unify 'USA'/'US' and other aliases into ISO-3166 codes, convert empty strings to null, then impute or flag remaining nulls before using as a categorical feature. high · anthropic:default
n
354,770
nulls
196,154 (55.3%)
unique
28
top_value
USA
top_rate
0.5459
cardinality
28
entropy
1.341
entropy_ratio
0.279

city

text feature one_word null_rate short_text duplicates
This column contains US city names, confirmed by top values (Seattle, Phoenix, Las Vegas, Portland, Los Angeles) and top words ('beach', 'san', 'lake', 'springs'). The most striking issue is the 82.91% null rate — only roughly 1 in 6 rows has a city value at all, making this field sparsely populated. Despite that sparsity, the duplicate rate among non-null values is 84.91%, indicating that populated rows cluster around a relatively small set of repeated cities (9,149 unique values from 4,862 vocab tokens). The word 'city' appearing 531 times in top_words suggests some entries may literally contain placeholder text like 'Kansas City' or 'Oklahoma City' rather than being data quality noise. Treatment: Impute or flag nulls (82.91% missing) before use; consider grouping rare cities or encoding as region/state for modelling. high · anthropic:default
n
354,770
nulls
294,138 (82.9%)
unique
9,149
len_min
3
len_max
23
len_mean
8.829
len_median
9
len_p95
14
word_mean
1.288
word_median
1
n_empty
0
n_duplicates
51,483
duplicate_rate
0.8491
vocab_size
4,862
readability_flesch_mean
21.74
emoji_rate
0
url_rate
0
one_word_rate
0.7294
allcaps_rate
0
boilerplate_rate
0

state

categorical feature null_rate
This column contains US state abbreviations (and possibly territories or non-standard codes given 118 unique values vs. the expected 50–60), making it a geographic categorical feature. The most critical signal is a 58.5% null rate, meaning over half the 354,770 rows have no state recorded — a severe data quality issue. The top value is 'TX' at 8.6% of non-null rows, with CA and FL following; the 118-cardinality (nearly double the 50 US states) suggests the presence of territories, foreign country codes, or dirty values worth auditing. Treatment: Audit the 118 unique values to identify non-US-state codes, impute or flag nulls (58.5% missing), then encode as categorical for modelling. high · anthropic:default
n
354,770
nulls
207,555 (58.5%)
unique
118
top_value
TX
top_rate
0.08645
cardinality
118
entropy
5.668
entropy_ratio
0.8236

shape

categorical feature null_rate
This column captures the reported shape of UFO/unidentified aerial phenomena sightings, with 28 distinct categories such as 'light', 'triangle', 'circle', and 'fireball'. The most striking issue is an 82.91% null rate across 354,770 rows, meaning only ~60,600 records have a shape value at all. Among non-null records, 'light' dominates at 21.27%, and the presence of catch-all categories like 'unknown' (4,359) and 'other' (4,209) further dilutes the informativeness of the non-missing data. Treatment: Impute or flag nulls as a separate 'not_reported' category before encoding; consider consolidating 'unknown' and 'other' with nulls given ambiguity. high · anthropic:default
n
354,770
nulls
294,138 (82.9%)
unique
28
top_value
light
top_rate
0.2127
cardinality
28
entropy
3.774
entropy_ratio
0.785

duration_seconds

numeric feature null_rate high_skew outliers
This column records event or session durations in seconds, with values ranging from 0.01 s to 66,276,000 s (~766 days). The most striking issue is that 82.91% of rows are null, meaning duration is only captured for roughly 1-in-6 records. Among non-null values the distribution is catastrophically right-skewed (skew = 135.86, kurtosis = 19,379.84): the median is just 180 s while the mean inflates to 5,410 s, and 7,753 rows (12.79% of non-null) are flagged as outliers—the maximum of 66,276,000 s is almost certainly erroneous or represents a sentinel/unclosed-session value. Treatment: Investigate and cap or remove extreme outliers (especially values near 66276000.0), impute or flag nulls explicitly, then log-transform before modelling. high · anthropic:default
n
354,770
nulls
294,138 (82.9%)
unique
444
min
0.01
max
6.628e+07
mean
5410
median
180
std
4.144e+05
q1
30
q3
600
iqr
570
skew
135.9
kurtosis
1.938e+04
n_outliers
7,753
outlier_rate
0.1279
zero_rate
0

mass_g

unknown feature skipped
The column 'mass_g' likely represents mass measurements in grams across 354,770 records, with zero nulls indicating complete data coverage. No distributional statistics are available — the profiler skipped this column — so skew, range, outliers, and uniqueness cannot be assessed from the evidence provided. Treatment: Re-profile to obtain distribution stats; then check for skew and consider log-transform before modelling. low · anthropic:default
n
354,770
nulls
0 (0.0%)
unique

meteorite_class

categorical label null_rate
This column contains meteorite classification codes (e.g., 'L6', 'H5', 'CM2'), representing standard petrologic-type designations for chondrite and other meteorite classes. The most striking feature is an extremely high null rate of 90.93%, meaning only roughly 32,000 of 354,770 rows carry a classification. Among classified records the distribution is moderately concentrated — 'L6' alone accounts for 20.3% of non-null values — with 395 distinct classes and an entropy ratio of ~0.51, indicating moderate spread across the taxonomy. Treatment: Impute nulls with an explicit 'Unknown' category or exclude from supervised models; encode via target or ordinal encoding given 395 classes and severe class imbalance. high · anthropic:default
n
354,770
nulls
322,584 (90.9%)
unique
395
top_value
L6
top_rate
0.2033
cardinality
395
entropy
4.37
entropy_ratio
0.5067

fall_type

categorical label null_rate imbalance
This column classifies meteorite recovery type, distinguishing between specimens that were 'Found' (discovered without an observed fall) versus 'Fell' (witnessed falling). Striking is the 90.93% null rate, meaning only ~32,186 of 354,770 records have a value at all. Among those with values, the distribution is heavily skewed: 'Found' accounts for 96.6% (31,090) versus 'Fell' at just 3.4% (1,096), which aligns with real-world meteorite data but constitutes a severe class imbalance alert. Treatment: Impute nulls as a third category ('Unknown') or exclude from classification tasks; apply class-weighting or oversampling to address the 97:3 Found-to-Fell imbalance before modelling. high · anthropic:default
n
354,770
nulls
322,584 (90.9%)
unique
2
top_value
Found
top_rate
0.9659
cardinality
2
entropy
0.2143
entropy_ratio
0.2143

magnitude

categorical feature null_rate
This column represents a magnitude scale (likely seismic, stellar, or similar physical measurement) stored as a categorical type despite being fundamentally numeric — values include integers (0, 1, 2, 3, 4) and decimals (4.5, 4.6, 4.7, 1.75). The null rate of 76.7% is alarming and triggered an alert, meaning over three-quarters of the 354,770 rows carry no value. An additional surprise is the presence of '-9', which appears 1,278 times and is almost certainly a sentinel/missing-value code rather than a true measurement. The top value '0' dominates non-null records at 44.4% of non-null observations, and entropy_ratio of 0.31 confirms a heavily skewed, low-diversity distribution despite 294 unique string representations. Treatment: Cast to float after replacing '-9' with NaN, investigate the 76.7% null rate for systematic missingness, then consider log-transform or binning before modelling. high · anthropic:default
n
354,770
nulls
272,093 (76.7%)
unique
294
top_value
0
top_rate
0.4436
cardinality
294
entropy
2.514
entropy_ratio
0.3065

depth_km

numeric null_rate high_skew outliers
n
354,770
nulls
351,028 (98.9%)
unique
1,505
min
-2.261
max
248.7
mean
23.71
median
10
std
28.79
q1
10
q3
29.1
iqr
19.1
skew
3.072
kurtosis
11.61
n_outliers
314
outlier_rate
0.08391
zero_rate
0.002672

place

text null_rate
n
354,770
nulls
351,028 (98.9%)
unique
3,002
len_min
4
len_max
59
len_mean
29.47
len_median
29
len_p95
36
word_mean
6.293
word_median
6
n_empty
0
n_duplicates
740
duplicate_rate
0.1978
vocab_size
1,036
readability_flesch_mean
69.91
emoji_rate
0
url_rate
0
one_word_rate
0.0005345
allcaps_rate
0
boilerplate_rate
0

earthquake_type

categorical null_rate imbalance
n
354,770
nulls
351,028 (98.9%)
unique
3
top_value
earthquake
top_rate
0.9992
cardinality
3
entropy
0.01014
entropy_ratio
0.006396

volcano_type

categorical null_rate imbalance
n
354,770
nulls
354,600 (100.0%)
unique
1
top_value
Unknown
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

elevation_m

unknown feature skipped
This column records elevation in metres for 354,770 rows with no nulls. The profiler emitted a 'skipped' alert and returned no computed statistics, so distribution shape, range, skew, and uniqueness are entirely unknown from this evidence. The name strongly implies a continuous numeric geographic feature, but no further characterisation can be made without re-running profiling. Treatment: Re-profile to obtain range, skew, and outlier metrics; then consider log-transform or clipping if heavily right-skewed before use in modelling. low · anthropic:default
n
354,770
nulls
0 (0.0%)
unique

status

categorical null_rate imbalance
n
354,770
nulls
354,600 (100.0%)
unique
1
top_value
Unknown
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

last_eruption

categorical null_rate imbalance
n
354,770
nulls
354,600 (100.0%)
unique
1
top_value
Unknown
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

injuries

categorical feature null_rate
This column records injury counts per incident, stored as a categorical type despite being numeric in nature — values are integers ('0', '1', '2', …) with a cardinality of 233 distinct values. The null rate is severely high at 75.59%, meaning only ~86,827 of 354,770 rows have a recorded value, which is flagged as an alert. Among non-null rows, 85.4% report zero injuries, producing a heavily right-skewed distribution with low entropy (1.23, entropy ratio 0.157). The presence of 233 distinct values suggests some entries may encode ranges, text annotations, or data-entry anomalies beyond simple integers. Treatment: Cast to numeric, investigate nulls (MCAR vs. structural zero), treat missing as unknown rather than zero, then consider zero-inflated or count-based model treatment. medium · anthropic:default
n
354,770
nulls
268,187 (75.6%)
unique
233
top_value
0
top_rate
0.854
cardinality
233
entropy
1.234
entropy_ratio
0.1569

fatalities

categorical feature null_rate
This column represents a count of fatalities per incident, stored as a categorical type despite being inherently numeric. The null rate is severe at 75.59%, meaning only ~86,313 of 354,770 rows have a value. Among non-null rows, 92.86% record zero fatalities, with a long tail reaching at least 10; the low entropy ratio (0.088) confirms extreme concentration at '0'. Treatment: Cast to integer, investigate and impute or exclude the 75.59% nulls, then treat as a heavily zero-inflated count variable (consider zero-inflated Poisson or log1p transform for regression). high · anthropic:default
n
354,770
nulls
268,187 (75.6%)
unique
57
top_value
0
top_rate
0.9286
cardinality
57
entropy
0.5134
entropy_ratio
0.08802

length_miles

text feature one_word allcaps null_rate short_text duplicates
This column stores numeric distance measurements (miles) encoded as text strings — all values are single tokens like '0.1', '0.5', '1.0' with a mean character length of 3.69 and a max of 8. Two signals demand attention: the null rate is extremely high at 79.76%, meaning roughly four in five rows carry no value, and the duplicate rate among non-null values is 94.72%, reflecting a coarse, rounded measurement scale (only 3,795 unique values across 354,770 rows). The top value '0.1' alone appears 15,456 times, suggesting heavy concentration at short distances. Treatment: Cast to float, investigate and handle the 79.76% nulls (impute or flag), then use directly or log-transform given likely right skew. high · anthropic:default
n
354,770
nulls
282,957 (79.8%)
unique
3,795
len_min
3
len_max
8
len_mean
3.688
len_median
3
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
68,018
duplicate_rate
0.9472
vocab_size
2,268
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

width_yards

categorical feature null_rate
This column represents the width of some geographic or physical feature measured in yards, stored as a categorical type despite being numeric in nature. Nearly 80% of values are null (null_rate = 0.7976), making missingness the dominant signal. Among the 71,493 non-null records, values are round numbers (10, 50, 100, 30, 20, 200…) suggesting manual or estimated entries rather than precise measurements. The top value '10' accounts for 20.2% of non-null rows, and with 437 unique values and an entropy ratio of 0.51, the distribution is moderately concentrated. Treatment: Cast to numeric, investigate whether nulls are structurally missing (feature absent) or simply unrecorded before imputing or dropping; log-transform or bin for modelling given round-number clustering. medium · anthropic:default
n
354,770
nulls
282,957 (79.8%)
unique
437
top_value
10
top_rate
0.2018
cardinality
437
entropy
4.493
entropy_ratio
0.5122

type

categorical null_rate imbalance
n
354,770
nulls
349,767 (98.6%)
unique
1
top_value
hot_spring
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

temperature

categorical long_tail null_rate imbalance
n
354,770
nulls
349,767 (98.6%)
unique
44
top_value
top_rate
0.9742
cardinality
44
entropy
0.2566
entropy_ratio
0.04699

source

categorical metadata null_rate
This column records the data provider or attribution source for each row, with only 4 distinct values drawn from named external datasets (OpenStreetMap contributors, The Megalithic Portal, NOAA Storm Events Database, OpenStreetMap). The most striking signal is a 51.56% null rate — meaning over half of all 354,770 rows carry no source attribution, which is a data quality concern for provenance tracking. The top value 'OpenStreetMap contributors' accounts for 51.44% of non-null rows (88,396 records), while the closely related 'OpenStreetMap' (8,656 records) suggests inconsistent attribution for the same upstream source. Treatment: Consolidate 'OpenStreetMap contributors' and 'OpenStreetMap' into a single category, investigate and impute or flag the 51.56% nulls before using as a stratification or filter variable. high · anthropic:default
n
354,770
nulls
182,920 (51.6%)
unique
4
top_value
OpenStreetMap contributors
top_rate
0.5144
cardinality
4
entropy
1.545
entropy_ratio
0.7724

vessel_type

categorical long_tail null_rate
n
354,770
nulls
351,117 (99.0%)
unique
23
top_value
top_rate
0.9064
cardinality
23
entropy
0.5764
entropy_ratio
0.1274

cargo

categorical long_tail null_rate imbalance
n
354,770
nulls
351,117 (99.0%)
unique
17
top_value
top_rate
0.9943
cardinality
17
entropy
0.07302
entropy_ratio
0.01786

peak_brightness_altitude_km

categorical null_rate
n
354,770
nulls
354,193 (99.8%)
unique
224
top_value
37.0
top_rate
0.06066
cardinality
224
entropy
7.187
entropy_ratio
0.9206

velocity_km_s

categorical null_rate
n
354,770
nulls
354,421 (99.9%)
unique
158
top_value
13.6
top_rate
0.01719
cardinality
158
entropy
7.052
entropy_ratio
0.9656

energy_joules

categorical long_tail null_rate
n
354,770
nulls
353,907 (99.8%)
unique
518
top_value
2.1
top_rate
0.01738
cardinality
518
entropy
8.634
entropy_ratio
0.9576

event_type

categorical null_rate
n
354,770
nulls
340,000 (95.8%)
unique
17
top_value
Tornado
top_rate
0.4288
cardinality
17
entropy
2.336
entropy_ratio
0.5715

damage_property

text one_word allcaps null_rate short_text duplicates
n
354,770
nulls
340,000 (95.8%)
unique
1,014
len_min
0
len_max
8
len_mean
4.381
len_median
5
len_p95
7
word_mean
1
word_median
1
n_empty
368
n_duplicates
13,756
duplicate_rate
0.9313
vocab_size
1,013
readability_flesch_mean
117
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.8724
boilerplate_rate
0

cave_type

categorical long_tail null_rate
n
354,770
nulls
354,729 (100.0%)
unique
5
top_value
pit
top_rate
0.878
cardinality
5
entropy
0.7693
entropy_ratio
0.3313

cave_length_m

categorical long_tail null_rate
n
354,770
nulls
354,128 (99.8%)
unique
237
top_value
5
top_rate
0.04984
cardinality
237
entropy
6.919
entropy_ratio
0.877

cave_depth_m

categorical long_tail null_rate
n
354,770
nulls
354,472 (99.9%)
unique
124
top_value
0
top_rate
0.2114
cardinality
124
entropy
5.797
entropy_ratio
0.8336

access

categorical null_rate
n
354,770
nulls
347,515 (98.0%)
unique
20
top_value
yes
top_rate
0.3795
cardinality
20
entropy
2.234
entropy_ratio
0.517

cave_ref

text one_word allcaps null_rate short_text
n
354,770
nulls
347,184 (97.9%)
unique
7,162
len_min
1
len_max
38
len_mean
6.341
len_median
7
len_p95
8
word_mean
1.068
word_median
1
n_empty
0
n_duplicates
424
duplicate_rate
0.05589
vocab_size
7,005
readability_flesch_mean
117.8
emoji_rate
0
url_rate
0
one_word_rate
0.9359
allcaps_rate
0.8559
boilerplate_rate
0

osm_id

numeric foreign_key null_rate
This column contains OpenStreetMap (OSM) numeric identifiers, likely referencing geographic features such as ways, relations, or nodes in the OSM database. The most striking issue is a 75.08% null rate across 354,770 rows, meaning only about one quarter of records carry an OSM linkage. Despite 88,395 unique values against ~88,693 non-null rows, the near-unique cardinality and platykurtic distribution (kurtosis ≈ -1.23) are consistent with IDs drawn broadly across OSM's ID space (min ~1.3M, max ~13.5B), with no outliers detected. Treatment: Left-join on this id to OSM data after filtering or imputing the 75.08% nulls; investigate whether missingness is systematic before joining. high · anthropic:default
n
354,770
nulls
266,374 (75.1%)
unique
88,395
min
1.334e+06
max
1.347e+10
mean
6.183e+09
median
6.047e+09
std
3.993e+09
q1
2.628e+09
q3
9.53e+09
iqr
6.903e+09
skew
0.1321
kurtosis
-1.228
n_outliers
0
outlier_rate
0
zero_rate
0

osm_type

categorical feature null_rate imbalance
This column stores OpenStreetMap geometry type classifications, taking only three possible values: 'node', 'way', and 'relation'. Two signals demand attention: 75.08% of the 354,770 rows are null, meaning OSM type is only recorded for roughly a quarter of records, and among the non-null values the distribution is severely imbalanced — 'node' accounts for 96.39% of non-null entries (85,204 occurrences) versus 2,560 'way' and just 632 'relation'. The near-zero entropy ratio (0.158) confirms this column carries very little discriminative information as-is. Treatment: Impute nulls as a distinct 'unknown' category, then one-hot encode; consider whether the 'way'/'relation' minority classes carry signal worth preserving or should be collapsed. high · anthropic:default
n
354,770
nulls
266,374 (75.1%)
unique
3
top_value
node
top_rate
0.9639
cardinality
3
entropy
0.2501
entropy_ratio
0.1578

place_type

categorical label long_tail null_rate
This column captures the settlement/place classification type, likely from an OpenStreetMap-style geographic dataset, with values such as 'hamlet', 'isolated_dwelling', 'village', and 'town'. The most striking signal is the extreme null rate of 94.88%, meaning only ~18,400 of 354,770 rows carry a value — the column is essentially sparse. Among populated rows, 'hamlet' dominates at 66.57% of non-null values, and the presence of a raw 'yes' tag (131 occurrences) indicates dirty or uncleaned OSM data that needs remediation. Treatment: Filter or impute nulls before use; remap 'yes' and other dirty values; treat as low-cardinality categorical with one-hot or ordinal encoding reflecting settlement hierarchy. high · anthropic:default
n
354,770
nulls
336,616 (94.9%)
unique
48
top_value
hamlet
top_rate
0.6657
cardinality
48
entropy
1.498
entropy_ratio
0.2682

abandoned_year

categorical long_tail null_rate
n
354,770
nulls
353,545 (99.7%)
unique
147
top_value
yes
top_rate
0.4359
cardinality
147
entropy
2.939
entropy_ratio
0.4082

abandoned_reason

unknown label skipped
This column contains abandoned-reason codes or labels — likely a categorical field recording why a record, transaction, or session was abandoned. The profiler emitted a 'skipped' alert with no stats or uniqueness counts, meaning the column's type could not be resolved and no frequency analysis was performed. With 354,770 non-null rows and a null rate of exactly 0.0, the field is fully populated, but its true cardinality, distribution, and value content are entirely unknown from this evidence. Treatment: Re-profile with explicit string/categorical typing to recover value counts and cardinality before any downstream use. low · anthropic:default
n
354,770
nulls
0 (0.0%)
unique

former_population

categorical null_rate
n
354,770
nulls
352,243 (99.3%)
unique
75
top_value
0
top_rate
0.4951
cardinality
75
entropy
2.605
entropy_ratio
0.4182

heritage

categorical long_tail null_rate
n
354,770
nulls
354,763 (100.0%)
unique
6
top_value
2
top_rate
0.2857
cardinality
6
entropy
2.522
entropy_ratio
0.9755