data trove strange places v5 2

source /home/coolhand/html/datavis/data_trove/data/quirky/strange_places_v5.2.json 354,770 rows 48 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This is a 354,770-row mashup of 14 heterogeneous 'strange places' datasets — spanning tornadoes, UFO sightings, cave entrances, meteorites, ghost towns, earthquakes, shipwrecks, and more — unified under a single 'category' column. The most important thing to examine first is the category distribution, which reveals that no single source dominates but tornadoes (~71K), caves (~70K), and UFO sightings (~61K) each make up roughly 17–20% of records. A second key signal is the pervasive sparsity: most domain-specific columns (depth_km, duration_seconds, shape, damage_property) carry null rates of 80–99%, meaning each column is only meaningful for the subset of rows belonging to its originating dataset. UFO sighting durations show extreme right-skew (median 180 s, max 66 million s) and earthquake depths are similarly skewed, both worth closer inspection within their respective subsets.

citing: category.top_values · category.null_rate · duration_seconds.stats · depth_km.stats · shape.null_rate · damage_property.null_rate · source.top_values · fatalities.top_values · event_type.top_values

Charts the summary said to look at first

category · Shows how records are split across the 14 source datasets — tornadoes, caves, and UFO sightings each dominate, but all 14 categories are present.

Show data table

Top values for category (14 unique shown, of 14 total).
value	count	share
noaa_tornadoes	71813	20.2%
osm_caves	70242	19.8%
ufo_sightings	60632	17.1%
megalithic_portal	60028	16.9%
nasa_meteorites	32186	9.1%
osm_ghost_towns	18154	5.1%
noaa_storm_events	14770	4.2%
haunted_places	9717	2.7%
noaa_thermal_springs	5003	1.4%
bigfoot_sightings	3797	1.1%
usgs_earthquakes	3742	1.1%
noaa_shipwrecks	3653	1.0%
nasa_fireballs	863	0.2%
usgs_volcanoes	170	0.0%

shape · For UFO sighting rows, 'light' is by far the most reported shape, followed by triangle and circle — look for the long tail of rarer forms.

Show data table

Top values for shape (20 unique shown, of 28 total).
value	count	share
light	12895	3.6%
triangle	6268	1.8%
circle	5890	1.7%
fireball	4939	1.4%
unknown	4359	1.2%
other	4209	1.2%
sphere	4134	1.2%
disk	3853	1.1%
oval	2881	0.8%
formation	1908	0.5%
cigar	1569	0.4%
changing	1517	0.4%
flash	1025	0.3%
rectangle	1010	0.3%
cylinder	977	0.3%
diamond	884	0.2%
chevron	774	0.2%
teardrop	560	0.2%
egg	555	0.2%
cone	235	0.1%

event_type · Among storm-event rows, tornadoes vastly outnumber flash floods and thunderstorm winds, revealing a strong imbalance in weather event coverage.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	1.8%
Flash Flood	2358	0.7%
Thunderstorm Wind	2257	0.6%
Flood	1777	0.5%
Hail	1246	0.4%
Lightning	574	0.2%
Heavy Rain	99	0.0%
Marine Strong Wind	43	0.0%
Debris Flow	43	0.0%
Marine Thunderstorm Wind	25	0.0%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

duration_seconds · Extreme right-skew with a median of 180 seconds but a max of 66 million seconds — look for the spike near zero and the extreme outliers.

Show data table

Histogram bins for duration_seconds (median: 180.0).
bin	count
0.01 – 1.657e+06	60612
1.657e+06 – 3.314e+06	11
3.314e+06 – 4.971e+06	0
4.971e+06 – 6.628e+06	3
6.628e+06 – 8.285e+06	1
8.285e+06 – 9.941e+06	0
9.941e+06 – 1.16e+07	2
1.16e+07 – 1.326e+07	0
1.326e+07 – 1.491e+07	0
1.491e+07 – 1.657e+07	0
1.657e+07 – 1.823e+07	0
1.823e+07 – 1.988e+07	0
1.988e+07 – 2.154e+07	0
2.154e+07 – 2.32e+07	0
2.32e+07 – 2.485e+07	0
2.485e+07 – 2.651e+07	0
2.651e+07 – 2.817e+07	0
2.817e+07 – 2.982e+07	0
2.982e+07 – 3.148e+07	0
3.148e+07 – 3.314e+07	0
3.314e+07 – 3.479e+07	0
3.479e+07 – 3.645e+07	0
3.645e+07 – 3.811e+07	0
3.811e+07 – 3.977e+07	0
3.977e+07 – 4.142e+07	0
4.142e+07 – 4.308e+07	0
4.308e+07 – 4.474e+07	0
4.474e+07 – 4.639e+07	0
4.639e+07 – 4.805e+07	0
4.805e+07 – 4.971e+07	0
4.971e+07 – 5.136e+07	0
5.136e+07 – 5.302e+07	2
5.302e+07 – 5.468e+07	0
5.468e+07 – 5.633e+07	0
5.633e+07 – 5.799e+07	0
5.799e+07 – 5.965e+07	0
5.965e+07 – 6.131e+07	0
6.131e+07 – 6.296e+07	0
6.296e+07 – 6.462e+07	0
6.462e+07 – 6.628e+07	1

meteorite_class · L6 and H5 ordinary chondrites dominate meteorite finds, with a long tail of rarer classes worth noting for completeness of the record.

Show data table

Top values for meteorite_class (20 unique shown, of 395 total).
value	count	share
L6	6544	1.8%
H5	5614	1.6%
H4	3336	0.9%
H6	3234	0.9%
L5	2750	0.8%
LL5	1899	0.5%
LL6	963	0.3%
L4	831	0.2%
H4/5	380	0.1%
CM2	281	0.1%
Iron, IIIAB	272	0.1%
H3	244	0.1%
LL	220	0.1%
E3	205	0.1%
L3	176	0.0%
LL4	160	0.0%
H5/6	156	0.0%
Ureilite	155	0.0%
Howardite	127	0.0%
Diogenite	125	0.0%

Schema

48 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
latitude	numeric	0.0%	215,964	high_skew outliers
longitude	numeric	0.0%	223,129
name	text	0.0%	189,861	multilingual duplicates
description	text	0.0%	218,717	multilingual duplicates
category	categorical	0.0%	14
date	text	41.9%	23,500	one_word allcaps null_rate short_text duplicates
country	categorical	55.3%	28	null_rate
city	text	82.9%	9,149	one_word null_rate short_text duplicates
state	categorical	58.5%	118	null_rate
shape	categorical	82.9%	28	null_rate
duration_seconds	numeric	82.9%	444	null_rate high_skew outliers
mass_g	unknown	0.0%	—	skipped
meteorite_class	categorical	90.9%	395	null_rate
fall_type	categorical	90.9%	2	null_rate imbalance
magnitude	categorical	76.7%	294	null_rate
depth_km	numeric	98.9%	1,505	null_rate high_skew outliers
place	text	98.9%	3,002	null_rate
earthquake_type	categorical	98.9%	3	null_rate imbalance
volcano_type	categorical	100.0%	1	null_rate imbalance
elevation_m	unknown	0.0%	—	skipped
status	categorical	100.0%	1	null_rate imbalance
last_eruption	categorical	100.0%	1	null_rate imbalance
injuries	categorical	75.6%	233	null_rate
fatalities	categorical	75.6%	57	null_rate
length_miles	text	79.8%	3,795	one_word allcaps null_rate short_text duplicates
width_yards	categorical	79.8%	437	null_rate
type	categorical	98.6%	1	null_rate imbalance
temperature	categorical	98.6%	44	long_tail null_rate imbalance
source	categorical	51.6%	4	null_rate
vessel_type	categorical	99.0%	23	long_tail null_rate
cargo	categorical	99.0%	17	long_tail null_rate imbalance
peak_brightness_altitude_km	categorical	99.8%	224	null_rate
velocity_km_s	categorical	99.9%	158	null_rate
energy_joules	categorical	99.8%	518	long_tail null_rate
event_type	categorical	95.8%	17	null_rate
damage_property	text	95.8%	1,014	one_word allcaps null_rate short_text duplicates
cave_type	categorical	100.0%	5	long_tail null_rate
cave_length_m	categorical	99.8%	237	long_tail null_rate
cave_depth_m	categorical	99.9%	124	long_tail null_rate
access	categorical	98.0%	20	null_rate
cave_ref	text	97.9%	7,162	one_word allcaps null_rate short_text
osm_id	numeric	75.1%	88,395	null_rate
osm_type	categorical	75.1%	3	null_rate imbalance
place_type	categorical	94.9%	48	long_tail null_rate
abandoned_year	categorical	99.7%	147	long_tail null_rate
abandoned_reason	unknown	0.0%	—	skipped
former_population	categorical	99.3%	75	null_rate
heritage	categorical	100.0%	6	long_tail null_rate

latitude

numeric feature high_skew outliers

This column contains geographic latitude values, ranging from -87.37° to 88.5°, consistent with global coordinates. The distribution is surprisingly left-skewed (skew = -2.84) with high kurtosis (7.30), meaning there is a heavy tail toward negative (southern hemisphere) latitudes despite the median sitting at ~40.6°N — suggesting the bulk of records are mid-latitude northern hemisphere but a notable minority of extreme southern values pull the mean down. About 9.4% of rows (33,355) are flagged as outliers, likely driven by records near the poles or far southern hemisphere; the near-zero zero_rate (0.06%) is negligible but worth checking for sentinel nulls encoded as 0. Treatment: Retain as-is for geospatial modelling; investigate ~0.06% zero-value rows as possible null sentinels, and review 33,355 outlier records for data quality before clustering or distance-based methods. high · anthropic:default

n: 354,770
nulls: 0 (0.0%)
unique: 215,964
min: -87.37
max: 88.5
mean: 32.66
median: 40.6
std: 31.01
q1: 33.69
q3: 46.53
iqr: 12.85
skew: -2.84
kurtosis: 7.302
n_outliers: 33,355
outlier_rate: 0.09402
zero_rate: 0.000637

longitude

numeric feature

This column contains geographic longitude values for 354,770 records, spanning the full valid range from -179.28° to 180°. The distribution is moderately right-skewed (skew = 0.755) with a mean of -31.75° and median of -42.66°, indicating a concentration of records in the Western Hemisphere (Americas/Atlantic). The IQR of 104.81° is extremely wide, suggesting genuinely global coverage rather than a region-specific dataset, and only 827 values (0.23%) are flagged as outliers. Treatment: Pair with latitude for geospatial modelling; consider coordinate binning or haversine-based features rather than treating as a raw numeric. high · anthropic:default

n: 354,770
nulls: 0 (0.0%)
unique: 223,129
min: -179.3
max: 180
mean: -31.75
median: -42.66
std: 72.11
q1: -92.08
q3: 12.73
iqr: 104.8
skew: 0.7545
kurtosis: 0.1165
n_outliers: 827
outlier_rate: 0.002331
zero_rate: 0

name

text label multilingual duplicates

This column contains the name or title of individual records in what appears to be a multi-domain dataset covering natural features (caves), weather events (tornadoes by US state), and UFO sightings. The duplicate rate is strikingly high at 46.5%, driven largely by templated strings like 'Unnamed Cave' (19,962 occurrences) and repeated tornado/state/count patterns. Despite the predominantly English content (3,363 language-detected values skewing English), the multilingual alert flags 30 detected languages including German (230), French (279), Italian (236), Russian (102), and Spanish (156), suggesting internationally-sourced named entities mixed into the dataset. Analysts should note that near-half of values are non-unique, so this column cannot serve as a reliable identifier. Treatment: Deduplicate or group by name pattern before use; consider splitting templated names (e.g. 'Tornado in TX, 48') into structured fields; embed free-form names if semantic similarity is needed. high · anthropic:default

n: 354,770
nulls: 0 (0.0%)
unique: 189,861
len_min: 1
len_max: 235
len_mean: 20
len_median: 17
len_p95: 32
word_mean: 3.564
word_median: 4
n_empty: 0
n_duplicates: 164,909
duplicate_rate: 0.4648
vocab_size: 15,811
readability_flesch_mean: 64.79
emoji_rate: 2.819e-06
url_rate: 0
one_word_rate: 0.09411
allcaps_rate: 0.01283
boilerplate_rate: 0

description

text free_text multilingual duplicates

This column contains free-text descriptions of geographic or physical features — cave entrances, former hamlets, hot springs, shipwrecks, and tornado tracks (e.g. 'F0, 0.1mi long, 10yd wide') dominate the top values, suggesting a points-of-interest or geographic gazetteer dataset. The duplicate rate is strikingly high at 38.3%, driven by 136,053 repeated values out of 354,770 rows, largely from templated entries like 'Cave entrance' (52,067 occurrences) and storm-track boilerplate. Text is overwhelmingly English (4,893 sampled as English) but 21 languages are detected including German (28), Bashkir (13), Russian (9), and Belarusian (9), flagging a multilingual minority that may require separate handling. The wide spread between median length (40 chars) and mean (114 chars) with a p95 of 491 indicates a heavily right-skewed length distribution. Treatment: Deduplicate or group templated entries before NLP; apply language detection and route non-English rows to language-specific pipelines; tokenize and embed for semantic modelling. high · anthropic:default

n: 354,770
nulls: 0 (0.0%)
unique: 218,717
len_min: 1
len_max: 500
len_mean: 114
len_median: 40
len_p95: 491
word_mean: 24.07
word_median: 7
n_empty: 0
n_duplicates: 136,053
duplicate_rate: 0.3835
vocab_size: 38,639
readability_flesch_mean: 66.65
emoji_rate: 0
url_rate: 0.008149
one_word_rate: 0.01018
allcaps_rate: 0.004256
boilerplate_rate: 0.0002509

date

text timestamp one_word allcaps null_rate short_text duplicates

This column contains ISO-format date strings (YYYY-MM-DD), stored as text rather than a proper date type, representing what appear to be annual publication or release dates — all top values fall on January 1st of a given year, suggesting date precision is year-level only. Two major data quality issues stand out: a 41.88% null rate (including 17,854 empty strings) and an 88.6% duplicate rate across 354,770 rows with only 23,500 unique values. The 'allcaps' alert is a false positive from the Saturn parser — ISO date strings trigger it due to lack of lowercase letters. Treatment: Cast to date type, impute or flag the 41.88% nulls, and consider extracting year as an integer feature given all values are Jan-1 anchored. high · anthropic:default

n: 354,770
nulls: 148,570 (41.9%)
unique: 23,500
len_min: 0
len_max: 30
len_mean: 9.331
len_median: 10
len_p95: 10
word_mean: 1.005
word_median: 1
n_empty: 17,854
n_duplicates: 182,700
duplicate_rate: 0.886
vocab_size: 8,565
readability_flesch_mean: 112.1
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9954
allcaps_rate: 0.913
boilerplate_rate: 0

country

categorical feature null_rate

This column captures country of origin or residence, using a mix of ISO 2-letter codes and full-name variants. The most alarming issue is a 55.29% null rate, meaning over half of 354,770 rows carry no country value. Compounding this, 'USA' and 'US' are effectively the same country but stored as two distinct values (86,583 and 60,634 respectively), together accounting for ~54.6% of non-null records — indicating inconsistent data entry that inflates apparent cardinality. There are also 9,497 empty-string records that escaped null detection, and the distribution is heavily US-dominated with 28 unique values at low entropy (1.34). Treatment: Unify 'USA'/'US' and other aliases into ISO-3166 codes, convert empty strings to null, then impute or flag remaining nulls before using as a categorical feature. high · anthropic:default

n: 354,770
nulls: 196,154 (55.3%)
unique: 28
top_value: USA
top_rate: 0.5459
cardinality: 28
entropy: 1.341
entropy_ratio: 0.279

city

text feature one_word null_rate short_text duplicates

This column contains US city names, confirmed by top values (Seattle, Phoenix, Las Vegas, Portland, Los Angeles) and top words ('beach', 'san', 'lake', 'springs'). The most striking issue is the 82.91% null rate — only roughly 1 in 6 rows has a city value at all, making this field sparsely populated. Despite that sparsity, the duplicate rate among non-null values is 84.91%, indicating that populated rows cluster around a relatively small set of repeated cities (9,149 unique values from 4,862 vocab tokens). The word 'city' appearing 531 times in top_words suggests some entries may literally contain placeholder text like 'Kansas City' or 'Oklahoma City' rather than being data quality noise. Treatment: Impute or flag nulls (82.91% missing) before use; consider grouping rare cities or encoding as region/state for modelling. high · anthropic:default

n: 354,770
nulls: 294,138 (82.9%)
unique: 9,149
len_min: 3
len_max: 23
len_mean: 8.829
len_median: 9
len_p95: 14
word_mean: 1.288
word_median: 1
n_empty: 0
n_duplicates: 51,483
duplicate_rate: 0.8491
vocab_size: 4,862
readability_flesch_mean: 21.74
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7294
allcaps_rate: 0
boilerplate_rate: 0

state

categorical feature null_rate

This column contains US state abbreviations (and possibly territories or non-standard codes given 118 unique values vs. the expected 50–60), making it a geographic categorical feature. The most critical signal is a 58.5% null rate, meaning over half the 354,770 rows have no state recorded — a severe data quality issue. The top value is 'TX' at 8.6% of non-null rows, with CA and FL following; the 118-cardinality (nearly double the 50 US states) suggests the presence of territories, foreign country codes, or dirty values worth auditing. Treatment: Audit the 118 unique values to identify non-US-state codes, impute or flag nulls (58.5% missing), then encode as categorical for modelling. high · anthropic:default

n: 354,770
nulls: 207,555 (58.5%)
unique: 118
top_value: TX
top_rate: 0.08645
cardinality: 118
entropy: 5.668
entropy_ratio: 0.8236

shape

categorical feature null_rate

This column captures the reported shape of UFO/unidentified aerial phenomena sightings, with 28 distinct categories such as 'light', 'triangle', 'circle', and 'fireball'. The most striking issue is an 82.91% null rate across 354,770 rows, meaning only ~60,600 records have a shape value at all. Among non-null records, 'light' dominates at 21.27%, and the presence of catch-all categories like 'unknown' (4,359) and 'other' (4,209) further dilutes the informativeness of the non-missing data. Treatment: Impute or flag nulls as a separate 'not_reported' category before encoding; consider consolidating 'unknown' and 'other' with nulls given ambiguity. high · anthropic:default

n: 354,770
nulls: 294,138 (82.9%)
unique: 28
top_value: light
top_rate: 0.2127
cardinality: 28
entropy: 3.774
entropy_ratio: 0.785

duration_seconds

numeric feature null_rate high_skew outliers

This column records event or session durations in seconds, with values ranging from 0.01 s to 66,276,000 s (~766 days). The most striking issue is that 82.91% of rows are null, meaning duration is only captured for roughly 1-in-6 records. Among non-null values the distribution is catastrophically right-skewed (skew = 135.86, kurtosis = 19,379.84): the median is just 180 s while the mean inflates to 5,410 s, and 7,753 rows (12.79% of non-null) are flagged as outliers—the maximum of 66,276,000 s is almost certainly erroneous or represents a sentinel/unclosed-session value. Treatment: Investigate and cap or remove extreme outliers (especially values near 66276000.0), impute or flag nulls explicitly, then log-transform before modelling. high · anthropic:default

n: 354,770
nulls: 294,138 (82.9%)
unique: 444
min: 0.01
max: 6.628e+07
mean: 5410
median: 180
std: 4.144e+05
q1: 30
q3: 600
iqr: 570
skew: 135.9
kurtosis: 1.938e+04
n_outliers: 7,753
outlier_rate: 0.1279
zero_rate: 0

mass_g

unknown feature skipped

The column 'mass_g' likely represents mass measurements in grams across 354,770 records, with zero nulls indicating complete data coverage. No distributional statistics are available — the profiler skipped this column — so skew, range, outliers, and uniqueness cannot be assessed from the evidence provided. Treatment: Re-profile to obtain distribution stats; then check for skew and consider log-transform before modelling. low · anthropic:default

n: 354,770
nulls: 0 (0.0%)
unique: —

meteorite_class

categorical label null_rate

This column contains meteorite classification codes (e.g., 'L6', 'H5', 'CM2'), representing standard petrologic-type designations for chondrite and other meteorite classes. The most striking feature is an extremely high null rate of 90.93%, meaning only roughly 32,000 of 354,770 rows carry a classification. Among classified records the distribution is moderately concentrated — 'L6' alone accounts for 20.3% of non-null values — with 395 distinct classes and an entropy ratio of ~0.51, indicating moderate spread across the taxonomy. Treatment: Impute nulls with an explicit 'Unknown' category or exclude from supervised models; encode via target or ordinal encoding given 395 classes and severe class imbalance. high · anthropic:default

n: 354,770
nulls: 322,584 (90.9%)
unique: 395
top_value: L6
top_rate: 0.2033
cardinality: 395
entropy: 4.37
entropy_ratio: 0.5067

fall_type

categorical label null_rate imbalance

This column classifies meteorite recovery type, distinguishing between specimens that were 'Found' (discovered without an observed fall) versus 'Fell' (witnessed falling). Striking is the 90.93% null rate, meaning only ~32,186 of 354,770 records have a value at all. Among those with values, the distribution is heavily skewed: 'Found' accounts for 96.6% (31,090) versus 'Fell' at just 3.4% (1,096), which aligns with real-world meteorite data but constitutes a severe class imbalance alert. Treatment: Impute nulls as a third category ('Unknown') or exclude from classification tasks; apply class-weighting or oversampling to address the 97:3 Found-to-Fell imbalance before modelling. high · anthropic:default

n: 354,770
nulls: 322,584 (90.9%)
unique: 2
top_value: Found
top_rate: 0.9659
cardinality: 2
entropy: 0.2143
entropy_ratio: 0.2143

magnitude

categorical feature null_rate

This column represents a magnitude scale (likely seismic, stellar, or similar physical measurement) stored as a categorical type despite being fundamentally numeric — values include integers (0, 1, 2, 3, 4) and decimals (4.5, 4.6, 4.7, 1.75). The null rate of 76.7% is alarming and triggered an alert, meaning over three-quarters of the 354,770 rows carry no value. An additional surprise is the presence of '-9', which appears 1,278 times and is almost certainly a sentinel/missing-value code rather than a true measurement. The top value '0' dominates non-null records at 44.4% of non-null observations, and entropy_ratio of 0.31 confirms a heavily skewed, low-diversity distribution despite 294 unique string representations. Treatment: Cast to float after replacing '-9' with NaN, investigate the 76.7% null rate for systematic missingness, then consider log-transform or binning before modelling. high · anthropic:default

n: 354,770
nulls: 272,093 (76.7%)
unique: 294
top_value: 0
top_rate: 0.4436
cardinality: 294
entropy: 2.514
entropy_ratio: 0.3065

depth_km

numeric null_rate high_skew outliers

n: 354,770
nulls: 351,028 (98.9%)
unique: 1,505
min: -2.261
max: 248.7
mean: 23.71
median: 10
std: 28.79
q1: 10
q3: 29.1
iqr: 19.1
skew: 3.072
kurtosis: 11.61
n_outliers: 314
outlier_rate: 0.08391
zero_rate: 0.002672

place

text null_rate

n: 354,770
nulls: 351,028 (98.9%)
unique: 3,002
len_min: 4
len_max: 59
len_mean: 29.47
len_median: 29
len_p95: 36
word_mean: 6.293
word_median: 6
n_empty: 0
n_duplicates: 740
duplicate_rate: 0.1978
vocab_size: 1,036
readability_flesch_mean: 69.91
emoji_rate: 0
url_rate: 0
one_word_rate: 0.0005345
allcaps_rate: 0
boilerplate_rate: 0

earthquake_type

categorical null_rate imbalance

n: 354,770
nulls: 351,028 (98.9%)
unique: 3
top_value: earthquake
top_rate: 0.9992
cardinality: 3
entropy: 0.01014
entropy_ratio: 0.006396

volcano_type

categorical null_rate imbalance

n: 354,770
nulls: 354,600 (100.0%)
unique: 1
top_value: Unknown
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

elevation_m

unknown feature skipped

This column records elevation in metres for 354,770 rows with no nulls. The profiler emitted a 'skipped' alert and returned no computed statistics, so distribution shape, range, skew, and uniqueness are entirely unknown from this evidence. The name strongly implies a continuous numeric geographic feature, but no further characterisation can be made without re-running profiling. Treatment: Re-profile to obtain range, skew, and outlier metrics; then consider log-transform or clipping if heavily right-skewed before use in modelling. low · anthropic:default

n: 354,770
nulls: 0 (0.0%)
unique: —

status

categorical null_rate imbalance

n: 354,770
nulls: 354,600 (100.0%)
unique: 1
top_value: Unknown
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

last_eruption

categorical null_rate imbalance

n: 354,770
nulls: 354,600 (100.0%)
unique: 1
top_value: Unknown
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

injuries

categorical feature null_rate

This column records injury counts per incident, stored as a categorical type despite being numeric in nature — values are integers ('0', '1', '2', …) with a cardinality of 233 distinct values. The null rate is severely high at 75.59%, meaning only ~86,827 of 354,770 rows have a recorded value, which is flagged as an alert. Among non-null rows, 85.4% report zero injuries, producing a heavily right-skewed distribution with low entropy (1.23, entropy ratio 0.157). The presence of 233 distinct values suggests some entries may encode ranges, text annotations, or data-entry anomalies beyond simple integers. Treatment: Cast to numeric, investigate nulls (MCAR vs. structural zero), treat missing as unknown rather than zero, then consider zero-inflated or count-based model treatment. medium · anthropic:default

n: 354,770
nulls: 268,187 (75.6%)
unique: 233
top_value: 0
top_rate: 0.854
cardinality: 233
entropy: 1.234
entropy_ratio: 0.1569

fatalities

categorical feature null_rate

This column represents a count of fatalities per incident, stored as a categorical type despite being inherently numeric. The null rate is severe at 75.59%, meaning only ~86,313 of 354,770 rows have a value. Among non-null rows, 92.86% record zero fatalities, with a long tail reaching at least 10; the low entropy ratio (0.088) confirms extreme concentration at '0'. Treatment: Cast to integer, investigate and impute or exclude the 75.59% nulls, then treat as a heavily zero-inflated count variable (consider zero-inflated Poisson or log1p transform for regression). high · anthropic:default

n: 354,770
nulls: 268,187 (75.6%)
unique: 57
top_value: 0
top_rate: 0.9286
cardinality: 57
entropy: 0.5134
entropy_ratio: 0.08802

length_miles

text feature one_word allcaps null_rate short_text duplicates

This column stores numeric distance measurements (miles) encoded as text strings — all values are single tokens like '0.1', '0.5', '1.0' with a mean character length of 3.69 and a max of 8. Two signals demand attention: the null rate is extremely high at 79.76%, meaning roughly four in five rows carry no value, and the duplicate rate among non-null values is 94.72%, reflecting a coarse, rounded measurement scale (only 3,795 unique values across 354,770 rows). The top value '0.1' alone appears 15,456 times, suggesting heavy concentration at short distances. Treatment: Cast to float, investigate and handle the 79.76% nulls (impute or flag), then use directly or log-transform given likely right skew. high · anthropic:default

n: 354,770
nulls: 282,957 (79.8%)
unique: 3,795
len_min: 3
len_max: 8
len_mean: 3.688
len_median: 3
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 68,018
duplicate_rate: 0.9472
vocab_size: 2,268
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

width_yards

categorical feature null_rate

This column represents the width of some geographic or physical feature measured in yards, stored as a categorical type despite being numeric in nature. Nearly 80% of values are null (null_rate = 0.7976), making missingness the dominant signal. Among the 71,493 non-null records, values are round numbers (10, 50, 100, 30, 20, 200…) suggesting manual or estimated entries rather than precise measurements. The top value '10' accounts for 20.2% of non-null rows, and with 437 unique values and an entropy ratio of 0.51, the distribution is moderately concentrated. Treatment: Cast to numeric, investigate whether nulls are structurally missing (feature absent) or simply unrecorded before imputing or dropping; log-transform or bin for modelling given round-number clustering. medium · anthropic:default

n: 354,770
nulls: 282,957 (79.8%)
unique: 437
top_value: 10
top_rate: 0.2018
cardinality: 437
entropy: 4.493
entropy_ratio: 0.5122

type

categorical null_rate imbalance

n: 354,770
nulls: 349,767 (98.6%)
unique: 1
top_value: hot_spring
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

temperature

categorical long_tail null_rate imbalance

n: 354,770
nulls: 349,767 (98.6%)
unique: 44
top_value
top_rate: 0.9742
cardinality: 44
entropy: 0.2566
entropy_ratio: 0.04699

source

categorical metadata null_rate

This column records the data provider or attribution source for each row, with only 4 distinct values drawn from named external datasets (OpenStreetMap contributors, The Megalithic Portal, NOAA Storm Events Database, OpenStreetMap). The most striking signal is a 51.56% null rate — meaning over half of all 354,770 rows carry no source attribution, which is a data quality concern for provenance tracking. The top value 'OpenStreetMap contributors' accounts for 51.44% of non-null rows (88,396 records), while the closely related 'OpenStreetMap' (8,656 records) suggests inconsistent attribution for the same upstream source. Treatment: Consolidate 'OpenStreetMap contributors' and 'OpenStreetMap' into a single category, investigate and impute or flag the 51.56% nulls before using as a stratification or filter variable. high · anthropic:default

n: 354,770
nulls: 182,920 (51.6%)
unique: 4
top_value: OpenStreetMap contributors
top_rate: 0.5144
cardinality: 4
entropy: 1.545
entropy_ratio: 0.7724

vessel_type

categorical long_tail null_rate

n: 354,770
nulls: 351,117 (99.0%)
unique: 23
top_value
top_rate: 0.9064
cardinality: 23
entropy: 0.5764
entropy_ratio: 0.1274

cargo

categorical long_tail null_rate imbalance

n: 354,770
nulls: 351,117 (99.0%)
unique: 17
top_value
top_rate: 0.9943
cardinality: 17
entropy: 0.07302
entropy_ratio: 0.01786

peak_brightness_altitude_km

categorical null_rate

n: 354,770
nulls: 354,193 (99.8%)
unique: 224
top_value: 37.0
top_rate: 0.06066
cardinality: 224
entropy: 7.187
entropy_ratio: 0.9206

velocity_km_s

categorical null_rate

n: 354,770
nulls: 354,421 (99.9%)
unique: 158
top_value: 13.6
top_rate: 0.01719
cardinality: 158
entropy: 7.052
entropy_ratio: 0.9656

energy_joules

categorical long_tail null_rate

n: 354,770
nulls: 353,907 (99.8%)
unique: 518
top_value: 2.1
top_rate: 0.01738
cardinality: 518
entropy: 8.634
entropy_ratio: 0.9576

event_type

categorical null_rate

n: 354,770
nulls: 340,000 (95.8%)
unique: 17
top_value: Tornado
top_rate: 0.4288
cardinality: 17
entropy: 2.336
entropy_ratio: 0.5715

damage_property

text one_word allcaps null_rate short_text duplicates

n: 354,770
nulls: 340,000 (95.8%)
unique: 1,014
len_min: 0
len_max: 8
len_mean: 4.381
len_median: 5
len_p95: 7
word_mean: 1
word_median: 1
n_empty: 368
n_duplicates: 13,756
duplicate_rate: 0.9313
vocab_size: 1,013
readability_flesch_mean: 117
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.8724
boilerplate_rate: 0

cave_type

categorical long_tail null_rate

n: 354,770
nulls: 354,729 (100.0%)
unique: 5
top_value: pit
top_rate: 0.878
cardinality: 5
entropy: 0.7693
entropy_ratio: 0.3313

cave_length_m

categorical long_tail null_rate

n: 354,770
nulls: 354,128 (99.8%)
unique: 237
top_value: 5
top_rate: 0.04984
cardinality: 237
entropy: 6.919
entropy_ratio: 0.877

cave_depth_m

categorical long_tail null_rate

n: 354,770
nulls: 354,472 (99.9%)
unique: 124
top_value: 0
top_rate: 0.2114
cardinality: 124
entropy: 5.797
entropy_ratio: 0.8336

access

categorical null_rate

n: 354,770
nulls: 347,515 (98.0%)
unique: 20
top_value: yes
top_rate: 0.3795
cardinality: 20
entropy: 2.234
entropy_ratio: 0.517

cave_ref

text one_word allcaps null_rate short_text

n: 354,770
nulls: 347,184 (97.9%)
unique: 7,162
len_min: 1
len_max: 38
len_mean: 6.341
len_median: 7
len_p95: 8
word_mean: 1.068
word_median: 1
n_empty: 0
n_duplicates: 424
duplicate_rate: 0.05589
vocab_size: 7,005
readability_flesch_mean: 117.8
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9359
allcaps_rate: 0.8559
boilerplate_rate: 0

osm_id

numeric foreign_key null_rate

This column contains OpenStreetMap (OSM) numeric identifiers, likely referencing geographic features such as ways, relations, or nodes in the OSM database. The most striking issue is a 75.08% null rate across 354,770 rows, meaning only about one quarter of records carry an OSM linkage. Despite 88,395 unique values against ~88,693 non-null rows, the near-unique cardinality and platykurtic distribution (kurtosis ≈ -1.23) are consistent with IDs drawn broadly across OSM's ID space (min ~1.3M, max ~13.5B), with no outliers detected. Treatment: Left-join on this id to OSM data after filtering or imputing the 75.08% nulls; investigate whether missingness is systematic before joining. high · anthropic:default

n: 354,770
nulls: 266,374 (75.1%)
unique: 88,395
min: 1.334e+06
max: 1.347e+10
mean: 6.183e+09
median: 6.047e+09
std: 3.993e+09
q1: 2.628e+09
q3: 9.53e+09
iqr: 6.903e+09
skew: 0.1321
kurtosis: -1.228
n_outliers: 0
outlier_rate: 0
zero_rate: 0

osm_type

categorical feature null_rate imbalance

This column stores OpenStreetMap geometry type classifications, taking only three possible values: 'node', 'way', and 'relation'. Two signals demand attention: 75.08% of the 354,770 rows are null, meaning OSM type is only recorded for roughly a quarter of records, and among the non-null values the distribution is severely imbalanced — 'node' accounts for 96.39% of non-null entries (85,204 occurrences) versus 2,560 'way' and just 632 'relation'. The near-zero entropy ratio (0.158) confirms this column carries very little discriminative information as-is. Treatment: Impute nulls as a distinct 'unknown' category, then one-hot encode; consider whether the 'way'/'relation' minority classes carry signal worth preserving or should be collapsed. high · anthropic:default

n: 354,770
nulls: 266,374 (75.1%)
unique: 3
top_value: node
top_rate: 0.9639
cardinality: 3
entropy: 0.2501
entropy_ratio: 0.1578

place_type

categorical label long_tail null_rate

This column captures the settlement/place classification type, likely from an OpenStreetMap-style geographic dataset, with values such as 'hamlet', 'isolated_dwelling', 'village', and 'town'. The most striking signal is the extreme null rate of 94.88%, meaning only ~18,400 of 354,770 rows carry a value — the column is essentially sparse. Among populated rows, 'hamlet' dominates at 66.57% of non-null values, and the presence of a raw 'yes' tag (131 occurrences) indicates dirty or uncleaned OSM data that needs remediation. Treatment: Filter or impute nulls before use; remap 'yes' and other dirty values; treat as low-cardinality categorical with one-hot or ordinal encoding reflecting settlement hierarchy. high · anthropic:default

n: 354,770
nulls: 336,616 (94.9%)
unique: 48
top_value: hamlet
top_rate: 0.6657
cardinality: 48
entropy: 1.498
entropy_ratio: 0.2682

abandoned_year

categorical long_tail null_rate

n: 354,770
nulls: 353,545 (99.7%)
unique: 147
top_value: yes
top_rate: 0.4359
cardinality: 147
entropy: 2.939
entropy_ratio: 0.4082

abandoned_reason

unknown label skipped

This column contains abandoned-reason codes or labels — likely a categorical field recording why a record, transaction, or session was abandoned. The profiler emitted a 'skipped' alert with no stats or uniqueness counts, meaning the column's type could not be resolved and no frequency analysis was performed. With 354,770 non-null rows and a null rate of exactly 0.0, the field is fully populated, but its true cardinality, distribution, and value content are entirely unknown from this evidence. Treatment: Re-profile with explicit string/categorical typing to recover value counts and cardinality before any downstream use. low · anthropic:default

n: 354,770
nulls: 0 (0.0%)
unique: —

former_population

categorical null_rate

n: 354,770
nulls: 352,243 (99.3%)
unique: 75
top_value: 0
top_rate: 0.4951
cardinality: 75
entropy: 2.605
entropy_ratio: 0.4182

heritage

categorical long_tail null_rate

n: 354,770
nulls: 354,763 (100.0%)
unique: 6
top_value: 2
top_rate: 0.2857
cardinality: 6
entropy: 2.522
entropy_ratio: 0.9755