data trove strange places v5 2
Reading
This is a 354,770-row mashup of 14 heterogeneous 'strange places' datasets — spanning tornadoes, UFO sightings, cave entrances, meteorites, ghost towns, earthquakes, shipwrecks, and more — unified under a single 'category' column. The most important thing to examine first is the category distribution, which reveals that no single source dominates but tornadoes (~71K), caves (~70K), and UFO sightings (~61K) each make up roughly 17–20% of records. A second key signal is the pervasive sparsity: most domain-specific columns (depth_km, duration_seconds, shape, damage_property) carry null rates of 80–99%, meaning each column is only meaningful for the subset of rows belonging to its originating dataset. UFO sighting durations show extreme right-skew (median 180 s, max 66 million s) and earthquake depths are similarly skewed, both worth closer inspection within their respective subsets.
citing: category.top_values · category.null_rate · duration_seconds.stats · depth_km.stats · shape.null_rate · damage_property.null_rate · source.top_values · fatalities.top_values · event_type.top_values
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| noaa_tornadoes | 71813 | 20.2% |
| osm_caves | 70242 | 19.8% |
| ufo_sightings | 60632 | 17.1% |
| megalithic_portal | 60028 | 16.9% |
| nasa_meteorites | 32186 | 9.1% |
| osm_ghost_towns | 18154 | 5.1% |
| noaa_storm_events | 14770 | 4.2% |
| haunted_places | 9717 | 2.7% |
| noaa_thermal_springs | 5003 | 1.4% |
| bigfoot_sightings | 3797 | 1.1% |
| usgs_earthquakes | 3742 | 1.1% |
| noaa_shipwrecks | 3653 | 1.0% |
| nasa_fireballs | 863 | 0.2% |
| usgs_volcanoes | 170 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| light | 12895 | 3.6% |
| triangle | 6268 | 1.8% |
| circle | 5890 | 1.7% |
| fireball | 4939 | 1.4% |
| unknown | 4359 | 1.2% |
| other | 4209 | 1.2% |
| sphere | 4134 | 1.2% |
| disk | 3853 | 1.1% |
| oval | 2881 | 0.8% |
| formation | 1908 | 0.5% |
| cigar | 1569 | 0.4% |
| changing | 1517 | 0.4% |
| flash | 1025 | 0.3% |
| rectangle | 1010 | 0.3% |
| cylinder | 977 | 0.3% |
| diamond | 884 | 0.2% |
| chevron | 774 | 0.2% |
| teardrop | 560 | 0.2% |
| egg | 555 | 0.2% |
| cone | 235 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| Tornado | 6334 | 1.8% |
| Flash Flood | 2358 | 0.7% |
| Thunderstorm Wind | 2257 | 0.6% |
| Flood | 1777 | 0.5% |
| Hail | 1246 | 0.4% |
| Lightning | 574 | 0.2% |
| Heavy Rain | 99 | 0.0% |
| Marine Strong Wind | 43 | 0.0% |
| Debris Flow | 43 | 0.0% |
| Marine Thunderstorm Wind | 25 | 0.0% |
| Marine High Wind | 5 | 0.0% |
| Dust Devil | 3 | 0.0% |
| Waterspout | 2 | 0.0% |
| Tropical Storm | 1 | 0.0% |
| High Wind | 1 | 0.0% |
| Heat | 1 | 0.0% |
| Marine Lightning | 1 | 0.0% |
Show data table
| bin | count |
|---|---|
| 0.01 – 1.657e+06 | 60612 |
| 1.657e+06 – 3.314e+06 | 11 |
| 3.314e+06 – 4.971e+06 | 0 |
| 4.971e+06 – 6.628e+06 | 3 |
| 6.628e+06 – 8.285e+06 | 1 |
| 8.285e+06 – 9.941e+06 | 0 |
| 9.941e+06 – 1.16e+07 | 2 |
| 1.16e+07 – 1.326e+07 | 0 |
| 1.326e+07 – 1.491e+07 | 0 |
| 1.491e+07 – 1.657e+07 | 0 |
| 1.657e+07 – 1.823e+07 | 0 |
| 1.823e+07 – 1.988e+07 | 0 |
| 1.988e+07 – 2.154e+07 | 0 |
| 2.154e+07 – 2.32e+07 | 0 |
| 2.32e+07 – 2.485e+07 | 0 |
| 2.485e+07 – 2.651e+07 | 0 |
| 2.651e+07 – 2.817e+07 | 0 |
| 2.817e+07 – 2.982e+07 | 0 |
| 2.982e+07 – 3.148e+07 | 0 |
| 3.148e+07 – 3.314e+07 | 0 |
| 3.314e+07 – 3.479e+07 | 0 |
| 3.479e+07 – 3.645e+07 | 0 |
| 3.645e+07 – 3.811e+07 | 0 |
| 3.811e+07 – 3.977e+07 | 0 |
| 3.977e+07 – 4.142e+07 | 0 |
| 4.142e+07 – 4.308e+07 | 0 |
| 4.308e+07 – 4.474e+07 | 0 |
| 4.474e+07 – 4.639e+07 | 0 |
| 4.639e+07 – 4.805e+07 | 0 |
| 4.805e+07 – 4.971e+07 | 0 |
| 4.971e+07 – 5.136e+07 | 0 |
| 5.136e+07 – 5.302e+07 | 2 |
| 5.302e+07 – 5.468e+07 | 0 |
| 5.468e+07 – 5.633e+07 | 0 |
| 5.633e+07 – 5.799e+07 | 0 |
| 5.799e+07 – 5.965e+07 | 0 |
| 5.965e+07 – 6.131e+07 | 0 |
| 6.131e+07 – 6.296e+07 | 0 |
| 6.296e+07 – 6.462e+07 | 0 |
| 6.462e+07 – 6.628e+07 | 1 |
Show data table
| value | count | share |
|---|---|---|
| L6 | 6544 | 1.8% |
| H5 | 5614 | 1.6% |
| H4 | 3336 | 0.9% |
| H6 | 3234 | 0.9% |
| L5 | 2750 | 0.8% |
| LL5 | 1899 | 0.5% |
| LL6 | 963 | 0.3% |
| L4 | 831 | 0.2% |
| H4/5 | 380 | 0.1% |
| CM2 | 281 | 0.1% |
| Iron, IIIAB | 272 | 0.1% |
| H3 | 244 | 0.1% |
| LL | 220 | 0.1% |
| E3 | 205 | 0.1% |
| L3 | 176 | 0.0% |
| LL4 | 160 | 0.0% |
| H5/6 | 156 | 0.0% |
| Ureilite | 155 | 0.0% |
| Howardite | 127 | 0.0% |
| Diogenite | 125 | 0.0% |
Schema
48 columns| Alerts | ||||
|---|---|---|---|---|
| latitude | numeric | 0.0% | 215,964 |
high_skew
outliers
|
| longitude | numeric | 0.0% | 223,129 |
|
| name | text | 0.0% | 189,861 |
multilingual
duplicates
|
| description | text | 0.0% | 218,717 |
multilingual
duplicates
|
| category | categorical | 0.0% | 14 |
|
| date | text | 41.9% | 23,500 |
one_word
allcaps
null_rate
short_text
duplicates
|
| country | categorical | 55.3% | 28 |
null_rate
|
| city | text | 82.9% | 9,149 |
one_word
null_rate
short_text
duplicates
|
| state | categorical | 58.5% | 118 |
null_rate
|
| shape | categorical | 82.9% | 28 |
null_rate
|
| duration_seconds | numeric | 82.9% | 444 |
null_rate
high_skew
outliers
|
| mass_g | unknown | 0.0% | — |
skipped
|
| meteorite_class | categorical | 90.9% | 395 |
null_rate
|
| fall_type | categorical | 90.9% | 2 |
null_rate
imbalance
|
| magnitude | categorical | 76.7% | 294 |
null_rate
|
| depth_km | numeric | 98.9% | 1,505 |
null_rate
high_skew
outliers
|
| place | text | 98.9% | 3,002 |
null_rate
|
| earthquake_type | categorical | 98.9% | 3 |
null_rate
imbalance
|
| volcano_type | categorical | 100.0% | 1 |
null_rate
imbalance
|
| elevation_m | unknown | 0.0% | — |
skipped
|
| status | categorical | 100.0% | 1 |
null_rate
imbalance
|
| last_eruption | categorical | 100.0% | 1 |
null_rate
imbalance
|
| injuries | categorical | 75.6% | 233 |
null_rate
|
| fatalities | categorical | 75.6% | 57 |
null_rate
|
| length_miles | text | 79.8% | 3,795 |
one_word
allcaps
null_rate
short_text
duplicates
|
| width_yards | categorical | 79.8% | 437 |
null_rate
|
| type | categorical | 98.6% | 1 |
null_rate
imbalance
|
| temperature | categorical | 98.6% | 44 |
long_tail
null_rate
imbalance
|
| source | categorical | 51.6% | 4 |
null_rate
|
| vessel_type | categorical | 99.0% | 23 |
long_tail
null_rate
|
| cargo | categorical | 99.0% | 17 |
long_tail
null_rate
imbalance
|
| peak_brightness_altitude_km | categorical | 99.8% | 224 |
null_rate
|
| velocity_km_s | categorical | 99.9% | 158 |
null_rate
|
| energy_joules | categorical | 99.8% | 518 |
long_tail
null_rate
|
| event_type | categorical | 95.8% | 17 |
null_rate
|
| damage_property | text | 95.8% | 1,014 |
one_word
allcaps
null_rate
short_text
duplicates
|
| cave_type | categorical | 100.0% | 5 |
long_tail
null_rate
|
| cave_length_m | categorical | 99.8% | 237 |
long_tail
null_rate
|
| cave_depth_m | categorical | 99.9% | 124 |
long_tail
null_rate
|
| access | categorical | 98.0% | 20 |
null_rate
|
| cave_ref | text | 97.9% | 7,162 |
one_word
allcaps
null_rate
short_text
|
| osm_id | numeric | 75.1% | 88,395 |
null_rate
|
| osm_type | categorical | 75.1% | 3 |
null_rate
imbalance
|
| place_type | categorical | 94.9% | 48 |
long_tail
null_rate
|
| abandoned_year | categorical | 99.7% | 147 |
long_tail
null_rate
|
| abandoned_reason | unknown | 0.0% | — |
skipped
|
| former_population | categorical | 99.3% | 75 |
null_rate
|
| heritage | categorical | 100.0% | 6 |
long_tail
null_rate
|
latitude
numeric feature high_skew outliersThis column contains geographic latitude values, ranging from -87.37° to 88.5°, consistent with global coordinates. The distribution is surprisingly left-skewed (skew = -2.84) with high kurtosis (7.30), meaning there is a heavy tail toward negative (southern hemisphere) latitudes despite the median sitting at ~40.6°N — suggesting the bulk of records are mid-latitude northern hemisphere but a notable minority of extreme southern values pull the mean down. About 9.4% of rows (33,355) are flagged as outliers, likely driven by records near the poles or far southern hemisphere; the near-zero zero_rate (0.06%) is negligible but worth checking for sentinel nulls encoded as 0. Treatment: Retain as-is for geospatial modelling; investigate ~0.06% zero-value rows as possible null sentinels, and review 33,355 outlier records for data quality before clustering or distance-based methods.
- n
- 354,770
- nulls
- 0 (0.0%)
- unique
- 215,964
- min
- -87.37
- max
- 88.5
- mean
- 32.66
- median
- 40.6
- std
- 31.01
- q1
- 33.69
- q3
- 46.53
- iqr
- 12.85
- skew
- -2.84
- kurtosis
- 7.302
- n_outliers
- 33,355
- outlier_rate
- 0.09402
- zero_rate
- 0.000637
longitude
numeric featureThis column contains geographic longitude values for 354,770 records, spanning the full valid range from -179.28° to 180°. The distribution is moderately right-skewed (skew = 0.755) with a mean of -31.75° and median of -42.66°, indicating a concentration of records in the Western Hemisphere (Americas/Atlantic). The IQR of 104.81° is extremely wide, suggesting genuinely global coverage rather than a region-specific dataset, and only 827 values (0.23%) are flagged as outliers. Treatment: Pair with latitude for geospatial modelling; consider coordinate binning or haversine-based features rather than treating as a raw numeric.
- n
- 354,770
- nulls
- 0 (0.0%)
- unique
- 223,129
- min
- -179.3
- max
- 180
- mean
- -31.75
- median
- -42.66
- std
- 72.11
- q1
- -92.08
- q3
- 12.73
- iqr
- 104.8
- skew
- 0.7545
- kurtosis
- 0.1165
- n_outliers
- 827
- outlier_rate
- 0.002331
- zero_rate
- 0
name
text label multilingual duplicatesThis column contains the name or title of individual records in what appears to be a multi-domain dataset covering natural features (caves), weather events (tornadoes by US state), and UFO sightings. The duplicate rate is strikingly high at 46.5%, driven largely by templated strings like 'Unnamed Cave' (19,962 occurrences) and repeated tornado/state/count patterns. Despite the predominantly English content (3,363 language-detected values skewing English), the multilingual alert flags 30 detected languages including German (230), French (279), Italian (236), Russian (102), and Spanish (156), suggesting internationally-sourced named entities mixed into the dataset. Analysts should note that near-half of values are non-unique, so this column cannot serve as a reliable identifier. Treatment: Deduplicate or group by name pattern before use; consider splitting templated names (e.g. 'Tornado in TX, 48') into structured fields; embed free-form names if semantic similarity is needed.
- n
- 354,770
- nulls
- 0 (0.0%)
- unique
- 189,861
- len_min
- 1
- len_max
- 235
- len_mean
- 20
- len_median
- 17
- len_p95
- 32
- word_mean
- 3.564
- word_median
- 4
- n_empty
- 0
- n_duplicates
- 164,909
- duplicate_rate
- 0.4648
- vocab_size
- 15,811
- readability_flesch_mean
- 64.79
- emoji_rate
- 2.819e-06
- url_rate
- 0
- one_word_rate
- 0.09411
- allcaps_rate
- 0.01283
- boilerplate_rate
- 0
description
text free_text multilingual duplicatesThis column contains free-text descriptions of geographic or physical features — cave entrances, former hamlets, hot springs, shipwrecks, and tornado tracks (e.g. 'F0, 0.1mi long, 10yd wide') dominate the top values, suggesting a points-of-interest or geographic gazetteer dataset. The duplicate rate is strikingly high at 38.3%, driven by 136,053 repeated values out of 354,770 rows, largely from templated entries like 'Cave entrance' (52,067 occurrences) and storm-track boilerplate. Text is overwhelmingly English (4,893 sampled as English) but 21 languages are detected including German (28), Bashkir (13), Russian (9), and Belarusian (9), flagging a multilingual minority that may require separate handling. The wide spread between median length (40 chars) and mean (114 chars) with a p95 of 491 indicates a heavily right-skewed length distribution. Treatment: Deduplicate or group templated entries before NLP; apply language detection and route non-English rows to language-specific pipelines; tokenize and embed for semantic modelling.
- n
- 354,770
- nulls
- 0 (0.0%)
- unique
- 218,717
- len_min
- 1
- len_max
- 500
- len_mean
- 114
- len_median
- 40
- len_p95
- 491
- word_mean
- 24.07
- word_median
- 7
- n_empty
- 0
- n_duplicates
- 136,053
- duplicate_rate
- 0.3835
- vocab_size
- 38,639
- readability_flesch_mean
- 66.65
- emoji_rate
- 0
- url_rate
- 0.008149
- one_word_rate
- 0.01018
- allcaps_rate
- 0.004256
- boilerplate_rate
- 0.0002509
category
categorical labelThis column is a data-source/event-type label drawn from 14 distinct categories across 354,770 rows with zero nulls. The categories span scientific datasets (NOAA tornadoes, NASA meteorites, OSM features) and paranormal/anomalous phenomena (UFO sightings, Bigfoot, haunted places, megalithic portal), suggesting this is a multi-source 'strange phenomena' aggregation dataset. Distribution is moderately uneven — the top value 'noaa_tornadoes' holds 20.2% of rows (71,813), while 'bigfoot_sightings' has only 3,797 — but entropy of 2.99 against a ratio of 0.78 indicates reasonable spread across classes. No nulls and clean cardinality make this an immediately usable stratification variable. Treatment: Use as a stratification or grouping key; one-hot encode or target-encode for modelling.
- n
- 354,770
- nulls
- 0 (0.0%)
- unique
- 14
- top_value
- noaa_tornadoes
- top_rate
- 0.2024
- cardinality
- 14
- entropy
- 2.985
- entropy_ratio
- 0.7841
date
text timestamp one_word allcaps null_rate short_text duplicatesThis column contains ISO-format date strings (YYYY-MM-DD), stored as text rather than a proper date type, representing what appear to be annual publication or release dates — all top values fall on January 1st of a given year, suggesting date precision is year-level only. Two major data quality issues stand out: a 41.88% null rate (including 17,854 empty strings) and an 88.6% duplicate rate across 354,770 rows with only 23,500 unique values. The 'allcaps' alert is a false positive from the Saturn parser — ISO date strings trigger it due to lack of lowercase letters. Treatment: Cast to date type, impute or flag the 41.88% nulls, and consider extracting year as an integer feature given all values are Jan-1 anchored.
- n
- 354,770
- nulls
- 148,570 (41.9%)
- unique
- 23,500
- len_min
- 0
- len_max
- 30
- len_mean
- 9.331
- len_median
- 10
- len_p95
- 10
- word_mean
- 1.005
- word_median
- 1
- n_empty
- 17,854
- n_duplicates
- 182,700
- duplicate_rate
- 0.886
- vocab_size
- 8,565
- readability_flesch_mean
- 112.1
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9954
- allcaps_rate
- 0.913
- boilerplate_rate
- 0
country
categorical feature null_rateThis column captures country of origin or residence, using a mix of ISO 2-letter codes and full-name variants. The most alarming issue is a 55.29% null rate, meaning over half of 354,770 rows carry no country value. Compounding this, 'USA' and 'US' are effectively the same country but stored as two distinct values (86,583 and 60,634 respectively), together accounting for ~54.6% of non-null records — indicating inconsistent data entry that inflates apparent cardinality. There are also 9,497 empty-string records that escaped null detection, and the distribution is heavily US-dominated with 28 unique values at low entropy (1.34). Treatment: Unify 'USA'/'US' and other aliases into ISO-3166 codes, convert empty strings to null, then impute or flag remaining nulls before using as a categorical feature.
- n
- 354,770
- nulls
- 196,154 (55.3%)
- unique
- 28
- top_value
- USA
- top_rate
- 0.5459
- cardinality
- 28
- entropy
- 1.341
- entropy_ratio
- 0.279
city
text feature one_word null_rate short_text duplicatesThis column contains US city names, confirmed by top values (Seattle, Phoenix, Las Vegas, Portland, Los Angeles) and top words ('beach', 'san', 'lake', 'springs'). The most striking issue is the 82.91% null rate — only roughly 1 in 6 rows has a city value at all, making this field sparsely populated. Despite that sparsity, the duplicate rate among non-null values is 84.91%, indicating that populated rows cluster around a relatively small set of repeated cities (9,149 unique values from 4,862 vocab tokens). The word 'city' appearing 531 times in top_words suggests some entries may literally contain placeholder text like 'Kansas City' or 'Oklahoma City' rather than being data quality noise. Treatment: Impute or flag nulls (82.91% missing) before use; consider grouping rare cities or encoding as region/state for modelling.
- n
- 354,770
- nulls
- 294,138 (82.9%)
- unique
- 9,149
- len_min
- 3
- len_max
- 23
- len_mean
- 8.829
- len_median
- 9
- len_p95
- 14
- word_mean
- 1.288
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 51,483
- duplicate_rate
- 0.8491
- vocab_size
- 4,862
- readability_flesch_mean
- 21.74
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7294
- allcaps_rate
- 0
- boilerplate_rate
- 0
state
categorical feature null_rateThis column contains US state abbreviations (and possibly territories or non-standard codes given 118 unique values vs. the expected 50–60), making it a geographic categorical feature. The most critical signal is a 58.5% null rate, meaning over half the 354,770 rows have no state recorded — a severe data quality issue. The top value is 'TX' at 8.6% of non-null rows, with CA and FL following; the 118-cardinality (nearly double the 50 US states) suggests the presence of territories, foreign country codes, or dirty values worth auditing. Treatment: Audit the 118 unique values to identify non-US-state codes, impute or flag nulls (58.5% missing), then encode as categorical for modelling.
- n
- 354,770
- nulls
- 207,555 (58.5%)
- unique
- 118
- top_value
- TX
- top_rate
- 0.08645
- cardinality
- 118
- entropy
- 5.668
- entropy_ratio
- 0.8236
shape
categorical feature null_rateThis column captures the reported shape of UFO/unidentified aerial phenomena sightings, with 28 distinct categories such as 'light', 'triangle', 'circle', and 'fireball'. The most striking issue is an 82.91% null rate across 354,770 rows, meaning only ~60,600 records have a shape value at all. Among non-null records, 'light' dominates at 21.27%, and the presence of catch-all categories like 'unknown' (4,359) and 'other' (4,209) further dilutes the informativeness of the non-missing data. Treatment: Impute or flag nulls as a separate 'not_reported' category before encoding; consider consolidating 'unknown' and 'other' with nulls given ambiguity.
- n
- 354,770
- nulls
- 294,138 (82.9%)
- unique
- 28
- top_value
- light
- top_rate
- 0.2127
- cardinality
- 28
- entropy
- 3.774
- entropy_ratio
- 0.785
duration_seconds
numeric feature null_rate high_skew outliersThis column records event or session durations in seconds, with values ranging from 0.01 s to 66,276,000 s (~766 days). The most striking issue is that 82.91% of rows are null, meaning duration is only captured for roughly 1-in-6 records. Among non-null values the distribution is catastrophically right-skewed (skew = 135.86, kurtosis = 19,379.84): the median is just 180 s while the mean inflates to 5,410 s, and 7,753 rows (12.79% of non-null) are flagged as outliers—the maximum of 66,276,000 s is almost certainly erroneous or represents a sentinel/unclosed-session value. Treatment: Investigate and cap or remove extreme outliers (especially values near 66276000.0), impute or flag nulls explicitly, then log-transform before modelling.
- n
- 354,770
- nulls
- 294,138 (82.9%)
- unique
- 444
- min
- 0.01
- max
- 6.628e+07
- mean
- 5410
- median
- 180
- std
- 4.144e+05
- q1
- 30
- q3
- 600
- iqr
- 570
- skew
- 135.9
- kurtosis
- 1.938e+04
- n_outliers
- 7,753
- outlier_rate
- 0.1279
- zero_rate
- 0
mass_g
unknown feature skippedThe column 'mass_g' likely represents mass measurements in grams across 354,770 records, with zero nulls indicating complete data coverage. No distributional statistics are available — the profiler skipped this column — so skew, range, outliers, and uniqueness cannot be assessed from the evidence provided. Treatment: Re-profile to obtain distribution stats; then check for skew and consider log-transform before modelling.
- n
- 354,770
- nulls
- 0 (0.0%)
- unique
- —
meteorite_class
categorical label null_rateThis column contains meteorite classification codes (e.g., 'L6', 'H5', 'CM2'), representing standard petrologic-type designations for chondrite and other meteorite classes. The most striking feature is an extremely high null rate of 90.93%, meaning only roughly 32,000 of 354,770 rows carry a classification. Among classified records the distribution is moderately concentrated — 'L6' alone accounts for 20.3% of non-null values — with 395 distinct classes and an entropy ratio of ~0.51, indicating moderate spread across the taxonomy. Treatment: Impute nulls with an explicit 'Unknown' category or exclude from supervised models; encode via target or ordinal encoding given 395 classes and severe class imbalance.
- n
- 354,770
- nulls
- 322,584 (90.9%)
- unique
- 395
- top_value
- L6
- top_rate
- 0.2033
- cardinality
- 395
- entropy
- 4.37
- entropy_ratio
- 0.5067
fall_type
categorical label null_rate imbalanceThis column classifies meteorite recovery type, distinguishing between specimens that were 'Found' (discovered without an observed fall) versus 'Fell' (witnessed falling). Striking is the 90.93% null rate, meaning only ~32,186 of 354,770 records have a value at all. Among those with values, the distribution is heavily skewed: 'Found' accounts for 96.6% (31,090) versus 'Fell' at just 3.4% (1,096), which aligns with real-world meteorite data but constitutes a severe class imbalance alert. Treatment: Impute nulls as a third category ('Unknown') or exclude from classification tasks; apply class-weighting or oversampling to address the 97:3 Found-to-Fell imbalance before modelling.
- n
- 354,770
- nulls
- 322,584 (90.9%)
- unique
- 2
- top_value
- Found
- top_rate
- 0.9659
- cardinality
- 2
- entropy
- 0.2143
- entropy_ratio
- 0.2143
magnitude
categorical feature null_rateThis column represents a magnitude scale (likely seismic, stellar, or similar physical measurement) stored as a categorical type despite being fundamentally numeric — values include integers (0, 1, 2, 3, 4) and decimals (4.5, 4.6, 4.7, 1.75). The null rate of 76.7% is alarming and triggered an alert, meaning over three-quarters of the 354,770 rows carry no value. An additional surprise is the presence of '-9', which appears 1,278 times and is almost certainly a sentinel/missing-value code rather than a true measurement. The top value '0' dominates non-null records at 44.4% of non-null observations, and entropy_ratio of 0.31 confirms a heavily skewed, low-diversity distribution despite 294 unique string representations. Treatment: Cast to float after replacing '-9' with NaN, investigate the 76.7% null rate for systematic missingness, then consider log-transform or binning before modelling.
- n
- 354,770
- nulls
- 272,093 (76.7%)
- unique
- 294
- top_value
- 0
- top_rate
- 0.4436
- cardinality
- 294
- entropy
- 2.514
- entropy_ratio
- 0.3065
depth_km
numeric null_rate high_skew outliers- n
- 354,770
- nulls
- 351,028 (98.9%)
- unique
- 1,505
- min
- -2.261
- max
- 248.7
- mean
- 23.71
- median
- 10
- std
- 28.79
- q1
- 10
- q3
- 29.1
- iqr
- 19.1
- skew
- 3.072
- kurtosis
- 11.61
- n_outliers
- 314
- outlier_rate
- 0.08391
- zero_rate
- 0.002672
place
text null_rate- n
- 354,770
- nulls
- 351,028 (98.9%)
- unique
- 3,002
- len_min
- 4
- len_max
- 59
- len_mean
- 29.47
- len_median
- 29
- len_p95
- 36
- word_mean
- 6.293
- word_median
- 6
- n_empty
- 0
- n_duplicates
- 740
- duplicate_rate
- 0.1978
- vocab_size
- 1,036
- readability_flesch_mean
- 69.91
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.0005345
- allcaps_rate
- 0
- boilerplate_rate
- 0
earthquake_type
categorical null_rate imbalance- n
- 354,770
- nulls
- 351,028 (98.9%)
- unique
- 3
- top_value
- earthquake
- top_rate
- 0.9992
- cardinality
- 3
- entropy
- 0.01014
- entropy_ratio
- 0.006396
volcano_type
categorical null_rate imbalance- n
- 354,770
- nulls
- 354,600 (100.0%)
- unique
- 1
- top_value
- Unknown
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
elevation_m
unknown feature skippedThis column records elevation in metres for 354,770 rows with no nulls. The profiler emitted a 'skipped' alert and returned no computed statistics, so distribution shape, range, skew, and uniqueness are entirely unknown from this evidence. The name strongly implies a continuous numeric geographic feature, but no further characterisation can be made without re-running profiling. Treatment: Re-profile to obtain range, skew, and outlier metrics; then consider log-transform or clipping if heavily right-skewed before use in modelling.
- n
- 354,770
- nulls
- 0 (0.0%)
- unique
- —
status
categorical null_rate imbalance- n
- 354,770
- nulls
- 354,600 (100.0%)
- unique
- 1
- top_value
- Unknown
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
last_eruption
categorical null_rate imbalance- n
- 354,770
- nulls
- 354,600 (100.0%)
- unique
- 1
- top_value
- Unknown
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
injuries
categorical feature null_rateThis column records injury counts per incident, stored as a categorical type despite being numeric in nature — values are integers ('0', '1', '2', …) with a cardinality of 233 distinct values. The null rate is severely high at 75.59%, meaning only ~86,827 of 354,770 rows have a recorded value, which is flagged as an alert. Among non-null rows, 85.4% report zero injuries, producing a heavily right-skewed distribution with low entropy (1.23, entropy ratio 0.157). The presence of 233 distinct values suggests some entries may encode ranges, text annotations, or data-entry anomalies beyond simple integers. Treatment: Cast to numeric, investigate nulls (MCAR vs. structural zero), treat missing as unknown rather than zero, then consider zero-inflated or count-based model treatment.
- n
- 354,770
- nulls
- 268,187 (75.6%)
- unique
- 233
- top_value
- 0
- top_rate
- 0.854
- cardinality
- 233
- entropy
- 1.234
- entropy_ratio
- 0.1569
fatalities
categorical feature null_rateThis column represents a count of fatalities per incident, stored as a categorical type despite being inherently numeric. The null rate is severe at 75.59%, meaning only ~86,313 of 354,770 rows have a value. Among non-null rows, 92.86% record zero fatalities, with a long tail reaching at least 10; the low entropy ratio (0.088) confirms extreme concentration at '0'. Treatment: Cast to integer, investigate and impute or exclude the 75.59% nulls, then treat as a heavily zero-inflated count variable (consider zero-inflated Poisson or log1p transform for regression).
- n
- 354,770
- nulls
- 268,187 (75.6%)
- unique
- 57
- top_value
- 0
- top_rate
- 0.9286
- cardinality
- 57
- entropy
- 0.5134
- entropy_ratio
- 0.08802
length_miles
text feature one_word allcaps null_rate short_text duplicatesThis column stores numeric distance measurements (miles) encoded as text strings — all values are single tokens like '0.1', '0.5', '1.0' with a mean character length of 3.69 and a max of 8. Two signals demand attention: the null rate is extremely high at 79.76%, meaning roughly four in five rows carry no value, and the duplicate rate among non-null values is 94.72%, reflecting a coarse, rounded measurement scale (only 3,795 unique values across 354,770 rows). The top value '0.1' alone appears 15,456 times, suggesting heavy concentration at short distances. Treatment: Cast to float, investigate and handle the 79.76% nulls (impute or flag), then use directly or log-transform given likely right skew.
- n
- 354,770
- nulls
- 282,957 (79.8%)
- unique
- 3,795
- len_min
- 3
- len_max
- 8
- len_mean
- 3.688
- len_median
- 3
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 68,018
- duplicate_rate
- 0.9472
- vocab_size
- 2,268
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
width_yards
categorical feature null_rateThis column represents the width of some geographic or physical feature measured in yards, stored as a categorical type despite being numeric in nature. Nearly 80% of values are null (null_rate = 0.7976), making missingness the dominant signal. Among the 71,493 non-null records, values are round numbers (10, 50, 100, 30, 20, 200…) suggesting manual or estimated entries rather than precise measurements. The top value '10' accounts for 20.2% of non-null rows, and with 437 unique values and an entropy ratio of 0.51, the distribution is moderately concentrated. Treatment: Cast to numeric, investigate whether nulls are structurally missing (feature absent) or simply unrecorded before imputing or dropping; log-transform or bin for modelling given round-number clustering.
- n
- 354,770
- nulls
- 282,957 (79.8%)
- unique
- 437
- top_value
- 10
- top_rate
- 0.2018
- cardinality
- 437
- entropy
- 4.493
- entropy_ratio
- 0.5122
type
categorical null_rate imbalance- n
- 354,770
- nulls
- 349,767 (98.6%)
- unique
- 1
- top_value
- hot_spring
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
temperature
categorical long_tail null_rate imbalance- n
- 354,770
- nulls
- 349,767 (98.6%)
- unique
- 44
- top_value
- top_rate
- 0.9742
- cardinality
- 44
- entropy
- 0.2566
- entropy_ratio
- 0.04699
source
categorical metadata null_rateThis column records the data provider or attribution source for each row, with only 4 distinct values drawn from named external datasets (OpenStreetMap contributors, The Megalithic Portal, NOAA Storm Events Database, OpenStreetMap). The most striking signal is a 51.56% null rate — meaning over half of all 354,770 rows carry no source attribution, which is a data quality concern for provenance tracking. The top value 'OpenStreetMap contributors' accounts for 51.44% of non-null rows (88,396 records), while the closely related 'OpenStreetMap' (8,656 records) suggests inconsistent attribution for the same upstream source. Treatment: Consolidate 'OpenStreetMap contributors' and 'OpenStreetMap' into a single category, investigate and impute or flag the 51.56% nulls before using as a stratification or filter variable.
- n
- 354,770
- nulls
- 182,920 (51.6%)
- unique
- 4
- top_value
- OpenStreetMap contributors
- top_rate
- 0.5144
- cardinality
- 4
- entropy
- 1.545
- entropy_ratio
- 0.7724
vessel_type
categorical long_tail null_rate- n
- 354,770
- nulls
- 351,117 (99.0%)
- unique
- 23
- top_value
- top_rate
- 0.9064
- cardinality
- 23
- entropy
- 0.5764
- entropy_ratio
- 0.1274
cargo
categorical long_tail null_rate imbalance- n
- 354,770
- nulls
- 351,117 (99.0%)
- unique
- 17
- top_value
- top_rate
- 0.9943
- cardinality
- 17
- entropy
- 0.07302
- entropy_ratio
- 0.01786
peak_brightness_altitude_km
categorical null_rate- n
- 354,770
- nulls
- 354,193 (99.8%)
- unique
- 224
- top_value
- 37.0
- top_rate
- 0.06066
- cardinality
- 224
- entropy
- 7.187
- entropy_ratio
- 0.9206
velocity_km_s
categorical null_rate- n
- 354,770
- nulls
- 354,421 (99.9%)
- unique
- 158
- top_value
- 13.6
- top_rate
- 0.01719
- cardinality
- 158
- entropy
- 7.052
- entropy_ratio
- 0.9656
energy_joules
categorical long_tail null_rate- n
- 354,770
- nulls
- 353,907 (99.8%)
- unique
- 518
- top_value
- 2.1
- top_rate
- 0.01738
- cardinality
- 518
- entropy
- 8.634
- entropy_ratio
- 0.9576
event_type
categorical null_rate- n
- 354,770
- nulls
- 340,000 (95.8%)
- unique
- 17
- top_value
- Tornado
- top_rate
- 0.4288
- cardinality
- 17
- entropy
- 2.336
- entropy_ratio
- 0.5715
damage_property
text one_word allcaps null_rate short_text duplicates- n
- 354,770
- nulls
- 340,000 (95.8%)
- unique
- 1,014
- len_min
- 0
- len_max
- 8
- len_mean
- 4.381
- len_median
- 5
- len_p95
- 7
- word_mean
- 1
- word_median
- 1
- n_empty
- 368
- n_duplicates
- 13,756
- duplicate_rate
- 0.9313
- vocab_size
- 1,013
- readability_flesch_mean
- 117
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.8724
- boilerplate_rate
- 0
cave_type
categorical long_tail null_rate- n
- 354,770
- nulls
- 354,729 (100.0%)
- unique
- 5
- top_value
- pit
- top_rate
- 0.878
- cardinality
- 5
- entropy
- 0.7693
- entropy_ratio
- 0.3313
cave_length_m
categorical long_tail null_rate- n
- 354,770
- nulls
- 354,128 (99.8%)
- unique
- 237
- top_value
- 5
- top_rate
- 0.04984
- cardinality
- 237
- entropy
- 6.919
- entropy_ratio
- 0.877
cave_depth_m
categorical long_tail null_rate- n
- 354,770
- nulls
- 354,472 (99.9%)
- unique
- 124
- top_value
- 0
- top_rate
- 0.2114
- cardinality
- 124
- entropy
- 5.797
- entropy_ratio
- 0.8336
access
categorical null_rate- n
- 354,770
- nulls
- 347,515 (98.0%)
- unique
- 20
- top_value
- yes
- top_rate
- 0.3795
- cardinality
- 20
- entropy
- 2.234
- entropy_ratio
- 0.517
cave_ref
text one_word allcaps null_rate short_text- n
- 354,770
- nulls
- 347,184 (97.9%)
- unique
- 7,162
- len_min
- 1
- len_max
- 38
- len_mean
- 6.341
- len_median
- 7
- len_p95
- 8
- word_mean
- 1.068
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 424
- duplicate_rate
- 0.05589
- vocab_size
- 7,005
- readability_flesch_mean
- 117.8
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9359
- allcaps_rate
- 0.8559
- boilerplate_rate
- 0
osm_id
numeric foreign_key null_rateThis column contains OpenStreetMap (OSM) numeric identifiers, likely referencing geographic features such as ways, relations, or nodes in the OSM database. The most striking issue is a 75.08% null rate across 354,770 rows, meaning only about one quarter of records carry an OSM linkage. Despite 88,395 unique values against ~88,693 non-null rows, the near-unique cardinality and platykurtic distribution (kurtosis ≈ -1.23) are consistent with IDs drawn broadly across OSM's ID space (min ~1.3M, max ~13.5B), with no outliers detected. Treatment: Left-join on this id to OSM data after filtering or imputing the 75.08% nulls; investigate whether missingness is systematic before joining.
- n
- 354,770
- nulls
- 266,374 (75.1%)
- unique
- 88,395
- min
- 1.334e+06
- max
- 1.347e+10
- mean
- 6.183e+09
- median
- 6.047e+09
- std
- 3.993e+09
- q1
- 2.628e+09
- q3
- 9.53e+09
- iqr
- 6.903e+09
- skew
- 0.1321
- kurtosis
- -1.228
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
osm_type
categorical feature null_rate imbalanceThis column stores OpenStreetMap geometry type classifications, taking only three possible values: 'node', 'way', and 'relation'. Two signals demand attention: 75.08% of the 354,770 rows are null, meaning OSM type is only recorded for roughly a quarter of records, and among the non-null values the distribution is severely imbalanced — 'node' accounts for 96.39% of non-null entries (85,204 occurrences) versus 2,560 'way' and just 632 'relation'. The near-zero entropy ratio (0.158) confirms this column carries very little discriminative information as-is. Treatment: Impute nulls as a distinct 'unknown' category, then one-hot encode; consider whether the 'way'/'relation' minority classes carry signal worth preserving or should be collapsed.
- n
- 354,770
- nulls
- 266,374 (75.1%)
- unique
- 3
- top_value
- node
- top_rate
- 0.9639
- cardinality
- 3
- entropy
- 0.2501
- entropy_ratio
- 0.1578
place_type
categorical label long_tail null_rateThis column captures the settlement/place classification type, likely from an OpenStreetMap-style geographic dataset, with values such as 'hamlet', 'isolated_dwelling', 'village', and 'town'. The most striking signal is the extreme null rate of 94.88%, meaning only ~18,400 of 354,770 rows carry a value — the column is essentially sparse. Among populated rows, 'hamlet' dominates at 66.57% of non-null values, and the presence of a raw 'yes' tag (131 occurrences) indicates dirty or uncleaned OSM data that needs remediation. Treatment: Filter or impute nulls before use; remap 'yes' and other dirty values; treat as low-cardinality categorical with one-hot or ordinal encoding reflecting settlement hierarchy.
- n
- 354,770
- nulls
- 336,616 (94.9%)
- unique
- 48
- top_value
- hamlet
- top_rate
- 0.6657
- cardinality
- 48
- entropy
- 1.498
- entropy_ratio
- 0.2682
abandoned_year
categorical long_tail null_rate- n
- 354,770
- nulls
- 353,545 (99.7%)
- unique
- 147
- top_value
- yes
- top_rate
- 0.4359
- cardinality
- 147
- entropy
- 2.939
- entropy_ratio
- 0.4082
abandoned_reason
unknown label skippedThis column contains abandoned-reason codes or labels — likely a categorical field recording why a record, transaction, or session was abandoned. The profiler emitted a 'skipped' alert with no stats or uniqueness counts, meaning the column's type could not be resolved and no frequency analysis was performed. With 354,770 non-null rows and a null rate of exactly 0.0, the field is fully populated, but its true cardinality, distribution, and value content are entirely unknown from this evidence. Treatment: Re-profile with explicit string/categorical typing to recover value counts and cardinality before any downstream use.
- n
- 354,770
- nulls
- 0 (0.0%)
- unique
- —
former_population
categorical null_rate- n
- 354,770
- nulls
- 352,243 (99.3%)
- unique
- 75
- top_value
- 0
- top_rate
- 0.4951
- cardinality
- 75
- entropy
- 2.605
- entropy_ratio
- 0.4182
heritage
categorical long_tail null_rate- n
- 354,770
- nulls
- 354,763 (100.0%)
- unique
- 6
- top_value
- 2
- top_rate
- 0.2857
- cardinality
- 6
- entropy
- 2.522
- entropy_ratio
- 0.9755