data-trove-strange-places-v5-2

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/strange_places_v5.2.json

Saturn profiled 354,770 rows across 48 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/strange_places_v5.2.json",
    "--findings", "data-trove-strange-places-v5-2.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This is a 354,770-row mashup of 14 heterogeneous 'strange places' datasets — spanning tornadoes, UFO sightings, cave entrances, meteorites, ghost towns, earthquakes, shipwrecks, and more — unified under a single 'category' column. The most important thing to examine first is the category distribution, which reveals that no single source dominates but tornadoes (~71K), caves (~70K), and UFO sightings (~61K) each make up roughly 17–20% of records. A second key signal is the pervasive sparsity: most domain-specific columns (depth_km, duration_seconds, shape, damage_property) carry null rates of 80–99%, meaning each column is only meaningful for the subset of rows belonging to its originating dataset. UFO sighting durations show extreme right-skew (median 180 s, max 66 million s) and earthquake depths are similarly skewed, both worth closer inspection within their respective subsets.

citing: category.top_values · category.null_rate · duration_seconds.stats · depth_km.stats · shape.null_rate · damage_property.null_rate · source.top_values · fatalities.top_values · event_type.top_values

Out[4]:

saturn.schema() · 48 columns

column	kind	n	null%	unique	alerts
latitude	numeric	354,770	0.0%	215,964	high_skew outliers
longitude	numeric	354,770	0.0%	223,129
name	text	354,770	0.0%	189,861	multilingual duplicates
description	text	354,770	0.0%	218,717	multilingual duplicates
category	categorical	354,770	0.0%	14
date	text	354,770	41.9%	23,500	one_word allcaps null_rate short_text duplicates
country	categorical	354,770	55.3%	28	null_rate
city	text	354,770	82.9%	9,149	one_word null_rate short_text duplicates
state	categorical	354,770	58.5%	118	null_rate
shape	categorical	354,770	82.9%	28	null_rate
duration_seconds	numeric	354,770	82.9%	444	null_rate high_skew outliers
mass_g	unknown	354,770	0.0%	—	skipped
meteorite_class	categorical	354,770	90.9%	395	null_rate
fall_type	categorical	354,770	90.9%	2	null_rate imbalance
magnitude	categorical	354,770	76.7%	294	null_rate
depth_km	numeric	354,770	98.9%	1,505	null_rate high_skew outliers
place	text	354,770	98.9%	3,002	null_rate
earthquake_type	categorical	354,770	98.9%	3	null_rate imbalance
volcano_type	categorical	354,770	100.0%	1	null_rate imbalance
elevation_m	unknown	354,770	0.0%	—	skipped
status	categorical	354,770	100.0%	1	null_rate imbalance
last_eruption	categorical	354,770	100.0%	1	null_rate imbalance
injuries	categorical	354,770	75.6%	233	null_rate
fatalities	categorical	354,770	75.6%	57	null_rate
length_miles	text	354,770	79.8%	3,795	one_word allcaps null_rate short_text duplicates
width_yards	categorical	354,770	79.8%	437	null_rate
type	categorical	354,770	98.6%	1	null_rate imbalance
temperature	categorical	354,770	98.6%	44	long_tail null_rate imbalance
source	categorical	354,770	51.6%	4	null_rate
vessel_type	categorical	354,770	99.0%	23	long_tail null_rate
cargo	categorical	354,770	99.0%	17	long_tail null_rate imbalance
peak_brightness_altitude_km	categorical	354,770	99.8%	224	null_rate
velocity_km_s	categorical	354,770	99.9%	158	null_rate
energy_joules	categorical	354,770	99.8%	518	long_tail null_rate
event_type	categorical	354,770	95.8%	17	null_rate
damage_property	text	354,770	95.8%	1,014	one_word allcaps null_rate short_text duplicates
cave_type	categorical	354,770	100.0%	5	long_tail null_rate
cave_length_m	categorical	354,770	99.8%	237	long_tail null_rate
cave_depth_m	categorical	354,770	99.9%	124	long_tail null_rate
access	categorical	354,770	98.0%	20	null_rate
cave_ref	text	354,770	97.9%	7,162	one_word allcaps null_rate short_text
osm_id	numeric	354,770	75.1%	88,395	null_rate
osm_type	categorical	354,770	75.1%	3	null_rate imbalance
place_type	categorical	354,770	94.9%	48	long_tail null_rate
abandoned_year	categorical	354,770	99.7%	147	long_tail null_rate
abandoned_reason	unknown	354,770	0.0%	—	skipped
former_population	categorical	354,770	99.3%	75	null_rate
heritage	categorical	354,770	100.0%	6	long_tail null_rate

Fig 1.

category · Shows how records are split across the 14 source datasets — tornadoes, caves, and UFO sightings each dominate, but all 14 categories are present.

Show data table

Top values for category (14 unique shown, of 14 total).
value	count	share
noaa_tornadoes	71813	20.2%
osm_caves	70242	19.8%
ufo_sightings	60632	17.1%
megalithic_portal	60028	16.9%
nasa_meteorites	32186	9.1%
osm_ghost_towns	18154	5.1%
noaa_storm_events	14770	4.2%
haunted_places	9717	2.7%
noaa_thermal_springs	5003	1.4%
bigfoot_sightings	3797	1.1%
usgs_earthquakes	3742	1.1%
noaa_shipwrecks	3653	1.0%
nasa_fireballs	863	0.2%
usgs_volcanoes	170	0.0%

Fig 2.

shape · For UFO sighting rows, 'light' is by far the most reported shape, followed by triangle and circle — look for the long tail of rarer forms.

Show data table

Top values for shape (20 unique shown, of 28 total).
value	count	share
light	12895	3.6%
triangle	6268	1.8%
circle	5890	1.7%
fireball	4939	1.4%
unknown	4359	1.2%
other	4209	1.2%
sphere	4134	1.2%
disk	3853	1.1%
oval	2881	0.8%
formation	1908	0.5%
cigar	1569	0.4%
changing	1517	0.4%
flash	1025	0.3%
rectangle	1010	0.3%
cylinder	977	0.3%
diamond	884	0.2%
chevron	774	0.2%
teardrop	560	0.2%
egg	555	0.2%
cone	235	0.1%

Fig 3.

event_type · Among storm-event rows, tornadoes vastly outnumber flash floods and thunderstorm winds, revealing a strong imbalance in weather event coverage.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	1.8%
Flash Flood	2358	0.7%
Thunderstorm Wind	2257	0.6%
Flood	1777	0.5%
Hail	1246	0.4%
Lightning	574	0.2%
Heavy Rain	99	0.0%
Marine Strong Wind	43	0.0%
Debris Flow	43	0.0%
Marine Thunderstorm Wind	25	0.0%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

Fig 4.

duration_seconds · Extreme right-skew with a median of 180 seconds but a max of 66 million seconds — look for the spike near zero and the extreme outliers.

Show data table

Histogram bins for duration_seconds (median: 180.0).
bin	count
0.01 – 1.657e+06	60612
1.657e+06 – 3.314e+06	11
3.314e+06 – 4.971e+06	0
4.971e+06 – 6.628e+06	3
6.628e+06 – 8.285e+06	1
8.285e+06 – 9.941e+06	0
9.941e+06 – 1.16e+07	2
1.16e+07 – 1.326e+07	0
1.326e+07 – 1.491e+07	0
1.491e+07 – 1.657e+07	0
1.657e+07 – 1.823e+07	0
1.823e+07 – 1.988e+07	0
1.988e+07 – 2.154e+07	0
2.154e+07 – 2.32e+07	0
2.32e+07 – 2.485e+07	0
2.485e+07 – 2.651e+07	0
2.651e+07 – 2.817e+07	0
2.817e+07 – 2.982e+07	0
2.982e+07 – 3.148e+07	0
3.148e+07 – 3.314e+07	0
3.314e+07 – 3.479e+07	0
3.479e+07 – 3.645e+07	0
3.645e+07 – 3.811e+07	0
3.811e+07 – 3.977e+07	0
3.977e+07 – 4.142e+07	0
4.142e+07 – 4.308e+07	0
4.308e+07 – 4.474e+07	0
4.474e+07 – 4.639e+07	0
4.639e+07 – 4.805e+07	0
4.805e+07 – 4.971e+07	0
4.971e+07 – 5.136e+07	0
5.136e+07 – 5.302e+07	2
5.302e+07 – 5.468e+07	0
5.468e+07 – 5.633e+07	0
5.633e+07 – 5.799e+07	0
5.799e+07 – 5.965e+07	0
5.965e+07 – 6.131e+07	0
6.131e+07 – 6.296e+07	0
6.296e+07 – 6.462e+07	0
6.462e+07 – 6.628e+07	1

Fig 5.

meteorite_class · L6 and H5 ordinary chondrites dominate meteorite finds, with a long tail of rarer classes worth noting for completeness of the record.

Show data table

Top values for meteorite_class (20 unique shown, of 395 total).
value	count	share
L6	6544	1.8%
H5	5614	1.6%
H4	3336	0.9%
H6	3234	0.9%
L5	2750	0.8%
LL5	1899	0.5%
LL6	963	0.3%
L4	831	0.2%
H4/5	380	0.1%
CM2	281	0.1%
Iron, IIIAB	272	0.1%
H3	244	0.1%
LL	220	0.1%
E3	205	0.1%
L3	176	0.0%
LL4	160	0.0%
H5/6	156	0.0%
Ureilite	155	0.0%
Howardite	127	0.0%
Diogenite	125	0.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
latitude	numeric	0.0%
longitude	numeric	0.0%
name	text	0.0%
description	text	0.0%
category	categorical	0.0%
date	text	41.9%
country	categorical	55.3%
city	text	82.9%
state	categorical	58.5%
shape	categorical	82.9%
duration_seconds	numeric	82.9%
mass_g	unknown	0.0%
meteorite_class	categorical	90.9%
fall_type	categorical	90.9%
magnitude	categorical	76.7%
depth_km	numeric	98.9%
place	text	98.9%
earthquake_type	categorical	98.9%
volcano_type	categorical	100.0%
elevation_m	unknown	0.0%
status	categorical	100.0%
last_eruption	categorical	100.0%
injuries	categorical	75.6%
fatalities	categorical	75.6%
length_miles	text	79.8%
width_yards	categorical	79.8%
type	categorical	98.6%
temperature	categorical	98.6%
source	categorical	51.6%
vessel_type	categorical	99.0%
cargo	categorical	99.0%
peak_brightness_altitude_km	categorical	99.8%
velocity_km_s	categorical	99.9%
energy_joules	categorical	99.8%
event_type	categorical	95.8%
damage_property	text	95.8%
cave_type	categorical	100.0%
cave_length_m	categorical	99.8%
cave_depth_m	categorical	99.9%
access	categorical	98.0%
cave_ref	text	97.9%
osm_id	numeric	75.1%
osm_type	categorical	75.1%
place_type	categorical	94.9%
abandoned_year	categorical	99.7%
abandoned_reason	unknown	0.0%
former_population	categorical	99.3%
heritage	categorical	100.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 9,808 detected strings).
lang	count	share
en	8310	84.7%
fr	285	2.9%
de	258	2.6%
it	240	2.4%
es	158	1.6%
ru	111	1.1%
ca	55	0.6%
da	38	0.4%
pt	36	0.4%
nl	30	0.3%
pl	30	0.3%
eu	29	0.3%
hu	24	0.2%
be	23	0.2%
id	18	0.2%
sv	17	0.2%
ar	15	0.2%
cs	14	0.1%
uk	13	0.1%
no	13	0.1%
ba	13	0.1%
sk	10	0.1%
ro	9	0.1%
cy	8	0.1%
el	8	0.1%
tr	7	0.1%
fi	7	0.1%
ja	7	0.1%
az	6	0.1%
hr	5	0.1%
ceb	5	0.1%
sl	2	0.0%
la	1	0.0%
tt	1	0.0%
ko	1	0.0%
als	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 5 numeric columns (values clipped to 2 decimals).
	latitude	longitude	duration_seconds	depth_km	osm_id
latitude	+1.00	-0.39	+0.01	+0.01	+0.34
longitude	-0.39	+1.00	-0.05	-0.05	+0.17
duration_seconds	+0.01	-0.05	+1.00	-0.03	+0.01
depth_km	+0.01	-0.05	-0.03	+1.00	-0.09
osm_id	+0.34	+0.17	+0.01	-0.09	+1.00

latitude numeric feature

This column contains geographic latitude values, ranging from -87.37° to 88.5°, consistent with global coordinates. The distribution is surprisingly left-skewed (skew = -2.84) with high kurtosis (7.30), meaning there is a heavy tail toward negative (southern hemisphere) latitudes despite the median sitting at ~40.6°N — suggesting the bulk of records are mid-latitude northern hemisphere but a notable minority of extreme southern values pull the mean down. About 9.4% of rows (33,355) are flagged as outliers, likely driven by records near the poles or far southern hemisphere; the near-zero zero_rate (0.06%) is negligible but worth checking for sentinel nulls encoded as 0.

Treatment: Retain as-is for geospatial modelling; investigate ~0.06% zero-value rows as possible null sentinels, and review 33,355 outlier records for data quality before clustering or distance-based methods.

anthropic:default · confidence high

Out[14]:

saturn.columns["latitude"].stats

stat	value
n	354,770
nulls	0 (0.0%)
unique	215,964
min	-87.37
max	88.5
mean	32.66
median	40.6
std	31.01
q1	33.69
q3	46.53
iqr	12.85
skew	-2.84
kurtosis	7.302
n_outliers	33,355
outlier_rate	0.09402
zero_rate	0.000637
alert: high_skew	skew=-2.84
alert: outliers	9.4% rows beyond 1.5 IQR

Fig 9.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 40.5983333).
bin	count
-87.37 – -82.97	7090
-82.97 – -78.57	1218
-78.57 – -74.18	4088
-74.18 – -69.78	9707
-69.78 – -65.38	4
-65.38 – -60.99	3
-60.99 – -56.59	1
-56.59 – -52.19	20
-52.19 – -47.8	72
-47.8 – -43.4	110
-43.4 – -39	287
-39 – -34.61	726
-34.61 – -30.21	991
-30.21 – -25.81	714
-25.81 – -21.42	902
-21.42 – -17.02	581
-17.02 – -12.62	297
-12.62 – -8.227	341
-8.227 – -3.83	603
-3.83 – 0.5667	690
0.5667 – 4.963	435
4.963 – 9.36	1927
9.36 – 13.76	1608
13.76 – 18.15	1738
18.15 – 22.55	6214
22.55 – 26.95	5180
26.95 – 31.34	20824
31.34 – 35.74	47917
35.74 – 40.14	55501
40.14 – 44.53	77688
44.53 – 48.93	45229
48.93 – 53.33	28613
53.33 – 57.72	24054
57.72 – 62.12	7237
62.12 – 66.52	1480
66.52 – 70.91	490
70.91 – 75.31	92
75.31 – 79.71	86
79.71 – 84.1	9
84.1 – 88.5	3

longitude numeric feature

This column contains geographic longitude values for 354,770 records, spanning the full valid range from -179.28° to 180°. The distribution is moderately right-skewed (skew = 0.755) with a mean of -31.75° and median of -42.66°, indicating a concentration of records in the Western Hemisphere (Americas/Atlantic). The IQR of 104.81° is extremely wide, suggesting genuinely global coverage rather than a region-specific dataset, and only 827 values (0.23%) are flagged as outliers.

Treatment: Pair with latitude for geospatial modelling; consider coordinate binning or haversine-based features rather than treating as a raw numeric.

anthropic:default · confidence high

Out[17]:

saturn.columns["longitude"].stats

stat	value
n	354,770
nulls	0 (0.0%)
unique	223,129
min	-179.3
max	180
mean	-31.75
median	-42.66
std	72.11
q1	-92.08
q3	12.73
iqr	104.8
skew	0.7545
kurtosis	0.1165
n_outliers	827
outlier_rate	0.002331
zero_rate	0

Fig 10.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: -42.65733525).
bin	count
-179.3 – -170.3	90
-170.3 – -161.3	969
-161.3 – -152.3	1120
-152.3 – -143.4	814
-143.4 – -134.4	413
-134.4 – -125.4	839
-125.4 – -116.4	18752
-116.4 – -107.4	8803
-107.4 – -98.44	21820
-98.44 – -89.46	47491
-89.46 – -80.48	45346
-80.48 – -71.5	24836
-71.5 – -62.52	4804
-62.52 – -53.53	485
-53.53 – -44.55	608
-44.55 – -35.57	503
-35.57 – -26.59	76
-26.59 – -17.61	376
-17.61 – -8.624	2511
-8.624 – 0.3583	40337
0.3583 – 9.34	29096
9.34 – 18.32	36101
18.32 – 27.3	12918
27.3 – 36.29	13238
36.29 – 45.27	6091
45.27 – 54.25	5213
54.25 – 63.23	4220
63.23 – 72.22	604
72.22 – 81.2	3575
81.2 – 90.18	777
90.18 – 99.16	745
99.16 – 108.1	1540
108.1 – 117.1	1434
117.1 – 126.1	1238
126.1 – 135.1	1806
135.1 – 144.1	1475
144.1 – 153.1	730
153.1 – 162	8482
162 – 171	3693
171 – 180	801

name text label

This column contains the name or title of individual records in what appears to be a multi-domain dataset covering natural features (caves), weather events (tornadoes by US state), and UFO sightings. The duplicate rate is strikingly high at 46.5%, driven largely by templated strings like 'Unnamed Cave' (19,962 occurrences) and repeated tornado/state/count patterns. Despite the predominantly English content (3,363 language-detected values skewing English), the multilingual alert flags 30 detected languages including German (230), French (279), Italian (236), Russian (102), and Spanish (156), suggesting internationally-sourced named entities mixed into the dataset. Analysts should note that near-half of values are non-unique, so this column cannot serve as a reliable identifier.

Treatment: Deduplicate or group by name pattern before use; consider splitting templated names (e.g. 'Tornado in TX, 48') into structured fields; embed free-form names if semantic similarity is needed.

anthropic:default · confidence high

Out[20]:

saturn.columns["name"].stats

stat	value
n	354,770
nulls	0 (0.0%)
unique	189,861
len_min	1
len_max	235
len_mean	20
len_median	17
len_p95	32
word_mean	3.564
word_median	4
n_empty	0
n_duplicates	164,909
duplicate_rate	0.4648
vocab_size	15,811
readability_flesch_mean	64.79
emoji_rate	2.819e-06
url_rate	0
one_word_rate	0.09411
allcaps_rate	0.01283
boilerplate_rate	0
alert: multilingual	31 languages detected in sample
alert: duplicates	46.5% duplicate strings

Fig 11.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 19.99670208867717).
chars	count
1 – 7	8769
7 – 13	61484
13 – 19	132248
19 – 24	49793
24 – 30	74096
30 – 36	19324
36 – 42	3592
42 – 48	1239
48 – 54	408
54 – 60	415
60 – 65	461
65 – 71	495
71 – 77	537
77 – 83	395
83 – 89	431
89 – 95	348
95 – 100	245
100 – 106	167
106 – 112	117
112 – 118	69
118 – 124	51
124 – 130	26
130 – 136	22
136 – 141	14
141 – 147	5
147 – 153	6
153 – 159	1
159 – 165	3
165 – 171	4
171 – 176	0
176 – 182	2
182 – 188	0
188 – 194	0
194 – 200	1
200 – 206	0
206 – 212	0
212 – 217	0
217 – 223	0
223 – 229	0
229 – 235	2

description text free_text

This column contains free-text descriptions of geographic or physical features — cave entrances, former hamlets, hot springs, shipwrecks, and tornado tracks (e.g. 'F0, 0.1mi long, 10yd wide') dominate the top values, suggesting a points-of-interest or geographic gazetteer dataset. The duplicate rate is strikingly high at 38.3%, driven by 136,053 repeated values out of 354,770 rows, largely from templated entries like 'Cave entrance' (52,067 occurrences) and storm-track boilerplate. Text is overwhelmingly English (4,893 sampled as English) but 21 languages are detected including German (28), Bashkir (13), Russian (9), and Belarusian (9), flagging a multilingual minority that may require separate handling. The wide spread between median length (40 chars) and mean (114 chars) with a p95 of 491 indicates a heavily right-skewed length distribution.

Treatment: Deduplicate or group templated entries before NLP; apply language detection and route non-English rows to language-specific pipelines; tokenize and embed for semantic modelling.

anthropic:default · confidence high

Out[23]:

saturn.columns["description"].stats

stat	value
n	354,770
nulls	0 (0.0%)
unique	218,717
len_min	1
len_max	500
len_mean	114
len_median	40
len_p95	491
word_mean	24.07
word_median	7
n_empty	0
n_duplicates	136,053
duplicate_rate	0.3835
vocab_size	38,639
readability_flesch_mean	66.65
emoji_rate	0
url_rate	0.008149
one_word_rate	0.01018
allcaps_rate	0.004256
boilerplate_rate	0.0002509
alert: multilingual	22 languages detected in sample
alert: duplicates	38.3% duplicate strings

Fig 12.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 114.04985201679962).
chars	count
1 – 13	71484
13 – 26	53646
26 – 38	49789
38 – 51	14135
51 – 63	30283
63 – 76	28147
76 – 88	8588
88 – 101	6163
101 – 113	5647
113 – 126	4082
126 – 138	9912
138 – 151	3581
151 – 163	678
163 – 176	439
176 – 188	426
188 – 201	525
201 – 213	479
213 – 226	430
226 – 238	412
238 – 250	576
250 – 263	1185
263 – 275	1361
275 – 288	1320
288 – 300	5787
300 – 313	878
313 – 325	1496
325 – 338	1410
338 – 350	1914
350 – 363	1940
363 – 375	1927
375 – 388	1960
388 – 400	2295
400 – 413	2190
413 – 425	2487
425 – 438	2493
438 – 450	3014
450 – 463	2997
463 – 475	3598
475 – 488	5468
488 – 500	19628

category categorical label

This column is a data-source/event-type label drawn from 14 distinct categories across 354,770 rows with zero nulls. The categories span scientific datasets (NOAA tornadoes, NASA meteorites, OSM features) and paranormal/anomalous phenomena (UFO sightings, Bigfoot, haunted places, megalithic portal), suggesting this is a multi-source 'strange phenomena' aggregation dataset. Distribution is moderately uneven — the top value 'noaa_tornadoes' holds 20.2% of rows (71,813), while 'bigfoot_sightings' has only 3,797 — but entropy of 2.99 against a ratio of 0.78 indicates reasonable spread across classes. No nulls and clean cardinality make this an immediately usable stratification variable.

Treatment: Use as a stratification or grouping key; one-hot encode or target-encode for modelling.

anthropic:default · confidence high

Out[26]:

saturn.columns["category"].stats

stat	value
n	354,770
nulls	0 (0.0%)
unique	14
top_value	noaa_tornadoes
top_rate	0.2024
cardinality	14
entropy	2.985
entropy_ratio	0.7841

Fig 13.

Top values for category.

Show data table

Top values for category (14 unique shown, of 14 total).
value	count	share
noaa_tornadoes	71813	20.2%
osm_caves	70242	19.8%
ufo_sightings	60632	17.1%
megalithic_portal	60028	16.9%
nasa_meteorites	32186	9.1%
osm_ghost_towns	18154	5.1%
noaa_storm_events	14770	4.2%
haunted_places	9717	2.7%
noaa_thermal_springs	5003	1.4%
bigfoot_sightings	3797	1.1%
usgs_earthquakes	3742	1.1%
noaa_shipwrecks	3653	1.0%
nasa_fireballs	863	0.2%
usgs_volcanoes	170	0.0%

date text timestamp

This column contains ISO-format date strings (YYYY-MM-DD), stored as text rather than a proper date type, representing what appear to be annual publication or release dates — all top values fall on January 1st of a given year, suggesting date precision is year-level only. Two major data quality issues stand out: a 41.88% null rate (including 17,854 empty strings) and an 88.6% duplicate rate across 354,770 rows with only 23,500 unique values. The 'allcaps' alert is a false positive from the Saturn parser — ISO date strings trigger it due to lack of lowercase letters.

Treatment: Cast to date type, impute or flag the 41.88% nulls, and consider extracting year as an integer feature given all values are Jan-1 anchored.

anthropic:default · confidence high

Out[29]:

saturn.columns["date"].stats

stat	value
n	354,770
nulls	148,570 (41.9%)
unique	23,500
len_min	0
len_max	30
len_mean	9.331
len_median	10
len_p95	10
word_mean	1.005
word_median	1
n_empty	17,854
n_duplicates	182,700
duplicate_rate	0.886
vocab_size	8,565
readability_flesch_mean	112.1
emoji_rate	0
url_rate	0
one_word_rate	0.9954
allcaps_rate	0.913
boilerplate_rate	0
alert: one_word	99.5% rows are a single word
alert: allcaps	91.3% rows are all-caps
alert: null_rate	41.9% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	88.6% duplicate strings

Fig 14.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 9.330581959262851).
chars	count
0 – 1	17854
1 – 2	0
2 – 2	1
2 – 3	0
3 – 4	0
4 – 4	151
4 – 5	13
5 – 6	0
6 – 7	1
7 – 8	21
8 – 8	5
8 – 9	0
9 – 10	3
10 – 10	183475
10 – 11	17
11 – 12	0
12 – 13	6
13 – 14	8
14 – 14	10
14 – 15	0
15 – 16	8
16 – 16	7
16 – 17	12
17 – 18	0
18 – 19	180
19 – 20	4425
20 – 20	1
20 – 21	0
21 – 22	1
22 – 22	0
22 – 23	0
23 – 24	0
24 – 25	0
25 – 26	0
26 – 26	0
26 – 27	0
27 – 28	0
28 – 28	0
28 – 29	0
29 – 30	1

country categorical feature

This column captures country of origin or residence, using a mix of ISO 2-letter codes and full-name variants. The most alarming issue is a 55.29% null rate, meaning over half of 354,770 rows carry no country value. Compounding this, 'USA' and 'US' are effectively the same country but stored as two distinct values (86,583 and 60,634 respectively), together accounting for ~54.6% of non-null records — indicating inconsistent data entry that inflates apparent cardinality. There are also 9,497 empty-string records that escaped null detection, and the distribution is heavily US-dominated with 28 unique values at low entropy (1.34).

Treatment: Unify 'USA'/'US' and other aliases into ISO-3166 codes, convert empty strings to null, then impute or flag remaining nulls before using as a categorical feature.

anthropic:default · confidence high

Out[32]:

saturn.columns["country"].stats

stat	value
n	354,770
nulls	196,154 (55.3%)
unique	28
top_value	USA
top_rate	0.5459
cardinality	28
entropy	1.341
entropy_ratio	0.279
alert: null_rate	55.3% null

Fig 15.

Top values for country.

Show data table

Top values for country (20 unique shown, of 28 total).
value	count	share
USA	86583	24.4%
US	60634	17.1%
	9497	2.7%
RU	1481	0.4%
BY	205	0.1%
KZ	156	0.0%
HT	13	0.0%
KY	9	0.0%
AU	6	0.0%
DE	5	0.0%
GB	5	0.0%
IQ	3	0.0%
RO	2	0.0%
EC	2	0.0%
IT	2	0.0%
TW	1	0.0%
MX	1	0.0%
CW	1	0.0%
BS	1	0.0%
MT	1	0.0%

city text feature

This column contains US city names, confirmed by top values (Seattle, Phoenix, Las Vegas, Portland, Los Angeles) and top words ('beach', 'san', 'lake', 'springs'). The most striking issue is the 82.91% null rate — only roughly 1 in 6 rows has a city value at all, making this field sparsely populated. Despite that sparsity, the duplicate rate among non-null values is 84.91%, indicating that populated rows cluster around a relatively small set of repeated cities (9,149 unique values from 4,862 vocab tokens). The word 'city' appearing 531 times in top_words suggests some entries may literally contain placeholder text like 'Kansas City' or 'Oklahoma City' rather than being data quality noise.

Treatment: Impute or flag nulls (82.91% missing) before use; consider grouping rare cities or encoding as region/state for modelling.

anthropic:default · confidence high

Out[35]:

saturn.columns["city"].stats

stat	value
n	354,770
nulls	294,138 (82.9%)
unique	9,149
len_min	3
len_max	23
len_mean	8.829
len_median	9
len_p95	14
word_mean	1.288
word_median	1
n_empty	0
n_duplicates	51,483
duplicate_rate	0.8491
vocab_size	4,862
readability_flesch_mean	21.74
emoji_rate	0
url_rate	0
one_word_rate	0.7294
allcaps_rate	0
boilerplate_rate	0
alert: one_word	72.9% rows are a single word
alert: null_rate	82.9% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	84.9% duplicate strings

Fig 16.

Character-length distribution for city.

Show data table

Character-length distribution for city (mean: 8.828638342789286).
chars	count
3 – 4	74
4 – 4	0
4 – 4	1497
4 – 5	0
5 – 6	3313
6 – 6	0
6 – 6	7780
6 – 7	0
7 – 8	9464
8 – 8	0
8 – 8	7830
8 – 9	0
9 – 10	8339
10 – 10	0
10 – 10	7192
10 – 11	0
11 – 12	5509
12 – 12	0
12 – 12	3694
12 – 13	0
13 – 14	2353
14 – 14	0
14 – 14	1537
14 – 15	0
15 – 16	800
16 – 16	0
16 – 16	787
16 – 17	0
17 – 18	203
18 – 18	0
18 – 18	155
18 – 19	0
19 – 20	44
20 – 20	0
20 – 20	25
20 – 21	0
21 – 22	13
22 – 22	0
22 – 22	21
22 – 23	2

state categorical feature

This column contains US state abbreviations (and possibly territories or non-standard codes given 118 unique values vs. the expected 50–60), making it a geographic categorical feature. The most critical signal is a 58.5% null rate, meaning over half the 354,770 rows have no state recorded — a severe data quality issue. The top value is 'TX' at 8.6% of non-null rows, with CA and FL following; the 118-cardinality (nearly double the 50 US states) suggests the presence of territories, foreign country codes, or dirty values worth auditing.

Treatment: Audit the 118 unique values to identify non-US-state codes, impute or flag nulls (58.5% missing), then encode as categorical for modelling.

anthropic:default · confidence high

Out[38]:

saturn.columns["state"].stats

stat	value
n	354,770
nulls	207,555 (58.5%)
unique	118
top_value	TX
top_rate	0.08645
cardinality	118
entropy	5.668
entropy_ratio	0.8236
alert: null_rate	58.5% null

Fig 17.

Top values for state.

Show data table

Top values for state (20 unique shown, of 118 total).
value	count	share
TX	12727	3.6%
CA	8791	2.5%
FL	7372	2.1%
IL	5329	1.5%
KS	5127	1.4%
OK	5056	1.4%
MO	3905	1.1%
CO	3766	1.1%
WA	3648	1.0%
IA	3647	1.0%
OH	3521	1.0%
NE	3506	1.0%
AL	3196	0.9%
PA	3193	0.9%
NC	3186	0.9%
GA	3112	0.9%
MN	3110	0.9%
MS	3086	0.9%
NY	2951	0.8%
LA	2927	0.8%

shape categorical feature

This column captures the reported shape of UFO/unidentified aerial phenomena sightings, with 28 distinct categories such as 'light', 'triangle', 'circle', and 'fireball'. The most striking issue is an 82.91% null rate across 354,770 rows, meaning only ~60,600 records have a shape value at all. Among non-null records, 'light' dominates at 21.27%, and the presence of catch-all categories like 'unknown' (4,359) and 'other' (4,209) further dilutes the informativeness of the non-missing data.

Treatment: Impute or flag nulls as a separate 'not_reported' category before encoding; consider consolidating 'unknown' and 'other' with nulls given ambiguity.

anthropic:default · confidence high

Out[41]:

saturn.columns["shape"].stats

stat	value
n	354,770
nulls	294,138 (82.9%)
unique	28
top_value	light
top_rate	0.2127
cardinality	28
entropy	3.774
entropy_ratio	0.785
alert: null_rate	82.9% null

Fig 18.

Top values for shape.

Show data table

Top values for shape (20 unique shown, of 28 total).
value	count	share
light	12895	3.6%
triangle	6268	1.8%
circle	5890	1.7%
fireball	4939	1.4%
unknown	4359	1.2%
other	4209	1.2%
sphere	4134	1.2%
disk	3853	1.1%
oval	2881	0.8%
formation	1908	0.5%
cigar	1569	0.4%
changing	1517	0.4%
flash	1025	0.3%
rectangle	1010	0.3%
cylinder	977	0.3%
diamond	884	0.2%
chevron	774	0.2%
teardrop	560	0.2%
egg	555	0.2%
cone	235	0.1%

duration_seconds numeric feature

This column records event or session durations in seconds, with values ranging from 0.01 s to 66,276,000 s (~766 days). The most striking issue is that 82.91% of rows are null, meaning duration is only captured for roughly 1-in-6 records. Among non-null values the distribution is catastrophically right-skewed (skew = 135.86, kurtosis = 19,379.84): the median is just 180 s while the mean inflates to 5,410 s, and 7,753 rows (12.79% of non-null) are flagged as outliers—the maximum of 66,276,000 s is almost certainly erroneous or represents a sentinel/unclosed-session value.

Treatment: Investigate and cap or remove extreme outliers (especially values near 66276000.0), impute or flag nulls explicitly, then log-transform before modelling.

anthropic:default · confidence high

Out[44]:

saturn.columns["duration_seconds"].stats

stat	value
n	354,770
nulls	294,138 (82.9%)
unique	444
min	0.01
max	6.628e+07
mean	5410
median	180
std	4.144e+05
q1	30
q3	600
iqr	570
skew	135.9
kurtosis	1.938e+04
n_outliers	7,753
outlier_rate	0.1279
zero_rate	0
alert: null_rate	82.9% null
alert: high_skew	skew=+135.86
alert: outliers	12.8% rows beyond 1.5 IQR

Fig 19.

Distribution of duration_seconds. Vertical dash marks the median.

Show data table

Histogram bins for duration_seconds (median: 180.0).
bin	count
0.01 – 1.657e+06	60612
1.657e+06 – 3.314e+06	11
3.314e+06 – 4.971e+06	0
4.971e+06 – 6.628e+06	3
6.628e+06 – 8.285e+06	1
8.285e+06 – 9.941e+06	0
9.941e+06 – 1.16e+07	2
1.16e+07 – 1.326e+07	0
1.326e+07 – 1.491e+07	0
1.491e+07 – 1.657e+07	0
1.657e+07 – 1.823e+07	0
1.823e+07 – 1.988e+07	0
1.988e+07 – 2.154e+07	0
2.154e+07 – 2.32e+07	0
2.32e+07 – 2.485e+07	0
2.485e+07 – 2.651e+07	0
2.651e+07 – 2.817e+07	0
2.817e+07 – 2.982e+07	0
2.982e+07 – 3.148e+07	0
3.148e+07 – 3.314e+07	0
3.314e+07 – 3.479e+07	0
3.479e+07 – 3.645e+07	0
3.645e+07 – 3.811e+07	0
3.811e+07 – 3.977e+07	0
3.977e+07 – 4.142e+07	0
4.142e+07 – 4.308e+07	0
4.308e+07 – 4.474e+07	0
4.474e+07 – 4.639e+07	0
4.639e+07 – 4.805e+07	0
4.805e+07 – 4.971e+07	0
4.971e+07 – 5.136e+07	0
5.136e+07 – 5.302e+07	2
5.302e+07 – 5.468e+07	0
5.468e+07 – 5.633e+07	0
5.633e+07 – 5.799e+07	0
5.799e+07 – 5.965e+07	0
5.965e+07 – 6.131e+07	0
6.131e+07 – 6.296e+07	0
6.296e+07 – 6.462e+07	0
6.462e+07 – 6.628e+07	1

mass_g unknown feature

The column 'mass_g' likely represents mass measurements in grams across 354,770 records, with zero nulls indicating complete data coverage. No distributional statistics are available — the profiler skipped this column — so skew, range, outliers, and uniqueness cannot be assessed from the evidence provided.

Treatment: Re-profile to obtain distribution stats; then check for skew and consider log-transform before modelling.

anthropic:default · confidence low

Out[47]:

saturn.columns["mass_g"].stats

stat	value
n	354,770
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

meteorite_class categorical label

This column contains meteorite classification codes (e.g., 'L6', 'H5', 'CM2'), representing standard petrologic-type designations for chondrite and other meteorite classes. The most striking feature is an extremely high null rate of 90.93%, meaning only roughly 32,000 of 354,770 rows carry a classification. Among classified records the distribution is moderately concentrated — 'L6' alone accounts for 20.3% of non-null values — with 395 distinct classes and an entropy ratio of ~0.51, indicating moderate spread across the taxonomy.

Treatment: Impute nulls with an explicit 'Unknown' category or exclude from supervised models; encode via target or ordinal encoding given 395 classes and severe class imbalance.

anthropic:default · confidence high

Out[49]:

saturn.columns["meteorite_class"].stats

stat	value
n	354,770
nulls	322,584 (90.9%)
unique	395
top_value	L6
top_rate	0.2033
cardinality	395
entropy	4.37
entropy_ratio	0.5067
alert: null_rate	90.9% null

Fig 20.

Top values for meteorite_class.

Show data table

Top values for meteorite_class (20 unique shown, of 395 total).
value	count	share
L6	6544	1.8%
H5	5614	1.6%
H4	3336	0.9%
H6	3234	0.9%
L5	2750	0.8%
LL5	1899	0.5%
LL6	963	0.3%
L4	831	0.2%
H4/5	380	0.1%
CM2	281	0.1%
Iron, IIIAB	272	0.1%
H3	244	0.1%
LL	220	0.1%
E3	205	0.1%
L3	176	0.0%
LL4	160	0.0%
H5/6	156	0.0%
Ureilite	155	0.0%
Howardite	127	0.0%
Diogenite	125	0.0%

fall_type categorical label

This column classifies meteorite recovery type, distinguishing between specimens that were 'Found' (discovered without an observed fall) versus 'Fell' (witnessed falling). Striking is the 90.93% null rate, meaning only ~32,186 of 354,770 records have a value at all. Among those with values, the distribution is heavily skewed: 'Found' accounts for 96.6% (31,090) versus 'Fell' at just 3.4% (1,096), which aligns with real-world meteorite data but constitutes a severe class imbalance alert.

Treatment: Impute nulls as a third category ('Unknown') or exclude from classification tasks; apply class-weighting or oversampling to address the 97:3 Found-to-Fell imbalance before modelling.

anthropic:default · confidence high

Out[52]:

saturn.columns["fall_type"].stats

stat	value
n	354,770
nulls	322,584 (90.9%)
unique	2
top_value	Found
top_rate	0.9659
cardinality	2
entropy	0.2143
entropy_ratio	0.2143
alert: null_rate	90.9% null
alert: imbalance	top value is 96.6% of rows

Fig 21.

Top values for fall_type.

Show data table

Top values for fall_type (2 unique shown, of 2 total).
value	count	share
Found	31090	8.8%
Fell	1096	0.3%

magnitude categorical feature

This column represents a magnitude scale (likely seismic, stellar, or similar physical measurement) stored as a categorical type despite being fundamentally numeric — values include integers (0, 1, 2, 3, 4) and decimals (4.5, 4.6, 4.7, 1.75). The null rate of 76.7% is alarming and triggered an alert, meaning over three-quarters of the 354,770 rows carry no value. An additional surprise is the presence of '-9', which appears 1,278 times and is almost certainly a sentinel/missing-value code rather than a true measurement. The top value '0' dominates non-null records at 44.4% of non-null observations, and entropy_ratio of 0.31 confirms a heavily skewed, low-diversity distribution despite 294 unique string representations.

Treatment: Cast to float after replacing '-9' with NaN, investigate the 76.7% null rate for systematic missingness, then consider log-transform or binning before modelling.

anthropic:default · confidence high

Out[55]:

saturn.columns["magnitude"].stats

stat	value
n	354,770
nulls	272,093 (76.7%)
unique	294
top_value	0
top_rate	0.4436
cardinality	294
entropy	2.514
entropy_ratio	0.3065
alert: null_rate	76.7% null

Fig 22.

Top values for magnitude.

Show data table

Top values for magnitude (20 unique shown, of 294 total).
value	count	share
0	36675	10.3%
1	24542	6.9%
2	9904	2.8%
3	2630	0.7%
-9	1278	0.4%
4.5	686	0.2%
4	591	0.2%
4.6	558	0.2%
4.7	415	0.1%
1.75	383	0.1%
4.8	317	0.1%
5	297	0.1%
4.9	261	0.1%
2.75	220	0.1%
5.1	202	0.1%
5.2	167	0.0%
70.00	162	0.0%
50.00	151	0.0%
2.00	150	0.0%
5.3	126	0.0%

depth_km numeric

Out[58]:

saturn.columns["depth_km"].stats

stat	value
n	354,770
nulls	351,028 (98.9%)
unique	1,505
min	-2.261
max	248.7
mean	23.71
median	10
std	28.79
q1	10
q3	29.1
iqr	19.1
skew	3.072
kurtosis	11.61
n_outliers	314
outlier_rate	0.08391
zero_rate	0.002672
alert: null_rate	98.9% null
alert: high_skew	skew=+3.07
alert: outliers	8.4% rows beyond 1.5 IQR

Fig 23.

Distribution of depth_km. Vertical dash marks the median.

Show data table

Histogram bins for depth_km (median: 10.0).
bin	count
-2.261 – 4.013	219
4.013 – 10.29	1730
10.29 – 16.56	370
16.56 – 22.84	258
22.84 – 29.11	230
29.11 – 35.38	250
35.38 – 41.66	167
41.66 – 47.93	129
47.93 – 54.21	56
54.21 – 60.48	31
60.48 – 66.75	43
66.75 – 73.03	27
73.03 – 79.3	19
79.3 – 85.58	29
85.58 – 91.85	21
91.85 – 98.12	19
98.12 – 104.4	24
104.4 – 110.7	12
110.7 – 116.9	9
116.9 – 123.2	14
123.2 – 129.5	13
129.5 – 135.8	19
135.8 – 142	14
142 – 148.3	5
148.3 – 154.6	6
154.6 – 160.9	0
160.9 – 167.1	7
167.1 – 173.4	4
173.4 – 179.7	0
179.7 – 186	1
186 – 192.2	5
192.2 – 198.5	1
198.5 – 204.8	3
204.8 – 211.1	2
211.1 – 217.3	3
217.3 – 223.6	1
223.6 – 229.9	0
229.9 – 236.2	0
236.2 – 242.4	0
242.4 – 248.7	1

place text

Out[61]:

saturn.columns["place"].stats

stat	value
n	354,770
nulls	351,028 (98.9%)
unique	3,002
len_min	4
len_max	59
len_mean	29.47
len_median	29
len_p95	36
word_mean	6.293
word_median	6
n_empty	0
n_duplicates	740
duplicate_rate	0.1978
vocab_size	1,036
readability_flesch_mean	69.91
emoji_rate	0
url_rate	0
one_word_rate	0.0005345
allcaps_rate	0
boilerplate_rate	0
alert: null_rate	98.9% null

Fig 24.

Character-length distribution for place.

Show data table

Character-length distribution for place (mean: 29.465793693212184).
chars	count
4 – 5	1
5 – 7	0
7 – 8	1
8 – 10	0
10 – 11	0
11 – 12	0
12 – 14	2
14 – 15	11
15 – 16	40
16 – 18	1
18 – 19	20
19 – 20	19
20 – 22	8
22 – 23	219
23 – 25	60
25 – 26	122
26 – 27	543
27 – 29	499
29 – 30	823
30 – 32	325
32 – 33	362
33 – 34	378
34 – 36	105
36 – 37	40
37 – 38	37
38 – 40	25
40 – 41	34
41 – 42	3
42 – 44	0
44 – 45	15
45 – 47	22
47 – 48	14
48 – 49	3
49 – 51	1
51 – 52	2
52 – 54	5
54 – 55	0
55 – 56	0
56 – 58	0
58 – 59	2

earthquake_type categorical

Out[64]:

saturn.columns["earthquake_type"].stats

stat	value
n	354,770
nulls	351,028 (98.9%)
unique	3
top_value	earthquake
top_rate	0.9992
cardinality	3
entropy	0.01014
entropy_ratio	0.006396
alert: null_rate	98.9% null
alert: imbalance	top value is 99.9% of rows

Fig 25.

Top values for earthquake_type.

Show data table

Top values for earthquake_type (3 unique shown, of 3 total).
value	count	share
earthquake	3739	1.1%
explosion	2	0.0%
landslide	1	0.0%

volcano_type categorical

Out[67]:

saturn.columns["volcano_type"].stats

stat	value
n	354,770
nulls	354,600 (100.0%)
unique	1
top_value	Unknown
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	100.0% null
alert: imbalance	top value is 100.0% of rows

Fig 26.

Top values for volcano_type.

Show data table

Top values for volcano_type (1 unique shown, of 1 total).
value	count	share
Unknown	170	0.0%

elevation_m unknown feature

This column records elevation in metres for 354,770 rows with no nulls. The profiler emitted a 'skipped' alert and returned no computed statistics, so distribution shape, range, skew, and uniqueness are entirely unknown from this evidence. The name strongly implies a continuous numeric geographic feature, but no further characterisation can be made without re-running profiling.

Treatment: Re-profile to obtain range, skew, and outlier metrics; then consider log-transform or clipping if heavily right-skewed before use in modelling.

anthropic:default · confidence low

Out[70]:

saturn.columns["elevation_m"].stats

stat	value
n	354,770
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

status categorical

Out[72]:

saturn.columns["status"].stats

stat	value
n	354,770
nulls	354,600 (100.0%)
unique	1
top_value	Unknown
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	100.0% null
alert: imbalance	top value is 100.0% of rows

Fig 27.

Top values for status.

Show data table

Top values for status (1 unique shown, of 1 total).
value	count	share
Unknown	170	0.0%

last_eruption categorical

Out[75]:

saturn.columns["last_eruption"].stats

stat	value
n	354,770
nulls	354,600 (100.0%)
unique	1
top_value	Unknown
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	100.0% null
alert: imbalance	top value is 100.0% of rows

Fig 28.

Top values for last_eruption.

Show data table

Top values for last_eruption (1 unique shown, of 1 total).
value	count	share
Unknown	170	0.0%

injuries categorical feature

This column records injury counts per incident, stored as a categorical type despite being numeric in nature — values are integers ('0', '1', '2', …) with a cardinality of 233 distinct values. The null rate is severely high at 75.59%, meaning only ~86,827 of 354,770 rows have a recorded value, which is flagged as an alert. Among non-null rows, 85.4% report zero injuries, producing a heavily right-skewed distribution with low entropy (1.23, entropy ratio 0.157). The presence of 233 distinct values suggests some entries may encode ranges, text annotations, or data-entry anomalies beyond simple integers.

Treatment: Cast to numeric, investigate nulls (MCAR vs. structural zero), treat missing as unknown rather than zero, then consider zero-inflated or count-based model treatment.

anthropic:default · confidence medium

Out[78]:

saturn.columns["injuries"].stats

stat	value
n	354,770
nulls	268,187 (75.6%)
unique	233
top_value	0
top_rate	0.854
cardinality	233
entropy	1.234
entropy_ratio	0.1569
alert: null_rate	75.6% null

Fig 29.

Top values for injuries.

Show data table

Top values for injuries (20 unique shown, of 233 total).
value	count	share
0	73943	20.8%
1	3402	1.0%
2	1957	0.6%
3	1118	0.3%
4	727	0.2%
5	625	0.2%
6	500	0.1%
10	362	0.1%
7	332	0.1%
8	292	0.1%
12	280	0.1%
9	202	0.1%
20	185	0.1%
15	181	0.1%
11	170	0.0%
13	136	0.0%
14	124	0.0%
30	116	0.0%
25	100	0.0%
16	91	0.0%

fatalities categorical feature

This column represents a count of fatalities per incident, stored as a categorical type despite being inherently numeric. The null rate is severe at 75.59%, meaning only ~86,313 of 354,770 rows have a value. Among non-null rows, 92.86% record zero fatalities, with a long tail reaching at least 10; the low entropy ratio (0.088) confirms extreme concentration at '0'.

Treatment: Cast to integer, investigate and impute or exclude the 75.59% nulls, then treat as a heavily zero-inflated count variable (consider zero-inflated Poisson or log1p transform for regression).

anthropic:default · confidence high

Out[81]:

saturn.columns["fatalities"].stats

stat	value
n	354,770
nulls	268,187 (75.6%)
unique	57
top_value	0
top_rate	0.9286
cardinality	57
entropy	0.5134
entropy_ratio	0.08802
alert: null_rate	75.6% null

Fig 30.

Top values for fatalities.

Show data table

Top values for fatalities (20 unique shown, of 57 total).
value	count	share
0	80397	22.7%
1	4053	1.1%
2	932	0.3%
3	357	0.1%
4	190	0.1%
5	121	0.0%
6	112	0.0%
7	71	0.0%
9	40	0.0%
10	39	0.0%
8	34	0.0%
11	34	0.0%
16	22	0.0%
13	19	0.0%
12	15	0.0%
17	13	0.0%
14	11	0.0%
21	9	0.0%
18	8	0.0%
20	8	0.0%

length_miles text feature

This column stores numeric distance measurements (miles) encoded as text strings — all values are single tokens like '0.1', '0.5', '1.0' with a mean character length of 3.69 and a max of 8. Two signals demand attention: the null rate is extremely high at 79.76%, meaning roughly four in five rows carry no value, and the duplicate rate among non-null values is 94.72%, reflecting a coarse, rounded measurement scale (only 3,795 unique values across 354,770 rows). The top value '0.1' alone appears 15,456 times, suggesting heavy concentration at short distances.

Treatment: Cast to float, investigate and handle the 79.76% nulls (impute or flag), then use directly or log-transform given likely right skew.

anthropic:default · confidence high

Out[84]:

saturn.columns["length_miles"].stats

stat	value
n	354,770
nulls	282,957 (79.8%)
unique	3,795
len_min	3
len_max	8
len_mean	3.688
len_median	3
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	68,018
duplicate_rate	0.9472
vocab_size	2,268
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: null_rate	79.8% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	94.7% duplicate strings

Fig 31.

Character-length distribution for length_miles.

Show data table

Character-length distribution for length_miles (mean: 3.6875078328436355).
chars	count
3 – 3	47630
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	11687
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	795
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	10712
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	986
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	3

width_yards categorical feature

This column represents the width of some geographic or physical feature measured in yards, stored as a categorical type despite being numeric in nature. Nearly 80% of values are null (null_rate = 0.7976), making missingness the dominant signal. Among the 71,493 non-null records, values are round numbers (10, 50, 100, 30, 20, 200…) suggesting manual or estimated entries rather than precise measurements. The top value '10' accounts for 20.2% of non-null rows, and with 437 unique values and an entropy ratio of 0.51, the distribution is moderately concentrated.

Treatment: Cast to numeric, investigate whether nulls are structurally missing (feature absent) or simply unrecorded before imputing or dropping; log-transform or bin for modelling given round-number clustering.

anthropic:default · confidence medium

Out[87]:

saturn.columns["width_yards"].stats

stat	value
n	354,770
nulls	282,957 (79.8%)
unique	437
top_value	10
top_rate	0.2018
cardinality	437
entropy	4.493
entropy_ratio	0.5122
alert: null_rate	79.8% null

Fig 32.

Top values for width_yards.

Show data table

Top values for width_yards (20 unique shown, of 437 total).
value	count	share
10	14492	4.1%
50	10603	3.0%
100	7243	2.0%
30	4882	1.4%
20	4431	1.2%
200	3046	0.9%
25	2530	0.7%
150	2234	0.6%
75	2026	0.6%
40	2006	0.6%
300	1518	0.4%
33	1161	0.3%
17	1037	0.3%
400	977	0.3%
250	828	0.2%
23	812	0.2%
60	765	0.2%
440	682	0.2%
500	665	0.2%
80	605	0.2%

type categorical

Out[90]:

saturn.columns["type"].stats

stat	value
n	354,770
nulls	349,767 (98.6%)
unique	1
top_value	hot_spring
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	98.6% null
alert: imbalance	top value is 100.0% of rows

Fig 33.

Top values for type.

Show data table

Top values for type (1 unique shown, of 1 total).
value	count	share
hot_spring	5003	1.4%

temperature categorical

Out[93]:

saturn.columns["temperature"].stats

stat	value
n	354,770
nulls	349,767 (98.6%)
unique	44
top_value
top_rate	0.9742
cardinality	44
entropy	0.2566
entropy_ratio	0.04699
alert: long_tail	34 singleton categories
alert: null_rate	98.6% null
alert: imbalance	top value is 97.4% of rows

Fig 34.

Top values for temperature.

Show data table

Top values for temperature (20 unique shown, of 44 total).
value	count	share
	4874	1.4%
hot	73	0.0%
90	4	0.0%
100	4	0.0%
21	3	0.0%
95	3	0.0%
37	2	0.0%
43	2	0.0%
28	2	0.0%
40	2	0.0%
37-39°	1	0.0%
35-37 °C	1	0.0%
58°C	1	0.0%
52,1	1	0.0%
25-30	1	0.0%
98°C	1	0.0%
40-43°	1	0.0%
77	1	0.0%
25°C	1	0.0%
52	1	0.0%

source categorical metadata

This column records the data provider or attribution source for each row, with only 4 distinct values drawn from named external datasets (OpenStreetMap contributors, The Megalithic Portal, NOAA Storm Events Database, OpenStreetMap). The most striking signal is a 51.56% null rate — meaning over half of all 354,770 rows carry no source attribution, which is a data quality concern for provenance tracking. The top value 'OpenStreetMap contributors' accounts for 51.44% of non-null rows (88,396 records), while the closely related 'OpenStreetMap' (8,656 records) suggests inconsistent attribution for the same upstream source.

Treatment: Consolidate 'OpenStreetMap contributors' and 'OpenStreetMap' into a single category, investigate and impute or flag the 51.56% nulls before using as a stratification or filter variable.

anthropic:default · confidence high

Out[96]:

saturn.columns["source"].stats

stat	value
n	354,770
nulls	182,920 (51.6%)
unique	4
top_value	OpenStreetMap contributors
top_rate	0.5144
cardinality	4
entropy	1.545
entropy_ratio	0.7724
alert: null_rate	51.6% null

Fig 35.

Top values for source.

Show data table

Top values for source (4 unique shown, of 4 total).
value	count	share
OpenStreetMap contributors	88396	24.9%
The Megalithic Portal	60028	16.9%
NOAA Storm Events Database	14770	4.2%
OpenStreetMap	8656	2.4%

vessel_type categorical

Out[99]:

saturn.columns["vessel_type"].stats

stat	value
n	354,770
nulls	351,117 (99.0%)
unique	23
top_value
top_rate	0.9064
cardinality	23
entropy	0.5764
entropy_ratio	0.1274
alert: long_tail	14 singleton categories
alert: null_rate	99.0% null

Fig 36.

Top values for vessel_type.

Show data table

Top values for vessel_type (20 unique shown, of 23 total).
value	count	share
	3311	0.9%
ship	275	0.1%
submarine	18	0.0%
aircraft	16	0.0%
plane	10	0.0%
boat	3	0.0%
schooner	2	0.0%
car	2	0.0%
sailboat	2	0.0%
steamer	1	0.0%
airplane	1	0.0%
freightcar	1	0.0%
train	1	0.0%
paddle steamer	1	0.0%
vehicle	1	0.0%
motorbike	1	0.0%
helicopter	1	0.0%
Steam hoist	1	0.0%
tractor	1	0.0%
Airplane	1	0.0%

cargo categorical

Out[102]:

saturn.columns["cargo"].stats

stat	value
n	354,770
nulls	351,117 (99.0%)
unique	17
top_value
top_rate	0.9943
cardinality	17
entropy	0.07302
entropy_ratio	0.01786
alert: long_tail	13 singleton categories
alert: null_rate	99.0% null
alert: imbalance	top value is 99.4% of rows

Fig 37.

Top values for cargo.

Show data table

Top values for cargo (17 unique shown, of 17 total).
value	count	share
	3632	1.0%
human	4	0.0%
timber	2	0.0%
coal	2	0.0%
fertilizer	1	0.0%
ore pellets	1	0.0%
Fischkutter (Stahl)	1	0.0%
seafood	1	0.0%
fish	1	0.0%
passengers	1	0.0%
mexican army supposed drugs, but the crew and cargo was not found	1	0.0%
iron ore	1	0.0%
pulp	1	0.0%
18 mines, 6 torpedos	1	0.0%
sugar	1	0.0%
containers;vehicles	1	0.0%
container;oil	1	0.0%

peak_brightness_altitude_km categorical

Out[105]:

saturn.columns["peak_brightness_altitude_km"].stats

stat	value
n	354,770
nulls	354,193 (99.8%)
unique	224
top_value	37.0
top_rate	0.06066
cardinality	224
entropy	7.187
entropy_ratio	0.9206
alert: null_rate	99.8% null

Fig 38.

Top values for peak_brightness_altitude_km.

Show data table

Top values for peak_brightness_altitude_km (20 unique shown, of 224 total).
value	count	share
37.0	35	0.0%
31.5	15	0.0%
33.3	15	0.0%
38.0	11	0.0%
29.6	11	0.0%
35.2	11	0.0%
40.7	11	0.0%
32.0	10	0.0%
26.0	10	0.0%
32.4	9	0.0%
42.0	8	0.0%
26.5	8	0.0%
33.0	8	0.0%
25.0	7	0.0%
50.0	7	0.0%
36.0	7	0.0%
35.0	6	0.0%
39.0	6	0.0%
28.7	6	0.0%
37	6	0.0%

velocity_km_s categorical

Out[108]:

saturn.columns["velocity_km_s"].stats

stat	value
n	354,770
nulls	354,421 (99.9%)
unique	158
top_value	13.6
top_rate	0.01719
cardinality	158
entropy	7.052
entropy_ratio	0.9656
alert: null_rate	99.9% null

Fig 39.

Top values for velocity_km_s.

Show data table

Top values for velocity_km_s (20 unique shown, of 158 total).
value	count	share
13.6	6	0.0%
15.2	6	0.0%
16.9	6	0.0%
17.8	5	0.0%
20.1	5	0.0%
17.4	5	0.0%
13.1	5	0.0%
16.2	5	0.0%
19.8	5	0.0%
16.5	5	0.0%
15.9	5	0.0%
14.1	5	0.0%
18.1	5	0.0%
14.9	5	0.0%
12.9	5	0.0%
12.2	5	0.0%
19.6	4	0.0%
17.0	4	0.0%
14.4	4	0.0%
18.3	4	0.0%

energy_joules categorical

Out[111]:

saturn.columns["energy_joules"].stats

stat	value
n	354,770
nulls	353,907 (99.8%)
unique	518
top_value	2.1
top_rate	0.01738
cardinality	518
entropy	8.634
entropy_ratio	0.9576
alert: long_tail	361 singleton categories
alert: null_rate	99.8% null

Fig 40.

Top values for energy_joules.

Show data table

Top values for energy_joules (20 unique shown, of 518 total).
value	count	share
2.1	15	0.0%
2.0	13	0.0%
3.2	13	0.0%
3.0	10	0.0%
2.8	8	0.0%
2.3	8	0.0%
3.5	8	0.0%
2.7	8	0.0%
2.2	8	0.0%
4.1	8	0.0%
3.3	7	0.0%
2.5	7	0.0%
4.0	6	0.0%
2.9	6	0.0%
3.1	6	0.0%
5.8	6	0.0%
10.4	6	0.0%
11.8	6	0.0%
3.6	5	0.0%
4.4	5	0.0%

event_type categorical

Out[114]:

saturn.columns["event_type"].stats

stat	value
n	354,770
nulls	340,000 (95.8%)
unique	17
top_value	Tornado
top_rate	0.4288
cardinality	17
entropy	2.336
entropy_ratio	0.5715
alert: null_rate	95.8% null

Fig 41.

Top values for event_type.

Show data table

Top values for event_type (17 unique shown, of 17 total).
value	count	share
Tornado	6334	1.8%
Flash Flood	2358	0.7%
Thunderstorm Wind	2257	0.6%
Flood	1777	0.5%
Hail	1246	0.4%
Lightning	574	0.2%
Heavy Rain	99	0.0%
Marine Strong Wind	43	0.0%
Debris Flow	43	0.0%
Marine Thunderstorm Wind	25	0.0%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

damage_property text

Out[117]:

saturn.columns["damage_property"].stats

stat	value
n	354,770
nulls	340,000 (95.8%)
unique	1,014
len_min	0
len_max	8
len_mean	4.381
len_median	5
len_p95	7
word_mean	1
word_median	1
n_empty	368
n_duplicates	13,756
duplicate_rate	0.9313
vocab_size	1,013
readability_flesch_mean	117
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.8724
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	87.2% rows are all-caps
alert: null_rate	95.8% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	93.1% duplicate strings

Fig 42.

Character-length distribution for damage_property.

Show data table

Character-length distribution for damage_property (mean: 4.380568720379147).
chars	count
0 – 0	368
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 1	264
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	1252
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1172
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3414
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	6075
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1450
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	514
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	261

cave_type categorical

Out[120]:

saturn.columns["cave_type"].stats

stat	value
n	354,770
nulls	354,729 (100.0%)
unique	5
top_value	pit
top_rate	0.878
cardinality	5
entropy	0.7693
entropy_ratio	0.3313
alert: long_tail	3 singleton categories
alert: null_rate	100.0% null

Fig 43.

Top values for cave_type.

Show data table

Top values for cave_type (5 unique shown, of 5 total).
value	count	share
pit	36	0.0%
ponor	2	0.0%
showcave	1	0.0%
sinkhole	1	0.0%
overhang	1	0.0%

cave_length_m categorical

Out[123]:

saturn.columns["cave_length_m"].stats

stat	value
n	354,770
nulls	354,128 (99.8%)
unique	237
top_value	5
top_rate	0.04984
cardinality	237
entropy	6.919
entropy_ratio	0.877
alert: long_tail	158 singleton categories
alert: null_rate	99.8% null

Fig 44.

Top values for cave_length_m.

Show data table

Top values for cave_length_m (20 unique shown, of 237 total).
value	count	share
5	32	0.0%
6	26	0.0%
10	25	0.0%
3	23	0.0%
4	23	0.0%
7	20	0.0%
8	19	0.0%
15	16	0.0%
20	14	0.0%
12	13	0.0%
30	13	0.0%
2	11	0.0%
11	9	0.0%
60	8	0.0%
4.5	8	0.0%
13	8	0.0%
16	8	0.0%
17	8	0.0%
25	8	0.0%
9	7	0.0%

cave_depth_m categorical

Out[126]:

saturn.columns["cave_depth_m"].stats

stat	value
n	354,770
nulls	354,472 (99.9%)
unique	124
top_value	0
top_rate	0.2114
cardinality	124
entropy	5.797
entropy_ratio	0.8336
alert: long_tail	88 singleton categories
alert: null_rate	99.9% null

Fig 45.

Top values for cave_depth_m.

Show data table

Top values for cave_depth_m (20 unique shown, of 124 total).
value	count	share
0	63	0.0%
10	13	0.0%
3	11	0.0%
1	9	0.0%
5	9	0.0%
4	8	0.0%
25	7	0.0%
30	6	0.0%
6	6	0.0%
2	6	0.0%
11	5	0.0%
35	5	0.0%
28	4	0.0%
14	4	0.0%
40	4	0.0%
70	4	0.0%
12	3	0.0%
8	3	0.0%
15	3	0.0%
23	3	0.0%

access categorical

Out[129]:

saturn.columns["access"].stats

stat	value
n	354,770
nulls	347,515 (98.0%)
unique	20
top_value	yes
top_rate	0.3795
cardinality	20
entropy	2.234
entropy_ratio	0.517
alert: null_rate	98.0% null

Fig 46.

Top values for access.

Show data table

Top values for access (20 unique shown, of 20 total).
value	count	share
yes	2753	0.8%
no	2279	0.6%
private	830	0.2%
permit	586	0.2%
permissive	448	0.1%
customers	273	0.1%
unknown	51	0.0%
destination	11	0.0%
restricted	9	0.0%
tidal	2	0.0%
request	2	0.0%
key	2	0.0%
discouraged	2	0.0%
designated	1	0.0%
official	1	0.0%
forestry	1	0.0%
agricultural	1	0.0%
guided	1	0.0%
university	1	0.0%
cancello_all'ingresso	1	0.0%

cave_ref text

Out[132]:

saturn.columns["cave_ref"].stats

stat	value
n	354,770
nulls	347,184 (97.9%)
unique	7,162
len_min	1
len_max	38
len_mean	6.341
len_median	7
len_p95	8
word_mean	1.068
word_median	1
n_empty	0
n_duplicates	424
duplicate_rate	0.05589
vocab_size	7,005
readability_flesch_mean	117.8
emoji_rate	0
url_rate	0
one_word_rate	0.9359
allcaps_rate	0.8559
boilerplate_rate	0
alert: one_word	93.6% rows are a single word
alert: allcaps	85.6% rows are all-caps
alert: null_rate	97.9% null
alert: short_text	95th-percentile length under 20 chars

Fig 47.

Character-length distribution for cave_ref.

Show data table

Character-length distribution for cave_ref (mean: 6.340891115212233).
chars	count
1 – 2	36
2 – 3	210
3 – 4	356
4 – 5	954
5 – 6	412
6 – 7	1123
7 – 7	2763
7 – 8	1372
8 – 9	232
9 – 10	29
10 – 11	44
11 – 12	28
12 – 13	11
13 – 14	0
14 – 15	2
15 – 16	2
16 – 17	3
17 – 18	3
18 – 19	1
19 – 20	1
20 – 20	1
20 – 21	0
21 – 22	0
22 – 23	0
23 – 24	0
24 – 25	0
25 – 26	0
26 – 27	0
27 – 28	2
28 – 29	0
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	0
34 – 35	0
35 – 36	0
36 – 37	0
37 – 38	1

osm_id numeric foreign_key

This column contains OpenStreetMap (OSM) numeric identifiers, likely referencing geographic features such as ways, relations, or nodes in the OSM database. The most striking issue is a 75.08% null rate across 354,770 rows, meaning only about one quarter of records carry an OSM linkage. Despite 88,395 unique values against ~88,693 non-null rows, the near-unique cardinality and platykurtic distribution (kurtosis ≈ -1.23) are consistent with IDs drawn broadly across OSM's ID space (min ~1.3M, max ~13.5B), with no outliers detected.

Treatment: Left-join on this id to OSM data after filtering or imputing the 75.08% nulls; investigate whether missingness is systematic before joining.

anthropic:default · confidence high

Out[135]:

saturn.columns["osm_id"].stats

stat	value
n	354,770
nulls	266,374 (75.1%)
unique	88,395
min	1.334e+06
max	1.347e+10
mean	6.183e+09
median	6.047e+09
std	3.993e+09
q1	2.628e+09
q3	9.53e+09
iqr	6.903e+09
skew	0.1321
kurtosis	-1.228
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	75.1% null

Fig 48.

Distribution of osm_id. Vertical dash marks the median.

Show data table

Histogram bins for osm_id (median: 6047018322.5).
bin	count
1.334e+06 – 3.38e+08	3968
3.38e+08 – 6.748e+08	2767
6.748e+08 – 1.011e+09	3230
1.011e+09 – 1.348e+09	3501
1.348e+09 – 1.685e+09	3041
1.685e+09 – 2.022e+09	2397
2.022e+09 – 2.358e+09	1663
2.358e+09 – 2.695e+09	1794
2.695e+09 – 3.032e+09	2343
3.032e+09 – 3.368e+09	2457
3.368e+09 – 3.705e+09	2751
3.705e+09 – 4.042e+09	2046
4.042e+09 – 4.379e+09	1826
4.379e+09 – 4.715e+09	2556
4.715e+09 – 5.052e+09	2957
5.052e+09 – 5.389e+09	2055
5.389e+09 – 5.725e+09	1481
5.725e+09 – 6.062e+09	1420
6.062e+09 – 6.399e+09	2234
6.399e+09 – 6.736e+09	1597
6.736e+09 – 7.072e+09	2035
7.072e+09 – 7.409e+09	2535
7.409e+09 – 7.746e+09	2516
7.746e+09 – 8.082e+09	1909
8.082e+09 – 8.419e+09	2018
8.419e+09 – 8.756e+09	1425
8.756e+09 – 9.092e+09	2194
9.092e+09 – 9.429e+09	1852
9.429e+09 – 9.766e+09	3471
9.766e+09 – 1.01e+10	2391
1.01e+10 – 1.044e+10	1034
1.044e+10 – 1.078e+10	1192
1.078e+10 – 1.111e+10	1627
1.111e+10 – 1.145e+10	3575
1.145e+10 – 1.179e+10	1447
1.179e+10 – 1.212e+10	1836
1.212e+10 – 1.246e+10	1355
1.246e+10 – 1.28e+10	2246
1.28e+10 – 1.313e+10	1582
1.313e+10 – 1.347e+10	2072

osm_type categorical feature

This column stores OpenStreetMap geometry type classifications, taking only three possible values: 'node', 'way', and 'relation'. Two signals demand attention: 75.08% of the 354,770 rows are null, meaning OSM type is only recorded for roughly a quarter of records, and among the non-null values the distribution is severely imbalanced — 'node' accounts for 96.39% of non-null entries (85,204 occurrences) versus 2,560 'way' and just 632 'relation'. The near-zero entropy ratio (0.158) confirms this column carries very little discriminative information as-is.

Treatment: Impute nulls as a distinct 'unknown' category, then one-hot encode; consider whether the 'way'/'relation' minority classes carry signal worth preserving or should be collapsed.

anthropic:default · confidence high

Out[138]:

saturn.columns["osm_type"].stats

stat	value
n	354,770
nulls	266,374 (75.1%)
unique	3
top_value	node
top_rate	0.9639
cardinality	3
entropy	0.2501
entropy_ratio	0.1578
alert: null_rate	75.1% null
alert: imbalance	top value is 96.4% of rows

Fig 49.

Top values for osm_type.

Show data table

Top values for osm_type (3 unique shown, of 3 total).
value	count	share
node	85204	24.0%
way	2560	0.7%
relation	632	0.2%

place_type categorical label

This column captures the settlement/place classification type, likely from an OpenStreetMap-style geographic dataset, with values such as 'hamlet', 'isolated_dwelling', 'village', and 'town'. The most striking signal is the extreme null rate of 94.88%, meaning only ~18,400 of 354,770 rows carry a value — the column is essentially sparse. Among populated rows, 'hamlet' dominates at 66.57% of non-null values, and the presence of a raw 'yes' tag (131 occurrences) indicates dirty or uncleaned OSM data that needs remediation.

Treatment: Filter or impute nulls before use; remap 'yes' and other dirty values; treat as low-cardinality categorical with one-hot or ordinal encoding reflecting settlement hierarchy.

anthropic:default · confidence high

Out[141]:

saturn.columns["place_type"].stats

stat	value
n	354,770
nulls	336,616 (94.9%)
unique	48
top_value	hamlet
top_rate	0.6657
cardinality	48
entropy	1.498
entropy_ratio	0.2682
alert: long_tail	31 singleton categories
alert: null_rate	94.9% null

Fig 50.

Top values for place_type.

Show data table

Top values for place_type (20 unique shown, of 48 total).
value	count	share
hamlet	12086	3.4%
isolated_dwelling	2977	0.8%
village	2388	0.7%
locality	251	0.1%
yes	131	0.0%
farm	122	0.0%
neighbourhood	73	0.0%
town	38	0.0%
suburb	23	0.0%
quarter	7	0.0%
square	7	0.0%
island	4	0.0%
local	4	0.0%
allotments	4	0.0%
house	3	0.0%
city	3	0.0%
islet	2	0.0%
county	1	0.0%
bus_station	1	0.0%
hamtel	1	0.0%

abandoned_year categorical

Out[144]:

saturn.columns["abandoned_year"].stats

stat	value
n	354,770
nulls	353,545 (99.7%)
unique	147
top_value	yes
top_rate	0.4359
cardinality	147
entropy	2.939
entropy_ratio	0.4082
alert: long_tail	97 singleton categories
alert: null_rate	99.7% null

Fig 51.

Top values for abandoned_year.

Show data table

Top values for abandoned_year (20 unique shown, of 147 total).
value	count	share
yes	534	0.2%
village	433	0.1%
2022	20	0.0%
1986	11	0.0%
hamlet	9	0.0%
1978	6	0.0%
1974	6	0.0%
1987	5	0.0%
2023	4	0.0%
1950	4	0.0%
1985	4	0.0%
1983	3	0.0%
~1500	3	0.0%
2013-12-02	3	0.0%
isolated_dwelling	3	0.0%
2022-12-26	3	0.0%
1946	3	0.0%
1938	3	0.0%
1955	3	0.0%
1962	3	0.0%

abandoned_reason unknown label

This column contains abandoned-reason codes or labels — likely a categorical field recording why a record, transaction, or session was abandoned. The profiler emitted a 'skipped' alert with no stats or uniqueness counts, meaning the column's type could not be resolved and no frequency analysis was performed. With 354,770 non-null rows and a null rate of exactly 0.0, the field is fully populated, but its true cardinality, distribution, and value content are entirely unknown from this evidence.

Treatment: Re-profile with explicit string/categorical typing to recover value counts and cardinality before any downstream use.

anthropic:default · confidence low

Out[147]:

saturn.columns["abandoned_reason"].stats

stat	value
n	354,770
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

former_population categorical

Out[149]:

saturn.columns["former_population"].stats

stat	value
n	354,770
nulls	352,243 (99.3%)
unique	75
top_value	0
top_rate	0.4951
cardinality	75
entropy	2.605
entropy_ratio	0.4182
alert: null_rate	99.3% null

Fig 52.

Top values for former_population.

Show data table

Top values for former_population (20 unique shown, of 75 total).
value	count	share
0	1251	0.4%
2010	574	0.2%
2024	244	0.1%
2010-10-14	81	0.0%
2008	69	0.0%
2021-08-31	53	0.0%
1999	37	0.0%
2009	12	0.0%
1	11	0.0%
2021	11	0.0%
2021-09-01	10	0.0%
2021-10-01	10	0.0%
1989	9	0.0%
2018-01-01	9	0.0%
2005	8	0.0%
2016	7	0.0%
10	7	0.0%
3	6	0.0%
2001	6	0.0%
2	6	0.0%

heritage categorical

Out[152]:

saturn.columns["heritage"].stats

stat	value
n	354,770
nulls	354,763 (100.0%)
unique	6
top_value	2
top_rate	0.2857
cardinality	6
entropy	2.522
entropy_ratio	0.9755
alert: long_tail	5 singleton categories
alert: null_rate	100.0% null

Fig 53.

Top values for heritage.

Show data table

Top values for heritage (6 unique shown, of 6 total).
value	count	share
2	2	0.0%
8	1	0.0%
yes	1	0.0%
4	1	0.0%
district	1	0.0%
3	1	0.0%

Overview

Summary confidence: high

latitude numeric feature

longitude numeric feature

name text label

description text free_text

category categorical label

date text timestamp

country categorical feature

city text feature

state categorical feature

shape categorical feature

duration_seconds numeric feature

mass_g unknown feature

meteorite_class categorical label

fall_type categorical label

magnitude categorical feature

depth_km numeric

place text

earthquake_type categorical

volcano_type categorical

elevation_m unknown feature

status categorical

last_eruption categorical

injuries categorical feature

fatalities categorical feature

length_miles text feature

width_yards categorical feature

type categorical

temperature categorical

source categorical metadata

vessel_type categorical

cargo categorical

peak_brightness_altitude_km categorical

velocity_km_s categorical

energy_joules categorical

event_type categorical

damage_property text

cave_type categorical

cave_length_m categorical

cave_depth_m categorical

access categorical

cave_ref text

osm_id numeric foreign_key

osm_type categorical feature

place_type categorical label

abandoned_year categorical

abandoned_reason unknown label

former_population categorical

heritage categorical

How to cite