saturn·

data trove strange places v5 2

saturn notebook · generated 2026-06-22 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/strange_places_v5.2.json

Saturn profiled 354,770 rows across 48 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/strange_places_v5.2.json",
    "--findings", "data-trove-strange-places-v5-2.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This is a 354,770-row mashup of 14 heterogeneous 'strange places' datasets — spanning tornadoes, UFO sightings, cave entrances, meteorites, ghost towns, earthquakes, shipwrecks, and more — unified under a single 'category' column. The most important thing to examine first is the category distribution, which reveals that no single source dominates but tornadoes (~71K), caves (~70K), and UFO sightings (~61K) each make up roughly 17–20% of records. A second key signal is the pervasive sparsity: most domain-specific columns (depth_km, duration_seconds, shape, damage_property) carry null rates of 80–99%, meaning each column is only meaningful for the subset of rows belonging to its originating dataset. UFO sighting durations show extreme right-skew (median 180 s, max 66 million s) and earthquake depths are similarly skewed, both worth closer inspection within their respective subsets.

citing: category.top_values · category.null_rate · duration_seconds.stats · depth_km.stats · shape.null_rate · damage_property.null_rate · source.top_values · fatalities.top_values · event_type.top_values

Out[4]:

saturn.schema() · 48 columns

column kind n null% unique alerts
latitude numeric 354,770 0.0% 215,964 high_skew outliers
longitude numeric 354,770 0.0% 223,129
name text 354,770 0.0% 189,861 multilingual duplicates
description text 354,770 0.0% 218,717 multilingual duplicates
category categorical 354,770 0.0% 14
date text 354,770 41.9% 23,500 one_word allcaps null_rate short_text duplicates
country categorical 354,770 55.3% 28 null_rate
city text 354,770 82.9% 9,149 one_word null_rate short_text duplicates
state categorical 354,770 58.5% 118 null_rate
shape categorical 354,770 82.9% 28 null_rate
duration_seconds numeric 354,770 82.9% 444 null_rate high_skew outliers
mass_g unknown 354,770 0.0% skipped
meteorite_class categorical 354,770 90.9% 395 null_rate
fall_type categorical 354,770 90.9% 2 null_rate imbalance
magnitude categorical 354,770 76.7% 294 null_rate
depth_km numeric 354,770 98.9% 1,505 null_rate high_skew outliers
place text 354,770 98.9% 3,002 null_rate
earthquake_type categorical 354,770 98.9% 3 null_rate imbalance
volcano_type categorical 354,770 100.0% 1 null_rate imbalance
elevation_m unknown 354,770 0.0% skipped
status categorical 354,770 100.0% 1 null_rate imbalance
last_eruption categorical 354,770 100.0% 1 null_rate imbalance
injuries categorical 354,770 75.6% 233 null_rate
fatalities categorical 354,770 75.6% 57 null_rate
length_miles text 354,770 79.8% 3,795 one_word allcaps null_rate short_text duplicates
width_yards categorical 354,770 79.8% 437 null_rate
type categorical 354,770 98.6% 1 null_rate imbalance
temperature categorical 354,770 98.6% 44 long_tail null_rate imbalance
source categorical 354,770 51.6% 4 null_rate
vessel_type categorical 354,770 99.0% 23 long_tail null_rate
cargo categorical 354,770 99.0% 17 long_tail null_rate imbalance
peak_brightness_altitude_km categorical 354,770 99.8% 224 null_rate
velocity_km_s categorical 354,770 99.9% 158 null_rate
energy_joules categorical 354,770 99.8% 518 long_tail null_rate
event_type categorical 354,770 95.8% 17 null_rate
damage_property text 354,770 95.8% 1,014 one_word allcaps null_rate short_text duplicates
cave_type categorical 354,770 100.0% 5 long_tail null_rate
cave_length_m categorical 354,770 99.8% 237 long_tail null_rate
cave_depth_m categorical 354,770 99.9% 124 long_tail null_rate
access categorical 354,770 98.0% 20 null_rate
cave_ref text 354,770 97.9% 7,162 one_word allcaps null_rate short_text
osm_id numeric 354,770 75.1% 88,395 null_rate
osm_type categorical 354,770 75.1% 3 null_rate imbalance
place_type categorical 354,770 94.9% 48 long_tail null_rate
abandoned_year categorical 354,770 99.7% 147 long_tail null_rate
abandoned_reason unknown 354,770 0.0% skipped
former_population categorical 354,770 99.3% 75 null_rate
heritage categorical 354,770 100.0% 6 long_tail null_rate
Fig 1.
category · Shows how records are split across the 14 source datasets — tornadoes, caves, and UFO sightings each dominate, but all 14 categories are present.
Show data table
Top values for category (14 unique shown, of 14 total).
valuecountshare
noaa_tornadoes7181320.2%
osm_caves7024219.8%
ufo_sightings6063217.1%
megalithic_portal6002816.9%
nasa_meteorites321869.1%
osm_ghost_towns181545.1%
noaa_storm_events147704.2%
haunted_places97172.7%
noaa_thermal_springs50031.4%
bigfoot_sightings37971.1%
usgs_earthquakes37421.1%
noaa_shipwrecks36531.0%
nasa_fireballs8630.2%
usgs_volcanoes1700.0%
Fig 2.
shape · For UFO sighting rows, 'light' is by far the most reported shape, followed by triangle and circle — look for the long tail of rarer forms.
Show data table
Top values for shape (20 unique shown, of 28 total).
valuecountshare
light128953.6%
triangle62681.8%
circle58901.7%
fireball49391.4%
unknown43591.2%
other42091.2%
sphere41341.2%
disk38531.1%
oval28810.8%
formation19080.5%
cigar15690.4%
changing15170.4%
flash10250.3%
rectangle10100.3%
cylinder9770.3%
diamond8840.2%
chevron7740.2%
teardrop5600.2%
egg5550.2%
cone2350.1%
Fig 3.
event_type · Among storm-event rows, tornadoes vastly outnumber flash floods and thunderstorm winds, revealing a strong imbalance in weather event coverage.
Show data table
Top values for event_type (17 unique shown, of 17 total).
valuecountshare
Tornado63341.8%
Flash Flood23580.7%
Thunderstorm Wind22570.6%
Flood17770.5%
Hail12460.4%
Lightning5740.2%
Heavy Rain990.0%
Marine Strong Wind430.0%
Debris Flow430.0%
Marine Thunderstorm Wind250.0%
Marine High Wind50.0%
Dust Devil30.0%
Waterspout20.0%
Tropical Storm10.0%
High Wind10.0%
Heat10.0%
Marine Lightning10.0%
Fig 4.
duration_seconds · Extreme right-skew with a median of 180 seconds but a max of 66 million seconds — look for the spike near zero and the extreme outliers.
Show data table
Histogram bins for duration_seconds (median: 180.0).
bincount
0.01 – 1.657e+0660612
1.657e+06 – 3.314e+0611
3.314e+06 – 4.971e+060
4.971e+06 – 6.628e+063
6.628e+06 – 8.285e+061
8.285e+06 – 9.941e+060
9.941e+06 – 1.16e+072
1.16e+07 – 1.326e+070
1.326e+07 – 1.491e+070
1.491e+07 – 1.657e+070
1.657e+07 – 1.823e+070
1.823e+07 – 1.988e+070
1.988e+07 – 2.154e+070
2.154e+07 – 2.32e+070
2.32e+07 – 2.485e+070
2.485e+07 – 2.651e+070
2.651e+07 – 2.817e+070
2.817e+07 – 2.982e+070
2.982e+07 – 3.148e+070
3.148e+07 – 3.314e+070
3.314e+07 – 3.479e+070
3.479e+07 – 3.645e+070
3.645e+07 – 3.811e+070
3.811e+07 – 3.977e+070
3.977e+07 – 4.142e+070
4.142e+07 – 4.308e+070
4.308e+07 – 4.474e+070
4.474e+07 – 4.639e+070
4.639e+07 – 4.805e+070
4.805e+07 – 4.971e+070
4.971e+07 – 5.136e+070
5.136e+07 – 5.302e+072
5.302e+07 – 5.468e+070
5.468e+07 – 5.633e+070
5.633e+07 – 5.799e+070
5.799e+07 – 5.965e+070
5.965e+07 – 6.131e+070
6.131e+07 – 6.296e+070
6.296e+07 – 6.462e+070
6.462e+07 – 6.628e+071
Fig 5.
meteorite_class · L6 and H5 ordinary chondrites dominate meteorite finds, with a long tail of rarer classes worth noting for completeness of the record.
Show data table
Top values for meteorite_class (20 unique shown, of 395 total).
valuecountshare
L665441.8%
H556141.6%
H433360.9%
H632340.9%
L527500.8%
LL518990.5%
LL69630.3%
L48310.2%
H4/53800.1%
CM22810.1%
Iron, IIIAB2720.1%
H32440.1%
LL2200.1%
E32050.1%
L31760.0%
LL41600.0%
H5/61560.0%
Ureilite1550.0%
Howardite1270.0%
Diogenite1250.0%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
latitudenumeric0.0%
longitudenumeric0.0%
nametext0.0%
descriptiontext0.0%
categorycategorical0.0%
datetext41.9%
countrycategorical55.3%
citytext82.9%
statecategorical58.5%
shapecategorical82.9%
duration_secondsnumeric82.9%
mass_gunknown0.0%
meteorite_classcategorical90.9%
fall_typecategorical90.9%
magnitudecategorical76.7%
depth_kmnumeric98.9%
placetext98.9%
earthquake_typecategorical98.9%
volcano_typecategorical100.0%
elevation_munknown0.0%
statuscategorical100.0%
last_eruptioncategorical100.0%
injuriescategorical75.6%
fatalitiescategorical75.6%
length_milestext79.8%
width_yardscategorical79.8%
typecategorical98.6%
temperaturecategorical98.6%
sourcecategorical51.6%
vessel_typecategorical99.0%
cargocategorical99.0%
peak_brightness_altitude_kmcategorical99.8%
velocity_km_scategorical99.9%
energy_joulescategorical99.8%
event_typecategorical95.8%
damage_propertytext95.8%
cave_typecategorical100.0%
cave_length_mcategorical99.8%
cave_depth_mcategorical99.9%
accesscategorical98.0%
cave_reftext97.9%
osm_idnumeric75.1%
osm_typecategorical75.1%
place_typecategorical94.9%
abandoned_yearcategorical99.7%
abandoned_reasonunknown0.0%
former_populationcategorical99.3%
heritagecategorical100.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 9,808 detected strings).
langcountshare
en831084.7%
fr2852.9%
de2582.6%
it2402.4%
es1581.6%
ru1111.1%
ca550.6%
da380.4%
pt360.4%
nl300.3%
pl300.3%
eu290.3%
hu240.2%
be230.2%
id180.2%
sv170.2%
ar150.2%
cs140.1%
uk130.1%
no130.1%
ba130.1%
sk100.1%
ro90.1%
cy80.1%
el80.1%
tr70.1%
fi70.1%
ja70.1%
az60.1%
hr50.1%
ceb50.1%
sl20.0%
la10.0%
tt10.0%
ko10.0%
als10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 5 numeric columns (values clipped to 2 decimals).
latitudelongitudeduration_secondsdepth_kmosm_id
latitude+1.00-0.39+0.01+0.01+0.34
longitude-0.39+1.00-0.05-0.05+0.17
duration_seconds+0.01-0.05+1.00-0.03+0.01
depth_km+0.01-0.05-0.03+1.00-0.09
osm_id+0.34+0.17+0.01-0.09+1.00

latitude numeric feature

This column contains geographic latitude values, ranging from -87.37° to 88.5°, consistent with global coordinates. The distribution is surprisingly left-skewed (skew = -2.84) with high kurtosis (7.30), meaning there is a heavy tail toward negative (southern hemisphere) latitudes despite the median sitting at ~40.6°N — suggesting the bulk of records are mid-latitude northern hemisphere but a notable minority of extreme southern values pull the mean down. About 9.4% of rows (33,355) are flagged as outliers, likely driven by records near the poles or far southern hemisphere; the near-zero zero_rate (0.06%) is negligible but worth checking for sentinel nulls encoded as 0.

Treatment: Retain as-is for geospatial modelling; investigate ~0.06% zero-value rows as possible null sentinels, and review 33,355 outlier records for data quality before clustering or distance-based methods.

anthropic:default · confidence high
Out[14]:

saturn.columns["latitude"].stats

statvalue
n354,770
nulls0 (0.0%)
unique215,964
min -87.37
max 88.5
mean 32.66
median 40.6
std 31.01
q1 33.69
q3 46.53
iqr 12.85
skew -2.84
kurtosis 7.302
n_outliers 33,355
outlier_rate 0.09402
zero_rate 0.000637
alert: high_skewskew=-2.84
alert: outliers9.4% rows beyond 1.5 IQR
Fig 9.
Distribution of latitude. Vertical dash marks the median.
Show data table
Histogram bins for latitude (median: 40.5983333).
bincount
-87.37 – -82.977090
-82.97 – -78.571218
-78.57 – -74.184088
-74.18 – -69.789707
-69.78 – -65.384
-65.38 – -60.993
-60.99 – -56.591
-56.59 – -52.1920
-52.19 – -47.872
-47.8 – -43.4110
-43.4 – -39287
-39 – -34.61726
-34.61 – -30.21991
-30.21 – -25.81714
-25.81 – -21.42902
-21.42 – -17.02581
-17.02 – -12.62297
-12.62 – -8.227341
-8.227 – -3.83603
-3.83 – 0.5667690
0.5667 – 4.963435
4.963 – 9.361927
9.36 – 13.761608
13.76 – 18.151738
18.15 – 22.556214
22.55 – 26.955180
26.95 – 31.3420824
31.34 – 35.7447917
35.74 – 40.1455501
40.14 – 44.5377688
44.53 – 48.9345229
48.93 – 53.3328613
53.33 – 57.7224054
57.72 – 62.127237
62.12 – 66.521480
66.52 – 70.91490
70.91 – 75.3192
75.31 – 79.7186
79.71 – 84.19
84.1 – 88.53

longitude numeric feature

This column contains geographic longitude values for 354,770 records, spanning the full valid range from -179.28° to 180°. The distribution is moderately right-skewed (skew = 0.755) with a mean of -31.75° and median of -42.66°, indicating a concentration of records in the Western Hemisphere (Americas/Atlantic). The IQR of 104.81° is extremely wide, suggesting genuinely global coverage rather than a region-specific dataset, and only 827 values (0.23%) are flagged as outliers.

Treatment: Pair with latitude for geospatial modelling; consider coordinate binning or haversine-based features rather than treating as a raw numeric.

anthropic:default · confidence high
Out[17]:

saturn.columns["longitude"].stats

statvalue
n354,770
nulls0 (0.0%)
unique223,129
min -179.3
max 180
mean -31.75
median -42.66
std 72.11
q1 -92.08
q3 12.73
iqr 104.8
skew 0.7545
kurtosis 0.1165
n_outliers 827
outlier_rate 0.002331
zero_rate 0
Fig 10.
Distribution of longitude. Vertical dash marks the median.
Show data table
Histogram bins for longitude (median: -42.65733525).
bincount
-179.3 – -170.390
-170.3 – -161.3969
-161.3 – -152.31120
-152.3 – -143.4814
-143.4 – -134.4413
-134.4 – -125.4839
-125.4 – -116.418752
-116.4 – -107.48803
-107.4 – -98.4421820
-98.44 – -89.4647491
-89.46 – -80.4845346
-80.48 – -71.524836
-71.5 – -62.524804
-62.52 – -53.53485
-53.53 – -44.55608
-44.55 – -35.57503
-35.57 – -26.5976
-26.59 – -17.61376
-17.61 – -8.6242511
-8.624 – 0.358340337
0.3583 – 9.3429096
9.34 – 18.3236101
18.32 – 27.312918
27.3 – 36.2913238
36.29 – 45.276091
45.27 – 54.255213
54.25 – 63.234220
63.23 – 72.22604
72.22 – 81.23575
81.2 – 90.18777
90.18 – 99.16745
99.16 – 108.11540
108.1 – 117.11434
117.1 – 126.11238
126.1 – 135.11806
135.1 – 144.11475
144.1 – 153.1730
153.1 – 1628482
162 – 1713693
171 – 180801

name text label

This column contains the name or title of individual records in what appears to be a multi-domain dataset covering natural features (caves), weather events (tornadoes by US state), and UFO sightings. The duplicate rate is strikingly high at 46.5%, driven largely by templated strings like 'Unnamed Cave' (19,962 occurrences) and repeated tornado/state/count patterns. Despite the predominantly English content (3,363 language-detected values skewing English), the multilingual alert flags 30 detected languages including German (230), French (279), Italian (236), Russian (102), and Spanish (156), suggesting internationally-sourced named entities mixed into the dataset. Analysts should note that near-half of values are non-unique, so this column cannot serve as a reliable identifier.

Treatment: Deduplicate or group by name pattern before use; consider splitting templated names (e.g. 'Tornado in TX, 48') into structured fields; embed free-form names if semantic similarity is needed.

anthropic:default · confidence high
Out[20]:

saturn.columns["name"].stats

statvalue
n354,770
nulls0 (0.0%)
unique189,861
len_min 1
len_max 235
len_mean 20
len_median 17
len_p95 32
word_mean 3.564
word_median 4
n_empty 0
n_duplicates 164,909
duplicate_rate 0.4648
vocab_size 15,811
readability_flesch_mean 64.79
emoji_rate 2.819e-06
url_rate 0
one_word_rate 0.09411
allcaps_rate 0.01283
boilerplate_rate 0
alert: multilingual31 languages detected in sample
alert: duplicates46.5% duplicate strings
Fig 11.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 19.99670208867717).
charscount
1 – 78769
7 – 1361484
13 – 19132248
19 – 2449793
24 – 3074096
30 – 3619324
36 – 423592
42 – 481239
48 – 54408
54 – 60415
60 – 65461
65 – 71495
71 – 77537
77 – 83395
83 – 89431
89 – 95348
95 – 100245
100 – 106167
106 – 112117
112 – 11869
118 – 12451
124 – 13026
130 – 13622
136 – 14114
141 – 1475
147 – 1536
153 – 1591
159 – 1653
165 – 1714
171 – 1760
176 – 1822
182 – 1880
188 – 1940
194 – 2001
200 – 2060
206 – 2120
212 – 2170
217 – 2230
223 – 2290
229 – 2352

description text free_text

This column contains free-text descriptions of geographic or physical features — cave entrances, former hamlets, hot springs, shipwrecks, and tornado tracks (e.g. 'F0, 0.1mi long, 10yd wide') dominate the top values, suggesting a points-of-interest or geographic gazetteer dataset. The duplicate rate is strikingly high at 38.3%, driven by 136,053 repeated values out of 354,770 rows, largely from templated entries like 'Cave entrance' (52,067 occurrences) and storm-track boilerplate. Text is overwhelmingly English (4,893 sampled as English) but 21 languages are detected including German (28), Bashkir (13), Russian (9), and Belarusian (9), flagging a multilingual minority that may require separate handling. The wide spread between median length (40 chars) and mean (114 chars) with a p95 of 491 indicates a heavily right-skewed length distribution.

Treatment: Deduplicate or group templated entries before NLP; apply language detection and route non-English rows to language-specific pipelines; tokenize and embed for semantic modelling.

anthropic:default · confidence high
Out[23]:

saturn.columns["description"].stats

statvalue
n354,770
nulls0 (0.0%)
unique218,717
len_min 1
len_max 500
len_mean 114
len_median 40
len_p95 491
word_mean 24.07
word_median 7
n_empty 0
n_duplicates 136,053
duplicate_rate 0.3835
vocab_size 38,639
readability_flesch_mean 66.65
emoji_rate 0
url_rate 0.008149
one_word_rate 0.01018
allcaps_rate 0.004256
boilerplate_rate 0.0002509
alert: multilingual22 languages detected in sample
alert: duplicates38.3% duplicate strings
Fig 12.
Character-length distribution for description.
Show data table
Character-length distribution for description (mean: 114.04985201679962).
charscount
1 – 1371484
13 – 2653646
26 – 3849789
38 – 5114135
51 – 6330283
63 – 7628147
76 – 888588
88 – 1016163
101 – 1135647
113 – 1264082
126 – 1389912
138 – 1513581
151 – 163678
163 – 176439
176 – 188426
188 – 201525
201 – 213479
213 – 226430
226 – 238412
238 – 250576
250 – 2631185
263 – 2751361
275 – 2881320
288 – 3005787
300 – 313878
313 – 3251496
325 – 3381410
338 – 3501914
350 – 3631940
363 – 3751927
375 – 3881960
388 – 4002295
400 – 4132190
413 – 4252487
425 – 4382493
438 – 4503014
450 – 4632997
463 – 4753598
475 – 4885468
488 – 50019628

category categorical label

This column is a data-source/event-type label drawn from 14 distinct categories across 354,770 rows with zero nulls. The categories span scientific datasets (NOAA tornadoes, NASA meteorites, OSM features) and paranormal/anomalous phenomena (UFO sightings, Bigfoot, haunted places, megalithic portal), suggesting this is a multi-source 'strange phenomena' aggregation dataset. Distribution is moderately uneven — the top value 'noaa_tornadoes' holds 20.2% of rows (71,813), while 'bigfoot_sightings' has only 3,797 — but entropy of 2.99 against a ratio of 0.78 indicates reasonable spread across classes. No nulls and clean cardinality make this an immediately usable stratification variable.

Treatment: Use as a stratification or grouping key; one-hot encode or target-encode for modelling.

anthropic:default · confidence high
Out[26]:

saturn.columns["category"].stats

statvalue
n354,770
nulls0 (0.0%)
unique14
top_value noaa_tornadoes
top_rate 0.2024
cardinality 14
entropy 2.985
entropy_ratio 0.7841
Fig 13.
Top values for category.
Show data table
Top values for category (14 unique shown, of 14 total).
valuecountshare
noaa_tornadoes7181320.2%
osm_caves7024219.8%
ufo_sightings6063217.1%
megalithic_portal6002816.9%
nasa_meteorites321869.1%
osm_ghost_towns181545.1%
noaa_storm_events147704.2%
haunted_places97172.7%
noaa_thermal_springs50031.4%
bigfoot_sightings37971.1%
usgs_earthquakes37421.1%
noaa_shipwrecks36531.0%
nasa_fireballs8630.2%
usgs_volcanoes1700.0%

date text timestamp

This column contains ISO-format date strings (YYYY-MM-DD), stored as text rather than a proper date type, representing what appear to be annual publication or release dates — all top values fall on January 1st of a given year, suggesting date precision is year-level only. Two major data quality issues stand out: a 41.88% null rate (including 17,854 empty strings) and an 88.6% duplicate rate across 354,770 rows with only 23,500 unique values. The 'allcaps' alert is a false positive from the Saturn parser — ISO date strings trigger it due to lack of lowercase letters.

Treatment: Cast to date type, impute or flag the 41.88% nulls, and consider extracting year as an integer feature given all values are Jan-1 anchored.

anthropic:default · confidence high
Out[29]:

saturn.columns["date"].stats

statvalue
n354,770
nulls148,570 (41.9%)
unique23,500
len_min 0
len_max 30
len_mean 9.331
len_median 10
len_p95 10
word_mean 1.005
word_median 1
n_empty 17,854
n_duplicates 182,700
duplicate_rate 0.886
vocab_size 8,565
readability_flesch_mean 112.1
emoji_rate 0
url_rate 0
one_word_rate 0.9954
allcaps_rate 0.913
boilerplate_rate 0
alert: one_word99.5% rows are a single word
alert: allcaps91.3% rows are all-caps
alert: null_rate41.9% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates88.6% duplicate strings
Fig 14.
Character-length distribution for date.
Show data table
Character-length distribution for date (mean: 9.330581959262851).
charscount
0 – 117854
1 – 20
2 – 21
2 – 30
3 – 40
4 – 4151
4 – 513
5 – 60
6 – 71
7 – 821
8 – 85
8 – 90
9 – 103
10 – 10183475
10 – 1117
11 – 120
12 – 136
13 – 148
14 – 1410
14 – 150
15 – 168
16 – 167
16 – 1712
17 – 180
18 – 19180
19 – 204425
20 – 201
20 – 210
21 – 221
22 – 220
22 – 230
23 – 240
24 – 250
25 – 260
26 – 260
26 – 270
27 – 280
28 – 280
28 – 290
29 – 301

country categorical feature

This column captures country of origin or residence, using a mix of ISO 2-letter codes and full-name variants. The most alarming issue is a 55.29% null rate, meaning over half of 354,770 rows carry no country value. Compounding this, 'USA' and 'US' are effectively the same country but stored as two distinct values (86,583 and 60,634 respectively), together accounting for ~54.6% of non-null records — indicating inconsistent data entry that inflates apparent cardinality. There are also 9,497 empty-string records that escaped null detection, and the distribution is heavily US-dominated with 28 unique values at low entropy (1.34).

Treatment: Unify 'USA'/'US' and other aliases into ISO-3166 codes, convert empty strings to null, then impute or flag remaining nulls before using as a categorical feature.

anthropic:default · confidence high
Out[32]:

saturn.columns["country"].stats

statvalue
n354,770
nulls196,154 (55.3%)
unique28
top_value USA
top_rate 0.5459
cardinality 28
entropy 1.341
entropy_ratio 0.279
alert: null_rate55.3% null
Fig 15.
Top values for country.
Show data table
Top values for country (20 unique shown, of 28 total).
valuecountshare
USA8658324.4%
US6063417.1%
94972.7%
RU14810.4%
BY2050.1%
KZ1560.0%
HT130.0%
KY90.0%
AU60.0%
DE50.0%
GB50.0%
IQ30.0%
RO20.0%
EC20.0%
IT20.0%
TW10.0%
MX10.0%
CW10.0%
BS10.0%
MT10.0%

city text feature

This column contains US city names, confirmed by top values (Seattle, Phoenix, Las Vegas, Portland, Los Angeles) and top words ('beach', 'san', 'lake', 'springs'). The most striking issue is the 82.91% null rate — only roughly 1 in 6 rows has a city value at all, making this field sparsely populated. Despite that sparsity, the duplicate rate among non-null values is 84.91%, indicating that populated rows cluster around a relatively small set of repeated cities (9,149 unique values from 4,862 vocab tokens). The word 'city' appearing 531 times in top_words suggests some entries may literally contain placeholder text like 'Kansas City' or 'Oklahoma City' rather than being data quality noise.

Treatment: Impute or flag nulls (82.91% missing) before use; consider grouping rare cities or encoding as region/state for modelling.

anthropic:default · confidence high
Out[35]:

saturn.columns["city"].stats

statvalue
n354,770
nulls294,138 (82.9%)
unique9,149
len_min 3
len_max 23
len_mean 8.829
len_median 9
len_p95 14
word_mean 1.288
word_median 1
n_empty 0
n_duplicates 51,483
duplicate_rate 0.8491
vocab_size 4,862
readability_flesch_mean 21.74
emoji_rate 0
url_rate 0
one_word_rate 0.7294
allcaps_rate 0
boilerplate_rate 0
alert: one_word72.9% rows are a single word
alert: null_rate82.9% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates84.9% duplicate strings
Fig 16.
Character-length distribution for city.
Show data table
Character-length distribution for city (mean: 8.828638342789286).
charscount
3 – 474
4 – 40
4 – 41497
4 – 50
5 – 63313
6 – 60
6 – 67780
6 – 70
7 – 89464
8 – 80
8 – 87830
8 – 90
9 – 108339
10 – 100
10 – 107192
10 – 110
11 – 125509
12 – 120
12 – 123694
12 – 130
13 – 142353
14 – 140
14 – 141537
14 – 150
15 – 16800
16 – 160
16 – 16787
16 – 170
17 – 18203
18 – 180
18 – 18155
18 – 190
19 – 2044
20 – 200
20 – 2025
20 – 210
21 – 2213
22 – 220
22 – 2221
22 – 232

state categorical feature

This column contains US state abbreviations (and possibly territories or non-standard codes given 118 unique values vs. the expected 50–60), making it a geographic categorical feature. The most critical signal is a 58.5% null rate, meaning over half the 354,770 rows have no state recorded — a severe data quality issue. The top value is 'TX' at 8.6% of non-null rows, with CA and FL following; the 118-cardinality (nearly double the 50 US states) suggests the presence of territories, foreign country codes, or dirty values worth auditing.

Treatment: Audit the 118 unique values to identify non-US-state codes, impute or flag nulls (58.5% missing), then encode as categorical for modelling.

anthropic:default · confidence high
Out[38]:

saturn.columns["state"].stats

statvalue
n354,770
nulls207,555 (58.5%)
unique118
top_value TX
top_rate 0.08645
cardinality 118
entropy 5.668
entropy_ratio 0.8236
alert: null_rate58.5% null
Fig 17.
Top values for state.
Show data table
Top values for state (20 unique shown, of 118 total).
valuecountshare
TX127273.6%
CA87912.5%
FL73722.1%
IL53291.5%
KS51271.4%
OK50561.4%
MO39051.1%
CO37661.1%
WA36481.0%
IA36471.0%
OH35211.0%
NE35061.0%
AL31960.9%
PA31930.9%
NC31860.9%
GA31120.9%
MN31100.9%
MS30860.9%
NY29510.8%
LA29270.8%

shape categorical feature

This column captures the reported shape of UFO/unidentified aerial phenomena sightings, with 28 distinct categories such as 'light', 'triangle', 'circle', and 'fireball'. The most striking issue is an 82.91% null rate across 354,770 rows, meaning only ~60,600 records have a shape value at all. Among non-null records, 'light' dominates at 21.27%, and the presence of catch-all categories like 'unknown' (4,359) and 'other' (4,209) further dilutes the informativeness of the non-missing data.

Treatment: Impute or flag nulls as a separate 'not_reported' category before encoding; consider consolidating 'unknown' and 'other' with nulls given ambiguity.

anthropic:default · confidence high
Out[41]:

saturn.columns["shape"].stats

statvalue
n354,770
nulls294,138 (82.9%)
unique28
top_value light
top_rate 0.2127
cardinality 28
entropy 3.774
entropy_ratio 0.785
alert: null_rate82.9% null
Fig 18.
Top values for shape.
Show data table
Top values for shape (20 unique shown, of 28 total).
valuecountshare
light128953.6%
triangle62681.8%
circle58901.7%
fireball49391.4%
unknown43591.2%
other42091.2%
sphere41341.2%
disk38531.1%
oval28810.8%
formation19080.5%
cigar15690.4%
changing15170.4%
flash10250.3%
rectangle10100.3%
cylinder9770.3%
diamond8840.2%
chevron7740.2%
teardrop5600.2%
egg5550.2%
cone2350.1%

duration_seconds numeric feature

This column records event or session durations in seconds, with values ranging from 0.01 s to 66,276,000 s (~766 days). The most striking issue is that 82.91% of rows are null, meaning duration is only captured for roughly 1-in-6 records. Among non-null values the distribution is catastrophically right-skewed (skew = 135.86, kurtosis = 19,379.84): the median is just 180 s while the mean inflates to 5,410 s, and 7,753 rows (12.79% of non-null) are flagged as outliers—the maximum of 66,276,000 s is almost certainly erroneous or represents a sentinel/unclosed-session value.

Treatment: Investigate and cap or remove extreme outliers (especially values near 66276000.0), impute or flag nulls explicitly, then log-transform before modelling.

anthropic:default · confidence high
Out[44]:

saturn.columns["duration_seconds"].stats

statvalue
n354,770
nulls294,138 (82.9%)
unique444
min 0.01
max 6.628e+07
mean 5410
median 180
std 4.144e+05
q1 30
q3 600
iqr 570
skew 135.9
kurtosis 1.938e+04
n_outliers 7,753
outlier_rate 0.1279
zero_rate 0
alert: null_rate82.9% null
alert: high_skewskew=+135.86
alert: outliers12.8% rows beyond 1.5 IQR
Fig 19.
Distribution of duration_seconds. Vertical dash marks the median.
Show data table
Histogram bins for duration_seconds (median: 180.0).
bincount
0.01 – 1.657e+0660612
1.657e+06 – 3.314e+0611
3.314e+06 – 4.971e+060
4.971e+06 – 6.628e+063
6.628e+06 – 8.285e+061
8.285e+06 – 9.941e+060
9.941e+06 – 1.16e+072
1.16e+07 – 1.326e+070
1.326e+07 – 1.491e+070
1.491e+07 – 1.657e+070
1.657e+07 – 1.823e+070
1.823e+07 – 1.988e+070
1.988e+07 – 2.154e+070
2.154e+07 – 2.32e+070
2.32e+07 – 2.485e+070
2.485e+07 – 2.651e+070
2.651e+07 – 2.817e+070
2.817e+07 – 2.982e+070
2.982e+07 – 3.148e+070
3.148e+07 – 3.314e+070
3.314e+07 – 3.479e+070
3.479e+07 – 3.645e+070
3.645e+07 – 3.811e+070
3.811e+07 – 3.977e+070
3.977e+07 – 4.142e+070
4.142e+07 – 4.308e+070
4.308e+07 – 4.474e+070
4.474e+07 – 4.639e+070
4.639e+07 – 4.805e+070
4.805e+07 – 4.971e+070
4.971e+07 – 5.136e+070
5.136e+07 – 5.302e+072
5.302e+07 – 5.468e+070
5.468e+07 – 5.633e+070
5.633e+07 – 5.799e+070
5.799e+07 – 5.965e+070
5.965e+07 – 6.131e+070
6.131e+07 – 6.296e+070
6.296e+07 – 6.462e+070
6.462e+07 – 6.628e+071

mass_g unknown feature

The column 'mass_g' likely represents mass measurements in grams across 354,770 records, with zero nulls indicating complete data coverage. No distributional statistics are available — the profiler skipped this column — so skew, range, outliers, and uniqueness cannot be assessed from the evidence provided.

Treatment: Re-profile to obtain distribution stats; then check for skew and consider log-transform before modelling.

anthropic:default · confidence low
Out[47]:

saturn.columns["mass_g"].stats

statvalue
n354,770
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

meteorite_class categorical label

This column contains meteorite classification codes (e.g., 'L6', 'H5', 'CM2'), representing standard petrologic-type designations for chondrite and other meteorite classes. The most striking feature is an extremely high null rate of 90.93%, meaning only roughly 32,000 of 354,770 rows carry a classification. Among classified records the distribution is moderately concentrated — 'L6' alone accounts for 20.3% of non-null values — with 395 distinct classes and an entropy ratio of ~0.51, indicating moderate spread across the taxonomy.

Treatment: Impute nulls with an explicit 'Unknown' category or exclude from supervised models; encode via target or ordinal encoding given 395 classes and severe class imbalance.

anthropic:default · confidence high
Out[49]:

saturn.columns["meteorite_class"].stats

statvalue
n354,770
nulls322,584 (90.9%)
unique395
top_value L6
top_rate 0.2033
cardinality 395
entropy 4.37
entropy_ratio 0.5067
alert: null_rate90.9% null
Fig 20.
Top values for meteorite_class.
Show data table
Top values for meteorite_class (20 unique shown, of 395 total).
valuecountshare
L665441.8%
H556141.6%
H433360.9%
H632340.9%
L527500.8%
LL518990.5%
LL69630.3%
L48310.2%
H4/53800.1%
CM22810.1%
Iron, IIIAB2720.1%
H32440.1%
LL2200.1%
E32050.1%
L31760.0%
LL41600.0%
H5/61560.0%
Ureilite1550.0%
Howardite1270.0%
Diogenite1250.0%

fall_type categorical label

This column classifies meteorite recovery type, distinguishing between specimens that were 'Found' (discovered without an observed fall) versus 'Fell' (witnessed falling). Striking is the 90.93% null rate, meaning only ~32,186 of 354,770 records have a value at all. Among those with values, the distribution is heavily skewed: 'Found' accounts for 96.6% (31,090) versus 'Fell' at just 3.4% (1,096), which aligns with real-world meteorite data but constitutes a severe class imbalance alert.

Treatment: Impute nulls as a third category ('Unknown') or exclude from classification tasks; apply class-weighting or oversampling to address the 97:3 Found-to-Fell imbalance before modelling.

anthropic:default · confidence high
Out[52]:

saturn.columns["fall_type"].stats

statvalue
n354,770
nulls322,584 (90.9%)
unique2
top_value Found
top_rate 0.9659
cardinality 2
entropy 0.2143
entropy_ratio 0.2143
alert: null_rate90.9% null
alert: imbalancetop value is 96.6% of rows
Fig 21.
Top values for fall_type.
Show data table
Top values for fall_type (2 unique shown, of 2 total).
valuecountshare
Found310908.8%
Fell10960.3%

magnitude categorical feature

This column represents a magnitude scale (likely seismic, stellar, or similar physical measurement) stored as a categorical type despite being fundamentally numeric — values include integers (0, 1, 2, 3, 4) and decimals (4.5, 4.6, 4.7, 1.75). The null rate of 76.7% is alarming and triggered an alert, meaning over three-quarters of the 354,770 rows carry no value. An additional surprise is the presence of '-9', which appears 1,278 times and is almost certainly a sentinel/missing-value code rather than a true measurement. The top value '0' dominates non-null records at 44.4% of non-null observations, and entropy_ratio of 0.31 confirms a heavily skewed, low-diversity distribution despite 294 unique string representations.

Treatment: Cast to float after replacing '-9' with NaN, investigate the 76.7% null rate for systematic missingness, then consider log-transform or binning before modelling.

anthropic:default · confidence high
Out[55]:

saturn.columns["magnitude"].stats

statvalue
n354,770
nulls272,093 (76.7%)
unique294
top_value 0
top_rate 0.4436
cardinality 294
entropy 2.514
entropy_ratio 0.3065
alert: null_rate76.7% null
Fig 22.
Top values for magnitude.
Show data table
Top values for magnitude (20 unique shown, of 294 total).
valuecountshare
03667510.3%
1245426.9%
299042.8%
326300.7%
-912780.4%
4.56860.2%
45910.2%
4.65580.2%
4.74150.1%
1.753830.1%
4.83170.1%
52970.1%
4.92610.1%
2.752200.1%
5.12020.1%
5.21670.0%
70.001620.0%
50.001510.0%
2.001500.0%
5.31260.0%

depth_km numeric

Out[58]:

saturn.columns["depth_km"].stats

statvalue
n354,770
nulls351,028 (98.9%)
unique1,505
min -2.261
max 248.7
mean 23.71
median 10
std 28.79
q1 10
q3 29.1
iqr 19.1
skew 3.072
kurtosis 11.61
n_outliers 314
outlier_rate 0.08391
zero_rate 0.002672
alert: null_rate98.9% null
alert: high_skewskew=+3.07
alert: outliers8.4% rows beyond 1.5 IQR
Fig 23.
Distribution of depth_km. Vertical dash marks the median.
Show data table
Histogram bins for depth_km (median: 10.0).
bincount
-2.261 – 4.013219
4.013 – 10.291730
10.29 – 16.56370
16.56 – 22.84258
22.84 – 29.11230
29.11 – 35.38250
35.38 – 41.66167
41.66 – 47.93129
47.93 – 54.2156
54.21 – 60.4831
60.48 – 66.7543
66.75 – 73.0327
73.03 – 79.319
79.3 – 85.5829
85.58 – 91.8521
91.85 – 98.1219
98.12 – 104.424
104.4 – 110.712
110.7 – 116.99
116.9 – 123.214
123.2 – 129.513
129.5 – 135.819
135.8 – 14214
142 – 148.35
148.3 – 154.66
154.6 – 160.90
160.9 – 167.17
167.1 – 173.44
173.4 – 179.70
179.7 – 1861
186 – 192.25
192.2 – 198.51
198.5 – 204.83
204.8 – 211.12
211.1 – 217.33
217.3 – 223.61
223.6 – 229.90
229.9 – 236.20
236.2 – 242.40
242.4 – 248.71

place text

Out[61]:

saturn.columns["place"].stats

statvalue
n354,770
nulls351,028 (98.9%)
unique3,002
len_min 4
len_max 59
len_mean 29.47
len_median 29
len_p95 36
word_mean 6.293
word_median 6
n_empty 0
n_duplicates 740
duplicate_rate 0.1978
vocab_size 1,036
readability_flesch_mean 69.91
emoji_rate 0
url_rate 0
one_word_rate 0.0005345
allcaps_rate 0
boilerplate_rate 0
alert: null_rate98.9% null
Fig 24.
Character-length distribution for place.
Show data table
Character-length distribution for place (mean: 29.465793693212184).
charscount
4 – 51
5 – 70
7 – 81
8 – 100
10 – 110
11 – 120
12 – 142
14 – 1511
15 – 1640
16 – 181
18 – 1920
19 – 2019
20 – 228
22 – 23219
23 – 2560
25 – 26122
26 – 27543
27 – 29499
29 – 30823
30 – 32325
32 – 33362
33 – 34378
34 – 36105
36 – 3740
37 – 3837
38 – 4025
40 – 4134
41 – 423
42 – 440
44 – 4515
45 – 4722
47 – 4814
48 – 493
49 – 511
51 – 522
52 – 545
54 – 550
55 – 560
56 – 580
58 – 592

earthquake_type categorical

Out[64]:

saturn.columns["earthquake_type"].stats

statvalue
n354,770
nulls351,028 (98.9%)
unique3
top_value earthquake
top_rate 0.9992
cardinality 3
entropy 0.01014
entropy_ratio 0.006396
alert: null_rate98.9% null
alert: imbalancetop value is 99.9% of rows
Fig 25.
Top values for earthquake_type.
Show data table
Top values for earthquake_type (3 unique shown, of 3 total).
valuecountshare
earthquake37391.1%
explosion20.0%
landslide10.0%

volcano_type categorical

Out[67]:

saturn.columns["volcano_type"].stats

statvalue
n354,770
nulls354,600 (100.0%)
unique1
top_value Unknown
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: null_rate100.0% null
alert: imbalancetop value is 100.0% of rows
Fig 26.
Top values for volcano_type.
Show data table
Top values for volcano_type (1 unique shown, of 1 total).
valuecountshare
Unknown1700.0%

elevation_m unknown feature

This column records elevation in metres for 354,770 rows with no nulls. The profiler emitted a 'skipped' alert and returned no computed statistics, so distribution shape, range, skew, and uniqueness are entirely unknown from this evidence. The name strongly implies a continuous numeric geographic feature, but no further characterisation can be made without re-running profiling.

Treatment: Re-profile to obtain range, skew, and outlier metrics; then consider log-transform or clipping if heavily right-skewed before use in modelling.

anthropic:default · confidence low
Out[70]:

saturn.columns["elevation_m"].stats

statvalue
n354,770
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

status categorical

Out[72]:

saturn.columns["status"].stats

statvalue
n354,770
nulls354,600 (100.0%)
unique1
top_value Unknown
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: null_rate100.0% null
alert: imbalancetop value is 100.0% of rows
Fig 27.
Top values for status.
Show data table
Top values for status (1 unique shown, of 1 total).
valuecountshare
Unknown1700.0%

last_eruption categorical

Out[75]:

saturn.columns["last_eruption"].stats

statvalue
n354,770
nulls354,600 (100.0%)
unique1
top_value Unknown
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: null_rate100.0% null
alert: imbalancetop value is 100.0% of rows
Fig 28.
Top values for last_eruption.
Show data table
Top values for last_eruption (1 unique shown, of 1 total).
valuecountshare
Unknown1700.0%

injuries categorical feature

This column records injury counts per incident, stored as a categorical type despite being numeric in nature — values are integers ('0', '1', '2', …) with a cardinality of 233 distinct values. The null rate is severely high at 75.59%, meaning only ~86,827 of 354,770 rows have a recorded value, which is flagged as an alert. Among non-null rows, 85.4% report zero injuries, producing a heavily right-skewed distribution with low entropy (1.23, entropy ratio 0.157). The presence of 233 distinct values suggests some entries may encode ranges, text annotations, or data-entry anomalies beyond simple integers.

Treatment: Cast to numeric, investigate nulls (MCAR vs. structural zero), treat missing as unknown rather than zero, then consider zero-inflated or count-based model treatment.

anthropic:default · confidence medium
Out[78]:

saturn.columns["injuries"].stats

statvalue
n354,770
nulls268,187 (75.6%)
unique233
top_value 0
top_rate 0.854
cardinality 233
entropy 1.234
entropy_ratio 0.1569
alert: null_rate75.6% null
Fig 29.
Top values for injuries.
Show data table
Top values for injuries (20 unique shown, of 233 total).
valuecountshare
07394320.8%
134021.0%
219570.6%
311180.3%
47270.2%
56250.2%
65000.1%
103620.1%
73320.1%
82920.1%
122800.1%
92020.1%
201850.1%
151810.1%
111700.0%
131360.0%
141240.0%
301160.0%
251000.0%
16910.0%

fatalities categorical feature

This column represents a count of fatalities per incident, stored as a categorical type despite being inherently numeric. The null rate is severe at 75.59%, meaning only ~86,313 of 354,770 rows have a value. Among non-null rows, 92.86% record zero fatalities, with a long tail reaching at least 10; the low entropy ratio (0.088) confirms extreme concentration at '0'.

Treatment: Cast to integer, investigate and impute or exclude the 75.59% nulls, then treat as a heavily zero-inflated count variable (consider zero-inflated Poisson or log1p transform for regression).

anthropic:default · confidence high
Out[81]:

saturn.columns["fatalities"].stats

statvalue
n354,770
nulls268,187 (75.6%)
unique57
top_value 0
top_rate 0.9286
cardinality 57
entropy 0.5134
entropy_ratio 0.08802
alert: null_rate75.6% null
Fig 30.
Top values for fatalities.
Show data table
Top values for fatalities (20 unique shown, of 57 total).
valuecountshare
08039722.7%
140531.1%
29320.3%
33570.1%
41900.1%
51210.0%
61120.0%
7710.0%
9400.0%
10390.0%
8340.0%
11340.0%
16220.0%
13190.0%
12150.0%
17130.0%
14110.0%
2190.0%
1880.0%
2080.0%

length_miles text feature

This column stores numeric distance measurements (miles) encoded as text strings — all values are single tokens like '0.1', '0.5', '1.0' with a mean character length of 3.69 and a max of 8. Two signals demand attention: the null rate is extremely high at 79.76%, meaning roughly four in five rows carry no value, and the duplicate rate among non-null values is 94.72%, reflecting a coarse, rounded measurement scale (only 3,795 unique values across 354,770 rows). The top value '0.1' alone appears 15,456 times, suggesting heavy concentration at short distances.

Treatment: Cast to float, investigate and handle the 79.76% nulls (impute or flag), then use directly or log-transform given likely right skew.

anthropic:default · confidence high
Out[84]:

saturn.columns["length_miles"].stats

statvalue
n354,770
nulls282,957 (79.8%)
unique3,795
len_min 3
len_max 8
len_mean 3.688
len_median 3
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 68,018
duplicate_rate 0.9472
vocab_size 2,268
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: null_rate79.8% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates94.7% duplicate strings
Fig 31.
Character-length distribution for length_miles.
Show data table
Character-length distribution for length_miles (mean: 3.6875078328436355).
charscount
3 – 347630
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 411687
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 5795
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 610712
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 7986
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 83

width_yards categorical feature

This column represents the width of some geographic or physical feature measured in yards, stored as a categorical type despite being numeric in nature. Nearly 80% of values are null (null_rate = 0.7976), making missingness the dominant signal. Among the 71,493 non-null records, values are round numbers (10, 50, 100, 30, 20, 200…) suggesting manual or estimated entries rather than precise measurements. The top value '10' accounts for 20.2% of non-null rows, and with 437 unique values and an entropy ratio of 0.51, the distribution is moderately concentrated.

Treatment: Cast to numeric, investigate whether nulls are structurally missing (feature absent) or simply unrecorded before imputing or dropping; log-transform or bin for modelling given round-number clustering.

anthropic:default · confidence medium
Out[87]:

saturn.columns["width_yards"].stats

statvalue
n354,770
nulls282,957 (79.8%)
unique437
top_value 10
top_rate 0.2018
cardinality 437
entropy 4.493
entropy_ratio 0.5122
alert: null_rate79.8% null
Fig 32.
Top values for width_yards.
Show data table
Top values for width_yards (20 unique shown, of 437 total).
valuecountshare
10144924.1%
50106033.0%
10072432.0%
3048821.4%
2044311.2%
20030460.9%
2525300.7%
15022340.6%
7520260.6%
4020060.6%
30015180.4%
3311610.3%
1710370.3%
4009770.3%
2508280.2%
238120.2%
607650.2%
4406820.2%
5006650.2%
806050.2%

type categorical

Out[90]:

saturn.columns["type"].stats

statvalue
n354,770
nulls349,767 (98.6%)
unique1
top_value hot_spring
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: null_rate98.6% null
alert: imbalancetop value is 100.0% of rows
Fig 33.
Top values for type.
Show data table
Top values for type (1 unique shown, of 1 total).
valuecountshare
hot_spring50031.4%

temperature categorical

Out[93]:

saturn.columns["temperature"].stats

statvalue
n354,770
nulls349,767 (98.6%)
unique44
top_value
top_rate 0.9742
cardinality 44
entropy 0.2566
entropy_ratio 0.04699
alert: long_tail34 singleton categories
alert: null_rate98.6% null
alert: imbalancetop value is 97.4% of rows
Fig 34.
Top values for temperature.
Show data table
Top values for temperature (20 unique shown, of 44 total).
valuecountshare
48741.4%
hot730.0%
9040.0%
10040.0%
2130.0%
9530.0%
3720.0%
4320.0%
2820.0%
4020.0%
37-39°10.0%
35-37 °C10.0%
58°C10.0%
52,110.0%
25-3010.0%
98°C10.0%
40-43°10.0%
7710.0%
25°C10.0%
5210.0%

source categorical metadata

This column records the data provider or attribution source for each row, with only 4 distinct values drawn from named external datasets (OpenStreetMap contributors, The Megalithic Portal, NOAA Storm Events Database, OpenStreetMap). The most striking signal is a 51.56% null rate — meaning over half of all 354,770 rows carry no source attribution, which is a data quality concern for provenance tracking. The top value 'OpenStreetMap contributors' accounts for 51.44% of non-null rows (88,396 records), while the closely related 'OpenStreetMap' (8,656 records) suggests inconsistent attribution for the same upstream source.

Treatment: Consolidate 'OpenStreetMap contributors' and 'OpenStreetMap' into a single category, investigate and impute or flag the 51.56% nulls before using as a stratification or filter variable.

anthropic:default · confidence high
Out[96]:

saturn.columns["source"].stats

statvalue
n354,770
nulls182,920 (51.6%)
unique4
top_value OpenStreetMap contributors
top_rate 0.5144
cardinality 4
entropy 1.545
entropy_ratio 0.7724
alert: null_rate51.6% null
Fig 35.
Top values for source.
Show data table
Top values for source (4 unique shown, of 4 total).
valuecountshare
OpenStreetMap contributors8839624.9%
The Megalithic Portal6002816.9%
NOAA Storm Events Database147704.2%
OpenStreetMap86562.4%

vessel_type categorical

Out[99]:

saturn.columns["vessel_type"].stats

statvalue
n354,770
nulls351,117 (99.0%)
unique23
top_value
top_rate 0.9064
cardinality 23
entropy 0.5764
entropy_ratio 0.1274
alert: long_tail14 singleton categories
alert: null_rate99.0% null
Fig 36.
Top values for vessel_type.
Show data table
Top values for vessel_type (20 unique shown, of 23 total).
valuecountshare
33110.9%
ship2750.1%
submarine180.0%
aircraft160.0%
plane100.0%
boat30.0%
schooner20.0%
car20.0%
sailboat20.0%
steamer10.0%
airplane10.0%
freightcar10.0%
train10.0%
paddle steamer10.0%
vehicle10.0%
motorbike10.0%
helicopter10.0%
Steam hoist10.0%
tractor10.0%
Airplane10.0%

cargo categorical

Out[102]:

saturn.columns["cargo"].stats

statvalue
n354,770
nulls351,117 (99.0%)
unique17
top_value
top_rate 0.9943
cardinality 17
entropy 0.07302
entropy_ratio 0.01786
alert: long_tail13 singleton categories
alert: null_rate99.0% null
alert: imbalancetop value is 99.4% of rows
Fig 37.
Top values for cargo.
Show data table
Top values for cargo (17 unique shown, of 17 total).
valuecountshare
36321.0%
human40.0%
timber20.0%
coal20.0%
fertilizer10.0%
ore pellets10.0%
Fischkutter (Stahl)10.0%
seafood10.0%
fish10.0%
passengers10.0%
mexican army supposed drugs, but the crew and cargo was not found10.0%
iron ore10.0%
pulp10.0%
18 mines, 6 torpedos10.0%
sugar10.0%
containers;vehicles10.0%
container;oil10.0%

peak_brightness_altitude_km categorical

Out[105]:

saturn.columns["peak_brightness_altitude_km"].stats

statvalue
n354,770
nulls354,193 (99.8%)
unique224
top_value 37.0
top_rate 0.06066
cardinality 224
entropy 7.187
entropy_ratio 0.9206
alert: null_rate99.8% null
Fig 38.
Top values for peak_brightness_altitude_km.
Show data table
Top values for peak_brightness_altitude_km (20 unique shown, of 224 total).
valuecountshare
37.0350.0%
31.5150.0%
33.3150.0%
38.0110.0%
29.6110.0%
35.2110.0%
40.7110.0%
32.0100.0%
26.0100.0%
32.490.0%
42.080.0%
26.580.0%
33.080.0%
25.070.0%
50.070.0%
36.070.0%
35.060.0%
39.060.0%
28.760.0%
3760.0%

velocity_km_s categorical

Out[108]:

saturn.columns["velocity_km_s"].stats

statvalue
n354,770
nulls354,421 (99.9%)
unique158
top_value 13.6
top_rate 0.01719
cardinality 158
entropy 7.052
entropy_ratio 0.9656
alert: null_rate99.9% null
Fig 39.
Top values for velocity_km_s.
Show data table
Top values for velocity_km_s (20 unique shown, of 158 total).
valuecountshare
13.660.0%
15.260.0%
16.960.0%
17.850.0%
20.150.0%
17.450.0%
13.150.0%
16.250.0%
19.850.0%
16.550.0%
15.950.0%
14.150.0%
18.150.0%
14.950.0%
12.950.0%
12.250.0%
19.640.0%
17.040.0%
14.440.0%
18.340.0%

energy_joules categorical

Out[111]:

saturn.columns["energy_joules"].stats

statvalue
n354,770
nulls353,907 (99.8%)
unique518
top_value 2.1
top_rate 0.01738
cardinality 518
entropy 8.634
entropy_ratio 0.9576
alert: long_tail361 singleton categories
alert: null_rate99.8% null
Fig 40.
Top values for energy_joules.
Show data table
Top values for energy_joules (20 unique shown, of 518 total).
valuecountshare
2.1150.0%
2.0130.0%
3.2130.0%
3.0100.0%
2.880.0%
2.380.0%
3.580.0%
2.780.0%
2.280.0%
4.180.0%
3.370.0%
2.570.0%
4.060.0%
2.960.0%
3.160.0%
5.860.0%
10.460.0%
11.860.0%
3.650.0%
4.450.0%

event_type categorical

Out[114]:

saturn.columns["event_type"].stats

statvalue
n354,770
nulls340,000 (95.8%)
unique17
top_value Tornado
top_rate 0.4288
cardinality 17
entropy 2.336
entropy_ratio 0.5715
alert: null_rate95.8% null
Fig 41.
Top values for event_type.
Show data table
Top values for event_type (17 unique shown, of 17 total).
valuecountshare
Tornado63341.8%
Flash Flood23580.7%
Thunderstorm Wind22570.6%
Flood17770.5%
Hail12460.4%
Lightning5740.2%
Heavy Rain990.0%
Marine Strong Wind430.0%
Debris Flow430.0%
Marine Thunderstorm Wind250.0%
Marine High Wind50.0%
Dust Devil30.0%
Waterspout20.0%
Tropical Storm10.0%
High Wind10.0%
Heat10.0%
Marine Lightning10.0%

damage_property text

Out[117]:

saturn.columns["damage_property"].stats

statvalue
n354,770
nulls340,000 (95.8%)
unique1,014
len_min 0
len_max 8
len_mean 4.381
len_median 5
len_p95 7
word_mean 1
word_median 1
n_empty 368
n_duplicates 13,756
duplicate_rate 0.9313
vocab_size 1,013
readability_flesch_mean 117
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.8724
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps87.2% rows are all-caps
alert: null_rate95.8% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates93.1% duplicate strings
Fig 42.
Character-length distribution for damage_property.
Show data table
Character-length distribution for damage_property (mean: 4.380568720379147).
charscount
0 – 0368
0 – 00
0 – 10
1 – 10
1 – 10
1 – 1264
1 – 10
1 – 20
2 – 20
2 – 20
2 – 21252
2 – 20
2 – 30
3 – 30
3 – 30
3 – 31172
3 – 30
3 – 40
4 – 40
4 – 40
4 – 43414
4 – 40
4 – 50
5 – 50
5 – 50
5 – 56075
5 – 50
5 – 60
6 – 60
6 – 60
6 – 61450
6 – 60
6 – 70
7 – 70
7 – 70
7 – 7514
7 – 70
7 – 80
8 – 80
8 – 8261

cave_type categorical

Out[120]:

saturn.columns["cave_type"].stats

statvalue
n354,770
nulls354,729 (100.0%)
unique5
top_value pit
top_rate 0.878
cardinality 5
entropy 0.7693
entropy_ratio 0.3313
alert: long_tail3 singleton categories
alert: null_rate100.0% null
Fig 43.
Top values for cave_type.
Show data table
Top values for cave_type (5 unique shown, of 5 total).
valuecountshare
pit360.0%
ponor20.0%
showcave10.0%
sinkhole10.0%
overhang10.0%

cave_length_m categorical

Out[123]:

saturn.columns["cave_length_m"].stats

statvalue
n354,770
nulls354,128 (99.8%)
unique237
top_value 5
top_rate 0.04984
cardinality 237
entropy 6.919
entropy_ratio 0.877
alert: long_tail158 singleton categories
alert: null_rate99.8% null
Fig 44.
Top values for cave_length_m.
Show data table
Top values for cave_length_m (20 unique shown, of 237 total).
valuecountshare
5320.0%
6260.0%
10250.0%
3230.0%
4230.0%
7200.0%
8190.0%
15160.0%
20140.0%
12130.0%
30130.0%
2110.0%
1190.0%
6080.0%
4.580.0%
1380.0%
1680.0%
1780.0%
2580.0%
970.0%

cave_depth_m categorical

Out[126]:

saturn.columns["cave_depth_m"].stats

statvalue
n354,770
nulls354,472 (99.9%)
unique124
top_value 0
top_rate 0.2114
cardinality 124
entropy 5.797
entropy_ratio 0.8336
alert: long_tail88 singleton categories
alert: null_rate99.9% null
Fig 45.
Top values for cave_depth_m.
Show data table
Top values for cave_depth_m (20 unique shown, of 124 total).
valuecountshare
0630.0%
10130.0%
3110.0%
190.0%
590.0%
480.0%
2570.0%
3060.0%
660.0%
260.0%
1150.0%
3550.0%
2840.0%
1440.0%
4040.0%
7040.0%
1230.0%
830.0%
1530.0%
2330.0%

access categorical

Out[129]:

saturn.columns["access"].stats

statvalue
n354,770
nulls347,515 (98.0%)
unique20
top_value yes
top_rate 0.3795
cardinality 20
entropy 2.234
entropy_ratio 0.517
alert: null_rate98.0% null
Fig 46.
Top values for access.
Show data table
Top values for access (20 unique shown, of 20 total).
valuecountshare
yes27530.8%
no22790.6%
private8300.2%
permit5860.2%
permissive4480.1%
customers2730.1%
unknown510.0%
destination110.0%
restricted90.0%
tidal20.0%
request20.0%
key20.0%
discouraged20.0%
designated10.0%
official10.0%
forestry10.0%
agricultural10.0%
guided10.0%
university10.0%
cancello_all'ingresso10.0%

cave_ref text

Out[132]:

saturn.columns["cave_ref"].stats

statvalue
n354,770
nulls347,184 (97.9%)
unique7,162
len_min 1
len_max 38
len_mean 6.341
len_median 7
len_p95 8
word_mean 1.068
word_median 1
n_empty 0
n_duplicates 424
duplicate_rate 0.05589
vocab_size 7,005
readability_flesch_mean 117.8
emoji_rate 0
url_rate 0
one_word_rate 0.9359
allcaps_rate 0.8559
boilerplate_rate 0
alert: one_word93.6% rows are a single word
alert: allcaps85.6% rows are all-caps
alert: null_rate97.9% null
alert: short_text95th-percentile length under 20 chars
Fig 47.
Character-length distribution for cave_ref.
Show data table
Character-length distribution for cave_ref (mean: 6.340891115212233).
charscount
1 – 236
2 – 3210
3 – 4356
4 – 5954
5 – 6412
6 – 71123
7 – 72763
7 – 81372
8 – 9232
9 – 1029
10 – 1144
11 – 1228
12 – 1311
13 – 140
14 – 152
15 – 162
16 – 173
17 – 183
18 – 191
19 – 201
20 – 201
20 – 210
21 – 220
22 – 230
23 – 240
24 – 250
25 – 260
26 – 270
27 – 282
28 – 290
29 – 300
30 – 310
31 – 320
32 – 320
32 – 330
33 – 340
34 – 350
35 – 360
36 – 370
37 – 381

osm_id numeric foreign_key

This column contains OpenStreetMap (OSM) numeric identifiers, likely referencing geographic features such as ways, relations, or nodes in the OSM database. The most striking issue is a 75.08% null rate across 354,770 rows, meaning only about one quarter of records carry an OSM linkage. Despite 88,395 unique values against ~88,693 non-null rows, the near-unique cardinality and platykurtic distribution (kurtosis ≈ -1.23) are consistent with IDs drawn broadly across OSM's ID space (min ~1.3M, max ~13.5B), with no outliers detected.

Treatment: Left-join on this id to OSM data after filtering or imputing the 75.08% nulls; investigate whether missingness is systematic before joining.

anthropic:default · confidence high
Out[135]:

saturn.columns["osm_id"].stats

statvalue
n354,770
nulls266,374 (75.1%)
unique88,395
min 1.334e+06
max 1.347e+10
mean 6.183e+09
median 6.047e+09
std 3.993e+09
q1 2.628e+09
q3 9.53e+09
iqr 6.903e+09
skew 0.1321
kurtosis -1.228
n_outliers 0
outlier_rate 0
zero_rate 0
alert: null_rate75.1% null
Fig 48.
Distribution of osm_id. Vertical dash marks the median.
Show data table
Histogram bins for osm_id (median: 6047018322.5).
bincount
1.334e+06 – 3.38e+083968
3.38e+08 – 6.748e+082767
6.748e+08 – 1.011e+093230
1.011e+09 – 1.348e+093501
1.348e+09 – 1.685e+093041
1.685e+09 – 2.022e+092397
2.022e+09 – 2.358e+091663
2.358e+09 – 2.695e+091794
2.695e+09 – 3.032e+092343
3.032e+09 – 3.368e+092457
3.368e+09 – 3.705e+092751
3.705e+09 – 4.042e+092046
4.042e+09 – 4.379e+091826
4.379e+09 – 4.715e+092556
4.715e+09 – 5.052e+092957
5.052e+09 – 5.389e+092055
5.389e+09 – 5.725e+091481
5.725e+09 – 6.062e+091420
6.062e+09 – 6.399e+092234
6.399e+09 – 6.736e+091597
6.736e+09 – 7.072e+092035
7.072e+09 – 7.409e+092535
7.409e+09 – 7.746e+092516
7.746e+09 – 8.082e+091909
8.082e+09 – 8.419e+092018
8.419e+09 – 8.756e+091425
8.756e+09 – 9.092e+092194
9.092e+09 – 9.429e+091852
9.429e+09 – 9.766e+093471
9.766e+09 – 1.01e+102391
1.01e+10 – 1.044e+101034
1.044e+10 – 1.078e+101192
1.078e+10 – 1.111e+101627
1.111e+10 – 1.145e+103575
1.145e+10 – 1.179e+101447
1.179e+10 – 1.212e+101836
1.212e+10 – 1.246e+101355
1.246e+10 – 1.28e+102246
1.28e+10 – 1.313e+101582
1.313e+10 – 1.347e+102072

osm_type categorical feature

This column stores OpenStreetMap geometry type classifications, taking only three possible values: 'node', 'way', and 'relation'. Two signals demand attention: 75.08% of the 354,770 rows are null, meaning OSM type is only recorded for roughly a quarter of records, and among the non-null values the distribution is severely imbalanced — 'node' accounts for 96.39% of non-null entries (85,204 occurrences) versus 2,560 'way' and just 632 'relation'. The near-zero entropy ratio (0.158) confirms this column carries very little discriminative information as-is.

Treatment: Impute nulls as a distinct 'unknown' category, then one-hot encode; consider whether the 'way'/'relation' minority classes carry signal worth preserving or should be collapsed.

anthropic:default · confidence high
Out[138]:

saturn.columns["osm_type"].stats

statvalue
n354,770
nulls266,374 (75.1%)
unique3
top_value node
top_rate 0.9639
cardinality 3
entropy 0.2501
entropy_ratio 0.1578
alert: null_rate75.1% null
alert: imbalancetop value is 96.4% of rows
Fig 49.
Top values for osm_type.
Show data table
Top values for osm_type (3 unique shown, of 3 total).
valuecountshare
node8520424.0%
way25600.7%
relation6320.2%

place_type categorical label

This column captures the settlement/place classification type, likely from an OpenStreetMap-style geographic dataset, with values such as 'hamlet', 'isolated_dwelling', 'village', and 'town'. The most striking signal is the extreme null rate of 94.88%, meaning only ~18,400 of 354,770 rows carry a value — the column is essentially sparse. Among populated rows, 'hamlet' dominates at 66.57% of non-null values, and the presence of a raw 'yes' tag (131 occurrences) indicates dirty or uncleaned OSM data that needs remediation.

Treatment: Filter or impute nulls before use; remap 'yes' and other dirty values; treat as low-cardinality categorical with one-hot or ordinal encoding reflecting settlement hierarchy.

anthropic:default · confidence high
Out[141]:

saturn.columns["place_type"].stats

statvalue
n354,770
nulls336,616 (94.9%)
unique48
top_value hamlet
top_rate 0.6657
cardinality 48
entropy 1.498
entropy_ratio 0.2682
alert: long_tail31 singleton categories
alert: null_rate94.9% null
Fig 50.
Top values for place_type.
Show data table
Top values for place_type (20 unique shown, of 48 total).
valuecountshare
hamlet120863.4%
isolated_dwelling29770.8%
village23880.7%
locality2510.1%
yes1310.0%
farm1220.0%
neighbourhood730.0%
town380.0%
suburb230.0%
quarter70.0%
square70.0%
island40.0%
local40.0%
allotments40.0%
house30.0%
city30.0%
islet20.0%
county10.0%
bus_station10.0%
hamtel10.0%

abandoned_year categorical

Out[144]:

saturn.columns["abandoned_year"].stats

statvalue
n354,770
nulls353,545 (99.7%)
unique147
top_value yes
top_rate 0.4359
cardinality 147
entropy 2.939
entropy_ratio 0.4082
alert: long_tail97 singleton categories
alert: null_rate99.7% null
Fig 51.
Top values for abandoned_year.
Show data table
Top values for abandoned_year (20 unique shown, of 147 total).
valuecountshare
yes5340.2%
village4330.1%
2022200.0%
1986110.0%
hamlet90.0%
197860.0%
197460.0%
198750.0%
202340.0%
195040.0%
198540.0%
198330.0%
~150030.0%
2013-12-0230.0%
isolated_dwelling30.0%
2022-12-2630.0%
194630.0%
193830.0%
195530.0%
196230.0%

abandoned_reason unknown label

This column contains abandoned-reason codes or labels — likely a categorical field recording why a record, transaction, or session was abandoned. The profiler emitted a 'skipped' alert with no stats or uniqueness counts, meaning the column's type could not be resolved and no frequency analysis was performed. With 354,770 non-null rows and a null rate of exactly 0.0, the field is fully populated, but its true cardinality, distribution, and value content are entirely unknown from this evidence.

Treatment: Re-profile with explicit string/categorical typing to recover value counts and cardinality before any downstream use.

anthropic:default · confidence low
Out[147]:

saturn.columns["abandoned_reason"].stats

statvalue
n354,770
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

former_population categorical

Out[149]:

saturn.columns["former_population"].stats

statvalue
n354,770
nulls352,243 (99.3%)
unique75
top_value 0
top_rate 0.4951
cardinality 75
entropy 2.605
entropy_ratio 0.4182
alert: null_rate99.3% null
Fig 52.
Top values for former_population.
Show data table
Top values for former_population (20 unique shown, of 75 total).
valuecountshare
012510.4%
20105740.2%
20242440.1%
2010-10-14810.0%
2008690.0%
2021-08-31530.0%
1999370.0%
2009120.0%
1110.0%
2021110.0%
2021-09-01100.0%
2021-10-01100.0%
198990.0%
2018-01-0190.0%
200580.0%
201670.0%
1070.0%
360.0%
200160.0%
260.0%

heritage categorical

Out[152]:

saturn.columns["heritage"].stats

statvalue
n354,770
nulls354,763 (100.0%)
unique6
top_value 2
top_rate 0.2857
cardinality 6
entropy 2.522
entropy_ratio 0.9755
alert: long_tail5 singleton categories
alert: null_rate100.0% null
Fig 53.
Top values for heritage.
Show data table
Top values for heritage (6 unique shown, of 6 total).
valuecountshare
220.0%
810.0%
yes10.0%
410.0%
district10.0%
310.0%

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-strange-places-v5-2-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove strange places v5 2},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-strange-places-v5-2}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove strange places v5 2. Source: /home/coolhand/html/datavis/data_trove/data/quirky/strange_places_v5.2.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-strange-places-v5-2