data-trove-us-disasters-mashup

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/disasters/disasters_mashup.json

Saturn profiled 54,575 rows across 16 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/disasters/disasters_mashup.json",
    "--findings", "data-trove-us-disasters-mashup.json",
    "--llm", "anthropic:default",
])

Summary confidence: medium

This dataset is a multi-hazard disaster event mashup of 54,575 records spanning aviation accidents, storms, earthquakes, and shipwrecks, each geolocated with latitude and longitude. Aviation accidents dominate heavily at nearly 59% of all records, with Cessna models being the most frequently involved aircraft — worth examining whether this reflects true prevalence or a reporting/sourcing bias. A second area of interest is the severity data: fatalities, injuries, and damage all carry a ~73% null rate, meaning consequence analysis is limited to roughly a quarter of the dataset and skewed toward zero-casualty events. The storm subcategory breakdown (Tornadoes, Flash Floods, Thunderstorm Wind) also deserves a closer look for geographic and seasonal clustering given the strong US state representation.

citing: category.top_values · subcategory.top_values · fatalities.null_rate · injuries.null_rate · damage.null_rate · aircraft_type.top_words · state.top_values · fatalities.top_values · category.stats.top_rate

Out[4]:

saturn.schema() · 16 columns

column	kind	n	null%	unique	alerts
category	categorical	54,575	0.0%	4
latitude	numeric	54,575	0.0%	32,209	high_skew outliers
longitude	numeric	54,575	0.0%	34,804	high_skew outliers
name	text	54,575	0.0%	20,587	multilingual duplicates
date	text	54,575	5.7%	9,264	one_word allcaps short_text duplicates
subcategory	categorical	54,575	0.0%	20
magnitude	categorical	54,575	80.1%	291	null_rate
fatalities	categorical	54,575	72.9%	49	null_rate
injuries	categorical	54,575	72.9%	178	null_rate
damage	text	54,575	72.9%	1,014	one_word allcaps null_rate short_text duplicates
state	categorical	54,575	72.9%	65	null_rate
aircraft_type	text	54,575	40.6%	9,478	allcaps null_rate duplicates
event_id	text	54,575	40.6%	26,427	one_word allcaps null_rate short_text
vessel_type	categorical	54,575	93.3%	23	long_tail null_rate
cargo	categorical	54,575	93.3%	17	long_tail null_rate imbalance
depth_km	unknown	54,575	0.0%	—	skipped

Fig 1.

category · Look for how heavily aviation accidents (59%) outweigh storms, earthquakes, and shipwrecks combined.

Show data table

Top values for category (4 unique shown, of 4 total).
value	count	share
aviation_accident	32410	59.4%
storm	14770	27.1%
earthquake	3742	6.9%
shipwreck	3653	6.7%

Fig 2.

subcategory · Check whether aviation and Tornado dominate all other subcategories, signalling potential source imbalance.

Show data table

Top values for subcategory (20 unique shown, of 20 total).
value	count	share
aviation	32410	59.4%
Tornado	6334	11.6%
seismic	3742	6.9%
maritime	3653	6.7%
Flash Flood	2358	4.3%
Thunderstorm Wind	2257	4.1%
Flood	1777	3.3%
Hail	1246	2.3%
Lightning	574	1.1%
Heavy Rain	99	0.2%
Marine Strong Wind	43	0.1%
Debris Flow	43	0.1%
Marine Thunderstorm Wind	25	0.0%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

Fig 3.

state · Texas leads by a wide margin — look for whether the top states reflect tornado-prone and storm-prone regions of the US.

Show data table

Top values for state (20 unique shown, of 65 total).
value	count	share
TEXAS	1450	2.7%
MISSOURI	648	1.2%
ARKANSAS	602	1.1%
MISSISSIPPI	570	1.0%
GEORGIA	562	1.0%
ILLINOIS	560	1.0%
IOWA	527	1.0%
LOUISIANA	507	0.9%
TENNESSEE	499	0.9%
FLORIDA	498	0.9%
OKLAHOMA	490	0.9%
NEBRASKA	486	0.9%
ALABAMA	469	0.9%
WISCONSIN	463	0.8%
OHIO	441	0.8%
MICHIGAN	426	0.8%
NORTH CAROLINA	422	0.8%
KANSAS	418	0.8%
INDIANA	408	0.7%
KENTUCKY	383	0.7%

Fig 4.

fatalities · The vast majority of recorded events show zero fatalities; look for how sharply the tail drops off beyond 1–2 deaths.

Show data table

Top values for fatalities (20 unique shown, of 49 total).
value	count	share
0	10209	18.7%
1	3208	5.9%
2	649	1.2%
3	222	0.4%
4	112	0.2%
5	74	0.1%
6	66	0.1%
7	38	0.1%
9	25	0.0%
10	24	0.0%
8	21	0.0%
11	20	0.0%
13	11	0.0%
16	10	0.0%
12	9	0.0%
14	8	0.0%
17	6	0.0%
20	6	0.0%
25	4	0.0%
23	3	0.0%

Fig 5.

date · Event counts cluster around 2002–2012 — look for whether coverage drops off in more recent or earlier years.

Show data table

Character-length distribution for date (mean: 9.979529946929492).
chars	count
2 – 2	1
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	151
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	13
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	19
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	3
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	51248

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
category	categorical	0.0%
latitude	numeric	0.0%
longitude	numeric	0.0%
name	text	0.0%
date	text	5.7%
subcategory	categorical	0.0%
magnitude	categorical	80.1%
fatalities	categorical	72.9%
injuries	categorical	72.9%
damage	text	72.9%
state	categorical	72.9%
aircraft_type	text	40.6%
event_id	text	40.6%
vessel_type	categorical	93.3%
cargo	categorical	93.3%
depth_km	unknown	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,964 detected strings).
lang	count	share
en	4726	95.2%
fr	60	1.2%
de	58	1.2%
es	46	0.9%
ja	32	0.6%
it	13	0.3%
ru	7	0.1%
zh	6	0.1%
eu	3	0.1%
pt	3	0.1%
id	3	0.1%
pl	2	0.0%
sr	1	0.0%
sv	1	0.0%
ht	1	0.0%
uk	1	0.0%
lv	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	latitude	longitude
latitude	+1.00	-0.45
longitude	-0.45	+1.00

category categorical label

This column is a disaster/incident type label with exactly 4 categories: aviation_accident, storm, earthquake, and shipwreck. The distribution is notably skewed — aviation_accident dominates at 59.4% of all 54,575 rows (32,410 records), while earthquake and shipwreck are each underrepresented at roughly 6.7% apiece. The entropy ratio of 0.74 confirms meaningful but unbalanced spread across classes, which could bias classifiers trained on this target without resampling.

Treatment: Use as classification target; apply class-weighting or oversampling for minority classes (earthquake, shipwreck) before modelling.

anthropic:default · confidence high

Out[14]:

saturn.columns["category"].stats

stat	value
n	54,575
nulls	0 (0.0%)
unique	4
top_value	aviation_accident
top_rate	0.5939
cardinality	4
entropy	1.483
entropy_ratio	0.7415

Fig 9.

Top values for category.

Show data table

Top values for category (4 unique shown, of 4 total).
value	count	share
aviation_accident	32410	59.4%
storm	14770	27.1%
earthquake	3742	6.9%
shipwreck	3653	6.7%

latitude numeric feature

This column contains geographic latitude values ranging from -77.42 to 82.17, consistent with global coordinate data. The median of 38.38 and IQR of 9.12 suggest the bulk of records cluster around mid-latitude Northern Hemisphere locations (roughly US/Europe), but the negative minimum (-77.42) indicates some Southern Hemisphere entries. Highly surprising is the negative skew of -2.51 combined with extreme kurtosis of 15.97 and 4,302 outliers (7.88% of rows), pointing to a heavy tail of anomalous low-latitude or Southern Hemisphere observations that likely warrant geographic subsetting or anomaly review.

Treatment: Retain as-is for geo-spatial modelling; investigate the 4,302 outliers for data quality issues before binning or clustering by region.

anthropic:default · confidence high

Out[17]:

saturn.columns["latitude"].stats

stat	value
n	54,575
nulls	0 (0.0%)
unique	32,209
min	-77.42
max	82.17
mean	38.16
median	38.38
std	11.96
q1	33.65
q3	42.77
iqr	9.12
skew	-2.51
kurtosis	15.97
n_outliers	4,302
outlier_rate	0.07883
zero_rate	0
alert: high_skew	skew=-2.51
alert: outliers	7.9% rows beyond 1.5 IQR

Fig 10.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 38.376667).
bin	count
-77.42 – -73.44	1
-73.44 – -69.45	0
-69.45 – -65.46	0
-65.46 – -61.47	1
-61.47 – -57.48	1
-57.48 – -53.49	5
-53.49 – -49.5	30
-49.5 – -45.51	22
-45.51 – -41.52	37
-41.52 – -37.53	79
-37.53 – -33.54	176
-33.54 – -29.55	103
-29.55 – -25.56	35
-25.56 – -21.57	108
-21.57 – -17.58	56
-17.58 – -13.59	20
-13.59 – -9.597	23
-9.597 – -5.607	41
-5.607 – -1.617	66
-1.617 – 2.373	64
2.373 – 6.363	28
6.363 – 10.35	171
10.35 – 14.34	70
14.34 – 18.33	112
18.33 – 22.32	346
22.32 – 26.31	895
26.31 – 30.3	4049
30.3 – 34.29	9547
34.29 – 38.28	10928
38.28 – 42.27	12642
42.27 – 46.26	7278
46.26 – 50.25	2535
50.25 – 54.24	1308
54.24 – 58.23	1095
58.23 – 62.22	1761
62.22 – 66.21	697
66.21 – 70.2	220
70.2 – 74.19	22
74.19 – 78.18	2
78.18 – 82.17	1

longitude numeric feature

This column contains geographic longitude values, spanning from -179.28° to +178.83°, consistent with worldwide coordinates. The mean (-92.97°) and median (-92.81°) are tightly clustered in the central United States, suggesting the bulk of records are North American, yet 4,320 outliers (7.9% of rows) and an extreme kurtosis of 15.13 indicate a heavy-tailed distribution with a substantial minority of globally dispersed points. The positive skew of 2.84 confirms an asymmetric pull toward higher (less-negative or positive) longitude values, i.e., non-US locations.

Treatment: Retain as-is for geospatial modelling; consider pairing with latitude and clustering by region to handle the bimodal/heavy-tailed distribution before feeding into non-spatial models.

anthropic:default · confidence high

Out[20]:

saturn.columns["longitude"].stats

stat	value
n	54,575
nulls	0 (0.0%)
unique	34,804
min	-179.3
max	178.8
mean	-92.97
median	-92.81
std	39.5
q1	-112
q3	-82.18
iqr	29.86
skew	2.843
kurtosis	15.13
n_outliers	4,320
outlier_rate	0.07916
zero_rate	0
alert: high_skew	skew=+2.84
alert: outliers	7.9% rows beyond 1.5 IQR

Fig 11.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: -92.8126).
bin	count
-179.3 – -170.3	49
-170.3 – -161.4	1005
-161.4 – -152.4	1182
-152.4 – -143.5	1679
-143.5 – -134.5	289
-134.5 – -125.6	833
-125.6 – -116.6	6128
-116.6 – -107.7	4912
-107.7 – -98.71	3964
-98.71 – -89.76	10929
-89.76 – -80.8	12439
-80.8 – -71.85	7045
-71.85 – -62.9	1013
-62.9 – -53.94	178
-53.94 – -44.99	139
-44.99 – -36.04	143
-36.04 – -27.09	40
-27.09 – -18.13	15
-18.13 – -9.181	39
-9.181 – -0.2277	348
-0.2277 – 8.725	275
8.725 – 17.68	834
17.68 – 26.63	267
26.63 – 35.58	121
35.58 – 44.54	36
44.54 – 53.49	80
53.49 – 62.44	53
62.44 – 71.39	2
71.39 – 80.35	20
80.35 – 89.3	2
89.3 – 98.25	4
98.25 – 107.2	19
107.2 – 116.2	26
116.2 – 125.1	37
125.1 – 134.1	18
134.1 – 143	41
143 – 152	59
152 – 160.9	22
160.9 – 169.9	140
169.9 – 178.8	150

name text label

This column contains descriptive incident or event names, predominantly aviation accidents and natural disaster events (floods, tornadoes). The duplicate rate is strikingly high at 62.3% — with 33,988 duplicates across only 20,587 unique values out of 54,575 rows — largely driven by generic labels like 'Unnamed Wreck' (2,184 occurrences) and repeated aircraft model patterns (e.g., 'Aviation Accident - CESSNA 172' variants). While 86.6% of detected-language tokens are English, 14 other languages appear (French: 60, German: 58, Spanish: 46, Japanese: 32), indicating a multilingual dataset that may require language-aware processing.

Treatment: Normalize case variants (e.g., 'CESSNA 172' vs 'Cessna 172') before grouping or embedding; treat as a categorical label with high cardinality rather than free text.

anthropic:default · confidence high

Out[23]:

saturn.columns["name"].stats

stat	value
n	54,575
nulls	0 (0.0%)
unique	20,587
len_min	2
len_max	153
len_mean	32.38
len_median	31
len_p95	48
word_mean	4.782
word_median	5
n_empty	0
n_duplicates	33,988
duplicate_rate	0.6228
vocab_size	8,062
readability_flesch_mean	8.744
emoji_rate	0
url_rate	0
one_word_rate	0.007971
allcaps_rate	0.001264
boilerplate_rate	0
alert: multilingual	18 languages detected in sample
alert: duplicates	62.3% duplicate strings

Fig 12.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 32.38147503435639).
chars	count
2 – 6	129
6 – 10	357
10 – 13	2564
13 – 17	297
17 – 21	296
21 – 25	2319
25 – 28	6611
28 – 32	20097
32 – 36	7724
36 – 40	5480
40 – 44	3498
44 – 47	2132
47 – 51	1396
51 – 55	827
55 – 59	490
59 – 62	191
62 – 66	82
66 – 70	19
70 – 74	19
74 – 78	14
78 – 81	5
81 – 85	8
85 – 89	2
89 – 93	5
93 – 96	1
96 – 100	8
100 – 104	0
104 – 108	0
108 – 111	0
111 – 115	0
115 – 119	1
119 – 123	0
123 – 127	0
127 – 130	0
130 – 134	2
134 – 138	0
138 – 142	0
142 – 145	0
145 – 149	0
149 – 153	1

date text timestamp

This column contains date strings in ISO-8601 format (YYYY-MM-DD), stored as text rather than a native date type. Nearly all top values fall on January 1st of their respective years (2002–2012), suggesting dates are truncated or snapped to year-start, which is analytically significant and likely not raw event timestamps. The duplicate rate is extremely high at 81.99%, consistent with annual granularity across 54,575 rows, and 9,264 unique values hint that some finer dates do exist beyond the dominant Jan-1 entries. Null rate is low at 5.74%.

Treatment: Parse to date type, investigate year-start snapping before using as a time feature, and consider extracting year as an ordinal variable.

anthropic:default · confidence high

Out[26]:

saturn.columns["date"].stats

stat	value
n	54,575
nulls	3,134 (5.7%)
unique	9,264
len_min	2
len_max	10
len_mean	9.98
len_median	10
len_p95	10
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	42,177
duplicate_rate	0.8199
vocab_size	4,710
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9985
allcaps_rate	0.9983
boilerplate_rate	0
alert: one_word	99.9% rows are a single word
alert: allcaps	99.8% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	82.0% duplicate strings

Fig 13.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 9.979529946929492).
chars	count
2 – 2	1
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	151
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	13
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	19
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	3
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	51248

subcategory categorical label

This column is a categorical event subcategory, most likely classifying incident or hazard reports across domains such as aviation, geophysical (seismic), meteorological (Tornado, Flash Flood, Thunderstorm Wind), and maritime events. 'aviation' dominates heavily at 59.4% of all 54,575 rows, creating pronounced class imbalance. A subtle data quality issue is present: some values use title case ('Tornado', 'Flash Flood', 'Thunderstorm Wind', 'Hail') while others are fully lowercase ('aviation', 'seismic', 'maritime'), suggesting records were ingested from at least two inconsistently formatted sources. Entropy ratio of 0.49 confirms the distribution is far from uniform.

Treatment: Normalize casing before use, then one-hot encode or target-encode accounting for the heavy 'aviation' majority (59.4%).

anthropic:default · confidence high

Out[29]:

saturn.columns["subcategory"].stats

stat	value
n	54,575
nulls	0 (0.0%)
unique	20
top_value	aviation
top_rate	0.5939
cardinality	20
entropy	2.115
entropy_ratio	0.4894

Fig 14.

Top values for subcategory.

Show data table

Top values for subcategory (20 unique shown, of 20 total).
value	count	share
aviation	32410	59.4%
Tornado	6334	11.6%
seismic	3742	6.9%
maritime	3653	6.7%
Flash Flood	2358	4.3%
Thunderstorm Wind	2257	4.1%
Flood	1777	3.3%
Hail	1246	2.3%
Lightning	574	1.1%
Heavy Rain	99	0.2%
Marine Strong Wind	43	0.1%
Debris Flow	43	0.1%
Marine Thunderstorm Wind	25	0.0%
Marine High Wind	5	0.0%
Dust Devil	3	0.0%
Waterspout	2	0.0%
Tropical Storm	1	0.0%
High Wind	1	0.0%
Heat	1	0.0%
Marine Lightning	1	0.0%

magnitude categorical feature

This column represents earthquake or seismic event magnitude, stored as a categorical/string type despite being a numeric measurement with 291 distinct decimal values (e.g., 4.5, 4.6, 4.7). Two signals demand attention: the null rate is extremely high at 80.09%, meaning only ~10,866 of 54,575 rows carry a value. The dominant value '0' accounts for 35.56% of non-null records (3,863 occurrences), which is likely a sentinel or placeholder rather than a true zero magnitude, since genuine zero-magnitude events would be vanishingly rare and the next most frequent values cluster around 4.5–5.1.

Treatment: Cast to float after replacing '0' sentinel values with NaN; investigate whether 80.09% nulls are structurally missing or data quality issues before imputing or dropping rows.

anthropic:default · confidence high

Out[32]:

saturn.columns["magnitude"].stats

stat	value
n	54,575
nulls	43,711 (80.1%)
unique	291
top_value	0
top_rate	0.3556
cardinality	291
entropy	4.732
entropy_ratio	0.5781
alert: null_rate	80.1% null

Fig 15.

Top values for magnitude.

Show data table

Top values for magnitude (20 unique shown, of 291 total).
value	count	share
0	3863	7.1%
4.5	686	1.3%
4.6	558	1.0%
4.7	415	0.8%
1.75	383	0.7%
4.8	317	0.6%
4.9	261	0.5%
5	238	0.4%
2.75	220	0.4%
5.1	202	0.4%
5.2	167	0.3%
70.00	162	0.3%
50.00	151	0.3%
2.00	150	0.3%
5.3	126	0.2%
2.50	123	0.2%
61.00	122	0.2%
65.00	104	0.2%
52.00	95	0.2%
5.4	95	0.2%

fatalities categorical feature

This column represents a fatality count per incident, stored as a categorical/string type despite being numeric in nature. The null rate is severe at 72.94%, meaning nearly three-quarters of records have no value recorded — this is the primary alert. Among non-null values, the distribution is heavily right-skewed: '0' dominates at 69.1% of non-null rows, with counts dropping sharply through 49 distinct values, indicating rare but high-fatality events exist in the tail.

Treatment: Cast to integer, treat nulls as unknown (not zero), then apply log1p-transform or use as-is for count-based modelling given heavy zero-inflation.

anthropic:default · confidence high

Out[35]:

saturn.columns["fatalities"].stats

stat	value
n	54,575
nulls	39,805 (72.9%)
unique	49
top_value	0
top_rate	0.6912
cardinality	49
entropy	1.423
entropy_ratio	0.2535
alert: null_rate	72.9% null

Fig 16.

Top values for fatalities.

Show data table

Top values for fatalities (20 unique shown, of 49 total).
value	count	share
0	10209	18.7%
1	3208	5.9%
2	649	1.2%
3	222	0.4%
4	112	0.2%
5	74	0.1%
6	66	0.1%
7	38	0.1%
9	25	0.0%
10	24	0.0%
8	21	0.0%
11	20	0.0%
13	11	0.0%
16	10	0.0%
12	9	0.0%
14	8	0.0%
17	6	0.0%
20	6	0.0%
25	4	0.0%
23	3	0.0%

injuries categorical feature

This column represents an injury count per incident, stored as a categorical type despite containing integer values (0, 1, 2, …). The dominant concern is an extreme null rate of 72.94%, meaning nearly three-quarters of rows carry no injury data at all. Among non-null rows, the value '0' accounts for 68.14% of responses, indicating most recorded incidents involved no injuries, with a long tail reaching at least 178 distinct values — suggesting occasional high-casualty outliers.

Treatment: Cast to integer, impute or flag nulls explicitly, then consider log-transform or treat as a count target given heavy zero-inflation.

anthropic:default · confidence high

Out[38]:

saturn.columns["injuries"].stats

stat	value
n	54,575
nulls	39,805 (72.9%)
unique	178
top_value	0
top_rate	0.6814
cardinality	178
entropy	2.468
entropy_ratio	0.3301
alert: null_rate	72.9% null

Fig 17.

Top values for injuries.

Show data table

Top values for injuries (20 unique shown, of 178 total).
value	count	share
0	10064	18.4%
1	893	1.6%
2	552	1.0%
3	343	0.6%
4	236	0.4%
5	234	0.4%
10	219	0.4%
6	196	0.4%
12	158	0.3%
7	134	0.2%
8	121	0.2%
20	114	0.2%
15	111	0.2%
11	90	0.2%
9	85	0.2%
13	70	0.1%
14	69	0.1%
30	68	0.1%
25	56	0.1%
16	48	0.1%

damage text feature

This column contains abbreviated monetary damage estimates (e.g., '2.5M', '250K', '0.00K') stored as free-form text, most likely representing financial loss or property damage figures from incident or insurance records. The null rate is extremely high at 72.94%, meaning nearly three-quarters of rows carry no damage value. The all-caps rate of 87.2% and one-word rate of 100% confirm a consistent but non-numeric encoding; the 1,014 unique values across 54,575 rows with a duplicate rate of 93.1% indicate a relatively coarse discrete scale. Analysts should note that string suffixes (K vs M) encode magnitude and must be parsed before any quantitative use.

Treatment: Parse magnitude suffixes (K=thousands, M=millions) and convert to a numeric column; impute or flag the 72.94% nulls before modelling.

anthropic:default · confidence high

Out[41]:

saturn.columns["damage"].stats

stat	value
n	54,575
nulls	39,805 (72.9%)
unique	1,014
len_min	0
len_max	8
len_mean	4.381
len_median	5
len_p95	7
word_mean	1
word_median	1
n_empty	368
n_duplicates	13,756
duplicate_rate	0.9313
vocab_size	1,013
readability_flesch_mean	117
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.8724
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	87.2% rows are all-caps
alert: null_rate	72.9% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	93.1% duplicate strings

Fig 18.

Character-length distribution for damage.

Show data table

Character-length distribution for damage (mean: 4.380568720379147).
chars	count
0 – 0	368
0 – 0	0
0 – 1	0
1 – 1	0
1 – 1	0
1 – 1	264
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	1252
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	1172
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3414
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	6075
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	1450
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	514
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	261

state categorical feature

This column contains US state names (full uppercase spellings), acting as a geographic feature for records in the dataset. The critical issue is a 72.94% null rate, meaning nearly three-quarters of all 54,575 rows carry no state value — this is a severe missingness alert. Among non-null values, cardinality is 65 (slightly above 50 US states, suggesting territories or data anomalies), and distribution is moderately spread (entropy ratio 0.86) with Texas as the dominant value at 9.82% of non-null records.

Treatment: Investigate missingness mechanism before use; consider imputation or missingness indicator flag, and audit the 65 unique values to identify non-standard entries beyond the 50 states.

anthropic:default · confidence high

Out[44]:

saturn.columns["state"].stats

stat	value
n	54,575
nulls	39,805 (72.9%)
unique	65
top_value	TEXAS
top_rate	0.09817
cardinality	65
entropy	5.182
entropy_ratio	0.8605
alert: null_rate	72.9% null

Fig 19.

Top values for state.

Show data table

Top values for state (20 unique shown, of 65 total).
value	count	share
TEXAS	1450	2.7%
MISSOURI	648	1.2%
ARKANSAS	602	1.1%
MISSISSIPPI	570	1.0%
GEORGIA	562	1.0%
ILLINOIS	560	1.0%
IOWA	527	1.0%
LOUISIANA	507	0.9%
TENNESSEE	499	0.9%
FLORIDA	498	0.9%
OKLAHOMA	490	0.9%
NEBRASKA	486	0.9%
ALABAMA	469	0.9%
WISCONSIN	463	0.8%
OHIO	441	0.8%
MICHIGAN	426	0.8%
NORTH CAROLINA	422	0.8%
KANSAS	418	0.8%
INDIANA	408	0.7%
KENTUCKY	383	0.7%

aircraft_type text label

This column contains aircraft make-and-model designations (e.g., 'Cessna 172', 'Piper PA-28-140') from what appears to be an aviation incident or registration dataset. Two major surprises: first, 40.6% of rows are null, indicating substantial missing coverage; second, case inconsistency is severe — 'CESSNA 172' (360 occurrences) and 'Cessna 172' (189 occurrences) are counted as distinct values despite being the same aircraft, with ~49.5% of values in all-caps, inflating n_unique (9,478) and the duplicate rate (70.8%) artificially. The top words confirm a GA-heavy dataset dominated by Cessna, Piper, and Beech.

Treatment: Normalize case (lowercase or title-case), then deduplicate/consolidate variant spellings before using as a categorical feature or grouping key.

anthropic:default · confidence high

Out[47]:

saturn.columns["aircraft_type"].stats

stat	value
n	54,575
nulls	22,165 (40.6%)
unique	9,478
len_min	7
len_max	50
len_mean	15.85
len_median	13
len_p95	31
word_mean	2
word_median	2
n_empty	0
n_duplicates	22,932
duplicate_rate	0.7076
vocab_size	7,261
readability_flesch_mean	45.12
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0.4955
boilerplate_rate	0
alert: allcaps	49.5% rows are all-caps
alert: null_rate	40.6% null
alert: duplicates	70.8% duplicate strings

Fig 20.

Character-length distribution for aircraft_type.

Show data table

Character-length distribution for aircraft_type (mean: 15.850848503548288).
chars	count
7 – 8	592
8 – 9	1240
9 – 10	3852
10 – 11	7354
11 – 12	2473
12 – 13	1129
13 – 15	1380
15 – 16	3032
16 – 17	1131
17 – 18	853
18 – 19	1024
19 – 20	779
20 – 21	670
21 – 22	1117
22 – 23	947
23 – 24	528
24 – 25	507
25 – 26	403
26 – 27	455
27 – 28	447
28 – 30	312
30 – 31	352
31 – 32	258
32 – 33	282
33 – 34	261
34 – 35	276
35 – 36	283
36 – 37	145
37 – 38	56
38 – 39	46
39 – 40	35
40 – 41	48
41 – 42	51
42 – 44	38
44 – 45	19
45 – 46	5
46 – 47	7
47 – 48	6
48 – 49	2
49 – 50	15

event_id text foreign_key

This column is an aviation or safety incident event identifier — the 14-character format (e.g., '20010519X00967') encodes a date prefix followed by an alphanumeric case code, consistent with NTSB accident/incident IDs. Two signals are surprising: a null rate of 40.61% means nearly half of rows lack an event ID entirely, and the duplicate rate of 18.46% (5,983 duplicates across 26,427 unique values) indicates multiple rows share the same event ID, implying a one-to-many relationship where each event spawns several records. All values are exactly 14 characters and fully uppercase, confirming a tightly controlled format with no malformed entries.

Treatment: Left-join on this ID to an events dimension table; investigate and handle the 40.61% null rate before joining.

anthropic:default · confidence high

Out[50]:

saturn.columns["event_id"].stats

stat	value
n	54,575
nulls	22,165 (40.6%)
unique	26,427
len_min	14
len_max	14
len_mean	14
len_median	14
len_p95	14
word_mean	1
word_median	1
n_empty	0
n_duplicates	5,983
duplicate_rate	0.1846
vocab_size	17,535
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: null_rate	40.6% null
alert: short_text	95th-percentile length under 20 chars

Fig 21.

Character-length distribution for event_id.

Show data table

Character-length distribution for event_id (mean: 14.0).
chars	count
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	32410
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0

vessel_type categorical feature

This column categorizes the type of vessel involved in an incident or record, with 23 distinct values including 'ship', 'submarine', 'aircraft', and oddly 'car'. Two major data quality issues stand out: the null rate is extreme at 93.31%, meaning only ~3,700 of 54,575 rows carry any value, and the top recorded value is an empty string (3,311 occurrences), which inflates the apparent top_rate to 90.6% — suggesting the true fill rate is even lower than the null_rate implies. The long-tail alert is consistent with rare values like 'schooner' (2), 'sailboat' (2), and 'steamer' (1), while 'car' appearing as a vessel type signals potential data entry errors or schema misuse.

Treatment: Treat empty strings as nulls, impute or exclude before modelling given 93.31% missingness, and audit 'car' and 'aircraft' entries for schema validity.

anthropic:default · confidence high

Out[53]:

saturn.columns["vessel_type"].stats

stat	value
n	54,575
nulls	50,922 (93.3%)
unique	23
top_value
top_rate	0.9064
cardinality	23
entropy	0.5764
entropy_ratio	0.1274
alert: long_tail	14 singleton categories
alert: null_rate	93.3% null

Fig 22.

Top values for vessel_type.

Show data table

Top values for vessel_type (20 unique shown, of 23 total).
value	count	share
	3311	6.1%
ship	275	0.5%
submarine	18	0.0%
aircraft	16	0.0%
plane	10	0.0%
boat	3	0.0%
schooner	2	0.0%
car	2	0.0%
sailboat	2	0.0%
steamer	1	0.0%
airplane	1	0.0%
freightcar	1	0.0%
train	1	0.0%
paddle steamer	1	0.0%
vehicle	1	0.0%
motorbike	1	0.0%
helicopter	1	0.0%
Steam hoist	1	0.0%
tractor	1	0.0%
Airplane	1	0.0%

cargo categorical feature

This column records the type of cargo carried by vessels or vehicles, with 17 distinct categories including 'human', 'timber', 'coal', 'fertilizer', and 'fish'. It is overwhelmingly sparse: 93.31% of rows are null, and among the non-null rows the top value is an empty string (3,632 occurrences), meaning genuinely populated values number only in the single digits each. The entropy ratio of 0.018 confirms near-total concentration, and the presence of a German-language entry ('Fischkutter (Stahl)') signals a language mix in the rare populated records.

Treatment: Exclude from modelling unless the non-null subset is the analytic focus; treat empty strings as nulls, consolidate language variants, and flag the 93.31% missingness as likely structural (field not applicable to most records).

anthropic:default · confidence high

Out[56]:

saturn.columns["cargo"].stats

stat	value
n	54,575
nulls	50,922 (93.3%)
unique	17
top_value
top_rate	0.9943
cardinality	17
entropy	0.07302
entropy_ratio	0.01786
alert: long_tail	13 singleton categories
alert: null_rate	93.3% null
alert: imbalance	top value is 99.4% of rows

Fig 23.

Top values for cargo.

Show data table

Top values for cargo (17 unique shown, of 17 total).
value	count	share
	3632	6.7%
human	4	0.0%
timber	2	0.0%
coal	2	0.0%
fertilizer	1	0.0%
ore pellets	1	0.0%
Fischkutter (Stahl)	1	0.0%
seafood	1	0.0%
fish	1	0.0%
passengers	1	0.0%
mexican army supposed drugs, but the crew and cargo was not found	1	0.0%
iron ore	1	0.0%
pulp	1	0.0%
18 mines, 6 torpedos	1	0.0%
sugar	1	0.0%
containers;vehicles	1	0.0%
container;oil	1	0.0%

depth_km unknown feature

This column represents earthquake or geological event depth in kilometres, a continuous numeric feature. The profiler skipped analysis entirely, so no distribution statistics, uniqueness counts, or range information are available. With 54,575 rows and a null rate of 0.0, the data is fully populated, but nothing can be said about skew, outliers, or value range from this evidence alone. An analyst should inspect the column directly before modelling.

Treatment: Manually profile for range and skew; apply log-transform if depth distribution is right-skewed before regression.

anthropic:default · confidence low

Out[59]:

saturn.columns["depth_km"].stats

stat	value
n	54,575
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

data trove us disasters mashup

Overview

Summary confidence: medium

category categorical label

latitude numeric feature

longitude numeric feature

name text label

date text timestamp

subcategory categorical label

magnitude categorical feature

fatalities categorical feature

injuries categorical feature

damage text feature

state categorical feature

aircraft_type text label

event_id text foreign_key

vessel_type categorical feature

cargo categorical feature

depth_km unknown feature

How to cite