saturn·

disasters airplane crashes

source /home/coolhand/html/datavis/data_trove/data/wild/disasters/airplane_crashes.csv 5,268 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset records 5,268 airplane crashes across 13 columns, mixing dates and times with operator, aircraft type, route, location, and casualty counts (Aboard, Fatalities, Ground). Casualty figures are highly skewed: Aboard averages 27.5 with a median of 13 and a maximum of 644, while Fatalities averages 20.1 with a median of 9 and a max of 583, and Ground deaths are zero in roughly 96% of rows but spike to 2,750 — clear outliers worth investigating. Operator and Type are dominated by a few heavy hitters (Aeroflot and U.S. military operators; Douglas DC-3 alone appears 334 times), suggesting concentration that could bias any aggregate analysis. Note also that Flight # is missing in nearly 80% of rows and Time in 42%, so those fields are weak for filtering. Start by looking at the Fatalities distribution and the top operators and aircraft types.

citing: row_count · column_count · columns.Aboard.stats · columns.Fatalities.stats · columns.Ground.stats · columns.Operator.top_values · columns.Type.top_values · columns.Flight #.null_rate · columns.Time.null_rate

Schema

13 columns
Per-column summary. Click column name to jump to its detail.
Alerts
Date text 0.0% 4,753
one_word allcaps short_text
Time text 42.1% 1,005
one_word allcaps null_rate short_text duplicates
Location text 0.4% 4,303
Operator text 0.3% 2,476
multilingual duplicates
Flight # categorical 79.7% 724
long_tail null_rate
Route text 32.4% 3,244
multilingual null_rate
Type text 0.5% 2,446
duplicates
Registration text 6.4% 4,905
near_unique one_word allcaps short_text
cn/In text 23.3% 3,707
one_word allcaps null_rate short_text
Aboard numeric 0.4% 239
high_skew outliers
Fatalities numeric 0.2% 191
high_skew outliers
Ground numeric 0.4% 50
high_skew
Summary text 7.4% 4,673
near_unique

Date

text timestamp one_word allcaps short_text
This column holds dates stored as text in MM/DD/YYYY format — every one of 5268 values is exactly 10 characters and a single token. There are 515 duplicates (9.8%) with repeats clustering on historically notable days such as 09/11/2001 and 06/06/1944, suggesting the rows describe events tied to those dates rather than unique daily records. The text alerts (allcaps, one_word, short_text) are artifacts of the date formatting, not real free-text content. Treatment: parse to a proper date type (MM/DD/YYYY) before any temporal analysis. high · anthropic:claude-opus-4-7
n
5,268
nulls
0 (0.0%)
unique
4,753
len_min
10
len_max
10
len_mean
10
len_median
10
len_p95
10
word_mean
1
word_median
1
n_empty
0
n_duplicates
515
duplicate_rate
0.09776
vocab_size
4,753
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

Time

text timestamp one_word allcaps null_rate short_text duplicates
Clock times in HH:MM format, stored as text rather than a temporal type — values like '15:00', '12:00' and '11:00' top the list and lengths sit tightly between 4 and 7 characters (mean 5.0). Roughly 42% of rows are null and 67% of the non-null values are duplicates across only 1,005 distinct times, suggesting times cluster on the half hour. Despite being numeric-looking, it tripped allcaps and one-word alerts because the profiler treats the strings as tokens. Treatment: parse to a time-of-day type and impute or flag the 42% missing before use. high · anthropic:claude-opus-4-7
n
5,268
nulls
2,219 (42.1%)
unique
1,005
len_min
4
len_max
7
len_mean
5.003
len_median
5
len_p95
5
word_mean
1.001
word_median
1
n_empty
0
n_duplicates
2,044
duplicate_rate
0.6704
vocab_size
1,004
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
0.999
allcaps_rate
0.9974
boilerplate_rate
0

Location

text feature
Free-text place names, typically 'City, Country/State' (word_median 3, len_median 19), with 4303 unique values across 5268 rows. Top entries cluster on major world cities (Sao Paulo, Moscow, Rio), but 'near' appears 1272 times suggesting many entries are approximate locations rather than exact place names. Duplicate rate of 18% and 945 repeated strings indicate moderate reusability, though high cardinality limits direct grouping. Treatment: Parse into city/region/country components and geocode before use; raw strings are too high-cardinality to one-hot. high · anthropic:claude-opus-4-7
n
5,268
nulls
20 (0.4%)
unique
4,303
len_min
5
len_max
60
len_mean
20.38
len_median
19
len_p95
31
word_mean
2.866
word_median
3
n_empty
0
n_duplicates
945
duplicate_rate
0.1801
vocab_size
4,541
readability_flesch_mean
24.03
emoji_rate
0
url_rate
0
one_word_rate
0.01124
allcaps_rate
0
boilerplate_rate
0

Operator

text feature multilingual duplicates
This column holds the airline or military operator name for each record, with 2476 unique values across 5268 rows and only a 0.0034 null rate. It is heavily duplicated (duplicate_rate 0.528, n_duplicates 2774), led by Aeroflot (179) and Military - U.S. Air Force (176), and the language detector flags a multilingual mix dominated by English (3340) but with sizable Italian (278), Spanish (224), German (202), and French (183) counts — likely an artifact of short proper nouns rather than true translations. Entries are short (word_mean 3.05, len_mean 19.5) and one_word_rate is 0.165, consistent with brand-style names. Treatment: Normalize casing and consolidate Military - * variants, then treat as a high-cardinality categorical (target/frequency encode). high · anthropic:claude-opus-4-7
n
5,268
nulls
18 (0.3%)
unique
2,476
len_min
3
len_max
65
len_mean
19.49
len_median
19
len_p95
35
word_mean
3.047
word_median
3
n_empty
0
n_duplicates
2,774
duplicate_rate
0.5284
vocab_size
2,370
readability_flesch_mean
19.61
emoji_rate
0
url_rate
0
one_word_rate
0.1651
allcaps_rate
0.03733
boilerplate_rate
0

Flight #

categorical identifier long_tail null_rate
Likely a flight number identifier attached to records (probably aviation incidents). Nearly 80% of rows are null (null_rate 0.7971) and the most common non-null value is the placeholder '-' at 67 occurrences (6.27% of present values), suggesting missing-data sentinels mixed with real codes. Cardinality is high (724 unique across 5268 rows) with entropy_ratio 0.953, so among populated rows values are nearly uniformly distributed. Treatment: Normalize '-' to null and treat as a high-cardinality identifier; drop from modelling or use only as a join key. high · anthropic:claude-opus-4-7
n
5,268
nulls
4,199 (79.7%)
unique
724
top_value
-
top_rate
0.06268
cardinality
724
entropy
9.058
entropy_ratio
0.9534

Route

text free_text multilingual null_rate
Short free-text describing a flight route, typically formatted as 'Origin - Destination' (the hyphen appears 3658 times across 5268 rows) with occasional non-route labels like 'Training' (81), 'Sightseeing' (29), or 'Test flight' (17). Values are short (mean 22 chars, 4 words) and highly varied (3244 unique out of 5268), but 32.38% are null and 318 duplicates exist. Language detection flags a multilingual mix dominated by English (2567) with notable Spanish (237), Portuguese (100), German (88) and Italian (88), reflecting place names rather than true prose. Treatment: Parse on ' - ' to split origin/destination and bucket non-route labels separately; impute or flag the 32% nulls. high · anthropic:claude-opus-4-7
n
5,268
nulls
1,706 (32.4%)
unique
3,244
len_min
4
len_max
59
len_mean
22.09
len_median
20
len_p95
37
word_mean
4.065
word_median
4
n_empty
0
n_duplicates
318
duplicate_rate
0.08928
vocab_size
3,647
readability_flesch_mean
27.15
emoji_rate
0
url_rate
0
one_word_rate
0.04099
allcaps_rate
0.0002807
boilerplate_rate
0

Type

text feature duplicates
This column records aircraft make-and-model designations, dominated by manufacturer-plus-type strings like 'Douglas DC-3' (334 occurrences) and 'de Havilland Canada DHC-6 Twin Otter 300'. Values are short (mean 18.3 chars, median 2 words) but highly repetitive: 53.3% duplicate rate across 2,446 unique types, with Douglas alone appearing in 1,113 rows. Watch for near-duplicate variants of the same airframe ('Douglas C-47', 'Douglas C-47A', 'Douglas C-47B') that will fragment any group-by unless normalised. Treatment: Normalise manufacturer/variant strings (e.g. collapse C-47 sub-variants) before using as a categorical feature. high · anthropic:claude-opus-4-7
n
5,268
nulls
27 (0.5%)
unique
2,446
len_min
4
len_max
40
len_mean
18.33
len_median
16
len_p95
34
word_mean
2.718
word_median
2
n_empty
0
n_duplicates
2,795
duplicate_rate
0.5333
vocab_size
2,534
readability_flesch_mean
69.26
emoji_rate
0
url_rate
0
one_word_rate
0.007441
allcaps_rate
0.00954
boilerplate_rate
0

Registration

text identifier near_unique one_word allcaps short_text
Almost certainly aircraft tail/registration codes: 4905 unique values across 5268 rows, 99% all-caps single tokens with mean length 6.4 (max 15), and top tokens like 'hk-' and 'nc10809' resemble registration prefixes. Near-unique (n_unique/n ≈ 0.93) with a 6.36% null rate and only 28 duplicates, so it behaves as an identifier rather than a feature. The lone '/' appearing 36 times suggests a placeholder for split/unknown registrations worth inspecting. Treatment: Treat as an identifier: drop from modelling or use only for joins/lookup after normalising case and the '/' placeholder. high · anthropic:claude-opus-4-7
n
5,268
nulls
335 (6.4%)
unique
4,905
len_min
1
len_max
15
len_mean
6.394
len_median
6
len_p95
10
word_mean
1.018
word_median
1
n_empty
0
n_duplicates
28
duplicate_rate
0.005676
vocab_size
4,948
readability_flesch_mean
103
emoji_rate
0
url_rate
0
one_word_rate
0.9899
allcaps_rate
0.9919
boilerplate_rate
0

cn/In

text feature one_word allcaps null_rate short_text
Despite the text classification, 'cn/In' looks like a short numeric code field — values are predominantly one-word, all-caps tokens with a mean length of 5.6 characters and the top values ('178', '19', '229', '125') all being integers. About 23.3% of rows are null and only 333 duplicates (8.2%) appear across 3,707 unique values, so cardinality is high relative to 5,268 rows. The '/' character showing up 49 times in top_words hints at occasional composite values (e.g., 'a/b'), which the column name 'cn/In' also suggests. Treatment: Cast to numeric where possible and split composite '/'-separated entries; impute or flag the 23% nulls before modelling. medium · anthropic:claude-opus-4-7
n
5,268
nulls
1,228 (23.3%)
unique
3,707
len_min
1
len_max
20
len_mean
5.645
len_median
5
len_p95
10
word_mean
1.026
word_median
1
n_empty
0
n_duplicates
333
duplicate_rate
0.08243
vocab_size
3,739
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
0.9842
allcaps_rate
0.9663
boilerplate_rate
0

Aboard

numeric feature high_skew outliers
This column records the number of people aboard, with values ranging from 0 to 644 and a median of 13. The distribution is heavily right-skewed (skew 4.25, kurtosis 28.4) and roughly 10% of rows (529) are flagged as outliers, indicating a long tail of very large flights against a typical small-aircraft baseline. Nulls are negligible (0.42%) and only 239 distinct values appear across 5268 rows. Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
5,268
nulls
22 (0.4%)
unique
239
min
0
max
644
mean
27.55
median
13
std
43.08
q1
5
q3
30
iqr
25
skew
4.247
kurtosis
28.41
n_outliers
529
outlier_rate
0.1008
zero_rate
0.0003812

Fatalities

numeric numeric_target high_skew outliers
Counts of deaths per event, ranging from 0 to 583 with a median of 9 and mean of 20.07. The distribution is heavily right-skewed (skew 4.95, kurtosis 42.79) with 444 outliers (8.4% of rows) and a small zero rate of 1.1%. The IQR of 20 against a max of 583 confirms a long tail driven by rare catastrophic events. Treatment: log1p-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
5,268
nulls
12 (0.2%)
unique
191
min
0
max
583
mean
20.07
median
9
std
33.2
q1
3
q3
23
iqr
20
skew
4.948
kurtosis
42.79
n_outliers
444
outlier_rate
0.08447
zero_rate
0.01104

Ground

numeric feature high_skew
Numeric field 'Ground' is overwhelmingly zero (zero_rate 0.9583) with median, q1, and q3 all at 0.0 and only 50 unique values across 5268 rows. The non-zero tail is extreme: max 2750.0 against a mean of 1.61, skew 50.3, and kurtosis 2559, producing 219 outliers (4.17%). This looks like a sparse count or charge-style feature where almost every record has no ground value but a few carry very large magnitudes. Treatment: Split into a zero/non-zero indicator and log-transform the non-zero magnitudes before modelling. high · anthropic:claude-opus-4-7
n
5,268
nulls
22 (0.4%)
unique
50
min
0
max
2,750
mean
1.609
median
0
std
53.99
q1
0
q3
0
iqr
0
skew
50.34
kurtosis
2559
n_outliers
219
outlier_rate
0.04175
zero_rate
0.9583

Summary

text free_text near_unique
Free-text incident summaries averaging 201 characters (median 136, max 1954) with a Flesch readability of 61.7, suggesting short narrative paragraphs. Domain vocabulary is clearly aviation-accident: 'crashed' (2925), 'aircraft' (2031), and 'into' (2300) dominate after stopwords. Near-unique (4673 of 5268) but with 205 exact duplicates (4.2%) and a 7.4% null rate worth checking before modelling. Treatment: Tokenize and embed (or TF-IDF) for downstream NLP; dedupe the 205 exact repeats first. high · anthropic:claude-opus-4-7
n
5,268
nulls
390 (7.4%)
unique
4,673
len_min
6
len_max
1,954
len_mean
200.7
len_median
136
len_p95
584
word_mean
33.24
word_median
23
n_empty
0
n_duplicates
205
duplicate_rate
0.04203
vocab_size
12,513
readability_flesch_mean
61.68
emoji_rate
0
url_rate
0
one_word_rate
0.00041
allcaps_rate
0
boilerplate_rate
0