disasters airplane crashes

source /home/coolhand/html/datavis/data_trove/data/wild/disasters/airplane_crashes.csv 5,268 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset records 5,268 airplane crashes across 13 columns, mixing dates and times with operator, aircraft type, route, location, and casualty counts (Aboard, Fatalities, Ground). Casualty figures are highly skewed: Aboard averages 27.5 with a median of 13 and a maximum of 644, while Fatalities averages 20.1 with a median of 9 and a max of 583, and Ground deaths are zero in roughly 96% of rows but spike to 2,750 — clear outliers worth investigating. Operator and Type are dominated by a few heavy hitters (Aeroflot and U.S. military operators; Douglas DC-3 alone appears 334 times), suggesting concentration that could bias any aggregate analysis. Note also that Flight # is missing in nearly 80% of rows and Time in 42%, so those fields are weak for filtering. Start by looking at the Fatalities distribution and the top operators and aircraft types.

citing: row_count · column_count · columns.Aboard.stats · columns.Fatalities.stats · columns.Ground.stats · columns.Operator.top_values · columns.Type.top_values · columns.Flight #.null_rate · columns.Time.null_rate

Charts the summary said to look at first

Fatalities · Heavily right-skewed: most crashes kill under 10 people but a long tail reaches 583.

Show data table

Histogram bins for Fatalities (median: 9.0).
bin	count
0 – 14.57	3314
14.57 – 29.15	980
29.15 – 43.72	343
43.72 – 58.3	215
58.3 – 72.88	96
72.88 – 87.45	90
87.45 – 102	51
102 – 116.6	42
116.6 – 131.2	39
131.2 – 145.8	19
145.8 – 160.3	18
160.3 – 174.9	9
174.9 – 189.5	11
189.5 – 204	3
204 – 218.6	2
218.6 – 233.2	6
233.2 – 247.8	2
247.8 – 262.3	5
262.3 – 276.9	4
276.9 – 291.5	1
291.5 – 306.1	1
306.1 – 320.6	0
320.6 – 335.2	1
335.2 – 349.8	2
349.8 – 364.4	0
364.4 – 378.9	0
378.9 – 393.5	0
393.5 – 408.1	0
408.1 – 422.7	0
422.7 – 437.2	0
437.2 – 451.8	0
451.8 – 466.4	0
466.4 – 481	0
481 – 495.5	0
495.5 – 510.1	0
510.1 – 524.7	1
524.7 – 539.3	0
539.3 – 553.9	0
553.9 – 568.4	0
568.4 – 583	1

Aboard · Shows passenger load distribution; median 13 with rare wide-body events near 644.

Show data table

Histogram bins for Aboard (median: 13.0).
bin	count
0 – 16.1	2978
16.1 – 32.2	1055
32.2 – 48.3	430
48.3 – 64.4	230
64.4 – 80.5	129
80.5 – 96.6	105
96.6 – 112.7	75
112.7 – 128.8	56
128.8 – 144.9	46
144.9 – 161	35
161 – 177.1	27
177.1 – 193.2	16
193.2 – 209.3	8
209.3 – 225.4	7
225.4 – 241.5	9
241.5 – 257.6	4
257.6 – 273.7	9
273.7 – 289.8	3
289.8 – 305.9	9
305.9 – 322	3
322 – 338.1	2
338.1 – 354.2	3
354.2 – 370.3	1
370.3 – 386.4	1
386.4 – 402.5	2
402.5 – 418.6	0
418.6 – 434.7	0
434.7 – 450.8	0
450.8 – 466.9	0
466.9 – 483	0
483 – 499.1	0
499.1 – 515.2	0
515.2 – 531.3	2
531.3 – 547.4	0
547.4 – 563.5	0
563.5 – 579.6	0
579.6 – 595.7	0
595.7 – 611.8	0
611.8 – 627.9	0
627.9 – 644	1

Operator · Aeroflot and U.S. military operators dominate — check for over-representation before averaging.

Show data table

Character-length distribution for Operator (mean: 19.493904761904762).
chars	count
3 – 5	96
5 – 6	233
6 – 8	140
8 – 9	462
9 – 11	169
11 – 12	184
12 – 14	128
14 – 15	395
15 – 17	270
17 – 18	447
18 – 20	407
20 – 22	143
22 – 23	340
23 – 25	205
25 – 26	542
26 – 28	166
28 – 29	229
29 – 31	102
31 – 32	194
32 – 34	54
34 – 36	127
36 – 37	62
37 – 39	35
39 – 40	30
40 – 42	13
42 – 43	25
43 – 45	11
45 – 46	5
46 – 48	7
48 – 50	7
50 – 51	7
51 – 53	0
53 – 54	9
54 – 56	1
56 – 57	3
57 – 59	0
59 – 60	1
60 – 62	0
62 – 63	0
63 – 65	1

Type · Douglas DC-3 alone accounts for 334 crashes; aircraft type is highly concentrated at the top.

Show data table

Character-length distribution for Type (mean: 18.325701202060674).
chars	count
4 – 5	6
5 – 6	5
6 – 7	6
7 – 8	19
8 – 8	32
8 – 9	57
9 – 10	178
10 – 11	255
11 – 12	685
12 – 13	0
13 – 14	522
14 – 15	331
15 – 16	441
16 – 17	369
17 – 18	208
18 – 18	158
18 – 19	154
19 – 20	166
20 – 21	154
21 – 22	0
22 – 23	109
23 – 24	120
24 – 25	158
25 – 26	188
26 – 26	174
26 – 27	107
27 – 28	73
28 – 29	85
29 – 30	39
30 – 31	0
31 – 32	66
32 – 33	58
33 – 34	55
34 – 35	43
35 – 36	16
36 – 36	25
36 – 37	16
37 – 38	9
38 – 39	21
39 – 40	133

Ground · Almost all values are zero; the few non-zero entries (up to 2,750) are extreme outliers worth flagging.

Show data table

Histogram bins for Ground (median: 0.0).
bin	count
0 – 68.75	5235
68.75 – 137.5	8
137.5 – 206.2	0
206.2 – 275	1
275 – 343.8	0
343.8 – 412.5	0
412.5 – 481.2	0
481.2 – 550	0
550 – 618.8	0
618.8 – 687.5	0
687.5 – 756.2	0
756.2 – 825	0
825 – 893.8	0
893.8 – 962.5	0
962.5 – 1031	0
1031 – 1100	0
1100 – 1169	0
1169 – 1238	0
1238 – 1306	0
1306 – 1375	0
1375 – 1444	0
1444 – 1512	0
1512 – 1581	0
1581 – 1650	0
1650 – 1719	0
1719 – 1788	0
1788 – 1856	0
1856 – 1925	0
1925 – 1994	0
1994 – 2062	0
2062 – 2131	0
2131 – 2200	0
2200 – 2269	0
2269 – 2338	0
2338 – 2406	0
2406 – 2475	0
2475 – 2544	0
2544 – 2612	0
2612 – 2681	0
2681 – 2750	2

Schema

13 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
Date	text	0.0%	4,753	one_word allcaps short_text
Time	text	42.1%	1,005	one_word allcaps null_rate short_text duplicates
Location	text	0.4%	4,303
Operator	text	0.3%	2,476	multilingual duplicates
Flight #	categorical	79.7%	724	long_tail null_rate
Route	text	32.4%	3,244	multilingual null_rate
Type	text	0.5%	2,446	duplicates
Registration	text	6.4%	4,905	near_unique one_word allcaps short_text
cn/In	text	23.3%	3,707	one_word allcaps null_rate short_text
Aboard	numeric	0.4%	239	high_skew outliers
Fatalities	numeric	0.2%	191	high_skew outliers
Ground	numeric	0.4%	50	high_skew
Summary	text	7.4%	4,673	near_unique

Date

text timestamp one_word allcaps short_text

This column holds dates stored as text in MM/DD/YYYY format — every one of 5268 values is exactly 10 characters and a single token. There are 515 duplicates (9.8%) with repeats clustering on historically notable days such as 09/11/2001 and 06/06/1944, suggesting the rows describe events tied to those dates rather than unique daily records. The text alerts (allcaps, one_word, short_text) are artifacts of the date formatting, not real free-text content. Treatment: parse to a proper date type (MM/DD/YYYY) before any temporal analysis. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 0 (0.0%)
unique: 4,753
len_min: 10
len_max: 10
len_mean: 10
len_median: 10
len_p95: 10
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 515
duplicate_rate: 0.09776
vocab_size: 4,753
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

Time

text timestamp one_word allcaps null_rate short_text duplicates

Clock times in HH:MM format, stored as text rather than a temporal type — values like '15:00', '12:00' and '11:00' top the list and lengths sit tightly between 4 and 7 characters (mean 5.0). Roughly 42% of rows are null and 67% of the non-null values are duplicates across only 1,005 distinct times, suggesting times cluster on the half hour. Despite being numeric-looking, it tripped allcaps and one-word alerts because the profiler treats the strings as tokens. Treatment: parse to a time-of-day type and impute or flag the 42% missing before use. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 2,219 (42.1%)
unique: 1,005
len_min: 4
len_max: 7
len_mean: 5.003
len_median: 5
len_p95: 5
word_mean: 1.001
word_median: 1
n_empty: 0
n_duplicates: 2,044
duplicate_rate: 0.6704
vocab_size: 1,004
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0.999
allcaps_rate: 0.9974
boilerplate_rate: 0

Location

text feature

Free-text place names, typically 'City, Country/State' (word_median 3, len_median 19), with 4303 unique values across 5268 rows. Top entries cluster on major world cities (Sao Paulo, Moscow, Rio), but 'near' appears 1272 times suggesting many entries are approximate locations rather than exact place names. Duplicate rate of 18% and 945 repeated strings indicate moderate reusability, though high cardinality limits direct grouping. Treatment: Parse into city/region/country components and geocode before use; raw strings are too high-cardinality to one-hot. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 20 (0.4%)
unique: 4,303
len_min: 5
len_max: 60
len_mean: 20.38
len_median: 19
len_p95: 31
word_mean: 2.866
word_median: 3
n_empty: 0
n_duplicates: 945
duplicate_rate: 0.1801
vocab_size: 4,541
readability_flesch_mean: 24.03
emoji_rate: 0
url_rate: 0
one_word_rate: 0.01124
allcaps_rate: 0
boilerplate_rate: 0

Operator

text feature multilingual duplicates

This column holds the airline or military operator name for each record, with 2476 unique values across 5268 rows and only a 0.0034 null rate. It is heavily duplicated (duplicate_rate 0.528, n_duplicates 2774), led by Aeroflot (179) and Military - U.S. Air Force (176), and the language detector flags a multilingual mix dominated by English (3340) but with sizable Italian (278), Spanish (224), German (202), and French (183) counts — likely an artifact of short proper nouns rather than true translations. Entries are short (word_mean 3.05, len_mean 19.5) and one_word_rate is 0.165, consistent with brand-style names. Treatment: Normalize casing and consolidate Military - * variants, then treat as a high-cardinality categorical (target/frequency encode). high · anthropic:claude-opus-4-7

n: 5,268
nulls: 18 (0.3%)
unique: 2,476
len_min: 3
len_max: 65
len_mean: 19.49
len_median: 19
len_p95: 35
word_mean: 3.047
word_median: 3
n_empty: 0
n_duplicates: 2,774
duplicate_rate: 0.5284
vocab_size: 2,370
readability_flesch_mean: 19.61
emoji_rate: 0
url_rate: 0
one_word_rate: 0.1651
allcaps_rate: 0.03733
boilerplate_rate: 0

Flight #

categorical identifier long_tail null_rate

Likely a flight number identifier attached to records (probably aviation incidents). Nearly 80% of rows are null (null_rate 0.7971) and the most common non-null value is the placeholder '-' at 67 occurrences (6.27% of present values), suggesting missing-data sentinels mixed with real codes. Cardinality is high (724 unique across 5268 rows) with entropy_ratio 0.953, so among populated rows values are nearly uniformly distributed. Treatment: Normalize '-' to null and treat as a high-cardinality identifier; drop from modelling or use only as a join key. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 4,199 (79.7%)
unique: 724
top_value: -
top_rate: 0.06268
cardinality: 724
entropy: 9.058
entropy_ratio: 0.9534

Route

text free_text multilingual null_rate

Short free-text describing a flight route, typically formatted as 'Origin - Destination' (the hyphen appears 3658 times across 5268 rows) with occasional non-route labels like 'Training' (81), 'Sightseeing' (29), or 'Test flight' (17). Values are short (mean 22 chars, 4 words) and highly varied (3244 unique out of 5268), but 32.38% are null and 318 duplicates exist. Language detection flags a multilingual mix dominated by English (2567) with notable Spanish (237), Portuguese (100), German (88) and Italian (88), reflecting place names rather than true prose. Treatment: Parse on ' - ' to split origin/destination and bucket non-route labels separately; impute or flag the 32% nulls. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 1,706 (32.4%)
unique: 3,244
len_min: 4
len_max: 59
len_mean: 22.09
len_median: 20
len_p95: 37
word_mean: 4.065
word_median: 4
n_empty: 0
n_duplicates: 318
duplicate_rate: 0.08928
vocab_size: 3,647
readability_flesch_mean: 27.15
emoji_rate: 0
url_rate: 0
one_word_rate: 0.04099
allcaps_rate: 0.0002807
boilerplate_rate: 0

Type

text feature duplicates

This column records aircraft make-and-model designations, dominated by manufacturer-plus-type strings like 'Douglas DC-3' (334 occurrences) and 'de Havilland Canada DHC-6 Twin Otter 300'. Values are short (mean 18.3 chars, median 2 words) but highly repetitive: 53.3% duplicate rate across 2,446 unique types, with Douglas alone appearing in 1,113 rows. Watch for near-duplicate variants of the same airframe ('Douglas C-47', 'Douglas C-47A', 'Douglas C-47B') that will fragment any group-by unless normalised. Treatment: Normalise manufacturer/variant strings (e.g. collapse C-47 sub-variants) before using as a categorical feature. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 27 (0.5%)
unique: 2,446
len_min: 4
len_max: 40
len_mean: 18.33
len_median: 16
len_p95: 34
word_mean: 2.718
word_median: 2
n_empty: 0
n_duplicates: 2,795
duplicate_rate: 0.5333
vocab_size: 2,534
readability_flesch_mean: 69.26
emoji_rate: 0
url_rate: 0
one_word_rate: 0.007441
allcaps_rate: 0.00954
boilerplate_rate: 0

Registration

text identifier near_unique one_word allcaps short_text

Almost certainly aircraft tail/registration codes: 4905 unique values across 5268 rows, 99% all-caps single tokens with mean length 6.4 (max 15), and top tokens like 'hk-' and 'nc10809' resemble registration prefixes. Near-unique (n_unique/n ≈ 0.93) with a 6.36% null rate and only 28 duplicates, so it behaves as an identifier rather than a feature. The lone '/' appearing 36 times suggests a placeholder for split/unknown registrations worth inspecting. Treatment: Treat as an identifier: drop from modelling or use only for joins/lookup after normalising case and the '/' placeholder. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 335 (6.4%)
unique: 4,905
len_min: 1
len_max: 15
len_mean: 6.394
len_median: 6
len_p95: 10
word_mean: 1.018
word_median: 1
n_empty: 0
n_duplicates: 28
duplicate_rate: 0.005676
vocab_size: 4,948
readability_flesch_mean: 103
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9899
allcaps_rate: 0.9919
boilerplate_rate: 0

cn/In

text feature one_word allcaps null_rate short_text

Despite the text classification, 'cn/In' looks like a short numeric code field — values are predominantly one-word, all-caps tokens with a mean length of 5.6 characters and the top values ('178', '19', '229', '125') all being integers. About 23.3% of rows are null and only 333 duplicates (8.2%) appear across 3,707 unique values, so cardinality is high relative to 5,268 rows. The '/' character showing up 49 times in top_words hints at occasional composite values (e.g., 'a/b'), which the column name 'cn/In' also suggests. Treatment: Cast to numeric where possible and split composite '/'-separated entries; impute or flag the 23% nulls before modelling. medium · anthropic:claude-opus-4-7

n: 5,268
nulls: 1,228 (23.3%)
unique: 3,707
len_min: 1
len_max: 20
len_mean: 5.645
len_median: 5
len_p95: 10
word_mean: 1.026
word_median: 1
n_empty: 0
n_duplicates: 333
duplicate_rate: 0.08243
vocab_size: 3,739
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9842
allcaps_rate: 0.9663
boilerplate_rate: 0

Aboard

numeric feature high_skew outliers

This column records the number of people aboard, with values ranging from 0 to 644 and a median of 13. The distribution is heavily right-skewed (skew 4.25, kurtosis 28.4) and roughly 10% of rows (529) are flagged as outliers, indicating a long tail of very large flights against a typical small-aircraft baseline. Nulls are negligible (0.42%) and only 239 distinct values appear across 5268 rows. Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 22 (0.4%)
unique: 239
min: 0
max: 644
mean: 27.55
median: 13
std: 43.08
q1: 5
q3: 30
iqr: 25
skew: 4.247
kurtosis: 28.41
n_outliers: 529
outlier_rate: 0.1008
zero_rate: 0.0003812

Fatalities

numeric numeric_target high_skew outliers

Counts of deaths per event, ranging from 0 to 583 with a median of 9 and mean of 20.07. The distribution is heavily right-skewed (skew 4.95, kurtosis 42.79) with 444 outliers (8.4% of rows) and a small zero rate of 1.1%. The IQR of 20 against a max of 583 confirms a long tail driven by rare catastrophic events. Treatment: log1p-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 12 (0.2%)
unique: 191
min: 0
max: 583
mean: 20.07
median: 9
std: 33.2
q1: 3
q3: 23
iqr: 20
skew: 4.948
kurtosis: 42.79
n_outliers: 444
outlier_rate: 0.08447
zero_rate: 0.01104

Ground

numeric feature high_skew

Numeric field 'Ground' is overwhelmingly zero (zero_rate 0.9583) with median, q1, and q3 all at 0.0 and only 50 unique values across 5268 rows. The non-zero tail is extreme: max 2750.0 against a mean of 1.61, skew 50.3, and kurtosis 2559, producing 219 outliers (4.17%). This looks like a sparse count or charge-style feature where almost every record has no ground value but a few carry very large magnitudes. Treatment: Split into a zero/non-zero indicator and log-transform the non-zero magnitudes before modelling. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 22 (0.4%)
unique: 50
min: 0
max: 2,750
mean: 1.609
median: 0
std: 53.99
q1: 0
q3: 0
iqr: 0
skew: 50.34
kurtosis: 2559
n_outliers: 219
outlier_rate: 0.04175
zero_rate: 0.9583

Summary

text free_text near_unique

Free-text incident summaries averaging 201 characters (median 136, max 1954) with a Flesch readability of 61.7, suggesting short narrative paragraphs. Domain vocabulary is clearly aviation-accident: 'crashed' (2925), 'aircraft' (2031), and 'into' (2300) dominate after stopwords. Near-unique (4673 of 5268) but with 205 exact duplicates (4.2%) and a 7.4% null rate worth checking before modelling. Treatment: Tokenize and embed (or TF-IDF) for downstream NLP; dedupe the 205 exact repeats first. high · anthropic:claude-opus-4-7

n: 5,268
nulls: 390 (7.4%)
unique: 4,673
len_min: 6
len_max: 1,954
len_mean: 200.7
len_median: 136
len_p95: 584
word_mean: 33.24
word_median: 23
n_empty: 0
n_duplicates: 205
duplicate_rate: 0.04203
vocab_size: 12,513
readability_flesch_mean: 61.68
emoji_rate: 0
url_rate: 0
one_word_rate: 0.00041
allcaps_rate: 0
boilerplate_rate: 0