disasters airplane crashes
Reading
This dataset records 5,268 airplane crashes across 13 columns, mixing dates and times with operator, aircraft type, route, location, and casualty counts (Aboard, Fatalities, Ground). Casualty figures are highly skewed: Aboard averages 27.5 with a median of 13 and a maximum of 644, while Fatalities averages 20.1 with a median of 9 and a max of 583, and Ground deaths are zero in roughly 96% of rows but spike to 2,750 — clear outliers worth investigating. Operator and Type are dominated by a few heavy hitters (Aeroflot and U.S. military operators; Douglas DC-3 alone appears 334 times), suggesting concentration that could bias any aggregate analysis. Note also that Flight # is missing in nearly 80% of rows and Time in 42%, so those fields are weak for filtering. Start by looking at the Fatalities distribution and the top operators and aircraft types.
citing: row_count · column_count · columns.Aboard.stats · columns.Fatalities.stats · columns.Ground.stats · columns.Operator.top_values · columns.Type.top_values · columns.Flight #.null_rate · columns.Time.null_rate
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 0 – 14.57 | 3314 |
| 14.57 – 29.15 | 980 |
| 29.15 – 43.72 | 343 |
| 43.72 – 58.3 | 215 |
| 58.3 – 72.88 | 96 |
| 72.88 – 87.45 | 90 |
| 87.45 – 102 | 51 |
| 102 – 116.6 | 42 |
| 116.6 – 131.2 | 39 |
| 131.2 – 145.8 | 19 |
| 145.8 – 160.3 | 18 |
| 160.3 – 174.9 | 9 |
| 174.9 – 189.5 | 11 |
| 189.5 – 204 | 3 |
| 204 – 218.6 | 2 |
| 218.6 – 233.2 | 6 |
| 233.2 – 247.8 | 2 |
| 247.8 – 262.3 | 5 |
| 262.3 – 276.9 | 4 |
| 276.9 – 291.5 | 1 |
| 291.5 – 306.1 | 1 |
| 306.1 – 320.6 | 0 |
| 320.6 – 335.2 | 1 |
| 335.2 – 349.8 | 2 |
| 349.8 – 364.4 | 0 |
| 364.4 – 378.9 | 0 |
| 378.9 – 393.5 | 0 |
| 393.5 – 408.1 | 0 |
| 408.1 – 422.7 | 0 |
| 422.7 – 437.2 | 0 |
| 437.2 – 451.8 | 0 |
| 451.8 – 466.4 | 0 |
| 466.4 – 481 | 0 |
| 481 – 495.5 | 0 |
| 495.5 – 510.1 | 0 |
| 510.1 – 524.7 | 1 |
| 524.7 – 539.3 | 0 |
| 539.3 – 553.9 | 0 |
| 553.9 – 568.4 | 0 |
| 568.4 – 583 | 1 |
Show data table
| bin | count |
|---|---|
| 0 – 16.1 | 2978 |
| 16.1 – 32.2 | 1055 |
| 32.2 – 48.3 | 430 |
| 48.3 – 64.4 | 230 |
| 64.4 – 80.5 | 129 |
| 80.5 – 96.6 | 105 |
| 96.6 – 112.7 | 75 |
| 112.7 – 128.8 | 56 |
| 128.8 – 144.9 | 46 |
| 144.9 – 161 | 35 |
| 161 – 177.1 | 27 |
| 177.1 – 193.2 | 16 |
| 193.2 – 209.3 | 8 |
| 209.3 – 225.4 | 7 |
| 225.4 – 241.5 | 9 |
| 241.5 – 257.6 | 4 |
| 257.6 – 273.7 | 9 |
| 273.7 – 289.8 | 3 |
| 289.8 – 305.9 | 9 |
| 305.9 – 322 | 3 |
| 322 – 338.1 | 2 |
| 338.1 – 354.2 | 3 |
| 354.2 – 370.3 | 1 |
| 370.3 – 386.4 | 1 |
| 386.4 – 402.5 | 2 |
| 402.5 – 418.6 | 0 |
| 418.6 – 434.7 | 0 |
| 434.7 – 450.8 | 0 |
| 450.8 – 466.9 | 0 |
| 466.9 – 483 | 0 |
| 483 – 499.1 | 0 |
| 499.1 – 515.2 | 0 |
| 515.2 – 531.3 | 2 |
| 531.3 – 547.4 | 0 |
| 547.4 – 563.5 | 0 |
| 563.5 – 579.6 | 0 |
| 579.6 – 595.7 | 0 |
| 595.7 – 611.8 | 0 |
| 611.8 – 627.9 | 0 |
| 627.9 – 644 | 1 |
Show data table
| chars | count |
|---|---|
| 3 – 5 | 96 |
| 5 – 6 | 233 |
| 6 – 8 | 140 |
| 8 – 9 | 462 |
| 9 – 11 | 169 |
| 11 – 12 | 184 |
| 12 – 14 | 128 |
| 14 – 15 | 395 |
| 15 – 17 | 270 |
| 17 – 18 | 447 |
| 18 – 20 | 407 |
| 20 – 22 | 143 |
| 22 – 23 | 340 |
| 23 – 25 | 205 |
| 25 – 26 | 542 |
| 26 – 28 | 166 |
| 28 – 29 | 229 |
| 29 – 31 | 102 |
| 31 – 32 | 194 |
| 32 – 34 | 54 |
| 34 – 36 | 127 |
| 36 – 37 | 62 |
| 37 – 39 | 35 |
| 39 – 40 | 30 |
| 40 – 42 | 13 |
| 42 – 43 | 25 |
| 43 – 45 | 11 |
| 45 – 46 | 5 |
| 46 – 48 | 7 |
| 48 – 50 | 7 |
| 50 – 51 | 7 |
| 51 – 53 | 0 |
| 53 – 54 | 9 |
| 54 – 56 | 1 |
| 56 – 57 | 3 |
| 57 – 59 | 0 |
| 59 – 60 | 1 |
| 60 – 62 | 0 |
| 62 – 63 | 0 |
| 63 – 65 | 1 |
Show data table
| chars | count |
|---|---|
| 4 – 5 | 6 |
| 5 – 6 | 5 |
| 6 – 7 | 6 |
| 7 – 8 | 19 |
| 8 – 8 | 32 |
| 8 – 9 | 57 |
| 9 – 10 | 178 |
| 10 – 11 | 255 |
| 11 – 12 | 685 |
| 12 – 13 | 0 |
| 13 – 14 | 522 |
| 14 – 15 | 331 |
| 15 – 16 | 441 |
| 16 – 17 | 369 |
| 17 – 18 | 208 |
| 18 – 18 | 158 |
| 18 – 19 | 154 |
| 19 – 20 | 166 |
| 20 – 21 | 154 |
| 21 – 22 | 0 |
| 22 – 23 | 109 |
| 23 – 24 | 120 |
| 24 – 25 | 158 |
| 25 – 26 | 188 |
| 26 – 26 | 174 |
| 26 – 27 | 107 |
| 27 – 28 | 73 |
| 28 – 29 | 85 |
| 29 – 30 | 39 |
| 30 – 31 | 0 |
| 31 – 32 | 66 |
| 32 – 33 | 58 |
| 33 – 34 | 55 |
| 34 – 35 | 43 |
| 35 – 36 | 16 |
| 36 – 36 | 25 |
| 36 – 37 | 16 |
| 37 – 38 | 9 |
| 38 – 39 | 21 |
| 39 – 40 | 133 |
Show data table
| bin | count |
|---|---|
| 0 – 68.75 | 5235 |
| 68.75 – 137.5 | 8 |
| 137.5 – 206.2 | 0 |
| 206.2 – 275 | 1 |
| 275 – 343.8 | 0 |
| 343.8 – 412.5 | 0 |
| 412.5 – 481.2 | 0 |
| 481.2 – 550 | 0 |
| 550 – 618.8 | 0 |
| 618.8 – 687.5 | 0 |
| 687.5 – 756.2 | 0 |
| 756.2 – 825 | 0 |
| 825 – 893.8 | 0 |
| 893.8 – 962.5 | 0 |
| 962.5 – 1031 | 0 |
| 1031 – 1100 | 0 |
| 1100 – 1169 | 0 |
| 1169 – 1238 | 0 |
| 1238 – 1306 | 0 |
| 1306 – 1375 | 0 |
| 1375 – 1444 | 0 |
| 1444 – 1512 | 0 |
| 1512 – 1581 | 0 |
| 1581 – 1650 | 0 |
| 1650 – 1719 | 0 |
| 1719 – 1788 | 0 |
| 1788 – 1856 | 0 |
| 1856 – 1925 | 0 |
| 1925 – 1994 | 0 |
| 1994 – 2062 | 0 |
| 2062 – 2131 | 0 |
| 2131 – 2200 | 0 |
| 2200 – 2269 | 0 |
| 2269 – 2338 | 0 |
| 2338 – 2406 | 0 |
| 2406 – 2475 | 0 |
| 2475 – 2544 | 0 |
| 2544 – 2612 | 0 |
| 2612 – 2681 | 0 |
| 2681 – 2750 | 2 |
Schema
13 columns| Alerts | ||||
|---|---|---|---|---|
| Date | text | 0.0% | 4,753 |
one_word
allcaps
short_text
|
| Time | text | 42.1% | 1,005 |
one_word
allcaps
null_rate
short_text
duplicates
|
| Location | text | 0.4% | 4,303 |
|
| Operator | text | 0.3% | 2,476 |
multilingual
duplicates
|
| Flight # | categorical | 79.7% | 724 |
long_tail
null_rate
|
| Route | text | 32.4% | 3,244 |
multilingual
null_rate
|
| Type | text | 0.5% | 2,446 |
duplicates
|
| Registration | text | 6.4% | 4,905 |
near_unique
one_word
allcaps
short_text
|
| cn/In | text | 23.3% | 3,707 |
one_word
allcaps
null_rate
short_text
|
| Aboard | numeric | 0.4% | 239 |
high_skew
outliers
|
| Fatalities | numeric | 0.2% | 191 |
high_skew
outliers
|
| Ground | numeric | 0.4% | 50 |
high_skew
|
| Summary | text | 7.4% | 4,673 |
near_unique
|
Date
text timestamp one_word allcaps short_textThis column holds dates stored as text in MM/DD/YYYY format — every one of 5268 values is exactly 10 characters and a single token. There are 515 duplicates (9.8%) with repeats clustering on historically notable days such as 09/11/2001 and 06/06/1944, suggesting the rows describe events tied to those dates rather than unique daily records. The text alerts (allcaps, one_word, short_text) are artifacts of the date formatting, not real free-text content. Treatment: parse to a proper date type (MM/DD/YYYY) before any temporal analysis.
- n
- 5,268
- nulls
- 0 (0.0%)
- unique
- 4,753
- len_min
- 10
- len_max
- 10
- len_mean
- 10
- len_median
- 10
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 515
- duplicate_rate
- 0.09776
- vocab_size
- 4,753
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
Time
text timestamp one_word allcaps null_rate short_text duplicatesClock times in HH:MM format, stored as text rather than a temporal type — values like '15:00', '12:00' and '11:00' top the list and lengths sit tightly between 4 and 7 characters (mean 5.0). Roughly 42% of rows are null and 67% of the non-null values are duplicates across only 1,005 distinct times, suggesting times cluster on the half hour. Despite being numeric-looking, it tripped allcaps and one-word alerts because the profiler treats the strings as tokens. Treatment: parse to a time-of-day type and impute or flag the 42% missing before use.
- n
- 5,268
- nulls
- 2,219 (42.1%)
- unique
- 1,005
- len_min
- 4
- len_max
- 7
- len_mean
- 5.003
- len_median
- 5
- len_p95
- 5
- word_mean
- 1.001
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 2,044
- duplicate_rate
- 0.6704
- vocab_size
- 1,004
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.999
- allcaps_rate
- 0.9974
- boilerplate_rate
- 0
Location
text featureFree-text place names, typically 'City, Country/State' (word_median 3, len_median 19), with 4303 unique values across 5268 rows. Top entries cluster on major world cities (Sao Paulo, Moscow, Rio), but 'near' appears 1272 times suggesting many entries are approximate locations rather than exact place names. Duplicate rate of 18% and 945 repeated strings indicate moderate reusability, though high cardinality limits direct grouping. Treatment: Parse into city/region/country components and geocode before use; raw strings are too high-cardinality to one-hot.
- n
- 5,268
- nulls
- 20 (0.4%)
- unique
- 4,303
- len_min
- 5
- len_max
- 60
- len_mean
- 20.38
- len_median
- 19
- len_p95
- 31
- word_mean
- 2.866
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 945
- duplicate_rate
- 0.1801
- vocab_size
- 4,541
- readability_flesch_mean
- 24.03
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.01124
- allcaps_rate
- 0
- boilerplate_rate
- 0
Operator
text feature multilingual duplicatesThis column holds the airline or military operator name for each record, with 2476 unique values across 5268 rows and only a 0.0034 null rate. It is heavily duplicated (duplicate_rate 0.528, n_duplicates 2774), led by Aeroflot (179) and Military - U.S. Air Force (176), and the language detector flags a multilingual mix dominated by English (3340) but with sizable Italian (278), Spanish (224), German (202), and French (183) counts — likely an artifact of short proper nouns rather than true translations. Entries are short (word_mean 3.05, len_mean 19.5) and one_word_rate is 0.165, consistent with brand-style names. Treatment: Normalize casing and consolidate Military - * variants, then treat as a high-cardinality categorical (target/frequency encode).
- n
- 5,268
- nulls
- 18 (0.3%)
- unique
- 2,476
- len_min
- 3
- len_max
- 65
- len_mean
- 19.49
- len_median
- 19
- len_p95
- 35
- word_mean
- 3.047
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 2,774
- duplicate_rate
- 0.5284
- vocab_size
- 2,370
- readability_flesch_mean
- 19.61
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.1651
- allcaps_rate
- 0.03733
- boilerplate_rate
- 0
Flight #
categorical identifier long_tail null_rateLikely a flight number identifier attached to records (probably aviation incidents). Nearly 80% of rows are null (null_rate 0.7971) and the most common non-null value is the placeholder '-' at 67 occurrences (6.27% of present values), suggesting missing-data sentinels mixed with real codes. Cardinality is high (724 unique across 5268 rows) with entropy_ratio 0.953, so among populated rows values are nearly uniformly distributed. Treatment: Normalize '-' to null and treat as a high-cardinality identifier; drop from modelling or use only as a join key.
- n
- 5,268
- nulls
- 4,199 (79.7%)
- unique
- 724
- top_value
- -
- top_rate
- 0.06268
- cardinality
- 724
- entropy
- 9.058
- entropy_ratio
- 0.9534
Route
text free_text multilingual null_rateShort free-text describing a flight route, typically formatted as 'Origin - Destination' (the hyphen appears 3658 times across 5268 rows) with occasional non-route labels like 'Training' (81), 'Sightseeing' (29), or 'Test flight' (17). Values are short (mean 22 chars, 4 words) and highly varied (3244 unique out of 5268), but 32.38% are null and 318 duplicates exist. Language detection flags a multilingual mix dominated by English (2567) with notable Spanish (237), Portuguese (100), German (88) and Italian (88), reflecting place names rather than true prose. Treatment: Parse on ' - ' to split origin/destination and bucket non-route labels separately; impute or flag the 32% nulls.
- n
- 5,268
- nulls
- 1,706 (32.4%)
- unique
- 3,244
- len_min
- 4
- len_max
- 59
- len_mean
- 22.09
- len_median
- 20
- len_p95
- 37
- word_mean
- 4.065
- word_median
- 4
- n_empty
- 0
- n_duplicates
- 318
- duplicate_rate
- 0.08928
- vocab_size
- 3,647
- readability_flesch_mean
- 27.15
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.04099
- allcaps_rate
- 0.0002807
- boilerplate_rate
- 0
Type
text feature duplicatesThis column records aircraft make-and-model designations, dominated by manufacturer-plus-type strings like 'Douglas DC-3' (334 occurrences) and 'de Havilland Canada DHC-6 Twin Otter 300'. Values are short (mean 18.3 chars, median 2 words) but highly repetitive: 53.3% duplicate rate across 2,446 unique types, with Douglas alone appearing in 1,113 rows. Watch for near-duplicate variants of the same airframe ('Douglas C-47', 'Douglas C-47A', 'Douglas C-47B') that will fragment any group-by unless normalised. Treatment: Normalise manufacturer/variant strings (e.g. collapse C-47 sub-variants) before using as a categorical feature.
- n
- 5,268
- nulls
- 27 (0.5%)
- unique
- 2,446
- len_min
- 4
- len_max
- 40
- len_mean
- 18.33
- len_median
- 16
- len_p95
- 34
- word_mean
- 2.718
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 2,795
- duplicate_rate
- 0.5333
- vocab_size
- 2,534
- readability_flesch_mean
- 69.26
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.007441
- allcaps_rate
- 0.00954
- boilerplate_rate
- 0
Registration
text identifier near_unique one_word allcaps short_textAlmost certainly aircraft tail/registration codes: 4905 unique values across 5268 rows, 99% all-caps single tokens with mean length 6.4 (max 15), and top tokens like 'hk-' and 'nc10809' resemble registration prefixes. Near-unique (n_unique/n ≈ 0.93) with a 6.36% null rate and only 28 duplicates, so it behaves as an identifier rather than a feature. The lone '/' appearing 36 times suggests a placeholder for split/unknown registrations worth inspecting. Treatment: Treat as an identifier: drop from modelling or use only for joins/lookup after normalising case and the '/' placeholder.
- n
- 5,268
- nulls
- 335 (6.4%)
- unique
- 4,905
- len_min
- 1
- len_max
- 15
- len_mean
- 6.394
- len_median
- 6
- len_p95
- 10
- word_mean
- 1.018
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 28
- duplicate_rate
- 0.005676
- vocab_size
- 4,948
- readability_flesch_mean
- 103
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9899
- allcaps_rate
- 0.9919
- boilerplate_rate
- 0
cn/In
text feature one_word allcaps null_rate short_textDespite the text classification, 'cn/In' looks like a short numeric code field — values are predominantly one-word, all-caps tokens with a mean length of 5.6 characters and the top values ('178', '19', '229', '125') all being integers. About 23.3% of rows are null and only 333 duplicates (8.2%) appear across 3,707 unique values, so cardinality is high relative to 5,268 rows. The '/' character showing up 49 times in top_words hints at occasional composite values (e.g., 'a/b'), which the column name 'cn/In' also suggests. Treatment: Cast to numeric where possible and split composite '/'-separated entries; impute or flag the 23% nulls before modelling.
- n
- 5,268
- nulls
- 1,228 (23.3%)
- unique
- 3,707
- len_min
- 1
- len_max
- 20
- len_mean
- 5.645
- len_median
- 5
- len_p95
- 10
- word_mean
- 1.026
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 333
- duplicate_rate
- 0.08243
- vocab_size
- 3,739
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9842
- allcaps_rate
- 0.9663
- boilerplate_rate
- 0
Aboard
numeric feature high_skew outliersThis column records the number of people aboard, with values ranging from 0 to 644 and a median of 13. The distribution is heavily right-skewed (skew 4.25, kurtosis 28.4) and roughly 10% of rows (529) are flagged as outliers, indicating a long tail of very large flights against a typical small-aircraft baseline. Nulls are negligible (0.42%) and only 239 distinct values appear across 5268 rows. Treatment: log-transform before modelling to tame the heavy right tail.
- n
- 5,268
- nulls
- 22 (0.4%)
- unique
- 239
- min
- 0
- max
- 644
- mean
- 27.55
- median
- 13
- std
- 43.08
- q1
- 5
- q3
- 30
- iqr
- 25
- skew
- 4.247
- kurtosis
- 28.41
- n_outliers
- 529
- outlier_rate
- 0.1008
- zero_rate
- 0.0003812
Fatalities
numeric numeric_target high_skew outliersCounts of deaths per event, ranging from 0 to 583 with a median of 9 and mean of 20.07. The distribution is heavily right-skewed (skew 4.95, kurtosis 42.79) with 444 outliers (8.4% of rows) and a small zero rate of 1.1%. The IQR of 20 against a max of 583 confirms a long tail driven by rare catastrophic events. Treatment: log1p-transform before modelling to tame the heavy right tail.
- n
- 5,268
- nulls
- 12 (0.2%)
- unique
- 191
- min
- 0
- max
- 583
- mean
- 20.07
- median
- 9
- std
- 33.2
- q1
- 3
- q3
- 23
- iqr
- 20
- skew
- 4.948
- kurtosis
- 42.79
- n_outliers
- 444
- outlier_rate
- 0.08447
- zero_rate
- 0.01104
Ground
numeric feature high_skewNumeric field 'Ground' is overwhelmingly zero (zero_rate 0.9583) with median, q1, and q3 all at 0.0 and only 50 unique values across 5268 rows. The non-zero tail is extreme: max 2750.0 against a mean of 1.61, skew 50.3, and kurtosis 2559, producing 219 outliers (4.17%). This looks like a sparse count or charge-style feature where almost every record has no ground value but a few carry very large magnitudes. Treatment: Split into a zero/non-zero indicator and log-transform the non-zero magnitudes before modelling.
- n
- 5,268
- nulls
- 22 (0.4%)
- unique
- 50
- min
- 0
- max
- 2,750
- mean
- 1.609
- median
- 0
- std
- 53.99
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 50.34
- kurtosis
- 2559
- n_outliers
- 219
- outlier_rate
- 0.04175
- zero_rate
- 0.9583
Summary
text free_text near_uniqueFree-text incident summaries averaging 201 characters (median 136, max 1954) with a Flesch readability of 61.7, suggesting short narrative paragraphs. Domain vocabulary is clearly aviation-accident: 'crashed' (2925), 'aircraft' (2031), and 'into' (2300) dominate after stopwords. Near-unique (4673 of 5268) but with 205 exact duplicates (4.2%) and a 7.4% null rate worth checking before modelling. Treatment: Tokenize and embed (or TF-IDF) for downstream NLP; dedupe the 205 exact repeats first.
- n
- 5,268
- nulls
- 390 (7.4%)
- unique
- 4,673
- len_min
- 6
- len_max
- 1,954
- len_mean
- 200.7
- len_median
- 136
- len_p95
- 584
- word_mean
- 33.24
- word_median
- 23
- n_empty
- 0
- n_duplicates
- 205
- duplicate_rate
- 0.04203
- vocab_size
- 12,513
- readability_flesch_mean
- 61.68
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.00041
- allcaps_rate
- 0
- boilerplate_rate
- 0