data trove airplane crashes fatalities 1908 2009
Reading
This dataset catalogues 5,268 aviation accidents spanning roughly a century, recording details such as date, operator, aircraft type, location, passengers aboard, fatalities, and ground casualties. Two numeric columns stand out immediately: Fatalities (mean 20, max 583) and Aboard (mean 28, max 644) are both highly right-skewed with significant outliers, suggesting a small number of catastrophic mass-casualty events dominate the tail. The Operator column reveals that Aeroflot (179 incidents) and U.S. military branches collectively account for a large share of recorded crashes, worth examining for era-specific clustering. Ground fatalities are near-zero in 95% of cases but spike dramatically in rare events (max 2,750), likely reflecting high-profile urban crashes.
citing: Fatalities.stats.mean · Fatalities.stats.max · Fatalities.stats.skew · Aboard.stats.mean · Aboard.stats.max · Aboard.stats.skew · Ground.stats.median · Ground.stats.max · Ground.stats.zero_rate · Operator.top_values · Type.top_values · row_count
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 0 – 14.57 | 3314 |
| 14.57 – 29.15 | 980 |
| 29.15 – 43.72 | 343 |
| 43.72 – 58.3 | 215 |
| 58.3 – 72.88 | 96 |
| 72.88 – 87.45 | 90 |
| 87.45 – 102 | 51 |
| 102 – 116.6 | 42 |
| 116.6 – 131.2 | 39 |
| 131.2 – 145.8 | 19 |
| 145.8 – 160.3 | 18 |
| 160.3 – 174.9 | 9 |
| 174.9 – 189.5 | 11 |
| 189.5 – 204 | 3 |
| 204 – 218.6 | 2 |
| 218.6 – 233.2 | 6 |
| 233.2 – 247.8 | 2 |
| 247.8 – 262.3 | 5 |
| 262.3 – 276.9 | 4 |
| 276.9 – 291.5 | 1 |
| 291.5 – 306.1 | 1 |
| 306.1 – 320.6 | 0 |
| 320.6 – 335.2 | 1 |
| 335.2 – 349.8 | 2 |
| 349.8 – 364.4 | 0 |
| 364.4 – 378.9 | 0 |
| 378.9 – 393.5 | 0 |
| 393.5 – 408.1 | 0 |
| 408.1 – 422.7 | 0 |
| 422.7 – 437.2 | 0 |
| 437.2 – 451.8 | 0 |
| 451.8 – 466.4 | 0 |
| 466.4 – 481 | 0 |
| 481 – 495.5 | 0 |
| 495.5 – 510.1 | 0 |
| 510.1 – 524.7 | 1 |
| 524.7 – 539.3 | 0 |
| 539.3 – 553.9 | 0 |
| 553.9 – 568.4 | 0 |
| 568.4 – 583 | 1 |
Show data table
| bin | count |
|---|---|
| 0 – 16.1 | 2978 |
| 16.1 – 32.2 | 1055 |
| 32.2 – 48.3 | 430 |
| 48.3 – 64.4 | 230 |
| 64.4 – 80.5 | 129 |
| 80.5 – 96.6 | 105 |
| 96.6 – 112.7 | 75 |
| 112.7 – 128.8 | 56 |
| 128.8 – 144.9 | 46 |
| 144.9 – 161 | 35 |
| 161 – 177.1 | 27 |
| 177.1 – 193.2 | 16 |
| 193.2 – 209.3 | 8 |
| 209.3 – 225.4 | 7 |
| 225.4 – 241.5 | 9 |
| 241.5 – 257.6 | 4 |
| 257.6 – 273.7 | 9 |
| 273.7 – 289.8 | 3 |
| 289.8 – 305.9 | 9 |
| 305.9 – 322 | 3 |
| 322 – 338.1 | 2 |
| 338.1 – 354.2 | 3 |
| 354.2 – 370.3 | 1 |
| 370.3 – 386.4 | 1 |
| 386.4 – 402.5 | 2 |
| 402.5 – 418.6 | 0 |
| 418.6 – 434.7 | 0 |
| 434.7 – 450.8 | 0 |
| 450.8 – 466.9 | 0 |
| 466.9 – 483 | 0 |
| 483 – 499.1 | 0 |
| 499.1 – 515.2 | 0 |
| 515.2 – 531.3 | 2 |
| 531.3 – 547.4 | 0 |
| 547.4 – 563.5 | 0 |
| 563.5 – 579.6 | 0 |
| 579.6 – 595.7 | 0 |
| 595.7 – 611.8 | 0 |
| 611.8 – 627.9 | 0 |
| 627.9 – 644 | 1 |
Show data table
| chars | count |
|---|---|
| 3 – 5 | 96 |
| 5 – 6 | 233 |
| 6 – 8 | 140 |
| 8 – 9 | 462 |
| 9 – 11 | 169 |
| 11 – 12 | 184 |
| 12 – 14 | 128 |
| 14 – 15 | 395 |
| 15 – 17 | 270 |
| 17 – 18 | 447 |
| 18 – 20 | 407 |
| 20 – 22 | 143 |
| 22 – 23 | 340 |
| 23 – 25 | 205 |
| 25 – 26 | 542 |
| 26 – 28 | 166 |
| 28 – 29 | 229 |
| 29 – 31 | 102 |
| 31 – 32 | 194 |
| 32 – 34 | 54 |
| 34 – 36 | 127 |
| 36 – 37 | 62 |
| 37 – 39 | 35 |
| 39 – 40 | 30 |
| 40 – 42 | 13 |
| 42 – 43 | 25 |
| 43 – 45 | 11 |
| 45 – 46 | 5 |
| 46 – 48 | 7 |
| 48 – 50 | 7 |
| 50 – 51 | 7 |
| 51 – 53 | 0 |
| 53 – 54 | 9 |
| 54 – 56 | 1 |
| 56 – 57 | 3 |
| 57 – 59 | 0 |
| 59 – 60 | 1 |
| 60 – 62 | 0 |
| 62 – 63 | 0 |
| 63 – 65 | 1 |
Show data table
| chars | count |
|---|---|
| 4 – 5 | 6 |
| 5 – 6 | 5 |
| 6 – 7 | 6 |
| 7 – 8 | 19 |
| 8 – 8 | 32 |
| 8 – 9 | 57 |
| 9 – 10 | 178 |
| 10 – 11 | 255 |
| 11 – 12 | 685 |
| 12 – 13 | 0 |
| 13 – 14 | 522 |
| 14 – 15 | 331 |
| 15 – 16 | 441 |
| 16 – 17 | 369 |
| 17 – 18 | 208 |
| 18 – 18 | 158 |
| 18 – 19 | 154 |
| 19 – 20 | 166 |
| 20 – 21 | 154 |
| 21 – 22 | 0 |
| 22 – 23 | 109 |
| 23 – 24 | 120 |
| 24 – 25 | 158 |
| 25 – 26 | 188 |
| 26 – 26 | 174 |
| 26 – 27 | 107 |
| 27 – 28 | 73 |
| 28 – 29 | 85 |
| 29 – 30 | 39 |
| 30 – 31 | 0 |
| 31 – 32 | 66 |
| 32 – 33 | 58 |
| 33 – 34 | 55 |
| 34 – 35 | 43 |
| 35 – 36 | 16 |
| 36 – 36 | 25 |
| 36 – 37 | 16 |
| 37 – 38 | 9 |
| 38 – 39 | 21 |
| 39 – 40 | 133 |
Show data table
| bin | count |
|---|---|
| 0 – 68.75 | 5235 |
| 68.75 – 137.5 | 8 |
| 137.5 – 206.2 | 0 |
| 206.2 – 275 | 1 |
| 275 – 343.8 | 0 |
| 343.8 – 412.5 | 0 |
| 412.5 – 481.2 | 0 |
| 481.2 – 550 | 0 |
| 550 – 618.8 | 0 |
| 618.8 – 687.5 | 0 |
| 687.5 – 756.2 | 0 |
| 756.2 – 825 | 0 |
| 825 – 893.8 | 0 |
| 893.8 – 962.5 | 0 |
| 962.5 – 1031 | 0 |
| 1031 – 1100 | 0 |
| 1100 – 1169 | 0 |
| 1169 – 1238 | 0 |
| 1238 – 1306 | 0 |
| 1306 – 1375 | 0 |
| 1375 – 1444 | 0 |
| 1444 – 1512 | 0 |
| 1512 – 1581 | 0 |
| 1581 – 1650 | 0 |
| 1650 – 1719 | 0 |
| 1719 – 1788 | 0 |
| 1788 – 1856 | 0 |
| 1856 – 1925 | 0 |
| 1925 – 1994 | 0 |
| 1994 – 2062 | 0 |
| 2062 – 2131 | 0 |
| 2131 – 2200 | 0 |
| 2200 – 2269 | 0 |
| 2269 – 2338 | 0 |
| 2338 – 2406 | 0 |
| 2406 – 2475 | 0 |
| 2475 – 2544 | 0 |
| 2544 – 2612 | 0 |
| 2612 – 2681 | 0 |
| 2681 – 2750 | 2 |
Schema
13 columns| Alerts | ||||
|---|---|---|---|---|
| Date | text | 0.0% | 4,753 |
one_word
allcaps
short_text
|
| Time | text | 42.1% | 1,005 |
one_word
allcaps
null_rate
short_text
duplicates
|
| Location | text | 0.4% | 4,303 |
|
| Operator | text | 0.3% | 2,476 |
multilingual
duplicates
|
| Flight # | categorical | 79.7% | 724 |
long_tail
null_rate
|
| Route | text | 32.4% | 3,244 |
multilingual
null_rate
|
| Type | text | 0.5% | 2,446 |
duplicates
|
| Registration | text | 6.4% | 4,905 |
near_unique
one_word
allcaps
short_text
|
| cn/In | text | 23.3% | 3,707 |
one_word
allcaps
null_rate
short_text
|
| Aboard | numeric | 0.4% | 239 |
high_skew
outliers
|
| Fatalities | numeric | 0.2% | 191 |
high_skew
outliers
|
| Ground | numeric | 0.4% | 50 |
high_skew
|
| Summary | text | 7.4% | 4,673 |
near_unique
|
Date
text timestamp one_word allcaps short_textThis column contains dates stored as text strings in MM/DD/YYYY format, with every value exactly 10 characters long and zero nulls across 5,268 rows. The duplicate rate of ~9.8% (515 duplicates across only 4,753 unique values) is notable — multiple records share the same date, with the most frequent dates appearing up to 4 times, including historically significant dates like 09/11/2001 and 06/06/1944, suggesting the dataset may track events tied to recurring or landmark dates. The 'allcaps' alert is a false positive from the date format containing no letters. Treatment: Parse to datetime dtype (strptime MM/DD/YYYY) before any time-based analysis or feature engineering.
- n
- 5,268
- nulls
- 0 (0.0%)
- unique
- 4,753
- len_min
- 10
- len_max
- 10
- len_mean
- 10
- len_median
- 10
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 515
- duplicate_rate
- 0.09776
- vocab_size
- 4,753
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
Time
text feature one_word allcaps null_rate short_text duplicatesThis column contains clock times in HH:MM format (lengths 4–7 characters), almost certainly representing scheduled or recorded event times. Two signals warrant attention: the null rate is high at 42.12%, meaning nearly half of all 5,268 rows carry no time value, and the duplicate rate is 67.04% — expected for a time-of-day field with only 1,005 distinct values across non-null rows. The 'allcaps' alert is a false positive from saturn misclassifying colon-separated digit strings. Treatment: Parse to datetime.time or extract hour/minute as numeric features; investigate the 42.12% null rate before deciding on imputation or exclusion strategy.
- n
- 5,268
- nulls
- 2,219 (42.1%)
- unique
- 1,005
- len_min
- 4
- len_max
- 7
- len_mean
- 5.003
- len_median
- 5
- len_p95
- 5
- word_mean
- 1.001
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 2,044
- duplicate_rate
- 0.6704
- vocab_size
- 1,004
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.999
- allcaps_rate
- 0.9974
- boilerplate_rate
- 0
Location
text featureThis column contains free-text geographic location descriptions, most commonly in 'City, Country/State' format (mean ~2.9 words, median length 19 characters), representing where individual events occurred. The high frequency of the word 'near' (1,272 occurrences out of 5,268 rows) indicates a substantial proportion of entries are approximate locations rather than precise place names, which could complicate geocoding. The duplicate rate of 18% (945 duplicates across 4,303 unique values) is expected for a location field but the long tail of near-unique entries (vocab size 4,541) suggests significant free-text variation in how locations are recorded. Treatment: Normalize 'near X' / 'off X' patterns, then geocode or extract country/region via NLP before modelling.
- n
- 5,268
- nulls
- 20 (0.4%)
- unique
- 4,303
- len_min
- 5
- len_max
- 60
- len_mean
- 20.38
- len_median
- 19
- len_p95
- 31
- word_mean
- 2.866
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 945
- duplicate_rate
- 0.1801
- vocab_size
- 4,541
- readability_flesch_mean
- 24.03
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.01124
- allcaps_rate
- 0
- boilerplate_rate
- 0
Operator
text label multilingual duplicatesThis column contains the name of the airline or military branch operating an aircraft involved in an incident, making it a categorical label field. With 2,476 unique values across 5,268 rows, the duplicate rate of 52.8% is expected for a label of this type — operators recur across multiple incidents. The multilingual alert is a natural artifact of international airline names (German, French, Italian, Spanish, Russian operators all present), not a data quality issue per se, though analysts should be aware that variant spellings of the same operator may inflate cardinality. Top values (Aeroflot at 179, U.S. Air Force at 176) reveal a mix of commercial and military operators. Treatment: Normalize operator name variants, then encode as categorical (target-encode or embed) for modelling; consider grouping military sub-branches under a single 'Military' category.
- n
- 5,268
- nulls
- 18 (0.3%)
- unique
- 2,476
- len_min
- 3
- len_max
- 65
- len_mean
- 19.49
- len_median
- 19
- len_p95
- 35
- word_mean
- 3.047
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 2,774
- duplicate_rate
- 0.5284
- vocab_size
- 2,370
- readability_flesch_mean
- 19.61
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.1651
- allcaps_rate
- 0.03733
- boilerplate_rate
- 0
Flight #
categorical label long_tail null_rateThis column represents a flight number identifier, likely recording the flight designation for each row in the dataset. Two major issues stand out: 79.71% of values are null, making the column largely unpopulated, and the most frequent non-null value is a placeholder dash ('-') appearing 67 times, suggesting systematic missing-data encoding. With 724 unique values across only 1,073 non-null rows and an entropy ratio of 0.953, the distribution is near-uniform with a pronounced long tail — no single flight number dominates meaningfully beyond the placeholder. Treatment: Treat '-' as null, impute or drop rows depending on whether flight number is required; with 79.71% nulls, consider dropping this column unless the analysis specifically targets flight-level granularity.
- n
- 5,268
- nulls
- 4,199 (79.7%)
- unique
- 724
- top_value
- -
- top_rate
- 0.06268
- cardinality
- 724
- entropy
- 9.058
- entropy_ratio
- 0.9534
Route
text label multilingual null_rateThis column represents aviation route descriptions, capturing both origin-destination pairs (e.g., 'Saigon - Paris', 'Bogota - Barranquilla') and flight purpose labels (e.g., 'Training', 'Sightseeing', 'Test flight'). The null rate of 32.38% is a significant concern, meaning roughly one-third of records lack route information. The multilingual alert is expected given the international nature of routes — English dominates at 2,567 detections but Spanish (237), Portuguese (100), German (88), and French (64) are well-represented, reflecting global aviation data. The high n_unique count (3,244 of 5,268 non-null values) with a duplicate rate of 8.93% (318 duplicates) confirms this is a descriptive label field with many distinct routes but some recurring purpose/training entries. Treatment: Impute or flag nulls (32.38% missing); split into 'purpose' vs 'OD-pair' subtypes using presence of '-' delimiter before encoding or embedding.
- n
- 5,268
- nulls
- 1,706 (32.4%)
- unique
- 3,244
- len_min
- 4
- len_max
- 59
- len_mean
- 22.09
- len_median
- 20
- len_p95
- 37
- word_mean
- 4.065
- word_median
- 4
- n_empty
- 0
- n_duplicates
- 318
- duplicate_rate
- 0.08928
- vocab_size
- 3,647
- readability_flesch_mean
- 27.15
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.04099
- allcaps_rate
- 0.0002807
- boilerplate_rate
- 0
Type
text label duplicatesThis column captures aircraft model designations (e.g., 'Douglas DC-3', 'de Havilland Canada DHC-6 Twin Otter 300'), making it an aircraft type label in what appears to be an aviation incident or accident dataset. The duplicate rate of 53.3% (2,795 of 5,268 rows) is expected for a categorical-like field where many incidents share the same aircraft type, with 'Douglas DC-3' alone appearing 334 times. There are 2,446 unique values against a vocabulary of 2,534 words, indicating many near-unique variant spellings or sub-model suffixes (e.g., 'Douglas C-47', 'Douglas C-47A', 'Douglas C-47B' are counted separately), which is the key analyst surprise. Null rate is negligible at 0.51%. Treatment: Normalize variant spellings and sub-model suffixes into canonical families before grouping or encoding; consider a manufacturer + model hierarchy for feature engineering.
- n
- 5,268
- nulls
- 27 (0.5%)
- unique
- 2,446
- len_min
- 4
- len_max
- 40
- len_mean
- 18.33
- len_median
- 16
- len_p95
- 34
- word_mean
- 2.718
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 2,795
- duplicate_rate
- 0.5333
- vocab_size
- 2,534
- readability_flesch_mean
- 69.26
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.007441
- allcaps_rate
- 0.00954
- boilerplate_rate
- 0
Registration
text identifier near_unique one_word allcaps short_textThis column contains vehicle or aircraft registration codes — short, almost entirely uppercase alphanumeric identifiers (allcaps_rate 99.2%, median length 6 characters) consistent with licence plates or tail numbers. With 4905 unique values out of 5268 rows and only 28 duplicates, it behaves as a near-unique identifier, though the 6.36% null rate and occasional slash-containing entries (top word '/' appears 36 times) suggest some composite or malformed registrations worth inspecting. The presence of tokens like 'HK-' (a Colombian aviation prefix) and 'NC10809' hints at international aircraft tail numbers rather than road vehicle plates. Treatment: Use as a near-unique entity key; cleanse slash-delimited entries and nulls before joining or deduplicating on this field.
- n
- 5,268
- nulls
- 335 (6.4%)
- unique
- 4,905
- len_min
- 1
- len_max
- 15
- len_mean
- 6.394
- len_median
- 6
- len_p95
- 10
- word_mean
- 1.018
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 28
- duplicate_rate
- 0.005676
- vocab_size
- 4,948
- readability_flesch_mean
- 103
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9899
- allcaps_rate
- 0.9919
- boilerplate_rate
- 0
cn/In
text label one_word allcaps null_rate short_textThis column ('cn/In') appears to be a short coded identifier or reference field — likely a chemical notation, index number, or abbreviated category code — given its near-universal single-word (98.4%), all-caps (96.6%) character and very short values (median length 5, max 20). The top word '/' appearing 49 times suggests some values are compound codes using slash-delimited notation (e.g., 'CN/IN' style references), while most top values are pure numeric strings ('178', '19', '229', etc.). Two signals warrant attention: the null rate is high at 23.3%, and despite 3,707 unique values across 5,268 rows, there are 333 duplicates, indicating this is not a strict unique identifier. Treatment: Investigate nulls (23.3% missing) before use; treat as categorical label or join key after resolving slash-delimited compound values.
- n
- 5,268
- nulls
- 1,228 (23.3%)
- unique
- 3,707
- len_min
- 1
- len_max
- 20
- len_mean
- 5.645
- len_median
- 5
- len_p95
- 10
- word_mean
- 1.026
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 333
- duplicate_rate
- 0.08243
- vocab_size
- 3,739
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9842
- allcaps_rate
- 0.9663
- boilerplate_rate
- 0
Aboard
numeric feature high_skew outliersThis column records the number of people aboard a vehicle (likely an aircraft or ship) at the time of an incident. The distribution is severely right-skewed (skew=4.25, kurtosis=28.41): the median is only 13 passengers while the mean is 27.6, and the max reaches 644 — consistent with a few large commercial aircraft disasters pulling the tail far right. Roughly 10% of rows (529) are flagged as outliers, and the IQR spans just 5–30, meaning the vast majority of incidents involve small craft. Treatment: Log-transform (log1p) before regression or modelling to reduce skew; retain outliers as they represent real large-scale events.
- n
- 5,268
- nulls
- 22 (0.4%)
- unique
- 239
- min
- 0
- max
- 644
- mean
- 27.55
- median
- 13
- std
- 43.08
- q1
- 5
- q3
- 30
- iqr
- 25
- skew
- 4.247
- kurtosis
- 28.41
- n_outliers
- 529
- outlier_rate
- 0.1008
- zero_rate
- 0.0003812
Fatalities
numeric numeric_target high_skew outliersThis column records the number of fatalities per incident (likely aviation accidents, conflicts, or similar events). The distribution is extremely right-skewed (skew = 4.95, kurtosis = 42.79): the median is only 9 fatalities while the mean is 20.07 and the maximum reaches 583, indicating a long tail of mass-casualty events. 444 rows (8.4%) are flagged as outliers, and the IQR of 20 against a std of 33.2 confirms that most incidents are low-fatality but a meaningful minority are catastrophic. Treatment: Log-transform (log1p) before regression or modelling to reduce skew; retain outliers as they represent real high-severity events.
- n
- 5,268
- nulls
- 12 (0.2%)
- unique
- 191
- min
- 0
- max
- 583
- mean
- 20.07
- median
- 9
- std
- 33.2
- q1
- 3
- q3
- 23
- iqr
- 20
- skew
- 4.948
- kurtosis
- 42.79
- n_outliers
- 444
- outlier_rate
- 0.08447
- zero_rate
- 0.01104
Ground
numeric feature high_skewThis column likely represents a ground elevation, ground clearance, or grounding-related measurement (possibly in feet or meters) associated with physical infrastructure or flight/equipment records. The distribution is extreme: 95.8% of values are exactly zero, yet the maximum reaches 2750.0 with a skew of 50.34 and kurtosis of 2558.60, indicating a tiny fraction of records carry very large non-zero values. Only 50 unique values exist across 5,268 rows, and 219 observations (4.17%) are flagged as outliers — the near-zero IQR (Q1=Q3=0) confirms the overwhelming concentration at zero. Treatment: Treat as sparse indicator/feature; consider binarizing (zero vs. non-zero) or log1p-transforming the non-zero subset, and investigate whether the 2750.0 outliers are valid or data-entry errors.
- n
- 5,268
- nulls
- 22 (0.4%)
- unique
- 50
- min
- 0
- max
- 2,750
- mean
- 1.609
- median
- 0
- std
- 53.99
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 50.34
- kurtosis
- 2559
- n_outliers
- 219
- outlier_rate
- 0.04175
- zero_rate
- 0.9583
Summary
text free_text near_uniqueThis column contains free-text narrative summaries of aviation incidents or accidents, as evidenced by dominant domain terms 'crashed', 'into', and 'aircraft' appearing thousands of times across 5,268 records. Text length varies widely (min 6, median 136, max 1,954 characters), suggesting entries range from brief one-liners to detailed multi-sentence accounts. A duplicate rate of 4.2% (205 duplicates) is mildly surprising for free-text summaries and may indicate repeated incident templates or copy-paste entries. Flesch readability of 61.7 indicates moderate accessibility, consistent with factual incident reporting prose. Treatment: Tokenize and embed (e.g., TF-IDF or sentence transformer) for modelling; deduplicate 205 exact-match rows before training.
- n
- 5,268
- nulls
- 390 (7.4%)
- unique
- 4,673
- len_min
- 6
- len_max
- 1,954
- len_mean
- 200.7
- len_median
- 136
- len_p95
- 584
- word_mean
- 33.24
- word_median
- 23
- n_empty
- 0
- n_duplicates
- 205
- duplicate_rate
- 0.04203
- vocab_size
- 12,513
- readability_flesch_mean
- 61.68
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.00041
- allcaps_rate
- 0
- boilerplate_rate
- 0