saturn·

disasters airplane crashes

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/disasters/airplane_crashes.csv

Saturn profiled 5,268 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/disasters/airplane_crashes.csv",
    "--findings", "disasters-airplane_crashes.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset records 5,268 airplane crashes across 13 columns, mixing dates and times with operator, aircraft type, route, location, and casualty counts (Aboard, Fatalities, Ground). Casualty figures are highly skewed: Aboard averages 27.5 with a median of 13 and a maximum of 644, while Fatalities averages 20.1 with a median of 9 and a max of 583, and Ground deaths are zero in roughly 96% of rows but spike to 2,750 — clear outliers worth investigating. Operator and Type are dominated by a few heavy hitters (Aeroflot and U.S. military operators; Douglas DC-3 alone appears 334 times), suggesting concentration that could bias any aggregate analysis. Note also that Flight # is missing in nearly 80% of rows and Time in 42%, so those fields are weak for filtering. Start by looking at the Fatalities distribution and the top operators and aircraft types.

citing: row_count · column_count · columns.Aboard.stats · columns.Fatalities.stats · columns.Ground.stats · columns.Operator.top_values · columns.Type.top_values · columns.Flight #.null_rate · columns.Time.null_rate

Out[4]:

saturn.schema() · 13 columns

column kind n null% unique alerts
Date text 5,268 0.0% 4,753 one_word allcaps short_text
Time text 5,268 42.1% 1,005 one_word allcaps null_rate short_text duplicates
Location text 5,268 0.4% 4,303
Operator text 5,268 0.3% 2,476 multilingual duplicates
Flight # categorical 5,268 79.7% 724 long_tail null_rate
Route text 5,268 32.4% 3,244 multilingual null_rate
Type text 5,268 0.5% 2,446 duplicates
Registration text 5,268 6.4% 4,905 near_unique one_word allcaps short_text
cn/In text 5,268 23.3% 3,707 one_word allcaps null_rate short_text
Aboard numeric 5,268 0.4% 239 high_skew outliers
Fatalities numeric 5,268 0.2% 191 high_skew outliers
Ground numeric 5,268 0.4% 50 high_skew
Summary text 5,268 7.4% 4,673 near_unique
Fig 1.
Fatalities · Heavily right-skewed: most crashes kill under 10 people but a long tail reaches 583.
Show data table
Histogram bins for Fatalities (median: 9.0).
bincount
0 – 14.573314
14.57 – 29.15980
29.15 – 43.72343
43.72 – 58.3215
58.3 – 72.8896
72.88 – 87.4590
87.45 – 10251
102 – 116.642
116.6 – 131.239
131.2 – 145.819
145.8 – 160.318
160.3 – 174.99
174.9 – 189.511
189.5 – 2043
204 – 218.62
218.6 – 233.26
233.2 – 247.82
247.8 – 262.35
262.3 – 276.94
276.9 – 291.51
291.5 – 306.11
306.1 – 320.60
320.6 – 335.21
335.2 – 349.82
349.8 – 364.40
364.4 – 378.90
378.9 – 393.50
393.5 – 408.10
408.1 – 422.70
422.7 – 437.20
437.2 – 451.80
451.8 – 466.40
466.4 – 4810
481 – 495.50
495.5 – 510.10
510.1 – 524.71
524.7 – 539.30
539.3 – 553.90
553.9 – 568.40
568.4 – 5831
Fig 2.
Aboard · Shows passenger load distribution; median 13 with rare wide-body events near 644.
Show data table
Histogram bins for Aboard (median: 13.0).
bincount
0 – 16.12978
16.1 – 32.21055
32.2 – 48.3430
48.3 – 64.4230
64.4 – 80.5129
80.5 – 96.6105
96.6 – 112.775
112.7 – 128.856
128.8 – 144.946
144.9 – 16135
161 – 177.127
177.1 – 193.216
193.2 – 209.38
209.3 – 225.47
225.4 – 241.59
241.5 – 257.64
257.6 – 273.79
273.7 – 289.83
289.8 – 305.99
305.9 – 3223
322 – 338.12
338.1 – 354.23
354.2 – 370.31
370.3 – 386.41
386.4 – 402.52
402.5 – 418.60
418.6 – 434.70
434.7 – 450.80
450.8 – 466.90
466.9 – 4830
483 – 499.10
499.1 – 515.20
515.2 – 531.32
531.3 – 547.40
547.4 – 563.50
563.5 – 579.60
579.6 – 595.70
595.7 – 611.80
611.8 – 627.90
627.9 – 6441
Fig 3.
Operator · Aeroflot and U.S. military operators dominate — check for over-representation before averaging.
Show data table
Character-length distribution for Operator (mean: 19.493904761904762).
charscount
3 – 596
5 – 6233
6 – 8140
8 – 9462
9 – 11169
11 – 12184
12 – 14128
14 – 15395
15 – 17270
17 – 18447
18 – 20407
20 – 22143
22 – 23340
23 – 25205
25 – 26542
26 – 28166
28 – 29229
29 – 31102
31 – 32194
32 – 3454
34 – 36127
36 – 3762
37 – 3935
39 – 4030
40 – 4213
42 – 4325
43 – 4511
45 – 465
46 – 487
48 – 507
50 – 517
51 – 530
53 – 549
54 – 561
56 – 573
57 – 590
59 – 601
60 – 620
62 – 630
63 – 651
Fig 4.
Type · Douglas DC-3 alone accounts for 334 crashes; aircraft type is highly concentrated at the top.
Show data table
Character-length distribution for Type (mean: 18.325701202060674).
charscount
4 – 56
5 – 65
6 – 76
7 – 819
8 – 832
8 – 957
9 – 10178
10 – 11255
11 – 12685
12 – 130
13 – 14522
14 – 15331
15 – 16441
16 – 17369
17 – 18208
18 – 18158
18 – 19154
19 – 20166
20 – 21154
21 – 220
22 – 23109
23 – 24120
24 – 25158
25 – 26188
26 – 26174
26 – 27107
27 – 2873
28 – 2985
29 – 3039
30 – 310
31 – 3266
32 – 3358
33 – 3455
34 – 3543
35 – 3616
36 – 3625
36 – 3716
37 – 389
38 – 3921
39 – 40133
Fig 5.
Ground · Almost all values are zero; the few non-zero entries (up to 2,750) are extreme outliers worth flagging.
Show data table
Histogram bins for Ground (median: 0.0).
bincount
0 – 68.755235
68.75 – 137.58
137.5 – 206.20
206.2 – 2751
275 – 343.80
343.8 – 412.50
412.5 – 481.20
481.2 – 5500
550 – 618.80
618.8 – 687.50
687.5 – 756.20
756.2 – 8250
825 – 893.80
893.8 – 962.50
962.5 – 10310
1031 – 11000
1100 – 11690
1169 – 12380
1238 – 13060
1306 – 13750
1375 – 14440
1444 – 15120
1512 – 15810
1581 – 16500
1650 – 17190
1719 – 17880
1788 – 18560
1856 – 19250
1925 – 19940
1994 – 20620
2062 – 21310
2131 – 22000
2200 – 22690
2269 – 23380
2338 – 24060
2406 – 24750
2475 – 25440
2544 – 26120
2612 – 26810
2681 – 27502
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
Datetext0.0%
Timetext42.1%
Locationtext0.4%
Operatortext0.3%
Flight #categorical79.7%
Routetext32.4%
Typetext0.5%
Registrationtext6.4%
cn/Intext23.3%
Aboardnumeric0.4%
Fatalitiesnumeric0.2%
Groundnumeric0.4%
Summarytext7.4%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 7,904 detected strings).
langcountshare
en590774.7%
es4615.8%
it3664.6%
de2903.7%
fr2473.1%
pt1552.0%
id931.2%
nl730.9%
sv510.6%
ca390.5%
pl310.4%
ru270.3%
no220.3%
sl200.3%
tr180.2%
ceb140.2%
hr140.2%
cs110.1%
eo70.1%
uk60.1%
hu60.1%
fi60.1%
ms60.1%
ro60.1%
da50.1%
bs30.0%
vi30.0%
sh30.0%
et30.0%
gl20.0%
lt20.0%
la20.0%
eu10.0%
ku10.0%
te10.0%
gd10.0%
ja10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
AboardFatalitiesGround
Aboard+1.00+0.04+0.06
Fatalities+0.04+1.00+0.05
Ground+0.06+0.05+1.00

Date text timestamp

This column holds dates stored as text in MM/DD/YYYY format — every one of 5268 values is exactly 10 characters and a single token. There are 515 duplicates (9.8%) with repeats clustering on historically notable days such as 09/11/2001 and 06/06/1944, suggesting the rows describe events tied to those dates rather than unique daily records. The text alerts (allcaps, one_word, short_text) are artifacts of the date formatting, not real free-text content.

Treatment: parse to a proper date type (MM/DD/YYYY) before any temporal analysis.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["Date"].stats

statvalue
n5,268
nulls0 (0.0%)
unique4,753
len_min 10
len_max 10
len_mean 10
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 0
n_duplicates 515
duplicate_rate 0.09776
vocab_size 4,753
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 9.
Character-length distribution for Date.
Show data table
Character-length distribution for Date (mean: 10.0).
charscount
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 105268
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100

Time text timestamp

Clock times in HH:MM format, stored as text rather than a temporal type — values like '15:00', '12:00' and '11:00' top the list and lengths sit tightly between 4 and 7 characters (mean 5.0). Roughly 42% of rows are null and 67% of the non-null values are duplicates across only 1,005 distinct times, suggesting times cluster on the half hour. Despite being numeric-looking, it tripped allcaps and one-word alerts because the profiler treats the strings as tokens.

Treatment: parse to a time-of-day type and impute or flag the 42% missing before use.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["Time"].stats

statvalue
n5,268
nulls2,219 (42.1%)
unique1,005
len_min 4
len_max 7
len_mean 5.003
len_median 5
len_p95 5
word_mean 1.001
word_median 1
n_empty 0
n_duplicates 2,044
duplicate_rate 0.6704
vocab_size 1,004
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.999
allcaps_rate 0.9974
boilerplate_rate 0
alert: one_word99.9% rows are a single word
alert: allcaps99.7% rows are all-caps
alert: null_rate42.1% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates67.0% duplicate strings
Fig 10.
Character-length distribution for Time.
Show data table
Character-length distribution for Time (mean: 5.002623811085602).
charscount
4 – 47
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 53033
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 63
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 76

Location text feature

Free-text place names, typically 'City, Country/State' (word_median 3, len_median 19), with 4303 unique values across 5268 rows. Top entries cluster on major world cities (Sao Paulo, Moscow, Rio), but 'near' appears 1272 times suggesting many entries are approximate locations rather than exact place names. Duplicate rate of 18% and 945 repeated strings indicate moderate reusability, though high cardinality limits direct grouping.

Treatment: Parse into city/region/country components and geocode before use; raw strings are too high-cardinality to one-hot.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["Location"].stats

statvalue
n5,268
nulls20 (0.4%)
unique4,303
len_min 5
len_max 60
len_mean 20.38
len_median 19
len_p95 31
word_mean 2.866
word_median 3
n_empty 0
n_duplicates 945
duplicate_rate 0.1801
vocab_size 4,541
readability_flesch_mean 24.03
emoji_rate 0
url_rate 0
one_word_rate 0.01124
allcaps_rate 0
boilerplate_rate 0
Fig 11.
Character-length distribution for Location.
Show data table
Character-length distribution for Location (mean: 20.379954268292682).
charscount
5 – 621
6 – 88
8 – 922
9 – 1016
10 – 1261
12 – 13326
13 – 15280
15 – 16289
16 – 17753
17 – 19413
19 – 20828
20 – 22383
22 – 23313
23 – 24479
24 – 26189
26 – 27170
27 – 28244
28 – 3085
30 – 31114
31 – 3236
32 – 3428
34 – 3552
35 – 3725
37 – 3823
38 – 3926
39 – 4111
41 – 4218
42 – 443
44 – 4513
45 – 463
46 – 481
48 – 495
49 – 502
50 – 521
52 – 533
53 – 541
54 – 560
56 – 570
57 – 591
59 – 602

Operator text feature

This column holds the airline or military operator name for each record, with 2476 unique values across 5268 rows and only a 0.0034 null rate. It is heavily duplicated (duplicate_rate 0.528, n_duplicates 2774), led by Aeroflot (179) and Military - U.S. Air Force (176), and the language detector flags a multilingual mix dominated by English (3340) but with sizable Italian (278), Spanish (224), German (202), and French (183) counts — likely an artifact of short proper nouns rather than true translations. Entries are short (word_mean 3.05, len_mean 19.5) and one_word_rate is 0.165, consistent with brand-style names.

Treatment: Normalize casing and consolidate Military - * variants, then treat as a high-cardinality categorical (target/frequency encode).

anthropic:claude-opus-4-7 · confidence high
Out[23]:

saturn.columns["Operator"].stats

statvalue
n5,268
nulls18 (0.3%)
unique2,476
len_min 3
len_max 65
len_mean 19.49
len_median 19
len_p95 35
word_mean 3.047
word_median 3
n_empty 0
n_duplicates 2,774
duplicate_rate 0.5284
vocab_size 2,370
readability_flesch_mean 19.61
emoji_rate 0
url_rate 0
one_word_rate 0.1651
allcaps_rate 0.03733
boilerplate_rate 0
alert: multilingual31 languages detected in sample
alert: duplicates52.8% duplicate strings
Fig 12.
Character-length distribution for Operator.
Show data table
Character-length distribution for Operator (mean: 19.493904761904762).
charscount
3 – 596
5 – 6233
6 – 8140
8 – 9462
9 – 11169
11 – 12184
12 – 14128
14 – 15395
15 – 17270
17 – 18447
18 – 20407
20 – 22143
22 – 23340
23 – 25205
25 – 26542
26 – 28166
28 – 29229
29 – 31102
31 – 32194
32 – 3454
34 – 36127
36 – 3762
37 – 3935
39 – 4030
40 – 4213
42 – 4325
43 – 4511
45 – 465
46 – 487
48 – 507
50 – 517
51 – 530
53 – 549
54 – 561
56 – 573
57 – 590
59 – 601
60 – 620
62 – 630
63 – 651

Flight # categorical identifier

Likely a flight number identifier attached to records (probably aviation incidents). Nearly 80% of rows are null (null_rate 0.7971) and the most common non-null value is the placeholder '-' at 67 occurrences (6.27% of present values), suggesting missing-data sentinels mixed with real codes. Cardinality is high (724 unique across 5268 rows) with entropy_ratio 0.953, so among populated rows values are nearly uniformly distributed.

Treatment: Normalize '-' to null and treat as a high-cardinality identifier; drop from modelling or use only as a join key.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["Flight #"].stats

statvalue
n5,268
nulls4,199 (79.7%)
unique724
top_value -
top_rate 0.06268
cardinality 724
entropy 9.058
entropy_ratio 0.9534
alert: long_tail543 singleton categories
alert: null_rate79.7% null
Fig 13.
Top values for Flight #.
Show data table
Top values for Flight # (20 unique shown, of 724 total).
valuecountshare
-671.3%
1100.2%
470.1%
660.1%
2160.1%
10160.1%
90160.1%
750.1%
20150.1%
70150.1%
70650.1%
70350.1%
240.1%
20340.1%
30440.1%
60140.1%
51440.1%
1140.1%
21740.1%
11440.1%

Route text free_text

Short free-text describing a flight route, typically formatted as 'Origin - Destination' (the hyphen appears 3658 times across 5268 rows) with occasional non-route labels like 'Training' (81), 'Sightseeing' (29), or 'Test flight' (17). Values are short (mean 22 chars, 4 words) and highly varied (3244 unique out of 5268), but 32.38% are null and 318 duplicates exist. Language detection flags a multilingual mix dominated by English (2567) with notable Spanish (237), Portuguese (100), German (88) and Italian (88), reflecting place names rather than true prose.

Treatment: Parse on ' - ' to split origin/destination and bucket non-route labels separately; impute or flag the 32% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["Route"].stats

statvalue
n5,268
nulls1,706 (32.4%)
unique3,244
len_min 4
len_max 59
len_mean 22.09
len_median 20
len_p95 37
word_mean 4.065
word_median 4
n_empty 0
n_duplicates 318
duplicate_rate 0.08928
vocab_size 3,647
readability_flesch_mean 27.15
emoji_rate 0
url_rate 0
one_word_rate 0.04099
allcaps_rate 0.0002807
boilerplate_rate 0
alert: multilingual31 languages detected in sample
alert: null_rate32.4% null
Fig 14.
Character-length distribution for Route.
Show data table
Character-length distribution for Route (mean: 22.088152723189218).
charscount
4 – 58
5 – 74
7 – 893
8 – 106
10 – 115
11 – 12100
12 – 1499
14 – 15155
15 – 16452
16 – 18247
18 – 19443
19 – 20170
20 – 22179
22 – 23286
23 – 25155
25 – 26135
26 – 27245
27 – 2994
29 – 30213
30 – 3271
32 – 3349
33 – 3474
34 – 3639
36 – 3740
37 – 3851
38 – 4020
40 – 4127
41 – 4212
42 – 4410
44 – 4519
45 – 476
47 – 484
48 – 4917
49 – 519
51 – 526
52 – 543
54 – 554
55 – 568
56 – 582
58 – 592

Type text feature

This column records aircraft make-and-model designations, dominated by manufacturer-plus-type strings like 'Douglas DC-3' (334 occurrences) and 'de Havilland Canada DHC-6 Twin Otter 300'. Values are short (mean 18.3 chars, median 2 words) but highly repetitive: 53.3% duplicate rate across 2,446 unique types, with Douglas alone appearing in 1,113 rows. Watch for near-duplicate variants of the same airframe ('Douglas C-47', 'Douglas C-47A', 'Douglas C-47B') that will fragment any group-by unless normalised.

Treatment: Normalise manufacturer/variant strings (e.g. collapse C-47 sub-variants) before using as a categorical feature.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["Type"].stats

statvalue
n5,268
nulls27 (0.5%)
unique2,446
len_min 4
len_max 40
len_mean 18.33
len_median 16
len_p95 34
word_mean 2.718
word_median 2
n_empty 0
n_duplicates 2,795
duplicate_rate 0.5333
vocab_size 2,534
readability_flesch_mean 69.26
emoji_rate 0
url_rate 0
one_word_rate 0.007441
allcaps_rate 0.00954
boilerplate_rate 0
alert: duplicates53.3% duplicate strings
Fig 15.
Character-length distribution for Type.
Show data table
Character-length distribution for Type (mean: 18.325701202060674).
charscount
4 – 56
5 – 65
6 – 76
7 – 819
8 – 832
8 – 957
9 – 10178
10 – 11255
11 – 12685
12 – 130
13 – 14522
14 – 15331
15 – 16441
16 – 17369
17 – 18208
18 – 18158
18 – 19154
19 – 20166
20 – 21154
21 – 220
22 – 23109
23 – 24120
24 – 25158
25 – 26188
26 – 26174
26 – 27107
27 – 2873
28 – 2985
29 – 3039
30 – 310
31 – 3266
32 – 3358
33 – 3455
34 – 3543
35 – 3616
36 – 3625
36 – 3716
37 – 389
38 – 3921
39 – 40133

Registration text identifier

Almost certainly aircraft tail/registration codes: 4905 unique values across 5268 rows, 99% all-caps single tokens with mean length 6.4 (max 15), and top tokens like 'hk-' and 'nc10809' resemble registration prefixes. Near-unique (n_unique/n ≈ 0.93) with a 6.36% null rate and only 28 duplicates, so it behaves as an identifier rather than a feature. The lone '/' appearing 36 times suggests a placeholder for split/unknown registrations worth inspecting.

Treatment: Treat as an identifier: drop from modelling or use only for joins/lookup after normalising case and the '/' placeholder.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["Registration"].stats

statvalue
n5,268
nulls335 (6.4%)
unique4,905
len_min 1
len_max 15
len_mean 6.394
len_median 6
len_p95 10
word_mean 1.018
word_median 1
n_empty 0
n_duplicates 28
duplicate_rate 0.005676
vocab_size 4,948
readability_flesch_mean 103
emoji_rate 0
url_rate 0
one_word_rate 0.9899
allcaps_rate 0.9919
boilerplate_rate 0
alert: near_unique99.4% of rows are unique strings
alert: one_word99.0% rows are a single word
alert: allcaps99.2% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 16.
Character-length distribution for Registration.
Show data table
Character-length distribution for Registration (mean: 6.393877964727347).
charscount
1 – 11
1 – 20
2 – 236
2 – 20
2 – 30
3 – 364
3 – 30
3 – 40
4 – 469
4 – 40
4 – 50
5 – 5398
5 – 60
6 – 60
6 – 63228
6 – 70
7 – 70
7 – 7512
7 – 80
8 – 80
8 – 8267
8 – 90
9 – 942
9 – 90
9 – 100
10 – 10206
10 – 100
10 – 110
11 – 1110
11 – 120
12 – 120
12 – 1212
12 – 130
13 – 130
13 – 1341
13 – 140
14 – 140
14 – 148
14 – 150
15 – 1539

cn/In text feature

Despite the text classification, 'cn/In' looks like a short numeric code field — values are predominantly one-word, all-caps tokens with a mean length of 5.6 characters and the top values ('178', '19', '229', '125') all being integers. About 23.3% of rows are null and only 333 duplicates (8.2%) appear across 3,707 unique values, so cardinality is high relative to 5,268 rows. The '/' character showing up 49 times in top_words hints at occasional composite values (e.g., 'a/b'), which the column name 'cn/In' also suggests.

Treatment: Cast to numeric where possible and split composite '/'-separated entries; impute or flag the 23% nulls before modelling.

anthropic:claude-opus-4-7 · confidence medium
Out[38]:

saturn.columns["cn/In"].stats

statvalue
n5,268
nulls1,228 (23.3%)
unique3,707
len_min 1
len_max 20
len_mean 5.645
len_median 5
len_p95 10
word_mean 1.026
word_median 1
n_empty 0
n_duplicates 333
duplicate_rate 0.08243
vocab_size 3,739
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9842
allcaps_rate 0.9663
boilerplate_rate 0
alert: one_word98.4% rows are a single word
alert: allcaps96.6% rows are all-caps
alert: null_rate23.3% null
alert: short_text95th-percentile length under 20 chars
Fig 17.
Character-length distribution for cn/In.
Show data table
Character-length distribution for cn/In (mean: 5.64480198019802).
charscount
1 – 123
1 – 20
2 – 2113
2 – 30
3 – 3604
3 – 40
4 – 4866
4 – 50
5 – 5895
5 – 60
6 – 6268
6 – 70
7 – 7269
7 – 80
8 – 8281
8 – 90
9 – 9457
9 – 100
10 – 10125
10 – 100
10 – 110
11 – 1192
11 – 120
12 – 1214
12 – 130
13 – 139
13 – 140
14 – 142
14 – 150
15 – 155
15 – 160
16 – 164
16 – 170
17 – 175
17 – 180
18 – 182
18 – 190
19 – 192
19 – 200
20 – 204

Aboard numeric feature

This column records the number of people aboard, with values ranging from 0 to 644 and a median of 13. The distribution is heavily right-skewed (skew 4.25, kurtosis 28.4) and roughly 10% of rows (529) are flagged as outliers, indicating a long tail of very large flights against a typical small-aircraft baseline. Nulls are negligible (0.42%) and only 239 distinct values appear across 5268 rows.

Treatment: log-transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[41]:

saturn.columns["Aboard"].stats

statvalue
n5,268
nulls22 (0.4%)
unique239
min 0
max 644
mean 27.55
median 13
std 43.08
q1 5
q3 30
iqr 25
skew 4.247
kurtosis 28.41
n_outliers 529
outlier_rate 0.1008
zero_rate 0.0003812
alert: high_skewskew=+4.25
alert: outliers10.1% rows beyond 1.5 IQR
Fig 18.
Distribution of Aboard. Vertical dash marks the median.
Show data table
Histogram bins for Aboard (median: 13.0).
bincount
0 – 16.12978
16.1 – 32.21055
32.2 – 48.3430
48.3 – 64.4230
64.4 – 80.5129
80.5 – 96.6105
96.6 – 112.775
112.7 – 128.856
128.8 – 144.946
144.9 – 16135
161 – 177.127
177.1 – 193.216
193.2 – 209.38
209.3 – 225.47
225.4 – 241.59
241.5 – 257.64
257.6 – 273.79
273.7 – 289.83
289.8 – 305.99
305.9 – 3223
322 – 338.12
338.1 – 354.23
354.2 – 370.31
370.3 – 386.41
386.4 – 402.52
402.5 – 418.60
418.6 – 434.70
434.7 – 450.80
450.8 – 466.90
466.9 – 4830
483 – 499.10
499.1 – 515.20
515.2 – 531.32
531.3 – 547.40
547.4 – 563.50
563.5 – 579.60
579.6 – 595.70
595.7 – 611.80
611.8 – 627.90
627.9 – 6441

Fatalities numeric numeric_target

Counts of deaths per event, ranging from 0 to 583 with a median of 9 and mean of 20.07. The distribution is heavily right-skewed (skew 4.95, kurtosis 42.79) with 444 outliers (8.4% of rows) and a small zero rate of 1.1%. The IQR of 20 against a max of 583 confirms a long tail driven by rare catastrophic events.

Treatment: log1p-transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

saturn.columns["Fatalities"].stats

statvalue
n5,268
nulls12 (0.2%)
unique191
min 0
max 583
mean 20.07
median 9
std 33.2
q1 3
q3 23
iqr 20
skew 4.948
kurtosis 42.79
n_outliers 444
outlier_rate 0.08447
zero_rate 0.01104
alert: high_skewskew=+4.95
alert: outliers8.4% rows beyond 1.5 IQR
Fig 19.
Distribution of Fatalities. Vertical dash marks the median.
Show data table
Histogram bins for Fatalities (median: 9.0).
bincount
0 – 14.573314
14.57 – 29.15980
29.15 – 43.72343
43.72 – 58.3215
58.3 – 72.8896
72.88 – 87.4590
87.45 – 10251
102 – 116.642
116.6 – 131.239
131.2 – 145.819
145.8 – 160.318
160.3 – 174.99
174.9 – 189.511
189.5 – 2043
204 – 218.62
218.6 – 233.26
233.2 – 247.82
247.8 – 262.35
262.3 – 276.94
276.9 – 291.51
291.5 – 306.11
306.1 – 320.60
320.6 – 335.21
335.2 – 349.82
349.8 – 364.40
364.4 – 378.90
378.9 – 393.50
393.5 – 408.10
408.1 – 422.70
422.7 – 437.20
437.2 – 451.80
451.8 – 466.40
466.4 – 4810
481 – 495.50
495.5 – 510.10
510.1 – 524.71
524.7 – 539.30
539.3 – 553.90
553.9 – 568.40
568.4 – 5831

Ground numeric feature

Numeric field 'Ground' is overwhelmingly zero (zero_rate 0.9583) with median, q1, and q3 all at 0.0 and only 50 unique values across 5268 rows. The non-zero tail is extreme: max 2750.0 against a mean of 1.61, skew 50.3, and kurtosis 2559, producing 219 outliers (4.17%). This looks like a sparse count or charge-style feature where almost every record has no ground value but a few carry very large magnitudes.

Treatment: Split into a zero/non-zero indicator and log-transform the non-zero magnitudes before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["Ground"].stats

statvalue
n5,268
nulls22 (0.4%)
unique50
min 0
max 2,750
mean 1.609
median 0
std 53.99
q1 0
q3 0
iqr 0
skew 50.34
kurtosis 2559
n_outliers 219
outlier_rate 0.04175
zero_rate 0.9583
alert: high_skewskew=+50.34
Fig 20.
Distribution of Ground. Vertical dash marks the median.
Show data table
Histogram bins for Ground (median: 0.0).
bincount
0 – 68.755235
68.75 – 137.58
137.5 – 206.20
206.2 – 2751
275 – 343.80
343.8 – 412.50
412.5 – 481.20
481.2 – 5500
550 – 618.80
618.8 – 687.50
687.5 – 756.20
756.2 – 8250
825 – 893.80
893.8 – 962.50
962.5 – 10310
1031 – 11000
1100 – 11690
1169 – 12380
1238 – 13060
1306 – 13750
1375 – 14440
1444 – 15120
1512 – 15810
1581 – 16500
1650 – 17190
1719 – 17880
1788 – 18560
1856 – 19250
1925 – 19940
1994 – 20620
2062 – 21310
2131 – 22000
2200 – 22690
2269 – 23380
2338 – 24060
2406 – 24750
2475 – 25440
2544 – 26120
2612 – 26810
2681 – 27502

Summary text free_text

Free-text incident summaries averaging 201 characters (median 136, max 1954) with a Flesch readability of 61.7, suggesting short narrative paragraphs. Domain vocabulary is clearly aviation-accident: 'crashed' (2925), 'aircraft' (2031), and 'into' (2300) dominate after stopwords. Near-unique (4673 of 5268) but with 205 exact duplicates (4.2%) and a 7.4% null rate worth checking before modelling.

Treatment: Tokenize and embed (or TF-IDF) for downstream NLP; dedupe the 205 exact repeats first.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["Summary"].stats

statvalue
n5,268
nulls390 (7.4%)
unique4,673
len_min 6
len_max 1,954
len_mean 200.7
len_median 136
len_p95 584
word_mean 33.24
word_median 23
n_empty 0
n_duplicates 205
duplicate_rate 0.04203
vocab_size 12,513
readability_flesch_mean 61.68
emoji_rate 0
url_rate 0
one_word_rate 0.00041
allcaps_rate 0
boilerplate_rate 0
alert: near_unique95.8% of rows are unique strings
Fig 21.
Character-length distribution for Summary.
Show data table
Character-length distribution for Summary (mean: 200.73575235752358).
charscount
6 – 55822
55 – 1031039
103 – 152800
152 – 201547
201 – 250364
250 – 298280
298 – 347231
347 – 396172
396 – 444123
444 – 493128
493 – 54286
542 – 59050
590 – 63957
639 – 68833
688 – 73637
736 – 78519
785 – 83415
834 – 88316
883 – 93111
931 – 98010
980 – 10295
1029 – 10771
1077 – 11266
1126 – 11753
1175 – 12244
1224 – 12723
1272 – 13212
1321 – 13701
1370 – 14181
1418 – 14673
1467 – 15161
1516 – 15642
1564 – 16131
1613 – 16624
1662 – 17100
1710 – 17590
1759 – 18080
1808 – 18570
1857 – 19050
1905 – 19541

How to cite

click to copy

BibTeX
@misc{saturn-disasters-airplane-crashes-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: disasters airplane crashes},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/disasters-airplane_crashes}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: disasters airplane crashes. Source: /home/coolhand/html/datavis/data_trove/data/wild/disasters/airplane_crashes.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/disasters-airplane_crashes