disasters-airplane_crashes · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/disasters/airplane_crashes.csv

Saturn profiled 5,268 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/disasters/airplane_crashes.csv",
    "--findings", "disasters-airplane_crashes.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset records 5,268 airplane crashes across 13 columns, mixing dates and times with operator, aircraft type, route, location, and casualty counts (Aboard, Fatalities, Ground). Casualty figures are highly skewed: Aboard averages 27.5 with a median of 13 and a maximum of 644, while Fatalities averages 20.1 with a median of 9 and a max of 583, and Ground deaths are zero in roughly 96% of rows but spike to 2,750 — clear outliers worth investigating. Operator and Type are dominated by a few heavy hitters (Aeroflot and U.S. military operators; Douglas DC-3 alone appears 334 times), suggesting concentration that could bias any aggregate analysis. Note also that Flight # is missing in nearly 80% of rows and Time in 42%, so those fields are weak for filtering. Start by looking at the Fatalities distribution and the top operators and aircraft types.

citing: row_count · column_count · columns.Aboard.stats · columns.Fatalities.stats · columns.Ground.stats · columns.Operator.top_values · columns.Type.top_values · columns.Flight #.null_rate · columns.Time.null_rate

Out[4]:

saturn.schema() · 13 columns

column	kind	n	null%	unique	alerts
Date	text	5,268	0.0%	4,753	one_word allcaps short_text
Time	text	5,268	42.1%	1,005	one_word allcaps null_rate short_text duplicates
Location	text	5,268	0.4%	4,303
Operator	text	5,268	0.3%	2,476	multilingual duplicates
Flight #	categorical	5,268	79.7%	724	long_tail null_rate
Route	text	5,268	32.4%	3,244	multilingual null_rate
Type	text	5,268	0.5%	2,446	duplicates
Registration	text	5,268	6.4%	4,905	near_unique one_word allcaps short_text
cn/In	text	5,268	23.3%	3,707	one_word allcaps null_rate short_text
Aboard	numeric	5,268	0.4%	239	high_skew outliers
Fatalities	numeric	5,268	0.2%	191	high_skew outliers
Ground	numeric	5,268	0.4%	50	high_skew
Summary	text	5,268	7.4%	4,673	near_unique

Fig 1.

Fatalities · Heavily right-skewed: most crashes kill under 10 people but a long tail reaches 583.

Show data table

Histogram bins for Fatalities (median: 9.0).
bin	count
0 – 14.57	3314
14.57 – 29.15	980
29.15 – 43.72	343
43.72 – 58.3	215
58.3 – 72.88	96
72.88 – 87.45	90
87.45 – 102	51
102 – 116.6	42
116.6 – 131.2	39
131.2 – 145.8	19
145.8 – 160.3	18
160.3 – 174.9	9
174.9 – 189.5	11
189.5 – 204	3
204 – 218.6	2
218.6 – 233.2	6
233.2 – 247.8	2
247.8 – 262.3	5
262.3 – 276.9	4
276.9 – 291.5	1
291.5 – 306.1	1
306.1 – 320.6	0
320.6 – 335.2	1
335.2 – 349.8	2
349.8 – 364.4	0
364.4 – 378.9	0
378.9 – 393.5	0
393.5 – 408.1	0
408.1 – 422.7	0
422.7 – 437.2	0
437.2 – 451.8	0
451.8 – 466.4	0
466.4 – 481	0
481 – 495.5	0
495.5 – 510.1	0
510.1 – 524.7	1
524.7 – 539.3	0
539.3 – 553.9	0
553.9 – 568.4	0
568.4 – 583	1

Fig 2.

Aboard · Shows passenger load distribution; median 13 with rare wide-body events near 644.

Show data table

Histogram bins for Aboard (median: 13.0).
bin	count
0 – 16.1	2978
16.1 – 32.2	1055
32.2 – 48.3	430
48.3 – 64.4	230
64.4 – 80.5	129
80.5 – 96.6	105
96.6 – 112.7	75
112.7 – 128.8	56
128.8 – 144.9	46
144.9 – 161	35
161 – 177.1	27
177.1 – 193.2	16
193.2 – 209.3	8
209.3 – 225.4	7
225.4 – 241.5	9
241.5 – 257.6	4
257.6 – 273.7	9
273.7 – 289.8	3
289.8 – 305.9	9
305.9 – 322	3
322 – 338.1	2
338.1 – 354.2	3
354.2 – 370.3	1
370.3 – 386.4	1
386.4 – 402.5	2
402.5 – 418.6	0
418.6 – 434.7	0
434.7 – 450.8	0
450.8 – 466.9	0
466.9 – 483	0
483 – 499.1	0
499.1 – 515.2	0
515.2 – 531.3	2
531.3 – 547.4	0
547.4 – 563.5	0
563.5 – 579.6	0
579.6 – 595.7	0
595.7 – 611.8	0
611.8 – 627.9	0
627.9 – 644	1

Fig 3.

Operator · Aeroflot and U.S. military operators dominate — check for over-representation before averaging.

Show data table

Character-length distribution for Operator (mean: 19.493904761904762).
chars	count
3 – 5	96
5 – 6	233
6 – 8	140
8 – 9	462
9 – 11	169
11 – 12	184
12 – 14	128
14 – 15	395
15 – 17	270
17 – 18	447
18 – 20	407
20 – 22	143
22 – 23	340
23 – 25	205
25 – 26	542
26 – 28	166
28 – 29	229
29 – 31	102
31 – 32	194
32 – 34	54
34 – 36	127
36 – 37	62
37 – 39	35
39 – 40	30
40 – 42	13
42 – 43	25
43 – 45	11
45 – 46	5
46 – 48	7
48 – 50	7
50 – 51	7
51 – 53	0
53 – 54	9
54 – 56	1
56 – 57	3
57 – 59	0
59 – 60	1
60 – 62	0
62 – 63	0
63 – 65	1

Fig 4.

Type · Douglas DC-3 alone accounts for 334 crashes; aircraft type is highly concentrated at the top.

Show data table

Character-length distribution for Type (mean: 18.325701202060674).
chars	count
4 – 5	6
5 – 6	5
6 – 7	6
7 – 8	19
8 – 8	32
8 – 9	57
9 – 10	178
10 – 11	255
11 – 12	685
12 – 13	0
13 – 14	522
14 – 15	331
15 – 16	441
16 – 17	369
17 – 18	208
18 – 18	158
18 – 19	154
19 – 20	166
20 – 21	154
21 – 22	0
22 – 23	109
23 – 24	120
24 – 25	158
25 – 26	188
26 – 26	174
26 – 27	107
27 – 28	73
28 – 29	85
29 – 30	39
30 – 31	0
31 – 32	66
32 – 33	58
33 – 34	55
34 – 35	43
35 – 36	16
36 – 36	25
36 – 37	16
37 – 38	9
38 – 39	21
39 – 40	133

Fig 5.

Ground · Almost all values are zero; the few non-zero entries (up to 2,750) are extreme outliers worth flagging.

Show data table

Histogram bins for Ground (median: 0.0).
bin	count
0 – 68.75	5235
68.75 – 137.5	8
137.5 – 206.2	0
206.2 – 275	1
275 – 343.8	0
343.8 – 412.5	0
412.5 – 481.2	0
481.2 – 550	0
550 – 618.8	0
618.8 – 687.5	0
687.5 – 756.2	0
756.2 – 825	0
825 – 893.8	0
893.8 – 962.5	0
962.5 – 1031	0
1031 – 1100	0
1100 – 1169	0
1169 – 1238	0
1238 – 1306	0
1306 – 1375	0
1375 – 1444	0
1444 – 1512	0
1512 – 1581	0
1581 – 1650	0
1650 – 1719	0
1719 – 1788	0
1788 – 1856	0
1856 – 1925	0
1925 – 1994	0
1994 – 2062	0
2062 – 2131	0
2131 – 2200	0
2200 – 2269	0
2269 – 2338	0
2338 – 2406	0
2406 – 2475	0
2475 – 2544	0
2544 – 2612	0
2612 – 2681	0
2681 – 2750	2

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
Date	text	0.0%
Time	text	42.1%
Location	text	0.4%
Operator	text	0.3%
Flight #	categorical	79.7%
Route	text	32.4%
Type	text	0.5%
Registration	text	6.4%
cn/In	text	23.3%
Aboard	numeric	0.4%
Fatalities	numeric	0.2%
Ground	numeric	0.4%
Summary	text	7.4%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 7,904 detected strings).
lang	count	share
en	5907	74.7%
es	461	5.8%
it	366	4.6%
de	290	3.7%
fr	247	3.1%
pt	155	2.0%
id	93	1.2%
nl	73	0.9%
sv	51	0.6%
ca	39	0.5%
pl	31	0.4%
ru	27	0.3%
no	22	0.3%
sl	20	0.3%
tr	18	0.2%
ceb	14	0.2%
hr	14	0.2%
cs	11	0.1%
eo	7	0.1%
uk	6	0.1%
hu	6	0.1%
fi	6	0.1%
ms	6	0.1%
ro	6	0.1%
da	5	0.1%
bs	3	0.0%
vi	3	0.0%
sh	3	0.0%
et	3	0.0%
gl	2	0.0%
lt	2	0.0%
la	2	0.0%
eu	1	0.0%
ku	1	0.0%
te	1	0.0%
gd	1	0.0%
ja	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	Aboard	Fatalities	Ground
Aboard	+1.00	+0.04	+0.06
Fatalities	+0.04	+1.00	+0.05
Ground	+0.06	+0.05	+1.00

Date text timestamp

This column holds dates stored as text in MM/DD/YYYY format — every one of 5268 values is exactly 10 characters and a single token. There are 515 duplicates (9.8%) with repeats clustering on historically notable days such as 09/11/2001 and 06/06/1944, suggesting the rows describe events tied to those dates rather than unique daily records. The text alerts (allcaps, one_word, short_text) are artifacts of the date formatting, not real free-text content.

Treatment: parse to a proper date type (MM/DD/YYYY) before any temporal analysis.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["Date"].stats

stat	value
n	5,268
nulls	0 (0.0%)
unique	4,753
len_min	10
len_max	10
len_mean	10
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	515
duplicate_rate	0.09776
vocab_size	4,753
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 9.

Character-length distribution for Date.

Show data table

Character-length distribution for Date (mean: 10.0).
chars	count
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	5268
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0

Time text timestamp

Clock times in HH:MM format, stored as text rather than a temporal type — values like '15:00', '12:00' and '11:00' top the list and lengths sit tightly between 4 and 7 characters (mean 5.0). Roughly 42% of rows are null and 67% of the non-null values are duplicates across only 1,005 distinct times, suggesting times cluster on the half hour. Despite being numeric-looking, it tripped allcaps and one-word alerts because the profiler treats the strings as tokens.

Treatment: parse to a time-of-day type and impute or flag the 42% missing before use.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["Time"].stats

stat	value
n	5,268
nulls	2,219 (42.1%)
unique	1,005
len_min	4
len_max	7
len_mean	5.003
len_median	5
len_p95	5
word_mean	1.001
word_median	1
n_empty	0
n_duplicates	2,044
duplicate_rate	0.6704
vocab_size	1,004
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.999
allcaps_rate	0.9974
boilerplate_rate	0
alert: one_word	99.9% rows are a single word
alert: allcaps	99.7% rows are all-caps
alert: null_rate	42.1% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	67.0% duplicate strings

Fig 10.

Character-length distribution for Time.

Show data table

Character-length distribution for Time (mean: 5.002623811085602).
chars	count
4 – 4	7
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	3033
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	3
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	6

Location text feature

Free-text place names, typically 'City, Country/State' (word_median 3, len_median 19), with 4303 unique values across 5268 rows. Top entries cluster on major world cities (Sao Paulo, Moscow, Rio), but 'near' appears 1272 times suggesting many entries are approximate locations rather than exact place names. Duplicate rate of 18% and 945 repeated strings indicate moderate reusability, though high cardinality limits direct grouping.

Treatment: Parse into city/region/country components and geocode before use; raw strings are too high-cardinality to one-hot.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["Location"].stats

stat	value
n	5,268
nulls	20 (0.4%)
unique	4,303
len_min	5
len_max	60
len_mean	20.38
len_median	19
len_p95	31
word_mean	2.866
word_median	3
n_empty	0
n_duplicates	945
duplicate_rate	0.1801
vocab_size	4,541
readability_flesch_mean	24.03
emoji_rate	0
url_rate	0
one_word_rate	0.01124
allcaps_rate	0
boilerplate_rate	0

Fig 11.

Character-length distribution for Location.

Show data table

Character-length distribution for Location (mean: 20.379954268292682).
chars	count
5 – 6	21
6 – 8	8
8 – 9	22
9 – 10	16
10 – 12	61
12 – 13	326
13 – 15	280
15 – 16	289
16 – 17	753
17 – 19	413
19 – 20	828
20 – 22	383
22 – 23	313
23 – 24	479
24 – 26	189
26 – 27	170
27 – 28	244
28 – 30	85
30 – 31	114
31 – 32	36
32 – 34	28
34 – 35	52
35 – 37	25
37 – 38	23
38 – 39	26
39 – 41	11
41 – 42	18
42 – 44	3
44 – 45	13
45 – 46	3
46 – 48	1
48 – 49	5
49 – 50	2
50 – 52	1
52 – 53	3
53 – 54	1
54 – 56	0
56 – 57	0
57 – 59	1
59 – 60	2

Operator text feature

This column holds the airline or military operator name for each record, with 2476 unique values across 5268 rows and only a 0.0034 null rate. It is heavily duplicated (duplicate_rate 0.528, n_duplicates 2774), led by Aeroflot (179) and Military - U.S. Air Force (176), and the language detector flags a multilingual mix dominated by English (3340) but with sizable Italian (278), Spanish (224), German (202), and French (183) counts — likely an artifact of short proper nouns rather than true translations. Entries are short (word_mean 3.05, len_mean 19.5) and one_word_rate is 0.165, consistent with brand-style names.

Treatment: Normalize casing and consolidate Military - * variants, then treat as a high-cardinality categorical (target/frequency encode).

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["Operator"].stats

stat	value
n	5,268
nulls	18 (0.3%)
unique	2,476
len_min	3
len_max	65
len_mean	19.49
len_median	19
len_p95	35
word_mean	3.047
word_median	3
n_empty	0
n_duplicates	2,774
duplicate_rate	0.5284
vocab_size	2,370
readability_flesch_mean	19.61
emoji_rate	0
url_rate	0
one_word_rate	0.1651
allcaps_rate	0.03733
boilerplate_rate	0
alert: multilingual	31 languages detected in sample
alert: duplicates	52.8% duplicate strings

Fig 12.

Character-length distribution for Operator.

Show data table

Character-length distribution for Operator (mean: 19.493904761904762).
chars	count
3 – 5	96
5 – 6	233
6 – 8	140
8 – 9	462
9 – 11	169
11 – 12	184
12 – 14	128
14 – 15	395
15 – 17	270
17 – 18	447
18 – 20	407
20 – 22	143
22 – 23	340
23 – 25	205
25 – 26	542
26 – 28	166
28 – 29	229
29 – 31	102
31 – 32	194
32 – 34	54
34 – 36	127
36 – 37	62
37 – 39	35
39 – 40	30
40 – 42	13
42 – 43	25
43 – 45	11
45 – 46	5
46 – 48	7
48 – 50	7
50 – 51	7
51 – 53	0
53 – 54	9
54 – 56	1
56 – 57	3
57 – 59	0
59 – 60	1
60 – 62	0
62 – 63	0
63 – 65	1

Flight # categorical identifier

Likely a flight number identifier attached to records (probably aviation incidents). Nearly 80% of rows are null (null_rate 0.7971) and the most common non-null value is the placeholder '-' at 67 occurrences (6.27% of present values), suggesting missing-data sentinels mixed with real codes. Cardinality is high (724 unique across 5268 rows) with entropy_ratio 0.953, so among populated rows values are nearly uniformly distributed.

Treatment: Normalize '-' to null and treat as a high-cardinality identifier; drop from modelling or use only as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["Flight #"].stats

stat	value
n	5,268
nulls	4,199 (79.7%)
unique	724
top_value	-
top_rate	0.06268
cardinality	724
entropy	9.058
entropy_ratio	0.9534
alert: long_tail	543 singleton categories
alert: null_rate	79.7% null

Fig 13.

Top values for Flight #.

Show data table

Top values for Flight # (20 unique shown, of 724 total).
value	count	share
-	67	1.3%
1	10	0.2%
4	7	0.1%
6	6	0.1%
21	6	0.1%
101	6	0.1%
901	6	0.1%
7	5	0.1%
201	5	0.1%
701	5	0.1%
706	5	0.1%
703	5	0.1%
2	4	0.1%
203	4	0.1%
304	4	0.1%
601	4	0.1%
514	4	0.1%
11	4	0.1%
217	4	0.1%
114	4	0.1%

Route text free_text

Short free-text describing a flight route, typically formatted as 'Origin - Destination' (the hyphen appears 3658 times across 5268 rows) with occasional non-route labels like 'Training' (81), 'Sightseeing' (29), or 'Test flight' (17). Values are short (mean 22 chars, 4 words) and highly varied (3244 unique out of 5268), but 32.38% are null and 318 duplicates exist. Language detection flags a multilingual mix dominated by English (2567) with notable Spanish (237), Portuguese (100), German (88) and Italian (88), reflecting place names rather than true prose.

Treatment: Parse on ' - ' to split origin/destination and bucket non-route labels separately; impute or flag the 32% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["Route"].stats

stat	value
n	5,268
nulls	1,706 (32.4%)
unique	3,244
len_min	4
len_max	59
len_mean	22.09
len_median	20
len_p95	37
word_mean	4.065
word_median	4
n_empty	0
n_duplicates	318
duplicate_rate	0.08928
vocab_size	3,647
readability_flesch_mean	27.15
emoji_rate	0
url_rate	0
one_word_rate	0.04099
allcaps_rate	0.0002807
boilerplate_rate	0
alert: multilingual	31 languages detected in sample
alert: null_rate	32.4% null

Fig 14.

Character-length distribution for Route.

Show data table

Character-length distribution for Route (mean: 22.088152723189218).
chars	count
4 – 5	8
5 – 7	4
7 – 8	93
8 – 10	6
10 – 11	5
11 – 12	100
12 – 14	99
14 – 15	155
15 – 16	452
16 – 18	247
18 – 19	443
19 – 20	170
20 – 22	179
22 – 23	286
23 – 25	155
25 – 26	135
26 – 27	245
27 – 29	94
29 – 30	213
30 – 32	71
32 – 33	49
33 – 34	74
34 – 36	39
36 – 37	40
37 – 38	51
38 – 40	20
40 – 41	27
41 – 42	12
42 – 44	10
44 – 45	19
45 – 47	6
47 – 48	4
48 – 49	17
49 – 51	9
51 – 52	6
52 – 54	3
54 – 55	4
55 – 56	8
56 – 58	2
58 – 59	2

Type text feature

This column records aircraft make-and-model designations, dominated by manufacturer-plus-type strings like 'Douglas DC-3' (334 occurrences) and 'de Havilland Canada DHC-6 Twin Otter 300'. Values are short (mean 18.3 chars, median 2 words) but highly repetitive: 53.3% duplicate rate across 2,446 unique types, with Douglas alone appearing in 1,113 rows. Watch for near-duplicate variants of the same airframe ('Douglas C-47', 'Douglas C-47A', 'Douglas C-47B') that will fragment any group-by unless normalised.

Treatment: Normalise manufacturer/variant strings (e.g. collapse C-47 sub-variants) before using as a categorical feature.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["Type"].stats

stat	value
n	5,268
nulls	27 (0.5%)
unique	2,446
len_min	4
len_max	40
len_mean	18.33
len_median	16
len_p95	34
word_mean	2.718
word_median	2
n_empty	0
n_duplicates	2,795
duplicate_rate	0.5333
vocab_size	2,534
readability_flesch_mean	69.26
emoji_rate	0
url_rate	0
one_word_rate	0.007441
allcaps_rate	0.00954
boilerplate_rate	0
alert: duplicates	53.3% duplicate strings

Fig 15.

Character-length distribution for Type.

Show data table

Character-length distribution for Type (mean: 18.325701202060674).
chars	count
4 – 5	6
5 – 6	5
6 – 7	6
7 – 8	19
8 – 8	32
8 – 9	57
9 – 10	178
10 – 11	255
11 – 12	685
12 – 13	0
13 – 14	522
14 – 15	331
15 – 16	441
16 – 17	369
17 – 18	208
18 – 18	158
18 – 19	154
19 – 20	166
20 – 21	154
21 – 22	0
22 – 23	109
23 – 24	120
24 – 25	158
25 – 26	188
26 – 26	174
26 – 27	107
27 – 28	73
28 – 29	85
29 – 30	39
30 – 31	0
31 – 32	66
32 – 33	58
33 – 34	55
34 – 35	43
35 – 36	16
36 – 36	25
36 – 37	16
37 – 38	9
38 – 39	21
39 – 40	133

Registration text identifier

Almost certainly aircraft tail/registration codes: 4905 unique values across 5268 rows, 99% all-caps single tokens with mean length 6.4 (max 15), and top tokens like 'hk-' and 'nc10809' resemble registration prefixes. Near-unique (n_unique/n ≈ 0.93) with a 6.36% null rate and only 28 duplicates, so it behaves as an identifier rather than a feature. The lone '/' appearing 36 times suggests a placeholder for split/unknown registrations worth inspecting.

Treatment: Treat as an identifier: drop from modelling or use only for joins/lookup after normalising case and the '/' placeholder.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["Registration"].stats

stat	value
n	5,268
nulls	335 (6.4%)
unique	4,905
len_min	1
len_max	15
len_mean	6.394
len_median	6
len_p95	10
word_mean	1.018
word_median	1
n_empty	0
n_duplicates	28
duplicate_rate	0.005676
vocab_size	4,948
readability_flesch_mean	103
emoji_rate	0
url_rate	0
one_word_rate	0.9899
allcaps_rate	0.9919
boilerplate_rate	0
alert: near_unique	99.4% of rows are unique strings
alert: one_word	99.0% rows are a single word
alert: allcaps	99.2% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 16.

Character-length distribution for Registration.

Show data table

Character-length distribution for Registration (mean: 6.393877964727347).
chars	count
1 – 1	1
1 – 2	0
2 – 2	36
2 – 2	0
2 – 3	0
3 – 3	64
3 – 3	0
3 – 4	0
4 – 4	69
4 – 4	0
4 – 5	0
5 – 5	398
5 – 6	0
6 – 6	0
6 – 6	3228
6 – 7	0
7 – 7	0
7 – 7	512
7 – 8	0
8 – 8	0
8 – 8	267
8 – 9	0
9 – 9	42
9 – 9	0
9 – 10	0
10 – 10	206
10 – 10	0
10 – 11	0
11 – 11	10
11 – 12	0
12 – 12	0
12 – 12	12
12 – 13	0
13 – 13	0
13 – 13	41
13 – 14	0
14 – 14	0
14 – 14	8
14 – 15	0
15 – 15	39

cn/In text feature

Despite the text classification, 'cn/In' looks like a short numeric code field — values are predominantly one-word, all-caps tokens with a mean length of 5.6 characters and the top values ('178', '19', '229', '125') all being integers. About 23.3% of rows are null and only 333 duplicates (8.2%) appear across 3,707 unique values, so cardinality is high relative to 5,268 rows. The '/' character showing up 49 times in top_words hints at occasional composite values (e.g., 'a/b'), which the column name 'cn/In' also suggests.

Treatment: Cast to numeric where possible and split composite '/'-separated entries; impute or flag the 23% nulls before modelling.

anthropic:claude-opus-4-7 · confidence medium

Out[38]:

saturn.columns["cn/In"].stats

stat	value
n	5,268
nulls	1,228 (23.3%)
unique	3,707
len_min	1
len_max	20
len_mean	5.645
len_median	5
len_p95	10
word_mean	1.026
word_median	1
n_empty	0
n_duplicates	333
duplicate_rate	0.08243
vocab_size	3,739
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9842
allcaps_rate	0.9663
boilerplate_rate	0
alert: one_word	98.4% rows are a single word
alert: allcaps	96.6% rows are all-caps
alert: null_rate	23.3% null
alert: short_text	95th-percentile length under 20 chars

Fig 17.

Character-length distribution for cn/In.

Show data table

Character-length distribution for cn/In (mean: 5.64480198019802).
chars	count
1 – 1	23
1 – 2	0
2 – 2	113
2 – 3	0
3 – 3	604
3 – 4	0
4 – 4	866
4 – 5	0
5 – 5	895
5 – 6	0
6 – 6	268
6 – 7	0
7 – 7	269
7 – 8	0
8 – 8	281
8 – 9	0
9 – 9	457
9 – 10	0
10 – 10	125
10 – 10	0
10 – 11	0
11 – 11	92
11 – 12	0
12 – 12	14
12 – 13	0
13 – 13	9
13 – 14	0
14 – 14	2
14 – 15	0
15 – 15	5
15 – 16	0
16 – 16	4
16 – 17	0
17 – 17	5
17 – 18	0
18 – 18	2
18 – 19	0
19 – 19	2
19 – 20	0
20 – 20	4

Aboard numeric feature

This column records the number of people aboard, with values ranging from 0 to 644 and a median of 13. The distribution is heavily right-skewed (skew 4.25, kurtosis 28.4) and roughly 10% of rows (529) are flagged as outliers, indicating a long tail of very large flights against a typical small-aircraft baseline. Nulls are negligible (0.42%) and only 239 distinct values appear across 5268 rows.

Treatment: log-transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["Aboard"].stats

stat	value
n	5,268
nulls	22 (0.4%)
unique	239
min	0
max	644
mean	27.55
median	13
std	43.08
q1	5
q3	30
iqr	25
skew	4.247
kurtosis	28.41
n_outliers	529
outlier_rate	0.1008
zero_rate	0.0003812
alert: high_skew	skew=+4.25
alert: outliers	10.1% rows beyond 1.5 IQR

Fig 18.

Distribution of Aboard. Vertical dash marks the median.

Show data table

Histogram bins for Aboard (median: 13.0).
bin	count
0 – 16.1	2978
16.1 – 32.2	1055
32.2 – 48.3	430
48.3 – 64.4	230
64.4 – 80.5	129
80.5 – 96.6	105
96.6 – 112.7	75
112.7 – 128.8	56
128.8 – 144.9	46
144.9 – 161	35
161 – 177.1	27
177.1 – 193.2	16
193.2 – 209.3	8
209.3 – 225.4	7
225.4 – 241.5	9
241.5 – 257.6	4
257.6 – 273.7	9
273.7 – 289.8	3
289.8 – 305.9	9
305.9 – 322	3
322 – 338.1	2
338.1 – 354.2	3
354.2 – 370.3	1
370.3 – 386.4	1
386.4 – 402.5	2
402.5 – 418.6	0
418.6 – 434.7	0
434.7 – 450.8	0
450.8 – 466.9	0
466.9 – 483	0
483 – 499.1	0
499.1 – 515.2	0
515.2 – 531.3	2
531.3 – 547.4	0
547.4 – 563.5	0
563.5 – 579.6	0
579.6 – 595.7	0
595.7 – 611.8	0
611.8 – 627.9	0
627.9 – 644	1

Fatalities numeric numeric_target

Counts of deaths per event, ranging from 0 to 583 with a median of 9 and mean of 20.07. The distribution is heavily right-skewed (skew 4.95, kurtosis 42.79) with 444 outliers (8.4% of rows) and a small zero rate of 1.1%. The IQR of 20 against a max of 583 confirms a long tail driven by rare catastrophic events.

Treatment: log1p-transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["Fatalities"].stats

stat	value
n	5,268
nulls	12 (0.2%)
unique	191
min	0
max	583
mean	20.07
median	9
std	33.2
q1	3
q3	23
iqr	20
skew	4.948
kurtosis	42.79
n_outliers	444
outlier_rate	0.08447
zero_rate	0.01104
alert: high_skew	skew=+4.95
alert: outliers	8.4% rows beyond 1.5 IQR

Fig 19.

Distribution of Fatalities. Vertical dash marks the median.

Show data table

Histogram bins for Fatalities (median: 9.0).
bin	count
0 – 14.57	3314
14.57 – 29.15	980
29.15 – 43.72	343
43.72 – 58.3	215
58.3 – 72.88	96
72.88 – 87.45	90
87.45 – 102	51
102 – 116.6	42
116.6 – 131.2	39
131.2 – 145.8	19
145.8 – 160.3	18
160.3 – 174.9	9
174.9 – 189.5	11
189.5 – 204	3
204 – 218.6	2
218.6 – 233.2	6
233.2 – 247.8	2
247.8 – 262.3	5
262.3 – 276.9	4
276.9 – 291.5	1
291.5 – 306.1	1
306.1 – 320.6	0
320.6 – 335.2	1
335.2 – 349.8	2
349.8 – 364.4	0
364.4 – 378.9	0
378.9 – 393.5	0
393.5 – 408.1	0
408.1 – 422.7	0
422.7 – 437.2	0
437.2 – 451.8	0
451.8 – 466.4	0
466.4 – 481	0
481 – 495.5	0
495.5 – 510.1	0
510.1 – 524.7	1
524.7 – 539.3	0
539.3 – 553.9	0
553.9 – 568.4	0
568.4 – 583	1

Ground numeric feature

Numeric field 'Ground' is overwhelmingly zero (zero_rate 0.9583) with median, q1, and q3 all at 0.0 and only 50 unique values across 5268 rows. The non-zero tail is extreme: max 2750.0 against a mean of 1.61, skew 50.3, and kurtosis 2559, producing 219 outliers (4.17%). This looks like a sparse count or charge-style feature where almost every record has no ground value but a few carry very large magnitudes.

Treatment: Split into a zero/non-zero indicator and log-transform the non-zero magnitudes before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["Ground"].stats

stat	value
n	5,268
nulls	22 (0.4%)
unique	50
min	0
max	2,750
mean	1.609
median	0
std	53.99
q1	0
q3	0
iqr	0
skew	50.34
kurtosis	2559
n_outliers	219
outlier_rate	0.04175
zero_rate	0.9583
alert: high_skew	skew=+50.34

Fig 20.

Distribution of Ground. Vertical dash marks the median.

Show data table

Histogram bins for Ground (median: 0.0).
bin	count
0 – 68.75	5235
68.75 – 137.5	8
137.5 – 206.2	0
206.2 – 275	1
275 – 343.8	0
343.8 – 412.5	0
412.5 – 481.2	0
481.2 – 550	0
550 – 618.8	0
618.8 – 687.5	0
687.5 – 756.2	0
756.2 – 825	0
825 – 893.8	0
893.8 – 962.5	0
962.5 – 1031	0
1031 – 1100	0
1100 – 1169	0
1169 – 1238	0
1238 – 1306	0
1306 – 1375	0
1375 – 1444	0
1444 – 1512	0
1512 – 1581	0
1581 – 1650	0
1650 – 1719	0
1719 – 1788	0
1788 – 1856	0
1856 – 1925	0
1925 – 1994	0
1994 – 2062	0
2062 – 2131	0
2131 – 2200	0
2200 – 2269	0
2269 – 2338	0
2338 – 2406	0
2406 – 2475	0
2475 – 2544	0
2544 – 2612	0
2612 – 2681	0
2681 – 2750	2

Summary text free_text

Free-text incident summaries averaging 201 characters (median 136, max 1954) with a Flesch readability of 61.7, suggesting short narrative paragraphs. Domain vocabulary is clearly aviation-accident: 'crashed' (2925), 'aircraft' (2031), and 'into' (2300) dominate after stopwords. Near-unique (4673 of 5268) but with 205 exact duplicates (4.2%) and a 7.4% null rate worth checking before modelling.

Treatment: Tokenize and embed (or TF-IDF) for downstream NLP; dedupe the 205 exact repeats first.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["Summary"].stats

stat	value
n	5,268
nulls	390 (7.4%)
unique	4,673
len_min	6
len_max	1,954
len_mean	200.7
len_median	136
len_p95	584
word_mean	33.24
word_median	23
n_empty	0
n_duplicates	205
duplicate_rate	0.04203
vocab_size	12,513
readability_flesch_mean	61.68
emoji_rate	0
url_rate	0
one_word_rate	0.00041
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	95.8% of rows are unique strings

Fig 21.

Character-length distribution for Summary.

Show data table

Character-length distribution for Summary (mean: 200.73575235752358).
chars	count
6 – 55	822
55 – 103	1039
103 – 152	800
152 – 201	547
201 – 250	364
250 – 298	280
298 – 347	231
347 – 396	172
396 – 444	123
444 – 493	128
493 – 542	86
542 – 590	50
590 – 639	57
639 – 688	33
688 – 736	37
736 – 785	19
785 – 834	15
834 – 883	16
883 – 931	11
931 – 980	10
980 – 1029	5
1029 – 1077	1
1077 – 1126	6
1126 – 1175	3
1175 – 1224	4
1224 – 1272	3
1272 – 1321	2
1321 – 1370	1
1370 – 1418	1
1418 – 1467	3
1467 – 1516	1
1516 – 1564	2
1564 – 1613	1
1613 – 1662	4
1662 – 1710	0
1710 – 1759	0
1759 – 1808	0
1808 – 1857	0
1857 – 1905	0
1905 – 1954	1

disasters airplane crashes

Overview

Summary confidence: high

Date text timestamp

Time text timestamp

Location text feature

Operator text feature

Flight # categorical identifier

Route text free_text

Type text feature

Registration text identifier

cn/In text feature

Aboard numeric feature

Fatalities numeric numeric_target

Ground numeric feature

Summary text free_text

How to cite