data-trove-global-shark-attack-file-gsaf

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv

Saturn profiled 6,462 rows across 24 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv",
    "--findings", "data-trove-global-shark-attack-file-gsaf.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is the Global Shark Attack File (GSAF), containing 6,462 records of shark attack incidents spanning centuries of documented cases. The most important thing to examine first is the attack outcome: roughly 75% of incidents are non-fatal ('N'), but 1,400 are recorded as fatal ('Y'), and the 'Injury' column reveals 823 entries simply marked 'FATAL' — worth cross-checking for data consistency. A second priority is the geographic and activity breakdown: the USA dominates with 2,310 cases (36%), Florida alone accounts for 1,076, and surfing (1,025) and swimming (932) are by far the most dangerous activities. The 'Year' column carries a data quality warning — a maximum value of 3019 and high kurtosis signal outliers that should be cleaned before any time-series analysis.

citing: Fatal (Y/N).top_values · Fatal (Y/N).stats.top_rate · Injury.top_values · Country.top_values · Area.top_values · Activity.top_values · Year.stats.max · Year.stats.kurtosis · Year.stats.n_outliers · row_count

Out[4]:

saturn.schema() · 24 columns

column	kind	n	null%	unique	alerts
index	numeric	6,462	0.0%	6,462
Case Number	text	6,462	0.0%	6,442	near_unique one_word allcaps short_text
Date	text	6,462	0.0%	5,552	one_word
Year	numeric	6,462	0.0%	252	high_skew
Type	categorical	6,462	0.1%	12
Country	categorical	6,462	0.8%	205
Area	categorical	6,462	7.2%	810	long_tail
Location	text	6,462	8.4%	4,148	multilingual duplicates
Activity	text	6,462	8.5%	1,516	one_word duplicates
Name	text	6,462	3.3%	5,339
Unnamed: 9	categorical	6,462	99.6%	2	null_rate
Age	categorical	6,462	44.4%	154	null_rate
Injury	text	6,462	0.4%	3,738	multilingual allcaps duplicates
Fatal (Y/N)	categorical	6,462	8.5%	7
Time	categorical	6,462	52.5%	366	long_tail null_rate
Species	text	6,462	45.2%	1,466	multilingual null_rate duplicates
Investigator or Source	text	6,462	0.3%	4,979	multilingual duplicates
pdf	text	6,462	52.6%	3,054	near_unique one_word null_rate
href formula	text	6,462	52.6%	3,051	near_unique one_word url_heavy null_rate
href	text	6,462	52.6%	3,051	near_unique one_word url_heavy null_rate
Case Number.1	text	6,462	52.6%	3,054	near_unique one_word allcaps null_rate short_text
Case Number.2	text	6,462	52.6%	3,055	near_unique one_word allcaps null_rate short_text
original order	numeric	6,462	52.6%	3,061	null_rate
Unnamed: 23	categorical	6,462	100.0%	2	long_tail null_rate

Fig 1.

Fatal (Y/N) · Look at the dominant 'N' slice versus fatal 'Y' and check for dirty values like 'M', 'F', and '2017' that indicate data entry errors.

Show data table

Top values for Fatal (Y/N) (7 unique shown, of 7 total).
value	count	share
N	4439	68.7%
Y	1400	21.7%
UNKNOWN	71	1.1%
F	2	0.0%
M	1	0.0%
2017	1	0.0%
y	1	0.0%

Fig 2.

Activity · Surfing and swimming together account for the majority of attacks — compare their counts to lower-risk activities like diving and fishing.

Show data table

Character-length distribution for Activity (mean: 16.207952622673435).
chars	count
1 – 7	2033
7 – 14	2150
14 – 20	468
20 – 26	376
26 – 33	228
33 – 39	177
39 – 45	134
45 – 52	84
52 – 58	42
58 – 64	51
64 – 71	35
71 – 77	25
77 – 83	20
83 – 90	10
90 – 96	9
96 – 102	14
102 – 109	10
109 – 115	6
115 – 121	5
121 – 128	1
128 – 134	1
134 – 140	7
140 – 146	4
146 – 153	4
153 – 159	3
159 – 165	0
165 – 172	2
172 – 178	1
178 – 184	0
184 – 191	2
191 – 197	3
197 – 203	0
203 – 210	0
210 – 216	1
216 – 222	0
222 – 229	0
229 – 235	2
235 – 241	1
241 – 248	0
248 – 254	1

Fig 3.

Country · The USA, Australia, and South Africa together dominate the record count; note how sharply the frequency drops after the top three.

Show data table

Top values for Country (20 unique shown, of 205 total).
value	count	share
USA	2310	35.7%
AUSTRALIA	1374	21.3%
SOUTH AFRICA	585	9.1%
NEW ZEALAND	135	2.1%
PAPUA NEW GUINEA	135	2.1%
BAHAMAS	115	1.8%
BRAZIL	113	1.7%
MEXICO	95	1.5%
ITALY	71	1.1%
PHILIPPINES	62	1.0%
FIJI	62	1.0%
REUNION	60	0.9%
NEW CALEDONIA	56	0.9%
CUBA	46	0.7%
SPAIN	44	0.7%
MOZAMBIQUE	44	0.7%
EGYPT	42	0.6%
INDIA	40	0.6%
JAPAN	34	0.5%
CROATIA	34	0.5%

Fig 4.

Type · Unprovoked attacks make up nearly 73% of all incidents — compare against provoked, invalid, and sea disaster categories.

Show data table

Top values for Type (12 unique shown, of 12 total).
value	count	share
Unprovoked	4716	73.0%
Provoked	593	9.2%
Invalid	552	8.5%
Sea Disaster	239	3.7%
Watercraft	142	2.2%
Boat	109	1.7%
Boating	92	1.4%
Questionable	10	0.2%
Unconfirmed	1	0.0%
Unverified	1	0.0%
Under investigation	1	0.0%
Boatomg	1	0.0%

Fig 5.

Year · Most records cluster in the modern era, but look for the extreme outliers (including a year value of 3019) that skew the distribution.

Show data table

Histogram bins for Year (median: 1980.0).
bin	count
0 – 75.47	126
75.47 – 150.9	1
150.9 – 226.4	0
226.4 – 301.9	0
301.9 – 377.4	0
377.4 – 452.8	0
452.8 – 528.3	1
528.3 – 603.8	0
603.8 – 679.3	0
679.3 – 754.8	0
754.8 – 830.2	0
830.2 – 905.7	0
905.7 – 981.2	0
981.2 – 1057	0
1057 – 1132	0
1132 – 1208	0
1208 – 1283	0
1283 – 1359	0
1359 – 1434	0
1434 – 1510	0
1510 – 1585	4
1585 – 1660	6
1660 – 1736	7
1736 – 1811	37
1811 – 1887	371
1887 – 1962	1975
1962 – 2038	3930
2038 – 2113	0
2113 – 2189	0
2189 – 2264	0
2264 – 2340	0
2340 – 2415	0
2415 – 2491	0
2491 – 2566	0
2566 – 2642	0
2642 – 2717	0
2717 – 2793	0
2793 – 2868	0
2868 – 2944	0
2944 – 3019	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
index	numeric	0.0%
Case Number	text	0.0%
Date	text	0.0%
Year	numeric	0.0%
Type	categorical	0.1%
Country	categorical	0.8%
Area	categorical	7.2%
Location	text	8.4%
Activity	text	8.5%
Name	text	3.3%
Unnamed: 9	categorical	99.6%
Age	categorical	44.4%
Injury	text	0.4%
Fatal (Y/N)	categorical	8.5%
Time	categorical	52.5%
Species	text	45.2%
Investigator or Source	text	0.3%
pdf	text	52.6%
href formula	text	52.6%
href	text	52.6%
Case Number.1	text	52.6%
Case Number.2	text	52.6%
original order	numeric	52.6%
Unnamed: 23	categorical	100.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 16,164 detected strings).
lang	count	share
en	14597	90.3%
de	627	3.9%
es	219	1.4%
fr	177	1.1%
it	123	0.8%
pt	80	0.5%
nl	45	0.3%
ru	31	0.2%
fi	26	0.2%
sv	23	0.1%
zh	22	0.1%
pl	18	0.1%
ja	16	0.1%
id	15	0.1%
ca	14	0.1%
ceb	13	0.1%
tr	13	0.1%
eu	13	0.1%
vi	12	0.1%
cy	10	0.1%
ro	9	0.1%
ms	8	0.0%
eo	7	0.0%
sq	7	0.0%
af	5	0.0%
hu	5	0.0%
hr	4	0.0%
jbo	4	0.0%
war	4	0.0%
cs	4	0.0%
lv	3	0.0%
sh	3	0.0%
sw	2	0.0%
tl	1	0.0%
th	1	0.0%
no	1	0.0%
sl	1	0.0%
uk	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	index	Year	original order
index	+1.00	-0.40	-0.97
Year	-0.40	+1.00	+0.38
original order	-0.97	+0.38	+1.00

index numeric identifier

This column is a row index — a sequential integer identifier running from 0 to 6461 with 6462 unique values and no nulls, perfectly matching the row count. Its distribution is exactly uniform (skew = 0.0, kurtosis ≈ −1.2, mean = median = 3230.5, zero outliers), confirming it was generated as a positional index rather than carrying any domain meaning. The single 'zero' (zero_rate ≈ 0.00015) is simply row 0. There is no analytical signal here.

Treatment: Drop before modelling; carries no predictive information.

anthropic:default · confidence high

Out[14]:

saturn.columns["index"].stats

stat	value
n	6,462
nulls	0 (0.0%)
unique	6,462
min	0
max	6,461
mean	3230
median	3230
std	1866
q1	1615
q3	4846
iqr	3230
skew	0
kurtosis	-1.2
n_outliers	0
outlier_rate	0
zero_rate	0.0001548

Fig 9.

Distribution of index. Vertical dash marks the median.

Show data table

Histogram bins for index (median: 3230.5).
bin	count
0 – 161.5	162
161.5 – 323.1	162
323.1 – 484.6	161
484.6 – 646.1	162
646.1 – 807.6	161
807.6 – 969.2	162
969.2 – 1131	161
1131 – 1292	162
1292 – 1454	161
1454 – 1615	162
1615 – 1777	161
1777 – 1938	162
1938 – 2100	161
2100 – 2261	162
2261 – 2423	161
2423 – 2584	162
2584 – 2746	161
2746 – 2907	162
2907 – 3069	161
3069 – 3230	162
3230 – 3392	162
3392 – 3554	161
3554 – 3715	162
3715 – 3877	161
3877 – 4038	162
4038 – 4200	161
4200 – 4361	162
4361 – 4523	161
4523 – 4684	162
4684 – 4846	161
4846 – 5007	162
5007 – 5169	161
5169 – 5330	162
5330 – 5492	161
5492 – 5653	162
5653 – 5815	161
5815 – 5976	162
5976 – 6138	161
6138 – 6299	162
6299 – 6461	162

Case Number text identifier

This column is a case identifier, with values that appear to encode dates in YYYY.MM.DD format (e.g., '2019.10.08'), suggesting case numbers tied to filing or incident dates. With 6,442 unique values out of 6,462 rows and a null rate of 0.0003, it is near-unique and functions as a primary key. The 18 duplicate values (duplicate_rate 0.0028) are a mild anomaly worth investigating — one value '2012.09.02.b' hints that suffixes are used to disambiguate same-date cases, implying the deduplication logic is not fully consistent. The allcaps_rate of 0.748 suggests a mix of formatting styles across records.

Treatment: Use as a case-level join key; flag the 18 duplicates for deduplication before modelling.

anthropic:default · confidence high

Out[17]:

saturn.columns["Case Number"].stats

stat	value
n	6,462
nulls	2 (0.0%)
unique	6,442
len_min	6
len_max	18
len_mean	10.63
len_median	10
len_p95	12
word_mean	1.001
word_median	1
n_empty	0
n_duplicates	18
duplicate_rate	0.002786
vocab_size	6,445
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9992
allcaps_rate	0.748
boilerplate_rate	0
alert: near_unique	99.7% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: allcaps	74.8% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 10.

Character-length distribution for Case Number.

Show data table

Character-length distribution for Case Number (mean: 10.626934984520124).
chars	count
6 – 6	4
6 – 7	0
7 – 7	0
7 – 7	122
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	4150
10 – 10	0
10 – 11	0
11 – 11	14
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	2128
12 – 13	0
13 – 13	0
13 – 13	14
13 – 14	0
14 – 14	0
14 – 14	24
14 – 14	0
14 – 15	0
15 – 15	0
15 – 15	2
15 – 16	0
16 – 16	0
16 – 16	1
16 – 16	0
16 – 17	0
17 – 17	0
17 – 17	0
17 – 18	0
18 – 18	1

Date text metadata

This column contains free-form date annotations for what appears to be a historical or archival dataset, storing dates in highly inconsistent formats: bare years (e.g., '1957', '1942'), structured dates ('05-Oct-2003'), vague phrases ('Before 1958', 'No date', 'ca.', 'summer', 'late'). The top word 'reported' appearing 559 times suggests many values follow a pattern like 'reported [date]', which is a red flag for downstream parsing. With 14.1% duplicate rate (909 duplicates across 5,552 unique values) and a max length of 64 characters, this column cannot be safely cast to a datetime type without substantial normalization work.

Treatment: Parse and normalize into structured date fields using regex rules per format pattern; flag unparseable values ('No date', 'ca.', 'reported …') as null or uncertain.

anthropic:default · confidence high

Out[20]:

saturn.columns["Date"].stats

stat	value
n	6,462
nulls	1 (0.0%)
unique	5,552
len_min	4
len_max	64
len_mean	11.43
len_median	11
len_p95	20
word_mean	1.155
word_median	1
n_empty	0
n_duplicates	909
duplicate_rate	0.1407
vocab_size	5,496
readability_flesch_mean	89.93
emoji_rate	0
url_rate	0
one_word_rate	0.8728
allcaps_rate	0.05092
boilerplate_rate	0
alert: one_word	87.3% rows are a single word

Fig 11.

Character-length distribution for Date.

Show data table

Character-length distribution for Date (mean: 11.425475932518186).
chars	count
4 – 6	308
6 – 7	0
7 – 8	347
8 – 10	31
10 – 12	5088
12 – 13	33
13 – 14	43
14 – 16	4
16 – 18	7
18 – 19	5
19 – 20	552
20 – 22	5
22 – 24	5
24 – 25	7
25 – 26	8
26 – 28	0
28 – 30	1
30 – 31	0
31 – 32	3
32 – 34	1
34 – 36	2
36 – 37	1
37 – 38	2
38 – 40	0
40 – 42	1
42 – 43	1
43 – 44	1
44 – 46	0
46 – 48	0
48 – 49	1
49 – 50	0
50 – 52	0
52 – 54	0
54 – 55	2
55 – 56	1
56 – 58	0
58 – 60	0
60 – 61	0
61 – 62	0
62 – 64	1

Year numeric feature

This column represents a publication or production year for 6,462 records, with values spanning a plausible modern range centered around a median of 1980 and IQR of 1943–2006. Two signals are highly surprising: a minimum of 0.0 (nearly 2% zero rate, almost certainly sentinel/missing-year placeholders) and a maximum of 3019.0, which is a data-entry error (likely a typo for a 4-digit year such as 2019). These outliers (266 records, ~4.1%) drive extreme negative skew (−6.55) and extraordinary kurtosis (42.54), masking what is otherwise a fairly clean temporal distribution.

Treatment: Null-code zeros and values outside a valid range (e.g., < 1800 or > 2100), correct obvious typos like 3019, then use as an ordinal or numeric feature.

anthropic:default · confidence high

Out[23]:

saturn.columns["Year"].stats

stat	value
n	6,462
nulls	3 (0.0%)
unique	252
min	0
max	3,019
mean	1930
median	1,980
std	278.3
q1	1,943
q3	2,006
iqr	63
skew	-6.554
kurtosis	42.54
n_outliers	266
outlier_rate	0.04118
zero_rate	0.01935
alert: high_skew	skew=-6.55

Fig 12.

Distribution of Year. Vertical dash marks the median.

Show data table

Histogram bins for Year (median: 1980.0).
bin	count
0 – 75.47	126
75.47 – 150.9	1
150.9 – 226.4	0
226.4 – 301.9	0
301.9 – 377.4	0
377.4 – 452.8	0
452.8 – 528.3	1
528.3 – 603.8	0
603.8 – 679.3	0
679.3 – 754.8	0
754.8 – 830.2	0
830.2 – 905.7	0
905.7 – 981.2	0
981.2 – 1057	0
1057 – 1132	0
1132 – 1208	0
1208 – 1283	0
1283 – 1359	0
1359 – 1434	0
1434 – 1510	0
1510 – 1585	4
1585 – 1660	6
1660 – 1736	7
1736 – 1811	37
1811 – 1887	371
1887 – 1962	1975
1962 – 2038	3930
2038 – 2113	0
2113 – 2189	0
2189 – 2264	0
2264 – 2340	0
2340 – 2415	0
2415 – 2491	0
2491 – 2566	0
2566 – 2642	0
2642 – 2717	0
2717 – 2793	0
2793 – 2868	0
2868 – 2944	0
2944 – 3019	1

Type categorical label

This column classifies shark attack incidents by the nature of the encounter, with 12 distinct categories across 6,462 records. It is heavily dominated by 'Unprovoked' at 73% of all records, creating notable class imbalance. A surprising data quality issue is the fragmentation of watercraft-related incidents across three near-synonymous labels — 'Watercraft' (142), 'Boat' (109), and 'Boating' (92) — which are almost certainly the same category and should be consolidated. The 'Invalid' category (552 records, ~8.5%) also warrants attention as it may represent records that should be excluded from incident analysis.

Treatment: Consolidate 'Boat', 'Boating', and 'Watercraft' into a single category; consider excluding or flagging 'Invalid' records; one-hot encode for modelling given severe class imbalance.

anthropic:default · confidence high

Out[26]:

saturn.columns["Type"].stats

stat	value
n	6,462
nulls	5 (0.1%)
unique	12
top_value	Unprovoked
top_rate	0.7304
cardinality	12
entropy	1.457
entropy_ratio	0.4064

Fig 13.

Top values for Type.

Show data table

Top values for Type (12 unique shown, of 12 total).
value	count	share
Unprovoked	4716	73.0%
Provoked	593	9.2%
Invalid	552	8.5%
Sea Disaster	239	3.7%
Watercraft	142	2.2%
Boat	109	1.7%
Boating	92	1.4%
Questionable	10	0.2%
Unconfirmed	1	0.0%
Unverified	1	0.0%
Under investigation	1	0.0%
Boatomg	1	0.0%

Country categorical feature

This column records the country of origin or occurrence for each record, with 205 distinct country values across 6,462 rows. The distribution is heavily skewed: USA alone accounts for 36% of records (2,310), followed by AUSTRALIA at 21% (1,374) and SOUTH AFRICA at 9% (585), meaning these three countries together represent roughly two-thirds of the dataset. The entropy ratio of 0.51 confirms moderate concentration despite 205 unique values, and the near-zero null rate (0.79%) means coverage is excellent.

Treatment: One-hot encode top countries and group tail countries into a residual 'OTHER' category before modelling.

anthropic:default · confidence high

Out[29]:

saturn.columns["Country"].stats

stat	value
n	6,462
nulls	51 (0.8%)
unique	205
top_value	USA
top_rate	0.3603
cardinality	205
entropy	3.909
entropy_ratio	0.509

Fig 14.

Top values for Country.

Show data table

Top values for Country (20 unique shown, of 205 total).
value	count	share
USA	2310	35.7%
AUSTRALIA	1374	21.3%
SOUTH AFRICA	585	9.1%
NEW ZEALAND	135	2.1%
PAPUA NEW GUINEA	135	2.1%
BAHAMAS	115	1.8%
BRAZIL	113	1.7%
MEXICO	95	1.5%
ITALY	71	1.1%
PHILIPPINES	62	1.0%
FIJI	62	1.0%
REUNION	60	0.9%
NEW CALEDONIA	56	0.9%
CUBA	46	0.7%
SPAIN	44	0.7%
MOZAMBIQUE	44	0.7%
EGYPT	42	0.6%
INDIA	40	0.6%
JAPAN	34	0.5%
CROATIA	34	0.5%

Area categorical feature

This column represents geographic sub-national regions (states, provinces) drawn from multiple countries — the US (Florida, Hawaii, California, South Carolina), Australia (New South Wales, Queensland, Western Australia), and South Africa (KwaZulu-Natal, Western Cape Province, Eastern Cape Province). Florida dominates at 17.9% of records, while 810 unique values against 6,462 rows signals a severe long-tail distribution where the vast majority of areas appear only rarely. The 7.16% null rate and high geographic diversity across at least three countries suggest this dataset is multinational in scope.

Treatment: Encode with frequency or target encoding; consider grouping rare areas (long-tail) into an 'Other' bucket or rolling up to country level to reduce cardinality from 810 classes.

anthropic:default · confidence high

Out[32]:

saturn.columns["Area"].stats

stat	value
n	6,462
nulls	463 (7.2%)
unique	810
top_value	Florida
top_rate	0.1794
cardinality	810
entropy	6.163
entropy_ratio	0.6379
alert: long_tail	540 singleton categories

Fig 15.

Top values for Area.

Show data table

Top values for Area (20 unique shown, of 810 total).
value	count	share
Florida	1076	16.7%
New South Wales	498	7.7%
Queensland	325	5.0%
Hawaii	312	4.8%
California	294	4.5%
KwaZulu-Natal	215	3.3%
Western Australia	197	3.0%
Western Cape Province	195	3.0%
Eastern Cape Province	165	2.6%
South Carolina	163	2.5%
North Carolina	111	1.7%
South Australia	104	1.6%
Victoria	92	1.4%
Texas	75	1.2%
Pernambuco	74	1.1%
Torres Strait	72	1.1%
North Island	70	1.1%
New Jersey	55	0.9%
Tasmania	42	0.6%
South Island	41	0.6%

Location text label

This column captures geographic incident locations, predominantly structured as 'City, County' strings—strongly associated with shark attack or ocean incident records given top values like 'New Smyrna Beach, Volusia County' (181 occurrences) and dominant words 'county', 'beach', 'island', 'bay'. The duplicate rate is notably high at ~29.9% (1,769 duplicates out of 6,462 rows), reflecting repeated incidents at the same hotspot locations rather than data error. The multilingual alert is triggered by automated language detection misclassifying short geographic names (e.g. 'Durban', 'Boa Viagem, Recife') as non-English—3,746 of 6,462 values are detected as English, with the remainder split across 29 other 'languages' due to short-string ambiguity. Null rate is 8.43%, which may represent unknown or offshore incident locations.

Treatment: Normalize to canonical 'City, County, Country' format, then use as a categorical grouping variable or geocode for spatial analysis.

anthropic:default · confidence high

Out[35]:

saturn.columns["Location"].stats

stat	value
n	6,462
nulls	545 (8.4%)
unique	4,148
len_min	3
len_max	119
len_mean	22.75
len_median	21
len_p95	47
word_mean	3.54
word_median	3
n_empty	0
n_duplicates	1,769
duplicate_rate	0.299
vocab_size	4,483
readability_flesch_mean	53.4
emoji_rate	0
url_rate	0
one_word_rate	0.1486
allcaps_rate	0.000507
boilerplate_rate	0
alert: multilingual	31 languages detected in sample
alert: duplicates	29.9% duplicate strings

Fig 16.

Character-length distribution for Location.

Show data table

Character-length distribution for Location (mean: 22.75477437890823).
chars	count
3 – 6	106
6 – 9	542
9 – 12	758
12 – 15	720
15 – 18	445
18 – 20	358
20 – 23	352
23 – 26	393
26 – 29	524
29 – 32	293
32 – 35	505
35 – 38	203
38 – 41	152
41 – 44	124
44 – 46	124
46 – 49	95
49 – 52	54
52 – 55	52
55 – 58	26
58 – 61	17
61 – 64	21
64 – 67	14
67 – 70	8
70 – 73	5
73 – 76	2
76 – 78	4
78 – 81	6
81 – 84	1
84 – 87	2
87 – 90	1
90 – 93	1
93 – 96	0
96 – 99	1
99 – 102	1
102 – 104	2
104 – 107	0
107 – 110	3
110 – 113	1
113 – 116	0
116 – 119	1

Activity text label

This column captures the water-based activity a person was engaged in at the time of an incident (likely a shark attack or drowning registry), dominated by Surfing (1,025) and Swimming (932) with a small tail of descriptive phrases. Despite being labelled 'text', it behaves largely as a categorical label: 62.9% of values are single words, only 1,516 unique values exist across 6,462 rows, and the duplicate rate is 74.3%, indicating a loosely controlled vocabulary rather than a strict enum. The median string length of 8 characters versus a max of 254 suggests a mix of clean category entries and occasional free-text annotations, which may require normalisation before use.

Treatment: Standardise to a controlled vocabulary by clustering near-duplicates (e.g. 'Diving' vs 'Scuba diving'), then encode as a categorical feature; impute or flag the 8.54% nulls separately.

anthropic:default · confidence high

Out[38]:

saturn.columns["Activity"].stats

stat	value
n	6,462
nulls	552 (8.5%)
unique	1,516
len_min	1
len_max	254
len_mean	16.21
len_median	8
len_p95	49
word_mean	2.497
word_median	1
n_empty	0
n_duplicates	4,394
duplicate_rate	0.7435
vocab_size	2,244
readability_flesch_mean	39.56
emoji_rate	0
url_rate	0
one_word_rate	0.6289
allcaps_rate	0.0005076
boilerplate_rate	0
alert: one_word	62.9% rows are a single word
alert: duplicates	74.3% duplicate strings

Fig 17.

Character-length distribution for Activity.

Show data table

Character-length distribution for Activity (mean: 16.207952622673435).
chars	count
1 – 7	2033
7 – 14	2150
14 – 20	468
20 – 26	376
26 – 33	228
33 – 39	177
39 – 45	134
45 – 52	84
52 – 58	42
58 – 64	51
64 – 71	35
71 – 77	25
77 – 83	20
83 – 90	10
90 – 96	9
96 – 102	14
102 – 109	10
109 – 115	6
115 – 121	5
121 – 128	1
128 – 134	1
134 – 140	7
140 – 146	4
146 – 153	4
153 – 159	3
159 – 165	0
165 – 172	2
172 – 178	1
178 – 184	0
184 – 191	2
191 – 197	3
197 – 203	0
203 – 210	0
210 – 216	1
216 – 222	0
222 – 229	0
229 – 235	2
235 – 241	1
241 – 248	0
248 – 254	1

Name text label

This column is a 'Name' field from what appears to be a historical incident or casualty dataset (likely maritime, given top values like 'boat', 'sailor', 'a sailor'). It is heavily contaminated with non-name entries: the most frequent value is 'male' (579 occurrences), followed by 'female' (106), 'boy' (23), '2 males' (19), and 'boat' (14), indicating that gender/role descriptors were freely mixed with actual proper names. The duplicate rate of 14.5% (908 duplicates across 5,339 unique values) and a 3.33% null rate further confirm this column is inconsistently populated and not a clean identifier.

Treatment: Split into two derived columns—one for proper names, one for role/gender descriptors—before any modelling or grouping.

anthropic:default · confidence high

Out[41]:

saturn.columns["Name"].stats

stat	value
n	6,462
nulls	215 (3.3%)
unique	5,339
len_min	1
len_max	221
len_mean	14.83
len_median	13
len_p95	35
word_mean	2.453
word_median	2
n_empty	0
n_duplicates	908
duplicate_rate	0.1453
vocab_size	6,536
readability_flesch_mean	51.73
emoji_rate	0
url_rate	0
one_word_rate	0.1657
allcaps_rate	0.006883
boilerplate_rate	0

Fig 18.

Character-length distribution for Name.

Show data table

Character-length distribution for Name (mean: 14.830158476068513).
chars	count
1 – 6	939
6 – 12	1355
12 – 18	2774
18 – 23	488
23 – 28	223
28 – 34	128
34 – 40	94
40 – 45	47
45 – 50	66
50 – 56	34
56 – 62	25
62 – 67	26
67 – 72	20
72 – 78	6
78 – 84	3
84 – 89	8
89 – 94	2
94 – 100	1
100 – 106	1
106 – 111	2
111 – 116	1
116 – 122	0
122 – 128	0
128 – 133	1
133 – 138	1
138 – 144	1
144 – 150	0
150 – 155	0
155 – 160	0
160 – 166	0
166 – 172	0
172 – 177	0
177 – 182	0
182 – 188	0
188 – 194	0
194 – 199	0
199 – 204	0
204 – 210	0
210 – 216	0
216 – 221	1

Unnamed: 9 categorical

Out[44]:

saturn.columns["Unnamed: 9"].stats

stat	value
n	6,462
nulls	6,434 (99.6%)
unique	2
top_value	M
top_rate	0.8571
cardinality	2
entropy	0.5917
entropy_ratio	0.5917
alert: null_rate	99.6% null

Fig 19.

Top values for Unnamed: 9.

Show data table

Top values for Unnamed: 9 (2 unique shown, of 2 total).
value	count	share
M	24	0.4%
F	4	0.1%

Age categorical feature

This column represents age stored as a categorical (string) type rather than a numeric type, covering 154 distinct values across 6,462 rows. The most striking issue is a 44.43% null rate, flagged as an alert, meaning nearly half the records lack an age value. The distribution skews young — the top values cluster tightly between ages 15–25, with '17' being most frequent at only 4.46% of rows, suggesting a youth-focused population (e.g., students or a juvenile-related dataset). The high entropy ratio of 0.80 confirms values are spread broadly across the 154 categories despite the youth concentration.

Treatment: Cast to integer, impute or flag nulls (44.43% missing requires explicit strategy), then treat as ordinal or numeric feature.

anthropic:default · confidence high

Out[47]:

saturn.columns["Age"].stats

stat	value
n	6,462
nulls	2,871 (44.4%)
unique	154
top_value	17
top_rate	0.04456
cardinality	154
entropy	5.827
entropy_ratio	0.8018
alert: null_rate	44.4% null

Fig 20.

Top values for Age.

Show data table

Top values for Age (20 unique shown, of 154 total).
value	count	share
17	160	2.5%
18	155	2.4%
20	146	2.3%
19	145	2.2%
16	140	2.2%
15	140	2.2%
21	122	1.9%
22	118	1.8%
25	111	1.7%
24	109	1.7%
14	104	1.6%
13	97	1.5%
26	85	1.3%
23	83	1.3%
28	82	1.3%
30	80	1.2%
29	80	1.2%
27	79	1.2%
12	75	1.2%
32	72	1.1%

Injury text label

This column describes the outcome or nature of injuries in what appears to be a shark attack dataset, containing free-text descriptions ranging from 'FATAL' to specific anatomical bite locations (e.g., 'Left foot bitten', 'Leg bitten'). The dominant value is 'FATAL' appearing 823 times, making it by far the most frequent entry. Two signals stand out: a high duplicate rate of 41.9% (2,695 duplicates across 6,462 rows) driven by repetitive categorical-style phrases, and an all-caps rate of 13.1% suggesting inconsistent data entry conventions. Additionally, 496 German-language entries co-exist with 3,812 English ones, indicating multilingual sourcing that will complicate any text-based analysis.

Treatment: Normalize case, map high-frequency values to a structured severity/outcome taxonomy, and handle German entries separately or translate before any text embedding or categorical encoding.

anthropic:default · confidence high

Out[50]:

saturn.columns["Injury"].stats

stat	value
n	6,462
nulls	29 (0.4%)
unique	3,738
len_min	5
len_max	234
len_mean	31.53
len_median	25
len_p95	82
word_mean	5.414
word_median	4
n_empty	0
n_duplicates	2,695
duplicate_rate	0.4189
vocab_size	2,550
readability_flesch_mean	53.74
emoji_rate	0
url_rate	0
one_word_rate	0.1489
allcaps_rate	0.1307
boilerplate_rate	0
alert: multilingual	10 languages detected in sample
alert: allcaps	13.1% rows are all-caps
alert: duplicates	41.9% duplicate strings

Fig 21.

Character-length distribution for Injury.

Show data table

Character-length distribution for Injury (mean: 31.52868024249961).
chars	count
5 – 11	1206
11 – 16	740
16 – 22	851
22 – 28	788
28 – 34	636
34 – 39	429
39 – 45	382
45 – 51	251
51 – 57	238
57 – 62	200
62 – 68	132
68 – 74	128
74 – 79	100
79 – 85	73
85 – 91	55
91 – 97	47
97 – 102	45
102 – 108	31
108 – 114	13
114 – 120	22
120 – 125	15
125 – 131	6
131 – 137	9
137 – 142	10
142 – 148	4
148 – 154	2
154 – 160	2
160 – 165	0
165 – 171	3
171 – 177	3
177 – 182	3
182 – 188	1
188 – 194	0
194 – 200	2
200 – 205	2
205 – 211	1
211 – 217	0
217 – 223	1
223 – 228	0
228 – 234	2

Fatal (Y/N) categorical label

This column is a binary fatality flag for incidents, expected to hold only 'Y' or 'N' values. The dominant value is 'N' (4,439 occurrences, 75% of records), with 'Y' accounting for 1,400 cases (~21.7%). Surprising data quality issues exist: 5 rows contain clearly erroneous values ('F', 'M', '2017', lowercase 'y') suggesting data entry errors or row misalignment, and 71 rows are labeled 'UNKNOWN'. The 8.46% null rate adds further incompleteness.

Treatment: Standardise 'y' → 'Y', investigate and recode/drop 'F', 'M', '2017' entries, decide on treatment of 'UNKNOWN' and nulls, then binarise (Y=1, N=0) for modelling.

anthropic:default · confidence high

Out[53]:

saturn.columns["Fatal (Y/N)"].stats

stat	value
n	6,462
nulls	547 (8.5%)
unique	7
top_value	N
top_rate	0.7505
cardinality	7
entropy	0.8897
entropy_ratio	0.3169

Fig 22.

Top values for Fatal (Y/N).

Show data table

Top values for Fatal (Y/N) (7 unique shown, of 7 total).
value	count	share
N	4439	68.7%
Y	1400	21.7%
UNKNOWN	71	1.1%
F	2	0.0%
M	1	0.0%
2017	1	0.0%
y	1	0.0%

Time categorical feature

This column captures time-of-day information, but it is stored inconsistently: values mix coarse labels ('Afternoon', 'Morning') with specific clock times in 'HhMM' format ('11h00', '16h30'), yielding 366 unique values across 6,462 rows. The null rate is severe at 52.49%, meaning over half of all records are missing a time entirely. The top value 'Afternoon' accounts for only 6.29% of rows, and entropy ratio is 0.77, indicating a long tail of rarely-seen time strings — likely data entry inconsistency across sources or time periods.

Treatment: Standardise to 24h numeric minutes-since-midnight, map label categories ('Morning', 'Afternoon') to representative values or a separate flag, then impute or model missingness explicitly given 52.49% null rate.

anthropic:default · confidence high

Out[56]:

saturn.columns["Time"].stats

stat	value
n	6,462
nulls	3,392 (52.5%)
unique	366
top_value	Afternoon
top_rate	0.06287
cardinality	366
entropy	6.559
entropy_ratio	0.7702
alert: long_tail	199 singleton categories
alert: null_rate	52.5% null

Fig 23.

Top values for Time.

Show data table

Top values for Time (20 unique shown, of 366 total).
value	count	share
Afternoon	193	3.0%
11h00	131	2.0%
Morning	126	1.9%
12h00	113	1.7%
15h00	111	1.7%
16h00	106	1.6%
14h00	102	1.6%
16h30	79	1.2%
17h30	77	1.2%
13h00	75	1.2%
17h00	74	1.1%
14h30	73	1.1%
18h00	72	1.1%
15h30	67	1.0%
11h30	65	1.0%
13h30	64	1.0%
10h00	63	1.0%
Night	63	1.0%
09h00	55	0.9%
10h30	51	0.8%

Species text label

This column records shark species (and incident validity notes) from what appears to be a shark attack dataset, with values ranging from specific species ('White shark', 'Tiger shark', 'Bull shark') to free-text qualifiers like 'Shark involvement not confirmed' and 'Invalid'. The null rate is severe at 45.25%, and 58.56% of non-null values are duplicates — expected for a species label with only 1,466 unique values across 6,462 rows. More surprising is the multilingual alert: while 2,582 values are classified as English, 14 are German, 18 Finnish, 11 Chinese, and 8 Turkish among others, suggesting some records were entered in non-English locales or scraped from multilingual sources. The mix of species names, size descriptions ('4\' shark', '1.8 m [6\'] shark'), and incident-status phrases means this column is semantically heterogeneous and will require parsing or splitting before use.

Treatment: Split into a normalized species category and a separate incident-validity flag; impute or exclude the 45.25% nulls based on task context.

anthropic:default · confidence high

Out[59]:

saturn.columns["Species "].stats

stat	value
n	6,462
nulls	2,924 (45.2%)
unique	1,466
len_min	3
len_max	194
len_mean	22.95
len_median	17
len_p95	50
word_mean	4.445
word_median	4
n_empty	0
n_duplicates	2,072
duplicate_rate	0.5856
vocab_size	1,105
readability_flesch_mean	88.63
emoji_rate	0
url_rate	0
one_word_rate	0.04098
allcaps_rate	0.0002826
boilerplate_rate	0
alert: multilingual	15 languages detected in sample
alert: null_rate	45.2% null
alert: duplicates	58.6% duplicate strings

Fig 24.

Character-length distribution for Species .

Show data table

Character-length distribution for Species (mean: 22.95110231769361).
chars	count
3 – 8	105
8 – 13	855
13 – 17	833
17 – 22	448
22 – 27	324
27 – 32	285
32 – 36	120
36 – 41	127
41 – 46	115
46 – 51	170
51 – 56	32
56 – 60	25
60 – 65	21
65 – 70	13
70 – 75	17
75 – 79	11
79 – 84	7
84 – 89	4
89 – 94	6
94 – 98	4
98 – 103	2
103 – 108	0
108 – 113	1
113 – 118	1
118 – 122	0
122 – 127	2
127 – 132	1
132 – 137	1
137 – 141	1
141 – 146	1
146 – 151	0
151 – 156	2
156 – 161	1
161 – 165	1
165 – 170	0
170 – 175	0
175 – 180	0
180 – 184	1
184 – 189	0
189 – 194	1

Investigator or Source text metadata

This column records the investigator or data source credited for each shark attack incident, typically formatted as an abbreviated name plus an organizational affiliation (e.g., 'C. Moore, GSAF'). GSAF (Global Shark Attack File) dominates the top entries and appears in 983 word tokens, making it the primary contributing organization. The duplicate rate of 22.7% (1,464 duplicates across 6,462 rows) is expected given a finite set of investigators filing multiple reports, but the multilingual alert across 22 detected languages is notable — 'en' accounts for 4,457 entries while 'es' (134), 'fr' (100), 'de' (56), and 'it' (62) reflect international source attribution or non-English investigator names. The near-unique cardinality (4,979 unique values out of 6,462) suggests many entries are one-off source combinations rather than standardized identifiers.

Treatment: Normalize organization affiliations (e.g., extract 'GSAF' tag) and standardize investigator name formats before using as a categorical grouping variable.

anthropic:default · confidence high

Out[62]:

saturn.columns["Investigator or Source"].stats

stat	value
n	6,462
nulls	19 (0.3%)
unique	4,979
len_min	3
len_max	210
len_mean	32.24
len_median	26
len_p95	77
word_mean	4.792
word_median	3
n_empty	0
n_duplicates	1,464
duplicate_rate	0.2272
vocab_size	7,898
readability_flesch_mean	73.62
emoji_rate	0
url_rate	0.002328
one_word_rate	0.02592
allcaps_rate	0.01847
boilerplate_rate	0
alert: multilingual	23 languages detected in sample
alert: duplicates	22.7% duplicate strings

Fig 25.

Character-length distribution for Investigator or Source.

Show data table

Character-length distribution for Investigator or Source (mean: 32.23731181126804).
chars	count
3 – 8	152
8 – 13	461
13 – 19	1164
19 – 24	844
24 – 29	1098
29 – 34	816
34 – 39	400
39 – 44	315
44 – 50	203
50 – 55	174
55 – 60	150
60 – 65	142
65 – 70	116
70 – 75	59
75 – 81	61
81 – 86	45
86 – 91	39
91 – 96	40
96 – 101	30
101 – 106	23
106 – 112	23
112 – 117	18
117 – 122	14
122 – 127	14
127 – 132	9
132 – 138	8
138 – 143	5
143 – 148	5
148 – 153	4
153 – 158	3
158 – 163	4
163 – 169	0
169 – 174	0
174 – 179	1
179 – 184	0
184 – 189	1
189 – 194	1
194 – 200	0
200 – 205	0
205 – 210	1

pdf text foreign_key

This column contains PDF filenames or partial file paths, evidenced by the '.pdf' suffixes in the top words and a mean token count of ~1 word per value. Over half the rows (52.55%) are null, and with 3,054 unique values out of 3,054 non-null distinct tokens the column is near-unique, functioning more like a document reference key than a descriptive field. The extremely negative Flesch readability score (−66.81) is consistent with structured filename strings rather than natural language. A small number of duplicates (12) suggest some documents are referenced by multiple records.

Treatment: Use as a document reference key to join or retrieve associated PDF files; impute or flag nulls before any join.

anthropic:default · confidence high

Out[65]:

saturn.columns["pdf"].stats

stat	value
n	6,462
nulls	3,396 (52.6%)
unique	3,054
len_min	10
len_max	41
len_mean	23.73
len_median	23
len_p95	31
word_mean	1.022
word_median	1
n_empty	0
n_duplicates	12
duplicate_rate	0.003914
vocab_size	3,098
readability_flesch_mean	-66.81
emoji_rate	0
url_rate	0
one_word_rate	0.9866
allcaps_rate	0.0003262
boilerplate_rate	0
alert: near_unique	99.6% of rows are unique strings
alert: one_word	98.7% rows are a single word
alert: null_rate	52.6% null

Fig 26.

Character-length distribution for pdf.

Show data table

Character-length distribution for pdf (mean: 23.73091976516634).
chars	count
10 – 11	2
11 – 12	0
12 – 12	0
12 – 13	1
13 – 14	0
14 – 15	4
15 – 15	0
15 – 16	4
16 – 17	0
17 – 18	14
18 – 19	49
19 – 19	139
19 – 20	277
20 – 21	0
21 – 22	438
22 – 22	473
22 – 23	391
23 – 24	0
24 – 25	275
25 – 26	229
26 – 26	159
26 – 27	134
27 – 28	0
28 – 29	112
29 – 29	86
29 – 30	72
30 – 31	0
31 – 32	60
32 – 32	50
32 – 33	31
33 – 34	27
34 – 35	0
35 – 36	13
36 – 36	8
36 – 37	6
37 – 38	0
38 – 39	3
39 – 39	2
39 – 40	4
40 – 41	3

href formula text metadata

This column contains hyperlink formulas pointing to PDF source documents on sharkattackfile.net, each URL referencing a dated incident report (e.g., '1935.06.05.r-solomonislands.pdf'). Over half the rows (52.62%) are null, meaning many records lack a linked source document. Values are nearly all single-token URLs (one_word_rate 0.9879, url_rate 0.9997), and the extremely negative Flesch readability score (-820.18) confirms these are machine-generated URL strings, not natural text. Only 11 duplicate values exist across 3,051 unique entries, suggesting most cited PDFs are distinct incident references.

Treatment: Extract raw URL string from formula syntax before use; treat as a source-citation reference field and consider joining or flagging unsourced rows (52.62% null) separately.

anthropic:default · confidence high

Out[68]:

saturn.columns["href formula"].stats

stat	value
n	6,462
nulls	3,400 (52.6%)
unique	3,051
len_min	64
len_max	95
len_mean	77.73
len_median	77
len_p95	85
word_mean	1.019
word_median	1
n_empty	0
n_duplicates	11
duplicate_rate	0.003592
vocab_size	3,089
readability_flesch_mean	-820.2
emoji_rate	0
url_rate	0.9997
one_word_rate	0.9879
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	99.6% of rows are unique strings
alert: one_word	98.8% rows are a single word
alert: url_heavy	100.0% rows contain a URL
alert: null_rate	52.6% null

Fig 27.

Character-length distribution for href formula.

Show data table

Character-length distribution for href formula (mean: 77.72893533638145).
chars	count
64 – 65	1
65 – 66	0
66 – 66	0
66 – 67	1
67 – 68	0
68 – 69	4
69 – 69	0
69 – 70	5
70 – 71	0
71 – 72	14
72 – 73	49
73 – 73	139
73 – 74	276
74 – 75	0
75 – 76	437
76 – 76	473
76 – 77	392
77 – 78	0
78 – 79	275
79 – 80	228
80 – 80	159
80 – 81	134
81 – 82	0
82 – 83	112
83 – 83	86
83 – 84	72
84 – 85	0
85 – 86	60
86 – 86	48
86 – 87	31
87 – 88	27
88 – 89	0
89 – 90	13
90 – 90	8
90 – 91	6
91 – 92	0
92 – 93	3
93 – 93	2
93 – 94	4
94 – 95	3

href text metadata

This column contains URLs linking to PDF source documents in a shark attack file directory (sharkattackfile.net/spreadsheets/pdf_directory/), serving as citation or evidence links for individual incident records. Over half the rows are null (52.62%), indicating many records lack a linked source document. Nearly all non-null values are single-token URLs (one_word_rate 0.988, url_rate 0.9997), with very few duplicates (11 duplicates out of 3,051 unique values), consistent with per-incident citation links. The high null rate is the key analyst concern — roughly half of incidents have no associated PDF reference.

Treatment: Exclude from predictive modelling; retain as a provenance/citation field, or engineer a binary 'has_source_pdf' indicator from non-null presence.

anthropic:default · confidence high

Out[71]:

saturn.columns["href"].stats

stat	value
n	6,462
nulls	3,400 (52.6%)
unique	3,051
len_min	64
len_max	135
len_mean	77.89
len_median	77
len_p95	86
word_mean	1.02
word_median	1
n_empty	0
n_duplicates	11
duplicate_rate	0.003592
vocab_size	3,091
readability_flesch_mean	-824.4
emoji_rate	0
url_rate	0.9997
one_word_rate	0.9879
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	99.6% of rows are unique strings
alert: one_word	98.8% rows are a single word
alert: url_heavy	100.0% rows contain a URL
alert: null_rate	52.6% null

Fig 28.

Character-length distribution for href.

Show data table

Character-length distribution for href (mean: 77.88667537557153).
chars	count
64 – 66	1
66 – 68	1
68 – 69	4
69 – 71	19
71 – 73	49
73 – 75	415
75 – 76	908
76 – 78	666
78 – 80	224
80 – 82	290
82 – 84	199
84 – 85	131
85 – 87	80
87 – 89	27
89 – 91	21
91 – 92	9
92 – 94	6
94 – 96	3
96 – 98	0
98 – 100	0
100 – 101	0
101 – 103	0
103 – 105	0
105 – 107	0
107 – 108	0
108 – 110	0
110 – 112	0
112 – 114	0
114 – 115	0
115 – 117	0
117 – 119	0
119 – 121	0
121 – 123	0
123 – 124	0
124 – 126	0
126 – 128	0
128 – 130	1
130 – 131	4
131 – 133	2
133 – 135	2

Case Number.1 text identifier

This column appears to be a duplicate or alternate version of a case number field, with values formatted as date-like codes (e.g., '1966.12.26', '1923.00.00.a') suggesting archival case identifiers tied to dates with alphabetic suffixes for disambiguation. A striking 52.62% null rate makes this column unreliable for most analyses, and the near-unique flag (3,054 unique values across 6,462 rows) combined with only 8 true duplicates confirms it functions as a quasi-identifier. The allcaps rate of 78.97% is notable given that values appear to be alphanumeric codes rather than natural language, and the '.1' suffix in the column name strongly suggests this is a duplicated column from a merge or pivot operation.

Treatment: Investigate overlap with the original 'Case Number' column; if redundant, drop; otherwise impute nulls cautiously or use as a secondary join key.

anthropic:default · confidence medium

Out[74]:

saturn.columns["Case Number.1"].stats

stat	value
n	6,462
nulls	3,400 (52.6%)
unique	3,054
len_min	7
len_max	18
len_mean	10.59
len_median	10
len_p95	12
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	8
duplicate_rate	0.002613
vocab_size	3,057
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9984
allcaps_rate	0.7897
boilerplate_rate	0
alert: near_unique	99.7% of rows are unique strings
alert: one_word	99.8% rows are a single word
alert: allcaps	79.0% rows are all-caps
alert: null_rate	52.6% null
alert: short_text	95th-percentile length under 20 chars

Fig 29.

Character-length distribution for Case Number.1.

Show data table

Character-length distribution for Case Number.1 (mean: 10.59079033311561).
chars	count
7 – 7	120
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	4
9 – 9	0
9 – 10	0
10 – 10	1884
10 – 10	0
10 – 11	0
11 – 11	0
11 – 11	7
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	1011
12 – 12	0
12 – 13	0
13 – 13	8
13 – 13	0
13 – 14	0
14 – 14	0
14 – 14	24
14 – 14	0
14 – 15	0
15 – 15	0
15 – 15	2
15 – 16	0
16 – 16	0
16 – 16	1
16 – 16	0
16 – 17	0
17 – 17	0
17 – 17	0
17 – 17	0
17 – 18	0
18 – 18	1

Case Number.2 text identifier

This column appears to be a structured case number identifier, likely encoding a date-based reference system (e.g., '1966.12.26', '1915.07.06.a.r') typical of archival, legal, or historical record catalogues. With 52.62% null rate across 6,462 rows and only 3,055 unique values out of 3,058 vocabulary size, the column is near-unique but severely incomplete. The 78.97% all-caps rate combined with date-like tokens and alphabetic suffixes (a, b, r) suggests a custom alphanumeric coding scheme rather than free text. Only 7 duplicate values exist, making this effectively an identifier where present.

Treatment: Retain as a join/lookup key; impute or flag nulls separately; do not encode numerically.

anthropic:default · confidence high

Out[77]:

saturn.columns["Case Number.2"].stats

stat	value
n	6,462
nulls	3,400 (52.6%)
unique	3,055
len_min	7
len_max	18
len_mean	10.59
len_median	10
len_p95	12
word_mean	1.002
word_median	1
n_empty	0
n_duplicates	7
duplicate_rate	0.002286
vocab_size	3,058
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0.9984
allcaps_rate	0.7897
boilerplate_rate	0
alert: near_unique	99.8% of rows are unique strings
alert: one_word	99.8% rows are a single word
alert: allcaps	79.0% rows are all-caps
alert: null_rate	52.6% null
alert: short_text	95th-percentile length under 20 chars

Fig 30.

Character-length distribution for Case Number.2.

Show data table

Character-length distribution for Case Number.2 (mean: 10.59079033311561).
chars	count
7 – 7	120
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	4
9 – 9	0
9 – 10	0
10 – 10	1884
10 – 10	0
10 – 11	0
11 – 11	0
11 – 11	7
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	1011
12 – 12	0
12 – 13	0
13 – 13	8
13 – 13	0
13 – 14	0
14 – 14	0
14 – 14	24
14 – 14	0
14 – 15	0
15 – 15	0
15 – 15	2
15 – 16	0
16 – 16	0
16 – 16	1
16 – 16	0
16 – 17	0
17 – 17	0
17 – 17	0
17 – 17	0
17 – 18	0
18 – 18	1

original order numeric metadata

This column appears to be a positional or sequence index assigned to records, likely reflecting the original sort order of items in a source dataset. The most striking issue is a 52.62% null rate — over half the rows carry no value, which is highly anomalous for an ordering field and suggests either a join that left many rows unmatched or that ordering was only recorded for a subset of records. With only 3,061 unique values across 6,462 rows (and ~3,061 non-null rows expected given the null rate), duplicates exist even among non-null entries, undermining uniqueness as a sequence key. The distribution is mildly right-skewed (skew 0.99) with notable leptokurtosis (3.55) and 27 outliers, hinting at a few unusually large order values relative to the bulk.

Treatment: Investigate source of 52.62% nulls before use; if retaining, treat as an optional sort key and do not use as a unique identifier given duplicate values.

anthropic:default · confidence medium

Out[80]:

saturn.columns["original order"].stats

stat	value
n	6,462
nulls	3,400 (52.6%)
unique	3,061
min	3
max	6,502
mean	1564
median	1534
std	988.4
q1	768.2
q3	2299
iqr	1530
skew	0.9878
kurtosis	3.551
n_outliers	27
outlier_rate	0.008818
zero_rate	0
alert: null_rate	52.6% null

Fig 31.

Distribution of original order. Vertical dash marks the median.

Show data table

Histogram bins for original order (median: 1533.5).
bin	count
3 – 165.5	163
165.5 – 327.9	162
327.9 – 490.4	163
490.4 – 652.9	162
652.9 – 815.4	163
815.4 – 977.8	162
977.8 – 1140	163
1140 – 1303	162
1303 – 1465	163
1465 – 1628	162
1628 – 1790	163
1790 – 1953	162
1953 – 2115	163
2115 – 2278	162
2278 – 2440	163
2440 – 2603	162
2603 – 2765	163
2765 – 2928	162
2928 – 3090	110
3090 – 3252	0
3252 – 3415	0
3415 – 3577	0
3577 – 3740	0
3740 – 3902	0
3902 – 4065	0
4065 – 4227	0
4227 – 4390	0
4390 – 4552	0
4552 – 4715	0
4715 – 4877	0
4877 – 5040	0
5040 – 5202	0
5202 – 5365	0
5365 – 5527	0
5527 – 5690	0
5690 – 5852	0
5852 – 6015	0
6015 – 6177	0
6177 – 6340	0
6340 – 6502	27

Unnamed: 23 categorical

Out[83]:

saturn.columns["Unnamed: 23"].stats

stat	value
n	6,462
nulls	6,460 (100.0%)
unique	2
top_value	Teramo
top_rate	0.5
cardinality	2
entropy	1
entropy_ratio	1
alert: long_tail	2 singleton categories
alert: null_rate	100.0% null

Fig 32.

Top values for Unnamed: 23.

Show data table

Top values for Unnamed: 23 (2 unique shown, of 2 total).
value	count	share
Teramo	1	0.0%
change filename	1	0.0%

data trove global shark attack file gsaf

Overview

Summary confidence: high

index numeric identifier

Case Number text identifier

Date text metadata

Year numeric feature

Type categorical label

Country categorical feature

Area categorical feature

Location text label

Activity text label

Name text label

Unnamed: 9 categorical

Age categorical feature

Injury text label

Fatal (Y/N) categorical label

Time categorical feature

Species text label

Investigator or Source text metadata

pdf text foreign_key

href formula text metadata

href text metadata

Case Number.1 text identifier

Case Number.2 text identifier

original order numeric metadata

Unnamed: 23 categorical

How to cite