data trove global shark attack file gsaf

source /home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv 6,462 rows 24 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset is the Global Shark Attack File (GSAF), containing 6,462 records of shark attack incidents spanning centuries of documented cases. The most important thing to examine first is the attack outcome: roughly 75% of incidents are non-fatal ('N'), but 1,400 are recorded as fatal ('Y'), and the 'Injury' column reveals 823 entries simply marked 'FATAL' — worth cross-checking for data consistency. A second priority is the geographic and activity breakdown: the USA dominates with 2,310 cases (36%), Florida alone accounts for 1,076, and surfing (1,025) and swimming (932) are by far the most dangerous activities. The 'Year' column carries a data quality warning — a maximum value of 3019 and high kurtosis signal outliers that should be cleaned before any time-series analysis.

citing: Fatal (Y/N).top_values · Fatal (Y/N).stats.top_rate · Injury.top_values · Country.top_values · Area.top_values · Activity.top_values · Year.stats.max · Year.stats.kurtosis · Year.stats.n_outliers · row_count

Charts the summary said to look at first

Fatal (Y/N) · Look at the dominant 'N' slice versus fatal 'Y' and check for dirty values like 'M', 'F', and '2017' that indicate data entry errors.

Show data table

Top values for Fatal (Y/N) (7 unique shown, of 7 total).
value	count	share
N	4439	68.7%
Y	1400	21.7%
UNKNOWN	71	1.1%
F	2	0.0%
M	1	0.0%
2017	1	0.0%
y	1	0.0%

Activity · Surfing and swimming together account for the majority of attacks — compare their counts to lower-risk activities like diving and fishing.

Show data table

Character-length distribution for Activity (mean: 16.207952622673435).
chars	count
1 – 7	2033
7 – 14	2150
14 – 20	468
20 – 26	376
26 – 33	228
33 – 39	177
39 – 45	134
45 – 52	84
52 – 58	42
58 – 64	51
64 – 71	35
71 – 77	25
77 – 83	20
83 – 90	10
90 – 96	9
96 – 102	14
102 – 109	10
109 – 115	6
115 – 121	5
121 – 128	1
128 – 134	1
134 – 140	7
140 – 146	4
146 – 153	4
153 – 159	3
159 – 165	0
165 – 172	2
172 – 178	1
178 – 184	0
184 – 191	2
191 – 197	3
197 – 203	0
203 – 210	0
210 – 216	1
216 – 222	0
222 – 229	0
229 – 235	2
235 – 241	1
241 – 248	0
248 – 254	1

Country · The USA, Australia, and South Africa together dominate the record count; note how sharply the frequency drops after the top three.

Show data table

Top values for Country (20 unique shown, of 205 total).
value	count	share
USA	2310	35.7%
AUSTRALIA	1374	21.3%
SOUTH AFRICA	585	9.1%
NEW ZEALAND	135	2.1%
PAPUA NEW GUINEA	135	2.1%
BAHAMAS	115	1.8%
BRAZIL	113	1.7%
MEXICO	95	1.5%
ITALY	71	1.1%
PHILIPPINES	62	1.0%
FIJI	62	1.0%
REUNION	60	0.9%
NEW CALEDONIA	56	0.9%
CUBA	46	0.7%
SPAIN	44	0.7%
MOZAMBIQUE	44	0.7%
EGYPT	42	0.6%
INDIA	40	0.6%
JAPAN	34	0.5%
CROATIA	34	0.5%

Type · Unprovoked attacks make up nearly 73% of all incidents — compare against provoked, invalid, and sea disaster categories.

Show data table

Top values for Type (12 unique shown, of 12 total).
value	count	share
Unprovoked	4716	73.0%
Provoked	593	9.2%
Invalid	552	8.5%
Sea Disaster	239	3.7%
Watercraft	142	2.2%
Boat	109	1.7%
Boating	92	1.4%
Questionable	10	0.2%
Unconfirmed	1	0.0%
Unverified	1	0.0%
Under investigation	1	0.0%
Boatomg	1	0.0%

Year · Most records cluster in the modern era, but look for the extreme outliers (including a year value of 3019) that skew the distribution.

Show data table

Histogram bins for Year (median: 1980.0).
bin	count
0 – 75.47	126
75.47 – 150.9	1
150.9 – 226.4	0
226.4 – 301.9	0
301.9 – 377.4	0
377.4 – 452.8	0
452.8 – 528.3	1
528.3 – 603.8	0
603.8 – 679.3	0
679.3 – 754.8	0
754.8 – 830.2	0
830.2 – 905.7	0
905.7 – 981.2	0
981.2 – 1057	0
1057 – 1132	0
1132 – 1208	0
1208 – 1283	0
1283 – 1359	0
1359 – 1434	0
1434 – 1510	0
1510 – 1585	4
1585 – 1660	6
1660 – 1736	7
1736 – 1811	37
1811 – 1887	371
1887 – 1962	1975
1962 – 2038	3930
2038 – 2113	0
2113 – 2189	0
2189 – 2264	0
2264 – 2340	0
2340 – 2415	0
2415 – 2491	0
2491 – 2566	0
2566 – 2642	0
2642 – 2717	0
2717 – 2793	0
2793 – 2868	0
2868 – 2944	0
2944 – 3019	1

Schema

24 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
index	numeric	0.0%	6,462
Case Number	text	0.0%	6,442	near_unique one_word allcaps short_text
Date	text	0.0%	5,552	one_word
Year	numeric	0.0%	252	high_skew
Type	categorical	0.1%	12
Country	categorical	0.8%	205
Area	categorical	7.2%	810	long_tail
Location	text	8.4%	4,148	multilingual duplicates
Activity	text	8.5%	1,516	one_word duplicates
Name	text	3.3%	5,339
Unnamed: 9	categorical	99.6%	2	null_rate
Age	categorical	44.4%	154	null_rate
Injury	text	0.4%	3,738	multilingual allcaps duplicates
Fatal (Y/N)	categorical	8.5%	7
Time	categorical	52.5%	366	long_tail null_rate
Species	text	45.2%	1,466	multilingual null_rate duplicates
Investigator or Source	text	0.3%	4,979	multilingual duplicates
pdf	text	52.6%	3,054	near_unique one_word null_rate
href formula	text	52.6%	3,051	near_unique one_word url_heavy null_rate
href	text	52.6%	3,051	near_unique one_word url_heavy null_rate
Case Number.1	text	52.6%	3,054	near_unique one_word allcaps null_rate short_text
Case Number.2	text	52.6%	3,055	near_unique one_word allcaps null_rate short_text
original order	numeric	52.6%	3,061	null_rate
Unnamed: 23	categorical	100.0%	2	long_tail null_rate

index

numeric identifier

This column is a row index — a sequential integer identifier running from 0 to 6461 with 6462 unique values and no nulls, perfectly matching the row count. Its distribution is exactly uniform (skew = 0.0, kurtosis ≈ −1.2, mean = median = 3230.5, zero outliers), confirming it was generated as a positional index rather than carrying any domain meaning. The single 'zero' (zero_rate ≈ 0.00015) is simply row 0. There is no analytical signal here. Treatment: Drop before modelling; carries no predictive information. high · anthropic:default

n: 6,462
nulls: 0 (0.0%)
unique: 6,462
min: 0
max: 6,461
mean: 3230
median: 3230
std: 1866
q1: 1615
q3: 4846
iqr: 3230
skew: 0
kurtosis: -1.2
n_outliers: 0
outlier_rate: 0
zero_rate: 0.0001548

Case Number

text identifier near_unique one_word allcaps short_text

This column is a case identifier, with values that appear to encode dates in YYYY.MM.DD format (e.g., '2019.10.08'), suggesting case numbers tied to filing or incident dates. With 6,442 unique values out of 6,462 rows and a null rate of 0.0003, it is near-unique and functions as a primary key. The 18 duplicate values (duplicate_rate 0.0028) are a mild anomaly worth investigating — one value '2012.09.02.b' hints that suffixes are used to disambiguate same-date cases, implying the deduplication logic is not fully consistent. The allcaps_rate of 0.748 suggests a mix of formatting styles across records. Treatment: Use as a case-level join key; flag the 18 duplicates for deduplication before modelling. high · anthropic:default

n: 6,462
nulls: 2 (0.0%)
unique: 6,442
len_min: 6
len_max: 18
len_mean: 10.63
len_median: 10
len_p95: 12
word_mean: 1.001
word_median: 1
n_empty: 0
n_duplicates: 18
duplicate_rate: 0.002786
vocab_size: 6,445
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9992
allcaps_rate: 0.748
boilerplate_rate: 0

Date

text metadata one_word

This column contains free-form date annotations for what appears to be a historical or archival dataset, storing dates in highly inconsistent formats: bare years (e.g., '1957', '1942'), structured dates ('05-Oct-2003'), vague phrases ('Before 1958', 'No date', 'ca.', 'summer', 'late'). The top word 'reported' appearing 559 times suggests many values follow a pattern like 'reported [date]', which is a red flag for downstream parsing. With 14.1% duplicate rate (909 duplicates across 5,552 unique values) and a max length of 64 characters, this column cannot be safely cast to a datetime type without substantial normalization work. Treatment: Parse and normalize into structured date fields using regex rules per format pattern; flag unparseable values ('No date', 'ca.', 'reported …') as null or uncertain. high · anthropic:default

n: 6,462
nulls: 1 (0.0%)
unique: 5,552
len_min: 4
len_max: 64
len_mean: 11.43
len_median: 11
len_p95: 20
word_mean: 1.155
word_median: 1
n_empty: 0
n_duplicates: 909
duplicate_rate: 0.1407
vocab_size: 5,496
readability_flesch_mean: 89.93
emoji_rate: 0
url_rate: 0
one_word_rate: 0.8728
allcaps_rate: 0.05092
boilerplate_rate: 0

Year

numeric feature high_skew

This column represents a publication or production year for 6,462 records, with values spanning a plausible modern range centered around a median of 1980 and IQR of 1943–2006. Two signals are highly surprising: a minimum of 0.0 (nearly 2% zero rate, almost certainly sentinel/missing-year placeholders) and a maximum of 3019.0, which is a data-entry error (likely a typo for a 4-digit year such as 2019). These outliers (266 records, ~4.1%) drive extreme negative skew (−6.55) and extraordinary kurtosis (42.54), masking what is otherwise a fairly clean temporal distribution. Treatment: Null-code zeros and values outside a valid range (e.g., < 1800 or > 2100), correct obvious typos like 3019, then use as an ordinal or numeric feature. high · anthropic:default

n: 6,462
nulls: 3 (0.0%)
unique: 252
min: 0
max: 3,019
mean: 1930
median: 1,980
std: 278.3
q1: 1,943
q3: 2,006
iqr: 63
skew: -6.554
kurtosis: 42.54
n_outliers: 266
outlier_rate: 0.04118
zero_rate: 0.01935

Type

categorical label

This column classifies shark attack incidents by the nature of the encounter, with 12 distinct categories across 6,462 records. It is heavily dominated by 'Unprovoked' at 73% of all records, creating notable class imbalance. A surprising data quality issue is the fragmentation of watercraft-related incidents across three near-synonymous labels — 'Watercraft' (142), 'Boat' (109), and 'Boating' (92) — which are almost certainly the same category and should be consolidated. The 'Invalid' category (552 records, ~8.5%) also warrants attention as it may represent records that should be excluded from incident analysis. Treatment: Consolidate 'Boat', 'Boating', and 'Watercraft' into a single category; consider excluding or flagging 'Invalid' records; one-hot encode for modelling given severe class imbalance. high · anthropic:default

n: 6,462
nulls: 5 (0.1%)
unique: 12
top_value: Unprovoked
top_rate: 0.7304
cardinality: 12
entropy: 1.457
entropy_ratio: 0.4064

Country

categorical feature

This column records the country of origin or occurrence for each record, with 205 distinct country values across 6,462 rows. The distribution is heavily skewed: USA alone accounts for 36% of records (2,310), followed by AUSTRALIA at 21% (1,374) and SOUTH AFRICA at 9% (585), meaning these three countries together represent roughly two-thirds of the dataset. The entropy ratio of 0.51 confirms moderate concentration despite 205 unique values, and the near-zero null rate (0.79%) means coverage is excellent. Treatment: One-hot encode top countries and group tail countries into a residual 'OTHER' category before modelling. high · anthropic:default

n: 6,462
nulls: 51 (0.8%)
unique: 205
top_value: USA
top_rate: 0.3603
cardinality: 205
entropy: 3.909
entropy_ratio: 0.509

Area

categorical feature long_tail

This column represents geographic sub-national regions (states, provinces) drawn from multiple countries — the US (Florida, Hawaii, California, South Carolina), Australia (New South Wales, Queensland, Western Australia), and South Africa (KwaZulu-Natal, Western Cape Province, Eastern Cape Province). Florida dominates at 17.9% of records, while 810 unique values against 6,462 rows signals a severe long-tail distribution where the vast majority of areas appear only rarely. The 7.16% null rate and high geographic diversity across at least three countries suggest this dataset is multinational in scope. Treatment: Encode with frequency or target encoding; consider grouping rare areas (long-tail) into an 'Other' bucket or rolling up to country level to reduce cardinality from 810 classes. high · anthropic:default

n: 6,462
nulls: 463 (7.2%)
unique: 810
top_value: Florida
top_rate: 0.1794
cardinality: 810
entropy: 6.163
entropy_ratio: 0.6379

Location

text label multilingual duplicates

This column captures geographic incident locations, predominantly structured as 'City, County' strings—strongly associated with shark attack or ocean incident records given top values like 'New Smyrna Beach, Volusia County' (181 occurrences) and dominant words 'county', 'beach', 'island', 'bay'. The duplicate rate is notably high at ~29.9% (1,769 duplicates out of 6,462 rows), reflecting repeated incidents at the same hotspot locations rather than data error. The multilingual alert is triggered by automated language detection misclassifying short geographic names (e.g. 'Durban', 'Boa Viagem, Recife') as non-English—3,746 of 6,462 values are detected as English, with the remainder split across 29 other 'languages' due to short-string ambiguity. Null rate is 8.43%, which may represent unknown or offshore incident locations. Treatment: Normalize to canonical 'City, County, Country' format, then use as a categorical grouping variable or geocode for spatial analysis. high · anthropic:default

n: 6,462
nulls: 545 (8.4%)
unique: 4,148
len_min: 3
len_max: 119
len_mean: 22.75
len_median: 21
len_p95: 47
word_mean: 3.54
word_median: 3
n_empty: 0
n_duplicates: 1,769
duplicate_rate: 0.299
vocab_size: 4,483
readability_flesch_mean: 53.4
emoji_rate: 0
url_rate: 0
one_word_rate: 0.1486
allcaps_rate: 0.000507
boilerplate_rate: 0

Activity

text label one_word duplicates

This column captures the water-based activity a person was engaged in at the time of an incident (likely a shark attack or drowning registry), dominated by Surfing (1,025) and Swimming (932) with a small tail of descriptive phrases. Despite being labelled 'text', it behaves largely as a categorical label: 62.9% of values are single words, only 1,516 unique values exist across 6,462 rows, and the duplicate rate is 74.3%, indicating a loosely controlled vocabulary rather than a strict enum. The median string length of 8 characters versus a max of 254 suggests a mix of clean category entries and occasional free-text annotations, which may require normalisation before use. Treatment: Standardise to a controlled vocabulary by clustering near-duplicates (e.g. 'Diving' vs 'Scuba diving'), then encode as a categorical feature; impute or flag the 8.54% nulls separately. high · anthropic:default

n: 6,462
nulls: 552 (8.5%)
unique: 1,516
len_min: 1
len_max: 254
len_mean: 16.21
len_median: 8
len_p95: 49
word_mean: 2.497
word_median: 1
n_empty: 0
n_duplicates: 4,394
duplicate_rate: 0.7435
vocab_size: 2,244
readability_flesch_mean: 39.56
emoji_rate: 0
url_rate: 0
one_word_rate: 0.6289
allcaps_rate: 0.0005076
boilerplate_rate: 0

Name

text label

This column is a 'Name' field from what appears to be a historical incident or casualty dataset (likely maritime, given top values like 'boat', 'sailor', 'a sailor'). It is heavily contaminated with non-name entries: the most frequent value is 'male' (579 occurrences), followed by 'female' (106), 'boy' (23), '2 males' (19), and 'boat' (14), indicating that gender/role descriptors were freely mixed with actual proper names. The duplicate rate of 14.5% (908 duplicates across 5,339 unique values) and a 3.33% null rate further confirm this column is inconsistently populated and not a clean identifier. Treatment: Split into two derived columns—one for proper names, one for role/gender descriptors—before any modelling or grouping. high · anthropic:default

n: 6,462
nulls: 215 (3.3%)
unique: 5,339
len_min: 1
len_max: 221
len_mean: 14.83
len_median: 13
len_p95: 35
word_mean: 2.453
word_median: 2
n_empty: 0
n_duplicates: 908
duplicate_rate: 0.1453
vocab_size: 6,536
readability_flesch_mean: 51.73
emoji_rate: 0
url_rate: 0
one_word_rate: 0.1657
allcaps_rate: 0.006883
boilerplate_rate: 0

Unnamed: 9

categorical null_rate

n: 6,462
nulls: 6,434 (99.6%)
unique: 2
top_value: M
top_rate: 0.8571
cardinality: 2
entropy: 0.5917
entropy_ratio: 0.5917

Age

categorical feature null_rate

This column represents age stored as a categorical (string) type rather than a numeric type, covering 154 distinct values across 6,462 rows. The most striking issue is a 44.43% null rate, flagged as an alert, meaning nearly half the records lack an age value. The distribution skews young — the top values cluster tightly between ages 15–25, with '17' being most frequent at only 4.46% of rows, suggesting a youth-focused population (e.g., students or a juvenile-related dataset). The high entropy ratio of 0.80 confirms values are spread broadly across the 154 categories despite the youth concentration. Treatment: Cast to integer, impute or flag nulls (44.43% missing requires explicit strategy), then treat as ordinal or numeric feature. high · anthropic:default

n: 6,462
nulls: 2,871 (44.4%)
unique: 154
top_value: 17
top_rate: 0.04456
cardinality: 154
entropy: 5.827
entropy_ratio: 0.8018

Injury

text label multilingual allcaps duplicates

This column describes the outcome or nature of injuries in what appears to be a shark attack dataset, containing free-text descriptions ranging from 'FATAL' to specific anatomical bite locations (e.g., 'Left foot bitten', 'Leg bitten'). The dominant value is 'FATAL' appearing 823 times, making it by far the most frequent entry. Two signals stand out: a high duplicate rate of 41.9% (2,695 duplicates across 6,462 rows) driven by repetitive categorical-style phrases, and an all-caps rate of 13.1% suggesting inconsistent data entry conventions. Additionally, 496 German-language entries co-exist with 3,812 English ones, indicating multilingual sourcing that will complicate any text-based analysis. Treatment: Normalize case, map high-frequency values to a structured severity/outcome taxonomy, and handle German entries separately or translate before any text embedding or categorical encoding. high · anthropic:default

n: 6,462
nulls: 29 (0.4%)
unique: 3,738
len_min: 5
len_max: 234
len_mean: 31.53
len_median: 25
len_p95: 82
word_mean: 5.414
word_median: 4
n_empty: 0
n_duplicates: 2,695
duplicate_rate: 0.4189
vocab_size: 2,550
readability_flesch_mean: 53.74
emoji_rate: 0
url_rate: 0
one_word_rate: 0.1489
allcaps_rate: 0.1307
boilerplate_rate: 0

Fatal (Y/N)

categorical label

This column is a binary fatality flag for incidents, expected to hold only 'Y' or 'N' values. The dominant value is 'N' (4,439 occurrences, 75% of records), with 'Y' accounting for 1,400 cases (~21.7%). Surprising data quality issues exist: 5 rows contain clearly erroneous values ('F', 'M', '2017', lowercase 'y') suggesting data entry errors or row misalignment, and 71 rows are labeled 'UNKNOWN'. The 8.46% null rate adds further incompleteness. Treatment: Standardise 'y' → 'Y', investigate and recode/drop 'F', 'M', '2017' entries, decide on treatment of 'UNKNOWN' and nulls, then binarise (Y=1, N=0) for modelling. high · anthropic:default

n: 6,462
nulls: 547 (8.5%)
unique: 7
top_value: N
top_rate: 0.7505
cardinality: 7
entropy: 0.8897
entropy_ratio: 0.3169

Time

categorical feature long_tail null_rate

This column captures time-of-day information, but it is stored inconsistently: values mix coarse labels ('Afternoon', 'Morning') with specific clock times in 'HhMM' format ('11h00', '16h30'), yielding 366 unique values across 6,462 rows. The null rate is severe at 52.49%, meaning over half of all records are missing a time entirely. The top value 'Afternoon' accounts for only 6.29% of rows, and entropy ratio is 0.77, indicating a long tail of rarely-seen time strings — likely data entry inconsistency across sources or time periods. Treatment: Standardise to 24h numeric minutes-since-midnight, map label categories ('Morning', 'Afternoon') to representative values or a separate flag, then impute or model missingness explicitly given 52.49% null rate. high · anthropic:default

n: 6,462
nulls: 3,392 (52.5%)
unique: 366
top_value: Afternoon
top_rate: 0.06287
cardinality: 366
entropy: 6.559
entropy_ratio: 0.7702

Species

text label multilingual null_rate duplicates

This column records shark species (and incident validity notes) from what appears to be a shark attack dataset, with values ranging from specific species ('White shark', 'Tiger shark', 'Bull shark') to free-text qualifiers like 'Shark involvement not confirmed' and 'Invalid'. The null rate is severe at 45.25%, and 58.56% of non-null values are duplicates — expected for a species label with only 1,466 unique values across 6,462 rows. More surprising is the multilingual alert: while 2,582 values are classified as English, 14 are German, 18 Finnish, 11 Chinese, and 8 Turkish among others, suggesting some records were entered in non-English locales or scraped from multilingual sources. The mix of species names, size descriptions ('4\' shark', '1.8 m [6\'] shark'), and incident-status phrases means this column is semantically heterogeneous and will require parsing or splitting before use. Treatment: Split into a normalized species category and a separate incident-validity flag; impute or exclude the 45.25% nulls based on task context. high · anthropic:default

n: 6,462
nulls: 2,924 (45.2%)
unique: 1,466
len_min: 3
len_max: 194
len_mean: 22.95
len_median: 17
len_p95: 50
word_mean: 4.445
word_median: 4
n_empty: 0
n_duplicates: 2,072
duplicate_rate: 0.5856
vocab_size: 1,105
readability_flesch_mean: 88.63
emoji_rate: 0
url_rate: 0
one_word_rate: 0.04098
allcaps_rate: 0.0002826
boilerplate_rate: 0

Investigator or Source

text metadata multilingual duplicates

This column records the investigator or data source credited for each shark attack incident, typically formatted as an abbreviated name plus an organizational affiliation (e.g., 'C. Moore, GSAF'). GSAF (Global Shark Attack File) dominates the top entries and appears in 983 word tokens, making it the primary contributing organization. The duplicate rate of 22.7% (1,464 duplicates across 6,462 rows) is expected given a finite set of investigators filing multiple reports, but the multilingual alert across 22 detected languages is notable — 'en' accounts for 4,457 entries while 'es' (134), 'fr' (100), 'de' (56), and 'it' (62) reflect international source attribution or non-English investigator names. The near-unique cardinality (4,979 unique values out of 6,462) suggests many entries are one-off source combinations rather than standardized identifiers. Treatment: Normalize organization affiliations (e.g., extract 'GSAF' tag) and standardize investigator name formats before using as a categorical grouping variable. high · anthropic:default

n: 6,462
nulls: 19 (0.3%)
unique: 4,979
len_min: 3
len_max: 210
len_mean: 32.24
len_median: 26
len_p95: 77
word_mean: 4.792
word_median: 3
n_empty: 0
n_duplicates: 1,464
duplicate_rate: 0.2272
vocab_size: 7,898
readability_flesch_mean: 73.62
emoji_rate: 0
url_rate: 0.002328
one_word_rate: 0.02592
allcaps_rate: 0.01847
boilerplate_rate: 0

pdf

text foreign_key near_unique one_word null_rate

This column contains PDF filenames or partial file paths, evidenced by the '.pdf' suffixes in the top words and a mean token count of ~1 word per value. Over half the rows (52.55%) are null, and with 3,054 unique values out of 3,054 non-null distinct tokens the column is near-unique, functioning more like a document reference key than a descriptive field. The extremely negative Flesch readability score (−66.81) is consistent with structured filename strings rather than natural language. A small number of duplicates (12) suggest some documents are referenced by multiple records. Treatment: Use as a document reference key to join or retrieve associated PDF files; impute or flag nulls before any join. high · anthropic:default

n: 6,462
nulls: 3,396 (52.6%)
unique: 3,054
len_min: 10
len_max: 41
len_mean: 23.73
len_median: 23
len_p95: 31
word_mean: 1.022
word_median: 1
n_empty: 0
n_duplicates: 12
duplicate_rate: 0.003914
vocab_size: 3,098
readability_flesch_mean: -66.81
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9866
allcaps_rate: 0.0003262
boilerplate_rate: 0

href formula

text metadata near_unique one_word url_heavy null_rate

This column contains hyperlink formulas pointing to PDF source documents on sharkattackfile.net, each URL referencing a dated incident report (e.g., '1935.06.05.r-solomonislands.pdf'). Over half the rows (52.62%) are null, meaning many records lack a linked source document. Values are nearly all single-token URLs (one_word_rate 0.9879, url_rate 0.9997), and the extremely negative Flesch readability score (-820.18) confirms these are machine-generated URL strings, not natural text. Only 11 duplicate values exist across 3,051 unique entries, suggesting most cited PDFs are distinct incident references. Treatment: Extract raw URL string from formula syntax before use; treat as a source-citation reference field and consider joining or flagging unsourced rows (52.62% null) separately. high · anthropic:default

n: 6,462
nulls: 3,400 (52.6%)
unique: 3,051
len_min: 64
len_max: 95
len_mean: 77.73
len_median: 77
len_p95: 85
word_mean: 1.019
word_median: 1
n_empty: 0
n_duplicates: 11
duplicate_rate: 0.003592
vocab_size: 3,089
readability_flesch_mean: -820.2
emoji_rate: 0
url_rate: 0.9997
one_word_rate: 0.9879
allcaps_rate: 0
boilerplate_rate: 0

href

text metadata near_unique one_word url_heavy null_rate

This column contains URLs linking to PDF source documents in a shark attack file directory (sharkattackfile.net/spreadsheets/pdf_directory/), serving as citation or evidence links for individual incident records. Over half the rows are null (52.62%), indicating many records lack a linked source document. Nearly all non-null values are single-token URLs (one_word_rate 0.988, url_rate 0.9997), with very few duplicates (11 duplicates out of 3,051 unique values), consistent with per-incident citation links. The high null rate is the key analyst concern — roughly half of incidents have no associated PDF reference. Treatment: Exclude from predictive modelling; retain as a provenance/citation field, or engineer a binary 'has_source_pdf' indicator from non-null presence. high · anthropic:default

n: 6,462
nulls: 3,400 (52.6%)
unique: 3,051
len_min: 64
len_max: 135
len_mean: 77.89
len_median: 77
len_p95: 86
word_mean: 1.02
word_median: 1
n_empty: 0
n_duplicates: 11
duplicate_rate: 0.003592
vocab_size: 3,091
readability_flesch_mean: -824.4
emoji_rate: 0
url_rate: 0.9997
one_word_rate: 0.9879
allcaps_rate: 0
boilerplate_rate: 0

Case Number.1

text identifier near_unique one_word allcaps null_rate short_text

This column appears to be a duplicate or alternate version of a case number field, with values formatted as date-like codes (e.g., '1966.12.26', '1923.00.00.a') suggesting archival case identifiers tied to dates with alphabetic suffixes for disambiguation. A striking 52.62% null rate makes this column unreliable for most analyses, and the near-unique flag (3,054 unique values across 6,462 rows) combined with only 8 true duplicates confirms it functions as a quasi-identifier. The allcaps rate of 78.97% is notable given that values appear to be alphanumeric codes rather than natural language, and the '.1' suffix in the column name strongly suggests this is a duplicated column from a merge or pivot operation. Treatment: Investigate overlap with the original 'Case Number' column; if redundant, drop; otherwise impute nulls cautiously or use as a secondary join key. medium · anthropic:default

n: 6,462
nulls: 3,400 (52.6%)
unique: 3,054
len_min: 7
len_max: 18
len_mean: 10.59
len_median: 10
len_p95: 12
word_mean: 1.002
word_median: 1
n_empty: 0
n_duplicates: 8
duplicate_rate: 0.002613
vocab_size: 3,057
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9984
allcaps_rate: 0.7897
boilerplate_rate: 0

Case Number.2

text identifier near_unique one_word allcaps null_rate short_text

This column appears to be a structured case number identifier, likely encoding a date-based reference system (e.g., '1966.12.26', '1915.07.06.a.r') typical of archival, legal, or historical record catalogues. With 52.62% null rate across 6,462 rows and only 3,055 unique values out of 3,058 vocabulary size, the column is near-unique but severely incomplete. The 78.97% all-caps rate combined with date-like tokens and alphabetic suffixes (a, b, r) suggests a custom alphanumeric coding scheme rather than free text. Only 7 duplicate values exist, making this effectively an identifier where present. Treatment: Retain as a join/lookup key; impute or flag nulls separately; do not encode numerically. high · anthropic:default

n: 6,462
nulls: 3,400 (52.6%)
unique: 3,055
len_min: 7
len_max: 18
len_mean: 10.59
len_median: 10
len_p95: 12
word_mean: 1.002
word_median: 1
n_empty: 0
n_duplicates: 7
duplicate_rate: 0.002286
vocab_size: 3,058
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9984
allcaps_rate: 0.7897
boilerplate_rate: 0

original order

numeric metadata null_rate

This column appears to be a positional or sequence index assigned to records, likely reflecting the original sort order of items in a source dataset. The most striking issue is a 52.62% null rate — over half the rows carry no value, which is highly anomalous for an ordering field and suggests either a join that left many rows unmatched or that ordering was only recorded for a subset of records. With only 3,061 unique values across 6,462 rows (and ~3,061 non-null rows expected given the null rate), duplicates exist even among non-null entries, undermining uniqueness as a sequence key. The distribution is mildly right-skewed (skew 0.99) with notable leptokurtosis (3.55) and 27 outliers, hinting at a few unusually large order values relative to the bulk. Treatment: Investigate source of 52.62% nulls before use; if retaining, treat as an optional sort key and do not use as a unique identifier given duplicate values. medium · anthropic:default

n: 6,462
nulls: 3,400 (52.6%)
unique: 3,061
min: 3
max: 6,502
mean: 1564
median: 1534
std: 988.4
q1: 768.2
q3: 2299
iqr: 1530
skew: 0.9878
kurtosis: 3.551
n_outliers: 27
outlier_rate: 0.008818
zero_rate: 0

Unnamed: 23

categorical long_tail null_rate

n: 6,462
nulls: 6,460 (100.0%)
unique: 2
top_value: Teramo
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1