data trove global shark attack file gsaf
Reading
This dataset is the Global Shark Attack File (GSAF), containing 6,462 records of shark attack incidents spanning centuries of documented cases. The most important thing to examine first is the attack outcome: roughly 75% of incidents are non-fatal ('N'), but 1,400 are recorded as fatal ('Y'), and the 'Injury' column reveals 823 entries simply marked 'FATAL' — worth cross-checking for data consistency. A second priority is the geographic and activity breakdown: the USA dominates with 2,310 cases (36%), Florida alone accounts for 1,076, and surfing (1,025) and swimming (932) are by far the most dangerous activities. The 'Year' column carries a data quality warning — a maximum value of 3019 and high kurtosis signal outliers that should be cleaned before any time-series analysis.
citing: Fatal (Y/N).top_values · Fatal (Y/N).stats.top_rate · Injury.top_values · Country.top_values · Area.top_values · Activity.top_values · Year.stats.max · Year.stats.kurtosis · Year.stats.n_outliers · row_count
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| N | 4439 | 68.7% |
| Y | 1400 | 21.7% |
| UNKNOWN | 71 | 1.1% |
| F | 2 | 0.0% |
| M | 1 | 0.0% |
| 2017 | 1 | 0.0% |
| y | 1 | 0.0% |
Show data table
| chars | count |
|---|---|
| 1 – 7 | 2033 |
| 7 – 14 | 2150 |
| 14 – 20 | 468 |
| 20 – 26 | 376 |
| 26 – 33 | 228 |
| 33 – 39 | 177 |
| 39 – 45 | 134 |
| 45 – 52 | 84 |
| 52 – 58 | 42 |
| 58 – 64 | 51 |
| 64 – 71 | 35 |
| 71 – 77 | 25 |
| 77 – 83 | 20 |
| 83 – 90 | 10 |
| 90 – 96 | 9 |
| 96 – 102 | 14 |
| 102 – 109 | 10 |
| 109 – 115 | 6 |
| 115 – 121 | 5 |
| 121 – 128 | 1 |
| 128 – 134 | 1 |
| 134 – 140 | 7 |
| 140 – 146 | 4 |
| 146 – 153 | 4 |
| 153 – 159 | 3 |
| 159 – 165 | 0 |
| 165 – 172 | 2 |
| 172 – 178 | 1 |
| 178 – 184 | 0 |
| 184 – 191 | 2 |
| 191 – 197 | 3 |
| 197 – 203 | 0 |
| 203 – 210 | 0 |
| 210 – 216 | 1 |
| 216 – 222 | 0 |
| 222 – 229 | 0 |
| 229 – 235 | 2 |
| 235 – 241 | 1 |
| 241 – 248 | 0 |
| 248 – 254 | 1 |
Show data table
| value | count | share |
|---|---|---|
| USA | 2310 | 35.7% |
| AUSTRALIA | 1374 | 21.3% |
| SOUTH AFRICA | 585 | 9.1% |
| NEW ZEALAND | 135 | 2.1% |
| PAPUA NEW GUINEA | 135 | 2.1% |
| BAHAMAS | 115 | 1.8% |
| BRAZIL | 113 | 1.7% |
| MEXICO | 95 | 1.5% |
| ITALY | 71 | 1.1% |
| PHILIPPINES | 62 | 1.0% |
| FIJI | 62 | 1.0% |
| REUNION | 60 | 0.9% |
| NEW CALEDONIA | 56 | 0.9% |
| CUBA | 46 | 0.7% |
| SPAIN | 44 | 0.7% |
| MOZAMBIQUE | 44 | 0.7% |
| EGYPT | 42 | 0.6% |
| INDIA | 40 | 0.6% |
| JAPAN | 34 | 0.5% |
| CROATIA | 34 | 0.5% |
Show data table
| value | count | share |
|---|---|---|
| Unprovoked | 4716 | 73.0% |
| Provoked | 593 | 9.2% |
| Invalid | 552 | 8.5% |
| Sea Disaster | 239 | 3.7% |
| Watercraft | 142 | 2.2% |
| Boat | 109 | 1.7% |
| Boating | 92 | 1.4% |
| Questionable | 10 | 0.2% |
| Unconfirmed | 1 | 0.0% |
| Unverified | 1 | 0.0% |
| Under investigation | 1 | 0.0% |
| Boatomg | 1 | 0.0% |
Show data table
| bin | count |
|---|---|
| 0 – 75.47 | 126 |
| 75.47 – 150.9 | 1 |
| 150.9 – 226.4 | 0 |
| 226.4 – 301.9 | 0 |
| 301.9 – 377.4 | 0 |
| 377.4 – 452.8 | 0 |
| 452.8 – 528.3 | 1 |
| 528.3 – 603.8 | 0 |
| 603.8 – 679.3 | 0 |
| 679.3 – 754.8 | 0 |
| 754.8 – 830.2 | 0 |
| 830.2 – 905.7 | 0 |
| 905.7 – 981.2 | 0 |
| 981.2 – 1057 | 0 |
| 1057 – 1132 | 0 |
| 1132 – 1208 | 0 |
| 1208 – 1283 | 0 |
| 1283 – 1359 | 0 |
| 1359 – 1434 | 0 |
| 1434 – 1510 | 0 |
| 1510 – 1585 | 4 |
| 1585 – 1660 | 6 |
| 1660 – 1736 | 7 |
| 1736 – 1811 | 37 |
| 1811 – 1887 | 371 |
| 1887 – 1962 | 1975 |
| 1962 – 2038 | 3930 |
| 2038 – 2113 | 0 |
| 2113 – 2189 | 0 |
| 2189 – 2264 | 0 |
| 2264 – 2340 | 0 |
| 2340 – 2415 | 0 |
| 2415 – 2491 | 0 |
| 2491 – 2566 | 0 |
| 2566 – 2642 | 0 |
| 2642 – 2717 | 0 |
| 2717 – 2793 | 0 |
| 2793 – 2868 | 0 |
| 2868 – 2944 | 0 |
| 2944 – 3019 | 1 |
Schema
24 columns| Alerts | ||||
|---|---|---|---|---|
| index | numeric | 0.0% | 6,462 |
|
| Case Number | text | 0.0% | 6,442 |
near_unique
one_word
allcaps
short_text
|
| Date | text | 0.0% | 5,552 |
one_word
|
| Year | numeric | 0.0% | 252 |
high_skew
|
| Type | categorical | 0.1% | 12 |
|
| Country | categorical | 0.8% | 205 |
|
| Area | categorical | 7.2% | 810 |
long_tail
|
| Location | text | 8.4% | 4,148 |
multilingual
duplicates
|
| Activity | text | 8.5% | 1,516 |
one_word
duplicates
|
| Name | text | 3.3% | 5,339 |
|
| Unnamed: 9 | categorical | 99.6% | 2 |
null_rate
|
| Age | categorical | 44.4% | 154 |
null_rate
|
| Injury | text | 0.4% | 3,738 |
multilingual
allcaps
duplicates
|
| Fatal (Y/N) | categorical | 8.5% | 7 |
|
| Time | categorical | 52.5% | 366 |
long_tail
null_rate
|
| Species | text | 45.2% | 1,466 |
multilingual
null_rate
duplicates
|
| Investigator or Source | text | 0.3% | 4,979 |
multilingual
duplicates
|
| text | 52.6% | 3,054 |
near_unique
one_word
null_rate
|
|
| href formula | text | 52.6% | 3,051 |
near_unique
one_word
url_heavy
null_rate
|
| href | text | 52.6% | 3,051 |
near_unique
one_word
url_heavy
null_rate
|
| Case Number.1 | text | 52.6% | 3,054 |
near_unique
one_word
allcaps
null_rate
short_text
|
| Case Number.2 | text | 52.6% | 3,055 |
near_unique
one_word
allcaps
null_rate
short_text
|
| original order | numeric | 52.6% | 3,061 |
null_rate
|
| Unnamed: 23 | categorical | 100.0% | 2 |
long_tail
null_rate
|
index
numeric identifierThis column is a row index — a sequential integer identifier running from 0 to 6461 with 6462 unique values and no nulls, perfectly matching the row count. Its distribution is exactly uniform (skew = 0.0, kurtosis ≈ −1.2, mean = median = 3230.5, zero outliers), confirming it was generated as a positional index rather than carrying any domain meaning. The single 'zero' (zero_rate ≈ 0.00015) is simply row 0. There is no analytical signal here. Treatment: Drop before modelling; carries no predictive information.
- n
- 6,462
- nulls
- 0 (0.0%)
- unique
- 6,462
- min
- 0
- max
- 6,461
- mean
- 3230
- median
- 3230
- std
- 1866
- q1
- 1615
- q3
- 4846
- iqr
- 3230
- skew
- 0
- kurtosis
- -1.2
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.0001548
Case Number
text identifier near_unique one_word allcaps short_textThis column is a case identifier, with values that appear to encode dates in YYYY.MM.DD format (e.g., '2019.10.08'), suggesting case numbers tied to filing or incident dates. With 6,442 unique values out of 6,462 rows and a null rate of 0.0003, it is near-unique and functions as a primary key. The 18 duplicate values (duplicate_rate 0.0028) are a mild anomaly worth investigating — one value '2012.09.02.b' hints that suffixes are used to disambiguate same-date cases, implying the deduplication logic is not fully consistent. The allcaps_rate of 0.748 suggests a mix of formatting styles across records. Treatment: Use as a case-level join key; flag the 18 duplicates for deduplication before modelling.
- n
- 6,462
- nulls
- 2 (0.0%)
- unique
- 6,442
- len_min
- 6
- len_max
- 18
- len_mean
- 10.63
- len_median
- 10
- len_p95
- 12
- word_mean
- 1.001
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 18
- duplicate_rate
- 0.002786
- vocab_size
- 6,445
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9992
- allcaps_rate
- 0.748
- boilerplate_rate
- 0
Date
text metadata one_wordThis column contains free-form date annotations for what appears to be a historical or archival dataset, storing dates in highly inconsistent formats: bare years (e.g., '1957', '1942'), structured dates ('05-Oct-2003'), vague phrases ('Before 1958', 'No date', 'ca.', 'summer', 'late'). The top word 'reported' appearing 559 times suggests many values follow a pattern like 'reported [date]', which is a red flag for downstream parsing. With 14.1% duplicate rate (909 duplicates across 5,552 unique values) and a max length of 64 characters, this column cannot be safely cast to a datetime type without substantial normalization work. Treatment: Parse and normalize into structured date fields using regex rules per format pattern; flag unparseable values ('No date', 'ca.', 'reported …') as null or uncertain.
- n
- 6,462
- nulls
- 1 (0.0%)
- unique
- 5,552
- len_min
- 4
- len_max
- 64
- len_mean
- 11.43
- len_median
- 11
- len_p95
- 20
- word_mean
- 1.155
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 909
- duplicate_rate
- 0.1407
- vocab_size
- 5,496
- readability_flesch_mean
- 89.93
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.8728
- allcaps_rate
- 0.05092
- boilerplate_rate
- 0
Year
numeric feature high_skewThis column represents a publication or production year for 6,462 records, with values spanning a plausible modern range centered around a median of 1980 and IQR of 1943–2006. Two signals are highly surprising: a minimum of 0.0 (nearly 2% zero rate, almost certainly sentinel/missing-year placeholders) and a maximum of 3019.0, which is a data-entry error (likely a typo for a 4-digit year such as 2019). These outliers (266 records, ~4.1%) drive extreme negative skew (−6.55) and extraordinary kurtosis (42.54), masking what is otherwise a fairly clean temporal distribution. Treatment: Null-code zeros and values outside a valid range (e.g., < 1800 or > 2100), correct obvious typos like 3019, then use as an ordinal or numeric feature.
- n
- 6,462
- nulls
- 3 (0.0%)
- unique
- 252
- min
- 0
- max
- 3,019
- mean
- 1930
- median
- 1,980
- std
- 278.3
- q1
- 1,943
- q3
- 2,006
- iqr
- 63
- skew
- -6.554
- kurtosis
- 42.54
- n_outliers
- 266
- outlier_rate
- 0.04118
- zero_rate
- 0.01935
Type
categorical labelThis column classifies shark attack incidents by the nature of the encounter, with 12 distinct categories across 6,462 records. It is heavily dominated by 'Unprovoked' at 73% of all records, creating notable class imbalance. A surprising data quality issue is the fragmentation of watercraft-related incidents across three near-synonymous labels — 'Watercraft' (142), 'Boat' (109), and 'Boating' (92) — which are almost certainly the same category and should be consolidated. The 'Invalid' category (552 records, ~8.5%) also warrants attention as it may represent records that should be excluded from incident analysis. Treatment: Consolidate 'Boat', 'Boating', and 'Watercraft' into a single category; consider excluding or flagging 'Invalid' records; one-hot encode for modelling given severe class imbalance.
- n
- 6,462
- nulls
- 5 (0.1%)
- unique
- 12
- top_value
- Unprovoked
- top_rate
- 0.7304
- cardinality
- 12
- entropy
- 1.457
- entropy_ratio
- 0.4064
Country
categorical featureThis column records the country of origin or occurrence for each record, with 205 distinct country values across 6,462 rows. The distribution is heavily skewed: USA alone accounts for 36% of records (2,310), followed by AUSTRALIA at 21% (1,374) and SOUTH AFRICA at 9% (585), meaning these three countries together represent roughly two-thirds of the dataset. The entropy ratio of 0.51 confirms moderate concentration despite 205 unique values, and the near-zero null rate (0.79%) means coverage is excellent. Treatment: One-hot encode top countries and group tail countries into a residual 'OTHER' category before modelling.
- n
- 6,462
- nulls
- 51 (0.8%)
- unique
- 205
- top_value
- USA
- top_rate
- 0.3603
- cardinality
- 205
- entropy
- 3.909
- entropy_ratio
- 0.509
Area
categorical feature long_tailThis column represents geographic sub-national regions (states, provinces) drawn from multiple countries — the US (Florida, Hawaii, California, South Carolina), Australia (New South Wales, Queensland, Western Australia), and South Africa (KwaZulu-Natal, Western Cape Province, Eastern Cape Province). Florida dominates at 17.9% of records, while 810 unique values against 6,462 rows signals a severe long-tail distribution where the vast majority of areas appear only rarely. The 7.16% null rate and high geographic diversity across at least three countries suggest this dataset is multinational in scope. Treatment: Encode with frequency or target encoding; consider grouping rare areas (long-tail) into an 'Other' bucket or rolling up to country level to reduce cardinality from 810 classes.
- n
- 6,462
- nulls
- 463 (7.2%)
- unique
- 810
- top_value
- Florida
- top_rate
- 0.1794
- cardinality
- 810
- entropy
- 6.163
- entropy_ratio
- 0.6379
Location
text label multilingual duplicatesThis column captures geographic incident locations, predominantly structured as 'City, County' strings—strongly associated with shark attack or ocean incident records given top values like 'New Smyrna Beach, Volusia County' (181 occurrences) and dominant words 'county', 'beach', 'island', 'bay'. The duplicate rate is notably high at ~29.9% (1,769 duplicates out of 6,462 rows), reflecting repeated incidents at the same hotspot locations rather than data error. The multilingual alert is triggered by automated language detection misclassifying short geographic names (e.g. 'Durban', 'Boa Viagem, Recife') as non-English—3,746 of 6,462 values are detected as English, with the remainder split across 29 other 'languages' due to short-string ambiguity. Null rate is 8.43%, which may represent unknown or offshore incident locations. Treatment: Normalize to canonical 'City, County, Country' format, then use as a categorical grouping variable or geocode for spatial analysis.
- n
- 6,462
- nulls
- 545 (8.4%)
- unique
- 4,148
- len_min
- 3
- len_max
- 119
- len_mean
- 22.75
- len_median
- 21
- len_p95
- 47
- word_mean
- 3.54
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 1,769
- duplicate_rate
- 0.299
- vocab_size
- 4,483
- readability_flesch_mean
- 53.4
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.1486
- allcaps_rate
- 0.000507
- boilerplate_rate
- 0
Activity
text label one_word duplicatesThis column captures the water-based activity a person was engaged in at the time of an incident (likely a shark attack or drowning registry), dominated by Surfing (1,025) and Swimming (932) with a small tail of descriptive phrases. Despite being labelled 'text', it behaves largely as a categorical label: 62.9% of values are single words, only 1,516 unique values exist across 6,462 rows, and the duplicate rate is 74.3%, indicating a loosely controlled vocabulary rather than a strict enum. The median string length of 8 characters versus a max of 254 suggests a mix of clean category entries and occasional free-text annotations, which may require normalisation before use. Treatment: Standardise to a controlled vocabulary by clustering near-duplicates (e.g. 'Diving' vs 'Scuba diving'), then encode as a categorical feature; impute or flag the 8.54% nulls separately.
- n
- 6,462
- nulls
- 552 (8.5%)
- unique
- 1,516
- len_min
- 1
- len_max
- 254
- len_mean
- 16.21
- len_median
- 8
- len_p95
- 49
- word_mean
- 2.497
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 4,394
- duplicate_rate
- 0.7435
- vocab_size
- 2,244
- readability_flesch_mean
- 39.56
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.6289
- allcaps_rate
- 0.0005076
- boilerplate_rate
- 0
Name
text labelThis column is a 'Name' field from what appears to be a historical incident or casualty dataset (likely maritime, given top values like 'boat', 'sailor', 'a sailor'). It is heavily contaminated with non-name entries: the most frequent value is 'male' (579 occurrences), followed by 'female' (106), 'boy' (23), '2 males' (19), and 'boat' (14), indicating that gender/role descriptors were freely mixed with actual proper names. The duplicate rate of 14.5% (908 duplicates across 5,339 unique values) and a 3.33% null rate further confirm this column is inconsistently populated and not a clean identifier. Treatment: Split into two derived columns—one for proper names, one for role/gender descriptors—before any modelling or grouping.
- n
- 6,462
- nulls
- 215 (3.3%)
- unique
- 5,339
- len_min
- 1
- len_max
- 221
- len_mean
- 14.83
- len_median
- 13
- len_p95
- 35
- word_mean
- 2.453
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 908
- duplicate_rate
- 0.1453
- vocab_size
- 6,536
- readability_flesch_mean
- 51.73
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.1657
- allcaps_rate
- 0.006883
- boilerplate_rate
- 0
Unnamed: 9
categorical null_rate- n
- 6,462
- nulls
- 6,434 (99.6%)
- unique
- 2
- top_value
- M
- top_rate
- 0.8571
- cardinality
- 2
- entropy
- 0.5917
- entropy_ratio
- 0.5917
Age
categorical feature null_rateThis column represents age stored as a categorical (string) type rather than a numeric type, covering 154 distinct values across 6,462 rows. The most striking issue is a 44.43% null rate, flagged as an alert, meaning nearly half the records lack an age value. The distribution skews young — the top values cluster tightly between ages 15–25, with '17' being most frequent at only 4.46% of rows, suggesting a youth-focused population (e.g., students or a juvenile-related dataset). The high entropy ratio of 0.80 confirms values are spread broadly across the 154 categories despite the youth concentration. Treatment: Cast to integer, impute or flag nulls (44.43% missing requires explicit strategy), then treat as ordinal or numeric feature.
- n
- 6,462
- nulls
- 2,871 (44.4%)
- unique
- 154
- top_value
- 17
- top_rate
- 0.04456
- cardinality
- 154
- entropy
- 5.827
- entropy_ratio
- 0.8018
Injury
text label multilingual allcaps duplicatesThis column describes the outcome or nature of injuries in what appears to be a shark attack dataset, containing free-text descriptions ranging from 'FATAL' to specific anatomical bite locations (e.g., 'Left foot bitten', 'Leg bitten'). The dominant value is 'FATAL' appearing 823 times, making it by far the most frequent entry. Two signals stand out: a high duplicate rate of 41.9% (2,695 duplicates across 6,462 rows) driven by repetitive categorical-style phrases, and an all-caps rate of 13.1% suggesting inconsistent data entry conventions. Additionally, 496 German-language entries co-exist with 3,812 English ones, indicating multilingual sourcing that will complicate any text-based analysis. Treatment: Normalize case, map high-frequency values to a structured severity/outcome taxonomy, and handle German entries separately or translate before any text embedding or categorical encoding.
- n
- 6,462
- nulls
- 29 (0.4%)
- unique
- 3,738
- len_min
- 5
- len_max
- 234
- len_mean
- 31.53
- len_median
- 25
- len_p95
- 82
- word_mean
- 5.414
- word_median
- 4
- n_empty
- 0
- n_duplicates
- 2,695
- duplicate_rate
- 0.4189
- vocab_size
- 2,550
- readability_flesch_mean
- 53.74
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.1489
- allcaps_rate
- 0.1307
- boilerplate_rate
- 0
Fatal (Y/N)
categorical labelThis column is a binary fatality flag for incidents, expected to hold only 'Y' or 'N' values. The dominant value is 'N' (4,439 occurrences, 75% of records), with 'Y' accounting for 1,400 cases (~21.7%). Surprising data quality issues exist: 5 rows contain clearly erroneous values ('F', 'M', '2017', lowercase 'y') suggesting data entry errors or row misalignment, and 71 rows are labeled 'UNKNOWN'. The 8.46% null rate adds further incompleteness. Treatment: Standardise 'y' → 'Y', investigate and recode/drop 'F', 'M', '2017' entries, decide on treatment of 'UNKNOWN' and nulls, then binarise (Y=1, N=0) for modelling.
- n
- 6,462
- nulls
- 547 (8.5%)
- unique
- 7
- top_value
- N
- top_rate
- 0.7505
- cardinality
- 7
- entropy
- 0.8897
- entropy_ratio
- 0.3169
Time
categorical feature long_tail null_rateThis column captures time-of-day information, but it is stored inconsistently: values mix coarse labels ('Afternoon', 'Morning') with specific clock times in 'HhMM' format ('11h00', '16h30'), yielding 366 unique values across 6,462 rows. The null rate is severe at 52.49%, meaning over half of all records are missing a time entirely. The top value 'Afternoon' accounts for only 6.29% of rows, and entropy ratio is 0.77, indicating a long tail of rarely-seen time strings — likely data entry inconsistency across sources or time periods. Treatment: Standardise to 24h numeric minutes-since-midnight, map label categories ('Morning', 'Afternoon') to representative values or a separate flag, then impute or model missingness explicitly given 52.49% null rate.
- n
- 6,462
- nulls
- 3,392 (52.5%)
- unique
- 366
- top_value
- Afternoon
- top_rate
- 0.06287
- cardinality
- 366
- entropy
- 6.559
- entropy_ratio
- 0.7702
Species
text label multilingual null_rate duplicatesThis column records shark species (and incident validity notes) from what appears to be a shark attack dataset, with values ranging from specific species ('White shark', 'Tiger shark', 'Bull shark') to free-text qualifiers like 'Shark involvement not confirmed' and 'Invalid'. The null rate is severe at 45.25%, and 58.56% of non-null values are duplicates — expected for a species label with only 1,466 unique values across 6,462 rows. More surprising is the multilingual alert: while 2,582 values are classified as English, 14 are German, 18 Finnish, 11 Chinese, and 8 Turkish among others, suggesting some records were entered in non-English locales or scraped from multilingual sources. The mix of species names, size descriptions ('4\' shark', '1.8 m [6\'] shark'), and incident-status phrases means this column is semantically heterogeneous and will require parsing or splitting before use. Treatment: Split into a normalized species category and a separate incident-validity flag; impute or exclude the 45.25% nulls based on task context.
- n
- 6,462
- nulls
- 2,924 (45.2%)
- unique
- 1,466
- len_min
- 3
- len_max
- 194
- len_mean
- 22.95
- len_median
- 17
- len_p95
- 50
- word_mean
- 4.445
- word_median
- 4
- n_empty
- 0
- n_duplicates
- 2,072
- duplicate_rate
- 0.5856
- vocab_size
- 1,105
- readability_flesch_mean
- 88.63
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.04098
- allcaps_rate
- 0.0002826
- boilerplate_rate
- 0
Investigator or Source
text metadata multilingual duplicatesThis column records the investigator or data source credited for each shark attack incident, typically formatted as an abbreviated name plus an organizational affiliation (e.g., 'C. Moore, GSAF'). GSAF (Global Shark Attack File) dominates the top entries and appears in 983 word tokens, making it the primary contributing organization. The duplicate rate of 22.7% (1,464 duplicates across 6,462 rows) is expected given a finite set of investigators filing multiple reports, but the multilingual alert across 22 detected languages is notable — 'en' accounts for 4,457 entries while 'es' (134), 'fr' (100), 'de' (56), and 'it' (62) reflect international source attribution or non-English investigator names. The near-unique cardinality (4,979 unique values out of 6,462) suggests many entries are one-off source combinations rather than standardized identifiers. Treatment: Normalize organization affiliations (e.g., extract 'GSAF' tag) and standardize investigator name formats before using as a categorical grouping variable.
- n
- 6,462
- nulls
- 19 (0.3%)
- unique
- 4,979
- len_min
- 3
- len_max
- 210
- len_mean
- 32.24
- len_median
- 26
- len_p95
- 77
- word_mean
- 4.792
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 1,464
- duplicate_rate
- 0.2272
- vocab_size
- 7,898
- readability_flesch_mean
- 73.62
- emoji_rate
- 0
- url_rate
- 0.002328
- one_word_rate
- 0.02592
- allcaps_rate
- 0.01847
- boilerplate_rate
- 0
This column contains PDF filenames or partial file paths, evidenced by the '.pdf' suffixes in the top words and a mean token count of ~1 word per value. Over half the rows (52.55%) are null, and with 3,054 unique values out of 3,054 non-null distinct tokens the column is near-unique, functioning more like a document reference key than a descriptive field. The extremely negative Flesch readability score (−66.81) is consistent with structured filename strings rather than natural language. A small number of duplicates (12) suggest some documents are referenced by multiple records. Treatment: Use as a document reference key to join or retrieve associated PDF files; impute or flag nulls before any join.
- n
- 6,462
- nulls
- 3,396 (52.6%)
- unique
- 3,054
- len_min
- 10
- len_max
- 41
- len_mean
- 23.73
- len_median
- 23
- len_p95
- 31
- word_mean
- 1.022
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 12
- duplicate_rate
- 0.003914
- vocab_size
- 3,098
- readability_flesch_mean
- -66.81
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9866
- allcaps_rate
- 0.0003262
- boilerplate_rate
- 0
href formula
text metadata near_unique one_word url_heavy null_rateThis column contains hyperlink formulas pointing to PDF source documents on sharkattackfile.net, each URL referencing a dated incident report (e.g., '1935.06.05.r-solomonislands.pdf'). Over half the rows (52.62%) are null, meaning many records lack a linked source document. Values are nearly all single-token URLs (one_word_rate 0.9879, url_rate 0.9997), and the extremely negative Flesch readability score (-820.18) confirms these are machine-generated URL strings, not natural text. Only 11 duplicate values exist across 3,051 unique entries, suggesting most cited PDFs are distinct incident references. Treatment: Extract raw URL string from formula syntax before use; treat as a source-citation reference field and consider joining or flagging unsourced rows (52.62% null) separately.
- n
- 6,462
- nulls
- 3,400 (52.6%)
- unique
- 3,051
- len_min
- 64
- len_max
- 95
- len_mean
- 77.73
- len_median
- 77
- len_p95
- 85
- word_mean
- 1.019
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 11
- duplicate_rate
- 0.003592
- vocab_size
- 3,089
- readability_flesch_mean
- -820.2
- emoji_rate
- 0
- url_rate
- 0.9997
- one_word_rate
- 0.9879
- allcaps_rate
- 0
- boilerplate_rate
- 0
href
text metadata near_unique one_word url_heavy null_rateThis column contains URLs linking to PDF source documents in a shark attack file directory (sharkattackfile.net/spreadsheets/pdf_directory/), serving as citation or evidence links for individual incident records. Over half the rows are null (52.62%), indicating many records lack a linked source document. Nearly all non-null values are single-token URLs (one_word_rate 0.988, url_rate 0.9997), with very few duplicates (11 duplicates out of 3,051 unique values), consistent with per-incident citation links. The high null rate is the key analyst concern — roughly half of incidents have no associated PDF reference. Treatment: Exclude from predictive modelling; retain as a provenance/citation field, or engineer a binary 'has_source_pdf' indicator from non-null presence.
- n
- 6,462
- nulls
- 3,400 (52.6%)
- unique
- 3,051
- len_min
- 64
- len_max
- 135
- len_mean
- 77.89
- len_median
- 77
- len_p95
- 86
- word_mean
- 1.02
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 11
- duplicate_rate
- 0.003592
- vocab_size
- 3,091
- readability_flesch_mean
- -824.4
- emoji_rate
- 0
- url_rate
- 0.9997
- one_word_rate
- 0.9879
- allcaps_rate
- 0
- boilerplate_rate
- 0
Case Number.1
text identifier near_unique one_word allcaps null_rate short_textThis column appears to be a duplicate or alternate version of a case number field, with values formatted as date-like codes (e.g., '1966.12.26', '1923.00.00.a') suggesting archival case identifiers tied to dates with alphabetic suffixes for disambiguation. A striking 52.62% null rate makes this column unreliable for most analyses, and the near-unique flag (3,054 unique values across 6,462 rows) combined with only 8 true duplicates confirms it functions as a quasi-identifier. The allcaps rate of 78.97% is notable given that values appear to be alphanumeric codes rather than natural language, and the '.1' suffix in the column name strongly suggests this is a duplicated column from a merge or pivot operation. Treatment: Investigate overlap with the original 'Case Number' column; if redundant, drop; otherwise impute nulls cautiously or use as a secondary join key.
- n
- 6,462
- nulls
- 3,400 (52.6%)
- unique
- 3,054
- len_min
- 7
- len_max
- 18
- len_mean
- 10.59
- len_median
- 10
- len_p95
- 12
- word_mean
- 1.002
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 8
- duplicate_rate
- 0.002613
- vocab_size
- 3,057
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9984
- allcaps_rate
- 0.7897
- boilerplate_rate
- 0
Case Number.2
text identifier near_unique one_word allcaps null_rate short_textThis column appears to be a structured case number identifier, likely encoding a date-based reference system (e.g., '1966.12.26', '1915.07.06.a.r') typical of archival, legal, or historical record catalogues. With 52.62% null rate across 6,462 rows and only 3,055 unique values out of 3,058 vocabulary size, the column is near-unique but severely incomplete. The 78.97% all-caps rate combined with date-like tokens and alphabetic suffixes (a, b, r) suggests a custom alphanumeric coding scheme rather than free text. Only 7 duplicate values exist, making this effectively an identifier where present. Treatment: Retain as a join/lookup key; impute or flag nulls separately; do not encode numerically.
- n
- 6,462
- nulls
- 3,400 (52.6%)
- unique
- 3,055
- len_min
- 7
- len_max
- 18
- len_mean
- 10.59
- len_median
- 10
- len_p95
- 12
- word_mean
- 1.002
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 7
- duplicate_rate
- 0.002286
- vocab_size
- 3,058
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9984
- allcaps_rate
- 0.7897
- boilerplate_rate
- 0
original order
numeric metadata null_rateThis column appears to be a positional or sequence index assigned to records, likely reflecting the original sort order of items in a source dataset. The most striking issue is a 52.62% null rate — over half the rows carry no value, which is highly anomalous for an ordering field and suggests either a join that left many rows unmatched or that ordering was only recorded for a subset of records. With only 3,061 unique values across 6,462 rows (and ~3,061 non-null rows expected given the null rate), duplicates exist even among non-null entries, undermining uniqueness as a sequence key. The distribution is mildly right-skewed (skew 0.99) with notable leptokurtosis (3.55) and 27 outliers, hinting at a few unusually large order values relative to the bulk. Treatment: Investigate source of 52.62% nulls before use; if retaining, treat as an optional sort key and do not use as a unique identifier given duplicate values.
- n
- 6,462
- nulls
- 3,400 (52.6%)
- unique
- 3,061
- min
- 3
- max
- 6,502
- mean
- 1564
- median
- 1534
- std
- 988.4
- q1
- 768.2
- q3
- 2299
- iqr
- 1530
- skew
- 0.9878
- kurtosis
- 3.551
- n_outliers
- 27
- outlier_rate
- 0.008818
- zero_rate
- 0
Unnamed: 23
categorical long_tail null_rate- n
- 6,462
- nulls
- 6,460 (100.0%)
- unique
- 2
- top_value
- Teramo
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1