saturn

/home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv 6,462 rows sample n=6,462 seed 42 2026-06-21T23:26:09+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv
Total rows6,462
Profiled sample6,462
Columns24
Generated2026-06-21T23:26:09+00:00
Show data table
Per-column null rate across the corpus.
columnkindnull %
indexnumeric0.0%
Case Numbertext0.0%
Datetext0.0%
Yearnumeric0.0%
Typecategorical0.1%
Countrycategorical0.8%
Areacategorical7.2%
Locationtext8.4%
Activitytext8.5%
Nametext3.3%
Unnamed: 9categorical99.6%
Agecategorical44.4%
Injurytext0.4%
Fatal (Y/N)categorical8.5%
Timecategorical52.5%
Species text45.2%
Investigator or Sourcetext0.3%
pdftext52.6%
href formulatext52.6%
hreftext52.6%
Case Number.1text52.6%
Case Number.2text52.6%
original ordernumeric52.6%
Unnamed: 23categorical100.0%

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:default.

Dataset high anthropic:default

This dataset is the Global Shark Attack File (GSAF), containing 6,462 records of shark attack incidents spanning centuries of documented cases. The most important thing to examine first is the attack outcome: roughly 75% of incidents are non-fatal ('N'), but 1,400 are recorded as fatal ('Y'), and the 'Injury' column reveals 823 entries simply marked 'FATAL' — worth cross-checking for data consistency. A second priority is the geographic and activity breakdown: the USA dominates with 2,310 cases (36%), Florida alone accounts for 1,076, and surfing (1,025) and swimming (932) are by far the most dangerous activities. The 'Year' column carries a data quality warning — a maximum value of 3019 and high kurtosis signal outliers that should be cleaned before any time-series analysis.

Case Number.1 medium anthropic:default

This column appears to be a duplicate or alternate version of a case number field, with values formatted as date-like codes (e.g., '1966.12.26', '1923.00.00.a') suggesting archival case identifiers tied to dates with alphabetic suffixes for disambiguation. A striking 52.62% null rate makes this column unreliable for most analyses, and the near-unique flag (3,054 unique values across 6,462 rows) combined with only 8 true duplicates confirms it functions as a quasi-identifier. The allcaps rate of 78.97% is notable given that values appear to be alphanumeric codes rather than natural language, and the '.1' suffix in the column name strongly suggests this is a duplicated column from a merge or pivot operation.

Case Number.2 high anthropic:default

This column appears to be a structured case number identifier, likely encoding a date-based reference system (e.g., '1966.12.26', '1915.07.06.a.r') typical of archival, legal, or historical record catalogues. With 52.62% null rate across 6,462 rows and only 3,055 unique values out of 3,058 vocabulary size, the column is near-unique but severely incomplete. The 78.97% all-caps rate combined with date-like tokens and alphabetic suffixes (a, b, r) suggests a custom alphanumeric coding scheme rather than free text. Only 7 duplicate values exist, making this effectively an identifier where present.

href high anthropic:default

This column contains URLs linking to PDF source documents in a shark attack file directory (sharkattackfile.net/spreadsheets/pdf_directory/), serving as citation or evidence links for individual incident records. Over half the rows are null (52.62%), indicating many records lack a linked source document. Nearly all non-null values are single-token URLs (one_word_rate 0.988, url_rate 0.9997), with very few duplicates (11 duplicates out of 3,051 unique values), consistent with per-incident citation links. The high null rate is the key analyst concern — roughly half of incidents have no associated PDF reference.

href formula high anthropic:default

This column contains hyperlink formulas pointing to PDF source documents on sharkattackfile.net, each URL referencing a dated incident report (e.g., '1935.06.05.r-solomonislands.pdf'). Over half the rows (52.62%) are null, meaning many records lack a linked source document. Values are nearly all single-token URLs (one_word_rate 0.9879, url_rate 0.9997), and the extremely negative Flesch readability score (-820.18) confirms these are machine-generated URL strings, not natural text. Only 11 duplicate values exist across 3,051 unique entries, suggesting most cited PDFs are distinct incident references.

Case Number high anthropic:default

This column is a case identifier, with values that appear to encode dates in YYYY.MM.DD format (e.g., '2019.10.08'), suggesting case numbers tied to filing or incident dates. With 6,442 unique values out of 6,462 rows and a null rate of 0.0003, it is near-unique and functions as a primary key. The 18 duplicate values (duplicate_rate 0.0028) are a mild anomaly worth investigating — one value '2012.09.02.b' hints that suffixes are used to disambiguate same-date cases, implying the deduplication logic is not fully consistent. The allcaps_rate of 0.748 suggests a mix of formatting styles across records.

Species high anthropic:default

This column records shark species (and incident validity notes) from what appears to be a shark attack dataset, with values ranging from specific species ('White shark', 'Tiger shark', 'Bull shark') to free-text qualifiers like 'Shark involvement not confirmed' and 'Invalid'. The null rate is severe at 45.25%, and 58.56% of non-null values are duplicates — expected for a species label with only 1,466 unique values across 6,462 rows. More surprising is the multilingual alert: while 2,582 values are classified as English, 14 are German, 18 Finnish, 11 Chinese, and 8 Turkish among others, suggesting some records were entered in non-English locales or scraped from multilingual sources. The mix of species names, size descriptions ('4\' shark', '1.8 m [6\'] shark'), and incident-status phrases means this column is semantically heterogeneous and will require parsing or splitting before use.

pdf high anthropic:default

This column contains PDF filenames or partial file paths, evidenced by the '.pdf' suffixes in the top words and a mean token count of ~1 word per value. Over half the rows (52.55%) are null, and with 3,054 unique values out of 3,054 non-null distinct tokens the column is near-unique, functioning more like a document reference key than a descriptive field. The extremely negative Flesch readability score (−66.81) is consistent with structured filename strings rather than natural language. A small number of duplicates (12) suggest some documents are referenced by multiple records.

Injury high anthropic:default

This column describes the outcome or nature of injuries in what appears to be a shark attack dataset, containing free-text descriptions ranging from 'FATAL' to specific anatomical bite locations (e.g., 'Left foot bitten', 'Leg bitten'). The dominant value is 'FATAL' appearing 823 times, making it by far the most frequent entry. Two signals stand out: a high duplicate rate of 41.9% (2,695 duplicates across 6,462 rows) driven by repetitive categorical-style phrases, and an all-caps rate of 13.1% suggesting inconsistent data entry conventions. Additionally, 496 German-language entries co-exist with 3,812 English ones, indicating multilingual sourcing that will complicate any text-based analysis.

Activity high anthropic:default

This column captures the water-based activity a person was engaged in at the time of an incident (likely a shark attack or drowning registry), dominated by Surfing (1,025) and Swimming (932) with a small tail of descriptive phrases. Despite being labelled 'text', it behaves largely as a categorical label: 62.9% of values are single words, only 1,516 unique values exist across 6,462 rows, and the duplicate rate is 74.3%, indicating a loosely controlled vocabulary rather than a strict enum. The median string length of 8 characters versus a max of 254 suggests a mix of clean category entries and occasional free-text annotations, which may require normalisation before use.

Investigator or Source high anthropic:default

This column records the investigator or data source credited for each shark attack incident, typically formatted as an abbreviated name plus an organizational affiliation (e.g., 'C. Moore, GSAF'). GSAF (Global Shark Attack File) dominates the top entries and appears in 983 word tokens, making it the primary contributing organization. The duplicate rate of 22.7% (1,464 duplicates across 6,462 rows) is expected given a finite set of investigators filing multiple reports, but the multilingual alert across 22 detected languages is notable — 'en' accounts for 4,457 entries while 'es' (134), 'fr' (100), 'de' (56), and 'it' (62) reflect international source attribution or non-English investigator names. The near-unique cardinality (4,979 unique values out of 6,462) suggests many entries are one-off source combinations rather than standardized identifiers.

Location high anthropic:default

This column captures geographic incident locations, predominantly structured as 'City, County' strings—strongly associated with shark attack or ocean incident records given top values like 'New Smyrna Beach, Volusia County' (181 occurrences) and dominant words 'county', 'beach', 'island', 'bay'. The duplicate rate is notably high at ~29.9% (1,769 duplicates out of 6,462 rows), reflecting repeated incidents at the same hotspot locations rather than data error. The multilingual alert is triggered by automated language detection misclassifying short geographic names (e.g. 'Durban', 'Boa Viagem, Recife') as non-English—3,746 of 6,462 values are detected as English, with the remainder split across 29 other 'languages' due to short-string ambiguity. Null rate is 8.43%, which may represent unknown or offshore incident locations.

Time high anthropic:default

This column captures time-of-day information, but it is stored inconsistently: values mix coarse labels ('Afternoon', 'Morning') with specific clock times in 'HhMM' format ('11h00', '16h30'), yielding 366 unique values across 6,462 rows. The null rate is severe at 52.49%, meaning over half of all records are missing a time entirely. The top value 'Afternoon' accounts for only 6.29% of rows, and entropy ratio is 0.77, indicating a long tail of rarely-seen time strings — likely data entry inconsistency across sources or time periods.

Date high anthropic:default

This column contains free-form date annotations for what appears to be a historical or archival dataset, storing dates in highly inconsistent formats: bare years (e.g., '1957', '1942'), structured dates ('05-Oct-2003'), vague phrases ('Before 1958', 'No date', 'ca.', 'summer', 'late'). The top word 'reported' appearing 559 times suggests many values follow a pattern like 'reported [date]', which is a red flag for downstream parsing. With 14.1% duplicate rate (909 duplicates across 5,552 unique values) and a max length of 64 characters, this column cannot be safely cast to a datetime type without substantial normalization work.

original order medium anthropic:default

This column appears to be a positional or sequence index assigned to records, likely reflecting the original sort order of items in a source dataset. The most striking issue is a 52.62% null rate — over half the rows carry no value, which is highly anomalous for an ordering field and suggests either a join that left many rows unmatched or that ordering was only recorded for a subset of records. With only 3,061 unique values across 6,462 rows (and ~3,061 non-null rows expected given the null rate), duplicates exist even among non-null entries, undermining uniqueness as a sequence key. The distribution is mildly right-skewed (skew 0.99) with notable leptokurtosis (3.55) and 27 outliers, hinting at a few unusually large order values relative to the bulk.

Age high anthropic:default

This column represents age stored as a categorical (string) type rather than a numeric type, covering 154 distinct values across 6,462 rows. The most striking issue is a 44.43% null rate, flagged as an alert, meaning nearly half the records lack an age value. The distribution skews young — the top values cluster tightly between ages 15–25, with '17' being most frequent at only 4.46% of rows, suggesting a youth-focused population (e.g., students or a juvenile-related dataset). The high entropy ratio of 0.80 confirms values are spread broadly across the 154 categories despite the youth concentration.

Year high anthropic:default

This column represents a publication or production year for 6,462 records, with values spanning a plausible modern range centered around a median of 1980 and IQR of 1943–2006. Two signals are highly surprising: a minimum of 0.0 (nearly 2% zero rate, almost certainly sentinel/missing-year placeholders) and a maximum of 3019.0, which is a data-entry error (likely a typo for a 4-digit year such as 2019). These outliers (266 records, ~4.1%) drive extreme negative skew (−6.55) and extraordinary kurtosis (42.54), masking what is otherwise a fairly clean temporal distribution.

Area high anthropic:default

This column represents geographic sub-national regions (states, provinces) drawn from multiple countries — the US (Florida, Hawaii, California, South Carolina), Australia (New South Wales, Queensland, Western Australia), and South Africa (KwaZulu-Natal, Western Cape Province, Eastern Cape Province). Florida dominates at 17.9% of records, while 810 unique values against 6,462 rows signals a severe long-tail distribution where the vast majority of areas appear only rarely. The 7.16% null rate and high geographic diversity across at least three countries suggest this dataset is multinational in scope.

Name high anthropic:default

This column is a 'Name' field from what appears to be a historical incident or casualty dataset (likely maritime, given top values like 'boat', 'sailor', 'a sailor'). It is heavily contaminated with non-name entries: the most frequent value is 'male' (579 occurrences), followed by 'female' (106), 'boy' (23), '2 males' (19), and 'boat' (14), indicating that gender/role descriptors were freely mixed with actual proper names. The duplicate rate of 14.5% (908 duplicates across 5,339 unique values) and a 3.33% null rate further confirm this column is inconsistently populated and not a clean identifier.

index high anthropic:default

This column is a row index — a sequential integer identifier running from 0 to 6461 with 6462 unique values and no nulls, perfectly matching the row count. Its distribution is exactly uniform (skew = 0.0, kurtosis ≈ −1.2, mean = median = 3230.5, zero outliers), confirming it was generated as a positional index rather than carrying any domain meaning. The single 'zero' (zero_rate ≈ 0.00015) is simply row 0. There is no analytical signal here.

Type high anthropic:default

This column classifies shark attack incidents by the nature of the encounter, with 12 distinct categories across 6,462 records. It is heavily dominated by 'Unprovoked' at 73% of all records, creating notable class imbalance. A surprising data quality issue is the fragmentation of watercraft-related incidents across three near-synonymous labels — 'Watercraft' (142), 'Boat' (109), and 'Boating' (92) — which are almost certainly the same category and should be consolidated. The 'Invalid' category (552 records, ~8.5%) also warrants attention as it may represent records that should be excluded from incident analysis.

Country high anthropic:default

This column records the country of origin or occurrence for each record, with 205 distinct country values across 6,462 rows. The distribution is heavily skewed: USA alone accounts for 36% of records (2,310), followed by AUSTRALIA at 21% (1,374) and SOUTH AFRICA at 9% (585), meaning these three countries together represent roughly two-thirds of the dataset. The entropy ratio of 0.51 confirms moderate concentration despite 205 unique values, and the near-zero null rate (0.79%) means coverage is excellent.

Fatal (Y/N) high anthropic:default

This column is a binary fatality flag for incidents, expected to hold only 'Y' or 'N' values. The dominant value is 'N' (4,439 occurrences, 75% of records), with 'Y' accounting for 1,400 cases (~21.7%). Surprising data quality issues exist: 5 rows contain clearly erroneous values ('F', 'M', '2017', lowercase 'y') suggesting data entry errors or row misalignment, and 71 rows are labeled 'UNKNOWN'. The 8.46% null rate adds further incompleteness.

Numeric correlation

Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
indexYearoriginal order
index+1.00-0.40-0.97
Year-0.40+1.00+0.38
original order-0.97+0.38+1.00

Languages detected

Per-string language detection across text columns (sampled).

Show data table
Per-language counts (total 16,164 detected strings).
langcountshare
en1459790.3%
de6273.9%
es2191.4%
fr1771.1%
it1230.8%
pt800.5%
nl450.3%
ru310.2%
fi260.2%
sv230.1%
zh220.1%
pl180.1%
ja160.1%
id150.1%
ca140.1%
ceb130.1%
tr130.1%
eu130.1%
vi120.1%
cy100.1%
ro90.1%
ms80.0%
eo70.0%
sq70.0%
af50.0%
hu50.0%
hr40.0%
jbo40.0%
war40.0%
cs40.0%
lv30.0%
sh30.0%
sw20.0%
tl10.0%
th10.0%
no10.0%
sl10.0%
uk10.0%

index numeric

rows6,462
null0 (0.0%)
unique6,462
min0.000
max6,461
mean3,230
median3,230
std1,866
q11,615
q34,846
iqr3,230
skew0.000
kurtosis-1.200
n_outliers0
outlier_rate0.000
zero_rate1.55e-04
Show data table
Histogram bins for index (median: 3230.5).
bincount
0 – 161.5162
161.5 – 323.1162
323.1 – 484.6161
484.6 – 646.1162
646.1 – 807.6161
807.6 – 969.2162
969.2 – 1131161
1131 – 1292162
1292 – 1454161
1454 – 1615162
1615 – 1777161
1777 – 1938162
1938 – 2100161
2100 – 2261162
2261 – 2423161
2423 – 2584162
2584 – 2746161
2746 – 2907162
2907 – 3069161
3069 – 3230162
3230 – 3392162
3392 – 3554161
3554 – 3715162
3715 – 3877161
3877 – 4038162
4038 – 4200161
4200 – 4361162
4361 – 4523161
4523 – 4684162
4684 – 4846161
4846 – 5007162
5007 – 5169161
5169 – 5330162
5330 – 5492161
5492 – 5653162
5653 – 5815161
5815 – 5976162
5976 – 6138161
6138 – 6299162
6299 – 6461162

Case Number text

99.7% of rows are unique strings 99.9% rows are a single word 74.8% rows are all-caps 95th-percentile length under 20 chars
rows6,462
null2 (0.0%)
unique6,442
len_min6
len_max18
len_mean10.627
len_median10.000
len_p9512.000
word_mean1.001
word_median1.000
n_empty0
n_duplicates18
duplicate_rate2.79e-03
vocab_size6,445
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate0.999
allcaps_rate0.748
boilerplate_rate0.000
Show data table
Character-length distribution for Case Number (mean: 10.626934984520124).
charscount
6 – 64
6 – 70
7 – 70
7 – 7122
7 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 104150
10 – 100
10 – 110
11 – 1114
11 – 110
11 – 120
12 – 120
12 – 122128
12 – 130
13 – 130
13 – 1314
13 – 140
14 – 140
14 – 1424
14 – 140
14 – 150
15 – 150
15 – 152
15 – 160
16 – 160
16 – 161
16 – 160
16 – 170
17 – 170
17 – 170
17 – 180
18 – 181
Sample values (first 10)
  1. 2020.01.05
  2. 1942.11.01
  3. 1910.03.08
  4. 1926.08.24
  5. 1992.06.17
  6. 1830.07.26
  7. 1990.00.00
  8. 1896.12.17.R
  9. 1925.11.22
  10. 2017.10.21

Date text

87.3% rows are a single word
rows6,462
null1 (0.0%)
unique5,552
len_min4
len_max64
len_mean11.425
len_median11.000
len_p9520.000
word_mean1.155
word_median1.000
n_empty0
n_duplicates909
duplicate_rate0.141
vocab_size5,496
readability_flesch_mean89.927
emoji_rate0.000
url_rate0.000
one_word_rate0.873
allcaps_rate0.051
boilerplate_rate0.000
Show data table
Character-length distribution for Date (mean: 11.425475932518186).
charscount
4 – 6308
6 – 70
7 – 8347
8 – 1031
10 – 125088
12 – 1333
13 – 1443
14 – 164
16 – 187
18 – 195
19 – 20552
20 – 225
22 – 245
24 – 257
25 – 268
26 – 280
28 – 301
30 – 310
31 – 323
32 – 341
34 – 362
36 – 371
37 – 382
38 – 400
40 – 421
42 – 431
43 – 441
44 – 460
46 – 480
48 – 491
49 – 500
50 – 520
52 – 540
54 – 552
55 – 561
56 – 580
58 – 600
60 – 610
61 – 620
62 – 641
Sample values (first 10)
  1. 05-Jan-2020
  2. Nov-1942
  3. Mar-1910
  4. 23-Jul-1926
  5. 17-Jun-1992
  6. 26-Jul-1830
  7. 19-Dec-1989
  8. Reported 17-Dec-1896
  9. 22-Nov-1925
  10. 21-Oct-2017

Year numeric

skew=-6.55
rows6,462
null3 (0.0%)
unique252
min0.000
max3,019
mean1,930
median1,980
std278.316
q11,943
q32,006
iqr63.000
skew-6.554
kurtosis42.541
n_outliers266
outlier_rate0.041
zero_rate0.019
Show data table
Histogram bins for Year (median: 1980.0).
bincount
0 – 75.47126
75.47 – 150.91
150.9 – 226.40
226.4 – 301.90
301.9 – 377.40
377.4 – 452.80
452.8 – 528.31
528.3 – 603.80
603.8 – 679.30
679.3 – 754.80
754.8 – 830.20
830.2 – 905.70
905.7 – 981.20
981.2 – 10570
1057 – 11320
1132 – 12080
1208 – 12830
1283 – 13590
1359 – 14340
1434 – 15100
1510 – 15854
1585 – 16606
1660 – 17367
1736 – 181137
1811 – 1887371
1887 – 19621975
1962 – 20383930
2038 – 21130
2113 – 21890
2189 – 22640
2264 – 23400
2340 – 24150
2415 – 24910
2491 – 25660
2566 – 26420
2642 – 27170
2717 – 27930
2793 – 28680
2868 – 29440
2944 – 30191

Type categorical

rows6,462
null5 (0.1%)
unique12
top_valueUnprovoked
top_rate0.730
cardinality12
entropy1.457
entropy_ratio0.406
Show data table
Top values for Type (12 unique shown, of 12 total).
valuecountshare
Unprovoked471673.0%
Provoked5939.2%
Invalid5528.5%
Sea Disaster2393.7%
Watercraft1422.2%
Boat1091.7%
Boating921.4%
Questionable100.2%
Unconfirmed10.0%
Unverified10.0%
Under investigation10.0%
Boatomg10.0%
Top values (rank 1–20)
  1. Unprovoked — 4,716
  2. Provoked — 593
  3. Invalid — 552
  4. Sea Disaster — 239
  5. Watercraft — 142
  6. Boat — 109
  7. Boating — 92
  8. Questionable — 10
  9. Unconfirmed — 1
  10. Unverified — 1
  11. Under investigation — 1
  12. Boatomg — 1

Country categorical

rows6,462
null51 (0.8%)
unique205
top_valueUSA
top_rate0.360
cardinality205
entropy3.909
entropy_ratio0.509
Show data table
Top values for Country (20 unique shown, of 205 total).
valuecountshare
USA231035.7%
AUSTRALIA137421.3%
SOUTH AFRICA5859.1%
NEW ZEALAND1352.1%
PAPUA NEW GUINEA1352.1%
BAHAMAS1151.8%
BRAZIL1131.7%
MEXICO951.5%
ITALY711.1%
PHILIPPINES621.0%
FIJI621.0%
REUNION600.9%
NEW CALEDONIA560.9%
CUBA460.7%
SPAIN440.7%
MOZAMBIQUE440.7%
EGYPT420.6%
INDIA400.6%
JAPAN340.5%
CROATIA340.5%
Top values (rank 1–20)
  1. USA — 2,310
  2. AUSTRALIA — 1,374
  3. SOUTH AFRICA — 585
  4. NEW ZEALAND — 135
  5. PAPUA NEW GUINEA — 135
  6. BAHAMAS — 115
  7. BRAZIL — 113
  8. MEXICO — 95
  9. ITALY — 71
  10. PHILIPPINES — 62
  11. FIJI — 62
  12. REUNION — 60
  13. NEW CALEDONIA — 56
  14. CUBA — 46
  15. SPAIN — 44
  16. MOZAMBIQUE — 44
  17. EGYPT — 42
  18. INDIA — 40
  19. JAPAN — 34
  20. CROATIA — 34

Area categorical

540 singleton categories
rows6,462
null463 (7.2%)
unique810
top_valueFlorida
top_rate0.179
cardinality810
entropy6.163
entropy_ratio0.638
Show data table
Top values for Area (20 unique shown, of 810 total).
valuecountshare
Florida107616.7%
New South Wales4987.7%
Queensland3255.0%
Hawaii3124.8%
California2944.5%
KwaZulu-Natal2153.3%
Western Australia1973.0%
Western Cape Province1953.0%
Eastern Cape Province1652.6%
South Carolina1632.5%
North Carolina1111.7%
South Australia1041.6%
Victoria921.4%
Texas751.2%
Pernambuco741.1%
Torres Strait721.1%
North Island701.1%
New Jersey550.9%
Tasmania420.6%
South Island410.6%
Top values (rank 1–20)
  1. Florida — 1,076
  2. New South Wales — 498
  3. Queensland — 325
  4. Hawaii — 312
  5. California — 294
  6. KwaZulu-Natal — 215
  7. Western Australia — 197
  8. Western Cape Province — 195
  9. Eastern Cape Province — 165
  10. South Carolina — 163
  11. North Carolina — 111
  12. South Australia — 104
  13. Victoria — 92
  14. Texas — 75
  15. Pernambuco — 74
  16. Torres Strait — 72
  17. North Island — 70
  18. New Jersey — 55
  19. Tasmania — 42
  20. South Island — 41

Location text

31 languages detected in sample 29.9% duplicate strings
rows6,462
null545 (8.4%)
unique4,148
len_min3
len_max119
len_mean22.755
len_median21.000
len_p9547.000
word_mean3.540
word_median3.000
n_empty0
n_duplicates1,769
duplicate_rate0.299
vocab_size4,483
readability_flesch_mean53.399
emoji_rate0.000
url_rate0.000
one_word_rate0.149
allcaps_rate5.07e-04
boilerplate_rate0.000
Show data table
Character-length distribution for Location (mean: 22.75477437890823).
charscount
3 – 6106
6 – 9542
9 – 12758
12 – 15720
15 – 18445
18 – 20358
20 – 23352
23 – 26393
26 – 29524
29 – 32293
32 – 35505
35 – 38203
38 – 41152
41 – 44124
44 – 46124
46 – 4995
49 – 5254
52 – 5552
55 – 5826
58 – 6117
61 – 6421
64 – 6714
67 – 708
70 – 735
73 – 762
76 – 784
78 – 816
81 – 841
84 – 872
87 – 901
90 – 931
93 – 960
96 – 991
99 – 1021
102 – 1042
104 – 1070
107 – 1103
110 – 1131
113 – 1160
116 – 1191
Sample values (first 10)
  1. Esperance
  2. Mornington Island, Gulf of Carpentaria
  3. Tripoli
  4. Rosebud
  5. Boa Viagem, Recife
  6. Sydney Harbor
  7. North Jetty Park, Fort Pierce, St Lucie County
  8. Caibarien Harbor
  9. Middle Brighton, Port Phillip
  10. Gars Garabulli

Activity text

62.9% rows are a single word 74.3% duplicate strings
rows6,462
null552 (8.5%)
unique1,516
len_min1
len_max254
len_mean16.208
len_median8.000
len_p9549.000
word_mean2.497
word_median1.000
n_empty0
n_duplicates4,394
duplicate_rate0.743
vocab_size2,244
readability_flesch_mean39.558
emoji_rate0.000
url_rate0.000
one_word_rate0.629
allcaps_rate5.08e-04
boilerplate_rate0.000
Show data table
Character-length distribution for Activity (mean: 16.207952622673435).
charscount
1 – 72033
7 – 142150
14 – 20468
20 – 26376
26 – 33228
33 – 39177
39 – 45134
45 – 5284
52 – 5842
58 – 6451
64 – 7135
71 – 7725
77 – 8320
83 – 9010
90 – 969
96 – 10214
102 – 10910
109 – 1156
115 – 1215
121 – 1281
128 – 1341
134 – 1407
140 – 1464
146 – 1534
153 – 1593
159 – 1650
165 – 1722
172 – 1781
178 – 1840
184 – 1912
191 – 1973
197 – 2030
203 – 2100
210 – 2161
216 – 2220
222 – 2290
229 – 2352
235 – 2411
241 – 2480
248 – 2541
Sample values (first 10)
  1. Scuba diving
  2. Swimming
  3. Swimming
  4. Launching rowboat through the surf
  5. Surfing
  6. Wreck of the USS Somers
  7. Swimming
  8. Sea Disaster
  9. Fishing
  10. Spearfishing

Name text

rows6,462
null215 (3.3%)
unique5,339
len_min1
len_max221
len_mean14.830
len_median13.000
len_p9535.000
word_mean2.453
word_median2.000
n_empty0
n_duplicates908
duplicate_rate0.145
vocab_size6,536
readability_flesch_mean51.734
emoji_rate0.000
url_rate0.000
one_word_rate0.166
allcaps_rate6.88e-03
boilerplate_rate0.000
Show data table
Character-length distribution for Name (mean: 14.830158476068513).
charscount
1 – 6939
6 – 121355
12 – 182774
18 – 23488
23 – 28223
28 – 34128
34 – 4094
40 – 4547
45 – 5066
50 – 5634
56 – 6225
62 – 6726
67 – 7220
72 – 786
78 – 843
84 – 898
89 – 942
94 – 1001
100 – 1061
106 – 1112
111 – 1161
116 – 1220
122 – 1280
128 – 1331
133 – 1381
138 – 1441
144 – 1500
150 – 1550
155 – 1600
160 – 1660
166 – 1720
172 – 1770
177 – 1820
182 – 1880
188 – 1940
194 – 1990
199 – 2040
204 – 2100
210 – 2160
216 – 2211
Sample values (first 10)
  1. Peter ___
  2. sailor
  3. Jules Antoine
  4. Mrs. Hoskin
  5. Siale Sime
  6. male
  7. Todd R. Wenke
  8. male
  9. male
  10. Susan Peteka

Unnamed: 9 categorical

99.6% null
rows6,462
null6,434 (99.6%)
unique2
top_valueM
top_rate0.857
cardinality2
entropy0.592
entropy_ratio0.592
Show data table
Top values for Unnamed: 9 (2 unique shown, of 2 total).
valuecountshare
M240.4%
F40.1%
Top values (rank 1–20)
  1. M — 24
  2. F — 4

Age categorical

44.4% null
rows6,462
null2,871 (44.4%)
unique154
top_value17
top_rate0.045
cardinality154
entropy5.827
entropy_ratio0.802
Show data table
Top values for Age (20 unique shown, of 154 total).
valuecountshare
171602.5%
181552.4%
201462.3%
191452.2%
161402.2%
151402.2%
211221.9%
221181.8%
251111.7%
241091.7%
141041.6%
13971.5%
26851.3%
23831.3%
28821.3%
30801.2%
29801.2%
27791.2%
12751.2%
32721.1%
Top values (rank 1–20)
  1. 17 — 160
  2. 18 — 155
  3. 20 — 146
  4. 19 — 145
  5. 16 — 140
  6. 15 — 140
  7. 21 — 122
  8. 22 — 118
  9. 25 — 111
  10. 24 — 109
  11. 14 — 104
  12. 13 — 97
  13. 26 — 85
  14. 23 — 83
  15. 28 — 82
  16. 30 — 80
  17. 29 — 80
  18. 27 — 79
  19. 12 — 75
  20. 32 — 72

Injury text

10 languages detected in sample 13.1% rows are all-caps 41.9% duplicate strings
rows6,462
null29 (0.4%)
unique3,738
len_min5
len_max234
len_mean31.529
len_median25.000
len_p9582.000
word_mean5.414
word_median4.000
n_empty0
n_duplicates2,695
duplicate_rate0.419
vocab_size2,550
readability_flesch_mean53.742
emoji_rate0.000
url_rate0.000
one_word_rate0.149
allcaps_rate0.131
boilerplate_rate0.000
Show data table
Character-length distribution for Injury (mean: 31.52868024249961).
charscount
5 – 111206
11 – 16740
16 – 22851
22 – 28788
28 – 34636
34 – 39429
39 – 45382
45 – 51251
51 – 57238
57 – 62200
62 – 68132
68 – 74128
74 – 79100
79 – 8573
85 – 9155
91 – 9747
97 – 10245
102 – 10831
108 – 11413
114 – 12022
120 – 12515
125 – 1316
131 – 1379
137 – 14210
142 – 1484
148 – 1542
154 – 1602
160 – 1650
165 – 1713
171 – 1773
177 – 1823
182 – 1881
188 – 1940
194 – 2002
200 – 2052
205 – 2111
211 – 2170
217 – 2231
223 – 2280
228 – 2342
Sample values (first 10)
  1. FATAL
  2. Hip bitten
  3. Right hand severely bitten by netted shark PROVOKED INCIDENT
  4. Arm injured
  5. No details
  6. FATAL
  7. No injury, board broken in two
  8. FATAL
  9. Thigh lacerated
  10. Minor injuries

Fatal (Y/N) categorical

rows6,462
null547 (8.5%)
unique7
top_valueN
top_rate0.750
cardinality7
entropy0.890
entropy_ratio0.317
Show data table
Top values for Fatal (Y/N) (7 unique shown, of 7 total).
valuecountshare
N443968.7%
Y140021.7%
UNKNOWN711.1%
F20.0%
M10.0%
201710.0%
y10.0%
Top values (rank 1–20)
  1. N — 4,439
  2. Y — 1,400
  3. UNKNOWN — 71
  4. F — 2
  5. M — 1
  6. 2017 — 1
  7. y — 1

Time categorical

199 singleton categories 52.5% null
rows6,462
null3,392 (52.5%)
unique366
top_valueAfternoon
top_rate0.063
cardinality366
entropy6.559
entropy_ratio0.770
Show data table
Top values for Time (20 unique shown, of 366 total).
valuecountshare
Afternoon1933.0%
11h001312.0%
Morning1261.9%
12h001131.7%
15h001111.7%
16h001061.6%
14h001021.6%
16h30791.2%
17h30771.2%
13h00751.2%
17h00741.1%
14h30731.1%
18h00721.1%
15h30671.0%
11h30651.0%
13h30641.0%
10h00631.0%
Night631.0%
09h00550.9%
10h30510.8%
Top values (rank 1–20)
  1. Afternoon — 193
  2. 11h00 — 131
  3. Morning — 126
  4. 12h00 — 113
  5. 15h00 — 111
  6. 16h00 — 106
  7. 14h00 — 102
  8. 16h30 — 79
  9. 17h30 — 77
  10. 13h00 — 75
  11. 17h00 — 74
  12. 14h30 — 73
  13. 18h00 — 72
  14. 15h30 — 67
  15. 11h30 — 65
  16. 13h30 — 64
  17. 10h00 — 63
  18. Night — 63
  19. 09h00 — 55
  20. 10h30 — 51

Species text

15 languages detected in sample 45.2% null 58.6% duplicate strings
rows6,462
null2,924 (45.2%)
unique1,466
len_min3
len_max194
len_mean22.951
len_median17.000
len_p9550.000
word_mean4.445
word_median4.000
n_empty0
n_duplicates2,072
duplicate_rate0.586
vocab_size1,105
readability_flesch_mean88.631
emoji_rate0.000
url_rate0.000
one_word_rate0.041
allcaps_rate2.83e-04
boilerplate_rate0.000
Show data table
Character-length distribution for Species (mean: 22.95110231769361).
charscount
3 – 8105
8 – 13855
13 – 17833
17 – 22448
22 – 27324
27 – 32285
32 – 36120
36 – 41127
41 – 46115
46 – 51170
51 – 5632
56 – 6025
60 – 6521
65 – 7013
70 – 7517
75 – 7911
79 – 847
84 – 894
89 – 946
94 – 984
98 – 1032
103 – 1080
108 – 1131
113 – 1181
118 – 1220
122 – 1272
127 – 1321
132 – 1371
137 – 1411
141 – 1461
146 – 1510
151 – 1562
156 – 1611
161 – 1651
165 – 1700
170 – 1750
175 – 1800
180 – 1841
184 – 1890
189 – 1941
Sample values (first 10)
  1. White shark
  2. Questionable incident
  3. Questionable incident
  4. Carpet shark, 5'
  5. Bull shark
  6. Invalid
  7. 1 m shark
  8. Shark involvement prior to death unconfirmed
  9. "Attacked by a number of sharks"
  10. 2 m shark

Investigator or Source text

23 languages detected in sample 22.7% duplicate strings
rows6,462
null19 (0.3%)
unique4,979
len_min3
len_max210
len_mean32.237
len_median26.000
len_p9577.000
word_mean4.792
word_median3.000
n_empty0
n_duplicates1,464
duplicate_rate0.227
vocab_size7,898
readability_flesch_mean73.623
emoji_rate0.000
url_rate2.33e-03
one_word_rate0.026
allcaps_rate0.018
boilerplate_rate0.000
Show data table
Character-length distribution for Investigator or Source (mean: 32.23731181126804).
charscount
3 – 8152
8 – 13461
13 – 191164
19 – 24844
24 – 291098
29 – 34816
34 – 39400
39 – 44315
44 – 50203
50 – 55174
55 – 60150
60 – 65142
65 – 70116
70 – 7559
75 – 8161
81 – 8645
86 – 9139
91 – 9640
96 – 10130
101 – 10623
106 – 11223
112 – 11718
117 – 12214
122 – 12714
127 – 1329
132 – 1388
138 – 1435
143 – 1485
148 – 1534
153 – 1583
158 – 1634
163 – 1690
169 – 1740
174 – 1791
179 – 1840
184 – 1891
189 – 1941
194 – 2000
200 – 2050
205 – 2101
Sample values (first 10)
  1. B. Myatt, GSAF
  2. M. Murphy; V.M. Coppleson (1962), pp.207-208
  3. The Sun, 4/3/1910; Authenticity questioned by G.H. Balazs in J. Borg, p.70
  4. NY Herald Tribune, 7/25/1926; A. De Maddalena; Anon. (1926a), Anon. (1926b); C. Moore, GSAF
  5. Charlotte Observer, 6/24/1992, p.1C & 8/8/1992, p.2C
  6. C. Black, GSAF; Sydney Gazette, 1/22/1831
  7. Courier-Mail, 11/24/1989, p.3; J. West, ASAF
  8. The Star, 12/17/1896
  9. V.M. Coppleson.W2, (1933); V.M. Coppleson (1958), pp.111 & 241; West Australia, 1/5/1967; A. Sharpe, pp.129-130; H. Edwards, pp.131-133
  10. New Zealand Herald, 10/22/2017

pdf text

99.6% of rows are unique strings 98.7% rows are a single word 52.6% null
rows6,462
null3,396 (52.6%)
unique3,054
len_min10
len_max41
len_mean23.731
len_median23.000
len_p9531.000
word_mean1.022
word_median1.000
n_empty0
n_duplicates12
duplicate_rate3.91e-03
vocab_size3,098
readability_flesch_mean-66.809
emoji_rate0.000
url_rate0.000
one_word_rate0.987
allcaps_rate3.26e-04
boilerplate_rate0.000
Show data table
Character-length distribution for pdf (mean: 23.73091976516634).
charscount
10 – 112
11 – 120
12 – 120
12 – 131
13 – 140
14 – 154
15 – 150
15 – 164
16 – 170
17 – 1814
18 – 1949
19 – 19139
19 – 20277
20 – 210
21 – 22438
22 – 22473
22 – 23391
23 – 240
24 – 25275
25 – 26229
26 – 26159
26 – 27134
27 – 280
28 – 29112
29 – 2986
29 – 3072
30 – 310
31 – 3260
32 – 3250
32 – 3331
33 – 3427
34 – 350
35 – 3613
36 – 368
36 – 376
37 – 380
38 – 393
39 – 392
39 – 404
40 – 413
Sample values (first 10)
  1. 2020.01.12-Malten.pdf
  2. 1900.09.05-Hartman.pdf
  3. 1872.11.30.R-MalayPirates.pdf
  4. 1885.04.16.R-GermanShip.pdf
  5. 1948.12.26-Keys.pdf
  6. ND-0094-HaeNyeo.pdf
  7. 1947.05.13.R-Kenya.pdf
  8. 1857.05.05-Dunn.pdf
  9. 1884.08.18-Rylor.pdf
  10. 1971.11.25.R-Chan.pdf

href formula text

99.6% of rows are unique strings 98.8% rows are a single word 100.0% rows contain a URL 52.6% null
rows6,462
null3,400 (52.6%)
unique3,051
len_min64
len_max95
len_mean77.729
len_median77.000
len_p9585.000
word_mean1.019
word_median1.000
n_empty0
n_duplicates11
duplicate_rate3.59e-03
vocab_size3,089
readability_flesch_mean-820.177
emoji_rate0.000
url_rate1.000
one_word_rate0.988
allcaps_rate0.000
boilerplate_rate0.000
Show data table
Character-length distribution for href formula (mean: 77.72893533638145).
charscount
64 – 651
65 – 660
66 – 660
66 – 671
67 – 680
68 – 694
69 – 690
69 – 705
70 – 710
71 – 7214
72 – 7349
73 – 73139
73 – 74276
74 – 750
75 – 76437
76 – 76473
76 – 77392
77 – 780
78 – 79275
79 – 80228
80 – 80159
80 – 81134
81 – 820
82 – 83112
83 – 8386
83 – 8472
84 – 850
85 – 8660
86 – 8648
86 – 8731
87 – 8827
88 – 890
89 – 9013
90 – 908
90 – 916
91 – 920
92 – 933
93 – 932
93 – 944
94 – 953
Sample values (first 10)
  1. http://sharkattackfile.net/spreadsheets/pdf_directory/2020.01.12-Malten.pdf
  2. http://sharkattackfile.net/spreadsheets/pdf_directory/1900.08.21-Burriss.pdf
  3. http://sharkattackfile.net/spreadsheets/pdf_directory/1872.11.30.R-MalayPirates.pdf
  4. http://sharkattackfile.net/spreadsheets/pdf_directory/1885.04.16.R-GermanShip.pdf
  5. http://sharkattackfile.net/spreadsheets/pdf_directory/1948.12.14.a-Jeppeson.pdf
  6. http://sharkattackfile.net/spreadsheets/pdf_directory/ND-0094-HaeNyeo.pdf
  7. http://sharkattackfile.net/spreadsheets/pdf_directory/1947.04.06-Watt.pdf
  8. http://sharkattackfile.net/spreadsheets/pdf_directory/1856.11.25.R-Fiji.pdf
  9. http://sharkattackfile.net/spreadsheets/pdf_directory/1884.08.18-Rylor.pdf
  10. http://sharkattackfile.net/spreadsheets/pdf_directory/1971.09.25-Horner.pdf

href text

99.6% of rows are unique strings 98.8% rows are a single word 100.0% rows contain a URL 52.6% null
rows6,462
null3,400 (52.6%)
unique3,051
len_min64
len_max135
len_mean77.887
len_median77.000
len_p9586.000
word_mean1.020
word_median1.000
n_empty0
n_duplicates11
duplicate_rate3.59e-03
vocab_size3,091
readability_flesch_mean-824.407
emoji_rate0.000
url_rate1.000
one_word_rate0.988
allcaps_rate0.000
boilerplate_rate0.000
Show data table
Character-length distribution for href (mean: 77.88667537557153).
charscount
64 – 661
66 – 681
68 – 694
69 – 7119
71 – 7349
73 – 75415
75 – 76908
76 – 78666
78 – 80224
80 – 82290
82 – 84199
84 – 85131
85 – 8780
87 – 8927
89 – 9121
91 – 929
92 – 946
94 – 963
96 – 980
98 – 1000
100 – 1010
101 – 1030
103 – 1050
105 – 1070
107 – 1080
108 – 1100
110 – 1120
112 – 1140
114 – 1150
115 – 1170
117 – 1190
119 – 1210
121 – 1230
123 – 1240
124 – 1260
126 – 1280
128 – 1301
130 – 1314
131 – 1332
133 – 1352
Sample values (first 10)
  1. http://sharkattackfile.net/spreadsheets/pdf_directory/2020.01.12-Malten.pdf
  2. http://sharkattackfile.net/spreadsheets/pdf_directory/1900.08.21-Burriss.pdf
  3. http://sharkattackfile.net/spreadsheets/pdf_directory/1872.11.30.R-MalayPirates.pdf
  4. http://sharkattackfile.net/spreadsheets/pdf_directory/1885.04.16.R-GermanShip.pdf
  5. http://sharkattackfile.net/spreadsheets/pdf_directory/1948.12.14.a-Jeppeson.pdf
  6. http://sharkattackfile.net/spreadsheets/pdf_directory/ND-0094-HaeNyeo.pdf
  7. http://sharkattackfile.net/spreadsheets/pdf_directory/1947.04.06-Watt.pdf
  8. http://sharkattackfile.net/spreadsheets/pdf_directory/1856.11.25.R-Fiji.pdf
  9. http://sharkattackfile.net/spreadsheets/pdf_directory/1884.08.18-Rylor.pdf
  10. http://sharkattackfile.net/spreadsheets/pdf_directory/1971.09.25-Horner.pdf

Case Number.1 text

99.7% of rows are unique strings 99.8% rows are a single word 79.0% rows are all-caps 52.6% null 95th-percentile length under 20 chars
rows6,462
null3,400 (52.6%)
unique3,054
len_min7
len_max18
len_mean10.591
len_median10.000
len_p9512.000
word_mean1.002
word_median1.000
n_empty0
n_duplicates8
duplicate_rate2.61e-03
vocab_size3,057
readability_flesch_mean121.215
emoji_rate0.000
url_rate0.000
one_word_rate0.998
allcaps_rate0.790
boilerplate_rate0.000
Show data table
Character-length distribution for Case Number.1 (mean: 10.59079033311561).
charscount
7 – 7120
7 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 94
9 – 90
9 – 100
10 – 101884
10 – 100
10 – 110
11 – 110
11 – 117
11 – 110
11 – 120
12 – 120
12 – 121011
12 – 120
12 – 130
13 – 138
13 – 130
13 – 140
14 – 140
14 – 1424
14 – 140
14 – 150
15 – 150
15 – 152
15 – 160
16 – 160
16 – 161
16 – 160
16 – 170
17 – 170
17 – 170
17 – 170
17 – 180
18 – 181
Sample values (first 10)
  1. 2020.01.12
  2. 1900.08.21
  3. 1872.11.30.R
  4. 1885.04.16.R
  5. 1948.12.14.a
  6. ND.0094
  7. 1947.04.06
  8. 1856.11.25.R
  9. 1884.08.18
  10. 1971.09.25

Case Number.2 text

99.8% of rows are unique strings 99.8% rows are a single word 79.0% rows are all-caps 52.6% null 95th-percentile length under 20 chars
rows6,462
null3,400 (52.6%)
unique3,055
len_min7
len_max18
len_mean10.591
len_median10.000
len_p9512.000
word_mean1.002
word_median1.000
n_empty0
n_duplicates7
duplicate_rate2.29e-03
vocab_size3,058
readability_flesch_mean121.215
emoji_rate0.000
url_rate0.000
one_word_rate0.998
allcaps_rate0.790
boilerplate_rate0.000
Show data table
Character-length distribution for Case Number.2 (mean: 10.59079033311561).
charscount
7 – 7120
7 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 94
9 – 90
9 – 100
10 – 101884
10 – 100
10 – 110
11 – 110
11 – 117
11 – 110
11 – 120
12 – 120
12 – 121011
12 – 120
12 – 130
13 – 138
13 – 130
13 – 140
14 – 140
14 – 1424
14 – 140
14 – 150
15 – 150
15 – 152
15 – 160
16 – 160
16 – 161
16 – 160
16 – 170
17 – 170
17 – 170
17 – 170
17 – 180
18 – 181
Sample values (first 10)
  1. 2020.01.12
  2. 1900.08.21
  3. 1872.11.30.R
  4. 1885.04.16.R
  5. 1948.12.14.a
  6. ND.0094
  7. 1947.04.06
  8. 1856.11.25.R
  9. 1884.08.18
  10. 1971.09.25

original order numeric

52.6% null
rows6,462
null3,400 (52.6%)
unique3,061
min3.000
max6,502
mean1,564
median1,534
std988.410
q1768.250
q32,299
iqr1,530
skew0.988
kurtosis3.551
n_outliers27
outlier_rate8.82e-03
zero_rate0.000
Show data table
Histogram bins for original order (median: 1533.5).
bincount
3 – 165.5163
165.5 – 327.9162
327.9 – 490.4163
490.4 – 652.9162
652.9 – 815.4163
815.4 – 977.8162
977.8 – 1140163
1140 – 1303162
1303 – 1465163
1465 – 1628162
1628 – 1790163
1790 – 1953162
1953 – 2115163
2115 – 2278162
2278 – 2440163
2440 – 2603162
2603 – 2765163
2765 – 2928162
2928 – 3090110
3090 – 32520
3252 – 34150
3415 – 35770
3577 – 37400
3740 – 39020
3902 – 40650
4065 – 42270
4227 – 43900
4390 – 45520
4552 – 47150
4715 – 48770
4877 – 50400
5040 – 52020
5202 – 53650
5365 – 55270
5527 – 56900
5690 – 58520
5852 – 60150
6015 – 61770
6177 – 63400
6340 – 650227

Unnamed: 23 categorical

2 singleton categories 100.0% null
rows6,462
null6,460 (100.0%)
unique2
top_valueTeramo
top_rate0.500
cardinality2
entropy1.000
entropy_ratio1.000
Show data table
Top values for Unnamed: 23 (2 unique shown, of 2 total).
valuecountshare
Teramo10.0%
change filename10.0%
Top values (rank 1–20)
  1. Teramo — 1
  2. change filename — 1