saturn·

data trove global shark attack file gsaf

saturn notebook · generated 2026-06-21 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv

Saturn profiled 6,462 rows across 24 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv",
    "--findings", "data-trove-global-shark-attack-file-gsaf.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is the Global Shark Attack File (GSAF), containing 6,462 records of shark attack incidents spanning centuries of documented cases. The most important thing to examine first is the attack outcome: roughly 75% of incidents are non-fatal ('N'), but 1,400 are recorded as fatal ('Y'), and the 'Injury' column reveals 823 entries simply marked 'FATAL' — worth cross-checking for data consistency. A second priority is the geographic and activity breakdown: the USA dominates with 2,310 cases (36%), Florida alone accounts for 1,076, and surfing (1,025) and swimming (932) are by far the most dangerous activities. The 'Year' column carries a data quality warning — a maximum value of 3019 and high kurtosis signal outliers that should be cleaned before any time-series analysis.

citing: Fatal (Y/N).top_values · Fatal (Y/N).stats.top_rate · Injury.top_values · Country.top_values · Area.top_values · Activity.top_values · Year.stats.max · Year.stats.kurtosis · Year.stats.n_outliers · row_count

Out[4]:

saturn.schema() · 24 columns

column kind n null% unique alerts
index numeric 6,462 0.0% 6,462
Case Number text 6,462 0.0% 6,442 near_unique one_word allcaps short_text
Date text 6,462 0.0% 5,552 one_word
Year numeric 6,462 0.0% 252 high_skew
Type categorical 6,462 0.1% 12
Country categorical 6,462 0.8% 205
Area categorical 6,462 7.2% 810 long_tail
Location text 6,462 8.4% 4,148 multilingual duplicates
Activity text 6,462 8.5% 1,516 one_word duplicates
Name text 6,462 3.3% 5,339
Unnamed: 9 categorical 6,462 99.6% 2 null_rate
Age categorical 6,462 44.4% 154 null_rate
Injury text 6,462 0.4% 3,738 multilingual allcaps duplicates
Fatal (Y/N) categorical 6,462 8.5% 7
Time categorical 6,462 52.5% 366 long_tail null_rate
Species text 6,462 45.2% 1,466 multilingual null_rate duplicates
Investigator or Source text 6,462 0.3% 4,979 multilingual duplicates
pdf text 6,462 52.6% 3,054 near_unique one_word null_rate
href formula text 6,462 52.6% 3,051 near_unique one_word url_heavy null_rate
href text 6,462 52.6% 3,051 near_unique one_word url_heavy null_rate
Case Number.1 text 6,462 52.6% 3,054 near_unique one_word allcaps null_rate short_text
Case Number.2 text 6,462 52.6% 3,055 near_unique one_word allcaps null_rate short_text
original order numeric 6,462 52.6% 3,061 null_rate
Unnamed: 23 categorical 6,462 100.0% 2 long_tail null_rate
Fig 1.
Fatal (Y/N) · Look at the dominant 'N' slice versus fatal 'Y' and check for dirty values like 'M', 'F', and '2017' that indicate data entry errors.
Show data table
Top values for Fatal (Y/N) (7 unique shown, of 7 total).
valuecountshare
N443968.7%
Y140021.7%
UNKNOWN711.1%
F20.0%
M10.0%
201710.0%
y10.0%
Fig 2.
Activity · Surfing and swimming together account for the majority of attacks — compare their counts to lower-risk activities like diving and fishing.
Show data table
Character-length distribution for Activity (mean: 16.207952622673435).
charscount
1 – 72033
7 – 142150
14 – 20468
20 – 26376
26 – 33228
33 – 39177
39 – 45134
45 – 5284
52 – 5842
58 – 6451
64 – 7135
71 – 7725
77 – 8320
83 – 9010
90 – 969
96 – 10214
102 – 10910
109 – 1156
115 – 1215
121 – 1281
128 – 1341
134 – 1407
140 – 1464
146 – 1534
153 – 1593
159 – 1650
165 – 1722
172 – 1781
178 – 1840
184 – 1912
191 – 1973
197 – 2030
203 – 2100
210 – 2161
216 – 2220
222 – 2290
229 – 2352
235 – 2411
241 – 2480
248 – 2541
Fig 3.
Country · The USA, Australia, and South Africa together dominate the record count; note how sharply the frequency drops after the top three.
Show data table
Top values for Country (20 unique shown, of 205 total).
valuecountshare
USA231035.7%
AUSTRALIA137421.3%
SOUTH AFRICA5859.1%
NEW ZEALAND1352.1%
PAPUA NEW GUINEA1352.1%
BAHAMAS1151.8%
BRAZIL1131.7%
MEXICO951.5%
ITALY711.1%
PHILIPPINES621.0%
FIJI621.0%
REUNION600.9%
NEW CALEDONIA560.9%
CUBA460.7%
SPAIN440.7%
MOZAMBIQUE440.7%
EGYPT420.6%
INDIA400.6%
JAPAN340.5%
CROATIA340.5%
Fig 4.
Type · Unprovoked attacks make up nearly 73% of all incidents — compare against provoked, invalid, and sea disaster categories.
Show data table
Top values for Type (12 unique shown, of 12 total).
valuecountshare
Unprovoked471673.0%
Provoked5939.2%
Invalid5528.5%
Sea Disaster2393.7%
Watercraft1422.2%
Boat1091.7%
Boating921.4%
Questionable100.2%
Unconfirmed10.0%
Unverified10.0%
Under investigation10.0%
Boatomg10.0%
Fig 5.
Year · Most records cluster in the modern era, but look for the extreme outliers (including a year value of 3019) that skew the distribution.
Show data table
Histogram bins for Year (median: 1980.0).
bincount
0 – 75.47126
75.47 – 150.91
150.9 – 226.40
226.4 – 301.90
301.9 – 377.40
377.4 – 452.80
452.8 – 528.31
528.3 – 603.80
603.8 – 679.30
679.3 – 754.80
754.8 – 830.20
830.2 – 905.70
905.7 – 981.20
981.2 – 10570
1057 – 11320
1132 – 12080
1208 – 12830
1283 – 13590
1359 – 14340
1434 – 15100
1510 – 15854
1585 – 16606
1660 – 17367
1736 – 181137
1811 – 1887371
1887 – 19621975
1962 – 20383930
2038 – 21130
2113 – 21890
2189 – 22640
2264 – 23400
2340 – 24150
2415 – 24910
2491 – 25660
2566 – 26420
2642 – 27170
2717 – 27930
2793 – 28680
2868 – 29440
2944 – 30191
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
indexnumeric0.0%
Case Numbertext0.0%
Datetext0.0%
Yearnumeric0.0%
Typecategorical0.1%
Countrycategorical0.8%
Areacategorical7.2%
Locationtext8.4%
Activitytext8.5%
Nametext3.3%
Unnamed: 9categorical99.6%
Agecategorical44.4%
Injurytext0.4%
Fatal (Y/N)categorical8.5%
Timecategorical52.5%
Species text45.2%
Investigator or Sourcetext0.3%
pdftext52.6%
href formulatext52.6%
hreftext52.6%
Case Number.1text52.6%
Case Number.2text52.6%
original ordernumeric52.6%
Unnamed: 23categorical100.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 16,164 detected strings).
langcountshare
en1459790.3%
de6273.9%
es2191.4%
fr1771.1%
it1230.8%
pt800.5%
nl450.3%
ru310.2%
fi260.2%
sv230.1%
zh220.1%
pl180.1%
ja160.1%
id150.1%
ca140.1%
ceb130.1%
tr130.1%
eu130.1%
vi120.1%
cy100.1%
ro90.1%
ms80.0%
eo70.0%
sq70.0%
af50.0%
hu50.0%
hr40.0%
jbo40.0%
war40.0%
cs40.0%
lv30.0%
sh30.0%
sw20.0%
tl10.0%
th10.0%
no10.0%
sl10.0%
uk10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
indexYearoriginal order
index+1.00-0.40-0.97
Year-0.40+1.00+0.38
original order-0.97+0.38+1.00

index numeric identifier

This column is a row index — a sequential integer identifier running from 0 to 6461 with 6462 unique values and no nulls, perfectly matching the row count. Its distribution is exactly uniform (skew = 0.0, kurtosis ≈ −1.2, mean = median = 3230.5, zero outliers), confirming it was generated as a positional index rather than carrying any domain meaning. The single 'zero' (zero_rate ≈ 0.00015) is simply row 0. There is no analytical signal here.

Treatment: Drop before modelling; carries no predictive information.

anthropic:default · confidence high
Out[14]:

saturn.columns["index"].stats

statvalue
n6,462
nulls0 (0.0%)
unique6,462
min 0
max 6,461
mean 3230
median 3230
std 1866
q1 1615
q3 4846
iqr 3230
skew 0
kurtosis -1.2
n_outliers 0
outlier_rate 0
zero_rate 0.0001548
Fig 9.
Distribution of index. Vertical dash marks the median.
Show data table
Histogram bins for index (median: 3230.5).
bincount
0 – 161.5162
161.5 – 323.1162
323.1 – 484.6161
484.6 – 646.1162
646.1 – 807.6161
807.6 – 969.2162
969.2 – 1131161
1131 – 1292162
1292 – 1454161
1454 – 1615162
1615 – 1777161
1777 – 1938162
1938 – 2100161
2100 – 2261162
2261 – 2423161
2423 – 2584162
2584 – 2746161
2746 – 2907162
2907 – 3069161
3069 – 3230162
3230 – 3392162
3392 – 3554161
3554 – 3715162
3715 – 3877161
3877 – 4038162
4038 – 4200161
4200 – 4361162
4361 – 4523161
4523 – 4684162
4684 – 4846161
4846 – 5007162
5007 – 5169161
5169 – 5330162
5330 – 5492161
5492 – 5653162
5653 – 5815161
5815 – 5976162
5976 – 6138161
6138 – 6299162
6299 – 6461162

Case Number text identifier

This column is a case identifier, with values that appear to encode dates in YYYY.MM.DD format (e.g., '2019.10.08'), suggesting case numbers tied to filing or incident dates. With 6,442 unique values out of 6,462 rows and a null rate of 0.0003, it is near-unique and functions as a primary key. The 18 duplicate values (duplicate_rate 0.0028) are a mild anomaly worth investigating — one value '2012.09.02.b' hints that suffixes are used to disambiguate same-date cases, implying the deduplication logic is not fully consistent. The allcaps_rate of 0.748 suggests a mix of formatting styles across records.

Treatment: Use as a case-level join key; flag the 18 duplicates for deduplication before modelling.

anthropic:default · confidence high
Out[17]:

saturn.columns["Case Number"].stats

statvalue
n6,462
nulls2 (0.0%)
unique6,442
len_min 6
len_max 18
len_mean 10.63
len_median 10
len_p95 12
word_mean 1.001
word_median 1
n_empty 0
n_duplicates 18
duplicate_rate 0.002786
vocab_size 6,445
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9992
allcaps_rate 0.748
boilerplate_rate 0
alert: near_unique99.7% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: allcaps74.8% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 10.
Character-length distribution for Case Number.
Show data table
Character-length distribution for Case Number (mean: 10.626934984520124).
charscount
6 – 64
6 – 70
7 – 70
7 – 7122
7 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 104150
10 – 100
10 – 110
11 – 1114
11 – 110
11 – 120
12 – 120
12 – 122128
12 – 130
13 – 130
13 – 1314
13 – 140
14 – 140
14 – 1424
14 – 140
14 – 150
15 – 150
15 – 152
15 – 160
16 – 160
16 – 161
16 – 160
16 – 170
17 – 170
17 – 170
17 – 180
18 – 181

Date text metadata

This column contains free-form date annotations for what appears to be a historical or archival dataset, storing dates in highly inconsistent formats: bare years (e.g., '1957', '1942'), structured dates ('05-Oct-2003'), vague phrases ('Before 1958', 'No date', 'ca.', 'summer', 'late'). The top word 'reported' appearing 559 times suggests many values follow a pattern like 'reported [date]', which is a red flag for downstream parsing. With 14.1% duplicate rate (909 duplicates across 5,552 unique values) and a max length of 64 characters, this column cannot be safely cast to a datetime type without substantial normalization work.

Treatment: Parse and normalize into structured date fields using regex rules per format pattern; flag unparseable values ('No date', 'ca.', 'reported …') as null or uncertain.

anthropic:default · confidence high
Out[20]:

saturn.columns["Date"].stats

statvalue
n6,462
nulls1 (0.0%)
unique5,552
len_min 4
len_max 64
len_mean 11.43
len_median 11
len_p95 20
word_mean 1.155
word_median 1
n_empty 0
n_duplicates 909
duplicate_rate 0.1407
vocab_size 5,496
readability_flesch_mean 89.93
emoji_rate 0
url_rate 0
one_word_rate 0.8728
allcaps_rate 0.05092
boilerplate_rate 0
alert: one_word87.3% rows are a single word
Fig 11.
Character-length distribution for Date.
Show data table
Character-length distribution for Date (mean: 11.425475932518186).
charscount
4 – 6308
6 – 70
7 – 8347
8 – 1031
10 – 125088
12 – 1333
13 – 1443
14 – 164
16 – 187
18 – 195
19 – 20552
20 – 225
22 – 245
24 – 257
25 – 268
26 – 280
28 – 301
30 – 310
31 – 323
32 – 341
34 – 362
36 – 371
37 – 382
38 – 400
40 – 421
42 – 431
43 – 441
44 – 460
46 – 480
48 – 491
49 – 500
50 – 520
52 – 540
54 – 552
55 – 561
56 – 580
58 – 600
60 – 610
61 – 620
62 – 641

Year numeric feature

This column represents a publication or production year for 6,462 records, with values spanning a plausible modern range centered around a median of 1980 and IQR of 1943–2006. Two signals are highly surprising: a minimum of 0.0 (nearly 2% zero rate, almost certainly sentinel/missing-year placeholders) and a maximum of 3019.0, which is a data-entry error (likely a typo for a 4-digit year such as 2019). These outliers (266 records, ~4.1%) drive extreme negative skew (−6.55) and extraordinary kurtosis (42.54), masking what is otherwise a fairly clean temporal distribution.

Treatment: Null-code zeros and values outside a valid range (e.g., < 1800 or > 2100), correct obvious typos like 3019, then use as an ordinal or numeric feature.

anthropic:default · confidence high
Out[23]:

saturn.columns["Year"].stats

statvalue
n6,462
nulls3 (0.0%)
unique252
min 0
max 3,019
mean 1930
median 1,980
std 278.3
q1 1,943
q3 2,006
iqr 63
skew -6.554
kurtosis 42.54
n_outliers 266
outlier_rate 0.04118
zero_rate 0.01935
alert: high_skewskew=-6.55
Fig 12.
Distribution of Year. Vertical dash marks the median.
Show data table
Histogram bins for Year (median: 1980.0).
bincount
0 – 75.47126
75.47 – 150.91
150.9 – 226.40
226.4 – 301.90
301.9 – 377.40
377.4 – 452.80
452.8 – 528.31
528.3 – 603.80
603.8 – 679.30
679.3 – 754.80
754.8 – 830.20
830.2 – 905.70
905.7 – 981.20
981.2 – 10570
1057 – 11320
1132 – 12080
1208 – 12830
1283 – 13590
1359 – 14340
1434 – 15100
1510 – 15854
1585 – 16606
1660 – 17367
1736 – 181137
1811 – 1887371
1887 – 19621975
1962 – 20383930
2038 – 21130
2113 – 21890
2189 – 22640
2264 – 23400
2340 – 24150
2415 – 24910
2491 – 25660
2566 – 26420
2642 – 27170
2717 – 27930
2793 – 28680
2868 – 29440
2944 – 30191

Type categorical label

This column classifies shark attack incidents by the nature of the encounter, with 12 distinct categories across 6,462 records. It is heavily dominated by 'Unprovoked' at 73% of all records, creating notable class imbalance. A surprising data quality issue is the fragmentation of watercraft-related incidents across three near-synonymous labels — 'Watercraft' (142), 'Boat' (109), and 'Boating' (92) — which are almost certainly the same category and should be consolidated. The 'Invalid' category (552 records, ~8.5%) also warrants attention as it may represent records that should be excluded from incident analysis.

Treatment: Consolidate 'Boat', 'Boating', and 'Watercraft' into a single category; consider excluding or flagging 'Invalid' records; one-hot encode for modelling given severe class imbalance.

anthropic:default · confidence high
Out[26]:

saturn.columns["Type"].stats

statvalue
n6,462
nulls5 (0.1%)
unique12
top_value Unprovoked
top_rate 0.7304
cardinality 12
entropy 1.457
entropy_ratio 0.4064
Fig 13.
Top values for Type.
Show data table
Top values for Type (12 unique shown, of 12 total).
valuecountshare
Unprovoked471673.0%
Provoked5939.2%
Invalid5528.5%
Sea Disaster2393.7%
Watercraft1422.2%
Boat1091.7%
Boating921.4%
Questionable100.2%
Unconfirmed10.0%
Unverified10.0%
Under investigation10.0%
Boatomg10.0%

Country categorical feature

This column records the country of origin or occurrence for each record, with 205 distinct country values across 6,462 rows. The distribution is heavily skewed: USA alone accounts for 36% of records (2,310), followed by AUSTRALIA at 21% (1,374) and SOUTH AFRICA at 9% (585), meaning these three countries together represent roughly two-thirds of the dataset. The entropy ratio of 0.51 confirms moderate concentration despite 205 unique values, and the near-zero null rate (0.79%) means coverage is excellent.

Treatment: One-hot encode top countries and group tail countries into a residual 'OTHER' category before modelling.

anthropic:default · confidence high
Out[29]:

saturn.columns["Country"].stats

statvalue
n6,462
nulls51 (0.8%)
unique205
top_value USA
top_rate 0.3603
cardinality 205
entropy 3.909
entropy_ratio 0.509
Fig 14.
Top values for Country.
Show data table
Top values for Country (20 unique shown, of 205 total).
valuecountshare
USA231035.7%
AUSTRALIA137421.3%
SOUTH AFRICA5859.1%
NEW ZEALAND1352.1%
PAPUA NEW GUINEA1352.1%
BAHAMAS1151.8%
BRAZIL1131.7%
MEXICO951.5%
ITALY711.1%
PHILIPPINES621.0%
FIJI621.0%
REUNION600.9%
NEW CALEDONIA560.9%
CUBA460.7%
SPAIN440.7%
MOZAMBIQUE440.7%
EGYPT420.6%
INDIA400.6%
JAPAN340.5%
CROATIA340.5%

Area categorical feature

This column represents geographic sub-national regions (states, provinces) drawn from multiple countries — the US (Florida, Hawaii, California, South Carolina), Australia (New South Wales, Queensland, Western Australia), and South Africa (KwaZulu-Natal, Western Cape Province, Eastern Cape Province). Florida dominates at 17.9% of records, while 810 unique values against 6,462 rows signals a severe long-tail distribution where the vast majority of areas appear only rarely. The 7.16% null rate and high geographic diversity across at least three countries suggest this dataset is multinational in scope.

Treatment: Encode with frequency or target encoding; consider grouping rare areas (long-tail) into an 'Other' bucket or rolling up to country level to reduce cardinality from 810 classes.

anthropic:default · confidence high
Out[32]:

saturn.columns["Area"].stats

statvalue
n6,462
nulls463 (7.2%)
unique810
top_value Florida
top_rate 0.1794
cardinality 810
entropy 6.163
entropy_ratio 0.6379
alert: long_tail540 singleton categories
Fig 15.
Top values for Area.
Show data table
Top values for Area (20 unique shown, of 810 total).
valuecountshare
Florida107616.7%
New South Wales4987.7%
Queensland3255.0%
Hawaii3124.8%
California2944.5%
KwaZulu-Natal2153.3%
Western Australia1973.0%
Western Cape Province1953.0%
Eastern Cape Province1652.6%
South Carolina1632.5%
North Carolina1111.7%
South Australia1041.6%
Victoria921.4%
Texas751.2%
Pernambuco741.1%
Torres Strait721.1%
North Island701.1%
New Jersey550.9%
Tasmania420.6%
South Island410.6%

Location text label

This column captures geographic incident locations, predominantly structured as 'City, County' strings—strongly associated with shark attack or ocean incident records given top values like 'New Smyrna Beach, Volusia County' (181 occurrences) and dominant words 'county', 'beach', 'island', 'bay'. The duplicate rate is notably high at ~29.9% (1,769 duplicates out of 6,462 rows), reflecting repeated incidents at the same hotspot locations rather than data error. The multilingual alert is triggered by automated language detection misclassifying short geographic names (e.g. 'Durban', 'Boa Viagem, Recife') as non-English—3,746 of 6,462 values are detected as English, with the remainder split across 29 other 'languages' due to short-string ambiguity. Null rate is 8.43%, which may represent unknown or offshore incident locations.

Treatment: Normalize to canonical 'City, County, Country' format, then use as a categorical grouping variable or geocode for spatial analysis.

anthropic:default · confidence high
Out[35]:

saturn.columns["Location"].stats

statvalue
n6,462
nulls545 (8.4%)
unique4,148
len_min 3
len_max 119
len_mean 22.75
len_median 21
len_p95 47
word_mean 3.54
word_median 3
n_empty 0
n_duplicates 1,769
duplicate_rate 0.299
vocab_size 4,483
readability_flesch_mean 53.4
emoji_rate 0
url_rate 0
one_word_rate 0.1486
allcaps_rate 0.000507
boilerplate_rate 0
alert: multilingual31 languages detected in sample
alert: duplicates29.9% duplicate strings
Fig 16.
Character-length distribution for Location.
Show data table
Character-length distribution for Location (mean: 22.75477437890823).
charscount
3 – 6106
6 – 9542
9 – 12758
12 – 15720
15 – 18445
18 – 20358
20 – 23352
23 – 26393
26 – 29524
29 – 32293
32 – 35505
35 – 38203
38 – 41152
41 – 44124
44 – 46124
46 – 4995
49 – 5254
52 – 5552
55 – 5826
58 – 6117
61 – 6421
64 – 6714
67 – 708
70 – 735
73 – 762
76 – 784
78 – 816
81 – 841
84 – 872
87 – 901
90 – 931
93 – 960
96 – 991
99 – 1021
102 – 1042
104 – 1070
107 – 1103
110 – 1131
113 – 1160
116 – 1191

Activity text label

This column captures the water-based activity a person was engaged in at the time of an incident (likely a shark attack or drowning registry), dominated by Surfing (1,025) and Swimming (932) with a small tail of descriptive phrases. Despite being labelled 'text', it behaves largely as a categorical label: 62.9% of values are single words, only 1,516 unique values exist across 6,462 rows, and the duplicate rate is 74.3%, indicating a loosely controlled vocabulary rather than a strict enum. The median string length of 8 characters versus a max of 254 suggests a mix of clean category entries and occasional free-text annotations, which may require normalisation before use.

Treatment: Standardise to a controlled vocabulary by clustering near-duplicates (e.g. 'Diving' vs 'Scuba diving'), then encode as a categorical feature; impute or flag the 8.54% nulls separately.

anthropic:default · confidence high
Out[38]:

saturn.columns["Activity"].stats

statvalue
n6,462
nulls552 (8.5%)
unique1,516
len_min 1
len_max 254
len_mean 16.21
len_median 8
len_p95 49
word_mean 2.497
word_median 1
n_empty 0
n_duplicates 4,394
duplicate_rate 0.7435
vocab_size 2,244
readability_flesch_mean 39.56
emoji_rate 0
url_rate 0
one_word_rate 0.6289
allcaps_rate 0.0005076
boilerplate_rate 0
alert: one_word62.9% rows are a single word
alert: duplicates74.3% duplicate strings
Fig 17.
Character-length distribution for Activity.
Show data table
Character-length distribution for Activity (mean: 16.207952622673435).
charscount
1 – 72033
7 – 142150
14 – 20468
20 – 26376
26 – 33228
33 – 39177
39 – 45134
45 – 5284
52 – 5842
58 – 6451
64 – 7135
71 – 7725
77 – 8320
83 – 9010
90 – 969
96 – 10214
102 – 10910
109 – 1156
115 – 1215
121 – 1281
128 – 1341
134 – 1407
140 – 1464
146 – 1534
153 – 1593
159 – 1650
165 – 1722
172 – 1781
178 – 1840
184 – 1912
191 – 1973
197 – 2030
203 – 2100
210 – 2161
216 – 2220
222 – 2290
229 – 2352
235 – 2411
241 – 2480
248 – 2541

Name text label

This column is a 'Name' field from what appears to be a historical incident or casualty dataset (likely maritime, given top values like 'boat', 'sailor', 'a sailor'). It is heavily contaminated with non-name entries: the most frequent value is 'male' (579 occurrences), followed by 'female' (106), 'boy' (23), '2 males' (19), and 'boat' (14), indicating that gender/role descriptors were freely mixed with actual proper names. The duplicate rate of 14.5% (908 duplicates across 5,339 unique values) and a 3.33% null rate further confirm this column is inconsistently populated and not a clean identifier.

Treatment: Split into two derived columns—one for proper names, one for role/gender descriptors—before any modelling or grouping.

anthropic:default · confidence high
Out[41]:

saturn.columns["Name"].stats

statvalue
n6,462
nulls215 (3.3%)
unique5,339
len_min 1
len_max 221
len_mean 14.83
len_median 13
len_p95 35
word_mean 2.453
word_median 2
n_empty 0
n_duplicates 908
duplicate_rate 0.1453
vocab_size 6,536
readability_flesch_mean 51.73
emoji_rate 0
url_rate 0
one_word_rate 0.1657
allcaps_rate 0.006883
boilerplate_rate 0
Fig 18.
Character-length distribution for Name.
Show data table
Character-length distribution for Name (mean: 14.830158476068513).
charscount
1 – 6939
6 – 121355
12 – 182774
18 – 23488
23 – 28223
28 – 34128
34 – 4094
40 – 4547
45 – 5066
50 – 5634
56 – 6225
62 – 6726
67 – 7220
72 – 786
78 – 843
84 – 898
89 – 942
94 – 1001
100 – 1061
106 – 1112
111 – 1161
116 – 1220
122 – 1280
128 – 1331
133 – 1381
138 – 1441
144 – 1500
150 – 1550
155 – 1600
160 – 1660
166 – 1720
172 – 1770
177 – 1820
182 – 1880
188 – 1940
194 – 1990
199 – 2040
204 – 2100
210 – 2160
216 – 2211

Unnamed: 9 categorical

Out[44]:

saturn.columns["Unnamed: 9"].stats

statvalue
n6,462
nulls6,434 (99.6%)
unique2
top_value M
top_rate 0.8571
cardinality 2
entropy 0.5917
entropy_ratio 0.5917
alert: null_rate99.6% null
Fig 19.
Top values for Unnamed: 9.
Show data table
Top values for Unnamed: 9 (2 unique shown, of 2 total).
valuecountshare
M240.4%
F40.1%

Age categorical feature

This column represents age stored as a categorical (string) type rather than a numeric type, covering 154 distinct values across 6,462 rows. The most striking issue is a 44.43% null rate, flagged as an alert, meaning nearly half the records lack an age value. The distribution skews young — the top values cluster tightly between ages 15–25, with '17' being most frequent at only 4.46% of rows, suggesting a youth-focused population (e.g., students or a juvenile-related dataset). The high entropy ratio of 0.80 confirms values are spread broadly across the 154 categories despite the youth concentration.

Treatment: Cast to integer, impute or flag nulls (44.43% missing requires explicit strategy), then treat as ordinal or numeric feature.

anthropic:default · confidence high
Out[47]:

saturn.columns["Age"].stats

statvalue
n6,462
nulls2,871 (44.4%)
unique154
top_value 17
top_rate 0.04456
cardinality 154
entropy 5.827
entropy_ratio 0.8018
alert: null_rate44.4% null
Fig 20.
Top values for Age.
Show data table
Top values for Age (20 unique shown, of 154 total).
valuecountshare
171602.5%
181552.4%
201462.3%
191452.2%
161402.2%
151402.2%
211221.9%
221181.8%
251111.7%
241091.7%
141041.6%
13971.5%
26851.3%
23831.3%
28821.3%
30801.2%
29801.2%
27791.2%
12751.2%
32721.1%

Injury text label

This column describes the outcome or nature of injuries in what appears to be a shark attack dataset, containing free-text descriptions ranging from 'FATAL' to specific anatomical bite locations (e.g., 'Left foot bitten', 'Leg bitten'). The dominant value is 'FATAL' appearing 823 times, making it by far the most frequent entry. Two signals stand out: a high duplicate rate of 41.9% (2,695 duplicates across 6,462 rows) driven by repetitive categorical-style phrases, and an all-caps rate of 13.1% suggesting inconsistent data entry conventions. Additionally, 496 German-language entries co-exist with 3,812 English ones, indicating multilingual sourcing that will complicate any text-based analysis.

Treatment: Normalize case, map high-frequency values to a structured severity/outcome taxonomy, and handle German entries separately or translate before any text embedding or categorical encoding.

anthropic:default · confidence high
Out[50]:

saturn.columns["Injury"].stats

statvalue
n6,462
nulls29 (0.4%)
unique3,738
len_min 5
len_max 234
len_mean 31.53
len_median 25
len_p95 82
word_mean 5.414
word_median 4
n_empty 0
n_duplicates 2,695
duplicate_rate 0.4189
vocab_size 2,550
readability_flesch_mean 53.74
emoji_rate 0
url_rate 0
one_word_rate 0.1489
allcaps_rate 0.1307
boilerplate_rate 0
alert: multilingual10 languages detected in sample
alert: allcaps13.1% rows are all-caps
alert: duplicates41.9% duplicate strings
Fig 21.
Character-length distribution for Injury.
Show data table
Character-length distribution for Injury (mean: 31.52868024249961).
charscount
5 – 111206
11 – 16740
16 – 22851
22 – 28788
28 – 34636
34 – 39429
39 – 45382
45 – 51251
51 – 57238
57 – 62200
62 – 68132
68 – 74128
74 – 79100
79 – 8573
85 – 9155
91 – 9747
97 – 10245
102 – 10831
108 – 11413
114 – 12022
120 – 12515
125 – 1316
131 – 1379
137 – 14210
142 – 1484
148 – 1542
154 – 1602
160 – 1650
165 – 1713
171 – 1773
177 – 1823
182 – 1881
188 – 1940
194 – 2002
200 – 2052
205 – 2111
211 – 2170
217 – 2231
223 – 2280
228 – 2342

Fatal (Y/N) categorical label

This column is a binary fatality flag for incidents, expected to hold only 'Y' or 'N' values. The dominant value is 'N' (4,439 occurrences, 75% of records), with 'Y' accounting for 1,400 cases (~21.7%). Surprising data quality issues exist: 5 rows contain clearly erroneous values ('F', 'M', '2017', lowercase 'y') suggesting data entry errors or row misalignment, and 71 rows are labeled 'UNKNOWN'. The 8.46% null rate adds further incompleteness.

Treatment: Standardise 'y' → 'Y', investigate and recode/drop 'F', 'M', '2017' entries, decide on treatment of 'UNKNOWN' and nulls, then binarise (Y=1, N=0) for modelling.

anthropic:default · confidence high
Out[53]:

saturn.columns["Fatal (Y/N)"].stats

statvalue
n6,462
nulls547 (8.5%)
unique7
top_value N
top_rate 0.7505
cardinality 7
entropy 0.8897
entropy_ratio 0.3169
Fig 22.
Top values for Fatal (Y/N).
Show data table
Top values for Fatal (Y/N) (7 unique shown, of 7 total).
valuecountshare
N443968.7%
Y140021.7%
UNKNOWN711.1%
F20.0%
M10.0%
201710.0%
y10.0%

Time categorical feature

This column captures time-of-day information, but it is stored inconsistently: values mix coarse labels ('Afternoon', 'Morning') with specific clock times in 'HhMM' format ('11h00', '16h30'), yielding 366 unique values across 6,462 rows. The null rate is severe at 52.49%, meaning over half of all records are missing a time entirely. The top value 'Afternoon' accounts for only 6.29% of rows, and entropy ratio is 0.77, indicating a long tail of rarely-seen time strings — likely data entry inconsistency across sources or time periods.

Treatment: Standardise to 24h numeric minutes-since-midnight, map label categories ('Morning', 'Afternoon') to representative values or a separate flag, then impute or model missingness explicitly given 52.49% null rate.

anthropic:default · confidence high
Out[56]:

saturn.columns["Time"].stats

statvalue
n6,462
nulls3,392 (52.5%)
unique366
top_value Afternoon
top_rate 0.06287
cardinality 366
entropy 6.559
entropy_ratio 0.7702
alert: long_tail199 singleton categories
alert: null_rate52.5% null
Fig 23.
Top values for Time.
Show data table
Top values for Time (20 unique shown, of 366 total).
valuecountshare
Afternoon1933.0%
11h001312.0%
Morning1261.9%
12h001131.7%
15h001111.7%
16h001061.6%
14h001021.6%
16h30791.2%
17h30771.2%
13h00751.2%
17h00741.1%
14h30731.1%
18h00721.1%
15h30671.0%
11h30651.0%
13h30641.0%
10h00631.0%
Night631.0%
09h00550.9%
10h30510.8%

Species text label

This column records shark species (and incident validity notes) from what appears to be a shark attack dataset, with values ranging from specific species ('White shark', 'Tiger shark', 'Bull shark') to free-text qualifiers like 'Shark involvement not confirmed' and 'Invalid'. The null rate is severe at 45.25%, and 58.56% of non-null values are duplicates — expected for a species label with only 1,466 unique values across 6,462 rows. More surprising is the multilingual alert: while 2,582 values are classified as English, 14 are German, 18 Finnish, 11 Chinese, and 8 Turkish among others, suggesting some records were entered in non-English locales or scraped from multilingual sources. The mix of species names, size descriptions ('4\' shark', '1.8 m [6\'] shark'), and incident-status phrases means this column is semantically heterogeneous and will require parsing or splitting before use.

Treatment: Split into a normalized species category and a separate incident-validity flag; impute or exclude the 45.25% nulls based on task context.

anthropic:default · confidence high
Out[59]:

saturn.columns["Species "].stats

statvalue
n6,462
nulls2,924 (45.2%)
unique1,466
len_min 3
len_max 194
len_mean 22.95
len_median 17
len_p95 50
word_mean 4.445
word_median 4
n_empty 0
n_duplicates 2,072
duplicate_rate 0.5856
vocab_size 1,105
readability_flesch_mean 88.63
emoji_rate 0
url_rate 0
one_word_rate 0.04098
allcaps_rate 0.0002826
boilerplate_rate 0
alert: multilingual15 languages detected in sample
alert: null_rate45.2% null
alert: duplicates58.6% duplicate strings
Fig 24.
Character-length distribution for Species .
Show data table
Character-length distribution for Species (mean: 22.95110231769361).
charscount
3 – 8105
8 – 13855
13 – 17833
17 – 22448
22 – 27324
27 – 32285
32 – 36120
36 – 41127
41 – 46115
46 – 51170
51 – 5632
56 – 6025
60 – 6521
65 – 7013
70 – 7517
75 – 7911
79 – 847
84 – 894
89 – 946
94 – 984
98 – 1032
103 – 1080
108 – 1131
113 – 1181
118 – 1220
122 – 1272
127 – 1321
132 – 1371
137 – 1411
141 – 1461
146 – 1510
151 – 1562
156 – 1611
161 – 1651
165 – 1700
170 – 1750
175 – 1800
180 – 1841
184 – 1890
189 – 1941

Investigator or Source text metadata

This column records the investigator or data source credited for each shark attack incident, typically formatted as an abbreviated name plus an organizational affiliation (e.g., 'C. Moore, GSAF'). GSAF (Global Shark Attack File) dominates the top entries and appears in 983 word tokens, making it the primary contributing organization. The duplicate rate of 22.7% (1,464 duplicates across 6,462 rows) is expected given a finite set of investigators filing multiple reports, but the multilingual alert across 22 detected languages is notable — 'en' accounts for 4,457 entries while 'es' (134), 'fr' (100), 'de' (56), and 'it' (62) reflect international source attribution or non-English investigator names. The near-unique cardinality (4,979 unique values out of 6,462) suggests many entries are one-off source combinations rather than standardized identifiers.

Treatment: Normalize organization affiliations (e.g., extract 'GSAF' tag) and standardize investigator name formats before using as a categorical grouping variable.

anthropic:default · confidence high
Out[62]:

saturn.columns["Investigator or Source"].stats

statvalue
n6,462
nulls19 (0.3%)
unique4,979
len_min 3
len_max 210
len_mean 32.24
len_median 26
len_p95 77
word_mean 4.792
word_median 3
n_empty 0
n_duplicates 1,464
duplicate_rate 0.2272
vocab_size 7,898
readability_flesch_mean 73.62
emoji_rate 0
url_rate 0.002328
one_word_rate 0.02592
allcaps_rate 0.01847
boilerplate_rate 0
alert: multilingual23 languages detected in sample
alert: duplicates22.7% duplicate strings
Fig 25.
Character-length distribution for Investigator or Source.
Show data table
Character-length distribution for Investigator or Source (mean: 32.23731181126804).
charscount
3 – 8152
8 – 13461
13 – 191164
19 – 24844
24 – 291098
29 – 34816
34 – 39400
39 – 44315
44 – 50203
50 – 55174
55 – 60150
60 – 65142
65 – 70116
70 – 7559
75 – 8161
81 – 8645
86 – 9139
91 – 9640
96 – 10130
101 – 10623
106 – 11223
112 – 11718
117 – 12214
122 – 12714
127 – 1329
132 – 1388
138 – 1435
143 – 1485
148 – 1534
153 – 1583
158 – 1634
163 – 1690
169 – 1740
174 – 1791
179 – 1840
184 – 1891
189 – 1941
194 – 2000
200 – 2050
205 – 2101

pdf text foreign_key

This column contains PDF filenames or partial file paths, evidenced by the '.pdf' suffixes in the top words and a mean token count of ~1 word per value. Over half the rows (52.55%) are null, and with 3,054 unique values out of 3,054 non-null distinct tokens the column is near-unique, functioning more like a document reference key than a descriptive field. The extremely negative Flesch readability score (−66.81) is consistent with structured filename strings rather than natural language. A small number of duplicates (12) suggest some documents are referenced by multiple records.

Treatment: Use as a document reference key to join or retrieve associated PDF files; impute or flag nulls before any join.

anthropic:default · confidence high
Out[65]:

saturn.columns["pdf"].stats

statvalue
n6,462
nulls3,396 (52.6%)
unique3,054
len_min 10
len_max 41
len_mean 23.73
len_median 23
len_p95 31
word_mean 1.022
word_median 1
n_empty 0
n_duplicates 12
duplicate_rate 0.003914
vocab_size 3,098
readability_flesch_mean -66.81
emoji_rate 0
url_rate 0
one_word_rate 0.9866
allcaps_rate 0.0003262
boilerplate_rate 0
alert: near_unique99.6% of rows are unique strings
alert: one_word98.7% rows are a single word
alert: null_rate52.6% null
Fig 26.
Character-length distribution for pdf.
Show data table
Character-length distribution for pdf (mean: 23.73091976516634).
charscount
10 – 112
11 – 120
12 – 120
12 – 131
13 – 140
14 – 154
15 – 150
15 – 164
16 – 170
17 – 1814
18 – 1949
19 – 19139
19 – 20277
20 – 210
21 – 22438
22 – 22473
22 – 23391
23 – 240
24 – 25275
25 – 26229
26 – 26159
26 – 27134
27 – 280
28 – 29112
29 – 2986
29 – 3072
30 – 310
31 – 3260
32 – 3250
32 – 3331
33 – 3427
34 – 350
35 – 3613
36 – 368
36 – 376
37 – 380
38 – 393
39 – 392
39 – 404
40 – 413

href formula text metadata

This column contains hyperlink formulas pointing to PDF source documents on sharkattackfile.net, each URL referencing a dated incident report (e.g., '1935.06.05.r-solomonislands.pdf'). Over half the rows (52.62%) are null, meaning many records lack a linked source document. Values are nearly all single-token URLs (one_word_rate 0.9879, url_rate 0.9997), and the extremely negative Flesch readability score (-820.18) confirms these are machine-generated URL strings, not natural text. Only 11 duplicate values exist across 3,051 unique entries, suggesting most cited PDFs are distinct incident references.

Treatment: Extract raw URL string from formula syntax before use; treat as a source-citation reference field and consider joining or flagging unsourced rows (52.62% null) separately.

anthropic:default · confidence high
Out[68]:

saturn.columns["href formula"].stats

statvalue
n6,462
nulls3,400 (52.6%)
unique3,051
len_min 64
len_max 95
len_mean 77.73
len_median 77
len_p95 85
word_mean 1.019
word_median 1
n_empty 0
n_duplicates 11
duplicate_rate 0.003592
vocab_size 3,089
readability_flesch_mean -820.2
emoji_rate 0
url_rate 0.9997
one_word_rate 0.9879
allcaps_rate 0
boilerplate_rate 0
alert: near_unique99.6% of rows are unique strings
alert: one_word98.8% rows are a single word
alert: url_heavy100.0% rows contain a URL
alert: null_rate52.6% null
Fig 27.
Character-length distribution for href formula.
Show data table
Character-length distribution for href formula (mean: 77.72893533638145).
charscount
64 – 651
65 – 660
66 – 660
66 – 671
67 – 680
68 – 694
69 – 690
69 – 705
70 – 710
71 – 7214
72 – 7349
73 – 73139
73 – 74276
74 – 750
75 – 76437
76 – 76473
76 – 77392
77 – 780
78 – 79275
79 – 80228
80 – 80159
80 – 81134
81 – 820
82 – 83112
83 – 8386
83 – 8472
84 – 850
85 – 8660
86 – 8648
86 – 8731
87 – 8827
88 – 890
89 – 9013
90 – 908
90 – 916
91 – 920
92 – 933
93 – 932
93 – 944
94 – 953

href text metadata

This column contains URLs linking to PDF source documents in a shark attack file directory (sharkattackfile.net/spreadsheets/pdf_directory/), serving as citation or evidence links for individual incident records. Over half the rows are null (52.62%), indicating many records lack a linked source document. Nearly all non-null values are single-token URLs (one_word_rate 0.988, url_rate 0.9997), with very few duplicates (11 duplicates out of 3,051 unique values), consistent with per-incident citation links. The high null rate is the key analyst concern — roughly half of incidents have no associated PDF reference.

Treatment: Exclude from predictive modelling; retain as a provenance/citation field, or engineer a binary 'has_source_pdf' indicator from non-null presence.

anthropic:default · confidence high
Out[71]:

saturn.columns["href"].stats

statvalue
n6,462
nulls3,400 (52.6%)
unique3,051
len_min 64
len_max 135
len_mean 77.89
len_median 77
len_p95 86
word_mean 1.02
word_median 1
n_empty 0
n_duplicates 11
duplicate_rate 0.003592
vocab_size 3,091
readability_flesch_mean -824.4
emoji_rate 0
url_rate 0.9997
one_word_rate 0.9879
allcaps_rate 0
boilerplate_rate 0
alert: near_unique99.6% of rows are unique strings
alert: one_word98.8% rows are a single word
alert: url_heavy100.0% rows contain a URL
alert: null_rate52.6% null
Fig 28.
Character-length distribution for href.
Show data table
Character-length distribution for href (mean: 77.88667537557153).
charscount
64 – 661
66 – 681
68 – 694
69 – 7119
71 – 7349
73 – 75415
75 – 76908
76 – 78666
78 – 80224
80 – 82290
82 – 84199
84 – 85131
85 – 8780
87 – 8927
89 – 9121
91 – 929
92 – 946
94 – 963
96 – 980
98 – 1000
100 – 1010
101 – 1030
103 – 1050
105 – 1070
107 – 1080
108 – 1100
110 – 1120
112 – 1140
114 – 1150
115 – 1170
117 – 1190
119 – 1210
121 – 1230
123 – 1240
124 – 1260
126 – 1280
128 – 1301
130 – 1314
131 – 1332
133 – 1352

Case Number.1 text identifier

This column appears to be a duplicate or alternate version of a case number field, with values formatted as date-like codes (e.g., '1966.12.26', '1923.00.00.a') suggesting archival case identifiers tied to dates with alphabetic suffixes for disambiguation. A striking 52.62% null rate makes this column unreliable for most analyses, and the near-unique flag (3,054 unique values across 6,462 rows) combined with only 8 true duplicates confirms it functions as a quasi-identifier. The allcaps rate of 78.97% is notable given that values appear to be alphanumeric codes rather than natural language, and the '.1' suffix in the column name strongly suggests this is a duplicated column from a merge or pivot operation.

Treatment: Investigate overlap with the original 'Case Number' column; if redundant, drop; otherwise impute nulls cautiously or use as a secondary join key.

anthropic:default · confidence medium
Out[74]:

saturn.columns["Case Number.1"].stats

statvalue
n6,462
nulls3,400 (52.6%)
unique3,054
len_min 7
len_max 18
len_mean 10.59
len_median 10
len_p95 12
word_mean 1.002
word_median 1
n_empty 0
n_duplicates 8
duplicate_rate 0.002613
vocab_size 3,057
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9984
allcaps_rate 0.7897
boilerplate_rate 0
alert: near_unique99.7% of rows are unique strings
alert: one_word99.8% rows are a single word
alert: allcaps79.0% rows are all-caps
alert: null_rate52.6% null
alert: short_text95th-percentile length under 20 chars
Fig 29.
Character-length distribution for Case Number.1.
Show data table
Character-length distribution for Case Number.1 (mean: 10.59079033311561).
charscount
7 – 7120
7 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 94
9 – 90
9 – 100
10 – 101884
10 – 100
10 – 110
11 – 110
11 – 117
11 – 110
11 – 120
12 – 120
12 – 121011
12 – 120
12 – 130
13 – 138
13 – 130
13 – 140
14 – 140
14 – 1424
14 – 140
14 – 150
15 – 150
15 – 152
15 – 160
16 – 160
16 – 161
16 – 160
16 – 170
17 – 170
17 – 170
17 – 170
17 – 180
18 – 181

Case Number.2 text identifier

This column appears to be a structured case number identifier, likely encoding a date-based reference system (e.g., '1966.12.26', '1915.07.06.a.r') typical of archival, legal, or historical record catalogues. With 52.62% null rate across 6,462 rows and only 3,055 unique values out of 3,058 vocabulary size, the column is near-unique but severely incomplete. The 78.97% all-caps rate combined with date-like tokens and alphabetic suffixes (a, b, r) suggests a custom alphanumeric coding scheme rather than free text. Only 7 duplicate values exist, making this effectively an identifier where present.

Treatment: Retain as a join/lookup key; impute or flag nulls separately; do not encode numerically.

anthropic:default · confidence high
Out[77]:

saturn.columns["Case Number.2"].stats

statvalue
n6,462
nulls3,400 (52.6%)
unique3,055
len_min 7
len_max 18
len_mean 10.59
len_median 10
len_p95 12
word_mean 1.002
word_median 1
n_empty 0
n_duplicates 7
duplicate_rate 0.002286
vocab_size 3,058
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0.9984
allcaps_rate 0.7897
boilerplate_rate 0
alert: near_unique99.8% of rows are unique strings
alert: one_word99.8% rows are a single word
alert: allcaps79.0% rows are all-caps
alert: null_rate52.6% null
alert: short_text95th-percentile length under 20 chars
Fig 30.
Character-length distribution for Case Number.2.
Show data table
Character-length distribution for Case Number.2 (mean: 10.59079033311561).
charscount
7 – 7120
7 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 94
9 – 90
9 – 100
10 – 101884
10 – 100
10 – 110
11 – 110
11 – 117
11 – 110
11 – 120
12 – 120
12 – 121011
12 – 120
12 – 130
13 – 138
13 – 130
13 – 140
14 – 140
14 – 1424
14 – 140
14 – 150
15 – 150
15 – 152
15 – 160
16 – 160
16 – 161
16 – 160
16 – 170
17 – 170
17 – 170
17 – 170
17 – 180
18 – 181

original order numeric metadata

This column appears to be a positional or sequence index assigned to records, likely reflecting the original sort order of items in a source dataset. The most striking issue is a 52.62% null rate — over half the rows carry no value, which is highly anomalous for an ordering field and suggests either a join that left many rows unmatched or that ordering was only recorded for a subset of records. With only 3,061 unique values across 6,462 rows (and ~3,061 non-null rows expected given the null rate), duplicates exist even among non-null entries, undermining uniqueness as a sequence key. The distribution is mildly right-skewed (skew 0.99) with notable leptokurtosis (3.55) and 27 outliers, hinting at a few unusually large order values relative to the bulk.

Treatment: Investigate source of 52.62% nulls before use; if retaining, treat as an optional sort key and do not use as a unique identifier given duplicate values.

anthropic:default · confidence medium
Out[80]:

saturn.columns["original order"].stats

statvalue
n6,462
nulls3,400 (52.6%)
unique3,061
min 3
max 6,502
mean 1564
median 1534
std 988.4
q1 768.2
q3 2299
iqr 1530
skew 0.9878
kurtosis 3.551
n_outliers 27
outlier_rate 0.008818
zero_rate 0
alert: null_rate52.6% null
Fig 31.
Distribution of original order. Vertical dash marks the median.
Show data table
Histogram bins for original order (median: 1533.5).
bincount
3 – 165.5163
165.5 – 327.9162
327.9 – 490.4163
490.4 – 652.9162
652.9 – 815.4163
815.4 – 977.8162
977.8 – 1140163
1140 – 1303162
1303 – 1465163
1465 – 1628162
1628 – 1790163
1790 – 1953162
1953 – 2115163
2115 – 2278162
2278 – 2440163
2440 – 2603162
2603 – 2765163
2765 – 2928162
2928 – 3090110
3090 – 32520
3252 – 34150
3415 – 35770
3577 – 37400
3740 – 39020
3902 – 40650
4065 – 42270
4227 – 43900
4390 – 45520
4552 – 47150
4715 – 48770
4877 – 50400
5040 – 52020
5202 – 53650
5365 – 55270
5527 – 56900
5690 – 58520
5852 – 60150
6015 – 61770
6177 – 63400
6340 – 650227

Unnamed: 23 categorical

Out[83]:

saturn.columns["Unnamed: 23"].stats

statvalue
n6,462
nulls6,460 (100.0%)
unique2
top_value Teramo
top_rate 0.5
cardinality 2
entropy 1
entropy_ratio 1
alert: long_tail2 singleton categories
alert: null_rate100.0% null
Fig 32.
Top values for Unnamed: 23.
Show data table
Top values for Unnamed: 23 (2 unique shown, of 2 total).
valuecountshare
Teramo10.0%
change filename10.0%

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-global-shark-attack-file-gsaf-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove global shark attack file gsaf},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-global-shark-attack-file-gsaf}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove global shark attack file gsaf. Source: /home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-global-shark-attack-file-gsaf