saturn·

data trove global shark attack file gsaf

source /home/coolhand/html/datavis/data_trove/data/wild/animal_attacks/shark_attacks_gsaf.csv 6,462 rows 24 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset is the Global Shark Attack File (GSAF), containing 6,462 records of shark attack incidents spanning centuries of documented cases. The most important thing to examine first is the attack outcome: roughly 75% of incidents are non-fatal ('N'), but 1,400 are recorded as fatal ('Y'), and the 'Injury' column reveals 823 entries simply marked 'FATAL' — worth cross-checking for data consistency. A second priority is the geographic and activity breakdown: the USA dominates with 2,310 cases (36%), Florida alone accounts for 1,076, and surfing (1,025) and swimming (932) are by far the most dangerous activities. The 'Year' column carries a data quality warning — a maximum value of 3019 and high kurtosis signal outliers that should be cleaned before any time-series analysis.

citing: Fatal (Y/N).top_values · Fatal (Y/N).stats.top_rate · Injury.top_values · Country.top_values · Area.top_values · Activity.top_values · Year.stats.max · Year.stats.kurtosis · Year.stats.n_outliers · row_count

Schema

24 columns
Per-column summary. Click column name to jump to its detail.
Alerts
index numeric 0.0% 6,462
Case Number text 0.0% 6,442
near_unique one_word allcaps short_text
Date text 0.0% 5,552
one_word
Year numeric 0.0% 252
high_skew
Type categorical 0.1% 12
Country categorical 0.8% 205
Area categorical 7.2% 810
long_tail
Location text 8.4% 4,148
multilingual duplicates
Activity text 8.5% 1,516
one_word duplicates
Name text 3.3% 5,339
Unnamed: 9 categorical 99.6% 2
null_rate
Age categorical 44.4% 154
null_rate
Injury text 0.4% 3,738
multilingual allcaps duplicates
Fatal (Y/N) categorical 8.5% 7
Time categorical 52.5% 366
long_tail null_rate
Species text 45.2% 1,466
multilingual null_rate duplicates
Investigator or Source text 0.3% 4,979
multilingual duplicates
pdf text 52.6% 3,054
near_unique one_word null_rate
href formula text 52.6% 3,051
near_unique one_word url_heavy null_rate
href text 52.6% 3,051
near_unique one_word url_heavy null_rate
Case Number.1 text 52.6% 3,054
near_unique one_word allcaps null_rate short_text
Case Number.2 text 52.6% 3,055
near_unique one_word allcaps null_rate short_text
original order numeric 52.6% 3,061
null_rate
Unnamed: 23 categorical 100.0% 2
long_tail null_rate

index

numeric identifier
This column is a row index — a sequential integer identifier running from 0 to 6461 with 6462 unique values and no nulls, perfectly matching the row count. Its distribution is exactly uniform (skew = 0.0, kurtosis ≈ −1.2, mean = median = 3230.5, zero outliers), confirming it was generated as a positional index rather than carrying any domain meaning. The single 'zero' (zero_rate ≈ 0.00015) is simply row 0. There is no analytical signal here. Treatment: Drop before modelling; carries no predictive information. high · anthropic:default
n
6,462
nulls
0 (0.0%)
unique
6,462
min
0
max
6,461
mean
3230
median
3230
std
1866
q1
1615
q3
4846
iqr
3230
skew
0
kurtosis
-1.2
n_outliers
0
outlier_rate
0
zero_rate
0.0001548

Case Number

text identifier near_unique one_word allcaps short_text
This column is a case identifier, with values that appear to encode dates in YYYY.MM.DD format (e.g., '2019.10.08'), suggesting case numbers tied to filing or incident dates. With 6,442 unique values out of 6,462 rows and a null rate of 0.0003, it is near-unique and functions as a primary key. The 18 duplicate values (duplicate_rate 0.0028) are a mild anomaly worth investigating — one value '2012.09.02.b' hints that suffixes are used to disambiguate same-date cases, implying the deduplication logic is not fully consistent. The allcaps_rate of 0.748 suggests a mix of formatting styles across records. Treatment: Use as a case-level join key; flag the 18 duplicates for deduplication before modelling. high · anthropic:default
n
6,462
nulls
2 (0.0%)
unique
6,442
len_min
6
len_max
18
len_mean
10.63
len_median
10
len_p95
12
word_mean
1.001
word_median
1
n_empty
0
n_duplicates
18
duplicate_rate
0.002786
vocab_size
6,445
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
0.9992
allcaps_rate
0.748
boilerplate_rate
0

Date

text metadata one_word
This column contains free-form date annotations for what appears to be a historical or archival dataset, storing dates in highly inconsistent formats: bare years (e.g., '1957', '1942'), structured dates ('05-Oct-2003'), vague phrases ('Before 1958', 'No date', 'ca.', 'summer', 'late'). The top word 'reported' appearing 559 times suggests many values follow a pattern like 'reported [date]', which is a red flag for downstream parsing. With 14.1% duplicate rate (909 duplicates across 5,552 unique values) and a max length of 64 characters, this column cannot be safely cast to a datetime type without substantial normalization work. Treatment: Parse and normalize into structured date fields using regex rules per format pattern; flag unparseable values ('No date', 'ca.', 'reported …') as null or uncertain. high · anthropic:default
n
6,462
nulls
1 (0.0%)
unique
5,552
len_min
4
len_max
64
len_mean
11.43
len_median
11
len_p95
20
word_mean
1.155
word_median
1
n_empty
0
n_duplicates
909
duplicate_rate
0.1407
vocab_size
5,496
readability_flesch_mean
89.93
emoji_rate
0
url_rate
0
one_word_rate
0.8728
allcaps_rate
0.05092
boilerplate_rate
0

Year

numeric feature high_skew
This column represents a publication or production year for 6,462 records, with values spanning a plausible modern range centered around a median of 1980 and IQR of 1943–2006. Two signals are highly surprising: a minimum of 0.0 (nearly 2% zero rate, almost certainly sentinel/missing-year placeholders) and a maximum of 3019.0, which is a data-entry error (likely a typo for a 4-digit year such as 2019). These outliers (266 records, ~4.1%) drive extreme negative skew (−6.55) and extraordinary kurtosis (42.54), masking what is otherwise a fairly clean temporal distribution. Treatment: Null-code zeros and values outside a valid range (e.g., < 1800 or > 2100), correct obvious typos like 3019, then use as an ordinal or numeric feature. high · anthropic:default
n
6,462
nulls
3 (0.0%)
unique
252
min
0
max
3,019
mean
1930
median
1,980
std
278.3
q1
1,943
q3
2,006
iqr
63
skew
-6.554
kurtosis
42.54
n_outliers
266
outlier_rate
0.04118
zero_rate
0.01935

Type

categorical label
This column classifies shark attack incidents by the nature of the encounter, with 12 distinct categories across 6,462 records. It is heavily dominated by 'Unprovoked' at 73% of all records, creating notable class imbalance. A surprising data quality issue is the fragmentation of watercraft-related incidents across three near-synonymous labels — 'Watercraft' (142), 'Boat' (109), and 'Boating' (92) — which are almost certainly the same category and should be consolidated. The 'Invalid' category (552 records, ~8.5%) also warrants attention as it may represent records that should be excluded from incident analysis. Treatment: Consolidate 'Boat', 'Boating', and 'Watercraft' into a single category; consider excluding or flagging 'Invalid' records; one-hot encode for modelling given severe class imbalance. high · anthropic:default
n
6,462
nulls
5 (0.1%)
unique
12
top_value
Unprovoked
top_rate
0.7304
cardinality
12
entropy
1.457
entropy_ratio
0.4064

Country

categorical feature
This column records the country of origin or occurrence for each record, with 205 distinct country values across 6,462 rows. The distribution is heavily skewed: USA alone accounts for 36% of records (2,310), followed by AUSTRALIA at 21% (1,374) and SOUTH AFRICA at 9% (585), meaning these three countries together represent roughly two-thirds of the dataset. The entropy ratio of 0.51 confirms moderate concentration despite 205 unique values, and the near-zero null rate (0.79%) means coverage is excellent. Treatment: One-hot encode top countries and group tail countries into a residual 'OTHER' category before modelling. high · anthropic:default
n
6,462
nulls
51 (0.8%)
unique
205
top_value
USA
top_rate
0.3603
cardinality
205
entropy
3.909
entropy_ratio
0.509

Area

categorical feature long_tail
This column represents geographic sub-national regions (states, provinces) drawn from multiple countries — the US (Florida, Hawaii, California, South Carolina), Australia (New South Wales, Queensland, Western Australia), and South Africa (KwaZulu-Natal, Western Cape Province, Eastern Cape Province). Florida dominates at 17.9% of records, while 810 unique values against 6,462 rows signals a severe long-tail distribution where the vast majority of areas appear only rarely. The 7.16% null rate and high geographic diversity across at least three countries suggest this dataset is multinational in scope. Treatment: Encode with frequency or target encoding; consider grouping rare areas (long-tail) into an 'Other' bucket or rolling up to country level to reduce cardinality from 810 classes. high · anthropic:default
n
6,462
nulls
463 (7.2%)
unique
810
top_value
Florida
top_rate
0.1794
cardinality
810
entropy
6.163
entropy_ratio
0.6379

Location

text label multilingual duplicates
This column captures geographic incident locations, predominantly structured as 'City, County' strings—strongly associated with shark attack or ocean incident records given top values like 'New Smyrna Beach, Volusia County' (181 occurrences) and dominant words 'county', 'beach', 'island', 'bay'. The duplicate rate is notably high at ~29.9% (1,769 duplicates out of 6,462 rows), reflecting repeated incidents at the same hotspot locations rather than data error. The multilingual alert is triggered by automated language detection misclassifying short geographic names (e.g. 'Durban', 'Boa Viagem, Recife') as non-English—3,746 of 6,462 values are detected as English, with the remainder split across 29 other 'languages' due to short-string ambiguity. Null rate is 8.43%, which may represent unknown or offshore incident locations. Treatment: Normalize to canonical 'City, County, Country' format, then use as a categorical grouping variable or geocode for spatial analysis. high · anthropic:default
n
6,462
nulls
545 (8.4%)
unique
4,148
len_min
3
len_max
119
len_mean
22.75
len_median
21
len_p95
47
word_mean
3.54
word_median
3
n_empty
0
n_duplicates
1,769
duplicate_rate
0.299
vocab_size
4,483
readability_flesch_mean
53.4
emoji_rate
0
url_rate
0
one_word_rate
0.1486
allcaps_rate
0.000507
boilerplate_rate
0

Activity

text label one_word duplicates
This column captures the water-based activity a person was engaged in at the time of an incident (likely a shark attack or drowning registry), dominated by Surfing (1,025) and Swimming (932) with a small tail of descriptive phrases. Despite being labelled 'text', it behaves largely as a categorical label: 62.9% of values are single words, only 1,516 unique values exist across 6,462 rows, and the duplicate rate is 74.3%, indicating a loosely controlled vocabulary rather than a strict enum. The median string length of 8 characters versus a max of 254 suggests a mix of clean category entries and occasional free-text annotations, which may require normalisation before use. Treatment: Standardise to a controlled vocabulary by clustering near-duplicates (e.g. 'Diving' vs 'Scuba diving'), then encode as a categorical feature; impute or flag the 8.54% nulls separately. high · anthropic:default
n
6,462
nulls
552 (8.5%)
unique
1,516
len_min
1
len_max
254
len_mean
16.21
len_median
8
len_p95
49
word_mean
2.497
word_median
1
n_empty
0
n_duplicates
4,394
duplicate_rate
0.7435
vocab_size
2,244
readability_flesch_mean
39.56
emoji_rate
0
url_rate
0
one_word_rate
0.6289
allcaps_rate
0.0005076
boilerplate_rate
0

Name

text label
This column is a 'Name' field from what appears to be a historical incident or casualty dataset (likely maritime, given top values like 'boat', 'sailor', 'a sailor'). It is heavily contaminated with non-name entries: the most frequent value is 'male' (579 occurrences), followed by 'female' (106), 'boy' (23), '2 males' (19), and 'boat' (14), indicating that gender/role descriptors were freely mixed with actual proper names. The duplicate rate of 14.5% (908 duplicates across 5,339 unique values) and a 3.33% null rate further confirm this column is inconsistently populated and not a clean identifier. Treatment: Split into two derived columns—one for proper names, one for role/gender descriptors—before any modelling or grouping. high · anthropic:default
n
6,462
nulls
215 (3.3%)
unique
5,339
len_min
1
len_max
221
len_mean
14.83
len_median
13
len_p95
35
word_mean
2.453
word_median
2
n_empty
0
n_duplicates
908
duplicate_rate
0.1453
vocab_size
6,536
readability_flesch_mean
51.73
emoji_rate
0
url_rate
0
one_word_rate
0.1657
allcaps_rate
0.006883
boilerplate_rate
0

Unnamed: 9

categorical null_rate
n
6,462
nulls
6,434 (99.6%)
unique
2
top_value
M
top_rate
0.8571
cardinality
2
entropy
0.5917
entropy_ratio
0.5917

Age

categorical feature null_rate
This column represents age stored as a categorical (string) type rather than a numeric type, covering 154 distinct values across 6,462 rows. The most striking issue is a 44.43% null rate, flagged as an alert, meaning nearly half the records lack an age value. The distribution skews young — the top values cluster tightly between ages 15–25, with '17' being most frequent at only 4.46% of rows, suggesting a youth-focused population (e.g., students or a juvenile-related dataset). The high entropy ratio of 0.80 confirms values are spread broadly across the 154 categories despite the youth concentration. Treatment: Cast to integer, impute or flag nulls (44.43% missing requires explicit strategy), then treat as ordinal or numeric feature. high · anthropic:default
n
6,462
nulls
2,871 (44.4%)
unique
154
top_value
17
top_rate
0.04456
cardinality
154
entropy
5.827
entropy_ratio
0.8018

Injury

text label multilingual allcaps duplicates
This column describes the outcome or nature of injuries in what appears to be a shark attack dataset, containing free-text descriptions ranging from 'FATAL' to specific anatomical bite locations (e.g., 'Left foot bitten', 'Leg bitten'). The dominant value is 'FATAL' appearing 823 times, making it by far the most frequent entry. Two signals stand out: a high duplicate rate of 41.9% (2,695 duplicates across 6,462 rows) driven by repetitive categorical-style phrases, and an all-caps rate of 13.1% suggesting inconsistent data entry conventions. Additionally, 496 German-language entries co-exist with 3,812 English ones, indicating multilingual sourcing that will complicate any text-based analysis. Treatment: Normalize case, map high-frequency values to a structured severity/outcome taxonomy, and handle German entries separately or translate before any text embedding or categorical encoding. high · anthropic:default
n
6,462
nulls
29 (0.4%)
unique
3,738
len_min
5
len_max
234
len_mean
31.53
len_median
25
len_p95
82
word_mean
5.414
word_median
4
n_empty
0
n_duplicates
2,695
duplicate_rate
0.4189
vocab_size
2,550
readability_flesch_mean
53.74
emoji_rate
0
url_rate
0
one_word_rate
0.1489
allcaps_rate
0.1307
boilerplate_rate
0

Fatal (Y/N)

categorical label
This column is a binary fatality flag for incidents, expected to hold only 'Y' or 'N' values. The dominant value is 'N' (4,439 occurrences, 75% of records), with 'Y' accounting for 1,400 cases (~21.7%). Surprising data quality issues exist: 5 rows contain clearly erroneous values ('F', 'M', '2017', lowercase 'y') suggesting data entry errors or row misalignment, and 71 rows are labeled 'UNKNOWN'. The 8.46% null rate adds further incompleteness. Treatment: Standardise 'y' → 'Y', investigate and recode/drop 'F', 'M', '2017' entries, decide on treatment of 'UNKNOWN' and nulls, then binarise (Y=1, N=0) for modelling. high · anthropic:default
n
6,462
nulls
547 (8.5%)
unique
7
top_value
N
top_rate
0.7505
cardinality
7
entropy
0.8897
entropy_ratio
0.3169

Time

categorical feature long_tail null_rate
This column captures time-of-day information, but it is stored inconsistently: values mix coarse labels ('Afternoon', 'Morning') with specific clock times in 'HhMM' format ('11h00', '16h30'), yielding 366 unique values across 6,462 rows. The null rate is severe at 52.49%, meaning over half of all records are missing a time entirely. The top value 'Afternoon' accounts for only 6.29% of rows, and entropy ratio is 0.77, indicating a long tail of rarely-seen time strings — likely data entry inconsistency across sources or time periods. Treatment: Standardise to 24h numeric minutes-since-midnight, map label categories ('Morning', 'Afternoon') to representative values or a separate flag, then impute or model missingness explicitly given 52.49% null rate. high · anthropic:default
n
6,462
nulls
3,392 (52.5%)
unique
366
top_value
Afternoon
top_rate
0.06287
cardinality
366
entropy
6.559
entropy_ratio
0.7702

Species

text label multilingual null_rate duplicates
This column records shark species (and incident validity notes) from what appears to be a shark attack dataset, with values ranging from specific species ('White shark', 'Tiger shark', 'Bull shark') to free-text qualifiers like 'Shark involvement not confirmed' and 'Invalid'. The null rate is severe at 45.25%, and 58.56% of non-null values are duplicates — expected for a species label with only 1,466 unique values across 6,462 rows. More surprising is the multilingual alert: while 2,582 values are classified as English, 14 are German, 18 Finnish, 11 Chinese, and 8 Turkish among others, suggesting some records were entered in non-English locales or scraped from multilingual sources. The mix of species names, size descriptions ('4\' shark', '1.8 m [6\'] shark'), and incident-status phrases means this column is semantically heterogeneous and will require parsing or splitting before use. Treatment: Split into a normalized species category and a separate incident-validity flag; impute or exclude the 45.25% nulls based on task context. high · anthropic:default
n
6,462
nulls
2,924 (45.2%)
unique
1,466
len_min
3
len_max
194
len_mean
22.95
len_median
17
len_p95
50
word_mean
4.445
word_median
4
n_empty
0
n_duplicates
2,072
duplicate_rate
0.5856
vocab_size
1,105
readability_flesch_mean
88.63
emoji_rate
0
url_rate
0
one_word_rate
0.04098
allcaps_rate
0.0002826
boilerplate_rate
0

Investigator or Source

text metadata multilingual duplicates
This column records the investigator or data source credited for each shark attack incident, typically formatted as an abbreviated name plus an organizational affiliation (e.g., 'C. Moore, GSAF'). GSAF (Global Shark Attack File) dominates the top entries and appears in 983 word tokens, making it the primary contributing organization. The duplicate rate of 22.7% (1,464 duplicates across 6,462 rows) is expected given a finite set of investigators filing multiple reports, but the multilingual alert across 22 detected languages is notable — 'en' accounts for 4,457 entries while 'es' (134), 'fr' (100), 'de' (56), and 'it' (62) reflect international source attribution or non-English investigator names. The near-unique cardinality (4,979 unique values out of 6,462) suggests many entries are one-off source combinations rather than standardized identifiers. Treatment: Normalize organization affiliations (e.g., extract 'GSAF' tag) and standardize investigator name formats before using as a categorical grouping variable. high · anthropic:default
n
6,462
nulls
19 (0.3%)
unique
4,979
len_min
3
len_max
210
len_mean
32.24
len_median
26
len_p95
77
word_mean
4.792
word_median
3
n_empty
0
n_duplicates
1,464
duplicate_rate
0.2272
vocab_size
7,898
readability_flesch_mean
73.62
emoji_rate
0
url_rate
0.002328
one_word_rate
0.02592
allcaps_rate
0.01847
boilerplate_rate
0

pdf

text foreign_key near_unique one_word null_rate
This column contains PDF filenames or partial file paths, evidenced by the '.pdf' suffixes in the top words and a mean token count of ~1 word per value. Over half the rows (52.55%) are null, and with 3,054 unique values out of 3,054 non-null distinct tokens the column is near-unique, functioning more like a document reference key than a descriptive field. The extremely negative Flesch readability score (−66.81) is consistent with structured filename strings rather than natural language. A small number of duplicates (12) suggest some documents are referenced by multiple records. Treatment: Use as a document reference key to join or retrieve associated PDF files; impute or flag nulls before any join. high · anthropic:default
n
6,462
nulls
3,396 (52.6%)
unique
3,054
len_min
10
len_max
41
len_mean
23.73
len_median
23
len_p95
31
word_mean
1.022
word_median
1
n_empty
0
n_duplicates
12
duplicate_rate
0.003914
vocab_size
3,098
readability_flesch_mean
-66.81
emoji_rate
0
url_rate
0
one_word_rate
0.9866
allcaps_rate
0.0003262
boilerplate_rate
0

href formula

text metadata near_unique one_word url_heavy null_rate
This column contains hyperlink formulas pointing to PDF source documents on sharkattackfile.net, each URL referencing a dated incident report (e.g., '1935.06.05.r-solomonislands.pdf'). Over half the rows (52.62%) are null, meaning many records lack a linked source document. Values are nearly all single-token URLs (one_word_rate 0.9879, url_rate 0.9997), and the extremely negative Flesch readability score (-820.18) confirms these are machine-generated URL strings, not natural text. Only 11 duplicate values exist across 3,051 unique entries, suggesting most cited PDFs are distinct incident references. Treatment: Extract raw URL string from formula syntax before use; treat as a source-citation reference field and consider joining or flagging unsourced rows (52.62% null) separately. high · anthropic:default
n
6,462
nulls
3,400 (52.6%)
unique
3,051
len_min
64
len_max
95
len_mean
77.73
len_median
77
len_p95
85
word_mean
1.019
word_median
1
n_empty
0
n_duplicates
11
duplicate_rate
0.003592
vocab_size
3,089
readability_flesch_mean
-820.2
emoji_rate
0
url_rate
0.9997
one_word_rate
0.9879
allcaps_rate
0
boilerplate_rate
0

href

text metadata near_unique one_word url_heavy null_rate
This column contains URLs linking to PDF source documents in a shark attack file directory (sharkattackfile.net/spreadsheets/pdf_directory/), serving as citation or evidence links for individual incident records. Over half the rows are null (52.62%), indicating many records lack a linked source document. Nearly all non-null values are single-token URLs (one_word_rate 0.988, url_rate 0.9997), with very few duplicates (11 duplicates out of 3,051 unique values), consistent with per-incident citation links. The high null rate is the key analyst concern — roughly half of incidents have no associated PDF reference. Treatment: Exclude from predictive modelling; retain as a provenance/citation field, or engineer a binary 'has_source_pdf' indicator from non-null presence. high · anthropic:default
n
6,462
nulls
3,400 (52.6%)
unique
3,051
len_min
64
len_max
135
len_mean
77.89
len_median
77
len_p95
86
word_mean
1.02
word_median
1
n_empty
0
n_duplicates
11
duplicate_rate
0.003592
vocab_size
3,091
readability_flesch_mean
-824.4
emoji_rate
0
url_rate
0.9997
one_word_rate
0.9879
allcaps_rate
0
boilerplate_rate
0

Case Number.1

text identifier near_unique one_word allcaps null_rate short_text
This column appears to be a duplicate or alternate version of a case number field, with values formatted as date-like codes (e.g., '1966.12.26', '1923.00.00.a') suggesting archival case identifiers tied to dates with alphabetic suffixes for disambiguation. A striking 52.62% null rate makes this column unreliable for most analyses, and the near-unique flag (3,054 unique values across 6,462 rows) combined with only 8 true duplicates confirms it functions as a quasi-identifier. The allcaps rate of 78.97% is notable given that values appear to be alphanumeric codes rather than natural language, and the '.1' suffix in the column name strongly suggests this is a duplicated column from a merge or pivot operation. Treatment: Investigate overlap with the original 'Case Number' column; if redundant, drop; otherwise impute nulls cautiously or use as a secondary join key. medium · anthropic:default
n
6,462
nulls
3,400 (52.6%)
unique
3,054
len_min
7
len_max
18
len_mean
10.59
len_median
10
len_p95
12
word_mean
1.002
word_median
1
n_empty
0
n_duplicates
8
duplicate_rate
0.002613
vocab_size
3,057
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
0.9984
allcaps_rate
0.7897
boilerplate_rate
0

Case Number.2

text identifier near_unique one_word allcaps null_rate short_text
This column appears to be a structured case number identifier, likely encoding a date-based reference system (e.g., '1966.12.26', '1915.07.06.a.r') typical of archival, legal, or historical record catalogues. With 52.62% null rate across 6,462 rows and only 3,055 unique values out of 3,058 vocabulary size, the column is near-unique but severely incomplete. The 78.97% all-caps rate combined with date-like tokens and alphabetic suffixes (a, b, r) suggests a custom alphanumeric coding scheme rather than free text. Only 7 duplicate values exist, making this effectively an identifier where present. Treatment: Retain as a join/lookup key; impute or flag nulls separately; do not encode numerically. high · anthropic:default
n
6,462
nulls
3,400 (52.6%)
unique
3,055
len_min
7
len_max
18
len_mean
10.59
len_median
10
len_p95
12
word_mean
1.002
word_median
1
n_empty
0
n_duplicates
7
duplicate_rate
0.002286
vocab_size
3,058
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
0.9984
allcaps_rate
0.7897
boilerplate_rate
0

original order

numeric metadata null_rate
This column appears to be a positional or sequence index assigned to records, likely reflecting the original sort order of items in a source dataset. The most striking issue is a 52.62% null rate — over half the rows carry no value, which is highly anomalous for an ordering field and suggests either a join that left many rows unmatched or that ordering was only recorded for a subset of records. With only 3,061 unique values across 6,462 rows (and ~3,061 non-null rows expected given the null rate), duplicates exist even among non-null entries, undermining uniqueness as a sequence key. The distribution is mildly right-skewed (skew 0.99) with notable leptokurtosis (3.55) and 27 outliers, hinting at a few unusually large order values relative to the bulk. Treatment: Investigate source of 52.62% nulls before use; if retaining, treat as an optional sort key and do not use as a unique identifier given duplicate values. medium · anthropic:default
n
6,462
nulls
3,400 (52.6%)
unique
3,061
min
3
max
6,502
mean
1564
median
1534
std
988.4
q1
768.2
q3
2299
iqr
1530
skew
0.9878
kurtosis
3.551
n_outliers
27
outlier_rate
0.008818
zero_rate
0

Unnamed: 23

categorical long_tail null_rate
n
6,462
nulls
6,460 (100.0%)
unique
2
top_value
Teramo
top_rate
0.5
cardinality
2
entropy
1
entropy_ratio
1