saturn·

bigfoot listings 20260210

source /home/coolhand/html/datavis/data_trove/cache/bigfoot/listings_20260210.json 5,411 rows 9 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 5,411 Bigfoot sighting reports from BFRO, with 9 columns covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Sightings are concentrated in Washington, California, Ohio and Florida, and cluster heavily in late-summer and early-fall months (August, October, July). Classification is dominated by Class B (2,722) and Class A (2,655), with Class C barely represented (34) — worth flagging if you plan to filter by report quality. The year distribution is left-skewed with a median of 2001 and a long tail back to 1870, so most activity is recent. Note that the county field has 338 empty values and an 81% duplicate rate (expected, since counties repeat across reports).

citing: row_count · column_count · columns.state.top_values · columns.month.top_values · columns.classification.top_values · columns.year.stats · columns.county.stats · columns.description.stats

Schema

9 columns
Per-column summary. Click column name to jump to its detail.
Alerts
id numeric 0.0% 5,411
state categorical 0.0% 53
state_code categorical 0.0% 53
county text 0.0% 1,022
one_word short_text duplicates
url text 0.0% 5,411
near_unique one_word url_heavy
month categorical 3.0% 32
year numeric 1.1% 99
classification categorical 0.0% 3
description text 0.0% 5,407
near_unique

id

numeric identifier
This column is almost certainly a row identifier: all 5411 values are unique, none are null, and they span a wide integer range from 60 to 79711. The distribution is right-skewed (skew 0.91) with no outliers flagged, consistent with sparsely allocated record IDs rather than a measured quantity. Treatment: Drop from modelling features; retain only as a join key. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
5,411
min
60
max
79,711
mean
2.329e+04
median
16,598
std
2.138e+04
q1
4898
q3
3.636e+04
iqr
31,464
skew
0.9109
kurtosis
-0.151
n_outliers
0
outlier_rate
0
zero_rate
0

state

categorical feature
US state names across 5411 rows with 53 unique values (slightly above the 50 states, suggesting DC, territories, or stray entries) and no nulls. Distribution is fairly even (entropy ratio 0.877) but Washington leads at 11.66% with 631 rows, ahead of California (431) and Ohio (317), which is unusual since California typically dominates US samples. Treatment: One-hot or target-encode; investigate the 3 extra categories beyond 50 states. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
53
top_value
Washington
top_rate
0.1166
cardinality
53
entropy
5.025
entropy_ratio
0.8773

state_code

categorical feature
Two-letter US state codes (53 distinct values, suggesting states plus territories or DC). Distribution is fairly even — entropy ratio 0.877 — but Washington leads at 11.7% (631 rows), with CA, OH, and FL also prominent rather than a population-weighted ranking. No nulls. Treatment: one-hot or target-encode for modelling; safe to use as-is since complete and low-cardinality. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
53
top_value
wa
top_rate
0.1166
cardinality
53
entropy
5.025
entropy_ratio
0.8773

county

text feature one_word short_text duplicates
Single-word US county names (Pierce, Jefferson, Lewis, Snohomish, Skamania suggest a Pacific Northwest tilt), with 1,022 unique values across 5,411 rows. Duplicates dominate at 81.1% (4,389 repeats) which is expected for a categorical, but 338 rows are empty strings rather than nulls — null_rate reads 0.0 only because the blanks aren't typed as null. Treatment: Coerce empty strings to null, then treat as a categorical (target/frequency encode for modelling). high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
1,022
len_min
0
len_max
23
len_mean
6.621
len_median
7
len_p95
10
word_mean
1
word_median
1
n_empty
338
n_duplicates
4,389
duplicate_rate
0.8111
vocab_size
1,020
readability_flesch_mean
16.9
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

url

text identifier near_unique one_word url_heavy
This column holds a unique BFRO (Bigfoot Field Researchers Organization) report URL for each of the 5411 rows, all following the pattern https://www.bfro.net/gdb/show_report.asp?id=. Every value is unique (n_unique=5411, duplicate_rate=0.0), non-null, and url_rate=1.0, so it functions as a per-row identifier rather than a feature. Lengths cluster tightly between 46 and 49 characters, consistent with the report id being the only varying segment. Treatment: Drop from modelling; retain as a row-level link or extract the numeric report id as a key. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
5,411
len_min
46
len_max
49
len_mean
48.56
len_median
49
len_p95
49
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
5,411
readability_flesch_mean
-301.8
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

month

categorical feature
Column of month names, presumably the month a record was created or observed. Distribution is seasonal-skewed, with summer/autumn months (August 12.07%, October, July) dominating and winter months trailing. Cardinality is 32, far above the expected 12, which suggests dirty values (typos, abbreviations, or non-month strings) alongside a 2.96% null rate. Treatment: Normalize to the 12 canonical months (resolve the 20 extra categories) and impute or flag nulls before encoding. high · anthropic:claude-opus-4-7
n
5,411
nulls
160 (3.0%)
unique
32
top_value
August
top_rate
0.1207
cardinality
32
entropy
3.807
entropy_ratio
0.7614

year

numeric timestamp
This column is a year value (likely publication, release, or event year) spanning 1870 to 2025 with a median of 2001 and IQR of 22 years. The distribution is left-skewed (skew -0.97) with a long tail of older entries, and 49 outliers (0.9%) sit on the early end. Null rate is low at 1.05% and there are 99 distinct years. Treatment: Treat as a temporal feature; consider bucketing by decade or computing age relative to a reference year. high · anthropic:claude-opus-4-7
n
5,411
nulls
57 (1.1%)
unique
99
min
1,870
max
2,025
mean
1998
median
2,001
std
15.79
q1
1,987
q3
2,009
iqr
22
skew
-0.9738
kurtosis
1.997
n_outliers
49
outlier_rate
0.009152
zero_rate
0

classification

categorical label
A 3-level categorical label, almost certainly the target or stratification class. Class B (2722) and Class A (2655) split the data nearly 50/50, while Class C appears only 34 times — a severe minority that will distort accuracy-style metrics. No nulls across 5411 rows. Treatment: Use as classification target with stratified splits and class-weighting to handle the Class C minority. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
3
top_value
Class B
top_rate
0.503
cardinality
3
entropy
1.049
entropy_ratio
0.6616

description

text free_text near_unique
Short free-text descriptions, averaging 10.6 words (median 10) and 67 characters, almost certainly capturing sighting reports — top tokens include 'sighting' (1436), 'possible' (1117), 'near' (2283). Values are nearly unique (5407 distinct out of 5411) with only 4 duplicates and no nulls or empties, and Flesch readability of 55.7 suggests fairly plain prose. Vocabulary of 7169 words across this small corpus indicates rich lexical variety rather than templated text. Treatment: Tokenize and embed (or extract entities) before modelling; do not treat as a categorical. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
5,407
len_min
10
len_max
221
len_mean
67.04
len_median
65
len_p95
101.5
word_mean
10.62
word_median
10
n_empty
0
n_duplicates
4
duplicate_rate
0.0007392
vocab_size
7,169
readability_flesch_mean
55.71
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0.0001848