saturn·

wild bigfoot sightings

source /home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json 5,411 rows 9 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 5,411 Bigfoot sighting reports from the BFRO database, with fields covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Geographically, sightings concentrate heavily in Washington (631), California (431), and Ohio (317), and the most common county is Pierce — worth a closer look as the data skews toward the Pacific Northwest. Temporally, the year distribution is left-skewed (mean 1997, median 2001, range 1870–2025), so most reports come from the late 1990s onward, and August/October/July dominate the month field, hinting at a warm-season reporting pattern. Classification is nearly a coin-flip between Class A (2,655) and Class B (2,722), with Class C almost absent (34) — that imbalance is something to flag before any modeling. Note also that 338 county values are empty even though state coverage is complete.

citing: state.top_values · county.top_values · year.stats · month.top_values · classification.top_values · row_count · county.stats.n_empty

Schema

9 columns
Per-column summary. Click column name to jump to its detail.
Alerts
id numeric 0.0% 5,411
state categorical 0.0% 53
state_code categorical 0.0% 53
county text 0.0% 1,022
one_word short_text duplicates
url text 0.0% 5,411
near_unique one_word url_heavy
month categorical 3.0% 32
year numeric 1.1% 99
classification categorical 0.0% 3
description text 0.0% 5,407
near_unique

id

numeric identifier
This column is almost certainly a row identifier: all 5411 values are unique with no nulls, spanning 60 to 79711. The wide range relative to the row count suggests sparse, non-sequential IDs (likely assigned from a larger source system) rather than a dense 1..N index. Skew of 0.91 and median 16598 vs mean 23288 are expected artifacts of ID allocation, not meaningful distribution signals. Treatment: Exclude from modelling; retain as a join key. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
5,411
min
60
max
79,711
mean
2.329e+04
median
16,598
std
2.138e+04
q1
4898
q3
3.636e+04
iqr
31,464
skew
0.9109
kurtosis
-0.151
n_outliers
0
outlier_rate
0
zero_rate
0

state

categorical feature
This is a US state field with 53 distinct values across 5411 rows and no nulls, suggesting it includes the 50 states plus a few extras like DC or territories. Distribution is fairly even (entropy ratio 0.877), but Washington leads at 11.66% (631 rows) — unusually high for a national sample and ahead of California (431), hinting at geographic bias toward the Pacific Northwest. Treatment: One-hot or target-encode; consider grouping the long tail and noting the Washington over-representation. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
53
top_value
Washington
top_rate
0.1166
cardinality
53
entropy
5.025
entropy_ratio
0.8773

state_code

categorical feature
This column holds US state codes as two-letter lowercase abbreviations, with 53 distinct values across 5411 rows and no nulls — slightly more than the 50 states, suggesting territories or DC are included. Distribution is broad (entropy ratio 0.88) but tilts toward Washington (wa, 11.7%) and California (ca, 431), which is unusual since wa outranks ca despite California's larger population. Treatment: Use as a categorical feature; one-hot or target-encode for modelling. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
53
top_value
wa
top_rate
0.1166
cardinality
53
entropy
5.025
entropy_ratio
0.8773

county

text feature one_word short_text duplicates
Single-word US county names (e.g., Pierce, Jefferson, Lewis, Snohomish, Skamania) acting as a geographic categorical feature. Heavy duplication is expected for this kind of field (duplicate_rate 0.81, 1022 unique values across 5411 rows), but 338 empty strings are recorded as non-null and should be treated as missing. The county mix (Snohomish, Skamania, Pierce, King) skews toward Washington/Pacific Northwest, with some overlap names like Jefferson and Jackson appearing in many states. Treatment: Convert empty strings to nulls and encode as a categorical (target/frequency encode given ~1k levels). high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
1,022
len_min
0
len_max
23
len_mean
6.621
len_median
7
len_p95
10
word_mean
1
word_median
1
n_empty
338
n_duplicates
4,389
duplicate_rate
0.8111
vocab_size
1,020
readability_flesch_mean
16.9
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

url

text identifier near_unique one_word url_heavy
This column holds a unique BFRO report URL for each of the 5411 rows, all following the pattern https://www.bfro.net/gdb/show_report.asp?id=. Every value is distinct (n_unique=5411, duplicate_rate=0.0), non-null, and url_rate is 1.0, so it functions as a per-row record locator rather than a feature. Lengths are tightly bound between 46 and 49 characters, consistent with only the numeric id varying. Treatment: Keep as a row-level link for traceability; drop from modelling or extract the trailing id as a foreign key. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
5,411
len_min
46
len_max
49
len_mean
48.56
len_median
49
len_p95
49
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
5,411
readability_flesch_mean
-301.8
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

month

categorical feature
This column captures the month name of an event, with August (634), October (632), and July (618) leading — consistent with a summer/autumn-skewed seasonal pattern. Cardinality is 32, far above the expected 12, so there are 20 extra non-month tokens polluting the field that need investigation. Null rate is 2.96% and entropy ratio is 0.76, indicating reasonably spread but not uniform distribution. Treatment: Normalize to the 12 canonical months, then one-hot or cyclically encode. high · anthropic:claude-opus-4-7
n
5,411
nulls
160 (3.0%)
unique
32
top_value
August
top_rate
0.1207
cardinality
32
entropy
3.807
entropy_ratio
0.7614

year

numeric timestamp
Year of record, ranging from 1870 to 2025 across 99 distinct values with a median of 2001 and IQR spanning 1987-2009. The distribution is left-skewed (skew -0.97) with 49 outliers (0.9%) on the early-year tail, and 1.05% of rows are null. Treatment: Treat as a temporal feature; consider bucketing by decade or clipping pre-1970 outliers before modelling. high · anthropic:claude-opus-4-7
n
5,411
nulls
57 (1.1%)
unique
99
min
1,870
max
2,025
mean
1998
median
2,001
std
15.79
q1
1,987
q3
2,009
iqr
22
skew
-0.9738
kurtosis
1.997
n_outliers
49
outlier_rate
0.009152
zero_rate
0

classification

categorical label
A three-level categorical label (Class A, B, C) with no nulls across 5411 rows. The distribution is essentially binary in practice: Class B (50.3%) and Class A are nearly tied, while Class C appears only 34 times, making it a rare class that will be hard to model or evaluate. Treatment: One-hot or ordinal encode; consider stratified splits or merging Class C given its rarity. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
3
top_value
Class B
top_rate
0.503
cardinality
3
entropy
1.049
entropy_ratio
0.6616

description

text free_text near_unique
Short free-text descriptions, almost certainly sighting summaries: 5407 of 5411 values are unique with a mean length of 67 characters and ~10.6 words. The vocabulary of 7169 tokens is dominated by 'near', 'sighting', and 'possible', suggesting templated phrasings about location-based observations. Duplicates and boilerplate are negligible (4 dupes, boilerplate_rate 0.00018), and Flesch ~55.7 indicates fairly readable prose with no URLs or emoji. Treatment: Tokenize and embed (or extract entities like species/location) before modelling; do not use as a key. high · anthropic:claude-opus-4-7
n
5,411
nulls
0 (0.0%)
unique
5,407
len_min
10
len_max
221
len_mean
67.04
len_median
65
len_p95
101.5
word_mean
10.62
word_median
10
n_empty
0
n_duplicates
4
duplicate_rate
0.0007392
vocab_size
7,169
readability_flesch_mean
55.71
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0.0001848