wild bigfoot sightings

source /home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json 5,411 rows 9 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 5,411 Bigfoot sighting reports from the BFRO database, with fields covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Geographically, sightings concentrate heavily in Washington (631), California (431), and Ohio (317), and the most common county is Pierce — worth a closer look as the data skews toward the Pacific Northwest. Temporally, the year distribution is left-skewed (mean 1997, median 2001, range 1870–2025), so most reports come from the late 1990s onward, and August/October/July dominate the month field, hinting at a warm-season reporting pattern. Classification is nearly a coin-flip between Class A (2,655) and Class B (2,722), with Class C almost absent (34) — that imbalance is something to flag before any modeling. Note also that 338 county values are empty even though state coverage is complete.

citing: state.top_values · county.top_values · year.stats · month.top_values · classification.top_values · row_count · county.stats.n_empty

Charts the summary said to look at first

state · Top reporting states — note Washington, California, and Ohio lead by a wide margin.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
Washington	631	11.7%
California	431	8.0%
Ohio	317	5.9%
Florida	314	5.8%
Oregon	253	4.7%
Illinois	239	4.4%
Texas	238	4.4%
Michigan	217	4.0%
Missouri	161	3.0%
Georgia	135	2.5%
Colorado	128	2.4%
Pennsylvania	125	2.3%
British Columbia	122	2.3%
New York	116	2.1%
Kentucky	115	2.1%
Arkansas	104	1.9%
Tennessee	104	1.9%
West Virginia	104	1.9%
Oklahoma	101	1.9%
Idaho	99	1.8%

classification · Class A and Class B are nearly even while Class C is negligible.

Show data table

Top values for classification (3 unique shown, of 3 total).
value	count	share
Class B	2722	50.3%
Class A	2655	49.1%
Class C	34	0.6%

year · Reports skew toward recent decades, with a long thin tail back to 1870.

Show data table

Histogram bins for year (median: 2001.0).
bin	count
1870 – 1874	1
1874 – 1878	0
1878 – 1882	0
1882 – 1886	0
1886 – 1889	0
1889 – 1893	1
1893 – 1897	0
1897 – 1901	0
1901 – 1905	0
1905 – 1909	1
1909 – 1913	1
1913 – 1916	0
1916 – 1920	2
1920 – 1924	2
1924 – 1928	2
1928 – 1932	2
1932 – 1936	4
1936 – 1940	2
1940 – 1944	5
1944 – 1948	4
1948 – 1951	15
1951 – 1955	13
1955 – 1959	18
1959 – 1963	24
1963 – 1967	53
1967 – 1971	120
1971 – 1975	158
1975 – 1978	331
1978 – 1982	307
1982 – 1986	257
1986 – 1990	224
1990 – 1994	195
1994 – 1998	380
1998 – 2002	610
2002 – 2006	679
2006 – 2010	622
2010 – 2013	616
2013 – 2017	355
2017 – 2021	220
2021 – 2025	130

month · Sightings cluster in summer and early fall, peaking in August and October.

Show data table

Top values for month (20 unique shown, of 32 total).
value	count	share
August	634	11.7%
October	632	11.7%
July	618	11.4%
September	515	9.5%
June	468	8.6%
November	458	8.5%
May	303	5.6%
April	259	4.8%
December	233	4.3%
January	228	4.2%
Summer	217	4.0%
March	201	3.7%
February	163	3.0%
Fall	129	2.4%
Spring	96	1.8%
Winter	57	1.1%
Late	6	0.1%
about	6	0.1%
mid	5	0.1%
or	5	0.1%

county · Most-named counties, led by Pierce; watch for the 338 missing county entries.

Show data table

Character-length distribution for county (mean: 6.620957309184994).
chars	count
0 – 1	338
1 – 1	0
1 – 2	0
2 – 2	0
2 – 3	0
3 – 3	28
3 – 4	457
4 – 5	0
5 – 5	640
5 – 6	0
6 – 6	1110
6 – 7	0
7 – 7	802
7 – 8	916
8 – 9	0
9 – 9	608
9 – 10	0
10 – 10	301
10 – 11	0
11 – 12	62
12 – 12	94
12 – 13	0
13 – 13	5
13 – 14	0
14 – 14	24
14 – 15	0
15 – 16	16
16 – 16	3
16 – 17	0
17 – 17	3
17 – 18	0
18 – 18	0
18 – 19	0
19 – 20	3
20 – 20	0
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 23	1

Schema

9 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
id	numeric	0.0%	5,411
state	categorical	0.0%	53
state_code	categorical	0.0%	53
county	text	0.0%	1,022	one_word short_text duplicates
url	text	0.0%	5,411	near_unique one_word url_heavy
month	categorical	3.0%	32
year	numeric	1.1%	99
classification	categorical	0.0%	3
description	text	0.0%	5,407	near_unique

id

numeric identifier

This column is almost certainly a row identifier: all 5411 values are unique with no nulls, spanning 60 to 79711. The wide range relative to the row count suggests sparse, non-sequential IDs (likely assigned from a larger source system) rather than a dense 1..N index. Skew of 0.91 and median 16598 vs mean 23288 are expected artifacts of ID allocation, not meaningful distribution signals. Treatment: Exclude from modelling; retain as a join key. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 5,411
min: 60
max: 79,711
mean: 2.329e+04
median: 16,598
std: 2.138e+04
q1: 4898
q3: 3.636e+04
iqr: 31,464
skew: 0.9109
kurtosis: -0.151
n_outliers: 0
outlier_rate: 0
zero_rate: 0

state

categorical feature

This is a US state field with 53 distinct values across 5411 rows and no nulls, suggesting it includes the 50 states plus a few extras like DC or territories. Distribution is fairly even (entropy ratio 0.877), but Washington leads at 11.66% (631 rows) — unusually high for a national sample and ahead of California (431), hinting at geographic bias toward the Pacific Northwest. Treatment: One-hot or target-encode; consider grouping the long tail and noting the Washington over-representation. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 53
top_value: Washington
top_rate: 0.1166
cardinality: 53
entropy: 5.025
entropy_ratio: 0.8773

state_code

categorical feature

This column holds US state codes as two-letter lowercase abbreviations, with 53 distinct values across 5411 rows and no nulls — slightly more than the 50 states, suggesting territories or DC are included. Distribution is broad (entropy ratio 0.88) but tilts toward Washington (wa, 11.7%) and California (ca, 431), which is unusual since wa outranks ca despite California's larger population. Treatment: Use as a categorical feature; one-hot or target-encode for modelling. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 53
top_value: wa
top_rate: 0.1166
cardinality: 53
entropy: 5.025
entropy_ratio: 0.8773

county

text feature one_word short_text duplicates

Single-word US county names (e.g., Pierce, Jefferson, Lewis, Snohomish, Skamania) acting as a geographic categorical feature. Heavy duplication is expected for this kind of field (duplicate_rate 0.81, 1022 unique values across 5411 rows), but 338 empty strings are recorded as non-null and should be treated as missing. The county mix (Snohomish, Skamania, Pierce, King) skews toward Washington/Pacific Northwest, with some overlap names like Jefferson and Jackson appearing in many states. Treatment: Convert empty strings to nulls and encode as a categorical (target/frequency encode given ~1k levels). high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 1,022
len_min: 0
len_max: 23
len_mean: 6.621
len_median: 7
len_p95: 10
word_mean: 1
word_median: 1
n_empty: 338
n_duplicates: 4,389
duplicate_rate: 0.8111
vocab_size: 1,020
readability_flesch_mean: 16.9
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

url

text identifier near_unique one_word url_heavy

This column holds a unique BFRO report URL for each of the 5411 rows, all following the pattern https://www.bfro.net/gdb/show_report.asp?id=. Every value is distinct (n_unique=5411, duplicate_rate=0.0), non-null, and url_rate is 1.0, so it functions as a per-row record locator rather than a feature. Lengths are tightly bound between 46 and 49 characters, consistent with only the numeric id varying. Treatment: Keep as a row-level link for traceability; drop from modelling or extract the trailing id as a foreign key. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 5,411
len_min: 46
len_max: 49
len_mean: 48.56
len_median: 49
len_p95: 49
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 5,411
readability_flesch_mean: -301.8
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

month

categorical feature

This column captures the month name of an event, with August (634), October (632), and July (618) leading — consistent with a summer/autumn-skewed seasonal pattern. Cardinality is 32, far above the expected 12, so there are 20 extra non-month tokens polluting the field that need investigation. Null rate is 2.96% and entropy ratio is 0.76, indicating reasonably spread but not uniform distribution. Treatment: Normalize to the 12 canonical months, then one-hot or cyclically encode. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 160 (3.0%)
unique: 32
top_value: August
top_rate: 0.1207
cardinality: 32
entropy: 3.807
entropy_ratio: 0.7614

year

numeric timestamp

Year of record, ranging from 1870 to 2025 across 99 distinct values with a median of 2001 and IQR spanning 1987-2009. The distribution is left-skewed (skew -0.97) with 49 outliers (0.9%) on the early-year tail, and 1.05% of rows are null. Treatment: Treat as a temporal feature; consider bucketing by decade or clipping pre-1970 outliers before modelling. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 57 (1.1%)
unique: 99
min: 1,870
max: 2,025
mean: 1998
median: 2,001
std: 15.79
q1: 1,987
q3: 2,009
iqr: 22
skew: -0.9738
kurtosis: 1.997
n_outliers: 49
outlier_rate: 0.009152
zero_rate: 0

classification

categorical label

A three-level categorical label (Class A, B, C) with no nulls across 5411 rows. The distribution is essentially binary in practice: Class B (50.3%) and Class A are nearly tied, while Class C appears only 34 times, making it a rare class that will be hard to model or evaluate. Treatment: One-hot or ordinal encode; consider stratified splits or merging Class C given its rarity. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 3
top_value: Class B
top_rate: 0.503
cardinality: 3
entropy: 1.049
entropy_ratio: 0.6616

description

text free_text near_unique

Short free-text descriptions, almost certainly sighting summaries: 5407 of 5411 values are unique with a mean length of 67 characters and ~10.6 words. The vocabulary of 7169 tokens is dominated by 'near', 'sighting', and 'possible', suggesting templated phrasings about location-based observations. Duplicates and boilerplate are negligible (4 dupes, boilerplate_rate 0.00018), and Flesch ~55.7 indicates fairly readable prose with no URLs or emoji. Treatment: Tokenize and embed (or extract entities like species/location) before modelling; do not use as a key. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 5,407
len_min: 10
len_max: 221
len_mean: 67.04
len_median: 65
len_p95: 101.5
word_mean: 10.62
word_median: 10
n_empty: 0
n_duplicates: 4
duplicate_rate: 0.0007392
vocab_size: 7,169
readability_flesch_mean: 55.71
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0.0001848