bigfoot listings 20260210

source /home/coolhand/html/datavis/data_trove/cache/bigfoot/listings_20260210.json 5,411 rows 9 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 5,411 Bigfoot sighting reports from BFRO, with 9 columns covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Sightings are concentrated in Washington, California, Ohio and Florida, and cluster heavily in late-summer and early-fall months (August, October, July). Classification is dominated by Class B (2,722) and Class A (2,655), with Class C barely represented (34) — worth flagging if you plan to filter by report quality. The year distribution is left-skewed with a median of 2001 and a long tail back to 1870, so most activity is recent. Note that the county field has 338 empty values and an 81% duplicate rate (expected, since counties repeat across reports).

citing: row_count · column_count · columns.state.top_values · columns.month.top_values · columns.classification.top_values · columns.year.stats · columns.county.stats · columns.description.stats

Charts the summary said to look at first

state · Top states for sightings — Washington and California lead by a wide margin.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
Washington	631	11.7%
California	431	8.0%
Ohio	317	5.9%
Florida	314	5.8%
Oregon	253	4.7%
Illinois	239	4.4%
Texas	238	4.4%
Michigan	217	4.0%
Missouri	161	3.0%
Georgia	135	2.5%
Colorado	128	2.4%
Pennsylvania	125	2.3%
British Columbia	122	2.3%
New York	116	2.1%
Kentucky	115	2.1%
Arkansas	104	1.9%
Tennessee	104	1.9%
West Virginia	104	1.9%
Oklahoma	101	1.9%
Idaho	99	1.8%

classification · Report quality is split almost evenly between Class A and B, with Class C negligible.

Show data table

Top values for classification (3 unique shown, of 3 total).
value	count	share
Class B	2722	50.3%
Class A	2655	49.1%
Class C	34	0.6%

month · Seasonality of sightings — peaks in August, October, and July.

Show data table

Top values for month (20 unique shown, of 32 total).
value	count	share
August	634	11.7%
October	632	11.7%
July	618	11.4%
September	515	9.5%
June	468	8.6%
November	458	8.5%
May	303	5.6%
April	259	4.8%
December	233	4.3%
January	228	4.2%
Summer	217	4.0%
March	201	3.7%
February	163	3.0%
Fall	129	2.4%
Spring	96	1.8%
Winter	57	1.1%
Late	6	0.1%
about	6	0.1%
mid	5	0.1%
or	5	0.1%

year · Reports skew heavily toward recent decades, centered around 2001.

Show data table

Histogram bins for year (median: 2001.0).
bin	count
1870 – 1874	1
1874 – 1878	0
1878 – 1882	0
1882 – 1886	0
1886 – 1889	0
1889 – 1893	1
1893 – 1897	0
1897 – 1901	0
1901 – 1905	0
1905 – 1909	1
1909 – 1913	1
1913 – 1916	0
1916 – 1920	2
1920 – 1924	2
1924 – 1928	2
1928 – 1932	2
1932 – 1936	4
1936 – 1940	2
1940 – 1944	5
1944 – 1948	4
1948 – 1951	15
1951 – 1955	13
1955 – 1959	18
1959 – 1963	24
1963 – 1967	53
1967 – 1971	120
1971 – 1975	158
1975 – 1978	331
1978 – 1982	307
1982 – 1986	257
1986 – 1990	224
1990 – 1994	195
1994 – 1998	380
1998 – 2002	610
2002 – 2006	679
2006 – 2010	622
2010 – 2013	616
2013 – 2017	355
2017 – 2021	220
2021 – 2025	130

description · Descriptions are short (median 10 words), suggesting summary-level rather than full narrative text.

Show data table

Character-length distribution for description (mean: 67.04213638883755).
chars	count
10 – 15	2
15 – 21	4
21 – 26	21
26 – 31	56
31 – 36	108
36 – 42	185
42 – 47	376
47 – 52	551
52 – 57	525
57 – 63	568
63 – 68	692
68 – 73	495
73 – 79	486
79 – 84	369
84 – 89	330
89 – 94	196
94 – 100	135
100 – 105	99
105 – 110	74
110 – 116	42
116 – 121	26
121 – 126	23
126 – 131	10
131 – 137	9
137 – 142	6
142 – 147	4
147 – 152	6
152 – 158	2
158 – 163	1
163 – 168	2
168 – 174	3
174 – 179	0
179 – 184	0
184 – 189	2
189 – 195	1
195 – 200	0
200 – 205	0
205 – 210	0
210 – 216	0
216 – 221	2

Schema

9 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
id	numeric	0.0%	5,411
state	categorical	0.0%	53
state_code	categorical	0.0%	53
county	text	0.0%	1,022	one_word short_text duplicates
url	text	0.0%	5,411	near_unique one_word url_heavy
month	categorical	3.0%	32
year	numeric	1.1%	99
classification	categorical	0.0%	3
description	text	0.0%	5,407	near_unique

id

numeric identifier

This column is almost certainly a row identifier: all 5411 values are unique, none are null, and they span a wide integer range from 60 to 79711. The distribution is right-skewed (skew 0.91) with no outliers flagged, consistent with sparsely allocated record IDs rather than a measured quantity. Treatment: Drop from modelling features; retain only as a join key. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 5,411
min: 60
max: 79,711
mean: 2.329e+04
median: 16,598
std: 2.138e+04
q1: 4898
q3: 3.636e+04
iqr: 31,464
skew: 0.9109
kurtosis: -0.151
n_outliers: 0
outlier_rate: 0
zero_rate: 0

state

categorical feature

US state names across 5411 rows with 53 unique values (slightly above the 50 states, suggesting DC, territories, or stray entries) and no nulls. Distribution is fairly even (entropy ratio 0.877) but Washington leads at 11.66% with 631 rows, ahead of California (431) and Ohio (317), which is unusual since California typically dominates US samples. Treatment: One-hot or target-encode; investigate the 3 extra categories beyond 50 states. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 53
top_value: Washington
top_rate: 0.1166
cardinality: 53
entropy: 5.025
entropy_ratio: 0.8773

state_code

categorical feature

Two-letter US state codes (53 distinct values, suggesting states plus territories or DC). Distribution is fairly even — entropy ratio 0.877 — but Washington leads at 11.7% (631 rows), with CA, OH, and FL also prominent rather than a population-weighted ranking. No nulls. Treatment: one-hot or target-encode for modelling; safe to use as-is since complete and low-cardinality. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 53
top_value: wa
top_rate: 0.1166
cardinality: 53
entropy: 5.025
entropy_ratio: 0.8773

county

text feature one_word short_text duplicates

Single-word US county names (Pierce, Jefferson, Lewis, Snohomish, Skamania suggest a Pacific Northwest tilt), with 1,022 unique values across 5,411 rows. Duplicates dominate at 81.1% (4,389 repeats) which is expected for a categorical, but 338 rows are empty strings rather than nulls — null_rate reads 0.0 only because the blanks aren't typed as null. Treatment: Coerce empty strings to null, then treat as a categorical (target/frequency encode for modelling). high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 1,022
len_min: 0
len_max: 23
len_mean: 6.621
len_median: 7
len_p95: 10
word_mean: 1
word_median: 1
n_empty: 338
n_duplicates: 4,389
duplicate_rate: 0.8111
vocab_size: 1,020
readability_flesch_mean: 16.9
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

url

text identifier near_unique one_word url_heavy

This column holds a unique BFRO (Bigfoot Field Researchers Organization) report URL for each of the 5411 rows, all following the pattern https://www.bfro.net/gdb/show_report.asp?id=. Every value is unique (n_unique=5411, duplicate_rate=0.0), non-null, and url_rate=1.0, so it functions as a per-row identifier rather than a feature. Lengths cluster tightly between 46 and 49 characters, consistent with the report id being the only varying segment. Treatment: Drop from modelling; retain as a row-level link or extract the numeric report id as a key. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 5,411
len_min: 46
len_max: 49
len_mean: 48.56
len_median: 49
len_p95: 49
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 5,411
readability_flesch_mean: -301.8
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

month

categorical feature

Column of month names, presumably the month a record was created or observed. Distribution is seasonal-skewed, with summer/autumn months (August 12.07%, October, July) dominating and winter months trailing. Cardinality is 32, far above the expected 12, which suggests dirty values (typos, abbreviations, or non-month strings) alongside a 2.96% null rate. Treatment: Normalize to the 12 canonical months (resolve the 20 extra categories) and impute or flag nulls before encoding. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 160 (3.0%)
unique: 32
top_value: August
top_rate: 0.1207
cardinality: 32
entropy: 3.807
entropy_ratio: 0.7614

year

numeric timestamp

This column is a year value (likely publication, release, or event year) spanning 1870 to 2025 with a median of 2001 and IQR of 22 years. The distribution is left-skewed (skew -0.97) with a long tail of older entries, and 49 outliers (0.9%) sit on the early end. Null rate is low at 1.05% and there are 99 distinct years. Treatment: Treat as a temporal feature; consider bucketing by decade or computing age relative to a reference year. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 57 (1.1%)
unique: 99
min: 1,870
max: 2,025
mean: 1998
median: 2,001
std: 15.79
q1: 1,987
q3: 2,009
iqr: 22
skew: -0.9738
kurtosis: 1.997
n_outliers: 49
outlier_rate: 0.009152
zero_rate: 0

classification

categorical label

A 3-level categorical label, almost certainly the target or stratification class. Class B (2722) and Class A (2655) split the data nearly 50/50, while Class C appears only 34 times — a severe minority that will distort accuracy-style metrics. No nulls across 5411 rows. Treatment: Use as classification target with stratified splits and class-weighting to handle the Class C minority. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 3
top_value: Class B
top_rate: 0.503
cardinality: 3
entropy: 1.049
entropy_ratio: 0.6616

description

text free_text near_unique

Short free-text descriptions, averaging 10.6 words (median 10) and 67 characters, almost certainly capturing sighting reports — top tokens include 'sighting' (1436), 'possible' (1117), 'near' (2283). Values are nearly unique (5407 distinct out of 5411) with only 4 duplicates and no nulls or empties, and Flesch readability of 55.7 suggests fairly plain prose. Vocabulary of 7169 words across this small corpus indicates rich lexical variety rather than templated text. Treatment: Tokenize and embed (or extract entities) before modelling; do not treat as a categorical. high · anthropic:claude-opus-4-7

n: 5,411
nulls: 0 (0.0%)
unique: 5,407
len_min: 10
len_max: 221
len_mean: 67.04
len_median: 65
len_p95: 101.5
word_mean: 10.62
word_median: 10
n_empty: 0
n_duplicates: 4
duplicate_rate: 0.0007392
vocab_size: 7,169
readability_flesch_mean: 55.71
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0.0001848