data-trove-bfro-bigfoot-sightings-full-scrape

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json

Saturn profiled 5,411 rows across 9 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json",
    "--findings", "data-trove-bfro-bigfoot-sightings-full-scrape.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset contains 5,411 Bigfoot sighting reports sourced from the Bigfoot Field Researchers Organization (BFRO), covering sightings across 53 U.S. states and territories with attributes including location, date, classification, and a short description. Washington state dominates with 631 reports (about 12% of all sightings), followed by California and Ohio, suggesting strong geographic clustering worth examining. The temporal distribution is skewed toward more recent decades — median year is 2001 with records stretching back to 1870 — raising questions about whether sightings are truly increasing or simply better reported. Sighting classifications split almost evenly between Class A (direct sightings, 2,655) and Class B (indirect evidence, 2,722), with Class C being rare at just 34 reports.

citing: row_count · column_count · state.top_values · state.n_unique · year.median · year.min · year.max · year.skew · classification.top_values · month.top_value · month.top_values · county.n_unique

Out[4]:

saturn.schema() · 9 columns

column	kind	n	null%	unique	alerts
id	numeric	5,411	0.0%	5,411
state	categorical	5,411	0.0%	53
state_code	categorical	5,411	0.0%	53
county	text	5,411	0.0%	1,022	one_word short_text duplicates
url	text	5,411	0.0%	5,411	near_unique one_word url_heavy
month	categorical	5,411	3.0%	32
year	numeric	5,411	1.1%	99
classification	categorical	5,411	0.0%	3
description	text	5,411	0.0%	5,407	near_unique

Fig 1.

state · Look for the outsized lead of Washington state and the Pacific Northwest/Midwest cluster in reported sightings.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
Washington	631	11.7%
California	431	8.0%
Ohio	317	5.9%
Florida	314	5.8%
Oregon	253	4.7%
Illinois	239	4.4%
Texas	238	4.4%
Michigan	217	4.0%
Missouri	161	3.0%
Georgia	135	2.5%
Colorado	128	2.4%
Pennsylvania	125	2.3%
British Columbia	122	2.3%
New York	116	2.1%
Kentucky	115	2.1%
Arkansas	104	1.9%
Tennessee	104	1.9%
West Virginia	104	1.9%
Oklahoma	101	1.9%
Idaho	99	1.8%

Fig 2.

year · Notice the strong skew toward post-1980 reports, with a long sparse tail stretching back to 1870.

Show data table

Histogram bins for year (median: 2001.0).
bin	count
1870 – 1874	1
1874 – 1878	0
1878 – 1882	0
1882 – 1886	0
1886 – 1889	0
1889 – 1893	1
1893 – 1897	0
1897 – 1901	0
1901 – 1905	0
1905 – 1909	1
1909 – 1913	1
1913 – 1916	0
1916 – 1920	2
1920 – 1924	2
1924 – 1928	2
1928 – 1932	2
1932 – 1936	4
1936 – 1940	2
1940 – 1944	5
1944 – 1948	4
1948 – 1951	15
1951 – 1955	13
1955 – 1959	18
1959 – 1963	24
1963 – 1967	53
1967 – 1971	120
1971 – 1975	158
1975 – 1978	331
1978 – 1982	307
1982 – 1986	257
1986 – 1990	224
1990 – 1994	195
1994 – 1998	380
1998 – 2002	610
2002 – 2006	679
2006 – 2010	622
2010 – 2013	616
2013 – 2017	355
2017 – 2021	220
2021 – 2025	130

Fig 3.

classification · Class A and Class B sightings are nearly equal halves, with Class C barely registering.

Show data table

Top values for classification (3 unique shown, of 3 total).
value	count	share
Class B	2722	50.3%
Class A	2655	49.1%
Class C	34	0.6%

Fig 4.

month · Summer months (July–October) peak sharply, likely reflecting increased outdoor activity rather than Bigfoot seasonality.

Show data table

Top values for month (20 unique shown, of 32 total).
value	count	share
August	634	11.7%
October	632	11.7%
July	618	11.4%
September	515	9.5%
June	468	8.6%
November	458	8.5%
May	303	5.6%
April	259	4.8%
December	233	4.3%
January	228	4.2%
Summer	217	4.0%
March	201	3.7%
February	163	3.0%
Fall	129	2.4%
Spring	96	1.8%
Winter	57	1.1%
Late	6	0.1%
about	6	0.1%
mid	5	0.1%
or	5	0.1%

Fig 5.

county · Pierce, Jefferson, and Lewis counties lead — check whether these align with the dominant Washington state concentration.

Show data table

Character-length distribution for county (mean: 6.620957309184994).
chars	count
0 – 1	338
1 – 1	0
1 – 2	0
2 – 2	0
2 – 3	0
3 – 3	28
3 – 4	457
4 – 5	0
5 – 5	640
5 – 6	0
6 – 6	1110
6 – 7	0
7 – 7	802
7 – 8	916
8 – 9	0
9 – 9	608
9 – 10	0
10 – 10	301
10 – 11	0
11 – 12	62
12 – 12	94
12 – 13	0
13 – 13	5
13 – 14	0
14 – 14	24
14 – 15	0
15 – 16	16
16 – 16	3
16 – 17	0
17 – 17	3
17 – 18	0
18 – 18	0
18 – 19	0
19 – 20	3
20 – 20	0
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 23	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	numeric	0.0%
state	categorical	0.0%
state_code	categorical	0.0%
county	text	0.0%
url	text	0.0%
month	categorical	3.0%
year	numeric	1.1%
classification	categorical	0.0%
description	text	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	id	year
id	+1.00	+0.12
year	+0.12	+1.00

id numeric identifier

This column is a numeric row identifier: all 5,411 values are unique, there are no nulls, and the zero rate is 0.0, consistent with a primary key or surrogate ID. The IDs are not sequential (range 60–79,711 with a mean of ~23,288 and median of ~16,598), suggesting they originate from a larger parent table or were assigned non-contiguously. Mild positive skew (0.91) indicates more records cluster in lower ID ranges, but the near-zero kurtosis (−0.15) and absence of outliers confirm a broadly spread, roughly uniform-ish distribution rather than a tightly clustered one.

Treatment: Retain as row key for joins/lookups; exclude from any predictive model as a feature.

anthropic:default · confidence high

Out[13]:

saturn.columns["id"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	5,411
min	60
max	79,711
mean	2.329e+04
median	16,598
std	2.138e+04
q1	4898
q3	3.636e+04
iqr	31,464
skew	0.9109
kurtosis	-0.151
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 8.

Distribution of id. Vertical dash marks the median.

Show data table

Histogram bins for id (median: 16598.0).
bin	count
60 – 2051	743
2051 – 4043	469
4043 – 6034	305
6034 – 8025	306
8025 – 1.002e+04	268
1.002e+04 – 1.201e+04	202
1.201e+04 – 1.4e+04	198
1.4e+04 – 1.599e+04	176
1.599e+04 – 1.798e+04	119
1.798e+04 – 1.997e+04	81
1.997e+04 – 2.196e+04	89
2.196e+04 – 2.396e+04	146
2.396e+04 – 2.595e+04	254
2.595e+04 – 2.794e+04	215
2.794e+04 – 2.993e+04	191
2.993e+04 – 3.192e+04	105
3.192e+04 – 3.391e+04	77
3.391e+04 – 3.59e+04	85
3.59e+04 – 3.789e+04	98
3.789e+04 – 3.989e+04	91
3.989e+04 – 4.188e+04	113
4.188e+04 – 4.387e+04	90
4.387e+04 – 4.586e+04	90
4.586e+04 – 4.785e+04	84
4.785e+04 – 4.984e+04	71
4.984e+04 – 5.183e+04	80
5.183e+04 – 5.382e+04	10
5.382e+04 – 5.582e+04	33
5.582e+04 – 5.781e+04	70
5.781e+04 – 5.98e+04	92
5.98e+04 – 6.179e+04	18
6.179e+04 – 6.378e+04	78
6.378e+04 – 6.577e+04	47
6.577e+04 – 6.776e+04	65
6.776e+04 – 6.975e+04	42
6.975e+04 – 7.175e+04	8
7.175e+04 – 7.374e+04	33
7.374e+04 – 7.573e+04	45
7.573e+04 – 7.772e+04	50
7.772e+04 – 7.971e+04	74

state categorical feature

This column contains US state names, with all 50 states likely represented plus Washington D.C. and possibly territories (53 unique values, 0 nulls across 5,411 rows). Washington dominates at 11.7% (631 records), roughly 1.5× California's share (431), suggesting a dataset with geographic bias toward the Pacific Northwest. Entropy ratio of 0.877 indicates reasonably broad distribution across states, though concentration in a handful of large/coastal states is apparent.

Treatment: One-hot encode for tree models or ordinal-encode by region grouping; investigate the Washington overrepresentation relative to population before modelling.

anthropic:default · confidence high

Out[16]:

saturn.columns["state"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	53
top_value	Washington
top_rate	0.1166
cardinality	53
entropy	5.025
entropy_ratio	0.8773

Fig 9.

Top values for state.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
Washington	631	11.7%
California	431	8.0%
Ohio	317	5.9%
Florida	314	5.8%
Oregon	253	4.7%
Illinois	239	4.4%
Texas	238	4.4%
Michigan	217	4.0%
Missouri	161	3.0%
Georgia	135	2.5%
Colorado	128	2.4%
Pennsylvania	125	2.3%
British Columbia	122	2.3%
New York	116	2.1%
Kentucky	115	2.1%
Arkansas	104	1.9%
Tennessee	104	1.9%
West Virginia	104	1.9%
Oklahoma	101	1.9%
Idaho	99	1.8%

state_code categorical feature

This column contains US state abbreviations (plus possibly DC and territories, explaining the 53 distinct values vs. 50 states). Washington ('wa') is notably over-represented at 11.7% of 5,411 rows — roughly 1.5× California ('ca') and nearly 2× Ohio ('oh') — suggesting a geographic skew toward the Pacific Northwest rather than a nationally representative sample. Entropy ratio of 0.877 indicates reasonably broad distribution across states, but the top-heavy concentration in 'wa' is worth flagging.

Treatment: One-hot encode or target-encode for modelling; investigate why 'wa' is over-represented relative to population share before training.

anthropic:default · confidence high

Out[19]:

saturn.columns["state_code"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	53
top_value	wa
top_rate	0.1166
cardinality	53
entropy	5.025
entropy_ratio	0.8773

Fig 10.

Top values for state_code.

Show data table

Top values for state_code (20 unique shown, of 53 total).
value	count	share
wa	631	11.7%
ca	431	8.0%
oh	317	5.9%
fl	314	5.8%
or	253	4.7%
il	239	4.4%
tx	238	4.4%
mi	217	4.0%
mo	161	3.0%
ga	135	2.5%
co	128	2.4%
pa	125	2.3%
ca-bc	122	2.3%
ny	116	2.1%
ky	115	2.1%
ar	104	1.9%
tn	104	1.9%
wv	104	1.9%
ok	101	1.9%
id	99	1.8%

county text label

This column contains US county names, functioning as a categorical geographic label with 1,022 unique values across 5,411 rows. The duplicate rate is high at 81.1%, which is expected for a county field where many records share the same geography. Notably, 338 rows (6.2%) have empty strings rather than nulls, masking true missingness since null_rate reports 0.0. The top counties — Pierce, Jefferson, Lewis, Washington, Snohomish — suggest a Pacific Northwest-heavy dataset, but the presence of Humboldt and Polk hints at multi-state coverage.

Treatment: Replace empty strings with NaN, then encode as categorical (ordinal or target-encode) for modelling or use as a group-by key for geographic aggregation.

anthropic:default · confidence high

Out[22]:

saturn.columns["county"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	1,022
len_min	0
len_max	23
len_mean	6.621
len_median	7
len_p95	10
word_mean	1
word_median	1
n_empty	338
n_duplicates	4,389
duplicate_rate	0.8111
vocab_size	1,020
readability_flesch_mean	16.9
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	81.1% duplicate strings

Fig 11.

Character-length distribution for county.

Show data table

Character-length distribution for county (mean: 6.620957309184994).
chars	count
0 – 1	338
1 – 1	0
1 – 2	0
2 – 2	0
2 – 3	0
3 – 3	28
3 – 4	457
4 – 5	0
5 – 5	640
5 – 6	0
6 – 6	1110
6 – 7	0
7 – 7	802
7 – 8	916
8 – 9	0
9 – 9	608
9 – 10	0
10 – 10	301
10 – 11	0
11 – 12	62
12 – 12	94
12 – 13	0
13 – 13	5
13 – 14	0
14 – 14	24
14 – 15	0
15 – 16	16
16 – 16	3
16 – 17	0
17 – 17	3
17 – 18	0
18 – 18	0
18 – 19	0
19 – 20	3
20 – 20	0
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 23	1

url text identifier

This column contains unique URLs pointing to individual report pages on bfro.net (the Bigfoot Field Researchers Organization database), all following the pattern `https://www.bfro.net/gdb/show_report.asp?id=`. Every one of the 5,411 rows holds a distinct URL (duplicate_rate: 0.0, n_unique: 5411), with lengths tightly clustered between 46 and 49 characters (len_mean: 48.56, len_median: 49.0), reflecting only variation in the numeric report ID. This column is effectively a primary or foreign key into the BFRO report database — not a content feature — and carries no modelling signal on its own.

Treatment: Retain as a row identifier or use to left-join additional scraped report metadata; drop from any feature matrix.

anthropic:default · confidence high

Out[25]:

saturn.columns["url"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	5,411
len_min	46
len_max	49
len_mean	48.56
len_median	49
len_p95	49
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	5,411
readability_flesch_mean	-301.8
emoji_rate	0
url_rate	1
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL

Fig 12.

Character-length distribution for url.

Show data table

Character-length distribution for url (mean: 48.55682868231381).
chars	count
46 – 46	11
46 – 46	0
46 – 46	0
46 – 46	0
46 – 46	0
46 – 46	0
46 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	288
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	1789
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 49	0
49 – 49	0
49 – 49	0
49 – 49	0
49 – 49	0
49 – 49	0
49 – 49	3323

month categorical feature

This column represents calendar month names, but with a cardinality of 32 instead of the expected 12, there are clearly duplicate or variant entries beyond the standard month labels — likely encoding errors, alternate spellings, or appended year/year-month combinations. The distribution is notably skewed toward summer and early-autumn months (August 634, October 632, July 618), with winter months dramatically underrepresented (December 233, January 228), suggesting seasonal bias in data collection. The entropy_ratio of 0.761 across 32 unique values rather than 12 is a strong flag that this field is dirty and needs normalisation before use.

Treatment: Audit and collapse the 32 distinct values down to 12 canonical month names, then encode as an ordered cyclic feature.

anthropic:default · confidence medium

Out[28]:

saturn.columns["month"].stats

stat	value
n	5,411
nulls	160 (3.0%)
unique	32
top_value	August
top_rate	0.1207
cardinality	32
entropy	3.807
entropy_ratio	0.7614

Fig 13.

Top values for month.

Show data table

Top values for month (20 unique shown, of 32 total).
value	count	share
August	634	11.7%
October	632	11.7%
July	618	11.4%
September	515	9.5%
June	468	8.6%
November	458	8.5%
May	303	5.6%
April	259	4.8%
December	233	4.3%
January	228	4.2%
Summer	217	4.0%
March	201	3.7%
February	163	3.0%
Fall	129	2.4%
Spring	96	1.8%
Winter	57	1.1%
Late	6	0.1%
about	6	0.1%
mid	5	0.1%
or	5	0.1%

year numeric timestamp

This column represents calendar years for records in the dataset, spanning 1870 to 2025 with 99 distinct values. The distribution is left-skewed (skew = -0.974) with a mean of ~1998 and IQR of 22 years (1987–2009), meaning the bulk of records cluster in the late 20th to early 21st century while a thin tail extends back to 1870. The 49 outliers (0.9%) likely correspond to those historically distant records, and analysts should verify whether pre-20th-century entries are genuine or data quality issues.

Treatment: Treat as an ordinal temporal feature; investigate the 49 outlier records (pre-~1960s) for validity before modelling.

anthropic:default · confidence high

Out[31]:

saturn.columns["year"].stats

stat	value
n	5,411
nulls	57 (1.1%)
unique	99
min	1,870
max	2,025
mean	1998
median	2,001
std	15.79
q1	1,987
q3	2,009
iqr	22
skew	-0.9738
kurtosis	1.997
n_outliers	49
outlier_rate	0.009152
zero_rate	0

Fig 14.

Distribution of year. Vertical dash marks the median.

Show data table

Histogram bins for year (median: 2001.0).
bin	count
1870 – 1874	1
1874 – 1878	0
1878 – 1882	0
1882 – 1886	0
1886 – 1889	0
1889 – 1893	1
1893 – 1897	0
1897 – 1901	0
1901 – 1905	0
1905 – 1909	1
1909 – 1913	1
1913 – 1916	0
1916 – 1920	2
1920 – 1924	2
1924 – 1928	2
1928 – 1932	2
1932 – 1936	4
1936 – 1940	2
1940 – 1944	5
1944 – 1948	4
1948 – 1951	15
1951 – 1955	13
1955 – 1959	18
1959 – 1963	24
1963 – 1967	53
1967 – 1971	120
1971 – 1975	158
1975 – 1978	331
1978 – 1982	307
1982 – 1986	257
1986 – 1990	224
1990 – 1994	195
1994 – 1998	380
1998 – 2002	610
2002 – 2006	679
2006 – 2010	622
2010 – 2013	616
2013 – 2017	355
2017 – 2021	220
2021 – 2025	130

classification categorical label

This column is a three-level ordinal or nominal classification label applied to all 5,411 rows with no nulls. The distribution is nearly balanced between 'Class B' (2,722; 50.3%) and 'Class A' (2,655; 49.1%), but 'Class C' is severely underrepresented with only 34 instances (~0.6%), which would surprise any analyst expecting a balanced multi-class target and will require oversampling or class-weight adjustments if used as a target variable.

Treatment: Use as classification target; apply class-weighting or oversampling to address severe 'Class C' imbalance (34 of 5411 rows).

anthropic:default · confidence high

Out[34]:

saturn.columns["classification"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	3
top_value	Class B
top_rate	0.503
cardinality	3
entropy	1.049
entropy_ratio	0.6616

Fig 15.

Top values for classification.

Show data table

Top values for classification (3 unique shown, of 3 total).
value	count	share
Class B	2722	50.3%
Class A	2655	49.1%
Class C	34	0.6%

description text free_text

This column contains short free-text descriptions of reported sightings — most likely UFO or wildlife sighting reports, inferred from the high-frequency terms 'sighting', 'near', and 'possible'. With 5,407 unique values out of 5,411 rows and zero nulls, entries are nearly all distinct; the 4 duplicates (duplicate_rate 0.00074) are negligible. Mean length of ~67 characters and ~10 words per entry suggests structured-but-natural one-line summaries rather than long narratives. Flesch readability of 55.7 indicates plain, accessible prose with a vocabulary of 7,169 unique tokens across the corpus.

Treatment: Tokenize and embed (e.g., TF-IDF or sentence transformer) before modelling; near-uniqueness makes direct encoding unusable.

anthropic:default · confidence high

Out[37]:

saturn.columns["description"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	5,407
len_min	10
len_max	221
len_mean	67.04
len_median	65
len_p95	101.5
word_mean	10.62
word_median	10
n_empty	0
n_duplicates	4
duplicate_rate	0.0007392
vocab_size	7,169
readability_flesch_mean	55.71
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0.0001848
alert: near_unique	99.9% of rows are unique strings

Fig 16.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 67.04213638883755).
chars	count
10 – 15	2
15 – 21	4
21 – 26	21
26 – 31	56
31 – 36	108
36 – 42	185
42 – 47	376
47 – 52	551
52 – 57	525
57 – 63	568
63 – 68	692
68 – 73	495
73 – 79	486
79 – 84	369
84 – 89	330
89 – 94	196
94 – 100	135
100 – 105	99
105 – 110	74
110 – 116	42
116 – 121	26
121 – 126	23
126 – 131	10
131 – 137	9
137 – 142	6
142 – 147	4
147 – 152	6
152 – 158	2
158 – 163	1
163 – 168	2
168 – 174	3
174 – 179	0
179 – 184	0
184 – 189	2
189 – 195	1
195 – 200	0
200 – 205	0
205 – 210	0
210 – 216	0
216 – 221	2

data trove bfro bigfoot sightings full scrape

Overview

Summary confidence: high

id numeric identifier

state categorical feature

state_code categorical feature

county text label

url text identifier

month categorical feature

year numeric timestamp

classification categorical label

description text free_text

How to cite