wild-bigfoot_sightings · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json

Saturn profiled 5,411 rows across 9 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json",
    "--findings", "wild-bigfoot_sightings.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 5,411 Bigfoot sighting reports from the BFRO database, with fields covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Geographically, sightings concentrate heavily in Washington (631), California (431), and Ohio (317), and the most common county is Pierce — worth a closer look as the data skews toward the Pacific Northwest. Temporally, the year distribution is left-skewed (mean 1997, median 2001, range 1870–2025), so most reports come from the late 1990s onward, and August/October/July dominate the month field, hinting at a warm-season reporting pattern. Classification is nearly a coin-flip between Class A (2,655) and Class B (2,722), with Class C almost absent (34) — that imbalance is something to flag before any modeling. Note also that 338 county values are empty even though state coverage is complete.

citing: state.top_values · county.top_values · year.stats · month.top_values · classification.top_values · row_count · county.stats.n_empty

Out[4]:

saturn.schema() · 9 columns

column	kind	n	null%	unique	alerts
id	numeric	5,411	0.0%	5,411
state	categorical	5,411	0.0%	53
state_code	categorical	5,411	0.0%	53
county	text	5,411	0.0%	1,022	one_word short_text duplicates
url	text	5,411	0.0%	5,411	near_unique one_word url_heavy
month	categorical	5,411	3.0%	32
year	numeric	5,411	1.1%	99
classification	categorical	5,411	0.0%	3
description	text	5,411	0.0%	5,407	near_unique

Fig 1.

state · Top reporting states — note Washington, California, and Ohio lead by a wide margin.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
Washington	631	11.7%
California	431	8.0%
Ohio	317	5.9%
Florida	314	5.8%
Oregon	253	4.7%
Illinois	239	4.4%
Texas	238	4.4%
Michigan	217	4.0%
Missouri	161	3.0%
Georgia	135	2.5%
Colorado	128	2.4%
Pennsylvania	125	2.3%
British Columbia	122	2.3%
New York	116	2.1%
Kentucky	115	2.1%
Arkansas	104	1.9%
Tennessee	104	1.9%
West Virginia	104	1.9%
Oklahoma	101	1.9%
Idaho	99	1.8%

Fig 2.

classification · Class A and Class B are nearly even while Class C is negligible.

Show data table

Top values for classification (3 unique shown, of 3 total).
value	count	share
Class B	2722	50.3%
Class A	2655	49.1%
Class C	34	0.6%

Fig 3.

year · Reports skew toward recent decades, with a long thin tail back to 1870.

Show data table

Histogram bins for year (median: 2001.0).
bin	count
1870 – 1874	1
1874 – 1878	0
1878 – 1882	0
1882 – 1886	0
1886 – 1889	0
1889 – 1893	1
1893 – 1897	0
1897 – 1901	0
1901 – 1905	0
1905 – 1909	1
1909 – 1913	1
1913 – 1916	0
1916 – 1920	2
1920 – 1924	2
1924 – 1928	2
1928 – 1932	2
1932 – 1936	4
1936 – 1940	2
1940 – 1944	5
1944 – 1948	4
1948 – 1951	15
1951 – 1955	13
1955 – 1959	18
1959 – 1963	24
1963 – 1967	53
1967 – 1971	120
1971 – 1975	158
1975 – 1978	331
1978 – 1982	307
1982 – 1986	257
1986 – 1990	224
1990 – 1994	195
1994 – 1998	380
1998 – 2002	610
2002 – 2006	679
2006 – 2010	622
2010 – 2013	616
2013 – 2017	355
2017 – 2021	220
2021 – 2025	130

Fig 4.

month · Sightings cluster in summer and early fall, peaking in August and October.

Show data table

Top values for month (20 unique shown, of 32 total).
value	count	share
August	634	11.7%
October	632	11.7%
July	618	11.4%
September	515	9.5%
June	468	8.6%
November	458	8.5%
May	303	5.6%
April	259	4.8%
December	233	4.3%
January	228	4.2%
Summer	217	4.0%
March	201	3.7%
February	163	3.0%
Fall	129	2.4%
Spring	96	1.8%
Winter	57	1.1%
Late	6	0.1%
about	6	0.1%
mid	5	0.1%
or	5	0.1%

Fig 5.

county · Most-named counties, led by Pierce; watch for the 338 missing county entries.

Show data table

Character-length distribution for county (mean: 6.620957309184994).
chars	count
0 – 1	338
1 – 1	0
1 – 2	0
2 – 2	0
2 – 3	0
3 – 3	28
3 – 4	457
4 – 5	0
5 – 5	640
5 – 6	0
6 – 6	1110
6 – 7	0
7 – 7	802
7 – 8	916
8 – 9	0
9 – 9	608
9 – 10	0
10 – 10	301
10 – 11	0
11 – 12	62
12 – 12	94
12 – 13	0
13 – 13	5
13 – 14	0
14 – 14	24
14 – 15	0
15 – 16	16
16 – 16	3
16 – 17	0
17 – 17	3
17 – 18	0
18 – 18	0
18 – 19	0
19 – 20	3
20 – 20	0
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 23	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	numeric	0.0%
state	categorical	0.0%
state_code	categorical	0.0%
county	text	0.0%
url	text	0.0%
month	categorical	3.0%
year	numeric	1.1%
classification	categorical	0.0%
description	text	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	id	year
id	+1.00	+0.12
year	+0.12	+1.00

id numeric identifier

This column is almost certainly a row identifier: all 5411 values are unique with no nulls, spanning 60 to 79711. The wide range relative to the row count suggests sparse, non-sequential IDs (likely assigned from a larger source system) rather than a dense 1..N index. Skew of 0.91 and median 16598 vs mean 23288 are expected artifacts of ID allocation, not meaningful distribution signals.

Treatment: Exclude from modelling; retain as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["id"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	5,411
min	60
max	79,711
mean	2.329e+04
median	16,598
std	2.138e+04
q1	4898
q3	3.636e+04
iqr	31,464
skew	0.9109
kurtosis	-0.151
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 8.

Distribution of id. Vertical dash marks the median.

Show data table

Histogram bins for id (median: 16598.0).
bin	count
60 – 2051	743
2051 – 4043	469
4043 – 6034	305
6034 – 8025	306
8025 – 1.002e+04	268
1.002e+04 – 1.201e+04	202
1.201e+04 – 1.4e+04	198
1.4e+04 – 1.599e+04	176
1.599e+04 – 1.798e+04	119
1.798e+04 – 1.997e+04	81
1.997e+04 – 2.196e+04	89
2.196e+04 – 2.396e+04	146
2.396e+04 – 2.595e+04	254
2.595e+04 – 2.794e+04	215
2.794e+04 – 2.993e+04	191
2.993e+04 – 3.192e+04	105
3.192e+04 – 3.391e+04	77
3.391e+04 – 3.59e+04	85
3.59e+04 – 3.789e+04	98
3.789e+04 – 3.989e+04	91
3.989e+04 – 4.188e+04	113
4.188e+04 – 4.387e+04	90
4.387e+04 – 4.586e+04	90
4.586e+04 – 4.785e+04	84
4.785e+04 – 4.984e+04	71
4.984e+04 – 5.183e+04	80
5.183e+04 – 5.382e+04	10
5.382e+04 – 5.582e+04	33
5.582e+04 – 5.781e+04	70
5.781e+04 – 5.98e+04	92
5.98e+04 – 6.179e+04	18
6.179e+04 – 6.378e+04	78
6.378e+04 – 6.577e+04	47
6.577e+04 – 6.776e+04	65
6.776e+04 – 6.975e+04	42
6.975e+04 – 7.175e+04	8
7.175e+04 – 7.374e+04	33
7.374e+04 – 7.573e+04	45
7.573e+04 – 7.772e+04	50
7.772e+04 – 7.971e+04	74

state categorical feature

This is a US state field with 53 distinct values across 5411 rows and no nulls, suggesting it includes the 50 states plus a few extras like DC or territories. Distribution is fairly even (entropy ratio 0.877), but Washington leads at 11.66% (631 rows) — unusually high for a national sample and ahead of California (431), hinting at geographic bias toward the Pacific Northwest.

Treatment: One-hot or target-encode; consider grouping the long tail and noting the Washington over-representation.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["state"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	53
top_value	Washington
top_rate	0.1166
cardinality	53
entropy	5.025
entropy_ratio	0.8773

Fig 9.

Top values for state.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
Washington	631	11.7%
California	431	8.0%
Ohio	317	5.9%
Florida	314	5.8%
Oregon	253	4.7%
Illinois	239	4.4%
Texas	238	4.4%
Michigan	217	4.0%
Missouri	161	3.0%
Georgia	135	2.5%
Colorado	128	2.4%
Pennsylvania	125	2.3%
British Columbia	122	2.3%
New York	116	2.1%
Kentucky	115	2.1%
Arkansas	104	1.9%
Tennessee	104	1.9%
West Virginia	104	1.9%
Oklahoma	101	1.9%
Idaho	99	1.8%

state_code categorical feature

This column holds US state codes as two-letter lowercase abbreviations, with 53 distinct values across 5411 rows and no nulls — slightly more than the 50 states, suggesting territories or DC are included. Distribution is broad (entropy ratio 0.88) but tilts toward Washington (wa, 11.7%) and California (ca, 431), which is unusual since wa outranks ca despite California's larger population.

Treatment: Use as a categorical feature; one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["state_code"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	53
top_value	wa
top_rate	0.1166
cardinality	53
entropy	5.025
entropy_ratio	0.8773

Fig 10.

Top values for state_code.

Show data table

Top values for state_code (20 unique shown, of 53 total).
value	count	share
wa	631	11.7%
ca	431	8.0%
oh	317	5.9%
fl	314	5.8%
or	253	4.7%
il	239	4.4%
tx	238	4.4%
mi	217	4.0%
mo	161	3.0%
ga	135	2.5%
co	128	2.4%
pa	125	2.3%
ca-bc	122	2.3%
ny	116	2.1%
ky	115	2.1%
ar	104	1.9%
tn	104	1.9%
wv	104	1.9%
ok	101	1.9%
id	99	1.8%

county text feature

Single-word US county names (e.g., Pierce, Jefferson, Lewis, Snohomish, Skamania) acting as a geographic categorical feature. Heavy duplication is expected for this kind of field (duplicate_rate 0.81, 1022 unique values across 5411 rows), but 338 empty strings are recorded as non-null and should be treated as missing. The county mix (Snohomish, Skamania, Pierce, King) skews toward Washington/Pacific Northwest, with some overlap names like Jefferson and Jackson appearing in many states.

Treatment: Convert empty strings to nulls and encode as a categorical (target/frequency encode given ~1k levels).

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["county"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	1,022
len_min	0
len_max	23
len_mean	6.621
len_median	7
len_p95	10
word_mean	1
word_median	1
n_empty	338
n_duplicates	4,389
duplicate_rate	0.8111
vocab_size	1,020
readability_flesch_mean	16.9
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	81.1% duplicate strings

Fig 11.

Character-length distribution for county.

Show data table

Character-length distribution for county (mean: 6.620957309184994).
chars	count
0 – 1	338
1 – 1	0
1 – 2	0
2 – 2	0
2 – 3	0
3 – 3	28
3 – 4	457
4 – 5	0
5 – 5	640
5 – 6	0
6 – 6	1110
6 – 7	0
7 – 7	802
7 – 8	916
8 – 9	0
9 – 9	608
9 – 10	0
10 – 10	301
10 – 11	0
11 – 12	62
12 – 12	94
12 – 13	0
13 – 13	5
13 – 14	0
14 – 14	24
14 – 15	0
15 – 16	16
16 – 16	3
16 – 17	0
17 – 17	3
17 – 18	0
18 – 18	0
18 – 19	0
19 – 20	3
20 – 20	0
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 23	1

url text identifier

This column holds a unique BFRO report URL for each of the 5411 rows, all following the pattern https://www.bfro.net/gdb/show_report.asp?id=. Every value is distinct (n_unique=5411, duplicate_rate=0.0), non-null, and url_rate is 1.0, so it functions as a per-row record locator rather than a feature. Lengths are tightly bound between 46 and 49 characters, consistent with only the numeric id varying.

Treatment: Keep as a row-level link for traceability; drop from modelling or extract the trailing id as a foreign key.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["url"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	5,411
len_min	46
len_max	49
len_mean	48.56
len_median	49
len_p95	49
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	5,411
readability_flesch_mean	-301.8
emoji_rate	0
url_rate	1
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL

Fig 12.

Character-length distribution for url.

Show data table

Character-length distribution for url (mean: 48.55682868231381).
chars	count
46 – 46	11
46 – 46	0
46 – 46	0
46 – 46	0
46 – 46	0
46 – 46	0
46 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	288
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	1789
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 49	0
49 – 49	0
49 – 49	0
49 – 49	0
49 – 49	0
49 – 49	0
49 – 49	3323

month categorical feature

This column captures the month name of an event, with August (634), October (632), and July (618) leading — consistent with a summer/autumn-skewed seasonal pattern. Cardinality is 32, far above the expected 12, so there are 20 extra non-month tokens polluting the field that need investigation. Null rate is 2.96% and entropy ratio is 0.76, indicating reasonably spread but not uniform distribution.

Treatment: Normalize to the 12 canonical months, then one-hot or cyclically encode.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["month"].stats

stat	value
n	5,411
nulls	160 (3.0%)
unique	32
top_value	August
top_rate	0.1207
cardinality	32
entropy	3.807
entropy_ratio	0.7614

Fig 13.

Top values for month.

Show data table

Top values for month (20 unique shown, of 32 total).
value	count	share
August	634	11.7%
October	632	11.7%
July	618	11.4%
September	515	9.5%
June	468	8.6%
November	458	8.5%
May	303	5.6%
April	259	4.8%
December	233	4.3%
January	228	4.2%
Summer	217	4.0%
March	201	3.7%
February	163	3.0%
Fall	129	2.4%
Spring	96	1.8%
Winter	57	1.1%
Late	6	0.1%
about	6	0.1%
mid	5	0.1%
or	5	0.1%

year numeric timestamp

Year of record, ranging from 1870 to 2025 across 99 distinct values with a median of 2001 and IQR spanning 1987-2009. The distribution is left-skewed (skew -0.97) with 49 outliers (0.9%) on the early-year tail, and 1.05% of rows are null.

Treatment: Treat as a temporal feature; consider bucketing by decade or clipping pre-1970 outliers before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["year"].stats

stat	value
n	5,411
nulls	57 (1.1%)
unique	99
min	1,870
max	2,025
mean	1998
median	2,001
std	15.79
q1	1,987
q3	2,009
iqr	22
skew	-0.9738
kurtosis	1.997
n_outliers	49
outlier_rate	0.009152
zero_rate	0

Fig 14.

Distribution of year. Vertical dash marks the median.

Show data table

Histogram bins for year (median: 2001.0).
bin	count
1870 – 1874	1
1874 – 1878	0
1878 – 1882	0
1882 – 1886	0
1886 – 1889	0
1889 – 1893	1
1893 – 1897	0
1897 – 1901	0
1901 – 1905	0
1905 – 1909	1
1909 – 1913	1
1913 – 1916	0
1916 – 1920	2
1920 – 1924	2
1924 – 1928	2
1928 – 1932	2
1932 – 1936	4
1936 – 1940	2
1940 – 1944	5
1944 – 1948	4
1948 – 1951	15
1951 – 1955	13
1955 – 1959	18
1959 – 1963	24
1963 – 1967	53
1967 – 1971	120
1971 – 1975	158
1975 – 1978	331
1978 – 1982	307
1982 – 1986	257
1986 – 1990	224
1990 – 1994	195
1994 – 1998	380
1998 – 2002	610
2002 – 2006	679
2006 – 2010	622
2010 – 2013	616
2013 – 2017	355
2017 – 2021	220
2021 – 2025	130

classification categorical label

A three-level categorical label (Class A, B, C) with no nulls across 5411 rows. The distribution is essentially binary in practice: Class B (50.3%) and Class A are nearly tied, while Class C appears only 34 times, making it a rare class that will be hard to model or evaluate.

Treatment: One-hot or ordinal encode; consider stratified splits or merging Class C given its rarity.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["classification"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	3
top_value	Class B
top_rate	0.503
cardinality	3
entropy	1.049
entropy_ratio	0.6616

Fig 15.

Top values for classification.

Show data table

Top values for classification (3 unique shown, of 3 total).
value	count	share
Class B	2722	50.3%
Class A	2655	49.1%
Class C	34	0.6%

description text free_text

Short free-text descriptions, almost certainly sighting summaries: 5407 of 5411 values are unique with a mean length of 67 characters and ~10.6 words. The vocabulary of 7169 tokens is dominated by 'near', 'sighting', and 'possible', suggesting templated phrasings about location-based observations. Duplicates and boilerplate are negligible (4 dupes, boilerplate_rate 0.00018), and Flesch ~55.7 indicates fairly readable prose with no URLs or emoji.

Treatment: Tokenize and embed (or extract entities like species/location) before modelling; do not use as a key.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["description"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	5,407
len_min	10
len_max	221
len_mean	67.04
len_median	65
len_p95	101.5
word_mean	10.62
word_median	10
n_empty	0
n_duplicates	4
duplicate_rate	0.0007392
vocab_size	7,169
readability_flesch_mean	55.71
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0.0001848
alert: near_unique	99.9% of rows are unique strings

Fig 16.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 67.04213638883755).
chars	count
10 – 15	2
15 – 21	4
21 – 26	21
26 – 31	56
31 – 36	108
36 – 42	185
42 – 47	376
47 – 52	551
52 – 57	525
57 – 63	568
63 – 68	692
68 – 73	495
73 – 79	486
79 – 84	369
84 – 89	330
89 – 94	196
94 – 100	135
100 – 105	99
105 – 110	74
110 – 116	42
116 – 121	26
121 – 126	23
126 – 131	10
131 – 137	9
137 – 142	6
142 – 147	4
147 – 152	6
152 – 158	2
158 – 163	1
163 – 168	2
168 – 174	3
174 – 179	0
179 – 184	0
184 – 189	2
189 – 195	1
195 – 200	0
200 – 205	0
205 – 210	0
210 – 216	0
216 – 221	2

wild bigfoot sightings

Overview

Summary confidence: high

id numeric identifier

state categorical feature

state_code categorical feature

county text feature

url text identifier

month categorical feature

year numeric timestamp

classification categorical label

description text free_text

How to cite