saturn·

wild bigfoot sightings

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json

Saturn profiled 5,411 rows across 9 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json",
    "--findings", "wild-bigfoot_sightings.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 5,411 Bigfoot sighting reports from the BFRO database, with fields covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Geographically, sightings concentrate heavily in Washington (631), California (431), and Ohio (317), and the most common county is Pierce — worth a closer look as the data skews toward the Pacific Northwest. Temporally, the year distribution is left-skewed (mean 1997, median 2001, range 1870–2025), so most reports come from the late 1990s onward, and August/October/July dominate the month field, hinting at a warm-season reporting pattern. Classification is nearly a coin-flip between Class A (2,655) and Class B (2,722), with Class C almost absent (34) — that imbalance is something to flag before any modeling. Note also that 338 county values are empty even though state coverage is complete.

citing: state.top_values · county.top_values · year.stats · month.top_values · classification.top_values · row_count · county.stats.n_empty

Out[4]:

saturn.schema() · 9 columns

column kind n null% unique alerts
id numeric 5,411 0.0% 5,411
state categorical 5,411 0.0% 53
state_code categorical 5,411 0.0% 53
county text 5,411 0.0% 1,022 one_word short_text duplicates
url text 5,411 0.0% 5,411 near_unique one_word url_heavy
month categorical 5,411 3.0% 32
year numeric 5,411 1.1% 99
classification categorical 5,411 0.0% 3
description text 5,411 0.0% 5,407 near_unique
Fig 1.
state · Top reporting states — note Washington, California, and Ohio lead by a wide margin.
Show data table
Top values for state (20 unique shown, of 53 total).
valuecountshare
Washington63111.7%
California4318.0%
Ohio3175.9%
Florida3145.8%
Oregon2534.7%
Illinois2394.4%
Texas2384.4%
Michigan2174.0%
Missouri1613.0%
Georgia1352.5%
Colorado1282.4%
Pennsylvania1252.3%
British Columbia1222.3%
New York1162.1%
Kentucky1152.1%
Arkansas1041.9%
Tennessee1041.9%
West Virginia1041.9%
Oklahoma1011.9%
Idaho991.8%
Fig 2.
classification · Class A and Class B are nearly even while Class C is negligible.
Show data table
Top values for classification (3 unique shown, of 3 total).
valuecountshare
Class B272250.3%
Class A265549.1%
Class C340.6%
Fig 3.
year · Reports skew toward recent decades, with a long thin tail back to 1870.
Show data table
Histogram bins for year (median: 2001.0).
bincount
1870 – 18741
1874 – 18780
1878 – 18820
1882 – 18860
1886 – 18890
1889 – 18931
1893 – 18970
1897 – 19010
1901 – 19050
1905 – 19091
1909 – 19131
1913 – 19160
1916 – 19202
1920 – 19242
1924 – 19282
1928 – 19322
1932 – 19364
1936 – 19402
1940 – 19445
1944 – 19484
1948 – 195115
1951 – 195513
1955 – 195918
1959 – 196324
1963 – 196753
1967 – 1971120
1971 – 1975158
1975 – 1978331
1978 – 1982307
1982 – 1986257
1986 – 1990224
1990 – 1994195
1994 – 1998380
1998 – 2002610
2002 – 2006679
2006 – 2010622
2010 – 2013616
2013 – 2017355
2017 – 2021220
2021 – 2025130
Fig 4.
month · Sightings cluster in summer and early fall, peaking in August and October.
Show data table
Top values for month (20 unique shown, of 32 total).
valuecountshare
August63411.7%
October63211.7%
July61811.4%
September5159.5%
June4688.6%
November4588.5%
May3035.6%
April2594.8%
December2334.3%
January2284.2%
Summer2174.0%
March2013.7%
February1633.0%
Fall1292.4%
Spring961.8%
Winter571.1%
Late60.1%
about60.1%
mid50.1%
or50.1%
Fig 5.
county · Most-named counties, led by Pierce; watch for the 338 missing county entries.
Show data table
Character-length distribution for county (mean: 6.620957309184994).
charscount
0 – 1338
1 – 10
1 – 20
2 – 20
2 – 30
3 – 328
3 – 4457
4 – 50
5 – 5640
5 – 60
6 – 61110
6 – 70
7 – 7802
7 – 8916
8 – 90
9 – 9608
9 – 100
10 – 10301
10 – 110
11 – 1262
12 – 1294
12 – 130
13 – 135
13 – 140
14 – 1424
14 – 150
15 – 1616
16 – 163
16 – 170
17 – 173
17 – 180
18 – 180
18 – 190
19 – 203
20 – 200
20 – 210
21 – 210
21 – 220
22 – 220
22 – 231
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
idnumeric0.0%
statecategorical0.0%
state_codecategorical0.0%
countytext0.0%
urltext0.0%
monthcategorical3.0%
yearnumeric1.1%
classificationcategorical0.0%
descriptiontext0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
idyear
id+1.00+0.12
year+0.12+1.00

id numeric identifier

This column is almost certainly a row identifier: all 5411 values are unique with no nulls, spanning 60 to 79711. The wide range relative to the row count suggests sparse, non-sequential IDs (likely assigned from a larger source system) rather than a dense 1..N index. Skew of 0.91 and median 16598 vs mean 23288 are expected artifacts of ID allocation, not meaningful distribution signals.

Treatment: Exclude from modelling; retain as a join key.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["id"].stats

statvalue
n5,411
nulls0 (0.0%)
unique5,411
min 60
max 79,711
mean 2.329e+04
median 16,598
std 2.138e+04
q1 4898
q3 3.636e+04
iqr 31,464
skew 0.9109
kurtosis -0.151
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 8.
Distribution of id. Vertical dash marks the median.
Show data table
Histogram bins for id (median: 16598.0).
bincount
60 – 2051743
2051 – 4043469
4043 – 6034305
6034 – 8025306
8025 – 1.002e+04268
1.002e+04 – 1.201e+04202
1.201e+04 – 1.4e+04198
1.4e+04 – 1.599e+04176
1.599e+04 – 1.798e+04119
1.798e+04 – 1.997e+0481
1.997e+04 – 2.196e+0489
2.196e+04 – 2.396e+04146
2.396e+04 – 2.595e+04254
2.595e+04 – 2.794e+04215
2.794e+04 – 2.993e+04191
2.993e+04 – 3.192e+04105
3.192e+04 – 3.391e+0477
3.391e+04 – 3.59e+0485
3.59e+04 – 3.789e+0498
3.789e+04 – 3.989e+0491
3.989e+04 – 4.188e+04113
4.188e+04 – 4.387e+0490
4.387e+04 – 4.586e+0490
4.586e+04 – 4.785e+0484
4.785e+04 – 4.984e+0471
4.984e+04 – 5.183e+0480
5.183e+04 – 5.382e+0410
5.382e+04 – 5.582e+0433
5.582e+04 – 5.781e+0470
5.781e+04 – 5.98e+0492
5.98e+04 – 6.179e+0418
6.179e+04 – 6.378e+0478
6.378e+04 – 6.577e+0447
6.577e+04 – 6.776e+0465
6.776e+04 – 6.975e+0442
6.975e+04 – 7.175e+048
7.175e+04 – 7.374e+0433
7.374e+04 – 7.573e+0445
7.573e+04 – 7.772e+0450
7.772e+04 – 7.971e+0474

state categorical feature

This is a US state field with 53 distinct values across 5411 rows and no nulls, suggesting it includes the 50 states plus a few extras like DC or territories. Distribution is fairly even (entropy ratio 0.877), but Washington leads at 11.66% (631 rows) — unusually high for a national sample and ahead of California (431), hinting at geographic bias toward the Pacific Northwest.

Treatment: One-hot or target-encode; consider grouping the long tail and noting the Washington over-representation.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["state"].stats

statvalue
n5,411
nulls0 (0.0%)
unique53
top_value Washington
top_rate 0.1166
cardinality 53
entropy 5.025
entropy_ratio 0.8773
Fig 9.
Top values for state.
Show data table
Top values for state (20 unique shown, of 53 total).
valuecountshare
Washington63111.7%
California4318.0%
Ohio3175.9%
Florida3145.8%
Oregon2534.7%
Illinois2394.4%
Texas2384.4%
Michigan2174.0%
Missouri1613.0%
Georgia1352.5%
Colorado1282.4%
Pennsylvania1252.3%
British Columbia1222.3%
New York1162.1%
Kentucky1152.1%
Arkansas1041.9%
Tennessee1041.9%
West Virginia1041.9%
Oklahoma1011.9%
Idaho991.8%

state_code categorical feature

This column holds US state codes as two-letter lowercase abbreviations, with 53 distinct values across 5411 rows and no nulls — slightly more than the 50 states, suggesting territories or DC are included. Distribution is broad (entropy ratio 0.88) but tilts toward Washington (wa, 11.7%) and California (ca, 431), which is unusual since wa outranks ca despite California's larger population.

Treatment: Use as a categorical feature; one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["state_code"].stats

statvalue
n5,411
nulls0 (0.0%)
unique53
top_value wa
top_rate 0.1166
cardinality 53
entropy 5.025
entropy_ratio 0.8773
Fig 10.
Top values for state_code.
Show data table
Top values for state_code (20 unique shown, of 53 total).
valuecountshare
wa63111.7%
ca4318.0%
oh3175.9%
fl3145.8%
or2534.7%
il2394.4%
tx2384.4%
mi2174.0%
mo1613.0%
ga1352.5%
co1282.4%
pa1252.3%
ca-bc1222.3%
ny1162.1%
ky1152.1%
ar1041.9%
tn1041.9%
wv1041.9%
ok1011.9%
id991.8%

county text feature

Single-word US county names (e.g., Pierce, Jefferson, Lewis, Snohomish, Skamania) acting as a geographic categorical feature. Heavy duplication is expected for this kind of field (duplicate_rate 0.81, 1022 unique values across 5411 rows), but 338 empty strings are recorded as non-null and should be treated as missing. The county mix (Snohomish, Skamania, Pierce, King) skews toward Washington/Pacific Northwest, with some overlap names like Jefferson and Jackson appearing in many states.

Treatment: Convert empty strings to nulls and encode as a categorical (target/frequency encode given ~1k levels).

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["county"].stats

statvalue
n5,411
nulls0 (0.0%)
unique1,022
len_min 0
len_max 23
len_mean 6.621
len_median 7
len_p95 10
word_mean 1
word_median 1
n_empty 338
n_duplicates 4,389
duplicate_rate 0.8111
vocab_size 1,020
readability_flesch_mean 16.9
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates81.1% duplicate strings
Fig 11.
Character-length distribution for county.
Show data table
Character-length distribution for county (mean: 6.620957309184994).
charscount
0 – 1338
1 – 10
1 – 20
2 – 20
2 – 30
3 – 328
3 – 4457
4 – 50
5 – 5640
5 – 60
6 – 61110
6 – 70
7 – 7802
7 – 8916
8 – 90
9 – 9608
9 – 100
10 – 10301
10 – 110
11 – 1262
12 – 1294
12 – 130
13 – 135
13 – 140
14 – 1424
14 – 150
15 – 1616
16 – 163
16 – 170
17 – 173
17 – 180
18 – 180
18 – 190
19 – 203
20 – 200
20 – 210
21 – 210
21 – 220
22 – 220
22 – 231

url text identifier

This column holds a unique BFRO report URL for each of the 5411 rows, all following the pattern https://www.bfro.net/gdb/show_report.asp?id=. Every value is distinct (n_unique=5411, duplicate_rate=0.0), non-null, and url_rate is 1.0, so it functions as a per-row record locator rather than a feature. Lengths are tightly bound between 46 and 49 characters, consistent with only the numeric id varying.

Treatment: Keep as a row-level link for traceability; drop from modelling or extract the trailing id as a foreign key.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["url"].stats

statvalue
n5,411
nulls0 (0.0%)
unique5,411
len_min 46
len_max 49
len_mean 48.56
len_median 49
len_p95 49
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 5,411
readability_flesch_mean -301.8
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
Fig 12.
Character-length distribution for url.
Show data table
Character-length distribution for url (mean: 48.55682868231381).
charscount
46 – 4611
46 – 460
46 – 460
46 – 460
46 – 460
46 – 460
46 – 470
47 – 470
47 – 470
47 – 470
47 – 470
47 – 470
47 – 470
47 – 47288
47 – 470
47 – 470
47 – 470
47 – 470
47 – 470
47 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 481789
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 490
49 – 490
49 – 490
49 – 490
49 – 490
49 – 490
49 – 493323

month categorical feature

This column captures the month name of an event, with August (634), October (632), and July (618) leading — consistent with a summer/autumn-skewed seasonal pattern. Cardinality is 32, far above the expected 12, so there are 20 extra non-month tokens polluting the field that need investigation. Null rate is 2.96% and entropy ratio is 0.76, indicating reasonably spread but not uniform distribution.

Treatment: Normalize to the 12 canonical months, then one-hot or cyclically encode.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["month"].stats

statvalue
n5,411
nulls160 (3.0%)
unique32
top_value August
top_rate 0.1207
cardinality 32
entropy 3.807
entropy_ratio 0.7614
Fig 13.
Top values for month.
Show data table
Top values for month (20 unique shown, of 32 total).
valuecountshare
August63411.7%
October63211.7%
July61811.4%
September5159.5%
June4688.6%
November4588.5%
May3035.6%
April2594.8%
December2334.3%
January2284.2%
Summer2174.0%
March2013.7%
February1633.0%
Fall1292.4%
Spring961.8%
Winter571.1%
Late60.1%
about60.1%
mid50.1%
or50.1%

year numeric timestamp

Year of record, ranging from 1870 to 2025 across 99 distinct values with a median of 2001 and IQR spanning 1987-2009. The distribution is left-skewed (skew -0.97) with 49 outliers (0.9%) on the early-year tail, and 1.05% of rows are null.

Treatment: Treat as a temporal feature; consider bucketing by decade or clipping pre-1970 outliers before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["year"].stats

statvalue
n5,411
nulls57 (1.1%)
unique99
min 1,870
max 2,025
mean 1998
median 2,001
std 15.79
q1 1,987
q3 2,009
iqr 22
skew -0.9738
kurtosis 1.997
n_outliers 49
outlier_rate 0.009152
zero_rate 0
Fig 14.
Distribution of year. Vertical dash marks the median.
Show data table
Histogram bins for year (median: 2001.0).
bincount
1870 – 18741
1874 – 18780
1878 – 18820
1882 – 18860
1886 – 18890
1889 – 18931
1893 – 18970
1897 – 19010
1901 – 19050
1905 – 19091
1909 – 19131
1913 – 19160
1916 – 19202
1920 – 19242
1924 – 19282
1928 – 19322
1932 – 19364
1936 – 19402
1940 – 19445
1944 – 19484
1948 – 195115
1951 – 195513
1955 – 195918
1959 – 196324
1963 – 196753
1967 – 1971120
1971 – 1975158
1975 – 1978331
1978 – 1982307
1982 – 1986257
1986 – 1990224
1990 – 1994195
1994 – 1998380
1998 – 2002610
2002 – 2006679
2006 – 2010622
2010 – 2013616
2013 – 2017355
2017 – 2021220
2021 – 2025130

classification categorical label

A three-level categorical label (Class A, B, C) with no nulls across 5411 rows. The distribution is essentially binary in practice: Class B (50.3%) and Class A are nearly tied, while Class C appears only 34 times, making it a rare class that will be hard to model or evaluate.

Treatment: One-hot or ordinal encode; consider stratified splits or merging Class C given its rarity.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["classification"].stats

statvalue
n5,411
nulls0 (0.0%)
unique3
top_value Class B
top_rate 0.503
cardinality 3
entropy 1.049
entropy_ratio 0.6616
Fig 15.
Top values for classification.
Show data table
Top values for classification (3 unique shown, of 3 total).
valuecountshare
Class B272250.3%
Class A265549.1%
Class C340.6%

description text free_text

Short free-text descriptions, almost certainly sighting summaries: 5407 of 5411 values are unique with a mean length of 67 characters and ~10.6 words. The vocabulary of 7169 tokens is dominated by 'near', 'sighting', and 'possible', suggesting templated phrasings about location-based observations. Duplicates and boilerplate are negligible (4 dupes, boilerplate_rate 0.00018), and Flesch ~55.7 indicates fairly readable prose with no URLs or emoji.

Treatment: Tokenize and embed (or extract entities like species/location) before modelling; do not use as a key.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["description"].stats

statvalue
n5,411
nulls0 (0.0%)
unique5,407
len_min 10
len_max 221
len_mean 67.04
len_median 65
len_p95 101.5
word_mean 10.62
word_median 10
n_empty 0
n_duplicates 4
duplicate_rate 0.0007392
vocab_size 7,169
readability_flesch_mean 55.71
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 0
boilerplate_rate 0.0001848
alert: near_unique99.9% of rows are unique strings
Fig 16.
Character-length distribution for description.
Show data table
Character-length distribution for description (mean: 67.04213638883755).
charscount
10 – 152
15 – 214
21 – 2621
26 – 3156
31 – 36108
36 – 42185
42 – 47376
47 – 52551
52 – 57525
57 – 63568
63 – 68692
68 – 73495
73 – 79486
79 – 84369
84 – 89330
89 – 94196
94 – 100135
100 – 10599
105 – 11074
110 – 11642
116 – 12126
121 – 12623
126 – 13110
131 – 1379
137 – 1426
142 – 1474
147 – 1526
152 – 1582
158 – 1631
163 – 1682
168 – 1743
174 – 1790
179 – 1840
184 – 1892
189 – 1951
195 – 2000
200 – 2050
205 – 2100
210 – 2160
216 – 2212

How to cite

click to copy

BibTeX
@misc{saturn-wild-bigfoot-sightings-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: wild bigfoot sightings},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/wild-bigfoot_sightings}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: wild bigfoot sightings. Source: /home/coolhand/html/datavis/data_trove/data/wild/bigfoot_sightings.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/wild-bigfoot_sightings