bigfoot-listings_20260210 · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/bigfoot/listings_20260210.json

Saturn profiled 5,411 rows across 9 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/bigfoot/listings_20260210.json",
    "--findings", "bigfoot-listings_20260210.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 5,411 Bigfoot sighting reports from BFRO, with 9 columns covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Sightings are concentrated in Washington, California, Ohio and Florida, and cluster heavily in late-summer and early-fall months (August, October, July). Classification is dominated by Class B (2,722) and Class A (2,655), with Class C barely represented (34) — worth flagging if you plan to filter by report quality. The year distribution is left-skewed with a median of 2001 and a long tail back to 1870, so most activity is recent. Note that the county field has 338 empty values and an 81% duplicate rate (expected, since counties repeat across reports).

citing: row_count · column_count · columns.state.top_values · columns.month.top_values · columns.classification.top_values · columns.year.stats · columns.county.stats · columns.description.stats

Out[4]:

saturn.schema() · 9 columns

column	kind	n	null%	unique	alerts
id	numeric	5,411	0.0%	5,411
state	categorical	5,411	0.0%	53
state_code	categorical	5,411	0.0%	53
county	text	5,411	0.0%	1,022	one_word short_text duplicates
url	text	5,411	0.0%	5,411	near_unique one_word url_heavy
month	categorical	5,411	3.0%	32
year	numeric	5,411	1.1%	99
classification	categorical	5,411	0.0%	3
description	text	5,411	0.0%	5,407	near_unique

Fig 1.

state · Top states for sightings — Washington and California lead by a wide margin.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
Washington	631	11.7%
California	431	8.0%
Ohio	317	5.9%
Florida	314	5.8%
Oregon	253	4.7%
Illinois	239	4.4%
Texas	238	4.4%
Michigan	217	4.0%
Missouri	161	3.0%
Georgia	135	2.5%
Colorado	128	2.4%
Pennsylvania	125	2.3%
British Columbia	122	2.3%
New York	116	2.1%
Kentucky	115	2.1%
Arkansas	104	1.9%
Tennessee	104	1.9%
West Virginia	104	1.9%
Oklahoma	101	1.9%
Idaho	99	1.8%

Fig 2.

classification · Report quality is split almost evenly between Class A and B, with Class C negligible.

Show data table

Top values for classification (3 unique shown, of 3 total).
value	count	share
Class B	2722	50.3%
Class A	2655	49.1%
Class C	34	0.6%

Fig 3.

month · Seasonality of sightings — peaks in August, October, and July.

Show data table

Top values for month (20 unique shown, of 32 total).
value	count	share
August	634	11.7%
October	632	11.7%
July	618	11.4%
September	515	9.5%
June	468	8.6%
November	458	8.5%
May	303	5.6%
April	259	4.8%
December	233	4.3%
January	228	4.2%
Summer	217	4.0%
March	201	3.7%
February	163	3.0%
Fall	129	2.4%
Spring	96	1.8%
Winter	57	1.1%
Late	6	0.1%
about	6	0.1%
mid	5	0.1%
or	5	0.1%

Fig 4.

year · Reports skew heavily toward recent decades, centered around 2001.

Show data table

Histogram bins for year (median: 2001.0).
bin	count
1870 – 1874	1
1874 – 1878	0
1878 – 1882	0
1882 – 1886	0
1886 – 1889	0
1889 – 1893	1
1893 – 1897	0
1897 – 1901	0
1901 – 1905	0
1905 – 1909	1
1909 – 1913	1
1913 – 1916	0
1916 – 1920	2
1920 – 1924	2
1924 – 1928	2
1928 – 1932	2
1932 – 1936	4
1936 – 1940	2
1940 – 1944	5
1944 – 1948	4
1948 – 1951	15
1951 – 1955	13
1955 – 1959	18
1959 – 1963	24
1963 – 1967	53
1967 – 1971	120
1971 – 1975	158
1975 – 1978	331
1978 – 1982	307
1982 – 1986	257
1986 – 1990	224
1990 – 1994	195
1994 – 1998	380
1998 – 2002	610
2002 – 2006	679
2006 – 2010	622
2010 – 2013	616
2013 – 2017	355
2017 – 2021	220
2021 – 2025	130

Fig 5.

description · Descriptions are short (median 10 words), suggesting summary-level rather than full narrative text.

Show data table

Character-length distribution for description (mean: 67.04213638883755).
chars	count
10 – 15	2
15 – 21	4
21 – 26	21
26 – 31	56
31 – 36	108
36 – 42	185
42 – 47	376
47 – 52	551
52 – 57	525
57 – 63	568
63 – 68	692
68 – 73	495
73 – 79	486
79 – 84	369
84 – 89	330
89 – 94	196
94 – 100	135
100 – 105	99
105 – 110	74
110 – 116	42
116 – 121	26
121 – 126	23
126 – 131	10
131 – 137	9
137 – 142	6
142 – 147	4
147 – 152	6
152 – 158	2
158 – 163	1
163 – 168	2
168 – 174	3
174 – 179	0
179 – 184	0
184 – 189	2
189 – 195	1
195 – 200	0
200 – 205	0
205 – 210	0
210 – 216	0
216 – 221	2

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
id	numeric	0.0%
state	categorical	0.0%
state_code	categorical	0.0%
county	text	0.0%
url	text	0.0%
month	categorical	3.0%
year	numeric	1.1%
classification	categorical	0.0%
description	text	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	id	year
id	+1.00	+0.12
year	+0.12	+1.00

id numeric identifier

This column is almost certainly a row identifier: all 5411 values are unique, none are null, and they span a wide integer range from 60 to 79711. The distribution is right-skewed (skew 0.91) with no outliers flagged, consistent with sparsely allocated record IDs rather than a measured quantity.

Treatment: Drop from modelling features; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["id"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	5,411
min	60
max	79,711
mean	2.329e+04
median	16,598
std	2.138e+04
q1	4898
q3	3.636e+04
iqr	31,464
skew	0.9109
kurtosis	-0.151
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 8.

Distribution of id. Vertical dash marks the median.

Show data table

Histogram bins for id (median: 16598.0).
bin	count
60 – 2051	743
2051 – 4043	469
4043 – 6034	305
6034 – 8025	306
8025 – 1.002e+04	268
1.002e+04 – 1.201e+04	202
1.201e+04 – 1.4e+04	198
1.4e+04 – 1.599e+04	176
1.599e+04 – 1.798e+04	119
1.798e+04 – 1.997e+04	81
1.997e+04 – 2.196e+04	89
2.196e+04 – 2.396e+04	146
2.396e+04 – 2.595e+04	254
2.595e+04 – 2.794e+04	215
2.794e+04 – 2.993e+04	191
2.993e+04 – 3.192e+04	105
3.192e+04 – 3.391e+04	77
3.391e+04 – 3.59e+04	85
3.59e+04 – 3.789e+04	98
3.789e+04 – 3.989e+04	91
3.989e+04 – 4.188e+04	113
4.188e+04 – 4.387e+04	90
4.387e+04 – 4.586e+04	90
4.586e+04 – 4.785e+04	84
4.785e+04 – 4.984e+04	71
4.984e+04 – 5.183e+04	80
5.183e+04 – 5.382e+04	10
5.382e+04 – 5.582e+04	33
5.582e+04 – 5.781e+04	70
5.781e+04 – 5.98e+04	92
5.98e+04 – 6.179e+04	18
6.179e+04 – 6.378e+04	78
6.378e+04 – 6.577e+04	47
6.577e+04 – 6.776e+04	65
6.776e+04 – 6.975e+04	42
6.975e+04 – 7.175e+04	8
7.175e+04 – 7.374e+04	33
7.374e+04 – 7.573e+04	45
7.573e+04 – 7.772e+04	50
7.772e+04 – 7.971e+04	74

state categorical feature

US state names across 5411 rows with 53 unique values (slightly above the 50 states, suggesting DC, territories, or stray entries) and no nulls. Distribution is fairly even (entropy ratio 0.877) but Washington leads at 11.66% with 631 rows, ahead of California (431) and Ohio (317), which is unusual since California typically dominates US samples.

Treatment: One-hot or target-encode; investigate the 3 extra categories beyond 50 states.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["state"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	53
top_value	Washington
top_rate	0.1166
cardinality	53
entropy	5.025
entropy_ratio	0.8773

Fig 9.

Top values for state.

Show data table

Top values for state (20 unique shown, of 53 total).
value	count	share
Washington	631	11.7%
California	431	8.0%
Ohio	317	5.9%
Florida	314	5.8%
Oregon	253	4.7%
Illinois	239	4.4%
Texas	238	4.4%
Michigan	217	4.0%
Missouri	161	3.0%
Georgia	135	2.5%
Colorado	128	2.4%
Pennsylvania	125	2.3%
British Columbia	122	2.3%
New York	116	2.1%
Kentucky	115	2.1%
Arkansas	104	1.9%
Tennessee	104	1.9%
West Virginia	104	1.9%
Oklahoma	101	1.9%
Idaho	99	1.8%

state_code categorical feature

Two-letter US state codes (53 distinct values, suggesting states plus territories or DC). Distribution is fairly even — entropy ratio 0.877 — but Washington leads at 11.7% (631 rows), with CA, OH, and FL also prominent rather than a population-weighted ranking. No nulls.

Treatment: one-hot or target-encode for modelling; safe to use as-is since complete and low-cardinality.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["state_code"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	53
top_value	wa
top_rate	0.1166
cardinality	53
entropy	5.025
entropy_ratio	0.8773

Fig 10.

Top values for state_code.

Show data table

Top values for state_code (20 unique shown, of 53 total).
value	count	share
wa	631	11.7%
ca	431	8.0%
oh	317	5.9%
fl	314	5.8%
or	253	4.7%
il	239	4.4%
tx	238	4.4%
mi	217	4.0%
mo	161	3.0%
ga	135	2.5%
co	128	2.4%
pa	125	2.3%
ca-bc	122	2.3%
ny	116	2.1%
ky	115	2.1%
ar	104	1.9%
tn	104	1.9%
wv	104	1.9%
ok	101	1.9%
id	99	1.8%

county text feature

Single-word US county names (Pierce, Jefferson, Lewis, Snohomish, Skamania suggest a Pacific Northwest tilt), with 1,022 unique values across 5,411 rows. Duplicates dominate at 81.1% (4,389 repeats) which is expected for a categorical, but 338 rows are empty strings rather than nulls — null_rate reads 0.0 only because the blanks aren't typed as null.

Treatment: Coerce empty strings to null, then treat as a categorical (target/frequency encode for modelling).

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["county"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	1,022
len_min	0
len_max	23
len_mean	6.621
len_median	7
len_p95	10
word_mean	1
word_median	1
n_empty	338
n_duplicates	4,389
duplicate_rate	0.8111
vocab_size	1,020
readability_flesch_mean	16.9
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	81.1% duplicate strings

Fig 11.

Character-length distribution for county.

Show data table

Character-length distribution for county (mean: 6.620957309184994).
chars	count
0 – 1	338
1 – 1	0
1 – 2	0
2 – 2	0
2 – 3	0
3 – 3	28
3 – 4	457
4 – 5	0
5 – 5	640
5 – 6	0
6 – 6	1110
6 – 7	0
7 – 7	802
7 – 8	916
8 – 9	0
9 – 9	608
9 – 10	0
10 – 10	301
10 – 11	0
11 – 12	62
12 – 12	94
12 – 13	0
13 – 13	5
13 – 14	0
14 – 14	24
14 – 15	0
15 – 16	16
16 – 16	3
16 – 17	0
17 – 17	3
17 – 18	0
18 – 18	0
18 – 19	0
19 – 20	3
20 – 20	0
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 23	1

url text identifier

This column holds a unique BFRO (Bigfoot Field Researchers Organization) report URL for each of the 5411 rows, all following the pattern https://www.bfro.net/gdb/show_report.asp?id=. Every value is unique (n_unique=5411, duplicate_rate=0.0), non-null, and url_rate=1.0, so it functions as a per-row identifier rather than a feature. Lengths cluster tightly between 46 and 49 characters, consistent with the report id being the only varying segment.

Treatment: Drop from modelling; retain as a row-level link or extract the numeric report id as a key.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["url"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	5,411
len_min	46
len_max	49
len_mean	48.56
len_median	49
len_p95	49
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	5,411
readability_flesch_mean	-301.8
emoji_rate	0
url_rate	1
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL

Fig 12.

Character-length distribution for url.

Show data table

Character-length distribution for url (mean: 48.55682868231381).
chars	count
46 – 46	11
46 – 46	0
46 – 46	0
46 – 46	0
46 – 46	0
46 – 46	0
46 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	288
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 47	0
47 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	1789
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 49	0
49 – 49	0
49 – 49	0
49 – 49	0
49 – 49	0
49 – 49	0
49 – 49	3323

month categorical feature

Column of month names, presumably the month a record was created or observed. Distribution is seasonal-skewed, with summer/autumn months (August 12.07%, October, July) dominating and winter months trailing. Cardinality is 32, far above the expected 12, which suggests dirty values (typos, abbreviations, or non-month strings) alongside a 2.96% null rate.

Treatment: Normalize to the 12 canonical months (resolve the 20 extra categories) and impute or flag nulls before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["month"].stats

stat	value
n	5,411
nulls	160 (3.0%)
unique	32
top_value	August
top_rate	0.1207
cardinality	32
entropy	3.807
entropy_ratio	0.7614

Fig 13.

Top values for month.

Show data table

Top values for month (20 unique shown, of 32 total).
value	count	share
August	634	11.7%
October	632	11.7%
July	618	11.4%
September	515	9.5%
June	468	8.6%
November	458	8.5%
May	303	5.6%
April	259	4.8%
December	233	4.3%
January	228	4.2%
Summer	217	4.0%
March	201	3.7%
February	163	3.0%
Fall	129	2.4%
Spring	96	1.8%
Winter	57	1.1%
Late	6	0.1%
about	6	0.1%
mid	5	0.1%
or	5	0.1%

year numeric timestamp

This column is a year value (likely publication, release, or event year) spanning 1870 to 2025 with a median of 2001 and IQR of 22 years. The distribution is left-skewed (skew -0.97) with a long tail of older entries, and 49 outliers (0.9%) sit on the early end. Null rate is low at 1.05% and there are 99 distinct years.

Treatment: Treat as a temporal feature; consider bucketing by decade or computing age relative to a reference year.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["year"].stats

stat	value
n	5,411
nulls	57 (1.1%)
unique	99
min	1,870
max	2,025
mean	1998
median	2,001
std	15.79
q1	1,987
q3	2,009
iqr	22
skew	-0.9738
kurtosis	1.997
n_outliers	49
outlier_rate	0.009152
zero_rate	0

Fig 14.

Distribution of year. Vertical dash marks the median.

Show data table

Histogram bins for year (median: 2001.0).
bin	count
1870 – 1874	1
1874 – 1878	0
1878 – 1882	0
1882 – 1886	0
1886 – 1889	0
1889 – 1893	1
1893 – 1897	0
1897 – 1901	0
1901 – 1905	0
1905 – 1909	1
1909 – 1913	1
1913 – 1916	0
1916 – 1920	2
1920 – 1924	2
1924 – 1928	2
1928 – 1932	2
1932 – 1936	4
1936 – 1940	2
1940 – 1944	5
1944 – 1948	4
1948 – 1951	15
1951 – 1955	13
1955 – 1959	18
1959 – 1963	24
1963 – 1967	53
1967 – 1971	120
1971 – 1975	158
1975 – 1978	331
1978 – 1982	307
1982 – 1986	257
1986 – 1990	224
1990 – 1994	195
1994 – 1998	380
1998 – 2002	610
2002 – 2006	679
2006 – 2010	622
2010 – 2013	616
2013 – 2017	355
2017 – 2021	220
2021 – 2025	130

classification categorical label

A 3-level categorical label, almost certainly the target or stratification class. Class B (2722) and Class A (2655) split the data nearly 50/50, while Class C appears only 34 times — a severe minority that will distort accuracy-style metrics. No nulls across 5411 rows.

Treatment: Use as classification target with stratified splits and class-weighting to handle the Class C minority.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["classification"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	3
top_value	Class B
top_rate	0.503
cardinality	3
entropy	1.049
entropy_ratio	0.6616

Fig 15.

Top values for classification.

Show data table

Top values for classification (3 unique shown, of 3 total).
value	count	share
Class B	2722	50.3%
Class A	2655	49.1%
Class C	34	0.6%

description text free_text

Short free-text descriptions, averaging 10.6 words (median 10) and 67 characters, almost certainly capturing sighting reports — top tokens include 'sighting' (1436), 'possible' (1117), 'near' (2283). Values are nearly unique (5407 distinct out of 5411) with only 4 duplicates and no nulls or empties, and Flesch readability of 55.7 suggests fairly plain prose. Vocabulary of 7169 words across this small corpus indicates rich lexical variety rather than templated text.

Treatment: Tokenize and embed (or extract entities) before modelling; do not treat as a categorical.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["description"].stats

stat	value
n	5,411
nulls	0 (0.0%)
unique	5,407
len_min	10
len_max	221
len_mean	67.04
len_median	65
len_p95	101.5
word_mean	10.62
word_median	10
n_empty	0
n_duplicates	4
duplicate_rate	0.0007392
vocab_size	7,169
readability_flesch_mean	55.71
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0.0001848
alert: near_unique	99.9% of rows are unique strings

Fig 16.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 67.04213638883755).
chars	count
10 – 15	2
15 – 21	4
21 – 26	21
26 – 31	56
31 – 36	108
36 – 42	185
42 – 47	376
47 – 52	551
52 – 57	525
57 – 63	568
63 – 68	692
68 – 73	495
73 – 79	486
79 – 84	369
84 – 89	330
89 – 94	196
94 – 100	135
100 – 105	99
105 – 110	74
110 – 116	42
116 – 121	26
121 – 126	23
126 – 131	10
131 – 137	9
137 – 142	6
142 – 147	4
147 – 152	6
152 – 158	2
158 – 163	1
163 – 168	2
168 – 174	3
174 – 179	0
179 – 184	0
184 – 189	2
189 – 195	1
195 – 200	0
200 – 205	0
205 – 210	0
210 – 216	0
216 – 221	2

bigfoot listings 20260210

Overview

Summary confidence: high

id numeric identifier

state categorical feature

state_code categorical feature

county text feature

url text identifier

month categorical feature

year numeric timestamp

classification categorical label

description text free_text

How to cite