wild bigfoot sightings
Reading
This dataset catalogs 5,411 Bigfoot sighting reports from the BFRO database, with fields covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Geographically, sightings concentrate heavily in Washington (631), California (431), and Ohio (317), and the most common county is Pierce — worth a closer look as the data skews toward the Pacific Northwest. Temporally, the year distribution is left-skewed (mean 1997, median 2001, range 1870–2025), so most reports come from the late 1990s onward, and August/October/July dominate the month field, hinting at a warm-season reporting pattern. Classification is nearly a coin-flip between Class A (2,655) and Class B (2,722), with Class C almost absent (34) — that imbalance is something to flag before any modeling. Note also that 338 county values are empty even though state coverage is complete.
citing: state.top_values · county.top_values · year.stats · month.top_values · classification.top_values · row_count · county.stats.n_empty
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Washington | 631 | 11.7% |
| California | 431 | 8.0% |
| Ohio | 317 | 5.9% |
| Florida | 314 | 5.8% |
| Oregon | 253 | 4.7% |
| Illinois | 239 | 4.4% |
| Texas | 238 | 4.4% |
| Michigan | 217 | 4.0% |
| Missouri | 161 | 3.0% |
| Georgia | 135 | 2.5% |
| Colorado | 128 | 2.4% |
| Pennsylvania | 125 | 2.3% |
| British Columbia | 122 | 2.3% |
| New York | 116 | 2.1% |
| Kentucky | 115 | 2.1% |
| Arkansas | 104 | 1.9% |
| Tennessee | 104 | 1.9% |
| West Virginia | 104 | 1.9% |
| Oklahoma | 101 | 1.9% |
| Idaho | 99 | 1.8% |
Show data table
| value | count | share |
|---|---|---|
| Class B | 2722 | 50.3% |
| Class A | 2655 | 49.1% |
| Class C | 34 | 0.6% |
Show data table
| bin | count |
|---|---|
| 1870 – 1874 | 1 |
| 1874 – 1878 | 0 |
| 1878 – 1882 | 0 |
| 1882 – 1886 | 0 |
| 1886 – 1889 | 0 |
| 1889 – 1893 | 1 |
| 1893 – 1897 | 0 |
| 1897 – 1901 | 0 |
| 1901 – 1905 | 0 |
| 1905 – 1909 | 1 |
| 1909 – 1913 | 1 |
| 1913 – 1916 | 0 |
| 1916 – 1920 | 2 |
| 1920 – 1924 | 2 |
| 1924 – 1928 | 2 |
| 1928 – 1932 | 2 |
| 1932 – 1936 | 4 |
| 1936 – 1940 | 2 |
| 1940 – 1944 | 5 |
| 1944 – 1948 | 4 |
| 1948 – 1951 | 15 |
| 1951 – 1955 | 13 |
| 1955 – 1959 | 18 |
| 1959 – 1963 | 24 |
| 1963 – 1967 | 53 |
| 1967 – 1971 | 120 |
| 1971 – 1975 | 158 |
| 1975 – 1978 | 331 |
| 1978 – 1982 | 307 |
| 1982 – 1986 | 257 |
| 1986 – 1990 | 224 |
| 1990 – 1994 | 195 |
| 1994 – 1998 | 380 |
| 1998 – 2002 | 610 |
| 2002 – 2006 | 679 |
| 2006 – 2010 | 622 |
| 2010 – 2013 | 616 |
| 2013 – 2017 | 355 |
| 2017 – 2021 | 220 |
| 2021 – 2025 | 130 |
Show data table
| value | count | share |
|---|---|---|
| August | 634 | 11.7% |
| October | 632 | 11.7% |
| July | 618 | 11.4% |
| September | 515 | 9.5% |
| June | 468 | 8.6% |
| November | 458 | 8.5% |
| May | 303 | 5.6% |
| April | 259 | 4.8% |
| December | 233 | 4.3% |
| January | 228 | 4.2% |
| Summer | 217 | 4.0% |
| March | 201 | 3.7% |
| February | 163 | 3.0% |
| Fall | 129 | 2.4% |
| Spring | 96 | 1.8% |
| Winter | 57 | 1.1% |
| Late | 6 | 0.1% |
| about | 6 | 0.1% |
| mid | 5 | 0.1% |
| or | 5 | 0.1% |
Show data table
| chars | count |
|---|---|
| 0 – 1 | 338 |
| 1 – 1 | 0 |
| 1 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 28 |
| 3 – 4 | 457 |
| 4 – 5 | 0 |
| 5 – 5 | 640 |
| 5 – 6 | 0 |
| 6 – 6 | 1110 |
| 6 – 7 | 0 |
| 7 – 7 | 802 |
| 7 – 8 | 916 |
| 8 – 9 | 0 |
| 9 – 9 | 608 |
| 9 – 10 | 0 |
| 10 – 10 | 301 |
| 10 – 11 | 0 |
| 11 – 12 | 62 |
| 12 – 12 | 94 |
| 12 – 13 | 0 |
| 13 – 13 | 5 |
| 13 – 14 | 0 |
| 14 – 14 | 24 |
| 14 – 15 | 0 |
| 15 – 16 | 16 |
| 16 – 16 | 3 |
| 16 – 17 | 0 |
| 17 – 17 | 3 |
| 17 – 18 | 0 |
| 18 – 18 | 0 |
| 18 – 19 | 0 |
| 19 – 20 | 3 |
| 20 – 20 | 0 |
| 20 – 21 | 0 |
| 21 – 21 | 0 |
| 21 – 22 | 0 |
| 22 – 22 | 0 |
| 22 – 23 | 1 |
Schema
9 columns| Alerts | ||||
|---|---|---|---|---|
| id | numeric | 0.0% | 5,411 |
|
| state | categorical | 0.0% | 53 |
|
| state_code | categorical | 0.0% | 53 |
|
| county | text | 0.0% | 1,022 |
one_word
short_text
duplicates
|
| url | text | 0.0% | 5,411 |
near_unique
one_word
url_heavy
|
| month | categorical | 3.0% | 32 |
|
| year | numeric | 1.1% | 99 |
|
| classification | categorical | 0.0% | 3 |
|
| description | text | 0.0% | 5,407 |
near_unique
|
id
numeric identifierThis column is almost certainly a row identifier: all 5411 values are unique with no nulls, spanning 60 to 79711. The wide range relative to the row count suggests sparse, non-sequential IDs (likely assigned from a larger source system) rather than a dense 1..N index. Skew of 0.91 and median 16598 vs mean 23288 are expected artifacts of ID allocation, not meaningful distribution signals. Treatment: Exclude from modelling; retain as a join key.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 5,411
- min
- 60
- max
- 79,711
- mean
- 2.329e+04
- median
- 16,598
- std
- 2.138e+04
- q1
- 4898
- q3
- 3.636e+04
- iqr
- 31,464
- skew
- 0.9109
- kurtosis
- -0.151
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
state
categorical featureThis is a US state field with 53 distinct values across 5411 rows and no nulls, suggesting it includes the 50 states plus a few extras like DC or territories. Distribution is fairly even (entropy ratio 0.877), but Washington leads at 11.66% (631 rows) — unusually high for a national sample and ahead of California (431), hinting at geographic bias toward the Pacific Northwest. Treatment: One-hot or target-encode; consider grouping the long tail and noting the Washington over-representation.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 53
- top_value
- Washington
- top_rate
- 0.1166
- cardinality
- 53
- entropy
- 5.025
- entropy_ratio
- 0.8773
state_code
categorical featureThis column holds US state codes as two-letter lowercase abbreviations, with 53 distinct values across 5411 rows and no nulls — slightly more than the 50 states, suggesting territories or DC are included. Distribution is broad (entropy ratio 0.88) but tilts toward Washington (wa, 11.7%) and California (ca, 431), which is unusual since wa outranks ca despite California's larger population. Treatment: Use as a categorical feature; one-hot or target-encode for modelling.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 53
- top_value
- wa
- top_rate
- 0.1166
- cardinality
- 53
- entropy
- 5.025
- entropy_ratio
- 0.8773
county
text feature one_word short_text duplicatesSingle-word US county names (e.g., Pierce, Jefferson, Lewis, Snohomish, Skamania) acting as a geographic categorical feature. Heavy duplication is expected for this kind of field (duplicate_rate 0.81, 1022 unique values across 5411 rows), but 338 empty strings are recorded as non-null and should be treated as missing. The county mix (Snohomish, Skamania, Pierce, King) skews toward Washington/Pacific Northwest, with some overlap names like Jefferson and Jackson appearing in many states. Treatment: Convert empty strings to nulls and encode as a categorical (target/frequency encode given ~1k levels).
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 1,022
- len_min
- 0
- len_max
- 23
- len_mean
- 6.621
- len_median
- 7
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 338
- n_duplicates
- 4,389
- duplicate_rate
- 0.8111
- vocab_size
- 1,020
- readability_flesch_mean
- 16.9
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
url
text identifier near_unique one_word url_heavyThis column holds a unique BFRO report URL for each of the 5411 rows, all following the pattern https://www.bfro.net/gdb/show_report.asp?id=. Every value is distinct (n_unique=5411, duplicate_rate=0.0), non-null, and url_rate is 1.0, so it functions as a per-row record locator rather than a feature. Lengths are tightly bound between 46 and 49 characters, consistent with only the numeric id varying. Treatment: Keep as a row-level link for traceability; drop from modelling or extract the trailing id as a foreign key.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 5,411
- len_min
- 46
- len_max
- 49
- len_mean
- 48.56
- len_median
- 49
- len_p95
- 49
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 5,411
- readability_flesch_mean
- -301.8
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
month
categorical featureThis column captures the month name of an event, with August (634), October (632), and July (618) leading — consistent with a summer/autumn-skewed seasonal pattern. Cardinality is 32, far above the expected 12, so there are 20 extra non-month tokens polluting the field that need investigation. Null rate is 2.96% and entropy ratio is 0.76, indicating reasonably spread but not uniform distribution. Treatment: Normalize to the 12 canonical months, then one-hot or cyclically encode.
- n
- 5,411
- nulls
- 160 (3.0%)
- unique
- 32
- top_value
- August
- top_rate
- 0.1207
- cardinality
- 32
- entropy
- 3.807
- entropy_ratio
- 0.7614
year
numeric timestampYear of record, ranging from 1870 to 2025 across 99 distinct values with a median of 2001 and IQR spanning 1987-2009. The distribution is left-skewed (skew -0.97) with 49 outliers (0.9%) on the early-year tail, and 1.05% of rows are null. Treatment: Treat as a temporal feature; consider bucketing by decade or clipping pre-1970 outliers before modelling.
- n
- 5,411
- nulls
- 57 (1.1%)
- unique
- 99
- min
- 1,870
- max
- 2,025
- mean
- 1998
- median
- 2,001
- std
- 15.79
- q1
- 1,987
- q3
- 2,009
- iqr
- 22
- skew
- -0.9738
- kurtosis
- 1.997
- n_outliers
- 49
- outlier_rate
- 0.009152
- zero_rate
- 0
classification
categorical labelA three-level categorical label (Class A, B, C) with no nulls across 5411 rows. The distribution is essentially binary in practice: Class B (50.3%) and Class A are nearly tied, while Class C appears only 34 times, making it a rare class that will be hard to model or evaluate. Treatment: One-hot or ordinal encode; consider stratified splits or merging Class C given its rarity.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- Class B
- top_rate
- 0.503
- cardinality
- 3
- entropy
- 1.049
- entropy_ratio
- 0.6616
description
text free_text near_uniqueShort free-text descriptions, almost certainly sighting summaries: 5407 of 5411 values are unique with a mean length of 67 characters and ~10.6 words. The vocabulary of 7169 tokens is dominated by 'near', 'sighting', and 'possible', suggesting templated phrasings about location-based observations. Duplicates and boilerplate are negligible (4 dupes, boilerplate_rate 0.00018), and Flesch ~55.7 indicates fairly readable prose with no URLs or emoji. Treatment: Tokenize and embed (or extract entities like species/location) before modelling; do not use as a key.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 5,407
- len_min
- 10
- len_max
- 221
- len_mean
- 67.04
- len_median
- 65
- len_p95
- 101.5
- word_mean
- 10.62
- word_median
- 10
- n_empty
- 0
- n_duplicates
- 4
- duplicate_rate
- 0.0007392
- vocab_size
- 7,169
- readability_flesch_mean
- 55.71
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0.0001848