bigfoot listings 20260210
Reading
This dataset contains 5,411 Bigfoot sighting reports from BFRO, with 9 columns covering location (state, county), timing (year, month), a classification grade, a short description, and a source URL. Sightings are concentrated in Washington, California, Ohio and Florida, and cluster heavily in late-summer and early-fall months (August, October, July). Classification is dominated by Class B (2,722) and Class A (2,655), with Class C barely represented (34) — worth flagging if you plan to filter by report quality. The year distribution is left-skewed with a median of 2001 and a long tail back to 1870, so most activity is recent. Note that the county field has 338 empty values and an 81% duplicate rate (expected, since counties repeat across reports).
citing: row_count · column_count · columns.state.top_values · columns.month.top_values · columns.classification.top_values · columns.year.stats · columns.county.stats · columns.description.stats
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Washington | 631 | 11.7% |
| California | 431 | 8.0% |
| Ohio | 317 | 5.9% |
| Florida | 314 | 5.8% |
| Oregon | 253 | 4.7% |
| Illinois | 239 | 4.4% |
| Texas | 238 | 4.4% |
| Michigan | 217 | 4.0% |
| Missouri | 161 | 3.0% |
| Georgia | 135 | 2.5% |
| Colorado | 128 | 2.4% |
| Pennsylvania | 125 | 2.3% |
| British Columbia | 122 | 2.3% |
| New York | 116 | 2.1% |
| Kentucky | 115 | 2.1% |
| Arkansas | 104 | 1.9% |
| Tennessee | 104 | 1.9% |
| West Virginia | 104 | 1.9% |
| Oklahoma | 101 | 1.9% |
| Idaho | 99 | 1.8% |
Show data table
| value | count | share |
|---|---|---|
| Class B | 2722 | 50.3% |
| Class A | 2655 | 49.1% |
| Class C | 34 | 0.6% |
Show data table
| value | count | share |
|---|---|---|
| August | 634 | 11.7% |
| October | 632 | 11.7% |
| July | 618 | 11.4% |
| September | 515 | 9.5% |
| June | 468 | 8.6% |
| November | 458 | 8.5% |
| May | 303 | 5.6% |
| April | 259 | 4.8% |
| December | 233 | 4.3% |
| January | 228 | 4.2% |
| Summer | 217 | 4.0% |
| March | 201 | 3.7% |
| February | 163 | 3.0% |
| Fall | 129 | 2.4% |
| Spring | 96 | 1.8% |
| Winter | 57 | 1.1% |
| Late | 6 | 0.1% |
| about | 6 | 0.1% |
| mid | 5 | 0.1% |
| or | 5 | 0.1% |
Show data table
| bin | count |
|---|---|
| 1870 – 1874 | 1 |
| 1874 – 1878 | 0 |
| 1878 – 1882 | 0 |
| 1882 – 1886 | 0 |
| 1886 – 1889 | 0 |
| 1889 – 1893 | 1 |
| 1893 – 1897 | 0 |
| 1897 – 1901 | 0 |
| 1901 – 1905 | 0 |
| 1905 – 1909 | 1 |
| 1909 – 1913 | 1 |
| 1913 – 1916 | 0 |
| 1916 – 1920 | 2 |
| 1920 – 1924 | 2 |
| 1924 – 1928 | 2 |
| 1928 – 1932 | 2 |
| 1932 – 1936 | 4 |
| 1936 – 1940 | 2 |
| 1940 – 1944 | 5 |
| 1944 – 1948 | 4 |
| 1948 – 1951 | 15 |
| 1951 – 1955 | 13 |
| 1955 – 1959 | 18 |
| 1959 – 1963 | 24 |
| 1963 – 1967 | 53 |
| 1967 – 1971 | 120 |
| 1971 – 1975 | 158 |
| 1975 – 1978 | 331 |
| 1978 – 1982 | 307 |
| 1982 – 1986 | 257 |
| 1986 – 1990 | 224 |
| 1990 – 1994 | 195 |
| 1994 – 1998 | 380 |
| 1998 – 2002 | 610 |
| 2002 – 2006 | 679 |
| 2006 – 2010 | 622 |
| 2010 – 2013 | 616 |
| 2013 – 2017 | 355 |
| 2017 – 2021 | 220 |
| 2021 – 2025 | 130 |
Show data table
| chars | count |
|---|---|
| 10 – 15 | 2 |
| 15 – 21 | 4 |
| 21 – 26 | 21 |
| 26 – 31 | 56 |
| 31 – 36 | 108 |
| 36 – 42 | 185 |
| 42 – 47 | 376 |
| 47 – 52 | 551 |
| 52 – 57 | 525 |
| 57 – 63 | 568 |
| 63 – 68 | 692 |
| 68 – 73 | 495 |
| 73 – 79 | 486 |
| 79 – 84 | 369 |
| 84 – 89 | 330 |
| 89 – 94 | 196 |
| 94 – 100 | 135 |
| 100 – 105 | 99 |
| 105 – 110 | 74 |
| 110 – 116 | 42 |
| 116 – 121 | 26 |
| 121 – 126 | 23 |
| 126 – 131 | 10 |
| 131 – 137 | 9 |
| 137 – 142 | 6 |
| 142 – 147 | 4 |
| 147 – 152 | 6 |
| 152 – 158 | 2 |
| 158 – 163 | 1 |
| 163 – 168 | 2 |
| 168 – 174 | 3 |
| 174 – 179 | 0 |
| 179 – 184 | 0 |
| 184 – 189 | 2 |
| 189 – 195 | 1 |
| 195 – 200 | 0 |
| 200 – 205 | 0 |
| 205 – 210 | 0 |
| 210 – 216 | 0 |
| 216 – 221 | 2 |
Schema
9 columns| Alerts | ||||
|---|---|---|---|---|
| id | numeric | 0.0% | 5,411 |
|
| state | categorical | 0.0% | 53 |
|
| state_code | categorical | 0.0% | 53 |
|
| county | text | 0.0% | 1,022 |
one_word
short_text
duplicates
|
| url | text | 0.0% | 5,411 |
near_unique
one_word
url_heavy
|
| month | categorical | 3.0% | 32 |
|
| year | numeric | 1.1% | 99 |
|
| classification | categorical | 0.0% | 3 |
|
| description | text | 0.0% | 5,407 |
near_unique
|
id
numeric identifierThis column is almost certainly a row identifier: all 5411 values are unique, none are null, and they span a wide integer range from 60 to 79711. The distribution is right-skewed (skew 0.91) with no outliers flagged, consistent with sparsely allocated record IDs rather than a measured quantity. Treatment: Drop from modelling features; retain only as a join key.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 5,411
- min
- 60
- max
- 79,711
- mean
- 2.329e+04
- median
- 16,598
- std
- 2.138e+04
- q1
- 4898
- q3
- 3.636e+04
- iqr
- 31,464
- skew
- 0.9109
- kurtosis
- -0.151
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
state
categorical featureUS state names across 5411 rows with 53 unique values (slightly above the 50 states, suggesting DC, territories, or stray entries) and no nulls. Distribution is fairly even (entropy ratio 0.877) but Washington leads at 11.66% with 631 rows, ahead of California (431) and Ohio (317), which is unusual since California typically dominates US samples. Treatment: One-hot or target-encode; investigate the 3 extra categories beyond 50 states.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 53
- top_value
- Washington
- top_rate
- 0.1166
- cardinality
- 53
- entropy
- 5.025
- entropy_ratio
- 0.8773
state_code
categorical featureTwo-letter US state codes (53 distinct values, suggesting states plus territories or DC). Distribution is fairly even — entropy ratio 0.877 — but Washington leads at 11.7% (631 rows), with CA, OH, and FL also prominent rather than a population-weighted ranking. No nulls. Treatment: one-hot or target-encode for modelling; safe to use as-is since complete and low-cardinality.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 53
- top_value
- wa
- top_rate
- 0.1166
- cardinality
- 53
- entropy
- 5.025
- entropy_ratio
- 0.8773
county
text feature one_word short_text duplicatesSingle-word US county names (Pierce, Jefferson, Lewis, Snohomish, Skamania suggest a Pacific Northwest tilt), with 1,022 unique values across 5,411 rows. Duplicates dominate at 81.1% (4,389 repeats) which is expected for a categorical, but 338 rows are empty strings rather than nulls — null_rate reads 0.0 only because the blanks aren't typed as null. Treatment: Coerce empty strings to null, then treat as a categorical (target/frequency encode for modelling).
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 1,022
- len_min
- 0
- len_max
- 23
- len_mean
- 6.621
- len_median
- 7
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 338
- n_duplicates
- 4,389
- duplicate_rate
- 0.8111
- vocab_size
- 1,020
- readability_flesch_mean
- 16.9
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
url
text identifier near_unique one_word url_heavyThis column holds a unique BFRO (Bigfoot Field Researchers Organization) report URL for each of the 5411 rows, all following the pattern https://www.bfro.net/gdb/show_report.asp?id=. Every value is unique (n_unique=5411, duplicate_rate=0.0), non-null, and url_rate=1.0, so it functions as a per-row identifier rather than a feature. Lengths cluster tightly between 46 and 49 characters, consistent with the report id being the only varying segment. Treatment: Drop from modelling; retain as a row-level link or extract the numeric report id as a key.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 5,411
- len_min
- 46
- len_max
- 49
- len_mean
- 48.56
- len_median
- 49
- len_p95
- 49
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 5,411
- readability_flesch_mean
- -301.8
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
month
categorical featureColumn of month names, presumably the month a record was created or observed. Distribution is seasonal-skewed, with summer/autumn months (August 12.07%, October, July) dominating and winter months trailing. Cardinality is 32, far above the expected 12, which suggests dirty values (typos, abbreviations, or non-month strings) alongside a 2.96% null rate. Treatment: Normalize to the 12 canonical months (resolve the 20 extra categories) and impute or flag nulls before encoding.
- n
- 5,411
- nulls
- 160 (3.0%)
- unique
- 32
- top_value
- August
- top_rate
- 0.1207
- cardinality
- 32
- entropy
- 3.807
- entropy_ratio
- 0.7614
year
numeric timestampThis column is a year value (likely publication, release, or event year) spanning 1870 to 2025 with a median of 2001 and IQR of 22 years. The distribution is left-skewed (skew -0.97) with a long tail of older entries, and 49 outliers (0.9%) sit on the early end. Null rate is low at 1.05% and there are 99 distinct years. Treatment: Treat as a temporal feature; consider bucketing by decade or computing age relative to a reference year.
- n
- 5,411
- nulls
- 57 (1.1%)
- unique
- 99
- min
- 1,870
- max
- 2,025
- mean
- 1998
- median
- 2,001
- std
- 15.79
- q1
- 1,987
- q3
- 2,009
- iqr
- 22
- skew
- -0.9738
- kurtosis
- 1.997
- n_outliers
- 49
- outlier_rate
- 0.009152
- zero_rate
- 0
classification
categorical labelA 3-level categorical label, almost certainly the target or stratification class. Class B (2722) and Class A (2655) split the data nearly 50/50, while Class C appears only 34 times — a severe minority that will distort accuracy-style metrics. No nulls across 5411 rows. Treatment: Use as classification target with stratified splits and class-weighting to handle the Class C minority.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- Class B
- top_rate
- 0.503
- cardinality
- 3
- entropy
- 1.049
- entropy_ratio
- 0.6616
description
text free_text near_uniqueShort free-text descriptions, averaging 10.6 words (median 10) and 67 characters, almost certainly capturing sighting reports — top tokens include 'sighting' (1436), 'possible' (1117), 'near' (2283). Values are nearly unique (5407 distinct out of 5411) with only 4 duplicates and no nulls or empties, and Flesch readability of 55.7 suggests fairly plain prose. Vocabulary of 7169 words across this small corpus indicates rich lexical variety rather than templated text. Treatment: Tokenize and embed (or extract entities) before modelling; do not treat as a categorical.
- n
- 5,411
- nulls
- 0 (0.0%)
- unique
- 5,407
- len_min
- 10
- len_max
- 221
- len_mean
- 67.04
- len_median
- 65
- len_p95
- 101.5
- word_mean
- 10.62
- word_median
- 10
- n_empty
- 0
- n_duplicates
- 4
- duplicate_rate
- 0.0007392
- vocab_size
- 7,169
- readability_flesch_mean
- 55.71
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0.0001848