wild ghost sightings
Reading
This dataset catalogs 10,992 reported ghost sightings across the United States, with each record describing a location (city, state, latitude/longitude) and a free-text description of the sighting. Every record is in the United States (country has only 1 unique value), and California, Texas, and Pennsylvania lead the state counts — California alone holds about 9.7% of all sightings. The location column is rich with thematic words like 'school', 'cemetery', 'high', and 'house', hinting at the kinds of places people report hauntings. Worth a closer look: the geographic skew toward a few populous states, and the description field which averages ~70 words per entry and is nearly all unique — a good candidate for text mining. Note that latitude/longitude have ~11.5% nulls and a handful of outliers (including a min latitude of -45 that falls outside the US).
citing: row_count · column_count · country.top_rate · state.top_values · state.entropy_ratio · location.top_words · description.word_mean · description.n_unique · latitude.null_rate · latitude.min · longitude.n_outliers
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| California | 1070 | 9.7% |
| Texas | 696 | 6.3% |
| Pennsylvania | 649 | 5.9% |
| Michigan | 529 | 4.8% |
| Ohio | 477 | 4.3% |
| New York | 459 | 4.2% |
| Illinois | 395 | 3.6% |
| Kentucky | 370 | 3.4% |
| Indiana | 351 | 3.2% |
| Massachusetts | 342 | 3.1% |
| Florida | 328 | 3.0% |
| Missouri | 314 | 2.9% |
| Georgia | 289 | 2.6% |
| Wisconsin | 274 | 2.5% |
| Alabama | 224 | 2.0% |
| Tennessee | 221 | 2.0% |
| Washington | 218 | 2.0% |
| Oklahoma | 211 | 1.9% |
| North Carolina | 211 | 1.9% |
| New Jersey | 194 | 1.8% |
Show data table
| chars | count |
|---|---|
| 2 – 27 | 9633 |
| 27 – 53 | 1280 |
| 53 – 78 | 57 |
| 78 – 103 | 7 |
| 103 – 129 | 0 |
| 129 – 154 | 0 |
| 154 – 179 | 0 |
| 179 – 205 | 2 |
| 205 – 230 | 2 |
| 230 – 256 | 0 |
| 256 – 281 | 2 |
| 281 – 306 | 0 |
| 306 – 332 | 0 |
| 332 – 357 | 1 |
| 357 – 382 | 0 |
| 382 – 408 | 1 |
| 408 – 433 | 0 |
| 433 – 458 | 1 |
| 458 – 484 | 0 |
| 484 – 509 | 1 |
| 509 – 534 | 0 |
| 534 – 560 | 0 |
| 560 – 585 | 0 |
| 585 – 610 | 0 |
| 610 – 636 | 0 |
| 636 – 661 | 0 |
| 661 – 686 | 0 |
| 686 – 712 | 0 |
| 712 – 737 | 0 |
| 737 – 762 | 0 |
| 762 – 788 | 0 |
| 788 – 813 | 0 |
| 813 – 839 | 0 |
| 839 – 864 | 0 |
| 864 – 889 | 0 |
| 889 – 915 | 0 |
| 915 – 940 | 1 |
| 940 – 965 | 0 |
| 965 – 991 | 0 |
| 991 – 1016 | 1 |
Show data table
| bin | count |
|---|---|
| -45.02 – -42.23 | 1 |
| -42.23 – -39.43 | 0 |
| -39.43 – -36.63 | 0 |
| -36.63 – -33.83 | 1 |
| -33.83 – -31.03 | 0 |
| -31.03 – -28.24 | 0 |
| -28.24 – -25.44 | 0 |
| -25.44 – -22.64 | 0 |
| -22.64 – -19.84 | 0 |
| -19.84 – -17.04 | 0 |
| -17.04 – -14.25 | 0 |
| -14.25 – -11.45 | 0 |
| -11.45 – -8.651 | 0 |
| -8.651 – -5.853 | 0 |
| -5.853 – -3.055 | 0 |
| -3.055 – -0.2572 | 0 |
| -0.2572 – 2.541 | 0 |
| 2.541 – 5.339 | 0 |
| 5.339 – 8.137 | 0 |
| 8.137 – 10.93 | 0 |
| 10.93 – 13.73 | 1 |
| 13.73 – 16.53 | 0 |
| 16.53 – 19.33 | 0 |
| 19.33 – 22.13 | 91 |
| 22.13 – 24.92 | 5 |
| 24.92 – 27.72 | 176 |
| 27.72 – 30.52 | 509 |
| 30.52 – 33.32 | 672 |
| 33.32 – 36.12 | 1609 |
| 36.12 – 38.91 | 1506 |
| 38.91 – 41.71 | 2549 |
| 41.71 – 44.51 | 1854 |
| 44.51 – 47.31 | 561 |
| 47.31 – 50.11 | 164 |
| 50.11 – 52.9 | 2 |
| 52.9 – 55.7 | 2 |
| 55.7 – 58.5 | 0 |
| 58.5 – 61.3 | 14 |
| 61.3 – 64.09 | 4 |
| 64.09 – 66.89 | 10 |
Show data table
| bin | count |
|---|---|
| -164.7 – -156.4 | 86 |
| -156.4 – -148.1 | 22 |
| -148.1 – -139.7 | 10 |
| -139.7 – -131.4 | 3 |
| -131.4 – -123 | 65 |
| -123 – -114.7 | 1371 |
| -114.7 – -106.4 | 436 |
| -106.4 – -98.04 | 590 |
| -98.04 – -89.7 | 1570 |
| -89.7 – -81.37 | 2857 |
| -81.37 – -73.03 | 2086 |
| -73.03 – -64.7 | 625 |
| -64.7 – -56.36 | 0 |
| -56.36 – -48.03 | 0 |
| -48.03 – -39.69 | 0 |
| -39.69 – -31.35 | 0 |
| -31.35 – -23.02 | 0 |
| -23.02 – -14.68 | 0 |
| -14.68 – -6.348 | 0 |
| -6.348 – 1.987 | 2 |
| 1.987 – 10.32 | 0 |
| 10.32 – 18.66 | 0 |
| 18.66 – 26.99 | 0 |
| 26.99 – 35.33 | 0 |
| 35.33 – 43.66 | 1 |
| 43.66 – 52 | 0 |
| 52 – 60.34 | 0 |
| 60.34 – 68.67 | 0 |
| 68.67 – 77.01 | 1 |
| 77.01 – 85.34 | 3 |
| 85.34 – 93.68 | 0 |
| 93.68 – 102 | 0 |
| 102 – 110.3 | 0 |
| 110.3 – 118.7 | 0 |
| 118.7 – 127 | 0 |
| 127 – 135.4 | 0 |
| 135.4 – 143.7 | 2 |
| 143.7 – 152 | 0 |
| 152 – 160.4 | 0 |
| 160.4 – 168.7 | 1 |
Show data table
| chars | count |
|---|---|
| 2 – 134 | 1623 |
| 134 – 266 | 3145 |
| 266 – 398 | 2473 |
| 398 – 530 | 1528 |
| 530 – 662 | 810 |
| 662 – 794 | 509 |
| 794 – 926 | 303 |
| 926 – 1058 | 195 |
| 1058 – 1190 | 124 |
| 1190 – 1322 | 71 |
| 1322 – 1453 | 63 |
| 1453 – 1585 | 37 |
| 1585 – 1717 | 27 |
| 1717 – 1849 | 20 |
| 1849 – 1981 | 15 |
| 1981 – 2113 | 12 |
| 2113 – 2245 | 10 |
| 2245 – 2377 | 8 |
| 2377 – 2509 | 6 |
| 2509 – 2641 | 2 |
| 2641 – 2773 | 1 |
| 2773 – 2905 | 2 |
| 2905 – 3037 | 1 |
| 3037 – 3169 | 2 |
| 3169 – 3301 | 0 |
| 3301 – 3433 | 0 |
| 3433 – 3565 | 2 |
| 3565 – 3697 | 0 |
| 3697 – 3829 | 1 |
| 3829 – 3960 | 0 |
| 3960 – 4092 | 0 |
| 4092 – 4224 | 0 |
| 4224 – 4356 | 0 |
| 4356 – 4488 | 0 |
| 4488 – 4620 | 0 |
| 4620 – 4752 | 0 |
| 4752 – 4884 | 1 |
| 4884 – 5016 | 0 |
| 5016 – 5148 | 0 |
| 5148 – 5280 | 1 |
Schema
12 columns| Alerts | ||||
|---|---|---|---|---|
| city | text | 0.0% | 4,385 |
one_word
short_text
duplicates
|
| country | categorical | 0.0% | 1 |
imbalance
|
| description | text | 0.0% | 10,987 |
near_unique
|
| location | text | 0.0% | 9,903 |
|
| state | categorical | 0.0% | 51 |
|
| state_abbrev | categorical | 0.0% | 51 |
|
| longitude | numeric | 11.5% | 8,774 |
|
| latitude | numeric | 11.5% | 8,775 |
|
| city_longitude | numeric | 0.3% | 5,222 |
|
| city_latitude | numeric | 0.3% | 5,310 |
|
| location_2 | text | 11.5% | 8,776 |
allcaps
|
| city_location | text | 0.3% | 5,311 |
allcaps
duplicates
|
city
text feature one_word short_text duplicatesShort place-name strings (mean 9 characters, 73% one-word) with familiar US city names like Los Angeles, San Antonio, and Houston dominating the top values. Heavy duplication (60%, 6604 rows) is expected for a city field and 4385 uniques suggests broad geographic coverage. Top word frequencies ('city', 'county', 'san', 'st.', 'new', 'fort') confirm conventional US toponymy with no emoji, URLs, or boilerplate noise. Treatment: Treat as a categorical/geographic feature; consider geocoding or grouping rare values before encoding.
- n
- 10,992
- nulls
- 3 (0.0%)
- unique
- 4,385
- len_min
- 3
- len_max
- 49
- len_mean
- 9.043
- len_median
- 9
- len_p95
- 14
- word_mean
- 1.291
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 6,604
- duplicate_rate
- 0.601
- vocab_size
- 3,988
- readability_flesch_mean
- 20.49
- emoji_rate
- 0
- url_rate
- 9.1e-05
- one_word_rate
- 0.7323
- allcaps_rate
- 0.000455
- boilerplate_rate
- 0
country
categorical metadata imbalanceThis column records country, but every one of the 10,992 rows is "United States" — cardinality is 1 and top_rate is 1.0. It carries zero information (entropy 0.0) and cannot discriminate between records. Treatment: Drop; constant column with no signal.
- n
- 10,992
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- United States
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
description
text free_text near_uniqueFree-text descriptions, averaging 70 words (median 55) and reaching up to 5,280 characters, with a Flesch readability of ~69.7 indicating fairly accessible prose. Nearly every row is unique (10,987 of 10,992) with only 5 duplicates and a vocabulary of 33,001 tokens, so this reads as bespoke long-form copy rather than templated text. Boilerplate (0.6%), URLs (0.5%), all-caps (0.06%) and emoji (0%) are all negligible, and the top words are common English stopwords — consistent with natural English narrative. Treatment: Tokenize and embed before modelling; do not use as a categorical key.
- n
- 10,992
- nulls
- 0 (0.0%)
- unique
- 10,987
- len_min
- 2
- len_max
- 5,280
- len_mean
- 380
- len_median
- 297
- len_p95
- 954
- word_mean
- 70.33
- word_median
- 55
- n_empty
- 0
- n_duplicates
- 5
- duplicate_rate
- 0.0004549
- vocab_size
- 33,001
- readability_flesch_mean
- 69.67
- emoji_rate
- 0
- url_rate
- 0.005095
- one_word_rate
- 0.000182
- allcaps_rate
- 0.0006368
- boilerplate_rate
- 0.006459
location
text free_textShort free-text place names averaging 2.98 words and 19.3 characters, with values like 'Prince Georges county', 'Cemetery', 'Cry Baby Bridge', and 'Wal-Mart' — likely the named site of a story or sighting (probably ghost/folklore lore given top entries). Vocabulary is dominated by 'school', 'cemetery', 'high', 'university', 'house', 'road', suggesting a strong tilt toward institutional and roadside locations. High cardinality (9903 unique of 10992) coexists with a 9.9% duplicate rate and one value ('Prince Georges county') hit 18 times, so most entries are unique strings but a long tail of repeated landmarks exists. Treatment: Normalise casing/whitespace and geocode or entity-link to extract structured place features before modelling.
- n
- 10,992
- nulls
- 3 (0.0%)
- unique
- 9,903
- len_min
- 2
- len_max
- 1,016
- len_mean
- 19.3
- len_median
- 17
- len_p95
- 34
- word_mean
- 2.976
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 1,086
- duplicate_rate
- 0.09883
- vocab_size
- 8,232
- readability_flesch_mean
- 45.79
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.06334
- allcaps_rate
- 0.003367
- boilerplate_rate
- 0
state
categorical featureThis is a US state field with 51 distinct values (likely the 50 states plus DC) across 10,992 rows and no nulls. Distribution is broad with high entropy ratio (0.916); California leads at 9.73% (1,070 rows), followed by Texas (696) and Pennsylvania (649), roughly tracking population except Pennsylvania ranking unusually high above Florida (which doesn't appear in the top 10). Treatment: One-hot or target-encode for modelling; consider grouping low-frequency states.
- n
- 10,992
- nulls
- 0 (0.0%)
- unique
- 51
- top_value
- California
- top_rate
- 0.09734
- cardinality
- 51
- entropy
- 5.194
- entropy_ratio
- 0.9157
state_abbrev
categorical featureThis is a US state abbreviation field with 51 distinct values (50 states plus presumably DC) and no nulls across 10,992 rows. Distribution is broadly spread (entropy ratio 0.916) with CA leading at 9.7% (1,070 rows), followed by TX (696) and PA (649), consistent with population-weighted geographic data. Treatment: one-hot or target-encode for modelling; safe to use as-is for grouping or joins.
- n
- 10,992
- nulls
- 0 (0.0%)
- unique
- 51
- top_value
- CA
- top_rate
- 0.09734
- cardinality
- 51
- entropy
- 5.194
- entropy_ratio
- 0.9157
longitude
numeric featureGeographic longitude coordinates, with 8774 unique values across 10992 rows and an 11.47% null rate. Values span -164.72 to 168.70 but cluster tightly between Q1 -99.12 and Q3 -80.30, indicating most records sit in North America despite the global range. Kurtosis of 15.2 confirms heavy tails from the 131 outliers (1.35%) reaching across the Pacific. Treatment: Pair with latitude for geospatial features; impute or filter the 11.47% nulls before mapping.
- n
- 10,992
- nulls
- 1,261 (11.5%)
- unique
- 8,774
- min
- -164.7
- max
- 168.7
- mean
- -92
- median
- -87.23
- std
- 17.69
- q1
- -99.12
- q3
- -80.3
- iqr
- 18.82
- skew
- 0.2512
- kurtosis
- 15.2
- n_outliers
- 131
- outlier_rate
- 0.01346
- zero_rate
- 0
latitude
numeric featureGeographic latitude in decimal degrees, ranging from -45.02 to 66.89 with a median of 39.28 and IQR of 7.20, suggesting most observations cluster in the northern mid-latitudes. The distribution is left-skewed (-0.97) with heavy tails (kurtosis 11.38) and 124 outliers (1.27%), consistent with a few southern-hemisphere points dragging the tail. Note the 11.47% null rate, which will need handling before any spatial analysis. Treatment: Pair with longitude for geospatial features; impute or drop the ~11% nulls before modelling.
- n
- 10,992
- nulls
- 1,261 (11.5%)
- unique
- 8,775
- min
- -45.02
- max
- 66.89
- mean
- 38.34
- median
- 39.28
- std
- 5.259
- q1
- 34.68
- q3
- 41.87
- iqr
- 7.197
- skew
- -0.971
- kurtosis
- 11.38
- n_outliers
- 124
- outlier_rate
- 0.01274
- zero_rate
- 0
city_longitude
numeric featureLongitude of a city, with all 10992 values negative (min -164.72, max -67.84), placing every record in the Western Hemisphere and consistent with North American coverage. The distribution is left-skewed (skew -1.16) with median -87.09 pulled east of the mean -91.91, suggesting a concentration in the central/eastern US with a tail reaching toward Alaska or the Pacific. 5222 unique values across 10992 rows indicate many records share the same city, and 128 outliers (1.17%) likely correspond to far-western locations. Treatment: Use as a geospatial feature alongside latitude; consider clustering or binning rather than treating as a plain scalar.
- n
- 10,992
- nulls
- 29 (0.3%)
- unique
- 5,222
- min
- -164.7
- max
- -67.84
- mean
- -91.91
- median
- -87.09
- std
- 16.4
- q1
- -98.49
- q3
- -80.51
- iqr
- 17.99
- skew
- -1.155
- kurtosis
- 1.328
- n_outliers
- 128
- outlier_rate
- 0.01168
- zero_rate
- 0
city_latitude
numeric featureThis column holds city latitudes in decimal degrees, ranging from 19.58 to 66.90 with a median of 39.28 and an IQR spanning 34.74 to 41.85. Values cluster firmly in the northern mid-latitudes (mean 38.38, std 5.07), with mild left skew (-0.32) and 128 outliers (1.17%) likely representing far-northern locales. Cardinality is moderate (5310 unique of 10992) and nulls are negligible (0.26%). Treatment: Pair with longitude for geospatial features; avoid treating as a standalone scalar.
- n
- 10,992
- nulls
- 29 (0.3%)
- unique
- 5,310
- min
- 19.58
- max
- 66.9
- mean
- 38.38
- median
- 39.28
- std
- 5.067
- q1
- 34.74
- q3
- 41.85
- iqr
- 7.109
- skew
- -0.325
- kurtosis
- 1.478
- n_outliers
- 128
- outlier_rate
- 0.01168
- zero_rate
- 0
location_2
text feature allcapsThis column holds WKT geometry strings in the form `POINT(lon lat)`, with 8776 unique values across 10992 rows and an 11.47% null rate. Values are tightly bounded (length 22-43, always two 'words') and 9.81% are duplicates, with the most repeated coordinate appearing 20 times — suggesting clusters at recurring sites. The 'allcaps' alert is a false positive driven by the literal `POINT` prefix rather than natural language. Treatment: Parse into numeric longitude/latitude pair before any geospatial modelling.
- n
- 10,992
- nulls
- 1,261 (11.5%)
- unique
- 8,776
- len_min
- 22
- len_max
- 43
- len_mean
- 30.81
- len_median
- 29
- len_p95
- 36
- word_mean
- 2
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 955
- duplicate_rate
- 0.09814
- vocab_size
- 17,549
- readability_flesch_mean
- 120.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 1
- boilerplate_rate
- 0
city_location
text feature allcaps duplicatesDespite the name, city_location stores WKT POINT(longitude latitude) geometry strings, not city names — every value matches that format (allcaps_rate 1.0, word_mean 2.0). Roughly half the rows are repeats (duplicate_rate 0.516, n_duplicates 5652) with the top coordinate (Los Angeles, 34.05/-118.24) appearing 61 times, suggesting a finite set of city centroids reused across records. Cardinality is still high (5311 unique of 10992) and nulls are negligible (0.26%). Treatment: Parse the POINT() strings into numeric lat/long pairs before any geospatial modelling.
- n
- 10,992
- nulls
- 29 (0.3%)
- unique
- 5,311
- len_min
- 22
- len_max
- 43
- len_mean
- 31.19
- len_median
- 29
- len_p95
- 36
- word_mean
- 2
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 5,652
- duplicate_rate
- 0.5156
- vocab_size
- 10,532
- readability_flesch_mean
- 120.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 1
- boilerplate_rate
- 0