wild ghost sightings

source /home/coolhand/html/datavis/data_trove/data/wild/ghost_sightings.csv 10,992 rows 12 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 10,992 reported ghost sightings across the United States, with each record describing a location (city, state, latitude/longitude) and a free-text description of the sighting. Every record is in the United States (country has only 1 unique value), and California, Texas, and Pennsylvania lead the state counts — California alone holds about 9.7% of all sightings. The location column is rich with thematic words like 'school', 'cemetery', 'high', and 'house', hinting at the kinds of places people report hauntings. Worth a closer look: the geographic skew toward a few populous states, and the description field which averages ~70 words per entry and is nearly all unique — a good candidate for text mining. Note that latitude/longitude have ~11.5% nulls and a handful of outliers (including a min latitude of -45 that falls outside the US).

citing: row_count · column_count · country.top_rate · state.top_values · state.entropy_ratio · location.top_words · description.word_mean · description.n_unique · latitude.null_rate · latitude.min · longitude.n_outliers

Charts the summary said to look at first

state · Shows which states report the most ghost sightings — California, Texas, and Pennsylvania dominate.

Show data table

Top values for state (20 unique shown, of 51 total).
value	count	share
California	1070	9.7%
Texas	696	6.3%
Pennsylvania	649	5.9%
Michigan	529	4.8%
Ohio	477	4.3%
New York	459	4.2%
Illinois	395	3.6%
Kentucky	370	3.4%
Indiana	351	3.2%
Massachusetts	342	3.1%
Florida	328	3.0%
Missouri	314	2.9%
Georgia	289	2.6%
Wisconsin	274	2.5%
Alabama	224	2.0%
Tennessee	221	2.0%
Washington	218	2.0%
Oklahoma	211	1.9%
North Carolina	211	1.9%
New Jersey	194	1.8%

location · Distribution of location-name lengths; most are short (median 17 chars) but a long tail goes up to 1,016.

Show data table

Character-length distribution for location (mean: 19.2985712985713).
chars	count
2 – 27	9633
27 – 53	1280
53 – 78	57
78 – 103	7
103 – 129	0
129 – 154	0
154 – 179	0
179 – 205	2
205 – 230	2
230 – 256	0
256 – 281	2
281 – 306	0
306 – 332	0
332 – 357	1
357 – 382	0
382 – 408	1
408 – 433	0
433 – 458	1
458 – 484	0
484 – 509	1
509 – 534	0
534 – 560	0
560 – 585	0
585 – 610	0
610 – 636	0
636 – 661	0
661 – 686	0
686 – 712	0
712 – 737	0
737 – 762	0
762 – 788	0
788 – 813	0
813 – 839	0
839 – 864	0
864 – 889	0
889 – 915	0
915 – 940	1
940 – 965	0
965 – 991	0
991 – 1016	1

latitude · Most sightings cluster in the continental US latitudes, but watch for outliers reaching as far south as -45.

Show data table

Histogram bins for latitude (median: 39.2795837).
bin	count
-45.02 – -42.23	1
-42.23 – -39.43	0
-39.43 – -36.63	0
-36.63 – -33.83	1
-33.83 – -31.03	0
-31.03 – -28.24	0
-28.24 – -25.44	0
-25.44 – -22.64	0
-22.64 – -19.84	0
-19.84 – -17.04	0
-17.04 – -14.25	0
-14.25 – -11.45	0
-11.45 – -8.651	0
-8.651 – -5.853	0
-5.853 – -3.055	0
-3.055 – -0.2572	0
-0.2572 – 2.541	0
2.541 – 5.339	0
5.339 – 8.137	0
8.137 – 10.93	0
10.93 – 13.73	1
13.73 – 16.53	0
16.53 – 19.33	0
19.33 – 22.13	91
22.13 – 24.92	5
24.92 – 27.72	176
27.72 – 30.52	509
30.52 – 33.32	672
33.32 – 36.12	1609
36.12 – 38.91	1506
38.91 – 41.71	2549
41.71 – 44.51	1854
44.51 – 47.31	561
47.31 – 50.11	164
50.11 – 52.9	2
52.9 – 55.7	2
55.7 – 58.5	0
58.5 – 61.3	14
61.3 – 64.09	4
64.09 – 66.89	10

longitude · Longitude spread reveals the east–west footprint of sightings, with a heavy concentration around -87.

Show data table

Histogram bins for longitude (median: -87.23121549999999).
bin	count
-164.7 – -156.4	86
-156.4 – -148.1	22
-148.1 – -139.7	10
-139.7 – -131.4	3
-131.4 – -123	65
-123 – -114.7	1371
-114.7 – -106.4	436
-106.4 – -98.04	590
-98.04 – -89.7	1570
-89.7 – -81.37	2857
-81.37 – -73.03	2086
-73.03 – -64.7	625
-64.7 – -56.36	0
-56.36 – -48.03	0
-48.03 – -39.69	0
-39.69 – -31.35	0
-31.35 – -23.02	0
-23.02 – -14.68	0
-14.68 – -6.348	0
-6.348 – 1.987	2
1.987 – 10.32	0
10.32 – 18.66	0
18.66 – 26.99	0
26.99 – 35.33	0
35.33 – 43.66	1
43.66 – 52	0
52 – 60.34	0
60.34 – 68.67	0
68.67 – 77.01	1
77.01 – 85.34	3
85.34 – 93.68	0
93.68 – 102	0
102 – 110.3	0
110.3 – 118.7	0
118.7 – 127	0
127 – 135.4	0
135.4 – 143.7	2
143.7 – 152	0
152 – 160.4	0
160.4 – 168.7	1

description · Description lengths are right-skewed: median ~297 characters but reaching up to 5,280 for the most detailed accounts.

Show data table

Character-length distribution for description (mean: 380.03156841339154).
chars	count
2 – 134	1623
134 – 266	3145
266 – 398	2473
398 – 530	1528
530 – 662	810
662 – 794	509
794 – 926	303
926 – 1058	195
1058 – 1190	124
1190 – 1322	71
1322 – 1453	63
1453 – 1585	37
1585 – 1717	27
1717 – 1849	20
1849 – 1981	15
1981 – 2113	12
2113 – 2245	10
2245 – 2377	8
2377 – 2509	6
2509 – 2641	2
2641 – 2773	1
2773 – 2905	2
2905 – 3037	1
3037 – 3169	2
3169 – 3301	0
3301 – 3433	0
3433 – 3565	2
3565 – 3697	0
3697 – 3829	1
3829 – 3960	0
3960 – 4092	0
4092 – 4224	0
4224 – 4356	0
4356 – 4488	0
4488 – 4620	0
4620 – 4752	0
4752 – 4884	1
4884 – 5016	0
5016 – 5148	0
5148 – 5280	1

Schema

12 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
city	text	0.0%	4,385	one_word short_text duplicates
country	categorical	0.0%	1	imbalance
description	text	0.0%	10,987	near_unique
location	text	0.0%	9,903
state	categorical	0.0%	51
state_abbrev	categorical	0.0%	51
longitude	numeric	11.5%	8,774
latitude	numeric	11.5%	8,775
city_longitude	numeric	0.3%	5,222
city_latitude	numeric	0.3%	5,310
location_2	text	11.5%	8,776	allcaps
city_location	text	0.3%	5,311	allcaps duplicates

city

text feature one_word short_text duplicates

Short place-name strings (mean 9 characters, 73% one-word) with familiar US city names like Los Angeles, San Antonio, and Houston dominating the top values. Heavy duplication (60%, 6604 rows) is expected for a city field and 4385 uniques suggests broad geographic coverage. Top word frequencies ('city', 'county', 'san', 'st.', 'new', 'fort') confirm conventional US toponymy with no emoji, URLs, or boilerplate noise. Treatment: Treat as a categorical/geographic feature; consider geocoding or grouping rare values before encoding. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 3 (0.0%)
unique: 4,385
len_min: 3
len_max: 49
len_mean: 9.043
len_median: 9
len_p95: 14
word_mean: 1.291
word_median: 1
n_empty: 0
n_duplicates: 6,604
duplicate_rate: 0.601
vocab_size: 3,988
readability_flesch_mean: 20.49
emoji_rate: 0
url_rate: 9.1e-05
one_word_rate: 0.7323
allcaps_rate: 0.000455
boilerplate_rate: 0

country

categorical metadata imbalance

This column records country, but every one of the 10,992 rows is "United States" — cardinality is 1 and top_rate is 1.0. It carries zero information (entropy 0.0) and cannot discriminate between records. Treatment: Drop; constant column with no signal. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 0 (0.0%)
unique: 1
top_value: United States
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

description

text free_text near_unique

Free-text descriptions, averaging 70 words (median 55) and reaching up to 5,280 characters, with a Flesch readability of ~69.7 indicating fairly accessible prose. Nearly every row is unique (10,987 of 10,992) with only 5 duplicates and a vocabulary of 33,001 tokens, so this reads as bespoke long-form copy rather than templated text. Boilerplate (0.6%), URLs (0.5%), all-caps (0.06%) and emoji (0%) are all negligible, and the top words are common English stopwords — consistent with natural English narrative. Treatment: Tokenize and embed before modelling; do not use as a categorical key. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 0 (0.0%)
unique: 10,987
len_min: 2
len_max: 5,280
len_mean: 380
len_median: 297
len_p95: 954
word_mean: 70.33
word_median: 55
n_empty: 0
n_duplicates: 5
duplicate_rate: 0.0004549
vocab_size: 33,001
readability_flesch_mean: 69.67
emoji_rate: 0
url_rate: 0.005095
one_word_rate: 0.000182
allcaps_rate: 0.0006368
boilerplate_rate: 0.006459

location

text free_text

Short free-text place names averaging 2.98 words and 19.3 characters, with values like 'Prince Georges county', 'Cemetery', 'Cry Baby Bridge', and 'Wal-Mart' — likely the named site of a story or sighting (probably ghost/folklore lore given top entries). Vocabulary is dominated by 'school', 'cemetery', 'high', 'university', 'house', 'road', suggesting a strong tilt toward institutional and roadside locations. High cardinality (9903 unique of 10992) coexists with a 9.9% duplicate rate and one value ('Prince Georges county') hit 18 times, so most entries are unique strings but a long tail of repeated landmarks exists. Treatment: Normalise casing/whitespace and geocode or entity-link to extract structured place features before modelling. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 3 (0.0%)
unique: 9,903
len_min: 2
len_max: 1,016
len_mean: 19.3
len_median: 17
len_p95: 34
word_mean: 2.976
word_median: 3
n_empty: 0
n_duplicates: 1,086
duplicate_rate: 0.09883
vocab_size: 8,232
readability_flesch_mean: 45.79
emoji_rate: 0
url_rate: 0
one_word_rate: 0.06334
allcaps_rate: 0.003367
boilerplate_rate: 0

state

categorical feature

This is a US state field with 51 distinct values (likely the 50 states plus DC) across 10,992 rows and no nulls. Distribution is broad with high entropy ratio (0.916); California leads at 9.73% (1,070 rows), followed by Texas (696) and Pennsylvania (649), roughly tracking population except Pennsylvania ranking unusually high above Florida (which doesn't appear in the top 10). Treatment: One-hot or target-encode for modelling; consider grouping low-frequency states. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 0 (0.0%)
unique: 51
top_value: California
top_rate: 0.09734
cardinality: 51
entropy: 5.194
entropy_ratio: 0.9157

state_abbrev

categorical feature

This is a US state abbreviation field with 51 distinct values (50 states plus presumably DC) and no nulls across 10,992 rows. Distribution is broadly spread (entropy ratio 0.916) with CA leading at 9.7% (1,070 rows), followed by TX (696) and PA (649), consistent with population-weighted geographic data. Treatment: one-hot or target-encode for modelling; safe to use as-is for grouping or joins. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 0 (0.0%)
unique: 51
top_value: CA
top_rate: 0.09734
cardinality: 51
entropy: 5.194
entropy_ratio: 0.9157

longitude

numeric feature

Geographic longitude coordinates, with 8774 unique values across 10992 rows and an 11.47% null rate. Values span -164.72 to 168.70 but cluster tightly between Q1 -99.12 and Q3 -80.30, indicating most records sit in North America despite the global range. Kurtosis of 15.2 confirms heavy tails from the 131 outliers (1.35%) reaching across the Pacific. Treatment: Pair with latitude for geospatial features; impute or filter the 11.47% nulls before mapping. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 1,261 (11.5%)
unique: 8,774
min: -164.7
max: 168.7
mean: -92
median: -87.23
std: 17.69
q1: -99.12
q3: -80.3
iqr: 18.82
skew: 0.2512
kurtosis: 15.2
n_outliers: 131
outlier_rate: 0.01346
zero_rate: 0

latitude

numeric feature

Geographic latitude in decimal degrees, ranging from -45.02 to 66.89 with a median of 39.28 and IQR of 7.20, suggesting most observations cluster in the northern mid-latitudes. The distribution is left-skewed (-0.97) with heavy tails (kurtosis 11.38) and 124 outliers (1.27%), consistent with a few southern-hemisphere points dragging the tail. Note the 11.47% null rate, which will need handling before any spatial analysis. Treatment: Pair with longitude for geospatial features; impute or drop the ~11% nulls before modelling. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 1,261 (11.5%)
unique: 8,775
min: -45.02
max: 66.89
mean: 38.34
median: 39.28
std: 5.259
q1: 34.68
q3: 41.87
iqr: 7.197
skew: -0.971
kurtosis: 11.38
n_outliers: 124
outlier_rate: 0.01274
zero_rate: 0

city_longitude

numeric feature

Longitude of a city, with all 10992 values negative (min -164.72, max -67.84), placing every record in the Western Hemisphere and consistent with North American coverage. The distribution is left-skewed (skew -1.16) with median -87.09 pulled east of the mean -91.91, suggesting a concentration in the central/eastern US with a tail reaching toward Alaska or the Pacific. 5222 unique values across 10992 rows indicate many records share the same city, and 128 outliers (1.17%) likely correspond to far-western locations. Treatment: Use as a geospatial feature alongside latitude; consider clustering or binning rather than treating as a plain scalar. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 29 (0.3%)
unique: 5,222
min: -164.7
max: -67.84
mean: -91.91
median: -87.09
std: 16.4
q1: -98.49
q3: -80.51
iqr: 17.99
skew: -1.155
kurtosis: 1.328
n_outliers: 128
outlier_rate: 0.01168
zero_rate: 0

city_latitude

numeric feature

This column holds city latitudes in decimal degrees, ranging from 19.58 to 66.90 with a median of 39.28 and an IQR spanning 34.74 to 41.85. Values cluster firmly in the northern mid-latitudes (mean 38.38, std 5.07), with mild left skew (-0.32) and 128 outliers (1.17%) likely representing far-northern locales. Cardinality is moderate (5310 unique of 10992) and nulls are negligible (0.26%). Treatment: Pair with longitude for geospatial features; avoid treating as a standalone scalar. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 29 (0.3%)
unique: 5,310
min: 19.58
max: 66.9
mean: 38.38
median: 39.28
std: 5.067
q1: 34.74
q3: 41.85
iqr: 7.109
skew: -0.325
kurtosis: 1.478
n_outliers: 128
outlier_rate: 0.01168
zero_rate: 0

location_2

text feature allcaps

This column holds WKT geometry strings in the form `POINT(lon lat)`, with 8776 unique values across 10992 rows and an 11.47% null rate. Values are tightly bounded (length 22-43, always two 'words') and 9.81% are duplicates, with the most repeated coordinate appearing 20 times — suggesting clusters at recurring sites. The 'allcaps' alert is a false positive driven by the literal `POINT` prefix rather than natural language. Treatment: Parse into numeric longitude/latitude pair before any geospatial modelling. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 1,261 (11.5%)
unique: 8,776
len_min: 22
len_max: 43
len_mean: 30.81
len_median: 29
len_p95: 36
word_mean: 2
word_median: 2
n_empty: 0
n_duplicates: 955
duplicate_rate: 0.09814
vocab_size: 17,549
readability_flesch_mean: 120.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 1
boilerplate_rate: 0

city_location

text feature allcaps duplicates

Despite the name, city_location stores WKT POINT(longitude latitude) geometry strings, not city names — every value matches that format (allcaps_rate 1.0, word_mean 2.0). Roughly half the rows are repeats (duplicate_rate 0.516, n_duplicates 5652) with the top coordinate (Los Angeles, 34.05/-118.24) appearing 61 times, suggesting a finite set of city centroids reused across records. Cardinality is still high (5311 unique of 10992) and nulls are negligible (0.26%). Treatment: Parse the POINT() strings into numeric lat/long pairs before any geospatial modelling. high · anthropic:claude-opus-4-7

n: 10,992
nulls: 29 (0.3%)
unique: 5,311
len_min: 22
len_max: 43
len_mean: 31.19
len_median: 29
len_p95: 36
word_mean: 2
word_median: 2
n_empty: 0
n_duplicates: 5,652
duplicate_rate: 0.5156
vocab_size: 10,532
readability_flesch_mean: 120.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 1
boilerplate_rate: 0