saturn·

wild ghost sightings

source /home/coolhand/html/datavis/data_trove/data/wild/ghost_sightings.csv 10,992 rows 12 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 10,992 reported ghost sightings across the United States, with each record describing a location (city, state, latitude/longitude) and a free-text description of the sighting. Every record is in the United States (country has only 1 unique value), and California, Texas, and Pennsylvania lead the state counts — California alone holds about 9.7% of all sightings. The location column is rich with thematic words like 'school', 'cemetery', 'high', and 'house', hinting at the kinds of places people report hauntings. Worth a closer look: the geographic skew toward a few populous states, and the description field which averages ~70 words per entry and is nearly all unique — a good candidate for text mining. Note that latitude/longitude have ~11.5% nulls and a handful of outliers (including a min latitude of -45 that falls outside the US).

citing: row_count · column_count · country.top_rate · state.top_values · state.entropy_ratio · location.top_words · description.word_mean · description.n_unique · latitude.null_rate · latitude.min · longitude.n_outliers

Schema

12 columns
Per-column summary. Click column name to jump to its detail.
Alerts
city text 0.0% 4,385
one_word short_text duplicates
country categorical 0.0% 1
imbalance
description text 0.0% 10,987
near_unique
location text 0.0% 9,903
state categorical 0.0% 51
state_abbrev categorical 0.0% 51
longitude numeric 11.5% 8,774
latitude numeric 11.5% 8,775
city_longitude numeric 0.3% 5,222
city_latitude numeric 0.3% 5,310
location_2 text 11.5% 8,776
allcaps
city_location text 0.3% 5,311
allcaps duplicates

city

text feature one_word short_text duplicates
Short place-name strings (mean 9 characters, 73% one-word) with familiar US city names like Los Angeles, San Antonio, and Houston dominating the top values. Heavy duplication (60%, 6604 rows) is expected for a city field and 4385 uniques suggests broad geographic coverage. Top word frequencies ('city', 'county', 'san', 'st.', 'new', 'fort') confirm conventional US toponymy with no emoji, URLs, or boilerplate noise. Treatment: Treat as a categorical/geographic feature; consider geocoding or grouping rare values before encoding. high · anthropic:claude-opus-4-7
n
10,992
nulls
3 (0.0%)
unique
4,385
len_min
3
len_max
49
len_mean
9.043
len_median
9
len_p95
14
word_mean
1.291
word_median
1
n_empty
0
n_duplicates
6,604
duplicate_rate
0.601
vocab_size
3,988
readability_flesch_mean
20.49
emoji_rate
0
url_rate
9.1e-05
one_word_rate
0.7323
allcaps_rate
0.000455
boilerplate_rate
0

country

categorical metadata imbalance
This column records country, but every one of the 10,992 rows is "United States" — cardinality is 1 and top_rate is 1.0. It carries zero information (entropy 0.0) and cannot discriminate between records. Treatment: Drop; constant column with no signal. high · anthropic:claude-opus-4-7
n
10,992
nulls
0 (0.0%)
unique
1
top_value
United States
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

description

text free_text near_unique
Free-text descriptions, averaging 70 words (median 55) and reaching up to 5,280 characters, with a Flesch readability of ~69.7 indicating fairly accessible prose. Nearly every row is unique (10,987 of 10,992) with only 5 duplicates and a vocabulary of 33,001 tokens, so this reads as bespoke long-form copy rather than templated text. Boilerplate (0.6%), URLs (0.5%), all-caps (0.06%) and emoji (0%) are all negligible, and the top words are common English stopwords — consistent with natural English narrative. Treatment: Tokenize and embed before modelling; do not use as a categorical key. high · anthropic:claude-opus-4-7
n
10,992
nulls
0 (0.0%)
unique
10,987
len_min
2
len_max
5,280
len_mean
380
len_median
297
len_p95
954
word_mean
70.33
word_median
55
n_empty
0
n_duplicates
5
duplicate_rate
0.0004549
vocab_size
33,001
readability_flesch_mean
69.67
emoji_rate
0
url_rate
0.005095
one_word_rate
0.000182
allcaps_rate
0.0006368
boilerplate_rate
0.006459

location

text free_text
Short free-text place names averaging 2.98 words and 19.3 characters, with values like 'Prince Georges county', 'Cemetery', 'Cry Baby Bridge', and 'Wal-Mart' — likely the named site of a story or sighting (probably ghost/folklore lore given top entries). Vocabulary is dominated by 'school', 'cemetery', 'high', 'university', 'house', 'road', suggesting a strong tilt toward institutional and roadside locations. High cardinality (9903 unique of 10992) coexists with a 9.9% duplicate rate and one value ('Prince Georges county') hit 18 times, so most entries are unique strings but a long tail of repeated landmarks exists. Treatment: Normalise casing/whitespace and geocode or entity-link to extract structured place features before modelling. high · anthropic:claude-opus-4-7
n
10,992
nulls
3 (0.0%)
unique
9,903
len_min
2
len_max
1,016
len_mean
19.3
len_median
17
len_p95
34
word_mean
2.976
word_median
3
n_empty
0
n_duplicates
1,086
duplicate_rate
0.09883
vocab_size
8,232
readability_flesch_mean
45.79
emoji_rate
0
url_rate
0
one_word_rate
0.06334
allcaps_rate
0.003367
boilerplate_rate
0

state

categorical feature
This is a US state field with 51 distinct values (likely the 50 states plus DC) across 10,992 rows and no nulls. Distribution is broad with high entropy ratio (0.916); California leads at 9.73% (1,070 rows), followed by Texas (696) and Pennsylvania (649), roughly tracking population except Pennsylvania ranking unusually high above Florida (which doesn't appear in the top 10). Treatment: One-hot or target-encode for modelling; consider grouping low-frequency states. high · anthropic:claude-opus-4-7
n
10,992
nulls
0 (0.0%)
unique
51
top_value
California
top_rate
0.09734
cardinality
51
entropy
5.194
entropy_ratio
0.9157

state_abbrev

categorical feature
This is a US state abbreviation field with 51 distinct values (50 states plus presumably DC) and no nulls across 10,992 rows. Distribution is broadly spread (entropy ratio 0.916) with CA leading at 9.7% (1,070 rows), followed by TX (696) and PA (649), consistent with population-weighted geographic data. Treatment: one-hot or target-encode for modelling; safe to use as-is for grouping or joins. high · anthropic:claude-opus-4-7
n
10,992
nulls
0 (0.0%)
unique
51
top_value
CA
top_rate
0.09734
cardinality
51
entropy
5.194
entropy_ratio
0.9157

longitude

numeric feature
Geographic longitude coordinates, with 8774 unique values across 10992 rows and an 11.47% null rate. Values span -164.72 to 168.70 but cluster tightly between Q1 -99.12 and Q3 -80.30, indicating most records sit in North America despite the global range. Kurtosis of 15.2 confirms heavy tails from the 131 outliers (1.35%) reaching across the Pacific. Treatment: Pair with latitude for geospatial features; impute or filter the 11.47% nulls before mapping. high · anthropic:claude-opus-4-7
n
10,992
nulls
1,261 (11.5%)
unique
8,774
min
-164.7
max
168.7
mean
-92
median
-87.23
std
17.69
q1
-99.12
q3
-80.3
iqr
18.82
skew
0.2512
kurtosis
15.2
n_outliers
131
outlier_rate
0.01346
zero_rate
0

latitude

numeric feature
Geographic latitude in decimal degrees, ranging from -45.02 to 66.89 with a median of 39.28 and IQR of 7.20, suggesting most observations cluster in the northern mid-latitudes. The distribution is left-skewed (-0.97) with heavy tails (kurtosis 11.38) and 124 outliers (1.27%), consistent with a few southern-hemisphere points dragging the tail. Note the 11.47% null rate, which will need handling before any spatial analysis. Treatment: Pair with longitude for geospatial features; impute or drop the ~11% nulls before modelling. high · anthropic:claude-opus-4-7
n
10,992
nulls
1,261 (11.5%)
unique
8,775
min
-45.02
max
66.89
mean
38.34
median
39.28
std
5.259
q1
34.68
q3
41.87
iqr
7.197
skew
-0.971
kurtosis
11.38
n_outliers
124
outlier_rate
0.01274
zero_rate
0

city_longitude

numeric feature
Longitude of a city, with all 10992 values negative (min -164.72, max -67.84), placing every record in the Western Hemisphere and consistent with North American coverage. The distribution is left-skewed (skew -1.16) with median -87.09 pulled east of the mean -91.91, suggesting a concentration in the central/eastern US with a tail reaching toward Alaska or the Pacific. 5222 unique values across 10992 rows indicate many records share the same city, and 128 outliers (1.17%) likely correspond to far-western locations. Treatment: Use as a geospatial feature alongside latitude; consider clustering or binning rather than treating as a plain scalar. high · anthropic:claude-opus-4-7
n
10,992
nulls
29 (0.3%)
unique
5,222
min
-164.7
max
-67.84
mean
-91.91
median
-87.09
std
16.4
q1
-98.49
q3
-80.51
iqr
17.99
skew
-1.155
kurtosis
1.328
n_outliers
128
outlier_rate
0.01168
zero_rate
0

city_latitude

numeric feature
This column holds city latitudes in decimal degrees, ranging from 19.58 to 66.90 with a median of 39.28 and an IQR spanning 34.74 to 41.85. Values cluster firmly in the northern mid-latitudes (mean 38.38, std 5.07), with mild left skew (-0.32) and 128 outliers (1.17%) likely representing far-northern locales. Cardinality is moderate (5310 unique of 10992) and nulls are negligible (0.26%). Treatment: Pair with longitude for geospatial features; avoid treating as a standalone scalar. high · anthropic:claude-opus-4-7
n
10,992
nulls
29 (0.3%)
unique
5,310
min
19.58
max
66.9
mean
38.38
median
39.28
std
5.067
q1
34.74
q3
41.85
iqr
7.109
skew
-0.325
kurtosis
1.478
n_outliers
128
outlier_rate
0.01168
zero_rate
0

location_2

text feature allcaps
This column holds WKT geometry strings in the form `POINT(lon lat)`, with 8776 unique values across 10992 rows and an 11.47% null rate. Values are tightly bounded (length 22-43, always two 'words') and 9.81% are duplicates, with the most repeated coordinate appearing 20 times — suggesting clusters at recurring sites. The 'allcaps' alert is a false positive driven by the literal `POINT` prefix rather than natural language. Treatment: Parse into numeric longitude/latitude pair before any geospatial modelling. high · anthropic:claude-opus-4-7
n
10,992
nulls
1,261 (11.5%)
unique
8,776
len_min
22
len_max
43
len_mean
30.81
len_median
29
len_p95
36
word_mean
2
word_median
2
n_empty
0
n_duplicates
955
duplicate_rate
0.09814
vocab_size
17,549
readability_flesch_mean
120.2
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
1
boilerplate_rate
0

city_location

text feature allcaps duplicates
Despite the name, city_location stores WKT POINT(longitude latitude) geometry strings, not city names — every value matches that format (allcaps_rate 1.0, word_mean 2.0). Roughly half the rows are repeats (duplicate_rate 0.516, n_duplicates 5652) with the top coordinate (Los Angeles, 34.05/-118.24) appearing 61 times, suggesting a finite set of city centroids reused across records. Cardinality is still high (5311 unique of 10992) and nulls are negligible (0.26%). Treatment: Parse the POINT() strings into numeric lat/long pairs before any geospatial modelling. high · anthropic:claude-opus-4-7
n
10,992
nulls
29 (0.3%)
unique
5,311
len_min
22
len_max
43
len_mean
31.19
len_median
29
len_p95
36
word_mean
2
word_median
2
n_empty
0
n_duplicates
5,652
duplicate_rate
0.5156
vocab_size
10,532
readability_flesch_mean
120.2
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
1
boilerplate_rate
0