wild-ghost_sightings · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/ghost_sightings.csv

Saturn profiled 10,992 rows across 12 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/ghost_sightings.csv",
    "--findings", "wild-ghost_sightings.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 10,992 reported ghost sightings across the United States, with each record describing a location (city, state, latitude/longitude) and a free-text description of the sighting. Every record is in the United States (country has only 1 unique value), and California, Texas, and Pennsylvania lead the state counts — California alone holds about 9.7% of all sightings. The location column is rich with thematic words like 'school', 'cemetery', 'high', and 'house', hinting at the kinds of places people report hauntings. Worth a closer look: the geographic skew toward a few populous states, and the description field which averages ~70 words per entry and is nearly all unique — a good candidate for text mining. Note that latitude/longitude have ~11.5% nulls and a handful of outliers (including a min latitude of -45 that falls outside the US).

citing: row_count · column_count · country.top_rate · state.top_values · state.entropy_ratio · location.top_words · description.word_mean · description.n_unique · latitude.null_rate · latitude.min · longitude.n_outliers

Out[4]:

saturn.schema() · 12 columns

column	kind	n	null%	unique	alerts
city	text	10,992	0.0%	4,385	one_word short_text duplicates
country	categorical	10,992	0.0%	1	imbalance
description	text	10,992	0.0%	10,987	near_unique
location	text	10,992	0.0%	9,903
state	categorical	10,992	0.0%	51
state_abbrev	categorical	10,992	0.0%	51
longitude	numeric	10,992	11.5%	8,774
latitude	numeric	10,992	11.5%	8,775
city_longitude	numeric	10,992	0.3%	5,222
city_latitude	numeric	10,992	0.3%	5,310
location_2	text	10,992	11.5%	8,776	allcaps
city_location	text	10,992	0.3%	5,311	allcaps duplicates

Fig 1.

state · Shows which states report the most ghost sightings — California, Texas, and Pennsylvania dominate.

Show data table

Top values for state (20 unique shown, of 51 total).
value	count	share
California	1070	9.7%
Texas	696	6.3%
Pennsylvania	649	5.9%
Michigan	529	4.8%
Ohio	477	4.3%
New York	459	4.2%
Illinois	395	3.6%
Kentucky	370	3.4%
Indiana	351	3.2%
Massachusetts	342	3.1%
Florida	328	3.0%
Missouri	314	2.9%
Georgia	289	2.6%
Wisconsin	274	2.5%
Alabama	224	2.0%
Tennessee	221	2.0%
Washington	218	2.0%
Oklahoma	211	1.9%
North Carolina	211	1.9%
New Jersey	194	1.8%

Fig 2.

location · Distribution of location-name lengths; most are short (median 17 chars) but a long tail goes up to 1,016.

Show data table

Character-length distribution for location (mean: 19.2985712985713).
chars	count
2 – 27	9633
27 – 53	1280
53 – 78	57
78 – 103	7
103 – 129	0
129 – 154	0
154 – 179	0
179 – 205	2
205 – 230	2
230 – 256	0
256 – 281	2
281 – 306	0
306 – 332	0
332 – 357	1
357 – 382	0
382 – 408	1
408 – 433	0
433 – 458	1
458 – 484	0
484 – 509	1
509 – 534	0
534 – 560	0
560 – 585	0
585 – 610	0
610 – 636	0
636 – 661	0
661 – 686	0
686 – 712	0
712 – 737	0
737 – 762	0
762 – 788	0
788 – 813	0
813 – 839	0
839 – 864	0
864 – 889	0
889 – 915	0
915 – 940	1
940 – 965	0
965 – 991	0
991 – 1016	1

Fig 3.

latitude · Most sightings cluster in the continental US latitudes, but watch for outliers reaching as far south as -45.

Show data table

Histogram bins for latitude (median: 39.2795837).
bin	count
-45.02 – -42.23	1
-42.23 – -39.43	0
-39.43 – -36.63	0
-36.63 – -33.83	1
-33.83 – -31.03	0
-31.03 – -28.24	0
-28.24 – -25.44	0
-25.44 – -22.64	0
-22.64 – -19.84	0
-19.84 – -17.04	0
-17.04 – -14.25	0
-14.25 – -11.45	0
-11.45 – -8.651	0
-8.651 – -5.853	0
-5.853 – -3.055	0
-3.055 – -0.2572	0
-0.2572 – 2.541	0
2.541 – 5.339	0
5.339 – 8.137	0
8.137 – 10.93	0
10.93 – 13.73	1
13.73 – 16.53	0
16.53 – 19.33	0
19.33 – 22.13	91
22.13 – 24.92	5
24.92 – 27.72	176
27.72 – 30.52	509
30.52 – 33.32	672
33.32 – 36.12	1609
36.12 – 38.91	1506
38.91 – 41.71	2549
41.71 – 44.51	1854
44.51 – 47.31	561
47.31 – 50.11	164
50.11 – 52.9	2
52.9 – 55.7	2
55.7 – 58.5	0
58.5 – 61.3	14
61.3 – 64.09	4
64.09 – 66.89	10

Fig 4.

longitude · Longitude spread reveals the east–west footprint of sightings, with a heavy concentration around -87.

Show data table

Histogram bins for longitude (median: -87.23121549999999).
bin	count
-164.7 – -156.4	86
-156.4 – -148.1	22
-148.1 – -139.7	10
-139.7 – -131.4	3
-131.4 – -123	65
-123 – -114.7	1371
-114.7 – -106.4	436
-106.4 – -98.04	590
-98.04 – -89.7	1570
-89.7 – -81.37	2857
-81.37 – -73.03	2086
-73.03 – -64.7	625
-64.7 – -56.36	0
-56.36 – -48.03	0
-48.03 – -39.69	0
-39.69 – -31.35	0
-31.35 – -23.02	0
-23.02 – -14.68	0
-14.68 – -6.348	0
-6.348 – 1.987	2
1.987 – 10.32	0
10.32 – 18.66	0
18.66 – 26.99	0
26.99 – 35.33	0
35.33 – 43.66	1
43.66 – 52	0
52 – 60.34	0
60.34 – 68.67	0
68.67 – 77.01	1
77.01 – 85.34	3
85.34 – 93.68	0
93.68 – 102	0
102 – 110.3	0
110.3 – 118.7	0
118.7 – 127	0
127 – 135.4	0
135.4 – 143.7	2
143.7 – 152	0
152 – 160.4	0
160.4 – 168.7	1

Fig 5.

description · Description lengths are right-skewed: median ~297 characters but reaching up to 5,280 for the most detailed accounts.

Show data table

Character-length distribution for description (mean: 380.03156841339154).
chars	count
2 – 134	1623
134 – 266	3145
266 – 398	2473
398 – 530	1528
530 – 662	810
662 – 794	509
794 – 926	303
926 – 1058	195
1058 – 1190	124
1190 – 1322	71
1322 – 1453	63
1453 – 1585	37
1585 – 1717	27
1717 – 1849	20
1849 – 1981	15
1981 – 2113	12
2113 – 2245	10
2245 – 2377	8
2377 – 2509	6
2509 – 2641	2
2641 – 2773	1
2773 – 2905	2
2905 – 3037	1
3037 – 3169	2
3169 – 3301	0
3301 – 3433	0
3433 – 3565	2
3565 – 3697	0
3697 – 3829	1
3829 – 3960	0
3960 – 4092	0
4092 – 4224	0
4224 – 4356	0
4356 – 4488	0
4488 – 4620	0
4620 – 4752	0
4752 – 4884	1
4884 – 5016	0
5016 – 5148	0
5148 – 5280	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
city	text	0.0%
country	categorical	0.0%
description	text	0.0%
location	text	0.0%
state	categorical	0.0%
state_abbrev	categorical	0.0%
longitude	numeric	11.5%
latitude	numeric	11.5%
city_longitude	numeric	0.3%
city_latitude	numeric	0.3%
location_2	text	11.5%
city_location	text	0.3%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
	longitude	latitude	city_longitude	city_latitude
longitude	+1.00	+0.27	+0.48	+0.18
latitude	+0.27	+1.00	+0.16	+0.53
city_longitude	+0.48	+0.16	+1.00	+0.30
city_latitude	+0.18	+0.53	+0.30	+1.00

city text feature

Short place-name strings (mean 9 characters, 73% one-word) with familiar US city names like Los Angeles, San Antonio, and Houston dominating the top values. Heavy duplication (60%, 6604 rows) is expected for a city field and 4385 uniques suggests broad geographic coverage. Top word frequencies ('city', 'county', 'san', 'st.', 'new', 'fort') confirm conventional US toponymy with no emoji, URLs, or boilerplate noise.

Treatment: Treat as a categorical/geographic feature; consider geocoding or grouping rare values before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["city"].stats

stat	value
n	10,992
nulls	3 (0.0%)
unique	4,385
len_min	3
len_max	49
len_mean	9.043
len_median	9
len_p95	14
word_mean	1.291
word_median	1
n_empty	0
n_duplicates	6,604
duplicate_rate	0.601
vocab_size	3,988
readability_flesch_mean	20.49
emoji_rate	0
url_rate	9.1e-05
one_word_rate	0.7323
allcaps_rate	0.000455
boilerplate_rate	0
alert: one_word	73.2% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	60.1% duplicate strings

Fig 8.

Character-length distribution for city.

Show data table

Character-length distribution for city (mean: 9.042770042770043).
chars	count
3 – 4	239
4 – 5	615
5 – 6	1387
6 – 8	1495
8 – 9	1431
9 – 10	1548
10 – 11	2391
11 – 12	630
12 – 13	455
13 – 14	318
14 – 16	145
16 – 17	137
17 – 18	63
18 – 19	49
19 – 20	32
20 – 21	11
21 – 23	7
23 – 24	5
24 – 25	5
25 – 26	5
26 – 27	10
27 – 28	0
28 – 29	6
29 – 31	1
31 – 32	1
32 – 33	0
33 – 34	1
34 – 35	0
35 – 36	0
36 – 38	0
38 – 39	0
39 – 40	0
40 – 41	0
41 – 42	1
42 – 43	0
43 – 44	0
44 – 46	0
46 – 47	0
47 – 48	0
48 – 49	1

country categorical metadata

This column records country, but every one of the 10,992 rows is "United States" — cardinality is 1 and top_rate is 1.0. It carries zero information (entropy 0.0) and cannot discriminate between records.

Treatment: Drop; constant column with no signal.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["country"].stats

stat	value
n	10,992
nulls	0 (0.0%)
unique	1
top_value	United States
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 9.

Top values for country.

Show data table

Top values for country (1 unique shown, of 1 total).
value	count	share
United States	10992	100.0%

description text free_text

Free-text descriptions, averaging 70 words (median 55) and reaching up to 5,280 characters, with a Flesch readability of ~69.7 indicating fairly accessible prose. Nearly every row is unique (10,987 of 10,992) with only 5 duplicates and a vocabulary of 33,001 tokens, so this reads as bespoke long-form copy rather than templated text. Boilerplate (0.6%), URLs (0.5%), all-caps (0.06%) and emoji (0%) are all negligible, and the top words are common English stopwords — consistent with natural English narrative.

Treatment: Tokenize and embed before modelling; do not use as a categorical key.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["description"].stats

stat	value
n	10,992
nulls	0 (0.0%)
unique	10,987
len_min	2
len_max	5,280
len_mean	380
len_median	297
len_p95	954
word_mean	70.33
word_median	55
n_empty	0
n_duplicates	5
duplicate_rate	0.0004549
vocab_size	33,001
readability_flesch_mean	69.67
emoji_rate	0
url_rate	0.005095
one_word_rate	0.000182
allcaps_rate	0.0006368
boilerplate_rate	0.006459
alert: near_unique	100.0% of rows are unique strings

Fig 10.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 380.03156841339154).
chars	count
2 – 134	1623
134 – 266	3145
266 – 398	2473
398 – 530	1528
530 – 662	810
662 – 794	509
794 – 926	303
926 – 1058	195
1058 – 1190	124
1190 – 1322	71
1322 – 1453	63
1453 – 1585	37
1585 – 1717	27
1717 – 1849	20
1849 – 1981	15
1981 – 2113	12
2113 – 2245	10
2245 – 2377	8
2377 – 2509	6
2509 – 2641	2
2641 – 2773	1
2773 – 2905	2
2905 – 3037	1
3037 – 3169	2
3169 – 3301	0
3301 – 3433	0
3433 – 3565	2
3565 – 3697	0
3697 – 3829	1
3829 – 3960	0
3960 – 4092	0
4092 – 4224	0
4224 – 4356	0
4356 – 4488	0
4488 – 4620	0
4620 – 4752	0
4752 – 4884	1
4884 – 5016	0
5016 – 5148	0
5148 – 5280	1

location text free_text

Short free-text place names averaging 2.98 words and 19.3 characters, with values like 'Prince Georges county', 'Cemetery', 'Cry Baby Bridge', and 'Wal-Mart' — likely the named site of a story or sighting (probably ghost/folklore lore given top entries). Vocabulary is dominated by 'school', 'cemetery', 'high', 'university', 'house', 'road', suggesting a strong tilt toward institutional and roadside locations. High cardinality (9903 unique of 10992) coexists with a 9.9% duplicate rate and one value ('Prince Georges county') hit 18 times, so most entries are unique strings but a long tail of repeated landmarks exists.

Treatment: Normalise casing/whitespace and geocode or entity-link to extract structured place features before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["location"].stats

stat	value
n	10,992
nulls	3 (0.0%)
unique	9,903
len_min	2
len_max	1,016
len_mean	19.3
len_median	17
len_p95	34
word_mean	2.976
word_median	3
n_empty	0
n_duplicates	1,086
duplicate_rate	0.09883
vocab_size	8,232
readability_flesch_mean	45.79
emoji_rate	0
url_rate	0
one_word_rate	0.06334
allcaps_rate	0.003367
boilerplate_rate	0

Fig 11.

Character-length distribution for location.

Show data table

Character-length distribution for location (mean: 19.2985712985713).
chars	count
2 – 27	9633
27 – 53	1280
53 – 78	57
78 – 103	7
103 – 129	0
129 – 154	0
154 – 179	0
179 – 205	2
205 – 230	2
230 – 256	0
256 – 281	2
281 – 306	0
306 – 332	0
332 – 357	1
357 – 382	0
382 – 408	1
408 – 433	0
433 – 458	1
458 – 484	0
484 – 509	1
509 – 534	0
534 – 560	0
560 – 585	0
585 – 610	0
610 – 636	0
636 – 661	0
661 – 686	0
686 – 712	0
712 – 737	0
737 – 762	0
762 – 788	0
788 – 813	0
813 – 839	0
839 – 864	0
864 – 889	0
889 – 915	0
915 – 940	1
940 – 965	0
965 – 991	0
991 – 1016	1

state categorical feature

This is a US state field with 51 distinct values (likely the 50 states plus DC) across 10,992 rows and no nulls. Distribution is broad with high entropy ratio (0.916); California leads at 9.73% (1,070 rows), followed by Texas (696) and Pennsylvania (649), roughly tracking population except Pennsylvania ranking unusually high above Florida (which doesn't appear in the top 10).

Treatment: One-hot or target-encode for modelling; consider grouping low-frequency states.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["state"].stats

stat	value
n	10,992
nulls	0 (0.0%)
unique	51
top_value	California
top_rate	0.09734
cardinality	51
entropy	5.194
entropy_ratio	0.9157

Fig 12.

Top values for state.

Show data table

Top values for state (20 unique shown, of 51 total).
value	count	share
California	1070	9.7%
Texas	696	6.3%
Pennsylvania	649	5.9%
Michigan	529	4.8%
Ohio	477	4.3%
New York	459	4.2%
Illinois	395	3.6%
Kentucky	370	3.4%
Indiana	351	3.2%
Massachusetts	342	3.1%
Florida	328	3.0%
Missouri	314	2.9%
Georgia	289	2.6%
Wisconsin	274	2.5%
Alabama	224	2.0%
Tennessee	221	2.0%
Washington	218	2.0%
Oklahoma	211	1.9%
North Carolina	211	1.9%
New Jersey	194	1.8%

state_abbrev categorical feature

This is a US state abbreviation field with 51 distinct values (50 states plus presumably DC) and no nulls across 10,992 rows. Distribution is broadly spread (entropy ratio 0.916) with CA leading at 9.7% (1,070 rows), followed by TX (696) and PA (649), consistent with population-weighted geographic data.

Treatment: one-hot or target-encode for modelling; safe to use as-is for grouping or joins.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["state_abbrev"].stats

stat	value
n	10,992
nulls	0 (0.0%)
unique	51
top_value	CA
top_rate	0.09734
cardinality	51
entropy	5.194
entropy_ratio	0.9157

Fig 13.

Top values for state_abbrev.

Show data table

Top values for state_abbrev (20 unique shown, of 51 total).
value	count	share
CA	1070	9.7%
TX	696	6.3%
PA	649	5.9%
MI	529	4.8%
OH	477	4.3%
NY	459	4.2%
IL	395	3.6%
KY	370	3.4%
IN	351	3.2%
MA	342	3.1%
FL	328	3.0%
MO	314	2.9%
GA	289	2.6%
WI	274	2.5%
AL	224	2.0%
TN	221	2.0%
WA	218	2.0%
OK	211	1.9%
NC	211	1.9%
NJ	194	1.8%

longitude numeric feature

Geographic longitude coordinates, with 8774 unique values across 10992 rows and an 11.47% null rate. Values span -164.72 to 168.70 but cluster tightly between Q1 -99.12 and Q3 -80.30, indicating most records sit in North America despite the global range. Kurtosis of 15.2 confirms heavy tails from the 131 outliers (1.35%) reaching across the Pacific.

Treatment: Pair with latitude for geospatial features; impute or filter the 11.47% nulls before mapping.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["longitude"].stats

stat	value
n	10,992
nulls	1,261 (11.5%)
unique	8,774
min	-164.7
max	168.7
mean	-92
median	-87.23
std	17.69
q1	-99.12
q3	-80.3
iqr	18.82
skew	0.2512
kurtosis	15.2
n_outliers	131
outlier_rate	0.01346
zero_rate	0

Fig 14.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: -87.23121549999999).
bin	count
-164.7 – -156.4	86
-156.4 – -148.1	22
-148.1 – -139.7	10
-139.7 – -131.4	3
-131.4 – -123	65
-123 – -114.7	1371
-114.7 – -106.4	436
-106.4 – -98.04	590
-98.04 – -89.7	1570
-89.7 – -81.37	2857
-81.37 – -73.03	2086
-73.03 – -64.7	625
-64.7 – -56.36	0
-56.36 – -48.03	0
-48.03 – -39.69	0
-39.69 – -31.35	0
-31.35 – -23.02	0
-23.02 – -14.68	0
-14.68 – -6.348	0
-6.348 – 1.987	2
1.987 – 10.32	0
10.32 – 18.66	0
18.66 – 26.99	0
26.99 – 35.33	0
35.33 – 43.66	1
43.66 – 52	0
52 – 60.34	0
60.34 – 68.67	0
68.67 – 77.01	1
77.01 – 85.34	3
85.34 – 93.68	0
93.68 – 102	0
102 – 110.3	0
110.3 – 118.7	0
118.7 – 127	0
127 – 135.4	0
135.4 – 143.7	2
143.7 – 152	0
152 – 160.4	0
160.4 – 168.7	1

latitude numeric feature

Geographic latitude in decimal degrees, ranging from -45.02 to 66.89 with a median of 39.28 and IQR of 7.20, suggesting most observations cluster in the northern mid-latitudes. The distribution is left-skewed (-0.97) with heavy tails (kurtosis 11.38) and 124 outliers (1.27%), consistent with a few southern-hemisphere points dragging the tail. Note the 11.47% null rate, which will need handling before any spatial analysis.

Treatment: Pair with longitude for geospatial features; impute or drop the ~11% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["latitude"].stats

stat	value
n	10,992
nulls	1,261 (11.5%)
unique	8,775
min	-45.02
max	66.89
mean	38.34
median	39.28
std	5.259
q1	34.68
q3	41.87
iqr	7.197
skew	-0.971
kurtosis	11.38
n_outliers	124
outlier_rate	0.01274
zero_rate	0

Fig 15.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 39.2795837).
bin	count
-45.02 – -42.23	1
-42.23 – -39.43	0
-39.43 – -36.63	0
-36.63 – -33.83	1
-33.83 – -31.03	0
-31.03 – -28.24	0
-28.24 – -25.44	0
-25.44 – -22.64	0
-22.64 – -19.84	0
-19.84 – -17.04	0
-17.04 – -14.25	0
-14.25 – -11.45	0
-11.45 – -8.651	0
-8.651 – -5.853	0
-5.853 – -3.055	0
-3.055 – -0.2572	0
-0.2572 – 2.541	0
2.541 – 5.339	0
5.339 – 8.137	0
8.137 – 10.93	0
10.93 – 13.73	1
13.73 – 16.53	0
16.53 – 19.33	0
19.33 – 22.13	91
22.13 – 24.92	5
24.92 – 27.72	176
27.72 – 30.52	509
30.52 – 33.32	672
33.32 – 36.12	1609
36.12 – 38.91	1506
38.91 – 41.71	2549
41.71 – 44.51	1854
44.51 – 47.31	561
47.31 – 50.11	164
50.11 – 52.9	2
52.9 – 55.7	2
55.7 – 58.5	0
58.5 – 61.3	14
61.3 – 64.09	4
64.09 – 66.89	10

city_longitude numeric feature

Longitude of a city, with all 10992 values negative (min -164.72, max -67.84), placing every record in the Western Hemisphere and consistent with North American coverage. The distribution is left-skewed (skew -1.16) with median -87.09 pulled east of the mean -91.91, suggesting a concentration in the central/eastern US with a tail reaching toward Alaska or the Pacific. 5222 unique values across 10992 rows indicate many records share the same city, and 128 outliers (1.17%) likely correspond to far-western locations.

Treatment: Use as a geospatial feature alongside latitude; consider clustering or binning rather than treating as a plain scalar.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["city_longitude"].stats

stat	value
n	10,992
nulls	29 (0.3%)
unique	5,222
min	-164.7
max	-67.84
mean	-91.91
median	-87.09
std	16.4
q1	-98.49
q3	-80.51
iqr	17.99
skew	-1.155
kurtosis	1.328
n_outliers	128
outlier_rate	0.01168
zero_rate	0

Fig 16.

Distribution of city_longitude. Vertical dash marks the median.

Show data table

Histogram bins for city_longitude (median: -87.0902772).
bin	count
-164.7 – -162.3	2
-162.3 – -159.9	0
-159.9 – -157.5	83
-157.5 – -155	14
-155 – -152.6	0
-152.6 – -150.2	3
-150.2 – -147.8	13
-147.8 – -145.3	5
-145.3 – -142.9	4
-142.9 – -140.5	0
-140.5 – -138.1	0
-138.1 – -135.7	0
-135.7 – -133.2	2
-133.2 – -130.8	2
-130.8 – -128.4	0
-128.4 – -126	0
-126 – -123.5	33
-123.5 – -121.1	529
-121.1 – -118.7	249
-118.7 – -116.3	661
-116.3 – -113.9	83
-113.9 – -111.4	245
-111.4 – -109	92
-109 – -106.6	96
-106.6 – -104.2	260
-104.2 – -101.7	93
-101.7 – -99.33	159
-99.33 – -96.91	530
-96.91 – -94.48	545
-94.48 – -92.06	492
-92.06 – -89.64	447
-89.64 – -87.22	792
-87.22 – -84.79	877
-84.79 – -82.37	1252
-82.37 – -79.95	912
-79.95 – -77.53	420
-77.53 – -75.11	825
-75.11 – -72.68	649
-72.68 – -70.26	548
-70.26 – -67.84	46

city_latitude numeric feature

This column holds city latitudes in decimal degrees, ranging from 19.58 to 66.90 with a median of 39.28 and an IQR spanning 34.74 to 41.85. Values cluster firmly in the northern mid-latitudes (mean 38.38, std 5.07), with mild left skew (-0.32) and 128 outliers (1.17%) likely representing far-northern locales. Cardinality is moderate (5310 unique of 10992) and nulls are negligible (0.26%).

Treatment: Pair with longitude for geospatial features; avoid treating as a standalone scalar.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["city_latitude"].stats

stat	value
n	10,992
nulls	29 (0.3%)
unique	5,310
min	19.58
max	66.9
mean	38.38
median	39.28
std	5.067
q1	34.74
q3	41.85
iqr	7.109
skew	-0.325
kurtosis	1.478
n_outliers	128
outlier_rate	0.01168
zero_rate	0

Fig 17.

Distribution of city_latitude. Vertical dash marks the median.

Show data table

Histogram bins for city_latitude (median: 39.2833968).
bin	count
19.58 – 20.76	5
20.76 – 21.94	90
21.94 – 23.12	1
23.12 – 24.31	0
24.31 – 25.49	7
25.49 – 26.67	106
26.67 – 27.86	114
27.86 – 29.04	145
29.04 – 30.22	256
30.22 – 31.41	222
31.41 – 32.59	369
32.59 – 33.77	569
33.77 – 34.96	988
34.96 – 36.14	560
36.14 – 37.32	666
37.32 – 38.5	703
38.5 – 39.69	1004
39.69 – 40.87	1341
40.87 – 42.05	1276
42.05 – 43.24	1258
43.24 – 44.42	401
44.42 – 45.6	387
45.6 – 46.79	184
46.79 – 47.97	220
47.97 – 49.15	59
49.15 – 50.34	0
50.34 – 51.52	0
51.52 – 52.7	0
52.7 – 53.88	0
53.88 – 55.07	0
55.07 – 56.25	2
56.25 – 57.43	1
57.43 – 58.62	0
58.62 – 59.8	1
59.8 – 60.98	5
60.98 – 62.17	13
62.17 – 63.35	1
63.35 – 64.53	0
64.53 – 65.72	6
65.72 – 66.9	3

location_2 text feature

This column holds WKT geometry strings in the form `POINT(lon lat)`, with 8776 unique values across 10992 rows and an 11.47% null rate. Values are tightly bounded (length 22-43, always two 'words') and 9.81% are duplicates, with the most repeated coordinate appearing 20 times — suggesting clusters at recurring sites. The 'allcaps' alert is a false positive driven by the literal `POINT` prefix rather than natural language.

Treatment: Parse into numeric longitude/latitude pair before any geospatial modelling.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["location_2"].stats

stat	value
n	10,992
nulls	1,261 (11.5%)
unique	8,776
len_min	22
len_max	43
len_mean	30.81
len_median	29
len_p95	36
word_mean	2
word_median	2
n_empty	0
n_duplicates	955
duplicate_rate	0.09814
vocab_size	17,549
readability_flesch_mean	120.2
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	1
boilerplate_rate	0
alert: allcaps	100.0% rows are all-caps

Fig 18.

Character-length distribution for location_2.

Show data table

Character-length distribution for location_2 (mean: 30.80896105230706).
chars	count
22 – 23	1
23 – 23	8
23 – 24	0
24 – 24	2
24 – 25	0
25 – 25	20
25 – 26	0
26 – 26	104
26 – 27	0
27 – 27	679
27 – 28	0
28 – 28	1007
28 – 29	0
29 – 29	3740
29 – 30	0
30 – 30	1567
30 – 31	0
31 – 31	0
31 – 32	0
32 – 32	1
32 – 33	13
33 – 34	0
34 – 34	54
34 – 35	0
35 – 35	453
35 – 36	0
36 – 36	1807
36 – 37	0
37 – 37	114
37 – 38	0
38 – 38	0
38 – 39	0
39 – 39	0
39 – 40	0
40 – 40	0
40 – 41	0
41 – 41	0
41 – 42	0
42 – 42	0
42 – 43	161

city_location text feature

Despite the name, city_location stores WKT POINT(longitude latitude) geometry strings, not city names — every value matches that format (allcaps_rate 1.0, word_mean 2.0). Roughly half the rows are repeats (duplicate_rate 0.516, n_duplicates 5652) with the top coordinate (Los Angeles, 34.05/-118.24) appearing 61 times, suggesting a finite set of city centroids reused across records. Cardinality is still high (5311 unique of 10992) and nulls are negligible (0.26%).

Treatment: Parse the POINT() strings into numeric lat/long pairs before any geospatial modelling.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["city_location"].stats

stat	value
n	10,992
nulls	29 (0.3%)
unique	5,311
len_min	22
len_max	43
len_mean	31.19
len_median	29
len_p95	36
word_mean	2
word_median	2
n_empty	0
n_duplicates	5,652
duplicate_rate	0.5156
vocab_size	10,532
readability_flesch_mean	120.2
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	1
boilerplate_rate	0
alert: allcaps	100.0% rows are all-caps
alert: duplicates	51.6% duplicate strings

Fig 19.

Character-length distribution for city_location.

Show data table

Character-length distribution for city_location (mean: 31.192739213718873).
chars	count
22 – 23	2
23 – 23	0
23 – 24	0
24 – 24	1
24 – 25	0
25 – 25	2
25 – 26	0
26 – 26	45
26 – 27	0
27 – 27	158
27 – 28	0
28 – 28	890
28 – 29	0
29 – 29	4648
29 – 30	0
30 – 30	1988
30 – 31	0
31 – 31	0
31 – 32	0
32 – 32	1
32 – 33	1
33 – 34	0
34 – 34	21
34 – 35	0
35 – 35	238
35 – 36	0
36 – 36	2711
36 – 37	0
37 – 37	116
37 – 38	0
38 – 38	0
38 – 39	0
39 – 39	0
39 – 40	0
40 – 40	0
40 – 41	0
41 – 41	0
41 – 42	0
42 – 42	0
42 – 43	141

wild ghost sightings

Overview

Summary confidence: high

city text feature

country categorical metadata

description text free_text

location text free_text

state categorical feature

state_abbrev categorical feature

longitude numeric feature

latitude numeric feature

city_longitude numeric feature

city_latitude numeric feature

location_2 text feature

city_location text feature

How to cite