saturn·

wild ghost sightings

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/ghost_sightings.csv

Saturn profiled 10,992 rows across 12 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/ghost_sightings.csv",
    "--findings", "wild-ghost_sightings.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 10,992 reported ghost sightings across the United States, with each record describing a location (city, state, latitude/longitude) and a free-text description of the sighting. Every record is in the United States (country has only 1 unique value), and California, Texas, and Pennsylvania lead the state counts — California alone holds about 9.7% of all sightings. The location column is rich with thematic words like 'school', 'cemetery', 'high', and 'house', hinting at the kinds of places people report hauntings. Worth a closer look: the geographic skew toward a few populous states, and the description field which averages ~70 words per entry and is nearly all unique — a good candidate for text mining. Note that latitude/longitude have ~11.5% nulls and a handful of outliers (including a min latitude of -45 that falls outside the US).

citing: row_count · column_count · country.top_rate · state.top_values · state.entropy_ratio · location.top_words · description.word_mean · description.n_unique · latitude.null_rate · latitude.min · longitude.n_outliers

Out[4]:

saturn.schema() · 12 columns

column kind n null% unique alerts
city text 10,992 0.0% 4,385 one_word short_text duplicates
country categorical 10,992 0.0% 1 imbalance
description text 10,992 0.0% 10,987 near_unique
location text 10,992 0.0% 9,903
state categorical 10,992 0.0% 51
state_abbrev categorical 10,992 0.0% 51
longitude numeric 10,992 11.5% 8,774
latitude numeric 10,992 11.5% 8,775
city_longitude numeric 10,992 0.3% 5,222
city_latitude numeric 10,992 0.3% 5,310
location_2 text 10,992 11.5% 8,776 allcaps
city_location text 10,992 0.3% 5,311 allcaps duplicates
Fig 1.
state · Shows which states report the most ghost sightings — California, Texas, and Pennsylvania dominate.
Show data table
Top values for state (20 unique shown, of 51 total).
valuecountshare
California10709.7%
Texas6966.3%
Pennsylvania6495.9%
Michigan5294.8%
Ohio4774.3%
New York4594.2%
Illinois3953.6%
Kentucky3703.4%
Indiana3513.2%
Massachusetts3423.1%
Florida3283.0%
Missouri3142.9%
Georgia2892.6%
Wisconsin2742.5%
Alabama2242.0%
Tennessee2212.0%
Washington2182.0%
Oklahoma2111.9%
North Carolina2111.9%
New Jersey1941.8%
Fig 2.
location · Distribution of location-name lengths; most are short (median 17 chars) but a long tail goes up to 1,016.
Show data table
Character-length distribution for location (mean: 19.2985712985713).
charscount
2 – 279633
27 – 531280
53 – 7857
78 – 1037
103 – 1290
129 – 1540
154 – 1790
179 – 2052
205 – 2302
230 – 2560
256 – 2812
281 – 3060
306 – 3320
332 – 3571
357 – 3820
382 – 4081
408 – 4330
433 – 4581
458 – 4840
484 – 5091
509 – 5340
534 – 5600
560 – 5850
585 – 6100
610 – 6360
636 – 6610
661 – 6860
686 – 7120
712 – 7370
737 – 7620
762 – 7880
788 – 8130
813 – 8390
839 – 8640
864 – 8890
889 – 9150
915 – 9401
940 – 9650
965 – 9910
991 – 10161
Fig 3.
latitude · Most sightings cluster in the continental US latitudes, but watch for outliers reaching as far south as -45.
Show data table
Histogram bins for latitude (median: 39.2795837).
bincount
-45.02 – -42.231
-42.23 – -39.430
-39.43 – -36.630
-36.63 – -33.831
-33.83 – -31.030
-31.03 – -28.240
-28.24 – -25.440
-25.44 – -22.640
-22.64 – -19.840
-19.84 – -17.040
-17.04 – -14.250
-14.25 – -11.450
-11.45 – -8.6510
-8.651 – -5.8530
-5.853 – -3.0550
-3.055 – -0.25720
-0.2572 – 2.5410
2.541 – 5.3390
5.339 – 8.1370
8.137 – 10.930
10.93 – 13.731
13.73 – 16.530
16.53 – 19.330
19.33 – 22.1391
22.13 – 24.925
24.92 – 27.72176
27.72 – 30.52509
30.52 – 33.32672
33.32 – 36.121609
36.12 – 38.911506
38.91 – 41.712549
41.71 – 44.511854
44.51 – 47.31561
47.31 – 50.11164
50.11 – 52.92
52.9 – 55.72
55.7 – 58.50
58.5 – 61.314
61.3 – 64.094
64.09 – 66.8910
Fig 4.
longitude · Longitude spread reveals the east–west footprint of sightings, with a heavy concentration around -87.
Show data table
Histogram bins for longitude (median: -87.23121549999999).
bincount
-164.7 – -156.486
-156.4 – -148.122
-148.1 – -139.710
-139.7 – -131.43
-131.4 – -12365
-123 – -114.71371
-114.7 – -106.4436
-106.4 – -98.04590
-98.04 – -89.71570
-89.7 – -81.372857
-81.37 – -73.032086
-73.03 – -64.7625
-64.7 – -56.360
-56.36 – -48.030
-48.03 – -39.690
-39.69 – -31.350
-31.35 – -23.020
-23.02 – -14.680
-14.68 – -6.3480
-6.348 – 1.9872
1.987 – 10.320
10.32 – 18.660
18.66 – 26.990
26.99 – 35.330
35.33 – 43.661
43.66 – 520
52 – 60.340
60.34 – 68.670
68.67 – 77.011
77.01 – 85.343
85.34 – 93.680
93.68 – 1020
102 – 110.30
110.3 – 118.70
118.7 – 1270
127 – 135.40
135.4 – 143.72
143.7 – 1520
152 – 160.40
160.4 – 168.71
Fig 5.
description · Description lengths are right-skewed: median ~297 characters but reaching up to 5,280 for the most detailed accounts.
Show data table
Character-length distribution for description (mean: 380.03156841339154).
charscount
2 – 1341623
134 – 2663145
266 – 3982473
398 – 5301528
530 – 662810
662 – 794509
794 – 926303
926 – 1058195
1058 – 1190124
1190 – 132271
1322 – 145363
1453 – 158537
1585 – 171727
1717 – 184920
1849 – 198115
1981 – 211312
2113 – 224510
2245 – 23778
2377 – 25096
2509 – 26412
2641 – 27731
2773 – 29052
2905 – 30371
3037 – 31692
3169 – 33010
3301 – 34330
3433 – 35652
3565 – 36970
3697 – 38291
3829 – 39600
3960 – 40920
4092 – 42240
4224 – 43560
4356 – 44880
4488 – 46200
4620 – 47520
4752 – 48841
4884 – 50160
5016 – 51480
5148 – 52801
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
citytext0.0%
countrycategorical0.0%
descriptiontext0.0%
locationtext0.0%
statecategorical0.0%
state_abbrevcategorical0.0%
longitudenumeric11.5%
latitudenumeric11.5%
city_longitudenumeric0.3%
city_latitudenumeric0.3%
location_2text11.5%
city_locationtext0.3%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
longitudelatitudecity_longitudecity_latitude
longitude+1.00+0.27+0.48+0.18
latitude+0.27+1.00+0.16+0.53
city_longitude+0.48+0.16+1.00+0.30
city_latitude+0.18+0.53+0.30+1.00

city text feature

Short place-name strings (mean 9 characters, 73% one-word) with familiar US city names like Los Angeles, San Antonio, and Houston dominating the top values. Heavy duplication (60%, 6604 rows) is expected for a city field and 4385 uniques suggests broad geographic coverage. Top word frequencies ('city', 'county', 'san', 'st.', 'new', 'fort') confirm conventional US toponymy with no emoji, URLs, or boilerplate noise.

Treatment: Treat as a categorical/geographic feature; consider geocoding or grouping rare values before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["city"].stats

statvalue
n10,992
nulls3 (0.0%)
unique4,385
len_min 3
len_max 49
len_mean 9.043
len_median 9
len_p95 14
word_mean 1.291
word_median 1
n_empty 0
n_duplicates 6,604
duplicate_rate 0.601
vocab_size 3,988
readability_flesch_mean 20.49
emoji_rate 0
url_rate 9.1e-05
one_word_rate 0.7323
allcaps_rate 0.000455
boilerplate_rate 0
alert: one_word73.2% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates60.1% duplicate strings
Fig 8.
Character-length distribution for city.
Show data table
Character-length distribution for city (mean: 9.042770042770043).
charscount
3 – 4239
4 – 5615
5 – 61387
6 – 81495
8 – 91431
9 – 101548
10 – 112391
11 – 12630
12 – 13455
13 – 14318
14 – 16145
16 – 17137
17 – 1863
18 – 1949
19 – 2032
20 – 2111
21 – 237
23 – 245
24 – 255
25 – 265
26 – 2710
27 – 280
28 – 296
29 – 311
31 – 321
32 – 330
33 – 341
34 – 350
35 – 360
36 – 380
38 – 390
39 – 400
40 – 410
41 – 421
42 – 430
43 – 440
44 – 460
46 – 470
47 – 480
48 – 491

country categorical metadata

This column records country, but every one of the 10,992 rows is "United States" — cardinality is 1 and top_rate is 1.0. It carries zero information (entropy 0.0) and cannot discriminate between records.

Treatment: Drop; constant column with no signal.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["country"].stats

statvalue
n10,992
nulls0 (0.0%)
unique1
top_value United States
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 9.
Top values for country.
Show data table
Top values for country (1 unique shown, of 1 total).
valuecountshare
United States10992100.0%

description text free_text

Free-text descriptions, averaging 70 words (median 55) and reaching up to 5,280 characters, with a Flesch readability of ~69.7 indicating fairly accessible prose. Nearly every row is unique (10,987 of 10,992) with only 5 duplicates and a vocabulary of 33,001 tokens, so this reads as bespoke long-form copy rather than templated text. Boilerplate (0.6%), URLs (0.5%), all-caps (0.06%) and emoji (0%) are all negligible, and the top words are common English stopwords — consistent with natural English narrative.

Treatment: Tokenize and embed before modelling; do not use as a categorical key.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["description"].stats

statvalue
n10,992
nulls0 (0.0%)
unique10,987
len_min 2
len_max 5,280
len_mean 380
len_median 297
len_p95 954
word_mean 70.33
word_median 55
n_empty 0
n_duplicates 5
duplicate_rate 0.0004549
vocab_size 33,001
readability_flesch_mean 69.67
emoji_rate 0
url_rate 0.005095
one_word_rate 0.000182
allcaps_rate 0.0006368
boilerplate_rate 0.006459
alert: near_unique100.0% of rows are unique strings
Fig 10.
Character-length distribution for description.
Show data table
Character-length distribution for description (mean: 380.03156841339154).
charscount
2 – 1341623
134 – 2663145
266 – 3982473
398 – 5301528
530 – 662810
662 – 794509
794 – 926303
926 – 1058195
1058 – 1190124
1190 – 132271
1322 – 145363
1453 – 158537
1585 – 171727
1717 – 184920
1849 – 198115
1981 – 211312
2113 – 224510
2245 – 23778
2377 – 25096
2509 – 26412
2641 – 27731
2773 – 29052
2905 – 30371
3037 – 31692
3169 – 33010
3301 – 34330
3433 – 35652
3565 – 36970
3697 – 38291
3829 – 39600
3960 – 40920
4092 – 42240
4224 – 43560
4356 – 44880
4488 – 46200
4620 – 47520
4752 – 48841
4884 – 50160
5016 – 51480
5148 – 52801

location text free_text

Short free-text place names averaging 2.98 words and 19.3 characters, with values like 'Prince Georges county', 'Cemetery', 'Cry Baby Bridge', and 'Wal-Mart' — likely the named site of a story or sighting (probably ghost/folklore lore given top entries). Vocabulary is dominated by 'school', 'cemetery', 'high', 'university', 'house', 'road', suggesting a strong tilt toward institutional and roadside locations. High cardinality (9903 unique of 10992) coexists with a 9.9% duplicate rate and one value ('Prince Georges county') hit 18 times, so most entries are unique strings but a long tail of repeated landmarks exists.

Treatment: Normalise casing/whitespace and geocode or entity-link to extract structured place features before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["location"].stats

statvalue
n10,992
nulls3 (0.0%)
unique9,903
len_min 2
len_max 1,016
len_mean 19.3
len_median 17
len_p95 34
word_mean 2.976
word_median 3
n_empty 0
n_duplicates 1,086
duplicate_rate 0.09883
vocab_size 8,232
readability_flesch_mean 45.79
emoji_rate 0
url_rate 0
one_word_rate 0.06334
allcaps_rate 0.003367
boilerplate_rate 0
Fig 11.
Character-length distribution for location.
Show data table
Character-length distribution for location (mean: 19.2985712985713).
charscount
2 – 279633
27 – 531280
53 – 7857
78 – 1037
103 – 1290
129 – 1540
154 – 1790
179 – 2052
205 – 2302
230 – 2560
256 – 2812
281 – 3060
306 – 3320
332 – 3571
357 – 3820
382 – 4081
408 – 4330
433 – 4581
458 – 4840
484 – 5091
509 – 5340
534 – 5600
560 – 5850
585 – 6100
610 – 6360
636 – 6610
661 – 6860
686 – 7120
712 – 7370
737 – 7620
762 – 7880
788 – 8130
813 – 8390
839 – 8640
864 – 8890
889 – 9150
915 – 9401
940 – 9650
965 – 9910
991 – 10161

state categorical feature

This is a US state field with 51 distinct values (likely the 50 states plus DC) across 10,992 rows and no nulls. Distribution is broad with high entropy ratio (0.916); California leads at 9.73% (1,070 rows), followed by Texas (696) and Pennsylvania (649), roughly tracking population except Pennsylvania ranking unusually high above Florida (which doesn't appear in the top 10).

Treatment: One-hot or target-encode for modelling; consider grouping low-frequency states.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["state"].stats

statvalue
n10,992
nulls0 (0.0%)
unique51
top_value California
top_rate 0.09734
cardinality 51
entropy 5.194
entropy_ratio 0.9157
Fig 12.
Top values for state.
Show data table
Top values for state (20 unique shown, of 51 total).
valuecountshare
California10709.7%
Texas6966.3%
Pennsylvania6495.9%
Michigan5294.8%
Ohio4774.3%
New York4594.2%
Illinois3953.6%
Kentucky3703.4%
Indiana3513.2%
Massachusetts3423.1%
Florida3283.0%
Missouri3142.9%
Georgia2892.6%
Wisconsin2742.5%
Alabama2242.0%
Tennessee2212.0%
Washington2182.0%
Oklahoma2111.9%
North Carolina2111.9%
New Jersey1941.8%

state_abbrev categorical feature

This is a US state abbreviation field with 51 distinct values (50 states plus presumably DC) and no nulls across 10,992 rows. Distribution is broadly spread (entropy ratio 0.916) with CA leading at 9.7% (1,070 rows), followed by TX (696) and PA (649), consistent with population-weighted geographic data.

Treatment: one-hot or target-encode for modelling; safe to use as-is for grouping or joins.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["state_abbrev"].stats

statvalue
n10,992
nulls0 (0.0%)
unique51
top_value CA
top_rate 0.09734
cardinality 51
entropy 5.194
entropy_ratio 0.9157
Fig 13.
Top values for state_abbrev.
Show data table
Top values for state_abbrev (20 unique shown, of 51 total).
valuecountshare
CA10709.7%
TX6966.3%
PA6495.9%
MI5294.8%
OH4774.3%
NY4594.2%
IL3953.6%
KY3703.4%
IN3513.2%
MA3423.1%
FL3283.0%
MO3142.9%
GA2892.6%
WI2742.5%
AL2242.0%
TN2212.0%
WA2182.0%
OK2111.9%
NC2111.9%
NJ1941.8%

longitude numeric feature

Geographic longitude coordinates, with 8774 unique values across 10992 rows and an 11.47% null rate. Values span -164.72 to 168.70 but cluster tightly between Q1 -99.12 and Q3 -80.30, indicating most records sit in North America despite the global range. Kurtosis of 15.2 confirms heavy tails from the 131 outliers (1.35%) reaching across the Pacific.

Treatment: Pair with latitude for geospatial features; impute or filter the 11.47% nulls before mapping.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["longitude"].stats

statvalue
n10,992
nulls1,261 (11.5%)
unique8,774
min -164.7
max 168.7
mean -92
median -87.23
std 17.69
q1 -99.12
q3 -80.3
iqr 18.82
skew 0.2512
kurtosis 15.2
n_outliers 131
outlier_rate 0.01346
zero_rate 0
Fig 14.
Distribution of longitude. Vertical dash marks the median.
Show data table
Histogram bins for longitude (median: -87.23121549999999).
bincount
-164.7 – -156.486
-156.4 – -148.122
-148.1 – -139.710
-139.7 – -131.43
-131.4 – -12365
-123 – -114.71371
-114.7 – -106.4436
-106.4 – -98.04590
-98.04 – -89.71570
-89.7 – -81.372857
-81.37 – -73.032086
-73.03 – -64.7625
-64.7 – -56.360
-56.36 – -48.030
-48.03 – -39.690
-39.69 – -31.350
-31.35 – -23.020
-23.02 – -14.680
-14.68 – -6.3480
-6.348 – 1.9872
1.987 – 10.320
10.32 – 18.660
18.66 – 26.990
26.99 – 35.330
35.33 – 43.661
43.66 – 520
52 – 60.340
60.34 – 68.670
68.67 – 77.011
77.01 – 85.343
85.34 – 93.680
93.68 – 1020
102 – 110.30
110.3 – 118.70
118.7 – 1270
127 – 135.40
135.4 – 143.72
143.7 – 1520
152 – 160.40
160.4 – 168.71

latitude numeric feature

Geographic latitude in decimal degrees, ranging from -45.02 to 66.89 with a median of 39.28 and IQR of 7.20, suggesting most observations cluster in the northern mid-latitudes. The distribution is left-skewed (-0.97) with heavy tails (kurtosis 11.38) and 124 outliers (1.27%), consistent with a few southern-hemisphere points dragging the tail. Note the 11.47% null rate, which will need handling before any spatial analysis.

Treatment: Pair with longitude for geospatial features; impute or drop the ~11% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["latitude"].stats

statvalue
n10,992
nulls1,261 (11.5%)
unique8,775
min -45.02
max 66.89
mean 38.34
median 39.28
std 5.259
q1 34.68
q3 41.87
iqr 7.197
skew -0.971
kurtosis 11.38
n_outliers 124
outlier_rate 0.01274
zero_rate 0
Fig 15.
Distribution of latitude. Vertical dash marks the median.
Show data table
Histogram bins for latitude (median: 39.2795837).
bincount
-45.02 – -42.231
-42.23 – -39.430
-39.43 – -36.630
-36.63 – -33.831
-33.83 – -31.030
-31.03 – -28.240
-28.24 – -25.440
-25.44 – -22.640
-22.64 – -19.840
-19.84 – -17.040
-17.04 – -14.250
-14.25 – -11.450
-11.45 – -8.6510
-8.651 – -5.8530
-5.853 – -3.0550
-3.055 – -0.25720
-0.2572 – 2.5410
2.541 – 5.3390
5.339 – 8.1370
8.137 – 10.930
10.93 – 13.731
13.73 – 16.530
16.53 – 19.330
19.33 – 22.1391
22.13 – 24.925
24.92 – 27.72176
27.72 – 30.52509
30.52 – 33.32672
33.32 – 36.121609
36.12 – 38.911506
38.91 – 41.712549
41.71 – 44.511854
44.51 – 47.31561
47.31 – 50.11164
50.11 – 52.92
52.9 – 55.72
55.7 – 58.50
58.5 – 61.314
61.3 – 64.094
64.09 – 66.8910

city_longitude numeric feature

Longitude of a city, with all 10992 values negative (min -164.72, max -67.84), placing every record in the Western Hemisphere and consistent with North American coverage. The distribution is left-skewed (skew -1.16) with median -87.09 pulled east of the mean -91.91, suggesting a concentration in the central/eastern US with a tail reaching toward Alaska or the Pacific. 5222 unique values across 10992 rows indicate many records share the same city, and 128 outliers (1.17%) likely correspond to far-western locations.

Treatment: Use as a geospatial feature alongside latitude; consider clustering or binning rather than treating as a plain scalar.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["city_longitude"].stats

statvalue
n10,992
nulls29 (0.3%)
unique5,222
min -164.7
max -67.84
mean -91.91
median -87.09
std 16.4
q1 -98.49
q3 -80.51
iqr 17.99
skew -1.155
kurtosis 1.328
n_outliers 128
outlier_rate 0.01168
zero_rate 0
Fig 16.
Distribution of city_longitude. Vertical dash marks the median.
Show data table
Histogram bins for city_longitude (median: -87.0902772).
bincount
-164.7 – -162.32
-162.3 – -159.90
-159.9 – -157.583
-157.5 – -15514
-155 – -152.60
-152.6 – -150.23
-150.2 – -147.813
-147.8 – -145.35
-145.3 – -142.94
-142.9 – -140.50
-140.5 – -138.10
-138.1 – -135.70
-135.7 – -133.22
-133.2 – -130.82
-130.8 – -128.40
-128.4 – -1260
-126 – -123.533
-123.5 – -121.1529
-121.1 – -118.7249
-118.7 – -116.3661
-116.3 – -113.983
-113.9 – -111.4245
-111.4 – -10992
-109 – -106.696
-106.6 – -104.2260
-104.2 – -101.793
-101.7 – -99.33159
-99.33 – -96.91530
-96.91 – -94.48545
-94.48 – -92.06492
-92.06 – -89.64447
-89.64 – -87.22792
-87.22 – -84.79877
-84.79 – -82.371252
-82.37 – -79.95912
-79.95 – -77.53420
-77.53 – -75.11825
-75.11 – -72.68649
-72.68 – -70.26548
-70.26 – -67.8446

city_latitude numeric feature

This column holds city latitudes in decimal degrees, ranging from 19.58 to 66.90 with a median of 39.28 and an IQR spanning 34.74 to 41.85. Values cluster firmly in the northern mid-latitudes (mean 38.38, std 5.07), with mild left skew (-0.32) and 128 outliers (1.17%) likely representing far-northern locales. Cardinality is moderate (5310 unique of 10992) and nulls are negligible (0.26%).

Treatment: Pair with longitude for geospatial features; avoid treating as a standalone scalar.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["city_latitude"].stats

statvalue
n10,992
nulls29 (0.3%)
unique5,310
min 19.58
max 66.9
mean 38.38
median 39.28
std 5.067
q1 34.74
q3 41.85
iqr 7.109
skew -0.325
kurtosis 1.478
n_outliers 128
outlier_rate 0.01168
zero_rate 0
Fig 17.
Distribution of city_latitude. Vertical dash marks the median.
Show data table
Histogram bins for city_latitude (median: 39.2833968).
bincount
19.58 – 20.765
20.76 – 21.9490
21.94 – 23.121
23.12 – 24.310
24.31 – 25.497
25.49 – 26.67106
26.67 – 27.86114
27.86 – 29.04145
29.04 – 30.22256
30.22 – 31.41222
31.41 – 32.59369
32.59 – 33.77569
33.77 – 34.96988
34.96 – 36.14560
36.14 – 37.32666
37.32 – 38.5703
38.5 – 39.691004
39.69 – 40.871341
40.87 – 42.051276
42.05 – 43.241258
43.24 – 44.42401
44.42 – 45.6387
45.6 – 46.79184
46.79 – 47.97220
47.97 – 49.1559
49.15 – 50.340
50.34 – 51.520
51.52 – 52.70
52.7 – 53.880
53.88 – 55.070
55.07 – 56.252
56.25 – 57.431
57.43 – 58.620
58.62 – 59.81
59.8 – 60.985
60.98 – 62.1713
62.17 – 63.351
63.35 – 64.530
64.53 – 65.726
65.72 – 66.93

location_2 text feature

This column holds WKT geometry strings in the form `POINT(lon lat)`, with 8776 unique values across 10992 rows and an 11.47% null rate. Values are tightly bounded (length 22-43, always two 'words') and 9.81% are duplicates, with the most repeated coordinate appearing 20 times — suggesting clusters at recurring sites. The 'allcaps' alert is a false positive driven by the literal `POINT` prefix rather than natural language.

Treatment: Parse into numeric longitude/latitude pair before any geospatial modelling.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["location_2"].stats

statvalue
n10,992
nulls1,261 (11.5%)
unique8,776
len_min 22
len_max 43
len_mean 30.81
len_median 29
len_p95 36
word_mean 2
word_median 2
n_empty 0
n_duplicates 955
duplicate_rate 0.09814
vocab_size 17,549
readability_flesch_mean 120.2
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 1
boilerplate_rate 0
alert: allcaps100.0% rows are all-caps
Fig 18.
Character-length distribution for location_2.
Show data table
Character-length distribution for location_2 (mean: 30.80896105230706).
charscount
22 – 231
23 – 238
23 – 240
24 – 242
24 – 250
25 – 2520
25 – 260
26 – 26104
26 – 270
27 – 27679
27 – 280
28 – 281007
28 – 290
29 – 293740
29 – 300
30 – 301567
30 – 310
31 – 310
31 – 320
32 – 321
32 – 3313
33 – 340
34 – 3454
34 – 350
35 – 35453
35 – 360
36 – 361807
36 – 370
37 – 37114
37 – 380
38 – 380
38 – 390
39 – 390
39 – 400
40 – 400
40 – 410
41 – 410
41 – 420
42 – 420
42 – 43161

city_location text feature

Despite the name, city_location stores WKT POINT(longitude latitude) geometry strings, not city names — every value matches that format (allcaps_rate 1.0, word_mean 2.0). Roughly half the rows are repeats (duplicate_rate 0.516, n_duplicates 5652) with the top coordinate (Los Angeles, 34.05/-118.24) appearing 61 times, suggesting a finite set of city centroids reused across records. Cardinality is still high (5311 unique of 10992) and nulls are negligible (0.26%).

Treatment: Parse the POINT() strings into numeric lat/long pairs before any geospatial modelling.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["city_location"].stats

statvalue
n10,992
nulls29 (0.3%)
unique5,311
len_min 22
len_max 43
len_mean 31.19
len_median 29
len_p95 36
word_mean 2
word_median 2
n_empty 0
n_duplicates 5,652
duplicate_rate 0.5156
vocab_size 10,532
readability_flesch_mean 120.2
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 1
boilerplate_rate 0
alert: allcaps100.0% rows are all-caps
alert: duplicates51.6% duplicate strings
Fig 19.
Character-length distribution for city_location.
Show data table
Character-length distribution for city_location (mean: 31.192739213718873).
charscount
22 – 232
23 – 230
23 – 240
24 – 241
24 – 250
25 – 252
25 – 260
26 – 2645
26 – 270
27 – 27158
27 – 280
28 – 28890
28 – 290
29 – 294648
29 – 300
30 – 301988
30 – 310
31 – 310
31 – 320
32 – 321
32 – 331
33 – 340
34 – 3421
34 – 350
35 – 35238
35 – 360
36 – 362711
36 – 370
37 – 37116
37 – 380
38 – 380
38 – 390
39 – 390
39 – 400
40 – 400
40 – 410
41 – 410
41 – 420
42 – 420
42 – 43141

How to cite

click to copy

BibTeX
@misc{saturn-wild-ghost-sightings-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: wild ghost sightings},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/wild-ghost_sightings}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: wild ghost sightings. Source: /home/coolhand/html/datavis/data_trove/data/wild/ghost_sightings.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/wild-ghost_sightings