saturn·

quirky ufo sightings 20260121

source /home/coolhand/html/datavis/data_trove/cache/quirky/ufo_sightings_20260121.parquet 147,890 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 147,890 UFO sighting reports across 13 columns, mixing free-text descriptions (Summary, Text, Location details), structured categoricals (Shape, Explanation), timestamps (Occurred, Reported, Posted), and a numeric witness count. The Shape field is a clean place to start: 39 categories with 'Light' leading at ~27,494 sightings, followed by Circle and Triangle. Two things deserve a closer look. First, 'No of observers' is extremely skewed — values run from -10 to 20,000 with a median of 2 and over 18,000 outliers, suggesting data-entry errors that need cleaning before any aggregation. Second, the Explanation column is 99.46% null, so claims about 'what UFOs really were' rest on under 800 labelled rows, dominated by Starlink and rocket attributions. Location is dense and US-heavy (Phoenix, Seattle, Las Vegas top the list), and the Characteristics field collapses to ~43 vocabulary tokens dominated by 'Lights on object'.

citing: Shape.top_values · Shape.cardinality · No of observers.skew · No of observers.max · No of observers.min · No of observers.median · No of observers.n_outliers · Explanation.null_rate · Explanation.top_values · Location.top_values · Characteristics.top_values · Characteristics.vocab_size · Duration.top_values · row_count · column_count

Schema

13 columns
Per-column summary. Click column name to jump to its detail.
Alerts
Sighting numeric 0.0% 147,890
Occurred text 0.0% 126,264
Location text 0.0% 37,070
multilingual duplicates
Shape categorical 4.3% 39
Duration text 4.8% 15,527
short_text duplicates
No of observers numeric 4.5% 137
high_skew outliers
Reported text 0.0% 136,471
multilingual
Posted categorical 0.0% 626
Characteristics text 28.1% 1,446
null_rate duplicates
Summary text 0.6% 144,208
near_unique
Text text 11.3% 127,124
near_unique
Location details text 93.1% 9,713
near_unique null_rate
Explanation categorical 99.5% 58
null_rate

Sighting

numeric identifier
With n_unique equal to n (147890) and zero nulls, this looks like a per-row sighting identifier rather than a measured quantity. The distribution is near-uniform across 111 to 179773 (skew -0.013, kurtosis -1.13, mean 91984 close to median 91435), consistent with a sequential or randomly assigned ID rather than a feature with signal. Treatment: drop from modelling; retain only as a join key. high · anthropic:claude-opus-4-7
n
147,890
nulls
0 (0.0%)
unique
147,890
min
111
max
179,773
mean
9.198e+04
median
9.143e+04
std
5.044e+04
q1
5.015e+04
q3
1.345e+05
iqr
8.437e+04
skew
-0.0132
kurtosis
-1.134
n_outliers
0
outlier_rate
0
zero_rate
0

Occurred

text timestamp
Timestamp strings of the form 'YYYY-MM-DD HH:MM:SS Local', with length tightly clustered at 25 characters (mean 24.96, p95 25) and 'local' appearing in roughly 20000 of 147890 rows. Duplicate rate is 14.6% (21626 repeats), and several July 4th evening timestamps recur dozens of times, suggesting event clustering around holidays and round hours (22:00, 21:00, 23:00 dominate). 299 rows contain only the bare token 'Local', indicating missing date/time portions encoded as a stub rather than nulls. Treatment: Parse to datetime, drop the trailing 'Local' suffix, and treat bare 'Local' entries as missing. high · anthropic:claude-opus-4-7
n
147,890
nulls
0 (0.0%)
unique
126,264
len_min
5
len_max
25
len_mean
24.96
len_median
25
len_p95
25
word_mean
2.996
word_median
3
n_empty
0
n_duplicates
21,626
duplicate_rate
0.1462
vocab_size
10,098
readability_flesch_mean
90.72
emoji_rate
0
url_rate
0
one_word_rate
0.002022
allcaps_rate
0
boilerplate_rate
0

Location

text metadata multilingual duplicates
Short city/state/country strings — top values like 'Phoenix, AZ, USA' (770) and dominant token 'usa' (17880) make this a geographic location field. Heavy duplication (74.9%, 110819 rows across 37070 uniques) is expected for city-level data, and entries are short (mean 20.4 chars, 3.57 words). The 'multilingual' alert is mostly noise from place names: 4481 detected as English with small counts in de (144), es (86), fr (89), and others. Treatment: Parse into city/state/country components and use as categorical geo features. high · anthropic:claude-opus-4-7
n
147,890
nulls
1 (0.0%)
unique
37,070
len_min
7
len_max
92
len_mean
20.38
len_median
18
len_p95
37
word_mean
3.571
word_median
3
n_empty
0
n_duplicates
110,819
duplicate_rate
0.7493
vocab_size
8,750
readability_flesch_mean
26.72
emoji_rate
6.762e-06
url_rate
0
one_word_rate
6.762e-06
allcaps_rate
0.003983
boilerplate_rate
0

Shape

categorical feature
Categorical descriptor of an observed object's shape, with 39 distinct values across 147,890 rows. 'Light' dominates at 19.4% followed by 'Circle' (14,367) and 'Triangle' (13,086), and the vocabulary is noticeably noisy: 'Other' (10,062) and 'Unknown' (10,021) together rival the top class, plus near-synonyms like 'Sphere', 'Orb', and 'Circle' coexist. Null rate is 4.29% and entropy ratio of 0.74 indicates moderately spread mass rather than a single peak. Treatment: Consolidate synonyms (Sphere/Orb/Circle), bucket Other/Unknown as missing, then one-hot encode. high · anthropic:claude-opus-4-7
n
147,890
nulls
6,343 (4.3%)
unique
39
top_value
Light
top_rate
0.1942
cardinality
39
entropy
3.93
entropy_ratio
0.7436

Duration

text feature short_text duplicates
This is a free-text duration field, almost always a number paired with a unit like 'minutes' or 'seconds' (mean length 9.5 chars, word_median 2). Values are highly repetitive: 89% are duplicates and only 15,527 uniques exist across 147,890 rows, with '5 minutes' alone appearing 8,813 times. The vocabulary of 1,757 tokens and presence of variants like 'min' alongside 'minutes' suggest inconsistent unit spellings, and 4.83% are null. Treatment: Parse into a numeric seconds column by normalizing unit tokens (minute/min, second/sec) before modelling. high · anthropic:claude-opus-4-7
n
147,890
nulls
7,148 (4.8%)
unique
15,527
len_min
1
len_max
37
len_mean
9.469
len_median
9
len_p95
15
word_mean
2.066
word_median
2
n_empty
0
n_duplicates
125,215
duplicate_rate
0.8897
vocab_size
1,757
readability_flesch_mean
84.41
emoji_rate
7.105e-06
url_rate
0
one_word_rate
0.0895
allcaps_rate
0.03033
boilerplate_rate
0

No of observers

numeric feature high_skew outliers
This is a count of observers per record, with a typical value of 1-2 (median 2.0, IQR 1.0) but a mean of 4.6 dragged up by extreme outliers reaching 20000. Skew of 109.3 and kurtosis above 14000 confirm a heavy right tail, and 13.0% of values are flagged as outliers. The minimum of -10.0 is implausible for an observer count and suggests data entry errors, while 6.9% are zero and 4.5% are null. Treatment: Clip negatives, investigate the 20000 tail, and log1p-transform before modelling. high · anthropic:claude-opus-4-7
n
147,890
nulls
6,661 (4.5%)
unique
137
min
-10
max
20,000
mean
4.603
median
2
std
129.5
q1
1
q3
2
iqr
1
skew
109.3
kurtosis
1.433e+04
n_outliers
18,390
outlier_rate
0.1302
zero_rate
0.06919

Reported

text timestamp multilingual
This is a 'Reported' timestamp column stored as text, with every one of the 147,890 values exactly 27 characters long and following a 'YYYY-MM-DD 00:00:00 Pacific' pattern. Despite being temporal, it was profiled as text, which explains the spurious 'multilingual' alert (4,649 'en', 336 'de', etc.) from language detection misreading date tokens. Duplicates are expected for a date field (7.7%, 11,418 rows), with 1999-11-16 and 1999-11-17 the heaviest days at 73 and 69 occurrences. Treatment: Parse to datetime (strip the ' Pacific' suffix) and discard the language alert. high · anthropic:claude-opus-4-7
n
147,890
nulls
1 (0.0%)
unique
136,471
len_min
27
len_max
27
len_mean
27
len_median
27
len_p95
27
word_mean
3
word_median
3
n_empty
0
n_duplicates
11,418
duplicate_rate
0.07721
vocab_size
24,077
readability_flesch_mean
62.79
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

Posted

categorical timestamp
`Posted` holds 626 unique date-stamped values (all at 00:00:00) across 147,890 rows with no nulls, so it functions as a posting date rather than a precise timestamp. The distribution is broad and high-entropy (entropy ratio 0.93); the most common date, 2020-06-25, accounts for only 1.24% of rows, and the top 10 dates span 1999 to 2023 with no obvious clustering. Worth noting: there are far fewer distinct dates (626) than would be expected if rows were spread daily across two-plus decades, suggesting posts cluster on specific days. Treatment: Parse to date and derive year/month/day-of-week features before modelling. high · anthropic:claude-opus-4-7
n
147,890
nulls
1 (0.0%)
unique
626
top_value
2020-06-25 00:00:00
top_rate
0.01239
cardinality
626
entropy
8.644
entropy_ratio
0.9304

Characteristics

text feature null_rate duplicates
This is a categorical multi-label feature describing observed object characteristics (e.g., 'Lights on object', 'Aura or haze around object'), encoded as comma-joined tags rather than free text. Despite 147,890 rows, only 1,446 unique values and a vocab of 43 words exist, with a 98.6% duplicate rate and 28.2% nulls. Note the truncated token 'Changed Colo' appearing repeatedly, suggesting an upstream string-truncation bug. Treatment: Split on comma and one-hot encode the tag set; fix the 'Changed Colo' truncation before encoding. high · anthropic:claude-opus-4-7
n
147,890
nulls
41,631 (28.1%)
unique
1,446
len_min
6
len_max
251
len_mean
33.27
len_median
26
len_p95
76
word_mean
5.613
word_median
5
n_empty
0
n_duplicates
104,813
duplicate_rate
0.9864
vocab_size
43
readability_flesch_mean
69.46
emoji_rate
0
url_rate
0
one_word_rate
0.00384
allcaps_rate
0
boilerplate_rate
0

Summary

text free_text near_unique
Free-form English text summaries, near-unique across 147,890 rows (144,208 distinct), with a median of 76 characters / 13 words but a long tail reaching 10,624 characters. Top tokens are stopwords plus 'light' (4,676), and readability averages 67.3 Flesch, suggesting conversational prose rather than structured codes. Surprises: 2,791 exact duplicates (1.9%) despite the near-unique flag, 1.8% all-caps entries, and a 28,632-word vocabulary — emojis and URLs are negligible. Treatment: Tokenize and embed (or TF-IDF) before modelling; dedupe the 2,791 exact repeats first. high · anthropic:claude-opus-4-7
n
147,890
nulls
891 (0.6%)
unique
144,208
len_min
1
len_max
10,624
len_mean
134.9
len_median
76
len_p95
479
word_mean
25.06
word_median
13
n_empty
0
n_duplicates
2,791
duplicate_rate
0.01899
vocab_size
28,632
readability_flesch_mean
67.28
emoji_rate
0.0001088
url_rate
0.0007755
one_word_rate
0.002327
allcaps_rate
0.01786
boilerplate_rate
0.0008708

Text

text free_text near_unique
Free-text field averaging 950 characters and 182 words per entry, with 127,124 unique values across 147,890 rows and a 65,106-token vocabulary — consistent with longish English prose (mean Flesch 69.6, fairly readable). First-person pronouns 'i' and 'my' rank in the top 10, suggesting personal narratives or reviews rather than formal documents. Notable signals: 11.28% null rate, 3.12% exact duplicates (4,091 rows), and a max length of 64,550 characters versus a p95 of 2,673 — a small set of extreme outliers. Treatment: Tokenize and embed before modelling; dedupe the 4,091 exact repeats and clip extreme-length outliers first. high · anthropic:claude-opus-4-7
n
147,890
nulls
16,675 (11.3%)
unique
127,124
len_min
1
len_max
64,550
len_mean
949.9
len_median
682
len_p95
2,673
word_mean
181.6
word_median
131
n_empty
0
n_duplicates
4,091
duplicate_rate
0.03118
vocab_size
65,106
readability_flesch_mean
69.56
emoji_rate
0.0001981
url_rate
0.01805
one_word_rate
0.0006783
allcaps_rate
0.007217
boilerplate_rate
0.001875

Location details

text free_text near_unique null_rate
Free-text location descriptions, populated for fewer than 7% of rows (null_rate 0.931) and otherwise short prose averaging 38.7 characters / 6.9 words with a Flesch readability of 69.8. Among the 10,212 non-null entries there are 9,713 distinct strings and only 492 duplicates (4.8%), so values are near-unique and unlikely to act as a category. Top tokens are stopwords ('the', 'of', 'in', 'my', 'over'), confirming sentence-style annotations rather than place names or codes. Treatment: Treat as optional free-text annotation; tokenize/embed if used, otherwise drop given the 93% null rate. high · anthropic:claude-opus-4-7
n
147,890
nulls
137,685 (93.1%)
unique
9,713
len_min
1
len_max
197
len_mean
38.75
len_median
33
len_p95
92
word_mean
6.93
word_median
6
n_empty
0
n_duplicates
492
duplicate_rate
0.04821
vocab_size
12,384
readability_flesch_mean
69.85
emoji_rate
0
url_rate
0.001078
one_word_rate
0.03822
allcaps_rate
0.0197
boilerplate_rate
0.000196

Explanation

categorical label null_rate
This is a categorical 'Explanation' field labeling probable causes for sightings (e.g., 'Starlink - Probable', 'Rocket - Certain', 'Balloon - Possible'), with 58 distinct values combining a source and a confidence tier. It is almost entirely empty: null_rate is 0.9946, leaving only a small annotated subset where the top label 'Starlink - Probable' appears 78 times (9.7% of non-nulls). Entropy ratio of 0.82 indicates labels are spread fairly evenly across categories rather than dominated by one. Treatment: Treat as a sparse annotation; split into source and confidence sub-labels and analyze only the ~0.5% labeled rows. high · anthropic:claude-opus-4-7
n
147,890
nulls
147,087 (99.5%)
unique
58
top_value
Starlink - Probable
top_rate
0.09714
cardinality
58
entropy
4.832
entropy_ratio
0.8248