quirky ufo sightings 20260121

source /home/coolhand/html/datavis/data_trove/cache/quirky/ufo_sightings_20260121.parquet 147,890 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 147,890 UFO sighting reports across 13 columns, mixing free-text descriptions (Summary, Text, Location details), structured categoricals (Shape, Explanation), timestamps (Occurred, Reported, Posted), and a numeric witness count. The Shape field is a clean place to start: 39 categories with 'Light' leading at ~27,494 sightings, followed by Circle and Triangle. Two things deserve a closer look. First, 'No of observers' is extremely skewed — values run from -10 to 20,000 with a median of 2 and over 18,000 outliers, suggesting data-entry errors that need cleaning before any aggregation. Second, the Explanation column is 99.46% null, so claims about 'what UFOs really were' rest on under 800 labelled rows, dominated by Starlink and rocket attributions. Location is dense and US-heavy (Phoenix, Seattle, Las Vegas top the list), and the Characteristics field collapses to ~43 vocabulary tokens dominated by 'Lights on object'.

citing: Shape.top_values · Shape.cardinality · No of observers.skew · No of observers.max · No of observers.min · No of observers.median · No of observers.n_outliers · Explanation.null_rate · Explanation.top_values · Location.top_values · Characteristics.top_values · Characteristics.vocab_size · Duration.top_values · row_count · column_count

Charts the summary said to look at first

Shape · Reported UFO shapes are dominated by 'Light', Circle, and Triangle — the long tail of 39 categories shows reporting vocabulary, not physics.

Show data table

Top values for Shape (20 unique shown, of 39 total).
value	count	share
Light	27494	18.6%
Circle	14367	9.7%
Triangle	13086	8.8%
Other	10062	6.8%
Unknown	10021	6.8%
Fireball	9880	6.7%
Disk	8716	5.9%
Sphere	7652	5.2%
Oval	6369	4.3%
Orb	5924	4.0%
Formation	4864	3.3%
Changing	3987	2.7%
Cigar	3753	2.5%
Rectangle	2610	1.8%
Cylinder	2482	1.7%
Flash	2439	1.6%
Diamond	2116	1.4%
Chevron	1742	1.2%
Egg	1289	0.9%
Teardrop	1238	0.8%

No of observers · Heavily skewed witness counts (median 2, max 20,000, min -10) flag data quality issues that must be handled before any averaging.

Show data table

Histogram bins for No of observers (median: 2.0).
bin	count
-10 – 490.2	141102
490.2 – 990.5	41
990.5 – 1491	52
1491 – 1991	1
1991 – 2491	8
2491 – 2992	3
2992 – 3492	3
3492 – 3992	0
3992 – 4492	1
4492 – 4992	0
4992 – 5493	7
5493 – 5993	0
5993 – 6493	0
6493 – 6994	0
6994 – 7494	0
7494 – 7994	0
7994 – 8494	0
8494 – 8994	0
8994 – 9495	0
9495 – 9995	0
9995 – 1.05e+04	7
1.05e+04 – 1.1e+04	0
1.1e+04 – 1.15e+04	1
1.15e+04 – 1.2e+04	0
1.2e+04 – 1.25e+04	0
1.25e+04 – 1.3e+04	0
1.3e+04 – 1.35e+04	0
1.35e+04 – 1.4e+04	0
1.4e+04 – 1.45e+04	0
1.45e+04 – 1.5e+04	0
1.5e+04 – 1.55e+04	0
1.55e+04 – 1.6e+04	0
1.6e+04 – 1.65e+04	0
1.65e+04 – 1.7e+04	0
1.7e+04 – 1.75e+04	0
1.75e+04 – 1.8e+04	0
1.8e+04 – 1.85e+04	0
1.85e+04 – 1.9e+04	0
1.9e+04 – 1.95e+04	0
1.95e+04 – 2e+04	3

Explanation · Only ~0.5% of rows have an explanation; among those, Starlink and rocket attributions lead — interpret cautiously given the 99.46% null rate.

Show data table

Top values for Explanation (20 unique shown, of 58 total).
value	count	share
Starlink - Probable	78	0.1%
Starlink - Certain	69	0.0%
Rocket - Certain	67	0.0%
Balloon - Possible	50	0.0%
Starlink - Possible	49	0.0%
Planet/Star - Possible	42	0.0%
Planet/Star - Probable	41	0.0%
Aircraft - Possible	35	0.0%
Aircraft - Probable	35	0.0%
Camera Anomaly - Probable	33	0.0%
Camera Anomaly - Certain	25	0.0%
Rocket - Probable	24	0.0%
Bird - Possible	21	0.0%
Searchlight - Certain	18	0.0%
Balloon - Probable	18	0.0%
Drone - Possible	17	0.0%
Camera Anomaly - Possible	15	0.0%
Bird - Probable	14	0.0%
Searchlight - Probable	12	0.0%
Rocket - Possible	11	0.0%

Duration · Top duration values cluster on round numbers like '5 minutes', '2 minutes', '10 minutes' — evidence of human estimation rather than precise measurement.

Show data table

Character-length distribution for Duration (mean: 9.469163433800855).
chars	count
1 – 2	784
2 – 3	1006
3 – 4	586
4 – 5	2052
5 – 6	6297
6 – 6	9510
6 – 7	9325
7 – 8	9401
8 – 9	36028
9 – 10	0
10 – 11	37045
11 – 12	10292
12 – 13	3432
13 – 14	5013
14 – 14	1918
14 – 15	1786
15 – 16	1404
16 – 17	751
17 – 18	818
18 – 19	0
19 – 20	598
20 – 21	476
21 – 22	305
22 – 23	343
23 – 24	337
24 – 24	416
24 – 25	818
25 – 26	0
26 – 27	0
27 – 28	0
28 – 29	0
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	0
34 – 35	0
35 – 36	0
36 – 37	1

Text · Narrative report length spans 1 to 64,550 characters with a median of 682, useful for sizing any downstream NLP work.

Show data table

Character-length distribution for Text (mean: 949.9216781617955).
chars	count
1 – 1615	111865
1615 – 3228	15329
3228 – 4842	2755
4842 – 6456	757
6456 – 8070	257
8070 – 9683	104
9683 – 11297	60
11297 – 12911	31
12911 – 14525	13
14525 – 16138	13
16138 – 17752	7
17752 – 19366	10
19366 – 20979	4
20979 – 22593	3
22593 – 24207	2
24207 – 25821	2
25821 – 27434	0
27434 – 29048	1
29048 – 30662	0
30662 – 32276	0
32276 – 33889	0
33889 – 35503	1
35503 – 37117	0
37117 – 38730	0
38730 – 40344	0
40344 – 41958	0
41958 – 43572	0
43572 – 45185	0
45185 – 46799	0
46799 – 48413	0
48413 – 50026	0
50026 – 51640	0
51640 – 53254	0
53254 – 54868	0
54868 – 56481	0
56481 – 58095	0
58095 – 59709	0
59709 – 61323	0
61323 – 62936	0
62936 – 64550	1

Schema

13 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
Sighting	numeric	0.0%	147,890
Occurred	text	0.0%	126,264
Location	text	0.0%	37,070	multilingual duplicates
Shape	categorical	4.3%	39
Duration	text	4.8%	15,527	short_text duplicates
No of observers	numeric	4.5%	137	high_skew outliers
Reported	text	0.0%	136,471	multilingual
Posted	categorical	0.0%	626
Characteristics	text	28.1%	1,446	null_rate duplicates
Summary	text	0.6%	144,208	near_unique
Text	text	11.3%	127,124	near_unique
Location details	text	93.1%	9,713	near_unique null_rate
Explanation	categorical	99.5%	58	null_rate

Sighting

numeric identifier

With n_unique equal to n (147890) and zero nulls, this looks like a per-row sighting identifier rather than a measured quantity. The distribution is near-uniform across 111 to 179773 (skew -0.013, kurtosis -1.13, mean 91984 close to median 91435), consistent with a sequential or randomly assigned ID rather than a feature with signal. Treatment: drop from modelling; retain only as a join key. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 0 (0.0%)
unique: 147,890
min: 111
max: 179,773
mean: 9.198e+04
median: 9.143e+04
std: 5.044e+04
q1: 5.015e+04
q3: 1.345e+05
iqr: 8.437e+04
skew: -0.0132
kurtosis: -1.134
n_outliers: 0
outlier_rate: 0
zero_rate: 0

Occurred

text timestamp

Timestamp strings of the form 'YYYY-MM-DD HH:MM:SS Local', with length tightly clustered at 25 characters (mean 24.96, p95 25) and 'local' appearing in roughly 20000 of 147890 rows. Duplicate rate is 14.6% (21626 repeats), and several July 4th evening timestamps recur dozens of times, suggesting event clustering around holidays and round hours (22:00, 21:00, 23:00 dominate). 299 rows contain only the bare token 'Local', indicating missing date/time portions encoded as a stub rather than nulls. Treatment: Parse to datetime, drop the trailing 'Local' suffix, and treat bare 'Local' entries as missing. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 0 (0.0%)
unique: 126,264
len_min: 5
len_max: 25
len_mean: 24.96
len_median: 25
len_p95: 25
word_mean: 2.996
word_median: 3
n_empty: 0
n_duplicates: 21,626
duplicate_rate: 0.1462
vocab_size: 10,098
readability_flesch_mean: 90.72
emoji_rate: 0
url_rate: 0
one_word_rate: 0.002022
allcaps_rate: 0
boilerplate_rate: 0

Location

text metadata multilingual duplicates

Short city/state/country strings — top values like 'Phoenix, AZ, USA' (770) and dominant token 'usa' (17880) make this a geographic location field. Heavy duplication (74.9%, 110819 rows across 37070 uniques) is expected for city-level data, and entries are short (mean 20.4 chars, 3.57 words). The 'multilingual' alert is mostly noise from place names: 4481 detected as English with small counts in de (144), es (86), fr (89), and others. Treatment: Parse into city/state/country components and use as categorical geo features. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 1 (0.0%)
unique: 37,070
len_min: 7
len_max: 92
len_mean: 20.38
len_median: 18
len_p95: 37
word_mean: 3.571
word_median: 3
n_empty: 0
n_duplicates: 110,819
duplicate_rate: 0.7493
vocab_size: 8,750
readability_flesch_mean: 26.72
emoji_rate: 6.762e-06
url_rate: 0
one_word_rate: 6.762e-06
allcaps_rate: 0.003983
boilerplate_rate: 0

Shape

categorical feature

Categorical descriptor of an observed object's shape, with 39 distinct values across 147,890 rows. 'Light' dominates at 19.4% followed by 'Circle' (14,367) and 'Triangle' (13,086), and the vocabulary is noticeably noisy: 'Other' (10,062) and 'Unknown' (10,021) together rival the top class, plus near-synonyms like 'Sphere', 'Orb', and 'Circle' coexist. Null rate is 4.29% and entropy ratio of 0.74 indicates moderately spread mass rather than a single peak. Treatment: Consolidate synonyms (Sphere/Orb/Circle), bucket Other/Unknown as missing, then one-hot encode. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 6,343 (4.3%)
unique: 39
top_value: Light
top_rate: 0.1942
cardinality: 39
entropy: 3.93
entropy_ratio: 0.7436

Duration

text feature short_text duplicates

This is a free-text duration field, almost always a number paired with a unit like 'minutes' or 'seconds' (mean length 9.5 chars, word_median 2). Values are highly repetitive: 89% are duplicates and only 15,527 uniques exist across 147,890 rows, with '5 minutes' alone appearing 8,813 times. The vocabulary of 1,757 tokens and presence of variants like 'min' alongside 'minutes' suggest inconsistent unit spellings, and 4.83% are null. Treatment: Parse into a numeric seconds column by normalizing unit tokens (minute/min, second/sec) before modelling. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 7,148 (4.8%)
unique: 15,527
len_min: 1
len_max: 37
len_mean: 9.469
len_median: 9
len_p95: 15
word_mean: 2.066
word_median: 2
n_empty: 0
n_duplicates: 125,215
duplicate_rate: 0.8897
vocab_size: 1,757
readability_flesch_mean: 84.41
emoji_rate: 7.105e-06
url_rate: 0
one_word_rate: 0.0895
allcaps_rate: 0.03033
boilerplate_rate: 0

No of observers

numeric feature high_skew outliers

This is a count of observers per record, with a typical value of 1-2 (median 2.0, IQR 1.0) but a mean of 4.6 dragged up by extreme outliers reaching 20000. Skew of 109.3 and kurtosis above 14000 confirm a heavy right tail, and 13.0% of values are flagged as outliers. The minimum of -10.0 is implausible for an observer count and suggests data entry errors, while 6.9% are zero and 4.5% are null. Treatment: Clip negatives, investigate the 20000 tail, and log1p-transform before modelling. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 6,661 (4.5%)
unique: 137
min: -10
max: 20,000
mean: 4.603
median: 2
std: 129.5
q1: 1
q3: 2
iqr: 1
skew: 109.3
kurtosis: 1.433e+04
n_outliers: 18,390
outlier_rate: 0.1302
zero_rate: 0.06919

Reported

text timestamp multilingual

This is a 'Reported' timestamp column stored as text, with every one of the 147,890 values exactly 27 characters long and following a 'YYYY-MM-DD 00:00:00 Pacific' pattern. Despite being temporal, it was profiled as text, which explains the spurious 'multilingual' alert (4,649 'en', 336 'de', etc.) from language detection misreading date tokens. Duplicates are expected for a date field (7.7%, 11,418 rows), with 1999-11-16 and 1999-11-17 the heaviest days at 73 and 69 occurrences. Treatment: Parse to datetime (strip the ' Pacific' suffix) and discard the language alert. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 1 (0.0%)
unique: 136,471
len_min: 27
len_max: 27
len_mean: 27
len_median: 27
len_p95: 27
word_mean: 3
word_median: 3
n_empty: 0
n_duplicates: 11,418
duplicate_rate: 0.07721
vocab_size: 24,077
readability_flesch_mean: 62.79
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0

Posted

categorical timestamp

`Posted` holds 626 unique date-stamped values (all at 00:00:00) across 147,890 rows with no nulls, so it functions as a posting date rather than a precise timestamp. The distribution is broad and high-entropy (entropy ratio 0.93); the most common date, 2020-06-25, accounts for only 1.24% of rows, and the top 10 dates span 1999 to 2023 with no obvious clustering. Worth noting: there are far fewer distinct dates (626) than would be expected if rows were spread daily across two-plus decades, suggesting posts cluster on specific days. Treatment: Parse to date and derive year/month/day-of-week features before modelling. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 1 (0.0%)
unique: 626
top_value: 2020-06-25 00:00:00
top_rate: 0.01239
cardinality: 626
entropy: 8.644
entropy_ratio: 0.9304

Characteristics

text feature null_rate duplicates

This is a categorical multi-label feature describing observed object characteristics (e.g., 'Lights on object', 'Aura or haze around object'), encoded as comma-joined tags rather than free text. Despite 147,890 rows, only 1,446 unique values and a vocab of 43 words exist, with a 98.6% duplicate rate and 28.2% nulls. Note the truncated token 'Changed Colo' appearing repeatedly, suggesting an upstream string-truncation bug. Treatment: Split on comma and one-hot encode the tag set; fix the 'Changed Colo' truncation before encoding. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 41,631 (28.1%)
unique: 1,446
len_min: 6
len_max: 251
len_mean: 33.27
len_median: 26
len_p95: 76
word_mean: 5.613
word_median: 5
n_empty: 0
n_duplicates: 104,813
duplicate_rate: 0.9864
vocab_size: 43
readability_flesch_mean: 69.46
emoji_rate: 0
url_rate: 0
one_word_rate: 0.00384
allcaps_rate: 0
boilerplate_rate: 0

Summary

text free_text near_unique

Free-form English text summaries, near-unique across 147,890 rows (144,208 distinct), with a median of 76 characters / 13 words but a long tail reaching 10,624 characters. Top tokens are stopwords plus 'light' (4,676), and readability averages 67.3 Flesch, suggesting conversational prose rather than structured codes. Surprises: 2,791 exact duplicates (1.9%) despite the near-unique flag, 1.8% all-caps entries, and a 28,632-word vocabulary — emojis and URLs are negligible. Treatment: Tokenize and embed (or TF-IDF) before modelling; dedupe the 2,791 exact repeats first. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 891 (0.6%)
unique: 144,208
len_min: 1
len_max: 10,624
len_mean: 134.9
len_median: 76
len_p95: 479
word_mean: 25.06
word_median: 13
n_empty: 0
n_duplicates: 2,791
duplicate_rate: 0.01899
vocab_size: 28,632
readability_flesch_mean: 67.28
emoji_rate: 0.0001088
url_rate: 0.0007755
one_word_rate: 0.002327
allcaps_rate: 0.01786
boilerplate_rate: 0.0008708

Text

text free_text near_unique

Free-text field averaging 950 characters and 182 words per entry, with 127,124 unique values across 147,890 rows and a 65,106-token vocabulary — consistent with longish English prose (mean Flesch 69.6, fairly readable). First-person pronouns 'i' and 'my' rank in the top 10, suggesting personal narratives or reviews rather than formal documents. Notable signals: 11.28% null rate, 3.12% exact duplicates (4,091 rows), and a max length of 64,550 characters versus a p95 of 2,673 — a small set of extreme outliers. Treatment: Tokenize and embed before modelling; dedupe the 4,091 exact repeats and clip extreme-length outliers first. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 16,675 (11.3%)
unique: 127,124
len_min: 1
len_max: 64,550
len_mean: 949.9
len_median: 682
len_p95: 2,673
word_mean: 181.6
word_median: 131
n_empty: 0
n_duplicates: 4,091
duplicate_rate: 0.03118
vocab_size: 65,106
readability_flesch_mean: 69.56
emoji_rate: 0.0001981
url_rate: 0.01805
one_word_rate: 0.0006783
allcaps_rate: 0.007217
boilerplate_rate: 0.001875

Location details

text free_text near_unique null_rate

Free-text location descriptions, populated for fewer than 7% of rows (null_rate 0.931) and otherwise short prose averaging 38.7 characters / 6.9 words with a Flesch readability of 69.8. Among the 10,212 non-null entries there are 9,713 distinct strings and only 492 duplicates (4.8%), so values are near-unique and unlikely to act as a category. Top tokens are stopwords ('the', 'of', 'in', 'my', 'over'), confirming sentence-style annotations rather than place names or codes. Treatment: Treat as optional free-text annotation; tokenize/embed if used, otherwise drop given the 93% null rate. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 137,685 (93.1%)
unique: 9,713
len_min: 1
len_max: 197
len_mean: 38.75
len_median: 33
len_p95: 92
word_mean: 6.93
word_median: 6
n_empty: 0
n_duplicates: 492
duplicate_rate: 0.04821
vocab_size: 12,384
readability_flesch_mean: 69.85
emoji_rate: 0
url_rate: 0.001078
one_word_rate: 0.03822
allcaps_rate: 0.0197
boilerplate_rate: 0.000196

Explanation

categorical label null_rate

This is a categorical 'Explanation' field labeling probable causes for sightings (e.g., 'Starlink - Probable', 'Rocket - Certain', 'Balloon - Possible'), with 58 distinct values combining a source and a confidence tier. It is almost entirely empty: null_rate is 0.9946, leaving only a small annotated subset where the top label 'Starlink - Probable' appears 78 times (9.7% of non-nulls). Entropy ratio of 0.82 indicates labels are spread fairly evenly across categories rather than dominated by one. Treatment: Treat as a sparse annotation; split into source and confidence sub-labels and analyze only the ~0.5% labeled rows. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 147,087 (99.5%)
unique: 58
top_value: Starlink - Probable
top_rate: 0.09714
cardinality: 58
entropy: 4.832
entropy_ratio: 0.8248