quirky ufo sightings 20260121
Reading
This dataset contains 147,890 UFO sighting reports across 13 columns, mixing free-text descriptions (Summary, Text, Location details), structured categoricals (Shape, Explanation), timestamps (Occurred, Reported, Posted), and a numeric witness count. The Shape field is a clean place to start: 39 categories with 'Light' leading at ~27,494 sightings, followed by Circle and Triangle. Two things deserve a closer look. First, 'No of observers' is extremely skewed — values run from -10 to 20,000 with a median of 2 and over 18,000 outliers, suggesting data-entry errors that need cleaning before any aggregation. Second, the Explanation column is 99.46% null, so claims about 'what UFOs really were' rest on under 800 labelled rows, dominated by Starlink and rocket attributions. Location is dense and US-heavy (Phoenix, Seattle, Las Vegas top the list), and the Characteristics field collapses to ~43 vocabulary tokens dominated by 'Lights on object'.
citing: Shape.top_values · Shape.cardinality · No of observers.skew · No of observers.max · No of observers.min · No of observers.median · No of observers.n_outliers · Explanation.null_rate · Explanation.top_values · Location.top_values · Characteristics.top_values · Characteristics.vocab_size · Duration.top_values · row_count · column_count
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Light | 27494 | 18.6% |
| Circle | 14367 | 9.7% |
| Triangle | 13086 | 8.8% |
| Other | 10062 | 6.8% |
| Unknown | 10021 | 6.8% |
| Fireball | 9880 | 6.7% |
| Disk | 8716 | 5.9% |
| Sphere | 7652 | 5.2% |
| Oval | 6369 | 4.3% |
| Orb | 5924 | 4.0% |
| Formation | 4864 | 3.3% |
| Changing | 3987 | 2.7% |
| Cigar | 3753 | 2.5% |
| Rectangle | 2610 | 1.8% |
| Cylinder | 2482 | 1.7% |
| Flash | 2439 | 1.6% |
| Diamond | 2116 | 1.4% |
| Chevron | 1742 | 1.2% |
| Egg | 1289 | 0.9% |
| Teardrop | 1238 | 0.8% |
Show data table
| bin | count |
|---|---|
| -10 – 490.2 | 141102 |
| 490.2 – 990.5 | 41 |
| 990.5 – 1491 | 52 |
| 1491 – 1991 | 1 |
| 1991 – 2491 | 8 |
| 2491 – 2992 | 3 |
| 2992 – 3492 | 3 |
| 3492 – 3992 | 0 |
| 3992 – 4492 | 1 |
| 4492 – 4992 | 0 |
| 4992 – 5493 | 7 |
| 5493 – 5993 | 0 |
| 5993 – 6493 | 0 |
| 6493 – 6994 | 0 |
| 6994 – 7494 | 0 |
| 7494 – 7994 | 0 |
| 7994 – 8494 | 0 |
| 8494 – 8994 | 0 |
| 8994 – 9495 | 0 |
| 9495 – 9995 | 0 |
| 9995 – 1.05e+04 | 7 |
| 1.05e+04 – 1.1e+04 | 0 |
| 1.1e+04 – 1.15e+04 | 1 |
| 1.15e+04 – 1.2e+04 | 0 |
| 1.2e+04 – 1.25e+04 | 0 |
| 1.25e+04 – 1.3e+04 | 0 |
| 1.3e+04 – 1.35e+04 | 0 |
| 1.35e+04 – 1.4e+04 | 0 |
| 1.4e+04 – 1.45e+04 | 0 |
| 1.45e+04 – 1.5e+04 | 0 |
| 1.5e+04 – 1.55e+04 | 0 |
| 1.55e+04 – 1.6e+04 | 0 |
| 1.6e+04 – 1.65e+04 | 0 |
| 1.65e+04 – 1.7e+04 | 0 |
| 1.7e+04 – 1.75e+04 | 0 |
| 1.75e+04 – 1.8e+04 | 0 |
| 1.8e+04 – 1.85e+04 | 0 |
| 1.85e+04 – 1.9e+04 | 0 |
| 1.9e+04 – 1.95e+04 | 0 |
| 1.95e+04 – 2e+04 | 3 |
Show data table
| value | count | share |
|---|---|---|
| Starlink - Probable | 78 | 0.1% |
| Starlink - Certain | 69 | 0.0% |
| Rocket - Certain | 67 | 0.0% |
| Balloon - Possible | 50 | 0.0% |
| Starlink - Possible | 49 | 0.0% |
| Planet/Star - Possible | 42 | 0.0% |
| Planet/Star - Probable | 41 | 0.0% |
| Aircraft - Possible | 35 | 0.0% |
| Aircraft - Probable | 35 | 0.0% |
| Camera Anomaly - Probable | 33 | 0.0% |
| Camera Anomaly - Certain | 25 | 0.0% |
| Rocket - Probable | 24 | 0.0% |
| Bird - Possible | 21 | 0.0% |
| Searchlight - Certain | 18 | 0.0% |
| Balloon - Probable | 18 | 0.0% |
| Drone - Possible | 17 | 0.0% |
| Camera Anomaly - Possible | 15 | 0.0% |
| Bird - Probable | 14 | 0.0% |
| Searchlight - Probable | 12 | 0.0% |
| Rocket - Possible | 11 | 0.0% |
Show data table
| chars | count |
|---|---|
| 1 – 2 | 784 |
| 2 – 3 | 1006 |
| 3 – 4 | 586 |
| 4 – 5 | 2052 |
| 5 – 6 | 6297 |
| 6 – 6 | 9510 |
| 6 – 7 | 9325 |
| 7 – 8 | 9401 |
| 8 – 9 | 36028 |
| 9 – 10 | 0 |
| 10 – 11 | 37045 |
| 11 – 12 | 10292 |
| 12 – 13 | 3432 |
| 13 – 14 | 5013 |
| 14 – 14 | 1918 |
| 14 – 15 | 1786 |
| 15 – 16 | 1404 |
| 16 – 17 | 751 |
| 17 – 18 | 818 |
| 18 – 19 | 0 |
| 19 – 20 | 598 |
| 20 – 21 | 476 |
| 21 – 22 | 305 |
| 22 – 23 | 343 |
| 23 – 24 | 337 |
| 24 – 24 | 416 |
| 24 – 25 | 818 |
| 25 – 26 | 0 |
| 26 – 27 | 0 |
| 27 – 28 | 0 |
| 28 – 29 | 0 |
| 29 – 30 | 0 |
| 30 – 31 | 0 |
| 31 – 32 | 0 |
| 32 – 32 | 0 |
| 32 – 33 | 0 |
| 33 – 34 | 0 |
| 34 – 35 | 0 |
| 35 – 36 | 0 |
| 36 – 37 | 1 |
Show data table
| chars | count |
|---|---|
| 1 – 1615 | 111865 |
| 1615 – 3228 | 15329 |
| 3228 – 4842 | 2755 |
| 4842 – 6456 | 757 |
| 6456 – 8070 | 257 |
| 8070 – 9683 | 104 |
| 9683 – 11297 | 60 |
| 11297 – 12911 | 31 |
| 12911 – 14525 | 13 |
| 14525 – 16138 | 13 |
| 16138 – 17752 | 7 |
| 17752 – 19366 | 10 |
| 19366 – 20979 | 4 |
| 20979 – 22593 | 3 |
| 22593 – 24207 | 2 |
| 24207 – 25821 | 2 |
| 25821 – 27434 | 0 |
| 27434 – 29048 | 1 |
| 29048 – 30662 | 0 |
| 30662 – 32276 | 0 |
| 32276 – 33889 | 0 |
| 33889 – 35503 | 1 |
| 35503 – 37117 | 0 |
| 37117 – 38730 | 0 |
| 38730 – 40344 | 0 |
| 40344 – 41958 | 0 |
| 41958 – 43572 | 0 |
| 43572 – 45185 | 0 |
| 45185 – 46799 | 0 |
| 46799 – 48413 | 0 |
| 48413 – 50026 | 0 |
| 50026 – 51640 | 0 |
| 51640 – 53254 | 0 |
| 53254 – 54868 | 0 |
| 54868 – 56481 | 0 |
| 56481 – 58095 | 0 |
| 58095 – 59709 | 0 |
| 59709 – 61323 | 0 |
| 61323 – 62936 | 0 |
| 62936 – 64550 | 1 |
Schema
13 columns| Alerts | ||||
|---|---|---|---|---|
| Sighting | numeric | 0.0% | 147,890 |
|
| Occurred | text | 0.0% | 126,264 |
|
| Location | text | 0.0% | 37,070 |
multilingual
duplicates
|
| Shape | categorical | 4.3% | 39 |
|
| Duration | text | 4.8% | 15,527 |
short_text
duplicates
|
| No of observers | numeric | 4.5% | 137 |
high_skew
outliers
|
| Reported | text | 0.0% | 136,471 |
multilingual
|
| Posted | categorical | 0.0% | 626 |
|
| Characteristics | text | 28.1% | 1,446 |
null_rate
duplicates
|
| Summary | text | 0.6% | 144,208 |
near_unique
|
| Text | text | 11.3% | 127,124 |
near_unique
|
| Location details | text | 93.1% | 9,713 |
near_unique
null_rate
|
| Explanation | categorical | 99.5% | 58 |
null_rate
|
Sighting
numeric identifierWith n_unique equal to n (147890) and zero nulls, this looks like a per-row sighting identifier rather than a measured quantity. The distribution is near-uniform across 111 to 179773 (skew -0.013, kurtosis -1.13, mean 91984 close to median 91435), consistent with a sequential or randomly assigned ID rather than a feature with signal. Treatment: drop from modelling; retain only as a join key.
- n
- 147,890
- nulls
- 0 (0.0%)
- unique
- 147,890
- min
- 111
- max
- 179,773
- mean
- 9.198e+04
- median
- 9.143e+04
- std
- 5.044e+04
- q1
- 5.015e+04
- q3
- 1.345e+05
- iqr
- 8.437e+04
- skew
- -0.0132
- kurtosis
- -1.134
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Occurred
text timestampTimestamp strings of the form 'YYYY-MM-DD HH:MM:SS Local', with length tightly clustered at 25 characters (mean 24.96, p95 25) and 'local' appearing in roughly 20000 of 147890 rows. Duplicate rate is 14.6% (21626 repeats), and several July 4th evening timestamps recur dozens of times, suggesting event clustering around holidays and round hours (22:00, 21:00, 23:00 dominate). 299 rows contain only the bare token 'Local', indicating missing date/time portions encoded as a stub rather than nulls. Treatment: Parse to datetime, drop the trailing 'Local' suffix, and treat bare 'Local' entries as missing.
- n
- 147,890
- nulls
- 0 (0.0%)
- unique
- 126,264
- len_min
- 5
- len_max
- 25
- len_mean
- 24.96
- len_median
- 25
- len_p95
- 25
- word_mean
- 2.996
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 21,626
- duplicate_rate
- 0.1462
- vocab_size
- 10,098
- readability_flesch_mean
- 90.72
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.002022
- allcaps_rate
- 0
- boilerplate_rate
- 0
Location
text metadata multilingual duplicatesShort city/state/country strings — top values like 'Phoenix, AZ, USA' (770) and dominant token 'usa' (17880) make this a geographic location field. Heavy duplication (74.9%, 110819 rows across 37070 uniques) is expected for city-level data, and entries are short (mean 20.4 chars, 3.57 words). The 'multilingual' alert is mostly noise from place names: 4481 detected as English with small counts in de (144), es (86), fr (89), and others. Treatment: Parse into city/state/country components and use as categorical geo features.
- n
- 147,890
- nulls
- 1 (0.0%)
- unique
- 37,070
- len_min
- 7
- len_max
- 92
- len_mean
- 20.38
- len_median
- 18
- len_p95
- 37
- word_mean
- 3.571
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 110,819
- duplicate_rate
- 0.7493
- vocab_size
- 8,750
- readability_flesch_mean
- 26.72
- emoji_rate
- 6.762e-06
- url_rate
- 0
- one_word_rate
- 6.762e-06
- allcaps_rate
- 0.003983
- boilerplate_rate
- 0
Shape
categorical featureCategorical descriptor of an observed object's shape, with 39 distinct values across 147,890 rows. 'Light' dominates at 19.4% followed by 'Circle' (14,367) and 'Triangle' (13,086), and the vocabulary is noticeably noisy: 'Other' (10,062) and 'Unknown' (10,021) together rival the top class, plus near-synonyms like 'Sphere', 'Orb', and 'Circle' coexist. Null rate is 4.29% and entropy ratio of 0.74 indicates moderately spread mass rather than a single peak. Treatment: Consolidate synonyms (Sphere/Orb/Circle), bucket Other/Unknown as missing, then one-hot encode.
- n
- 147,890
- nulls
- 6,343 (4.3%)
- unique
- 39
- top_value
- Light
- top_rate
- 0.1942
- cardinality
- 39
- entropy
- 3.93
- entropy_ratio
- 0.7436
Duration
text feature short_text duplicatesThis is a free-text duration field, almost always a number paired with a unit like 'minutes' or 'seconds' (mean length 9.5 chars, word_median 2). Values are highly repetitive: 89% are duplicates and only 15,527 uniques exist across 147,890 rows, with '5 minutes' alone appearing 8,813 times. The vocabulary of 1,757 tokens and presence of variants like 'min' alongside 'minutes' suggest inconsistent unit spellings, and 4.83% are null. Treatment: Parse into a numeric seconds column by normalizing unit tokens (minute/min, second/sec) before modelling.
- n
- 147,890
- nulls
- 7,148 (4.8%)
- unique
- 15,527
- len_min
- 1
- len_max
- 37
- len_mean
- 9.469
- len_median
- 9
- len_p95
- 15
- word_mean
- 2.066
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 125,215
- duplicate_rate
- 0.8897
- vocab_size
- 1,757
- readability_flesch_mean
- 84.41
- emoji_rate
- 7.105e-06
- url_rate
- 0
- one_word_rate
- 0.0895
- allcaps_rate
- 0.03033
- boilerplate_rate
- 0
No of observers
numeric feature high_skew outliersThis is a count of observers per record, with a typical value of 1-2 (median 2.0, IQR 1.0) but a mean of 4.6 dragged up by extreme outliers reaching 20000. Skew of 109.3 and kurtosis above 14000 confirm a heavy right tail, and 13.0% of values are flagged as outliers. The minimum of -10.0 is implausible for an observer count and suggests data entry errors, while 6.9% are zero and 4.5% are null. Treatment: Clip negatives, investigate the 20000 tail, and log1p-transform before modelling.
- n
- 147,890
- nulls
- 6,661 (4.5%)
- unique
- 137
- min
- -10
- max
- 20,000
- mean
- 4.603
- median
- 2
- std
- 129.5
- q1
- 1
- q3
- 2
- iqr
- 1
- skew
- 109.3
- kurtosis
- 1.433e+04
- n_outliers
- 18,390
- outlier_rate
- 0.1302
- zero_rate
- 0.06919
Reported
text timestamp multilingualThis is a 'Reported' timestamp column stored as text, with every one of the 147,890 values exactly 27 characters long and following a 'YYYY-MM-DD 00:00:00 Pacific' pattern. Despite being temporal, it was profiled as text, which explains the spurious 'multilingual' alert (4,649 'en', 336 'de', etc.) from language detection misreading date tokens. Duplicates are expected for a date field (7.7%, 11,418 rows), with 1999-11-16 and 1999-11-17 the heaviest days at 73 and 69 occurrences. Treatment: Parse to datetime (strip the ' Pacific' suffix) and discard the language alert.
- n
- 147,890
- nulls
- 1 (0.0%)
- unique
- 136,471
- len_min
- 27
- len_max
- 27
- len_mean
- 27
- len_median
- 27
- len_p95
- 27
- word_mean
- 3
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 11,418
- duplicate_rate
- 0.07721
- vocab_size
- 24,077
- readability_flesch_mean
- 62.79
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
Posted
categorical timestamp`Posted` holds 626 unique date-stamped values (all at 00:00:00) across 147,890 rows with no nulls, so it functions as a posting date rather than a precise timestamp. The distribution is broad and high-entropy (entropy ratio 0.93); the most common date, 2020-06-25, accounts for only 1.24% of rows, and the top 10 dates span 1999 to 2023 with no obvious clustering. Worth noting: there are far fewer distinct dates (626) than would be expected if rows were spread daily across two-plus decades, suggesting posts cluster on specific days. Treatment: Parse to date and derive year/month/day-of-week features before modelling.
- n
- 147,890
- nulls
- 1 (0.0%)
- unique
- 626
- top_value
- 2020-06-25 00:00:00
- top_rate
- 0.01239
- cardinality
- 626
- entropy
- 8.644
- entropy_ratio
- 0.9304
Characteristics
text feature null_rate duplicatesThis is a categorical multi-label feature describing observed object characteristics (e.g., 'Lights on object', 'Aura or haze around object'), encoded as comma-joined tags rather than free text. Despite 147,890 rows, only 1,446 unique values and a vocab of 43 words exist, with a 98.6% duplicate rate and 28.2% nulls. Note the truncated token 'Changed Colo' appearing repeatedly, suggesting an upstream string-truncation bug. Treatment: Split on comma and one-hot encode the tag set; fix the 'Changed Colo' truncation before encoding.
- n
- 147,890
- nulls
- 41,631 (28.1%)
- unique
- 1,446
- len_min
- 6
- len_max
- 251
- len_mean
- 33.27
- len_median
- 26
- len_p95
- 76
- word_mean
- 5.613
- word_median
- 5
- n_empty
- 0
- n_duplicates
- 104,813
- duplicate_rate
- 0.9864
- vocab_size
- 43
- readability_flesch_mean
- 69.46
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.00384
- allcaps_rate
- 0
- boilerplate_rate
- 0
Summary
text free_text near_uniqueFree-form English text summaries, near-unique across 147,890 rows (144,208 distinct), with a median of 76 characters / 13 words but a long tail reaching 10,624 characters. Top tokens are stopwords plus 'light' (4,676), and readability averages 67.3 Flesch, suggesting conversational prose rather than structured codes. Surprises: 2,791 exact duplicates (1.9%) despite the near-unique flag, 1.8% all-caps entries, and a 28,632-word vocabulary — emojis and URLs are negligible. Treatment: Tokenize and embed (or TF-IDF) before modelling; dedupe the 2,791 exact repeats first.
- n
- 147,890
- nulls
- 891 (0.6%)
- unique
- 144,208
- len_min
- 1
- len_max
- 10,624
- len_mean
- 134.9
- len_median
- 76
- len_p95
- 479
- word_mean
- 25.06
- word_median
- 13
- n_empty
- 0
- n_duplicates
- 2,791
- duplicate_rate
- 0.01899
- vocab_size
- 28,632
- readability_flesch_mean
- 67.28
- emoji_rate
- 0.0001088
- url_rate
- 0.0007755
- one_word_rate
- 0.002327
- allcaps_rate
- 0.01786
- boilerplate_rate
- 0.0008708
Text
text free_text near_uniqueFree-text field averaging 950 characters and 182 words per entry, with 127,124 unique values across 147,890 rows and a 65,106-token vocabulary — consistent with longish English prose (mean Flesch 69.6, fairly readable). First-person pronouns 'i' and 'my' rank in the top 10, suggesting personal narratives or reviews rather than formal documents. Notable signals: 11.28% null rate, 3.12% exact duplicates (4,091 rows), and a max length of 64,550 characters versus a p95 of 2,673 — a small set of extreme outliers. Treatment: Tokenize and embed before modelling; dedupe the 4,091 exact repeats and clip extreme-length outliers first.
- n
- 147,890
- nulls
- 16,675 (11.3%)
- unique
- 127,124
- len_min
- 1
- len_max
- 64,550
- len_mean
- 949.9
- len_median
- 682
- len_p95
- 2,673
- word_mean
- 181.6
- word_median
- 131
- n_empty
- 0
- n_duplicates
- 4,091
- duplicate_rate
- 0.03118
- vocab_size
- 65,106
- readability_flesch_mean
- 69.56
- emoji_rate
- 0.0001981
- url_rate
- 0.01805
- one_word_rate
- 0.0006783
- allcaps_rate
- 0.007217
- boilerplate_rate
- 0.001875
Location details
text free_text near_unique null_rateFree-text location descriptions, populated for fewer than 7% of rows (null_rate 0.931) and otherwise short prose averaging 38.7 characters / 6.9 words with a Flesch readability of 69.8. Among the 10,212 non-null entries there are 9,713 distinct strings and only 492 duplicates (4.8%), so values are near-unique and unlikely to act as a category. Top tokens are stopwords ('the', 'of', 'in', 'my', 'over'), confirming sentence-style annotations rather than place names or codes. Treatment: Treat as optional free-text annotation; tokenize/embed if used, otherwise drop given the 93% null rate.
- n
- 147,890
- nulls
- 137,685 (93.1%)
- unique
- 9,713
- len_min
- 1
- len_max
- 197
- len_mean
- 38.75
- len_median
- 33
- len_p95
- 92
- word_mean
- 6.93
- word_median
- 6
- n_empty
- 0
- n_duplicates
- 492
- duplicate_rate
- 0.04821
- vocab_size
- 12,384
- readability_flesch_mean
- 69.85
- emoji_rate
- 0
- url_rate
- 0.001078
- one_word_rate
- 0.03822
- allcaps_rate
- 0.0197
- boilerplate_rate
- 0.000196
Explanation
categorical label null_rateThis is a categorical 'Explanation' field labeling probable causes for sightings (e.g., 'Starlink - Probable', 'Rocket - Certain', 'Balloon - Possible'), with 58 distinct values combining a source and a confidence tier. It is almost entirely empty: null_rate is 0.9946, leaving only a small annotated subset where the top label 'Starlink - Probable' appears 78 times (9.7% of non-nulls). Entropy ratio of 0.82 indicates labels are spread fairly evenly across categories rather than dominated by one. Treatment: Treat as a sparse annotation; split into source and confidence sub-labels and analyze only the ~0.5% labeled rows.
- n
- 147,890
- nulls
- 147,087 (99.5%)
- unique
- 58
- top_value
- Starlink - Probable
- top_rate
- 0.09714
- cardinality
- 58
- entropy
- 4.832
- entropy_ratio
- 0.8248