quirky nuforc sightings
Reading
This dataset contains 147,890 UFO sighting reports (likely from NUFORC) with 13 columns covering location, shape, duration, witness counts, and free-text descriptions. The Shape field is a clean categorical with 39 values dominated by 'Light' (27,494), 'Circle', and 'Triangle' — a natural starting point for understanding what people report. Duration is text-based but highly repetitive, with '5 minutes' and '2 minutes' as the most common values, suggesting witnesses anchor on round numbers. Watch out for 'No of observers': it is extremely skewed (max 20,000, min -10, skew 109) with ~13% outliers, so it needs cleaning before any quantitative use. Also note that 'Explanation' is 99.5% null — only a tiny fraction of sightings have an official label, with 'Starlink' explanations leading the small set that do.
citing: row_count · column_count · Shape.top_values · Shape.stats.cardinality · Duration.top_values · No of observers.stats · Explanation.null_rate · Explanation.top_values · Location.top_values · Summary.stats.len_mean
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Light | 27494 | 18.6% |
| Circle | 14367 | 9.7% |
| Triangle | 13086 | 8.8% |
| Other | 10062 | 6.8% |
| Unknown | 10021 | 6.8% |
| Fireball | 9880 | 6.7% |
| Disk | 8716 | 5.9% |
| Sphere | 7652 | 5.2% |
| Oval | 6369 | 4.3% |
| Orb | 5924 | 4.0% |
| Formation | 4864 | 3.3% |
| Changing | 3987 | 2.7% |
| Cigar | 3753 | 2.5% |
| Rectangle | 2610 | 1.8% |
| Cylinder | 2482 | 1.7% |
| Flash | 2439 | 1.6% |
| Diamond | 2116 | 1.4% |
| Chevron | 1742 | 1.2% |
| Egg | 1289 | 0.9% |
| Teardrop | 1238 | 0.8% |
Show data table
| chars | count |
|---|---|
| 1 – 2 | 784 |
| 2 – 3 | 1006 |
| 3 – 4 | 586 |
| 4 – 5 | 2052 |
| 5 – 6 | 6297 |
| 6 – 6 | 9510 |
| 6 – 7 | 9325 |
| 7 – 8 | 9401 |
| 8 – 9 | 36028 |
| 9 – 10 | 0 |
| 10 – 11 | 37045 |
| 11 – 12 | 10292 |
| 12 – 13 | 3432 |
| 13 – 14 | 5013 |
| 14 – 14 | 1918 |
| 14 – 15 | 1786 |
| 15 – 16 | 1404 |
| 16 – 17 | 751 |
| 17 – 18 | 818 |
| 18 – 19 | 0 |
| 19 – 20 | 598 |
| 20 – 21 | 476 |
| 21 – 22 | 305 |
| 22 – 23 | 343 |
| 23 – 24 | 337 |
| 24 – 24 | 416 |
| 24 – 25 | 818 |
| 25 – 26 | 0 |
| 26 – 27 | 0 |
| 27 – 28 | 0 |
| 28 – 29 | 0 |
| 29 – 30 | 0 |
| 30 – 31 | 0 |
| 31 – 32 | 0 |
| 32 – 32 | 0 |
| 32 – 33 | 0 |
| 33 – 34 | 0 |
| 34 – 35 | 0 |
| 35 – 36 | 0 |
| 36 – 37 | 1 |
Show data table
| value | count | share |
|---|---|---|
| Starlink - Probable | 78 | 0.1% |
| Starlink - Certain | 69 | 0.0% |
| Rocket - Certain | 67 | 0.0% |
| Balloon - Possible | 50 | 0.0% |
| Starlink - Possible | 49 | 0.0% |
| Planet/Star - Possible | 42 | 0.0% |
| Planet/Star - Probable | 41 | 0.0% |
| Aircraft - Possible | 35 | 0.0% |
| Aircraft - Probable | 35 | 0.0% |
| Camera Anomaly - Probable | 33 | 0.0% |
| Camera Anomaly - Certain | 25 | 0.0% |
| Rocket - Probable | 24 | 0.0% |
| Bird - Possible | 21 | 0.0% |
| Searchlight - Certain | 18 | 0.0% |
| Balloon - Probable | 18 | 0.0% |
| Drone - Possible | 17 | 0.0% |
| Camera Anomaly - Possible | 15 | 0.0% |
| Bird - Probable | 14 | 0.0% |
| Searchlight - Probable | 12 | 0.0% |
| Rocket - Possible | 11 | 0.0% |
Show data table
| bin | count |
|---|---|
| -10 – 490.2 | 141102 |
| 490.2 – 990.5 | 41 |
| 990.5 – 1491 | 52 |
| 1491 – 1991 | 1 |
| 1991 – 2491 | 8 |
| 2491 – 2992 | 3 |
| 2992 – 3492 | 3 |
| 3492 – 3992 | 0 |
| 3992 – 4492 | 1 |
| 4492 – 4992 | 0 |
| 4992 – 5493 | 7 |
| 5493 – 5993 | 0 |
| 5993 – 6493 | 0 |
| 6493 – 6994 | 0 |
| 6994 – 7494 | 0 |
| 7494 – 7994 | 0 |
| 7994 – 8494 | 0 |
| 8494 – 8994 | 0 |
| 8994 – 9495 | 0 |
| 9495 – 9995 | 0 |
| 9995 – 1.05e+04 | 7 |
| 1.05e+04 – 1.1e+04 | 0 |
| 1.1e+04 – 1.15e+04 | 1 |
| 1.15e+04 – 1.2e+04 | 0 |
| 1.2e+04 – 1.25e+04 | 0 |
| 1.25e+04 – 1.3e+04 | 0 |
| 1.3e+04 – 1.35e+04 | 0 |
| 1.35e+04 – 1.4e+04 | 0 |
| 1.4e+04 – 1.45e+04 | 0 |
| 1.45e+04 – 1.5e+04 | 0 |
| 1.5e+04 – 1.55e+04 | 0 |
| 1.55e+04 – 1.6e+04 | 0 |
| 1.6e+04 – 1.65e+04 | 0 |
| 1.65e+04 – 1.7e+04 | 0 |
| 1.7e+04 – 1.75e+04 | 0 |
| 1.75e+04 – 1.8e+04 | 0 |
| 1.8e+04 – 1.85e+04 | 0 |
| 1.85e+04 – 1.9e+04 | 0 |
| 1.9e+04 – 1.95e+04 | 0 |
| 1.95e+04 – 2e+04 | 3 |
Show data table
| chars | count |
|---|---|
| 1 – 267 | 131782 |
| 267 – 532 | 8911 |
| 532 – 798 | 3233 |
| 798 – 1063 | 1484 |
| 1063 – 1329 | 669 |
| 1329 – 1594 | 333 |
| 1594 – 1860 | 204 |
| 1860 – 2126 | 123 |
| 2126 – 2391 | 74 |
| 2391 – 2657 | 60 |
| 2657 – 2922 | 33 |
| 2922 – 3188 | 16 |
| 3188 – 3453 | 14 |
| 3453 – 3719 | 9 |
| 3719 – 3985 | 13 |
| 3985 – 4250 | 19 |
| 4250 – 4516 | 5 |
| 4516 – 4781 | 5 |
| 4781 – 5047 | 2 |
| 5047 – 5312 | 1 |
| 5312 – 5578 | 4 |
| 5578 – 5844 | 2 |
| 5844 – 6109 | 0 |
| 6109 – 6375 | 0 |
| 6375 – 6640 | 0 |
| 6640 – 6906 | 1 |
| 6906 – 7172 | 0 |
| 7172 – 7437 | 0 |
| 7437 – 7703 | 0 |
| 7703 – 7968 | 0 |
| 7968 – 8234 | 0 |
| 8234 – 8499 | 0 |
| 8499 – 8765 | 0 |
| 8765 – 9031 | 1 |
| 9031 – 9296 | 0 |
| 9296 – 9562 | 0 |
| 9562 – 9827 | 0 |
| 9827 – 10093 | 0 |
| 10093 – 10358 | 0 |
| 10358 – 10624 | 1 |
Schema
13 columns| Alerts | ||||
|---|---|---|---|---|
| Sighting | numeric | 0.0% | 147,890 |
|
| Occurred | text | 0.0% | 126,264 |
|
| Location | text | 0.0% | 37,070 |
multilingual
duplicates
|
| Shape | categorical | 4.3% | 39 |
|
| Duration | text | 4.8% | 15,527 |
short_text
duplicates
|
| No of observers | numeric | 4.5% | 137 |
high_skew
outliers
|
| Reported | text | 0.0% | 136,471 |
multilingual
|
| Posted | categorical | 0.0% | 626 |
|
| Characteristics | text | 28.1% | 1,446 |
null_rate
duplicates
|
| Summary | text | 0.6% | 144,208 |
near_unique
|
| Text | text | 11.3% | 127,124 |
near_unique
|
| Location details | text | 93.1% | 9,713 |
near_unique
null_rate
|
| Explanation | categorical | 99.5% | 58 |
null_rate
|
Sighting
numeric identifierSighting is almost certainly a row identifier: every one of the 147890 values is unique, there are no nulls, and the distribution is essentially uniform (skew -0.013, kurtosis -1.13) spanning 111 to 179773. The values are not a dense 1..N sequence, suggesting an externally assigned record or sighting ID with gaps. No outliers and no zeros, consistent with an ID rather than a measurement. Treatment: Drop from modelling features; retain as the join/lookup key per row.
- n
- 147,890
- nulls
- 0 (0.0%)
- unique
- 147,890
- min
- 111
- max
- 179,773
- mean
- 9.198e+04
- median
- 9.143e+04
- std
- 5.044e+04
- q1
- 5.015e+04
- q3
- 1.345e+05
- iqr
- 8.437e+04
- skew
- -0.0132
- kurtosis
- -1.134
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Occurred
text timestampTimestamp strings of the form 'YYYY-MM-DD HH:MM:SS Local', with length tightly clustered at 25 characters (mean 24.96, p95 25). Stored as text rather than parsed datetimes, and 14.6% of values are duplicates (21,626 rows), with notable spikes on July 4th evenings and one outlier '2015-11-07 18:00:00' appearing 104 times. 299 rows contain just the bare token 'Local' with no date, which will break naive datetime parsing. Treatment: Strip the ' Local' suffix and parse to datetime, coercing the 299 bare 'Local' entries to null.
- n
- 147,890
- nulls
- 0 (0.0%)
- unique
- 126,264
- len_min
- 5
- len_max
- 25
- len_mean
- 24.96
- len_median
- 25
- len_p95
- 25
- word_mean
- 2.996
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 21,626
- duplicate_rate
- 0.1462
- vocab_size
- 10,098
- readability_flesch_mean
- 90.72
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.002022
- allcaps_rate
- 0
- boilerplate_rate
- 0
Location
text feature multilingual duplicatesShort 'City, State/Region, Country' location strings, averaging 20 characters and 3.6 words, dominated by US entries (Phoenix, Seattle, Las Vegas lead) with 'usa' appearing 17,880 times. The column is highly repetitive: 110,819 of 147,890 rows are duplicates (75%) across only 37,070 unique values, so it behaves like a categorical despite being free text. Language detection flags multilingual content but this mostly reflects short-string misclassification — 4,481 detected as English versus small counts in 27 other codes. Treatment: Parse into city/state/country fields and treat as a high-cardinality categorical.
- n
- 147,890
- nulls
- 1 (0.0%)
- unique
- 37,070
- len_min
- 7
- len_max
- 92
- len_mean
- 20.38
- len_median
- 18
- len_p95
- 37
- word_mean
- 3.571
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 110,819
- duplicate_rate
- 0.7493
- vocab_size
- 8,750
- readability_flesch_mean
- 26.72
- emoji_rate
- 6.762e-06
- url_rate
- 0
- one_word_rate
- 6.762e-06
- allcaps_rate
- 0.003983
- boilerplate_rate
- 0
Shape
categorical featureCategorical descriptor of UFO sighting shapes across 39 distinct values, with 'Light' leading at 19.4% of records (27,494). The distribution is moderately spread (entropy ratio 0.74), and notably 'Other' (10,062) and 'Unknown' (10,021) together rival the second-largest real category, suggesting substantial reporter ambiguity. Null rate is 4.29%, modest but non-trivial. Treatment: One-hot or target-encode after collapsing 'Other'/'Unknown'/nulls into a single missing bucket.
- n
- 147,890
- nulls
- 6,343 (4.3%)
- unique
- 39
- top_value
- Light
- top_rate
- 0.1942
- cardinality
- 39
- entropy
- 3.93
- entropy_ratio
- 0.7436
Duration
text feature short_text duplicatesThis is a free-text duration field, almost always a number-plus-unit phrase like '5 minutes' or '30 seconds' (mean 9.5 chars, ~2 words). Values are highly repetitive: only 15,527 distinct strings across 147,890 rows and an 89% duplicate rate, with a 4.8% null rate. The dominant units are 'minutes' and 'seconds', but the presence of an abbreviated 'min' token signals inconsistent formatting that will need normalisation. Treatment: Parse number and unit, convert to a single numeric scale (e.g., seconds), and treat 'min'/'minutes' variants as equivalent.
- n
- 147,890
- nulls
- 7,148 (4.8%)
- unique
- 15,527
- len_min
- 1
- len_max
- 37
- len_mean
- 9.469
- len_median
- 9
- len_p95
- 15
- word_mean
- 2.066
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 125,215
- duplicate_rate
- 0.8897
- vocab_size
- 1,757
- readability_flesch_mean
- 84.41
- emoji_rate
- 7.105e-06
- url_rate
- 0
- one_word_rate
- 0.0895
- allcaps_rate
- 0.03033
- boilerplate_rate
- 0
No of observers
numeric feature high_skew outliersCounts of observers per record, with a typical value of 1-2 (median 2, IQR 1) but a maximum of 20000 driving mean 4.6 and std 129.5. Skew of 109.3 and kurtosis of 14332 are extreme, and 13.0% of rows are flagged as outliers. A min of -10 is suspicious for a count, and 6.9% are zero with 4.5% null. Treatment: Clip negatives, investigate the 20000 tail, and log1p-transform before modelling.
- n
- 147,890
- nulls
- 6,661 (4.5%)
- unique
- 137
- min
- -10
- max
- 20,000
- mean
- 4.603
- median
- 2
- std
- 129.5
- q1
- 1
- q3
- 2
- iqr
- 1
- skew
- 109.3
- kurtosis
- 1.433e+04
- n_outliers
- 18,390
- outlier_rate
- 0.1302
- zero_rate
- 0.06919
Reported
text timestamp multilingualThis is a 'Reported' timestamp stored as text in fixed 27-character format like '1999-11-16 00:00:00 Pacific'. Every value has identical length (min/max/mean = 27) and 3 words, with 'Pacific' appearing as a constant timezone suffix in ~20000 rows. The multilingual alert is a false positive from the language detector misreading dates; the duplicate rate of 7.7% (11418 rows) reflects multiple events sharing a report date. Treatment: Parse to datetime (strip 'Pacific' suffix, localize to US/Pacific) before any temporal analysis.
- n
- 147,890
- nulls
- 1 (0.0%)
- unique
- 136,471
- len_min
- 27
- len_max
- 27
- len_mean
- 27
- len_median
- 27
- len_p95
- 27
- word_mean
- 3
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 11,418
- duplicate_rate
- 0.07721
- vocab_size
- 24,077
- readability_flesch_mean
- 62.79
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
Posted
categorical timestampThis column stores posting dates as datetime strings with zeroed time components, almost certainly a publication or upload timestamp. Across 147,890 rows there are 626 distinct dates with no nulls, and the distribution is remarkably flat — entropy ratio 0.93 and the most common date (2020-06-25) accounting for only 1.24% of rows. The top dates span 1999 to 2023, suggesting the dataset covers more than two decades of activity. Treatment: parse to datetime and derive year/month features for temporal analysis.
- n
- 147,890
- nulls
- 1 (0.0%)
- unique
- 626
- top_value
- 2020-06-25 00:00:00
- top_rate
- 0.01239
- cardinality
- 626
- entropy
- 8.644
- entropy_ratio
- 0.9304
Characteristics
text feature null_rate duplicatesThis is a multi-label categorical feature describing observed object characteristics (e.g. "Lights on object", "Aura or haze around object", "Aircraft nearby"), stored as comma-joined tags rather than a structured list. Despite 147,890 rows, only 1,446 distinct strings exist and 98.6% are duplicates, with a tiny vocab of 43 tokens. Watch for the 28.15% null rate and a truncated tag ("Changed Colo", apparently "Changed Color" cut off) that recurs across thousands of rows. Treatment: Split on commas and one-hot encode the ~43 underlying tags; treat nulls as a separate indicator.
- n
- 147,890
- nulls
- 41,631 (28.1%)
- unique
- 1,446
- len_min
- 6
- len_max
- 251
- len_mean
- 33.27
- len_median
- 26
- len_p95
- 76
- word_mean
- 5.613
- word_median
- 5
- n_empty
- 0
- n_duplicates
- 104,813
- duplicate_rate
- 0.9864
- vocab_size
- 43
- readability_flesch_mean
- 69.46
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.00384
- allcaps_rate
- 0
- boilerplate_rate
- 0
Summary
text free_text near_uniqueFree-text summary field with 144,208 unique values across 147,890 rows and a 0.6% null rate, so virtually every record carries its own short description. Lengths are highly skewed: median 76 characters / 13 words but a max of 10,624 characters and a p95 of 479, and mean Flesch readability of 67.3 suggests fairly plain English prose. Top tokens are stopwords plus 'light' (4,676 occurrences), hinting at a recurring topical theme worth investigating; duplicates (1.9%) and boilerplate (<0.1%) are negligible. Treatment: Tokenize and embed before modelling; do not treat as a categorical key.
- n
- 147,890
- nulls
- 891 (0.6%)
- unique
- 144,208
- len_min
- 1
- len_max
- 10,624
- len_mean
- 134.9
- len_median
- 76
- len_p95
- 479
- word_mean
- 25.06
- word_median
- 13
- n_empty
- 0
- n_duplicates
- 2,791
- duplicate_rate
- 0.01899
- vocab_size
- 28,632
- readability_flesch_mean
- 67.28
- emoji_rate
- 0.0001088
- url_rate
- 0.0007755
- one_word_rate
- 0.002327
- allcaps_rate
- 0.01786
- boilerplate_rate
- 0.0008708
Text
text free_text near_uniqueFree-text field containing medium-length English prose, averaging 949.9 characters and 181.6 words with a Flesch readability of 69.6, suggesting reviews, comments, or short narratives. The column is near-unique (127,124 unique of 147,890) yet still carries 4,091 exact duplicates (3.1%) and an 11.3% null rate worth investigating. Top words are dominated by English stopwords plus a frequent first-person 'i' and 'my', hinting at personal/subjective writing rather than formal documents. Treatment: Clean nulls and exact duplicates, then tokenize and embed before modelling.
- n
- 147,890
- nulls
- 16,675 (11.3%)
- unique
- 127,124
- len_min
- 1
- len_max
- 64,550
- len_mean
- 949.9
- len_median
- 682
- len_p95
- 2,673
- word_mean
- 181.6
- word_median
- 131
- n_empty
- 0
- n_duplicates
- 4,091
- duplicate_rate
- 0.03118
- vocab_size
- 65,106
- readability_flesch_mean
- 69.56
- emoji_rate
- 0.0001981
- url_rate
- 0.01805
- one_word_rate
- 0.0006783
- allcaps_rate
- 0.007217
- boilerplate_rate
- 0.001875
Location details
text free_text near_unique null_rateFree-text supplementary location notes, populated for only 6.9% of the 147,890 rows (null_rate 0.931). When present, entries are short prose averaging 38.7 characters / 6.9 words with readable Flesch 69.8, and the top tokens ('the', 'of', 'in', 'my', 'from') confirm natural-language descriptions rather than structured place codes. Cardinality is high (9,713 uniques) but 492 exact duplicates (4.8%) hint at recurring phrases worth normalising. Treatment: Treat as optional free-text annotation: tokenize/embed for any modelling and expect to ignore for the 93% of rows where it is null.
- n
- 147,890
- nulls
- 137,685 (93.1%)
- unique
- 9,713
- len_min
- 1
- len_max
- 197
- len_mean
- 38.75
- len_median
- 33
- len_p95
- 92
- word_mean
- 6.93
- word_median
- 6
- n_empty
- 0
- n_duplicates
- 492
- duplicate_rate
- 0.04821
- vocab_size
- 12,384
- readability_flesch_mean
- 69.85
- emoji_rate
- 0
- url_rate
- 0.001078
- one_word_rate
- 0.03822
- allcaps_rate
- 0.0197
- boilerplate_rate
- 0.000196
Explanation
categorical label null_rateFree-form classification labels explaining UFO/sky-object sightings, with categories like 'Starlink - Probable', 'Rocket - Certain', and 'Balloon - Possible' combining an object type with a confidence qualifier. The column is 99.46% null — only ~800 of 147,890 rows carry a value — so it functions as a sparse annotation rather than a primary feature. Among populated rows, 58 distinct labels appear with relatively even spread (entropy ratio 0.82); the top label 'Starlink - Probable' covers just 9.71% of non-nulls. Treatment: Treat as a sparse secondary label; split into object/confidence and only model on the annotated subset.
- n
- 147,890
- nulls
- 147,087 (99.5%)
- unique
- 58
- top_value
- Starlink - Probable
- top_rate
- 0.09714
- cardinality
- 58
- entropy
- 4.832
- entropy_ratio
- 0.8248