quirky nuforc sightings

source /home/coolhand/html/datavis/data_trove/cache/quirky/nuforc_sightings.parquet 147,890 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 147,890 UFO sighting reports (likely from NUFORC) with 13 columns covering location, shape, duration, witness counts, and free-text descriptions. The Shape field is a clean categorical with 39 values dominated by 'Light' (27,494), 'Circle', and 'Triangle' — a natural starting point for understanding what people report. Duration is text-based but highly repetitive, with '5 minutes' and '2 minutes' as the most common values, suggesting witnesses anchor on round numbers. Watch out for 'No of observers': it is extremely skewed (max 20,000, min -10, skew 109) with ~13% outliers, so it needs cleaning before any quantitative use. Also note that 'Explanation' is 99.5% null — only a tiny fraction of sightings have an official label, with 'Starlink' explanations leading the small set that do.

citing: row_count · column_count · Shape.top_values · Shape.stats.cardinality · Duration.top_values · No of observers.stats · Explanation.null_rate · Explanation.top_values · Location.top_values · Summary.stats.len_mean

Charts the summary said to look at first

Shape · See which UFO shapes dominate reports — 'Light' alone accounts for nearly 1 in 5 sightings.

Show data table

Top values for Shape (20 unique shown, of 39 total).
value	count	share
Light	27494	18.6%
Circle	14367	9.7%
Triangle	13086	8.8%
Other	10062	6.8%
Unknown	10021	6.8%
Fireball	9880	6.7%
Disk	8716	5.9%
Sphere	7652	5.2%
Oval	6369	4.3%
Orb	5924	4.0%
Formation	4864	3.3%
Changing	3987	2.7%
Cigar	3753	2.5%
Rectangle	2610	1.8%
Cylinder	2482	1.7%
Flash	2439	1.6%
Diamond	2116	1.4%
Chevron	1742	1.2%
Egg	1289	0.9%
Teardrop	1238	0.8%

Duration · Look at how witnesses cluster around round-number durations like '5 minutes' and '2 minutes'.

Show data table

Character-length distribution for Duration (mean: 9.469163433800855).
chars	count
1 – 2	784
2 – 3	1006
3 – 4	586
4 – 5	2052
5 – 6	6297
6 – 6	9510
6 – 7	9325
7 – 8	9401
8 – 9	36028
9 – 10	0
10 – 11	37045
11 – 12	10292
12 – 13	3432
13 – 14	5013
14 – 14	1918
14 – 15	1786
15 – 16	1404
16 – 17	751
17 – 18	818
18 – 19	0
19 – 20	598
20 – 21	476
21 – 22	305
22 – 23	343
23 – 24	337
24 – 24	416
24 – 25	818
25 – 26	0
26 – 27	0
27 – 28	0
28 – 29	0
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	0
34 – 35	0
35 – 36	0
36 – 37	1

Explanation · Among the rare cases with an official explanation, check the mix of Starlink, rocket, and balloon attributions.

Show data table

Top values for Explanation (20 unique shown, of 58 total).
value	count	share
Starlink - Probable	78	0.1%
Starlink - Certain	69	0.0%
Rocket - Certain	67	0.0%
Balloon - Possible	50	0.0%
Starlink - Possible	49	0.0%
Planet/Star - Possible	42	0.0%
Planet/Star - Probable	41	0.0%
Aircraft - Possible	35	0.0%
Aircraft - Probable	35	0.0%
Camera Anomaly - Probable	33	0.0%
Camera Anomaly - Certain	25	0.0%
Rocket - Probable	24	0.0%
Bird - Possible	21	0.0%
Searchlight - Certain	18	0.0%
Balloon - Probable	18	0.0%
Drone - Possible	17	0.0%
Camera Anomaly - Possible	15	0.0%
Bird - Probable	14	0.0%
Searchlight - Probable	12	0.0%
Rocket - Possible	11	0.0%

No of observers · Inspect the extreme skew and outliers (negative values and a 20,000 maximum) before trusting any averages.

Show data table

Histogram bins for No of observers (median: 2.0).
bin	count
-10 – 490.2	141102
490.2 – 990.5	41
990.5 – 1491	52
1491 – 1991	1
1991 – 2491	8
2491 – 2992	3
2992 – 3492	3
3492 – 3992	0
3992 – 4492	1
4492 – 4992	0
4992 – 5493	7
5493 – 5993	0
5993 – 6493	0
6493 – 6994	0
6994 – 7494	0
7494 – 7994	0
7994 – 8494	0
8494 – 8994	0
8994 – 9495	0
9495 – 9995	0
9995 – 1.05e+04	7
1.05e+04 – 1.1e+04	0
1.1e+04 – 1.15e+04	1
1.15e+04 – 1.2e+04	0
1.2e+04 – 1.25e+04	0
1.25e+04 – 1.3e+04	0
1.3e+04 – 1.35e+04	0
1.35e+04 – 1.4e+04	0
1.4e+04 – 1.45e+04	0
1.45e+04 – 1.5e+04	0
1.5e+04 – 1.55e+04	0
1.55e+04 – 1.6e+04	0
1.6e+04 – 1.65e+04	0
1.65e+04 – 1.7e+04	0
1.7e+04 – 1.75e+04	0
1.75e+04 – 1.8e+04	0
1.8e+04 – 1.85e+04	0
1.85e+04 – 1.9e+04	0
1.9e+04 – 1.95e+04	0
1.95e+04 – 2e+04	3

Summary · Review summary length distribution to gauge how much narrative detail accompanies each sighting.

Show data table

Character-length distribution for Summary (mean: 134.90139388703324).
chars	count
1 – 267	131782
267 – 532	8911
532 – 798	3233
798 – 1063	1484
1063 – 1329	669
1329 – 1594	333
1594 – 1860	204
1860 – 2126	123
2126 – 2391	74
2391 – 2657	60
2657 – 2922	33
2922 – 3188	16
3188 – 3453	14
3453 – 3719	9
3719 – 3985	13
3985 – 4250	19
4250 – 4516	5
4516 – 4781	5
4781 – 5047	2
5047 – 5312	1
5312 – 5578	4
5578 – 5844	2
5844 – 6109	0
6109 – 6375	0
6375 – 6640	0
6640 – 6906	1
6906 – 7172	0
7172 – 7437	0
7437 – 7703	0
7703 – 7968	0
7968 – 8234	0
8234 – 8499	0
8499 – 8765	0
8765 – 9031	1
9031 – 9296	0
9296 – 9562	0
9562 – 9827	0
9827 – 10093	0
10093 – 10358	0
10358 – 10624	1

Schema

13 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
Sighting	numeric	0.0%	147,890
Occurred	text	0.0%	126,264
Location	text	0.0%	37,070	multilingual duplicates
Shape	categorical	4.3%	39
Duration	text	4.8%	15,527	short_text duplicates
No of observers	numeric	4.5%	137	high_skew outliers
Reported	text	0.0%	136,471	multilingual
Posted	categorical	0.0%	626
Characteristics	text	28.1%	1,446	null_rate duplicates
Summary	text	0.6%	144,208	near_unique
Text	text	11.3%	127,124	near_unique
Location details	text	93.1%	9,713	near_unique null_rate
Explanation	categorical	99.5%	58	null_rate

Sighting

numeric identifier

Sighting is almost certainly a row identifier: every one of the 147890 values is unique, there are no nulls, and the distribution is essentially uniform (skew -0.013, kurtosis -1.13) spanning 111 to 179773. The values are not a dense 1..N sequence, suggesting an externally assigned record or sighting ID with gaps. No outliers and no zeros, consistent with an ID rather than a measurement. Treatment: Drop from modelling features; retain as the join/lookup key per row. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 0 (0.0%)
unique: 147,890
min: 111
max: 179,773
mean: 9.198e+04
median: 9.143e+04
std: 5.044e+04
q1: 5.015e+04
q3: 1.345e+05
iqr: 8.437e+04
skew: -0.0132
kurtosis: -1.134
n_outliers: 0
outlier_rate: 0
zero_rate: 0

Occurred

text timestamp

Timestamp strings of the form 'YYYY-MM-DD HH:MM:SS Local', with length tightly clustered at 25 characters (mean 24.96, p95 25). Stored as text rather than parsed datetimes, and 14.6% of values are duplicates (21,626 rows), with notable spikes on July 4th evenings and one outlier '2015-11-07 18:00:00' appearing 104 times. 299 rows contain just the bare token 'Local' with no date, which will break naive datetime parsing. Treatment: Strip the ' Local' suffix and parse to datetime, coercing the 299 bare 'Local' entries to null. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 0 (0.0%)
unique: 126,264
len_min: 5
len_max: 25
len_mean: 24.96
len_median: 25
len_p95: 25
word_mean: 2.996
word_median: 3
n_empty: 0
n_duplicates: 21,626
duplicate_rate: 0.1462
vocab_size: 10,098
readability_flesch_mean: 90.72
emoji_rate: 0
url_rate: 0
one_word_rate: 0.002022
allcaps_rate: 0
boilerplate_rate: 0

Location

text feature multilingual duplicates

Short 'City, State/Region, Country' location strings, averaging 20 characters and 3.6 words, dominated by US entries (Phoenix, Seattle, Las Vegas lead) with 'usa' appearing 17,880 times. The column is highly repetitive: 110,819 of 147,890 rows are duplicates (75%) across only 37,070 unique values, so it behaves like a categorical despite being free text. Language detection flags multilingual content but this mostly reflects short-string misclassification — 4,481 detected as English versus small counts in 27 other codes. Treatment: Parse into city/state/country fields and treat as a high-cardinality categorical. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 1 (0.0%)
unique: 37,070
len_min: 7
len_max: 92
len_mean: 20.38
len_median: 18
len_p95: 37
word_mean: 3.571
word_median: 3
n_empty: 0
n_duplicates: 110,819
duplicate_rate: 0.7493
vocab_size: 8,750
readability_flesch_mean: 26.72
emoji_rate: 6.762e-06
url_rate: 0
one_word_rate: 6.762e-06
allcaps_rate: 0.003983
boilerplate_rate: 0

Shape

categorical feature

Categorical descriptor of UFO sighting shapes across 39 distinct values, with 'Light' leading at 19.4% of records (27,494). The distribution is moderately spread (entropy ratio 0.74), and notably 'Other' (10,062) and 'Unknown' (10,021) together rival the second-largest real category, suggesting substantial reporter ambiguity. Null rate is 4.29%, modest but non-trivial. Treatment: One-hot or target-encode after collapsing 'Other'/'Unknown'/nulls into a single missing bucket. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 6,343 (4.3%)
unique: 39
top_value: Light
top_rate: 0.1942
cardinality: 39
entropy: 3.93
entropy_ratio: 0.7436

Duration

text feature short_text duplicates

This is a free-text duration field, almost always a number-plus-unit phrase like '5 minutes' or '30 seconds' (mean 9.5 chars, ~2 words). Values are highly repetitive: only 15,527 distinct strings across 147,890 rows and an 89% duplicate rate, with a 4.8% null rate. The dominant units are 'minutes' and 'seconds', but the presence of an abbreviated 'min' token signals inconsistent formatting that will need normalisation. Treatment: Parse number and unit, convert to a single numeric scale (e.g., seconds), and treat 'min'/'minutes' variants as equivalent. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 7,148 (4.8%)
unique: 15,527
len_min: 1
len_max: 37
len_mean: 9.469
len_median: 9
len_p95: 15
word_mean: 2.066
word_median: 2
n_empty: 0
n_duplicates: 125,215
duplicate_rate: 0.8897
vocab_size: 1,757
readability_flesch_mean: 84.41
emoji_rate: 7.105e-06
url_rate: 0
one_word_rate: 0.0895
allcaps_rate: 0.03033
boilerplate_rate: 0

No of observers

numeric feature high_skew outliers

Counts of observers per record, with a typical value of 1-2 (median 2, IQR 1) but a maximum of 20000 driving mean 4.6 and std 129.5. Skew of 109.3 and kurtosis of 14332 are extreme, and 13.0% of rows are flagged as outliers. A min of -10 is suspicious for a count, and 6.9% are zero with 4.5% null. Treatment: Clip negatives, investigate the 20000 tail, and log1p-transform before modelling. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 6,661 (4.5%)
unique: 137
min: -10
max: 20,000
mean: 4.603
median: 2
std: 129.5
q1: 1
q3: 2
iqr: 1
skew: 109.3
kurtosis: 1.433e+04
n_outliers: 18,390
outlier_rate: 0.1302
zero_rate: 0.06919

Reported

text timestamp multilingual

This is a 'Reported' timestamp stored as text in fixed 27-character format like '1999-11-16 00:00:00 Pacific'. Every value has identical length (min/max/mean = 27) and 3 words, with 'Pacific' appearing as a constant timezone suffix in ~20000 rows. The multilingual alert is a false positive from the language detector misreading dates; the duplicate rate of 7.7% (11418 rows) reflects multiple events sharing a report date. Treatment: Parse to datetime (strip 'Pacific' suffix, localize to US/Pacific) before any temporal analysis. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 1 (0.0%)
unique: 136,471
len_min: 27
len_max: 27
len_mean: 27
len_median: 27
len_p95: 27
word_mean: 3
word_median: 3
n_empty: 0
n_duplicates: 11,418
duplicate_rate: 0.07721
vocab_size: 24,077
readability_flesch_mean: 62.79
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0

Posted

categorical timestamp

This column stores posting dates as datetime strings with zeroed time components, almost certainly a publication or upload timestamp. Across 147,890 rows there are 626 distinct dates with no nulls, and the distribution is remarkably flat — entropy ratio 0.93 and the most common date (2020-06-25) accounting for only 1.24% of rows. The top dates span 1999 to 2023, suggesting the dataset covers more than two decades of activity. Treatment: parse to datetime and derive year/month features for temporal analysis. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 1 (0.0%)
unique: 626
top_value: 2020-06-25 00:00:00
top_rate: 0.01239
cardinality: 626
entropy: 8.644
entropy_ratio: 0.9304

Characteristics

text feature null_rate duplicates

This is a multi-label categorical feature describing observed object characteristics (e.g. "Lights on object", "Aura or haze around object", "Aircraft nearby"), stored as comma-joined tags rather than a structured list. Despite 147,890 rows, only 1,446 distinct strings exist and 98.6% are duplicates, with a tiny vocab of 43 tokens. Watch for the 28.15% null rate and a truncated tag ("Changed Colo", apparently "Changed Color" cut off) that recurs across thousands of rows. Treatment: Split on commas and one-hot encode the ~43 underlying tags; treat nulls as a separate indicator. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 41,631 (28.1%)
unique: 1,446
len_min: 6
len_max: 251
len_mean: 33.27
len_median: 26
len_p95: 76
word_mean: 5.613
word_median: 5
n_empty: 0
n_duplicates: 104,813
duplicate_rate: 0.9864
vocab_size: 43
readability_flesch_mean: 69.46
emoji_rate: 0
url_rate: 0
one_word_rate: 0.00384
allcaps_rate: 0
boilerplate_rate: 0

Summary

text free_text near_unique

Free-text summary field with 144,208 unique values across 147,890 rows and a 0.6% null rate, so virtually every record carries its own short description. Lengths are highly skewed: median 76 characters / 13 words but a max of 10,624 characters and a p95 of 479, and mean Flesch readability of 67.3 suggests fairly plain English prose. Top tokens are stopwords plus 'light' (4,676 occurrences), hinting at a recurring topical theme worth investigating; duplicates (1.9%) and boilerplate (<0.1%) are negligible. Treatment: Tokenize and embed before modelling; do not treat as a categorical key. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 891 (0.6%)
unique: 144,208
len_min: 1
len_max: 10,624
len_mean: 134.9
len_median: 76
len_p95: 479
word_mean: 25.06
word_median: 13
n_empty: 0
n_duplicates: 2,791
duplicate_rate: 0.01899
vocab_size: 28,632
readability_flesch_mean: 67.28
emoji_rate: 0.0001088
url_rate: 0.0007755
one_word_rate: 0.002327
allcaps_rate: 0.01786
boilerplate_rate: 0.0008708

Text

text free_text near_unique

Free-text field containing medium-length English prose, averaging 949.9 characters and 181.6 words with a Flesch readability of 69.6, suggesting reviews, comments, or short narratives. The column is near-unique (127,124 unique of 147,890) yet still carries 4,091 exact duplicates (3.1%) and an 11.3% null rate worth investigating. Top words are dominated by English stopwords plus a frequent first-person 'i' and 'my', hinting at personal/subjective writing rather than formal documents. Treatment: Clean nulls and exact duplicates, then tokenize and embed before modelling. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 16,675 (11.3%)
unique: 127,124
len_min: 1
len_max: 64,550
len_mean: 949.9
len_median: 682
len_p95: 2,673
word_mean: 181.6
word_median: 131
n_empty: 0
n_duplicates: 4,091
duplicate_rate: 0.03118
vocab_size: 65,106
readability_flesch_mean: 69.56
emoji_rate: 0.0001981
url_rate: 0.01805
one_word_rate: 0.0006783
allcaps_rate: 0.007217
boilerplate_rate: 0.001875

Location details

text free_text near_unique null_rate

Free-text supplementary location notes, populated for only 6.9% of the 147,890 rows (null_rate 0.931). When present, entries are short prose averaging 38.7 characters / 6.9 words with readable Flesch 69.8, and the top tokens ('the', 'of', 'in', 'my', 'from') confirm natural-language descriptions rather than structured place codes. Cardinality is high (9,713 uniques) but 492 exact duplicates (4.8%) hint at recurring phrases worth normalising. Treatment: Treat as optional free-text annotation: tokenize/embed for any modelling and expect to ignore for the 93% of rows where it is null. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 137,685 (93.1%)
unique: 9,713
len_min: 1
len_max: 197
len_mean: 38.75
len_median: 33
len_p95: 92
word_mean: 6.93
word_median: 6
n_empty: 0
n_duplicates: 492
duplicate_rate: 0.04821
vocab_size: 12,384
readability_flesch_mean: 69.85
emoji_rate: 0
url_rate: 0.001078
one_word_rate: 0.03822
allcaps_rate: 0.0197
boilerplate_rate: 0.000196

Explanation

categorical label null_rate

Free-form classification labels explaining UFO/sky-object sightings, with categories like 'Starlink - Probable', 'Rocket - Certain', and 'Balloon - Possible' combining an object type with a confidence qualifier. The column is 99.46% null — only ~800 of 147,890 rows carry a value — so it functions as a sparse annotation rather than a primary feature. Among populated rows, 58 distinct labels appear with relatively even spread (entropy ratio 0.82); the top label 'Starlink - Probable' covers just 9.71% of non-nulls. Treatment: Treat as a sparse secondary label; split into object/confidence and only model on the annotated subset. high · anthropic:claude-opus-4-7

n: 147,890
nulls: 147,087 (99.5%)
unique: 58
top_value: Starlink - Probable
top_rate: 0.09714
cardinality: 58
entropy: 4.832
entropy_ratio: 0.8248