saturn·

quirky nuforc sightings

source /home/coolhand/html/datavis/data_trove/cache/quirky/nuforc_sightings.parquet 147,890 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 147,890 UFO sighting reports (likely from NUFORC) with 13 columns covering location, shape, duration, witness counts, and free-text descriptions. The Shape field is a clean categorical with 39 values dominated by 'Light' (27,494), 'Circle', and 'Triangle' — a natural starting point for understanding what people report. Duration is text-based but highly repetitive, with '5 minutes' and '2 minutes' as the most common values, suggesting witnesses anchor on round numbers. Watch out for 'No of observers': it is extremely skewed (max 20,000, min -10, skew 109) with ~13% outliers, so it needs cleaning before any quantitative use. Also note that 'Explanation' is 99.5% null — only a tiny fraction of sightings have an official label, with 'Starlink' explanations leading the small set that do.

citing: row_count · column_count · Shape.top_values · Shape.stats.cardinality · Duration.top_values · No of observers.stats · Explanation.null_rate · Explanation.top_values · Location.top_values · Summary.stats.len_mean

Schema

13 columns
Per-column summary. Click column name to jump to its detail.
Alerts
Sighting numeric 0.0% 147,890
Occurred text 0.0% 126,264
Location text 0.0% 37,070
multilingual duplicates
Shape categorical 4.3% 39
Duration text 4.8% 15,527
short_text duplicates
No of observers numeric 4.5% 137
high_skew outliers
Reported text 0.0% 136,471
multilingual
Posted categorical 0.0% 626
Characteristics text 28.1% 1,446
null_rate duplicates
Summary text 0.6% 144,208
near_unique
Text text 11.3% 127,124
near_unique
Location details text 93.1% 9,713
near_unique null_rate
Explanation categorical 99.5% 58
null_rate

Sighting

numeric identifier
Sighting is almost certainly a row identifier: every one of the 147890 values is unique, there are no nulls, and the distribution is essentially uniform (skew -0.013, kurtosis -1.13) spanning 111 to 179773. The values are not a dense 1..N sequence, suggesting an externally assigned record or sighting ID with gaps. No outliers and no zeros, consistent with an ID rather than a measurement. Treatment: Drop from modelling features; retain as the join/lookup key per row. high · anthropic:claude-opus-4-7
n
147,890
nulls
0 (0.0%)
unique
147,890
min
111
max
179,773
mean
9.198e+04
median
9.143e+04
std
5.044e+04
q1
5.015e+04
q3
1.345e+05
iqr
8.437e+04
skew
-0.0132
kurtosis
-1.134
n_outliers
0
outlier_rate
0
zero_rate
0

Occurred

text timestamp
Timestamp strings of the form 'YYYY-MM-DD HH:MM:SS Local', with length tightly clustered at 25 characters (mean 24.96, p95 25). Stored as text rather than parsed datetimes, and 14.6% of values are duplicates (21,626 rows), with notable spikes on July 4th evenings and one outlier '2015-11-07 18:00:00' appearing 104 times. 299 rows contain just the bare token 'Local' with no date, which will break naive datetime parsing. Treatment: Strip the ' Local' suffix and parse to datetime, coercing the 299 bare 'Local' entries to null. high · anthropic:claude-opus-4-7
n
147,890
nulls
0 (0.0%)
unique
126,264
len_min
5
len_max
25
len_mean
24.96
len_median
25
len_p95
25
word_mean
2.996
word_median
3
n_empty
0
n_duplicates
21,626
duplicate_rate
0.1462
vocab_size
10,098
readability_flesch_mean
90.72
emoji_rate
0
url_rate
0
one_word_rate
0.002022
allcaps_rate
0
boilerplate_rate
0

Location

text feature multilingual duplicates
Short 'City, State/Region, Country' location strings, averaging 20 characters and 3.6 words, dominated by US entries (Phoenix, Seattle, Las Vegas lead) with 'usa' appearing 17,880 times. The column is highly repetitive: 110,819 of 147,890 rows are duplicates (75%) across only 37,070 unique values, so it behaves like a categorical despite being free text. Language detection flags multilingual content but this mostly reflects short-string misclassification — 4,481 detected as English versus small counts in 27 other codes. Treatment: Parse into city/state/country fields and treat as a high-cardinality categorical. high · anthropic:claude-opus-4-7
n
147,890
nulls
1 (0.0%)
unique
37,070
len_min
7
len_max
92
len_mean
20.38
len_median
18
len_p95
37
word_mean
3.571
word_median
3
n_empty
0
n_duplicates
110,819
duplicate_rate
0.7493
vocab_size
8,750
readability_flesch_mean
26.72
emoji_rate
6.762e-06
url_rate
0
one_word_rate
6.762e-06
allcaps_rate
0.003983
boilerplate_rate
0

Shape

categorical feature
Categorical descriptor of UFO sighting shapes across 39 distinct values, with 'Light' leading at 19.4% of records (27,494). The distribution is moderately spread (entropy ratio 0.74), and notably 'Other' (10,062) and 'Unknown' (10,021) together rival the second-largest real category, suggesting substantial reporter ambiguity. Null rate is 4.29%, modest but non-trivial. Treatment: One-hot or target-encode after collapsing 'Other'/'Unknown'/nulls into a single missing bucket. high · anthropic:claude-opus-4-7
n
147,890
nulls
6,343 (4.3%)
unique
39
top_value
Light
top_rate
0.1942
cardinality
39
entropy
3.93
entropy_ratio
0.7436

Duration

text feature short_text duplicates
This is a free-text duration field, almost always a number-plus-unit phrase like '5 minutes' or '30 seconds' (mean 9.5 chars, ~2 words). Values are highly repetitive: only 15,527 distinct strings across 147,890 rows and an 89% duplicate rate, with a 4.8% null rate. The dominant units are 'minutes' and 'seconds', but the presence of an abbreviated 'min' token signals inconsistent formatting that will need normalisation. Treatment: Parse number and unit, convert to a single numeric scale (e.g., seconds), and treat 'min'/'minutes' variants as equivalent. high · anthropic:claude-opus-4-7
n
147,890
nulls
7,148 (4.8%)
unique
15,527
len_min
1
len_max
37
len_mean
9.469
len_median
9
len_p95
15
word_mean
2.066
word_median
2
n_empty
0
n_duplicates
125,215
duplicate_rate
0.8897
vocab_size
1,757
readability_flesch_mean
84.41
emoji_rate
7.105e-06
url_rate
0
one_word_rate
0.0895
allcaps_rate
0.03033
boilerplate_rate
0

No of observers

numeric feature high_skew outliers
Counts of observers per record, with a typical value of 1-2 (median 2, IQR 1) but a maximum of 20000 driving mean 4.6 and std 129.5. Skew of 109.3 and kurtosis of 14332 are extreme, and 13.0% of rows are flagged as outliers. A min of -10 is suspicious for a count, and 6.9% are zero with 4.5% null. Treatment: Clip negatives, investigate the 20000 tail, and log1p-transform before modelling. high · anthropic:claude-opus-4-7
n
147,890
nulls
6,661 (4.5%)
unique
137
min
-10
max
20,000
mean
4.603
median
2
std
129.5
q1
1
q3
2
iqr
1
skew
109.3
kurtosis
1.433e+04
n_outliers
18,390
outlier_rate
0.1302
zero_rate
0.06919

Reported

text timestamp multilingual
This is a 'Reported' timestamp stored as text in fixed 27-character format like '1999-11-16 00:00:00 Pacific'. Every value has identical length (min/max/mean = 27) and 3 words, with 'Pacific' appearing as a constant timezone suffix in ~20000 rows. The multilingual alert is a false positive from the language detector misreading dates; the duplicate rate of 7.7% (11418 rows) reflects multiple events sharing a report date. Treatment: Parse to datetime (strip 'Pacific' suffix, localize to US/Pacific) before any temporal analysis. high · anthropic:claude-opus-4-7
n
147,890
nulls
1 (0.0%)
unique
136,471
len_min
27
len_max
27
len_mean
27
len_median
27
len_p95
27
word_mean
3
word_median
3
n_empty
0
n_duplicates
11,418
duplicate_rate
0.07721
vocab_size
24,077
readability_flesch_mean
62.79
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

Posted

categorical timestamp
This column stores posting dates as datetime strings with zeroed time components, almost certainly a publication or upload timestamp. Across 147,890 rows there are 626 distinct dates with no nulls, and the distribution is remarkably flat — entropy ratio 0.93 and the most common date (2020-06-25) accounting for only 1.24% of rows. The top dates span 1999 to 2023, suggesting the dataset covers more than two decades of activity. Treatment: parse to datetime and derive year/month features for temporal analysis. high · anthropic:claude-opus-4-7
n
147,890
nulls
1 (0.0%)
unique
626
top_value
2020-06-25 00:00:00
top_rate
0.01239
cardinality
626
entropy
8.644
entropy_ratio
0.9304

Characteristics

text feature null_rate duplicates
This is a multi-label categorical feature describing observed object characteristics (e.g. "Lights on object", "Aura or haze around object", "Aircraft nearby"), stored as comma-joined tags rather than a structured list. Despite 147,890 rows, only 1,446 distinct strings exist and 98.6% are duplicates, with a tiny vocab of 43 tokens. Watch for the 28.15% null rate and a truncated tag ("Changed Colo", apparently "Changed Color" cut off) that recurs across thousands of rows. Treatment: Split on commas and one-hot encode the ~43 underlying tags; treat nulls as a separate indicator. high · anthropic:claude-opus-4-7
n
147,890
nulls
41,631 (28.1%)
unique
1,446
len_min
6
len_max
251
len_mean
33.27
len_median
26
len_p95
76
word_mean
5.613
word_median
5
n_empty
0
n_duplicates
104,813
duplicate_rate
0.9864
vocab_size
43
readability_flesch_mean
69.46
emoji_rate
0
url_rate
0
one_word_rate
0.00384
allcaps_rate
0
boilerplate_rate
0

Summary

text free_text near_unique
Free-text summary field with 144,208 unique values across 147,890 rows and a 0.6% null rate, so virtually every record carries its own short description. Lengths are highly skewed: median 76 characters / 13 words but a max of 10,624 characters and a p95 of 479, and mean Flesch readability of 67.3 suggests fairly plain English prose. Top tokens are stopwords plus 'light' (4,676 occurrences), hinting at a recurring topical theme worth investigating; duplicates (1.9%) and boilerplate (<0.1%) are negligible. Treatment: Tokenize and embed before modelling; do not treat as a categorical key. high · anthropic:claude-opus-4-7
n
147,890
nulls
891 (0.6%)
unique
144,208
len_min
1
len_max
10,624
len_mean
134.9
len_median
76
len_p95
479
word_mean
25.06
word_median
13
n_empty
0
n_duplicates
2,791
duplicate_rate
0.01899
vocab_size
28,632
readability_flesch_mean
67.28
emoji_rate
0.0001088
url_rate
0.0007755
one_word_rate
0.002327
allcaps_rate
0.01786
boilerplate_rate
0.0008708

Text

text free_text near_unique
Free-text field containing medium-length English prose, averaging 949.9 characters and 181.6 words with a Flesch readability of 69.6, suggesting reviews, comments, or short narratives. The column is near-unique (127,124 unique of 147,890) yet still carries 4,091 exact duplicates (3.1%) and an 11.3% null rate worth investigating. Top words are dominated by English stopwords plus a frequent first-person 'i' and 'my', hinting at personal/subjective writing rather than formal documents. Treatment: Clean nulls and exact duplicates, then tokenize and embed before modelling. high · anthropic:claude-opus-4-7
n
147,890
nulls
16,675 (11.3%)
unique
127,124
len_min
1
len_max
64,550
len_mean
949.9
len_median
682
len_p95
2,673
word_mean
181.6
word_median
131
n_empty
0
n_duplicates
4,091
duplicate_rate
0.03118
vocab_size
65,106
readability_flesch_mean
69.56
emoji_rate
0.0001981
url_rate
0.01805
one_word_rate
0.0006783
allcaps_rate
0.007217
boilerplate_rate
0.001875

Location details

text free_text near_unique null_rate
Free-text supplementary location notes, populated for only 6.9% of the 147,890 rows (null_rate 0.931). When present, entries are short prose averaging 38.7 characters / 6.9 words with readable Flesch 69.8, and the top tokens ('the', 'of', 'in', 'my', 'from') confirm natural-language descriptions rather than structured place codes. Cardinality is high (9,713 uniques) but 492 exact duplicates (4.8%) hint at recurring phrases worth normalising. Treatment: Treat as optional free-text annotation: tokenize/embed for any modelling and expect to ignore for the 93% of rows where it is null. high · anthropic:claude-opus-4-7
n
147,890
nulls
137,685 (93.1%)
unique
9,713
len_min
1
len_max
197
len_mean
38.75
len_median
33
len_p95
92
word_mean
6.93
word_median
6
n_empty
0
n_duplicates
492
duplicate_rate
0.04821
vocab_size
12,384
readability_flesch_mean
69.85
emoji_rate
0
url_rate
0.001078
one_word_rate
0.03822
allcaps_rate
0.0197
boilerplate_rate
0.000196

Explanation

categorical label null_rate
Free-form classification labels explaining UFO/sky-object sightings, with categories like 'Starlink - Probable', 'Rocket - Certain', and 'Balloon - Possible' combining an object type with a confidence qualifier. The column is 99.46% null — only ~800 of 147,890 rows carry a value — so it functions as a sparse annotation rather than a primary feature. Among populated rows, 58 distinct labels appear with relatively even spread (entropy ratio 0.82); the top label 'Starlink - Probable' covers just 9.71% of non-nulls. Treatment: Treat as a sparse secondary label; split into object/confidence and only model on the annotated subset. high · anthropic:claude-opus-4-7
n
147,890
nulls
147,087 (99.5%)
unique
58
top_value
Starlink - Probable
top_rate
0.09714
cardinality
58
entropy
4.832
entropy_ratio
0.8248