quirky-nuforc_sightings · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/quirky/nuforc_sightings.parquet

Saturn profiled 147,890 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/quirky/nuforc_sightings.parquet",
    "--findings", "quirky-nuforc_sightings.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 147,890 UFO sighting reports (likely from NUFORC) with 13 columns covering location, shape, duration, witness counts, and free-text descriptions. The Shape field is a clean categorical with 39 values dominated by 'Light' (27,494), 'Circle', and 'Triangle' — a natural starting point for understanding what people report. Duration is text-based but highly repetitive, with '5 minutes' and '2 minutes' as the most common values, suggesting witnesses anchor on round numbers. Watch out for 'No of observers': it is extremely skewed (max 20,000, min -10, skew 109) with ~13% outliers, so it needs cleaning before any quantitative use. Also note that 'Explanation' is 99.5% null — only a tiny fraction of sightings have an official label, with 'Starlink' explanations leading the small set that do.

citing: row_count · column_count · Shape.top_values · Shape.stats.cardinality · Duration.top_values · No of observers.stats · Explanation.null_rate · Explanation.top_values · Location.top_values · Summary.stats.len_mean

Out[4]:

saturn.schema() · 13 columns

column	kind	n	null%	unique	alerts
Sighting	numeric	147,890	0.0%	147,890
Occurred	text	147,890	0.0%	126,264
Location	text	147,890	0.0%	37,070	multilingual duplicates
Shape	categorical	147,890	4.3%	39
Duration	text	147,890	4.8%	15,527	short_text duplicates
No of observers	numeric	147,890	4.5%	137	high_skew outliers
Reported	text	147,890	0.0%	136,471	multilingual
Posted	categorical	147,890	0.0%	626
Characteristics	text	147,890	28.1%	1,446	null_rate duplicates
Summary	text	147,890	0.6%	144,208	near_unique
Text	text	147,890	11.3%	127,124	near_unique
Location details	text	147,890	93.1%	9,713	near_unique null_rate
Explanation	categorical	147,890	99.5%	58	null_rate

Fig 1.

Shape · See which UFO shapes dominate reports — 'Light' alone accounts for nearly 1 in 5 sightings.

Show data table

Top values for Shape (20 unique shown, of 39 total).
value	count	share
Light	27494	18.6%
Circle	14367	9.7%
Triangle	13086	8.8%
Other	10062	6.8%
Unknown	10021	6.8%
Fireball	9880	6.7%
Disk	8716	5.9%
Sphere	7652	5.2%
Oval	6369	4.3%
Orb	5924	4.0%
Formation	4864	3.3%
Changing	3987	2.7%
Cigar	3753	2.5%
Rectangle	2610	1.8%
Cylinder	2482	1.7%
Flash	2439	1.6%
Diamond	2116	1.4%
Chevron	1742	1.2%
Egg	1289	0.9%
Teardrop	1238	0.8%

Fig 2.

Duration · Look at how witnesses cluster around round-number durations like '5 minutes' and '2 minutes'.

Show data table

Character-length distribution for Duration (mean: 9.469163433800855).
chars	count
1 – 2	784
2 – 3	1006
3 – 4	586
4 – 5	2052
5 – 6	6297
6 – 6	9510
6 – 7	9325
7 – 8	9401
8 – 9	36028
9 – 10	0
10 – 11	37045
11 – 12	10292
12 – 13	3432
13 – 14	5013
14 – 14	1918
14 – 15	1786
15 – 16	1404
16 – 17	751
17 – 18	818
18 – 19	0
19 – 20	598
20 – 21	476
21 – 22	305
22 – 23	343
23 – 24	337
24 – 24	416
24 – 25	818
25 – 26	0
26 – 27	0
27 – 28	0
28 – 29	0
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	0
34 – 35	0
35 – 36	0
36 – 37	1

Fig 3.

Explanation · Among the rare cases with an official explanation, check the mix of Starlink, rocket, and balloon attributions.

Show data table

Top values for Explanation (20 unique shown, of 58 total).
value	count	share
Starlink - Probable	78	0.1%
Starlink - Certain	69	0.0%
Rocket - Certain	67	0.0%
Balloon - Possible	50	0.0%
Starlink - Possible	49	0.0%
Planet/Star - Possible	42	0.0%
Planet/Star - Probable	41	0.0%
Aircraft - Possible	35	0.0%
Aircraft - Probable	35	0.0%
Camera Anomaly - Probable	33	0.0%
Camera Anomaly - Certain	25	0.0%
Rocket - Probable	24	0.0%
Bird - Possible	21	0.0%
Searchlight - Certain	18	0.0%
Balloon - Probable	18	0.0%
Drone - Possible	17	0.0%
Camera Anomaly - Possible	15	0.0%
Bird - Probable	14	0.0%
Searchlight - Probable	12	0.0%
Rocket - Possible	11	0.0%

Fig 4.

No of observers · Inspect the extreme skew and outliers (negative values and a 20,000 maximum) before trusting any averages.

Show data table

Histogram bins for No of observers (median: 2.0).
bin	count
-10 – 490.2	141102
490.2 – 990.5	41
990.5 – 1491	52
1491 – 1991	1
1991 – 2491	8
2491 – 2992	3
2992 – 3492	3
3492 – 3992	0
3992 – 4492	1
4492 – 4992	0
4992 – 5493	7
5493 – 5993	0
5993 – 6493	0
6493 – 6994	0
6994 – 7494	0
7494 – 7994	0
7994 – 8494	0
8494 – 8994	0
8994 – 9495	0
9495 – 9995	0
9995 – 1.05e+04	7
1.05e+04 – 1.1e+04	0
1.1e+04 – 1.15e+04	1
1.15e+04 – 1.2e+04	0
1.2e+04 – 1.25e+04	0
1.25e+04 – 1.3e+04	0
1.3e+04 – 1.35e+04	0
1.35e+04 – 1.4e+04	0
1.4e+04 – 1.45e+04	0
1.45e+04 – 1.5e+04	0
1.5e+04 – 1.55e+04	0
1.55e+04 – 1.6e+04	0
1.6e+04 – 1.65e+04	0
1.65e+04 – 1.7e+04	0
1.7e+04 – 1.75e+04	0
1.75e+04 – 1.8e+04	0
1.8e+04 – 1.85e+04	0
1.85e+04 – 1.9e+04	0
1.9e+04 – 1.95e+04	0
1.95e+04 – 2e+04	3

Fig 5.

Summary · Review summary length distribution to gauge how much narrative detail accompanies each sighting.

Show data table

Character-length distribution for Summary (mean: 134.90139388703324).
chars	count
1 – 267	131782
267 – 532	8911
532 – 798	3233
798 – 1063	1484
1063 – 1329	669
1329 – 1594	333
1594 – 1860	204
1860 – 2126	123
2126 – 2391	74
2391 – 2657	60
2657 – 2922	33
2922 – 3188	16
3188 – 3453	14
3453 – 3719	9
3719 – 3985	13
3985 – 4250	19
4250 – 4516	5
4516 – 4781	5
4781 – 5047	2
5047 – 5312	1
5312 – 5578	4
5578 – 5844	2
5844 – 6109	0
6109 – 6375	0
6375 – 6640	0
6640 – 6906	1
6906 – 7172	0
7172 – 7437	0
7437 – 7703	0
7703 – 7968	0
7968 – 8234	0
8234 – 8499	0
8499 – 8765	0
8765 – 9031	1
9031 – 9296	0
9296 – 9562	0
9562 – 9827	0
9827 – 10093	0
10093 – 10358	0
10358 – 10624	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
Sighting	numeric	0.0%
Occurred	text	0.0%
Location	text	0.0%
Shape	categorical	4.3%
Duration	text	4.8%
No of observers	numeric	4.5%
Reported	text	0.0%
Posted	categorical	0.0%
Characteristics	text	28.1%
Summary	text	0.6%
Text	text	11.3%
Location details	text	93.1%
Explanation	categorical	99.5%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 13,527 detected strings).
lang	count	share
en	12657	93.6%
de	480	3.5%
fr	90	0.7%
es	86	0.6%
cs	50	0.4%
sv	37	0.3%
pl	24	0.2%
it	20	0.1%
da	13	0.1%
et	9	0.1%
hu	8	0.1%
pt	6	0.0%
sl	5	0.0%
fi	5	0.0%
tr	5	0.0%
ru	5	0.0%
nl	4	0.0%
ceb	4	0.0%
te	3	0.0%
ca	3	0.0%
id	2	0.0%
eo	2	0.0%
zh	2	0.0%
bn	1	0.0%
kn	1	0.0%
ms	1	0.0%
hr	1	0.0%
no	1	0.0%
sr	1	0.0%
ro	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	Sighting	No of observers
Sighting	+1.00	+0.01
No of observers	+0.01	+1.00

Sighting numeric identifier

Sighting is almost certainly a row identifier: every one of the 147890 values is unique, there are no nulls, and the distribution is essentially uniform (skew -0.013, kurtosis -1.13) spanning 111 to 179773. The values are not a dense 1..N sequence, suggesting an externally assigned record or sighting ID with gaps. No outliers and no zeros, consistent with an ID rather than a measurement.

Treatment: Drop from modelling features; retain as the join/lookup key per row.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["Sighting"].stats

stat	value
n	147,890
nulls	0 (0.0%)
unique	147,890
min	111
max	179,773
mean	9.198e+04
median	9.143e+04
std	5.044e+04
q1	5.015e+04
q3	1.345e+05
iqr	8.437e+04
skew	-0.0132
kurtosis	-1.134
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 9.

Distribution of Sighting. Vertical dash marks the median.

Show data table

Histogram bins for Sighting (median: 91434.5).
bin	count
111 – 4603	3091
4603 – 9094	2425
9094 – 1.359e+04	3218
1.359e+04 – 1.808e+04	3438
1.808e+04 – 2.257e+04	3053
2.257e+04 – 2.706e+04	3390
2.706e+04 – 3.155e+04	3183
3.155e+04 – 3.604e+04	3706
3.604e+04 – 4.053e+04	3583
4.053e+04 – 4.503e+04	3666
4.503e+04 – 4.952e+04	3651
4.952e+04 – 5.401e+04	3961
5.401e+04 – 5.85e+04	3961
5.85e+04 – 6.299e+04	3893
6.299e+04 – 6.748e+04	3933
6.748e+04 – 7.198e+04	4132
7.198e+04 – 7.647e+04	4084
7.647e+04 – 8.096e+04	4122
8.096e+04 – 8.545e+04	3949
8.545e+04 – 8.994e+04	4125
8.994e+04 – 9.443e+04	4176
9.443e+04 – 9.893e+04	3928
9.893e+04 – 1.034e+05	3455
1.034e+05 – 1.079e+05	3801
1.079e+05 – 1.124e+05	3934
1.124e+05 – 1.169e+05	3982
1.169e+05 – 1.214e+05	3802
1.214e+05 – 1.259e+05	3834
1.259e+05 – 1.304e+05	3837
1.304e+05 – 1.349e+05	3904
1.349e+05 – 1.393e+05	3795
1.393e+05 – 1.438e+05	2435
1.438e+05 – 1.483e+05	3931
1.483e+05 – 1.528e+05	3928
1.528e+05 – 1.573e+05	3932
1.573e+05 – 1.618e+05	3825
1.618e+05 – 1.663e+05	3654
1.663e+05 – 1.708e+05	2959
1.708e+05 – 1.753e+05	4194
1.753e+05 – 1.798e+05	4020

Occurred text timestamp

Timestamp strings of the form 'YYYY-MM-DD HH:MM:SS Local', with length tightly clustered at 25 characters (mean 24.96, p95 25). Stored as text rather than parsed datetimes, and 14.6% of values are duplicates (21,626 rows), with notable spikes on July 4th evenings and one outlier '2015-11-07 18:00:00' appearing 104 times. 299 rows contain just the bare token 'Local' with no date, which will break naive datetime parsing.

Treatment: Strip the ' Local' suffix and parse to datetime, coercing the 299 bare 'Local' entries to null.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["Occurred"].stats

stat	value
n	147,890
nulls	0 (0.0%)
unique	126,264
len_min	5
len_max	25
len_mean	24.96
len_median	25
len_p95	25
word_mean	2.996
word_median	3
n_empty	0
n_duplicates	21,626
duplicate_rate	0.1462
vocab_size	10,098
readability_flesch_mean	90.72
emoji_rate	0
url_rate	0
one_word_rate	0.002022
allcaps_rate	0
boilerplate_rate	0

Fig 10.

Character-length distribution for Occurred.

Show data table

Character-length distribution for Occurred (mean: 24.959564541213062).
chars	count
5 – 6	299
6 – 6	0
6 – 6	0
6 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 10	0
10 – 10	0
10 – 10	0
10 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 14	0
14 – 14	0
14 – 14	0
14 – 15	0
15 – 16	0
16 – 16	0
16 – 16	0
16 – 17	0
17 – 18	0
18 – 18	0
18 – 18	0
18 – 19	0
19 – 20	0
20 – 20	0
20 – 20	0
20 – 21	0
21 – 22	0
22 – 22	0
22 – 22	0
22 – 23	0
23 – 24	0
24 – 24	0
24 – 24	0
24 – 25	147591

Location text feature

Short 'City, State/Region, Country' location strings, averaging 20 characters and 3.6 words, dominated by US entries (Phoenix, Seattle, Las Vegas lead) with 'usa' appearing 17,880 times. The column is highly repetitive: 110,819 of 147,890 rows are duplicates (75%) across only 37,070 unique values, so it behaves like a categorical despite being free text. Language detection flags multilingual content but this mostly reflects short-string misclassification — 4,481 detected as English versus small counts in 27 other codes.

Treatment: Parse into city/state/country fields and treat as a high-cardinality categorical.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["Location"].stats

stat	value
n	147,890
nulls	1 (0.0%)
unique	37,070
len_min	7
len_max	92
len_mean	20.38
len_median	18
len_p95	37
word_mean	3.571
word_median	3
n_empty	0
n_duplicates	110,819
duplicate_rate	0.7493
vocab_size	8,750
readability_flesch_mean	26.72
emoji_rate	6.762e-06
url_rate	0
one_word_rate	6.762e-06
allcaps_rate	0.003983
boilerplate_rate	0
alert: multilingual	29 languages detected in sample
alert: duplicates	74.9% duplicate strings

Fig 11.

Character-length distribution for Location.

Show data table

Character-length distribution for Location (mean: 20.37837161655025).
chars	count
7 – 9	373
9 – 11	34
11 – 13	3169
13 – 16	22450
16 – 18	34587
18 – 20	31739
20 – 22	19449
22 – 24	9423
24 – 26	6446
26 – 28	3843
28 – 30	3241
30 – 32	2232
32 – 35	1936
35 – 37	1540
37 – 39	1773
39 – 41	1495
41 – 43	1469
43 – 45	605
45 – 47	459
47 – 50	353
50 – 52	292
52 – 54	230
54 – 56	170
56 – 58	147
58 – 60	144
60 – 62	75
62 – 64	54
64 – 66	54
66 – 69	37
69 – 71	38
71 – 73	7
73 – 75	4
75 – 77	7
77 – 79	6
79 – 81	3
81 – 84	1
84 – 86	2
86 – 88	1
88 – 90	0
90 – 92	1

Shape categorical feature

Categorical descriptor of UFO sighting shapes across 39 distinct values, with 'Light' leading at 19.4% of records (27,494). The distribution is moderately spread (entropy ratio 0.74), and notably 'Other' (10,062) and 'Unknown' (10,021) together rival the second-largest real category, suggesting substantial reporter ambiguity. Null rate is 4.29%, modest but non-trivial.

Treatment: One-hot or target-encode after collapsing 'Other'/'Unknown'/nulls into a single missing bucket.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["Shape"].stats

stat	value
n	147,890
nulls	6,343 (4.3%)
unique	39
top_value	Light
top_rate	0.1942
cardinality	39
entropy	3.93
entropy_ratio	0.7436

Fig 12.

Top values for Shape.

Show data table

Top values for Shape (20 unique shown, of 39 total).
value	count	share
Light	27494	18.6%
Circle	14367	9.7%
Triangle	13086	8.8%
Other	10062	6.8%
Unknown	10021	6.8%
Fireball	9880	6.7%
Disk	8716	5.9%
Sphere	7652	5.2%
Oval	6369	4.3%
Orb	5924	4.0%
Formation	4864	3.3%
Changing	3987	2.7%
Cigar	3753	2.5%
Rectangle	2610	1.8%
Cylinder	2482	1.7%
Flash	2439	1.6%
Diamond	2116	1.4%
Chevron	1742	1.2%
Egg	1289	0.9%
Teardrop	1238	0.8%

Duration text feature

This is a free-text duration field, almost always a number-plus-unit phrase like '5 minutes' or '30 seconds' (mean 9.5 chars, ~2 words). Values are highly repetitive: only 15,527 distinct strings across 147,890 rows and an 89% duplicate rate, with a 4.8% null rate. The dominant units are 'minutes' and 'seconds', but the presence of an abbreviated 'min' token signals inconsistent formatting that will need normalisation.

Treatment: Parse number and unit, convert to a single numeric scale (e.g., seconds), and treat 'min'/'minutes' variants as equivalent.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["Duration"].stats

stat	value
n	147,890
nulls	7,148 (4.8%)
unique	15,527
len_min	1
len_max	37
len_mean	9.469
len_median	9
len_p95	15
word_mean	2.066
word_median	2
n_empty	0
n_duplicates	125,215
duplicate_rate	0.8897
vocab_size	1,757
readability_flesch_mean	84.41
emoji_rate	7.105e-06
url_rate	0
one_word_rate	0.0895
allcaps_rate	0.03033
boilerplate_rate	0
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	89.0% duplicate strings

Fig 13.

Character-length distribution for Duration.

Show data table

Character-length distribution for Duration (mean: 9.469163433800855).
chars	count
1 – 2	784
2 – 3	1006
3 – 4	586
4 – 5	2052
5 – 6	6297
6 – 6	9510
6 – 7	9325
7 – 8	9401
8 – 9	36028
9 – 10	0
10 – 11	37045
11 – 12	10292
12 – 13	3432
13 – 14	5013
14 – 14	1918
14 – 15	1786
15 – 16	1404
16 – 17	751
17 – 18	818
18 – 19	0
19 – 20	598
20 – 21	476
21 – 22	305
22 – 23	343
23 – 24	337
24 – 24	416
24 – 25	818
25 – 26	0
26 – 27	0
27 – 28	0
28 – 29	0
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	0
34 – 35	0
35 – 36	0
36 – 37	1

No of observers numeric feature

Counts of observers per record, with a typical value of 1-2 (median 2, IQR 1) but a maximum of 20000 driving mean 4.6 and std 129.5. Skew of 109.3 and kurtosis of 14332 are extreme, and 13.0% of rows are flagged as outliers. A min of -10 is suspicious for a count, and 6.9% are zero with 4.5% null.

Treatment: Clip negatives, investigate the 20000 tail, and log1p-transform before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["No of observers"].stats

stat	value
n	147,890
nulls	6,661 (4.5%)
unique	137
min	-10
max	20,000
mean	4.603
median	2
std	129.5
q1	1
q3	2
iqr	1
skew	109.3
kurtosis	1.433e+04
n_outliers	18,390
outlier_rate	0.1302
zero_rate	0.06919
alert: high_skew	skew=+109.30
alert: outliers	13.0% rows beyond 1.5 IQR

Fig 14.

Distribution of No of observers. Vertical dash marks the median.

Show data table

Histogram bins for No of observers (median: 2.0).
bin	count
-10 – 490.2	141102
490.2 – 990.5	41
990.5 – 1491	52
1491 – 1991	1
1991 – 2491	8
2491 – 2992	3
2992 – 3492	3
3492 – 3992	0
3992 – 4492	1
4492 – 4992	0
4992 – 5493	7
5493 – 5993	0
5993 – 6493	0
6493 – 6994	0
6994 – 7494	0
7494 – 7994	0
7994 – 8494	0
8494 – 8994	0
8994 – 9495	0
9495 – 9995	0
9995 – 1.05e+04	7
1.05e+04 – 1.1e+04	0
1.1e+04 – 1.15e+04	1
1.15e+04 – 1.2e+04	0
1.2e+04 – 1.25e+04	0
1.25e+04 – 1.3e+04	0
1.3e+04 – 1.35e+04	0
1.35e+04 – 1.4e+04	0
1.4e+04 – 1.45e+04	0
1.45e+04 – 1.5e+04	0
1.5e+04 – 1.55e+04	0
1.55e+04 – 1.6e+04	0
1.6e+04 – 1.65e+04	0
1.65e+04 – 1.7e+04	0
1.7e+04 – 1.75e+04	0
1.75e+04 – 1.8e+04	0
1.8e+04 – 1.85e+04	0
1.85e+04 – 1.9e+04	0
1.9e+04 – 1.95e+04	0
1.95e+04 – 2e+04	3

Reported text timestamp

This is a 'Reported' timestamp stored as text in fixed 27-character format like '1999-11-16 00:00:00 Pacific'. Every value has identical length (min/max/mean = 27) and 3 words, with 'Pacific' appearing as a constant timezone suffix in ~20000 rows. The multilingual alert is a false positive from the language detector misreading dates; the duplicate rate of 7.7% (11418 rows) reflects multiple events sharing a report date.

Treatment: Parse to datetime (strip 'Pacific' suffix, localize to US/Pacific) before any temporal analysis.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["Reported"].stats

stat	value
n	147,890
nulls	1 (0.0%)
unique	136,471
len_min	27
len_max	27
len_mean	27
len_median	27
len_p95	27
word_mean	3
word_median	3
n_empty	0
n_duplicates	11,418
duplicate_rate	0.07721
vocab_size	24,077
readability_flesch_mean	62.79
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	8 languages detected in sample

Fig 15.

Character-length distribution for Reported.

Show data table

Character-length distribution for Reported (mean: 27.0).
chars	count
26 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	147889
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 27	0
27 – 28	0

Posted categorical timestamp

This column stores posting dates as datetime strings with zeroed time components, almost certainly a publication or upload timestamp. Across 147,890 rows there are 626 distinct dates with no nulls, and the distribution is remarkably flat — entropy ratio 0.93 and the most common date (2020-06-25) accounting for only 1.24% of rows. The top dates span 1999 to 2023, suggesting the dataset covers more than two decades of activity.

Treatment: parse to datetime and derive year/month features for temporal analysis.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["Posted"].stats

stat	value
n	147,890
nulls	1 (0.0%)
unique	626
top_value	2020-06-25 00:00:00
top_rate	0.01239
cardinality	626
entropy	8.644
entropy_ratio	0.9304

Fig 16.

Top values for Posted.

Show data table

Top values for Posted (20 unique shown, of 626 total).
value	count	share
2020-06-25 00:00:00	1833	1.2%
2009-12-12 00:00:00	1627	1.1%
2006-10-30 00:00:00	1573	1.1%
2019-12-01 00:00:00	1484	1.0%
2010-11-21 00:00:00	1365	0.9%
2022-09-09 00:00:00	1333	0.9%
1999-11-02 00:00:00	1314	0.9%
2020-12-23 00:00:00	1312	0.9%
2023-03-06 00:00:00	1274	0.9%
2008-10-31 00:00:00	1274	0.9%
2022-12-22 00:00:00	1252	0.8%
2001-08-05 00:00:00	1229	0.8%
2009-03-19 00:00:00	1201	0.8%
2009-01-10 00:00:00	1198	0.8%
2013-08-30 00:00:00	1142	0.8%
2022-03-04 00:00:00	1035	0.7%
2023-09-10 00:00:00	1028	0.7%
2008-06-12 00:00:00	1023	0.7%
2012-09-24 00:00:00	1019	0.7%
2011-10-10 00:00:00	1017	0.7%

Characteristics text feature

This is a multi-label categorical feature describing observed object characteristics (e.g. "Lights on object", "Aura or haze around object", "Aircraft nearby"), stored as comma-joined tags rather than a structured list. Despite 147,890 rows, only 1,446 distinct strings exist and 98.6% are duplicates, with a tiny vocab of 43 tokens. Watch for the 28.15% null rate and a truncated tag ("Changed Colo", apparently "Changed Color" cut off) that recurs across thousands of rows.

Treatment: Split on commas and one-hot encode the ~43 underlying tags; treat nulls as a separate indicator.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["Characteristics"].stats

stat	value
n	147,890
nulls	41,631 (28.1%)
unique	1,446
len_min	6
len_max	251
len_mean	33.27
len_median	26
len_p95	76
word_mean	5.613
word_median	5
n_empty	0
n_duplicates	104,813
duplicate_rate	0.9864
vocab_size	43
readability_flesch_mean	69.46
emoji_rate	0
url_rate	0
one_word_rate	0.00384
allcaps_rate	0
boilerplate_rate	0
alert: null_rate	28.1% null
alert: duplicates	98.6% duplicate strings

Fig 17.

Character-length distribution for Characteristics.

Show data table

Character-length distribution for Characteristics (mean: 33.2713746600288).
chars	count
6 – 12	5825
12 – 18	36312
18 – 24	1614
24 – 30	18280
30 – 37	7600
37 – 43	5135
43 – 49	12756
49 – 55	2126
55 – 61	6966
61 – 67	2046
67 – 73	1732
73 – 80	1712
80 – 86	1002
86 – 92	866
92 – 98	584
98 – 104	367
104 – 110	436
110 – 116	232
116 – 122	112
122 – 128	106
128 – 135	89
135 – 141	51
141 – 147	80
147 – 153	30
153 – 159	32
159 – 165	28
165 – 171	48
171 – 178	21
178 – 184	14
184 – 190	10
190 – 196	7
196 – 202	8
202 – 208	4
208 – 214	7
214 – 220	4
220 – 226	2
226 – 233	1
233 – 239	3
239 – 245	1
245 – 251	10

Summary text free_text

Free-text summary field with 144,208 unique values across 147,890 rows and a 0.6% null rate, so virtually every record carries its own short description. Lengths are highly skewed: median 76 characters / 13 words but a max of 10,624 characters and a p95 of 479, and mean Flesch readability of 67.3 suggests fairly plain English prose. Top tokens are stopwords plus 'light' (4,676 occurrences), hinting at a recurring topical theme worth investigating; duplicates (1.9%) and boilerplate (<0.1%) are negligible.

Treatment: Tokenize and embed before modelling; do not treat as a categorical key.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["Summary"].stats

stat	value
n	147,890
nulls	891 (0.6%)
unique	144,208
len_min	1
len_max	10,624
len_mean	134.9
len_median	76
len_p95	479
word_mean	25.06
word_median	13
n_empty	0
n_duplicates	2,791
duplicate_rate	0.01899
vocab_size	28,632
readability_flesch_mean	67.28
emoji_rate	0.0001088
url_rate	0.0007755
one_word_rate	0.002327
allcaps_rate	0.01786
boilerplate_rate	0.0008708
alert: near_unique	98.1% of rows are unique strings

Fig 18.

Character-length distribution for Summary.

Show data table

Character-length distribution for Summary (mean: 134.90139388703324).
chars	count
1 – 267	131782
267 – 532	8911
532 – 798	3233
798 – 1063	1484
1063 – 1329	669
1329 – 1594	333
1594 – 1860	204
1860 – 2126	123
2126 – 2391	74
2391 – 2657	60
2657 – 2922	33
2922 – 3188	16
3188 – 3453	14
3453 – 3719	9
3719 – 3985	13
3985 – 4250	19
4250 – 4516	5
4516 – 4781	5
4781 – 5047	2
5047 – 5312	1
5312 – 5578	4
5578 – 5844	2
5844 – 6109	0
6109 – 6375	0
6375 – 6640	0
6640 – 6906	1
6906 – 7172	0
7172 – 7437	0
7437 – 7703	0
7703 – 7968	0
7968 – 8234	0
8234 – 8499	0
8499 – 8765	0
8765 – 9031	1
9031 – 9296	0
9296 – 9562	0
9562 – 9827	0
9827 – 10093	0
10093 – 10358	0
10358 – 10624	1

Text text free_text

Free-text field containing medium-length English prose, averaging 949.9 characters and 181.6 words with a Flesch readability of 69.6, suggesting reviews, comments, or short narratives. The column is near-unique (127,124 unique of 147,890) yet still carries 4,091 exact duplicates (3.1%) and an 11.3% null rate worth investigating. Top words are dominated by English stopwords plus a frequent first-person 'i' and 'my', hinting at personal/subjective writing rather than formal documents.

Treatment: Clean nulls and exact duplicates, then tokenize and embed before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["Text"].stats

stat	value
n	147,890
nulls	16,675 (11.3%)
unique	127,124
len_min	1
len_max	64,550
len_mean	949.9
len_median	682
len_p95	2,673
word_mean	181.6
word_median	131
n_empty	0
n_duplicates	4,091
duplicate_rate	0.03118
vocab_size	65,106
readability_flesch_mean	69.56
emoji_rate	0.0001981
url_rate	0.01805
one_word_rate	0.0006783
allcaps_rate	0.007217
boilerplate_rate	0.001875
alert: near_unique	96.9% of rows are unique strings

Fig 19.

Character-length distribution for Text.

Show data table

Character-length distribution for Text (mean: 949.9216781617955).
chars	count
1 – 1615	111865
1615 – 3228	15329
3228 – 4842	2755
4842 – 6456	757
6456 – 8070	257
8070 – 9683	104
9683 – 11297	60
11297 – 12911	31
12911 – 14525	13
14525 – 16138	13
16138 – 17752	7
17752 – 19366	10
19366 – 20979	4
20979 – 22593	3
22593 – 24207	2
24207 – 25821	2
25821 – 27434	0
27434 – 29048	1
29048 – 30662	0
30662 – 32276	0
32276 – 33889	0
33889 – 35503	1
35503 – 37117	0
37117 – 38730	0
38730 – 40344	0
40344 – 41958	0
41958 – 43572	0
43572 – 45185	0
45185 – 46799	0
46799 – 48413	0
48413 – 50026	0
50026 – 51640	0
51640 – 53254	0
53254 – 54868	0
54868 – 56481	0
56481 – 58095	0
58095 – 59709	0
59709 – 61323	0
61323 – 62936	0
62936 – 64550	1

Location details text free_text

Free-text supplementary location notes, populated for only 6.9% of the 147,890 rows (null_rate 0.931). When present, entries are short prose averaging 38.7 characters / 6.9 words with readable Flesch 69.8, and the top tokens ('the', 'of', 'in', 'my', 'from') confirm natural-language descriptions rather than structured place codes. Cardinality is high (9,713 uniques) but 492 exact duplicates (4.8%) hint at recurring phrases worth normalising.

Treatment: Treat as optional free-text annotation: tokenize/embed for any modelling and expect to ignore for the 93% of rows where it is null.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["Location details"].stats

stat	value
n	147,890
nulls	137,685 (93.1%)
unique	9,713
len_min	1
len_max	197
len_mean	38.75
len_median	33
len_p95	92
word_mean	6.93
word_median	6
n_empty	0
n_duplicates	492
duplicate_rate	0.04821
vocab_size	12,384
readability_flesch_mean	69.85
emoji_rate	0
url_rate	0.001078
one_word_rate	0.03822
allcaps_rate	0.0197
boilerplate_rate	0.000196
alert: near_unique	95.2% of rows are unique strings
alert: null_rate	93.1% null

Fig 20.

Character-length distribution for Location details.

Show data table

Character-length distribution for Location details (mean: 38.74659480646742).
chars	count
1 – 6	130
6 – 11	582
11 – 16	983
16 – 21	1124
21 – 26	1030
26 – 30	845
30 – 35	837
35 – 40	741
40 – 45	653
45 – 50	435
50 – 55	472
55 – 60	378
60 – 65	321
65 – 70	269
70 – 74	238
74 – 79	217
79 – 84	205
84 – 89	166
89 – 94	206
94 – 99	188
99 – 104	176
104 – 109	0
109 – 114	1
114 – 119	0
119 – 124	1
124 – 128	1
128 – 133	1
133 – 138	0
138 – 143	0
143 – 148	0
148 – 153	0
153 – 158	0
158 – 163	0
163 – 168	0
168 – 172	1
172 – 177	1
177 – 182	0
182 – 187	2
187 – 192	0
192 – 197	1

Explanation categorical label

Free-form classification labels explaining UFO/sky-object sightings, with categories like 'Starlink - Probable', 'Rocket - Certain', and 'Balloon - Possible' combining an object type with a confidence qualifier. The column is 99.46% null — only ~800 of 147,890 rows carry a value — so it functions as a sparse annotation rather than a primary feature. Among populated rows, 58 distinct labels appear with relatively even spread (entropy ratio 0.82); the top label 'Starlink - Probable' covers just 9.71% of non-nulls.

Treatment: Treat as a sparse secondary label; split into object/confidence and only model on the annotated subset.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["Explanation"].stats

stat	value
n	147,890
nulls	147,087 (99.5%)
unique	58
top_value	Starlink - Probable
top_rate	0.09714
cardinality	58
entropy	4.832
entropy_ratio	0.8248
alert: null_rate	99.5% null

Fig 21.

Top values for Explanation.

Show data table

Top values for Explanation (20 unique shown, of 58 total).
value	count	share
Starlink - Probable	78	0.1%
Starlink - Certain	69	0.0%
Rocket - Certain	67	0.0%
Balloon - Possible	50	0.0%
Starlink - Possible	49	0.0%
Planet/Star - Possible	42	0.0%
Planet/Star - Probable	41	0.0%
Aircraft - Possible	35	0.0%
Aircraft - Probable	35	0.0%
Camera Anomaly - Probable	33	0.0%
Camera Anomaly - Certain	25	0.0%
Rocket - Probable	24	0.0%
Bird - Possible	21	0.0%
Searchlight - Certain	18	0.0%
Balloon - Probable	18	0.0%
Drone - Possible	17	0.0%
Camera Anomaly - Possible	15	0.0%
Bird - Probable	14	0.0%
Searchlight - Probable	12	0.0%
Rocket - Possible	11	0.0%

quirky nuforc sightings

Overview

Summary confidence: high

Sighting numeric identifier

Occurred text timestamp

Location text feature

Shape categorical feature

Duration text feature

No of observers numeric feature

Reported text timestamp

Posted categorical timestamp

Characteristics text feature

Summary text free_text

Text text free_text

Location details text free_text

Explanation categorical label

How to cite