saturn·

quirky nuforc sightings

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/quirky/nuforc_sightings.parquet

Saturn profiled 147,890 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/quirky/nuforc_sightings.parquet",
    "--findings", "quirky-nuforc_sightings.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 147,890 UFO sighting reports (likely from NUFORC) with 13 columns covering location, shape, duration, witness counts, and free-text descriptions. The Shape field is a clean categorical with 39 values dominated by 'Light' (27,494), 'Circle', and 'Triangle' — a natural starting point for understanding what people report. Duration is text-based but highly repetitive, with '5 minutes' and '2 minutes' as the most common values, suggesting witnesses anchor on round numbers. Watch out for 'No of observers': it is extremely skewed (max 20,000, min -10, skew 109) with ~13% outliers, so it needs cleaning before any quantitative use. Also note that 'Explanation' is 99.5% null — only a tiny fraction of sightings have an official label, with 'Starlink' explanations leading the small set that do.

citing: row_count · column_count · Shape.top_values · Shape.stats.cardinality · Duration.top_values · No of observers.stats · Explanation.null_rate · Explanation.top_values · Location.top_values · Summary.stats.len_mean

Out[4]:

saturn.schema() · 13 columns

column kind n null% unique alerts
Sighting numeric 147,890 0.0% 147,890
Occurred text 147,890 0.0% 126,264
Location text 147,890 0.0% 37,070 multilingual duplicates
Shape categorical 147,890 4.3% 39
Duration text 147,890 4.8% 15,527 short_text duplicates
No of observers numeric 147,890 4.5% 137 high_skew outliers
Reported text 147,890 0.0% 136,471 multilingual
Posted categorical 147,890 0.0% 626
Characteristics text 147,890 28.1% 1,446 null_rate duplicates
Summary text 147,890 0.6% 144,208 near_unique
Text text 147,890 11.3% 127,124 near_unique
Location details text 147,890 93.1% 9,713 near_unique null_rate
Explanation categorical 147,890 99.5% 58 null_rate
Fig 1.
Shape · See which UFO shapes dominate reports — 'Light' alone accounts for nearly 1 in 5 sightings.
Show data table
Top values for Shape (20 unique shown, of 39 total).
valuecountshare
Light2749418.6%
Circle143679.7%
Triangle130868.8%
Other100626.8%
Unknown100216.8%
Fireball98806.7%
Disk87165.9%
Sphere76525.2%
Oval63694.3%
Orb59244.0%
Formation48643.3%
Changing39872.7%
Cigar37532.5%
Rectangle26101.8%
Cylinder24821.7%
Flash24391.6%
Diamond21161.4%
Chevron17421.2%
Egg12890.9%
Teardrop12380.8%
Fig 2.
Duration · Look at how witnesses cluster around round-number durations like '5 minutes' and '2 minutes'.
Show data table
Character-length distribution for Duration (mean: 9.469163433800855).
charscount
1 – 2784
2 – 31006
3 – 4586
4 – 52052
5 – 66297
6 – 69510
6 – 79325
7 – 89401
8 – 936028
9 – 100
10 – 1137045
11 – 1210292
12 – 133432
13 – 145013
14 – 141918
14 – 151786
15 – 161404
16 – 17751
17 – 18818
18 – 190
19 – 20598
20 – 21476
21 – 22305
22 – 23343
23 – 24337
24 – 24416
24 – 25818
25 – 260
26 – 270
27 – 280
28 – 290
29 – 300
30 – 310
31 – 320
32 – 320
32 – 330
33 – 340
34 – 350
35 – 360
36 – 371
Fig 3.
Explanation · Among the rare cases with an official explanation, check the mix of Starlink, rocket, and balloon attributions.
Show data table
Top values for Explanation (20 unique shown, of 58 total).
valuecountshare
Starlink - Probable780.1%
Starlink - Certain690.0%
Rocket - Certain670.0%
Balloon - Possible500.0%
Starlink - Possible490.0%
Planet/Star - Possible420.0%
Planet/Star - Probable410.0%
Aircraft - Possible350.0%
Aircraft - Probable350.0%
Camera Anomaly - Probable330.0%
Camera Anomaly - Certain250.0%
Rocket - Probable240.0%
Bird - Possible210.0%
Searchlight - Certain180.0%
Balloon - Probable180.0%
Drone - Possible170.0%
Camera Anomaly - Possible150.0%
Bird - Probable140.0%
Searchlight - Probable120.0%
Rocket - Possible110.0%
Fig 4.
No of observers · Inspect the extreme skew and outliers (negative values and a 20,000 maximum) before trusting any averages.
Show data table
Histogram bins for No of observers (median: 2.0).
bincount
-10 – 490.2141102
490.2 – 990.541
990.5 – 149152
1491 – 19911
1991 – 24918
2491 – 29923
2992 – 34923
3492 – 39920
3992 – 44921
4492 – 49920
4992 – 54937
5493 – 59930
5993 – 64930
6493 – 69940
6994 – 74940
7494 – 79940
7994 – 84940
8494 – 89940
8994 – 94950
9495 – 99950
9995 – 1.05e+047
1.05e+04 – 1.1e+040
1.1e+04 – 1.15e+041
1.15e+04 – 1.2e+040
1.2e+04 – 1.25e+040
1.25e+04 – 1.3e+040
1.3e+04 – 1.35e+040
1.35e+04 – 1.4e+040
1.4e+04 – 1.45e+040
1.45e+04 – 1.5e+040
1.5e+04 – 1.55e+040
1.55e+04 – 1.6e+040
1.6e+04 – 1.65e+040
1.65e+04 – 1.7e+040
1.7e+04 – 1.75e+040
1.75e+04 – 1.8e+040
1.8e+04 – 1.85e+040
1.85e+04 – 1.9e+040
1.9e+04 – 1.95e+040
1.95e+04 – 2e+043
Fig 5.
Summary · Review summary length distribution to gauge how much narrative detail accompanies each sighting.
Show data table
Character-length distribution for Summary (mean: 134.90139388703324).
charscount
1 – 267131782
267 – 5328911
532 – 7983233
798 – 10631484
1063 – 1329669
1329 – 1594333
1594 – 1860204
1860 – 2126123
2126 – 239174
2391 – 265760
2657 – 292233
2922 – 318816
3188 – 345314
3453 – 37199
3719 – 398513
3985 – 425019
4250 – 45165
4516 – 47815
4781 – 50472
5047 – 53121
5312 – 55784
5578 – 58442
5844 – 61090
6109 – 63750
6375 – 66400
6640 – 69061
6906 – 71720
7172 – 74370
7437 – 77030
7703 – 79680
7968 – 82340
8234 – 84990
8499 – 87650
8765 – 90311
9031 – 92960
9296 – 95620
9562 – 98270
9827 – 100930
10093 – 103580
10358 – 106241
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
Sightingnumeric0.0%
Occurredtext0.0%
Locationtext0.0%
Shapecategorical4.3%
Durationtext4.8%
No of observersnumeric4.5%
Reportedtext0.0%
Postedcategorical0.0%
Characteristicstext28.1%
Summarytext0.6%
Texttext11.3%
Location detailstext93.1%
Explanationcategorical99.5%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 13,527 detected strings).
langcountshare
en1265793.6%
de4803.5%
fr900.7%
es860.6%
cs500.4%
sv370.3%
pl240.2%
it200.1%
da130.1%
et90.1%
hu80.1%
pt60.0%
sl50.0%
fi50.0%
tr50.0%
ru50.0%
nl40.0%
ceb40.0%
te30.0%
ca30.0%
id20.0%
eo20.0%
zh20.0%
bn10.0%
kn10.0%
ms10.0%
hr10.0%
no10.0%
sr10.0%
ro10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
SightingNo of observers
Sighting+1.00+0.01
No of observers+0.01+1.00

Sighting numeric identifier

Sighting is almost certainly a row identifier: every one of the 147890 values is unique, there are no nulls, and the distribution is essentially uniform (skew -0.013, kurtosis -1.13) spanning 111 to 179773. The values are not a dense 1..N sequence, suggesting an externally assigned record or sighting ID with gaps. No outliers and no zeros, consistent with an ID rather than a measurement.

Treatment: Drop from modelling features; retain as the join/lookup key per row.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["Sighting"].stats

statvalue
n147,890
nulls0 (0.0%)
unique147,890
min 111
max 179,773
mean 9.198e+04
median 9.143e+04
std 5.044e+04
q1 5.015e+04
q3 1.345e+05
iqr 8.437e+04
skew -0.0132
kurtosis -1.134
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 9.
Distribution of Sighting. Vertical dash marks the median.
Show data table
Histogram bins for Sighting (median: 91434.5).
bincount
111 – 46033091
4603 – 90942425
9094 – 1.359e+043218
1.359e+04 – 1.808e+043438
1.808e+04 – 2.257e+043053
2.257e+04 – 2.706e+043390
2.706e+04 – 3.155e+043183
3.155e+04 – 3.604e+043706
3.604e+04 – 4.053e+043583
4.053e+04 – 4.503e+043666
4.503e+04 – 4.952e+043651
4.952e+04 – 5.401e+043961
5.401e+04 – 5.85e+043961
5.85e+04 – 6.299e+043893
6.299e+04 – 6.748e+043933
6.748e+04 – 7.198e+044132
7.198e+04 – 7.647e+044084
7.647e+04 – 8.096e+044122
8.096e+04 – 8.545e+043949
8.545e+04 – 8.994e+044125
8.994e+04 – 9.443e+044176
9.443e+04 – 9.893e+043928
9.893e+04 – 1.034e+053455
1.034e+05 – 1.079e+053801
1.079e+05 – 1.124e+053934
1.124e+05 – 1.169e+053982
1.169e+05 – 1.214e+053802
1.214e+05 – 1.259e+053834
1.259e+05 – 1.304e+053837
1.304e+05 – 1.349e+053904
1.349e+05 – 1.393e+053795
1.393e+05 – 1.438e+052435
1.438e+05 – 1.483e+053931
1.483e+05 – 1.528e+053928
1.528e+05 – 1.573e+053932
1.573e+05 – 1.618e+053825
1.618e+05 – 1.663e+053654
1.663e+05 – 1.708e+052959
1.708e+05 – 1.753e+054194
1.753e+05 – 1.798e+054020

Occurred text timestamp

Timestamp strings of the form 'YYYY-MM-DD HH:MM:SS Local', with length tightly clustered at 25 characters (mean 24.96, p95 25). Stored as text rather than parsed datetimes, and 14.6% of values are duplicates (21,626 rows), with notable spikes on July 4th evenings and one outlier '2015-11-07 18:00:00' appearing 104 times. 299 rows contain just the bare token 'Local' with no date, which will break naive datetime parsing.

Treatment: Strip the ' Local' suffix and parse to datetime, coercing the 299 bare 'Local' entries to null.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["Occurred"].stats

statvalue
n147,890
nulls0 (0.0%)
unique126,264
len_min 5
len_max 25
len_mean 24.96
len_median 25
len_p95 25
word_mean 2.996
word_median 3
n_empty 0
n_duplicates 21,626
duplicate_rate 0.1462
vocab_size 10,098
readability_flesch_mean 90.72
emoji_rate 0
url_rate 0
one_word_rate 0.002022
allcaps_rate 0
boilerplate_rate 0
Fig 10.
Character-length distribution for Occurred.
Show data table
Character-length distribution for Occurred (mean: 24.959564541213062).
charscount
5 – 6299
6 – 60
6 – 60
6 – 70
7 – 80
8 – 80
8 – 80
8 – 90
9 – 100
10 – 100
10 – 100
10 – 110
11 – 120
12 – 120
12 – 120
12 – 130
13 – 140
14 – 140
14 – 140
14 – 150
15 – 160
16 – 160
16 – 160
16 – 170
17 – 180
18 – 180
18 – 180
18 – 190
19 – 200
20 – 200
20 – 200
20 – 210
21 – 220
22 – 220
22 – 220
22 – 230
23 – 240
24 – 240
24 – 240
24 – 25147591

Location text feature

Short 'City, State/Region, Country' location strings, averaging 20 characters and 3.6 words, dominated by US entries (Phoenix, Seattle, Las Vegas lead) with 'usa' appearing 17,880 times. The column is highly repetitive: 110,819 of 147,890 rows are duplicates (75%) across only 37,070 unique values, so it behaves like a categorical despite being free text. Language detection flags multilingual content but this mostly reflects short-string misclassification — 4,481 detected as English versus small counts in 27 other codes.

Treatment: Parse into city/state/country fields and treat as a high-cardinality categorical.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["Location"].stats

statvalue
n147,890
nulls1 (0.0%)
unique37,070
len_min 7
len_max 92
len_mean 20.38
len_median 18
len_p95 37
word_mean 3.571
word_median 3
n_empty 0
n_duplicates 110,819
duplicate_rate 0.7493
vocab_size 8,750
readability_flesch_mean 26.72
emoji_rate 6.762e-06
url_rate 0
one_word_rate 6.762e-06
allcaps_rate 0.003983
boilerplate_rate 0
alert: multilingual29 languages detected in sample
alert: duplicates74.9% duplicate strings
Fig 11.
Character-length distribution for Location.
Show data table
Character-length distribution for Location (mean: 20.37837161655025).
charscount
7 – 9373
9 – 1134
11 – 133169
13 – 1622450
16 – 1834587
18 – 2031739
20 – 2219449
22 – 249423
24 – 266446
26 – 283843
28 – 303241
30 – 322232
32 – 351936
35 – 371540
37 – 391773
39 – 411495
41 – 431469
43 – 45605
45 – 47459
47 – 50353
50 – 52292
52 – 54230
54 – 56170
56 – 58147
58 – 60144
60 – 6275
62 – 6454
64 – 6654
66 – 6937
69 – 7138
71 – 737
73 – 754
75 – 777
77 – 796
79 – 813
81 – 841
84 – 862
86 – 881
88 – 900
90 – 921

Shape categorical feature

Categorical descriptor of UFO sighting shapes across 39 distinct values, with 'Light' leading at 19.4% of records (27,494). The distribution is moderately spread (entropy ratio 0.74), and notably 'Other' (10,062) and 'Unknown' (10,021) together rival the second-largest real category, suggesting substantial reporter ambiguity. Null rate is 4.29%, modest but non-trivial.

Treatment: One-hot or target-encode after collapsing 'Other'/'Unknown'/nulls into a single missing bucket.

anthropic:claude-opus-4-7 · confidence high
Out[23]:

saturn.columns["Shape"].stats

statvalue
n147,890
nulls6,343 (4.3%)
unique39
top_value Light
top_rate 0.1942
cardinality 39
entropy 3.93
entropy_ratio 0.7436
Fig 12.
Top values for Shape.
Show data table
Top values for Shape (20 unique shown, of 39 total).
valuecountshare
Light2749418.6%
Circle143679.7%
Triangle130868.8%
Other100626.8%
Unknown100216.8%
Fireball98806.7%
Disk87165.9%
Sphere76525.2%
Oval63694.3%
Orb59244.0%
Formation48643.3%
Changing39872.7%
Cigar37532.5%
Rectangle26101.8%
Cylinder24821.7%
Flash24391.6%
Diamond21161.4%
Chevron17421.2%
Egg12890.9%
Teardrop12380.8%

Duration text feature

This is a free-text duration field, almost always a number-plus-unit phrase like '5 minutes' or '30 seconds' (mean 9.5 chars, ~2 words). Values are highly repetitive: only 15,527 distinct strings across 147,890 rows and an 89% duplicate rate, with a 4.8% null rate. The dominant units are 'minutes' and 'seconds', but the presence of an abbreviated 'min' token signals inconsistent formatting that will need normalisation.

Treatment: Parse number and unit, convert to a single numeric scale (e.g., seconds), and treat 'min'/'minutes' variants as equivalent.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["Duration"].stats

statvalue
n147,890
nulls7,148 (4.8%)
unique15,527
len_min 1
len_max 37
len_mean 9.469
len_median 9
len_p95 15
word_mean 2.066
word_median 2
n_empty 0
n_duplicates 125,215
duplicate_rate 0.8897
vocab_size 1,757
readability_flesch_mean 84.41
emoji_rate 7.105e-06
url_rate 0
one_word_rate 0.0895
allcaps_rate 0.03033
boilerplate_rate 0
alert: short_text95th-percentile length under 20 chars
alert: duplicates89.0% duplicate strings
Fig 13.
Character-length distribution for Duration.
Show data table
Character-length distribution for Duration (mean: 9.469163433800855).
charscount
1 – 2784
2 – 31006
3 – 4586
4 – 52052
5 – 66297
6 – 69510
6 – 79325
7 – 89401
8 – 936028
9 – 100
10 – 1137045
11 – 1210292
12 – 133432
13 – 145013
14 – 141918
14 – 151786
15 – 161404
16 – 17751
17 – 18818
18 – 190
19 – 20598
20 – 21476
21 – 22305
22 – 23343
23 – 24337
24 – 24416
24 – 25818
25 – 260
26 – 270
27 – 280
28 – 290
29 – 300
30 – 310
31 – 320
32 – 320
32 – 330
33 – 340
34 – 350
35 – 360
36 – 371

No of observers numeric feature

Counts of observers per record, with a typical value of 1-2 (median 2, IQR 1) but a maximum of 20000 driving mean 4.6 and std 129.5. Skew of 109.3 and kurtosis of 14332 are extreme, and 13.0% of rows are flagged as outliers. A min of -10 is suspicious for a count, and 6.9% are zero with 4.5% null.

Treatment: Clip negatives, investigate the 20000 tail, and log1p-transform before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["No of observers"].stats

statvalue
n147,890
nulls6,661 (4.5%)
unique137
min -10
max 20,000
mean 4.603
median 2
std 129.5
q1 1
q3 2
iqr 1
skew 109.3
kurtosis 1.433e+04
n_outliers 18,390
outlier_rate 0.1302
zero_rate 0.06919
alert: high_skewskew=+109.30
alert: outliers13.0% rows beyond 1.5 IQR
Fig 14.
Distribution of No of observers. Vertical dash marks the median.
Show data table
Histogram bins for No of observers (median: 2.0).
bincount
-10 – 490.2141102
490.2 – 990.541
990.5 – 149152
1491 – 19911
1991 – 24918
2491 – 29923
2992 – 34923
3492 – 39920
3992 – 44921
4492 – 49920
4992 – 54937
5493 – 59930
5993 – 64930
6493 – 69940
6994 – 74940
7494 – 79940
7994 – 84940
8494 – 89940
8994 – 94950
9495 – 99950
9995 – 1.05e+047
1.05e+04 – 1.1e+040
1.1e+04 – 1.15e+041
1.15e+04 – 1.2e+040
1.2e+04 – 1.25e+040
1.25e+04 – 1.3e+040
1.3e+04 – 1.35e+040
1.35e+04 – 1.4e+040
1.4e+04 – 1.45e+040
1.45e+04 – 1.5e+040
1.5e+04 – 1.55e+040
1.55e+04 – 1.6e+040
1.6e+04 – 1.65e+040
1.65e+04 – 1.7e+040
1.7e+04 – 1.75e+040
1.75e+04 – 1.8e+040
1.8e+04 – 1.85e+040
1.85e+04 – 1.9e+040
1.9e+04 – 1.95e+040
1.95e+04 – 2e+043

Reported text timestamp

This is a 'Reported' timestamp stored as text in fixed 27-character format like '1999-11-16 00:00:00 Pacific'. Every value has identical length (min/max/mean = 27) and 3 words, with 'Pacific' appearing as a constant timezone suffix in ~20000 rows. The multilingual alert is a false positive from the language detector misreading dates; the duplicate rate of 7.7% (11418 rows) reflects multiple events sharing a report date.

Treatment: Parse to datetime (strip 'Pacific' suffix, localize to US/Pacific) before any temporal analysis.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["Reported"].stats

statvalue
n147,890
nulls1 (0.0%)
unique136,471
len_min 27
len_max 27
len_mean 27
len_median 27
len_p95 27
word_mean 3
word_median 3
n_empty 0
n_duplicates 11,418
duplicate_rate 0.07721
vocab_size 24,077
readability_flesch_mean 62.79
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 0
boilerplate_rate 0
alert: multilingual8 languages detected in sample
Fig 15.
Character-length distribution for Reported.
Show data table
Character-length distribution for Reported (mean: 27.0).
charscount
26 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 27147889
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 270
27 – 280

Posted categorical timestamp

This column stores posting dates as datetime strings with zeroed time components, almost certainly a publication or upload timestamp. Across 147,890 rows there are 626 distinct dates with no nulls, and the distribution is remarkably flat — entropy ratio 0.93 and the most common date (2020-06-25) accounting for only 1.24% of rows. The top dates span 1999 to 2023, suggesting the dataset covers more than two decades of activity.

Treatment: parse to datetime and derive year/month features for temporal analysis.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["Posted"].stats

statvalue
n147,890
nulls1 (0.0%)
unique626
top_value 2020-06-25 00:00:00
top_rate 0.01239
cardinality 626
entropy 8.644
entropy_ratio 0.9304
Fig 16.
Top values for Posted.
Show data table
Top values for Posted (20 unique shown, of 626 total).
valuecountshare
2020-06-25 00:00:0018331.2%
2009-12-12 00:00:0016271.1%
2006-10-30 00:00:0015731.1%
2019-12-01 00:00:0014841.0%
2010-11-21 00:00:0013650.9%
2022-09-09 00:00:0013330.9%
1999-11-02 00:00:0013140.9%
2020-12-23 00:00:0013120.9%
2023-03-06 00:00:0012740.9%
2008-10-31 00:00:0012740.9%
2022-12-22 00:00:0012520.8%
2001-08-05 00:00:0012290.8%
2009-03-19 00:00:0012010.8%
2009-01-10 00:00:0011980.8%
2013-08-30 00:00:0011420.8%
2022-03-04 00:00:0010350.7%
2023-09-10 00:00:0010280.7%
2008-06-12 00:00:0010230.7%
2012-09-24 00:00:0010190.7%
2011-10-10 00:00:0010170.7%

Characteristics text feature

This is a multi-label categorical feature describing observed object characteristics (e.g. "Lights on object", "Aura or haze around object", "Aircraft nearby"), stored as comma-joined tags rather than a structured list. Despite 147,890 rows, only 1,446 distinct strings exist and 98.6% are duplicates, with a tiny vocab of 43 tokens. Watch for the 28.15% null rate and a truncated tag ("Changed Colo", apparently "Changed Color" cut off) that recurs across thousands of rows.

Treatment: Split on commas and one-hot encode the ~43 underlying tags; treat nulls as a separate indicator.

anthropic:claude-opus-4-7 · confidence high
Out[38]:

saturn.columns["Characteristics"].stats

statvalue
n147,890
nulls41,631 (28.1%)
unique1,446
len_min 6
len_max 251
len_mean 33.27
len_median 26
len_p95 76
word_mean 5.613
word_median 5
n_empty 0
n_duplicates 104,813
duplicate_rate 0.9864
vocab_size 43
readability_flesch_mean 69.46
emoji_rate 0
url_rate 0
one_word_rate 0.00384
allcaps_rate 0
boilerplate_rate 0
alert: null_rate28.1% null
alert: duplicates98.6% duplicate strings
Fig 17.
Character-length distribution for Characteristics.
Show data table
Character-length distribution for Characteristics (mean: 33.2713746600288).
charscount
6 – 125825
12 – 1836312
18 – 241614
24 – 3018280
30 – 377600
37 – 435135
43 – 4912756
49 – 552126
55 – 616966
61 – 672046
67 – 731732
73 – 801712
80 – 861002
86 – 92866
92 – 98584
98 – 104367
104 – 110436
110 – 116232
116 – 122112
122 – 128106
128 – 13589
135 – 14151
141 – 14780
147 – 15330
153 – 15932
159 – 16528
165 – 17148
171 – 17821
178 – 18414
184 – 19010
190 – 1967
196 – 2028
202 – 2084
208 – 2147
214 – 2204
220 – 2262
226 – 2331
233 – 2393
239 – 2451
245 – 25110

Summary text free_text

Free-text summary field with 144,208 unique values across 147,890 rows and a 0.6% null rate, so virtually every record carries its own short description. Lengths are highly skewed: median 76 characters / 13 words but a max of 10,624 characters and a p95 of 479, and mean Flesch readability of 67.3 suggests fairly plain English prose. Top tokens are stopwords plus 'light' (4,676 occurrences), hinting at a recurring topical theme worth investigating; duplicates (1.9%) and boilerplate (<0.1%) are negligible.

Treatment: Tokenize and embed before modelling; do not treat as a categorical key.

anthropic:claude-opus-4-7 · confidence high
Out[41]:

saturn.columns["Summary"].stats

statvalue
n147,890
nulls891 (0.6%)
unique144,208
len_min 1
len_max 10,624
len_mean 134.9
len_median 76
len_p95 479
word_mean 25.06
word_median 13
n_empty 0
n_duplicates 2,791
duplicate_rate 0.01899
vocab_size 28,632
readability_flesch_mean 67.28
emoji_rate 0.0001088
url_rate 0.0007755
one_word_rate 0.002327
allcaps_rate 0.01786
boilerplate_rate 0.0008708
alert: near_unique98.1% of rows are unique strings
Fig 18.
Character-length distribution for Summary.
Show data table
Character-length distribution for Summary (mean: 134.90139388703324).
charscount
1 – 267131782
267 – 5328911
532 – 7983233
798 – 10631484
1063 – 1329669
1329 – 1594333
1594 – 1860204
1860 – 2126123
2126 – 239174
2391 – 265760
2657 – 292233
2922 – 318816
3188 – 345314
3453 – 37199
3719 – 398513
3985 – 425019
4250 – 45165
4516 – 47815
4781 – 50472
5047 – 53121
5312 – 55784
5578 – 58442
5844 – 61090
6109 – 63750
6375 – 66400
6640 – 69061
6906 – 71720
7172 – 74370
7437 – 77030
7703 – 79680
7968 – 82340
8234 – 84990
8499 – 87650
8765 – 90311
9031 – 92960
9296 – 95620
9562 – 98270
9827 – 100930
10093 – 103580
10358 – 106241

Text text free_text

Free-text field containing medium-length English prose, averaging 949.9 characters and 181.6 words with a Flesch readability of 69.6, suggesting reviews, comments, or short narratives. The column is near-unique (127,124 unique of 147,890) yet still carries 4,091 exact duplicates (3.1%) and an 11.3% null rate worth investigating. Top words are dominated by English stopwords plus a frequent first-person 'i' and 'my', hinting at personal/subjective writing rather than formal documents.

Treatment: Clean nulls and exact duplicates, then tokenize and embed before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

saturn.columns["Text"].stats

statvalue
n147,890
nulls16,675 (11.3%)
unique127,124
len_min 1
len_max 64,550
len_mean 949.9
len_median 682
len_p95 2,673
word_mean 181.6
word_median 131
n_empty 0
n_duplicates 4,091
duplicate_rate 0.03118
vocab_size 65,106
readability_flesch_mean 69.56
emoji_rate 0.0001981
url_rate 0.01805
one_word_rate 0.0006783
allcaps_rate 0.007217
boilerplate_rate 0.001875
alert: near_unique96.9% of rows are unique strings
Fig 19.
Character-length distribution for Text.
Show data table
Character-length distribution for Text (mean: 949.9216781617955).
charscount
1 – 1615111865
1615 – 322815329
3228 – 48422755
4842 – 6456757
6456 – 8070257
8070 – 9683104
9683 – 1129760
11297 – 1291131
12911 – 1452513
14525 – 1613813
16138 – 177527
17752 – 1936610
19366 – 209794
20979 – 225933
22593 – 242072
24207 – 258212
25821 – 274340
27434 – 290481
29048 – 306620
30662 – 322760
32276 – 338890
33889 – 355031
35503 – 371170
37117 – 387300
38730 – 403440
40344 – 419580
41958 – 435720
43572 – 451850
45185 – 467990
46799 – 484130
48413 – 500260
50026 – 516400
51640 – 532540
53254 – 548680
54868 – 564810
56481 – 580950
58095 – 597090
59709 – 613230
61323 – 629360
62936 – 645501

Location details text free_text

Free-text supplementary location notes, populated for only 6.9% of the 147,890 rows (null_rate 0.931). When present, entries are short prose averaging 38.7 characters / 6.9 words with readable Flesch 69.8, and the top tokens ('the', 'of', 'in', 'my', 'from') confirm natural-language descriptions rather than structured place codes. Cardinality is high (9,713 uniques) but 492 exact duplicates (4.8%) hint at recurring phrases worth normalising.

Treatment: Treat as optional free-text annotation: tokenize/embed for any modelling and expect to ignore for the 93% of rows where it is null.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["Location details"].stats

statvalue
n147,890
nulls137,685 (93.1%)
unique9,713
len_min 1
len_max 197
len_mean 38.75
len_median 33
len_p95 92
word_mean 6.93
word_median 6
n_empty 0
n_duplicates 492
duplicate_rate 0.04821
vocab_size 12,384
readability_flesch_mean 69.85
emoji_rate 0
url_rate 0.001078
one_word_rate 0.03822
allcaps_rate 0.0197
boilerplate_rate 0.000196
alert: near_unique95.2% of rows are unique strings
alert: null_rate93.1% null
Fig 20.
Character-length distribution for Location details.
Show data table
Character-length distribution for Location details (mean: 38.74659480646742).
charscount
1 – 6130
6 – 11582
11 – 16983
16 – 211124
21 – 261030
26 – 30845
30 – 35837
35 – 40741
40 – 45653
45 – 50435
50 – 55472
55 – 60378
60 – 65321
65 – 70269
70 – 74238
74 – 79217
79 – 84205
84 – 89166
89 – 94206
94 – 99188
99 – 104176
104 – 1090
109 – 1141
114 – 1190
119 – 1241
124 – 1281
128 – 1331
133 – 1380
138 – 1430
143 – 1480
148 – 1530
153 – 1580
158 – 1630
163 – 1680
168 – 1721
172 – 1771
177 – 1820
182 – 1872
187 – 1920
192 – 1971

Explanation categorical label

Free-form classification labels explaining UFO/sky-object sightings, with categories like 'Starlink - Probable', 'Rocket - Certain', and 'Balloon - Possible' combining an object type with a confidence qualifier. The column is 99.46% null — only ~800 of 147,890 rows carry a value — so it functions as a sparse annotation rather than a primary feature. Among populated rows, 58 distinct labels appear with relatively even spread (entropy ratio 0.82); the top label 'Starlink - Probable' covers just 9.71% of non-nulls.

Treatment: Treat as a sparse secondary label; split into object/confidence and only model on the annotated subset.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["Explanation"].stats

statvalue
n147,890
nulls147,087 (99.5%)
unique58
top_value Starlink - Probable
top_rate 0.09714
cardinality 58
entropy 4.832
entropy_ratio 0.8248
alert: null_rate99.5% null
Fig 21.
Top values for Explanation.
Show data table
Top values for Explanation (20 unique shown, of 58 total).
valuecountshare
Starlink - Probable780.1%
Starlink - Certain690.0%
Rocket - Certain670.0%
Balloon - Possible500.0%
Starlink - Possible490.0%
Planet/Star - Possible420.0%
Planet/Star - Probable410.0%
Aircraft - Possible350.0%
Aircraft - Probable350.0%
Camera Anomaly - Probable330.0%
Camera Anomaly - Certain250.0%
Rocket - Probable240.0%
Bird - Possible210.0%
Searchlight - Certain180.0%
Balloon - Probable180.0%
Drone - Possible170.0%
Camera Anomaly - Possible150.0%
Bird - Probable140.0%
Searchlight - Probable120.0%
Rocket - Possible110.0%

How to cite

click to copy

BibTeX
@misc{saturn-quirky-nuforc-sightings-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky nuforc sightings},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-nuforc_sightings}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky nuforc sightings. Source: /home/coolhand/html/datavis/data_trove/cache/quirky/nuforc_sightings.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-nuforc_sightings