saturn·

quirky asteroids

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/asteroids.json

Saturn profiled 40,827 rows across 11 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/asteroids.json",
    "--findings", "quirky-asteroids.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 40,827 Near-Earth Objects (asteroids) across 11 columns mixing orbital parameters (H, a, e, i, per), physical properties (diameter, albedo), and classification flags (neo, pha, class). Every record has neo='Y', so that column carries no information and can be ignored. The most analytically interesting fields are 'class', where APO dominates at 56.8% followed by AMO at 35.1%, and 'pha' (potentially hazardous), which flags 2,534 objects (about 6.2%) as 'Y'. Note that 'diameter' and 'albedo' are ~97% null, so any size/reflectivity analysis will be limited to roughly 1,200 rows. The orbital-parameter columns are stored as short text rather than numbers — they will need to be cast to floats before any quantitative work.

citing: row_count · column_count · columns.class.top_values · columns.pha.top_values · columns.neo.top_values · columns.diameter.null_rate · columns.albedo.null_rate · columns.full_name.top_words · columns.H.stats · columns.a.stats

Out[4]:

saturn.schema() · 11 columns

column kind n null% unique alerts
full_name text 40,827 0.0% 40,827 near_unique allcaps short_text
neo categorical 40,827 0.0% 1 imbalance
pha categorical 40,827 0.3% 2
e text 40,827 0.0% 7,849 one_word allcaps short_text duplicates
a text 40,827 0.0% 4,170 one_word allcaps short_text duplicates
i text 40,827 0.0% 4,489 one_word allcaps short_text duplicates
per text 40,827 0.0% 1,025 one_word allcaps short_text duplicates
H text 40,827 0.0% 1,656 one_word allcaps short_text duplicates
diameter categorical 40,827 96.9% 924 long_tail null_rate
albedo categorical 40,827 97.1% 437 null_rate
class categorical 40,827 0.0% 4
Fig 1.
class · Orbit class distribution: APO and AMO together account for over 90% of the catalog.
Show data table
Top values for class (4 unique shown, of 4 total).
valuecountshare
APO2317556.8%
AMO1432135.1%
ATE32938.1%
IEO380.1%
Fig 2.
pha · Potentially hazardous flag — about 6% of asteroids are marked 'Y'.
Show data table
Top values for pha (2 unique shown, of 2 total).
valuecountshare
N3816293.5%
Y25346.2%
Fig 3.
H · Absolute magnitude (H) values cluster tightly around 24–25.5; cast to numeric to see the full shape.
Show data table
Character-length distribution for H (mean: 4.999975504605135).
charscount
4 – 41
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 540823
Fig 4.
full_name · Name length is fairly uniform (16–19 chars), reflecting the standard '(YYYY XX)' designation format.
Show data table
Character-length distribution for full_name (mean: 17.27251573713474).
charscount
16 – 166070
16 – 170
17 – 1721295
17 – 180
18 – 1810737
18 – 190
19 – 192544
19 – 200
20 – 204
20 – 200
20 – 210
21 – 2120
21 – 220
22 – 2217
22 – 230
23 – 2328
23 – 240
24 – 2429
24 – 250
25 – 250
25 – 2529
25 – 260
26 – 2617
26 – 270
27 – 2710
27 – 280
28 – 288
28 – 290
29 – 299
29 – 300
30 – 300
30 – 304
30 – 310
31 – 313
31 – 320
32 – 321
32 – 330
33 – 331
33 – 340
34 – 341
Fig 5.
albedo · Top albedo values among the ~3% of rows that have one — useful for spotting dark vs reflective bodies.
Show data table
Top values for albedo (20 unique shown, of 437 total).
valuecountshare
0.037150.0%
0.020150.0%
0.031140.0%
0.019120.0%
0.023120.0%
0.018110.0%
0.022100.0%
0.030100.0%
0.025100.0%
0.034100.0%
0.02890.0%
0.04290.0%
0.04890.0%
0.02690.0%
0.13790.0%
0.04090.0%
0.02490.0%
0.04680.0%
0.03380.0%
0.03980.0%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
full_nametext0.0%
neocategorical0.0%
phacategorical0.3%
etext0.0%
atext0.0%
itext0.0%
pertext0.0%
Htext0.0%
diametercategorical96.9%
albedocategorical97.1%
classcategorical0.0%

full_name text identifier

Despite the name 'full_name', this column appears to be title-with-year strings (e.g. ending in '(2024'), not personal names — top tokens are all parenthesised years from 2016-2025. Every one of the 40,827 rows is unique with zero nulls, and 99.56% are all-caps, with lengths tightly bounded between 16 and 34 characters. The year distribution skews recent, with 2024 (1607) and 2025 (1594) leading.

Treatment: Treat as a unique title key; parse the trailing year into a separate numeric feature rather than embedding the raw string.

anthropic:claude-opus-4-7 · confidence high
Out[12]:

saturn.columns["full_name"].stats

statvalue
n40,827
nulls0 (0.0%)
unique40,827
len_min 16
len_max 34
len_mean 17.27
len_median 17
len_p95 19
word_mean 8.479
word_median 9
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 12,613
readability_flesch_mean 119.5
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 0.9956
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: allcaps99.6% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 7.
Character-length distribution for full_name.
Show data table
Character-length distribution for full_name (mean: 17.27251573713474).
charscount
16 – 166070
16 – 170
17 – 1721295
17 – 180
18 – 1810737
18 – 190
19 – 192544
19 – 200
20 – 204
20 – 200
20 – 210
21 – 2120
21 – 220
22 – 2217
22 – 230
23 – 2328
23 – 240
24 – 2429
24 – 250
25 – 250
25 – 2529
25 – 260
26 – 2617
26 – 270
27 – 2710
27 – 280
28 – 288
28 – 290
29 – 299
29 – 300
30 – 300
30 – 304
30 – 310
31 – 313
31 – 320
32 – 321
32 – 330
33 – 331
33 – 340
34 – 341

neo categorical metadata

The `neo` column is a categorical flag that takes a single value 'Y' across all 40,827 rows, with zero nulls and entropy of 0.0. Because cardinality is 1 and top_rate is 1.0, this column carries no information and cannot discriminate between records.

Treatment: Drop, constant column.

anthropic:claude-opus-4-7 · confidence high
Out[15]:

saturn.columns["neo"].stats

statvalue
n40,827
nulls0 (0.0%)
unique1
top_value Y
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 8.
Top values for neo.
Show data table
Top values for neo (1 unique shown, of 1 total).
valuecountshare
Y40827100.0%

pha categorical label

Binary Y/N flag, almost certainly a 'potentially hazardous asteroid' indicator given the column name 'pha'. The class is heavily imbalanced: 'N' covers 93.77% of rows versus 2,534 'Y' values, with a 0.32% null rate. Entropy ratio of 0.34 confirms the skew.

Treatment: Encode as binary target and use class-imbalance handling (stratified splits, class weights, or resampling).

anthropic:claude-opus-4-7 · confidence high
Out[18]:

saturn.columns["pha"].stats

statvalue
n40,827
nulls131 (0.3%)
unique2
top_value N
top_rate 0.9377
cardinality 2
entropy 0.3364
entropy_ratio 0.3364
Fig 9.
Top values for pha.
Show data table
Top values for pha (2 unique shown, of 2 total).
valuecountshare
N3816293.5%
Y25346.2%

e text feature

Column 'e' is stored as text but every value is a fixed 6-character single token, and the top values ('0.5298', '0.5964', '0.4826', ...) are all numeric strings between 0 and 1. This is almost certainly a numeric feature (likely a probability, ratio, or normalized score) that has been serialized as text. With 7849 unique values across 40827 rows and a duplicate_rate of 0.808, repetition is heavy but not pathological for a discretized score.

Treatment: Cast to float and use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[21]:

saturn.columns["e"].stats

statvalue
n40,827
nulls0 (0.0%)
unique7,849
len_min 6
len_max 6
len_mean 6
len_median 6
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 32,978
duplicate_rate 0.8077
vocab_size 6,736
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates80.8% duplicate strings
Fig 10.
Character-length distribution for e.
Show data table
Character-length distribution for e (mean: 6.0).
charscount
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 640827
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60

a text feature

Column 'a' is stored as text but the values are short numeric strings (e.g., '1.299', '1.424'), all single tokens with length 1-6. With 4170 unique values across 40827 rows and an 89.8% duplicate rate, it behaves like a low-precision numeric feature mistakenly typed as string. The 99.9% allcaps flag is a quirk of digit-only strings tripping the case detector and can be ignored.

Treatment: Cast to float and treat as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[24]:

saturn.columns["a"].stats

statvalue
n40,827
nulls0 (0.0%)
unique4,170
len_min 1
len_max 6
len_mean 4.969
len_median 5
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 36,657
duplicate_rate 0.8979
vocab_size 3,344
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.9995
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps99.9% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates89.8% duplicate strings
Fig 11.
Character-length distribution for a.
Show data table
Character-length distribution for a (mean: 4.969113576799667).
charscount
1 – 122
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 3378
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 43428
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 533988
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 63011

i text feature

Despite being typed as text, column 'i' holds short numeric tokens (length 4-6, all single-word) like '6.07', '2.12', '2.26' — almost certainly a decimal numeric feature stored as strings. With 40,827 rows but only 4,489 unique values and an 89% duplicate rate, the value space is heavily concentrated. The 'allcaps' flag and Flesch score of 121.22 are artefacts of treating numeric strings as prose and can be ignored.

Treatment: Cast to float and treat as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[27]:

saturn.columns["i"].stats

statvalue
n40,827
nulls0 (0.0%)
unique4,489
len_min 4
len_max 6
len_mean 4.428
len_median 4
len_p95 5
word_mean 1
word_median 1
n_empty 0
n_duplicates 36,338
duplicate_rate 0.89
vocab_size 3,827
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates89.0% duplicate strings
Fig 12.
Character-length distribution for i.
Show data table
Character-length distribution for i (mean: 4.427707154579077).
charscount
4 – 423372
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 517448
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 67

per text feature

Despite being typed as text, every value in `per` is a single short token (word_mean 1.0, len_max 8) and the top values are all numeric strings in scientific notation like '1.13e+03', suggesting this is a numeric measurement that was stringified during export. With 40,827 rows but only 1,025 unique values and a 97.5% duplicate rate, the field takes on a small set of repeating numeric codes. The 64% allcaps rate is an artefact of the 'e+03' exponent characters rather than genuine casing.

Treatment: Cast back to numeric (parse the scientific-notation strings to float) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[30]:

saturn.columns["per"].stats

statvalue
n40,827
nulls0 (0.0%)
unique1,025
len_min 3
len_max 8
len_mean 4.747
len_median 3
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 39,802
duplicate_rate 0.9749
vocab_size 985
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.6425
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps64.2% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.5% duplicate strings
Fig 13.
Character-length distribution for per.
Show data table
Character-length distribution for per (mean: 4.746687241286404).
charscount
3 – 326231
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 5146
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 71230
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 813220

H text feature

Column H is stored as text but the values are uniformly short numeric tokens (len_mean 4.999, one_word_rate 1.0) clustered tightly around 24-25 (top values 24.20-25.50). With 1656 uniques across 40827 rows and a 95.9% duplicate_rate, this looks like a quantised numeric measurement (price, weight, or similar) miscast as a string. The allcaps flag is a false positive driven by digits.

Treatment: Cast to float and treat as a continuous numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[33]:

saturn.columns["H"].stats

statvalue
n40,827
nulls3 (0.0%)
unique1,656
len_min 4
len_max 5
len_mean 5
len_median 5
len_p95 5
word_mean 1
word_median 1
n_empty 0
n_duplicates 39,168
duplicate_rate 0.9594
vocab_size 1,522
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates95.9% duplicate strings
Fig 14.
Character-length distribution for H.
Show data table
Character-length distribution for H (mean: 4.999975504605135).
charscount
4 – 41
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 540823

diameter categorical feature

This is almost certainly an asteroid/object diameter measurement stored as strings (e.g. '0.4', '2.3', '0.451'), miscoded as categorical. It is overwhelmingly missing — 96.94% null — and among the 40,827 rows only 924 distinct values appear, with the most common ('0.4') occurring just 7 times (top_rate 0.0056) and entropy_ratio 0.985 indicating a near-uniform long tail. The mix of one-decimal and three-decimal strings hints at heterogeneous measurement precision across sources.

Treatment: Cast to numeric and either drop given 96.94% nulls or impute with a missingness indicator before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[36]:

saturn.columns["diameter"].stats

statvalue
n40,827
nulls39,579 (96.9%)
unique924
top_value 0.4
top_rate 0.005609
cardinality 924
entropy 9.703
entropy_ratio 0.9849
alert: long_tail684 singleton categories
alert: null_rate96.9% null
Fig 15.
Top values for diameter.
Show data table
Top values for diameter (20 unique shown, of 924 total).
valuecountshare
0.470.0%
2.360.0%
1.050.0%
1.450.0%
0.45150.0%
0.340.0%
0.940.0%
0.43240.0%
0.41340.0%
0.34440.0%
0.65440.0%
0.37440.0%
0.22840.0%
0.14340.0%
0.06640.0%
1.530.0%
4.330.0%
0.630.0%
1.830.0%
0.730.0%

albedo categorical feature

Likely a geometric/Bond albedo measurement (reflectivity, 0-1 range) stored as a string rather than parsed numeric, given top values like '0.037', '0.020', '0.031'. Coverage is extremely sparse: 97.05% null with only 437 distinct values across 40,827 rows, and the modal value appears just 15 times (1.25%). Entropy ratio of 0.954 shows the few populated values are spread almost uniformly across the 437 levels.

Treatment: Cast to float and treat as numeric; given 97% nulls, use only as a sparse feature with missingness indicator or drop.

anthropic:claude-opus-4-7 · confidence high
Out[39]:

saturn.columns["albedo"].stats

statvalue
n40,827
nulls39,623 (97.1%)
unique437
top_value 0.037
top_rate 0.01246
cardinality 437
entropy 8.366
entropy_ratio 0.9538
alert: null_rate97.1% null
Fig 16.
Top values for albedo.
Show data table
Top values for albedo (20 unique shown, of 437 total).
valuecountshare
0.037150.0%
0.020150.0%
0.031140.0%
0.019120.0%
0.023120.0%
0.018110.0%
0.022100.0%
0.030100.0%
0.025100.0%
0.034100.0%
0.02890.0%
0.04290.0%
0.04890.0%
0.02690.0%
0.13790.0%
0.04090.0%
0.02490.0%
0.04680.0%
0.03380.0%
0.03980.0%

class categorical label

Categorical label with 4 classes across 40,827 rows and no nulls. Distribution is heavily imbalanced: APO accounts for 56.8% and AMO for most of the remainder, while IEO appears only 38 times — a near-absent class that will be hard to learn or evaluate.

Treatment: Use as classification target with class-weighting or resampling to handle the IEO minority class.

anthropic:claude-opus-4-7 · confidence high
Out[42]:

saturn.columns["class"].stats

statvalue
n40,827
nulls0 (0.0%)
unique4
top_value APO
top_rate 0.5676
cardinality 4
entropy 1.296
entropy_ratio 0.6481
Fig 17.
Top values for class.
Show data table
Top values for class (4 unique shown, of 4 total).
valuecountshare
APO2317556.8%
AMO1432135.1%
ATE32938.1%
IEO380.1%

How to cite

click to copy

BibTeX
@misc{saturn-quirky-asteroids-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky asteroids},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-asteroids}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky asteroids. Source: /home/coolhand/html/datavis/data_trove/data/quirky/asteroids.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-asteroids