quirky-asteroids · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/asteroids.json

Saturn profiled 40,827 rows across 11 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/asteroids.json",
    "--findings", "quirky-asteroids.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 40,827 Near-Earth Objects (asteroids) across 11 columns mixing orbital parameters (H, a, e, i, per), physical properties (diameter, albedo), and classification flags (neo, pha, class). Every record has neo='Y', so that column carries no information and can be ignored. The most analytically interesting fields are 'class', where APO dominates at 56.8% followed by AMO at 35.1%, and 'pha' (potentially hazardous), which flags 2,534 objects (about 6.2%) as 'Y'. Note that 'diameter' and 'albedo' are ~97% null, so any size/reflectivity analysis will be limited to roughly 1,200 rows. The orbital-parameter columns are stored as short text rather than numbers — they will need to be cast to floats before any quantitative work.

citing: row_count · column_count · columns.class.top_values · columns.pha.top_values · columns.neo.top_values · columns.diameter.null_rate · columns.albedo.null_rate · columns.full_name.top_words · columns.H.stats · columns.a.stats

Out[4]:

saturn.schema() · 11 columns

column	kind	n	null%	unique	alerts
full_name	text	40,827	0.0%	40,827	near_unique allcaps short_text
neo	categorical	40,827	0.0%	1	imbalance
pha	categorical	40,827	0.3%	2
e	text	40,827	0.0%	7,849	one_word allcaps short_text duplicates
a	text	40,827	0.0%	4,170	one_word allcaps short_text duplicates
i	text	40,827	0.0%	4,489	one_word allcaps short_text duplicates
per	text	40,827	0.0%	1,025	one_word allcaps short_text duplicates
H	text	40,827	0.0%	1,656	one_word allcaps short_text duplicates
diameter	categorical	40,827	96.9%	924	long_tail null_rate
albedo	categorical	40,827	97.1%	437	null_rate
class	categorical	40,827	0.0%	4

Fig 1.

class · Orbit class distribution: APO and AMO together account for over 90% of the catalog.

Show data table

Top values for class (4 unique shown, of 4 total).
value	count	share
APO	23175	56.8%
AMO	14321	35.1%
ATE	3293	8.1%
IEO	38	0.1%

Fig 2.

pha · Potentially hazardous flag — about 6% of asteroids are marked 'Y'.

Show data table

Top values for pha (2 unique shown, of 2 total).
value	count	share
N	38162	93.5%
Y	2534	6.2%

Fig 3.

H · Absolute magnitude (H) values cluster tightly around 24–25.5; cast to numeric to see the full shape.

Show data table

Character-length distribution for H (mean: 4.999975504605135).
chars	count
4 – 4	1
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	40823

Fig 4.

full_name · Name length is fairly uniform (16–19 chars), reflecting the standard '(YYYY XX)' designation format.

Show data table

Character-length distribution for full_name (mean: 17.27251573713474).
chars	count
16 – 16	6070
16 – 17	0
17 – 17	21295
17 – 18	0
18 – 18	10737
18 – 19	0
19 – 19	2544
19 – 20	0
20 – 20	4
20 – 20	0
20 – 21	0
21 – 21	20
21 – 22	0
22 – 22	17
22 – 23	0
23 – 23	28
23 – 24	0
24 – 24	29
24 – 25	0
25 – 25	0
25 – 25	29
25 – 26	0
26 – 26	17
26 – 27	0
27 – 27	10
27 – 28	0
28 – 28	8
28 – 29	0
29 – 29	9
29 – 30	0
30 – 30	0
30 – 30	4
30 – 31	0
31 – 31	3
31 – 32	0
32 – 32	1
32 – 33	0
33 – 33	1
33 – 34	0
34 – 34	1

Fig 5.

albedo · Top albedo values among the ~3% of rows that have one — useful for spotting dark vs reflective bodies.

Show data table

Top values for albedo (20 unique shown, of 437 total).
value	count	share
0.037	15	0.0%
0.020	15	0.0%
0.031	14	0.0%
0.019	12	0.0%
0.023	12	0.0%
0.018	11	0.0%
0.022	10	0.0%
0.030	10	0.0%
0.025	10	0.0%
0.034	10	0.0%
0.028	9	0.0%
0.042	9	0.0%
0.048	9	0.0%
0.026	9	0.0%
0.137	9	0.0%
0.040	9	0.0%
0.024	9	0.0%
0.046	8	0.0%
0.033	8	0.0%
0.039	8	0.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
full_name	text	0.0%
neo	categorical	0.0%
pha	categorical	0.3%
e	text	0.0%
a	text	0.0%
i	text	0.0%
per	text	0.0%
H	text	0.0%
diameter	categorical	96.9%
albedo	categorical	97.1%
class	categorical	0.0%

full_name text identifier

Despite the name 'full_name', this column appears to be title-with-year strings (e.g. ending in '(2024'), not personal names — top tokens are all parenthesised years from 2016-2025. Every one of the 40,827 rows is unique with zero nulls, and 99.56% are all-caps, with lengths tightly bounded between 16 and 34 characters. The year distribution skews recent, with 2024 (1607) and 2025 (1594) leading.

Treatment: Treat as a unique title key; parse the trailing year into a separate numeric feature rather than embedding the raw string.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["full_name"].stats

stat	value
n	40,827
nulls	0 (0.0%)
unique	40,827
len_min	16
len_max	34
len_mean	17.27
len_median	17
len_p95	19
word_mean	8.479
word_median	9
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	12,613
readability_flesch_mean	119.5
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0.9956
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: allcaps	99.6% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 7.

Character-length distribution for full_name.

Show data table

Character-length distribution for full_name (mean: 17.27251573713474).
chars	count
16 – 16	6070
16 – 17	0
17 – 17	21295
17 – 18	0
18 – 18	10737
18 – 19	0
19 – 19	2544
19 – 20	0
20 – 20	4
20 – 20	0
20 – 21	0
21 – 21	20
21 – 22	0
22 – 22	17
22 – 23	0
23 – 23	28
23 – 24	0
24 – 24	29
24 – 25	0
25 – 25	0
25 – 25	29
25 – 26	0
26 – 26	17
26 – 27	0
27 – 27	10
27 – 28	0
28 – 28	8
28 – 29	0
29 – 29	9
29 – 30	0
30 – 30	0
30 – 30	4
30 – 31	0
31 – 31	3
31 – 32	0
32 – 32	1
32 – 33	0
33 – 33	1
33 – 34	0
34 – 34	1

neo categorical metadata

The `neo` column is a categorical flag that takes a single value 'Y' across all 40,827 rows, with zero nulls and entropy of 0.0. Because cardinality is 1 and top_rate is 1.0, this column carries no information and cannot discriminate between records.

Treatment: Drop, constant column.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns["neo"].stats

stat	value
n	40,827
nulls	0 (0.0%)
unique	1
top_value	Y
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 8.

Top values for neo.

Show data table

Top values for neo (1 unique shown, of 1 total).
value	count	share
Y	40827	100.0%

pha categorical label

Binary Y/N flag, almost certainly a 'potentially hazardous asteroid' indicator given the column name 'pha'. The class is heavily imbalanced: 'N' covers 93.77% of rows versus 2,534 'Y' values, with a 0.32% null rate. Entropy ratio of 0.34 confirms the skew.

Treatment: Encode as binary target and use class-imbalance handling (stratified splits, class weights, or resampling).

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["pha"].stats

stat	value
n	40,827
nulls	131 (0.3%)
unique	2
top_value	N
top_rate	0.9377
cardinality	2
entropy	0.3364
entropy_ratio	0.3364

Fig 9.

Top values for pha.

Show data table

Top values for pha (2 unique shown, of 2 total).
value	count	share
N	38162	93.5%
Y	2534	6.2%

e text feature

Column 'e' is stored as text but every value is a fixed 6-character single token, and the top values ('0.5298', '0.5964', '0.4826', ...) are all numeric strings between 0 and 1. This is almost certainly a numeric feature (likely a probability, ratio, or normalized score) that has been serialized as text. With 7849 unique values across 40827 rows and a duplicate_rate of 0.808, repetition is heavy but not pathological for a discretized score.

Treatment: Cast to float and use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["e"].stats

stat	value
n	40,827
nulls	0 (0.0%)
unique	7,849
len_min	6
len_max	6
len_mean	6
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	32,978
duplicate_rate	0.8077
vocab_size	6,736
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	80.8% duplicate strings

Fig 10.

Character-length distribution for e.

Show data table

Character-length distribution for e (mean: 6.0).
chars	count
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	40827
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0

a text feature

Column 'a' is stored as text but the values are short numeric strings (e.g., '1.299', '1.424'), all single tokens with length 1-6. With 4170 unique values across 40827 rows and an 89.8% duplicate rate, it behaves like a low-precision numeric feature mistakenly typed as string. The 99.9% allcaps flag is a quirk of digit-only strings tripping the case detector and can be ignored.

Treatment: Cast to float and treat as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["a"].stats

stat	value
n	40,827
nulls	0 (0.0%)
unique	4,170
len_min	1
len_max	6
len_mean	4.969
len_median	5
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	36,657
duplicate_rate	0.8979
vocab_size	3,344
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.9995
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	99.9% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	89.8% duplicate strings

Fig 11.

Character-length distribution for a.

Show data table

Character-length distribution for a (mean: 4.969113576799667).
chars	count
1 – 1	22
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	378
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	3428
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	33988
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	3011

i text feature

Despite being typed as text, column 'i' holds short numeric tokens (length 4-6, all single-word) like '6.07', '2.12', '2.26' — almost certainly a decimal numeric feature stored as strings. With 40,827 rows but only 4,489 unique values and an 89% duplicate rate, the value space is heavily concentrated. The 'allcaps' flag and Flesch score of 121.22 are artefacts of treating numeric strings as prose and can be ignored.

Treatment: Cast to float and treat as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["i"].stats

stat	value
n	40,827
nulls	0 (0.0%)
unique	4,489
len_min	4
len_max	6
len_mean	4.428
len_median	4
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	36,338
duplicate_rate	0.89
vocab_size	3,827
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	89.0% duplicate strings

Fig 12.

Character-length distribution for i.

Show data table

Character-length distribution for i (mean: 4.427707154579077).
chars	count
4 – 4	23372
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	17448
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	7

per text feature

Despite being typed as text, every value in `per` is a single short token (word_mean 1.0, len_max 8) and the top values are all numeric strings in scientific notation like '1.13e+03', suggesting this is a numeric measurement that was stringified during export. With 40,827 rows but only 1,025 unique values and a 97.5% duplicate rate, the field takes on a small set of repeating numeric codes. The 64% allcaps rate is an artefact of the 'e+03' exponent characters rather than genuine casing.

Treatment: Cast back to numeric (parse the scientific-notation strings to float) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[30]:

saturn.columns["per"].stats

stat	value
n	40,827
nulls	0 (0.0%)
unique	1,025
len_min	3
len_max	8
len_mean	4.747
len_median	3
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	39,802
duplicate_rate	0.9749
vocab_size	985
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.6425
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	64.2% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.5% duplicate strings

Fig 13.

Character-length distribution for per.

Show data table

Character-length distribution for per (mean: 4.746687241286404).
chars	count
3 – 3	26231
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	146
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	1230
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	13220

H text feature

Column H is stored as text but the values are uniformly short numeric tokens (len_mean 4.999, one_word_rate 1.0) clustered tightly around 24-25 (top values 24.20-25.50). With 1656 uniques across 40827 rows and a 95.9% duplicate_rate, this looks like a quantised numeric measurement (price, weight, or similar) miscast as a string. The allcaps flag is a false positive driven by digits.

Treatment: Cast to float and treat as a continuous numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[33]:

saturn.columns["H"].stats

stat	value
n	40,827
nulls	3 (0.0%)
unique	1,656
len_min	4
len_max	5
len_mean	5
len_median	5
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	39,168
duplicate_rate	0.9594
vocab_size	1,522
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	95.9% duplicate strings

Fig 14.

Character-length distribution for H.

Show data table

Character-length distribution for H (mean: 4.999975504605135).
chars	count
4 – 4	1
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	40823

diameter categorical feature

This is almost certainly an asteroid/object diameter measurement stored as strings (e.g. '0.4', '2.3', '0.451'), miscoded as categorical. It is overwhelmingly missing — 96.94% null — and among the 40,827 rows only 924 distinct values appear, with the most common ('0.4') occurring just 7 times (top_rate 0.0056) and entropy_ratio 0.985 indicating a near-uniform long tail. The mix of one-decimal and three-decimal strings hints at heterogeneous measurement precision across sources.

Treatment: Cast to numeric and either drop given 96.94% nulls or impute with a missingness indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[36]:

saturn.columns["diameter"].stats

stat	value
n	40,827
nulls	39,579 (96.9%)
unique	924
top_value	0.4
top_rate	0.005609
cardinality	924
entropy	9.703
entropy_ratio	0.9849
alert: long_tail	684 singleton categories
alert: null_rate	96.9% null

Fig 15.

Top values for diameter.

Show data table

Top values for diameter (20 unique shown, of 924 total).
value	count	share
0.4	7	0.0%
2.3	6	0.0%
1.0	5	0.0%
1.4	5	0.0%
0.451	5	0.0%
0.3	4	0.0%
0.9	4	0.0%
0.432	4	0.0%
0.413	4	0.0%
0.344	4	0.0%
0.654	4	0.0%
0.374	4	0.0%
0.228	4	0.0%
0.143	4	0.0%
0.066	4	0.0%
1.5	3	0.0%
4.3	3	0.0%
0.6	3	0.0%
1.8	3	0.0%
0.7	3	0.0%

albedo categorical feature

Likely a geometric/Bond albedo measurement (reflectivity, 0-1 range) stored as a string rather than parsed numeric, given top values like '0.037', '0.020', '0.031'. Coverage is extremely sparse: 97.05% null with only 437 distinct values across 40,827 rows, and the modal value appears just 15 times (1.25%). Entropy ratio of 0.954 shows the few populated values are spread almost uniformly across the 437 levels.

Treatment: Cast to float and treat as numeric; given 97% nulls, use only as a sparse feature with missingness indicator or drop.

anthropic:claude-opus-4-7 · confidence high

Out[39]:

saturn.columns["albedo"].stats

stat	value
n	40,827
nulls	39,623 (97.1%)
unique	437
top_value	0.037
top_rate	0.01246
cardinality	437
entropy	8.366
entropy_ratio	0.9538
alert: null_rate	97.1% null

Fig 16.

Top values for albedo.

Show data table

Top values for albedo (20 unique shown, of 437 total).
value	count	share
0.037	15	0.0%
0.020	15	0.0%
0.031	14	0.0%
0.019	12	0.0%
0.023	12	0.0%
0.018	11	0.0%
0.022	10	0.0%
0.030	10	0.0%
0.025	10	0.0%
0.034	10	0.0%
0.028	9	0.0%
0.042	9	0.0%
0.048	9	0.0%
0.026	9	0.0%
0.137	9	0.0%
0.040	9	0.0%
0.024	9	0.0%
0.046	8	0.0%
0.033	8	0.0%
0.039	8	0.0%

class categorical label

Categorical label with 4 classes across 40,827 rows and no nulls. Distribution is heavily imbalanced: APO accounts for 56.8% and AMO for most of the remainder, while IEO appears only 38 times — a near-absent class that will be hard to learn or evaluate.

Treatment: Use as classification target with class-weighting or resampling to handle the IEO minority class.

anthropic:claude-opus-4-7 · confidence high

Out[42]:

saturn.columns["class"].stats

stat	value
n	40,827
nulls	0 (0.0%)
unique	4
top_value	APO
top_rate	0.5676
cardinality	4
entropy	1.296
entropy_ratio	0.6481

Fig 17.

Top values for class.

Show data table

Top values for class (4 unique shown, of 4 total).
value	count	share
APO	23175	56.8%
AMO	14321	35.1%
ATE	3293	8.1%
IEO	38	0.1%

quirky asteroids

Overview

Summary confidence: high

full_name text identifier

neo categorical metadata

pha categorical label

e text feature

a text feature

i text feature

per text feature

H text feature

diameter categorical feature

albedo categorical feature

class categorical label

How to cite