quirky asteroids

source /home/coolhand/html/datavis/data_trove/data/quirky/asteroids.json 40,827 rows 11 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 40,827 Near-Earth Objects (asteroids) across 11 columns mixing orbital parameters (H, a, e, i, per), physical properties (diameter, albedo), and classification flags (neo, pha, class). Every record has neo='Y', so that column carries no information and can be ignored. The most analytically interesting fields are 'class', where APO dominates at 56.8% followed by AMO at 35.1%, and 'pha' (potentially hazardous), which flags 2,534 objects (about 6.2%) as 'Y'. Note that 'diameter' and 'albedo' are ~97% null, so any size/reflectivity analysis will be limited to roughly 1,200 rows. The orbital-parameter columns are stored as short text rather than numbers — they will need to be cast to floats before any quantitative work.

citing: row_count · column_count · columns.class.top_values · columns.pha.top_values · columns.neo.top_values · columns.diameter.null_rate · columns.albedo.null_rate · columns.full_name.top_words · columns.H.stats · columns.a.stats

Charts the summary said to look at first

class · Orbit class distribution: APO and AMO together account for over 90% of the catalog.

Show data table

Top values for class (4 unique shown, of 4 total).
value	count	share
APO	23175	56.8%
AMO	14321	35.1%
ATE	3293	8.1%
IEO	38	0.1%

pha · Potentially hazardous flag — about 6% of asteroids are marked 'Y'.

Show data table

Top values for pha (2 unique shown, of 2 total).
value	count	share
N	38162	93.5%
Y	2534	6.2%

H · Absolute magnitude (H) values cluster tightly around 24–25.5; cast to numeric to see the full shape.

Show data table

Character-length distribution for H (mean: 4.999975504605135).
chars	count
4 – 4	1
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	40823

full_name · Name length is fairly uniform (16–19 chars), reflecting the standard '(YYYY XX)' designation format.

Show data table

Character-length distribution for full_name (mean: 17.27251573713474).
chars	count
16 – 16	6070
16 – 17	0
17 – 17	21295
17 – 18	0
18 – 18	10737
18 – 19	0
19 – 19	2544
19 – 20	0
20 – 20	4
20 – 20	0
20 – 21	0
21 – 21	20
21 – 22	0
22 – 22	17
22 – 23	0
23 – 23	28
23 – 24	0
24 – 24	29
24 – 25	0
25 – 25	0
25 – 25	29
25 – 26	0
26 – 26	17
26 – 27	0
27 – 27	10
27 – 28	0
28 – 28	8
28 – 29	0
29 – 29	9
29 – 30	0
30 – 30	0
30 – 30	4
30 – 31	0
31 – 31	3
31 – 32	0
32 – 32	1
32 – 33	0
33 – 33	1
33 – 34	0
34 – 34	1

albedo · Top albedo values among the ~3% of rows that have one — useful for spotting dark vs reflective bodies.

Show data table

Top values for albedo (20 unique shown, of 437 total).
value	count	share
0.037	15	0.0%
0.020	15	0.0%
0.031	14	0.0%
0.019	12	0.0%
0.023	12	0.0%
0.018	11	0.0%
0.022	10	0.0%
0.030	10	0.0%
0.025	10	0.0%
0.034	10	0.0%
0.028	9	0.0%
0.042	9	0.0%
0.048	9	0.0%
0.026	9	0.0%
0.137	9	0.0%
0.040	9	0.0%
0.024	9	0.0%
0.046	8	0.0%
0.033	8	0.0%
0.039	8	0.0%

Schema

11 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
full_name	text	0.0%	40,827	near_unique allcaps short_text
neo	categorical	0.0%	1	imbalance
pha	categorical	0.3%	2
e	text	0.0%	7,849	one_word allcaps short_text duplicates
a	text	0.0%	4,170	one_word allcaps short_text duplicates
i	text	0.0%	4,489	one_word allcaps short_text duplicates
per	text	0.0%	1,025	one_word allcaps short_text duplicates
H	text	0.0%	1,656	one_word allcaps short_text duplicates
diameter	categorical	96.9%	924	long_tail null_rate
albedo	categorical	97.1%	437	null_rate
class	categorical	0.0%	4

full_name

text identifier near_unique allcaps short_text

Despite the name 'full_name', this column appears to be title-with-year strings (e.g. ending in '(2024'), not personal names — top tokens are all parenthesised years from 2016-2025. Every one of the 40,827 rows is unique with zero nulls, and 99.56% are all-caps, with lengths tightly bounded between 16 and 34 characters. The year distribution skews recent, with 2024 (1607) and 2025 (1594) leading. Treatment: Treat as a unique title key; parse the trailing year into a separate numeric feature rather than embedding the raw string. high · anthropic:claude-opus-4-7

n: 40,827
nulls: 0 (0.0%)
unique: 40,827
len_min: 16
len_max: 34
len_mean: 17.27
len_median: 17
len_p95: 19
word_mean: 8.479
word_median: 9
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 12,613
readability_flesch_mean: 119.5
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0.9956
boilerplate_rate: 0

neo

categorical metadata imbalance

The `neo` column is a categorical flag that takes a single value 'Y' across all 40,827 rows, with zero nulls and entropy of 0.0. Because cardinality is 1 and top_rate is 1.0, this column carries no information and cannot discriminate between records. Treatment: Drop, constant column. high · anthropic:claude-opus-4-7

n: 40,827
nulls: 0 (0.0%)
unique: 1
top_value: Y
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

pha

categorical label

Binary Y/N flag, almost certainly a 'potentially hazardous asteroid' indicator given the column name 'pha'. The class is heavily imbalanced: 'N' covers 93.77% of rows versus 2,534 'Y' values, with a 0.32% null rate. Entropy ratio of 0.34 confirms the skew. Treatment: Encode as binary target and use class-imbalance handling (stratified splits, class weights, or resampling). high · anthropic:claude-opus-4-7

n: 40,827
nulls: 131 (0.3%)
unique: 2
top_value: N
top_rate: 0.9377
cardinality: 2
entropy: 0.3364
entropy_ratio: 0.3364