saturn·

quirky asteroids

source /home/coolhand/html/datavis/data_trove/data/quirky/asteroids.json 40,827 rows 11 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 40,827 Near-Earth Objects (asteroids) across 11 columns mixing orbital parameters (H, a, e, i, per), physical properties (diameter, albedo), and classification flags (neo, pha, class). Every record has neo='Y', so that column carries no information and can be ignored. The most analytically interesting fields are 'class', where APO dominates at 56.8% followed by AMO at 35.1%, and 'pha' (potentially hazardous), which flags 2,534 objects (about 6.2%) as 'Y'. Note that 'diameter' and 'albedo' are ~97% null, so any size/reflectivity analysis will be limited to roughly 1,200 rows. The orbital-parameter columns are stored as short text rather than numbers — they will need to be cast to floats before any quantitative work.

citing: row_count · column_count · columns.class.top_values · columns.pha.top_values · columns.neo.top_values · columns.diameter.null_rate · columns.albedo.null_rate · columns.full_name.top_words · columns.H.stats · columns.a.stats

Schema

11 columns
Per-column summary. Click column name to jump to its detail.
Alerts
full_name text 0.0% 40,827
near_unique allcaps short_text
neo categorical 0.0% 1
imbalance
pha categorical 0.3% 2
e text 0.0% 7,849
one_word allcaps short_text duplicates
a text 0.0% 4,170
one_word allcaps short_text duplicates
i text 0.0% 4,489
one_word allcaps short_text duplicates
per text 0.0% 1,025
one_word allcaps short_text duplicates
H text 0.0% 1,656
one_word allcaps short_text duplicates
diameter categorical 96.9% 924
long_tail null_rate
albedo categorical 97.1% 437
null_rate
class categorical 0.0% 4

full_name

text identifier near_unique allcaps short_text
Despite the name 'full_name', this column appears to be title-with-year strings (e.g. ending in '(2024'), not personal names — top tokens are all parenthesised years from 2016-2025. Every one of the 40,827 rows is unique with zero nulls, and 99.56% are all-caps, with lengths tightly bounded between 16 and 34 characters. The year distribution skews recent, with 2024 (1607) and 2025 (1594) leading. Treatment: Treat as a unique title key; parse the trailing year into a separate numeric feature rather than embedding the raw string. high · anthropic:claude-opus-4-7
n
40,827
nulls
0 (0.0%)
unique
40,827
len_min
16
len_max
34
len_mean
17.27
len_median
17
len_p95
19
word_mean
8.479
word_median
9
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
12,613
readability_flesch_mean
119.5
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0.9956
boilerplate_rate
0

neo

categorical metadata imbalance
The `neo` column is a categorical flag that takes a single value 'Y' across all 40,827 rows, with zero nulls and entropy of 0.0. Because cardinality is 1 and top_rate is 1.0, this column carries no information and cannot discriminate between records. Treatment: Drop, constant column. high · anthropic:claude-opus-4-7
n
40,827
nulls
0 (0.0%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

pha

categorical label
Binary Y/N flag, almost certainly a 'potentially hazardous asteroid' indicator given the column name 'pha'. The class is heavily imbalanced: 'N' covers 93.77% of rows versus 2,534 'Y' values, with a 0.32% null rate. Entropy ratio of 0.34 confirms the skew. Treatment: Encode as binary target and use class-imbalance handling (stratified splits, class weights, or resampling). high · anthropic:claude-opus-4-7
n
40,827
nulls
131 (0.3%)
unique
2
top_value
N
top_rate
0.9377
cardinality
2
entropy
0.3364
entropy_ratio
0.3364

e

text feature one_word allcaps short_text duplicates
Column 'e' is stored as text but every value is a fixed 6-character single token, and the top values ('0.5298', '0.5964', '0.4826', ...) are all numeric strings between 0 and 1. This is almost certainly a numeric feature (likely a probability, ratio, or normalized score) that has been serialized as text. With 7849 unique values across 40827 rows and a duplicate_rate of 0.808, repetition is heavy but not pathological for a discretized score. Treatment: Cast to float and use as a numeric feature. high · anthropic:claude-opus-4-7
n
40,827
nulls
0 (0.0%)
unique
7,849
len_min
6
len_max
6
len_mean
6
len_median
6
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
32,978
duplicate_rate
0.8077
vocab_size
6,736
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

a

text feature one_word allcaps short_text duplicates
Column 'a' is stored as text but the values are short numeric strings (e.g., '1.299', '1.424'), all single tokens with length 1-6. With 4170 unique values across 40827 rows and an 89.8% duplicate rate, it behaves like a low-precision numeric feature mistakenly typed as string. The 99.9% allcaps flag is a quirk of digit-only strings tripping the case detector and can be ignored. Treatment: Cast to float and treat as a numeric feature. high · anthropic:claude-opus-4-7
n
40,827
nulls
0 (0.0%)
unique
4,170
len_min
1
len_max
6
len_mean
4.969
len_median
5
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
36,657
duplicate_rate
0.8979
vocab_size
3,344
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.9995
boilerplate_rate
0

i

text feature one_word allcaps short_text duplicates
Despite being typed as text, column 'i' holds short numeric tokens (length 4-6, all single-word) like '6.07', '2.12', '2.26' — almost certainly a decimal numeric feature stored as strings. With 40,827 rows but only 4,489 unique values and an 89% duplicate rate, the value space is heavily concentrated. The 'allcaps' flag and Flesch score of 121.22 are artefacts of treating numeric strings as prose and can be ignored. Treatment: Cast to float and treat as a numeric feature. high · anthropic:claude-opus-4-7
n
40,827
nulls
0 (0.0%)
unique
4,489
len_min
4
len_max
6
len_mean
4.428
len_median
4
len_p95
5
word_mean
1
word_median
1
n_empty
0
n_duplicates
36,338
duplicate_rate
0.89
vocab_size
3,827
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

per

text feature one_word allcaps short_text duplicates
Despite being typed as text, every value in `per` is a single short token (word_mean 1.0, len_max 8) and the top values are all numeric strings in scientific notation like '1.13e+03', suggesting this is a numeric measurement that was stringified during export. With 40,827 rows but only 1,025 unique values and a 97.5% duplicate rate, the field takes on a small set of repeating numeric codes. The 64% allcaps rate is an artefact of the 'e+03' exponent characters rather than genuine casing. Treatment: Cast back to numeric (parse the scientific-notation strings to float) before modelling. high · anthropic:claude-opus-4-7
n
40,827
nulls
0 (0.0%)
unique
1,025
len_min
3
len_max
8
len_mean
4.747
len_median
3
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
39,802
duplicate_rate
0.9749
vocab_size
985
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.6425
boilerplate_rate
0

H

text feature one_word allcaps short_text duplicates
Column H is stored as text but the values are uniformly short numeric tokens (len_mean 4.999, one_word_rate 1.0) clustered tightly around 24-25 (top values 24.20-25.50). With 1656 uniques across 40827 rows and a 95.9% duplicate_rate, this looks like a quantised numeric measurement (price, weight, or similar) miscast as a string. The allcaps flag is a false positive driven by digits. Treatment: Cast to float and treat as a continuous numeric feature. high · anthropic:claude-opus-4-7
n
40,827
nulls
3 (0.0%)
unique
1,656
len_min
4
len_max
5
len_mean
5
len_median
5
len_p95
5
word_mean
1
word_median
1
n_empty
0
n_duplicates
39,168
duplicate_rate
0.9594
vocab_size
1,522
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

diameter

categorical feature long_tail null_rate
This is almost certainly an asteroid/object diameter measurement stored as strings (e.g. '0.4', '2.3', '0.451'), miscoded as categorical. It is overwhelmingly missing — 96.94% null — and among the 40,827 rows only 924 distinct values appear, with the most common ('0.4') occurring just 7 times (top_rate 0.0056) and entropy_ratio 0.985 indicating a near-uniform long tail. The mix of one-decimal and three-decimal strings hints at heterogeneous measurement precision across sources. Treatment: Cast to numeric and either drop given 96.94% nulls or impute with a missingness indicator before modelling. high · anthropic:claude-opus-4-7
n
40,827
nulls
39,579 (96.9%)
unique
924
top_value
0.4
top_rate
0.005609
cardinality
924
entropy
9.703
entropy_ratio
0.9849

albedo

categorical feature null_rate
Likely a geometric/Bond albedo measurement (reflectivity, 0-1 range) stored as a string rather than parsed numeric, given top values like '0.037', '0.020', '0.031'. Coverage is extremely sparse: 97.05% null with only 437 distinct values across 40,827 rows, and the modal value appears just 15 times (1.25%). Entropy ratio of 0.954 shows the few populated values are spread almost uniformly across the 437 levels. Treatment: Cast to float and treat as numeric; given 97% nulls, use only as a sparse feature with missingness indicator or drop. high · anthropic:claude-opus-4-7
n
40,827
nulls
39,623 (97.1%)
unique
437
top_value
0.037
top_rate
0.01246
cardinality
437
entropy
8.366
entropy_ratio
0.9538

class

categorical label
Categorical label with 4 classes across 40,827 rows and no nulls. Distribution is heavily imbalanced: APO accounts for 56.8% and AMO for most of the remainder, while IEO appears only 38 times — a near-absent class that will be hard to learn or evaluate. Treatment: Use as classification target with class-weighting or resampling to handle the IEO minority class. high · anthropic:claude-opus-4-7
n
40,827
nulls
0 (0.0%)
unique
4
top_value
APO
top_rate
0.5676
cardinality
4
entropy
1.296
entropy_ratio
0.6481