quirky asteroids
Reading
This dataset catalogs 40,827 Near-Earth Objects (asteroids) across 11 columns mixing orbital parameters (H, a, e, i, per), physical properties (diameter, albedo), and classification flags (neo, pha, class). Every record has neo='Y', so that column carries no information and can be ignored. The most analytically interesting fields are 'class', where APO dominates at 56.8% followed by AMO at 35.1%, and 'pha' (potentially hazardous), which flags 2,534 objects (about 6.2%) as 'Y'. Note that 'diameter' and 'albedo' are ~97% null, so any size/reflectivity analysis will be limited to roughly 1,200 rows. The orbital-parameter columns are stored as short text rather than numbers — they will need to be cast to floats before any quantitative work.
citing: row_count · column_count · columns.class.top_values · columns.pha.top_values · columns.neo.top_values · columns.diameter.null_rate · columns.albedo.null_rate · columns.full_name.top_words · columns.H.stats · columns.a.stats
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| APO | 23175 | 56.8% |
| AMO | 14321 | 35.1% |
| ATE | 3293 | 8.1% |
| IEO | 38 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| N | 38162 | 93.5% |
| Y | 2534 | 6.2% |
Show data table
| chars | count |
|---|---|
| 4 – 4 | 1 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 40823 |
Show data table
| chars | count |
|---|---|
| 16 – 16 | 6070 |
| 16 – 17 | 0 |
| 17 – 17 | 21295 |
| 17 – 18 | 0 |
| 18 – 18 | 10737 |
| 18 – 19 | 0 |
| 19 – 19 | 2544 |
| 19 – 20 | 0 |
| 20 – 20 | 4 |
| 20 – 20 | 0 |
| 20 – 21 | 0 |
| 21 – 21 | 20 |
| 21 – 22 | 0 |
| 22 – 22 | 17 |
| 22 – 23 | 0 |
| 23 – 23 | 28 |
| 23 – 24 | 0 |
| 24 – 24 | 29 |
| 24 – 25 | 0 |
| 25 – 25 | 0 |
| 25 – 25 | 29 |
| 25 – 26 | 0 |
| 26 – 26 | 17 |
| 26 – 27 | 0 |
| 27 – 27 | 10 |
| 27 – 28 | 0 |
| 28 – 28 | 8 |
| 28 – 29 | 0 |
| 29 – 29 | 9 |
| 29 – 30 | 0 |
| 30 – 30 | 0 |
| 30 – 30 | 4 |
| 30 – 31 | 0 |
| 31 – 31 | 3 |
| 31 – 32 | 0 |
| 32 – 32 | 1 |
| 32 – 33 | 0 |
| 33 – 33 | 1 |
| 33 – 34 | 0 |
| 34 – 34 | 1 |
Show data table
| value | count | share |
|---|---|---|
| 0.037 | 15 | 0.0% |
| 0.020 | 15 | 0.0% |
| 0.031 | 14 | 0.0% |
| 0.019 | 12 | 0.0% |
| 0.023 | 12 | 0.0% |
| 0.018 | 11 | 0.0% |
| 0.022 | 10 | 0.0% |
| 0.030 | 10 | 0.0% |
| 0.025 | 10 | 0.0% |
| 0.034 | 10 | 0.0% |
| 0.028 | 9 | 0.0% |
| 0.042 | 9 | 0.0% |
| 0.048 | 9 | 0.0% |
| 0.026 | 9 | 0.0% |
| 0.137 | 9 | 0.0% |
| 0.040 | 9 | 0.0% |
| 0.024 | 9 | 0.0% |
| 0.046 | 8 | 0.0% |
| 0.033 | 8 | 0.0% |
| 0.039 | 8 | 0.0% |
Schema
11 columns| Alerts | ||||
|---|---|---|---|---|
| full_name | text | 0.0% | 40,827 |
near_unique
allcaps
short_text
|
| neo | categorical | 0.0% | 1 |
imbalance
|
| pha | categorical | 0.3% | 2 |
|
| e | text | 0.0% | 7,849 |
one_word
allcaps
short_text
duplicates
|
| a | text | 0.0% | 4,170 |
one_word
allcaps
short_text
duplicates
|
| i | text | 0.0% | 4,489 |
one_word
allcaps
short_text
duplicates
|
| per | text | 0.0% | 1,025 |
one_word
allcaps
short_text
duplicates
|
| H | text | 0.0% | 1,656 |
one_word
allcaps
short_text
duplicates
|
| diameter | categorical | 96.9% | 924 |
long_tail
null_rate
|
| albedo | categorical | 97.1% | 437 |
null_rate
|
| class | categorical | 0.0% | 4 |
|
full_name
text identifier near_unique allcaps short_textDespite the name 'full_name', this column appears to be title-with-year strings (e.g. ending in '(2024'), not personal names — top tokens are all parenthesised years from 2016-2025. Every one of the 40,827 rows is unique with zero nulls, and 99.56% are all-caps, with lengths tightly bounded between 16 and 34 characters. The year distribution skews recent, with 2024 (1607) and 2025 (1594) leading. Treatment: Treat as a unique title key; parse the trailing year into a separate numeric feature rather than embedding the raw string.
- n
- 40,827
- nulls
- 0 (0.0%)
- unique
- 40,827
- len_min
- 16
- len_max
- 34
- len_mean
- 17.27
- len_median
- 17
- len_p95
- 19
- word_mean
- 8.479
- word_median
- 9
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 12,613
- readability_flesch_mean
- 119.5
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0.9956
- boilerplate_rate
- 0
neo
categorical metadata imbalanceThe `neo` column is a categorical flag that takes a single value 'Y' across all 40,827 rows, with zero nulls and entropy of 0.0. Because cardinality is 1 and top_rate is 1.0, this column carries no information and cannot discriminate between records. Treatment: Drop, constant column.
- n
- 40,827
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
pha
categorical labelBinary Y/N flag, almost certainly a 'potentially hazardous asteroid' indicator given the column name 'pha'. The class is heavily imbalanced: 'N' covers 93.77% of rows versus 2,534 'Y' values, with a 0.32% null rate. Entropy ratio of 0.34 confirms the skew. Treatment: Encode as binary target and use class-imbalance handling (stratified splits, class weights, or resampling).
- n
- 40,827
- nulls
- 131 (0.3%)
- unique
- 2
- top_value
- N
- top_rate
- 0.9377
- cardinality
- 2
- entropy
- 0.3364
- entropy_ratio
- 0.3364
e
text feature one_word allcaps short_text duplicatesColumn 'e' is stored as text but every value is a fixed 6-character single token, and the top values ('0.5298', '0.5964', '0.4826', ...) are all numeric strings between 0 and 1. This is almost certainly a numeric feature (likely a probability, ratio, or normalized score) that has been serialized as text. With 7849 unique values across 40827 rows and a duplicate_rate of 0.808, repetition is heavy but not pathological for a discretized score. Treatment: Cast to float and use as a numeric feature.
- n
- 40,827
- nulls
- 0 (0.0%)
- unique
- 7,849
- len_min
- 6
- len_max
- 6
- len_mean
- 6
- len_median
- 6
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 32,978
- duplicate_rate
- 0.8077
- vocab_size
- 6,736
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
a
text feature one_word allcaps short_text duplicatesColumn 'a' is stored as text but the values are short numeric strings (e.g., '1.299', '1.424'), all single tokens with length 1-6. With 4170 unique values across 40827 rows and an 89.8% duplicate rate, it behaves like a low-precision numeric feature mistakenly typed as string. The 99.9% allcaps flag is a quirk of digit-only strings tripping the case detector and can be ignored. Treatment: Cast to float and treat as a numeric feature.
- n
- 40,827
- nulls
- 0 (0.0%)
- unique
- 4,170
- len_min
- 1
- len_max
- 6
- len_mean
- 4.969
- len_median
- 5
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 36,657
- duplicate_rate
- 0.8979
- vocab_size
- 3,344
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.9995
- boilerplate_rate
- 0
i
text feature one_word allcaps short_text duplicatesDespite being typed as text, column 'i' holds short numeric tokens (length 4-6, all single-word) like '6.07', '2.12', '2.26' — almost certainly a decimal numeric feature stored as strings. With 40,827 rows but only 4,489 unique values and an 89% duplicate rate, the value space is heavily concentrated. The 'allcaps' flag and Flesch score of 121.22 are artefacts of treating numeric strings as prose and can be ignored. Treatment: Cast to float and treat as a numeric feature.
- n
- 40,827
- nulls
- 0 (0.0%)
- unique
- 4,489
- len_min
- 4
- len_max
- 6
- len_mean
- 4.428
- len_median
- 4
- len_p95
- 5
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 36,338
- duplicate_rate
- 0.89
- vocab_size
- 3,827
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
per
text feature one_word allcaps short_text duplicatesDespite being typed as text, every value in `per` is a single short token (word_mean 1.0, len_max 8) and the top values are all numeric strings in scientific notation like '1.13e+03', suggesting this is a numeric measurement that was stringified during export. With 40,827 rows but only 1,025 unique values and a 97.5% duplicate rate, the field takes on a small set of repeating numeric codes. The 64% allcaps rate is an artefact of the 'e+03' exponent characters rather than genuine casing. Treatment: Cast back to numeric (parse the scientific-notation strings to float) before modelling.
- n
- 40,827
- nulls
- 0 (0.0%)
- unique
- 1,025
- len_min
- 3
- len_max
- 8
- len_mean
- 4.747
- len_median
- 3
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 39,802
- duplicate_rate
- 0.9749
- vocab_size
- 985
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.6425
- boilerplate_rate
- 0
H
text feature one_word allcaps short_text duplicatesColumn H is stored as text but the values are uniformly short numeric tokens (len_mean 4.999, one_word_rate 1.0) clustered tightly around 24-25 (top values 24.20-25.50). With 1656 uniques across 40827 rows and a 95.9% duplicate_rate, this looks like a quantised numeric measurement (price, weight, or similar) miscast as a string. The allcaps flag is a false positive driven by digits. Treatment: Cast to float and treat as a continuous numeric feature.
- n
- 40,827
- nulls
- 3 (0.0%)
- unique
- 1,656
- len_min
- 4
- len_max
- 5
- len_mean
- 5
- len_median
- 5
- len_p95
- 5
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 39,168
- duplicate_rate
- 0.9594
- vocab_size
- 1,522
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
diameter
categorical feature long_tail null_rateThis is almost certainly an asteroid/object diameter measurement stored as strings (e.g. '0.4', '2.3', '0.451'), miscoded as categorical. It is overwhelmingly missing — 96.94% null — and among the 40,827 rows only 924 distinct values appear, with the most common ('0.4') occurring just 7 times (top_rate 0.0056) and entropy_ratio 0.985 indicating a near-uniform long tail. The mix of one-decimal and three-decimal strings hints at heterogeneous measurement precision across sources. Treatment: Cast to numeric and either drop given 96.94% nulls or impute with a missingness indicator before modelling.
- n
- 40,827
- nulls
- 39,579 (96.9%)
- unique
- 924
- top_value
- 0.4
- top_rate
- 0.005609
- cardinality
- 924
- entropy
- 9.703
- entropy_ratio
- 0.9849
albedo
categorical feature null_rateLikely a geometric/Bond albedo measurement (reflectivity, 0-1 range) stored as a string rather than parsed numeric, given top values like '0.037', '0.020', '0.031'. Coverage is extremely sparse: 97.05% null with only 437 distinct values across 40,827 rows, and the modal value appears just 15 times (1.25%). Entropy ratio of 0.954 shows the few populated values are spread almost uniformly across the 437 levels. Treatment: Cast to float and treat as numeric; given 97% nulls, use only as a sparse feature with missingness indicator or drop.
- n
- 40,827
- nulls
- 39,623 (97.1%)
- unique
- 437
- top_value
- 0.037
- top_rate
- 0.01246
- cardinality
- 437
- entropy
- 8.366
- entropy_ratio
- 0.9538
class
categorical labelCategorical label with 4 classes across 40,827 rows and no nulls. Distribution is heavily imbalanced: APO accounts for 56.8% and AMO for most of the remainder, while IEO appears only 38 times — a near-absent class that will be hard to learn or evaluate. Treatment: Use as classification target with class-weighting or resampling to handle the IEO minority class.
- n
- 40,827
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- APO
- top_rate
- 0.5676
- cardinality
- 4
- entropy
- 1.296
- entropy_ratio
- 0.6481