deepsky ngc
Reading
This is an astronomical catalog of 13,969 deep-sky objects (NGC.csv) with 32 columns covering identifiers, sky coordinates, magnitudes across multiple bands, morphological classifications, and kinematic measurements like radial velocity and redshift. The catalog is dominated by galaxies — 75% of entries are type 'G' — with smaller populations of open clusters, duplicates, stars, and planetary nebulae. Object morphology (Hubble type) and constellation distribution are the most informative descriptive fields, while RadVel and Redshift give a clean view of the cosmological distance distribution skewing toward nearby objects (median z ≈ 0.016). Be aware that many columns are very sparsely populated: parallax (Pax), proper motions, and central-star magnitudes are >92% null, so any analysis on those fields will be limited to a small subset. Size measurements (MajAx, MinAx) are extremely skewed with heavy outliers, suggesting a few very large objects dominate the tails.
citing: Type · Hubble · Const · Redshift · RadVel · V-Mag · Pax · MajAx · B-Mag
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| G | 10481 | 75.0% |
| OCl | 652 | 4.7% |
| Dup | 651 | 4.7% |
| * | 546 | 3.9% |
| Other | 419 | 3.0% |
| ** | 243 | 1.7% |
| GPair | 231 | 1.7% |
| GCl | 204 | 1.5% |
| PN | 130 | 0.9% |
| Neb | 94 | 0.7% |
| HII | 82 | 0.6% |
| Cl+N | 67 | 0.5% |
| *Ass | 62 | 0.4% |
| RfN | 38 | 0.3% |
| GTrpl | 26 | 0.2% |
| SNR | 11 | 0.1% |
| GGroup | 11 | 0.1% |
| NonEx | 10 | 0.1% |
| EmN | 8 | 0.1% |
| Nova | 3 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| E | 1546 | 11.1% |
| S0 | 1073 | 7.7% |
| S0-a | 996 | 7.1% |
| Sc | 948 | 6.8% |
| Sb | 758 | 5.4% |
| Sbc | 727 | 5.2% |
| E-S0 | 623 | 4.5% |
| Sa | 478 | 3.4% |
| Sab | 461 | 3.3% |
| SBb | 323 | 2.3% |
| SABc | 281 | 2.0% |
| SBc | 272 | 1.9% |
| SBbc | 269 | 1.9% |
| SABb | 255 | 1.8% |
| SBa | 186 | 1.3% |
| SBab | 160 | 1.1% |
| SABa | 122 | 0.9% |
| I | 104 | 0.7% |
| Scd | 103 | 0.7% |
| Sd | 74 | 0.5% |
Show data table
| value | count | share |
|---|---|---|
| Vir | 1236 | 8.8% |
| Com | 1044 | 7.5% |
| Leo | 876 | 6.3% |
| Cet | 686 | 4.9% |
| UMa | 543 | 3.9% |
| Boo | 528 | 3.8% |
| CVn | 518 | 3.7% |
| Psc | 490 | 3.5% |
| Peg | 462 | 3.3% |
| Eri | 457 | 3.3% |
| Dra | 380 | 2.7% |
| Hya | 376 | 2.7% |
| Cnc | 354 | 2.5% |
| Dor | 341 | 2.4% |
| Her | 340 | 2.4% |
| Pav | 289 | 2.1% |
| Aqr | 277 | 2.0% |
| Cen | 231 | 1.7% |
| And | 209 | 1.5% |
| Per | 163 | 1.2% |
Show data table
| bin | count |
|---|---|
| -0.00161 – 0.003221 | 1111 |
| 0.003221 – 0.008051 | 1535 |
| 0.008051 – 0.01288 | 1344 |
| 0.01288 – 0.01771 | 1763 |
| 0.01771 – 0.02254 | 1330 |
| 0.02254 – 0.02737 | 1167 |
| 0.02737 – 0.0322 | 844 |
| 0.0322 – 0.03704 | 526 |
| 0.03704 – 0.04187 | 328 |
| 0.04187 – 0.0467 | 175 |
| 0.0467 – 0.05153 | 96 |
| 0.05153 – 0.05636 | 92 |
| 0.05636 – 0.06119 | 51 |
| 0.06119 – 0.06602 | 77 |
| 0.06602 – 0.07085 | 46 |
| 0.07085 – 0.07568 | 33 |
| 0.07568 – 0.08051 | 18 |
| 0.08051 – 0.08534 | 26 |
| 0.08534 – 0.09017 | 6 |
| 0.09017 – 0.095 | 8 |
| 0.095 – 0.09983 | 1 |
| 0.09983 – 0.1047 | 1 |
| 0.1047 – 0.1095 | 0 |
| 0.1095 – 0.1143 | 0 |
| 0.1143 – 0.1192 | 2 |
| 0.1192 – 0.124 | 1 |
| 0.124 – 0.1288 | 0 |
| 0.1288 – 0.1336 | 0 |
| 0.1336 – 0.1385 | 0 |
| 0.1385 – 0.1433 | 0 |
| 0.1433 – 0.1481 | 0 |
| 0.1481 – 0.153 | 0 |
| 0.153 – 0.1578 | 0 |
| 0.1578 – 0.1626 | 0 |
| 0.1626 – 0.1675 | 0 |
| 0.1675 – 0.1723 | 0 |
| 0.1723 – 0.1771 | 0 |
| 0.1771 – 0.182 | 1 |
| 0.182 – 0.1868 | 0 |
| 0.1868 – 0.1916 | 1 |
Show data table
| bin | count |
|---|---|
| 1.51 – 1.998 | 1 |
| 1.998 – 2.485 | 0 |
| 2.485 – 2.973 | 3 |
| 2.973 – 3.46 | 4 |
| 3.46 – 3.947 | 3 |
| 3.947 – 4.435 | 7 |
| 4.435 – 4.923 | 13 |
| 4.923 – 5.41 | 12 |
| 5.41 – 5.897 | 9 |
| 5.897 – 6.385 | 28 |
| 6.385 – 6.872 | 19 |
| 6.872 – 7.36 | 45 |
| 7.36 – 7.847 | 37 |
| 7.847 – 8.335 | 40 |
| 8.335 – 8.822 | 42 |
| 8.822 – 9.31 | 49 |
| 9.31 – 9.797 | 52 |
| 9.797 – 10.29 | 73 |
| 10.29 – 10.77 | 92 |
| 10.77 – 11.26 | 133 |
| 11.26 – 11.75 | 219 |
| 11.75 – 12.23 | 322 |
| 12.23 – 12.72 | 475 |
| 12.72 – 13.21 | 762 |
| 13.21 – 13.7 | 1038 |
| 13.7 – 14.18 | 1374 |
| 14.18 – 14.67 | 1735 |
| 14.67 – 15.16 | 1765 |
| 15.16 – 15.65 | 1468 |
| 15.65 – 16.14 | 678 |
| 16.14 – 16.62 | 392 |
| 16.62 – 17.11 | 236 |
| 17.11 – 17.6 | 126 |
| 17.6 – 18.09 | 48 |
| 18.09 – 18.57 | 19 |
| 18.57 – 19.06 | 8 |
| 19.06 – 19.55 | 6 |
| 19.55 – 20.04 | 0 |
| 20.04 – 20.52 | 0 |
| 20.52 – 21.01 | 2 |
Schema
32 columns| Alerts | ||||
|---|---|---|---|---|
| Name | text | 0.0% | 13,969 |
near_unique
one_word
allcaps
short_text
|
| Type | categorical | 0.0% | 20 |
|
| RA | unknown | 0.0% | — |
skipped
|
| Dec | text | 0.1% | 13,282 |
near_unique
one_word
allcaps
short_text
|
| Const | categorical | 0.1% | 89 |
|
| MajAx | numeric | 14.1% | 734 |
high_skew
outliers
|
| MinAx | numeric | 20.9% | 465 |
null_rate
high_skew
outliers
|
| PosAng | numeric | 23.2% | 181 |
null_rate
|
| B-Mag | numeric | 18.9% | 1,056 |
outliers
|
| V-Mag | numeric | 69.8% | 774 |
null_rate
outliers
|
| J-Mag | numeric | 30.9% | 804 |
null_rate
|
| H-Mag | numeric | 30.6% | 831 |
null_rate
|
| K-Mag | numeric | 30.9% | 823 |
null_rate
|
| SurfBr | numeric | 26.8% | 438 |
null_rate
|
| Hubble | categorical | 27.3% | 30 |
null_rate
|
| Pax | numeric | 94.8% | 676 |
null_rate
high_skew
outliers
|
| Pm-RA | numeric | 92.5% | 954 |
null_rate
high_skew
outliers
|
| Pm-Dec | numeric | 92.5% | 961 |
null_rate
outliers
|
| RadVel | numeric | 24.2% | 6,691 |
null_rate
|
| Redshift | numeric | 24.2% | 7,717 |
null_rate
|
| Cstar U-Mag | numeric | 99.9% | 16 |
null_rate
|
| Cstar B-Mag | numeric | 99.2% | 97 |
null_rate
|
| Cstar V-Mag | numeric | 99.3% | 82 |
null_rate
|
| M | categorical | 99.2% | 107 |
long_tail
null_rate
|
| NGC | categorical | 93.5% | 891 |
long_tail
null_rate
|
| IC | categorical | 96.7% | 452 |
long_tail
null_rate
|
| Cstar Names | categorical | 99.4% | 87 |
long_tail
null_rate
|
| Identifiers | text | 12.8% | 12,179 |
near_unique
allcaps
|
| Common names | categorical | 99.1% | 127 |
long_tail
null_rate
|
| NED notes | text | 83.6% | 1,198 |
multilingual
null_rate
duplicates
|
| OpenNGC notes | categorical | 98.5% | 159 |
long_tail
null_rate
|
| Sources | categorical | 0.0% | 344 |
long_tail
|
Name
text identifier near_unique one_word allcaps short_textThis appears to be an astronomical object designation field — short, all-caps, single-token codes like 'NED01', 'NED02', and 'IC' prefixed catalog numbers. Every one of the 13,969 rows is unique with zero nulls, lengths tightly bounded between 6 and 13 characters, and 97.4% are single-word tokens. The 'NED01'/'NED02' values recurring 175 and 168 times sit oddly against the n_unique=13969 claim, suggesting these are prefixes/substrings counted at the word level rather than full duplicates. Treatment: Treat as a primary key; drop from modelling features and use only for joins or lookups.
- n
- 13,969
- nulls
- 0 (0.0%)
- unique
- 13,969
- len_min
- 6
- len_max
- 13
- len_mean
- 6.784
- len_median
- 7
- len_p95
- 8
- word_mean
- 1.026
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 13,607
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9738
- allcaps_rate
- 1
- boilerplate_rate
- 0
Type
categorical labelA categorical type code with 20 distinct values across 13,969 rows and no nulls. The distribution is highly imbalanced: 'G' alone accounts for 75.0% of records (10,481), with the next categories ('OCl', 'Dup', '*', 'Other') each below 5%, yielding an entropy ratio of just 0.38. The codes ('G', 'OCl', 'GPair', 'GCl', 'PN', 'Neb') suggest astronomical object classifications (galaxies, open/globular clusters, planetary nebulae, nebulae). Treatment: Group rare categories or stratify by 'G' dominance before any classification task.
- n
- 13,969
- nulls
- 0 (0.0%)
- unique
- 20
- top_value
- G
- top_rate
- 0.7503
- cardinality
- 20
- entropy
- 1.646
- entropy_ratio
- 0.3808
RA
unknown other skippedThe column is named RA, which in astronomical datasets typically denotes Right Ascension, but saturn skipped profiling so kind and uniqueness are unresolved. The only confirmed signals are 13969 rows with a 0.0 null rate; no distributional statistics are available. Treatment: Re-profile with an explicit type cast before deciding on use.
- n
- 13,969
- nulls
- 0 (0.0%)
- unique
- —
Dec
text feature near_unique one_word allcaps short_textThis column holds astronomical Declination coordinates formatted as signed sexagesimal strings (e.g. '-19:28:17.6'), with every value 9-11 characters long and exactly one token. It is near-unique (13,282 distinct of 13,969) yet shows 680 duplicates (4.87%), suggesting repeated observations of the same sky positions. The 'allcaps' and Flesch=121 signals are artefacts of the numeric format, not real prose. Treatment: Parse the sexagesimal string into signed decimal degrees before any numeric use.
- n
- 13,969
- nulls
- 7 (0.1%)
- unique
- 13,282
- len_min
- 9
- len_max
- 11
- len_mean
- 11
- len_median
- 11
- len_p95
- 11
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 680
- duplicate_rate
- 0.0487
- vocab_size
- 13,282
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
Const
categorical featureThis column holds three-letter constellation abbreviations (Vir, Com, Leo, Cet, UMa…), with 89 distinct values across 13,969 rows — consistent with the 88 IAU constellations plus possibly one stray code. Distribution is fairly even (entropy ratio 0.85), though Virgo leads at 8.85% and the top three constellations account for roughly a fifth of records. Nulls are negligible (0.05%). Treatment: One-hot or target-encode as a categorical feature.
- n
- 13,969
- nulls
- 7 (0.1%)
- unique
- 89
- top_value
- Vir
- top_rate
- 0.08853
- cardinality
- 89
- entropy
- 5.489
- entropy_ratio
- 0.8476
MajAx
numeric feature high_skew outliersMajAx is a numeric measurement (likely a major-axis length, e.g. of a galaxy or ellipse) with a tight central distribution—median 1.2, IQR 0.82–1.87—but an extreme right tail reaching 299.92. Skew of 21.89 and kurtosis of 641.85 are both severe, and 10.36% of values flag as outliers. Null rate is also non-trivial at 14.06%. Treatment: log-transform and impute missing before modelling; consider winsorising the upper tail.
- n
- 13,969
- nulls
- 1,964 (14.1%)
- unique
- 734
- min
- 0.02
- max
- 299.9
- mean
- 2.145
- median
- 1.2
- std
- 6.789
- q1
- 0.82
- q3
- 1.87
- iqr
- 1.05
- skew
- 21.89
- kurtosis
- 641.8
- n_outliers
- 1,244
- outlier_rate
- 0.1036
- zero_rate
- 0
MinAx
numeric feature null_rate high_skew outliersMinAx is a numeric measurement (likely a minor-axis dimension) with a tight central mass — median 0.69, IQR 0.45–1.07 — but an extreme right tail stretching to 179.89. Skew of 26.82 and kurtosis near 981 are extraordinary, and 760 outliers (6.88%) sit alongside a 20.91% null rate. Only 465 distinct values across 13,969 rows suggests rounding or a discretised scale. Treatment: Impute or flag the 20.91% nulls and apply a log transform before modelling to tame the heavy right tail.
- n
- 13,969
- nulls
- 2,921 (20.9%)
- unique
- 465
- min
- 0.02
- max
- 179.9
- mean
- 1.113
- median
- 0.69
- std
- 3.738
- q1
- 0.45
- q3
- 1.07
- iqr
- 0.62
- skew
- 26.82
- kurtosis
- 981.5
- n_outliers
- 760
- outlier_rate
- 0.06879
- zero_rate
- 0
PosAng
numeric feature null_ratePosAng is a numeric column bounded between 0 and 180 with mean 87.27 and median 87, consistent with a position angle in degrees (a common astronomical measurement). The distribution is nearly symmetric (skew 0.047) and platykurtic (kurtosis -1.21), spread broadly across the range with IQR 40-133 and no outliers. Notable: 23.17% of rows are null, and despite n=13969 there are only 181 unique values, suggesting integer-degree quantisation. Treatment: Treat as a circular/angular feature (e.g. encode as sin/cos) and impute or flag the 23% missingness before modelling.
- n
- 13,969
- nulls
- 3,236 (23.2%)
- unique
- 181
- min
- 0
- max
- 180
- mean
- 87.27
- median
- 87
- std
- 52.68
- q1
- 40
- q3
- 133
- iqr
- 93
- skew
- 0.04737
- kurtosis
- -1.212
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.008572
B-Mag
numeric feature outliersB-Mag is a numeric photometric measurement, almost certainly the B-band apparent magnitude of astronomical sources, with values concentrated between 13.42 and 15.2 (median 14.42). The distribution is strongly left-skewed (skew -1.69, kurtosis 5.58) with a long faint-end tail down to 1.51 and 572 outliers (5.05%). Roughly 18.86% of rows are null, so a sizeable share of objects lack a B measurement. Treatment: Impute or flag the 18.86% missing values and consider winsorising the bright-end tail before modelling.
- n
- 13,969
- nulls
- 2,634 (18.9%)
- unique
- 1,056
- min
- 1.51
- max
- 21.01
- mean
- 14.12
- median
- 14.42
- std
- 1.833
- q1
- 13.42
- q3
- 15.2
- iqr
- 1.78
- skew
- -1.692
- kurtosis
- 5.576
- n_outliers
- 572
- outlier_rate
- 0.05046
- zero_rate
- 0
V-Mag
numeric feature null_rate outliersV-Mag is a numeric column almost certainly recording visual (apparent) magnitude of astronomical objects, with values spanning 1.69 to 20.41 and a median of 12.38 consistent with that scale. The distribution is left-skewed (skew -1.22) with 323 outliers (7.66%) on the bright end, and 69.83% of rows are null, so usable coverage is limited to roughly 30% of the catalogue. The interquartile band is tight (11.31-13.28, IQR 1.97) around faint magnitudes. Treatment: Impute or flag the 69.83% missing values before modelling; consider keeping raw scale since magnitudes are already logarithmic.
- n
- 13,969
- nulls
- 9,755 (69.8%)
- unique
- 774
- min
- 1.69
- max
- 20.41
- mean
- 12.04
- median
- 12.38
- std
- 2.092
- q1
- 11.31
- q3
- 13.28
- iqr
- 1.97
- skew
- -1.223
- kurtosis
- 2.67
- n_outliers
- 323
- outlier_rate
- 0.07665
- zero_rate
- 0
J-Mag
numeric feature null_rateThis is the J-band magnitude (near-infrared photometry, ~1.25 μm) for catalogued sources, with values centered near 11.42 and spanning 1.11 to 17.02. The distribution is mildly left-skewed (-0.55) with modest excess kurtosis (2.51) and 333 outliers (3.45%), consistent with a mix of bright and faint sources. A notable concern is the 30.9% null rate, meaning nearly a third of rows lack a J-Mag measurement. Treatment: Impute or flag missing values before modelling; consider robust scaling given the left skew and outliers.
- n
- 13,969
- nulls
- 4,317 (30.9%)
- unique
- 804
- min
- 1.11
- max
- 17.02
- mean
- 11.37
- median
- 11.42
- std
- 1.355
- q1
- 10.63
- q3
- 12.17
- iqr
- 1.54
- skew
- -0.5524
- kurtosis
- 2.512
- n_outliers
- 333
- outlier_rate
- 0.0345
- zero_rate
- 0
H-Mag
numeric feature null_rateH-Mag is a numeric astronomical magnitude (likely absolute H-magnitude, common in asteroid/minor-planet catalogs), ranging from 0.83 to 16.67 with a mean of 10.70 and median 10.74. The distribution is mildly left-skewed (-0.45) with light tails (kurtosis 2.18) and a tight IQR of 1.55, indicating most objects cluster around magnitude 10-11.5. The notable concern is a 30.6% null rate, plus 341 low-end outliers (3.5%) representing unusually bright objects. Treatment: Impute or flag the 30.6% missing values before modelling; consider keeping raw scale since skew is mild.
- n
- 13,969
- nulls
- 4,276 (30.6%)
- unique
- 831
- min
- 0.83
- max
- 16.67
- mean
- 10.7
- median
- 10.74
- std
- 1.368
- q1
- 9.95
- q3
- 11.5
- iqr
- 1.55
- skew
- -0.4515
- kurtosis
- 2.179
- n_outliers
- 341
- outlier_rate
- 0.03518
- zero_rate
- 0
K-Mag
numeric feature null_rateNumeric K-band magnitude readings (likely 2MASS K-mag photometry), centered around a median of 10.46 and ranging from 0.72 to 15.76. The distribution is mildly left-skewed (-0.45) with 315 outliers (3.3%) and a notable 30.9% null rate, suggesting many sources lack K-band coverage. Spread is tight (std 1.36, IQR 1.57) across 823 unique values. Treatment: Impute or flag the 31% missing values before modelling; no transform needed given near-symmetric spread.
- n
- 13,969
- nulls
- 4,317 (30.9%)
- unique
- 823
- min
- 0.72
- max
- 15.76
- mean
- 10.41
- median
- 10.46
- std
- 1.361
- q1
- 9.66
- q3
- 11.23
- iqr
- 1.57
- skew
- -0.4513
- kurtosis
- 2.166
- n_outliers
- 315
- outlier_rate
- 0.03264
- zero_rate
- 0
SurfBr
numeric feature null_rateSurfBr is a numeric measurement tightly clustered around 23.31 (median 23.33) with a narrow IQR of 0.71 and standard deviation 0.61, consistent with a surface brightness magnitude. The distribution is mildly left-skewed (-0.30) with elevated kurtosis (2.65) and 288 outliers (2.8%), and roughly 27% of rows are null which is a notable coverage gap. Treatment: Impute or flag the 27% missing values before modelling; no transform needed given the tight, near-symmetric spread.
- n
- 13,969
- nulls
- 3,742 (26.8%)
- unique
- 438
- min
- 18.36
- max
- 28.48
- mean
- 23.31
- median
- 23.33
- std
- 0.61
- q1
- 22.97
- q3
- 23.68
- iqr
- 0.71
- skew
- -0.3014
- kurtosis
- 2.65
- n_outliers
- 288
- outlier_rate
- 0.02816
- zero_rate
- 0
Hubble
categorical label null_rateHubble appears to be the Hubble morphological classification of galaxies, with familiar types like E (elliptical), S0 (lenticular), and the spiral sequence Sa/Sb/Sc dominating. Across 13,969 rows it has 30 distinct codes and high entropy ratio (0.83), so the type distribution is fairly spread rather than concentrated — the top value E only accounts for 15.2%. Notably, 27.26% of rows are null, meaning over a quarter of galaxies lack a morphological label. Treatment: Treat as categorical label; impute or filter the 27% nulls before any class-conditional analysis.
- n
- 13,969
- nulls
- 3,808 (27.3%)
- unique
- 30
- top_value
- E
- top_rate
- 0.1522
- cardinality
- 30
- entropy
- 4.096
- entropy_ratio
- 0.8348
Pax
numeric feature null_rate high_skew outliersPax is a sparse numeric measurement, populated for only ~5% of rows (null_rate 0.9485) with values ranging from 0.003 to 22.8 and a median of 0.4829. The distribution is severely right-skewed (skew 7.21, kurtosis 68.87) with 65 outliers among the non-null values, and the mean (0.919) sits well above the median, indicating a long upper tail. With 676 unique values across 13,969 rows and no zeros, this looks like a continuous rate or ratio observed only on a small subpopulation. Treatment: Log-transform and add a missingness indicator before modelling, given the 94.85% null rate and heavy right skew.
- n
- 13,969
- nulls
- 13,249 (94.8%)
- unique
- 676
- min
- 0.003
- max
- 22.8
- mean
- 0.9192
- median
- 0.4829
- std
- 1.76
- q1
- 0.2517
- q3
- 0.9338
- iqr
- 0.6821
- skew
- 7.206
- kurtosis
- 68.87
- n_outliers
- 65
- outlier_rate
- 0.09028
- zero_rate
- 0
Pm-RA
numeric feature null_rate high_skew outliersPm-RA is almost certainly a proper-motion-in-right-ascension measurement (mas/yr) for catalog objects, with values centered slightly negative (median -0.9015, mean -1.374) and a tight IQR of 3.54. Two things stand out: 92.53% of rows are null, suggesting this is only populated for a subset of sources, and the distribution has heavy tails (skew 2.66, kurtosis 50.19) with extremes from -43.57 to 76.0 and a 6.8% outlier rate. Treatment: Impute or flag the 92.53% missingness explicitly and winsorize/robust-scale before modelling.
- n
- 13,969
- nulls
- 12,925 (92.5%)
- unique
- 954
- min
- -43.57
- max
- 76
- mean
- -1.374
- median
- -0.9015
- std
- 5.826
- q1
- -3.103
- q3
- 0.4365
- iqr
- 3.54
- skew
- 2.664
- kurtosis
- 50.19
- n_outliers
- 71
- outlier_rate
- 0.06801
- zero_rate
- 0.0009579
Pm-Dec
numeric feature null_rate outliersPm-Dec is almost certainly proper motion in declination (mas/yr), a sparse astrometric feature with 92.53% nulls and only 1044 populated rows out of 13969. Values center near zero (median -1.088, mean -1.402) but span -58.961 to 64.0 with std 5.9, kurtosis 43.8, and 74 outliers (7.09% of non-null), indicating heavy tails typical of high-proper-motion sources. Treatment: Impute or mask the 92.53% missing before modelling, and consider a robust scaler given the heavy tails.
- n
- 13,969
- nulls
- 12,925 (92.5%)
- unique
- 961
- min
- -58.96
- max
- 64
- mean
- -1.402
- median
- -1.088
- std
- 5.9
- q1
- -3.256
- q3
- 0.544
- iqr
- 3.8
- skew
- 0.7277
- kurtosis
- 43.81
- n_outliers
- 74
- outlier_rate
- 0.07088
- zero_rate
- 0
RadVel
numeric feature null_rateRadVel reads as a radial velocity measurement (likely km/s or m/s) for ~14k objects, with values ranging from -483 to 52,025 and a median of 4,885. The distribution is right-skewed (skew 1.53, kurtosis 5.16) with 341 outliers (3.2%) and a notable 24.2% null rate flagged as an alert. Only 6,691 unique values across 13,969 rows suggests some repeated/quantized measurements. Treatment: Impute or flag the 24% missing values and consider a robust scaler given the right skew and outliers.
- n
- 13,969
- nulls
- 3,386 (24.2%)
- unique
- 6,691
- min
- -483
- max
- 52,025
- mean
- 5541
- median
- 4,885
- std
- 4264
- q1
- 2406
- q3
- 7,632
- iqr
- 5226
- skew
- 1.532
- kurtosis
- 5.16
- n_outliers
- 341
- outlier_rate
- 0.03222
- zero_rate
- 0.0006614
Redshift
numeric feature null_rateRedshift values cluster tightly between q1 0.008056 and q3 0.02579 with a median of 0.01643, consistent with cosmological redshift measurements for relatively nearby objects. The distribution is right-skewed (skew 1.65, kurtosis 6.29) with a max of 0.191616 and a small negative min of -0.00161, and 24.24% of rows are null. Treatment: Impute or filter the 24% nulls and consider a log1p transform before modelling to tame the right skew.
- n
- 13,969
- nulls
- 3,386 (24.2%)
- unique
- 7,717
- min
- -0.00161
- max
- 0.1916
- mean
- 0.01877
- median
- 0.01643
- std
- 0.01467
- q1
- 0.008056
- q3
- 0.02579
- iqr
- 0.01773
- skew
- 1.652
- kurtosis
- 6.287
- n_outliers
- 350
- outlier_rate
- 0.03307
- zero_rate
- 0.000378
Cstar U-Mag
numeric feature null_rateCstar U-Mag appears to be the U-band magnitude of a central/companion star, populated for only ~0.11% of rows (null_rate 0.9989) — just 16 unique non-null values across 13969 records. Where present, values span 9.3 to 14.75 with mean 11.93 and median 12.09, roughly symmetric (skew 0.09) and platykurtic (kurtosis -1.12), with no flagged outliers. The extreme sparsity is the dominant signal and limits any aggregate use. Treatment: Drop or treat as a presence indicator; too sparse (99.89% null) for direct modelling.
- n
- 13,969
- nulls
- 13,953 (99.9%)
- unique
- 16
- min
- 9.3
- max
- 14.75
- mean
- 11.93
- median
- 12.09
- std
- 1.741
- q1
- 10.38
- q3
- 13.06
- iqr
- 2.683
- skew
- 0.09311
- kurtosis
- -1.12
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Cstar B-Mag
numeric feature null_rateApparent B-band magnitude of a companion star (Cstar B-Mag), populated for only 0.79% of rows (null_rate 0.9921). Among the 111 non-null values there are just 97 unique magnitudes spanning 9.93 to 21.1, with mean 15.23 and median 15.5, roughly symmetric (skew -0.17) and no flagged outliers. Treatment: Treat as sparse optional feature; impute or model missingness explicitly rather than relying on the value.
- n
- 13,969
- nulls
- 13,858 (99.2%)
- unique
- 97
- min
- 9.93
- max
- 21.1
- mean
- 15.23
- median
- 15.5
- std
- 2.476
- q1
- 13.59
- q3
- 16.77
- iqr
- 3.185
- skew
- -0.1719
- kurtosis
- -0.37
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Cstar V-Mag
numeric feature null_rateVisual magnitude of a companion star ('Cstar V-Mag'), populated for only ~0.7% of rows (null_rate 0.9931). The 96 non-null values span 9.42 to 19.6 with median 15.145 and a roughly symmetric distribution (skew -0.15, kurtosis -0.36), consistent with stellar photometry on the magnitude scale. Severe sparsity is the dominant signal—this field is only meaningful for systems where a companion star was characterized. Treatment: Treat as optional astrophysical feature; impute with a missing-indicator or drop given >99% nulls.
- n
- 13,969
- nulls
- 13,873 (99.3%)
- unique
- 82
- min
- 9.42
- max
- 19.6
- mean
- 14.86
- median
- 15.14
- std
- 2.308
- q1
- 13.39
- q3
- 16.06
- iqr
- 2.663
- skew
- -0.1467
- kurtosis
- -0.3632
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
M
categorical identifier long_tail null_rateColumn M is a sparsely populated categorical code, present in only ~0.77% of the 13969 rows (null_rate 0.9923). Among the 107 non-null values, every one appears exactly once (top_rate ≈ 0.0093, entropy_ratio ≈ 1.0), so each observation is unique — values are zero-padded 3-digit strings like '024','025','110'. The combination of 99.23% nulls and perfect uniqueness on the remainder suggests an incidental tag or sub-identifier rather than a usable feature. Treatment: Drop; near-unique values with >99% nulls offer no modelling signal.
- n
- 13,969
- nulls
- 13,862 (99.2%)
- unique
- 107
- top_value
- 024
- top_rate
- 0.009346
- cardinality
- 107
- entropy
- 6.741
- entropy_ratio
- 1
NGC
categorical identifier long_tail null_rateThis is an NGC (New General Catalogue) astronomical object identifier, populated for only 6.5% of rows (null_rate 0.935). Among the 908 non-null entries it spans 891 unique codes with near-maximal entropy_ratio 0.999 and a top value '3497' appearing just 3 times (top_rate 0.0033), so it behaves almost like a sparse identifier. Some codes carry letter suffixes (e.g. '5619B'), confirming it's a catalog string rather than a clean integer. Treatment: Treat as a sparse cross-reference key; drop or left-join to an NGC catalog rather than using as a model feature.
- n
- 13,969
- nulls
- 13,061 (93.5%)
- unique
- 891
- top_value
- 3497
- top_rate
- 0.003304
- cardinality
- 891
- entropy
- 9.788
- entropy_ratio
- 0.9989
IC
categorical foreign_key long_tail null_rateIC appears to be a sparse categorical code, likely an industry/identifier classification, populated for only ~3.3% of rows (null_rate 0.9671). Among the 460 non-null entries it spreads across 452 distinct values with near-maximal entropy (entropy_ratio 0.9984) and a top frequency of just 3 (top_rate 0.0065), so essentially every present value is unique. This combination of overwhelming nulls and near-unique codes makes it unusable as a grouping feature without enrichment. Treatment: Drop or treat as a sparse lookup key; do not use as a categorical feature given 96.7% nulls and near-unique values.
- n
- 13,969
- nulls
- 13,509 (96.7%)
- unique
- 452
- top_value
- 5003
- top_rate
- 0.006522
- cardinality
- 452
- entropy
- 8.806
- entropy_ratio
- 0.9984
Cstar Names
categorical identifier long_tail null_rateCompanion-star catalogue identifiers (HD/BD designations), occasionally listing multiple names comma-separated. Effectively empty: 99.38% null with only 87 distinct values across 13,969 rows, each appearing once (top_rate 0.0115, entropy_ratio ~1.0). Treatment: Drop or retain only as a cross-reference key; too sparse and unique to model.
- n
- 13,969
- nulls
- 13,882 (99.4%)
- unique
- 87
- top_value
- BD -12 1172,HD 35914
- top_rate
- 0.01149
- cardinality
- 87
- entropy
- 6.443
- entropy_ratio
- 1
Identifiers
text identifier near_unique allcapsThis column holds astronomical object identifiers, with 12,179 unique values across 13,969 rows and almost everything in uppercase (allcaps_rate 0.998). The token distribution reveals catalog prefixes like 2MASX (9,805), MWSC, ESO, MCG, SDSS, and PGC, suggesting each cell concatenates cross-catalog designations (word_mean 5.26). Note the 12.81% null rate and that the column is near-unique, so it carries little aggregate signal on its own. Treatment: Treat as a join key to external catalogs; do not use as a model feature.
- n
- 13,969
- nulls
- 1,789 (12.8%)
- unique
- 12,179
- len_min
- 5
- len_max
- 211
- len_mean
- 71.84
- len_median
- 74
- len_p95
- 134
- word_mean
- 5.255
- word_median
- 5
- n_empty
- 0
- n_duplicates
- 1
- duplicate_rate
- 8.21e-05
- vocab_size
- 50,977
- readability_flesch_mean
- 87.54
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0.9975
- boilerplate_rate
- 0
Common names
categorical metadata long_tail null_rateVernacular labels for astronomical objects (e.g., 'Antennae Galaxies', 'Flame Nebula'), occasionally comma-concatenating multiple aliases in one cell. The column is 99.06% null across 13,969 rows, leaving only ~127 unique populated values with near-uniform distribution (entropy_ratio 0.998, top_rate 1.5%). Multi-name cells like 'Butterfly Galaxies,Siamese Twins' indicate the field is a delimited list rather than a clean category. Treatment: Treat as sparse free-form labels: split on comma into a name list and use only for display or lookup, not modelling.
- n
- 13,969
- nulls
- 13,838 (99.1%)
- unique
- 127
- top_value
- Antennae Galaxies
- top_rate
- 0.01527
- cardinality
- 127
- entropy
- 6.972
- entropy_ratio
- 0.9977
NED notes
text metadata multilingual null_rate duplicatesShort astronomical annotations from the NGC/IC NED catalog, averaging 6.7 words and capped at 80 characters, describing object positions and identification caveats (e.g. 'In the Large Magellanic Cloud.', 'Confused HIPASS source'). 83.6% of rows are null and only 1,198 distinct strings appear across 13,969 rows, with a 47.6% duplicate rate driven by a small set of canned remarks. Language detection flags the field as multilingual but 780 of 783 detected samples are English; the bs/it/pt hits are almost certainly false positives on terse astronomy jargon. Treatment: Treat as sparse categorical-ish notes: bucket the top recurring phrases as flags and ignore the long tail rather than embedding free text.
- n
- 13,969
- nulls
- 11,683 (83.6%)
- unique
- 1,198
- len_min
- 14
- len_max
- 80
- len_mean
- 39.64
- len_median
- 35
- len_p95
- 70
- word_mean
- 6.658
- word_median
- 6
- n_empty
- 0
- n_duplicates
- 1,088
- duplicate_rate
- 0.4759
- vocab_size
- 1,836
- readability_flesch_mean
- 60.05
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0.01575
OpenNGC notes
categorical free_text long_tail null_rateFree-text curator notes on OpenNGC catalog entries, present for only 1.52% of rows (null_rate 0.9848). Among the 213 populated cells there are 159 distinct strings with high entropy (6.72, ratio 0.92), and the most common note — 'Identification taken from Corwin's catalog.' — covers just 13.6% of non-nulls. Content is heterogeneous provenance commentary referencing LEDA, NED, SIMBAD, and Corwin sources rather than a controlled vocabulary. Treatment: Treat as sparse free-text metadata; drop for modelling or keep as a provenance flag (note present vs absent).
- n
- 13,969
- nulls
- 13,756 (98.5%)
- unique
- 159
- top_value
- Identification taken from Corwin’s catalog.
- top_rate
- 0.1362
- cardinality
- 159
- entropy
- 6.719
- entropy_ratio
- 0.9187
Sources
categorical metadata long_tailThis column encodes a per-row provenance manifest: a pipe-delimited list of astronomical measurement fields (Type, RA, Dec, Const, magnitudes, redshift, etc.) each tagged with a numeric source/catalog code. Despite 344 distinct combinations across 13,969 rows, the distribution is highly concentrated — the top template covers 41.2% of rows and entropy ratio is just 0.41, confirming the long_tail alert. There are no nulls, but the structured-string format means it is not directly usable as a category without parsing. Treatment: Parse into per-field source codes (one column per measurement) rather than treating the raw string as a category.
- n
- 13,969
- nulls
- 0 (0.0%)
- unique
- 344
- top_value
- Type:1|RA:1|Dec:1|Const:99|MajAx:3|MinAx:3|PosAng:3|B-Mag:3|J-Mag:2|H-Mag:2|K-Mag:2|SurfBr:3|Hubble:3|RadVel:2|Redshift:2
- top_rate
- 0.4116
- cardinality
- 344
- entropy
- 3.486
- entropy_ratio
- 0.4138