saturn·

deepsky ngc

source /home/coolhand/data/celestial/deepsky/NGC.csv 13,969 rows 32 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is an astronomical catalog of 13,969 deep-sky objects (NGC.csv) with 32 columns covering identifiers, sky coordinates, magnitudes across multiple bands, morphological classifications, and kinematic measurements like radial velocity and redshift. The catalog is dominated by galaxies — 75% of entries are type 'G' — with smaller populations of open clusters, duplicates, stars, and planetary nebulae. Object morphology (Hubble type) and constellation distribution are the most informative descriptive fields, while RadVel and Redshift give a clean view of the cosmological distance distribution skewing toward nearby objects (median z ≈ 0.016). Be aware that many columns are very sparsely populated: parallax (Pax), proper motions, and central-star magnitudes are >92% null, so any analysis on those fields will be limited to a small subset. Size measurements (MajAx, MinAx) are extremely skewed with heavy outliers, suggesting a few very large objects dominate the tails.

citing: Type · Hubble · Const · Redshift · RadVel · V-Mag · Pax · MajAx · B-Mag

Schema

32 columns
Per-column summary. Click column name to jump to its detail.
Alerts
Name text 0.0% 13,969
near_unique one_word allcaps short_text
Type categorical 0.0% 20
RA unknown 0.0%
skipped
Dec text 0.1% 13,282
near_unique one_word allcaps short_text
Const categorical 0.1% 89
MajAx numeric 14.1% 734
high_skew outliers
MinAx numeric 20.9% 465
null_rate high_skew outliers
PosAng numeric 23.2% 181
null_rate
B-Mag numeric 18.9% 1,056
outliers
V-Mag numeric 69.8% 774
null_rate outliers
J-Mag numeric 30.9% 804
null_rate
H-Mag numeric 30.6% 831
null_rate
K-Mag numeric 30.9% 823
null_rate
SurfBr numeric 26.8% 438
null_rate
Hubble categorical 27.3% 30
null_rate
Pax numeric 94.8% 676
null_rate high_skew outliers
Pm-RA numeric 92.5% 954
null_rate high_skew outliers
Pm-Dec numeric 92.5% 961
null_rate outliers
RadVel numeric 24.2% 6,691
null_rate
Redshift numeric 24.2% 7,717
null_rate
Cstar U-Mag numeric 99.9% 16
null_rate
Cstar B-Mag numeric 99.2% 97
null_rate
Cstar V-Mag numeric 99.3% 82
null_rate
M categorical 99.2% 107
long_tail null_rate
NGC categorical 93.5% 891
long_tail null_rate
IC categorical 96.7% 452
long_tail null_rate
Cstar Names categorical 99.4% 87
long_tail null_rate
Identifiers text 12.8% 12,179
near_unique allcaps
Common names categorical 99.1% 127
long_tail null_rate
NED notes text 83.6% 1,198
multilingual null_rate duplicates
OpenNGC notes categorical 98.5% 159
long_tail null_rate
Sources categorical 0.0% 344
long_tail

Name

text identifier near_unique one_word allcaps short_text
This appears to be an astronomical object designation field — short, all-caps, single-token codes like 'NED01', 'NED02', and 'IC' prefixed catalog numbers. Every one of the 13,969 rows is unique with zero nulls, lengths tightly bounded between 6 and 13 characters, and 97.4% are single-word tokens. The 'NED01'/'NED02' values recurring 175 and 168 times sit oddly against the n_unique=13969 claim, suggesting these are prefixes/substrings counted at the word level rather than full duplicates. Treatment: Treat as a primary key; drop from modelling features and use only for joins or lookups. high · anthropic:claude-opus-4-7
n
13,969
nulls
0 (0.0%)
unique
13,969
len_min
6
len_max
13
len_mean
6.784
len_median
7
len_p95
8
word_mean
1.026
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
13,607
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
0.9738
allcaps_rate
1
boilerplate_rate
0

Type

categorical label
A categorical type code with 20 distinct values across 13,969 rows and no nulls. The distribution is highly imbalanced: 'G' alone accounts for 75.0% of records (10,481), with the next categories ('OCl', 'Dup', '*', 'Other') each below 5%, yielding an entropy ratio of just 0.38. The codes ('G', 'OCl', 'GPair', 'GCl', 'PN', 'Neb') suggest astronomical object classifications (galaxies, open/globular clusters, planetary nebulae, nebulae). Treatment: Group rare categories or stratify by 'G' dominance before any classification task. high · anthropic:claude-opus-4-7
n
13,969
nulls
0 (0.0%)
unique
20
top_value
G
top_rate
0.7503
cardinality
20
entropy
1.646
entropy_ratio
0.3808

RA

unknown other skipped
The column is named RA, which in astronomical datasets typically denotes Right Ascension, but saturn skipped profiling so kind and uniqueness are unresolved. The only confirmed signals are 13969 rows with a 0.0 null rate; no distributional statistics are available. Treatment: Re-profile with an explicit type cast before deciding on use. low · anthropic:claude-opus-4-7
n
13,969
nulls
0 (0.0%)
unique

Dec

text feature near_unique one_word allcaps short_text
This column holds astronomical Declination coordinates formatted as signed sexagesimal strings (e.g. '-19:28:17.6'), with every value 9-11 characters long and exactly one token. It is near-unique (13,282 distinct of 13,969) yet shows 680 duplicates (4.87%), suggesting repeated observations of the same sky positions. The 'allcaps' and Flesch=121 signals are artefacts of the numeric format, not real prose. Treatment: Parse the sexagesimal string into signed decimal degrees before any numeric use. high · anthropic:claude-opus-4-7
n
13,969
nulls
7 (0.1%)
unique
13,282
len_min
9
len_max
11
len_mean
11
len_median
11
len_p95
11
word_mean
1
word_median
1
n_empty
0
n_duplicates
680
duplicate_rate
0.0487
vocab_size
13,282
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

Const

categorical feature
This column holds three-letter constellation abbreviations (Vir, Com, Leo, Cet, UMa…), with 89 distinct values across 13,969 rows — consistent with the 88 IAU constellations plus possibly one stray code. Distribution is fairly even (entropy ratio 0.85), though Virgo leads at 8.85% and the top three constellations account for roughly a fifth of records. Nulls are negligible (0.05%). Treatment: One-hot or target-encode as a categorical feature. high · anthropic:claude-opus-4-7
n
13,969
nulls
7 (0.1%)
unique
89
top_value
Vir
top_rate
0.08853
cardinality
89
entropy
5.489
entropy_ratio
0.8476

MajAx

numeric feature high_skew outliers
MajAx is a numeric measurement (likely a major-axis length, e.g. of a galaxy or ellipse) with a tight central distribution—median 1.2, IQR 0.82–1.87—but an extreme right tail reaching 299.92. Skew of 21.89 and kurtosis of 641.85 are both severe, and 10.36% of values flag as outliers. Null rate is also non-trivial at 14.06%. Treatment: log-transform and impute missing before modelling; consider winsorising the upper tail. high · anthropic:claude-opus-4-7
n
13,969
nulls
1,964 (14.1%)
unique
734
min
0.02
max
299.9
mean
2.145
median
1.2
std
6.789
q1
0.82
q3
1.87
iqr
1.05
skew
21.89
kurtosis
641.8
n_outliers
1,244
outlier_rate
0.1036
zero_rate
0

MinAx

numeric feature null_rate high_skew outliers
MinAx is a numeric measurement (likely a minor-axis dimension) with a tight central mass — median 0.69, IQR 0.45–1.07 — but an extreme right tail stretching to 179.89. Skew of 26.82 and kurtosis near 981 are extraordinary, and 760 outliers (6.88%) sit alongside a 20.91% null rate. Only 465 distinct values across 13,969 rows suggests rounding or a discretised scale. Treatment: Impute or flag the 20.91% nulls and apply a log transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
13,969
nulls
2,921 (20.9%)
unique
465
min
0.02
max
179.9
mean
1.113
median
0.69
std
3.738
q1
0.45
q3
1.07
iqr
0.62
skew
26.82
kurtosis
981.5
n_outliers
760
outlier_rate
0.06879
zero_rate
0

PosAng

numeric feature null_rate
PosAng is a numeric column bounded between 0 and 180 with mean 87.27 and median 87, consistent with a position angle in degrees (a common astronomical measurement). The distribution is nearly symmetric (skew 0.047) and platykurtic (kurtosis -1.21), spread broadly across the range with IQR 40-133 and no outliers. Notable: 23.17% of rows are null, and despite n=13969 there are only 181 unique values, suggesting integer-degree quantisation. Treatment: Treat as a circular/angular feature (e.g. encode as sin/cos) and impute or flag the 23% missingness before modelling. high · anthropic:claude-opus-4-7
n
13,969
nulls
3,236 (23.2%)
unique
181
min
0
max
180
mean
87.27
median
87
std
52.68
q1
40
q3
133
iqr
93
skew
0.04737
kurtosis
-1.212
n_outliers
0
outlier_rate
0
zero_rate
0.008572

B-Mag

numeric feature outliers
B-Mag is a numeric photometric measurement, almost certainly the B-band apparent magnitude of astronomical sources, with values concentrated between 13.42 and 15.2 (median 14.42). The distribution is strongly left-skewed (skew -1.69, kurtosis 5.58) with a long faint-end tail down to 1.51 and 572 outliers (5.05%). Roughly 18.86% of rows are null, so a sizeable share of objects lack a B measurement. Treatment: Impute or flag the 18.86% missing values and consider winsorising the bright-end tail before modelling. high · anthropic:claude-opus-4-7
n
13,969
nulls
2,634 (18.9%)
unique
1,056
min
1.51
max
21.01
mean
14.12
median
14.42
std
1.833
q1
13.42
q3
15.2
iqr
1.78
skew
-1.692
kurtosis
5.576
n_outliers
572
outlier_rate
0.05046
zero_rate
0

V-Mag

numeric feature null_rate outliers
V-Mag is a numeric column almost certainly recording visual (apparent) magnitude of astronomical objects, with values spanning 1.69 to 20.41 and a median of 12.38 consistent with that scale. The distribution is left-skewed (skew -1.22) with 323 outliers (7.66%) on the bright end, and 69.83% of rows are null, so usable coverage is limited to roughly 30% of the catalogue. The interquartile band is tight (11.31-13.28, IQR 1.97) around faint magnitudes. Treatment: Impute or flag the 69.83% missing values before modelling; consider keeping raw scale since magnitudes are already logarithmic. high · anthropic:claude-opus-4-7
n
13,969
nulls
9,755 (69.8%)
unique
774
min
1.69
max
20.41
mean
12.04
median
12.38
std
2.092
q1
11.31
q3
13.28
iqr
1.97
skew
-1.223
kurtosis
2.67
n_outliers
323
outlier_rate
0.07665
zero_rate
0

J-Mag

numeric feature null_rate
This is the J-band magnitude (near-infrared photometry, ~1.25 μm) for catalogued sources, with values centered near 11.42 and spanning 1.11 to 17.02. The distribution is mildly left-skewed (-0.55) with modest excess kurtosis (2.51) and 333 outliers (3.45%), consistent with a mix of bright and faint sources. A notable concern is the 30.9% null rate, meaning nearly a third of rows lack a J-Mag measurement. Treatment: Impute or flag missing values before modelling; consider robust scaling given the left skew and outliers. high · anthropic:claude-opus-4-7
n
13,969
nulls
4,317 (30.9%)
unique
804
min
1.11
max
17.02
mean
11.37
median
11.42
std
1.355
q1
10.63
q3
12.17
iqr
1.54
skew
-0.5524
kurtosis
2.512
n_outliers
333
outlier_rate
0.0345
zero_rate
0

H-Mag

numeric feature null_rate
H-Mag is a numeric astronomical magnitude (likely absolute H-magnitude, common in asteroid/minor-planet catalogs), ranging from 0.83 to 16.67 with a mean of 10.70 and median 10.74. The distribution is mildly left-skewed (-0.45) with light tails (kurtosis 2.18) and a tight IQR of 1.55, indicating most objects cluster around magnitude 10-11.5. The notable concern is a 30.6% null rate, plus 341 low-end outliers (3.5%) representing unusually bright objects. Treatment: Impute or flag the 30.6% missing values before modelling; consider keeping raw scale since skew is mild. high · anthropic:claude-opus-4-7
n
13,969
nulls
4,276 (30.6%)
unique
831
min
0.83
max
16.67
mean
10.7
median
10.74
std
1.368
q1
9.95
q3
11.5
iqr
1.55
skew
-0.4515
kurtosis
2.179
n_outliers
341
outlier_rate
0.03518
zero_rate
0

K-Mag

numeric feature null_rate
Numeric K-band magnitude readings (likely 2MASS K-mag photometry), centered around a median of 10.46 and ranging from 0.72 to 15.76. The distribution is mildly left-skewed (-0.45) with 315 outliers (3.3%) and a notable 30.9% null rate, suggesting many sources lack K-band coverage. Spread is tight (std 1.36, IQR 1.57) across 823 unique values. Treatment: Impute or flag the 31% missing values before modelling; no transform needed given near-symmetric spread. high · anthropic:claude-opus-4-7
n
13,969
nulls
4,317 (30.9%)
unique
823
min
0.72
max
15.76
mean
10.41
median
10.46
std
1.361
q1
9.66
q3
11.23
iqr
1.57
skew
-0.4513
kurtosis
2.166
n_outliers
315
outlier_rate
0.03264
zero_rate
0

SurfBr

numeric feature null_rate
SurfBr is a numeric measurement tightly clustered around 23.31 (median 23.33) with a narrow IQR of 0.71 and standard deviation 0.61, consistent with a surface brightness magnitude. The distribution is mildly left-skewed (-0.30) with elevated kurtosis (2.65) and 288 outliers (2.8%), and roughly 27% of rows are null which is a notable coverage gap. Treatment: Impute or flag the 27% missing values before modelling; no transform needed given the tight, near-symmetric spread. high · anthropic:claude-opus-4-7
n
13,969
nulls
3,742 (26.8%)
unique
438
min
18.36
max
28.48
mean
23.31
median
23.33
std
0.61
q1
22.97
q3
23.68
iqr
0.71
skew
-0.3014
kurtosis
2.65
n_outliers
288
outlier_rate
0.02816
zero_rate
0

Hubble

categorical label null_rate
Hubble appears to be the Hubble morphological classification of galaxies, with familiar types like E (elliptical), S0 (lenticular), and the spiral sequence Sa/Sb/Sc dominating. Across 13,969 rows it has 30 distinct codes and high entropy ratio (0.83), so the type distribution is fairly spread rather than concentrated — the top value E only accounts for 15.2%. Notably, 27.26% of rows are null, meaning over a quarter of galaxies lack a morphological label. Treatment: Treat as categorical label; impute or filter the 27% nulls before any class-conditional analysis. high · anthropic:claude-opus-4-7
n
13,969
nulls
3,808 (27.3%)
unique
30
top_value
E
top_rate
0.1522
cardinality
30
entropy
4.096
entropy_ratio
0.8348

Pax

numeric feature null_rate high_skew outliers
Pax is a sparse numeric measurement, populated for only ~5% of rows (null_rate 0.9485) with values ranging from 0.003 to 22.8 and a median of 0.4829. The distribution is severely right-skewed (skew 7.21, kurtosis 68.87) with 65 outliers among the non-null values, and the mean (0.919) sits well above the median, indicating a long upper tail. With 676 unique values across 13,969 rows and no zeros, this looks like a continuous rate or ratio observed only on a small subpopulation. Treatment: Log-transform and add a missingness indicator before modelling, given the 94.85% null rate and heavy right skew. medium · anthropic:claude-opus-4-7
n
13,969
nulls
13,249 (94.8%)
unique
676
min
0.003
max
22.8
mean
0.9192
median
0.4829
std
1.76
q1
0.2517
q3
0.9338
iqr
0.6821
skew
7.206
kurtosis
68.87
n_outliers
65
outlier_rate
0.09028
zero_rate
0

Pm-RA

numeric feature null_rate high_skew outliers
Pm-RA is almost certainly a proper-motion-in-right-ascension measurement (mas/yr) for catalog objects, with values centered slightly negative (median -0.9015, mean -1.374) and a tight IQR of 3.54. Two things stand out: 92.53% of rows are null, suggesting this is only populated for a subset of sources, and the distribution has heavy tails (skew 2.66, kurtosis 50.19) with extremes from -43.57 to 76.0 and a 6.8% outlier rate. Treatment: Impute or flag the 92.53% missingness explicitly and winsorize/robust-scale before modelling. high · anthropic:claude-opus-4-7
n
13,969
nulls
12,925 (92.5%)
unique
954
min
-43.57
max
76
mean
-1.374
median
-0.9015
std
5.826
q1
-3.103
q3
0.4365
iqr
3.54
skew
2.664
kurtosis
50.19
n_outliers
71
outlier_rate
0.06801
zero_rate
0.0009579

Pm-Dec

numeric feature null_rate outliers
Pm-Dec is almost certainly proper motion in declination (mas/yr), a sparse astrometric feature with 92.53% nulls and only 1044 populated rows out of 13969. Values center near zero (median -1.088, mean -1.402) but span -58.961 to 64.0 with std 5.9, kurtosis 43.8, and 74 outliers (7.09% of non-null), indicating heavy tails typical of high-proper-motion sources. Treatment: Impute or mask the 92.53% missing before modelling, and consider a robust scaler given the heavy tails. high · anthropic:claude-opus-4-7
n
13,969
nulls
12,925 (92.5%)
unique
961
min
-58.96
max
64
mean
-1.402
median
-1.088
std
5.9
q1
-3.256
q3
0.544
iqr
3.8
skew
0.7277
kurtosis
43.81
n_outliers
74
outlier_rate
0.07088
zero_rate
0

RadVel

numeric feature null_rate
RadVel reads as a radial velocity measurement (likely km/s or m/s) for ~14k objects, with values ranging from -483 to 52,025 and a median of 4,885. The distribution is right-skewed (skew 1.53, kurtosis 5.16) with 341 outliers (3.2%) and a notable 24.2% null rate flagged as an alert. Only 6,691 unique values across 13,969 rows suggests some repeated/quantized measurements. Treatment: Impute or flag the 24% missing values and consider a robust scaler given the right skew and outliers. high · anthropic:claude-opus-4-7
n
13,969
nulls
3,386 (24.2%)
unique
6,691
min
-483
max
52,025
mean
5541
median
4,885
std
4264
q1
2406
q3
7,632
iqr
5226
skew
1.532
kurtosis
5.16
n_outliers
341
outlier_rate
0.03222
zero_rate
0.0006614

Redshift

numeric feature null_rate
Redshift values cluster tightly between q1 0.008056 and q3 0.02579 with a median of 0.01643, consistent with cosmological redshift measurements for relatively nearby objects. The distribution is right-skewed (skew 1.65, kurtosis 6.29) with a max of 0.191616 and a small negative min of -0.00161, and 24.24% of rows are null. Treatment: Impute or filter the 24% nulls and consider a log1p transform before modelling to tame the right skew. high · anthropic:claude-opus-4-7
n
13,969
nulls
3,386 (24.2%)
unique
7,717
min
-0.00161
max
0.1916
mean
0.01877
median
0.01643
std
0.01467
q1
0.008056
q3
0.02579
iqr
0.01773
skew
1.652
kurtosis
6.287
n_outliers
350
outlier_rate
0.03307
zero_rate
0.000378

Cstar U-Mag

numeric feature null_rate
Cstar U-Mag appears to be the U-band magnitude of a central/companion star, populated for only ~0.11% of rows (null_rate 0.9989) — just 16 unique non-null values across 13969 records. Where present, values span 9.3 to 14.75 with mean 11.93 and median 12.09, roughly symmetric (skew 0.09) and platykurtic (kurtosis -1.12), with no flagged outliers. The extreme sparsity is the dominant signal and limits any aggregate use. Treatment: Drop or treat as a presence indicator; too sparse (99.89% null) for direct modelling. high · anthropic:claude-opus-4-7
n
13,969
nulls
13,953 (99.9%)
unique
16
min
9.3
max
14.75
mean
11.93
median
12.09
std
1.741
q1
10.38
q3
13.06
iqr
2.683
skew
0.09311
kurtosis
-1.12
n_outliers
0
outlier_rate
0
zero_rate
0

Cstar B-Mag

numeric feature null_rate
Apparent B-band magnitude of a companion star (Cstar B-Mag), populated for only 0.79% of rows (null_rate 0.9921). Among the 111 non-null values there are just 97 unique magnitudes spanning 9.93 to 21.1, with mean 15.23 and median 15.5, roughly symmetric (skew -0.17) and no flagged outliers. Treatment: Treat as sparse optional feature; impute or model missingness explicitly rather than relying on the value. high · anthropic:claude-opus-4-7
n
13,969
nulls
13,858 (99.2%)
unique
97
min
9.93
max
21.1
mean
15.23
median
15.5
std
2.476
q1
13.59
q3
16.77
iqr
3.185
skew
-0.1719
kurtosis
-0.37
n_outliers
0
outlier_rate
0
zero_rate
0

Cstar V-Mag

numeric feature null_rate
Visual magnitude of a companion star ('Cstar V-Mag'), populated for only ~0.7% of rows (null_rate 0.9931). The 96 non-null values span 9.42 to 19.6 with median 15.145 and a roughly symmetric distribution (skew -0.15, kurtosis -0.36), consistent with stellar photometry on the magnitude scale. Severe sparsity is the dominant signal—this field is only meaningful for systems where a companion star was characterized. Treatment: Treat as optional astrophysical feature; impute with a missing-indicator or drop given >99% nulls. high · anthropic:claude-opus-4-7
n
13,969
nulls
13,873 (99.3%)
unique
82
min
9.42
max
19.6
mean
14.86
median
15.14
std
2.308
q1
13.39
q3
16.06
iqr
2.663
skew
-0.1467
kurtosis
-0.3632
n_outliers
0
outlier_rate
0
zero_rate
0

M

categorical identifier long_tail null_rate
Column M is a sparsely populated categorical code, present in only ~0.77% of the 13969 rows (null_rate 0.9923). Among the 107 non-null values, every one appears exactly once (top_rate ≈ 0.0093, entropy_ratio ≈ 1.0), so each observation is unique — values are zero-padded 3-digit strings like '024','025','110'. The combination of 99.23% nulls and perfect uniqueness on the remainder suggests an incidental tag or sub-identifier rather than a usable feature. Treatment: Drop; near-unique values with >99% nulls offer no modelling signal. high · anthropic:claude-opus-4-7
n
13,969
nulls
13,862 (99.2%)
unique
107
top_value
024
top_rate
0.009346
cardinality
107
entropy
6.741
entropy_ratio
1

NGC

categorical identifier long_tail null_rate
This is an NGC (New General Catalogue) astronomical object identifier, populated for only 6.5% of rows (null_rate 0.935). Among the 908 non-null entries it spans 891 unique codes with near-maximal entropy_ratio 0.999 and a top value '3497' appearing just 3 times (top_rate 0.0033), so it behaves almost like a sparse identifier. Some codes carry letter suffixes (e.g. '5619B'), confirming it's a catalog string rather than a clean integer. Treatment: Treat as a sparse cross-reference key; drop or left-join to an NGC catalog rather than using as a model feature. high · anthropic:claude-opus-4-7
n
13,969
nulls
13,061 (93.5%)
unique
891
top_value
3497
top_rate
0.003304
cardinality
891
entropy
9.788
entropy_ratio
0.9989

IC

categorical foreign_key long_tail null_rate
IC appears to be a sparse categorical code, likely an industry/identifier classification, populated for only ~3.3% of rows (null_rate 0.9671). Among the 460 non-null entries it spreads across 452 distinct values with near-maximal entropy (entropy_ratio 0.9984) and a top frequency of just 3 (top_rate 0.0065), so essentially every present value is unique. This combination of overwhelming nulls and near-unique codes makes it unusable as a grouping feature without enrichment. Treatment: Drop or treat as a sparse lookup key; do not use as a categorical feature given 96.7% nulls and near-unique values. high · anthropic:claude-opus-4-7
n
13,969
nulls
13,509 (96.7%)
unique
452
top_value
5003
top_rate
0.006522
cardinality
452
entropy
8.806
entropy_ratio
0.9984

Cstar Names

categorical identifier long_tail null_rate
Companion-star catalogue identifiers (HD/BD designations), occasionally listing multiple names comma-separated. Effectively empty: 99.38% null with only 87 distinct values across 13,969 rows, each appearing once (top_rate 0.0115, entropy_ratio ~1.0). Treatment: Drop or retain only as a cross-reference key; too sparse and unique to model. high · anthropic:claude-opus-4-7
n
13,969
nulls
13,882 (99.4%)
unique
87
top_value
BD -12 1172,HD 35914
top_rate
0.01149
cardinality
87
entropy
6.443
entropy_ratio
1

Identifiers

text identifier near_unique allcaps
This column holds astronomical object identifiers, with 12,179 unique values across 13,969 rows and almost everything in uppercase (allcaps_rate 0.998). The token distribution reveals catalog prefixes like 2MASX (9,805), MWSC, ESO, MCG, SDSS, and PGC, suggesting each cell concatenates cross-catalog designations (word_mean 5.26). Note the 12.81% null rate and that the column is near-unique, so it carries little aggregate signal on its own. Treatment: Treat as a join key to external catalogs; do not use as a model feature. high · anthropic:claude-opus-4-7
n
13,969
nulls
1,789 (12.8%)
unique
12,179
len_min
5
len_max
211
len_mean
71.84
len_median
74
len_p95
134
word_mean
5.255
word_median
5
n_empty
0
n_duplicates
1
duplicate_rate
8.21e-05
vocab_size
50,977
readability_flesch_mean
87.54
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0.9975
boilerplate_rate
0

Common names

categorical metadata long_tail null_rate
Vernacular labels for astronomical objects (e.g., 'Antennae Galaxies', 'Flame Nebula'), occasionally comma-concatenating multiple aliases in one cell. The column is 99.06% null across 13,969 rows, leaving only ~127 unique populated values with near-uniform distribution (entropy_ratio 0.998, top_rate 1.5%). Multi-name cells like 'Butterfly Galaxies,Siamese Twins' indicate the field is a delimited list rather than a clean category. Treatment: Treat as sparse free-form labels: split on comma into a name list and use only for display or lookup, not modelling. high · anthropic:claude-opus-4-7
n
13,969
nulls
13,838 (99.1%)
unique
127
top_value
Antennae Galaxies
top_rate
0.01527
cardinality
127
entropy
6.972
entropy_ratio
0.9977

NED notes

text metadata multilingual null_rate duplicates
Short astronomical annotations from the NGC/IC NED catalog, averaging 6.7 words and capped at 80 characters, describing object positions and identification caveats (e.g. 'In the Large Magellanic Cloud.', 'Confused HIPASS source'). 83.6% of rows are null and only 1,198 distinct strings appear across 13,969 rows, with a 47.6% duplicate rate driven by a small set of canned remarks. Language detection flags the field as multilingual but 780 of 783 detected samples are English; the bs/it/pt hits are almost certainly false positives on terse astronomy jargon. Treatment: Treat as sparse categorical-ish notes: bucket the top recurring phrases as flags and ignore the long tail rather than embedding free text. high · anthropic:claude-opus-4-7
n
13,969
nulls
11,683 (83.6%)
unique
1,198
len_min
14
len_max
80
len_mean
39.64
len_median
35
len_p95
70
word_mean
6.658
word_median
6
n_empty
0
n_duplicates
1,088
duplicate_rate
0.4759
vocab_size
1,836
readability_flesch_mean
60.05
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0.01575

OpenNGC notes

categorical free_text long_tail null_rate
Free-text curator notes on OpenNGC catalog entries, present for only 1.52% of rows (null_rate 0.9848). Among the 213 populated cells there are 159 distinct strings with high entropy (6.72, ratio 0.92), and the most common note — 'Identification taken from Corwin's catalog.' — covers just 13.6% of non-nulls. Content is heterogeneous provenance commentary referencing LEDA, NED, SIMBAD, and Corwin sources rather than a controlled vocabulary. Treatment: Treat as sparse free-text metadata; drop for modelling or keep as a provenance flag (note present vs absent). high · anthropic:claude-opus-4-7
n
13,969
nulls
13,756 (98.5%)
unique
159
top_value
Identification taken from Corwin’s catalog.
top_rate
0.1362
cardinality
159
entropy
6.719
entropy_ratio
0.9187

Sources

categorical metadata long_tail
This column encodes a per-row provenance manifest: a pipe-delimited list of astronomical measurement fields (Type, RA, Dec, Const, magnitudes, redshift, etc.) each tagged with a numeric source/catalog code. Despite 344 distinct combinations across 13,969 rows, the distribution is highly concentrated — the top template covers 41.2% of rows and entropy ratio is just 0.41, confirming the long_tail alert. There are no nulls, but the structured-string format means it is not directly usable as a category without parsing. Treatment: Parse into per-field source codes (one column per measurement) rather than treating the raw string as a category. high · anthropic:claude-opus-4-7
n
13,969
nulls
0 (0.0%)
unique
344
top_value
Type:1|RA:1|Dec:1|Const:99|MajAx:3|MinAx:3|PosAng:3|B-Mag:3|J-Mag:2|H-Mag:2|K-Mag:2|SurfBr:3|Hubble:3|RadVel:2|Redshift:2
top_rate
0.4116
cardinality
344
entropy
3.486
entropy_ratio
0.4138