saturn·

hyg hygdata v41

source /home/coolhand/data/celestial/hyg/hygdata_v41.csv 119,626 rows 37 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is the HYG star catalog (hygdata_v41.csv) with 119,626 stars and 37 columns covering positions (ra/dec, x/y/z), motion (pmra, pmdec, vx/vy/vz, rv), brightness (mag, absmag, lum, ci), and identifiers/classifications (hd, hip, spect, con, proper). The most informative single field is the spectral type 'spect': it has 4,310 distinct values but is dominated by a handful of classes (K0 ~8.6k, G5 ~6.0k, A0 ~4.9k), giving a clean view of stellar populations. Distance and luminosity are extremely right-skewed (lum skew ≈49, dist max 100,000 pc) with 10–15% outliers, so any analysis on those should use log scales. Radial velocity 'rv' is 81% zeros — effectively a 'measured vs not' flag rather than a continuous variable. Constellation 'con' is the most evenly distributed categorical (89 values, entropy ratio 0.95) led by Cen, UMa, and Her, making it a good grouping key.

citing: row_count · column_count · spect.top_values · spect.n_unique · dist.skew · dist.max · lum.skew · lum.max · rv.zero_rate · con.entropy_ratio · con.top_values · mag.median · absmag.skew

Schema

37 columns
Per-column summary. Click column name to jump to its detail.
Alerts
id numeric 0.0% 119,626
hip numeric 1.4% 117,951
hd numeric 17.3% 98,825
hr numeric 92.4% 9,029
null_rate
gl text 0.0% 3,802
one_word short_text duplicates
bf text 0.0% 3,066
one_word short_text duplicates
proper categorical 0.0% 465
long_tail imbalance
ra numeric 0.0% 119,263
dec numeric 0.0% 119,534
dist numeric 0.0% 5,397
high_skew outliers
pmra numeric 0.0% 25,644
high_skew outliers
pmdec numeric 0.0% 23,226
high_skew outliers
rv numeric 0.0% 1,714
outliers
mag numeric 0.0% 1,422
absmag numeric 0.0% 13,452
outliers
spect text 0.0% 4,310
one_word allcaps short_text duplicates
ci numeric 1.6% 2,439
x numeric 0.0% 119,593
outliers
y numeric 0.0% 119,585
outliers
z numeric 0.0% 119,588
outliers
vx numeric 0.0% 21,555
high_skew outliers
vy numeric 0.0% 25,826
high_skew outliers
vz numeric 0.0% 23,037
high_skew outliers
rarad numeric 0.0% 119,585
decrad numeric 0.0% 119,585
pmrarad numeric 0.0% 25,647
high_skew outliers
pmdecrad numeric 0.0% 23,588
outliers
bayer categorical 0.0% 104
imbalance
flam categorical 0.0% 139
imbalance
con categorical 0.0% 89
comp numeric 0.0% 3
high_skew
comp_primary numeric 0.0% 119,190
base categorical 0.0% 651
imbalance
lum numeric 0.0% 13,465
high_skew outliers
var text 0.0% 1,523
one_word short_text duplicates
var_min numeric 85.8% 6,248
null_rate
var_max numeric 85.8% 6,090
null_rate

id

numeric identifier
This is a row identifier: 119,626 values, all unique, no nulls, ranging from 0 to 119,630 with a near-perfectly uniform distribution (mean 59,813.16, median 59,813.5, skew ~1.5e-05, kurtosis -1.20). The min of 0 and max of 119,630 against n=119,626 suggests a 0-based sequential id with a handful of gaps. No analytical signal lives here. Treatment: drop from modelling; retain only as a join key. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
119,626
min
0
max
119,630
mean
5.981e+04
median
5.981e+04
std
3.453e+04
q1
2.991e+04
q3
8.972e+04
iqr
5.981e+04
skew
1.471e-05
kurtosis
-1.2
n_outliers
0
outlier_rate
0
zero_rate
8.359e-06

hip

numeric identifier
This is almost certainly the Hipparcos catalog identifier (HIP number) for stars: integer values running from 1 to 120404 with 117951 unique values across 119626 rows, near-perfectly uniform (skew ≈ 0.0002, kurtosis ≈ -1.2). The 1.4% null rate suggests some rows lack a Hipparcos cross-match. No outliers and no zeros, consistent with a catalog index rather than a measurement. Treatment: Treat as a catalog ID; left-join to Hipparcos metadata and exclude from modelling. high · anthropic:claude-opus-4-7
n
119,626
nulls
1,675 (1.4%)
unique
117,951
min
1
max
120,404
mean
5.917e+04
median
59,172
std
3.417e+04
q1
2.956e+04
q3
8.876e+04
iqr
59,198
skew
0.0001943
kurtosis
-1.2
n_outliers
0
outlier_rate
0
zero_rate
0

hd

numeric identifier
The 'hd' column is a numeric field with 119,626 rows and 98,825 unique values, suggesting a near-identifier or high-cardinality measurement rather than a categorical feature. Values span 1 to 358,431 with a mean of 114,357 and median of 110,358, showing a roughly symmetric distribution (skew 0.28, kurtosis -0.73) and no flagged outliers. Notably, 17.34% of rows are null, which is substantial and would need handling before any downstream use. Treatment: Treat as a high-cardinality id; drop from modelling or use only as a join key after imputing or filtering the 17% nulls. medium · anthropic:claude-opus-4-7
n
119,626
nulls
20,741 (17.3%)
unique
98,825
min
1
max
358,431
mean
1.144e+05
median
110,358
std
7.418e+04
q1
46,723
q3
175,823
iqr
129,100
skew
0.2822
kurtosis
-0.7265
n_outliers
0
outlier_rate
0
zero_rate
0

hr

numeric feature null_rate
The 'hr' column is a numeric field populated for only 7.6% of the 119,626 rows (null_rate 0.9244), with values ranging from 1 to 9110 and 9,029 distinct values. Its near-zero skew (-0.003), flat kurtosis (-1.20), and mean (4563.9) almost equal to median (4566.0) suggest a near-uniform distribution across the 1–9110 range rather than a typical hour-of-day or heart-rate measure. The extreme null rate is the dominant concern. Treatment: Impute or drop given 92% nulls; verify the semantic meaning of 'hr' before use since the 1–9110 range is not an obvious hour or heart-rate scale. medium · anthropic:claude-opus-4-7
n
119,626
nulls
110,585 (92.4%)
unique
9,029
min
1
max
9,110
mean
4564
median
4,566
std
2632
q1
2,283
q3
6,848
iqr
4,565
skew
-0.003426
kurtosis
-1.202
n_outliers
0
outlier_rate
0
zero_rate
0

gl

text foreign_key one_word short_text duplicates
This column 'gl' holds Gliese/GJ catalogue identifiers for stars (e.g., 'GJ 1293', 'Gl 914B'), with 'gj' and 'gl' being the dominant tokens (344 and 331 occurrences). It is overwhelmingly empty: 115,825 of 119,626 rows are blank, driving a 96.8% duplicate rate and leaving only ~3,800 unique designations. When populated, values are short single tokens (len_max 9, word_mean 1.03). Treatment: Treat as a sparse cross-catalogue star ID; left-join on it where present and ignore the empty majority. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
3,802
len_min
0
len_max
9
len_mean
0.2263
len_median
0
len_p95
0
word_mean
1.032
word_median
1
n_empty
115,825
n_duplicates
115,824
duplicate_rate
0.9682
vocab_size
677
readability_flesch_mean
4.808
emoji_rate
0
url_rate
0
one_word_rate
0.9682
allcaps_rate
0.01719
boilerplate_rate
0

bf

text metadata one_word short_text duplicates
This column appears to hold Bayer/Flamsteed star designations (e.g. "41The1Ori", "66Alp Gem"), with the trailing tokens being three-letter constellation abbreviations like leo, her, cyg, vir. It is overwhelmingly empty: 116,527 of 119,626 rows are blank strings, driving a 97.4% duplicate rate and a mean length of 0.22 characters. Among the 3,099 non-empty entries the values are nearly unique, suggesting this label only applies to a small subset of named/cataloged stars. Treatment: Treat empty strings as missing and use only as a sparse identifier for cataloged stars; drop from modelling features. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
3,066
len_min
0
len_max
10
len_mean
0.2206
len_median
0
len_p95
0
word_mean
1.064
word_median
1
n_empty
116,527
n_duplicates
116,560
duplicate_rate
0.9744
vocab_size
369
readability_flesch_mean
1.986
emoji_rate
0
url_rate
0
one_word_rate
0.9763
allcaps_rate
0
boilerplate_rate
0

proper

categorical metadata long_tail imbalance
This is the proper name of a star or named celestial object, populated for only a tiny fraction of rows. Empty strings dominate at 119161 of 119626 (top_rate 0.9961), leaving 465 distinct named entries like 'Sol', 'Alpheratz', and 'Caph' essentially as singletons. Entropy ratio of 0.008 confirms the column carries almost no information in aggregate. Treatment: Treat as a sparse name lookup; convert blanks to null and use only as a display label, not a model feature. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
465
top_value
top_rate
0.9961
cardinality
465
entropy
0.07115
entropy_ratio
0.008029

ra

numeric feature
Values span 0.0 to 23.998594 with a near-symmetric distribution (skew -0.012, kurtosis -1.20) and mean 12.09 close to median 12.13, consistent with right ascension expressed in hours. With 119263 unique values across 119626 rows, near-zero zero_rate, and no nulls or outliers, the column behaves like a continuous astronomical coordinate rather than a categorical feature. Treatment: Treat as a circular coordinate (e.g., encode via sine/cosine on the 0-24 hour range) before modelling. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
119,263
min
0
max
24
mean
12.09
median
12.13
std
6.887
q1
6.217
q3
18.12
iqr
11.9
skew
-0.01197
kurtosis
-1.198
n_outliers
0
outlier_rate
0
zero_rate
8.359e-06

dec

numeric feature
This column is almost certainly declination (dec), the celestial latitude coordinate, with values bounded in [-89.78, 89.57] degrees matching the full sky range. The distribution is nearly symmetric (skew 0.04) and platykurtic (kurtosis -1.02) with median near -1.64 and IQR spanning ~68 degrees, suggesting broad sky coverage rather than a concentrated survey footprint. With 119,534 unique values across 119,626 rows and no nulls or outliers, it behaves as a continuous astrometric coordinate. Treatment: Use as-is as a continuous coordinate, optionally pairing with RA for spatial features. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
119,534
min
-89.78
max
89.57
mean
-1.986
median
-1.64
std
40.96
q1
-36.42
q3
31.51
iqr
67.94
skew
0.03675
kurtosis
-1.019
n_outliers
0
outlier_rate
0
zero_rate
8.359e-06

dist

numeric feature high_skew outliers
Numeric 'dist' column (likely a distance measurement) with 119,626 rows, no nulls, and 5,397 distinct values. The distribution is severely right-skewed (skew 2.97, kurtosis 6.79): median is 213.68 with IQR 115.07–392.16, yet the mean is 8,772.29 and the max reaches exactly 100,000, suggesting a capped or sentinel ceiling. Over 10% of rows (12,350) flag as outliers and std (27,890.67) dwarfs the IQR. Treatment: Investigate the 100000 ceiling for sentinel encoding, then log-transform before modelling. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
5,397
min
0
max
100,000
mean
8772
median
213.7
std
2.789e+04
q1
115.1
q3
392.2
iqr
277.1
skew
2.965
kurtosis
6.792
n_outliers
12,350
outlier_rate
0.1032
zero_rate
8.359e-06

pmra

numeric feature high_skew outliers
This is `pmra`, almost certainly proper motion in right ascension (mas/yr) for ~119k astronomical sources, centered near zero (median -1.68, mean -1.31) with an interquartile range of 27.64. The distribution is extremely heavy-tailed: skew 4.61, kurtosis 433.55, std 118.18, and extremes spanning -4432.65 to 6767.26, with 16.4% of rows (19,615) flagged as outliers. No nulls and only 0.04% exact zeros, so the field is densely populated but dominated by tail behaviour. Treatment: Winsorize or apply a signed log/arcsinh transform before modelling to tame the heavy tails. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
25,644
min
-4433
max
6767
mean
-1.307
median
-1.68
std
118.2
q1
-15.46
q3
12.18
iqr
27.64
skew
4.608
kurtosis
433.6
n_outliers
19,615
outlier_rate
0.164
zero_rate
0.0004013

pmdec

numeric feature high_skew outliers
This is `pmdec`, almost certainly proper motion in declination (mas/yr) from an astrometric catalog. The bulk sits in a tight IQR of -22.4 to 3.77 around a median of -5.76, but the distribution is extremely heavy-tailed: kurtosis of 934.5, skew -2.60, and a min of -5813 against a max of 9999.99 — the latter looks like a sentinel/missing-value flag rather than a real motion. About 14.4% of rows (17,188) fall outside the standard outlier fence. Treatment: Filter the 9999.99 sentinel and clip extreme tails before any modelling; consider robust-scaling rather than a log transform since values are signed. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
23,226
min
-5,813
max
1e+04
mean
-19.33
median
-5.76
std
112.5
q1
-22.4
q3
3.77
iqr
26.17
skew
-2.605
kurtosis
934.5
n_outliers
17,188
outlier_rate
0.1437
zero_rate
0.0003427

rv

numeric feature outliers
`rv` is a numeric feature dominated by zeros: the median, Q1, and Q3 are all 0.0, and 81.07% of values are exactly zero (zero_rate 0.8107). The non-zero tail is wide and heavy, spanning -386.9 to 471.0 with std 13.90 and kurtosis 116.06, producing 22,643 outliers (18.93% outlier rate). Despite the extremes, mean (-0.276) and skew (0.371) are modest, suggesting roughly balanced positive/negative excursions around a sparse zero baseline. Treatment: Split into a zero-indicator plus a signed-log transform of the non-zero magnitude before modelling. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
1,714
min
-386.9
max
471
mean
-0.2765
median
0
std
13.9
q1
0
q3
0
iqr
0
skew
0.3708
kurtosis
116.1
n_outliers
22,643
outlier_rate
0.1893
zero_rate
0.8107

mag

numeric feature
Numeric column 'mag' looks like an astronomical magnitude reading: values are tightly clustered around a median of 8.46 with an IQR of 1.52, but the range stretches from -26.7 to 21.0. That extreme negative tail (consistent with very bright objects like the Sun at -26.7) drives the high kurtosis of 6.35 and flags 5,241 outliers (4.4%) despite near-symmetric skew of 0.16. No nulls or zeros, and 1,422 distinct values across 119,626 rows suggest measurements rounded to ~0.01. Treatment: Use as-is for modelling but consider winsorizing or separating extreme bright-object outliers before fitting. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
1,422
min
-26.7
max
21
mean
8.429
median
8.46
std
1.428
q1
7.65
q3
9.17
iqr
1.52
skew
0.1607
kurtosis
6.353
n_outliers
5,241
outlier_rate
0.04381
zero_rate
0

absmag

numeric feature outliers
This is a numeric `absmag` field, almost certainly absolute magnitude on an astronomical scale, ranging from -16.68 to 19.629 with a median of 1.495 and IQR of 3.021. The distribution is left-skewed (skew -1.37) with heavier tails than normal (kurtosis 3.17), and 11.29% of rows (13,508) flag as outliers — consistent with a long bright-end tail rather than data errors. No nulls and only 0.015% zeros across 119,626 rows, with 13,452 unique values suggesting quantised reporting. Treatment: Keep as-is for modelling but inspect the bright-end tail; consider robust scaling rather than dropping outliers since they are physically meaningful. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
13,452
min
-16.68
max
19.63
mean
0.9907
median
1.495
std
4.353
q1
0.138
q3
3.159
iqr
3.021
skew
-1.37
kurtosis
3.168
n_outliers
13,508
outlier_rate
0.1129
zero_rate
0.0001505

spect

text feature one_word allcaps short_text duplicates
This column holds stellar spectral type codes (e.g. K0, G5, A0, F8) — short one-word tokens averaging 3.4 characters with a 1,532-word vocabulary across 119,626 rows. Values are highly repetitive (96.4% duplicate rate, only 4,310 unique), which is expected for a categorical taxonomy, and 3,048 rows are empty. Mixed casing shows up as a 45.8% allcaps rate, suggesting inconsistent capitalization (e.g. K0III vs lowercase forms) that should be normalized. Treatment: Uppercase-normalize and treat as a categorical feature; consider grouping rare codes and imputing the 3,048 empties. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
4,310
len_min
0
len_max
12
len_mean
3.376
len_median
3
len_p95
8
word_mean
1.009
word_median
1
n_empty
3,048
n_duplicates
115,316
duplicate_rate
0.964
vocab_size
1,532
readability_flesch_mean
98.19
emoji_rate
0
url_rate
0
one_word_rate
0.9937
allcaps_rate
0.4584
boilerplate_rate
0

ci

numeric feature
Numeric feature 'ci' spans -0.4 to 5.46 with mean 0.71 and median 0.616, suggesting a bounded continuous measurement (possibly a colour index or similar physical quantity given the name). Distribution is mildly right-skewed (0.37) with light tails (kurtosis -0.26) and only 208 outliers (0.18%). Negative values exist but zeros are rare (0.15%), and 1.58% of rows are null. Treatment: Impute the 1.58% nulls and use as-is; mild skew does not require transformation. high · anthropic:claude-opus-4-7
n
119,626
nulls
1,891 (1.6%)
unique
2,439
min
-0.4
max
5.46
mean
0.7115
median
0.616
std
0.4932
q1
0.3485
q3
1.083
iqr
0.7345
skew
0.3728
kurtosis
-0.2552
n_outliers
208
outlier_rate
0.001767
zero_rate
0.001537

x

numeric feature outliers
Numeric feature 'x' is effectively continuous (119,593 unique values across 119,626 rows, no nulls) and centered near zero (median -1.05, Q1/Q3 of -89.04/86.27). The distribution has extreme tails: min -99,950 and max 99,982 push the standard deviation to 15,182 against an IQR of just 175, with kurtosis 19.16 and 13.07% of rows flagged as outliers. Slight negative skew (-0.22) suggests the tails are roughly symmetric in direction but heavy in magnitude. Treatment: Winsorize or robust-scale before modelling to contain the heavy tails. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
119,593
min
-9.995e+04
max
9.998e+04
mean
-235.3
median
-1.05
std
1.518e+04
q1
-89.04
q3
86.27
iqr
175.3
skew
-0.2229
kurtosis
19.16
n_outliers
15,635
outlier_rate
0.1307
zero_rate
0

y

numeric feature outliers
A continuous numeric feature centered near zero (median -1.24, mean -39.3) but with an extraordinarily wide spread (std ~17249, min -99979, max 99996). The distribution is roughly symmetric (skew 0.12) yet heavy-tailed (kurtosis 18.0), and 13.9% of values flag as outliers — the bulk sits within an IQR of ~183 while extremes reach ±100k. Near-unique values (119585 of 119626) and effectively no zeros or nulls suggest a measured signal rather than a category or sentinel-coded field. Treatment: Winsorize or robust-scale before modelling to tame the heavy tails. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
119,585
min
-9.998e+04
max
1e+05
mean
-39.32
median
-1.239
std
1.725e+04
q1
-91.18
q3
91.87
iqr
183
skew
0.1166
kurtosis
18.03
n_outliers
16,582
outlier_rate
0.1386
zero_rate
8.359e-06

z

numeric feature outliers
Column z is a high-cardinality numeric feature (119,588 unique values across 119,626 rows, no nulls) centered near zero with median -3.42 and IQR roughly -107.6 to 95.0. The distribution has extreme tails: min -99,964.98, max 99,862.51, std 18,074.56, and kurtosis 15.49, with 13.7% of rows (16,441) flagged as outliers despite skew of only -0.27. The IQR is two orders of magnitude smaller than the standard deviation, indicating a tight core swamped by heavy symmetric tails. Treatment: Winsorize or apply a signed log transform before modelling to tame the heavy tails. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
119,588
min
-9.996e+04
max
9.986e+04
mean
-235
median
-3.416
std
1.807e+04
q1
-107.6
q3
94.97
iqr
202.5
skew
-0.2722
kurtosis
15.49
n_outliers
16,441
outlier_rate
0.1374
zero_rate
8.359e-06

vx

numeric feature high_skew outliers
`vx` is a numeric feature centered tightly around zero (median 1.3e-07, IQR 1.94e-05) but with a symmetric extreme range of ±0.10227249 — almost certainly a velocity-like component (x-axis). The distribution is pathologically heavy-tailed: kurtosis 1307.6 and skew -11.5, with 13.2% of values flagged as outliers despite std being only 0.00178. The exact symmetry of min and max suggests a hard clipping bound at ±0.10227249. Treatment: Apply a robust scaler or signed-log transform before modelling; investigate the ±0.10227249 clipping boundary. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
21,555
min
-0.1023
max
0.1023
mean
-2.891e-05
median
1.3e-07
std
0.001782
q1
-1.033e-05
q3
9.09e-06
iqr
1.942e-05
skew
-11.51
kurtosis
1308
n_outliers
15,752
outlier_rate
0.1317
zero_rate
0.0004848

vy

numeric feature high_skew outliers
Likely a velocity component (vy) for ~119k objects, with values clustered tightly around zero (median 1.18e-05, IQR 3.4e-05) but spanning -0.102 to 0.102. The distribution is extraordinarily heavy-tailed (skew 15.6, kurtosis 678) and 12.0% of rows fall outside the Tukey fences, so a small minority of fast movers dominate the variance (std 0.0022 vs IQR ~3e-05). Treatment: Apply a signed log or robust scaler before modelling to tame the heavy tails. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
25,826
min
-0.1023
max
0.1023
mean
0.0002164
median
1.182e-05
std
0.002226
q1
-1.86e-06
q3
3.209e-05
iqr
3.395e-05
skew
15.59
kurtosis
678.4
n_outliers
14,368
outlier_rate
0.1201
zero_rate
0.0003427

vz

numeric feature high_skew outliers
A signed numeric quantity centred almost exactly on zero (median -6.23e-06, mean -1.57e-04) with an extremely tight IQR of 2.31e-05 — consistent with a vertical velocity or rate-of-change feature (the name 'vz' suggests a z-axis velocity). The distribution is pathologically heavy-tailed: kurtosis 1029.85, skew -20.30, and symmetric extremes at ±0.10227249 produce 15,774 outliers (13.2%). Despite 119,626 rows there are only 23,037 unique values, hinting at quantisation or repeated stationary readings. Treatment: Clip or winsorise the symmetric ±0.102 tails and consider a signed log (e.g. asinh) transform before modelling. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
23,037
min
-0.1023
max
0.1023
mean
-0.0001566
median
-6.23e-06
std
0.00195
q1
-1.998e-05
q3
3.147e-06
iqr
2.313e-05
skew
-20.3
kurtosis
1030
n_outliers
15,774
outlier_rate
0.1319
zero_rate
0.000535

rarad

numeric feature
Values span 0 to 6.2828 with mean 3.166 and median 3.175, consistent with a right ascension expressed in radians (0 to 2π). The distribution is essentially symmetric (skew -0.012) and platykurtic (kurtosis -1.198), close to uniform across the circle, with no outliers and only one zero out of 119,626 rows. With 119,585 unique values, this is a high-resolution continuous coordinate. Treatment: Encode as sin/cos pair to preserve circular continuity before modelling. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
119,585
min
0
max
6.283
mean
3.166
median
3.175
std
1.803
q1
1.628
q3
4.743
iqr
3.115
skew
-0.01197
kurtosis
-1.198
n_outliers
0
outlier_rate
0
zero_rate
8.359e-06

decrad

numeric feature
This column appears to be a declination angle expressed in radians, ranging symmetrically from -1.567 to 1.563 (close to ±π/2). The distribution is near-symmetric (skew 0.037) with negative kurtosis (-1.02), indicating a flatter-than-normal spread, and 119,585 of 119,626 values are unique with no nulls or outliers. The mean (-0.035) and median (-0.029) sit near zero, consistent with an angular coordinate covering most of the celestial sphere. Treatment: Use directly as a numeric feature; consider sin/cos encoding if treating as a circular coordinate. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
119,585
min
-1.567
max
1.563
mean
-0.03466
median
-0.02862
std
0.715
q1
-0.6357
q3
0.55
iqr
1.186
skew
0.03675
kurtosis
-1.019
n_outliers
0
outlier_rate
0
zero_rate
8.359e-06

pmrarad

numeric feature high_skew outliers
`pmrarad` is a tiny-magnitude signed numeric feature centered near zero (mean -6.4e-09, median -8.1e-09) with values on the order of 1e-7 to 1e-5, consistent with a proper-motion-in-RA quantity expressed in radians. The distribution is highly non-Gaussian: skew 4.6, kurtosis 433.6, and 16.4% of rows flagged as outliers, with extremes reaching 3.28e-05 against an IQR of just 1.34e-07. Nulls are absent and only 0.04% are exact zeros, so the heavy tails are real rather than artefacts of missingness. Treatment: Robust-scale or winsorize before modelling; do not assume normality given the heavy tails. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
25,647
min
-2.149e-05
max
3.281e-05
mean
-6.4e-09
median
-8.145e-09
std
5.729e-07
q1
-7.495e-08
q3
5.905e-08
iqr
1.34e-07
skew
4.607
kurtosis
433.6
n_outliers
19,615
outlier_rate
0.164
zero_rate
0.0004013

pmdecrad

numeric feature outliers
This is `pmdecrad`, proper motion in declination expressed in radians (per time unit), centred near zero with median -2.79e-08 and IQR of about 1.27e-07. The distribution is extremely heavy-tailed: kurtosis of 997.6, skew of -1.99, and 17,187 outliers (14.4% of rows) stretching from -2.82e-05 to 5.01e-05 against a std of 5.47e-07. Values are dense and continuous (23,588 unique across 119,626 rows) with no nulls and essentially no zeros. Treatment: Rescale to mas/yr and consider a robust or signed-log transform before modelling to tame the heavy tails. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
23,588
min
-2.818e-05
max
5.007e-05
mean
-9.363e-08
median
-2.793e-08
std
5.467e-07
q1
-1.086e-07
q3
1.828e-08
iqr
1.269e-07
skew
-1.992
kurtosis
997.6
n_outliers
17,187
outlier_rate
0.1437
zero_rate
0.0003344

bayer

categorical metadata imbalance
This is the Bayer designation for stars (Greek-letter prefix like 'Alp', 'Bet', 'Gam', 'Del'), with 104 distinct values across 119,626 rows. It is overwhelmingly empty: 118,089 of 119,626 rows (top_rate 0.987) carry no Bayer letter, leaving entropy at just 0.171 (entropy_ratio 0.026). The non-empty tail is roughly evenly distributed across the Greek alphabet, with Alpha (80) and Beta (77) most common. Treatment: Treat empty as missing and either drop or one-hot the small non-empty subset; near-constant signal. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
104
top_value
top_rate
0.9872
cardinality
104
entropy
0.171
entropy_ratio
0.02552

flam

categorical feature imbalance
The column 'flam' is a categorical field that is overwhelmingly empty: 116,889 of 119,626 rows (97.7%) hold the blank string. The remaining 2.3% spreads across 138 distinct values that look like small integers ('2','4','5','7'…'16'), each appearing roughly 48-53 times. Entropy ratio of 0.043 confirms almost no information content, and the imbalance alert is warranted. Treatment: Drop or treat as near-constant; the 2.3% non-empty integer-like values are too sparse to model directly. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
139
top_value
top_rate
0.9771
cardinality
139
entropy
0.3074
entropy_ratio
0.04319

con

categorical feature
Three-letter IAU constellation abbreviations (Cen, UMa, Her, Cyg...), with all 89 of the standard 88+1 codes represented across 119,626 rows and zero nulls. The distribution is remarkably flat: entropy ratio is 0.949 and the most common value, Cen, accounts for only 3.57% of records, so no single constellation dominates. Useful as a sky-region grouping key rather than a predictive feature on its own. Treatment: one-hot or target-encode if used in modelling; otherwise keep as a categorical group-by key. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
89
top_value
Cen
top_rate
0.03569
cardinality
89
entropy
6.147
entropy_ratio
0.9493

comp

numeric feature high_skew
The column 'comp' is a numeric field with only 3 distinct values bounded between 1.0 and 3.0, with a median, Q1, and Q3 all equal to 1.0, indicating it is effectively a low-cardinality code or flag rather than a continuous measure. The mean of 1.0048 shows the value 1 dominates almost entirely, with only 536 outliers (0.45%) deviating, producing extreme skew (16.7) and kurtosis (311.1). Treatment: Cast to a categorical/ordinal code; the near-constant distribution offers little signal for modelling. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
3
min
1
max
3
mean
1.005
median
1
std
0.0739
q1
1
q3
1
iqr
0
skew
16.71
kurtosis
311.1
n_outliers
536
outlier_rate
0.004481
zero_rate
0

comp_primary

numeric identifier
The column 'comp_primary' contains 119,626 nearly unique numeric values (119,190 distinct) ranging from 0 to 119,630 with mean 59,641.96 and median 59,634.5. The near-perfect symmetry (skew 0.0004), negative kurtosis (-1.20), and quartiles (Q1 29,815.25, Q3 89,462.75) closely matching a uniform distribution over [0, n] strongly suggest this is a row index or sequential identifier rather than a substantive measurement. Treatment: Drop before modelling; near-unique sequential id with no predictive content. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
119,190
min
0
max
119,630
mean
5.964e+04
median
5.963e+04
std
3.444e+04
q1
2.982e+04
q3
8.946e+04
iqr
5.965e+04
skew
0.0003989
kurtosis
-1.2
n_outliers
0
outlier_rate
0
zero_rate
8.359e-06

base

categorical metadata imbalance
The 'base' column is a categorical field that is effectively empty: 118,540 of 119,626 rows (top_rate 0.991) hold the empty string, leaving only ~1,086 rows distributed across 650 other values. The non-empty entries look like Gliese star catalog identifiers (e.g. 'Gl 57.1', 'Gl 60'), each appearing at most 3 times. Entropy ratio of 0.017 confirms almost no information content. Treatment: Drop or binarize as 'has_base_id'; near-constant and unsuitable as a feature. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
651
top_value
top_rate
0.9909
cardinality
651
entropy
0.1587
entropy_ratio
0.01698

lum

numeric feature high_skew outliers
This is a luminosity-like numeric feature spanning roughly 1.2e-06 to 4.09e+08, with a median of about 21.98 but a mean of 356526 — clear evidence of an extremely heavy right tail. Skew of 49.27 and kurtosis of 3885.57 confirm a pathological distribution, and 17485 values (14.6%) fall outside the IQR fence. Of 119626 rows, 13465 are unique with no nulls or zeros, so the spread is genuine rather than padded. Treatment: Apply a log transform before any distance- or variance-based modelling. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
13,465
min
1.226e-06
max
4.093e+08
mean
3.565e+05
median
21.98
std
3.341e+06
q1
4.747
q3
76.7
iqr
71.95
skew
49.27
kurtosis
3886
n_outliers
17,485
outlier_rate
0.1462
zero_rate
0

var

text feature one_word short_text duplicates
Column 'var' is a sparse short-code field: 113,634 of 119,626 rows (n_empty) are blank and the remaining values are 1–5 character tokens like 'R', 'S', 'T', 'RS'. Duplicate_rate is 0.987 with only 1,523 uniques, and one_word_rate is 1.0 with len_max of 5, suggesting a categorical abbreviation or flag rather than free text. The overwhelming emptiness (null_rate is reported as 0.0 but empties dominate) is the headline surprise. Treatment: Treat as a low-cardinality categorical code; impute the empties as a distinct 'missing' level before encoding. high · anthropic:claude-opus-4-7
n
119,626
nulls
0 (0.0%)
unique
1,523
len_min
0
len_max
5
len_mean
0.14
len_median
0
len_p95
1
word_mean
1
word_median
1
n_empty
113,634
n_duplicates
118,103
duplicate_rate
0.9873
vocab_size
597
readability_flesch_mean
4.609
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.0177
boilerplate_rate
0

var_min

numeric feature null_rate
Numeric feature 'var_min' is populated for only ~14% of rows (null_rate 0.858), making it a sparse signal. Among the 16,991 observed values it ranges from -1.333 to 14.902 with mean 9.50 and median 9.849, and is left-skewed (skew -0.93) with mild kurtosis (1.25). About 2.6% of present values fall outside the IQR fence (449 outliers), and only 6,248 distinct values appear. Treatment: Impute or add a missingness indicator before modelling given the 85.8% null rate. high · anthropic:claude-opus-4-7
n
119,626
nulls
102,635 (85.8%)
unique
6,248
min
-1.333
max
14.9
mean
9.502
median
9.849
std
1.781
q1
8.526
q3
10.71
iqr
2.181
skew
-0.9339
kurtosis
1.251
n_outliers
449
outlier_rate
0.02643
zero_rate
0

var_max

numeric feature null_rate
Numeric feature 'var_max' (likely a per-record maximum of some variable) is missing for 85.8% of the 119,626 rows, leaving roughly 17k populated values spread over 6,090 distinct numbers. Among observed values it centers near a median of 9.646 with mean 9.259, ranges from -1.523 to 13.702, and is left-skewed (-0.97) with 325 outliers (1.9%). The dominant concern is the null rate, not the distribution shape. Treatment: Add a missingness indicator and impute or restrict modelling to the populated subset given the 85.8% null rate. high · anthropic:claude-opus-4-7
n
119,626
nulls
102,635 (85.8%)
unique
6,090
min
-1.523
max
13.7
mean
9.259
median
9.646
std
1.742
q1
8.243
q3
10.49
iqr
2.249
skew
-0.9704
kurtosis
1.128
n_outliers
325
outlier_rate
0.01913
zero_rate
0