hyg hygdata v41
Reading
This is the HYG star catalog (hygdata_v41.csv) with 119,626 stars and 37 columns covering positions (ra/dec, x/y/z), motion (pmra, pmdec, vx/vy/vz, rv), brightness (mag, absmag, lum, ci), and identifiers/classifications (hd, hip, spect, con, proper). The most informative single field is the spectral type 'spect': it has 4,310 distinct values but is dominated by a handful of classes (K0 ~8.6k, G5 ~6.0k, A0 ~4.9k), giving a clean view of stellar populations. Distance and luminosity are extremely right-skewed (lum skew ≈49, dist max 100,000 pc) with 10–15% outliers, so any analysis on those should use log scales. Radial velocity 'rv' is 81% zeros — effectively a 'measured vs not' flag rather than a continuous variable. Constellation 'con' is the most evenly distributed categorical (89 values, entropy ratio 0.95) led by Cen, UMa, and Her, making it a good grouping key.
citing: row_count · column_count · spect.top_values · spect.n_unique · dist.skew · dist.max · lum.skew · lum.max · rv.zero_rate · con.entropy_ratio · con.top_values · mag.median · absmag.skew
Charts the summary said to look at first
Show data table
| chars | count |
|---|---|
| 0 – 0 | 3048 |
| 0 – 1 | 0 |
| 1 – 1 | 0 |
| 1 – 1 | 1542 |
| 1 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 54685 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 20214 |
| 3 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 6547 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 17615 |
| 5 – 5 | 0 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 6693 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 1634 |
| 7 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 4927 |
| 8 – 8 | 0 |
| 8 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 1161 |
| 9 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 703 |
| 10 – 10 | 0 |
| 10 – 11 | 0 |
| 11 – 11 | 746 |
| 11 – 11 | 0 |
| 11 – 12 | 0 |
| 12 – 12 | 111 |
Show data table
| value | count | share |
|---|---|---|
| Cen | 4270 | 3.6% |
| UMa | 3616 | 3.0% |
| Her | 3434 | 2.9% |
| Cyg | 3116 | 2.6% |
| Hya | 3061 | 2.6% |
| Cet | 3030 | 2.5% |
| Vir | 2921 | 2.4% |
| Eri | 2789 | 2.3% |
| Peg | 2744 | 2.3% |
| Dra | 2722 | 2.3% |
| Sgr | 2504 | 2.1% |
| Boo | 2477 | 2.1% |
| Pup | 2427 | 2.0% |
| Cas | 2352 | 2.0% |
| Tau | 2281 | 1.9% |
| Oph | 2270 | 1.9% |
| Vel | 2238 | 1.9% |
| Aqr | 2188 | 1.8% |
| Leo | 2165 | 1.8% |
| Car | 2162 | 1.8% |
Show data table
| bin | count |
|---|---|
| -26.7 – -25.51 | 1 |
| -25.51 – -24.31 | 0 |
| -24.31 – -23.12 | 0 |
| -23.12 – -21.93 | 0 |
| -21.93 – -20.74 | 0 |
| -20.74 – -19.54 | 0 |
| -19.54 – -18.35 | 0 |
| -18.35 – -17.16 | 0 |
| -17.16 – -15.97 | 0 |
| -15.97 – -14.77 | 0 |
| -14.77 – -13.58 | 0 |
| -13.58 – -12.39 | 0 |
| -12.39 – -11.2 | 0 |
| -11.2 – -10 | 0 |
| -10 – -8.812 | 0 |
| -8.812 – -7.62 | 0 |
| -7.62 – -6.427 | 0 |
| -6.427 – -5.235 | 0 |
| -5.235 – -4.042 | 0 |
| -4.042 – -2.85 | 0 |
| -2.85 – -1.657 | 0 |
| -1.657 – -0.465 | 2 |
| -0.465 – 0.7275 | 9 |
| 0.7275 – 1.92 | 33 |
| 1.92 – 3.113 | 152 |
| 3.113 – 4.305 | 533 |
| 4.305 – 5.498 | 2104 |
| 5.498 – 6.69 | 8246 |
| 6.69 – 7.883 | 26095 |
| 7.883 – 9.075 | 49082 |
| 9.075 – 10.27 | 24650 |
| 10.27 – 11.46 | 5990 |
| 11.46 – 12.65 | 1650 |
| 12.65 – 13.85 | 582 |
| 13.85 – 15.04 | 305 |
| 15.04 – 16.23 | 122 |
| 16.23 – 17.42 | 44 |
| 17.42 – 18.62 | 19 |
| 18.62 – 19.81 | 5 |
| 19.81 – 21 | 2 |
Show data table
| bin | count |
|---|---|
| 0 – 2500 | 109401 |
| 2500 – 5000 | 0 |
| 5000 – 7500 | 0 |
| 7500 – 1e+04 | 0 |
| 1e+04 – 1.25e+04 | 0 |
| 1.25e+04 – 1.5e+04 | 0 |
| 1.5e+04 – 1.75e+04 | 0 |
| 1.75e+04 – 2e+04 | 0 |
| 2e+04 – 2.25e+04 | 0 |
| 2.25e+04 – 2.5e+04 | 0 |
| 2.5e+04 – 2.75e+04 | 0 |
| 2.75e+04 – 3e+04 | 0 |
| 3e+04 – 3.25e+04 | 0 |
| 3.25e+04 – 3.5e+04 | 0 |
| 3.5e+04 – 3.75e+04 | 0 |
| 3.75e+04 – 4e+04 | 0 |
| 4e+04 – 4.25e+04 | 0 |
| 4.25e+04 – 4.5e+04 | 0 |
| 4.5e+04 – 4.75e+04 | 0 |
| 4.75e+04 – 5e+04 | 0 |
| 5e+04 – 5.25e+04 | 0 |
| 5.25e+04 – 5.5e+04 | 0 |
| 5.5e+04 – 5.75e+04 | 0 |
| 5.75e+04 – 6e+04 | 0 |
| 6e+04 – 6.25e+04 | 0 |
| 6.25e+04 – 6.5e+04 | 0 |
| 6.5e+04 – 6.75e+04 | 0 |
| 6.75e+04 – 7e+04 | 0 |
| 7e+04 – 7.25e+04 | 0 |
| 7.25e+04 – 7.5e+04 | 0 |
| 7.5e+04 – 7.75e+04 | 0 |
| 7.75e+04 – 8e+04 | 0 |
| 8e+04 – 8.25e+04 | 0 |
| 8.25e+04 – 8.5e+04 | 0 |
| 8.5e+04 – 8.75e+04 | 0 |
| 8.75e+04 – 9e+04 | 0 |
| 9e+04 – 9.25e+04 | 0 |
| 9.25e+04 – 9.5e+04 | 0 |
| 9.5e+04 – 9.75e+04 | 0 |
| 9.75e+04 – 1e+05 | 10225 |
Show data table
| bin | count |
|---|---|
| -0.4 – -0.2535 | 59 |
| -0.2535 – -0.107 | 1383 |
| -0.107 – 0.0395 | 8328 |
| 0.0395 – 0.186 | 9569 |
| 0.186 – 0.3325 | 8982 |
| 0.3325 – 0.479 | 14728 |
| 0.479 – 0.6255 | 16591 |
| 0.6255 – 0.772 | 8623 |
| 0.772 – 0.9185 | 5682 |
| 0.9185 – 1.065 | 12808 |
| 1.065 – 1.212 | 10614 |
| 1.212 – 1.358 | 6401 |
| 1.358 – 1.505 | 5724 |
| 1.505 – 1.651 | 5558 |
| 1.651 – 1.798 | 1851 |
| 1.798 – 1.944 | 435 |
| 1.944 – 2.091 | 121 |
| 2.091 – 2.237 | 88 |
| 2.237 – 2.384 | 40 |
| 2.384 – 2.53 | 42 |
| 2.53 – 2.677 | 24 |
| 2.677 – 2.823 | 28 |
| 2.823 – 2.97 | 11 |
| 2.97 – 3.116 | 22 |
| 3.116 – 3.263 | 3 |
| 3.263 – 3.409 | 8 |
| 3.409 – 3.556 | 4 |
| 3.556 – 3.702 | 0 |
| 3.702 – 3.849 | 2 |
| 3.849 – 3.995 | 0 |
| 3.995 – 4.142 | 1 |
| 4.142 – 4.288 | 0 |
| 4.288 – 4.434 | 2 |
| 4.434 – 4.581 | 0 |
| 4.581 – 4.728 | 0 |
| 4.728 – 4.874 | 1 |
| 4.874 – 5.021 | 0 |
| 5.021 – 5.167 | 0 |
| 5.167 – 5.314 | 1 |
| 5.314 – 5.46 | 1 |
Schema
37 columns| Alerts | ||||
|---|---|---|---|---|
| id | numeric | 0.0% | 119,626 |
|
| hip | numeric | 1.4% | 117,951 |
|
| hd | numeric | 17.3% | 98,825 |
|
| hr | numeric | 92.4% | 9,029 |
null_rate
|
| gl | text | 0.0% | 3,802 |
one_word
short_text
duplicates
|
| bf | text | 0.0% | 3,066 |
one_word
short_text
duplicates
|
| proper | categorical | 0.0% | 465 |
long_tail
imbalance
|
| ra | numeric | 0.0% | 119,263 |
|
| dec | numeric | 0.0% | 119,534 |
|
| dist | numeric | 0.0% | 5,397 |
high_skew
outliers
|
| pmra | numeric | 0.0% | 25,644 |
high_skew
outliers
|
| pmdec | numeric | 0.0% | 23,226 |
high_skew
outliers
|
| rv | numeric | 0.0% | 1,714 |
outliers
|
| mag | numeric | 0.0% | 1,422 |
|
| absmag | numeric | 0.0% | 13,452 |
outliers
|
| spect | text | 0.0% | 4,310 |
one_word
allcaps
short_text
duplicates
|
| ci | numeric | 1.6% | 2,439 |
|
| x | numeric | 0.0% | 119,593 |
outliers
|
| y | numeric | 0.0% | 119,585 |
outliers
|
| z | numeric | 0.0% | 119,588 |
outliers
|
| vx | numeric | 0.0% | 21,555 |
high_skew
outliers
|
| vy | numeric | 0.0% | 25,826 |
high_skew
outliers
|
| vz | numeric | 0.0% | 23,037 |
high_skew
outliers
|
| rarad | numeric | 0.0% | 119,585 |
|
| decrad | numeric | 0.0% | 119,585 |
|
| pmrarad | numeric | 0.0% | 25,647 |
high_skew
outliers
|
| pmdecrad | numeric | 0.0% | 23,588 |
outliers
|
| bayer | categorical | 0.0% | 104 |
imbalance
|
| flam | categorical | 0.0% | 139 |
imbalance
|
| con | categorical | 0.0% | 89 |
|
| comp | numeric | 0.0% | 3 |
high_skew
|
| comp_primary | numeric | 0.0% | 119,190 |
|
| base | categorical | 0.0% | 651 |
imbalance
|
| lum | numeric | 0.0% | 13,465 |
high_skew
outliers
|
| var | text | 0.0% | 1,523 |
one_word
short_text
duplicates
|
| var_min | numeric | 85.8% | 6,248 |
null_rate
|
| var_max | numeric | 85.8% | 6,090 |
null_rate
|
id
numeric identifierThis is a row identifier: 119,626 values, all unique, no nulls, ranging from 0 to 119,630 with a near-perfectly uniform distribution (mean 59,813.16, median 59,813.5, skew ~1.5e-05, kurtosis -1.20). The min of 0 and max of 119,630 against n=119,626 suggests a 0-based sequential id with a handful of gaps. No analytical signal lives here. Treatment: drop from modelling; retain only as a join key.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 119,626
- min
- 0
- max
- 119,630
- mean
- 5.981e+04
- median
- 5.981e+04
- std
- 3.453e+04
- q1
- 2.991e+04
- q3
- 8.972e+04
- iqr
- 5.981e+04
- skew
- 1.471e-05
- kurtosis
- -1.2
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 8.359e-06
hip
numeric identifierThis is almost certainly the Hipparcos catalog identifier (HIP number) for stars: integer values running from 1 to 120404 with 117951 unique values across 119626 rows, near-perfectly uniform (skew ≈ 0.0002, kurtosis ≈ -1.2). The 1.4% null rate suggests some rows lack a Hipparcos cross-match. No outliers and no zeros, consistent with a catalog index rather than a measurement. Treatment: Treat as a catalog ID; left-join to Hipparcos metadata and exclude from modelling.
- n
- 119,626
- nulls
- 1,675 (1.4%)
- unique
- 117,951
- min
- 1
- max
- 120,404
- mean
- 5.917e+04
- median
- 59,172
- std
- 3.417e+04
- q1
- 2.956e+04
- q3
- 8.876e+04
- iqr
- 59,198
- skew
- 0.0001943
- kurtosis
- -1.2
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
hd
numeric identifierThe 'hd' column is a numeric field with 119,626 rows and 98,825 unique values, suggesting a near-identifier or high-cardinality measurement rather than a categorical feature. Values span 1 to 358,431 with a mean of 114,357 and median of 110,358, showing a roughly symmetric distribution (skew 0.28, kurtosis -0.73) and no flagged outliers. Notably, 17.34% of rows are null, which is substantial and would need handling before any downstream use. Treatment: Treat as a high-cardinality id; drop from modelling or use only as a join key after imputing or filtering the 17% nulls.
- n
- 119,626
- nulls
- 20,741 (17.3%)
- unique
- 98,825
- min
- 1
- max
- 358,431
- mean
- 1.144e+05
- median
- 110,358
- std
- 7.418e+04
- q1
- 46,723
- q3
- 175,823
- iqr
- 129,100
- skew
- 0.2822
- kurtosis
- -0.7265
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
hr
numeric feature null_rateThe 'hr' column is a numeric field populated for only 7.6% of the 119,626 rows (null_rate 0.9244), with values ranging from 1 to 9110 and 9,029 distinct values. Its near-zero skew (-0.003), flat kurtosis (-1.20), and mean (4563.9) almost equal to median (4566.0) suggest a near-uniform distribution across the 1–9110 range rather than a typical hour-of-day or heart-rate measure. The extreme null rate is the dominant concern. Treatment: Impute or drop given 92% nulls; verify the semantic meaning of 'hr' before use since the 1–9110 range is not an obvious hour or heart-rate scale.
- n
- 119,626
- nulls
- 110,585 (92.4%)
- unique
- 9,029
- min
- 1
- max
- 9,110
- mean
- 4564
- median
- 4,566
- std
- 2632
- q1
- 2,283
- q3
- 6,848
- iqr
- 4,565
- skew
- -0.003426
- kurtosis
- -1.202
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
gl
text foreign_key one_word short_text duplicatesThis column 'gl' holds Gliese/GJ catalogue identifiers for stars (e.g., 'GJ 1293', 'Gl 914B'), with 'gj' and 'gl' being the dominant tokens (344 and 331 occurrences). It is overwhelmingly empty: 115,825 of 119,626 rows are blank, driving a 96.8% duplicate rate and leaving only ~3,800 unique designations. When populated, values are short single tokens (len_max 9, word_mean 1.03). Treatment: Treat as a sparse cross-catalogue star ID; left-join on it where present and ignore the empty majority.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 3,802
- len_min
- 0
- len_max
- 9
- len_mean
- 0.2263
- len_median
- 0
- len_p95
- 0
- word_mean
- 1.032
- word_median
- 1
- n_empty
- 115,825
- n_duplicates
- 115,824
- duplicate_rate
- 0.9682
- vocab_size
- 677
- readability_flesch_mean
- 4.808
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9682
- allcaps_rate
- 0.01719
- boilerplate_rate
- 0
bf
text metadata one_word short_text duplicatesThis column appears to hold Bayer/Flamsteed star designations (e.g. "41The1Ori", "66Alp Gem"), with the trailing tokens being three-letter constellation abbreviations like leo, her, cyg, vir. It is overwhelmingly empty: 116,527 of 119,626 rows are blank strings, driving a 97.4% duplicate rate and a mean length of 0.22 characters. Among the 3,099 non-empty entries the values are nearly unique, suggesting this label only applies to a small subset of named/cataloged stars. Treatment: Treat empty strings as missing and use only as a sparse identifier for cataloged stars; drop from modelling features.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 3,066
- len_min
- 0
- len_max
- 10
- len_mean
- 0.2206
- len_median
- 0
- len_p95
- 0
- word_mean
- 1.064
- word_median
- 1
- n_empty
- 116,527
- n_duplicates
- 116,560
- duplicate_rate
- 0.9744
- vocab_size
- 369
- readability_flesch_mean
- 1.986
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9763
- allcaps_rate
- 0
- boilerplate_rate
- 0
proper
categorical metadata long_tail imbalanceThis is the proper name of a star or named celestial object, populated for only a tiny fraction of rows. Empty strings dominate at 119161 of 119626 (top_rate 0.9961), leaving 465 distinct named entries like 'Sol', 'Alpheratz', and 'Caph' essentially as singletons. Entropy ratio of 0.008 confirms the column carries almost no information in aggregate. Treatment: Treat as a sparse name lookup; convert blanks to null and use only as a display label, not a model feature.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 465
- top_value
- top_rate
- 0.9961
- cardinality
- 465
- entropy
- 0.07115
- entropy_ratio
- 0.008029
ra
numeric featureValues span 0.0 to 23.998594 with a near-symmetric distribution (skew -0.012, kurtosis -1.20) and mean 12.09 close to median 12.13, consistent with right ascension expressed in hours. With 119263 unique values across 119626 rows, near-zero zero_rate, and no nulls or outliers, the column behaves like a continuous astronomical coordinate rather than a categorical feature. Treatment: Treat as a circular coordinate (e.g., encode via sine/cosine on the 0-24 hour range) before modelling.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 119,263
- min
- 0
- max
- 24
- mean
- 12.09
- median
- 12.13
- std
- 6.887
- q1
- 6.217
- q3
- 18.12
- iqr
- 11.9
- skew
- -0.01197
- kurtosis
- -1.198
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 8.359e-06
dec
numeric featureThis column is almost certainly declination (dec), the celestial latitude coordinate, with values bounded in [-89.78, 89.57] degrees matching the full sky range. The distribution is nearly symmetric (skew 0.04) and platykurtic (kurtosis -1.02) with median near -1.64 and IQR spanning ~68 degrees, suggesting broad sky coverage rather than a concentrated survey footprint. With 119,534 unique values across 119,626 rows and no nulls or outliers, it behaves as a continuous astrometric coordinate. Treatment: Use as-is as a continuous coordinate, optionally pairing with RA for spatial features.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 119,534
- min
- -89.78
- max
- 89.57
- mean
- -1.986
- median
- -1.64
- std
- 40.96
- q1
- -36.42
- q3
- 31.51
- iqr
- 67.94
- skew
- 0.03675
- kurtosis
- -1.019
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 8.359e-06
dist
numeric feature high_skew outliersNumeric 'dist' column (likely a distance measurement) with 119,626 rows, no nulls, and 5,397 distinct values. The distribution is severely right-skewed (skew 2.97, kurtosis 6.79): median is 213.68 with IQR 115.07–392.16, yet the mean is 8,772.29 and the max reaches exactly 100,000, suggesting a capped or sentinel ceiling. Over 10% of rows (12,350) flag as outliers and std (27,890.67) dwarfs the IQR. Treatment: Investigate the 100000 ceiling for sentinel encoding, then log-transform before modelling.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 5,397
- min
- 0
- max
- 100,000
- mean
- 8772
- median
- 213.7
- std
- 2.789e+04
- q1
- 115.1
- q3
- 392.2
- iqr
- 277.1
- skew
- 2.965
- kurtosis
- 6.792
- n_outliers
- 12,350
- outlier_rate
- 0.1032
- zero_rate
- 8.359e-06
pmra
numeric feature high_skew outliersThis is `pmra`, almost certainly proper motion in right ascension (mas/yr) for ~119k astronomical sources, centered near zero (median -1.68, mean -1.31) with an interquartile range of 27.64. The distribution is extremely heavy-tailed: skew 4.61, kurtosis 433.55, std 118.18, and extremes spanning -4432.65 to 6767.26, with 16.4% of rows (19,615) flagged as outliers. No nulls and only 0.04% exact zeros, so the field is densely populated but dominated by tail behaviour. Treatment: Winsorize or apply a signed log/arcsinh transform before modelling to tame the heavy tails.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 25,644
- min
- -4433
- max
- 6767
- mean
- -1.307
- median
- -1.68
- std
- 118.2
- q1
- -15.46
- q3
- 12.18
- iqr
- 27.64
- skew
- 4.608
- kurtosis
- 433.6
- n_outliers
- 19,615
- outlier_rate
- 0.164
- zero_rate
- 0.0004013
pmdec
numeric feature high_skew outliersThis is `pmdec`, almost certainly proper motion in declination (mas/yr) from an astrometric catalog. The bulk sits in a tight IQR of -22.4 to 3.77 around a median of -5.76, but the distribution is extremely heavy-tailed: kurtosis of 934.5, skew -2.60, and a min of -5813 against a max of 9999.99 — the latter looks like a sentinel/missing-value flag rather than a real motion. About 14.4% of rows (17,188) fall outside the standard outlier fence. Treatment: Filter the 9999.99 sentinel and clip extreme tails before any modelling; consider robust-scaling rather than a log transform since values are signed.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 23,226
- min
- -5,813
- max
- 1e+04
- mean
- -19.33
- median
- -5.76
- std
- 112.5
- q1
- -22.4
- q3
- 3.77
- iqr
- 26.17
- skew
- -2.605
- kurtosis
- 934.5
- n_outliers
- 17,188
- outlier_rate
- 0.1437
- zero_rate
- 0.0003427
rv
numeric feature outliers`rv` is a numeric feature dominated by zeros: the median, Q1, and Q3 are all 0.0, and 81.07% of values are exactly zero (zero_rate 0.8107). The non-zero tail is wide and heavy, spanning -386.9 to 471.0 with std 13.90 and kurtosis 116.06, producing 22,643 outliers (18.93% outlier rate). Despite the extremes, mean (-0.276) and skew (0.371) are modest, suggesting roughly balanced positive/negative excursions around a sparse zero baseline. Treatment: Split into a zero-indicator plus a signed-log transform of the non-zero magnitude before modelling.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 1,714
- min
- -386.9
- max
- 471
- mean
- -0.2765
- median
- 0
- std
- 13.9
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 0.3708
- kurtosis
- 116.1
- n_outliers
- 22,643
- outlier_rate
- 0.1893
- zero_rate
- 0.8107
mag
numeric featureNumeric column 'mag' looks like an astronomical magnitude reading: values are tightly clustered around a median of 8.46 with an IQR of 1.52, but the range stretches from -26.7 to 21.0. That extreme negative tail (consistent with very bright objects like the Sun at -26.7) drives the high kurtosis of 6.35 and flags 5,241 outliers (4.4%) despite near-symmetric skew of 0.16. No nulls or zeros, and 1,422 distinct values across 119,626 rows suggest measurements rounded to ~0.01. Treatment: Use as-is for modelling but consider winsorizing or separating extreme bright-object outliers before fitting.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 1,422
- min
- -26.7
- max
- 21
- mean
- 8.429
- median
- 8.46
- std
- 1.428
- q1
- 7.65
- q3
- 9.17
- iqr
- 1.52
- skew
- 0.1607
- kurtosis
- 6.353
- n_outliers
- 5,241
- outlier_rate
- 0.04381
- zero_rate
- 0
absmag
numeric feature outliersThis is a numeric `absmag` field, almost certainly absolute magnitude on an astronomical scale, ranging from -16.68 to 19.629 with a median of 1.495 and IQR of 3.021. The distribution is left-skewed (skew -1.37) with heavier tails than normal (kurtosis 3.17), and 11.29% of rows (13,508) flag as outliers — consistent with a long bright-end tail rather than data errors. No nulls and only 0.015% zeros across 119,626 rows, with 13,452 unique values suggesting quantised reporting. Treatment: Keep as-is for modelling but inspect the bright-end tail; consider robust scaling rather than dropping outliers since they are physically meaningful.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 13,452
- min
- -16.68
- max
- 19.63
- mean
- 0.9907
- median
- 1.495
- std
- 4.353
- q1
- 0.138
- q3
- 3.159
- iqr
- 3.021
- skew
- -1.37
- kurtosis
- 3.168
- n_outliers
- 13,508
- outlier_rate
- 0.1129
- zero_rate
- 0.0001505
spect
text feature one_word allcaps short_text duplicatesThis column holds stellar spectral type codes (e.g. K0, G5, A0, F8) — short one-word tokens averaging 3.4 characters with a 1,532-word vocabulary across 119,626 rows. Values are highly repetitive (96.4% duplicate rate, only 4,310 unique), which is expected for a categorical taxonomy, and 3,048 rows are empty. Mixed casing shows up as a 45.8% allcaps rate, suggesting inconsistent capitalization (e.g. K0III vs lowercase forms) that should be normalized. Treatment: Uppercase-normalize and treat as a categorical feature; consider grouping rare codes and imputing the 3,048 empties.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 4,310
- len_min
- 0
- len_max
- 12
- len_mean
- 3.376
- len_median
- 3
- len_p95
- 8
- word_mean
- 1.009
- word_median
- 1
- n_empty
- 3,048
- n_duplicates
- 115,316
- duplicate_rate
- 0.964
- vocab_size
- 1,532
- readability_flesch_mean
- 98.19
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9937
- allcaps_rate
- 0.4584
- boilerplate_rate
- 0
ci
numeric featureNumeric feature 'ci' spans -0.4 to 5.46 with mean 0.71 and median 0.616, suggesting a bounded continuous measurement (possibly a colour index or similar physical quantity given the name). Distribution is mildly right-skewed (0.37) with light tails (kurtosis -0.26) and only 208 outliers (0.18%). Negative values exist but zeros are rare (0.15%), and 1.58% of rows are null. Treatment: Impute the 1.58% nulls and use as-is; mild skew does not require transformation.
- n
- 119,626
- nulls
- 1,891 (1.6%)
- unique
- 2,439
- min
- -0.4
- max
- 5.46
- mean
- 0.7115
- median
- 0.616
- std
- 0.4932
- q1
- 0.3485
- q3
- 1.083
- iqr
- 0.7345
- skew
- 0.3728
- kurtosis
- -0.2552
- n_outliers
- 208
- outlier_rate
- 0.001767
- zero_rate
- 0.001537
x
numeric feature outliersNumeric feature 'x' is effectively continuous (119,593 unique values across 119,626 rows, no nulls) and centered near zero (median -1.05, Q1/Q3 of -89.04/86.27). The distribution has extreme tails: min -99,950 and max 99,982 push the standard deviation to 15,182 against an IQR of just 175, with kurtosis 19.16 and 13.07% of rows flagged as outliers. Slight negative skew (-0.22) suggests the tails are roughly symmetric in direction but heavy in magnitude. Treatment: Winsorize or robust-scale before modelling to contain the heavy tails.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 119,593
- min
- -9.995e+04
- max
- 9.998e+04
- mean
- -235.3
- median
- -1.05
- std
- 1.518e+04
- q1
- -89.04
- q3
- 86.27
- iqr
- 175.3
- skew
- -0.2229
- kurtosis
- 19.16
- n_outliers
- 15,635
- outlier_rate
- 0.1307
- zero_rate
- 0
y
numeric feature outliersA continuous numeric feature centered near zero (median -1.24, mean -39.3) but with an extraordinarily wide spread (std ~17249, min -99979, max 99996). The distribution is roughly symmetric (skew 0.12) yet heavy-tailed (kurtosis 18.0), and 13.9% of values flag as outliers — the bulk sits within an IQR of ~183 while extremes reach ±100k. Near-unique values (119585 of 119626) and effectively no zeros or nulls suggest a measured signal rather than a category or sentinel-coded field. Treatment: Winsorize or robust-scale before modelling to tame the heavy tails.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 119,585
- min
- -9.998e+04
- max
- 1e+05
- mean
- -39.32
- median
- -1.239
- std
- 1.725e+04
- q1
- -91.18
- q3
- 91.87
- iqr
- 183
- skew
- 0.1166
- kurtosis
- 18.03
- n_outliers
- 16,582
- outlier_rate
- 0.1386
- zero_rate
- 8.359e-06
z
numeric feature outliersColumn z is a high-cardinality numeric feature (119,588 unique values across 119,626 rows, no nulls) centered near zero with median -3.42 and IQR roughly -107.6 to 95.0. The distribution has extreme tails: min -99,964.98, max 99,862.51, std 18,074.56, and kurtosis 15.49, with 13.7% of rows (16,441) flagged as outliers despite skew of only -0.27. The IQR is two orders of magnitude smaller than the standard deviation, indicating a tight core swamped by heavy symmetric tails. Treatment: Winsorize or apply a signed log transform before modelling to tame the heavy tails.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 119,588
- min
- -9.996e+04
- max
- 9.986e+04
- mean
- -235
- median
- -3.416
- std
- 1.807e+04
- q1
- -107.6
- q3
- 94.97
- iqr
- 202.5
- skew
- -0.2722
- kurtosis
- 15.49
- n_outliers
- 16,441
- outlier_rate
- 0.1374
- zero_rate
- 8.359e-06
vx
numeric feature high_skew outliers`vx` is a numeric feature centered tightly around zero (median 1.3e-07, IQR 1.94e-05) but with a symmetric extreme range of ±0.10227249 — almost certainly a velocity-like component (x-axis). The distribution is pathologically heavy-tailed: kurtosis 1307.6 and skew -11.5, with 13.2% of values flagged as outliers despite std being only 0.00178. The exact symmetry of min and max suggests a hard clipping bound at ±0.10227249. Treatment: Apply a robust scaler or signed-log transform before modelling; investigate the ±0.10227249 clipping boundary.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 21,555
- min
- -0.1023
- max
- 0.1023
- mean
- -2.891e-05
- median
- 1.3e-07
- std
- 0.001782
- q1
- -1.033e-05
- q3
- 9.09e-06
- iqr
- 1.942e-05
- skew
- -11.51
- kurtosis
- 1308
- n_outliers
- 15,752
- outlier_rate
- 0.1317
- zero_rate
- 0.0004848
vy
numeric feature high_skew outliersLikely a velocity component (vy) for ~119k objects, with values clustered tightly around zero (median 1.18e-05, IQR 3.4e-05) but spanning -0.102 to 0.102. The distribution is extraordinarily heavy-tailed (skew 15.6, kurtosis 678) and 12.0% of rows fall outside the Tukey fences, so a small minority of fast movers dominate the variance (std 0.0022 vs IQR ~3e-05). Treatment: Apply a signed log or robust scaler before modelling to tame the heavy tails.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 25,826
- min
- -0.1023
- max
- 0.1023
- mean
- 0.0002164
- median
- 1.182e-05
- std
- 0.002226
- q1
- -1.86e-06
- q3
- 3.209e-05
- iqr
- 3.395e-05
- skew
- 15.59
- kurtosis
- 678.4
- n_outliers
- 14,368
- outlier_rate
- 0.1201
- zero_rate
- 0.0003427
vz
numeric feature high_skew outliersA signed numeric quantity centred almost exactly on zero (median -6.23e-06, mean -1.57e-04) with an extremely tight IQR of 2.31e-05 — consistent with a vertical velocity or rate-of-change feature (the name 'vz' suggests a z-axis velocity). The distribution is pathologically heavy-tailed: kurtosis 1029.85, skew -20.30, and symmetric extremes at ±0.10227249 produce 15,774 outliers (13.2%). Despite 119,626 rows there are only 23,037 unique values, hinting at quantisation or repeated stationary readings. Treatment: Clip or winsorise the symmetric ±0.102 tails and consider a signed log (e.g. asinh) transform before modelling.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 23,037
- min
- -0.1023
- max
- 0.1023
- mean
- -0.0001566
- median
- -6.23e-06
- std
- 0.00195
- q1
- -1.998e-05
- q3
- 3.147e-06
- iqr
- 2.313e-05
- skew
- -20.3
- kurtosis
- 1030
- n_outliers
- 15,774
- outlier_rate
- 0.1319
- zero_rate
- 0.000535
rarad
numeric featureValues span 0 to 6.2828 with mean 3.166 and median 3.175, consistent with a right ascension expressed in radians (0 to 2π). The distribution is essentially symmetric (skew -0.012) and platykurtic (kurtosis -1.198), close to uniform across the circle, with no outliers and only one zero out of 119,626 rows. With 119,585 unique values, this is a high-resolution continuous coordinate. Treatment: Encode as sin/cos pair to preserve circular continuity before modelling.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 119,585
- min
- 0
- max
- 6.283
- mean
- 3.166
- median
- 3.175
- std
- 1.803
- q1
- 1.628
- q3
- 4.743
- iqr
- 3.115
- skew
- -0.01197
- kurtosis
- -1.198
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 8.359e-06
decrad
numeric featureThis column appears to be a declination angle expressed in radians, ranging symmetrically from -1.567 to 1.563 (close to ±π/2). The distribution is near-symmetric (skew 0.037) with negative kurtosis (-1.02), indicating a flatter-than-normal spread, and 119,585 of 119,626 values are unique with no nulls or outliers. The mean (-0.035) and median (-0.029) sit near zero, consistent with an angular coordinate covering most of the celestial sphere. Treatment: Use directly as a numeric feature; consider sin/cos encoding if treating as a circular coordinate.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 119,585
- min
- -1.567
- max
- 1.563
- mean
- -0.03466
- median
- -0.02862
- std
- 0.715
- q1
- -0.6357
- q3
- 0.55
- iqr
- 1.186
- skew
- 0.03675
- kurtosis
- -1.019
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 8.359e-06
pmrarad
numeric feature high_skew outliers`pmrarad` is a tiny-magnitude signed numeric feature centered near zero (mean -6.4e-09, median -8.1e-09) with values on the order of 1e-7 to 1e-5, consistent with a proper-motion-in-RA quantity expressed in radians. The distribution is highly non-Gaussian: skew 4.6, kurtosis 433.6, and 16.4% of rows flagged as outliers, with extremes reaching 3.28e-05 against an IQR of just 1.34e-07. Nulls are absent and only 0.04% are exact zeros, so the heavy tails are real rather than artefacts of missingness. Treatment: Robust-scale or winsorize before modelling; do not assume normality given the heavy tails.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 25,647
- min
- -2.149e-05
- max
- 3.281e-05
- mean
- -6.4e-09
- median
- -8.145e-09
- std
- 5.729e-07
- q1
- -7.495e-08
- q3
- 5.905e-08
- iqr
- 1.34e-07
- skew
- 4.607
- kurtosis
- 433.6
- n_outliers
- 19,615
- outlier_rate
- 0.164
- zero_rate
- 0.0004013
pmdecrad
numeric feature outliersThis is `pmdecrad`, proper motion in declination expressed in radians (per time unit), centred near zero with median -2.79e-08 and IQR of about 1.27e-07. The distribution is extremely heavy-tailed: kurtosis of 997.6, skew of -1.99, and 17,187 outliers (14.4% of rows) stretching from -2.82e-05 to 5.01e-05 against a std of 5.47e-07. Values are dense and continuous (23,588 unique across 119,626 rows) with no nulls and essentially no zeros. Treatment: Rescale to mas/yr and consider a robust or signed-log transform before modelling to tame the heavy tails.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 23,588
- min
- -2.818e-05
- max
- 5.007e-05
- mean
- -9.363e-08
- median
- -2.793e-08
- std
- 5.467e-07
- q1
- -1.086e-07
- q3
- 1.828e-08
- iqr
- 1.269e-07
- skew
- -1.992
- kurtosis
- 997.6
- n_outliers
- 17,187
- outlier_rate
- 0.1437
- zero_rate
- 0.0003344
bayer
categorical metadata imbalanceThis is the Bayer designation for stars (Greek-letter prefix like 'Alp', 'Bet', 'Gam', 'Del'), with 104 distinct values across 119,626 rows. It is overwhelmingly empty: 118,089 of 119,626 rows (top_rate 0.987) carry no Bayer letter, leaving entropy at just 0.171 (entropy_ratio 0.026). The non-empty tail is roughly evenly distributed across the Greek alphabet, with Alpha (80) and Beta (77) most common. Treatment: Treat empty as missing and either drop or one-hot the small non-empty subset; near-constant signal.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 104
- top_value
- top_rate
- 0.9872
- cardinality
- 104
- entropy
- 0.171
- entropy_ratio
- 0.02552
flam
categorical feature imbalanceThe column 'flam' is a categorical field that is overwhelmingly empty: 116,889 of 119,626 rows (97.7%) hold the blank string. The remaining 2.3% spreads across 138 distinct values that look like small integers ('2','4','5','7'…'16'), each appearing roughly 48-53 times. Entropy ratio of 0.043 confirms almost no information content, and the imbalance alert is warranted. Treatment: Drop or treat as near-constant; the 2.3% non-empty integer-like values are too sparse to model directly.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 139
- top_value
- top_rate
- 0.9771
- cardinality
- 139
- entropy
- 0.3074
- entropy_ratio
- 0.04319
con
categorical featureThree-letter IAU constellation abbreviations (Cen, UMa, Her, Cyg...), with all 89 of the standard 88+1 codes represented across 119,626 rows and zero nulls. The distribution is remarkably flat: entropy ratio is 0.949 and the most common value, Cen, accounts for only 3.57% of records, so no single constellation dominates. Useful as a sky-region grouping key rather than a predictive feature on its own. Treatment: one-hot or target-encode if used in modelling; otherwise keep as a categorical group-by key.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 89
- top_value
- Cen
- top_rate
- 0.03569
- cardinality
- 89
- entropy
- 6.147
- entropy_ratio
- 0.9493
comp
numeric feature high_skewThe column 'comp' is a numeric field with only 3 distinct values bounded between 1.0 and 3.0, with a median, Q1, and Q3 all equal to 1.0, indicating it is effectively a low-cardinality code or flag rather than a continuous measure. The mean of 1.0048 shows the value 1 dominates almost entirely, with only 536 outliers (0.45%) deviating, producing extreme skew (16.7) and kurtosis (311.1). Treatment: Cast to a categorical/ordinal code; the near-constant distribution offers little signal for modelling.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 3
- min
- 1
- max
- 3
- mean
- 1.005
- median
- 1
- std
- 0.0739
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 16.71
- kurtosis
- 311.1
- n_outliers
- 536
- outlier_rate
- 0.004481
- zero_rate
- 0
comp_primary
numeric identifierThe column 'comp_primary' contains 119,626 nearly unique numeric values (119,190 distinct) ranging from 0 to 119,630 with mean 59,641.96 and median 59,634.5. The near-perfect symmetry (skew 0.0004), negative kurtosis (-1.20), and quartiles (Q1 29,815.25, Q3 89,462.75) closely matching a uniform distribution over [0, n] strongly suggest this is a row index or sequential identifier rather than a substantive measurement. Treatment: Drop before modelling; near-unique sequential id with no predictive content.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 119,190
- min
- 0
- max
- 119,630
- mean
- 5.964e+04
- median
- 5.963e+04
- std
- 3.444e+04
- q1
- 2.982e+04
- q3
- 8.946e+04
- iqr
- 5.965e+04
- skew
- 0.0003989
- kurtosis
- -1.2
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 8.359e-06
base
categorical metadata imbalanceThe 'base' column is a categorical field that is effectively empty: 118,540 of 119,626 rows (top_rate 0.991) hold the empty string, leaving only ~1,086 rows distributed across 650 other values. The non-empty entries look like Gliese star catalog identifiers (e.g. 'Gl 57.1', 'Gl 60'), each appearing at most 3 times. Entropy ratio of 0.017 confirms almost no information content. Treatment: Drop or binarize as 'has_base_id'; near-constant and unsuitable as a feature.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 651
- top_value
- top_rate
- 0.9909
- cardinality
- 651
- entropy
- 0.1587
- entropy_ratio
- 0.01698
lum
numeric feature high_skew outliersThis is a luminosity-like numeric feature spanning roughly 1.2e-06 to 4.09e+08, with a median of about 21.98 but a mean of 356526 — clear evidence of an extremely heavy right tail. Skew of 49.27 and kurtosis of 3885.57 confirm a pathological distribution, and 17485 values (14.6%) fall outside the IQR fence. Of 119626 rows, 13465 are unique with no nulls or zeros, so the spread is genuine rather than padded. Treatment: Apply a log transform before any distance- or variance-based modelling.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 13,465
- min
- 1.226e-06
- max
- 4.093e+08
- mean
- 3.565e+05
- median
- 21.98
- std
- 3.341e+06
- q1
- 4.747
- q3
- 76.7
- iqr
- 71.95
- skew
- 49.27
- kurtosis
- 3886
- n_outliers
- 17,485
- outlier_rate
- 0.1462
- zero_rate
- 0
var
text feature one_word short_text duplicatesColumn 'var' is a sparse short-code field: 113,634 of 119,626 rows (n_empty) are blank and the remaining values are 1–5 character tokens like 'R', 'S', 'T', 'RS'. Duplicate_rate is 0.987 with only 1,523 uniques, and one_word_rate is 1.0 with len_max of 5, suggesting a categorical abbreviation or flag rather than free text. The overwhelming emptiness (null_rate is reported as 0.0 but empties dominate) is the headline surprise. Treatment: Treat as a low-cardinality categorical code; impute the empties as a distinct 'missing' level before encoding.
- n
- 119,626
- nulls
- 0 (0.0%)
- unique
- 1,523
- len_min
- 0
- len_max
- 5
- len_mean
- 0.14
- len_median
- 0
- len_p95
- 1
- word_mean
- 1
- word_median
- 1
- n_empty
- 113,634
- n_duplicates
- 118,103
- duplicate_rate
- 0.9873
- vocab_size
- 597
- readability_flesch_mean
- 4.609
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0177
- boilerplate_rate
- 0
var_min
numeric feature null_rateNumeric feature 'var_min' is populated for only ~14% of rows (null_rate 0.858), making it a sparse signal. Among the 16,991 observed values it ranges from -1.333 to 14.902 with mean 9.50 and median 9.849, and is left-skewed (skew -0.93) with mild kurtosis (1.25). About 2.6% of present values fall outside the IQR fence (449 outliers), and only 6,248 distinct values appear. Treatment: Impute or add a missingness indicator before modelling given the 85.8% null rate.
- n
- 119,626
- nulls
- 102,635 (85.8%)
- unique
- 6,248
- min
- -1.333
- max
- 14.9
- mean
- 9.502
- median
- 9.849
- std
- 1.781
- q1
- 8.526
- q3
- 10.71
- iqr
- 2.181
- skew
- -0.9339
- kurtosis
- 1.251
- n_outliers
- 449
- outlier_rate
- 0.02643
- zero_rate
- 0
var_max
numeric feature null_rateNumeric feature 'var_max' (likely a per-record maximum of some variable) is missing for 85.8% of the 119,626 rows, leaving roughly 17k populated values spread over 6,090 distinct numbers. Among observed values it centers near a median of 9.646 with mean 9.259, ranges from -1.523 to 13.702, and is left-skewed (-0.97) with 325 outliers (1.9%). The dominant concern is the null rate, not the distribution shape. Treatment: Add a missingness indicator and impute or restrict modelling to the populated subset given the 85.8% null rate.
- n
- 119,626
- nulls
- 102,635 (85.8%)
- unique
- 6,090
- min
- -1.523
- max
- 13.7
- mean
- 9.259
- median
- 9.646
- std
- 1.742
- q1
- 8.243
- q3
- 10.49
- iqr
- 2.249
- skew
- -0.9704
- kurtosis
- 1.128
- n_outliers
- 325
- outlier_rate
- 0.01913
- zero_rate
- 0