This is the HYG star catalog (hygdata_v41.csv) with 119,626 stars and 37 columns covering positions (ra/dec, x/y/z), motion (pmra, pmdec, vx/vy/vz, rv), brightness (mag, absmag, lum, ci), and identifiers/classifications (hd, hip, spect, con, proper). The most informative single field is the spectral type 'spect': it has 4,310 distinct values but is dominated by a handful of classes (K0 ~8.6k, G5 ~6.0k, A0 ~4.9k), giving a clean view of stellar populations. Distance and luminosity are extremely right-skewed (lum skew ≈49, dist max 100,000 pc) with 10–15% outliers, so any analysis on those should use log scales. Radial velocity 'rv' is 81% zeros — effectively a 'measured vs not' flag rather than a continuous variable. Constellation 'con' is the most evenly distributed categorical (89 values, entropy ratio 0.95) led by Cen, UMa, and Her, making it a good grouping key.
saturn
/home/coolhand/data/celestial/hyg/hygdata_v41.csv 119,626 rows sample n=119,626 seed 42 2026-05-01T23:28:28+00:00
Overview
| Source | /home/coolhand/data/celestial/hyg/hygdata_v41.csv |
| Total rows | 119,626 |
| Profiled sample | 119,626 |
| Columns | 37 |
| Generated | 2026-05-01T23:28:28+00:00 |
Insights opt-in
Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.
This is a row identifier: 119,626 values, all unique, no nulls, ranging from 0 to 119,630 with a near-perfectly uniform distribution (mean 59,813.16, median 59,813.5, skew ~1.5e-05, kurtosis -1.20). The min of 0 and max of 119,630 against n=119,626 suggests a 0-based sequential id with a handful of gaps. No analytical signal lives here.
This is almost certainly the Hipparcos catalog identifier (HIP number) for stars: integer values running from 1 to 120404 with 117951 unique values across 119626 rows, near-perfectly uniform (skew ≈ 0.0002, kurtosis ≈ -1.2). The 1.4% null rate suggests some rows lack a Hipparcos cross-match. No outliers and no zeros, consistent with a catalog index rather than a measurement.
The 'hd' column is a numeric field with 119,626 rows and 98,825 unique values, suggesting a near-identifier or high-cardinality measurement rather than a categorical feature. Values span 1 to 358,431 with a mean of 114,357 and median of 110,358, showing a roughly symmetric distribution (skew 0.28, kurtosis -0.73) and no flagged outliers. Notably, 17.34% of rows are null, which is substantial and would need handling before any downstream use.
The 'hr' column is a numeric field populated for only 7.6% of the 119,626 rows (null_rate 0.9244), with values ranging from 1 to 9110 and 9,029 distinct values. Its near-zero skew (-0.003), flat kurtosis (-1.20), and mean (4563.9) almost equal to median (4566.0) suggest a near-uniform distribution across the 1–9110 range rather than a typical hour-of-day or heart-rate measure. The extreme null rate is the dominant concern.
This column 'gl' holds Gliese/GJ catalogue identifiers for stars (e.g., 'GJ 1293', 'Gl 914B'), with 'gj' and 'gl' being the dominant tokens (344 and 331 occurrences). It is overwhelmingly empty: 115,825 of 119,626 rows are blank, driving a 96.8% duplicate rate and leaving only ~3,800 unique designations. When populated, values are short single tokens (len_max 9, word_mean 1.03).
This column appears to hold Bayer/Flamsteed star designations (e.g. "41The1Ori", "66Alp Gem"), with the trailing tokens being three-letter constellation abbreviations like leo, her, cyg, vir. It is overwhelmingly empty: 116,527 of 119,626 rows are blank strings, driving a 97.4% duplicate rate and a mean length of 0.22 characters. Among the 3,099 non-empty entries the values are nearly unique, suggesting this label only applies to a small subset of named/cataloged stars.
This is the proper name of a star or named celestial object, populated for only a tiny fraction of rows. Empty strings dominate at 119161 of 119626 (top_rate 0.9961), leaving 465 distinct named entries like 'Sol', 'Alpheratz', and 'Caph' essentially as singletons. Entropy ratio of 0.008 confirms the column carries almost no information in aggregate.
Values span 0.0 to 23.998594 with a near-symmetric distribution (skew -0.012, kurtosis -1.20) and mean 12.09 close to median 12.13, consistent with right ascension expressed in hours. With 119263 unique values across 119626 rows, near-zero zero_rate, and no nulls or outliers, the column behaves like a continuous astronomical coordinate rather than a categorical feature.
This column is almost certainly declination (dec), the celestial latitude coordinate, with values bounded in [-89.78, 89.57] degrees matching the full sky range. The distribution is nearly symmetric (skew 0.04) and platykurtic (kurtosis -1.02) with median near -1.64 and IQR spanning ~68 degrees, suggesting broad sky coverage rather than a concentrated survey footprint. With 119,534 unique values across 119,626 rows and no nulls or outliers, it behaves as a continuous astrometric coordinate.
Numeric 'dist' column (likely a distance measurement) with 119,626 rows, no nulls, and 5,397 distinct values. The distribution is severely right-skewed (skew 2.97, kurtosis 6.79): median is 213.68 with IQR 115.07–392.16, yet the mean is 8,772.29 and the max reaches exactly 100,000, suggesting a capped or sentinel ceiling. Over 10% of rows (12,350) flag as outliers and std (27,890.67) dwarfs the IQR.
This is `pmra`, almost certainly proper motion in right ascension (mas/yr) for ~119k astronomical sources, centered near zero (median -1.68, mean -1.31) with an interquartile range of 27.64. The distribution is extremely heavy-tailed: skew 4.61, kurtosis 433.55, std 118.18, and extremes spanning -4432.65 to 6767.26, with 16.4% of rows (19,615) flagged as outliers. No nulls and only 0.04% exact zeros, so the field is densely populated but dominated by tail behaviour.
This is `pmdec`, almost certainly proper motion in declination (mas/yr) from an astrometric catalog. The bulk sits in a tight IQR of -22.4 to 3.77 around a median of -5.76, but the distribution is extremely heavy-tailed: kurtosis of 934.5, skew -2.60, and a min of -5813 against a max of 9999.99 — the latter looks like a sentinel/missing-value flag rather than a real motion. About 14.4% of rows (17,188) fall outside the standard outlier fence.
`rv` is a numeric feature dominated by zeros: the median, Q1, and Q3 are all 0.0, and 81.07% of values are exactly zero (zero_rate 0.8107). The non-zero tail is wide and heavy, spanning -386.9 to 471.0 with std 13.90 and kurtosis 116.06, producing 22,643 outliers (18.93% outlier rate). Despite the extremes, mean (-0.276) and skew (0.371) are modest, suggesting roughly balanced positive/negative excursions around a sparse zero baseline.
Numeric column 'mag' looks like an astronomical magnitude reading: values are tightly clustered around a median of 8.46 with an IQR of 1.52, but the range stretches from -26.7 to 21.0. That extreme negative tail (consistent with very bright objects like the Sun at -26.7) drives the high kurtosis of 6.35 and flags 5,241 outliers (4.4%) despite near-symmetric skew of 0.16. No nulls or zeros, and 1,422 distinct values across 119,626 rows suggest measurements rounded to ~0.01.
This is a numeric `absmag` field, almost certainly absolute magnitude on an astronomical scale, ranging from -16.68 to 19.629 with a median of 1.495 and IQR of 3.021. The distribution is left-skewed (skew -1.37) with heavier tails than normal (kurtosis 3.17), and 11.29% of rows (13,508) flag as outliers — consistent with a long bright-end tail rather than data errors. No nulls and only 0.015% zeros across 119,626 rows, with 13,452 unique values suggesting quantised reporting.
This column holds stellar spectral type codes (e.g. K0, G5, A0, F8) — short one-word tokens averaging 3.4 characters with a 1,532-word vocabulary across 119,626 rows. Values are highly repetitive (96.4% duplicate rate, only 4,310 unique), which is expected for a categorical taxonomy, and 3,048 rows are empty. Mixed casing shows up as a 45.8% allcaps rate, suggesting inconsistent capitalization (e.g. K0III vs lowercase forms) that should be normalized.
Numeric feature 'ci' spans -0.4 to 5.46 with mean 0.71 and median 0.616, suggesting a bounded continuous measurement (possibly a colour index or similar physical quantity given the name). Distribution is mildly right-skewed (0.37) with light tails (kurtosis -0.26) and only 208 outliers (0.18%). Negative values exist but zeros are rare (0.15%), and 1.58% of rows are null.
Numeric feature 'x' is effectively continuous (119,593 unique values across 119,626 rows, no nulls) and centered near zero (median -1.05, Q1/Q3 of -89.04/86.27). The distribution has extreme tails: min -99,950 and max 99,982 push the standard deviation to 15,182 against an IQR of just 175, with kurtosis 19.16 and 13.07% of rows flagged as outliers. Slight negative skew (-0.22) suggests the tails are roughly symmetric in direction but heavy in magnitude.
A continuous numeric feature centered near zero (median -1.24, mean -39.3) but with an extraordinarily wide spread (std ~17249, min -99979, max 99996). The distribution is roughly symmetric (skew 0.12) yet heavy-tailed (kurtosis 18.0), and 13.9% of values flag as outliers — the bulk sits within an IQR of ~183 while extremes reach ±100k. Near-unique values (119585 of 119626) and effectively no zeros or nulls suggest a measured signal rather than a category or sentinel-coded field.
Column z is a high-cardinality numeric feature (119,588 unique values across 119,626 rows, no nulls) centered near zero with median -3.42 and IQR roughly -107.6 to 95.0. The distribution has extreme tails: min -99,964.98, max 99,862.51, std 18,074.56, and kurtosis 15.49, with 13.7% of rows (16,441) flagged as outliers despite skew of only -0.27. The IQR is two orders of magnitude smaller than the standard deviation, indicating a tight core swamped by heavy symmetric tails.
`vx` is a numeric feature centered tightly around zero (median 1.3e-07, IQR 1.94e-05) but with a symmetric extreme range of ±0.10227249 — almost certainly a velocity-like component (x-axis). The distribution is pathologically heavy-tailed: kurtosis 1307.6 and skew -11.5, with 13.2% of values flagged as outliers despite std being only 0.00178. The exact symmetry of min and max suggests a hard clipping bound at ±0.10227249.
Likely a velocity component (vy) for ~119k objects, with values clustered tightly around zero (median 1.18e-05, IQR 3.4e-05) but spanning -0.102 to 0.102. The distribution is extraordinarily heavy-tailed (skew 15.6, kurtosis 678) and 12.0% of rows fall outside the Tukey fences, so a small minority of fast movers dominate the variance (std 0.0022 vs IQR ~3e-05).
A signed numeric quantity centred almost exactly on zero (median -6.23e-06, mean -1.57e-04) with an extremely tight IQR of 2.31e-05 — consistent with a vertical velocity or rate-of-change feature (the name 'vz' suggests a z-axis velocity). The distribution is pathologically heavy-tailed: kurtosis 1029.85, skew -20.30, and symmetric extremes at ±0.10227249 produce 15,774 outliers (13.2%). Despite 119,626 rows there are only 23,037 unique values, hinting at quantisation or repeated stationary readings.
Values span 0 to 6.2828 with mean 3.166 and median 3.175, consistent with a right ascension expressed in radians (0 to 2π). The distribution is essentially symmetric (skew -0.012) and platykurtic (kurtosis -1.198), close to uniform across the circle, with no outliers and only one zero out of 119,626 rows. With 119,585 unique values, this is a high-resolution continuous coordinate.
This column appears to be a declination angle expressed in radians, ranging symmetrically from -1.567 to 1.563 (close to ±π/2). The distribution is near-symmetric (skew 0.037) with negative kurtosis (-1.02), indicating a flatter-than-normal spread, and 119,585 of 119,626 values are unique with no nulls or outliers. The mean (-0.035) and median (-0.029) sit near zero, consistent with an angular coordinate covering most of the celestial sphere.
`pmrarad` is a tiny-magnitude signed numeric feature centered near zero (mean -6.4e-09, median -8.1e-09) with values on the order of 1e-7 to 1e-5, consistent with a proper-motion-in-RA quantity expressed in radians. The distribution is highly non-Gaussian: skew 4.6, kurtosis 433.6, and 16.4% of rows flagged as outliers, with extremes reaching 3.28e-05 against an IQR of just 1.34e-07. Nulls are absent and only 0.04% are exact zeros, so the heavy tails are real rather than artefacts of missingness.
This is `pmdecrad`, proper motion in declination expressed in radians (per time unit), centred near zero with median -2.79e-08 and IQR of about 1.27e-07. The distribution is extremely heavy-tailed: kurtosis of 997.6, skew of -1.99, and 17,187 outliers (14.4% of rows) stretching from -2.82e-05 to 5.01e-05 against a std of 5.47e-07. Values are dense and continuous (23,588 unique across 119,626 rows) with no nulls and essentially no zeros.
This is the Bayer designation for stars (Greek-letter prefix like 'Alp', 'Bet', 'Gam', 'Del'), with 104 distinct values across 119,626 rows. It is overwhelmingly empty: 118,089 of 119,626 rows (top_rate 0.987) carry no Bayer letter, leaving entropy at just 0.171 (entropy_ratio 0.026). The non-empty tail is roughly evenly distributed across the Greek alphabet, with Alpha (80) and Beta (77) most common.
The column 'flam' is a categorical field that is overwhelmingly empty: 116,889 of 119,626 rows (97.7%) hold the blank string. The remaining 2.3% spreads across 138 distinct values that look like small integers ('2','4','5','7'…'16'), each appearing roughly 48-53 times. Entropy ratio of 0.043 confirms almost no information content, and the imbalance alert is warranted.
Three-letter IAU constellation abbreviations (Cen, UMa, Her, Cyg...), with all 89 of the standard 88+1 codes represented across 119,626 rows and zero nulls. The distribution is remarkably flat: entropy ratio is 0.949 and the most common value, Cen, accounts for only 3.57% of records, so no single constellation dominates. Useful as a sky-region grouping key rather than a predictive feature on its own.
The column 'comp' is a numeric field with only 3 distinct values bounded between 1.0 and 3.0, with a median, Q1, and Q3 all equal to 1.0, indicating it is effectively a low-cardinality code or flag rather than a continuous measure. The mean of 1.0048 shows the value 1 dominates almost entirely, with only 536 outliers (0.45%) deviating, producing extreme skew (16.7) and kurtosis (311.1).
The column 'comp_primary' contains 119,626 nearly unique numeric values (119,190 distinct) ranging from 0 to 119,630 with mean 59,641.96 and median 59,634.5. The near-perfect symmetry (skew 0.0004), negative kurtosis (-1.20), and quartiles (Q1 29,815.25, Q3 89,462.75) closely matching a uniform distribution over [0, n] strongly suggest this is a row index or sequential identifier rather than a substantive measurement.
The 'base' column is a categorical field that is effectively empty: 118,540 of 119,626 rows (top_rate 0.991) hold the empty string, leaving only ~1,086 rows distributed across 650 other values. The non-empty entries look like Gliese star catalog identifiers (e.g. 'Gl 57.1', 'Gl 60'), each appearing at most 3 times. Entropy ratio of 0.017 confirms almost no information content.
This is a luminosity-like numeric feature spanning roughly 1.2e-06 to 4.09e+08, with a median of about 21.98 but a mean of 356526 — clear evidence of an extremely heavy right tail. Skew of 49.27 and kurtosis of 3885.57 confirm a pathological distribution, and 17485 values (14.6%) fall outside the IQR fence. Of 119626 rows, 13465 are unique with no nulls or zeros, so the spread is genuine rather than padded.
Column 'var' is a sparse short-code field: 113,634 of 119,626 rows (n_empty) are blank and the remaining values are 1–5 character tokens like 'R', 'S', 'T', 'RS'. Duplicate_rate is 0.987 with only 1,523 uniques, and one_word_rate is 1.0 with len_max of 5, suggesting a categorical abbreviation or flag rather than free text. The overwhelming emptiness (null_rate is reported as 0.0 but empties dominate) is the headline surprise.
Numeric feature 'var_min' is populated for only ~14% of rows (null_rate 0.858), making it a sparse signal. Among the 16,991 observed values it ranges from -1.333 to 14.902 with mean 9.50 and median 9.849, and is left-skewed (skew -0.93) with mild kurtosis (1.25). About 2.6% of present values fall outside the IQR fence (449 outliers), and only 6,248 distinct values appear.
Numeric feature 'var_max' (likely a per-record maximum of some variable) is missing for 85.8% of the 119,626 rows, leaving roughly 17k populated values spread over 6,090 distinct numbers. Among observed values it centers near a median of 9.646 with mean 9.259, ranges from -1.523 to 13.702, and is left-skewed (-0.97) with 325 outliers (1.9%). The dominant concern is the null rate, not the distribution shape.
Numeric correlation
id numeric
hip numeric
hd numeric
hr numeric
gl text
Sample values (first 10)
bf text
Sample values (first 10)
- 29 Psc
- 64 Aql
proper categorical
Top values (rank 1–20)
- — 119,161
- p Eridani — 2
- Sol — 1
- Alpheratz — 1
- Caph — 1
- Algenib — 1
- Groombridge 34 — 1
- Citadelle — 1
- Ankaa — 1
- Felixvarela — 1
- Fulu — 1
- Schedar — 1
- Diphda — 1
- Cocibolca — 1
- 96 G. Psc — 1
- Achird — 1
- Van Maanen's Star — 1
- Castula — 1
- Cih — 1
- Nenque — 1
ra numeric
dec numeric
dist numeric
pmra numeric
pmdec numeric
rv numeric
mag numeric
absmag numeric
spect text
Sample values (first 10)
- B7III-IV
- K2
- A0...
- K1IV
- G0V
- F5
- G6/G8III
- F8
- K1III
- G0
ci numeric
x numeric
y numeric
z numeric
vx numeric
vy numeric
vz numeric
rarad numeric
decrad numeric
pmrarad numeric
pmdecrad numeric
bayer categorical
Top values (rank 1–20)
- — 118,089
- Alp — 80
- Bet — 77
- Eps — 74
- Del — 71
- Eta — 69
- Zet — 68
- Gam — 68
- The — 61
- Iot — 60
- Lam — 56
- Kap — 54
- Mu — 49
- Nu — 45
- Sig — 39
- Xi — 38
- Pi — 37
- Rho — 36
- Omi — 36
- Tau — 35
flam categorical
Top values (rank 1–20)
- — 116,889
- 7 — 53
- 8 — 51
- 5 — 51
- 9 — 50
- 4 — 50
- 10 — 49
- 12 — 49
- 16 — 49
- 2 — 48
- 11 — 48
- 13 — 48
- 14 — 48
- 15 — 48
- 1 — 47
- 3 — 46
- 6 — 46
- 17 — 46
- 21 — 45
- 20 — 44
con categorical
Top values (rank 1–20)
- Cen — 4,270
- UMa — 3,616
- Her — 3,434
- Cyg — 3,116
- Hya — 3,061
- Cet — 3,030
- Vir — 2,921
- Eri — 2,789
- Peg — 2,744
- Dra — 2,722
- Sgr — 2,504
- Boo — 2,477
- Pup — 2,427
- Cas — 2,352
- Tau — 2,281
- Oph — 2,270
- Vel — 2,238
- Aqr — 2,188
- Leo — 2,165
- Car — 2,162
comp numeric
comp_primary numeric
base categorical
Top values (rank 1–20)
- — 118,540
- Gl 57.1 — 3
- Gl 60 — 3
- Gl 100 — 3
- Gl 106.1 — 3
- Gl 118.2 — 3
- Gl 120.1 — 3
- Gl 140 — 3
- Gl 153 — 3
- Gl 166 — 3
- Gl 225.2 — 3
- Gl 278 — 3
- Gl 294 — 3
- Gl 319 — 3
- Gl 321.3 — 3
- Gl 331 — 3
- Gl 421 — 3
- Gl 520 — 3
- GJ 9490 — 3
- Gl 586 — 3