saturn·

hyg hygdata v41

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/data/celestial/hyg/hygdata_v41.csv

Saturn profiled 119,626 rows across 37 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/data/celestial/hyg/hygdata_v41.csv",
    "--findings", "hyg-hygdata_v41.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is the HYG star catalog (hygdata_v41.csv) with 119,626 stars and 37 columns covering positions (ra/dec, x/y/z), motion (pmra, pmdec, vx/vy/vz, rv), brightness (mag, absmag, lum, ci), and identifiers/classifications (hd, hip, spect, con, proper). The most informative single field is the spectral type 'spect': it has 4,310 distinct values but is dominated by a handful of classes (K0 ~8.6k, G5 ~6.0k, A0 ~4.9k), giving a clean view of stellar populations. Distance and luminosity are extremely right-skewed (lum skew ≈49, dist max 100,000 pc) with 10–15% outliers, so any analysis on those should use log scales. Radial velocity 'rv' is 81% zeros — effectively a 'measured vs not' flag rather than a continuous variable. Constellation 'con' is the most evenly distributed categorical (89 values, entropy ratio 0.95) led by Cen, UMa, and Her, making it a good grouping key.

citing: row_count · column_count · spect.top_values · spect.n_unique · dist.skew · dist.max · lum.skew · lum.max · rv.zero_rate · con.entropy_ratio · con.top_values · mag.median · absmag.skew

Out[4]:

saturn.schema() · 37 columns

column kind n null% unique alerts
id numeric 119,626 0.0% 119,626
hip numeric 119,626 1.4% 117,951
hd numeric 119,626 17.3% 98,825
hr numeric 119,626 92.4% 9,029 null_rate
gl text 119,626 0.0% 3,802 one_word short_text duplicates
bf text 119,626 0.0% 3,066 one_word short_text duplicates
proper categorical 119,626 0.0% 465 long_tail imbalance
ra numeric 119,626 0.0% 119,263
dec numeric 119,626 0.0% 119,534
dist numeric 119,626 0.0% 5,397 high_skew outliers
pmra numeric 119,626 0.0% 25,644 high_skew outliers
pmdec numeric 119,626 0.0% 23,226 high_skew outliers
rv numeric 119,626 0.0% 1,714 outliers
mag numeric 119,626 0.0% 1,422
absmag numeric 119,626 0.0% 13,452 outliers
spect text 119,626 0.0% 4,310 one_word allcaps short_text duplicates
ci numeric 119,626 1.6% 2,439
x numeric 119,626 0.0% 119,593 outliers
y numeric 119,626 0.0% 119,585 outliers
z numeric 119,626 0.0% 119,588 outliers
vx numeric 119,626 0.0% 21,555 high_skew outliers
vy numeric 119,626 0.0% 25,826 high_skew outliers
vz numeric 119,626 0.0% 23,037 high_skew outliers
rarad numeric 119,626 0.0% 119,585
decrad numeric 119,626 0.0% 119,585
pmrarad numeric 119,626 0.0% 25,647 high_skew outliers
pmdecrad numeric 119,626 0.0% 23,588 outliers
bayer categorical 119,626 0.0% 104 imbalance
flam categorical 119,626 0.0% 139 imbalance
con categorical 119,626 0.0% 89
comp numeric 119,626 0.0% 3 high_skew
comp_primary numeric 119,626 0.0% 119,190
base categorical 119,626 0.0% 651 imbalance
lum numeric 119,626 0.0% 13,465 high_skew outliers
var text 119,626 0.0% 1,523 one_word short_text duplicates
var_min numeric 119,626 85.8% 6,248 null_rate
var_max numeric 119,626 85.8% 6,090 null_rate
Fig 1.
spect · Top spectral classes — K0, G5 and A0 dominate, showing the survey is heavy on cool/sun-like stars.
Show data table
Character-length distribution for spect (mean: 3.3759049036162705).
charscount
0 – 03048
0 – 10
1 – 10
1 – 11542
1 – 20
2 – 20
2 – 254685
2 – 20
2 – 30
3 – 30
3 – 320214
3 – 40
4 – 40
4 – 46547
4 – 40
4 – 50
5 – 517615
5 – 50
5 – 60
6 – 60
6 – 66693
6 – 70
7 – 70
7 – 71634
7 – 80
8 – 80
8 – 84927
8 – 80
8 – 90
9 – 90
9 – 91161
9 – 100
10 – 100
10 – 10703
10 – 100
10 – 110
11 – 11746
11 – 110
11 – 120
12 – 12111
Fig 2.
con · Stars per constellation — fairly even (entropy ratio 0.95), with Cen, UMa and Her leading.
Show data table
Top values for con (20 unique shown, of 89 total).
valuecountshare
Cen42703.6%
UMa36163.0%
Her34342.9%
Cyg31162.6%
Hya30612.6%
Cet30302.5%
Vir29212.4%
Eri27892.3%
Peg27442.3%
Dra27222.3%
Sgr25042.1%
Boo24772.1%
Pup24272.0%
Cas23522.0%
Tau22811.9%
Oph22701.9%
Vel22381.9%
Aqr21881.8%
Leo21651.8%
Car21621.8%
Fig 3.
mag · Apparent magnitude distribution — concentrated around 8.4 with a long faint tail; check the bright outliers (min -26.7 is the Sun).
Show data table
Histogram bins for mag (median: 8.46).
bincount
-26.7 – -25.511
-25.51 – -24.310
-24.31 – -23.120
-23.12 – -21.930
-21.93 – -20.740
-20.74 – -19.540
-19.54 – -18.350
-18.35 – -17.160
-17.16 – -15.970
-15.97 – -14.770
-14.77 – -13.580
-13.58 – -12.390
-12.39 – -11.20
-11.2 – -100
-10 – -8.8120
-8.812 – -7.620
-7.62 – -6.4270
-6.427 – -5.2350
-5.235 – -4.0420
-4.042 – -2.850
-2.85 – -1.6570
-1.657 – -0.4652
-0.465 – 0.72759
0.7275 – 1.9233
1.92 – 3.113152
3.113 – 4.305533
4.305 – 5.4982104
5.498 – 6.698246
6.69 – 7.88326095
7.883 – 9.07549082
9.075 – 10.2724650
10.27 – 11.465990
11.46 – 12.651650
12.65 – 13.85582
13.85 – 15.04305
15.04 – 16.23122
16.23 – 17.4244
17.42 – 18.6219
18.62 – 19.815
19.81 – 212
Fig 4.
dist · Distance is highly right-skewed (median 214 pc, max 100,000 pc); use a log axis and expect ~10% outliers.
Show data table
Histogram bins for dist (median: 213.6752).
bincount
0 – 2500109401
2500 – 50000
5000 – 75000
7500 – 1e+040
1e+04 – 1.25e+040
1.25e+04 – 1.5e+040
1.5e+04 – 1.75e+040
1.75e+04 – 2e+040
2e+04 – 2.25e+040
2.25e+04 – 2.5e+040
2.5e+04 – 2.75e+040
2.75e+04 – 3e+040
3e+04 – 3.25e+040
3.25e+04 – 3.5e+040
3.5e+04 – 3.75e+040
3.75e+04 – 4e+040
4e+04 – 4.25e+040
4.25e+04 – 4.5e+040
4.5e+04 – 4.75e+040
4.75e+04 – 5e+040
5e+04 – 5.25e+040
5.25e+04 – 5.5e+040
5.5e+04 – 5.75e+040
5.75e+04 – 6e+040
6e+04 – 6.25e+040
6.25e+04 – 6.5e+040
6.5e+04 – 6.75e+040
6.75e+04 – 7e+040
7e+04 – 7.25e+040
7.25e+04 – 7.5e+040
7.5e+04 – 7.75e+040
7.75e+04 – 8e+040
8e+04 – 8.25e+040
8.25e+04 – 8.5e+040
8.5e+04 – 8.75e+040
8.75e+04 – 9e+040
9e+04 – 9.25e+040
9.25e+04 – 9.5e+040
9.5e+04 – 9.75e+040
9.75e+04 – 1e+0510225
Fig 5.
ci · Color index B-V — bimodal-ish spread from -0.4 to 5.46 reflects the mix of hot blue and cool red stars.
Show data table
Histogram bins for ci (median: 0.616).
bincount
-0.4 – -0.253559
-0.2535 – -0.1071383
-0.107 – 0.03958328
0.0395 – 0.1869569
0.186 – 0.33258982
0.3325 – 0.47914728
0.479 – 0.625516591
0.6255 – 0.7728623
0.772 – 0.91855682
0.9185 – 1.06512808
1.065 – 1.21210614
1.212 – 1.3586401
1.358 – 1.5055724
1.505 – 1.6515558
1.651 – 1.7981851
1.798 – 1.944435
1.944 – 2.091121
2.091 – 2.23788
2.237 – 2.38440
2.384 – 2.5342
2.53 – 2.67724
2.677 – 2.82328
2.823 – 2.9711
2.97 – 3.11622
3.116 – 3.2633
3.263 – 3.4098
3.409 – 3.5564
3.556 – 3.7020
3.702 – 3.8492
3.849 – 3.9950
3.995 – 4.1421
4.142 – 4.2880
4.288 – 4.4342
4.434 – 4.5810
4.581 – 4.7280
4.728 – 4.8741
4.874 – 5.0210
5.021 – 5.1670
5.167 – 5.3141
5.314 – 5.461
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
idnumeric0.0%
hipnumeric1.4%
hdnumeric17.3%
hrnumeric92.4%
gltext0.0%
bftext0.0%
propercategorical0.0%
ranumeric0.0%
decnumeric0.0%
distnumeric0.0%
pmranumeric0.0%
pmdecnumeric0.0%
rvnumeric0.0%
magnumeric0.0%
absmagnumeric0.0%
specttext0.0%
cinumeric1.6%
xnumeric0.0%
ynumeric0.0%
znumeric0.0%
vxnumeric0.0%
vynumeric0.0%
vznumeric0.0%
raradnumeric0.0%
decradnumeric0.0%
pmraradnumeric0.0%
pmdecradnumeric0.0%
bayercategorical0.0%
flamcategorical0.0%
concategorical0.0%
compnumeric0.0%
comp_primarynumeric0.0%
basecategorical0.0%
lumnumeric0.0%
vartext0.0%
var_minnumeric85.8%
var_maxnumeric85.8%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
idhiphdhrradecdistpmrapmdecrvmagabsmag
id+1.00+0.99+0.87+0.99+0.97-0.03+0.08+0.03-0.01-0.08+0.02-0.04
hip+0.99+1.00+0.87+0.98+0.96-0.03+0.08+0.03-0.01-0.08+0.02-0.04
hd+0.87+0.87+1.00+0.86+0.85-0.03+0.09+0.06+0.00-0.08+0.03-0.04
hr+0.99+0.98+0.86+1.00+0.98-0.04+0.09-0.02+0.00-0.09-0.02-0.08
ra+0.97+0.96+0.85+0.98+1.00-0.05+0.09-0.04+0.03-0.09-0.05-0.11
dec-0.03-0.03-0.03-0.04-0.05+1.00-0.05+0.00-0.01-0.06-0.00+0.03
dist+0.08+0.08+0.09+0.09+0.09-0.05+1.00-0.01+0.06+0.03+0.09-0.85
pmra+0.03+0.03+0.06-0.02-0.04+0.00-0.01+1.00-0.06-0.10+0.26+0.12
pmdec-0.01-0.01+0.00+0.00+0.03-0.01+0.06-0.06+1.00+0.07-0.06-0.20
rv-0.08-0.08-0.08-0.09-0.09-0.06+0.03-0.10+0.07+1.00+0.02-0.01
mag+0.02+0.02+0.03-0.02-0.05-0.00+0.09+0.26-0.06+0.02+1.00+0.26
absmag-0.04-0.04-0.04-0.08-0.11+0.03-0.85+0.12-0.20-0.01+0.26+1.00

id numeric identifier

This is a row identifier: 119,626 values, all unique, no nulls, ranging from 0 to 119,630 with a near-perfectly uniform distribution (mean 59,813.16, median 59,813.5, skew ~1.5e-05, kurtosis -1.20). The min of 0 and max of 119,630 against n=119,626 suggests a 0-based sequential id with a handful of gaps. No analytical signal lives here.

Treatment: drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["id"].stats

statvalue
n119,626
nulls0 (0.0%)
unique119,626
min 0
max 119,630
mean 5.981e+04
median 5.981e+04
std 3.453e+04
q1 2.991e+04
q3 8.972e+04
iqr 5.981e+04
skew 1.471e-05
kurtosis -1.2
n_outliers 0
outlier_rate 0
zero_rate 8.359e-06
Fig 8.
Distribution of id. Vertical dash marks the median.
Show data table
Histogram bins for id (median: 59813.5).
bincount
0 – 29912991
2991 – 59822991
5982 – 89722991
8972 – 1.196e+042990
1.196e+04 – 1.495e+042991
1.495e+04 – 1.794e+042991
1.794e+04 – 2.094e+042991
2.094e+04 – 2.393e+042990
2.393e+04 – 2.692e+042991
2.692e+04 – 2.991e+042991
2.991e+04 – 3.29e+042991
3.29e+04 – 3.589e+042990
3.589e+04 – 3.888e+042991
3.888e+04 – 4.187e+042991
4.187e+04 – 4.486e+042991
4.486e+04 – 4.785e+042990
4.785e+04 – 5.084e+042991
5.084e+04 – 5.383e+042991
5.383e+04 – 5.682e+042990
5.682e+04 – 5.982e+042990
5.982e+04 – 6.281e+042991
6.281e+04 – 6.58e+042991
6.58e+04 – 6.879e+042991
6.879e+04 – 7.178e+042990
7.178e+04 – 7.477e+042991
7.477e+04 – 7.776e+042991
7.776e+04 – 8.075e+042991
8.075e+04 – 8.374e+042990
8.374e+04 – 8.673e+042991
8.673e+04 – 8.972e+042991
8.972e+04 – 9.271e+042991
9.271e+04 – 9.57e+042990
9.57e+04 – 9.869e+042991
9.869e+04 – 1.017e+052991
1.017e+05 – 1.047e+052991
1.047e+05 – 1.077e+052990
1.077e+05 – 1.107e+052991
1.107e+05 – 1.136e+052991
1.136e+05 – 1.166e+052989
1.166e+05 – 1.196e+052989

hip numeric identifier

This is almost certainly the Hipparcos catalog identifier (HIP number) for stars: integer values running from 1 to 120404 with 117951 unique values across 119626 rows, near-perfectly uniform (skew ≈ 0.0002, kurtosis ≈ -1.2). The 1.4% null rate suggests some rows lack a Hipparcos cross-match. No outliers and no zeros, consistent with a catalog index rather than a measurement.

Treatment: Treat as a catalog ID; left-join to Hipparcos metadata and exclude from modelling.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["hip"].stats

statvalue
n119,626
nulls1,675 (1.4%)
unique117,951
min 1
max 120,404
mean 5.917e+04
median 59,172
std 3.417e+04
q1 2.956e+04
q3 8.876e+04
iqr 59,198
skew 0.0001943
kurtosis -1.2
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 9.
Distribution of hip. Vertical dash marks the median.
Show data table
Histogram bins for hip (median: 59172.0).
bincount
1 – 30113005
3011 – 60213003
6021 – 90313004
9031 – 1.204e+042998
1.204e+04 – 1.505e+043004
1.505e+04 – 1.806e+043003
1.806e+04 – 2.107e+043001
2.107e+04 – 2.408e+043004
2.408e+04 – 2.709e+043001
2.709e+04 – 3.01e+043002
3.01e+04 – 3.311e+042994
3.311e+04 – 3.612e+042994
3.612e+04 – 3.913e+042997
3.913e+04 – 4.214e+043001
4.214e+04 – 4.515e+042999
4.515e+04 – 4.816e+043003
4.816e+04 – 5.117e+042999
5.117e+04 – 5.418e+042997
5.418e+04 – 5.719e+042994
5.719e+04 – 6.02e+042997
6.02e+04 – 6.321e+042993
6.321e+04 – 6.622e+043003
6.622e+04 – 6.923e+042996
6.923e+04 – 7.224e+043006
7.224e+04 – 7.525e+043006
7.525e+04 – 7.826e+043003
7.826e+04 – 8.127e+043000
8.127e+04 – 8.428e+042997
8.428e+04 – 8.729e+042998
8.729e+04 – 9.03e+043000
9.03e+04 – 9.331e+042996
9.331e+04 – 9.632e+043001
9.632e+04 – 9.933e+042994
9.933e+04 – 1.023e+052995
1.023e+05 – 1.054e+053000
1.054e+05 – 1.084e+053005
1.084e+05 – 1.114e+053006
1.114e+05 – 1.144e+052994
1.144e+05 – 1.174e+053005
1.174e+05 – 1.204e+05953

hd numeric identifier

The 'hd' column is a numeric field with 119,626 rows and 98,825 unique values, suggesting a near-identifier or high-cardinality measurement rather than a categorical feature. Values span 1 to 358,431 with a mean of 114,357 and median of 110,358, showing a roughly symmetric distribution (skew 0.28, kurtosis -0.73) and no flagged outliers. Notably, 17.34% of rows are null, which is substantial and would need handling before any downstream use.

Treatment: Treat as a high-cardinality id; drop from modelling or use only as a join key after imputing or filtering the 17% nulls.

anthropic:claude-opus-4-7 · confidence medium
Out[19]:

saturn.columns["hd"].stats

statvalue
n119,626
nulls20,741 (17.3%)
unique98,825
min 1
max 358,431
mean 1.144e+05
median 110,358
std 7.418e+04
q1 46,723
q3 175,823
iqr 129,100
skew 0.2822
kurtosis -0.7265
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 10.
Distribution of hd. Vertical dash marks the median.
Show data table
Histogram bins for hd (median: 110358.0).
bincount
1 – 89625143
8962 – 1.792e+045184
1.792e+04 – 2.688e+044999
2.688e+04 – 3.584e+044556
3.584e+04 – 4.48e+044043
4.48e+04 – 5.377e+043196
5.377e+04 – 6.273e+042916
6.273e+04 – 7.169e+043086
7.169e+04 – 8.065e+043509
8.065e+04 – 8.961e+043884
8.961e+04 – 9.857e+043785
9.857e+04 – 1.075e+053874
1.075e+05 – 1.165e+053980
1.165e+05 – 1.255e+053612
1.255e+05 – 1.344e+053333
1.344e+05 – 1.434e+053331
1.434e+05 – 1.523e+053446
1.523e+05 – 1.613e+053365
1.613e+05 – 1.703e+052892
1.703e+05 – 1.792e+053103
1.792e+05 – 1.882e+053010
1.882e+05 – 1.971e+053485
1.971e+05 – 2.061e+053895
2.061e+05 – 2.151e+054125
2.151e+05 – 2.24e+054602
2.24e+05 – 2.33e+051072
2.33e+05 – 2.419e+051340
2.419e+05 – 2.509e+05263
2.509e+05 – 2.599e+05158
2.599e+05 – 2.688e+05146
2.688e+05 – 2.778e+05124
2.778e+05 – 2.867e+05394
2.867e+05 – 2.957e+05146
2.957e+05 – 3.047e+0583
3.047e+05 – 3.136e+0579
3.136e+05 – 3.226e+0583
3.226e+05 – 3.315e+05120
3.315e+05 – 3.405e+05223
3.405e+05 – 3.495e+05199
3.495e+05 – 3.584e+05101

hr numeric feature

The 'hr' column is a numeric field populated for only 7.6% of the 119,626 rows (null_rate 0.9244), with values ranging from 1 to 9110 and 9,029 distinct values. Its near-zero skew (-0.003), flat kurtosis (-1.20), and mean (4563.9) almost equal to median (4566.0) suggest a near-uniform distribution across the 1–9110 range rather than a typical hour-of-day or heart-rate measure. The extreme null rate is the dominant concern.

Treatment: Impute or drop given 92% nulls; verify the semantic meaning of 'hr' before use since the 1–9110 range is not an obvious hour or heart-rate scale.

anthropic:claude-opus-4-7 · confidence medium
Out[22]:

saturn.columns["hr"].stats

statvalue
n119,626
nulls110,585 (92.4%)
unique9,029
min 1
max 9,110
mean 4564
median 4,566
std 2632
q1 2,283
q3 6,848
iqr 4,565
skew -0.003426
kurtosis -1.202
n_outliers 0
outlier_rate 0
zero_rate 0
alert: null_rate92.4% null
Fig 11.
Distribution of hr. Vertical dash marks the median.
Show data table
Histogram bins for hr (median: 4566.0).
bincount
1 – 228.7225
228.7 – 456.4226
456.4 – 684.2225
684.2 – 911.9225
911.9 – 1140226
1140 – 1367226
1367 – 1595227
1595 – 1823227
1823 – 2051222
2051 – 2278227
2278 – 2506222
2506 – 2734228
2734 – 2961224
2961 – 3189227
3189 – 3417222
3417 – 3645226
3645 – 3872226
3872 – 4100226
4100 – 4328226
4328 – 4556227
4556 – 4783224
4783 – 5011226
5011 – 5239228
5239 – 5466227
5466 – 5694224
5694 – 5922225
5922 – 6150225
6150 – 6377226
6377 – 6605225
6605 – 6833225
6833 – 7060227
7060 – 7288227
7288 – 7516228
7516 – 7744227
7744 – 7971227
7971 – 8199227
8199 – 8427228
8427 – 8655229
8655 – 8882229
8882 – 9110227

gl text foreign_key

This column 'gl' holds Gliese/GJ catalogue identifiers for stars (e.g., 'GJ 1293', 'Gl 914B'), with 'gj' and 'gl' being the dominant tokens (344 and 331 occurrences). It is overwhelmingly empty: 115,825 of 119,626 rows are blank, driving a 96.8% duplicate rate and leaving only ~3,800 unique designations. When populated, values are short single tokens (len_max 9, word_mean 1.03).

Treatment: Treat as a sparse cross-catalogue star ID; left-join on it where present and ignore the empty majority.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["gl"].stats

statvalue
n119,626
nulls0 (0.0%)
unique3,802
len_min 0
len_max 9
len_mean 0.2263
len_median 0
len_p95 0
word_mean 1.032
word_median 1
n_empty 115,825
n_duplicates 115,824
duplicate_rate 0.9682
vocab_size 677
readability_flesch_mean 4.808
emoji_rate 0
url_rate 0
one_word_rate 0.9682
allcaps_rate 0.01719
boilerplate_rate 0
alert: one_word96.8% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates96.8% duplicate strings
Fig 12.
Character-length distribution for gl.
Show data table
Character-length distribution for gl (mean: 0.22627188069483223).
charscount
0 – 0115825
0 – 00
0 – 10
1 – 10
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 47
4 – 40
4 – 40
4 – 50
5 – 50
5 – 573
5 – 50
5 – 60
6 – 60
6 – 6615
6 – 60
6 – 70
7 – 70
7 – 70
7 – 72092
7 – 70
7 – 80
8 – 80
8 – 8785
8 – 80
8 – 90
9 – 90
9 – 9229

bf text metadata

This column appears to hold Bayer/Flamsteed star designations (e.g. "41The1Ori", "66Alp Gem"), with the trailing tokens being three-letter constellation abbreviations like leo, her, cyg, vir. It is overwhelmingly empty: 116,527 of 119,626 rows are blank strings, driving a 97.4% duplicate rate and a mean length of 0.22 characters. Among the 3,099 non-empty entries the values are nearly unique, suggesting this label only applies to a small subset of named/cataloged stars.

Treatment: Treat empty strings as missing and use only as a sparse identifier for cataloged stars; drop from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["bf"].stats

statvalue
n119,626
nulls0 (0.0%)
unique3,066
len_min 0
len_max 10
len_mean 0.2206
len_median 0
len_p95 0
word_mean 1.064
word_median 1
n_empty 116,527
n_duplicates 116,560
duplicate_rate 0.9744
vocab_size 369
readability_flesch_mean 1.986
emoji_rate 0
url_rate 0
one_word_rate 0.9763
allcaps_rate 0
boilerplate_rate 0
alert: one_word97.6% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.4% duplicate strings
Fig 13.
Character-length distribution for bf.
Show data table
Character-length distribution for bf (mean: 0.22057077892765786).
charscount
0 – 0116527
0 – 00
0 – 10
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 7582
7 – 80
8 – 80
8 – 80
8 – 8417
8 – 80
8 – 90
9 – 90
9 – 92024
9 – 100
10 – 100
10 – 1076

proper categorical metadata

This is the proper name of a star or named celestial object, populated for only a tiny fraction of rows. Empty strings dominate at 119161 of 119626 (top_rate 0.9961), leaving 465 distinct named entries like 'Sol', 'Alpheratz', and 'Caph' essentially as singletons. Entropy ratio of 0.008 confirms the column carries almost no information in aggregate.

Treatment: Treat as a sparse name lookup; convert blanks to null and use only as a display label, not a model feature.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["proper"].stats

statvalue
n119,626
nulls0 (0.0%)
unique465
top_value
top_rate 0.9961
cardinality 465
entropy 0.07115
entropy_ratio 0.008029
alert: long_tail463 singleton categories
alert: imbalancetop value is 99.6% of rows
Fig 14.
Top values for proper.
Show data table
Top values for proper (20 unique shown, of 465 total).
valuecountshare
11916199.6%
p Eridani20.0%
Sol10.0%
Alpheratz10.0%
Caph10.0%
Algenib10.0%
Groombridge 3410.0%
Citadelle10.0%
Ankaa10.0%
Felixvarela10.0%
Fulu10.0%
Schedar10.0%
Diphda10.0%
Cocibolca10.0%
96 G. Psc10.0%
Achird10.0%
Van Maanen's Star10.0%
Castula10.0%
Cih10.0%
Nenque10.0%

ra numeric feature

Values span 0.0 to 23.998594 with a near-symmetric distribution (skew -0.012, kurtosis -1.20) and mean 12.09 close to median 12.13, consistent with right ascension expressed in hours. With 119263 unique values across 119626 rows, near-zero zero_rate, and no nulls or outliers, the column behaves like a continuous astronomical coordinate rather than a categorical feature.

Treatment: Treat as a circular coordinate (e.g., encode via sine/cosine on the 0-24 hour range) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["ra"].stats

statvalue
n119,626
nulls0 (0.0%)
unique119,263
min 0
max 24
mean 12.09
median 12.13
std 6.887
q1 6.217
q3 18.12
iqr 11.9
skew -0.01197
kurtosis -1.198
n_outliers 0
outlier_rate 0
zero_rate 8.359e-06
Fig 15.
Distribution of ra. Vertical dash marks the median.
Show data table
Histogram bins for ra (median: 12.127026).
bincount
0 – 0.62880
0.6 – 1.22819
1.2 – 1.82802
1.8 – 2.42840
2.4 – 32833
3 – 3.62851
3.6 – 4.22850
4.2 – 4.82740
4.8 – 5.42994
5.4 – 63178
6 – 6.63135
6.6 – 7.23281
7.2 – 7.83294
7.8 – 8.43142
8.4 – 8.9993054
8.999 – 9.5992966
9.599 – 10.22878
10.2 – 10.82891
10.8 – 11.42856
11.4 – 122897
12 – 12.63004
12.6 – 13.22937
13.2 – 13.82964
13.8 – 14.43088
14.4 – 153048
15 – 15.63006
15.6 – 16.23027
16.2 – 16.82886
16.8 – 17.42941
17.4 – 183020
18 – 18.63069
18.6 – 19.23184
19.2 – 19.83112
19.8 – 20.43207
20.4 – 213076
21 – 21.63021
21.6 – 22.23005
22.2 – 22.83019
22.8 – 23.42970
23.4 – 242861

dec numeric feature

This column is almost certainly declination (dec), the celestial latitude coordinate, with values bounded in [-89.78, 89.57] degrees matching the full sky range. The distribution is nearly symmetric (skew 0.04) and platykurtic (kurtosis -1.02) with median near -1.64 and IQR spanning ~68 degrees, suggesting broad sky coverage rather than a concentrated survey footprint. With 119,534 unique values across 119,626 rows and no nulls or outliers, it behaves as a continuous astrometric coordinate.

Treatment: Use as-is as a continuous coordinate, optionally pairing with RA for spatial features.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["dec"].stats

statvalue
n119,626
nulls0 (0.0%)
unique119,534
min -89.78
max 89.57
mean -1.986
median -1.64
std 40.96
q1 -36.42
q3 31.51
iqr 67.94
skew 0.03675
kurtosis -1.019
n_outliers 0
outlier_rate 0
zero_rate 8.359e-06
Fig 16.
Distribution of dec. Vertical dash marks the median.
Show data table
Histogram bins for dec (median: -1.6400785).
bincount
-89.78 – -85.3177
-85.3 – -80.81517
-80.81 – -76.33940
-76.33 – -71.851408
-71.85 – -67.361998
-67.36 – -62.882595
-62.88 – -58.43127
-58.4 – -53.913392
-53.91 – -49.433868
-49.43 – -44.944034
-44.94 – -40.464191
-40.46 – -35.984069
-35.98 – -31.494172
-31.49 – -27.014171
-27.01 – -22.533850
-22.53 – -18.043713
-18.04 – -13.563764
-13.56 – -9.0743665
-9.074 – -4.593754
-4.59 – -0.10653827
-0.1065 – 4.3774046
4.377 – 8.8614037
8.861 – 13.344055
13.34 – 17.834100
17.83 – 22.314088
22.31 – 26.84036
26.8 – 31.283944
31.28 – 35.763889
35.76 – 40.254015
40.25 – 44.733586
44.73 – 49.223484
49.22 – 53.73438
53.7 – 58.182771
58.18 – 62.672562
62.67 – 67.152027
67.15 – 71.631573
71.63 – 76.121182
76.12 – 80.6860
80.6 – 85.09510
85.09 – 89.57191

dist numeric feature

Numeric 'dist' column (likely a distance measurement) with 119,626 rows, no nulls, and 5,397 distinct values. The distribution is severely right-skewed (skew 2.97, kurtosis 6.79): median is 213.68 with IQR 115.07–392.16, yet the mean is 8,772.29 and the max reaches exactly 100,000, suggesting a capped or sentinel ceiling. Over 10% of rows (12,350) flag as outliers and std (27,890.67) dwarfs the IQR.

Treatment: Investigate the 100000 ceiling for sentinel encoding, then log-transform before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["dist"].stats

statvalue
n119,626
nulls0 (0.0%)
unique5,397
min 0
max 100,000
mean 8772
median 213.7
std 2.789e+04
q1 115.1
q3 392.2
iqr 277.1
skew 2.965
kurtosis 6.792
n_outliers 12,350
outlier_rate 0.1032
zero_rate 8.359e-06
alert: high_skewskew=+2.97
alert: outliers10.3% rows beyond 1.5 IQR
Fig 17.
Distribution of dist. Vertical dash marks the median.
Show data table
Histogram bins for dist (median: 213.6752).
bincount
0 – 2500109401
2500 – 50000
5000 – 75000
7500 – 1e+040
1e+04 – 1.25e+040
1.25e+04 – 1.5e+040
1.5e+04 – 1.75e+040
1.75e+04 – 2e+040
2e+04 – 2.25e+040
2.25e+04 – 2.5e+040
2.5e+04 – 2.75e+040
2.75e+04 – 3e+040
3e+04 – 3.25e+040
3.25e+04 – 3.5e+040
3.5e+04 – 3.75e+040
3.75e+04 – 4e+040
4e+04 – 4.25e+040
4.25e+04 – 4.5e+040
4.5e+04 – 4.75e+040
4.75e+04 – 5e+040
5e+04 – 5.25e+040
5.25e+04 – 5.5e+040
5.5e+04 – 5.75e+040
5.75e+04 – 6e+040
6e+04 – 6.25e+040
6.25e+04 – 6.5e+040
6.5e+04 – 6.75e+040
6.75e+04 – 7e+040
7e+04 – 7.25e+040
7.25e+04 – 7.5e+040
7.5e+04 – 7.75e+040
7.75e+04 – 8e+040
8e+04 – 8.25e+040
8.25e+04 – 8.5e+040
8.5e+04 – 8.75e+040
8.75e+04 – 9e+040
9e+04 – 9.25e+040
9.25e+04 – 9.5e+040
9.5e+04 – 9.75e+040
9.75e+04 – 1e+0510225

pmra numeric feature

This is `pmra`, almost certainly proper motion in right ascension (mas/yr) for ~119k astronomical sources, centered near zero (median -1.68, mean -1.31) with an interquartile range of 27.64. The distribution is extremely heavy-tailed: skew 4.61, kurtosis 433.55, std 118.18, and extremes spanning -4432.65 to 6767.26, with 16.4% of rows (19,615) flagged as outliers. No nulls and only 0.04% exact zeros, so the field is densely populated but dominated by tail behaviour.

Treatment: Winsorize or apply a signed log/arcsinh transform before modelling to tame the heavy tails.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["pmra"].stats

statvalue
n119,626
nulls0 (0.0%)
unique25,644
min -4433
max 6767
mean -1.307
median -1.68
std 118.2
q1 -15.46
q3 12.18
iqr 27.64
skew 4.608
kurtosis 433.6
n_outliers 19,615
outlier_rate 0.164
zero_rate 0.0004013
alert: high_skewskew=+4.61
alert: outliers16.4% rows beyond 1.5 IQR
Fig 18.
Distribution of pmra. Vertical dash marks the median.
Show data table
Histogram bins for pmra (median: -1.68).
bincount
-4433 – -41532
-4153 – -38730
-3873 – -35935
-3593 – -33130
-3313 – -30331
-3033 – -27531
-2753 – -24731
-2473 – -21932
-2193 – -19135
-1913 – -16338
-1633 – -135316
-1353 – -107336
-1073 – -792.794
-792.7 – -512.7268
-512.7 – -232.71339
-232.7 – 47.31106838
47.31 – 327.310019
327.3 – 607.3668
607.3 – 887.3176
887.3 – 116759
1167 – 144743
1447 – 172710
1727 – 200712
2007 – 22874
2287 – 25673
2567 – 28471
2847 – 31273
3127 – 34072
3407 – 36872
3687 – 39671
3967 – 42474
4247 – 45270
4527 – 48070
4807 – 50870
5087 – 53670
5367 – 56471
5647 – 59270
5927 – 62070
6207 – 64870
6487 – 67672

pmdec numeric feature

This is `pmdec`, almost certainly proper motion in declination (mas/yr) from an astrometric catalog. The bulk sits in a tight IQR of -22.4 to 3.77 around a median of -5.76, but the distribution is extremely heavy-tailed: kurtosis of 934.5, skew -2.60, and a min of -5813 against a max of 9999.99 — the latter looks like a sentinel/missing-value flag rather than a real motion. About 14.4% of rows (17,188) fall outside the standard outlier fence.

Treatment: Filter the 9999.99 sentinel and clip extreme tails before any modelling; consider robust-scaling rather than a log transform since values are signed.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["pmdec"].stats

statvalue
n119,626
nulls0 (0.0%)
unique23,226
min -5,813
max 1e+04
mean -19.33
median -5.76
std 112.5
q1 -22.4
q3 3.77
iqr 26.17
skew -2.605
kurtosis 934.5
n_outliers 17,188
outlier_rate 0.1437
zero_rate 0.0003427
alert: high_skewskew=-2.60
alert: outliers14.4% rows beyond 1.5 IQR
Fig 19.
Distribution of pmdec. Vertical dash marks the median.
Show data table
Histogram bins for pmdec (median: -5.76).
bincount
-5813 – -54183
-5418 – -50221
-5022 – -46271
-4627 – -42320
-4232 – -38360
-3836 – -34413
-3441 – -30464
-3046 – -26504
-2650 – -22555
-2255 – -18609
-1860 – -146429
-1464 – -106983
-1069 – -673.8252
-673.8 – -278.51512
-278.5 – 116.9115998
116.9 – 512.21589
512.2 – 907.596
907.5 – 130321
1303 – 16987
1698 – 20933
2093 – 24891
2489 – 28840
2884 – 32792
3279 – 36751
3675 – 40700
4070 – 44650
4465 – 48610
4861 – 52560
5256 – 56511
5651 – 60470
6047 – 64420
6442 – 68370
6837 – 72330
7233 – 76280
7628 – 80230
8023 – 84190
8419 – 88140
8814 – 92090
9209 – 96050
9605 – 1e+041

rv numeric feature

`rv` is a numeric feature dominated by zeros: the median, Q1, and Q3 are all 0.0, and 81.07% of values are exactly zero (zero_rate 0.8107). The non-zero tail is wide and heavy, spanning -386.9 to 471.0 with std 13.90 and kurtosis 116.06, producing 22,643 outliers (18.93% outlier rate). Despite the extremes, mean (-0.276) and skew (0.371) are modest, suggesting roughly balanced positive/negative excursions around a sparse zero baseline.

Treatment: Split into a zero-indicator plus a signed-log transform of the non-zero magnitude before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[49]:

saturn.columns["rv"].stats

statvalue
n119,626
nulls0 (0.0%)
unique1,714
min -386.9
max 471
mean -0.2765
median 0
std 13.9
q1 0
q3 0
iqr 0
skew 0.3708
kurtosis 116.1
n_outliers 22,643
outlier_rate 0.1893
zero_rate 0.8107
alert: outliers18.9% rows beyond 1.5 IQR
Fig 20.
Distribution of rv. Vertical dash marks the median.
Show data table
Histogram bins for rv (median: 0.0).
bincount
-386.9 – -365.52
-365.5 – -3443
-344 – -322.61
-322.6 – -301.14
-301.1 – -279.71
-279.7 – -258.23
-258.2 – -236.83
-236.8 – -215.33
-215.3 – -193.93
-193.9 – -172.45
-172.4 – -15119
-151 – -129.521
-129.5 – -108.143
-108.1 – -86.6390
-86.63 – -65.19238
-65.19 – -43.74793
-43.74 – -22.292900
-22.29 – -0.8457904
-0.845 – 20.6103516
20.6 – 42.053015
42.05 – 63.5637
63.5 – 84.94208
84.94 – 106.497
106.4 – 127.828
127.8 – 149.324
149.3 – 170.722
170.7 – 192.213
192.2 – 213.64
213.6 – 235.18
235.1 – 256.54
256.5 – 2780
278 – 299.43
299.4 – 320.95
320.9 – 342.34
342.3 – 363.80
363.8 – 385.21
385.2 – 406.70
406.7 – 428.10
428.1 – 449.60
449.6 – 4711

mag numeric feature

Numeric column 'mag' looks like an astronomical magnitude reading: values are tightly clustered around a median of 8.46 with an IQR of 1.52, but the range stretches from -26.7 to 21.0. That extreme negative tail (consistent with very bright objects like the Sun at -26.7) drives the high kurtosis of 6.35 and flags 5,241 outliers (4.4%) despite near-symmetric skew of 0.16. No nulls or zeros, and 1,422 distinct values across 119,626 rows suggest measurements rounded to ~0.01.

Treatment: Use as-is for modelling but consider winsorizing or separating extreme bright-object outliers before fitting.

anthropic:claude-opus-4-7 · confidence high
Out[52]:

saturn.columns["mag"].stats

statvalue
n119,626
nulls0 (0.0%)
unique1,422
min -26.7
max 21
mean 8.429
median 8.46
std 1.428
q1 7.65
q3 9.17
iqr 1.52
skew 0.1607
kurtosis 6.353
n_outliers 5,241
outlier_rate 0.04381
zero_rate 0
Fig 21.
Distribution of mag. Vertical dash marks the median.
Show data table
Histogram bins for mag (median: 8.46).
bincount
-26.7 – -25.511
-25.51 – -24.310
-24.31 – -23.120
-23.12 – -21.930
-21.93 – -20.740
-20.74 – -19.540
-19.54 – -18.350
-18.35 – -17.160
-17.16 – -15.970
-15.97 – -14.770
-14.77 – -13.580
-13.58 – -12.390
-12.39 – -11.20
-11.2 – -100
-10 – -8.8120
-8.812 – -7.620
-7.62 – -6.4270
-6.427 – -5.2350
-5.235 – -4.0420
-4.042 – -2.850
-2.85 – -1.6570
-1.657 – -0.4652
-0.465 – 0.72759
0.7275 – 1.9233
1.92 – 3.113152
3.113 – 4.305533
4.305 – 5.4982104
5.498 – 6.698246
6.69 – 7.88326095
7.883 – 9.07549082
9.075 – 10.2724650
10.27 – 11.465990
11.46 – 12.651650
12.65 – 13.85582
13.85 – 15.04305
15.04 – 16.23122
16.23 – 17.4244
17.42 – 18.6219
18.62 – 19.815
19.81 – 212

absmag numeric feature

This is a numeric `absmag` field, almost certainly absolute magnitude on an astronomical scale, ranging from -16.68 to 19.629 with a median of 1.495 and IQR of 3.021. The distribution is left-skewed (skew -1.37) with heavier tails than normal (kurtosis 3.17), and 11.29% of rows (13,508) flag as outliers — consistent with a long bright-end tail rather than data errors. No nulls and only 0.015% zeros across 119,626 rows, with 13,452 unique values suggesting quantised reporting.

Treatment: Keep as-is for modelling but inspect the bright-end tail; consider robust scaling rather than dropping outliers since they are physically meaningful.

anthropic:claude-opus-4-7 · confidence high
Out[55]:

saturn.columns["absmag"].stats

statvalue
n119,626
nulls0 (0.0%)
unique13,452
min -16.68
max 19.63
mean 0.9907
median 1.495
std 4.353
q1 0.138
q3 3.159
iqr 3.021
skew -1.37
kurtosis 3.168
n_outliers 13,508
outlier_rate 0.1129
zero_rate 0.0001505
alert: outliers11.3% rows beyond 1.5 IQR
Fig 22.
Distribution of absmag. Vertical dash marks the median.
Show data table
Histogram bins for absmag (median: 1.495).
bincount
-16.68 – -15.778
-15.77 – -14.8628
-14.86 – -13.9674
-13.96 – -13.05296
-13.05 – -12.14945
-12.14 – -11.232808
-11.23 – -10.333546
-10.33 – -9.4181564
-9.418 – -8.51649
-8.51 – -7.603248
-7.603 – -6.69559
-6.695 – -5.78712
-5.787 – -4.8816
-4.88 – -3.97265
-3.972 – -3.064221
-3.064 – -2.156832
-2.156 – -1.2492997
-1.249 – -0.3418438
-0.341 – 0.566815299
0.5668 – 1.47421259
1.474 – 2.38217499
2.382 – 3.2914914
3.29 – 4.19812012
4.198 – 5.1056472
5.105 – 6.0133218
6.013 – 6.9211841
6.921 – 7.8291235
7.829 – 8.736795
8.736 – 9.644465
9.644 – 10.55376
10.55 – 11.46397
11.46 – 12.37399
12.37 – 13.27268
13.27 – 14.18179
14.18 – 15.09104
15.09 – 1651
16 – 16.9117
16.91 – 17.8110
17.81 – 18.725
18.72 – 19.635

spect text feature

This column holds stellar spectral type codes (e.g. K0, G5, A0, F8) — short one-word tokens averaging 3.4 characters with a 1,532-word vocabulary across 119,626 rows. Values are highly repetitive (96.4% duplicate rate, only 4,310 unique), which is expected for a categorical taxonomy, and 3,048 rows are empty. Mixed casing shows up as a 45.8% allcaps rate, suggesting inconsistent capitalization (e.g. K0III vs lowercase forms) that should be normalized.

Treatment: Uppercase-normalize and treat as a categorical feature; consider grouping rare codes and imputing the 3,048 empties.

anthropic:claude-opus-4-7 · confidence high
Out[58]:

saturn.columns["spect"].stats

statvalue
n119,626
nulls0 (0.0%)
unique4,310
len_min 0
len_max 12
len_mean 3.376
len_median 3
len_p95 8
word_mean 1.009
word_median 1
n_empty 3,048
n_duplicates 115,316
duplicate_rate 0.964
vocab_size 1,532
readability_flesch_mean 98.19
emoji_rate 0
url_rate 0
one_word_rate 0.9937
allcaps_rate 0.4584
boilerplate_rate 0
alert: one_word99.4% rows are a single word
alert: allcaps45.8% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates96.4% duplicate strings
Fig 23.
Character-length distribution for spect.
Show data table
Character-length distribution for spect (mean: 3.3759049036162705).
charscount
0 – 03048
0 – 10
1 – 10
1 – 11542
1 – 20
2 – 20
2 – 254685
2 – 20
2 – 30
3 – 30
3 – 320214
3 – 40
4 – 40
4 – 46547
4 – 40
4 – 50
5 – 517615
5 – 50
5 – 60
6 – 60
6 – 66693
6 – 70
7 – 70
7 – 71634
7 – 80
8 – 80
8 – 84927
8 – 80
8 – 90
9 – 90
9 – 91161
9 – 100
10 – 100
10 – 10703
10 – 100
10 – 110
11 – 11746
11 – 110
11 – 120
12 – 12111

ci numeric feature

Numeric feature 'ci' spans -0.4 to 5.46 with mean 0.71 and median 0.616, suggesting a bounded continuous measurement (possibly a colour index or similar physical quantity given the name). Distribution is mildly right-skewed (0.37) with light tails (kurtosis -0.26) and only 208 outliers (0.18%). Negative values exist but zeros are rare (0.15%), and 1.58% of rows are null.

Treatment: Impute the 1.58% nulls and use as-is; mild skew does not require transformation.

anthropic:claude-opus-4-7 · confidence high
Out[61]:

saturn.columns["ci"].stats

statvalue
n119,626
nulls1,891 (1.6%)
unique2,439
min -0.4
max 5.46
mean 0.7115
median 0.616
std 0.4932
q1 0.3485
q3 1.083
iqr 0.7345
skew 0.3728
kurtosis -0.2552
n_outliers 208
outlier_rate 0.001767
zero_rate 0.001537
Fig 24.
Distribution of ci. Vertical dash marks the median.
Show data table
Histogram bins for ci (median: 0.616).
bincount
-0.4 – -0.253559
-0.2535 – -0.1071383
-0.107 – 0.03958328
0.0395 – 0.1869569
0.186 – 0.33258982
0.3325 – 0.47914728
0.479 – 0.625516591
0.6255 – 0.7728623
0.772 – 0.91855682
0.9185 – 1.06512808
1.065 – 1.21210614
1.212 – 1.3586401
1.358 – 1.5055724
1.505 – 1.6515558
1.651 – 1.7981851
1.798 – 1.944435
1.944 – 2.091121
2.091 – 2.23788
2.237 – 2.38440
2.384 – 2.5342
2.53 – 2.67724
2.677 – 2.82328
2.823 – 2.9711
2.97 – 3.11622
3.116 – 3.2633
3.263 – 3.4098
3.409 – 3.5564
3.556 – 3.7020
3.702 – 3.8492
3.849 – 3.9950
3.995 – 4.1421
4.142 – 4.2880
4.288 – 4.4342
4.434 – 4.5810
4.581 – 4.7280
4.728 – 4.8741
4.874 – 5.0210
5.021 – 5.1670
5.167 – 5.3141
5.314 – 5.461

x numeric feature

Numeric feature 'x' is effectively continuous (119,593 unique values across 119,626 rows, no nulls) and centered near zero (median -1.05, Q1/Q3 of -89.04/86.27). The distribution has extreme tails: min -99,950 and max 99,982 push the standard deviation to 15,182 against an IQR of just 175, with kurtosis 19.16 and 13.07% of rows flagged as outliers. Slight negative skew (-0.22) suggests the tails are roughly symmetric in direction but heavy in magnitude.

Treatment: Winsorize or robust-scale before modelling to contain the heavy tails.

anthropic:claude-opus-4-7 · confidence high
Out[64]:

saturn.columns["x"].stats

statvalue
n119,626
nulls0 (0.0%)
unique119,593
min -9.995e+04
max 9.998e+04
mean -235.3
median -1.05
std 1.518e+04
q1 -89.04
q3 86.27
iqr 175.3
skew -0.2229
kurtosis 19.16
n_outliers 15,635
outlier_rate 0.1307
zero_rate 0
alert: outliers13.1% rows beyond 1.5 IQR
Fig 25.
Distribution of x. Vertical dash marks the median.
Show data table
Histogram bins for x (median: -1.050364).
bincount
-9.995e+04 – -9.495e+04139
-9.495e+04 – -8.995e+04173
-8.995e+04 – -8.496e+04158
-8.496e+04 – -7.996e+04184
-7.996e+04 – -7.496e+04181
-7.496e+04 – -6.996e+04220
-6.996e+04 – -6.496e+04235
-6.496e+04 – -5.996e+04250
-5.996e+04 – -5.497e+04262
-5.497e+04 – -4.997e+04347
-4.997e+04 – -4.497e+04467
-4.497e+04 – -3.997e+04447
-3.997e+04 – -3.497e+04346
-3.497e+04 – -2.997e+04322
-2.997e+04 – -2.498e+04330
-2.498e+04 – -1.998e+04347
-1.998e+04 – -1.498e+04265
-1.498e+04 – -9981245
-9981 – -4982233
-4982 – 15.9962447
15.99 – 501447485
5014 – 1.001e+04245
1.001e+04 – 1.501e+04255
1.501e+04 – 2.001e+04248
2.001e+04 – 2.501e+04237
2.501e+04 – 3.001e+04258
3.001e+04 – 3.5e+04213
3.5e+04 – 4e+04328
4e+04 – 4.5e+04421
4.5e+04 – 5e+04356
5e+04 – 5.5e+04291
5.5e+04 – 6e+04312
6e+04 – 6.499e+04264
6.499e+04 – 6.999e+04187
6.999e+04 – 7.499e+04198
7.499e+04 – 7.999e+04172
7.999e+04 – 8.499e+04156
8.499e+04 – 8.999e+04141
8.999e+04 – 9.498e+04139
9.498e+04 – 9.998e+04122

y numeric feature

A continuous numeric feature centered near zero (median -1.24, mean -39.3) but with an extraordinarily wide spread (std ~17249, min -99979, max 99996). The distribution is roughly symmetric (skew 0.12) yet heavy-tailed (kurtosis 18.0), and 13.9% of values flag as outliers — the bulk sits within an IQR of ~183 while extremes reach ±100k. Near-unique values (119585 of 119626) and effectively no zeros or nulls suggest a measured signal rather than a category or sentinel-coded field.

Treatment: Winsorize or robust-scale before modelling to tame the heavy tails.

anthropic:claude-opus-4-7 · confidence high
Out[67]:

saturn.columns["y"].stats

statvalue
n119,626
nulls0 (0.0%)
unique119,585
min -9.998e+04
max 1e+05
mean -39.32
median -1.239
std 1.725e+04
q1 -91.18
q3 91.87
iqr 183
skew 0.1166
kurtosis 18.03
n_outliers 16,582
outlier_rate 0.1386
zero_rate 8.359e-06
alert: outliers13.9% rows beyond 1.5 IQR
Fig 26.
Distribution of y. Vertical dash marks the median.
Show data table
Histogram bins for y (median: -1.2388565).
bincount
-9.998e+04 – -9.498e+04248
-9.498e+04 – -8.998e+04286
-8.998e+04 – -8.498e+04264
-8.498e+04 – -7.998e+04304
-7.998e+04 – -7.498e+04282
-7.498e+04 – -6.998e+04246
-6.998e+04 – -6.498e+04253
-6.498e+04 – -5.998e+04280
-5.998e+04 – -5.498e+04244
-5.498e+04 – -4.999e+04257
-4.999e+04 – -4.499e+04235
-4.499e+04 – -3.999e+04221
-3.999e+04 – -3.499e+04220
-3.499e+04 – -2.999e+04299
-2.999e+04 – -2.499e+04266
-2.499e+04 – -1.999e+04275
-1.999e+04 – -1.499e+04274
-1.499e+04 – -9990233
-9990 – -4991272
-4991 – 8.4159499
8.41 – 500850396
5008 – 1.001e+04265
1.001e+04 – 1.501e+04282
1.501e+04 – 2.001e+04277
2.001e+04 – 2.501e+04228
2.501e+04 – 3e+04241
3e+04 – 3.5e+04287
3.5e+04 – 4e+04241
4e+04 – 4.5e+04206
4.5e+04 – 5e+04193
5e+04 – 5.5e+04211
5.5e+04 – 6e+04199
6e+04 – 6.5e+04192
6.5e+04 – 7e+04212
7e+04 – 7.5e+04226
7.5e+04 – 8e+04245
8e+04 – 8.5e+04267
8.5e+04 – 9e+04325
9e+04 – 9.5e+04330
9.5e+04 – 1e+05345

z numeric feature

Column z is a high-cardinality numeric feature (119,588 unique values across 119,626 rows, no nulls) centered near zero with median -3.42 and IQR roughly -107.6 to 95.0. The distribution has extreme tails: min -99,964.98, max 99,862.51, std 18,074.56, and kurtosis 15.49, with 13.7% of rows (16,441) flagged as outliers despite skew of only -0.27. The IQR is two orders of magnitude smaller than the standard deviation, indicating a tight core swamped by heavy symmetric tails.

Treatment: Winsorize or apply a signed log transform before modelling to tame the heavy tails.

anthropic:claude-opus-4-7 · confidence high
Out[70]:

saturn.columns["z"].stats

statvalue
n119,626
nulls0 (0.0%)
unique119,588
min -9.996e+04
max 9.986e+04
mean -235
median -3.416
std 1.807e+04
q1 -107.6
q3 94.97
iqr 202.5
skew -0.2722
kurtosis 15.49
n_outliers 16,441
outlier_rate 0.1374
zero_rate 8.359e-06
alert: outliers13.7% rows beyond 1.5 IQR
Fig 27.
Distribution of z. Vertical dash marks the median.
Show data table
Histogram bins for z (median: -3.4156459999999997).
bincount
-9.996e+04 – -9.497e+04175
-9.497e+04 – -8.997e+04329
-8.997e+04 – -8.498e+04543
-8.498e+04 – -7.998e+04386
-7.998e+04 – -7.499e+04357
-7.499e+04 – -6.999e+04298
-6.999e+04 – -6.5e+04258
-6.5e+04 – -6e+04261
-6e+04 – -5.5e+04271
-5.5e+04 – -5.001e+04279
-5.001e+04 – -4.501e+04269
-4.501e+04 – -4.002e+04264
-4.002e+04 – -3.502e+04237
-3.502e+04 – -3.003e+04218
-3.003e+04 – -2.503e+04176
-2.503e+04 – -2.003e+04186
-2.003e+04 – -1.504e+04174
-1.504e+04 – -1.004e+04173
-1.004e+04 – -5047164
-5047 – -51.2338362
-51.23 – 494471405
4944 – 9940241
9940 – 1.494e+04230
1.494e+04 – 1.993e+04218
1.993e+04 – 2.493e+04216
2.493e+04 – 2.992e+04195
2.992e+04 – 3.492e+04195
3.492e+04 – 3.991e+04223
3.991e+04 – 4.491e+04234
4.491e+04 – 4.991e+04234
4.991e+04 – 5.49e+04234
5.49e+04 – 5.99e+04240
5.99e+04 – 6.489e+04308
6.489e+04 – 6.989e+04281
6.989e+04 – 7.488e+04303
7.488e+04 – 7.988e+04336
7.988e+04 – 8.488e+04360
8.488e+04 – 8.987e+04429
8.987e+04 – 9.487e+04236
9.487e+04 – 9.986e+04128

vx numeric feature

`vx` is a numeric feature centered tightly around zero (median 1.3e-07, IQR 1.94e-05) but with a symmetric extreme range of ±0.10227249 — almost certainly a velocity-like component (x-axis). The distribution is pathologically heavy-tailed: kurtosis 1307.6 and skew -11.5, with 13.2% of values flagged as outliers despite std being only 0.00178. The exact symmetry of min and max suggests a hard clipping bound at ±0.10227249.

Treatment: Apply a robust scaler or signed-log transform before modelling; investigate the ±0.10227249 clipping boundary.

anthropic:claude-opus-4-7 · confidence high
Out[73]:

saturn.columns["vx"].stats

statvalue
n119,626
nulls0 (0.0%)
unique21,555
min -0.1023
max 0.1023
mean -2.891e-05
median 1.3e-07
std 0.001782
q1 -1.033e-05
q3 9.09e-06
iqr 1.942e-05
skew -11.51
kurtosis 1308
n_outliers 15,752
outlier_rate 0.1317
zero_rate 0.0004848
alert: high_skewskew=-11.51
alert: outliers13.2% rows beyond 1.5 IQR
Fig 28.
Distribution of vx. Vertical dash marks the median.
Show data table
Histogram bins for vx (median: 1.3e-07).
bincount
-0.1023 – -0.097167
-0.09716 – -0.092051
-0.09205 – -0.086930
-0.08693 – -0.081821
-0.08182 – -0.07670
-0.0767 – -0.071592
-0.07159 – -0.066481
-0.06648 – -0.061363
-0.06136 – -0.056255
-0.05625 – -0.051140
-0.05114 – -0.046022
-0.04602 – -0.040913
-0.04091 – -0.03581
-0.0358 – -0.030685
-0.03068 – -0.0255710
-0.02557 – -0.0204520
-0.02045 – -0.0153453
-0.01534 – -0.01023145
-0.01023 – -0.005114566
-0.005114 – 058472
0 – 0.00511459871
0.005114 – 0.01023348
0.01023 – 0.0153462
0.01534 – 0.0204523
0.02045 – 0.025576
0.02557 – 0.030684
0.03068 – 0.03581
0.0358 – 0.040912
0.04091 – 0.046021
0.04602 – 0.051141
0.05114 – 0.056252
0.05625 – 0.061361
0.06136 – 0.066480
0.06648 – 0.071591
0.07159 – 0.07670
0.0767 – 0.081821
0.08182 – 0.086933
0.08693 – 0.092050
0.09205 – 0.097161
0.09716 – 0.10231

vy numeric feature

Likely a velocity component (vy) for ~119k objects, with values clustered tightly around zero (median 1.18e-05, IQR 3.4e-05) but spanning -0.102 to 0.102. The distribution is extraordinarily heavy-tailed (skew 15.6, kurtosis 678) and 12.0% of rows fall outside the Tukey fences, so a small minority of fast movers dominate the variance (std 0.0022 vs IQR ~3e-05).

Treatment: Apply a signed log or robust scaler before modelling to tame the heavy tails.

anthropic:claude-opus-4-7 · confidence high
Out[76]:

saturn.columns["vy"].stats

statvalue
n119,626
nulls0 (0.0%)
unique25,826
min -0.1023
max 0.1023
mean 0.0002164
median 1.182e-05
std 0.002226
q1 -1.86e-06
q3 3.209e-05
iqr 3.395e-05
skew 15.59
kurtosis 678.4
n_outliers 14,368
outlier_rate 0.1201
zero_rate 0.0003427
alert: high_skewskew=+15.59
alert: outliers12.0% rows beyond 1.5 IQR
Fig 29.
Distribution of vy. Vertical dash marks the median.
Show data table
Histogram bins for vy (median: 1.182e-05).
bincount
-0.1023 – -0.097161
-0.09716 – -0.092050
-0.09205 – -0.086930
-0.08693 – -0.081820
-0.08182 – -0.07670
-0.0767 – -0.071592
-0.07159 – -0.066480
-0.06648 – -0.061360
-0.06136 – -0.056250
-0.05625 – -0.051140
-0.05114 – -0.046020
-0.04602 – -0.040913
-0.04091 – -0.03581
-0.0358 – -0.030682
-0.03068 – -0.025574
-0.02557 – -0.020459
-0.02045 – -0.0153419
-0.01534 – -0.0102369
-0.01023 – -0.005114278
-0.005114 – 033692
0 – 0.00511483552
0.005114 – 0.010231271
0.01023 – 0.01534428
0.01534 – 0.02045133
0.02045 – 0.0255766
0.02557 – 0.0306835
0.03068 – 0.035810
0.0358 – 0.0409111
0.04091 – 0.046024
0.04602 – 0.051143
0.05114 – 0.056257
0.05625 – 0.061365
0.06136 – 0.066483
0.06648 – 0.071593
0.07159 – 0.07670
0.0767 – 0.081820
0.08182 – 0.086932
0.08693 – 0.092051
0.09205 – 0.097161
0.09716 – 0.102311

vz numeric feature

A signed numeric quantity centred almost exactly on zero (median -6.23e-06, mean -1.57e-04) with an extremely tight IQR of 2.31e-05 — consistent with a vertical velocity or rate-of-change feature (the name 'vz' suggests a z-axis velocity). The distribution is pathologically heavy-tailed: kurtosis 1029.85, skew -20.30, and symmetric extremes at ±0.10227249 produce 15,774 outliers (13.2%). Despite 119,626 rows there are only 23,037 unique values, hinting at quantisation or repeated stationary readings.

Treatment: Clip or winsorise the symmetric ±0.102 tails and consider a signed log (e.g. asinh) transform before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[79]:

saturn.columns["vz"].stats

statvalue
n119,626
nulls0 (0.0%)
unique23,037
min -0.1023
max 0.1023
mean -0.0001566
median -6.23e-06
std 0.00195
q1 -1.998e-05
q3 3.147e-06
iqr 2.313e-05
skew -20.3
kurtosis 1030
n_outliers 15,774
outlier_rate 0.1319
zero_rate 0.000535
alert: high_skewskew=-20.30
alert: outliers13.2% rows beyond 1.5 IQR
Fig 30.
Distribution of vz. Vertical dash marks the median.
Show data table
Histogram bins for vz (median: -6.23e-06).
bincount
-0.1023 – -0.097169
-0.09716 – -0.092052
-0.09205 – -0.086932
-0.08693 – -0.081821
-0.08182 – -0.07672
-0.0767 – -0.071591
-0.07159 – -0.066481
-0.06648 – -0.061362
-0.06136 – -0.056251
-0.05625 – -0.051142
-0.05114 – -0.046024
-0.04602 – -0.040918
-0.04091 – -0.035810
-0.0358 – -0.0306815
-0.03068 – -0.0255714
-0.02557 – -0.0204547
-0.02045 – -0.0153480
-0.01534 – -0.01023254
-0.01023 – -0.005114866
-0.005114 – 079116
0 – 0.00511438868
0.005114 – 0.01023258
0.01023 – 0.0153444
0.01534 – 0.020458
0.02045 – 0.025573
0.02557 – 0.030682
0.03068 – 0.03581
0.0358 – 0.040912
0.04091 – 0.046020
0.04602 – 0.051141
0.05114 – 0.056250
0.05625 – 0.061360
0.06136 – 0.066480
0.06648 – 0.071590
0.07159 – 0.07670
0.0767 – 0.081821
0.08182 – 0.086930
0.08693 – 0.092050
0.09205 – 0.097160
0.09716 – 0.10231

rarad numeric feature

Values span 0 to 6.2828 with mean 3.166 and median 3.175, consistent with a right ascension expressed in radians (0 to 2π). The distribution is essentially symmetric (skew -0.012) and platykurtic (kurtosis -1.198), close to uniform across the circle, with no outliers and only one zero out of 119,626 rows. With 119,585 unique values, this is a high-resolution continuous coordinate.

Treatment: Encode as sin/cos pair to preserve circular continuity before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[82]:

saturn.columns["rarad"].stats

statvalue
n119,626
nulls0 (0.0%)
unique119,585
min 0
max 6.283
mean 3.166
median 3.175
std 1.803
q1 1.628
q3 4.743
iqr 3.115
skew -0.01197
kurtosis -1.198
n_outliers 0
outlier_rate 0
zero_rate 8.359e-06
Fig 31.
Distribution of rarad. Vertical dash marks the median.
Show data table
Histogram bins for rarad (median: 3.1748479275532158).
bincount
0 – 0.15712880
0.1571 – 0.31412819
0.3141 – 0.47122802
0.4712 – 0.62832840
0.6283 – 0.78542833
0.7854 – 0.94242851
0.9424 – 1.0992850
1.099 – 1.2572740
1.257 – 1.4142994
1.414 – 1.5713178
1.571 – 1.7283135
1.728 – 1.8853281
1.885 – 2.0423294
2.042 – 2.1993142
2.199 – 2.3563054
2.356 – 2.5132966
2.513 – 2.672878
2.67 – 2.8272891
2.827 – 2.9842856
2.984 – 3.1412897
3.141 – 3.2983004
3.298 – 3.4562937
3.456 – 3.6132964
3.613 – 3.773088
3.77 – 3.9273048
3.927 – 4.0843006
4.084 – 4.2413027
4.241 – 4.3982886
4.398 – 4.5552941
4.555 – 4.7123020
4.712 – 4.8693069
4.869 – 5.0263184
5.026 – 5.1833112
5.183 – 5.343207
5.34 – 5.4973076
5.497 – 5.6553021
5.655 – 5.8123005
5.812 – 5.9693019
5.969 – 6.1262970
6.126 – 6.2832861

decrad numeric feature

This column appears to be a declination angle expressed in radians, ranging symmetrically from -1.567 to 1.563 (close to ±π/2). The distribution is near-symmetric (skew 0.037) with negative kurtosis (-1.02), indicating a flatter-than-normal spread, and 119,585 of 119,626 values are unique with no nulls or outliers. The mean (-0.035) and median (-0.029) sit near zero, consistent with an angular coordinate covering most of the celestial sphere.

Treatment: Use directly as a numeric feature; consider sin/cos encoding if treating as a circular coordinate.

anthropic:claude-opus-4-7 · confidence high
Out[85]:

saturn.columns["decrad"].stats

statvalue
n119,626
nulls0 (0.0%)
unique119,585
min -1.567
max 1.563
mean -0.03466
median -0.02862
std 0.715
q1 -0.6357
q3 0.55
iqr 1.186
skew 0.03675
kurtosis -1.019
n_outliers 0
outlier_rate 0
zero_rate 8.359e-06
Fig 32.
Distribution of decrad. Vertical dash marks the median.
Show data table
Histogram bins for decrad (median: -0.028624772101456874).
bincount
-1.567 – -1.489177
-1.489 – -1.41517
-1.41 – -1.332940
-1.332 – -1.2541408
-1.254 – -1.1761998
-1.176 – -1.0972595
-1.097 – -1.0193127
-1.019 – -0.94093392
-0.9409 – -0.86273868
-0.8627 – -0.78444034
-0.7844 – -0.70624191
-0.7062 – -0.62794069
-0.6279 – -0.54974172
-0.5497 – -0.47144171
-0.4714 – -0.39313850
-0.3931 – -0.31493713
-0.3149 – -0.23663764
-0.2366 – -0.15843665
-0.1584 – -0.080123754
-0.08012 – -0.0018593827
-0.001859 – 0.07644046
0.0764 – 0.15474037
0.1547 – 0.23294055
0.2329 – 0.31124100
0.3112 – 0.38944088
0.3894 – 0.46774036
0.4677 – 0.54593944
0.5459 – 0.62423889
0.6242 – 0.70254015
0.7025 – 0.78073586
0.7807 – 0.8593484
0.859 – 0.93723438
0.9372 – 1.0152771
1.015 – 1.0942562
1.094 – 1.1722027
1.172 – 1.251573
1.25 – 1.3291182
1.329 – 1.407860
1.407 – 1.485510
1.485 – 1.563191

pmrarad numeric feature

`pmrarad` is a tiny-magnitude signed numeric feature centered near zero (mean -6.4e-09, median -8.1e-09) with values on the order of 1e-7 to 1e-5, consistent with a proper-motion-in-RA quantity expressed in radians. The distribution is highly non-Gaussian: skew 4.6, kurtosis 433.6, and 16.4% of rows flagged as outliers, with extremes reaching 3.28e-05 against an IQR of just 1.34e-07. Nulls are absent and only 0.04% are exact zeros, so the heavy tails are real rather than artefacts of missingness.

Treatment: Robust-scale or winsorize before modelling; do not assume normality given the heavy tails.

anthropic:claude-opus-4-7 · confidence high
Out[88]:

saturn.columns["pmrarad"].stats

statvalue
n119,626
nulls0 (0.0%)
unique25,647
min -2.149e-05
max 3.281e-05
mean -6.4e-09
median -8.145e-09
std 5.729e-07
q1 -7.495e-08
q3 5.905e-08
iqr 1.34e-07
skew 4.607
kurtosis 433.6
n_outliers 19,615
outlier_rate 0.164
zero_rate 0.0004013
alert: high_skewskew=+4.61
alert: outliers16.4% rows beyond 1.5 IQR
Fig 33.
Distribution of pmrarad. Vertical dash marks the median.
Show data table
Histogram bins for pmrarad (median: -8.1448698e-09).
bincount
-2.149e-05 – -2.013e-052
-2.013e-05 – -1.878e-050
-1.878e-05 – -1.742e-055
-1.742e-05 – -1.606e-050
-1.606e-05 – -1.47e-051
-1.47e-05 – -1.335e-051
-1.335e-05 – -1.199e-051
-1.199e-05 – -1.063e-052
-1.063e-05 – -9.273e-065
-9.273e-06 – -7.915e-068
-7.915e-06 – -6.558e-0616
-6.558e-06 – -5.2e-0636
-5.2e-06 – -3.843e-0694
-3.843e-06 – -2.486e-06268
-2.486e-06 – -1.128e-061341
-1.128e-06 – 2.294e-07106838
2.294e-07 – 1.587e-0610018
1.587e-06 – 2.944e-06668
2.944e-06 – 4.302e-06175
4.302e-06 – 5.659e-0659
5.659e-06 – 7.017e-0643
7.017e-06 – 8.374e-0610
8.374e-06 – 9.732e-0612
9.732e-06 – 1.109e-054
1.109e-05 – 1.245e-053
1.245e-05 – 1.38e-051
1.38e-05 – 1.516e-053
1.516e-05 – 1.652e-052
1.652e-05 – 1.788e-052
1.788e-05 – 1.923e-051
1.923e-05 – 2.059e-054
2.059e-05 – 2.195e-050
2.195e-05 – 2.331e-050
2.331e-05 – 2.466e-050
2.466e-05 – 2.602e-050
2.602e-05 – 2.738e-051
2.738e-05 – 2.874e-050
2.874e-05 – 3.009e-050
3.009e-05 – 3.145e-050
3.145e-05 – 3.281e-052

pmdecrad numeric feature

This is `pmdecrad`, proper motion in declination expressed in radians (per time unit), centred near zero with median -2.79e-08 and IQR of about 1.27e-07. The distribution is extremely heavy-tailed: kurtosis of 997.6, skew of -1.99, and 17,187 outliers (14.4% of rows) stretching from -2.82e-05 to 5.01e-05 against a std of 5.47e-07. Values are dense and continuous (23,588 unique across 119,626 rows) with no nulls and essentially no zeros.

Treatment: Rescale to mas/yr and consider a robust or signed-log transform before modelling to tame the heavy tails.

anthropic:claude-opus-4-7 · confidence high
Out[91]:

saturn.columns["pmdecrad"].stats

statvalue
n119,626
nulls0 (0.0%)
unique23,588
min -2.818e-05
max 5.007e-05
mean -9.363e-08
median -2.793e-08
std 5.467e-07
q1 -1.086e-07
q3 1.828e-08
iqr 1.269e-07
skew -1.992
kurtosis 997.6
n_outliers 17,187
outlier_rate 0.1437
zero_rate 0.0003344
alert: outliers14.4% rows beyond 1.5 IQR
Fig 34.
Distribution of pmdecrad. Vertical dash marks the median.
Show data table
Histogram bins for pmdecrad (median: -2.7925268e-08).
bincount
-2.818e-05 – -2.623e-053
-2.623e-05 – -2.427e-051
-2.427e-05 – -2.231e-051
-2.231e-05 – -2.036e-050
-2.036e-05 – -1.84e-050
-1.84e-05 – -1.644e-056
-1.644e-05 – -1.449e-051
-1.449e-05 – -1.253e-054
-1.253e-05 – -1.058e-055
-1.058e-05 – -8.62e-0613
-8.62e-06 – -6.664e-0636
-6.664e-06 – -4.708e-06108
-4.708e-06 – -2.751e-06379
-2.751e-06 – -7.952e-073718
-7.952e-07 – 1.161e-06114754
1.161e-06 – 3.117e-06511
3.117e-06 – 5.073e-0656
5.073e-06 – 7.03e-0618
7.03e-06 – 8.986e-066
8.986e-06 – 1.094e-051
1.094e-05 – 1.29e-050
1.29e-05 – 1.485e-050
1.485e-05 – 1.681e-053
1.681e-05 – 1.877e-050
1.877e-05 – 2.072e-050
2.072e-05 – 2.268e-050
2.268e-05 – 2.464e-050
2.464e-05 – 2.659e-051
2.659e-05 – 2.855e-050
2.855e-05 – 3.05e-050
3.05e-05 – 3.246e-050
3.246e-05 – 3.442e-050
3.442e-05 – 3.637e-050
3.637e-05 – 3.833e-050
3.833e-05 – 4.029e-050
4.029e-05 – 4.224e-050
4.224e-05 – 4.42e-050
4.42e-05 – 4.615e-050
4.615e-05 – 4.811e-050
4.811e-05 – 5.007e-051

bayer categorical metadata

This is the Bayer designation for stars (Greek-letter prefix like 'Alp', 'Bet', 'Gam', 'Del'), with 104 distinct values across 119,626 rows. It is overwhelmingly empty: 118,089 of 119,626 rows (top_rate 0.987) carry no Bayer letter, leaving entropy at just 0.171 (entropy_ratio 0.026). The non-empty tail is roughly evenly distributed across the Greek alphabet, with Alpha (80) and Beta (77) most common.

Treatment: Treat empty as missing and either drop or one-hot the small non-empty subset; near-constant signal.

anthropic:claude-opus-4-7 · confidence high
Out[94]:

saturn.columns["bayer"].stats

statvalue
n119,626
nulls0 (0.0%)
unique104
top_value
top_rate 0.9872
cardinality 104
entropy 0.171
entropy_ratio 0.02552
alert: imbalancetop value is 98.7% of rows
Fig 35.
Top values for bayer.
Show data table
Top values for bayer (20 unique shown, of 104 total).
valuecountshare
11808998.7%
Alp800.1%
Bet770.1%
Eps740.1%
Del710.1%
Eta690.1%
Zet680.1%
Gam680.1%
The610.1%
Iot600.1%
Lam560.0%
Kap540.0%
Mu490.0%
Nu450.0%
Sig390.0%
Xi380.0%
Pi370.0%
Rho360.0%
Omi360.0%
Tau350.0%

flam categorical feature

The column 'flam' is a categorical field that is overwhelmingly empty: 116,889 of 119,626 rows (97.7%) hold the blank string. The remaining 2.3% spreads across 138 distinct values that look like small integers ('2','4','5','7'…'16'), each appearing roughly 48-53 times. Entropy ratio of 0.043 confirms almost no information content, and the imbalance alert is warranted.

Treatment: Drop or treat as near-constant; the 2.3% non-empty integer-like values are too sparse to model directly.

anthropic:claude-opus-4-7 · confidence high
Out[97]:

saturn.columns["flam"].stats

statvalue
n119,626
nulls0 (0.0%)
unique139
top_value
top_rate 0.9771
cardinality 139
entropy 0.3074
entropy_ratio 0.04319
alert: imbalancetop value is 97.7% of rows
Fig 36.
Top values for flam.
Show data table
Top values for flam (20 unique shown, of 139 total).
valuecountshare
11688997.7%
7530.0%
8510.0%
5510.0%
9500.0%
4500.0%
10490.0%
12490.0%
16490.0%
2480.0%
11480.0%
13480.0%
14480.0%
15480.0%
1470.0%
3460.0%
6460.0%
17460.0%
21450.0%
20440.0%

con categorical feature

Three-letter IAU constellation abbreviations (Cen, UMa, Her, Cyg...), with all 89 of the standard 88+1 codes represented across 119,626 rows and zero nulls. The distribution is remarkably flat: entropy ratio is 0.949 and the most common value, Cen, accounts for only 3.57% of records, so no single constellation dominates. Useful as a sky-region grouping key rather than a predictive feature on its own.

Treatment: one-hot or target-encode if used in modelling; otherwise keep as a categorical group-by key.

anthropic:claude-opus-4-7 · confidence high
Out[100]:

saturn.columns["con"].stats

statvalue
n119,626
nulls0 (0.0%)
unique89
top_value Cen
top_rate 0.03569
cardinality 89
entropy 6.147
entropy_ratio 0.9493
Fig 37.
Top values for con.
Show data table
Top values for con (20 unique shown, of 89 total).
valuecountshare
Cen42703.6%
UMa36163.0%
Her34342.9%
Cyg31162.6%
Hya30612.6%
Cet30302.5%
Vir29212.4%
Eri27892.3%
Peg27442.3%
Dra27222.3%
Sgr25042.1%
Boo24772.1%
Pup24272.0%
Cas23522.0%
Tau22811.9%
Oph22701.9%
Vel22381.9%
Aqr21881.8%
Leo21651.8%
Car21621.8%

comp numeric feature

The column 'comp' is a numeric field with only 3 distinct values bounded between 1.0 and 3.0, with a median, Q1, and Q3 all equal to 1.0, indicating it is effectively a low-cardinality code or flag rather than a continuous measure. The mean of 1.0048 shows the value 1 dominates almost entirely, with only 536 outliers (0.45%) deviating, producing extreme skew (16.7) and kurtosis (311.1).

Treatment: Cast to a categorical/ordinal code; the near-constant distribution offers little signal for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[103]:

saturn.columns["comp"].stats

statvalue
n119,626
nulls0 (0.0%)
unique3
min 1
max 3
mean 1.005
median 1
std 0.0739
q1 1
q3 1
iqr 0
skew 16.71
kurtosis 311.1
n_outliers 536
outlier_rate 0.004481
zero_rate 0
alert: high_skewskew=+16.71
Fig 38.
Distribution of comp. Vertical dash marks the median.
Show data table
Histogram bins for comp (median: 1.0).
bincount
1 – 1.05119090
1.05 – 1.10
1.1 – 1.150
1.15 – 1.20
1.2 – 1.250
1.25 – 1.30
1.3 – 1.350
1.35 – 1.40
1.4 – 1.450
1.45 – 1.50
1.5 – 1.550
1.55 – 1.60
1.6 – 1.650
1.65 – 1.70
1.7 – 1.750
1.75 – 1.80
1.8 – 1.850
1.85 – 1.90
1.9 – 1.950
1.95 – 20
2 – 2.05496
2.05 – 2.10
2.1 – 2.150
2.15 – 2.20
2.2 – 2.250
2.25 – 2.30
2.3 – 2.350
2.35 – 2.40
2.4 – 2.450
2.45 – 2.50
2.5 – 2.550
2.55 – 2.60
2.6 – 2.650
2.65 – 2.70
2.7 – 2.750
2.75 – 2.80
2.8 – 2.850
2.85 – 2.90
2.9 – 2.950
2.95 – 340

comp_primary numeric identifier

The column 'comp_primary' contains 119,626 nearly unique numeric values (119,190 distinct) ranging from 0 to 119,630 with mean 59,641.96 and median 59,634.5. The near-perfect symmetry (skew 0.0004), negative kurtosis (-1.20), and quartiles (Q1 29,815.25, Q3 89,462.75) closely matching a uniform distribution over [0, n] strongly suggest this is a row index or sequential identifier rather than a substantive measurement.

Treatment: Drop before modelling; near-unique sequential id with no predictive content.

anthropic:claude-opus-4-7 · confidence high
Out[106]:

saturn.columns["comp_primary"].stats

statvalue
n119,626
nulls0 (0.0%)
unique119,190
min 0
max 119,630
mean 5.964e+04
median 5.963e+04
std 3.444e+04
q1 2.982e+04
q3 8.946e+04
iqr 5.965e+04
skew 0.0003989
kurtosis -1.2
n_outliers 0
outlier_rate 0
zero_rate 8.359e-06
Fig 39.
Distribution of comp_primary. Vertical dash marks the median.
Show data table
Histogram bins for comp_primary (median: 59634.5).
bincount
0 – 29913002
2991 – 59823003
5982 – 89722996
8972 – 1.196e+043002
1.196e+04 – 1.495e+043004
1.495e+04 – 1.794e+043003
1.794e+04 – 2.094e+042995
2.094e+04 – 2.393e+043002
2.393e+04 – 2.692e+042995
2.692e+04 – 2.991e+042997
2.991e+04 – 3.29e+042995
3.29e+04 – 3.589e+042996
3.589e+04 – 3.888e+043000
3.888e+04 – 4.187e+042997
4.187e+04 – 4.486e+043004
4.486e+04 – 4.785e+043006
4.785e+04 – 5.084e+042998
5.084e+04 – 5.383e+042996
5.383e+04 – 5.682e+043002
5.682e+04 – 5.982e+043000
5.982e+04 – 6.281e+042994
6.281e+04 – 6.58e+043001
6.58e+04 – 6.879e+043002
6.879e+04 – 7.178e+042997
7.178e+04 – 7.477e+042997
7.477e+04 – 7.776e+042997
7.776e+04 – 8.075e+043001
8.075e+04 – 8.374e+042995
8.374e+04 – 8.673e+043009
8.673e+04 – 8.972e+042993
8.972e+04 – 9.271e+042992
9.271e+04 – 9.57e+042998
9.57e+04 – 9.869e+043002
9.869e+04 – 1.017e+052997
1.017e+05 – 1.047e+053006
1.047e+05 – 1.077e+052998
1.077e+05 – 1.107e+053001
1.107e+05 – 1.136e+053000
1.136e+05 – 1.166e+052999
1.166e+05 – 1.196e+052654

base categorical metadata

The 'base' column is a categorical field that is effectively empty: 118,540 of 119,626 rows (top_rate 0.991) hold the empty string, leaving only ~1,086 rows distributed across 650 other values. The non-empty entries look like Gliese star catalog identifiers (e.g. 'Gl 57.1', 'Gl 60'), each appearing at most 3 times. Entropy ratio of 0.017 confirms almost no information content.

Treatment: Drop or binarize as 'has_base_id'; near-constant and unsuitable as a feature.

anthropic:claude-opus-4-7 · confidence high
Out[109]:

saturn.columns["base"].stats

statvalue
n119,626
nulls0 (0.0%)
unique651
top_value
top_rate 0.9909
cardinality 651
entropy 0.1587
entropy_ratio 0.01698
alert: imbalancetop value is 99.1% of rows
Fig 40.
Top values for base.
Show data table
Top values for base (20 unique shown, of 651 total).
valuecountshare
11854099.1%
Gl 57.130.0%
Gl 6030.0%
Gl 10030.0%
Gl 106.130.0%
Gl 118.230.0%
Gl 120.130.0%
Gl 14030.0%
Gl 15330.0%
Gl 16630.0%
Gl 225.230.0%
Gl 27830.0%
Gl 29430.0%
Gl 31930.0%
Gl 321.330.0%
Gl 33130.0%
Gl 42130.0%
Gl 52030.0%
GJ 949030.0%
Gl 58630.0%

lum numeric feature

This is a luminosity-like numeric feature spanning roughly 1.2e-06 to 4.09e+08, with a median of about 21.98 but a mean of 356526 — clear evidence of an extremely heavy right tail. Skew of 49.27 and kurtosis of 3885.57 confirm a pathological distribution, and 17485 values (14.6%) fall outside the IQR fence. Of 119626 rows, 13465 are unique with no nulls or zeros, so the spread is genuine rather than padded.

Treatment: Apply a log transform before any distance- or variance-based modelling.

anthropic:claude-opus-4-7 · confidence high
Out[112]:

saturn.columns["lum"].stats

statvalue
n119,626
nulls0 (0.0%)
unique13,465
min 1.226e-06
max 4.093e+08
mean 3.565e+05
median 21.98
std 3.341e+06
q1 4.747
q3 76.7
iqr 71.95
skew 49.27
kurtosis 3886
n_outliers 17,485
outlier_rate 0.1462
zero_rate 0
alert: high_skewskew=+49.27
alert: outliers14.6% rows beyond 1.5 IQR
Fig 41.
Distribution of lum. Vertical dash marks the median.
Show data table
Histogram bins for lum (median: 21.97859872784824).
bincount
1.226e-06 – 1.023e+07118965
1.023e+07 – 2.046e+07430
2.046e+07 – 3.069e+07110
3.069e+07 – 4.093e+0738
4.093e+07 – 5.116e+0727
5.116e+07 – 6.139e+0711
6.139e+07 – 7.162e+077
7.162e+07 – 8.185e+079
8.185e+07 – 9.208e+076
9.208e+07 – 1.023e+084
1.023e+08 – 1.125e+083
1.125e+08 – 1.228e+082
1.228e+08 – 1.33e+081
1.33e+08 – 1.432e+082
1.432e+08 – 1.535e+080
1.535e+08 – 1.637e+080
1.637e+08 – 1.739e+082
1.739e+08 – 1.842e+081
1.842e+08 – 1.944e+081
1.944e+08 – 2.046e+080
2.046e+08 – 2.149e+080
2.149e+08 – 2.251e+080
2.251e+08 – 2.353e+082
2.353e+08 – 2.456e+081
2.456e+08 – 2.558e+082
2.558e+08 – 2.66e+080
2.66e+08 – 2.763e+081
2.763e+08 – 2.865e+080
2.865e+08 – 2.967e+080
2.967e+08 – 3.069e+080
3.069e+08 – 3.172e+080
3.172e+08 – 3.274e+080
3.274e+08 – 3.376e+080
3.376e+08 – 3.479e+080
3.479e+08 – 3.581e+080
3.581e+08 – 3.683e+080
3.683e+08 – 3.786e+080
3.786e+08 – 3.888e+080
3.888e+08 – 3.99e+080
3.99e+08 – 4.093e+081

var text feature

Column 'var' is a sparse short-code field: 113,634 of 119,626 rows (n_empty) are blank and the remaining values are 1–5 character tokens like 'R', 'S', 'T', 'RS'. Duplicate_rate is 0.987 with only 1,523 uniques, and one_word_rate is 1.0 with len_max of 5, suggesting a categorical abbreviation or flag rather than free text. The overwhelming emptiness (null_rate is reported as 0.0 but empties dominate) is the headline surprise.

Treatment: Treat as a low-cardinality categorical code; impute the empties as a distinct 'missing' level before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[115]:

saturn.columns["var"].stats

statvalue
n119,626
nulls0 (0.0%)
unique1,523
len_min 0
len_max 5
len_mean 0.14
len_median 0
len_p95 1
word_mean 1
word_median 1
n_empty 113,634
n_duplicates 118,103
duplicate_rate 0.9873
vocab_size 597
readability_flesch_mean 4.609
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.0177
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates98.7% duplicate strings
Fig 42.
Character-length distribution for var.
Show data table
Character-length distribution for var (mean: 0.14000300937923194).
charscount
0 – 0113634
0 – 00
0 – 00
0 – 00
0 – 10
1 – 10
1 – 10
1 – 10
1 – 1373
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 23272
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 3197
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 41510
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 5640

var_min numeric feature

Numeric feature 'var_min' is populated for only ~14% of rows (null_rate 0.858), making it a sparse signal. Among the 16,991 observed values it ranges from -1.333 to 14.902 with mean 9.50 and median 9.849, and is left-skewed (skew -0.93) with mild kurtosis (1.25). About 2.6% of present values fall outside the IQR fence (449 outliers), and only 6,248 distinct values appear.

Treatment: Impute or add a missingness indicator before modelling given the 85.8% null rate.

anthropic:claude-opus-4-7 · confidence high
Out[118]:

saturn.columns["var_min"].stats

statvalue
n119,626
nulls102,635 (85.8%)
unique6,248
min -1.333
max 14.9
mean 9.502
median 9.849
std 1.781
q1 8.526
q3 10.71
iqr 2.181
skew -0.9339
kurtosis 1.251
n_outliers 449
outlier_rate 0.02643
zero_rate 0
alert: null_rate85.8% null
Fig 43.
Distribution of var_min. Vertical dash marks the median.
Show data table
Histogram bins for var_min (median: 9.849).
bincount
-1.333 – -0.92711
-0.9271 – -0.52121
-0.5212 – -0.11540
-0.1154 – 0.29051
0.2905 – 0.69643
0.6964 – 1.1023
1.102 – 1.5083
1.508 – 1.9144
1.914 – 2.3216
2.32 – 2.72613
2.726 – 3.13222
3.132 – 3.53727
3.537 – 3.94331
3.943 – 4.34951
4.349 – 4.75588
4.755 – 5.161119
5.161 – 5.567167
5.567 – 5.973202
5.973 – 6.379324
6.379 – 6.784392
6.784 – 7.19433
7.19 – 7.596593
7.596 – 8.002754
8.002 – 8.408756
8.408 – 8.814839
8.814 – 9.221046
9.22 – 9.6261535
9.626 – 10.032016
10.03 – 10.442001
10.44 – 10.841940
10.84 – 11.251432
11.25 – 11.65916
11.65 – 12.06642
12.06 – 12.47377
12.47 – 12.87157
12.87 – 13.2851
13.28 – 13.6814
13.68 – 14.0911
14.09 – 14.56
14.5 – 14.94

var_max numeric feature

Numeric feature 'var_max' (likely a per-record maximum of some variable) is missing for 85.8% of the 119,626 rows, leaving roughly 17k populated values spread over 6,090 distinct numbers. Among observed values it centers near a median of 9.646 with mean 9.259, ranges from -1.523 to 13.702, and is left-skewed (-0.97) with 325 outliers (1.9%). The dominant concern is the null rate, not the distribution shape.

Treatment: Add a missingness indicator and impute or restrict modelling to the populated subset given the 85.8% null rate.

anthropic:claude-opus-4-7 · confidence high
Out[121]:

saturn.columns["var_max"].stats

statvalue
n119,626
nulls102,635 (85.8%)
unique6,090
min -1.523
max 13.7
mean 9.259
median 9.646
std 1.742
q1 8.243
q3 10.49
iqr 2.249
skew -0.9704
kurtosis 1.128
n_outliers 325
outlier_rate 0.01913
zero_rate 0
alert: null_rate85.8% null
Fig 44.
Distribution of var_max. Vertical dash marks the median.
Show data table
Histogram bins for var_max (median: 9.646).
bincount
-1.523 – -1.1421
-1.142 – -0.76170
-0.7617 – -0.38111
-0.3811 – -0.00051
-0.0005 – 0.38011
0.3801 – 0.76083
0.7608 – 1.1413
1.141 – 1.5222
1.522 – 1.9035
1.903 – 2.28317
2.283 – 2.66411
2.664 – 3.04522
3.045 – 3.42528
3.425 – 3.80629
3.806 – 4.18641
4.186 – 4.56779
4.567 – 4.948110
4.948 – 5.328157
5.328 – 5.709185
5.709 – 6.09236
6.09 – 6.47394
6.47 – 6.851428
6.851 – 7.231519
7.231 – 7.612662
7.612 – 7.993760
7.993 – 8.373849
8.373 – 8.754855
8.754 – 9.1341057
9.134 – 9.5151443
9.515 – 9.8961867
9.896 – 10.281915
10.28 – 10.661832
10.66 – 11.041463
11.04 – 11.42950
11.42 – 11.8599
11.8 – 12.18348
12.18 – 12.56100
12.56 – 12.9415
12.94 – 13.322
13.32 – 13.71

How to cite

click to copy

BibTeX
@misc{saturn-hyg-hygdata-v41-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: hyg hygdata v41},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/hyg-hygdata_v41}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: hyg hygdata v41. Source: /home/coolhand/data/celestial/hyg/hygdata_v41.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/hyg-hygdata_v41