saturn·

exoplanets exoplanets

source /home/coolhand/data/celestial/exoplanets/exoplanets.csv 6,150 rows 11 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 6,150 exoplanets across 11 columns, mixing identifiers (pl_name, hostname), discovery metadata (discoverymethod, disc_year), sky coordinates (ra, dec), and physical measurements (pl_bmassj, pl_orbsmax, pl_rade, pl_orbper, sy_dist). Discovery is heavily dominated by the Transit method at 73.4% of records, with Radial Velocity a distant second — worth noting because it shapes which kinds of planets are represented. The physical measurement columns are all extremely skewed with heavy outliers: pl_orbper has a skew of ~43.8 and a max of 8,040,000 days, and pl_orbsmax similarly stretches to 19,000 AU, so any analysis should use log scales or trimming. Also flag that pl_bmassj is missing for 50.3% of rows and pl_orbsmax for 37.4%, which limits joint mass/orbit analyses. Discovery year peaks around 2016 (median) and ranges from 1992 to 2026, giving a clear timeline of the field's growth.

citing: row_count · column_count · discoverymethod · pl_orbper · pl_orbsmax · pl_bmassj · pl_rade · disc_year · sy_dist

Schema

11 columns
Per-column summary. Click column name to jump to its detail.
Alerts
pl_name text 0.0% 6,150
near_unique short_text
hostname text 0.0% 4,582
one_word allcaps short_text duplicates
ra numeric 0.0% 4,579
dec numeric 0.0% 4,579
sy_dist numeric 2.1% 4,397
high_skew outliers
pl_orbper numeric 5.6% 5,791
high_skew outliers
pl_orbsmax numeric 37.4% 2,292
null_rate high_skew outliers
pl_bmassj numeric 50.3% 1,989
null_rate high_skew outliers
pl_rade numeric 25.7% 2,004
null_rate high_skew outliers
disc_year numeric 0.0% 34
discoverymethod categorical 0.0% 11

pl_name

text identifier near_unique short_text
This is the planet name identifier (pl_name), a fully unique short text field across all 6150 rows with zero nulls or duplicates. Values are short (mean 11.4 chars, ~2.24 words) and dominated by astronomical catalog conventions: companion letters like 'b' (4535), 'c' (1052), 'd' (338) paired with host-star prefixes such as 'hd' (815), 'gj' (147), 'hip' (81), 'epic' (43). Uniqueness equals row count, so it functions as a primary key rather than a feature. Treatment: Use as primary key for joins; drop from modelling features. high · anthropic:claude-opus-4-7
n
6,150
nulls
0 (0.0%)
unique
6,150
len_min
5
len_max
29
len_mean
11.42
len_median
11
len_p95
16.55
word_mean
2.242
word_median
2
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
4,713
readability_flesch_mean
97.69
emoji_rate
0
url_rate
0
one_word_rate
0.008293
allcaps_rate
0.008618
boilerplate_rate
0

hostname

text foreign_key one_word allcaps short_text duplicates
This column holds astronomical host-star identifiers (KOI-351, TRAPPIST-1, HD 110067, HIP 41378), with the 'hd' prefix dominating at 815 occurrences and catalog prefixes like GJ, HIP, EPIC, 2MASS, TIC following. Values are short (mean length 9.4, median 1 word) and 51.5% are all-caps, consistent with catalog naming conventions. Duplication is high: 4582 unique values across 6150 rows (25.5% duplicate rate), suggesting multiple records per host star — likely one row per planet. Treatment: left-join on this id to a star-level table; do not use as a model feature directly. high · anthropic:claude-opus-4-7
n
6,150
nulls
0 (0.0%)
unique
4,582
len_min
3
len_max
27
len_mean
9.424
len_median
9
len_p95
15.55
word_mean
1.254
word_median
1
n_empty
0
n_duplicates
1,568
duplicate_rate
0.255
vocab_size
4,671
readability_flesch_mean
77.19
emoji_rate
0
url_rate
0
one_word_rate
0.7579
allcaps_rate
0.5154
boilerplate_rate
0

ra

numeric feature
This column is almost certainly Right Ascension (ra), a celestial longitude coordinate, with values spanning 0.186 to 359.97 — the full 0–360° range expected for RA. The distribution is left-skewed (skew -1.08) with median 284.91 well above the mean 232.89, suggesting non-uniform sky coverage concentrated toward higher RA values. With 4579 unique values across 6150 rows and no nulls or outliers, the column is clean but not a key. Treatment: Treat as a circular/angular feature; consider sin/cos encoding before modelling rather than using raw degrees. high · anthropic:claude-opus-4-7
n
6,150
nulls
0 (0.0%)
unique
4,579
min
0.1856
max
360
mean
232.9
median
284.9
std
91.68
q1
173.3
q3
293.2
iqr
119.9
skew
-1.078
kurtosis
-0.144
n_outliers
0
outlier_rate
0
zero_rate
0

dec

numeric feature
This is almost certainly declination (dec) in degrees, an astronomical sky-coordinate: values span -88.12 to 86.86, well within the ±90° valid range. The distribution is left-skewed (skew -0.83) with a median of 39.13 sitting well above the mean of 18.05, suggesting a sample weighted toward the northern celestial hemisphere despite reaching deep southern declinations. With 4579 unique values across 6150 rows and no nulls or outliers, coverage is clean. Treatment: Use as-is for spatial joins or convert to radians/sin(dec) before modelling sky density. high · anthropic:claude-opus-4-7
n
6,150
nulls
0 (0.0%)
unique
4,579
min
-88.12
max
86.86
mean
18.05
median
39.13
std
37.07
q1
-11.17
q3
45.38
iqr
56.55
skew
-0.8327
kurtosis
-0.4469
n_outliers
0
outlier_rate
0
zero_rate
0

sy_dist

numeric feature high_skew outliers
Likely the system distance to the host star/planet (sy_dist) in parsecs, with 6,150 rows and 4,397 unique values. Distribution is heavily right-skewed (skew 3.97, kurtosis 17.0): median 377.06 sits well below the mean 713.31, and values span 1.30 to 8,340 with 321 outliers (5.3%). Null rate is low at 2.07% and there are no zeros. Treatment: log-transform before modelling to tame the right skew and outliers. high · anthropic:claude-opus-4-7
n
6,150
nulls
127 (2.1%)
unique
4,397
min
1.301
max
8,340
mean
713.3
median
377.1
std
1212
q1
100.3
q3
836.7
iqr
736.4
skew
3.967
kurtosis
17.02
n_outliers
321
outlier_rate
0.0533
zero_rate
0

pl_orbper

numeric feature high_skew outliers
This is almost certainly planetary orbital period (likely in days), with 5791 unique values across 6150 rows and a 5.63% null rate. The distribution is wildly right-skewed: median is 11.13 while mean is 4469.34 and max reaches 8,040,000, producing a skew of 43.8 and kurtosis near 1970. About 17.4% of values (1012) flag as outliers, consistent with a mix of short-period close-in planets and extreme long-period objects. Treatment: log-transform before modelling and impute the ~5.6% nulls. high · anthropic:claude-opus-4-7
n
6,150
nulls
346 (5.6%)
unique
5,791
min
0.09071
max
8.04e+06
mean
4469
median
11.13
std
1.633e+05
q1
4.352
q3
39.69
iqr
35.34
skew
43.82
kurtosis
1970
n_outliers
1,012
outlier_rate
0.1744
zero_rate
0

pl_orbsmax

numeric feature null_rate high_skew outliers
This is the planet's orbital semi-major axis (pl_orbsmax), a numeric astrophysical feature spanning 0.0044 to 19000.0 with median 0.1159 — typical AU-scale values dominated by close-in planets but with extreme wide-orbit outliers. Skew of 34.66 and kurtosis of 1394.96 are extraordinary, and 604 outliers (15.7%) plus a 37.4% null rate make raw use risky. Mean (21.65) sits far above the q3 of 0.812, confirming a handful of values dominate the scale. Treatment: log-transform and impute missing before modelling. high · anthropic:claude-opus-4-7
n
6,150
nulls
2,301 (37.4%)
unique
2,292
min
0.0044
max
19,000
mean
21.65
median
0.1159
std
412.2
q1
0.0538
q3
0.812
iqr
0.7582
skew
34.66
kurtosis
1395
n_outliers
604
outlier_rate
0.1569
zero_rate
0

pl_bmassj

numeric feature null_rate high_skew outliers
This is the planet mass measured in Jupiter masses (pl_bmassj), a numeric astrophysical feature. Half the rows are null (0.5026) and the distribution is heavily right-skewed (skew 3.07, kurtosis 10.26): the median is 0.538 MJ but the mean is 2.42 MJ and values stretch from 6.293e-05 up to 30.0, with 410 outliers (13.4%). The huge dynamic range across ~5 orders of magnitude is the dominant signal. Treatment: Log-transform before modelling and decide on an imputation/missing-indicator strategy for the 50% nulls. high · anthropic:claude-opus-4-7
n
6,150
nulls
3,091 (50.3%)
unique
1,989
min
6.293e-05
max
30
mean
2.417
median
0.538
std
4.706
q1
0.03744
q3
2.197
iqr
2.16
skew
3.075
kurtosis
10.26
n_outliers
410
outlier_rate
0.134
zero_rate
0

pl_rade

numeric feature null_rate high_skew outliers
This is `pl_rade`, the planetary radius (in Earth radii) for confirmed exoplanets. Values span 0.31 to 87.21 with a median of 2.43, but heavy right skew (3.22) and extreme kurtosis (28.66) push the mean to 4.46 and flag 872 outliers (19.1%). About 25.7% of rows are null, so a quarter of planets lack a measured radius. Treatment: Log-transform and impute or flag the 25.7% missing before modelling. high · anthropic:claude-opus-4-7
n
6,150
nulls
1,582 (25.7%)
unique
2,004
min
0.3098
max
87.21
mean
4.456
median
2.43
std
4.952
q1
1.62
q3
4.06
iqr
2.44
skew
3.218
kurtosis
28.66
n_outliers
872
outlier_rate
0.1909
zero_rate
0

disc_year

numeric timestamp
Discovery year for each record, ranging from 1992 to 2026 with a median of 2016 and IQR of 2014-2021. The distribution is left-skewed (skew -0.69), reflecting that most discoveries cluster in recent years while a long tail of earlier years produces 109 outliers (1.8%). Only 34 distinct years appear across 6150 rows, and nulls are negligible (0.02%). Treatment: Treat as a discrete year; bin or use directly as an ordinal feature. high · anthropic:claude-opus-4-7
n
6,150
nulls
1 (0.0%)
unique
34
min
1,992
max
2,026
mean
2017
median
2,016
std
4.965
q1
2,014
q3
2,021
iqr
7
skew
-0.6885
kurtosis
1.262
n_outliers
109
outlier_rate
0.01773
zero_rate
0

discoverymethod

categorical feature
Categorical label recording the technique used to detect each exoplanet, with 11 distinct methods across 6150 rows and no nulls. The distribution is heavily concentrated: 'Transit' accounts for 73.4% of records and 'Radial Velocity' another 1182, leaving the remaining 9 methods as long-tail rarities (down to 2 'Pulsation Timing Variations'). Entropy ratio of 0.34 confirms the imbalance. Treatment: One-hot encode, optionally collapsing the rare methods into an 'Other' bucket given the severe imbalance. high · anthropic:claude-opus-4-7
n
6,150
nulls
0 (0.0%)
unique
11
top_value
Transit
top_rate
0.7345
cardinality
11
entropy
1.189
entropy_ratio
0.3436