exoplanets exoplanets
Reading
This dataset catalogs 6,150 exoplanets across 11 columns, mixing identifiers (pl_name, hostname), discovery metadata (discoverymethod, disc_year), sky coordinates (ra, dec), and physical measurements (pl_bmassj, pl_orbsmax, pl_rade, pl_orbper, sy_dist). Discovery is heavily dominated by the Transit method at 73.4% of records, with Radial Velocity a distant second — worth noting because it shapes which kinds of planets are represented. The physical measurement columns are all extremely skewed with heavy outliers: pl_orbper has a skew of ~43.8 and a max of 8,040,000 days, and pl_orbsmax similarly stretches to 19,000 AU, so any analysis should use log scales or trimming. Also flag that pl_bmassj is missing for 50.3% of rows and pl_orbsmax for 37.4%, which limits joint mass/orbit analyses. Discovery year peaks around 2016 (median) and ranges from 1992 to 2026, giving a clear timeline of the field's growth.
citing: row_count · column_count · discoverymethod · pl_orbper · pl_orbsmax · pl_bmassj · pl_rade · disc_year · sy_dist
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Transit | 4517 | 73.4% |
| Radial Velocity | 1182 | 19.2% |
| Microlensing | 275 | 4.5% |
| Imaging | 94 | 1.5% |
| Transit Timing Variations | 39 | 0.6% |
| Eclipse Timing Variations | 17 | 0.3% |
| Orbital Brightness Modulation | 9 | 0.1% |
| Pulsar Timing | 8 | 0.1% |
| Astrometry | 6 | 0.1% |
| Pulsation Timing Variations | 2 | 0.0% |
| Disk Kinematics | 1 | 0.0% |
Show data table
| bin | count |
|---|---|
| 1992 – 1993 | 2 |
| 1993 – 1994 | 0 |
| 1994 – 1995 | 1 |
| 1995 – 1995 | 1 |
| 1995 – 1996 | 6 |
| 1996 – 1997 | 1 |
| 1997 – 1998 | 0 |
| 1998 – 1999 | 6 |
| 1999 – 2000 | 13 |
| 2000 – 2000 | 16 |
| 2000 – 2001 | 12 |
| 2001 – 2002 | 29 |
| 2002 – 2003 | 22 |
| 2003 – 2004 | 0 |
| 2004 – 2005 | 27 |
| 2005 – 2006 | 36 |
| 2006 – 2006 | 32 |
| 2006 – 2007 | 52 |
| 2007 – 2008 | 63 |
| 2008 – 2009 | 0 |
| 2009 – 2010 | 91 |
| 2010 – 2011 | 98 |
| 2011 – 2012 | 135 |
| 2012 – 2012 | 139 |
| 2012 – 2013 | 128 |
| 2013 – 2014 | 869 |
| 2014 – 2015 | 0 |
| 2015 – 2016 | 155 |
| 2016 – 2017 | 1496 |
| 2017 – 2018 | 152 |
| 2018 – 2018 | 315 |
| 2018 – 2019 | 196 |
| 2019 – 2020 | 234 |
| 2020 – 2021 | 0 |
| 2021 – 2022 | 564 |
| 2022 – 2023 | 369 |
| 2023 – 2023 | 324 |
| 2023 – 2024 | 259 |
| 2024 – 2025 | 243 |
| 2025 – 2026 | 63 |
Show data table
| bin | count |
|---|---|
| 0.3098 – 2.482 | 2355 |
| 2.482 – 4.655 | 1149 |
| 4.655 – 6.827 | 155 |
| 6.827 – 8.999 | 93 |
| 8.999 – 11.17 | 158 |
| 11.17 – 13.34 | 266 |
| 13.34 – 15.52 | 201 |
| 15.52 – 17.69 | 107 |
| 17.69 – 19.86 | 54 |
| 19.86 – 22.03 | 15 |
| 22.03 – 24.21 | 8 |
| 24.21 – 26.38 | 2 |
| 26.38 – 28.55 | 0 |
| 28.55 – 30.72 | 1 |
| 30.72 – 32.9 | 1 |
| 32.9 – 35.07 | 1 |
| 35.07 – 37.24 | 0 |
| 37.24 – 39.41 | 0 |
| 39.41 – 41.59 | 0 |
| 41.59 – 43.76 | 0 |
| 43.76 – 45.93 | 0 |
| 45.93 – 48.1 | 0 |
| 48.1 – 50.28 | 0 |
| 50.28 – 52.45 | 0 |
| 52.45 – 54.62 | 0 |
| 54.62 – 56.79 | 0 |
| 56.79 – 58.96 | 0 |
| 58.96 – 61.14 | 0 |
| 61.14 – 63.31 | 0 |
| 63.31 – 65.48 | 0 |
| 65.48 – 67.65 | 0 |
| 67.65 – 69.83 | 0 |
| 69.83 – 72 | 0 |
| 72 – 74.17 | 0 |
| 74.17 – 76.34 | 0 |
| 76.34 – 78.52 | 1 |
| 78.52 – 80.69 | 0 |
| 80.69 – 82.86 | 0 |
| 82.86 – 85.03 | 0 |
| 85.03 – 87.21 | 1 |
Show data table
| bin | count |
|---|---|
| 0.09071 – 2.01e+05 | 5800 |
| 2.01e+05 – 4.02e+05 | 0 |
| 4.02e+05 – 6.03e+05 | 0 |
| 6.03e+05 – 8.04e+05 | 0 |
| 8.04e+05 – 1.005e+06 | 0 |
| 1.005e+06 – 1.206e+06 | 0 |
| 1.206e+06 – 1.407e+06 | 0 |
| 1.407e+06 – 1.608e+06 | 0 |
| 1.608e+06 – 1.809e+06 | 1 |
| 1.809e+06 – 2.01e+06 | 0 |
| 2.01e+06 – 2.211e+06 | 0 |
| 2.211e+06 – 2.412e+06 | 0 |
| 2.412e+06 – 2.613e+06 | 0 |
| 2.613e+06 – 2.814e+06 | 0 |
| 2.814e+06 – 3.015e+06 | 0 |
| 3.015e+06 – 3.216e+06 | 0 |
| 3.216e+06 – 3.417e+06 | 0 |
| 3.417e+06 – 3.618e+06 | 0 |
| 3.618e+06 – 3.819e+06 | 0 |
| 3.819e+06 – 4.02e+06 | 0 |
| 4.02e+06 – 4.221e+06 | 0 |
| 4.221e+06 – 4.422e+06 | 0 |
| 4.422e+06 – 4.623e+06 | 0 |
| 4.623e+06 – 4.824e+06 | 0 |
| 4.824e+06 – 5.025e+06 | 0 |
| 5.025e+06 – 5.226e+06 | 0 |
| 5.226e+06 – 5.427e+06 | 0 |
| 5.427e+06 – 5.628e+06 | 0 |
| 5.628e+06 – 5.829e+06 | 1 |
| 5.829e+06 – 6.03e+06 | 0 |
| 6.03e+06 – 6.231e+06 | 0 |
| 6.231e+06 – 6.432e+06 | 0 |
| 6.432e+06 – 6.633e+06 | 0 |
| 6.633e+06 – 6.834e+06 | 0 |
| 6.834e+06 – 7.035e+06 | 0 |
| 7.035e+06 – 7.236e+06 | 0 |
| 7.236e+06 – 7.437e+06 | 1 |
| 7.437e+06 – 7.638e+06 | 0 |
| 7.638e+06 – 7.839e+06 | 0 |
| 7.839e+06 – 8.04e+06 | 1 |
Show data table
| bin | count |
|---|---|
| 1.301 – 209.8 | 2232 |
| 209.8 – 418.2 | 969 |
| 418.2 – 626.7 | 714 |
| 626.7 – 835.2 | 594 |
| 835.2 – 1044 | 539 |
| 1044 – 1252 | 287 |
| 1252 – 1461 | 187 |
| 1461 – 1669 | 102 |
| 1669 – 1878 | 67 |
| 1878 – 2086 | 39 |
| 2086 – 2294 | 16 |
| 2294 – 2503 | 18 |
| 2503 – 2711 | 7 |
| 2711 – 2920 | 6 |
| 2920 – 3128 | 7 |
| 3128 – 3337 | 8 |
| 3337 – 3545 | 8 |
| 3545 – 3754 | 4 |
| 3754 – 3962 | 8 |
| 3962 – 4171 | 6 |
| 4171 – 4379 | 5 |
| 4379 – 4588 | 7 |
| 4588 – 4796 | 4 |
| 4796 – 5005 | 3 |
| 5005 – 5213 | 8 |
| 5213 – 5421 | 6 |
| 5421 – 5630 | 7 |
| 5630 – 5838 | 12 |
| 5838 – 6047 | 11 |
| 6047 – 6255 | 10 |
| 6255 – 6464 | 14 |
| 6464 – 6672 | 24 |
| 6672 – 6881 | 14 |
| 6881 – 7089 | 24 |
| 7089 – 7298 | 23 |
| 7298 – 7506 | 9 |
| 7506 – 7715 | 11 |
| 7715 – 7923 | 6 |
| 7923 – 8132 | 3 |
| 8132 – 8340 | 4 |
Schema
11 columns| Alerts | ||||
|---|---|---|---|---|
| pl_name | text | 0.0% | 6,150 |
near_unique
short_text
|
| hostname | text | 0.0% | 4,582 |
one_word
allcaps
short_text
duplicates
|
| ra | numeric | 0.0% | 4,579 |
|
| dec | numeric | 0.0% | 4,579 |
|
| sy_dist | numeric | 2.1% | 4,397 |
high_skew
outliers
|
| pl_orbper | numeric | 5.6% | 5,791 |
high_skew
outliers
|
| pl_orbsmax | numeric | 37.4% | 2,292 |
null_rate
high_skew
outliers
|
| pl_bmassj | numeric | 50.3% | 1,989 |
null_rate
high_skew
outliers
|
| pl_rade | numeric | 25.7% | 2,004 |
null_rate
high_skew
outliers
|
| disc_year | numeric | 0.0% | 34 |
|
| discoverymethod | categorical | 0.0% | 11 |
|
pl_name
text identifier near_unique short_textThis is the planet name identifier (pl_name), a fully unique short text field across all 6150 rows with zero nulls or duplicates. Values are short (mean 11.4 chars, ~2.24 words) and dominated by astronomical catalog conventions: companion letters like 'b' (4535), 'c' (1052), 'd' (338) paired with host-star prefixes such as 'hd' (815), 'gj' (147), 'hip' (81), 'epic' (43). Uniqueness equals row count, so it functions as a primary key rather than a feature. Treatment: Use as primary key for joins; drop from modelling features.
- n
- 6,150
- nulls
- 0 (0.0%)
- unique
- 6,150
- len_min
- 5
- len_max
- 29
- len_mean
- 11.42
- len_median
- 11
- len_p95
- 16.55
- word_mean
- 2.242
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 4,713
- readability_flesch_mean
- 97.69
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.008293
- allcaps_rate
- 0.008618
- boilerplate_rate
- 0
hostname
text foreign_key one_word allcaps short_text duplicatesThis column holds astronomical host-star identifiers (KOI-351, TRAPPIST-1, HD 110067, HIP 41378), with the 'hd' prefix dominating at 815 occurrences and catalog prefixes like GJ, HIP, EPIC, 2MASS, TIC following. Values are short (mean length 9.4, median 1 word) and 51.5% are all-caps, consistent with catalog naming conventions. Duplication is high: 4582 unique values across 6150 rows (25.5% duplicate rate), suggesting multiple records per host star — likely one row per planet. Treatment: left-join on this id to a star-level table; do not use as a model feature directly.
- n
- 6,150
- nulls
- 0 (0.0%)
- unique
- 4,582
- len_min
- 3
- len_max
- 27
- len_mean
- 9.424
- len_median
- 9
- len_p95
- 15.55
- word_mean
- 1.254
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 1,568
- duplicate_rate
- 0.255
- vocab_size
- 4,671
- readability_flesch_mean
- 77.19
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7579
- allcaps_rate
- 0.5154
- boilerplate_rate
- 0
ra
numeric featureThis column is almost certainly Right Ascension (ra), a celestial longitude coordinate, with values spanning 0.186 to 359.97 — the full 0–360° range expected for RA. The distribution is left-skewed (skew -1.08) with median 284.91 well above the mean 232.89, suggesting non-uniform sky coverage concentrated toward higher RA values. With 4579 unique values across 6150 rows and no nulls or outliers, the column is clean but not a key. Treatment: Treat as a circular/angular feature; consider sin/cos encoding before modelling rather than using raw degrees.
- n
- 6,150
- nulls
- 0 (0.0%)
- unique
- 4,579
- min
- 0.1856
- max
- 360
- mean
- 232.9
- median
- 284.9
- std
- 91.68
- q1
- 173.3
- q3
- 293.2
- iqr
- 119.9
- skew
- -1.078
- kurtosis
- -0.144
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
dec
numeric featureThis is almost certainly declination (dec) in degrees, an astronomical sky-coordinate: values span -88.12 to 86.86, well within the ±90° valid range. The distribution is left-skewed (skew -0.83) with a median of 39.13 sitting well above the mean of 18.05, suggesting a sample weighted toward the northern celestial hemisphere despite reaching deep southern declinations. With 4579 unique values across 6150 rows and no nulls or outliers, coverage is clean. Treatment: Use as-is for spatial joins or convert to radians/sin(dec) before modelling sky density.
- n
- 6,150
- nulls
- 0 (0.0%)
- unique
- 4,579
- min
- -88.12
- max
- 86.86
- mean
- 18.05
- median
- 39.13
- std
- 37.07
- q1
- -11.17
- q3
- 45.38
- iqr
- 56.55
- skew
- -0.8327
- kurtosis
- -0.4469
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
sy_dist
numeric feature high_skew outliersLikely the system distance to the host star/planet (sy_dist) in parsecs, with 6,150 rows and 4,397 unique values. Distribution is heavily right-skewed (skew 3.97, kurtosis 17.0): median 377.06 sits well below the mean 713.31, and values span 1.30 to 8,340 with 321 outliers (5.3%). Null rate is low at 2.07% and there are no zeros. Treatment: log-transform before modelling to tame the right skew and outliers.
- n
- 6,150
- nulls
- 127 (2.1%)
- unique
- 4,397
- min
- 1.301
- max
- 8,340
- mean
- 713.3
- median
- 377.1
- std
- 1212
- q1
- 100.3
- q3
- 836.7
- iqr
- 736.4
- skew
- 3.967
- kurtosis
- 17.02
- n_outliers
- 321
- outlier_rate
- 0.0533
- zero_rate
- 0
pl_orbper
numeric feature high_skew outliersThis is almost certainly planetary orbital period (likely in days), with 5791 unique values across 6150 rows and a 5.63% null rate. The distribution is wildly right-skewed: median is 11.13 while mean is 4469.34 and max reaches 8,040,000, producing a skew of 43.8 and kurtosis near 1970. About 17.4% of values (1012) flag as outliers, consistent with a mix of short-period close-in planets and extreme long-period objects. Treatment: log-transform before modelling and impute the ~5.6% nulls.
- n
- 6,150
- nulls
- 346 (5.6%)
- unique
- 5,791
- min
- 0.09071
- max
- 8.04e+06
- mean
- 4469
- median
- 11.13
- std
- 1.633e+05
- q1
- 4.352
- q3
- 39.69
- iqr
- 35.34
- skew
- 43.82
- kurtosis
- 1970
- n_outliers
- 1,012
- outlier_rate
- 0.1744
- zero_rate
- 0
pl_orbsmax
numeric feature null_rate high_skew outliersThis is the planet's orbital semi-major axis (pl_orbsmax), a numeric astrophysical feature spanning 0.0044 to 19000.0 with median 0.1159 — typical AU-scale values dominated by close-in planets but with extreme wide-orbit outliers. Skew of 34.66 and kurtosis of 1394.96 are extraordinary, and 604 outliers (15.7%) plus a 37.4% null rate make raw use risky. Mean (21.65) sits far above the q3 of 0.812, confirming a handful of values dominate the scale. Treatment: log-transform and impute missing before modelling.
- n
- 6,150
- nulls
- 2,301 (37.4%)
- unique
- 2,292
- min
- 0.0044
- max
- 19,000
- mean
- 21.65
- median
- 0.1159
- std
- 412.2
- q1
- 0.0538
- q3
- 0.812
- iqr
- 0.7582
- skew
- 34.66
- kurtosis
- 1395
- n_outliers
- 604
- outlier_rate
- 0.1569
- zero_rate
- 0
pl_bmassj
numeric feature null_rate high_skew outliersThis is the planet mass measured in Jupiter masses (pl_bmassj), a numeric astrophysical feature. Half the rows are null (0.5026) and the distribution is heavily right-skewed (skew 3.07, kurtosis 10.26): the median is 0.538 MJ but the mean is 2.42 MJ and values stretch from 6.293e-05 up to 30.0, with 410 outliers (13.4%). The huge dynamic range across ~5 orders of magnitude is the dominant signal. Treatment: Log-transform before modelling and decide on an imputation/missing-indicator strategy for the 50% nulls.
- n
- 6,150
- nulls
- 3,091 (50.3%)
- unique
- 1,989
- min
- 6.293e-05
- max
- 30
- mean
- 2.417
- median
- 0.538
- std
- 4.706
- q1
- 0.03744
- q3
- 2.197
- iqr
- 2.16
- skew
- 3.075
- kurtosis
- 10.26
- n_outliers
- 410
- outlier_rate
- 0.134
- zero_rate
- 0
pl_rade
numeric feature null_rate high_skew outliersThis is `pl_rade`, the planetary radius (in Earth radii) for confirmed exoplanets. Values span 0.31 to 87.21 with a median of 2.43, but heavy right skew (3.22) and extreme kurtosis (28.66) push the mean to 4.46 and flag 872 outliers (19.1%). About 25.7% of rows are null, so a quarter of planets lack a measured radius. Treatment: Log-transform and impute or flag the 25.7% missing before modelling.
- n
- 6,150
- nulls
- 1,582 (25.7%)
- unique
- 2,004
- min
- 0.3098
- max
- 87.21
- mean
- 4.456
- median
- 2.43
- std
- 4.952
- q1
- 1.62
- q3
- 4.06
- iqr
- 2.44
- skew
- 3.218
- kurtosis
- 28.66
- n_outliers
- 872
- outlier_rate
- 0.1909
- zero_rate
- 0
disc_year
numeric timestampDiscovery year for each record, ranging from 1992 to 2026 with a median of 2016 and IQR of 2014-2021. The distribution is left-skewed (skew -0.69), reflecting that most discoveries cluster in recent years while a long tail of earlier years produces 109 outliers (1.8%). Only 34 distinct years appear across 6150 rows, and nulls are negligible (0.02%). Treatment: Treat as a discrete year; bin or use directly as an ordinal feature.
- n
- 6,150
- nulls
- 1 (0.0%)
- unique
- 34
- min
- 1,992
- max
- 2,026
- mean
- 2017
- median
- 2,016
- std
- 4.965
- q1
- 2,014
- q3
- 2,021
- iqr
- 7
- skew
- -0.6885
- kurtosis
- 1.262
- n_outliers
- 109
- outlier_rate
- 0.01773
- zero_rate
- 0
discoverymethod
categorical featureCategorical label recording the technique used to detect each exoplanet, with 11 distinct methods across 6150 rows and no nulls. The distribution is heavily concentrated: 'Transit' accounts for 73.4% of records and 'Radial Velocity' another 1182, leaving the remaining 9 methods as long-tail rarities (down to 2 'Pulsation Timing Variations'). Entropy ratio of 0.34 confirms the imbalance. Treatment: One-hot encode, optionally collapsing the rare methods into an 'Other' bucket given the severe imbalance.
- n
- 6,150
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- Transit
- top_rate
- 0.7345
- cardinality
- 11
- entropy
- 1.189
- entropy_ratio
- 0.3436