large meteorites large meteorites
Reading
This dataset catalogues 4,871 large meteorites with 10 columns covering mass, location (latitude/longitude), discovery year, fall type, and classification. Mass is extremely skewed (skew ≈ 25, max 60,000,000g vs median 3,600g) with 712 outliers, so any mass-based analysis should use a log scale or trim extremes. Fall_type is heavily imbalanced — 84.8% are 'Found' versus 'Fell' — and meteorite_class is dominated by ordinary chondrites like L6 (16.9%) and H5. Year is left-skewed toward recent decades (median 1990, q1 1945), reflecting a modern collection bias. Note that the 'category' column is constant ('large_meteorites') and adds no signal.
citing: mass_g.stats.skew · mass_g.stats.median · mass_g.stats.max · mass_g.stats.n_outliers · fall_type.stats.top_rate · meteorite_class.stats.top_rate · meteorite_class.top_values · year.stats.median · year.stats.q1 · category.stats.top_rate · row_count
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 1000 – 1.501e+06 | 4831 |
| 1.501e+06 – 3.001e+06 | 19 |
| 3.001e+06 – 4.501e+06 | 4 |
| 4.501e+06 – 6.001e+06 | 1 |
| 6.001e+06 – 7.501e+06 | 1 |
| 7.501e+06 – 9.001e+06 | 1 |
| 9.001e+06 – 1.05e+07 | 2 |
| 1.05e+07 – 1.2e+07 | 0 |
| 1.2e+07 – 1.35e+07 | 0 |
| 1.35e+07 – 1.5e+07 | 0 |
| 1.5e+07 – 1.65e+07 | 2 |
| 1.65e+07 – 1.8e+07 | 0 |
| 1.8e+07 – 1.95e+07 | 0 |
| 1.95e+07 – 2.1e+07 | 0 |
| 2.1e+07 – 2.25e+07 | 1 |
| 2.25e+07 – 2.4e+07 | 2 |
| 2.4e+07 – 2.55e+07 | 1 |
| 2.55e+07 – 2.7e+07 | 1 |
| 2.7e+07 – 2.85e+07 | 1 |
| 2.85e+07 – 3e+07 | 1 |
| 3e+07 – 3.15e+07 | 0 |
| 3.15e+07 – 3.3e+07 | 0 |
| 3.3e+07 – 3.45e+07 | 0 |
| 3.45e+07 – 3.6e+07 | 0 |
| 3.6e+07 – 3.75e+07 | 0 |
| 3.75e+07 – 3.9e+07 | 0 |
| 3.9e+07 – 4.05e+07 | 0 |
| 4.05e+07 – 4.2e+07 | 0 |
| 4.2e+07 – 4.35e+07 | 0 |
| 4.35e+07 – 4.5e+07 | 0 |
| 4.5e+07 – 4.65e+07 | 0 |
| 4.65e+07 – 4.8e+07 | 0 |
| 4.8e+07 – 4.95e+07 | 0 |
| 4.95e+07 – 5.1e+07 | 1 |
| 5.1e+07 – 5.25e+07 | 0 |
| 5.25e+07 – 5.4e+07 | 0 |
| 5.4e+07 – 5.55e+07 | 0 |
| 5.55e+07 – 5.7e+07 | 0 |
| 5.7e+07 – 5.85e+07 | 1 |
| 5.85e+07 – 6e+07 | 1 |
Show data table
| bin | count |
|---|---|
| 1399 – 1414 | 1 |
| 1414 – 1430 | 0 |
| 1430 – 1445 | 0 |
| 1445 – 1460 | 0 |
| 1460 – 1476 | 0 |
| 1476 – 1491 | 1 |
| 1491 – 1506 | 0 |
| 1506 – 1522 | 0 |
| 1522 – 1537 | 0 |
| 1537 – 1552 | 0 |
| 1552 – 1568 | 0 |
| 1568 – 1583 | 2 |
| 1583 – 1599 | 0 |
| 1599 – 1614 | 1 |
| 1614 – 1629 | 3 |
| 1629 – 1645 | 2 |
| 1645 – 1660 | 0 |
| 1660 – 1675 | 1 |
| 1675 – 1691 | 0 |
| 1691 – 1706 | 0 |
| 1706 – 1721 | 2 |
| 1721 – 1737 | 1 |
| 1737 – 1752 | 4 |
| 1752 – 1767 | 3 |
| 1767 – 1783 | 6 |
| 1783 – 1798 | 13 |
| 1798 – 1813 | 27 |
| 1813 – 1829 | 30 |
| 1829 – 1844 | 47 |
| 1844 – 1860 | 72 |
| 1860 – 1875 | 106 |
| 1875 – 1890 | 152 |
| 1890 – 1906 | 140 |
| 1906 – 1921 | 169 |
| 1921 – 1936 | 242 |
| 1936 – 1952 | 277 |
| 1952 – 1967 | 267 |
| 1967 – 1982 | 489 |
| 1982 – 1998 | 706 |
| 1998 – 2013 | 2053 |
Show data table
| value | count | share |
|---|---|---|
| Found | 4129 | 84.8% |
| Fell | 742 | 15.2% |
Show data table
| value | count | share |
|---|---|---|
| L6 | 825 | 16.9% |
| H5 | 665 | 13.7% |
| L5 | 408 | 8.4% |
| H4 | 349 | 7.2% |
| H6 | 309 | 6.3% |
| Iron, IIIAB | 232 | 4.8% |
| L4 | 157 | 3.2% |
| LL6 | 120 | 2.5% |
| LL5 | 93 | 1.9% |
| Iron, IIAB | 92 | 1.9% |
| Iron, ungrouped | 76 | 1.6% |
| Iron, IAB-MG | 75 | 1.5% |
| Iron, IVA | 59 | 1.2% |
| CV3 | 35 | 0.7% |
| H4/5 | 33 | 0.7% |
| Iron, IAB complex | 32 | 0.7% |
| Pallasite, PMG | 31 | 0.6% |
| Iron | 30 | 0.6% |
| Iron, IAB-ung | 30 | 0.6% |
| LL4 | 29 | 0.6% |
Show data table
| bin | count |
|---|---|
| -87.37 – -83.15 | 159 |
| -83.15 – -78.94 | 55 |
| -78.94 – -74.73 | 126 |
| -74.73 – -70.51 | 158 |
| -70.51 – -66.3 | 1 |
| -66.3 – -62.09 | 0 |
| -62.09 – -57.87 | 0 |
| -57.87 – -53.66 | 1 |
| -53.66 – -49.45 | 0 |
| -49.45 – -45.23 | 3 |
| -45.23 – -41.02 | 10 |
| -41.02 – -36.81 | 22 |
| -36.81 – -32.59 | 53 |
| -32.59 – -28.38 | 140 |
| -28.38 – -24.17 | 112 |
| -24.17 – -19.95 | 59 |
| -19.95 – -15.74 | 27 |
| -15.74 – -11.53 | 13 |
| -11.53 – -7.313 | 16 |
| -7.313 – -3.1 | 18 |
| -3.1 – 1.113 | 355 |
| 1.113 – 5.327 | 9 |
| 5.327 – 9.54 | 15 |
| 9.54 – 13.75 | 35 |
| 13.75 – 17.97 | 29 |
| 17.97 – 22.18 | 636 |
| 22.18 – 26.39 | 125 |
| 26.39 – 30.61 | 478 |
| 30.61 – 34.82 | 401 |
| 34.82 – 39.03 | 368 |
| 39.03 – 43.25 | 252 |
| 43.25 – 47.46 | 165 |
| 47.46 – 51.67 | 137 |
| 51.67 – 55.89 | 120 |
| 55.89 – 60.1 | 40 |
| 60.1 – 64.31 | 17 |
| 64.31 – 68.53 | 12 |
| 68.53 – 72.74 | 3 |
| 72.74 – 76.95 | 3 |
| 76.95 – 81.17 | 1 |
Schema
10 columns| Alerts | ||||
|---|---|---|---|---|
| name | text | 0.0% | 4,871 |
near_unique
one_word
|
| mass_g | numeric | 0.0% | 2,850 |
high_skew
outliers
|
| meteorite_class | categorical | 0.0% | 261 |
|
| fall_type | categorical | 0.0% | 2 |
|
| latitude | numeric | 14.3% | 2,861 |
outliers
|
| longitude | numeric | 14.3% | 3,122 |
|
| date | categorical | 1.1% | 243 |
|
| year | numeric | 1.1% | 243 |
high_skew
|
| category | categorical | 0.0% | 1 |
imbalance
|
| description | text | 0.0% | 4,871 |
near_unique
|
name
text identifier near_unique one_wordThe `name` column holds a unique short label for each of the 4871 rows (n_unique == n, duplicate_rate 0.0), averaging 13.8 characters and 2.2 words. Top tokens like 'africa', 'northwest', 'dhofar', 'jiddat' and 'harasis' strongly suggest these are meteorite or specimen designations (e.g. 'Northwest Africa', 'Dhofar', 'Jiddat al Harasis' find sites). About 32% are single-word names, and the vocabulary of 4632 tokens across 4871 names confirms heavy reuse of regional prefixes with numeric suffixes. Treatment: Treat as a unique row identifier; drop from modelling or parse out the regional prefix as a categorical feature.
- n
- 4,871
- nulls
- 0 (0.0%)
- unique
- 4,871
- len_min
- 3
- len_max
- 28
- len_mean
- 13.8
- len_median
- 12
- len_p95
- 21
- word_mean
- 2.215
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 4,632
- readability_flesch_mean
- 52.55
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.3221
- allcaps_rate
- 0
- boilerplate_rate
- 0
mass_g
numeric feature high_skew outliersThis is the recorded mass in grams for 4,871 specimens, with 2,850 unique values and no nulls. The distribution is extraordinarily right-skewed (skew 25.1, kurtosis 723.6): the median is just 3,600 g while the mean is 123,403 g and the max reaches 60,000,000 g, producing 712 outliers (14.6% of rows). The standard deviation of 1,755,278 dwarfs the IQR of 10,313, indicating a handful of enormous specimens dominate the tail. Treatment: Log-transform before any modelling and consider winsorising the upper tail.
- n
- 4,871
- nulls
- 0 (0.0%)
- unique
- 2,850
- min
- 1,000
- max
- 6e+07
- mean
- 1.234e+05
- median
- 3,600
- std
- 1.755e+06
- q1
- 1686
- q3
- 12,000
- iqr
- 1.031e+04
- skew
- 25.12
- kurtosis
- 723.6
- n_outliers
- 712
- outlier_rate
- 0.1462
- zero_rate
- 0
meteorite_class
categorical featureThis column holds meteorite classification codes (e.g., L6, H5, Iron IIIAB), a standard taxonomy for stony and iron meteorites. With 261 distinct classes across 4871 rows and entropy ratio 0.65, the distribution is moderately diverse but heavily concentrated: the top class L6 covers 16.9% and the five most common ordinary chondrite classes (L6, H5, L5, H4, H6) together dominate. No nulls, but the long tail of 250+ rare classes will be sparse. Treatment: Group rare classes into an 'other' bucket or roll up to parent groups (H/L/LL/Iron) before one-hot encoding.
- n
- 4,871
- nulls
- 0 (0.0%)
- unique
- 261
- top_value
- L6
- top_rate
- 0.1694
- cardinality
- 261
- entropy
- 5.197
- entropy_ratio
- 0.6474
fall_type
categorical featureThis is a binary categorical flag distinguishing meteorites that were 'Found' versus those observed to have 'Fell'. The class is heavily imbalanced — 'Found' accounts for 4129 of 4871 records (top_rate 0.8477) while 'Fell' makes up the remaining 742, with no nulls. Entropy of 0.6156 confirms the skew but the column is fully populated and clean. Treatment: Encode as a binary indicator; account for class imbalance if used as a stratifier or target.
- n
- 4,871
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Found
- top_rate
- 0.8477
- cardinality
- 2
- entropy
- 0.6156
- entropy_ratio
- 0.6156
latitude
numeric feature outliersGeographic latitude in decimal degrees, spanning -87.37 to 81.17 with a median of 24.00, consistent with a worldwide point dataset. The distribution is left-skewed (-1.23) with 500 flagged outliers (12.0%) and an 8.4% zero rate that may indicate placeholder or equator-rounded values. Roughly 14.3% of rows are null, so geographic analyses will lose a meaningful slice. Treatment: Pair with longitude for geo-features; investigate the 8.4% zeros for sentinel encoding before modelling.
- n
- 4,871
- nulls
- 697 (14.3%)
- unique
- 2,861
- min
- -87.37
- max
- 81.17
- mean
- 9.763
- median
- 24
- std
- 39.21
- q1
- 0
- q3
- 35.66
- iqr
- 35.66
- skew
- -1.228
- kurtosis
- 0.4314
- n_outliers
- 500
- outlier_rate
- 0.1198
- zero_rate
- 0.08385
longitude
numeric featureGeographic longitude in decimal degrees, spanning -165.43 to 178.2 with median 12.16 and an IQR of 133.79, consistent with worldwide coverage. Distribution is nearly symmetric (skew 0.14) and platykurtic (kurtosis -0.85), with no flagged outliers. Notable concerns: 14.31% of rows are null and 8.36% are exactly zero, which often indicates missing coordinates encoded as 0 rather than true Gulf-of-Guinea points. Treatment: Treat zeros as missing, then pair with latitude for geospatial features.
- n
- 4,871
- nulls
- 697 (14.3%)
- unique
- 3,122
- min
- -165.4
- max
- 178.2
- mean
- 7.871
- median
- 12.16
- std
- 82.3
- q1
- -78.08
- q3
- 55.71
- iqr
- 133.8
- skew
- 0.1374
- kurtosis
- -0.845
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.08361
date
categorical timestampThis column holds ISO-format dates stored as strings, but every observed value falls on January 1st of a year, suggesting year-only granularity padded to a date. Across 4871 rows there are 243 distinct values with a 1.11% null rate, and the most common entry '2000-01-01' accounts for just 4.88% of records, giving a fairly flat distribution (entropy ratio 0.84). The concentration of top values in the early 2000s hints at a temporal bias in the dataset. Treatment: Parse to datetime and extract year as the effective granularity before modelling.
- n
- 4,871
- nulls
- 54 (1.1%)
- unique
- 243
- top_value
- 2000-01-01
- top_rate
- 0.04879
- cardinality
- 243
- entropy
- 6.668
- entropy_ratio
- 0.8414
year
numeric feature high_skewThis column is a year value, likely a publication or release date, ranging from 1399 to 2013 with a median of 1990 and IQR spanning 1945-2002. The distribution is heavily left-skewed (skew -2.27, kurtosis 9.38) with 216 outliers (4.48%) trailing into very early centuries, suggesting a small set of historical or erroneous entries pulling the tail. Null rate is low at 1.11% across 4871 rows. Treatment: Inspect pre-1900 outliers for data-entry errors; consider clipping or treating as ordinal before modelling.
- n
- 4,871
- nulls
- 54 (1.1%)
- unique
- 243
- min
- 1,399
- max
- 2,013
- mean
- 1968
- median
- 1,990
- std
- 51.83
- q1
- 1,945
- q3
- 2,002
- iqr
- 57
- skew
- -2.271
- kurtosis
- 9.377
- n_outliers
- 216
- outlier_rate
- 0.04484
- zero_rate
- 0
category
categorical metadata imbalanceThis is a constant categorical column where every one of the 4871 rows holds the value "large_meteorites". Cardinality is 1, entropy is 0, and the top rate is 1.0, so the field carries no information for modelling or segmentation. It likely encodes the dataset's provenance or subset label rather than a row-level attribute. Treatment: Drop before modelling; retain only as a dataset-level tag.
- n
- 4,871
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- large_meteorites
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
description
text free_text near_uniqueShort, templated descriptive sentences about meteorites — every one of the 4871 rows is unique yet contains the words 'meteorite', '-', and 'mass:', indicating a generated string like 'Northwest Africa meteorite - mass: ... found.' Lengths are tight (39–82 chars, mean 53.6) and there are no nulls, duplicates, URLs, or emoji. Despite being unique per row, the vocabulary is small (6296) and dominated by classification tokens (l6., h5., iron,) plus locality terms, suggesting a synthesised summary rather than free-form prose. Treatment: Parse out the structured fields (class, locality, mass) instead of embedding the raw string.
- n
- 4,871
- nulls
- 0 (0.0%)
- unique
- 4,871
- len_min
- 39
- len_max
- 82
- len_mean
- 53.59
- len_median
- 54
- len_p95
- 65
- word_mean
- 8.396
- word_median
- 8
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 6,296
- readability_flesch_mean
- 58.7
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0