wild nasa meteorites
Reading
This is a NASA meteorites dataset with 45,716 records and 20 columns covering each meteorite's name, classification, mass, fall type, year, and geographic coordinates. The most interesting signals are physical and categorical: mass (g) is extremely skewed (mean ~13,278g vs median 32.6g, max 60,000,000g) with ~15.5% flagged as outliers, and recclass is dominated by ordinary chondrites (L6 at 18.1%, followed by H5, L5, H6, H4). The fall column is heavily imbalanced — 97.6% 'Found' vs 2.4% 'Fell' — and year shows a clear concentration in recent decades, peaking at 2003 (3,323 records). Note that Counties and States are 96% null, several columns (created_at, updated_at, position, meta) are constant and can be ignored, and GeoLocation has 55% duplicate values driven by a few repeated Antarctic coordinates.
citing: mass (g) · recclass · fall · year · nametype · Counties · States · GeoLocation · created_at · position · meta
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 0 – 1.5e+06 | 45544 |
| 1.5e+06 – 3e+06 | 16 |
| 3e+06 – 4.5e+06 | 8 |
| 4.5e+06 – 6e+06 | 1 |
| 6e+06 – 7.5e+06 | 1 |
| 7.5e+06 – 9e+06 | 1 |
| 9e+06 – 1.05e+07 | 2 |
| 1.05e+07 – 1.2e+07 | 0 |
| 1.2e+07 – 1.35e+07 | 0 |
| 1.35e+07 – 1.5e+07 | 0 |
| 1.5e+07 – 1.65e+07 | 2 |
| 1.65e+07 – 1.8e+07 | 0 |
| 1.8e+07 – 1.95e+07 | 0 |
| 1.95e+07 – 2.1e+07 | 0 |
| 2.1e+07 – 2.25e+07 | 1 |
| 2.25e+07 – 2.4e+07 | 1 |
| 2.4e+07 – 2.55e+07 | 2 |
| 2.55e+07 – 2.7e+07 | 1 |
| 2.7e+07 – 2.85e+07 | 1 |
| 2.85e+07 – 3e+07 | 0 |
| 3e+07 – 3.15e+07 | 1 |
| 3.15e+07 – 3.3e+07 | 0 |
| 3.3e+07 – 3.45e+07 | 0 |
| 3.45e+07 – 3.6e+07 | 0 |
| 3.6e+07 – 3.75e+07 | 0 |
| 3.75e+07 – 3.9e+07 | 0 |
| 3.9e+07 – 4.05e+07 | 0 |
| 4.05e+07 – 4.2e+07 | 0 |
| 4.2e+07 – 4.35e+07 | 0 |
| 4.35e+07 – 4.5e+07 | 0 |
| 4.5e+07 – 4.65e+07 | 0 |
| 4.65e+07 – 4.8e+07 | 0 |
| 4.8e+07 – 4.95e+07 | 0 |
| 4.95e+07 – 5.1e+07 | 1 |
| 5.1e+07 – 5.25e+07 | 0 |
| 5.25e+07 – 5.4e+07 | 0 |
| 5.4e+07 – 5.55e+07 | 0 |
| 5.55e+07 – 5.7e+07 | 0 |
| 5.7e+07 – 5.85e+07 | 1 |
| 5.85e+07 – 6e+07 | 1 |
Show data table
| value | count | share |
|---|---|---|
| L6 | 8285 | 18.1% |
| H5 | 7142 | 15.6% |
| L5 | 4796 | 10.5% |
| H6 | 4528 | 9.9% |
| H4 | 4211 | 9.2% |
| LL5 | 2766 | 6.1% |
| LL6 | 2043 | 4.5% |
| L4 | 1253 | 2.7% |
| H4/5 | 428 | 0.9% |
| CM2 | 416 | 0.9% |
| H3 | 386 | 0.8% |
| L3 | 365 | 0.8% |
| CO3 | 335 | 0.7% |
| Ureilite | 300 | 0.7% |
| Iron, IIIAB | 285 | 0.6% |
| LL4 | 268 | 0.6% |
| CV3 | 256 | 0.6% |
| Diogenite | 241 | 0.5% |
| Howardite | 240 | 0.5% |
| LL | 225 | 0.5% |
Show data table
| value | count | share |
|---|---|---|
| Found | 44609 | 97.6% |
| Fell | 1107 | 2.4% |
Show data table
| value | count | share |
|---|---|---|
| 2003-01-01T00:00:00 | 3323 | 7.3% |
| 1979-01-01T00:00:00 | 3046 | 6.7% |
| 1998-01-01T00:00:00 | 2697 | 5.9% |
| 2006-01-01T00:00:00 | 2456 | 5.4% |
| 1988-01-01T00:00:00 | 2296 | 5.0% |
| 2002-01-01T00:00:00 | 2078 | 4.5% |
| 2004-01-01T00:00:00 | 1940 | 4.2% |
| 2000-01-01T00:00:00 | 1792 | 3.9% |
| 1997-01-01T00:00:00 | 1696 | 3.7% |
| 1999-01-01T00:00:00 | 1691 | 3.7% |
| 2001-01-01T00:00:00 | 1650 | 3.6% |
| 1990-01-01T00:00:00 | 1518 | 3.3% |
| 2009-01-01T00:00:00 | 1497 | 3.3% |
| 1986-01-01T00:00:00 | 1375 | 3.0% |
| 2007-01-01T00:00:00 | 1189 | 2.6% |
| 2010-01-01T00:00:00 | 1005 | 2.2% |
| 1993-01-01T00:00:00 | 979 | 2.1% |
| 2008-01-01T00:00:00 | 957 | 2.1% |
| 1987-01-01T00:00:00 | 916 | 2.0% |
| 1991-01-01T00:00:00 | 877 | 1.9% |
Show data table
| value | count | share |
|---|---|---|
| Valid | 45641 | 99.8% |
| Relict | 75 | 0.2% |
Schema
20 columns| Alerts | ||||
|---|---|---|---|---|
| sid | text | 0.0% | 45,716 |
near_unique
one_word
short_text
|
| id | text | 0.0% | 45,716 |
near_unique
one_word
allcaps
|
| position | numeric | 0.0% | 1 |
constant
|
| created_at | numeric | 0.0% | 1 |
constant
|
| created_meta | unknown | 0.0% | — |
skipped
|
| updated_at | numeric | 0.0% | 1 |
constant
|
| updated_meta | unknown | 0.0% | — |
skipped
|
| meta | categorical | 0.0% | 1 |
imbalance
|
| name | text | 0.0% | 45,716 |
near_unique
|
| id_1 | numeric | 0.0% | 45,716 |
|
| nametype | categorical | 0.0% | 2 |
imbalance
|
| recclass | categorical | 0.0% | 466 |
|
| mass (g) | numeric | 0.3% | 12,576 |
high_skew
outliers
|
| fall | categorical | 0.0% | 2 |
imbalance
|
| year | categorical | 0.6% | 266 |
|
| reclat | numeric | 16.0% | 12,738 |
|
| reclong | numeric | 16.0% | 14,640 |
|
| GeoLocation | text | 16.0% | 17,100 |
duplicates
|
| States | numeric | 96.4% | 45 |
null_rate
|
| Counties | numeric | 96.4% | 662 |
null_rate
|
sid
text identifier near_unique one_word short_textThis is a synthetic row identifier: every one of the 45716 values is unique, exactly 18 characters long, single-token, and follows a 'row-xxxx-xxxx-xxxx' pattern. There are no nulls, duplicates, or empties, confirming it functions as a primary key rather than a feature. Treatment: Drop from modelling; retain only as a join key or row index.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 45,716
- len_min
- 18
- len_max
- 18
- len_mean
- 18
- len_median
- 18
- len_p95
- 18
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- -5.68
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
id
text identifier near_unique one_word allcapsThis column is a row identifier holding 36-character UUID-style strings, all uppercase and one token wide. Every one of the 45,716 values is unique with zero nulls or duplicates, and length is fixed at exactly 36 characters across min, median, and max. The shared `00000000-0000-0000-` prefix on all sampled values is notable — only the latter half of each UUID varies, suggesting a namespaced or truncated-entropy ID scheme rather than fully random v4 UUIDs. Treatment: Drop from modelling features; retain only as a join key or row reference.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 45,716
- len_min
- 36
- len_max
- 36
- len_mean
- 36
- len_median
- 36
- len_p95
- 36
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 65.38
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
position
numeric other constantThe column 'position' is numeric but holds a single value across all 45716 rows: every entry is 0, giving a zero_rate of 1.0 and n_unique of 1. With zero variance (std 0.0, iqr 0.0), it carries no information for any downstream task. Treatment: Drop, constant column with no variance.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 0
- max
- 0
- mean
- 0
- median
- 0
- std
- 0
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 1
created_at
numeric timestamp constantThis column appears to be a Unix epoch creation timestamp (1446143734 corresponds to a single moment in late 2015), stored as a numeric value. Across all 45716 rows it holds exactly one value, with std 0.0 and n_unique 1, so it carries no information to differentiate records. The 'constant' alert confirms there is no variation to model or filter on. Treatment: Drop; constant column adds no signal.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 1.446e+09
- max
- 1.446e+09
- mean
- 1.446e+09
- median
- 1.446e+09
- std
- 0
- q1
- 1.446e+09
- q3
- 1.446e+09
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
created_meta
unknown metadata skippedThe column `created_meta` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 45716 and a null_rate of 0.0. The name suggests it carries creation-time metadata (e.g., a user id or system tag attached to record creation), but this cannot be confirmed from the evidence. No further signal is present to assess distribution, uniqueness, or drift. Treatment: Re-profile with parsing enabled before deciding; otherwise drop until contents are characterised.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- —
updated_at
numeric timestamp constantThis column is almost certainly a Unix epoch timestamp recording a row update time, with the single value 1446143734 (late 2015) repeated across all 45716 rows. With n_unique=1, std=0, and identical min/median/max, it carries no information—every record was stamped at the same instant, suggesting a bulk export or a field that was never actually updated per-row. Treatment: Drop; constant column provides no signal.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 1.446e+09
- max
- 1.446e+09
- mean
- 1.446e+09
- median
- 1.446e+09
- std
- 0
- q1
- 1.446e+09
- q3
- 1.446e+09
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
updated_meta
unknown metadata skippedThe column `updated_meta` was skipped by the profiler, so no type inference, uniqueness count, or value statistics are available. The only confirmed signals are 45716 rows with a null rate of 0.0, but the actual content and structure remain uncharacterised. The name suggests it may hold update-related metadata (e.g., a timestamp, user, or nested struct), yet this is not supported by evidence. Treatment: Re-profile with an appropriate parser before deciding; do not feed into modelling until its type is known.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- —
meta
categorical metadata imbalanceThis 'meta' column is a constant placeholder: every one of the 45,716 rows holds the same '{ }' value, giving a cardinality of 1 and entropy of 0. There is no information to extract here, likely a vestigial JSON metadata field that was never populated. Treatment: Drop; the column is constant and carries zero signal.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- { }
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
name
text identifier near_uniqueThis is a short text column of place or feature names — every one of 45,716 rows is unique with zero nulls, averaging 17.8 characters and 2.8 words. Top tokens like 'yamato', 'range', 'northwest', 'hills', 'mountains', 'queen alexandra', and 'grove' suggest geographic/toponymic entries (mountain ranges, hills, regions). With n_unique equal to n, it functions as an identifier rather than a categorical feature. Treatment: Drop for modelling; retain as a label/key for joins or display.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 45,716
- len_min
- 2
- len_max
- 28
- len_mean
- 17.78
- len_median
- 19
- len_p95
- 27
- word_mean
- 2.772
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 17,917
- readability_flesch_mean
- 63.74
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.04749
- allcaps_rate
- 0
- boilerplate_rate
- 0
id_1
numeric identifierid_1 is almost certainly a row identifier: 45716 unique values across 45716 rows, no nulls, ranging from 1 to 57458 with a near-uniform spread (kurtosis -1.16, mild skew 0.27). The fact that the max (57458) exceeds the row count suggests gaps in the sequence, consistent with a primary key carried over from a larger source table. Treatment: drop from modelling; retain only for joins or row tracing.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 45,716
- min
- 1
- max
- 57,458
- mean
- 2.689e+04
- median
- 2.426e+04
- std
- 1.686e+04
- q1
- 1.269e+04
- q3
- 4.066e+04
- iqr
- 27,968
- skew
- 0.2665
- kurtosis
- -1.16
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
nametype
categorical feature imbalanceThis is a binary categorical flag distinguishing meteorite name types, with values 'Valid' and 'Relict'. The distribution is extremely lopsided: 45,641 of 45,716 rows (99.84%) are 'Valid' and only 75 are 'Relict', yielding an entropy ratio of just 0.018. With effectively no variation, this column carries almost no information for modelling. Treatment: Drop or retain only as a rare-event indicator; near-constant for modelling.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Valid
- top_rate
- 0.9984
- cardinality
- 2
- entropy
- 0.01754
- entropy_ratio
- 0.01754
recclass
categorical labelThis column holds meteorite classification codes (recclass), with 466 distinct classes across 45,716 records and no nulls. The distribution is dominated by ordinary chondrites: L6 (18.1%), H5, L5, H6, and H4 together account for the bulk of records, while the long tail (entropy ratio 0.51) includes rare classes like CM2 with only 416 entries. High cardinality combined with concentrated top categories suggests a classic taxonomic hierarchy (H/L/LL groups with petrologic types). Treatment: Group rare classes into an 'other' bucket or roll up to parent groups (H/L/LL/C) before modelling.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 466
- top_value
- L6
- top_rate
- 0.1812
- cardinality
- 466
- entropy
- 4.548
- entropy_ratio
- 0.5131
mass (g)
numeric feature high_skew outliersNumeric mass measurements in grams across 45,716 rows, with a median of just 32.6g but a maximum of 60,000,000g — a 6-order-of-magnitude span. The distribution is extremely heavy-tailed (skew 76.9, kurtosis ~6796) and 15.5% of values flag as outliers, while the std (574,988) dwarfs the IQR (195.4). Nulls (0.29%) and zeros (0.04%) are negligible. Treatment: log-transform before any modelling or distance-based analysis.
- n
- 45,716
- nulls
- 131 (0.3%)
- unique
- 12,576
- min
- 0
- max
- 6e+07
- mean
- 1.328e+04
- median
- 32.6
- std
- 5.75e+05
- q1
- 7.2
- q3
- 202.6
- iqr
- 195.4
- skew
- 76.91
- kurtosis
- 6796
- n_outliers
- 7,086
- outlier_rate
- 0.1554
- zero_rate
- 0.0004168
fall
categorical feature imbalanceBinary categorical flag distinguishing meteorites that were observed falling versus those found later, with only two values: "Found" and "Fell". The split is severely imbalanced — "Found" accounts for 44609 of 45716 rows (top_rate 0.9758) while "Fell" has just 1107, yielding an entropy_ratio of 0.164. No nulls are present. Treatment: Encode as binary; stratify or rebalance before modelling given the 40:1 class skew.
- n
- 45,716
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Found
- top_rate
- 0.9758
- cardinality
- 2
- entropy
- 0.1645
- entropy_ratio
- 0.1645
year
categorical timestampStored as January-1 timestamps, this column encodes a year-of-record across 45,716 rows with 266 distinct values and a 0.64% null rate. Despite being labeled 'year', the values are full datetimes pinned to YYYY-01-01, which will surprise anyone expecting integer years. The distribution is moderately spread (entropy ratio 0.66) with 2003 the modal year at 7.3% of rows, followed by 1979 and 1998. Treatment: Cast to integer year (or proper date) before using as a temporal feature.
- n
- 45,716
- nulls
- 291 (0.6%)
- unique
- 266
- top_value
- 2003-01-01T00:00:00
- top_rate
- 0.07315
- cardinality
- 266
- entropy
- 5.299
- entropy_ratio
- 0.6578
reclat
numeric featureThis is the meteorite reception latitude in decimal degrees, ranging from -87.37 to 81.17. The distribution leans heavily toward the southern hemisphere with a median of -71.5 and a Q3 of exactly 0.0, and 16.8% of values are exactly zero — likely placeholder/unknown coordinates rather than the equator. About 16% of rows are null, and the bimodal-feeling shape (kurtosis -1.48) suggests clusters in Antarctica and elsewhere. Treatment: Treat exact zeros as missing and pair with reclong for geospatial use.
- n
- 45,716
- nulls
- 7,315 (16.0%)
- unique
- 12,738
- min
- -87.37
- max
- 81.17
- mean
- -39.12
- median
- -71.5
- std
- 46.38
- q1
- -76.71
- q3
- 0
- iqr
- 76.71
- skew
- 0.4916
- kurtosis
- -1.477
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.1677
reclong
numeric featureLongitude coordinate for meteorite recovery sites, ranging from -165.43 to 354.47 with median 35.67. The maximum exceeding 180 is anomalous for standard longitude and suggests un-normalized or erroneous values, and the 16.2% zero rate aligns suspiciously with the 16% null rate, hinting that missing coordinates were coded as 0. Treatment: Normalize longitudes to [-180,180], treat 0/0 pairs as missing, then use as a geospatial feature.
- n
- 45,716
- nulls
- 7,315 (16.0%)
- unique
- 14,640
- min
- -165.4
- max
- 354.5
- mean
- 61.07
- median
- 35.67
- std
- 80.65
- q1
- 0
- q3
- 157.2
- iqr
- 157.2
- skew
- -0.1745
- kurtosis
- -0.7312
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.1618
GeoLocation
text feature duplicatesSerialised Python list literals encoding geolocation tuples of the form [None, lat, lon, None, False], with 45716 rows, 16% nulls and only 17100 distinct values. Duplication is severe (duplicate_rate 0.55, 21301 duplicates), and the top value '[None, 0.0, 0.0, None, False]' appears 6214 times suggesting placeholder coordinates. Lengths are tightly bounded (min 33, max 47) consistent with a fixed serialisation rather than free text. Treatment: Parse the list literal into separate latitude and longitude numeric columns and treat 0.0/0.0 as missing.
- n
- 45,716
- nulls
- 7,315 (16.0%)
- unique
- 17,100
- len_min
- 33
- len_max
- 47
- len_mean
- 40.3
- len_median
- 41
- len_p95
- 45
- word_mean
- 5
- word_median
- 5
- n_empty
- 0
- n_duplicates
- 21,301
- duplicate_rate
- 0.5547
- vocab_size
- 15,461
- readability_flesch_mean
- 117.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
States
numeric feature null_rateNumeric column 'States' takes 45 distinct integer values between 1 and 51 with a median of 15, strongly suggesting encoded US state identifiers rather than a true quantity. The column is 96.37% null, so it carries information for fewer than 4% of rows, and the right skew (1.11) reflects uneven coverage across the encoded states. Treating the mean of 17.3 as meaningful would be a mistake given the categorical nature. Treatment: Cast to categorical state codes and impute or flag the 96% missing before modelling.
- n
- 45,716
- nulls
- 44,057 (96.4%)
- unique
- 45
- min
- 1
- max
- 51
- mean
- 17.34
- median
- 15
- std
- 10.41
- q1
- 9
- q3
- 23
- iqr
- 14
- skew
- 1.115
- kurtosis
- 0.6891
- n_outliers
- 40
- outlier_rate
- 0.02411
- zero_rate
- 0
Counties
numeric feature null_rateNumeric column 'Counties' is populated for only 3.6% of the 45,716 rows (null_rate 0.9637), with 662 unique values ranging from 5 to 3210 and a roughly symmetric distribution (skew 0.24, kurtosis -1.19, mean 1353 vs median 1195). The values look like county counts or county FIPS-style codes rather than a continuous measurement, and the overwhelming sparsity is the headline issue. No outliers or zeros are flagged. Treatment: Impute or add a missingness indicator; given 96% nulls, consider dropping unless the populated subset is analytically meaningful.
- n
- 45,716
- nulls
- 44,057 (96.4%)
- unique
- 662
- min
- 5
- max
- 3,210
- mean
- 1353
- median
- 1,195
- std
- 994.1
- q1
- 482
- q3
- 2,113
- iqr
- 1,631
- skew
- 0.2374
- kurtosis
- -1.19
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0