large meteorites large meteorites

source /home/coolhand/datasets/large-meteorites/large_meteorites.json 4,871 rows 10 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogues 4,871 large meteorites with 10 columns covering mass, location (latitude/longitude), discovery year, fall type, and classification. Mass is extremely skewed (skew ≈ 25, max 60,000,000g vs median 3,600g) with 712 outliers, so any mass-based analysis should use a log scale or trim extremes. Fall_type is heavily imbalanced — 84.8% are 'Found' versus 'Fell' — and meteorite_class is dominated by ordinary chondrites like L6 (16.9%) and H5. Year is left-skewed toward recent decades (median 1990, q1 1945), reflecting a modern collection bias. Note that the 'category' column is constant ('large_meteorites') and adds no signal.

citing: mass_g.stats.skew · mass_g.stats.median · mass_g.stats.max · mass_g.stats.n_outliers · fall_type.stats.top_rate · meteorite_class.stats.top_rate · meteorite_class.top_values · year.stats.median · year.stats.q1 · category.stats.top_rate · row_count

Charts the summary said to look at first

mass_g · Heavy right tail — use a log scale to see the bulk below ~12kg separately from the multi-tonne outliers.

Show data table

Histogram bins for mass_g (median: 3600.0).
bin	count
1000 – 1.501e+06	4831
1.501e+06 – 3.001e+06	19
3.001e+06 – 4.501e+06	4
4.501e+06 – 6.001e+06	1
6.001e+06 – 7.501e+06	1
7.501e+06 – 9.001e+06	1
9.001e+06 – 1.05e+07	2
1.05e+07 – 1.2e+07	0
1.2e+07 – 1.35e+07	0
1.35e+07 – 1.5e+07	0
1.5e+07 – 1.65e+07	2
1.65e+07 – 1.8e+07	0
1.8e+07 – 1.95e+07	0
1.95e+07 – 2.1e+07	0
2.1e+07 – 2.25e+07	1
2.25e+07 – 2.4e+07	2
2.4e+07 – 2.55e+07	1
2.55e+07 – 2.7e+07	1
2.7e+07 – 2.85e+07	1
2.85e+07 – 3e+07	1
3e+07 – 3.15e+07	0
3.15e+07 – 3.3e+07	0
3.3e+07 – 3.45e+07	0
3.45e+07 – 3.6e+07	0
3.6e+07 – 3.75e+07	0
3.75e+07 – 3.9e+07	0
3.9e+07 – 4.05e+07	0
4.05e+07 – 4.2e+07	0
4.2e+07 – 4.35e+07	0
4.35e+07 – 4.5e+07	0
4.5e+07 – 4.65e+07	0
4.65e+07 – 4.8e+07	0
4.8e+07 – 4.95e+07	0
4.95e+07 – 5.1e+07	1
5.1e+07 – 5.25e+07	0
5.25e+07 – 5.4e+07	0
5.4e+07 – 5.55e+07	0
5.55e+07 – 5.7e+07	0
5.7e+07 – 5.85e+07	1
5.85e+07 – 6e+07	1

year · Discovery years cluster in recent decades, with a long thin tail back to 1399.

Show data table

Histogram bins for year (median: 1990.0).
bin	count
1399 – 1414	1
1414 – 1430	0
1430 – 1445	0
1445 – 1460	0
1460 – 1476	0
1476 – 1491	1
1491 – 1506	0
1506 – 1522	0
1522 – 1537	0
1537 – 1552	0
1552 – 1568	0
1568 – 1583	2
1583 – 1599	0
1599 – 1614	1
1614 – 1629	3
1629 – 1645	2
1645 – 1660	0
1660 – 1675	1
1675 – 1691	0
1691 – 1706	0
1706 – 1721	2
1721 – 1737	1
1737 – 1752	4
1752 – 1767	3
1767 – 1783	6
1783 – 1798	13
1798 – 1813	27
1813 – 1829	30
1829 – 1844	47
1844 – 1860	72
1860 – 1875	106
1875 – 1890	152
1890 – 1906	140
1906 – 1921	169
1921 – 1936	242
1936 – 1952	277
1952 – 1967	267
1967 – 1982	489
1982 – 1998	706
1998 – 2013	2053

fall_type · Roughly 85% 'Found' versus 15% 'Fell' — observed falls are the rarer, more scientifically valuable subset.

Show data table

Top values for fall_type (2 unique shown, of 2 total).
value	count	share
Found	4129	84.8%
Fell	742	15.2%

meteorite_class · L6 and H5 ordinary chondrites dominate; the long tail of 261 classes is sparse.

Show data table

Top values for meteorite_class (20 unique shown, of 261 total).
value	count	share
L6	825	16.9%
H5	665	13.7%
L5	408	8.4%
H4	349	7.2%
H6	309	6.3%
Iron, IIIAB	232	4.8%
L4	157	3.2%
LL6	120	2.5%
LL5	93	1.9%
Iron, IIAB	92	1.9%
Iron, ungrouped	76	1.6%
Iron, IAB-MG	75	1.5%
Iron, IVA	59	1.2%
CV3	35	0.7%
H4/5	33	0.7%
Iron, IAB complex	32	0.7%
Pallasite, PMG	31	0.6%
Iron	30	0.6%
Iron, IAB-ung	30	0.6%
LL4	29	0.6%

latitude · Bimodal-ish spread with notable concentration in mid-northern latitudes and an Antarctic cluster driving negative outliers.

Show data table

Histogram bins for latitude (median: 24.00139).
bin	count
-87.37 – -83.15	159
-83.15 – -78.94	55
-78.94 – -74.73	126
-74.73 – -70.51	158
-70.51 – -66.3	1
-66.3 – -62.09	0
-62.09 – -57.87	0
-57.87 – -53.66	1
-53.66 – -49.45	0
-49.45 – -45.23	3
-45.23 – -41.02	10
-41.02 – -36.81	22
-36.81 – -32.59	53
-32.59 – -28.38	140
-28.38 – -24.17	112
-24.17 – -19.95	59
-19.95 – -15.74	27
-15.74 – -11.53	13
-11.53 – -7.313	16
-7.313 – -3.1	18
-3.1 – 1.113	355
1.113 – 5.327	9
5.327 – 9.54	15
9.54 – 13.75	35
13.75 – 17.97	29
17.97 – 22.18	636
22.18 – 26.39	125
26.39 – 30.61	478
30.61 – 34.82	401
34.82 – 39.03	368
39.03 – 43.25	252
43.25 – 47.46	165
47.46 – 51.67	137
51.67 – 55.89	120
55.89 – 60.1	40
60.1 – 64.31	17
64.31 – 68.53	12
68.53 – 72.74	3
72.74 – 76.95	3
76.95 – 81.17	1

Schema

10 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
name	text	0.0%	4,871	near_unique one_word
mass_g	numeric	0.0%	2,850	high_skew outliers
meteorite_class	categorical	0.0%	261
fall_type	categorical	0.0%	2
latitude	numeric	14.3%	2,861	outliers
longitude	numeric	14.3%	3,122
date	categorical	1.1%	243
year	numeric	1.1%	243	high_skew
category	categorical	0.0%	1	imbalance
description	text	0.0%	4,871	near_unique

name

text identifier near_unique one_word

The `name` column holds a unique short label for each of the 4871 rows (n_unique == n, duplicate_rate 0.0), averaging 13.8 characters and 2.2 words. Top tokens like 'africa', 'northwest', 'dhofar', 'jiddat' and 'harasis' strongly suggest these are meteorite or specimen designations (e.g. 'Northwest Africa', 'Dhofar', 'Jiddat al Harasis' find sites). About 32% are single-word names, and the vocabulary of 4632 tokens across 4871 names confirms heavy reuse of regional prefixes with numeric suffixes. Treatment: Treat as a unique row identifier; drop from modelling or parse out the regional prefix as a categorical feature. high · anthropic:claude-opus-4-7

n: 4,871
nulls: 0 (0.0%)
unique: 4,871
len_min: 3
len_max: 28
len_mean: 13.8
len_median: 12
len_p95: 21
word_mean: 2.215
word_median: 2
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 4,632
readability_flesch_mean: 52.55
emoji_rate: 0
url_rate: 0
one_word_rate: 0.3221
allcaps_rate: 0
boilerplate_rate: 0

mass_g

numeric feature high_skew outliers

This is the recorded mass in grams for 4,871 specimens, with 2,850 unique values and no nulls. The distribution is extraordinarily right-skewed (skew 25.1, kurtosis 723.6): the median is just 3,600 g while the mean is 123,403 g and the max reaches 60,000,000 g, producing 712 outliers (14.6% of rows). The standard deviation of 1,755,278 dwarfs the IQR of 10,313, indicating a handful of enormous specimens dominate the tail. Treatment: Log-transform before any modelling and consider winsorising the upper tail. high · anthropic:claude-opus-4-7

n: 4,871
nulls: 0 (0.0%)
unique: 2,850
min: 1,000
max: 6e+07
mean: 1.234e+05
median: 3,600
std: 1.755e+06
q1: 1686
q3: 12,000
iqr: 1.031e+04
skew: 25.12
kurtosis: 723.6
n_outliers: 712
outlier_rate: 0.1462
zero_rate: 0

meteorite_class

categorical feature

This column holds meteorite classification codes (e.g., L6, H5, Iron IIIAB), a standard taxonomy for stony and iron meteorites. With 261 distinct classes across 4871 rows and entropy ratio 0.65, the distribution is moderately diverse but heavily concentrated: the top class L6 covers 16.9% and the five most common ordinary chondrite classes (L6, H5, L5, H4, H6) together dominate. No nulls, but the long tail of 250+ rare classes will be sparse. Treatment: Group rare classes into an 'other' bucket or roll up to parent groups (H/L/LL/Iron) before one-hot encoding. high · anthropic:claude-opus-4-7

n: 4,871
nulls: 0 (0.0%)
unique: 261
top_value: L6
top_rate: 0.1694
cardinality: 261
entropy: 5.197
entropy_ratio: 0.6474

fall_type

categorical feature

This is a binary categorical flag distinguishing meteorites that were 'Found' versus those observed to have 'Fell'. The class is heavily imbalanced — 'Found' accounts for 4129 of 4871 records (top_rate 0.8477) while 'Fell' makes up the remaining 742, with no nulls. Entropy of 0.6156 confirms the skew but the column is fully populated and clean. Treatment: Encode as a binary indicator; account for class imbalance if used as a stratifier or target. high · anthropic:claude-opus-4-7

n: 4,871
nulls: 0 (0.0%)
unique: 2
top_value: Found
top_rate: 0.8477
cardinality: 2
entropy: 0.6156
entropy_ratio: 0.6156

latitude

numeric feature outliers

Geographic latitude in decimal degrees, spanning -87.37 to 81.17 with a median of 24.00, consistent with a worldwide point dataset. The distribution is left-skewed (-1.23) with 500 flagged outliers (12.0%) and an 8.4% zero rate that may indicate placeholder or equator-rounded values. Roughly 14.3% of rows are null, so geographic analyses will lose a meaningful slice. Treatment: Pair with longitude for geo-features; investigate the 8.4% zeros for sentinel encoding before modelling. high · anthropic:claude-opus-4-7

n: 4,871
nulls: 697 (14.3%)
unique: 2,861
min: -87.37
max: 81.17
mean: 9.763
median: 24
std: 39.21
q1: 0
q3: 35.66
iqr: 35.66
skew: -1.228
kurtosis: 0.4314
n_outliers: 500
outlier_rate: 0.1198
zero_rate: 0.08385

longitude

numeric feature

Geographic longitude in decimal degrees, spanning -165.43 to 178.2 with median 12.16 and an IQR of 133.79, consistent with worldwide coverage. Distribution is nearly symmetric (skew 0.14) and platykurtic (kurtosis -0.85), with no flagged outliers. Notable concerns: 14.31% of rows are null and 8.36% are exactly zero, which often indicates missing coordinates encoded as 0 rather than true Gulf-of-Guinea points. Treatment: Treat zeros as missing, then pair with latitude for geospatial features. high · anthropic:claude-opus-4-7

n: 4,871
nulls: 697 (14.3%)
unique: 3,122
min: -165.4
max: 178.2
mean: 7.871
median: 12.16
std: 82.3
q1: -78.08
q3: 55.71
iqr: 133.8
skew: 0.1374
kurtosis: -0.845
n_outliers: 0
outlier_rate: 0
zero_rate: 0.08361

date

categorical timestamp

This column holds ISO-format dates stored as strings, but every observed value falls on January 1st of a year, suggesting year-only granularity padded to a date. Across 4871 rows there are 243 distinct values with a 1.11% null rate, and the most common entry '2000-01-01' accounts for just 4.88% of records, giving a fairly flat distribution (entropy ratio 0.84). The concentration of top values in the early 2000s hints at a temporal bias in the dataset. Treatment: Parse to datetime and extract year as the effective granularity before modelling. high · anthropic:claude-opus-4-7

n: 4,871
nulls: 54 (1.1%)
unique: 243
top_value: 2000-01-01
top_rate: 0.04879
cardinality: 243
entropy: 6.668
entropy_ratio: 0.8414

year

numeric feature high_skew

This column is a year value, likely a publication or release date, ranging from 1399 to 2013 with a median of 1990 and IQR spanning 1945-2002. The distribution is heavily left-skewed (skew -2.27, kurtosis 9.38) with 216 outliers (4.48%) trailing into very early centuries, suggesting a small set of historical or erroneous entries pulling the tail. Null rate is low at 1.11% across 4871 rows. Treatment: Inspect pre-1900 outliers for data-entry errors; consider clipping or treating as ordinal before modelling. high · anthropic:claude-opus-4-7

n: 4,871
nulls: 54 (1.1%)
unique: 243
min: 1,399
max: 2,013
mean: 1968
median: 1,990
std: 51.83
q1: 1,945
q3: 2,002
iqr: 57
skew: -2.271
kurtosis: 9.377
n_outliers: 216
outlier_rate: 0.04484
zero_rate: 0

description

text free_text near_unique

Short, templated descriptive sentences about meteorites — every one of the 4871 rows is unique yet contains the words 'meteorite', '-', and 'mass:', indicating a generated string like 'Northwest Africa meteorite - mass: ... found.' Lengths are tight (39–82 chars, mean 53.6) and there are no nulls, duplicates, URLs, or emoji. Despite being unique per row, the vocabulary is small (6296) and dominated by classification tokens (l6., h5., iron,) plus locality terms, suggesting a synthesised summary rather than free-form prose. Treatment: Parse out the structured fields (class, locality, mass) instead of embedding the raw string. high · anthropic:claude-opus-4-7

n: 4,871
nulls: 0 (0.0%)
unique: 4,871
len_min: 39
len_max: 82
len_mean: 53.59
len_median: 54
len_p95: 65
word_mean: 8.396
word_median: 8
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 6,296
readability_flesch_mean: 58.7
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0