saturn·

large meteorites large meteorites

source /home/coolhand/datasets/large-meteorites/large_meteorites.json 4,871 rows 10 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogues 4,871 large meteorites with 10 columns covering mass, location (latitude/longitude), discovery year, fall type, and classification. Mass is extremely skewed (skew ≈ 25, max 60,000,000g vs median 3,600g) with 712 outliers, so any mass-based analysis should use a log scale or trim extremes. Fall_type is heavily imbalanced — 84.8% are 'Found' versus 'Fell' — and meteorite_class is dominated by ordinary chondrites like L6 (16.9%) and H5. Year is left-skewed toward recent decades (median 1990, q1 1945), reflecting a modern collection bias. Note that the 'category' column is constant ('large_meteorites') and adds no signal.

citing: mass_g.stats.skew · mass_g.stats.median · mass_g.stats.max · mass_g.stats.n_outliers · fall_type.stats.top_rate · meteorite_class.stats.top_rate · meteorite_class.top_values · year.stats.median · year.stats.q1 · category.stats.top_rate · row_count

Schema

10 columns
Per-column summary. Click column name to jump to its detail.
Alerts
name text 0.0% 4,871
near_unique one_word
mass_g numeric 0.0% 2,850
high_skew outliers
meteorite_class categorical 0.0% 261
fall_type categorical 0.0% 2
latitude numeric 14.3% 2,861
outliers
longitude numeric 14.3% 3,122
date categorical 1.1% 243
year numeric 1.1% 243
high_skew
category categorical 0.0% 1
imbalance
description text 0.0% 4,871
near_unique

name

text identifier near_unique one_word
The `name` column holds a unique short label for each of the 4871 rows (n_unique == n, duplicate_rate 0.0), averaging 13.8 characters and 2.2 words. Top tokens like 'africa', 'northwest', 'dhofar', 'jiddat' and 'harasis' strongly suggest these are meteorite or specimen designations (e.g. 'Northwest Africa', 'Dhofar', 'Jiddat al Harasis' find sites). About 32% are single-word names, and the vocabulary of 4632 tokens across 4871 names confirms heavy reuse of regional prefixes with numeric suffixes. Treatment: Treat as a unique row identifier; drop from modelling or parse out the regional prefix as a categorical feature. high · anthropic:claude-opus-4-7
n
4,871
nulls
0 (0.0%)
unique
4,871
len_min
3
len_max
28
len_mean
13.8
len_median
12
len_p95
21
word_mean
2.215
word_median
2
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
4,632
readability_flesch_mean
52.55
emoji_rate
0
url_rate
0
one_word_rate
0.3221
allcaps_rate
0
boilerplate_rate
0

mass_g

numeric feature high_skew outliers
This is the recorded mass in grams for 4,871 specimens, with 2,850 unique values and no nulls. The distribution is extraordinarily right-skewed (skew 25.1, kurtosis 723.6): the median is just 3,600 g while the mean is 123,403 g and the max reaches 60,000,000 g, producing 712 outliers (14.6% of rows). The standard deviation of 1,755,278 dwarfs the IQR of 10,313, indicating a handful of enormous specimens dominate the tail. Treatment: Log-transform before any modelling and consider winsorising the upper tail. high · anthropic:claude-opus-4-7
n
4,871
nulls
0 (0.0%)
unique
2,850
min
1,000
max
6e+07
mean
1.234e+05
median
3,600
std
1.755e+06
q1
1686
q3
12,000
iqr
1.031e+04
skew
25.12
kurtosis
723.6
n_outliers
712
outlier_rate
0.1462
zero_rate
0

meteorite_class

categorical feature
This column holds meteorite classification codes (e.g., L6, H5, Iron IIIAB), a standard taxonomy for stony and iron meteorites. With 261 distinct classes across 4871 rows and entropy ratio 0.65, the distribution is moderately diverse but heavily concentrated: the top class L6 covers 16.9% and the five most common ordinary chondrite classes (L6, H5, L5, H4, H6) together dominate. No nulls, but the long tail of 250+ rare classes will be sparse. Treatment: Group rare classes into an 'other' bucket or roll up to parent groups (H/L/LL/Iron) before one-hot encoding. high · anthropic:claude-opus-4-7
n
4,871
nulls
0 (0.0%)
unique
261
top_value
L6
top_rate
0.1694
cardinality
261
entropy
5.197
entropy_ratio
0.6474

fall_type

categorical feature
This is a binary categorical flag distinguishing meteorites that were 'Found' versus those observed to have 'Fell'. The class is heavily imbalanced — 'Found' accounts for 4129 of 4871 records (top_rate 0.8477) while 'Fell' makes up the remaining 742, with no nulls. Entropy of 0.6156 confirms the skew but the column is fully populated and clean. Treatment: Encode as a binary indicator; account for class imbalance if used as a stratifier or target. high · anthropic:claude-opus-4-7
n
4,871
nulls
0 (0.0%)
unique
2
top_value
Found
top_rate
0.8477
cardinality
2
entropy
0.6156
entropy_ratio
0.6156

latitude

numeric feature outliers
Geographic latitude in decimal degrees, spanning -87.37 to 81.17 with a median of 24.00, consistent with a worldwide point dataset. The distribution is left-skewed (-1.23) with 500 flagged outliers (12.0%) and an 8.4% zero rate that may indicate placeholder or equator-rounded values. Roughly 14.3% of rows are null, so geographic analyses will lose a meaningful slice. Treatment: Pair with longitude for geo-features; investigate the 8.4% zeros for sentinel encoding before modelling. high · anthropic:claude-opus-4-7
n
4,871
nulls
697 (14.3%)
unique
2,861
min
-87.37
max
81.17
mean
9.763
median
24
std
39.21
q1
0
q3
35.66
iqr
35.66
skew
-1.228
kurtosis
0.4314
n_outliers
500
outlier_rate
0.1198
zero_rate
0.08385

longitude

numeric feature
Geographic longitude in decimal degrees, spanning -165.43 to 178.2 with median 12.16 and an IQR of 133.79, consistent with worldwide coverage. Distribution is nearly symmetric (skew 0.14) and platykurtic (kurtosis -0.85), with no flagged outliers. Notable concerns: 14.31% of rows are null and 8.36% are exactly zero, which often indicates missing coordinates encoded as 0 rather than true Gulf-of-Guinea points. Treatment: Treat zeros as missing, then pair with latitude for geospatial features. high · anthropic:claude-opus-4-7
n
4,871
nulls
697 (14.3%)
unique
3,122
min
-165.4
max
178.2
mean
7.871
median
12.16
std
82.3
q1
-78.08
q3
55.71
iqr
133.8
skew
0.1374
kurtosis
-0.845
n_outliers
0
outlier_rate
0
zero_rate
0.08361

date

categorical timestamp
This column holds ISO-format dates stored as strings, but every observed value falls on January 1st of a year, suggesting year-only granularity padded to a date. Across 4871 rows there are 243 distinct values with a 1.11% null rate, and the most common entry '2000-01-01' accounts for just 4.88% of records, giving a fairly flat distribution (entropy ratio 0.84). The concentration of top values in the early 2000s hints at a temporal bias in the dataset. Treatment: Parse to datetime and extract year as the effective granularity before modelling. high · anthropic:claude-opus-4-7
n
4,871
nulls
54 (1.1%)
unique
243
top_value
2000-01-01
top_rate
0.04879
cardinality
243
entropy
6.668
entropy_ratio
0.8414

year

numeric feature high_skew
This column is a year value, likely a publication or release date, ranging from 1399 to 2013 with a median of 1990 and IQR spanning 1945-2002. The distribution is heavily left-skewed (skew -2.27, kurtosis 9.38) with 216 outliers (4.48%) trailing into very early centuries, suggesting a small set of historical or erroneous entries pulling the tail. Null rate is low at 1.11% across 4871 rows. Treatment: Inspect pre-1900 outliers for data-entry errors; consider clipping or treating as ordinal before modelling. high · anthropic:claude-opus-4-7
n
4,871
nulls
54 (1.1%)
unique
243
min
1,399
max
2,013
mean
1968
median
1,990
std
51.83
q1
1,945
q3
2,002
iqr
57
skew
-2.271
kurtosis
9.377
n_outliers
216
outlier_rate
0.04484
zero_rate
0

category

categorical metadata imbalance
This is a constant categorical column where every one of the 4871 rows holds the value "large_meteorites". Cardinality is 1, entropy is 0, and the top rate is 1.0, so the field carries no information for modelling or segmentation. It likely encodes the dataset's provenance or subset label rather than a row-level attribute. Treatment: Drop before modelling; retain only as a dataset-level tag. high · anthropic:claude-opus-4-7
n
4,871
nulls
0 (0.0%)
unique
1
top_value
large_meteorites
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

description

text free_text near_unique
Short, templated descriptive sentences about meteorites — every one of the 4871 rows is unique yet contains the words 'meteorite', '-', and 'mass:', indicating a generated string like 'Northwest Africa meteorite - mass: ... found.' Lengths are tight (39–82 chars, mean 53.6) and there are no nulls, duplicates, URLs, or emoji. Despite being unique per row, the vocabulary is small (6296) and dominated by classification tokens (l6., h5., iron,) plus locality terms, suggesting a synthesised summary rather than free-form prose. Treatment: Parse out the structured fields (class, locality, mass) instead of embedding the raw string. high · anthropic:claude-opus-4-7
n
4,871
nulls
0 (0.0%)
unique
4,871
len_min
39
len_max
82
len_mean
53.59
len_median
54
len_p95
65
word_mean
8.396
word_median
8
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
6,296
readability_flesch_mean
58.7
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0