data-trove-large-meteorites-10kg

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/nasa_meteorites.csv

Saturn profiled 45,716 rows across 20 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/nasa_meteorites.csv",
    "--findings", "data-trove-large-meteorites-10kg.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is a NASA meteorite landings catalogue covering 45,716 unique meteorite records with attributes including mass, classification, discovery year, and geographic coordinates. The most striking feature is the mass distribution: the median mass is just 32.6 g but the maximum reaches 60,000,000 g, producing extreme skew (skew=76.9) and over 7,000 statistical outliers — a handful of enormous meteorites are pulling the mean to 13,278 g. A second key finding is that 97.6% of records are classified as 'Found' rather than 'Fell', meaning nearly all entries are meteorites discovered on the ground rather than witnessed falling, which has strong implications for geographic and temporal bias in the data. The meteorite classification column (recclass) spans 466 types, dominated by ordinary chondrites (L6, H5, L5), and year of discovery shows a clear spike in the late 1990s–2000s likely tied to Antarctic collection campaigns.

citing: mass (g).stats.median · mass (g).stats.max · mass (g).stats.skew · mass (g).stats.mean · mass (g).stats.n_outliers · fall.top_values · fall.stats.top_rate · recclass.n_unique · recclass.top_values · year.top_values · row_count

Out[4]:

saturn.schema() · 20 columns

column	kind	n	null%	unique	alerts
sid	text	45,716	0.0%	45,716	near_unique one_word short_text
id	text	45,716	0.0%	45,716	near_unique one_word allcaps
position	numeric	45,716	0.0%	1	constant
created_at	numeric	45,716	0.0%	1	constant
created_meta	unknown	45,716	0.0%	—	skipped
updated_at	numeric	45,716	0.0%	1	constant
updated_meta	unknown	45,716	0.0%	—	skipped
meta	categorical	45,716	0.0%	1	imbalance
name	text	45,716	0.0%	45,716	near_unique
id_1	numeric	45,716	0.0%	45,716
nametype	categorical	45,716	0.0%	2	imbalance
recclass	categorical	45,716	0.0%	466
mass (g)	numeric	45,716	0.3%	12,576	high_skew outliers
fall	categorical	45,716	0.0%	2	imbalance
year	categorical	45,716	0.6%	266
reclat	numeric	45,716	16.0%	12,738
reclong	numeric	45,716	16.0%	14,640
GeoLocation	text	45,716	16.0%	17,100	duplicates
States	numeric	45,716	96.4%	45	null_rate
Counties	numeric	45,716	96.4%	662	null_rate

Fig 1.

mass (g) · Expect a heavily right-skewed distribution — the vast majority of meteorites are tiny (median 32.6 g) but a few giants push the mean to 13,278 g.

Show data table

Histogram bins for mass (g) (median: 32.6).
bin	count
0 – 1.5e+06	45544
1.5e+06 – 3e+06	16
3e+06 – 4.5e+06	8
4.5e+06 – 6e+06	1
6e+06 – 7.5e+06	1
7.5e+06 – 9e+06	1
9e+06 – 1.05e+07	2
1.05e+07 – 1.2e+07	0
1.2e+07 – 1.35e+07	0
1.35e+07 – 1.5e+07	0
1.5e+07 – 1.65e+07	2
1.65e+07 – 1.8e+07	0
1.8e+07 – 1.95e+07	0
1.95e+07 – 2.1e+07	0
2.1e+07 – 2.25e+07	1
2.25e+07 – 2.4e+07	1
2.4e+07 – 2.55e+07	2
2.55e+07 – 2.7e+07	1
2.7e+07 – 2.85e+07	1
2.85e+07 – 3e+07	0
3e+07 – 3.15e+07	1
3.15e+07 – 3.3e+07	0
3.3e+07 – 3.45e+07	0
3.45e+07 – 3.6e+07	0
3.6e+07 – 3.75e+07	0
3.75e+07 – 3.9e+07	0
3.9e+07 – 4.05e+07	0
4.05e+07 – 4.2e+07	0
4.2e+07 – 4.35e+07	0
4.35e+07 – 4.5e+07	0
4.5e+07 – 4.65e+07	0
4.65e+07 – 4.8e+07	0
4.8e+07 – 4.95e+07	0
4.95e+07 – 5.1e+07	1
5.1e+07 – 5.25e+07	0
5.25e+07 – 5.4e+07	0
5.4e+07 – 5.55e+07	0
5.55e+07 – 5.7e+07	0
5.7e+07 – 5.85e+07	1
5.85e+07 – 6e+07	1

Fig 2.

fall · 97.6% of records are 'Found' vs only 1,107 'Fell', revealing a strong observational bias toward ground-discovered meteorites.

Show data table

Top values for fall (2 unique shown, of 2 total).
value	count	share
Found	44609	97.6%
Fell	1107	2.4%

Fig 3.

recclass · L6 and H5 ordinary chondrites dominate the 466 classification types — look for how steeply the long tail drops off.

Show data table

Top values for recclass (20 unique shown, of 466 total).
value	count	share
L6	8285	18.1%
H5	7142	15.6%
L5	4796	10.5%
H6	4528	9.9%
H4	4211	9.2%
LL5	2766	6.1%
LL6	2043	4.5%
L4	1253	2.7%
H4/5	428	0.9%
CM2	416	0.9%
H3	386	0.8%
L3	365	0.8%
CO3	335	0.7%
Ureilite	300	0.7%
Iron, IIIAB	285	0.6%
LL4	268	0.6%
CV3	256	0.6%
Diogenite	241	0.5%
Howardite	240	0.5%
LL	225	0.5%

Fig 4.

year · Discovery counts spike in the late 1990s and early 2000s, likely reflecting organised Antarctic meteorite recovery expeditions.

Show data table

Top values for year (20 unique shown, of 266 total).
value	count	share
2003-01-01T00:00:00	3323	7.3%
1979-01-01T00:00:00	3046	6.7%
1998-01-01T00:00:00	2697	5.9%
2006-01-01T00:00:00	2456	5.4%
1988-01-01T00:00:00	2296	5.0%
2002-01-01T00:00:00	2078	4.5%
2004-01-01T00:00:00	1940	4.2%
2000-01-01T00:00:00	1792	3.9%
1997-01-01T00:00:00	1696	3.7%
1999-01-01T00:00:00	1691	3.7%
2001-01-01T00:00:00	1650	3.6%
1990-01-01T00:00:00	1518	3.3%
2009-01-01T00:00:00	1497	3.3%
1986-01-01T00:00:00	1375	3.0%
2007-01-01T00:00:00	1189	2.6%
2010-01-01T00:00:00	1005	2.2%
1993-01-01T00:00:00	979	2.1%
2008-01-01T00:00:00	957	2.1%
1987-01-01T00:00:00	916	2.0%
1991-01-01T00:00:00	877	1.9%

Fig 5.

reclat · Latitude values are skewed toward negative (southern) values, confirming that Antarctic recoveries account for a large share of the dataset.

Show data table

Histogram bins for reclat (median: -71.5).
bin	count
-87.37 – -83.15	7090
-83.15 – -78.94	1218
-78.94 – -74.73	4083
-74.73 – -70.51	9707
-70.51 – -66.3	1
-66.3 – -62.09	0
-62.09 – -57.87	0
-57.87 – -53.66	1
-53.66 – -49.45	0
-49.45 – -45.23	3
-45.23 – -41.02	11
-41.02 – -36.81	27
-36.81 – -32.59	91
-32.59 – -28.38	550
-28.38 – -24.17	436
-24.17 – -19.95	93
-19.95 – -15.74	35
-15.74 – -11.53	18
-11.53 – -7.313	19
-7.313 – -3.1	24
-3.1 – 1.113	6448
1.113 – 5.327	15
5.327 – 9.54	19
9.54 – 13.75	55
13.75 – 17.97	40
17.97 – 22.18	3197
22.18 – 26.39	315
26.39 – 30.61	2239
30.61 – 34.82	859
34.82 – 39.03	649
39.03 – 43.25	403
43.25 – 47.46	230
47.46 – 51.67	196
51.67 – 55.89	155
55.89 – 60.1	119
60.1 – 64.31	30
64.31 – 68.53	17
68.53 – 72.74	4
72.74 – 76.95	3
76.95 – 81.17	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
sid	text	0.0%
id	text	0.0%
position	numeric	0.0%
created_at	numeric	0.0%
created_meta	unknown	0.0%
updated_at	numeric	0.0%
updated_meta	unknown	0.0%
meta	categorical	0.0%
name	text	0.0%
id_1	numeric	0.0%
nametype	categorical	0.0%
recclass	categorical	0.0%
mass (g)	numeric	0.3%
fall	categorical	0.0%
year	categorical	0.6%
reclat	numeric	16.0%
reclong	numeric	16.0%
GeoLocation	text	16.0%
States	numeric	96.4%
Counties	numeric	96.4%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,211 detected strings).
lang	count	share
en	4209	100.0%
sh	2	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 9 numeric columns (values clipped to 2 decimals).
	position	created_at	updated_at	id_1	mass (g)	reclat	reclong	States	Counties
position	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan
created_at	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan
updated_at	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan
id_1	+nan	+nan	+nan	+1.00	-0.04	+0.09	-0.18	-0.06	-0.09
mass (g)	+nan	+nan	+nan	-0.04	+1.00	+0.06	+0.02	+0.09	-0.01
reclat	+nan	+nan	+nan	+0.09	+0.06	+1.00	-0.56	+0.07	+0.02
reclong	+nan	+nan	+nan	-0.18	+0.02	-0.56	+1.00	-0.03	-0.08
States	+nan	+nan	+nan	-0.06	+0.09	+0.07	-0.03	+1.00	+0.15
Counties	+nan	+nan	+nan	-0.09	-0.01	+0.02	-0.08	+0.15	+1.00

sid text identifier

This column is a Socrata-style row identifier, recognizable from the 'row-XXXX.XXXX-XXXX' format visible in all sampled values. Every value is exactly 18 characters long (len_min = len_max = len_mean = 18.0), and all 45,716 rows are unique with a duplicate_rate of 0.0 and null_rate of 0.0 — a perfect surrogate key. No surprises in the data; it is entirely consistent with an auto-generated system identifier from a Socrata open-data platform.

Treatment: Drop before modelling; retain only if row-level traceability back to the source platform is needed.

anthropic:default · confidence high

Out[14]:

saturn.columns["sid"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	45,716
len_min	18
len_max	18
len_mean	18
len_median	18
len_p95	18
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	-5.68
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 9.

Character-length distribution for sid.

Show data table

Character-length distribution for sid (mean: 18.0).
chars	count
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	45716
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0

id text identifier

This column contains UUIDs (universally unique identifiers) serving as a primary key, with all 45,716 values being exactly 36 characters long and fully unique — zero duplicates, zero nulls. Notably, all sampled top values share the prefix '00000000-0000-0000-', suggesting the UUID version/variant fields are zeroed out, which is atypical of standard UUID v4 generation and may indicate a custom or synthetic ID scheme. The allcaps_rate of 1.0 is consistent with hex characters but worth noting if downstream systems are case-sensitive.

Treatment: Use as primary key for joins; do not encode or transform for modelling — drop or pass through as-is.

anthropic:default · confidence high

Out[17]:

saturn.columns["id"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	45,716
len_min	36
len_max	36
len_mean	36
len_median	36
len_p95	36
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	65.38
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps

Fig 10.

Character-length distribution for id.

Show data table

Character-length distribution for id (mean: 36.0).
chars	count
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	45716
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0

position numeric other

This column, named 'position', is a numeric field that is entirely constant — every one of its 45,716 non-null values is exactly 0.0 (zero_rate = 1.0, n_unique = 1). It carries zero information and would contribute nothing to any model or analysis. This is flagged as a constant alert, confirming it is safe to drop.

Treatment: Drop immediately; constant value across all 45,716 rows provides no signal.

anthropic:default · confidence high

Out[20]:

saturn.columns["position"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	1
min	0
max	0
mean	0
median	0
std	0
q1	0
q3	0
iqr	0
skew	0
kurtosis	0
n_outliers	0
outlier_rate	0
zero_rate	1
alert: constant	only one distinct value

Fig 11.

Distribution of position. Vertical dash marks the median.

Show data table

Histogram bins for position (median: 0.0).
bin	count
-0.5 – -0.475	0
-0.475 – -0.45	0
-0.45 – -0.425	0
-0.425 – -0.4	0
-0.4 – -0.375	0
-0.375 – -0.35	0
-0.35 – -0.325	0
-0.325 – -0.3	0
-0.3 – -0.275	0
-0.275 – -0.25	0
-0.25 – -0.225	0
-0.225 – -0.2	0
-0.2 – -0.175	0
-0.175 – -0.15	0
-0.15 – -0.125	0
-0.125 – -0.1	0
-0.1 – -0.075	0
-0.075 – -0.05	0
-0.05 – -0.025	0
-0.025 – 0	0
0 – 0.025	45716
0.025 – 0.05	0
0.05 – 0.075	0
0.075 – 0.1	0
0.1 – 0.125	0
0.125 – 0.15	0
0.15 – 0.175	0
0.175 – 0.2	0
0.2 – 0.225	0
0.225 – 0.25	0
0.25 – 0.275	0
0.275 – 0.3	0
0.3 – 0.325	0
0.325 – 0.35	0
0.35 – 0.375	0
0.375 – 0.4	0
0.4 – 0.425	0
0.425 – 0.45	0
0.45 – 0.475	0
0.475 – 0.5	0

created_at numeric timestamp

This column is a Unix timestamp named 'created_at', but every single one of its 45,716 non-null rows holds the identical value 1446143734 (approximately 2015-10-29 UTC), triggering a 'constant' alert. With n_unique of 1, std of 0.0, and IQR of 0.0, the column carries zero information variance — strongly suggesting a bulk-load default, a data pipeline bug, or a one-time snapshot import where timestamps were not properly captured. This is a critical data quality issue that renders the column useless as a temporal signal.

Treatment: Drop or flag as corrupted; investigate ETL pipeline for the source of the constant value before using as a temporal feature.

anthropic:default · confidence high

Out[23]:

saturn.columns["created_at"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	1
min	1.446e+09
max	1.446e+09
mean	1.446e+09
median	1.446e+09
std	0
q1	1.446e+09
q3	1.446e+09
iqr	0
skew	0
kurtosis	0
n_outliers	0
outlier_rate	0
zero_rate	0
alert: constant	only one distinct value

Fig 12.

Distribution of created_at. Vertical dash marks the median.

Show data table

Histogram bins for created_at (median: 1446143734.0).
bin	count
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	45716
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0

created_meta unknown metadata

The column 'created_meta' is likely a creation timestamp or metadata field associated with record provenance, but saturn classified it as 'unknown' kind and skipped all profiling, yielding zero stats and no uniqueness count. With 45,716 rows, zero nulls, and no further signal available, its actual dtype, distribution, and content cannot be assessed from this evidence alone.

Treatment: Inspect raw values to determine dtype (timestamp, JSON blob, string) before deciding on parsing, dropping, or feature extraction.

anthropic:default · confidence low

Out[26]:

saturn.columns["created_meta"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

updated_at numeric timestamp

This column is a Unix epoch timestamp named `updated_at`, representing a last-modified datetime for each row. Every single one of the 45,716 non-null records holds the identical value 1446143734 (approximately 2015-10-29 UTC), meaning the column is a constant — it carries zero information variance. This strongly suggests a bulk data load or migration event where all rows were stamped with the same timestamp rather than tracking real update times.

Treatment: Drop from modelling; flag to data owner as a likely ETL artefact — all 45,716 rows share the single value 1446143734.

anthropic:default · confidence high

Out[28]:

saturn.columns["updated_at"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	1
min	1.446e+09
max	1.446e+09
mean	1.446e+09
median	1.446e+09
std	0
q1	1.446e+09
q3	1.446e+09
iqr	0
skew	0
kurtosis	0
n_outliers	0
outlier_rate	0
zero_rate	0
alert: constant	only one distinct value

Fig 13.

Distribution of updated_at. Vertical dash marks the median.

Show data table

Histogram bins for updated_at (median: 1446143734.0).
bin	count
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	45716
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0

updated_meta unknown other

The column 'updated_meta' was skipped by the profiler, yielding no stats, no uniqueness count, and no type resolution beyond 'unknown'. With 45,716 non-null rows and a null rate of 0.0, the column is fully populated, but its content, structure, and distribution are entirely opaque from the available evidence. The name suggests it may hold metadata update timestamps or serialized metadata objects (e.g., JSON blobs), but nothing in the evidence confirms this.

Treatment: Manually inspect raw values to determine type (timestamp, JSON, free text); re-profile after parsing or casting appropriately.

anthropic:default · confidence low

Out[31]:

saturn.columns["updated_meta"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

meta categorical metadata

This column is a metadata field that contains exclusively the empty object literal '{ }' across all 45,716 rows, with zero nulls and a cardinality of 1. It carries no information whatsoever — entropy is 0.0 and top_rate is 1.0, meaning every single record is identical. This is almost certainly an unfilled placeholder or a defaulted JSON field that was never populated.

Treatment: Drop before modelling; zero-variance column with no predictive or descriptive value.

anthropic:default · confidence high

Out[33]:

saturn.columns["meta"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	1
top_value	{ }
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 14.

Top values for meta.

Show data table

Top values for meta (1 unique shown, of 1 total).
value	count	share
{ }	45716	100.0%

name text label

This column contains proper names of geographic features — the top words ('range', 'hills', 'mountains', 'northwest', 'africa', 'grove', 'yamato', 'Queen Alexandra') are all typical components of named landforms or place names. Every one of the 45,716 rows has a distinct value (duplicate_rate 0.0, n_unique = 45,716) with zero nulls, making it a perfect natural identifier. The mean word count of 2.77 and median of 3.0 confirm multi-token names rather than single labels, while 'yamato' appearing 3,317 times as the top individual word suggests a large Antarctic or Japanese geographic sub-corpus driving partial lexical repetition even though full names are unique.

Treatment: Use as a display label or natural key; tokenize on whitespace for gazetteer/NLP tasks, but drop before any ML feature matrix.

anthropic:default · confidence high

Out[36]:

saturn.columns["name"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	45,716
len_min	2
len_max	28
len_mean	17.78
len_median	19
len_p95	27
word_mean	2.772
word_median	3
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	17,917
readability_flesch_mean	63.74
emoji_rate	0
url_rate	0
one_word_rate	0.04749
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings

Fig 15.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 17.78460495231429).
chars	count
2 – 3	2
3 – 3	16
3 – 4	0
4 – 5	97
5 – 5	224
5 – 6	0
6 – 7	420
7 – 7	430
7 – 8	0
8 – 8	449
8 – 9	958
9 – 10	0
10 – 10	1392
10 – 11	1235
11 – 12	0
12 – 12	3961
12 – 13	5478
13 – 14	0
14 – 14	297
14 – 15	0
15 – 16	1217
16 – 16	237
16 – 17	0
17 – 18	2772
18 – 18	2194
18 – 19	0
19 – 20	1826
20 – 20	3427
20 – 21	0
21 – 22	8562
22 – 22	5145
22 – 23	0
23 – 23	1466
23 – 24	33
24 – 25	0
25 – 25	405
25 – 26	21
26 – 27	0
27 – 27	3398
27 – 28	54

id_1 numeric identifier

This column is almost certainly a row or entity identifier: it has 45,716 unique values across 45,716 rows with zero nulls and zero duplicates, indicating a perfect 1-to-1 mapping. Values run from 1 to 57,458, suggesting either a sparse sequential ID (gaps exist since max > n) or a pre-filtered subset of a larger table. The near-uniform distribution (kurtosis −1.16, skew 0.27, zero outliers) is consistent with a sequential or pseudo-random integer key rather than a meaningful numeric feature.

Treatment: Retain as a join/lookup key; exclude from any modelling feature set.

anthropic:default · confidence high

Out[39]:

saturn.columns["id_1"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	45,716
min	1
max	57,458
mean	2.689e+04
median	2.426e+04
std	1.686e+04
q1	1.269e+04
q3	4.066e+04
iqr	27,968
skew	0.2665
kurtosis	-1.16
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 16.

Distribution of id_1. Vertical dash marks the median.

Show data table

Histogram bins for id_1 (median: 24261.5).
bin	count
1 – 1437	1354
1437 – 2874	1151
2874 – 4310	814
4310 – 5747	1270
5747 – 7183	1416
7183 – 8620	1428
8620 – 1.006e+04	1433
1.006e+04 – 1.149e+04	1404
1.149e+04 – 1.293e+04	1394
1.293e+04 – 1.437e+04	1437
1.437e+04 – 1.58e+04	1415
1.58e+04 – 1.724e+04	1414
1.724e+04 – 1.867e+04	1420
1.867e+04 – 2.011e+04	1423
2.011e+04 – 2.155e+04	1437
2.155e+04 – 2.298e+04	1432
2.298e+04 – 2.442e+04	1368
2.442e+04 – 2.586e+04	1296
2.586e+04 – 2.729e+04	1368
2.729e+04 – 2.873e+04	900
2.873e+04 – 3.017e+04	1368
3.017e+04 – 3.16e+04	1078
3.16e+04 – 3.304e+04	529
3.304e+04 – 3.448e+04	763
3.448e+04 – 3.591e+04	1205
3.591e+04 – 3.735e+04	1049
3.735e+04 – 3.878e+04	582
3.878e+04 – 4.022e+04	810
4.022e+04 – 4.166e+04	392
4.166e+04 – 4.309e+04	0
4.309e+04 – 4.453e+04	186
4.453e+04 – 4.597e+04	1300
4.597e+04 – 4.74e+04	1281
4.74e+04 – 4.884e+04	1413
4.884e+04 – 5.028e+04	1129
5.028e+04 – 5.171e+04	1274
5.171e+04 – 5.315e+04	1022
5.315e+04 – 5.459e+04	1219
5.459e+04 – 5.602e+04	1275
5.602e+04 – 5.746e+04	1267

nametype categorical label

This column is a meteorite name-type classification flag, distinguishing between currently valid meteorite names ('Valid') and relict/superseded ones ('Relict'). The distribution is extremely imbalanced: 45,641 of 45,716 records (99.84%) are 'Valid', with only 75 'Relict' entries. The near-zero entropy (0.018) confirms this column carries almost no information variance, which triggered the imbalance alert.

Treatment: Exclude from predictive features due to near-zero variance; retain only as a filter to subset valid records if analysis should exclude relict entries.

anthropic:default · confidence high

Out[42]:

saturn.columns["nametype"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	2
top_value	Valid
top_rate	0.9984
cardinality	2
entropy	0.01754
entropy_ratio	0.01754
alert: imbalance	top value is 99.8% of rows

Fig 17.

Top values for nametype.

Show data table

Top values for nametype (2 unique shown, of 2 total).
value	count	share
Valid	45641	99.8%
Relict	75	0.2%

recclass categorical label

This column contains meteorite classification codes, identifying the mineralogical and petrologic type of each recovered meteorite specimen. The top 7 values (L6, H5, L5, H6, H4, LL5, LL6) are all ordinary chondrite classes and together account for the vast majority of records, with L6 alone representing 18.1% of the 45,716 rows. Despite 466 unique classes, the entropy ratio of 0.51 indicates moderate concentration — the long tail of rare classes (e.g., CM2 with only 416 occurrences) will create sparse dummy variables if one-hot encoded naively.

Treatment: Group rare classes (below a frequency threshold) into an 'Other' bucket before encoding, or use target/frequency encoding to handle the 466-cardinality tail.

anthropic:default · confidence high

Out[45]:

saturn.columns["recclass"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	466
top_value	L6
top_rate	0.1812
cardinality	466
entropy	4.548
entropy_ratio	0.5131

Fig 18.

Top values for recclass.

Show data table

Top values for recclass (20 unique shown, of 466 total).
value	count	share
L6	8285	18.1%
H5	7142	15.6%
L5	4796	10.5%
H6	4528	9.9%
H4	4211	9.2%
LL5	2766	6.1%
LL6	2043	4.5%
L4	1253	2.7%
H4/5	428	0.9%
CM2	416	0.9%
H3	386	0.8%
L3	365	0.8%
CO3	335	0.7%
Ureilite	300	0.7%
Iron, IIIAB	285	0.6%
LL4	268	0.6%
CV3	256	0.6%
Diogenite	241	0.5%
Howardite	240	0.5%
LL	225	0.5%

mass (g) numeric feature

This column records the physical mass of objects in grams, almost certainly meteorite or asteroid specimen weights given the scale and distribution. The median is just 32.6 g while the mean explodes to 13,278 g and the maximum reaches 60,000,000 g (60 tonnes), indicating a tiny fraction of massive outliers dragging the distribution — skew of 76.9 and kurtosis of 6,796 confirm an extreme long tail. Fully 15.5% of rows (7,086) are flagged as outliers, meaning the bulk of specimens are small rocks while a handful of giants dominate the aggregate statistics.

Treatment: log-transform (log1p) before modelling to compress the extreme right tail; consider capping or flagging the ~7,086 outliers separately.

anthropic:default · confidence high

Out[48]:

saturn.columns["mass (g)"].stats

stat	value
n	45,716
nulls	131 (0.3%)
unique	12,576
min	0
max	6e+07
mean	1.328e+04
median	32.6
std	5.75e+05
q1	7.2
q3	202.6
iqr	195.4
skew	76.91
kurtosis	6796
n_outliers	7,086
outlier_rate	0.1554
zero_rate	0.0004168
alert: high_skew	skew=+76.91
alert: outliers	15.5% rows beyond 1.5 IQR

Fig 19.

Distribution of mass (g). Vertical dash marks the median.

Show data table

Histogram bins for mass (g) (median: 32.6).
bin	count
0 – 1.5e+06	45544
1.5e+06 – 3e+06	16
3e+06 – 4.5e+06	8
4.5e+06 – 6e+06	1
6e+06 – 7.5e+06	1
7.5e+06 – 9e+06	1
9e+06 – 1.05e+07	2
1.05e+07 – 1.2e+07	0
1.2e+07 – 1.35e+07	0
1.35e+07 – 1.5e+07	0
1.5e+07 – 1.65e+07	2
1.65e+07 – 1.8e+07	0
1.8e+07 – 1.95e+07	0
1.95e+07 – 2.1e+07	0
2.1e+07 – 2.25e+07	1
2.25e+07 – 2.4e+07	1
2.4e+07 – 2.55e+07	2
2.55e+07 – 2.7e+07	1
2.7e+07 – 2.85e+07	1
2.85e+07 – 3e+07	0
3e+07 – 3.15e+07	1
3.15e+07 – 3.3e+07	0
3.3e+07 – 3.45e+07	0
3.45e+07 – 3.6e+07	0
3.6e+07 – 3.75e+07	0
3.75e+07 – 3.9e+07	0
3.9e+07 – 4.05e+07	0
4.05e+07 – 4.2e+07	0
4.2e+07 – 4.35e+07	0
4.35e+07 – 4.5e+07	0
4.5e+07 – 4.65e+07	0
4.65e+07 – 4.8e+07	0
4.8e+07 – 4.95e+07	0
4.95e+07 – 5.1e+07	1
5.1e+07 – 5.25e+07	0
5.25e+07 – 5.4e+07	0
5.4e+07 – 5.55e+07	0
5.55e+07 – 5.7e+07	0
5.7e+07 – 5.85e+07	1
5.85e+07 – 6e+07	1

fall categorical label

This column captures whether a meteorite was discovered on the ground ('Found') versus observed falling ('Fell'), making it a binary classification label for meteorite recovery type. The distribution is severely imbalanced: 'Found' accounts for 97.6% of 45,716 records (44,609), while 'Fell' represents only 2.4% (1,107). The entropy ratio of 0.164 confirms near-minimum uncertainty, flagged explicitly as an imbalance alert. Any model using this as a target will require class-balancing techniques.

Treatment: Apply class-balancing (e.g., SMOTE or class weights) before using as a classification target.

anthropic:default · confidence high

Out[51]:

saturn.columns["fall"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	2
top_value	Found
top_rate	0.9758
cardinality	2
entropy	0.1645
entropy_ratio	0.1645
alert: imbalance	top value is 97.6% of rows

Fig 20.

Top values for fall.

Show data table

Top values for fall (2 unique shown, of 2 total).
value	count	share
Found	44609	97.6%
Fell	1107	2.4%

year categorical timestamp

This column represents a calendar year, stored as full ISO-8601 timestamps normalised to January 1st of each year (e.g. '2003-01-01T00:00:00'), confirming the time component carries no information. Despite being profiled as categorical, it is effectively an annual time dimension spanning at least the range visible in the top values (1979–2006). Surprising: cardinality is 266 distinct values against only ~45 years visible in top values, suggesting either a much wider date range or some malformed/unexpected entries worth inspecting. The top year (2003) accounts for just 7.3% of rows, indicating a reasonably spread distribution rather than heavy concentration.

Treatment: Parse to date/integer year, investigate the 266 distinct values vs expected ~45-year span for anomalies, then use as a time feature or group-by dimension.

anthropic:default · confidence high

Out[54]:

saturn.columns["year"].stats

stat	value
n	45,716
nulls	291 (0.6%)
unique	266
top_value	2003-01-01T00:00:00
top_rate	0.07315
cardinality	266
entropy	5.299
entropy_ratio	0.6578

Fig 21.

Top values for year.

Show data table

Top values for year (20 unique shown, of 266 total).
value	count	share
2003-01-01T00:00:00	3323	7.3%
1979-01-01T00:00:00	3046	6.7%
1998-01-01T00:00:00	2697	5.9%
2006-01-01T00:00:00	2456	5.4%
1988-01-01T00:00:00	2296	5.0%
2002-01-01T00:00:00	2078	4.5%
2004-01-01T00:00:00	1940	4.2%
2000-01-01T00:00:00	1792	3.9%
1997-01-01T00:00:00	1696	3.7%
1999-01-01T00:00:00	1691	3.7%
2001-01-01T00:00:00	1650	3.6%
1990-01-01T00:00:00	1518	3.3%
2009-01-01T00:00:00	1497	3.3%
1986-01-01T00:00:00	1375	3.0%
2007-01-01T00:00:00	1189	2.6%
2010-01-01T00:00:00	1005	2.2%
1993-01-01T00:00:00	979	2.1%
2008-01-01T00:00:00	957	2.1%
1987-01-01T00:00:00	916	2.0%
1991-01-01T00:00:00	877	1.9%

reclat numeric feature

This column represents the recorded latitude of meteorite find/fall locations, with values ranging from -87.37° to +81.17° consistent with geographic latitude bounds. Surprising signals: the median of -71.5° indicates the majority of records are concentrated in high southern latitudes (likely Antarctic recovery sites), yet the Q3 is exactly 0.0°, suggesting a notable cluster at the equator or a placeholder zero — reinforced by a zero_rate of 16.8% that almost exactly matches the null_rate of 16%, implying zeros may be encoding missing coordinates rather than true equatorial finds. Kurtosis of -1.48 confirms a flat, bimodal-like distribution rather than a normal one.

Treatment: Treat zero values as missing (mask alongside existing nulls before modelling); use as-is for geospatial analysis or pair with longitude for coordinate-based features.

anthropic:default · confidence high

Out[57]:

saturn.columns["reclat"].stats

stat	value
n	45,716
nulls	7,315 (16.0%)
unique	12,738
min	-87.37
max	81.17
mean	-39.12
median	-71.5
std	46.38
q1	-76.71
q3	0
iqr	76.71
skew	0.4916
kurtosis	-1.477
n_outliers	0
outlier_rate	0
zero_rate	0.1677

Fig 22.

Distribution of reclat. Vertical dash marks the median.

Show data table

Histogram bins for reclat (median: -71.5).
bin	count
-87.37 – -83.15	7090
-83.15 – -78.94	1218
-78.94 – -74.73	4083
-74.73 – -70.51	9707
-70.51 – -66.3	1
-66.3 – -62.09	0
-62.09 – -57.87	0
-57.87 – -53.66	1
-53.66 – -49.45	0
-49.45 – -45.23	3
-45.23 – -41.02	11
-41.02 – -36.81	27
-36.81 – -32.59	91
-32.59 – -28.38	550
-28.38 – -24.17	436
-24.17 – -19.95	93
-19.95 – -15.74	35
-15.74 – -11.53	18
-11.53 – -7.313	19
-7.313 – -3.1	24
-3.1 – 1.113	6448
1.113 – 5.327	15
5.327 – 9.54	19
9.54 – 13.75	55
13.75 – 17.97	40
17.97 – 22.18	3197
22.18 – 26.39	315
26.39 – 30.61	2239
30.61 – 34.82	859
34.82 – 39.03	649
39.03 – 43.25	403
43.25 – 47.46	230
47.46 – 51.67	196
51.67 – 55.89	155
55.89 – 60.1	119
60.1 – 64.31	30
64.31 – 68.53	17
68.53 – 72.74	4
72.74 – 76.95	3
76.95 – 81.17	1

reclong numeric feature

This column represents the recorded longitude of meteorite landing or find locations, covering a range from -165.43° to 354.47°. The maximum value of 354.47 is surprising — valid WGS84 longitude should cap at 180°, suggesting some records use a 0–360° convention rather than the standard -180 to 180° range, which will cause mapping errors if not normalised. The zero_rate of ~16% mirrors the null_rate of 16%, strongly implying that zero-filled values are placeholder/missing entries rather than genuine equatorial coordinates at the prime meridian. Distribution is near-symmetric (skew -0.17, kurtosis -0.73) with a large IQR of 157.17, consistent with a globally spread geographic variable.

Treatment: Normalise values > 180 to the -180–180 range (subtract 360), then treat zero values matching the null_rate as missing and impute or exclude before spatial analysis.

anthropic:default · confidence high

Out[60]:

saturn.columns["reclong"].stats

stat	value
n	45,716
nulls	7,315 (16.0%)
unique	14,640
min	-165.4
max	354.5
mean	61.07
median	35.67
std	80.65
q1	0
q3	157.2
iqr	157.2
skew	-0.1745
kurtosis	-0.7312
n_outliers	0
outlier_rate	0
zero_rate	0.1618

Fig 23.

Distribution of reclong. Vertical dash marks the median.

Show data table

Histogram bins for reclong (median: 35.66667).
bin	count
-165.4 – -152.4	30
-152.4 – -139.4	228
-139.4 – -126.4	6
-126.4 – -113.4	444
-113.4 – -100.4	795
-100.4 – -87.45	462
-87.45 – -74.45	214
-74.45 – -61.45	1386
-61.45 – -48.45	57
-48.45 – -35.46	33
-35.46 – -22.46	2
-22.46 – -9.461	76
-9.461 – 3.536	6696
3.536 – 16.53	2208
16.53 – 29.53	1782
29.53 – 42.53	5243
42.53 – 55.53	1818
55.53 – 68.52	1420
68.52 – 81.52	2616
81.52 – 94.52	78
94.52 – 107.5	45
107.5 – 120.5	131
120.5 – 133.5	483
133.5 – 146.5	178
146.5 – 159.5	4052
159.5 – 172.5	7724
172.5 – 185.5	193
185.5 – 198.5	0
198.5 – 211.5	0
211.5 – 224.5	0
224.5 – 237.5	0
237.5 – 250.5	0
250.5 – 263.5	0
263.5 – 276.5	0
276.5 – 289.5	0
289.5 – 302.5	0
302.5 – 315.5	0
315.5 – 328.5	0
328.5 – 341.5	0
341.5 – 354.5	1

GeoLocation text feature

GeoLocation stores geographic coordinates as serialized Python list strings in the format [None, '', '', None, False], representing what appears to be a structured geo-point object flattened to text. The most common value — '[None, '0.0', '0.0', None, False]' appearing 6,214 times — is almost certainly a null/unknown sentinel rather than a genuine equatorial location, masking true missingness beyond the 16% null rate. Duplicate rate is high at 55.5% (21,301 duplicates across 17,100 unique values), consistent with many records sharing the same geographic coordinates. The column should be parsed to extract numeric longitude and latitude fields rather than used as raw text.

Treatment: Parse the serialized list string to extract longitude (index 1) and latitude (index 2) as separate numeric columns, treating '0.0', '0.0' as missing.

anthropic:default · confidence high

Out[63]:

saturn.columns["GeoLocation"].stats

stat	value
n	45,716
nulls	7,315 (16.0%)
unique	17,100
len_min	33
len_max	47
len_mean	40.3
len_median	41
len_p95	45
word_mean	5
word_median	5
n_empty	0
n_duplicates	21,301
duplicate_rate	0.5547
vocab_size	15,461
readability_flesch_mean	117.2
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: duplicates	55.5% duplicate strings

Fig 24.

Character-length distribution for GeoLocation.

Show data table

Character-length distribution for GeoLocation (mean: 40.3046535246478).
chars	count
33 – 33	6214
33 – 34	0
34 – 34	27
34 – 34	0
34 – 35	0
35 – 35	139
35 – 35	0
35 – 36	0
36 – 36	1693
36 – 36	0
36 – 37	0
37 – 37	3283
37 – 38	0
38 – 38	0
38 – 38	527
38 – 39	0
39 – 39	0
39 – 39	488
39 – 40	0
40 – 40	0
40 – 40	5488
40 – 41	0
41 – 41	2086
41 – 41	0
41 – 42	0
42 – 42	3512
42 – 42	0
42 – 43	0
43 – 43	3760
43 – 44	0
44 – 44	0
44 – 44	2811
44 – 45	0
45 – 45	0
45 – 45	7678
45 – 46	0
46 – 46	0
46 – 46	672
46 – 47	0
47 – 47	23

States numeric

Out[66]:

saturn.columns["States"].stats

stat	value
n	45,716
nulls	44,057 (96.4%)
unique	45
min	1
max	51
mean	17.34
median	15
std	10.41
q1	9
q3	23
iqr	14
skew	1.115
kurtosis	0.6891
n_outliers	40
outlier_rate	0.02411
zero_rate	0
alert: null_rate	96.4% null

Fig 25.

Distribution of States. Vertical dash marks the median.

Show data table

Histogram bins for States (median: 15.0).
bin	count
1 – 2.25	13
2.25 – 3.5	9
3.5 – 4.75	2
4.75 – 6	6
6 – 7.25	125
7.25 – 8.5	224
8.5 – 9.75	87
9.75 – 11	95
11 – 12.25	229
12.25 – 13.5	20
13.5 – 14.75	14
14.75 – 16	15
16 – 17.25	146
17.25 – 18.5	23
18.5 – 19.75	49
19.75 – 21	40
21 – 22.25	19
22.25 – 23.5	297
23.5 – 24.75	4
24.75 – 26	1
26 – 27.25	0
27.25 – 28.5	0
28.5 – 29.75	17
29.75 – 31	5
31 – 32.25	29
32.25 – 33.5	6
33.5 – 34.75	10
34.75 – 36	12
36 – 37.25	55
37.25 – 38.5	12
38.5 – 39.75	23
39.75 – 41	14
41 – 42.25	18
42.25 – 43.5	0
43.5 – 44.75	0
44.75 – 46	3
46 – 47.25	11
47.25 – 48.5	8
48.5 – 49.75	5
49.75 – 51	13

Counties numeric

Out[69]:

saturn.columns["Counties"].stats

stat	value
n	45,716
nulls	44,057 (96.4%)
unique	662
min	5
max	3,210
mean	1353
median	1,195
std	994.1
q1	482
q3	2,113
iqr	1,631
skew	0.2374
kurtosis	-1.19
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	96.4% null

Fig 26.

Distribution of Counties. Vertical dash marks the median.

Show data table

Histogram bins for Counties (median: 1195.0).
bin	count
5 – 85.12	303
85.12 – 165.2	8
165.2 – 245.4	17
245.4 – 325.5	26
325.5 – 405.6	20
405.6 – 485.8	64
485.8 – 565.9	8
565.9 – 646	34
646 – 726.1	20
726.1 – 806.2	113
806.2 – 886.4	59
886.4 – 966.5	46
966.5 – 1047	49
1047 – 1127	25
1127 – 1207	40
1207 – 1287	57
1287 – 1367	28
1367 – 1447	25
1447 – 1527	16
1527 – 1608	10
1608 – 1688	13
1688 – 1768	11
1768 – 1848	8
1848 – 1928	13
1928 – 2008	198
2008 – 2088	25
2088 – 2168	21
2168 – 2248	28
2248 – 2329	28
2329 – 2409	69
2409 – 2489	18
2489 – 2569	34
2569 – 2649	11
2649 – 2729	34
2729 – 2809	19
2809 – 2890	18
2890 – 2970	28
2970 – 3050	19
3050 – 3130	24
3130 – 3210	72

data trove large meteorites 10kg

Overview

Summary confidence: high

sid text identifier

id text identifier

position numeric other

created_at numeric timestamp

created_meta unknown metadata

updated_at numeric timestamp

updated_meta unknown other

meta categorical metadata

name text label

id_1 numeric identifier

nametype categorical label

recclass categorical label

mass (g) numeric feature

fall categorical label

year categorical timestamp

reclat numeric feature

reclong numeric feature

GeoLocation text feature

States numeric

Counties numeric

How to cite