wild-nasa_meteorites · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/nasa_meteorites.csv

Saturn profiled 45,716 rows across 20 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/nasa_meteorites.csv",
    "--findings", "wild-nasa_meteorites.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a NASA meteorites dataset with 45,716 records and 20 columns covering each meteorite's name, classification, mass, fall type, year, and geographic coordinates. The most interesting signals are physical and categorical: mass (g) is extremely skewed (mean ~13,278g vs median 32.6g, max 60,000,000g) with ~15.5% flagged as outliers, and recclass is dominated by ordinary chondrites (L6 at 18.1%, followed by H5, L5, H6, H4). The fall column is heavily imbalanced — 97.6% 'Found' vs 2.4% 'Fell' — and year shows a clear concentration in recent decades, peaking at 2003 (3,323 records). Note that Counties and States are 96% null, several columns (created_at, updated_at, position, meta) are constant and can be ignored, and GeoLocation has 55% duplicate values driven by a few repeated Antarctic coordinates.

citing: mass (g) · recclass · fall · year · nametype · Counties · States · GeoLocation · created_at · position · meta

Out[4]:

saturn.schema() · 20 columns

column	kind	n	null%	unique	alerts
sid	text	45,716	0.0%	45,716	near_unique one_word short_text
id	text	45,716	0.0%	45,716	near_unique one_word allcaps
position	numeric	45,716	0.0%	1	constant
created_at	numeric	45,716	0.0%	1	constant
created_meta	unknown	45,716	0.0%	—	skipped
updated_at	numeric	45,716	0.0%	1	constant
updated_meta	unknown	45,716	0.0%	—	skipped
meta	categorical	45,716	0.0%	1	imbalance
name	text	45,716	0.0%	45,716	near_unique
id_1	numeric	45,716	0.0%	45,716
nametype	categorical	45,716	0.0%	2	imbalance
recclass	categorical	45,716	0.0%	466
mass (g)	numeric	45,716	0.3%	12,576	high_skew outliers
fall	categorical	45,716	0.0%	2	imbalance
year	categorical	45,716	0.6%	266
reclat	numeric	45,716	16.0%	12,738
reclong	numeric	45,716	16.0%	14,640
GeoLocation	text	45,716	16.0%	17,100	duplicates
States	numeric	45,716	96.4%	45	null_rate
Counties	numeric	45,716	96.4%	662	null_rate

Fig 1.

mass (g) · Extreme right skew — most meteorites are tiny (median 32.6g) but a long tail reaches 60 million grams.

Show data table

Histogram bins for mass (g) (median: 32.6).
bin	count
0 – 1.5e+06	45544
1.5e+06 – 3e+06	16
3e+06 – 4.5e+06	8
4.5e+06 – 6e+06	1
6e+06 – 7.5e+06	1
7.5e+06 – 9e+06	1
9e+06 – 1.05e+07	2
1.05e+07 – 1.2e+07	0
1.2e+07 – 1.35e+07	0
1.35e+07 – 1.5e+07	0
1.5e+07 – 1.65e+07	2
1.65e+07 – 1.8e+07	0
1.8e+07 – 1.95e+07	0
1.95e+07 – 2.1e+07	0
2.1e+07 – 2.25e+07	1
2.25e+07 – 2.4e+07	1
2.4e+07 – 2.55e+07	2
2.55e+07 – 2.7e+07	1
2.7e+07 – 2.85e+07	1
2.85e+07 – 3e+07	0
3e+07 – 3.15e+07	1
3.15e+07 – 3.3e+07	0
3.3e+07 – 3.45e+07	0
3.45e+07 – 3.6e+07	0
3.6e+07 – 3.75e+07	0
3.75e+07 – 3.9e+07	0
3.9e+07 – 4.05e+07	0
4.05e+07 – 4.2e+07	0
4.2e+07 – 4.35e+07	0
4.35e+07 – 4.5e+07	0
4.5e+07 – 4.65e+07	0
4.65e+07 – 4.8e+07	0
4.8e+07 – 4.95e+07	0
4.95e+07 – 5.1e+07	1
5.1e+07 – 5.25e+07	0
5.25e+07 – 5.4e+07	0
5.4e+07 – 5.55e+07	0
5.55e+07 – 5.7e+07	0
5.7e+07 – 5.85e+07	1
5.85e+07 – 6e+07	1

Fig 2.

recclass · Top classifications are dominated by ordinary chondrites L6 (18%) and H5, with a long tail of 466 classes.

Show data table

Top values for recclass (20 unique shown, of 466 total).
value	count	share
L6	8285	18.1%
H5	7142	15.6%
L5	4796	10.5%
H6	4528	9.9%
H4	4211	9.2%
LL5	2766	6.1%
LL6	2043	4.5%
L4	1253	2.7%
H4/5	428	0.9%
CM2	416	0.9%
H3	386	0.8%
L3	365	0.8%
CO3	335	0.7%
Ureilite	300	0.7%
Iron, IIIAB	285	0.6%
LL4	268	0.6%
CV3	256	0.6%
Diogenite	241	0.5%
Howardite	240	0.5%
LL	225	0.5%

Fig 3.

fall · Heavy imbalance: 97.6% were 'Found' versus only 2.4% observed 'Fell'.

Show data table

Top values for fall (2 unique shown, of 2 total).
value	count	share
Found	44609	97.6%
Fell	1107	2.4%

Fig 4.

year · Year of record is concentrated in recent decades, peaking around 2003, 1979, and 1998.

Show data table

Top values for year (20 unique shown, of 266 total).
value	count	share
2003-01-01T00:00:00	3323	7.3%
1979-01-01T00:00:00	3046	6.7%
1998-01-01T00:00:00	2697	5.9%
2006-01-01T00:00:00	2456	5.4%
1988-01-01T00:00:00	2296	5.0%
2002-01-01T00:00:00	2078	4.5%
2004-01-01T00:00:00	1940	4.2%
2000-01-01T00:00:00	1792	3.9%
1997-01-01T00:00:00	1696	3.7%
1999-01-01T00:00:00	1691	3.7%
2001-01-01T00:00:00	1650	3.6%
1990-01-01T00:00:00	1518	3.3%
2009-01-01T00:00:00	1497	3.3%
1986-01-01T00:00:00	1375	3.0%
2007-01-01T00:00:00	1189	2.6%
2010-01-01T00:00:00	1005	2.2%
1993-01-01T00:00:00	979	2.1%
2008-01-01T00:00:00	957	2.1%
1987-01-01T00:00:00	916	2.0%
1991-01-01T00:00:00	877	1.9%

Fig 5.

nametype · Nearly all entries are 'Valid' (99.8%); only 75 'Relict' meteorites stand out.

Show data table

Top values for nametype (2 unique shown, of 2 total).
value	count	share
Valid	45641	99.8%
Relict	75	0.2%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
sid	text	0.0%
id	text	0.0%
position	numeric	0.0%
created_at	numeric	0.0%
created_meta	unknown	0.0%
updated_at	numeric	0.0%
updated_meta	unknown	0.0%
meta	categorical	0.0%
name	text	0.0%
id_1	numeric	0.0%
nametype	categorical	0.0%
recclass	categorical	0.0%
mass (g)	numeric	0.3%
fall	categorical	0.0%
year	categorical	0.6%
reclat	numeric	16.0%
reclong	numeric	16.0%
GeoLocation	text	16.0%
States	numeric	96.4%
Counties	numeric	96.4%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,211 detected strings).
lang	count	share
en	4209	100.0%
sh	2	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 9 numeric columns (values clipped to 2 decimals).
	position	created_at	updated_at	id_1	mass (g)	reclat	reclong	States	Counties
position	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan
created_at	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan
updated_at	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan
id_1	+nan	+nan	+nan	+1.00	-0.04	+0.09	-0.18	-0.06	-0.09
mass (g)	+nan	+nan	+nan	-0.04	+1.00	+0.06	+0.02	+0.09	-0.01
reclat	+nan	+nan	+nan	+0.09	+0.06	+1.00	-0.56	+0.07	+0.02
reclong	+nan	+nan	+nan	-0.18	+0.02	-0.56	+1.00	-0.03	-0.08
States	+nan	+nan	+nan	-0.06	+0.09	+0.07	-0.03	+1.00	+0.15
Counties	+nan	+nan	+nan	-0.09	-0.01	+0.02	-0.08	+0.15	+1.00

sid text identifier

This is a synthetic row identifier: every one of the 45716 values is unique, exactly 18 characters long, single-token, and follows a 'row-xxxx-xxxx-xxxx' pattern. There are no nulls, duplicates, or empties, confirming it functions as a primary key rather than a feature.

Treatment: Drop from modelling; retain only as a join key or row index.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["sid"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	45,716
len_min	18
len_max	18
len_mean	18
len_median	18
len_p95	18
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	-5.68
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 9.

Character-length distribution for sid.

Show data table

Character-length distribution for sid (mean: 18.0).
chars	count
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	45716
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0

id text identifier

This column is a row identifier holding 36-character UUID-style strings, all uppercase and one token wide. Every one of the 45,716 values is unique with zero nulls or duplicates, and length is fixed at exactly 36 characters across min, median, and max. The shared `00000000-0000-0000-` prefix on all sampled values is notable — only the latter half of each UUID varies, suggesting a namespaced or truncated-entropy ID scheme rather than fully random v4 UUIDs.

Treatment: Drop from modelling features; retain only as a join key or row reference.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["id"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	45,716
len_min	36
len_max	36
len_mean	36
len_median	36
len_p95	36
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	65.38
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps

Fig 10.

Character-length distribution for id.

Show data table

Character-length distribution for id (mean: 36.0).
chars	count
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	45716
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0
36 – 36	0

position numeric other

The column 'position' is numeric but holds a single value across all 45716 rows: every entry is 0, giving a zero_rate of 1.0 and n_unique of 1. With zero variance (std 0.0, iqr 0.0), it carries no information for any downstream task.

Treatment: Drop, constant column with no variance.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["position"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	1
min	0
max	0
mean	0
median	0
std	0
q1	0
q3	0
iqr	0
skew	0
kurtosis	0
n_outliers	0
outlier_rate	0
zero_rate	1
alert: constant	only one distinct value

Fig 11.

Distribution of position. Vertical dash marks the median.

Show data table

Histogram bins for position (median: 0.0).
bin	count
-0.5 – -0.475	0
-0.475 – -0.45	0
-0.45 – -0.425	0
-0.425 – -0.4	0
-0.4 – -0.375	0
-0.375 – -0.35	0
-0.35 – -0.325	0
-0.325 – -0.3	0
-0.3 – -0.275	0
-0.275 – -0.25	0
-0.25 – -0.225	0
-0.225 – -0.2	0
-0.2 – -0.175	0
-0.175 – -0.15	0
-0.15 – -0.125	0
-0.125 – -0.1	0
-0.1 – -0.075	0
-0.075 – -0.05	0
-0.05 – -0.025	0
-0.025 – 0	0
0 – 0.025	45716
0.025 – 0.05	0
0.05 – 0.075	0
0.075 – 0.1	0
0.1 – 0.125	0
0.125 – 0.15	0
0.15 – 0.175	0
0.175 – 0.2	0
0.2 – 0.225	0
0.225 – 0.25	0
0.25 – 0.275	0
0.275 – 0.3	0
0.3 – 0.325	0
0.325 – 0.35	0
0.35 – 0.375	0
0.375 – 0.4	0
0.4 – 0.425	0
0.425 – 0.45	0
0.45 – 0.475	0
0.475 – 0.5	0

created_at numeric timestamp

This column appears to be a Unix epoch creation timestamp (1446143734 corresponds to a single moment in late 2015), stored as a numeric value. Across all 45716 rows it holds exactly one value, with std 0.0 and n_unique 1, so it carries no information to differentiate records. The 'constant' alert confirms there is no variation to model or filter on.

Treatment: Drop; constant column adds no signal.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["created_at"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	1
min	1.446e+09
max	1.446e+09
mean	1.446e+09
median	1.446e+09
std	0
q1	1.446e+09
q3	1.446e+09
iqr	0
skew	0
kurtosis	0
n_outliers	0
outlier_rate	0
zero_rate	0
alert: constant	only one distinct value

Fig 12.

Distribution of created_at. Vertical dash marks the median.

Show data table

Histogram bins for created_at (median: 1446143734.0).
bin	count
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	45716
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0

created_meta unknown metadata

The column `created_meta` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 45716 and a null_rate of 0.0. The name suggests it carries creation-time metadata (e.g., a user id or system tag attached to record creation), but this cannot be confirmed from the evidence. No further signal is present to assess distribution, uniqueness, or drift.

Treatment: Re-profile with parsing enabled before deciding; otherwise drop until contents are characterised.

anthropic:claude-opus-4-7 · confidence low

Out[26]:

saturn.columns["created_meta"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

updated_at numeric timestamp

This column is almost certainly a Unix epoch timestamp recording a row update time, with the single value 1446143734 (late 2015) repeated across all 45716 rows. With n_unique=1, std=0, and identical min/median/max, it carries no information—every record was stamped at the same instant, suggesting a bulk export or a field that was never actually updated per-row.

Treatment: Drop; constant column provides no signal.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["updated_at"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	1
min	1.446e+09
max	1.446e+09
mean	1.446e+09
median	1.446e+09
std	0
q1	1.446e+09
q3	1.446e+09
iqr	0
skew	0
kurtosis	0
n_outliers	0
outlier_rate	0
zero_rate	0
alert: constant	only one distinct value

Fig 13.

Distribution of updated_at. Vertical dash marks the median.

Show data table

Histogram bins for updated_at (median: 1446143734.0).
bin	count
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	45716
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0
1.446e+09 – 1.446e+09	0

updated_meta unknown metadata

The column `updated_meta` was skipped by the profiler, so no type inference, uniqueness count, or value statistics are available. The only confirmed signals are 45716 rows with a null rate of 0.0, but the actual content and structure remain uncharacterised. The name suggests it may hold update-related metadata (e.g., a timestamp, user, or nested struct), yet this is not supported by evidence.

Treatment: Re-profile with an appropriate parser before deciding; do not feed into modelling until its type is known.

anthropic:claude-opus-4-7 · confidence low

Out[31]:

saturn.columns["updated_meta"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

meta categorical metadata

This 'meta' column is a constant placeholder: every one of the 45,716 rows holds the same '{ }' value, giving a cardinality of 1 and entropy of 0. There is no information to extract here, likely a vestigial JSON metadata field that was never populated.

Treatment: Drop; the column is constant and carries zero signal.

anthropic:claude-opus-4-7 · confidence high

Out[33]:

saturn.columns["meta"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	1
top_value	{ }
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 14.

Top values for meta.

Show data table

Top values for meta (1 unique shown, of 1 total).
value	count	share
{ }	45716	100.0%

name text identifier

This is a short text column of place or feature names — every one of 45,716 rows is unique with zero nulls, averaging 17.8 characters and 2.8 words. Top tokens like 'yamato', 'range', 'northwest', 'hills', 'mountains', 'queen alexandra', and 'grove' suggest geographic/toponymic entries (mountain ranges, hills, regions). With n_unique equal to n, it functions as an identifier rather than a categorical feature.

Treatment: Drop for modelling; retain as a label/key for joins or display.

anthropic:claude-opus-4-7 · confidence high

Out[36]:

saturn.columns["name"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	45,716
len_min	2
len_max	28
len_mean	17.78
len_median	19
len_p95	27
word_mean	2.772
word_median	3
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	17,917
readability_flesch_mean	63.74
emoji_rate	0
url_rate	0
one_word_rate	0.04749
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings

Fig 15.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 17.78460495231429).
chars	count
2 – 3	2
3 – 3	16
3 – 4	0
4 – 5	97
5 – 5	224
5 – 6	0
6 – 7	420
7 – 7	430
7 – 8	0
8 – 8	449
8 – 9	958
9 – 10	0
10 – 10	1392
10 – 11	1235
11 – 12	0
12 – 12	3961
12 – 13	5478
13 – 14	0
14 – 14	297
14 – 15	0
15 – 16	1217
16 – 16	237
16 – 17	0
17 – 18	2772
18 – 18	2194
18 – 19	0
19 – 20	1826
20 – 20	3427
20 – 21	0
21 – 22	8562
22 – 22	5145
22 – 23	0
23 – 23	1466
23 – 24	33
24 – 25	0
25 – 25	405
25 – 26	21
26 – 27	0
27 – 27	3398
27 – 28	54

id_1 numeric identifier

id_1 is almost certainly a row identifier: 45716 unique values across 45716 rows, no nulls, ranging from 1 to 57458 with a near-uniform spread (kurtosis -1.16, mild skew 0.27). The fact that the max (57458) exceeds the row count suggests gaps in the sequence, consistent with a primary key carried over from a larger source table.

Treatment: drop from modelling; retain only for joins or row tracing.

anthropic:claude-opus-4-7 · confidence high

Out[39]:

saturn.columns["id_1"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	45,716
min	1
max	57,458
mean	2.689e+04
median	2.426e+04
std	1.686e+04
q1	1.269e+04
q3	4.066e+04
iqr	27,968
skew	0.2665
kurtosis	-1.16
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 16.

Distribution of id_1. Vertical dash marks the median.

Show data table

Histogram bins for id_1 (median: 24261.5).
bin	count
1 – 1437	1354
1437 – 2874	1151
2874 – 4310	814
4310 – 5747	1270
5747 – 7183	1416
7183 – 8620	1428
8620 – 1.006e+04	1433
1.006e+04 – 1.149e+04	1404
1.149e+04 – 1.293e+04	1394
1.293e+04 – 1.437e+04	1437
1.437e+04 – 1.58e+04	1415
1.58e+04 – 1.724e+04	1414
1.724e+04 – 1.867e+04	1420
1.867e+04 – 2.011e+04	1423
2.011e+04 – 2.155e+04	1437
2.155e+04 – 2.298e+04	1432
2.298e+04 – 2.442e+04	1368
2.442e+04 – 2.586e+04	1296
2.586e+04 – 2.729e+04	1368
2.729e+04 – 2.873e+04	900
2.873e+04 – 3.017e+04	1368
3.017e+04 – 3.16e+04	1078
3.16e+04 – 3.304e+04	529
3.304e+04 – 3.448e+04	763
3.448e+04 – 3.591e+04	1205
3.591e+04 – 3.735e+04	1049
3.735e+04 – 3.878e+04	582
3.878e+04 – 4.022e+04	810
4.022e+04 – 4.166e+04	392
4.166e+04 – 4.309e+04	0
4.309e+04 – 4.453e+04	186
4.453e+04 – 4.597e+04	1300
4.597e+04 – 4.74e+04	1281
4.74e+04 – 4.884e+04	1413
4.884e+04 – 5.028e+04	1129
5.028e+04 – 5.171e+04	1274
5.171e+04 – 5.315e+04	1022
5.315e+04 – 5.459e+04	1219
5.459e+04 – 5.602e+04	1275
5.602e+04 – 5.746e+04	1267

nametype categorical feature

This is a binary categorical flag distinguishing meteorite name types, with values 'Valid' and 'Relict'. The distribution is extremely lopsided: 45,641 of 45,716 rows (99.84%) are 'Valid' and only 75 are 'Relict', yielding an entropy ratio of just 0.018. With effectively no variation, this column carries almost no information for modelling.

Treatment: Drop or retain only as a rare-event indicator; near-constant for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[42]:

saturn.columns["nametype"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	2
top_value	Valid
top_rate	0.9984
cardinality	2
entropy	0.01754
entropy_ratio	0.01754
alert: imbalance	top value is 99.8% of rows

Fig 17.

Top values for nametype.

Show data table

Top values for nametype (2 unique shown, of 2 total).
value	count	share
Valid	45641	99.8%
Relict	75	0.2%

recclass categorical label

This column holds meteorite classification codes (recclass), with 466 distinct classes across 45,716 records and no nulls. The distribution is dominated by ordinary chondrites: L6 (18.1%), H5, L5, H6, and H4 together account for the bulk of records, while the long tail (entropy ratio 0.51) includes rare classes like CM2 with only 416 entries. High cardinality combined with concentrated top categories suggests a classic taxonomic hierarchy (H/L/LL groups with petrologic types).

Treatment: Group rare classes into an 'other' bucket or roll up to parent groups (H/L/LL/C) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[45]:

saturn.columns["recclass"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	466
top_value	L6
top_rate	0.1812
cardinality	466
entropy	4.548
entropy_ratio	0.5131

Fig 18.

Top values for recclass.

Show data table

Top values for recclass (20 unique shown, of 466 total).
value	count	share
L6	8285	18.1%
H5	7142	15.6%
L5	4796	10.5%
H6	4528	9.9%
H4	4211	9.2%
LL5	2766	6.1%
LL6	2043	4.5%
L4	1253	2.7%
H4/5	428	0.9%
CM2	416	0.9%
H3	386	0.8%
L3	365	0.8%
CO3	335	0.7%
Ureilite	300	0.7%
Iron, IIIAB	285	0.6%
LL4	268	0.6%
CV3	256	0.6%
Diogenite	241	0.5%
Howardite	240	0.5%
LL	225	0.5%

mass (g) numeric feature

Numeric mass measurements in grams across 45,716 rows, with a median of just 32.6g but a maximum of 60,000,000g — a 6-order-of-magnitude span. The distribution is extremely heavy-tailed (skew 76.9, kurtosis ~6796) and 15.5% of values flag as outliers, while the std (574,988) dwarfs the IQR (195.4). Nulls (0.29%) and zeros (0.04%) are negligible.

Treatment: log-transform before any modelling or distance-based analysis.

anthropic:claude-opus-4-7 · confidence high

Out[48]:

saturn.columns["mass (g)"].stats

stat	value
n	45,716
nulls	131 (0.3%)
unique	12,576
min	0
max	6e+07
mean	1.328e+04
median	32.6
std	5.75e+05
q1	7.2
q3	202.6
iqr	195.4
skew	76.91
kurtosis	6796
n_outliers	7,086
outlier_rate	0.1554
zero_rate	0.0004168
alert: high_skew	skew=+76.91
alert: outliers	15.5% rows beyond 1.5 IQR

Fig 19.

Distribution of mass (g). Vertical dash marks the median.

Show data table

Histogram bins for mass (g) (median: 32.6).
bin	count
0 – 1.5e+06	45544
1.5e+06 – 3e+06	16
3e+06 – 4.5e+06	8
4.5e+06 – 6e+06	1
6e+06 – 7.5e+06	1
7.5e+06 – 9e+06	1
9e+06 – 1.05e+07	2
1.05e+07 – 1.2e+07	0
1.2e+07 – 1.35e+07	0
1.35e+07 – 1.5e+07	0
1.5e+07 – 1.65e+07	2
1.65e+07 – 1.8e+07	0
1.8e+07 – 1.95e+07	0
1.95e+07 – 2.1e+07	0
2.1e+07 – 2.25e+07	1
2.25e+07 – 2.4e+07	1
2.4e+07 – 2.55e+07	2
2.55e+07 – 2.7e+07	1
2.7e+07 – 2.85e+07	1
2.85e+07 – 3e+07	0
3e+07 – 3.15e+07	1
3.15e+07 – 3.3e+07	0
3.3e+07 – 3.45e+07	0
3.45e+07 – 3.6e+07	0
3.6e+07 – 3.75e+07	0
3.75e+07 – 3.9e+07	0
3.9e+07 – 4.05e+07	0
4.05e+07 – 4.2e+07	0
4.2e+07 – 4.35e+07	0
4.35e+07 – 4.5e+07	0
4.5e+07 – 4.65e+07	0
4.65e+07 – 4.8e+07	0
4.8e+07 – 4.95e+07	0
4.95e+07 – 5.1e+07	1
5.1e+07 – 5.25e+07	0
5.25e+07 – 5.4e+07	0
5.4e+07 – 5.55e+07	0
5.55e+07 – 5.7e+07	0
5.7e+07 – 5.85e+07	1
5.85e+07 – 6e+07	1

fall categorical feature

Binary categorical flag distinguishing meteorites that were observed falling versus those found later, with only two values: "Found" and "Fell". The split is severely imbalanced — "Found" accounts for 44609 of 45716 rows (top_rate 0.9758) while "Fell" has just 1107, yielding an entropy_ratio of 0.164. No nulls are present.

Treatment: Encode as binary; stratify or rebalance before modelling given the 40:1 class skew.

anthropic:claude-opus-4-7 · confidence high

Out[51]:

saturn.columns["fall"].stats

stat	value
n	45,716
nulls	0 (0.0%)
unique	2
top_value	Found
top_rate	0.9758
cardinality	2
entropy	0.1645
entropy_ratio	0.1645
alert: imbalance	top value is 97.6% of rows

Fig 20.

Top values for fall.

Show data table

Top values for fall (2 unique shown, of 2 total).
value	count	share
Found	44609	97.6%
Fell	1107	2.4%

year categorical timestamp

Stored as January-1 timestamps, this column encodes a year-of-record across 45,716 rows with 266 distinct values and a 0.64% null rate. Despite being labeled 'year', the values are full datetimes pinned to YYYY-01-01, which will surprise anyone expecting integer years. The distribution is moderately spread (entropy ratio 0.66) with 2003 the modal year at 7.3% of rows, followed by 1979 and 1998.

Treatment: Cast to integer year (or proper date) before using as a temporal feature.

anthropic:claude-opus-4-7 · confidence high

Out[54]:

saturn.columns["year"].stats

stat	value
n	45,716
nulls	291 (0.6%)
unique	266
top_value	2003-01-01T00:00:00
top_rate	0.07315
cardinality	266
entropy	5.299
entropy_ratio	0.6578

Fig 21.

Top values for year.

Show data table

Top values for year (20 unique shown, of 266 total).
value	count	share
2003-01-01T00:00:00	3323	7.3%
1979-01-01T00:00:00	3046	6.7%
1998-01-01T00:00:00	2697	5.9%
2006-01-01T00:00:00	2456	5.4%
1988-01-01T00:00:00	2296	5.0%
2002-01-01T00:00:00	2078	4.5%
2004-01-01T00:00:00	1940	4.2%
2000-01-01T00:00:00	1792	3.9%
1997-01-01T00:00:00	1696	3.7%
1999-01-01T00:00:00	1691	3.7%
2001-01-01T00:00:00	1650	3.6%
1990-01-01T00:00:00	1518	3.3%
2009-01-01T00:00:00	1497	3.3%
1986-01-01T00:00:00	1375	3.0%
2007-01-01T00:00:00	1189	2.6%
2010-01-01T00:00:00	1005	2.2%
1993-01-01T00:00:00	979	2.1%
2008-01-01T00:00:00	957	2.1%
1987-01-01T00:00:00	916	2.0%
1991-01-01T00:00:00	877	1.9%

reclat numeric feature

This is the meteorite reception latitude in decimal degrees, ranging from -87.37 to 81.17. The distribution leans heavily toward the southern hemisphere with a median of -71.5 and a Q3 of exactly 0.0, and 16.8% of values are exactly zero — likely placeholder/unknown coordinates rather than the equator. About 16% of rows are null, and the bimodal-feeling shape (kurtosis -1.48) suggests clusters in Antarctica and elsewhere.

Treatment: Treat exact zeros as missing and pair with reclong for geospatial use.

anthropic:claude-opus-4-7 · confidence high

Out[57]:

saturn.columns["reclat"].stats

stat	value
n	45,716
nulls	7,315 (16.0%)
unique	12,738
min	-87.37
max	81.17
mean	-39.12
median	-71.5
std	46.38
q1	-76.71
q3	0
iqr	76.71
skew	0.4916
kurtosis	-1.477
n_outliers	0
outlier_rate	0
zero_rate	0.1677

Fig 22.

Distribution of reclat. Vertical dash marks the median.

Show data table

Histogram bins for reclat (median: -71.5).
bin	count
-87.37 – -83.15	7090
-83.15 – -78.94	1218
-78.94 – -74.73	4083
-74.73 – -70.51	9707
-70.51 – -66.3	1
-66.3 – -62.09	0
-62.09 – -57.87	0
-57.87 – -53.66	1
-53.66 – -49.45	0
-49.45 – -45.23	3
-45.23 – -41.02	11
-41.02 – -36.81	27
-36.81 – -32.59	91
-32.59 – -28.38	550
-28.38 – -24.17	436
-24.17 – -19.95	93
-19.95 – -15.74	35
-15.74 – -11.53	18
-11.53 – -7.313	19
-7.313 – -3.1	24
-3.1 – 1.113	6448
1.113 – 5.327	15
5.327 – 9.54	19
9.54 – 13.75	55
13.75 – 17.97	40
17.97 – 22.18	3197
22.18 – 26.39	315
26.39 – 30.61	2239
30.61 – 34.82	859
34.82 – 39.03	649
39.03 – 43.25	403
43.25 – 47.46	230
47.46 – 51.67	196
51.67 – 55.89	155
55.89 – 60.1	119
60.1 – 64.31	30
64.31 – 68.53	17
68.53 – 72.74	4
72.74 – 76.95	3
76.95 – 81.17	1

reclong numeric feature

Longitude coordinate for meteorite recovery sites, ranging from -165.43 to 354.47 with median 35.67. The maximum exceeding 180 is anomalous for standard longitude and suggests un-normalized or erroneous values, and the 16.2% zero rate aligns suspiciously with the 16% null rate, hinting that missing coordinates were coded as 0.

Treatment: Normalize longitudes to [-180,180], treat 0/0 pairs as missing, then use as a geospatial feature.

anthropic:claude-opus-4-7 · confidence high

Out[60]:

saturn.columns["reclong"].stats

stat	value
n	45,716
nulls	7,315 (16.0%)
unique	14,640
min	-165.4
max	354.5
mean	61.07
median	35.67
std	80.65
q1	0
q3	157.2
iqr	157.2
skew	-0.1745
kurtosis	-0.7312
n_outliers	0
outlier_rate	0
zero_rate	0.1618

Fig 23.

Distribution of reclong. Vertical dash marks the median.

Show data table

Histogram bins for reclong (median: 35.66667).
bin	count
-165.4 – -152.4	30
-152.4 – -139.4	228
-139.4 – -126.4	6
-126.4 – -113.4	444
-113.4 – -100.4	795
-100.4 – -87.45	462
-87.45 – -74.45	214
-74.45 – -61.45	1386
-61.45 – -48.45	57
-48.45 – -35.46	33
-35.46 – -22.46	2
-22.46 – -9.461	76
-9.461 – 3.536	6696
3.536 – 16.53	2208
16.53 – 29.53	1782
29.53 – 42.53	5243
42.53 – 55.53	1818
55.53 – 68.52	1420
68.52 – 81.52	2616
81.52 – 94.52	78
94.52 – 107.5	45
107.5 – 120.5	131
120.5 – 133.5	483
133.5 – 146.5	178
146.5 – 159.5	4052
159.5 – 172.5	7724
172.5 – 185.5	193
185.5 – 198.5	0
198.5 – 211.5	0
211.5 – 224.5	0
224.5 – 237.5	0
237.5 – 250.5	0
250.5 – 263.5	0
263.5 – 276.5	0
276.5 – 289.5	0
289.5 – 302.5	0
302.5 – 315.5	0
315.5 – 328.5	0
328.5 – 341.5	0
341.5 – 354.5	1

GeoLocation text feature

Serialised Python list literals encoding geolocation tuples of the form [None, lat, lon, None, False], with 45716 rows, 16% nulls and only 17100 distinct values. Duplication is severe (duplicate_rate 0.55, 21301 duplicates), and the top value '[None, 0.0, 0.0, None, False]' appears 6214 times suggesting placeholder coordinates. Lengths are tightly bounded (min 33, max 47) consistent with a fixed serialisation rather than free text.

Treatment: Parse the list literal into separate latitude and longitude numeric columns and treat 0.0/0.0 as missing.

anthropic:claude-opus-4-7 · confidence high

Out[63]:

saturn.columns["GeoLocation"].stats

stat	value
n	45,716
nulls	7,315 (16.0%)
unique	17,100
len_min	33
len_max	47
len_mean	40.3
len_median	41
len_p95	45
word_mean	5
word_median	5
n_empty	0
n_duplicates	21,301
duplicate_rate	0.5547
vocab_size	15,461
readability_flesch_mean	117.2
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: duplicates	55.5% duplicate strings

Fig 24.

Character-length distribution for GeoLocation.

Show data table

Character-length distribution for GeoLocation (mean: 40.3046535246478).
chars	count
33 – 33	6214
33 – 34	0
34 – 34	27
34 – 34	0
34 – 35	0
35 – 35	139
35 – 35	0
35 – 36	0
36 – 36	1693
36 – 36	0
36 – 37	0
37 – 37	3283
37 – 38	0
38 – 38	0
38 – 38	527
38 – 39	0
39 – 39	0
39 – 39	488
39 – 40	0
40 – 40	0
40 – 40	5488
40 – 41	0
41 – 41	2086
41 – 41	0
41 – 42	0
42 – 42	3512
42 – 42	0
42 – 43	0
43 – 43	3760
43 – 44	0
44 – 44	0
44 – 44	2811
44 – 45	0
45 – 45	0
45 – 45	7678
45 – 46	0
46 – 46	0
46 – 46	672
46 – 47	0
47 – 47	23

States numeric feature

Numeric column 'States' takes 45 distinct integer values between 1 and 51 with a median of 15, strongly suggesting encoded US state identifiers rather than a true quantity. The column is 96.37% null, so it carries information for fewer than 4% of rows, and the right skew (1.11) reflects uneven coverage across the encoded states. Treating the mean of 17.3 as meaningful would be a mistake given the categorical nature.

Treatment: Cast to categorical state codes and impute or flag the 96% missing before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[66]:

saturn.columns["States"].stats

stat	value
n	45,716
nulls	44,057 (96.4%)
unique	45
min	1
max	51
mean	17.34
median	15
std	10.41
q1	9
q3	23
iqr	14
skew	1.115
kurtosis	0.6891
n_outliers	40
outlier_rate	0.02411
zero_rate	0
alert: null_rate	96.4% null

Fig 25.

Distribution of States. Vertical dash marks the median.

Show data table

Histogram bins for States (median: 15.0).
bin	count
1 – 2.25	13
2.25 – 3.5	9
3.5 – 4.75	2
4.75 – 6	6
6 – 7.25	125
7.25 – 8.5	224
8.5 – 9.75	87
9.75 – 11	95
11 – 12.25	229
12.25 – 13.5	20
13.5 – 14.75	14
14.75 – 16	15
16 – 17.25	146
17.25 – 18.5	23
18.5 – 19.75	49
19.75 – 21	40
21 – 22.25	19
22.25 – 23.5	297
23.5 – 24.75	4
24.75 – 26	1
26 – 27.25	0
27.25 – 28.5	0
28.5 – 29.75	17
29.75 – 31	5
31 – 32.25	29
32.25 – 33.5	6
33.5 – 34.75	10
34.75 – 36	12
36 – 37.25	55
37.25 – 38.5	12
38.5 – 39.75	23
39.75 – 41	14
41 – 42.25	18
42.25 – 43.5	0
43.5 – 44.75	0
44.75 – 46	3
46 – 47.25	11
47.25 – 48.5	8
48.5 – 49.75	5
49.75 – 51	13

Counties numeric feature

Numeric column 'Counties' is populated for only 3.6% of the 45,716 rows (null_rate 0.9637), with 662 unique values ranging from 5 to 3210 and a roughly symmetric distribution (skew 0.24, kurtosis -1.19, mean 1353 vs median 1195). The values look like county counts or county FIPS-style codes rather than a continuous measurement, and the overwhelming sparsity is the headline issue. No outliers or zeros are flagged.

Treatment: Impute or add a missingness indicator; given 96% nulls, consider dropping unless the populated subset is analytically meaningful.

anthropic:claude-opus-4-7 · confidence medium

Out[69]:

saturn.columns["Counties"].stats

stat	value
n	45,716
nulls	44,057 (96.4%)
unique	662
min	5
max	3,210
mean	1353
median	1,195
std	994.1
q1	482
q3	2,113
iqr	1,631
skew	0.2374
kurtosis	-1.19
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	96.4% null

Fig 26.

Distribution of Counties. Vertical dash marks the median.

Show data table

Histogram bins for Counties (median: 1195.0).
bin	count
5 – 85.12	303
85.12 – 165.2	8
165.2 – 245.4	17
245.4 – 325.5	26
325.5 – 405.6	20
405.6 – 485.8	64
485.8 – 565.9	8
565.9 – 646	34
646 – 726.1	20
726.1 – 806.2	113
806.2 – 886.4	59
886.4 – 966.5	46
966.5 – 1047	49
1047 – 1127	25
1127 – 1207	40
1207 – 1287	57
1287 – 1367	28
1367 – 1447	25
1447 – 1527	16
1527 – 1608	10
1608 – 1688	13
1688 – 1768	11
1768 – 1848	8
1848 – 1928	13
1928 – 2008	198
2008 – 2088	25
2088 – 2168	21
2168 – 2248	28
2248 – 2329	28
2329 – 2409	69
2409 – 2489	18
2489 – 2569	34
2569 – 2649	11
2649 – 2729	34
2729 – 2809	19
2809 – 2890	18
2890 – 2970	28
2970 – 3050	19
3050 – 3130	24
3130 – 3210	72

wild nasa meteorites

Overview

Summary confidence: high

sid text identifier

id text identifier

position numeric other

created_at numeric timestamp

created_meta unknown metadata

updated_at numeric timestamp

updated_meta unknown metadata

meta categorical metadata

name text identifier

id_1 numeric identifier

nametype categorical feature

recclass categorical label

mass (g) numeric feature

fall categorical feature

year categorical timestamp

reclat numeric feature

reclong numeric feature

GeoLocation text feature

States numeric feature

Counties numeric feature

How to cite