natural_hazards-meteorites · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/natural_hazards/meteorites.json

Saturn profiled 1,097 rows across 10 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/natural_hazards/meteorites.json",
    "--findings", "natural_hazards-meteorites.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a 1,097-row catalogue of witnessed meteorite falls, with each record carrying a name, description, date, lat/long coordinates and a meteorite class. Two columns (category and fall_type) are constant — every record is a 'witnessed_meteorite_falls' event with fall_type 'Fell' — so the analytic interest sits elsewhere. Meteorite class is the most informative categorical: 125 distinct classes but heavily concentrated, with L6 alone accounting for ~24% of falls and H5 the next largest at ~15%. Latitude is skewed toward the northern hemisphere (median 36.1, mean 30.0) with ~8% flagged as outliers, while longitude spreads broadly across the globe (-157.9 to 174.4). Start with meteorite_class to understand the dominant compositions, then look at the lat/long pair to see geographic coverage.

citing: row_count · column_count · category.top_value · fall_type.top_value · meteorite_class.n_unique · meteorite_class.top_values · meteorite_class.top_rate · latitude.mean · latitude.median · latitude.outlier_rate · longitude.min · longitude.max · date.n_unique · date.top_values

Out[4]:

saturn.schema() · 10 columns

column	kind	n	null%	unique	alerts
latitude	numeric	1,097	0.0%	958	outliers
longitude	numeric	1,097	0.0%	1,030
name	text	1,097	0.0%	1,097	near_unique one_word short_text
description	text	1,097	0.0%	1,097	near_unique
category	categorical	1,097	0.0%	1	imbalance
date	categorical	1,097	1.7%	231
country	unknown	1,097	0.0%	—	skipped
mass_g	unknown	1,097	0.0%	—	skipped
meteorite_class	categorical	1,097	0.0%	125
fall_type	categorical	1,097	0.0%	1	imbalance

Fig 1.

meteorite_class · L6 dominates at ~24% of falls, with H5 and H6 the next most common — note the long tail of 125 classes.

Show data table

Top values for meteorite_class (20 unique shown, of 125 total).
value	count	share
L6	260	23.7%
H5	163	14.9%
H6	91	8.3%
L5	76	6.9%
H4	50	4.6%
LL6	41	3.7%
Stone-uncl	39	3.6%
OC	24	2.2%
LL5	19	1.7%
Eucrite-mmict	18	1.6%
L4	18	1.6%
Howardite	16	1.5%
CM2	15	1.4%
H	13	1.2%
L	10	0.9%
Iron, IIIAB	10	0.9%
Aubrite	9	0.8%
Diogenite	8	0.7%
EL6	8	0.7%
CV3	7	0.6%

Fig 2.

latitude · Distribution leans toward northern mid-latitudes (median 36.1) with a left tail of southern-hemisphere outliers.

Show data table

Histogram bins for latitude (median: 36.1).
bin	count
-44.12 – -40.77	2
-40.77 – -37.42	1
-37.42 – -34.07	3
-34.07 – -30.73	31
-30.73 – -27.38	14
-27.38 – -24.03	14
-24.03 – -20.68	9
-20.68 – -17.34	11
-17.34 – -13.99	6
-13.99 – -10.64	4
-10.64 – -7.295	11
-7.295 – -3.948	18
-3.948 – -0.6002	9
-0.6002 – 2.747	10
2.747 – 6.095	6
6.095 – 9.442	13
9.442 – 12.79	34
12.79 – 16.14	32
16.14 – 19.48	20
19.48 – 22.83	34
22.83 – 26.18	57
26.18 – 29.53	59
29.53 – 32.87	61
32.87 – 36.22	96
36.22 – 39.57	79
39.57 – 42.92	78
42.92 – 46.26	119
46.26 – 49.61	81
49.61 – 52.96	92
52.96 – 56.31	53
56.31 – 59.65	22
59.65 – 63	14
63 – 66.35	4

Fig 3.

longitude · Longitude spans the full globe (-158 to 174); look for clustering around populated landmasses where falls get reported.

Show data table

Histogram bins for longitude (median: 18.71667).
bin	count
-157.9 – -147.8	2
-147.8 – -137.7	0
-137.7 – -127.7	1
-127.7 – -117.6	7
-117.6 – -107.5	11
-107.5 – -97.45	44
-97.45 – -87.39	49
-87.39 – -77.32	51
-77.32 – -67.25	26
-67.25 – -57.18	21
-57.18 – -47.11	14
-47.11 – -37.04	7
-37.04 – -26.97	2
-26.97 – -16.91	0
-16.91 – -6.836	23
-6.836 – 3.232	102
3.232 – 13.3	135
13.3 – 23.37	88
23.37 – 33.44	91
33.44 – 43.51	59
43.51 – 53.58	25
53.58 – 63.64	11
63.64 – 73.71	29
73.71 – 83.78	104
83.78 – 93.85	31
93.85 – 103.9	13
103.9 – 114	40
114 – 124.1	38
124.1 – 134.1	28
134.1 – 144.2	33
144.2 – 154.3	9
154.3 – 164.3	1
164.3 – 174.4	2

Fig 4.

date · 231 distinct fall dates spread fairly evenly; 1933 is the single busiest year with 17 recorded falls.

Show data table

Top values for date (20 unique shown, of 231 total).
value	count	share
1933-01-01	17	1.5%
1949-01-01	13	1.2%
1950-01-01	12	1.1%
1976-01-01	11	1.0%
1930-01-01	11	1.0%
1938-01-01	11	1.0%
1910-01-01	11	1.0%
1868-01-01	11	1.0%
1977-01-01	10	0.9%
1939-01-01	10	0.9%
1984-01-01	10	0.9%
1934-01-01	10	0.9%
1916-01-01	10	0.9%
1924-01-01	10	0.9%
1917-01-01	10	0.9%
2008-01-01	9	0.8%
2003-01-01	9	0.8%
1998-01-01	9	0.8%
1890-01-01	9	0.8%
1986-01-01	9	0.8%

Fig 5.

description · Descriptions are uniformly templated (46-72 chars); useful as a sanity check that no records are truncated or malformed.

Show data table

Character-length distribution for description (mean: 54.30811303555151).
chars	count
46 – 47	1
47 – 47	5
47 – 48	0
48 – 49	29
49 – 49	79
49 – 50	0
50 – 51	118
51 – 51	137
51 – 52	0
52 – 52	129
52 – 53	110
53 – 54	0
54 – 54	76
54 – 55	68
55 – 56	0
56 – 56	58
56 – 57	54
57 – 58	0
58 – 58	34
58 – 59	0
59 – 60	40
60 – 60	22
60 – 61	0
61 – 62	26
62 – 62	21
62 – 63	0
63 – 64	20
64 – 64	20
64 – 65	0
65 – 66	14
66 – 66	9
66 – 67	0
67 – 67	11
67 – 68	4
68 – 69	0
69 – 69	3
69 – 70	5
70 – 71	0
71 – 71	1
71 – 72	3

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
latitude	numeric	0.0%
longitude	numeric	0.0%
name	text	0.0%
description	text	0.0%
category	categorical	0.0%
date	categorical	1.7%
country	unknown	0.0%
mass_g	unknown	0.0%
meteorite_class	categorical	0.0%
fall_type	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	latitude	longitude
latitude	+1.00	-0.09
longitude	-0.09	+1.00

latitude numeric feature

Geographic latitude coordinates spanning -44.12 to 66.35 degrees, covering most inhabited latitudes from southern Australia to northern Scandinavia. The distribution is left-skewed (skew -1.28) with a median of 36.1 sitting well above the mean of 30.04, indicating a northern-hemisphere bias with a tail of southern-hemisphere observations flagged as 90 outliers (8.2%). Near-unique values (958 of 1097) suggest each row is a distinct location.

Treatment: Pair with longitude for spatial features; the southern-hemisphere outliers are likely legitimate, not errors.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["latitude"].stats

stat	value
n	1,097
nulls	0 (0.0%)
unique	958
min	-44.12
max	66.35
mean	30.04
median	36.1
std	23.13
q1	21.87
q3	46.07
iqr	24.2
skew	-1.276
kurtosis	1.01
n_outliers	90
outlier_rate	0.08204
zero_rate	0.001823
alert: outliers	8.2% rows beyond 1.5 IQR

Fig 8.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 36.1).
bin	count
-44.12 – -40.77	2
-40.77 – -37.42	1
-37.42 – -34.07	3
-34.07 – -30.73	31
-30.73 – -27.38	14
-27.38 – -24.03	14
-24.03 – -20.68	9
-20.68 – -17.34	11
-17.34 – -13.99	6
-13.99 – -10.64	4
-10.64 – -7.295	11
-7.295 – -3.948	18
-3.948 – -0.6002	9
-0.6002 – 2.747	10
2.747 – 6.095	6
6.095 – 9.442	13
9.442 – 12.79	34
12.79 – 16.14	32
16.14 – 19.48	20
19.48 – 22.83	34
22.83 – 26.18	57
26.18 – 29.53	59
29.53 – 32.87	61
32.87 – 36.22	96
36.22 – 39.57	79
39.57 – 42.92	78
42.92 – 46.26	119
46.26 – 49.61	81
49.61 – 52.96	92
52.96 – 56.31	53
56.31 – 59.65	22
59.65 – 63	14
63 – 66.35	4

longitude numeric feature

Geographic longitude in decimal degrees, with values spanning -157.87 to 174.4 — essentially the full -180/180 range. The distribution is roughly symmetric (skew -0.23, kurtosis -0.62) and centered near 18.72, suggesting a slight Eurasian/African concentration but broad global coverage. With 1030 unique values across 1097 rows and no nulls, points are nearly all distinct.

Treatment: Pair with latitude as a geospatial coordinate; avoid scaling as a plain numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["longitude"].stats

stat	value
n	1,097
nulls	0 (0.0%)
unique	1,030
min	-157.9
max	174.4
mean	20.13
median	18.72
std	68.87
q1	-4.233
q3	76.27
iqr	80.5
skew	-0.2257
kurtosis	-0.6185
n_outliers	3
outlier_rate	0.002735
zero_rate	0.0009116

Fig 9.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: 18.71667).
bin	count
-157.9 – -147.8	2
-147.8 – -137.7	0
-137.7 – -127.7	1
-127.7 – -117.6	7
-117.6 – -107.5	11
-107.5 – -97.45	44
-97.45 – -87.39	49
-87.39 – -77.32	51
-77.32 – -67.25	26
-67.25 – -57.18	21
-57.18 – -47.11	14
-47.11 – -37.04	7
-37.04 – -26.97	2
-26.97 – -16.91	0
-16.91 – -6.836	23
-6.836 – 3.232	102
3.232 – 13.3	135
13.3 – 23.37	88
23.37 – 33.44	91
33.44 – 43.51	59
43.51 – 53.58	25
53.58 – 63.64	11
63.64 – 73.71	29
73.71 – 83.78	104
83.78 – 93.85	31
93.85 – 103.9	13
103.9 – 114	40
114 – 124.1	38
124.1 – 134.1	28
134.1 – 144.2	33
144.2 – 154.3	9
154.3 – 164.3	1
164.3 – 174.4	2

name text identifier

This is a `name` column with 1097 fully unique short strings (n_unique == n, duplicate_rate 0.0), averaging 8.56 characters and 1.21 words, with 82.95% being a single word. Top words like `st.`, `county`, `san`, `santa`, `creek`, `la`, `el`, `de` strongly suggest these are place or geographic entity names rather than person names, with a Spanish/English mix. No nulls, no URLs, no emoji — clean but effectively an identifier-grade label.

Treatment: Treat as a unique label/key; drop from modelling features or use only via geographic enrichment lookup.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["name"].stats

stat	value
n	1,097
nulls	0 (0.0%)
unique	1,097
len_min	2
len_max	28
len_mean	8.557
len_median	8
len_p95	15
word_mean	1.209
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	1,238
readability_flesch_mean	40.67
emoji_rate	0
url_rate	0
one_word_rate	0.8295
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	83.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 10.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 8.55697356426618).
chars	count
2 – 3	1
3 – 3	7
3 – 4	0
4 – 5	40
5 – 5	104
5 – 6	0
6 – 7	174
7 – 7	170
7 – 8	0
8 – 8	144
8 – 9	136
9 – 10	0
10 – 10	78
10 – 11	62
11 – 12	0
12 – 12	51
12 – 13	44
13 – 14	0
14 – 14	20
14 – 15	0
15 – 16	17
16 – 16	11
16 – 17	0
17 – 18	15
18 – 18	4
18 – 19	0
19 – 20	9
20 – 20	3
20 – 21	0
21 – 22	2
22 – 22	3
22 – 23	0
23 – 23	1
23 – 24	0
24 – 25	0
25 – 25	0
25 – 26	0
26 – 27	0
27 – 27	0
27 – 28	1

description text metadata

This appears to be a templated, machine-generated description string for meteorite records, with every row containing the tokens 'meteorite', 'mass:', 'found:', and 'fell.' exactly 1097 times. Every value is unique (n_unique=1097, duplicate_rate=0.0) yet length is tightly bounded (46-72 chars, median 53), confirming a fixed schema where only embedded fields like classification (l6. appears 260 times, h5. 163) and numeric values vary. The 'unknown.' token appearing 1099 times signals frequent missing sub-fields packed into the template.

Treatment: Parse the template to extract structured fields (class, mass, found/fell) rather than embedding the raw string.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["description"].stats

stat	value
n	1,097
nulls	0 (0.0%)
unique	1,097
len_min	46
len_max	72
len_mean	54.31
len_median	53
len_p95	64
word_mean	8.254
word_median	8
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	1,372
readability_flesch_mean	52.62
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings

Fig 11.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 54.30811303555151).
chars	count
46 – 47	1
47 – 47	5
47 – 48	0
48 – 49	29
49 – 49	79
49 – 50	0
50 – 51	118
51 – 51	137
51 – 52	0
52 – 52	129
52 – 53	110
53 – 54	0
54 – 54	76
54 – 55	68
55 – 56	0
56 – 56	58
56 – 57	54
57 – 58	0
58 – 58	34
58 – 59	0
59 – 60	40
60 – 60	22
60 – 61	0
61 – 62	26
62 – 62	21
62 – 63	0
63 – 64	20
64 – 64	20
64 – 65	0
65 – 66	14
66 – 66	9
66 – 67	0
67 – 67	11
67 – 68	4
68 – 69	0
69 – 69	3
69 – 70	5
70 – 71	0
71 – 71	1
71 – 72	3

category categorical metadata

This column is a constant categorical tag, holding the single value "witnessed_meteorite_falls" across all 1097 rows. With cardinality of 1 and entropy of 0, it carries no information and likely identifies the dataset's provenance or scope rather than describing individual records.

Treatment: Drop before modelling; retain only as a dataset-level label.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["category"].stats

stat	value
n	1,097
nulls	0 (0.0%)
unique	1
top_value	witnessed_meteorite_falls
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 12.

Top values for category.

Show data table

Top values for category (1 unique shown, of 1 total).
value	count	share
witnessed_meteorite_falls	1097	100.0%

date categorical timestamp

Date values stored as ISO strings, all snapped to January 1st of the year — so this is effectively a year-granularity field masquerading as a full date. Across 1097 rows there are 231 distinct years with very high entropy ratio (0.967), and the most common year (1933-01-01) accounts for only 1.6% of rows. Null rate is 1.73%.

Treatment: Parse to datetime and extract the year as an integer feature; the month/day component carries no signal.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["date"].stats

stat	value
n	1,097
nulls	19 (1.7%)
unique	231
top_value	1933-01-01
top_rate	0.01577
cardinality	231
entropy	7.593
entropy_ratio	0.967

Fig 13.

Top values for date.

Show data table

Top values for date (20 unique shown, of 231 total).
value	count	share
1933-01-01	17	1.5%
1949-01-01	13	1.2%
1950-01-01	12	1.1%
1976-01-01	11	1.0%
1930-01-01	11	1.0%
1938-01-01	11	1.0%
1910-01-01	11	1.0%
1868-01-01	11	1.0%
1977-01-01	10	0.9%
1939-01-01	10	0.9%
1984-01-01	10	0.9%
1934-01-01	10	0.9%
1916-01-01	10	0.9%
1924-01-01	10	0.9%
1917-01-01	10	0.9%
2008-01-01	9	0.8%
2003-01-01	9	0.8%
1998-01-01	9	0.8%
1890-01-01	9	0.8%
1986-01-01	9	0.8%

country unknown feature

This column is labeled 'country' and likely holds country names or codes, but saturn skipped detailed profiling so kind, uniqueness, and value distribution are unknown. The only confirmed signals are 1097 rows with a 0.0 null rate. No further evidence is available to characterize cardinality, format, or dominant values.

Treatment: Re-profile to determine cardinality and format, then encode as a categorical.

anthropic:claude-opus-4-7 · confidence low

Out[31]:

saturn.columns["country"].stats

stat	value
n	1,097
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

mass_g unknown other

The column is named mass_g, suggesting a mass measurement in grams across 1,097 rows with no nulls. However, saturn skipped profiling this column and reported no kind, uniqueness, or distribution stats, so its actual content and shape cannot be confirmed from the evidence.

Treatment: Re-profile or manually inspect this column before use; current evidence is insufficient to decide handling.

anthropic:claude-opus-4-7 · confidence low

Out[33]:

saturn.columns["mass_g"].stats

stat	value
n	1,097
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

meteorite_class categorical label

This column records meteorite classification codes (L6, H5, H6, etc.), the standard taxonomy for ordinary chondrites and other meteorite types. Cardinality is high at 125 distinct classes across 1,097 rows with no nulls, but the distribution is concentrated: L6 alone covers 23.7% and the top four classes (L6, H5, H6, L5) account for over half the data. Entropy ratio of 0.67 confirms a long tail of rare classes that will be sparsely represented.

Treatment: Group rare classes into an 'other' bucket or use target encoding before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["meteorite_class"].stats

stat	value
n	1,097
nulls	0 (0.0%)
unique	125
top_value	L6
top_rate	0.237
cardinality	125
entropy	4.639
entropy_ratio	0.666

Fig 14.

Top values for meteorite_class.

Show data table

Top values for meteorite_class (20 unique shown, of 125 total).
value	count	share
L6	260	23.7%
H5	163	14.9%
H6	91	8.3%
L5	76	6.9%
H4	50	4.6%
LL6	41	3.7%
Stone-uncl	39	3.6%
OC	24	2.2%
LL5	19	1.7%
Eucrite-mmict	18	1.6%
L4	18	1.6%
Howardite	16	1.5%
CM2	15	1.4%
H	13	1.2%
L	10	0.9%
Iron, IIIAB	10	0.9%
Aubrite	9	0.8%
Diogenite	8	0.7%
EL6	8	0.7%
CV3	7	0.6%

fall_type categorical metadata

This column records a fall classification but contains the single value "Fell" across all 1097 rows, with no nulls. Entropy is 0 and cardinality is 1, so it carries no information for any downstream model or segmentation.

Treatment: Drop; constant column with zero entropy.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["fall_type"].stats

stat	value
n	1,097
nulls	0 (0.0%)
unique	1
top_value	Fell
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 15.

Top values for fall_type.

Show data table

Top values for fall_type (1 unique shown, of 1 total).
value	count	share
Fell	1097	100.0%

natural hazards meteorites

Overview

Summary confidence: high

latitude numeric feature

longitude numeric feature

name text identifier

description text metadata

category categorical metadata

date categorical timestamp

country unknown feature

mass_g unknown other

meteorite_class categorical label

fall_type categorical metadata

How to cite