natural_hazards-earthquakes · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/natural_hazards/earthquakes.json

Saturn profiled 3,742 rows across 11 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/natural_hazards/earthquakes.json",
    "--findings", "natural_hazards-earthquakes.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 3,742 records of significant earthquakes, with numeric measurements (magnitude, depth, latitude, longitude), location text fields, and a date column. Magnitude is tightly clustered between 4.5 and 5.1 (median 4.8) but reaches up to 8.2, producing 184 high-end outliers worth a closer look. Depth is highly skewed (skew 3.07) with a median of 10 km but a max of 248.7 km and 314 outliers, suggesting a mix of shallow and deep events. Geographically, the data is heavily concentrated around Alaska — 'alaska' appears in 1,991 place names, with Canada and Mexico trailing far behind — and longitudes sit firmly in the western hemisphere (median -144.2). Note that 'category' is a single constant value and 'earthquake_type' is 99.9% 'earthquake', so neither adds analytic signal.

citing: row_count · column_count · depth_km · magnitude · latitude · longitude · place · earthquake_type · category

Out[4]:

saturn.schema() · 11 columns

column	kind	n	null%	unique	alerts
latitude	numeric	3,742	0.0%	3,627
longitude	numeric	3,742	0.0%	3,668
name	text	3,742	0.0%	3,002	multilingual
description	text	3,742	0.0%	3,591	near_unique
category	categorical	3,742	0.0%	1	imbalance
date	text	3,742	0.0%	3,741	near_unique one_word allcaps short_text
country	unknown	3,742	0.0%	—	skipped
magnitude	numeric	3,742	0.0%	123
depth_km	numeric	3,742	0.0%	1,505	high_skew outliers
place	text	3,742	0.0%	3,002	multilingual
earthquake_type	categorical	3,742	0.0%	3	imbalance

Fig 1.

magnitude · Most quakes sit between 4.5–5.1; watch the long tail extending to 8.2.

Show data table

Histogram bins for magnitude (median: 4.8).
bin	count
4.5 – 4.593	752
4.593 – 4.685	601
4.685 – 4.777	445
4.777 – 4.87	340
4.87 – 4.963	286
4.963 – 5.055	254
5.055 – 5.147	218
5.147 – 5.24	177
5.24 – 5.332	137
5.332 – 5.425	104
5.425 – 5.518	85
5.518 – 5.61	66
5.61 – 5.702	53
5.702 – 5.795	3
5.795 – 5.887	38
5.887 – 5.98	46
5.98 – 6.072	25
6.072 – 6.165	17
6.165 – 6.257	16
6.257 – 6.35	11
6.35 – 6.442	15
6.442 – 6.535	11
6.535 – 6.627	10
6.627 – 6.72	4
6.72 – 6.812	6
6.812 – 6.905	5
6.905 – 6.997	0
6.997 – 7.09	3
7.09 – 7.182	3
7.182 – 7.275	3
7.275 – 7.367	1
7.367 – 7.46	0
7.46 – 7.552	1
7.552 – 7.645	1
7.645 – 7.737	0
7.737 – 7.83	2
7.83 – 7.922	2
7.922 – 8.015	0
8.015 – 8.107	0
8.107 – 8.2	1

Fig 2.

depth_km · Strongly right-skewed with a median of 10 km but values reaching 248 km — inspect the deep-event tail.

Show data table

Histogram bins for depth_km (median: 10.0).
bin	count
-2.261 – 4.013	219
4.013 – 10.29	1730
10.29 – 16.56	370
16.56 – 22.84	258
22.84 – 29.11	230
29.11 – 35.38	250
35.38 – 41.66	167
41.66 – 47.93	129
47.93 – 54.21	56
54.21 – 60.48	31
60.48 – 66.75	43
66.75 – 73.03	27
73.03 – 79.3	19
79.3 – 85.58	29
85.58 – 91.85	21
91.85 – 98.12	19
98.12 – 104.4	24
104.4 – 110.7	12
110.7 – 116.9	9
116.9 – 123.2	14
123.2 – 129.5	13
129.5 – 135.8	19
135.8 – 142	14
142 – 148.3	5
148.3 – 154.6	6
154.6 – 160.9	0
160.9 – 167.1	7
167.1 – 173.4	4
173.4 – 179.7	0
179.7 – 186	1
186 – 192.2	5
192.2 – 198.5	1
198.5 – 204.8	3
204.8 – 211.1	2
211.1 – 217.3	3
217.3 – 223.6	1
223.6 – 229.9	0
229.9 – 236.2	0
236.2 – 242.4	0
242.4 – 248.7	1

Fig 3.

latitude · Latitude clusters in the high northern hemisphere (median ~52°), consistent with an Alaska-heavy sample.

Show data table

Histogram bins for latitude (median: 52.395700000000005).
bin	count
20.02 – 21.26	35
21.26 – 22.51	28
22.51 – 23.75	42
23.75 – 25	88
25 – 26.24	91
26.24 – 27.49	35
27.49 – 28.73	43
28.73 – 29.98	39
29.98 – 31.22	52
31.22 – 32.46	83
32.46 – 33.71	52
33.71 – 34.95	26
34.95 – 36.2	67
36.2 – 37.44	43
37.44 – 38.69	54
38.69 – 39.93	22
39.93 – 41.18	131
41.18 – 42.42	52
42.42 – 43.66	96
43.66 – 44.91	177
44.91 – 46.15	4
46.15 – 47.4	7
47.4 – 48.64	34
48.64 – 49.89	117
49.89 – 51.13	151
51.13 – 52.38	288
52.38 – 53.62	423
53.62 – 54.86	349
54.86 – 56.11	223
56.11 – 57.35	201
57.35 – 58.6	75
58.6 – 59.84	104
59.84 – 61.09	116
61.09 – 62.33	117
62.33 – 63.58	120
63.58 – 64.82	33
64.82 – 66.06	35
66.06 – 67.31	25
67.31 – 68.55	29
68.55 – 69.8	35

Fig 4.

place · Top recurring locations are dominated by Alaska, Canada, and Mexico — useful for spotting regional concentration.

Show data table

Character-length distribution for place (mean: 29.465793693212184).
chars	count
4 – 5	1
5 – 7	0
7 – 8	1
8 – 10	0
10 – 11	0
11 – 12	0
12 – 14	2
14 – 15	11
15 – 16	40
16 – 18	1
18 – 19	20
19 – 20	19
20 – 22	8
22 – 23	219
23 – 25	60
25 – 26	122
26 – 27	543
27 – 29	499
29 – 30	823
30 – 32	325
32 – 33	362
33 – 34	378
34 – 36	105
36 – 37	40
37 – 38	37
38 – 40	25
40 – 41	34
41 – 42	3
42 – 44	0
44 – 45	15
45 – 47	22
47 – 48	14
48 – 49	3
49 – 51	1
51 – 52	2
52 – 54	5
54 – 55	0
55 – 56	0
56 – 58	0
58 – 59	2

Fig 5.

earthquake_type · Shows that 'earthquake' accounts for 99.9% of records, with only a handful of explosions and one landslide.

Show data table

Top values for earthquake_type (3 unique shown, of 3 total).
value	count	share
earthquake	3739	99.9%
explosion	2	0.1%
landslide	1	0.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
latitude	numeric	0.0%
longitude	numeric	0.0%
name	text	0.0%
description	text	0.0%
category	categorical	0.0%
date	text	0.0%
country	unknown	0.0%
magnitude	numeric	0.0%
depth_km	numeric	0.0%
place	text	0.0%
earthquake_type	categorical	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 7,480 detected strings).
lang	count	share
en	7438	99.4%
es	32	0.4%
de	6	0.1%
ja	2	0.0%
ceb	2	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
	latitude	longitude	magnitude	depth_km
latitude	+1.00	-0.72	-0.12	+0.31
longitude	-0.72	+1.00	+0.07	-0.37
magnitude	-0.12	+0.07	+1.00	-0.03
depth_km	+0.31	-0.37	-0.03	+1.00

latitude numeric feature

Geographic latitude coordinates ranging from 20.02 to 69.80, with a mean of 48.53 and median of 52.40. The distribution is moderately left-skewed (-0.76), concentrated in northern mid-to-high latitudes (Q1=41.34, Q3=55.90), suggesting a Europe/North America-heavy sample with no southern hemisphere points. Near-unique values (3627/3742) and zero nulls indicate clean, granular location data.

Treatment: Pair with longitude for geospatial features; consider binning or projecting rather than using raw value in linear models.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["latitude"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,627
min	20.02
max	69.8
mean	48.53
median	52.4
std	11.58
q1	41.34
q3	55.9
iqr	14.56
skew	-0.7591
kurtosis	-0.2887
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 9.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 52.395700000000005).
bin	count
20.02 – 21.26	35
21.26 – 22.51	28
22.51 – 23.75	42
23.75 – 25	88
25 – 26.24	91
26.24 – 27.49	35
27.49 – 28.73	43
28.73 – 29.98	39
29.98 – 31.22	52
31.22 – 32.46	83
32.46 – 33.71	52
33.71 – 34.95	26
34.95 – 36.2	67
36.2 – 37.44	43
37.44 – 38.69	54
38.69 – 39.93	22
39.93 – 41.18	131
41.18 – 42.42	52
42.42 – 43.66	96
43.66 – 44.91	177
44.91 – 46.15	4
46.15 – 47.4	7
47.4 – 48.64	34
48.64 – 49.89	117
49.89 – 51.13	151
51.13 – 52.38	288
52.38 – 53.62	423
53.62 – 54.86	349
54.86 – 56.11	223
56.11 – 57.35	201
57.35 – 58.6	75
58.6 – 59.84	104
59.84 – 61.09	116
61.09 – 62.33	117
62.33 – 63.58	120
63.58 – 64.82	33
64.82 – 66.06	35
66.06 – 67.31	25
67.31 – 68.55	29
68.55 – 69.8	35

longitude numeric feature

Geographic longitude coordinates, all negative and ranging from -169.997 to -65.039, placing every record in the western hemisphere. The distribution is wide (IQR ~34.8°) with a slight right skew (0.449) and only 26 mild outliers (0.69%); near-uniqueness (3668 unique of 3742) suggests point-level locations rather than a small set of sites.

Treatment: Pair with latitude as a 2-D geospatial feature; avoid treating as a standalone scalar in models.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["longitude"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,668
min	-170
max	-65.04
mean	-140.1
median	-144.2
std	21.81
q1	-159.9
q3	-125.1
iqr	34.84
skew	0.4489
kurtosis	-0.4302
n_outliers	26
outlier_rate	0.006948
zero_rate	0

Fig 10.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: -144.21715).
bin	count
-170 – -167.4	368
-167.4 – -164.7	161
-164.7 – -162.1	243
-162.1 – -159.5	241
-159.5 – -156.9	113
-156.9 – -154.3	129
-154.3 – -151.6	241
-151.6 – -149	220
-149 – -146.4	77
-146.4 – -143.8	83
-143.8 – -141.1	29
-141.1 – -138.5	48
-138.5 – -135.9	41
-135.9 – -133.3	34
-133.3 – -130.6	129
-130.6 – -128	428
-128 – -125.4	200
-125.4 – -122.8	89
-122.8 – -120.1	44
-120.1 – -117.5	130
-117.5 – -114.9	149
-114.9 – -112.3	113
-112.3 – -109.6	163
-109.6 – -107	134
-107 – -104.4	33
-104.4 – -101.8	12
-101.8 – -99.15	10
-99.15 – -96.53	20
-96.53 – -93.9	4
-93.9 – -91.28	1
-91.28 – -88.65	2
-88.65 – -86.03	6
-86.03 – -83.41	1
-83.41 – -80.78	2
-80.78 – -78.16	3
-78.16 – -75.53	7
-75.53 – -72.91	8
-72.91 – -70.29	7
-70.29 – -67.66	5
-67.66 – -65.04	14

name text metadata

This column holds short geographic descriptions of event locations, almost certainly earthquake place names (e.g. '104 km SSW of Nikolski, Alaska', 'off the coast of Oregon'), with mean length 29 chars and 6 words. Of 3742 rows, only 3002 are unique and 740 are duplicates (19.8%), with 'off the coast of Oregon' alone repeating 151 times. The language detector flags a multilingual mix but it is overwhelmingly English (3719) with trace de/es/ja/ceb counts likely false positives on place tokens.

Treatment: Parse into structured fields (distance, bearing, place) or geocode rather than using the raw string as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["name"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,002
len_min	4
len_max	59
len_mean	29.47
len_median	29
len_p95	36
word_mean	6.293
word_median	6
n_empty	0
n_duplicates	740
duplicate_rate	0.1978
vocab_size	1,036
readability_flesch_mean	69.91
emoji_rate	0
url_rate	0
one_word_rate	0.0005345
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	6 languages detected in sample

Fig 11.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 29.465793693212184).
chars	count
4 – 5	1
5 – 7	0
7 – 8	1
8 – 10	0
10 – 11	0
11 – 12	0
12 – 14	2
14 – 15	11
15 – 16	40
16 – 18	1
18 – 19	20
19 – 20	19
20 – 22	8
22 – 23	219
23 – 25	60
25 – 26	122
26 – 27	543
27 – 29	499
29 – 30	823
30 – 32	325
32 – 33	362
33 – 34	378
34 – 36	105
36 – 37	40
37 – 38	37
38 – 40	25
40 – 41	34
41 – 42	3
42 – 44	0
44 – 45	15
45 – 47	22
47 – 48	14
48 – 49	3
49 – 51	1
51 – 52	2
52 – 54	5
54 – 55	0
55 – 56	0
56 – 58	0
58 – 59	2

description text free_text

This is a templated text description of seismic events, with every row mentioning 'earthquake', 'magnitude', and 'depth:' and lengths tightly clustered between 45 and 100 characters (median 72). Despite the formulaic structure, 3591 of 3742 values are unique because the embedded magnitude and depth numbers vary; still, 151 exact duplicates (4.0%) exist. Alaska appears in 1973 rows and a 10km depth in 1216, suggesting the corpus is dominated by shallow Alaskan quakes.

Treatment: Parse out magnitude, depth, and region with regex into structured features rather than embedding the raw string.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["description"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,591
len_min	45
len_max	100
len_mean	71.71
len_median	72
len_p95	79
word_mean	12.29
word_median	12
n_empty	0
n_duplicates	151
duplicate_rate	0.04035
vocab_size	2,674
readability_flesch_mean	63.23
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	96.0% of rows are unique strings

Fig 12.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 71.7116515232496).
chars	count
45 – 46	1
46 – 48	1
48 – 49	0
49 – 50	0
50 – 52	0
52 – 53	0
53 – 55	4
55 – 56	4
56 – 57	19
57 – 59	15
59 – 60	20
60 – 62	9
62 – 63	17
63 – 64	158
64 – 66	45
66 – 67	73
67 – 68	380
68 – 70	350
70 – 71	726
71 – 72	389
72 – 74	367
74 – 75	605
75 – 77	204
77 – 78	91
78 – 79	90
79 – 81	21
81 – 82	33
82 – 84	8
84 – 85	25
85 – 86	31
86 – 88	24
88 – 89	10
89 – 90	7
90 – 92	1
92 – 93	3
93 – 94	5
94 – 96	1
96 – 97	2
97 – 99	1
99 – 100	2

category categorical metadata

This column is a constant tag labeling every row as "significant_earthquakes" across all 3742 records, with cardinality of 1 and entropy of 0. It carries no information for modelling or analysis since the top_rate is 1.0 with no nulls.

Treatment: Drop; constant column with a single value.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["category"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	1
top_value	significant_earthquakes
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 13.

Top values for category.

Show data table

Top values for category (1 unique shown, of 1 total).
value	count	share
significant_earthquakes	3742	100.0%

date text identifier

Despite its name, this column holds malformed date-like strings rather than parseable timestamps — every value follows a 'NNNNNNNNNNNNN-01-01' pattern with a 13-digit prefix (likely a millisecond epoch) glued to a literal '-01-01' suffix. Values are 18-19 chars, one token each, and 3741 of 3742 are unique with only one duplicate ('1614452365296-01-01' appears twice). Stored as text, not a date type, so no temporal ordering or range is usable as-is.

Treatment: Parse the leading 13-digit epoch as ms-since-epoch into a real timestamp, or treat as a near-unique id and drop.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["date"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,741
len_min	18
len_max	19
len_mean	18.95
len_median	19
len_p95	19
word_mean	1
word_median	1
n_empty	0
n_duplicates	1
duplicate_rate	0.0002672
vocab_size	3,741
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 14.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 18.951897381079636).
chars	count
18 – 18	180
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	3562

country unknown feature

The column is named "country" and has 3742 non-null entries with a null_rate of 0.0, but saturn skipped detailed profiling (kind is "unknown") so cardinality and value distribution are not available. Without n_unique or category stats, we cannot tell whether this is a clean ISO code field, free-text country names, or a near-constant. The lack of any descriptive statistics is itself the main signal.

Treatment: Re-profile or manually inspect to determine cardinality before deciding whether to one-hot encode or normalize to ISO codes.

anthropic:claude-opus-4-7 · confidence low

Out[32]:

saturn.columns["country"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

magnitude numeric feature

This is an earthquake-style magnitude reading, bounded between 4.5 and 8.2 with a mean of 4.92 and median of 4.8. The distribution is heavily right-skewed (skew 1.97, kurtosis 5.58) with a tight IQR of 0.5 and 184 outliers (4.9%) — consistent with a catalog truncated at magnitude 4.5 where large events are rare but extreme.

Treatment: Treat as a right-skewed numeric feature; consider modelling tail events separately rather than transforming, since magnitude is already on a log scale.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["magnitude"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	123
min	4.5
max	8.2
mean	4.917
median	4.8
std	0.462
q1	4.6
q3	5.1
iqr	0.5
skew	1.97
kurtosis	5.583
n_outliers	184
outlier_rate	0.04917
zero_rate	0

Fig 15.

Distribution of magnitude. Vertical dash marks the median.

Show data table

Histogram bins for magnitude (median: 4.8).
bin	count
4.5 – 4.593	752
4.593 – 4.685	601
4.685 – 4.777	445
4.777 – 4.87	340
4.87 – 4.963	286
4.963 – 5.055	254
5.055 – 5.147	218
5.147 – 5.24	177
5.24 – 5.332	137
5.332 – 5.425	104
5.425 – 5.518	85
5.518 – 5.61	66
5.61 – 5.702	53
5.702 – 5.795	3
5.795 – 5.887	38
5.887 – 5.98	46
5.98 – 6.072	25
6.072 – 6.165	17
6.165 – 6.257	16
6.257 – 6.35	11
6.35 – 6.442	15
6.442 – 6.535	11
6.535 – 6.627	10
6.627 – 6.72	4
6.72 – 6.812	6
6.812 – 6.905	5
6.905 – 6.997	0
6.997 – 7.09	3
7.09 – 7.182	3
7.182 – 7.275	3
7.275 – 7.367	1
7.367 – 7.46	0
7.46 – 7.552	1
7.552 – 7.645	1
7.645 – 7.737	0
7.737 – 7.83	2
7.83 – 7.922	2
7.922 – 8.015	0
8.015 – 8.107	0
8.107 – 8.2	1

depth_km numeric feature

This is almost certainly the focal depth in kilometres for seismic events, with values ranging from -2.261 to 248.7 and a median of 10.0. The distribution is heavily right-skewed (skew 3.07, kurtosis 11.6) with 314 outliers (8.39% of rows) and a small number of negative depths that warrant inspection. Half the records sit between q1=10.0 and q3=29.1015, suggesting a strong concentration of shallow events with a long deep tail.

Treatment: Clip or investigate negative values, then log-transform (e.g., log(depth+c)) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["depth_km"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	1,505
min	-2.261
max	248.7
mean	23.71
median	10
std	28.79
q1	10
q3	29.1
iqr	19.1
skew	3.072
kurtosis	11.61
n_outliers	314
outlier_rate	0.08391
zero_rate	0.002672
alert: high_skew	skew=+3.07
alert: outliers	8.4% rows beyond 1.5 IQR

Fig 16.

Distribution of depth_km. Vertical dash marks the median.

Show data table

Histogram bins for depth_km (median: 10.0).
bin	count
-2.261 – 4.013	219
4.013 – 10.29	1730
10.29 – 16.56	370
16.56 – 22.84	258
22.84 – 29.11	230
29.11 – 35.38	250
35.38 – 41.66	167
41.66 – 47.93	129
47.93 – 54.21	56
54.21 – 60.48	31
60.48 – 66.75	43
66.75 – 73.03	27
73.03 – 79.3	19
79.3 – 85.58	29
85.58 – 91.85	21
91.85 – 98.12	19
98.12 – 104.4	24
104.4 – 110.7	12
110.7 – 116.9	9
116.9 – 123.2	14
123.2 – 129.5	13
129.5 – 135.8	19
135.8 – 142	14
142 – 148.3	5
148.3 – 154.6	6
154.6 – 160.9	0
160.9 – 167.1	7
167.1 – 173.4	4
173.4 – 179.7	0
179.7 – 186	1
186 – 192.2	5
192.2 – 198.5	1
198.5 – 204.8	3
204.8 – 211.1	2
211.1 – 217.3	3
217.3 – 223.6	1
223.6 – 229.9	0
229.9 – 236.2	0
236.2 – 242.4	0
242.4 – 248.7	1

place text metadata

Short geographic descriptors of earthquake locations, typically formatted as ' km of , ' (e.g., '104 km SSW of Nikolski, Alaska'), with mean length 29 characters and median 6 words. Alaska dominates (1,991 of 3,742 rows mention it), and a single string 'off the coast of Oregon' repeats 151 times, contributing to a 19.8% duplicate rate across 3,002 unique values. The 'multilingual' alert is driven by a tiny non-English fringe (16 es, 3 de, 1 ja, 1 ceb) against 3,719 English rows, so it is effectively monolingual.

Treatment: Parse into structured fields (distance_km, bearing, place_name, region) rather than using the raw string as a categorical.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["place"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,002
len_min	4
len_max	59
len_mean	29.47
len_median	29
len_p95	36
word_mean	6.293
word_median	6
n_empty	0
n_duplicates	740
duplicate_rate	0.1978
vocab_size	1,036
readability_flesch_mean	69.91
emoji_rate	0
url_rate	0
one_word_rate	0.0005345
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	6 languages detected in sample

Fig 17.

Character-length distribution for place.

Show data table

Character-length distribution for place (mean: 29.465793693212184).
chars	count
4 – 5	1
5 – 7	0
7 – 8	1
8 – 10	0
10 – 11	0
11 – 12	0
12 – 14	2
14 – 15	11
15 – 16	40
16 – 18	1
18 – 19	20
19 – 20	19
20 – 22	8
22 – 23	219
23 – 25	60
25 – 26	122
26 – 27	543
27 – 29	499
29 – 30	823
30 – 32	325
32 – 33	362
33 – 34	378
34 – 36	105
36 – 37	40
37 – 38	37
38 – 40	25
40 – 41	34
41 – 42	3
42 – 44	0
44 – 45	15
45 – 47	22
47 – 48	14
48 – 49	3
49 – 51	1
51 – 52	2
52 – 54	5
54 – 55	0
55 – 56	0
56 – 58	0
58 – 59	2

earthquake_type categorical label

This is a categorical event-type label with only 3 distinct values across 3742 rows and no nulls. The distribution is essentially degenerate: 'earthquake' covers 99.92% of rows, leaving just 2 'explosion' and 1 'landslide' records, yielding an entropy ratio of 0.006. The column carries almost no information as-is.

Treatment: Drop or collapse to a binary rare-event flag; near-constant for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["earthquake_type"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3
top_value	earthquake
top_rate	0.9992
cardinality	3
entropy	0.01014
entropy_ratio	0.006396
alert: imbalance	top value is 99.9% of rows

Fig 18.

Top values for earthquake_type.

Show data table

Top values for earthquake_type (3 unique shown, of 3 total).
value	count	share
earthquake	3739	99.9%
explosion	2	0.1%
landslide	1	0.0%

natural hazards earthquakes

Overview

Summary confidence: high

latitude numeric feature

longitude numeric feature

name text metadata

description text free_text

category categorical metadata

date text identifier

country unknown feature

magnitude numeric feature

depth_km numeric feature

place text metadata

earthquake_type categorical label

How to cite