data-trove-usgs-significant-earthquakes

Overview

Source: /home/coolhand/html/datavis/data_trove/data/wild/usgs_significant_earthquakes.json

Saturn profiled 3,742 rows across 11 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/wild/usgs_significant_earthquakes.json",
    "--findings", "data-trove-usgs-significant-earthquakes.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset contains 3,742 records of significant earthquakes catalogued by the USGS, each describing a seismic event with location, magnitude, depth, and type. The vast majority (99.9%) are classified as earthquakes, with just 2 explosions and 1 landslide, so event type is not a useful differentiator. Two things stand out for closer inspection: first, depth_km is heavily right-skewed (median 10 km, mean 23.7 km, max 248.7 km) with 314 outliers, suggesting a small but important subset of unusually deep earthquakes worth isolating. Second, geographic concentration is striking — Alaska dominates the place names (appearing in roughly 1,991 records) and 'off the coast of Oregon' is the single most repeated location (151 times), pointing to a strong Pacific Northwest and Alaskan bias in this 'significant' events catalog. Magnitude ranges from 4.5 to 8.2 with a median of 4.8 and a long upper tail, meaning truly destructive events are rare outliers worth flagging.

citing: depth_km.stats.median · depth_km.stats.mean · depth_km.stats.max · depth_km.n_outliers · depth_km.stats.skew · magnitude.stats.median · magnitude.stats.max · magnitude.stats.min · magnitude.n_outliers · earthquake_type.top_values · place.top_values · name.top_words · row_count

Out[4]:

saturn.schema() · 11 columns

column	kind	n	null%	unique	alerts
latitude	numeric	3,742	0.0%	3,627
longitude	numeric	3,742	0.0%	3,668
name	text	3,742	0.0%	3,002	multilingual
description	text	3,742	0.0%	3,591	near_unique
category	categorical	3,742	0.0%	1	imbalance
date	text	3,742	0.0%	3,741	near_unique one_word allcaps short_text
country	unknown	3,742	0.0%	—	skipped
magnitude	numeric	3,742	0.0%	123
depth_km	numeric	3,742	0.0%	1,505	high_skew outliers
place	text	3,742	0.0%	3,002	multilingual
earthquake_type	categorical	3,742	0.0%	3	imbalance

Fig 1.

magnitude · Look for the long right tail — most events cluster between 4.5 and 5.1, but a small number exceed 7.0 and represent the truly destructive quakes.

Show data table

Histogram bins for magnitude (median: 4.8).
bin	count
4.5 – 4.593	752
4.593 – 4.685	601
4.685 – 4.777	445
4.777 – 4.87	340
4.87 – 4.963	286
4.963 – 5.055	254
5.055 – 5.147	218
5.147 – 5.24	177
5.24 – 5.332	137
5.332 – 5.425	104
5.425 – 5.518	85
5.518 – 5.61	66
5.61 – 5.702	53
5.702 – 5.795	3
5.795 – 5.887	38
5.887 – 5.98	46
5.98 – 6.072	25
6.072 – 6.165	17
6.165 – 6.257	16
6.257 – 6.35	11
6.35 – 6.442	15
6.442 – 6.535	11
6.535 – 6.627	10
6.627 – 6.72	4
6.72 – 6.812	6
6.812 – 6.905	5
6.905 – 6.997	0
6.997 – 7.09	3
7.09 – 7.182	3
7.182 – 7.275	3
7.275 – 7.367	1
7.367 – 7.46	0
7.46 – 7.552	1
7.552 – 7.645	1
7.645 – 7.737	0
7.737 – 7.83	2
7.83 – 7.922	2
7.922 – 8.015	0
8.015 – 8.107	0
8.107 – 8.2	1

Fig 2.

depth_km · The distribution is heavily skewed with a median of 10 km, but 314 outliers extend to 248.7 km — deep-focus events that behave very differently from shallow ones.

Show data table

Histogram bins for depth_km (median: 10.0).
bin	count
-2.261 – 4.013	219
4.013 – 10.29	1730
10.29 – 16.56	370
16.56 – 22.84	258
22.84 – 29.11	230
29.11 – 35.38	250
35.38 – 41.66	167
41.66 – 47.93	129
47.93 – 54.21	56
54.21 – 60.48	31
60.48 – 66.75	43
66.75 – 73.03	27
73.03 – 79.3	19
79.3 – 85.58	29
85.58 – 91.85	21
91.85 – 98.12	19
98.12 – 104.4	24
104.4 – 110.7	12
110.7 – 116.9	9
116.9 – 123.2	14
123.2 – 129.5	13
129.5 – 135.8	19
135.8 – 142	14
142 – 148.3	5
148.3 – 154.6	6
154.6 – 160.9	0
160.9 – 167.1	7
167.1 – 173.4	4
173.4 – 179.7	0
179.7 – 186	1
186 – 192.2	5
192.2 – 198.5	1
198.5 – 204.8	3
204.8 – 211.1	2
211.1 – 217.3	3
217.3 – 223.6	1
223.6 – 229.9	0
229.9 – 236.2	0
236.2 – 242.4	0
242.4 – 248.7	1

Fig 3.

place · Alaska and the Oregon coast dominate locations, revealing a strong geographic concentration in the Pacific Northwest and Aleutian arc.

Show data table

Character-length distribution for place (mean: 29.465793693212184).
chars	count
4 – 5	1
5 – 7	0
7 – 8	1
8 – 10	0
10 – 11	0
11 – 12	0
12 – 14	2
14 – 15	11
15 – 16	40
16 – 18	1
18 – 19	20
19 – 20	19
20 – 22	8
22 – 23	219
23 – 25	60
25 – 26	122
26 – 27	543
27 – 29	499
29 – 30	823
30 – 32	325
32 – 33	362
33 – 34	378
34 – 36	105
36 – 37	40
37 – 38	37
38 – 40	25
40 – 41	34
41 – 42	3
42 – 44	0
44 – 45	15
45 – 47	22
47 – 48	14
48 – 49	3
49 – 51	1
51 – 52	2
52 – 54	5
54 – 55	0
55 – 56	0
56 – 58	0
58 – 59	2

Fig 4.

earthquake_type · Nearly all events (99.9%) are earthquakes; the two explosions and one landslide are negligible but worth confirming as data anomalies.

Show data table

Top values for earthquake_type (3 unique shown, of 3 total).
value	count	share
earthquake	3739	99.9%
explosion	2	0.1%
landslide	1	0.0%

Fig 5.

latitude · Latitude distribution confirms geographic clustering in the 40–60° N range, consistent with Alaska and Pacific Northwest dominance.

Show data table

Histogram bins for latitude (median: 52.395700000000005).
bin	count
20.02 – 21.26	35
21.26 – 22.51	28
22.51 – 23.75	42
23.75 – 25	88
25 – 26.24	91
26.24 – 27.49	35
27.49 – 28.73	43
28.73 – 29.98	39
29.98 – 31.22	52
31.22 – 32.46	83
32.46 – 33.71	52
33.71 – 34.95	26
34.95 – 36.2	67
36.2 – 37.44	43
37.44 – 38.69	54
38.69 – 39.93	22
39.93 – 41.18	131
41.18 – 42.42	52
42.42 – 43.66	96
43.66 – 44.91	177
44.91 – 46.15	4
46.15 – 47.4	7
47.4 – 48.64	34
48.64 – 49.89	117
49.89 – 51.13	151
51.13 – 52.38	288
52.38 – 53.62	423
53.62 – 54.86	349
54.86 – 56.11	223
56.11 – 57.35	201
57.35 – 58.6	75
58.6 – 59.84	104
59.84 – 61.09	116
61.09 – 62.33	117
62.33 – 63.58	120
63.58 – 64.82	33
64.82 – 66.06	35
66.06 – 67.31	25
67.31 – 68.55	29
68.55 – 69.8	35

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
latitude	numeric	0.0%
longitude	numeric	0.0%
name	text	0.0%
description	text	0.0%
category	categorical	0.0%
date	text	0.0%
country	unknown	0.0%
magnitude	numeric	0.0%
depth_km	numeric	0.0%
place	text	0.0%
earthquake_type	categorical	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 7,480 detected strings).
lang	count	share
en	7438	99.4%
es	32	0.4%
de	6	0.1%
ja	2	0.0%
ceb	2	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
	latitude	longitude	magnitude	depth_km
latitude	+1.00	-0.72	-0.12	+0.31
longitude	-0.72	+1.00	+0.07	-0.37
magnitude	-0.12	+0.07	+1.00	-0.03
depth_km	+0.31	-0.37	-0.03	+1.00

latitude numeric feature

This column contains geographic latitude coordinates, ranging from 20.02° (subtropical) to 69.7975° (Arctic), indicating a dataset spanning locations across the Northern Hemisphere, likely Europe and parts of North America or Asia. The median of 52.4° and mean of 48.5° suggest a concentration of records around central/northern Europe. Near-zero skew (−0.76) and platykurtic distribution (kurtosis −0.29) indicate a relatively flat, spread-out distribution with no extreme clustering. High cardinality (3,627 unique values out of 3,742 rows) confirms these are precise geospatial coordinates rather than bucketed regions.

Treatment: Use as-is or pair with longitude for spatial modelling; consider binning into geographic zones if used as a categorical feature.

anthropic:default · confidence high

Out[14]:

saturn.columns["latitude"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,627
min	20.02
max	69.8
mean	48.53
median	52.4
std	11.58
q1	41.34
q3	55.9
iqr	14.56
skew	-0.7591
kurtosis	-0.2887
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 9.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 52.395700000000005).
bin	count
20.02 – 21.26	35
21.26 – 22.51	28
22.51 – 23.75	42
23.75 – 25	88
25 – 26.24	91
26.24 – 27.49	35
27.49 – 28.73	43
28.73 – 29.98	39
29.98 – 31.22	52
31.22 – 32.46	83
32.46 – 33.71	52
33.71 – 34.95	26
34.95 – 36.2	67
36.2 – 37.44	43
37.44 – 38.69	54
38.69 – 39.93	22
39.93 – 41.18	131
41.18 – 42.42	52
42.42 – 43.66	96
43.66 – 44.91	177
44.91 – 46.15	4
46.15 – 47.4	7
47.4 – 48.64	34
48.64 – 49.89	117
49.89 – 51.13	151
51.13 – 52.38	288
52.38 – 53.62	423
53.62 – 54.86	349
54.86 – 56.11	223
56.11 – 57.35	201
57.35 – 58.6	75
58.6 – 59.84	104
59.84 – 61.09	116
61.09 – 62.33	117
62.33 – 63.58	120
63.58 – 64.82	33
64.82 – 66.06	35
66.06 – 67.31	25
67.31 – 68.55	29
68.55 – 69.8	35

longitude numeric feature

This column contains geographic longitude values, all negative, indicating locations exclusively in the Western Hemisphere. The range spans from approximately -170° (near the International Date Line, consistent with Alaska or Pacific islands) to -65° (eastern US/Caribbean region), with a mean near -140° and median near -144°, suggesting a heavy concentration of observations in Alaska or the North Pacific. The distribution is mildly right-skewed (skew 0.45) with near-platykurtic shape (kurtosis -0.43), implying a broad but relatively uniform spread across this western band rather than a tight geographic cluster. Only 26 outliers (0.69%) are flagged, and with 3,668 unique values out of 3,742 rows, coordinates appear precise and largely non-duplicated.

Treatment: Pair with latitude for spatial analysis; consider geographic binning or projection into a coordinate reference system before modelling.

anthropic:default · confidence high

Out[17]:

saturn.columns["longitude"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,668
min	-170
max	-65.04
mean	-140.1
median	-144.2
std	21.81
q1	-159.9
q3	-125.1
iqr	34.84
skew	0.4489
kurtosis	-0.4302
n_outliers	26
outlier_rate	0.006948
zero_rate	0

Fig 10.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: -144.21715).
bin	count
-170 – -167.4	368
-167.4 – -164.7	161
-164.7 – -162.1	243
-162.1 – -159.5	241
-159.5 – -156.9	113
-156.9 – -154.3	129
-154.3 – -151.6	241
-151.6 – -149	220
-149 – -146.4	77
-146.4 – -143.8	83
-143.8 – -141.1	29
-141.1 – -138.5	48
-138.5 – -135.9	41
-135.9 – -133.3	34
-133.3 – -130.6	129
-130.6 – -128	428
-128 – -125.4	200
-125.4 – -122.8	89
-122.8 – -120.1	44
-120.1 – -117.5	130
-117.5 – -114.9	149
-114.9 – -112.3	113
-112.3 – -109.6	163
-109.6 – -107	134
-107 – -104.4	33
-104.4 – -101.8	12
-101.8 – -99.15	10
-99.15 – -96.53	20
-96.53 – -93.9	4
-93.9 – -91.28	1
-91.28 – -88.65	2
-88.65 – -86.03	6
-86.03 – -83.41	1
-83.41 – -80.78	2
-80.78 – -78.16	3
-78.16 – -75.53	7
-75.53 – -72.91	8
-72.91 – -70.29	7
-70.29 – -67.66	5
-67.66 – -65.04	14

name text label

This column contains geographic location descriptions for seismic events, formatted as place names relative to known landmarks (e.g., '104 km SSW of Nikolski, Alaska'). The dominant region is Alaska, appearing in 1,991 of ~3,742 entries, with Canada and Mexico also frequent. Notably, 740 duplicate values (19.8% duplicate rate) exist despite only 3,002 unique values out of 3,742 rows, driven by repeated location labels like 'off the coast of Oregon' (151 occurrences), suggesting many events cluster in the same geographic zones. A multilingual alert fires, but non-English entries are negligible (21 rows across de/es/ja/ceb vs. 3,719 en), so this is not a practical concern.

Treatment: Use as a categorical geographic label; consider parsing direction/distance/place subcomponents for structured feature engineering, or embed as text for ML.

anthropic:default · confidence high

Out[20]:

saturn.columns["name"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,002
len_min	4
len_max	59
len_mean	29.47
len_median	29
len_p95	36
word_mean	6.293
word_median	6
n_empty	0
n_duplicates	740
duplicate_rate	0.1978
vocab_size	1,036
readability_flesch_mean	69.91
emoji_rate	0
url_rate	0
one_word_rate	0.0005345
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	6 languages detected in sample

Fig 11.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 29.465793693212184).
chars	count
4 – 5	1
5 – 7	0
7 – 8	1
8 – 10	0
10 – 11	0
11 – 12	0
12 – 14	2
14 – 15	11
15 – 16	40
16 – 18	1
18 – 19	20
19 – 20	19
20 – 22	8
22 – 23	219
23 – 25	60
25 – 26	122
26 – 27	543
27 – 29	499
29 – 30	823
30 – 32	325
32 – 33	362
33 – 34	378
34 – 36	105
36 – 37	40
37 – 38	37
38 – 40	25
40 – 41	34
41 – 42	3
42 – 44	0
44 – 45	15
45 – 47	22
47 – 48	14
48 – 49	3
49 – 51	1
51 – 52	2
52 – 54	5
54 – 55	0
55 – 56	0
56 – 58	0
58 – 59	2

description text label

This column contains structured natural-language descriptions of seismic events, likely auto-generated strings of the form 'Earthquake magnitude X - depth: Y km [location]'. The top words confirm every row references 'earthquake', 'magnitude', 'depth:', and 'km', making this a templated field rather than free prose (mean length 71.7 chars, mean 12.3 words, Flesch readability 63.2). Surprising signals: 151 duplicate descriptions (4.0% duplicate rate) despite 3,742 rows, meaning distinct earthquake events share identical text — likely collisions on rounded magnitude/depth/location values. The vocabulary of only 2,674 unique words across 3,742 rows further confirms the highly templated, low-diversity nature of the text.

Treatment: Parse structured fields (magnitude, depth, location) via regex rather than embedding; flag duplicate descriptions that may map to distinct events.

anthropic:default · confidence high

Out[23]:

saturn.columns["description"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,591
len_min	45
len_max	100
len_mean	71.71
len_median	72
len_p95	79
word_mean	12.29
word_median	12
n_empty	0
n_duplicates	151
duplicate_rate	0.04035
vocab_size	2,674
readability_flesch_mean	63.23
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	96.0% of rows are unique strings

Fig 12.

Character-length distribution for description.

Show data table

Character-length distribution for description (mean: 71.7116515232496).
chars	count
45 – 46	1
46 – 48	1
48 – 49	0
49 – 50	0
50 – 52	0
52 – 53	0
53 – 55	4
55 – 56	4
56 – 57	19
57 – 59	15
59 – 60	20
60 – 62	9
62 – 63	17
63 – 64	158
64 – 66	45
66 – 67	73
67 – 68	380
68 – 70	350
70 – 71	726
71 – 72	389
72 – 74	367
74 – 75	605
75 – 77	204
77 – 78	91
78 – 79	90
79 – 81	21
81 – 82	33
82 – 84	8
84 – 85	25
85 – 86	31
86 – 88	24
88 – 89	10
89 – 90	7
90 – 92	1
92 – 93	3
93 – 94	5
94 – 96	1
96 – 97	2
97 – 99	1
99 – 100	2

category categorical metadata

This column is a dataset-level category tag indicating the source or filter applied to all 3,742 rows — every single record carries the value 'significant_earthquakes' with a top_rate of 1.0 and entropy of 0.0. It is a constant column with zero discriminative power. The imbalance alert confirms it: cardinality is 1, meaning this field adds no within-dataset information whatsoever.

Treatment: Drop before modelling — zero-variance constant column provides no predictive signal.

anthropic:default · confidence high

Out[26]:

saturn.columns["category"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	1
top_value	significant_earthquakes
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 13.

Top values for category.

Show data table

Top values for category (1 unique shown, of 1 total).
value	count	share
significant_earthquakes	3742	100.0%

date text timestamp

This column is named 'date' but contains values that are clearly not valid calendar dates — the year component is a 13-digit Unix timestamp in milliseconds (e.g., '1614452365296') appended with '-01-01', indicating a malformed or incorrectly formatted datetime field. With 3,741 unique values out of 3,742 rows and near-zero duplicates (only 1 duplicate exists), this column functions almost as a row identifier. The one duplicate ('1614452365296-01-01' appearing twice) is the only anomaly in an otherwise fully unique set.

Treatment: Parse the numeric prefix as a Unix millisecond timestamp (divide by 1000 for seconds), discard the '-01-01' suffix artifact, and convert to a proper datetime before any temporal analysis.

anthropic:default · confidence high

Out[29]:

saturn.columns["date"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,741
len_min	18
len_max	19
len_mean	18.95
len_median	19
len_p95	19
word_mean	1
word_median	1
n_empty	0
n_duplicates	1
duplicate_rate	0.0002672
vocab_size	3,741
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 14.

Character-length distribution for date.

Show data table

Character-length distribution for date (mean: 18.951897381079636).
chars	count
18 – 18	180
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 18	0
18 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	0
19 – 19	3562

country unknown label

This column contains country values across 3,742 rows with no nulls. The profiler skipped detailed analysis ('skipped' alert), so cardinality, value distribution, and encoding format (ISO codes vs. full names) are unknown. The absence of any stats prevents assessment of skew, dominant categories, or anomalies.

Treatment: Re-profile to obtain value counts and cardinality; standardize to ISO 3166-1 alpha-2/3 codes before encoding or joining.

anthropic:default · confidence low

Out[32]:

saturn.columns["country"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

magnitude numeric feature

This column represents seismic event magnitude, almost certainly on the Richter or moment magnitude scale, given the range of 4.5–8.2 and mean of ~4.92. Despite 3,742 records, only 123 unique values appear, indicating the data is reported at one decimal place of precision. The distribution is strongly right-skewed (skew=1.97) with high kurtosis (5.58), meaning the vast majority of events cluster in the 4.6–5.1 IQR while 184 outliers (≈4.9% of records) pull the tail toward the extreme 8.2 maximum — a pattern entirely consistent with earthquake frequency-magnitude relationships (Gutenberg-Richter law).

Treatment: Log-transform or use as-is for seismic models; treat outlier events (>6.5 approx) as a separate high-impact stratum given the severe right skew.

anthropic:default · confidence high

Out[34]:

saturn.columns["magnitude"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	123
min	4.5
max	8.2
mean	4.917
median	4.8
std	0.462
q1	4.6
q3	5.1
iqr	0.5
skew	1.97
kurtosis	5.583
n_outliers	184
outlier_rate	0.04917
zero_rate	0

Fig 15.

Distribution of magnitude. Vertical dash marks the median.

Show data table

Histogram bins for magnitude (median: 4.8).
bin	count
4.5 – 4.593	752
4.593 – 4.685	601
4.685 – 4.777	445
4.777 – 4.87	340
4.87 – 4.963	286
4.963 – 5.055	254
5.055 – 5.147	218
5.147 – 5.24	177
5.24 – 5.332	137
5.332 – 5.425	104
5.425 – 5.518	85
5.518 – 5.61	66
5.61 – 5.702	53
5.702 – 5.795	3
5.795 – 5.887	38
5.887 – 5.98	46
5.98 – 6.072	25
6.072 – 6.165	17
6.165 – 6.257	16
6.257 – 6.35	11
6.35 – 6.442	15
6.442 – 6.535	11
6.535 – 6.627	10
6.627 – 6.72	4
6.72 – 6.812	6
6.812 – 6.905	5
6.905 – 6.997	0
6.997 – 7.09	3
7.09 – 7.182	3
7.182 – 7.275	3
7.275 – 7.367	1
7.367 – 7.46	0
7.46 – 7.552	1
7.552 – 7.645	1
7.645 – 7.737	0
7.737 – 7.83	2
7.83 – 7.922	2
7.922 – 8.015	0
8.015 – 8.107	0
8.107 – 8.2	1

depth_km numeric feature

This column represents seismic event or borehole depth in kilometres, almost certainly from an earthquake catalogue or geological survey dataset. The distribution is severely right-skewed (skew 3.07, kurtosis 11.61): the median is just 10.0 km—a canonical default depth assigned when depth is poorly constrained—with Q1 also exactly 10.0 km, suggesting a large fraction of records are pinned to that default value. The tail extends to 248.7 km with 314 outliers (8.4%), and a small negative minimum (−2.261 km) indicates above-sea-level or instrument-offset events that may need review.

Treatment: Investigate Q1=median=10.0 km pile-up for catalog-assigned default depths; log-transform or use quantile binning before modelling, and flag/separate records with depth < 0 km.

anthropic:default · confidence high

Out[37]:

saturn.columns["depth_km"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	1,505
min	-2.261
max	248.7
mean	23.71
median	10
std	28.79
q1	10
q3	29.1
iqr	19.1
skew	3.072
kurtosis	11.61
n_outliers	314
outlier_rate	0.08391
zero_rate	0.002672
alert: high_skew	skew=+3.07
alert: outliers	8.4% rows beyond 1.5 IQR

Fig 16.

Distribution of depth_km. Vertical dash marks the median.

Show data table

Histogram bins for depth_km (median: 10.0).
bin	count
-2.261 – 4.013	219
4.013 – 10.29	1730
10.29 – 16.56	370
16.56 – 22.84	258
22.84 – 29.11	230
29.11 – 35.38	250
35.38 – 41.66	167
41.66 – 47.93	129
47.93 – 54.21	56
54.21 – 60.48	31
60.48 – 66.75	43
66.75 – 73.03	27
73.03 – 79.3	19
79.3 – 85.58	29
85.58 – 91.85	21
91.85 – 98.12	19
98.12 – 104.4	24
104.4 – 110.7	12
110.7 – 116.9	9
116.9 – 123.2	14
123.2 – 129.5	13
129.5 – 135.8	19
135.8 – 142	14
142 – 148.3	5
148.3 – 154.6	6
154.6 – 160.9	0
160.9 – 167.1	7
167.1 – 173.4	4
173.4 – 179.7	0
179.7 – 186	1
186 – 192.2	5
192.2 – 198.5	1
198.5 – 204.8	3
204.8 – 211.1	2
211.1 – 217.3	3
217.3 – 223.6	1
223.6 – 229.9	0
229.9 – 236.2	0
236.2 – 242.4	0
242.4 – 248.7	1

place text label

This column contains human-readable geographic location descriptions for seismic or oceanographic events, predominantly structured as compass-bearing distance strings (e.g., '104 km SSW of Nikolski, Alaska'). Nearly all 3,742 entries are in English (3,719), with negligible Spanish, German, Cebuano, and Japanese entries flagging a multilingual alert. The dominant location is 'off the coast of Oregon' appearing 151 times — far more than any other value — and 740 duplicate entries (19.8% duplicate rate) suggest event clustering at recurring geographic hotspots. The vocabulary of directional abbreviations (SSW, SSE, SE, W) and 'km' appearing 3,220 times confirms a standardized but free-text geocoding convention.

Treatment: Parse structured distance-bearing substrings (e.g., regex on 'km [NSEW]+') to extract numeric distance and cardinal direction as features; consider geocoding to lat/lon for spatial modelling.

anthropic:default · confidence high

Out[40]:

saturn.columns["place"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3,002
len_min	4
len_max	59
len_mean	29.47
len_median	29
len_p95	36
word_mean	6.293
word_median	6
n_empty	0
n_duplicates	740
duplicate_rate	0.1978
vocab_size	1,036
readability_flesch_mean	69.91
emoji_rate	0
url_rate	0
one_word_rate	0.0005345
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	6 languages detected in sample

Fig 17.

Character-length distribution for place.

Show data table

Character-length distribution for place (mean: 29.465793693212184).
chars	count
4 – 5	1
5 – 7	0
7 – 8	1
8 – 10	0
10 – 11	0
11 – 12	0
12 – 14	2
14 – 15	11
15 – 16	40
16 – 18	1
18 – 19	20
19 – 20	19
20 – 22	8
22 – 23	219
23 – 25	60
25 – 26	122
26 – 27	543
27 – 29	499
29 – 30	823
30 – 32	325
32 – 33	362
33 – 34	378
34 – 36	105
36 – 37	40
37 – 38	37
38 – 40	25
40 – 41	34
41 – 42	3
42 – 44	0
44 – 45	15
45 – 47	22
47 – 48	14
48 – 49	3
49 – 51	1
51 – 52	2
52 – 54	5
54 – 55	0
55 – 56	0
56 – 58	0
58 – 59	2

earthquake_type categorical label

This column classifies seismic event types, with three possible values: 'earthquake', 'explosion', and 'landslide'. It is severely imbalanced: 'earthquake' accounts for 3,739 of 3,742 records (99.92%), leaving only 2 explosions and 1 landslide. The near-zero entropy (0.0101) and entropy_ratio (0.0064) confirm the column carries almost no information variance, which will make it useless as a predictive feature without special handling.

Treatment: Drop or use only as a stratification/filter variable; minority classes (n=2, n=1) are too rare for meaningful multi-class modelling without heavy oversampling.

anthropic:default · confidence high

Out[43]:

saturn.columns["earthquake_type"].stats

stat	value
n	3,742
nulls	0 (0.0%)
unique	3
top_value	earthquake
top_rate	0.9992
cardinality	3
entropy	0.01014
entropy_ratio	0.006396
alert: imbalance	top value is 99.9% of rows

Fig 18.

Top values for earthquake_type.

Show data table

Top values for earthquake_type (3 unique shown, of 3 total).
value	count	share
earthquake	3739	99.9%
explosion	2	0.1%
landslide	1	0.0%

data trove usgs significant earthquakes

Overview

Summary confidence: high

latitude numeric feature

longitude numeric feature

name text label

description text label

category categorical metadata

date text timestamp

country unknown label

magnitude numeric feature

depth_km numeric feature

place text label

earthquake_type categorical label

How to cite