saturn·

usgs significant earthquakes usgs significant earthquakes

source /home/coolhand/datasets/usgs-significant-earthquakes/usgs_significant_earthquakes.json 3,742 rows 11 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 3,742 records of significant earthquakes from USGS, with 11 columns covering location (latitude, longitude, place/name), magnitude, depth, and event metadata. Magnitude is tightly clustered between 4.5 and 5.1 (median 4.8) but has a long right tail reaching 8.2, with 184 outliers worth examining for the rare large events. Depth_km is highly skewed (skew 3.07) with a median of 10 km but a max of 248.7 km and 314 outliers, suggesting a mix of shallow and deep-focus quakes. Geographically, the data is heavily concentrated around Alaska — 'alaska' appears in 1,991 place names and 'off the coast of Oregon' alone accounts for 151 records — so this is effectively a North Pacific / Alaska-dominated sample rather than a global one. Note that the `category` column is constant ('significant_earthquakes') and `earthquake_type` is 99.9% 'earthquake', so neither will be useful for segmentation.

citing: row_count · column_count · depth_km · magnitude · latitude · longitude · name · place · earthquake_type · category

Schema

11 columns
Per-column summary. Click column name to jump to its detail.
Alerts
latitude numeric 0.0% 3,627
longitude numeric 0.0% 3,668
name text 0.0% 3,002
multilingual
description text 0.0% 3,591
near_unique
category categorical 0.0% 1
imbalance
date text 0.0% 3,741
near_unique one_word allcaps short_text
country unknown 0.0%
skipped
magnitude numeric 0.0% 123
depth_km numeric 0.0% 1,505
high_skew outliers
place text 0.0% 3,002
multilingual
earthquake_type categorical 0.0% 3
imbalance

latitude

numeric feature
Geographic latitude coordinate, with values ranging from 20.02 to 69.7975 and a median of 52.40, consistent with locations spanning roughly the tropics to the Arctic Circle. The distribution is left-skewed (skew -0.76), concentrated in northern mid-to-high latitudes (Q1-Q3: 41.34-55.90), suggesting a Europe/North America bias. No nulls or outliers, and 3627 unique values across 3742 rows indicate near-row-level granularity. Treatment: Pair with longitude for geospatial features; consider binning or clustering rather than treating as a plain scalar. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
3,627
min
20.02
max
69.8
mean
48.53
median
52.4
std
11.58
q1
41.34
q3
55.9
iqr
14.56
skew
-0.7591
kurtosis
-0.2887
n_outliers
0
outlier_rate
0
zero_rate
0

longitude

numeric feature
Geographic longitude coordinates, all in the western hemisphere with values ranging from -169.997 to -65.039 (mean -140.10, median -144.22). The distribution is mildly right-skewed (0.45) with 3668 unique values across 3742 rows, suggesting near-distinct location points. The 26 outliers (0.69%) likely correspond to easternmost points well outside the dense western cluster bounded by Q1=-159.93 and Q3=-125.10. Treatment: Pair with latitude for geospatial features (e.g., binning, clustering, or distance computation) rather than treating as a standalone scalar. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
3,668
min
-170
max
-65.04
mean
-140.1
median
-144.2
std
21.81
q1
-159.9
q3
-125.1
iqr
34.84
skew
0.4489
kurtosis
-0.4302
n_outliers
26
outlier_rate
0.006948
zero_rate
0

name

text metadata multilingual
This 'name' column reads as human-readable seismic event location descriptions (e.g., '104 km SSW of Nikolski, Alaska'), dominated by distance-and-bearing phrases — 'km', 'of', and cardinal directions like 'SSW'/'SSE' top the word list. Of 3742 rows there are only 3002 uniques and a 19.8% duplicate rate, with 'off the coast of Oregon' appearing 151 times, suggesting many events share the same place label. Despite a 'multilingual' flag, 3719 of 3742 strings are detected as English with only a handful tagged de/es/ja/ceb — likely false positives on short toponyms rather than real language mixing. Treatment: Parse into structured fields (distance_km, bearing, place) rather than embedding the raw string. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
3,002
len_min
4
len_max
59
len_mean
29.47
len_median
29
len_p95
36
word_mean
6.293
word_median
6
n_empty
0
n_duplicates
740
duplicate_rate
0.1978
vocab_size
1,036
readability_flesch_mean
69.91
emoji_rate
0
url_rate
0
one_word_rate
0.0005345
allcaps_rate
0
boilerplate_rate
0

description

text free_text near_unique
Templated earthquake event descriptions, e.g. magnitude/depth strings ending in locations like Alaska. Lengths are tightly clustered (min 45, max 100, p95 79) and every row contains 'magnitude', '-', and 'depth:', confirming a fixed format. Despite that template, 3591 of 3742 values are unique with only 151 duplicates (4.0%), so the free-text portion (location, magnitude, depth) carries the signal. Treatment: Parse out magnitude, depth, and location with a regex rather than embedding the templated string. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
3,591
len_min
45
len_max
100
len_mean
71.71
len_median
72
len_p95
79
word_mean
12.29
word_median
12
n_empty
0
n_duplicates
151
duplicate_rate
0.04035
vocab_size
2,674
readability_flesch_mean
63.23
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

category

categorical metadata imbalance
This column tags every one of the 3,742 rows with the single value "significant_earthquakes", giving cardinality 1 and entropy 0. It carries no information for modelling and most likely encodes the dataset's source or slice rather than a per-row attribute. Treatment: Drop before modelling; retain only as a dataset-level provenance tag. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
1
top_value
significant_earthquakes
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

date

text identifier near_unique one_word allcaps short_text
Despite the name, this column holds 18-19 character single-token strings shaped like '-01-01' rather than parseable dates — every value ends in '-01-01' and the prefix appears to be a 13-digit epoch-style integer. With 3741 unique values across 3742 rows (one duplicate) and a 100% one-word, all-caps rate, it behaves as a near-unique identifier, not a temporal feature. Treatment: Drop as-is or parse the leading numeric prefix into a real timestamp before any time-based use. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
3,741
len_min
18
len_max
19
len_mean
18.95
len_median
19
len_p95
19
word_mean
1
word_median
1
n_empty
0
n_duplicates
1
duplicate_rate
0.0002672
vocab_size
3,741
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

country

unknown metadata skipped
This column is labelled "country" and has 3742 fully populated rows with no nulls, but saturn skipped detailed profiling so kind, uniqueness, and value distribution are unknown. Without cardinality or sample values, it is impossible to confirm whether entries are ISO codes, full names, or mixed representations. Treat the absence of stats as the main signal here. Treatment: Re-profile with type detection enabled to confirm cardinality and standardise country representation before use. low · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique

magnitude

numeric feature
This is a numeric magnitude field, almost certainly earthquake magnitudes given the floor at 4.5 and ceiling at 8.2. Values cluster tightly (median 4.8, IQR 0.5) but the distribution is heavily right-skewed (skew 1.97, kurtosis 5.58) with 184 outliers (4.9%) in the upper tail. Only 123 unique values across 3742 rows suggests reporting at one-decimal precision. Treatment: Log-transform or bucket into magnitude bands before modelling to tame the right skew. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
123
min
4.5
max
8.2
mean
4.917
median
4.8
std
0.462
q1
4.6
q3
5.1
iqr
0.5
skew
1.97
kurtosis
5.583
n_outliers
184
outlier_rate
0.04917
zero_rate
0

depth_km

numeric feature high_skew outliers
Numeric depth measurements in kilometers, almost certainly earthquake hypocenter depths given the median of 10.0 and range from -2.261 to 248.7. The distribution is heavily right-skewed (skew 3.07, kurtosis 11.6) with 314 outliers (8.4%) and a Q1 equal to the median at 10.0, suggesting many events are pinned to a default 10 km depth. Negative minimum (-2.261) likely reflects events above sea level or reference datum. Treatment: Log-transform (after shifting for negatives) and flag the 10 km default-depth pile-up before modelling. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
1,505
min
-2.261
max
248.7
mean
23.71
median
10
std
28.79
q1
10
q3
29.1
iqr
19.1
skew
3.072
kurtosis
11.61
n_outliers
314
outlier_rate
0.08391
zero_rate
0.002672

place

text metadata multilingual
Free-text place descriptions for seismic events, typically formatted as distance + direction + landmark (e.g. '104 km SSW of Nikolski, Alaska'), averaging 29 characters and 6 words. Alaska dominates with 1991 mentions, followed by Canada (472) and Mexico (412), and 'off the coast of Oregon' alone repeats 151 times, driving a 19.8% duplicate rate across 3002 unique values. The multilingual flag is essentially noise: 3719 of 3742 rows are English, with only a handful of de/es/ja/ceb tags likely from place names misread by the detector. Treatment: Parse into structured fields (distance_km, bearing, reference_place, region) rather than embedding the raw string. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
3,002
len_min
4
len_max
59
len_mean
29.47
len_median
29
len_p95
36
word_mean
6.293
word_median
6
n_empty
0
n_duplicates
740
duplicate_rate
0.1978
vocab_size
1,036
readability_flesch_mean
69.91
emoji_rate
0
url_rate
0
one_word_rate
0.0005345
allcaps_rate
0
boilerplate_rate
0

earthquake_type

categorical feature imbalance
Classifies seismic events into earthquake, explosion, or landslide, but is effectively a constant: 3,739 of 3,742 rows (top_rate 0.9992) are 'earthquake', with only 2 explosions and 1 landslide. Entropy of 0.0101 (entropy_ratio 0.0064) confirms there is virtually no information content here. Treatment: Drop as a predictor, or isolate the 3 non-earthquake rows as anomalies. high · anthropic:claude-opus-4-7
n
3,742
nulls
0 (0.0%)
unique
3
top_value
earthquake
top_rate
0.9992
cardinality
3
entropy
0.01014
entropy_ratio
0.006396