fips_county-geology_counties · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/geographic/fips_county/geology_counties.csv

Saturn profiled 3,235 rows across 9 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/geographic/fips_county/geology_counties.csv",
    "--findings", "fips_county-geology_counties.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset links 3,235 U.S. counties (by FIPS code) to their nearest geological mineral or fuel deposit, including the deposit's type, era, state, and distance. Coal dominates deposit_type at roughly 42% of rows, with Copper, Iron, and Oil rounding out the major categories — worth checking whether this reflects true geological prevalence or sampling bias. The distance_to_deposit column is heavily right-skewed (skew ~7.5, max 5652 vs. median 152), so a small number of remote counties pull the mean far above typical values and deserve a closer look. Deposit eras span nine geological periods led by Pennsylvanian (~23%), and deposit_state concentrates in Missouri, Ohio, and Alabama even though counties themselves are spread across all 56 state codes.

citing: row_count · column_count · deposit_type.top_values · deposit_type.top_rate · distance_to_deposit.skew · distance_to_deposit.median · distance_to_deposit.max · distance_to_deposit.mean · deposit_era.top_values · deposit_era.top_rate · deposit_state.top_values · state_name.top_values · state_name.cardinality

Out[4]:

saturn.schema() · 9 columns

column	kind	n	null%	unique	alerts
fips	numeric	3,235	0.0%	3,235
county_name	text	3,235	0.0%	1,973	short_text duplicates
state	categorical	3,235	0.0%	56
state_name	categorical	3,235	0.0%	56
distance_to_deposit	numeric	3,235	0.0%	2,202	high_skew
nearest_deposit	categorical	3,235	0.0%	97
deposit_type	categorical	3,235	0.0%	10
deposit_era	categorical	3,235	0.0%	9
deposit_state	categorical	3,235	0.0%	25

Fig 1.

deposit_type · Shows how heavily Coal dominates the deposit mix relative to metals and hydrocarbons.

Show data table

Top values for deposit_type (10 unique shown, of 10 total).
value	count	share
Coal	1345	41.6%
Copper	485	15.0%
Iron	403	12.5%
Oil	400	12.4%
Natural Gas	235	7.3%
Lead	170	5.3%
Phosphate	81	2.5%
Gold	72	2.2%
Zinc	23	0.7%
Silver	21	0.6%

Fig 2.

distance_to_deposit · Highlights the strong right skew and the long tail of counties far from any deposit.

Show data table

Histogram bins for distance_to_deposit (median: 152.0).
bin	count
1.8 – 143.1	1519
143.1 – 284.3	1181
284.3 – 425.6	343
425.6 – 566.9	64
566.9 – 708.1	4
708.1 – 849.4	2
849.4 – 990.7	4
990.7 – 1132	2
1132 – 1273	0
1273 – 1414	3
1414 – 1556	8
1556 – 1697	82
1697 – 1838	3
1838 – 1980	3
1980 – 2121	1
2121 – 2262	0
2262 – 2403	0
2403 – 2545	5
2545 – 2686	1
2686 – 2827	0
2827 – 2968	0
2968 – 3110	0
3110 – 3251	0
3251 – 3392	0
3392 – 3533	0
3533 – 3675	0
3675 – 3816	0
3816 – 3957	0
3957 – 4098	0
4098 – 4240	0
4240 – 4381	0
4381 – 4522	0
4522 – 4664	0
4664 – 4805	3
4805 – 4946	2
4946 – 5087	0
5087 – 5229	0
5229 – 5370	0
5370 – 5511	2
5511 – 5652	3

Fig 3.

deposit_era · Compares the nine geological eras, with Pennsylvanian leading and Permian trailing.

Show data table

Top values for deposit_era (9 unique shown, of 9 total).
value	count	share
Pennsylvanian	732	22.6%
Devonian	422	13.0%
Paleozoic	419	13.0%
Tertiary	401	12.4%
Mississippian	401	12.4%
Precambrian	327	10.1%
Cretaceous	289	8.9%
Miocene	149	4.6%
Permian	95	2.9%

Fig 4.

deposit_state · Reveals which states host the most deposits feeding nearby counties, led by Missouri and Ohio.

Show data table

Top values for deposit_state (20 unique shown, of 25 total).
value	count	share
Missouri	478	14.8%
Ohio	448	13.8%
Alabama	434	13.4%
Indiana	263	8.1%
Arkansas	257	7.9%
South Dakota	210	6.5%
New Jersey	179	5.5%
Texas	170	5.3%
Colorado	144	4.5%
Louisiana	115	3.6%
New York	99	3.1%
Oregon	71	2.2%
California	68	2.1%
Idaho	54	1.7%
New Mexico	51	1.6%
Washington	47	1.5%
Rhode Island	43	1.3%
Montana	37	1.1%
Utah	30	0.9%
Arizona	16	0.5%

Fig 5.

state_name · Confirms broad geographic coverage across states, with Texas and Georgia contributing the most county rows.

Show data table

Top values for state_name (20 unique shown, of 56 total).
value	count	share
Texas	254	7.9%
Georgia	159	4.9%
Virginia	133	4.1%
Kentucky	120	3.7%
Missouri	115	3.6%
Kansas	105	3.2%
Illinois	102	3.2%
North Carolina	100	3.1%
Iowa	99	3.1%
Tennessee	95	2.9%
Nebraska	93	2.9%
Indiana	92	2.8%
Ohio	88	2.7%
Minnesota	87	2.7%
Michigan	83	2.6%
Mississippi	82	2.5%
Puerto Rico	78	2.4%
Oklahoma	77	2.4%
Arkansas	75	2.3%
Wisconsin	72	2.2%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
fips	numeric	0.0%
county_name	text	0.0%
state	categorical	0.0%
state_name	categorical	0.0%
distance_to_deposit	numeric	0.0%
nearest_deposit	categorical	0.0%
deposit_type	categorical	0.0%
deposit_era	categorical	0.0%
deposit_state	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	fips	distance_to_deposit
fips	+1.00	+0.30
distance_to_deposit	+0.30	+1.00

fips numeric identifier

This is the FIPS code identifying U.S. counties (or equivalent geographies), with all 3235 values unique and no nulls. Values span 1001 to 78030, consistent with state-prefixed county codes, and the distribution is broad (IQR 27090) rather than meaningfully skewed (skew 0.17). Treat the numeric stats as incidental — magnitude has no quantitative meaning here.

Treatment: Cast to string and use as a join key to county-level reference data; do not model as numeric.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["fips"].stats

stat	value
n	3,235
nulls	0 (0.0%)
unique	3,235
min	1,001
max	78,030
mean	3.152e+04
median	30,035
std	1.643e+04
q1	19,036
q3	46,126
iqr	27,090
skew	0.1738
kurtosis	-0.6075
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 8.

Distribution of fips. Vertical dash marks the median.

Show data table

Histogram bins for fips (median: 30035.0).
bin	count
1001 – 2927	97
2927 – 4852	15
4852 – 6778	133
6778 – 8704	64
8704 – 1.063e+04	12
1.063e+04 – 1.256e+04	68
1.256e+04 – 1.448e+04	159
1.448e+04 – 1.641e+04	49
1.641e+04 – 1.833e+04	194
1.833e+04 – 2.026e+04	204
2.026e+04 – 2.218e+04	184
2.218e+04 – 2.411e+04	39
2.411e+04 – 2.604e+04	33
2.604e+04 – 2.796e+04	152
2.796e+04 – 2.989e+04	197
2.989e+04 – 3.181e+04	149
3.181e+04 – 3.374e+04	27
3.374e+04 – 3.566e+04	54
3.566e+04 – 3.759e+04	162
3.759e+04 – 3.952e+04	141
3.952e+04 – 4.144e+04	113
4.144e+04 – 4.337e+04	67
4.337e+04 – 4.529e+04	51
4.529e+04 – 4.722e+04	161
4.722e+04 – 4.914e+04	283
4.914e+04 – 5.107e+04	48
5.107e+04 – 5.3e+04	99
5.3e+04 – 5.492e+04	94
5.492e+04 – 5.685e+04	95
5.685e+04 – 5.877e+04	0
5.877e+04 – 6.07e+04	5
6.07e+04 – 6.262e+04	0
6.262e+04 – 6.455e+04	0
6.455e+04 – 6.648e+04	1
6.648e+04 – 6.84e+04	0
6.84e+04 – 7.033e+04	4
7.033e+04 – 7.225e+04	78
7.225e+04 – 7.418e+04	0
7.418e+04 – 7.61e+04	0
7.61e+04 – 7.803e+04	3

county_name text metadata

This column holds US county-level place names, with 1,973 unique values across 3,235 rows and almost every entry containing the word 'county' (2,999 occurrences) alongside Louisiana 'parish' (64) and Puerto Rico 'municipio' (78) variants. Names repeat heavily — duplicate rate is 39% with classics like 'Washington County' (30), 'Jefferson County' (25), and 'Franklin County' (24) topping the list, which is expected since the same county name recurs across states. Entries are short (mean 14.2 chars, ~2 words) and there are no nulls or empties.

Treatment: Pair with a state column to form a unique geographic key before joining or aggregating.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["county_name"].stats

stat	value
n	3,235
nulls	0 (0.0%)
unique	1,973
len_min	4
len_max	46
len_mean	14.18
len_median	14
len_p95	18
word_mean	2.084
word_median	2
n_empty	0
n_duplicates	1,262
duplicate_rate	0.3901
vocab_size	1,973
readability_flesch_mean	33.65
emoji_rate	0
url_rate	0
one_word_rate	0.0003091
allcaps_rate	0
boilerplate_rate	0
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	39.0% duplicate strings

Fig 9.

Character-length distribution for county_name.

Show data table

Character-length distribution for county_name (mean: 14.179289026275116).
chars	count
4 – 5	1
5 – 6	0
6 – 7	0
7 – 8	0
8 – 9	0
9 – 10	29
10 – 11	256
11 – 12	465
12 – 13	683
13 – 14	588
14 – 16	495
16 – 17	294
17 – 18	221
18 – 19	67
19 – 20	51
20 – 21	23
21 – 22	16
22 – 23	14
23 – 24	8
24 – 25	4
25 – 26	7
26 – 27	1
27 – 28	1
28 – 29	1
29 – 30	0
30 – 31	2
31 – 32	1
32 – 33	1
33 – 34	1
34 – 36	1
36 – 37	0
37 – 38	0
38 – 39	0
39 – 40	0
40 – 41	2
41 – 42	1
42 – 43	0
43 – 44	0
44 – 45	0
45 – 46	1

state categorical feature

This is a US state code column with 56 unique values — more than the 50 states, suggesting territories or codes like DC, PR, or military designations are included. The distribution is fairly even (entropy ratio 0.92), with TX leading at 7.9% (254 of 3235 rows) followed by GA, VA, KY, and MO, consistent with a county- or jurisdiction-level dataset where larger states contribute more rows. No nulls.

Treatment: One-hot or target-encode for modelling; verify the 6 extra codes beyond 50 states.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["state"].stats

stat	value
n	3,235
nulls	0 (0.0%)
unique	56
top_value	TX
top_rate	0.07852
cardinality	56
entropy	5.338
entropy_ratio	0.9192

Fig 10.

Top values for state.

Show data table

Top values for state (20 unique shown, of 56 total).
value	count	share
TX	254	7.9%
GA	159	4.9%
VA	133	4.1%
KY	120	3.7%
MO	115	3.6%
KS	105	3.2%
IL	102	3.2%
NC	100	3.1%
IA	99	3.1%
TN	95	2.9%
NE	93	2.9%
IN	92	2.8%
OH	88	2.7%
MN	87	2.7%
MI	83	2.6%
MS	82	2.5%
PR	78	2.4%
OK	77	2.4%
AR	75	2.3%
WI	72	2.2%

state_name categorical feature

This column holds U.S. state names, almost certainly one row per county or county-equivalent given the 3,235 total rows and 56 distinct values (the 50 states plus territories/DC). Texas dominates at 254 rows (7.85%), followed by Georgia (159) and Virginia (133), which matches the known county-count ranking. Distribution is highly even across categories (entropy ratio 0.92) with no nulls.

Treatment: Use as a categorical grouping key; one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["state_name"].stats

stat	value
n	3,235
nulls	0 (0.0%)
unique	56
top_value	Texas
top_rate	0.07852
cardinality	56
entropy	5.338
entropy_ratio	0.9192

Fig 11.

Top values for state_name.

Show data table

Top values for state_name (20 unique shown, of 56 total).
value	count	share
Texas	254	7.9%
Georgia	159	4.9%
Virginia	133	4.1%
Kentucky	120	3.7%
Missouri	115	3.6%
Kansas	105	3.2%
Illinois	102	3.2%
North Carolina	100	3.1%
Iowa	99	3.1%
Tennessee	95	2.9%
Nebraska	93	2.9%
Indiana	92	2.8%
Ohio	88	2.7%
Minnesota	87	2.7%
Michigan	83	2.6%
Mississippi	82	2.5%
Puerto Rico	78	2.4%
Oklahoma	77	2.4%
Arkansas	75	2.3%
Wisconsin	72	2.2%

distance_to_deposit numeric feature

Numeric feature measuring distance to a deposit, likely in metres, with all 3235 rows populated and 2202 distinct values. The distribution is severely right-skewed (skew 7.51, kurtosis 77.6): the median is 152.0 while the mean is 230.12 and the max stretches to 5652.4, more than 14x the Q3 of 235.75. About 4.9% of rows (159) flag as outliers, and there are no zeros or nulls.

Treatment: log-transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["distance_to_deposit"].stats

stat	value
n	3,235
nulls	0 (0.0%)
unique	2,202
min	1.8
max	5652
mean	230.1
median	152
std	399.9
q1	85.5
q3	235.8
iqr	150.2
skew	7.511
kurtosis	77.6
n_outliers	159
outlier_rate	0.04915
zero_rate	0
alert: high_skew	skew=+7.51

Fig 12.

Distribution of distance_to_deposit. Vertical dash marks the median.

Show data table

Histogram bins for distance_to_deposit (median: 152.0).
bin	count
1.8 – 143.1	1519
143.1 – 284.3	1181
284.3 – 425.6	343
425.6 – 566.9	64
566.9 – 708.1	4
708.1 – 849.4	2
849.4 – 990.7	4
990.7 – 1132	2
1132 – 1273	0
1273 – 1414	3
1414 – 1556	8
1556 – 1697	82
1697 – 1838	3
1838 – 1980	3
1980 – 2121	1
2121 – 2262	0
2262 – 2403	0
2403 – 2545	5
2545 – 2686	1
2686 – 2827	0
2827 – 2968	0
2968 – 3110	0
3110 – 3251	0
3251 – 3392	0
3392 – 3533	0
3533 – 3675	0
3675 – 3816	0
3816 – 3957	0
3957 – 4098	0
4098 – 4240	0
4240 – 4381	0
4381 – 4522	0
4522 – 4664	0
4664 – 4805	3
4805 – 4946	2
4946 – 5087	0
5087 – 5229	0
5229 – 5370	0
5370 – 5511	2
5511 – 5652	3

nearest_deposit categorical feature

This column names the nearest mineral deposit for each record, with 97 distinct sites across 3,235 rows and no nulls. Distribution is moderately concentrated: "Hatchet Creek Copper" alone accounts for 13.4% (434 rows), and the top three deposits cover roughly 30% of the data, yet entropy ratio of 0.76 indicates the long tail still carries meaningful spread. Names mix mine types (copper, clay, sulfur), pits, banks, quads, and districts, suggesting heterogeneous source nomenclature rather than a clean controlled vocabulary.

Treatment: Treat as a high-cardinality categorical: target- or frequency-encode, and consider grouping rare deposits into an 'other' bucket.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["nearest_deposit"].stats

stat	value
n	3,235
nulls	0 (0.0%)
unique	97
top_value	Hatchet Creek Copper
top_rate	0.1342
cardinality	97
entropy	4.999
entropy_ratio	0.7574

Fig 13.

Top values for nearest_deposit.

Show data table

Top values for nearest_deposit (20 unique shown, of 97 total).
value	count	share
Hatchet Creek Copper	434	13.4%
Chaney No 1 Clay Mine	302	9.3%
Cardonia Pit	263	8.1%
Hager Mine	179	5.5%
Lodgepole Quad	171	5.3%
Cooper Mine	164	5.1%
Stewart May	161	5.0%
Main Pass Sulfur Mine	115	3.6%
Dunn Bank	101	3.1%
Batesville District	96	3.0%
Unknown - Coal & Zn	90	2.8%
Tole and Thorp Fireclay Mine	89	2.8%
Ventech Gas Processors Sulfur Plant	84	2.6%
Midland Farms Sulfur Plant	66	2.0%
Belden Pit	65	2.0%
Afc Pit	45	1.4%
Iron Mine Hill Deposit	43	1.3%
Butte Valley, Alamo #1	42	1.3%
Santa Rosa Tar Sands	41	1.3%
Old Leyden Mine	39	1.2%

deposit_type categorical label

Categorical label identifying the type of mineral or fuel deposit, with 10 distinct values across 3235 rows and no nulls. Coal dominates at 41.6% (1345 rows), followed by Copper, Iron, and Oil, while Zinc (23) and Silver (21) are rare. Entropy ratio of 0.76 indicates a moderately concentrated distribution skewed toward fossil/base resources rather than precious metals.

Treatment: One-hot encode; consider grouping rare classes (Zinc, Silver) if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["deposit_type"].stats

stat	value
n	3,235
nulls	0 (0.0%)
unique	10
top_value	Coal
top_rate	0.4158
cardinality	10
entropy	2.536
entropy_ratio	0.7633

Fig 14.

Top values for deposit_type.

Show data table

Top values for deposit_type (10 unique shown, of 10 total).
value	count	share
Coal	1345	41.6%
Copper	485	15.0%
Iron	403	12.5%
Oil	400	12.4%
Natural Gas	235	7.3%
Lead	170	5.3%
Phosphate	81	2.5%
Gold	72	2.2%
Zinc	23	0.7%
Silver	21	0.6%

deposit_era categorical feature

Categorical geological era/period label for deposits, spanning 9 distinct values across 3235 complete rows. Distribution is unusually flat for a categorical (entropy_ratio 0.945) — Pennsylvanian leads at only 22.6% (732 rows) and even the smallest, Permian, holds 95 rows. Note the mixed granularity: broad eras (Paleozoic, Precambrian) sit alongside specific periods (Devonian, Miocene), so categories are not mutually exclusive in geological time.

Treatment: One-hot encode, but consider reconciling overlapping era/period granularity before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["deposit_era"].stats

stat	value
n	3,235
nulls	0 (0.0%)
unique	9
top_value	Pennsylvanian
top_rate	0.2263
cardinality	9
entropy	2.997
entropy_ratio	0.9453

Fig 15.

Top values for deposit_era.

Show data table

Top values for deposit_era (9 unique shown, of 9 total).
value	count	share
Pennsylvanian	732	22.6%
Devonian	422	13.0%
Paleozoic	419	13.0%
Tertiary	401	12.4%
Mississippian	401	12.4%
Precambrian	327	10.1%
Cretaceous	289	8.9%
Miocene	149	4.6%
Permian	95	2.9%

deposit_state categorical feature

`deposit_state` is a categorical US-state field with 25 distinct values across 3,235 rows and no nulls. Distribution is fairly even (entropy ratio 0.83); the top state Missouri accounts for only 14.8%, followed closely by Ohio (448) and Alabama (434). Coverage is partial — only half the US states appear — so this is not a nationwide sample.

Treatment: One-hot or target-encode for modelling; verify whether the 25-state coverage is intentional.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["deposit_state"].stats

stat	value
n	3,235
nulls	0 (0.0%)
unique	25
top_value	Missouri
top_rate	0.1478
cardinality	25
entropy	3.85
entropy_ratio	0.829

Fig 16.

Top values for deposit_state.

Show data table

Top values for deposit_state (20 unique shown, of 25 total).
value	count	share
Missouri	478	14.8%
Ohio	448	13.8%
Alabama	434	13.4%
Indiana	263	8.1%
Arkansas	257	7.9%
South Dakota	210	6.5%
New Jersey	179	5.5%
Texas	170	5.3%
Colorado	144	4.5%
Louisiana	115	3.6%
New York	99	3.1%
Oregon	71	2.2%
California	68	2.1%
Idaho	54	1.7%
New Mexico	51	1.6%
Washington	47	1.5%
Rhode Island	43	1.3%
Montana	37	1.1%
Utah	30	0.9%
Arizona	16	0.5%

fips county geology counties

Overview

Summary confidence: high

fips numeric identifier

county_name text metadata

state categorical feature

state_name categorical feature

distance_to_deposit numeric feature

nearest_deposit categorical feature

deposit_type categorical label

deposit_era categorical feature

deposit_state categorical feature

How to cite