saturn·

data trove us geological features

saturn notebook · generated 2026-06-22 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/geographic/geology/geological_regions.geojson

Saturn profiled 14 rows across 7 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/geographic/geology/geological_regions.geojson",
    "--findings", "data-trove-us-geological-features.json",
    "--llm", "anthropic:default",
])

Summary confidence: medium

This dataset is a small geospatial catalogue of 14 named geological regions across the United States, each described as a polygon with attributes covering geology type, geological age, primary resources, and a short description. The most notable pattern is the dominance of Sedimentary Basins (4 of 14 regions) as the leading geology type, which aligns with the prevalence of oil and natural gas as primary resources. The geological ages span a wide range from Precambrian to Tertiary, suggesting this catalogue captures regions of very different formation histories — worth examining alongside resource type to spot any age-resource relationships.

citing: geology_type.top_value · geology_type.top_rate · primary_resources.top_value · age.top_value · age.n_unique · row_count · column_count

Out[4]:

saturn.schema() · 7 columns

column kind n null% unique alerts
name categorical 14 0.0% 14 long_tail
geology_type categorical 14 0.0% 11 long_tail
primary_resources categorical 14 0.0% 12 long_tail
age categorical 14 0.0% 12 long_tail
description categorical 14 0.0% 14 long_tail
color categorical 14 0.0% 13 long_tail
geometry_type categorical 14 0.0% 1 imbalance
Fig 1.
geology_type · Look for how strongly Sedimentary Basin dominates compared to other geology types like Shale or Precambrian formations.
Show data table
Top values for geology_type (11 unique shown, of 11 total).
valuecountshare
Sedimentary Basin428.6%
Ancient Marine Basin17.1%
Shale Formation17.1%
Shale/Carbonate17.1%
Precambrian Shield17.1%
Igneous/Metamorphic17.1%
Basin and Range17.1%
Porphyry Copper17.1%
Precambrian Uplift17.1%
Metamorphic17.1%
Sedimentary17.1%
Fig 2.
primary_resources · Notice how often oil and gas appear — either alone or bundled with coal and other resources — versus metals like gold and iron ore.
Show data table
Top values for primary_resources (12 unique shown, of 12 total).
valuecountshare
Oil, Natural Gas214.3%
Gold214.3%
Oil, Natural Gas, Coal, Rich Soils17.1%
Coal, Natural Gas17.1%
Natural Gas, Oil17.1%
Oil, Natural Gas, Sulfur17.1%
Coal, Oil, Natural Gas17.1%
Iron Ore17.1%
Gold, Silver, Copper, Lead, Zinc17.1%
Gold, Silver, Copper17.1%
Copper, Molybdenum17.1%
Phosphate17.1%
Fig 3.
age · Check which geological eras are most represented and whether older eras like Precambrian cluster around different resource types.
Show data table
Top values for age (12 unique shown, of 12 total).
valuecountshare
Precambrian214.3%
Tertiary214.3%
Cretaceous (145-66 million years ago)17.1%
Pennsylvanian-Permian17.1%
Devonian17.1%
Tertiary-Cretaceous17.1%
Permian17.1%
Devonian-Mississippian17.1%
Pennsylvanian17.1%
Cretaceous-Tertiary17.1%
Paleozoic17.1%
Miocene-Pliocene17.1%
Fig 4.
name · Each region name is unique — use this as a reference index to identify which specific basins and formations are included in the dataset.
Show data table
Top values for name (14 unique shown, of 14 total).
valuecountshare
Cretaceous Interior Seaway17.1%
Appalachian Coal Basin17.1%
Marcellus-Utica Shale17.1%
Gulf Coastal Plain17.1%
Permian Basin17.1%
Bakken Formation17.1%
Illinois Basin17.1%
Mesabi Iron Range17.1%
Colorado Mineral Belt17.1%
Nevada Mining District17.1%
Copper Belt - Arizona17.1%
Black Hills17.1%
Southern Appalachian Gold Belt17.1%
Florida Phosphate District17.1%
Fig 5.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
namecategorical0.0%
geology_typecategorical0.0%
primary_resourcescategorical0.0%
agecategorical0.0%
descriptioncategorical0.0%
colorcategorical0.0%
geometry_typecategorical0.0%

name categorical label

This column contains names of geological formations, basins, and resource districts (e.g., 'Permian Basin', 'Marcellus-Utica Shale', 'Bakken Formation'), making it a label or identifier for geological regions in a small reference dataset of 14 rows. Every value is unique (cardinality = 14, n = 14), producing a perfect entropy ratio of 1.0 — the column is essentially a primary key of human-readable names. The 'long_tail' alert is a statistical artefact of all values appearing exactly once (top_rate = 0.071), not a meaningful distribution signal. No nulls are present.

Treatment: Use as a row label or join key; drop from any ML feature set as it contributes no generalizable signal.

anthropic:default · confidence high
Out[11]:

saturn.columns["name"].stats

statvalue
n14
nulls0 (0.0%)
unique14
top_value Cretaceous Interior Seaway
top_rate 0.07143
cardinality 14
entropy 3.807
entropy_ratio 1
alert: long_tail14 singleton categories
Fig 6.
Top values for name.
Show data table
Top values for name (14 unique shown, of 14 total).
valuecountshare
Cretaceous Interior Seaway17.1%
Appalachian Coal Basin17.1%
Marcellus-Utica Shale17.1%
Gulf Coastal Plain17.1%
Permian Basin17.1%
Bakken Formation17.1%
Illinois Basin17.1%
Mesabi Iron Range17.1%
Colorado Mineral Belt17.1%
Nevada Mining District17.1%
Copper Belt - Arizona17.1%
Black Hills17.1%
Southern Appalachian Gold Belt17.1%
Florida Phosphate District17.1%

geology_type categorical feature

This column classifies geological formation types associated with each record, covering 11 distinct categories across only 14 rows. 'Sedimentary Basin' dominates with 4 occurrences (28.6% top_rate), while all other 10 categories appear exactly once — a textbook long-tail distribution flagged in alerts. The near-maximum entropy ratio of 0.935 confirms the distribution is close to uniform outside the top value, meaning the dataset is too small to draw reliable frequency-based conclusions. The mix of broadly defined types ('Sedimentary Basin', 'Metamorphic') alongside highly specific ones ('Porphyry Copper', 'Shale/Carbonate') suggests inconsistent taxonomy.

Treatment: Standardise taxonomy and one-hot encode; note that n=14 is too small for robust categorical modelling.

anthropic:default · confidence medium
Out[14]:

saturn.columns["geology_type"].stats

statvalue
n14
nulls0 (0.0%)
unique11
top_value Sedimentary Basin
top_rate 0.2857
cardinality 11
entropy 3.236
entropy_ratio 0.9354
alert: long_tail10 singleton categories
Fig 7.
Top values for geology_type.
Show data table
Top values for geology_type (11 unique shown, of 11 total).
valuecountshare
Sedimentary Basin428.6%
Ancient Marine Basin17.1%
Shale Formation17.1%
Shale/Carbonate17.1%
Precambrian Shield17.1%
Igneous/Metamorphic17.1%
Basin and Range17.1%
Porphyry Copper17.1%
Precambrian Uplift17.1%
Metamorphic17.1%
Sedimentary17.1%

primary_resources categorical label

This column captures the primary natural resources of geographic entities (likely countries or regions), expressed as free-form comma-separated lists. With only 14 rows and 12 unique values, the dataset is tiny; the top values 'Oil, Natural Gas' and 'Gold' each appear twice (14.3% each), while all other entries are singletons. The near-maximum entropy ratio (0.982) and long-tail alert confirm extreme fragmentation — semantically equivalent entries like 'Oil, Natural Gas' and 'Natural Gas, Oil' are treated as distinct, indicating inconsistent ordering that inflates apparent cardinality.

Treatment: Normalize ordering, split multi-value strings into sets, then one-hot encode individual resources before modelling.

anthropic:default · confidence high
Out[17]:

saturn.columns["primary_resources"].stats

statvalue
n14
nulls0 (0.0%)
unique12
top_value Oil, Natural Gas
top_rate 0.1429
cardinality 12
entropy 3.522
entropy_ratio 0.9823
alert: long_tail10 singleton categories
Fig 8.
Top values for primary_resources.
Show data table
Top values for primary_resources (12 unique shown, of 12 total).
valuecountshare
Oil, Natural Gas214.3%
Gold214.3%
Oil, Natural Gas, Coal, Rich Soils17.1%
Coal, Natural Gas17.1%
Natural Gas, Oil17.1%
Oil, Natural Gas, Sulfur17.1%
Coal, Oil, Natural Gas17.1%
Iron Ore17.1%
Gold, Silver, Copper, Lead, Zinc17.1%
Gold, Silver, Copper17.1%
Copper, Molybdenum17.1%
Phosphate17.1%

age categorical label

This column captures geological time period / stratigraphic age, classifying records by the geologic era or period of their origin (e.g., 'Precambrian', 'Cretaceous', 'Devonian'). With only 14 rows, 12 distinct values, and an entropy ratio of 0.98, the distribution is nearly flat — almost every record has a unique age label, which limits its predictive utility as a categorical feature. The 'long_tail' alert is consistent with this near-uniform spread, and the top value ('Precambrian') appears only twice (14.3% frequency). Label inconsistency is also present: overlapping ranges like 'Cretaceous-Tertiary' and 'Tertiary-Cretaceous' likely refer to the same interval, suggesting unstandardized entry.

Treatment: Standardize free-form period strings into canonical geologic time scale bins before using as a categorical feature.

anthropic:default · confidence high
Out[20]:

saturn.columns["age"].stats

statvalue
n14
nulls0 (0.0%)
unique12
top_value Precambrian
top_rate 0.1429
cardinality 12
entropy 3.522
entropy_ratio 0.9823
alert: long_tail10 singleton categories
Fig 9.
Top values for age.
Show data table
Top values for age (12 unique shown, of 12 total).
valuecountshare
Precambrian214.3%
Tertiary214.3%
Cretaceous (145-66 million years ago)17.1%
Pennsylvanian-Permian17.1%
Devonian17.1%
Tertiary-Cretaceous17.1%
Permian17.1%
Devonian-Mississippian17.1%
Pennsylvanian17.1%
Cretaceous-Tertiary17.1%
Paleozoic17.1%
Miocene-Pliocene17.1%

description categorical free_text

This column contains free-text descriptive annotations for 14 geographic or geological regions, each explaining their natural resource profile and economic significance (oil, gas, coal, mining, agriculture). Every row has a unique description (cardinality 14, entropy_ratio 1.0), meaning it functions purely as a human-readable label with no repeated values. The top_rate of 0.071 confirms perfect uniformity — no single value dominates. The 'long_tail' alert is technically triggered but is trivially explained by all values appearing exactly once.

Treatment: Tokenize and embed for semantic similarity or NLP tasks; drop before any categorical encoding or modelling.

anthropic:default · confidence high
Out[23]:

saturn.columns["description"].stats

statvalue
n14
nulls0 (0.0%)
unique14
top_value Ancient sea divided North America; left rich sediments forming oil/gas deposits and fertile agricultural soils. Shaped settlement, agriculture, and economy across the Great Plains.
top_rate 0.07143
cardinality 14
entropy 3.807
entropy_ratio 1
alert: long_tail14 singleton categories
Fig 10.
Top values for description.
Show data table
Top values for description (14 unique shown, of 14 total).
valuecountshare
Ancient sea divided North America; left rich sediments forming oil/gas deposits and fertile agricultural soils. Shaped settlement, agriculture, and economy across the Great Plains.17.1%
Major coal-producing region, historically drove industrialization17.1%
Major shale gas play, modern fracking boom17.1%
Major oil and gas region, petrochemical industry center17.1%
One of the most productive oil regions in US history17.1%
Major shale oil play, North Dakota boom17.1%
Coal and oil production, agricultural region17.1%
Historic iron mining, built US steel industry17.1%
Rich mining district, gold rush history17.1%
Comstock Lode, major silver and gold production17.1%
Major copper mining, mining towns17.1%
Homestake Mine, gold rush history17.1%
First US gold rush, Dahlonega17.1%
Major phosphate mining for fertilizers17.1%

color categorical label

This column contains CSS hex color codes (e.g., '#1e3a8a', '#4a5568'), likely representing UI theme colors, category badges, or tag styling values. With 13 unique values across only 14 rows and an entropy ratio of 0.99, the distribution is nearly uniform — every color appears exactly once except '#1e3a8a' which appears twice. The long-tail alert is technically triggered but is a minor artefact of the tiny dataset size; the dominant value holds only a 14.3% share.

Treatment: Use as-is for display/join purposes; if feeding into a model, decode to RGB numeric triplets or embed as categorical with one-hot encoding.

anthropic:default · confidence high
Out[26]:

saturn.columns["color"].stats

statvalue
n14
nulls0 (0.0%)
unique13
top_value #1e3a8a
top_rate 0.1429
cardinality 13
entropy 3.664
entropy_ratio 0.9903
alert: long_tail12 singleton categories
Fig 11.
Top values for color.
Show data table
Top values for color (13 unique shown, of 13 total).
valuecountshare
#1e3a8a214.3%
#4a556817.1%
#2d374817.1%
#74421017.1%
#92400e17.1%
#37415117.1%
#7c2d1217.1%
#ca8a0417.1%
#a1620717.1%
#b4530917.1%
#713f1217.1%
#854d0e17.1%
#065f4617.1%

geometry_type categorical metadata

This column records the geometry type of spatial features and contains exactly one value, 'Polygon', across all 14 rows with no nulls. It is a constant column — zero entropy, cardinality of 1, and a top_rate of 1.0 — meaning it carries no discriminative information whatsoever. The imbalance alert is technically correct but understates the situation: this is not imbalanced, it is entirely invariant.

Treatment: Drop before modelling; zero-variance constant adds no signal and will cause issues in some encoders.

anthropic:default · confidence high
Out[29]:

saturn.columns["geometry_type"].stats

statvalue
n14
nulls0 (0.0%)
unique1
top_value Polygon
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 12.
Top values for geometry_type.
Show data table
Top values for geometry_type (1 unique shown, of 1 total).
valuecountshare
Polygon14100.0%

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-us-geological-features-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove us geological features},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-us-geological-features}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove us geological features. Source: /home/coolhand/html/datavis/data_trove/geographic/geology/geological_regions.geojson. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-us-geological-features