saturn·

data trove us geological features

source /home/coolhand/html/datavis/data_trove/geographic/geology/geological_regions.geojson 14 rows 7 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · medium confidence anthropic:default

This dataset is a small geospatial catalogue of 14 named geological regions across the United States, each described as a polygon with attributes covering geology type, geological age, primary resources, and a short description. The most notable pattern is the dominance of Sedimentary Basins (4 of 14 regions) as the leading geology type, which aligns with the prevalence of oil and natural gas as primary resources. The geological ages span a wide range from Precambrian to Tertiary, suggesting this catalogue captures regions of very different formation histories — worth examining alongside resource type to spot any age-resource relationships.

citing: geology_type.top_value · geology_type.top_rate · primary_resources.top_value · age.top_value · age.n_unique · row_count · column_count

Schema

7 columns
Per-column summary. Click column name to jump to its detail.
Alerts
name categorical 0.0% 14
long_tail
geology_type categorical 0.0% 11
long_tail
primary_resources categorical 0.0% 12
long_tail
age categorical 0.0% 12
long_tail
description categorical 0.0% 14
long_tail
color categorical 0.0% 13
long_tail
geometry_type categorical 0.0% 1
imbalance

name

categorical label long_tail
This column contains names of geological formations, basins, and resource districts (e.g., 'Permian Basin', 'Marcellus-Utica Shale', 'Bakken Formation'), making it a label or identifier for geological regions in a small reference dataset of 14 rows. Every value is unique (cardinality = 14, n = 14), producing a perfect entropy ratio of 1.0 — the column is essentially a primary key of human-readable names. The 'long_tail' alert is a statistical artefact of all values appearing exactly once (top_rate = 0.071), not a meaningful distribution signal. No nulls are present. Treatment: Use as a row label or join key; drop from any ML feature set as it contributes no generalizable signal. high · anthropic:default
n
14
nulls
0 (0.0%)
unique
14
top_value
Cretaceous Interior Seaway
top_rate
0.07143
cardinality
14
entropy
3.807
entropy_ratio
1

geology_type

categorical feature long_tail
This column classifies geological formation types associated with each record, covering 11 distinct categories across only 14 rows. 'Sedimentary Basin' dominates with 4 occurrences (28.6% top_rate), while all other 10 categories appear exactly once — a textbook long-tail distribution flagged in alerts. The near-maximum entropy ratio of 0.935 confirms the distribution is close to uniform outside the top value, meaning the dataset is too small to draw reliable frequency-based conclusions. The mix of broadly defined types ('Sedimentary Basin', 'Metamorphic') alongside highly specific ones ('Porphyry Copper', 'Shale/Carbonate') suggests inconsistent taxonomy. Treatment: Standardise taxonomy and one-hot encode; note that n=14 is too small for robust categorical modelling. medium · anthropic:default
n
14
nulls
0 (0.0%)
unique
11
top_value
Sedimentary Basin
top_rate
0.2857
cardinality
11
entropy
3.236
entropy_ratio
0.9354

primary_resources

categorical label long_tail
This column captures the primary natural resources of geographic entities (likely countries or regions), expressed as free-form comma-separated lists. With only 14 rows and 12 unique values, the dataset is tiny; the top values 'Oil, Natural Gas' and 'Gold' each appear twice (14.3% each), while all other entries are singletons. The near-maximum entropy ratio (0.982) and long-tail alert confirm extreme fragmentation — semantically equivalent entries like 'Oil, Natural Gas' and 'Natural Gas, Oil' are treated as distinct, indicating inconsistent ordering that inflates apparent cardinality. Treatment: Normalize ordering, split multi-value strings into sets, then one-hot encode individual resources before modelling. high · anthropic:default
n
14
nulls
0 (0.0%)
unique
12
top_value
Oil, Natural Gas
top_rate
0.1429
cardinality
12
entropy
3.522
entropy_ratio
0.9823

age

categorical label long_tail
This column captures geological time period / stratigraphic age, classifying records by the geologic era or period of their origin (e.g., 'Precambrian', 'Cretaceous', 'Devonian'). With only 14 rows, 12 distinct values, and an entropy ratio of 0.98, the distribution is nearly flat — almost every record has a unique age label, which limits its predictive utility as a categorical feature. The 'long_tail' alert is consistent with this near-uniform spread, and the top value ('Precambrian') appears only twice (14.3% frequency). Label inconsistency is also present: overlapping ranges like 'Cretaceous-Tertiary' and 'Tertiary-Cretaceous' likely refer to the same interval, suggesting unstandardized entry. Treatment: Standardize free-form period strings into canonical geologic time scale bins before using as a categorical feature. high · anthropic:default
n
14
nulls
0 (0.0%)
unique
12
top_value
Precambrian
top_rate
0.1429
cardinality
12
entropy
3.522
entropy_ratio
0.9823

description

categorical free_text long_tail
This column contains free-text descriptive annotations for 14 geographic or geological regions, each explaining their natural resource profile and economic significance (oil, gas, coal, mining, agriculture). Every row has a unique description (cardinality 14, entropy_ratio 1.0), meaning it functions purely as a human-readable label with no repeated values. The top_rate of 0.071 confirms perfect uniformity — no single value dominates. The 'long_tail' alert is technically triggered but is trivially explained by all values appearing exactly once. Treatment: Tokenize and embed for semantic similarity or NLP tasks; drop before any categorical encoding or modelling. high · anthropic:default
n
14
nulls
0 (0.0%)
unique
14
top_value
Ancient sea divided North America; left rich sediments forming oil/gas deposits and fertile agricultural soils. Shaped settlement, agriculture, and economy across the Great Plains.
top_rate
0.07143
cardinality
14
entropy
3.807
entropy_ratio
1

color

categorical label long_tail
This column contains CSS hex color codes (e.g., '#1e3a8a', '#4a5568'), likely representing UI theme colors, category badges, or tag styling values. With 13 unique values across only 14 rows and an entropy ratio of 0.99, the distribution is nearly uniform — every color appears exactly once except '#1e3a8a' which appears twice. The long-tail alert is technically triggered but is a minor artefact of the tiny dataset size; the dominant value holds only a 14.3% share. Treatment: Use as-is for display/join purposes; if feeding into a model, decode to RGB numeric triplets or embed as categorical with one-hot encoding. high · anthropic:default
n
14
nulls
0 (0.0%)
unique
13
top_value
#1e3a8a
top_rate
0.1429
cardinality
13
entropy
3.664
entropy_ratio
0.9903

geometry_type

categorical metadata imbalance
This column records the geometry type of spatial features and contains exactly one value, 'Polygon', across all 14 rows with no nulls. It is a constant column — zero entropy, cardinality of 1, and a top_rate of 1.0 — meaning it carries no discriminative information whatsoever. The imbalance alert is technically correct but understates the situation: this is not imbalanced, it is entirely invariant. Treatment: Drop before modelling; zero-variance constant adds no signal and will cause issues in some encoders. high · anthropic:default
n
14
nulls
0 (0.0%)
unique
1
top_value
Polygon
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0