data trove us geological features
Reading
This dataset is a small geospatial catalogue of 14 named geological regions across the United States, each described as a polygon with attributes covering geology type, geological age, primary resources, and a short description. The most notable pattern is the dominance of Sedimentary Basins (4 of 14 regions) as the leading geology type, which aligns with the prevalence of oil and natural gas as primary resources. The geological ages span a wide range from Precambrian to Tertiary, suggesting this catalogue captures regions of very different formation histories — worth examining alongside resource type to spot any age-resource relationships.
citing: geology_type.top_value · geology_type.top_rate · primary_resources.top_value · age.top_value · age.n_unique · row_count · column_count
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Sedimentary Basin | 4 | 28.6% |
| Ancient Marine Basin | 1 | 7.1% |
| Shale Formation | 1 | 7.1% |
| Shale/Carbonate | 1 | 7.1% |
| Precambrian Shield | 1 | 7.1% |
| Igneous/Metamorphic | 1 | 7.1% |
| Basin and Range | 1 | 7.1% |
| Porphyry Copper | 1 | 7.1% |
| Precambrian Uplift | 1 | 7.1% |
| Metamorphic | 1 | 7.1% |
| Sedimentary | 1 | 7.1% |
Show data table
| value | count | share |
|---|---|---|
| Oil, Natural Gas | 2 | 14.3% |
| Gold | 2 | 14.3% |
| Oil, Natural Gas, Coal, Rich Soils | 1 | 7.1% |
| Coal, Natural Gas | 1 | 7.1% |
| Natural Gas, Oil | 1 | 7.1% |
| Oil, Natural Gas, Sulfur | 1 | 7.1% |
| Coal, Oil, Natural Gas | 1 | 7.1% |
| Iron Ore | 1 | 7.1% |
| Gold, Silver, Copper, Lead, Zinc | 1 | 7.1% |
| Gold, Silver, Copper | 1 | 7.1% |
| Copper, Molybdenum | 1 | 7.1% |
| Phosphate | 1 | 7.1% |
Show data table
| value | count | share |
|---|---|---|
| Precambrian | 2 | 14.3% |
| Tertiary | 2 | 14.3% |
| Cretaceous (145-66 million years ago) | 1 | 7.1% |
| Pennsylvanian-Permian | 1 | 7.1% |
| Devonian | 1 | 7.1% |
| Tertiary-Cretaceous | 1 | 7.1% |
| Permian | 1 | 7.1% |
| Devonian-Mississippian | 1 | 7.1% |
| Pennsylvanian | 1 | 7.1% |
| Cretaceous-Tertiary | 1 | 7.1% |
| Paleozoic | 1 | 7.1% |
| Miocene-Pliocene | 1 | 7.1% |
Show data table
| value | count | share |
|---|---|---|
| Cretaceous Interior Seaway | 1 | 7.1% |
| Appalachian Coal Basin | 1 | 7.1% |
| Marcellus-Utica Shale | 1 | 7.1% |
| Gulf Coastal Plain | 1 | 7.1% |
| Permian Basin | 1 | 7.1% |
| Bakken Formation | 1 | 7.1% |
| Illinois Basin | 1 | 7.1% |
| Mesabi Iron Range | 1 | 7.1% |
| Colorado Mineral Belt | 1 | 7.1% |
| Nevada Mining District | 1 | 7.1% |
| Copper Belt - Arizona | 1 | 7.1% |
| Black Hills | 1 | 7.1% |
| Southern Appalachian Gold Belt | 1 | 7.1% |
| Florida Phosphate District | 1 | 7.1% |
Schema
7 columns| Alerts | ||||
|---|---|---|---|---|
| name | categorical | 0.0% | 14 |
long_tail
|
| geology_type | categorical | 0.0% | 11 |
long_tail
|
| primary_resources | categorical | 0.0% | 12 |
long_tail
|
| age | categorical | 0.0% | 12 |
long_tail
|
| description | categorical | 0.0% | 14 |
long_tail
|
| color | categorical | 0.0% | 13 |
long_tail
|
| geometry_type | categorical | 0.0% | 1 |
imbalance
|
name
categorical label long_tailThis column contains names of geological formations, basins, and resource districts (e.g., 'Permian Basin', 'Marcellus-Utica Shale', 'Bakken Formation'), making it a label or identifier for geological regions in a small reference dataset of 14 rows. Every value is unique (cardinality = 14, n = 14), producing a perfect entropy ratio of 1.0 — the column is essentially a primary key of human-readable names. The 'long_tail' alert is a statistical artefact of all values appearing exactly once (top_rate = 0.071), not a meaningful distribution signal. No nulls are present. Treatment: Use as a row label or join key; drop from any ML feature set as it contributes no generalizable signal.
- n
- 14
- nulls
- 0 (0.0%)
- unique
- 14
- top_value
- Cretaceous Interior Seaway
- top_rate
- 0.07143
- cardinality
- 14
- entropy
- 3.807
- entropy_ratio
- 1
geology_type
categorical feature long_tailThis column classifies geological formation types associated with each record, covering 11 distinct categories across only 14 rows. 'Sedimentary Basin' dominates with 4 occurrences (28.6% top_rate), while all other 10 categories appear exactly once — a textbook long-tail distribution flagged in alerts. The near-maximum entropy ratio of 0.935 confirms the distribution is close to uniform outside the top value, meaning the dataset is too small to draw reliable frequency-based conclusions. The mix of broadly defined types ('Sedimentary Basin', 'Metamorphic') alongside highly specific ones ('Porphyry Copper', 'Shale/Carbonate') suggests inconsistent taxonomy. Treatment: Standardise taxonomy and one-hot encode; note that n=14 is too small for robust categorical modelling.
- n
- 14
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- Sedimentary Basin
- top_rate
- 0.2857
- cardinality
- 11
- entropy
- 3.236
- entropy_ratio
- 0.9354
primary_resources
categorical label long_tailThis column captures the primary natural resources of geographic entities (likely countries or regions), expressed as free-form comma-separated lists. With only 14 rows and 12 unique values, the dataset is tiny; the top values 'Oil, Natural Gas' and 'Gold' each appear twice (14.3% each), while all other entries are singletons. The near-maximum entropy ratio (0.982) and long-tail alert confirm extreme fragmentation — semantically equivalent entries like 'Oil, Natural Gas' and 'Natural Gas, Oil' are treated as distinct, indicating inconsistent ordering that inflates apparent cardinality. Treatment: Normalize ordering, split multi-value strings into sets, then one-hot encode individual resources before modelling.
- n
- 14
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- Oil, Natural Gas
- top_rate
- 0.1429
- cardinality
- 12
- entropy
- 3.522
- entropy_ratio
- 0.9823
age
categorical label long_tailThis column captures geological time period / stratigraphic age, classifying records by the geologic era or period of their origin (e.g., 'Precambrian', 'Cretaceous', 'Devonian'). With only 14 rows, 12 distinct values, and an entropy ratio of 0.98, the distribution is nearly flat — almost every record has a unique age label, which limits its predictive utility as a categorical feature. The 'long_tail' alert is consistent with this near-uniform spread, and the top value ('Precambrian') appears only twice (14.3% frequency). Label inconsistency is also present: overlapping ranges like 'Cretaceous-Tertiary' and 'Tertiary-Cretaceous' likely refer to the same interval, suggesting unstandardized entry. Treatment: Standardize free-form period strings into canonical geologic time scale bins before using as a categorical feature.
- n
- 14
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- Precambrian
- top_rate
- 0.1429
- cardinality
- 12
- entropy
- 3.522
- entropy_ratio
- 0.9823
description
categorical free_text long_tailThis column contains free-text descriptive annotations for 14 geographic or geological regions, each explaining their natural resource profile and economic significance (oil, gas, coal, mining, agriculture). Every row has a unique description (cardinality 14, entropy_ratio 1.0), meaning it functions purely as a human-readable label with no repeated values. The top_rate of 0.071 confirms perfect uniformity — no single value dominates. The 'long_tail' alert is technically triggered but is trivially explained by all values appearing exactly once. Treatment: Tokenize and embed for semantic similarity or NLP tasks; drop before any categorical encoding or modelling.
- n
- 14
- nulls
- 0 (0.0%)
- unique
- 14
- top_value
- Ancient sea divided North America; left rich sediments forming oil/gas deposits and fertile agricultural soils. Shaped settlement, agriculture, and economy across the Great Plains.
- top_rate
- 0.07143
- cardinality
- 14
- entropy
- 3.807
- entropy_ratio
- 1
color
categorical label long_tailThis column contains CSS hex color codes (e.g., '#1e3a8a', '#4a5568'), likely representing UI theme colors, category badges, or tag styling values. With 13 unique values across only 14 rows and an entropy ratio of 0.99, the distribution is nearly uniform — every color appears exactly once except '#1e3a8a' which appears twice. The long-tail alert is technically triggered but is a minor artefact of the tiny dataset size; the dominant value holds only a 14.3% share. Treatment: Use as-is for display/join purposes; if feeding into a model, decode to RGB numeric triplets or embed as categorical with one-hot encoding.
- n
- 14
- nulls
- 0 (0.0%)
- unique
- 13
- top_value
- #1e3a8a
- top_rate
- 0.1429
- cardinality
- 13
- entropy
- 3.664
- entropy_ratio
- 0.9903
geometry_type
categorical metadata imbalanceThis column records the geometry type of spatial features and contains exactly one value, 'Polygon', across all 14 rows with no nulls. It is a constant column — zero entropy, cardinality of 1, and a top_rate of 1.0 — meaning it carries no discriminative information whatsoever. The imbalance alert is technically correct but understates the situation: this is not imbalanced, it is entirely invariant. Treatment: Drop before modelling; zero-variance constant adds no signal and will cause issues in some encoders.
- n
- 14
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Polygon
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0