saturn·

quirky geothermal

source /home/coolhand/html/datavis/data_trove/data/quirky/geothermal.json 8,776 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 8,776 geothermal features (hot springs and geysers) sourced from OpenStreetMap, with 13 columns covering location, type, and optional metadata like temperature and tourism use. The core signal is in the `type` and `osm_type` fields: roughly 80% are hot springs and 20% geysers, and most entries are point nodes rather than ways. Geographic coverage is global but skewed — latitude leans heavily toward the northern hemisphere with a long southern tail flagged as outliers, while longitude spans the full range. Be aware that nearly all the descriptive fields (`country`, `wikipedia`, `temperature`, `description`, `access`, `tourism`, `intermittent`) have null rates above 97%, so they're only useful for the small annotated subset. Within that subset, `tourism` is dominated by 'attraction' and `intermittent` is overwhelmingly 'no', which limits their analytic value.

citing: row_count · column_count · columns.type.top_values · columns.osm_type.top_values · columns.lat.stats · columns.lon.stats · columns.tourism.top_values · columns.country.null_rate · columns.temperature.top_values · columns.name.stats

Schema

13 columns
Per-column summary. Click column name to jump to its detail.
Alerts
name text 0.0% 8,316
allcaps
lat numeric 0.0% 8,758
high_skew outliers
lon numeric 0.0% 8,747
country categorical 99.9% 6
long_tail null_rate
type categorical 0.0% 2
temperature categorical 98.3% 56
long_tail null_rate
wikipedia categorical 98.5% 124
long_tail null_rate
description categorical 98.1% 141
long_tail null_rate
intermittent categorical 97.9% 2
null_rate
access categorical 99.0% 5
null_rate
tourism categorical 97.5% 10
null_rate
osm_id numeric 0.0% 8,776
osm_type categorical 0.0% 2

name

text label allcaps
Short free-text names for places, almost certainly hot springs and geysers given that 'hot_spring' (3393) and 'geyser' (1083) dominate top_words. The column is highly diverse (8316 unique of 8776) but multilingual — top values mix Arabic, English, Spanish, and Turkish — and 25.3% of entries are all-caps, which would break naive string matching. There are 460 duplicates (5.2%) including 40 repeats of a single Arabic name, suggesting the same feature recorded multiple times. Treatment: Normalize case and unicode, then keep as a descriptive label; do not use as a join key given duplicates and language mix. high · anthropic:claude-opus-4-7
n
8,776
nulls
0 (0.0%)
unique
8,316
len_min
1
len_max
60
len_mean
17.1
len_median
19
len_p95
24
word_mean
2.16
word_median
2
n_empty
0
n_duplicates
460
duplicate_rate
0.05242
vocab_size
9,610
readability_flesch_mean
101.6
emoji_rate
0
url_rate
0
one_word_rate
0.1508
allcaps_rate
0.2525
boilerplate_rate
0

lat

numeric feature high_skew outliers
This is a latitude coordinate column, with values spanning -54.68 to 70.96 and a median of 40.86 placing most records in the northern hemisphere. The distribution is heavily left-skewed (skew -2.36, kurtosis 6.28) with 889 outliers (10.1%), reflecting a long southern-hemisphere tail relative to a northern-clustered core. Near-unique values (8758/8776) indicate point-level geolocations rather than coarse bins. Treatment: Pair with longitude for geospatial features (e.g., bin, cluster, or compute distances) rather than treating as a standalone numeric. high · anthropic:claude-opus-4-7
n
8,776
nulls
0 (0.0%)
unique
8,758
min
-54.68
max
70.96
mean
34.77
median
40.86
std
18.12
q1
32.31
q3
44.53
iqr
12.22
skew
-2.36
kurtosis
6.277
n_outliers
889
outlier_rate
0.1013
zero_rate
0

lon

numeric feature
This column is almost certainly geographic longitude in decimal degrees: values span -176.63 to 179.36, with mean -23.99 and median -21.23 sitting plausibly within the valid [-180, 180] range. The wide IQR of 154.96 and std of 89.39 indicate global coverage rather than a regional dataset, and near-uniqueness (8747 unique of 8776) suggests each row is a distinct location. No nulls, no zeros, and no flagged outliers. Treatment: Pair with latitude for geospatial features; avoid treating as a plain scalar in models due to wraparound at ±180. high · anthropic:claude-opus-4-7
n
8,776
nulls
0 (0.0%)
unique
8,747
min
-176.6
max
179.4
mean
-23.99
median
-21.23
std
89.39
q1
-110.8
q3
44.15
iqr
155
skew
0.4043
kurtosis
-1.179
n_outliers
0
outlier_rate
0
zero_rate
0

country

categorical metadata long_tail null_rate
This is a country code field (ISO-2 style values like IQ, TW, MX, DE, RU, JP) that is effectively empty: 99.91% of the 8776 rows are null, leaving only 8 observed values across 6 distinct codes. The non-null distribution is too sparse to be meaningful, though IQ appears 3 times and accounts for 37.5% of present values. With this null rate, any apparent signal is noise. Treatment: Drop the column; null rate of 99.91% leaves nothing to model. high · anthropic:claude-opus-4-7
n
8,776
nulls
8,768 (99.9%)
unique
6
top_value
IQ
top_rate
0.375
cardinality
6
entropy
2.406
entropy_ratio
0.9306

type

categorical label
Binary categorical column distinguishing two hydrothermal feature types: hot_spring (7082 rows, ~80.7%) and geyser (1694 rows). No nulls and only 2 unique values, so the field is clean but imbalanced roughly 4:1 toward hot_spring. Entropy ratio of 0.71 reflects that skew rather than any data quality issue. Treatment: One-hot or boolean-encode; stratify splits to preserve the ~4:1 class balance. high · anthropic:claude-opus-4-7
n
8,776
nulls
0 (0.0%)
unique
2
top_value
hot_spring
top_rate
0.807
cardinality
2
entropy
0.7078
entropy_ratio
0.7078

temperature

categorical feature long_tail null_rate
A free-text temperature field that is 98.31% null, with only 148 populated rows out of 8776. The dominant value is the descriptive string 'hot' (76 occurrences, 51.35% of populated rows), while the remaining entries are numeric strings like '90', '100', '21' — indicating a mix of qualitative and quantitative encodings with no consistent unit. Cardinality is 56 with entropy ratio 0.64, so the long tail is sparse but varied. Treatment: Drop or treat as missing-by-default; if retained, normalize units and split numeric vs categorical encodings before use. high · anthropic:claude-opus-4-7
n
8,776
nulls
8,628 (98.3%)
unique
56
top_value
hot
top_rate
0.5135
cardinality
56
entropy
3.734
entropy_ratio
0.643

wikipedia

categorical metadata long_tail null_rate
This appears to be a Wikipedia article reference column, with values formatted as language-prefixed page titles (e.g., 'en:Olympic Hot Springs', 'ja:鉢形駅') spanning multiple languages including English, Japanese, Russian, and Icelandic. The column is 98.52% null with only 124 unique values across 8776 rows, and entropy ratio of 0.996 indicates the few populated entries are nearly all distinct. The top value appears just 3 times (0.023 rate), confirming no meaningful concentration. Treatment: Drop or retain only as a reference link; too sparse and high-cardinality for modelling. high · anthropic:claude-opus-4-7
n
8,776
nulls
8,646 (98.5%)
unique
124
top_value
en:Olympic Hot Springs
top_rate
0.02308
cardinality
124
entropy
6.924
entropy_ratio
0.9957

description

categorical free_text long_tail null_rate
Free-text descriptions, likely for hot spring or geothermal site entries, present on only ~1.94% of the 8776 rows (null_rate 0.9806). Among the 170 non-null entries there are 141 unique values with entropy ratio 0.966, so almost every description is bespoke; the most common string ('Mud geyser created from recent seismic activity') still only repeats 12 times (top_rate 0.0706). Languages are mixed — English, Japanese (熱海七湯), Russian, French — which will complicate any text processing. Treatment: Treat as multilingual free text: language-detect then tokenize/embed, but expect ~98% missingness to limit usefulness as a feature. high · anthropic:claude-opus-4-7
n
8,776
nulls
8,606 (98.1%)
unique
141
top_value
Mud geyser created from recent seismic activity
top_rate
0.07059
cardinality
141
entropy
6.894
entropy_ratio
0.9656

intermittent

categorical feature null_rate
Binary yes/no flag indicating whether something is intermittent, but it is essentially absent: 97.89% of the 8,776 rows are null, leaving only 185 populated values. Among those few, 'no' dominates at 91.9% (170 vs 15 'yes'), so the column carries almost no usable signal. Treatment: Drop or treat as a sparse indicator; null rate too high for direct modelling. high · anthropic:claude-opus-4-7
n
8,776
nulls
8,591 (97.9%)
unique
2
top_value
no
top_rate
0.9189
cardinality
2
entropy
0.406
entropy_ratio
0.406

access

categorical feature null_rate
This is a low-cardinality categorical access flag (5 distinct values: yes, customers, private, no, permissive) — likely an OSM-style access tag indicating who may use a feature. It is overwhelmingly null at 98.99%, leaving only 89 observed values across 8,776 rows. The non-null distribution is unusually flat (entropy ratio 0.945), with the modal value 'yes' accounting for just 26.97% of present values. Treatment: Treat missing as its own category ('unspecified') before encoding, given 99% nulls. high · anthropic:claude-opus-4-7
n
8,776
nulls
8,687 (99.0%)
unique
5
top_value
yes
top_rate
0.2697
cardinality
5
entropy
2.194
entropy_ratio
0.9451

tourism

categorical metadata null_rate
This is an OpenStreetMap-style `tourism` tag classifying points of interest (attraction, hotel, camp_site, viewpoint, etc.). It is almost entirely empty — 97.46% null across 8776 rows — and among the 223 populated entries, `attraction` dominates at 87% (194 records), leaving the other 9 categories as long-tail singletons. A stray `yes` value (7 rows) suggests inconsistent tagging upstream. Treatment: Drop or collapse to a binary is_attraction flag; too sparse and skewed for direct use as a feature. high · anthropic:claude-opus-4-7
n
8,776
nulls
8,553 (97.5%)
unique
10
top_value
attraction
top_rate
0.87
cardinality
10
entropy
0.9127
entropy_ratio
0.2747

osm_id

numeric identifier
This is almost certainly the OpenStreetMap object identifier: every one of the 8776 rows is unique with no nulls or zeros, and values span 27750092 to 13535658843, the range typical of OSM IDs. The distribution is broad (IQR ~9.94e9) and slightly left-skewed (-0.30) with flat kurtosis (-1.43), consistent with IDs accumulated across OSM history rather than a meaningful numeric feature. No outliers were flagged, which is expected for an identifier. Treatment: Use as a join key to OSM; do not feed into models as a numeric feature. high · anthropic:claude-opus-4-7
n
8,776
nulls
0 (0.0%)
unique
8,776
min
2.775e+07
max
1.354e+10
mean
7.186e+09
median
8.263e+09
std
4.374e+09
q1
1.334e+09
q3
1.128e+10
iqr
9.942e+09
skew
-0.3011
kurtosis
-1.43
n_outliers
0
outlier_rate
0
zero_rate
0

osm_type

categorical feature
This column records the OpenStreetMap geometry type, taking only two values across 8,776 rows: 'node' (6,705) and 'way' (2,071). 76.4% of records are nodes, giving an entropy ratio of 0.79 — moderately imbalanced but no nulls or rare categories. Treatment: One-hot or binary-encode before modelling. high · anthropic:claude-opus-4-7
n
8,776
nulls
0 (0.0%)
unique
2
top_value
node
top_rate
0.764
cardinality
2
entropy
0.7883
entropy_ratio
0.7883