quirky geothermal
Reading
This dataset catalogs 8,776 geothermal features (hot springs and geysers) sourced from OpenStreetMap, with 13 columns covering location, type, and optional metadata like temperature and tourism use. The core signal is in the `type` and `osm_type` fields: roughly 80% are hot springs and 20% geysers, and most entries are point nodes rather than ways. Geographic coverage is global but skewed — latitude leans heavily toward the northern hemisphere with a long southern tail flagged as outliers, while longitude spans the full range. Be aware that nearly all the descriptive fields (`country`, `wikipedia`, `temperature`, `description`, `access`, `tourism`, `intermittent`) have null rates above 97%, so they're only useful for the small annotated subset. Within that subset, `tourism` is dominated by 'attraction' and `intermittent` is overwhelmingly 'no', which limits their analytic value.
citing: row_count · column_count · columns.type.top_values · columns.osm_type.top_values · columns.lat.stats · columns.lon.stats · columns.tourism.top_values · columns.country.null_rate · columns.temperature.top_values · columns.name.stats
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| hot_spring | 7082 | 80.7% |
| geyser | 1694 | 19.3% |
Show data table
| value | count | share |
|---|---|---|
| node | 6705 | 76.4% |
| way | 2071 | 23.6% |
Show data table
| bin | count |
|---|---|
| -54.68 – -51.54 | 1 |
| -51.54 – -48.4 | 0 |
| -48.4 – -45.25 | 0 |
| -45.25 – -42.11 | 23 |
| -42.11 – -38.97 | 30 |
| -38.97 – -35.83 | 132 |
| -35.83 – -32.69 | 24 |
| -32.69 – -29.55 | 9 |
| -29.55 – -26.41 | 14 |
| -26.41 – -23.27 | 10 |
| -23.27 – -20.13 | 55 |
| -20.13 – -16.99 | 104 |
| -16.99 – -13.84 | 20 |
| -13.84 – -10.7 | 13 |
| -10.7 – -7.563 | 29 |
| -7.563 – -4.422 | 14 |
| -4.422 – -1.281 | 25 |
| -1.281 – 1.86 | 35 |
| 1.86 – 5.001 | 32 |
| 5.001 – 8.142 | 35 |
| 8.142 – 11.28 | 63 |
| 11.28 – 14.42 | 88 |
| 14.42 – 17.56 | 276 |
| 17.56 – 20.71 | 82 |
| 20.71 – 23.85 | 134 |
| 23.85 – 26.99 | 100 |
| 26.99 – 30.13 | 298 |
| 30.13 – 33.27 | 997 |
| 33.27 – 36.41 | 1011 |
| 36.41 – 39.55 | 438 |
| 39.55 – 42.69 | 597 |
| 42.69 – 45.83 | 3559 |
| 45.83 – 48.97 | 117 |
| 48.97 – 52.12 | 105 |
| 52.12 – 55.26 | 101 |
| 55.26 – 58.4 | 30 |
| 58.4 – 61.54 | 9 |
| 61.54 – 64.68 | 88 |
| 64.68 – 67.82 | 75 |
| 67.82 – 70.96 | 3 |
Show data table
| bin | count |
|---|---|
| -176.6 – -167.7 | 4 |
| -167.7 – -158.8 | 2 |
| -158.8 – -149.9 | 2 |
| -149.9 – -141 | 3 |
| -141 – -132.1 | 2 |
| -132.1 – -123.2 | 18 |
| -123.2 – -114.3 | 403 |
| -114.3 – -105.4 | 3413 |
| -105.4 – -96.53 | 23 |
| -96.53 – -87.63 | 24 |
| -87.63 – -78.73 | 58 |
| -78.73 – -69.83 | 158 |
| -69.83 – -60.93 | 193 |
| -60.93 – -52.03 | 11 |
| -52.03 – -43.13 | 5 |
| -43.13 – -34.23 | 0 |
| -34.23 – -25.33 | 7 |
| -25.33 – -16.44 | 154 |
| -16.44 – -7.535 | 122 |
| -7.535 – 1.364 | 68 |
| 1.364 – 10.26 | 102 |
| 10.26 – 19.16 | 203 |
| 19.16 – 28.06 | 374 |
| 28.06 – 36.96 | 637 |
| 36.96 – 45.86 | 1145 |
| 45.86 – 54.76 | 325 |
| 54.76 – 63.66 | 48 |
| 63.66 – 72.56 | 43 |
| 72.56 – 81.46 | 114 |
| 81.46 – 90.36 | 26 |
| 90.36 – 99.26 | 99 |
| 99.26 – 108.2 | 181 |
| 108.2 – 117.1 | 45 |
| 117.1 – 126 | 190 |
| 126 – 134.9 | 80 |
| 134.9 – 143.8 | 234 |
| 143.8 – 152.7 | 41 |
| 152.7 – 161.6 | 83 |
| 161.6 – 170.5 | 5 |
| 170.5 – 179.4 | 131 |
Show data table
| value | count | share |
|---|---|---|
| attraction | 194 | 2.2% |
| hotel | 9 | 0.1% |
| yes | 7 | 0.1% |
| camp_site | 3 | 0.0% |
| camp_site;loding | 3 | 0.0% |
| information | 2 | 0.0% |
| viewpoint | 2 | 0.0% |
| picnic_site | 1 | 0.0% |
| caravan_site | 1 | 0.0% |
| guest_house | 1 | 0.0% |
Schema
13 columns| Alerts | ||||
|---|---|---|---|---|
| name | text | 0.0% | 8,316 |
allcaps
|
| lat | numeric | 0.0% | 8,758 |
high_skew
outliers
|
| lon | numeric | 0.0% | 8,747 |
|
| country | categorical | 99.9% | 6 |
long_tail
null_rate
|
| type | categorical | 0.0% | 2 |
|
| temperature | categorical | 98.3% | 56 |
long_tail
null_rate
|
| wikipedia | categorical | 98.5% | 124 |
long_tail
null_rate
|
| description | categorical | 98.1% | 141 |
long_tail
null_rate
|
| intermittent | categorical | 97.9% | 2 |
null_rate
|
| access | categorical | 99.0% | 5 |
null_rate
|
| tourism | categorical | 97.5% | 10 |
null_rate
|
| osm_id | numeric | 0.0% | 8,776 |
|
| osm_type | categorical | 0.0% | 2 |
|
name
text label allcapsShort free-text names for places, almost certainly hot springs and geysers given that 'hot_spring' (3393) and 'geyser' (1083) dominate top_words. The column is highly diverse (8316 unique of 8776) but multilingual — top values mix Arabic, English, Spanish, and Turkish — and 25.3% of entries are all-caps, which would break naive string matching. There are 460 duplicates (5.2%) including 40 repeats of a single Arabic name, suggesting the same feature recorded multiple times. Treatment: Normalize case and unicode, then keep as a descriptive label; do not use as a join key given duplicates and language mix.
- n
- 8,776
- nulls
- 0 (0.0%)
- unique
- 8,316
- len_min
- 1
- len_max
- 60
- len_mean
- 17.1
- len_median
- 19
- len_p95
- 24
- word_mean
- 2.16
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 460
- duplicate_rate
- 0.05242
- vocab_size
- 9,610
- readability_flesch_mean
- 101.6
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.1508
- allcaps_rate
- 0.2525
- boilerplate_rate
- 0
lat
numeric feature high_skew outliersThis is a latitude coordinate column, with values spanning -54.68 to 70.96 and a median of 40.86 placing most records in the northern hemisphere. The distribution is heavily left-skewed (skew -2.36, kurtosis 6.28) with 889 outliers (10.1%), reflecting a long southern-hemisphere tail relative to a northern-clustered core. Near-unique values (8758/8776) indicate point-level geolocations rather than coarse bins. Treatment: Pair with longitude for geospatial features (e.g., bin, cluster, or compute distances) rather than treating as a standalone numeric.
- n
- 8,776
- nulls
- 0 (0.0%)
- unique
- 8,758
- min
- -54.68
- max
- 70.96
- mean
- 34.77
- median
- 40.86
- std
- 18.12
- q1
- 32.31
- q3
- 44.53
- iqr
- 12.22
- skew
- -2.36
- kurtosis
- 6.277
- n_outliers
- 889
- outlier_rate
- 0.1013
- zero_rate
- 0
lon
numeric featureThis column is almost certainly geographic longitude in decimal degrees: values span -176.63 to 179.36, with mean -23.99 and median -21.23 sitting plausibly within the valid [-180, 180] range. The wide IQR of 154.96 and std of 89.39 indicate global coverage rather than a regional dataset, and near-uniqueness (8747 unique of 8776) suggests each row is a distinct location. No nulls, no zeros, and no flagged outliers. Treatment: Pair with latitude for geospatial features; avoid treating as a plain scalar in models due to wraparound at ±180.
- n
- 8,776
- nulls
- 0 (0.0%)
- unique
- 8,747
- min
- -176.6
- max
- 179.4
- mean
- -23.99
- median
- -21.23
- std
- 89.39
- q1
- -110.8
- q3
- 44.15
- iqr
- 155
- skew
- 0.4043
- kurtosis
- -1.179
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
country
categorical metadata long_tail null_rateThis is a country code field (ISO-2 style values like IQ, TW, MX, DE, RU, JP) that is effectively empty: 99.91% of the 8776 rows are null, leaving only 8 observed values across 6 distinct codes. The non-null distribution is too sparse to be meaningful, though IQ appears 3 times and accounts for 37.5% of present values. With this null rate, any apparent signal is noise. Treatment: Drop the column; null rate of 99.91% leaves nothing to model.
- n
- 8,776
- nulls
- 8,768 (99.9%)
- unique
- 6
- top_value
- IQ
- top_rate
- 0.375
- cardinality
- 6
- entropy
- 2.406
- entropy_ratio
- 0.9306
type
categorical labelBinary categorical column distinguishing two hydrothermal feature types: hot_spring (7082 rows, ~80.7%) and geyser (1694 rows). No nulls and only 2 unique values, so the field is clean but imbalanced roughly 4:1 toward hot_spring. Entropy ratio of 0.71 reflects that skew rather than any data quality issue. Treatment: One-hot or boolean-encode; stratify splits to preserve the ~4:1 class balance.
- n
- 8,776
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- hot_spring
- top_rate
- 0.807
- cardinality
- 2
- entropy
- 0.7078
- entropy_ratio
- 0.7078
temperature
categorical feature long_tail null_rateA free-text temperature field that is 98.31% null, with only 148 populated rows out of 8776. The dominant value is the descriptive string 'hot' (76 occurrences, 51.35% of populated rows), while the remaining entries are numeric strings like '90', '100', '21' — indicating a mix of qualitative and quantitative encodings with no consistent unit. Cardinality is 56 with entropy ratio 0.64, so the long tail is sparse but varied. Treatment: Drop or treat as missing-by-default; if retained, normalize units and split numeric vs categorical encodings before use.
- n
- 8,776
- nulls
- 8,628 (98.3%)
- unique
- 56
- top_value
- hot
- top_rate
- 0.5135
- cardinality
- 56
- entropy
- 3.734
- entropy_ratio
- 0.643
wikipedia
categorical metadata long_tail null_rateThis appears to be a Wikipedia article reference column, with values formatted as language-prefixed page titles (e.g., 'en:Olympic Hot Springs', 'ja:鉢形駅') spanning multiple languages including English, Japanese, Russian, and Icelandic. The column is 98.52% null with only 124 unique values across 8776 rows, and entropy ratio of 0.996 indicates the few populated entries are nearly all distinct. The top value appears just 3 times (0.023 rate), confirming no meaningful concentration. Treatment: Drop or retain only as a reference link; too sparse and high-cardinality for modelling.
- n
- 8,776
- nulls
- 8,646 (98.5%)
- unique
- 124
- top_value
- en:Olympic Hot Springs
- top_rate
- 0.02308
- cardinality
- 124
- entropy
- 6.924
- entropy_ratio
- 0.9957
description
categorical free_text long_tail null_rateFree-text descriptions, likely for hot spring or geothermal site entries, present on only ~1.94% of the 8776 rows (null_rate 0.9806). Among the 170 non-null entries there are 141 unique values with entropy ratio 0.966, so almost every description is bespoke; the most common string ('Mud geyser created from recent seismic activity') still only repeats 12 times (top_rate 0.0706). Languages are mixed — English, Japanese (熱海七湯), Russian, French — which will complicate any text processing. Treatment: Treat as multilingual free text: language-detect then tokenize/embed, but expect ~98% missingness to limit usefulness as a feature.
- n
- 8,776
- nulls
- 8,606 (98.1%)
- unique
- 141
- top_value
- Mud geyser created from recent seismic activity
- top_rate
- 0.07059
- cardinality
- 141
- entropy
- 6.894
- entropy_ratio
- 0.9656
intermittent
categorical feature null_rateBinary yes/no flag indicating whether something is intermittent, but it is essentially absent: 97.89% of the 8,776 rows are null, leaving only 185 populated values. Among those few, 'no' dominates at 91.9% (170 vs 15 'yes'), so the column carries almost no usable signal. Treatment: Drop or treat as a sparse indicator; null rate too high for direct modelling.
- n
- 8,776
- nulls
- 8,591 (97.9%)
- unique
- 2
- top_value
- no
- top_rate
- 0.9189
- cardinality
- 2
- entropy
- 0.406
- entropy_ratio
- 0.406
access
categorical feature null_rateThis is a low-cardinality categorical access flag (5 distinct values: yes, customers, private, no, permissive) — likely an OSM-style access tag indicating who may use a feature. It is overwhelmingly null at 98.99%, leaving only 89 observed values across 8,776 rows. The non-null distribution is unusually flat (entropy ratio 0.945), with the modal value 'yes' accounting for just 26.97% of present values. Treatment: Treat missing as its own category ('unspecified') before encoding, given 99% nulls.
- n
- 8,776
- nulls
- 8,687 (99.0%)
- unique
- 5
- top_value
- yes
- top_rate
- 0.2697
- cardinality
- 5
- entropy
- 2.194
- entropy_ratio
- 0.9451
tourism
categorical metadata null_rateThis is an OpenStreetMap-style `tourism` tag classifying points of interest (attraction, hotel, camp_site, viewpoint, etc.). It is almost entirely empty — 97.46% null across 8776 rows — and among the 223 populated entries, `attraction` dominates at 87% (194 records), leaving the other 9 categories as long-tail singletons. A stray `yes` value (7 rows) suggests inconsistent tagging upstream. Treatment: Drop or collapse to a binary is_attraction flag; too sparse and skewed for direct use as a feature.
- n
- 8,776
- nulls
- 8,553 (97.5%)
- unique
- 10
- top_value
- attraction
- top_rate
- 0.87
- cardinality
- 10
- entropy
- 0.9127
- entropy_ratio
- 0.2747
osm_id
numeric identifierThis is almost certainly the OpenStreetMap object identifier: every one of the 8776 rows is unique with no nulls or zeros, and values span 27750092 to 13535658843, the range typical of OSM IDs. The distribution is broad (IQR ~9.94e9) and slightly left-skewed (-0.30) with flat kurtosis (-1.43), consistent with IDs accumulated across OSM history rather than a meaningful numeric feature. No outliers were flagged, which is expected for an identifier. Treatment: Use as a join key to OSM; do not feed into models as a numeric feature.
- n
- 8,776
- nulls
- 0 (0.0%)
- unique
- 8,776
- min
- 2.775e+07
- max
- 1.354e+10
- mean
- 7.186e+09
- median
- 8.263e+09
- std
- 4.374e+09
- q1
- 1.334e+09
- q3
- 1.128e+10
- iqr
- 9.942e+09
- skew
- -0.3011
- kurtosis
- -1.43
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
osm_type
categorical featureThis column records the OpenStreetMap geometry type, taking only two values across 8,776 rows: 'node' (6,705) and 'way' (2,071). 76.4% of records are nodes, giving an entropy ratio of 0.79 — moderately imbalanced but no nulls or rare categories. Treatment: One-hot or binary-encode before modelling.
- n
- 8,776
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- node
- top_rate
- 0.764
- cardinality
- 2
- entropy
- 0.7883
- entropy_ratio
- 0.7883