waterfalls waterfalls worldwide

source /home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json 80,678 rows 9 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogues 80,678 waterfalls worldwide, sourced entirely from OpenStreetMap with latitude/longitude coordinates and minimal descriptive metadata. The most striking feature is how sparse the descriptive fields are: 'category' and 'source' are constant, 'date' and 'country' are essentially empty (country is blank for 80,650 of 80,678 rows), and 89.9% of 'description' entries are simply 'Waterfall'. The 'name' field is similarly thin — 'Unnamed Waterfall' accounts for 48,168 rows and the duplicate rate is 65.7%. The real analytical signal lives in the geographic coordinates, where latitude skews toward the northern hemisphere (median 40.3) and longitude spans the full globe, making this primarily a spatial dataset rather than an attribute-rich one.

citing: country.top_rate · category.top_value · source.top_value · description.top_rate · description.top_values · name.duplicate_rate · name.top_values · latitude.median · latitude.skew · longitude.median · row_count

Charts the summary said to look at first

latitude · Check the northern-hemisphere skew (median ~40°) and the long left tail toward Antarctica.

Show data table

Histogram bins for latitude (median: 40.311778000000004).
bin	count
-77.72 – -73.81	1
-73.81 – -69.9	0
-69.9 – -65.99	2
-65.99 – -62.08	0
-62.08 – -58.17	0
-58.17 – -54.26	40
-54.26 – -50.35	69
-50.35 – -46.44	231
-46.44 – -42.54	980
-42.54 – -38.63	1226
-38.63 – -34.72	1111
-34.72 – -30.81	984
-30.81 – -26.9	2936
-26.9 – -22.99	1664
-22.99 – -19.08	2204
-19.08 – -15.17	1445
-15.17 – -11.26	552
-11.26 – -7.348	652
-7.348 – -3.439	711
-3.439 – 0.4709	1326
0.4709 – 4.381	1234
4.381 – 8.29	2293
8.29 – 12.2	1440
12.2 – 16.11	2557
16.11 – 20.02	1843
20.02 – 23.93	1639
23.93 – 27.84	3183
27.84 – 31.75	1376
31.75 – 35.66	3246
35.66 – 39.57	4519
39.57 – 43.48	7862
43.48 – 47.39	12910
47.39 – 51.3	7883
51.3 – 55.21	3329
55.21 – 59.12	3266
59.12 – 63.03	2437
63.03 – 66.94	2174
66.94 – 70.84	1260
70.84 – 74.75	64
74.75 – 78.66	29

longitude · See how waterfalls spread across all longitudes, hinting at distinct continental clusters.

Show data table

Histogram bins for longitude (median: 7.8029868).
bin	count
-180 – -171	21
-171 – -162	5
-162 – -153	332
-153 – -144.1	160
-144.1 – -135.1	42
-135.1 – -126.1	1803
-126.1 – -117.1	3492
-117.1 – -108.1	1627
-108.1 – -99.13	566
-99.13 – -90.14	922
-90.14 – -81.15	3019
-81.15 – -72.17	4949
-72.17 – -63.18	2531
-63.18 – -54.2	2294
-54.2 – -45.21	4760
-45.21 – -36.23	1428
-36.23 – -27.24	76
-27.24 – -18.26	959
-18.26 – -9.274	901
-9.274 – -0.2894	4266
-0.2894 – 8.696	7904
8.696 – 17.68	10863
17.68 – 26.67	4161
26.67 – 35.65	2370
35.65 – 44.64	4414
44.64 – 53.62	1659
53.62 – 62.61	555
62.61 – 71.59	512
71.59 – 80.58	785
80.58 – 89.56	865
89.56 – 98.55	833
98.55 – 107.5	2020
107.5 – 116.5	1082
116.5 – 125.5	1555
125.5 – 134.5	842
134.5 – 143.5	1654
143.5 – 152.5	1855
152.5 – 161.4	345
161.4 – 170.4	743
170.4 – 179.4	1508

description · Note that ~90% of descriptions are just 'Waterfall'; only a small tail encodes height info.

Show data table

Top values for description (20 unique shown, of 775 total).
value	count	share
Waterfall	72565	89.9%
Waterfall, 3m	551	0.7%
Waterfall, 2m	520	0.6%
Waterfall, 5m	460	0.6%
Waterfall, 10m	426	0.5%
Waterfall, 4m	423	0.5%
Waterfall, 1m	358	0.4%
Waterfall, 6m	329	0.4%
Waterfall, 20m	298	0.4%
Waterfall, 15m	257	0.3%
Waterfall, 8m	240	0.3%
Waterfall, 7m	214	0.3%
Waterfall, 30m	170	0.2%
Waterfall, 12m	159	0.2%
Waterfall, 25m	125	0.2%
Waterfall, 40m	114	0.1%
Waterfall, 1.5m	103	0.1%
Waterfall, 50m	79	0.1%
Waterfall, 9m	79	0.1%
Waterfall, 60m	74	0.1%

name · Name lengths cluster tightly (mean ~16 chars), reflecting the dominance of 'Unnamed Waterfall' and short local names.

Show data table

Character-length distribution for name (mean: 16.120949949180694).
chars	count
1 – 3	217
3 – 4	1507
4 – 6	685
6 – 8	1501
8 – 9	1940
9 – 11	1737
11 – 13	4325
13 – 14	4350
14 – 16	2013
16 – 18	52248
18 – 19	3568
19 – 21	1444
21 – 22	2068
22 – 24	1215
24 – 26	387
26 – 27	545
27 – 29	308
29 – 31	94
31 – 32	175
32 – 34	61
34 – 36	84
36 – 37	62
37 – 39	18
39 – 41	31
41 – 42	31
42 – 44	11
44 – 46	17
46 – 47	8
47 – 49	2
49 – 50	6
50 – 52	7
52 – 54	1
54 – 55	2
55 – 57	3
57 – 59	1
59 – 60	4
60 – 62	0
62 – 64	0
64 – 65	0
65 – 67	2

country · Almost every row has a blank country — a critical data-quality gap to flag before any geographic rollup.

Show data table

Top values for country (6 unique shown, of 6 total).
value	count	share
	80650	100.0%
VE	24	0.0%
DE	1	0.0%
LB	1	0.0%
HN	1	0.0%
BR	1	0.0%

Schema

9 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
latitude	numeric	0.0%	80,650
longitude	numeric	0.0%	80,650
name	text	0.0%	27,697	duplicates
description	categorical	0.0%	775	long_tail
category	categorical	0.0%	1	imbalance
date	categorical	0.0%	1	imbalance
country	categorical	0.0%	6	long_tail imbalance
height	categorical	0.0%	775	long_tail
source	categorical	0.0%	1	imbalance

latitude

numeric feature

Geographic latitude coordinates spanning -77.72 to 78.66, covering nearly the full habitable range of Earth. The distribution is left-skewed (-0.94) with a median of 40.31 sitting well above the mean of 27.15, suggesting a concentration of records in the Northern Hemisphere with a long tail reaching toward Antarctica. Near-uniqueness (80,650 distinct of 80,678) and zero nulls indicate clean, granular point data. Treatment: Pair with longitude for geospatial features; avoid treating as a standalone scalar in models. high · anthropic:claude-opus-4-7

n: 80,678
nulls: 0 (0.0%)
unique: 80,650
min: -77.72
max: 78.66
mean: 27.15
median: 40.31
std: 30.05
q1: 9.657
q3: 47.48
iqr: 37.82
skew: -0.9359
kurtosis: -0.2827
n_outliers: 298
outlier_rate: 0.003694
zero_rate: 0

longitude

numeric feature

This is a longitude coordinate column spanning the full global range from -179.99 to 179.41, with 80,650 unique values across 80,678 rows and no nulls. The distribution is broad (std 76.86, IQR 100.27) and only mildly skewed (0.29), with the median at 7.80 sitting east of the prime meridian, hinting at a Europe/Africa-leaning sample. No outliers were flagged, consistent with values bounded by valid geographic limits. Treatment: Pair with latitude as a 2D geospatial feature; avoid treating as a standalone scalar. high · anthropic:claude-opus-4-7

n: 80,678
nulls: 0 (0.0%)
unique: 80,650
min: -180
max: 179.4
mean: 0.9626
median: 7.803
std: 76.86
q1: -61.71
q3: 38.56
iqr: 100.3
skew: 0.2865
kurtosis: -0.4119
n_outliers: 0
outlier_rate: 0
zero_rate: 0

name

text label duplicates

This is the human-readable name of a waterfall, averaging 2 words and 16 characters. The column is dominated by the placeholder 'Unnamed Waterfall' (48,168 of 80,678 rows), driving a 65.7% duplicate rate; multiple languages appear in the vocabulary (cachoeira, cascada, cascata, salto, fossen) alongside English 'falls'. Treatment: Treat 'Unnamed Waterfall' as missing and avoid using this field as a unique key. high · anthropic:claude-opus-4-7

n: 80,678
nulls: 0 (0.0%)
unique: 27,697
len_min: 1
len_max: 67
len_mean: 16.12
len_median: 17
len_p95: 21
word_mean: 2.091
word_median: 2
n_empty: 0
n_duplicates: 52,981
duplicate_rate: 0.6567
vocab_size: 8,093
readability_flesch_mean: 17.61
emoji_rate: 1.239e-05
url_rate: 0
one_word_rate: 0.1112
allcaps_rate: 0.03462
boilerplate_rate: 0

description

categorical feature long_tail

This is a categorical descriptor column, overwhelmingly dominated by the value "Waterfall" which accounts for 72,565 of 80,678 rows (top_rate 0.899). The remaining 774 categories appear to be variants annotated with heights (e.g. "Waterfall, 3m", "Waterfall, 5m"), suggesting a structured suffix pattern rather than free text. Entropy is very low (1.14, ratio 0.119) and the long_tail alert fires, so most signal collapses into one label. Treatment: Parse the height suffix into a numeric feature and collapse the rest to a binary is_waterfall flag. high · anthropic:claude-opus-4-7

n: 80,678
nulls: 0 (0.0%)
unique: 775
top_value: Waterfall
top_rate: 0.8994
cardinality: 775
entropy: 1.14
entropy_ratio: 0.1188

date

categorical metadata imbalance

This column is named 'date' but contains a single value—an empty string—across all 80678 rows. Cardinality is 1, top_rate is 1.0, and entropy is 0.0, so the field carries no information. It looks like a date field that was never populated. Treatment: Drop; the column is constant (empty string) and has zero signal. high · anthropic:claude-opus-4-7

n: 80,678
nulls: 0 (0.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

country

categorical metadata long_tail imbalance

This is an ISO country code field that is effectively empty: 80650 of 80678 rows (top_rate 0.9996) hold the blank string, leaving only 28 actual codes spread across VE (24), and one each for DE, LB, HN, and BR. Entropy is 0.0048 (entropy_ratio 0.0019), so the column carries almost no information despite having no nulls. The non-blank values look plausible but are far too sparse to support segmentation or modelling. Treatment: Drop or collapse to a binary has_country flag; too sparse to use as a feature. high · anthropic:claude-opus-4-7

n: 80,678
nulls: 0 (0.0%)
unique: 6
top_value
top_rate: 0.9997
cardinality: 6
entropy: 0.004794
entropy_ratio: 0.001854

height

categorical feature long_tail

A nominally numeric height field stored as strings, but 89.9% of the 80,678 rows are empty and the remaining values spread across 775 distinct tokens with very low entropy ratio (0.119). The populated values look like small integers (3, 2, 5, 10...) with no obvious unit, suggesting inconsistent or truncated entry rather than a clean measurement. Treatment: Treat empty string as missing, cast to numeric, and expect ~90% nulls before use; likely drop unless imputation is justified. high · anthropic:claude-opus-4-7

n: 80,678
nulls: 0 (0.0%)
unique: 775
top_value
top_rate: 0.8994
cardinality: 775
entropy: 1.14
entropy_ratio: 0.1188

source

categorical metadata imbalance

This column records the data provenance, with every one of the 80678 rows tagged as "OpenStreetMap". Cardinality is 1 and entropy is 0, so the field carries no information for modelling or filtering. Treatment: drop, constant column with a single value. high · anthropic:claude-opus-4-7

n: 80,678
nulls: 0 (0.0%)
unique: 1
top_value: OpenStreetMap
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0