saturn·

waterfalls waterfalls worldwide

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json

Saturn profiled 80,678 rows across 9 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json",
    "--findings", "waterfalls-waterfalls_worldwide.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogues 80,678 waterfalls worldwide, sourced entirely from OpenStreetMap with latitude/longitude coordinates and minimal descriptive metadata. The most striking feature is how sparse the descriptive fields are: 'category' and 'source' are constant, 'date' and 'country' are essentially empty (country is blank for 80,650 of 80,678 rows), and 89.9% of 'description' entries are simply 'Waterfall'. The 'name' field is similarly thin — 'Unnamed Waterfall' accounts for 48,168 rows and the duplicate rate is 65.7%. The real analytical signal lives in the geographic coordinates, where latitude skews toward the northern hemisphere (median 40.3) and longitude spans the full globe, making this primarily a spatial dataset rather than an attribute-rich one.

citing: country.top_rate · category.top_value · source.top_value · description.top_rate · description.top_values · name.duplicate_rate · name.top_values · latitude.median · latitude.skew · longitude.median · row_count

Out[4]:

saturn.schema() · 9 columns

column kind n null% unique alerts
latitude numeric 80,678 0.0% 80,650
longitude numeric 80,678 0.0% 80,650
name text 80,678 0.0% 27,697 duplicates
description categorical 80,678 0.0% 775 long_tail
category categorical 80,678 0.0% 1 imbalance
date categorical 80,678 0.0% 1 imbalance
country categorical 80,678 0.0% 6 long_tail imbalance
height categorical 80,678 0.0% 775 long_tail
source categorical 80,678 0.0% 1 imbalance
Fig 1.
latitude · Check the northern-hemisphere skew (median ~40°) and the long left tail toward Antarctica.
Show data table
Histogram bins for latitude (median: 40.311778000000004).
bincount
-77.72 – -73.811
-73.81 – -69.90
-69.9 – -65.992
-65.99 – -62.080
-62.08 – -58.170
-58.17 – -54.2640
-54.26 – -50.3569
-50.35 – -46.44231
-46.44 – -42.54980
-42.54 – -38.631226
-38.63 – -34.721111
-34.72 – -30.81984
-30.81 – -26.92936
-26.9 – -22.991664
-22.99 – -19.082204
-19.08 – -15.171445
-15.17 – -11.26552
-11.26 – -7.348652
-7.348 – -3.439711
-3.439 – 0.47091326
0.4709 – 4.3811234
4.381 – 8.292293
8.29 – 12.21440
12.2 – 16.112557
16.11 – 20.021843
20.02 – 23.931639
23.93 – 27.843183
27.84 – 31.751376
31.75 – 35.663246
35.66 – 39.574519
39.57 – 43.487862
43.48 – 47.3912910
47.39 – 51.37883
51.3 – 55.213329
55.21 – 59.123266
59.12 – 63.032437
63.03 – 66.942174
66.94 – 70.841260
70.84 – 74.7564
74.75 – 78.6629
Fig 2.
longitude · See how waterfalls spread across all longitudes, hinting at distinct continental clusters.
Show data table
Histogram bins for longitude (median: 7.8029868).
bincount
-180 – -17121
-171 – -1625
-162 – -153332
-153 – -144.1160
-144.1 – -135.142
-135.1 – -126.11803
-126.1 – -117.13492
-117.1 – -108.11627
-108.1 – -99.13566
-99.13 – -90.14922
-90.14 – -81.153019
-81.15 – -72.174949
-72.17 – -63.182531
-63.18 – -54.22294
-54.2 – -45.214760
-45.21 – -36.231428
-36.23 – -27.2476
-27.24 – -18.26959
-18.26 – -9.274901
-9.274 – -0.28944266
-0.2894 – 8.6967904
8.696 – 17.6810863
17.68 – 26.674161
26.67 – 35.652370
35.65 – 44.644414
44.64 – 53.621659
53.62 – 62.61555
62.61 – 71.59512
71.59 – 80.58785
80.58 – 89.56865
89.56 – 98.55833
98.55 – 107.52020
107.5 – 116.51082
116.5 – 125.51555
125.5 – 134.5842
134.5 – 143.51654
143.5 – 152.51855
152.5 – 161.4345
161.4 – 170.4743
170.4 – 179.41508
Fig 3.
description · Note that ~90% of descriptions are just 'Waterfall'; only a small tail encodes height info.
Show data table
Top values for description (20 unique shown, of 775 total).
valuecountshare
Waterfall7256589.9%
Waterfall, 3m5510.7%
Waterfall, 2m5200.6%
Waterfall, 5m4600.6%
Waterfall, 10m4260.5%
Waterfall, 4m4230.5%
Waterfall, 1m3580.4%
Waterfall, 6m3290.4%
Waterfall, 20m2980.4%
Waterfall, 15m2570.3%
Waterfall, 8m2400.3%
Waterfall, 7m2140.3%
Waterfall, 30m1700.2%
Waterfall, 12m1590.2%
Waterfall, 25m1250.2%
Waterfall, 40m1140.1%
Waterfall, 1.5m1030.1%
Waterfall, 50m790.1%
Waterfall, 9m790.1%
Waterfall, 60m740.1%
Fig 4.
name · Name lengths cluster tightly (mean ~16 chars), reflecting the dominance of 'Unnamed Waterfall' and short local names.
Show data table
Character-length distribution for name (mean: 16.120949949180694).
charscount
1 – 3217
3 – 41507
4 – 6685
6 – 81501
8 – 91940
9 – 111737
11 – 134325
13 – 144350
14 – 162013
16 – 1852248
18 – 193568
19 – 211444
21 – 222068
22 – 241215
24 – 26387
26 – 27545
27 – 29308
29 – 3194
31 – 32175
32 – 3461
34 – 3684
36 – 3762
37 – 3918
39 – 4131
41 – 4231
42 – 4411
44 – 4617
46 – 478
47 – 492
49 – 506
50 – 527
52 – 541
54 – 552
55 – 573
57 – 591
59 – 604
60 – 620
62 – 640
64 – 650
65 – 672
Fig 5.
country · Almost every row has a blank country — a critical data-quality gap to flag before any geographic rollup.
Show data table
Top values for country (6 unique shown, of 6 total).
valuecountshare
80650100.0%
VE240.0%
DE10.0%
LB10.0%
HN10.0%
BR10.0%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
latitudenumeric0.0%
longitudenumeric0.0%
nametext0.0%
descriptioncategorical0.0%
categorycategorical0.0%
datecategorical0.0%
countrycategorical0.0%
heightcategorical0.0%
sourcecategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
latitudelongitude
latitude+1.00-0.18
longitude-0.18+1.00

latitude numeric feature

Geographic latitude coordinates spanning -77.72 to 78.66, covering nearly the full habitable range of Earth. The distribution is left-skewed (-0.94) with a median of 40.31 sitting well above the mean of 27.15, suggesting a concentration of records in the Northern Hemisphere with a long tail reaching toward Antarctica. Near-uniqueness (80,650 distinct of 80,678) and zero nulls indicate clean, granular point data.

Treatment: Pair with longitude for geospatial features; avoid treating as a standalone scalar in models.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["latitude"].stats

statvalue
n80,678
nulls0 (0.0%)
unique80,650
min -77.72
max 78.66
mean 27.15
median 40.31
std 30.05
q1 9.657
q3 47.48
iqr 37.82
skew -0.9359
kurtosis -0.2827
n_outliers 298
outlier_rate 0.003694
zero_rate 0
Fig 8.
Distribution of latitude. Vertical dash marks the median.
Show data table
Histogram bins for latitude (median: 40.311778000000004).
bincount
-77.72 – -73.811
-73.81 – -69.90
-69.9 – -65.992
-65.99 – -62.080
-62.08 – -58.170
-58.17 – -54.2640
-54.26 – -50.3569
-50.35 – -46.44231
-46.44 – -42.54980
-42.54 – -38.631226
-38.63 – -34.721111
-34.72 – -30.81984
-30.81 – -26.92936
-26.9 – -22.991664
-22.99 – -19.082204
-19.08 – -15.171445
-15.17 – -11.26552
-11.26 – -7.348652
-7.348 – -3.439711
-3.439 – 0.47091326
0.4709 – 4.3811234
4.381 – 8.292293
8.29 – 12.21440
12.2 – 16.112557
16.11 – 20.021843
20.02 – 23.931639
23.93 – 27.843183
27.84 – 31.751376
31.75 – 35.663246
35.66 – 39.574519
39.57 – 43.487862
43.48 – 47.3912910
47.39 – 51.37883
51.3 – 55.213329
55.21 – 59.123266
59.12 – 63.032437
63.03 – 66.942174
66.94 – 70.841260
70.84 – 74.7564
74.75 – 78.6629

longitude numeric feature

This is a longitude coordinate column spanning the full global range from -179.99 to 179.41, with 80,650 unique values across 80,678 rows and no nulls. The distribution is broad (std 76.86, IQR 100.27) and only mildly skewed (0.29), with the median at 7.80 sitting east of the prime meridian, hinting at a Europe/Africa-leaning sample. No outliers were flagged, consistent with values bounded by valid geographic limits.

Treatment: Pair with latitude as a 2D geospatial feature; avoid treating as a standalone scalar.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["longitude"].stats

statvalue
n80,678
nulls0 (0.0%)
unique80,650
min -180
max 179.4
mean 0.9626
median 7.803
std 76.86
q1 -61.71
q3 38.56
iqr 100.3
skew 0.2865
kurtosis -0.4119
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 9.
Distribution of longitude. Vertical dash marks the median.
Show data table
Histogram bins for longitude (median: 7.8029868).
bincount
-180 – -17121
-171 – -1625
-162 – -153332
-153 – -144.1160
-144.1 – -135.142
-135.1 – -126.11803
-126.1 – -117.13492
-117.1 – -108.11627
-108.1 – -99.13566
-99.13 – -90.14922
-90.14 – -81.153019
-81.15 – -72.174949
-72.17 – -63.182531
-63.18 – -54.22294
-54.2 – -45.214760
-45.21 – -36.231428
-36.23 – -27.2476
-27.24 – -18.26959
-18.26 – -9.274901
-9.274 – -0.28944266
-0.2894 – 8.6967904
8.696 – 17.6810863
17.68 – 26.674161
26.67 – 35.652370
35.65 – 44.644414
44.64 – 53.621659
53.62 – 62.61555
62.61 – 71.59512
71.59 – 80.58785
80.58 – 89.56865
89.56 – 98.55833
98.55 – 107.52020
107.5 – 116.51082
116.5 – 125.51555
125.5 – 134.5842
134.5 – 143.51654
143.5 – 152.51855
152.5 – 161.4345
161.4 – 170.4743
170.4 – 179.41508

name text label

This is the human-readable name of a waterfall, averaging 2 words and 16 characters. The column is dominated by the placeholder 'Unnamed Waterfall' (48,168 of 80,678 rows), driving a 65.7% duplicate rate; multiple languages appear in the vocabulary (cachoeira, cascada, cascata, salto, fossen) alongside English 'falls'.

Treatment: Treat 'Unnamed Waterfall' as missing and avoid using this field as a unique key.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["name"].stats

statvalue
n80,678
nulls0 (0.0%)
unique27,697
len_min 1
len_max 67
len_mean 16.12
len_median 17
len_p95 21
word_mean 2.091
word_median 2
n_empty 0
n_duplicates 52,981
duplicate_rate 0.6567
vocab_size 8,093
readability_flesch_mean 17.61
emoji_rate 1.239e-05
url_rate 0
one_word_rate 0.1112
allcaps_rate 0.03462
boilerplate_rate 0
alert: duplicates65.7% duplicate strings
Fig 10.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 16.120949949180694).
charscount
1 – 3217
3 – 41507
4 – 6685
6 – 81501
8 – 91940
9 – 111737
11 – 134325
13 – 144350
14 – 162013
16 – 1852248
18 – 193568
19 – 211444
21 – 222068
22 – 241215
24 – 26387
26 – 27545
27 – 29308
29 – 3194
31 – 32175
32 – 3461
34 – 3684
36 – 3762
37 – 3918
39 – 4131
41 – 4231
42 – 4411
44 – 4617
46 – 478
47 – 492
49 – 506
50 – 527
52 – 541
54 – 552
55 – 573
57 – 591
59 – 604
60 – 620
62 – 640
64 – 650
65 – 672

description categorical feature

This is a categorical descriptor column, overwhelmingly dominated by the value "Waterfall" which accounts for 72,565 of 80,678 rows (top_rate 0.899). The remaining 774 categories appear to be variants annotated with heights (e.g. "Waterfall, 3m", "Waterfall, 5m"), suggesting a structured suffix pattern rather than free text. Entropy is very low (1.14, ratio 0.119) and the long_tail alert fires, so most signal collapses into one label.

Treatment: Parse the height suffix into a numeric feature and collapse the rest to a binary is_waterfall flag.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["description"].stats

statvalue
n80,678
nulls0 (0.0%)
unique775
top_value Waterfall
top_rate 0.8994
cardinality 775
entropy 1.14
entropy_ratio 0.1188
alert: long_tail403 singleton categories
Fig 11.
Top values for description.
Show data table
Top values for description (20 unique shown, of 775 total).
valuecountshare
Waterfall7256589.9%
Waterfall, 3m5510.7%
Waterfall, 2m5200.6%
Waterfall, 5m4600.6%
Waterfall, 10m4260.5%
Waterfall, 4m4230.5%
Waterfall, 1m3580.4%
Waterfall, 6m3290.4%
Waterfall, 20m2980.4%
Waterfall, 15m2570.3%
Waterfall, 8m2400.3%
Waterfall, 7m2140.3%
Waterfall, 30m1700.2%
Waterfall, 12m1590.2%
Waterfall, 25m1250.2%
Waterfall, 40m1140.1%
Waterfall, 1.5m1030.1%
Waterfall, 50m790.1%
Waterfall, 9m790.1%
Waterfall, 60m740.1%

category categorical metadata

This column is a constant categorical tag, holding the literal value "usgs_waterfalls" for all 80678 rows. With cardinality 1, entropy 0, and a top_rate of 1.0, it carries no information and likely just records the source dataset or ingestion batch.

Treatment: Drop before modelling; retain only as a provenance label if combining with other sources.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["category"].stats

statvalue
n80,678
nulls0 (0.0%)
unique1
top_value usgs_waterfalls
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 12.
Top values for category.
Show data table
Top values for category (1 unique shown, of 1 total).
valuecountshare
usgs_waterfalls80678100.0%

date categorical metadata

This column is named 'date' but contains a single value—an empty string—across all 80678 rows. Cardinality is 1, top_rate is 1.0, and entropy is 0.0, so the field carries no information. It looks like a date field that was never populated.

Treatment: Drop; the column is constant (empty string) and has zero signal.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["date"].stats

statvalue
n80,678
nulls0 (0.0%)
unique1
top_value
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 13.
Top values for date.
Show data table
Top values for date (1 unique shown, of 1 total).
valuecountshare
80678100.0%

country categorical metadata

This is an ISO country code field that is effectively empty: 80650 of 80678 rows (top_rate 0.9996) hold the blank string, leaving only 28 actual codes spread across VE (24), and one each for DE, LB, HN, and BR. Entropy is 0.0048 (entropy_ratio 0.0019), so the column carries almost no information despite having no nulls. The non-blank values look plausible but are far too sparse to support segmentation or modelling.

Treatment: Drop or collapse to a binary has_country flag; too sparse to use as a feature.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["country"].stats

statvalue
n80,678
nulls0 (0.0%)
unique6
top_value
top_rate 0.9997
cardinality 6
entropy 0.004794
entropy_ratio 0.001854
alert: long_tail4 singleton categories
alert: imbalancetop value is 100.0% of rows
Fig 14.
Top values for country.
Show data table
Top values for country (6 unique shown, of 6 total).
valuecountshare
80650100.0%
VE240.0%
DE10.0%
LB10.0%
HN10.0%
BR10.0%

height categorical feature

A nominally numeric height field stored as strings, but 89.9% of the 80,678 rows are empty and the remaining values spread across 775 distinct tokens with very low entropy ratio (0.119). The populated values look like small integers (3, 2, 5, 10...) with no obvious unit, suggesting inconsistent or truncated entry rather than a clean measurement.

Treatment: Treat empty string as missing, cast to numeric, and expect ~90% nulls before use; likely drop unless imputation is justified.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["height"].stats

statvalue
n80,678
nulls0 (0.0%)
unique775
top_value
top_rate 0.8994
cardinality 775
entropy 1.14
entropy_ratio 0.1188
alert: long_tail403 singleton categories
Fig 15.
Top values for height.
Show data table
Top values for height (20 unique shown, of 775 total).
valuecountshare
7256589.9%
35510.7%
25200.6%
54600.6%
104260.5%
44230.5%
13580.4%
63290.4%
202980.4%
152570.3%
82400.3%
72140.3%
301700.2%
121590.2%
251250.2%
401140.1%
1.51030.1%
50790.1%
9790.1%
60740.1%

source categorical metadata

This column records the data provenance, with every one of the 80678 rows tagged as "OpenStreetMap". Cardinality is 1 and entropy is 0, so the field carries no information for modelling or filtering.

Treatment: drop, constant column with a single value.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["source"].stats

statvalue
n80,678
nulls0 (0.0%)
unique1
top_value OpenStreetMap
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 16.
Top values for source.
Show data table
Top values for source (1 unique shown, of 1 total).
valuecountshare
OpenStreetMap80678100.0%

How to cite

click to copy

BibTeX
@misc{saturn-waterfalls-waterfalls-worldwide-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: waterfalls waterfalls worldwide},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/waterfalls-waterfalls_worldwide}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: waterfalls waterfalls worldwide. Source: /home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/waterfalls-waterfalls_worldwide