saturn

/home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json 80,678 rows sample n=80,678 seed 42 2026-06-22T00:25:28+00:00

Overview

Source	/home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json
Total rows	80,678
Profiled sample	80,678
Columns	9
Generated	2026-06-22T00:25:28+00:00

Show data table

Per-column null rate across the corpus.
column	kind	null %
latitude	numeric	0.0%
longitude	numeric	0.0%
name	text	0.0%
description	categorical	0.0%
category	categorical	0.0%
date	categorical	0.0%
country	categorical	0.0%
height	categorical	0.0%
source	categorical	0.0%

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:default.

Dataset high anthropic:default

This dataset is a global catalogue of 80,678 waterfalls sourced entirely from OpenStreetMap, covering geographic coordinates and basic descriptive attributes. The most striking finding is how sparse the data quality is: 89.9% of records carry only the generic description 'Waterfall' with no height recorded, and 59.7% of entries are named 'Unnamed Waterfall', suggesting the dataset is geographically broad but informationally thin. Height data is worth a closer look — where it does exist, values cluster at small measurements (2–10 metres), hinting at a possible recording bias toward easily measured falls. The geographic spread is genuinely global (latitude ranges from -77.7 to 78.7), but the country field is nearly empty for 99.97% of records, so spatial analysis should rely on the raw coordinates rather than the country column.

country high anthropic:default

This column is intended to capture country of origin or residence, with only 6 distinct values across 80,678 rows. The overwhelming surprise is that 99.97% of records (80,650 out of 80,678) contain an empty string rather than a valid country code, making the field effectively unpopulated. The remaining 28 records split across five ISO country codes (VE with 24 occurrences, and DE, LB, HN, BR each with 1), suggesting the field was rarely filled in rather than being systematically captured.

name high anthropic:default

This column contains the names of waterfalls or water features, drawn from what appears to be a global geographic dataset (evidenced by multilingual terms: 'Cachoeira'/'Cascada'/'Cascata'/'Fossen'/'Salto'). The dominant signal is that 48,168 of 80,678 rows — nearly 60% — carry the value 'Unnamed Waterfall', driving a duplicate rate of 65.7% and collapsing effective cardinality to just 27,697 unique values out of 80,678 total. The vocab includes Portuguese, Spanish, Norwegian, and English terms, confirming a multilingual mix that an analyst should be aware of when grouping or filtering by name.

category high anthropic:default

This column is a dataset category tag, representing the data source or classification for every record — here uniformly 'usgs_waterfalls'. With cardinality of 1, top_rate of 1.0, and zero nulls across all 80,678 rows, it carries no discriminative information whatsoever. This is a constant column, almost certainly a provenance/partition label added when merging multiple source datasets.

date high anthropic:default

This column is labeled 'date' but contains no actual date values — every single one of its 80,678 rows holds an empty string, giving it a cardinality of 1 and a top_rate of 1.0. The column is entirely blank with zero nulls, meaning missing values were stored as empty strings rather than proper nulls. It carries zero information and will contribute nothing to any analysis or model.

source high anthropic:default

This column records the data source attribution for all 80,678 rows, and every single record carries the value 'OpenStreetMap' — making it a constant with cardinality of 1, entropy of 0, and a top_rate of 1.0. It provides zero discriminative information and will contribute nothing to any model or analysis. The imbalance alert is technically correct but understates the situation: this is a fully degenerate column, not merely skewed.

description high anthropic:default

This column appears to describe a financial or project methodology type, overwhelmingly dominated by 'Waterfall' (72,565 of 80,678 rows, ~89.9%), with the remaining values being 'Waterfall' variants qualified by a time suffix (e.g., '3m', '2m', '5m'). The extreme concentration in a single value — an entropy ratio of only 0.119 — and the long-tail alert indicate that despite 775 unique values, almost all signal is captured by one category. Surprising: with 775 distinct values but ~90% mass in one label, the tail likely contains hundreds of rare or inconsistently formatted variants that may need normalisation.

height high anthropic:default

This column purports to store height values but is classified as categorical, with 775 unique string values across 80,678 rows. The dominant signal is alarming: 72,565 rows (89.9%) contain an empty string, meaning the field is effectively missing for nearly 9 in 10 records despite a reported null_rate of 0.0. The non-empty values appear to be small integers (e.g., '1', '2', '3', '5', '10', '20'), suggesting height in some discrete unit, but the extreme sparsity and long-tail alert make this column unreliable as a feature without significant imputation or domain clarification.

latitude high anthropic:default

This column contains geographic latitude coordinates, spanning from -77.72° (Antarctic region) to 78.66° (Arctic region), covering nearly the full terrestrial range. With 80,650 unique values out of 80,678 rows and zero nulls, it is essentially a high-cardinality continuous measurement. The distribution is notably left-skewed (skew = -0.94) with a mean of 27.1° and median of 40.3°, indicating a concentration of records in mid-to-high Northern Hemisphere latitudes but with a meaningful tail toward the Southern Hemisphere. The IQR of 37.8° and near-flat kurtosis (-0.28) suggest a broadly spread, roughly uniform distribution rather than a tight cluster.

longitude high anthropic:default

This column is geographic longitude, with values spanning nearly the full valid range of −179.99 to 179.41 degrees, indicating globally distributed records. The distribution is notably flat (kurtosis −0.41, IQR of 100.27°) and only mildly right-skewed (skew 0.29), suggesting broad geographic spread rather than concentration in any single region. The median of 7.80° (near Western Europe/West Africa) sits well below the mean of 0.96°, hinting at a slight pull toward Eastern longitudes. Near-perfect uniqueness (80,650 unique values out of 80,678 rows) confirms these are precise coordinate readings, not bucketed regions.

Numeric correlation

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	latitude	longitude
latitude	+1.00	-0.18
longitude	-0.18	+1.00

latitude numeric

rows80,678

null0 (0.0%)

unique80,650

min-77.722

max78.664

mean27.148

median40.312

std30.045

q19.657

q347.477

iqr37.820

skew-0.936

kurtosis-0.283

n_outliers298

outlier_rate3.69e-03

zero_rate0.000

Show data table

Histogram bins for latitude (median: 40.311778000000004).
bin	count
-77.72 – -73.81	1
-73.81 – -69.9	0
-69.9 – -65.99	2
-65.99 – -62.08	0
-62.08 – -58.17	0
-58.17 – -54.26	40
-54.26 – -50.35	69
-50.35 – -46.44	231
-46.44 – -42.54	980
-42.54 – -38.63	1226
-38.63 – -34.72	1111
-34.72 – -30.81	984
-30.81 – -26.9	2936
-26.9 – -22.99	1664
-22.99 – -19.08	2204
-19.08 – -15.17	1445
-15.17 – -11.26	552
-11.26 – -7.348	652
-7.348 – -3.439	711
-3.439 – 0.4709	1326
0.4709 – 4.381	1234
4.381 – 8.29	2293
8.29 – 12.2	1440
12.2 – 16.11	2557
16.11 – 20.02	1843
20.02 – 23.93	1639
23.93 – 27.84	3183
27.84 – 31.75	1376
31.75 – 35.66	3246
35.66 – 39.57	4519
39.57 – 43.48	7862
43.48 – 47.39	12910
47.39 – 51.3	7883
51.3 – 55.21	3329
55.21 – 59.12	3266
59.12 – 63.03	2437
63.03 – 66.94	2174
66.94 – 70.84	1260
70.84 – 74.75	64
74.75 – 78.66	29

longitude numeric

rows80,678

null0 (0.0%)

unique80,650

min-179.991

max179.412

mean0.963

median7.803

std76.859

q1-61.708

q338.561

iqr100.269

skew0.287

kurtosis-0.412

n_outliers0

outlier_rate0.000

zero_rate0.000

Show data table

Histogram bins for longitude (median: 7.8029868).
bin	count
-180 – -171	21
-171 – -162	5
-162 – -153	332
-153 – -144.1	160
-144.1 – -135.1	42
-135.1 – -126.1	1803
-126.1 – -117.1	3492
-117.1 – -108.1	1627
-108.1 – -99.13	566
-99.13 – -90.14	922
-90.14 – -81.15	3019
-81.15 – -72.17	4949
-72.17 – -63.18	2531
-63.18 – -54.2	2294
-54.2 – -45.21	4760
-45.21 – -36.23	1428
-36.23 – -27.24	76
-27.24 – -18.26	959
-18.26 – -9.274	901
-9.274 – -0.2894	4266
-0.2894 – 8.696	7904
8.696 – 17.68	10863
17.68 – 26.67	4161
26.67 – 35.65	2370
35.65 – 44.64	4414
44.64 – 53.62	1659
53.62 – 62.61	555
62.61 – 71.59	512
71.59 – 80.58	785
80.58 – 89.56	865
89.56 – 98.55	833
98.55 – 107.5	2020
107.5 – 116.5	1082
116.5 – 125.5	1555
125.5 – 134.5	842
134.5 – 143.5	1654
143.5 – 152.5	1855
152.5 – 161.4	345
161.4 – 170.4	743
170.4 – 179.4	1508

name text

65.7% duplicate strings

rows80,678

null0 (0.0%)

unique27,697

len_min1

len_max67

len_mean16.121

len_median17.000

len_p9521.000

word_mean2.091

word_median2.000

n_empty0

n_duplicates52,981

duplicate_rate0.657

vocab_size8,093

readability_flesch_mean17.609

emoji_rate1.24e-05

url_rate0.000

one_word_rate0.111

allcaps_rate0.035

boilerplate_rate0.000

Show data table

Character-length distribution for name (mean: 16.120949949180694).
chars	count
1 – 3	217
3 – 4	1507
4 – 6	685
6 – 8	1501
8 – 9	1940
9 – 11	1737
11 – 13	4325
13 – 14	4350
14 – 16	2013
16 – 18	52248
18 – 19	3568
19 – 21	1444
21 – 22	2068
22 – 24	1215
24 – 26	387
26 – 27	545
27 – 29	308
29 – 31	94
31 – 32	175
32 – 34	61
34 – 36	84
36 – 37	62
37 – 39	18
39 – 41	31
41 – 42	31
42 – 44	11
44 – 46	17
46 – 47	8
47 – 49	2
49 – 50	6
50 – 52	7
52 – 54	1
54 – 55	2
55 – 57	3
57 – 59	1
59 – 60	4
60 – 62	0
62 – 64	0
64 – 65	0
65 – 67	2

Sample values (first 10)

Strømslifossen
Unnamed Waterfall
Rauðfossar
Unnamed Waterfall
Little Niagara Falls
Unnamed Waterfall
Price’s Falls
Unnamed Waterfall
彌東飛瀑
Cascada de Arriba

description categorical

403 singleton categories

rows80,678

null0 (0.0%)

unique775

top_valueWaterfall

top_rate0.899

cardinality775

entropy1.140

entropy_ratio0.119

Show data table

Top values for description (20 unique shown, of 775 total).
value	count	share
Waterfall	72565	89.9%
Waterfall, 3m	551	0.7%
Waterfall, 2m	520	0.6%
Waterfall, 5m	460	0.6%
Waterfall, 10m	426	0.5%
Waterfall, 4m	423	0.5%
Waterfall, 1m	358	0.4%
Waterfall, 6m	329	0.4%
Waterfall, 20m	298	0.4%
Waterfall, 15m	257	0.3%
Waterfall, 8m	240	0.3%
Waterfall, 7m	214	0.3%
Waterfall, 30m	170	0.2%
Waterfall, 12m	159	0.2%
Waterfall, 25m	125	0.2%
Waterfall, 40m	114	0.1%
Waterfall, 1.5m	103	0.1%
Waterfall, 50m	79	0.1%
Waterfall, 9m	79	0.1%
Waterfall, 60m	74	0.1%

Top values (rank 1–20)

Waterfall — 72,565
Waterfall, 3m — 551
Waterfall, 2m — 520
Waterfall, 5m — 460
Waterfall, 10m — 426
Waterfall, 4m — 423
Waterfall, 1m — 358
Waterfall, 6m — 329
Waterfall, 20m — 298
Waterfall, 15m — 257
Waterfall, 8m — 240
Waterfall, 7m — 214
Waterfall, 30m — 170
Waterfall, 12m — 159
Waterfall, 25m — 125
Waterfall, 40m — 114
Waterfall, 1.5m — 103
Waterfall, 50m — 79
Waterfall, 9m — 79
Waterfall, 60m — 74

category categorical

top value is 100.0% of rows

rows80,678

null0 (0.0%)

unique1

top_valueusgs_waterfalls

top_rate1.000

cardinality1

entropy-0.000

entropy_ratio0.000

Show data table

Top values for category (1 unique shown, of 1 total).
value	count	share
usgs_waterfalls	80678	100.0%

Top values (rank 1–20)

usgs_waterfalls — 80,678

date categorical

top value is 100.0% of rows

rows80,678

null0 (0.0%)

unique1

top_value

top_rate1.000

cardinality1

entropy-0.000

entropy_ratio0.000

Show data table

Top values for date (1 unique shown, of 1 total).
value	count	share
	80678	100.0%

Top values (rank 1–20)

— 80,678

country categorical

4 singleton categories top value is 100.0% of rows

rows80,678

null0 (0.0%)

unique6

top_value

top_rate1.000

cardinality6

entropy4.79e-03

entropy_ratio1.85e-03

Show data table

Top values for country (6 unique shown, of 6 total).
value	count	share
	80650	100.0%
VE	24	0.0%
DE	1	0.0%
LB	1	0.0%
HN	1	0.0%
BR	1	0.0%

Top values (rank 1–20)

— 80,650
VE — 24
DE — 1
LB — 1
HN — 1
BR — 1

height categorical

403 singleton categories

rows80,678

null0 (0.0%)

unique775

top_value

top_rate0.899

cardinality775

entropy1.140

entropy_ratio0.119

Show data table

Top values for height (20 unique shown, of 775 total).
value	count	share
	72565	89.9%
3	551	0.7%
2	520	0.6%
5	460	0.6%
10	426	0.5%
4	423	0.5%
1	358	0.4%
6	329	0.4%
20	298	0.4%
15	257	0.3%
8	240	0.3%
7	214	0.3%
30	170	0.2%
12	159	0.2%
25	125	0.2%
40	114	0.1%
1.5	103	0.1%
50	79	0.1%
9	79	0.1%
60	74	0.1%

Top values (rank 1–20)

— 72,565
3 — 551
2 — 520
5 — 460
10 — 426
4 — 423
1 — 358
6 — 329
20 — 298
15 — 257
8 — 240
7 — 214
30 — 170
12 — 159
25 — 125
40 — 114
1.5 — 103
50 — 79
9 — 79
60 — 74

source categorical

top value is 100.0% of rows

rows80,678

null0 (0.0%)

unique1

top_valueOpenStreetMap

top_rate1.000

cardinality1

entropy-0.000

entropy_ratio0.000

Show data table

Top values for source (1 unique shown, of 1 total).
value	count	share
OpenStreetMap	80678	100.0%

Top values (rank 1–20)

OpenStreetMap — 80,678