saturn

/home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json 80,678 rows sample n=80,678 seed 42 2026-06-22T00:25:28+00:00

Overview

Source/home/coolhand/html/datavis/data_trove/data/geographic/waterfalls/waterfalls_worldwide.json
Total rows80,678
Profiled sample80,678
Columns9
Generated2026-06-22T00:25:28+00:00
Show data table
Per-column null rate across the corpus.
columnkindnull %
latitudenumeric0.0%
longitudenumeric0.0%
nametext0.0%
descriptioncategorical0.0%
categorycategorical0.0%
datecategorical0.0%
countrycategorical0.0%
heightcategorical0.0%
sourcecategorical0.0%

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:default.

Dataset high anthropic:default

This dataset is a global catalogue of 80,678 waterfalls sourced entirely from OpenStreetMap, covering geographic coordinates and basic descriptive attributes. The most striking finding is how sparse the data quality is: 89.9% of records carry only the generic description 'Waterfall' with no height recorded, and 59.7% of entries are named 'Unnamed Waterfall', suggesting the dataset is geographically broad but informationally thin. Height data is worth a closer look — where it does exist, values cluster at small measurements (2–10 metres), hinting at a possible recording bias toward easily measured falls. The geographic spread is genuinely global (latitude ranges from -77.7 to 78.7), but the country field is nearly empty for 99.97% of records, so spatial analysis should rely on the raw coordinates rather than the country column.

country high anthropic:default

This column is intended to capture country of origin or residence, with only 6 distinct values across 80,678 rows. The overwhelming surprise is that 99.97% of records (80,650 out of 80,678) contain an empty string rather than a valid country code, making the field effectively unpopulated. The remaining 28 records split across five ISO country codes (VE with 24 occurrences, and DE, LB, HN, BR each with 1), suggesting the field was rarely filled in rather than being systematically captured.

name high anthropic:default

This column contains the names of waterfalls or water features, drawn from what appears to be a global geographic dataset (evidenced by multilingual terms: 'Cachoeira'/'Cascada'/'Cascata'/'Fossen'/'Salto'). The dominant signal is that 48,168 of 80,678 rows — nearly 60% — carry the value 'Unnamed Waterfall', driving a duplicate rate of 65.7% and collapsing effective cardinality to just 27,697 unique values out of 80,678 total. The vocab includes Portuguese, Spanish, Norwegian, and English terms, confirming a multilingual mix that an analyst should be aware of when grouping or filtering by name.

category high anthropic:default

This column is a dataset category tag, representing the data source or classification for every record — here uniformly 'usgs_waterfalls'. With cardinality of 1, top_rate of 1.0, and zero nulls across all 80,678 rows, it carries no discriminative information whatsoever. This is a constant column, almost certainly a provenance/partition label added when merging multiple source datasets.

date high anthropic:default

This column is labeled 'date' but contains no actual date values — every single one of its 80,678 rows holds an empty string, giving it a cardinality of 1 and a top_rate of 1.0. The column is entirely blank with zero nulls, meaning missing values were stored as empty strings rather than proper nulls. It carries zero information and will contribute nothing to any analysis or model.

source high anthropic:default

This column records the data source attribution for all 80,678 rows, and every single record carries the value 'OpenStreetMap' — making it a constant with cardinality of 1, entropy of 0, and a top_rate of 1.0. It provides zero discriminative information and will contribute nothing to any model or analysis. The imbalance alert is technically correct but understates the situation: this is a fully degenerate column, not merely skewed.

description high anthropic:default

This column appears to describe a financial or project methodology type, overwhelmingly dominated by 'Waterfall' (72,565 of 80,678 rows, ~89.9%), with the remaining values being 'Waterfall' variants qualified by a time suffix (e.g., '3m', '2m', '5m'). The extreme concentration in a single value — an entropy ratio of only 0.119 — and the long-tail alert indicate that despite 775 unique values, almost all signal is captured by one category. Surprising: with 775 distinct values but ~90% mass in one label, the tail likely contains hundreds of rare or inconsistently formatted variants that may need normalisation.

height high anthropic:default

This column purports to store height values but is classified as categorical, with 775 unique string values across 80,678 rows. The dominant signal is alarming: 72,565 rows (89.9%) contain an empty string, meaning the field is effectively missing for nearly 9 in 10 records despite a reported null_rate of 0.0. The non-empty values appear to be small integers (e.g., '1', '2', '3', '5', '10', '20'), suggesting height in some discrete unit, but the extreme sparsity and long-tail alert make this column unreliable as a feature without significant imputation or domain clarification.

latitude high anthropic:default

This column contains geographic latitude coordinates, spanning from -77.72° (Antarctic region) to 78.66° (Arctic region), covering nearly the full terrestrial range. With 80,650 unique values out of 80,678 rows and zero nulls, it is essentially a high-cardinality continuous measurement. The distribution is notably left-skewed (skew = -0.94) with a mean of 27.1° and median of 40.3°, indicating a concentration of records in mid-to-high Northern Hemisphere latitudes but with a meaningful tail toward the Southern Hemisphere. The IQR of 37.8° and near-flat kurtosis (-0.28) suggest a broadly spread, roughly uniform distribution rather than a tight cluster.

longitude high anthropic:default

This column is geographic longitude, with values spanning nearly the full valid range of −179.99 to 179.41 degrees, indicating globally distributed records. The distribution is notably flat (kurtosis −0.41, IQR of 100.27°) and only mildly right-skewed (skew 0.29), suggesting broad geographic spread rather than concentration in any single region. The median of 7.80° (near Western Europe/West Africa) sits well below the mean of 0.96°, hinting at a slight pull toward Eastern longitudes. Near-perfect uniqueness (80,650 unique values out of 80,678 rows) confirms these are precise coordinate readings, not bucketed regions.

Numeric correlation

Show data table
Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
latitudelongitude
latitude+1.00-0.18
longitude-0.18+1.00

latitude numeric

rows80,678
null0 (0.0%)
unique80,650
min-77.722
max78.664
mean27.148
median40.312
std30.045
q19.657
q347.477
iqr37.820
skew-0.936
kurtosis-0.283
n_outliers298
outlier_rate3.69e-03
zero_rate0.000
Show data table
Histogram bins for latitude (median: 40.311778000000004).
bincount
-77.72 – -73.811
-73.81 – -69.90
-69.9 – -65.992
-65.99 – -62.080
-62.08 – -58.170
-58.17 – -54.2640
-54.26 – -50.3569
-50.35 – -46.44231
-46.44 – -42.54980
-42.54 – -38.631226
-38.63 – -34.721111
-34.72 – -30.81984
-30.81 – -26.92936
-26.9 – -22.991664
-22.99 – -19.082204
-19.08 – -15.171445
-15.17 – -11.26552
-11.26 – -7.348652
-7.348 – -3.439711
-3.439 – 0.47091326
0.4709 – 4.3811234
4.381 – 8.292293
8.29 – 12.21440
12.2 – 16.112557
16.11 – 20.021843
20.02 – 23.931639
23.93 – 27.843183
27.84 – 31.751376
31.75 – 35.663246
35.66 – 39.574519
39.57 – 43.487862
43.48 – 47.3912910
47.39 – 51.37883
51.3 – 55.213329
55.21 – 59.123266
59.12 – 63.032437
63.03 – 66.942174
66.94 – 70.841260
70.84 – 74.7564
74.75 – 78.6629

longitude numeric

rows80,678
null0 (0.0%)
unique80,650
min-179.991
max179.412
mean0.963
median7.803
std76.859
q1-61.708
q338.561
iqr100.269
skew0.287
kurtosis-0.412
n_outliers0
outlier_rate0.000
zero_rate0.000
Show data table
Histogram bins for longitude (median: 7.8029868).
bincount
-180 – -17121
-171 – -1625
-162 – -153332
-153 – -144.1160
-144.1 – -135.142
-135.1 – -126.11803
-126.1 – -117.13492
-117.1 – -108.11627
-108.1 – -99.13566
-99.13 – -90.14922
-90.14 – -81.153019
-81.15 – -72.174949
-72.17 – -63.182531
-63.18 – -54.22294
-54.2 – -45.214760
-45.21 – -36.231428
-36.23 – -27.2476
-27.24 – -18.26959
-18.26 – -9.274901
-9.274 – -0.28944266
-0.2894 – 8.6967904
8.696 – 17.6810863
17.68 – 26.674161
26.67 – 35.652370
35.65 – 44.644414
44.64 – 53.621659
53.62 – 62.61555
62.61 – 71.59512
71.59 – 80.58785
80.58 – 89.56865
89.56 – 98.55833
98.55 – 107.52020
107.5 – 116.51082
116.5 – 125.51555
125.5 – 134.5842
134.5 – 143.51654
143.5 – 152.51855
152.5 – 161.4345
161.4 – 170.4743
170.4 – 179.41508

name text

65.7% duplicate strings
rows80,678
null0 (0.0%)
unique27,697
len_min1
len_max67
len_mean16.121
len_median17.000
len_p9521.000
word_mean2.091
word_median2.000
n_empty0
n_duplicates52,981
duplicate_rate0.657
vocab_size8,093
readability_flesch_mean17.609
emoji_rate1.24e-05
url_rate0.000
one_word_rate0.111
allcaps_rate0.035
boilerplate_rate0.000
Show data table
Character-length distribution for name (mean: 16.120949949180694).
charscount
1 – 3217
3 – 41507
4 – 6685
6 – 81501
8 – 91940
9 – 111737
11 – 134325
13 – 144350
14 – 162013
16 – 1852248
18 – 193568
19 – 211444
21 – 222068
22 – 241215
24 – 26387
26 – 27545
27 – 29308
29 – 3194
31 – 32175
32 – 3461
34 – 3684
36 – 3762
37 – 3918
39 – 4131
41 – 4231
42 – 4411
44 – 4617
46 – 478
47 – 492
49 – 506
50 – 527
52 – 541
54 – 552
55 – 573
57 – 591
59 – 604
60 – 620
62 – 640
64 – 650
65 – 672
Sample values (first 10)
  1. Strømslifossen
  2. Unnamed Waterfall
  3. Rauðfossar
  4. Unnamed Waterfall
  5. Little Niagara Falls
  6. Unnamed Waterfall
  7. Price’s Falls
  8. Unnamed Waterfall
  9. 彌東飛瀑
  10. Cascada de Arriba

description categorical

403 singleton categories
rows80,678
null0 (0.0%)
unique775
top_valueWaterfall
top_rate0.899
cardinality775
entropy1.140
entropy_ratio0.119
Show data table
Top values for description (20 unique shown, of 775 total).
valuecountshare
Waterfall7256589.9%
Waterfall, 3m5510.7%
Waterfall, 2m5200.6%
Waterfall, 5m4600.6%
Waterfall, 10m4260.5%
Waterfall, 4m4230.5%
Waterfall, 1m3580.4%
Waterfall, 6m3290.4%
Waterfall, 20m2980.4%
Waterfall, 15m2570.3%
Waterfall, 8m2400.3%
Waterfall, 7m2140.3%
Waterfall, 30m1700.2%
Waterfall, 12m1590.2%
Waterfall, 25m1250.2%
Waterfall, 40m1140.1%
Waterfall, 1.5m1030.1%
Waterfall, 50m790.1%
Waterfall, 9m790.1%
Waterfall, 60m740.1%
Top values (rank 1–20)
  1. Waterfall — 72,565
  2. Waterfall, 3m — 551
  3. Waterfall, 2m — 520
  4. Waterfall, 5m — 460
  5. Waterfall, 10m — 426
  6. Waterfall, 4m — 423
  7. Waterfall, 1m — 358
  8. Waterfall, 6m — 329
  9. Waterfall, 20m — 298
  10. Waterfall, 15m — 257
  11. Waterfall, 8m — 240
  12. Waterfall, 7m — 214
  13. Waterfall, 30m — 170
  14. Waterfall, 12m — 159
  15. Waterfall, 25m — 125
  16. Waterfall, 40m — 114
  17. Waterfall, 1.5m — 103
  18. Waterfall, 50m — 79
  19. Waterfall, 9m — 79
  20. Waterfall, 60m — 74

category categorical

top value is 100.0% of rows
rows80,678
null0 (0.0%)
unique1
top_valueusgs_waterfalls
top_rate1.000
cardinality1
entropy-0.000
entropy_ratio0.000
Show data table
Top values for category (1 unique shown, of 1 total).
valuecountshare
usgs_waterfalls80678100.0%
Top values (rank 1–20)
  1. usgs_waterfalls — 80,678

date categorical

top value is 100.0% of rows
rows80,678
null0 (0.0%)
unique1
top_value
top_rate1.000
cardinality1
entropy-0.000
entropy_ratio0.000
Show data table
Top values for date (1 unique shown, of 1 total).
valuecountshare
80678100.0%
Top values (rank 1–20)
  1. — 80,678

country categorical

4 singleton categories top value is 100.0% of rows
rows80,678
null0 (0.0%)
unique6
top_value
top_rate1.000
cardinality6
entropy4.79e-03
entropy_ratio1.85e-03
Show data table
Top values for country (6 unique shown, of 6 total).
valuecountshare
80650100.0%
VE240.0%
DE10.0%
LB10.0%
HN10.0%
BR10.0%
Top values (rank 1–20)
  1. — 80,650
  2. VE — 24
  3. DE — 1
  4. LB — 1
  5. HN — 1
  6. BR — 1

height categorical

403 singleton categories
rows80,678
null0 (0.0%)
unique775
top_value
top_rate0.899
cardinality775
entropy1.140
entropy_ratio0.119
Show data table
Top values for height (20 unique shown, of 775 total).
valuecountshare
7256589.9%
35510.7%
25200.6%
54600.6%
104260.5%
44230.5%
13580.4%
63290.4%
202980.4%
152570.3%
82400.3%
72140.3%
301700.2%
121590.2%
251250.2%
401140.1%
1.51030.1%
50790.1%
9790.1%
60740.1%
Top values (rank 1–20)
  1. — 72,565
  2. 3 — 551
  3. 2 — 520
  4. 5 — 460
  5. 10 — 426
  6. 4 — 423
  7. 1 — 358
  8. 6 — 329
  9. 20 — 298
  10. 15 — 257
  11. 8 — 240
  12. 7 — 214
  13. 30 — 170
  14. 12 — 159
  15. 25 — 125
  16. 40 — 114
  17. 1.5 — 103
  18. 50 — 79
  19. 9 — 79
  20. 60 — 74

source categorical

top value is 100.0% of rows
rows80,678
null0 (0.0%)
unique1
top_valueOpenStreetMap
top_rate1.000
cardinality1
entropy-0.000
entropy_ratio0.000
Show data table
Top values for source (1 unique shown, of 1 total).
valuecountshare
OpenStreetMap80678100.0%
Top values (rank 1–20)
  1. OpenStreetMap — 80,678