saturn·

data trove world languages endangerment silence

saturn notebook · generated 2026-06-22 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/silence_data.json

Saturn profiled 6,998 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/silence_data.json",
    "--findings", "data-trove-world-languages-endangerment-silence.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset catalogues approximately 7,000 world languages, each with a name, geographic coordinates, speaker population, and endangerment status. The most striking finding is the extreme inequality in speaker populations: the median language has only 11,000 speakers while the maximum reaches nearly 1 billion, with over 16% of languages flagged as outliers — a classic long-tail distribution reflecting how a handful of dominant languages vastly outnumber the rest. Equally notable is the endangerment picture: while 44% of languages are classified as 'safe', a substantial share face real risk — 1,753 are 'definitely endangered', 327 are 'critically endangered', and 219 are already extinct. Top words in language names include 'sign', 'zapotec', 'mixtec', and directional qualifiers like 'southern' and 'northern', hinting at rich dialect clustering worth exploring geographically.

citing: row_count · column_count · stats.median · stats.max · stats.outlier_rate · top_values · top_words · stats.zero_rate

Out[4]:

saturn.schema() · 6 columns

column kind n null% unique alerts
n text 6,998 0.0% 6,998 near_unique one_word
lat numeric 6,998 0.0% 4,048
lng numeric 6,998 0.0% 5,560
p numeric 6,998 0.0% 1,627 high_skew outliers
s numeric 6,998 0.0% 6
ss categorical 6,998 0.0% 7
Fig 1.
ss · Look at how the 'safe' majority compares to the combined endangered and extinct categories — nearly half of all languages are at some level of risk.
Show data table
Top values for ss (7 unique shown, of 7 total).
valuecountshare
safe307443.9%
definitely endangered175325.1%
vulnerable116016.6%
severely endangered3745.3%
critically endangered3274.7%
extinct2193.1%
unknown911.3%
Fig 2.
p · Expect a heavily right-skewed distribution; the vast majority of languages cluster near zero speakers while a tiny number reach hundreds of millions.
Show data table
Histogram bins for p (median: 11000.0).
bincount
0 – 2.411e+076943
2.411e+07 – 4.823e+0728
4.823e+07 – 7.234e+075
7.234e+07 – 9.646e+0712
9.646e+07 – 1.206e+080
1.206e+08 – 1.447e+083
1.447e+08 – 1.688e+081
1.688e+08 – 1.929e+080
1.929e+08 – 2.17e+080
2.17e+08 – 2.411e+082
2.411e+08 – 2.653e+080
2.653e+08 – 2.894e+080
2.894e+08 – 3.135e+080
3.135e+08 – 3.376e+080
3.376e+08 – 3.617e+080
3.617e+08 – 3.858e+081
3.858e+08 – 4.099e+080
4.099e+08 – 4.34e+080
4.34e+08 – 4.582e+081
4.582e+08 – 4.823e+080
4.823e+08 – 5.064e+080
5.064e+08 – 5.305e+080
5.305e+08 – 5.546e+080
5.546e+08 – 5.787e+080
5.787e+08 – 6.028e+080
6.028e+08 – 6.27e+080
6.27e+08 – 6.511e+080
6.511e+08 – 6.752e+080
6.752e+08 – 6.993e+080
6.993e+08 – 7.234e+081
7.234e+08 – 7.475e+080
7.475e+08 – 7.716e+080
7.716e+08 – 7.958e+080
7.958e+08 – 8.199e+080
8.199e+08 – 8.44e+080
8.44e+08 – 8.681e+080
8.681e+08 – 8.922e+080
8.922e+08 – 9.163e+080
9.163e+08 – 9.404e+080
9.404e+08 – 9.646e+081
Fig 3.
s · This numeric status score (0–5) shows how languages rank on a vitality scale — look for the concentration at the higher (safer) end versus the tail of critically low scores.
Show data table
Histogram bins for s (median: 4.0).
bincount
0 – 0.125250
0.125 – 0.250
0.25 – 0.3750
0.375 – 0.50
0.5 – 0.6250
0.625 – 0.750
0.75 – 0.8750
0.875 – 10
1 – 1.125328
1.125 – 1.250
1.25 – 1.3750
1.375 – 1.50
1.5 – 1.6250
1.625 – 1.750
1.75 – 1.8750
1.875 – 20
2 – 2.125383
2.125 – 2.250
2.25 – 2.3750
2.375 – 2.50
2.5 – 2.6250
2.625 – 2.750
2.75 – 2.8750
2.875 – 30
3 – 3.1251768
3.125 – 3.250
3.25 – 3.3750
3.375 – 3.50
3.5 – 3.6250
3.625 – 3.750
3.75 – 3.8750
3.875 – 40
4 – 4.1251176
4.125 – 4.250
4.25 – 4.3750
4.375 – 4.50
4.5 – 4.6250
4.625 – 4.750
4.75 – 4.8750
4.875 – 53093
Fig 4.
lat · The distribution of latitudes skews toward tropical and equatorial regions, revealing where the world's linguistic diversity is geographically concentrated.
Show data table
Histogram bins for lat (median: 6.37).
bincount
-55.27 – -52.061
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.430
-42.43 – -39.222
-39.22 – -36.015
-36.01 – -32.89
-32.8 – -29.5913
-29.59 – -26.3823
-26.38 – -23.1748
-23.17 – -19.96100
-19.96 – -16.75101
-16.75 – -13.54217
-13.54 – -10.33202
-10.33 – -7.116461
-7.116 – -3.906749
-3.906 – -0.6958640
-0.6958 – 2.514344
2.514 – 5.725434
5.725 – 8.935609
8.935 – 12.15676
12.15 – 15.36264
15.36 – 18.57368
18.57 – 21.78210
21.78 – 24.99282
24.99 – 28.2320
28.2 – 31.41131
31.41 – 34.62124
34.62 – 37.83148
37.83 – 41.0481
41.04 – 44.25107
44.25 – 47.4665
47.46 – 50.6768
50.67 – 53.8868
53.88 – 57.0935
57.09 – 60.320
60.3 – 63.5134
63.51 – 66.7218
66.72 – 69.9313
69.93 – 73.146
Fig 5.
n · Most language names are short single words, but a long tail of compound names (up to 43 characters) reflects complex regional dialect naming conventions.
Show data table
Character-length distribution for n (mean: 8.971991997713632).
charscount
1 – 225
2 – 3193
3 – 4748
4 – 51095
5 – 61082
6 – 7833
7 – 8521
8 – 9333
9 – 10271
10 – 12231
12 – 13208
13 – 14213
14 – 15188
15 – 16199
16 – 17144
17 – 1891
18 – 1975
19 – 2068
20 – 2163
21 – 2269
22 – 23150
23 – 2448
24 – 2526
25 – 2630
26 – 2721
27 – 288
28 – 297
29 – 3010
30 – 3114
31 – 326
32 – 348
34 – 352
35 – 3610
36 – 371
37 – 384
38 – 390
39 – 401
40 – 410
41 – 421
42 – 431
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
ntext0.0%
latnumeric0.0%
lngnumeric0.0%
pnumeric0.0%
snumeric0.0%
sscategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
latlngps
lat+1.00-0.35+0.13-0.09
lng-0.35+1.00+0.00+0.16
p+0.13+0.00+1.00+0.09
s-0.09+0.16+0.09+1.00

n text label

This column appears to be a name field for human languages or dialects — top words include 'language', 'sign', 'Zapotec', 'Mixtec', and directional qualifiers like 'southern', 'northern', 'western', 'eastern', 'central', consistent with a linguistic taxonomy dataset. All 6,998 rows are unique with zero duplicates and zero nulls, confirming it functions as a label or identifier. The majority of values (73%) are single words, but multi-word entries push the mean length to ~8.97 characters and mean word count to ~1.37, reflecting compound names like 'Southern Zapotec'. The vocabulary size (7,003) slightly exceeding unique row count (6,998) is unremarkable given tokenization.

Treatment: Use as a human-readable label; encode as a categorical ID or embed via a language-name lookup for modelling.

anthropic:default · confidence high
Out[13]:

saturn.columns["n"].stats

statvalue
n6,998
nulls0 (0.0%)
unique6,998
len_min 1
len_max 43
len_mean 8.972
len_median 7
len_p95 21
word_mean 1.369
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 7,003
readability_flesch_mean 57.13
emoji_rate 0
url_rate 0
one_word_rate 0.7314
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word73.1% rows are a single word
Fig 8.
Character-length distribution for n.
Show data table
Character-length distribution for n (mean: 8.971991997713632).
charscount
1 – 225
2 – 3193
3 – 4748
4 – 51095
5 – 61082
6 – 7833
7 – 8521
8 – 9333
9 – 10271
10 – 12231
12 – 13208
13 – 14213
14 – 15188
15 – 16199
16 – 17144
17 – 1891
18 – 1975
19 – 2068
20 – 2163
21 – 2269
22 – 23150
23 – 2448
24 – 2526
25 – 2630
26 – 2721
27 – 288
28 – 297
29 – 3010
30 – 3114
31 – 326
32 – 348
34 – 352
35 – 3610
36 – 371
37 – 384
38 – 390
39 – 401
40 – 410
41 – 421
42 – 431

lat numeric feature

This column contains geographic latitude values, spanning from -55.27° (southern South America) to 73.14° (Arctic latitudes), covering a wide swath of the globe with concentration in tropical and subtropical regions (median 6.37°, Q1 -4.65°, Q3 18.29°). The 4,048 unique values out of 6,998 rows indicates coordinate granularity likely at 2 decimal places, with some location repetition. Mild positive skew (0.697) and near-mesokurtic kurtosis (0.477) confirm most records cluster in equatorial-to-subtropical bands, consistent with datasets heavy in African, South/Southeast Asian, or Latin American records. Only 149 outliers (~2.1%) were flagged, likely corresponding to high-latitude locations in Europe or North America.

Treatment: Pair with a longitude column for geospatial analysis; consider binning into regions or using as-is in spatial models — do not normalize without care for geographic interpretation.

anthropic:default · confidence high
Out[16]:

saturn.columns["lat"].stats

statvalue
n6,998
nulls0 (0.0%)
unique4,048
min -55.27
max 73.14
mean 8.437
median 6.37
std 18
q1 -4.65
q3 18.29
iqr 22.94
skew 0.6975
kurtosis 0.4773
n_outliers 149
outlier_rate 0.02129
zero_rate 0
Fig 9.
Distribution of lat. Vertical dash marks the median.
Show data table
Histogram bins for lat (median: 6.37).
bincount
-55.27 – -52.061
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.430
-42.43 – -39.222
-39.22 – -36.015
-36.01 – -32.89
-32.8 – -29.5913
-29.59 – -26.3823
-26.38 – -23.1748
-23.17 – -19.96100
-19.96 – -16.75101
-16.75 – -13.54217
-13.54 – -10.33202
-10.33 – -7.116461
-7.116 – -3.906749
-3.906 – -0.6958640
-0.6958 – 2.514344
2.514 – 5.725434
5.725 – 8.935609
8.935 – 12.15676
12.15 – 15.36264
15.36 – 18.57368
18.57 – 21.78210
21.78 – 24.99282
24.99 – 28.2320
28.2 – 31.41131
31.41 – 34.62124
34.62 – 37.83148
37.83 – 41.0481
41.04 – 44.25107
44.25 – 47.4665
47.46 – 50.6768
50.67 – 53.8868
53.88 – 57.0935
57.09 – 60.320
60.3 – 63.5134
63.51 – 66.7218
66.72 – 69.9313
69.93 – 73.146

lng numeric feature

This column contains geographic longitude values, spanning from -178.78 to 179.31 — nearly the full valid range of -180 to 180 degrees. The mean (52.45) and median (47.65) both skew toward positive (eastern) longitudes, suggesting the dataset has a higher concentration of locations in Europe, Asia, or Africa than in the Americas. The IQR of 115.65 and low kurtosis (-0.67) indicate a broadly spread, relatively flat distribution across the globe with only 12 outliers flagged.

Treatment: Use as-is for geospatial modelling; consider pairing with latitude and projecting to a coordinate system, or binning into geographic regions as a categorical feature.

anthropic:default · confidence high
Out[19]:

saturn.columns["lng"].stats

statvalue
n6,998
nulls0 (0.0%)
unique5,560
min -178.8
max 179.3
mean 52.46
median 47.65
std 79.67
q1 8.282
q3 123.9
iqr 115.7
skew -0.4982
kurtosis -0.673
n_outliers 12
outlier_rate 0.001715
zero_rate 0
Fig 10.
Distribution of lng. Vertical dash marks the median.
Show data table
Histogram bins for lng (median: 47.65).
bincount
-178.8 – -169.811
-169.8 – -160.94
-160.9 – -151.99
-151.9 – -14311
-143 – -13410
-134 – -125.115
-125.1 – -116.194
-116.1 – -107.237
-107.2 – -98.2171
-98.21 – -89.26257
-89.26 – -80.3150
-80.31 – -71.35176
-71.35 – -62.4150
-62.4 – -53.45112
-53.45 – -44.548
-44.5 – -35.5421
-35.54 – -26.590
-26.59 – -17.643
-17.64 – -8.68795
-8.687 – 0.265261
0.265 – 9.217420
9.217 – 18.17687
18.17 – 27.12297
27.12 – 36.07402
36.07 – 45.03211
45.03 – 53.98113
53.98 – 62.9327
62.93 – 71.8873
71.88 – 80.84212
80.84 – 89.79191
89.79 – 98.74242
98.74 – 107.7386
107.7 – 116.6202
116.6 – 125.6437
125.6 – 134.5266
134.5 – 143.5512
143.5 – 152.5597
152.5 – 161.4106
161.4 – 170.4170
170.4 – 179.312

p numeric feature

Column 'p' is a numeric field likely representing a price, population, or some monetary/count quantity with extreme positive skew (skew = 39.45, kurtosis = 1870.01). The median is 11,000 while the mean is 1,157,741 — a 100× gap — driven by a long upper tail that reaches 964,553,200, with 1,159 outliers (16.6% of rows). An additional 13.5% of values are exactly zero, suggesting a two-population distribution (absent/zero vs. non-zero values) that may require separate treatment.

Treatment: Separate zero and non-zero records, then log1p-transform the non-zero values before modelling; investigate whether zeros are true zeros or missing-data proxies.

anthropic:default · confidence medium
Out[22]:

saturn.columns["p"].stats

statvalue
n6,998
nulls0 (0.0%)
unique1,627
min 0
max 9.646e+08
mean 1.158e+06
median 11,000
std 1.73e+07
q1 1,200
q3 83,000
iqr 81,800
skew 39.45
kurtosis 1870
n_outliers 1,159
outlier_rate 0.1656
zero_rate 0.1346
alert: high_skewskew=+39.45
alert: outliers16.6% rows beyond 1.5 IQR
Fig 11.
Distribution of p. Vertical dash marks the median.
Show data table
Histogram bins for p (median: 11000.0).
bincount
0 – 2.411e+076943
2.411e+07 – 4.823e+0728
4.823e+07 – 7.234e+075
7.234e+07 – 9.646e+0712
9.646e+07 – 1.206e+080
1.206e+08 – 1.447e+083
1.447e+08 – 1.688e+081
1.688e+08 – 1.929e+080
1.929e+08 – 2.17e+080
2.17e+08 – 2.411e+082
2.411e+08 – 2.653e+080
2.653e+08 – 2.894e+080
2.894e+08 – 3.135e+080
3.135e+08 – 3.376e+080
3.376e+08 – 3.617e+080
3.617e+08 – 3.858e+081
3.858e+08 – 4.099e+080
4.099e+08 – 4.34e+080
4.34e+08 – 4.582e+081
4.582e+08 – 4.823e+080
4.823e+08 – 5.064e+080
5.064e+08 – 5.305e+080
5.305e+08 – 5.546e+080
5.546e+08 – 5.787e+080
5.787e+08 – 6.028e+080
6.028e+08 – 6.27e+080
6.27e+08 – 6.511e+080
6.511e+08 – 6.752e+080
6.752e+08 – 6.993e+080
6.993e+08 – 7.234e+081
7.234e+08 – 7.475e+080
7.475e+08 – 7.716e+080
7.716e+08 – 7.958e+080
7.958e+08 – 8.199e+080
8.199e+08 – 8.44e+080
8.44e+08 – 8.681e+080
8.681e+08 – 8.922e+080
8.922e+08 – 9.163e+080
9.163e+08 – 9.404e+080
9.404e+08 – 9.646e+081

s numeric feature

This column is almost certainly an ordinal rating or severity score with exactly 6 discrete integer values ranging from 0 to 5. The distribution is notably left-skewed (skew = -1.04), meaning high scores (4–5) dominate — the median is 4.0 and Q3 is 5.0 — which would surprise an analyst expecting a balanced scale. Only 3.6% of rows are zero, suggesting the lowest value is rare rather than a default or sentinel.

Treatment: Treat as ordinal; consider whether to one-hot encode or keep as integer, and note the left-skew may bias tree splits toward high values.

anthropic:default · confidence high
Out[25]:

saturn.columns["s"].stats

statvalue
n6,998
nulls0 (0.0%)
unique6
min 0
max 5
mean 3.796
median 4
std 1.366
q1 3
q3 5
iqr 2
skew -1.041
kurtosis 0.4154
n_outliers 0
outlier_rate 0
zero_rate 0.03572
Fig 12.
Distribution of s. Vertical dash marks the median.
Show data table
Histogram bins for s (median: 4.0).
bincount
0 – 0.125250
0.125 – 0.250
0.25 – 0.3750
0.375 – 0.50
0.5 – 0.6250
0.625 – 0.750
0.75 – 0.8750
0.875 – 10
1 – 1.125328
1.125 – 1.250
1.25 – 1.3750
1.375 – 1.50
1.5 – 1.6250
1.625 – 1.750
1.75 – 1.8750
1.875 – 20
2 – 2.125383
2.125 – 2.250
2.25 – 2.3750
2.375 – 2.50
2.5 – 2.6250
2.625 – 2.750
2.75 – 2.8750
2.875 – 30
3 – 3.1251768
3.125 – 3.250
3.25 – 3.3750
3.375 – 3.50
3.5 – 3.6250
3.625 – 3.750
3.75 – 3.8750
3.875 – 40
4 – 4.1251176
4.125 – 4.250
4.25 – 4.3750
4.375 – 4.50
4.5 – 4.6250
4.625 – 4.750
4.75 – 4.8750
4.875 – 53093

ss categorical label

This column encodes a conservation or endangerment status classification, with 7 distinct ordered categories ranging from 'safe' through 'extinct'. The dominant class is 'safe' at 43.9% of 6,998 rows, while 'extinct' accounts for 219 records — a meaningful but minority signal. Entropy ratio of 0.756 indicates reasonable spread across categories, though the distribution is notably right-skewed toward safer statuses. No nulls are present, and the label set is clean with no obvious noise.

Treatment: Ordinal-encode in conservation-severity order (safe < vulnerable < definitely endangered < severely endangered < critically endangered < extinct; treat 'unknown' as missing) before modelling.

anthropic:default · confidence high
Out[28]:

saturn.columns["ss"].stats

statvalue
n6,998
nulls0 (0.0%)
unique7
top_value safe
top_rate 0.4393
cardinality 7
entropy 2.122
entropy_ratio 0.7557
Fig 13.
Top values for ss.
Show data table
Top values for ss (7 unique shown, of 7 total).
valuecountshare
safe307443.9%
definitely endangered175325.1%
vulnerable116016.6%
severely endangered3745.3%
critically endangered3274.7%
extinct2193.1%
unknown911.3%

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-world-languages-endangerment-silence-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove world languages endangerment silence},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-world-languages-endangerment-silence}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove world languages endangerment silence. Source: /home/coolhand/html/datavis/data_trove/data/quirky/silence_data.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-world-languages-endangerment-silence