saturn·

quirky silence data

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/silence_data.json

Saturn profiled 6,998 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/silence_data.json",
    "--findings", "quirky-silence_data.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 6,998 world languages, each with a name (n), speaker population (p), geographic coordinates (lat/lng), a numeric score (s), and an endangerment status (ss). The most striking feature is the extreme skew in speaker population: the median language has just 11,000 speakers but the mean is over 1.1 million, with a max near 965 million and roughly 13% of entries showing zero speakers — a tell-tale signature of dying or dormant languages. The endangerment status field is also worth a close look: only about 44% of languages are 'safe', while the remaining categories span from 'vulnerable' all the way to 'extinct' (219 cases). Geography is broadly distributed (latitude centered near the tropics, longitude spanning the globe), so the dataset supports both statistical and map-based exploration.

citing: row_count · columns.p.stats.median · columns.p.stats.mean · columns.p.stats.max · columns.p.stats.zero_rate · columns.p.stats.skew · columns.ss.top_rate · columns.ss.top_values · columns.lat.stats.median · columns.lng.stats.median · columns.s.n_unique

Out[4]:

saturn.schema() · 6 columns

column kind n null% unique alerts
n text 6,998 0.0% 6,998 near_unique one_word
lat numeric 6,998 0.0% 4,048
lng numeric 6,998 0.0% 5,560
p numeric 6,998 0.0% 1,627 high_skew outliers
s numeric 6,998 0.0% 6
ss categorical 6,998 0.0% 7
Fig 1.
ss · How many languages fall into each endangerment status — note that less than half are 'safe'.
Show data table
Top values for ss (7 unique shown, of 7 total).
valuecountshare
safe307443.9%
definitely endangered175325.1%
vulnerable116016.6%
severely endangered3745.3%
critically endangered3274.7%
extinct2193.1%
unknown911.3%
Fig 2.
p · Speaker population is extremely right-skewed; expect a long tail and a large spike at zero.
Show data table
Histogram bins for p (median: 11000.0).
bincount
0 – 2.411e+076943
2.411e+07 – 4.823e+0728
4.823e+07 – 7.234e+075
7.234e+07 – 9.646e+0712
9.646e+07 – 1.206e+080
1.206e+08 – 1.447e+083
1.447e+08 – 1.688e+081
1.688e+08 – 1.929e+080
1.929e+08 – 2.17e+080
2.17e+08 – 2.411e+082
2.411e+08 – 2.653e+080
2.653e+08 – 2.894e+080
2.894e+08 – 3.135e+080
3.135e+08 – 3.376e+080
3.376e+08 – 3.617e+080
3.617e+08 – 3.858e+081
3.858e+08 – 4.099e+080
4.099e+08 – 4.34e+080
4.34e+08 – 4.582e+081
4.582e+08 – 4.823e+080
4.823e+08 – 5.064e+080
5.064e+08 – 5.305e+080
5.305e+08 – 5.546e+080
5.546e+08 – 5.787e+080
5.787e+08 – 6.028e+080
6.028e+08 – 6.27e+080
6.27e+08 – 6.511e+080
6.511e+08 – 6.752e+080
6.752e+08 – 6.993e+080
6.993e+08 – 7.234e+081
7.234e+08 – 7.475e+080
7.475e+08 – 7.716e+080
7.716e+08 – 7.958e+080
7.958e+08 – 8.199e+080
8.199e+08 – 8.44e+080
8.44e+08 – 8.681e+080
8.681e+08 – 8.922e+080
8.922e+08 – 9.163e+080
9.163e+08 – 9.404e+080
9.404e+08 – 9.646e+081
Fig 3.
lat · Latitude distribution shows where in the world languages cluster, weighted toward the tropics and northern hemisphere.
Show data table
Histogram bins for lat (median: 6.37).
bincount
-55.27 – -52.061
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.430
-42.43 – -39.222
-39.22 – -36.015
-36.01 – -32.89
-32.8 – -29.5913
-29.59 – -26.3823
-26.38 – -23.1748
-23.17 – -19.96100
-19.96 – -16.75101
-16.75 – -13.54217
-13.54 – -10.33202
-10.33 – -7.116461
-7.116 – -3.906749
-3.906 – -0.6958640
-0.6958 – 2.514344
2.514 – 5.725434
5.725 – 8.935609
8.935 – 12.15676
12.15 – 15.36264
15.36 – 18.57368
18.57 – 21.78210
21.78 – 24.99282
24.99 – 28.2320
28.2 – 31.41131
31.41 – 34.62124
34.62 – 37.83148
37.83 – 41.0481
41.04 – 44.25107
44.25 – 47.4665
47.46 – 50.6768
50.67 – 53.8868
53.88 – 57.0935
57.09 – 60.320
60.3 – 63.5134
63.51 – 66.7218
66.72 – 69.9313
69.93 – 73.146
Fig 4.
s · The 's' score takes only six integer values from 0 to 5; check which level dominates.
Show data table
Histogram bins for s (median: 4.0).
bincount
0 – 0.125250
0.125 – 0.250
0.25 – 0.3750
0.375 – 0.50
0.5 – 0.6250
0.625 – 0.750
0.75 – 0.8750
0.875 – 10
1 – 1.125328
1.125 – 1.250
1.25 – 1.3750
1.375 – 1.50
1.5 – 1.6250
1.625 – 1.750
1.75 – 1.8750
1.875 – 20
2 – 2.125383
2.125 – 2.250
2.25 – 2.3750
2.375 – 2.50
2.5 – 2.6250
2.625 – 2.750
2.75 – 2.8750
2.875 – 30
3 – 3.1251768
3.125 – 3.250
3.25 – 3.3750
3.375 – 3.50
3.5 – 3.6250
3.625 – 3.750
3.75 – 3.8750
3.875 – 40
4 – 4.1251176
4.125 – 4.250
4.25 – 4.3750
4.375 – 4.50
4.5 – 4.6250
4.625 – 4.750
4.75 – 4.8750
4.875 – 53093
Fig 5.
lng · Longitude spread reveals the global reach of the dataset across continents.
Show data table
Histogram bins for lng (median: 47.65).
bincount
-178.8 – -169.811
-169.8 – -160.94
-160.9 – -151.99
-151.9 – -14311
-143 – -13410
-134 – -125.115
-125.1 – -116.194
-116.1 – -107.237
-107.2 – -98.2171
-98.21 – -89.26257
-89.26 – -80.3150
-80.31 – -71.35176
-71.35 – -62.4150
-62.4 – -53.45112
-53.45 – -44.548
-44.5 – -35.5421
-35.54 – -26.590
-26.59 – -17.643
-17.64 – -8.68795
-8.687 – 0.265261
0.265 – 9.217420
9.217 – 18.17687
18.17 – 27.12297
27.12 – 36.07402
36.07 – 45.03211
45.03 – 53.98113
53.98 – 62.9327
62.93 – 71.8873
71.88 – 80.84212
80.84 – 89.79191
89.79 – 98.74242
98.74 – 107.7386
107.7 – 116.6202
116.6 – 125.6437
125.6 – 134.5266
134.5 – 143.5512
143.5 – 152.5597
152.5 – 161.4106
161.4 – 170.4170
170.4 – 179.312
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
ntext0.0%
latnumeric0.0%
lngnumeric0.0%
pnumeric0.0%
snumeric0.0%
sscategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 4 numeric columns (values clipped to 2 decimals).
latlngps
lat+1.00-0.35+0.13-0.09
lng-0.35+1.00+0.00+0.16
p+0.13+0.00+1.00+0.09
s-0.09+0.16+0.09+1.00

n text identifier

Column 'n' holds short text labels, one per row, with all 6998 values unique and no nulls. About 73% are single words (word_mean 1.37, len_mean ~9), and the top tokens — 'language', 'sign', 'zapotec', 'mixtec', 'naga' plus directional modifiers — strongly suggest these are language names (e.g., 'Southern Zapotec', 'sign language' variants). Vocab_size (7003) slightly exceeds n, consistent with a small set of recurring qualifiers across otherwise unique names.

Treatment: Treat as a name key; left-join on this rather than using as a model feature.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["n"].stats

statvalue
n6,998
nulls0 (0.0%)
unique6,998
len_min 1
len_max 43
len_mean 8.972
len_median 7
len_p95 21
word_mean 1.369
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 7,003
readability_flesch_mean 57.13
emoji_rate 0
url_rate 0
one_word_rate 0.7314
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word73.1% rows are a single word
Fig 8.
Character-length distribution for n.
Show data table
Character-length distribution for n (mean: 8.971991997713632).
charscount
1 – 225
2 – 3193
3 – 4748
4 – 51095
5 – 61082
6 – 7833
7 – 8521
8 – 9333
9 – 10271
10 – 12231
12 – 13208
13 – 14213
14 – 15188
15 – 16199
16 – 17144
17 – 1891
18 – 1975
19 – 2068
20 – 2163
21 – 2269
22 – 23150
23 – 2448
24 – 2526
25 – 2630
26 – 2721
27 – 288
28 – 297
29 – 3010
30 – 3114
31 – 326
32 – 348
34 – 352
35 – 3610
36 – 371
37 – 384
38 – 390
39 – 401
40 – 410
41 – 421
42 – 431

lat numeric feature

Latitude coordinates spanning -55.27 to 73.14, covering most inhabited latitudes from sub-Antarctic to high Arctic. The distribution leans north of the equator (mean 8.44, median 6.37) with mild positive skew (0.70) and 149 outliers (2.1%) likely representing extreme polar observations. With 4048 unique values across 6998 rows and no nulls, this is clean geospatial data ready for use.

Treatment: Pair with longitude for geospatial features; consider binning by hemisphere or climate zone.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["lat"].stats

statvalue
n6,998
nulls0 (0.0%)
unique4,048
min -55.27
max 73.14
mean 8.437
median 6.37
std 18
q1 -4.65
q3 18.29
iqr 22.94
skew 0.6975
kurtosis 0.4773
n_outliers 149
outlier_rate 0.02129
zero_rate 0
Fig 9.
Distribution of lat. Vertical dash marks the median.
Show data table
Histogram bins for lat (median: 6.37).
bincount
-55.27 – -52.061
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.430
-42.43 – -39.222
-39.22 – -36.015
-36.01 – -32.89
-32.8 – -29.5913
-29.59 – -26.3823
-26.38 – -23.1748
-23.17 – -19.96100
-19.96 – -16.75101
-16.75 – -13.54217
-13.54 – -10.33202
-10.33 – -7.116461
-7.116 – -3.906749
-3.906 – -0.6958640
-0.6958 – 2.514344
2.514 – 5.725434
5.725 – 8.935609
8.935 – 12.15676
12.15 – 15.36264
15.36 – 18.57368
18.57 – 21.78210
21.78 – 24.99282
24.99 – 28.2320
28.2 – 31.41131
31.41 – 34.62124
34.62 – 37.83148
37.83 – 41.0481
41.04 – 44.25107
44.25 – 47.4665
47.46 – 50.6768
50.67 – 53.8868
53.88 – 57.0935
57.09 – 60.320
60.3 – 63.5134
63.51 – 66.7218
66.72 – 69.9313
69.93 – 73.146

lng numeric feature

This is a longitude feature spanning -178.78 to 179.31 across 6998 rows with 5560 unique values and no nulls. The distribution is mildly left-skewed (-0.498) with a median of 47.65 and IQR from 8.28 to 123.94, suggesting an Eastern-Hemisphere bias rather than a uniform global spread. Only 12 outliers (0.17%) flagged, which is expected given the bounded geographic range.

Treatment: Pair with latitude as a geospatial coordinate; consider cyclical encoding or geohashing rather than treating as a plain scalar.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["lng"].stats

statvalue
n6,998
nulls0 (0.0%)
unique5,560
min -178.8
max 179.3
mean 52.46
median 47.65
std 79.67
q1 8.282
q3 123.9
iqr 115.7
skew -0.4982
kurtosis -0.673
n_outliers 12
outlier_rate 0.001715
zero_rate 0
Fig 10.
Distribution of lng. Vertical dash marks the median.
Show data table
Histogram bins for lng (median: 47.65).
bincount
-178.8 – -169.811
-169.8 – -160.94
-160.9 – -151.99
-151.9 – -14311
-143 – -13410
-134 – -125.115
-125.1 – -116.194
-116.1 – -107.237
-107.2 – -98.2171
-98.21 – -89.26257
-89.26 – -80.3150
-80.31 – -71.35176
-71.35 – -62.4150
-62.4 – -53.45112
-53.45 – -44.548
-44.5 – -35.5421
-35.54 – -26.590
-26.59 – -17.643
-17.64 – -8.68795
-8.687 – 0.265261
0.265 – 9.217420
9.217 – 18.17687
18.17 – 27.12297
27.12 – 36.07402
36.07 – 45.03211
45.03 – 53.98113
53.98 – 62.9327
62.93 – 71.8873
71.88 – 80.84212
80.84 – 89.79191
89.79 – 98.74242
98.74 – 107.7386
107.7 – 116.6202
116.6 – 125.6437
125.6 – 134.5266
134.5 – 143.5512
143.5 – 152.5597
152.5 – 161.4106
161.4 – 170.4170
170.4 – 179.312

p numeric feature

Column `p` is a heavily right-skewed numeric value spanning 0 to 964,553,200 with a median of just 11,000 — likely a monetary amount, count, or size metric. The skew of 39.45 and kurtosis of 1870 are extreme, with 16.56% of rows flagged as outliers and 13.46% sitting at exactly zero. The mean (1,157,741) sits more than 100x above the median, so summary averages will be misleading.

Treatment: Apply a log1p transform before modelling and consider a separate zero-vs-nonzero indicator.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["p"].stats

statvalue
n6,998
nulls0 (0.0%)
unique1,627
min 0
max 9.646e+08
mean 1.158e+06
median 11,000
std 1.73e+07
q1 1,200
q3 83,000
iqr 81,800
skew 39.45
kurtosis 1870
n_outliers 1,159
outlier_rate 0.1656
zero_rate 0.1346
alert: high_skewskew=+39.45
alert: outliers16.6% rows beyond 1.5 IQR
Fig 11.
Distribution of p. Vertical dash marks the median.
Show data table
Histogram bins for p (median: 11000.0).
bincount
0 – 2.411e+076943
2.411e+07 – 4.823e+0728
4.823e+07 – 7.234e+075
7.234e+07 – 9.646e+0712
9.646e+07 – 1.206e+080
1.206e+08 – 1.447e+083
1.447e+08 – 1.688e+081
1.688e+08 – 1.929e+080
1.929e+08 – 2.17e+080
2.17e+08 – 2.411e+082
2.411e+08 – 2.653e+080
2.653e+08 – 2.894e+080
2.894e+08 – 3.135e+080
3.135e+08 – 3.376e+080
3.376e+08 – 3.617e+080
3.617e+08 – 3.858e+081
3.858e+08 – 4.099e+080
4.099e+08 – 4.34e+080
4.34e+08 – 4.582e+081
4.582e+08 – 4.823e+080
4.823e+08 – 5.064e+080
5.064e+08 – 5.305e+080
5.305e+08 – 5.546e+080
5.546e+08 – 5.787e+080
5.787e+08 – 6.028e+080
6.028e+08 – 6.27e+080
6.27e+08 – 6.511e+080
6.511e+08 – 6.752e+080
6.752e+08 – 6.993e+080
6.993e+08 – 7.234e+081
7.234e+08 – 7.475e+080
7.475e+08 – 7.716e+080
7.716e+08 – 7.958e+080
7.958e+08 – 8.199e+080
8.199e+08 – 8.44e+080
8.44e+08 – 8.681e+080
8.681e+08 – 8.922e+080
8.922e+08 – 9.163e+080
9.163e+08 – 9.404e+080
9.404e+08 – 9.646e+081

s numeric feature

Column 's' holds an integer-valued score on a 0-5 scale with only 6 unique values across 6998 rows and no nulls. The distribution skews high (mean 3.80, median 4.0, skew -1.04), with Q1-Q3 spanning 3 to 5 and about 3.6% of rows at zero. This looks like an ordinal rating rather than a continuous measurement.

Treatment: Treat as an ordinal rating; consider ordinal encoding or bucket low scores before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["s"].stats

statvalue
n6,998
nulls0 (0.0%)
unique6
min 0
max 5
mean 3.796
median 4
std 1.366
q1 3
q3 5
iqr 2
skew -1.041
kurtosis 0.4154
n_outliers 0
outlier_rate 0
zero_rate 0.03572
Fig 12.
Distribution of s. Vertical dash marks the median.
Show data table
Histogram bins for s (median: 4.0).
bincount
0 – 0.125250
0.125 – 0.250
0.25 – 0.3750
0.375 – 0.50
0.5 – 0.6250
0.625 – 0.750
0.75 – 0.8750
0.875 – 10
1 – 1.125328
1.125 – 1.250
1.25 – 1.3750
1.375 – 1.50
1.5 – 1.6250
1.625 – 1.750
1.75 – 1.8750
1.875 – 20
2 – 2.125383
2.125 – 2.250
2.25 – 2.3750
2.375 – 2.50
2.5 – 2.6250
2.625 – 2.750
2.75 – 2.8750
2.875 – 30
3 – 3.1251768
3.125 – 3.250
3.25 – 3.3750
3.375 – 3.50
3.5 – 3.6250
3.625 – 3.750
3.75 – 3.8750
3.875 – 40
4 – 4.1251176
4.125 – 4.250
4.25 – 4.3750
4.375 – 4.50
4.5 – 4.6250
4.625 – 4.750
4.75 – 4.8750
4.875 – 53093

ss categorical label

The column 'ss' is a categorical safety/endangerment status with 7 distinct levels ranging from 'safe' to 'extinct', plus an 'unknown' bucket. 'safe' dominates at 43.9% (3074 of 6998) with 'definitely endangered' second at 1753, and there are no nulls. The distribution is moderately balanced (entropy ratio 0.76) but the tail categories like 'extinct' (219) and 'unknown' (91) are sparse.

Treatment: Encode as an ordinal scale (safe → extinct) and treat 'unknown' separately or impute.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["ss"].stats

statvalue
n6,998
nulls0 (0.0%)
unique7
top_value safe
top_rate 0.4393
cardinality 7
entropy 2.122
entropy_ratio 0.7557
Fig 13.
Top values for ss.
Show data table
Top values for ss (7 unique shown, of 7 total).
valuecountshare
safe307443.9%
definitely endangered175325.1%
vulnerable116016.6%
severely endangered3745.3%
critically endangered3274.7%
extinct2193.1%
unknown911.3%

How to cite

click to copy

BibTeX
@misc{saturn-quirky-silence-data-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky silence data},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-silence_data}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky silence data. Source: /home/coolhand/html/datavis/data_trove/data/quirky/silence_data.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-silence_data