saturn·

data raw glottolog languoid

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/data_raw/glottolog_languoid.csv

Saturn profiled 19,401 rows across 7 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/data_raw/glottolog_languoid.csv",
    "--findings", "data_raw-glottolog_languoid.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Glottolog languoid catalogue with 19,401 rows and 7 columns covering identifiers (glottocode, isocodes, name), geographic coordinates (latitude, longitude), and classification fields (macroarea, level). The most striking feature is missingness: roughly 59% of rows lack ISO codes and coordinates, so any geographic or ISO-based analysis will only cover about 40% of entries. Worth a closer look first: the macroarea distribution (Africa leads at 32%, followed by Eurasia and Papunesia) and the level split between dialect (56%) and language (44%). The name field is mostly single words but contains recurring qualifiers like 'nuclear', 'sign', 'central', and 'southern' that hint at naming conventions worth exploring.

citing: row_count · column_count · columns · kinds

Out[4]:

saturn.schema() · 7 columns

column kind n null% unique alerts
glottocode text 19,401 0.0% 19,401 near_unique one_word short_text
name text 19,401 0.0% 19,401 near_unique one_word
isocodes text 19,401 59.2% 7,922 near_unique one_word null_rate short_text
level categorical 19,401 0.0% 2
macroarea categorical 19,401 4.3% 6
latitude numeric 19,401 59.1% 7,786 null_rate
longitude numeric 19,401 59.1% 7,745 null_rate
Fig 1.
macroarea · Shows Africa, Eurasia, and Papunesia dominate while Australia is smallest — useful for understanding geographic coverage.
Show data table
Top values for macroarea (6 unique shown, of 6 total).
valuecountshare
Africa595530.7%
Eurasia502825.9%
Papunesia484725.0%
South America10955.6%
North America10355.3%
Australia6023.1%
Fig 2.
level · Reveals the dialect-vs-language split (about 56% dialects, 44% languages).
Show data table
Top values for level (2 unique shown, of 2 total).
valuecountshare
dialect1092056.3%
language848143.7%
Fig 3.
latitude · Shows the latitudinal spread of languoids; note ~59% are null so the chart only reflects the geocoded subset.
Show data table
Histogram bins for latitude (median: 6.2918).
bincount
-55.27 – -52.065
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.434
-42.43 – -39.227
-39.22 – -36.0116
-36.01 – -32.829
-32.8 – -29.5926
-29.59 – -26.3847
-26.38 – -23.1777
-23.17 – -19.96125
-19.96 – -16.75141
-16.75 – -13.54280
-13.54 – -10.33256
-10.33 – -7.121495
-7.121 – -3.911788
-3.911 – -0.7005681
-0.7005 – 2.51378
2.51 – 5.72468
5.72 – 8.93663
8.93 – 12.14710
12.14 – 15.35303
15.35 – 18.56384
18.56 – 21.77233
21.77 – 24.98318
24.98 – 28.19371
28.19 – 31.4167
31.4 – 34.61143
34.61 – 37.82178
37.82 – 41.03113
41.03 – 44.24138
44.24 – 47.4579
47.45 – 50.6677
50.66 – 53.8776
53.87 – 57.0846
57.08 – 60.2921
60.29 – 63.541
63.5 – 66.7123
66.71 – 69.9314
69.93 – 73.146
Fig 4.
longitude · Shows the bimodal east-west distribution of languoid locations across the globe.
Show data table
Histogram bins for longitude (median: 47.565486).
bincount
-178.8 – -169.813
-169.8 – -160.94
-160.9 – -151.910
-151.9 – -14311
-143 – -13410
-134 – -125.117
-125.1 – -116.1123
-116.1 – -107.247
-107.2 – -98.2178
-98.21 – -89.26280
-89.26 – -80.3159
-80.31 – -71.36235
-71.36 – -62.41218
-62.41 – -53.45150
-53.45 – -44.560
-44.5 – -35.5540
-35.55 – -26.60
-26.6 – -17.644
-17.64 – -8.692105
-8.692 – 0.2605275
0.2605 – 9.213443
9.213 – 18.17751
18.17 – 27.12322
27.12 – 36.07429
36.07 – 45.02228
45.02 – 53.97126
53.97 – 62.9335
62.93 – 71.8879
71.88 – 80.83210
80.83 – 89.78207
89.78 – 98.74269
98.74 – 107.7454
107.7 – 116.6239
116.6 – 125.6497
125.6 – 134.5316
134.5 – 143.5598
143.5 – 152.4667
152.4 – 161.4122
161.4 – 170.4186
170.4 – 179.312
Fig 5.
name · Most names are short (median 7 characters, ~72% one word), but a long tail extends to 58 characters.
Show data table
Character-length distribution for name (mean: 9.211483944126591).
charscount
1 – 252
2 – 4455
4 – 54355
5 – 72890
7 – 83995
8 – 101115
10 – 11813
11 – 121417
12 – 14724
14 – 151264
15 – 17439
17 – 18524
18 – 20214
20 – 21175
21 – 22343
22 – 24143
24 – 25173
25 – 2758
27 – 2886
28 – 3026
30 – 3126
31 – 3243
32 – 3412
34 – 3524
35 – 3711
37 – 388
38 – 393
39 – 411
41 – 422
42 – 441
44 – 453
45 – 471
47 – 482
48 – 491
49 – 510
51 – 520
52 – 540
54 – 550
55 – 570
57 – 582
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
glottocodetext0.0%
nametext0.0%
isocodestext59.2%
levelcategorical0.0%
macroareacategorical4.3%
latitudenumeric59.1%
longitudenumeric59.1%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
latitudelongitude
latitude+1.00-0.30
longitude-0.30+1.00

glottocode text identifier

This column holds Glottocodes — fixed 8-character identifiers (len_min=len_max=8) that uniquely tag languages in the Glottolog catalogue. Every one of the 19,401 rows is unique (n_unique=19401, duplicate_rate=0.0) and single-token (one_word_rate=1.0), matching the canonical four-letters-plus-four-digits pattern visible in top_words like 'aala1237' and 'aari1239'. There are no nulls and no collisions, so this is a clean primary key rather than a feature.

Treatment: Use as the primary key; left-join on this id rather than feeding it to a model.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["glottocode"].stats

statvalue
n19,401
nulls0 (0.0%)
unique19,401
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 19,401
readability_flesch_mean 93.3
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for glottocode.
Show data table
Character-length distribution for glottocode (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 819401
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

name text identifier

The `name` column is a short text identifier, with all 19,401 values unique and no nulls or duplicates — effectively a primary key or label. Entries are terse (mean 9.2 chars, median 1 word) and 71.7% are single words, yet the top tokens (`nuclear`, `language`, `central`, `southern`, `western`) suggest these are entity or topic names rather than personal names. Vocabulary is wide (17,861 distinct words) for such short strings.

Treatment: Treat as a unique key; do not feature-engineer directly, but tokenize if used for matching or embedding.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["name"].stats

statvalue
n19,401
nulls0 (0.0%)
unique19,401
len_min 1
len_max 58
len_mean 9.211
len_median 7
len_p95 20
word_mean 1.369
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 17,861
readability_flesch_mean 60.53
emoji_rate 0
url_rate 0
one_word_rate 0.7169
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word71.7% rows are a single word
Fig 9.
Character-length distribution for name.
Show data table
Character-length distribution for name (mean: 9.211483944126591).
charscount
1 – 252
2 – 4455
4 – 54355
5 – 72890
7 – 83995
8 – 101115
10 – 11813
11 – 121417
12 – 14724
14 – 151264
15 – 17439
17 – 18524
18 – 20214
20 – 21175
21 – 22343
22 – 24143
24 – 25173
25 – 2758
27 – 2886
28 – 3026
30 – 3126
31 – 3243
32 – 3412
34 – 3524
35 – 3711
37 – 388
38 – 393
39 – 411
41 – 422
42 – 441
44 – 453
45 – 471
47 – 482
48 – 491
49 – 510
51 – 520
52 – 540
54 – 550
55 – 570
57 – 582

isocodes text identifier

This column holds 3-character ISO-style codes: every non-null value is exactly 3 characters and one word (len_min/len_max=3, one_word_rate=1.0). It is sparsely populated with a 0.5917 null rate, and of the 19401 rows there are 7922 distinct codes with no duplicates among the populated entries (vocab_size=7922, n_duplicates=0). The top_words sample (aiw, aay, aas, kbt…) suggests ISO 639-3 language codes rather than country codes, each appearing only once.

Treatment: Treat as a categorical/foreign key; left-join to an ISO code lookup and impute or flag the ~59% missing values.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["isocodes"].stats

statvalue
n19,401
nulls11,479 (59.2%)
unique7,922
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 7,922
readability_flesch_mean 118.7
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: null_rate59.2% null
alert: short_text95th-percentile length under 20 chars
Fig 10.
Character-length distribution for isocodes.
Show data table
Character-length distribution for isocodes (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 37922
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

level categorical label

A binary categorical flag distinguishing 'dialect' (10,920 rows, 56.3%) from 'language' (8,481 rows). The split is fairly balanced, with entropy at 0.989 of the maximum, and there are no nulls across all 19,401 rows.

Treatment: One-hot or binary-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["level"].stats

statvalue
n19,401
nulls0 (0.0%)
unique2
top_value dialect
top_rate 0.5629
cardinality 2
entropy 0.9886
entropy_ratio 0.9886
Fig 11.
Top values for level.
Show data table
Top values for level (2 unique shown, of 2 total).
valuecountshare
dialect1092056.3%
language848143.7%

macroarea categorical feature

Categorical macro-region label with just 6 values covering the world's linguistic areas (Africa, Eurasia, Papunesia, South America, North America, Australia). Distribution is moderately balanced (entropy ratio 0.84) with Africa leading at 32.1% and a long tail in Australia at 602 rows; 4.32% are null. No single category dominates, but the Americas and Australia are markedly underrepresented relative to Africa and Eurasia.

Treatment: one-hot or target-encode; impute or flag the ~4% missing before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["macroarea"].stats

statvalue
n19,401
nulls839 (4.3%)
unique6
top_value Africa
top_rate 0.3208
cardinality 6
entropy 2.176
entropy_ratio 0.8418
Fig 12.
Top values for macroarea.
Show data table
Top values for macroarea (6 unique shown, of 6 total).
valuecountshare
Africa595530.7%
Eurasia502825.9%
Papunesia484725.0%
South America10955.6%
North America10355.3%
Australia6023.1%

latitude numeric feature

Geographic latitude coordinates ranging from -55.2748 to 73.1354, consistent with a worldwide point dataset. Nearly 59% of rows are null, which dominates any spatial analysis, and the median of 6.29 with mean 8.16 hints at a slight northern-hemisphere lean despite moderate skew (0.54).

Treatment: Pair with longitude for geospatial features; impute or filter the 59% nulls before mapping.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["latitude"].stats

statvalue
n19,401
nulls11,472 (59.1%)
unique7,786
min -55.27
max 73.14
mean 8.164
median 6.292
std 18.96
q1 -5.139
q3 19.27
iqr 24.41
skew 0.5425
kurtosis 0.3048
n_outliers 135
outlier_rate 0.01703
zero_rate 0
alert: null_rate59.1% null
Fig 13.
Distribution of latitude. Vertical dash marks the median.
Show data table
Histogram bins for latitude (median: 6.2918).
bincount
-55.27 – -52.065
-52.06 – -48.851
-48.85 – -45.641
-45.64 – -42.434
-42.43 – -39.227
-39.22 – -36.0116
-36.01 – -32.829
-32.8 – -29.5926
-29.59 – -26.3847
-26.38 – -23.1777
-23.17 – -19.96125
-19.96 – -16.75141
-16.75 – -13.54280
-13.54 – -10.33256
-10.33 – -7.121495
-7.121 – -3.911788
-3.911 – -0.7005681
-0.7005 – 2.51378
2.51 – 5.72468
5.72 – 8.93663
8.93 – 12.14710
12.14 – 15.35303
15.35 – 18.56384
18.56 – 21.77233
21.77 – 24.98318
24.98 – 28.19371
28.19 – 31.4167
31.4 – 34.61143
34.61 – 37.82178
37.82 – 41.03113
41.03 – 44.24138
44.24 – 47.4579
47.45 – 50.6677
50.66 – 53.8776
53.87 – 57.0846
57.08 – 60.2921
60.29 – 63.541
63.5 – 66.7123
66.71 – 69.9314
69.93 – 73.146

longitude numeric feature

Geographic longitude coordinate spanning the full globe (min -178.785, max 179.306) with median 47.565486, suggesting a worldwide dataset weighted toward Eurasia. The 59.13% null rate is the dominant concern—most records lack location—while only 13 outliers (0.16%) appear and skew is mild (-0.48).

Treatment: Impute or flag the 59% missing values before any geospatial modelling; pair with latitude for joint use.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["longitude"].stats

statvalue
n19,401
nulls11,472 (59.1%)
unique7,745
min -178.8
max 179.3
mean 51.22
median 47.57
std 81.15
q1 7.18
q3 124.1
iqr 117
skew -0.4814
kurtosis -0.7765
n_outliers 13
outlier_rate 0.00164
zero_rate 0
alert: null_rate59.1% null
Fig 14.
Distribution of longitude. Vertical dash marks the median.
Show data table
Histogram bins for longitude (median: 47.565486).
bincount
-178.8 – -169.813
-169.8 – -160.94
-160.9 – -151.910
-151.9 – -14311
-143 – -13410
-134 – -125.117
-125.1 – -116.1123
-116.1 – -107.247
-107.2 – -98.2178
-98.21 – -89.26280
-89.26 – -80.3159
-80.31 – -71.36235
-71.36 – -62.41218
-62.41 – -53.45150
-53.45 – -44.560
-44.5 – -35.5540
-35.55 – -26.60
-26.6 – -17.644
-17.64 – -8.692105
-8.692 – 0.2605275
0.2605 – 9.213443
9.213 – 18.17751
18.17 – 27.12322
27.12 – 36.07429
36.07 – 45.02228
45.02 – 53.97126
53.97 – 62.9335
62.93 – 71.8879
71.88 – 80.83210
80.83 – 89.78207
89.78 – 98.74269
98.74 – 107.7454
107.7 – 116.6239
116.6 – 125.6497
125.6 – 134.5316
134.5 – 143.5598
143.5 – 152.4667
152.4 – 161.4122
161.4 – 170.4186
170.4 – 179.312

How to cite

click to copy

BibTeX
@misc{saturn-data-raw-glottolog-languoid-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data raw glottolog languoid},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data_raw-glottolog_languoid}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: data raw glottolog languoid. Source: /home/coolhand/servers/diachronica/data_raw/glottolog_languoid.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/data_raw-glottolog_languoid