data raw glottolog languoid

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/data_raw/glottolog_languoid.csv

Saturn profiled 19,401 rows across 7 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/data_raw/glottolog_languoid.csv",
    "--findings", "data_raw-glottolog_languoid.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Glottolog languoid catalogue with 19,401 rows and 7 columns covering identifiers (glottocode, isocodes, name), geographic coordinates (latitude, longitude), and classification fields (macroarea, level). The most striking feature is missingness: roughly 59% of rows lack ISO codes and coordinates, so any geographic or ISO-based analysis will only cover about 40% of entries. Worth a closer look first: the macroarea distribution (Africa leads at 32%, followed by Eurasia and Papunesia) and the level split between dialect (56%) and language (44%). The name field is mostly single words but contains recurring qualifiers like 'nuclear', 'sign', 'central', and 'southern' that hint at naming conventions worth exploring.

citing: row_count · column_count · columns · kinds

Out[4]:

saturn.schema() · 7 columns

column	kind	n	null%	unique	alerts
glottocode	text	19,401	0.0%	19,401	near_unique one_word short_text
name	text	19,401	0.0%	19,401	near_unique one_word
isocodes	text	19,401	59.2%	7,922	near_unique one_word null_rate short_text
level	categorical	19,401	0.0%	2
macroarea	categorical	19,401	4.3%	6
latitude	numeric	19,401	59.1%	7,786	null_rate
longitude	numeric	19,401	59.1%	7,745	null_rate

Fig 1.

macroarea · Shows Africa, Eurasia, and Papunesia dominate while Australia is smallest — useful for understanding geographic coverage.

Show data table

Top values for macroarea (6 unique shown, of 6 total).
value	count	share
Africa	5955	30.7%
Eurasia	5028	25.9%
Papunesia	4847	25.0%
South America	1095	5.6%
North America	1035	5.3%
Australia	602	3.1%

Fig 2.

level · Reveals the dialect-vs-language split (about 56% dialects, 44% languages).

Show data table

Top values for level (2 unique shown, of 2 total).
value	count	share
dialect	10920	56.3%
language	8481	43.7%

Fig 3.

latitude · Shows the latitudinal spread of languoids; note ~59% are null so the chart only reflects the geocoded subset.

Show data table

Histogram bins for latitude (median: 6.2918).
bin	count
-55.27 – -52.06	5
-52.06 – -48.85	1
-48.85 – -45.64	1
-45.64 – -42.43	4
-42.43 – -39.22	7
-39.22 – -36.01	16
-36.01 – -32.8	29
-32.8 – -29.59	26
-29.59 – -26.38	47
-26.38 – -23.17	77
-23.17 – -19.96	125
-19.96 – -16.75	141
-16.75 – -13.54	280
-13.54 – -10.33	256
-10.33 – -7.121	495
-7.121 – -3.911	788
-3.911 – -0.7005	681
-0.7005 – 2.51	378
2.51 – 5.72	468
5.72 – 8.93	663
8.93 – 12.14	710
12.14 – 15.35	303
15.35 – 18.56	384
18.56 – 21.77	233
21.77 – 24.98	318
24.98 – 28.19	371
28.19 – 31.4	167
31.4 – 34.61	143
34.61 – 37.82	178
37.82 – 41.03	113
41.03 – 44.24	138
44.24 – 47.45	79
47.45 – 50.66	77
50.66 – 53.87	76
53.87 – 57.08	46
57.08 – 60.29	21
60.29 – 63.5	41
63.5 – 66.71	23
66.71 – 69.93	14
69.93 – 73.14	6

Fig 4.

longitude · Shows the bimodal east-west distribution of languoid locations across the globe.

Show data table

Histogram bins for longitude (median: 47.565486).
bin	count
-178.8 – -169.8	13
-169.8 – -160.9	4
-160.9 – -151.9	10
-151.9 – -143	11
-143 – -134	10
-134 – -125.1	17
-125.1 – -116.1	123
-116.1 – -107.2	47
-107.2 – -98.21	78
-98.21 – -89.26	280
-89.26 – -80.31	59
-80.31 – -71.36	235
-71.36 – -62.41	218
-62.41 – -53.45	150
-53.45 – -44.5	60
-44.5 – -35.55	40
-35.55 – -26.6	0
-26.6 – -17.64	4
-17.64 – -8.692	105
-8.692 – 0.2605	275
0.2605 – 9.213	443
9.213 – 18.17	751
18.17 – 27.12	322
27.12 – 36.07	429
36.07 – 45.02	228
45.02 – 53.97	126
53.97 – 62.93	35
62.93 – 71.88	79
71.88 – 80.83	210
80.83 – 89.78	207
89.78 – 98.74	269
98.74 – 107.7	454
107.7 – 116.6	239
116.6 – 125.6	497
125.6 – 134.5	316
134.5 – 143.5	598
143.5 – 152.4	667
152.4 – 161.4	122
161.4 – 170.4	186
170.4 – 179.3	12

Fig 5.

name · Most names are short (median 7 characters, ~72% one word), but a long tail extends to 58 characters.

Show data table

Character-length distribution for name (mean: 9.211483944126591).
chars	count
1 – 2	52
2 – 4	455
4 – 5	4355
5 – 7	2890
7 – 8	3995
8 – 10	1115
10 – 11	813
11 – 12	1417
12 – 14	724
14 – 15	1264
15 – 17	439
17 – 18	524
18 – 20	214
20 – 21	175
21 – 22	343
22 – 24	143
24 – 25	173
25 – 27	58
27 – 28	86
28 – 30	26
30 – 31	26
31 – 32	43
32 – 34	12
34 – 35	24
35 – 37	11
37 – 38	8
38 – 39	3
39 – 41	1
41 – 42	2
42 – 44	1
44 – 45	3
45 – 47	1
47 – 48	2
48 – 49	1
49 – 51	0
51 – 52	0
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	2

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
glottocode	text	0.0%
name	text	0.0%
isocodes	text	59.2%
level	categorical	0.0%
macroarea	categorical	4.3%
latitude	numeric	59.1%
longitude	numeric	59.1%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	latitude	longitude
latitude	+1.00	-0.30
longitude	-0.30	+1.00

glottocode text identifier

This column holds Glottocodes — fixed 8-character identifiers (len_min=len_max=8) that uniquely tag languages in the Glottolog catalogue. Every one of the 19,401 rows is unique (n_unique=19401, duplicate_rate=0.0) and single-token (one_word_rate=1.0), matching the canonical four-letters-plus-four-digits pattern visible in top_words like 'aala1237' and 'aari1239'. There are no nulls and no collisions, so this is a clean primary key rather than a feature.

Treatment: Use as the primary key; left-join on this id rather than feeding it to a model.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["glottocode"].stats

stat	value
n	19,401
nulls	0 (0.0%)
unique	19,401
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	19,401
readability_flesch_mean	93.3
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for glottocode.

Show data table

Character-length distribution for glottocode (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	19401
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

name text identifier

The `name` column is a short text identifier, with all 19,401 values unique and no nulls or duplicates — effectively a primary key or label. Entries are terse (mean 9.2 chars, median 1 word) and 71.7% are single words, yet the top tokens (`nuclear`, `language`, `central`, `southern`, `western`) suggest these are entity or topic names rather than personal names. Vocabulary is wide (17,861 distinct words) for such short strings.

Treatment: Treat as a unique key; do not feature-engineer directly, but tokenize if used for matching or embedding.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["name"].stats

stat	value
n	19,401
nulls	0 (0.0%)
unique	19,401
len_min	1
len_max	58
len_mean	9.211
len_median	7
len_p95	20
word_mean	1.369
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	17,861
readability_flesch_mean	60.53
emoji_rate	0
url_rate	0
one_word_rate	0.7169
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	71.7% rows are a single word

Fig 9.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 9.211483944126591).
chars	count
1 – 2	52
2 – 4	455
4 – 5	4355
5 – 7	2890
7 – 8	3995
8 – 10	1115
10 – 11	813
11 – 12	1417
12 – 14	724
14 – 15	1264
15 – 17	439
17 – 18	524
18 – 20	214
20 – 21	175
21 – 22	343
22 – 24	143
24 – 25	173
25 – 27	58
27 – 28	86
28 – 30	26
30 – 31	26
31 – 32	43
32 – 34	12
34 – 35	24
35 – 37	11
37 – 38	8
38 – 39	3
39 – 41	1
41 – 42	2
42 – 44	1
44 – 45	3
45 – 47	1
47 – 48	2
48 – 49	1
49 – 51	0
51 – 52	0
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	2

isocodes text identifier

This column holds 3-character ISO-style codes: every non-null value is exactly 3 characters and one word (len_min/len_max=3, one_word_rate=1.0). It is sparsely populated with a 0.5917 null rate, and of the 19401 rows there are 7922 distinct codes with no duplicates among the populated entries (vocab_size=7922, n_duplicates=0). The top_words sample (aiw, aay, aas, kbt…) suggests ISO 639-3 language codes rather than country codes, each appearing only once.

Treatment: Treat as a categorical/foreign key; left-join to an ISO code lookup and impute or flag the ~59% missing values.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["isocodes"].stats

stat	value
n	19,401
nulls	11,479 (59.2%)
unique	7,922
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	7,922
readability_flesch_mean	118.7
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: null_rate	59.2% null
alert: short_text	95th-percentile length under 20 chars

Fig 10.

Character-length distribution for isocodes.

Show data table

Character-length distribution for isocodes (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	7922
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

level categorical label

A binary categorical flag distinguishing 'dialect' (10,920 rows, 56.3%) from 'language' (8,481 rows). The split is fairly balanced, with entropy at 0.989 of the maximum, and there are no nulls across all 19,401 rows.

Treatment: One-hot or binary-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["level"].stats

stat	value
n	19,401
nulls	0 (0.0%)
unique	2
top_value	dialect
top_rate	0.5629
cardinality	2
entropy	0.9886
entropy_ratio	0.9886

Fig 11.

Top values for level.

Show data table

Top values for level (2 unique shown, of 2 total).
value	count	share
dialect	10920	56.3%
language	8481	43.7%

macroarea categorical feature

Categorical macro-region label with just 6 values covering the world's linguistic areas (Africa, Eurasia, Papunesia, South America, North America, Australia). Distribution is moderately balanced (entropy ratio 0.84) with Africa leading at 32.1% and a long tail in Australia at 602 rows; 4.32% are null. No single category dominates, but the Americas and Australia are markedly underrepresented relative to Africa and Eurasia.

Treatment: one-hot or target-encode; impute or flag the ~4% missing before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["macroarea"].stats

stat	value
n	19,401
nulls	839 (4.3%)
unique	6
top_value	Africa
top_rate	0.3208
cardinality	6
entropy	2.176
entropy_ratio	0.8418

Fig 12.

Top values for macroarea.

Show data table

Top values for macroarea (6 unique shown, of 6 total).
value	count	share
Africa	5955	30.7%
Eurasia	5028	25.9%
Papunesia	4847	25.0%
South America	1095	5.6%
North America	1035	5.3%
Australia	602	3.1%

latitude numeric feature

Geographic latitude coordinates ranging from -55.2748 to 73.1354, consistent with a worldwide point dataset. Nearly 59% of rows are null, which dominates any spatial analysis, and the median of 6.29 with mean 8.16 hints at a slight northern-hemisphere lean despite moderate skew (0.54).

Treatment: Pair with longitude for geospatial features; impute or filter the 59% nulls before mapping.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["latitude"].stats

stat	value
n	19,401
nulls	11,472 (59.1%)
unique	7,786
min	-55.27
max	73.14
mean	8.164
median	6.292
std	18.96
q1	-5.139
q3	19.27
iqr	24.41
skew	0.5425
kurtosis	0.3048
n_outliers	135
outlier_rate	0.01703
zero_rate	0
alert: null_rate	59.1% null

Fig 13.

Distribution of latitude. Vertical dash marks the median.

Show data table

Histogram bins for latitude (median: 6.2918).
bin	count
-55.27 – -52.06	5
-52.06 – -48.85	1
-48.85 – -45.64	1
-45.64 – -42.43	4
-42.43 – -39.22	7
-39.22 – -36.01	16
-36.01 – -32.8	29
-32.8 – -29.59	26
-29.59 – -26.38	47
-26.38 – -23.17	77
-23.17 – -19.96	125
-19.96 – -16.75	141
-16.75 – -13.54	280
-13.54 – -10.33	256
-10.33 – -7.121	495
-7.121 – -3.911	788
-3.911 – -0.7005	681
-0.7005 – 2.51	378
2.51 – 5.72	468
5.72 – 8.93	663
8.93 – 12.14	710
12.14 – 15.35	303
15.35 – 18.56	384
18.56 – 21.77	233
21.77 – 24.98	318
24.98 – 28.19	371
28.19 – 31.4	167
31.4 – 34.61	143
34.61 – 37.82	178
37.82 – 41.03	113
41.03 – 44.24	138
44.24 – 47.45	79
47.45 – 50.66	77
50.66 – 53.87	76
53.87 – 57.08	46
57.08 – 60.29	21
60.29 – 63.5	41
63.5 – 66.71	23
66.71 – 69.93	14
69.93 – 73.14	6

longitude numeric feature

Geographic longitude coordinate spanning the full globe (min -178.785, max 179.306) with median 47.565486, suggesting a worldwide dataset weighted toward Eurasia. The 59.13% null rate is the dominant concern—most records lack location—while only 13 outliers (0.16%) appear and skew is mild (-0.48).

Treatment: Impute or flag the 59% missing values before any geospatial modelling; pair with latitude for joint use.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["longitude"].stats

stat	value
n	19,401
nulls	11,472 (59.1%)
unique	7,745
min	-178.8
max	179.3
mean	51.22
median	47.57
std	81.15
q1	7.18
q3	124.1
iqr	117
skew	-0.4814
kurtosis	-0.7765
n_outliers	13
outlier_rate	0.00164
zero_rate	0
alert: null_rate	59.1% null

Fig 14.

Distribution of longitude. Vertical dash marks the median.

Show data table

Histogram bins for longitude (median: 47.565486).
bin	count
-178.8 – -169.8	13
-169.8 – -160.9	4
-160.9 – -151.9	10
-151.9 – -143	11
-143 – -134	10
-134 – -125.1	17
-125.1 – -116.1	123
-116.1 – -107.2	47
-107.2 – -98.21	78
-98.21 – -89.26	280
-89.26 – -80.31	59
-80.31 – -71.36	235
-71.36 – -62.41	218
-62.41 – -53.45	150
-53.45 – -44.5	60
-44.5 – -35.55	40
-35.55 – -26.6	0
-26.6 – -17.64	4
-17.64 – -8.692	105
-8.692 – 0.2605	275
0.2605 – 9.213	443
9.213 – 18.17	751
18.17 – 27.12	322
27.12 – 36.07	429
36.07 – 45.02	228
45.02 – 53.97	126
53.97 – 62.93	35
62.93 – 71.88	79
71.88 – 80.83	210
80.83 – 89.78	207
89.78 – 98.74	269
98.74 – 107.7	454
107.7 – 116.6	239
116.6 – 125.6	497
125.6 – 134.5	316
134.5 – 143.5	598
143.5 – 152.4	667
152.4 – 161.4	122
161.4 – 170.4	186
170.4 – 179.3	12

How to cite

click to copy

BibTeX

@misc{saturn-data-raw-glottolog-languoid-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data raw glottolog languoid},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data_raw-glottolog_languoid}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: data raw glottolog languoid. Source: /home/coolhand/servers/diachronica/data_raw/glottolog_languoid.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/data_raw-glottolog_languoid