data raw glottolog languoid

source /home/coolhand/servers/diachronica/data_raw/glottolog_languoid.csv 19,401 rows 7 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Glottolog languoid catalogue with 19,401 rows and 7 columns covering identifiers (glottocode, isocodes, name), geographic coordinates (latitude, longitude), and classification fields (macroarea, level). The most striking feature is missingness: roughly 59% of rows lack ISO codes and coordinates, so any geographic or ISO-based analysis will only cover about 40% of entries. Worth a closer look first: the macroarea distribution (Africa leads at 32%, followed by Eurasia and Papunesia) and the level split between dialect (56%) and language (44%). The name field is mostly single words but contains recurring qualifiers like 'nuclear', 'sign', 'central', and 'southern' that hint at naming conventions worth exploring.

citing: row_count · column_count · columns · kinds

Charts the summary said to look at first

macroarea · Shows Africa, Eurasia, and Papunesia dominate while Australia is smallest — useful for understanding geographic coverage.

Show data table

Top values for macroarea (6 unique shown, of 6 total).
value	count	share
Africa	5955	30.7%
Eurasia	5028	25.9%
Papunesia	4847	25.0%
South America	1095	5.6%
North America	1035	5.3%
Australia	602	3.1%

level · Reveals the dialect-vs-language split (about 56% dialects, 44% languages).

Show data table

Top values for level (2 unique shown, of 2 total).
value	count	share
dialect	10920	56.3%
language	8481	43.7%

latitude · Shows the latitudinal spread of languoids; note ~59% are null so the chart only reflects the geocoded subset.

Show data table

Histogram bins for latitude (median: 6.2918).
bin	count
-55.27 – -52.06	5
-52.06 – -48.85	1
-48.85 – -45.64	1
-45.64 – -42.43	4
-42.43 – -39.22	7
-39.22 – -36.01	16
-36.01 – -32.8	29
-32.8 – -29.59	26
-29.59 – -26.38	47
-26.38 – -23.17	77
-23.17 – -19.96	125
-19.96 – -16.75	141
-16.75 – -13.54	280
-13.54 – -10.33	256
-10.33 – -7.121	495
-7.121 – -3.911	788
-3.911 – -0.7005	681
-0.7005 – 2.51	378
2.51 – 5.72	468
5.72 – 8.93	663
8.93 – 12.14	710
12.14 – 15.35	303
15.35 – 18.56	384
18.56 – 21.77	233
21.77 – 24.98	318
24.98 – 28.19	371
28.19 – 31.4	167
31.4 – 34.61	143
34.61 – 37.82	178
37.82 – 41.03	113
41.03 – 44.24	138
44.24 – 47.45	79
47.45 – 50.66	77
50.66 – 53.87	76
53.87 – 57.08	46
57.08 – 60.29	21
60.29 – 63.5	41
63.5 – 66.71	23
66.71 – 69.93	14
69.93 – 73.14	6

longitude · Shows the bimodal east-west distribution of languoid locations across the globe.

Show data table

Histogram bins for longitude (median: 47.565486).
bin	count
-178.8 – -169.8	13
-169.8 – -160.9	4
-160.9 – -151.9	10
-151.9 – -143	11
-143 – -134	10
-134 – -125.1	17
-125.1 – -116.1	123
-116.1 – -107.2	47
-107.2 – -98.21	78
-98.21 – -89.26	280
-89.26 – -80.31	59
-80.31 – -71.36	235
-71.36 – -62.41	218
-62.41 – -53.45	150
-53.45 – -44.5	60
-44.5 – -35.55	40
-35.55 – -26.6	0
-26.6 – -17.64	4
-17.64 – -8.692	105
-8.692 – 0.2605	275
0.2605 – 9.213	443
9.213 – 18.17	751
18.17 – 27.12	322
27.12 – 36.07	429
36.07 – 45.02	228
45.02 – 53.97	126
53.97 – 62.93	35
62.93 – 71.88	79
71.88 – 80.83	210
80.83 – 89.78	207
89.78 – 98.74	269
98.74 – 107.7	454
107.7 – 116.6	239
116.6 – 125.6	497
125.6 – 134.5	316
134.5 – 143.5	598
143.5 – 152.4	667
152.4 – 161.4	122
161.4 – 170.4	186
170.4 – 179.3	12

name · Most names are short (median 7 characters, ~72% one word), but a long tail extends to 58 characters.

Show data table

Character-length distribution for name (mean: 9.211483944126591).
chars	count
1 – 2	52
2 – 4	455
4 – 5	4355
5 – 7	2890
7 – 8	3995
8 – 10	1115
10 – 11	813
11 – 12	1417
12 – 14	724
14 – 15	1264
15 – 17	439
17 – 18	524
18 – 20	214
20 – 21	175
21 – 22	343
22 – 24	143
24 – 25	173
25 – 27	58
27 – 28	86
28 – 30	26
30 – 31	26
31 – 32	43
32 – 34	12
34 – 35	24
35 – 37	11
37 – 38	8
38 – 39	3
39 – 41	1
41 – 42	2
42 – 44	1
44 – 45	3
45 – 47	1
47 – 48	2
48 – 49	1
49 – 51	0
51 – 52	0
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	2

Schema

7 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
glottocode	text	0.0%	19,401	near_unique one_word short_text
name	text	0.0%	19,401	near_unique one_word
isocodes	text	59.2%	7,922	near_unique one_word null_rate short_text
level	categorical	0.0%	2
macroarea	categorical	4.3%	6
latitude	numeric	59.1%	7,786	null_rate
longitude	numeric	59.1%	7,745	null_rate

glottocode

text identifier near_unique one_word short_text

This column holds Glottocodes — fixed 8-character identifiers (len_min=len_max=8) that uniquely tag languages in the Glottolog catalogue. Every one of the 19,401 rows is unique (n_unique=19401, duplicate_rate=0.0) and single-token (one_word_rate=1.0), matching the canonical four-letters-plus-four-digits pattern visible in top_words like 'aala1237' and 'aari1239'. There are no nulls and no collisions, so this is a clean primary key rather than a feature. Treatment: Use as the primary key; left-join on this id rather than feeding it to a model. high · anthropic:claude-opus-4-7

n: 19,401
nulls: 0 (0.0%)
unique: 19,401
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 19,401
readability_flesch_mean: 93.3
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

name

text identifier near_unique one_word

The `name` column is a short text identifier, with all 19,401 values unique and no nulls or duplicates — effectively a primary key or label. Entries are terse (mean 9.2 chars, median 1 word) and 71.7% are single words, yet the top tokens (`nuclear`, `language`, `central`, `southern`, `western`) suggest these are entity or topic names rather than personal names. Vocabulary is wide (17,861 distinct words) for such short strings. Treatment: Treat as a unique key; do not feature-engineer directly, but tokenize if used for matching or embedding. high · anthropic:claude-opus-4-7

n: 19,401
nulls: 0 (0.0%)
unique: 19,401
len_min: 1
len_max: 58
len_mean: 9.211
len_median: 7
len_p95: 20
word_mean: 1.369
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 17,861
readability_flesch_mean: 60.53
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7169
allcaps_rate: 0
boilerplate_rate: 0

isocodes

text identifier near_unique one_word null_rate short_text

This column holds 3-character ISO-style codes: every non-null value is exactly 3 characters and one word (len_min/len_max=3, one_word_rate=1.0). It is sparsely populated with a 0.5917 null rate, and of the 19401 rows there are 7922 distinct codes with no duplicates among the populated entries (vocab_size=7922, n_duplicates=0). The top_words sample (aiw, aay, aas, kbt…) suggests ISO 639-3 language codes rather than country codes, each appearing only once. Treatment: Treat as a categorical/foreign key; left-join to an ISO code lookup and impute or flag the ~59% missing values. high · anthropic:claude-opus-4-7

n: 19,401
nulls: 11,479 (59.2%)
unique: 7,922
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 7,922
readability_flesch_mean: 118.7
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

level

categorical label

A binary categorical flag distinguishing 'dialect' (10,920 rows, 56.3%) from 'language' (8,481 rows). The split is fairly balanced, with entropy at 0.989 of the maximum, and there are no nulls across all 19,401 rows. Treatment: One-hot or binary-encode for modelling. high · anthropic:claude-opus-4-7

n: 19,401
nulls: 0 (0.0%)
unique: 2
top_value: dialect
top_rate: 0.5629
cardinality: 2
entropy: 0.9886
entropy_ratio: 0.9886

macroarea

categorical feature

Categorical macro-region label with just 6 values covering the world's linguistic areas (Africa, Eurasia, Papunesia, South America, North America, Australia). Distribution is moderately balanced (entropy ratio 0.84) with Africa leading at 32.1% and a long tail in Australia at 602 rows; 4.32% are null. No single category dominates, but the Americas and Australia are markedly underrepresented relative to Africa and Eurasia. Treatment: one-hot or target-encode; impute or flag the ~4% missing before modelling. high · anthropic:claude-opus-4-7

n: 19,401
nulls: 839 (4.3%)
unique: 6
top_value: Africa
top_rate: 0.3208
cardinality: 6
entropy: 2.176
entropy_ratio: 0.8418

latitude

numeric feature null_rate

Geographic latitude coordinates ranging from -55.2748 to 73.1354, consistent with a worldwide point dataset. Nearly 59% of rows are null, which dominates any spatial analysis, and the median of 6.29 with mean 8.16 hints at a slight northern-hemisphere lean despite moderate skew (0.54). Treatment: Pair with longitude for geospatial features; impute or filter the 59% nulls before mapping. high · anthropic:claude-opus-4-7

n: 19,401
nulls: 11,472 (59.1%)
unique: 7,786
min: -55.27
max: 73.14
mean: 8.164
median: 6.292
std: 18.96
q1: -5.139
q3: 19.27
iqr: 24.41
skew: 0.5425
kurtosis: 0.3048
n_outliers: 135
outlier_rate: 0.01703
zero_rate: 0

longitude

numeric feature null_rate

Geographic longitude coordinate spanning the full globe (min -178.785, max 179.306) with median 47.565486, suggesting a worldwide dataset weighted toward Eurasia. The 59.13% null rate is the dominant concern—most records lack location—while only 13 outliers (0.16%) appear and skew is mild (-0.48). Treatment: Impute or flag the 59% missing values before any geospatial modelling; pair with latitude for joint use. high · anthropic:claude-opus-4-7

n: 19,401
nulls: 11,472 (59.1%)
unique: 7,745
min: -178.8
max: 179.3
mean: 51.22
median: 47.57
std: 81.15
q1: 7.18
q3: 124.1
iqr: 117
skew: -0.4814
kurtosis: -0.7765
n_outliers: 13
outlier_rate: 0.00164
zero_rate: 0