glottolog languages

source /home/coolhand/html/datavis/data_trove/cache/glottolog_languages.parquet 27,037 rows 15 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Glottolog catalogue of 27,037 language entries with 15 columns covering identifiers (Glottocode, ISO codes), geographic info (Latitude, Longitude, Countries, Macroarea), classification (Family_ID, Level, Is_Isolate), and documentation years. The Level column shows the catalogue is split across dialects (about 50%), languages, and families, while Macroarea is dominated by Eurasia and Africa with Papunesia close behind. The Family_ID distribution is heavily concentrated in a few large families (atla1278, aust1307, indo1319) out of 297 total. Note that documentation-year fields are almost entirely null (Last_Year ~96%, First_Year ~99%) and Is_Isolate is missing for ~68% of rows, so those columns are unreliable for analysis. The geographic coordinates are nearly complete and would support mapping work.

citing: Level · Macroarea · Family_ID · Countries · Is_Isolate · Last_Year_Of_Documentation · First_Year_Of_Documentation · Latitude · Longitude · Name

Charts the summary said to look at first

Level · Roughly half of all entries are dialects, with languages and families making up the rest.

Show data table

Top values for Level (3 unique shown, of 3 total).
value	count	share
dialect	13593	50.3%
language	8612	31.9%
family	4832	17.9%

Macroarea · Eurasia and Africa dominate, together accounting for over half of the entries.

Show data table

Top values for Macroarea (20 unique shown, of 30 total).
value	count	share
Eurasia	8060	29.8%
Africa	8020	29.7%
Papunesia	6326	23.4%
North America	1782	6.6%
South America	1524	5.6%
Australia	919	3.4%
Africa;Eurasia	29	0.1%
Eurasia;Papunesia	22	0.1%
Africa;Eurasia;North America;Papunesia;South America	18	0.1%
Africa;Australia;Eurasia;North America;Papunesia;South America	17	0.1%
North America;South America	15	0.1%
Eurasia;North America	12	0.0%
Africa;North America	12	0.0%
Eurasia;South America	11	0.0%
Eurasia;Papunesia;South America	8	0.0%
Africa;Eurasia;Papunesia;South America	7	0.0%
Eurasia;North America;South America	5	0.0%
Eurasia;North America;Papunesia;South America	4	0.0%
Africa;Australia;Eurasia;North America;Papunesia	3	0.0%
Papunesia;South America	3	0.0%

Family_ID · A handful of large families (atla1278, aust1307, indo1319) carry most of the rows out of 297 families.

Show data table

Top values for Family_ID (20 unique shown, of 297 total).
value	count	share
atla1278	4861	18.0%
aust1307	4108	15.2%
indo1319	3173	11.7%
sino1245	1926	7.1%
afro1255	1458	5.4%
nucl1709	834	3.1%
pama1250	642	2.4%
aust1305	526	1.9%
otom1299	385	1.4%
book1242	382	1.4%
sign1238	343	1.3%
mand1469	322	1.2%
drav1251	281	1.0%
turk1311	273	1.0%
cent2225	267	1.0%
taik1256	261	1.0%
ural1272	236	0.9%
nilo1247	235	0.9%
nakh1245	190	0.7%
araw1281	188	0.7%

Latitude · Shows the geographic spread of languages, skewed toward equatorial and northern latitudes.

Show data table

Histogram bins for Latitude (median: 8.52697).
bin	count
-55.27 – -52.06	20
-52.06 – -48.85	6
-48.85 – -45.64	3
-45.64 – -42.43	11
-42.43 – -39.22	17
-39.22 – -36.01	51
-36.01 – -32.8	70
-32.8 – -29.59	90
-29.59 – -26.38	139
-26.38 – -23.17	276
-23.17 – -19.96	323
-19.96 – -16.75	446
-16.75 – -13.54	670
-13.54 – -10.33	665
-10.33 – -7.121	1558
-7.121 – -3.911	2186
-3.911 – -0.7005	1999
-0.7005 – 2.51	1102
2.51 – 5.72	1636
5.72 – 8.93	2277
8.93 – 12.14	2361
12.14 – 15.35	993
15.35 – 18.56	1013
18.56 – 21.77	696
21.77 – 24.98	956
24.98 – 28.19	1358
28.19 – 31.4	620
31.4 – 34.61	733
34.61 – 37.82	938
37.82 – 41.03	558
41.03 – 44.24	736
44.24 – 47.45	374
47.45 – 50.66	418
50.66 – 53.87	527
53.87 – 57.08	185
57.08 – 60.29	143
60.29 – 63.5	204
63.5 – 66.71	109
66.71 – 69.93	66
69.93 – 73.14	25

Countries · Papua New Guinea, Indonesia, and Nigeria lead — useful context for where linguistic diversity concentrates.

Show data table

Top values for Countries (20 unique shown, of 737 total).
value	count	share
PG	905	3.3%
ID	708	2.6%
NG	512	1.9%
AU	476	1.8%
IN	402	1.5%
MX	316	1.2%
CN	315	1.2%
BR	277	1.0%
US	255	0.9%
CM	205	0.8%
PH	188	0.7%
CD	162	0.6%
VU	129	0.5%
RU	104	0.4%
TZ	103	0.4%
PE	102	0.4%
MY	88	0.3%
TD	88	0.3%
NP	82	0.3%
CO	80	0.3%

Schema

15 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
ID	text	0.0%	27,037	near_unique one_word short_text
Name	text	0.0%	27,037	near_unique one_word
Macroarea	categorical	0.8%	30
Latitude	numeric	1.8%	13,231
Longitude	numeric	1.8%	13,203
Glottocode	text	0.0%	27,037	near_unique one_word short_text
ISO639P3code	text	69.7%	8,180	near_unique one_word null_rate short_text
Level	categorical	0.0%	3
Countries	categorical	66.4%	737	null_rate
Family_ID	categorical	1.6%	297
Language_ID	text	49.7%	3,110	one_word null_rate short_text duplicates
Closest_ISO369P3code	text	21.3%	8,180	one_word null_rate short_text duplicates
First_Year_Of_Documentation	numeric	99.2%	114	null_rate
Last_Year_Of_Documentation	numeric	96.0%	269	null_rate high_skew outliers
Is_Isolate	categorical	68.1%	2	null_rate imbalance

ID

text identifier near_unique one_word short_text

Fixed-length 8-character single-token codes (len_min=len_max=8, word_mean=1.0) that are perfectly unique across all 27037 rows with zero nulls or duplicates. Sample values like 'cent1996' and 'chan1318' look like 4-letter prefix plus 4-digit suffix codes, consistent with Glottolog-style language identifiers rather than arbitrary surrogate keys. Treatment: Use as the row key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 0 (0.0%)
unique: 27,037
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 20,000
readability_flesch_mean: 92.03
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Name

text identifier near_unique one_word

`Name` is a fully unique short label column (27037 rows, 27037 distinct values, no nulls or duplicates), with a mean length of 10.4 characters and 66.7% of entries being a single word. The vocabulary of 18126 tokens skews toward geographic and topical descriptors — 'nuclear', 'central', 'western', 'northern', 'eastern', 'southern' lead the frequency list — suggesting these are entity or category names rather than personal names. The combination of perfect uniqueness and short, often one-word values flags it as an identifier-like label. Treatment: Treat as a unique key; drop from modelling features or use only for joins and display. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 0 (0.0%)
unique: 27,037
len_min: 1
len_max: 109
len_mean: 10.44
len_median: 8
len_p95: 23
word_mean: 1.444
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 18,126
readability_flesch_mean: 29.91
emoji_rate: 0
url_rate: 0
one_word_rate: 0.6675
allcaps_rate: 0.0001479
boilerplate_rate: 0

Macroarea

categorical feature

Geographic macroarea label for each record, almost certainly tagging languages or populations by world region. Six canonical regions dominate (Eurasia 8060, Africa 8020, Papunesia 6326, North America 1782, South America 1524, Australia 919), but cardinality is 30 because some rows carry semicolon-joined multi-region strings like 'Africa;Eurasia' (29) or even all six regions concatenated (17). Null rate is low at 0.83% and entropy_ratio of 0.46 reflects the heavy Eurasia/Africa/Papunesia concentration (top_rate 0.30). Treatment: Split the semicolon-delimited compound values into a multi-hot encoding over the six base regions before modelling. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 224 (0.8%)
unique: 30
top_value: Eurasia
top_rate: 0.3006
cardinality: 30
entropy: 2.271
entropy_ratio: 0.4628

Latitude

numeric feature

Geographic latitude in decimal degrees, spanning -55.2748 to 73.1354, which fits the global range. The distribution is mildly right-skewed (0.42) with a median of 8.52697, consistent with land mass concentrated in the Northern Hemisphere. About 1.77% of rows are null and only 48 outliers (0.18%) sit outside the IQR fence, so the column is largely clean. Treatment: Pair with longitude for geospatial features; impute or drop the 1.77% nulls before modelling. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 479 (1.8%)
unique: 13,231
min: -55.27
max: 73.14
mean: 11.59
median: 8.527
std: 20.57
q1: -3.747
q3: 26
iqr: 29.75
skew: 0.4211
kurtosis: -0.1912
n_outliers: 48
outlier_rate: 0.001807
zero_rate: 0

Longitude

numeric feature

This column captures geographic longitude, with values spanning -178.785 to 179.43 — essentially the full -180/180 globe. The distribution is wide (std 74.05, IQR 110.17) and slightly left-skewed (-0.47), with 13,203 unique values across 27,037 rows and a 1.77% null rate. Only 51 outliers (0.19%) flag, which is expected since longitude is bounded. Treatment: Pair with latitude for geospatial features; consider sin/cos encoding to handle the -180/180 wraparound. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 479 (1.8%)
unique: 13,203
min: -178.8
max: 179.4
mean: 51.82
median: 44.07
std: 74.05
q1: 9.225
q3: 119.4
iqr: 110.2
skew: -0.468
kurtosis: -0.4518
n_outliers: 51
outlier_rate: 0.00192
zero_rate: 0

Glottocode

text identifier near_unique one_word short_text

This column holds Glottocodes—the standard 8-character identifiers used by the Glottolog language catalogue (e.g. 'cent1996', 'chan1318'). Every one of the 27,037 rows is unique with a fixed length of 8 and exactly one word, and there are no nulls or duplicates, so it functions as a primary key for languages/dialects. Nothing surprising in the distribution; it behaves exactly like a clean ID field. Treatment: Use as a primary key to left-join against Glottolog metadata; do not feed into models as a feature. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 0 (0.0%)
unique: 27,037
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 20,000
readability_flesch_mean: 92.03
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

ISO639P3code

text foreign_key near_unique one_word null_rate short_text

This column holds ISO 639-3 language codes — exactly 3 characters, one word, every value lowercase alphabetic. It is 69.75% null and the 8,180 unique codes across 27,037 rows suggest each code maps to a distinct language entry, consistent with a language-registry foreign key rather than a feature. No duplicates or empties among the populated rows. Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and encode missingness explicitly. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 18,857 (69.7%)
unique: 8,180
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 8,180
readability_flesch_mean: 119.1
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Level

categorical label

This is a low-cardinality categorical taxonomy field with exactly 3 levels: dialect, language, and family. Distribution is uneven but not pathological — dialect dominates at 50.3% (13,593 of 27,037), followed by language (8,612) and family (4,832), yielding entropy ratio 0.93. No nulls, suggesting a curated classification scheme likely from a linguistic dataset. Treatment: one-hot or ordinal encode for modelling; safe to use as a stratification key. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 0 (0.0%)
unique: 3
top_value: dialect
top_rate: 0.5028
cardinality: 3
entropy: 1.468
entropy_ratio: 0.9265

Countries

categorical feature null_rate

Two-letter ISO country codes, with 737 distinct values across 27,037 rows. Two-thirds of rows are null (null_rate 0.6641), and even among present values the distribution is broad (entropy_ratio 0.69) with PG topping out at just 9.97%. The presence of 737 distinct codes is surprising since ISO 3166-1 alpha-2 only defines ~250, suggesting multi-country concatenations or non-standard codes mixed in. Treatment: Normalize/split non-standard codes, add an explicit missing indicator, then group rare levels before encoding. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 17,956 (66.4%)
unique: 737
top_value: PG
top_rate: 0.09966
cardinality: 737
entropy: 6.562
entropy_ratio: 0.6888

Family_ID

categorical foreign_key

Family_ID holds Glottolog-style language family codes (e.g., atla1278, aust1307, indo1319), making it a categorical grouping key across 27,037 rows with 297 distinct families. The distribution is heavily skewed: the top family atla1278 alone covers 18.27% of rows, and the top three account for the bulk of the data, yielding an entropy ratio of 0.60. Null rate is low at 1.59%. Treatment: left-join on this id to a language-family reference, or group-by for stratified analysis. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 429 (1.6%)
unique: 297
top_value: atla1278
top_rate: 0.1827
cardinality: 297
entropy: 4.938
entropy_ratio: 0.6011

Language_ID

text foreign_key one_word null_rate short_text duplicates

This column holds 8-character single-token codes (len_min/max=8, one_word_rate=1.0) that look like Glottolog language identifiers (e.g., 'nucl1643', 'stan1293'). With 3110 unique values across 27037 rows and a 0.7712 duplicate rate, it behaves like a categorical foreign key into a language registry. Note that 0.4972 of rows are null, so nearly half the dataset has no language assignment. Treatment: Left-join on this id to a language reference table; treat missing as a separate category. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 13,444 (49.7%)
unique: 3,110
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 10,483
duplicate_rate: 0.7712
vocab_size: 3,110
readability_flesch_mean: 86.53
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Closest_ISO369P3code

text feature one_word null_rate short_text duplicates

This column holds ISO 639-3 three-letter language codes: every value is exactly 3 characters and one word (len_mean 3.0, one_word_rate 1.0), with 8180 unique codes led by jpn (120), eng (115), and pes (64). Notable signals: 21.28% nulls and a 61.57% duplicate rate (13103 duplicates), so coverage is partial but the field is a clean categorical. Treatment: Treat as a categorical language code; impute or flag the 21% nulls and join to an ISO 639-3 reference table for names/families. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 5,754 (21.3%)
unique: 8,180
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 13,103
duplicate_rate: 0.6157
vocab_size: 7,877
readability_flesch_mean: 117.4
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

First_Year_Of_Documentation

numeric metadata null_rate

This column appears to record the earliest year an item was documented, spanning from -2100 (BCE) to 1932 CE with a median of 711. Severe nullity is the headline: 99.2% of the 27,037 rows are missing, leaving only ~215 populated values across 114 unique years. The wide IQR (-300 to 1710.5) and negative skew indicate a long tail into antiquity rather than a modern-era concentration. Treatment: Drop or treat as a sparse indicator; too null to use as a feature without heavy imputation. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 26,822 (99.2%)
unique: 114
min: -2,100
max: 1,932
mean: 673.7
median: 711
std: 1055
q1: -300
q3: 1710
iqr: 2010
skew: -0.4581
kurtosis: -0.9206
n_outliers: 0
outlier_rate: 0
zero_rate: 0

Last_Year_Of_Documentation

numeric timestamp null_rate high_skew outliers

This appears to be the last year a record was documented, populated for only ~4% of rows (null_rate 0.9605). Values span an implausible range from -3100 to 2024 with a median of 1960, and the heavy left skew (-3.35) plus kurtosis of 12.3 yields 170 outliers (15.9% of non-null entries). The negative minimum suggests BCE-style dating or sentinel values rather than clean calendar years. Treatment: Validate or clip the year range and treat as mostly-missing; impute or flag presence rather than relying on the raw value. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 25,969 (96.0%)
unique: 269
min: -3,100
max: 2,024
mean: 1700
median: 1,960
std: 699.3
q1: 1858
q3: 1987
iqr: 129.5
skew: -3.345
kurtosis: 12.32
n_outliers: 170
outlier_rate: 0.1592
zero_rate: 0

Is_Isolate

categorical feature null_rate imbalance

Boolean flag indicating isolate status, present on only ~32% of the 27,037 rows (null_rate 0.6815). Among non-null values, 'False' dominates at 0.9789 with just 182 'True' cases, yielding very low entropy (0.148). Treatment: Impute or add a missingness indicator; near-constant, so expect little predictive lift. high · anthropic:claude-opus-4-7

n: 27,037
nulls: 18,425 (68.1%)
unique: 2
top_value: False
top_rate: 0.9789
cardinality: 2
entropy: 0.1478
entropy_ratio: 0.1478