glottolog languages
Reading
This dataset is a Glottolog catalogue of 27,037 language entries with 15 columns covering identifiers (Glottocode, ISO codes), geographic info (Latitude, Longitude, Countries, Macroarea), classification (Family_ID, Level, Is_Isolate), and documentation years. The Level column shows the catalogue is split across dialects (about 50%), languages, and families, while Macroarea is dominated by Eurasia and Africa with Papunesia close behind. The Family_ID distribution is heavily concentrated in a few large families (atla1278, aust1307, indo1319) out of 297 total. Note that documentation-year fields are almost entirely null (Last_Year ~96%, First_Year ~99%) and Is_Isolate is missing for ~68% of rows, so those columns are unreliable for analysis. The geographic coordinates are nearly complete and would support mapping work.
citing: Level · Macroarea · Family_ID · Countries · Is_Isolate · Last_Year_Of_Documentation · First_Year_Of_Documentation · Latitude · Longitude · Name
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| dialect | 13593 | 50.3% |
| language | 8612 | 31.9% |
| family | 4832 | 17.9% |
Show data table
| value | count | share |
|---|---|---|
| Eurasia | 8060 | 29.8% |
| Africa | 8020 | 29.7% |
| Papunesia | 6326 | 23.4% |
| North America | 1782 | 6.6% |
| South America | 1524 | 5.6% |
| Australia | 919 | 3.4% |
| Africa;Eurasia | 29 | 0.1% |
| Eurasia;Papunesia | 22 | 0.1% |
| Africa;Eurasia;North America;Papunesia;South America | 18 | 0.1% |
| Africa;Australia;Eurasia;North America;Papunesia;South America | 17 | 0.1% |
| North America;South America | 15 | 0.1% |
| Eurasia;North America | 12 | 0.0% |
| Africa;North America | 12 | 0.0% |
| Eurasia;South America | 11 | 0.0% |
| Eurasia;Papunesia;South America | 8 | 0.0% |
| Africa;Eurasia;Papunesia;South America | 7 | 0.0% |
| Eurasia;North America;South America | 5 | 0.0% |
| Eurasia;North America;Papunesia;South America | 4 | 0.0% |
| Africa;Australia;Eurasia;North America;Papunesia | 3 | 0.0% |
| Papunesia;South America | 3 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| atla1278 | 4861 | 18.0% |
| aust1307 | 4108 | 15.2% |
| indo1319 | 3173 | 11.7% |
| sino1245 | 1926 | 7.1% |
| afro1255 | 1458 | 5.4% |
| nucl1709 | 834 | 3.1% |
| pama1250 | 642 | 2.4% |
| aust1305 | 526 | 1.9% |
| otom1299 | 385 | 1.4% |
| book1242 | 382 | 1.4% |
| sign1238 | 343 | 1.3% |
| mand1469 | 322 | 1.2% |
| drav1251 | 281 | 1.0% |
| turk1311 | 273 | 1.0% |
| cent2225 | 267 | 1.0% |
| taik1256 | 261 | 1.0% |
| ural1272 | 236 | 0.9% |
| nilo1247 | 235 | 0.9% |
| nakh1245 | 190 | 0.7% |
| araw1281 | 188 | 0.7% |
Show data table
| bin | count |
|---|---|
| -55.27 – -52.06 | 20 |
| -52.06 – -48.85 | 6 |
| -48.85 – -45.64 | 3 |
| -45.64 – -42.43 | 11 |
| -42.43 – -39.22 | 17 |
| -39.22 – -36.01 | 51 |
| -36.01 – -32.8 | 70 |
| -32.8 – -29.59 | 90 |
| -29.59 – -26.38 | 139 |
| -26.38 – -23.17 | 276 |
| -23.17 – -19.96 | 323 |
| -19.96 – -16.75 | 446 |
| -16.75 – -13.54 | 670 |
| -13.54 – -10.33 | 665 |
| -10.33 – -7.121 | 1558 |
| -7.121 – -3.911 | 2186 |
| -3.911 – -0.7005 | 1999 |
| -0.7005 – 2.51 | 1102 |
| 2.51 – 5.72 | 1636 |
| 5.72 – 8.93 | 2277 |
| 8.93 – 12.14 | 2361 |
| 12.14 – 15.35 | 993 |
| 15.35 – 18.56 | 1013 |
| 18.56 – 21.77 | 696 |
| 21.77 – 24.98 | 956 |
| 24.98 – 28.19 | 1358 |
| 28.19 – 31.4 | 620 |
| 31.4 – 34.61 | 733 |
| 34.61 – 37.82 | 938 |
| 37.82 – 41.03 | 558 |
| 41.03 – 44.24 | 736 |
| 44.24 – 47.45 | 374 |
| 47.45 – 50.66 | 418 |
| 50.66 – 53.87 | 527 |
| 53.87 – 57.08 | 185 |
| 57.08 – 60.29 | 143 |
| 60.29 – 63.5 | 204 |
| 63.5 – 66.71 | 109 |
| 66.71 – 69.93 | 66 |
| 69.93 – 73.14 | 25 |
Show data table
| value | count | share |
|---|---|---|
| PG | 905 | 3.3% |
| ID | 708 | 2.6% |
| NG | 512 | 1.9% |
| AU | 476 | 1.8% |
| IN | 402 | 1.5% |
| MX | 316 | 1.2% |
| CN | 315 | 1.2% |
| BR | 277 | 1.0% |
| US | 255 | 0.9% |
| CM | 205 | 0.8% |
| PH | 188 | 0.7% |
| CD | 162 | 0.6% |
| VU | 129 | 0.5% |
| RU | 104 | 0.4% |
| TZ | 103 | 0.4% |
| PE | 102 | 0.4% |
| MY | 88 | 0.3% |
| TD | 88 | 0.3% |
| NP | 82 | 0.3% |
| CO | 80 | 0.3% |
Schema
15 columns| Alerts | ||||
|---|---|---|---|---|
| ID | text | 0.0% | 27,037 |
near_unique
one_word
short_text
|
| Name | text | 0.0% | 27,037 |
near_unique
one_word
|
| Macroarea | categorical | 0.8% | 30 |
|
| Latitude | numeric | 1.8% | 13,231 |
|
| Longitude | numeric | 1.8% | 13,203 |
|
| Glottocode | text | 0.0% | 27,037 |
near_unique
one_word
short_text
|
| ISO639P3code | text | 69.7% | 8,180 |
near_unique
one_word
null_rate
short_text
|
| Level | categorical | 0.0% | 3 |
|
| Countries | categorical | 66.4% | 737 |
null_rate
|
| Family_ID | categorical | 1.6% | 297 |
|
| Language_ID | text | 49.7% | 3,110 |
one_word
null_rate
short_text
duplicates
|
| Closest_ISO369P3code | text | 21.3% | 8,180 |
one_word
null_rate
short_text
duplicates
|
| First_Year_Of_Documentation | numeric | 99.2% | 114 |
null_rate
|
| Last_Year_Of_Documentation | numeric | 96.0% | 269 |
null_rate
high_skew
outliers
|
| Is_Isolate | categorical | 68.1% | 2 |
null_rate
imbalance
|
ID
text identifier near_unique one_word short_textFixed-length 8-character single-token codes (len_min=len_max=8, word_mean=1.0) that are perfectly unique across all 27037 rows with zero nulls or duplicates. Sample values like 'cent1996' and 'chan1318' look like 4-letter prefix plus 4-digit suffix codes, consistent with Glottolog-style language identifiers rather than arbitrary surrogate keys. Treatment: Use as the row key for joins; exclude from modelling features.
- n
- 27,037
- nulls
- 0 (0.0%)
- unique
- 27,037
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 92.03
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
Name
text identifier near_unique one_word`Name` is a fully unique short label column (27037 rows, 27037 distinct values, no nulls or duplicates), with a mean length of 10.4 characters and 66.7% of entries being a single word. The vocabulary of 18126 tokens skews toward geographic and topical descriptors — 'nuclear', 'central', 'western', 'northern', 'eastern', 'southern' lead the frequency list — suggesting these are entity or category names rather than personal names. The combination of perfect uniqueness and short, often one-word values flags it as an identifier-like label. Treatment: Treat as a unique key; drop from modelling features or use only for joins and display.
- n
- 27,037
- nulls
- 0 (0.0%)
- unique
- 27,037
- len_min
- 1
- len_max
- 109
- len_mean
- 10.44
- len_median
- 8
- len_p95
- 23
- word_mean
- 1.444
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 18,126
- readability_flesch_mean
- 29.91
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.6675
- allcaps_rate
- 0.0001479
- boilerplate_rate
- 0
Macroarea
categorical featureGeographic macroarea label for each record, almost certainly tagging languages or populations by world region. Six canonical regions dominate (Eurasia 8060, Africa 8020, Papunesia 6326, North America 1782, South America 1524, Australia 919), but cardinality is 30 because some rows carry semicolon-joined multi-region strings like 'Africa;Eurasia' (29) or even all six regions concatenated (17). Null rate is low at 0.83% and entropy_ratio of 0.46 reflects the heavy Eurasia/Africa/Papunesia concentration (top_rate 0.30). Treatment: Split the semicolon-delimited compound values into a multi-hot encoding over the six base regions before modelling.
- n
- 27,037
- nulls
- 224 (0.8%)
- unique
- 30
- top_value
- Eurasia
- top_rate
- 0.3006
- cardinality
- 30
- entropy
- 2.271
- entropy_ratio
- 0.4628
Latitude
numeric featureGeographic latitude in decimal degrees, spanning -55.2748 to 73.1354, which fits the global range. The distribution is mildly right-skewed (0.42) with a median of 8.52697, consistent with land mass concentrated in the Northern Hemisphere. About 1.77% of rows are null and only 48 outliers (0.18%) sit outside the IQR fence, so the column is largely clean. Treatment: Pair with longitude for geospatial features; impute or drop the 1.77% nulls before modelling.
- n
- 27,037
- nulls
- 479 (1.8%)
- unique
- 13,231
- min
- -55.27
- max
- 73.14
- mean
- 11.59
- median
- 8.527
- std
- 20.57
- q1
- -3.747
- q3
- 26
- iqr
- 29.75
- skew
- 0.4211
- kurtosis
- -0.1912
- n_outliers
- 48
- outlier_rate
- 0.001807
- zero_rate
- 0
Longitude
numeric featureThis column captures geographic longitude, with values spanning -178.785 to 179.43 — essentially the full -180/180 globe. The distribution is wide (std 74.05, IQR 110.17) and slightly left-skewed (-0.47), with 13,203 unique values across 27,037 rows and a 1.77% null rate. Only 51 outliers (0.19%) flag, which is expected since longitude is bounded. Treatment: Pair with latitude for geospatial features; consider sin/cos encoding to handle the -180/180 wraparound.
- n
- 27,037
- nulls
- 479 (1.8%)
- unique
- 13,203
- min
- -178.8
- max
- 179.4
- mean
- 51.82
- median
- 44.07
- std
- 74.05
- q1
- 9.225
- q3
- 119.4
- iqr
- 110.2
- skew
- -0.468
- kurtosis
- -0.4518
- n_outliers
- 51
- outlier_rate
- 0.00192
- zero_rate
- 0
Glottocode
text identifier near_unique one_word short_textThis column holds Glottocodes—the standard 8-character identifiers used by the Glottolog language catalogue (e.g. 'cent1996', 'chan1318'). Every one of the 27,037 rows is unique with a fixed length of 8 and exactly one word, and there are no nulls or duplicates, so it functions as a primary key for languages/dialects. Nothing surprising in the distribution; it behaves exactly like a clean ID field. Treatment: Use as a primary key to left-join against Glottolog metadata; do not feed into models as a feature.
- n
- 27,037
- nulls
- 0 (0.0%)
- unique
- 27,037
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 92.03
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
ISO639P3code
text foreign_key near_unique one_word null_rate short_textThis column holds ISO 639-3 language codes — exactly 3 characters, one word, every value lowercase alphabetic. It is 69.75% null and the 8,180 unique codes across 27,037 rows suggest each code maps to a distinct language entry, consistent with a language-registry foreign key rather than a feature. No duplicates or empties among the populated rows. Treatment: Treat as a language-code key; left-join to an ISO 639-3 reference table and encode missingness explicitly.
- n
- 27,037
- nulls
- 18,857 (69.7%)
- unique
- 8,180
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 8,180
- readability_flesch_mean
- 119.1
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
Level
categorical labelThis is a low-cardinality categorical taxonomy field with exactly 3 levels: dialect, language, and family. Distribution is uneven but not pathological — dialect dominates at 50.3% (13,593 of 27,037), followed by language (8,612) and family (4,832), yielding entropy ratio 0.93. No nulls, suggesting a curated classification scheme likely from a linguistic dataset. Treatment: one-hot or ordinal encode for modelling; safe to use as a stratification key.
- n
- 27,037
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- dialect
- top_rate
- 0.5028
- cardinality
- 3
- entropy
- 1.468
- entropy_ratio
- 0.9265
Countries
categorical feature null_rateTwo-letter ISO country codes, with 737 distinct values across 27,037 rows. Two-thirds of rows are null (null_rate 0.6641), and even among present values the distribution is broad (entropy_ratio 0.69) with PG topping out at just 9.97%. The presence of 737 distinct codes is surprising since ISO 3166-1 alpha-2 only defines ~250, suggesting multi-country concatenations or non-standard codes mixed in. Treatment: Normalize/split non-standard codes, add an explicit missing indicator, then group rare levels before encoding.
- n
- 27,037
- nulls
- 17,956 (66.4%)
- unique
- 737
- top_value
- PG
- top_rate
- 0.09966
- cardinality
- 737
- entropy
- 6.562
- entropy_ratio
- 0.6888
Family_ID
categorical foreign_keyFamily_ID holds Glottolog-style language family codes (e.g., atla1278, aust1307, indo1319), making it a categorical grouping key across 27,037 rows with 297 distinct families. The distribution is heavily skewed: the top family atla1278 alone covers 18.27% of rows, and the top three account for the bulk of the data, yielding an entropy ratio of 0.60. Null rate is low at 1.59%. Treatment: left-join on this id to a language-family reference, or group-by for stratified analysis.
- n
- 27,037
- nulls
- 429 (1.6%)
- unique
- 297
- top_value
- atla1278
- top_rate
- 0.1827
- cardinality
- 297
- entropy
- 4.938
- entropy_ratio
- 0.6011
Language_ID
text foreign_key one_word null_rate short_text duplicatesThis column holds 8-character single-token codes (len_min/max=8, one_word_rate=1.0) that look like Glottolog language identifiers (e.g., 'nucl1643', 'stan1293'). With 3110 unique values across 27037 rows and a 0.7712 duplicate rate, it behaves like a categorical foreign key into a language registry. Note that 0.4972 of rows are null, so nearly half the dataset has no language assignment. Treatment: Left-join on this id to a language reference table; treat missing as a separate category.
- n
- 27,037
- nulls
- 13,444 (49.7%)
- unique
- 3,110
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 10,483
- duplicate_rate
- 0.7712
- vocab_size
- 3,110
- readability_flesch_mean
- 86.53
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
Closest_ISO369P3code
text feature one_word null_rate short_text duplicatesThis column holds ISO 639-3 three-letter language codes: every value is exactly 3 characters and one word (len_mean 3.0, one_word_rate 1.0), with 8180 unique codes led by jpn (120), eng (115), and pes (64). Notable signals: 21.28% nulls and a 61.57% duplicate rate (13103 duplicates), so coverage is partial but the field is a clean categorical. Treatment: Treat as a categorical language code; impute or flag the 21% nulls and join to an ISO 639-3 reference table for names/families.
- n
- 27,037
- nulls
- 5,754 (21.3%)
- unique
- 8,180
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 13,103
- duplicate_rate
- 0.6157
- vocab_size
- 7,877
- readability_flesch_mean
- 117.4
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
First_Year_Of_Documentation
numeric metadata null_rateThis column appears to record the earliest year an item was documented, spanning from -2100 (BCE) to 1932 CE with a median of 711. Severe nullity is the headline: 99.2% of the 27,037 rows are missing, leaving only ~215 populated values across 114 unique years. The wide IQR (-300 to 1710.5) and negative skew indicate a long tail into antiquity rather than a modern-era concentration. Treatment: Drop or treat as a sparse indicator; too null to use as a feature without heavy imputation.
- n
- 27,037
- nulls
- 26,822 (99.2%)
- unique
- 114
- min
- -2,100
- max
- 1,932
- mean
- 673.7
- median
- 711
- std
- 1055
- q1
- -300
- q3
- 1710
- iqr
- 2010
- skew
- -0.4581
- kurtosis
- -0.9206
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Last_Year_Of_Documentation
numeric timestamp null_rate high_skew outliersThis appears to be the last year a record was documented, populated for only ~4% of rows (null_rate 0.9605). Values span an implausible range from -3100 to 2024 with a median of 1960, and the heavy left skew (-3.35) plus kurtosis of 12.3 yields 170 outliers (15.9% of non-null entries). The negative minimum suggests BCE-style dating or sentinel values rather than clean calendar years. Treatment: Validate or clip the year range and treat as mostly-missing; impute or flag presence rather than relying on the raw value.
- n
- 27,037
- nulls
- 25,969 (96.0%)
- unique
- 269
- min
- -3,100
- max
- 2,024
- mean
- 1700
- median
- 1,960
- std
- 699.3
- q1
- 1858
- q3
- 1987
- iqr
- 129.5
- skew
- -3.345
- kurtosis
- 12.32
- n_outliers
- 170
- outlier_rate
- 0.1592
- zero_rate
- 0
Is_Isolate
categorical feature null_rate imbalanceBoolean flag indicating isolate status, present on only ~32% of the 27,037 rows (null_rate 0.6815). Among non-null values, 'False' dominates at 0.9789 with just 182 'True' cases, yielding very low entropy (0.148). Treatment: Impute or add a missingness indicator; near-constant, so expect little predictive lift.
- n
- 27,037
- nulls
- 18,425 (68.1%)
- unique
- 2
- top_value
- False
- top_rate
- 0.9789
- cardinality
- 2
- entropy
- 0.1478
- entropy_ratio
- 0.1478