data raw wals language
Reading
This dataset is a catalogue of 3,573 world languages from WALS, with identifiers (ISO codes, Glottocode), names, geographic coordinates, and classification fields (Family, Genus, Subfamily, Macroarea) plus reference sources and sampling flags. The geographic and genealogical breakdowns are the most informative starting point: Macroarea splits cleanly across six regions led by Eurasia (659) and Africa (606), while Family is dominated by Niger-Congo and Austronesian (324 each). Worth a closer look: roughly a quarter of rows are missing core fields like Family, Genus, Macroarea, and coordinates (null rate ~0.255), and Subfamily is 74.5% null, which will limit any subfamily-level analysis. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 True values respectively), reflecting their role as curated sub-samples rather than balanced categories.
citing: columns · row_count · kinds
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Eurasia | 659 | 18.4% |
| Africa | 606 | 17.0% |
| Papunesia | 560 | 15.7% |
| North America | 396 | 11.1% |
| South America | 258 | 7.2% |
| Australia | 183 | 5.1% |
Show data table
| value | count | share |
|---|---|---|
| Niger-Congo | 324 | 9.1% |
| Austronesian | 324 | 9.1% |
| Indo-European | 176 | 4.9% |
| Sino-Tibetan | 146 | 4.1% |
| Afro-Asiatic | 145 | 4.1% |
| Pama-Nyungan | 121 | 3.4% |
| Trans-New Guinea | 98 | 2.7% |
| other | 72 | 2.0% |
| Altaic | 65 | 1.8% |
| Oto-Manguean | 56 | 1.6% |
| Austro-Asiatic | 48 | 1.3% |
| Eastern Sudanic | 47 | 1.3% |
| Uto-Aztecan | 44 | 1.2% |
| Algic | 31 | 0.9% |
| Mayan | 30 | 0.8% |
| Arawakan | 29 | 0.8% |
| Nakh-Daghestanian | 28 | 0.8% |
| Mande | 28 | 0.8% |
| Uralic | 27 | 0.8% |
| Hokan | 26 | 0.7% |
Show data table
| bin | count |
|---|---|
| -55 – -51.84 | 2 |
| -51.84 – -48.69 | 1 |
| -48.69 – -45.53 | 1 |
| -45.53 – -42.38 | 2 |
| -42.38 – -39.22 | 3 |
| -39.22 – -36.06 | 5 |
| -36.06 – -32.91 | 9 |
| -32.91 – -29.75 | 18 |
| -29.75 – -26.59 | 24 |
| -26.59 – -23.44 | 29 |
| -23.44 – -20.28 | 47 |
| -20.28 – -17.12 | 64 |
| -17.12 – -13.97 | 85 |
| -13.97 – -10.81 | 86 |
| -10.81 – -7.656 | 129 |
| -7.656 – -4.5 | 187 |
| -4.5 – -1.344 | 190 |
| -1.344 – 1.812 | 130 |
| 1.812 – 4.969 | 119 |
| 4.969 – 8.125 | 196 |
| 8.125 – 11.28 | 161 |
| 11.28 – 14.44 | 104 |
| 14.44 – 17.59 | 145 |
| 17.59 – 20.75 | 82 |
| 20.75 – 23.91 | 63 |
| 23.91 – 27.06 | 74 |
| 27.06 – 30.22 | 84 |
| 30.22 – 33.38 | 66 |
| 33.38 – 36.53 | 84 |
| 36.53 – 39.69 | 67 |
| 39.69 – 42.84 | 86 |
| 42.84 – 46 | 62 |
| 46 – 49.16 | 74 |
| 49.16 – 52.31 | 52 |
| 52.31 – 55.47 | 44 |
| 55.47 – 58.62 | 18 |
| 58.62 – 61.78 | 20 |
| 61.78 – 64.94 | 23 |
| 64.94 – 68.09 | 19 |
| 68.09 – 71.25 | 7 |
Show data table
| value | count | share |
|---|---|---|
| PG | 214 | 6.0% |
| AU | 185 | 5.2% |
| US | 177 | 5.0% |
| ID | 177 | 5.0% |
| IN | 120 | 3.4% |
| MX | 120 | 3.4% |
| RU | 89 | 2.5% |
| NG | 66 | 1.8% |
| BR | 66 | 1.8% |
| CN | 54 | 1.5% |
| CD | 49 | 1.4% |
| CM | 46 | 1.3% |
| CA | 45 | 1.3% |
| CO | 39 | 1.1% |
| ET | 36 | 1.0% |
| PH | 36 | 1.0% |
| PE | 35 | 1.0% |
| NP | 32 | 0.9% |
| TZ | 28 | 0.8% |
| VU | 28 | 0.8% |
Show data table
| value | count | share |
|---|---|---|
| False | 2562 | 71.7% |
| True | 100 | 2.8% |
Schema
17 columns| Alerts | ||||
|---|---|---|---|---|
| ID | text | 0.0% | 3,573 |
near_unique
one_word
short_text
|
| Name | text | 0.0% | 3,198 |
one_word
short_text
|
| Macroarea | categorical | 25.5% | 6 |
null_rate
|
| Latitude | numeric | 25.5% | 887 |
null_rate
|
| Longitude | numeric | 25.5% | 1,360 |
null_rate
|
| Glottocode | text | 26.0% | 2,502 |
one_word
null_rate
short_text
|
| ISO639P3code | text | 26.8% | 2,442 |
one_word
null_rate
short_text
|
| Family | categorical | 25.5% | 254 |
null_rate
|
| Subfamily | categorical | 74.5% | 32 |
null_rate
|
| Genus | categorical | 25.5% | 625 |
null_rate
|
| GenusIcon | categorical | 82.5% | 613 |
long_tail
null_rate
|
| ISO_codes | text | 26.1% | 2,468 |
one_word
null_rate
short_text
|
| Samples_100 | categorical | 25.5% | 2 |
null_rate
imbalance
|
| Samples_200 | categorical | 25.5% | 2 |
null_rate
|
| Country_ID | categorical | 25.7% | 337 |
null_rate
|
| Source | text | 30.1% | 2,373 |
one_word
null_rate
|
| Parent_ID | categorical | 7.1% | 911 |
long_tail
|
ID
text identifier near_unique one_word short_textColumn 'ID' is a unique row identifier: all 3573 values are distinct (n_unique equals n), every value is a single token (one_word_rate 1.0), and there are no nulls or duplicates. Lengths range from 2 to 36 characters with a median of 3, and the top tokens (aab, aar, aba, abb…) suggest short alphabetic codes rather than numeric keys. Treatment: Use as a join key; drop from modelling features.
- n
- 3,573
- nulls
- 0 (0.0%)
- unique
- 3,573
- len_min
- 2
- len_max
- 36
- len_mean
- 5.982
- len_median
- 3
- len_p95
- 17
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 3,573
- readability_flesch_mean
- 61.58
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
Name
text label one_word short_textThis column holds short proper-noun labels, almost certainly language names (top values like Basque, Ainu, Beothuk, Atakapa, and frequent words 'sign', 'language', 'arabic', 'mixtec' all point to a linguistic registry). Entries are overwhelmingly single tokens (one_word_rate 0.80, word_mean 1.26, len_mean 8.7) with a 46-character max for the longer parenthesised variants like '(northern)'/'(southern)'. Notably, 375 duplicates (10.5%) exist across 3,573 rows with 3,198 uniques — names like 'Abun', 'Andoke', 'Basque' each appear 3 times, suggesting the dataset repeats languages across some other dimension rather than being a clean key. Treatment: Treat as a categorical language label; deduplicate or join on it rather than using as a primary key.
- n
- 3,573
- nulls
- 0 (0.0%)
- unique
- 3,198
- len_min
- 2
- len_max
- 46
- len_mean
- 8.705
- len_median
- 7
- len_p95
- 19
- word_mean
- 1.258
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 375
- duplicate_rate
- 0.105
- vocab_size
- 3,383
- readability_flesch_mean
- 48.16
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.8002
- allcaps_rate
- 0
- boilerplate_rate
- 0
Macroarea
categorical feature null_rateMacroarea is a categorical geographic grouping with 6 values covering the standard continental/linguistic macroareas (Eurasia, Africa, Papunesia, North America, South America, Australia). The distribution is fairly balanced — entropy ratio is 0.95 and the top value Eurasia accounts for only 24.8% of rows. The main concern is a 25.5% null rate, meaning a quarter of the 3573 rows lack any macroarea assignment. Treatment: One-hot encode and decide whether to impute or add an explicit 'unknown' bucket for the 25.5% nulls.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 6
- top_value
- Eurasia
- top_rate
- 0.2476
- cardinality
- 6
- entropy
- 2.459
- entropy_ratio
- 0.9511
Latitude
numeric feature null_rateGeographic latitude in decimal degrees, ranging from -55.0 to 71.25 with a median of 8.29 — consistent with global coverage skewed slightly toward the northern hemisphere. About 25.5% of rows are null, a notable gap for a positional field, and only 887 unique values across 3573 rows suggest coordinates are rounded or tied to a limited set of locations. Distribution is near-symmetric (skew 0.36, kurtosis -0.50) with just one outlier flagged. Treatment: Pair with Longitude for geospatial features; impute or filter the 25.5% nulls before modelling.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 887
- min
- -55
- max
- 71.25
- mean
- 11.88
- median
- 8.292
- std
- 22.72
- q1
- -5
- q3
- 28
- iqr
- 33
- skew
- 0.3562
- kurtosis
- -0.5023
- n_outliers
- 1
- outlier_rate
- 0.0003757
- zero_rate
- 0.002254
Longitude
numeric feature null_rateGeographic longitude in decimal degrees, spanning -178.17 to 179.17 with 1360 distinct values across 3573 rows. The distribution is roughly symmetric (skew -0.33) but flat (kurtosis -1.05) with an IQR of 166.75, consistent with truly global coverage rather than a regional sample. Notable concern: 25.5% of rows are null, which will silently drop a quarter of any geospatial join. Treatment: Pair with Latitude and impute or filter the 25.5% nulls before any geospatial modelling.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 1,360
- min
- -178.2
- max
- 179.2
- mean
- 35.17
- median
- 34.79
- std
- 89.35
- q1
- -45.75
- q3
- 121
- iqr
- 166.8
- skew
- -0.3259
- kurtosis
- -1.047
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.001503
Glottocode
text foreign_key one_word null_rate short_textThis is a Glottocode field — fixed 8-character language identifiers from the Glottolog catalogue (e.g. basq1248, swis1247), with every value being a single token. About 26% of rows are null and 2502 distinct codes appear across 3573 rows, with 143 duplicates (5.4%) where the same language repeats — basq1248 leads with 11 occurrences. Length is rigidly 8 for min, median, and max, consistent with a controlled vocabulary identifier rather than free text. Treatment: Treat as a categorical key; left-join to Glottolog metadata and handle the 26% nulls explicitly.
- n
- 3,573
- nulls
- 928 (26.0%)
- unique
- 2,502
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 143
- duplicate_rate
- 0.05406
- vocab_size
- 2,502
- readability_flesch_mean
- 92.88
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
ISO639P3code
text foreign_key one_word null_rate short_textThis column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with examples like 'eus', 'gsw', 'deu' matching the standard. Coverage is incomplete — 26.84% of rows are null — and 2442 unique codes appear across 3573 rows with a 6.58% duplicate rate, so most languages occur only once or twice (top value 'eus' at 12). Treatment: Treat as a categorical key; left-join to an ISO 639-3 reference table and decide on an explicit bucket for the 26.84% nulls.
- n
- 3,573
- nulls
- 959 (26.8%)
- unique
- 2,442
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 172
- duplicate_rate
- 0.0658
- vocab_size
- 2,442
- readability_flesch_mean
- 119.5
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
Family
categorical feature null_rateCategorical label for the language family of each row, with 254 distinct families across 3573 records. Distribution is long-tailed but not dominated: top value 'Niger-Congo' covers only 12.2% and ties exactly with 'Austronesian' at 324 each, with entropy ratio 0.70 indicating spread across many small families. Notable concern: 25.5% of rows are null, and a literal 'other' bucket already accounts for 72 rows. Treatment: Impute or flag the 25.5% missing, then group rare families before one-hot or target encoding.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 254
- top_value
- Niger-Congo
- top_rate
- 0.1217
- cardinality
- 254
- entropy
- 5.631
- entropy_ratio
- 0.7049
Subfamily
categorical feature null_rateThis column records the linguistic subfamily classification of each row, with 32 distinct values dominated by Benue-Congo (200 occurrences, 21.95% of non-nulls), Eastern Malayo-Polynesian (159), and Tibeto-Burman (139). The striking issue is the 74.5% null rate — only about a quarter of the 3573 rows carry a subfamily label — yet entropy ratio of 0.77 indicates the populated values are reasonably spread across the 32 categories rather than collapsing onto one. Treatment: Treat missingness as its own category and group rare subfamilies before one-hot encoding.
- n
- 3,573
- nulls
- 2,662 (74.5%)
- unique
- 32
- top_value
- Benue-Congo
- top_rate
- 0.2195
- cardinality
- 32
- entropy
- 3.856
- entropy_ratio
- 0.7712
Genus
categorical feature null_rateThis column holds linguistic genus labels (e.g., Oceanic, Bantu, Indic, Semitic, Germanic), a mid-level grouping in language classification. Cardinality is high at 625 distinct values across 3573 rows with entropy ratio 0.856, so the distribution is broad and flat — the top value 'Oceanic' covers only 5.6%. Note the 25.5% null rate, which is flagged and would meaningfully shrink any analysis that conditions on genus. Treatment: Treat as a high-cardinality categorical: group rare genera into 'Other' or target-encode, and add an explicit missing indicator for the 25.5% nulls.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 625
- top_value
- Oceanic
- top_rate
- 0.05597
- cardinality
- 625
- entropy
- 7.95
- entropy_ratio
- 0.856
GenusIcon
categorical identifier long_tail null_rateGenusIcon holds 613 short hex-like codes (e.g. 'c688033') across 3573 rows, with 82.51% nulls and an entropy ratio of 0.9988 indicating values are nearly uniformly distributed among non-nulls. The top value appears only twice (top_rate 0.0032), so there is no dominant category — it behaves like a near-unique tag rather than a real categorical feature. Treatment: Drop for modelling; near-unique with 82% nulls.
- n
- 3,573
- nulls
- 2,948 (82.5%)
- unique
- 613
- top_value
- c688033
- top_rate
- 0.0032
- cardinality
- 613
- entropy
- 9.249
- entropy_ratio
- 0.9989
ISO_codes
text foreign_key one_word null_rate short_textThis column holds ISO language codes — almost all values are single tokens of length 3 (len_mean 3.04, one_word_rate 0.99), matching ISO 639-3 conventions (e.g. 'eus', 'gsw', 'deu'). 26.1% of rows are null and 172 duplicates exist, but with 2,468 unique codes across 3,573 rows the vocabulary is unusually wide, suggesting broad multilingual coverage rather than a few dominant languages. No top value exceeds 12 occurrences, so the distribution has an extremely long tail. Treatment: Treat as a categorical code key; impute or filter the 26% nulls before joining to a language reference table.
- n
- 3,573
- nulls
- 933 (26.1%)
- unique
- 2,468
- len_min
- 3
- len_max
- 7
- len_mean
- 3.039
- len_median
- 3
- len_p95
- 3
- word_mean
- 1.01
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 172
- duplicate_rate
- 0.06515
- vocab_size
- 2,486
- readability_flesch_mean
- 117.4
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9902
- allcaps_rate
- 0
- boilerplate_rate
- 0
Samples_100
categorical feature null_rate imbalanceBoolean flag with only two values where 'False' dominates at 96.2% (2562 of 2662 non-null rows) and 'True' appears exactly 100 times. The null rate of 25.5% is high, suggesting the flag is only populated for a subset of records. Entropy ratio of 0.23 confirms severe imbalance. Treatment: Treat as a rare-event boolean indicator; impute or encode nulls explicitly and avoid as a stratification key given the imbalance.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 2
- top_value
- False
- top_rate
- 0.9624
- cardinality
- 2
- entropy
- 0.231
- entropy_ratio
- 0.231
Samples_200
categorical label null_rateBinary flag column with only two values (False/True) and heavy class imbalance — False accounts for 92.5% of non-null rows versus 200 True observations, hinting the name 'Samples_200' refers to a tagged 200-row subset. Roughly a quarter of rows (25.5%) are null, which is the main surprise and the reason for the null_rate alert. Entropy ratio of 0.385 confirms the distribution is far from balanced. Treatment: Impute or explicitly encode nulls as a third category before using as a binary indicator.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 2
- top_value
- False
- top_rate
- 0.9249
- cardinality
- 2
- entropy
- 0.3848
- entropy_ratio
- 0.3848
Country_ID
categorical foreign_key null_rateCountry_ID looks like an ISO-style two-letter country code, with 337 distinct values across 3573 rows and a fairly even spread (entropy ratio 0.752). The top country PG accounts for only 8.06% of rows, followed by AU, US, and ID. Notably, 25.69% of values are null, and the cardinality of 337 exceeds the ~250 ISO 3166-1 codes, suggesting non-standard or sub-region codes are mixed in. Treatment: Impute or flag the 25.69% nulls and reconcile non-standard codes against an ISO 3166-1 reference before joining.
- n
- 3,573
- nulls
- 918 (25.7%)
- unique
- 337
- top_value
- PG
- top_rate
- 0.0806
- cardinality
- 337
- entropy
- 6.314
- entropy_ratio
- 0.752
Source
text metadata one_word null_rateThis column holds bibliographic citation tags (e.g. 'Huber-and-Reed-1992', 'Boelaars-1950'), evidently the source reference for each row in what looks like a linguistic typology dataset. Values are short (median 25 chars, 2 words) and 45.5% are single tokens, consistent with author-year keys rather than prose. Cardinality is high (2373 unique of 3573) with 5% duplicates and a 30% null rate, so coverage is uneven and no single source dominates (top value appears only 14 times). Treatment: Treat as a citation key: keep as categorical provenance metadata, optionally normalize casing and join to a bibliography table; do not use as a model feature.
- n
- 3,573
- nulls
- 1,074 (30.1%)
- unique
- 2,373
- len_min
- 7
- len_max
- 452
- len_mean
- 42.07
- len_median
- 25
- len_p95
- 135
- word_mean
- 2.854
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 126
- duplicate_rate
- 0.05042
- vocab_size
- 5,899
- readability_flesch_mean
- 21.33
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.4546
- allcaps_rate
- 0
- boilerplate_rate
- 0
Parent_ID
categorical foreign_key long_tailParent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but flat — the top value covers only 4.5% of rows and entropy ratio is 0.87 — and 7.1% of values are null. Oceanic and Bantu dominate the head, with Indic, Western Pama-Nyungan and Semitic trailing far behind. Treatment: Left-join on this id to a genus lookup; treat the 7.1% nulls explicitly rather than one-hot encoding the 911 levels.
- n
- 3,573
- nulls
- 254 (7.1%)
- unique
- 911
- top_value
- genus-oceanic
- top_rate
- 0.04489
- cardinality
- 911
- entropy
- 8.554
- entropy_ratio
- 0.87