language data wals languages
Reading
This dataset catalogs 3,573 world languages (WALS) across 17 columns combining identifiers (ISO codes, Glottocode), classifications (Family, Genus, Subfamily), geography (Latitude, Longitude, Macroarea, Country_ID), and sampling flags. The Family and Macroarea distributions are the most informative starting point: Niger-Congo and Austronesian dominate at 324 languages each, and Eurasia (659) and Africa (606) lead the macroareas out of just six categories. Note that roughly a quarter of rows (null_rate ~0.255) are missing geographic and family fields in lockstep, suggesting a shared set of unclassified entries worth investigating. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 'True' values respectively), reflecting curated WALS sub-samples. Subfamily is sparsely populated (74.5% null) so treat it as supplementary rather than primary.
citing: Family.top_values · Macroarea.top_values · Genus.top_values · Country_ID.top_values · Samples_100.stats · Samples_200.stats · Latitude.stats · Longitude.stats · Subfamily.null_rate
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Niger-Congo | 324 | 9.1% |
| Austronesian | 324 | 9.1% |
| Indo-European | 176 | 4.9% |
| Sino-Tibetan | 146 | 4.1% |
| Afro-Asiatic | 145 | 4.1% |
| Pama-Nyungan | 121 | 3.4% |
| Trans-New Guinea | 98 | 2.7% |
| other | 72 | 2.0% |
| Altaic | 65 | 1.8% |
| Oto-Manguean | 56 | 1.6% |
| Austro-Asiatic | 48 | 1.3% |
| Eastern Sudanic | 47 | 1.3% |
| Uto-Aztecan | 44 | 1.2% |
| Algic | 31 | 0.9% |
| Mayan | 30 | 0.8% |
| Arawakan | 29 | 0.8% |
| Nakh-Daghestanian | 28 | 0.8% |
| Mande | 28 | 0.8% |
| Uralic | 27 | 0.8% |
| Hokan | 26 | 0.7% |
Show data table
| value | count | share |
|---|---|---|
| Eurasia | 659 | 18.4% |
| Africa | 606 | 17.0% |
| Papunesia | 560 | 15.7% |
| North America | 396 | 11.1% |
| South America | 258 | 7.2% |
| Australia | 183 | 5.1% |
Show data table
| value | count | share |
|---|---|---|
| PG | 214 | 6.0% |
| AU | 185 | 5.2% |
| US | 177 | 5.0% |
| ID | 177 | 5.0% |
| IN | 120 | 3.4% |
| MX | 120 | 3.4% |
| RU | 89 | 2.5% |
| NG | 66 | 1.8% |
| BR | 66 | 1.8% |
| CN | 54 | 1.5% |
| CD | 49 | 1.4% |
| CM | 46 | 1.3% |
| CA | 45 | 1.3% |
| CO | 39 | 1.1% |
| ET | 36 | 1.0% |
| PH | 36 | 1.0% |
| PE | 35 | 1.0% |
| NP | 32 | 0.9% |
| TZ | 28 | 0.8% |
| VU | 28 | 0.8% |
Show data table
| bin | count |
|---|---|
| -55 – -51.84 | 2 |
| -51.84 – -48.69 | 1 |
| -48.69 – -45.53 | 1 |
| -45.53 – -42.38 | 2 |
| -42.38 – -39.22 | 3 |
| -39.22 – -36.06 | 5 |
| -36.06 – -32.91 | 9 |
| -32.91 – -29.75 | 18 |
| -29.75 – -26.59 | 24 |
| -26.59 – -23.44 | 29 |
| -23.44 – -20.28 | 47 |
| -20.28 – -17.12 | 64 |
| -17.12 – -13.97 | 85 |
| -13.97 – -10.81 | 86 |
| -10.81 – -7.656 | 129 |
| -7.656 – -4.5 | 187 |
| -4.5 – -1.344 | 190 |
| -1.344 – 1.812 | 130 |
| 1.812 – 4.969 | 119 |
| 4.969 – 8.125 | 196 |
| 8.125 – 11.28 | 161 |
| 11.28 – 14.44 | 104 |
| 14.44 – 17.59 | 145 |
| 17.59 – 20.75 | 82 |
| 20.75 – 23.91 | 63 |
| 23.91 – 27.06 | 74 |
| 27.06 – 30.22 | 84 |
| 30.22 – 33.38 | 66 |
| 33.38 – 36.53 | 84 |
| 36.53 – 39.69 | 67 |
| 39.69 – 42.84 | 86 |
| 42.84 – 46 | 62 |
| 46 – 49.16 | 74 |
| 49.16 – 52.31 | 52 |
| 52.31 – 55.47 | 44 |
| 55.47 – 58.62 | 18 |
| 58.62 – 61.78 | 20 |
| 61.78 – 64.94 | 23 |
| 64.94 – 68.09 | 19 |
| 68.09 – 71.25 | 7 |
Show data table
| value | count | share |
|---|---|---|
| False | 2462 | 68.9% |
| True | 200 | 5.6% |
Schema
17 columns| Alerts | ||||
|---|---|---|---|---|
| ID | text | 0.0% | 3,573 |
near_unique
one_word
short_text
|
| Name | text | 0.0% | 3,198 |
one_word
short_text
|
| Macroarea | categorical | 25.5% | 6 |
null_rate
|
| Latitude | numeric | 25.5% | 887 |
null_rate
|
| Longitude | numeric | 25.5% | 1,360 |
null_rate
|
| Glottocode | text | 26.0% | 2,502 |
one_word
null_rate
short_text
|
| ISO639P3code | text | 26.8% | 2,442 |
one_word
null_rate
short_text
|
| Family | categorical | 25.5% | 254 |
null_rate
|
| Subfamily | categorical | 74.5% | 32 |
null_rate
|
| Genus | categorical | 25.5% | 625 |
null_rate
|
| GenusIcon | categorical | 82.5% | 613 |
long_tail
null_rate
|
| ISO_codes | text | 26.1% | 2,468 |
one_word
null_rate
short_text
|
| Samples_100 | categorical | 25.5% | 2 |
null_rate
imbalance
|
| Samples_200 | categorical | 25.5% | 2 |
null_rate
|
| Country_ID | categorical | 25.7% | 337 |
null_rate
|
| Source | text | 30.1% | 2,373 |
one_word
null_rate
|
| Parent_ID | categorical | 7.1% | 911 |
long_tail
|
ID
text identifier near_unique one_word short_textThis is an identifier column: every one of the 3573 rows holds a unique single-token string with no nulls or duplicates. Values are short (median length 3, max 36) and the vocabulary equals the row count (3573), confirming one-to-one uniqueness. Top tokens like 'aab', 'aar', 'aba' suggest short alphabetic codes rather than numeric keys. Treatment: drop from modelling; retain only as a join key.
- n
- 3,573
- nulls
- 0 (0.0%)
- unique
- 3,573
- len_min
- 2
- len_max
- 36
- len_mean
- 5.982
- len_median
- 3
- len_p95
- 17
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 3,573
- readability_flesch_mean
- 61.58
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
Name
text label one_word short_textThis column holds short proper-noun labels — almost certainly language or ethnonym names, given top values like 'Basque', 'Ainu', 'Beothuk' and frequent words 'sign', 'language', 'arabic', 'german'. Entries are terse (mean 8.7 chars, 80% one-word) but not unique: 375 duplicates (10.5%) and only 3,198 distinct names across 3,573 rows, with several names appearing exactly 3 times — suggesting the dataset repeats each language across multiple records or variants (e.g. '(northern)', '(southern)'). No nulls, no URLs, no emoji. Treatment: Treat as a categorical key; deduplicate or join on a normalized form before aggregating.
- n
- 3,573
- nulls
- 0 (0.0%)
- unique
- 3,198
- len_min
- 2
- len_max
- 46
- len_mean
- 8.705
- len_median
- 7
- len_p95
- 19
- word_mean
- 1.258
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 375
- duplicate_rate
- 0.105
- vocab_size
- 3,383
- readability_flesch_mean
- 48.16
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.8002
- allcaps_rate
- 0
- boilerplate_rate
- 0
Macroarea
categorical feature null_rateMacroarea is a coarse geographic grouping with 6 categories spanning Eurasia, Africa, Papunesia, North America, South America, and Australia — consistent with WALS/Glottolog-style language area labels. Distribution is relatively even (entropy ratio 0.95, top value Eurasia at 24.8%), so no single region dominates. Note the 25.5% null rate, which is substantial and flagged. Treatment: One-hot encode and add an explicit 'missing' category to preserve the 25.5% nulls.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 6
- top_value
- Eurasia
- top_rate
- 0.2476
- cardinality
- 6
- entropy
- 2.459
- entropy_ratio
- 0.9511
Latitude
numeric feature null_rateGeographic latitude in degrees, ranging from -55.0 to 71.25 with a median of 8.29 and IQR of 33.0, consistent with a worldwide point distribution. The 25.5% null rate is notable and flagged, while skew (0.36) and kurtosis (-0.50) indicate a fairly symmetric, slightly flat spread with only one outlier. Treatment: Impute or filter the 25.5% missing values, and pair with longitude for any geospatial modelling.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 887
- min
- -55
- max
- 71.25
- mean
- 11.88
- median
- 8.292
- std
- 22.72
- q1
- -5
- q3
- 28
- iqr
- 33
- skew
- 0.3562
- kurtosis
- -0.5023
- n_outliers
- 1
- outlier_rate
- 0.0003757
- zero_rate
- 0.002254
Longitude
numeric feature null_rateGeographic longitude in degrees, spanning the full globe from -178.17 to 179.17 with a near-zero skew (-0.33) and flat kurtosis (-1.05), consistent with a worldwide point distribution. The 25.5% null rate is the main concern, and despite 3573 rows only 1360 unique values appear, suggesting repeated locations or rounded coordinates. No outliers flagged, as expected for a bounded angular measure. Treatment: Pair with Latitude for geospatial features; impute or drop the 25.5% missing before modelling.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 1,360
- min
- -178.2
- max
- 179.2
- mean
- 35.17
- median
- 34.79
- std
- 89.35
- q1
- -45.75
- q3
- 121
- iqr
- 166.8
- skew
- -0.3259
- kurtosis
- -1.047
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.001503
Glottocode
text foreign_key one_word null_rate short_textThis column holds Glottocodes — fixed 8-character language identifiers from the Glottolog catalogue (e.g. 'basq1248', 'stan1295'), with every value a single token of length exactly 8. About 26% of rows are null and 2502 distinct codes cover 3573 records, with a 5.4% duplicate rate; the most repeated code 'basq1248' appears 11 times, suggesting multiple records can share a language. Treatment: Left-join on this code to a Glottolog reference table; impute or flag the 26% nulls separately.
- n
- 3,573
- nulls
- 928 (26.0%)
- unique
- 2,502
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 143
- duplicate_rate
- 0.05406
- vocab_size
- 2,502
- readability_flesch_mean
- 92.88
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
ISO639P3code
text foreign_key one_word null_rate short_textThis column holds ISO 639-3 language codes — every non-null value is exactly 3 characters and a single word (len_mean 3.0, one_word_rate 1.0), with familiar codes like 'eus' (Basque), 'deu' (German), and 'gsw' (Swiss German) at the top. Coverage is incomplete: 26.84% of rows are null, and across 3573 rows there are 2442 unique codes with a 6.58% duplicate rate. Nothing in the evidence indicates which entity each code is tagging. Treatment: Treat as a categorical join key to an ISO 639-3 reference table; impute or filter the 26.84% nulls before use.
- n
- 3,573
- nulls
- 959 (26.8%)
- unique
- 2,442
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 172
- duplicate_rate
- 0.0658
- vocab_size
- 2,442
- readability_flesch_mean
- 119.5
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
Family
categorical feature null_rateCategorical label assigning each of 3,573 rows to one of 254 language families, headed by Niger-Congo and Austronesian (tied at 324 rows, 12.2% each). The long tail is heavy — entropy ratio 0.705 indicates the distribution is fairly spread across families rather than dominated by a few — and 25.5% of rows are null, which is a substantial gap for what looks like a taxonomic feature. Treatment: Impute or add an explicit 'unknown' category for the 25.5% nulls, then group rare families before encoding.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 254
- top_value
- Niger-Congo
- top_rate
- 0.1217
- cardinality
- 254
- entropy
- 5.631
- entropy_ratio
- 0.7049
Subfamily
categorical feature null_rateThis column records the linguistic subfamily classification of entries, drawn from a controlled vocabulary of 32 values such as Benue-Congo, Eastern Malayo-Polynesian, and Tibeto-Burman. Coverage is the main concern: 74.5% of rows are null, leaving only ~910 labelled records, with Benue-Congo accounting for 21.95% of those. Among populated rows the distribution is reasonably diverse (entropy ratio 0.77), so the signal is informative where present but sparse overall. Treatment: Treat as a sparse categorical: impute an explicit 'unknown' level before encoding, since 74.5% are null.
- n
- 3,573
- nulls
- 2,662 (74.5%)
- unique
- 32
- top_value
- Benue-Congo
- top_rate
- 0.2195
- cardinality
- 32
- entropy
- 3.856
- entropy_ratio
- 0.7712
Genus
categorical feature null_rateGenus is a linguistic genus label (subfamily-level grouping of languages), with values like Oceanic, Bantu, Indic, and Semitic. It is highly diverse — 625 distinct genera across 3573 rows with entropy ratio 0.86 and the top value Oceanic covering only 5.6% — and 25.5% of rows are null, which is the flagged concern. Treatment: Treat as a high-cardinality categorical: target- or frequency-encode and explicitly model the 25.5% missing as its own category.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 625
- top_value
- Oceanic
- top_rate
- 0.05597
- cardinality
- 625
- entropy
- 7.95
- entropy_ratio
- 0.856
GenusIcon
categorical metadata long_tail null_rateGenusIcon is a high-cardinality categorical with 613 unique values across only 3573 rows, and 82.51% of those rows are null. Entropy ratio of 0.9988 and a top_rate of just 0.0032 mean the non-null values are nearly uniformly distributed, with the most frequent code 'c688033' appearing only twice. The hex-like tokens (e.g. 'c807D33') suggest icon identifiers or color/asset codes rather than a meaningful category. Treatment: Drop or retain as a sparse asset reference; not useful as a modelling feature given near-unique values and 82.51% nulls.
- n
- 3,573
- nulls
- 2,948 (82.5%)
- unique
- 613
- top_value
- c688033
- top_rate
- 0.0032
- cardinality
- 613
- entropy
- 9.249
- entropy_ratio
- 0.9989
ISO_codes
text feature one_word null_rate short_textAlmost certainly ISO 639-3 language codes: 99% are single tokens, length is tightly clustered at 3 characters (min 3, max 7, p95 3), and top values like 'eus', 'deu', 'gsw', 'bod', 'roh' are recognisable three-letter language identifiers. Cardinality is high (2468 unique out of 3573) with a 26.1% null rate and 172 duplicates, so coverage is partial and no single code dominates (top value 'eus' appears just 12 times). The handful of length-7 entries is anomalous for a strict ISO 639-3 field and worth inspecting. Treatment: Treat as a categorical code; validate against the ISO 639-3 list and investigate entries longer than 3 characters.
- n
- 3,573
- nulls
- 933 (26.1%)
- unique
- 2,468
- len_min
- 3
- len_max
- 7
- len_mean
- 3.039
- len_median
- 3
- len_p95
- 3
- word_mean
- 1.01
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 172
- duplicate_rate
- 0.06515
- vocab_size
- 2,486
- readability_flesch_mean
- 117.4
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9902
- allcaps_rate
- 0
- boilerplate_rate
- 0
Samples_100
categorical feature null_rate imbalanceBoolean flag with only two values (False/True) where False dominates at 96.2% of non-null rows (2562 vs 100). The name 'Samples_100' plus the exact count of 100 True values suggests this marks a curated subset of 100 sampled records. A 25.5% null rate is notable and should be reconciled before use. Treatment: Treat as a boolean subset indicator; impute or exclude nulls and avoid using as a model feature given severe imbalance.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 2
- top_value
- False
- top_rate
- 0.9624
- cardinality
- 2
- entropy
- 0.231
- entropy_ratio
- 0.231
Samples_200
categorical metadata null_rateBinary True/False flag, almost certainly indicating membership in a 200-row sample (the name 'Samples_200' and the exact count of 200 'True' values support this). The column is heavily imbalanced — 'False' covers 92.5% of non-null rows — and 25.5% of values are null, which is unusual for a sampling indicator and worth investigating. Treatment: Use as a boolean filter/split flag; reconcile the 25.5% nulls (treat as False or exclude) before relying on it.
- n
- 3,573
- nulls
- 911 (25.5%)
- unique
- 2
- top_value
- False
- top_rate
- 0.9249
- cardinality
- 2
- entropy
- 0.3848
- entropy_ratio
- 0.3848
Country_ID
categorical foreign_key null_rateTwo-letter country codes (PG, AU, US, ID, IN...) identifying the country associated with each record, with 337 distinct values across 3573 rows. The cardinality is suspiciously high since ISO 3166-1 alpha-2 only defines ~250 codes, hinting at non-standard or sub-region codes mixed in. Distribution is fairly flat (entropy ratio 0.752, top value PG only 8.06%) and 25.69% of rows are null. Treatment: Validate codes against ISO 3166, impute or flag the 25.69% nulls, then left-join on this id.
- n
- 3,573
- nulls
- 918 (25.7%)
- unique
- 337
- top_value
- PG
- top_rate
- 0.0806
- cardinality
- 337
- entropy
- 6.314
- entropy_ratio
- 0.752
Source
text metadata one_word null_rateThis column holds bibliographic citation tags (e.g., 'Huber-and-Reed-1992', 'Boelaars-1950'), almost certainly the source reference for each row in what appears to be a linguistic dataset. About 45% of values are a single token and the median length is 25 chars, consistent with compact Author-Year keys, but 30% of rows are null and 2,373 of 3,573 values are unique, with only 126 duplicates. Top citations like 'nichols-1992' and 'malherbe-and-rosenberg-1996' (113 occurrences each) dominate, suggesting a few reference works supply many entries. Treatment: Normalize casing and keep as a categorical provenance tag; impute or flag the 30% nulls rather than modelling the text.
- n
- 3,573
- nulls
- 1,074 (30.1%)
- unique
- 2,373
- len_min
- 7
- len_max
- 452
- len_mean
- 42.07
- len_median
- 25
- len_p95
- 135
- word_mean
- 2.854
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 126
- duplicate_rate
- 0.05042
- vocab_size
- 5,899
- readability_flesch_mean
- 21.33
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.4546
- allcaps_rate
- 0
- boilerplate_rate
- 0
Parent_ID
categorical foreign_key long_tailParent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but not dominated: the top value covers only 4.5% of rows and entropy is 8.55 (ratio 0.87), so most genera carry few members. About 7.1% of values are null, which will need a decision before any join or grouping. Treatment: Left-join on this id to a genus lookup; impute or flag the 7.1% nulls before grouping.
- n
- 3,573
- nulls
- 254 (7.1%)
- unique
- 911
- top_value
- genus-oceanic
- top_rate
- 0.04489
- cardinality
- 911
- entropy
- 8.554
- entropy_ratio
- 0.87