language data wals languages

source /home/coolhand/datasets/language-data/wals_languages.csv 3,573 rows 17 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 3,573 world languages (WALS) across 17 columns combining identifiers (ISO codes, Glottocode), classifications (Family, Genus, Subfamily), geography (Latitude, Longitude, Macroarea, Country_ID), and sampling flags. The Family and Macroarea distributions are the most informative starting point: Niger-Congo and Austronesian dominate at 324 languages each, and Eurasia (659) and Africa (606) lead the macroareas out of just six categories. Note that roughly a quarter of rows (null_rate ~0.255) are missing geographic and family fields in lockstep, suggesting a shared set of unclassified entries worth investigating. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 'True' values respectively), reflecting curated WALS sub-samples. Subfamily is sparsely populated (74.5% null) so treat it as supplementary rather than primary.

citing: Family.top_values · Macroarea.top_values · Genus.top_values · Country_ID.top_values · Samples_100.stats · Samples_200.stats · Latitude.stats · Longitude.stats · Subfamily.null_rate

Charts the summary said to look at first

Family · Top language families — Niger-Congo and Austronesian tie at 324 each, with a long tail across 254 families.

Show data table

Top values for Family (20 unique shown, of 254 total).
value	count	share
Niger-Congo	324	9.1%
Austronesian	324	9.1%
Indo-European	176	4.9%
Sino-Tibetan	146	4.1%
Afro-Asiatic	145	4.1%
Pama-Nyungan	121	3.4%
Trans-New Guinea	98	2.7%
other	72	2.0%
Altaic	65	1.8%
Oto-Manguean	56	1.6%
Austro-Asiatic	48	1.3%
Eastern Sudanic	47	1.3%
Uto-Aztecan	44	1.2%
Algic	31	0.9%
Mayan	30	0.8%
Arawakan	29	0.8%
Nakh-Daghestanian	28	0.8%
Mande	28	0.8%
Uralic	27	0.8%
Hokan	26	0.7%

Macroarea · Six macroareas, led by Eurasia (659) and Africa (606); shows global coverage balance.

Show data table

Top values for Macroarea (6 unique shown, of 6 total).
value	count	share
Eurasia	659	18.4%
Africa	606	17.0%
Papunesia	560	15.7%
North America	396	11.1%
South America	258	7.2%
Australia	183	5.1%

Country_ID · Country distribution — Papua New Guinea, Australia, US, and Indonesia top the list, signaling linguistic hotspots.

Show data table

Top values for Country_ID (20 unique shown, of 337 total).
value	count	share
PG	214	6.0%
AU	185	5.2%
US	177	5.0%
ID	177	5.0%
IN	120	3.4%
MX	120	3.4%
RU	89	2.5%
NG	66	1.8%
BR	66	1.8%
CN	54	1.5%
CD	49	1.4%
CM	46	1.3%
CA	45	1.3%
CO	39	1.1%
ET	36	1.0%
PH	36	1.0%
PE	35	1.0%
NP	32	0.9%
TZ	28	0.8%
VU	28	0.8%

Latitude · Latitude spread from -55 to 71 with median ~8°, showing a tropical/northern-hemisphere skew.

Show data table

Histogram bins for Latitude (median: 8.291666666665).
bin	count
-55 – -51.84	2
-51.84 – -48.69	1
-48.69 – -45.53	1
-45.53 – -42.38	2
-42.38 – -39.22	3
-39.22 – -36.06	5
-36.06 – -32.91	9
-32.91 – -29.75	18
-29.75 – -26.59	24
-26.59 – -23.44	29
-23.44 – -20.28	47
-20.28 – -17.12	64
-17.12 – -13.97	85
-13.97 – -10.81	86
-10.81 – -7.656	129
-7.656 – -4.5	187
-4.5 – -1.344	190
-1.344 – 1.812	130
1.812 – 4.969	119
4.969 – 8.125	196
8.125 – 11.28	161
11.28 – 14.44	104
14.44 – 17.59	145
17.59 – 20.75	82
20.75 – 23.91	63
23.91 – 27.06	74
27.06 – 30.22	84
30.22 – 33.38	66
33.38 – 36.53	84
36.53 – 39.69	67
39.69 – 42.84	86
42.84 – 46	62
46 – 49.16	74
49.16 – 52.31	52
52.31 – 55.47	44
55.47 – 58.62	18
58.62 – 61.78	20
61.78 – 64.94	23
64.94 – 68.09	19
68.09 – 71.25	7

Samples_200 · Only 200 of 2,662 non-null rows are flagged True — useful for filtering to the curated WALS 200-sample set.

Show data table

Top values for Samples_200 (2 unique shown, of 2 total).
value	count	share
False	2462	68.9%
True	200	5.6%

Schema

17 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
ID	text	0.0%	3,573	near_unique one_word short_text
Name	text	0.0%	3,198	one_word short_text
Macroarea	categorical	25.5%	6	null_rate
Latitude	numeric	25.5%	887	null_rate
Longitude	numeric	25.5%	1,360	null_rate
Glottocode	text	26.0%	2,502	one_word null_rate short_text
ISO639P3code	text	26.8%	2,442	one_word null_rate short_text
Family	categorical	25.5%	254	null_rate
Subfamily	categorical	74.5%	32	null_rate
Genus	categorical	25.5%	625	null_rate
GenusIcon	categorical	82.5%	613	long_tail null_rate
ISO_codes	text	26.1%	2,468	one_word null_rate short_text
Samples_100	categorical	25.5%	2	null_rate imbalance
Samples_200	categorical	25.5%	2	null_rate
Country_ID	categorical	25.7%	337	null_rate
Source	text	30.1%	2,373	one_word null_rate
Parent_ID	categorical	7.1%	911	long_tail

ID

text identifier near_unique one_word short_text

This is an identifier column: every one of the 3573 rows holds a unique single-token string with no nulls or duplicates. Values are short (median length 3, max 36) and the vocabulary equals the row count (3573), confirming one-to-one uniqueness. Top tokens like 'aab', 'aar', 'aba' suggest short alphabetic codes rather than numeric keys. Treatment: drop from modelling; retain only as a join key. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 0 (0.0%)
unique: 3,573
len_min: 2
len_max: 36
len_mean: 5.982
len_median: 3
len_p95: 17
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 3,573
readability_flesch_mean: 61.58
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Name

text label one_word short_text

This column holds short proper-noun labels — almost certainly language or ethnonym names, given top values like 'Basque', 'Ainu', 'Beothuk' and frequent words 'sign', 'language', 'arabic', 'german'. Entries are terse (mean 8.7 chars, 80% one-word) but not unique: 375 duplicates (10.5%) and only 3,198 distinct names across 3,573 rows, with several names appearing exactly 3 times — suggesting the dataset repeats each language across multiple records or variants (e.g. '(northern)', '(southern)'). No nulls, no URLs, no emoji. Treatment: Treat as a categorical key; deduplicate or join on a normalized form before aggregating. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 0 (0.0%)
unique: 3,198
len_min: 2
len_max: 46
len_mean: 8.705
len_median: 7
len_p95: 19
word_mean: 1.258
word_median: 1
n_empty: 0
n_duplicates: 375
duplicate_rate: 0.105
vocab_size: 3,383
readability_flesch_mean: 48.16
emoji_rate: 0
url_rate: 0
one_word_rate: 0.8002
allcaps_rate: 0
boilerplate_rate: 0

Macroarea

categorical feature null_rate

Macroarea is a coarse geographic grouping with 6 categories spanning Eurasia, Africa, Papunesia, North America, South America, and Australia — consistent with WALS/Glottolog-style language area labels. Distribution is relatively even (entropy ratio 0.95, top value Eurasia at 24.8%), so no single region dominates. Note the 25.5% null rate, which is substantial and flagged. Treatment: One-hot encode and add an explicit 'missing' category to preserve the 25.5% nulls. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 6
top_value: Eurasia
top_rate: 0.2476
cardinality: 6
entropy: 2.459
entropy_ratio: 0.9511

Latitude

numeric feature null_rate

Geographic latitude in degrees, ranging from -55.0 to 71.25 with a median of 8.29 and IQR of 33.0, consistent with a worldwide point distribution. The 25.5% null rate is notable and flagged, while skew (0.36) and kurtosis (-0.50) indicate a fairly symmetric, slightly flat spread with only one outlier. Treatment: Impute or filter the 25.5% missing values, and pair with longitude for any geospatial modelling. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 887
min: -55
max: 71.25
mean: 11.88
median: 8.292
std: 22.72
q1: -5
q3: 28
iqr: 33
skew: 0.3562
kurtosis: -0.5023
n_outliers: 1
outlier_rate: 0.0003757
zero_rate: 0.002254

Longitude

numeric feature null_rate

Geographic longitude in degrees, spanning the full globe from -178.17 to 179.17 with a near-zero skew (-0.33) and flat kurtosis (-1.05), consistent with a worldwide point distribution. The 25.5% null rate is the main concern, and despite 3573 rows only 1360 unique values appear, suggesting repeated locations or rounded coordinates. No outliers flagged, as expected for a bounded angular measure. Treatment: Pair with Latitude for geospatial features; impute or drop the 25.5% missing before modelling. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 1,360
min: -178.2
max: 179.2
mean: 35.17
median: 34.79
std: 89.35
q1: -45.75
q3: 121
iqr: 166.8
skew: -0.3259
kurtosis: -1.047
n_outliers: 0
outlier_rate: 0
zero_rate: 0.001503

Glottocode

text foreign_key one_word null_rate short_text

This column holds Glottocodes — fixed 8-character language identifiers from the Glottolog catalogue (e.g. 'basq1248', 'stan1295'), with every value a single token of length exactly 8. About 26% of rows are null and 2502 distinct codes cover 3573 records, with a 5.4% duplicate rate; the most repeated code 'basq1248' appears 11 times, suggesting multiple records can share a language. Treatment: Left-join on this code to a Glottolog reference table; impute or flag the 26% nulls separately. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 928 (26.0%)
unique: 2,502
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 143
duplicate_rate: 0.05406
vocab_size: 2,502
readability_flesch_mean: 92.88
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

ISO639P3code

text foreign_key one_word null_rate short_text

This column holds ISO 639-3 language codes — every non-null value is exactly 3 characters and a single word (len_mean 3.0, one_word_rate 1.0), with familiar codes like 'eus' (Basque), 'deu' (German), and 'gsw' (Swiss German) at the top. Coverage is incomplete: 26.84% of rows are null, and across 3573 rows there are 2442 unique codes with a 6.58% duplicate rate. Nothing in the evidence indicates which entity each code is tagging. Treatment: Treat as a categorical join key to an ISO 639-3 reference table; impute or filter the 26.84% nulls before use. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 959 (26.8%)
unique: 2,442
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 172
duplicate_rate: 0.0658
vocab_size: 2,442
readability_flesch_mean: 119.5
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Family

categorical feature null_rate

Categorical label assigning each of 3,573 rows to one of 254 language families, headed by Niger-Congo and Austronesian (tied at 324 rows, 12.2% each). The long tail is heavy — entropy ratio 0.705 indicates the distribution is fairly spread across families rather than dominated by a few — and 25.5% of rows are null, which is a substantial gap for what looks like a taxonomic feature. Treatment: Impute or add an explicit 'unknown' category for the 25.5% nulls, then group rare families before encoding. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 254
top_value: Niger-Congo
top_rate: 0.1217
cardinality: 254
entropy: 5.631
entropy_ratio: 0.7049

Subfamily

categorical feature null_rate

This column records the linguistic subfamily classification of entries, drawn from a controlled vocabulary of 32 values such as Benue-Congo, Eastern Malayo-Polynesian, and Tibeto-Burman. Coverage is the main concern: 74.5% of rows are null, leaving only ~910 labelled records, with Benue-Congo accounting for 21.95% of those. Among populated rows the distribution is reasonably diverse (entropy ratio 0.77), so the signal is informative where present but sparse overall. Treatment: Treat as a sparse categorical: impute an explicit 'unknown' level before encoding, since 74.5% are null. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 2,662 (74.5%)
unique: 32
top_value: Benue-Congo
top_rate: 0.2195
cardinality: 32
entropy: 3.856
entropy_ratio: 0.7712

Genus

categorical feature null_rate

Genus is a linguistic genus label (subfamily-level grouping of languages), with values like Oceanic, Bantu, Indic, and Semitic. It is highly diverse — 625 distinct genera across 3573 rows with entropy ratio 0.86 and the top value Oceanic covering only 5.6% — and 25.5% of rows are null, which is the flagged concern. Treatment: Treat as a high-cardinality categorical: target- or frequency-encode and explicitly model the 25.5% missing as its own category. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 625
top_value: Oceanic
top_rate: 0.05597
cardinality: 625
entropy: 7.95
entropy_ratio: 0.856

GenusIcon

categorical metadata long_tail null_rate

GenusIcon is a high-cardinality categorical with 613 unique values across only 3573 rows, and 82.51% of those rows are null. Entropy ratio of 0.9988 and a top_rate of just 0.0032 mean the non-null values are nearly uniformly distributed, with the most frequent code 'c688033' appearing only twice. The hex-like tokens (e.g. 'c807D33') suggest icon identifiers or color/asset codes rather than a meaningful category. Treatment: Drop or retain as a sparse asset reference; not useful as a modelling feature given near-unique values and 82.51% nulls. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 2,948 (82.5%)
unique: 613
top_value: c688033
top_rate: 0.0032
cardinality: 613
entropy: 9.249
entropy_ratio: 0.9989

ISO_codes

text feature one_word null_rate short_text

Almost certainly ISO 639-3 language codes: 99% are single tokens, length is tightly clustered at 3 characters (min 3, max 7, p95 3), and top values like 'eus', 'deu', 'gsw', 'bod', 'roh' are recognisable three-letter language identifiers. Cardinality is high (2468 unique out of 3573) with a 26.1% null rate and 172 duplicates, so coverage is partial and no single code dominates (top value 'eus' appears just 12 times). The handful of length-7 entries is anomalous for a strict ISO 639-3 field and worth inspecting. Treatment: Treat as a categorical code; validate against the ISO 639-3 list and investigate entries longer than 3 characters. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 933 (26.1%)
unique: 2,468
len_min: 3
len_max: 7
len_mean: 3.039
len_median: 3
len_p95: 3
word_mean: 1.01
word_median: 1
n_empty: 0
n_duplicates: 172
duplicate_rate: 0.06515
vocab_size: 2,486
readability_flesch_mean: 117.4
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9902
allcaps_rate: 0
boilerplate_rate: 0

Samples_100

categorical feature null_rate imbalance

Boolean flag with only two values (False/True) where False dominates at 96.2% of non-null rows (2562 vs 100). The name 'Samples_100' plus the exact count of 100 True values suggests this marks a curated subset of 100 sampled records. A 25.5% null rate is notable and should be reconciled before use. Treatment: Treat as a boolean subset indicator; impute or exclude nulls and avoid using as a model feature given severe imbalance. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 2
top_value: False
top_rate: 0.9624
cardinality: 2
entropy: 0.231
entropy_ratio: 0.231

Samples_200

categorical metadata null_rate

Binary True/False flag, almost certainly indicating membership in a 200-row sample (the name 'Samples_200' and the exact count of 200 'True' values support this). The column is heavily imbalanced — 'False' covers 92.5% of non-null rows — and 25.5% of values are null, which is unusual for a sampling indicator and worth investigating. Treatment: Use as a boolean filter/split flag; reconcile the 25.5% nulls (treat as False or exclude) before relying on it. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 2
top_value: False
top_rate: 0.9249
cardinality: 2
entropy: 0.3848
entropy_ratio: 0.3848

Country_ID

categorical foreign_key null_rate

Two-letter country codes (PG, AU, US, ID, IN...) identifying the country associated with each record, with 337 distinct values across 3573 rows. The cardinality is suspiciously high since ISO 3166-1 alpha-2 only defines ~250 codes, hinting at non-standard or sub-region codes mixed in. Distribution is fairly flat (entropy ratio 0.752, top value PG only 8.06%) and 25.69% of rows are null. Treatment: Validate codes against ISO 3166, impute or flag the 25.69% nulls, then left-join on this id. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 918 (25.7%)
unique: 337
top_value: PG
top_rate: 0.0806
cardinality: 337
entropy: 6.314
entropy_ratio: 0.752

Source

text metadata one_word null_rate

This column holds bibliographic citation tags (e.g., 'Huber-and-Reed-1992', 'Boelaars-1950'), almost certainly the source reference for each row in what appears to be a linguistic dataset. About 45% of values are a single token and the median length is 25 chars, consistent with compact Author-Year keys, but 30% of rows are null and 2,373 of 3,573 values are unique, with only 126 duplicates. Top citations like 'nichols-1992' and 'malherbe-and-rosenberg-1996' (113 occurrences each) dominate, suggesting a few reference works supply many entries. Treatment: Normalize casing and keep as a categorical provenance tag; impute or flag the 30% nulls rather than modelling the text. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 1,074 (30.1%)
unique: 2,373
len_min: 7
len_max: 452
len_mean: 42.07
len_median: 25
len_p95: 135
word_mean: 2.854
word_median: 2
n_empty: 0
n_duplicates: 126
duplicate_rate: 0.05042
vocab_size: 5,899
readability_flesch_mean: 21.33
emoji_rate: 0
url_rate: 0
one_word_rate: 0.4546
allcaps_rate: 0
boilerplate_rate: 0

Parent_ID

categorical foreign_key long_tail

Parent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but not dominated: the top value covers only 4.5% of rows and entropy is 8.55 (ratio 0.87), so most genera carry few members. About 7.1% of values are null, which will need a decision before any join or grouping. Treatment: Left-join on this id to a genus lookup; impute or flag the 7.1% nulls before grouping. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 254 (7.1%)
unique: 911
top_value: genus-oceanic
top_rate: 0.04489
cardinality: 911
entropy: 8.554
entropy_ratio: 0.87