data raw wals language

source /home/coolhand/servers/diachronica/data_raw/wals_language.csv 3,573 rows 17 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a catalogue of 3,573 world languages from WALS, with identifiers (ISO codes, Glottocode), names, geographic coordinates, and classification fields (Family, Genus, Subfamily, Macroarea) plus reference sources and sampling flags. The geographic and genealogical breakdowns are the most informative starting point: Macroarea splits cleanly across six regions led by Eurasia (659) and Africa (606), while Family is dominated by Niger-Congo and Austronesian (324 each). Worth a closer look: roughly a quarter of rows are missing core fields like Family, Genus, Macroarea, and coordinates (null rate ~0.255), and Subfamily is 74.5% null, which will limit any subfamily-level analysis. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 True values respectively), reflecting their role as curated sub-samples rather than balanced categories.

citing: columns · row_count · kinds

Charts the summary said to look at first

Macroarea · Shows the six-region geographic split, led by Eurasia and Africa.

Show data table

Top values for Macroarea (6 unique shown, of 6 total).
value	count	share
Eurasia	659	18.4%
Africa	606	17.0%
Papunesia	560	15.7%
North America	396	11.1%
South America	258	7.2%
Australia	183	5.1%

Family · Top language families — Niger-Congo and Austronesian tie at the top with 324 languages each.

Show data table

Top values for Family (20 unique shown, of 254 total).
value	count	share
Niger-Congo	324	9.1%
Austronesian	324	9.1%
Indo-European	176	4.9%
Sino-Tibetan	146	4.1%
Afro-Asiatic	145	4.1%
Pama-Nyungan	121	3.4%
Trans-New Guinea	98	2.7%
other	72	2.0%
Altaic	65	1.8%
Oto-Manguean	56	1.6%
Austro-Asiatic	48	1.3%
Eastern Sudanic	47	1.3%
Uto-Aztecan	44	1.2%
Algic	31	0.9%
Mayan	30	0.8%
Arawakan	29	0.8%
Nakh-Daghestanian	28	0.8%
Mande	28	0.8%
Uralic	27	0.8%
Hokan	26	0.7%

Latitude · Latitude distribution skews toward the tropics and northern hemisphere (median ~8°).

Show data table

Histogram bins for Latitude (median: 8.291666666665).
bin	count
-55 – -51.84	2
-51.84 – -48.69	1
-48.69 – -45.53	1
-45.53 – -42.38	2
-42.38 – -39.22	3
-39.22 – -36.06	5
-36.06 – -32.91	9
-32.91 – -29.75	18
-29.75 – -26.59	24
-26.59 – -23.44	29
-23.44 – -20.28	47
-20.28 – -17.12	64
-17.12 – -13.97	85
-13.97 – -10.81	86
-10.81 – -7.656	129
-7.656 – -4.5	187
-4.5 – -1.344	190
-1.344 – 1.812	130
1.812 – 4.969	119
4.969 – 8.125	196
8.125 – 11.28	161
11.28 – 14.44	104
14.44 – 17.59	145
17.59 – 20.75	82
20.75 – 23.91	63
23.91 – 27.06	74
27.06 – 30.22	84
30.22 – 33.38	66
33.38 – 36.53	84
36.53 – 39.69	67
39.69 – 42.84	86
42.84 – 46	62
46 – 49.16	74
49.16 – 52.31	52
52.31 – 55.47	44
55.47 – 58.62	18
58.62 – 61.78	20
61.78 – 64.94	23
64.94 – 68.09	19
68.09 – 71.25	7

Country_ID · Country concentration — Papua New Guinea, Australia, the US, and Indonesia host the most languages.

Show data table

Top values for Country_ID (20 unique shown, of 337 total).
value	count	share
PG	214	6.0%
AU	185	5.2%
US	177	5.0%
ID	177	5.0%
IN	120	3.4%
MX	120	3.4%
RU	89	2.5%
NG	66	1.8%
BR	66	1.8%
CN	54	1.5%
CD	49	1.4%
CM	46	1.3%
CA	45	1.3%
CO	39	1.1%
ET	36	1.0%
PH	36	1.0%
PE	35	1.0%
NP	32	0.9%
TZ	28	0.8%
VU	28	0.8%

Samples_100 · Highlights the strong imbalance: only 100 of 2,662 non-null rows are flagged True.

Show data table

Top values for Samples_100 (2 unique shown, of 2 total).
value	count	share
False	2562	71.7%
True	100	2.8%

Schema

17 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
ID	text	0.0%	3,573	near_unique one_word short_text
Name	text	0.0%	3,198	one_word short_text
Macroarea	categorical	25.5%	6	null_rate
Latitude	numeric	25.5%	887	null_rate
Longitude	numeric	25.5%	1,360	null_rate
Glottocode	text	26.0%	2,502	one_word null_rate short_text
ISO639P3code	text	26.8%	2,442	one_word null_rate short_text
Family	categorical	25.5%	254	null_rate
Subfamily	categorical	74.5%	32	null_rate
Genus	categorical	25.5%	625	null_rate
GenusIcon	categorical	82.5%	613	long_tail null_rate
ISO_codes	text	26.1%	2,468	one_word null_rate short_text
Samples_100	categorical	25.5%	2	null_rate imbalance
Samples_200	categorical	25.5%	2	null_rate
Country_ID	categorical	25.7%	337	null_rate
Source	text	30.1%	2,373	one_word null_rate
Parent_ID	categorical	7.1%	911	long_tail

ID

text identifier near_unique one_word short_text

Column 'ID' is a unique row identifier: all 3573 values are distinct (n_unique equals n), every value is a single token (one_word_rate 1.0), and there are no nulls or duplicates. Lengths range from 2 to 36 characters with a median of 3, and the top tokens (aab, aar, aba, abb…) suggest short alphabetic codes rather than numeric keys. Treatment: Use as a join key; drop from modelling features. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 0 (0.0%)
unique: 3,573
len_min: 2
len_max: 36
len_mean: 5.982
len_median: 3
len_p95: 17
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 3,573
readability_flesch_mean: 61.58
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Name

text label one_word short_text

This column holds short proper-noun labels, almost certainly language names (top values like Basque, Ainu, Beothuk, Atakapa, and frequent words 'sign', 'language', 'arabic', 'mixtec' all point to a linguistic registry). Entries are overwhelmingly single tokens (one_word_rate 0.80, word_mean 1.26, len_mean 8.7) with a 46-character max for the longer parenthesised variants like '(northern)'/'(southern)'. Notably, 375 duplicates (10.5%) exist across 3,573 rows with 3,198 uniques — names like 'Abun', 'Andoke', 'Basque' each appear 3 times, suggesting the dataset repeats languages across some other dimension rather than being a clean key. Treatment: Treat as a categorical language label; deduplicate or join on it rather than using as a primary key. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 0 (0.0%)
unique: 3,198
len_min: 2
len_max: 46
len_mean: 8.705
len_median: 7
len_p95: 19
word_mean: 1.258
word_median: 1
n_empty: 0
n_duplicates: 375
duplicate_rate: 0.105
vocab_size: 3,383
readability_flesch_mean: 48.16
emoji_rate: 0
url_rate: 0
one_word_rate: 0.8002
allcaps_rate: 0
boilerplate_rate: 0

Macroarea

categorical feature null_rate

Macroarea is a categorical geographic grouping with 6 values covering the standard continental/linguistic macroareas (Eurasia, Africa, Papunesia, North America, South America, Australia). The distribution is fairly balanced — entropy ratio is 0.95 and the top value Eurasia accounts for only 24.8% of rows. The main concern is a 25.5% null rate, meaning a quarter of the 3573 rows lack any macroarea assignment. Treatment: One-hot encode and decide whether to impute or add an explicit 'unknown' bucket for the 25.5% nulls. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 6
top_value: Eurasia
top_rate: 0.2476
cardinality: 6
entropy: 2.459
entropy_ratio: 0.9511

Latitude

numeric feature null_rate

Geographic latitude in decimal degrees, ranging from -55.0 to 71.25 with a median of 8.29 — consistent with global coverage skewed slightly toward the northern hemisphere. About 25.5% of rows are null, a notable gap for a positional field, and only 887 unique values across 3573 rows suggest coordinates are rounded or tied to a limited set of locations. Distribution is near-symmetric (skew 0.36, kurtosis -0.50) with just one outlier flagged. Treatment: Pair with Longitude for geospatial features; impute or filter the 25.5% nulls before modelling. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 887
min: -55
max: 71.25
mean: 11.88
median: 8.292
std: 22.72
q1: -5
q3: 28
iqr: 33
skew: 0.3562
kurtosis: -0.5023
n_outliers: 1
outlier_rate: 0.0003757
zero_rate: 0.002254

Longitude

numeric feature null_rate

Geographic longitude in decimal degrees, spanning -178.17 to 179.17 with 1360 distinct values across 3573 rows. The distribution is roughly symmetric (skew -0.33) but flat (kurtosis -1.05) with an IQR of 166.75, consistent with truly global coverage rather than a regional sample. Notable concern: 25.5% of rows are null, which will silently drop a quarter of any geospatial join. Treatment: Pair with Latitude and impute or filter the 25.5% nulls before any geospatial modelling. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 1,360
min: -178.2
max: 179.2
mean: 35.17
median: 34.79
std: 89.35
q1: -45.75
q3: 121
iqr: 166.8
skew: -0.3259
kurtosis: -1.047
n_outliers: 0
outlier_rate: 0
zero_rate: 0.001503

Glottocode

text foreign_key one_word null_rate short_text

This is a Glottocode field — fixed 8-character language identifiers from the Glottolog catalogue (e.g. basq1248, swis1247), with every value being a single token. About 26% of rows are null and 2502 distinct codes appear across 3573 rows, with 143 duplicates (5.4%) where the same language repeats — basq1248 leads with 11 occurrences. Length is rigidly 8 for min, median, and max, consistent with a controlled vocabulary identifier rather than free text. Treatment: Treat as a categorical key; left-join to Glottolog metadata and handle the 26% nulls explicitly. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 928 (26.0%)
unique: 2,502
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 143
duplicate_rate: 0.05406
vocab_size: 2,502
readability_flesch_mean: 92.88
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

ISO639P3code

text foreign_key one_word null_rate short_text

This column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with examples like 'eus', 'gsw', 'deu' matching the standard. Coverage is incomplete — 26.84% of rows are null — and 2442 unique codes appear across 3573 rows with a 6.58% duplicate rate, so most languages occur only once or twice (top value 'eus' at 12). Treatment: Treat as a categorical key; left-join to an ISO 639-3 reference table and decide on an explicit bucket for the 26.84% nulls. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 959 (26.8%)
unique: 2,442
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 172
duplicate_rate: 0.0658
vocab_size: 2,442
readability_flesch_mean: 119.5
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Family

categorical feature null_rate

Categorical label for the language family of each row, with 254 distinct families across 3573 records. Distribution is long-tailed but not dominated: top value 'Niger-Congo' covers only 12.2% and ties exactly with 'Austronesian' at 324 each, with entropy ratio 0.70 indicating spread across many small families. Notable concern: 25.5% of rows are null, and a literal 'other' bucket already accounts for 72 rows. Treatment: Impute or flag the 25.5% missing, then group rare families before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 254
top_value: Niger-Congo
top_rate: 0.1217
cardinality: 254
entropy: 5.631
entropy_ratio: 0.7049

Subfamily

categorical feature null_rate

This column records the linguistic subfamily classification of each row, with 32 distinct values dominated by Benue-Congo (200 occurrences, 21.95% of non-nulls), Eastern Malayo-Polynesian (159), and Tibeto-Burman (139). The striking issue is the 74.5% null rate — only about a quarter of the 3573 rows carry a subfamily label — yet entropy ratio of 0.77 indicates the populated values are reasonably spread across the 32 categories rather than collapsing onto one. Treatment: Treat missingness as its own category and group rare subfamilies before one-hot encoding. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 2,662 (74.5%)
unique: 32
top_value: Benue-Congo
top_rate: 0.2195
cardinality: 32
entropy: 3.856
entropy_ratio: 0.7712

Genus

categorical feature null_rate

This column holds linguistic genus labels (e.g., Oceanic, Bantu, Indic, Semitic, Germanic), a mid-level grouping in language classification. Cardinality is high at 625 distinct values across 3573 rows with entropy ratio 0.856, so the distribution is broad and flat — the top value 'Oceanic' covers only 5.6%. Note the 25.5% null rate, which is flagged and would meaningfully shrink any analysis that conditions on genus. Treatment: Treat as a high-cardinality categorical: group rare genera into 'Other' or target-encode, and add an explicit missing indicator for the 25.5% nulls. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 625
top_value: Oceanic
top_rate: 0.05597
cardinality: 625
entropy: 7.95
entropy_ratio: 0.856

GenusIcon

categorical identifier long_tail null_rate

GenusIcon holds 613 short hex-like codes (e.g. 'c688033') across 3573 rows, with 82.51% nulls and an entropy ratio of 0.9988 indicating values are nearly uniformly distributed among non-nulls. The top value appears only twice (top_rate 0.0032), so there is no dominant category — it behaves like a near-unique tag rather than a real categorical feature. Treatment: Drop for modelling; near-unique with 82% nulls. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 2,948 (82.5%)
unique: 613
top_value: c688033
top_rate: 0.0032
cardinality: 613
entropy: 9.249
entropy_ratio: 0.9989

ISO_codes

text foreign_key one_word null_rate short_text

This column holds ISO language codes — almost all values are single tokens of length 3 (len_mean 3.04, one_word_rate 0.99), matching ISO 639-3 conventions (e.g. 'eus', 'gsw', 'deu'). 26.1% of rows are null and 172 duplicates exist, but with 2,468 unique codes across 3,573 rows the vocabulary is unusually wide, suggesting broad multilingual coverage rather than a few dominant languages. No top value exceeds 12 occurrences, so the distribution has an extremely long tail. Treatment: Treat as a categorical code key; impute or filter the 26% nulls before joining to a language reference table. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 933 (26.1%)
unique: 2,468
len_min: 3
len_max: 7
len_mean: 3.039
len_median: 3
len_p95: 3
word_mean: 1.01
word_median: 1
n_empty: 0
n_duplicates: 172
duplicate_rate: 0.06515
vocab_size: 2,486
readability_flesch_mean: 117.4
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9902
allcaps_rate: 0
boilerplate_rate: 0

Samples_100

categorical feature null_rate imbalance

Boolean flag with only two values where 'False' dominates at 96.2% (2562 of 2662 non-null rows) and 'True' appears exactly 100 times. The null rate of 25.5% is high, suggesting the flag is only populated for a subset of records. Entropy ratio of 0.23 confirms severe imbalance. Treatment: Treat as a rare-event boolean indicator; impute or encode nulls explicitly and avoid as a stratification key given the imbalance. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 2
top_value: False
top_rate: 0.9624
cardinality: 2
entropy: 0.231
entropy_ratio: 0.231

Samples_200

categorical label null_rate

Binary flag column with only two values (False/True) and heavy class imbalance — False accounts for 92.5% of non-null rows versus 200 True observations, hinting the name 'Samples_200' refers to a tagged 200-row subset. Roughly a quarter of rows (25.5%) are null, which is the main surprise and the reason for the null_rate alert. Entropy ratio of 0.385 confirms the distribution is far from balanced. Treatment: Impute or explicitly encode nulls as a third category before using as a binary indicator. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 911 (25.5%)
unique: 2
top_value: False
top_rate: 0.9249
cardinality: 2
entropy: 0.3848
entropy_ratio: 0.3848

Country_ID

categorical foreign_key null_rate

Country_ID looks like an ISO-style two-letter country code, with 337 distinct values across 3573 rows and a fairly even spread (entropy ratio 0.752). The top country PG accounts for only 8.06% of rows, followed by AU, US, and ID. Notably, 25.69% of values are null, and the cardinality of 337 exceeds the ~250 ISO 3166-1 codes, suggesting non-standard or sub-region codes are mixed in. Treatment: Impute or flag the 25.69% nulls and reconcile non-standard codes against an ISO 3166-1 reference before joining. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 918 (25.7%)
unique: 337
top_value: PG
top_rate: 0.0806
cardinality: 337
entropy: 6.314
entropy_ratio: 0.752

Source

text metadata one_word null_rate

This column holds bibliographic citation tags (e.g. 'Huber-and-Reed-1992', 'Boelaars-1950'), evidently the source reference for each row in what looks like a linguistic typology dataset. Values are short (median 25 chars, 2 words) and 45.5% are single tokens, consistent with author-year keys rather than prose. Cardinality is high (2373 unique of 3573) with 5% duplicates and a 30% null rate, so coverage is uneven and no single source dominates (top value appears only 14 times). Treatment: Treat as a citation key: keep as categorical provenance metadata, optionally normalize casing and join to a bibliography table; do not use as a model feature. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 1,074 (30.1%)
unique: 2,373
len_min: 7
len_max: 452
len_mean: 42.07
len_median: 25
len_p95: 135
word_mean: 2.854
word_median: 2
n_empty: 0
n_duplicates: 126
duplicate_rate: 0.05042
vocab_size: 5,899
readability_flesch_mean: 21.33
emoji_rate: 0
url_rate: 0
one_word_rate: 0.4546
allcaps_rate: 0
boilerplate_rate: 0

Parent_ID

categorical foreign_key long_tail

Parent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but flat — the top value covers only 4.5% of rows and entropy ratio is 0.87 — and 7.1% of values are null. Oceanic and Bantu dominate the head, with Indic, Western Pama-Nyungan and Semitic trailing far behind. Treatment: Left-join on this id to a genus lookup; treat the 7.1% nulls explicitly rather than one-hot encoding the 911 levels. high · anthropic:claude-opus-4-7

n: 3,573
nulls: 254 (7.1%)
unique: 911
top_value: genus-oceanic
top_rate: 0.04489
cardinality: 911
entropy: 8.554
entropy_ratio: 0.87