language data wals values

source /home/coolhand/datasets/language-data/wals_values.csv 76,475 rows 8 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a 76,475-row export of WALS (World Atlas of Language Structures) values, with 8 columns covering language identifiers, typological parameters, coded values, sources, and comments. The core analytical fields are Parameter_ID (192 typological features, top being '83A' at ~2% of rows) and Value (a small numeric coding scheme with 28 distinct values, median 2 and a long right tail up to 28). Two things deserve a closer look first: the Value column is highly skewed (skew 3.49, kurtosis 16.4, ~3.2% outliers), which matters if you plan to aggregate it numerically; and Comment is 96.9% null and, where present, is mostly HTML markup in mixed languages (English and Chinese dominate), so it needs cleaning before any text analysis. Language_ID spans 2,660 unique codes fairly evenly (top 'eng' has just 159 rows), confirming broad cross-linguistic coverage.

citing: Value.stats.skew · Value.stats.kurtosis · Value.n_unique · Value.stats.outlier_rate · Parameter_ID.n_unique · Parameter_ID.top_value · Parameter_ID.stats.top_rate · Comment.null_rate · Comment.language_counts · Language_ID.n_unique · Code_ID.n_unique · Source.n_unique · row_count

Charts the summary said to look at first

Value · Check the heavy right skew — most values are 1-4 but the tail stretches to 28.

Show data table

Histogram bins for Value (median: 2.0).
bin	count
1 – 1.675	27379
1.675 – 2.35	21173
2.35 – 3.025	6771
3.025 – 3.7	0
3.7 – 4.375	9917
4.375 – 5.05	3602
5.05 – 5.725	0
5.725 – 6.4	2730
6.4 – 7.075	1392
7.075 – 7.75	0
7.75 – 8.425	1042
8.425 – 9.1	597
9.1 – 9.775	0
9.775 – 10.45	32
10.45 – 11.12	265
11.12 – 11.8	0
11.8 – 12.48	43
12.48 – 13.15	68
13.15 – 13.83	0
13.83 – 14.5	251
14.5 – 15.18	190
15.18 – 15.85	0
15.85 – 16.52	156
16.52 – 17.2	44
17.2 – 17.88	0
17.88 – 18.55	127
18.55 – 19.23	83
19.23 – 19.9	0
19.9 – 20.58	358
20.58 – 21.25	202
21.25 – 21.93	0
21.93 – 22.6	21
22.6 – 23.28	18
23.28 – 23.95	0
23.95 – 24.62	3
24.62 – 25.3	3
25.3 – 25.98	0
25.98 – 26.65	4
26.65 – 27.33	3
27.33 – 28	1

Parameter_ID · See which of the 192 typological features are most represented; the top ones cluster around 1,300-1,500 rows.

Show data table

Top values for Parameter_ID (20 unique shown, of 192 total).
value	count	share
83A	1518	2.0%
82A	1496	2.0%
81A	1376	1.8%
87A	1367	1.8%
143A	1325	1.7%
143E	1325	1.7%
143F	1325	1.7%
143G	1325	1.7%
97A	1316	1.7%
86A	1249	1.6%
88A	1225	1.6%
144A	1190	1.6%
85A	1184	1.5%
112A	1157	1.5%
89A	1154	1.5%
95A	1142	1.5%
69A	1131	1.5%
33A	1066	1.4%
51A	1031	1.3%
26A	969	1.3%

Language_ID · Confirms broad coverage across 2,660 languages with no single language dominating.

Show data table

Character-length distribution for Language_ID (mean: 2.995527950310559).
chars	count
2 – 2	342
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	76133

Code_ID · Shows which parameter-value codes (e.g. '143G-4', '82A-1') appear most frequently.

Show data table

Character-length distribution for Code_ID (mean: 5.300150375939849).
chars	count
4 – 4	4995
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	45403
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	24205
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	1872

Comment · Among the 3% of rows with comments, lengths vary wildly (median 196, max 15,127) and contain HTML — flag for cleanup.

Show data table

Character-length distribution for Comment (mean: 372.2237673830594).
chars	count
36 – 413	1870
413 – 791	361
791 – 1168	44
1168 – 1545	25
1545 – 1922	12
1922 – 2300	7
2300 – 2677	5
2677 – 3054	9
3054 – 3431	8
3431 – 3809	7
3809 – 4186	0
4186 – 4563	2
4563 – 4941	6
4941 – 5318	5
5318 – 5695	0
5695 – 6072	4
6072 – 6450	2
6450 – 6827	3
6827 – 7204	0
7204 – 7582	0
7582 – 7959	0
7959 – 8336	0
8336 – 8713	1
8713 – 9091	0
9091 – 9468	0
9468 – 9845	1
9845 – 10222	0
10222 – 10600	0
10600 – 10977	0
10977 – 11354	0
11354 – 11732	0
11732 – 12109	0
12109 – 12486	0
12486 – 12863	0
12863 – 13241	0
13241 – 13618	0
13618 – 13995	0
13995 – 14372	0
14372 – 14750	0
14750 – 15127	1

Schema

8 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
ID	text	0.0%	76,475	near_unique one_word short_text
Language_ID	text	0.0%	2,660	one_word short_text duplicates
Parameter_ID	categorical	0.0%	192
Value	numeric	0.0%	28	high_skew
Code_ID	text	0.0%	1,139	one_word allcaps short_text duplicates
Comment	text	96.9%	2,068	multilingual null_rate
Source	text	9.3%	29,715	one_word duplicates
Example_ID	text	97.9%	1,444	one_word null_rate

ID

text identifier near_unique one_word short_text

This is a row identifier: every one of the 76,475 values is unique, non-null, and a single token of 5-8 characters. Sample values like '26a-mar' and '114a-yko' follow a consistent 'a-<3-letter-suffix>' pattern, suggesting a structured composite key rather than a random hash. No duplicates or empties, so it is safe to use as a primary key. Treatment: Use as primary key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 76,475
len_min: 5
len_max: 8
len_mean: 7.271
len_median: 7
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 20,000
readability_flesch_mean: 88.65
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Language_ID

text foreign_key one_word short_text duplicates

Three-letter codes (len_mean 2.996, len_max 3, one_word_rate 1.0) almost certainly ISO 639-style language identifiers, with 2,660 distinct values across 76,475 rows. Distribution is remarkably flat — the top code 'eng' appears only 159 times and the next nine codes are within 6 counts of it — so duplicate_rate of 0.965 reflects a wide catalogue rather than a dominant language. No nulls or empties. Treatment: left-join on this id to a language lookup, or one-hot only after collapsing rare codes. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 2,660
len_min: 2
len_max: 3
len_mean: 2.996
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 73,815
duplicate_rate: 0.9652
vocab_size: 2,238
readability_flesch_mean: 117.7
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Parameter_ID

categorical foreign_key

Parameter_ID is a categorical code with 192 distinct values across 76,475 rows and no nulls. The distribution is nearly uniform (entropy ratio 0.94), with the top value '83A' covering only 1.98% of rows; several '143x' codes share an identical count of 1,325, suggesting paired or co-emitted parameters. Treatment: left-join on this id to a parameter lookup table; one-hot only if cardinality is reduced first. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 192
top_value: 83A
top_rate: 0.01985
cardinality: 192
entropy: 7.103
entropy_ratio: 0.9365

Value

numeric feature high_skew

Small-integer count or rating field bounded between 1 and 28 with only 28 distinct values across 76,475 rows. The distribution is tightly packed (median 2, IQR 3) but has a heavy right tail (skew 3.49, kurtosis 16.36) producing 2,469 outliers (3.2%). No nulls or zeros, so every row carries a positive count. Treatment: Log1p-transform or cap the upper tail before modelling to tame the skew. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 28
min: 1
max: 28
mean: 2.854
median: 2
std: 2.824
q1: 1
q3: 4
iqr: 3
skew: 3.493
kurtosis: 16.36
n_outliers: 2,469
outlier_rate: 0.03229
zero_rate: 0

Code_ID

text foreign_key one_word allcaps short_text duplicates

Code_ID holds short uppercase alphanumeric codes (length 4-7, always one word) drawn from a fixed vocabulary of 1,139 distinct values across 76,475 rows. The 98.5% duplicate rate is expected for a categorical key, with the most common code '143G-4' appearing 1,315 times. The pattern (digits + letter + dash + digit) suggests a structured taxonomy code rather than a unique row identifier. Treatment: Treat as a categorical foreign key; left-join to a code lookup table or target-encode for modelling. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 1,139
len_min: 4
len_max: 7
len_mean: 5.3
len_median: 5
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 75,336
duplicate_rate: 0.9851
vocab_size: 911
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

Comment

text free_text multilingual null_rate

Free-text commentary field carrying HTML-formatted lexicographic examples (e.g. `
čaj
`), with content spanning at least 7 languages including English (64), Chinese (53) and Persian (11). It is overwhelmingly empty: null_rate is 0.969 across 76,475 rows, leaving only ~2,068 distinct values and a 0.129 duplicate_rate among the populated entries. The negative Flesch mean (-127.13) and HTML-dominated top_words confirm these are markup fragments rather than natural prose, and lengths are highly skewed (median 196, max 15,127). Treatment: Strip HTML, then tokenize per-language before any embedding; given 96.9% nulls, treat presence as a sparse feature. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 74,102 (96.9%)
unique: 2,068
len_min: 36
len_max: 15,127
len_mean: 372.2
len_median: 196
len_p95: 917
word_mean: 49.54
word_median: 8
n_empty: 0
n_duplicates: 305
duplicate_rate: 0.1285
vocab_size: 7,544
readability_flesch_mean: -127.1
emoji_rate: 0
url_rate: 0.001686
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0

Source

text metadata one_word duplicates

This column holds bibliographic source citations in Author-Year slug form (e.g. 'Bybee-et-al-1994', 'Huber-and-Reed-1992'), with 83% being a single token and a mean of 1.24 words. Despite 29,715 unique values, 57% of entries are duplicates and 9.3% are null, indicating the same handful of references are reused heavily across rows. The 'passim]' token (386) and stray '112]' suggest page-locator fragments leaked in from inconsistent citation formatting. Treatment: Normalize citation slugs (strip '[passim]'/page fragments) and treat as a categorical reference key. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 7,092 (9.3%)
unique: 29,715
len_min: 7
len_max: 165
len_mean: 21.86
len_median: 19
len_p95: 44
word_mean: 1.239
word_median: 1
n_empty: 0
n_duplicates: 39,668
duplicate_rate: 0.5717
vocab_size: 13,953
readability_flesch_mean: -9.644
emoji_rate: 0
url_rate: 0
one_word_rate: 0.8318
allcaps_rate: 5.765e-05
boilerplate_rate: 0

Example_ID

text foreign_key one_word null_rate

This column holds space-separated lists of `igt-####` example identifiers, likely linking each row to one or more interlinear gloss text examples. It is overwhelmingly empty (97.89% null) and only 1,444 of 76,475 rows carry a value, with 39% of those being a single token. Despite the IDs looking unique, 168 duplicate strings (10.4% duplicate rate) appear, so the same example bundle is referenced by multiple rows. Treatment: Split on whitespace and left-join each igt-id to the examples table; expect most rows to have no match. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 74,863 (97.9%)
unique: 1,444
len_min: 5
len_max: 575
len_mean: 23.57
len_median: 17
len_p95: 53
word_mean: 2.818
word_median: 2
n_empty: 0
n_duplicates: 168
duplicate_rate: 0.1042
vocab_size: 3,810
readability_flesch_mean: 119.7
emoji_rate: 0
url_rate: 0
one_word_rate: 0.3908
allcaps_rate: 0
boilerplate_rate: 0