language data wals values
Reading
This dataset is a 76,475-row export of WALS (World Atlas of Language Structures) values, with 8 columns covering language identifiers, typological parameters, coded values, sources, and comments. The core analytical fields are Parameter_ID (192 typological features, top being '83A' at ~2% of rows) and Value (a small numeric coding scheme with 28 distinct values, median 2 and a long right tail up to 28). Two things deserve a closer look first: the Value column is highly skewed (skew 3.49, kurtosis 16.4, ~3.2% outliers), which matters if you plan to aggregate it numerically; and Comment is 96.9% null and, where present, is mostly HTML markup in mixed languages (English and Chinese dominate), so it needs cleaning before any text analysis. Language_ID spans 2,660 unique codes fairly evenly (top 'eng' has just 159 rows), confirming broad cross-linguistic coverage.
citing: Value.stats.skew · Value.stats.kurtosis · Value.n_unique · Value.stats.outlier_rate · Parameter_ID.n_unique · Parameter_ID.top_value · Parameter_ID.stats.top_rate · Comment.null_rate · Comment.language_counts · Language_ID.n_unique · Code_ID.n_unique · Source.n_unique · row_count
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 1 – 1.675 | 27379 |
| 1.675 – 2.35 | 21173 |
| 2.35 – 3.025 | 6771 |
| 3.025 – 3.7 | 0 |
| 3.7 – 4.375 | 9917 |
| 4.375 – 5.05 | 3602 |
| 5.05 – 5.725 | 0 |
| 5.725 – 6.4 | 2730 |
| 6.4 – 7.075 | 1392 |
| 7.075 – 7.75 | 0 |
| 7.75 – 8.425 | 1042 |
| 8.425 – 9.1 | 597 |
| 9.1 – 9.775 | 0 |
| 9.775 – 10.45 | 32 |
| 10.45 – 11.12 | 265 |
| 11.12 – 11.8 | 0 |
| 11.8 – 12.48 | 43 |
| 12.48 – 13.15 | 68 |
| 13.15 – 13.83 | 0 |
| 13.83 – 14.5 | 251 |
| 14.5 – 15.18 | 190 |
| 15.18 – 15.85 | 0 |
| 15.85 – 16.52 | 156 |
| 16.52 – 17.2 | 44 |
| 17.2 – 17.88 | 0 |
| 17.88 – 18.55 | 127 |
| 18.55 – 19.23 | 83 |
| 19.23 – 19.9 | 0 |
| 19.9 – 20.58 | 358 |
| 20.58 – 21.25 | 202 |
| 21.25 – 21.93 | 0 |
| 21.93 – 22.6 | 21 |
| 22.6 – 23.28 | 18 |
| 23.28 – 23.95 | 0 |
| 23.95 – 24.62 | 3 |
| 24.62 – 25.3 | 3 |
| 25.3 – 25.98 | 0 |
| 25.98 – 26.65 | 4 |
| 26.65 – 27.33 | 3 |
| 27.33 – 28 | 1 |
Show data table
| value | count | share |
|---|---|---|
| 83A | 1518 | 2.0% |
| 82A | 1496 | 2.0% |
| 81A | 1376 | 1.8% |
| 87A | 1367 | 1.8% |
| 143A | 1325 | 1.7% |
| 143E | 1325 | 1.7% |
| 143F | 1325 | 1.7% |
| 143G | 1325 | 1.7% |
| 97A | 1316 | 1.7% |
| 86A | 1249 | 1.6% |
| 88A | 1225 | 1.6% |
| 144A | 1190 | 1.6% |
| 85A | 1184 | 1.5% |
| 112A | 1157 | 1.5% |
| 89A | 1154 | 1.5% |
| 95A | 1142 | 1.5% |
| 69A | 1131 | 1.5% |
| 33A | 1066 | 1.4% |
| 51A | 1031 | 1.3% |
| 26A | 969 | 1.3% |
Show data table
| chars | count |
|---|---|
| 2 – 2 | 342 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 76133 |
Show data table
| chars | count |
|---|---|
| 4 – 4 | 4995 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 45403 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 24205 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 1872 |
Show data table
| chars | count |
|---|---|
| 36 – 413 | 1870 |
| 413 – 791 | 361 |
| 791 – 1168 | 44 |
| 1168 – 1545 | 25 |
| 1545 – 1922 | 12 |
| 1922 – 2300 | 7 |
| 2300 – 2677 | 5 |
| 2677 – 3054 | 9 |
| 3054 – 3431 | 8 |
| 3431 – 3809 | 7 |
| 3809 – 4186 | 0 |
| 4186 – 4563 | 2 |
| 4563 – 4941 | 6 |
| 4941 – 5318 | 5 |
| 5318 – 5695 | 0 |
| 5695 – 6072 | 4 |
| 6072 – 6450 | 2 |
| 6450 – 6827 | 3 |
| 6827 – 7204 | 0 |
| 7204 – 7582 | 0 |
| 7582 – 7959 | 0 |
| 7959 – 8336 | 0 |
| 8336 – 8713 | 1 |
| 8713 – 9091 | 0 |
| 9091 – 9468 | 0 |
| 9468 – 9845 | 1 |
| 9845 – 10222 | 0 |
| 10222 – 10600 | 0 |
| 10600 – 10977 | 0 |
| 10977 – 11354 | 0 |
| 11354 – 11732 | 0 |
| 11732 – 12109 | 0 |
| 12109 – 12486 | 0 |
| 12486 – 12863 | 0 |
| 12863 – 13241 | 0 |
| 13241 – 13618 | 0 |
| 13618 – 13995 | 0 |
| 13995 – 14372 | 0 |
| 14372 – 14750 | 0 |
| 14750 – 15127 | 1 |
Schema
8 columns| Alerts | ||||
|---|---|---|---|---|
| ID | text | 0.0% | 76,475 |
near_unique
one_word
short_text
|
| Language_ID | text | 0.0% | 2,660 |
one_word
short_text
duplicates
|
| Parameter_ID | categorical | 0.0% | 192 |
|
| Value | numeric | 0.0% | 28 |
high_skew
|
| Code_ID | text | 0.0% | 1,139 |
one_word
allcaps
short_text
duplicates
|
| Comment | text | 96.9% | 2,068 |
multilingual
null_rate
|
| Source | text | 9.3% | 29,715 |
one_word
duplicates
|
| Example_ID | text | 97.9% | 1,444 |
one_word
null_rate
|
ID
text identifier near_unique one_word short_textThis is a row identifier: every one of the 76,475 values is unique, non-null, and a single token of 5-8 characters. Sample values like '26a-mar' and '114a-yko' follow a consistent 'a-<3-letter-suffix>' pattern, suggesting a structured composite key rather than a random hash. No duplicates or empties, so it is safe to use as a primary key. Treatment: Use as primary key for joins; exclude from modelling features.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 76,475
- len_min
- 5
- len_max
- 8
- len_mean
- 7.271
- len_median
- 7
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 20,000
- readability_flesch_mean
- 88.65
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
Language_ID
text foreign_key one_word short_text duplicatesThree-letter codes (len_mean 2.996, len_max 3, one_word_rate 1.0) almost certainly ISO 639-style language identifiers, with 2,660 distinct values across 76,475 rows. Distribution is remarkably flat — the top code 'eng' appears only 159 times and the next nine codes are within 6 counts of it — so duplicate_rate of 0.965 reflects a wide catalogue rather than a dominant language. No nulls or empties. Treatment: left-join on this id to a language lookup, or one-hot only after collapsing rare codes.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 2,660
- len_min
- 2
- len_max
- 3
- len_mean
- 2.996
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 73,815
- duplicate_rate
- 0.9652
- vocab_size
- 2,238
- readability_flesch_mean
- 117.7
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
Parameter_ID
categorical foreign_keyParameter_ID is a categorical code with 192 distinct values across 76,475 rows and no nulls. The distribution is nearly uniform (entropy ratio 0.94), with the top value '83A' covering only 1.98% of rows; several '143x' codes share an identical count of 1,325, suggesting paired or co-emitted parameters. Treatment: left-join on this id to a parameter lookup table; one-hot only if cardinality is reduced first.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 192
- top_value
- 83A
- top_rate
- 0.01985
- cardinality
- 192
- entropy
- 7.103
- entropy_ratio
- 0.9365
Value
numeric feature high_skewSmall-integer count or rating field bounded between 1 and 28 with only 28 distinct values across 76,475 rows. The distribution is tightly packed (median 2, IQR 3) but has a heavy right tail (skew 3.49, kurtosis 16.36) producing 2,469 outliers (3.2%). No nulls or zeros, so every row carries a positive count. Treatment: Log1p-transform or cap the upper tail before modelling to tame the skew.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 28
- min
- 1
- max
- 28
- mean
- 2.854
- median
- 2
- std
- 2.824
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 3.493
- kurtosis
- 16.36
- n_outliers
- 2,469
- outlier_rate
- 0.03229
- zero_rate
- 0
Code_ID
text foreign_key one_word allcaps short_text duplicatesCode_ID holds short uppercase alphanumeric codes (length 4-7, always one word) drawn from a fixed vocabulary of 1,139 distinct values across 76,475 rows. The 98.5% duplicate rate is expected for a categorical key, with the most common code '143G-4' appearing 1,315 times. The pattern (digits + letter + dash + digit) suggests a structured taxonomy code rather than a unique row identifier. Treatment: Treat as a categorical foreign key; left-join to a code lookup table or target-encode for modelling.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 1,139
- len_min
- 4
- len_max
- 7
- len_mean
- 5.3
- len_median
- 5
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 75,336
- duplicate_rate
- 0.9851
- vocab_size
- 911
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
Comment
text free_text multilingual null_rateFree-text commentary field carrying HTML-formatted lexicographic examples (e.g. ``), with content spanning at least 7 languages including English (64), Chinese (53) and Persian (11). It is overwhelmingly empty: null_rate is 0.969 across 76,475 rows, leaving only ~2,068 distinct values and a 0.129 duplicate_rate among the populated entries. The negative Flesch mean (-127.13) and HTML-dominated top_words confirm these are markup fragments rather than natural prose, and lengths are highly skewed (median 196, max 15,127). Treatment: Strip HTML, then tokenize per-language before any embedding; given 96.9% nulls, treat presence as a sparse feature.čaj
- n
- 76,475
- nulls
- 74,102 (96.9%)
- unique
- 2,068
- len_min
- 36
- len_max
- 15,127
- len_mean
- 372.2
- len_median
- 196
- len_p95
- 917
- word_mean
- 49.54
- word_median
- 8
- n_empty
- 0
- n_duplicates
- 305
- duplicate_rate
- 0.1285
- vocab_size
- 7,544
- readability_flesch_mean
- -127.1
- emoji_rate
- 0
- url_rate
- 0.001686
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
Source
text metadata one_word duplicatesThis column holds bibliographic source citations in Author-Year slug form (e.g. 'Bybee-et-al-1994', 'Huber-and-Reed-1992'), with 83% being a single token and a mean of 1.24 words. Despite 29,715 unique values, 57% of entries are duplicates and 9.3% are null, indicating the same handful of references are reused heavily across rows. The 'passim]' token (386) and stray '112]' suggest page-locator fragments leaked in from inconsistent citation formatting. Treatment: Normalize citation slugs (strip '[passim]'/page fragments) and treat as a categorical reference key.
- n
- 76,475
- nulls
- 7,092 (9.3%)
- unique
- 29,715
- len_min
- 7
- len_max
- 165
- len_mean
- 21.86
- len_median
- 19
- len_p95
- 44
- word_mean
- 1.239
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 39,668
- duplicate_rate
- 0.5717
- vocab_size
- 13,953
- readability_flesch_mean
- -9.644
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.8318
- allcaps_rate
- 5.765e-05
- boilerplate_rate
- 0
Example_ID
text foreign_key one_word null_rateThis column holds space-separated lists of `igt-####` example identifiers, likely linking each row to one or more interlinear gloss text examples. It is overwhelmingly empty (97.89% null) and only 1,444 of 76,475 rows carry a value, with 39% of those being a single token. Despite the IDs looking unique, 168 duplicate strings (10.4% duplicate rate) appear, so the same example bundle is referenced by multiple rows. Treatment: Split on whitespace and left-join each igt-id to the examples table; expect most rows to have no match.
- n
- 76,475
- nulls
- 74,863 (97.9%)
- unique
- 1,444
- len_min
- 5
- len_max
- 575
- len_mean
- 23.57
- len_median
- 17
- len_p95
- 53
- word_mean
- 2.818
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 168
- duplicate_rate
- 0.1042
- vocab_size
- 3,810
- readability_flesch_mean
- 119.7
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.3908
- allcaps_rate
- 0
- boilerplate_rate
- 0