saturn·

language data wals values

source /home/coolhand/datasets/language-data/wals_values.csv 76,475 rows 8 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a 76,475-row export of WALS (World Atlas of Language Structures) values, with 8 columns covering language identifiers, typological parameters, coded values, sources, and comments. The core analytical fields are Parameter_ID (192 typological features, top being '83A' at ~2% of rows) and Value (a small numeric coding scheme with 28 distinct values, median 2 and a long right tail up to 28). Two things deserve a closer look first: the Value column is highly skewed (skew 3.49, kurtosis 16.4, ~3.2% outliers), which matters if you plan to aggregate it numerically; and Comment is 96.9% null and, where present, is mostly HTML markup in mixed languages (English and Chinese dominate), so it needs cleaning before any text analysis. Language_ID spans 2,660 unique codes fairly evenly (top 'eng' has just 159 rows), confirming broad cross-linguistic coverage.

citing: Value.stats.skew · Value.stats.kurtosis · Value.n_unique · Value.stats.outlier_rate · Parameter_ID.n_unique · Parameter_ID.top_value · Parameter_ID.stats.top_rate · Comment.null_rate · Comment.language_counts · Language_ID.n_unique · Code_ID.n_unique · Source.n_unique · row_count

Schema

8 columns
Per-column summary. Click column name to jump to its detail.
Alerts
ID text 0.0% 76,475
near_unique one_word short_text
Language_ID text 0.0% 2,660
one_word short_text duplicates
Parameter_ID categorical 0.0% 192
Value numeric 0.0% 28
high_skew
Code_ID text 0.0% 1,139
one_word allcaps short_text duplicates
Comment text 96.9% 2,068
multilingual null_rate
Source text 9.3% 29,715
one_word duplicates
Example_ID text 97.9% 1,444
one_word null_rate

ID

text identifier near_unique one_word short_text
This is a row identifier: every one of the 76,475 values is unique, non-null, and a single token of 5-8 characters. Sample values like '26a-mar' and '114a-yko' follow a consistent 'a-<3-letter-suffix>' pattern, suggesting a structured composite key rather than a random hash. No duplicates or empties, so it is safe to use as a primary key. Treatment: Use as primary key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
76,475
len_min
5
len_max
8
len_mean
7.271
len_median
7
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
20,000
readability_flesch_mean
88.65
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Language_ID

text foreign_key one_word short_text duplicates
Three-letter codes (len_mean 2.996, len_max 3, one_word_rate 1.0) almost certainly ISO 639-style language identifiers, with 2,660 distinct values across 76,475 rows. Distribution is remarkably flat — the top code 'eng' appears only 159 times and the next nine codes are within 6 counts of it — so duplicate_rate of 0.965 reflects a wide catalogue rather than a dominant language. No nulls or empties. Treatment: left-join on this id to a language lookup, or one-hot only after collapsing rare codes. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
2,660
len_min
2
len_max
3
len_mean
2.996
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
73,815
duplicate_rate
0.9652
vocab_size
2,238
readability_flesch_mean
117.7
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Parameter_ID

categorical foreign_key
Parameter_ID is a categorical code with 192 distinct values across 76,475 rows and no nulls. The distribution is nearly uniform (entropy ratio 0.94), with the top value '83A' covering only 1.98% of rows; several '143x' codes share an identical count of 1,325, suggesting paired or co-emitted parameters. Treatment: left-join on this id to a parameter lookup table; one-hot only if cardinality is reduced first. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
192
top_value
83A
top_rate
0.01985
cardinality
192
entropy
7.103
entropy_ratio
0.9365

Value

numeric feature high_skew
Small-integer count or rating field bounded between 1 and 28 with only 28 distinct values across 76,475 rows. The distribution is tightly packed (median 2, IQR 3) but has a heavy right tail (skew 3.49, kurtosis 16.36) producing 2,469 outliers (3.2%). No nulls or zeros, so every row carries a positive count. Treatment: Log1p-transform or cap the upper tail before modelling to tame the skew. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
28
min
1
max
28
mean
2.854
median
2
std
2.824
q1
1
q3
4
iqr
3
skew
3.493
kurtosis
16.36
n_outliers
2,469
outlier_rate
0.03229
zero_rate
0

Code_ID

text foreign_key one_word allcaps short_text duplicates
Code_ID holds short uppercase alphanumeric codes (length 4-7, always one word) drawn from a fixed vocabulary of 1,139 distinct values across 76,475 rows. The 98.5% duplicate rate is expected for a categorical key, with the most common code '143G-4' appearing 1,315 times. The pattern (digits + letter + dash + digit) suggests a structured taxonomy code rather than a unique row identifier. Treatment: Treat as a categorical foreign key; left-join to a code lookup table or target-encode for modelling. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
1,139
len_min
4
len_max
7
len_mean
5.3
len_median
5
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
75,336
duplicate_rate
0.9851
vocab_size
911
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

Comment

text free_text multilingual null_rate
Free-text commentary field carrying HTML-formatted lexicographic examples (e.g. `

čaj

`), with content spanning at least 7 languages including English (64), Chinese (53) and Persian (11). It is overwhelmingly empty: null_rate is 0.969 across 76,475 rows, leaving only ~2,068 distinct values and a 0.129 duplicate_rate among the populated entries. The negative Flesch mean (-127.13) and HTML-dominated top_words confirm these are markup fragments rather than natural prose, and lengths are highly skewed (median 196, max 15,127). Treatment: Strip HTML, then tokenize per-language before any embedding; given 96.9% nulls, treat presence as a sparse feature. high · anthropic:claude-opus-4-7
n
76,475
nulls
74,102 (96.9%)
unique
2,068
len_min
36
len_max
15,127
len_mean
372.2
len_median
196
len_p95
917
word_mean
49.54
word_median
8
n_empty
0
n_duplicates
305
duplicate_rate
0.1285
vocab_size
7,544
readability_flesch_mean
-127.1
emoji_rate
0
url_rate
0.001686
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

Source

text metadata one_word duplicates
This column holds bibliographic source citations in Author-Year slug form (e.g. 'Bybee-et-al-1994', 'Huber-and-Reed-1992'), with 83% being a single token and a mean of 1.24 words. Despite 29,715 unique values, 57% of entries are duplicates and 9.3% are null, indicating the same handful of references are reused heavily across rows. The 'passim]' token (386) and stray '112]' suggest page-locator fragments leaked in from inconsistent citation formatting. Treatment: Normalize citation slugs (strip '[passim]'/page fragments) and treat as a categorical reference key. high · anthropic:claude-opus-4-7
n
76,475
nulls
7,092 (9.3%)
unique
29,715
len_min
7
len_max
165
len_mean
21.86
len_median
19
len_p95
44
word_mean
1.239
word_median
1
n_empty
0
n_duplicates
39,668
duplicate_rate
0.5717
vocab_size
13,953
readability_flesch_mean
-9.644
emoji_rate
0
url_rate
0
one_word_rate
0.8318
allcaps_rate
5.765e-05
boilerplate_rate
0

Example_ID

text foreign_key one_word null_rate
This column holds space-separated lists of `igt-####` example identifiers, likely linking each row to one or more interlinear gloss text examples. It is overwhelmingly empty (97.89% null) and only 1,444 of 76,475 rows carry a value, with 39% of those being a single token. Despite the IDs looking unique, 168 duplicate strings (10.4% duplicate rate) appear, so the same example bundle is referenced by multiple rows. Treatment: Split on whitespace and left-join each igt-id to the examples table; expect most rows to have no match. high · anthropic:claude-opus-4-7
n
76,475
nulls
74,863 (97.9%)
unique
1,444
len_min
5
len_max
575
len_mean
23.57
len_median
17
len_p95
53
word_mean
2.818
word_median
2
n_empty
0
n_duplicates
168
duplicate_rate
0.1042
vocab_size
3,810
readability_flesch_mean
119.7
emoji_rate
0
url_rate
0
one_word_rate
0.3908
allcaps_rate
0
boilerplate_rate
0