language-data-wals_values · saturn notebook

Overview

Source: /home/coolhand/datasets/language-data/wals_values.csv

Saturn profiled 76,475 rows across 8 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/language-data/wals_values.csv",
    "--findings", "language-data-wals_values.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a 76,475-row export of WALS (World Atlas of Language Structures) values, with 8 columns covering language identifiers, typological parameters, coded values, sources, and comments. The core analytical fields are Parameter_ID (192 typological features, top being '83A' at ~2% of rows) and Value (a small numeric coding scheme with 28 distinct values, median 2 and a long right tail up to 28). Two things deserve a closer look first: the Value column is highly skewed (skew 3.49, kurtosis 16.4, ~3.2% outliers), which matters if you plan to aggregate it numerically; and Comment is 96.9% null and, where present, is mostly HTML markup in mixed languages (English and Chinese dominate), so it needs cleaning before any text analysis. Language_ID spans 2,660 unique codes fairly evenly (top 'eng' has just 159 rows), confirming broad cross-linguistic coverage.

citing: Value.stats.skew · Value.stats.kurtosis · Value.n_unique · Value.stats.outlier_rate · Parameter_ID.n_unique · Parameter_ID.top_value · Parameter_ID.stats.top_rate · Comment.null_rate · Comment.language_counts · Language_ID.n_unique · Code_ID.n_unique · Source.n_unique · row_count

Out[4]:

saturn.schema() · 8 columns

column	kind	n	null%	unique	alerts
ID	text	76,475	0.0%	76,475	near_unique one_word short_text
Language_ID	text	76,475	0.0%	2,660	one_word short_text duplicates
Parameter_ID	categorical	76,475	0.0%	192
Value	numeric	76,475	0.0%	28	high_skew
Code_ID	text	76,475	0.0%	1,139	one_word allcaps short_text duplicates
Comment	text	76,475	96.9%	2,068	multilingual null_rate
Source	text	76,475	9.3%	29,715	one_word duplicates
Example_ID	text	76,475	97.9%	1,444	one_word null_rate

Fig 1.

Value · Check the heavy right skew — most values are 1-4 but the tail stretches to 28.

Show data table

Histogram bins for Value (median: 2.0).
bin	count
1 – 1.675	27379
1.675 – 2.35	21173
2.35 – 3.025	6771
3.025 – 3.7	0
3.7 – 4.375	9917
4.375 – 5.05	3602
5.05 – 5.725	0
5.725 – 6.4	2730
6.4 – 7.075	1392
7.075 – 7.75	0
7.75 – 8.425	1042
8.425 – 9.1	597
9.1 – 9.775	0
9.775 – 10.45	32
10.45 – 11.12	265
11.12 – 11.8	0
11.8 – 12.48	43
12.48 – 13.15	68
13.15 – 13.83	0
13.83 – 14.5	251
14.5 – 15.18	190
15.18 – 15.85	0
15.85 – 16.52	156
16.52 – 17.2	44
17.2 – 17.88	0
17.88 – 18.55	127
18.55 – 19.23	83
19.23 – 19.9	0
19.9 – 20.58	358
20.58 – 21.25	202
21.25 – 21.93	0
21.93 – 22.6	21
22.6 – 23.28	18
23.28 – 23.95	0
23.95 – 24.62	3
24.62 – 25.3	3
25.3 – 25.98	0
25.98 – 26.65	4
26.65 – 27.33	3
27.33 – 28	1

Fig 2.

Parameter_ID · See which of the 192 typological features are most represented; the top ones cluster around 1,300-1,500 rows.

Show data table

Top values for Parameter_ID (20 unique shown, of 192 total).
value	count	share
83A	1518	2.0%
82A	1496	2.0%
81A	1376	1.8%
87A	1367	1.8%
143A	1325	1.7%
143E	1325	1.7%
143F	1325	1.7%
143G	1325	1.7%
97A	1316	1.7%
86A	1249	1.6%
88A	1225	1.6%
144A	1190	1.6%
85A	1184	1.5%
112A	1157	1.5%
89A	1154	1.5%
95A	1142	1.5%
69A	1131	1.5%
33A	1066	1.4%
51A	1031	1.3%
26A	969	1.3%

Fig 3.

Language_ID · Confirms broad coverage across 2,660 languages with no single language dominating.

Show data table

Character-length distribution for Language_ID (mean: 2.995527950310559).
chars	count
2 – 2	342
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	76133

Fig 4.

Code_ID · Shows which parameter-value codes (e.g. '143G-4', '82A-1') appear most frequently.

Show data table

Character-length distribution for Code_ID (mean: 5.300150375939849).
chars	count
4 – 4	4995
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	45403
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	24205
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	1872

Fig 5.

Comment · Among the 3% of rows with comments, lengths vary wildly (median 196, max 15,127) and contain HTML — flag for cleanup.

Show data table

Character-length distribution for Comment (mean: 372.2237673830594).
chars	count
36 – 413	1870
413 – 791	361
791 – 1168	44
1168 – 1545	25
1545 – 1922	12
1922 – 2300	7
2300 – 2677	5
2677 – 3054	9
3054 – 3431	8
3431 – 3809	7
3809 – 4186	0
4186 – 4563	2
4563 – 4941	6
4941 – 5318	5
5318 – 5695	0
5695 – 6072	4
6072 – 6450	2
6450 – 6827	3
6827 – 7204	0
7204 – 7582	0
7582 – 7959	0
7959 – 8336	0
8336 – 8713	1
8713 – 9091	0
9091 – 9468	0
9468 – 9845	1
9845 – 10222	0
10222 – 10600	0
10600 – 10977	0
10977 – 11354	0
11354 – 11732	0
11732 – 12109	0
12109 – 12486	0
12486 – 12863	0
12863 – 13241	0
13241 – 13618	0
13618 – 13995	0
13995 – 14372	0
14372 – 14750	0
14750 – 15127	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
ID	text	0.0%
Language_ID	text	0.0%
Parameter_ID	categorical	0.0%
Value	numeric	0.0%
Code_ID	text	0.0%
Comment	text	96.9%
Source	text	9.3%
Example_ID	text	97.9%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 134 detected strings).
lang	count	share
en	64	47.8%
zh	53	39.6%
fa	11	8.2%
ar	3	2.2%
fr	1	0.7%
ru	1	0.7%
kk	1	0.7%

ID text identifier

This is a row identifier: every one of the 76,475 values is unique, non-null, and a single token of 5-8 characters. Sample values like '26a-mar' and '114a-yko' follow a consistent 'a-<3-letter-suffix>' pattern, suggesting a structured composite key rather than a random hash. No duplicates or empties, so it is safe to use as a primary key.

Treatment: Use as primary key for joins; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["ID"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	76,475
len_min	5
len_max	8
len_mean	7.271
len_median	7
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	20,000
readability_flesch_mean	88.65
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for ID.

Show data table

Character-length distribution for ID (mean: 7.271199738476627).
chars	count
5 – 5	32
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	5156
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	45327
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	25960

Language_ID text foreign_key

Three-letter codes (len_mean 2.996, len_max 3, one_word_rate 1.0) almost certainly ISO 639-style language identifiers, with 2,660 distinct values across 76,475 rows. Distribution is remarkably flat — the top code 'eng' appears only 159 times and the next nine codes are within 6 counts of it — so duplicate_rate of 0.965 reflects a wide catalogue rather than a dominant language. No nulls or empties.

Treatment: left-join on this id to a language lookup, or one-hot only after collapsing rare codes.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["Language_ID"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	2,660
len_min	2
len_max	3
len_mean	2.996
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	73,815
duplicate_rate	0.9652
vocab_size	2,238
readability_flesch_mean	117.7
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	96.5% duplicate strings

Fig 9.

Character-length distribution for Language_ID.

Show data table

Character-length distribution for Language_ID (mean: 2.995527950310559).
chars	count
2 – 2	342
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	76133

Parameter_ID categorical foreign_key

Parameter_ID is a categorical code with 192 distinct values across 76,475 rows and no nulls. The distribution is nearly uniform (entropy ratio 0.94), with the top value '83A' covering only 1.98% of rows; several '143x' codes share an identical count of 1,325, suggesting paired or co-emitted parameters.

Treatment: left-join on this id to a parameter lookup table; one-hot only if cardinality is reduced first.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["Parameter_ID"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	192
top_value	83A
top_rate	0.01985
cardinality	192
entropy	7.103
entropy_ratio	0.9365

Fig 10.

Top values for Parameter_ID.

Show data table

Top values for Parameter_ID (20 unique shown, of 192 total).
value	count	share
83A	1518	2.0%
82A	1496	2.0%
81A	1376	1.8%
87A	1367	1.8%
143A	1325	1.7%
143E	1325	1.7%
143F	1325	1.7%
143G	1325	1.7%
97A	1316	1.7%
86A	1249	1.6%
88A	1225	1.6%
144A	1190	1.6%
85A	1184	1.5%
112A	1157	1.5%
89A	1154	1.5%
95A	1142	1.5%
69A	1131	1.5%
33A	1066	1.4%
51A	1031	1.3%
26A	969	1.3%

Value numeric feature

Small-integer count or rating field bounded between 1 and 28 with only 28 distinct values across 76,475 rows. The distribution is tightly packed (median 2, IQR 3) but has a heavy right tail (skew 3.49, kurtosis 16.36) producing 2,469 outliers (3.2%). No nulls or zeros, so every row carries a positive count.

Treatment: Log1p-transform or cap the upper tail before modelling to tame the skew.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["Value"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	28
min	1
max	28
mean	2.854
median	2
std	2.824
q1	1
q3	4
iqr	3
skew	3.493
kurtosis	16.36
n_outliers	2,469
outlier_rate	0.03229
zero_rate	0
alert: high_skew	skew=+3.49

Fig 11.

Distribution of Value. Vertical dash marks the median.

Show data table

Histogram bins for Value (median: 2.0).
bin	count
1 – 1.675	27379
1.675 – 2.35	21173
2.35 – 3.025	6771
3.025 – 3.7	0
3.7 – 4.375	9917
4.375 – 5.05	3602
5.05 – 5.725	0
5.725 – 6.4	2730
6.4 – 7.075	1392
7.075 – 7.75	0
7.75 – 8.425	1042
8.425 – 9.1	597
9.1 – 9.775	0
9.775 – 10.45	32
10.45 – 11.12	265
11.12 – 11.8	0
11.8 – 12.48	43
12.48 – 13.15	68
13.15 – 13.83	0
13.83 – 14.5	251
14.5 – 15.18	190
15.18 – 15.85	0
15.85 – 16.52	156
16.52 – 17.2	44
17.2 – 17.88	0
17.88 – 18.55	127
18.55 – 19.23	83
19.23 – 19.9	0
19.9 – 20.58	358
20.58 – 21.25	202
21.25 – 21.93	0
21.93 – 22.6	21
22.6 – 23.28	18
23.28 – 23.95	0
23.95 – 24.62	3
24.62 – 25.3	3
25.3 – 25.98	0
25.98 – 26.65	4
26.65 – 27.33	3
27.33 – 28	1

Code_ID text foreign_key

Code_ID holds short uppercase alphanumeric codes (length 4-7, always one word) drawn from a fixed vocabulary of 1,139 distinct values across 76,475 rows. The 98.5% duplicate rate is expected for a categorical key, with the most common code '143G-4' appearing 1,315 times. The pattern (digits + letter + dash + digit) suggests a structured taxonomy code rather than a unique row identifier.

Treatment: Treat as a categorical foreign key; left-join to a code lookup table or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["Code_ID"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	1,139
len_min	4
len_max	7
len_mean	5.3
len_median	5
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	75,336
duplicate_rate	0.9851
vocab_size	911
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	98.5% duplicate strings

Fig 12.

Character-length distribution for Code_ID.

Show data table

Character-length distribution for Code_ID (mean: 5.300150375939849).
chars	count
4 – 4	4995
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	45403
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	24205
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	1872

Comment text free_text

Free-text commentary field carrying HTML-formatted lexicographic examples (e.g. `
čaj
`), with content spanning at least 7 languages including English (64), Chinese (53) and Persian (11). It is overwhelmingly empty: null_rate is 0.969 across 76,475 rows, leaving only ~2,068 distinct values and a 0.129 duplicate_rate among the populated entries. The negative Flesch mean (-127.13) and HTML-dominated top_words confirm these are markup fragments rather than natural prose, and lengths are highly skewed (median 196, max 15,127).

Treatment: Strip HTML, then tokenize per-language before any embedding; given 96.9% nulls, treat presence as a sparse feature.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["Comment"].stats

stat	value
n	76,475
nulls	74,102 (96.9%)
unique	2,068
len_min	36
len_max	15,127
len_mean	372.2
len_median	196
len_p95	917
word_mean	49.54
word_median	8
n_empty	0
n_duplicates	305
duplicate_rate	0.1285
vocab_size	7,544
readability_flesch_mean	-127.1
emoji_rate	0
url_rate	0.001686
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	8 languages detected in sample
alert: null_rate	96.9% null

Fig 13.

Character-length distribution for Comment.

Show data table

Character-length distribution for Comment (mean: 372.2237673830594).
chars	count
36 – 413	1870
413 – 791	361
791 – 1168	44
1168 – 1545	25
1545 – 1922	12
1922 – 2300	7
2300 – 2677	5
2677 – 3054	9
3054 – 3431	8
3431 – 3809	7
3809 – 4186	0
4186 – 4563	2
4563 – 4941	6
4941 – 5318	5
5318 – 5695	0
5695 – 6072	4
6072 – 6450	2
6450 – 6827	3
6827 – 7204	0
7204 – 7582	0
7582 – 7959	0
7959 – 8336	0
8336 – 8713	1
8713 – 9091	0
9091 – 9468	0
9468 – 9845	1
9845 – 10222	0
10222 – 10600	0
10600 – 10977	0
10977 – 11354	0
11354 – 11732	0
11732 – 12109	0
12109 – 12486	0
12486 – 12863	0
12863 – 13241	0
13241 – 13618	0
13618 – 13995	0
13995 – 14372	0
14372 – 14750	0
14750 – 15127	1

Source text metadata

This column holds bibliographic source citations in Author-Year slug form (e.g. 'Bybee-et-al-1994', 'Huber-and-Reed-1992'), with 83% being a single token and a mean of 1.24 words. Despite 29,715 unique values, 57% of entries are duplicates and 9.3% are null, indicating the same handful of references are reused heavily across rows. The 'passim]' token (386) and stray '112]' suggest page-locator fragments leaked in from inconsistent citation formatting.

Treatment: Normalize citation slugs (strip '[passim]'/page fragments) and treat as a categorical reference key.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["Source"].stats

stat	value
n	76,475
nulls	7,092 (9.3%)
unique	29,715
len_min	7
len_max	165
len_mean	21.86
len_median	19
len_p95	44
word_mean	1.239
word_median	1
n_empty	0
n_duplicates	39,668
duplicate_rate	0.5717
vocab_size	13,953
readability_flesch_mean	-9.644
emoji_rate	0
url_rate	0
one_word_rate	0.8318
allcaps_rate	5.765e-05
boilerplate_rate	0
alert: one_word	83.2% rows are a single word
alert: duplicates	57.2% duplicate strings

Fig 14.

Character-length distribution for Source.

Show data table

Character-length distribution for Source (mean: 21.858510009656545).
chars	count
7 – 11	3154
11 – 15	11217
15 – 19	18520
19 – 23	14181
23 – 27	7564
27 – 31	4653
31 – 35	3099
35 – 39	1849
39 – 43	1332
43 – 46	1123
46 – 50	620
50 – 54	570
54 – 58	357
58 – 62	308
62 – 66	166
66 – 70	168
70 – 74	141
74 – 78	40
78 – 82	81
82 – 86	42
86 – 90	35
90 – 94	38
94 – 98	18
98 – 102	24
102 – 106	34
106 – 110	21
110 – 114	5
114 – 118	4
118 – 122	9
122 – 126	0
126 – 129	1
129 – 133	2
133 – 137	1
137 – 141	3
141 – 145	0
145 – 149	0
149 – 153	0
153 – 157	1
157 – 161	1
161 – 165	1

Example_ID text foreign_key

This column holds space-separated lists of `igt-####` example identifiers, likely linking each row to one or more interlinear gloss text examples. It is overwhelmingly empty (97.89% null) and only 1,444 of 76,475 rows carry a value, with 39% of those being a single token. Despite the IDs looking unique, 168 duplicate strings (10.4% duplicate rate) appear, so the same example bundle is referenced by multiple rows.

Treatment: Split on whitespace and left-join each igt-id to the examples table; expect most rows to have no match.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["Example_ID"].stats

stat	value
n	76,475
nulls	74,863 (97.9%)
unique	1,444
len_min	5
len_max	575
len_mean	23.57
len_median	17
len_p95	53
word_mean	2.818
word_median	2
n_empty	0
n_duplicates	168
duplicate_rate	0.1042
vocab_size	3,810
readability_flesch_mean	119.7
emoji_rate	0
url_rate	0
one_word_rate	0.3908
allcaps_rate	0
boilerplate_rate	0
alert: one_word	39.1% rows are a single word
alert: null_rate	97.9% null

Fig 15.

Character-length distribution for Example_ID.

Show data table

Character-length distribution for Example_ID (mean: 23.574441687344912).
chars	count
5 – 19	1053
19 – 34	180
34 – 48	293
48 – 62	24
62 – 76	20
76 – 90	8
90 – 105	3
105 – 119	5
119 – 133	2
133 – 148	4
148 – 162	1
162 – 176	3
176 – 190	3
190 – 204	3
204 – 219	1
219 – 233	1
233 – 247	0
247 – 262	3
262 – 276	0
276 – 290	1
290 – 304	1
304 – 318	0
318 – 333	1
333 – 347	0
347 – 361	0
361 – 376	0
376 – 390	1
390 – 404	0
404 – 418	0
418 – 432	0
432 – 447	0
447 – 461	0
461 – 475	0
475 – 490	0
490 – 504	0
504 – 518	0
518 – 532	0
532 – 546	0
546 – 561	0
561 – 575	1

language data wals values

Overview

Summary confidence: high

ID text identifier

Language_ID text foreign_key

Parameter_ID categorical foreign_key

Value numeric feature

Code_ID text foreign_key

Comment text free_text

Source text metadata

Example_ID text foreign_key

How to cite