saturn·

language data wals values

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/datasets/language-data/wals_values.csv

Saturn profiled 76,475 rows across 8 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/language-data/wals_values.csv",
    "--findings", "language-data-wals_values.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a 76,475-row export of WALS (World Atlas of Language Structures) values, with 8 columns covering language identifiers, typological parameters, coded values, sources, and comments. The core analytical fields are Parameter_ID (192 typological features, top being '83A' at ~2% of rows) and Value (a small numeric coding scheme with 28 distinct values, median 2 and a long right tail up to 28). Two things deserve a closer look first: the Value column is highly skewed (skew 3.49, kurtosis 16.4, ~3.2% outliers), which matters if you plan to aggregate it numerically; and Comment is 96.9% null and, where present, is mostly HTML markup in mixed languages (English and Chinese dominate), so it needs cleaning before any text analysis. Language_ID spans 2,660 unique codes fairly evenly (top 'eng' has just 159 rows), confirming broad cross-linguistic coverage.

citing: Value.stats.skew · Value.stats.kurtosis · Value.n_unique · Value.stats.outlier_rate · Parameter_ID.n_unique · Parameter_ID.top_value · Parameter_ID.stats.top_rate · Comment.null_rate · Comment.language_counts · Language_ID.n_unique · Code_ID.n_unique · Source.n_unique · row_count

Out[4]:

saturn.schema() · 8 columns

column kind n null% unique alerts
ID text 76,475 0.0% 76,475 near_unique one_word short_text
Language_ID text 76,475 0.0% 2,660 one_word short_text duplicates
Parameter_ID categorical 76,475 0.0% 192
Value numeric 76,475 0.0% 28 high_skew
Code_ID text 76,475 0.0% 1,139 one_word allcaps short_text duplicates
Comment text 76,475 96.9% 2,068 multilingual null_rate
Source text 76,475 9.3% 29,715 one_word duplicates
Example_ID text 76,475 97.9% 1,444 one_word null_rate
Fig 1.
Value · Check the heavy right skew — most values are 1-4 but the tail stretches to 28.
Show data table
Histogram bins for Value (median: 2.0).
bincount
1 – 1.67527379
1.675 – 2.3521173
2.35 – 3.0256771
3.025 – 3.70
3.7 – 4.3759917
4.375 – 5.053602
5.05 – 5.7250
5.725 – 6.42730
6.4 – 7.0751392
7.075 – 7.750
7.75 – 8.4251042
8.425 – 9.1597
9.1 – 9.7750
9.775 – 10.4532
10.45 – 11.12265
11.12 – 11.80
11.8 – 12.4843
12.48 – 13.1568
13.15 – 13.830
13.83 – 14.5251
14.5 – 15.18190
15.18 – 15.850
15.85 – 16.52156
16.52 – 17.244
17.2 – 17.880
17.88 – 18.55127
18.55 – 19.2383
19.23 – 19.90
19.9 – 20.58358
20.58 – 21.25202
21.25 – 21.930
21.93 – 22.621
22.6 – 23.2818
23.28 – 23.950
23.95 – 24.623
24.62 – 25.33
25.3 – 25.980
25.98 – 26.654
26.65 – 27.333
27.33 – 281
Fig 2.
Parameter_ID · See which of the 192 typological features are most represented; the top ones cluster around 1,300-1,500 rows.
Show data table
Top values for Parameter_ID (20 unique shown, of 192 total).
valuecountshare
83A15182.0%
82A14962.0%
81A13761.8%
87A13671.8%
143A13251.7%
143E13251.7%
143F13251.7%
143G13251.7%
97A13161.7%
86A12491.6%
88A12251.6%
144A11901.6%
85A11841.5%
112A11571.5%
89A11541.5%
95A11421.5%
69A11311.5%
33A10661.4%
51A10311.3%
26A9691.3%
Fig 3.
Language_ID · Confirms broad coverage across 2,660 languages with no single language dominating.
Show data table
Character-length distribution for Language_ID (mean: 2.995527950310559).
charscount
2 – 2342
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 376133
Fig 4.
Code_ID · Shows which parameter-value codes (e.g. '143G-4', '82A-1') appear most frequently.
Show data table
Character-length distribution for Code_ID (mean: 5.300150375939849).
charscount
4 – 44995
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 545403
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 624205
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 71872
Fig 5.
Comment · Among the 3% of rows with comments, lengths vary wildly (median 196, max 15,127) and contain HTML — flag for cleanup.
Show data table
Character-length distribution for Comment (mean: 372.2237673830594).
charscount
36 – 4131870
413 – 791361
791 – 116844
1168 – 154525
1545 – 192212
1922 – 23007
2300 – 26775
2677 – 30549
3054 – 34318
3431 – 38097
3809 – 41860
4186 – 45632
4563 – 49416
4941 – 53185
5318 – 56950
5695 – 60724
6072 – 64502
6450 – 68273
6827 – 72040
7204 – 75820
7582 – 79590
7959 – 83360
8336 – 87131
8713 – 90910
9091 – 94680
9468 – 98451
9845 – 102220
10222 – 106000
10600 – 109770
10977 – 113540
11354 – 117320
11732 – 121090
12109 – 124860
12486 – 128630
12863 – 132410
13241 – 136180
13618 – 139950
13995 – 143720
14372 – 147500
14750 – 151271
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
IDtext0.0%
Language_IDtext0.0%
Parameter_IDcategorical0.0%
Valuenumeric0.0%
Code_IDtext0.0%
Commenttext96.9%
Sourcetext9.3%
Example_IDtext97.9%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 134 detected strings).
langcountshare
en6447.8%
zh5339.6%
fa118.2%
ar32.2%
fr10.7%
ru10.7%
kk10.7%

ID text identifier

This is a row identifier: every one of the 76,475 values is unique, non-null, and a single token of 5-8 characters. Sample values like '26a-mar' and '114a-yko' follow a consistent 'a-<3-letter-suffix>' pattern, suggesting a structured composite key rather than a random hash. No duplicates or empties, so it is safe to use as a primary key.

Treatment: Use as primary key for joins; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["ID"].stats

statvalue
n76,475
nulls0 (0.0%)
unique76,475
len_min 5
len_max 8
len_mean 7.271
len_median 7
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 20,000
readability_flesch_mean 88.65
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for ID.
Show data table
Character-length distribution for ID (mean: 7.271199738476627).
charscount
5 – 532
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 65156
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 745327
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 825960

Language_ID text foreign_key

Three-letter codes (len_mean 2.996, len_max 3, one_word_rate 1.0) almost certainly ISO 639-style language identifiers, with 2,660 distinct values across 76,475 rows. Distribution is remarkably flat — the top code 'eng' appears only 159 times and the next nine codes are within 6 counts of it — so duplicate_rate of 0.965 reflects a wide catalogue rather than a dominant language. No nulls or empties.

Treatment: left-join on this id to a language lookup, or one-hot only after collapsing rare codes.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["Language_ID"].stats

statvalue
n76,475
nulls0 (0.0%)
unique2,660
len_min 2
len_max 3
len_mean 2.996
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 73,815
duplicate_rate 0.9652
vocab_size 2,238
readability_flesch_mean 117.7
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates96.5% duplicate strings
Fig 9.
Character-length distribution for Language_ID.
Show data table
Character-length distribution for Language_ID (mean: 2.995527950310559).
charscount
2 – 2342
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 376133

Parameter_ID categorical foreign_key

Parameter_ID is a categorical code with 192 distinct values across 76,475 rows and no nulls. The distribution is nearly uniform (entropy ratio 0.94), with the top value '83A' covering only 1.98% of rows; several '143x' codes share an identical count of 1,325, suggesting paired or co-emitted parameters.

Treatment: left-join on this id to a parameter lookup table; one-hot only if cardinality is reduced first.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["Parameter_ID"].stats

statvalue
n76,475
nulls0 (0.0%)
unique192
top_value 83A
top_rate 0.01985
cardinality 192
entropy 7.103
entropy_ratio 0.9365
Fig 10.
Top values for Parameter_ID.
Show data table
Top values for Parameter_ID (20 unique shown, of 192 total).
valuecountshare
83A15182.0%
82A14962.0%
81A13761.8%
87A13671.8%
143A13251.7%
143E13251.7%
143F13251.7%
143G13251.7%
97A13161.7%
86A12491.6%
88A12251.6%
144A11901.6%
85A11841.5%
112A11571.5%
89A11541.5%
95A11421.5%
69A11311.5%
33A10661.4%
51A10311.3%
26A9691.3%

Value numeric feature

Small-integer count or rating field bounded between 1 and 28 with only 28 distinct values across 76,475 rows. The distribution is tightly packed (median 2, IQR 3) but has a heavy right tail (skew 3.49, kurtosis 16.36) producing 2,469 outliers (3.2%). No nulls or zeros, so every row carries a positive count.

Treatment: Log1p-transform or cap the upper tail before modelling to tame the skew.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["Value"].stats

statvalue
n76,475
nulls0 (0.0%)
unique28
min 1
max 28
mean 2.854
median 2
std 2.824
q1 1
q3 4
iqr 3
skew 3.493
kurtosis 16.36
n_outliers 2,469
outlier_rate 0.03229
zero_rate 0
alert: high_skewskew=+3.49
Fig 11.
Distribution of Value. Vertical dash marks the median.
Show data table
Histogram bins for Value (median: 2.0).
bincount
1 – 1.67527379
1.675 – 2.3521173
2.35 – 3.0256771
3.025 – 3.70
3.7 – 4.3759917
4.375 – 5.053602
5.05 – 5.7250
5.725 – 6.42730
6.4 – 7.0751392
7.075 – 7.750
7.75 – 8.4251042
8.425 – 9.1597
9.1 – 9.7750
9.775 – 10.4532
10.45 – 11.12265
11.12 – 11.80
11.8 – 12.4843
12.48 – 13.1568
13.15 – 13.830
13.83 – 14.5251
14.5 – 15.18190
15.18 – 15.850
15.85 – 16.52156
16.52 – 17.244
17.2 – 17.880
17.88 – 18.55127
18.55 – 19.2383
19.23 – 19.90
19.9 – 20.58358
20.58 – 21.25202
21.25 – 21.930
21.93 – 22.621
22.6 – 23.2818
23.28 – 23.950
23.95 – 24.623
24.62 – 25.33
25.3 – 25.980
25.98 – 26.654
26.65 – 27.333
27.33 – 281

Code_ID text foreign_key

Code_ID holds short uppercase alphanumeric codes (length 4-7, always one word) drawn from a fixed vocabulary of 1,139 distinct values across 76,475 rows. The 98.5% duplicate rate is expected for a categorical key, with the most common code '143G-4' appearing 1,315 times. The pattern (digits + letter + dash + digit) suggests a structured taxonomy code rather than a unique row identifier.

Treatment: Treat as a categorical foreign key; left-join to a code lookup table or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["Code_ID"].stats

statvalue
n76,475
nulls0 (0.0%)
unique1,139
len_min 4
len_max 7
len_mean 5.3
len_median 5
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 75,336
duplicate_rate 0.9851
vocab_size 911
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates98.5% duplicate strings
Fig 12.
Character-length distribution for Code_ID.
Show data table
Character-length distribution for Code_ID (mean: 5.300150375939849).
charscount
4 – 44995
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 545403
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 624205
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 71872

Comment text free_text

Free-text commentary field carrying HTML-formatted lexicographic examples (e.g. `

čaj

`), with content spanning at least 7 languages including English (64), Chinese (53) and Persian (11). It is overwhelmingly empty: null_rate is 0.969 across 76,475 rows, leaving only ~2,068 distinct values and a 0.129 duplicate_rate among the populated entries. The negative Flesch mean (-127.13) and HTML-dominated top_words confirm these are markup fragments rather than natural prose, and lengths are highly skewed (median 196, max 15,127).

Treatment: Strip HTML, then tokenize per-language before any embedding; given 96.9% nulls, treat presence as a sparse feature.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["Comment"].stats

statvalue
n76,475
nulls74,102 (96.9%)
unique2,068
len_min 36
len_max 15,127
len_mean 372.2
len_median 196
len_p95 917
word_mean 49.54
word_median 8
n_empty 0
n_duplicates 305
duplicate_rate 0.1285
vocab_size 7,544
readability_flesch_mean -127.1
emoji_rate 0
url_rate 0.001686
one_word_rate 0
allcaps_rate 0
boilerplate_rate 0
alert: multilingual8 languages detected in sample
alert: null_rate96.9% null
Fig 13.
Character-length distribution for Comment.
Show data table
Character-length distribution for Comment (mean: 372.2237673830594).
charscount
36 – 4131870
413 – 791361
791 – 116844
1168 – 154525
1545 – 192212
1922 – 23007
2300 – 26775
2677 – 30549
3054 – 34318
3431 – 38097
3809 – 41860
4186 – 45632
4563 – 49416
4941 – 53185
5318 – 56950
5695 – 60724
6072 – 64502
6450 – 68273
6827 – 72040
7204 – 75820
7582 – 79590
7959 – 83360
8336 – 87131
8713 – 90910
9091 – 94680
9468 – 98451
9845 – 102220
10222 – 106000
10600 – 109770
10977 – 113540
11354 – 117320
11732 – 121090
12109 – 124860
12486 – 128630
12863 – 132410
13241 – 136180
13618 – 139950
13995 – 143720
14372 – 147500
14750 – 151271

Source text metadata

This column holds bibliographic source citations in Author-Year slug form (e.g. 'Bybee-et-al-1994', 'Huber-and-Reed-1992'), with 83% being a single token and a mean of 1.24 words. Despite 29,715 unique values, 57% of entries are duplicates and 9.3% are null, indicating the same handful of references are reused heavily across rows. The 'passim]' token (386) and stray '112]' suggest page-locator fragments leaked in from inconsistent citation formatting.

Treatment: Normalize citation slugs (strip '[passim]'/page fragments) and treat as a categorical reference key.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["Source"].stats

statvalue
n76,475
nulls7,092 (9.3%)
unique29,715
len_min 7
len_max 165
len_mean 21.86
len_median 19
len_p95 44
word_mean 1.239
word_median 1
n_empty 0
n_duplicates 39,668
duplicate_rate 0.5717
vocab_size 13,953
readability_flesch_mean -9.644
emoji_rate 0
url_rate 0
one_word_rate 0.8318
allcaps_rate 5.765e-05
boilerplate_rate 0
alert: one_word83.2% rows are a single word
alert: duplicates57.2% duplicate strings
Fig 14.
Character-length distribution for Source.
Show data table
Character-length distribution for Source (mean: 21.858510009656545).
charscount
7 – 113154
11 – 1511217
15 – 1918520
19 – 2314181
23 – 277564
27 – 314653
31 – 353099
35 – 391849
39 – 431332
43 – 461123
46 – 50620
50 – 54570
54 – 58357
58 – 62308
62 – 66166
66 – 70168
70 – 74141
74 – 7840
78 – 8281
82 – 8642
86 – 9035
90 – 9438
94 – 9818
98 – 10224
102 – 10634
106 – 11021
110 – 1145
114 – 1184
118 – 1229
122 – 1260
126 – 1291
129 – 1332
133 – 1371
137 – 1413
141 – 1450
145 – 1490
149 – 1530
153 – 1571
157 – 1611
161 – 1651

Example_ID text foreign_key

This column holds space-separated lists of `igt-####` example identifiers, likely linking each row to one or more interlinear gloss text examples. It is overwhelmingly empty (97.89% null) and only 1,444 of 76,475 rows carry a value, with 39% of those being a single token. Despite the IDs looking unique, 168 duplicate strings (10.4% duplicate rate) appear, so the same example bundle is referenced by multiple rows.

Treatment: Split on whitespace and left-join each igt-id to the examples table; expect most rows to have no match.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["Example_ID"].stats

statvalue
n76,475
nulls74,863 (97.9%)
unique1,444
len_min 5
len_max 575
len_mean 23.57
len_median 17
len_p95 53
word_mean 2.818
word_median 2
n_empty 0
n_duplicates 168
duplicate_rate 0.1042
vocab_size 3,810
readability_flesch_mean 119.7
emoji_rate 0
url_rate 0
one_word_rate 0.3908
allcaps_rate 0
boilerplate_rate 0
alert: one_word39.1% rows are a single word
alert: null_rate97.9% null
Fig 15.
Character-length distribution for Example_ID.
Show data table
Character-length distribution for Example_ID (mean: 23.574441687344912).
charscount
5 – 191053
19 – 34180
34 – 48293
48 – 6224
62 – 7620
76 – 908
90 – 1053
105 – 1195
119 – 1332
133 – 1484
148 – 1621
162 – 1763
176 – 1903
190 – 2043
204 – 2191
219 – 2331
233 – 2470
247 – 2623
262 – 2760
276 – 2901
290 – 3041
304 – 3180
318 – 3331
333 – 3470
347 – 3610
361 – 3760
376 – 3901
390 – 4040
404 – 4180
418 – 4320
432 – 4470
447 – 4610
461 – 4750
475 – 4900
490 – 5040
504 – 5180
518 – 5320
532 – 5460
546 – 5610
561 – 5751

How to cite

click to copy

BibTeX
@misc{saturn-language-data-wals-values-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: language data wals values},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/language-data-wals_values}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: language data wals values. Source: /home/coolhand/datasets/language-data/wals_values.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/language-data-wals_values