saturn·

data raw wals language

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/data_raw/wals_language.csv

Saturn profiled 3,573 rows across 17 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/data_raw/wals_language.csv",
    "--findings", "data_raw-wals_language.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a catalogue of 3,573 world languages from WALS, with identifiers (ISO codes, Glottocode), names, geographic coordinates, and classification fields (Family, Genus, Subfamily, Macroarea) plus reference sources and sampling flags. The geographic and genealogical breakdowns are the most informative starting point: Macroarea splits cleanly across six regions led by Eurasia (659) and Africa (606), while Family is dominated by Niger-Congo and Austronesian (324 each). Worth a closer look: roughly a quarter of rows are missing core fields like Family, Genus, Macroarea, and coordinates (null rate ~0.255), and Subfamily is 74.5% null, which will limit any subfamily-level analysis. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 True values respectively), reflecting their role as curated sub-samples rather than balanced categories.

citing: columns · row_count · kinds

Out[4]:

saturn.schema() · 17 columns

column kind n null% unique alerts
ID text 3,573 0.0% 3,573 near_unique one_word short_text
Name text 3,573 0.0% 3,198 one_word short_text
Macroarea categorical 3,573 25.5% 6 null_rate
Latitude numeric 3,573 25.5% 887 null_rate
Longitude numeric 3,573 25.5% 1,360 null_rate
Glottocode text 3,573 26.0% 2,502 one_word null_rate short_text
ISO639P3code text 3,573 26.8% 2,442 one_word null_rate short_text
Family categorical 3,573 25.5% 254 null_rate
Subfamily categorical 3,573 74.5% 32 null_rate
Genus categorical 3,573 25.5% 625 null_rate
GenusIcon categorical 3,573 82.5% 613 long_tail null_rate
ISO_codes text 3,573 26.1% 2,468 one_word null_rate short_text
Samples_100 categorical 3,573 25.5% 2 null_rate imbalance
Samples_200 categorical 3,573 25.5% 2 null_rate
Country_ID categorical 3,573 25.7% 337 null_rate
Source text 3,573 30.1% 2,373 one_word null_rate
Parent_ID categorical 3,573 7.1% 911 long_tail
Fig 1.
Macroarea · Shows the six-region geographic split, led by Eurasia and Africa.
Show data table
Top values for Macroarea (6 unique shown, of 6 total).
valuecountshare
Eurasia65918.4%
Africa60617.0%
Papunesia56015.7%
North America39611.1%
South America2587.2%
Australia1835.1%
Fig 2.
Family · Top language families — Niger-Congo and Austronesian tie at the top with 324 languages each.
Show data table
Top values for Family (20 unique shown, of 254 total).
valuecountshare
Niger-Congo3249.1%
Austronesian3249.1%
Indo-European1764.9%
Sino-Tibetan1464.1%
Afro-Asiatic1454.1%
Pama-Nyungan1213.4%
Trans-New Guinea982.7%
other722.0%
Altaic651.8%
Oto-Manguean561.6%
Austro-Asiatic481.3%
Eastern Sudanic471.3%
Uto-Aztecan441.2%
Algic310.9%
Mayan300.8%
Arawakan290.8%
Nakh-Daghestanian280.8%
Mande280.8%
Uralic270.8%
Hokan260.7%
Fig 3.
Latitude · Latitude distribution skews toward the tropics and northern hemisphere (median ~8°).
Show data table
Histogram bins for Latitude (median: 8.291666666665).
bincount
-55 – -51.842
-51.84 – -48.691
-48.69 – -45.531
-45.53 – -42.382
-42.38 – -39.223
-39.22 – -36.065
-36.06 – -32.919
-32.91 – -29.7518
-29.75 – -26.5924
-26.59 – -23.4429
-23.44 – -20.2847
-20.28 – -17.1264
-17.12 – -13.9785
-13.97 – -10.8186
-10.81 – -7.656129
-7.656 – -4.5187
-4.5 – -1.344190
-1.344 – 1.812130
1.812 – 4.969119
4.969 – 8.125196
8.125 – 11.28161
11.28 – 14.44104
14.44 – 17.59145
17.59 – 20.7582
20.75 – 23.9163
23.91 – 27.0674
27.06 – 30.2284
30.22 – 33.3866
33.38 – 36.5384
36.53 – 39.6967
39.69 – 42.8486
42.84 – 4662
46 – 49.1674
49.16 – 52.3152
52.31 – 55.4744
55.47 – 58.6218
58.62 – 61.7820
61.78 – 64.9423
64.94 – 68.0919
68.09 – 71.257
Fig 4.
Country_ID · Country concentration — Papua New Guinea, Australia, the US, and Indonesia host the most languages.
Show data table
Top values for Country_ID (20 unique shown, of 337 total).
valuecountshare
PG2146.0%
AU1855.2%
US1775.0%
ID1775.0%
IN1203.4%
MX1203.4%
RU892.5%
NG661.8%
BR661.8%
CN541.5%
CD491.4%
CM461.3%
CA451.3%
CO391.1%
ET361.0%
PH361.0%
PE351.0%
NP320.9%
TZ280.8%
VU280.8%
Fig 5.
Samples_100 · Highlights the strong imbalance: only 100 of 2,662 non-null rows are flagged True.
Show data table
Top values for Samples_100 (2 unique shown, of 2 total).
valuecountshare
False256271.7%
True1002.8%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
IDtext0.0%
Nametext0.0%
Macroareacategorical25.5%
Latitudenumeric25.5%
Longitudenumeric25.5%
Glottocodetext26.0%
ISO639P3codetext26.8%
Familycategorical25.5%
Subfamilycategorical74.5%
Genuscategorical25.5%
GenusIconcategorical82.5%
ISO_codestext26.1%
Samples_100categorical25.5%
Samples_200categorical25.5%
Country_IDcategorical25.7%
Sourcetext30.1%
Parent_IDcategorical7.1%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
LatitudeLongitude
Latitude+1.00-0.38
Longitude-0.38+1.00

ID text identifier

Column 'ID' is a unique row identifier: all 3573 values are distinct (n_unique equals n), every value is a single token (one_word_rate 1.0), and there are no nulls or duplicates. Lengths range from 2 to 36 characters with a median of 3, and the top tokens (aab, aar, aba, abb…) suggest short alphabetic codes rather than numeric keys.

Treatment: Use as a join key; drop from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["ID"].stats

statvalue
n3,573
nulls0 (0.0%)
unique3,573
len_min 2
len_max 36
len_mean 5.982
len_median 3
len_p95 17
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 3,573
readability_flesch_mean 61.58
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for ID.
Show data table
Character-length distribution for ID (mean: 5.9818080044780295).
charscount
2 – 311
3 – 42651
4 – 50
5 – 50
5 – 60
6 – 70
7 – 80
8 – 92
9 – 1017
10 – 1053
10 – 1189
11 – 12137
12 – 13122
13 – 140
14 – 15110
15 – 1690
16 – 1667
16 – 1752
17 – 1848
18 – 190
19 – 2021
20 – 2122
21 – 2220
22 – 2217
22 – 2310
23 – 2411
24 – 250
25 – 267
26 – 273
27 – 280
28 – 282
28 – 293
29 – 302
30 – 310
31 – 321
32 – 332
33 – 332
33 – 340
34 – 350
35 – 361

Name text label

This column holds short proper-noun labels, almost certainly language names (top values like Basque, Ainu, Beothuk, Atakapa, and frequent words 'sign', 'language', 'arabic', 'mixtec' all point to a linguistic registry). Entries are overwhelmingly single tokens (one_word_rate 0.80, word_mean 1.26, len_mean 8.7) with a 46-character max for the longer parenthesised variants like '(northern)'/'(southern)'. Notably, 375 duplicates (10.5%) exist across 3,573 rows with 3,198 uniques — names like 'Abun', 'Andoke', 'Basque' each appear 3 times, suggesting the dataset repeats languages across some other dimension rather than being a clean key.

Treatment: Treat as a categorical language label; deduplicate or join on it rather than using as a primary key.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["Name"].stats

statvalue
n3,573
nulls0 (0.0%)
unique3,198
len_min 2
len_max 46
len_mean 8.705
len_median 7
len_p95 19
word_mean 1.258
word_median 1
n_empty 0
n_duplicates 375
duplicate_rate 0.105
vocab_size 3,383
readability_flesch_mean 48.16
emoji_rate 0
url_rate 0
one_word_rate 0.8002
allcaps_rate 0
boilerplate_rate 0
alert: one_word80.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 9.
Character-length distribution for Name.
Show data table
Character-length distribution for Name (mean: 8.705009795689897).
charscount
2 – 3112
3 – 4361
4 – 5528
5 – 6581
6 – 8502
8 – 9305
9 – 10198
10 – 11152
11 – 1284
12 – 1388
13 – 14131
14 – 1589
15 – 1676
16 – 1777
17 – 1866
18 – 2046
20 – 2134
21 – 2231
22 – 2321
23 – 2419
24 – 2521
25 – 268
26 – 279
27 – 289
28 – 305
30 – 312
31 – 323
32 – 334
33 – 342
34 – 353
35 – 362
36 – 371
37 – 381
38 – 390
39 – 400
40 – 420
42 – 430
43 – 440
44 – 450
45 – 462

Macroarea categorical feature

Macroarea is a categorical geographic grouping with 6 values covering the standard continental/linguistic macroareas (Eurasia, Africa, Papunesia, North America, South America, Australia). The distribution is fairly balanced — entropy ratio is 0.95 and the top value Eurasia accounts for only 24.8% of rows. The main concern is a 25.5% null rate, meaning a quarter of the 3573 rows lack any macroarea assignment.

Treatment: One-hot encode and decide whether to impute or add an explicit 'unknown' bucket for the 25.5% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["Macroarea"].stats

statvalue
n3,573
nulls911 (25.5%)
unique6
top_value Eurasia
top_rate 0.2476
cardinality 6
entropy 2.459
entropy_ratio 0.9511
alert: null_rate25.5% null
Fig 10.
Top values for Macroarea.
Show data table
Top values for Macroarea (6 unique shown, of 6 total).
valuecountshare
Eurasia65918.4%
Africa60617.0%
Papunesia56015.7%
North America39611.1%
South America2587.2%
Australia1835.1%

Latitude numeric feature

Geographic latitude in decimal degrees, ranging from -55.0 to 71.25 with a median of 8.29 — consistent with global coverage skewed slightly toward the northern hemisphere. About 25.5% of rows are null, a notable gap for a positional field, and only 887 unique values across 3573 rows suggest coordinates are rounded or tied to a limited set of locations. Distribution is near-symmetric (skew 0.36, kurtosis -0.50) with just one outlier flagged.

Treatment: Pair with Longitude for geospatial features; impute or filter the 25.5% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["Latitude"].stats

statvalue
n3,573
nulls911 (25.5%)
unique887
min -55
max 71.25
mean 11.88
median 8.292
std 22.72
q1 -5
q3 28
iqr 33
skew 0.3562
kurtosis -0.5023
n_outliers 1
outlier_rate 0.0003757
zero_rate 0.002254
alert: null_rate25.5% null
Fig 11.
Distribution of Latitude. Vertical dash marks the median.
Show data table
Histogram bins for Latitude (median: 8.291666666665).
bincount
-55 – -51.842
-51.84 – -48.691
-48.69 – -45.531
-45.53 – -42.382
-42.38 – -39.223
-39.22 – -36.065
-36.06 – -32.919
-32.91 – -29.7518
-29.75 – -26.5924
-26.59 – -23.4429
-23.44 – -20.2847
-20.28 – -17.1264
-17.12 – -13.9785
-13.97 – -10.8186
-10.81 – -7.656129
-7.656 – -4.5187
-4.5 – -1.344190
-1.344 – 1.812130
1.812 – 4.969119
4.969 – 8.125196
8.125 – 11.28161
11.28 – 14.44104
14.44 – 17.59145
17.59 – 20.7582
20.75 – 23.9163
23.91 – 27.0674
27.06 – 30.2284
30.22 – 33.3866
33.38 – 36.5384
36.53 – 39.6967
39.69 – 42.8486
42.84 – 4662
46 – 49.1674
49.16 – 52.3152
52.31 – 55.4744
55.47 – 58.6218
58.62 – 61.7820
61.78 – 64.9423
64.94 – 68.0919
68.09 – 71.257

Longitude numeric feature

Geographic longitude in decimal degrees, spanning -178.17 to 179.17 with 1360 distinct values across 3573 rows. The distribution is roughly symmetric (skew -0.33) but flat (kurtosis -1.05) with an IQR of 166.75, consistent with truly global coverage rather than a regional sample. Notable concern: 25.5% of rows are null, which will silently drop a quarter of any geospatial join.

Treatment: Pair with Latitude and impute or filter the 25.5% nulls before any geospatial modelling.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["Longitude"].stats

statvalue
n3,573
nulls911 (25.5%)
unique1,360
min -178.2
max 179.2
mean 35.17
median 34.79
std 89.35
q1 -45.75
q3 121
iqr 166.8
skew -0.3259
kurtosis -1.047
n_outliers 0
outlier_rate 0
zero_rate 0.001503
alert: null_rate25.5% null
Fig 12.
Distribution of Longitude. Vertical dash marks the median.
Show data table
Histogram bins for Longitude (median: 34.79166666665).
bincount
-178.2 – -169.213
-169.2 – -160.35
-160.3 – -151.47
-151.4 – -142.48
-142.4 – -133.54
-133.5 – -124.615
-124.6 – -115.696
-115.6 – -106.736
-106.7 – -97.7757
-97.77 – -88.83107
-88.83 – -79.933
-79.9 – -70.97114
-70.97 – -62.0395
-62.03 – -53.155
-53.1 – -44.1722
-44.17 – -35.237
-35.23 – -26.30
-26.3 – -17.371
-17.37 – -8.43342
-8.433 – 0.5109
0.5 – 9.433108
9.433 – 18.37170
18.37 – 27.3107
27.3 – 36.23159
36.23 – 45.1794
45.17 – 54.162
54.1 – 63.0317
63.03 – 71.9720
71.97 – 80.967
80.9 – 89.8373
89.83 – 98.77102
98.77 – 107.783
107.7 – 116.656
116.6 – 125.6129
125.6 – 134.5112
134.5 – 143.4190
143.4 – 152.4177
152.4 – 161.348
161.3 – 170.251
170.2 – 179.211

Glottocode text foreign_key

This is a Glottocode field — fixed 8-character language identifiers from the Glottolog catalogue (e.g. basq1248, swis1247), with every value being a single token. About 26% of rows are null and 2502 distinct codes appear across 3573 rows, with 143 duplicates (5.4%) where the same language repeats — basq1248 leads with 11 occurrences. Length is rigidly 8 for min, median, and max, consistent with a controlled vocabulary identifier rather than free text.

Treatment: Treat as a categorical key; left-join to Glottolog metadata and handle the 26% nulls explicitly.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["Glottocode"].stats

statvalue
n3,573
nulls928 (26.0%)
unique2,502
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 143
duplicate_rate 0.05406
vocab_size 2,502
readability_flesch_mean 92.88
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: null_rate26.0% null
alert: short_text95th-percentile length under 20 chars
Fig 13.
Character-length distribution for Glottocode.
Show data table
Character-length distribution for Glottocode (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 82645
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

ISO639P3code text foreign_key

This column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with examples like 'eus', 'gsw', 'deu' matching the standard. Coverage is incomplete — 26.84% of rows are null — and 2442 unique codes appear across 3573 rows with a 6.58% duplicate rate, so most languages occur only once or twice (top value 'eus' at 12).

Treatment: Treat as a categorical key; left-join to an ISO 639-3 reference table and decide on an explicit bucket for the 26.84% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["ISO639P3code"].stats

statvalue
n3,573
nulls959 (26.8%)
unique2,442
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 172
duplicate_rate 0.0658
vocab_size 2,442
readability_flesch_mean 119.5
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: null_rate26.8% null
alert: short_text95th-percentile length under 20 chars
Fig 14.
Character-length distribution for ISO639P3code.
Show data table
Character-length distribution for ISO639P3code (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 32614
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

Family categorical feature

Categorical label for the language family of each row, with 254 distinct families across 3573 records. Distribution is long-tailed but not dominated: top value 'Niger-Congo' covers only 12.2% and ties exactly with 'Austronesian' at 324 each, with entropy ratio 0.70 indicating spread across many small families. Notable concern: 25.5% of rows are null, and a literal 'other' bucket already accounts for 72 rows.

Treatment: Impute or flag the 25.5% missing, then group rare families before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["Family"].stats

statvalue
n3,573
nulls911 (25.5%)
unique254
top_value Niger-Congo
top_rate 0.1217
cardinality 254
entropy 5.631
entropy_ratio 0.7049
alert: null_rate25.5% null
Fig 15.
Top values for Family.
Show data table
Top values for Family (20 unique shown, of 254 total).
valuecountshare
Niger-Congo3249.1%
Austronesian3249.1%
Indo-European1764.9%
Sino-Tibetan1464.1%
Afro-Asiatic1454.1%
Pama-Nyungan1213.4%
Trans-New Guinea982.7%
other722.0%
Altaic651.8%
Oto-Manguean561.6%
Austro-Asiatic481.3%
Eastern Sudanic471.3%
Uto-Aztecan441.2%
Algic310.9%
Mayan300.8%
Arawakan290.8%
Nakh-Daghestanian280.8%
Mande280.8%
Uralic270.8%
Hokan260.7%

Subfamily categorical feature

This column records the linguistic subfamily classification of each row, with 32 distinct values dominated by Benue-Congo (200 occurrences, 21.95% of non-nulls), Eastern Malayo-Polynesian (159), and Tibeto-Burman (139). The striking issue is the 74.5% null rate — only about a quarter of the 3573 rows carry a subfamily label — yet entropy ratio of 0.77 indicates the populated values are reasonably spread across the 32 categories rather than collapsing onto one.

Treatment: Treat missingness as its own category and group rare subfamilies before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["Subfamily"].stats

statvalue
n3,573
nulls2,662 (74.5%)
unique32
top_value Benue-Congo
top_rate 0.2195
cardinality 32
entropy 3.856
entropy_ratio 0.7712
alert: null_rate74.5% null
Fig 16.
Top values for Subfamily.
Show data table
Top values for Subfamily (20 unique shown, of 32 total).
valuecountshare
Benue-Congo2005.6%
Eastern Malayo-Polynesian1594.5%
Tibeto-Burman1393.9%
Chadic471.3%
Mon-Khmer381.1%
Adamawa-Ubangi300.8%
Gur270.8%
Daghestanian250.7%
Cushitic240.7%
Finno-Ugric210.6%
Kwa200.6%
North-Central Atlantic200.6%
Nilotic190.5%
Mixtecan180.5%
Omotic150.4%
Kainantu-Goroka140.4%
Madang130.4%
Awyu-Ok100.3%
Surmic100.3%
Je90.3%

Genus categorical feature

This column holds linguistic genus labels (e.g., Oceanic, Bantu, Indic, Semitic, Germanic), a mid-level grouping in language classification. Cardinality is high at 625 distinct values across 3573 rows with entropy ratio 0.856, so the distribution is broad and flat — the top value 'Oceanic' covers only 5.6%. Note the 25.5% null rate, which is flagged and would meaningfully shrink any analysis that conditions on genus.

Treatment: Treat as a high-cardinality categorical: group rare genera into 'Other' or target-encode, and add an explicit missing indicator for the 25.5% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["Genus"].stats

statvalue
n3,573
nulls911 (25.5%)
unique625
top_value Oceanic
top_rate 0.05597
cardinality 625
entropy 7.95
entropy_ratio 0.856
alert: null_rate25.5% null
Fig 17.
Top values for Genus.
Show data table
Top values for Genus (20 unique shown, of 625 total).
valuecountshare
Oceanic1494.2%
Bantu1413.9%
Indic501.4%
Western Pama-Nyungan491.4%
Semitic431.2%
Turkic411.1%
Sign Languages401.1%
Bodic401.1%
Germanic391.1%
Northern Pama-Nyungan330.9%
Creoles and Pidgins320.9%
Mayan300.8%
Algonquian290.8%
Central Malayo-Polynesian290.8%
Iranian260.7%
Romance240.7%
Biu-Mandara240.7%
Southeastern Pama-Nyungan230.6%
Dravidian230.6%
Malayo-Sumbawan220.6%

GenusIcon categorical identifier

GenusIcon holds 613 short hex-like codes (e.g. 'c688033') across 3573 rows, with 82.51% nulls and an entropy ratio of 0.9988 indicating values are nearly uniformly distributed among non-nulls. The top value appears only twice (top_rate 0.0032), so there is no dominant category — it behaves like a near-unique tag rather than a real categorical feature.

Treatment: Drop for modelling; near-unique with 82% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["GenusIcon"].stats

statvalue
n3,573
nulls2,948 (82.5%)
unique613
top_value c688033
top_rate 0.0032
cardinality 613
entropy 9.249
entropy_ratio 0.9989
alert: long_tail601 singleton categories
alert: null_rate82.5% null
Fig 18.
Top values for GenusIcon.
Show data table
Top values for GenusIcon (20 unique shown, of 613 total).
valuecountshare
c68803320.1%
c803E3320.1%
c80473320.1%
c807D3320.1%
c80623320.1%
c80503320.1%
c7A803320.1%
c80593320.1%
c80743320.1%
c806B3320.1%
c71803320.1%
c80353320.1%
cCC8C5110.0%
cCC685110.0%
cCC7E5110.0%
c8FCC5110.0%
cCC805110.0%
c52803310.0%
cCC9F5110.0%
cCCB55110.0%

ISO_codes text foreign_key

This column holds ISO language codes — almost all values are single tokens of length 3 (len_mean 3.04, one_word_rate 0.99), matching ISO 639-3 conventions (e.g. 'eus', 'gsw', 'deu'). 26.1% of rows are null and 172 duplicates exist, but with 2,468 unique codes across 3,573 rows the vocabulary is unusually wide, suggesting broad multilingual coverage rather than a few dominant languages. No top value exceeds 12 occurrences, so the distribution has an extremely long tail.

Treatment: Treat as a categorical code key; impute or filter the 26% nulls before joining to a language reference table.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["ISO_codes"].stats

statvalue
n3,573
nulls933 (26.1%)
unique2,468
len_min 3
len_max 7
len_mean 3.039
len_median 3
len_p95 3
word_mean 1.01
word_median 1
n_empty 0
n_duplicates 172
duplicate_rate 0.06515
vocab_size 2,486
readability_flesch_mean 117.4
emoji_rate 0
url_rate 0
one_word_rate 0.9902
allcaps_rate 0
boilerplate_rate 0
alert: one_word99.0% rows are a single word
alert: null_rate26.1% null
alert: short_text95th-percentile length under 20 chars
Fig 19.
Character-length distribution for ISO_codes.
Show data table
Character-length distribution for ISO_codes (mean: 3.0393939393939395).
charscount
3 – 32614
3 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 726

Samples_100 categorical feature

Boolean flag with only two values where 'False' dominates at 96.2% (2562 of 2662 non-null rows) and 'True' appears exactly 100 times. The null rate of 25.5% is high, suggesting the flag is only populated for a subset of records. Entropy ratio of 0.23 confirms severe imbalance.

Treatment: Treat as a rare-event boolean indicator; impute or encode nulls explicitly and avoid as a stratification key given the imbalance.

anthropic:claude-opus-4-7 · confidence high
Out[49]:

saturn.columns["Samples_100"].stats

statvalue
n3,573
nulls911 (25.5%)
unique2
top_value False
top_rate 0.9624
cardinality 2
entropy 0.231
entropy_ratio 0.231
alert: null_rate25.5% null
alert: imbalancetop value is 96.2% of rows
Fig 20.
Top values for Samples_100.
Show data table
Top values for Samples_100 (2 unique shown, of 2 total).
valuecountshare
False256271.7%
True1002.8%

Samples_200 categorical label

Binary flag column with only two values (False/True) and heavy class imbalance — False accounts for 92.5% of non-null rows versus 200 True observations, hinting the name 'Samples_200' refers to a tagged 200-row subset. Roughly a quarter of rows (25.5%) are null, which is the main surprise and the reason for the null_rate alert. Entropy ratio of 0.385 confirms the distribution is far from balanced.

Treatment: Impute or explicitly encode nulls as a third category before using as a binary indicator.

anthropic:claude-opus-4-7 · confidence high
Out[52]:

saturn.columns["Samples_200"].stats

statvalue
n3,573
nulls911 (25.5%)
unique2
top_value False
top_rate 0.9249
cardinality 2
entropy 0.3848
entropy_ratio 0.3848
alert: null_rate25.5% null
Fig 21.
Top values for Samples_200.
Show data table
Top values for Samples_200 (2 unique shown, of 2 total).
valuecountshare
False246268.9%
True2005.6%

Country_ID categorical foreign_key

Country_ID looks like an ISO-style two-letter country code, with 337 distinct values across 3573 rows and a fairly even spread (entropy ratio 0.752). The top country PG accounts for only 8.06% of rows, followed by AU, US, and ID. Notably, 25.69% of values are null, and the cardinality of 337 exceeds the ~250 ISO 3166-1 codes, suggesting non-standard or sub-region codes are mixed in.

Treatment: Impute or flag the 25.69% nulls and reconcile non-standard codes against an ISO 3166-1 reference before joining.

anthropic:claude-opus-4-7 · confidence high
Out[55]:

saturn.columns["Country_ID"].stats

statvalue
n3,573
nulls918 (25.7%)
unique337
top_value PG
top_rate 0.0806
cardinality 337
entropy 6.314
entropy_ratio 0.752
alert: null_rate25.7% null
Fig 22.
Top values for Country_ID.
Show data table
Top values for Country_ID (20 unique shown, of 337 total).
valuecountshare
PG2146.0%
AU1855.2%
US1775.0%
ID1775.0%
IN1203.4%
MX1203.4%
RU892.5%
NG661.8%
BR661.8%
CN541.5%
CD491.4%
CM461.3%
CA451.3%
CO391.1%
ET361.0%
PH361.0%
PE351.0%
NP320.9%
TZ280.8%
VU280.8%

Source text metadata

This column holds bibliographic citation tags (e.g. 'Huber-and-Reed-1992', 'Boelaars-1950'), evidently the source reference for each row in what looks like a linguistic typology dataset. Values are short (median 25 chars, 2 words) and 45.5% are single tokens, consistent with author-year keys rather than prose. Cardinality is high (2373 unique of 3573) with 5% duplicates and a 30% null rate, so coverage is uneven and no single source dominates (top value appears only 14 times).

Treatment: Treat as a citation key: keep as categorical provenance metadata, optionally normalize casing and join to a bibliography table; do not use as a model feature.

anthropic:claude-opus-4-7 · confidence high
Out[58]:

saturn.columns["Source"].stats

statvalue
n3,573
nulls1,074 (30.1%)
unique2,373
len_min 7
len_max 452
len_mean 42.07
len_median 25
len_p95 135
word_mean 2.854
word_median 2
n_empty 0
n_duplicates 126
duplicate_rate 0.05042
vocab_size 5,899
readability_flesch_mean 21.33
emoji_rate 0
url_rate 0
one_word_rate 0.4546
allcaps_rate 0
boilerplate_rate 0
alert: one_word45.5% rows are a single word
alert: null_rate30.1% null
Fig 23.
Character-length distribution for Source.
Show data table
Character-length distribution for Source (mean: 42.07122849139656).
charscount
7 – 18937
18 – 29518
29 – 40280
40 – 52203
52 – 63136
63 – 7490
74 – 8561
85 – 9642
96 – 10735
107 – 11836
118 – 12927
129 – 14021
140 – 15212
152 – 16313
163 – 1746
174 – 18514
185 – 1966
196 – 2075
207 – 2183
218 – 2304
230 – 2414
241 – 2525
252 – 2633
263 – 2746
274 – 2857
285 – 2963
296 – 3074
307 – 3181
318 – 3301
330 – 3413
341 – 3521
352 – 3631
363 – 3740
374 – 3852
385 – 3961
396 – 4081
408 – 4194
419 – 4301
430 – 4410
441 – 4522

Parent_ID categorical foreign_key

Parent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but flat — the top value covers only 4.5% of rows and entropy ratio is 0.87 — and 7.1% of values are null. Oceanic and Bantu dominate the head, with Indic, Western Pama-Nyungan and Semitic trailing far behind.

Treatment: Left-join on this id to a genus lookup; treat the 7.1% nulls explicitly rather than one-hot encoding the 911 levels.

anthropic:claude-opus-4-7 · confidence high
Out[61]:

saturn.columns["Parent_ID"].stats

statvalue
n3,573
nulls254 (7.1%)
unique911
top_value genus-oceanic
top_rate 0.04489
cardinality 911
entropy 8.554
entropy_ratio 0.87
alert: long_tail501 singleton categories
Fig 24.
Top values for Parent_ID.
Show data table
Top values for Parent_ID (20 unique shown, of 911 total).
valuecountshare
genus-oceanic1494.2%
genus-bantu1413.9%
genus-indic501.4%
genus-westernpamanyungan491.4%
genus-semitic431.2%
genus-turkic411.1%
genus-signlanguages401.1%
genus-bodic401.1%
genus-germanic391.1%
genus-northernpamanyungan330.9%
genus-creolesandpidgins320.9%
genus-mayan300.8%
family-austronesian300.8%
genus-algonquian290.8%
genus-centralmalayopolynesian290.8%
genus-iranian260.7%
family-transnewguinea250.7%
genus-romance240.7%
genus-biumandara240.7%
genus-southeasternpamanyungan230.6%

How to cite

click to copy

BibTeX
@misc{saturn-data-raw-wals-language-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data raw wals language},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data_raw-wals_language}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: data raw wals language. Source: /home/coolhand/servers/diachronica/data_raw/wals_language.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/data_raw-wals_language