saturn·

language data wals languages

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/datasets/language-data/wals_languages.csv

Saturn profiled 3,573 rows across 17 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/language-data/wals_languages.csv",
    "--findings", "language-data-wals_languages.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 3,573 world languages (WALS) across 17 columns combining identifiers (ISO codes, Glottocode), classifications (Family, Genus, Subfamily), geography (Latitude, Longitude, Macroarea, Country_ID), and sampling flags. The Family and Macroarea distributions are the most informative starting point: Niger-Congo and Austronesian dominate at 324 languages each, and Eurasia (659) and Africa (606) lead the macroareas out of just six categories. Note that roughly a quarter of rows (null_rate ~0.255) are missing geographic and family fields in lockstep, suggesting a shared set of unclassified entries worth investigating. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 'True' values respectively), reflecting curated WALS sub-samples. Subfamily is sparsely populated (74.5% null) so treat it as supplementary rather than primary.

citing: Family.top_values · Macroarea.top_values · Genus.top_values · Country_ID.top_values · Samples_100.stats · Samples_200.stats · Latitude.stats · Longitude.stats · Subfamily.null_rate

Out[4]:

saturn.schema() · 17 columns

column kind n null% unique alerts
ID text 3,573 0.0% 3,573 near_unique one_word short_text
Name text 3,573 0.0% 3,198 one_word short_text
Macroarea categorical 3,573 25.5% 6 null_rate
Latitude numeric 3,573 25.5% 887 null_rate
Longitude numeric 3,573 25.5% 1,360 null_rate
Glottocode text 3,573 26.0% 2,502 one_word null_rate short_text
ISO639P3code text 3,573 26.8% 2,442 one_word null_rate short_text
Family categorical 3,573 25.5% 254 null_rate
Subfamily categorical 3,573 74.5% 32 null_rate
Genus categorical 3,573 25.5% 625 null_rate
GenusIcon categorical 3,573 82.5% 613 long_tail null_rate
ISO_codes text 3,573 26.1% 2,468 one_word null_rate short_text
Samples_100 categorical 3,573 25.5% 2 null_rate imbalance
Samples_200 categorical 3,573 25.5% 2 null_rate
Country_ID categorical 3,573 25.7% 337 null_rate
Source text 3,573 30.1% 2,373 one_word null_rate
Parent_ID categorical 3,573 7.1% 911 long_tail
Fig 1.
Family · Top language families — Niger-Congo and Austronesian tie at 324 each, with a long tail across 254 families.
Show data table
Top values for Family (20 unique shown, of 254 total).
valuecountshare
Niger-Congo3249.1%
Austronesian3249.1%
Indo-European1764.9%
Sino-Tibetan1464.1%
Afro-Asiatic1454.1%
Pama-Nyungan1213.4%
Trans-New Guinea982.7%
other722.0%
Altaic651.8%
Oto-Manguean561.6%
Austro-Asiatic481.3%
Eastern Sudanic471.3%
Uto-Aztecan441.2%
Algic310.9%
Mayan300.8%
Arawakan290.8%
Nakh-Daghestanian280.8%
Mande280.8%
Uralic270.8%
Hokan260.7%
Fig 2.
Macroarea · Six macroareas, led by Eurasia (659) and Africa (606); shows global coverage balance.
Show data table
Top values for Macroarea (6 unique shown, of 6 total).
valuecountshare
Eurasia65918.4%
Africa60617.0%
Papunesia56015.7%
North America39611.1%
South America2587.2%
Australia1835.1%
Fig 3.
Country_ID · Country distribution — Papua New Guinea, Australia, US, and Indonesia top the list, signaling linguistic hotspots.
Show data table
Top values for Country_ID (20 unique shown, of 337 total).
valuecountshare
PG2146.0%
AU1855.2%
US1775.0%
ID1775.0%
IN1203.4%
MX1203.4%
RU892.5%
NG661.8%
BR661.8%
CN541.5%
CD491.4%
CM461.3%
CA451.3%
CO391.1%
ET361.0%
PH361.0%
PE351.0%
NP320.9%
TZ280.8%
VU280.8%
Fig 4.
Latitude · Latitude spread from -55 to 71 with median ~8°, showing a tropical/northern-hemisphere skew.
Show data table
Histogram bins for Latitude (median: 8.291666666665).
bincount
-55 – -51.842
-51.84 – -48.691
-48.69 – -45.531
-45.53 – -42.382
-42.38 – -39.223
-39.22 – -36.065
-36.06 – -32.919
-32.91 – -29.7518
-29.75 – -26.5924
-26.59 – -23.4429
-23.44 – -20.2847
-20.28 – -17.1264
-17.12 – -13.9785
-13.97 – -10.8186
-10.81 – -7.656129
-7.656 – -4.5187
-4.5 – -1.344190
-1.344 – 1.812130
1.812 – 4.969119
4.969 – 8.125196
8.125 – 11.28161
11.28 – 14.44104
14.44 – 17.59145
17.59 – 20.7582
20.75 – 23.9163
23.91 – 27.0674
27.06 – 30.2284
30.22 – 33.3866
33.38 – 36.5384
36.53 – 39.6967
39.69 – 42.8486
42.84 – 4662
46 – 49.1674
49.16 – 52.3152
52.31 – 55.4744
55.47 – 58.6218
58.62 – 61.7820
61.78 – 64.9423
64.94 – 68.0919
68.09 – 71.257
Fig 5.
Samples_200 · Only 200 of 2,662 non-null rows are flagged True — useful for filtering to the curated WALS 200-sample set.
Show data table
Top values for Samples_200 (2 unique shown, of 2 total).
valuecountshare
False246268.9%
True2005.6%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
IDtext0.0%
Nametext0.0%
Macroareacategorical25.5%
Latitudenumeric25.5%
Longitudenumeric25.5%
Glottocodetext26.0%
ISO639P3codetext26.8%
Familycategorical25.5%
Subfamilycategorical74.5%
Genuscategorical25.5%
GenusIconcategorical82.5%
ISO_codestext26.1%
Samples_100categorical25.5%
Samples_200categorical25.5%
Country_IDcategorical25.7%
Sourcetext30.1%
Parent_IDcategorical7.1%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
LatitudeLongitude
Latitude+1.00-0.38
Longitude-0.38+1.00

ID text identifier

This is an identifier column: every one of the 3573 rows holds a unique single-token string with no nulls or duplicates. Values are short (median length 3, max 36) and the vocabulary equals the row count (3573), confirming one-to-one uniqueness. Top tokens like 'aab', 'aar', 'aba' suggest short alphabetic codes rather than numeric keys.

Treatment: drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["ID"].stats

statvalue
n3,573
nulls0 (0.0%)
unique3,573
len_min 2
len_max 36
len_mean 5.982
len_median 3
len_p95 17
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 3,573
readability_flesch_mean 61.58
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for ID.
Show data table
Character-length distribution for ID (mean: 5.9818080044780295).
charscount
2 – 311
3 – 42651
4 – 50
5 – 50
5 – 60
6 – 70
7 – 80
8 – 92
9 – 1017
10 – 1053
10 – 1189
11 – 12137
12 – 13122
13 – 140
14 – 15110
15 – 1690
16 – 1667
16 – 1752
17 – 1848
18 – 190
19 – 2021
20 – 2122
21 – 2220
22 – 2217
22 – 2310
23 – 2411
24 – 250
25 – 267
26 – 273
27 – 280
28 – 282
28 – 293
29 – 302
30 – 310
31 – 321
32 – 332
33 – 332
33 – 340
34 – 350
35 – 361

Name text label

This column holds short proper-noun labels — almost certainly language or ethnonym names, given top values like 'Basque', 'Ainu', 'Beothuk' and frequent words 'sign', 'language', 'arabic', 'german'. Entries are terse (mean 8.7 chars, 80% one-word) but not unique: 375 duplicates (10.5%) and only 3,198 distinct names across 3,573 rows, with several names appearing exactly 3 times — suggesting the dataset repeats each language across multiple records or variants (e.g. '(northern)', '(southern)'). No nulls, no URLs, no emoji.

Treatment: Treat as a categorical key; deduplicate or join on a normalized form before aggregating.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["Name"].stats

statvalue
n3,573
nulls0 (0.0%)
unique3,198
len_min 2
len_max 46
len_mean 8.705
len_median 7
len_p95 19
word_mean 1.258
word_median 1
n_empty 0
n_duplicates 375
duplicate_rate 0.105
vocab_size 3,383
readability_flesch_mean 48.16
emoji_rate 0
url_rate 0
one_word_rate 0.8002
allcaps_rate 0
boilerplate_rate 0
alert: one_word80.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 9.
Character-length distribution for Name.
Show data table
Character-length distribution for Name (mean: 8.705009795689897).
charscount
2 – 3112
3 – 4361
4 – 5528
5 – 6581
6 – 8502
8 – 9305
9 – 10198
10 – 11152
11 – 1284
12 – 1388
13 – 14131
14 – 1589
15 – 1676
16 – 1777
17 – 1866
18 – 2046
20 – 2134
21 – 2231
22 – 2321
23 – 2419
24 – 2521
25 – 268
26 – 279
27 – 289
28 – 305
30 – 312
31 – 323
32 – 334
33 – 342
34 – 353
35 – 362
36 – 371
37 – 381
38 – 390
39 – 400
40 – 420
42 – 430
43 – 440
44 – 450
45 – 462

Macroarea categorical feature

Macroarea is a coarse geographic grouping with 6 categories spanning Eurasia, Africa, Papunesia, North America, South America, and Australia — consistent with WALS/Glottolog-style language area labels. Distribution is relatively even (entropy ratio 0.95, top value Eurasia at 24.8%), so no single region dominates. Note the 25.5% null rate, which is substantial and flagged.

Treatment: One-hot encode and add an explicit 'missing' category to preserve the 25.5% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["Macroarea"].stats

statvalue
n3,573
nulls911 (25.5%)
unique6
top_value Eurasia
top_rate 0.2476
cardinality 6
entropy 2.459
entropy_ratio 0.9511
alert: null_rate25.5% null
Fig 10.
Top values for Macroarea.
Show data table
Top values for Macroarea (6 unique shown, of 6 total).
valuecountshare
Eurasia65918.4%
Africa60617.0%
Papunesia56015.7%
North America39611.1%
South America2587.2%
Australia1835.1%

Latitude numeric feature

Geographic latitude in degrees, ranging from -55.0 to 71.25 with a median of 8.29 and IQR of 33.0, consistent with a worldwide point distribution. The 25.5% null rate is notable and flagged, while skew (0.36) and kurtosis (-0.50) indicate a fairly symmetric, slightly flat spread with only one outlier.

Treatment: Impute or filter the 25.5% missing values, and pair with longitude for any geospatial modelling.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["Latitude"].stats

statvalue
n3,573
nulls911 (25.5%)
unique887
min -55
max 71.25
mean 11.88
median 8.292
std 22.72
q1 -5
q3 28
iqr 33
skew 0.3562
kurtosis -0.5023
n_outliers 1
outlier_rate 0.0003757
zero_rate 0.002254
alert: null_rate25.5% null
Fig 11.
Distribution of Latitude. Vertical dash marks the median.
Show data table
Histogram bins for Latitude (median: 8.291666666665).
bincount
-55 – -51.842
-51.84 – -48.691
-48.69 – -45.531
-45.53 – -42.382
-42.38 – -39.223
-39.22 – -36.065
-36.06 – -32.919
-32.91 – -29.7518
-29.75 – -26.5924
-26.59 – -23.4429
-23.44 – -20.2847
-20.28 – -17.1264
-17.12 – -13.9785
-13.97 – -10.8186
-10.81 – -7.656129
-7.656 – -4.5187
-4.5 – -1.344190
-1.344 – 1.812130
1.812 – 4.969119
4.969 – 8.125196
8.125 – 11.28161
11.28 – 14.44104
14.44 – 17.59145
17.59 – 20.7582
20.75 – 23.9163
23.91 – 27.0674
27.06 – 30.2284
30.22 – 33.3866
33.38 – 36.5384
36.53 – 39.6967
39.69 – 42.8486
42.84 – 4662
46 – 49.1674
49.16 – 52.3152
52.31 – 55.4744
55.47 – 58.6218
58.62 – 61.7820
61.78 – 64.9423
64.94 – 68.0919
68.09 – 71.257

Longitude numeric feature

Geographic longitude in degrees, spanning the full globe from -178.17 to 179.17 with a near-zero skew (-0.33) and flat kurtosis (-1.05), consistent with a worldwide point distribution. The 25.5% null rate is the main concern, and despite 3573 rows only 1360 unique values appear, suggesting repeated locations or rounded coordinates. No outliers flagged, as expected for a bounded angular measure.

Treatment: Pair with Latitude for geospatial features; impute or drop the 25.5% missing before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["Longitude"].stats

statvalue
n3,573
nulls911 (25.5%)
unique1,360
min -178.2
max 179.2
mean 35.17
median 34.79
std 89.35
q1 -45.75
q3 121
iqr 166.8
skew -0.3259
kurtosis -1.047
n_outliers 0
outlier_rate 0
zero_rate 0.001503
alert: null_rate25.5% null
Fig 12.
Distribution of Longitude. Vertical dash marks the median.
Show data table
Histogram bins for Longitude (median: 34.79166666665).
bincount
-178.2 – -169.213
-169.2 – -160.35
-160.3 – -151.47
-151.4 – -142.48
-142.4 – -133.54
-133.5 – -124.615
-124.6 – -115.696
-115.6 – -106.736
-106.7 – -97.7757
-97.77 – -88.83107
-88.83 – -79.933
-79.9 – -70.97114
-70.97 – -62.0395
-62.03 – -53.155
-53.1 – -44.1722
-44.17 – -35.237
-35.23 – -26.30
-26.3 – -17.371
-17.37 – -8.43342
-8.433 – 0.5109
0.5 – 9.433108
9.433 – 18.37170
18.37 – 27.3107
27.3 – 36.23159
36.23 – 45.1794
45.17 – 54.162
54.1 – 63.0317
63.03 – 71.9720
71.97 – 80.967
80.9 – 89.8373
89.83 – 98.77102
98.77 – 107.783
107.7 – 116.656
116.6 – 125.6129
125.6 – 134.5112
134.5 – 143.4190
143.4 – 152.4177
152.4 – 161.348
161.3 – 170.251
170.2 – 179.211

Glottocode text foreign_key

This column holds Glottocodes — fixed 8-character language identifiers from the Glottolog catalogue (e.g. 'basq1248', 'stan1295'), with every value a single token of length exactly 8. About 26% of rows are null and 2502 distinct codes cover 3573 records, with a 5.4% duplicate rate; the most repeated code 'basq1248' appears 11 times, suggesting multiple records can share a language.

Treatment: Left-join on this code to a Glottolog reference table; impute or flag the 26% nulls separately.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["Glottocode"].stats

statvalue
n3,573
nulls928 (26.0%)
unique2,502
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 143
duplicate_rate 0.05406
vocab_size 2,502
readability_flesch_mean 92.88
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: null_rate26.0% null
alert: short_text95th-percentile length under 20 chars
Fig 13.
Character-length distribution for Glottocode.
Show data table
Character-length distribution for Glottocode (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 82645
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

ISO639P3code text foreign_key

This column holds ISO 639-3 language codes — every non-null value is exactly 3 characters and a single word (len_mean 3.0, one_word_rate 1.0), with familiar codes like 'eus' (Basque), 'deu' (German), and 'gsw' (Swiss German) at the top. Coverage is incomplete: 26.84% of rows are null, and across 3573 rows there are 2442 unique codes with a 6.58% duplicate rate. Nothing in the evidence indicates which entity each code is tagging.

Treatment: Treat as a categorical join key to an ISO 639-3 reference table; impute or filter the 26.84% nulls before use.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["ISO639P3code"].stats

statvalue
n3,573
nulls959 (26.8%)
unique2,442
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 172
duplicate_rate 0.0658
vocab_size 2,442
readability_flesch_mean 119.5
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: null_rate26.8% null
alert: short_text95th-percentile length under 20 chars
Fig 14.
Character-length distribution for ISO639P3code.
Show data table
Character-length distribution for ISO639P3code (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 32614
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

Family categorical feature

Categorical label assigning each of 3,573 rows to one of 254 language families, headed by Niger-Congo and Austronesian (tied at 324 rows, 12.2% each). The long tail is heavy — entropy ratio 0.705 indicates the distribution is fairly spread across families rather than dominated by a few — and 25.5% of rows are null, which is a substantial gap for what looks like a taxonomic feature.

Treatment: Impute or add an explicit 'unknown' category for the 25.5% nulls, then group rare families before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["Family"].stats

statvalue
n3,573
nulls911 (25.5%)
unique254
top_value Niger-Congo
top_rate 0.1217
cardinality 254
entropy 5.631
entropy_ratio 0.7049
alert: null_rate25.5% null
Fig 15.
Top values for Family.
Show data table
Top values for Family (20 unique shown, of 254 total).
valuecountshare
Niger-Congo3249.1%
Austronesian3249.1%
Indo-European1764.9%
Sino-Tibetan1464.1%
Afro-Asiatic1454.1%
Pama-Nyungan1213.4%
Trans-New Guinea982.7%
other722.0%
Altaic651.8%
Oto-Manguean561.6%
Austro-Asiatic481.3%
Eastern Sudanic471.3%
Uto-Aztecan441.2%
Algic310.9%
Mayan300.8%
Arawakan290.8%
Nakh-Daghestanian280.8%
Mande280.8%
Uralic270.8%
Hokan260.7%

Subfamily categorical feature

This column records the linguistic subfamily classification of entries, drawn from a controlled vocabulary of 32 values such as Benue-Congo, Eastern Malayo-Polynesian, and Tibeto-Burman. Coverage is the main concern: 74.5% of rows are null, leaving only ~910 labelled records, with Benue-Congo accounting for 21.95% of those. Among populated rows the distribution is reasonably diverse (entropy ratio 0.77), so the signal is informative where present but sparse overall.

Treatment: Treat as a sparse categorical: impute an explicit 'unknown' level before encoding, since 74.5% are null.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["Subfamily"].stats

statvalue
n3,573
nulls2,662 (74.5%)
unique32
top_value Benue-Congo
top_rate 0.2195
cardinality 32
entropy 3.856
entropy_ratio 0.7712
alert: null_rate74.5% null
Fig 16.
Top values for Subfamily.
Show data table
Top values for Subfamily (20 unique shown, of 32 total).
valuecountshare
Benue-Congo2005.6%
Eastern Malayo-Polynesian1594.5%
Tibeto-Burman1393.9%
Chadic471.3%
Mon-Khmer381.1%
Adamawa-Ubangi300.8%
Gur270.8%
Daghestanian250.7%
Cushitic240.7%
Finno-Ugric210.6%
Kwa200.6%
North-Central Atlantic200.6%
Nilotic190.5%
Mixtecan180.5%
Omotic150.4%
Kainantu-Goroka140.4%
Madang130.4%
Awyu-Ok100.3%
Surmic100.3%
Je90.3%

Genus categorical feature

Genus is a linguistic genus label (subfamily-level grouping of languages), with values like Oceanic, Bantu, Indic, and Semitic. It is highly diverse — 625 distinct genera across 3573 rows with entropy ratio 0.86 and the top value Oceanic covering only 5.6% — and 25.5% of rows are null, which is the flagged concern.

Treatment: Treat as a high-cardinality categorical: target- or frequency-encode and explicitly model the 25.5% missing as its own category.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["Genus"].stats

statvalue
n3,573
nulls911 (25.5%)
unique625
top_value Oceanic
top_rate 0.05597
cardinality 625
entropy 7.95
entropy_ratio 0.856
alert: null_rate25.5% null
Fig 17.
Top values for Genus.
Show data table
Top values for Genus (20 unique shown, of 625 total).
valuecountshare
Oceanic1494.2%
Bantu1413.9%
Indic501.4%
Western Pama-Nyungan491.4%
Semitic431.2%
Turkic411.1%
Sign Languages401.1%
Bodic401.1%
Germanic391.1%
Northern Pama-Nyungan330.9%
Creoles and Pidgins320.9%
Mayan300.8%
Algonquian290.8%
Central Malayo-Polynesian290.8%
Iranian260.7%
Romance240.7%
Biu-Mandara240.7%
Southeastern Pama-Nyungan230.6%
Dravidian230.6%
Malayo-Sumbawan220.6%

GenusIcon categorical metadata

GenusIcon is a high-cardinality categorical with 613 unique values across only 3573 rows, and 82.51% of those rows are null. Entropy ratio of 0.9988 and a top_rate of just 0.0032 mean the non-null values are nearly uniformly distributed, with the most frequent code 'c688033' appearing only twice. The hex-like tokens (e.g. 'c807D33') suggest icon identifiers or color/asset codes rather than a meaningful category.

Treatment: Drop or retain as a sparse asset reference; not useful as a modelling feature given near-unique values and 82.51% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["GenusIcon"].stats

statvalue
n3,573
nulls2,948 (82.5%)
unique613
top_value c688033
top_rate 0.0032
cardinality 613
entropy 9.249
entropy_ratio 0.9989
alert: long_tail601 singleton categories
alert: null_rate82.5% null
Fig 18.
Top values for GenusIcon.
Show data table
Top values for GenusIcon (20 unique shown, of 613 total).
valuecountshare
c68803320.1%
c803E3320.1%
c80473320.1%
c807D3320.1%
c80623320.1%
c80503320.1%
c7A803320.1%
c80593320.1%
c80743320.1%
c806B3320.1%
c71803320.1%
c80353320.1%
cCC8C5110.0%
cCC685110.0%
cCC7E5110.0%
c8FCC5110.0%
cCC805110.0%
c52803310.0%
cCC9F5110.0%
cCCB55110.0%

ISO_codes text feature

Almost certainly ISO 639-3 language codes: 99% are single tokens, length is tightly clustered at 3 characters (min 3, max 7, p95 3), and top values like 'eus', 'deu', 'gsw', 'bod', 'roh' are recognisable three-letter language identifiers. Cardinality is high (2468 unique out of 3573) with a 26.1% null rate and 172 duplicates, so coverage is partial and no single code dominates (top value 'eus' appears just 12 times). The handful of length-7 entries is anomalous for a strict ISO 639-3 field and worth inspecting.

Treatment: Treat as a categorical code; validate against the ISO 639-3 list and investigate entries longer than 3 characters.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["ISO_codes"].stats

statvalue
n3,573
nulls933 (26.1%)
unique2,468
len_min 3
len_max 7
len_mean 3.039
len_median 3
len_p95 3
word_mean 1.01
word_median 1
n_empty 0
n_duplicates 172
duplicate_rate 0.06515
vocab_size 2,486
readability_flesch_mean 117.4
emoji_rate 0
url_rate 0
one_word_rate 0.9902
allcaps_rate 0
boilerplate_rate 0
alert: one_word99.0% rows are a single word
alert: null_rate26.1% null
alert: short_text95th-percentile length under 20 chars
Fig 19.
Character-length distribution for ISO_codes.
Show data table
Character-length distribution for ISO_codes (mean: 3.0393939393939395).
charscount
3 – 32614
3 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 726

Samples_100 categorical feature

Boolean flag with only two values (False/True) where False dominates at 96.2% of non-null rows (2562 vs 100). The name 'Samples_100' plus the exact count of 100 True values suggests this marks a curated subset of 100 sampled records. A 25.5% null rate is notable and should be reconciled before use.

Treatment: Treat as a boolean subset indicator; impute or exclude nulls and avoid using as a model feature given severe imbalance.

anthropic:claude-opus-4-7 · confidence high
Out[49]:

saturn.columns["Samples_100"].stats

statvalue
n3,573
nulls911 (25.5%)
unique2
top_value False
top_rate 0.9624
cardinality 2
entropy 0.231
entropy_ratio 0.231
alert: null_rate25.5% null
alert: imbalancetop value is 96.2% of rows
Fig 20.
Top values for Samples_100.
Show data table
Top values for Samples_100 (2 unique shown, of 2 total).
valuecountshare
False256271.7%
True1002.8%

Samples_200 categorical metadata

Binary True/False flag, almost certainly indicating membership in a 200-row sample (the name 'Samples_200' and the exact count of 200 'True' values support this). The column is heavily imbalanced — 'False' covers 92.5% of non-null rows — and 25.5% of values are null, which is unusual for a sampling indicator and worth investigating.

Treatment: Use as a boolean filter/split flag; reconcile the 25.5% nulls (treat as False or exclude) before relying on it.

anthropic:claude-opus-4-7 · confidence high
Out[52]:

saturn.columns["Samples_200"].stats

statvalue
n3,573
nulls911 (25.5%)
unique2
top_value False
top_rate 0.9249
cardinality 2
entropy 0.3848
entropy_ratio 0.3848
alert: null_rate25.5% null
Fig 21.
Top values for Samples_200.
Show data table
Top values for Samples_200 (2 unique shown, of 2 total).
valuecountshare
False246268.9%
True2005.6%

Country_ID categorical foreign_key

Two-letter country codes (PG, AU, US, ID, IN...) identifying the country associated with each record, with 337 distinct values across 3573 rows. The cardinality is suspiciously high since ISO 3166-1 alpha-2 only defines ~250 codes, hinting at non-standard or sub-region codes mixed in. Distribution is fairly flat (entropy ratio 0.752, top value PG only 8.06%) and 25.69% of rows are null.

Treatment: Validate codes against ISO 3166, impute or flag the 25.69% nulls, then left-join on this id.

anthropic:claude-opus-4-7 · confidence high
Out[55]:

saturn.columns["Country_ID"].stats

statvalue
n3,573
nulls918 (25.7%)
unique337
top_value PG
top_rate 0.0806
cardinality 337
entropy 6.314
entropy_ratio 0.752
alert: null_rate25.7% null
Fig 22.
Top values for Country_ID.
Show data table
Top values for Country_ID (20 unique shown, of 337 total).
valuecountshare
PG2146.0%
AU1855.2%
US1775.0%
ID1775.0%
IN1203.4%
MX1203.4%
RU892.5%
NG661.8%
BR661.8%
CN541.5%
CD491.4%
CM461.3%
CA451.3%
CO391.1%
ET361.0%
PH361.0%
PE351.0%
NP320.9%
TZ280.8%
VU280.8%

Source text metadata

This column holds bibliographic citation tags (e.g., 'Huber-and-Reed-1992', 'Boelaars-1950'), almost certainly the source reference for each row in what appears to be a linguistic dataset. About 45% of values are a single token and the median length is 25 chars, consistent with compact Author-Year keys, but 30% of rows are null and 2,373 of 3,573 values are unique, with only 126 duplicates. Top citations like 'nichols-1992' and 'malherbe-and-rosenberg-1996' (113 occurrences each) dominate, suggesting a few reference works supply many entries.

Treatment: Normalize casing and keep as a categorical provenance tag; impute or flag the 30% nulls rather than modelling the text.

anthropic:claude-opus-4-7 · confidence high
Out[58]:

saturn.columns["Source"].stats

statvalue
n3,573
nulls1,074 (30.1%)
unique2,373
len_min 7
len_max 452
len_mean 42.07
len_median 25
len_p95 135
word_mean 2.854
word_median 2
n_empty 0
n_duplicates 126
duplicate_rate 0.05042
vocab_size 5,899
readability_flesch_mean 21.33
emoji_rate 0
url_rate 0
one_word_rate 0.4546
allcaps_rate 0
boilerplate_rate 0
alert: one_word45.5% rows are a single word
alert: null_rate30.1% null
Fig 23.
Character-length distribution for Source.
Show data table
Character-length distribution for Source (mean: 42.07122849139656).
charscount
7 – 18937
18 – 29518
29 – 40280
40 – 52203
52 – 63136
63 – 7490
74 – 8561
85 – 9642
96 – 10735
107 – 11836
118 – 12927
129 – 14021
140 – 15212
152 – 16313
163 – 1746
174 – 18514
185 – 1966
196 – 2075
207 – 2183
218 – 2304
230 – 2414
241 – 2525
252 – 2633
263 – 2746
274 – 2857
285 – 2963
296 – 3074
307 – 3181
318 – 3301
330 – 3413
341 – 3521
352 – 3631
363 – 3740
374 – 3852
385 – 3961
396 – 4081
408 – 4194
419 – 4301
430 – 4410
441 – 4522

Parent_ID categorical foreign_key

Parent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but not dominated: the top value covers only 4.5% of rows and entropy is 8.55 (ratio 0.87), so most genera carry few members. About 7.1% of values are null, which will need a decision before any join or grouping.

Treatment: Left-join on this id to a genus lookup; impute or flag the 7.1% nulls before grouping.

anthropic:claude-opus-4-7 · confidence high
Out[61]:

saturn.columns["Parent_ID"].stats

statvalue
n3,573
nulls254 (7.1%)
unique911
top_value genus-oceanic
top_rate 0.04489
cardinality 911
entropy 8.554
entropy_ratio 0.87
alert: long_tail501 singleton categories
Fig 24.
Top values for Parent_ID.
Show data table
Top values for Parent_ID (20 unique shown, of 911 total).
valuecountshare
genus-oceanic1494.2%
genus-bantu1413.9%
genus-indic501.4%
genus-westernpamanyungan491.4%
genus-semitic431.2%
genus-turkic411.1%
genus-signlanguages401.1%
genus-bodic401.1%
genus-germanic391.1%
genus-northernpamanyungan330.9%
genus-creolesandpidgins320.9%
genus-mayan300.8%
family-austronesian300.8%
genus-algonquian290.8%
genus-centralmalayopolynesian290.8%
genus-iranian260.7%
family-transnewguinea250.7%
genus-romance240.7%
genus-biumandara240.7%
genus-southeasternpamanyungan230.6%

How to cite

click to copy

BibTeX
@misc{saturn-language-data-wals-languages-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: language data wals languages},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/language-data-wals_languages}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: language data wals languages. Source: /home/coolhand/datasets/language-data/wals_languages.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/language-data-wals_languages