saturn·

data raw wals language

source /home/coolhand/servers/diachronica/data_raw/wals_language.csv 3,573 rows 17 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a catalogue of 3,573 world languages from WALS, with identifiers (ISO codes, Glottocode), names, geographic coordinates, and classification fields (Family, Genus, Subfamily, Macroarea) plus reference sources and sampling flags. The geographic and genealogical breakdowns are the most informative starting point: Macroarea splits cleanly across six regions led by Eurasia (659) and Africa (606), while Family is dominated by Niger-Congo and Austronesian (324 each). Worth a closer look: roughly a quarter of rows are missing core fields like Family, Genus, Macroarea, and coordinates (null rate ~0.255), and Subfamily is 74.5% null, which will limit any subfamily-level analysis. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 True values respectively), reflecting their role as curated sub-samples rather than balanced categories.

citing: columns · row_count · kinds

Schema

17 columns
Per-column summary. Click column name to jump to its detail.
Alerts
ID text 0.0% 3,573
near_unique one_word short_text
Name text 0.0% 3,198
one_word short_text
Macroarea categorical 25.5% 6
null_rate
Latitude numeric 25.5% 887
null_rate
Longitude numeric 25.5% 1,360
null_rate
Glottocode text 26.0% 2,502
one_word null_rate short_text
ISO639P3code text 26.8% 2,442
one_word null_rate short_text
Family categorical 25.5% 254
null_rate
Subfamily categorical 74.5% 32
null_rate
Genus categorical 25.5% 625
null_rate
GenusIcon categorical 82.5% 613
long_tail null_rate
ISO_codes text 26.1% 2,468
one_word null_rate short_text
Samples_100 categorical 25.5% 2
null_rate imbalance
Samples_200 categorical 25.5% 2
null_rate
Country_ID categorical 25.7% 337
null_rate
Source text 30.1% 2,373
one_word null_rate
Parent_ID categorical 7.1% 911
long_tail

ID

text identifier near_unique one_word short_text
Column 'ID' is a unique row identifier: all 3573 values are distinct (n_unique equals n), every value is a single token (one_word_rate 1.0), and there are no nulls or duplicates. Lengths range from 2 to 36 characters with a median of 3, and the top tokens (aab, aar, aba, abb…) suggest short alphabetic codes rather than numeric keys. Treatment: Use as a join key; drop from modelling features. high · anthropic:claude-opus-4-7
n
3,573
nulls
0 (0.0%)
unique
3,573
len_min
2
len_max
36
len_mean
5.982
len_median
3
len_p95
17
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
3,573
readability_flesch_mean
61.58
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Name

text label one_word short_text
This column holds short proper-noun labels, almost certainly language names (top values like Basque, Ainu, Beothuk, Atakapa, and frequent words 'sign', 'language', 'arabic', 'mixtec' all point to a linguistic registry). Entries are overwhelmingly single tokens (one_word_rate 0.80, word_mean 1.26, len_mean 8.7) with a 46-character max for the longer parenthesised variants like '(northern)'/'(southern)'. Notably, 375 duplicates (10.5%) exist across 3,573 rows with 3,198 uniques — names like 'Abun', 'Andoke', 'Basque' each appear 3 times, suggesting the dataset repeats languages across some other dimension rather than being a clean key. Treatment: Treat as a categorical language label; deduplicate or join on it rather than using as a primary key. high · anthropic:claude-opus-4-7
n
3,573
nulls
0 (0.0%)
unique
3,198
len_min
2
len_max
46
len_mean
8.705
len_median
7
len_p95
19
word_mean
1.258
word_median
1
n_empty
0
n_duplicates
375
duplicate_rate
0.105
vocab_size
3,383
readability_flesch_mean
48.16
emoji_rate
0
url_rate
0
one_word_rate
0.8002
allcaps_rate
0
boilerplate_rate
0

Macroarea

categorical feature null_rate
Macroarea is a categorical geographic grouping with 6 values covering the standard continental/linguistic macroareas (Eurasia, Africa, Papunesia, North America, South America, Australia). The distribution is fairly balanced — entropy ratio is 0.95 and the top value Eurasia accounts for only 24.8% of rows. The main concern is a 25.5% null rate, meaning a quarter of the 3573 rows lack any macroarea assignment. Treatment: One-hot encode and decide whether to impute or add an explicit 'unknown' bucket for the 25.5% nulls. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
6
top_value
Eurasia
top_rate
0.2476
cardinality
6
entropy
2.459
entropy_ratio
0.9511

Latitude

numeric feature null_rate
Geographic latitude in decimal degrees, ranging from -55.0 to 71.25 with a median of 8.29 — consistent with global coverage skewed slightly toward the northern hemisphere. About 25.5% of rows are null, a notable gap for a positional field, and only 887 unique values across 3573 rows suggest coordinates are rounded or tied to a limited set of locations. Distribution is near-symmetric (skew 0.36, kurtosis -0.50) with just one outlier flagged. Treatment: Pair with Longitude for geospatial features; impute or filter the 25.5% nulls before modelling. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
887
min
-55
max
71.25
mean
11.88
median
8.292
std
22.72
q1
-5
q3
28
iqr
33
skew
0.3562
kurtosis
-0.5023
n_outliers
1
outlier_rate
0.0003757
zero_rate
0.002254

Longitude

numeric feature null_rate
Geographic longitude in decimal degrees, spanning -178.17 to 179.17 with 1360 distinct values across 3573 rows. The distribution is roughly symmetric (skew -0.33) but flat (kurtosis -1.05) with an IQR of 166.75, consistent with truly global coverage rather than a regional sample. Notable concern: 25.5% of rows are null, which will silently drop a quarter of any geospatial join. Treatment: Pair with Latitude and impute or filter the 25.5% nulls before any geospatial modelling. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
1,360
min
-178.2
max
179.2
mean
35.17
median
34.79
std
89.35
q1
-45.75
q3
121
iqr
166.8
skew
-0.3259
kurtosis
-1.047
n_outliers
0
outlier_rate
0
zero_rate
0.001503

Glottocode

text foreign_key one_word null_rate short_text
This is a Glottocode field — fixed 8-character language identifiers from the Glottolog catalogue (e.g. basq1248, swis1247), with every value being a single token. About 26% of rows are null and 2502 distinct codes appear across 3573 rows, with 143 duplicates (5.4%) where the same language repeats — basq1248 leads with 11 occurrences. Length is rigidly 8 for min, median, and max, consistent with a controlled vocabulary identifier rather than free text. Treatment: Treat as a categorical key; left-join to Glottolog metadata and handle the 26% nulls explicitly. high · anthropic:claude-opus-4-7
n
3,573
nulls
928 (26.0%)
unique
2,502
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
143
duplicate_rate
0.05406
vocab_size
2,502
readability_flesch_mean
92.88
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

ISO639P3code

text foreign_key one_word null_rate short_text
This column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with examples like 'eus', 'gsw', 'deu' matching the standard. Coverage is incomplete — 26.84% of rows are null — and 2442 unique codes appear across 3573 rows with a 6.58% duplicate rate, so most languages occur only once or twice (top value 'eus' at 12). Treatment: Treat as a categorical key; left-join to an ISO 639-3 reference table and decide on an explicit bucket for the 26.84% nulls. high · anthropic:claude-opus-4-7
n
3,573
nulls
959 (26.8%)
unique
2,442
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
172
duplicate_rate
0.0658
vocab_size
2,442
readability_flesch_mean
119.5
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Family

categorical feature null_rate
Categorical label for the language family of each row, with 254 distinct families across 3573 records. Distribution is long-tailed but not dominated: top value 'Niger-Congo' covers only 12.2% and ties exactly with 'Austronesian' at 324 each, with entropy ratio 0.70 indicating spread across many small families. Notable concern: 25.5% of rows are null, and a literal 'other' bucket already accounts for 72 rows. Treatment: Impute or flag the 25.5% missing, then group rare families before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
254
top_value
Niger-Congo
top_rate
0.1217
cardinality
254
entropy
5.631
entropy_ratio
0.7049

Subfamily

categorical feature null_rate
This column records the linguistic subfamily classification of each row, with 32 distinct values dominated by Benue-Congo (200 occurrences, 21.95% of non-nulls), Eastern Malayo-Polynesian (159), and Tibeto-Burman (139). The striking issue is the 74.5% null rate — only about a quarter of the 3573 rows carry a subfamily label — yet entropy ratio of 0.77 indicates the populated values are reasonably spread across the 32 categories rather than collapsing onto one. Treatment: Treat missingness as its own category and group rare subfamilies before one-hot encoding. high · anthropic:claude-opus-4-7
n
3,573
nulls
2,662 (74.5%)
unique
32
top_value
Benue-Congo
top_rate
0.2195
cardinality
32
entropy
3.856
entropy_ratio
0.7712

Genus

categorical feature null_rate
This column holds linguistic genus labels (e.g., Oceanic, Bantu, Indic, Semitic, Germanic), a mid-level grouping in language classification. Cardinality is high at 625 distinct values across 3573 rows with entropy ratio 0.856, so the distribution is broad and flat — the top value 'Oceanic' covers only 5.6%. Note the 25.5% null rate, which is flagged and would meaningfully shrink any analysis that conditions on genus. Treatment: Treat as a high-cardinality categorical: group rare genera into 'Other' or target-encode, and add an explicit missing indicator for the 25.5% nulls. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
625
top_value
Oceanic
top_rate
0.05597
cardinality
625
entropy
7.95
entropy_ratio
0.856

GenusIcon

categorical identifier long_tail null_rate
GenusIcon holds 613 short hex-like codes (e.g. 'c688033') across 3573 rows, with 82.51% nulls and an entropy ratio of 0.9988 indicating values are nearly uniformly distributed among non-nulls. The top value appears only twice (top_rate 0.0032), so there is no dominant category — it behaves like a near-unique tag rather than a real categorical feature. Treatment: Drop for modelling; near-unique with 82% nulls. high · anthropic:claude-opus-4-7
n
3,573
nulls
2,948 (82.5%)
unique
613
top_value
c688033
top_rate
0.0032
cardinality
613
entropy
9.249
entropy_ratio
0.9989

ISO_codes

text foreign_key one_word null_rate short_text
This column holds ISO language codes — almost all values are single tokens of length 3 (len_mean 3.04, one_word_rate 0.99), matching ISO 639-3 conventions (e.g. 'eus', 'gsw', 'deu'). 26.1% of rows are null and 172 duplicates exist, but with 2,468 unique codes across 3,573 rows the vocabulary is unusually wide, suggesting broad multilingual coverage rather than a few dominant languages. No top value exceeds 12 occurrences, so the distribution has an extremely long tail. Treatment: Treat as a categorical code key; impute or filter the 26% nulls before joining to a language reference table. high · anthropic:claude-opus-4-7
n
3,573
nulls
933 (26.1%)
unique
2,468
len_min
3
len_max
7
len_mean
3.039
len_median
3
len_p95
3
word_mean
1.01
word_median
1
n_empty
0
n_duplicates
172
duplicate_rate
0.06515
vocab_size
2,486
readability_flesch_mean
117.4
emoji_rate
0
url_rate
0
one_word_rate
0.9902
allcaps_rate
0
boilerplate_rate
0

Samples_100

categorical feature null_rate imbalance
Boolean flag with only two values where 'False' dominates at 96.2% (2562 of 2662 non-null rows) and 'True' appears exactly 100 times. The null rate of 25.5% is high, suggesting the flag is only populated for a subset of records. Entropy ratio of 0.23 confirms severe imbalance. Treatment: Treat as a rare-event boolean indicator; impute or encode nulls explicitly and avoid as a stratification key given the imbalance. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
2
top_value
False
top_rate
0.9624
cardinality
2
entropy
0.231
entropy_ratio
0.231

Samples_200

categorical label null_rate
Binary flag column with only two values (False/True) and heavy class imbalance — False accounts for 92.5% of non-null rows versus 200 True observations, hinting the name 'Samples_200' refers to a tagged 200-row subset. Roughly a quarter of rows (25.5%) are null, which is the main surprise and the reason for the null_rate alert. Entropy ratio of 0.385 confirms the distribution is far from balanced. Treatment: Impute or explicitly encode nulls as a third category before using as a binary indicator. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
2
top_value
False
top_rate
0.9249
cardinality
2
entropy
0.3848
entropy_ratio
0.3848

Country_ID

categorical foreign_key null_rate
Country_ID looks like an ISO-style two-letter country code, with 337 distinct values across 3573 rows and a fairly even spread (entropy ratio 0.752). The top country PG accounts for only 8.06% of rows, followed by AU, US, and ID. Notably, 25.69% of values are null, and the cardinality of 337 exceeds the ~250 ISO 3166-1 codes, suggesting non-standard or sub-region codes are mixed in. Treatment: Impute or flag the 25.69% nulls and reconcile non-standard codes against an ISO 3166-1 reference before joining. high · anthropic:claude-opus-4-7
n
3,573
nulls
918 (25.7%)
unique
337
top_value
PG
top_rate
0.0806
cardinality
337
entropy
6.314
entropy_ratio
0.752

Source

text metadata one_word null_rate
This column holds bibliographic citation tags (e.g. 'Huber-and-Reed-1992', 'Boelaars-1950'), evidently the source reference for each row in what looks like a linguistic typology dataset. Values are short (median 25 chars, 2 words) and 45.5% are single tokens, consistent with author-year keys rather than prose. Cardinality is high (2373 unique of 3573) with 5% duplicates and a 30% null rate, so coverage is uneven and no single source dominates (top value appears only 14 times). Treatment: Treat as a citation key: keep as categorical provenance metadata, optionally normalize casing and join to a bibliography table; do not use as a model feature. high · anthropic:claude-opus-4-7
n
3,573
nulls
1,074 (30.1%)
unique
2,373
len_min
7
len_max
452
len_mean
42.07
len_median
25
len_p95
135
word_mean
2.854
word_median
2
n_empty
0
n_duplicates
126
duplicate_rate
0.05042
vocab_size
5,899
readability_flesch_mean
21.33
emoji_rate
0
url_rate
0
one_word_rate
0.4546
allcaps_rate
0
boilerplate_rate
0

Parent_ID

categorical foreign_key long_tail
Parent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but flat — the top value covers only 4.5% of rows and entropy ratio is 0.87 — and 7.1% of values are null. Oceanic and Bantu dominate the head, with Indic, Western Pama-Nyungan and Semitic trailing far behind. Treatment: Left-join on this id to a genus lookup; treat the 7.1% nulls explicitly rather than one-hot encoding the 911 levels. high · anthropic:claude-opus-4-7
n
3,573
nulls
254 (7.1%)
unique
911
top_value
genus-oceanic
top_rate
0.04489
cardinality
911
entropy
8.554
entropy_ratio
0.87