saturn·

language data wals languages

source /home/coolhand/datasets/language-data/wals_languages.csv 3,573 rows 17 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 3,573 world languages (WALS) across 17 columns combining identifiers (ISO codes, Glottocode), classifications (Family, Genus, Subfamily), geography (Latitude, Longitude, Macroarea, Country_ID), and sampling flags. The Family and Macroarea distributions are the most informative starting point: Niger-Congo and Austronesian dominate at 324 languages each, and Eurasia (659) and Africa (606) lead the macroareas out of just six categories. Note that roughly a quarter of rows (null_rate ~0.255) are missing geographic and family fields in lockstep, suggesting a shared set of unclassified entries worth investigating. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 'True' values respectively), reflecting curated WALS sub-samples. Subfamily is sparsely populated (74.5% null) so treat it as supplementary rather than primary.

citing: Family.top_values · Macroarea.top_values · Genus.top_values · Country_ID.top_values · Samples_100.stats · Samples_200.stats · Latitude.stats · Longitude.stats · Subfamily.null_rate

Schema

17 columns
Per-column summary. Click column name to jump to its detail.
Alerts
ID text 0.0% 3,573
near_unique one_word short_text
Name text 0.0% 3,198
one_word short_text
Macroarea categorical 25.5% 6
null_rate
Latitude numeric 25.5% 887
null_rate
Longitude numeric 25.5% 1,360
null_rate
Glottocode text 26.0% 2,502
one_word null_rate short_text
ISO639P3code text 26.8% 2,442
one_word null_rate short_text
Family categorical 25.5% 254
null_rate
Subfamily categorical 74.5% 32
null_rate
Genus categorical 25.5% 625
null_rate
GenusIcon categorical 82.5% 613
long_tail null_rate
ISO_codes text 26.1% 2,468
one_word null_rate short_text
Samples_100 categorical 25.5% 2
null_rate imbalance
Samples_200 categorical 25.5% 2
null_rate
Country_ID categorical 25.7% 337
null_rate
Source text 30.1% 2,373
one_word null_rate
Parent_ID categorical 7.1% 911
long_tail

ID

text identifier near_unique one_word short_text
This is an identifier column: every one of the 3573 rows holds a unique single-token string with no nulls or duplicates. Values are short (median length 3, max 36) and the vocabulary equals the row count (3573), confirming one-to-one uniqueness. Top tokens like 'aab', 'aar', 'aba' suggest short alphabetic codes rather than numeric keys. Treatment: drop from modelling; retain only as a join key. high · anthropic:claude-opus-4-7
n
3,573
nulls
0 (0.0%)
unique
3,573
len_min
2
len_max
36
len_mean
5.982
len_median
3
len_p95
17
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
3,573
readability_flesch_mean
61.58
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Name

text label one_word short_text
This column holds short proper-noun labels — almost certainly language or ethnonym names, given top values like 'Basque', 'Ainu', 'Beothuk' and frequent words 'sign', 'language', 'arabic', 'german'. Entries are terse (mean 8.7 chars, 80% one-word) but not unique: 375 duplicates (10.5%) and only 3,198 distinct names across 3,573 rows, with several names appearing exactly 3 times — suggesting the dataset repeats each language across multiple records or variants (e.g. '(northern)', '(southern)'). No nulls, no URLs, no emoji. Treatment: Treat as a categorical key; deduplicate or join on a normalized form before aggregating. high · anthropic:claude-opus-4-7
n
3,573
nulls
0 (0.0%)
unique
3,198
len_min
2
len_max
46
len_mean
8.705
len_median
7
len_p95
19
word_mean
1.258
word_median
1
n_empty
0
n_duplicates
375
duplicate_rate
0.105
vocab_size
3,383
readability_flesch_mean
48.16
emoji_rate
0
url_rate
0
one_word_rate
0.8002
allcaps_rate
0
boilerplate_rate
0

Macroarea

categorical feature null_rate
Macroarea is a coarse geographic grouping with 6 categories spanning Eurasia, Africa, Papunesia, North America, South America, and Australia — consistent with WALS/Glottolog-style language area labels. Distribution is relatively even (entropy ratio 0.95, top value Eurasia at 24.8%), so no single region dominates. Note the 25.5% null rate, which is substantial and flagged. Treatment: One-hot encode and add an explicit 'missing' category to preserve the 25.5% nulls. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
6
top_value
Eurasia
top_rate
0.2476
cardinality
6
entropy
2.459
entropy_ratio
0.9511

Latitude

numeric feature null_rate
Geographic latitude in degrees, ranging from -55.0 to 71.25 with a median of 8.29 and IQR of 33.0, consistent with a worldwide point distribution. The 25.5% null rate is notable and flagged, while skew (0.36) and kurtosis (-0.50) indicate a fairly symmetric, slightly flat spread with only one outlier. Treatment: Impute or filter the 25.5% missing values, and pair with longitude for any geospatial modelling. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
887
min
-55
max
71.25
mean
11.88
median
8.292
std
22.72
q1
-5
q3
28
iqr
33
skew
0.3562
kurtosis
-0.5023
n_outliers
1
outlier_rate
0.0003757
zero_rate
0.002254

Longitude

numeric feature null_rate
Geographic longitude in degrees, spanning the full globe from -178.17 to 179.17 with a near-zero skew (-0.33) and flat kurtosis (-1.05), consistent with a worldwide point distribution. The 25.5% null rate is the main concern, and despite 3573 rows only 1360 unique values appear, suggesting repeated locations or rounded coordinates. No outliers flagged, as expected for a bounded angular measure. Treatment: Pair with Latitude for geospatial features; impute or drop the 25.5% missing before modelling. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
1,360
min
-178.2
max
179.2
mean
35.17
median
34.79
std
89.35
q1
-45.75
q3
121
iqr
166.8
skew
-0.3259
kurtosis
-1.047
n_outliers
0
outlier_rate
0
zero_rate
0.001503

Glottocode

text foreign_key one_word null_rate short_text
This column holds Glottocodes — fixed 8-character language identifiers from the Glottolog catalogue (e.g. 'basq1248', 'stan1295'), with every value a single token of length exactly 8. About 26% of rows are null and 2502 distinct codes cover 3573 records, with a 5.4% duplicate rate; the most repeated code 'basq1248' appears 11 times, suggesting multiple records can share a language. Treatment: Left-join on this code to a Glottolog reference table; impute or flag the 26% nulls separately. high · anthropic:claude-opus-4-7
n
3,573
nulls
928 (26.0%)
unique
2,502
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
143
duplicate_rate
0.05406
vocab_size
2,502
readability_flesch_mean
92.88
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

ISO639P3code

text foreign_key one_word null_rate short_text
This column holds ISO 639-3 language codes — every non-null value is exactly 3 characters and a single word (len_mean 3.0, one_word_rate 1.0), with familiar codes like 'eus' (Basque), 'deu' (German), and 'gsw' (Swiss German) at the top. Coverage is incomplete: 26.84% of rows are null, and across 3573 rows there are 2442 unique codes with a 6.58% duplicate rate. Nothing in the evidence indicates which entity each code is tagging. Treatment: Treat as a categorical join key to an ISO 639-3 reference table; impute or filter the 26.84% nulls before use. high · anthropic:claude-opus-4-7
n
3,573
nulls
959 (26.8%)
unique
2,442
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
172
duplicate_rate
0.0658
vocab_size
2,442
readability_flesch_mean
119.5
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Family

categorical feature null_rate
Categorical label assigning each of 3,573 rows to one of 254 language families, headed by Niger-Congo and Austronesian (tied at 324 rows, 12.2% each). The long tail is heavy — entropy ratio 0.705 indicates the distribution is fairly spread across families rather than dominated by a few — and 25.5% of rows are null, which is a substantial gap for what looks like a taxonomic feature. Treatment: Impute or add an explicit 'unknown' category for the 25.5% nulls, then group rare families before encoding. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
254
top_value
Niger-Congo
top_rate
0.1217
cardinality
254
entropy
5.631
entropy_ratio
0.7049

Subfamily

categorical feature null_rate
This column records the linguistic subfamily classification of entries, drawn from a controlled vocabulary of 32 values such as Benue-Congo, Eastern Malayo-Polynesian, and Tibeto-Burman. Coverage is the main concern: 74.5% of rows are null, leaving only ~910 labelled records, with Benue-Congo accounting for 21.95% of those. Among populated rows the distribution is reasonably diverse (entropy ratio 0.77), so the signal is informative where present but sparse overall. Treatment: Treat as a sparse categorical: impute an explicit 'unknown' level before encoding, since 74.5% are null. high · anthropic:claude-opus-4-7
n
3,573
nulls
2,662 (74.5%)
unique
32
top_value
Benue-Congo
top_rate
0.2195
cardinality
32
entropy
3.856
entropy_ratio
0.7712

Genus

categorical feature null_rate
Genus is a linguistic genus label (subfamily-level grouping of languages), with values like Oceanic, Bantu, Indic, and Semitic. It is highly diverse — 625 distinct genera across 3573 rows with entropy ratio 0.86 and the top value Oceanic covering only 5.6% — and 25.5% of rows are null, which is the flagged concern. Treatment: Treat as a high-cardinality categorical: target- or frequency-encode and explicitly model the 25.5% missing as its own category. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
625
top_value
Oceanic
top_rate
0.05597
cardinality
625
entropy
7.95
entropy_ratio
0.856

GenusIcon

categorical metadata long_tail null_rate
GenusIcon is a high-cardinality categorical with 613 unique values across only 3573 rows, and 82.51% of those rows are null. Entropy ratio of 0.9988 and a top_rate of just 0.0032 mean the non-null values are nearly uniformly distributed, with the most frequent code 'c688033' appearing only twice. The hex-like tokens (e.g. 'c807D33') suggest icon identifiers or color/asset codes rather than a meaningful category. Treatment: Drop or retain as a sparse asset reference; not useful as a modelling feature given near-unique values and 82.51% nulls. high · anthropic:claude-opus-4-7
n
3,573
nulls
2,948 (82.5%)
unique
613
top_value
c688033
top_rate
0.0032
cardinality
613
entropy
9.249
entropy_ratio
0.9989

ISO_codes

text feature one_word null_rate short_text
Almost certainly ISO 639-3 language codes: 99% are single tokens, length is tightly clustered at 3 characters (min 3, max 7, p95 3), and top values like 'eus', 'deu', 'gsw', 'bod', 'roh' are recognisable three-letter language identifiers. Cardinality is high (2468 unique out of 3573) with a 26.1% null rate and 172 duplicates, so coverage is partial and no single code dominates (top value 'eus' appears just 12 times). The handful of length-7 entries is anomalous for a strict ISO 639-3 field and worth inspecting. Treatment: Treat as a categorical code; validate against the ISO 639-3 list and investigate entries longer than 3 characters. high · anthropic:claude-opus-4-7
n
3,573
nulls
933 (26.1%)
unique
2,468
len_min
3
len_max
7
len_mean
3.039
len_median
3
len_p95
3
word_mean
1.01
word_median
1
n_empty
0
n_duplicates
172
duplicate_rate
0.06515
vocab_size
2,486
readability_flesch_mean
117.4
emoji_rate
0
url_rate
0
one_word_rate
0.9902
allcaps_rate
0
boilerplate_rate
0

Samples_100

categorical feature null_rate imbalance
Boolean flag with only two values (False/True) where False dominates at 96.2% of non-null rows (2562 vs 100). The name 'Samples_100' plus the exact count of 100 True values suggests this marks a curated subset of 100 sampled records. A 25.5% null rate is notable and should be reconciled before use. Treatment: Treat as a boolean subset indicator; impute or exclude nulls and avoid using as a model feature given severe imbalance. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
2
top_value
False
top_rate
0.9624
cardinality
2
entropy
0.231
entropy_ratio
0.231

Samples_200

categorical metadata null_rate
Binary True/False flag, almost certainly indicating membership in a 200-row sample (the name 'Samples_200' and the exact count of 200 'True' values support this). The column is heavily imbalanced — 'False' covers 92.5% of non-null rows — and 25.5% of values are null, which is unusual for a sampling indicator and worth investigating. Treatment: Use as a boolean filter/split flag; reconcile the 25.5% nulls (treat as False or exclude) before relying on it. high · anthropic:claude-opus-4-7
n
3,573
nulls
911 (25.5%)
unique
2
top_value
False
top_rate
0.9249
cardinality
2
entropy
0.3848
entropy_ratio
0.3848

Country_ID

categorical foreign_key null_rate
Two-letter country codes (PG, AU, US, ID, IN...) identifying the country associated with each record, with 337 distinct values across 3573 rows. The cardinality is suspiciously high since ISO 3166-1 alpha-2 only defines ~250 codes, hinting at non-standard or sub-region codes mixed in. Distribution is fairly flat (entropy ratio 0.752, top value PG only 8.06%) and 25.69% of rows are null. Treatment: Validate codes against ISO 3166, impute or flag the 25.69% nulls, then left-join on this id. high · anthropic:claude-opus-4-7
n
3,573
nulls
918 (25.7%)
unique
337
top_value
PG
top_rate
0.0806
cardinality
337
entropy
6.314
entropy_ratio
0.752

Source

text metadata one_word null_rate
This column holds bibliographic citation tags (e.g., 'Huber-and-Reed-1992', 'Boelaars-1950'), almost certainly the source reference for each row in what appears to be a linguistic dataset. About 45% of values are a single token and the median length is 25 chars, consistent with compact Author-Year keys, but 30% of rows are null and 2,373 of 3,573 values are unique, with only 126 duplicates. Top citations like 'nichols-1992' and 'malherbe-and-rosenberg-1996' (113 occurrences each) dominate, suggesting a few reference works supply many entries. Treatment: Normalize casing and keep as a categorical provenance tag; impute or flag the 30% nulls rather than modelling the text. high · anthropic:claude-opus-4-7
n
3,573
nulls
1,074 (30.1%)
unique
2,373
len_min
7
len_max
452
len_mean
42.07
len_median
25
len_p95
135
word_mean
2.854
word_median
2
n_empty
0
n_duplicates
126
duplicate_rate
0.05042
vocab_size
5,899
readability_flesch_mean
21.33
emoji_rate
0
url_rate
0
one_word_rate
0.4546
allcaps_rate
0
boilerplate_rate
0

Parent_ID

categorical foreign_key long_tail
Parent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but not dominated: the top value covers only 4.5% of rows and entropy is 8.55 (ratio 0.87), so most genera carry few members. About 7.1% of values are null, which will need a decision before any join or grouping. Treatment: Left-join on this id to a genus lookup; impute or flag the 7.1% nulls before grouping. high · anthropic:claude-opus-4-7
n
3,573
nulls
254 (7.1%)
unique
911
top_value
genus-oceanic
top_rate
0.04489
cardinality
911
entropy
8.554
entropy_ratio
0.87