data_raw-wals_language · saturn notebook

Overview

Source: /home/coolhand/servers/diachronica/data_raw/wals_language.csv

Saturn profiled 3,573 rows across 17 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/data_raw/wals_language.csv",
    "--findings", "data_raw-wals_language.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a catalogue of 3,573 world languages from WALS, with identifiers (ISO codes, Glottocode), names, geographic coordinates, and classification fields (Family, Genus, Subfamily, Macroarea) plus reference sources and sampling flags. The geographic and genealogical breakdowns are the most informative starting point: Macroarea splits cleanly across six regions led by Eurasia (659) and Africa (606), while Family is dominated by Niger-Congo and Austronesian (324 each). Worth a closer look: roughly a quarter of rows are missing core fields like Family, Genus, Macroarea, and coordinates (null rate ~0.255), and Subfamily is 74.5% null, which will limit any subfamily-level analysis. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 True values respectively), reflecting their role as curated sub-samples rather than balanced categories.

citing: columns · row_count · kinds

Out[4]:

saturn.schema() · 17 columns

column	kind	n	null%	unique	alerts
ID	text	3,573	0.0%	3,573	near_unique one_word short_text
Name	text	3,573	0.0%	3,198	one_word short_text
Macroarea	categorical	3,573	25.5%	6	null_rate
Latitude	numeric	3,573	25.5%	887	null_rate
Longitude	numeric	3,573	25.5%	1,360	null_rate
Glottocode	text	3,573	26.0%	2,502	one_word null_rate short_text
ISO639P3code	text	3,573	26.8%	2,442	one_word null_rate short_text
Family	categorical	3,573	25.5%	254	null_rate
Subfamily	categorical	3,573	74.5%	32	null_rate
Genus	categorical	3,573	25.5%	625	null_rate
GenusIcon	categorical	3,573	82.5%	613	long_tail null_rate
ISO_codes	text	3,573	26.1%	2,468	one_word null_rate short_text
Samples_100	categorical	3,573	25.5%	2	null_rate imbalance
Samples_200	categorical	3,573	25.5%	2	null_rate
Country_ID	categorical	3,573	25.7%	337	null_rate
Source	text	3,573	30.1%	2,373	one_word null_rate
Parent_ID	categorical	3,573	7.1%	911	long_tail

Fig 1.

Macroarea · Shows the six-region geographic split, led by Eurasia and Africa.

Show data table

Top values for Macroarea (6 unique shown, of 6 total).
value	count	share
Eurasia	659	18.4%
Africa	606	17.0%
Papunesia	560	15.7%
North America	396	11.1%
South America	258	7.2%
Australia	183	5.1%

Fig 2.

Family · Top language families — Niger-Congo and Austronesian tie at the top with 324 languages each.

Show data table

Top values for Family (20 unique shown, of 254 total).
value	count	share
Niger-Congo	324	9.1%
Austronesian	324	9.1%
Indo-European	176	4.9%
Sino-Tibetan	146	4.1%
Afro-Asiatic	145	4.1%
Pama-Nyungan	121	3.4%
Trans-New Guinea	98	2.7%
other	72	2.0%
Altaic	65	1.8%
Oto-Manguean	56	1.6%
Austro-Asiatic	48	1.3%
Eastern Sudanic	47	1.3%
Uto-Aztecan	44	1.2%
Algic	31	0.9%
Mayan	30	0.8%
Arawakan	29	0.8%
Nakh-Daghestanian	28	0.8%
Mande	28	0.8%
Uralic	27	0.8%
Hokan	26	0.7%

Fig 3.

Latitude · Latitude distribution skews toward the tropics and northern hemisphere (median ~8°).

Show data table

Histogram bins for Latitude (median: 8.291666666665).
bin	count
-55 – -51.84	2
-51.84 – -48.69	1
-48.69 – -45.53	1
-45.53 – -42.38	2
-42.38 – -39.22	3
-39.22 – -36.06	5
-36.06 – -32.91	9
-32.91 – -29.75	18
-29.75 – -26.59	24
-26.59 – -23.44	29
-23.44 – -20.28	47
-20.28 – -17.12	64
-17.12 – -13.97	85
-13.97 – -10.81	86
-10.81 – -7.656	129
-7.656 – -4.5	187
-4.5 – -1.344	190
-1.344 – 1.812	130
1.812 – 4.969	119
4.969 – 8.125	196
8.125 – 11.28	161
11.28 – 14.44	104
14.44 – 17.59	145
17.59 – 20.75	82
20.75 – 23.91	63
23.91 – 27.06	74
27.06 – 30.22	84
30.22 – 33.38	66
33.38 – 36.53	84
36.53 – 39.69	67
39.69 – 42.84	86
42.84 – 46	62
46 – 49.16	74
49.16 – 52.31	52
52.31 – 55.47	44
55.47 – 58.62	18
58.62 – 61.78	20
61.78 – 64.94	23
64.94 – 68.09	19
68.09 – 71.25	7

Fig 4.

Country_ID · Country concentration — Papua New Guinea, Australia, the US, and Indonesia host the most languages.

Show data table

Top values for Country_ID (20 unique shown, of 337 total).
value	count	share
PG	214	6.0%
AU	185	5.2%
US	177	5.0%
ID	177	5.0%
IN	120	3.4%
MX	120	3.4%
RU	89	2.5%
NG	66	1.8%
BR	66	1.8%
CN	54	1.5%
CD	49	1.4%
CM	46	1.3%
CA	45	1.3%
CO	39	1.1%
ET	36	1.0%
PH	36	1.0%
PE	35	1.0%
NP	32	0.9%
TZ	28	0.8%
VU	28	0.8%

Fig 5.

Samples_100 · Highlights the strong imbalance: only 100 of 2,662 non-null rows are flagged True.

Show data table

Top values for Samples_100 (2 unique shown, of 2 total).
value	count	share
False	2562	71.7%
True	100	2.8%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
ID	text	0.0%
Name	text	0.0%
Macroarea	categorical	25.5%
Latitude	numeric	25.5%
Longitude	numeric	25.5%
Glottocode	text	26.0%
ISO639P3code	text	26.8%
Family	categorical	25.5%
Subfamily	categorical	74.5%
Genus	categorical	25.5%
GenusIcon	categorical	82.5%
ISO_codes	text	26.1%
Samples_100	categorical	25.5%
Samples_200	categorical	25.5%
Country_ID	categorical	25.7%
Source	text	30.1%
Parent_ID	categorical	7.1%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	Latitude	Longitude
Latitude	+1.00	-0.38
Longitude	-0.38	+1.00

ID text identifier

Column 'ID' is a unique row identifier: all 3573 values are distinct (n_unique equals n), every value is a single token (one_word_rate 1.0), and there are no nulls or duplicates. Lengths range from 2 to 36 characters with a median of 3, and the top tokens (aab, aar, aba, abb…) suggest short alphabetic codes rather than numeric keys.

Treatment: Use as a join key; drop from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["ID"].stats

stat	value
n	3,573
nulls	0 (0.0%)
unique	3,573
len_min	2
len_max	36
len_mean	5.982
len_median	3
len_p95	17
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	3,573
readability_flesch_mean	61.58
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for ID.

Show data table

Character-length distribution for ID (mean: 5.9818080044780295).
chars	count
2 – 3	11
3 – 4	2651
4 – 5	0
5 – 5	0
5 – 6	0
6 – 7	0
7 – 8	0
8 – 9	2
9 – 10	17
10 – 10	53
10 – 11	89
11 – 12	137
12 – 13	122
13 – 14	0
14 – 15	110
15 – 16	90
16 – 16	67
16 – 17	52
17 – 18	48
18 – 19	0
19 – 20	21
20 – 21	22
21 – 22	20
22 – 22	17
22 – 23	10
23 – 24	11
24 – 25	0
25 – 26	7
26 – 27	3
27 – 28	0
28 – 28	2
28 – 29	3
29 – 30	2
30 – 31	0
31 – 32	1
32 – 33	2
33 – 33	2
33 – 34	0
34 – 35	0
35 – 36	1

Name text label

This column holds short proper-noun labels, almost certainly language names (top values like Basque, Ainu, Beothuk, Atakapa, and frequent words 'sign', 'language', 'arabic', 'mixtec' all point to a linguistic registry). Entries are overwhelmingly single tokens (one_word_rate 0.80, word_mean 1.26, len_mean 8.7) with a 46-character max for the longer parenthesised variants like '(northern)'/'(southern)'. Notably, 375 duplicates (10.5%) exist across 3,573 rows with 3,198 uniques — names like 'Abun', 'Andoke', 'Basque' each appear 3 times, suggesting the dataset repeats languages across some other dimension rather than being a clean key.

Treatment: Treat as a categorical language label; deduplicate or join on it rather than using as a primary key.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["Name"].stats

stat	value
n	3,573
nulls	0 (0.0%)
unique	3,198
len_min	2
len_max	46
len_mean	8.705
len_median	7
len_p95	19
word_mean	1.258
word_median	1
n_empty	0
n_duplicates	375
duplicate_rate	0.105
vocab_size	3,383
readability_flesch_mean	48.16
emoji_rate	0
url_rate	0
one_word_rate	0.8002
allcaps_rate	0
boilerplate_rate	0
alert: one_word	80.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 9.

Character-length distribution for Name.

Show data table

Character-length distribution for Name (mean: 8.705009795689897).
chars	count
2 – 3	112
3 – 4	361
4 – 5	528
5 – 6	581
6 – 8	502
8 – 9	305
9 – 10	198
10 – 11	152
11 – 12	84
12 – 13	88
13 – 14	131
14 – 15	89
15 – 16	76
16 – 17	77
17 – 18	66
18 – 20	46
20 – 21	34
21 – 22	31
22 – 23	21
23 – 24	19
24 – 25	21
25 – 26	8
26 – 27	9
27 – 28	9
28 – 30	5
30 – 31	2
31 – 32	3
32 – 33	4
33 – 34	2
34 – 35	3
35 – 36	2
36 – 37	1
37 – 38	1
38 – 39	0
39 – 40	0
40 – 42	0
42 – 43	0
43 – 44	0
44 – 45	0
45 – 46	2

Macroarea categorical feature

Macroarea is a categorical geographic grouping with 6 values covering the standard continental/linguistic macroareas (Eurasia, Africa, Papunesia, North America, South America, Australia). The distribution is fairly balanced — entropy ratio is 0.95 and the top value Eurasia accounts for only 24.8% of rows. The main concern is a 25.5% null rate, meaning a quarter of the 3573 rows lack any macroarea assignment.

Treatment: One-hot encode and decide whether to impute or add an explicit 'unknown' bucket for the 25.5% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["Macroarea"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	6
top_value	Eurasia
top_rate	0.2476
cardinality	6
entropy	2.459
entropy_ratio	0.9511
alert: null_rate	25.5% null

Fig 10.

Top values for Macroarea.

Show data table

Top values for Macroarea (6 unique shown, of 6 total).
value	count	share
Eurasia	659	18.4%
Africa	606	17.0%
Papunesia	560	15.7%
North America	396	11.1%
South America	258	7.2%
Australia	183	5.1%

Latitude numeric feature

Geographic latitude in decimal degrees, ranging from -55.0 to 71.25 with a median of 8.29 — consistent with global coverage skewed slightly toward the northern hemisphere. About 25.5% of rows are null, a notable gap for a positional field, and only 887 unique values across 3573 rows suggest coordinates are rounded or tied to a limited set of locations. Distribution is near-symmetric (skew 0.36, kurtosis -0.50) with just one outlier flagged.

Treatment: Pair with Longitude for geospatial features; impute or filter the 25.5% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["Latitude"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	887
min	-55
max	71.25
mean	11.88
median	8.292
std	22.72
q1	-5
q3	28
iqr	33
skew	0.3562
kurtosis	-0.5023
n_outliers	1
outlier_rate	0.0003757
zero_rate	0.002254
alert: null_rate	25.5% null

Fig 11.

Distribution of Latitude. Vertical dash marks the median.

Show data table

Histogram bins for Latitude (median: 8.291666666665).
bin	count
-55 – -51.84	2
-51.84 – -48.69	1
-48.69 – -45.53	1
-45.53 – -42.38	2
-42.38 – -39.22	3
-39.22 – -36.06	5
-36.06 – -32.91	9
-32.91 – -29.75	18
-29.75 – -26.59	24
-26.59 – -23.44	29
-23.44 – -20.28	47
-20.28 – -17.12	64
-17.12 – -13.97	85
-13.97 – -10.81	86
-10.81 – -7.656	129
-7.656 – -4.5	187
-4.5 – -1.344	190
-1.344 – 1.812	130
1.812 – 4.969	119
4.969 – 8.125	196
8.125 – 11.28	161
11.28 – 14.44	104
14.44 – 17.59	145
17.59 – 20.75	82
20.75 – 23.91	63
23.91 – 27.06	74
27.06 – 30.22	84
30.22 – 33.38	66
33.38 – 36.53	84
36.53 – 39.69	67
39.69 – 42.84	86
42.84 – 46	62
46 – 49.16	74
49.16 – 52.31	52
52.31 – 55.47	44
55.47 – 58.62	18
58.62 – 61.78	20
61.78 – 64.94	23
64.94 – 68.09	19
68.09 – 71.25	7

Longitude numeric feature

Geographic longitude in decimal degrees, spanning -178.17 to 179.17 with 1360 distinct values across 3573 rows. The distribution is roughly symmetric (skew -0.33) but flat (kurtosis -1.05) with an IQR of 166.75, consistent with truly global coverage rather than a regional sample. Notable concern: 25.5% of rows are null, which will silently drop a quarter of any geospatial join.

Treatment: Pair with Latitude and impute or filter the 25.5% nulls before any geospatial modelling.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["Longitude"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	1,360
min	-178.2
max	179.2
mean	35.17
median	34.79
std	89.35
q1	-45.75
q3	121
iqr	166.8
skew	-0.3259
kurtosis	-1.047
n_outliers	0
outlier_rate	0
zero_rate	0.001503
alert: null_rate	25.5% null

Fig 12.

Distribution of Longitude. Vertical dash marks the median.

Show data table

Histogram bins for Longitude (median: 34.79166666665).
bin	count
-178.2 – -169.2	13
-169.2 – -160.3	5
-160.3 – -151.4	7
-151.4 – -142.4	8
-142.4 – -133.5	4
-133.5 – -124.6	15
-124.6 – -115.6	96
-115.6 – -106.7	36
-106.7 – -97.77	57
-97.77 – -88.83	107
-88.83 – -79.9	33
-79.9 – -70.97	114
-70.97 – -62.03	95
-62.03 – -53.1	55
-53.1 – -44.17	22
-44.17 – -35.23	7
-35.23 – -26.3	0
-26.3 – -17.37	1
-17.37 – -8.433	42
-8.433 – 0.5	109
0.5 – 9.433	108
9.433 – 18.37	170
18.37 – 27.3	107
27.3 – 36.23	159
36.23 – 45.17	94
45.17 – 54.1	62
54.1 – 63.03	17
63.03 – 71.97	20
71.97 – 80.9	67
80.9 – 89.83	73
89.83 – 98.77	102
98.77 – 107.7	83
107.7 – 116.6	56
116.6 – 125.6	129
125.6 – 134.5	112
134.5 – 143.4	190
143.4 – 152.4	177
152.4 – 161.3	48
161.3 – 170.2	51
170.2 – 179.2	11

Glottocode text foreign_key

This is a Glottocode field — fixed 8-character language identifiers from the Glottolog catalogue (e.g. basq1248, swis1247), with every value being a single token. About 26% of rows are null and 2502 distinct codes appear across 3573 rows, with 143 duplicates (5.4%) where the same language repeats — basq1248 leads with 11 occurrences. Length is rigidly 8 for min, median, and max, consistent with a controlled vocabulary identifier rather than free text.

Treatment: Treat as a categorical key; left-join to Glottolog metadata and handle the 26% nulls explicitly.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["Glottocode"].stats

stat	value
n	3,573
nulls	928 (26.0%)
unique	2,502
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	143
duplicate_rate	0.05406
vocab_size	2,502
readability_flesch_mean	92.88
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: null_rate	26.0% null
alert: short_text	95th-percentile length under 20 chars

Fig 13.

Character-length distribution for Glottocode.

Show data table

Character-length distribution for Glottocode (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	2645
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

ISO639P3code text foreign_key

This column holds ISO 639-3 language codes: every non-null value is exactly 3 characters and one word (len_min=len_max=3, one_word_rate=1.0), with examples like 'eus', 'gsw', 'deu' matching the standard. Coverage is incomplete — 26.84% of rows are null — and 2442 unique codes appear across 3573 rows with a 6.58% duplicate rate, so most languages occur only once or twice (top value 'eus' at 12).

Treatment: Treat as a categorical key; left-join to an ISO 639-3 reference table and decide on an explicit bucket for the 26.84% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["ISO639P3code"].stats

stat	value
n	3,573
nulls	959 (26.8%)
unique	2,442
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	172
duplicate_rate	0.0658
vocab_size	2,442
readability_flesch_mean	119.5
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: null_rate	26.8% null
alert: short_text	95th-percentile length under 20 chars

Fig 14.

Character-length distribution for ISO639P3code.

Show data table

Character-length distribution for ISO639P3code (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	2614
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

Family categorical feature

Categorical label for the language family of each row, with 254 distinct families across 3573 records. Distribution is long-tailed but not dominated: top value 'Niger-Congo' covers only 12.2% and ties exactly with 'Austronesian' at 324 each, with entropy ratio 0.70 indicating spread across many small families. Notable concern: 25.5% of rows are null, and a literal 'other' bucket already accounts for 72 rows.

Treatment: Impute or flag the 25.5% missing, then group rare families before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["Family"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	254
top_value	Niger-Congo
top_rate	0.1217
cardinality	254
entropy	5.631
entropy_ratio	0.7049
alert: null_rate	25.5% null

Fig 15.

Top values for Family.

Show data table

Top values for Family (20 unique shown, of 254 total).
value	count	share
Niger-Congo	324	9.1%
Austronesian	324	9.1%
Indo-European	176	4.9%
Sino-Tibetan	146	4.1%
Afro-Asiatic	145	4.1%
Pama-Nyungan	121	3.4%
Trans-New Guinea	98	2.7%
other	72	2.0%
Altaic	65	1.8%
Oto-Manguean	56	1.6%
Austro-Asiatic	48	1.3%
Eastern Sudanic	47	1.3%
Uto-Aztecan	44	1.2%
Algic	31	0.9%
Mayan	30	0.8%
Arawakan	29	0.8%
Nakh-Daghestanian	28	0.8%
Mande	28	0.8%
Uralic	27	0.8%
Hokan	26	0.7%

Subfamily categorical feature

This column records the linguistic subfamily classification of each row, with 32 distinct values dominated by Benue-Congo (200 occurrences, 21.95% of non-nulls), Eastern Malayo-Polynesian (159), and Tibeto-Burman (139). The striking issue is the 74.5% null rate — only about a quarter of the 3573 rows carry a subfamily label — yet entropy ratio of 0.77 indicates the populated values are reasonably spread across the 32 categories rather than collapsing onto one.

Treatment: Treat missingness as its own category and group rare subfamilies before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["Subfamily"].stats

stat	value
n	3,573
nulls	2,662 (74.5%)
unique	32
top_value	Benue-Congo
top_rate	0.2195
cardinality	32
entropy	3.856
entropy_ratio	0.7712
alert: null_rate	74.5% null

Fig 16.

Top values for Subfamily.

Show data table

Top values for Subfamily (20 unique shown, of 32 total).
value	count	share
Benue-Congo	200	5.6%
Eastern Malayo-Polynesian	159	4.5%
Tibeto-Burman	139	3.9%
Chadic	47	1.3%
Mon-Khmer	38	1.1%
Adamawa-Ubangi	30	0.8%
Gur	27	0.8%
Daghestanian	25	0.7%
Cushitic	24	0.7%
Finno-Ugric	21	0.6%
Kwa	20	0.6%
North-Central Atlantic	20	0.6%
Nilotic	19	0.5%
Mixtecan	18	0.5%
Omotic	15	0.4%
Kainantu-Goroka	14	0.4%
Madang	13	0.4%
Awyu-Ok	10	0.3%
Surmic	10	0.3%
Je	9	0.3%

Genus categorical feature

This column holds linguistic genus labels (e.g., Oceanic, Bantu, Indic, Semitic, Germanic), a mid-level grouping in language classification. Cardinality is high at 625 distinct values across 3573 rows with entropy ratio 0.856, so the distribution is broad and flat — the top value 'Oceanic' covers only 5.6%. Note the 25.5% null rate, which is flagged and would meaningfully shrink any analysis that conditions on genus.

Treatment: Treat as a high-cardinality categorical: group rare genera into 'Other' or target-encode, and add an explicit missing indicator for the 25.5% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["Genus"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	625
top_value	Oceanic
top_rate	0.05597
cardinality	625
entropy	7.95
entropy_ratio	0.856
alert: null_rate	25.5% null

Fig 17.

Top values for Genus.

Show data table

Top values for Genus (20 unique shown, of 625 total).
value	count	share
Oceanic	149	4.2%
Bantu	141	3.9%
Indic	50	1.4%
Western Pama-Nyungan	49	1.4%
Semitic	43	1.2%
Turkic	41	1.1%
Sign Languages	40	1.1%
Bodic	40	1.1%
Germanic	39	1.1%
Northern Pama-Nyungan	33	0.9%
Creoles and Pidgins	32	0.9%
Mayan	30	0.8%
Algonquian	29	0.8%
Central Malayo-Polynesian	29	0.8%
Iranian	26	0.7%
Romance	24	0.7%
Biu-Mandara	24	0.7%
Southeastern Pama-Nyungan	23	0.6%
Dravidian	23	0.6%
Malayo-Sumbawan	22	0.6%

GenusIcon categorical identifier

GenusIcon holds 613 short hex-like codes (e.g. 'c688033') across 3573 rows, with 82.51% nulls and an entropy ratio of 0.9988 indicating values are nearly uniformly distributed among non-nulls. The top value appears only twice (top_rate 0.0032), so there is no dominant category — it behaves like a near-unique tag rather than a real categorical feature.

Treatment: Drop for modelling; near-unique with 82% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["GenusIcon"].stats

stat	value
n	3,573
nulls	2,948 (82.5%)
unique	613
top_value	c688033
top_rate	0.0032
cardinality	613
entropy	9.249
entropy_ratio	0.9989
alert: long_tail	601 singleton categories
alert: null_rate	82.5% null

Fig 18.

Top values for GenusIcon.

Show data table

Top values for GenusIcon (20 unique shown, of 613 total).
value	count	share
c688033	2	0.1%
c803E33	2	0.1%
c804733	2	0.1%
c807D33	2	0.1%
c806233	2	0.1%
c805033	2	0.1%
c7A8033	2	0.1%
c805933	2	0.1%
c807433	2	0.1%
c806B33	2	0.1%
c718033	2	0.1%
c803533	2	0.1%
cCC8C51	1	0.0%
cCC6851	1	0.0%
cCC7E51	1	0.0%
c8FCC51	1	0.0%
cCC8051	1	0.0%
c528033	1	0.0%
cCC9F51	1	0.0%
cCCB551	1	0.0%

ISO_codes text foreign_key

This column holds ISO language codes — almost all values are single tokens of length 3 (len_mean 3.04, one_word_rate 0.99), matching ISO 639-3 conventions (e.g. 'eus', 'gsw', 'deu'). 26.1% of rows are null and 172 duplicates exist, but with 2,468 unique codes across 3,573 rows the vocabulary is unusually wide, suggesting broad multilingual coverage rather than a few dominant languages. No top value exceeds 12 occurrences, so the distribution has an extremely long tail.

Treatment: Treat as a categorical code key; impute or filter the 26% nulls before joining to a language reference table.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["ISO_codes"].stats

stat	value
n	3,573
nulls	933 (26.1%)
unique	2,468
len_min	3
len_max	7
len_mean	3.039
len_median	3
len_p95	3
word_mean	1.01
word_median	1
n_empty	0
n_duplicates	172
duplicate_rate	0.06515
vocab_size	2,486
readability_flesch_mean	117.4
emoji_rate	0
url_rate	0
one_word_rate	0.9902
allcaps_rate	0
boilerplate_rate	0
alert: one_word	99.0% rows are a single word
alert: null_rate	26.1% null
alert: short_text	95th-percentile length under 20 chars

Fig 19.

Character-length distribution for ISO_codes.

Show data table

Character-length distribution for ISO_codes (mean: 3.0393939393939395).
chars	count
3 – 3	2614
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	26

Samples_100 categorical feature

Boolean flag with only two values where 'False' dominates at 96.2% (2562 of 2662 non-null rows) and 'True' appears exactly 100 times. The null rate of 25.5% is high, suggesting the flag is only populated for a subset of records. Entropy ratio of 0.23 confirms severe imbalance.

Treatment: Treat as a rare-event boolean indicator; impute or encode nulls explicitly and avoid as a stratification key given the imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["Samples_100"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	2
top_value	False
top_rate	0.9624
cardinality	2
entropy	0.231
entropy_ratio	0.231
alert: null_rate	25.5% null
alert: imbalance	top value is 96.2% of rows

Fig 20.

Top values for Samples_100.

Show data table

Top values for Samples_100 (2 unique shown, of 2 total).
value	count	share
False	2562	71.7%
True	100	2.8%

Samples_200 categorical label

Binary flag column with only two values (False/True) and heavy class imbalance — False accounts for 92.5% of non-null rows versus 200 True observations, hinting the name 'Samples_200' refers to a tagged 200-row subset. Roughly a quarter of rows (25.5%) are null, which is the main surprise and the reason for the null_rate alert. Entropy ratio of 0.385 confirms the distribution is far from balanced.

Treatment: Impute or explicitly encode nulls as a third category before using as a binary indicator.

anthropic:claude-opus-4-7 · confidence high

Out[52]:

saturn.columns["Samples_200"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	2
top_value	False
top_rate	0.9249
cardinality	2
entropy	0.3848
entropy_ratio	0.3848
alert: null_rate	25.5% null

Fig 21.

Top values for Samples_200.

Show data table

Top values for Samples_200 (2 unique shown, of 2 total).
value	count	share
False	2462	68.9%
True	200	5.6%

Country_ID categorical foreign_key

Country_ID looks like an ISO-style two-letter country code, with 337 distinct values across 3573 rows and a fairly even spread (entropy ratio 0.752). The top country PG accounts for only 8.06% of rows, followed by AU, US, and ID. Notably, 25.69% of values are null, and the cardinality of 337 exceeds the ~250 ISO 3166-1 codes, suggesting non-standard or sub-region codes are mixed in.

Treatment: Impute or flag the 25.69% nulls and reconcile non-standard codes against an ISO 3166-1 reference before joining.

anthropic:claude-opus-4-7 · confidence high

Out[55]:

saturn.columns["Country_ID"].stats

stat	value
n	3,573
nulls	918 (25.7%)
unique	337
top_value	PG
top_rate	0.0806
cardinality	337
entropy	6.314
entropy_ratio	0.752
alert: null_rate	25.7% null

Fig 22.

Top values for Country_ID.

Show data table

Top values for Country_ID (20 unique shown, of 337 total).
value	count	share
PG	214	6.0%
AU	185	5.2%
US	177	5.0%
ID	177	5.0%
IN	120	3.4%
MX	120	3.4%
RU	89	2.5%
NG	66	1.8%
BR	66	1.8%
CN	54	1.5%
CD	49	1.4%
CM	46	1.3%
CA	45	1.3%
CO	39	1.1%
ET	36	1.0%
PH	36	1.0%
PE	35	1.0%
NP	32	0.9%
TZ	28	0.8%
VU	28	0.8%

Source text metadata

This column holds bibliographic citation tags (e.g. 'Huber-and-Reed-1992', 'Boelaars-1950'), evidently the source reference for each row in what looks like a linguistic typology dataset. Values are short (median 25 chars, 2 words) and 45.5% are single tokens, consistent with author-year keys rather than prose. Cardinality is high (2373 unique of 3573) with 5% duplicates and a 30% null rate, so coverage is uneven and no single source dominates (top value appears only 14 times).

Treatment: Treat as a citation key: keep as categorical provenance metadata, optionally normalize casing and join to a bibliography table; do not use as a model feature.

anthropic:claude-opus-4-7 · confidence high

Out[58]:

saturn.columns["Source"].stats

stat	value
n	3,573
nulls	1,074 (30.1%)
unique	2,373
len_min	7
len_max	452
len_mean	42.07
len_median	25
len_p95	135
word_mean	2.854
word_median	2
n_empty	0
n_duplicates	126
duplicate_rate	0.05042
vocab_size	5,899
readability_flesch_mean	21.33
emoji_rate	0
url_rate	0
one_word_rate	0.4546
allcaps_rate	0
boilerplate_rate	0
alert: one_word	45.5% rows are a single word
alert: null_rate	30.1% null

Fig 23.

Character-length distribution for Source.

Show data table

Character-length distribution for Source (mean: 42.07122849139656).
chars	count
7 – 18	937
18 – 29	518
29 – 40	280
40 – 52	203
52 – 63	136
63 – 74	90
74 – 85	61
85 – 96	42
96 – 107	35
107 – 118	36
118 – 129	27
129 – 140	21
140 – 152	12
152 – 163	13
163 – 174	6
174 – 185	14
185 – 196	6
196 – 207	5
207 – 218	3
218 – 230	4
230 – 241	4
241 – 252	5
252 – 263	3
263 – 274	6
274 – 285	7
285 – 296	3
296 – 307	4
307 – 318	1
318 – 330	1
330 – 341	3
341 – 352	1
352 – 363	1
363 – 374	0
374 – 385	2
385 – 396	1
396 – 408	1
408 – 419	4
419 – 430	1
430 – 441	0
441 – 452	2

Parent_ID categorical foreign_key

Parent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but flat — the top value covers only 4.5% of rows and entropy ratio is 0.87 — and 7.1% of values are null. Oceanic and Bantu dominate the head, with Indic, Western Pama-Nyungan and Semitic trailing far behind.

Treatment: Left-join on this id to a genus lookup; treat the 7.1% nulls explicitly rather than one-hot encoding the 911 levels.

anthropic:claude-opus-4-7 · confidence high

Out[61]:

saturn.columns["Parent_ID"].stats

stat	value
n	3,573
nulls	254 (7.1%)
unique	911
top_value	genus-oceanic
top_rate	0.04489
cardinality	911
entropy	8.554
entropy_ratio	0.87
alert: long_tail	501 singleton categories

Fig 24.

Top values for Parent_ID.

Show data table

Top values for Parent_ID (20 unique shown, of 911 total).
value	count	share
genus-oceanic	149	4.2%
genus-bantu	141	3.9%
genus-indic	50	1.4%
genus-westernpamanyungan	49	1.4%
genus-semitic	43	1.2%
genus-turkic	41	1.1%
genus-signlanguages	40	1.1%
genus-bodic	40	1.1%
genus-germanic	39	1.1%
genus-northernpamanyungan	33	0.9%
genus-creolesandpidgins	32	0.9%
genus-mayan	30	0.8%
family-austronesian	30	0.8%
genus-algonquian	29	0.8%
genus-centralmalayopolynesian	29	0.8%
genus-iranian	26	0.7%
family-transnewguinea	25	0.7%
genus-romance	24	0.7%
genus-biumandara	24	0.7%
genus-southeasternpamanyungan	23	0.6%

data raw wals language

Overview

Summary confidence: high

ID text identifier

Name text label

Macroarea categorical feature

Latitude numeric feature

Longitude numeric feature

Glottocode text foreign_key

ISO639P3code text foreign_key

Family categorical feature

Subfamily categorical feature

Genus categorical feature

GenusIcon categorical identifier

ISO_codes text foreign_key

Samples_100 categorical feature

Samples_200 categorical label

Country_ID categorical foreign_key

Source text metadata

Parent_ID categorical foreign_key

How to cite