language-data-wals_languages · saturn notebook

Overview

Source: /home/coolhand/datasets/language-data/wals_languages.csv

Saturn profiled 3,573 rows across 17 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/language-data/wals_languages.csv",
    "--findings", "language-data-wals_languages.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 3,573 world languages (WALS) across 17 columns combining identifiers (ISO codes, Glottocode), classifications (Family, Genus, Subfamily), geography (Latitude, Longitude, Macroarea, Country_ID), and sampling flags. The Family and Macroarea distributions are the most informative starting point: Niger-Congo and Austronesian dominate at 324 languages each, and Eurasia (659) and Africa (606) lead the macroareas out of just six categories. Note that roughly a quarter of rows (null_rate ~0.255) are missing geographic and family fields in lockstep, suggesting a shared set of unclassified entries worth investigating. The Samples_100 and Samples_200 flags are highly imbalanced (only 100 and 200 'True' values respectively), reflecting curated WALS sub-samples. Subfamily is sparsely populated (74.5% null) so treat it as supplementary rather than primary.

citing: Family.top_values · Macroarea.top_values · Genus.top_values · Country_ID.top_values · Samples_100.stats · Samples_200.stats · Latitude.stats · Longitude.stats · Subfamily.null_rate

Out[4]:

saturn.schema() · 17 columns

column	kind	n	null%	unique	alerts
ID	text	3,573	0.0%	3,573	near_unique one_word short_text
Name	text	3,573	0.0%	3,198	one_word short_text
Macroarea	categorical	3,573	25.5%	6	null_rate
Latitude	numeric	3,573	25.5%	887	null_rate
Longitude	numeric	3,573	25.5%	1,360	null_rate
Glottocode	text	3,573	26.0%	2,502	one_word null_rate short_text
ISO639P3code	text	3,573	26.8%	2,442	one_word null_rate short_text
Family	categorical	3,573	25.5%	254	null_rate
Subfamily	categorical	3,573	74.5%	32	null_rate
Genus	categorical	3,573	25.5%	625	null_rate
GenusIcon	categorical	3,573	82.5%	613	long_tail null_rate
ISO_codes	text	3,573	26.1%	2,468	one_word null_rate short_text
Samples_100	categorical	3,573	25.5%	2	null_rate imbalance
Samples_200	categorical	3,573	25.5%	2	null_rate
Country_ID	categorical	3,573	25.7%	337	null_rate
Source	text	3,573	30.1%	2,373	one_word null_rate
Parent_ID	categorical	3,573	7.1%	911	long_tail

Fig 1.

Family · Top language families — Niger-Congo and Austronesian tie at 324 each, with a long tail across 254 families.

Show data table

Top values for Family (20 unique shown, of 254 total).
value	count	share
Niger-Congo	324	9.1%
Austronesian	324	9.1%
Indo-European	176	4.9%
Sino-Tibetan	146	4.1%
Afro-Asiatic	145	4.1%
Pama-Nyungan	121	3.4%
Trans-New Guinea	98	2.7%
other	72	2.0%
Altaic	65	1.8%
Oto-Manguean	56	1.6%
Austro-Asiatic	48	1.3%
Eastern Sudanic	47	1.3%
Uto-Aztecan	44	1.2%
Algic	31	0.9%
Mayan	30	0.8%
Arawakan	29	0.8%
Nakh-Daghestanian	28	0.8%
Mande	28	0.8%
Uralic	27	0.8%
Hokan	26	0.7%

Fig 2.

Macroarea · Six macroareas, led by Eurasia (659) and Africa (606); shows global coverage balance.

Show data table

Top values for Macroarea (6 unique shown, of 6 total).
value	count	share
Eurasia	659	18.4%
Africa	606	17.0%
Papunesia	560	15.7%
North America	396	11.1%
South America	258	7.2%
Australia	183	5.1%

Fig 3.

Country_ID · Country distribution — Papua New Guinea, Australia, US, and Indonesia top the list, signaling linguistic hotspots.

Show data table

Top values for Country_ID (20 unique shown, of 337 total).
value	count	share
PG	214	6.0%
AU	185	5.2%
US	177	5.0%
ID	177	5.0%
IN	120	3.4%
MX	120	3.4%
RU	89	2.5%
NG	66	1.8%
BR	66	1.8%
CN	54	1.5%
CD	49	1.4%
CM	46	1.3%
CA	45	1.3%
CO	39	1.1%
ET	36	1.0%
PH	36	1.0%
PE	35	1.0%
NP	32	0.9%
TZ	28	0.8%
VU	28	0.8%

Fig 4.

Latitude · Latitude spread from -55 to 71 with median ~8°, showing a tropical/northern-hemisphere skew.

Show data table

Histogram bins for Latitude (median: 8.291666666665).
bin	count
-55 – -51.84	2
-51.84 – -48.69	1
-48.69 – -45.53	1
-45.53 – -42.38	2
-42.38 – -39.22	3
-39.22 – -36.06	5
-36.06 – -32.91	9
-32.91 – -29.75	18
-29.75 – -26.59	24
-26.59 – -23.44	29
-23.44 – -20.28	47
-20.28 – -17.12	64
-17.12 – -13.97	85
-13.97 – -10.81	86
-10.81 – -7.656	129
-7.656 – -4.5	187
-4.5 – -1.344	190
-1.344 – 1.812	130
1.812 – 4.969	119
4.969 – 8.125	196
8.125 – 11.28	161
11.28 – 14.44	104
14.44 – 17.59	145
17.59 – 20.75	82
20.75 – 23.91	63
23.91 – 27.06	74
27.06 – 30.22	84
30.22 – 33.38	66
33.38 – 36.53	84
36.53 – 39.69	67
39.69 – 42.84	86
42.84 – 46	62
46 – 49.16	74
49.16 – 52.31	52
52.31 – 55.47	44
55.47 – 58.62	18
58.62 – 61.78	20
61.78 – 64.94	23
64.94 – 68.09	19
68.09 – 71.25	7

Fig 5.

Samples_200 · Only 200 of 2,662 non-null rows are flagged True — useful for filtering to the curated WALS 200-sample set.

Show data table

Top values for Samples_200 (2 unique shown, of 2 total).
value	count	share
False	2462	68.9%
True	200	5.6%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
ID	text	0.0%
Name	text	0.0%
Macroarea	categorical	25.5%
Latitude	numeric	25.5%
Longitude	numeric	25.5%
Glottocode	text	26.0%
ISO639P3code	text	26.8%
Family	categorical	25.5%
Subfamily	categorical	74.5%
Genus	categorical	25.5%
GenusIcon	categorical	82.5%
ISO_codes	text	26.1%
Samples_100	categorical	25.5%
Samples_200	categorical	25.5%
Country_ID	categorical	25.7%
Source	text	30.1%
Parent_ID	categorical	7.1%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	Latitude	Longitude
Latitude	+1.00	-0.38
Longitude	-0.38	+1.00

ID text identifier

This is an identifier column: every one of the 3573 rows holds a unique single-token string with no nulls or duplicates. Values are short (median length 3, max 36) and the vocabulary equals the row count (3573), confirming one-to-one uniqueness. Top tokens like 'aab', 'aar', 'aba' suggest short alphabetic codes rather than numeric keys.

Treatment: drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["ID"].stats

stat	value
n	3,573
nulls	0 (0.0%)
unique	3,573
len_min	2
len_max	36
len_mean	5.982
len_median	3
len_p95	17
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	3,573
readability_flesch_mean	61.58
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for ID.

Show data table

Character-length distribution for ID (mean: 5.9818080044780295).
chars	count
2 – 3	11
3 – 4	2651
4 – 5	0
5 – 5	0
5 – 6	0
6 – 7	0
7 – 8	0
8 – 9	2
9 – 10	17
10 – 10	53
10 – 11	89
11 – 12	137
12 – 13	122
13 – 14	0
14 – 15	110
15 – 16	90
16 – 16	67
16 – 17	52
17 – 18	48
18 – 19	0
19 – 20	21
20 – 21	22
21 – 22	20
22 – 22	17
22 – 23	10
23 – 24	11
24 – 25	0
25 – 26	7
26 – 27	3
27 – 28	0
28 – 28	2
28 – 29	3
29 – 30	2
30 – 31	0
31 – 32	1
32 – 33	2
33 – 33	2
33 – 34	0
34 – 35	0
35 – 36	1

Name text label

This column holds short proper-noun labels — almost certainly language or ethnonym names, given top values like 'Basque', 'Ainu', 'Beothuk' and frequent words 'sign', 'language', 'arabic', 'german'. Entries are terse (mean 8.7 chars, 80% one-word) but not unique: 375 duplicates (10.5%) and only 3,198 distinct names across 3,573 rows, with several names appearing exactly 3 times — suggesting the dataset repeats each language across multiple records or variants (e.g. '(northern)', '(southern)'). No nulls, no URLs, no emoji.

Treatment: Treat as a categorical key; deduplicate or join on a normalized form before aggregating.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["Name"].stats

stat	value
n	3,573
nulls	0 (0.0%)
unique	3,198
len_min	2
len_max	46
len_mean	8.705
len_median	7
len_p95	19
word_mean	1.258
word_median	1
n_empty	0
n_duplicates	375
duplicate_rate	0.105
vocab_size	3,383
readability_flesch_mean	48.16
emoji_rate	0
url_rate	0
one_word_rate	0.8002
allcaps_rate	0
boilerplate_rate	0
alert: one_word	80.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 9.

Character-length distribution for Name.

Show data table

Character-length distribution for Name (mean: 8.705009795689897).
chars	count
2 – 3	112
3 – 4	361
4 – 5	528
5 – 6	581
6 – 8	502
8 – 9	305
9 – 10	198
10 – 11	152
11 – 12	84
12 – 13	88
13 – 14	131
14 – 15	89
15 – 16	76
16 – 17	77
17 – 18	66
18 – 20	46
20 – 21	34
21 – 22	31
22 – 23	21
23 – 24	19
24 – 25	21
25 – 26	8
26 – 27	9
27 – 28	9
28 – 30	5
30 – 31	2
31 – 32	3
32 – 33	4
33 – 34	2
34 – 35	3
35 – 36	2
36 – 37	1
37 – 38	1
38 – 39	0
39 – 40	0
40 – 42	0
42 – 43	0
43 – 44	0
44 – 45	0
45 – 46	2

Macroarea categorical feature

Macroarea is a coarse geographic grouping with 6 categories spanning Eurasia, Africa, Papunesia, North America, South America, and Australia — consistent with WALS/Glottolog-style language area labels. Distribution is relatively even (entropy ratio 0.95, top value Eurasia at 24.8%), so no single region dominates. Note the 25.5% null rate, which is substantial and flagged.

Treatment: One-hot encode and add an explicit 'missing' category to preserve the 25.5% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["Macroarea"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	6
top_value	Eurasia
top_rate	0.2476
cardinality	6
entropy	2.459
entropy_ratio	0.9511
alert: null_rate	25.5% null

Fig 10.

Top values for Macroarea.

Show data table

Top values for Macroarea (6 unique shown, of 6 total).
value	count	share
Eurasia	659	18.4%
Africa	606	17.0%
Papunesia	560	15.7%
North America	396	11.1%
South America	258	7.2%
Australia	183	5.1%

Latitude numeric feature

Geographic latitude in degrees, ranging from -55.0 to 71.25 with a median of 8.29 and IQR of 33.0, consistent with a worldwide point distribution. The 25.5% null rate is notable and flagged, while skew (0.36) and kurtosis (-0.50) indicate a fairly symmetric, slightly flat spread with only one outlier.

Treatment: Impute or filter the 25.5% missing values, and pair with longitude for any geospatial modelling.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["Latitude"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	887
min	-55
max	71.25
mean	11.88
median	8.292
std	22.72
q1	-5
q3	28
iqr	33
skew	0.3562
kurtosis	-0.5023
n_outliers	1
outlier_rate	0.0003757
zero_rate	0.002254
alert: null_rate	25.5% null

Fig 11.

Distribution of Latitude. Vertical dash marks the median.

Show data table

Histogram bins for Latitude (median: 8.291666666665).
bin	count
-55 – -51.84	2
-51.84 – -48.69	1
-48.69 – -45.53	1
-45.53 – -42.38	2
-42.38 – -39.22	3
-39.22 – -36.06	5
-36.06 – -32.91	9
-32.91 – -29.75	18
-29.75 – -26.59	24
-26.59 – -23.44	29
-23.44 – -20.28	47
-20.28 – -17.12	64
-17.12 – -13.97	85
-13.97 – -10.81	86
-10.81 – -7.656	129
-7.656 – -4.5	187
-4.5 – -1.344	190
-1.344 – 1.812	130
1.812 – 4.969	119
4.969 – 8.125	196
8.125 – 11.28	161
11.28 – 14.44	104
14.44 – 17.59	145
17.59 – 20.75	82
20.75 – 23.91	63
23.91 – 27.06	74
27.06 – 30.22	84
30.22 – 33.38	66
33.38 – 36.53	84
36.53 – 39.69	67
39.69 – 42.84	86
42.84 – 46	62
46 – 49.16	74
49.16 – 52.31	52
52.31 – 55.47	44
55.47 – 58.62	18
58.62 – 61.78	20
61.78 – 64.94	23
64.94 – 68.09	19
68.09 – 71.25	7

Longitude numeric feature

Geographic longitude in degrees, spanning the full globe from -178.17 to 179.17 with a near-zero skew (-0.33) and flat kurtosis (-1.05), consistent with a worldwide point distribution. The 25.5% null rate is the main concern, and despite 3573 rows only 1360 unique values appear, suggesting repeated locations or rounded coordinates. No outliers flagged, as expected for a bounded angular measure.

Treatment: Pair with Latitude for geospatial features; impute or drop the 25.5% missing before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["Longitude"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	1,360
min	-178.2
max	179.2
mean	35.17
median	34.79
std	89.35
q1	-45.75
q3	121
iqr	166.8
skew	-0.3259
kurtosis	-1.047
n_outliers	0
outlier_rate	0
zero_rate	0.001503
alert: null_rate	25.5% null

Fig 12.

Distribution of Longitude. Vertical dash marks the median.

Show data table

Histogram bins for Longitude (median: 34.79166666665).
bin	count
-178.2 – -169.2	13
-169.2 – -160.3	5
-160.3 – -151.4	7
-151.4 – -142.4	8
-142.4 – -133.5	4
-133.5 – -124.6	15
-124.6 – -115.6	96
-115.6 – -106.7	36
-106.7 – -97.77	57
-97.77 – -88.83	107
-88.83 – -79.9	33
-79.9 – -70.97	114
-70.97 – -62.03	95
-62.03 – -53.1	55
-53.1 – -44.17	22
-44.17 – -35.23	7
-35.23 – -26.3	0
-26.3 – -17.37	1
-17.37 – -8.433	42
-8.433 – 0.5	109
0.5 – 9.433	108
9.433 – 18.37	170
18.37 – 27.3	107
27.3 – 36.23	159
36.23 – 45.17	94
45.17 – 54.1	62
54.1 – 63.03	17
63.03 – 71.97	20
71.97 – 80.9	67
80.9 – 89.83	73
89.83 – 98.77	102
98.77 – 107.7	83
107.7 – 116.6	56
116.6 – 125.6	129
125.6 – 134.5	112
134.5 – 143.4	190
143.4 – 152.4	177
152.4 – 161.3	48
161.3 – 170.2	51
170.2 – 179.2	11

Glottocode text foreign_key

This column holds Glottocodes — fixed 8-character language identifiers from the Glottolog catalogue (e.g. 'basq1248', 'stan1295'), with every value a single token of length exactly 8. About 26% of rows are null and 2502 distinct codes cover 3573 records, with a 5.4% duplicate rate; the most repeated code 'basq1248' appears 11 times, suggesting multiple records can share a language.

Treatment: Left-join on this code to a Glottolog reference table; impute or flag the 26% nulls separately.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["Glottocode"].stats

stat	value
n	3,573
nulls	928 (26.0%)
unique	2,502
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	143
duplicate_rate	0.05406
vocab_size	2,502
readability_flesch_mean	92.88
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: null_rate	26.0% null
alert: short_text	95th-percentile length under 20 chars

Fig 13.

Character-length distribution for Glottocode.

Show data table

Character-length distribution for Glottocode (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	2645
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

ISO639P3code text foreign_key

This column holds ISO 639-3 language codes — every non-null value is exactly 3 characters and a single word (len_mean 3.0, one_word_rate 1.0), with familiar codes like 'eus' (Basque), 'deu' (German), and 'gsw' (Swiss German) at the top. Coverage is incomplete: 26.84% of rows are null, and across 3573 rows there are 2442 unique codes with a 6.58% duplicate rate. Nothing in the evidence indicates which entity each code is tagging.

Treatment: Treat as a categorical join key to an ISO 639-3 reference table; impute or filter the 26.84% nulls before use.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["ISO639P3code"].stats

stat	value
n	3,573
nulls	959 (26.8%)
unique	2,442
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	172
duplicate_rate	0.0658
vocab_size	2,442
readability_flesch_mean	119.5
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: null_rate	26.8% null
alert: short_text	95th-percentile length under 20 chars

Fig 14.

Character-length distribution for ISO639P3code.

Show data table

Character-length distribution for ISO639P3code (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	2614
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

Family categorical feature

Categorical label assigning each of 3,573 rows to one of 254 language families, headed by Niger-Congo and Austronesian (tied at 324 rows, 12.2% each). The long tail is heavy — entropy ratio 0.705 indicates the distribution is fairly spread across families rather than dominated by a few — and 25.5% of rows are null, which is a substantial gap for what looks like a taxonomic feature.

Treatment: Impute or add an explicit 'unknown' category for the 25.5% nulls, then group rare families before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["Family"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	254
top_value	Niger-Congo
top_rate	0.1217
cardinality	254
entropy	5.631
entropy_ratio	0.7049
alert: null_rate	25.5% null

Fig 15.

Top values for Family.

Show data table

Top values for Family (20 unique shown, of 254 total).
value	count	share
Niger-Congo	324	9.1%
Austronesian	324	9.1%
Indo-European	176	4.9%
Sino-Tibetan	146	4.1%
Afro-Asiatic	145	4.1%
Pama-Nyungan	121	3.4%
Trans-New Guinea	98	2.7%
other	72	2.0%
Altaic	65	1.8%
Oto-Manguean	56	1.6%
Austro-Asiatic	48	1.3%
Eastern Sudanic	47	1.3%
Uto-Aztecan	44	1.2%
Algic	31	0.9%
Mayan	30	0.8%
Arawakan	29	0.8%
Nakh-Daghestanian	28	0.8%
Mande	28	0.8%
Uralic	27	0.8%
Hokan	26	0.7%

Subfamily categorical feature

This column records the linguistic subfamily classification of entries, drawn from a controlled vocabulary of 32 values such as Benue-Congo, Eastern Malayo-Polynesian, and Tibeto-Burman. Coverage is the main concern: 74.5% of rows are null, leaving only ~910 labelled records, with Benue-Congo accounting for 21.95% of those. Among populated rows the distribution is reasonably diverse (entropy ratio 0.77), so the signal is informative where present but sparse overall.

Treatment: Treat as a sparse categorical: impute an explicit 'unknown' level before encoding, since 74.5% are null.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["Subfamily"].stats

stat	value
n	3,573
nulls	2,662 (74.5%)
unique	32
top_value	Benue-Congo
top_rate	0.2195
cardinality	32
entropy	3.856
entropy_ratio	0.7712
alert: null_rate	74.5% null

Fig 16.

Top values for Subfamily.

Show data table

Top values for Subfamily (20 unique shown, of 32 total).
value	count	share
Benue-Congo	200	5.6%
Eastern Malayo-Polynesian	159	4.5%
Tibeto-Burman	139	3.9%
Chadic	47	1.3%
Mon-Khmer	38	1.1%
Adamawa-Ubangi	30	0.8%
Gur	27	0.8%
Daghestanian	25	0.7%
Cushitic	24	0.7%
Finno-Ugric	21	0.6%
Kwa	20	0.6%
North-Central Atlantic	20	0.6%
Nilotic	19	0.5%
Mixtecan	18	0.5%
Omotic	15	0.4%
Kainantu-Goroka	14	0.4%
Madang	13	0.4%
Awyu-Ok	10	0.3%
Surmic	10	0.3%
Je	9	0.3%

Genus categorical feature

Genus is a linguistic genus label (subfamily-level grouping of languages), with values like Oceanic, Bantu, Indic, and Semitic. It is highly diverse — 625 distinct genera across 3573 rows with entropy ratio 0.86 and the top value Oceanic covering only 5.6% — and 25.5% of rows are null, which is the flagged concern.

Treatment: Treat as a high-cardinality categorical: target- or frequency-encode and explicitly model the 25.5% missing as its own category.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["Genus"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	625
top_value	Oceanic
top_rate	0.05597
cardinality	625
entropy	7.95
entropy_ratio	0.856
alert: null_rate	25.5% null

Fig 17.

Top values for Genus.

Show data table

Top values for Genus (20 unique shown, of 625 total).
value	count	share
Oceanic	149	4.2%
Bantu	141	3.9%
Indic	50	1.4%
Western Pama-Nyungan	49	1.4%
Semitic	43	1.2%
Turkic	41	1.1%
Sign Languages	40	1.1%
Bodic	40	1.1%
Germanic	39	1.1%
Northern Pama-Nyungan	33	0.9%
Creoles and Pidgins	32	0.9%
Mayan	30	0.8%
Algonquian	29	0.8%
Central Malayo-Polynesian	29	0.8%
Iranian	26	0.7%
Romance	24	0.7%
Biu-Mandara	24	0.7%
Southeastern Pama-Nyungan	23	0.6%
Dravidian	23	0.6%
Malayo-Sumbawan	22	0.6%

GenusIcon categorical metadata

GenusIcon is a high-cardinality categorical with 613 unique values across only 3573 rows, and 82.51% of those rows are null. Entropy ratio of 0.9988 and a top_rate of just 0.0032 mean the non-null values are nearly uniformly distributed, with the most frequent code 'c688033' appearing only twice. The hex-like tokens (e.g. 'c807D33') suggest icon identifiers or color/asset codes rather than a meaningful category.

Treatment: Drop or retain as a sparse asset reference; not useful as a modelling feature given near-unique values and 82.51% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["GenusIcon"].stats

stat	value
n	3,573
nulls	2,948 (82.5%)
unique	613
top_value	c688033
top_rate	0.0032
cardinality	613
entropy	9.249
entropy_ratio	0.9989
alert: long_tail	601 singleton categories
alert: null_rate	82.5% null

Fig 18.

Top values for GenusIcon.

Show data table

Top values for GenusIcon (20 unique shown, of 613 total).
value	count	share
c688033	2	0.1%
c803E33	2	0.1%
c804733	2	0.1%
c807D33	2	0.1%
c806233	2	0.1%
c805033	2	0.1%
c7A8033	2	0.1%
c805933	2	0.1%
c807433	2	0.1%
c806B33	2	0.1%
c718033	2	0.1%
c803533	2	0.1%
cCC8C51	1	0.0%
cCC6851	1	0.0%
cCC7E51	1	0.0%
c8FCC51	1	0.0%
cCC8051	1	0.0%
c528033	1	0.0%
cCC9F51	1	0.0%
cCCB551	1	0.0%

ISO_codes text feature

Almost certainly ISO 639-3 language codes: 99% are single tokens, length is tightly clustered at 3 characters (min 3, max 7, p95 3), and top values like 'eus', 'deu', 'gsw', 'bod', 'roh' are recognisable three-letter language identifiers. Cardinality is high (2468 unique out of 3573) with a 26.1% null rate and 172 duplicates, so coverage is partial and no single code dominates (top value 'eus' appears just 12 times). The handful of length-7 entries is anomalous for a strict ISO 639-3 field and worth inspecting.

Treatment: Treat as a categorical code; validate against the ISO 639-3 list and investigate entries longer than 3 characters.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["ISO_codes"].stats

stat	value
n	3,573
nulls	933 (26.1%)
unique	2,468
len_min	3
len_max	7
len_mean	3.039
len_median	3
len_p95	3
word_mean	1.01
word_median	1
n_empty	0
n_duplicates	172
duplicate_rate	0.06515
vocab_size	2,486
readability_flesch_mean	117.4
emoji_rate	0
url_rate	0
one_word_rate	0.9902
allcaps_rate	0
boilerplate_rate	0
alert: one_word	99.0% rows are a single word
alert: null_rate	26.1% null
alert: short_text	95th-percentile length under 20 chars

Fig 19.

Character-length distribution for ISO_codes.

Show data table

Character-length distribution for ISO_codes (mean: 3.0393939393939395).
chars	count
3 – 3	2614
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	26

Samples_100 categorical feature

Boolean flag with only two values (False/True) where False dominates at 96.2% of non-null rows (2562 vs 100). The name 'Samples_100' plus the exact count of 100 True values suggests this marks a curated subset of 100 sampled records. A 25.5% null rate is notable and should be reconciled before use.

Treatment: Treat as a boolean subset indicator; impute or exclude nulls and avoid using as a model feature given severe imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["Samples_100"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	2
top_value	False
top_rate	0.9624
cardinality	2
entropy	0.231
entropy_ratio	0.231
alert: null_rate	25.5% null
alert: imbalance	top value is 96.2% of rows

Fig 20.

Top values for Samples_100.

Show data table

Top values for Samples_100 (2 unique shown, of 2 total).
value	count	share
False	2562	71.7%
True	100	2.8%

Samples_200 categorical metadata

Binary True/False flag, almost certainly indicating membership in a 200-row sample (the name 'Samples_200' and the exact count of 200 'True' values support this). The column is heavily imbalanced — 'False' covers 92.5% of non-null rows — and 25.5% of values are null, which is unusual for a sampling indicator and worth investigating.

Treatment: Use as a boolean filter/split flag; reconcile the 25.5% nulls (treat as False or exclude) before relying on it.

anthropic:claude-opus-4-7 · confidence high

Out[52]:

saturn.columns["Samples_200"].stats

stat	value
n	3,573
nulls	911 (25.5%)
unique	2
top_value	False
top_rate	0.9249
cardinality	2
entropy	0.3848
entropy_ratio	0.3848
alert: null_rate	25.5% null

Fig 21.

Top values for Samples_200.

Show data table

Top values for Samples_200 (2 unique shown, of 2 total).
value	count	share
False	2462	68.9%
True	200	5.6%

Country_ID categorical foreign_key

Two-letter country codes (PG, AU, US, ID, IN...) identifying the country associated with each record, with 337 distinct values across 3573 rows. The cardinality is suspiciously high since ISO 3166-1 alpha-2 only defines ~250 codes, hinting at non-standard or sub-region codes mixed in. Distribution is fairly flat (entropy ratio 0.752, top value PG only 8.06%) and 25.69% of rows are null.

Treatment: Validate codes against ISO 3166, impute or flag the 25.69% nulls, then left-join on this id.

anthropic:claude-opus-4-7 · confidence high

Out[55]:

saturn.columns["Country_ID"].stats

stat	value
n	3,573
nulls	918 (25.7%)
unique	337
top_value	PG
top_rate	0.0806
cardinality	337
entropy	6.314
entropy_ratio	0.752
alert: null_rate	25.7% null

Fig 22.

Top values for Country_ID.

Show data table

Top values for Country_ID (20 unique shown, of 337 total).
value	count	share
PG	214	6.0%
AU	185	5.2%
US	177	5.0%
ID	177	5.0%
IN	120	3.4%
MX	120	3.4%
RU	89	2.5%
NG	66	1.8%
BR	66	1.8%
CN	54	1.5%
CD	49	1.4%
CM	46	1.3%
CA	45	1.3%
CO	39	1.1%
ET	36	1.0%
PH	36	1.0%
PE	35	1.0%
NP	32	0.9%
TZ	28	0.8%
VU	28	0.8%

Source text metadata

This column holds bibliographic citation tags (e.g., 'Huber-and-Reed-1992', 'Boelaars-1950'), almost certainly the source reference for each row in what appears to be a linguistic dataset. About 45% of values are a single token and the median length is 25 chars, consistent with compact Author-Year keys, but 30% of rows are null and 2,373 of 3,573 values are unique, with only 126 duplicates. Top citations like 'nichols-1992' and 'malherbe-and-rosenberg-1996' (113 occurrences each) dominate, suggesting a few reference works supply many entries.

Treatment: Normalize casing and keep as a categorical provenance tag; impute or flag the 30% nulls rather than modelling the text.

anthropic:claude-opus-4-7 · confidence high

Out[58]:

saturn.columns["Source"].stats

stat	value
n	3,573
nulls	1,074 (30.1%)
unique	2,373
len_min	7
len_max	452
len_mean	42.07
len_median	25
len_p95	135
word_mean	2.854
word_median	2
n_empty	0
n_duplicates	126
duplicate_rate	0.05042
vocab_size	5,899
readability_flesch_mean	21.33
emoji_rate	0
url_rate	0
one_word_rate	0.4546
allcaps_rate	0
boilerplate_rate	0
alert: one_word	45.5% rows are a single word
alert: null_rate	30.1% null

Fig 23.

Character-length distribution for Source.

Show data table

Character-length distribution for Source (mean: 42.07122849139656).
chars	count
7 – 18	937
18 – 29	518
29 – 40	280
40 – 52	203
52 – 63	136
63 – 74	90
74 – 85	61
85 – 96	42
96 – 107	35
107 – 118	36
118 – 129	27
129 – 140	21
140 – 152	12
152 – 163	13
163 – 174	6
174 – 185	14
185 – 196	6
196 – 207	5
207 – 218	3
218 – 230	4
230 – 241	4
241 – 252	5
252 – 263	3
263 – 274	6
274 – 285	7
285 – 296	3
296 – 307	4
307 – 318	1
318 – 330	1
330 – 341	3
341 – 352	1
352 – 363	1
363 – 374	0
374 – 385	2
385 – 396	1
396 – 408	1
408 – 419	4
419 – 430	1
430 – 441	0
441 – 452	2

Parent_ID categorical foreign_key

Parent_ID looks like a foreign key pointing to a linguistic genus (e.g. 'genus-oceanic', 'genus-bantu'), grouping the 3573 rows into 911 parent categories. The distribution is long-tailed but not dominated: the top value covers only 4.5% of rows and entropy is 8.55 (ratio 0.87), so most genera carry few members. About 7.1% of values are null, which will need a decision before any join or grouping.

Treatment: Left-join on this id to a genus lookup; impute or flag the 7.1% nulls before grouping.

anthropic:claude-opus-4-7 · confidence high

Out[61]:

saturn.columns["Parent_ID"].stats

stat	value
n	3,573
nulls	254 (7.1%)
unique	911
top_value	genus-oceanic
top_rate	0.04489
cardinality	911
entropy	8.554
entropy_ratio	0.87
alert: long_tail	501 singleton categories

Fig 24.

Top values for Parent_ID.

Show data table

Top values for Parent_ID (20 unique shown, of 911 total).
value	count	share
genus-oceanic	149	4.2%
genus-bantu	141	3.9%
genus-indic	50	1.4%
genus-westernpamanyungan	49	1.4%
genus-semitic	43	1.2%
genus-turkic	41	1.1%
genus-signlanguages	40	1.1%
genus-bodic	40	1.1%
genus-germanic	39	1.1%
genus-northernpamanyungan	33	0.9%
genus-creolesandpidgins	32	0.9%
genus-mayan	30	0.8%
family-austronesian	30	0.8%
genus-algonquian	29	0.8%
genus-centralmalayopolynesian	29	0.8%
genus-iranian	26	0.7%
family-transnewguinea	25	0.7%
genus-romance	24	0.7%
genus-biumandara	24	0.7%
genus-southeasternpamanyungan	23	0.6%

language data wals languages

Overview

Summary confidence: high

ID text identifier

Name text label

Macroarea categorical feature

Latitude numeric feature

Longitude numeric feature

Glottocode text foreign_key

ISO639P3code text foreign_key

Family categorical feature

Subfamily categorical feature

Genus categorical feature

GenusIcon categorical metadata

ISO_codes text feature

Samples_100 categorical feature

Samples_200 categorical metadata

Country_ID categorical foreign_key

Source text metadata

Parent_ID categorical foreign_key

How to cite