language-data-phoible · saturn notebook

Overview

Source: /home/coolhand/datasets/language-data/phoible.csv

Saturn profiled 105,484 rows across 49 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/language-data/phoible.csv",
    "--findings", "language-data-phoible.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is the PHOIBLE phonological inventory dataset: 105,484 rows describing phoneme segments across roughly 2,716 language names and 2,177 Glottocodes, with each row carrying a Phoneme/GlyphID plus 40+ binary phonological features (e.g. consonantal, nasal, sonorant, dorsal). The dataset is dominated by consonants — SegmentClass shows 72,282 consonants vs 31,052 vowels and 2,150 tones — and pulls from 8 sources, with 'ph' (36,274) and 'ea' (16,883) accounting for over half the rows. Most feature columns are heavily imbalanced toward '-' or '0', but a handful (consonantal, sonorant, continuant, dorsal, high, front, back) are fairly balanced and carry the real phonological signal worth exploring. Top language names like Iron Ossetic (444), Dutch (395), and Chechen (309) point to the densest inventories in the corpus.

citing: SegmentClass · Source · LanguageName · Phoneme · Glottocode · consonantal · sonorant · continuant · dorsal · high · Marginal

Out[4]:

saturn.schema() · 49 columns

column	kind	n	null%	unique	alerts
InventoryID	numeric	105,484	0.0%	3,020
Glottocode	text	105,484	0.0%	2,177	one_word short_text duplicates
ISO6393	text	105,484	0.0%	2,095	one_word short_text duplicates
LanguageName	text	105,484	0.0%	2,716	one_word allcaps short_text duplicates
SpecificDialect	categorical	105,484	0.0%	546
GlyphID	text	105,484	0.0%	3,142	one_word allcaps short_text duplicates
Phoneme	text	105,484	0.0%	3,142	one_word short_text duplicates
Allophones	text	105,484	0.0%	6,892	one_word short_text duplicates
Marginal	categorical	105,484	0.0%	3
SegmentClass	categorical	105,484	0.0%	3
Source	categorical	105,484	0.0%	8
tone	categorical	105,484	0.0%	2	imbalance
stress	categorical	105,484	0.0%	2	imbalance
syllabic	categorical	105,484	0.0%	8
short	categorical	105,484	0.0%	4	imbalance
long	categorical	105,484	0.0%	6
consonantal	categorical	105,484	0.0%	5
sonorant	categorical	105,484	0.0%	8
continuant	categorical	105,484	0.0%	9
delayedRelease	categorical	105,484	0.0%	7
approximant	categorical	105,484	0.0%	6
tap	categorical	105,484	0.0%	5	imbalance
trill	categorical	105,484	0.0%	6	imbalance
nasal	categorical	105,484	0.0%	8
lateral	categorical	105,484	0.0%	8
labial	categorical	105,484	0.0%	15
round	categorical	105,484	0.0%	8
labiodental	categorical	105,484	0.0%	6
coronal	categorical	105,484	0.0%	7
anterior	categorical	105,484	0.0%	6
distributed	categorical	105,484	0.0%	11
strident	categorical	105,484	0.0%	9
dorsal	categorical	105,484	0.0%	13
high	categorical	105,484	0.0%	11
low	categorical	105,484	0.0%	8
front	categorical	105,484	0.0%	13
back	categorical	105,484	0.0%	12
tense	categorical	105,484	0.0%	8
retractedTongueRoot	categorical	105,484	0.0%	7	imbalance
advancedTongueRoot	categorical	105,484	0.0%	3	imbalance
periodicGlottalSource	categorical	105,484	0.0%	7
epilaryngealSource	categorical	105,484	0.0%	3	imbalance
spreadGlottis	categorical	105,484	0.0%	10
constrictedGlottis	categorical	105,484	0.0%	7
fortis	categorical	105,484	0.0%	3
lenis	categorical	105,484	0.0%	3
raisedLarynxEjective	categorical	105,484	0.0%	6	imbalance
loweredLarynxImplosive	categorical	105,484	0.0%	5	imbalance
click	categorical	105,484	0.0%	5

Fig 1.

SegmentClass · Consonants outnumber vowels roughly 2-to-1, with a small tone slice — sets the scope for any phoneme analysis.

Show data table

Top values for SegmentClass (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

Fig 2.

Source · Shows how unevenly the eight source corpora contribute, with 'ph' alone supplying about a third of all rows.

Show data table

Top values for Source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

Fig 3.

LanguageName · Top languages by row count (Iron Ossetic, Dutch, Chechen…) reveal which inventories dominate the dataset.

Show data table

Character-length distribution for LanguageName (mean: 7.822219483523567).
chars	count
2 – 4	3915
4 – 6	29541
6 – 8	34409
8 – 10	14985
10 – 12	7358
12 – 14	4849
14 – 15	3897
15 – 17	2893
17 – 19	1292
19 – 21	993
21 – 23	198
23 – 25	224
25 – 27	218
27 – 29	39
29 – 31	93
31 – 33	130
33 – 35	64
35 – 37	23
37 – 39	37
39 – 40	57
40 – 42	0
42 – 44	52
44 – 46	0
46 – 48	23
48 – 50	0
50 – 52	0
52 – 54	20
54 – 56	0
56 – 58	40
58 – 60	0
60 – 62	87
62 – 64	0
64 – 66	0
66 – 67	0
67 – 69	0
69 – 71	0
71 – 73	0
73 – 75	0
75 – 77	0
77 – 79	47

Fig 4.

consonantal · One of the most balanced phonological features (+ vs −) — a good starting point for feature-based comparisons.

Show data table

Top values for consonantal (5 unique shown, of 5 total).
value	count	share
+	64257	60.9%
-	39041	37.0%
0	2151	2.0%
+,-	34	0.0%
-,+	1	0.0%

Fig 5.

Marginal · Indicates how many phonemes are flagged as marginal vs core vs NA, useful before filtering for typological work.

Show data table

Top values for Marginal (3 unique shown, of 3 total).
value	count	share
FALSE	83263	78.9%
NA	20874	19.8%
TRUE	1347	1.3%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
InventoryID	numeric	0.0%
Glottocode	text	0.0%
ISO6393	text	0.0%
LanguageName	text	0.0%
SpecificDialect	categorical	0.0%
GlyphID	text	0.0%
Phoneme	text	0.0%
Allophones	text	0.0%
Marginal	categorical	0.0%
SegmentClass	categorical	0.0%
Source	categorical	0.0%
tone	categorical	0.0%
stress	categorical	0.0%
syllabic	categorical	0.0%
short	categorical	0.0%
long	categorical	0.0%
consonantal	categorical	0.0%
sonorant	categorical	0.0%
continuant	categorical	0.0%
delayedRelease	categorical	0.0%
approximant	categorical	0.0%
tap	categorical	0.0%
trill	categorical	0.0%
nasal	categorical	0.0%
lateral	categorical	0.0%
labial	categorical	0.0%
round	categorical	0.0%
labiodental	categorical	0.0%
coronal	categorical	0.0%
anterior	categorical	0.0%
distributed	categorical	0.0%
strident	categorical	0.0%
dorsal	categorical	0.0%
high	categorical	0.0%
low	categorical	0.0%
front	categorical	0.0%
back	categorical	0.0%
tense	categorical	0.0%
retractedTongueRoot	categorical	0.0%
advancedTongueRoot	categorical	0.0%
periodicGlottalSource	categorical	0.0%
epilaryngealSource	categorical	0.0%
spreadGlottis	categorical	0.0%
constrictedGlottis	categorical	0.0%
fortis	categorical	0.0%
lenis	categorical	0.0%
raisedLarynxEjective	categorical	0.0%
loweredLarynxImplosive	categorical	0.0%
click	categorical	0.0%

InventoryID numeric foreign_key

InventoryID looks like a categorical inventory key stored as an integer, with 3,020 distinct values spread across 105,484 rows and no nulls. The distribution is essentially uniform from 1 to 3,020 (mean 1479, median 1464, skew ≈0, kurtosis ≈-1.15), confirming it's an enumerated identifier rather than a measurement. Each ID recurs roughly 35 times on average, so this is a foreign key linking transactions to an inventory dimension.

Treatment: Treat as a categorical key; left-join to the inventory table rather than using as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["InventoryID"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3,020
min	1
max	3,020
mean	1479
median	1,464
std	843.1
q1	769
q3	2,237
iqr	1,468
skew	-0.002397
kurtosis	-1.146
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 7.

Distribution of InventoryID. Vertical dash marks the median.

Show data table

Histogram bins for InventoryID (median: 1464.0).
bin	count
1 – 76.47	2729
76.47 – 151.9	2947
151.9 – 227.4	2742
227.4 – 302.9	2385
302.9 – 378.4	2399
378.4 – 453.8	2530
453.8 – 529.3	2270
529.3 – 604.8	2323
604.8 – 680.3	2533
680.3 – 755.8	3039
755.8 – 831.2	2896
831.2 – 906.7	2609
906.7 – 982.2	2693
982.2 – 1058	2513
1058 – 1133	2637
1133 – 1209	2514
1209 – 1284	2898
1284 – 1360	3314
1360 – 1435	3634
1435 – 1510	2993
1510 – 1586	3172
1586 – 1661	3290
1661 – 1737	2686
1737 – 1812	3001
1812 – 1888	1779
1888 – 1963	2046
1963 – 2039	1878
2039 – 2114	1864
2114 – 2190	2685
2190 – 2265	3387
2265 – 2341	3096
2341 – 2416	3142
2416 – 2492	3264
2492 – 2567	3557
2567 – 2643	2966
2643 – 2718	1975
2718 – 2794	1761
2794 – 2869	1659
2869 – 2945	1855
2945 – 3020	1823

Glottocode text foreign_key

This column holds Glottocodes — standardized 8-character language identifiers from Glottolog (e.g., 'kham1282', 'dutc1256'). Values are uniformly one word with length tightly clustered at 8 (mean 7.999, min 2, max 8), and there are 2,177 unique codes across 105,484 rows with a 97.9% duplicate rate, indicating each language appears many times. The top code 'kham1282' alone accounts for 622 rows.

Treatment: Treat as a categorical foreign key; left-join to a Glottolog reference table for language metadata.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns["Glottocode"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2,177
len_min	2
len_max	8
len_mean	7.999
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	103,307
duplicate_rate	0.9794
vocab_size	2,168
readability_flesch_mean	94.15
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.9% duplicate strings

Fig 8.

Character-length distribution for Glottocode.

Show data table

Character-length distribution for Glottocode (mean: 7.998919267377043).
chars	count
2 – 2	19
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	105465

ISO6393 text foreign_key

This column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,095 distinct codes across 105,484 rows. The distribution is heavy-tailed and highly repetitive (98.0% duplicate rate), led by 'mis' (828), 'khg' (622), and 'oss' (525), with familiar codes like 'eng' and 'hin' also prominent. No nulls or empties, and the vocabulary (2,086) closely matches n_unique (2,095), consistent with a clean controlled vocabulary.

Treatment: Treat as a categorical key; left-join to an ISO 639-3 reference table or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["ISO6393"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2,095
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	103,389
duplicate_rate	0.9801
vocab_size	2,086
readability_flesch_mean	119.5
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	98.0% duplicate strings

Fig 9.

Character-length distribution for ISO6393.

Show data table

Character-length distribution for ISO6393 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	105484
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

LanguageName text label

This column holds language names, mostly single-word labels (one_word_rate 0.84, word_mean 1.20) with a vocab of 2670 across 2716 unique values over 105484 rows. It is highly repetitive (duplicate_rate 0.974) with top entries like 'Iron Ossetic' (444), 'Dutch' (395), and 'Chechen' (309), and roughly 13% of values are uppercase, suggesting inconsistent casing worth normalising. Compound names use directional modifiers ('northern', 'southern', 'eastern', 'western') indicating dialect-level granularity.

Treatment: Normalise case and treat as a categorical label; consider grouping directional dialect variants before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["LanguageName"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2,716
len_min	2
len_max	79
len_mean	7.822
len_median	7
len_p95	16
word_mean	1.201
word_median	1
n_empty	0
n_duplicates	102,768
duplicate_rate	0.9743
vocab_size	2,670
readability_flesch_mean	53.18
emoji_rate	0
url_rate	0
one_word_rate	0.8433
allcaps_rate	0.1314
boilerplate_rate	0
alert: one_word	84.3% rows are a single word
alert: allcaps	13.1% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.4% duplicate strings

Fig 10.

Character-length distribution for LanguageName.

Show data table

Character-length distribution for LanguageName (mean: 7.822219483523567).
chars	count
2 – 4	3915
4 – 6	29541
6 – 8	34409
8 – 10	14985
10 – 12	7358
12 – 14	4849
14 – 15	3897
15 – 17	2893
17 – 19	1292
19 – 21	993
21 – 23	198
23 – 25	224
25 – 27	218
27 – 29	39
29 – 31	93
31 – 33	130
33 – 35	64
35 – 37	23
37 – 39	37
39 – 40	57
40 – 42	0
42 – 44	52
44 – 46	0
46 – 48	23
48 – 50	0
50 – 52	0
52 – 54	20
54 – 56	0
56 – 58	40
58 – 60	0
60 – 62	87
62 – 64	0
64 – 66	0
66 – 67	0
67 – 69	0
69 – 71	0
71 – 73	0
73 – 75	0
75 – 77	0
77 – 79	47

SpecificDialect categorical metadata

Categorical field naming a specific dialect/sub-variety of a language, with 546 distinct values across 105,484 rows. The distribution is extremely concentrated: 71.9% are the literal string "NA" and another 7,692 rows are empty strings, leaving the remaining 544 dialect labels (e.g., "W2", "Lezgian (Güne)", "Scottish Gaelic (Lewis)") in a long tail topping out at 120 occurrences. Entropy ratio of 0.33 confirms most signal lives in that small tail.

Treatment: Normalize "NA" and "" to a single missing token, then treat as high-cardinality categorical (hash or group rare levels) for any downstream use.

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["SpecificDialect"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	546
top_value	NA
top_rate	0.7187
cardinality	546
entropy	2.969
entropy_ratio	0.3265

Fig 11.

Top values for SpecificDialect.

Show data table

Top values for SpecificDialect (20 unique shown, of 546 total).
value	count	share
NA	75807	71.9%
	7692	7.3%
W2	120	0.1%
Lezgian (Güne)	96	0.1%
Santa	92	0.1%
Central Pakistan	83	0.1%
Babungo (Grassfields Bantu, Ring)	82	0.1%
Scottish Gaelic (Lewis)	82	0.1%
Tangari	81	0.1%
Kanga	76	0.1%
Kufa	75	0.1%
Skolt Saami (Suõʹnnʼjel)	75	0.1%
Standard Hindi (as spoken in Varanasi, Lucknow, Delhi etc.)	74	0.1%
Standard (eastern)	74	0.1%
Guovdageaidnu	74	0.1%
Nuosu (Black Yi)	74	0.1%
Northern Qiang (Yadu)	73	0.1%
Bangladeshi Standard (spoken in Dhaka and other urban aread of Bangladesh)	72	0.1%
Standard Italian	70	0.1%
Chechen (Ploskost)	70	0.1%

GlyphID text foreign_key

GlyphID holds single-token uppercase hexadecimal codes (allcaps_rate 1.0, one_word_rate 1.0, len_min 4, len_median 4) that look like Unicode codepoints — the top values 006D, 0069, 006B map to lowercase Latin letters m, i, k. Despite the ID-sounding name it is highly non-unique: only 3,142 distinct values across 105,484 rows with a 0.97 duplicate rate, so it behaves as a categorical glyph reference rather than a row key. Lengths stretch up to 54 characters (p95 14), hinting that some entries concatenate multiple codepoints.

Treatment: Treat as a categorical codepoint reference and left-join to a Unicode/glyph lookup table; do not use as a primary key.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["GlyphID"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3,142
len_min	4
len_max	54
len_mean	6.503
len_median	4
len_p95	14
word_mean	1
word_median	1
n_empty	0
n_duplicates	102,342
duplicate_rate	0.9702
vocab_size	1,343
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.0% duplicate strings

Fig 12.

Character-length distribution for GlyphID.

Show data table

Character-length distribution for GlyphID (mean: 6.503033635432862).
chars	count
4 – 5	67114
5 – 6	0
6 – 8	0
8 – 9	0
9 – 10	28726
10 – 12	0
12 – 13	0
13 – 14	0
14 – 15	6559
15 – 16	0
16 – 18	0
18 – 19	0
19 – 20	2225
20 – 22	0
22 – 23	0
23 – 24	0
24 – 25	401
25 – 26	0
26 – 28	0
28 – 29	0
29 – 30	267
30 – 32	0
32 – 33	0
33 – 34	0
34 – 35	104
35 – 36	0
36 – 38	0
38 – 39	0
39 – 40	5
40 – 42	0
42 – 43	0
43 – 44	0
44 – 45	70
45 – 46	0
46 – 48	0
48 – 49	0
49 – 50	1
50 – 52	0
52 – 53	0
53 – 54	12

Phoneme text feature

This column holds individual phoneme tokens, almost always a single character (len_mean 1.50, len_median 1, max 11) and always one word (one_word_rate 1.0). With 105,484 rows but only 3,142 unique values and a 97.0% duplicate rate, the same small phoneme inventory repeats heavily; top symbols like 'm' (2,915), 'i' (2,779), and 'k' (2,729) dominate. Vocab_size of 1,339 is larger than n_unique would suggest for single-character entries, hinting that the longer (up to 11-char) values contribute multi-token strings.

Treatment: Treat as a categorical token and encode (label or embedding) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[30]:

saturn.columns["Phoneme"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3,142
len_min	1
len_max	11
len_mean	1.501
len_median	1
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	102,342
duplicate_rate	0.9702
vocab_size	1,339
readability_flesch_mean	114.4
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.001754
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.0% duplicate strings

Fig 13.

Character-length distribution for Phoneme.

Show data table

Character-length distribution for Phoneme (mean: 1.5006067270865724).
chars	count
1 – 1	67114
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	28726
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	6559
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	2225
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	401
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	267
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	104
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	70
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	1
10 – 10	0
10 – 11	0
11 – 11	12

Allophones text feature

Column holds short phonetic tokens (allophones), with mean length 2.08 characters and 91.3% being a single word. The distribution is dominated by the literal string 'NA' (53,580 of 105,484 rows, ~50.8%), which likely encodes missing rather than true null since null_rate is 0.0; duplicate_rate is 0.935 across only 6,892 uniques. Top non-NA values are individual phoneme letters like 'm', 'j', 'w', 's', consistent with IPA-style symbols.

Treatment: Recode the literal 'NA' to missing, then treat as a categorical/phoneme token (one-hot or embed).

anthropic:claude-opus-4-7 · confidence high

Out[33]:

saturn.columns["Allophones"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6,892
len_min	1
len_max	37
len_mean	2.083
len_median	2
len_p95	4
word_mean	1.129
word_median	1
n_empty	0
n_duplicates	98,592
duplicate_rate	0.9347
vocab_size	1,263
readability_flesch_mean	116.2
emoji_rate	0
url_rate	0
one_word_rate	0.9131
allcaps_rate	0.00291
boilerplate_rate	0
alert: one_word	91.3% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	93.5% duplicate strings

Fig 14.

Character-length distribution for Allophones.

Show data table

Character-length distribution for Allophones (mean: 2.0834154942929732).
chars	count
1 – 2	26762
2 – 3	66019
3 – 4	4934
4 – 5	3125
5 – 6	1246
6 – 6	1181
6 – 7	789
7 – 8	409
8 – 9	334
9 – 10	0
10 – 11	226
11 – 12	160
12 – 13	69
13 – 14	65
14 – 14	41
14 – 15	36
15 – 16	12
16 – 17	16
17 – 18	12
18 – 19	0
19 – 20	14
20 – 21	9
21 – 22	8
22 – 23	0
23 – 24	3
24 – 24	3
24 – 25	3
25 – 26	3
26 – 27	2
27 – 28	0
28 – 29	1
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	1
34 – 35	0
35 – 36	0
36 – 37	1

Marginal categorical feature

A ternary flag with values FALSE, NA, and TRUE across 105,484 rows. FALSE dominates at 78.9% while TRUE appears only 1,347 times; the 20,874 NA entries are encoded as a literal string rather than null, so null_rate is 0.0 despite roughly a fifth of rows being missing in practice.

Treatment: Recode the literal 'NA' string to a true missing value, then treat as a binary indicator.

anthropic:claude-opus-4-7 · confidence high

Out[36]:

saturn.columns["Marginal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	FALSE
top_rate	0.7893
cardinality	3
entropy	0.8122
entropy_ratio	0.5125

Fig 15.

Top values for Marginal.

Show data table

Top values for Marginal (3 unique shown, of 3 total).
value	count	share
FALSE	83263	78.9%
NA	20874	19.8%
TRUE	1347	1.3%

SegmentClass categorical label

SegmentClass is a categorical phonological label with only 3 distinct values: consonant, vowel, and tone. The distribution is heavily skewed — consonant accounts for 68.5% of 105,484 rows, vowel for 31,052, and tone for just 2,150, giving an entropy ratio of 0.64. The presence of 'tone' as a rare third class suggests the dataset spans tonal languages but those segments are sparsely represented.

Treatment: One-hot encode; consider class imbalance (especially rare 'tone' class) before stratified modelling.

anthropic:claude-opus-4-7 · confidence high

Out[39]:

saturn.columns["SegmentClass"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	consonant
top_rate	0.6852
cardinality	3
entropy	1.008
entropy_ratio	0.6357

Fig 16.

Top values for SegmentClass.

Show data table

Top values for SegmentClass (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

Source categorical metadata

Categorical provenance tag with 8 distinct sources across 105,484 rows and no nulls. Distribution is fairly balanced (entropy ratio 0.90), though 'ph' leads at 34.4% followed by 'ea' (16,883) and 'upsid' (13,966). Looks like a dataset-origin code identifying which linguistic database each record came from.

Treatment: one-hot encode or keep as categorical for stratification and source-bias checks

anthropic:claude-opus-4-7 · confidence high

Out[42]:

saturn.columns["Source"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	ph
top_rate	0.3439
cardinality	8
entropy	2.697
entropy_ratio	0.8991

Fig 17.

Top values for Source.

Show data table

Top values for Source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

tone categorical label

Binary categorical flag with values "0" and "+", almost certainly a tone/sentiment indicator. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows while "+" appears only 2,150 times, yielding entropy of just 0.144. No nulls are present.

Treatment: Use as a binary label but apply class-imbalance handling (resampling or class weights) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[45]:

saturn.columns["tone"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2
top_value	0
top_rate	0.9796
cardinality	2
entropy	0.1436
entropy_ratio	0.1436
alert: imbalance	top value is 98.0% of rows

Fig 18.

Top values for tone.

Show data table

Top values for tone (2 unique shown, of 2 total).
value	count	share
0	103334	98.0%
+	2150	2.0%

stress categorical feature

Binary categorical flag with only two observed values, '-' and '0', across 105484 rows and no nulls. The column is severely imbalanced: '-' covers 97.96% of records (103334) while '0' accounts for the remaining 2150, yielding an entropy ratio of just 0.144. The '-' likely encodes a missing or default state rather than a true category, making this near-constant.

Treatment: Treat '-' as missing and consider dropping; near-constant with little signal.

anthropic:claude-opus-4-7 · confidence high

Out[48]:

saturn.columns["stress"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2
top_value	-
top_rate	0.9796
cardinality	2
entropy	0.1436
entropy_ratio	0.1436
alert: imbalance	top value is 98.0% of rows

Fig 19.

Top values for stress.

Show data table

Top values for stress (2 unique shown, of 2 total).
value	count	share
-	103334	98.0%
0	2150	2.0%

syllabic categorical feature

This is a phonological feature column encoding the [syllabic] distinctive feature, with 8 distinct values across 105,484 rows and no nulls. Most entries are simple '-' (68.5%) or '+', but 2,532 rows carry composite codes like '0', '+,-', or '-,+,-' that suggest contour/multi-segment annotations. Entropy ratio of 0.35 confirms the distribution is heavily concentrated on the negative value.

Treatment: One-hot encode, optionally collapsing rare composite codes into an 'other' bucket.

anthropic:claude-opus-4-7 · confidence high

Out[51]:

saturn.columns["syllabic"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	-
top_rate	0.6849
cardinality	8
entropy	1.042
entropy_ratio	0.3472

Fig 20.

Top values for syllabic.

Show data table

Top values for syllabic (8 unique shown, of 8 total).
value	count	share
-	72248	68.5%
+	30692	29.1%
0	2150	2.0%
+,-	244	0.2%
-,+	124	0.1%
-,+,-	12	0.0%
-,+,+	12	0.0%
+,+,-	2	0.0%

short categorical feature

A 4-level categorical with values '-', '0', '+', and '-,+' — likely a strand or sign indicator. It is severely imbalanced: '-' covers 97.76% of 105,484 rows, leaving only 2,150 zeros, 204 plus signs, and 5 mixed '-,+' entries. Entropy ratio is just 0.082, so the column carries almost no information as-is.

Treatment: Consider dropping or collapsing into a binary '-' vs other indicator given the extreme imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[54]:

saturn.columns["short"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	4
top_value	-
top_rate	0.9776
cardinality	4
entropy	0.1645
entropy_ratio	0.08225
alert: imbalance	top value is 97.8% of rows

Fig 21.

Top values for short.

Show data table

Top values for short (4 unique shown, of 4 total).
value	count	share
-	103125	97.8%
0	2150	2.0%
+	204	0.2%
-,+	5	0.0%

long categorical feature

Categorical flag with only 6 distinct values dominated by '-' at 89.9% (94844/105484), followed by '+' (8386) and '0' (2150). The remaining three categories are comma-joined combinations like '-,+' and '+,-' with tiny counts (63, 40, 1), suggesting concatenated multi-record values rather than clean single labels. Low entropy ratio (0.214) confirms the column is highly imbalanced toward '-'.

Treatment: Split the comma-joined compound values, then one-hot encode; expect '-' to dominate.

anthropic:claude-opus-4-7 · confidence high

Out[57]:

saturn.columns["long"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	-
top_rate	0.8991
cardinality	6
entropy	0.5537
entropy_ratio	0.2142

Fig 22.

Top values for long.

Show data table

Top values for long (6 unique shown, of 6 total).
value	count	share
-	94844	89.9%
+	8386	8.0%
0	2150	2.0%
-,+	63	0.1%
+,-	40	0.0%
-,-,+	1	0.0%

consonantal categorical feature

This is a phonological feature column flagging segments as consonantal, with five distinct values across 105,484 rows and no nulls. The vast majority are binary +/- (64,257 and 39,041 respectively, with + dominating at 60.9%), plus 2,151 zero/underspecified entries and just 35 rows with mixed values like '+,-' or '-,+'. Entropy ratio of 0.47 reflects the heavy + skew, and the lone '-,+' singleton is worth noting as a likely encoding artifact.

Treatment: Map to a small categorical encoding; consider collapsing the rare mixed/zero values or treating them as a separate 'underspecified' level.

anthropic:claude-opus-4-7 · confidence high

Out[60]:

saturn.columns["consonantal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	5
top_value	+
top_rate	0.6092
cardinality	5
entropy	1.085
entropy_ratio	0.4672

Fig 23.

Top values for consonantal.

Show data table

Top values for consonantal (5 unique shown, of 5 total).
value	count	share
+	64257	60.9%
-	39041	37.0%
0	2151	2.0%
+,-	34	0.0%
-,+	1	0.0%

sonorant categorical feature

This is a phonological feature column encoding the [sonorant] value of a segment, dominated by binary '+'/'-' marks (53.0% '+', plus 45322 '-'). The presence of '0' (2150) and comma-joined sequences like '+,-' (1948) and rarer '+,-,-', '+,-,+', '+,-,+,-' suggests contour/complex segments where multiple values are concatenated. Cardinality is 8 with entropy ratio 0.41, so the long tail is sparse but meaningful.

Treatment: Split comma-separated contour values into ordered sub-features or treat as categorical with a rare-bucket for the contour cases.

anthropic:claude-opus-4-7 · confidence high

Out[63]:

saturn.columns["sonorant"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	+
top_rate	0.5301
cardinality	8
entropy	1.245
entropy_ratio	0.4149

Fig 24.

Top values for sonorant.

Show data table

Top values for sonorant (8 unique shown, of 8 total).
value	count	share
+	55920	53.0%
-	45322	43.0%
0	2150	2.0%
+,-	1948	1.8%
-,+	89	0.1%
+,-,-	29	0.0%
+,-,+	25	0.0%
+,-,+,-	1	0.0%

continuant categorical feature

Categorical flag with 9 distinct values across 105,484 rows and no nulls, dominated by '+' (54.9%) and '-' (~42%), with '0' a distant third at 2,151 occurrences. The remaining six categories are comma-joined sequences like '-,+' or '0,0,-,+' that look like concatenated multi-step states rather than atomic labels — only 796 rows total fall into these compound buckets. Entropy ratio of 0.37 confirms heavy concentration in the two primary signs.

Treatment: Collapse rare comma-joined sequences into an 'other' bucket (or split on comma) before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[66]:

saturn.columns["continuant"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	9
top_value	+
top_rate	0.5494
cardinality	9
entropy	1.172
entropy_ratio	0.3696

Fig 25.

Top values for continuant.

Show data table

Top values for continuant (9 unique shown, of 9 total).
value	count	share
+	57952	54.9%
-	44585	42.3%
0	2151	2.0%
-,+	728	0.7%
-,-,+	50	0.0%
+,-	9	0.0%
0,-,+	4	0.0%
-,+,+	4	0.0%
0,0,-,+	1	0.0%

delayedRelease categorical feature

A categorical flag named delayedRelease with 7 distinct values across 105484 rows and no nulls. Values are dominated by '0' (55.0%), followed by '-' (27384) and '+' (19533), but a long tail of comma-joined combinations like '-,+', '0,-,+', '+,-', and '0,0,-,+' suggests rows where multiple labels were concatenated rather than normalized. Entropy ratio of 0.52 confirms a skewed but not single-valued distribution.

Treatment: Split the comma-joined values and one-hot encode the three base levels (0, -, +) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[69]:

saturn.columns["delayedRelease"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	0
top_rate	0.5502
cardinality	7
entropy	1.471
entropy_ratio	0.5238

Fig 26.

Top values for delayedRelease.

Show data table

Top values for delayedRelease (7 unique shown, of 7 total).
value	count	share
0	58035	55.0%
-	27384	26.0%
+	19533	18.5%
-,+	492	0.5%
0,-,+	33	0.0%
+,-	6	0.0%
0,0,-,+	1	0.0%

approximant categorical feature

A low-cardinality categorical with only 6 distinct values dominated by sign tokens '-' (55.9%) and '+' (≈42%), plus a small '0' bucket (2,150 rows) and rare comma-joined combinations like '-,+', '-,-,+', '+,-'. The compound values suggest this field occasionally stores multiple approximant signs concatenated rather than a single label. No nulls across 105,484 rows, and entropy ratio is 0.43, reflecting the heavy '-'/'+' imbalance.

Treatment: Split the comma-joined values or collapse rare combos into an 'other' bucket, then one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[72]:

saturn.columns["approximant"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	-
top_rate	0.559
cardinality	6
entropy	1.12
entropy_ratio	0.4333

Fig 27.

Top values for approximant.

Show data table

Top values for approximant (6 unique shown, of 6 total).
value	count	share
-	58966	55.9%
+	44266	42.0%
0	2150	2.0%
-,+	71	0.1%
-,-,+	25	0.0%
+,-	6	0.0%

tap categorical feature

A categorical flag with only 5 distinct values dominated by '-' at 96.7% of 105484 rows, with '0' and '+' as minor categories and two rare composite codes ('-,+', '-,-,+') appearing 25 and 15 times. Entropy ratio of 0.104 confirms extreme imbalance, so this column carries very little discriminative signal on its own. The composite values suggest the field occasionally concatenates multiple states, which is worth verifying against the source schema.

Treatment: Collapse rare composites and treat as a low-signal binary indicator, or drop if downstream models penalise near-constant features.

anthropic:claude-opus-4-7 · confidence high

Out[75]:

saturn.columns["tap"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	5
top_value	-
top_rate	0.9672
cardinality	5
entropy	0.2421
entropy_ratio	0.1043
alert: imbalance	top value is 96.7% of rows

Fig 28.

Top values for tap.

Show data table

Top values for tap (5 unique shown, of 5 total).
value	count	share
-	102023	96.7%
0	2203	2.1%
+	1218	1.2%
-,+	25	0.0%
-,-,+	15	0.0%

trill categorical feature

A categorical flag with only 6 distinct values across 105,484 rows, almost certainly encoding a trill or sign indicator ("-", "0", "+", and a few comma-joined sequences). The distribution is severely imbalanced: "-" alone covers 96.15% of rows, leaving entropy at just 0.276 (entropy ratio 0.107). The compound values like "-,+", "-,-,+" and "+,-" appear fewer than 30 times each, hinting at concatenated multi-event records that may need parsing.

Treatment: Collapse rare compound codes and consider dropping or one-hot encoding given the 96% dominance of a single value.

anthropic:claude-opus-4-7 · confidence high

Out[78]:

saturn.columns["trill"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	-
top_rate	0.9615
cardinality	6
entropy	0.2762
entropy_ratio	0.1069
alert: imbalance	top value is 96.2% of rows

Fig 29.

Top values for trill.

Show data table

Top values for trill (6 unique shown, of 6 total).
value	count	share
-	101427	96.2%
0	2202	2.1%
+	1819	1.7%
-,+	26	0.0%
-,-,+	8	0.0%
+,-	2	0.0%

nasal categorical feature

A categorical flag for nasal presence/absence, dominated by '-' at 80.8% with '+' a distant second (15,941 of 105,484). The presence of compound values like '+,-', '-,+', and '+,-,-' suggests concatenated multi-observation records rather than a clean binary indicator, and a '0' category (2,150) is a third encoding that doesn't match the +/- scheme.

Treatment: Normalize encodings (collapse '0' and split comma-joined sequences) before using as a categorical feature.

anthropic:claude-opus-4-7 · confidence high

Out[81]:

saturn.columns["nasal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	-
top_rate	0.8084
cardinality	8
entropy	0.897
entropy_ratio	0.299

Fig 30.

Top values for nasal.

Show data table

Top values for nasal (8 unique shown, of 8 total).
value	count	share
-	85269	80.8%
+	15941	15.1%
0	2150	2.0%
+,-	1973	1.9%
-,+	95	0.1%
+,-,-	54	0.1%
+,-,+,-	1	0.0%
-,+,-	1	0.0%

lateral categorical feature

Categorical flag with 8 distinct values dominated by '-' at 93.8% of 105484 rows, followed by '+' (4211) and '0' (2150). The remaining categories are comma-joined composites like '-,+' or '-,-,+' with counts in the single or double digits, suggesting multi-observation concatenations rather than clean labels. Entropy ratio is just 0.134, so the column carries little information as-is.

Treatment: Collapse rare composite values into 'mixed' or split on comma before one-hot encoding; expect minimal predictive signal due to severe imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[84]:

saturn.columns["lateral"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	-
top_rate	0.9382
cardinality	8
entropy	0.4012
entropy_ratio	0.1337

Fig 31.

Top values for lateral.

Show data table

Top values for lateral (8 unique shown, of 8 total).
value	count	share
-	98968	93.8%
+	4211	4.0%
0	2150	2.0%
-,+	135	0.1%
+,-	12	0.0%
-,-,+	4	0.0%
-,+,-	3	0.0%
0,-,+	1	0.0%

labial categorical feature

A categorical flag for labial articulation, dominated by '-' (68.2%) and '+' (27%), with '0' and various comma-joined combinations like '-,+' and '-,-,+' suggesting multi-segment or sequence-level annotations rather than atomic values. Cardinality is 15 across 105,484 rows with no nulls, and entropy ratio is only 0.30, so the signal is highly concentrated in the binary +/- distinction. The presence of compound tokens is the main surprise and indicates inconsistent encoding granularity.

Treatment: Normalise compound values (split or collapse to first segment) before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[87]:

saturn.columns["labial"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	15
top_value	-
top_rate	0.6822
cardinality	15
entropy	1.182
entropy_ratio	0.3025

Fig 32.

Top values for labial.

Show data table

Top values for labial (15 unique shown, of 15 total).
value	count	share
-	71961	68.2%
+	28241	26.8%
-,+	2414	2.3%
0	2160	2.0%
+,-	531	0.5%
-,-,+	121	0.1%
+,-,-	21	0.0%
0,+,-	8	0.0%
-,+,-	6	0.0%
0,-,+	5	0.0%
-,+,+	5	0.0%
+,+,-	5	0.0%
+,-,+	4	0.0%
-,-,+,+	1	0.0%
0,+,-,-	1	0.0%

round categorical feature

Categorical column with 8 distinct values dominated by '0' (70.3% of 105484 rows), followed by '+' (16956) and '-' (14082). The tokens look like rounding/sign indicators, and a long tail of compound values like '-,+', '-,-,+', and '0,-,+' (each with 1-269 occurrences) suggests concatenated multi-step rounding sequences. No nulls, but entropy_ratio of 0.398 confirms heavy concentration on the single dominant category.

Treatment: Collapse rare compound sequences into an 'other' bucket and one-hot encode the resulting categories.

anthropic:claude-opus-4-7 · confidence medium

Out[90]:

saturn.columns["round"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	0
top_rate	0.703
cardinality	8
entropy	1.194
entropy_ratio	0.398

Fig 33.

Top values for round.

Show data table

Top values for round (8 unique shown, of 8 total).
value	count	share
0	74155	70.3%
+	16956	16.1%
-	14082	13.3%
-,+	269	0.3%
-,-,+	17	0.0%
-,0,+	3	0.0%
0,-,+	1	0.0%
+,-	1	0.0%

labiodental categorical feature

This column appears to be a phonological feature flag indicating whether a phoneme is labiodental, encoded with values like "0", "-", "+", and a few comma-separated combinations. The distribution is dominated by "0" at 70.3% of 105,484 rows, with "-" at 28,726 and "+" at only 2,574; mixed values like "+,-", "-,+", and "+,+,-" appear in fewer than 60 rows combined and likely represent multi-segment entries. Entropy of 1.01 (ratio 0.39) confirms heavy concentration on the default "0" code.

Treatment: Treat as categorical; collapse rare comma-separated combinations or one-hot encode the three primary levels.

anthropic:claude-opus-4-7 · confidence high

Out[93]:

saturn.columns["labiodental"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	0
top_rate	0.7027
cardinality	6
entropy	1.006
entropy_ratio	0.3891

Fig 34.

Top values for labiodental.

Show data table

Top values for labiodental (6 unique shown, of 6 total).
value	count	share
0	74124	70.3%
-	28726	27.2%
+	2574	2.4%
+,-	56	0.1%
-,+	3	0.0%
+,+,-	1	0.0%

coronal categorical feature

A low-cardinality categorical with 7 distinct values dominated by sign tokens '-' (62.8%) and '+' (about 36,955 of 105,484), plus a small '0' bucket of 2,160 rows. The presence of comma-joined combinations like '+,-', '-,+', '-,-,+' and '+,-,+' suggests multi-valued entries collapsed into a single string rather than a clean atomic category. Entropy ratio of 0.385 confirms heavy concentration on the two main signs.

Treatment: Split the comma-joined values and encode as a small categorical (or multi-hot) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[96]:

saturn.columns["coronal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	-
top_rate	0.6279
cardinality	7
entropy	1.08
entropy_ratio	0.3848

Fig 35.

Top values for coronal.

Show data table

Top values for coronal (7 unique shown, of 7 total).
value	count	share
-	66234	62.8%
+	36955	35.0%
0	2160	2.0%
+,-	87	0.1%
-,+	41	0.0%
-,-,+	6	0.0%
+,-,+	1	0.0%

anterior categorical feature

A low-cardinality categorical with 6 distinct values dominated by '0' (64.8% of 105484 rows), '+' (25704), and '-' (11391). The remaining three categories are concatenated combinations like '-,+', '+,-', and '-,-,+' totaling only 17 rows, suggesting this field originally allowed multi-valued entries that were collapsed into comma-joined strings. Entropy ratio of 0.48 confirms heavy concentration on the single mode.

Treatment: Split comma-separated values or bucket the rare combinations into 'other', then one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[99]:

saturn.columns["anterior"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	0
top_rate	0.6482
cardinality	6
entropy	1.251
entropy_ratio	0.4839

Fig 36.

Top values for anterior.

Show data table

Top values for anterior (6 unique shown, of 6 total).
value	count	share
0	68372	64.8%
+	25704	24.4%
-	11391	10.8%
-,+	9	0.0%
+,-	5	0.0%
-,-,+	3	0.0%

distributed categorical feature

A low-cardinality categorical with 11 distinct values dominated by '0' (66.0% of 105,484 rows), followed by '-' and '+'. The remaining eight categories are concatenated combinations like '-,+' or '-,-,+' that together account for fewer than 350 rows, suggesting multi-event encoding squeezed into a single field. Entropy ratio of 0.368 confirms the heavy concentration on the top class.

Treatment: Split the comma-delimited combinations into separate flags or collapse rare categories before encoding.

anthropic:claude-opus-4-7 · confidence medium

Out[102]:

saturn.columns["distributed"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	11
top_value	0
top_rate	0.6602
cardinality	11
entropy	1.273
entropy_ratio	0.3681

Fig 37.

Top values for distributed.

Show data table

Top values for distributed (11 unique shown, of 11 total).
value	count	share
0	69639	66.0%
-	22283	21.1%
+	13228	12.5%
-,+	296	0.3%
-,-,+	25	0.0%
+,-	5	0.0%
0,-,+	3	0.0%
+,-,+	2	0.0%
0,+,-	1	0.0%
+,+,-	1	0.0%
0,0,-,+	1	0.0%

strident categorical feature

This appears to be a phonological feature column encoding the [strident] distinctive feature, with values '0' (unspecified), '-' (non-strident), and '+' (strident) covering the vast majority of 105484 rows. About 64.9% are '0' and there are no nulls, but a long tail of 6 composite values like '-,+' and '0,0,-,+' (totaling 625 rows) suggests multi-segment entries where the feature varies across positions. Entropy ratio of 0.41 confirms heavy concentration on the unspecified category.

Treatment: Split composite comma-separated values or one-hot encode the three primary categories before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[105]:

saturn.columns["strident"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	9
top_value	0
top_rate	0.6485
cardinality	9
entropy	1.287
entropy_ratio	0.406

Fig 38.

Top values for strident.

Show data table

Top values for strident (9 unique shown, of 9 total).
value	count	share
0	68410	64.9%
-	25410	24.1%
+	11039	10.5%
-,+	585	0.6%
-,-,+	26	0.0%
+,-	7	0.0%
-,+,-	3	0.0%
0,-,+	3	0.0%
0,0,-,+	1	0.0%

dorsal categorical feature

This column appears to encode a dorsal sign or polarity flag, dominated by two values: '+' (54,535, 51.7%) and '-' (47,052). A third value '0' appears 2,160 times, and the remaining 10 categories are compound comma-separated combinations like '-,+' or '+,-,+', suggesting concatenated multi-observation records collapsed into one cell. Cardinality is 13 with entropy ratio 0.33, so the long tail is negligible but structurally inconsistent with a clean categorical.

Treatment: Split comma-separated compounds into list-valued or first-token features before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[108]:

saturn.columns["dorsal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	13
top_value	+
top_rate	0.517
cardinality	13
entropy	1.235
entropy_ratio	0.3338

Fig 39.

Top values for dorsal.

Show data table

Top values for dorsal (13 unique shown, of 13 total).
value	count	share
+	54535	51.7%
-	47052	44.6%
0	2160	2.0%
-,+	1530	1.5%
+,-	144	0.1%
-,-,+	44	0.0%
0,-,+	6	0.0%
+,-,+	5	0.0%
-,+,+	4	0.0%
+,+,-,-	1	0.0%
-,+,-	1	0.0%
+,+,-	1	0.0%
0,0,-,+	1	0.0%

high categorical feature

This appears to be a categorical 'high' indicator encoding directional movement signs, with '0' (no change) the dominant value at 46.7% of 105,484 rows, followed by '+' and '-'. Compound values like '-,+' and '+,-' suggest concatenated multi-step sign sequences, but they tail off sharply (845, 627, then single digits). Entropy ratio of 0.46 confirms heavy concentration in the top three single-symbol categories; cardinality is 11 with no nulls.

Treatment: Collapse rare compound sequences into an 'other' bucket, then one-hot encode the three primary signs.

anthropic:claude-opus-4-7 · confidence medium

Out[111]:

saturn.columns["high"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	11
top_value	0
top_rate	0.4669
cardinality	11
entropy	1.594
entropy_ratio	0.4609

Fig 40.

Top values for high.

Show data table

Top values for high (11 unique shown, of 11 total).
value	count	share
0	49247	46.7%
+	35559	33.7%
-	19156	18.2%
-,+	845	0.8%
+,-	627	0.6%
+,-,+	38	0.0%
+,+,-	6	0.0%
-,+,+	2	0.0%
-,-,+	2	0.0%
+,-,0	1	0.0%
-,+,-	1	0.0%

low categorical feature

A low-cardinality categorical with 8 distinct tokens dominated by '-' (47.3%) and '0' (~49,244 of 105,484), with '+' a distant third at 5,598. The remaining values are comma-joined sequences like '+,-', '-,+,-', suggesting this column encodes a sign or direction trace, possibly concatenated across events. The long tail (down to a single '+,-,-') indicates rare composite states worth bucketing.

Treatment: Collapse rare composite sequences into an 'other' bucket and one-hot encode the remaining categories.

anthropic:claude-opus-4-7 · confidence medium

Out[114]:

saturn.columns["low"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	-
top_rate	0.4733
cardinality	8
entropy	1.305
entropy_ratio	0.4351

Fig 41.

Top values for low.

Show data table

Top values for low (8 unique shown, of 8 total).
value	count	share
-	49930	47.3%
0	49244	46.7%
+	5598	5.3%
+,-	417	0.4%
-,+	270	0.3%
-,+,-	21	0.0%
-,-,+	3	0.0%
+,-,-	1	0.0%

front categorical feature

Categorical column encoding a front-side signal with 13 distinct values dominated by three primitives: "0" (49,316), "-" (34,225), and "+" (20,683), together covering nearly all 105,484 rows. The remaining categories are comma-joined combinations like "-,+" (838) or "-,-,+" (24), suggesting multi-event concatenations rather than clean atomic labels. Top rate is 0.468 and entropy ratio 0.43, so the distribution is skewed toward "0" but not degenerate. No nulls.

Treatment: Split the comma-separated combinations into multi-hot indicators for -, 0, + before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[117]:

saturn.columns["front"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	13
top_value	0
top_rate	0.4675
cardinality	13
entropy	1.592
entropy_ratio	0.4302

Fig 42.

Top values for front.

Show data table

Top values for front (13 unique shown, of 13 total).
value	count	share
0	49316	46.8%
-	34225	32.4%
+	20683	19.6%
-,+	838	0.8%
+,-	359	0.3%
-,-,+	24	0.0%
+,-,-	14	0.0%
-,+,+	10	0.0%
+,-,+	6	0.0%
-,0,+	3	0.0%
+,+,-	2	0.0%
0,-,+	2	0.0%
-,+,-	2	0.0%

back categorical feature

A low-cardinality categorical with 12 distinct values dominated by the tokens "0" (46.7%), "-", and "+", suggesting a sign/state flag for some "back" attribute. The remaining values are comma-separated combinations like "+,-" or "+,-,-", indicating multiple events concatenated into one cell — a compound encoding rather than a clean atomic category. No nulls across 105,484 rows, and entropy ratio of 0.42 confirms the heavy skew toward "0".

Treatment: split the comma-separated tokens into a list and one-hot or count-encode the +/-/0 components before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[120]:

saturn.columns["back"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	12
top_value	0
top_rate	0.4671
cardinality	12
entropy	1.521
entropy_ratio	0.4244

Fig 43.

Top values for back.

Show data table

Top values for back (12 unique shown, of 12 total).
value	count	share
0	49270	46.7%
-	39749	37.7%
+	15547	14.7%
+,-	511	0.5%
-,+	367	0.3%
+,-,-	19	0.0%
-,-,+	8	0.0%
-,+,+	5	0.0%
-,+,-	5	0.0%
0,+,-	1	0.0%
+,-,+	1	0.0%
+,+,-	1	0.0%

tense categorical feature

This is a categorical 'tense' field with only 8 distinct values across 105,484 rows and no nulls, dominated by '0' at 71.3% and '+' at ~22%. The remaining categories are sparse markers ('-', '+,-', '-,+') and a long tail of compound sequences with as few as 1-6 occurrences, suggesting a linguistic encoding of tense polarity rather than free text. Low entropy ratio (0.37) confirms heavy concentration in the top class.

Treatment: Collapse rare compound categories into an 'other' bucket before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[123]:

saturn.columns["tense"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	0
top_rate	0.7132
cardinality	8
entropy	1.114
entropy_ratio	0.3712

Fig 44.

Top values for tense.

Show data table

Top values for tense (8 unique shown, of 8 total).
value	count	share
0	75230	71.3%
+	23411	22.2%
-	6386	6.1%
+,-	268	0.3%
-,+	179	0.2%
+,-,+	6	0.0%
+,-,-	3	0.0%
+,+,-	1	0.0%

retractedTongueRoot categorical feature

Categorical encoding of the retracted-tongue-root (RTR) phonological feature, with 7 distinct values across 105,484 rows and no nulls. The column is severely imbalanced: '-' covers 97.4% of rows, '0' another ~2%, and the remaining five values (including compound codes like '-,+' and '-,-,+') together account for under 500 rows. Entropy ratio of 0.069 confirms almost no information content as-is.

Treatment: Collapse rare codes or binarize ('-' vs other); likely low predictive value due to extreme imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[126]:

saturn.columns["retractedTongueRoot"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	-
top_rate	0.9744
cardinality	7
entropy	0.1935
entropy_ratio	0.06892
alert: imbalance	top value is 97.4% of rows

Fig 45.

Top values for retractedTongueRoot.

Show data table

Top values for retractedTongueRoot (7 unique shown, of 7 total).
value	count	share
-	102788	97.4%
0	2235	2.1%
-,+	251	0.2%
+	199	0.2%
-,-,+	9	0.0%
-,+,-	1	0.0%
+,-	1	0.0%

advancedTongueRoot categorical feature

This column encodes the advanced tongue root (ATR) phonological feature, taking three values: '-', '0', and '+'. It is severely imbalanced — '-' covers 97.87% of 105,484 rows, '0' appears 2,235 times, and '+' only 11 times, yielding an entropy ratio of just 0.094. The near-absence of '+' values means this feature carries almost no discriminative signal as-is.

Treatment: Consider dropping or collapsing to a binary indicator due to extreme imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[129]:

saturn.columns["advancedTongueRoot"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	-
top_rate	0.9787
cardinality	3
entropy	0.1496
entropy_ratio	0.09438
alert: imbalance	top value is 97.9% of rows

Fig 46.

Top values for advancedTongueRoot.

Show data table

Top values for advancedTongueRoot (3 unique shown, of 3 total).
value	count	share
-	103238	97.9%
0	2235	2.1%
+	11	0.0%

periodicGlottalSource categorical feature

Phonological feature flag for periodic glottal source (voicing), with 7 distinct values across 105,484 rows and no nulls. The vast majority are simple binary tags: '+' at 67.97% (71,694) and '-' (31,179), with a small '0' class (2,139) and rare comma-joined sequences like '+,-' (371) suggesting multi-segment or contour entries. Low entropy ratio (0.3745) confirms heavy concentration on '+'.

Treatment: Collapse rare composite values into an 'other' bucket and one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[132]:

saturn.columns["periodicGlottalSource"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	+
top_rate	0.6797
cardinality	7
entropy	1.051
entropy_ratio	0.3745

Fig 47.

Top values for periodicGlottalSource.

Show data table

Top values for periodicGlottalSource (7 unique shown, of 7 total).
value	count	share
+	71694	68.0%
-	31179	29.6%
0	2139	2.0%
+,-	371	0.4%
-,+	87	0.1%
+,-,-	8	0.0%
+,-,+	6	0.0%

epilaryngealSource categorical feature

A categorical phonological feature (epilaryngeal source) with only 3 distinct values: '-', '0', and '+'. The column is severely imbalanced — '-' accounts for 97.93% of the 105,484 rows, '0' for 2,150 rows, and '+' for just 31 rows, yielding a very low entropy ratio of 0.093. With no nulls but near-constant values, it carries little discriminative signal.

Treatment: Consider dropping or collapsing to a binary indicator; near-constant with extreme imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[135]:

saturn.columns["epilaryngealSource"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	-
top_rate	0.9793
cardinality	3
entropy	0.1474
entropy_ratio	0.09303
alert: imbalance	top value is 97.9% of rows

Fig 48.

Top values for epilaryngealSource.

Show data table

Top values for epilaryngealSource (3 unique shown, of 3 total).
value	count	share
-	103303	97.9%
0	2150	2.0%
+	31	0.0%

spreadGlottis categorical feature

This appears to be a phonological feature column encoding the [spread glottis] distinctive feature, with values '-', '+', '0' and comma-separated combinations for segments with multiple specifications. The distribution is extremely lopsided: '-' covers 91.8% of 105,484 rows and entropy ratio is just 0.149, meaning the column carries little information on its own. The long tail of compound values like '-,+', '+,0,-', and '+,-,+' (some with only 1-5 occurrences) suggests multi-segment or contour entries that may need parsing.

Treatment: Split compound values on comma and one-hot encode; consider dropping if downstream model is sensitive to near-constant features.

anthropic:claude-opus-4-7 · confidence high

Out[138]:

saturn.columns["spreadGlottis"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	10
top_value	-
top_rate	0.9182
cardinality	10
entropy	0.4965
entropy_ratio	0.1495

Fig 49.

Top values for spreadGlottis.

Show data table

Top values for spreadGlottis (10 unique shown, of 10 total).
value	count	share
-	96855	91.8%
+	6156	5.8%
0	2138	2.0%
-,+	206	0.2%
+,-	115	0.1%
-,-,+	5	0.0%
+,0,-	5	0.0%
+,-,-	2	0.0%
+,0,-,-	1	0.0%
+,-,+	1	0.0%

constrictedGlottis categorical feature

Categorical flag for a 'constricted glottis' phonological feature, with 7 distinct values across 105,484 rows and no nulls. Heavily dominated by '-' at 94.5% (99,727 rows), with '+' and '0' as minor categories and a long tail of comma-joined sequences ('+,-', '-,+', and two singletons) suggesting multi-segment annotations. Low entropy ratio (0.13) confirms the column carries little information in isolation.

Treatment: Collapse rare composite values and binarize against '-' before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[141]:

saturn.columns["constrictedGlottis"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	-
top_rate	0.9454
cardinality	7
entropy	0.3717
entropy_ratio	0.1324

Fig 50.

Top values for constrictedGlottis.

Show data table

Top values for constrictedGlottis (7 unique shown, of 7 total).
value	count	share
-	99727	94.5%
+	3383	3.2%
0	2138	2.0%
+,-	141	0.1%
-,+	93	0.1%
+,-,-	1	0.0%
-,-,+	1	0.0%

fortis categorical feature

A 3-level categorical flag dominated by '-' (68.1% of 105,484 rows), with '0' covering most of the rest and '+' appearing only 415 times. The skew toward '-' and the tiny '+' class (entropy ratio 0.589) suggest a sign/direction indicator rather than a balanced category. No nulls, so the encoding is complete as-is.

Treatment: One-hot encode; consider merging the rare '+' class or treating it as a minority signal.

anthropic:claude-opus-4-7 · confidence medium

Out[144]:

saturn.columns["fortis"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	-
top_rate	0.6813
cardinality	3
entropy	0.9335
entropy_ratio	0.589

Fig 51.

Top values for fortis.

Show data table

Top values for fortis (3 unique shown, of 3 total).
value	count	share
-	71867	68.1%
0	33202	31.5%
+	415	0.4%

lenis categorical feature

A ternary categorical flag with values '-', '0', and '+', likely encoding a linguistic lenition feature (lenis/fortis/neutral). The distribution is highly imbalanced: '-' covers 68.1% of 105,484 rows and '0' another 33,202, while '+' appears only 416 times. No nulls, but the rare '+' class may be too sparse for stable modelling.

Treatment: One-hot encode, but consider merging or upweighting the rare '+' class before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[147]:

saturn.columns["lenis"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	-
top_rate	0.6813
cardinality	3
entropy	0.9336
entropy_ratio	0.589

Fig 52.

Top values for lenis.

Show data table

Top values for lenis (3 unique shown, of 3 total).
value	count	share
-	71866	68.1%
0	33202	31.5%
+	416	0.4%

raisedLarynxEjective categorical feature

This is a categorical phonological feature flag for 'raisedLarynxEjective', taking values like '-', '+', '0', and a few comma-separated combinations across 105484 rows with no nulls. The distribution is severely imbalanced: '-' covers 96.37% of rows and entropy ratio is just 0.103, with rare compound values like '-,-,+' appearing only once. The 6-way cardinality plus mixed-delimiter codes ('-,+' vs '+,-') suggests multi-segment annotations that may need parsing.

Treatment: Treat as low-signal categorical; consider collapsing rare compound codes or dropping due to extreme imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[150]:

saturn.columns["raisedLarynxEjective"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	-
top_rate	0.9637
cardinality	6
entropy	0.2675
entropy_ratio	0.1035
alert: imbalance	top value is 96.4% of rows

Fig 53.

Top values for raisedLarynxEjective.

Show data table

Top values for raisedLarynxEjective (6 unique shown, of 6 total).
value	count	share
-	101652	96.4%
0	2150	2.0%
+	1573	1.5%
-,+	85	0.1%
+,-	23	0.0%
-,-,+	1	0.0%

loweredLarynxImplosive categorical feature

Categorical phonological feature flagging lowered larynx implosives, with 5 distinct values across 105,484 rows and no nulls. The distribution is severely imbalanced: '-' covers 97.27% of rows, while '0' (2,150), '+' (716), and the mixed codes '-,+' (7) and '+,-' (2) are rare. Entropy ratio of 0.088 confirms the column carries very little information as-is.

Treatment: Collapse to a binary present/absent indicator or drop given the 97% dominance of '-'.

anthropic:claude-opus-4-7 · confidence high

Out[153]:

saturn.columns["loweredLarynxImplosive"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	5
top_value	-
top_rate	0.9727
cardinality	5
entropy	0.2034
entropy_ratio	0.08759
alert: imbalance	top value is 97.3% of rows

Fig 54.

Top values for loweredLarynxImplosive.

Show data table

Top values for loweredLarynxImplosive (5 unique shown, of 5 total).
value	count	share
-	102609	97.3%
0	2150	2.0%
+	716	0.7%
-,+	7	0.0%
+,-	2	0.0%

click categorical label

Categorical flag with only 5 distinct values dominated by '-' (68.2% of 105,484 rows) and '0' (33,202), suggesting a click/interaction indicator where '-' likely means no click and '0' a recorded null/zero. The '+' class is rare (253) and the compound values '+,-' (52) and '-,+' (6) hint at concatenated multi-event records that break the single-label assumption. Entropy ratio of 0.40 confirms the heavy imbalance.

Treatment: Normalize the compound '+,-'/'-,+' rows and collapse '-'/'0' semantics before using as a binary click target.

anthropic:claude-opus-4-7 · confidence medium

Out[156]:

saturn.columns["click"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	5
top_value	-
top_rate	0.6823
cardinality	5
entropy	0.9283
entropy_ratio	0.3998

Fig 55.

Top values for click.

Show data table

Top values for click (5 unique shown, of 5 total).
value	count	share
-	71971	68.2%
0	33202	31.5%
+	253	0.2%
+,-	52	0.0%
-,+	6	0.0%

Overview

Summary confidence: high

InventoryID numeric foreign_key

Glottocode text foreign_key

ISO6393 text foreign_key

LanguageName text label

SpecificDialect categorical metadata

GlyphID text foreign_key

Phoneme text feature

Allophones text feature

Marginal categorical feature

SegmentClass categorical label

Source categorical metadata

tone categorical label

stress categorical feature

syllabic categorical feature

short categorical feature

long categorical feature

consonantal categorical feature

sonorant categorical feature

continuant categorical feature

delayedRelease categorical feature

approximant categorical feature

tap categorical feature

trill categorical feature

nasal categorical feature

lateral categorical feature

labial categorical feature

round categorical feature

labiodental categorical feature

coronal categorical feature

anterior categorical feature

distributed categorical feature

strident categorical feature

dorsal categorical feature

high categorical feature

low categorical feature

front categorical feature

back categorical feature

tense categorical feature

retractedTongueRoot categorical feature

advancedTongueRoot categorical feature

periodicGlottalSource categorical feature

epilaryngealSource categorical feature

spreadGlottis categorical feature

constrictedGlottis categorical feature

fortis categorical feature

lenis categorical feature

raisedLarynxEjective categorical feature

loweredLarynxImplosive categorical feature

click categorical label

How to cite