data-trove-phoible-phonetics-database

Overview

Source: /home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv

Saturn profiled 105,484 rows across 49 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv",
    "--findings", "data-trove-phoible-phonetics-database.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is PHOIBLE, a cross-linguistic phonological inventory database containing 105,484 phoneme-level records spanning roughly 2,177 languages and dialects, each row describing a single phoneme and its distinctive feature values. The most immediate thing to examine is the breakdown by SegmentClass: consonants dominate (~68.5%), followed by vowels (~29.5%) and tones (~2%), which shapes how almost every other feature distributes. A second focus is the Source column, which reveals that data comes from eight different linguistic databases ('ph' alone accounts for 34%), meaning coverage and coding conventions are uneven across the corpus and could introduce systematic biases in any cross-linguistic comparison.

citing: SegmentClass.top_values · Source.top_values · LanguageName.n_unique · Glottocode.n_unique · Phoneme.top_values · row_count · column_count

Out[4]:

saturn.schema() · 49 columns

column	kind	n	null%	unique	alerts
InventoryID	numeric	105,484	0.0%	3,020
Glottocode	text	105,484	0.0%	2,177	one_word short_text duplicates
ISO6393	text	105,484	0.0%	2,095	one_word short_text duplicates
LanguageName	text	105,484	0.0%	2,716	one_word allcaps short_text duplicates
SpecificDialect	categorical	105,484	0.0%	546
GlyphID	text	105,484	0.0%	3,142	one_word allcaps short_text duplicates
Phoneme	text	105,484	0.0%	3,142	one_word short_text duplicates
Allophones	text	105,484	0.0%	6,892	one_word short_text duplicates
Marginal	categorical	105,484	0.0%	3
SegmentClass	categorical	105,484	0.0%	3
Source	categorical	105,484	0.0%	8
tone	categorical	105,484	0.0%	2	imbalance
stress	categorical	105,484	0.0%	2	imbalance
syllabic	categorical	105,484	0.0%	8
short	categorical	105,484	0.0%	4	imbalance
long	categorical	105,484	0.0%	6
consonantal	categorical	105,484	0.0%	5
sonorant	categorical	105,484	0.0%	8
continuant	categorical	105,484	0.0%	9
delayedRelease	categorical	105,484	0.0%	7
approximant	categorical	105,484	0.0%	6
tap	categorical	105,484	0.0%	5	imbalance
trill	categorical	105,484	0.0%	6	imbalance
nasal	categorical	105,484	0.0%	8
lateral	categorical	105,484	0.0%	8
labial	categorical	105,484	0.0%	15
round	categorical	105,484	0.0%	8
labiodental	categorical	105,484	0.0%	6
coronal	categorical	105,484	0.0%	7
anterior	categorical	105,484	0.0%	6
distributed	categorical	105,484	0.0%	11
strident	categorical	105,484	0.0%	9
dorsal	categorical	105,484	0.0%	13
high	categorical	105,484	0.0%	11
low	categorical	105,484	0.0%	8
front	categorical	105,484	0.0%	13
back	categorical	105,484	0.0%	12
tense	categorical	105,484	0.0%	8
retractedTongueRoot	categorical	105,484	0.0%	7	imbalance
advancedTongueRoot	categorical	105,484	0.0%	3	imbalance
periodicGlottalSource	categorical	105,484	0.0%	7
epilaryngealSource	categorical	105,484	0.0%	3	imbalance
spreadGlottis	categorical	105,484	0.0%	10
constrictedGlottis	categorical	105,484	0.0%	7
fortis	categorical	105,484	0.0%	3
lenis	categorical	105,484	0.0%	3
raisedLarynxEjective	categorical	105,484	0.0%	6	imbalance
loweredLarynxImplosive	categorical	105,484	0.0%	5	imbalance
click	categorical	105,484	0.0%	5

Fig 1.

SegmentClass · Look at the consonant-to-vowel-to-tone split — consonants make up roughly 68.5% of all records, which will skew any feature-level analysis.

Show data table

Top values for SegmentClass (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

Fig 2.

Source · Check how unevenly data is distributed across the eight source databases, with 'ph' contributing more than a third of all rows on its own.

Show data table

Top values for Source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

Fig 3.

LanguageName · See which languages contribute the most phoneme records — Iron Ossetic leads with 444 entries, hinting at uneven language-level representation.

Show data table

Character-length distribution for LanguageName (mean: 7.822219483523567).
chars	count
2 – 4	3915
4 – 6	29541
6 – 8	34409
8 – 10	14985
10 – 12	7358
12 – 14	4849
14 – 15	3897
15 – 17	2893
17 – 19	1292
19 – 21	993
21 – 23	198
23 – 25	224
25 – 27	218
27 – 29	39
29 – 31	93
31 – 33	130
33 – 35	64
35 – 37	23
37 – 39	37
39 – 40	57
40 – 42	0
42 – 44	52
44 – 46	0
46 – 48	23
48 – 50	0
50 – 52	0
52 – 54	20
54 – 56	0
56 – 58	40
58 – 60	0
60 – 62	87
62 – 64	0
64 – 66	0
66 – 67	0
67 – 69	0
69 – 71	0
71 – 73	0
73 – 75	0
75 – 77	0
77 – 79	47

Fig 4.

nasal · Review the balance of the nasal feature values (+, -, 0) as a representative example of how distinctive features are coded across the full inventory.

Show data table

Top values for nasal (8 unique shown, of 8 total).
value	count	share
-	85269	80.8%
+	15941	15.1%
0	2150	2.0%
+,-	1973	1.9%
-,+	95	0.1%
+,-,-	54	0.1%
+,-,+,-	1	0.0%
-,+,-	1	0.0%

Fig 5.

Marginal · Note that about 1.3% of phonemes are flagged as marginal (borrowed or rare) while ~19.8% carry NA, worth checking before any typological counts.

Show data table

Top values for Marginal (3 unique shown, of 3 total).
value	count	share
FALSE	83263	78.9%
NA	20874	19.8%
TRUE	1347	1.3%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
InventoryID	numeric	0.0%
Glottocode	text	0.0%
ISO6393	text	0.0%
LanguageName	text	0.0%
SpecificDialect	categorical	0.0%
GlyphID	text	0.0%
Phoneme	text	0.0%
Allophones	text	0.0%
Marginal	categorical	0.0%
SegmentClass	categorical	0.0%
Source	categorical	0.0%
tone	categorical	0.0%
stress	categorical	0.0%
syllabic	categorical	0.0%
short	categorical	0.0%
long	categorical	0.0%
consonantal	categorical	0.0%
sonorant	categorical	0.0%
continuant	categorical	0.0%
delayedRelease	categorical	0.0%
approximant	categorical	0.0%
tap	categorical	0.0%
trill	categorical	0.0%
nasal	categorical	0.0%
lateral	categorical	0.0%
labial	categorical	0.0%
round	categorical	0.0%
labiodental	categorical	0.0%
coronal	categorical	0.0%
anterior	categorical	0.0%
distributed	categorical	0.0%
strident	categorical	0.0%
dorsal	categorical	0.0%
high	categorical	0.0%
low	categorical	0.0%
front	categorical	0.0%
back	categorical	0.0%
tense	categorical	0.0%
retractedTongueRoot	categorical	0.0%
advancedTongueRoot	categorical	0.0%
periodicGlottalSource	categorical	0.0%
epilaryngealSource	categorical	0.0%
spreadGlottis	categorical	0.0%
constrictedGlottis	categorical	0.0%
fortis	categorical	0.0%
lenis	categorical	0.0%
raisedLarynxEjective	categorical	0.0%
loweredLarynxImplosive	categorical	0.0%
click	categorical	0.0%

InventoryID numeric foreign_key

InventoryID is a numeric foreign key referencing an inventory dimension table, with exactly 3,020 distinct values spanning 1–3,020 across 105,484 rows, implying heavy repeated use of each ID (average ~35 rows per ID). The distribution is remarkably flat and symmetric (skew ≈ −0.002, kurtosis ≈ −1.15, zero outliers), consistent with a well-populated lookup identifier rather than a measured quantity.

Treatment: Left-join on this ID to the inventory dimension table; do not use raw numeric value as a feature.

anthropic:default · confidence high

Out[12]:

saturn.columns["InventoryID"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3,020
min	1
max	3,020
mean	1479
median	1,464
std	843.1
q1	769
q3	2,237
iqr	1,468
skew	-0.002397
kurtosis	-1.146
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 7.

Distribution of InventoryID. Vertical dash marks the median.

Show data table

Histogram bins for InventoryID (median: 1464.0).
bin	count
1 – 76.47	2729
76.47 – 151.9	2947
151.9 – 227.4	2742
227.4 – 302.9	2385
302.9 – 378.4	2399
378.4 – 453.8	2530
453.8 – 529.3	2270
529.3 – 604.8	2323
604.8 – 680.3	2533
680.3 – 755.8	3039
755.8 – 831.2	2896
831.2 – 906.7	2609
906.7 – 982.2	2693
982.2 – 1058	2513
1058 – 1133	2637
1133 – 1209	2514
1209 – 1284	2898
1284 – 1360	3314
1360 – 1435	3634
1435 – 1510	2993
1510 – 1586	3172
1586 – 1661	3290
1661 – 1737	2686
1737 – 1812	3001
1812 – 1888	1779
1888 – 1963	2046
1963 – 2039	1878
2039 – 2114	1864
2114 – 2190	2685
2190 – 2265	3387
2265 – 2341	3096
2341 – 2416	3142
2416 – 2492	3264
2492 – 2567	3557
2567 – 2643	2966
2643 – 2718	1975
2718 – 2794	1761
2794 – 2869	1659
2869 – 2945	1855
2945 – 3020	1823

Glottocode text foreign_key

This column contains Glottocodes — the standardized 8-character language identifiers used by the Glottolog database (e.g., 'kham1282', 'dutc1256'), confirmed by the near-uniform length of 8 characters (mean 7.999, median 8.0) and the structured alphanumeric format. With only 2,177 unique codes across 105,484 rows, the duplicate rate is extremely high at 97.9%, meaning each language code recurs on average ~48 times — consistent with a dataset where many observations (e.g., words, features, speakers) are annotated per language. The top code 'kham1282' (Kham) appears 622 times, suggesting uneven language coverage in the dataset.

Treatment: Left-join on this code against the Glottolog reference table to enrich with language family, geographic coordinates, and macro-area metadata.

anthropic:default · confidence high

Out[15]:

saturn.columns["Glottocode"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2,177
len_min	2
len_max	8
len_mean	7.999
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	103,307
duplicate_rate	0.9794
vocab_size	2,168
readability_flesch_mean	94.15
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.9% duplicate strings

Fig 8.

Character-length distribution for Glottocode.

Show data table

Character-length distribution for Glottocode (mean: 7.998919267377043).
chars	count
2 – 2	19
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	105465

ISO6393 text label

This column contains ISO 639-3 three-letter language codes, a standardised identifier for natural languages. Every value is exactly 3 characters long (min, mean, max all equal 3) with zero nulls, confirming strict conformity to the standard. The duplicate rate is extremely high at 98.0%, meaning 2,095 distinct codes repeat across 105,484 rows — expected behaviour for a language tag applied to many records. The most frequent code 'mis' (miscellaneous/unattested language) appearing 828 times may warrant attention, as it signals a non-trivial share of records with unidentified languages.

Treatment: Use as a categorical grouping key; consider flagging or separating records where ISO6393 equals 'mis' (unattested) before language-based analysis.

anthropic:default · confidence high

Out[18]:

saturn.columns["ISO6393"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2,095
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	103,389
duplicate_rate	0.9801
vocab_size	2,086
readability_flesch_mean	119.5
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	98.0% duplicate strings

Fig 9.

Character-length distribution for ISO6393.

Show data table

Character-length distribution for ISO6393 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	105484
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

LanguageName text label

This column contains human language names (e.g., 'Dutch', 'Chechen', 'Bengali', 'Iron Ossetic'), functioning as a categorical label drawn from a linguistically diverse vocabulary of 2,716 unique values across 105,484 rows. The duplicate rate of 97.4% (102,768 duplicates) confirms it is a low-cardinality repeating label, not a free-text field. Notably, 13.1% of values are all-caps, suggesting some entries use ISO-style abbreviations or codes alongside full names. The distribution is uneven — the top value 'Iron Ossetic' appears 444 times while many languages appear rarely — indicating a long-tail spread across minority and regional languages.

Treatment: Encode as a categorical (label or target-encode) after normalising casing inconsistencies flagged by the 13.1% all-caps rate.

anthropic:default · confidence high

Out[21]:

saturn.columns["LanguageName"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2,716
len_min	2
len_max	79
len_mean	7.822
len_median	7
len_p95	16
word_mean	1.201
word_median	1
n_empty	0
n_duplicates	102,768
duplicate_rate	0.9743
vocab_size	2,670
readability_flesch_mean	53.18
emoji_rate	0
url_rate	0
one_word_rate	0.8433
allcaps_rate	0.1314
boilerplate_rate	0
alert: one_word	84.3% rows are a single word
alert: allcaps	13.1% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.4% duplicate strings

Fig 10.

Character-length distribution for LanguageName.

Show data table

Character-length distribution for LanguageName (mean: 7.822219483523567).
chars	count
2 – 4	3915
4 – 6	29541
6 – 8	34409
8 – 10	14985
10 – 12	7358
12 – 14	4849
14 – 15	3897
15 – 17	2893
17 – 19	1292
19 – 21	993
21 – 23	198
23 – 25	224
25 – 27	218
27 – 29	39
29 – 31	93
31 – 33	130
33 – 35	64
35 – 37	23
37 – 39	37
39 – 40	57
40 – 42	0
42 – 44	52
44 – 46	0
46 – 48	23
48 – 50	0
50 – 52	0
52 – 54	20
54 – 56	0
56 – 58	40
58 – 60	0
60 – 62	87
62 – 64	0
64 – 66	0
66 – 67	0
67 – 69	0
69 – 71	0
71 – 73	0
73 – 75	0
75 – 77	0
77 – 79	47

SpecificDialect categorical label

This column captures a specific dialect designation for each record in what appears to be a linguistic or language survey dataset. The dominant signal is extreme missingness-by-label: 71.9% of rows carry the value 'NA' and a further 7,692 rows (≈7.3%) are empty strings, meaning roughly 79% of records lack a meaningful dialect value. Despite 546 unique values, the effective entropy ratio is only 0.33, confirming that real dialect labels are thinly and unevenly spread across the remaining ~22,000 rows.

Treatment: Treat 'NA' and empty-string as missing; consider collapsing rare dialects (below a frequency threshold) into an 'Other' bucket before encoding, and flag high missingness rate (~79%) before any modelling use.

anthropic:default · confidence high

Out[24]:

saturn.columns["SpecificDialect"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	546
top_value	NA
top_rate	0.7187
cardinality	546
entropy	2.969
entropy_ratio	0.3265

Fig 11.

Top values for SpecificDialect.

Show data table

Top values for SpecificDialect (20 unique shown, of 546 total).
value	count	share
NA	75807	71.9%
	7692	7.3%
W2	120	0.1%
Lezgian (Güne)	96	0.1%
Santa	92	0.1%
Central Pakistan	83	0.1%
Babungo (Grassfields Bantu, Ring)	82	0.1%
Scottish Gaelic (Lewis)	82	0.1%
Tangari	81	0.1%
Kanga	76	0.1%
Kufa	75	0.1%
Skolt Saami (Suõʹnnʼjel)	75	0.1%
Standard Hindi (as spoken in Varanasi, Lucknow, Delhi etc.)	74	0.1%
Standard (eastern)	74	0.1%
Guovdageaidnu	74	0.1%
Nuosu (Black Yi)	74	0.1%
Northern Qiang (Yadu)	73	0.1%
Bangladeshi Standard (spoken in Dhaka and other urban aread of Bangladesh)	72	0.1%
Standard Italian	70	0.1%
Chechen (Ploskost)	70	0.1%

GlyphID text label

GlyphID is a categorical identifier column containing Unicode code point values (e.g., '006D' = 'm', '0069' = 'i', '0061' = 'a') stored as uppercase hexadecimal strings, almost certainly representing character-level glyph references in a typography, OCR, or text-rendering dataset. With only 3,142 unique values across 105,484 rows, the duplicate rate is extremely high at 97.0%, which is expected for a glyph/character frequency table but means this column functions as a low-cardinality label rather than a unique identifier despite its name. All values are fully uppercase (allcaps_rate 1.0), single-token (one_word_rate 1.0), and short (median length 4 characters), consistent with 4–6 character hex codes. The top values map to common Latin lowercase letters, suggesting the underlying corpus is predominantly Latin-script text.

Treatment: Map hex strings to Unicode characters for interpretability, then encode as categorical (low-cardinality, 3142 levels) or group by Unicode block/script before modelling.

anthropic:default · confidence high

Out[27]:

saturn.columns["GlyphID"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3,142
len_min	4
len_max	54
len_mean	6.503
len_median	4
len_p95	14
word_mean	1
word_median	1
n_empty	0
n_duplicates	102,342
duplicate_rate	0.9702
vocab_size	1,343
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.0% duplicate strings

Fig 12.

Character-length distribution for GlyphID.

Show data table

Character-length distribution for GlyphID (mean: 6.503033635432862).
chars	count
4 – 5	67114
5 – 6	0
6 – 8	0
8 – 9	0
9 – 10	28726
10 – 12	0
12 – 13	0
13 – 14	0
14 – 15	6559
15 – 16	0
16 – 18	0
18 – 19	0
19 – 20	2225
20 – 22	0
22 – 23	0
23 – 24	0
24 – 25	401
25 – 26	0
26 – 28	0
28 – 29	0
29 – 30	267
30 – 32	0
32 – 33	0
33 – 34	0
34 – 35	104
35 – 36	0
36 – 38	0
38 – 39	0
39 – 40	5
40 – 42	0
42 – 43	0
43 – 44	0
44 – 45	70
45 – 46	0
46 – 48	0
48 – 49	0
49 – 50	1
50 – 52	0
52 – 53	0
53 – 54	12

Phoneme text label

This column contains phoneme symbols from a linguistic or speech dataset, with 3,142 unique phoneme strings across 105,484 rows. Values are overwhelmingly single characters (len_median 1.0, len_mean 1.5, len_max 11), consistent with IPA or ARPABET-style phoneme notation. The duplicate rate is 97.0% (102,342 duplicates), which is expected given a finite phoneme inventory repeated across many words or utterances. The vocab_size of 1,339 distinct tokens against only 3,142 unique values suggests multi-character phoneme strings (e.g., digraphs or diacritics) are also present alongside single-character ones.

Treatment: Encode as categorical (label or one-hot) for modelling; 3,142 unique values is manageable but consider grouping by manner/place of articulation if dimensionality is a concern.

anthropic:default · confidence high

Out[30]:

saturn.columns["Phoneme"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3,142
len_min	1
len_max	11
len_mean	1.501
len_median	1
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	102,342
duplicate_rate	0.9702
vocab_size	1,339
readability_flesch_mean	114.4
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.001754
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.0% duplicate strings

Fig 13.

Character-length distribution for Phoneme.

Show data table

Character-length distribution for Phoneme (mean: 1.5006067270865724).
chars	count
1 – 1	67114
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	28726
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	6559
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	2225
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	401
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	267
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	104
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	70
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	1
10 – 10	0
10 – 11	0
11 – 11	12

Allophones text feature

This column contains allophone representations from a phonology or linguistics dataset, storing the phonetic variants of phonemes (e.g., 'm', 'j', 'w', 's') as very short strings with a mean length of ~2 characters. Strikingly, 53,580 of 105,484 rows (roughly 50.8%) carry the sentinel value 'NA', indicating no allophone recorded, which dominates the duplicate rate of 93.5%. The 6,892 unique values across a vocabulary of only 1,263 words suggest a modest set of phonetic symbols combined in small clusters, consistent with IPA or similar notation.

Treatment: Treat 'NA' as missing; encode remaining values as categorical or tokenize individual phoneme symbols before modelling.

anthropic:default · confidence high

Out[33]:

saturn.columns["Allophones"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6,892
len_min	1
len_max	37
len_mean	2.083
len_median	2
len_p95	4
word_mean	1.129
word_median	1
n_empty	0
n_duplicates	98,592
duplicate_rate	0.9347
vocab_size	1,263
readability_flesch_mean	116.2
emoji_rate	0
url_rate	0
one_word_rate	0.9131
allcaps_rate	0.00291
boilerplate_rate	0
alert: one_word	91.3% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	93.5% duplicate strings

Fig 14.

Character-length distribution for Allophones.

Show data table

Character-length distribution for Allophones (mean: 2.0834154942929732).
chars	count
1 – 2	26762
2 – 3	66019
3 – 4	4934
4 – 5	3125
5 – 6	1246
6 – 6	1181
6 – 7	789
7 – 8	409
8 – 9	334
9 – 10	0
10 – 11	226
11 – 12	160
12 – 13	69
13 – 14	65
14 – 14	41
14 – 15	36
15 – 16	12
16 – 17	16
17 – 18	12
18 – 19	0
19 – 20	14
20 – 21	9
21 – 22	8
22 – 23	0
23 – 24	3
24 – 24	3
24 – 25	3
25 – 26	3
26 – 27	2
27 – 28	0
28 – 29	1
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	1
34 – 35	0
35 – 36	0
36 – 37	1

Marginal categorical feature

This column is a boolean flag indicating whether a record is classified as 'marginal', with three distinct string values: FALSE, NA, and TRUE. The dominant value is FALSE (78.9% of 105,484 rows), while TRUE is strikingly rare at only 1,347 records (≈1.3%). The presence of 'NA' as a literal string value in 20,874 rows (≈19.8%) is noteworthy — these are not system nulls (null_rate is 0.0) but encoded string missings, which must be handled explicitly rather than via standard null imputation.

Treatment: Encode as ternary (FALSE=0, TRUE=1, NA=missing) after converting string 'NA' to actual nulls, then decide on imputation strategy given 19.8% string-missing rate.

anthropic:default · confidence high

Out[36]:

saturn.columns["Marginal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	FALSE
top_rate	0.7893
cardinality	3
entropy	0.8122
entropy_ratio	0.5125

Fig 15.

Top values for Marginal.

Show data table

Top values for Marginal (3 unique shown, of 3 total).
value	count	share
FALSE	83263	78.9%
NA	20874	19.8%
TRUE	1347	1.3%

SegmentClass categorical label

SegmentClass is a phonological category label classifying 105,484 linguistic segments into exactly three classes: consonant, vowel, and tone. Consonants dominate at 68.5% (72,282 occurrences), vowels account for 29.4% (31,052), and tones are a small minority at just 2.0% (2,150) — a distribution consistent with natural language phoneme inventories but with tones notably underrepresented, suggesting the dataset skews toward non-tonal languages or tonal markings are partially absent. Zero nulls and perfect coverage make this a clean, reliable label.

Treatment: One-hot encode or use as a stratification variable; monitor class imbalance for the 'tone' minority class (2,150 samples) in any classification task.

anthropic:default · confidence high

Out[39]:

saturn.columns["SegmentClass"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	consonant
top_rate	0.6852
cardinality	3
entropy	1.008
entropy_ratio	0.6357

Fig 16.

Top values for SegmentClass.

Show data table

Top values for SegmentClass (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

Source categorical label

This column identifies the source database or linguistic inventory from which each phonological record was drawn, with 8 distinct source codes across 105,484 rows and no nulls. The top source 'ph' dominates at 34.4% (36,274 rows), likely referring to PHOIBLE or a similar phoneme database, followed by 'ea', 'upsid', 'er', 'saphon', 'aa', 'spa', and 'ra'. The high entropy ratio of 0.899 indicates a relatively even spread across sources despite 'ph' leading — no single source overwhelmingly controls the data. Analysts should be aware that cross-source comparisons may introduce systematic coding differences, as each source may apply distinct phonological conventions.

Treatment: One-hot encode or use as a stratification/grouping variable; check for source-specific biases before pooling.

anthropic:default · confidence high

Out[42]:

saturn.columns["Source"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	ph
top_rate	0.3439
cardinality	8
entropy	2.697
entropy_ratio	0.8991

Fig 17.

Top values for Source.

Show data table

Top values for Source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

tone categorical label

This column is a binary tone indicator with only two values: '0' (neutral/absent) and '+' (positive), across 105,484 rows with no nulls. The distribution is severely imbalanced — '0' accounts for 97.96% of records (103,334) while '+' appears in just 2,150 rows (~2%). The entropy of 0.144 (out of a maximum of 1.0 for a binary variable) confirms the near-total dominance of a single class, which would make any model trained on this label prone to predicting '0' exclusively.

Treatment: Apply class-balancing techniques (e.g., oversampling '+' or class-weight adjustment) before using as a modelling target.

anthropic:default · confidence high

Out[45]:

saturn.columns["tone"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2
top_value	0
top_rate	0.9796
cardinality	2
entropy	0.1436
entropy_ratio	0.1436
alert: imbalance	top value is 98.0% of rows

Fig 18.

Top values for tone.

Show data table

Top values for tone (2 unique shown, of 2 total).
value	count	share
0	103334	98.0%
+	2150	2.0%

stress categorical label

This column is a binary categorical flag likely indicating the presence or absence of stress (e.g., a linguistic or physiological stress marker), encoded as '-' for the negative/absent case and '0' for the positive/present case. The distribution is severely imbalanced: '-' accounts for 97.96% of all 105,484 rows (103,334 occurrences) versus only 2,150 rows for '0'. The entropy ratio of 0.144 confirms near-minimum information content, meaning the positive class is rare and models trained on this column will need class-balancing strategies.

Treatment: Apply class-balancing (oversampling or weighted loss) before modelling; consider encoding '-' as 0 and '0' as 1 for numeric compatibility.

anthropic:default · confidence medium

Out[48]:

saturn.columns["stress"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2
top_value	-
top_rate	0.9796
cardinality	2
entropy	0.1436
entropy_ratio	0.1436
alert: imbalance	top value is 98.0% of rows

Fig 19.

Top values for stress.

Show data table

Top values for stress (2 unique shown, of 2 total).
value	count	share
-	103334	98.0%
0	2150	2.0%

syllabic categorical

Out[51]:

saturn.columns["syllabic"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	-
top_rate	0.6849
cardinality	8
entropy	1.042
entropy_ratio	0.3472

Fig 20.

Top values for syllabic.

Show data table

Top values for syllabic (8 unique shown, of 8 total).
value	count	share
-	72248	68.5%
+	30692	29.1%
0	2150	2.0%
+,-	244	0.2%
-,+	124	0.1%
-,+,-	12	0.0%
-,+,+	12	0.0%
+,+,-	2	0.0%

short categorical feature

This column appears to encode a directional or sign indicator with only 4 distinct values: '-', '0', '+', and the compound '-,+'. It is severely imbalanced: the dominant value '-' accounts for 97,764% of rows (103,125 of 105,484), while '+' appears just 204 times and the mixed label '-,+' only 5 times. The near-zero entropy ratio (0.082) confirms the column carries very little information variance, and any model trained on it will be overwhelmed by the '-' class.

Treatment: Treat as ordinal or one-hot encode, but flag severe class imbalance (97.8% '-'); consider oversampling minority classes or collapsing '+' and '-,+' before modelling.

anthropic:default · confidence high

Out[54]:

saturn.columns["short"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	4
top_value	-
top_rate	0.9776
cardinality	4
entropy	0.1645
entropy_ratio	0.08225
alert: imbalance	top value is 97.8% of rows

Fig 21.

Top values for short.

Show data table

Top values for short (4 unique shown, of 4 total).
value	count	share
-	103125	97.8%
0	2150	2.0%
+	204	0.2%
-,+	5	0.0%

long categorical feature

This column encodes the sign or direction of longitude (or a signed numeric field), with only 6 distinct values across 105,484 rows. The dominant value '-' accounts for 89.9% of records, suggesting the dataset is heavily skewed toward negative longitudes (e.g., Western Hemisphere coordinates). The compound values '-,+', '+,-', and '-,-,+' appear in just 104 rows combined, hinting at rare multi-value or malformed entries that warrant inspection before modelling.

Treatment: Investigate compound values ('-,+', '+,-', '-,-,+') for parsing errors; encode '-' as -1, '+' as +1, '0' as 0, and flag or impute the 104 compound rows.

anthropic:default · confidence medium

Out[57]:

saturn.columns["long"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	-
top_rate	0.8991
cardinality	6
entropy	0.5537
entropy_ratio	0.2142

Fig 22.

Top values for long.

Show data table

Top values for long (6 unique shown, of 6 total).
value	count	share
-	94844	89.9%
+	8386	8.0%
0	2150	2.0%
-,+	63	0.1%
+,-	40	0.0%
-,-,+	1	0.0%

consonantal categorical feature

This column encodes a binary phonological feature — consonantal — marking whether a segment is consonantal (+) or not (-), a standard distinctive feature in linguistics datasets. The dominant value is '+' at 60.9% (64,257 rows), with '-' covering 37.0% (39,041 rows). The near-zero third category '0' (2,151 rows) suggests underspecified segments, while the composite values '+,-' (34) and '-,+' (1) are anomalous multi-value entries that likely reflect data quality issues or ambiguous annotations.

Treatment: Encode as ordinal or one-hot after splitting composite entries ('+,-', '-,+') into separate flags or flagging them for review.

anthropic:default · confidence high

Out[60]:

saturn.columns["consonantal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	5
top_value	+
top_rate	0.6092
cardinality	5
entropy	1.085
entropy_ratio	0.4672

Fig 23.

Top values for consonantal.

Show data table

Top values for consonantal (5 unique shown, of 5 total).
value	count	share
+	64257	60.9%
-	39041	37.0%
0	2151	2.0%
+,-	34	0.0%
-,+	1	0.0%

sonorant categorical

Out[63]:

saturn.columns["sonorant"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	+
top_rate	0.5301
cardinality	8
entropy	1.245
entropy_ratio	0.4149

Fig 24.

Top values for sonorant.

Show data table

Top values for sonorant (8 unique shown, of 8 total).
value	count	share
+	55920	53.0%
-	45322	43.0%
0	2150	2.0%
+,-	1948	1.8%
-,+	89	0.1%
+,-,-	29	0.0%
+,-,+	25	0.0%
+,-,+,-	1	0.0%

continuant categorical feature

This column encodes a phonological feature — whether a speech sound is a continuant (airflow continues through the vocal tract) — using '+', '-', and '0' (unspecified/N-A) notation standard in distinctive feature theory. The dominant values are '+' (54.9%, 57,952 rows) and '-' (44,585 rows), which together account for ~97% of records. Surprisingly, 7 of the 9 unique values are composite strings like '-,+' or '-,-,+', suggesting some rows encode sequences or multi-segment entries rather than single-segment features, which is atypical and may indicate data quality or schema inconsistency. Entropy ratio of 0.37 reflects moderate concentration driven almost entirely by the binary +/- split.

Treatment: Treat '+'/'-'/'0' as a 3-class categorical; investigate and potentially split or flag the 796 composite multi-value rows before encoding.

anthropic:default · confidence high

Out[66]:

saturn.columns["continuant"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	9
top_value	+
top_rate	0.5494
cardinality	9
entropy	1.172
entropy_ratio	0.3696

Fig 25.

Top values for continuant.

Show data table

Top values for continuant (9 unique shown, of 9 total).
value	count	share
+	57952	54.9%
-	44585	42.3%
0	2151	2.0%
-,+	728	0.7%
-,-,+	50	0.0%
+,-	9	0.0%
0,-,+	4	0.0%
-,+,+	4	0.0%
0,0,-,+	1	0.0%

delayedRelease categorical feature

This column encodes a delayed-release flag or classification for 105,484 records using a small set of symbolic tokens: '0' (no delay, 55% of rows), '-' (negative/early shift, 26%), '+' (positive/late shift, 19%), and compound combinations thereof. The compound values ('-,+', '0,-,+', '+,-', '0,0,-,+') suggest the column is sometimes multi-valued — packed as comma-separated strings — which is structurally inconsistent with a simple categorical and implies set-like semantics. No nulls are present, and the entropy ratio of 0.52 reflects a moderately skewed but non-trivial distribution dominated by the '0' class.

Treatment: Split compound values on ',' into multi-hot binary columns ('has_0', 'has_minus', 'has_plus') before modelling.

anthropic:default · confidence high

Out[69]:

saturn.columns["delayedRelease"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	0
top_rate	0.5502
cardinality	7
entropy	1.471
entropy_ratio	0.5238

Fig 26.

Top values for delayedRelease.

Show data table

Top values for delayedRelease (7 unique shown, of 7 total).
value	count	share
0	58035	55.0%
-	27384	26.0%
+	19533	18.5%
-,+	492	0.5%
0,-,+	33	0.0%
+,-	6	0.0%
0,0,-,+	1	0.0%

approximant categorical feature

This column captures whether a phoneme or linguistic segment is classified as an approximant, using a compact symbolic encoding where '-' means absent and '+' means present. The dominant value is '-' at 55.9% of rows (58,966), with '+' accounting for 41.9% (44,266), giving a near-binary distribution with very low entropy (1.12). Surprising are the small number of compound values ('-,+', '-,-,+', '+,-') totalling just 102 rows, suggesting a minority of segments carry ambiguous or multi-valued approximant classifications — possibly transcription artifacts or multi-segment bundles.

Treatment: One-hot encode the two dominant values ('-', '+'); bin the three compound values into an 'ambiguous' category or flag and investigate as potential data quality issues before modelling.

anthropic:default · confidence high

Out[72]:

saturn.columns["approximant"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	-
top_rate	0.559
cardinality	6
entropy	1.12
entropy_ratio	0.4333

Fig 27.

Top values for approximant.

Show data table

Top values for approximant (6 unique shown, of 6 total).
value	count	share
-	58966	55.9%
+	44266	42.0%
0	2150	2.0%
-,+	71	0.1%
-,-,+	25	0.0%
+,-	6	0.0%

tap categorical feature

This column appears to encode a directional or signed-change indicator, with values like '-', '+', '0', and compound sequences such as '-,+' and '-,-,+' that suggest a history of sign changes or transitions. The dominant value '-' accounts for 96.7% of all 105,484 rows, producing severe class imbalance (entropy ratio of only 0.10), meaning positive or mixed-direction events are rare signals. The compound multi-value entries ('-,+', '-,-,+') are notable — they imply the field can store ordered sequences of transitions, not just a single state. No nulls are present.

Treatment: One-hot encode single-value categories; treat compound sequence values ('-,+', '-,-,+') separately via sequence parsing or a dedicated binary flag for multi-transition events; oversample or weight minority classes before modelling.

anthropic:default · confidence medium

Out[75]:

saturn.columns["tap"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	5
top_value	-
top_rate	0.9672
cardinality	5
entropy	0.2421
entropy_ratio	0.1043
alert: imbalance	top value is 96.7% of rows

Fig 28.

Top values for tap.

Show data table

Top values for tap (5 unique shown, of 5 total).
value	count	share
-	102023	96.7%
0	2203	2.1%
+	1218	1.2%
-,+	25	0.0%
-,-,+	15	0.0%

trill categorical feature

This column encodes a phonological feature — specifically the presence or absence of a trill articulation — using a small symbol vocabulary of 6 distinct values. The dominant value is "-" (absent/negative), which accounts for 96.15% of 105,484 rows, producing an extremely low entropy of 0.276 and triggering an imbalance alert. The minority values ("0", "+", and compound sequences like "-,+") together cover fewer than 4,000 rows, suggesting trill is a rare feature in this phonological dataset. The compound values ("-,+", "-,-,+", "+,-") with counts of 26, 8, and 2 respectively hint at multi-segment or allophonic annotation, but are statistically negligible.

Treatment: One-hot or ordinal encode with caution; severe class imbalance (96.15% negative class) means most models will need oversampling, class weighting, or collapsing rare compound categories before use.

anthropic:default · confidence high

Out[78]:

saturn.columns["trill"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	-
top_rate	0.9615
cardinality	6
entropy	0.2762
entropy_ratio	0.1069
alert: imbalance	top value is 96.2% of rows

Fig 29.

Top values for trill.

Show data table

Top values for trill (6 unique shown, of 6 total).
value	count	share
-	101427	96.2%
0	2202	2.1%
+	1819	1.7%
-,+	26	0.0%
-,-,+	8	0.0%
+,-	2	0.0%

nasal categorical

Out[81]:

saturn.columns["nasal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	-
top_rate	0.8084
cardinality	8
entropy	0.897
entropy_ratio	0.299

Fig 30.

Top values for nasal.

Show data table

Top values for nasal (8 unique shown, of 8 total).
value	count	share
-	85269	80.8%
+	15941	15.1%
0	2150	2.0%
+,-	1973	1.9%
-,+	95	0.1%
+,-,-	54	0.1%
+,-,+,-	1	0.0%
-,+,-	1	0.0%

lateral categorical feature

This column encodes lateral direction or laterality coding, likely indicating the side(s) of a clinical finding, anatomical measurement, or test result, using '-' (e.g., left or negative), '+' (right or positive), and '0' (midline or neutral). The dominant value '-' accounts for 93.8% of all 105,484 rows, producing a very low entropy ratio of 0.134, meaning the column is heavily skewed toward a single class. Surprisingly, some values represent sequences of laterality codes (e.g., '-,+', '-,-,+'), suggesting multi-segment or bilateral recordings encoded as comma-delimited strings rather than separate fields. No nulls are present.

Treatment: Split comma-delimited compound values into multi-hot binary flags for '-', '+', and '0' presence before modelling; expect severe class imbalance.

anthropic:default · confidence high

Out[84]:

saturn.columns["lateral"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	-
top_rate	0.9382
cardinality	8
entropy	0.4012
entropy_ratio	0.1337

Fig 31.

Top values for lateral.

Show data table

Top values for lateral (8 unique shown, of 8 total).
value	count	share
-	98968	93.8%
+	4211	4.0%
0	2150	2.0%
-,+	135	0.1%
+,-	12	0.0%
-,-,+	4	0.0%
-,+,-	3	0.0%
0,-,+	1	0.0%

labial categorical feature

This column captures labial articulation features in a phonological or linguistic dataset, encoding presence/absence/neutrality of a labial feature per segment (or per segment sequence). The dominant value is '-' (absent) at 68.2% of 105,484 rows, with '+' (present) at 26.8%, suggesting most segments are non-labial. Surprisingly, ~2.9% of values are compound strings like '-,+', '+,-', '-,-,+', indicating multi-segment bundles packed into a single cell rather than a flat atomic feature, which creates a parsing challenge and implies the column is not fully normalized.

Treatment: Split compound values on ',' into separate per-segment records or encode as ordered multi-label; then one-hot or ordinal encode the atomic {'-', '+', '0'} values for modelling.

anthropic:default · confidence high

Out[87]:

saturn.columns["labial"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	15
top_value	-
top_rate	0.6822
cardinality	15
entropy	1.182
entropy_ratio	0.3025

Fig 32.

Top values for labial.

Show data table

Top values for labial (15 unique shown, of 15 total).
value	count	share
-	71961	68.2%
+	28241	26.8%
-,+	2414	2.3%
0	2160	2.0%
+,-	531	0.5%
-,-,+	121	0.1%
+,-,-	21	0.0%
0,+,-	8	0.0%
-,+,-	6	0.0%
0,-,+	5	0.0%
-,+,+	5	0.0%
+,+,-	5	0.0%
+,-,+	4	0.0%
-,-,+,+	1	0.0%
0,+,-,-	1	0.0%

round categorical

Out[90]:

saturn.columns["round"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	0
top_rate	0.703
cardinality	8
entropy	1.194
entropy_ratio	0.398

Fig 33.

Top values for round.

Show data table

Top values for round (8 unique shown, of 8 total).
value	count	share
0	74155	70.3%
+	16956	16.1%
-	14082	13.3%
-,+	269	0.3%
-,-,+	17	0.0%
-,0,+	3	0.0%
0,-,+	1	0.0%
+,-	1	0.0%

labiodental categorical feature

This column encodes labiodental phonetic feature annotations, likely from a linguistic or phonological dataset where each row represents a speech sound or segment. The dominant value '0' (74,124 rows, 70.3%) indicates the feature is absent or neutral, while '+' and '-' mark positive and negative feature values respectively — a standard binary distinctive-feature notation. Notably, multi-valued strings like '+,-', '-,+', and '+,+,-' appear (60 rows total), suggesting a small number of segments carry conflicting or composite annotations, which may indicate data entry inconsistency or deliberate underspecification. Entropy ratio of 0.39 confirms moderate imbalance with '0' dominating.

Treatment: One-hot encode '+', '-', '0'; flag or inspect the 60 multi-valued rows ('+,-', '-,+', '+,+,-') for normalization before modelling.

anthropic:default · confidence high

Out[93]:

saturn.columns["labiodental"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	0
top_rate	0.7027
cardinality	6
entropy	1.006
entropy_ratio	0.3891

Fig 34.

Top values for labiodental.

Show data table

Top values for labiodental (6 unique shown, of 6 total).
value	count	share
0	74124	70.3%
-	28726	27.2%
+	2574	2.4%
+,-	56	0.1%
-,+	3	0.0%
+,+,-	1	0.0%

coronal categorical feature

This column encodes coronal articulation features in what appears to be a phonological or linguistic dataset, using binary '+'/'-' notation common in distinctive feature theory. The dominant value is '-' (62.8% of 105,484 rows), with '+' covering another 35.0%, making the two primary values account for ~98% of observations. The remaining values ('+,-', '-,+', '-,-,+', '+,-,+') suggest multi-segment or compound entries, which are rare (≤87 occurrences) but worth flagging as potential encoding inconsistencies or multi-value cells that may need splitting.

Treatment: Encode primary values '+'/'-'/'0' as ordinal or one-hot features; isolate and inspect the 135 multi-value rows for parsing or splitting before modelling.

anthropic:default · confidence high

Out[96]:

saturn.columns["coronal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	-
top_rate	0.6279
cardinality	7
entropy	1.08
entropy_ratio	0.3848

Fig 35.

Top values for coronal.

Show data table

Top values for coronal (7 unique shown, of 7 total).
value	count	share
-	66234	62.8%
+	36955	35.0%
0	2160	2.0%
+,-	87	0.1%
-,+	41	0.0%
-,-,+	6	0.0%
+,-,+	1	0.0%

anterior categorical feature

This column captures an anterior anatomical or directional finding coded as a signed categorical variable, with 6 distinct values across 105,484 rows and zero nulls. The dominant value '0' (64.8% of rows) likely indicates absence or neutral status, while '+' (24.4%) and '-' (10.8%) denote positive/negative findings. Surprisingly, a small number of rows (9, 5, and 3) contain compound multi-value strings like '-,+', '+,-', and '-,-,+', suggesting occasional data-entry concatenation errors or multi-event encoding that deviates from the expected single-symbol schema.

Treatment: Map '0'/'+'/'-' to ordinal or one-hot encoded features; isolate and investigate the 17 compound-value rows before modelling.

anthropic:default · confidence medium

Out[99]:

saturn.columns["anterior"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	0
top_rate	0.6482
cardinality	6
entropy	1.251
entropy_ratio	0.4839

Fig 36.

Top values for anterior.

Show data table

Top values for anterior (6 unique shown, of 6 total).
value	count	share
0	68372	64.8%
+	25704	24.4%
-	11391	10.8%
-,+	9	0.0%
+,-	5	0.0%
-,-,+	3	0.0%

distributed categorical label

This column appears to encode a distribution or change-direction indicator, using a compact symbolic notation: '0' (no change), '-' (decrease), '+' (increase), and comma-separated sequences for multi-step or compound movements. The dominant value '0' covers 66% of 105,484 rows, with '-' (21.1%) and '+' (12.5%) accounting for most of the remainder. Surprisingly, 11 distinct values arise from combinations of these three symbols, suggesting some records capture sequences of directional changes rather than a single state. Entropy ratio of 0.37 confirms moderate but uneven information content, heavily concentrated in the single '0' class.

Treatment: Encode as an ordinal or multi-hot feature by splitting on commas and mapping {'-': -1, '0': 0, '+': 1}; treat compound values as sequences.

anthropic:default · confidence medium

Out[102]:

saturn.columns["distributed"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	11
top_value	0
top_rate	0.6602
cardinality	11
entropy	1.273
entropy_ratio	0.3681

Fig 37.

Top values for distributed.

Show data table

Top values for distributed (11 unique shown, of 11 total).
value	count	share
0	69639	66.0%
-	22283	21.1%
+	13228	12.5%
-,+	296	0.3%
-,-,+	25	0.0%
+,-	5	0.0%
0,-,+	3	0.0%
+,-,+	2	0.0%
0,+,-	1	0.0%
+,+,-	1	0.0%
0,0,-,+	1	0.0%

strident categorical

Out[105]:

saturn.columns["strident"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	9
top_value	0
top_rate	0.6485
cardinality	9
entropy	1.287
entropy_ratio	0.406

Fig 38.

Top values for strident.

Show data table

Top values for strident (9 unique shown, of 9 total).
value	count	share
0	68410	64.9%
-	25410	24.1%
+	11039	10.5%
-,+	585	0.6%
-,-,+	26	0.0%
+,-	7	0.0%
-,+,-	3	0.0%
0,-,+	3	0.0%
0,0,-,+	1	0.0%

dorsal categorical feature

This column captures a dorsal-surface marking or pattern indicator for biological specimens (likely fish or reptiles), encoding the presence/absence of a feature using '+', '-', and '0' symbols — sometimes as comma-separated sequences indicating multiple zones or segments. The dominant values are '+' (51.7%, n=54,535) and '-' (n=47,052), together accounting for ~96.3% of rows, with 'neutral/absent' coded as '0' (n=2,160). Surprisingly, 11 of the 13 categories are multi-value strings like '-,+' or '-,-,+', suggesting some records encode ordered sequences of dorsal sub-regions rather than a single binary state, creating an implicit structural inconsistency that warrants normalization.

Treatment: Split multi-value entries on ',' into ordered sub-features or one-hot encode each position before modelling; treat '+', '-', '0' as a ternary categorical for single-value rows.

anthropic:default · confidence high

Out[108]:

saturn.columns["dorsal"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	13
top_value	+
top_rate	0.517
cardinality	13
entropy	1.235
entropy_ratio	0.3338

Fig 39.

Top values for dorsal.

Show data table

Top values for dorsal (13 unique shown, of 13 total).
value	count	share
+	54535	51.7%
-	47052	44.6%
0	2160	2.0%
-,+	1530	1.5%
+,-	144	0.1%
-,-,+	44	0.0%
0,-,+	6	0.0%
+,-,+	5	0.0%
-,+,+	4	0.0%
+,+,-,-	1	0.0%
-,+,-	1	0.0%
+,+,-	1	0.0%
0,0,-,+	1	0.0%

high categorical feature

This column encodes the direction of price/value movement at a 'high' point — likely a signal or pattern indicator from a time-series or financial dataset, using tokens '0' (no movement), '+' (up), and '-' (down), including multi-step sequences like '-,+' or '+,-,+'. The dominant value is '0' at 46.7% (49,247 rows), followed by '+' at 35,559 and '-' at 19,156, suggesting a strong asymmetry between upward and downward signals. The presence of compound sequence values (e.g., '+,-,+', '-,+,+') with very low frequency — some appearing only once or twice — indicates these multi-step patterns are rare edge cases that may need consolidation or separate treatment.

Treatment: Encode top 3 values ('0', '+', '-') as ordinal or one-hot features; collapse rare multi-step sequences (≤845 occurrences) into an 'other' or structured sub-category.

anthropic:default · confidence medium

Out[111]:

saturn.columns["high"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	11
top_value	0
top_rate	0.4669
cardinality	11
entropy	1.594
entropy_ratio	0.4609

Fig 40.

Top values for high.

Show data table

Top values for high (11 unique shown, of 11 total).
value	count	share
0	49247	46.7%
+	35559	33.7%
-	19156	18.2%
-,+	845	0.8%
+,-	627	0.6%
+,-,+	38	0.0%
+,+,-	6	0.0%
-,+,+	2	0.0%
-,-,+	2	0.0%
+,-,0	1	0.0%
-,+,-	1	0.0%

low categorical

Out[114]:

saturn.columns["low"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	-
top_rate	0.4733
cardinality	8
entropy	1.305
entropy_ratio	0.4351

Fig 41.

Top values for low.

Show data table

Top values for low (8 unique shown, of 8 total).
value	count	share
-	49930	47.3%
0	49244	46.7%
+	5598	5.3%
+,-	417	0.4%
-,+	270	0.3%
-,+,-	21	0.0%
-,-,+	3	0.0%
+,-,-	1	0.0%

front categorical feature

This column appears to encode a directional or signed-change indicator for a 'front' measurement, using values '0', '-', and '+' as atomic tokens that can be combined into sequences (e.g., '-,+', '+,-,-'). The dominant value '0' accounts for 46.75% of rows, with '-' (34,225) and '+' (20,683) making up most of the remainder — notably, negative changes outnumber positive ones by ~1.65:1. The compound sequence values ('-,+', '+,-', etc.) suggest multi-step event chains are encoded as a single string, which is an unusual encoding pattern that may require parsing before modelling.

Treatment: Parse composite sequences (e.g., '-,+') into structured event arrays or ordinal counts; then one-hot or embed atomic states before modelling.

anthropic:default · confidence medium

Out[117]:

saturn.columns["front"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	13
top_value	0
top_rate	0.4675
cardinality	13
entropy	1.592
entropy_ratio	0.4302

Fig 42.

Top values for front.

Show data table

Top values for front (13 unique shown, of 13 total).
value	count	share
0	49316	46.8%
-	34225	32.4%
+	20683	19.6%
-,+	838	0.8%
+,-	359	0.3%
-,-,+	24	0.0%
+,-,-	14	0.0%
-,+,+	10	0.0%
+,-,+	6	0.0%
-,0,+	3	0.0%
+,+,-	2	0.0%
0,-,+	2	0.0%
-,+,-	2	0.0%

back categorical feature

This column encodes directional movement or change signals for a 'back' dimension, using a compact notation of '+', '-', and '0' tokens — likely representing price/value movement sequences (e.g., up, down, flat) over a lookback window. The dominant value is '0' at 46.7% (49,270 rows), followed by '-' at 39,749 and '+' at 15,547, indicating a strong bearish/negative skew relative to positive signals. Compound multi-step sequences like '+,-' (511) and '-,+' (367) exist but are rare, and a handful of three-step sequences appear at the tail, suggesting variable-length encoding that could cause parsing issues if treated as a simple label.

Treatment: Parse composite multi-token values (e.g., '+,-') into structured sequences or ordinal scores before modelling; consider one-hot or frequency encoding for the simple single-token majority.

anthropic:default · confidence medium

Out[120]:

saturn.columns["back"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	12
top_value	0
top_rate	0.4671
cardinality	12
entropy	1.521
entropy_ratio	0.4244

Fig 43.

Top values for back.

Show data table

Top values for back (12 unique shown, of 12 total).
value	count	share
0	49270	46.7%
-	39749	37.7%
+	15547	14.7%
+,-	511	0.5%
-,+	367	0.3%
+,-,-	19	0.0%
-,-,+	8	0.0%
-,+,+	5	0.0%
-,+,-	5	0.0%
0,+,-	1	0.0%
+,-,+	1	0.0%
+,+,-	1	0.0%

tense categorical

Out[123]:

saturn.columns["tense"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	0
top_rate	0.7132
cardinality	8
entropy	1.114
entropy_ratio	0.3712

Fig 44.

Top values for tense.

Show data table

Top values for tense (8 unique shown, of 8 total).
value	count	share
0	75230	71.3%
+	23411	22.2%
-	6386	6.1%
+,-	268	0.3%
-,+	179	0.2%
+,-,+	6	0.0%
+,-,-	3	0.0%
+,+,-	1	0.0%

retractedTongueRoot categorical feature

This column encodes the phonetic feature 'retracted tongue root' (RTR), a binary or multi-valued linguistic annotation used in phonological databases to mark vowel or consonant articulation. The dominant value is '-' (absence of RTR) at 97.44% of 105,484 rows, making the feature extremely rare in this dataset — only ~2,696 tokens show any positive marking. The compound values ('-,+', '-,-,+', etc.) suggest per-segment annotation strings rather than single-token labels, indicating a sequence or multi-segment scope.

Treatment: Flag severe class imbalance (97.44% negative); use oversampling or class-weighted models if predicting RTR; consider splitting compound strings into per-segment binary indicators before modelling.

anthropic:default · confidence high

Out[126]:

saturn.columns["retractedTongueRoot"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	-
top_rate	0.9744
cardinality	7
entropy	0.1935
entropy_ratio	0.06892
alert: imbalance	top value is 97.4% of rows

Fig 45.

Top values for retractedTongueRoot.

Show data table

Top values for retractedTongueRoot (7 unique shown, of 7 total).
value	count	share
-	102788	97.4%
0	2235	2.1%
-,+	251	0.2%
+	199	0.2%
-,-,+	9	0.0%
-,+,-	1	0.0%
+,-	1	0.0%

advancedTongueRoot categorical feature

This column encodes the Advanced Tongue Root (ATR) phonological feature, a binary articulatory distinction marked as '+' (present), '-' (absent), or '0' (neutral/not applicable), typical of linguistic phoneme inventories. The distribution is severely imbalanced: '-' dominates at 97.87% (103,238 of 105,484 rows), '0' appears in only 2,235 rows, and '+' is nearly absent with just 11 occurrences. The near-zero entropy ratio (0.094) confirms that '+ATR' is a vanishingly rare feature in this dataset, which would make any model predicting '+' extremely difficult to train without resampling.

Treatment: Treat as ordinal or one-hot encoded categorical; oversample or reweight the '+' class (n=11) before modelling, or collapse '+' and '0' if class separation is infeasible.

anthropic:default · confidence high

Out[129]:

saturn.columns["advancedTongueRoot"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	-
top_rate	0.9787
cardinality	3
entropy	0.1496
entropy_ratio	0.09438
alert: imbalance	top value is 97.9% of rows

Fig 46.

Top values for advancedTongueRoot.

Show data table

Top values for advancedTongueRoot (3 unique shown, of 3 total).
value	count	share
-	103238	97.9%
0	2235	2.1%
+	11	0.0%

periodicGlottalSource categorical

Out[132]:

saturn.columns["periodicGlottalSource"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	+
top_rate	0.6797
cardinality	7
entropy	1.051
entropy_ratio	0.3745

Fig 47.

Top values for periodicGlottalSource.

Show data table

Top values for periodicGlottalSource (7 unique shown, of 7 total).
value	count	share
+	71694	68.0%
-	31179	29.6%
0	2139	2.0%
+,-	371	0.4%
-,+	87	0.1%
+,-,-	8	0.0%
+,-,+	6	0.0%

epilaryngealSource categorical feature

This column encodes whether an epilaryngeal phonation source is present, using a ternary scheme: absent ('-'), present ('+')), or a neutral/ambiguous state ('0'). The distribution is severely imbalanced: the '-' (absent) class dominates at 97.9% of 105,484 rows, while '+' (present) appears in only 31 records (~0.03%), making positive-class detection extremely challenging. The near-zero entropy (0.147) confirms almost no informational variance in this column as-is.

Treatment: Flag severe class imbalance ('+' = 31 of 105,484); apply oversampling or class-weighted modelling, and consider binary encoding after collapsing '-' vs. non-'-'.

anthropic:default · confidence high

Out[135]:

saturn.columns["epilaryngealSource"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	-
top_rate	0.9793
cardinality	3
entropy	0.1474
entropy_ratio	0.09303
alert: imbalance	top value is 97.9% of rows

Fig 48.

Top values for epilaryngealSource.

Show data table

Top values for epilaryngealSource (3 unique shown, of 3 total).
value	count	share
-	103303	97.9%
0	2150	2.0%
+	31	0.0%

spreadGlottis categorical

Out[138]:

saturn.columns["spreadGlottis"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	10
top_value	-
top_rate	0.9182
cardinality	10
entropy	0.4965
entropy_ratio	0.1495

Fig 49.

Top values for spreadGlottis.

Show data table

Top values for spreadGlottis (10 unique shown, of 10 total).
value	count	share
-	96855	91.8%
+	6156	5.8%
0	2138	2.0%
-,+	206	0.2%
+,-	115	0.1%
-,-,+	5	0.0%
+,0,-	5	0.0%
+,-,-	2	0.0%
+,0,-,-	1	0.0%
+,-,+	1	0.0%

constrictedGlottis categorical feature

This column captures a clinical or phonetic annotation for constricted glottis, encoded as presence/absence symbols across 105,484 records with no nulls and only 7 unique values. The dominant value is '-' (absent/negative), appearing in 94.5% of rows (99,727), while '+' (present) accounts for just 3,383 rows — indicating a rare positive condition. The compound values ('+,-', '-,+', '+,-,-', '-,-,+') suggest multi-observation or multi-segment sequences for a tiny fraction of records, but with counts of 141, 93, 1, and 1 respectively, these are near-negligible. The extremely low entropy (0.372, entropy ratio 0.132) confirms the column is heavily imbalanced toward the negative class.

Treatment: Binarize into presence/absence flag; treat compound multi-segment values as positive; account for severe class imbalance (94.5% negative) during modelling.

anthropic:default · confidence high

Out[141]:

saturn.columns["constrictedGlottis"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	7
top_value	-
top_rate	0.9454
cardinality	7
entropy	0.3717
entropy_ratio	0.1324

Fig 50.

Top values for constrictedGlottis.

Show data table

Top values for constrictedGlottis (7 unique shown, of 7 total).
value	count	share
-	99727	94.5%
+	3383	3.2%
0	2138	2.0%
+,-	141	0.1%
-,+	93	0.1%
+,-,-	1	0.0%
-,-,+	1	0.0%

fortis categorical feature

This column is a three-valued categorical flag, likely representing a directional or strength indicator with values '-', '0', and '+'. The dominant value is '-' at 68.1% (71,867 rows), '0' accounts for 31.5% (33,202 rows), and '+' is strikingly rare at only 415 occurrences (~0.4%), creating a heavily imbalanced distribution. The near-absence of '+' values is a notable surprise and may indicate a rare positive condition, signal, or classification outcome worth investigating for class imbalance before modelling.

Treatment: Ordinal-encode as -1/0/1 or one-hot encode, and apply class-imbalance handling (e.g. oversampling '+') before modelling.

anthropic:default · confidence high

Out[144]:

saturn.columns["fortis"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	-
top_rate	0.6813
cardinality	3
entropy	0.9335
entropy_ratio	0.589

Fig 51.

Top values for fortis.

Show data table

Top values for fortis (3 unique shown, of 3 total).
value	count	share
-	71867	68.1%
0	33202	31.5%
+	415	0.4%

lenis categorical feature

This column represents a ternary linguistic or phonological feature, almost certainly a 'lenis' (weak/voiced) marker with values '-' (absent/irrelevant), '0' (neutral), and '+' (present/lenis). The dominant value is '-' at 68.1% (71,866 rows), while the positive lenis marker '+' is strikingly rare at only 416 occurrences (~0.4%), creating a heavily imbalanced distribution. No nulls exist across all 105,484 rows, and entropy is moderate at 0.93 bits, well below the theoretical maximum, confirming the skew toward the negative class.

Treatment: Encode as ordinal or one-hot; be aware that the '+' class (416 samples) is severely underrepresented and will require oversampling or class-weight adjustment before modelling.

anthropic:default · confidence high

Out[147]:

saturn.columns["lenis"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	-
top_rate	0.6813
cardinality	3
entropy	0.9336
entropy_ratio	0.589

Fig 52.

Top values for lenis.

Show data table

Top values for lenis (3 unique shown, of 3 total).
value	count	share
-	71866	68.1%
0	33202	31.5%
+	416	0.4%

raisedLarynxEjective categorical feature

This column encodes the presence or absence of a raised-larynx ejective phonological feature, using a diacritic notation system common in linguistic databases (e.g., IPA or feature matrices). The dominant value is '-' (absent/not applicable), appearing in 96.4% of 105,484 rows, creating severe class imbalance with an entropy ratio of only 0.10. Minority values include '0' (2,150 rows), '+' (1,573 rows), and compound sequences like '-,+' and '+,-', suggesting some segments receive multi-valued or composite feature annotations. The near-absence of nulls (0.0%) indicates complete annotation coverage, but the extreme skew means this feature will contribute negligible signal in most modelling contexts without deliberate resampling or grouping.

Treatment: Collapse compound values ('-,+', '+,-', '-,-,+') into a unified category and apply oversampling or class-weight adjustment before modelling given 96.4% imbalance.

anthropic:default · confidence high

Out[150]:

saturn.columns["raisedLarynxEjective"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	6
top_value	-
top_rate	0.9637
cardinality	6
entropy	0.2675
entropy_ratio	0.1035
alert: imbalance	top value is 96.4% of rows

Fig 53.

Top values for raisedLarynxEjective.

Show data table

Top values for raisedLarynxEjective (6 unique shown, of 6 total).
value	count	share
-	101652	96.4%
0	2150	2.0%
+	1573	1.5%
-,+	85	0.1%
+,-	23	0.0%
-,-,+	1	0.0%

loweredLarynxImplosive categorical feature

This column encodes a phonological feature — specifically whether a sound has a lowered larynx implosive articulation — using a symbolic notation system ('+', '-', and combinations). The distribution is severely imbalanced: 97.3% of the 105,484 rows carry the default/absent value '-', with only 2,150 instances of '0', 716 of '+', and fewer than 10 combined entries. The near-zero entropy ratio (0.088) confirms this column carries almost no information for most records, which would make it a very weak standalone predictor.

Treatment: Flag the severe imbalance; consider binarising ('+' vs. all others) or dropping if the rare positive class is insufficient for modelling purposes.

anthropic:default · confidence high

Out[153]:

saturn.columns["loweredLarynxImplosive"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	5
top_value	-
top_rate	0.9727
cardinality	5
entropy	0.2034
entropy_ratio	0.08759
alert: imbalance	top value is 97.3% of rows

Fig 54.

Top values for loweredLarynxImplosive.

Show data table

Top values for loweredLarynxImplosive (5 unique shown, of 5 total).
value	count	share
-	102609	97.3%
0	2150	2.0%
+	716	0.7%
-,+	7	0.0%
+,-	2	0.0%

click categorical label

This column captures click direction or interaction type, encoded as symbolic tokens ('-', '0', '+') with combination values ('+,-', '-,+'). The dominant value is '-' at 68.2% of 105,484 rows, while positive-click signals ('+', '+,-', '-,+') account for only ~311 rows combined (~0.3%), suggesting a heavily imbalanced outcome or event flag. The presence of multi-value strings like '+,-' and '-,+' implies ordered sequences of click events were sometimes collapsed into a single cell rather than normalized — this warrants investigation before modelling.

Treatment: Ordinal-encode or one-hot after splitting multi-value entries ('+,-', '-,+'); treat class imbalance (~0.3% positive) before using as a classification target.

anthropic:default · confidence high

Out[156]:

saturn.columns["click"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	5
top_value	-
top_rate	0.6823
cardinality	5
entropy	0.9283
entropy_ratio	0.3998

Fig 55.

Top values for click.

Show data table

Top values for click (5 unique shown, of 5 total).
value	count	share
-	71971	68.2%
0	33202	31.5%
+	253	0.2%
+,-	52	0.0%
-,+	6	0.0%

Overview

Summary confidence: high

InventoryID numeric foreign_key

Glottocode text foreign_key

ISO6393 text label

LanguageName text label

SpecificDialect categorical label

GlyphID text label

Phoneme text label

Allophones text feature

Marginal categorical feature

SegmentClass categorical label

Source categorical label

tone categorical label

stress categorical label

syllabic categorical

short categorical feature

long categorical feature

consonantal categorical feature

sonorant categorical

continuant categorical feature

delayedRelease categorical feature

approximant categorical feature

tap categorical feature

trill categorical feature

nasal categorical

lateral categorical feature

labial categorical feature

round categorical

labiodental categorical feature

coronal categorical feature

anterior categorical feature

distributed categorical label

strident categorical

dorsal categorical feature

high categorical feature

low categorical

front categorical feature

back categorical feature

tense categorical

retractedTongueRoot categorical feature

advancedTongueRoot categorical feature

periodicGlottalSource categorical

epilaryngealSource categorical feature

spreadGlottis categorical

constrictedGlottis categorical feature

fortis categorical feature

lenis categorical feature

raisedLarynxEjective categorical feature

loweredLarynxImplosive categorical feature

click categorical label

How to cite