data trove phoible phonetics database

source /home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv 105,484 rows 49 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset is PHOIBLE, a cross-linguistic phonological inventory database containing 105,484 phoneme-level records spanning roughly 2,177 languages and dialects, each row describing a single phoneme and its distinctive feature values. The most immediate thing to examine is the breakdown by SegmentClass: consonants dominate (~68.5%), followed by vowels (~29.5%) and tones (~2%), which shapes how almost every other feature distributes. A second focus is the Source column, which reveals that data comes from eight different linguistic databases ('ph' alone accounts for 34%), meaning coverage and coding conventions are uneven across the corpus and could introduce systematic biases in any cross-linguistic comparison.

citing: SegmentClass.top_values · Source.top_values · LanguageName.n_unique · Glottocode.n_unique · Phoneme.top_values · row_count · column_count

Charts the summary said to look at first

SegmentClass · Look at the consonant-to-vowel-to-tone split — consonants make up roughly 68.5% of all records, which will skew any feature-level analysis.

Show data table

Top values for SegmentClass (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

Source · Check how unevenly data is distributed across the eight source databases, with 'ph' contributing more than a third of all rows on its own.

Show data table

Top values for Source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

LanguageName · See which languages contribute the most phoneme records — Iron Ossetic leads with 444 entries, hinting at uneven language-level representation.

Show data table

Character-length distribution for LanguageName (mean: 7.822219483523567).
chars	count
2 – 4	3915
4 – 6	29541
6 – 8	34409
8 – 10	14985
10 – 12	7358
12 – 14	4849
14 – 15	3897
15 – 17	2893
17 – 19	1292
19 – 21	993
21 – 23	198
23 – 25	224
25 – 27	218
27 – 29	39
29 – 31	93
31 – 33	130
33 – 35	64
35 – 37	23
37 – 39	37
39 – 40	57
40 – 42	0
42 – 44	52
44 – 46	0
46 – 48	23
48 – 50	0
50 – 52	0
52 – 54	20
54 – 56	0
56 – 58	40
58 – 60	0
60 – 62	87
62 – 64	0
64 – 66	0
66 – 67	0
67 – 69	0
69 – 71	0
71 – 73	0
73 – 75	0
75 – 77	0
77 – 79	47

nasal · Review the balance of the nasal feature values (+, -, 0) as a representative example of how distinctive features are coded across the full inventory.

Show data table

Top values for nasal (8 unique shown, of 8 total).
value	count	share
-	85269	80.8%
+	15941	15.1%
0	2150	2.0%
+,-	1973	1.9%
-,+	95	0.1%
+,-,-	54	0.1%
+,-,+,-	1	0.0%
-,+,-	1	0.0%

Marginal · Note that about 1.3% of phonemes are flagged as marginal (borrowed or rare) while ~19.8% carry NA, worth checking before any typological counts.

Show data table

Top values for Marginal (3 unique shown, of 3 total).
value	count	share
FALSE	83263	78.9%
NA	20874	19.8%
TRUE	1347	1.3%

Schema

49 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
InventoryID	numeric	0.0%	3,020
Glottocode	text	0.0%	2,177	one_word short_text duplicates
ISO6393	text	0.0%	2,095	one_word short_text duplicates
LanguageName	text	0.0%	2,716	one_word allcaps short_text duplicates
SpecificDialect	categorical	0.0%	546
GlyphID	text	0.0%	3,142	one_word allcaps short_text duplicates
Phoneme	text	0.0%	3,142	one_word short_text duplicates
Allophones	text	0.0%	6,892	one_word short_text duplicates
Marginal	categorical	0.0%	3
SegmentClass	categorical	0.0%	3
Source	categorical	0.0%	8
tone	categorical	0.0%	2	imbalance
stress	categorical	0.0%	2	imbalance
syllabic	categorical	0.0%	8
short	categorical	0.0%	4	imbalance
long	categorical	0.0%	6
consonantal	categorical	0.0%	5
sonorant	categorical	0.0%	8
continuant	categorical	0.0%	9
delayedRelease	categorical	0.0%	7
approximant	categorical	0.0%	6
tap	categorical	0.0%	5	imbalance
trill	categorical	0.0%	6	imbalance
nasal	categorical	0.0%	8
lateral	categorical	0.0%	8
labial	categorical	0.0%	15
round	categorical	0.0%	8
labiodental	categorical	0.0%	6
coronal	categorical	0.0%	7
anterior	categorical	0.0%	6
distributed	categorical	0.0%	11
strident	categorical	0.0%	9
dorsal	categorical	0.0%	13
high	categorical	0.0%	11
low	categorical	0.0%	8
front	categorical	0.0%	13
back	categorical	0.0%	12
tense	categorical	0.0%	8
retractedTongueRoot	categorical	0.0%	7	imbalance
advancedTongueRoot	categorical	0.0%	3	imbalance
periodicGlottalSource	categorical	0.0%	7
epilaryngealSource	categorical	0.0%	3	imbalance
spreadGlottis	categorical	0.0%	10
constrictedGlottis	categorical	0.0%	7
fortis	categorical	0.0%	3
lenis	categorical	0.0%	3
raisedLarynxEjective	categorical	0.0%	6	imbalance
loweredLarynxImplosive	categorical	0.0%	5	imbalance
click	categorical	0.0%	5

InventoryID

numeric foreign_key

InventoryID is a numeric foreign key referencing an inventory dimension table, with exactly 3,020 distinct values spanning 1–3,020 across 105,484 rows, implying heavy repeated use of each ID (average ~35 rows per ID). The distribution is remarkably flat and symmetric (skew ≈ −0.002, kurtosis ≈ −1.15, zero outliers), consistent with a well-populated lookup identifier rather than a measured quantity. Treatment: Left-join on this ID to the inventory dimension table; do not use raw numeric value as a feature. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 3,020
min: 1
max: 3,020
mean: 1479
median: 1,464
std: 843.1
q1: 769
q3: 2,237
iqr: 1,468
skew: -0.002397
kurtosis: -1.146
n_outliers: 0
outlier_rate: 0
zero_rate: 0

Glottocode

text foreign_key one_word short_text duplicates

This column contains Glottocodes — the standardized 8-character language identifiers used by the Glottolog database (e.g., 'kham1282', 'dutc1256'), confirmed by the near-uniform length of 8 characters (mean 7.999, median 8.0) and the structured alphanumeric format. With only 2,177 unique codes across 105,484 rows, the duplicate rate is extremely high at 97.9%, meaning each language code recurs on average ~48 times — consistent with a dataset where many observations (e.g., words, features, speakers) are annotated per language. The top code 'kham1282' (Kham) appears 622 times, suggesting uneven language coverage in the dataset. Treatment: Left-join on this code against the Glottolog reference table to enrich with language family, geographic coordinates, and macro-area metadata. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 2,177
len_min: 2
len_max: 8
len_mean: 7.999
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 103,307
duplicate_rate: 0.9794
vocab_size: 2,168
readability_flesch_mean: 94.15
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

ISO6393

text label one_word short_text duplicates

This column contains ISO 639-3 three-letter language codes, a standardised identifier for natural languages. Every value is exactly 3 characters long (min, mean, max all equal 3) with zero nulls, confirming strict conformity to the standard. The duplicate rate is extremely high at 98.0%, meaning 2,095 distinct codes repeat across 105,484 rows — expected behaviour for a language tag applied to many records. The most frequent code 'mis' (miscellaneous/unattested language) appearing 828 times may warrant attention, as it signals a non-trivial share of records with unidentified languages. Treatment: Use as a categorical grouping key; consider flagging or separating records where ISO6393 equals 'mis' (unattested) before language-based analysis. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 2,095
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 103,389
duplicate_rate: 0.9801
vocab_size: 2,086
readability_flesch_mean: 119.5
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

LanguageName

text label one_word allcaps short_text duplicates

This column contains human language names (e.g., 'Dutch', 'Chechen', 'Bengali', 'Iron Ossetic'), functioning as a categorical label drawn from a linguistically diverse vocabulary of 2,716 unique values across 105,484 rows. The duplicate rate of 97.4% (102,768 duplicates) confirms it is a low-cardinality repeating label, not a free-text field. Notably, 13.1% of values are all-caps, suggesting some entries use ISO-style abbreviations or codes alongside full names. The distribution is uneven — the top value 'Iron Ossetic' appears 444 times while many languages appear rarely — indicating a long-tail spread across minority and regional languages. Treatment: Encode as a categorical (label or target-encode) after normalising casing inconsistencies flagged by the 13.1% all-caps rate. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 2,716
len_min: 2
len_max: 79
len_mean: 7.822
len_median: 7
len_p95: 16
word_mean: 1.201
word_median: 1
n_empty: 0
n_duplicates: 102,768
duplicate_rate: 0.9743
vocab_size: 2,670
readability_flesch_mean: 53.18
emoji_rate: 0
url_rate: 0
one_word_rate: 0.8433
allcaps_rate: 0.1314
boilerplate_rate: 0

SpecificDialect

categorical label

This column captures a specific dialect designation for each record in what appears to be a linguistic or language survey dataset. The dominant signal is extreme missingness-by-label: 71.9% of rows carry the value 'NA' and a further 7,692 rows (≈7.3%) are empty strings, meaning roughly 79% of records lack a meaningful dialect value. Despite 546 unique values, the effective entropy ratio is only 0.33, confirming that real dialect labels are thinly and unevenly spread across the remaining ~22,000 rows. Treatment: Treat 'NA' and empty-string as missing; consider collapsing rare dialects (below a frequency threshold) into an 'Other' bucket before encoding, and flag high missingness rate (~79%) before any modelling use. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 546
top_value: NA
top_rate: 0.7187
cardinality: 546
entropy: 2.969
entropy_ratio: 0.3265

GlyphID

text label one_word allcaps short_text duplicates

GlyphID is a categorical identifier column containing Unicode code point values (e.g., '006D' = 'm', '0069' = 'i', '0061' = 'a') stored as uppercase hexadecimal strings, almost certainly representing character-level glyph references in a typography, OCR, or text-rendering dataset. With only 3,142 unique values across 105,484 rows, the duplicate rate is extremely high at 97.0%, which is expected for a glyph/character frequency table but means this column functions as a low-cardinality label rather than a unique identifier despite its name. All values are fully uppercase (allcaps_rate 1.0), single-token (one_word_rate 1.0), and short (median length 4 characters), consistent with 4–6 character hex codes. The top values map to common Latin lowercase letters, suggesting the underlying corpus is predominantly Latin-script text. Treatment: Map hex strings to Unicode characters for interpretability, then encode as categorical (low-cardinality, 3142 levels) or group by Unicode block/script before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 3,142
len_min: 4
len_max: 54
len_mean: 6.503
len_median: 4
len_p95: 14
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 102,342
duplicate_rate: 0.9702
vocab_size: 1,343
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

Phoneme

text label one_word short_text duplicates

This column contains phoneme symbols from a linguistic or speech dataset, with 3,142 unique phoneme strings across 105,484 rows. Values are overwhelmingly single characters (len_median 1.0, len_mean 1.5, len_max 11), consistent with IPA or ARPABET-style phoneme notation. The duplicate rate is 97.0% (102,342 duplicates), which is expected given a finite phoneme inventory repeated across many words or utterances. The vocab_size of 1,339 distinct tokens against only 3,142 unique values suggests multi-character phoneme strings (e.g., digraphs or diacritics) are also present alongside single-character ones. Treatment: Encode as categorical (label or one-hot) for modelling; 3,142 unique values is manageable but consider grouping by manner/place of articulation if dimensionality is a concern. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 3,142
len_min: 1
len_max: 11
len_mean: 1.501
len_median: 1
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 102,342
duplicate_rate: 0.9702
vocab_size: 1,339
readability_flesch_mean: 114.4
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.001754
boilerplate_rate: 0

Allophones

text feature one_word short_text duplicates

This column contains allophone representations from a phonology or linguistics dataset, storing the phonetic variants of phonemes (e.g., 'm', 'j', 'w', 's') as very short strings with a mean length of ~2 characters. Strikingly, 53,580 of 105,484 rows (roughly 50.8%) carry the sentinel value 'NA', indicating no allophone recorded, which dominates the duplicate rate of 93.5%. The 6,892 unique values across a vocabulary of only 1,263 words suggest a modest set of phonetic symbols combined in small clusters, consistent with IPA or similar notation. Treatment: Treat 'NA' as missing; encode remaining values as categorical or tokenize individual phoneme symbols before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 6,892
len_min: 1
len_max: 37
len_mean: 2.083
len_median: 2
len_p95: 4
word_mean: 1.129
word_median: 1
n_empty: 0
n_duplicates: 98,592
duplicate_rate: 0.9347
vocab_size: 1,263
readability_flesch_mean: 116.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9131
allcaps_rate: 0.00291
boilerplate_rate: 0

Marginal

categorical feature

This column is a boolean flag indicating whether a record is classified as 'marginal', with three distinct string values: FALSE, NA, and TRUE. The dominant value is FALSE (78.9% of 105,484 rows), while TRUE is strikingly rare at only 1,347 records (≈1.3%). The presence of 'NA' as a literal string value in 20,874 rows (≈19.8%) is noteworthy — these are not system nulls (null_rate is 0.0) but encoded string missings, which must be handled explicitly rather than via standard null imputation. Treatment: Encode as ternary (FALSE=0, TRUE=1, NA=missing) after converting string 'NA' to actual nulls, then decide on imputation strategy given 19.8% string-missing rate. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 3
top_value: FALSE
top_rate: 0.7893
cardinality: 3
entropy: 0.8122
entropy_ratio: 0.5125

SegmentClass

categorical label

SegmentClass is a phonological category label classifying 105,484 linguistic segments into exactly three classes: consonant, vowel, and tone. Consonants dominate at 68.5% (72,282 occurrences), vowels account for 29.4% (31,052), and tones are a small minority at just 2.0% (2,150) — a distribution consistent with natural language phoneme inventories but with tones notably underrepresented, suggesting the dataset skews toward non-tonal languages or tonal markings are partially absent. Zero nulls and perfect coverage make this a clean, reliable label. Treatment: One-hot encode or use as a stratification variable; monitor class imbalance for the 'tone' minority class (2,150 samples) in any classification task. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 3
top_value: consonant
top_rate: 0.6852
cardinality: 3
entropy: 1.008
entropy_ratio: 0.6357

Source

categorical label

This column identifies the source database or linguistic inventory from which each phonological record was drawn, with 8 distinct source codes across 105,484 rows and no nulls. The top source 'ph' dominates at 34.4% (36,274 rows), likely referring to PHOIBLE or a similar phoneme database, followed by 'ea', 'upsid', 'er', 'saphon', 'aa', 'spa', and 'ra'. The high entropy ratio of 0.899 indicates a relatively even spread across sources despite 'ph' leading — no single source overwhelmingly controls the data. Analysts should be aware that cross-source comparisons may introduce systematic coding differences, as each source may apply distinct phonological conventions. Treatment: One-hot encode or use as a stratification/grouping variable; check for source-specific biases before pooling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: ph
top_rate: 0.3439
cardinality: 8
entropy: 2.697
entropy_ratio: 0.8991

tone

categorical label imbalance

This column is a binary tone indicator with only two values: '0' (neutral/absent) and '+' (positive), across 105,484 rows with no nulls. The distribution is severely imbalanced — '0' accounts for 97.96% of records (103,334) while '+' appears in just 2,150 rows (~2%). The entropy of 0.144 (out of a maximum of 1.0 for a binary variable) confirms the near-total dominance of a single class, which would make any model trained on this label prone to predicting '0' exclusively. Treatment: Apply class-balancing techniques (e.g., oversampling '+' or class-weight adjustment) before using as a modelling target. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 2
top_value: 0
top_rate: 0.9796
cardinality: 2
entropy: 0.1436
entropy_ratio: 0.1436

stress

categorical label imbalance

This column is a binary categorical flag likely indicating the presence or absence of stress (e.g., a linguistic or physiological stress marker), encoded as '-' for the negative/absent case and '0' for the positive/present case. The distribution is severely imbalanced: '-' accounts for 97.96% of all 105,484 rows (103,334 occurrences) versus only 2,150 rows for '0'. The entropy ratio of 0.144 confirms near-minimum information content, meaning the positive class is rare and models trained on this column will need class-balancing strategies. Treatment: Apply class-balancing (oversampling or weighted loss) before modelling; consider encoding '-' as 0 and '0' as 1 for numeric compatibility. medium · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 2
top_value: -
top_rate: 0.9796
cardinality: 2
entropy: 0.1436
entropy_ratio: 0.1436

syllabic

categorical

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: -
top_rate: 0.6849
cardinality: 8
entropy: 1.042
entropy_ratio: 0.3472

short

categorical feature imbalance

This column appears to encode a directional or sign indicator with only 4 distinct values: '-', '0', '+', and the compound '-,+'. It is severely imbalanced: the dominant value '-' accounts for 97,764% of rows (103,125 of 105,484), while '+' appears just 204 times and the mixed label '-,+' only 5 times. The near-zero entropy ratio (0.082) confirms the column carries very little information variance, and any model trained on it will be overwhelmed by the '-' class. Treatment: Treat as ordinal or one-hot encode, but flag severe class imbalance (97.8% '-'); consider oversampling minority classes or collapsing '+' and '-,+' before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 4
top_value: -
top_rate: 0.9776
cardinality: 4
entropy: 0.1645
entropy_ratio: 0.08225

long

categorical feature

This column encodes the sign or direction of longitude (or a signed numeric field), with only 6 distinct values across 105,484 rows. The dominant value '-' accounts for 89.9% of records, suggesting the dataset is heavily skewed toward negative longitudes (e.g., Western Hemisphere coordinates). The compound values '-,+', '+,-', and '-,-,+' appear in just 104 rows combined, hinting at rare multi-value or malformed entries that warrant inspection before modelling. Treatment: Investigate compound values ('-,+', '+,-', '-,-,+') for parsing errors; encode '-' as -1, '+' as +1, '0' as 0, and flag or impute the 104 compound rows. medium · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 6
top_value: -
top_rate: 0.8991
cardinality: 6
entropy: 0.5537
entropy_ratio: 0.2142

consonantal

categorical feature

This column encodes a binary phonological feature — consonantal — marking whether a segment is consonantal (+) or not (-), a standard distinctive feature in linguistics datasets. The dominant value is '+' at 60.9% (64,257 rows), with '-' covering 37.0% (39,041 rows). The near-zero third category '0' (2,151 rows) suggests underspecified segments, while the composite values '+,-' (34) and '-,+' (1) are anomalous multi-value entries that likely reflect data quality issues or ambiguous annotations. Treatment: Encode as ordinal or one-hot after splitting composite entries ('+,-', '-,+') into separate flags or flagging them for review. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 5
top_value: +
top_rate: 0.6092
cardinality: 5
entropy: 1.085
entropy_ratio: 0.4672

sonorant

categorical

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: +
top_rate: 0.5301
cardinality: 8
entropy: 1.245
entropy_ratio: 0.4149

continuant

categorical feature

This column encodes a phonological feature — whether a speech sound is a continuant (airflow continues through the vocal tract) — using '+', '-', and '0' (unspecified/N-A) notation standard in distinctive feature theory. The dominant values are '+' (54.9%, 57,952 rows) and '-' (44,585 rows), which together account for ~97% of records. Surprisingly, 7 of the 9 unique values are composite strings like '-,+' or '-,-,+', suggesting some rows encode sequences or multi-segment entries rather than single-segment features, which is atypical and may indicate data quality or schema inconsistency. Entropy ratio of 0.37 reflects moderate concentration driven almost entirely by the binary +/- split. Treatment: Treat '+'/'-'/'0' as a 3-class categorical; investigate and potentially split or flag the 796 composite multi-value rows before encoding. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 9
top_value: +
top_rate: 0.5494
cardinality: 9
entropy: 1.172
entropy_ratio: 0.3696

delayedRelease

categorical feature

This column encodes a delayed-release flag or classification for 105,484 records using a small set of symbolic tokens: '0' (no delay, 55% of rows), '-' (negative/early shift, 26%), '+' (positive/late shift, 19%), and compound combinations thereof. The compound values ('-,+', '0,-,+', '+,-', '0,0,-,+') suggest the column is sometimes multi-valued — packed as comma-separated strings — which is structurally inconsistent with a simple categorical and implies set-like semantics. No nulls are present, and the entropy ratio of 0.52 reflects a moderately skewed but non-trivial distribution dominated by the '0' class. Treatment: Split compound values on ',' into multi-hot binary columns ('has_0', 'has_minus', 'has_plus') before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 7
top_value: 0
top_rate: 0.5502
cardinality: 7
entropy: 1.471
entropy_ratio: 0.5238

approximant

categorical feature

This column captures whether a phoneme or linguistic segment is classified as an approximant, using a compact symbolic encoding where '-' means absent and '+' means present. The dominant value is '-' at 55.9% of rows (58,966), with '+' accounting for 41.9% (44,266), giving a near-binary distribution with very low entropy (1.12). Surprising are the small number of compound values ('-,+', '-,-,+', '+,-') totalling just 102 rows, suggesting a minority of segments carry ambiguous or multi-valued approximant classifications — possibly transcription artifacts or multi-segment bundles. Treatment: One-hot encode the two dominant values ('-', '+'); bin the three compound values into an 'ambiguous' category or flag and investigate as potential data quality issues before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 6
top_value: -
top_rate: 0.559
cardinality: 6
entropy: 1.12
entropy_ratio: 0.4333

tap

categorical feature imbalance

This column appears to encode a directional or signed-change indicator, with values like '-', '+', '0', and compound sequences such as '-,+' and '-,-,+' that suggest a history of sign changes or transitions. The dominant value '-' accounts for 96.7% of all 105,484 rows, producing severe class imbalance (entropy ratio of only 0.10), meaning positive or mixed-direction events are rare signals. The compound multi-value entries ('-,+', '-,-,+') are notable — they imply the field can store ordered sequences of transitions, not just a single state. No nulls are present. Treatment: One-hot encode single-value categories; treat compound sequence values ('-,+', '-,-,+') separately via sequence parsing or a dedicated binary flag for multi-transition events; oversample or weight minority classes before modelling. medium · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 5
top_value: -
top_rate: 0.9672
cardinality: 5
entropy: 0.2421
entropy_ratio: 0.1043

trill

categorical feature imbalance

This column encodes a phonological feature — specifically the presence or absence of a trill articulation — using a small symbol vocabulary of 6 distinct values. The dominant value is "-" (absent/negative), which accounts for 96.15% of 105,484 rows, producing an extremely low entropy of 0.276 and triggering an imbalance alert. The minority values ("0", "+", and compound sequences like "-,+") together cover fewer than 4,000 rows, suggesting trill is a rare feature in this phonological dataset. The compound values ("-,+", "-,-,+", "+,-") with counts of 26, 8, and 2 respectively hint at multi-segment or allophonic annotation, but are statistically negligible. Treatment: One-hot or ordinal encode with caution; severe class imbalance (96.15% negative class) means most models will need oversampling, class weighting, or collapsing rare compound categories before use. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 6
top_value: -
top_rate: 0.9615
cardinality: 6
entropy: 0.2762
entropy_ratio: 0.1069

nasal

categorical

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: -
top_rate: 0.8084
cardinality: 8
entropy: 0.897
entropy_ratio: 0.299

lateral

categorical feature

This column encodes lateral direction or laterality coding, likely indicating the side(s) of a clinical finding, anatomical measurement, or test result, using '-' (e.g., left or negative), '+' (right or positive), and '0' (midline or neutral). The dominant value '-' accounts for 93.8% of all 105,484 rows, producing a very low entropy ratio of 0.134, meaning the column is heavily skewed toward a single class. Surprisingly, some values represent sequences of laterality codes (e.g., '-,+', '-,-,+'), suggesting multi-segment or bilateral recordings encoded as comma-delimited strings rather than separate fields. No nulls are present. Treatment: Split comma-delimited compound values into multi-hot binary flags for '-', '+', and '0' presence before modelling; expect severe class imbalance. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: -
top_rate: 0.9382
cardinality: 8
entropy: 0.4012
entropy_ratio: 0.1337

labial

categorical feature

This column captures labial articulation features in a phonological or linguistic dataset, encoding presence/absence/neutrality of a labial feature per segment (or per segment sequence). The dominant value is '-' (absent) at 68.2% of 105,484 rows, with '+' (present) at 26.8%, suggesting most segments are non-labial. Surprisingly, ~2.9% of values are compound strings like '-,+', '+,-', '-,-,+', indicating multi-segment bundles packed into a single cell rather than a flat atomic feature, which creates a parsing challenge and implies the column is not fully normalized. Treatment: Split compound values on ',' into separate per-segment records or encode as ordered multi-label; then one-hot or ordinal encode the atomic {'-', '+', '0'} values for modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 15
top_value: -
top_rate: 0.6822
cardinality: 15
entropy: 1.182
entropy_ratio: 0.3025

round

categorical

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: 0
top_rate: 0.703
cardinality: 8
entropy: 1.194
entropy_ratio: 0.398

labiodental

categorical feature

This column encodes labiodental phonetic feature annotations, likely from a linguistic or phonological dataset where each row represents a speech sound or segment. The dominant value '0' (74,124 rows, 70.3%) indicates the feature is absent or neutral, while '+' and '-' mark positive and negative feature values respectively — a standard binary distinctive-feature notation. Notably, multi-valued strings like '+,-', '-,+', and '+,+,-' appear (60 rows total), suggesting a small number of segments carry conflicting or composite annotations, which may indicate data entry inconsistency or deliberate underspecification. Entropy ratio of 0.39 confirms moderate imbalance with '0' dominating. Treatment: One-hot encode '+', '-', '0'; flag or inspect the 60 multi-valued rows ('+,-', '-,+', '+,+,-') for normalization before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 6
top_value: 0
top_rate: 0.7027
cardinality: 6
entropy: 1.006
entropy_ratio: 0.3891

coronal

categorical feature

This column encodes coronal articulation features in what appears to be a phonological or linguistic dataset, using binary '+'/'-' notation common in distinctive feature theory. The dominant value is '-' (62.8% of 105,484 rows), with '+' covering another 35.0%, making the two primary values account for ~98% of observations. The remaining values ('+,-', '-,+', '-,-,+', '+,-,+') suggest multi-segment or compound entries, which are rare (≤87 occurrences) but worth flagging as potential encoding inconsistencies or multi-value cells that may need splitting. Treatment: Encode primary values '+'/'-'/'0' as ordinal or one-hot features; isolate and inspect the 135 multi-value rows for parsing or splitting before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 7
top_value: -
top_rate: 0.6279
cardinality: 7
entropy: 1.08
entropy_ratio: 0.3848

categorical feature

This column captures an anterior anatomical or directional finding coded as a signed categorical variable, with 6 distinct values across 105,484 rows and zero nulls. The dominant value '0' (64.8% of rows) likely indicates absence or neutral status, while '+' (24.4%) and '-' (10.8%) denote positive/negative findings. Surprisingly, a small number of rows (9, 5, and 3) contain compound multi-value strings like '-,+', '+,-', and '-,-,+', suggesting occasional data-entry concatenation errors or multi-event encoding that deviates from the expected single-symbol schema. Treatment: Map '0'/'+'/'-' to ordinal or one-hot encoded features; isolate and investigate the 17 compound-value rows before modelling. medium · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 6
top_value: 0
top_rate: 0.6482
cardinality: 6
entropy: 1.251
entropy_ratio: 0.4839

distributed

categorical label

This column appears to encode a distribution or change-direction indicator, using a compact symbolic notation: '0' (no change), '-' (decrease), '+' (increase), and comma-separated sequences for multi-step or compound movements. The dominant value '0' covers 66% of 105,484 rows, with '-' (21.1%) and '+' (12.5%) accounting for most of the remainder. Surprisingly, 11 distinct values arise from combinations of these three symbols, suggesting some records capture sequences of directional changes rather than a single state. Entropy ratio of 0.37 confirms moderate but uneven information content, heavily concentrated in the single '0' class. Treatment: Encode as an ordinal or multi-hot feature by splitting on commas and mapping {'-': -1, '0': 0, '+': 1}; treat compound values as sequences. medium · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 11
top_value: 0
top_rate: 0.6602
cardinality: 11
entropy: 1.273
entropy_ratio: 0.3681

strident

categorical

n: 105,484
nulls: 0 (0.0%)
unique: 9
top_value: 0
top_rate: 0.6485
cardinality: 9
entropy: 1.287
entropy_ratio: 0.406

dorsal

categorical feature

This column captures a dorsal-surface marking or pattern indicator for biological specimens (likely fish or reptiles), encoding the presence/absence of a feature using '+', '-', and '0' symbols — sometimes as comma-separated sequences indicating multiple zones or segments. The dominant values are '+' (51.7%, n=54,535) and '-' (n=47,052), together accounting for ~96.3% of rows, with 'neutral/absent' coded as '0' (n=2,160). Surprisingly, 11 of the 13 categories are multi-value strings like '-,+' or '-,-,+', suggesting some records encode ordered sequences of dorsal sub-regions rather than a single binary state, creating an implicit structural inconsistency that warrants normalization. Treatment: Split multi-value entries on ',' into ordered sub-features or one-hot encode each position before modelling; treat '+', '-', '0' as a ternary categorical for single-value rows. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 13
top_value: +
top_rate: 0.517
cardinality: 13
entropy: 1.235
entropy_ratio: 0.3338

high

categorical feature

This column encodes the direction of price/value movement at a 'high' point — likely a signal or pattern indicator from a time-series or financial dataset, using tokens '0' (no movement), '+' (up), and '-' (down), including multi-step sequences like '-,+' or '+,-,+'. The dominant value is '0' at 46.7% (49,247 rows), followed by '+' at 35,559 and '-' at 19,156, suggesting a strong asymmetry between upward and downward signals. The presence of compound sequence values (e.g., '+,-,+', '-,+,+') with very low frequency — some appearing only once or twice — indicates these multi-step patterns are rare edge cases that may need consolidation or separate treatment. Treatment: Encode top 3 values ('0', '+', '-') as ordinal or one-hot features; collapse rare multi-step sequences (≤845 occurrences) into an 'other' or structured sub-category. medium · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 11
top_value: 0
top_rate: 0.4669
cardinality: 11
entropy: 1.594
entropy_ratio: 0.4609

low

categorical

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: -
top_rate: 0.4733
cardinality: 8
entropy: 1.305
entropy_ratio: 0.4351

front

categorical feature

This column appears to encode a directional or signed-change indicator for a 'front' measurement, using values '0', '-', and '+' as atomic tokens that can be combined into sequences (e.g., '-,+', '+,-,-'). The dominant value '0' accounts for 46.75% of rows, with '-' (34,225) and '+' (20,683) making up most of the remainder — notably, negative changes outnumber positive ones by ~1.65:1. The compound sequence values ('-,+', '+,-', etc.) suggest multi-step event chains are encoded as a single string, which is an unusual encoding pattern that may require parsing before modelling. Treatment: Parse composite sequences (e.g., '-,+') into structured event arrays or ordinal counts; then one-hot or embed atomic states before modelling. medium · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 13
top_value: 0
top_rate: 0.4675
cardinality: 13
entropy: 1.592
entropy_ratio: 0.4302

back

categorical feature

This column encodes directional movement or change signals for a 'back' dimension, using a compact notation of '+', '-', and '0' tokens — likely representing price/value movement sequences (e.g., up, down, flat) over a lookback window. The dominant value is '0' at 46.7% (49,270 rows), followed by '-' at 39,749 and '+' at 15,547, indicating a strong bearish/negative skew relative to positive signals. Compound multi-step sequences like '+,-' (511) and '-,+' (367) exist but are rare, and a handful of three-step sequences appear at the tail, suggesting variable-length encoding that could cause parsing issues if treated as a simple label. Treatment: Parse composite multi-token values (e.g., '+,-') into structured sequences or ordinal scores before modelling; consider one-hot or frequency encoding for the simple single-token majority. medium · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 12
top_value: 0
top_rate: 0.4671
cardinality: 12
entropy: 1.521
entropy_ratio: 0.4244

tense

categorical

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: 0
top_rate: 0.7132
cardinality: 8
entropy: 1.114
entropy_ratio: 0.3712

retractedTongueRoot

categorical feature imbalance

This column encodes the phonetic feature 'retracted tongue root' (RTR), a binary or multi-valued linguistic annotation used in phonological databases to mark vowel or consonant articulation. The dominant value is '-' (absence of RTR) at 97.44% of 105,484 rows, making the feature extremely rare in this dataset — only ~2,696 tokens show any positive marking. The compound values ('-,+', '-,-,+', etc.) suggest per-segment annotation strings rather than single-token labels, indicating a sequence or multi-segment scope. Treatment: Flag severe class imbalance (97.44% negative); use oversampling or class-weighted models if predicting RTR; consider splitting compound strings into per-segment binary indicators before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 7
top_value: -
top_rate: 0.9744
cardinality: 7
entropy: 0.1935
entropy_ratio: 0.06892

advancedTongueRoot

categorical feature imbalance

This column encodes the Advanced Tongue Root (ATR) phonological feature, a binary articulatory distinction marked as '+' (present), '-' (absent), or '0' (neutral/not applicable), typical of linguistic phoneme inventories. The distribution is severely imbalanced: '-' dominates at 97.87% (103,238 of 105,484 rows), '0' appears in only 2,235 rows, and '+' is nearly absent with just 11 occurrences. The near-zero entropy ratio (0.094) confirms that '+ATR' is a vanishingly rare feature in this dataset, which would make any model predicting '+' extremely difficult to train without resampling. Treatment: Treat as ordinal or one-hot encoded categorical; oversample or reweight the '+' class (n=11) before modelling, or collapse '+' and '0' if class separation is infeasible. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 3
top_value: -
top_rate: 0.9787
cardinality: 3
entropy: 0.1496
entropy_ratio: 0.09438

periodicGlottalSource

categorical

n: 105,484
nulls: 0 (0.0%)
unique: 7
top_value: +
top_rate: 0.6797
cardinality: 7
entropy: 1.051
entropy_ratio: 0.3745

epilaryngealSource

categorical feature imbalance

This column encodes whether an epilaryngeal phonation source is present, using a ternary scheme: absent ('-'), present ('+')), or a neutral/ambiguous state ('0'). The distribution is severely imbalanced: the '-' (absent) class dominates at 97.9% of 105,484 rows, while '+' (present) appears in only 31 records (~0.03%), making positive-class detection extremely challenging. The near-zero entropy (0.147) confirms almost no informational variance in this column as-is. Treatment: Flag severe class imbalance ('+' = 31 of 105,484); apply oversampling or class-weighted modelling, and consider binary encoding after collapsing '-' vs. non-'-'. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 3
top_value: -
top_rate: 0.9793
cardinality: 3
entropy: 0.1474
entropy_ratio: 0.09303

spreadGlottis

categorical

n: 105,484
nulls: 0 (0.0%)
unique: 10
top_value: -
top_rate: 0.9182
cardinality: 10
entropy: 0.4965
entropy_ratio: 0.1495

constrictedGlottis

categorical feature

This column captures a clinical or phonetic annotation for constricted glottis, encoded as presence/absence symbols across 105,484 records with no nulls and only 7 unique values. The dominant value is '-' (absent/negative), appearing in 94.5% of rows (99,727), while '+' (present) accounts for just 3,383 rows — indicating a rare positive condition. The compound values ('+,-', '-,+', '+,-,-', '-,-,+') suggest multi-observation or multi-segment sequences for a tiny fraction of records, but with counts of 141, 93, 1, and 1 respectively, these are near-negligible. The extremely low entropy (0.372, entropy ratio 0.132) confirms the column is heavily imbalanced toward the negative class. Treatment: Binarize into presence/absence flag; treat compound multi-segment values as positive; account for severe class imbalance (94.5% negative) during modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 7
top_value: -
top_rate: 0.9454
cardinality: 7
entropy: 0.3717
entropy_ratio: 0.1324

fortis

categorical feature

This column is a three-valued categorical flag, likely representing a directional or strength indicator with values '-', '0', and '+'. The dominant value is '-' at 68.1% (71,867 rows), '0' accounts for 31.5% (33,202 rows), and '+' is strikingly rare at only 415 occurrences (~0.4%), creating a heavily imbalanced distribution. The near-absence of '+' values is a notable surprise and may indicate a rare positive condition, signal, or classification outcome worth investigating for class imbalance before modelling. Treatment: Ordinal-encode as -1/0/1 or one-hot encode, and apply class-imbalance handling (e.g. oversampling '+') before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 3
top_value: -
top_rate: 0.6813
cardinality: 3
entropy: 0.9335
entropy_ratio: 0.589

lenis

categorical feature

This column represents a ternary linguistic or phonological feature, almost certainly a 'lenis' (weak/voiced) marker with values '-' (absent/irrelevant), '0' (neutral), and '+' (present/lenis). The dominant value is '-' at 68.1% (71,866 rows), while the positive lenis marker '+' is strikingly rare at only 416 occurrences (~0.4%), creating a heavily imbalanced distribution. No nulls exist across all 105,484 rows, and entropy is moderate at 0.93 bits, well below the theoretical maximum, confirming the skew toward the negative class. Treatment: Encode as ordinal or one-hot; be aware that the '+' class (416 samples) is severely underrepresented and will require oversampling or class-weight adjustment before modelling. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 3
top_value: -
top_rate: 0.6813
cardinality: 3
entropy: 0.9336
entropy_ratio: 0.589

raisedLarynxEjective

categorical feature imbalance

This column encodes the presence or absence of a raised-larynx ejective phonological feature, using a diacritic notation system common in linguistic databases (e.g., IPA or feature matrices). The dominant value is '-' (absent/not applicable), appearing in 96.4% of 105,484 rows, creating severe class imbalance with an entropy ratio of only 0.10. Minority values include '0' (2,150 rows), '+' (1,573 rows), and compound sequences like '-,+' and '+,-', suggesting some segments receive multi-valued or composite feature annotations. The near-absence of nulls (0.0%) indicates complete annotation coverage, but the extreme skew means this feature will contribute negligible signal in most modelling contexts without deliberate resampling or grouping. Treatment: Collapse compound values ('-,+', '+,-', '-,-,+') into a unified category and apply oversampling or class-weight adjustment before modelling given 96.4% imbalance. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 6
top_value: -
top_rate: 0.9637
cardinality: 6
entropy: 0.2675
entropy_ratio: 0.1035

loweredLarynxImplosive

categorical feature imbalance

This column encodes a phonological feature — specifically whether a sound has a lowered larynx implosive articulation — using a symbolic notation system ('+', '-', and combinations). The distribution is severely imbalanced: 97.3% of the 105,484 rows carry the default/absent value '-', with only 2,150 instances of '0', 716 of '+', and fewer than 10 combined entries. The near-zero entropy ratio (0.088) confirms this column carries almost no information for most records, which would make it a very weak standalone predictor. Treatment: Flag the severe imbalance; consider binarising ('+' vs. all others) or dropping if the rare positive class is insufficient for modelling purposes. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 5
top_value: -
top_rate: 0.9727
cardinality: 5
entropy: 0.2034
entropy_ratio: 0.08759

click

categorical label

This column captures click direction or interaction type, encoded as symbolic tokens ('-', '0', '+') with combination values ('+,-', '-,+'). The dominant value is '-' at 68.2% of 105,484 rows, while positive-click signals ('+', '+,-', '-,+') account for only ~311 rows combined (~0.3%), suggesting a heavily imbalanced outcome or event flag. The presence of multi-value strings like '+,-' and '-,+' implies ordered sequences of click events were sometimes collapsed into a single cell rather than normalized — this warrants investigation before modelling. Treatment: Ordinal-encode or one-hot after splitting multi-value entries ('+,-', '-,+'); treat class imbalance (~0.3% positive) before using as a classification target. high · anthropic:default

n: 105,484
nulls: 0 (0.0%)
unique: 5
top_value: -
top_rate: 0.6823
cardinality: 5
entropy: 0.9283
entropy_ratio: 0.3998