data trove phoible phonetics database
Reading
This dataset is PHOIBLE, a cross-linguistic phonological inventory database containing 105,484 phoneme-level records spanning roughly 2,177 languages and dialects, each row describing a single phoneme and its distinctive feature values. The most immediate thing to examine is the breakdown by SegmentClass: consonants dominate (~68.5%), followed by vowels (~29.5%) and tones (~2%), which shapes how almost every other feature distributes. A second focus is the Source column, which reveals that data comes from eight different linguistic databases ('ph' alone accounts for 34%), meaning coverage and coding conventions are uneven across the corpus and could introduce systematic biases in any cross-linguistic comparison.
citing: SegmentClass.top_values · Source.top_values · LanguageName.n_unique · Glottocode.n_unique · Phoneme.top_values · row_count · column_count
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| consonant | 72282 | 68.5% |
| vowel | 31052 | 29.4% |
| tone | 2150 | 2.0% |
Show data table
| value | count | share |
|---|---|---|
| ph | 36274 | 34.4% |
| ea | 16883 | 16.0% |
| upsid | 13966 | 13.2% |
| er | 9423 | 8.9% |
| saphon | 9047 | 8.6% |
| aa | 8064 | 7.6% |
| spa | 7566 | 7.2% |
| ra | 4261 | 4.0% |
Show data table
| chars | count |
|---|---|
| 2 – 4 | 3915 |
| 4 – 6 | 29541 |
| 6 – 8 | 34409 |
| 8 – 10 | 14985 |
| 10 – 12 | 7358 |
| 12 – 14 | 4849 |
| 14 – 15 | 3897 |
| 15 – 17 | 2893 |
| 17 – 19 | 1292 |
| 19 – 21 | 993 |
| 21 – 23 | 198 |
| 23 – 25 | 224 |
| 25 – 27 | 218 |
| 27 – 29 | 39 |
| 29 – 31 | 93 |
| 31 – 33 | 130 |
| 33 – 35 | 64 |
| 35 – 37 | 23 |
| 37 – 39 | 37 |
| 39 – 40 | 57 |
| 40 – 42 | 0 |
| 42 – 44 | 52 |
| 44 – 46 | 0 |
| 46 – 48 | 23 |
| 48 – 50 | 0 |
| 50 – 52 | 0 |
| 52 – 54 | 20 |
| 54 – 56 | 0 |
| 56 – 58 | 40 |
| 58 – 60 | 0 |
| 60 – 62 | 87 |
| 62 – 64 | 0 |
| 64 – 66 | 0 |
| 66 – 67 | 0 |
| 67 – 69 | 0 |
| 69 – 71 | 0 |
| 71 – 73 | 0 |
| 73 – 75 | 0 |
| 75 – 77 | 0 |
| 77 – 79 | 47 |
Show data table
| value | count | share |
|---|---|---|
| - | 85269 | 80.8% |
| + | 15941 | 15.1% |
| 0 | 2150 | 2.0% |
| +,- | 1973 | 1.9% |
| -,+ | 95 | 0.1% |
| +,-,- | 54 | 0.1% |
| +,-,+,- | 1 | 0.0% |
| -,+,- | 1 | 0.0% |
Show data table
| value | count | share |
|---|---|---|
| FALSE | 83263 | 78.9% |
| NA | 20874 | 19.8% |
| TRUE | 1347 | 1.3% |
Schema
49 columns| Alerts | ||||
|---|---|---|---|---|
| InventoryID | numeric | 0.0% | 3,020 |
|
| Glottocode | text | 0.0% | 2,177 |
one_word
short_text
duplicates
|
| ISO6393 | text | 0.0% | 2,095 |
one_word
short_text
duplicates
|
| LanguageName | text | 0.0% | 2,716 |
one_word
allcaps
short_text
duplicates
|
| SpecificDialect | categorical | 0.0% | 546 |
|
| GlyphID | text | 0.0% | 3,142 |
one_word
allcaps
short_text
duplicates
|
| Phoneme | text | 0.0% | 3,142 |
one_word
short_text
duplicates
|
| Allophones | text | 0.0% | 6,892 |
one_word
short_text
duplicates
|
| Marginal | categorical | 0.0% | 3 |
|
| SegmentClass | categorical | 0.0% | 3 |
|
| Source | categorical | 0.0% | 8 |
|
| tone | categorical | 0.0% | 2 |
imbalance
|
| stress | categorical | 0.0% | 2 |
imbalance
|
| syllabic | categorical | 0.0% | 8 |
|
| short | categorical | 0.0% | 4 |
imbalance
|
| long | categorical | 0.0% | 6 |
|
| consonantal | categorical | 0.0% | 5 |
|
| sonorant | categorical | 0.0% | 8 |
|
| continuant | categorical | 0.0% | 9 |
|
| delayedRelease | categorical | 0.0% | 7 |
|
| approximant | categorical | 0.0% | 6 |
|
| tap | categorical | 0.0% | 5 |
imbalance
|
| trill | categorical | 0.0% | 6 |
imbalance
|
| nasal | categorical | 0.0% | 8 |
|
| lateral | categorical | 0.0% | 8 |
|
| labial | categorical | 0.0% | 15 |
|
| round | categorical | 0.0% | 8 |
|
| labiodental | categorical | 0.0% | 6 |
|
| coronal | categorical | 0.0% | 7 |
|
| anterior | categorical | 0.0% | 6 |
|
| distributed | categorical | 0.0% | 11 |
|
| strident | categorical | 0.0% | 9 |
|
| dorsal | categorical | 0.0% | 13 |
|
| high | categorical | 0.0% | 11 |
|
| low | categorical | 0.0% | 8 |
|
| front | categorical | 0.0% | 13 |
|
| back | categorical | 0.0% | 12 |
|
| tense | categorical | 0.0% | 8 |
|
| retractedTongueRoot | categorical | 0.0% | 7 |
imbalance
|
| advancedTongueRoot | categorical | 0.0% | 3 |
imbalance
|
| periodicGlottalSource | categorical | 0.0% | 7 |
|
| epilaryngealSource | categorical | 0.0% | 3 |
imbalance
|
| spreadGlottis | categorical | 0.0% | 10 |
|
| constrictedGlottis | categorical | 0.0% | 7 |
|
| fortis | categorical | 0.0% | 3 |
|
| lenis | categorical | 0.0% | 3 |
|
| raisedLarynxEjective | categorical | 0.0% | 6 |
imbalance
|
| loweredLarynxImplosive | categorical | 0.0% | 5 |
imbalance
|
| click | categorical | 0.0% | 5 |
|
InventoryID
numeric foreign_keyInventoryID is a numeric foreign key referencing an inventory dimension table, with exactly 3,020 distinct values spanning 1–3,020 across 105,484 rows, implying heavy repeated use of each ID (average ~35 rows per ID). The distribution is remarkably flat and symmetric (skew ≈ −0.002, kurtosis ≈ −1.15, zero outliers), consistent with a well-populated lookup identifier rather than a measured quantity. Treatment: Left-join on this ID to the inventory dimension table; do not use raw numeric value as a feature.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3,020
- min
- 1
- max
- 3,020
- mean
- 1479
- median
- 1,464
- std
- 843.1
- q1
- 769
- q3
- 2,237
- iqr
- 1,468
- skew
- -0.002397
- kurtosis
- -1.146
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Glottocode
text foreign_key one_word short_text duplicatesThis column contains Glottocodes — the standardized 8-character language identifiers used by the Glottolog database (e.g., 'kham1282', 'dutc1256'), confirmed by the near-uniform length of 8 characters (mean 7.999, median 8.0) and the structured alphanumeric format. With only 2,177 unique codes across 105,484 rows, the duplicate rate is extremely high at 97.9%, meaning each language code recurs on average ~48 times — consistent with a dataset where many observations (e.g., words, features, speakers) are annotated per language. The top code 'kham1282' (Kham) appears 622 times, suggesting uneven language coverage in the dataset. Treatment: Left-join on this code against the Glottolog reference table to enrich with language family, geographic coordinates, and macro-area metadata.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 2,177
- len_min
- 2
- len_max
- 8
- len_mean
- 7.999
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 103,307
- duplicate_rate
- 0.9794
- vocab_size
- 2,168
- readability_flesch_mean
- 94.15
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
ISO6393
text label one_word short_text duplicatesThis column contains ISO 639-3 three-letter language codes, a standardised identifier for natural languages. Every value is exactly 3 characters long (min, mean, max all equal 3) with zero nulls, confirming strict conformity to the standard. The duplicate rate is extremely high at 98.0%, meaning 2,095 distinct codes repeat across 105,484 rows — expected behaviour for a language tag applied to many records. The most frequent code 'mis' (miscellaneous/unattested language) appearing 828 times may warrant attention, as it signals a non-trivial share of records with unidentified languages. Treatment: Use as a categorical grouping key; consider flagging or separating records where ISO6393 equals 'mis' (unattested) before language-based analysis.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 2,095
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 103,389
- duplicate_rate
- 0.9801
- vocab_size
- 2,086
- readability_flesch_mean
- 119.5
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
LanguageName
text label one_word allcaps short_text duplicatesThis column contains human language names (e.g., 'Dutch', 'Chechen', 'Bengali', 'Iron Ossetic'), functioning as a categorical label drawn from a linguistically diverse vocabulary of 2,716 unique values across 105,484 rows. The duplicate rate of 97.4% (102,768 duplicates) confirms it is a low-cardinality repeating label, not a free-text field. Notably, 13.1% of values are all-caps, suggesting some entries use ISO-style abbreviations or codes alongside full names. The distribution is uneven — the top value 'Iron Ossetic' appears 444 times while many languages appear rarely — indicating a long-tail spread across minority and regional languages. Treatment: Encode as a categorical (label or target-encode) after normalising casing inconsistencies flagged by the 13.1% all-caps rate.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 2,716
- len_min
- 2
- len_max
- 79
- len_mean
- 7.822
- len_median
- 7
- len_p95
- 16
- word_mean
- 1.201
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 102,768
- duplicate_rate
- 0.9743
- vocab_size
- 2,670
- readability_flesch_mean
- 53.18
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.8433
- allcaps_rate
- 0.1314
- boilerplate_rate
- 0
SpecificDialect
categorical labelThis column captures a specific dialect designation for each record in what appears to be a linguistic or language survey dataset. The dominant signal is extreme missingness-by-label: 71.9% of rows carry the value 'NA' and a further 7,692 rows (≈7.3%) are empty strings, meaning roughly 79% of records lack a meaningful dialect value. Despite 546 unique values, the effective entropy ratio is only 0.33, confirming that real dialect labels are thinly and unevenly spread across the remaining ~22,000 rows. Treatment: Treat 'NA' and empty-string as missing; consider collapsing rare dialects (below a frequency threshold) into an 'Other' bucket before encoding, and flag high missingness rate (~79%) before any modelling use.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 546
- top_value
- NA
- top_rate
- 0.7187
- cardinality
- 546
- entropy
- 2.969
- entropy_ratio
- 0.3265
GlyphID
text label one_word allcaps short_text duplicatesGlyphID is a categorical identifier column containing Unicode code point values (e.g., '006D' = 'm', '0069' = 'i', '0061' = 'a') stored as uppercase hexadecimal strings, almost certainly representing character-level glyph references in a typography, OCR, or text-rendering dataset. With only 3,142 unique values across 105,484 rows, the duplicate rate is extremely high at 97.0%, which is expected for a glyph/character frequency table but means this column functions as a low-cardinality label rather than a unique identifier despite its name. All values are fully uppercase (allcaps_rate 1.0), single-token (one_word_rate 1.0), and short (median length 4 characters), consistent with 4–6 character hex codes. The top values map to common Latin lowercase letters, suggesting the underlying corpus is predominantly Latin-script text. Treatment: Map hex strings to Unicode characters for interpretability, then encode as categorical (low-cardinality, 3142 levels) or group by Unicode block/script before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3,142
- len_min
- 4
- len_max
- 54
- len_mean
- 6.503
- len_median
- 4
- len_p95
- 14
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 102,342
- duplicate_rate
- 0.9702
- vocab_size
- 1,343
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
Phoneme
text label one_word short_text duplicatesThis column contains phoneme symbols from a linguistic or speech dataset, with 3,142 unique phoneme strings across 105,484 rows. Values are overwhelmingly single characters (len_median 1.0, len_mean 1.5, len_max 11), consistent with IPA or ARPABET-style phoneme notation. The duplicate rate is 97.0% (102,342 duplicates), which is expected given a finite phoneme inventory repeated across many words or utterances. The vocab_size of 1,339 distinct tokens against only 3,142 unique values suggests multi-character phoneme strings (e.g., digraphs or diacritics) are also present alongside single-character ones. Treatment: Encode as categorical (label or one-hot) for modelling; 3,142 unique values is manageable but consider grouping by manner/place of articulation if dimensionality is a concern.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3,142
- len_min
- 1
- len_max
- 11
- len_mean
- 1.501
- len_median
- 1
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 102,342
- duplicate_rate
- 0.9702
- vocab_size
- 1,339
- readability_flesch_mean
- 114.4
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.001754
- boilerplate_rate
- 0
Allophones
text feature one_word short_text duplicatesThis column contains allophone representations from a phonology or linguistics dataset, storing the phonetic variants of phonemes (e.g., 'm', 'j', 'w', 's') as very short strings with a mean length of ~2 characters. Strikingly, 53,580 of 105,484 rows (roughly 50.8%) carry the sentinel value 'NA', indicating no allophone recorded, which dominates the duplicate rate of 93.5%. The 6,892 unique values across a vocabulary of only 1,263 words suggest a modest set of phonetic symbols combined in small clusters, consistent with IPA or similar notation. Treatment: Treat 'NA' as missing; encode remaining values as categorical or tokenize individual phoneme symbols before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 6,892
- len_min
- 1
- len_max
- 37
- len_mean
- 2.083
- len_median
- 2
- len_p95
- 4
- word_mean
- 1.129
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 98,592
- duplicate_rate
- 0.9347
- vocab_size
- 1,263
- readability_flesch_mean
- 116.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.9131
- allcaps_rate
- 0.00291
- boilerplate_rate
- 0
Marginal
categorical featureThis column is a boolean flag indicating whether a record is classified as 'marginal', with three distinct string values: FALSE, NA, and TRUE. The dominant value is FALSE (78.9% of 105,484 rows), while TRUE is strikingly rare at only 1,347 records (≈1.3%). The presence of 'NA' as a literal string value in 20,874 rows (≈19.8%) is noteworthy — these are not system nulls (null_rate is 0.0) but encoded string missings, which must be handled explicitly rather than via standard null imputation. Treatment: Encode as ternary (FALSE=0, TRUE=1, NA=missing) after converting string 'NA' to actual nulls, then decide on imputation strategy given 19.8% string-missing rate.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- FALSE
- top_rate
- 0.7893
- cardinality
- 3
- entropy
- 0.8122
- entropy_ratio
- 0.5125
SegmentClass
categorical labelSegmentClass is a phonological category label classifying 105,484 linguistic segments into exactly three classes: consonant, vowel, and tone. Consonants dominate at 68.5% (72,282 occurrences), vowels account for 29.4% (31,052), and tones are a small minority at just 2.0% (2,150) — a distribution consistent with natural language phoneme inventories but with tones notably underrepresented, suggesting the dataset skews toward non-tonal languages or tonal markings are partially absent. Zero nulls and perfect coverage make this a clean, reliable label. Treatment: One-hot encode or use as a stratification variable; monitor class imbalance for the 'tone' minority class (2,150 samples) in any classification task.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- consonant
- top_rate
- 0.6852
- cardinality
- 3
- entropy
- 1.008
- entropy_ratio
- 0.6357
Source
categorical labelThis column identifies the source database or linguistic inventory from which each phonological record was drawn, with 8 distinct source codes across 105,484 rows and no nulls. The top source 'ph' dominates at 34.4% (36,274 rows), likely referring to PHOIBLE or a similar phoneme database, followed by 'ea', 'upsid', 'er', 'saphon', 'aa', 'spa', and 'ra'. The high entropy ratio of 0.899 indicates a relatively even spread across sources despite 'ph' leading — no single source overwhelmingly controls the data. Analysts should be aware that cross-source comparisons may introduce systematic coding differences, as each source may apply distinct phonological conventions. Treatment: One-hot encode or use as a stratification/grouping variable; check for source-specific biases before pooling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- ph
- top_rate
- 0.3439
- cardinality
- 8
- entropy
- 2.697
- entropy_ratio
- 0.8991
tone
categorical label imbalanceThis column is a binary tone indicator with only two values: '0' (neutral/absent) and '+' (positive), across 105,484 rows with no nulls. The distribution is severely imbalanced — '0' accounts for 97.96% of records (103,334) while '+' appears in just 2,150 rows (~2%). The entropy of 0.144 (out of a maximum of 1.0 for a binary variable) confirms the near-total dominance of a single class, which would make any model trained on this label prone to predicting '0' exclusively. Treatment: Apply class-balancing techniques (e.g., oversampling '+' or class-weight adjustment) before using as a modelling target.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 0
- top_rate
- 0.9796
- cardinality
- 2
- entropy
- 0.1436
- entropy_ratio
- 0.1436
stress
categorical label imbalanceThis column is a binary categorical flag likely indicating the presence or absence of stress (e.g., a linguistic or physiological stress marker), encoded as '-' for the negative/absent case and '0' for the positive/present case. The distribution is severely imbalanced: '-' accounts for 97.96% of all 105,484 rows (103,334 occurrences) versus only 2,150 rows for '0'. The entropy ratio of 0.144 confirms near-minimum information content, meaning the positive class is rare and models trained on this column will need class-balancing strategies. Treatment: Apply class-balancing (oversampling or weighted loss) before modelling; consider encoding '-' as 0 and '0' as 1 for numeric compatibility.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- -
- top_rate
- 0.9796
- cardinality
- 2
- entropy
- 0.1436
- entropy_ratio
- 0.1436
syllabic
categorical- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- -
- top_rate
- 0.6849
- cardinality
- 8
- entropy
- 1.042
- entropy_ratio
- 0.3472
short
categorical feature imbalanceThis column appears to encode a directional or sign indicator with only 4 distinct values: '-', '0', '+', and the compound '-,+'. It is severely imbalanced: the dominant value '-' accounts for 97,764% of rows (103,125 of 105,484), while '+' appears just 204 times and the mixed label '-,+' only 5 times. The near-zero entropy ratio (0.082) confirms the column carries very little information variance, and any model trained on it will be overwhelmed by the '-' class. Treatment: Treat as ordinal or one-hot encode, but flag severe class imbalance (97.8% '-'); consider oversampling minority classes or collapsing '+' and '-,+' before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- -
- top_rate
- 0.9776
- cardinality
- 4
- entropy
- 0.1645
- entropy_ratio
- 0.08225
long
categorical featureThis column encodes the sign or direction of longitude (or a signed numeric field), with only 6 distinct values across 105,484 rows. The dominant value '-' accounts for 89.9% of records, suggesting the dataset is heavily skewed toward negative longitudes (e.g., Western Hemisphere coordinates). The compound values '-,+', '+,-', and '-,-,+' appear in just 104 rows combined, hinting at rare multi-value or malformed entries that warrant inspection before modelling. Treatment: Investigate compound values ('-,+', '+,-', '-,-,+') for parsing errors; encode '-' as -1, '+' as +1, '0' as 0, and flag or impute the 104 compound rows.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- -
- top_rate
- 0.8991
- cardinality
- 6
- entropy
- 0.5537
- entropy_ratio
- 0.2142
consonantal
categorical featureThis column encodes a binary phonological feature — consonantal — marking whether a segment is consonantal (+) or not (-), a standard distinctive feature in linguistics datasets. The dominant value is '+' at 60.9% (64,257 rows), with '-' covering 37.0% (39,041 rows). The near-zero third category '0' (2,151 rows) suggests underspecified segments, while the composite values '+,-' (34) and '-,+' (1) are anomalous multi-value entries that likely reflect data quality issues or ambiguous annotations. Treatment: Encode as ordinal or one-hot after splitting composite entries ('+,-', '-,+') into separate flags or flagging them for review.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- +
- top_rate
- 0.6092
- cardinality
- 5
- entropy
- 1.085
- entropy_ratio
- 0.4672
sonorant
categorical- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- +
- top_rate
- 0.5301
- cardinality
- 8
- entropy
- 1.245
- entropy_ratio
- 0.4149
continuant
categorical featureThis column encodes a phonological feature — whether a speech sound is a continuant (airflow continues through the vocal tract) — using '+', '-', and '0' (unspecified/N-A) notation standard in distinctive feature theory. The dominant values are '+' (54.9%, 57,952 rows) and '-' (44,585 rows), which together account for ~97% of records. Surprisingly, 7 of the 9 unique values are composite strings like '-,+' or '-,-,+', suggesting some rows encode sequences or multi-segment entries rather than single-segment features, which is atypical and may indicate data quality or schema inconsistency. Entropy ratio of 0.37 reflects moderate concentration driven almost entirely by the binary +/- split. Treatment: Treat '+'/'-'/'0' as a 3-class categorical; investigate and potentially split or flag the 796 composite multi-value rows before encoding.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 9
- top_value
- +
- top_rate
- 0.5494
- cardinality
- 9
- entropy
- 1.172
- entropy_ratio
- 0.3696
delayedRelease
categorical featureThis column encodes a delayed-release flag or classification for 105,484 records using a small set of symbolic tokens: '0' (no delay, 55% of rows), '-' (negative/early shift, 26%), '+' (positive/late shift, 19%), and compound combinations thereof. The compound values ('-,+', '0,-,+', '+,-', '0,0,-,+') suggest the column is sometimes multi-valued — packed as comma-separated strings — which is structurally inconsistent with a simple categorical and implies set-like semantics. No nulls are present, and the entropy ratio of 0.52 reflects a moderately skewed but non-trivial distribution dominated by the '0' class. Treatment: Split compound values on ',' into multi-hot binary columns ('has_0', 'has_minus', 'has_plus') before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- 0
- top_rate
- 0.5502
- cardinality
- 7
- entropy
- 1.471
- entropy_ratio
- 0.5238
approximant
categorical featureThis column captures whether a phoneme or linguistic segment is classified as an approximant, using a compact symbolic encoding where '-' means absent and '+' means present. The dominant value is '-' at 55.9% of rows (58,966), with '+' accounting for 41.9% (44,266), giving a near-binary distribution with very low entropy (1.12). Surprising are the small number of compound values ('-,+', '-,-,+', '+,-') totalling just 102 rows, suggesting a minority of segments carry ambiguous or multi-valued approximant classifications — possibly transcription artifacts or multi-segment bundles. Treatment: One-hot encode the two dominant values ('-', '+'); bin the three compound values into an 'ambiguous' category or flag and investigate as potential data quality issues before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- -
- top_rate
- 0.559
- cardinality
- 6
- entropy
- 1.12
- entropy_ratio
- 0.4333
tap
categorical feature imbalanceThis column appears to encode a directional or signed-change indicator, with values like '-', '+', '0', and compound sequences such as '-,+' and '-,-,+' that suggest a history of sign changes or transitions. The dominant value '-' accounts for 96.7% of all 105,484 rows, producing severe class imbalance (entropy ratio of only 0.10), meaning positive or mixed-direction events are rare signals. The compound multi-value entries ('-,+', '-,-,+') are notable — they imply the field can store ordered sequences of transitions, not just a single state. No nulls are present. Treatment: One-hot encode single-value categories; treat compound sequence values ('-,+', '-,-,+') separately via sequence parsing or a dedicated binary flag for multi-transition events; oversample or weight minority classes before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- -
- top_rate
- 0.9672
- cardinality
- 5
- entropy
- 0.2421
- entropy_ratio
- 0.1043
trill
categorical feature imbalanceThis column encodes a phonological feature — specifically the presence or absence of a trill articulation — using a small symbol vocabulary of 6 distinct values. The dominant value is "-" (absent/negative), which accounts for 96.15% of 105,484 rows, producing an extremely low entropy of 0.276 and triggering an imbalance alert. The minority values ("0", "+", and compound sequences like "-,+") together cover fewer than 4,000 rows, suggesting trill is a rare feature in this phonological dataset. The compound values ("-,+", "-,-,+", "+,-") with counts of 26, 8, and 2 respectively hint at multi-segment or allophonic annotation, but are statistically negligible. Treatment: One-hot or ordinal encode with caution; severe class imbalance (96.15% negative class) means most models will need oversampling, class weighting, or collapsing rare compound categories before use.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- -
- top_rate
- 0.9615
- cardinality
- 6
- entropy
- 0.2762
- entropy_ratio
- 0.1069
nasal
categorical- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- -
- top_rate
- 0.8084
- cardinality
- 8
- entropy
- 0.897
- entropy_ratio
- 0.299
lateral
categorical featureThis column encodes lateral direction or laterality coding, likely indicating the side(s) of a clinical finding, anatomical measurement, or test result, using '-' (e.g., left or negative), '+' (right or positive), and '0' (midline or neutral). The dominant value '-' accounts for 93.8% of all 105,484 rows, producing a very low entropy ratio of 0.134, meaning the column is heavily skewed toward a single class. Surprisingly, some values represent sequences of laterality codes (e.g., '-,+', '-,-,+'), suggesting multi-segment or bilateral recordings encoded as comma-delimited strings rather than separate fields. No nulls are present. Treatment: Split comma-delimited compound values into multi-hot binary flags for '-', '+', and '0' presence before modelling; expect severe class imbalance.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- -
- top_rate
- 0.9382
- cardinality
- 8
- entropy
- 0.4012
- entropy_ratio
- 0.1337
labial
categorical featureThis column captures labial articulation features in a phonological or linguistic dataset, encoding presence/absence/neutrality of a labial feature per segment (or per segment sequence). The dominant value is '-' (absent) at 68.2% of 105,484 rows, with '+' (present) at 26.8%, suggesting most segments are non-labial. Surprisingly, ~2.9% of values are compound strings like '-,+', '+,-', '-,-,+', indicating multi-segment bundles packed into a single cell rather than a flat atomic feature, which creates a parsing challenge and implies the column is not fully normalized. Treatment: Split compound values on ',' into separate per-segment records or encode as ordered multi-label; then one-hot or ordinal encode the atomic {'-', '+', '0'} values for modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 15
- top_value
- -
- top_rate
- 0.6822
- cardinality
- 15
- entropy
- 1.182
- entropy_ratio
- 0.3025
round
categorical- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- 0
- top_rate
- 0.703
- cardinality
- 8
- entropy
- 1.194
- entropy_ratio
- 0.398
labiodental
categorical featureThis column encodes labiodental phonetic feature annotations, likely from a linguistic or phonological dataset where each row represents a speech sound or segment. The dominant value '0' (74,124 rows, 70.3%) indicates the feature is absent or neutral, while '+' and '-' mark positive and negative feature values respectively — a standard binary distinctive-feature notation. Notably, multi-valued strings like '+,-', '-,+', and '+,+,-' appear (60 rows total), suggesting a small number of segments carry conflicting or composite annotations, which may indicate data entry inconsistency or deliberate underspecification. Entropy ratio of 0.39 confirms moderate imbalance with '0' dominating. Treatment: One-hot encode '+', '-', '0'; flag or inspect the 60 multi-valued rows ('+,-', '-,+', '+,+,-') for normalization before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- 0
- top_rate
- 0.7027
- cardinality
- 6
- entropy
- 1.006
- entropy_ratio
- 0.3891
coronal
categorical featureThis column encodes coronal articulation features in what appears to be a phonological or linguistic dataset, using binary '+'/'-' notation common in distinctive feature theory. The dominant value is '-' (62.8% of 105,484 rows), with '+' covering another 35.0%, making the two primary values account for ~98% of observations. The remaining values ('+,-', '-,+', '-,-,+', '+,-,+') suggest multi-segment or compound entries, which are rare (≤87 occurrences) but worth flagging as potential encoding inconsistencies or multi-value cells that may need splitting. Treatment: Encode primary values '+'/'-'/'0' as ordinal or one-hot features; isolate and inspect the 135 multi-value rows for parsing or splitting before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- -
- top_rate
- 0.6279
- cardinality
- 7
- entropy
- 1.08
- entropy_ratio
- 0.3848
anterior
categorical featureThis column captures an anterior anatomical or directional finding coded as a signed categorical variable, with 6 distinct values across 105,484 rows and zero nulls. The dominant value '0' (64.8% of rows) likely indicates absence or neutral status, while '+' (24.4%) and '-' (10.8%) denote positive/negative findings. Surprisingly, a small number of rows (9, 5, and 3) contain compound multi-value strings like '-,+', '+,-', and '-,-,+', suggesting occasional data-entry concatenation errors or multi-event encoding that deviates from the expected single-symbol schema. Treatment: Map '0'/'+'/'-' to ordinal or one-hot encoded features; isolate and investigate the 17 compound-value rows before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- 0
- top_rate
- 0.6482
- cardinality
- 6
- entropy
- 1.251
- entropy_ratio
- 0.4839
distributed
categorical labelThis column appears to encode a distribution or change-direction indicator, using a compact symbolic notation: '0' (no change), '-' (decrease), '+' (increase), and comma-separated sequences for multi-step or compound movements. The dominant value '0' covers 66% of 105,484 rows, with '-' (21.1%) and '+' (12.5%) accounting for most of the remainder. Surprisingly, 11 distinct values arise from combinations of these three symbols, suggesting some records capture sequences of directional changes rather than a single state. Entropy ratio of 0.37 confirms moderate but uneven information content, heavily concentrated in the single '0' class. Treatment: Encode as an ordinal or multi-hot feature by splitting on commas and mapping {'-': -1, '0': 0, '+': 1}; treat compound values as sequences.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- 0
- top_rate
- 0.6602
- cardinality
- 11
- entropy
- 1.273
- entropy_ratio
- 0.3681
strident
categorical- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 9
- top_value
- 0
- top_rate
- 0.6485
- cardinality
- 9
- entropy
- 1.287
- entropy_ratio
- 0.406
dorsal
categorical featureThis column captures a dorsal-surface marking or pattern indicator for biological specimens (likely fish or reptiles), encoding the presence/absence of a feature using '+', '-', and '0' symbols — sometimes as comma-separated sequences indicating multiple zones or segments. The dominant values are '+' (51.7%, n=54,535) and '-' (n=47,052), together accounting for ~96.3% of rows, with 'neutral/absent' coded as '0' (n=2,160). Surprisingly, 11 of the 13 categories are multi-value strings like '-,+' or '-,-,+', suggesting some records encode ordered sequences of dorsal sub-regions rather than a single binary state, creating an implicit structural inconsistency that warrants normalization. Treatment: Split multi-value entries on ',' into ordered sub-features or one-hot encode each position before modelling; treat '+', '-', '0' as a ternary categorical for single-value rows.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 13
- top_value
- +
- top_rate
- 0.517
- cardinality
- 13
- entropy
- 1.235
- entropy_ratio
- 0.3338
high
categorical featureThis column encodes the direction of price/value movement at a 'high' point — likely a signal or pattern indicator from a time-series or financial dataset, using tokens '0' (no movement), '+' (up), and '-' (down), including multi-step sequences like '-,+' or '+,-,+'. The dominant value is '0' at 46.7% (49,247 rows), followed by '+' at 35,559 and '-' at 19,156, suggesting a strong asymmetry between upward and downward signals. The presence of compound sequence values (e.g., '+,-,+', '-,+,+') with very low frequency — some appearing only once or twice — indicates these multi-step patterns are rare edge cases that may need consolidation or separate treatment. Treatment: Encode top 3 values ('0', '+', '-') as ordinal or one-hot features; collapse rare multi-step sequences (≤845 occurrences) into an 'other' or structured sub-category.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- 0
- top_rate
- 0.4669
- cardinality
- 11
- entropy
- 1.594
- entropy_ratio
- 0.4609
low
categorical- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- -
- top_rate
- 0.4733
- cardinality
- 8
- entropy
- 1.305
- entropy_ratio
- 0.4351
front
categorical featureThis column appears to encode a directional or signed-change indicator for a 'front' measurement, using values '0', '-', and '+' as atomic tokens that can be combined into sequences (e.g., '-,+', '+,-,-'). The dominant value '0' accounts for 46.75% of rows, with '-' (34,225) and '+' (20,683) making up most of the remainder — notably, negative changes outnumber positive ones by ~1.65:1. The compound sequence values ('-,+', '+,-', etc.) suggest multi-step event chains are encoded as a single string, which is an unusual encoding pattern that may require parsing before modelling. Treatment: Parse composite sequences (e.g., '-,+') into structured event arrays or ordinal counts; then one-hot or embed atomic states before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 13
- top_value
- 0
- top_rate
- 0.4675
- cardinality
- 13
- entropy
- 1.592
- entropy_ratio
- 0.4302
back
categorical featureThis column encodes directional movement or change signals for a 'back' dimension, using a compact notation of '+', '-', and '0' tokens — likely representing price/value movement sequences (e.g., up, down, flat) over a lookback window. The dominant value is '0' at 46.7% (49,270 rows), followed by '-' at 39,749 and '+' at 15,547, indicating a strong bearish/negative skew relative to positive signals. Compound multi-step sequences like '+,-' (511) and '-,+' (367) exist but are rare, and a handful of three-step sequences appear at the tail, suggesting variable-length encoding that could cause parsing issues if treated as a simple label. Treatment: Parse composite multi-token values (e.g., '+,-') into structured sequences or ordinal scores before modelling; consider one-hot or frequency encoding for the simple single-token majority.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- 0
- top_rate
- 0.4671
- cardinality
- 12
- entropy
- 1.521
- entropy_ratio
- 0.4244
tense
categorical- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- 0
- top_rate
- 0.7132
- cardinality
- 8
- entropy
- 1.114
- entropy_ratio
- 0.3712
retractedTongueRoot
categorical feature imbalanceThis column encodes the phonetic feature 'retracted tongue root' (RTR), a binary or multi-valued linguistic annotation used in phonological databases to mark vowel or consonant articulation. The dominant value is '-' (absence of RTR) at 97.44% of 105,484 rows, making the feature extremely rare in this dataset — only ~2,696 tokens show any positive marking. The compound values ('-,+', '-,-,+', etc.) suggest per-segment annotation strings rather than single-token labels, indicating a sequence or multi-segment scope. Treatment: Flag severe class imbalance (97.44% negative); use oversampling or class-weighted models if predicting RTR; consider splitting compound strings into per-segment binary indicators before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- -
- top_rate
- 0.9744
- cardinality
- 7
- entropy
- 0.1935
- entropy_ratio
- 0.06892
advancedTongueRoot
categorical feature imbalanceThis column encodes the Advanced Tongue Root (ATR) phonological feature, a binary articulatory distinction marked as '+' (present), '-' (absent), or '0' (neutral/not applicable), typical of linguistic phoneme inventories. The distribution is severely imbalanced: '-' dominates at 97.87% (103,238 of 105,484 rows), '0' appears in only 2,235 rows, and '+' is nearly absent with just 11 occurrences. The near-zero entropy ratio (0.094) confirms that '+ATR' is a vanishingly rare feature in this dataset, which would make any model predicting '+' extremely difficult to train without resampling. Treatment: Treat as ordinal or one-hot encoded categorical; oversample or reweight the '+' class (n=11) before modelling, or collapse '+' and '0' if class separation is infeasible.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- -
- top_rate
- 0.9787
- cardinality
- 3
- entropy
- 0.1496
- entropy_ratio
- 0.09438
periodicGlottalSource
categorical- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- +
- top_rate
- 0.6797
- cardinality
- 7
- entropy
- 1.051
- entropy_ratio
- 0.3745
epilaryngealSource
categorical feature imbalanceThis column encodes whether an epilaryngeal phonation source is present, using a ternary scheme: absent ('-'), present ('+')), or a neutral/ambiguous state ('0'). The distribution is severely imbalanced: the '-' (absent) class dominates at 97.9% of 105,484 rows, while '+' (present) appears in only 31 records (~0.03%), making positive-class detection extremely challenging. The near-zero entropy (0.147) confirms almost no informational variance in this column as-is. Treatment: Flag severe class imbalance ('+' = 31 of 105,484); apply oversampling or class-weighted modelling, and consider binary encoding after collapsing '-' vs. non-'-'.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- -
- top_rate
- 0.9793
- cardinality
- 3
- entropy
- 0.1474
- entropy_ratio
- 0.09303
spreadGlottis
categorical- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 10
- top_value
- -
- top_rate
- 0.9182
- cardinality
- 10
- entropy
- 0.4965
- entropy_ratio
- 0.1495
constrictedGlottis
categorical featureThis column captures a clinical or phonetic annotation for constricted glottis, encoded as presence/absence symbols across 105,484 records with no nulls and only 7 unique values. The dominant value is '-' (absent/negative), appearing in 94.5% of rows (99,727), while '+' (present) accounts for just 3,383 rows — indicating a rare positive condition. The compound values ('+,-', '-,+', '+,-,-', '-,-,+') suggest multi-observation or multi-segment sequences for a tiny fraction of records, but with counts of 141, 93, 1, and 1 respectively, these are near-negligible. The extremely low entropy (0.372, entropy ratio 0.132) confirms the column is heavily imbalanced toward the negative class. Treatment: Binarize into presence/absence flag; treat compound multi-segment values as positive; account for severe class imbalance (94.5% negative) during modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- -
- top_rate
- 0.9454
- cardinality
- 7
- entropy
- 0.3717
- entropy_ratio
- 0.1324
fortis
categorical featureThis column is a three-valued categorical flag, likely representing a directional or strength indicator with values '-', '0', and '+'. The dominant value is '-' at 68.1% (71,867 rows), '0' accounts for 31.5% (33,202 rows), and '+' is strikingly rare at only 415 occurrences (~0.4%), creating a heavily imbalanced distribution. The near-absence of '+' values is a notable surprise and may indicate a rare positive condition, signal, or classification outcome worth investigating for class imbalance before modelling. Treatment: Ordinal-encode as -1/0/1 or one-hot encode, and apply class-imbalance handling (e.g. oversampling '+') before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- -
- top_rate
- 0.6813
- cardinality
- 3
- entropy
- 0.9335
- entropy_ratio
- 0.589
lenis
categorical featureThis column represents a ternary linguistic or phonological feature, almost certainly a 'lenis' (weak/voiced) marker with values '-' (absent/irrelevant), '0' (neutral), and '+' (present/lenis). The dominant value is '-' at 68.1% (71,866 rows), while the positive lenis marker '+' is strikingly rare at only 416 occurrences (~0.4%), creating a heavily imbalanced distribution. No nulls exist across all 105,484 rows, and entropy is moderate at 0.93 bits, well below the theoretical maximum, confirming the skew toward the negative class. Treatment: Encode as ordinal or one-hot; be aware that the '+' class (416 samples) is severely underrepresented and will require oversampling or class-weight adjustment before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- -
- top_rate
- 0.6813
- cardinality
- 3
- entropy
- 0.9336
- entropy_ratio
- 0.589
raisedLarynxEjective
categorical feature imbalanceThis column encodes the presence or absence of a raised-larynx ejective phonological feature, using a diacritic notation system common in linguistic databases (e.g., IPA or feature matrices). The dominant value is '-' (absent/not applicable), appearing in 96.4% of 105,484 rows, creating severe class imbalance with an entropy ratio of only 0.10. Minority values include '0' (2,150 rows), '+' (1,573 rows), and compound sequences like '-,+' and '+,-', suggesting some segments receive multi-valued or composite feature annotations. The near-absence of nulls (0.0%) indicates complete annotation coverage, but the extreme skew means this feature will contribute negligible signal in most modelling contexts without deliberate resampling or grouping. Treatment: Collapse compound values ('-,+', '+,-', '-,-,+') into a unified category and apply oversampling or class-weight adjustment before modelling given 96.4% imbalance.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- -
- top_rate
- 0.9637
- cardinality
- 6
- entropy
- 0.2675
- entropy_ratio
- 0.1035
loweredLarynxImplosive
categorical feature imbalanceThis column encodes a phonological feature — specifically whether a sound has a lowered larynx implosive articulation — using a symbolic notation system ('+', '-', and combinations). The distribution is severely imbalanced: 97.3% of the 105,484 rows carry the default/absent value '-', with only 2,150 instances of '0', 716 of '+', and fewer than 10 combined entries. The near-zero entropy ratio (0.088) confirms this column carries almost no information for most records, which would make it a very weak standalone predictor. Treatment: Flag the severe imbalance; consider binarising ('+' vs. all others) or dropping if the rare positive class is insufficient for modelling purposes.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- -
- top_rate
- 0.9727
- cardinality
- 5
- entropy
- 0.2034
- entropy_ratio
- 0.08759
click
categorical labelThis column captures click direction or interaction type, encoded as symbolic tokens ('-', '0', '+') with combination values ('+,-', '-,+'). The dominant value is '-' at 68.2% of 105,484 rows, while positive-click signals ('+', '+,-', '-,+') account for only ~311 rows combined (~0.3%), suggesting a heavily imbalanced outcome or event flag. The presence of multi-value strings like '+,-' and '-,+' implies ordered sequences of click events were sometimes collapsed into a single cell rather than normalized — this warrants investigation before modelling. Treatment: Ordinal-encode or one-hot after splitting multi-value entries ('+,-', '-,+'); treat class imbalance (~0.3% positive) before using as a classification target.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- -
- top_rate
- 0.6823
- cardinality
- 5
- entropy
- 0.9283
- entropy_ratio
- 0.3998