saturn·

data trove phoible phonetics database

source /home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv 105,484 rows 49 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset is PHOIBLE, a cross-linguistic phonological inventory database containing 105,484 phoneme-level records spanning roughly 2,177 languages and dialects, each row describing a single phoneme and its distinctive feature values. The most immediate thing to examine is the breakdown by SegmentClass: consonants dominate (~68.5%), followed by vowels (~29.5%) and tones (~2%), which shapes how almost every other feature distributes. A second focus is the Source column, which reveals that data comes from eight different linguistic databases ('ph' alone accounts for 34%), meaning coverage and coding conventions are uneven across the corpus and could introduce systematic biases in any cross-linguistic comparison.

citing: SegmentClass.top_values · Source.top_values · LanguageName.n_unique · Glottocode.n_unique · Phoneme.top_values · row_count · column_count

Schema

49 columns
Per-column summary. Click column name to jump to its detail.
Alerts
InventoryID numeric 0.0% 3,020
Glottocode text 0.0% 2,177
one_word short_text duplicates
ISO6393 text 0.0% 2,095
one_word short_text duplicates
LanguageName text 0.0% 2,716
one_word allcaps short_text duplicates
SpecificDialect categorical 0.0% 546
GlyphID text 0.0% 3,142
one_word allcaps short_text duplicates
Phoneme text 0.0% 3,142
one_word short_text duplicates
Allophones text 0.0% 6,892
one_word short_text duplicates
Marginal categorical 0.0% 3
SegmentClass categorical 0.0% 3
Source categorical 0.0% 8
tone categorical 0.0% 2
imbalance
stress categorical 0.0% 2
imbalance
syllabic categorical 0.0% 8
short categorical 0.0% 4
imbalance
long categorical 0.0% 6
consonantal categorical 0.0% 5
sonorant categorical 0.0% 8
continuant categorical 0.0% 9
delayedRelease categorical 0.0% 7
approximant categorical 0.0% 6
tap categorical 0.0% 5
imbalance
trill categorical 0.0% 6
imbalance
nasal categorical 0.0% 8
lateral categorical 0.0% 8
labial categorical 0.0% 15
round categorical 0.0% 8
labiodental categorical 0.0% 6
coronal categorical 0.0% 7
anterior categorical 0.0% 6
distributed categorical 0.0% 11
strident categorical 0.0% 9
dorsal categorical 0.0% 13
high categorical 0.0% 11
low categorical 0.0% 8
front categorical 0.0% 13
back categorical 0.0% 12
tense categorical 0.0% 8
retractedTongueRoot categorical 0.0% 7
imbalance
advancedTongueRoot categorical 0.0% 3
imbalance
periodicGlottalSource categorical 0.0% 7
epilaryngealSource categorical 0.0% 3
imbalance
spreadGlottis categorical 0.0% 10
constrictedGlottis categorical 0.0% 7
fortis categorical 0.0% 3
lenis categorical 0.0% 3
raisedLarynxEjective categorical 0.0% 6
imbalance
loweredLarynxImplosive categorical 0.0% 5
imbalance
click categorical 0.0% 5

InventoryID

numeric foreign_key
InventoryID is a numeric foreign key referencing an inventory dimension table, with exactly 3,020 distinct values spanning 1–3,020 across 105,484 rows, implying heavy repeated use of each ID (average ~35 rows per ID). The distribution is remarkably flat and symmetric (skew ≈ −0.002, kurtosis ≈ −1.15, zero outliers), consistent with a well-populated lookup identifier rather than a measured quantity. Treatment: Left-join on this ID to the inventory dimension table; do not use raw numeric value as a feature. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
3,020
min
1
max
3,020
mean
1479
median
1,464
std
843.1
q1
769
q3
2,237
iqr
1,468
skew
-0.002397
kurtosis
-1.146
n_outliers
0
outlier_rate
0
zero_rate
0

Glottocode

text foreign_key one_word short_text duplicates
This column contains Glottocodes — the standardized 8-character language identifiers used by the Glottolog database (e.g., 'kham1282', 'dutc1256'), confirmed by the near-uniform length of 8 characters (mean 7.999, median 8.0) and the structured alphanumeric format. With only 2,177 unique codes across 105,484 rows, the duplicate rate is extremely high at 97.9%, meaning each language code recurs on average ~48 times — consistent with a dataset where many observations (e.g., words, features, speakers) are annotated per language. The top code 'kham1282' (Kham) appears 622 times, suggesting uneven language coverage in the dataset. Treatment: Left-join on this code against the Glottolog reference table to enrich with language family, geographic coordinates, and macro-area metadata. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
2,177
len_min
2
len_max
8
len_mean
7.999
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
103,307
duplicate_rate
0.9794
vocab_size
2,168
readability_flesch_mean
94.15
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

ISO6393

text label one_word short_text duplicates
This column contains ISO 639-3 three-letter language codes, a standardised identifier for natural languages. Every value is exactly 3 characters long (min, mean, max all equal 3) with zero nulls, confirming strict conformity to the standard. The duplicate rate is extremely high at 98.0%, meaning 2,095 distinct codes repeat across 105,484 rows — expected behaviour for a language tag applied to many records. The most frequent code 'mis' (miscellaneous/unattested language) appearing 828 times may warrant attention, as it signals a non-trivial share of records with unidentified languages. Treatment: Use as a categorical grouping key; consider flagging or separating records where ISO6393 equals 'mis' (unattested) before language-based analysis. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
2,095
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
103,389
duplicate_rate
0.9801
vocab_size
2,086
readability_flesch_mean
119.5
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

LanguageName

text label one_word allcaps short_text duplicates
This column contains human language names (e.g., 'Dutch', 'Chechen', 'Bengali', 'Iron Ossetic'), functioning as a categorical label drawn from a linguistically diverse vocabulary of 2,716 unique values across 105,484 rows. The duplicate rate of 97.4% (102,768 duplicates) confirms it is a low-cardinality repeating label, not a free-text field. Notably, 13.1% of values are all-caps, suggesting some entries use ISO-style abbreviations or codes alongside full names. The distribution is uneven — the top value 'Iron Ossetic' appears 444 times while many languages appear rarely — indicating a long-tail spread across minority and regional languages. Treatment: Encode as a categorical (label or target-encode) after normalising casing inconsistencies flagged by the 13.1% all-caps rate. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
2,716
len_min
2
len_max
79
len_mean
7.822
len_median
7
len_p95
16
word_mean
1.201
word_median
1
n_empty
0
n_duplicates
102,768
duplicate_rate
0.9743
vocab_size
2,670
readability_flesch_mean
53.18
emoji_rate
0
url_rate
0
one_word_rate
0.8433
allcaps_rate
0.1314
boilerplate_rate
0

SpecificDialect

categorical label
This column captures a specific dialect designation for each record in what appears to be a linguistic or language survey dataset. The dominant signal is extreme missingness-by-label: 71.9% of rows carry the value 'NA' and a further 7,692 rows (≈7.3%) are empty strings, meaning roughly 79% of records lack a meaningful dialect value. Despite 546 unique values, the effective entropy ratio is only 0.33, confirming that real dialect labels are thinly and unevenly spread across the remaining ~22,000 rows. Treatment: Treat 'NA' and empty-string as missing; consider collapsing rare dialects (below a frequency threshold) into an 'Other' bucket before encoding, and flag high missingness rate (~79%) before any modelling use. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
546
top_value
NA
top_rate
0.7187
cardinality
546
entropy
2.969
entropy_ratio
0.3265

GlyphID

text label one_word allcaps short_text duplicates
GlyphID is a categorical identifier column containing Unicode code point values (e.g., '006D' = 'm', '0069' = 'i', '0061' = 'a') stored as uppercase hexadecimal strings, almost certainly representing character-level glyph references in a typography, OCR, or text-rendering dataset. With only 3,142 unique values across 105,484 rows, the duplicate rate is extremely high at 97.0%, which is expected for a glyph/character frequency table but means this column functions as a low-cardinality label rather than a unique identifier despite its name. All values are fully uppercase (allcaps_rate 1.0), single-token (one_word_rate 1.0), and short (median length 4 characters), consistent with 4–6 character hex codes. The top values map to common Latin lowercase letters, suggesting the underlying corpus is predominantly Latin-script text. Treatment: Map hex strings to Unicode characters for interpretability, then encode as categorical (low-cardinality, 3142 levels) or group by Unicode block/script before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
3,142
len_min
4
len_max
54
len_mean
6.503
len_median
4
len_p95
14
word_mean
1
word_median
1
n_empty
0
n_duplicates
102,342
duplicate_rate
0.9702
vocab_size
1,343
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

Phoneme

text label one_word short_text duplicates
This column contains phoneme symbols from a linguistic or speech dataset, with 3,142 unique phoneme strings across 105,484 rows. Values are overwhelmingly single characters (len_median 1.0, len_mean 1.5, len_max 11), consistent with IPA or ARPABET-style phoneme notation. The duplicate rate is 97.0% (102,342 duplicates), which is expected given a finite phoneme inventory repeated across many words or utterances. The vocab_size of 1,339 distinct tokens against only 3,142 unique values suggests multi-character phoneme strings (e.g., digraphs or diacritics) are also present alongside single-character ones. Treatment: Encode as categorical (label or one-hot) for modelling; 3,142 unique values is manageable but consider grouping by manner/place of articulation if dimensionality is a concern. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
3,142
len_min
1
len_max
11
len_mean
1.501
len_median
1
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
102,342
duplicate_rate
0.9702
vocab_size
1,339
readability_flesch_mean
114.4
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.001754
boilerplate_rate
0

Allophones

text feature one_word short_text duplicates
This column contains allophone representations from a phonology or linguistics dataset, storing the phonetic variants of phonemes (e.g., 'm', 'j', 'w', 's') as very short strings with a mean length of ~2 characters. Strikingly, 53,580 of 105,484 rows (roughly 50.8%) carry the sentinel value 'NA', indicating no allophone recorded, which dominates the duplicate rate of 93.5%. The 6,892 unique values across a vocabulary of only 1,263 words suggest a modest set of phonetic symbols combined in small clusters, consistent with IPA or similar notation. Treatment: Treat 'NA' as missing; encode remaining values as categorical or tokenize individual phoneme symbols before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
6,892
len_min
1
len_max
37
len_mean
2.083
len_median
2
len_p95
4
word_mean
1.129
word_median
1
n_empty
0
n_duplicates
98,592
duplicate_rate
0.9347
vocab_size
1,263
readability_flesch_mean
116.2
emoji_rate
0
url_rate
0
one_word_rate
0.9131
allcaps_rate
0.00291
boilerplate_rate
0

Marginal

categorical feature
This column is a boolean flag indicating whether a record is classified as 'marginal', with three distinct string values: FALSE, NA, and TRUE. The dominant value is FALSE (78.9% of 105,484 rows), while TRUE is strikingly rare at only 1,347 records (≈1.3%). The presence of 'NA' as a literal string value in 20,874 rows (≈19.8%) is noteworthy — these are not system nulls (null_rate is 0.0) but encoded string missings, which must be handled explicitly rather than via standard null imputation. Treatment: Encode as ternary (FALSE=0, TRUE=1, NA=missing) after converting string 'NA' to actual nulls, then decide on imputation strategy given 19.8% string-missing rate. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
3
top_value
FALSE
top_rate
0.7893
cardinality
3
entropy
0.8122
entropy_ratio
0.5125

SegmentClass

categorical label
SegmentClass is a phonological category label classifying 105,484 linguistic segments into exactly three classes: consonant, vowel, and tone. Consonants dominate at 68.5% (72,282 occurrences), vowels account for 29.4% (31,052), and tones are a small minority at just 2.0% (2,150) — a distribution consistent with natural language phoneme inventories but with tones notably underrepresented, suggesting the dataset skews toward non-tonal languages or tonal markings are partially absent. Zero nulls and perfect coverage make this a clean, reliable label. Treatment: One-hot encode or use as a stratification variable; monitor class imbalance for the 'tone' minority class (2,150 samples) in any classification task. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
3
top_value
consonant
top_rate
0.6852
cardinality
3
entropy
1.008
entropy_ratio
0.6357

Source

categorical label
This column identifies the source database or linguistic inventory from which each phonological record was drawn, with 8 distinct source codes across 105,484 rows and no nulls. The top source 'ph' dominates at 34.4% (36,274 rows), likely referring to PHOIBLE or a similar phoneme database, followed by 'ea', 'upsid', 'er', 'saphon', 'aa', 'spa', and 'ra'. The high entropy ratio of 0.899 indicates a relatively even spread across sources despite 'ph' leading — no single source overwhelmingly controls the data. Analysts should be aware that cross-source comparisons may introduce systematic coding differences, as each source may apply distinct phonological conventions. Treatment: One-hot encode or use as a stratification/grouping variable; check for source-specific biases before pooling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
8
top_value
ph
top_rate
0.3439
cardinality
8
entropy
2.697
entropy_ratio
0.8991

tone

categorical label imbalance
This column is a binary tone indicator with only two values: '0' (neutral/absent) and '+' (positive), across 105,484 rows with no nulls. The distribution is severely imbalanced — '0' accounts for 97.96% of records (103,334) while '+' appears in just 2,150 rows (~2%). The entropy of 0.144 (out of a maximum of 1.0 for a binary variable) confirms the near-total dominance of a single class, which would make any model trained on this label prone to predicting '0' exclusively. Treatment: Apply class-balancing techniques (e.g., oversampling '+' or class-weight adjustment) before using as a modelling target. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
2
top_value
0
top_rate
0.9796
cardinality
2
entropy
0.1436
entropy_ratio
0.1436

stress

categorical label imbalance
This column is a binary categorical flag likely indicating the presence or absence of stress (e.g., a linguistic or physiological stress marker), encoded as '-' for the negative/absent case and '0' for the positive/present case. The distribution is severely imbalanced: '-' accounts for 97.96% of all 105,484 rows (103,334 occurrences) versus only 2,150 rows for '0'. The entropy ratio of 0.144 confirms near-minimum information content, meaning the positive class is rare and models trained on this column will need class-balancing strategies. Treatment: Apply class-balancing (oversampling or weighted loss) before modelling; consider encoding '-' as 0 and '0' as 1 for numeric compatibility. medium · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
2
top_value
-
top_rate
0.9796
cardinality
2
entropy
0.1436
entropy_ratio
0.1436

syllabic

categorical
n
105,484
nulls
0 (0.0%)
unique
8
top_value
-
top_rate
0.6849
cardinality
8
entropy
1.042
entropy_ratio
0.3472

short

categorical feature imbalance
This column appears to encode a directional or sign indicator with only 4 distinct values: '-', '0', '+', and the compound '-,+'. It is severely imbalanced: the dominant value '-' accounts for 97,764% of rows (103,125 of 105,484), while '+' appears just 204 times and the mixed label '-,+' only 5 times. The near-zero entropy ratio (0.082) confirms the column carries very little information variance, and any model trained on it will be overwhelmed by the '-' class. Treatment: Treat as ordinal or one-hot encode, but flag severe class imbalance (97.8% '-'); consider oversampling minority classes or collapsing '+' and '-,+' before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
4
top_value
-
top_rate
0.9776
cardinality
4
entropy
0.1645
entropy_ratio
0.08225

long

categorical feature
This column encodes the sign or direction of longitude (or a signed numeric field), with only 6 distinct values across 105,484 rows. The dominant value '-' accounts for 89.9% of records, suggesting the dataset is heavily skewed toward negative longitudes (e.g., Western Hemisphere coordinates). The compound values '-,+', '+,-', and '-,-,+' appear in just 104 rows combined, hinting at rare multi-value or malformed entries that warrant inspection before modelling. Treatment: Investigate compound values ('-,+', '+,-', '-,-,+') for parsing errors; encode '-' as -1, '+' as +1, '0' as 0, and flag or impute the 104 compound rows. medium · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
6
top_value
-
top_rate
0.8991
cardinality
6
entropy
0.5537
entropy_ratio
0.2142

consonantal

categorical feature
This column encodes a binary phonological feature — consonantal — marking whether a segment is consonantal (+) or not (-), a standard distinctive feature in linguistics datasets. The dominant value is '+' at 60.9% (64,257 rows), with '-' covering 37.0% (39,041 rows). The near-zero third category '0' (2,151 rows) suggests underspecified segments, while the composite values '+,-' (34) and '-,+' (1) are anomalous multi-value entries that likely reflect data quality issues or ambiguous annotations. Treatment: Encode as ordinal or one-hot after splitting composite entries ('+,-', '-,+') into separate flags or flagging them for review. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
5
top_value
+
top_rate
0.6092
cardinality
5
entropy
1.085
entropy_ratio
0.4672

sonorant

categorical
n
105,484
nulls
0 (0.0%)
unique
8
top_value
+
top_rate
0.5301
cardinality
8
entropy
1.245
entropy_ratio
0.4149

continuant

categorical feature
This column encodes a phonological feature — whether a speech sound is a continuant (airflow continues through the vocal tract) — using '+', '-', and '0' (unspecified/N-A) notation standard in distinctive feature theory. The dominant values are '+' (54.9%, 57,952 rows) and '-' (44,585 rows), which together account for ~97% of records. Surprisingly, 7 of the 9 unique values are composite strings like '-,+' or '-,-,+', suggesting some rows encode sequences or multi-segment entries rather than single-segment features, which is atypical and may indicate data quality or schema inconsistency. Entropy ratio of 0.37 reflects moderate concentration driven almost entirely by the binary +/- split. Treatment: Treat '+'/'-'/'0' as a 3-class categorical; investigate and potentially split or flag the 796 composite multi-value rows before encoding. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
9
top_value
+
top_rate
0.5494
cardinality
9
entropy
1.172
entropy_ratio
0.3696

delayedRelease

categorical feature
This column encodes a delayed-release flag or classification for 105,484 records using a small set of symbolic tokens: '0' (no delay, 55% of rows), '-' (negative/early shift, 26%), '+' (positive/late shift, 19%), and compound combinations thereof. The compound values ('-,+', '0,-,+', '+,-', '0,0,-,+') suggest the column is sometimes multi-valued — packed as comma-separated strings — which is structurally inconsistent with a simple categorical and implies set-like semantics. No nulls are present, and the entropy ratio of 0.52 reflects a moderately skewed but non-trivial distribution dominated by the '0' class. Treatment: Split compound values on ',' into multi-hot binary columns ('has_0', 'has_minus', 'has_plus') before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
7
top_value
0
top_rate
0.5502
cardinality
7
entropy
1.471
entropy_ratio
0.5238

approximant

categorical feature
This column captures whether a phoneme or linguistic segment is classified as an approximant, using a compact symbolic encoding where '-' means absent and '+' means present. The dominant value is '-' at 55.9% of rows (58,966), with '+' accounting for 41.9% (44,266), giving a near-binary distribution with very low entropy (1.12). Surprising are the small number of compound values ('-,+', '-,-,+', '+,-') totalling just 102 rows, suggesting a minority of segments carry ambiguous or multi-valued approximant classifications — possibly transcription artifacts or multi-segment bundles. Treatment: One-hot encode the two dominant values ('-', '+'); bin the three compound values into an 'ambiguous' category or flag and investigate as potential data quality issues before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
6
top_value
-
top_rate
0.559
cardinality
6
entropy
1.12
entropy_ratio
0.4333

tap

categorical feature imbalance
This column appears to encode a directional or signed-change indicator, with values like '-', '+', '0', and compound sequences such as '-,+' and '-,-,+' that suggest a history of sign changes or transitions. The dominant value '-' accounts for 96.7% of all 105,484 rows, producing severe class imbalance (entropy ratio of only 0.10), meaning positive or mixed-direction events are rare signals. The compound multi-value entries ('-,+', '-,-,+') are notable — they imply the field can store ordered sequences of transitions, not just a single state. No nulls are present. Treatment: One-hot encode single-value categories; treat compound sequence values ('-,+', '-,-,+') separately via sequence parsing or a dedicated binary flag for multi-transition events; oversample or weight minority classes before modelling. medium · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
5
top_value
-
top_rate
0.9672
cardinality
5
entropy
0.2421
entropy_ratio
0.1043

trill

categorical feature imbalance
This column encodes a phonological feature — specifically the presence or absence of a trill articulation — using a small symbol vocabulary of 6 distinct values. The dominant value is "-" (absent/negative), which accounts for 96.15% of 105,484 rows, producing an extremely low entropy of 0.276 and triggering an imbalance alert. The minority values ("0", "+", and compound sequences like "-,+") together cover fewer than 4,000 rows, suggesting trill is a rare feature in this phonological dataset. The compound values ("-,+", "-,-,+", "+,-") with counts of 26, 8, and 2 respectively hint at multi-segment or allophonic annotation, but are statistically negligible. Treatment: One-hot or ordinal encode with caution; severe class imbalance (96.15% negative class) means most models will need oversampling, class weighting, or collapsing rare compound categories before use. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
6
top_value
-
top_rate
0.9615
cardinality
6
entropy
0.2762
entropy_ratio
0.1069

nasal

categorical
n
105,484
nulls
0 (0.0%)
unique
8
top_value
-
top_rate
0.8084
cardinality
8
entropy
0.897
entropy_ratio
0.299

lateral

categorical feature
This column encodes lateral direction or laterality coding, likely indicating the side(s) of a clinical finding, anatomical measurement, or test result, using '-' (e.g., left or negative), '+' (right or positive), and '0' (midline or neutral). The dominant value '-' accounts for 93.8% of all 105,484 rows, producing a very low entropy ratio of 0.134, meaning the column is heavily skewed toward a single class. Surprisingly, some values represent sequences of laterality codes (e.g., '-,+', '-,-,+'), suggesting multi-segment or bilateral recordings encoded as comma-delimited strings rather than separate fields. No nulls are present. Treatment: Split comma-delimited compound values into multi-hot binary flags for '-', '+', and '0' presence before modelling; expect severe class imbalance. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
8
top_value
-
top_rate
0.9382
cardinality
8
entropy
0.4012
entropy_ratio
0.1337

labial

categorical feature
This column captures labial articulation features in a phonological or linguistic dataset, encoding presence/absence/neutrality of a labial feature per segment (or per segment sequence). The dominant value is '-' (absent) at 68.2% of 105,484 rows, with '+' (present) at 26.8%, suggesting most segments are non-labial. Surprisingly, ~2.9% of values are compound strings like '-,+', '+,-', '-,-,+', indicating multi-segment bundles packed into a single cell rather than a flat atomic feature, which creates a parsing challenge and implies the column is not fully normalized. Treatment: Split compound values on ',' into separate per-segment records or encode as ordered multi-label; then one-hot or ordinal encode the atomic {'-', '+', '0'} values for modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
15
top_value
-
top_rate
0.6822
cardinality
15
entropy
1.182
entropy_ratio
0.3025

round

categorical
n
105,484
nulls
0 (0.0%)
unique
8
top_value
0
top_rate
0.703
cardinality
8
entropy
1.194
entropy_ratio
0.398

labiodental

categorical feature
This column encodes labiodental phonetic feature annotations, likely from a linguistic or phonological dataset where each row represents a speech sound or segment. The dominant value '0' (74,124 rows, 70.3%) indicates the feature is absent or neutral, while '+' and '-' mark positive and negative feature values respectively — a standard binary distinctive-feature notation. Notably, multi-valued strings like '+,-', '-,+', and '+,+,-' appear (60 rows total), suggesting a small number of segments carry conflicting or composite annotations, which may indicate data entry inconsistency or deliberate underspecification. Entropy ratio of 0.39 confirms moderate imbalance with '0' dominating. Treatment: One-hot encode '+', '-', '0'; flag or inspect the 60 multi-valued rows ('+,-', '-,+', '+,+,-') for normalization before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
6
top_value
0
top_rate
0.7027
cardinality
6
entropy
1.006
entropy_ratio
0.3891

coronal

categorical feature
This column encodes coronal articulation features in what appears to be a phonological or linguistic dataset, using binary '+'/'-' notation common in distinctive feature theory. The dominant value is '-' (62.8% of 105,484 rows), with '+' covering another 35.0%, making the two primary values account for ~98% of observations. The remaining values ('+,-', '-,+', '-,-,+', '+,-,+') suggest multi-segment or compound entries, which are rare (≤87 occurrences) but worth flagging as potential encoding inconsistencies or multi-value cells that may need splitting. Treatment: Encode primary values '+'/'-'/'0' as ordinal or one-hot features; isolate and inspect the 135 multi-value rows for parsing or splitting before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
7
top_value
-
top_rate
0.6279
cardinality
7
entropy
1.08
entropy_ratio
0.3848

anterior

categorical feature
This column captures an anterior anatomical or directional finding coded as a signed categorical variable, with 6 distinct values across 105,484 rows and zero nulls. The dominant value '0' (64.8% of rows) likely indicates absence or neutral status, while '+' (24.4%) and '-' (10.8%) denote positive/negative findings. Surprisingly, a small number of rows (9, 5, and 3) contain compound multi-value strings like '-,+', '+,-', and '-,-,+', suggesting occasional data-entry concatenation errors or multi-event encoding that deviates from the expected single-symbol schema. Treatment: Map '0'/'+'/'-' to ordinal or one-hot encoded features; isolate and investigate the 17 compound-value rows before modelling. medium · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
6
top_value
0
top_rate
0.6482
cardinality
6
entropy
1.251
entropy_ratio
0.4839

distributed

categorical label
This column appears to encode a distribution or change-direction indicator, using a compact symbolic notation: '0' (no change), '-' (decrease), '+' (increase), and comma-separated sequences for multi-step or compound movements. The dominant value '0' covers 66% of 105,484 rows, with '-' (21.1%) and '+' (12.5%) accounting for most of the remainder. Surprisingly, 11 distinct values arise from combinations of these three symbols, suggesting some records capture sequences of directional changes rather than a single state. Entropy ratio of 0.37 confirms moderate but uneven information content, heavily concentrated in the single '0' class. Treatment: Encode as an ordinal or multi-hot feature by splitting on commas and mapping {'-': -1, '0': 0, '+': 1}; treat compound values as sequences. medium · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
11
top_value
0
top_rate
0.6602
cardinality
11
entropy
1.273
entropy_ratio
0.3681

strident

categorical
n
105,484
nulls
0 (0.0%)
unique
9
top_value
0
top_rate
0.6485
cardinality
9
entropy
1.287
entropy_ratio
0.406

dorsal

categorical feature
This column captures a dorsal-surface marking or pattern indicator for biological specimens (likely fish or reptiles), encoding the presence/absence of a feature using '+', '-', and '0' symbols — sometimes as comma-separated sequences indicating multiple zones or segments. The dominant values are '+' (51.7%, n=54,535) and '-' (n=47,052), together accounting for ~96.3% of rows, with 'neutral/absent' coded as '0' (n=2,160). Surprisingly, 11 of the 13 categories are multi-value strings like '-,+' or '-,-,+', suggesting some records encode ordered sequences of dorsal sub-regions rather than a single binary state, creating an implicit structural inconsistency that warrants normalization. Treatment: Split multi-value entries on ',' into ordered sub-features or one-hot encode each position before modelling; treat '+', '-', '0' as a ternary categorical for single-value rows. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
13
top_value
+
top_rate
0.517
cardinality
13
entropy
1.235
entropy_ratio
0.3338

high

categorical feature
This column encodes the direction of price/value movement at a 'high' point — likely a signal or pattern indicator from a time-series or financial dataset, using tokens '0' (no movement), '+' (up), and '-' (down), including multi-step sequences like '-,+' or '+,-,+'. The dominant value is '0' at 46.7% (49,247 rows), followed by '+' at 35,559 and '-' at 19,156, suggesting a strong asymmetry between upward and downward signals. The presence of compound sequence values (e.g., '+,-,+', '-,+,+') with very low frequency — some appearing only once or twice — indicates these multi-step patterns are rare edge cases that may need consolidation or separate treatment. Treatment: Encode top 3 values ('0', '+', '-') as ordinal or one-hot features; collapse rare multi-step sequences (≤845 occurrences) into an 'other' or structured sub-category. medium · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
11
top_value
0
top_rate
0.4669
cardinality
11
entropy
1.594
entropy_ratio
0.4609

low

categorical
n
105,484
nulls
0 (0.0%)
unique
8
top_value
-
top_rate
0.4733
cardinality
8
entropy
1.305
entropy_ratio
0.4351

front

categorical feature
This column appears to encode a directional or signed-change indicator for a 'front' measurement, using values '0', '-', and '+' as atomic tokens that can be combined into sequences (e.g., '-,+', '+,-,-'). The dominant value '0' accounts for 46.75% of rows, with '-' (34,225) and '+' (20,683) making up most of the remainder — notably, negative changes outnumber positive ones by ~1.65:1. The compound sequence values ('-,+', '+,-', etc.) suggest multi-step event chains are encoded as a single string, which is an unusual encoding pattern that may require parsing before modelling. Treatment: Parse composite sequences (e.g., '-,+') into structured event arrays or ordinal counts; then one-hot or embed atomic states before modelling. medium · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
13
top_value
0
top_rate
0.4675
cardinality
13
entropy
1.592
entropy_ratio
0.4302

back

categorical feature
This column encodes directional movement or change signals for a 'back' dimension, using a compact notation of '+', '-', and '0' tokens — likely representing price/value movement sequences (e.g., up, down, flat) over a lookback window. The dominant value is '0' at 46.7% (49,270 rows), followed by '-' at 39,749 and '+' at 15,547, indicating a strong bearish/negative skew relative to positive signals. Compound multi-step sequences like '+,-' (511) and '-,+' (367) exist but are rare, and a handful of three-step sequences appear at the tail, suggesting variable-length encoding that could cause parsing issues if treated as a simple label. Treatment: Parse composite multi-token values (e.g., '+,-') into structured sequences or ordinal scores before modelling; consider one-hot or frequency encoding for the simple single-token majority. medium · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
12
top_value
0
top_rate
0.4671
cardinality
12
entropy
1.521
entropy_ratio
0.4244

tense

categorical
n
105,484
nulls
0 (0.0%)
unique
8
top_value
0
top_rate
0.7132
cardinality
8
entropy
1.114
entropy_ratio
0.3712

retractedTongueRoot

categorical feature imbalance
This column encodes the phonetic feature 'retracted tongue root' (RTR), a binary or multi-valued linguistic annotation used in phonological databases to mark vowel or consonant articulation. The dominant value is '-' (absence of RTR) at 97.44% of 105,484 rows, making the feature extremely rare in this dataset — only ~2,696 tokens show any positive marking. The compound values ('-,+', '-,-,+', etc.) suggest per-segment annotation strings rather than single-token labels, indicating a sequence or multi-segment scope. Treatment: Flag severe class imbalance (97.44% negative); use oversampling or class-weighted models if predicting RTR; consider splitting compound strings into per-segment binary indicators before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
7
top_value
-
top_rate
0.9744
cardinality
7
entropy
0.1935
entropy_ratio
0.06892

advancedTongueRoot

categorical feature imbalance
This column encodes the Advanced Tongue Root (ATR) phonological feature, a binary articulatory distinction marked as '+' (present), '-' (absent), or '0' (neutral/not applicable), typical of linguistic phoneme inventories. The distribution is severely imbalanced: '-' dominates at 97.87% (103,238 of 105,484 rows), '0' appears in only 2,235 rows, and '+' is nearly absent with just 11 occurrences. The near-zero entropy ratio (0.094) confirms that '+ATR' is a vanishingly rare feature in this dataset, which would make any model predicting '+' extremely difficult to train without resampling. Treatment: Treat as ordinal or one-hot encoded categorical; oversample or reweight the '+' class (n=11) before modelling, or collapse '+' and '0' if class separation is infeasible. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
3
top_value
-
top_rate
0.9787
cardinality
3
entropy
0.1496
entropy_ratio
0.09438

periodicGlottalSource

categorical
n
105,484
nulls
0 (0.0%)
unique
7
top_value
+
top_rate
0.6797
cardinality
7
entropy
1.051
entropy_ratio
0.3745

epilaryngealSource

categorical feature imbalance
This column encodes whether an epilaryngeal phonation source is present, using a ternary scheme: absent ('-'), present ('+')), or a neutral/ambiguous state ('0'). The distribution is severely imbalanced: the '-' (absent) class dominates at 97.9% of 105,484 rows, while '+' (present) appears in only 31 records (~0.03%), making positive-class detection extremely challenging. The near-zero entropy (0.147) confirms almost no informational variance in this column as-is. Treatment: Flag severe class imbalance ('+' = 31 of 105,484); apply oversampling or class-weighted modelling, and consider binary encoding after collapsing '-' vs. non-'-'. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
3
top_value
-
top_rate
0.9793
cardinality
3
entropy
0.1474
entropy_ratio
0.09303

spreadGlottis

categorical
n
105,484
nulls
0 (0.0%)
unique
10
top_value
-
top_rate
0.9182
cardinality
10
entropy
0.4965
entropy_ratio
0.1495

constrictedGlottis

categorical feature
This column captures a clinical or phonetic annotation for constricted glottis, encoded as presence/absence symbols across 105,484 records with no nulls and only 7 unique values. The dominant value is '-' (absent/negative), appearing in 94.5% of rows (99,727), while '+' (present) accounts for just 3,383 rows — indicating a rare positive condition. The compound values ('+,-', '-,+', '+,-,-', '-,-,+') suggest multi-observation or multi-segment sequences for a tiny fraction of records, but with counts of 141, 93, 1, and 1 respectively, these are near-negligible. The extremely low entropy (0.372, entropy ratio 0.132) confirms the column is heavily imbalanced toward the negative class. Treatment: Binarize into presence/absence flag; treat compound multi-segment values as positive; account for severe class imbalance (94.5% negative) during modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
7
top_value
-
top_rate
0.9454
cardinality
7
entropy
0.3717
entropy_ratio
0.1324

fortis

categorical feature
This column is a three-valued categorical flag, likely representing a directional or strength indicator with values '-', '0', and '+'. The dominant value is '-' at 68.1% (71,867 rows), '0' accounts for 31.5% (33,202 rows), and '+' is strikingly rare at only 415 occurrences (~0.4%), creating a heavily imbalanced distribution. The near-absence of '+' values is a notable surprise and may indicate a rare positive condition, signal, or classification outcome worth investigating for class imbalance before modelling. Treatment: Ordinal-encode as -1/0/1 or one-hot encode, and apply class-imbalance handling (e.g. oversampling '+') before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
3
top_value
-
top_rate
0.6813
cardinality
3
entropy
0.9335
entropy_ratio
0.589

lenis

categorical feature
This column represents a ternary linguistic or phonological feature, almost certainly a 'lenis' (weak/voiced) marker with values '-' (absent/irrelevant), '0' (neutral), and '+' (present/lenis). The dominant value is '-' at 68.1% (71,866 rows), while the positive lenis marker '+' is strikingly rare at only 416 occurrences (~0.4%), creating a heavily imbalanced distribution. No nulls exist across all 105,484 rows, and entropy is moderate at 0.93 bits, well below the theoretical maximum, confirming the skew toward the negative class. Treatment: Encode as ordinal or one-hot; be aware that the '+' class (416 samples) is severely underrepresented and will require oversampling or class-weight adjustment before modelling. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
3
top_value
-
top_rate
0.6813
cardinality
3
entropy
0.9336
entropy_ratio
0.589

raisedLarynxEjective

categorical feature imbalance
This column encodes the presence or absence of a raised-larynx ejective phonological feature, using a diacritic notation system common in linguistic databases (e.g., IPA or feature matrices). The dominant value is '-' (absent/not applicable), appearing in 96.4% of 105,484 rows, creating severe class imbalance with an entropy ratio of only 0.10. Minority values include '0' (2,150 rows), '+' (1,573 rows), and compound sequences like '-,+' and '+,-', suggesting some segments receive multi-valued or composite feature annotations. The near-absence of nulls (0.0%) indicates complete annotation coverage, but the extreme skew means this feature will contribute negligible signal in most modelling contexts without deliberate resampling or grouping. Treatment: Collapse compound values ('-,+', '+,-', '-,-,+') into a unified category and apply oversampling or class-weight adjustment before modelling given 96.4% imbalance. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
6
top_value
-
top_rate
0.9637
cardinality
6
entropy
0.2675
entropy_ratio
0.1035

loweredLarynxImplosive

categorical feature imbalance
This column encodes a phonological feature — specifically whether a sound has a lowered larynx implosive articulation — using a symbolic notation system ('+', '-', and combinations). The distribution is severely imbalanced: 97.3% of the 105,484 rows carry the default/absent value '-', with only 2,150 instances of '0', 716 of '+', and fewer than 10 combined entries. The near-zero entropy ratio (0.088) confirms this column carries almost no information for most records, which would make it a very weak standalone predictor. Treatment: Flag the severe imbalance; consider binarising ('+' vs. all others) or dropping if the rare positive class is insufficient for modelling purposes. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
5
top_value
-
top_rate
0.9727
cardinality
5
entropy
0.2034
entropy_ratio
0.08759

click

categorical label
This column captures click direction or interaction type, encoded as symbolic tokens ('-', '0', '+') with combination values ('+,-', '-,+'). The dominant value is '-' at 68.2% of 105,484 rows, while positive-click signals ('+', '+,-', '-,+') account for only ~311 rows combined (~0.3%), suggesting a heavily imbalanced outcome or event flag. The presence of multi-value strings like '+,-' and '-,+' implies ordered sequences of click events were sometimes collapsed into a single cell rather than normalized — this warrants investigation before modelling. Treatment: Ordinal-encode or one-hot after splitting multi-value entries ('+,-', '-,+'); treat class imbalance (~0.3% positive) before using as a classification target. high · anthropic:default
n
105,484
nulls
0 (0.0%)
unique
5
top_value
-
top_rate
0.6823
cardinality
5
entropy
0.9283
entropy_ratio
0.3998