saturn

/home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv 105,484 rows sample n=105,484 seed 42 2026-06-22T00:14:33+00:00

Overview

Source	/home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv
Total rows	105,484
Profiled sample	105,484
Columns	49
Generated	2026-06-22T00:14:33+00:00

Show data table

Per-column null rate across the corpus.
column	kind	null %
InventoryID	numeric	0.0%
Glottocode	text	0.0%
ISO6393	text	0.0%
LanguageName	text	0.0%
SpecificDialect	categorical	0.0%
GlyphID	text	0.0%
Phoneme	text	0.0%
Allophones	text	0.0%
Marginal	categorical	0.0%
SegmentClass	categorical	0.0%
Source	categorical	0.0%
tone	categorical	0.0%
stress	categorical	0.0%
syllabic	categorical	0.0%
short	categorical	0.0%
long	categorical	0.0%
consonantal	categorical	0.0%
sonorant	categorical	0.0%
continuant	categorical	0.0%
delayedRelease	categorical	0.0%
approximant	categorical	0.0%
tap	categorical	0.0%
trill	categorical	0.0%
nasal	categorical	0.0%
lateral	categorical	0.0%
labial	categorical	0.0%
round	categorical	0.0%
labiodental	categorical	0.0%
coronal	categorical	0.0%
anterior	categorical	0.0%
distributed	categorical	0.0%
strident	categorical	0.0%
dorsal	categorical	0.0%
high	categorical	0.0%
low	categorical	0.0%
front	categorical	0.0%
back	categorical	0.0%
tense	categorical	0.0%
retractedTongueRoot	categorical	0.0%
advancedTongueRoot	categorical	0.0%
periodicGlottalSource	categorical	0.0%
epilaryngealSource	categorical	0.0%
spreadGlottis	categorical	0.0%
constrictedGlottis	categorical	0.0%
fortis	categorical	0.0%
lenis	categorical	0.0%
raisedLarynxEjective	categorical	0.0%
loweredLarynxImplosive	categorical	0.0%
click	categorical	0.0%

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:default.

Dataset high anthropic:default

This dataset is PHOIBLE, a cross-linguistic phonological inventory database containing 105,484 phoneme-level records spanning roughly 2,177 languages and dialects, each row describing a single phoneme and its distinctive feature values. The most immediate thing to examine is the breakdown by SegmentClass: consonants dominate (~68.5%), followed by vowels (~29.5%) and tones (~2%), which shapes how almost every other feature distributes. A second focus is the Source column, which reveals that data comes from eight different linguistic databases ('ph' alone accounts for 34%), meaning coverage and coding conventions are uneven across the corpus and could introduce systematic biases in any cross-linguistic comparison.

GlyphID high anthropic:default

GlyphID is a categorical identifier column containing Unicode code point values (e.g., '006D' = 'm', '0069' = 'i', '0061' = 'a') stored as uppercase hexadecimal strings, almost certainly representing character-level glyph references in a typography, OCR, or text-rendering dataset. With only 3,142 unique values across 105,484 rows, the duplicate rate is extremely high at 97.0%, which is expected for a glyph/character frequency table but means this column functions as a low-cardinality label rather than a unique identifier despite its name. All values are fully uppercase (allcaps_rate 1.0), single-token (one_word_rate 1.0), and short (median length 4 characters), consistent with 4–6 character hex codes. The top values map to common Latin lowercase letters, suggesting the underlying corpus is predominantly Latin-script text.

LanguageName high anthropic:default

This column contains human language names (e.g., 'Dutch', 'Chechen', 'Bengali', 'Iron Ossetic'), functioning as a categorical label drawn from a linguistically diverse vocabulary of 2,716 unique values across 105,484 rows. The duplicate rate of 97.4% (102,768 duplicates) confirms it is a low-cardinality repeating label, not a free-text field. Notably, 13.1% of values are all-caps, suggesting some entries use ISO-style abbreviations or codes alongside full names. The distribution is uneven — the top value 'Iron Ossetic' appears 444 times while many languages appear rarely — indicating a long-tail spread across minority and regional languages.

Allophones high anthropic:default

This column contains allophone representations from a phonology or linguistics dataset, storing the phonetic variants of phonemes (e.g., 'm', 'j', 'w', 's') as very short strings with a mean length of ~2 characters. Strikingly, 53,580 of 105,484 rows (roughly 50.8%) carry the sentinel value 'NA', indicating no allophone recorded, which dominates the duplicate rate of 93.5%. The 6,892 unique values across a vocabulary of only 1,263 words suggest a modest set of phonetic symbols combined in small clusters, consistent with IPA or similar notation.

Glottocode high anthropic:default

This column contains Glottocodes — the standardized 8-character language identifiers used by the Glottolog database (e.g., 'kham1282', 'dutc1256'), confirmed by the near-uniform length of 8 characters (mean 7.999, median 8.0) and the structured alphanumeric format. With only 2,177 unique codes across 105,484 rows, the duplicate rate is extremely high at 97.9%, meaning each language code recurs on average ~48 times — consistent with a dataset where many observations (e.g., words, features, speakers) are annotated per language. The top code 'kham1282' (Kham) appears 622 times, suggesting uneven language coverage in the dataset.

ISO6393 high anthropic:default

This column contains ISO 639-3 three-letter language codes, a standardised identifier for natural languages. Every value is exactly 3 characters long (min, mean, max all equal 3) with zero nulls, confirming strict conformity to the standard. The duplicate rate is extremely high at 98.0%, meaning 2,095 distinct codes repeat across 105,484 rows — expected behaviour for a language tag applied to many records. The most frequent code 'mis' (miscellaneous/unattested language) appearing 828 times may warrant attention, as it signals a non-trivial share of records with unidentified languages.

Phoneme high anthropic:default

This column contains phoneme symbols from a linguistic or speech dataset, with 3,142 unique phoneme strings across 105,484 rows. Values are overwhelmingly single characters (len_median 1.0, len_mean 1.5, len_max 11), consistent with IPA or ARPABET-style phoneme notation. The duplicate rate is 97.0% (102,342 duplicates), which is expected given a finite phoneme inventory repeated across many words or utterances. The vocab_size of 1,339 distinct tokens against only 3,142 unique values suggests multi-character phoneme strings (e.g., digraphs or diacritics) are also present alongside single-character ones.

advancedTongueRoot high anthropic:default

This column encodes the Advanced Tongue Root (ATR) phonological feature, a binary articulatory distinction marked as '+' (present), '-' (absent), or '0' (neutral/not applicable), typical of linguistic phoneme inventories. The distribution is severely imbalanced: '-' dominates at 97.87% (103,238 of 105,484 rows), '0' appears in only 2,235 rows, and '+' is nearly absent with just 11 occurrences. The near-zero entropy ratio (0.094) confirms that '+ATR' is a vanishingly rare feature in this dataset, which would make any model predicting '+' extremely difficult to train without resampling.

epilaryngealSource high anthropic:default

This column encodes whether an epilaryngeal phonation source is present, using a ternary scheme: absent ('-'), present ('+')), or a neutral/ambiguous state ('0'). The distribution is severely imbalanced: the '-' (absent) class dominates at 97.9% of 105,484 rows, while '+' (present) appears in only 31 records (~0.03%), making positive-class detection extremely challenging. The near-zero entropy (0.147) confirms almost no informational variance in this column as-is.

loweredLarynxImplosive high anthropic:default

This column encodes a phonological feature — specifically whether a sound has a lowered larynx implosive articulation — using a symbolic notation system ('+', '-', and combinations). The distribution is severely imbalanced: 97.3% of the 105,484 rows carry the default/absent value '-', with only 2,150 instances of '0', 716 of '+', and fewer than 10 combined entries. The near-zero entropy ratio (0.088) confirms this column carries almost no information for most records, which would make it a very weak standalone predictor.

raisedLarynxEjective high anthropic:default

This column encodes the presence or absence of a raised-larynx ejective phonological feature, using a diacritic notation system common in linguistic databases (e.g., IPA or feature matrices). The dominant value is '-' (absent/not applicable), appearing in 96.4% of 105,484 rows, creating severe class imbalance with an entropy ratio of only 0.10. Minority values include '0' (2,150 rows), '+' (1,573 rows), and compound sequences like '-,+' and '+,-', suggesting some segments receive multi-valued or composite feature annotations. The near-absence of nulls (0.0%) indicates complete annotation coverage, but the extreme skew means this feature will contribute negligible signal in most modelling contexts without deliberate resampling or grouping.

retractedTongueRoot high anthropic:default

This column encodes the phonetic feature 'retracted tongue root' (RTR), a binary or multi-valued linguistic annotation used in phonological databases to mark vowel or consonant articulation. The dominant value is '-' (absence of RTR) at 97.44% of 105,484 rows, making the feature extremely rare in this dataset — only ~2,696 tokens show any positive marking. The compound values ('-,+', '-,-,+', etc.) suggest per-segment annotation strings rather than single-token labels, indicating a sequence or multi-segment scope.

short high anthropic:default

This column appears to encode a directional or sign indicator with only 4 distinct values: '-', '0', '+', and the compound '-,+'. It is severely imbalanced: the dominant value '-' accounts for 97,764% of rows (103,125 of 105,484), while '+' appears just 204 times and the mixed label '-,+' only 5 times. The near-zero entropy ratio (0.082) confirms the column carries very little information variance, and any model trained on it will be overwhelmed by the '-' class.

stress medium anthropic:default

This column is a binary categorical flag likely indicating the presence or absence of stress (e.g., a linguistic or physiological stress marker), encoded as '-' for the negative/absent case and '0' for the positive/present case. The distribution is severely imbalanced: '-' accounts for 97.96% of all 105,484 rows (103,334 occurrences) versus only 2,150 rows for '0'. The entropy ratio of 0.144 confirms near-minimum information content, meaning the positive class is rare and models trained on this column will need class-balancing strategies.

tap medium anthropic:default

This column appears to encode a directional or signed-change indicator, with values like '-', '+', '0', and compound sequences such as '-,+' and '-,-,+' that suggest a history of sign changes or transitions. The dominant value '-' accounts for 96.7% of all 105,484 rows, producing severe class imbalance (entropy ratio of only 0.10), meaning positive or mixed-direction events are rare signals. The compound multi-value entries ('-,+', '-,-,+') are notable — they imply the field can store ordered sequences of transitions, not just a single state. No nulls are present.

tone high anthropic:default

This column is a binary tone indicator with only two values: '0' (neutral/absent) and '+' (positive), across 105,484 rows with no nulls. The distribution is severely imbalanced — '0' accounts for 97.96% of records (103,334) while '+' appears in just 2,150 rows (~2%). The entropy of 0.144 (out of a maximum of 1.0 for a binary variable) confirms the near-total dominance of a single class, which would make any model trained on this label prone to predicting '0' exclusively.

trill high anthropic:default

This column encodes a phonological feature — specifically the presence or absence of a trill articulation — using a small symbol vocabulary of 6 distinct values. The dominant value is "-" (absent/negative), which accounts for 96.15% of 105,484 rows, producing an extremely low entropy of 0.276 and triggering an imbalance alert. The minority values ("0", "+", and compound sequences like "-,+") together cover fewer than 4,000 rows, suggesting trill is a rare feature in this phonological dataset. The compound values ("-,+", "-,-,+", "+,-") with counts of 26, 8, and 2 respectively hint at multi-segment or allophonic annotation, but are statistically negligible.

InventoryID high anthropic:default

InventoryID is a numeric foreign key referencing an inventory dimension table, with exactly 3,020 distinct values spanning 1–3,020 across 105,484 rows, implying heavy repeated use of each ID (average ~35 rows per ID). The distribution is remarkably flat and symmetric (skew ≈ −0.002, kurtosis ≈ −1.15, zero outliers), consistent with a well-populated lookup identifier rather than a measured quantity.

Marginal high anthropic:default

This column is a boolean flag indicating whether a record is classified as 'marginal', with three distinct string values: FALSE, NA, and TRUE. The dominant value is FALSE (78.9% of 105,484 rows), while TRUE is strikingly rare at only 1,347 records (≈1.3%). The presence of 'NA' as a literal string value in 20,874 rows (≈19.8%) is noteworthy — these are not system nulls (null_rate is 0.0) but encoded string missings, which must be handled explicitly rather than via standard null imputation.

SegmentClass high anthropic:default

SegmentClass is a phonological category label classifying 105,484 linguistic segments into exactly three classes: consonant, vowel, and tone. Consonants dominate at 68.5% (72,282 occurrences), vowels account for 29.4% (31,052), and tones are a small minority at just 2.0% (2,150) — a distribution consistent with natural language phoneme inventories but with tones notably underrepresented, suggesting the dataset skews toward non-tonal languages or tonal markings are partially absent. Zero nulls and perfect coverage make this a clean, reliable label.

Source high anthropic:default

This column identifies the source database or linguistic inventory from which each phonological record was drawn, with 8 distinct source codes across 105,484 rows and no nulls. The top source 'ph' dominates at 34.4% (36,274 rows), likely referring to PHOIBLE or a similar phoneme database, followed by 'ea', 'upsid', 'er', 'saphon', 'aa', 'spa', and 'ra'. The high entropy ratio of 0.899 indicates a relatively even spread across sources despite 'ph' leading — no single source overwhelmingly controls the data. Analysts should be aware that cross-source comparisons may introduce systematic coding differences, as each source may apply distinct phonological conventions.

SpecificDialect high anthropic:default

This column captures a specific dialect designation for each record in what appears to be a linguistic or language survey dataset. The dominant signal is extreme missingness-by-label: 71.9% of rows carry the value 'NA' and a further 7,692 rows (≈7.3%) are empty strings, meaning roughly 79% of records lack a meaningful dialect value. Despite 546 unique values, the effective entropy ratio is only 0.33, confirming that real dialect labels are thinly and unevenly spread across the remaining ~22,000 rows.

anterior medium anthropic:default

This column captures an anterior anatomical or directional finding coded as a signed categorical variable, with 6 distinct values across 105,484 rows and zero nulls. The dominant value '0' (64.8% of rows) likely indicates absence or neutral status, while '+' (24.4%) and '-' (10.8%) denote positive/negative findings. Surprisingly, a small number of rows (9, 5, and 3) contain compound multi-value strings like '-,+', '+,-', and '-,-,+', suggesting occasional data-entry concatenation errors or multi-event encoding that deviates from the expected single-symbol schema.

approximant high anthropic:default

This column captures whether a phoneme or linguistic segment is classified as an approximant, using a compact symbolic encoding where '-' means absent and '+' means present. The dominant value is '-' at 55.9% of rows (58,966), with '+' accounting for 41.9% (44,266), giving a near-binary distribution with very low entropy (1.12). Surprising are the small number of compound values ('-,+', '-,-,+', '+,-') totalling just 102 rows, suggesting a minority of segments carry ambiguous or multi-valued approximant classifications — possibly transcription artifacts or multi-segment bundles.

back medium anthropic:default

This column encodes directional movement or change signals for a 'back' dimension, using a compact notation of '+', '-', and '0' tokens — likely representing price/value movement sequences (e.g., up, down, flat) over a lookback window. The dominant value is '0' at 46.7% (49,270 rows), followed by '-' at 39,749 and '+' at 15,547, indicating a strong bearish/negative skew relative to positive signals. Compound multi-step sequences like '+,-' (511) and '-,+' (367) exist but are rare, and a handful of three-step sequences appear at the tail, suggesting variable-length encoding that could cause parsing issues if treated as a simple label.

click high anthropic:default

This column captures click direction or interaction type, encoded as symbolic tokens ('-', '0', '+') with combination values ('+,-', '-,+'). The dominant value is '-' at 68.2% of 105,484 rows, while positive-click signals ('+', '+,-', '-,+') account for only ~311 rows combined (~0.3%), suggesting a heavily imbalanced outcome or event flag. The presence of multi-value strings like '+,-' and '-,+' implies ordered sequences of click events were sometimes collapsed into a single cell rather than normalized — this warrants investigation before modelling.

consonantal high anthropic:default

This column encodes a binary phonological feature — consonantal — marking whether a segment is consonantal (+) or not (-), a standard distinctive feature in linguistics datasets. The dominant value is '+' at 60.9% (64,257 rows), with '-' covering 37.0% (39,041 rows). The near-zero third category '0' (2,151 rows) suggests underspecified segments, while the composite values '+,-' (34) and '-,+' (1) are anomalous multi-value entries that likely reflect data quality issues or ambiguous annotations.

constrictedGlottis high anthropic:default

This column captures a clinical or phonetic annotation for constricted glottis, encoded as presence/absence symbols across 105,484 records with no nulls and only 7 unique values. The dominant value is '-' (absent/negative), appearing in 94.5% of rows (99,727), while '+' (present) accounts for just 3,383 rows — indicating a rare positive condition. The compound values ('+,-', '-,+', '+,-,-', '-,-,+') suggest multi-observation or multi-segment sequences for a tiny fraction of records, but with counts of 141, 93, 1, and 1 respectively, these are near-negligible. The extremely low entropy (0.372, entropy ratio 0.132) confirms the column is heavily imbalanced toward the negative class.

continuant high anthropic:default

This column encodes a phonological feature — whether a speech sound is a continuant (airflow continues through the vocal tract) — using '+', '-', and '0' (unspecified/N-A) notation standard in distinctive feature theory. The dominant values are '+' (54.9%, 57,952 rows) and '-' (44,585 rows), which together account for ~97% of records. Surprisingly, 7 of the 9 unique values are composite strings like '-,+' or '-,-,+', suggesting some rows encode sequences or multi-segment entries rather than single-segment features, which is atypical and may indicate data quality or schema inconsistency. Entropy ratio of 0.37 reflects moderate concentration driven almost entirely by the binary +/- split.

coronal high anthropic:default

This column encodes coronal articulation features in what appears to be a phonological or linguistic dataset, using binary '+'/'-' notation common in distinctive feature theory. The dominant value is '-' (62.8% of 105,484 rows), with '+' covering another 35.0%, making the two primary values account for ~98% of observations. The remaining values ('+,-', '-,+', '-,-,+', '+,-,+') suggest multi-segment or compound entries, which are rare (≤87 occurrences) but worth flagging as potential encoding inconsistencies or multi-value cells that may need splitting.

delayedRelease high anthropic:default

This column encodes a delayed-release flag or classification for 105,484 records using a small set of symbolic tokens: '0' (no delay, 55% of rows), '-' (negative/early shift, 26%), '+' (positive/late shift, 19%), and compound combinations thereof. The compound values ('-,+', '0,-,+', '+,-', '0,0,-,+') suggest the column is sometimes multi-valued — packed as comma-separated strings — which is structurally inconsistent with a simple categorical and implies set-like semantics. No nulls are present, and the entropy ratio of 0.52 reflects a moderately skewed but non-trivial distribution dominated by the '0' class.

distributed medium anthropic:default

This column appears to encode a distribution or change-direction indicator, using a compact symbolic notation: '0' (no change), '-' (decrease), '+' (increase), and comma-separated sequences for multi-step or compound movements. The dominant value '0' covers 66% of 105,484 rows, with '-' (21.1%) and '+' (12.5%) accounting for most of the remainder. Surprisingly, 11 distinct values arise from combinations of these three symbols, suggesting some records capture sequences of directional changes rather than a single state. Entropy ratio of 0.37 confirms moderate but uneven information content, heavily concentrated in the single '0' class.

dorsal high anthropic:default

This column captures a dorsal-surface marking or pattern indicator for biological specimens (likely fish or reptiles), encoding the presence/absence of a feature using '+', '-', and '0' symbols — sometimes as comma-separated sequences indicating multiple zones or segments. The dominant values are '+' (51.7%, n=54,535) and '-' (n=47,052), together accounting for ~96.3% of rows, with 'neutral/absent' coded as '0' (n=2,160). Surprisingly, 11 of the 13 categories are multi-value strings like '-,+' or '-,-,+', suggesting some records encode ordered sequences of dorsal sub-regions rather than a single binary state, creating an implicit structural inconsistency that warrants normalization.

fortis high anthropic:default

This column is a three-valued categorical flag, likely representing a directional or strength indicator with values '-', '0', and '+'. The dominant value is '-' at 68.1% (71,867 rows), '0' accounts for 31.5% (33,202 rows), and '+' is strikingly rare at only 415 occurrences (~0.4%), creating a heavily imbalanced distribution. The near-absence of '+' values is a notable surprise and may indicate a rare positive condition, signal, or classification outcome worth investigating for class imbalance before modelling.

front medium anthropic:default

This column appears to encode a directional or signed-change indicator for a 'front' measurement, using values '0', '-', and '+' as atomic tokens that can be combined into sequences (e.g., '-,+', '+,-,-'). The dominant value '0' accounts for 46.75% of rows, with '-' (34,225) and '+' (20,683) making up most of the remainder — notably, negative changes outnumber positive ones by ~1.65:1. The compound sequence values ('-,+', '+,-', etc.) suggest multi-step event chains are encoded as a single string, which is an unusual encoding pattern that may require parsing before modelling.

high medium anthropic:default

This column encodes the direction of price/value movement at a 'high' point — likely a signal or pattern indicator from a time-series or financial dataset, using tokens '0' (no movement), '+' (up), and '-' (down), including multi-step sequences like '-,+' or '+,-,+'. The dominant value is '0' at 46.7% (49,247 rows), followed by '+' at 35,559 and '-' at 19,156, suggesting a strong asymmetry between upward and downward signals. The presence of compound sequence values (e.g., '+,-,+', '-,+,+') with very low frequency — some appearing only once or twice — indicates these multi-step patterns are rare edge cases that may need consolidation or separate treatment.

labial high anthropic:default

This column captures labial articulation features in a phonological or linguistic dataset, encoding presence/absence/neutrality of a labial feature per segment (or per segment sequence). The dominant value is '-' (absent) at 68.2% of 105,484 rows, with '+' (present) at 26.8%, suggesting most segments are non-labial. Surprisingly, ~2.9% of values are compound strings like '-,+', '+,-', '-,-,+', indicating multi-segment bundles packed into a single cell rather than a flat atomic feature, which creates a parsing challenge and implies the column is not fully normalized.

labiodental high anthropic:default

This column encodes labiodental phonetic feature annotations, likely from a linguistic or phonological dataset where each row represents a speech sound or segment. The dominant value '0' (74,124 rows, 70.3%) indicates the feature is absent or neutral, while '+' and '-' mark positive and negative feature values respectively — a standard binary distinctive-feature notation. Notably, multi-valued strings like '+,-', '-,+', and '+,+,-' appear (60 rows total), suggesting a small number of segments carry conflicting or composite annotations, which may indicate data entry inconsistency or deliberate underspecification. Entropy ratio of 0.39 confirms moderate imbalance with '0' dominating.

lateral high anthropic:default

This column encodes lateral direction or laterality coding, likely indicating the side(s) of a clinical finding, anatomical measurement, or test result, using '-' (e.g., left or negative), '+' (right or positive), and '0' (midline or neutral). The dominant value '-' accounts for 93.8% of all 105,484 rows, producing a very low entropy ratio of 0.134, meaning the column is heavily skewed toward a single class. Surprisingly, some values represent sequences of laterality codes (e.g., '-,+', '-,-,+'), suggesting multi-segment or bilateral recordings encoded as comma-delimited strings rather than separate fields. No nulls are present.

lenis high anthropic:default

This column represents a ternary linguistic or phonological feature, almost certainly a 'lenis' (weak/voiced) marker with values '-' (absent/irrelevant), '0' (neutral), and '+' (present/lenis). The dominant value is '-' at 68.1% (71,866 rows), while the positive lenis marker '+' is strikingly rare at only 416 occurrences (~0.4%), creating a heavily imbalanced distribution. No nulls exist across all 105,484 rows, and entropy is moderate at 0.93 bits, well below the theoretical maximum, confirming the skew toward the negative class.

long medium anthropic:default

This column encodes the sign or direction of longitude (or a signed numeric field), with only 6 distinct values across 105,484 rows. The dominant value '-' accounts for 89.9% of records, suggesting the dataset is heavily skewed toward negative longitudes (e.g., Western Hemisphere coordinates). The compound values '-,+', '+,-', and '-,-,+' appear in just 104 rows combined, hinting at rare multi-value or malformed entries that warrant inspection before modelling.

InventoryID numeric

rows105,484

null0 (0.0%)

unique3,020

min1.000

max3,020

mean1,479

median1,464

std843.111

q1769.000

q32,237

iqr1,468

skew-2.40e-03

kurtosis-1.146

n_outliers0

outlier_rate0.000

zero_rate0.000

Show data table

Histogram bins for InventoryID (median: 1464.0).
bin	count
1 – 76.47	2729
76.47 – 151.9	2947
151.9 – 227.4	2742
227.4 – 302.9	2385
302.9 – 378.4	2399
378.4 – 453.8	2530
453.8 – 529.3	2270
529.3 – 604.8	2323
604.8 – 680.3	2533
680.3 – 755.8	3039
755.8 – 831.2	2896
831.2 – 906.7	2609
906.7 – 982.2	2693
982.2 – 1058	2513
1058 – 1133	2637
1133 – 1209	2514
1209 – 1284	2898
1284 – 1360	3314
1360 – 1435	3634
1435 – 1510	2993
1510 – 1586	3172
1586 – 1661	3290
1661 – 1737	2686
1737 – 1812	3001
1812 – 1888	1779
1888 – 1963	2046
1963 – 2039	1878
2039 – 2114	1864
2114 – 2190	2685
2190 – 2265	3387
2265 – 2341	3096
2341 – 2416	3142
2416 – 2492	3264
2492 – 2567	3557
2567 – 2643	2966
2643 – 2718	1975
2718 – 2794	1761
2794 – 2869	1659
2869 – 2945	1855
2945 – 3020	1823

Glottocode text

100.0% rows are a single word 95th-percentile length under 20 chars 97.9% duplicate strings

rows105,484

null0 (0.0%)

unique2,177

len_min2

len_max8

len_mean7.999

len_median8.000

len_p958.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates103,307

duplicate_rate0.979

vocab_size2,168

readability_flesch_mean94.148

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.000

boilerplate_rate0.000

Show data table

Character-length distribution for Glottocode (mean: 7.998919267377043).
chars	count
2 – 2	19
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	105465

Sample values (first 10)

lakk1252
lazz1240
kham1282
cebu1242
west2456
gurd1238
copi1238
kham1282
gata1239
paez1247

ISO6393 text

100.0% rows are a single word 95th-percentile length under 20 chars 98.0% duplicate strings

rows105,484

null0 (0.0%)

unique2,095

len_min3

len_max3

len_mean3.000

len_median3.000

len_p953.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates103,389

duplicate_rate0.980

vocab_size2,086

readability_flesch_mean119.528

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.000

boilerplate_rate0.000

Show data table

Character-length distribution for ISO6393 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	105484
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

Sample values (first 10)

LanguageName text

84.3% rows are a single word 13.1% rows are all-caps 95th-percentile length under 20 chars 97.4% duplicate strings

rows105,484

null0 (0.0%)

unique2,716

len_min2

len_max79

len_mean7.822

len_median7.000

len_p9516.000

word_mean1.201

word_median1.000

n_empty0

n_duplicates102,768

duplicate_rate0.974

vocab_size2,670

readability_flesch_mean53.181

emoji_rate0.000

url_rate0.000

one_word_rate0.843

allcaps_rate0.131

boilerplate_rate0.000

Show data table

Character-length distribution for LanguageName (mean: 7.822219483523567).
chars	count
2 – 4	3915
4 – 6	29541
6 – 8	34409
8 – 10	14985
10 – 12	7358
12 – 14	4849
14 – 15	3897
15 – 17	2893
17 – 19	1292
19 – 21	993
21 – 23	198
23 – 25	224
25 – 27	218
27 – 29	39
29 – 31	93
31 – 33	130
33 – 35	64
35 – 37	23
37 – 39	37
39 – 40	57
40 – 42	0
42 – 44	52
44 – 46	0
46 – 48	23
48 – 50	0
50 – 52	0
52 – 54	20
54 – 56	0
56 – 58	40
58 – 60	0
60 – 62	87
62 – 64	0
64 – 66	0
66 – 67	0
67 – 69	0
69 – 71	0
71 – 73	0
73 – 75	0
75 – 77	0
77 – 79	47

Sample values (first 10)

Lak
Laz
Kami Tibetan
Cebuano
western xwla
Kurtjar
Copi
Kham Tibetan
Gtaʔ
Paez

SpecificDialect categorical

rows105,484

null0 (0.0%)

unique546

top_valueNA

top_rate0.719

cardinality546

entropy2.969

entropy_ratio0.326

Show data table

Top values for SpecificDialect (20 unique shown, of 546 total).
value	count	share
NA	75807	71.9%
	7692	7.3%
W2	120	0.1%
Lezgian (Güne)	96	0.1%
Santa	92	0.1%
Central Pakistan	83	0.1%
Babungo (Grassfields Bantu, Ring)	82	0.1%
Scottish Gaelic (Lewis)	82	0.1%
Tangari	81	0.1%
Kanga	76	0.1%
Kufa	75	0.1%
Skolt Saami (Suõʹnnʼjel)	75	0.1%
Standard Hindi (as spoken in Varanasi, Lucknow, Delhi etc.)	74	0.1%
Standard (eastern)	74	0.1%
Guovdageaidnu	74	0.1%
Nuosu (Black Yi)	74	0.1%
Northern Qiang (Yadu)	73	0.1%
Bangladeshi Standard (spoken in Dhaka and other urban aread of Bangladesh)	72	0.1%
Standard Italian	70	0.1%
Chechen (Ploskost)	70	0.1%

Top values (rank 1–20)

NA — 75,807
— 7,692
W2 — 120
Lezgian (Güne) — 96
Santa — 92
Central Pakistan — 83
Babungo (Grassfields Bantu, Ring) — 82
Scottish Gaelic (Lewis) — 82
Tangari — 81
Kanga — 76
Kufa — 75
Skolt Saami (Suõʹnnʼjel) — 75
Standard Hindi (as spoken in Varanasi, Lucknow, Delhi etc.) — 74
Standard (eastern) — 74
Guovdageaidnu — 74
Nuosu (Black Yi) — 74
Northern Qiang (Yadu) — 73
Bangladeshi Standard (spoken in Dhaka and other urban aread of Bangladesh) — 72
Standard Italian — 70
Chechen (Ploskost) — 70

GlyphID text

100.0% rows are a single word 100.0% rows are all-caps 95th-percentile length under 20 chars 97.0% duplicate strings

rows105,484

null0 (0.0%)

unique3,142

len_min4

len_max54

len_mean6.503

len_median4.000

len_p9514.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates102,342

duplicate_rate0.970

vocab_size1,343

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Show data table

Character-length distribution for GlyphID (mean: 6.503033635432862).
chars	count
4 – 5	67114
5 – 6	0
6 – 8	0
8 – 9	0
9 – 10	28726
10 – 12	0
12 – 13	0
13 – 14	0
14 – 15	6559
15 – 16	0
16 – 18	0
18 – 19	0
19 – 20	2225
20 – 22	0
22 – 23	0
23 – 24	0
24 – 25	401
25 – 26	0
26 – 28	0
28 – 29	0
29 – 30	267
30 – 32	0
32 – 33	0
33 – 34	0
34 – 35	104
35 – 36	0
36 – 38	0
38 – 39	0
39 – 40	5
40 – 42	0
42 – 43	0
43 – 44	0
44 – 45	70
45 – 46	0
46 – 48	0
48 – 49	0
49 – 50	1
50 – 52	0
52 – 53	0
53 – 54	12

Sample values (first 10)

03C7
0075
006E+0064+007A
0072
02E6
0268
0064+0324+026E+0324
0075
0069+0303
0061+0303

Phoneme text

100.0% rows are a single word 95th-percentile length under 20 chars 97.0% duplicate strings

rows105,484

null0 (0.0%)

unique3,142

len_min1

len_max11

len_mean1.501

len_median1.000

len_p953.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates102,342

duplicate_rate0.970

vocab_size1,339

readability_flesch_mean114.439

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.75e-03

boilerplate_rate0.000

Show data table

Character-length distribution for Phoneme (mean: 1.5006067270865724).
chars	count
1 – 1	67114
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	28726
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	6559
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	2225
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	401
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	267
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	104
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	70
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	1
10 – 10	0
10 – 11	0
11 – 11	12

Sample values (first 10)

χ
u
ndz
r
˦
ɨ
d̤ɮ̤
u
ĩ
ã

Allophones text

91.3% rows are a single word 95th-percentile length under 20 chars 93.5% duplicate strings

rows105,484

null0 (0.0%)

unique6,892

len_min1

len_max37

len_mean2.083

len_median2.000

len_p954.000

word_mean1.129

word_median1.000

n_empty0

n_duplicates98,592

duplicate_rate0.935

vocab_size1,263

readability_flesch_mean116.186

emoji_rate0.000

url_rate0.000

one_word_rate0.913

allcaps_rate2.91e-03

boilerplate_rate0.000

Show data table

Character-length distribution for Allophones (mean: 2.0834154942929732).
chars	count
1 – 2	26762
2 – 3	66019
3 – 4	4934
4 – 5	3125
5 – 6	1246
6 – 6	1181
6 – 7	789
7 – 8	409
8 – 9	334
9 – 10	0
10 – 11	226
11 – 12	160
12 – 13	69
13 – 14	65
14 – 14	41
14 – 15	36
15 – 16	12
16 – 17	16
17 – 18	12
18 – 19	0
19 – 20	14
20 – 21	9
21 – 22	8
22 – 23	0
23 – 24	3
24 – 24	3
24 – 25	3
25 – 26	3
26 – 27	2
27 – 28	0
28 – 29	1
29 – 30	0
30 – 31	0
31 – 32	0
32 – 32	0
32 – 33	0
33 – 34	1
34 – 35	0
35 – 36	0
36 – 37	1

Sample values (first 10)

χ
NA
NA
NA
˦
NA
d̤ɮ̤
NA
NA
ã ə̃

Marginal categorical

rows105,484

null0 (0.0%)

unique3

top_valueFALSE

top_rate0.789

cardinality3

entropy0.812

entropy_ratio0.512

Show data table

Top values for Marginal (3 unique shown, of 3 total).
value	count	share
FALSE	83263	78.9%
NA	20874	19.8%
TRUE	1347	1.3%

Top values (rank 1–20)

FALSE — 83,263
NA — 20,874
TRUE — 1,347

SegmentClass categorical

rows105,484

null0 (0.0%)

unique3

top_valueconsonant

top_rate0.685

cardinality3

entropy1.008

entropy_ratio0.636

Show data table

Top values for SegmentClass (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

Top values (rank 1–20)

consonant — 72,282
vowel — 31,052
tone — 2,150

Source categorical

rows105,484

null0 (0.0%)

unique8

top_valueph

top_rate0.344

cardinality8

entropy2.697

entropy_ratio0.899

Show data table

Top values for Source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

Top values (rank 1–20)

ph — 36,274
ea — 16,883
upsid — 13,966
er — 9,423
saphon — 9,047
aa — 8,064
spa — 7,566
ra — 4,261

tone categorical

top value is 98.0% of rows

rows105,484

null0 (0.0%)

unique2

top_value0

top_rate0.980

cardinality2

entropy0.144

entropy_ratio0.144

Show data table

Top values for tone (2 unique shown, of 2 total).
value	count	share
0	103334	98.0%
+	2150	2.0%

Top values (rank 1–20)

0 — 103,334
+ — 2,150

stress categorical

top value is 98.0% of rows

rows105,484

null0 (0.0%)

unique2

top_value-

top_rate0.980

cardinality2

entropy0.144

entropy_ratio0.144

Show data table

Top values for stress (2 unique shown, of 2 total).
value	count	share
-	103334	98.0%
0	2150	2.0%

Top values (rank 1–20)

- — 103,334
0 — 2,150

syllabic categorical

rows105,484

null0 (0.0%)

unique8

top_value-

top_rate0.685

cardinality8

entropy1.042

entropy_ratio0.347

Show data table

Top values for syllabic (8 unique shown, of 8 total).
value	count	share
-	72248	68.5%
+	30692	29.1%
0	2150	2.0%
+,-	244	0.2%
-,+	124	0.1%
-,+,-	12	0.0%
-,+,+	12	0.0%
+,+,-	2	0.0%

Top values (rank 1–20)

- — 72,248
+ — 30,692
0 — 2,150
+,- — 244
-,+ — 124
-,+,- — 12
-,+,+ — 12
+,+,- — 2

short categorical

top value is 97.8% of rows

rows105,484

null0 (0.0%)

unique4

top_value-

top_rate0.978

cardinality4

entropy0.164

entropy_ratio0.082

Show data table

Top values for short (4 unique shown, of 4 total).
value	count	share
-	103125	97.8%
0	2150	2.0%
+	204	0.2%
-,+	5	0.0%

Top values (rank 1–20)

- — 103,125
0 — 2,150
+ — 204
-,+ — 5

long categorical

rows105,484

null0 (0.0%)

unique6

top_value-

top_rate0.899

cardinality6

entropy0.554

entropy_ratio0.214

Show data table

Top values for long (6 unique shown, of 6 total).
value	count	share
-	94844	89.9%
+	8386	8.0%
0	2150	2.0%
-,+	63	0.1%
+,-	40	0.0%
-,-,+	1	0.0%

Top values (rank 1–20)

- — 94,844
+ — 8,386
0 — 2,150
-,+ — 63
+,- — 40
-,-,+ — 1

consonantal categorical

rows105,484

null0 (0.0%)

unique5

top_value+

top_rate0.609

cardinality5

entropy1.085

entropy_ratio0.467

Show data table

Top values for consonantal (5 unique shown, of 5 total).
value	count	share
+	64257	60.9%
-	39041	37.0%
0	2151	2.0%
+,-	34	0.0%
-,+	1	0.0%

Top values (rank 1–20)

+ — 64,257
- — 39,041
0 — 2,151
+,- — 34
-,+ — 1

sonorant categorical

rows105,484

null0 (0.0%)

unique8

top_value+

top_rate0.530

cardinality8

entropy1.245

entropy_ratio0.415

Show data table

Top values for sonorant (8 unique shown, of 8 total).
value	count	share
+	55920	53.0%
-	45322	43.0%
0	2150	2.0%
+,-	1948	1.8%
-,+	89	0.1%
+,-,-	29	0.0%
+,-,+	25	0.0%
+,-,+,-	1	0.0%

Top values (rank 1–20)

+ — 55,920
- — 45,322
0 — 2,150
+,- — 1,948
-,+ — 89
+,-,- — 29
+,-,+ — 25
+,-,+,- — 1

continuant categorical

rows105,484

null0 (0.0%)

unique9

top_value+

top_rate0.549

cardinality9

entropy1.172

entropy_ratio0.370

Show data table

Top values for continuant (9 unique shown, of 9 total).
value	count	share
+	57952	54.9%
-	44585	42.3%
0	2151	2.0%
-,+	728	0.7%
-,-,+	50	0.0%
+,-	9	0.0%
0,-,+	4	0.0%
-,+,+	4	0.0%
0,0,-,+	1	0.0%

Top values (rank 1–20)

+ — 57,952
- — 44,585
0 — 2,151
-,+ — 728
-,-,+ — 50
+,- — 9
0,-,+ — 4
-,+,+ — 4
0,0,-,+ — 1

delayedRelease categorical

rows105,484

null0 (0.0%)

unique7

top_value0

top_rate0.550

cardinality7

entropy1.471

entropy_ratio0.524

Show data table

Top values for delayedRelease (7 unique shown, of 7 total).
value	count	share
0	58035	55.0%
-	27384	26.0%
+	19533	18.5%
-,+	492	0.5%
0,-,+	33	0.0%
+,-	6	0.0%
0,0,-,+	1	0.0%

Top values (rank 1–20)

0 — 58,035
- — 27,384
+ — 19,533
-,+ — 492
0,-,+ — 33
+,- — 6
0,0,-,+ — 1

approximant categorical

rows105,484

null0 (0.0%)

unique6

top_value-

top_rate0.559

cardinality6

entropy1.120

entropy_ratio0.433

Show data table

Top values for approximant (6 unique shown, of 6 total).
value	count	share
-	58966	55.9%
+	44266	42.0%
0	2150	2.0%
-,+	71	0.1%
-,-,+	25	0.0%
+,-	6	0.0%

Top values (rank 1–20)

- — 58,966
+ — 44,266
0 — 2,150
-,+ — 71
-,-,+ — 25
+,- — 6

tap categorical

top value is 96.7% of rows

rows105,484

null0 (0.0%)

unique5

top_value-

top_rate0.967

cardinality5

entropy0.242

entropy_ratio0.104

Show data table

Top values for tap (5 unique shown, of 5 total).
value	count	share
-	102023	96.7%
0	2203	2.1%
+	1218	1.2%
-,+	25	0.0%
-,-,+	15	0.0%

Top values (rank 1–20)

- — 102,023
0 — 2,203
+ — 1,218
-,+ — 25
-,-,+ — 15

trill categorical

top value is 96.2% of rows

rows105,484

null0 (0.0%)

unique6

top_value-

top_rate0.962

cardinality6

entropy0.276

entropy_ratio0.107

Show data table

Top values for trill (6 unique shown, of 6 total).
value	count	share
-	101427	96.2%
0	2202	2.1%
+	1819	1.7%
-,+	26	0.0%
-,-,+	8	0.0%
+,-	2	0.0%

Top values (rank 1–20)

- — 101,427
0 — 2,202
+ — 1,819
-,+ — 26
-,-,+ — 8
+,- — 2

nasal categorical

rows105,484

null0 (0.0%)

unique8

top_value-

top_rate0.808

cardinality8

entropy0.897

entropy_ratio0.299

Show data table

Top values for nasal (8 unique shown, of 8 total).
value	count	share
-	85269	80.8%
+	15941	15.1%
0	2150	2.0%
+,-	1973	1.9%
-,+	95	0.1%
+,-,-	54	0.1%
+,-,+,-	1	0.0%
-,+,-	1	0.0%

Top values (rank 1–20)

- — 85,269
+ — 15,941
0 — 2,150
+,- — 1,973
-,+ — 95
+,-,- — 54
+,-,+,- — 1
-,+,- — 1

lateral categorical

rows105,484

null0 (0.0%)

unique8

top_value-

top_rate0.938

cardinality8

entropy0.401

entropy_ratio0.134

Show data table

Top values for lateral (8 unique shown, of 8 total).
value	count	share
-	98968	93.8%
+	4211	4.0%
0	2150	2.0%
-,+	135	0.1%
+,-	12	0.0%
-,-,+	4	0.0%
-,+,-	3	0.0%
0,-,+	1	0.0%

Top values (rank 1–20)

- — 98,968
+ — 4,211
0 — 2,150
-,+ — 135
+,- — 12
-,-,+ — 4
-,+,- — 3
0,-,+ — 1

labial categorical

rows105,484

null0 (0.0%)

unique15

top_value-

top_rate0.682

cardinality15

entropy1.182

entropy_ratio0.302

Show data table

Top values for labial (15 unique shown, of 15 total).
value	count	share
-	71961	68.2%
+	28241	26.8%
-,+	2414	2.3%
0	2160	2.0%
+,-	531	0.5%
-,-,+	121	0.1%
+,-,-	21	0.0%
0,+,-	8	0.0%
-,+,-	6	0.0%
0,-,+	5	0.0%
-,+,+	5	0.0%
+,+,-	5	0.0%
+,-,+	4	0.0%
-,-,+,+	1	0.0%
0,+,-,-	1	0.0%

Top values (rank 1–20)

- — 71,961
+ — 28,241
-,+ — 2,414
0 — 2,160
+,- — 531
-,-,+ — 121
+,-,- — 21
0,+,- — 8
-,+,- — 6
0,-,+ — 5
-,+,+ — 5
+,+,- — 5
+,-,+ — 4
-,-,+,+ — 1
0,+,-,- — 1

round categorical

rows105,484

null0 (0.0%)

unique8

top_value0

top_rate0.703

cardinality8

entropy1.194

entropy_ratio0.398

Show data table

Top values for round (8 unique shown, of 8 total).
value	count	share
0	74155	70.3%
+	16956	16.1%
-	14082	13.3%
-,+	269	0.3%
-,-,+	17	0.0%
-,0,+	3	0.0%
0,-,+	1	0.0%
+,-	1	0.0%

Top values (rank 1–20)

0 — 74,155
+ — 16,956
- — 14,082
-,+ — 269
-,-,+ — 17
-,0,+ — 3
0,-,+ — 1
+,- — 1

labiodental categorical

rows105,484

null0 (0.0%)

unique6

top_value0

top_rate0.703

cardinality6

entropy1.006

entropy_ratio0.389

Show data table

Top values for labiodental (6 unique shown, of 6 total).
value	count	share
0	74124	70.3%
-	28726	27.2%
+	2574	2.4%
+,-	56	0.1%
-,+	3	0.0%
+,+,-	1	0.0%

Top values (rank 1–20)

0 — 74,124
- — 28,726
+ — 2,574
+,- — 56
-,+ — 3
+,+,- — 1

coronal categorical

rows105,484

null0 (0.0%)

unique7

top_value-

top_rate0.628

cardinality7

entropy1.080

entropy_ratio0.385

Show data table

Top values for coronal (7 unique shown, of 7 total).
value	count	share
-	66234	62.8%
+	36955	35.0%
0	2160	2.0%
+,-	87	0.1%
-,+	41	0.0%
-,-,+	6	0.0%
+,-,+	1	0.0%

Top values (rank 1–20)

- — 66,234
+ — 36,955
0 — 2,160
+,- — 87
-,+ — 41
-,-,+ — 6
+,-,+ — 1

anterior categorical

rows105,484

null0 (0.0%)

unique6

top_value0

top_rate0.648

cardinality6

entropy1.251

entropy_ratio0.484

Show data table

Top values for anterior (6 unique shown, of 6 total).
value	count	share
0	68372	64.8%
+	25704	24.4%
-	11391	10.8%
-,+	9	0.0%
+,-	5	0.0%
-,-,+	3	0.0%

Top values (rank 1–20)

0 — 68,372
+ — 25,704
- — 11,391
-,+ — 9
+,- — 5
-,-,+ — 3

distributed categorical

rows105,484

null0 (0.0%)

unique11

top_value0

top_rate0.660

cardinality11

entropy1.273

entropy_ratio0.368

Show data table

Top values for distributed (11 unique shown, of 11 total).
value	count	share
0	69639	66.0%
-	22283	21.1%
+	13228	12.5%
-,+	296	0.3%
-,-,+	25	0.0%
+,-	5	0.0%
0,-,+	3	0.0%
+,-,+	2	0.0%
0,+,-	1	0.0%
+,+,-	1	0.0%
0,0,-,+	1	0.0%

Top values (rank 1–20)

0 — 69,639
- — 22,283
+ — 13,228
-,+ — 296
-,-,+ — 25
+,- — 5
0,-,+ — 3
+,-,+ — 2
0,+,- — 1
+,+,- — 1
0,0,-,+ — 1

strident categorical

rows105,484

null0 (0.0%)

unique9

top_value0

top_rate0.649

cardinality9

entropy1.287

entropy_ratio0.406

Show data table

Top values for strident (9 unique shown, of 9 total).
value	count	share
0	68410	64.9%
-	25410	24.1%
+	11039	10.5%
-,+	585	0.6%
-,-,+	26	0.0%
+,-	7	0.0%
-,+,-	3	0.0%
0,-,+	3	0.0%
0,0,-,+	1	0.0%

Top values (rank 1–20)

0 — 68,410
- — 25,410
+ — 11,039
-,+ — 585
-,-,+ — 26
+,- — 7
-,+,- — 3
0,-,+ — 3
0,0,-,+ — 1

dorsal categorical

rows105,484

null0 (0.0%)

unique13

top_value+

top_rate0.517

cardinality13

entropy1.235

entropy_ratio0.334

Show data table

Top values for dorsal (13 unique shown, of 13 total).
value	count	share
+	54535	51.7%
-	47052	44.6%
0	2160	2.0%
-,+	1530	1.5%
+,-	144	0.1%
-,-,+	44	0.0%
0,-,+	6	0.0%
+,-,+	5	0.0%
-,+,+	4	0.0%
+,+,-,-	1	0.0%
-,+,-	1	0.0%
+,+,-	1	0.0%
0,0,-,+	1	0.0%

Top values (rank 1–20)

+ — 54,535
- — 47,052
0 — 2,160
-,+ — 1,530
+,- — 144
-,-,+ — 44
0,-,+ — 6
+,-,+ — 5
-,+,+ — 4
+,+,-,- — 1
-,+,- — 1
+,+,- — 1
0,0,-,+ — 1

high categorical

rows105,484

null0 (0.0%)

unique11

top_value0

top_rate0.467

cardinality11

entropy1.594

entropy_ratio0.461

Show data table

Top values for high (11 unique shown, of 11 total).
value	count	share
0	49247	46.7%
+	35559	33.7%
-	19156	18.2%
-,+	845	0.8%
+,-	627	0.6%
+,-,+	38	0.0%
+,+,-	6	0.0%
-,+,+	2	0.0%
-,-,+	2	0.0%
+,-,0	1	0.0%
-,+,-	1	0.0%

Top values (rank 1–20)

0 — 49,247
+ — 35,559
- — 19,156
-,+ — 845
+,- — 627
+,-,+ — 38
+,+,- — 6
-,+,+ — 2
-,-,+ — 2
+,-,0 — 1
-,+,- — 1

low categorical

rows105,484

null0 (0.0%)

unique8

top_value-

top_rate0.473

cardinality8

entropy1.305

entropy_ratio0.435

Show data table

Top values for low (8 unique shown, of 8 total).
value	count	share
-	49930	47.3%
0	49244	46.7%
+	5598	5.3%
+,-	417	0.4%
-,+	270	0.3%
-,+,-	21	0.0%
-,-,+	3	0.0%
+,-,-	1	0.0%

Top values (rank 1–20)

- — 49,930
0 — 49,244
+ — 5,598
+,- — 417
-,+ — 270
-,+,- — 21
-,-,+ — 3
+,-,- — 1

front categorical

rows105,484

null0 (0.0%)

unique13

top_value0

top_rate0.468

cardinality13

entropy1.592

entropy_ratio0.430

Show data table

Top values for front (13 unique shown, of 13 total).
value	count	share
0	49316	46.8%
-	34225	32.4%
+	20683	19.6%
-,+	838	0.8%
+,-	359	0.3%
-,-,+	24	0.0%
+,-,-	14	0.0%
-,+,+	10	0.0%
+,-,+	6	0.0%
-,0,+	3	0.0%
+,+,-	2	0.0%
0,-,+	2	0.0%
-,+,-	2	0.0%

Top values (rank 1–20)

0 — 49,316
- — 34,225
+ — 20,683
-,+ — 838
+,- — 359
-,-,+ — 24
+,-,- — 14
-,+,+ — 10
+,-,+ — 6
-,0,+ — 3
+,+,- — 2
0,-,+ — 2
-,+,- — 2

back categorical

rows105,484

null0 (0.0%)

unique12

top_value0

top_rate0.467

cardinality12

entropy1.521

entropy_ratio0.424

Show data table

Top values for back (12 unique shown, of 12 total).
value	count	share
0	49270	46.7%
-	39749	37.7%
+	15547	14.7%
+,-	511	0.5%
-,+	367	0.3%
+,-,-	19	0.0%
-,-,+	8	0.0%
-,+,+	5	0.0%
-,+,-	5	0.0%
0,+,-	1	0.0%
+,-,+	1	0.0%
+,+,-	1	0.0%

Top values (rank 1–20)

0 — 49,270
- — 39,749
+ — 15,547
+,- — 511
-,+ — 367
+,-,- — 19
-,-,+ — 8
-,+,+ — 5
-,+,- — 5
0,+,- — 1
+,-,+ — 1
+,+,- — 1

tense categorical

rows105,484

null0 (0.0%)

unique8

top_value0

top_rate0.713

cardinality8

entropy1.114

entropy_ratio0.371

Show data table

Top values for tense (8 unique shown, of 8 total).
value	count	share
0	75230	71.3%
+	23411	22.2%
-	6386	6.1%
+,-	268	0.3%
-,+	179	0.2%
+,-,+	6	0.0%
+,-,-	3	0.0%
+,+,-	1	0.0%

Top values (rank 1–20)

0 — 75,230
+ — 23,411
- — 6,386
+,- — 268
-,+ — 179
+,-,+ — 6
+,-,- — 3
+,+,- — 1

retractedTongueRoot categorical

top value is 97.4% of rows

rows105,484

null0 (0.0%)

unique7

top_value-

top_rate0.974

cardinality7

entropy0.193

entropy_ratio0.069

Show data table

Top values for retractedTongueRoot (7 unique shown, of 7 total).
value	count	share
-	102788	97.4%
0	2235	2.1%
-,+	251	0.2%
+	199	0.2%
-,-,+	9	0.0%
-,+,-	1	0.0%
+,-	1	0.0%

Top values (rank 1–20)

- — 102,788
0 — 2,235
-,+ — 251
+ — 199
-,-,+ — 9
-,+,- — 1
+,- — 1

advancedTongueRoot categorical

top value is 97.9% of rows

rows105,484

null0 (0.0%)

unique3

top_value-

top_rate0.979

cardinality3

entropy0.150

entropy_ratio0.094

Show data table

Top values for advancedTongueRoot (3 unique shown, of 3 total).
value	count	share
-	103238	97.9%
0	2235	2.1%
+	11	0.0%

Top values (rank 1–20)

- — 103,238
0 — 2,235
+ — 11

periodicGlottalSource categorical

rows105,484

null0 (0.0%)

unique7

top_value+

top_rate0.680

cardinality7

entropy1.051

entropy_ratio0.375

Show data table

Top values for periodicGlottalSource (7 unique shown, of 7 total).
value	count	share
+	71694	68.0%
-	31179	29.6%
0	2139	2.0%
+,-	371	0.4%
-,+	87	0.1%
+,-,-	8	0.0%
+,-,+	6	0.0%

Top values (rank 1–20)

+ — 71,694
- — 31,179
0 — 2,139
+,- — 371
-,+ — 87
+,-,- — 8
+,-,+ — 6

epilaryngealSource categorical

top value is 97.9% of rows

rows105,484

null0 (0.0%)

unique3

top_value-

top_rate0.979

cardinality3

entropy0.147

entropy_ratio0.093

Show data table

Top values for epilaryngealSource (3 unique shown, of 3 total).
value	count	share
-	103303	97.9%
0	2150	2.0%
+	31	0.0%

Top values (rank 1–20)

- — 103,303
0 — 2,150
+ — 31

spreadGlottis categorical

rows105,484

null0 (0.0%)

unique10

top_value-

top_rate0.918

cardinality10

entropy0.497

entropy_ratio0.149

Show data table

Top values for spreadGlottis (10 unique shown, of 10 total).
value	count	share
-	96855	91.8%
+	6156	5.8%
0	2138	2.0%
-,+	206	0.2%
+,-	115	0.1%
-,-,+	5	0.0%
+,0,-	5	0.0%
+,-,-	2	0.0%
+,0,-,-	1	0.0%
+,-,+	1	0.0%

Top values (rank 1–20)

- — 96,855
+ — 6,156
0 — 2,138
-,+ — 206
+,- — 115
-,-,+ — 5
+,0,- — 5
+,-,- — 2
+,0,-,- — 1
+,-,+ — 1

constrictedGlottis categorical

rows105,484

null0 (0.0%)

unique7

top_value-

top_rate0.945

cardinality7

entropy0.372

entropy_ratio0.132

Show data table

Top values for constrictedGlottis (7 unique shown, of 7 total).
value	count	share
-	99727	94.5%
+	3383	3.2%
0	2138	2.0%
+,-	141	0.1%
-,+	93	0.1%
+,-,-	1	0.0%
-,-,+	1	0.0%

Top values (rank 1–20)

- — 99,727
+ — 3,383
0 — 2,138
+,- — 141
-,+ — 93
+,-,- — 1
-,-,+ — 1

fortis categorical

rows105,484

null0 (0.0%)

unique3

top_value-

top_rate0.681

cardinality3

entropy0.934

entropy_ratio0.589

Show data table

Top values for fortis (3 unique shown, of 3 total).
value	count	share
-	71867	68.1%
0	33202	31.5%
+	415	0.4%

Top values (rank 1–20)

- — 71,867
0 — 33,202
+ — 415

lenis categorical

rows105,484

null0 (0.0%)

unique3

top_value-

top_rate0.681

cardinality3

entropy0.934

entropy_ratio0.589

Show data table

Top values for lenis (3 unique shown, of 3 total).
value	count	share
-	71866	68.1%
0	33202	31.5%
+	416	0.4%

Top values (rank 1–20)

- — 71,866
0 — 33,202
+ — 416

raisedLarynxEjective categorical

top value is 96.4% of rows

rows105,484

null0 (0.0%)

unique6

top_value-

top_rate0.964

cardinality6

entropy0.267

entropy_ratio0.103

Show data table

Top values for raisedLarynxEjective (6 unique shown, of 6 total).
value	count	share
-	101652	96.4%
0	2150	2.0%
+	1573	1.5%
-,+	85	0.1%
+,-	23	0.0%
-,-,+	1	0.0%

Top values (rank 1–20)

- — 101,652
0 — 2,150
+ — 1,573
-,+ — 85
+,- — 23
-,-,+ — 1

loweredLarynxImplosive categorical

top value is 97.3% of rows

rows105,484

null0 (0.0%)

unique5

top_value-

top_rate0.973

cardinality5

entropy0.203

entropy_ratio0.088

Show data table

Top values for loweredLarynxImplosive (5 unique shown, of 5 total).
value	count	share
-	102609	97.3%
0	2150	2.0%
+	716	0.7%
-,+	7	0.0%
+,-	2	0.0%

Top values (rank 1–20)

- — 102,609
0 — 2,150
+ — 716
-,+ — 7
+,- — 2

click categorical

rows105,484

null0 (0.0%)

unique5

top_value-

top_rate0.682

cardinality5

entropy0.928

entropy_ratio0.400

Show data table

Top values for click (5 unique shown, of 5 total).
value	count	share
-	71971	68.2%
0	33202	31.5%
+	253	0.2%
+,-	52	0.0%
-,+	6	0.0%

Top values (rank 1–20)

- — 71,971
0 — 33,202
+ — 253
+,- — 52
-,+ — 6