saturn

/home/coolhand/datasets/language-data/phoible.csv 105,484 rows sample n=105,484 seed 42 2026-05-01T23:20:29+00:00

Overview

Source/home/coolhand/datasets/language-data/phoible.csv
Total rows105,484
Profiled sample105,484
Columns49
Generated2026-05-01T23:20:29+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This is the PHOIBLE phonological inventory dataset: 105,484 rows describing phoneme segments across roughly 2,716 language names and 2,177 Glottocodes, with each row carrying a Phoneme/GlyphID plus 40+ binary phonological features (e.g. consonantal, nasal, sonorant, dorsal). The dataset is dominated by consonants — SegmentClass shows 72,282 consonants vs 31,052 vowels and 2,150 tones — and pulls from 8 sources, with 'ph' (36,274) and 'ea' (16,883) accounting for over half the rows. Most feature columns are heavily imbalanced toward '-' or '0', but a handful (consonantal, sonorant, continuant, dorsal, high, front, back) are fairly balanced and carry the real phonological signal worth exploring. Top language names like Iron Ossetic (444), Dutch (395), and Chechen (309) point to the densest inventories in the corpus.

InventoryID high anthropic:claude-opus-4-7

InventoryID looks like a categorical inventory key stored as an integer, with 3,020 distinct values spread across 105,484 rows and no nulls. The distribution is essentially uniform from 1 to 3,020 (mean 1479, median 1464, skew ≈0, kurtosis ≈-1.15), confirming it's an enumerated identifier rather than a measurement. Each ID recurs roughly 35 times on average, so this is a foreign key linking transactions to an inventory dimension.

Glottocode high anthropic:claude-opus-4-7

This column holds Glottocodes — standardized 8-character language identifiers from Glottolog (e.g., 'kham1282', 'dutc1256'). Values are uniformly one word with length tightly clustered at 8 (mean 7.999, min 2, max 8), and there are 2,177 unique codes across 105,484 rows with a 97.9% duplicate rate, indicating each language appears many times. The top code 'kham1282' alone accounts for 622 rows.

ISO6393 high anthropic:claude-opus-4-7

This column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,095 distinct codes across 105,484 rows. The distribution is heavy-tailed and highly repetitive (98.0% duplicate rate), led by 'mis' (828), 'khg' (622), and 'oss' (525), with familiar codes like 'eng' and 'hin' also prominent. No nulls or empties, and the vocabulary (2,086) closely matches n_unique (2,095), consistent with a clean controlled vocabulary.

LanguageName high anthropic:claude-opus-4-7

This column holds language names, mostly single-word labels (one_word_rate 0.84, word_mean 1.20) with a vocab of 2670 across 2716 unique values over 105484 rows. It is highly repetitive (duplicate_rate 0.974) with top entries like 'Iron Ossetic' (444), 'Dutch' (395), and 'Chechen' (309), and roughly 13% of values are uppercase, suggesting inconsistent casing worth normalising. Compound names use directional modifiers ('northern', 'southern', 'eastern', 'western') indicating dialect-level granularity.

SpecificDialect high anthropic:claude-opus-4-7

Categorical field naming a specific dialect/sub-variety of a language, with 546 distinct values across 105,484 rows. The distribution is extremely concentrated: 71.9% are the literal string "NA" and another 7,692 rows are empty strings, leaving the remaining 544 dialect labels (e.g., "W2", "Lezgian (Güne)", "Scottish Gaelic (Lewis)") in a long tail topping out at 120 occurrences. Entropy ratio of 0.33 confirms most signal lives in that small tail.

GlyphID high anthropic:claude-opus-4-7

GlyphID holds single-token uppercase hexadecimal codes (allcaps_rate 1.0, one_word_rate 1.0, len_min 4, len_median 4) that look like Unicode codepoints — the top values 006D, 0069, 006B map to lowercase Latin letters m, i, k. Despite the ID-sounding name it is highly non-unique: only 3,142 distinct values across 105,484 rows with a 0.97 duplicate rate, so it behaves as a categorical glyph reference rather than a row key. Lengths stretch up to 54 characters (p95 14), hinting that some entries concatenate multiple codepoints.

Phoneme high anthropic:claude-opus-4-7

This column holds individual phoneme tokens, almost always a single character (len_mean 1.50, len_median 1, max 11) and always one word (one_word_rate 1.0). With 105,484 rows but only 3,142 unique values and a 97.0% duplicate rate, the same small phoneme inventory repeats heavily; top symbols like 'm' (2,915), 'i' (2,779), and 'k' (2,729) dominate. Vocab_size of 1,339 is larger than n_unique would suggest for single-character entries, hinting that the longer (up to 11-char) values contribute multi-token strings.

Allophones high anthropic:claude-opus-4-7

Column holds short phonetic tokens (allophones), with mean length 2.08 characters and 91.3% being a single word. The distribution is dominated by the literal string 'NA' (53,580 of 105,484 rows, ~50.8%), which likely encodes missing rather than true null since null_rate is 0.0; duplicate_rate is 0.935 across only 6,892 uniques. Top non-NA values are individual phoneme letters like 'm', 'j', 'w', 's', consistent with IPA-style symbols.

Marginal high anthropic:claude-opus-4-7

A ternary flag with values FALSE, NA, and TRUE across 105,484 rows. FALSE dominates at 78.9% while TRUE appears only 1,347 times; the 20,874 NA entries are encoded as a literal string rather than null, so null_rate is 0.0 despite roughly a fifth of rows being missing in practice.

SegmentClass high anthropic:claude-opus-4-7

SegmentClass is a categorical phonological label with only 3 distinct values: consonant, vowel, and tone. The distribution is heavily skewed — consonant accounts for 68.5% of 105,484 rows, vowel for 31,052, and tone for just 2,150, giving an entropy ratio of 0.64. The presence of 'tone' as a rare third class suggests the dataset spans tonal languages but those segments are sparsely represented.

Source high anthropic:claude-opus-4-7

Categorical provenance tag with 8 distinct sources across 105,484 rows and no nulls. Distribution is fairly balanced (entropy ratio 0.90), though 'ph' leads at 34.4% followed by 'ea' (16,883) and 'upsid' (13,966). Looks like a dataset-origin code identifying which linguistic database each record came from.

tone high anthropic:claude-opus-4-7

Binary categorical flag with values "0" and "+", almost certainly a tone/sentiment indicator. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows while "+" appears only 2,150 times, yielding entropy of just 0.144. No nulls are present.

stress high anthropic:claude-opus-4-7

Binary categorical flag with only two observed values, '-' and '0', across 105484 rows and no nulls. The column is severely imbalanced: '-' covers 97.96% of records (103334) while '0' accounts for the remaining 2150, yielding an entropy ratio of just 0.144. The '-' likely encodes a missing or default state rather than a true category, making this near-constant.

syllabic high anthropic:claude-opus-4-7

This is a phonological feature column encoding the [syllabic] distinctive feature, with 8 distinct values across 105,484 rows and no nulls. Most entries are simple '-' (68.5%) or '+', but 2,532 rows carry composite codes like '0', '+,-', or '-,+,-' that suggest contour/multi-segment annotations. Entropy ratio of 0.35 confirms the distribution is heavily concentrated on the negative value.

short high anthropic:claude-opus-4-7

A 4-level categorical with values '-', '0', '+', and '-,+' — likely a strand or sign indicator. It is severely imbalanced: '-' covers 97.76% of 105,484 rows, leaving only 2,150 zeros, 204 plus signs, and 5 mixed '-,+' entries. Entropy ratio is just 0.082, so the column carries almost no information as-is.

long high anthropic:claude-opus-4-7

Categorical flag with only 6 distinct values dominated by '-' at 89.9% (94844/105484), followed by '+' (8386) and '0' (2150). The remaining three categories are comma-joined combinations like '-,+' and '+,-' with tiny counts (63, 40, 1), suggesting concatenated multi-record values rather than clean single labels. Low entropy ratio (0.214) confirms the column is highly imbalanced toward '-'.

consonantal high anthropic:claude-opus-4-7

This is a phonological feature column flagging segments as consonantal, with five distinct values across 105,484 rows and no nulls. The vast majority are binary +/- (64,257 and 39,041 respectively, with + dominating at 60.9%), plus 2,151 zero/underspecified entries and just 35 rows with mixed values like '+,-' or '-,+'. Entropy ratio of 0.47 reflects the heavy + skew, and the lone '-,+' singleton is worth noting as a likely encoding artifact.

sonorant high anthropic:claude-opus-4-7

This is a phonological feature column encoding the [sonorant] value of a segment, dominated by binary '+'/'-' marks (53.0% '+', plus 45322 '-'). The presence of '0' (2150) and comma-joined sequences like '+,-' (1948) and rarer '+,-,-', '+,-,+', '+,-,+,-' suggests contour/complex segments where multiple values are concatenated. Cardinality is 8 with entropy ratio 0.41, so the long tail is sparse but meaningful.

continuant high anthropic:claude-opus-4-7

Categorical flag with 9 distinct values across 105,484 rows and no nulls, dominated by '+' (54.9%) and '-' (~42%), with '0' a distant third at 2,151 occurrences. The remaining six categories are comma-joined sequences like '-,+' or '0,0,-,+' that look like concatenated multi-step states rather than atomic labels — only 796 rows total fall into these compound buckets. Entropy ratio of 0.37 confirms heavy concentration in the two primary signs.

delayedRelease high anthropic:claude-opus-4-7

A categorical flag named delayedRelease with 7 distinct values across 105484 rows and no nulls. Values are dominated by '0' (55.0%), followed by '-' (27384) and '+' (19533), but a long tail of comma-joined combinations like '-,+', '0,-,+', '+,-', and '0,0,-,+' suggests rows where multiple labels were concatenated rather than normalized. Entropy ratio of 0.52 confirms a skewed but not single-valued distribution.

approximant high anthropic:claude-opus-4-7

A low-cardinality categorical with only 6 distinct values dominated by sign tokens '-' (55.9%) and '+' (≈42%), plus a small '0' bucket (2,150 rows) and rare comma-joined combinations like '-,+', '-,-,+', '+,-'. The compound values suggest this field occasionally stores multiple approximant signs concatenated rather than a single label. No nulls across 105,484 rows, and entropy ratio is 0.43, reflecting the heavy '-'/'+' imbalance.

tap high anthropic:claude-opus-4-7

A categorical flag with only 5 distinct values dominated by '-' at 96.7% of 105484 rows, with '0' and '+' as minor categories and two rare composite codes ('-,+', '-,-,+') appearing 25 and 15 times. Entropy ratio of 0.104 confirms extreme imbalance, so this column carries very little discriminative signal on its own. The composite values suggest the field occasionally concatenates multiple states, which is worth verifying against the source schema.

trill high anthropic:claude-opus-4-7

A categorical flag with only 6 distinct values across 105,484 rows, almost certainly encoding a trill or sign indicator ("-", "0", "+", and a few comma-joined sequences). The distribution is severely imbalanced: "-" alone covers 96.15% of rows, leaving entropy at just 0.276 (entropy ratio 0.107). The compound values like "-,+", "-,-,+" and "+,-" appear fewer than 30 times each, hinting at concatenated multi-event records that may need parsing.

nasal high anthropic:claude-opus-4-7

A categorical flag for nasal presence/absence, dominated by '-' at 80.8% with '+' a distant second (15,941 of 105,484). The presence of compound values like '+,-', '-,+', and '+,-,-' suggests concatenated multi-observation records rather than a clean binary indicator, and a '0' category (2,150) is a third encoding that doesn't match the +/- scheme.

lateral high anthropic:claude-opus-4-7

Categorical flag with 8 distinct values dominated by '-' at 93.8% of 105484 rows, followed by '+' (4211) and '0' (2150). The remaining categories are comma-joined composites like '-,+' or '-,-,+' with counts in the single or double digits, suggesting multi-observation concatenations rather than clean labels. Entropy ratio is just 0.134, so the column carries little information as-is.

labial high anthropic:claude-opus-4-7

A categorical flag for labial articulation, dominated by '-' (68.2%) and '+' (27%), with '0' and various comma-joined combinations like '-,+' and '-,-,+' suggesting multi-segment or sequence-level annotations rather than atomic values. Cardinality is 15 across 105,484 rows with no nulls, and entropy ratio is only 0.30, so the signal is highly concentrated in the binary +/- distinction. The presence of compound tokens is the main surprise and indicates inconsistent encoding granularity.

round medium anthropic:claude-opus-4-7

Categorical column with 8 distinct values dominated by '0' (70.3% of 105484 rows), followed by '+' (16956) and '-' (14082). The tokens look like rounding/sign indicators, and a long tail of compound values like '-,+', '-,-,+', and '0,-,+' (each with 1-269 occurrences) suggests concatenated multi-step rounding sequences. No nulls, but entropy_ratio of 0.398 confirms heavy concentration on the single dominant category.

labiodental high anthropic:claude-opus-4-7

This column appears to be a phonological feature flag indicating whether a phoneme is labiodental, encoded with values like "0", "-", "+", and a few comma-separated combinations. The distribution is dominated by "0" at 70.3% of 105,484 rows, with "-" at 28,726 and "+" at only 2,574; mixed values like "+,-", "-,+", and "+,+,-" appear in fewer than 60 rows combined and likely represent multi-segment entries. Entropy of 1.01 (ratio 0.39) confirms heavy concentration on the default "0" code.

coronal high anthropic:claude-opus-4-7

A low-cardinality categorical with 7 distinct values dominated by sign tokens '-' (62.8%) and '+' (about 36,955 of 105,484), plus a small '0' bucket of 2,160 rows. The presence of comma-joined combinations like '+,-', '-,+', '-,-,+' and '+,-,+' suggests multi-valued entries collapsed into a single string rather than a clean atomic category. Entropy ratio of 0.385 confirms heavy concentration on the two main signs.

anterior high anthropic:claude-opus-4-7

A low-cardinality categorical with 6 distinct values dominated by '0' (64.8% of 105484 rows), '+' (25704), and '-' (11391). The remaining three categories are concatenated combinations like '-,+', '+,-', and '-,-,+' totaling only 17 rows, suggesting this field originally allowed multi-valued entries that were collapsed into comma-joined strings. Entropy ratio of 0.48 confirms heavy concentration on the single mode.

distributed medium anthropic:claude-opus-4-7

A low-cardinality categorical with 11 distinct values dominated by '0' (66.0% of 105,484 rows), followed by '-' and '+'. The remaining eight categories are concatenated combinations like '-,+' or '-,-,+' that together account for fewer than 350 rows, suggesting multi-event encoding squeezed into a single field. Entropy ratio of 0.368 confirms the heavy concentration on the top class.

strident high anthropic:claude-opus-4-7

This appears to be a phonological feature column encoding the [strident] distinctive feature, with values '0' (unspecified), '-' (non-strident), and '+' (strident) covering the vast majority of 105484 rows. About 64.9% are '0' and there are no nulls, but a long tail of 6 composite values like '-,+' and '0,0,-,+' (totaling 625 rows) suggests multi-segment entries where the feature varies across positions. Entropy ratio of 0.41 confirms heavy concentration on the unspecified category.

dorsal high anthropic:claude-opus-4-7

This column appears to encode a dorsal sign or polarity flag, dominated by two values: '+' (54,535, 51.7%) and '-' (47,052). A third value '0' appears 2,160 times, and the remaining 10 categories are compound comma-separated combinations like '-,+' or '+,-,+', suggesting concatenated multi-observation records collapsed into one cell. Cardinality is 13 with entropy ratio 0.33, so the long tail is negligible but structurally inconsistent with a clean categorical.

high medium anthropic:claude-opus-4-7

This appears to be a categorical 'high' indicator encoding directional movement signs, with '0' (no change) the dominant value at 46.7% of 105,484 rows, followed by '+' and '-'. Compound values like '-,+' and '+,-' suggest concatenated multi-step sign sequences, but they tail off sharply (845, 627, then single digits). Entropy ratio of 0.46 confirms heavy concentration in the top three single-symbol categories; cardinality is 11 with no nulls.

low medium anthropic:claude-opus-4-7

A low-cardinality categorical with 8 distinct tokens dominated by '-' (47.3%) and '0' (~49,244 of 105,484), with '+' a distant third at 5,598. The remaining values are comma-joined sequences like '+,-', '-,+,-', suggesting this column encodes a sign or direction trace, possibly concatenated across events. The long tail (down to a single '+,-,-') indicates rare composite states worth bucketing.

front high anthropic:claude-opus-4-7

Categorical column encoding a front-side signal with 13 distinct values dominated by three primitives: "0" (49,316), "-" (34,225), and "+" (20,683), together covering nearly all 105,484 rows. The remaining categories are comma-joined combinations like "-,+" (838) or "-,-,+" (24), suggesting multi-event concatenations rather than clean atomic labels. Top rate is 0.468 and entropy ratio 0.43, so the distribution is skewed toward "0" but not degenerate. No nulls.

back high anthropic:claude-opus-4-7

A low-cardinality categorical with 12 distinct values dominated by the tokens "0" (46.7%), "-", and "+", suggesting a sign/state flag for some "back" attribute. The remaining values are comma-separated combinations like "+,-" or "+,-,-", indicating multiple events concatenated into one cell — a compound encoding rather than a clean atomic category. No nulls across 105,484 rows, and entropy ratio of 0.42 confirms the heavy skew toward "0".

tense high anthropic:claude-opus-4-7

This is a categorical 'tense' field with only 8 distinct values across 105,484 rows and no nulls, dominated by '0' at 71.3% and '+' at ~22%. The remaining categories are sparse markers ('-', '+,-', '-,+') and a long tail of compound sequences with as few as 1-6 occurrences, suggesting a linguistic encoding of tense polarity rather than free text. Low entropy ratio (0.37) confirms heavy concentration in the top class.

retractedTongueRoot high anthropic:claude-opus-4-7

Categorical encoding of the retracted-tongue-root (RTR) phonological feature, with 7 distinct values across 105,484 rows and no nulls. The column is severely imbalanced: '-' covers 97.4% of rows, '0' another ~2%, and the remaining five values (including compound codes like '-,+' and '-,-,+') together account for under 500 rows. Entropy ratio of 0.069 confirms almost no information content as-is.

advancedTongueRoot high anthropic:claude-opus-4-7

This column encodes the advanced tongue root (ATR) phonological feature, taking three values: '-', '0', and '+'. It is severely imbalanced — '-' covers 97.87% of 105,484 rows, '0' appears 2,235 times, and '+' only 11 times, yielding an entropy ratio of just 0.094. The near-absence of '+' values means this feature carries almost no discriminative signal as-is.

periodicGlottalSource high anthropic:claude-opus-4-7

Phonological feature flag for periodic glottal source (voicing), with 7 distinct values across 105,484 rows and no nulls. The vast majority are simple binary tags: '+' at 67.97% (71,694) and '-' (31,179), with a small '0' class (2,139) and rare comma-joined sequences like '+,-' (371) suggesting multi-segment or contour entries. Low entropy ratio (0.3745) confirms heavy concentration on '+'.

epilaryngealSource high anthropic:claude-opus-4-7

A categorical phonological feature (epilaryngeal source) with only 3 distinct values: '-', '0', and '+'. The column is severely imbalanced — '-' accounts for 97.93% of the 105,484 rows, '0' for 2,150 rows, and '+' for just 31 rows, yielding a very low entropy ratio of 0.093. With no nulls but near-constant values, it carries little discriminative signal.

spreadGlottis high anthropic:claude-opus-4-7

This appears to be a phonological feature column encoding the [spread glottis] distinctive feature, with values '-', '+', '0' and comma-separated combinations for segments with multiple specifications. The distribution is extremely lopsided: '-' covers 91.8% of 105,484 rows and entropy ratio is just 0.149, meaning the column carries little information on its own. The long tail of compound values like '-,+', '+,0,-', and '+,-,+' (some with only 1-5 occurrences) suggests multi-segment or contour entries that may need parsing.

constrictedGlottis high anthropic:claude-opus-4-7

Categorical flag for a 'constricted glottis' phonological feature, with 7 distinct values across 105,484 rows and no nulls. Heavily dominated by '-' at 94.5% (99,727 rows), with '+' and '0' as minor categories and a long tail of comma-joined sequences ('+,-', '-,+', and two singletons) suggesting multi-segment annotations. Low entropy ratio (0.13) confirms the column carries little information in isolation.

fortis medium anthropic:claude-opus-4-7

A 3-level categorical flag dominated by '-' (68.1% of 105,484 rows), with '0' covering most of the rest and '+' appearing only 415 times. The skew toward '-' and the tiny '+' class (entropy ratio 0.589) suggest a sign/direction indicator rather than a balanced category. No nulls, so the encoding is complete as-is.

lenis high anthropic:claude-opus-4-7

A ternary categorical flag with values '-', '0', and '+', likely encoding a linguistic lenition feature (lenis/fortis/neutral). The distribution is highly imbalanced: '-' covers 68.1% of 105,484 rows and '0' another 33,202, while '+' appears only 416 times. No nulls, but the rare '+' class may be too sparse for stable modelling.

raisedLarynxEjective high anthropic:claude-opus-4-7

This is a categorical phonological feature flag for 'raisedLarynxEjective', taking values like '-', '+', '0', and a few comma-separated combinations across 105484 rows with no nulls. The distribution is severely imbalanced: '-' covers 96.37% of rows and entropy ratio is just 0.103, with rare compound values like '-,-,+' appearing only once. The 6-way cardinality plus mixed-delimiter codes ('-,+' vs '+,-') suggests multi-segment annotations that may need parsing.

loweredLarynxImplosive high anthropic:claude-opus-4-7

Categorical phonological feature flagging lowered larynx implosives, with 5 distinct values across 105,484 rows and no nulls. The distribution is severely imbalanced: '-' covers 97.27% of rows, while '0' (2,150), '+' (716), and the mixed codes '-,+' (7) and '+,-' (2) are rare. Entropy ratio of 0.088 confirms the column carries very little information as-is.

click medium anthropic:claude-opus-4-7

Categorical flag with only 5 distinct values dominated by '-' (68.2% of 105,484 rows) and '0' (33,202), suggesting a click/interaction indicator where '-' likely means no click and '0' a recorded null/zero. The '+' class is rare (253) and the compound values '+,-' (52) and '-,+' (6) hint at concatenated multi-event records that break the single-label assumption. Entropy ratio of 0.40 confirms the heavy imbalance.

InventoryID numeric

rows105,484
null0 (0.0%)
unique3,020
min1.000
max3,020
mean1,479
median1,464
std843.111
q1769.000
q32,237
iqr1,468
skew-2.40e-03
kurtosis-1.146
n_outliers0
outlier_rate0.000
zero_rate0.000

Glottocode text

100.0% rows are a single word 95th-percentile length under 20 chars 97.9% duplicate strings
rows105,484
null0 (0.0%)
unique2,177
len_min2
len_max8
len_mean7.999
len_median8.000
len_p958.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates103,307
duplicate_rate0.979
vocab_size2,168
readability_flesch_mean94.148
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. lakk1252
  2. lazz1240
  3. kham1282
  4. cebu1242
  5. west2456
  6. gurd1238
  7. copi1238
  8. kham1282
  9. gata1239
  10. paez1247

ISO6393 text

100.0% rows are a single word 95th-percentile length under 20 chars 98.0% duplicate strings
rows105,484
null0 (0.0%)
unique2,095
len_min3
len_max3
len_mean3.000
len_median3.000
len_p953.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates103,389
duplicate_rate0.980
vocab_size2,086
readability_flesch_mean119.528
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. lbe
  2. lzz
  3. khg
  4. ceb
  5. xwl
  6. gdj
  7. cce
  8. khg
  9. gaq
  10. pbb

LanguageName text

84.3% rows are a single word 13.1% rows are all-caps 95th-percentile length under 20 chars 97.4% duplicate strings
rows105,484
null0 (0.0%)
unique2,716
len_min2
len_max79
len_mean7.822
len_median7.000
len_p9516.000
word_mean1.201
word_median1.000
n_empty0
n_duplicates102,768
duplicate_rate0.974
vocab_size2,670
readability_flesch_mean53.181
emoji_rate0.000
url_rate0.000
one_word_rate0.843
allcaps_rate0.131
boilerplate_rate0.000
Sample values (first 10)
  1. Lak
  2. Laz
  3. Kami Tibetan
  4. Cebuano
  5. western xwla
  6. Kurtjar
  7. Copi
  8. Kham Tibetan
  9. Gtaʔ
  10. Paez

SpecificDialect categorical

rows105,484
null0 (0.0%)
unique546
top_valueNA
top_rate0.719
cardinality546
entropy2.969
entropy_ratio0.326
Top values (rank 1–20)
  1. NA — 75,807
  2. — 7,692
  3. W2 — 120
  4. Lezgian (Güne) — 96
  5. Santa — 92
  6. Central Pakistan — 83
  7. Babungo (Grassfields Bantu, Ring) — 82
  8. Scottish Gaelic (Lewis) — 82
  9. Tangari — 81
  10. Kanga — 76
  11. Kufa — 75
  12. Skolt Saami (Suõʹnnʼjel) — 75
  13. Standard Hindi (as spoken in Varanasi, Lucknow, Delhi etc.) — 74
  14. Standard (eastern) — 74
  15. Guovdageaidnu — 74
  16. Nuosu (Black Yi) — 74
  17. Northern Qiang (Yadu) — 73
  18. Bangladeshi Standard (spoken in Dhaka and other urban aread of Bangladesh) — 72
  19. Standard Italian — 70
  20. Chechen (Ploskost) — 70

GlyphID text

100.0% rows are a single word 100.0% rows are all-caps 95th-percentile length under 20 chars 97.0% duplicate strings
rows105,484
null0 (0.0%)
unique3,142
len_min4
len_max54
len_mean6.503
len_median4.000
len_p9514.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates102,342
duplicate_rate0.970
vocab_size1,343
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.000
boilerplate_rate0.000
Sample values (first 10)
  1. 03C7
  2. 0075
  3. 006E+0064+007A
  4. 0072
  5. 02E6
  6. 0268
  7. 0064+0324+026E+0324
  8. 0075
  9. 0069+0303
  10. 0061+0303

Phoneme text

100.0% rows are a single word 95th-percentile length under 20 chars 97.0% duplicate strings
rows105,484
null0 (0.0%)
unique3,142
len_min1
len_max11
len_mean1.501
len_median1.000
len_p953.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates102,342
duplicate_rate0.970
vocab_size1,339
readability_flesch_mean114.439
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.75e-03
boilerplate_rate0.000
Sample values (first 10)
  1. χ
  2. u
  3. ndz
  4. r
  5. ˦
  6. ɨ
  7. d̤ɮ̤
  8. u

Allophones text

91.3% rows are a single word 95th-percentile length under 20 chars 93.5% duplicate strings
rows105,484
null0 (0.0%)
unique6,892
len_min1
len_max37
len_mean2.083
len_median2.000
len_p954.000
word_mean1.129
word_median1.000
n_empty0
n_duplicates98,592
duplicate_rate0.935
vocab_size1,263
readability_flesch_mean116.186
emoji_rate0.000
url_rate0.000
one_word_rate0.913
allcaps_rate2.91e-03
boilerplate_rate0.000
Sample values (first 10)
  1. χ
  2. NA
  3. NA
  4. NA
  5. ˦
  6. NA
  7. d̤ɮ̤
  8. NA
  9. NA
  10. ã ə̃

Marginal categorical

rows105,484
null0 (0.0%)
unique3
top_valueFALSE
top_rate0.789
cardinality3
entropy0.812
entropy_ratio0.512
Top values (rank 1–20)
  1. FALSE — 83,263
  2. NA — 20,874
  3. TRUE — 1,347

SegmentClass categorical

rows105,484
null0 (0.0%)
unique3
top_valueconsonant
top_rate0.685
cardinality3
entropy1.008
entropy_ratio0.636
Top values (rank 1–20)
  1. consonant — 72,282
  2. vowel — 31,052
  3. tone — 2,150

Source categorical

rows105,484
null0 (0.0%)
unique8
top_valueph
top_rate0.344
cardinality8
entropy2.697
entropy_ratio0.899
Top values (rank 1–20)
  1. ph — 36,274
  2. ea — 16,883
  3. upsid — 13,966
  4. er — 9,423
  5. saphon — 9,047
  6. aa — 8,064
  7. spa — 7,566
  8. ra — 4,261

tone categorical

top value is 98.0% of rows
rows105,484
null0 (0.0%)
unique2
top_value0
top_rate0.980
cardinality2
entropy0.144
entropy_ratio0.144
Top values (rank 1–20)
  1. 0 — 103,334
  2. + — 2,150

stress categorical

top value is 98.0% of rows
rows105,484
null0 (0.0%)
unique2
top_value-
top_rate0.980
cardinality2
entropy0.144
entropy_ratio0.144
Top values (rank 1–20)
  1. - — 103,334
  2. 0 — 2,150

syllabic categorical

rows105,484
null0 (0.0%)
unique8
top_value-
top_rate0.685
cardinality8
entropy1.042
entropy_ratio0.347
Top values (rank 1–20)
  1. - — 72,248
  2. + — 30,692
  3. 0 — 2,150
  4. +,- — 244
  5. -,+ — 124
  6. -,+,- — 12
  7. -,+,+ — 12
  8. +,+,- — 2

short categorical

top value is 97.8% of rows
rows105,484
null0 (0.0%)
unique4
top_value-
top_rate0.978
cardinality4
entropy0.164
entropy_ratio0.082
Top values (rank 1–20)
  1. - — 103,125
  2. 0 — 2,150
  3. + — 204
  4. -,+ — 5

long categorical

rows105,484
null0 (0.0%)
unique6
top_value-
top_rate0.899
cardinality6
entropy0.554
entropy_ratio0.214
Top values (rank 1–20)
  1. - — 94,844
  2. + — 8,386
  3. 0 — 2,150
  4. -,+ — 63
  5. +,- — 40
  6. -,-,+ — 1

consonantal categorical

rows105,484
null0 (0.0%)
unique5
top_value+
top_rate0.609
cardinality5
entropy1.085
entropy_ratio0.467
Top values (rank 1–20)
  1. + — 64,257
  2. - — 39,041
  3. 0 — 2,151
  4. +,- — 34
  5. -,+ — 1

sonorant categorical

rows105,484
null0 (0.0%)
unique8
top_value+
top_rate0.530
cardinality8
entropy1.245
entropy_ratio0.415
Top values (rank 1–20)
  1. + — 55,920
  2. - — 45,322
  3. 0 — 2,150
  4. +,- — 1,948
  5. -,+ — 89
  6. +,-,- — 29
  7. +,-,+ — 25
  8. +,-,+,- — 1

continuant categorical

rows105,484
null0 (0.0%)
unique9
top_value+
top_rate0.549
cardinality9
entropy1.172
entropy_ratio0.370
Top values (rank 1–20)
  1. + — 57,952
  2. - — 44,585
  3. 0 — 2,151
  4. -,+ — 728
  5. -,-,+ — 50
  6. +,- — 9
  7. 0,-,+ — 4
  8. -,+,+ — 4
  9. 0,0,-,+ — 1

delayedRelease categorical

rows105,484
null0 (0.0%)
unique7
top_value0
top_rate0.550
cardinality7
entropy1.471
entropy_ratio0.524
Top values (rank 1–20)
  1. 0 — 58,035
  2. - — 27,384
  3. + — 19,533
  4. -,+ — 492
  5. 0,-,+ — 33
  6. +,- — 6
  7. 0,0,-,+ — 1

approximant categorical

rows105,484
null0 (0.0%)
unique6
top_value-
top_rate0.559
cardinality6
entropy1.120
entropy_ratio0.433
Top values (rank 1–20)
  1. - — 58,966
  2. + — 44,266
  3. 0 — 2,150
  4. -,+ — 71
  5. -,-,+ — 25
  6. +,- — 6

tap categorical

top value is 96.7% of rows
rows105,484
null0 (0.0%)
unique5
top_value-
top_rate0.967
cardinality5
entropy0.242
entropy_ratio0.104
Top values (rank 1–20)
  1. - — 102,023
  2. 0 — 2,203
  3. + — 1,218
  4. -,+ — 25
  5. -,-,+ — 15

trill categorical

top value is 96.2% of rows
rows105,484
null0 (0.0%)
unique6
top_value-
top_rate0.962
cardinality6
entropy0.276
entropy_ratio0.107
Top values (rank 1–20)
  1. - — 101,427
  2. 0 — 2,202
  3. + — 1,819
  4. -,+ — 26
  5. -,-,+ — 8
  6. +,- — 2

nasal categorical

rows105,484
null0 (0.0%)
unique8
top_value-
top_rate0.808
cardinality8
entropy0.897
entropy_ratio0.299
Top values (rank 1–20)
  1. - — 85,269
  2. + — 15,941
  3. 0 — 2,150
  4. +,- — 1,973
  5. -,+ — 95
  6. +,-,- — 54
  7. +,-,+,- — 1
  8. -,+,- — 1

lateral categorical

rows105,484
null0 (0.0%)
unique8
top_value-
top_rate0.938
cardinality8
entropy0.401
entropy_ratio0.134
Top values (rank 1–20)
  1. - — 98,968
  2. + — 4,211
  3. 0 — 2,150
  4. -,+ — 135
  5. +,- — 12
  6. -,-,+ — 4
  7. -,+,- — 3
  8. 0,-,+ — 1

labial categorical

rows105,484
null0 (0.0%)
unique15
top_value-
top_rate0.682
cardinality15
entropy1.182
entropy_ratio0.302
Top values (rank 1–20)
  1. - — 71,961
  2. + — 28,241
  3. -,+ — 2,414
  4. 0 — 2,160
  5. +,- — 531
  6. -,-,+ — 121
  7. +,-,- — 21
  8. 0,+,- — 8
  9. -,+,- — 6
  10. 0,-,+ — 5
  11. -,+,+ — 5
  12. +,+,- — 5
  13. +,-,+ — 4
  14. -,-,+,+ — 1
  15. 0,+,-,- — 1

round categorical

rows105,484
null0 (0.0%)
unique8
top_value0
top_rate0.703
cardinality8
entropy1.194
entropy_ratio0.398
Top values (rank 1–20)
  1. 0 — 74,155
  2. + — 16,956
  3. - — 14,082
  4. -,+ — 269
  5. -,-,+ — 17
  6. -,0,+ — 3
  7. 0,-,+ — 1
  8. +,- — 1

labiodental categorical

rows105,484
null0 (0.0%)
unique6
top_value0
top_rate0.703
cardinality6
entropy1.006
entropy_ratio0.389
Top values (rank 1–20)
  1. 0 — 74,124
  2. - — 28,726
  3. + — 2,574
  4. +,- — 56
  5. -,+ — 3
  6. +,+,- — 1

coronal categorical

rows105,484
null0 (0.0%)
unique7
top_value-
top_rate0.628
cardinality7
entropy1.080
entropy_ratio0.385
Top values (rank 1–20)
  1. - — 66,234
  2. + — 36,955
  3. 0 — 2,160
  4. +,- — 87
  5. -,+ — 41
  6. -,-,+ — 6
  7. +,-,+ — 1

anterior categorical

rows105,484
null0 (0.0%)
unique6
top_value0
top_rate0.648
cardinality6
entropy1.251
entropy_ratio0.484
Top values (rank 1–20)
  1. 0 — 68,372
  2. + — 25,704
  3. - — 11,391
  4. -,+ — 9
  5. +,- — 5
  6. -,-,+ — 3

distributed categorical

rows105,484
null0 (0.0%)
unique11
top_value0
top_rate0.660
cardinality11
entropy1.273
entropy_ratio0.368
Top values (rank 1–20)
  1. 0 — 69,639
  2. - — 22,283
  3. + — 13,228
  4. -,+ — 296
  5. -,-,+ — 25
  6. +,- — 5
  7. 0,-,+ — 3
  8. +,-,+ — 2
  9. 0,+,- — 1
  10. +,+,- — 1
  11. 0,0,-,+ — 1

strident categorical

rows105,484
null0 (0.0%)
unique9
top_value0
top_rate0.649
cardinality9
entropy1.287
entropy_ratio0.406
Top values (rank 1–20)
  1. 0 — 68,410
  2. - — 25,410
  3. + — 11,039
  4. -,+ — 585
  5. -,-,+ — 26
  6. +,- — 7
  7. -,+,- — 3
  8. 0,-,+ — 3
  9. 0,0,-,+ — 1

dorsal categorical

rows105,484
null0 (0.0%)
unique13
top_value+
top_rate0.517
cardinality13
entropy1.235
entropy_ratio0.334
Top values (rank 1–20)
  1. + — 54,535
  2. - — 47,052
  3. 0 — 2,160
  4. -,+ — 1,530
  5. +,- — 144
  6. -,-,+ — 44
  7. 0,-,+ — 6
  8. +,-,+ — 5
  9. -,+,+ — 4
  10. +,+,-,- — 1
  11. -,+,- — 1
  12. +,+,- — 1
  13. 0,0,-,+ — 1

high categorical

rows105,484
null0 (0.0%)
unique11
top_value0
top_rate0.467
cardinality11
entropy1.594
entropy_ratio0.461
Top values (rank 1–20)
  1. 0 — 49,247
  2. + — 35,559
  3. - — 19,156
  4. -,+ — 845
  5. +,- — 627
  6. +,-,+ — 38
  7. +,+,- — 6
  8. -,+,+ — 2
  9. -,-,+ — 2
  10. +,-,0 — 1
  11. -,+,- — 1

low categorical

rows105,484
null0 (0.0%)
unique8
top_value-
top_rate0.473
cardinality8
entropy1.305
entropy_ratio0.435
Top values (rank 1–20)
  1. - — 49,930
  2. 0 — 49,244
  3. + — 5,598
  4. +,- — 417
  5. -,+ — 270
  6. -,+,- — 21
  7. -,-,+ — 3
  8. +,-,- — 1

front categorical

rows105,484
null0 (0.0%)
unique13
top_value0
top_rate0.468
cardinality13
entropy1.592
entropy_ratio0.430
Top values (rank 1–20)
  1. 0 — 49,316
  2. - — 34,225
  3. + — 20,683
  4. -,+ — 838
  5. +,- — 359
  6. -,-,+ — 24
  7. +,-,- — 14
  8. -,+,+ — 10
  9. +,-,+ — 6
  10. -,0,+ — 3
  11. +,+,- — 2
  12. 0,-,+ — 2
  13. -,+,- — 2

back categorical

rows105,484
null0 (0.0%)
unique12
top_value0
top_rate0.467
cardinality12
entropy1.521
entropy_ratio0.424
Top values (rank 1–20)
  1. 0 — 49,270
  2. - — 39,749
  3. + — 15,547
  4. +,- — 511
  5. -,+ — 367
  6. +,-,- — 19
  7. -,-,+ — 8
  8. -,+,+ — 5
  9. -,+,- — 5
  10. 0,+,- — 1
  11. +,-,+ — 1
  12. +,+,- — 1

tense categorical

rows105,484
null0 (0.0%)
unique8
top_value0
top_rate0.713
cardinality8
entropy1.114
entropy_ratio0.371
Top values (rank 1–20)
  1. 0 — 75,230
  2. + — 23,411
  3. - — 6,386
  4. +,- — 268
  5. -,+ — 179
  6. +,-,+ — 6
  7. +,-,- — 3
  8. +,+,- — 1

retractedTongueRoot categorical

top value is 97.4% of rows
rows105,484
null0 (0.0%)
unique7
top_value-
top_rate0.974
cardinality7
entropy0.193
entropy_ratio0.069
Top values (rank 1–20)
  1. - — 102,788
  2. 0 — 2,235
  3. -,+ — 251
  4. + — 199
  5. -,-,+ — 9
  6. -,+,- — 1
  7. +,- — 1

advancedTongueRoot categorical

top value is 97.9% of rows
rows105,484
null0 (0.0%)
unique3
top_value-
top_rate0.979
cardinality3
entropy0.150
entropy_ratio0.094
Top values (rank 1–20)
  1. - — 103,238
  2. 0 — 2,235
  3. + — 11

periodicGlottalSource categorical

rows105,484
null0 (0.0%)
unique7
top_value+
top_rate0.680
cardinality7
entropy1.051
entropy_ratio0.375
Top values (rank 1–20)
  1. + — 71,694
  2. - — 31,179
  3. 0 — 2,139
  4. +,- — 371
  5. -,+ — 87
  6. +,-,- — 8
  7. +,-,+ — 6

epilaryngealSource categorical

top value is 97.9% of rows
rows105,484
null0 (0.0%)
unique3
top_value-
top_rate0.979
cardinality3
entropy0.147
entropy_ratio0.093
Top values (rank 1–20)
  1. - — 103,303
  2. 0 — 2,150
  3. + — 31

spreadGlottis categorical

rows105,484
null0 (0.0%)
unique10
top_value-
top_rate0.918
cardinality10
entropy0.497
entropy_ratio0.149
Top values (rank 1–20)
  1. - — 96,855
  2. + — 6,156
  3. 0 — 2,138
  4. -,+ — 206
  5. +,- — 115
  6. -,-,+ — 5
  7. +,0,- — 5
  8. +,-,- — 2
  9. +,0,-,- — 1
  10. +,-,+ — 1

constrictedGlottis categorical

rows105,484
null0 (0.0%)
unique7
top_value-
top_rate0.945
cardinality7
entropy0.372
entropy_ratio0.132
Top values (rank 1–20)
  1. - — 99,727
  2. + — 3,383
  3. 0 — 2,138
  4. +,- — 141
  5. -,+ — 93
  6. +,-,- — 1
  7. -,-,+ — 1

fortis categorical

rows105,484
null0 (0.0%)
unique3
top_value-
top_rate0.681
cardinality3
entropy0.934
entropy_ratio0.589
Top values (rank 1–20)
  1. - — 71,867
  2. 0 — 33,202
  3. + — 415

lenis categorical

rows105,484
null0 (0.0%)
unique3
top_value-
top_rate0.681
cardinality3
entropy0.934
entropy_ratio0.589
Top values (rank 1–20)
  1. - — 71,866
  2. 0 — 33,202
  3. + — 416

raisedLarynxEjective categorical

top value is 96.4% of rows
rows105,484
null0 (0.0%)
unique6
top_value-
top_rate0.964
cardinality6
entropy0.267
entropy_ratio0.103
Top values (rank 1–20)
  1. - — 101,652
  2. 0 — 2,150
  3. + — 1,573
  4. -,+ — 85
  5. +,- — 23
  6. -,-,+ — 1

loweredLarynxImplosive categorical

top value is 97.3% of rows
rows105,484
null0 (0.0%)
unique5
top_value-
top_rate0.973
cardinality5
entropy0.203
entropy_ratio0.088
Top values (rank 1–20)
  1. - — 102,609
  2. 0 — 2,150
  3. + — 716
  4. -,+ — 7
  5. +,- — 2

click categorical

rows105,484
null0 (0.0%)
unique5
top_value-
top_rate0.682
cardinality5
entropy0.928
entropy_ratio0.400
Top values (rank 1–20)
  1. - — 71,971
  2. 0 — 33,202
  3. + — 253
  4. +,- — 52
  5. -,+ — 6