parquet phonemes
Reading
This dataset is a phoneme inventory table with 105,484 rows and 8 columns, indexing phonemes by language (via iso_639_3 and glottocode) along with phonological features like segment_class, syllabic, stress, and tone, plus a source attribution. Coverage spans roughly 2,094 ISO languages and 2,176 Glottolog codes, with 'mis' (828 rows) and 'kham1282' (622 rows) being the most represented. Worth a closer look first: the segment_class and source distributions, since segment_class shows a clear consonant-heavy mix (72,282 consonants vs 31,052 vowels vs 2,150 tones) and source is dominated by 'ph' at 34% but spreads across 8 datasets, hinting at where data density comes from. The phoneme column itself is also informative — common segments like /m/, /i/, /k/, /j/ top the list, matching well-known cross-linguistic frequencies. Note that stress and tone are highly imbalanced (~98% one value) and largely redundant with the 'tone' segment_class.
citing: row_count · column_count · columns.iso_639_3.n_unique · columns.glottocode.n_unique · columns.segment_class.top_values · columns.source.top_values · columns.source.top_rate · columns.phoneme.top_values · columns.stress.top_rate · columns.tone.top_rate · columns.syllabic.top_values
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| consonant | 72282 | 68.5% |
| vowel | 31052 | 29.4% |
| tone | 2150 | 2.0% |
Show data table
| value | count | share |
|---|---|---|
| ph | 36274 | 34.4% |
| ea | 16883 | 16.0% |
| upsid | 13966 | 13.2% |
| er | 9423 | 8.9% |
| saphon | 9047 | 8.6% |
| aa | 8064 | 7.6% |
| spa | 7566 | 7.2% |
| ra | 4261 | 4.0% |
Show data table
| chars | count |
|---|---|
| 1 – 1 | 67114 |
| 1 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 28726 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 6559 |
| 3 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 2225 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 401 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 267 |
| 6 – 6 | 0 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 104 |
| 7 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 5 |
| 8 – 8 | 0 |
| 8 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 70 |
| 9 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 1 |
| 10 – 10 | 0 |
| 10 – 11 | 0 |
| 11 – 11 | 12 |
Show data table
| value | count | share |
|---|---|---|
| - | 72248 | 68.5% |
| + | 30692 | 29.1% |
| 0 | 2150 | 2.0% |
| +,- | 244 | 0.2% |
| -,+ | 124 | 0.1% |
| -,+,- | 12 | 0.0% |
| -,+,+ | 12 | 0.0% |
| +,+,- | 2 | 0.0% |
Show data table
| chars | count |
|---|---|
| 2 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 105459 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 4 | 0 |
Schema
8 columns| Alerts | ||||
|---|---|---|---|---|
| glottocode | text | 0.0% | 2,176 |
one_word
short_text
duplicates
|
| iso_639_3 | text | 0.0% | 2,094 |
one_word
short_text
duplicates
|
| phoneme | text | 0.0% | 3,142 |
one_word
short_text
duplicates
|
| segment_class | categorical | 0.0% | 3 |
|
| tone | categorical | 0.0% | 2 |
imbalance
|
| stress | categorical | 0.0% | 2 |
imbalance
|
| syllabic | categorical | 0.0% | 8 |
|
| source | categorical | 0.0% | 8 |
|
glottocode
text foreign_key one_word short_text duplicatesThis column holds Glottocodes — fixed 8-character language identifiers (len_min/max/mean all 8, one_word_rate 1.0) from the Glottolog catalog. Across 105,484 rows there are only 2,176 unique codes with a 97.9% duplicate rate, so each code tags many records; top codes like kham1282 (622) and osse1243 (483) dominate. Null rate is negligible (0.0002) and vocabulary (2,166) closely matches unique count, indicating clean categorical data rather than free text. Treatment: Treat as a categorical key; left-join to a Glottolog reference table for language metadata.
- n
- 105,484
- nulls
- 19 (0.0%)
- unique
- 2,176
- len_min
- 8
- len_max
- 8
- len_mean
- 8
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 103,289
- duplicate_rate
- 0.9794
- vocab_size
- 2,166
- readability_flesch_mean
- 97.11
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
iso_639_3
text feature one_word short_text duplicatesThis column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,094 distinct codes across 105,484 rows. The distribution is heavy-tailed — the top code 'mis' (the ISO 'uncoded languages' placeholder) leads at 828 occurrences, followed by 'khg' and 'oss', and duplicates account for 98.01% of rows by design. Null rate is negligible (0.0002), and the prevalence of 'mis' is worth flagging since it signals unidentified languages rather than a real language label. Treatment: Treat as a categorical code; one-hot or target-encode, and decide whether to drop or bucket the 'mis' placeholder.
- n
- 105,484
- nulls
- 25 (0.0%)
- unique
- 2,094
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 103,365
- duplicate_rate
- 0.9801
- vocab_size
- 2,089
- readability_flesch_mean
- 120
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
phoneme
text feature one_word short_text duplicatesThis column holds short phoneme tokens — every entry is a single word, mean length 1.5 characters and max 11, with top values being individual letters like 'm', 'i', 'k', 'j'. Despite 105,484 rows there are only 3,142 distinct values and a 97.0% duplicate rate, so this behaves as a small categorical alphabet rather than free text. Vocab size of 1,339 suggests multi-character phoneme codes exist alongside the single-letter majority. Treatment: Treat as a categorical phoneme code — label-encode or one-hot rather than tokenizing as text.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3,142
- len_min
- 1
- len_max
- 11
- len_mean
- 1.501
- len_median
- 1
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 102,342
- duplicate_rate
- 0.9702
- vocab_size
- 1,339
- readability_flesch_mean
- 114.4
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.001754
- boilerplate_rate
- 0
segment_class
categorical labelThis is a 3-level categorical labeling each row as a phonological segment class: consonant, vowel, or tone. The distribution is heavily imbalanced — consonants account for 68.5% (72,282), vowels 31,052, and tones only 2,150 — yielding entropy of 1.01 (entropy ratio 0.64). No nulls across 105,484 rows. Treatment: One-hot or ordinal encode; consider class-weighting or stratified sampling given the rare 'tone' class.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- consonant
- top_rate
- 0.6852
- cardinality
- 3
- entropy
- 1.008
- entropy_ratio
- 0.6357
tone
categorical label imbalanceBinary tone flag with values "0" and "+", almost certainly encoding neutral vs. positive sentiment or polarity. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows, leaving only 2,150 "+" cases, and entropy ratio is just 0.144. No nulls are present. Treatment: Treat as imbalanced binary target; stratify splits and apply class weighting or resampling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 0
- top_rate
- 0.9796
- cardinality
- 2
- entropy
- 0.1436
- entropy_ratio
- 0.1436
stress
categorical feature imbalanceBinary categorical flag with only two values, '-' and '0', where '-' dominates at 97.96% of 105,484 rows and '0' covers the remaining 2,150. Entropy ratio of 0.14 confirms the column carries almost no information, and the '-' token suggests a placeholder rather than a true category. As a stress indicator it is severely imbalanced and likely unusable as-is. Treatment: Drop or recode '-' as missing; near-constant column unlikely to help modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- -
- top_rate
- 0.9796
- cardinality
- 2
- entropy
- 0.1436
- entropy_ratio
- 0.1436
syllabic
categorical featureThis appears to be a phonological feature column encoding the [syllabic] distinctive feature, dominated by binary values '-' (68.5%) and '+' (29.1%) with 2150 entries marked '0' (likely unspecified). Notably, 394 rows carry comma-separated compound values like '+,-' or '-,+,-', suggesting segments with sequential feature changes (e.g., affricates or contour segments). Entropy ratio of 0.347 confirms heavy concentration in the top category. Treatment: One-hot encode, or split compound comma values into sequence positions before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- -
- top_rate
- 0.6849
- cardinality
- 8
- entropy
- 1.042
- entropy_ratio
- 0.3472
source
categorical metadataCategorical provenance tag with 8 distinct values across 105,484 rows and no nulls, almost certainly indicating which source database or inventory each record came from (e.g., 'ph', 'ea', 'upsid', 'saphon'). Distribution is moderately concentrated: 'ph' accounts for 34.4% of rows while the smallest source 'ra' contributes only 4,261, yielding a high entropy ratio of 0.90. No single source dominates outright, so this is a usable stratification key rather than a near-constant flag. Treatment: Keep as a categorical grouping/stratification variable; one-hot encode if used as a model feature.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- ph
- top_rate
- 0.3439
- cardinality
- 8
- entropy
- 2.697
- entropy_ratio
- 0.8991