parquet phonemes

source /home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet 105,484 rows 8 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a phoneme inventory table with 105,484 rows and 8 columns, indexing phonemes by language (via iso_639_3 and glottocode) along with phonological features like segment_class, syllabic, stress, and tone, plus a source attribution. Coverage spans roughly 2,094 ISO languages and 2,176 Glottolog codes, with 'mis' (828 rows) and 'kham1282' (622 rows) being the most represented. Worth a closer look first: the segment_class and source distributions, since segment_class shows a clear consonant-heavy mix (72,282 consonants vs 31,052 vowels vs 2,150 tones) and source is dominated by 'ph' at 34% but spreads across 8 datasets, hinting at where data density comes from. The phoneme column itself is also informative — common segments like /m/, /i/, /k/, /j/ top the list, matching well-known cross-linguistic frequencies. Note that stress and tone are highly imbalanced (~98% one value) and largely redundant with the 'tone' segment_class.

citing: row_count · column_count · columns.iso_639_3.n_unique · columns.glottocode.n_unique · columns.segment_class.top_values · columns.source.top_values · columns.source.top_rate · columns.phoneme.top_values · columns.stress.top_rate · columns.tone.top_rate · columns.syllabic.top_values

Charts the summary said to look at first

segment_class · Shows the consonant/vowel/tone split — consonants dominate at ~68% of rows.

Show data table

Top values for segment_class (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

source · Compares the 8 contributing inventories; 'ph' supplies about a third of all rows.

Show data table

Top values for source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

phoneme · Top phoneme symbols across languages — /m/, /i/, /k/ lead, matching typological expectations.

Show data table

Character-length distribution for phoneme (mean: 1.5006067270865724).
chars	count
1 – 1	67114
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	28726
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	6559
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	2225
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	401
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	267
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	104
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	70
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	1
10 – 10	0
10 – 11	0
11 – 11	12

syllabic · Distribution of the syllabic feature; mostly '-' vs '+' with a long tail of rare combined values.

Show data table

Top values for syllabic (8 unique shown, of 8 total).
value	count	share
-	72248	68.5%
+	30692	29.1%
0	2150	2.0%
+,-	244	0.2%
-,+	124	0.1%
-,+,-	12	0.0%
-,+,+	12	0.0%
+,+,-	2	0.0%

iso_639_3 · Language coverage by ISO code — check which languages contribute the most phoneme rows.

Show data table

Character-length distribution for iso_639_3 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	105459
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

Schema

8 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
glottocode	text	0.0%	2,176	one_word short_text duplicates
iso_639_3	text	0.0%	2,094	one_word short_text duplicates
phoneme	text	0.0%	3,142	one_word short_text duplicates
segment_class	categorical	0.0%	3
tone	categorical	0.0%	2	imbalance
stress	categorical	0.0%	2	imbalance
syllabic	categorical	0.0%	8
source	categorical	0.0%	8

glottocode

text foreign_key one_word short_text duplicates

This column holds Glottocodes — fixed 8-character language identifiers (len_min/max/mean all 8, one_word_rate 1.0) from the Glottolog catalog. Across 105,484 rows there are only 2,176 unique codes with a 97.9% duplicate rate, so each code tags many records; top codes like kham1282 (622) and osse1243 (483) dominate. Null rate is negligible (0.0002) and vocabulary (2,166) closely matches unique count, indicating clean categorical data rather than free text. Treatment: Treat as a categorical key; left-join to a Glottolog reference table for language metadata. high · anthropic:claude-opus-4-7

n: 105,484
nulls: 19 (0.0%)
unique: 2,176
len_min: 8
len_max: 8
len_mean: 8
len_median: 8
len_p95: 8
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 103,289
duplicate_rate: 0.9794
vocab_size: 2,166
readability_flesch_mean: 97.11
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

iso_639_3

text feature one_word short_text duplicates

This column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,094 distinct codes across 105,484 rows. The distribution is heavy-tailed — the top code 'mis' (the ISO 'uncoded languages' placeholder) leads at 828 occurrences, followed by 'khg' and 'oss', and duplicates account for 98.01% of rows by design. Null rate is negligible (0.0002), and the prevalence of 'mis' is worth flagging since it signals unidentified languages rather than a real language label. Treatment: Treat as a categorical code; one-hot or target-encode, and decide whether to drop or bucket the 'mis' placeholder. high · anthropic:claude-opus-4-7

n: 105,484
nulls: 25 (0.0%)
unique: 2,094
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 103,365
duplicate_rate: 0.9801
vocab_size: 2,089
readability_flesch_mean: 120
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

phoneme

text feature one_word short_text duplicates

This column holds short phoneme tokens — every entry is a single word, mean length 1.5 characters and max 11, with top values being individual letters like 'm', 'i', 'k', 'j'. Despite 105,484 rows there are only 3,142 distinct values and a 97.0% duplicate rate, so this behaves as a small categorical alphabet rather than free text. Vocab size of 1,339 suggests multi-character phoneme codes exist alongside the single-letter majority. Treatment: Treat as a categorical phoneme code — label-encode or one-hot rather than tokenizing as text. high · anthropic:claude-opus-4-7

n: 105,484
nulls: 0 (0.0%)
unique: 3,142
len_min: 1
len_max: 11
len_mean: 1.501
len_median: 1
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 102,342
duplicate_rate: 0.9702
vocab_size: 1,339
readability_flesch_mean: 114.4
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.001754
boilerplate_rate: 0

segment_class

categorical label

This is a 3-level categorical labeling each row as a phonological segment class: consonant, vowel, or tone. The distribution is heavily imbalanced — consonants account for 68.5% (72,282), vowels 31,052, and tones only 2,150 — yielding entropy of 1.01 (entropy ratio 0.64). No nulls across 105,484 rows. Treatment: One-hot or ordinal encode; consider class-weighting or stratified sampling given the rare 'tone' class. high · anthropic:claude-opus-4-7

n: 105,484
nulls: 0 (0.0%)
unique: 3
top_value: consonant
top_rate: 0.6852
cardinality: 3
entropy: 1.008
entropy_ratio: 0.6357

tone

categorical label imbalance

Binary tone flag with values "0" and "+", almost certainly encoding neutral vs. positive sentiment or polarity. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows, leaving only 2,150 "+" cases, and entropy ratio is just 0.144. No nulls are present. Treatment: Treat as imbalanced binary target; stratify splits and apply class weighting or resampling. high · anthropic:claude-opus-4-7

n: 105,484
nulls: 0 (0.0%)
unique: 2
top_value: 0
top_rate: 0.9796
cardinality: 2
entropy: 0.1436
entropy_ratio: 0.1436

stress

categorical feature imbalance

Binary categorical flag with only two values, '-' and '0', where '-' dominates at 97.96% of 105,484 rows and '0' covers the remaining 2,150. Entropy ratio of 0.14 confirms the column carries almost no information, and the '-' token suggests a placeholder rather than a true category. As a stress indicator it is severely imbalanced and likely unusable as-is. Treatment: Drop or recode '-' as missing; near-constant column unlikely to help modelling. high · anthropic:claude-opus-4-7

n: 105,484
nulls: 0 (0.0%)
unique: 2
top_value: -
top_rate: 0.9796
cardinality: 2
entropy: 0.1436
entropy_ratio: 0.1436

syllabic

categorical feature

This appears to be a phonological feature column encoding the [syllabic] distinctive feature, dominated by binary values '-' (68.5%) and '+' (29.1%) with 2150 entries marked '0' (likely unspecified). Notably, 394 rows carry comma-separated compound values like '+,-' or '-,+,-', suggesting segments with sequential feature changes (e.g., affricates or contour segments). Entropy ratio of 0.347 confirms heavy concentration in the top category. Treatment: One-hot encode, or split compound comma values into sequence positions before modelling. high · anthropic:claude-opus-4-7

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: -
top_rate: 0.6849
cardinality: 8
entropy: 1.042
entropy_ratio: 0.3472

source

categorical metadata

Categorical provenance tag with 8 distinct values across 105,484 rows and no nulls, almost certainly indicating which source database or inventory each record came from (e.g., 'ph', 'ea', 'upsid', 'saphon'). Distribution is moderately concentrated: 'ph' accounts for 34.4% of rows while the smallest source 'ra' contributes only 4,261, yielding a high entropy ratio of 0.90. No single source dominates outright, so this is a usable stratification key rather than a near-constant flag. Treatment: Keep as a categorical grouping/stratification variable; one-hot encode if used as a model feature. high · anthropic:claude-opus-4-7

n: 105,484
nulls: 0 (0.0%)
unique: 8
top_value: ph
top_rate: 0.3439
cardinality: 8
entropy: 2.697
entropy_ratio: 0.8991