saturn·

parquet phonemes

source /home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet 105,484 rows 8 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a phoneme inventory table with 105,484 rows and 8 columns, indexing phonemes by language (via iso_639_3 and glottocode) along with phonological features like segment_class, syllabic, stress, and tone, plus a source attribution. Coverage spans roughly 2,094 ISO languages and 2,176 Glottolog codes, with 'mis' (828 rows) and 'kham1282' (622 rows) being the most represented. Worth a closer look first: the segment_class and source distributions, since segment_class shows a clear consonant-heavy mix (72,282 consonants vs 31,052 vowels vs 2,150 tones) and source is dominated by 'ph' at 34% but spreads across 8 datasets, hinting at where data density comes from. The phoneme column itself is also informative — common segments like /m/, /i/, /k/, /j/ top the list, matching well-known cross-linguistic frequencies. Note that stress and tone are highly imbalanced (~98% one value) and largely redundant with the 'tone' segment_class.

citing: row_count · column_count · columns.iso_639_3.n_unique · columns.glottocode.n_unique · columns.segment_class.top_values · columns.source.top_values · columns.source.top_rate · columns.phoneme.top_values · columns.stress.top_rate · columns.tone.top_rate · columns.syllabic.top_values

Schema

8 columns
Per-column summary. Click column name to jump to its detail.
Alerts
glottocode text 0.0% 2,176
one_word short_text duplicates
iso_639_3 text 0.0% 2,094
one_word short_text duplicates
phoneme text 0.0% 3,142
one_word short_text duplicates
segment_class categorical 0.0% 3
tone categorical 0.0% 2
imbalance
stress categorical 0.0% 2
imbalance
syllabic categorical 0.0% 8
source categorical 0.0% 8

glottocode

text foreign_key one_word short_text duplicates
This column holds Glottocodes — fixed 8-character language identifiers (len_min/max/mean all 8, one_word_rate 1.0) from the Glottolog catalog. Across 105,484 rows there are only 2,176 unique codes with a 97.9% duplicate rate, so each code tags many records; top codes like kham1282 (622) and osse1243 (483) dominate. Null rate is negligible (0.0002) and vocabulary (2,166) closely matches unique count, indicating clean categorical data rather than free text. Treatment: Treat as a categorical key; left-join to a Glottolog reference table for language metadata. high · anthropic:claude-opus-4-7
n
105,484
nulls
19 (0.0%)
unique
2,176
len_min
8
len_max
8
len_mean
8
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
103,289
duplicate_rate
0.9794
vocab_size
2,166
readability_flesch_mean
97.11
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

iso_639_3

text feature one_word short_text duplicates
This column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,094 distinct codes across 105,484 rows. The distribution is heavy-tailed — the top code 'mis' (the ISO 'uncoded languages' placeholder) leads at 828 occurrences, followed by 'khg' and 'oss', and duplicates account for 98.01% of rows by design. Null rate is negligible (0.0002), and the prevalence of 'mis' is worth flagging since it signals unidentified languages rather than a real language label. Treatment: Treat as a categorical code; one-hot or target-encode, and decide whether to drop or bucket the 'mis' placeholder. high · anthropic:claude-opus-4-7
n
105,484
nulls
25 (0.0%)
unique
2,094
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
103,365
duplicate_rate
0.9801
vocab_size
2,089
readability_flesch_mean
120
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

phoneme

text feature one_word short_text duplicates
This column holds short phoneme tokens — every entry is a single word, mean length 1.5 characters and max 11, with top values being individual letters like 'm', 'i', 'k', 'j'. Despite 105,484 rows there are only 3,142 distinct values and a 97.0% duplicate rate, so this behaves as a small categorical alphabet rather than free text. Vocab size of 1,339 suggests multi-character phoneme codes exist alongside the single-letter majority. Treatment: Treat as a categorical phoneme code — label-encode or one-hot rather than tokenizing as text. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
3,142
len_min
1
len_max
11
len_mean
1.501
len_median
1
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
102,342
duplicate_rate
0.9702
vocab_size
1,339
readability_flesch_mean
114.4
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.001754
boilerplate_rate
0

segment_class

categorical label
This is a 3-level categorical labeling each row as a phonological segment class: consonant, vowel, or tone. The distribution is heavily imbalanced — consonants account for 68.5% (72,282), vowels 31,052, and tones only 2,150 — yielding entropy of 1.01 (entropy ratio 0.64). No nulls across 105,484 rows. Treatment: One-hot or ordinal encode; consider class-weighting or stratified sampling given the rare 'tone' class. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
3
top_value
consonant
top_rate
0.6852
cardinality
3
entropy
1.008
entropy_ratio
0.6357

tone

categorical label imbalance
Binary tone flag with values "0" and "+", almost certainly encoding neutral vs. positive sentiment or polarity. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows, leaving only 2,150 "+" cases, and entropy ratio is just 0.144. No nulls are present. Treatment: Treat as imbalanced binary target; stratify splits and apply class weighting or resampling. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
2
top_value
0
top_rate
0.9796
cardinality
2
entropy
0.1436
entropy_ratio
0.1436

stress

categorical feature imbalance
Binary categorical flag with only two values, '-' and '0', where '-' dominates at 97.96% of 105,484 rows and '0' covers the remaining 2,150. Entropy ratio of 0.14 confirms the column carries almost no information, and the '-' token suggests a placeholder rather than a true category. As a stress indicator it is severely imbalanced and likely unusable as-is. Treatment: Drop or recode '-' as missing; near-constant column unlikely to help modelling. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
2
top_value
-
top_rate
0.9796
cardinality
2
entropy
0.1436
entropy_ratio
0.1436

syllabic

categorical feature
This appears to be a phonological feature column encoding the [syllabic] distinctive feature, dominated by binary values '-' (68.5%) and '+' (29.1%) with 2150 entries marked '0' (likely unspecified). Notably, 394 rows carry comma-separated compound values like '+,-' or '-,+,-', suggesting segments with sequential feature changes (e.g., affricates or contour segments). Entropy ratio of 0.347 confirms heavy concentration in the top category. Treatment: One-hot encode, or split compound comma values into sequence positions before modelling. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
8
top_value
-
top_rate
0.6849
cardinality
8
entropy
1.042
entropy_ratio
0.3472

source

categorical metadata
Categorical provenance tag with 8 distinct values across 105,484 rows and no nulls, almost certainly indicating which source database or inventory each record came from (e.g., 'ph', 'ea', 'upsid', 'saphon'). Distribution is moderately concentrated: 'ph' accounts for 34.4% of rows while the smallest source 'ra' contributes only 4,261, yielding a high entropy ratio of 0.90. No single source dominates outright, so this is a usable stratification key rather than a near-constant flag. Treatment: Keep as a categorical grouping/stratification variable; one-hot encode if used as a model feature. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
8
top_value
ph
top_rate
0.3439
cardinality
8
entropy
2.697
entropy_ratio
0.8991