This dataset is a phoneme inventory table with 105,484 rows and 8 columns, indexing phonemes by language (via iso_639_3 and glottocode) along with phonological features like segment_class, syllabic, stress, and tone, plus a source attribution. Coverage spans roughly 2,094 ISO languages and 2,176 Glottolog codes, with 'mis' (828 rows) and 'kham1282' (622 rows) being the most represented. Worth a closer look first: the segment_class and source distributions, since segment_class shows a clear consonant-heavy mix (72,282 consonants vs 31,052 vowels vs 2,150 tones) and source is dominated by 'ph' at 34% but spreads across 8 datasets, hinting at where data density comes from. The phoneme column itself is also informative — common segments like /m/, /i/, /k/, /j/ top the list, matching well-known cross-linguistic frequencies. Note that stress and tone are highly imbalanced (~98% one value) and largely redundant with the 'tone' segment_class.
saturn
/home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet 105,484 rows sample n=105,484 seed 42 2026-05-01T17:11:31+00:00
Overview
| Source | /home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet |
| Total rows | 105,484 |
| Profiled sample | 105,484 |
| Columns | 8 |
| Generated | 2026-05-01T17:11:31+00:00 |
Insights opt-in
Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.
This column holds Glottocodes — fixed 8-character language identifiers (len_min/max/mean all 8, one_word_rate 1.0) from the Glottolog catalog. Across 105,484 rows there are only 2,176 unique codes with a 97.9% duplicate rate, so each code tags many records; top codes like kham1282 (622) and osse1243 (483) dominate. Null rate is negligible (0.0002) and vocabulary (2,166) closely matches unique count, indicating clean categorical data rather than free text.
This column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,094 distinct codes across 105,484 rows. The distribution is heavy-tailed — the top code 'mis' (the ISO 'uncoded languages' placeholder) leads at 828 occurrences, followed by 'khg' and 'oss', and duplicates account for 98.01% of rows by design. Null rate is negligible (0.0002), and the prevalence of 'mis' is worth flagging since it signals unidentified languages rather than a real language label.
This column holds short phoneme tokens — every entry is a single word, mean length 1.5 characters and max 11, with top values being individual letters like 'm', 'i', 'k', 'j'. Despite 105,484 rows there are only 3,142 distinct values and a 97.0% duplicate rate, so this behaves as a small categorical alphabet rather than free text. Vocab size of 1,339 suggests multi-character phoneme codes exist alongside the single-letter majority.
This is a 3-level categorical labeling each row as a phonological segment class: consonant, vowel, or tone. The distribution is heavily imbalanced — consonants account for 68.5% (72,282), vowels 31,052, and tones only 2,150 — yielding entropy of 1.01 (entropy ratio 0.64). No nulls across 105,484 rows.
Binary tone flag with values "0" and "+", almost certainly encoding neutral vs. positive sentiment or polarity. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows, leaving only 2,150 "+" cases, and entropy ratio is just 0.144. No nulls are present.
Binary categorical flag with only two values, '-' and '0', where '-' dominates at 97.96% of 105,484 rows and '0' covers the remaining 2,150. Entropy ratio of 0.14 confirms the column carries almost no information, and the '-' token suggests a placeholder rather than a true category. As a stress indicator it is severely imbalanced and likely unusable as-is.
This appears to be a phonological feature column encoding the [syllabic] distinctive feature, dominated by binary values '-' (68.5%) and '+' (29.1%) with 2150 entries marked '0' (likely unspecified). Notably, 394 rows carry comma-separated compound values like '+,-' or '-,+,-', suggesting segments with sequential feature changes (e.g., affricates or contour segments). Entropy ratio of 0.347 confirms heavy concentration in the top category.
Categorical provenance tag with 8 distinct values across 105,484 rows and no nulls, almost certainly indicating which source database or inventory each record came from (e.g., 'ph', 'ea', 'upsid', 'saphon'). Distribution is moderately concentrated: 'ph' accounts for 34.4% of rows while the smallest source 'ra' contributes only 4,261, yielding a high entropy ratio of 0.90. No single source dominates outright, so this is a usable stratification key rather than a near-constant flag.
glottocode text
Sample values (first 10)
- lakk1252
- lazz1240
- kham1282
- nort2722
- west2456
- gurd1238
- suga1248
- kham1282
- gata1239
- paez1247
iso_639_3 text
Sample values (first 10)
- lbe
- dlg
- khg
- ceb
- xwl
- gdj
- cce
- gla
- gaq
- pbb
phoneme text
Sample values (first 10)
- χ
- u
- ndz
- r
- ˦
- ɨ
- d̤ɮ̤
- u
- ĩ
- ã
segment_class categorical
Top values (rank 1–20)
- consonant — 72,282
- vowel — 31,052
- tone — 2,150
tone categorical
Top values (rank 1–20)
- 0 — 103,334
- + — 2,150
stress categorical
Top values (rank 1–20)
- - — 103,334
- 0 — 2,150
syllabic categorical
Top values (rank 1–20)
- - — 72,248
- + — 30,692
- 0 — 2,150
- +,- — 244
- -,+ — 124
- -,+,- — 12
- -,+,+ — 12
- +,+,- — 2
source categorical
Top values (rank 1–20)
- ph — 36,274
- ea — 16,883
- upsid — 13,966
- er — 9,423
- saphon — 9,047
- aa — 8,064
- spa — 7,566
- ra — 4,261