parquet phonemes

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet

Saturn profiled 105,484 rows across 8 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet",
    "--findings", "parquet-phonemes.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a phoneme inventory table with 105,484 rows and 8 columns, indexing phonemes by language (via iso_639_3 and glottocode) along with phonological features like segment_class, syllabic, stress, and tone, plus a source attribution. Coverage spans roughly 2,094 ISO languages and 2,176 Glottolog codes, with 'mis' (828 rows) and 'kham1282' (622 rows) being the most represented. Worth a closer look first: the segment_class and source distributions, since segment_class shows a clear consonant-heavy mix (72,282 consonants vs 31,052 vowels vs 2,150 tones) and source is dominated by 'ph' at 34% but spreads across 8 datasets, hinting at where data density comes from. The phoneme column itself is also informative — common segments like /m/, /i/, /k/, /j/ top the list, matching well-known cross-linguistic frequencies. Note that stress and tone are highly imbalanced (~98% one value) and largely redundant with the 'tone' segment_class.

citing: row_count · column_count · columns.iso_639_3.n_unique · columns.glottocode.n_unique · columns.segment_class.top_values · columns.source.top_values · columns.source.top_rate · columns.phoneme.top_values · columns.stress.top_rate · columns.tone.top_rate · columns.syllabic.top_values

Out[4]:

saturn.schema() · 8 columns

column	kind	n	null%	unique	alerts
glottocode	text	105,484	0.0%	2,176	one_word short_text duplicates
iso_639_3	text	105,484	0.0%	2,094	one_word short_text duplicates
phoneme	text	105,484	0.0%	3,142	one_word short_text duplicates
segment_class	categorical	105,484	0.0%	3
tone	categorical	105,484	0.0%	2	imbalance
stress	categorical	105,484	0.0%	2	imbalance
syllabic	categorical	105,484	0.0%	8
source	categorical	105,484	0.0%	8

Fig 1.

segment_class · Shows the consonant/vowel/tone split — consonants dominate at ~68% of rows.

Show data table

Top values for segment_class (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

Fig 2.

source · Compares the 8 contributing inventories; 'ph' supplies about a third of all rows.

Show data table

Top values for source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

Fig 3.

phoneme · Top phoneme symbols across languages — /m/, /i/, /k/ lead, matching typological expectations.

Show data table

Character-length distribution for phoneme (mean: 1.5006067270865724).
chars	count
1 – 1	67114
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	28726
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	6559
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	2225
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	401
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	267
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	104
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	70
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	1
10 – 10	0
10 – 11	0
11 – 11	12

Fig 4.

syllabic · Distribution of the syllabic feature; mostly '-' vs '+' with a long tail of rare combined values.

Show data table

Top values for syllabic (8 unique shown, of 8 total).
value	count	share
-	72248	68.5%
+	30692	29.1%
0	2150	2.0%
+,-	244	0.2%
-,+	124	0.1%
-,+,-	12	0.0%
-,+,+	12	0.0%
+,+,-	2	0.0%

Fig 5.

iso_639_3 · Language coverage by ISO code — check which languages contribute the most phoneme rows.

Show data table

Character-length distribution for iso_639_3 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	105459
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
glottocode	text	0.0%
iso_639_3	text	0.0%
phoneme	text	0.0%
segment_class	categorical	0.0%
tone	categorical	0.0%
stress	categorical	0.0%
syllabic	categorical	0.0%
source	categorical	0.0%

glottocode text foreign_key

This column holds Glottocodes — fixed 8-character language identifiers (len_min/max/mean all 8, one_word_rate 1.0) from the Glottolog catalog. Across 105,484 rows there are only 2,176 unique codes with a 97.9% duplicate rate, so each code tags many records; top codes like kham1282 (622) and osse1243 (483) dominate. Null rate is negligible (0.0002) and vocabulary (2,166) closely matches unique count, indicating clean categorical data rather than free text.

Treatment: Treat as a categorical key; left-join to a Glottolog reference table for language metadata.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["glottocode"].stats

stat	value
n	105,484
nulls	19 (0.0%)
unique	2,176
len_min	8
len_max	8
len_mean	8
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	103,289
duplicate_rate	0.9794
vocab_size	2,166
readability_flesch_mean	97.11
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.9% duplicate strings

Fig 7.

Character-length distribution for glottocode.

Show data table

Character-length distribution for glottocode (mean: 8.0).
chars	count
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	105465
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0

iso_639_3 text feature

This column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,094 distinct codes across 105,484 rows. The distribution is heavy-tailed — the top code 'mis' (the ISO 'uncoded languages' placeholder) leads at 828 occurrences, followed by 'khg' and 'oss', and duplicates account for 98.01% of rows by design. Null rate is negligible (0.0002), and the prevalence of 'mis' is worth flagging since it signals unidentified languages rather than a real language label.

Treatment: Treat as a categorical code; one-hot or target-encode, and decide whether to drop or bucket the 'mis' placeholder.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns["iso_639_3"].stats

stat	value
n	105,484
nulls	25 (0.0%)
unique	2,094
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	103,365
duplicate_rate	0.9801
vocab_size	2,089
readability_flesch_mean	120
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	98.0% duplicate strings

Fig 8.

Character-length distribution for iso_639_3.

Show data table

Character-length distribution for iso_639_3 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	105459
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

phoneme text feature

This column holds short phoneme tokens — every entry is a single word, mean length 1.5 characters and max 11, with top values being individual letters like 'm', 'i', 'k', 'j'. Despite 105,484 rows there are only 3,142 distinct values and a 97.0% duplicate rate, so this behaves as a small categorical alphabet rather than free text. Vocab size of 1,339 suggests multi-character phoneme codes exist alongside the single-letter majority.

Treatment: Treat as a categorical phoneme code — label-encode or one-hot rather than tokenizing as text.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["phoneme"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3,142
len_min	1
len_max	11
len_mean	1.501
len_median	1
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	102,342
duplicate_rate	0.9702
vocab_size	1,339
readability_flesch_mean	114.4
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.001754
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.0% duplicate strings

Fig 9.

Character-length distribution for phoneme.

Show data table

Character-length distribution for phoneme (mean: 1.5006067270865724).
chars	count
1 – 1	67114
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	28726
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	6559
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	2225
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	401
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	267
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	104
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	70
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	1
10 – 10	0
10 – 11	0
11 – 11	12

segment_class categorical label

This is a 3-level categorical labeling each row as a phonological segment class: consonant, vowel, or tone. The distribution is heavily imbalanced — consonants account for 68.5% (72,282), vowels 31,052, and tones only 2,150 — yielding entropy of 1.01 (entropy ratio 0.64). No nulls across 105,484 rows.

Treatment: One-hot or ordinal encode; consider class-weighting or stratified sampling given the rare 'tone' class.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["segment_class"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	consonant
top_rate	0.6852
cardinality	3
entropy	1.008
entropy_ratio	0.6357

Fig 10.

Top values for segment_class.

Show data table

Top values for segment_class (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

tone categorical label

Binary tone flag with values "0" and "+", almost certainly encoding neutral vs. positive sentiment or polarity. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows, leaving only 2,150 "+" cases, and entropy ratio is just 0.144. No nulls are present.

Treatment: Treat as imbalanced binary target; stratify splits and apply class weighting or resampling.

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["tone"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2
top_value	0
top_rate	0.9796
cardinality	2
entropy	0.1436
entropy_ratio	0.1436
alert: imbalance	top value is 98.0% of rows

Fig 11.

Top values for tone.

Show data table

Top values for tone (2 unique shown, of 2 total).
value	count	share
0	103334	98.0%
+	2150	2.0%

stress categorical feature

Binary categorical flag with only two values, '-' and '0', where '-' dominates at 97.96% of 105,484 rows and '0' covers the remaining 2,150. Entropy ratio of 0.14 confirms the column carries almost no information, and the '-' token suggests a placeholder rather than a true category. As a stress indicator it is severely imbalanced and likely unusable as-is.

Treatment: Drop or recode '-' as missing; near-constant column unlikely to help modelling.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["stress"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2
top_value	-
top_rate	0.9796
cardinality	2
entropy	0.1436
entropy_ratio	0.1436
alert: imbalance	top value is 98.0% of rows

Fig 12.

Top values for stress.

Show data table

Top values for stress (2 unique shown, of 2 total).
value	count	share
-	103334	98.0%
0	2150	2.0%

syllabic categorical feature

This appears to be a phonological feature column encoding the [syllabic] distinctive feature, dominated by binary values '-' (68.5%) and '+' (29.1%) with 2150 entries marked '0' (likely unspecified). Notably, 394 rows carry comma-separated compound values like '+,-' or '-,+,-', suggesting segments with sequential feature changes (e.g., affricates or contour segments). Entropy ratio of 0.347 confirms heavy concentration in the top category.

Treatment: One-hot encode, or split compound comma values into sequence positions before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[30]:

saturn.columns["syllabic"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	-
top_rate	0.6849
cardinality	8
entropy	1.042
entropy_ratio	0.3472

Fig 13.

Top values for syllabic.

Show data table

Top values for syllabic (8 unique shown, of 8 total).
value	count	share
-	72248	68.5%
+	30692	29.1%
0	2150	2.0%
+,-	244	0.2%
-,+	124	0.1%
-,+,-	12	0.0%
-,+,+	12	0.0%
+,+,-	2	0.0%

source categorical metadata

Categorical provenance tag with 8 distinct values across 105,484 rows and no nulls, almost certainly indicating which source database or inventory each record came from (e.g., 'ph', 'ea', 'upsid', 'saphon'). Distribution is moderately concentrated: 'ph' accounts for 34.4% of rows while the smallest source 'ra' contributes only 4,261, yielding a high entropy ratio of 0.90. No single source dominates outright, so this is a usable stratification key rather than a near-constant flag.

Treatment: Keep as a categorical grouping/stratification variable; one-hot encode if used as a model feature.

anthropic:claude-opus-4-7 · confidence high

Out[33]:

saturn.columns["source"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	ph
top_rate	0.3439
cardinality	8
entropy	2.697
entropy_ratio	0.8991

Fig 14.

Top values for source.

Show data table

Top values for source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

How to cite

click to copy

BibTeX

@misc{saturn-parquet-phonemes-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: parquet phonemes},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/parquet-phonemes}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: parquet phonemes. Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/parquet-phonemes