saturn·

parquet phonemes

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet

Saturn profiled 105,484 rows across 8 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet",
    "--findings", "parquet-phonemes.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a phoneme inventory table with 105,484 rows and 8 columns, indexing phonemes by language (via iso_639_3 and glottocode) along with phonological features like segment_class, syllabic, stress, and tone, plus a source attribution. Coverage spans roughly 2,094 ISO languages and 2,176 Glottolog codes, with 'mis' (828 rows) and 'kham1282' (622 rows) being the most represented. Worth a closer look first: the segment_class and source distributions, since segment_class shows a clear consonant-heavy mix (72,282 consonants vs 31,052 vowels vs 2,150 tones) and source is dominated by 'ph' at 34% but spreads across 8 datasets, hinting at where data density comes from. The phoneme column itself is also informative — common segments like /m/, /i/, /k/, /j/ top the list, matching well-known cross-linguistic frequencies. Note that stress and tone are highly imbalanced (~98% one value) and largely redundant with the 'tone' segment_class.

citing: row_count · column_count · columns.iso_639_3.n_unique · columns.glottocode.n_unique · columns.segment_class.top_values · columns.source.top_values · columns.source.top_rate · columns.phoneme.top_values · columns.stress.top_rate · columns.tone.top_rate · columns.syllabic.top_values

Out[4]:

saturn.schema() · 8 columns

column kind n null% unique alerts
glottocode text 105,484 0.0% 2,176 one_word short_text duplicates
iso_639_3 text 105,484 0.0% 2,094 one_word short_text duplicates
phoneme text 105,484 0.0% 3,142 one_word short_text duplicates
segment_class categorical 105,484 0.0% 3
tone categorical 105,484 0.0% 2 imbalance
stress categorical 105,484 0.0% 2 imbalance
syllabic categorical 105,484 0.0% 8
source categorical 105,484 0.0% 8
Fig 1.
segment_class · Shows the consonant/vowel/tone split — consonants dominate at ~68% of rows.
Show data table
Top values for segment_class (3 unique shown, of 3 total).
valuecountshare
consonant7228268.5%
vowel3105229.4%
tone21502.0%
Fig 2.
source · Compares the 8 contributing inventories; 'ph' supplies about a third of all rows.
Show data table
Top values for source (8 unique shown, of 8 total).
valuecountshare
ph3627434.4%
ea1688316.0%
upsid1396613.2%
er94238.9%
saphon90478.6%
aa80647.6%
spa75667.2%
ra42614.0%
Fig 3.
phoneme · Top phoneme symbols across languages — /m/, /i/, /k/ lead, matching typological expectations.
Show data table
Character-length distribution for phoneme (mean: 1.5006067270865724).
charscount
1 – 167114
1 – 20
2 – 20
2 – 20
2 – 228726
2 – 20
2 – 30
3 – 30
3 – 36559
3 – 40
4 – 40
4 – 40
4 – 42225
4 – 40
4 – 50
5 – 50
5 – 5401
5 – 60
6 – 60
6 – 60
6 – 6267
6 – 60
6 – 70
7 – 70
7 – 7104
7 – 80
8 – 80
8 – 80
8 – 85
8 – 80
8 – 90
9 – 90
9 – 970
9 – 100
10 – 100
10 – 100
10 – 101
10 – 100
10 – 110
11 – 1112
Fig 4.
syllabic · Distribution of the syllabic feature; mostly '-' vs '+' with a long tail of rare combined values.
Show data table
Top values for syllabic (8 unique shown, of 8 total).
valuecountshare
-7224868.5%
+3069229.1%
021502.0%
+,-2440.2%
-,+1240.1%
-,+,-120.0%
-,+,+120.0%
+,+,-20.0%
Fig 5.
iso_639_3 · Language coverage by ISO code — check which languages contribute the most phoneme rows.
Show data table
Character-length distribution for iso_639_3 (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 3105459
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
glottocodetext0.0%
iso_639_3text0.0%
phonemetext0.0%
segment_classcategorical0.0%
tonecategorical0.0%
stresscategorical0.0%
syllabiccategorical0.0%
sourcecategorical0.0%

glottocode text foreign_key

This column holds Glottocodes — fixed 8-character language identifiers (len_min/max/mean all 8, one_word_rate 1.0) from the Glottolog catalog. Across 105,484 rows there are only 2,176 unique codes with a 97.9% duplicate rate, so each code tags many records; top codes like kham1282 (622) and osse1243 (483) dominate. Null rate is negligible (0.0002) and vocabulary (2,166) closely matches unique count, indicating clean categorical data rather than free text.

Treatment: Treat as a categorical key; left-join to a Glottolog reference table for language metadata.

anthropic:claude-opus-4-7 · confidence high
Out[12]:

saturn.columns["glottocode"].stats

statvalue
n105,484
nulls19 (0.0%)
unique2,176
len_min 8
len_max 8
len_mean 8
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 103,289
duplicate_rate 0.9794
vocab_size 2,166
readability_flesch_mean 97.11
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.9% duplicate strings
Fig 7.
Character-length distribution for glottocode.
Show data table
Character-length distribution for glottocode (mean: 8.0).
charscount
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 8105465
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80

iso_639_3 text feature

This column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,094 distinct codes across 105,484 rows. The distribution is heavy-tailed — the top code 'mis' (the ISO 'uncoded languages' placeholder) leads at 828 occurrences, followed by 'khg' and 'oss', and duplicates account for 98.01% of rows by design. Null rate is negligible (0.0002), and the prevalence of 'mis' is worth flagging since it signals unidentified languages rather than a real language label.

Treatment: Treat as a categorical code; one-hot or target-encode, and decide whether to drop or bucket the 'mis' placeholder.

anthropic:claude-opus-4-7 · confidence high
Out[15]:

saturn.columns["iso_639_3"].stats

statvalue
n105,484
nulls25 (0.0%)
unique2,094
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 103,365
duplicate_rate 0.9801
vocab_size 2,089
readability_flesch_mean 120
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates98.0% duplicate strings
Fig 8.
Character-length distribution for iso_639_3.
Show data table
Character-length distribution for iso_639_3 (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 3105459
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

phoneme text feature

This column holds short phoneme tokens — every entry is a single word, mean length 1.5 characters and max 11, with top values being individual letters like 'm', 'i', 'k', 'j'. Despite 105,484 rows there are only 3,142 distinct values and a 97.0% duplicate rate, so this behaves as a small categorical alphabet rather than free text. Vocab size of 1,339 suggests multi-character phoneme codes exist alongside the single-letter majority.

Treatment: Treat as a categorical phoneme code — label-encode or one-hot rather than tokenizing as text.

anthropic:claude-opus-4-7 · confidence high
Out[18]:

saturn.columns["phoneme"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3,142
len_min 1
len_max 11
len_mean 1.501
len_median 1
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 102,342
duplicate_rate 0.9702
vocab_size 1,339
readability_flesch_mean 114.4
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.001754
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.0% duplicate strings
Fig 9.
Character-length distribution for phoneme.
Show data table
Character-length distribution for phoneme (mean: 1.5006067270865724).
charscount
1 – 167114
1 – 20
2 – 20
2 – 20
2 – 228726
2 – 20
2 – 30
3 – 30
3 – 36559
3 – 40
4 – 40
4 – 40
4 – 42225
4 – 40
4 – 50
5 – 50
5 – 5401
5 – 60
6 – 60
6 – 60
6 – 6267
6 – 60
6 – 70
7 – 70
7 – 7104
7 – 80
8 – 80
8 – 80
8 – 85
8 – 80
8 – 90
9 – 90
9 – 970
9 – 100
10 – 100
10 – 100
10 – 101
10 – 100
10 – 110
11 – 1112

segment_class categorical label

This is a 3-level categorical labeling each row as a phonological segment class: consonant, vowel, or tone. The distribution is heavily imbalanced — consonants account for 68.5% (72,282), vowels 31,052, and tones only 2,150 — yielding entropy of 1.01 (entropy ratio 0.64). No nulls across 105,484 rows.

Treatment: One-hot or ordinal encode; consider class-weighting or stratified sampling given the rare 'tone' class.

anthropic:claude-opus-4-7 · confidence high
Out[21]:

saturn.columns["segment_class"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3
top_value consonant
top_rate 0.6852
cardinality 3
entropy 1.008
entropy_ratio 0.6357
Fig 10.
Top values for segment_class.
Show data table
Top values for segment_class (3 unique shown, of 3 total).
valuecountshare
consonant7228268.5%
vowel3105229.4%
tone21502.0%

tone categorical label

Binary tone flag with values "0" and "+", almost certainly encoding neutral vs. positive sentiment or polarity. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows, leaving only 2,150 "+" cases, and entropy ratio is just 0.144. No nulls are present.

Treatment: Treat as imbalanced binary target; stratify splits and apply class weighting or resampling.

anthropic:claude-opus-4-7 · confidence high
Out[24]:

saturn.columns["tone"].stats

statvalue
n105,484
nulls0 (0.0%)
unique2
top_value 0
top_rate 0.9796
cardinality 2
entropy 0.1436
entropy_ratio 0.1436
alert: imbalancetop value is 98.0% of rows
Fig 11.
Top values for tone.
Show data table
Top values for tone (2 unique shown, of 2 total).
valuecountshare
010333498.0%
+21502.0%

stress categorical feature

Binary categorical flag with only two values, '-' and '0', where '-' dominates at 97.96% of 105,484 rows and '0' covers the remaining 2,150. Entropy ratio of 0.14 confirms the column carries almost no information, and the '-' token suggests a placeholder rather than a true category. As a stress indicator it is severely imbalanced and likely unusable as-is.

Treatment: Drop or recode '-' as missing; near-constant column unlikely to help modelling.

anthropic:claude-opus-4-7 · confidence high
Out[27]:

saturn.columns["stress"].stats

statvalue
n105,484
nulls0 (0.0%)
unique2
top_value -
top_rate 0.9796
cardinality 2
entropy 0.1436
entropy_ratio 0.1436
alert: imbalancetop value is 98.0% of rows
Fig 12.
Top values for stress.
Show data table
Top values for stress (2 unique shown, of 2 total).
valuecountshare
-10333498.0%
021502.0%

syllabic categorical feature

This appears to be a phonological feature column encoding the [syllabic] distinctive feature, dominated by binary values '-' (68.5%) and '+' (29.1%) with 2150 entries marked '0' (likely unspecified). Notably, 394 rows carry comma-separated compound values like '+,-' or '-,+,-', suggesting segments with sequential feature changes (e.g., affricates or contour segments). Entropy ratio of 0.347 confirms heavy concentration in the top category.

Treatment: One-hot encode, or split compound comma values into sequence positions before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[30]:

saturn.columns["syllabic"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value -
top_rate 0.6849
cardinality 8
entropy 1.042
entropy_ratio 0.3472
Fig 13.
Top values for syllabic.
Show data table
Top values for syllabic (8 unique shown, of 8 total).
valuecountshare
-7224868.5%
+3069229.1%
021502.0%
+,-2440.2%
-,+1240.1%
-,+,-120.0%
-,+,+120.0%
+,+,-20.0%

source categorical metadata

Categorical provenance tag with 8 distinct values across 105,484 rows and no nulls, almost certainly indicating which source database or inventory each record came from (e.g., 'ph', 'ea', 'upsid', 'saphon'). Distribution is moderately concentrated: 'ph' accounts for 34.4% of rows while the smallest source 'ra' contributes only 4,261, yielding a high entropy ratio of 0.90. No single source dominates outright, so this is a usable stratification key rather than a near-constant flag.

Treatment: Keep as a categorical grouping/stratification variable; one-hot encode if used as a model feature.

anthropic:claude-opus-4-7 · confidence high
Out[33]:

saturn.columns["source"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value ph
top_rate 0.3439
cardinality 8
entropy 2.697
entropy_ratio 0.8991
Fig 14.
Top values for source.
Show data table
Top values for source (8 unique shown, of 8 total).
valuecountshare
ph3627434.4%
ea1688316.0%
upsid1396613.2%
er94238.9%
saphon90478.6%
aa80647.6%
spa75667.2%
ra42614.0%

How to cite

click to copy

BibTeX
@misc{saturn-parquet-phonemes-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: parquet phonemes},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/parquet-phonemes}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: parquet phonemes. Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/parquet-phonemes