saturn

/home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet 105,484 rows sample n=105,484 seed 42 2026-05-01T17:11:31+00:00

Overview

Source	/home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet
Total rows	105,484
Profiled sample	105,484
Columns	8
Generated	2026-05-01T17:11:31+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset is a phoneme inventory table with 105,484 rows and 8 columns, indexing phonemes by language (via iso_639_3 and glottocode) along with phonological features like segment_class, syllabic, stress, and tone, plus a source attribution. Coverage spans roughly 2,094 ISO languages and 2,176 Glottolog codes, with 'mis' (828 rows) and 'kham1282' (622 rows) being the most represented. Worth a closer look first: the segment_class and source distributions, since segment_class shows a clear consonant-heavy mix (72,282 consonants vs 31,052 vowels vs 2,150 tones) and source is dominated by 'ph' at 34% but spreads across 8 datasets, hinting at where data density comes from. The phoneme column itself is also informative — common segments like /m/, /i/, /k/, /j/ top the list, matching well-known cross-linguistic frequencies. Note that stress and tone are highly imbalanced (~98% one value) and largely redundant with the 'tone' segment_class.

glottocode high anthropic:claude-opus-4-7

This column holds Glottocodes — fixed 8-character language identifiers (len_min/max/mean all 8, one_word_rate 1.0) from the Glottolog catalog. Across 105,484 rows there are only 2,176 unique codes with a 97.9% duplicate rate, so each code tags many records; top codes like kham1282 (622) and osse1243 (483) dominate. Null rate is negligible (0.0002) and vocabulary (2,166) closely matches unique count, indicating clean categorical data rather than free text.

iso_639_3 high anthropic:claude-opus-4-7

This column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,094 distinct codes across 105,484 rows. The distribution is heavy-tailed — the top code 'mis' (the ISO 'uncoded languages' placeholder) leads at 828 occurrences, followed by 'khg' and 'oss', and duplicates account for 98.01% of rows by design. Null rate is negligible (0.0002), and the prevalence of 'mis' is worth flagging since it signals unidentified languages rather than a real language label.

phoneme high anthropic:claude-opus-4-7

This column holds short phoneme tokens — every entry is a single word, mean length 1.5 characters and max 11, with top values being individual letters like 'm', 'i', 'k', 'j'. Despite 105,484 rows there are only 3,142 distinct values and a 97.0% duplicate rate, so this behaves as a small categorical alphabet rather than free text. Vocab size of 1,339 suggests multi-character phoneme codes exist alongside the single-letter majority.

segment_class high anthropic:claude-opus-4-7

This is a 3-level categorical labeling each row as a phonological segment class: consonant, vowel, or tone. The distribution is heavily imbalanced — consonants account for 68.5% (72,282), vowels 31,052, and tones only 2,150 — yielding entropy of 1.01 (entropy ratio 0.64). No nulls across 105,484 rows.

tone high anthropic:claude-opus-4-7

Binary tone flag with values "0" and "+", almost certainly encoding neutral vs. positive sentiment or polarity. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows, leaving only 2,150 "+" cases, and entropy ratio is just 0.144. No nulls are present.

stress high anthropic:claude-opus-4-7

Binary categorical flag with only two values, '-' and '0', where '-' dominates at 97.96% of 105,484 rows and '0' covers the remaining 2,150. Entropy ratio of 0.14 confirms the column carries almost no information, and the '-' token suggests a placeholder rather than a true category. As a stress indicator it is severely imbalanced and likely unusable as-is.

syllabic high anthropic:claude-opus-4-7

This appears to be a phonological feature column encoding the [syllabic] distinctive feature, dominated by binary values '-' (68.5%) and '+' (29.1%) with 2150 entries marked '0' (likely unspecified). Notably, 394 rows carry comma-separated compound values like '+,-' or '-,+,-', suggesting segments with sequential feature changes (e.g., affricates or contour segments). Entropy ratio of 0.347 confirms heavy concentration in the top category.

source high anthropic:claude-opus-4-7

Categorical provenance tag with 8 distinct values across 105,484 rows and no nulls, almost certainly indicating which source database or inventory each record came from (e.g., 'ph', 'ea', 'upsid', 'saphon'). Distribution is moderately concentrated: 'ph' accounts for 34.4% of rows while the smallest source 'ra' contributes only 4,261, yielding a high entropy ratio of 0.90. No single source dominates outright, so this is a usable stratification key rather than a near-constant flag.

glottocode text

100.0% rows are a single word 95th-percentile length under 20 chars 97.9% duplicate strings

rows105,484

null19 (0.0%)

unique2,176

len_min8

len_max8

len_mean8.000

len_median8.000

len_p958.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates103,289

duplicate_rate0.979

vocab_size2,166

readability_flesch_mean97.109

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.000

boilerplate_rate0.000

Sample values (first 10)

lakk1252
lazz1240
kham1282
nort2722
west2456
gurd1238
suga1248
kham1282
gata1239
paez1247

iso_639_3 text

100.0% rows are a single word 95th-percentile length under 20 chars 98.0% duplicate strings

rows105,484

null25 (0.0%)

unique2,094

len_min3

len_max3

len_mean3.000

len_median3.000

len_p953.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates103,365

duplicate_rate0.980

vocab_size2,089

readability_flesch_mean119.951

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.000

boilerplate_rate0.000

Sample values (first 10)

phoneme text

100.0% rows are a single word 95th-percentile length under 20 chars 97.0% duplicate strings

rows105,484

null0 (0.0%)

unique3,142

len_min1

len_max11

len_mean1.501

len_median1.000

len_p953.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates102,342

duplicate_rate0.970

vocab_size1,339

readability_flesch_mean114.439

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.75e-03

boilerplate_rate0.000

Sample values (first 10)

χ
u
ndz
r
˦
ɨ
d̤ɮ̤
u
ĩ
ã

segment_class categorical

rows105,484

null0 (0.0%)

unique3

top_valueconsonant

top_rate0.685

cardinality3

entropy1.008

entropy_ratio0.636

Top values (rank 1–20)

consonant — 72,282
vowel — 31,052
tone — 2,150

tone categorical

top value is 98.0% of rows

rows105,484

null0 (0.0%)

unique2

top_value0

top_rate0.980

cardinality2

entropy0.144

entropy_ratio0.144

Top values (rank 1–20)

0 — 103,334
+ — 2,150

stress categorical

top value is 98.0% of rows

rows105,484

null0 (0.0%)

unique2

top_value-

top_rate0.980

cardinality2

entropy0.144

entropy_ratio0.144

Top values (rank 1–20)

- — 103,334
0 — 2,150

syllabic categorical

rows105,484

null0 (0.0%)

unique8

top_value-

top_rate0.685

cardinality8

entropy1.042

entropy_ratio0.347

Top values (rank 1–20)

- — 72,248
+ — 30,692
0 — 2,150
+,- — 244
-,+ — 124
-,+,- — 12
-,+,+ — 12
+,+,- — 2

source categorical

rows105,484

null0 (0.0%)

unique8

top_valueph

top_rate0.344

cardinality8

entropy2.697

entropy_ratio0.899

Top values (rank 1–20)

ph — 36,274
ea — 16,883
upsid — 13,966
er — 9,423
saphon — 9,047
aa — 8,064
spa — 7,566
ra — 4,261