saturn

/home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet 105,484 rows sample n=105,484 seed 42 2026-05-01T17:11:31+00:00

Overview

Source/home/coolhand/servers/diachronica/etymology_atlas/parquet/phonemes.parquet
Total rows105,484
Profiled sample105,484
Columns8
Generated2026-05-01T17:11:31+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset is a phoneme inventory table with 105,484 rows and 8 columns, indexing phonemes by language (via iso_639_3 and glottocode) along with phonological features like segment_class, syllabic, stress, and tone, plus a source attribution. Coverage spans roughly 2,094 ISO languages and 2,176 Glottolog codes, with 'mis' (828 rows) and 'kham1282' (622 rows) being the most represented. Worth a closer look first: the segment_class and source distributions, since segment_class shows a clear consonant-heavy mix (72,282 consonants vs 31,052 vowels vs 2,150 tones) and source is dominated by 'ph' at 34% but spreads across 8 datasets, hinting at where data density comes from. The phoneme column itself is also informative — common segments like /m/, /i/, /k/, /j/ top the list, matching well-known cross-linguistic frequencies. Note that stress and tone are highly imbalanced (~98% one value) and largely redundant with the 'tone' segment_class.

glottocode high anthropic:claude-opus-4-7

This column holds Glottocodes — fixed 8-character language identifiers (len_min/max/mean all 8, one_word_rate 1.0) from the Glottolog catalog. Across 105,484 rows there are only 2,176 unique codes with a 97.9% duplicate rate, so each code tags many records; top codes like kham1282 (622) and osse1243 (483) dominate. Null rate is negligible (0.0002) and vocabulary (2,166) closely matches unique count, indicating clean categorical data rather than free text.

iso_639_3 high anthropic:claude-opus-4-7

This column holds ISO 639-3 language codes: every value is exactly 3 characters and one word, with 2,094 distinct codes across 105,484 rows. The distribution is heavy-tailed — the top code 'mis' (the ISO 'uncoded languages' placeholder) leads at 828 occurrences, followed by 'khg' and 'oss', and duplicates account for 98.01% of rows by design. Null rate is negligible (0.0002), and the prevalence of 'mis' is worth flagging since it signals unidentified languages rather than a real language label.

phoneme high anthropic:claude-opus-4-7

This column holds short phoneme tokens — every entry is a single word, mean length 1.5 characters and max 11, with top values being individual letters like 'm', 'i', 'k', 'j'. Despite 105,484 rows there are only 3,142 distinct values and a 97.0% duplicate rate, so this behaves as a small categorical alphabet rather than free text. Vocab size of 1,339 suggests multi-character phoneme codes exist alongside the single-letter majority.

segment_class high anthropic:claude-opus-4-7

This is a 3-level categorical labeling each row as a phonological segment class: consonant, vowel, or tone. The distribution is heavily imbalanced — consonants account for 68.5% (72,282), vowels 31,052, and tones only 2,150 — yielding entropy of 1.01 (entropy ratio 0.64). No nulls across 105,484 rows.

tone high anthropic:claude-opus-4-7

Binary tone flag with values "0" and "+", almost certainly encoding neutral vs. positive sentiment or polarity. The distribution is severely imbalanced: "0" covers 97.96% of 105,484 rows, leaving only 2,150 "+" cases, and entropy ratio is just 0.144. No nulls are present.

stress high anthropic:claude-opus-4-7

Binary categorical flag with only two values, '-' and '0', where '-' dominates at 97.96% of 105,484 rows and '0' covers the remaining 2,150. Entropy ratio of 0.14 confirms the column carries almost no information, and the '-' token suggests a placeholder rather than a true category. As a stress indicator it is severely imbalanced and likely unusable as-is.

syllabic high anthropic:claude-opus-4-7

This appears to be a phonological feature column encoding the [syllabic] distinctive feature, dominated by binary values '-' (68.5%) and '+' (29.1%) with 2150 entries marked '0' (likely unspecified). Notably, 394 rows carry comma-separated compound values like '+,-' or '-,+,-', suggesting segments with sequential feature changes (e.g., affricates or contour segments). Entropy ratio of 0.347 confirms heavy concentration in the top category.

source high anthropic:claude-opus-4-7

Categorical provenance tag with 8 distinct values across 105,484 rows and no nulls, almost certainly indicating which source database or inventory each record came from (e.g., 'ph', 'ea', 'upsid', 'saphon'). Distribution is moderately concentrated: 'ph' accounts for 34.4% of rows while the smallest source 'ra' contributes only 4,261, yielding a high entropy ratio of 0.90. No single source dominates outright, so this is a usable stratification key rather than a near-constant flag.

glottocode text

100.0% rows are a single word 95th-percentile length under 20 chars 97.9% duplicate strings
rows105,484
null19 (0.0%)
unique2,176
len_min8
len_max8
len_mean8.000
len_median8.000
len_p958.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates103,289
duplicate_rate0.979
vocab_size2,166
readability_flesch_mean97.109
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. lakk1252
  2. lazz1240
  3. kham1282
  4. nort2722
  5. west2456
  6. gurd1238
  7. suga1248
  8. kham1282
  9. gata1239
  10. paez1247

iso_639_3 text

100.0% rows are a single word 95th-percentile length under 20 chars 98.0% duplicate strings
rows105,484
null25 (0.0%)
unique2,094
len_min3
len_max3
len_mean3.000
len_median3.000
len_p953.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates103,365
duplicate_rate0.980
vocab_size2,089
readability_flesch_mean119.951
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. lbe
  2. dlg
  3. khg
  4. ceb
  5. xwl
  6. gdj
  7. cce
  8. gla
  9. gaq
  10. pbb

phoneme text

100.0% rows are a single word 95th-percentile length under 20 chars 97.0% duplicate strings
rows105,484
null0 (0.0%)
unique3,142
len_min1
len_max11
len_mean1.501
len_median1.000
len_p953.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates102,342
duplicate_rate0.970
vocab_size1,339
readability_flesch_mean114.439
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.75e-03
boilerplate_rate0.000
Sample values (first 10)
  1. χ
  2. u
  3. ndz
  4. r
  5. ˦
  6. ɨ
  7. d̤ɮ̤
  8. u

segment_class categorical

rows105,484
null0 (0.0%)
unique3
top_valueconsonant
top_rate0.685
cardinality3
entropy1.008
entropy_ratio0.636
Top values (rank 1–20)
  1. consonant — 72,282
  2. vowel — 31,052
  3. tone — 2,150

tone categorical

top value is 98.0% of rows
rows105,484
null0 (0.0%)
unique2
top_value0
top_rate0.980
cardinality2
entropy0.144
entropy_ratio0.144
Top values (rank 1–20)
  1. 0 — 103,334
  2. + — 2,150

stress categorical

top value is 98.0% of rows
rows105,484
null0 (0.0%)
unique2
top_value-
top_rate0.980
cardinality2
entropy0.144
entropy_ratio0.144
Top values (rank 1–20)
  1. - — 103,334
  2. 0 — 2,150

syllabic categorical

rows105,484
null0 (0.0%)
unique8
top_value-
top_rate0.685
cardinality8
entropy1.042
entropy_ratio0.347
Top values (rank 1–20)
  1. - — 72,248
  2. + — 30,692
  3. 0 — 2,150
  4. +,- — 244
  5. -,+ — 124
  6. -,+,- — 12
  7. -,+,+ — 12
  8. +,+,- — 2

source categorical

rows105,484
null0 (0.0%)
unique8
top_valueph
top_rate0.344
cardinality8
entropy2.697
entropy_ratio0.899
Top values (rank 1–20)
  1. ph — 36,274
  2. ea — 16,883
  3. upsid — 13,966
  4. er — 9,423
  5. saphon — 9,047
  6. aa — 8,064
  7. spa — 7,566
  8. ra — 4,261