saturn

/home/coolhand/data/linguistic.db 105,484 rows sample n=105,484 seed 42 2026-05-01T23:25:14+00:00

Overview

Source/home/coolhand/data/linguistic.db
Total rows105,484
Profiled sample105,484
Columns6
Generated2026-05-01T23:25:14+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset contains 105,484 rows of phoneme records linked to languages by glottocode, drawn from 8 different sources. Each row pairs a language identifier (2,177 unique glottocodes) with a phoneme (3,142 unique values, mostly 1-character IPA-like symbols) and a segment class. The segment_class breakdown is the most informative summary: consonants dominate at 72,282 rows, vowels account for 31,052, and tones only 2,150. Source coverage is uneven — 'ph' alone supplies about 34% of records, while the long tail (ra, spa, aa) is much smaller, which matters if you compare across sources. Glottocode frequency is also skewed: kham1282 and osse1243 each appear hundreds of times, suggesting some languages have far richer phoneme inventories recorded than others.

phoneme_id high anthropic:claude-opus-4-7

This is a sequential row identifier: every one of the 105484 rows has a unique value, running from min 1 to max 105484 with mean and median both at 52742.5 and skew 0.0. The perfectly uniform distribution and zero null/outlier rate confirm it carries no analytic signal beyond row ordering.

glottocode high anthropic:claude-opus-4-7

This column holds Glottolog language codes — fixed 8-character identifiers (len_mean 7.999, len_max 8) drawn from a vocabulary of 2,168 distinct codes across 105,484 rows. With a 97.94% duplicate rate and top codes like kham1282 (622) and osse1243 (483) repeating heavily, each code labels many rows rather than identifying them. The 2-character minimum length suggests a small number of malformed or truncated entries worth inspecting.

phoneme high anthropic:claude-opus-4-7

This column holds phoneme tokens — single-word strings averaging 1.5 characters with a max length of 11 and every row being one word. Despite 105,484 rows, only 3,142 unique values exist and 97.0% are duplicates, with single letters like 'm', 'i', 'k', 'j' dominating the top values. The small vocabulary (1,339 words) and tiny token sizes suggest these are IPA-like phonetic units rather than full words.

segment_class high anthropic:claude-opus-4-7

A 3-level categorical tag classifying segments as consonant, vowel, or tone, with no nulls across 105,484 rows. The distribution is heavily skewed: consonant dominates at 68.5%, vowel takes most of the rest, and tone is rare at only 2,150 occurrences. Entropy ratio of 0.64 confirms the imbalance.

source high anthropic:claude-opus-4-7

Categorical provenance tag with 8 distinct codes (ph, ea, upsid, er, saphon, aa, spa, ra) marking which source each of the 105,484 rows came from. Distribution is fairly balanced for a source field — entropy ratio 0.899 and the top code 'ph' covers only 34.4% — suggesting the dataset is a merge of multiple comparably-sized corpora rather than one dominant source with minor supplements. No nulls, so every row is attributable.

created_at high anthropic:claude-opus-4-7

Despite its name, created_at holds only 2 distinct timestamp values across 105,484 rows, both within one second of each other on 2026-01-06. This looks like a batch ingestion or load timestamp rather than per-record creation time, with 71.6% of rows sharing the dominant value. There is no temporal variation to exploit as a feature.

phoneme_id numeric

rows105,484
null0 (0.0%)
unique105,484
min1.000
max105,484
mean52,742
median52,742
std30,451
q126,372
q379,113
iqr52,742
skew0.000
kurtosis-1.200
n_outliers0
outlier_rate0.000
zero_rate0.000

glottocode text

100.0% rows are a single word 95th-percentile length under 20 chars 97.9% duplicate strings
rows105,484
null0 (0.0%)
unique2,177
len_min2
len_max8
len_mean7.999
len_median8.000
len_p958.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates103,307
duplicate_rate0.979
vocab_size2,168
readability_flesch_mean94.148
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. lakk1252
  2. lazz1240
  3. kham1282
  4. cebu1242
  5. west2456
  6. gurd1238
  7. copi1238
  8. kham1282
  9. gata1239
  10. paez1247

phoneme text

100.0% rows are a single word 95th-percentile length under 20 chars 97.0% duplicate strings
rows105,484
null0 (0.0%)
unique3,142
len_min1
len_max11
len_mean1.501
len_median1.000
len_p953.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates102,342
duplicate_rate0.970
vocab_size1,339
readability_flesch_mean114.439
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.75e-03
boilerplate_rate0.000
Sample values (first 10)
  1. χ
  2. u
  3. ndz
  4. r
  5. ˦
  6. ɨ
  7. d̤ɮ̤
  8. u

segment_class categorical

rows105,484
null0 (0.0%)
unique3
top_valueconsonant
top_rate0.685
cardinality3
entropy1.008
entropy_ratio0.636
Top values (rank 1–20)
  1. consonant — 72,282
  2. vowel — 31,052
  3. tone — 2,150

source categorical

rows105,484
null0 (0.0%)
unique8
top_valueph
top_rate0.344
cardinality8
entropy2.697
entropy_ratio0.899
Top values (rank 1–20)
  1. ph — 36,274
  2. ea — 16,883
  3. upsid — 13,966
  4. er — 9,423
  5. saphon — 9,047
  6. aa — 8,064
  7. spa — 7,566
  8. ra — 4,261

created_at categorical

rows105,484
null0 (0.0%)
unique2
top_value2026-01-06 05:13:20
top_rate0.716
cardinality2
entropy0.861
entropy_ratio0.861
Top values (rank 1–20)
  1. 2026-01-06 05:13:20 — 75,484
  2. 2026-01-06 05:13:19 — 30,000