saturn·

linguistic

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/data/linguistic.db

Saturn profiled 105,484 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/data/linguistic.db",
    "--findings", "linguistic.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 105,484 rows of phoneme records linked to languages by glottocode, drawn from 8 different sources. Each row pairs a language identifier (2,177 unique glottocodes) with a phoneme (3,142 unique values, mostly 1-character IPA-like symbols) and a segment class. The segment_class breakdown is the most informative summary: consonants dominate at 72,282 rows, vowels account for 31,052, and tones only 2,150. Source coverage is uneven — 'ph' alone supplies about 34% of records, while the long tail (ra, spa, aa) is much smaller, which matters if you compare across sources. Glottocode frequency is also skewed: kham1282 and osse1243 each appear hundreds of times, suggesting some languages have far richer phoneme inventories recorded than others.

citing: row_count · column_count · segment_class.top_values · source.top_values · glottocode.n_unique · glottocode.top_values · phoneme.n_unique · phoneme.top_values · phoneme.stats.len_mean

Out[4]:

saturn.schema() · 6 columns

column kind n null% unique alerts
phoneme_id numeric 105,484 0.0% 105,484
glottocode text 105,484 0.0% 2,177 one_word short_text duplicates
phoneme text 105,484 0.0% 3,142 one_word short_text duplicates
segment_class categorical 105,484 0.0% 3
source categorical 105,484 0.0% 8
created_at categorical 105,484 0.0% 2
Fig 1.
segment_class · Shows that consonants make up the bulk of records, with tones a small minority.
Show data table
Top values for segment_class (3 unique shown, of 3 total).
valuecountshare
consonant7228268.5%
vowel3105229.4%
tone21502.0%
Fig 2.
source · Reveals heavy reliance on the 'ph' source (~34%) versus a long tail of smaller contributors.
Show data table
Top values for source (8 unique shown, of 8 total).
valuecountshare
ph3627434.4%
ea1688316.0%
upsid1396613.2%
er94238.9%
saphon90478.6%
aa80647.6%
spa75667.2%
ra42614.0%
Fig 3.
phoneme · Highlights the most common phoneme symbols (m, i, k, j, u, a) across the dataset.
Show data table
Character-length distribution for phoneme (mean: 1.5006067270865724).
charscount
1 – 167114
1 – 20
2 – 20
2 – 20
2 – 228726
2 – 20
2 – 30
3 – 30
3 – 36559
3 – 40
4 – 40
4 – 40
4 – 42225
4 – 40
4 – 50
5 – 50
5 – 5401
5 – 60
6 – 60
6 – 60
6 – 6267
6 – 60
6 – 70
7 – 70
7 – 7104
7 – 80
8 – 80
8 – 80
8 – 85
8 – 80
8 – 90
9 – 90
9 – 970
9 – 100
10 – 100
10 – 100
10 – 101
10 – 100
10 – 110
11 – 1112
Fig 4.
glottocode · Top languages like kham1282 and osse1243 dominate, indicating uneven per-language coverage.
Show data table
Character-length distribution for glottocode (mean: 7.998919267377043).
charscount
2 – 219
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 8105465
Fig 5.
phoneme · Most phoneme strings are 1 character; check the tail up to 11 characters for compound symbols.
Show data table
Character-length distribution for phoneme (mean: 1.5006067270865724).
charscount
1 – 167114
1 – 20
2 – 20
2 – 20
2 – 228726
2 – 20
2 – 30
3 – 30
3 – 36559
3 – 40
4 – 40
4 – 40
4 – 42225
4 – 40
4 – 50
5 – 50
5 – 5401
5 – 60
6 – 60
6 – 60
6 – 6267
6 – 60
6 – 70
7 – 70
7 – 7104
7 – 80
8 – 80
8 – 80
8 – 85
8 – 80
8 – 90
9 – 90
9 – 970
9 – 100
10 – 100
10 – 100
10 – 101
10 – 100
10 – 110
11 – 1112
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
phoneme_idnumeric0.0%
glottocodetext0.0%
phonemetext0.0%
segment_classcategorical0.0%
sourcecategorical0.0%
created_atcategorical0.0%

phoneme_id numeric identifier

This is a sequential row identifier: every one of the 105484 rows has a unique value, running from min 1 to max 105484 with mean and median both at 52742.5 and skew 0.0. The perfectly uniform distribution and zero null/outlier rate confirm it carries no analytic signal beyond row ordering.

Treatment: drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high
Out[12]:

saturn.columns["phoneme_id"].stats

statvalue
n105,484
nulls0 (0.0%)
unique105,484
min 1
max 105,484
mean 5.274e+04
median 5.274e+04
std 3.045e+04
q1 2.637e+04
q3 7.911e+04
iqr 5.274e+04
skew 0
kurtosis -1.2
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 7.
Distribution of phoneme_id. Vertical dash marks the median.
Show data table
Histogram bins for phoneme_id (median: 52742.5).
bincount
1 – 26382638
2638 – 52752637
5275 – 79122637
7912 – 1.055e+042637
1.055e+04 – 1.319e+042637
1.319e+04 – 1.582e+042637
1.582e+04 – 1.846e+042637
1.846e+04 – 2.11e+042637
2.11e+04 – 2.373e+042637
2.373e+04 – 2.637e+042637
2.637e+04 – 2.901e+042637
2.901e+04 – 3.165e+042637
3.165e+04 – 3.428e+042637
3.428e+04 – 3.692e+042638
3.692e+04 – 3.956e+042637
3.956e+04 – 4.219e+042637
4.219e+04 – 4.483e+042637
4.483e+04 – 4.747e+042637
4.747e+04 – 5.011e+042637
5.011e+04 – 5.274e+042637
5.274e+04 – 5.538e+042637
5.538e+04 – 5.802e+042637
5.802e+04 – 6.065e+042637
6.065e+04 – 6.329e+042637
6.329e+04 – 6.593e+042637
6.593e+04 – 6.856e+042637
6.856e+04 – 7.12e+042638
7.12e+04 – 7.384e+042637
7.384e+04 – 7.648e+042637
7.648e+04 – 7.911e+042637
7.911e+04 – 8.175e+042637
8.175e+04 – 8.439e+042637
8.439e+04 – 8.702e+042637
8.702e+04 – 8.966e+042637
8.966e+04 – 9.23e+042637
9.23e+04 – 9.494e+042637
9.494e+04 – 9.757e+042637
9.757e+04 – 1.002e+052637
1.002e+05 – 1.028e+052637
1.028e+05 – 1.055e+052638

glottocode text foreign_key

This column holds Glottolog language codes — fixed 8-character identifiers (len_mean 7.999, len_max 8) drawn from a vocabulary of 2,168 distinct codes across 105,484 rows. With a 97.94% duplicate rate and top codes like kham1282 (622) and osse1243 (483) repeating heavily, each code labels many rows rather than identifying them. The 2-character minimum length suggests a small number of malformed or truncated entries worth inspecting.

Treatment: Treat as a categorical join key to Glottolog metadata; verify the short (len=2) entries.

anthropic:claude-opus-4-7 · confidence high
Out[15]:

saturn.columns["glottocode"].stats

statvalue
n105,484
nulls0 (0.0%)
unique2,177
len_min 2
len_max 8
len_mean 7.999
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 103,307
duplicate_rate 0.9794
vocab_size 2,168
readability_flesch_mean 94.15
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.9% duplicate strings
Fig 8.
Character-length distribution for glottocode.
Show data table
Character-length distribution for glottocode (mean: 7.998919267377043).
charscount
2 – 219
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 8105465

phoneme text feature

This column holds phoneme tokens — single-word strings averaging 1.5 characters with a max length of 11 and every row being one word. Despite 105,484 rows, only 3,142 unique values exist and 97.0% are duplicates, with single letters like 'm', 'i', 'k', 'j' dominating the top values. The small vocabulary (1,339 words) and tiny token sizes suggest these are IPA-like phonetic units rather than full words.

Treatment: Treat as a categorical token and label-encode or embed before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[18]:

saturn.columns["phoneme"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3,142
len_min 1
len_max 11
len_mean 1.501
len_median 1
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 102,342
duplicate_rate 0.9702
vocab_size 1,339
readability_flesch_mean 114.4
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.001754
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.0% duplicate strings
Fig 9.
Character-length distribution for phoneme.
Show data table
Character-length distribution for phoneme (mean: 1.5006067270865724).
charscount
1 – 167114
1 – 20
2 – 20
2 – 20
2 – 228726
2 – 20
2 – 30
3 – 30
3 – 36559
3 – 40
4 – 40
4 – 40
4 – 42225
4 – 40
4 – 50
5 – 50
5 – 5401
5 – 60
6 – 60
6 – 60
6 – 6267
6 – 60
6 – 70
7 – 70
7 – 7104
7 – 80
8 – 80
8 – 80
8 – 85
8 – 80
8 – 90
9 – 90
9 – 970
9 – 100
10 – 100
10 – 100
10 – 101
10 – 100
10 – 110
11 – 1112

segment_class categorical feature

A 3-level categorical tag classifying segments as consonant, vowel, or tone, with no nulls across 105,484 rows. The distribution is heavily skewed: consonant dominates at 68.5%, vowel takes most of the rest, and tone is rare at only 2,150 occurrences. Entropy ratio of 0.64 confirms the imbalance.

Treatment: One-hot encode; consider class-imbalance handling if predicting the rare 'tone' class.

anthropic:claude-opus-4-7 · confidence high
Out[21]:

saturn.columns["segment_class"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3
top_value consonant
top_rate 0.6852
cardinality 3
entropy 1.008
entropy_ratio 0.6357
Fig 10.
Top values for segment_class.
Show data table
Top values for segment_class (3 unique shown, of 3 total).
valuecountshare
consonant7228268.5%
vowel3105229.4%
tone21502.0%

source categorical metadata

Categorical provenance tag with 8 distinct codes (ph, ea, upsid, er, saphon, aa, spa, ra) marking which source each of the 105,484 rows came from. Distribution is fairly balanced for a source field — entropy ratio 0.899 and the top code 'ph' covers only 34.4% — suggesting the dataset is a merge of multiple comparably-sized corpora rather than one dominant source with minor supplements. No nulls, so every row is attributable.

Treatment: Keep as a categorical grouping/stratification key; one-hot encode if used as a feature.

anthropic:claude-opus-4-7 · confidence high
Out[24]:

saturn.columns["source"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value ph
top_rate 0.3439
cardinality 8
entropy 2.697
entropy_ratio 0.8991
Fig 11.
Top values for source.
Show data table
Top values for source (8 unique shown, of 8 total).
valuecountshare
ph3627434.4%
ea1688316.0%
upsid1396613.2%
er94238.9%
saphon90478.6%
aa80647.6%
spa75667.2%
ra42614.0%

created_at categorical metadata

Despite its name, created_at holds only 2 distinct timestamp values across 105,484 rows, both within one second of each other on 2026-01-06. This looks like a batch ingestion or load timestamp rather than per-record creation time, with 71.6% of rows sharing the dominant value. There is no temporal variation to exploit as a feature.

Treatment: Drop; no temporal signal beyond a batch-load marker.

anthropic:claude-opus-4-7 · confidence high
Out[27]:

saturn.columns["created_at"].stats

statvalue
n105,484
nulls0 (0.0%)
unique2
top_value 2026-01-06 05:13:20
top_rate 0.7156
cardinality 2
entropy 0.8614
entropy_ratio 0.8614
Fig 12.
Top values for created_at.
Show data table
Top values for created_at (2 unique shown, of 2 total).
valuecountshare
2026-01-06 05:13:207548471.6%
2026-01-06 05:13:193000028.4%

How to cite

click to copy

BibTeX
@misc{saturn-linguistic-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: linguistic},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/linguistic}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: linguistic. Source: /home/coolhand/data/linguistic.db. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/linguistic