linguistic

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/data/linguistic.db

Saturn profiled 105,484 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/data/linguistic.db",
    "--findings", "linguistic.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 105,484 rows of phoneme records linked to languages by glottocode, drawn from 8 different sources. Each row pairs a language identifier (2,177 unique glottocodes) with a phoneme (3,142 unique values, mostly 1-character IPA-like symbols) and a segment class. The segment_class breakdown is the most informative summary: consonants dominate at 72,282 rows, vowels account for 31,052, and tones only 2,150. Source coverage is uneven — 'ph' alone supplies about 34% of records, while the long tail (ra, spa, aa) is much smaller, which matters if you compare across sources. Glottocode frequency is also skewed: kham1282 and osse1243 each appear hundreds of times, suggesting some languages have far richer phoneme inventories recorded than others.

citing: row_count · column_count · segment_class.top_values · source.top_values · glottocode.n_unique · glottocode.top_values · phoneme.n_unique · phoneme.top_values · phoneme.stats.len_mean

Out[4]:

saturn.schema() · 6 columns

column	kind	n	null%	unique	alerts
phoneme_id	numeric	105,484	0.0%	105,484
glottocode	text	105,484	0.0%	2,177	one_word short_text duplicates
phoneme	text	105,484	0.0%	3,142	one_word short_text duplicates
segment_class	categorical	105,484	0.0%	3
source	categorical	105,484	0.0%	8
created_at	categorical	105,484	0.0%	2

Fig 1.

segment_class · Shows that consonants make up the bulk of records, with tones a small minority.

Show data table

Top values for segment_class (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

Fig 2.

source · Reveals heavy reliance on the 'ph' source (~34%) versus a long tail of smaller contributors.

Show data table

Top values for source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

Fig 3.

phoneme · Highlights the most common phoneme symbols (m, i, k, j, u, a) across the dataset.

Show data table

Character-length distribution for phoneme (mean: 1.5006067270865724).
chars	count
1 – 1	67114
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	28726
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	6559
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	2225
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	401
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	267
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	104
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	70
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	1
10 – 10	0
10 – 11	0
11 – 11	12

Fig 4.

glottocode · Top languages like kham1282 and osse1243 dominate, indicating uneven per-language coverage.

Show data table

Character-length distribution for glottocode (mean: 7.998919267377043).
chars	count
2 – 2	19
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	105465

Fig 5.

phoneme · Most phoneme strings are 1 character; check the tail up to 11 characters for compound symbols.

Show data table

Character-length distribution for phoneme (mean: 1.5006067270865724).
chars	count
1 – 1	67114
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	28726
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	6559
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	2225
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	401
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	267
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	104
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	70
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	1
10 – 10	0
10 – 11	0
11 – 11	12

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
phoneme_id	numeric	0.0%
glottocode	text	0.0%
phoneme	text	0.0%
segment_class	categorical	0.0%
source	categorical	0.0%
created_at	categorical	0.0%

phoneme_id numeric identifier

This is a sequential row identifier: every one of the 105484 rows has a unique value, running from min 1 to max 105484 with mean and median both at 52742.5 and skew 0.0. The perfectly uniform distribution and zero null/outlier rate confirm it carries no analytic signal beyond row ordering.

Treatment: drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["phoneme_id"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	105,484
min	1
max	105,484
mean	5.274e+04
median	5.274e+04
std	3.045e+04
q1	2.637e+04
q3	7.911e+04
iqr	5.274e+04
skew	0
kurtosis	-1.2
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 7.

Distribution of phoneme_id. Vertical dash marks the median.

Show data table

Histogram bins for phoneme_id (median: 52742.5).
bin	count
1 – 2638	2638
2638 – 5275	2637
5275 – 7912	2637
7912 – 1.055e+04	2637
1.055e+04 – 1.319e+04	2637
1.319e+04 – 1.582e+04	2637
1.582e+04 – 1.846e+04	2637
1.846e+04 – 2.11e+04	2637
2.11e+04 – 2.373e+04	2637
2.373e+04 – 2.637e+04	2637
2.637e+04 – 2.901e+04	2637
2.901e+04 – 3.165e+04	2637
3.165e+04 – 3.428e+04	2637
3.428e+04 – 3.692e+04	2638
3.692e+04 – 3.956e+04	2637
3.956e+04 – 4.219e+04	2637
4.219e+04 – 4.483e+04	2637
4.483e+04 – 4.747e+04	2637
4.747e+04 – 5.011e+04	2637
5.011e+04 – 5.274e+04	2637
5.274e+04 – 5.538e+04	2637
5.538e+04 – 5.802e+04	2637
5.802e+04 – 6.065e+04	2637
6.065e+04 – 6.329e+04	2637
6.329e+04 – 6.593e+04	2637
6.593e+04 – 6.856e+04	2637
6.856e+04 – 7.12e+04	2638
7.12e+04 – 7.384e+04	2637
7.384e+04 – 7.648e+04	2637
7.648e+04 – 7.911e+04	2637
7.911e+04 – 8.175e+04	2637
8.175e+04 – 8.439e+04	2637
8.439e+04 – 8.702e+04	2637
8.702e+04 – 8.966e+04	2637
8.966e+04 – 9.23e+04	2637
9.23e+04 – 9.494e+04	2637
9.494e+04 – 9.757e+04	2637
9.757e+04 – 1.002e+05	2637
1.002e+05 – 1.028e+05	2637
1.028e+05 – 1.055e+05	2638

glottocode text foreign_key

This column holds Glottolog language codes — fixed 8-character identifiers (len_mean 7.999, len_max 8) drawn from a vocabulary of 2,168 distinct codes across 105,484 rows. With a 97.94% duplicate rate and top codes like kham1282 (622) and osse1243 (483) repeating heavily, each code labels many rows rather than identifying them. The 2-character minimum length suggests a small number of malformed or truncated entries worth inspecting.

Treatment: Treat as a categorical join key to Glottolog metadata; verify the short (len=2) entries.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns["glottocode"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2,177
len_min	2
len_max	8
len_mean	7.999
len_median	8
len_p95	8
word_mean	1
word_median	1
n_empty	0
n_duplicates	103,307
duplicate_rate	0.9794
vocab_size	2,168
readability_flesch_mean	94.15
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.9% duplicate strings

Fig 8.

Character-length distribution for glottocode.

Show data table

Character-length distribution for glottocode (mean: 7.998919267377043).
chars	count
2 – 2	19
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	105465

phoneme text feature

This column holds phoneme tokens — single-word strings averaging 1.5 characters with a max length of 11 and every row being one word. Despite 105,484 rows, only 3,142 unique values exist and 97.0% are duplicates, with single letters like 'm', 'i', 'k', 'j' dominating the top values. The small vocabulary (1,339 words) and tiny token sizes suggest these are IPA-like phonetic units rather than full words.

Treatment: Treat as a categorical token and label-encode or embed before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["phoneme"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3,142
len_min	1
len_max	11
len_mean	1.501
len_median	1
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	102,342
duplicate_rate	0.9702
vocab_size	1,339
readability_flesch_mean	114.4
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.001754
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.0% duplicate strings

Fig 9.

Character-length distribution for phoneme.

Show data table

Character-length distribution for phoneme (mean: 1.5006067270865724).
chars	count
1 – 1	67114
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	28726
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	6559
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	2225
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	401
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	267
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	104
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	5
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	70
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	1
10 – 10	0
10 – 11	0
11 – 11	12

segment_class categorical feature

A 3-level categorical tag classifying segments as consonant, vowel, or tone, with no nulls across 105,484 rows. The distribution is heavily skewed: consonant dominates at 68.5%, vowel takes most of the rest, and tone is rare at only 2,150 occurrences. Entropy ratio of 0.64 confirms the imbalance.

Treatment: One-hot encode; consider class-imbalance handling if predicting the rare 'tone' class.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["segment_class"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	3
top_value	consonant
top_rate	0.6852
cardinality	3
entropy	1.008
entropy_ratio	0.6357

Fig 10.

Top values for segment_class.

Show data table

Top values for segment_class (3 unique shown, of 3 total).
value	count	share
consonant	72282	68.5%
vowel	31052	29.4%
tone	2150	2.0%

source categorical metadata

Categorical provenance tag with 8 distinct codes (ph, ea, upsid, er, saphon, aa, spa, ra) marking which source each of the 105,484 rows came from. Distribution is fairly balanced for a source field — entropy ratio 0.899 and the top code 'ph' covers only 34.4% — suggesting the dataset is a merge of multiple comparably-sized corpora rather than one dominant source with minor supplements. No nulls, so every row is attributable.

Treatment: Keep as a categorical grouping/stratification key; one-hot encode if used as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["source"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	8
top_value	ph
top_rate	0.3439
cardinality	8
entropy	2.697
entropy_ratio	0.8991

Fig 11.

Top values for source.

Show data table

Top values for source (8 unique shown, of 8 total).
value	count	share
ph	36274	34.4%
ea	16883	16.0%
upsid	13966	13.2%
er	9423	8.9%
saphon	9047	8.6%
aa	8064	7.6%
spa	7566	7.2%
ra	4261	4.0%

created_at categorical metadata

Despite its name, created_at holds only 2 distinct timestamp values across 105,484 rows, both within one second of each other on 2026-01-06. This looks like a batch ingestion or load timestamp rather than per-record creation time, with 71.6% of rows sharing the dominant value. There is no temporal variation to exploit as a feature.

Treatment: Drop; no temporal signal beyond a batch-load marker.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["created_at"].stats

stat	value
n	105,484
nulls	0 (0.0%)
unique	2
top_value	2026-01-06 05:13:20
top_rate	0.7156
cardinality	2
entropy	0.8614
entropy_ratio	0.8614

Fig 12.

Top values for created_at.

Show data table

Top values for created_at (2 unique shown, of 2 total).
value	count	share
2026-01-06 05:13:20	75484	71.6%
2026-01-06 05:13:19	30000	28.4%

How to cite

click to copy

BibTeX

@misc{saturn-linguistic-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: linguistic},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/linguistic}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: linguistic. Source: /home/coolhand/data/linguistic.db. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/linguistic