saturn·

linguistic

source /home/coolhand/data/linguistic.db 105,484 rows 6 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 105,484 rows of phoneme records linked to languages by glottocode, drawn from 8 different sources. Each row pairs a language identifier (2,177 unique glottocodes) with a phoneme (3,142 unique values, mostly 1-character IPA-like symbols) and a segment class. The segment_class breakdown is the most informative summary: consonants dominate at 72,282 rows, vowels account for 31,052, and tones only 2,150. Source coverage is uneven — 'ph' alone supplies about 34% of records, while the long tail (ra, spa, aa) is much smaller, which matters if you compare across sources. Glottocode frequency is also skewed: kham1282 and osse1243 each appear hundreds of times, suggesting some languages have far richer phoneme inventories recorded than others.

citing: row_count · column_count · segment_class.top_values · source.top_values · glottocode.n_unique · glottocode.top_values · phoneme.n_unique · phoneme.top_values · phoneme.stats.len_mean

Schema

6 columns
Per-column summary. Click column name to jump to its detail.
Alerts
phoneme_id numeric 0.0% 105,484
glottocode text 0.0% 2,177
one_word short_text duplicates
phoneme text 0.0% 3,142
one_word short_text duplicates
segment_class categorical 0.0% 3
source categorical 0.0% 8
created_at categorical 0.0% 2

phoneme_id

numeric identifier
This is a sequential row identifier: every one of the 105484 rows has a unique value, running from min 1 to max 105484 with mean and median both at 52742.5 and skew 0.0. The perfectly uniform distribution and zero null/outlier rate confirm it carries no analytic signal beyond row ordering. Treatment: drop from modelling; retain only as a join key. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
105,484
min
1
max
105,484
mean
5.274e+04
median
5.274e+04
std
3.045e+04
q1
2.637e+04
q3
7.911e+04
iqr
5.274e+04
skew
0
kurtosis
-1.2
n_outliers
0
outlier_rate
0
zero_rate
0

glottocode

text foreign_key one_word short_text duplicates
This column holds Glottolog language codes — fixed 8-character identifiers (len_mean 7.999, len_max 8) drawn from a vocabulary of 2,168 distinct codes across 105,484 rows. With a 97.94% duplicate rate and top codes like kham1282 (622) and osse1243 (483) repeating heavily, each code labels many rows rather than identifying them. The 2-character minimum length suggests a small number of malformed or truncated entries worth inspecting. Treatment: Treat as a categorical join key to Glottolog metadata; verify the short (len=2) entries. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
2,177
len_min
2
len_max
8
len_mean
7.999
len_median
8
len_p95
8
word_mean
1
word_median
1
n_empty
0
n_duplicates
103,307
duplicate_rate
0.9794
vocab_size
2,168
readability_flesch_mean
94.15
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

phoneme

text feature one_word short_text duplicates
This column holds phoneme tokens — single-word strings averaging 1.5 characters with a max length of 11 and every row being one word. Despite 105,484 rows, only 3,142 unique values exist and 97.0% are duplicates, with single letters like 'm', 'i', 'k', 'j' dominating the top values. The small vocabulary (1,339 words) and tiny token sizes suggest these are IPA-like phonetic units rather than full words. Treatment: Treat as a categorical token and label-encode or embed before modelling. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
3,142
len_min
1
len_max
11
len_mean
1.501
len_median
1
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
102,342
duplicate_rate
0.9702
vocab_size
1,339
readability_flesch_mean
114.4
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.001754
boilerplate_rate
0

segment_class

categorical feature
A 3-level categorical tag classifying segments as consonant, vowel, or tone, with no nulls across 105,484 rows. The distribution is heavily skewed: consonant dominates at 68.5%, vowel takes most of the rest, and tone is rare at only 2,150 occurrences. Entropy ratio of 0.64 confirms the imbalance. Treatment: One-hot encode; consider class-imbalance handling if predicting the rare 'tone' class. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
3
top_value
consonant
top_rate
0.6852
cardinality
3
entropy
1.008
entropy_ratio
0.6357

source

categorical metadata
Categorical provenance tag with 8 distinct codes (ph, ea, upsid, er, saphon, aa, spa, ra) marking which source each of the 105,484 rows came from. Distribution is fairly balanced for a source field — entropy ratio 0.899 and the top code 'ph' covers only 34.4% — suggesting the dataset is a merge of multiple comparably-sized corpora rather than one dominant source with minor supplements. No nulls, so every row is attributable. Treatment: Keep as a categorical grouping/stratification key; one-hot encode if used as a feature. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
8
top_value
ph
top_rate
0.3439
cardinality
8
entropy
2.697
entropy_ratio
0.8991

created_at

categorical metadata
Despite its name, created_at holds only 2 distinct timestamp values across 105,484 rows, both within one second of each other on 2026-01-06. This looks like a batch ingestion or load timestamp rather than per-record creation time, with 71.6% of rows sharing the dominant value. There is no temporal variation to exploit as a feature. Treatment: Drop; no temporal signal beyond a batch-load marker. high · anthropic:claude-opus-4-7
n
105,484
nulls
0 (0.0%)
unique
2
top_value
2026-01-06 05:13:20
top_rate
0.7156
cardinality
2
entropy
0.8614
entropy_ratio
0.8614