linguistic
Reading
This dataset contains 105,484 rows of phoneme records linked to languages by glottocode, drawn from 8 different sources. Each row pairs a language identifier (2,177 unique glottocodes) with a phoneme (3,142 unique values, mostly 1-character IPA-like symbols) and a segment class. The segment_class breakdown is the most informative summary: consonants dominate at 72,282 rows, vowels account for 31,052, and tones only 2,150. Source coverage is uneven — 'ph' alone supplies about 34% of records, while the long tail (ra, spa, aa) is much smaller, which matters if you compare across sources. Glottocode frequency is also skewed: kham1282 and osse1243 each appear hundreds of times, suggesting some languages have far richer phoneme inventories recorded than others.
citing: row_count · column_count · segment_class.top_values · source.top_values · glottocode.n_unique · glottocode.top_values · phoneme.n_unique · phoneme.top_values · phoneme.stats.len_mean
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| consonant | 72282 | 68.5% |
| vowel | 31052 | 29.4% |
| tone | 2150 | 2.0% |
Show data table
| value | count | share |
|---|---|---|
| ph | 36274 | 34.4% |
| ea | 16883 | 16.0% |
| upsid | 13966 | 13.2% |
| er | 9423 | 8.9% |
| saphon | 9047 | 8.6% |
| aa | 8064 | 7.6% |
| spa | 7566 | 7.2% |
| ra | 4261 | 4.0% |
Show data table
| chars | count |
|---|---|
| 1 – 1 | 67114 |
| 1 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 28726 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 6559 |
| 3 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 2225 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 401 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 267 |
| 6 – 6 | 0 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 104 |
| 7 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 5 |
| 8 – 8 | 0 |
| 8 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 70 |
| 9 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 1 |
| 10 – 10 | 0 |
| 10 – 11 | 0 |
| 11 – 11 | 12 |
Show data table
| chars | count |
|---|---|
| 2 – 2 | 19 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 105465 |
Show data table
| chars | count |
|---|---|
| 1 – 1 | 67114 |
| 1 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 28726 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 6559 |
| 3 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 2225 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 401 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 267 |
| 6 – 6 | 0 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 104 |
| 7 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 0 |
| 8 – 8 | 5 |
| 8 – 8 | 0 |
| 8 – 9 | 0 |
| 9 – 9 | 0 |
| 9 – 9 | 70 |
| 9 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 0 |
| 10 – 10 | 1 |
| 10 – 10 | 0 |
| 10 – 11 | 0 |
| 11 – 11 | 12 |
Schema
6 columns| Alerts | ||||
|---|---|---|---|---|
| phoneme_id | numeric | 0.0% | 105,484 |
|
| glottocode | text | 0.0% | 2,177 |
one_word
short_text
duplicates
|
| phoneme | text | 0.0% | 3,142 |
one_word
short_text
duplicates
|
| segment_class | categorical | 0.0% | 3 |
|
| source | categorical | 0.0% | 8 |
|
| created_at | categorical | 0.0% | 2 |
|
phoneme_id
numeric identifierThis is a sequential row identifier: every one of the 105484 rows has a unique value, running from min 1 to max 105484 with mean and median both at 52742.5 and skew 0.0. The perfectly uniform distribution and zero null/outlier rate confirm it carries no analytic signal beyond row ordering. Treatment: drop from modelling; retain only as a join key.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 105,484
- min
- 1
- max
- 105,484
- mean
- 5.274e+04
- median
- 5.274e+04
- std
- 3.045e+04
- q1
- 2.637e+04
- q3
- 7.911e+04
- iqr
- 5.274e+04
- skew
- 0
- kurtosis
- -1.2
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
glottocode
text foreign_key one_word short_text duplicatesThis column holds Glottolog language codes — fixed 8-character identifiers (len_mean 7.999, len_max 8) drawn from a vocabulary of 2,168 distinct codes across 105,484 rows. With a 97.94% duplicate rate and top codes like kham1282 (622) and osse1243 (483) repeating heavily, each code labels many rows rather than identifying them. The 2-character minimum length suggests a small number of malformed or truncated entries worth inspecting. Treatment: Treat as a categorical join key to Glottolog metadata; verify the short (len=2) entries.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 2,177
- len_min
- 2
- len_max
- 8
- len_mean
- 7.999
- len_median
- 8
- len_p95
- 8
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 103,307
- duplicate_rate
- 0.9794
- vocab_size
- 2,168
- readability_flesch_mean
- 94.15
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
phoneme
text feature one_word short_text duplicatesThis column holds phoneme tokens — single-word strings averaging 1.5 characters with a max length of 11 and every row being one word. Despite 105,484 rows, only 3,142 unique values exist and 97.0% are duplicates, with single letters like 'm', 'i', 'k', 'j' dominating the top values. The small vocabulary (1,339 words) and tiny token sizes suggest these are IPA-like phonetic units rather than full words. Treatment: Treat as a categorical token and label-encode or embed before modelling.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3,142
- len_min
- 1
- len_max
- 11
- len_mean
- 1.501
- len_median
- 1
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 102,342
- duplicate_rate
- 0.9702
- vocab_size
- 1,339
- readability_flesch_mean
- 114.4
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.001754
- boilerplate_rate
- 0
segment_class
categorical featureA 3-level categorical tag classifying segments as consonant, vowel, or tone, with no nulls across 105,484 rows. The distribution is heavily skewed: consonant dominates at 68.5%, vowel takes most of the rest, and tone is rare at only 2,150 occurrences. Entropy ratio of 0.64 confirms the imbalance. Treatment: One-hot encode; consider class-imbalance handling if predicting the rare 'tone' class.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- consonant
- top_rate
- 0.6852
- cardinality
- 3
- entropy
- 1.008
- entropy_ratio
- 0.6357
source
categorical metadataCategorical provenance tag with 8 distinct codes (ph, ea, upsid, er, saphon, aa, spa, ra) marking which source each of the 105,484 rows came from. Distribution is fairly balanced for a source field — entropy ratio 0.899 and the top code 'ph' covers only 34.4% — suggesting the dataset is a merge of multiple comparably-sized corpora rather than one dominant source with minor supplements. No nulls, so every row is attributable. Treatment: Keep as a categorical grouping/stratification key; one-hot encode if used as a feature.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- ph
- top_rate
- 0.3439
- cardinality
- 8
- entropy
- 2.697
- entropy_ratio
- 0.8991
created_at
categorical metadataDespite its name, created_at holds only 2 distinct timestamp values across 105,484 rows, both within one second of each other on 2026-01-06. This looks like a batch ingestion or load timestamp rather than per-record creation time, with 71.6% of rows sharing the dominant value. There is no temporal variation to exploit as a feature. Treatment: Drop; no temporal signal beyond a batch-load marker.
- n
- 105,484
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 2026-01-06 05:13:20
- top_rate
- 0.7156
- cardinality
- 2
- entropy
- 0.8614
- entropy_ratio
- 0.8614