saturn·

data trove phoible phonetics database

saturn notebook · generated 2026-06-22 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv

Saturn profiled 105,484 rows across 49 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv",
    "--findings", "data-trove-phoible-phonetics-database.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is PHOIBLE, a cross-linguistic phonological inventory database containing 105,484 phoneme-level records spanning roughly 2,177 languages and dialects, each row describing a single phoneme and its distinctive feature values. The most immediate thing to examine is the breakdown by SegmentClass: consonants dominate (~68.5%), followed by vowels (~29.5%) and tones (~2%), which shapes how almost every other feature distributes. A second focus is the Source column, which reveals that data comes from eight different linguistic databases ('ph' alone accounts for 34%), meaning coverage and coding conventions are uneven across the corpus and could introduce systematic biases in any cross-linguistic comparison.

citing: SegmentClass.top_values · Source.top_values · LanguageName.n_unique · Glottocode.n_unique · Phoneme.top_values · row_count · column_count

Out[4]:

saturn.schema() · 49 columns

column kind n null% unique alerts
InventoryID numeric 105,484 0.0% 3,020
Glottocode text 105,484 0.0% 2,177 one_word short_text duplicates
ISO6393 text 105,484 0.0% 2,095 one_word short_text duplicates
LanguageName text 105,484 0.0% 2,716 one_word allcaps short_text duplicates
SpecificDialect categorical 105,484 0.0% 546
GlyphID text 105,484 0.0% 3,142 one_word allcaps short_text duplicates
Phoneme text 105,484 0.0% 3,142 one_word short_text duplicates
Allophones text 105,484 0.0% 6,892 one_word short_text duplicates
Marginal categorical 105,484 0.0% 3
SegmentClass categorical 105,484 0.0% 3
Source categorical 105,484 0.0% 8
tone categorical 105,484 0.0% 2 imbalance
stress categorical 105,484 0.0% 2 imbalance
syllabic categorical 105,484 0.0% 8
short categorical 105,484 0.0% 4 imbalance
long categorical 105,484 0.0% 6
consonantal categorical 105,484 0.0% 5
sonorant categorical 105,484 0.0% 8
continuant categorical 105,484 0.0% 9
delayedRelease categorical 105,484 0.0% 7
approximant categorical 105,484 0.0% 6
tap categorical 105,484 0.0% 5 imbalance
trill categorical 105,484 0.0% 6 imbalance
nasal categorical 105,484 0.0% 8
lateral categorical 105,484 0.0% 8
labial categorical 105,484 0.0% 15
round categorical 105,484 0.0% 8
labiodental categorical 105,484 0.0% 6
coronal categorical 105,484 0.0% 7
anterior categorical 105,484 0.0% 6
distributed categorical 105,484 0.0% 11
strident categorical 105,484 0.0% 9
dorsal categorical 105,484 0.0% 13
high categorical 105,484 0.0% 11
low categorical 105,484 0.0% 8
front categorical 105,484 0.0% 13
back categorical 105,484 0.0% 12
tense categorical 105,484 0.0% 8
retractedTongueRoot categorical 105,484 0.0% 7 imbalance
advancedTongueRoot categorical 105,484 0.0% 3 imbalance
periodicGlottalSource categorical 105,484 0.0% 7
epilaryngealSource categorical 105,484 0.0% 3 imbalance
spreadGlottis categorical 105,484 0.0% 10
constrictedGlottis categorical 105,484 0.0% 7
fortis categorical 105,484 0.0% 3
lenis categorical 105,484 0.0% 3
raisedLarynxEjective categorical 105,484 0.0% 6 imbalance
loweredLarynxImplosive categorical 105,484 0.0% 5 imbalance
click categorical 105,484 0.0% 5
Fig 1.
SegmentClass · Look at the consonant-to-vowel-to-tone split — consonants make up roughly 68.5% of all records, which will skew any feature-level analysis.
Show data table
Top values for SegmentClass (3 unique shown, of 3 total).
valuecountshare
consonant7228268.5%
vowel3105229.4%
tone21502.0%
Fig 2.
Source · Check how unevenly data is distributed across the eight source databases, with 'ph' contributing more than a third of all rows on its own.
Show data table
Top values for Source (8 unique shown, of 8 total).
valuecountshare
ph3627434.4%
ea1688316.0%
upsid1396613.2%
er94238.9%
saphon90478.6%
aa80647.6%
spa75667.2%
ra42614.0%
Fig 3.
LanguageName · See which languages contribute the most phoneme records — Iron Ossetic leads with 444 entries, hinting at uneven language-level representation.
Show data table
Character-length distribution for LanguageName (mean: 7.822219483523567).
charscount
2 – 43915
4 – 629541
6 – 834409
8 – 1014985
10 – 127358
12 – 144849
14 – 153897
15 – 172893
17 – 191292
19 – 21993
21 – 23198
23 – 25224
25 – 27218
27 – 2939
29 – 3193
31 – 33130
33 – 3564
35 – 3723
37 – 3937
39 – 4057
40 – 420
42 – 4452
44 – 460
46 – 4823
48 – 500
50 – 520
52 – 5420
54 – 560
56 – 5840
58 – 600
60 – 6287
62 – 640
64 – 660
66 – 670
67 – 690
69 – 710
71 – 730
73 – 750
75 – 770
77 – 7947
Fig 4.
nasal · Review the balance of the nasal feature values (+, -, 0) as a representative example of how distinctive features are coded across the full inventory.
Show data table
Top values for nasal (8 unique shown, of 8 total).
valuecountshare
-8526980.8%
+1594115.1%
021502.0%
+,-19731.9%
-,+950.1%
+,-,-540.1%
+,-,+,-10.0%
-,+,-10.0%
Fig 5.
Marginal · Note that about 1.3% of phonemes are flagged as marginal (borrowed or rare) while ~19.8% carry NA, worth checking before any typological counts.
Show data table
Top values for Marginal (3 unique shown, of 3 total).
valuecountshare
FALSE8326378.9%
NA2087419.8%
TRUE13471.3%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
InventoryIDnumeric0.0%
Glottocodetext0.0%
ISO6393text0.0%
LanguageNametext0.0%
SpecificDialectcategorical0.0%
GlyphIDtext0.0%
Phonemetext0.0%
Allophonestext0.0%
Marginalcategorical0.0%
SegmentClasscategorical0.0%
Sourcecategorical0.0%
tonecategorical0.0%
stresscategorical0.0%
syllabiccategorical0.0%
shortcategorical0.0%
longcategorical0.0%
consonantalcategorical0.0%
sonorantcategorical0.0%
continuantcategorical0.0%
delayedReleasecategorical0.0%
approximantcategorical0.0%
tapcategorical0.0%
trillcategorical0.0%
nasalcategorical0.0%
lateralcategorical0.0%
labialcategorical0.0%
roundcategorical0.0%
labiodentalcategorical0.0%
coronalcategorical0.0%
anteriorcategorical0.0%
distributedcategorical0.0%
stridentcategorical0.0%
dorsalcategorical0.0%
highcategorical0.0%
lowcategorical0.0%
frontcategorical0.0%
backcategorical0.0%
tensecategorical0.0%
retractedTongueRootcategorical0.0%
advancedTongueRootcategorical0.0%
periodicGlottalSourcecategorical0.0%
epilaryngealSourcecategorical0.0%
spreadGlottiscategorical0.0%
constrictedGlottiscategorical0.0%
fortiscategorical0.0%
leniscategorical0.0%
raisedLarynxEjectivecategorical0.0%
loweredLarynxImplosivecategorical0.0%
clickcategorical0.0%

InventoryID numeric foreign_key

InventoryID is a numeric foreign key referencing an inventory dimension table, with exactly 3,020 distinct values spanning 1–3,020 across 105,484 rows, implying heavy repeated use of each ID (average ~35 rows per ID). The distribution is remarkably flat and symmetric (skew ≈ −0.002, kurtosis ≈ −1.15, zero outliers), consistent with a well-populated lookup identifier rather than a measured quantity.

Treatment: Left-join on this ID to the inventory dimension table; do not use raw numeric value as a feature.

anthropic:default · confidence high
Out[12]:

saturn.columns["InventoryID"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3,020
min 1
max 3,020
mean 1479
median 1,464
std 843.1
q1 769
q3 2,237
iqr 1,468
skew -0.002397
kurtosis -1.146
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 7.
Distribution of InventoryID. Vertical dash marks the median.
Show data table
Histogram bins for InventoryID (median: 1464.0).
bincount
1 – 76.472729
76.47 – 151.92947
151.9 – 227.42742
227.4 – 302.92385
302.9 – 378.42399
378.4 – 453.82530
453.8 – 529.32270
529.3 – 604.82323
604.8 – 680.32533
680.3 – 755.83039
755.8 – 831.22896
831.2 – 906.72609
906.7 – 982.22693
982.2 – 10582513
1058 – 11332637
1133 – 12092514
1209 – 12842898
1284 – 13603314
1360 – 14353634
1435 – 15102993
1510 – 15863172
1586 – 16613290
1661 – 17372686
1737 – 18123001
1812 – 18881779
1888 – 19632046
1963 – 20391878
2039 – 21141864
2114 – 21902685
2190 – 22653387
2265 – 23413096
2341 – 24163142
2416 – 24923264
2492 – 25673557
2567 – 26432966
2643 – 27181975
2718 – 27941761
2794 – 28691659
2869 – 29451855
2945 – 30201823

Glottocode text foreign_key

This column contains Glottocodes — the standardized 8-character language identifiers used by the Glottolog database (e.g., 'kham1282', 'dutc1256'), confirmed by the near-uniform length of 8 characters (mean 7.999, median 8.0) and the structured alphanumeric format. With only 2,177 unique codes across 105,484 rows, the duplicate rate is extremely high at 97.9%, meaning each language code recurs on average ~48 times — consistent with a dataset where many observations (e.g., words, features, speakers) are annotated per language. The top code 'kham1282' (Kham) appears 622 times, suggesting uneven language coverage in the dataset.

Treatment: Left-join on this code against the Glottolog reference table to enrich with language family, geographic coordinates, and macro-area metadata.

anthropic:default · confidence high
Out[15]:

saturn.columns["Glottocode"].stats

statvalue
n105,484
nulls0 (0.0%)
unique2,177
len_min 2
len_max 8
len_mean 7.999
len_median 8
len_p95 8
word_mean 1
word_median 1
n_empty 0
n_duplicates 103,307
duplicate_rate 0.9794
vocab_size 2,168
readability_flesch_mean 94.15
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.9% duplicate strings
Fig 8.
Character-length distribution for Glottocode.
Show data table
Character-length distribution for Glottocode (mean: 7.998919267377043).
charscount
2 – 219
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 8105465

ISO6393 text label

This column contains ISO 639-3 three-letter language codes, a standardised identifier for natural languages. Every value is exactly 3 characters long (min, mean, max all equal 3) with zero nulls, confirming strict conformity to the standard. The duplicate rate is extremely high at 98.0%, meaning 2,095 distinct codes repeat across 105,484 rows — expected behaviour for a language tag applied to many records. The most frequent code 'mis' (miscellaneous/unattested language) appearing 828 times may warrant attention, as it signals a non-trivial share of records with unidentified languages.

Treatment: Use as a categorical grouping key; consider flagging or separating records where ISO6393 equals 'mis' (unattested) before language-based analysis.

anthropic:default · confidence high
Out[18]:

saturn.columns["ISO6393"].stats

statvalue
n105,484
nulls0 (0.0%)
unique2,095
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 103,389
duplicate_rate 0.9801
vocab_size 2,086
readability_flesch_mean 119.5
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates98.0% duplicate strings
Fig 9.
Character-length distribution for ISO6393.
Show data table
Character-length distribution for ISO6393 (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 3105484
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

LanguageName text label

This column contains human language names (e.g., 'Dutch', 'Chechen', 'Bengali', 'Iron Ossetic'), functioning as a categorical label drawn from a linguistically diverse vocabulary of 2,716 unique values across 105,484 rows. The duplicate rate of 97.4% (102,768 duplicates) confirms it is a low-cardinality repeating label, not a free-text field. Notably, 13.1% of values are all-caps, suggesting some entries use ISO-style abbreviations or codes alongside full names. The distribution is uneven — the top value 'Iron Ossetic' appears 444 times while many languages appear rarely — indicating a long-tail spread across minority and regional languages.

Treatment: Encode as a categorical (label or target-encode) after normalising casing inconsistencies flagged by the 13.1% all-caps rate.

anthropic:default · confidence high
Out[21]:

saturn.columns["LanguageName"].stats

statvalue
n105,484
nulls0 (0.0%)
unique2,716
len_min 2
len_max 79
len_mean 7.822
len_median 7
len_p95 16
word_mean 1.201
word_median 1
n_empty 0
n_duplicates 102,768
duplicate_rate 0.9743
vocab_size 2,670
readability_flesch_mean 53.18
emoji_rate 0
url_rate 0
one_word_rate 0.8433
allcaps_rate 0.1314
boilerplate_rate 0
alert: one_word84.3% rows are a single word
alert: allcaps13.1% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.4% duplicate strings
Fig 10.
Character-length distribution for LanguageName.
Show data table
Character-length distribution for LanguageName (mean: 7.822219483523567).
charscount
2 – 43915
4 – 629541
6 – 834409
8 – 1014985
10 – 127358
12 – 144849
14 – 153897
15 – 172893
17 – 191292
19 – 21993
21 – 23198
23 – 25224
25 – 27218
27 – 2939
29 – 3193
31 – 33130
33 – 3564
35 – 3723
37 – 3937
39 – 4057
40 – 420
42 – 4452
44 – 460
46 – 4823
48 – 500
50 – 520
52 – 5420
54 – 560
56 – 5840
58 – 600
60 – 6287
62 – 640
64 – 660
66 – 670
67 – 690
69 – 710
71 – 730
73 – 750
75 – 770
77 – 7947

SpecificDialect categorical label

This column captures a specific dialect designation for each record in what appears to be a linguistic or language survey dataset. The dominant signal is extreme missingness-by-label: 71.9% of rows carry the value 'NA' and a further 7,692 rows (≈7.3%) are empty strings, meaning roughly 79% of records lack a meaningful dialect value. Despite 546 unique values, the effective entropy ratio is only 0.33, confirming that real dialect labels are thinly and unevenly spread across the remaining ~22,000 rows.

Treatment: Treat 'NA' and empty-string as missing; consider collapsing rare dialects (below a frequency threshold) into an 'Other' bucket before encoding, and flag high missingness rate (~79%) before any modelling use.

anthropic:default · confidence high
Out[24]:

saturn.columns["SpecificDialect"].stats

statvalue
n105,484
nulls0 (0.0%)
unique546
top_value NA
top_rate 0.7187
cardinality 546
entropy 2.969
entropy_ratio 0.3265
Fig 11.
Top values for SpecificDialect.
Show data table
Top values for SpecificDialect (20 unique shown, of 546 total).
valuecountshare
NA7580771.9%
76927.3%
W21200.1%
Lezgian (Güne)960.1%
Santa920.1%
Central Pakistan830.1%
Babungo (Grassfields Bantu, Ring)820.1%
Scottish Gaelic (Lewis)820.1%
Tangari810.1%
Kanga760.1%
Kufa750.1%
Skolt Saami (Suõʹnnʼjel)750.1%
Standard Hindi (as spoken in Varanasi, Lucknow, Delhi etc.)740.1%
Standard (eastern)740.1%
Guovdageaidnu740.1%
Nuosu (Black Yi)740.1%
Northern Qiang (Yadu)730.1%
Bangladeshi Standard (spoken in Dhaka and other urban aread of Bangladesh)720.1%
Standard Italian700.1%
Chechen (Ploskost)700.1%

GlyphID text label

GlyphID is a categorical identifier column containing Unicode code point values (e.g., '006D' = 'm', '0069' = 'i', '0061' = 'a') stored as uppercase hexadecimal strings, almost certainly representing character-level glyph references in a typography, OCR, or text-rendering dataset. With only 3,142 unique values across 105,484 rows, the duplicate rate is extremely high at 97.0%, which is expected for a glyph/character frequency table but means this column functions as a low-cardinality label rather than a unique identifier despite its name. All values are fully uppercase (allcaps_rate 1.0), single-token (one_word_rate 1.0), and short (median length 4 characters), consistent with 4–6 character hex codes. The top values map to common Latin lowercase letters, suggesting the underlying corpus is predominantly Latin-script text.

Treatment: Map hex strings to Unicode characters for interpretability, then encode as categorical (low-cardinality, 3142 levels) or group by Unicode block/script before modelling.

anthropic:default · confidence high
Out[27]:

saturn.columns["GlyphID"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3,142
len_min 4
len_max 54
len_mean 6.503
len_median 4
len_p95 14
word_mean 1
word_median 1
n_empty 0
n_duplicates 102,342
duplicate_rate 0.9702
vocab_size 1,343
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.0% duplicate strings
Fig 12.
Character-length distribution for GlyphID.
Show data table
Character-length distribution for GlyphID (mean: 6.503033635432862).
charscount
4 – 567114
5 – 60
6 – 80
8 – 90
9 – 1028726
10 – 120
12 – 130
13 – 140
14 – 156559
15 – 160
16 – 180
18 – 190
19 – 202225
20 – 220
22 – 230
23 – 240
24 – 25401
25 – 260
26 – 280
28 – 290
29 – 30267
30 – 320
32 – 330
33 – 340
34 – 35104
35 – 360
36 – 380
38 – 390
39 – 405
40 – 420
42 – 430
43 – 440
44 – 4570
45 – 460
46 – 480
48 – 490
49 – 501
50 – 520
52 – 530
53 – 5412

Phoneme text label

This column contains phoneme symbols from a linguistic or speech dataset, with 3,142 unique phoneme strings across 105,484 rows. Values are overwhelmingly single characters (len_median 1.0, len_mean 1.5, len_max 11), consistent with IPA or ARPABET-style phoneme notation. The duplicate rate is 97.0% (102,342 duplicates), which is expected given a finite phoneme inventory repeated across many words or utterances. The vocab_size of 1,339 distinct tokens against only 3,142 unique values suggests multi-character phoneme strings (e.g., digraphs or diacritics) are also present alongside single-character ones.

Treatment: Encode as categorical (label or one-hot) for modelling; 3,142 unique values is manageable but consider grouping by manner/place of articulation if dimensionality is a concern.

anthropic:default · confidence high
Out[30]:

saturn.columns["Phoneme"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3,142
len_min 1
len_max 11
len_mean 1.501
len_median 1
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 102,342
duplicate_rate 0.9702
vocab_size 1,339
readability_flesch_mean 114.4
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.001754
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.0% duplicate strings
Fig 13.
Character-length distribution for Phoneme.
Show data table
Character-length distribution for Phoneme (mean: 1.5006067270865724).
charscount
1 – 167114
1 – 20
2 – 20
2 – 20
2 – 228726
2 – 20
2 – 30
3 – 30
3 – 36559
3 – 40
4 – 40
4 – 40
4 – 42225
4 – 40
4 – 50
5 – 50
5 – 5401
5 – 60
6 – 60
6 – 60
6 – 6267
6 – 60
6 – 70
7 – 70
7 – 7104
7 – 80
8 – 80
8 – 80
8 – 85
8 – 80
8 – 90
9 – 90
9 – 970
9 – 100
10 – 100
10 – 100
10 – 101
10 – 100
10 – 110
11 – 1112

Allophones text feature

This column contains allophone representations from a phonology or linguistics dataset, storing the phonetic variants of phonemes (e.g., 'm', 'j', 'w', 's') as very short strings with a mean length of ~2 characters. Strikingly, 53,580 of 105,484 rows (roughly 50.8%) carry the sentinel value 'NA', indicating no allophone recorded, which dominates the duplicate rate of 93.5%. The 6,892 unique values across a vocabulary of only 1,263 words suggest a modest set of phonetic symbols combined in small clusters, consistent with IPA or similar notation.

Treatment: Treat 'NA' as missing; encode remaining values as categorical or tokenize individual phoneme symbols before modelling.

anthropic:default · confidence high
Out[33]:

saturn.columns["Allophones"].stats

statvalue
n105,484
nulls0 (0.0%)
unique6,892
len_min 1
len_max 37
len_mean 2.083
len_median 2
len_p95 4
word_mean 1.129
word_median 1
n_empty 0
n_duplicates 98,592
duplicate_rate 0.9347
vocab_size 1,263
readability_flesch_mean 116.2
emoji_rate 0
url_rate 0
one_word_rate 0.9131
allcaps_rate 0.00291
boilerplate_rate 0
alert: one_word91.3% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates93.5% duplicate strings
Fig 14.
Character-length distribution for Allophones.
Show data table
Character-length distribution for Allophones (mean: 2.0834154942929732).
charscount
1 – 226762
2 – 366019
3 – 44934
4 – 53125
5 – 61246
6 – 61181
6 – 7789
7 – 8409
8 – 9334
9 – 100
10 – 11226
11 – 12160
12 – 1369
13 – 1465
14 – 1441
14 – 1536
15 – 1612
16 – 1716
17 – 1812
18 – 190
19 – 2014
20 – 219
21 – 228
22 – 230
23 – 243
24 – 243
24 – 253
25 – 263
26 – 272
27 – 280
28 – 291
29 – 300
30 – 310
31 – 320
32 – 320
32 – 330
33 – 341
34 – 350
35 – 360
36 – 371

Marginal categorical feature

This column is a boolean flag indicating whether a record is classified as 'marginal', with three distinct string values: FALSE, NA, and TRUE. The dominant value is FALSE (78.9% of 105,484 rows), while TRUE is strikingly rare at only 1,347 records (≈1.3%). The presence of 'NA' as a literal string value in 20,874 rows (≈19.8%) is noteworthy — these are not system nulls (null_rate is 0.0) but encoded string missings, which must be handled explicitly rather than via standard null imputation.

Treatment: Encode as ternary (FALSE=0, TRUE=1, NA=missing) after converting string 'NA' to actual nulls, then decide on imputation strategy given 19.8% string-missing rate.

anthropic:default · confidence high
Out[36]:

saturn.columns["Marginal"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3
top_value FALSE
top_rate 0.7893
cardinality 3
entropy 0.8122
entropy_ratio 0.5125
Fig 15.
Top values for Marginal.
Show data table
Top values for Marginal (3 unique shown, of 3 total).
valuecountshare
FALSE8326378.9%
NA2087419.8%
TRUE13471.3%

SegmentClass categorical label

SegmentClass is a phonological category label classifying 105,484 linguistic segments into exactly three classes: consonant, vowel, and tone. Consonants dominate at 68.5% (72,282 occurrences), vowels account for 29.4% (31,052), and tones are a small minority at just 2.0% (2,150) — a distribution consistent with natural language phoneme inventories but with tones notably underrepresented, suggesting the dataset skews toward non-tonal languages or tonal markings are partially absent. Zero nulls and perfect coverage make this a clean, reliable label.

Treatment: One-hot encode or use as a stratification variable; monitor class imbalance for the 'tone' minority class (2,150 samples) in any classification task.

anthropic:default · confidence high
Out[39]:

saturn.columns["SegmentClass"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3
top_value consonant
top_rate 0.6852
cardinality 3
entropy 1.008
entropy_ratio 0.6357
Fig 16.
Top values for SegmentClass.
Show data table
Top values for SegmentClass (3 unique shown, of 3 total).
valuecountshare
consonant7228268.5%
vowel3105229.4%
tone21502.0%

Source categorical label

This column identifies the source database or linguistic inventory from which each phonological record was drawn, with 8 distinct source codes across 105,484 rows and no nulls. The top source 'ph' dominates at 34.4% (36,274 rows), likely referring to PHOIBLE or a similar phoneme database, followed by 'ea', 'upsid', 'er', 'saphon', 'aa', 'spa', and 'ra'. The high entropy ratio of 0.899 indicates a relatively even spread across sources despite 'ph' leading — no single source overwhelmingly controls the data. Analysts should be aware that cross-source comparisons may introduce systematic coding differences, as each source may apply distinct phonological conventions.

Treatment: One-hot encode or use as a stratification/grouping variable; check for source-specific biases before pooling.

anthropic:default · confidence high
Out[42]:

saturn.columns["Source"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value ph
top_rate 0.3439
cardinality 8
entropy 2.697
entropy_ratio 0.8991
Fig 17.
Top values for Source.
Show data table
Top values for Source (8 unique shown, of 8 total).
valuecountshare
ph3627434.4%
ea1688316.0%
upsid1396613.2%
er94238.9%
saphon90478.6%
aa80647.6%
spa75667.2%
ra42614.0%

tone categorical label

This column is a binary tone indicator with only two values: '0' (neutral/absent) and '+' (positive), across 105,484 rows with no nulls. The distribution is severely imbalanced — '0' accounts for 97.96% of records (103,334) while '+' appears in just 2,150 rows (~2%). The entropy of 0.144 (out of a maximum of 1.0 for a binary variable) confirms the near-total dominance of a single class, which would make any model trained on this label prone to predicting '0' exclusively.

Treatment: Apply class-balancing techniques (e.g., oversampling '+' or class-weight adjustment) before using as a modelling target.

anthropic:default · confidence high
Out[45]:

saturn.columns["tone"].stats

statvalue
n105,484
nulls0 (0.0%)
unique2
top_value 0
top_rate 0.9796
cardinality 2
entropy 0.1436
entropy_ratio 0.1436
alert: imbalancetop value is 98.0% of rows
Fig 18.
Top values for tone.
Show data table
Top values for tone (2 unique shown, of 2 total).
valuecountshare
010333498.0%
+21502.0%

stress categorical label

This column is a binary categorical flag likely indicating the presence or absence of stress (e.g., a linguistic or physiological stress marker), encoded as '-' for the negative/absent case and '0' for the positive/present case. The distribution is severely imbalanced: '-' accounts for 97.96% of all 105,484 rows (103,334 occurrences) versus only 2,150 rows for '0'. The entropy ratio of 0.144 confirms near-minimum information content, meaning the positive class is rare and models trained on this column will need class-balancing strategies.

Treatment: Apply class-balancing (oversampling or weighted loss) before modelling; consider encoding '-' as 0 and '0' as 1 for numeric compatibility.

anthropic:default · confidence medium
Out[48]:

saturn.columns["stress"].stats

statvalue
n105,484
nulls0 (0.0%)
unique2
top_value -
top_rate 0.9796
cardinality 2
entropy 0.1436
entropy_ratio 0.1436
alert: imbalancetop value is 98.0% of rows
Fig 19.
Top values for stress.
Show data table
Top values for stress (2 unique shown, of 2 total).
valuecountshare
-10333498.0%
021502.0%

syllabic categorical

Out[51]:

saturn.columns["syllabic"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value -
top_rate 0.6849
cardinality 8
entropy 1.042
entropy_ratio 0.3472
Fig 20.
Top values for syllabic.
Show data table
Top values for syllabic (8 unique shown, of 8 total).
valuecountshare
-7224868.5%
+3069229.1%
021502.0%
+,-2440.2%
-,+1240.1%
-,+,-120.0%
-,+,+120.0%
+,+,-20.0%

short categorical feature

This column appears to encode a directional or sign indicator with only 4 distinct values: '-', '0', '+', and the compound '-,+'. It is severely imbalanced: the dominant value '-' accounts for 97,764% of rows (103,125 of 105,484), while '+' appears just 204 times and the mixed label '-,+' only 5 times. The near-zero entropy ratio (0.082) confirms the column carries very little information variance, and any model trained on it will be overwhelmed by the '-' class.

Treatment: Treat as ordinal or one-hot encode, but flag severe class imbalance (97.8% '-'); consider oversampling minority classes or collapsing '+' and '-,+' before modelling.

anthropic:default · confidence high
Out[54]:

saturn.columns["short"].stats

statvalue
n105,484
nulls0 (0.0%)
unique4
top_value -
top_rate 0.9776
cardinality 4
entropy 0.1645
entropy_ratio 0.08225
alert: imbalancetop value is 97.8% of rows
Fig 21.
Top values for short.
Show data table
Top values for short (4 unique shown, of 4 total).
valuecountshare
-10312597.8%
021502.0%
+2040.2%
-,+50.0%

long categorical feature

This column encodes the sign or direction of longitude (or a signed numeric field), with only 6 distinct values across 105,484 rows. The dominant value '-' accounts for 89.9% of records, suggesting the dataset is heavily skewed toward negative longitudes (e.g., Western Hemisphere coordinates). The compound values '-,+', '+,-', and '-,-,+' appear in just 104 rows combined, hinting at rare multi-value or malformed entries that warrant inspection before modelling.

Treatment: Investigate compound values ('-,+', '+,-', '-,-,+') for parsing errors; encode '-' as -1, '+' as +1, '0' as 0, and flag or impute the 104 compound rows.

anthropic:default · confidence medium
Out[57]:

saturn.columns["long"].stats

statvalue
n105,484
nulls0 (0.0%)
unique6
top_value -
top_rate 0.8991
cardinality 6
entropy 0.5537
entropy_ratio 0.2142
Fig 22.
Top values for long.
Show data table
Top values for long (6 unique shown, of 6 total).
valuecountshare
-9484489.9%
+83868.0%
021502.0%
-,+630.1%
+,-400.0%
-,-,+10.0%

consonantal categorical feature

This column encodes a binary phonological feature — consonantal — marking whether a segment is consonantal (+) or not (-), a standard distinctive feature in linguistics datasets. The dominant value is '+' at 60.9% (64,257 rows), with '-' covering 37.0% (39,041 rows). The near-zero third category '0' (2,151 rows) suggests underspecified segments, while the composite values '+,-' (34) and '-,+' (1) are anomalous multi-value entries that likely reflect data quality issues or ambiguous annotations.

Treatment: Encode as ordinal or one-hot after splitting composite entries ('+,-', '-,+') into separate flags or flagging them for review.

anthropic:default · confidence high
Out[60]:

saturn.columns["consonantal"].stats

statvalue
n105,484
nulls0 (0.0%)
unique5
top_value +
top_rate 0.6092
cardinality 5
entropy 1.085
entropy_ratio 0.4672
Fig 23.
Top values for consonantal.
Show data table
Top values for consonantal (5 unique shown, of 5 total).
valuecountshare
+6425760.9%
-3904137.0%
021512.0%
+,-340.0%
-,+10.0%

sonorant categorical

Out[63]:

saturn.columns["sonorant"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value +
top_rate 0.5301
cardinality 8
entropy 1.245
entropy_ratio 0.4149
Fig 24.
Top values for sonorant.
Show data table
Top values for sonorant (8 unique shown, of 8 total).
valuecountshare
+5592053.0%
-4532243.0%
021502.0%
+,-19481.8%
-,+890.1%
+,-,-290.0%
+,-,+250.0%
+,-,+,-10.0%

continuant categorical feature

This column encodes a phonological feature — whether a speech sound is a continuant (airflow continues through the vocal tract) — using '+', '-', and '0' (unspecified/N-A) notation standard in distinctive feature theory. The dominant values are '+' (54.9%, 57,952 rows) and '-' (44,585 rows), which together account for ~97% of records. Surprisingly, 7 of the 9 unique values are composite strings like '-,+' or '-,-,+', suggesting some rows encode sequences or multi-segment entries rather than single-segment features, which is atypical and may indicate data quality or schema inconsistency. Entropy ratio of 0.37 reflects moderate concentration driven almost entirely by the binary +/- split.

Treatment: Treat '+'/'-'/'0' as a 3-class categorical; investigate and potentially split or flag the 796 composite multi-value rows before encoding.

anthropic:default · confidence high
Out[66]:

saturn.columns["continuant"].stats

statvalue
n105,484
nulls0 (0.0%)
unique9
top_value +
top_rate 0.5494
cardinality 9
entropy 1.172
entropy_ratio 0.3696
Fig 25.
Top values for continuant.
Show data table
Top values for continuant (9 unique shown, of 9 total).
valuecountshare
+5795254.9%
-4458542.3%
021512.0%
-,+7280.7%
-,-,+500.0%
+,-90.0%
0,-,+40.0%
-,+,+40.0%
0,0,-,+10.0%

delayedRelease categorical feature

This column encodes a delayed-release flag or classification for 105,484 records using a small set of symbolic tokens: '0' (no delay, 55% of rows), '-' (negative/early shift, 26%), '+' (positive/late shift, 19%), and compound combinations thereof. The compound values ('-,+', '0,-,+', '+,-', '0,0,-,+') suggest the column is sometimes multi-valued — packed as comma-separated strings — which is structurally inconsistent with a simple categorical and implies set-like semantics. No nulls are present, and the entropy ratio of 0.52 reflects a moderately skewed but non-trivial distribution dominated by the '0' class.

Treatment: Split compound values on ',' into multi-hot binary columns ('has_0', 'has_minus', 'has_plus') before modelling.

anthropic:default · confidence high
Out[69]:

saturn.columns["delayedRelease"].stats

statvalue
n105,484
nulls0 (0.0%)
unique7
top_value 0
top_rate 0.5502
cardinality 7
entropy 1.471
entropy_ratio 0.5238
Fig 26.
Top values for delayedRelease.
Show data table
Top values for delayedRelease (7 unique shown, of 7 total).
valuecountshare
05803555.0%
-2738426.0%
+1953318.5%
-,+4920.5%
0,-,+330.0%
+,-60.0%
0,0,-,+10.0%

approximant categorical feature

This column captures whether a phoneme or linguistic segment is classified as an approximant, using a compact symbolic encoding where '-' means absent and '+' means present. The dominant value is '-' at 55.9% of rows (58,966), with '+' accounting for 41.9% (44,266), giving a near-binary distribution with very low entropy (1.12). Surprising are the small number of compound values ('-,+', '-,-,+', '+,-') totalling just 102 rows, suggesting a minority of segments carry ambiguous or multi-valued approximant classifications — possibly transcription artifacts or multi-segment bundles.

Treatment: One-hot encode the two dominant values ('-', '+'); bin the three compound values into an 'ambiguous' category or flag and investigate as potential data quality issues before modelling.

anthropic:default · confidence high
Out[72]:

saturn.columns["approximant"].stats

statvalue
n105,484
nulls0 (0.0%)
unique6
top_value -
top_rate 0.559
cardinality 6
entropy 1.12
entropy_ratio 0.4333
Fig 27.
Top values for approximant.
Show data table
Top values for approximant (6 unique shown, of 6 total).
valuecountshare
-5896655.9%
+4426642.0%
021502.0%
-,+710.1%
-,-,+250.0%
+,-60.0%

tap categorical feature

This column appears to encode a directional or signed-change indicator, with values like '-', '+', '0', and compound sequences such as '-,+' and '-,-,+' that suggest a history of sign changes or transitions. The dominant value '-' accounts for 96.7% of all 105,484 rows, producing severe class imbalance (entropy ratio of only 0.10), meaning positive or mixed-direction events are rare signals. The compound multi-value entries ('-,+', '-,-,+') are notable — they imply the field can store ordered sequences of transitions, not just a single state. No nulls are present.

Treatment: One-hot encode single-value categories; treat compound sequence values ('-,+', '-,-,+') separately via sequence parsing or a dedicated binary flag for multi-transition events; oversample or weight minority classes before modelling.

anthropic:default · confidence medium
Out[75]:

saturn.columns["tap"].stats

statvalue
n105,484
nulls0 (0.0%)
unique5
top_value -
top_rate 0.9672
cardinality 5
entropy 0.2421
entropy_ratio 0.1043
alert: imbalancetop value is 96.7% of rows
Fig 28.
Top values for tap.
Show data table
Top values for tap (5 unique shown, of 5 total).
valuecountshare
-10202396.7%
022032.1%
+12181.2%
-,+250.0%
-,-,+150.0%

trill categorical feature

This column encodes a phonological feature — specifically the presence or absence of a trill articulation — using a small symbol vocabulary of 6 distinct values. The dominant value is "-" (absent/negative), which accounts for 96.15% of 105,484 rows, producing an extremely low entropy of 0.276 and triggering an imbalance alert. The minority values ("0", "+", and compound sequences like "-,+") together cover fewer than 4,000 rows, suggesting trill is a rare feature in this phonological dataset. The compound values ("-,+", "-,-,+", "+,-") with counts of 26, 8, and 2 respectively hint at multi-segment or allophonic annotation, but are statistically negligible.

Treatment: One-hot or ordinal encode with caution; severe class imbalance (96.15% negative class) means most models will need oversampling, class weighting, or collapsing rare compound categories before use.

anthropic:default · confidence high
Out[78]:

saturn.columns["trill"].stats

statvalue
n105,484
nulls0 (0.0%)
unique6
top_value -
top_rate 0.9615
cardinality 6
entropy 0.2762
entropy_ratio 0.1069
alert: imbalancetop value is 96.2% of rows
Fig 29.
Top values for trill.
Show data table
Top values for trill (6 unique shown, of 6 total).
valuecountshare
-10142796.2%
022022.1%
+18191.7%
-,+260.0%
-,-,+80.0%
+,-20.0%

nasal categorical

Out[81]:

saturn.columns["nasal"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value -
top_rate 0.8084
cardinality 8
entropy 0.897
entropy_ratio 0.299
Fig 30.
Top values for nasal.
Show data table
Top values for nasal (8 unique shown, of 8 total).
valuecountshare
-8526980.8%
+1594115.1%
021502.0%
+,-19731.9%
-,+950.1%
+,-,-540.1%
+,-,+,-10.0%
-,+,-10.0%

lateral categorical feature

This column encodes lateral direction or laterality coding, likely indicating the side(s) of a clinical finding, anatomical measurement, or test result, using '-' (e.g., left or negative), '+' (right or positive), and '0' (midline or neutral). The dominant value '-' accounts for 93.8% of all 105,484 rows, producing a very low entropy ratio of 0.134, meaning the column is heavily skewed toward a single class. Surprisingly, some values represent sequences of laterality codes (e.g., '-,+', '-,-,+'), suggesting multi-segment or bilateral recordings encoded as comma-delimited strings rather than separate fields. No nulls are present.

Treatment: Split comma-delimited compound values into multi-hot binary flags for '-', '+', and '0' presence before modelling; expect severe class imbalance.

anthropic:default · confidence high
Out[84]:

saturn.columns["lateral"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value -
top_rate 0.9382
cardinality 8
entropy 0.4012
entropy_ratio 0.1337
Fig 31.
Top values for lateral.
Show data table
Top values for lateral (8 unique shown, of 8 total).
valuecountshare
-9896893.8%
+42114.0%
021502.0%
-,+1350.1%
+,-120.0%
-,-,+40.0%
-,+,-30.0%
0,-,+10.0%

labial categorical feature

This column captures labial articulation features in a phonological or linguistic dataset, encoding presence/absence/neutrality of a labial feature per segment (or per segment sequence). The dominant value is '-' (absent) at 68.2% of 105,484 rows, with '+' (present) at 26.8%, suggesting most segments are non-labial. Surprisingly, ~2.9% of values are compound strings like '-,+', '+,-', '-,-,+', indicating multi-segment bundles packed into a single cell rather than a flat atomic feature, which creates a parsing challenge and implies the column is not fully normalized.

Treatment: Split compound values on ',' into separate per-segment records or encode as ordered multi-label; then one-hot or ordinal encode the atomic {'-', '+', '0'} values for modelling.

anthropic:default · confidence high
Out[87]:

saturn.columns["labial"].stats

statvalue
n105,484
nulls0 (0.0%)
unique15
top_value -
top_rate 0.6822
cardinality 15
entropy 1.182
entropy_ratio 0.3025
Fig 32.
Top values for labial.
Show data table
Top values for labial (15 unique shown, of 15 total).
valuecountshare
-7196168.2%
+2824126.8%
-,+24142.3%
021602.0%
+,-5310.5%
-,-,+1210.1%
+,-,-210.0%
0,+,-80.0%
-,+,-60.0%
0,-,+50.0%
-,+,+50.0%
+,+,-50.0%
+,-,+40.0%
-,-,+,+10.0%
0,+,-,-10.0%

round categorical

Out[90]:

saturn.columns["round"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value 0
top_rate 0.703
cardinality 8
entropy 1.194
entropy_ratio 0.398
Fig 33.
Top values for round.
Show data table
Top values for round (8 unique shown, of 8 total).
valuecountshare
07415570.3%
+1695616.1%
-1408213.3%
-,+2690.3%
-,-,+170.0%
-,0,+30.0%
0,-,+10.0%
+,-10.0%

labiodental categorical feature

This column encodes labiodental phonetic feature annotations, likely from a linguistic or phonological dataset where each row represents a speech sound or segment. The dominant value '0' (74,124 rows, 70.3%) indicates the feature is absent or neutral, while '+' and '-' mark positive and negative feature values respectively — a standard binary distinctive-feature notation. Notably, multi-valued strings like '+,-', '-,+', and '+,+,-' appear (60 rows total), suggesting a small number of segments carry conflicting or composite annotations, which may indicate data entry inconsistency or deliberate underspecification. Entropy ratio of 0.39 confirms moderate imbalance with '0' dominating.

Treatment: One-hot encode '+', '-', '0'; flag or inspect the 60 multi-valued rows ('+,-', '-,+', '+,+,-') for normalization before modelling.

anthropic:default · confidence high
Out[93]:

saturn.columns["labiodental"].stats

statvalue
n105,484
nulls0 (0.0%)
unique6
top_value 0
top_rate 0.7027
cardinality 6
entropy 1.006
entropy_ratio 0.3891
Fig 34.
Top values for labiodental.
Show data table
Top values for labiodental (6 unique shown, of 6 total).
valuecountshare
07412470.3%
-2872627.2%
+25742.4%
+,-560.1%
-,+30.0%
+,+,-10.0%

coronal categorical feature

This column encodes coronal articulation features in what appears to be a phonological or linguistic dataset, using binary '+'/'-' notation common in distinctive feature theory. The dominant value is '-' (62.8% of 105,484 rows), with '+' covering another 35.0%, making the two primary values account for ~98% of observations. The remaining values ('+,-', '-,+', '-,-,+', '+,-,+') suggest multi-segment or compound entries, which are rare (≤87 occurrences) but worth flagging as potential encoding inconsistencies or multi-value cells that may need splitting.

Treatment: Encode primary values '+'/'-'/'0' as ordinal or one-hot features; isolate and inspect the 135 multi-value rows for parsing or splitting before modelling.

anthropic:default · confidence high
Out[96]:

saturn.columns["coronal"].stats

statvalue
n105,484
nulls0 (0.0%)
unique7
top_value -
top_rate 0.6279
cardinality 7
entropy 1.08
entropy_ratio 0.3848
Fig 35.
Top values for coronal.
Show data table
Top values for coronal (7 unique shown, of 7 total).
valuecountshare
-6623462.8%
+3695535.0%
021602.0%
+,-870.1%
-,+410.0%
-,-,+60.0%
+,-,+10.0%

anterior categorical feature

This column captures an anterior anatomical or directional finding coded as a signed categorical variable, with 6 distinct values across 105,484 rows and zero nulls. The dominant value '0' (64.8% of rows) likely indicates absence or neutral status, while '+' (24.4%) and '-' (10.8%) denote positive/negative findings. Surprisingly, a small number of rows (9, 5, and 3) contain compound multi-value strings like '-,+', '+,-', and '-,-,+', suggesting occasional data-entry concatenation errors or multi-event encoding that deviates from the expected single-symbol schema.

Treatment: Map '0'/'+'/'-' to ordinal or one-hot encoded features; isolate and investigate the 17 compound-value rows before modelling.

anthropic:default · confidence medium
Out[99]:

saturn.columns["anterior"].stats

statvalue
n105,484
nulls0 (0.0%)
unique6
top_value 0
top_rate 0.6482
cardinality 6
entropy 1.251
entropy_ratio 0.4839
Fig 36.
Top values for anterior.
Show data table
Top values for anterior (6 unique shown, of 6 total).
valuecountshare
06837264.8%
+2570424.4%
-1139110.8%
-,+90.0%
+,-50.0%
-,-,+30.0%

distributed categorical label

This column appears to encode a distribution or change-direction indicator, using a compact symbolic notation: '0' (no change), '-' (decrease), '+' (increase), and comma-separated sequences for multi-step or compound movements. The dominant value '0' covers 66% of 105,484 rows, with '-' (21.1%) and '+' (12.5%) accounting for most of the remainder. Surprisingly, 11 distinct values arise from combinations of these three symbols, suggesting some records capture sequences of directional changes rather than a single state. Entropy ratio of 0.37 confirms moderate but uneven information content, heavily concentrated in the single '0' class.

Treatment: Encode as an ordinal or multi-hot feature by splitting on commas and mapping {'-': -1, '0': 0, '+': 1}; treat compound values as sequences.

anthropic:default · confidence medium
Out[102]:

saturn.columns["distributed"].stats

statvalue
n105,484
nulls0 (0.0%)
unique11
top_value 0
top_rate 0.6602
cardinality 11
entropy 1.273
entropy_ratio 0.3681
Fig 37.
Top values for distributed.
Show data table
Top values for distributed (11 unique shown, of 11 total).
valuecountshare
06963966.0%
-2228321.1%
+1322812.5%
-,+2960.3%
-,-,+250.0%
+,-50.0%
0,-,+30.0%
+,-,+20.0%
0,+,-10.0%
+,+,-10.0%
0,0,-,+10.0%

strident categorical

Out[105]:

saturn.columns["strident"].stats

statvalue
n105,484
nulls0 (0.0%)
unique9
top_value 0
top_rate 0.6485
cardinality 9
entropy 1.287
entropy_ratio 0.406
Fig 38.
Top values for strident.
Show data table
Top values for strident (9 unique shown, of 9 total).
valuecountshare
06841064.9%
-2541024.1%
+1103910.5%
-,+5850.6%
-,-,+260.0%
+,-70.0%
-,+,-30.0%
0,-,+30.0%
0,0,-,+10.0%

dorsal categorical feature

This column captures a dorsal-surface marking or pattern indicator for biological specimens (likely fish or reptiles), encoding the presence/absence of a feature using '+', '-', and '0' symbols — sometimes as comma-separated sequences indicating multiple zones or segments. The dominant values are '+' (51.7%, n=54,535) and '-' (n=47,052), together accounting for ~96.3% of rows, with 'neutral/absent' coded as '0' (n=2,160). Surprisingly, 11 of the 13 categories are multi-value strings like '-,+' or '-,-,+', suggesting some records encode ordered sequences of dorsal sub-regions rather than a single binary state, creating an implicit structural inconsistency that warrants normalization.

Treatment: Split multi-value entries on ',' into ordered sub-features or one-hot encode each position before modelling; treat '+', '-', '0' as a ternary categorical for single-value rows.

anthropic:default · confidence high
Out[108]:

saturn.columns["dorsal"].stats

statvalue
n105,484
nulls0 (0.0%)
unique13
top_value +
top_rate 0.517
cardinality 13
entropy 1.235
entropy_ratio 0.3338
Fig 39.
Top values for dorsal.
Show data table
Top values for dorsal (13 unique shown, of 13 total).
valuecountshare
+5453551.7%
-4705244.6%
021602.0%
-,+15301.5%
+,-1440.1%
-,-,+440.0%
0,-,+60.0%
+,-,+50.0%
-,+,+40.0%
+,+,-,-10.0%
-,+,-10.0%
+,+,-10.0%
0,0,-,+10.0%

high categorical feature

This column encodes the direction of price/value movement at a 'high' point — likely a signal or pattern indicator from a time-series or financial dataset, using tokens '0' (no movement), '+' (up), and '-' (down), including multi-step sequences like '-,+' or '+,-,+'. The dominant value is '0' at 46.7% (49,247 rows), followed by '+' at 35,559 and '-' at 19,156, suggesting a strong asymmetry between upward and downward signals. The presence of compound sequence values (e.g., '+,-,+', '-,+,+') with very low frequency — some appearing only once or twice — indicates these multi-step patterns are rare edge cases that may need consolidation or separate treatment.

Treatment: Encode top 3 values ('0', '+', '-') as ordinal or one-hot features; collapse rare multi-step sequences (≤845 occurrences) into an 'other' or structured sub-category.

anthropic:default · confidence medium
Out[111]:

saturn.columns["high"].stats

statvalue
n105,484
nulls0 (0.0%)
unique11
top_value 0
top_rate 0.4669
cardinality 11
entropy 1.594
entropy_ratio 0.4609
Fig 40.
Top values for high.
Show data table
Top values for high (11 unique shown, of 11 total).
valuecountshare
04924746.7%
+3555933.7%
-1915618.2%
-,+8450.8%
+,-6270.6%
+,-,+380.0%
+,+,-60.0%
-,+,+20.0%
-,-,+20.0%
+,-,010.0%
-,+,-10.0%

low categorical

Out[114]:

saturn.columns["low"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value -
top_rate 0.4733
cardinality 8
entropy 1.305
entropy_ratio 0.4351
Fig 41.
Top values for low.
Show data table
Top values for low (8 unique shown, of 8 total).
valuecountshare
-4993047.3%
04924446.7%
+55985.3%
+,-4170.4%
-,+2700.3%
-,+,-210.0%
-,-,+30.0%
+,-,-10.0%

front categorical feature

This column appears to encode a directional or signed-change indicator for a 'front' measurement, using values '0', '-', and '+' as atomic tokens that can be combined into sequences (e.g., '-,+', '+,-,-'). The dominant value '0' accounts for 46.75% of rows, with '-' (34,225) and '+' (20,683) making up most of the remainder — notably, negative changes outnumber positive ones by ~1.65:1. The compound sequence values ('-,+', '+,-', etc.) suggest multi-step event chains are encoded as a single string, which is an unusual encoding pattern that may require parsing before modelling.

Treatment: Parse composite sequences (e.g., '-,+') into structured event arrays or ordinal counts; then one-hot or embed atomic states before modelling.

anthropic:default · confidence medium
Out[117]:

saturn.columns["front"].stats

statvalue
n105,484
nulls0 (0.0%)
unique13
top_value 0
top_rate 0.4675
cardinality 13
entropy 1.592
entropy_ratio 0.4302
Fig 42.
Top values for front.
Show data table
Top values for front (13 unique shown, of 13 total).
valuecountshare
04931646.8%
-3422532.4%
+2068319.6%
-,+8380.8%
+,-3590.3%
-,-,+240.0%
+,-,-140.0%
-,+,+100.0%
+,-,+60.0%
-,0,+30.0%
+,+,-20.0%
0,-,+20.0%
-,+,-20.0%

back categorical feature

This column encodes directional movement or change signals for a 'back' dimension, using a compact notation of '+', '-', and '0' tokens — likely representing price/value movement sequences (e.g., up, down, flat) over a lookback window. The dominant value is '0' at 46.7% (49,270 rows), followed by '-' at 39,749 and '+' at 15,547, indicating a strong bearish/negative skew relative to positive signals. Compound multi-step sequences like '+,-' (511) and '-,+' (367) exist but are rare, and a handful of three-step sequences appear at the tail, suggesting variable-length encoding that could cause parsing issues if treated as a simple label.

Treatment: Parse composite multi-token values (e.g., '+,-') into structured sequences or ordinal scores before modelling; consider one-hot or frequency encoding for the simple single-token majority.

anthropic:default · confidence medium
Out[120]:

saturn.columns["back"].stats

statvalue
n105,484
nulls0 (0.0%)
unique12
top_value 0
top_rate 0.4671
cardinality 12
entropy 1.521
entropy_ratio 0.4244
Fig 43.
Top values for back.
Show data table
Top values for back (12 unique shown, of 12 total).
valuecountshare
04927046.7%
-3974937.7%
+1554714.7%
+,-5110.5%
-,+3670.3%
+,-,-190.0%
-,-,+80.0%
-,+,+50.0%
-,+,-50.0%
0,+,-10.0%
+,-,+10.0%
+,+,-10.0%

tense categorical

Out[123]:

saturn.columns["tense"].stats

statvalue
n105,484
nulls0 (0.0%)
unique8
top_value 0
top_rate 0.7132
cardinality 8
entropy 1.114
entropy_ratio 0.3712
Fig 44.
Top values for tense.
Show data table
Top values for tense (8 unique shown, of 8 total).
valuecountshare
07523071.3%
+2341122.2%
-63866.1%
+,-2680.3%
-,+1790.2%
+,-,+60.0%
+,-,-30.0%
+,+,-10.0%

retractedTongueRoot categorical feature

This column encodes the phonetic feature 'retracted tongue root' (RTR), a binary or multi-valued linguistic annotation used in phonological databases to mark vowel or consonant articulation. The dominant value is '-' (absence of RTR) at 97.44% of 105,484 rows, making the feature extremely rare in this dataset — only ~2,696 tokens show any positive marking. The compound values ('-,+', '-,-,+', etc.) suggest per-segment annotation strings rather than single-token labels, indicating a sequence or multi-segment scope.

Treatment: Flag severe class imbalance (97.44% negative); use oversampling or class-weighted models if predicting RTR; consider splitting compound strings into per-segment binary indicators before modelling.

anthropic:default · confidence high
Out[126]:

saturn.columns["retractedTongueRoot"].stats

statvalue
n105,484
nulls0 (0.0%)
unique7
top_value -
top_rate 0.9744
cardinality 7
entropy 0.1935
entropy_ratio 0.06892
alert: imbalancetop value is 97.4% of rows
Fig 45.
Top values for retractedTongueRoot.
Show data table
Top values for retractedTongueRoot (7 unique shown, of 7 total).
valuecountshare
-10278897.4%
022352.1%
-,+2510.2%
+1990.2%
-,-,+90.0%
-,+,-10.0%
+,-10.0%

advancedTongueRoot categorical feature

This column encodes the Advanced Tongue Root (ATR) phonological feature, a binary articulatory distinction marked as '+' (present), '-' (absent), or '0' (neutral/not applicable), typical of linguistic phoneme inventories. The distribution is severely imbalanced: '-' dominates at 97.87% (103,238 of 105,484 rows), '0' appears in only 2,235 rows, and '+' is nearly absent with just 11 occurrences. The near-zero entropy ratio (0.094) confirms that '+ATR' is a vanishingly rare feature in this dataset, which would make any model predicting '+' extremely difficult to train without resampling.

Treatment: Treat as ordinal or one-hot encoded categorical; oversample or reweight the '+' class (n=11) before modelling, or collapse '+' and '0' if class separation is infeasible.

anthropic:default · confidence high
Out[129]:

saturn.columns["advancedTongueRoot"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3
top_value -
top_rate 0.9787
cardinality 3
entropy 0.1496
entropy_ratio 0.09438
alert: imbalancetop value is 97.9% of rows
Fig 46.
Top values for advancedTongueRoot.
Show data table
Top values for advancedTongueRoot (3 unique shown, of 3 total).
valuecountshare
-10323897.9%
022352.1%
+110.0%

periodicGlottalSource categorical

Out[132]:

saturn.columns["periodicGlottalSource"].stats

statvalue
n105,484
nulls0 (0.0%)
unique7
top_value +
top_rate 0.6797
cardinality 7
entropy 1.051
entropy_ratio 0.3745
Fig 47.
Top values for periodicGlottalSource.
Show data table
Top values for periodicGlottalSource (7 unique shown, of 7 total).
valuecountshare
+7169468.0%
-3117929.6%
021392.0%
+,-3710.4%
-,+870.1%
+,-,-80.0%
+,-,+60.0%

epilaryngealSource categorical feature

This column encodes whether an epilaryngeal phonation source is present, using a ternary scheme: absent ('-'), present ('+')), or a neutral/ambiguous state ('0'). The distribution is severely imbalanced: the '-' (absent) class dominates at 97.9% of 105,484 rows, while '+' (present) appears in only 31 records (~0.03%), making positive-class detection extremely challenging. The near-zero entropy (0.147) confirms almost no informational variance in this column as-is.

Treatment: Flag severe class imbalance ('+' = 31 of 105,484); apply oversampling or class-weighted modelling, and consider binary encoding after collapsing '-' vs. non-'-'.

anthropic:default · confidence high
Out[135]:

saturn.columns["epilaryngealSource"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3
top_value -
top_rate 0.9793
cardinality 3
entropy 0.1474
entropy_ratio 0.09303
alert: imbalancetop value is 97.9% of rows
Fig 48.
Top values for epilaryngealSource.
Show data table
Top values for epilaryngealSource (3 unique shown, of 3 total).
valuecountshare
-10330397.9%
021502.0%
+310.0%

spreadGlottis categorical

Out[138]:

saturn.columns["spreadGlottis"].stats

statvalue
n105,484
nulls0 (0.0%)
unique10
top_value -
top_rate 0.9182
cardinality 10
entropy 0.4965
entropy_ratio 0.1495
Fig 49.
Top values for spreadGlottis.
Show data table
Top values for spreadGlottis (10 unique shown, of 10 total).
valuecountshare
-9685591.8%
+61565.8%
021382.0%
-,+2060.2%
+,-1150.1%
-,-,+50.0%
+,0,-50.0%
+,-,-20.0%
+,0,-,-10.0%
+,-,+10.0%

constrictedGlottis categorical feature

This column captures a clinical or phonetic annotation for constricted glottis, encoded as presence/absence symbols across 105,484 records with no nulls and only 7 unique values. The dominant value is '-' (absent/negative), appearing in 94.5% of rows (99,727), while '+' (present) accounts for just 3,383 rows — indicating a rare positive condition. The compound values ('+,-', '-,+', '+,-,-', '-,-,+') suggest multi-observation or multi-segment sequences for a tiny fraction of records, but with counts of 141, 93, 1, and 1 respectively, these are near-negligible. The extremely low entropy (0.372, entropy ratio 0.132) confirms the column is heavily imbalanced toward the negative class.

Treatment: Binarize into presence/absence flag; treat compound multi-segment values as positive; account for severe class imbalance (94.5% negative) during modelling.

anthropic:default · confidence high
Out[141]:

saturn.columns["constrictedGlottis"].stats

statvalue
n105,484
nulls0 (0.0%)
unique7
top_value -
top_rate 0.9454
cardinality 7
entropy 0.3717
entropy_ratio 0.1324
Fig 50.
Top values for constrictedGlottis.
Show data table
Top values for constrictedGlottis (7 unique shown, of 7 total).
valuecountshare
-9972794.5%
+33833.2%
021382.0%
+,-1410.1%
-,+930.1%
+,-,-10.0%
-,-,+10.0%

fortis categorical feature

This column is a three-valued categorical flag, likely representing a directional or strength indicator with values '-', '0', and '+'. The dominant value is '-' at 68.1% (71,867 rows), '0' accounts for 31.5% (33,202 rows), and '+' is strikingly rare at only 415 occurrences (~0.4%), creating a heavily imbalanced distribution. The near-absence of '+' values is a notable surprise and may indicate a rare positive condition, signal, or classification outcome worth investigating for class imbalance before modelling.

Treatment: Ordinal-encode as -1/0/1 or one-hot encode, and apply class-imbalance handling (e.g. oversampling '+') before modelling.

anthropic:default · confidence high
Out[144]:

saturn.columns["fortis"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3
top_value -
top_rate 0.6813
cardinality 3
entropy 0.9335
entropy_ratio 0.589
Fig 51.
Top values for fortis.
Show data table
Top values for fortis (3 unique shown, of 3 total).
valuecountshare
-7186768.1%
03320231.5%
+4150.4%

lenis categorical feature

This column represents a ternary linguistic or phonological feature, almost certainly a 'lenis' (weak/voiced) marker with values '-' (absent/irrelevant), '0' (neutral), and '+' (present/lenis). The dominant value is '-' at 68.1% (71,866 rows), while the positive lenis marker '+' is strikingly rare at only 416 occurrences (~0.4%), creating a heavily imbalanced distribution. No nulls exist across all 105,484 rows, and entropy is moderate at 0.93 bits, well below the theoretical maximum, confirming the skew toward the negative class.

Treatment: Encode as ordinal or one-hot; be aware that the '+' class (416 samples) is severely underrepresented and will require oversampling or class-weight adjustment before modelling.

anthropic:default · confidence high
Out[147]:

saturn.columns["lenis"].stats

statvalue
n105,484
nulls0 (0.0%)
unique3
top_value -
top_rate 0.6813
cardinality 3
entropy 0.9336
entropy_ratio 0.589
Fig 52.
Top values for lenis.
Show data table
Top values for lenis (3 unique shown, of 3 total).
valuecountshare
-7186668.1%
03320231.5%
+4160.4%

raisedLarynxEjective categorical feature

This column encodes the presence or absence of a raised-larynx ejective phonological feature, using a diacritic notation system common in linguistic databases (e.g., IPA or feature matrices). The dominant value is '-' (absent/not applicable), appearing in 96.4% of 105,484 rows, creating severe class imbalance with an entropy ratio of only 0.10. Minority values include '0' (2,150 rows), '+' (1,573 rows), and compound sequences like '-,+' and '+,-', suggesting some segments receive multi-valued or composite feature annotations. The near-absence of nulls (0.0%) indicates complete annotation coverage, but the extreme skew means this feature will contribute negligible signal in most modelling contexts without deliberate resampling or grouping.

Treatment: Collapse compound values ('-,+', '+,-', '-,-,+') into a unified category and apply oversampling or class-weight adjustment before modelling given 96.4% imbalance.

anthropic:default · confidence high
Out[150]:

saturn.columns["raisedLarynxEjective"].stats

statvalue
n105,484
nulls0 (0.0%)
unique6
top_value -
top_rate 0.9637
cardinality 6
entropy 0.2675
entropy_ratio 0.1035
alert: imbalancetop value is 96.4% of rows
Fig 53.
Top values for raisedLarynxEjective.
Show data table
Top values for raisedLarynxEjective (6 unique shown, of 6 total).
valuecountshare
-10165296.4%
021502.0%
+15731.5%
-,+850.1%
+,-230.0%
-,-,+10.0%

loweredLarynxImplosive categorical feature

This column encodes a phonological feature — specifically whether a sound has a lowered larynx implosive articulation — using a symbolic notation system ('+', '-', and combinations). The distribution is severely imbalanced: 97.3% of the 105,484 rows carry the default/absent value '-', with only 2,150 instances of '0', 716 of '+', and fewer than 10 combined entries. The near-zero entropy ratio (0.088) confirms this column carries almost no information for most records, which would make it a very weak standalone predictor.

Treatment: Flag the severe imbalance; consider binarising ('+' vs. all others) or dropping if the rare positive class is insufficient for modelling purposes.

anthropic:default · confidence high
Out[153]:

saturn.columns["loweredLarynxImplosive"].stats

statvalue
n105,484
nulls0 (0.0%)
unique5
top_value -
top_rate 0.9727
cardinality 5
entropy 0.2034
entropy_ratio 0.08759
alert: imbalancetop value is 97.3% of rows
Fig 54.
Top values for loweredLarynxImplosive.
Show data table
Top values for loweredLarynxImplosive (5 unique shown, of 5 total).
valuecountshare
-10260997.3%
021502.0%
+7160.7%
-,+70.0%
+,-20.0%

click categorical label

This column captures click direction or interaction type, encoded as symbolic tokens ('-', '0', '+') with combination values ('+,-', '-,+'). The dominant value is '-' at 68.2% of 105,484 rows, while positive-click signals ('+', '+,-', '-,+') account for only ~311 rows combined (~0.3%), suggesting a heavily imbalanced outcome or event flag. The presence of multi-value strings like '+,-' and '-,+' implies ordered sequences of click events were sometimes collapsed into a single cell rather than normalized — this warrants investigation before modelling.

Treatment: Ordinal-encode or one-hot after splitting multi-value entries ('+,-', '-,+'); treat class imbalance (~0.3% positive) before using as a classification target.

anthropic:default · confidence high
Out[156]:

saturn.columns["click"].stats

statvalue
n105,484
nulls0 (0.0%)
unique5
top_value -
top_rate 0.6823
cardinality 5
entropy 0.9283
entropy_ratio 0.3998
Fig 55.
Top values for click.
Show data table
Top values for click (5 unique shown, of 5 total).
valuecountshare
-7197168.2%
03320231.5%
+2530.2%
+,-520.0%
-,+60.0%

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-phoible-phonetics-database-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove phoible phonetics database},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-phoible-phonetics-database}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove phoible phonetics database. Source: /home/coolhand/html/datavis/data_trove/data/linguistic/phoible.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-phoible-phonetics-database