saturn·

data trove wals world atlas of language structures

saturn notebook · generated 2026-06-22 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/linguistic/wals_parameters.csv

Saturn profiled 192 rows across 5 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/linguistic/wals_parameters.csv",
    "--findings", "data-trove-wals-world-atlas-of-language-structures.json",
    "--llm", "anthropic:default",
])

Summary confidence: medium

This dataset is a catalogue of 192 linguistic parameters from the World Atlas of Language Structures (WALS), each representing a distinct typological feature of human languages (e.g., 'Consonant Inventories', 'Vowel Quality Inventories'). Every row is uniquely identified by both an ID (e.g., '1A', '2A') and a Name, meaning there are no duplicates or groupings to aggregate within those columns. The most analytically interesting column is Chapter_ID, which groups these 192 parameters into 144 chapters — indicating that some chapters contain multiple parameters worth investigating. The Chapter_ID distribution is fairly uniform (mean ~84.5, median ~89.5, near-symmetric with slight left skew), suggesting chapters are spread across the full range with no heavy clustering.

citing: row_count · column_count · n_unique · null_rate · mean · median · skew · max · min · n_unique

Out[4]:

saturn.schema() · 5 columns

column kind n null% unique alerts
ID categorical 192 0.0% 192 long_tail
Name categorical 192 0.0% 192 long_tail
Description unknown 192 0.0% skipped
ColumnSpec unknown 192 0.0% skipped
Chapter_ID numeric 192 0.0% 144
Fig 1.
Chapter_ID · Look for whether parameters cluster in certain chapter ranges or are evenly spread across the 1–144 chapter span.
Show data table
Histogram bins for Chapter_ID (median: 89.5).
bincount
1 – 1212
12 – 2312
23 – 3412
34 – 4512
45 – 5611
56 – 6712
67 – 7811
78 – 8913
89 – 10017
100 – 11113
111 – 12211
122 – 13312
133 – 14444
Fig 2.
Name · Browse the top parameter names to get a sense of which linguistic features are catalogued — useful for a quick domain orientation.
Show data table
Top values for Name (20 unique shown, of 192 total).
valuecountshare
Consonant Inventories10.5%
Vowel Quality Inventories10.5%
Consonant-Vowel Ratio10.5%
Voicing in Plosives and Fricatives10.5%
Voicing and Gaps in Plosive Systems10.5%
Uvular Consonants10.5%
Glottalized Consonants10.5%
Lateral Consonants10.5%
The Velar Nasal10.5%
Vowel Nasalization10.5%
Nasal Vowels in West Africa10.5%
Front Rounded Vowels10.5%
Syllable Structure10.5%
Tone10.5%
Fixed Stress Locations10.5%
Weight-Sensitive Stress10.5%
Weight Factors in Weight-Sensitive Stress Systems10.5%
Rhythm Types10.5%
Absence of Common Consonants10.5%
Presence of Uncommon Consonants10.5%
Fig 3.
ID · Check the ID naming pattern (e.g., '1A', '2A') to understand how parameters are coded and whether letter suffixes imply sub-groupings.
Show data table
Top values for ID (20 unique shown, of 192 total).
valuecountshare
1A10.5%
2A10.5%
3A10.5%
4A10.5%
5A10.5%
6A10.5%
7A10.5%
8A10.5%
9A10.5%
10A10.5%
10B10.5%
11A10.5%
12A10.5%
13A10.5%
14A10.5%
15A10.5%
16A10.5%
17A10.5%
18A10.5%
19A10.5%
Fig 4.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
IDcategorical0.0%
Namecategorical0.0%
Descriptionunknown0.0%
ColumnSpecunknown0.0%
Chapter_IDnumeric0.0%

ID categorical identifier

This column is a row identifier, with 192 unique values across 192 rows, zero nulls, and a maximum entropy ratio of 1.0 — every value appears exactly once. The alphanumeric pattern (e.g., '1A', '2A', …, '10A') suggests a structured code system, possibly denoting sequential items with a suffix category. The long_tail alert is a statistical artifact of perfect uniqueness rather than a genuine distributional concern.

Treatment: Exclude from modelling features; retain as a row key for traceability or join operations.

anthropic:default · confidence high
Out[10]:

saturn.columns["ID"].stats

statvalue
n192
nulls0 (0.0%)
unique192
top_value 1A
top_rate 0.005208
cardinality 192
entropy 7.585
entropy_ratio 1
alert: long_tail192 singleton categories
Fig 5.
Top values for ID.
Show data table
Top values for ID (20 unique shown, of 192 total).
valuecountshare
1A10.5%
2A10.5%
3A10.5%
4A10.5%
5A10.5%
6A10.5%
7A10.5%
8A10.5%
9A10.5%
10A10.5%
10B10.5%
11A10.5%
12A10.5%
13A10.5%
14A10.5%
15A10.5%
16A10.5%
17A10.5%
18A10.5%
19A10.5%

Name categorical label

This column contains the names of linguistic typology features or chapters, almost certainly from the World Atlas of Language Structures (WALS) or a similar comparative linguistics dataset. Every one of the 192 rows has a unique name (cardinality 192, entropy_ratio 1.0), meaning this column is a perfect natural-language identifier with no repetition. The top values — 'Consonant Inventories', 'Vowel Quality Inventories', 'Consonant-Vowel Ratio' — confirm these are named phonological/typological feature categories, not free-form text.

Treatment: Use as a human-readable row label or index; do not encode as a categorical feature — treat as an identifier or join key on feature name.

anthropic:default · confidence high
Out[13]:

saturn.columns["Name"].stats

statvalue
n192
nulls0 (0.0%)
unique192
top_value Consonant Inventories
top_rate 0.005208
cardinality 192
entropy 7.585
entropy_ratio 1
alert: long_tail192 singleton categories
Fig 6.
Top values for Name.
Show data table
Top values for Name (20 unique shown, of 192 total).
valuecountshare
Consonant Inventories10.5%
Vowel Quality Inventories10.5%
Consonant-Vowel Ratio10.5%
Voicing in Plosives and Fricatives10.5%
Voicing and Gaps in Plosive Systems10.5%
Uvular Consonants10.5%
Glottalized Consonants10.5%
Lateral Consonants10.5%
The Velar Nasal10.5%
Vowel Nasalization10.5%
Nasal Vowels in West Africa10.5%
Front Rounded Vowels10.5%
Syllable Structure10.5%
Tone10.5%
Fixed Stress Locations10.5%
Weight-Sensitive Stress10.5%
Weight Factors in Weight-Sensitive Stress Systems10.5%
Rhythm Types10.5%
Absence of Common Consonants10.5%
Presence of Uncommon Consonants10.5%

Description unknown free_text

This column contains textual descriptions with 192 non-null rows and a null rate of 0.0%, suggesting complete population coverage. Profiling was skipped entirely (alert: 'skipped'), so no uniqueness, frequency, or length statistics are available — the column's content distribution is opaque. The 'unknown' kind designation indicates the profiler could not classify it, which is common for free-text or mixed-content fields. Downstream treatment should proceed cautiously given the absence of any statistical evidence.

Treatment: Inspect raw values manually, then tokenize and embed or apply NLP preprocessing before modelling.

anthropic:default · confidence low
Out[16]:

saturn.columns["Description"].stats

statvalue
n192
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

ColumnSpec unknown other

The column 'ColumnSpec' was skipped by the profiler, yielding no distributional statistics, cardinality, or type classification beyond 192 non-null rows. Without further evidence it is impossible to characterise the column's content, range, or role. The complete absence of stats alongside the 'skipped' alert is the surprising signal here — likely due to an unsupported or complex data type (e.g. nested struct, binary, or array).

Treatment: Inspect raw values manually to determine type and content before assigning a profiling strategy or modelling treatment.

anthropic:default · confidence low
Out[18]:

saturn.columns["ColumnSpec"].stats

statvalue
n192
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

Chapter_ID numeric foreign_key

This column is a chapter identifier, likely a foreign or primary key linking rows to one of 144 distinct chapters across a book or structured document. With 192 rows but only 144 unique values, each chapter ID appears on average 1.33 times, indicating some chapters have multiple associated records. The near-uniform distribution (kurtosis –1.26, skew –0.19, IQR spanning almost the full 1–144 range) suggests IDs are assigned sequentially and coverage across chapters is fairly even, though the duplication pattern warrants checking whether it reflects intentional one-to-many relationships or data quality issues.

Treatment: Left-join on this ID to the chapters reference table; investigate the 48 duplicate entries to confirm they represent legitimate one-to-many relationships.

anthropic:default · confidence high
Out[20]:

saturn.columns["Chapter_ID"].stats

statvalue
n192
nulls0 (0.0%)
unique144
min 1
max 144
mean 84.52
median 89.5
std 45.75
q1 44.75
q3 129.2
iqr 84.5
skew -0.195
kurtosis -1.262
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 7.
Distribution of Chapter_ID. Vertical dash marks the median.
Show data table
Histogram bins for Chapter_ID (median: 89.5).
bincount
1 – 1212
12 – 2312
23 – 3412
34 – 4512
45 – 5611
56 – 6712
67 – 7811
78 – 8913
89 – 10017
100 – 11113
111 – 12211
122 – 13312
133 – 14444

How to cite

click to copy

BibTeX
@misc{saturn-data-trove-wals-world-atlas-of-language-structures-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove wals world atlas of language structures},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-wals-world-atlas-of-language-structures}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}
APA
Steuber, L. (2026). Saturn reading: data trove wals world atlas of language structures. Source: /home/coolhand/html/datavis/data_trove/data/linguistic/wals_parameters.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-wals-world-atlas-of-language-structures