This dataset contains 76,475 rows of linguistic feature observations, all sourced from WALS (World Atlas of Language Structures). Each row pairs a language (2,659 unique language IDs) with one of 192 typological features (e.g., 'Order of Object and Verb') and a categorical value encoded as both a short code (value_name) and a small integer (value). The most common features cluster around word order and negation, with 'Order of Object and Verb' being the top feature at 1,518 rows. Worth a closer look: the `value` column is highly skewed (skew 3.49, kurtosis 16.4) with ~3.2% outliers reaching up to 28, suggesting most features have only a few possible values but a handful have many categories. The `source` column is constant ('WALS') and can be ignored as a variable.
saturn
/home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet 76,475 rows sample n=76,475 seed 42 2026-05-01T17:23:21+00:00
Overview
| Source | /home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet |
| Total rows | 76,475 |
| Profiled sample | 76,475 |
| Columns | 6 |
| Generated | 2026-05-01T17:23:21+00:00 |
Insights opt-in
Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.
This column holds 2-3 character language codes (ISO 639-style: eng, fre, ger, rus...), with len_mean 2.996 and one_word_rate 1.0. With 76,475 rows but only 2,659 unique values and a 96.5% duplicate_rate, it behaves as a categorical key rather than free text. The top codes are remarkably balanced (eng 159, fre 158, ger 157), suggesting a curated multilingual catalogue rather than organic usage.
feature_id is a categorical code with 192 distinct values across 76,475 rows and no nulls, with codes like '83A', '82A', '143E' suggesting a numeric prefix plus letter suffix scheme. The distribution is remarkably flat: entropy ratio 0.937, top value '83A' covers only 1.98% of rows, and four '143' suffixes ('143A','143E','143F','143G') tie at exactly 1325 occurrences, hinting at a structured co-occurrence rather than random sampling.
This column names linguistic typology features, almost certainly WALS-style feature labels (e.g., 'Order of Object and Verb', 'Order of Subject and Verb'). With 192 distinct values across 76,475 rows and a top rate of just 1.98%, the distribution is remarkably flat (entropy ratio 0.94), suggesting each feature is sampled across many languages rather than a few features dominating. There are no nulls and no obvious duplicates among the top values.
A small-integer count or rating feature with only 28 distinct values ranging from 1 to 28, mean 2.85 and median 2. The distribution is heavily right-skewed (skew 3.49, kurtosis 16.36), with 2,469 outliers (3.23%) stretching the tail to 28 while the IQR sits tight at 1–4. No nulls or zeros, so every row carries a positive count.
This column holds short alphanumeric codes (e.g. '143G-4', '82A-1') — uppercase, single-token, 4-7 characters long, with 1.0 one_word_rate and 1.0 allcaps_rate. Across 76,475 rows there are only 1,139 distinct values and a 0.985 duplicate_rate, so it behaves as a categorical key rather than free text. The pattern (digits + letter + dash + digit) suggests a typology or feature-value identifier drawn from a fixed vocabulary of 911 tokens.
This column records the data origin and holds the constant value "WALS" across all 76,475 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and merely tags the dataset's provenance.
language_id text
Sample values (first 10)
- abi
- rus
- tmn
- tag
- kmz
- yim
- kom
- tuy
- tag
- aml
feature_id categorical
Top values (rank 1–20)
- 83A — 1,518
- 82A — 1,496
- 81A — 1,376
- 87A — 1,367
- 143A — 1,325
- 143E — 1,325
- 143F — 1,325
- 143G — 1,325
- 97A — 1,316
- 86A — 1,249
- 88A — 1,225
- 144A — 1,190
- 85A — 1,184
- 112A — 1,157
- 89A — 1,154
- 95A — 1,142
- 69A — 1,131
- 33A — 1,066
- 51A — 1,031
- 26A — 969
feature_name categorical
Top values (rank 1–20)
- Order of Object and Verb — 1,518
- Order of Subject and Verb — 1,496
- Order of Subject, Object and Verb — 1,376
- Order of Adjective and Noun — 1,367
- Order of Negative Morpheme and Verb — 1,325
- Preverbal Negative Morphemes — 1,325
- Postverbal Negative Morphemes — 1,325
- Minor morphological means of signaling negation — 1,325
- Relationship between the Order of Object and Verb and the Order of Adjective and Noun — 1,316
- Order of Genitive and Noun — 1,249
- Order of Demonstrative and Noun — 1,225
- Position of Negative Word With Respect to Subject, Object, and Verb — 1,190
- Order of Adposition and Noun Phrase — 1,184
- Negative Morphemes — 1,157
- Order of Numeral and Noun — 1,154
- Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase — 1,142
- Position of Tense-Aspect Affixes — 1,131
- Coding of Nominal Plurality — 1,066
- Position of Case Affixes — 1,031
- Prefixing vs. Suffixing in Inflectional Morphology — 969
value numeric
value_name text
Sample values (first 10)
- 18A-1
- 96A-2
- 87A-2
- 144J-7
- 144P-4
- 95A-1
- 144B-3
- 114A-7
- 131A-1
- 96A-4
source categorical
Top values (rank 1–20)
- WALS — 76,475