parquet linguistic features
Reading
This dataset contains 76,475 rows of linguistic feature observations, all sourced from WALS (World Atlas of Language Structures). Each row pairs a language (2,659 unique language IDs) with one of 192 typological features (e.g., 'Order of Object and Verb') and a categorical value encoded as both a short code (value_name) and a small integer (value). The most common features cluster around word order and negation, with 'Order of Object and Verb' being the top feature at 1,518 rows. Worth a closer look: the `value` column is highly skewed (skew 3.49, kurtosis 16.4) with ~3.2% outliers reaching up to 28, suggesting most features have only a few possible values but a handful have many categories. The `source` column is constant ('WALS') and can be ignored as a variable.
citing: row_count · column_count · columns.value_name.n_unique · columns.value_name.top_values · columns.language_id.n_unique · columns.language_id.top_values · columns.source.top_value · columns.source.top_rate · columns.value.skew · columns.value.kurtosis · columns.value.outlier_rate · columns.value.max · columns.feature_id.n_unique · columns.feature_name.top_values · columns.feature_name.top_rate
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Order of Object and Verb | 1518 | 2.0% |
| Order of Subject and Verb | 1496 | 2.0% |
| Order of Subject, Object and Verb | 1376 | 1.8% |
| Order of Adjective and Noun | 1367 | 1.8% |
| Order of Negative Morpheme and Verb | 1325 | 1.7% |
| Preverbal Negative Morphemes | 1325 | 1.7% |
| Postverbal Negative Morphemes | 1325 | 1.7% |
| Minor morphological means of signaling negation | 1325 | 1.7% |
| Relationship between the Order of Object and Verb and the Order of Adjective and Noun | 1316 | 1.7% |
| Order of Genitive and Noun | 1249 | 1.6% |
| Order of Demonstrative and Noun | 1225 | 1.6% |
| Position of Negative Word With Respect to Subject, Object, and Verb | 1190 | 1.6% |
| Order of Adposition and Noun Phrase | 1184 | 1.5% |
| Negative Morphemes | 1157 | 1.5% |
| Order of Numeral and Noun | 1154 | 1.5% |
| Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase | 1142 | 1.5% |
| Position of Tense-Aspect Affixes | 1131 | 1.5% |
| Coding of Nominal Plurality | 1066 | 1.4% |
| Position of Case Affixes | 1031 | 1.3% |
| Prefixing vs. Suffixing in Inflectional Morphology | 969 | 1.3% |
Show data table
| bin | count |
|---|---|
| 1 – 1.675 | 27379 |
| 1.675 – 2.35 | 21173 |
| 2.35 – 3.025 | 6771 |
| 3.025 – 3.7 | 0 |
| 3.7 – 4.375 | 9917 |
| 4.375 – 5.05 | 3602 |
| 5.05 – 5.725 | 0 |
| 5.725 – 6.4 | 2730 |
| 6.4 – 7.075 | 1392 |
| 7.075 – 7.75 | 0 |
| 7.75 – 8.425 | 1042 |
| 8.425 – 9.1 | 597 |
| 9.1 – 9.775 | 0 |
| 9.775 – 10.45 | 32 |
| 10.45 – 11.12 | 265 |
| 11.12 – 11.8 | 0 |
| 11.8 – 12.48 | 43 |
| 12.48 – 13.15 | 68 |
| 13.15 – 13.83 | 0 |
| 13.83 – 14.5 | 251 |
| 14.5 – 15.18 | 190 |
| 15.18 – 15.85 | 0 |
| 15.85 – 16.52 | 156 |
| 16.52 – 17.2 | 44 |
| 17.2 – 17.88 | 0 |
| 17.88 – 18.55 | 127 |
| 18.55 – 19.23 | 83 |
| 19.23 – 19.9 | 0 |
| 19.9 – 20.58 | 358 |
| 20.58 – 21.25 | 202 |
| 21.25 – 21.93 | 0 |
| 21.93 – 22.6 | 21 |
| 22.6 – 23.28 | 18 |
| 23.28 – 23.95 | 0 |
| 23.95 – 24.62 | 3 |
| 24.62 – 25.3 | 3 |
| 25.3 – 25.98 | 0 |
| 25.98 – 26.65 | 4 |
| 26.65 – 27.33 | 3 |
| 27.33 – 28 | 1 |
Show data table
| chars | count |
|---|---|
| 2 – 2 | 342 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 2 | 0 |
| 2 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 0 |
| 3 – 3 | 76059 |
Show data table
| chars | count |
|---|---|
| 4 – 4 | 4995 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 4 | 0 |
| 4 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 45403 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 5 | 0 |
| 5 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 24205 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 6 | 0 |
| 6 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 0 |
| 7 – 7 | 1872 |
Show data table
| value | count | share |
|---|---|---|
| WALS | 76475 | 100.0% |
Schema
6 columns| Alerts | ||||
|---|---|---|---|---|
| language_id | text | 0.1% | 2,659 |
one_word
short_text
duplicates
|
| feature_id | categorical | 0.0% | 192 |
|
| feature_name | categorical | 0.0% | 192 |
|
| value | numeric | 0.0% | 28 |
high_skew
|
| value_name | text | 0.0% | 1,139 |
one_word
allcaps
short_text
duplicates
|
| source | categorical | 0.0% | 1 |
imbalance
|
language_id
text foreign_key one_word short_text duplicatesThis column holds 2-3 character language codes (ISO 639-style: eng, fre, ger, rus...), with len_mean 2.996 and one_word_rate 1.0. With 76,475 rows but only 2,659 unique values and a 96.5% duplicate_rate, it behaves as a categorical key rather than free text. The top codes are remarkably balanced (eng 159, fre 158, ger 157), suggesting a curated multilingual catalogue rather than organic usage. Treatment: Treat as a categorical code; left-join to a language lookup table.
- n
- 76,475
- nulls
- 74 (0.1%)
- unique
- 2,659
- len_min
- 2
- len_max
- 3
- len_mean
- 2.996
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 73,742
- duplicate_rate
- 0.9652
- vocab_size
- 2,209
- readability_flesch_mean
- 118.3
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
feature_id
categorical featurefeature_id is a categorical code with 192 distinct values across 76,475 rows and no nulls, with codes like '83A', '82A', '143E' suggesting a numeric prefix plus letter suffix scheme. The distribution is remarkably flat: entropy ratio 0.937, top value '83A' covers only 1.98% of rows, and four '143' suffixes ('143A','143E','143F','143G') tie at exactly 1325 occurrences, hinting at a structured co-occurrence rather than random sampling. Treatment: Treat as a categorical feature; consider splitting the numeric prefix and letter suffix into separate fields before encoding.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 192
- top_value
- 83A
- top_rate
- 0.01985
- cardinality
- 192
- entropy
- 7.103
- entropy_ratio
- 0.9365
feature_name
categorical featureThis column names linguistic typology features, almost certainly WALS-style feature labels (e.g., 'Order of Object and Verb', 'Order of Subject and Verb'). With 192 distinct values across 76,475 rows and a top rate of just 1.98%, the distribution is remarkably flat (entropy ratio 0.94), suggesting each feature is sampled across many languages rather than a few features dominating. There are no nulls and no obvious duplicates among the top values. Treatment: Use as a categorical key; one-hot or target-encode, or pivot to wide form keyed by feature_name.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 192
- top_value
- Order of Object and Verb
- top_rate
- 0.01985
- cardinality
- 192
- entropy
- 7.103
- entropy_ratio
- 0.9365
value
numeric feature high_skewA small-integer count or rating feature with only 28 distinct values ranging from 1 to 28, mean 2.85 and median 2. The distribution is heavily right-skewed (skew 3.49, kurtosis 16.36), with 2,469 outliers (3.23%) stretching the tail to 28 while the IQR sits tight at 1–4. No nulls or zeros, so every row carries a positive count. Treatment: log1p-transform or cap the upper tail before modelling to tame the skew.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 28
- min
- 1
- max
- 28
- mean
- 2.854
- median
- 2
- std
- 2.824
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 3.493
- kurtosis
- 16.36
- n_outliers
- 2,469
- outlier_rate
- 0.03229
- zero_rate
- 0
value_name
text feature one_word allcaps short_text duplicatesThis column holds short alphanumeric codes (e.g. '143G-4', '82A-1') — uppercase, single-token, 4-7 characters long, with 1.0 one_word_rate and 1.0 allcaps_rate. Across 76,475 rows there are only 1,139 distinct values and a 0.985 duplicate_rate, so it behaves as a categorical key rather than free text. The pattern (digits + letter + dash + digit) suggests a typology or feature-value identifier drawn from a fixed vocabulary of 911 tokens. Treatment: Treat as a categorical code and encode (e.g. target or one-hot) rather than tokenizing as text.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 1,139
- len_min
- 4
- len_max
- 7
- len_mean
- 5.3
- len_median
- 5
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 75,336
- duplicate_rate
- 0.9851
- vocab_size
- 911
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
source
categorical metadata imbalanceThis column records the data origin and holds the constant value "WALS" across all 76,475 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and merely tags the dataset's provenance. Treatment: Drop before modelling; retain only as a provenance tag.
- n
- 76,475
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- WALS
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0