saturn·

parquet linguistic features

source /home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet 76,475 rows 6 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 76,475 rows of linguistic feature observations, all sourced from WALS (World Atlas of Language Structures). Each row pairs a language (2,659 unique language IDs) with one of 192 typological features (e.g., 'Order of Object and Verb') and a categorical value encoded as both a short code (value_name) and a small integer (value). The most common features cluster around word order and negation, with 'Order of Object and Verb' being the top feature at 1,518 rows. Worth a closer look: the `value` column is highly skewed (skew 3.49, kurtosis 16.4) with ~3.2% outliers reaching up to 28, suggesting most features have only a few possible values but a handful have many categories. The `source` column is constant ('WALS') and can be ignored as a variable.

citing: row_count · column_count · columns.value_name.n_unique · columns.value_name.top_values · columns.language_id.n_unique · columns.language_id.top_values · columns.source.top_value · columns.source.top_rate · columns.value.skew · columns.value.kurtosis · columns.value.outlier_rate · columns.value.max · columns.feature_id.n_unique · columns.feature_name.top_values · columns.feature_name.top_rate

Schema

6 columns
Per-column summary. Click column name to jump to its detail.
Alerts
language_id text 0.1% 2,659
one_word short_text duplicates
feature_id categorical 0.0% 192
feature_name categorical 0.0% 192
value numeric 0.0% 28
high_skew
value_name text 0.0% 1,139
one_word allcaps short_text duplicates
source categorical 0.0% 1
imbalance

language_id

text foreign_key one_word short_text duplicates
This column holds 2-3 character language codes (ISO 639-style: eng, fre, ger, rus...), with len_mean 2.996 and one_word_rate 1.0. With 76,475 rows but only 2,659 unique values and a 96.5% duplicate_rate, it behaves as a categorical key rather than free text. The top codes are remarkably balanced (eng 159, fre 158, ger 157), suggesting a curated multilingual catalogue rather than organic usage. Treatment: Treat as a categorical code; left-join to a language lookup table. high · anthropic:claude-opus-4-7
n
76,475
nulls
74 (0.1%)
unique
2,659
len_min
2
len_max
3
len_mean
2.996
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
73,742
duplicate_rate
0.9652
vocab_size
2,209
readability_flesch_mean
118.3
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

feature_id

categorical feature
feature_id is a categorical code with 192 distinct values across 76,475 rows and no nulls, with codes like '83A', '82A', '143E' suggesting a numeric prefix plus letter suffix scheme. The distribution is remarkably flat: entropy ratio 0.937, top value '83A' covers only 1.98% of rows, and four '143' suffixes ('143A','143E','143F','143G') tie at exactly 1325 occurrences, hinting at a structured co-occurrence rather than random sampling. Treatment: Treat as a categorical feature; consider splitting the numeric prefix and letter suffix into separate fields before encoding. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
192
top_value
83A
top_rate
0.01985
cardinality
192
entropy
7.103
entropy_ratio
0.9365

feature_name

categorical feature
This column names linguistic typology features, almost certainly WALS-style feature labels (e.g., 'Order of Object and Verb', 'Order of Subject and Verb'). With 192 distinct values across 76,475 rows and a top rate of just 1.98%, the distribution is remarkably flat (entropy ratio 0.94), suggesting each feature is sampled across many languages rather than a few features dominating. There are no nulls and no obvious duplicates among the top values. Treatment: Use as a categorical key; one-hot or target-encode, or pivot to wide form keyed by feature_name. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
192
top_value
Order of Object and Verb
top_rate
0.01985
cardinality
192
entropy
7.103
entropy_ratio
0.9365

value

numeric feature high_skew
A small-integer count or rating feature with only 28 distinct values ranging from 1 to 28, mean 2.85 and median 2. The distribution is heavily right-skewed (skew 3.49, kurtosis 16.36), with 2,469 outliers (3.23%) stretching the tail to 28 while the IQR sits tight at 1–4. No nulls or zeros, so every row carries a positive count. Treatment: log1p-transform or cap the upper tail before modelling to tame the skew. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
28
min
1
max
28
mean
2.854
median
2
std
2.824
q1
1
q3
4
iqr
3
skew
3.493
kurtosis
16.36
n_outliers
2,469
outlier_rate
0.03229
zero_rate
0

value_name

text feature one_word allcaps short_text duplicates
This column holds short alphanumeric codes (e.g. '143G-4', '82A-1') — uppercase, single-token, 4-7 characters long, with 1.0 one_word_rate and 1.0 allcaps_rate. Across 76,475 rows there are only 1,139 distinct values and a 0.985 duplicate_rate, so it behaves as a categorical key rather than free text. The pattern (digits + letter + dash + digit) suggests a typology or feature-value identifier drawn from a fixed vocabulary of 911 tokens. Treatment: Treat as a categorical code and encode (e.g. target or one-hot) rather than tokenizing as text. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
1,139
len_min
4
len_max
7
len_mean
5.3
len_median
5
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
75,336
duplicate_rate
0.9851
vocab_size
911
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

source

categorical metadata imbalance
This column records the data origin and holds the constant value "WALS" across all 76,475 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and merely tags the dataset's provenance. Treatment: Drop before modelling; retain only as a provenance tag. high · anthropic:claude-opus-4-7
n
76,475
nulls
0 (0.0%)
unique
1
top_value
WALS
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0