parquet linguistic features

source /home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet 76,475 rows 6 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 76,475 rows of linguistic feature observations, all sourced from WALS (World Atlas of Language Structures). Each row pairs a language (2,659 unique language IDs) with one of 192 typological features (e.g., 'Order of Object and Verb') and a categorical value encoded as both a short code (value_name) and a small integer (value). The most common features cluster around word order and negation, with 'Order of Object and Verb' being the top feature at 1,518 rows. Worth a closer look: the `value` column is highly skewed (skew 3.49, kurtosis 16.4) with ~3.2% outliers reaching up to 28, suggesting most features have only a few possible values but a handful have many categories. The `source` column is constant ('WALS') and can be ignored as a variable.

citing: row_count · column_count · columns.value_name.n_unique · columns.value_name.top_values · columns.language_id.n_unique · columns.language_id.top_values · columns.source.top_value · columns.source.top_rate · columns.value.skew · columns.value.kurtosis · columns.value.outlier_rate · columns.value.max · columns.feature_id.n_unique · columns.feature_name.top_values · columns.feature_name.top_rate

Charts the summary said to look at first

feature_name · See which typological features are most heavily represented; word-order and negation features dominate the top of the distribution.

Show data table

Top values for feature_name (20 unique shown, of 192 total).
value	count	share
Order of Object and Verb	1518	2.0%
Order of Subject and Verb	1496	2.0%
Order of Subject, Object and Verb	1376	1.8%
Order of Adjective and Noun	1367	1.8%
Order of Negative Morpheme and Verb	1325	1.7%
Preverbal Negative Morphemes	1325	1.7%
Postverbal Negative Morphemes	1325	1.7%
Minor morphological means of signaling negation	1325	1.7%
Relationship between the Order of Object and Verb and the Order of Adjective and Noun	1316	1.7%
Order of Genitive and Noun	1249	1.6%
Order of Demonstrative and Noun	1225	1.6%
Position of Negative Word With Respect to Subject, Object, and Verb	1190	1.6%
Order of Adposition and Noun Phrase	1184	1.5%
Negative Morphemes	1157	1.5%
Order of Numeral and Noun	1154	1.5%
Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase	1142	1.5%
Position of Tense-Aspect Affixes	1131	1.5%
Coding of Nominal Plurality	1066	1.4%
Position of Case Affixes	1031	1.3%
Prefixing vs. Suffixing in Inflectional Morphology	969	1.3%

value · Check the heavy right skew — most observations fall at values 1–4, but a long tail extends up to 28.

Show data table

Histogram bins for value (median: 2.0).
bin	count
1 – 1.675	27379
1.675 – 2.35	21173
2.35 – 3.025	6771
3.025 – 3.7	0
3.7 – 4.375	9917
4.375 – 5.05	3602
5.05 – 5.725	0
5.725 – 6.4	2730
6.4 – 7.075	1392
7.075 – 7.75	0
7.75 – 8.425	1042
8.425 – 9.1	597
9.1 – 9.775	0
9.775 – 10.45	32
10.45 – 11.12	265
11.12 – 11.8	0
11.8 – 12.48	43
12.48 – 13.15	68
13.15 – 13.83	0
13.83 – 14.5	251
14.5 – 15.18	190
15.18 – 15.85	0
15.85 – 16.52	156
16.52 – 17.2	44
17.2 – 17.88	0
17.88 – 18.55	127
18.55 – 19.23	83
19.23 – 19.9	0
19.9 – 20.58	358
20.58 – 21.25	202
21.25 – 21.93	0
21.93 – 22.6	21
22.6 – 23.28	18
23.28 – 23.95	0
23.95 – 24.62	3
24.62 – 25.3	3
25.3 – 25.98	0
25.98 – 26.65	4
26.65 – 27.33	3
27.33 – 28	1

language_id · Top languages by number of feature observations are fairly even (around 150–160 each), reflecting WALS's broad coverage.

Show data table

Character-length distribution for language_id (mean: 2.995523618800801).
chars	count
2 – 2	342
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	76059

value_name · The most frequent value codes (e.g., '143G-4', '82A-1') reveal which feature-value combinations recur across many languages.

Show data table

Character-length distribution for value_name (mean: 5.300150375939849).
chars	count
4 – 4	4995
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	45403
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	24205
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	1872

source · Confirms that 100% of rows come from WALS — the column is constant and adds no variation.

Show data table

Top values for source (1 unique shown, of 1 total).
value	count	share
WALS	76475	100.0%

Schema

6 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
language_id	text	0.1%	2,659	one_word short_text duplicates
feature_id	categorical	0.0%	192
feature_name	categorical	0.0%	192
value	numeric	0.0%	28	high_skew
value_name	text	0.0%	1,139	one_word allcaps short_text duplicates
source	categorical	0.0%	1	imbalance

language_id

text foreign_key one_word short_text duplicates

This column holds 2-3 character language codes (ISO 639-style: eng, fre, ger, rus...), with len_mean 2.996 and one_word_rate 1.0. With 76,475 rows but only 2,659 unique values and a 96.5% duplicate_rate, it behaves as a categorical key rather than free text. The top codes are remarkably balanced (eng 159, fre 158, ger 157), suggesting a curated multilingual catalogue rather than organic usage. Treatment: Treat as a categorical code; left-join to a language lookup table. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 74 (0.1%)
unique: 2,659
len_min: 2
len_max: 3
len_mean: 2.996
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 73,742
duplicate_rate: 0.9652
vocab_size: 2,209
readability_flesch_mean: 118.3
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

feature_id

categorical feature

feature_id is a categorical code with 192 distinct values across 76,475 rows and no nulls, with codes like '83A', '82A', '143E' suggesting a numeric prefix plus letter suffix scheme. The distribution is remarkably flat: entropy ratio 0.937, top value '83A' covers only 1.98% of rows, and four '143' suffixes ('143A','143E','143F','143G') tie at exactly 1325 occurrences, hinting at a structured co-occurrence rather than random sampling. Treatment: Treat as a categorical feature; consider splitting the numeric prefix and letter suffix into separate fields before encoding. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 192
top_value: 83A
top_rate: 0.01985
cardinality: 192
entropy: 7.103
entropy_ratio: 0.9365

feature_name

categorical feature

This column names linguistic typology features, almost certainly WALS-style feature labels (e.g., 'Order of Object and Verb', 'Order of Subject and Verb'). With 192 distinct values across 76,475 rows and a top rate of just 1.98%, the distribution is remarkably flat (entropy ratio 0.94), suggesting each feature is sampled across many languages rather than a few features dominating. There are no nulls and no obvious duplicates among the top values. Treatment: Use as a categorical key; one-hot or target-encode, or pivot to wide form keyed by feature_name. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 192
top_value: Order of Object and Verb
top_rate: 0.01985
cardinality: 192
entropy: 7.103
entropy_ratio: 0.9365

value

numeric feature high_skew

A small-integer count or rating feature with only 28 distinct values ranging from 1 to 28, mean 2.85 and median 2. The distribution is heavily right-skewed (skew 3.49, kurtosis 16.36), with 2,469 outliers (3.23%) stretching the tail to 28 while the IQR sits tight at 1–4. No nulls or zeros, so every row carries a positive count. Treatment: log1p-transform or cap the upper tail before modelling to tame the skew. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 28
min: 1
max: 28
mean: 2.854
median: 2
std: 2.824
q1: 1
q3: 4
iqr: 3
skew: 3.493
kurtosis: 16.36
n_outliers: 2,469
outlier_rate: 0.03229
zero_rate: 0

value_name

text feature one_word allcaps short_text duplicates

This column holds short alphanumeric codes (e.g. '143G-4', '82A-1') — uppercase, single-token, 4-7 characters long, with 1.0 one_word_rate and 1.0 allcaps_rate. Across 76,475 rows there are only 1,139 distinct values and a 0.985 duplicate_rate, so it behaves as a categorical key rather than free text. The pattern (digits + letter + dash + digit) suggests a typology or feature-value identifier drawn from a fixed vocabulary of 911 tokens. Treatment: Treat as a categorical code and encode (e.g. target or one-hot) rather than tokenizing as text. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 1,139
len_min: 4
len_max: 7
len_mean: 5.3
len_median: 5
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 75,336
duplicate_rate: 0.9851
vocab_size: 911
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

source

categorical metadata imbalance

This column records the data origin and holds the constant value "WALS" across all 76,475 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and merely tags the dataset's provenance. Treatment: Drop before modelling; retain only as a provenance tag. high · anthropic:claude-opus-4-7

n: 76,475
nulls: 0 (0.0%)
unique: 1
top_value: WALS
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0