parquet linguistic features

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet

Saturn profiled 76,475 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet",
    "--findings", "parquet-linguistic_features.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 76,475 rows of linguistic feature observations, all sourced from WALS (World Atlas of Language Structures). Each row pairs a language (2,659 unique language IDs) with one of 192 typological features (e.g., 'Order of Object and Verb') and a categorical value encoded as both a short code (value_name) and a small integer (value). The most common features cluster around word order and negation, with 'Order of Object and Verb' being the top feature at 1,518 rows. Worth a closer look: the `value` column is highly skewed (skew 3.49, kurtosis 16.4) with ~3.2% outliers reaching up to 28, suggesting most features have only a few possible values but a handful have many categories. The `source` column is constant ('WALS') and can be ignored as a variable.

citing: row_count · column_count · columns.value_name.n_unique · columns.value_name.top_values · columns.language_id.n_unique · columns.language_id.top_values · columns.source.top_value · columns.source.top_rate · columns.value.skew · columns.value.kurtosis · columns.value.outlier_rate · columns.value.max · columns.feature_id.n_unique · columns.feature_name.top_values · columns.feature_name.top_rate

Out[4]:

saturn.schema() · 6 columns

column	kind	n	null%	unique	alerts
language_id	text	76,475	0.1%	2,659	one_word short_text duplicates
feature_id	categorical	76,475	0.0%	192
feature_name	categorical	76,475	0.0%	192
value	numeric	76,475	0.0%	28	high_skew
value_name	text	76,475	0.0%	1,139	one_word allcaps short_text duplicates
source	categorical	76,475	0.0%	1	imbalance

Fig 1.

feature_name · See which typological features are most heavily represented; word-order and negation features dominate the top of the distribution.

Show data table

Top values for feature_name (20 unique shown, of 192 total).
value	count	share
Order of Object and Verb	1518	2.0%
Order of Subject and Verb	1496	2.0%
Order of Subject, Object and Verb	1376	1.8%
Order of Adjective and Noun	1367	1.8%
Order of Negative Morpheme and Verb	1325	1.7%
Preverbal Negative Morphemes	1325	1.7%
Postverbal Negative Morphemes	1325	1.7%
Minor morphological means of signaling negation	1325	1.7%
Relationship between the Order of Object and Verb and the Order of Adjective and Noun	1316	1.7%
Order of Genitive and Noun	1249	1.6%
Order of Demonstrative and Noun	1225	1.6%
Position of Negative Word With Respect to Subject, Object, and Verb	1190	1.6%
Order of Adposition and Noun Phrase	1184	1.5%
Negative Morphemes	1157	1.5%
Order of Numeral and Noun	1154	1.5%
Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase	1142	1.5%
Position of Tense-Aspect Affixes	1131	1.5%
Coding of Nominal Plurality	1066	1.4%
Position of Case Affixes	1031	1.3%
Prefixing vs. Suffixing in Inflectional Morphology	969	1.3%

Fig 2.

value · Check the heavy right skew — most observations fall at values 1–4, but a long tail extends up to 28.

Show data table

Histogram bins for value (median: 2.0).
bin	count
1 – 1.675	27379
1.675 – 2.35	21173
2.35 – 3.025	6771
3.025 – 3.7	0
3.7 – 4.375	9917
4.375 – 5.05	3602
5.05 – 5.725	0
5.725 – 6.4	2730
6.4 – 7.075	1392
7.075 – 7.75	0
7.75 – 8.425	1042
8.425 – 9.1	597
9.1 – 9.775	0
9.775 – 10.45	32
10.45 – 11.12	265
11.12 – 11.8	0
11.8 – 12.48	43
12.48 – 13.15	68
13.15 – 13.83	0
13.83 – 14.5	251
14.5 – 15.18	190
15.18 – 15.85	0
15.85 – 16.52	156
16.52 – 17.2	44
17.2 – 17.88	0
17.88 – 18.55	127
18.55 – 19.23	83
19.23 – 19.9	0
19.9 – 20.58	358
20.58 – 21.25	202
21.25 – 21.93	0
21.93 – 22.6	21
22.6 – 23.28	18
23.28 – 23.95	0
23.95 – 24.62	3
24.62 – 25.3	3
25.3 – 25.98	0
25.98 – 26.65	4
26.65 – 27.33	3
27.33 – 28	1

Fig 3.

language_id · Top languages by number of feature observations are fairly even (around 150–160 each), reflecting WALS's broad coverage.

Show data table

Character-length distribution for language_id (mean: 2.995523618800801).
chars	count
2 – 2	342
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	76059

Fig 4.

value_name · The most frequent value codes (e.g., '143G-4', '82A-1') reveal which feature-value combinations recur across many languages.

Show data table

Character-length distribution for value_name (mean: 5.300150375939849).
chars	count
4 – 4	4995
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	45403
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	24205
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	1872

Fig 5.

source · Confirms that 100% of rows come from WALS — the column is constant and adds no variation.

Show data table

Top values for source (1 unique shown, of 1 total).
value	count	share
WALS	76475	100.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
language_id	text	0.1%
feature_id	categorical	0.0%
feature_name	categorical	0.0%
value	numeric	0.0%
value_name	text	0.0%
source	categorical	0.0%

language_id text foreign_key

This column holds 2-3 character language codes (ISO 639-style: eng, fre, ger, rus...), with len_mean 2.996 and one_word_rate 1.0. With 76,475 rows but only 2,659 unique values and a 96.5% duplicate_rate, it behaves as a categorical key rather than free text. The top codes are remarkably balanced (eng 159, fre 158, ger 157), suggesting a curated multilingual catalogue rather than organic usage.

Treatment: Treat as a categorical code; left-join to a language lookup table.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["language_id"].stats

stat	value
n	76,475
nulls	74 (0.1%)
unique	2,659
len_min	2
len_max	3
len_mean	2.996
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	73,742
duplicate_rate	0.9652
vocab_size	2,209
readability_flesch_mean	118.3
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	96.5% duplicate strings

Fig 7.

Character-length distribution for language_id.

Show data table

Character-length distribution for language_id (mean: 2.995523618800801).
chars	count
2 – 2	342
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	76059

feature_id categorical feature

feature_id is a categorical code with 192 distinct values across 76,475 rows and no nulls, with codes like '83A', '82A', '143E' suggesting a numeric prefix plus letter suffix scheme. The distribution is remarkably flat: entropy ratio 0.937, top value '83A' covers only 1.98% of rows, and four '143' suffixes ('143A','143E','143F','143G') tie at exactly 1325 occurrences, hinting at a structured co-occurrence rather than random sampling.

Treatment: Treat as a categorical feature; consider splitting the numeric prefix and letter suffix into separate fields before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns["feature_id"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	192
top_value	83A
top_rate	0.01985
cardinality	192
entropy	7.103
entropy_ratio	0.9365

Fig 8.

Top values for feature_id.

Show data table

Top values for feature_id (20 unique shown, of 192 total).
value	count	share
83A	1518	2.0%
82A	1496	2.0%
81A	1376	1.8%
87A	1367	1.8%
143A	1325	1.7%
143E	1325	1.7%
143F	1325	1.7%
143G	1325	1.7%
97A	1316	1.7%
86A	1249	1.6%
88A	1225	1.6%
144A	1190	1.6%
85A	1184	1.5%
112A	1157	1.5%
89A	1154	1.5%
95A	1142	1.5%
69A	1131	1.5%
33A	1066	1.4%
51A	1031	1.3%
26A	969	1.3%

feature_name categorical feature

This column names linguistic typology features, almost certainly WALS-style feature labels (e.g., 'Order of Object and Verb', 'Order of Subject and Verb'). With 192 distinct values across 76,475 rows and a top rate of just 1.98%, the distribution is remarkably flat (entropy ratio 0.94), suggesting each feature is sampled across many languages rather than a few features dominating. There are no nulls and no obvious duplicates among the top values.

Treatment: Use as a categorical key; one-hot or target-encode, or pivot to wide form keyed by feature_name.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["feature_name"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	192
top_value	Order of Object and Verb
top_rate	0.01985
cardinality	192
entropy	7.103
entropy_ratio	0.9365

Fig 9.

Top values for feature_name.

Show data table

Top values for feature_name (20 unique shown, of 192 total).
value	count	share
Order of Object and Verb	1518	2.0%
Order of Subject and Verb	1496	2.0%
Order of Subject, Object and Verb	1376	1.8%
Order of Adjective and Noun	1367	1.8%
Order of Negative Morpheme and Verb	1325	1.7%
Preverbal Negative Morphemes	1325	1.7%
Postverbal Negative Morphemes	1325	1.7%
Minor morphological means of signaling negation	1325	1.7%
Relationship between the Order of Object and Verb and the Order of Adjective and Noun	1316	1.7%
Order of Genitive and Noun	1249	1.6%
Order of Demonstrative and Noun	1225	1.6%
Position of Negative Word With Respect to Subject, Object, and Verb	1190	1.6%
Order of Adposition and Noun Phrase	1184	1.5%
Negative Morphemes	1157	1.5%
Order of Numeral and Noun	1154	1.5%
Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase	1142	1.5%
Position of Tense-Aspect Affixes	1131	1.5%
Coding of Nominal Plurality	1066	1.4%
Position of Case Affixes	1031	1.3%
Prefixing vs. Suffixing in Inflectional Morphology	969	1.3%

value numeric feature

A small-integer count or rating feature with only 28 distinct values ranging from 1 to 28, mean 2.85 and median 2. The distribution is heavily right-skewed (skew 3.49, kurtosis 16.36), with 2,469 outliers (3.23%) stretching the tail to 28 while the IQR sits tight at 1–4. No nulls or zeros, so every row carries a positive count.

Treatment: log1p-transform or cap the upper tail before modelling to tame the skew.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["value"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	28
min	1
max	28
mean	2.854
median	2
std	2.824
q1	1
q3	4
iqr	3
skew	3.493
kurtosis	16.36
n_outliers	2,469
outlier_rate	0.03229
zero_rate	0
alert: high_skew	skew=+3.49

Fig 10.

Distribution of value. Vertical dash marks the median.

Show data table

Histogram bins for value (median: 2.0).
bin	count
1 – 1.675	27379
1.675 – 2.35	21173
2.35 – 3.025	6771
3.025 – 3.7	0
3.7 – 4.375	9917
4.375 – 5.05	3602
5.05 – 5.725	0
5.725 – 6.4	2730
6.4 – 7.075	1392
7.075 – 7.75	0
7.75 – 8.425	1042
8.425 – 9.1	597
9.1 – 9.775	0
9.775 – 10.45	32
10.45 – 11.12	265
11.12 – 11.8	0
11.8 – 12.48	43
12.48 – 13.15	68
13.15 – 13.83	0
13.83 – 14.5	251
14.5 – 15.18	190
15.18 – 15.85	0
15.85 – 16.52	156
16.52 – 17.2	44
17.2 – 17.88	0
17.88 – 18.55	127
18.55 – 19.23	83
19.23 – 19.9	0
19.9 – 20.58	358
20.58 – 21.25	202
21.25 – 21.93	0
21.93 – 22.6	21
22.6 – 23.28	18
23.28 – 23.95	0
23.95 – 24.62	3
24.62 – 25.3	3
25.3 – 25.98	0
25.98 – 26.65	4
26.65 – 27.33	3
27.33 – 28	1

value_name text feature

This column holds short alphanumeric codes (e.g. '143G-4', '82A-1') — uppercase, single-token, 4-7 characters long, with 1.0 one_word_rate and 1.0 allcaps_rate. Across 76,475 rows there are only 1,139 distinct values and a 0.985 duplicate_rate, so it behaves as a categorical key rather than free text. The pattern (digits + letter + dash + digit) suggests a typology or feature-value identifier drawn from a fixed vocabulary of 911 tokens.

Treatment: Treat as a categorical code and encode (e.g. target or one-hot) rather than tokenizing as text.

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["value_name"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	1,139
len_min	4
len_max	7
len_mean	5.3
len_median	5
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	75,336
duplicate_rate	0.9851
vocab_size	911
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	98.5% duplicate strings

Fig 11.

Character-length distribution for value_name.

Show data table

Character-length distribution for value_name (mean: 5.300150375939849).
chars	count
4 – 4	4995
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	45403
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	24205
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	1872

source categorical metadata

This column records the data origin and holds the constant value "WALS" across all 76,475 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and merely tags the dataset's provenance.

Treatment: Drop before modelling; retain only as a provenance tag.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["source"].stats

stat	value
n	76,475
nulls	0 (0.0%)
unique	1
top_value	WALS
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 12.

Top values for source.

Show data table

Top values for source (1 unique shown, of 1 total).
value	count	share
WALS	76475	100.0%

How to cite

click to copy

BibTeX

@misc{saturn-parquet-linguistic-features-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: parquet linguistic features},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/parquet-linguistic_features}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: parquet linguistic features. Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/parquet-linguistic_features