saturn·

parquet linguistic features

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet

Saturn profiled 76,475 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet",
    "--findings", "parquet-linguistic_features.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 76,475 rows of linguistic feature observations, all sourced from WALS (World Atlas of Language Structures). Each row pairs a language (2,659 unique language IDs) with one of 192 typological features (e.g., 'Order of Object and Verb') and a categorical value encoded as both a short code (value_name) and a small integer (value). The most common features cluster around word order and negation, with 'Order of Object and Verb' being the top feature at 1,518 rows. Worth a closer look: the `value` column is highly skewed (skew 3.49, kurtosis 16.4) with ~3.2% outliers reaching up to 28, suggesting most features have only a few possible values but a handful have many categories. The `source` column is constant ('WALS') and can be ignored as a variable.

citing: row_count · column_count · columns.value_name.n_unique · columns.value_name.top_values · columns.language_id.n_unique · columns.language_id.top_values · columns.source.top_value · columns.source.top_rate · columns.value.skew · columns.value.kurtosis · columns.value.outlier_rate · columns.value.max · columns.feature_id.n_unique · columns.feature_name.top_values · columns.feature_name.top_rate

Out[4]:

saturn.schema() · 6 columns

column kind n null% unique alerts
language_id text 76,475 0.1% 2,659 one_word short_text duplicates
feature_id categorical 76,475 0.0% 192
feature_name categorical 76,475 0.0% 192
value numeric 76,475 0.0% 28 high_skew
value_name text 76,475 0.0% 1,139 one_word allcaps short_text duplicates
source categorical 76,475 0.0% 1 imbalance
Fig 1.
feature_name · See which typological features are most heavily represented; word-order and negation features dominate the top of the distribution.
Show data table
Top values for feature_name (20 unique shown, of 192 total).
valuecountshare
Order of Object and Verb15182.0%
Order of Subject and Verb14962.0%
Order of Subject, Object and Verb13761.8%
Order of Adjective and Noun13671.8%
Order of Negative Morpheme and Verb13251.7%
Preverbal Negative Morphemes13251.7%
Postverbal Negative Morphemes13251.7%
Minor morphological means of signaling negation13251.7%
Relationship between the Order of Object and Verb and the Order of Adjective and Noun13161.7%
Order of Genitive and Noun12491.6%
Order of Demonstrative and Noun12251.6%
Position of Negative Word With Respect to Subject, Object, and Verb11901.6%
Order of Adposition and Noun Phrase11841.5%
Negative Morphemes11571.5%
Order of Numeral and Noun11541.5%
Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase11421.5%
Position of Tense-Aspect Affixes11311.5%
Coding of Nominal Plurality10661.4%
Position of Case Affixes10311.3%
Prefixing vs. Suffixing in Inflectional Morphology9691.3%
Fig 2.
value · Check the heavy right skew — most observations fall at values 1–4, but a long tail extends up to 28.
Show data table
Histogram bins for value (median: 2.0).
bincount
1 – 1.67527379
1.675 – 2.3521173
2.35 – 3.0256771
3.025 – 3.70
3.7 – 4.3759917
4.375 – 5.053602
5.05 – 5.7250
5.725 – 6.42730
6.4 – 7.0751392
7.075 – 7.750
7.75 – 8.4251042
8.425 – 9.1597
9.1 – 9.7750
9.775 – 10.4532
10.45 – 11.12265
11.12 – 11.80
11.8 – 12.4843
12.48 – 13.1568
13.15 – 13.830
13.83 – 14.5251
14.5 – 15.18190
15.18 – 15.850
15.85 – 16.52156
16.52 – 17.244
17.2 – 17.880
17.88 – 18.55127
18.55 – 19.2383
19.23 – 19.90
19.9 – 20.58358
20.58 – 21.25202
21.25 – 21.930
21.93 – 22.621
22.6 – 23.2818
23.28 – 23.950
23.95 – 24.623
24.62 – 25.33
25.3 – 25.980
25.98 – 26.654
26.65 – 27.333
27.33 – 281
Fig 3.
language_id · Top languages by number of feature observations are fairly even (around 150–160 each), reflecting WALS's broad coverage.
Show data table
Character-length distribution for language_id (mean: 2.995523618800801).
charscount
2 – 2342
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 376059
Fig 4.
value_name · The most frequent value codes (e.g., '143G-4', '82A-1') reveal which feature-value combinations recur across many languages.
Show data table
Character-length distribution for value_name (mean: 5.300150375939849).
charscount
4 – 44995
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 545403
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 624205
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 71872
Fig 5.
source · Confirms that 100% of rows come from WALS — the column is constant and adds no variation.
Show data table
Top values for source (1 unique shown, of 1 total).
valuecountshare
WALS76475100.0%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
language_idtext0.1%
feature_idcategorical0.0%
feature_namecategorical0.0%
valuenumeric0.0%
value_nametext0.0%
sourcecategorical0.0%

language_id text foreign_key

This column holds 2-3 character language codes (ISO 639-style: eng, fre, ger, rus...), with len_mean 2.996 and one_word_rate 1.0. With 76,475 rows but only 2,659 unique values and a 96.5% duplicate_rate, it behaves as a categorical key rather than free text. The top codes are remarkably balanced (eng 159, fre 158, ger 157), suggesting a curated multilingual catalogue rather than organic usage.

Treatment: Treat as a categorical code; left-join to a language lookup table.

anthropic:claude-opus-4-7 · confidence high
Out[12]:

saturn.columns["language_id"].stats

statvalue
n76,475
nulls74 (0.1%)
unique2,659
len_min 2
len_max 3
len_mean 2.996
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 73,742
duplicate_rate 0.9652
vocab_size 2,209
readability_flesch_mean 118.3
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates96.5% duplicate strings
Fig 7.
Character-length distribution for language_id.
Show data table
Character-length distribution for language_id (mean: 2.995523618800801).
charscount
2 – 2342
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 376059

feature_id categorical feature

feature_id is a categorical code with 192 distinct values across 76,475 rows and no nulls, with codes like '83A', '82A', '143E' suggesting a numeric prefix plus letter suffix scheme. The distribution is remarkably flat: entropy ratio 0.937, top value '83A' covers only 1.98% of rows, and four '143' suffixes ('143A','143E','143F','143G') tie at exactly 1325 occurrences, hinting at a structured co-occurrence rather than random sampling.

Treatment: Treat as a categorical feature; consider splitting the numeric prefix and letter suffix into separate fields before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[15]:

saturn.columns["feature_id"].stats

statvalue
n76,475
nulls0 (0.0%)
unique192
top_value 83A
top_rate 0.01985
cardinality 192
entropy 7.103
entropy_ratio 0.9365
Fig 8.
Top values for feature_id.
Show data table
Top values for feature_id (20 unique shown, of 192 total).
valuecountshare
83A15182.0%
82A14962.0%
81A13761.8%
87A13671.8%
143A13251.7%
143E13251.7%
143F13251.7%
143G13251.7%
97A13161.7%
86A12491.6%
88A12251.6%
144A11901.6%
85A11841.5%
112A11571.5%
89A11541.5%
95A11421.5%
69A11311.5%
33A10661.4%
51A10311.3%
26A9691.3%

feature_name categorical feature

This column names linguistic typology features, almost certainly WALS-style feature labels (e.g., 'Order of Object and Verb', 'Order of Subject and Verb'). With 192 distinct values across 76,475 rows and a top rate of just 1.98%, the distribution is remarkably flat (entropy ratio 0.94), suggesting each feature is sampled across many languages rather than a few features dominating. There are no nulls and no obvious duplicates among the top values.

Treatment: Use as a categorical key; one-hot or target-encode, or pivot to wide form keyed by feature_name.

anthropic:claude-opus-4-7 · confidence high
Out[18]:

saturn.columns["feature_name"].stats

statvalue
n76,475
nulls0 (0.0%)
unique192
top_value Order of Object and Verb
top_rate 0.01985
cardinality 192
entropy 7.103
entropy_ratio 0.9365
Fig 9.
Top values for feature_name.
Show data table
Top values for feature_name (20 unique shown, of 192 total).
valuecountshare
Order of Object and Verb15182.0%
Order of Subject and Verb14962.0%
Order of Subject, Object and Verb13761.8%
Order of Adjective and Noun13671.8%
Order of Negative Morpheme and Verb13251.7%
Preverbal Negative Morphemes13251.7%
Postverbal Negative Morphemes13251.7%
Minor morphological means of signaling negation13251.7%
Relationship between the Order of Object and Verb and the Order of Adjective and Noun13161.7%
Order of Genitive and Noun12491.6%
Order of Demonstrative and Noun12251.6%
Position of Negative Word With Respect to Subject, Object, and Verb11901.6%
Order of Adposition and Noun Phrase11841.5%
Negative Morphemes11571.5%
Order of Numeral and Noun11541.5%
Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase11421.5%
Position of Tense-Aspect Affixes11311.5%
Coding of Nominal Plurality10661.4%
Position of Case Affixes10311.3%
Prefixing vs. Suffixing in Inflectional Morphology9691.3%

value numeric feature

A small-integer count or rating feature with only 28 distinct values ranging from 1 to 28, mean 2.85 and median 2. The distribution is heavily right-skewed (skew 3.49, kurtosis 16.36), with 2,469 outliers (3.23%) stretching the tail to 28 while the IQR sits tight at 1–4. No nulls or zeros, so every row carries a positive count.

Treatment: log1p-transform or cap the upper tail before modelling to tame the skew.

anthropic:claude-opus-4-7 · confidence high
Out[21]:

saturn.columns["value"].stats

statvalue
n76,475
nulls0 (0.0%)
unique28
min 1
max 28
mean 2.854
median 2
std 2.824
q1 1
q3 4
iqr 3
skew 3.493
kurtosis 16.36
n_outliers 2,469
outlier_rate 0.03229
zero_rate 0
alert: high_skewskew=+3.49
Fig 10.
Distribution of value. Vertical dash marks the median.
Show data table
Histogram bins for value (median: 2.0).
bincount
1 – 1.67527379
1.675 – 2.3521173
2.35 – 3.0256771
3.025 – 3.70
3.7 – 4.3759917
4.375 – 5.053602
5.05 – 5.7250
5.725 – 6.42730
6.4 – 7.0751392
7.075 – 7.750
7.75 – 8.4251042
8.425 – 9.1597
9.1 – 9.7750
9.775 – 10.4532
10.45 – 11.12265
11.12 – 11.80
11.8 – 12.4843
12.48 – 13.1568
13.15 – 13.830
13.83 – 14.5251
14.5 – 15.18190
15.18 – 15.850
15.85 – 16.52156
16.52 – 17.244
17.2 – 17.880
17.88 – 18.55127
18.55 – 19.2383
19.23 – 19.90
19.9 – 20.58358
20.58 – 21.25202
21.25 – 21.930
21.93 – 22.621
22.6 – 23.2818
23.28 – 23.950
23.95 – 24.623
24.62 – 25.33
25.3 – 25.980
25.98 – 26.654
26.65 – 27.333
27.33 – 281

value_name text feature

This column holds short alphanumeric codes (e.g. '143G-4', '82A-1') — uppercase, single-token, 4-7 characters long, with 1.0 one_word_rate and 1.0 allcaps_rate. Across 76,475 rows there are only 1,139 distinct values and a 0.985 duplicate_rate, so it behaves as a categorical key rather than free text. The pattern (digits + letter + dash + digit) suggests a typology or feature-value identifier drawn from a fixed vocabulary of 911 tokens.

Treatment: Treat as a categorical code and encode (e.g. target or one-hot) rather than tokenizing as text.

anthropic:claude-opus-4-7 · confidence high
Out[24]:

saturn.columns["value_name"].stats

statvalue
n76,475
nulls0 (0.0%)
unique1,139
len_min 4
len_max 7
len_mean 5.3
len_median 5
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 75,336
duplicate_rate 0.9851
vocab_size 911
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates98.5% duplicate strings
Fig 11.
Character-length distribution for value_name.
Show data table
Character-length distribution for value_name (mean: 5.300150375939849).
charscount
4 – 44995
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 545403
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 624205
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 71872

source categorical metadata

This column records the data origin and holds the constant value "WALS" across all 76,475 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and merely tags the dataset's provenance.

Treatment: Drop before modelling; retain only as a provenance tag.

anthropic:claude-opus-4-7 · confidence high
Out[27]:

saturn.columns["source"].stats

statvalue
n76,475
nulls0 (0.0%)
unique1
top_value WALS
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 12.
Top values for source.
Show data table
Top values for source (1 unique shown, of 1 total).
valuecountshare
WALS76475100.0%

How to cite

click to copy

BibTeX
@misc{saturn-parquet-linguistic-features-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: parquet linguistic features},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/parquet-linguistic_features}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: parquet linguistic features. Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/parquet-linguistic_features