saturn

/home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet 76,475 rows sample n=76,475 seed 42 2026-05-01T17:23:21+00:00

Overview

Source	/home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet
Total rows	76,475
Profiled sample	76,475
Columns	6
Generated	2026-05-01T17:23:21+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset contains 76,475 rows of linguistic feature observations, all sourced from WALS (World Atlas of Language Structures). Each row pairs a language (2,659 unique language IDs) with one of 192 typological features (e.g., 'Order of Object and Verb') and a categorical value encoded as both a short code (value_name) and a small integer (value). The most common features cluster around word order and negation, with 'Order of Object and Verb' being the top feature at 1,518 rows. Worth a closer look: the `value` column is highly skewed (skew 3.49, kurtosis 16.4) with ~3.2% outliers reaching up to 28, suggesting most features have only a few possible values but a handful have many categories. The `source` column is constant ('WALS') and can be ignored as a variable.

language_id high anthropic:claude-opus-4-7

This column holds 2-3 character language codes (ISO 639-style: eng, fre, ger, rus...), with len_mean 2.996 and one_word_rate 1.0. With 76,475 rows but only 2,659 unique values and a 96.5% duplicate_rate, it behaves as a categorical key rather than free text. The top codes are remarkably balanced (eng 159, fre 158, ger 157), suggesting a curated multilingual catalogue rather than organic usage.

feature_id high anthropic:claude-opus-4-7

feature_id is a categorical code with 192 distinct values across 76,475 rows and no nulls, with codes like '83A', '82A', '143E' suggesting a numeric prefix plus letter suffix scheme. The distribution is remarkably flat: entropy ratio 0.937, top value '83A' covers only 1.98% of rows, and four '143' suffixes ('143A','143E','143F','143G') tie at exactly 1325 occurrences, hinting at a structured co-occurrence rather than random sampling.

feature_name high anthropic:claude-opus-4-7

This column names linguistic typology features, almost certainly WALS-style feature labels (e.g., 'Order of Object and Verb', 'Order of Subject and Verb'). With 192 distinct values across 76,475 rows and a top rate of just 1.98%, the distribution is remarkably flat (entropy ratio 0.94), suggesting each feature is sampled across many languages rather than a few features dominating. There are no nulls and no obvious duplicates among the top values.

value high anthropic:claude-opus-4-7

A small-integer count or rating feature with only 28 distinct values ranging from 1 to 28, mean 2.85 and median 2. The distribution is heavily right-skewed (skew 3.49, kurtosis 16.36), with 2,469 outliers (3.23%) stretching the tail to 28 while the IQR sits tight at 1–4. No nulls or zeros, so every row carries a positive count.

value_name high anthropic:claude-opus-4-7

This column holds short alphanumeric codes (e.g. '143G-4', '82A-1') — uppercase, single-token, 4-7 characters long, with 1.0 one_word_rate and 1.0 allcaps_rate. Across 76,475 rows there are only 1,139 distinct values and a 0.985 duplicate_rate, so it behaves as a categorical key rather than free text. The pattern (digits + letter + dash + digit) suggests a typology or feature-value identifier drawn from a fixed vocabulary of 911 tokens.

source high anthropic:claude-opus-4-7

This column records the data origin and holds the constant value "WALS" across all 76,475 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and merely tags the dataset's provenance.

language_id text

100.0% rows are a single word 95th-percentile length under 20 chars 96.5% duplicate strings

rows76,475

null74 (0.1%)

unique2,659

len_min2

len_max3

len_mean2.996

len_median3.000

len_p953.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates73,742

duplicate_rate0.965

vocab_size2,209

readability_flesch_mean118.259

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate0.000

boilerplate_rate0.000

Sample values (first 10)

feature_id categorical

rows76,475

null0 (0.0%)

unique192

top_value83A

top_rate0.020

cardinality192

entropy7.103

entropy_ratio0.937

Top values (rank 1–20)

83A — 1,518
82A — 1,496
81A — 1,376
87A — 1,367
143A — 1,325
143E — 1,325
143F — 1,325
143G — 1,325
97A — 1,316
86A — 1,249
88A — 1,225
144A — 1,190
85A — 1,184
112A — 1,157
89A — 1,154
95A — 1,142
69A — 1,131
33A — 1,066
51A — 1,031
26A — 969

feature_name categorical

rows76,475

null0 (0.0%)

unique192

top_valueOrder of Object and Verb

top_rate0.020

cardinality192

entropy7.103

entropy_ratio0.937

Top values (rank 1–20)

Order of Object and Verb — 1,518
Order of Subject and Verb — 1,496
Order of Subject, Object and Verb — 1,376
Order of Adjective and Noun — 1,367
Order of Negative Morpheme and Verb — 1,325
Preverbal Negative Morphemes — 1,325
Postverbal Negative Morphemes — 1,325
Minor morphological means of signaling negation — 1,325
Relationship between the Order of Object and Verb and the Order of Adjective and Noun — 1,316
Order of Genitive and Noun — 1,249
Order of Demonstrative and Noun — 1,225
Position of Negative Word With Respect to Subject, Object, and Verb — 1,190
Order of Adposition and Noun Phrase — 1,184
Negative Morphemes — 1,157
Order of Numeral and Noun — 1,154
Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase — 1,142
Position of Tense-Aspect Affixes — 1,131
Coding of Nominal Plurality — 1,066
Position of Case Affixes — 1,031
Prefixing vs. Suffixing in Inflectional Morphology — 969

value numeric

skew=+3.49

rows76,475

null0 (0.0%)

unique28

min1.000

max28.000

mean2.854

median2.000

std2.824

q11.000

q34.000

iqr3.000

skew3.493

kurtosis16.361

n_outliers2,469

outlier_rate0.032

zero_rate0.000

value_name text

100.0% rows are a single word 100.0% rows are all-caps 95th-percentile length under 20 chars 98.5% duplicate strings

rows76,475

null0 (0.0%)

unique1,139

len_min4

len_max7

len_mean5.300

len_median5.000

len_p956.000

word_mean1.000

word_median1.000

n_empty0

n_duplicates75,336

duplicate_rate0.985

vocab_size911

readability_flesch_mean121.220

emoji_rate0.000

url_rate0.000

one_word_rate1.000

allcaps_rate1.000

boilerplate_rate0.000

Sample values (first 10)

18A-1
96A-2
87A-2
144J-7
144P-4
95A-1
144B-3
114A-7
131A-1
96A-4

source categorical

top value is 100.0% of rows

rows76,475

null0 (0.0%)

unique1

top_valueWALS

top_rate1.000

cardinality1

entropy-0.000

entropy_ratio0.000

Top values (rank 1–20)

WALS — 76,475