saturn

/home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet 76,475 rows sample n=76,475 seed 42 2026-05-01T17:23:21+00:00

Overview

Source/home/coolhand/servers/diachronica/etymology_atlas/parquet/linguistic_features.parquet
Total rows76,475
Profiled sample76,475
Columns6
Generated2026-05-01T17:23:21+00:00

Insights opt-in

Model-generated narrative. These are opinions, not facts — the stats below are what saturn measured. Generated by: anthropic:claude-opus-4-7.

Dataset high anthropic:claude-opus-4-7

This dataset contains 76,475 rows of linguistic feature observations, all sourced from WALS (World Atlas of Language Structures). Each row pairs a language (2,659 unique language IDs) with one of 192 typological features (e.g., 'Order of Object and Verb') and a categorical value encoded as both a short code (value_name) and a small integer (value). The most common features cluster around word order and negation, with 'Order of Object and Verb' being the top feature at 1,518 rows. Worth a closer look: the `value` column is highly skewed (skew 3.49, kurtosis 16.4) with ~3.2% outliers reaching up to 28, suggesting most features have only a few possible values but a handful have many categories. The `source` column is constant ('WALS') and can be ignored as a variable.

language_id high anthropic:claude-opus-4-7

This column holds 2-3 character language codes (ISO 639-style: eng, fre, ger, rus...), with len_mean 2.996 and one_word_rate 1.0. With 76,475 rows but only 2,659 unique values and a 96.5% duplicate_rate, it behaves as a categorical key rather than free text. The top codes are remarkably balanced (eng 159, fre 158, ger 157), suggesting a curated multilingual catalogue rather than organic usage.

feature_id high anthropic:claude-opus-4-7

feature_id is a categorical code with 192 distinct values across 76,475 rows and no nulls, with codes like '83A', '82A', '143E' suggesting a numeric prefix plus letter suffix scheme. The distribution is remarkably flat: entropy ratio 0.937, top value '83A' covers only 1.98% of rows, and four '143' suffixes ('143A','143E','143F','143G') tie at exactly 1325 occurrences, hinting at a structured co-occurrence rather than random sampling.

feature_name high anthropic:claude-opus-4-7

This column names linguistic typology features, almost certainly WALS-style feature labels (e.g., 'Order of Object and Verb', 'Order of Subject and Verb'). With 192 distinct values across 76,475 rows and a top rate of just 1.98%, the distribution is remarkably flat (entropy ratio 0.94), suggesting each feature is sampled across many languages rather than a few features dominating. There are no nulls and no obvious duplicates among the top values.

value high anthropic:claude-opus-4-7

A small-integer count or rating feature with only 28 distinct values ranging from 1 to 28, mean 2.85 and median 2. The distribution is heavily right-skewed (skew 3.49, kurtosis 16.36), with 2,469 outliers (3.23%) stretching the tail to 28 while the IQR sits tight at 1–4. No nulls or zeros, so every row carries a positive count.

value_name high anthropic:claude-opus-4-7

This column holds short alphanumeric codes (e.g. '143G-4', '82A-1') — uppercase, single-token, 4-7 characters long, with 1.0 one_word_rate and 1.0 allcaps_rate. Across 76,475 rows there are only 1,139 distinct values and a 0.985 duplicate_rate, so it behaves as a categorical key rather than free text. The pattern (digits + letter + dash + digit) suggests a typology or feature-value identifier drawn from a fixed vocabulary of 911 tokens.

source high anthropic:claude-opus-4-7

This column records the data origin and holds the constant value "WALS" across all 76,475 rows. With cardinality of 1 and entropy of 0.0, it carries no information for modelling and merely tags the dataset's provenance.

language_id text

100.0% rows are a single word 95th-percentile length under 20 chars 96.5% duplicate strings
rows76,475
null74 (0.1%)
unique2,659
len_min2
len_max3
len_mean2.996
len_median3.000
len_p953.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates73,742
duplicate_rate0.965
vocab_size2,209
readability_flesch_mean118.259
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate0.000
boilerplate_rate0.000
Sample values (first 10)
  1. abi
  2. rus
  3. tmn
  4. tag
  5. kmz
  6. yim
  7. kom
  8. tuy
  9. tag
  10. aml

feature_id categorical

rows76,475
null0 (0.0%)
unique192
top_value83A
top_rate0.020
cardinality192
entropy7.103
entropy_ratio0.937
Top values (rank 1–20)
  1. 83A — 1,518
  2. 82A — 1,496
  3. 81A — 1,376
  4. 87A — 1,367
  5. 143A — 1,325
  6. 143E — 1,325
  7. 143F — 1,325
  8. 143G — 1,325
  9. 97A — 1,316
  10. 86A — 1,249
  11. 88A — 1,225
  12. 144A — 1,190
  13. 85A — 1,184
  14. 112A — 1,157
  15. 89A — 1,154
  16. 95A — 1,142
  17. 69A — 1,131
  18. 33A — 1,066
  19. 51A — 1,031
  20. 26A — 969

feature_name categorical

rows76,475
null0 (0.0%)
unique192
top_valueOrder of Object and Verb
top_rate0.020
cardinality192
entropy7.103
entropy_ratio0.937
Top values (rank 1–20)
  1. Order of Object and Verb — 1,518
  2. Order of Subject and Verb — 1,496
  3. Order of Subject, Object and Verb — 1,376
  4. Order of Adjective and Noun — 1,367
  5. Order of Negative Morpheme and Verb — 1,325
  6. Preverbal Negative Morphemes — 1,325
  7. Postverbal Negative Morphemes — 1,325
  8. Minor morphological means of signaling negation — 1,325
  9. Relationship between the Order of Object and Verb and the Order of Adjective and Noun — 1,316
  10. Order of Genitive and Noun — 1,249
  11. Order of Demonstrative and Noun — 1,225
  12. Position of Negative Word With Respect to Subject, Object, and Verb — 1,190
  13. Order of Adposition and Noun Phrase — 1,184
  14. Negative Morphemes — 1,157
  15. Order of Numeral and Noun — 1,154
  16. Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase — 1,142
  17. Position of Tense-Aspect Affixes — 1,131
  18. Coding of Nominal Plurality — 1,066
  19. Position of Case Affixes — 1,031
  20. Prefixing vs. Suffixing in Inflectional Morphology — 969

value numeric

skew=+3.49
rows76,475
null0 (0.0%)
unique28
min1.000
max28.000
mean2.854
median2.000
std2.824
q11.000
q34.000
iqr3.000
skew3.493
kurtosis16.361
n_outliers2,469
outlier_rate0.032
zero_rate0.000

value_name text

100.0% rows are a single word 100.0% rows are all-caps 95th-percentile length under 20 chars 98.5% duplicate strings
rows76,475
null0 (0.0%)
unique1,139
len_min4
len_max7
len_mean5.300
len_median5.000
len_p956.000
word_mean1.000
word_median1.000
n_empty0
n_duplicates75,336
duplicate_rate0.985
vocab_size911
readability_flesch_mean121.220
emoji_rate0.000
url_rate0.000
one_word_rate1.000
allcaps_rate1.000
boilerplate_rate0.000
Sample values (first 10)
  1. 18A-1
  2. 96A-2
  3. 87A-2
  4. 144J-7
  5. 144P-4
  6. 95A-1
  7. 144B-3
  8. 114A-7
  9. 131A-1
  10. 96A-4

source categorical

top value is 100.0% of rows
rows76,475
null0 (0.0%)
unique1
top_valueWALS
top_rate1.000
cardinality1
entropy-0.000
entropy_ratio0.000
Top values (rank 1–20)
  1. WALS — 76,475