saturn·

processed word forms

source /home/coolhand/servers/diachronica/etymology_atlas/processed/word_forms.csv 25,731 rows 8 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 25,731 word forms drawn from a single source ('iecor'), each tagged with a concept, language, and cognate identifier — essentially a comparative wordlist across 160 languages and 170 concepts. The 'form' column is mostly single-word entries (94.6% one-word, mean length ~5 characters) with about 24.9% duplicates, suggesting many shared or repeated forms across languages. The language coverage is broad and well-balanced (entropy ratio ~0.99 across 142 ISO codes), led by Greek (ell), Slovenian (slv), and Macedonian (mkd). Worth a closer look: the concept distribution is remarkably even (~160-170 forms per concept), and the language_name distribution shows which languages are most densely sampled (Bakhtiari, Nepali, Italiot Greek). The 'source_dataset' column is constant and can be ignored.

citing: row_count · column_count · form.stats.one_word_rate · form.stats.duplicate_rate · form.stats.len_mean · iso_639_3.n_unique · iso_639_3.top_values · concept.n_unique · concept.top_values · language_name.top_values · source_dataset.top_rate

Schema

8 columns
Per-column summary. Click column name to jump to its detail.
Alerts
form text 0.0% 19,334
one_word short_text duplicates
language_id numeric 0.0% 160
language_name categorical 0.0% 160
glottocode categorical 0.0% 152
iso_639_3 categorical 0.7% 142
concept categorical 0.0% 170
cognate_id numeric 0.0% 4,979
source_dataset categorical 0.0% 1
imbalance

form

text feature one_word short_text duplicates
This column holds short single-word lexical forms — 94.6% are one-word entries with a mean length of 5.4 characters and median word count of 1. The vocabulary spans 16,219 distinct words across 25,731 rows, and top values like 'noga', 'pā', 'dūr', 'voda', 'bitter' suggest a multilingual mix (Slavic, Polynesian, Germanic). Notably, 24.9% of rows are duplicates (6,397), yet no single form dominates — the most frequent ('noga') appears only 24 times. Treatment: Treat as a categorical lexical token; normalize unicode and consider language detection before embedding. high · anthropic:claude-opus-4-7
n
25,731
nulls
0 (0.0%)
unique
19,334
len_min
1
len_max
63
len_mean
5.373
len_median
5
len_p95
9
word_mean
1.067
word_median
1
n_empty
0
n_duplicates
6,397
duplicate_rate
0.2486
vocab_size
16,219
readability_flesch_mean
86.62
emoji_rate
0
url_rate
0
one_word_rate
0.9464
allcaps_rate
0
boilerplate_rate
0

language_id

numeric foreign_key
Numeric code with 160 distinct values across 25,731 rows and zero nulls, ranging from 3 to 317 with a near-symmetric distribution (skew -0.05) and flat shape (kurtosis -1.47). The flat, wide spread and integer-looking quartiles (65, 174, 266) suggest this is a categorical language identifier stored as an integer rather than a true numeric measurement. No outliers and no zeros, consistent with a lookup key. Treatment: Treat as categorical and left-join to a language lookup table; do not model as a continuous variable. high · anthropic:claude-opus-4-7
n
25,731
nulls
0 (0.0%)
unique
160
min
3
max
317
mean
166
median
174
std
101.4
q1
65
q3
266
iqr
201
skew
-0.04885
kurtosis
-1.471
n_outliers
0
outlier_rate
0
zero_rate
0

language_name

categorical feature
This column holds language names, with 160 distinct values across 25,731 rows and no nulls. The distribution is remarkably flat: entropy_ratio is 0.996 and the most common value 'Bakhtiari' appears just 178 times (0.69%), suggesting a near-uniform sampling of languages rather than a natural population. Several entries use a 'Family: Variety' convention (e.g., 'Greek: Italiot', 'Breton: Treger'), indicating dialect-level granularity mixed with top-level language names. Treatment: Use as a categorical feature; consider splitting on ':' to separate family from variety before encoding. high · anthropic:claude-opus-4-7
n
25,731
nulls
0 (0.0%)
unique
160
top_value
Bakhtiari
top_rate
0.006918
cardinality
160
entropy
7.294
entropy_ratio
0.9962

glottocode

categorical foreign_key
Glottocodes are Glottolog's stable language identifiers, so this column tags each of the 25,731 rows with one of 152 distinct languages. The distribution is remarkably flat: entropy ratio is 0.991 and the most frequent code 'mace1250' covers only 1.93% of rows, with several Slavic and Germanic codes clustered around 340–350 occurrences. No nulls, and a visible drop-off after the top seven codes (down to ~177) suggests a tiered sampling design rather than a long tail. Treatment: left-join on this id to Glottolog metadata, or one-hot/target-encode for modelling. high · anthropic:claude-opus-4-7
n
25,731
nulls
0 (0.0%)
unique
152
top_value
mace1250
top_rate
0.01932
cardinality
152
entropy
7.184
entropy_ratio
0.9912

iso_639_3

categorical feature
This column holds ISO 639-3 language codes, with 142 distinct languages spread across 25,731 rows. The distribution is remarkably flat — entropy ratio of 0.985 and the top code 'ell' covering only 2.04% — so no single language dominates. Null rate is negligible at 0.67%. Treatment: Treat as a categorical feature; one-hot or target-encode given the 142 levels, or group rare codes. high · anthropic:claude-opus-4-7
n
25,731
nulls
173 (0.7%)
unique
142
top_value
ell
top_rate
0.02042
cardinality
142
entropy
7.044
entropy_ratio
0.9853

concept

categorical foreign_key
This column holds 170 distinct concept labels (e.g., 'say', 'man', 'big', 'stone', 'house') spread almost perfectly evenly across 25,731 rows, with the top value covering only 0.66% of the data and entropy at 99.98% of the maximum. The vocabulary resembles a Swadesh-style basic concept list, and the near-uniform distribution suggests each concept appears a fixed number of times — likely once per language or source. Treatment: Treat as a categorical key; group or pivot on it rather than one-hot encoding given 170 balanced levels. high · anthropic:claude-opus-4-7
n
25,731
nulls
0 (0.0%)
unique
170
top_value
say
top_rate
0.006607
cardinality
170
entropy
7.408
entropy_ratio
0.9998

cognate_id

numeric foreign_key
This is almost certainly a cognate group identifier: 4,979 distinct integer values spread across 25,731 rows (roughly 5x repetition) with no nulls and no zeros. Despite being stored as numeric, the wide range (3 to 9,982), moderate skew (0.73) and negative kurtosis (-0.90) suggest these are arbitrary group labels rather than a measured quantity. The lack of outliers is consistent with categorical-style codes packed into the integer space. Treatment: Treat as a categorical group key; join or group-by rather than feeding as a numeric feature. high · anthropic:claude-opus-4-7
n
25,731
nulls
0 (0.0%)
unique
4,979
min
3
max
9,982
mean
3086
median
1,610
std
3024
q1
411
q3
5,640
iqr
5,229
skew
0.7307
kurtosis
-0.9048
n_outliers
0
outlier_rate
0
zero_rate
0

source_dataset

categorical metadata imbalance
This column is a constant provenance tag identifying the source dataset, with every one of the 25731 rows labelled "iecor". Cardinality is 1 and entropy is 0, so it carries no discriminative signal. Treatment: Drop before modelling; retain only as a provenance flag. high · anthropic:claude-opus-4-7
n
25,731
nulls
0 (0.0%)
unique
1
top_value
iecor
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0