processed word forms

source /home/coolhand/servers/diachronica/etymology_atlas/processed/word_forms.csv 25,731 rows 8 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 25,731 word forms drawn from a single source ('iecor'), each tagged with a concept, language, and cognate identifier — essentially a comparative wordlist across 160 languages and 170 concepts. The 'form' column is mostly single-word entries (94.6% one-word, mean length ~5 characters) with about 24.9% duplicates, suggesting many shared or repeated forms across languages. The language coverage is broad and well-balanced (entropy ratio ~0.99 across 142 ISO codes), led by Greek (ell), Slovenian (slv), and Macedonian (mkd). Worth a closer look: the concept distribution is remarkably even (~160-170 forms per concept), and the language_name distribution shows which languages are most densely sampled (Bakhtiari, Nepali, Italiot Greek). The 'source_dataset' column is constant and can be ignored.

citing: row_count · column_count · form.stats.one_word_rate · form.stats.duplicate_rate · form.stats.len_mean · iso_639_3.n_unique · iso_639_3.top_values · concept.n_unique · concept.top_values · language_name.top_values · source_dataset.top_rate

Charts the summary said to look at first

iso_639_3 · Top languages by ISO code — Greek, Slovenian, and Macedonian lead the sample.

Show data table

Top values for iso_639_3 (20 unique shown, of 142 total).
value	count	share
ell	522	2.0%
slv	509	2.0%
mkd	497	1.9%
bre	353	1.4%
swe	347	1.3%
ces	346	1.3%
pol	345	1.3%
sdh	345	1.3%
src	343	1.3%
por	341	1.3%
oss	341	1.3%
cat	340	1.3%
grc	332	1.3%
bsh	289	1.1%
bqi	178	0.7%
nep	177	0.7%
osp	177	0.7%
pnt	177	0.7%
hin	176	0.7%
ron	176	0.7%

concept · Concept distribution is highly uniform — each of the 170 concepts has ~160 forms.

Show data table

Top values for concept (20 unique shown, of 170 total).
value	count	share
say	170	0.7%
man	166	0.6%
big	163	0.6%
stone	163	0.6%
house	163	0.6%
foot	161	0.6%
hand	161	0.6%
head	161	0.6%
see	161	0.6%
woman	161	0.6%
year	161	0.6%
day	160	0.6%
good	160	0.6%
name	160	0.6%
water	160	0.6%
do	160	0.6%
come	159	0.6%
give	159	0.6%
know	159	0.6%
red	159	0.6%

language_name · Which languages are most densely sampled by name (Bakhtiari, Nepali, Italiot Greek).

Show data table

Top values for language_name (20 unique shown, of 160 total).
value	count	share
Bakhtiari	178	0.7%
Nepali	177	0.7%
Greek: Italiot	177	0.7%
Old Spanish	177	0.7%
Greek: Pontic	177	0.7%
Breton: Treger	177	0.7%
Hindi	176	0.7%
Romanian	176	0.7%
Greek: Cappadocian	176	0.7%
Breton: Gwened	176	0.7%
Middle Welsh	176	0.7%
Ladin	175	0.7%
Old Church Slavonic	175	0.7%
Elfdalian	175	0.7%
Old Swedish	175	0.7%
Welsh: North	175	0.7%
Old Polish	175	0.7%
Lithuanian	174	0.7%
Sinhalese	174	0.7%
Urdu	174	0.7%

form · Form lengths cluster tightly around 5 characters; nearly all entries are single words.

Show data table

Character-length distribution for form (mean: 5.373285142435195).
chars	count
1 – 3	530
3 – 4	9411
4 – 6	6280
6 – 7	6745
7 – 9	1056
9 – 10	823
10 – 12	222
12 – 13	284
13 – 15	100
15 – 16	128
16 – 18	65
18 – 20	14
20 – 21	27
21 – 23	5
23 – 24	7
24 – 26	10
26 – 27	5
27 – 29	2
29 – 30	3
30 – 32	0
32 – 34	4
34 – 35	6
35 – 37	0
37 – 38	0
38 – 40	0
40 – 41	0
41 – 43	0
43 – 44	0
44 – 46	0
46 – 48	0
48 – 49	0
49 – 51	1
51 – 52	1
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	0
58 – 60	0
60 – 61	0
61 – 63	2

cognate_id · Spread of cognate IDs across the 4,979 cognate sets — useful for spotting clustering.

Show data table

Histogram bins for cognate_id (median: 1610.0).
bin	count
3 – 252.5	3663
252.5 – 501.9	3046
501.9 – 751.4	1164
751.4 – 1001	2020
1001 – 1250	1934
1250 – 1500	677
1500 – 1749	687
1749 – 1999	370
1999 – 2248	629
2248 – 2498	897
2498 – 2747	263
2747 – 2997	436
2997 – 3246	392
3246 – 3496	234
3496 – 3745	129
3745 – 3995	141
3995 – 4244	158
4244 – 4494	105
4494 – 4743	99
4743 – 4992	175
4992 – 5242	1554
5242 – 5491	368
5491 – 5741	274
5741 – 5990	296
5990 – 6240	373
6240 – 6489	695
6489 – 6739	602
6739 – 6988	318
6988 – 7238	281
7238 – 7487	504
7487 – 7737	574
7737 – 7986	263
7986 – 8236	275
8236 – 8485	652
8485 – 8735	273
8735 – 8984	106
8984 – 9234	199
9234 – 9483	285
9483 – 9733	314
9733 – 9982	306

Schema

8 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
form	text	0.0%	19,334	one_word short_text duplicates
language_id	numeric	0.0%	160
language_name	categorical	0.0%	160
glottocode	categorical	0.0%	152
iso_639_3	categorical	0.7%	142
concept	categorical	0.0%	170
cognate_id	numeric	0.0%	4,979
source_dataset	categorical	0.0%	1	imbalance

form

text feature one_word short_text duplicates

This column holds short single-word lexical forms — 94.6% are one-word entries with a mean length of 5.4 characters and median word count of 1. The vocabulary spans 16,219 distinct words across 25,731 rows, and top values like 'noga', 'pā', 'dūr', 'voda', 'bitter' suggest a multilingual mix (Slavic, Polynesian, Germanic). Notably, 24.9% of rows are duplicates (6,397), yet no single form dominates — the most frequent ('noga') appears only 24 times. Treatment: Treat as a categorical lexical token; normalize unicode and consider language detection before embedding. high · anthropic:claude-opus-4-7

n: 25,731
nulls: 0 (0.0%)
unique: 19,334
len_min: 1
len_max: 63
len_mean: 5.373
len_median: 5
len_p95: 9
word_mean: 1.067
word_median: 1
n_empty: 0
n_duplicates: 6,397
duplicate_rate: 0.2486
vocab_size: 16,219
readability_flesch_mean: 86.62
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9464
allcaps_rate: 0
boilerplate_rate: 0

language_id

numeric foreign_key

Numeric code with 160 distinct values across 25,731 rows and zero nulls, ranging from 3 to 317 with a near-symmetric distribution (skew -0.05) and flat shape (kurtosis -1.47). The flat, wide spread and integer-looking quartiles (65, 174, 266) suggest this is a categorical language identifier stored as an integer rather than a true numeric measurement. No outliers and no zeros, consistent with a lookup key. Treatment: Treat as categorical and left-join to a language lookup table; do not model as a continuous variable. high · anthropic:claude-opus-4-7

n: 25,731
nulls: 0 (0.0%)
unique: 160
min: 3
max: 317
mean: 166
median: 174
std: 101.4
q1: 65
q3: 266
iqr: 201
skew: -0.04885
kurtosis: -1.471
n_outliers: 0
outlier_rate: 0
zero_rate: 0

language_name

categorical feature

This column holds language names, with 160 distinct values across 25,731 rows and no nulls. The distribution is remarkably flat: entropy_ratio is 0.996 and the most common value 'Bakhtiari' appears just 178 times (0.69%), suggesting a near-uniform sampling of languages rather than a natural population. Several entries use a 'Family: Variety' convention (e.g., 'Greek: Italiot', 'Breton: Treger'), indicating dialect-level granularity mixed with top-level language names. Treatment: Use as a categorical feature; consider splitting on ':' to separate family from variety before encoding. high · anthropic:claude-opus-4-7

n: 25,731
nulls: 0 (0.0%)
unique: 160
top_value: Bakhtiari
top_rate: 0.006918
cardinality: 160
entropy: 7.294
entropy_ratio: 0.9962

glottocode

categorical foreign_key

Glottocodes are Glottolog's stable language identifiers, so this column tags each of the 25,731 rows with one of 152 distinct languages. The distribution is remarkably flat: entropy ratio is 0.991 and the most frequent code 'mace1250' covers only 1.93% of rows, with several Slavic and Germanic codes clustered around 340–350 occurrences. No nulls, and a visible drop-off after the top seven codes (down to ~177) suggests a tiered sampling design rather than a long tail. Treatment: left-join on this id to Glottolog metadata, or one-hot/target-encode for modelling. high · anthropic:claude-opus-4-7

n: 25,731
nulls: 0 (0.0%)
unique: 152
top_value: mace1250
top_rate: 0.01932
cardinality: 152
entropy: 7.184
entropy_ratio: 0.9912

iso_639_3

categorical feature

This column holds ISO 639-3 language codes, with 142 distinct languages spread across 25,731 rows. The distribution is remarkably flat — entropy ratio of 0.985 and the top code 'ell' covering only 2.04% — so no single language dominates. Null rate is negligible at 0.67%. Treatment: Treat as a categorical feature; one-hot or target-encode given the 142 levels, or group rare codes. high · anthropic:claude-opus-4-7

n: 25,731
nulls: 173 (0.7%)
unique: 142
top_value: ell
top_rate: 0.02042
cardinality: 142
entropy: 7.044
entropy_ratio: 0.9853

concept

categorical foreign_key

This column holds 170 distinct concept labels (e.g., 'say', 'man', 'big', 'stone', 'house') spread almost perfectly evenly across 25,731 rows, with the top value covering only 0.66% of the data and entropy at 99.98% of the maximum. The vocabulary resembles a Swadesh-style basic concept list, and the near-uniform distribution suggests each concept appears a fixed number of times — likely once per language or source. Treatment: Treat as a categorical key; group or pivot on it rather than one-hot encoding given 170 balanced levels. high · anthropic:claude-opus-4-7

n: 25,731
nulls: 0 (0.0%)
unique: 170
top_value: say
top_rate: 0.006607
cardinality: 170
entropy: 7.408
entropy_ratio: 0.9998

cognate_id

numeric foreign_key

This is almost certainly a cognate group identifier: 4,979 distinct integer values spread across 25,731 rows (roughly 5x repetition) with no nulls and no zeros. Despite being stored as numeric, the wide range (3 to 9,982), moderate skew (0.73) and negative kurtosis (-0.90) suggest these are arbitrary group labels rather than a measured quantity. The lack of outliers is consistent with categorical-style codes packed into the integer space. Treatment: Treat as a categorical group key; join or group-by rather than feeding as a numeric feature. high · anthropic:claude-opus-4-7

n: 25,731
nulls: 0 (0.0%)
unique: 4,979
min: 3
max: 9,982
mean: 3086
median: 1,610
std: 3024
q1: 411
q3: 5,640
iqr: 5,229
skew: 0.7307
kurtosis: -0.9048
n_outliers: 0
outlier_rate: 0
zero_rate: 0

source_dataset

categorical metadata imbalance

This column is a constant provenance tag identifying the source dataset, with every one of the 25731 rows labelled "iecor". Cardinality is 1 and entropy is 0, so it carries no discriminative signal. Treatment: Drop before modelling; retain only as a provenance flag. high · anthropic:claude-opus-4-7

n: 25,731
nulls: 0 (0.0%)
unique: 1
top_value: iecor
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0