processed-word_forms · saturn notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/processed/word_forms.csv

Saturn profiled 25,731 rows across 8 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/processed/word_forms.csv",
    "--findings", "processed-word_forms.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 25,731 word forms drawn from a single source ('iecor'), each tagged with a concept, language, and cognate identifier — essentially a comparative wordlist across 160 languages and 170 concepts. The 'form' column is mostly single-word entries (94.6% one-word, mean length ~5 characters) with about 24.9% duplicates, suggesting many shared or repeated forms across languages. The language coverage is broad and well-balanced (entropy ratio ~0.99 across 142 ISO codes), led by Greek (ell), Slovenian (slv), and Macedonian (mkd). Worth a closer look: the concept distribution is remarkably even (~160-170 forms per concept), and the language_name distribution shows which languages are most densely sampled (Bakhtiari, Nepali, Italiot Greek). The 'source_dataset' column is constant and can be ignored.

citing: row_count · column_count · form.stats.one_word_rate · form.stats.duplicate_rate · form.stats.len_mean · iso_639_3.n_unique · iso_639_3.top_values · concept.n_unique · concept.top_values · language_name.top_values · source_dataset.top_rate

Out[4]:

saturn.schema() · 8 columns

column	kind	n	null%	unique	alerts
form	text	25,731	0.0%	19,334	one_word short_text duplicates
language_id	numeric	25,731	0.0%	160
language_name	categorical	25,731	0.0%	160
glottocode	categorical	25,731	0.0%	152
iso_639_3	categorical	25,731	0.7%	142
concept	categorical	25,731	0.0%	170
cognate_id	numeric	25,731	0.0%	4,979
source_dataset	categorical	25,731	0.0%	1	imbalance

Fig 1.

iso_639_3 · Top languages by ISO code — Greek, Slovenian, and Macedonian lead the sample.

Show data table

Top values for iso_639_3 (20 unique shown, of 142 total).
value	count	share
ell	522	2.0%
slv	509	2.0%
mkd	497	1.9%
bre	353	1.4%
swe	347	1.3%
ces	346	1.3%
pol	345	1.3%
sdh	345	1.3%
src	343	1.3%
por	341	1.3%
oss	341	1.3%
cat	340	1.3%
grc	332	1.3%
bsh	289	1.1%
bqi	178	0.7%
nep	177	0.7%
osp	177	0.7%
pnt	177	0.7%
hin	176	0.7%
ron	176	0.7%

Fig 2.

concept · Concept distribution is highly uniform — each of the 170 concepts has ~160 forms.

Show data table

Top values for concept (20 unique shown, of 170 total).
value	count	share
say	170	0.7%
man	166	0.6%
big	163	0.6%
stone	163	0.6%
house	163	0.6%
foot	161	0.6%
hand	161	0.6%
head	161	0.6%
see	161	0.6%
woman	161	0.6%
year	161	0.6%
day	160	0.6%
good	160	0.6%
name	160	0.6%
water	160	0.6%
do	160	0.6%
come	159	0.6%
give	159	0.6%
know	159	0.6%
red	159	0.6%

Fig 3.

language_name · Which languages are most densely sampled by name (Bakhtiari, Nepali, Italiot Greek).

Show data table

Top values for language_name (20 unique shown, of 160 total).
value	count	share
Bakhtiari	178	0.7%
Nepali	177	0.7%
Greek: Italiot	177	0.7%
Old Spanish	177	0.7%
Greek: Pontic	177	0.7%
Breton: Treger	177	0.7%
Hindi	176	0.7%
Romanian	176	0.7%
Greek: Cappadocian	176	0.7%
Breton: Gwened	176	0.7%
Middle Welsh	176	0.7%
Ladin	175	0.7%
Old Church Slavonic	175	0.7%
Elfdalian	175	0.7%
Old Swedish	175	0.7%
Welsh: North	175	0.7%
Old Polish	175	0.7%
Lithuanian	174	0.7%
Sinhalese	174	0.7%
Urdu	174	0.7%

Fig 4.

form · Form lengths cluster tightly around 5 characters; nearly all entries are single words.

Show data table

Character-length distribution for form (mean: 5.373285142435195).
chars	count
1 – 3	530
3 – 4	9411
4 – 6	6280
6 – 7	6745
7 – 9	1056
9 – 10	823
10 – 12	222
12 – 13	284
13 – 15	100
15 – 16	128
16 – 18	65
18 – 20	14
20 – 21	27
21 – 23	5
23 – 24	7
24 – 26	10
26 – 27	5
27 – 29	2
29 – 30	3
30 – 32	0
32 – 34	4
34 – 35	6
35 – 37	0
37 – 38	0
38 – 40	0
40 – 41	0
41 – 43	0
43 – 44	0
44 – 46	0
46 – 48	0
48 – 49	0
49 – 51	1
51 – 52	1
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	0
58 – 60	0
60 – 61	0
61 – 63	2

Fig 5.

cognate_id · Spread of cognate IDs across the 4,979 cognate sets — useful for spotting clustering.

Show data table

Histogram bins for cognate_id (median: 1610.0).
bin	count
3 – 252.5	3663
252.5 – 501.9	3046
501.9 – 751.4	1164
751.4 – 1001	2020
1001 – 1250	1934
1250 – 1500	677
1500 – 1749	687
1749 – 1999	370
1999 – 2248	629
2248 – 2498	897
2498 – 2747	263
2747 – 2997	436
2997 – 3246	392
3246 – 3496	234
3496 – 3745	129
3745 – 3995	141
3995 – 4244	158
4244 – 4494	105
4494 – 4743	99
4743 – 4992	175
4992 – 5242	1554
5242 – 5491	368
5491 – 5741	274
5741 – 5990	296
5990 – 6240	373
6240 – 6489	695
6489 – 6739	602
6739 – 6988	318
6988 – 7238	281
7238 – 7487	504
7487 – 7737	574
7737 – 7986	263
7986 – 8236	275
8236 – 8485	652
8485 – 8735	273
8735 – 8984	106
8984 – 9234	199
9234 – 9483	285
9483 – 9733	314
9733 – 9982	306

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
form	text	0.0%
language_id	numeric	0.0%
language_name	categorical	0.0%
glottocode	categorical	0.0%
iso_639_3	categorical	0.7%
concept	categorical	0.0%
cognate_id	numeric	0.0%
source_dataset	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
	language_id	cognate_id
language_id	+1.00	+0.10
cognate_id	+0.10	+1.00

form text feature

This column holds short single-word lexical forms — 94.6% are one-word entries with a mean length of 5.4 characters and median word count of 1. The vocabulary spans 16,219 distinct words across 25,731 rows, and top values like 'noga', 'pā', 'dūr', 'voda', 'bitter' suggest a multilingual mix (Slavic, Polynesian, Germanic). Notably, 24.9% of rows are duplicates (6,397), yet no single form dominates — the most frequent ('noga') appears only 24 times.

Treatment: Treat as a categorical lexical token; normalize unicode and consider language detection before embedding.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["form"].stats

stat	value
n	25,731
nulls	0 (0.0%)
unique	19,334
len_min	1
len_max	63
len_mean	5.373
len_median	5
len_p95	9
word_mean	1.067
word_median	1
n_empty	0
n_duplicates	6,397
duplicate_rate	0.2486
vocab_size	16,219
readability_flesch_mean	86.62
emoji_rate	0
url_rate	0
one_word_rate	0.9464
allcaps_rate	0
boilerplate_rate	0
alert: one_word	94.6% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	24.9% duplicate strings

Fig 8.

Character-length distribution for form.

Show data table

Character-length distribution for form (mean: 5.373285142435195).
chars	count
1 – 3	530
3 – 4	9411
4 – 6	6280
6 – 7	6745
7 – 9	1056
9 – 10	823
10 – 12	222
12 – 13	284
13 – 15	100
15 – 16	128
16 – 18	65
18 – 20	14
20 – 21	27
21 – 23	5
23 – 24	7
24 – 26	10
26 – 27	5
27 – 29	2
29 – 30	3
30 – 32	0
32 – 34	4
34 – 35	6
35 – 37	0
37 – 38	0
38 – 40	0
40 – 41	0
41 – 43	0
43 – 44	0
44 – 46	0
46 – 48	0
48 – 49	0
49 – 51	1
51 – 52	1
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	0
58 – 60	0
60 – 61	0
61 – 63	2

language_id numeric foreign_key

Numeric code with 160 distinct values across 25,731 rows and zero nulls, ranging from 3 to 317 with a near-symmetric distribution (skew -0.05) and flat shape (kurtosis -1.47). The flat, wide spread and integer-looking quartiles (65, 174, 266) suggest this is a categorical language identifier stored as an integer rather than a true numeric measurement. No outliers and no zeros, consistent with a lookup key.

Treatment: Treat as categorical and left-join to a language lookup table; do not model as a continuous variable.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["language_id"].stats

stat	value
n	25,731
nulls	0 (0.0%)
unique	160
min	3
max	317
mean	166
median	174
std	101.4
q1	65
q3	266
iqr	201
skew	-0.04885
kurtosis	-1.471
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 9.

Distribution of language_id. Vertical dash marks the median.

Show data table

Histogram bins for language_id (median: 174.0).
bin	count
3 – 10.85	683
10.85 – 18.7	684
18.7 – 26.55	1190
26.55 – 34.4	344
34.4 – 42.25	1036
42.25 – 50.1	862
50.1 – 57.95	691
57.95 – 65.8	1033
65.8 – 73.65	686
73.65 – 81.5	603
81.5 – 89.35	155
89.35 – 97.2	513
97.2 – 105	345
105 – 112.9	632
112.9 – 120.8	686
120.8 – 128.6	833
128.6 – 136.4	375
136.4 – 144.3	415
144.3 – 152.2	511
152.2 – 160	175
160 – 167.8	174
167.8 – 175.7	550
175.7 – 183.5	626
183.5 – 191.4	340
191.4 – 199.2	173
199.2 – 207.1	673
207.1 – 214.9	655
214.9 – 222.8	510
222.8 – 230.6	346
230.6 – 238.5	674
238.5 – 246.3	621
246.3 – 254.2	500
254.2 – 262.1	517
262.1 – 269.9	513
269.9 – 277.8	839
277.8 – 285.6	1253
285.6 – 293.4	1345
293.4 – 301.3	1115
301.3 – 309.1	966
309.1 – 317	889

language_name categorical feature

This column holds language names, with 160 distinct values across 25,731 rows and no nulls. The distribution is remarkably flat: entropy_ratio is 0.996 and the most common value 'Bakhtiari' appears just 178 times (0.69%), suggesting a near-uniform sampling of languages rather than a natural population. Several entries use a 'Family: Variety' convention (e.g., 'Greek: Italiot', 'Breton: Treger'), indicating dialect-level granularity mixed with top-level language names.

Treatment: Use as a categorical feature; consider splitting on ':' to separate family from variety before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["language_name"].stats

stat	value
n	25,731
nulls	0 (0.0%)
unique	160
top_value	Bakhtiari
top_rate	0.006918
cardinality	160
entropy	7.294
entropy_ratio	0.9962

Fig 10.

Top values for language_name.

Show data table

Top values for language_name (20 unique shown, of 160 total).
value	count	share
Bakhtiari	178	0.7%
Nepali	177	0.7%
Greek: Italiot	177	0.7%
Old Spanish	177	0.7%
Greek: Pontic	177	0.7%
Breton: Treger	177	0.7%
Hindi	176	0.7%
Romanian	176	0.7%
Greek: Cappadocian	176	0.7%
Breton: Gwened	176	0.7%
Middle Welsh	176	0.7%
Ladin	175	0.7%
Old Church Slavonic	175	0.7%
Elfdalian	175	0.7%
Old Swedish	175	0.7%
Welsh: North	175	0.7%
Old Polish	175	0.7%
Lithuanian	174	0.7%
Sinhalese	174	0.7%
Urdu	174	0.7%

glottocode categorical foreign_key

Glottocodes are Glottolog's stable language identifiers, so this column tags each of the 25,731 rows with one of 152 distinct languages. The distribution is remarkably flat: entropy ratio is 0.991 and the most frequent code 'mace1250' covers only 1.93% of rows, with several Slavic and Germanic codes clustered around 340–350 occurrences. No nulls, and a visible drop-off after the top seven codes (down to ~177) suggests a tiered sampling design rather than a long tail.

Treatment: left-join on this id to Glottolog metadata, or one-hot/target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["glottocode"].stats

stat	value
n	25,731
nulls	0 (0.0%)
unique	152
top_value	mace1250
top_rate	0.01932
cardinality	152
entropy	7.184
entropy_ratio	0.9912

Fig 11.

Top values for glottocode.

Show data table

Top values for glottocode (20 unique shown, of 152 total).
value	count	share
mace1250	497	1.9%
swed1254	347	1.3%
czec1258	346	1.3%
poli1260	345	1.3%
sout2640	345	1.3%
slov1268	342	1.3%
oldc1252	317	1.2%
bakh1245	178	0.7%
east1436	177	0.7%
apul1236	177	0.7%
olds1249	177	0.7%
pont1253	177	0.7%
treg1244	177	0.7%
hind1269	176	0.7%
roma1327	176	0.7%
capp1239	176	0.7%
vann1244	176	0.7%
midd1363	176	0.7%
ladi1250	175	0.7%
chur1257	175	0.7%

iso_639_3 categorical feature

This column holds ISO 639-3 language codes, with 142 distinct languages spread across 25,731 rows. The distribution is remarkably flat — entropy ratio of 0.985 and the top code 'ell' covering only 2.04% — so no single language dominates. Null rate is negligible at 0.67%.

Treatment: Treat as a categorical feature; one-hot or target-encode given the 142 levels, or group rare codes.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["iso_639_3"].stats

stat	value
n	25,731
nulls	173 (0.7%)
unique	142
top_value	ell
top_rate	0.02042
cardinality	142
entropy	7.044
entropy_ratio	0.9853

Fig 12.

Top values for iso_639_3.

Show data table

Top values for iso_639_3 (20 unique shown, of 142 total).
value	count	share
ell	522	2.0%
slv	509	2.0%
mkd	497	1.9%
bre	353	1.4%
swe	347	1.3%
ces	346	1.3%
pol	345	1.3%
sdh	345	1.3%
src	343	1.3%
por	341	1.3%
oss	341	1.3%
cat	340	1.3%
grc	332	1.3%
bsh	289	1.1%
bqi	178	0.7%
nep	177	0.7%
osp	177	0.7%
pnt	177	0.7%
hin	176	0.7%
ron	176	0.7%

concept categorical foreign_key

This column holds 170 distinct concept labels (e.g., 'say', 'man', 'big', 'stone', 'house') spread almost perfectly evenly across 25,731 rows, with the top value covering only 0.66% of the data and entropy at 99.98% of the maximum. The vocabulary resembles a Swadesh-style basic concept list, and the near-uniform distribution suggests each concept appears a fixed number of times — likely once per language or source.

Treatment: Treat as a categorical key; group or pivot on it rather than one-hot encoding given 170 balanced levels.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["concept"].stats

stat	value
n	25,731
nulls	0 (0.0%)
unique	170
top_value	say
top_rate	0.006607
cardinality	170
entropy	7.408
entropy_ratio	0.9998

Fig 13.

Top values for concept.

Show data table

Top values for concept (20 unique shown, of 170 total).
value	count	share
say	170	0.7%
man	166	0.6%
big	163	0.6%
stone	163	0.6%
house	163	0.6%
foot	161	0.6%
hand	161	0.6%
head	161	0.6%
see	161	0.6%
woman	161	0.6%
year	161	0.6%
day	160	0.6%
good	160	0.6%
name	160	0.6%
water	160	0.6%
do	160	0.6%
come	159	0.6%
give	159	0.6%
know	159	0.6%
red	159	0.6%

cognate_id numeric foreign_key

This is almost certainly a cognate group identifier: 4,979 distinct integer values spread across 25,731 rows (roughly 5x repetition) with no nulls and no zeros. Despite being stored as numeric, the wide range (3 to 9,982), moderate skew (0.73) and negative kurtosis (-0.90) suggest these are arbitrary group labels rather than a measured quantity. The lack of outliers is consistent with categorical-style codes packed into the integer space.

Treatment: Treat as a categorical group key; join or group-by rather than feeding as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["cognate_id"].stats

stat	value
n	25,731
nulls	0 (0.0%)
unique	4,979
min	3
max	9,982
mean	3086
median	1,610
std	3024
q1	411
q3	5,640
iqr	5,229
skew	0.7307
kurtosis	-0.9048
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 14.

Distribution of cognate_id. Vertical dash marks the median.

Show data table

Histogram bins for cognate_id (median: 1610.0).
bin	count
3 – 252.5	3663
252.5 – 501.9	3046
501.9 – 751.4	1164
751.4 – 1001	2020
1001 – 1250	1934
1250 – 1500	677
1500 – 1749	687
1749 – 1999	370
1999 – 2248	629
2248 – 2498	897
2498 – 2747	263
2747 – 2997	436
2997 – 3246	392
3246 – 3496	234
3496 – 3745	129
3745 – 3995	141
3995 – 4244	158
4244 – 4494	105
4494 – 4743	99
4743 – 4992	175
4992 – 5242	1554
5242 – 5491	368
5491 – 5741	274
5741 – 5990	296
5990 – 6240	373
6240 – 6489	695
6489 – 6739	602
6739 – 6988	318
6988 – 7238	281
7238 – 7487	504
7487 – 7737	574
7737 – 7986	263
7986 – 8236	275
8236 – 8485	652
8485 – 8735	273
8735 – 8984	106
8984 – 9234	199
9234 – 9483	285
9483 – 9733	314
9733 – 9982	306

source_dataset categorical metadata

This column is a constant provenance tag identifying the source dataset, with every one of the 25731 rows labelled "iecor". Cardinality is 1 and entropy is 0, so it carries no discriminative signal.

Treatment: Drop before modelling; retain only as a provenance flag.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["source_dataset"].stats

stat	value
n	25,731
nulls	0 (0.0%)
unique	1
top_value	iecor
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 15.

Top values for source_dataset.

Show data table

Top values for source_dataset (1 unique shown, of 1 total).
value	count	share
iecor	25731	100.0%

processed word forms

Overview

Summary confidence: high

form text feature

language_id numeric foreign_key

language_name categorical feature

glottocode categorical foreign_key

iso_639_3 categorical feature

concept categorical foreign_key

cognate_id numeric foreign_key

source_dataset categorical metadata

How to cite