saturn·

processed word forms

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/processed/word_forms.csv

Saturn profiled 25,731 rows across 8 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/processed/word_forms.csv",
    "--findings", "processed-word_forms.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 25,731 word forms drawn from a single source ('iecor'), each tagged with a concept, language, and cognate identifier — essentially a comparative wordlist across 160 languages and 170 concepts. The 'form' column is mostly single-word entries (94.6% one-word, mean length ~5 characters) with about 24.9% duplicates, suggesting many shared or repeated forms across languages. The language coverage is broad and well-balanced (entropy ratio ~0.99 across 142 ISO codes), led by Greek (ell), Slovenian (slv), and Macedonian (mkd). Worth a closer look: the concept distribution is remarkably even (~160-170 forms per concept), and the language_name distribution shows which languages are most densely sampled (Bakhtiari, Nepali, Italiot Greek). The 'source_dataset' column is constant and can be ignored.

citing: row_count · column_count · form.stats.one_word_rate · form.stats.duplicate_rate · form.stats.len_mean · iso_639_3.n_unique · iso_639_3.top_values · concept.n_unique · concept.top_values · language_name.top_values · source_dataset.top_rate

Out[4]:

saturn.schema() · 8 columns

column kind n null% unique alerts
form text 25,731 0.0% 19,334 one_word short_text duplicates
language_id numeric 25,731 0.0% 160
language_name categorical 25,731 0.0% 160
glottocode categorical 25,731 0.0% 152
iso_639_3 categorical 25,731 0.7% 142
concept categorical 25,731 0.0% 170
cognate_id numeric 25,731 0.0% 4,979
source_dataset categorical 25,731 0.0% 1 imbalance
Fig 1.
iso_639_3 · Top languages by ISO code — Greek, Slovenian, and Macedonian lead the sample.
Show data table
Top values for iso_639_3 (20 unique shown, of 142 total).
valuecountshare
ell5222.0%
slv5092.0%
mkd4971.9%
bre3531.4%
swe3471.3%
ces3461.3%
pol3451.3%
sdh3451.3%
src3431.3%
por3411.3%
oss3411.3%
cat3401.3%
grc3321.3%
bsh2891.1%
bqi1780.7%
nep1770.7%
osp1770.7%
pnt1770.7%
hin1760.7%
ron1760.7%
Fig 2.
concept · Concept distribution is highly uniform — each of the 170 concepts has ~160 forms.
Show data table
Top values for concept (20 unique shown, of 170 total).
valuecountshare
say1700.7%
man1660.6%
big1630.6%
stone1630.6%
house1630.6%
foot1610.6%
hand1610.6%
head1610.6%
see1610.6%
woman1610.6%
year1610.6%
day1600.6%
good1600.6%
name1600.6%
water1600.6%
do1600.6%
come1590.6%
give1590.6%
know1590.6%
red1590.6%
Fig 3.
language_name · Which languages are most densely sampled by name (Bakhtiari, Nepali, Italiot Greek).
Show data table
Top values for language_name (20 unique shown, of 160 total).
valuecountshare
Bakhtiari1780.7%
Nepali1770.7%
Greek: Italiot1770.7%
Old Spanish1770.7%
Greek: Pontic1770.7%
Breton: Treger1770.7%
Hindi1760.7%
Romanian1760.7%
Greek: Cappadocian1760.7%
Breton: Gwened1760.7%
Middle Welsh1760.7%
Ladin1750.7%
Old Church Slavonic1750.7%
Elfdalian1750.7%
Old Swedish1750.7%
Welsh: North1750.7%
Old Polish1750.7%
Lithuanian1740.7%
Sinhalese1740.7%
Urdu1740.7%
Fig 4.
form · Form lengths cluster tightly around 5 characters; nearly all entries are single words.
Show data table
Character-length distribution for form (mean: 5.373285142435195).
charscount
1 – 3530
3 – 49411
4 – 66280
6 – 76745
7 – 91056
9 – 10823
10 – 12222
12 – 13284
13 – 15100
15 – 16128
16 – 1865
18 – 2014
20 – 2127
21 – 235
23 – 247
24 – 2610
26 – 275
27 – 292
29 – 303
30 – 320
32 – 344
34 – 356
35 – 370
37 – 380
38 – 400
40 – 410
41 – 430
43 – 440
44 – 460
46 – 480
48 – 490
49 – 511
51 – 521
52 – 540
54 – 550
55 – 570
57 – 580
58 – 600
60 – 610
61 – 632
Fig 5.
cognate_id · Spread of cognate IDs across the 4,979 cognate sets — useful for spotting clustering.
Show data table
Histogram bins for cognate_id (median: 1610.0).
bincount
3 – 252.53663
252.5 – 501.93046
501.9 – 751.41164
751.4 – 10012020
1001 – 12501934
1250 – 1500677
1500 – 1749687
1749 – 1999370
1999 – 2248629
2248 – 2498897
2498 – 2747263
2747 – 2997436
2997 – 3246392
3246 – 3496234
3496 – 3745129
3745 – 3995141
3995 – 4244158
4244 – 4494105
4494 – 474399
4743 – 4992175
4992 – 52421554
5242 – 5491368
5491 – 5741274
5741 – 5990296
5990 – 6240373
6240 – 6489695
6489 – 6739602
6739 – 6988318
6988 – 7238281
7238 – 7487504
7487 – 7737574
7737 – 7986263
7986 – 8236275
8236 – 8485652
8485 – 8735273
8735 – 8984106
8984 – 9234199
9234 – 9483285
9483 – 9733314
9733 – 9982306
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
formtext0.0%
language_idnumeric0.0%
language_namecategorical0.0%
glottocodecategorical0.0%
iso_639_3categorical0.7%
conceptcategorical0.0%
cognate_idnumeric0.0%
source_datasetcategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 2 numeric columns (values clipped to 2 decimals).
language_idcognate_id
language_id+1.00+0.10
cognate_id+0.10+1.00

form text feature

This column holds short single-word lexical forms — 94.6% are one-word entries with a mean length of 5.4 characters and median word count of 1. The vocabulary spans 16,219 distinct words across 25,731 rows, and top values like 'noga', 'pā', 'dūr', 'voda', 'bitter' suggest a multilingual mix (Slavic, Polynesian, Germanic). Notably, 24.9% of rows are duplicates (6,397), yet no single form dominates — the most frequent ('noga') appears only 24 times.

Treatment: Treat as a categorical lexical token; normalize unicode and consider language detection before embedding.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["form"].stats

statvalue
n25,731
nulls0 (0.0%)
unique19,334
len_min 1
len_max 63
len_mean 5.373
len_median 5
len_p95 9
word_mean 1.067
word_median 1
n_empty 0
n_duplicates 6,397
duplicate_rate 0.2486
vocab_size 16,219
readability_flesch_mean 86.62
emoji_rate 0
url_rate 0
one_word_rate 0.9464
allcaps_rate 0
boilerplate_rate 0
alert: one_word94.6% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates24.9% duplicate strings
Fig 8.
Character-length distribution for form.
Show data table
Character-length distribution for form (mean: 5.373285142435195).
charscount
1 – 3530
3 – 49411
4 – 66280
6 – 76745
7 – 91056
9 – 10823
10 – 12222
12 – 13284
13 – 15100
15 – 16128
16 – 1865
18 – 2014
20 – 2127
21 – 235
23 – 247
24 – 2610
26 – 275
27 – 292
29 – 303
30 – 320
32 – 344
34 – 356
35 – 370
37 – 380
38 – 400
40 – 410
41 – 430
43 – 440
44 – 460
46 – 480
48 – 490
49 – 511
51 – 521
52 – 540
54 – 550
55 – 570
57 – 580
58 – 600
60 – 610
61 – 632

language_id numeric foreign_key

Numeric code with 160 distinct values across 25,731 rows and zero nulls, ranging from 3 to 317 with a near-symmetric distribution (skew -0.05) and flat shape (kurtosis -1.47). The flat, wide spread and integer-looking quartiles (65, 174, 266) suggest this is a categorical language identifier stored as an integer rather than a true numeric measurement. No outliers and no zeros, consistent with a lookup key.

Treatment: Treat as categorical and left-join to a language lookup table; do not model as a continuous variable.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["language_id"].stats

statvalue
n25,731
nulls0 (0.0%)
unique160
min 3
max 317
mean 166
median 174
std 101.4
q1 65
q3 266
iqr 201
skew -0.04885
kurtosis -1.471
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 9.
Distribution of language_id. Vertical dash marks the median.
Show data table
Histogram bins for language_id (median: 174.0).
bincount
3 – 10.85683
10.85 – 18.7684
18.7 – 26.551190
26.55 – 34.4344
34.4 – 42.251036
42.25 – 50.1862
50.1 – 57.95691
57.95 – 65.81033
65.8 – 73.65686
73.65 – 81.5603
81.5 – 89.35155
89.35 – 97.2513
97.2 – 105345
105 – 112.9632
112.9 – 120.8686
120.8 – 128.6833
128.6 – 136.4375
136.4 – 144.3415
144.3 – 152.2511
152.2 – 160175
160 – 167.8174
167.8 – 175.7550
175.7 – 183.5626
183.5 – 191.4340
191.4 – 199.2173
199.2 – 207.1673
207.1 – 214.9655
214.9 – 222.8510
222.8 – 230.6346
230.6 – 238.5674
238.5 – 246.3621
246.3 – 254.2500
254.2 – 262.1517
262.1 – 269.9513
269.9 – 277.8839
277.8 – 285.61253
285.6 – 293.41345
293.4 – 301.31115
301.3 – 309.1966
309.1 – 317889

language_name categorical feature

This column holds language names, with 160 distinct values across 25,731 rows and no nulls. The distribution is remarkably flat: entropy_ratio is 0.996 and the most common value 'Bakhtiari' appears just 178 times (0.69%), suggesting a near-uniform sampling of languages rather than a natural population. Several entries use a 'Family: Variety' convention (e.g., 'Greek: Italiot', 'Breton: Treger'), indicating dialect-level granularity mixed with top-level language names.

Treatment: Use as a categorical feature; consider splitting on ':' to separate family from variety before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["language_name"].stats

statvalue
n25,731
nulls0 (0.0%)
unique160
top_value Bakhtiari
top_rate 0.006918
cardinality 160
entropy 7.294
entropy_ratio 0.9962
Fig 10.
Top values for language_name.
Show data table
Top values for language_name (20 unique shown, of 160 total).
valuecountshare
Bakhtiari1780.7%
Nepali1770.7%
Greek: Italiot1770.7%
Old Spanish1770.7%
Greek: Pontic1770.7%
Breton: Treger1770.7%
Hindi1760.7%
Romanian1760.7%
Greek: Cappadocian1760.7%
Breton: Gwened1760.7%
Middle Welsh1760.7%
Ladin1750.7%
Old Church Slavonic1750.7%
Elfdalian1750.7%
Old Swedish1750.7%
Welsh: North1750.7%
Old Polish1750.7%
Lithuanian1740.7%
Sinhalese1740.7%
Urdu1740.7%

glottocode categorical foreign_key

Glottocodes are Glottolog's stable language identifiers, so this column tags each of the 25,731 rows with one of 152 distinct languages. The distribution is remarkably flat: entropy ratio is 0.991 and the most frequent code 'mace1250' covers only 1.93% of rows, with several Slavic and Germanic codes clustered around 340–350 occurrences. No nulls, and a visible drop-off after the top seven codes (down to ~177) suggests a tiered sampling design rather than a long tail.

Treatment: left-join on this id to Glottolog metadata, or one-hot/target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["glottocode"].stats

statvalue
n25,731
nulls0 (0.0%)
unique152
top_value mace1250
top_rate 0.01932
cardinality 152
entropy 7.184
entropy_ratio 0.9912
Fig 11.
Top values for glottocode.
Show data table
Top values for glottocode (20 unique shown, of 152 total).
valuecountshare
mace12504971.9%
swed12543471.3%
czec12583461.3%
poli12603451.3%
sout26403451.3%
slov12683421.3%
oldc12523171.2%
bakh12451780.7%
east14361770.7%
apul12361770.7%
olds12491770.7%
pont12531770.7%
treg12441770.7%
hind12691760.7%
roma13271760.7%
capp12391760.7%
vann12441760.7%
midd13631760.7%
ladi12501750.7%
chur12571750.7%

iso_639_3 categorical feature

This column holds ISO 639-3 language codes, with 142 distinct languages spread across 25,731 rows. The distribution is remarkably flat — entropy ratio of 0.985 and the top code 'ell' covering only 2.04% — so no single language dominates. Null rate is negligible at 0.67%.

Treatment: Treat as a categorical feature; one-hot or target-encode given the 142 levels, or group rare codes.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["iso_639_3"].stats

statvalue
n25,731
nulls173 (0.7%)
unique142
top_value ell
top_rate 0.02042
cardinality 142
entropy 7.044
entropy_ratio 0.9853
Fig 12.
Top values for iso_639_3.
Show data table
Top values for iso_639_3 (20 unique shown, of 142 total).
valuecountshare
ell5222.0%
slv5092.0%
mkd4971.9%
bre3531.4%
swe3471.3%
ces3461.3%
pol3451.3%
sdh3451.3%
src3431.3%
por3411.3%
oss3411.3%
cat3401.3%
grc3321.3%
bsh2891.1%
bqi1780.7%
nep1770.7%
osp1770.7%
pnt1770.7%
hin1760.7%
ron1760.7%

concept categorical foreign_key

This column holds 170 distinct concept labels (e.g., 'say', 'man', 'big', 'stone', 'house') spread almost perfectly evenly across 25,731 rows, with the top value covering only 0.66% of the data and entropy at 99.98% of the maximum. The vocabulary resembles a Swadesh-style basic concept list, and the near-uniform distribution suggests each concept appears a fixed number of times — likely once per language or source.

Treatment: Treat as a categorical key; group or pivot on it rather than one-hot encoding given 170 balanced levels.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["concept"].stats

statvalue
n25,731
nulls0 (0.0%)
unique170
top_value say
top_rate 0.006607
cardinality 170
entropy 7.408
entropy_ratio 0.9998
Fig 13.
Top values for concept.
Show data table
Top values for concept (20 unique shown, of 170 total).
valuecountshare
say1700.7%
man1660.6%
big1630.6%
stone1630.6%
house1630.6%
foot1610.6%
hand1610.6%
head1610.6%
see1610.6%
woman1610.6%
year1610.6%
day1600.6%
good1600.6%
name1600.6%
water1600.6%
do1600.6%
come1590.6%
give1590.6%
know1590.6%
red1590.6%

cognate_id numeric foreign_key

This is almost certainly a cognate group identifier: 4,979 distinct integer values spread across 25,731 rows (roughly 5x repetition) with no nulls and no zeros. Despite being stored as numeric, the wide range (3 to 9,982), moderate skew (0.73) and negative kurtosis (-0.90) suggest these are arbitrary group labels rather than a measured quantity. The lack of outliers is consistent with categorical-style codes packed into the integer space.

Treatment: Treat as a categorical group key; join or group-by rather than feeding as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["cognate_id"].stats

statvalue
n25,731
nulls0 (0.0%)
unique4,979
min 3
max 9,982
mean 3086
median 1,610
std 3024
q1 411
q3 5,640
iqr 5,229
skew 0.7307
kurtosis -0.9048
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 14.
Distribution of cognate_id. Vertical dash marks the median.
Show data table
Histogram bins for cognate_id (median: 1610.0).
bincount
3 – 252.53663
252.5 – 501.93046
501.9 – 751.41164
751.4 – 10012020
1001 – 12501934
1250 – 1500677
1500 – 1749687
1749 – 1999370
1999 – 2248629
2248 – 2498897
2498 – 2747263
2747 – 2997436
2997 – 3246392
3246 – 3496234
3496 – 3745129
3745 – 3995141
3995 – 4244158
4244 – 4494105
4494 – 474399
4743 – 4992175
4992 – 52421554
5242 – 5491368
5491 – 5741274
5741 – 5990296
5990 – 6240373
6240 – 6489695
6489 – 6739602
6739 – 6988318
6988 – 7238281
7238 – 7487504
7487 – 7737574
7737 – 7986263
7986 – 8236275
8236 – 8485652
8485 – 8735273
8735 – 8984106
8984 – 9234199
9234 – 9483285
9483 – 9733314
9733 – 9982306

source_dataset categorical metadata

This column is a constant provenance tag identifying the source dataset, with every one of the 25731 rows labelled "iecor". Cardinality is 1 and entropy is 0, so it carries no discriminative signal.

Treatment: Drop before modelling; retain only as a provenance flag.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["source_dataset"].stats

statvalue
n25,731
nulls0 (0.0%)
unique1
top_value iecor
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 15.
Top values for source_dataset.
Show data table
Top values for source_dataset (1 unique shown, of 1 total).
valuecountshare
iecor25731100.0%

How to cite

click to copy

BibTeX
@misc{saturn-processed-word-forms-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: processed word forms},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/processed-word_forms}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: processed word forms. Source: /home/coolhand/servers/diachronica/etymology_atlas/processed/word_forms.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/processed-word_forms