parquet cognate sets

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/cognate_sets.parquet

Saturn profiled 4,981 rows across 7 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/parquet/cognate_sets.parquet",
    "--findings", "parquet-cognate_sets.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 4,981 cognate sets from the IECoR source, with each row identified by a unique cognate_id and accompanied by a JSON-like 'words' payload listing language entries. The numeric columns language_count and word_count are nearly identical twins, both highly skewed (skew ~6.84) with a median of 2 but a max of 157 and ~13% outliers — a small set of cognate groups is dramatically larger than the rest. Three columns (concept, confidence, source_dataset) are constant or empty and carry no analytic signal. Start by examining the distribution of language_count to understand the long tail of cross-linguistic coverage, and inspect the longest 'words' entries (len_max ~14,956) to see which cognate sets dominate.

citing: language_count · word_count · words · cognate_id · concept · confidence · source_dataset

Out[4]:

saturn.schema() · 7 columns

column	kind	n	null%	unique	alerts
cognate_id	text	4,981	0.0%	4,981	near_unique one_word short_text
concept	categorical	4,981	0.0%	1	imbalance
word_count	numeric	4,981	0.0%	93	high_skew outliers
language_count	numeric	4,981	0.0%	94	high_skew outliers
words	text	4,981	0.0%	4,963	near_unique
source_dataset	categorical	4,981	0.0%	1	imbalance
confidence	numeric	4,981	0.0%	1	constant

Fig 1.

language_count · Highly skewed: most cognate sets cover just 2 languages but a few reach 157.

Show data table

Histogram bins for language_count (median: 2.0).
bin	count
1 – 4.9	3849
4.9 – 8.8	483
8.8 – 12.7	194
12.7 – 16.6	96
16.6 – 20.5	90
20.5 – 24.4	107
24.4 – 28.3	32
28.3 – 32.2	25
32.2 – 36.1	15
36.1 – 40	4
40 – 43.9	8
43.9 – 47.8	8
47.8 – 51.7	6
51.7 – 55.6	5
55.6 – 59.5	3
59.5 – 63.4	3
63.4 – 67.3	8
67.3 – 71.2	3
71.2 – 75.1	3
75.1 – 79	4
79 – 82.9	0
82.9 – 86.8	1
86.8 – 90.7	2
90.7 – 94.6	4
94.6 – 98.5	5
98.5 – 102.4	2
102.4 – 106.3	3
106.3 – 110.2	2
110.2 – 114.1	2
114.1 – 118	3
118 – 121.9	0
121.9 – 125.8	1
125.8 – 129.7	1
129.7 – 133.6	0
133.6 – 137.5	2
137.5 – 141.4	2
141.4 – 145.3	0
145.3 – 149.2	0
149.2 – 153.1	1
153.1 – 157	4

Fig 2.

word_count · Mirrors language_count closely — check whether the two are effectively redundant.

Show data table

Histogram bins for word_count (median: 2.0).
bin	count
1 – 4.9	3848
4.9 – 8.8	484
8.8 – 12.7	194
12.7 – 16.6	96
16.6 – 20.5	90
20.5 – 24.4	107
24.4 – 28.3	32
28.3 – 32.2	25
32.2 – 36.1	15
36.1 – 40	4
40 – 43.9	8
43.9 – 47.8	8
47.8 – 51.7	6
51.7 – 55.6	5
55.6 – 59.5	3
59.5 – 63.4	3
63.4 – 67.3	8
67.3 – 71.2	3
71.2 – 75.1	3
75.1 – 79	3
79 – 82.9	1
82.9 – 86.8	1
86.8 – 90.7	2
90.7 – 94.6	4
94.6 – 98.5	5
98.5 – 102.4	2
102.4 – 106.3	3
106.3 – 110.2	2
110.2 – 114.1	2
114.1 – 118	3
118 – 121.9	0
121.9 – 125.8	1
125.8 – 129.7	1
129.7 – 133.6	0
133.6 – 137.5	2
137.5 – 141.4	2
141.4 – 145.3	0
145.3 – 149.2	0
149.2 – 153.1	1
153.1 – 157	4

Fig 3.

words · Text payload length ranges from 83 to ~15k characters; the long tail marks the richest cognate sets.

Show data table

Character-length distribution for words (mean: 498.88094760088336).
chars	count
83 – 455	3857
455 – 827	475
827 – 1198	191
1198 – 1570	99
1570 – 1942	93
1942 – 2314	98
2314 – 2686	36
2686 – 3058	27
3058 – 3429	14
3429 – 3801	6
3801 – 4173	8
4173 – 4545	7
4545 – 4917	6
4917 – 5289	4
5289 – 5660	4
5660 – 6032	4
6032 – 6404	4
6404 – 6776	6
6776 – 7148	3
7148 – 7520	4
7520 – 7891	1
7891 – 8263	0
8263 – 8635	3
8635 – 9007	3
9007 – 9379	4
9379 – 9750	2
9750 – 10122	3
10122 – 10494	3
10494 – 10866	2
10866 – 11238	2
11238 – 11610	1
11610 – 11981	0
11981 – 12353	2
12353 – 12725	1
12725 – 13097	1
13097 – 13469	2
13469 – 13841	0
13841 – 14212	0
14212 – 14584	2
14584 – 14956	3

Fig 4.

source_dataset · Confirms every row comes from the single 'iecor' source.

Show data table

Top values for source_dataset (1 unique shown, of 1 total).
value	count	share
iecor	4981	100.0%

Fig 5.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
cognate_id	text	0.0%
concept	categorical	0.0%
word_count	numeric	0.0%
language_count	numeric	0.0%
words	text	0.0%
source_dataset	categorical	0.0%
confidence	numeric	0.0%

Fig 6.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
	word_count	language_count	confidence
word_count	+1.00	+1.00	+nan
language_count	+1.00	+1.00	+nan
confidence	+nan	+nan	+nan

cognate_id text identifier

This is a unique cognate identifier column, with every one of the 4981 rows carrying a distinct single-token value (vocab_size 4981, one_word_rate 1.0, null_rate 0). Values follow a fixed `iecor:` scheme with lengths between 7 and 10 characters, consistent with a namespaced primary key from the IE-CoR lexical database. There is nothing to model here — it is pure row identity with zero duplicates.

Treatment: Use as a join key; drop before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[12]:

saturn.columns["cognate_id"].stats

stat	value
n	4,981
nulls	0 (0.0%)
unique	4,981
len_min	7
len_max	10
len_mean	9.884
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	4,981
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 7.

Character-length distribution for cognate_id.

Show data table

Character-length distribution for cognate_id (mean: 9.883557518570568).
chars	count
7 – 7	5
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	44
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	477
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	4455

concept categorical metadata

The 'concept' column is constant: all 4981 rows hold the empty string, with cardinality 1, entropy 0, and a top_rate of 1.0. There is no variation to exploit and no non-empty category was observed.

Treatment: Drop the column; it carries zero information.

anthropic:claude-opus-4-7 · confidence high

Out[15]:

saturn.columns["concept"].stats

stat	value
n	4,981
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 8.

Top values for concept.

Show data table

Top values for concept (1 unique shown, of 1 total).
value	count	share
	4981	100.0%

word_count numeric feature

Counts of words per record (likely titles, queries, or short text fields), with 4981 non-null integer values ranging 1 to 157. The distribution is heavily right-skewed (skew 6.84, kurtosis 59.74): median is 2 and Q3 is 4, yet the max reaches 157, dragging the mean to 5.17 with std 12.13. About 13% of rows (649) are flagged as outliers, indicating a long tail of unusually verbose entries.

Treatment: log-transform or clip before modelling to tame the long right tail.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["word_count"].stats

stat	value
n	4,981
nulls	0 (0.0%)
unique	93
min	1
max	157
mean	5.168
median	2
std	12.13
q1	1
q3	4
iqr	3
skew	6.837
kurtosis	59.74
n_outliers	649
outlier_rate	0.1303
zero_rate	0
alert: high_skew	skew=+6.84
alert: outliers	13.0% rows beyond 1.5 IQR

Fig 9.

Distribution of word_count. Vertical dash marks the median.

Show data table

Histogram bins for word_count (median: 2.0).
bin	count
1 – 4.9	3848
4.9 – 8.8	484
8.8 – 12.7	194
12.7 – 16.6	96
16.6 – 20.5	90
20.5 – 24.4	107
24.4 – 28.3	32
28.3 – 32.2	25
32.2 – 36.1	15
36.1 – 40	4
40 – 43.9	8
43.9 – 47.8	8
47.8 – 51.7	6
51.7 – 55.6	5
55.6 – 59.5	3
59.5 – 63.4	3
63.4 – 67.3	8
67.3 – 71.2	3
71.2 – 75.1	3
75.1 – 79	3
79 – 82.9	1
82.9 – 86.8	1
86.8 – 90.7	2
90.7 – 94.6	4
94.6 – 98.5	5
98.5 – 102.4	2
102.4 – 106.3	3
106.3 – 110.2	2
110.2 – 114.1	2
114.1 – 118	3
118 – 121.9	0
121.9 – 125.8	1
125.8 – 129.7	1
129.7 – 133.6	0
133.6 – 137.5	2
137.5 – 141.4	2
141.4 – 145.3	0
145.3 – 149.2	0
149.2 – 153.1	1
153.1 – 157	4

language_count numeric feature

`language_count` is a positive integer feature counting languages per record, ranging from 1 to 157 with a median of 2 and Q3 of 4. The distribution is severely right-skewed (skew 6.84, kurtosis 59.77) with 649 outliers (13.0% outlier rate) stretching the mean to 5.17 against a std of 12.13. No nulls or zeros, and only 94 distinct values across 4981 rows.

Treatment: Log1p-transform or cap at a high quantile before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["language_count"].stats

stat	value
n	4,981
nulls	0 (0.0%)
unique	94
min	1
max	157
mean	5.166
median	2
std	12.13
q1	1
q3	4
iqr	3
skew	6.838
kurtosis	59.77
n_outliers	649
outlier_rate	0.1303
zero_rate	0
alert: high_skew	skew=+6.84
alert: outliers	13.0% rows beyond 1.5 IQR

Fig 10.

Distribution of language_count. Vertical dash marks the median.

Show data table

Histogram bins for language_count (median: 2.0).
bin	count
1 – 4.9	3849
4.9 – 8.8	483
8.8 – 12.7	194
12.7 – 16.6	96
16.6 – 20.5	90
20.5 – 24.4	107
24.4 – 28.3	32
28.3 – 32.2	25
32.2 – 36.1	15
36.1 – 40	4
40 – 43.9	8
43.9 – 47.8	8
47.8 – 51.7	6
51.7 – 55.6	5
55.6 – 59.5	3
59.5 – 63.4	3
63.4 – 67.3	8
67.3 – 71.2	3
71.2 – 75.1	3
75.1 – 79	4
79 – 82.9	0
82.9 – 86.8	1
86.8 – 90.7	2
90.7 – 94.6	4
94.6 – 98.5	5
98.5 – 102.4	2
102.4 – 106.3	3
106.3 – 110.2	2
110.2 – 114.1	2
114.1 – 118	3
118 – 121.9	0
121.9 – 125.8	1
125.8 – 129.7	1
129.7 – 133.6	0
133.6 – 137.5	2
137.5 – 141.4	2
141.4 – 145.3	0
145.3 – 149.2	0
149.2 – 153.1	1
153.1 – 157	4

words text free_text

This column holds serialized JSON arrays of word entries, each carrying fields like "form", "language", "iso_639_3", and "glottocode" — every one of the 4981 rows starts with `[{"form":`. Values are nearly unique (4963/4981) with 18 duplicates, and lengths vary wildly from 83 to 14956 characters (mean 498, median 184), indicating variable-size nested records rather than free prose. Top tokens reveal a multilingual etymology dataset spanning Greek, Old (Iranian/English?), Armenian, Albanian, etc., so Flesch readability (48.4) is meaningless here.

Treatment: Parse as JSON and explode into a child table of word entries before any modelling.

anthropic:claude-opus-4-7 · confidence high

Out[24]:

saturn.columns["words"].stats

stat	value
n	4,981
nulls	0 (0.0%)
unique	4,963
len_min	83
len_max	14,956
len_mean	498.9
len_median	184
len_p95	1,988
word_mean	44.53
word_median	16
n_empty	0
n_duplicates	18
duplicate_rate	0.003614
vocab_size	12,094
readability_flesch_mean	48.43
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	99.6% of rows are unique strings

Fig 11.

Character-length distribution for words.

Show data table

Character-length distribution for words (mean: 498.88094760088336).
chars	count
83 – 455	3857
455 – 827	475
827 – 1198	191
1198 – 1570	99
1570 – 1942	93
1942 – 2314	98
2314 – 2686	36
2686 – 3058	27
3058 – 3429	14
3429 – 3801	6
3801 – 4173	8
4173 – 4545	7
4545 – 4917	6
4917 – 5289	4
5289 – 5660	4
5660 – 6032	4
6032 – 6404	4
6404 – 6776	6
6776 – 7148	3
7148 – 7520	4
7520 – 7891	1
7891 – 8263	0
8263 – 8635	3
8635 – 9007	3
9007 – 9379	4
9379 – 9750	2
9750 – 10122	3
10122 – 10494	3
10494 – 10866	2
10866 – 11238	2
11238 – 11610	1
11610 – 11981	0
11981 – 12353	2
12353 – 12725	1
12725 – 13097	1
13097 – 13469	2
13469 – 13841	0
13841 – 14212	0
14212 – 14584	2
14584 – 14956	3

source_dataset categorical metadata

This column records the source dataset provenance, but every one of the 4981 rows carries the single value "iecor". With cardinality 1 and entropy 0, it conveys no information and serves only as a constant tag.

Treatment: Drop before modelling; retain only as a provenance note.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["source_dataset"].stats

stat	value
n	4,981
nulls	0 (0.0%)
unique	1
top_value	iecor
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 12.

Top values for source_dataset.

Show data table

Top values for source_dataset (1 unique shown, of 1 total).
value	count	share
iecor	4981	100.0%

confidence numeric metadata

This column is labelled 'confidence' and appears to be a numeric score, but every one of the 4981 rows holds the value 1.0 with zero standard deviation. It carries no information and was flagged constant. Likely an upstream default or placeholder rather than a measured confidence.

Treatment: drop, constant column with no variance

anthropic:claude-opus-4-7 · confidence high

Out[30]:

saturn.columns["confidence"].stats

stat	value
n	4,981
nulls	0 (0.0%)
unique	1
min	1
max	1
mean	1
median	1
std	0
q1	1
q3	1
iqr	0
skew	0
kurtosis	0
n_outliers	0
outlier_rate	0
zero_rate	0
alert: constant	only one distinct value

Fig 13.

Distribution of confidence. Vertical dash marks the median.

Show data table

Histogram bins for confidence (median: 1.0).
bin	count
0.5 – 0.525	0
0.525 – 0.55	0
0.55 – 0.575	0
0.575 – 0.6	0
0.6 – 0.625	0
0.625 – 0.65	0
0.65 – 0.675	0
0.675 – 0.7	0
0.7 – 0.725	0
0.725 – 0.75	0
0.75 – 0.775	0
0.775 – 0.8	0
0.8 – 0.825	0
0.825 – 0.85	0
0.85 – 0.875	0
0.875 – 0.9	0
0.9 – 0.925	0
0.925 – 0.95	0
0.95 – 0.975	0
0.975 – 1	0
1 – 1.025	4981
1.025 – 1.05	0
1.05 – 1.075	0
1.075 – 1.1	0
1.1 – 1.125	0
1.125 – 1.15	0
1.15 – 1.175	0
1.175 – 1.2	0
1.2 – 1.225	0
1.225 – 1.25	0
1.25 – 1.275	0
1.275 – 1.3	0
1.3 – 1.325	0
1.325 – 1.35	0
1.35 – 1.375	0
1.375 – 1.4	0
1.4 – 1.425	0
1.425 – 1.45	0
1.45 – 1.475	0
1.475 – 1.5	0

How to cite

click to copy

BibTeX

@misc{saturn-parquet-cognate-sets-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: parquet cognate sets},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/parquet-cognate_sets}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: parquet cognate sets. Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/cognate_sets.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/parquet-cognate_sets