saturn·

parquet cognate sets

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/cognate_sets.parquet

Saturn profiled 4,981 rows across 7 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/parquet/cognate_sets.parquet",
    "--findings", "parquet-cognate_sets.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 4,981 cognate sets from the IECoR source, with each row identified by a unique cognate_id and accompanied by a JSON-like 'words' payload listing language entries. The numeric columns language_count and word_count are nearly identical twins, both highly skewed (skew ~6.84) with a median of 2 but a max of 157 and ~13% outliers — a small set of cognate groups is dramatically larger than the rest. Three columns (concept, confidence, source_dataset) are constant or empty and carry no analytic signal. Start by examining the distribution of language_count to understand the long tail of cross-linguistic coverage, and inspect the longest 'words' entries (len_max ~14,956) to see which cognate sets dominate.

citing: language_count · word_count · words · cognate_id · concept · confidence · source_dataset

Out[4]:

saturn.schema() · 7 columns

column kind n null% unique alerts
cognate_id text 4,981 0.0% 4,981 near_unique one_word short_text
concept categorical 4,981 0.0% 1 imbalance
word_count numeric 4,981 0.0% 93 high_skew outliers
language_count numeric 4,981 0.0% 94 high_skew outliers
words text 4,981 0.0% 4,963 near_unique
source_dataset categorical 4,981 0.0% 1 imbalance
confidence numeric 4,981 0.0% 1 constant
Fig 1.
language_count · Highly skewed: most cognate sets cover just 2 languages but a few reach 157.
Show data table
Histogram bins for language_count (median: 2.0).
bincount
1 – 4.93849
4.9 – 8.8483
8.8 – 12.7194
12.7 – 16.696
16.6 – 20.590
20.5 – 24.4107
24.4 – 28.332
28.3 – 32.225
32.2 – 36.115
36.1 – 404
40 – 43.98
43.9 – 47.88
47.8 – 51.76
51.7 – 55.65
55.6 – 59.53
59.5 – 63.43
63.4 – 67.38
67.3 – 71.23
71.2 – 75.13
75.1 – 794
79 – 82.90
82.9 – 86.81
86.8 – 90.72
90.7 – 94.64
94.6 – 98.55
98.5 – 102.42
102.4 – 106.33
106.3 – 110.22
110.2 – 114.12
114.1 – 1183
118 – 121.90
121.9 – 125.81
125.8 – 129.71
129.7 – 133.60
133.6 – 137.52
137.5 – 141.42
141.4 – 145.30
145.3 – 149.20
149.2 – 153.11
153.1 – 1574
Fig 2.
word_count · Mirrors language_count closely — check whether the two are effectively redundant.
Show data table
Histogram bins for word_count (median: 2.0).
bincount
1 – 4.93848
4.9 – 8.8484
8.8 – 12.7194
12.7 – 16.696
16.6 – 20.590
20.5 – 24.4107
24.4 – 28.332
28.3 – 32.225
32.2 – 36.115
36.1 – 404
40 – 43.98
43.9 – 47.88
47.8 – 51.76
51.7 – 55.65
55.6 – 59.53
59.5 – 63.43
63.4 – 67.38
67.3 – 71.23
71.2 – 75.13
75.1 – 793
79 – 82.91
82.9 – 86.81
86.8 – 90.72
90.7 – 94.64
94.6 – 98.55
98.5 – 102.42
102.4 – 106.33
106.3 – 110.22
110.2 – 114.12
114.1 – 1183
118 – 121.90
121.9 – 125.81
125.8 – 129.71
129.7 – 133.60
133.6 – 137.52
137.5 – 141.42
141.4 – 145.30
145.3 – 149.20
149.2 – 153.11
153.1 – 1574
Fig 3.
words · Text payload length ranges from 83 to ~15k characters; the long tail marks the richest cognate sets.
Show data table
Character-length distribution for words (mean: 498.88094760088336).
charscount
83 – 4553857
455 – 827475
827 – 1198191
1198 – 157099
1570 – 194293
1942 – 231498
2314 – 268636
2686 – 305827
3058 – 342914
3429 – 38016
3801 – 41738
4173 – 45457
4545 – 49176
4917 – 52894
5289 – 56604
5660 – 60324
6032 – 64044
6404 – 67766
6776 – 71483
7148 – 75204
7520 – 78911
7891 – 82630
8263 – 86353
8635 – 90073
9007 – 93794
9379 – 97502
9750 – 101223
10122 – 104943
10494 – 108662
10866 – 112382
11238 – 116101
11610 – 119810
11981 – 123532
12353 – 127251
12725 – 130971
13097 – 134692
13469 – 138410
13841 – 142120
14212 – 145842
14584 – 149563
Fig 4.
source_dataset · Confirms every row comes from the single 'iecor' source.
Show data table
Top values for source_dataset (1 unique shown, of 1 total).
valuecountshare
iecor4981100.0%
Fig 5.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
cognate_idtext0.0%
conceptcategorical0.0%
word_countnumeric0.0%
language_countnumeric0.0%
wordstext0.0%
source_datasetcategorical0.0%
confidencenumeric0.0%
Fig 6.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
word_countlanguage_countconfidence
word_count+1.00+1.00+nan
language_count+1.00+1.00+nan
confidence+nan+nan+nan

cognate_id text identifier

This is a unique cognate identifier column, with every one of the 4981 rows carrying a distinct single-token value (vocab_size 4981, one_word_rate 1.0, null_rate 0). Values follow a fixed `iecor:` scheme with lengths between 7 and 10 characters, consistent with a namespaced primary key from the IE-CoR lexical database. There is nothing to model here — it is pure row identity with zero duplicates.

Treatment: Use as a join key; drop before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[12]:

saturn.columns["cognate_id"].stats

statvalue
n4,981
nulls0 (0.0%)
unique4,981
len_min 7
len_max 10
len_mean 9.884
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 4,981
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 7.
Character-length distribution for cognate_id.
Show data table
Character-length distribution for cognate_id (mean: 9.883557518570568).
charscount
7 – 75
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 844
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 9477
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 104455

concept categorical metadata

The 'concept' column is constant: all 4981 rows hold the empty string, with cardinality 1, entropy 0, and a top_rate of 1.0. There is no variation to exploit and no non-empty category was observed.

Treatment: Drop the column; it carries zero information.

anthropic:claude-opus-4-7 · confidence high
Out[15]:

saturn.columns["concept"].stats

statvalue
n4,981
nulls0 (0.0%)
unique1
top_value
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 8.
Top values for concept.
Show data table
Top values for concept (1 unique shown, of 1 total).
valuecountshare
4981100.0%

word_count numeric feature

Counts of words per record (likely titles, queries, or short text fields), with 4981 non-null integer values ranging 1 to 157. The distribution is heavily right-skewed (skew 6.84, kurtosis 59.74): median is 2 and Q3 is 4, yet the max reaches 157, dragging the mean to 5.17 with std 12.13. About 13% of rows (649) are flagged as outliers, indicating a long tail of unusually verbose entries.

Treatment: log-transform or clip before modelling to tame the long right tail.

anthropic:claude-opus-4-7 · confidence high
Out[18]:

saturn.columns["word_count"].stats

statvalue
n4,981
nulls0 (0.0%)
unique93
min 1
max 157
mean 5.168
median 2
std 12.13
q1 1
q3 4
iqr 3
skew 6.837
kurtosis 59.74
n_outliers 649
outlier_rate 0.1303
zero_rate 0
alert: high_skewskew=+6.84
alert: outliers13.0% rows beyond 1.5 IQR
Fig 9.
Distribution of word_count. Vertical dash marks the median.
Show data table
Histogram bins for word_count (median: 2.0).
bincount
1 – 4.93848
4.9 – 8.8484
8.8 – 12.7194
12.7 – 16.696
16.6 – 20.590
20.5 – 24.4107
24.4 – 28.332
28.3 – 32.225
32.2 – 36.115
36.1 – 404
40 – 43.98
43.9 – 47.88
47.8 – 51.76
51.7 – 55.65
55.6 – 59.53
59.5 – 63.43
63.4 – 67.38
67.3 – 71.23
71.2 – 75.13
75.1 – 793
79 – 82.91
82.9 – 86.81
86.8 – 90.72
90.7 – 94.64
94.6 – 98.55
98.5 – 102.42
102.4 – 106.33
106.3 – 110.22
110.2 – 114.12
114.1 – 1183
118 – 121.90
121.9 – 125.81
125.8 – 129.71
129.7 – 133.60
133.6 – 137.52
137.5 – 141.42
141.4 – 145.30
145.3 – 149.20
149.2 – 153.11
153.1 – 1574

language_count numeric feature

`language_count` is a positive integer feature counting languages per record, ranging from 1 to 157 with a median of 2 and Q3 of 4. The distribution is severely right-skewed (skew 6.84, kurtosis 59.77) with 649 outliers (13.0% outlier rate) stretching the mean to 5.17 against a std of 12.13. No nulls or zeros, and only 94 distinct values across 4981 rows.

Treatment: Log1p-transform or cap at a high quantile before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[21]:

saturn.columns["language_count"].stats

statvalue
n4,981
nulls0 (0.0%)
unique94
min 1
max 157
mean 5.166
median 2
std 12.13
q1 1
q3 4
iqr 3
skew 6.838
kurtosis 59.77
n_outliers 649
outlier_rate 0.1303
zero_rate 0
alert: high_skewskew=+6.84
alert: outliers13.0% rows beyond 1.5 IQR
Fig 10.
Distribution of language_count. Vertical dash marks the median.
Show data table
Histogram bins for language_count (median: 2.0).
bincount
1 – 4.93849
4.9 – 8.8483
8.8 – 12.7194
12.7 – 16.696
16.6 – 20.590
20.5 – 24.4107
24.4 – 28.332
28.3 – 32.225
32.2 – 36.115
36.1 – 404
40 – 43.98
43.9 – 47.88
47.8 – 51.76
51.7 – 55.65
55.6 – 59.53
59.5 – 63.43
63.4 – 67.38
67.3 – 71.23
71.2 – 75.13
75.1 – 794
79 – 82.90
82.9 – 86.81
86.8 – 90.72
90.7 – 94.64
94.6 – 98.55
98.5 – 102.42
102.4 – 106.33
106.3 – 110.22
110.2 – 114.12
114.1 – 1183
118 – 121.90
121.9 – 125.81
125.8 – 129.71
129.7 – 133.60
133.6 – 137.52
137.5 – 141.42
141.4 – 145.30
145.3 – 149.20
149.2 – 153.11
153.1 – 1574

words text free_text

This column holds serialized JSON arrays of word entries, each carrying fields like "form", "language", "iso_639_3", and "glottocode" — every one of the 4981 rows starts with `[{"form":`. Values are nearly unique (4963/4981) with 18 duplicates, and lengths vary wildly from 83 to 14956 characters (mean 498, median 184), indicating variable-size nested records rather than free prose. Top tokens reveal a multilingual etymology dataset spanning Greek, Old (Iranian/English?), Armenian, Albanian, etc., so Flesch readability (48.4) is meaningless here.

Treatment: Parse as JSON and explode into a child table of word entries before any modelling.

anthropic:claude-opus-4-7 · confidence high
Out[24]:

saturn.columns["words"].stats

statvalue
n4,981
nulls0 (0.0%)
unique4,963
len_min 83
len_max 14,956
len_mean 498.9
len_median 184
len_p95 1,988
word_mean 44.53
word_median 16
n_empty 0
n_duplicates 18
duplicate_rate 0.003614
vocab_size 12,094
readability_flesch_mean 48.43
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 0
boilerplate_rate 0
alert: near_unique99.6% of rows are unique strings
Fig 11.
Character-length distribution for words.
Show data table
Character-length distribution for words (mean: 498.88094760088336).
charscount
83 – 4553857
455 – 827475
827 – 1198191
1198 – 157099
1570 – 194293
1942 – 231498
2314 – 268636
2686 – 305827
3058 – 342914
3429 – 38016
3801 – 41738
4173 – 45457
4545 – 49176
4917 – 52894
5289 – 56604
5660 – 60324
6032 – 64044
6404 – 67766
6776 – 71483
7148 – 75204
7520 – 78911
7891 – 82630
8263 – 86353
8635 – 90073
9007 – 93794
9379 – 97502
9750 – 101223
10122 – 104943
10494 – 108662
10866 – 112382
11238 – 116101
11610 – 119810
11981 – 123532
12353 – 127251
12725 – 130971
13097 – 134692
13469 – 138410
13841 – 142120
14212 – 145842
14584 – 149563

source_dataset categorical metadata

This column records the source dataset provenance, but every one of the 4981 rows carries the single value "iecor". With cardinality 1 and entropy 0, it conveys no information and serves only as a constant tag.

Treatment: Drop before modelling; retain only as a provenance note.

anthropic:claude-opus-4-7 · confidence high
Out[27]:

saturn.columns["source_dataset"].stats

statvalue
n4,981
nulls0 (0.0%)
unique1
top_value iecor
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 12.
Top values for source_dataset.
Show data table
Top values for source_dataset (1 unique shown, of 1 total).
valuecountshare
iecor4981100.0%

confidence numeric metadata

This column is labelled 'confidence' and appears to be a numeric score, but every one of the 4981 rows holds the value 1.0 with zero standard deviation. It carries no information and was flagged constant. Likely an upstream default or placeholder rather than a measured confidence.

Treatment: drop, constant column with no variance

anthropic:claude-opus-4-7 · confidence high
Out[30]:

saturn.columns["confidence"].stats

statvalue
n4,981
nulls0 (0.0%)
unique1
min 1
max 1
mean 1
median 1
std 0
q1 1
q3 1
iqr 0
skew 0
kurtosis 0
n_outliers 0
outlier_rate 0
zero_rate 0
alert: constantonly one distinct value
Fig 13.
Distribution of confidence. Vertical dash marks the median.
Show data table
Histogram bins for confidence (median: 1.0).
bincount
0.5 – 0.5250
0.525 – 0.550
0.55 – 0.5750
0.575 – 0.60
0.6 – 0.6250
0.625 – 0.650
0.65 – 0.6750
0.675 – 0.70
0.7 – 0.7250
0.725 – 0.750
0.75 – 0.7750
0.775 – 0.80
0.8 – 0.8250
0.825 – 0.850
0.85 – 0.8750
0.875 – 0.90
0.9 – 0.9250
0.925 – 0.950
0.95 – 0.9750
0.975 – 10
1 – 1.0254981
1.025 – 1.050
1.05 – 1.0750
1.075 – 1.10
1.1 – 1.1250
1.125 – 1.150
1.15 – 1.1750
1.175 – 1.20
1.2 – 1.2250
1.225 – 1.250
1.25 – 1.2750
1.275 – 1.30
1.3 – 1.3250
1.325 – 1.350
1.35 – 1.3750
1.375 – 1.40
1.4 – 1.4250
1.425 – 1.450
1.45 – 1.4750
1.475 – 1.50

How to cite

click to copy

BibTeX
@misc{saturn-parquet-cognate-sets-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: parquet cognate sets},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/parquet-cognate_sets}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: parquet cognate sets. Source: /home/coolhand/servers/diachronica/etymology_atlas/parquet/cognate_sets.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/parquet-cognate_sets