saturn·

processed cognate sets

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/processed/cognate_sets.json

Saturn profiled 4,981 rows across 8 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/processed/cognate_sets.json",
    "--findings", "processed-cognate_sets.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 4,981 cognate sets sourced entirely from the 'iecor' source_dataset, each identified by a unique cognate_id. The two main numeric signals are language_count and word_count, which are nearly identical in distribution: both have a median of 2 and mean around 5.17, but stretch out to a maximum of 157 with skew above 6.8 and roughly 13% of rows flagged as outliers. That long tail is the most interesting story — most cognate sets are small, but a minority span very many languages/words and deserve a closer look. Note that concept is empty for every row, confidence is constant at 1.0, and source_dataset has only one value, so those columns carry no analytic signal.

citing: row_count · column_count · language_count · word_count · concept · confidence · source_dataset · cognate_id

Out[4]:

saturn.schema() · 8 columns

column kind n null% unique alerts
cognate_id text 4,981 0.0% 4,981 near_unique one_word short_text
concept categorical 4,981 0.0% 1 imbalance
words unknown 4,981 0.0% skipped
source_dataset categorical 4,981 0.0% 1 imbalance
confidence numeric 4,981 0.0% 1 constant
word_count numeric 4,981 0.0% 93 high_skew outliers
language_count numeric 4,981 0.0% 94 high_skew outliers
sources unknown 4,981 0.0% skipped
Fig 1.
language_count · Look at the long right tail — most sets cover just 1-2 languages but a few reach 157.
Show data table
Histogram bins for language_count (median: 2.0).
bincount
1 – 4.93849
4.9 – 8.8483
8.8 – 12.7194
12.7 – 16.696
16.6 – 20.590
20.5 – 24.4107
24.4 – 28.332
28.3 – 32.225
32.2 – 36.115
36.1 – 404
40 – 43.98
43.9 – 47.88
47.8 – 51.76
51.7 – 55.65
55.6 – 59.53
59.5 – 63.43
63.4 – 67.38
67.3 – 71.23
71.2 – 75.13
75.1 – 794
79 – 82.90
82.9 – 86.81
86.8 – 90.72
90.7 – 94.64
94.6 – 98.55
98.5 – 102.42
102.4 – 106.33
106.3 – 110.22
110.2 – 114.12
114.1 – 1183
118 – 121.90
121.9 – 125.81
125.8 – 129.71
129.7 – 133.60
133.6 – 137.52
137.5 – 141.42
141.4 – 145.30
145.3 – 149.20
149.2 – 153.11
153.1 – 1574
Fig 2.
word_count · Mirrors language_count almost exactly; check whether the two are effectively duplicates.
Show data table
Histogram bins for word_count (median: 2.0).
bincount
1 – 4.93848
4.9 – 8.8484
8.8 – 12.7194
12.7 – 16.696
16.6 – 20.590
20.5 – 24.4107
24.4 – 28.332
28.3 – 32.225
32.2 – 36.115
36.1 – 404
40 – 43.98
43.9 – 47.88
47.8 – 51.76
51.7 – 55.65
55.6 – 59.53
59.5 – 63.43
63.4 – 67.38
67.3 – 71.23
71.2 – 75.13
75.1 – 793
79 – 82.91
82.9 – 86.81
86.8 – 90.72
90.7 – 94.64
94.6 – 98.55
98.5 – 102.42
102.4 – 106.33
106.3 – 110.22
110.2 – 114.12
114.1 – 1183
118 – 121.90
121.9 – 125.81
125.8 – 129.71
129.7 – 133.60
133.6 – 137.52
137.5 – 141.42
141.4 – 145.30
145.3 – 149.20
149.2 – 153.11
153.1 – 1574
Fig 3.
source_dataset · Confirms every row comes from a single source ('iecor'), so no cross-source comparison is possible.
Show data table
Top values for source_dataset (1 unique shown, of 1 total).
valuecountshare
iecor4981100.0%
Fig 4.
cognate_id · ID strings cluster tightly at length 10, useful as a sanity check on identifier formatting.
Show data table
Character-length distribution for cognate_id (mean: 9.883557518570568).
charscount
7 – 75
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 844
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 9477
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 104455
Fig 5.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
cognate_idtext0.0%
conceptcategorical0.0%
wordsunknown0.0%
source_datasetcategorical0.0%
confidencenumeric0.0%
word_countnumeric0.0%
language_countnumeric0.0%
sourcesunknown0.0%
Fig 6.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
confidenceword_countlanguage_count
confidence+nan+nan+nan
word_count+nan+1.00+1.00
language_count+nan+1.00+1.00

cognate_id text identifier

This is a primary identifier column: every one of the 4981 values is unique, non-null, single-token, and follows an `iecor:` pattern (length 7-10 chars). With n_unique == n and duplicate_rate 0, it functions as a cognate-set key from the IECoR resource rather than a modelling feature.

Treatment: Use as the join key to cognate metadata; exclude from any model features.

anthropic:claude-opus-4-7 · confidence high
Out[12]:

saturn.columns["cognate_id"].stats

statvalue
n4,981
nulls0 (0.0%)
unique4,981
len_min 7
len_max 10
len_mean 9.884
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 4,981
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 7.
Character-length distribution for cognate_id.
Show data table
Character-length distribution for cognate_id (mean: 9.883557518570568).
charscount
7 – 75
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 844
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 9477
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 104455

concept categorical other

The 'concept' column is a categorical field that is entirely constant: all 4981 rows hold the same empty-string value, giving cardinality 1 and entropy 0. It carries no information and was flagged for imbalance with a top_rate of 1.0.

Treatment: Drop; the column is constant and contributes no signal.

anthropic:claude-opus-4-7 · confidence high
Out[15]:

saturn.columns["concept"].stats

statvalue
n4,981
nulls0 (0.0%)
unique1
top_value
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 8.
Top values for concept.
Show data table
Top values for concept (1 unique shown, of 1 total).
valuecountshare
4981100.0%

words unknown other

The column is named "words" but saturn skipped profiling it, so its kind is unknown and no descriptive statistics were computed. The only confirmed signals are 4981 non-null rows with a null rate of 0.0; uniqueness, type, and value distribution are all unavailable.

Treatment: Re-profile or manually inspect this column before any downstream use, since saturn skipped it.

anthropic:claude-opus-4-7 · confidence low
Out[18]:

saturn.columns["words"].stats

statvalue
n4,981
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

source_dataset categorical metadata

This column records the originating dataset for each row, but every one of the 4981 records carries the single value "iecor". Cardinality is 1 and entropy is 0, so the field carries no information for modelling or grouping.

Treatment: Drop; constant column with no discriminative signal.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["source_dataset"].stats

statvalue
n4,981
nulls0 (0.0%)
unique1
top_value iecor
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 9.
Top values for source_dataset.
Show data table
Top values for source_dataset (1 unique shown, of 1 total).
valuecountshare
iecor4981100.0%

confidence numeric metadata

The column 'confidence' is a numeric field that is entirely constant: all 4981 rows hold the value 1.0, with zero standard deviation and a single unique value. It carries no information for downstream modelling and likely reflects a default or hard-coded score rather than a measured probability.

Treatment: drop, constant column with no variance

anthropic:claude-opus-4-7 · confidence high
Out[23]:

saturn.columns["confidence"].stats

statvalue
n4,981
nulls0 (0.0%)
unique1
min 1
max 1
mean 1
median 1
std 0
q1 1
q3 1
iqr 0
skew 0
kurtosis 0
n_outliers 0
outlier_rate 0
zero_rate 0
alert: constantonly one distinct value
Fig 10.
Distribution of confidence. Vertical dash marks the median.
Show data table
Histogram bins for confidence (median: 1.0).
bincount
0.5 – 0.5250
0.525 – 0.550
0.55 – 0.5750
0.575 – 0.60
0.6 – 0.6250
0.625 – 0.650
0.65 – 0.6750
0.675 – 0.70
0.7 – 0.7250
0.725 – 0.750
0.75 – 0.7750
0.775 – 0.80
0.8 – 0.8250
0.825 – 0.850
0.85 – 0.8750
0.875 – 0.90
0.9 – 0.9250
0.925 – 0.950
0.95 – 0.9750
0.975 – 10
1 – 1.0254981
1.025 – 1.050
1.05 – 1.0750
1.075 – 1.10
1.1 – 1.1250
1.125 – 1.150
1.15 – 1.1750
1.175 – 1.20
1.2 – 1.2250
1.225 – 1.250
1.25 – 1.2750
1.275 – 1.30
1.3 – 1.3250
1.325 – 1.350
1.35 – 1.3750
1.375 – 1.40
1.4 – 1.4250
1.425 – 1.450
1.45 – 1.4750
1.475 – 1.50

word_count numeric feature

Counts of words per record, ranging from 1 to 157 with a median of 2 and IQR of 3. The distribution is severely right-skewed (skew 6.84, kurtosis 59.74) with 649 outliers (13.0% of rows) pulling the mean to 5.17 against std 12.13. Most entries are very short while a long tail of verbose records dominates the variance.

Treatment: Apply a log1p transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["word_count"].stats

statvalue
n4,981
nulls0 (0.0%)
unique93
min 1
max 157
mean 5.168
median 2
std 12.13
q1 1
q3 4
iqr 3
skew 6.837
kurtosis 59.74
n_outliers 649
outlier_rate 0.1303
zero_rate 0
alert: high_skewskew=+6.84
alert: outliers13.0% rows beyond 1.5 IQR
Fig 11.
Distribution of word_count. Vertical dash marks the median.
Show data table
Histogram bins for word_count (median: 2.0).
bincount
1 – 4.93848
4.9 – 8.8484
8.8 – 12.7194
12.7 – 16.696
16.6 – 20.590
20.5 – 24.4107
24.4 – 28.332
28.3 – 32.225
32.2 – 36.115
36.1 – 404
40 – 43.98
43.9 – 47.88
47.8 – 51.76
51.7 – 55.65
55.6 – 59.53
59.5 – 63.43
63.4 – 67.38
67.3 – 71.23
71.2 – 75.13
75.1 – 793
79 – 82.91
82.9 – 86.81
86.8 – 90.72
90.7 – 94.64
94.6 – 98.55
98.5 – 102.42
102.4 – 106.33
106.3 – 110.22
110.2 – 114.12
114.1 – 1183
118 – 121.90
121.9 – 125.81
125.8 – 129.71
129.7 – 133.60
133.6 – 137.52
137.5 – 141.42
141.4 – 145.30
145.3 – 149.20
149.2 – 153.11
153.1 – 1574

language_count numeric feature

A count of languages per record, ranging from 1 to 157 with a median of just 2 and IQR of 3. The distribution is severely right-skewed (skew 6.84, kurtosis 59.77) with 649 outliers (13.0%), meaning a small number of records list dozens of languages while most list only a handful.

Treatment: Log-transform or cap before modelling to tame the heavy tail.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["language_count"].stats

statvalue
n4,981
nulls0 (0.0%)
unique94
min 1
max 157
mean 5.166
median 2
std 12.13
q1 1
q3 4
iqr 3
skew 6.838
kurtosis 59.77
n_outliers 649
outlier_rate 0.1303
zero_rate 0
alert: high_skewskew=+6.84
alert: outliers13.0% rows beyond 1.5 IQR
Fig 12.
Distribution of language_count. Vertical dash marks the median.
Show data table
Histogram bins for language_count (median: 2.0).
bincount
1 – 4.93849
4.9 – 8.8483
8.8 – 12.7194
12.7 – 16.696
16.6 – 20.590
20.5 – 24.4107
24.4 – 28.332
28.3 – 32.225
32.2 – 36.115
36.1 – 404
40 – 43.98
43.9 – 47.88
47.8 – 51.76
51.7 – 55.65
55.6 – 59.53
59.5 – 63.43
63.4 – 67.38
67.3 – 71.23
71.2 – 75.13
75.1 – 794
79 – 82.90
82.9 – 86.81
86.8 – 90.72
90.7 – 94.64
94.6 – 98.55
98.5 – 102.42
102.4 – 106.33
106.3 – 110.22
110.2 – 114.12
114.1 – 1183
118 – 121.90
121.9 – 125.81
125.8 – 129.71
129.7 – 133.60
133.6 – 137.52
137.5 – 141.42
141.4 – 145.30
145.3 – 149.20
149.2 – 153.11
153.1 – 1574

sources unknown other

The column 'sources' was skipped by the profiler, so its kind is unknown and no descriptive statistics are available. We can only confirm it has 4981 rows with a null rate of 0.0 and no recorded unique count. Without further evidence, the content and structure cannot be characterised.

Treatment: Re-profile or manually inspect this column before deciding on downstream handling.

anthropic:claude-opus-4-7 · confidence low
Out[32]:

saturn.columns["sources"].stats

statvalue
n4,981
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

How to cite

click to copy

BibTeX
@misc{saturn-processed-cognate-sets-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: processed cognate sets},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/processed-cognate_sets}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: processed cognate sets. Source: /home/coolhand/servers/diachronica/etymology_atlas/processed/cognate_sets.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/processed-cognate_sets