saturn·

processed lexibank references

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/processed/lexibank_references.json

Saturn profiled 11,359 rows across 9 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/processed/lexibank_references.json",
    "--findings", "processed-lexibank_references.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a bibliographic reference list with 11,359 rows and 9 columns (key, author, citation, title, year, plus mostly-empty editor/publisher/journal/url fields). The most informative columns are author, citation, title, and year — the rest are either unique IDs or near-empty categoricals. Note that author has a 66% duplicate rate and 1,277 empty values, while citation and title both show heavy duplication (54% and 50%) driven by a handful of large source collections like Koelle's Polyglotta africana and the Africa Museum and Austronesian web archives. The year column spans 271 distinct values with reasonable spread (entropy ratio 0.75), though about 11% of rows have no year and another 574 are marked 'n.d.'. Author and title are also multilingual, with English dominant but meaningful German, French, Spanish, and Chinese subsets.

citing: row_count · column_count · columns.author.stats.duplicate_rate · columns.author.stats.n_empty · columns.author.language_counts · columns.author.top_values · columns.citation.stats.duplicate_rate · columns.citation.top_values · columns.title.stats.duplicate_rate · columns.title.top_values · columns.title.language_counts · columns.year.n_unique · columns.year.stats.entropy_ratio · columns.year.top_values · columns.editor.stats.top_rate · columns.publisher.stats.top_rate · columns.url.stats.top_rate

Out[4]:

saturn.schema() · 9 columns

column kind n null% unique alerts
key text 11,359 0.0% 11,359 near_unique one_word allcaps short_text
author text 11,359 0.0% 3,830 multilingual duplicates
year categorical 11,359 0.0% 271
title text 11,359 0.0% 5,663 multilingual duplicates
journal categorical 11,359 0.0% 4 imbalance
publisher categorical 11,359 0.0% 15 long_tail imbalance
editor categorical 11,359 0.0% 4 long_tail imbalance
url categorical 11,359 0.0% 1 imbalance
citation text 11,359 0.0% 5,178 multilingual duplicates
Fig 1.
year · Distribution of publication years; watch for the large empty/'n.d.' share and peaks around 1971, 1979, and 1992.
Show data table
Top values for year (20 unique shown, of 271 total).
valuecountshare
130011.4%
n.d.5745.1%
19923383.0%
20072822.5%
19712712.4%
19792542.2%
19802252.0%
20052211.9%
20152171.9%
19862081.8%
19972081.8%
20062041.8%
20092021.8%
20111961.7%
19751951.7%
1963 [1854]1931.7%
19811881.7%
20161881.7%
20041851.6%
20001851.6%
Fig 2.
author · Top authors are dominated by a few prolific contributors (Koelle, Tryon, Blench), reflecting the 66% duplicate rate.
Show data table
Character-length distribution for author (mean: 20.427150277313142).
charscount
0 – 112178
11 – 236163
23 – 341293
34 – 451034
45 – 57188
57 – 68206
68 – 79126
79 – 9130
91 – 10275
102 – 11331
113 – 1257
125 – 1364
136 – 1472
147 – 1595
159 – 1702
170 – 1812
181 – 1931
193 – 2040
204 – 2151
215 – 2260
226 – 2380
238 – 2490
249 – 2600
260 – 2720
272 – 2830
283 – 2943
294 – 3060
306 – 3176
317 – 3281
328 – 3400
340 – 3510
351 – 3620
362 – 3740
374 – 3850
385 – 3960
396 – 4080
408 – 4190
419 – 4300
430 – 4420
442 – 4531
Fig 3.
citation · Citation string lengths cluster around 70 characters — useful for spotting truncated or anomalously short entries.
Show data table
Character-length distribution for citation (mean: 67.93811074918567).
charscount
8 – 117
11 – 1522
15 – 1820
18 – 218
21 – 2554
25 – 28128
28 – 3199
31 – 3599
35 – 38133
38 – 42125
42 – 4561
45 – 48269
48 – 5286
52 – 55119
55 – 5895
58 – 6271
62 – 65105
65 – 68312
68 – 725296
72 – 753139
75 – 78854
78 – 82168
82 – 8564
85 – 884
88 – 926
92 – 953
95 – 986
98 – 1020
102 – 1052
105 – 1080
108 – 1120
112 – 1151
115 – 1191
119 – 1220
122 – 1250
125 – 1291
129 – 1320
132 – 1350
135 – 1390
139 – 1421
Fig 4.
title · Title lengths are highly skewed (median 104, max 1562); long outliers often contain embedded URLs.
Show data table
Character-length distribution for title (mean: 120.61598732282772).
charscount
0 – 391194
39 – 782283
78 – 1173425
117 – 1561892
156 – 1951094
195 – 234646
234 – 273344
273 – 312178
312 – 35181
351 – 39056
390 – 43039
430 – 46928
469 – 50823
508 – 54724
547 – 58613
586 – 62514
625 – 6645
664 – 7033
703 – 7423
742 – 7812
781 – 8206
820 – 8591
859 – 8980
898 – 9372
937 – 9761
976 – 10150
1015 – 10540
1054 – 10930
1093 – 11320
1132 – 11720
1172 – 12110
1211 – 12500
1250 – 12890
1289 – 13280
1328 – 13670
1367 – 14060
1406 – 14451
1445 – 14840
1484 – 15230
1523 – 15621
Fig 5.
title · Language mix of titles shows English dominant but with sizeable German, French, Spanish, and Chinese minorities.
Show data table
Character-length distribution for title (mean: 120.61598732282772).
charscount
0 – 391194
39 – 782283
78 – 1173425
117 – 1561892
156 – 1951094
195 – 234646
234 – 273344
273 – 312178
312 – 35181
351 – 39056
390 – 43039
430 – 46928
469 – 50823
508 – 54724
547 – 58613
586 – 62514
625 – 6645
664 – 7033
703 – 7423
742 – 7812
781 – 8206
820 – 8591
859 – 8980
898 – 9372
937 – 9761
976 – 10150
1015 – 10540
1054 – 10930
1093 – 11320
1132 – 11720
1172 – 12110
1211 – 12500
1250 – 12890
1289 – 13280
1328 – 13670
1367 – 14060
1406 – 14451
1445 – 14840
1484 – 15230
1523 – 15621
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
keytext0.0%
authortext0.0%
yearcategorical0.0%
titletext0.0%
journalcategorical0.0%
publishercategorical0.0%
editorcategorical0.0%
urlcategorical0.0%
citationtext0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 14,079 detected strings).
langcountshare
en1033573.4%
fr9196.5%
de8576.1%
zh6474.6%
es5273.7%
nl1751.2%
it1391.0%
pt1150.8%
ja820.6%
ru560.4%
id330.2%
eo300.2%
pl280.2%
fi170.1%
no120.1%
ms120.1%
hu110.1%
uk110.1%
sl110.1%
da90.1%
sv80.1%
cs80.1%
ca80.1%
sk50.0%
ceb30.0%
hr30.0%
eu20.0%
vi20.0%
et20.0%
sr20.0%
sq20.0%
bs20.0%
lt10.0%
tl10.0%
oc10.0%
os10.0%
pam10.0%
war10.0%

key text identifier

This column is almost certainly a primary identifier: every one of the 11,359 rows holds a unique, single-token value (n_unique=11359, one_word_rate=1.0, duplicate_rate=0.0). Values are short (len_mean 4.07, len_max 5) and 99.1% are uppercase, consistent with short alphanumeric codes rather than natural text. The top_words sample shows purely numeric tokens (47, 49, 50, ...), so the 'allcaps' signal may be a side effect of digit-only strings rather than true letters.

Treatment: Use as a row key or left-join key; drop from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["key"].stats

statvalue
n11,359
nulls0 (0.0%)
unique11,359
len_min 1
len_max 5
len_mean 4.074
len_median 4
len_p95 5
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 11,359
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.9913
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: allcaps99.1% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for key.
Show data table
Character-length distribution for key (mean: 4.074126243507351).
charscount
1 – 19
1 – 10
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 20
2 – 290
2 – 20
2 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 3873
3 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 40
4 – 48465
4 – 40
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 51922

author text metadata

This is an author/contributor name field, mostly formatted 'Surname, Given Name' with frequent multi-author strings joined by 'and' (4049 occurrences). Duplication is severe: 66.3% of rows repeat, with prolific contributors like 'Koelle, Sigismund Wilhelm' (193) and 'Tryon, Darrell T.' (191) dominating, and 1277 rows are empty strings despite a 0.0 null rate. Names span 30 detected languages — predominantly English (3125) and German (384), but also 113 rows of CJK script — so naive string matching will fragment identities.

Treatment: Normalize to a canonical name form and split multi-author strings on ' and ' before any join or grouping.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["author"].stats

statvalue
n11,359
nulls0 (0.0%)
unique3,830
len_min 0
len_max 453
len_mean 20.43
len_median 17
len_p95 50
word_mean 3.382
word_median 3
n_empty 1,277
n_duplicates 7,529
duplicate_rate 0.6628
vocab_size 6,656
readability_flesch_mean 53.21
emoji_rate 0
url_rate 0
one_word_rate 0.1836
allcaps_rate 0.0589
boilerplate_rate 0
alert: multilingual31 languages detected in sample
alert: duplicates66.3% duplicate strings
Fig 9.
Character-length distribution for author.
Show data table
Character-length distribution for author (mean: 20.427150277313142).
charscount
0 – 112178
11 – 236163
23 – 341293
34 – 451034
45 – 57188
57 – 68206
68 – 79126
79 – 9130
91 – 10275
102 – 11331
113 – 1257
125 – 1364
136 – 1472
147 – 1595
159 – 1702
170 – 1812
181 – 1931
193 – 2040
204 – 2151
215 – 2260
226 – 2380
238 – 2490
249 – 2600
260 – 2720
272 – 2830
283 – 2943
294 – 3060
306 – 3176
317 – 3281
328 – 3400
340 – 3510
351 – 3620
362 – 3740
374 – 3850
385 – 3960
396 – 4080
408 – 4190
419 – 4300
430 – 4420
442 – 4531

year categorical feature

This is a year field stored as strings rather than integers, with 271 distinct values across 11,359 rows. The most common entry is an empty string (1,300 rows, 11.4%) followed by the literal "n.d." (574 rows), so roughly 16.5% of records carry no usable year despite a 0.0 null rate. Actual years span at least 1971 to 2015 in the top values, with 1992 the most frequent real year at 338 occurrences.

Treatment: Coerce to integer, mapping empty strings and "n.d." to missing before any temporal analysis.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["year"].stats

statvalue
n11,359
nulls0 (0.0%)
unique271
top_value
top_rate 0.1144
cardinality 271
entropy 6.1
entropy_ratio 0.7548
Fig 10.
Top values for year.
Show data table
Top values for year (20 unique shown, of 271 total).
valuecountshare
130011.4%
n.d.5745.1%
19923383.0%
20072822.5%
19712712.4%
19792542.2%
19802252.0%
20052211.9%
20152171.9%
19862081.8%
19972081.8%
20062041.8%
20092021.8%
20111961.7%
19751951.7%
1963 [1854]1931.7%
19811881.7%
20161881.7%
20041851.6%
20001851.6%

title text metadata

This column holds bibliographic citation strings — book/article titles, URLs, and access notes for linguistic sources, dominated by English (3620) but mixing 24 other languages including French (382), Chinese (300), German (224), and Spanish (207). Half the values are duplicates (duplicate_rate 0.50, 5696 repeats across 11359 rows), with a single africamuseum.be URL appearing 355 times and 17.4% of entries containing URLs. Length varies wildly (median 104 chars, max 1562) and top words ('of','the','languages','linguistics','(accessed') confirm these are reference citations rather than free-form titles.

Treatment: Normalize and deduplicate citations into a source-reference lookup table rather than treating as a modelling feature.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["title"].stats

statvalue
n11,359
nulls0 (0.0%)
unique5,663
len_min 0
len_max 1,562
len_mean 120.6
len_median 104
len_p95 261
word_mean 14.66
word_median 12
n_empty 7
n_duplicates 5,696
duplicate_rate 0.5015
vocab_size 21,846
readability_flesch_mean 8.003
emoji_rate 0
url_rate 0.1741
one_word_rate 0.07008
allcaps_rate 0.06154
boilerplate_rate 0
alert: multilingual26 languages detected in sample
alert: duplicates50.1% duplicate strings
Fig 11.
Character-length distribution for title.
Show data table
Character-length distribution for title (mean: 120.61598732282772).
charscount
0 – 391194
39 – 782283
78 – 1173425
117 – 1561892
156 – 1951094
195 – 234646
234 – 273344
273 – 312178
312 – 35181
351 – 39056
390 – 43039
430 – 46928
469 – 50823
508 – 54724
547 – 58613
586 – 62514
625 – 6645
664 – 7033
703 – 7423
742 – 7812
781 – 8206
820 – 8591
859 – 8980
898 – 9372
937 – 9761
976 – 10150
1015 – 10540
1054 – 10930
1093 – 11320
1132 – 11720
1172 – 12110
1211 – 12500
1250 – 12890
1289 – 13280
1328 – 13670
1367 – 14060
1406 – 14451
1445 – 14840
1484 – 15230
1523 – 15621

journal categorical metadata

This appears to be a journal-name field for bibliographic records, but it is effectively empty: 11,355 of 11,359 rows (top_rate 0.9996) hold an empty string, leaving only 4 actual journal names across 3 distinct German linguistics titles. Entropy is 0.005 (entropy_ratio 0.0025), so the column carries virtually no information despite a 0.0 null rate — the blanks are stored as empty strings rather than nulls.

Treatment: Drop; near-constant empty string with only 4 populated rows.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["journal"].stats

statvalue
n11,359
nulls0 (0.0%)
unique4
top_value
top_rate 0.9996
cardinality 4
entropy 0.005076
entropy_ratio 0.002538
alert: imbalancetop value is 100.0% of rows
Fig 12.
Top values for journal.
Show data table
Top values for journal (4 unique shown, of 4 total).
valuecountshare
11355100.0%
Münchener Studien zur Sprachwissenschaft20.0%
Zeitschrift für vergleichende Sprachforschung10.0%
Historische Sprachforschung10.0%

publisher categorical metadata

Publisher name field, but it is effectively empty: 11,344 of 11,359 rows (top_rate 0.9987) carry an empty string, leaving only 15 distinct values and an entropy_ratio of 0.005. The handful of populated entries (Brill with 2, then Winter, Reichert, Rodopi, Harrassowitz and others with 1 each) hint at academic/humanities publishers but are too sparse to be useful. Note that null_rate is 0.0 because the blanks are stored as empty strings rather than true nulls.

Treatment: Drop; the column is ~99.9% empty strings and carries almost no signal.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["publisher"].stats

statvalue
n11,359
nulls0 (0.0%)
unique15
top_value
top_rate 0.9987
cardinality 15
entropy 0.01952
entropy_ratio 0.004996
alert: long_tail13 singleton categories
alert: imbalancetop value is 99.9% of rows
Fig 13.
Top values for publisher.
Show data table
Top values for publisher (15 unique shown, of 15 total).
valuecountshare
1134499.9%
Brill20.0%
Winter10.0%
Reichert10.0%
Vostočnaja Literatura10.0%
Rodopi10.0%
Harrassowitz10.0%
K. J. Trübner10.0%
Belaruskaâ navuka10.0%
Fitzroy Dearborn Publishers10.0%
Karl J. Trübner10.0%
Institut für Sprachwissenschaft der Universität Innsbruck10.0%
Vandenhoeck & Ruprecht10.0%
Walter de Gruyter & Co.10.0%
Nova Fronteira10.0%

editor categorical metadata

This appears to be an editor name field for bibliographic records, but it is effectively empty: 11356 of 11359 rows (top_rate 0.9997) hold the empty string, with only three distinct named editors each appearing once. Entropy is essentially zero (0.0039) and cardinality is just 4, so the column carries almost no information despite a 0.0 null rate (blanks are encoded as ''). The long_tail and imbalance alerts simply reflect that three singleton names sit beside one dominant blank.

Treatment: Drop; near-constant blank with only three populated rows.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["editor"].stats

statvalue
n11,359
nulls0 (0.0%)
unique4
top_value
top_rate 0.9997
cardinality 4
entropy 0.003939
entropy_ratio 0.001969
alert: long_tail3 singleton categories
alert: imbalancetop value is 100.0% of rows
Fig 14.
Top values for editor.
Show data table
Top values for editor (4 unique shown, of 4 total).
valuecountshare
11356100.0%
Tischler, Johann10.0%
Martynaǔ, V. and G., Cyhun10.0%
Mallory, James P.10.0%

url categorical metadata

This column is labelled 'url' but contains a single value—an empty string—across all 11,359 rows. Cardinality is 1, entropy is 0, and the top_rate is 1.0, so it carries no information whatsoever. Likely a placeholder field that was never populated during ingestion.

Treatment: Drop; the column is constant and has zero predictive value.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["url"].stats

statvalue
n11,359
nulls0 (0.0%)
unique1
top_value
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 15.
Top values for url.
Show data table
Top values for url (1 unique shown, of 1 total).
valuecountshare
11359100.0%

citation text metadata

Bibliographic citation strings for linguistic sources, mostly English (3590) with substantial French (401), Chinese (335), and German (249) entries. Heavy duplication is the headline: 6181 duplicates (54.4%) across only 5178 unique values, with the top value 'Anon. n.d.. http://www.africamuseum.be/...' repeating 462 times. URLs appear in 13.3% of rows and Flesch readability is low (29.3), consistent with reference-style text rather than prose.

Treatment: Normalize and deduplicate to a citation lookup table, then reference by key.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["citation"].stats

statvalue
n11,359
nulls0 (0.0%)
unique5,178
len_min 8
len_max 142
len_mean 67.94
len_median 71
len_p95 77
word_mean 8.827
word_median 10
n_empty 0
n_duplicates 6,181
duplicate_rate 0.5442
vocab_size 15,535
readability_flesch_mean 29.26
emoji_rate 0
url_rate 0.1334
one_word_rate 0
allcaps_rate 0.06004
boilerplate_rate 0
alert: multilingual31 languages detected in sample
alert: duplicates54.4% duplicate strings
Fig 16.
Character-length distribution for citation.
Show data table
Character-length distribution for citation (mean: 67.93811074918567).
charscount
8 – 117
11 – 1522
15 – 1820
18 – 218
21 – 2554
25 – 28128
28 – 3199
31 – 3599
35 – 38133
38 – 42125
42 – 4561
45 – 48269
48 – 5286
52 – 55119
55 – 5895
58 – 6271
62 – 65105
65 – 68312
68 – 725296
72 – 753139
75 – 78854
78 – 82168
82 – 8564
85 – 884
88 – 926
92 – 953
95 – 986
98 – 1020
102 – 1052
105 – 1080
108 – 1120
112 – 1151
115 – 1191
119 – 1220
122 – 1250
125 – 1291
129 – 1320
132 – 1350
135 – 1390
139 – 1421

How to cite

click to copy

BibTeX
@misc{saturn-processed-lexibank-references-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: processed lexibank references},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/processed-lexibank_references}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: processed lexibank references. Source: /home/coolhand/servers/diachronica/etymology_atlas/processed/lexibank_references.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/processed-lexibank_references