processed-lexibank_references

Overview

Source: /home/coolhand/servers/diachronica/etymology_atlas/processed/lexibank_references.json

Saturn profiled 11,359 rows across 9 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/etymology_atlas/processed/lexibank_references.json",
    "--findings", "processed-lexibank_references.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a bibliographic reference list with 11,359 rows and 9 columns (key, author, citation, title, year, plus mostly-empty editor/publisher/journal/url fields). The most informative columns are author, citation, title, and year — the rest are either unique IDs or near-empty categoricals. Note that author has a 66% duplicate rate and 1,277 empty values, while citation and title both show heavy duplication (54% and 50%) driven by a handful of large source collections like Koelle's Polyglotta africana and the Africa Museum and Austronesian web archives. The year column spans 271 distinct values with reasonable spread (entropy ratio 0.75), though about 11% of rows have no year and another 574 are marked 'n.d.'. Author and title are also multilingual, with English dominant but meaningful German, French, Spanish, and Chinese subsets.

citing: row_count · column_count · columns.author.stats.duplicate_rate · columns.author.stats.n_empty · columns.author.language_counts · columns.author.top_values · columns.citation.stats.duplicate_rate · columns.citation.top_values · columns.title.stats.duplicate_rate · columns.title.top_values · columns.title.language_counts · columns.year.n_unique · columns.year.stats.entropy_ratio · columns.year.top_values · columns.editor.stats.top_rate · columns.publisher.stats.top_rate · columns.url.stats.top_rate

Out[4]:

saturn.schema() · 9 columns

column	kind	n	null%	unique	alerts
key	text	11,359	0.0%	11,359	near_unique one_word allcaps short_text
author	text	11,359	0.0%	3,830	multilingual duplicates
year	categorical	11,359	0.0%	271
title	text	11,359	0.0%	5,663	multilingual duplicates
journal	categorical	11,359	0.0%	4	imbalance
publisher	categorical	11,359	0.0%	15	long_tail imbalance
editor	categorical	11,359	0.0%	4	long_tail imbalance
url	categorical	11,359	0.0%	1	imbalance
citation	text	11,359	0.0%	5,178	multilingual duplicates

Fig 1.

year · Distribution of publication years; watch for the large empty/'n.d.' share and peaks around 1971, 1979, and 1992.

Show data table

Top values for year (20 unique shown, of 271 total).
value	count	share
	1300	11.4%
n.d.	574	5.1%
1992	338	3.0%
2007	282	2.5%
1971	271	2.4%
1979	254	2.2%
1980	225	2.0%
2005	221	1.9%
2015	217	1.9%
1986	208	1.8%
1997	208	1.8%
2006	204	1.8%
2009	202	1.8%
2011	196	1.7%
1975	195	1.7%
1963 [1854]	193	1.7%
1981	188	1.7%
2016	188	1.7%
2004	185	1.6%
2000	185	1.6%

Fig 2.

author · Top authors are dominated by a few prolific contributors (Koelle, Tryon, Blench), reflecting the 66% duplicate rate.

Show data table

Character-length distribution for author (mean: 20.427150277313142).
chars	count
0 – 11	2178
11 – 23	6163
23 – 34	1293
34 – 45	1034
45 – 57	188
57 – 68	206
68 – 79	126
79 – 91	30
91 – 102	75
102 – 113	31
113 – 125	7
125 – 136	4
136 – 147	2
147 – 159	5
159 – 170	2
170 – 181	2
181 – 193	1
193 – 204	0
204 – 215	1
215 – 226	0
226 – 238	0
238 – 249	0
249 – 260	0
260 – 272	0
272 – 283	0
283 – 294	3
294 – 306	0
306 – 317	6
317 – 328	1
328 – 340	0
340 – 351	0
351 – 362	0
362 – 374	0
374 – 385	0
385 – 396	0
396 – 408	0
408 – 419	0
419 – 430	0
430 – 442	0
442 – 453	1

Fig 3.

citation · Citation string lengths cluster around 70 characters — useful for spotting truncated or anomalously short entries.

Show data table

Character-length distribution for citation (mean: 67.93811074918567).
chars	count
8 – 11	7
11 – 15	22
15 – 18	20
18 – 21	8
21 – 25	54
25 – 28	128
28 – 31	99
31 – 35	99
35 – 38	133
38 – 42	125
42 – 45	61
45 – 48	269
48 – 52	86
52 – 55	119
55 – 58	95
58 – 62	71
62 – 65	105
65 – 68	312
68 – 72	5296
72 – 75	3139
75 – 78	854
78 – 82	168
82 – 85	64
85 – 88	4
88 – 92	6
92 – 95	3
95 – 98	6
98 – 102	0
102 – 105	2
105 – 108	0
108 – 112	0
112 – 115	1
115 – 119	1
119 – 122	0
122 – 125	0
125 – 129	1
129 – 132	0
132 – 135	0
135 – 139	0
139 – 142	1

Fig 4.

title · Title lengths are highly skewed (median 104, max 1562); long outliers often contain embedded URLs.

Show data table

Character-length distribution for title (mean: 120.61598732282772).
chars	count
0 – 39	1194
39 – 78	2283
78 – 117	3425
117 – 156	1892
156 – 195	1094
195 – 234	646
234 – 273	344
273 – 312	178
312 – 351	81
351 – 390	56
390 – 430	39
430 – 469	28
469 – 508	23
508 – 547	24
547 – 586	13
586 – 625	14
625 – 664	5
664 – 703	3
703 – 742	3
742 – 781	2
781 – 820	6
820 – 859	1
859 – 898	0
898 – 937	2
937 – 976	1
976 – 1015	0
1015 – 1054	0
1054 – 1093	0
1093 – 1132	0
1132 – 1172	0
1172 – 1211	0
1211 – 1250	0
1250 – 1289	0
1289 – 1328	0
1328 – 1367	0
1367 – 1406	0
1406 – 1445	1
1445 – 1484	0
1484 – 1523	0
1523 – 1562	1

Fig 5.

title · Language mix of titles shows English dominant but with sizeable German, French, Spanish, and Chinese minorities.

Show data table

Character-length distribution for title (mean: 120.61598732282772).
chars	count
0 – 39	1194
39 – 78	2283
78 – 117	3425
117 – 156	1892
156 – 195	1094
195 – 234	646
234 – 273	344
273 – 312	178
312 – 351	81
351 – 390	56
390 – 430	39
430 – 469	28
469 – 508	23
508 – 547	24
547 – 586	13
586 – 625	14
625 – 664	5
664 – 703	3
703 – 742	3
742 – 781	2
781 – 820	6
820 – 859	1
859 – 898	0
898 – 937	2
937 – 976	1
976 – 1015	0
1015 – 1054	0
1054 – 1093	0
1093 – 1132	0
1132 – 1172	0
1172 – 1211	0
1211 – 1250	0
1250 – 1289	0
1289 – 1328	0
1328 – 1367	0
1367 – 1406	0
1406 – 1445	1
1445 – 1484	0
1484 – 1523	0
1523 – 1562	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
key	text	0.0%
author	text	0.0%
year	categorical	0.0%
title	text	0.0%
journal	categorical	0.0%
publisher	categorical	0.0%
editor	categorical	0.0%
url	categorical	0.0%
citation	text	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 14,079 detected strings).
lang	count	share
en	10335	73.4%
fr	919	6.5%
de	857	6.1%
zh	647	4.6%
es	527	3.7%
nl	175	1.2%
it	139	1.0%
pt	115	0.8%
ja	82	0.6%
ru	56	0.4%
id	33	0.2%
eo	30	0.2%
pl	28	0.2%
fi	17	0.1%
no	12	0.1%
ms	12	0.1%
hu	11	0.1%
uk	11	0.1%
sl	11	0.1%
da	9	0.1%
sv	8	0.1%
cs	8	0.1%
ca	8	0.1%
sk	5	0.0%
ceb	3	0.0%
hr	3	0.0%
eu	2	0.0%
vi	2	0.0%
et	2	0.0%
sr	2	0.0%
sq	2	0.0%
bs	2	0.0%
lt	1	0.0%
tl	1	0.0%
oc	1	0.0%
os	1	0.0%
pam	1	0.0%
war	1	0.0%

key text identifier

This column is almost certainly a primary identifier: every one of the 11,359 rows holds a unique, single-token value (n_unique=11359, one_word_rate=1.0, duplicate_rate=0.0). Values are short (len_mean 4.07, len_max 5) and 99.1% are uppercase, consistent with short alphanumeric codes rather than natural text. The top_words sample shows purely numeric tokens (47, 49, 50, ...), so the 'allcaps' signal may be a side effect of digit-only strings rather than true letters.

Treatment: Use as a row key or left-join key; drop from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["key"].stats

stat	value
n	11,359
nulls	0 (0.0%)
unique	11,359
len_min	1
len_max	5
len_mean	4.074
len_median	4
len_p95	5
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	11,359
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.9913
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	99.1% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for key.

Show data table

Character-length distribution for key (mean: 4.074126243507351).
chars	count
1 – 1	9
1 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	90
2 – 2	0
2 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	873
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	8465
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	1922

author text metadata

This is an author/contributor name field, mostly formatted 'Surname, Given Name' with frequent multi-author strings joined by 'and' (4049 occurrences). Duplication is severe: 66.3% of rows repeat, with prolific contributors like 'Koelle, Sigismund Wilhelm' (193) and 'Tryon, Darrell T.' (191) dominating, and 1277 rows are empty strings despite a 0.0 null rate. Names span 30 detected languages — predominantly English (3125) and German (384), but also 113 rows of CJK script — so naive string matching will fragment identities.

Treatment: Normalize to a canonical name form and split multi-author strings on ' and ' before any join or grouping.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["author"].stats

stat	value
n	11,359
nulls	0 (0.0%)
unique	3,830
len_min	0
len_max	453
len_mean	20.43
len_median	17
len_p95	50
word_mean	3.382
word_median	3
n_empty	1,277
n_duplicates	7,529
duplicate_rate	0.6628
vocab_size	6,656
readability_flesch_mean	53.21
emoji_rate	0
url_rate	0
one_word_rate	0.1836
allcaps_rate	0.0589
boilerplate_rate	0
alert: multilingual	31 languages detected in sample
alert: duplicates	66.3% duplicate strings

Fig 9.

Character-length distribution for author.

Show data table

Character-length distribution for author (mean: 20.427150277313142).
chars	count
0 – 11	2178
11 – 23	6163
23 – 34	1293
34 – 45	1034
45 – 57	188
57 – 68	206
68 – 79	126
79 – 91	30
91 – 102	75
102 – 113	31
113 – 125	7
125 – 136	4
136 – 147	2
147 – 159	5
159 – 170	2
170 – 181	2
181 – 193	1
193 – 204	0
204 – 215	1
215 – 226	0
226 – 238	0
238 – 249	0
249 – 260	0
260 – 272	0
272 – 283	0
283 – 294	3
294 – 306	0
306 – 317	6
317 – 328	1
328 – 340	0
340 – 351	0
351 – 362	0
362 – 374	0
374 – 385	0
385 – 396	0
396 – 408	0
408 – 419	0
419 – 430	0
430 – 442	0
442 – 453	1

year categorical feature

This is a year field stored as strings rather than integers, with 271 distinct values across 11,359 rows. The most common entry is an empty string (1,300 rows, 11.4%) followed by the literal "n.d." (574 rows), so roughly 16.5% of records carry no usable year despite a 0.0 null rate. Actual years span at least 1971 to 2015 in the top values, with 1992 the most frequent real year at 338 occurrences.

Treatment: Coerce to integer, mapping empty strings and "n.d." to missing before any temporal analysis.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["year"].stats

stat	value
n	11,359
nulls	0 (0.0%)
unique	271
top_value
top_rate	0.1144
cardinality	271
entropy	6.1
entropy_ratio	0.7548

Fig 10.

Top values for year.

Show data table

Top values for year (20 unique shown, of 271 total).
value	count	share
	1300	11.4%
n.d.	574	5.1%
1992	338	3.0%
2007	282	2.5%
1971	271	2.4%
1979	254	2.2%
1980	225	2.0%
2005	221	1.9%
2015	217	1.9%
1986	208	1.8%
1997	208	1.8%
2006	204	1.8%
2009	202	1.8%
2011	196	1.7%
1975	195	1.7%
1963 [1854]	193	1.7%
1981	188	1.7%
2016	188	1.7%
2004	185	1.6%
2000	185	1.6%

title text metadata

This column holds bibliographic citation strings — book/article titles, URLs, and access notes for linguistic sources, dominated by English (3620) but mixing 24 other languages including French (382), Chinese (300), German (224), and Spanish (207). Half the values are duplicates (duplicate_rate 0.50, 5696 repeats across 11359 rows), with a single africamuseum.be URL appearing 355 times and 17.4% of entries containing URLs. Length varies wildly (median 104 chars, max 1562) and top words ('of','the','languages','linguistics','(accessed') confirm these are reference citations rather than free-form titles.

Treatment: Normalize and deduplicate citations into a source-reference lookup table rather than treating as a modelling feature.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["title"].stats

stat	value
n	11,359
nulls	0 (0.0%)
unique	5,663
len_min	0
len_max	1,562
len_mean	120.6
len_median	104
len_p95	261
word_mean	14.66
word_median	12
n_empty	7
n_duplicates	5,696
duplicate_rate	0.5015
vocab_size	21,846
readability_flesch_mean	8.003
emoji_rate	0
url_rate	0.1741
one_word_rate	0.07008
allcaps_rate	0.06154
boilerplate_rate	0
alert: multilingual	26 languages detected in sample
alert: duplicates	50.1% duplicate strings

Fig 11.

Character-length distribution for title.

Show data table

Character-length distribution for title (mean: 120.61598732282772).
chars	count
0 – 39	1194
39 – 78	2283
78 – 117	3425
117 – 156	1892
156 – 195	1094
195 – 234	646
234 – 273	344
273 – 312	178
312 – 351	81
351 – 390	56
390 – 430	39
430 – 469	28
469 – 508	23
508 – 547	24
547 – 586	13
586 – 625	14
625 – 664	5
664 – 703	3
703 – 742	3
742 – 781	2
781 – 820	6
820 – 859	1
859 – 898	0
898 – 937	2
937 – 976	1
976 – 1015	0
1015 – 1054	0
1054 – 1093	0
1093 – 1132	0
1132 – 1172	0
1172 – 1211	0
1211 – 1250	0
1250 – 1289	0
1289 – 1328	0
1328 – 1367	0
1367 – 1406	0
1406 – 1445	1
1445 – 1484	0
1484 – 1523	0
1523 – 1562	1

journal categorical metadata

This appears to be a journal-name field for bibliographic records, but it is effectively empty: 11,355 of 11,359 rows (top_rate 0.9996) hold an empty string, leaving only 4 actual journal names across 3 distinct German linguistics titles. Entropy is 0.005 (entropy_ratio 0.0025), so the column carries virtually no information despite a 0.0 null rate — the blanks are stored as empty strings rather than nulls.

Treatment: Drop; near-constant empty string with only 4 populated rows.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["journal"].stats

stat	value
n	11,359
nulls	0 (0.0%)
unique	4
top_value
top_rate	0.9996
cardinality	4
entropy	0.005076
entropy_ratio	0.002538
alert: imbalance	top value is 100.0% of rows

Fig 12.

Top values for journal.

Show data table

Top values for journal (4 unique shown, of 4 total).
value	count	share
	11355	100.0%
Münchener Studien zur Sprachwissenschaft	2	0.0%
Zeitschrift für vergleichende Sprachforschung	1	0.0%
Historische Sprachforschung	1	0.0%

publisher categorical metadata

Publisher name field, but it is effectively empty: 11,344 of 11,359 rows (top_rate 0.9987) carry an empty string, leaving only 15 distinct values and an entropy_ratio of 0.005. The handful of populated entries (Brill with 2, then Winter, Reichert, Rodopi, Harrassowitz and others with 1 each) hint at academic/humanities publishers but are too sparse to be useful. Note that null_rate is 0.0 because the blanks are stored as empty strings rather than true nulls.

Treatment: Drop; the column is ~99.9% empty strings and carries almost no signal.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["publisher"].stats

stat	value
n	11,359
nulls	0 (0.0%)
unique	15
top_value
top_rate	0.9987
cardinality	15
entropy	0.01952
entropy_ratio	0.004996
alert: long_tail	13 singleton categories
alert: imbalance	top value is 99.9% of rows

Fig 13.

Top values for publisher.

Show data table

Top values for publisher (15 unique shown, of 15 total).
value	count	share
	11344	99.9%
Brill	2	0.0%
Winter	1	0.0%
Reichert	1	0.0%
Vostočnaja Literatura	1	0.0%
Rodopi	1	0.0%
Harrassowitz	1	0.0%
K. J. Trübner	1	0.0%
Belaruskaâ navuka	1	0.0%
Fitzroy Dearborn Publishers	1	0.0%
Karl J. Trübner	1	0.0%
Institut für Sprachwissenschaft der Universität Innsbruck	1	0.0%
Vandenhoeck & Ruprecht	1	0.0%
Walter de Gruyter & Co.	1	0.0%
Nova Fronteira	1	0.0%

editor categorical metadata

This appears to be an editor name field for bibliographic records, but it is effectively empty: 11356 of 11359 rows (top_rate 0.9997) hold the empty string, with only three distinct named editors each appearing once. Entropy is essentially zero (0.0039) and cardinality is just 4, so the column carries almost no information despite a 0.0 null rate (blanks are encoded as ''). The long_tail and imbalance alerts simply reflect that three singleton names sit beside one dominant blank.

Treatment: Drop; near-constant blank with only three populated rows.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["editor"].stats

stat	value
n	11,359
nulls	0 (0.0%)
unique	4
top_value
top_rate	0.9997
cardinality	4
entropy	0.003939
entropy_ratio	0.001969
alert: long_tail	3 singleton categories
alert: imbalance	top value is 100.0% of rows

Fig 14.

Top values for editor.

Show data table

Top values for editor (4 unique shown, of 4 total).
value	count	share
	11356	100.0%
Tischler, Johann	1	0.0%
Martynaǔ, V. and G., Cyhun	1	0.0%
Mallory, James P.	1	0.0%

url categorical metadata

This column is labelled 'url' but contains a single value—an empty string—across all 11,359 rows. Cardinality is 1, entropy is 0, and the top_rate is 1.0, so it carries no information whatsoever. Likely a placeholder field that was never populated during ingestion.

Treatment: Drop; the column is constant and has zero predictive value.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["url"].stats

stat	value
n	11,359
nulls	0 (0.0%)
unique	1
top_value
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 15.

Top values for url.

Show data table

Top values for url (1 unique shown, of 1 total).
value	count	share
	11359	100.0%

citation text metadata

Bibliographic citation strings for linguistic sources, mostly English (3590) with substantial French (401), Chinese (335), and German (249) entries. Heavy duplication is the headline: 6181 duplicates (54.4%) across only 5178 unique values, with the top value 'Anon. n.d.. http://www.africamuseum.be/...' repeating 462 times. URLs appear in 13.3% of rows and Flesch readability is low (29.3), consistent with reference-style text rather than prose.

Treatment: Normalize and deduplicate to a citation lookup table, then reference by key.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["citation"].stats

stat	value
n	11,359
nulls	0 (0.0%)
unique	5,178
len_min	8
len_max	142
len_mean	67.94
len_median	71
len_p95	77
word_mean	8.827
word_median	10
n_empty	0
n_duplicates	6,181
duplicate_rate	0.5442
vocab_size	15,535
readability_flesch_mean	29.26
emoji_rate	0
url_rate	0.1334
one_word_rate	0
allcaps_rate	0.06004
boilerplate_rate	0
alert: multilingual	31 languages detected in sample
alert: duplicates	54.4% duplicate strings

Fig 16.

Character-length distribution for citation.

Show data table

Character-length distribution for citation (mean: 67.93811074918567).
chars	count
8 – 11	7
11 – 15	22
15 – 18	20
18 – 21	8
21 – 25	54
25 – 28	128
28 – 31	99
31 – 35	99
35 – 38	133
38 – 42	125
42 – 45	61
45 – 48	269
48 – 52	86
52 – 55	119
55 – 58	95
58 – 62	71
62 – 65	105
65 – 68	312
68 – 72	5296
72 – 75	3139
75 – 78	854
78 – 82	168
82 – 85	64
85 – 88	4
88 – 92	6
92 – 95	3
95 – 98	6
98 – 102	0
102 – 105	2
105 – 108	0
108 – 112	0
112 – 115	1
115 – 119	1
119 – 122	0
122 – 125	0
125 – 129	1
129 – 132	0
132 – 135	0
135 – 139	0
139 – 142	1

processed lexibank references

Overview

Summary confidence: high

key text identifier

author text metadata

year categorical feature

title text metadata

journal categorical metadata

publisher categorical metadata

editor categorical metadata

url categorical metadata

citation text metadata

How to cite