saturn·

processed lexibank references

source /home/coolhand/servers/diachronica/etymology_atlas/processed/lexibank_references.json 11,359 rows 9 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a bibliographic reference list with 11,359 rows and 9 columns (key, author, citation, title, year, plus mostly-empty editor/publisher/journal/url fields). The most informative columns are author, citation, title, and year — the rest are either unique IDs or near-empty categoricals. Note that author has a 66% duplicate rate and 1,277 empty values, while citation and title both show heavy duplication (54% and 50%) driven by a handful of large source collections like Koelle's Polyglotta africana and the Africa Museum and Austronesian web archives. The year column spans 271 distinct values with reasonable spread (entropy ratio 0.75), though about 11% of rows have no year and another 574 are marked 'n.d.'. Author and title are also multilingual, with English dominant but meaningful German, French, Spanish, and Chinese subsets.

citing: row_count · column_count · columns.author.stats.duplicate_rate · columns.author.stats.n_empty · columns.author.language_counts · columns.author.top_values · columns.citation.stats.duplicate_rate · columns.citation.top_values · columns.title.stats.duplicate_rate · columns.title.top_values · columns.title.language_counts · columns.year.n_unique · columns.year.stats.entropy_ratio · columns.year.top_values · columns.editor.stats.top_rate · columns.publisher.stats.top_rate · columns.url.stats.top_rate

Schema

9 columns
Per-column summary. Click column name to jump to its detail.
Alerts
key text 0.0% 11,359
near_unique one_word allcaps short_text
author text 0.0% 3,830
multilingual duplicates
year categorical 0.0% 271
title text 0.0% 5,663
multilingual duplicates
journal categorical 0.0% 4
imbalance
publisher categorical 0.0% 15
long_tail imbalance
editor categorical 0.0% 4
long_tail imbalance
url categorical 0.0% 1
imbalance
citation text 0.0% 5,178
multilingual duplicates

key

text identifier near_unique one_word allcaps short_text
This column is almost certainly a primary identifier: every one of the 11,359 rows holds a unique, single-token value (n_unique=11359, one_word_rate=1.0, duplicate_rate=0.0). Values are short (len_mean 4.07, len_max 5) and 99.1% are uppercase, consistent with short alphanumeric codes rather than natural text. The top_words sample shows purely numeric tokens (47, 49, 50, ...), so the 'allcaps' signal may be a side effect of digit-only strings rather than true letters. Treatment: Use as a row key or left-join key; drop from modelling features. high · anthropic:claude-opus-4-7
n
11,359
nulls
0 (0.0%)
unique
11,359
len_min
1
len_max
5
len_mean
4.074
len_median
4
len_p95
5
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
11,359
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.9913
boilerplate_rate
0

author

text metadata multilingual duplicates
This is an author/contributor name field, mostly formatted 'Surname, Given Name' with frequent multi-author strings joined by 'and' (4049 occurrences). Duplication is severe: 66.3% of rows repeat, with prolific contributors like 'Koelle, Sigismund Wilhelm' (193) and 'Tryon, Darrell T.' (191) dominating, and 1277 rows are empty strings despite a 0.0 null rate. Names span 30 detected languages — predominantly English (3125) and German (384), but also 113 rows of CJK script — so naive string matching will fragment identities. Treatment: Normalize to a canonical name form and split multi-author strings on ' and ' before any join or grouping. high · anthropic:claude-opus-4-7
n
11,359
nulls
0 (0.0%)
unique
3,830
len_min
0
len_max
453
len_mean
20.43
len_median
17
len_p95
50
word_mean
3.382
word_median
3
n_empty
1,277
n_duplicates
7,529
duplicate_rate
0.6628
vocab_size
6,656
readability_flesch_mean
53.21
emoji_rate
0
url_rate
0
one_word_rate
0.1836
allcaps_rate
0.0589
boilerplate_rate
0

year

categorical feature
This is a year field stored as strings rather than integers, with 271 distinct values across 11,359 rows. The most common entry is an empty string (1,300 rows, 11.4%) followed by the literal "n.d." (574 rows), so roughly 16.5% of records carry no usable year despite a 0.0 null rate. Actual years span at least 1971 to 2015 in the top values, with 1992 the most frequent real year at 338 occurrences. Treatment: Coerce to integer, mapping empty strings and "n.d." to missing before any temporal analysis. high · anthropic:claude-opus-4-7
n
11,359
nulls
0 (0.0%)
unique
271
top_value
top_rate
0.1144
cardinality
271
entropy
6.1
entropy_ratio
0.7548

title

text metadata multilingual duplicates
This column holds bibliographic citation strings — book/article titles, URLs, and access notes for linguistic sources, dominated by English (3620) but mixing 24 other languages including French (382), Chinese (300), German (224), and Spanish (207). Half the values are duplicates (duplicate_rate 0.50, 5696 repeats across 11359 rows), with a single africamuseum.be URL appearing 355 times and 17.4% of entries containing URLs. Length varies wildly (median 104 chars, max 1562) and top words ('of','the','languages','linguistics','(accessed') confirm these are reference citations rather than free-form titles. Treatment: Normalize and deduplicate citations into a source-reference lookup table rather than treating as a modelling feature. high · anthropic:claude-opus-4-7
n
11,359
nulls
0 (0.0%)
unique
5,663
len_min
0
len_max
1,562
len_mean
120.6
len_median
104
len_p95
261
word_mean
14.66
word_median
12
n_empty
7
n_duplicates
5,696
duplicate_rate
0.5015
vocab_size
21,846
readability_flesch_mean
8.003
emoji_rate
0
url_rate
0.1741
one_word_rate
0.07008
allcaps_rate
0.06154
boilerplate_rate
0

journal

categorical metadata imbalance
This appears to be a journal-name field for bibliographic records, but it is effectively empty: 11,355 of 11,359 rows (top_rate 0.9996) hold an empty string, leaving only 4 actual journal names across 3 distinct German linguistics titles. Entropy is 0.005 (entropy_ratio 0.0025), so the column carries virtually no information despite a 0.0 null rate — the blanks are stored as empty strings rather than nulls. Treatment: Drop; near-constant empty string with only 4 populated rows. high · anthropic:claude-opus-4-7
n
11,359
nulls
0 (0.0%)
unique
4
top_value
top_rate
0.9996
cardinality
4
entropy
0.005076
entropy_ratio
0.002538

publisher

categorical metadata long_tail imbalance
Publisher name field, but it is effectively empty: 11,344 of 11,359 rows (top_rate 0.9987) carry an empty string, leaving only 15 distinct values and an entropy_ratio of 0.005. The handful of populated entries (Brill with 2, then Winter, Reichert, Rodopi, Harrassowitz and others with 1 each) hint at academic/humanities publishers but are too sparse to be useful. Note that null_rate is 0.0 because the blanks are stored as empty strings rather than true nulls. Treatment: Drop; the column is ~99.9% empty strings and carries almost no signal. high · anthropic:claude-opus-4-7
n
11,359
nulls
0 (0.0%)
unique
15
top_value
top_rate
0.9987
cardinality
15
entropy
0.01952
entropy_ratio
0.004996

editor

categorical metadata long_tail imbalance
This appears to be an editor name field for bibliographic records, but it is effectively empty: 11356 of 11359 rows (top_rate 0.9997) hold the empty string, with only three distinct named editors each appearing once. Entropy is essentially zero (0.0039) and cardinality is just 4, so the column carries almost no information despite a 0.0 null rate (blanks are encoded as ''). The long_tail and imbalance alerts simply reflect that three singleton names sit beside one dominant blank. Treatment: Drop; near-constant blank with only three populated rows. high · anthropic:claude-opus-4-7
n
11,359
nulls
0 (0.0%)
unique
4
top_value
top_rate
0.9997
cardinality
4
entropy
0.003939
entropy_ratio
0.001969

url

categorical metadata imbalance
This column is labelled 'url' but contains a single value—an empty string—across all 11,359 rows. Cardinality is 1, entropy is 0, and the top_rate is 1.0, so it carries no information whatsoever. Likely a placeholder field that was never populated during ingestion. Treatment: Drop; the column is constant and has zero predictive value. high · anthropic:claude-opus-4-7
n
11,359
nulls
0 (0.0%)
unique
1
top_value
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

citation

text metadata multilingual duplicates
Bibliographic citation strings for linguistic sources, mostly English (3590) with substantial French (401), Chinese (335), and German (249) entries. Heavy duplication is the headline: 6181 duplicates (54.4%) across only 5178 unique values, with the top value 'Anon. n.d.. http://www.africamuseum.be/...' repeating 462 times. URLs appear in 13.3% of rows and Flesch readability is low (29.3), consistent with reference-style text rather than prose. Treatment: Normalize and deduplicate to a citation lookup table, then reference by key. high · anthropic:claude-opus-4-7
n
11,359
nulls
0 (0.0%)
unique
5,178
len_min
8
len_max
142
len_mean
67.94
len_median
71
len_p95
77
word_mean
8.827
word_median
10
n_empty
0
n_duplicates
6,181
duplicate_rate
0.5442
vocab_size
15,535
readability_flesch_mean
29.26
emoji_rate
0
url_rate
0.1334
one_word_rate
0
allcaps_rate
0.06004
boilerplate_rate
0