processed lexibank references
Reading
This dataset is a bibliographic reference list with 11,359 rows and 9 columns (key, author, citation, title, year, plus mostly-empty editor/publisher/journal/url fields). The most informative columns are author, citation, title, and year — the rest are either unique IDs or near-empty categoricals. Note that author has a 66% duplicate rate and 1,277 empty values, while citation and title both show heavy duplication (54% and 50%) driven by a handful of large source collections like Koelle's Polyglotta africana and the Africa Museum and Austronesian web archives. The year column spans 271 distinct values with reasonable spread (entropy ratio 0.75), though about 11% of rows have no year and another 574 are marked 'n.d.'. Author and title are also multilingual, with English dominant but meaningful German, French, Spanish, and Chinese subsets.
citing: row_count · column_count · columns.author.stats.duplicate_rate · columns.author.stats.n_empty · columns.author.language_counts · columns.author.top_values · columns.citation.stats.duplicate_rate · columns.citation.top_values · columns.title.stats.duplicate_rate · columns.title.top_values · columns.title.language_counts · columns.year.n_unique · columns.year.stats.entropy_ratio · columns.year.top_values · columns.editor.stats.top_rate · columns.publisher.stats.top_rate · columns.url.stats.top_rate
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| 1300 | 11.4% | |
| n.d. | 574 | 5.1% |
| 1992 | 338 | 3.0% |
| 2007 | 282 | 2.5% |
| 1971 | 271 | 2.4% |
| 1979 | 254 | 2.2% |
| 1980 | 225 | 2.0% |
| 2005 | 221 | 1.9% |
| 2015 | 217 | 1.9% |
| 1986 | 208 | 1.8% |
| 1997 | 208 | 1.8% |
| 2006 | 204 | 1.8% |
| 2009 | 202 | 1.8% |
| 2011 | 196 | 1.7% |
| 1975 | 195 | 1.7% |
| 1963 [1854] | 193 | 1.7% |
| 1981 | 188 | 1.7% |
| 2016 | 188 | 1.7% |
| 2004 | 185 | 1.6% |
| 2000 | 185 | 1.6% |
Show data table
| chars | count |
|---|---|
| 0 – 11 | 2178 |
| 11 – 23 | 6163 |
| 23 – 34 | 1293 |
| 34 – 45 | 1034 |
| 45 – 57 | 188 |
| 57 – 68 | 206 |
| 68 – 79 | 126 |
| 79 – 91 | 30 |
| 91 – 102 | 75 |
| 102 – 113 | 31 |
| 113 – 125 | 7 |
| 125 – 136 | 4 |
| 136 – 147 | 2 |
| 147 – 159 | 5 |
| 159 – 170 | 2 |
| 170 – 181 | 2 |
| 181 – 193 | 1 |
| 193 – 204 | 0 |
| 204 – 215 | 1 |
| 215 – 226 | 0 |
| 226 – 238 | 0 |
| 238 – 249 | 0 |
| 249 – 260 | 0 |
| 260 – 272 | 0 |
| 272 – 283 | 0 |
| 283 – 294 | 3 |
| 294 – 306 | 0 |
| 306 – 317 | 6 |
| 317 – 328 | 1 |
| 328 – 340 | 0 |
| 340 – 351 | 0 |
| 351 – 362 | 0 |
| 362 – 374 | 0 |
| 374 – 385 | 0 |
| 385 – 396 | 0 |
| 396 – 408 | 0 |
| 408 – 419 | 0 |
| 419 – 430 | 0 |
| 430 – 442 | 0 |
| 442 – 453 | 1 |
Show data table
| chars | count |
|---|---|
| 8 – 11 | 7 |
| 11 – 15 | 22 |
| 15 – 18 | 20 |
| 18 – 21 | 8 |
| 21 – 25 | 54 |
| 25 – 28 | 128 |
| 28 – 31 | 99 |
| 31 – 35 | 99 |
| 35 – 38 | 133 |
| 38 – 42 | 125 |
| 42 – 45 | 61 |
| 45 – 48 | 269 |
| 48 – 52 | 86 |
| 52 – 55 | 119 |
| 55 – 58 | 95 |
| 58 – 62 | 71 |
| 62 – 65 | 105 |
| 65 – 68 | 312 |
| 68 – 72 | 5296 |
| 72 – 75 | 3139 |
| 75 – 78 | 854 |
| 78 – 82 | 168 |
| 82 – 85 | 64 |
| 85 – 88 | 4 |
| 88 – 92 | 6 |
| 92 – 95 | 3 |
| 95 – 98 | 6 |
| 98 – 102 | 0 |
| 102 – 105 | 2 |
| 105 – 108 | 0 |
| 108 – 112 | 0 |
| 112 – 115 | 1 |
| 115 – 119 | 1 |
| 119 – 122 | 0 |
| 122 – 125 | 0 |
| 125 – 129 | 1 |
| 129 – 132 | 0 |
| 132 – 135 | 0 |
| 135 – 139 | 0 |
| 139 – 142 | 1 |
Show data table
| chars | count |
|---|---|
| 0 – 39 | 1194 |
| 39 – 78 | 2283 |
| 78 – 117 | 3425 |
| 117 – 156 | 1892 |
| 156 – 195 | 1094 |
| 195 – 234 | 646 |
| 234 – 273 | 344 |
| 273 – 312 | 178 |
| 312 – 351 | 81 |
| 351 – 390 | 56 |
| 390 – 430 | 39 |
| 430 – 469 | 28 |
| 469 – 508 | 23 |
| 508 – 547 | 24 |
| 547 – 586 | 13 |
| 586 – 625 | 14 |
| 625 – 664 | 5 |
| 664 – 703 | 3 |
| 703 – 742 | 3 |
| 742 – 781 | 2 |
| 781 – 820 | 6 |
| 820 – 859 | 1 |
| 859 – 898 | 0 |
| 898 – 937 | 2 |
| 937 – 976 | 1 |
| 976 – 1015 | 0 |
| 1015 – 1054 | 0 |
| 1054 – 1093 | 0 |
| 1093 – 1132 | 0 |
| 1132 – 1172 | 0 |
| 1172 – 1211 | 0 |
| 1211 – 1250 | 0 |
| 1250 – 1289 | 0 |
| 1289 – 1328 | 0 |
| 1328 – 1367 | 0 |
| 1367 – 1406 | 0 |
| 1406 – 1445 | 1 |
| 1445 – 1484 | 0 |
| 1484 – 1523 | 0 |
| 1523 – 1562 | 1 |
Show data table
| chars | count |
|---|---|
| 0 – 39 | 1194 |
| 39 – 78 | 2283 |
| 78 – 117 | 3425 |
| 117 – 156 | 1892 |
| 156 – 195 | 1094 |
| 195 – 234 | 646 |
| 234 – 273 | 344 |
| 273 – 312 | 178 |
| 312 – 351 | 81 |
| 351 – 390 | 56 |
| 390 – 430 | 39 |
| 430 – 469 | 28 |
| 469 – 508 | 23 |
| 508 – 547 | 24 |
| 547 – 586 | 13 |
| 586 – 625 | 14 |
| 625 – 664 | 5 |
| 664 – 703 | 3 |
| 703 – 742 | 3 |
| 742 – 781 | 2 |
| 781 – 820 | 6 |
| 820 – 859 | 1 |
| 859 – 898 | 0 |
| 898 – 937 | 2 |
| 937 – 976 | 1 |
| 976 – 1015 | 0 |
| 1015 – 1054 | 0 |
| 1054 – 1093 | 0 |
| 1093 – 1132 | 0 |
| 1132 – 1172 | 0 |
| 1172 – 1211 | 0 |
| 1211 – 1250 | 0 |
| 1250 – 1289 | 0 |
| 1289 – 1328 | 0 |
| 1328 – 1367 | 0 |
| 1367 – 1406 | 0 |
| 1406 – 1445 | 1 |
| 1445 – 1484 | 0 |
| 1484 – 1523 | 0 |
| 1523 – 1562 | 1 |
Schema
9 columns| Alerts | ||||
|---|---|---|---|---|
| key | text | 0.0% | 11,359 |
near_unique
one_word
allcaps
short_text
|
| author | text | 0.0% | 3,830 |
multilingual
duplicates
|
| year | categorical | 0.0% | 271 |
|
| title | text | 0.0% | 5,663 |
multilingual
duplicates
|
| journal | categorical | 0.0% | 4 |
imbalance
|
| publisher | categorical | 0.0% | 15 |
long_tail
imbalance
|
| editor | categorical | 0.0% | 4 |
long_tail
imbalance
|
| url | categorical | 0.0% | 1 |
imbalance
|
| citation | text | 0.0% | 5,178 |
multilingual
duplicates
|
key
text identifier near_unique one_word allcaps short_textThis column is almost certainly a primary identifier: every one of the 11,359 rows holds a unique, single-token value (n_unique=11359, one_word_rate=1.0, duplicate_rate=0.0). Values are short (len_mean 4.07, len_max 5) and 99.1% are uppercase, consistent with short alphanumeric codes rather than natural text. The top_words sample shows purely numeric tokens (47, 49, 50, ...), so the 'allcaps' signal may be a side effect of digit-only strings rather than true letters. Treatment: Use as a row key or left-join key; drop from modelling features.
- n
- 11,359
- nulls
- 0 (0.0%)
- unique
- 11,359
- len_min
- 1
- len_max
- 5
- len_mean
- 4.074
- len_median
- 4
- len_p95
- 5
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 11,359
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.9913
- boilerplate_rate
- 0
year
categorical featureThis is a year field stored as strings rather than integers, with 271 distinct values across 11,359 rows. The most common entry is an empty string (1,300 rows, 11.4%) followed by the literal "n.d." (574 rows), so roughly 16.5% of records carry no usable year despite a 0.0 null rate. Actual years span at least 1971 to 2015 in the top values, with 1992 the most frequent real year at 338 occurrences. Treatment: Coerce to integer, mapping empty strings and "n.d." to missing before any temporal analysis.
- n
- 11,359
- nulls
- 0 (0.0%)
- unique
- 271
- top_value
- top_rate
- 0.1144
- cardinality
- 271
- entropy
- 6.1
- entropy_ratio
- 0.7548
title
text metadata multilingual duplicatesThis column holds bibliographic citation strings — book/article titles, URLs, and access notes for linguistic sources, dominated by English (3620) but mixing 24 other languages including French (382), Chinese (300), German (224), and Spanish (207). Half the values are duplicates (duplicate_rate 0.50, 5696 repeats across 11359 rows), with a single africamuseum.be URL appearing 355 times and 17.4% of entries containing URLs. Length varies wildly (median 104 chars, max 1562) and top words ('of','the','languages','linguistics','(accessed') confirm these are reference citations rather than free-form titles. Treatment: Normalize and deduplicate citations into a source-reference lookup table rather than treating as a modelling feature.
- n
- 11,359
- nulls
- 0 (0.0%)
- unique
- 5,663
- len_min
- 0
- len_max
- 1,562
- len_mean
- 120.6
- len_median
- 104
- len_p95
- 261
- word_mean
- 14.66
- word_median
- 12
- n_empty
- 7
- n_duplicates
- 5,696
- duplicate_rate
- 0.5015
- vocab_size
- 21,846
- readability_flesch_mean
- 8.003
- emoji_rate
- 0
- url_rate
- 0.1741
- one_word_rate
- 0.07008
- allcaps_rate
- 0.06154
- boilerplate_rate
- 0
journal
categorical metadata imbalanceThis appears to be a journal-name field for bibliographic records, but it is effectively empty: 11,355 of 11,359 rows (top_rate 0.9996) hold an empty string, leaving only 4 actual journal names across 3 distinct German linguistics titles. Entropy is 0.005 (entropy_ratio 0.0025), so the column carries virtually no information despite a 0.0 null rate — the blanks are stored as empty strings rather than nulls. Treatment: Drop; near-constant empty string with only 4 populated rows.
- n
- 11,359
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- top_rate
- 0.9996
- cardinality
- 4
- entropy
- 0.005076
- entropy_ratio
- 0.002538
publisher
categorical metadata long_tail imbalancePublisher name field, but it is effectively empty: 11,344 of 11,359 rows (top_rate 0.9987) carry an empty string, leaving only 15 distinct values and an entropy_ratio of 0.005. The handful of populated entries (Brill with 2, then Winter, Reichert, Rodopi, Harrassowitz and others with 1 each) hint at academic/humanities publishers but are too sparse to be useful. Note that null_rate is 0.0 because the blanks are stored as empty strings rather than true nulls. Treatment: Drop; the column is ~99.9% empty strings and carries almost no signal.
- n
- 11,359
- nulls
- 0 (0.0%)
- unique
- 15
- top_value
- top_rate
- 0.9987
- cardinality
- 15
- entropy
- 0.01952
- entropy_ratio
- 0.004996
editor
categorical metadata long_tail imbalanceThis appears to be an editor name field for bibliographic records, but it is effectively empty: 11356 of 11359 rows (top_rate 0.9997) hold the empty string, with only three distinct named editors each appearing once. Entropy is essentially zero (0.0039) and cardinality is just 4, so the column carries almost no information despite a 0.0 null rate (blanks are encoded as ''). The long_tail and imbalance alerts simply reflect that three singleton names sit beside one dominant blank. Treatment: Drop; near-constant blank with only three populated rows.
- n
- 11,359
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- top_rate
- 0.9997
- cardinality
- 4
- entropy
- 0.003939
- entropy_ratio
- 0.001969
url
categorical metadata imbalanceThis column is labelled 'url' but contains a single value—an empty string—across all 11,359 rows. Cardinality is 1, entropy is 0, and the top_rate is 1.0, so it carries no information whatsoever. Likely a placeholder field that was never populated during ingestion. Treatment: Drop; the column is constant and has zero predictive value.
- n
- 11,359
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
citation
text metadata multilingual duplicatesBibliographic citation strings for linguistic sources, mostly English (3590) with substantial French (401), Chinese (335), and German (249) entries. Heavy duplication is the headline: 6181 duplicates (54.4%) across only 5178 unique values, with the top value 'Anon. n.d.. http://www.africamuseum.be/...' repeating 462 times. URLs appear in 13.3% of rows and Flesch readability is low (29.3), consistent with reference-style text rather than prose. Treatment: Normalize and deduplicate to a citation lookup table, then reference by key.
- n
- 11,359
- nulls
- 0 (0.0%)
- unique
- 5,178
- len_min
- 8
- len_max
- 142
- len_mean
- 67.94
- len_median
- 71
- len_p95
- 77
- word_mean
- 8.827
- word_median
- 10
- n_empty
- 0
- n_duplicates
- 6,181
- duplicate_rate
- 0.5442
- vocab_size
- 15,535
- readability_flesch_mean
- 29.26
- emoji_rate
- 0
- url_rate
- 0.1334
- one_word_rate
- 0
- allcaps_rate
- 0.06004
- boilerplate_rate
- 0