aif 2022
Reading
This dataset catalogues 4,970 historical letters (the PCEEC corpus metadata), with 13 columns describing each letter's reference code, author, recipient, their genders, dates of birth, social roles (API), and kinship relations. The social skew is striking: authors are 83% male versus 17% female, and recipients are 82% male versus 18% female, so any analysis of women's correspondence will work from a much smaller base. Roles and relations are heavily concentrated too — 'SIR' tops both author and recipient API fields, and 'FRIEND', 'BROTHER', and 'SON' dominate the kinship columns — though both API fields have long tails of 250+ distinct values worth scanning. Note also that 'Order of Gardiner letters in file' is 98.8% null (only relevant to a 58-letter subset) and 'Change from 2006?' is 95% 'ok', so neither carries much analytic signal.
citing: row_count · column_count · Author gender · Recipient gender · Author API · Recipient API · Relation to author · Relation to recipient · Order of Gardiner letters in file · Change from 2006?
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| MALE | 4130 | 83.1% |
| FEMALE | 840 | 16.9% |
Show data table
| value | count | share |
|---|---|---|
| MALE | 4074 | 82.0% |
| FEMALE | 892 | 17.9% |
| MALE/FEMALE | 4 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| FRIEND | 387 | 7.8% |
| BROTHER | 324 | 6.5% |
| SON | 295 | 5.9% |
| BROTHER-IN-LAW | 192 | 3.9% |
| KIN | 188 | 3.8% |
| WIFE | 160 | 3.2% |
| MOTHER | 160 | 3.2% |
| FATHER | 138 | 2.8% |
| HUSBAND | 126 | 2.5% |
| COUSIN | 115 | 2.3% |
| MOTHER-IN-LAW | 81 | 1.6% |
| FUTURE_HUSBAND | 78 | 1.6% |
| SISTER | 67 | 1.3% |
| UNCLE | 59 | 1.2% |
| FAMILY_SERVANT | 53 | 1.1% |
| FATHER-IN-LAW | 43 | 0.9% |
| NEPHEW | 29 | 0.6% |
| SON-IN-LAW | 27 | 0.5% |
| SISTER-IN-LAW | 26 | 0.5% |
| AUNT | 25 | 0.5% |
Show data table
| value | count | share |
|---|---|---|
| SIR | 560 | 11.3% |
| LADY | 278 | 5.6% |
| MERCHANT | 149 | 3.0% |
| KING_OF_ENGLAND | 140 | 2.8% |
| 1ST_EARL_OF_CLARE/POLITICIAN(DNB) | 136 | 2.7% |
| BISHOP_OF_WINCHESTER | 111 | 2.2% |
| CLERK | 103 | 2.1% |
| EARL_OF_ESSEX/ROYAL_MINISTER(DNB) | 93 | 1.9% |
| SIR/LOCAL_POLITICIAN(DNB) | 79 | 1.6% |
| BISHOP_OF_NORWICH | 77 | 1.5% |
| 1ST_EARL_OF_STRAFFORD/LORD_LIEUTENANT_OF_IRELAND | 67 | 1.3% |
| PUBLIC_SERVANT | 58 | 1.2% |
| COLONEL | 55 | 1.1% |
| 1ST_LORD_BURGHLEY/ROYAL_MINISTER(DNB) | 48 | 1.0% |
| EARL_OF_LEICESTER/COURTIER/MAGNATE(DNB) | 47 | 0.9% |
| 2ND_EARL_OF_ARUNDEL_AND_SURREY/POLITICIAN(DNB) | 44 | 0.9% |
| 3RD_EARL_OF_DERBY | 40 | 0.8% |
| SIR/NATURAL_PHILOSOPHER/ADMINISTRATOR(DNB) | 40 | 0.8% |
| SIR/2ND_BART/SCHOLAR/POLITICIAN(DNB) | 38 | 0.8% |
| SIR/LORD_CHANCELLOR(DNB) | 38 | 0.8% |
Show data table
| value | count | share |
|---|---|---|
| SIR | 542 | 10.9% |
| LADY | 444 | 8.9% |
| SIR/LOCAL_POLITICIAN(DNB) | 253 | 5.1% |
| MERCHANT | 154 | 3.1% |
| SIR/ANTIQUARY | 100 | 2.0% |
| BARONET/DIPLOMAT/AUTHOR(DNB) | 85 | 1.7% |
| SIR/MERCHANT | 82 | 1.6% |
| BISHOP_OF_DURHAM/LORD_CHANCELLOR | 75 | 1.5% |
| ROYAL_MINISTER/ARCHBISHOP_OF_YORK/CARDINAL(DNB) | 71 | 1.4% |
| KING_OF_ENGLAND | 68 | 1.4% |
| BISHOP_OF_WINCHESTER | 68 | 1.4% |
| 1ST_EARL_OF_CUMBERLAND | 64 | 1.3% |
| VISCOUNT | 63 | 1.3% |
| BISHOP_OF_DURHAM | 59 | 1.2% |
| CAPTAIN | 55 | 1.1% |
| EARL_OF_LEICESTER/COURTIER/MAGNATE(DNB) | 49 | 1.0% |
| SIR/PRINCIPLE_SECRETARY(DNB) | 45 | 0.9% |
| VISCOUNTESS | 45 | 0.9% |
| SIR/LORD_KEEPER_OF_THE_GREAT_SEAL | 43 | 0.9% |
| VISCOUNT_DORCHESTER/DIPLOMNAT(DNB) | 43 | 0.9% |
Schema
13 columns| Alerts | ||||
|---|---|---|---|---|
| Letter reference | text | 0.0% | 4,970 |
near_unique
one_word
allcaps
short_text
|
| Author name | categorical | 0.0% | 695 |
|
| Author API | categorical | 25.0% | 252 |
null_rate
|
| Author gender | categorical | 0.0% | 2 |
|
| Author DOB | categorical | 31.2% | 217 |
null_rate
|
| Relation to recipient | categorical | 44.4% | 45 |
null_rate
|
| Recipient | categorical | 0.0% | 623 |
long_tail
|
| Recipient API | categorical | 18.9% | 326 |
|
| Recipient gender | categorical | 0.0% | 3 |
|
| Recipient DOB | categorical | 19.6% | 210 |
|
| Relation to author | categorical | 46.2% | 43 |
null_rate
|
| Change from 2006? | categorical | 0.0% | 4 |
imbalance
|
| Order of Gardiner letters in file | numeric | 98.8% | 58 |
null_rate
|
Letter reference
text identifier near_unique one_word allcaps short_textThis column holds a unique letter reference code, formatted as an all-caps single token combining a name and a zero-padded sequence number (e.g. ALLEN_001, ARUNDEL_006). Every one of the 4970 rows is distinct with no nulls or duplicates, and lengths cluster tightly between 7 and 11 characters. The vocabulary equals the row count, confirming this is a primary identifier rather than a feature. Treatment: Use as a row key for joins; exclude from modelling features.
- n
- 4,970
- nulls
- 0 (0.0%)
- unique
- 4,970
- len_min
- 7
- len_max
- 11
- len_mean
- 10.09
- len_median
- 10
- len_p95
- 11
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 4,970
- readability_flesch_mean
- 49.31
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
Author name
categorical featureCategorical column naming letter authors, with 695 distinct individuals across 4,970 rows and zero nulls. Distribution is moderately concentrated: JOHN_HOLLES_SR tops the list at 136 occurrences (2.7%), followed by THOMAS_CROMWELL (93) and DOROTHY_OSBORNE/TEMPLE (85), and entropy ratio of 0.87 indicates a fairly even spread across the long tail. Naming convention uses uppercase tokens with underscores, and some entries carry annotations like [N.MAUTBY] or compound surnames (OSBORNE/TEMPLE) that an analyst should normalise before grouping. Treatment: Normalise bracketed annotations and treat as a high-cardinality categorical (target-encode or group rare authors).
- n
- 4,970
- nulls
- 0 (0.0%)
- unique
- 695
- top_value
- JOHN_HOLLES_SR
- top_rate
- 0.02736
- cardinality
- 695
- entropy
- 8.192
- entropy_ratio
- 0.8677
Author API
categorical metadata null_rateAppears to be an authority/role tag for the author of each record, with 252 distinct titles or office descriptors (e.g. SIR, LADY, MERCHANT, KING_OF_ENGLAND, BISHOP_OF_WINCHESTER) suggesting historical/prosopographical data. Distribution is moderately diffuse (entropy ratio 0.76, top value SIR only 15.0%), but a quarter of rows are null (null_rate 0.2503) and several values mix role with DNB-style annotations like '1ST_EARL_OF_CLARE/POLITICIAN(DNB)', indicating inconsistent encoding. Treatment: Normalise the compound 'ROLE/SUBROLE(DNB)' strings and treat missingness explicitly before using as a categorical feature.
- n
- 4,970
- nulls
- 1,244 (25.0%)
- unique
- 252
- top_value
- SIR
- top_rate
- 0.1503
- cardinality
- 252
- entropy
- 6.06
- entropy_ratio
- 0.7596
Author gender
categorical featureBinary gender label for the author of each record, fully populated across all 4970 rows. The split is heavily skewed: MALE accounts for 83.1% (4130) versus FEMALE at 840, giving an entropy ratio of 0.655. No nulls or unexpected categories appear. Treatment: One-hot or binary-encode; consider class-imbalance handling if used as a stratifier or predictor.
- n
- 4,970
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- MALE
- top_rate
- 0.831
- cardinality
- 2
- entropy
- 0.6554
- entropy_ratio
- 0.6554
Author DOB
categorical metadata null_rateThis appears to be the author's year of birth, stored as a categorical string rather than a number — many values carry a trailing '?' (e.g. '1565?', '1546?') indicating uncertain dates from historical records. 31.17% of rows are null and the top value covers only 3.98% of entries, with 217 distinct values and high entropy (ratio 0.87) suggesting a wide spread across early modern centuries. The mix of clean years ('1633', '1593') and questioned years signals inconsistent provenance that will break naive numeric parsing. Treatment: Strip '?' markers and cast to integer year (with an 'uncertain' flag column) before any temporal analysis.
- n
- 4,970
- nulls
- 1,549 (31.2%)
- unique
- 217
- top_value
- 1565?
- top_rate
- 0.03975
- cardinality
- 217
- entropy
- 6.746
- entropy_ratio
- 0.8692
Relation to recipient
categorical feature null_rateCategorical describing the relationship between two parties (likely a donor/sender and a recipient), with 45 distinct kinship or social labels such as FRIEND, BROTHER, SON, KIN, and FAMILY_SERVANT. The distribution is fairly flat — top value FRIEND covers only 13.99% and entropy ratio is 0.77 — so no single relation dominates. Most striking is the 44.37% null rate, meaning nearly half of records have no recorded relation. Treatment: Impute nulls as an explicit 'UNKNOWN' category and consider grouping the 45 levels into broader buckets (immediate family, extended kin, non-kin) before modelling.
- n
- 4,970
- nulls
- 2,205 (44.4%)
- unique
- 45
- top_value
- FRIEND
- top_rate
- 0.14
- cardinality
- 45
- entropy
- 4.211
- entropy_ratio
- 0.7668
Recipient
categorical label long_tailRecipient is a categorical field naming the addressee of each record (likely letters in a historical correspondence corpus), with 623 distinct names across 4970 rows and no nulls. The distribution has a long tail but is concentrated on a few major figures: JOHN_PASTON_I tops at 5.27% (262), followed by NATHANIEL_BACON_I (251), JOAN_BARRINGTON (182), and JANE_CORNWALLIS/BACON[N.MEAUTYS] (180). High entropy ratio (0.79) confirms breadth, while the naming convention (uppercase with underscores, roman numerals, bracketed aliases) suggests a curated historical-letters dataset. Treatment: Group rare recipients into an 'other' bucket before encoding, given the long tail across 623 categories.
- n
- 4,970
- nulls
- 0 (0.0%)
- unique
- 623
- top_value
- JOHN_PASTON_I
- top_rate
- 0.05272
- cardinality
- 623
- entropy
- 7.351
- entropy_ratio
- 0.7918
Recipient API
categorical featureThis column appears to capture the recipient's title or role/API descriptor in correspondence records, with 326 distinct values across 4970 rows. The distribution is moderately diverse (entropy_ratio 0.73) — the most common value 'SIR' covers only 13.4% of records, followed by 'LADY' and compound role strings like 'SIR/LOCAL_POLITICIAN(DNB)'. Notable surprises: 18.9% nulls, and many values are concatenated multi-role strings rather than atomic titles, suggesting inconsistent encoding. Treatment: Split compound '/'-delimited roles and one-hot or target-encode atomic titles before modelling.
- n
- 4,970
- nulls
- 940 (18.9%)
- unique
- 326
- top_value
- SIR
- top_rate
- 0.1345
- cardinality
- 326
- entropy
- 6.128
- entropy_ratio
- 0.734
Recipient gender
categorical featureCategorical recipient gender field with three values and no nulls across 4970 rows. Heavily skewed toward MALE at 81.97% (4074), with FEMALE at 892 and a rare MALE/FEMALE combo at just 4 records. Low entropy ratio of 0.43 confirms the imbalance, and the mixed category is small enough to need a handling decision. Treatment: One-hot encode and decide whether to merge or drop the 4 MALE/FEMALE rows before modelling.
- n
- 4,970
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- MALE
- top_rate
- 0.8197
- cardinality
- 3
- entropy
- 0.6881
- entropy_ratio
- 0.4342
Recipient DOB
categorical timestampThis is the recipient's date of birth recorded as a year only, with values clustered between the 15th and 17th centuries (e.g. 1421, 1546?, 1581) — consistent with a historical correspondence or archival dataset. Roughly 19.6% of rows are null and many entries carry trailing '?' marks indicating archivist uncertainty, which inflates cardinality (210 distinct values for 4970 rows) and means '1546?' and '1546' would be treated as different categories. Entropy ratio of 0.82 shows the distribution is fairly spread across years rather than dominated by one cohort, with the top year '1421' covering only 6.6%. Treatment: Strip '?' uncertainty markers, cast to integer year, and bucket into centuries or decades before modelling.
- n
- 4,970
- nulls
- 975 (19.6%)
- unique
- 210
- top_value
- 1421
- top_rate
- 0.06608
- cardinality
- 210
- entropy
- 6.31
- entropy_ratio
- 0.8179
Change from 2006?
categorical metadata imbalanceA data-quality flag tracking whether each row changed since 2006, with four categories dominated by 'ok' at 95.5% of 4,970 rows. The remaining values ('corrected', 'corrected in spreadsheet', 'ok sic') indicate manual review notes, and the inconsistent labels suggest free-form curator entries rather than a controlled vocabulary. Entropy ratio of 0.15 confirms severe imbalance. Treatment: Drop or collapse to a binary corrected/ok flag; too imbalanced to be a useful feature.
- n
- 4,970
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- ok
- top_rate
- 0.9549
- cardinality
- 4
- entropy
- 0.3025
- entropy_ratio
- 0.1513
Order of Gardiner letters in file
numeric metadata null_rateThis column appears to be an ordinal index assigning each Gardiner letter (an Egyptian hieroglyph category) its position within the file, running from 1 to 58 with a perfectly symmetric distribution (mean and median both 29.5, skew 0). The striking signal is a 98.83% null rate: only 58 of 4970 rows carry a value, exactly matching n_unique, so this is effectively a one-row-per-letter lookup sparsely attached to a much larger table. No outliers and uniform spread confirm it is a sequence, not a measurement. Treatment: Drop from modelling; retain only if you need to preserve the original Gardiner ordering via a join on letter.
- n
- 4,970
- nulls
- 4,912 (98.8%)
- unique
- 58
- min
- 1
- max
- 58
- mean
- 29.5
- median
- 29.5
- std
- 16.89
- q1
- 15.25
- q3
- 43.75
- iqr
- 28.5
- skew
- 0
- kurtosis
- -1.201
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0