aif 2022

source /home/coolhand/servers/diachronica/corpus/historical-corpora/pceec/data/aif_2022.csv 4,970 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogues 4,970 historical letters (the PCEEC corpus metadata), with 13 columns describing each letter's reference code, author, recipient, their genders, dates of birth, social roles (API), and kinship relations. The social skew is striking: authors are 83% male versus 17% female, and recipients are 82% male versus 18% female, so any analysis of women's correspondence will work from a much smaller base. Roles and relations are heavily concentrated too — 'SIR' tops both author and recipient API fields, and 'FRIEND', 'BROTHER', and 'SON' dominate the kinship columns — though both API fields have long tails of 250+ distinct values worth scanning. Note also that 'Order of Gardiner letters in file' is 98.8% null (only relevant to a 58-letter subset) and 'Change from 2006?' is 95% 'ok', so neither carries much analytic signal.

citing: row_count · column_count · Author gender · Recipient gender · Author API · Recipient API · Relation to author · Relation to recipient · Order of Gardiner letters in file · Change from 2006?

Charts the summary said to look at first

Author gender · Shows the strong male skew among letter authors (83% male, 17% female).

Show data table

Top values for Author gender (2 unique shown, of 2 total).
value	count	share
MALE	4130	83.1%
FEMALE	840	16.9%

Recipient gender · Recipients are similarly male-dominated; check the small 'MALE/FEMALE' slice for joint-addressed letters.

Show data table

Top values for Recipient gender (3 unique shown, of 3 total).
value	count	share
MALE	4074	82.0%
FEMALE	892	17.9%
MALE/FEMALE	4	0.1%

Relation to author · FRIEND, BROTHER, and SON lead — useful for seeing what kinship ties drove correspondence.

Show data table

Top values for Relation to author (20 unique shown, of 43 total).
value	count	share
FRIEND	387	7.8%
BROTHER	324	6.5%
SON	295	5.9%
BROTHER-IN-LAW	192	3.9%
KIN	188	3.8%
WIFE	160	3.2%
MOTHER	160	3.2%
FATHER	138	2.8%
HUSBAND	126	2.5%
COUSIN	115	2.3%
MOTHER-IN-LAW	81	1.6%
FUTURE_HUSBAND	78	1.6%
SISTER	67	1.3%
UNCLE	59	1.2%
FAMILY_SERVANT	53	1.1%
FATHER-IN-LAW	43	0.9%
NEPHEW	29	0.6%
SON-IN-LAW	27	0.5%
SISTER-IN-LAW	26	0.5%
AUNT	25	0.5%

Author API · Top social roles of authors (SIR, LADY, MERCHANT) reveal the elite tilt of the corpus.

Show data table

Top values for Author API (20 unique shown, of 252 total).
value	count	share
SIR	560	11.3%
LADY	278	5.6%
MERCHANT	149	3.0%
KING_OF_ENGLAND	140	2.8%
1ST_EARL_OF_CLARE/POLITICIAN(DNB)	136	2.7%
BISHOP_OF_WINCHESTER	111	2.2%
CLERK	103	2.1%
EARL_OF_ESSEX/ROYAL_MINISTER(DNB)	93	1.9%
SIR/LOCAL_POLITICIAN(DNB)	79	1.6%
BISHOP_OF_NORWICH	77	1.5%
1ST_EARL_OF_STRAFFORD/LORD_LIEUTENANT_OF_IRELAND	67	1.3%
PUBLIC_SERVANT	58	1.2%
COLONEL	55	1.1%
1ST_LORD_BURGHLEY/ROYAL_MINISTER(DNB)	48	1.0%
EARL_OF_LEICESTER/COURTIER/MAGNATE(DNB)	47	0.9%
2ND_EARL_OF_ARUNDEL_AND_SURREY/POLITICIAN(DNB)	44	0.9%
3RD_EARL_OF_DERBY	40	0.8%
SIR/NATURAL_PHILOSOPHER/ADMINISTRATOR(DNB)	40	0.8%
SIR/2ND_BART/SCHOLAR/POLITICIAN(DNB)	38	0.8%
SIR/LORD_CHANCELLOR(DNB)	38	0.8%

Recipient API · Compare against author roles to see whether letters mostly travelled within the same social strata.

Show data table

Top values for Recipient API (20 unique shown, of 326 total).
value	count	share
SIR	542	10.9%
LADY	444	8.9%
SIR/LOCAL_POLITICIAN(DNB)	253	5.1%
MERCHANT	154	3.1%
SIR/ANTIQUARY	100	2.0%
BARONET/DIPLOMAT/AUTHOR(DNB)	85	1.7%
SIR/MERCHANT	82	1.6%
BISHOP_OF_DURHAM/LORD_CHANCELLOR	75	1.5%
ROYAL_MINISTER/ARCHBISHOP_OF_YORK/CARDINAL(DNB)	71	1.4%
KING_OF_ENGLAND	68	1.4%
BISHOP_OF_WINCHESTER	68	1.4%
1ST_EARL_OF_CUMBERLAND	64	1.3%
VISCOUNT	63	1.3%
BISHOP_OF_DURHAM	59	1.2%
CAPTAIN	55	1.1%
EARL_OF_LEICESTER/COURTIER/MAGNATE(DNB)	49	1.0%
SIR/PRINCIPLE_SECRETARY(DNB)	45	0.9%
VISCOUNTESS	45	0.9%
SIR/LORD_KEEPER_OF_THE_GREAT_SEAL	43	0.9%
VISCOUNT_DORCHESTER/DIPLOMNAT(DNB)	43	0.9%

Schema

13 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
Letter reference	text	0.0%	4,970	near_unique one_word allcaps short_text
Author name	categorical	0.0%	695
Author API	categorical	25.0%	252	null_rate
Author gender	categorical	0.0%	2
Author DOB	categorical	31.2%	217	null_rate
Relation to recipient	categorical	44.4%	45	null_rate
Recipient	categorical	0.0%	623	long_tail
Recipient API	categorical	18.9%	326
Recipient gender	categorical	0.0%	3
Recipient DOB	categorical	19.6%	210
Relation to author	categorical	46.2%	43	null_rate
Change from 2006?	categorical	0.0%	4	imbalance
Order of Gardiner letters in file	numeric	98.8%	58	null_rate

Letter reference

text identifier near_unique one_word allcaps short_text

This column holds a unique letter reference code, formatted as an all-caps single token combining a name and a zero-padded sequence number (e.g. ALLEN_001, ARUNDEL_006). Every one of the 4970 rows is distinct with no nulls or duplicates, and lengths cluster tightly between 7 and 11 characters. The vocabulary equals the row count, confirming this is a primary identifier rather than a feature. Treatment: Use as a row key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 0 (0.0%)
unique: 4,970
len_min: 7
len_max: 11
len_mean: 10.09
len_median: 10
len_p95: 11
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 4,970
readability_flesch_mean: 49.31
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

Author name

categorical feature

Categorical column naming letter authors, with 695 distinct individuals across 4,970 rows and zero nulls. Distribution is moderately concentrated: JOHN_HOLLES_SR tops the list at 136 occurrences (2.7%), followed by THOMAS_CROMWELL (93) and DOROTHY_OSBORNE/TEMPLE (85), and entropy ratio of 0.87 indicates a fairly even spread across the long tail. Naming convention uses uppercase tokens with underscores, and some entries carry annotations like [N.MAUTBY] or compound surnames (OSBORNE/TEMPLE) that an analyst should normalise before grouping. Treatment: Normalise bracketed annotations and treat as a high-cardinality categorical (target-encode or group rare authors). high · anthropic:claude-opus-4-7

n: 4,970
nulls: 0 (0.0%)
unique: 695
top_value: JOHN_HOLLES_SR
top_rate: 0.02736
cardinality: 695
entropy: 8.192
entropy_ratio: 0.8677

Author API

categorical metadata null_rate

Appears to be an authority/role tag for the author of each record, with 252 distinct titles or office descriptors (e.g. SIR, LADY, MERCHANT, KING_OF_ENGLAND, BISHOP_OF_WINCHESTER) suggesting historical/prosopographical data. Distribution is moderately diffuse (entropy ratio 0.76, top value SIR only 15.0%), but a quarter of rows are null (null_rate 0.2503) and several values mix role with DNB-style annotations like '1ST_EARL_OF_CLARE/POLITICIAN(DNB)', indicating inconsistent encoding. Treatment: Normalise the compound 'ROLE/SUBROLE(DNB)' strings and treat missingness explicitly before using as a categorical feature. medium · anthropic:claude-opus-4-7

n: 4,970
nulls: 1,244 (25.0%)
unique: 252
top_value: SIR
top_rate: 0.1503
cardinality: 252
entropy: 6.06
entropy_ratio: 0.7596

Author gender

categorical feature

Binary gender label for the author of each record, fully populated across all 4970 rows. The split is heavily skewed: MALE accounts for 83.1% (4130) versus FEMALE at 840, giving an entropy ratio of 0.655. No nulls or unexpected categories appear. Treatment: One-hot or binary-encode; consider class-imbalance handling if used as a stratifier or predictor. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 0 (0.0%)
unique: 2
top_value: MALE
top_rate: 0.831
cardinality: 2
entropy: 0.6554
entropy_ratio: 0.6554

Author DOB

categorical metadata null_rate

This appears to be the author's year of birth, stored as a categorical string rather than a number — many values carry a trailing '?' (e.g. '1565?', '1546?') indicating uncertain dates from historical records. 31.17% of rows are null and the top value covers only 3.98% of entries, with 217 distinct values and high entropy (ratio 0.87) suggesting a wide spread across early modern centuries. The mix of clean years ('1633', '1593') and questioned years signals inconsistent provenance that will break naive numeric parsing. Treatment: Strip '?' markers and cast to integer year (with an 'uncertain' flag column) before any temporal analysis. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 1,549 (31.2%)
unique: 217
top_value: 1565?
top_rate: 0.03975
cardinality: 217
entropy: 6.746
entropy_ratio: 0.8692

Relation to recipient

categorical feature null_rate

Categorical describing the relationship between two parties (likely a donor/sender and a recipient), with 45 distinct kinship or social labels such as FRIEND, BROTHER, SON, KIN, and FAMILY_SERVANT. The distribution is fairly flat — top value FRIEND covers only 13.99% and entropy ratio is 0.77 — so no single relation dominates. Most striking is the 44.37% null rate, meaning nearly half of records have no recorded relation. Treatment: Impute nulls as an explicit 'UNKNOWN' category and consider grouping the 45 levels into broader buckets (immediate family, extended kin, non-kin) before modelling. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 2,205 (44.4%)
unique: 45
top_value: FRIEND
top_rate: 0.14
cardinality: 45
entropy: 4.211
entropy_ratio: 0.7668

Recipient

categorical label long_tail

Recipient is a categorical field naming the addressee of each record (likely letters in a historical correspondence corpus), with 623 distinct names across 4970 rows and no nulls. The distribution has a long tail but is concentrated on a few major figures: JOHN_PASTON_I tops at 5.27% (262), followed by NATHANIEL_BACON_I (251), JOAN_BARRINGTON (182), and JANE_CORNWALLIS/BACON[N.MEAUTYS] (180). High entropy ratio (0.79) confirms breadth, while the naming convention (uppercase with underscores, roman numerals, bracketed aliases) suggests a curated historical-letters dataset. Treatment: Group rare recipients into an 'other' bucket before encoding, given the long tail across 623 categories. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 0 (0.0%)
unique: 623
top_value: JOHN_PASTON_I
top_rate: 0.05272
cardinality: 623
entropy: 7.351
entropy_ratio: 0.7918

Recipient API

categorical feature

This column appears to capture the recipient's title or role/API descriptor in correspondence records, with 326 distinct values across 4970 rows. The distribution is moderately diverse (entropy_ratio 0.73) — the most common value 'SIR' covers only 13.4% of records, followed by 'LADY' and compound role strings like 'SIR/LOCAL_POLITICIAN(DNB)'. Notable surprises: 18.9% nulls, and many values are concatenated multi-role strings rather than atomic titles, suggesting inconsistent encoding. Treatment: Split compound '/'-delimited roles and one-hot or target-encode atomic titles before modelling. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 940 (18.9%)
unique: 326
top_value: SIR
top_rate: 0.1345
cardinality: 326
entropy: 6.128
entropy_ratio: 0.734

Recipient gender

categorical feature

Categorical recipient gender field with three values and no nulls across 4970 rows. Heavily skewed toward MALE at 81.97% (4074), with FEMALE at 892 and a rare MALE/FEMALE combo at just 4 records. Low entropy ratio of 0.43 confirms the imbalance, and the mixed category is small enough to need a handling decision. Treatment: One-hot encode and decide whether to merge or drop the 4 MALE/FEMALE rows before modelling. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 0 (0.0%)
unique: 3
top_value: MALE
top_rate: 0.8197
cardinality: 3
entropy: 0.6881
entropy_ratio: 0.4342

Recipient DOB

categorical timestamp

This is the recipient's date of birth recorded as a year only, with values clustered between the 15th and 17th centuries (e.g. 1421, 1546?, 1581) — consistent with a historical correspondence or archival dataset. Roughly 19.6% of rows are null and many entries carry trailing '?' marks indicating archivist uncertainty, which inflates cardinality (210 distinct values for 4970 rows) and means '1546?' and '1546' would be treated as different categories. Entropy ratio of 0.82 shows the distribution is fairly spread across years rather than dominated by one cohort, with the top year '1421' covering only 6.6%. Treatment: Strip '?' uncertainty markers, cast to integer year, and bucket into centuries or decades before modelling. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 975 (19.6%)
unique: 210
top_value: 1421
top_rate: 0.06608
cardinality: 210
entropy: 6.31
entropy_ratio: 0.8179

Relation to author

categorical feature null_rate

Categorical relationship label between an author and another person, with 43 distinct values dominated by family/social ties (FRIEND 14.5%, BROTHER, SON, BROTHER-IN-LAW, KIN). Nearly half the rows (46.2%) are null, which is the main concern. Entropy ratio of 0.76 indicates the non-null values are spread fairly evenly across the top categories rather than collapsing onto one. Treatment: Impute or add an explicit 'unknown' category for the 46% nulls, then group rare levels before one-hot encoding. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 2,297 (46.2%)
unique: 43
top_value: FRIEND
top_rate: 0.1448
cardinality: 43
entropy: 4.127
entropy_ratio: 0.7606

Change from 2006?

categorical metadata imbalance

A data-quality flag tracking whether each row changed since 2006, with four categories dominated by 'ok' at 95.5% of 4,970 rows. The remaining values ('corrected', 'corrected in spreadsheet', 'ok sic') indicate manual review notes, and the inconsistent labels suggest free-form curator entries rather than a controlled vocabulary. Entropy ratio of 0.15 confirms severe imbalance. Treatment: Drop or collapse to a binary corrected/ok flag; too imbalanced to be a useful feature. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 0 (0.0%)
unique: 4
top_value: ok
top_rate: 0.9549
cardinality: 4
entropy: 0.3025
entropy_ratio: 0.1513

Order of Gardiner letters in file

numeric metadata null_rate

This column appears to be an ordinal index assigning each Gardiner letter (an Egyptian hieroglyph category) its position within the file, running from 1 to 58 with a perfectly symmetric distribution (mean and median both 29.5, skew 0). The striking signal is a 98.83% null rate: only 58 of 4970 rows carry a value, exactly matching n_unique, so this is effectively a one-row-per-letter lookup sparsely attached to a much larger table. No outliers and uniform spread confirm it is a sequence, not a measurement. Treatment: Drop from modelling; retain only if you need to preserve the original Gardiner ordering via a join on letter. high · anthropic:claude-opus-4-7

n: 4,970
nulls: 4,912 (98.8%)
unique: 58
min: 1
max: 58
mean: 29.5
median: 29.5
std: 16.89
q1: 15.25
q3: 43.75
iqr: 28.5
skew: 0
kurtosis: -1.201
n_outliers: 0
outlier_rate: 0
zero_rate: 0