saturn·

aif 2022

source /home/coolhand/servers/diachronica/corpus/historical-corpora/pceec/data/aif_2022.csv 4,970 rows 13 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogues 4,970 historical letters (the PCEEC corpus metadata), with 13 columns describing each letter's reference code, author, recipient, their genders, dates of birth, social roles (API), and kinship relations. The social skew is striking: authors are 83% male versus 17% female, and recipients are 82% male versus 18% female, so any analysis of women's correspondence will work from a much smaller base. Roles and relations are heavily concentrated too — 'SIR' tops both author and recipient API fields, and 'FRIEND', 'BROTHER', and 'SON' dominate the kinship columns — though both API fields have long tails of 250+ distinct values worth scanning. Note also that 'Order of Gardiner letters in file' is 98.8% null (only relevant to a 58-letter subset) and 'Change from 2006?' is 95% 'ok', so neither carries much analytic signal.

citing: row_count · column_count · Author gender · Recipient gender · Author API · Recipient API · Relation to author · Relation to recipient · Order of Gardiner letters in file · Change from 2006?

Schema

13 columns
Per-column summary. Click column name to jump to its detail.
Alerts
Letter reference text 0.0% 4,970
near_unique one_word allcaps short_text
Author name categorical 0.0% 695
Author API categorical 25.0% 252
null_rate
Author gender categorical 0.0% 2
Author DOB categorical 31.2% 217
null_rate
Relation to recipient categorical 44.4% 45
null_rate
Recipient categorical 0.0% 623
long_tail
Recipient API categorical 18.9% 326
Recipient gender categorical 0.0% 3
Recipient DOB categorical 19.6% 210
Relation to author categorical 46.2% 43
null_rate
Change from 2006? categorical 0.0% 4
imbalance
Order of Gardiner letters in file numeric 98.8% 58
null_rate

Letter reference

text identifier near_unique one_word allcaps short_text
This column holds a unique letter reference code, formatted as an all-caps single token combining a name and a zero-padded sequence number (e.g. ALLEN_001, ARUNDEL_006). Every one of the 4970 rows is distinct with no nulls or duplicates, and lengths cluster tightly between 7 and 11 characters. The vocabulary equals the row count, confirming this is a primary identifier rather than a feature. Treatment: Use as a row key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7
n
4,970
nulls
0 (0.0%)
unique
4,970
len_min
7
len_max
11
len_mean
10.09
len_median
10
len_p95
11
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
4,970
readability_flesch_mean
49.31
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

Author name

categorical feature
Categorical column naming letter authors, with 695 distinct individuals across 4,970 rows and zero nulls. Distribution is moderately concentrated: JOHN_HOLLES_SR tops the list at 136 occurrences (2.7%), followed by THOMAS_CROMWELL (93) and DOROTHY_OSBORNE/TEMPLE (85), and entropy ratio of 0.87 indicates a fairly even spread across the long tail. Naming convention uses uppercase tokens with underscores, and some entries carry annotations like [N.MAUTBY] or compound surnames (OSBORNE/TEMPLE) that an analyst should normalise before grouping. Treatment: Normalise bracketed annotations and treat as a high-cardinality categorical (target-encode or group rare authors). high · anthropic:claude-opus-4-7
n
4,970
nulls
0 (0.0%)
unique
695
top_value
JOHN_HOLLES_SR
top_rate
0.02736
cardinality
695
entropy
8.192
entropy_ratio
0.8677

Author API

categorical metadata null_rate
Appears to be an authority/role tag for the author of each record, with 252 distinct titles or office descriptors (e.g. SIR, LADY, MERCHANT, KING_OF_ENGLAND, BISHOP_OF_WINCHESTER) suggesting historical/prosopographical data. Distribution is moderately diffuse (entropy ratio 0.76, top value SIR only 15.0%), but a quarter of rows are null (null_rate 0.2503) and several values mix role with DNB-style annotations like '1ST_EARL_OF_CLARE/POLITICIAN(DNB)', indicating inconsistent encoding. Treatment: Normalise the compound 'ROLE/SUBROLE(DNB)' strings and treat missingness explicitly before using as a categorical feature. medium · anthropic:claude-opus-4-7
n
4,970
nulls
1,244 (25.0%)
unique
252
top_value
SIR
top_rate
0.1503
cardinality
252
entropy
6.06
entropy_ratio
0.7596

Author gender

categorical feature
Binary gender label for the author of each record, fully populated across all 4970 rows. The split is heavily skewed: MALE accounts for 83.1% (4130) versus FEMALE at 840, giving an entropy ratio of 0.655. No nulls or unexpected categories appear. Treatment: One-hot or binary-encode; consider class-imbalance handling if used as a stratifier or predictor. high · anthropic:claude-opus-4-7
n
4,970
nulls
0 (0.0%)
unique
2
top_value
MALE
top_rate
0.831
cardinality
2
entropy
0.6554
entropy_ratio
0.6554

Author DOB

categorical metadata null_rate
This appears to be the author's year of birth, stored as a categorical string rather than a number — many values carry a trailing '?' (e.g. '1565?', '1546?') indicating uncertain dates from historical records. 31.17% of rows are null and the top value covers only 3.98% of entries, with 217 distinct values and high entropy (ratio 0.87) suggesting a wide spread across early modern centuries. The mix of clean years ('1633', '1593') and questioned years signals inconsistent provenance that will break naive numeric parsing. Treatment: Strip '?' markers and cast to integer year (with an 'uncertain' flag column) before any temporal analysis. high · anthropic:claude-opus-4-7
n
4,970
nulls
1,549 (31.2%)
unique
217
top_value
1565?
top_rate
0.03975
cardinality
217
entropy
6.746
entropy_ratio
0.8692

Relation to recipient

categorical feature null_rate
Categorical describing the relationship between two parties (likely a donor/sender and a recipient), with 45 distinct kinship or social labels such as FRIEND, BROTHER, SON, KIN, and FAMILY_SERVANT. The distribution is fairly flat — top value FRIEND covers only 13.99% and entropy ratio is 0.77 — so no single relation dominates. Most striking is the 44.37% null rate, meaning nearly half of records have no recorded relation. Treatment: Impute nulls as an explicit 'UNKNOWN' category and consider grouping the 45 levels into broader buckets (immediate family, extended kin, non-kin) before modelling. high · anthropic:claude-opus-4-7
n
4,970
nulls
2,205 (44.4%)
unique
45
top_value
FRIEND
top_rate
0.14
cardinality
45
entropy
4.211
entropy_ratio
0.7668

Recipient

categorical label long_tail
Recipient is a categorical field naming the addressee of each record (likely letters in a historical correspondence corpus), with 623 distinct names across 4970 rows and no nulls. The distribution has a long tail but is concentrated on a few major figures: JOHN_PASTON_I tops at 5.27% (262), followed by NATHANIEL_BACON_I (251), JOAN_BARRINGTON (182), and JANE_CORNWALLIS/BACON[N.MEAUTYS] (180). High entropy ratio (0.79) confirms breadth, while the naming convention (uppercase with underscores, roman numerals, bracketed aliases) suggests a curated historical-letters dataset. Treatment: Group rare recipients into an 'other' bucket before encoding, given the long tail across 623 categories. high · anthropic:claude-opus-4-7
n
4,970
nulls
0 (0.0%)
unique
623
top_value
JOHN_PASTON_I
top_rate
0.05272
cardinality
623
entropy
7.351
entropy_ratio
0.7918

Recipient API

categorical feature
This column appears to capture the recipient's title or role/API descriptor in correspondence records, with 326 distinct values across 4970 rows. The distribution is moderately diverse (entropy_ratio 0.73) — the most common value 'SIR' covers only 13.4% of records, followed by 'LADY' and compound role strings like 'SIR/LOCAL_POLITICIAN(DNB)'. Notable surprises: 18.9% nulls, and many values are concatenated multi-role strings rather than atomic titles, suggesting inconsistent encoding. Treatment: Split compound '/'-delimited roles and one-hot or target-encode atomic titles before modelling. high · anthropic:claude-opus-4-7
n
4,970
nulls
940 (18.9%)
unique
326
top_value
SIR
top_rate
0.1345
cardinality
326
entropy
6.128
entropy_ratio
0.734

Recipient gender

categorical feature
Categorical recipient gender field with three values and no nulls across 4970 rows. Heavily skewed toward MALE at 81.97% (4074), with FEMALE at 892 and a rare MALE/FEMALE combo at just 4 records. Low entropy ratio of 0.43 confirms the imbalance, and the mixed category is small enough to need a handling decision. Treatment: One-hot encode and decide whether to merge or drop the 4 MALE/FEMALE rows before modelling. high · anthropic:claude-opus-4-7
n
4,970
nulls
0 (0.0%)
unique
3
top_value
MALE
top_rate
0.8197
cardinality
3
entropy
0.6881
entropy_ratio
0.4342

Recipient DOB

categorical timestamp
This is the recipient's date of birth recorded as a year only, with values clustered between the 15th and 17th centuries (e.g. 1421, 1546?, 1581) — consistent with a historical correspondence or archival dataset. Roughly 19.6% of rows are null and many entries carry trailing '?' marks indicating archivist uncertainty, which inflates cardinality (210 distinct values for 4970 rows) and means '1546?' and '1546' would be treated as different categories. Entropy ratio of 0.82 shows the distribution is fairly spread across years rather than dominated by one cohort, with the top year '1421' covering only 6.6%. Treatment: Strip '?' uncertainty markers, cast to integer year, and bucket into centuries or decades before modelling. high · anthropic:claude-opus-4-7
n
4,970
nulls
975 (19.6%)
unique
210
top_value
1421
top_rate
0.06608
cardinality
210
entropy
6.31
entropy_ratio
0.8179

Relation to author

categorical feature null_rate
Categorical relationship label between an author and another person, with 43 distinct values dominated by family/social ties (FRIEND 14.5%, BROTHER, SON, BROTHER-IN-LAW, KIN). Nearly half the rows (46.2%) are null, which is the main concern. Entropy ratio of 0.76 indicates the non-null values are spread fairly evenly across the top categories rather than collapsing onto one. Treatment: Impute or add an explicit 'unknown' category for the 46% nulls, then group rare levels before one-hot encoding. high · anthropic:claude-opus-4-7
n
4,970
nulls
2,297 (46.2%)
unique
43
top_value
FRIEND
top_rate
0.1448
cardinality
43
entropy
4.127
entropy_ratio
0.7606

Change from 2006?

categorical metadata imbalance
A data-quality flag tracking whether each row changed since 2006, with four categories dominated by 'ok' at 95.5% of 4,970 rows. The remaining values ('corrected', 'corrected in spreadsheet', 'ok sic') indicate manual review notes, and the inconsistent labels suggest free-form curator entries rather than a controlled vocabulary. Entropy ratio of 0.15 confirms severe imbalance. Treatment: Drop or collapse to a binary corrected/ok flag; too imbalanced to be a useful feature. high · anthropic:claude-opus-4-7
n
4,970
nulls
0 (0.0%)
unique
4
top_value
ok
top_rate
0.9549
cardinality
4
entropy
0.3025
entropy_ratio
0.1513

Order of Gardiner letters in file

numeric metadata null_rate
This column appears to be an ordinal index assigning each Gardiner letter (an Egyptian hieroglyph category) its position within the file, running from 1 to 58 with a perfectly symmetric distribution (mean and median both 29.5, skew 0). The striking signal is a 98.83% null rate: only 58 of 4970 rows carry a value, exactly matching n_unique, so this is effectively a one-row-per-letter lookup sparsely attached to a much larger table. No outliers and uniform spread confirm it is a sequence, not a measurement. Treatment: Drop from modelling; retain only if you need to preserve the original Gardiner ordering via a join on letter. high · anthropic:claude-opus-4-7
n
4,970
nulls
4,912 (98.8%)
unique
58
min
1
max
58
mean
29.5
median
29.5
std
16.89
q1
15.25
q3
43.75
iqr
28.5
skew
0
kurtosis
-1.201
n_outliers
0
outlier_rate
0
zero_rate
0