saturn·

aif 2022

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/servers/diachronica/corpus/historical-corpora/pceec/data/aif_2022.csv

Saturn profiled 4,970 rows across 13 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/servers/diachronica/corpus/historical-corpora/pceec/data/aif_2022.csv",
    "--findings", "aif_2022.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogues 4,970 historical letters (the PCEEC corpus metadata), with 13 columns describing each letter's reference code, author, recipient, their genders, dates of birth, social roles (API), and kinship relations. The social skew is striking: authors are 83% male versus 17% female, and recipients are 82% male versus 18% female, so any analysis of women's correspondence will work from a much smaller base. Roles and relations are heavily concentrated too — 'SIR' tops both author and recipient API fields, and 'FRIEND', 'BROTHER', and 'SON' dominate the kinship columns — though both API fields have long tails of 250+ distinct values worth scanning. Note also that 'Order of Gardiner letters in file' is 98.8% null (only relevant to a 58-letter subset) and 'Change from 2006?' is 95% 'ok', so neither carries much analytic signal.

citing: row_count · column_count · Author gender · Recipient gender · Author API · Recipient API · Relation to author · Relation to recipient · Order of Gardiner letters in file · Change from 2006?

Out[4]:

saturn.schema() · 13 columns

column kind n null% unique alerts
Letter reference text 4,970 0.0% 4,970 near_unique one_word allcaps short_text
Author name categorical 4,970 0.0% 695
Author API categorical 4,970 25.0% 252 null_rate
Author gender categorical 4,970 0.0% 2
Author DOB categorical 4,970 31.2% 217 null_rate
Relation to recipient categorical 4,970 44.4% 45 null_rate
Recipient categorical 4,970 0.0% 623 long_tail
Recipient API categorical 4,970 18.9% 326
Recipient gender categorical 4,970 0.0% 3
Recipient DOB categorical 4,970 19.6% 210
Relation to author categorical 4,970 46.2% 43 null_rate
Change from 2006? categorical 4,970 0.0% 4 imbalance
Order of Gardiner letters in file numeric 4,970 98.8% 58 null_rate
Fig 1.
Author gender · Shows the strong male skew among letter authors (83% male, 17% female).
Show data table
Top values for Author gender (2 unique shown, of 2 total).
valuecountshare
MALE413083.1%
FEMALE84016.9%
Fig 2.
Recipient gender · Recipients are similarly male-dominated; check the small 'MALE/FEMALE' slice for joint-addressed letters.
Show data table
Top values for Recipient gender (3 unique shown, of 3 total).
valuecountshare
MALE407482.0%
FEMALE89217.9%
MALE/FEMALE40.1%
Fig 3.
Relation to author · FRIEND, BROTHER, and SON lead — useful for seeing what kinship ties drove correspondence.
Show data table
Top values for Relation to author (20 unique shown, of 43 total).
valuecountshare
FRIEND3877.8%
BROTHER3246.5%
SON2955.9%
BROTHER-IN-LAW1923.9%
KIN1883.8%
WIFE1603.2%
MOTHER1603.2%
FATHER1382.8%
HUSBAND1262.5%
COUSIN1152.3%
MOTHER-IN-LAW811.6%
FUTURE_HUSBAND781.6%
SISTER671.3%
UNCLE591.2%
FAMILY_SERVANT531.1%
FATHER-IN-LAW430.9%
NEPHEW290.6%
SON-IN-LAW270.5%
SISTER-IN-LAW260.5%
AUNT250.5%
Fig 4.
Author API · Top social roles of authors (SIR, LADY, MERCHANT) reveal the elite tilt of the corpus.
Show data table
Top values for Author API (20 unique shown, of 252 total).
valuecountshare
SIR56011.3%
LADY2785.6%
MERCHANT1493.0%
KING_OF_ENGLAND1402.8%
1ST_EARL_OF_CLARE/POLITICIAN(DNB)1362.7%
BISHOP_OF_WINCHESTER1112.2%
CLERK1032.1%
EARL_OF_ESSEX/ROYAL_MINISTER(DNB)931.9%
SIR/LOCAL_POLITICIAN(DNB)791.6%
BISHOP_OF_NORWICH771.5%
1ST_EARL_OF_STRAFFORD/LORD_LIEUTENANT_OF_IRELAND671.3%
PUBLIC_SERVANT581.2%
COLONEL551.1%
1ST_LORD_BURGHLEY/ROYAL_MINISTER(DNB)481.0%
EARL_OF_LEICESTER/COURTIER/MAGNATE(DNB)470.9%
2ND_EARL_OF_ARUNDEL_AND_SURREY/POLITICIAN(DNB)440.9%
3RD_EARL_OF_DERBY400.8%
SIR/NATURAL_PHILOSOPHER/ADMINISTRATOR(DNB)400.8%
SIR/2ND_BART/SCHOLAR/POLITICIAN(DNB)380.8%
SIR/LORD_CHANCELLOR(DNB)380.8%
Fig 5.
Recipient API · Compare against author roles to see whether letters mostly travelled within the same social strata.
Show data table
Top values for Recipient API (20 unique shown, of 326 total).
valuecountshare
SIR54210.9%
LADY4448.9%
SIR/LOCAL_POLITICIAN(DNB)2535.1%
MERCHANT1543.1%
SIR/ANTIQUARY1002.0%
BARONET/DIPLOMAT/AUTHOR(DNB)851.7%
SIR/MERCHANT821.6%
BISHOP_OF_DURHAM/LORD_CHANCELLOR751.5%
ROYAL_MINISTER/ARCHBISHOP_OF_YORK/CARDINAL(DNB)711.4%
KING_OF_ENGLAND681.4%
BISHOP_OF_WINCHESTER681.4%
1ST_EARL_OF_CUMBERLAND641.3%
VISCOUNT631.3%
BISHOP_OF_DURHAM591.2%
CAPTAIN551.1%
EARL_OF_LEICESTER/COURTIER/MAGNATE(DNB)491.0%
SIR/PRINCIPLE_SECRETARY(DNB)450.9%
VISCOUNTESS450.9%
SIR/LORD_KEEPER_OF_THE_GREAT_SEAL430.9%
VISCOUNT_DORCHESTER/DIPLOMNAT(DNB)430.9%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
Letter referencetext0.0%
Author namecategorical0.0%
Author APIcategorical25.0%
Author gendercategorical0.0%
Author DOBcategorical31.2%
Relation to recipientcategorical44.4%
Recipientcategorical0.0%
Recipient APIcategorical18.9%
Recipient gendercategorical0.0%
Recipient DOBcategorical19.6%
Relation to authorcategorical46.2%
Change from 2006?categorical0.0%
Order of Gardiner letters in filenumeric98.8%

Letter reference text identifier

This column holds a unique letter reference code, formatted as an all-caps single token combining a name and a zero-padded sequence number (e.g. ALLEN_001, ARUNDEL_006). Every one of the 4970 rows is distinct with no nulls or duplicates, and lengths cluster tightly between 7 and 11 characters. The vocabulary equals the row count, confirming this is a primary identifier rather than a feature.

Treatment: Use as a row key for joins; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[12]:

saturn.columns["Letter reference"].stats

statvalue
n4,970
nulls0 (0.0%)
unique4,970
len_min 7
len_max 11
len_mean 10.09
len_median 10
len_p95 11
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 4,970
readability_flesch_mean 49.31
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 7.
Character-length distribution for Letter reference.
Show data table
Character-length distribution for Letter reference (mean: 10.091750503018108).
charscount
7 – 722
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 8207
8 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 90
9 – 90
9 – 91000
9 – 90
9 – 90
9 – 90
9 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 101805
10 – 100
10 – 100
10 – 100
10 – 100
10 – 110
11 – 110
11 – 110
11 – 110
11 – 111936

Author name categorical feature

Categorical column naming letter authors, with 695 distinct individuals across 4,970 rows and zero nulls. Distribution is moderately concentrated: JOHN_HOLLES_SR tops the list at 136 occurrences (2.7%), followed by THOMAS_CROMWELL (93) and DOROTHY_OSBORNE/TEMPLE (85), and entropy ratio of 0.87 indicates a fairly even spread across the long tail. Naming convention uses uppercase tokens with underscores, and some entries carry annotations like [N.MAUTBY] or compound surnames (OSBORNE/TEMPLE) that an analyst should normalise before grouping.

Treatment: Normalise bracketed annotations and treat as a high-cardinality categorical (target-encode or group rare authors).

anthropic:claude-opus-4-7 · confidence high
Out[15]:

saturn.columns["Author name"].stats

statvalue
n4,970
nulls0 (0.0%)
unique695
top_value JOHN_HOLLES_SR
top_rate 0.02736
cardinality 695
entropy 8.192
entropy_ratio 0.8677
Fig 8.
Top values for Author name.
Show data table
Top values for Author name (20 unique shown, of 695 total).
valuecountshare
JOHN_HOLLES_SR1362.7%
THOMAS_CROMWELL931.9%
DOROTHY_OSBORNE/TEMPLE851.7%
NATHANIEL_BACON_I771.5%
JOHN_CHAMBERLAIN711.4%
THOMAS_WENTWORTH671.3%
MARGARET_PASTON[N.MAUTBY]661.3%
ARABELLA_STUART651.3%
BRILLIANA_HARLEY[N.CONWAY]611.2%
STEPHEN_GARDINER581.2%
SAMUEL_PEPYS581.2%
JOHN_PARKHURST551.1%
JOHN_JONES531.1%
ANTHONY_ANTONIE511.0%
ROBERT_DUDLEY470.9%
KATHERINE_PASTON[N.KNYVETT]470.9%
RICHARD_CELY_JR460.9%
WILLIAM_CECIL450.9%
THOMAS_KNYVETT450.9%
THOMAS_HOWARD_III440.9%

Author API categorical metadata

Appears to be an authority/role tag for the author of each record, with 252 distinct titles or office descriptors (e.g. SIR, LADY, MERCHANT, KING_OF_ENGLAND, BISHOP_OF_WINCHESTER) suggesting historical/prosopographical data. Distribution is moderately diffuse (entropy ratio 0.76, top value SIR only 15.0%), but a quarter of rows are null (null_rate 0.2503) and several values mix role with DNB-style annotations like '1ST_EARL_OF_CLARE/POLITICIAN(DNB)', indicating inconsistent encoding.

Treatment: Normalise the compound 'ROLE/SUBROLE(DNB)' strings and treat missingness explicitly before using as a categorical feature.

anthropic:claude-opus-4-7 · confidence medium
Out[18]:

saturn.columns["Author API"].stats

statvalue
n4,970
nulls1,244 (25.0%)
unique252
top_value SIR
top_rate 0.1503
cardinality 252
entropy 6.06
entropy_ratio 0.7596
alert: null_rate25.0% null
Fig 9.
Top values for Author API.
Show data table
Top values for Author API (20 unique shown, of 252 total).
valuecountshare
SIR56011.3%
LADY2785.6%
MERCHANT1493.0%
KING_OF_ENGLAND1402.8%
1ST_EARL_OF_CLARE/POLITICIAN(DNB)1362.7%
BISHOP_OF_WINCHESTER1112.2%
CLERK1032.1%
EARL_OF_ESSEX/ROYAL_MINISTER(DNB)931.9%
SIR/LOCAL_POLITICIAN(DNB)791.6%
BISHOP_OF_NORWICH771.5%
1ST_EARL_OF_STRAFFORD/LORD_LIEUTENANT_OF_IRELAND671.3%
PUBLIC_SERVANT581.2%
COLONEL551.1%
1ST_LORD_BURGHLEY/ROYAL_MINISTER(DNB)481.0%
EARL_OF_LEICESTER/COURTIER/MAGNATE(DNB)470.9%
2ND_EARL_OF_ARUNDEL_AND_SURREY/POLITICIAN(DNB)440.9%
3RD_EARL_OF_DERBY400.8%
SIR/NATURAL_PHILOSOPHER/ADMINISTRATOR(DNB)400.8%
SIR/2ND_BART/SCHOLAR/POLITICIAN(DNB)380.8%
SIR/LORD_CHANCELLOR(DNB)380.8%

Author gender categorical feature

Binary gender label for the author of each record, fully populated across all 4970 rows. The split is heavily skewed: MALE accounts for 83.1% (4130) versus FEMALE at 840, giving an entropy ratio of 0.655. No nulls or unexpected categories appear.

Treatment: One-hot or binary-encode; consider class-imbalance handling if used as a stratifier or predictor.

anthropic:claude-opus-4-7 · confidence high
Out[21]:

saturn.columns["Author gender"].stats

statvalue
n4,970
nulls0 (0.0%)
unique2
top_value MALE
top_rate 0.831
cardinality 2
entropy 0.6554
entropy_ratio 0.6554
Fig 10.
Top values for Author gender.
Show data table
Top values for Author gender (2 unique shown, of 2 total).
valuecountshare
MALE413083.1%
FEMALE84016.9%

Author DOB categorical metadata

This appears to be the author's year of birth, stored as a categorical string rather than a number — many values carry a trailing '?' (e.g. '1565?', '1546?') indicating uncertain dates from historical records. 31.17% of rows are null and the top value covers only 3.98% of entries, with 217 distinct values and high entropy (ratio 0.87) suggesting a wide spread across early modern centuries. The mix of clean years ('1633', '1593') and questioned years signals inconsistent provenance that will break naive numeric parsing.

Treatment: Strip '?' markers and cast to integer year (with an 'uncertain' flag column) before any temporal analysis.

anthropic:claude-opus-4-7 · confidence high
Out[24]:

saturn.columns["Author DOB"].stats

statvalue
n4,970
nulls1,549 (31.2%)
unique217
top_value 1565?
top_rate 0.03975
cardinality 217
entropy 6.746
entropy_ratio 0.8692
alert: null_rate31.2% null
Fig 11.
Top values for Author DOB.
Show data table
Top values for Author DOB (20 unique shown, of 217 total).
valuecountshare
1565?1362.7%
1546?1202.4%
1633992.0%
1485?972.0%
1627851.7%
1593791.6%
1533791.6%
1585731.5%
1554711.4%
1575661.3%
1600?611.2%
1623601.2%
1497?581.2%
1631551.1%
1511551.1%
1596531.1%
1597?531.1%
1442521.0%
1614501.0%
1608481.0%

Relation to recipient categorical feature

Categorical describing the relationship between two parties (likely a donor/sender and a recipient), with 45 distinct kinship or social labels such as FRIEND, BROTHER, SON, KIN, and FAMILY_SERVANT. The distribution is fairly flat — top value FRIEND covers only 13.99% and entropy ratio is 0.77 — so no single relation dominates. Most striking is the 44.37% null rate, meaning nearly half of records have no recorded relation.

Treatment: Impute nulls as an explicit 'UNKNOWN' category and consider grouping the 45 levels into broader buckets (immediate family, extended kin, non-kin) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[27]:

saturn.columns["Relation to recipient"].stats

statvalue
n4,970
nulls2,205 (44.4%)
unique45
top_value FRIEND
top_rate 0.14
cardinality 45
entropy 4.211
entropy_ratio 0.7668
alert: null_rate44.4% null
Fig 12.
Top values for Relation to recipient.
Show data table
Top values for Relation to recipient (20 unique shown, of 45 total).
valuecountshare
FRIEND3877.8%
BROTHER3537.1%
SON2505.0%
KIN1873.8%
BROTHER-IN-LAW1693.4%
MOTHER1623.3%
HUSBAND1603.2%
FATHER1493.0%
FAMILY_SERVANT1402.8%
WIFE1262.5%
COUSIN1152.3%
SON-IN-LAW931.9%
FUTURE_WIFE781.6%
DAUGHTER470.9%
NEPHEW470.9%
SISTER-IN-LAW420.8%
SISTER380.8%
NIECE370.7%
FATHER-IN-LAW290.6%
UNCLE260.5%

Recipient categorical label

Recipient is a categorical field naming the addressee of each record (likely letters in a historical correspondence corpus), with 623 distinct names across 4970 rows and no nulls. The distribution has a long tail but is concentrated on a few major figures: JOHN_PASTON_I tops at 5.27% (262), followed by NATHANIEL_BACON_I (251), JOAN_BARRINGTON (182), and JANE_CORNWALLIS/BACON[N.MEAUTYS] (180). High entropy ratio (0.79) confirms breadth, while the naming convention (uppercase with underscores, roman numerals, bracketed aliases) suggests a curated historical-letters dataset.

Treatment: Group rare recipients into an 'other' bucket before encoding, given the long tail across 623 categories.

anthropic:claude-opus-4-7 · confidence high
Out[30]:

saturn.columns["Recipient"].stats

statvalue
n4,970
nulls0 (0.0%)
unique623
top_value JOHN_PASTON_I
top_rate 0.05272
cardinality 623
entropy 7.351
entropy_ratio 0.7918
alert: long_tail317 singleton categories
Fig 13.
Top values for Recipient.
Show data table
Top values for Recipient (20 unique shown, of 623 total).
valuecountshare
JOHN_PASTON_I2625.3%
NATHANIEL_BACON_I2515.1%
JOAN_BARRINGTON1823.7%
JANE_CORNWALLIS/BACON[N.MEAUTYS]1803.6%
GEORGE_CELY1192.4%
HENRY_OXINDEN[BARHAM]1072.2%
DANIEL_FLEMING1002.0%
ROBERT_PLUMPTON_I871.8%
WILLIAM_TEMPLE851.7%
JOHN_PASTON_III841.7%
WILLIAM_STONOR811.6%
THOMAS_LANGLEY751.5%
THOMAS_STOCKWELL731.5%
THOMAS_WOLSEY651.3%
EDWARD_HARLEY641.3%
HENRY_CLIFFORD_II631.3%
CHRISTOPHER_HATTON_III631.3%
HENRY_TUDOR_VIII571.1%
JOHN_PASTON_II561.1%
MARGARET_PASTON[N.MAUTBY]501.0%

Recipient API categorical feature

This column appears to capture the recipient's title or role/API descriptor in correspondence records, with 326 distinct values across 4970 rows. The distribution is moderately diverse (entropy_ratio 0.73) — the most common value 'SIR' covers only 13.4% of records, followed by 'LADY' and compound role strings like 'SIR/LOCAL_POLITICIAN(DNB)'. Notable surprises: 18.9% nulls, and many values are concatenated multi-role strings rather than atomic titles, suggesting inconsistent encoding.

Treatment: Split compound '/'-delimited roles and one-hot or target-encode atomic titles before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[33]:

saturn.columns["Recipient API"].stats

statvalue
n4,970
nulls940 (18.9%)
unique326
top_value SIR
top_rate 0.1345
cardinality 326
entropy 6.128
entropy_ratio 0.734
Fig 14.
Top values for Recipient API.
Show data table
Top values for Recipient API (20 unique shown, of 326 total).
valuecountshare
SIR54210.9%
LADY4448.9%
SIR/LOCAL_POLITICIAN(DNB)2535.1%
MERCHANT1543.1%
SIR/ANTIQUARY1002.0%
BARONET/DIPLOMAT/AUTHOR(DNB)851.7%
SIR/MERCHANT821.6%
BISHOP_OF_DURHAM/LORD_CHANCELLOR751.5%
ROYAL_MINISTER/ARCHBISHOP_OF_YORK/CARDINAL(DNB)711.4%
KING_OF_ENGLAND681.4%
BISHOP_OF_WINCHESTER681.4%
1ST_EARL_OF_CUMBERLAND641.3%
VISCOUNT631.3%
BISHOP_OF_DURHAM591.2%
CAPTAIN551.1%
EARL_OF_LEICESTER/COURTIER/MAGNATE(DNB)491.0%
SIR/PRINCIPLE_SECRETARY(DNB)450.9%
VISCOUNTESS450.9%
SIR/LORD_KEEPER_OF_THE_GREAT_SEAL430.9%
VISCOUNT_DORCHESTER/DIPLOMNAT(DNB)430.9%

Recipient gender categorical feature

Categorical recipient gender field with three values and no nulls across 4970 rows. Heavily skewed toward MALE at 81.97% (4074), with FEMALE at 892 and a rare MALE/FEMALE combo at just 4 records. Low entropy ratio of 0.43 confirms the imbalance, and the mixed category is small enough to need a handling decision.

Treatment: One-hot encode and decide whether to merge or drop the 4 MALE/FEMALE rows before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[36]:

saturn.columns["Recipient gender"].stats

statvalue
n4,970
nulls0 (0.0%)
unique3
top_value MALE
top_rate 0.8197
cardinality 3
entropy 0.6881
entropy_ratio 0.4342
Fig 15.
Top values for Recipient gender.
Show data table
Top values for Recipient gender (3 unique shown, of 3 total).
valuecountshare
MALE407482.0%
FEMALE89217.9%
MALE/FEMALE40.1%

Recipient DOB categorical timestamp

This is the recipient's date of birth recorded as a year only, with values clustered between the 15th and 17th centuries (e.g. 1421, 1546?, 1581) — consistent with a historical correspondence or archival dataset. Roughly 19.6% of rows are null and many entries carry trailing '?' marks indicating archivist uncertainty, which inflates cardinality (210 distinct values for 4970 rows) and means '1546?' and '1546' would be treated as different categories. Entropy ratio of 0.82 shows the distribution is fairly spread across years rather than dominated by one cohort, with the top year '1421' covering only 6.6%.

Treatment: Strip '?' uncertainty markers, cast to integer year, and bucket into centuries or decades before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[39]:

saturn.columns["Recipient DOB"].stats

statvalue
n4,970
nulls975 (19.6%)
unique210
top_value 1421
top_rate 0.06608
cardinality 210
entropy 6.31
entropy_ratio 0.8179
Fig 16.
Top values for Recipient DOB.
Show data table
Top values for Recipient DOB (20 unique shown, of 210 total).
valuecountshare
14212645.3%
1546?2565.2%
15811853.7%
1558?1823.7%
16331402.8%
16081082.2%
1628901.8%
1453881.8%
1444861.7%
1449?821.6%
1360?751.5%
1595711.4%
1473?651.3%
1632?641.3%
1624641.3%
1631621.2%
1644611.2%
1493?611.2%
1533571.1%
1491571.1%

Relation to author categorical feature

Categorical relationship label between an author and another person, with 43 distinct values dominated by family/social ties (FRIEND 14.5%, BROTHER, SON, BROTHER-IN-LAW, KIN). Nearly half the rows (46.2%) are null, which is the main concern. Entropy ratio of 0.76 indicates the non-null values are spread fairly evenly across the top categories rather than collapsing onto one.

Treatment: Impute or add an explicit 'unknown' category for the 46% nulls, then group rare levels before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high
Out[42]:

saturn.columns["Relation to author"].stats

statvalue
n4,970
nulls2,297 (46.2%)
unique43
top_value FRIEND
top_rate 0.1448
cardinality 43
entropy 4.127
entropy_ratio 0.7606
alert: null_rate46.2% null
Fig 17.
Top values for Relation to author.
Show data table
Top values for Relation to author (20 unique shown, of 43 total).
valuecountshare
FRIEND3877.8%
BROTHER3246.5%
SON2955.9%
BROTHER-IN-LAW1923.9%
KIN1883.8%
WIFE1603.2%
MOTHER1603.2%
FATHER1382.8%
HUSBAND1262.5%
COUSIN1152.3%
MOTHER-IN-LAW811.6%
FUTURE_HUSBAND781.6%
SISTER671.3%
UNCLE591.2%
FAMILY_SERVANT531.1%
FATHER-IN-LAW430.9%
NEPHEW290.6%
SON-IN-LAW270.5%
SISTER-IN-LAW260.5%
AUNT250.5%

Change from 2006? categorical metadata

A data-quality flag tracking whether each row changed since 2006, with four categories dominated by 'ok' at 95.5% of 4,970 rows. The remaining values ('corrected', 'corrected in spreadsheet', 'ok sic') indicate manual review notes, and the inconsistent labels suggest free-form curator entries rather than a controlled vocabulary. Entropy ratio of 0.15 confirms severe imbalance.

Treatment: Drop or collapse to a binary corrected/ok flag; too imbalanced to be a useful feature.

anthropic:claude-opus-4-7 · confidence high
Out[45]:

saturn.columns["Change from 2006?"].stats

statvalue
n4,970
nulls0 (0.0%)
unique4
top_value ok
top_rate 0.9549
cardinality 4
entropy 0.3025
entropy_ratio 0.1513
alert: imbalancetop value is 95.5% of rows
Fig 18.
Top values for Change from 2006?.
Show data table
Top values for Change from 2006? (4 unique shown, of 4 total).
valuecountshare
ok474695.5%
corrected1753.5%
corrected in spreadsheet460.9%
ok sic30.1%

Order of Gardiner letters in file numeric metadata

This column appears to be an ordinal index assigning each Gardiner letter (an Egyptian hieroglyph category) its position within the file, running from 1 to 58 with a perfectly symmetric distribution (mean and median both 29.5, skew 0). The striking signal is a 98.83% null rate: only 58 of 4970 rows carry a value, exactly matching n_unique, so this is effectively a one-row-per-letter lookup sparsely attached to a much larger table. No outliers and uniform spread confirm it is a sequence, not a measurement.

Treatment: Drop from modelling; retain only if you need to preserve the original Gardiner ordering via a join on letter.

anthropic:claude-opus-4-7 · confidence high
Out[48]:

saturn.columns["Order of Gardiner letters in file"].stats

statvalue
n4,970
nulls4,912 (98.8%)
unique58
min 1
max 58
mean 29.5
median 29.5
std 16.89
q1 15.25
q3 43.75
iqr 28.5
skew 0
kurtosis -1.201
n_outliers 0
outlier_rate 0
zero_rate 0
alert: null_rate98.8% null
Fig 19.
Distribution of Order of Gardiner letters in file. Vertical dash marks the median.
Show data table
Histogram bins for Order of Gardiner letters in file (median: 29.5).
bincount
1 – 9.1439
9.143 – 17.298
17.29 – 25.438
25.43 – 33.578
33.57 – 41.718
41.71 – 49.868
49.86 – 589

How to cite

click to copy

BibTeX
@misc{saturn-aif-2022-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: aif 2022},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/aif_2022}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: aif 2022. Source: /home/coolhand/servers/diachronica/corpus/historical-corpora/pceec/data/aif_2022.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/aif_2022