quirky social norms 20260121

source /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms_20260121.parquet 355,922 rows 25 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a social-norms annotation dataset of 355,922 rows and 25 columns, where each entry pairs a real-life 'situation' (mostly from Reddit confessions, AmItheAsshole, Dear Abby, and ROCStories) with an 'action', a rule-of-thumb ('rot'), and a battery of moral judgments by crowd workers. The most striking shape feature is heavy duplication in the text fields: 'rot-judgment' is 97% duplicated and 'characters' 91%, because they collapse to short controlled vocabularies, while 'situation' and 'rot' themselves repeat ~71% and ~27% of the time across annotators. Worth a closer look first: the moral-foundation distribution, which is dominated by 'care-harm' (~39% of non-null), and the 'action-legal' field where 93% of actions are tagged 'legal' — both suggest class imbalance that will matter for any modeling. Also note 'area' is reasonably balanced across the four source corpora, but 'split' is heavily skewed toward 'train' (66%).

citing: rot-moral-foundations · action-legal · area · split · rot-judgment · characters · situation · rot · action-hypothetical · rot-categorization · action-agree · row_count · column_count

Charts the summary said to look at first

rot-moral-foundations · See how 'care-harm' dominates the moral foundation labels at ~39% of non-null rows.

Show data table

Top values for rot-moral-foundations (20 unique shown, of 30 total).
value	count	share
care-harm	108535	30.5%
fairness-cheating	37666	10.6%
loyalty-betrayal	34581	9.7%
care-harm\|loyalty-betrayal	21125	5.9%
authority-subversion	19087	5.4%
sanctity-degradation	14657	4.1%
care-harm\|fairness-cheating	10787	3.0%
care-harm\|authority-subversion	7958	2.2%
care-harm\|sanctity-degradation	6328	1.8%
fairness-cheating\|loyalty-betrayal	6222	1.7%
fairness-cheating\|authority-subversion	3738	1.1%
loyalty-betrayal\|authority-subversion	2122	0.6%
fairness-cheating\|sanctity-degradation	1218	0.3%
authority-subversion\|sanctity-degradation	1042	0.3%
loyalty-betrayal\|sanctity-degradation	885	0.2%
care-harm\|fairness-cheating\|loyalty-betrayal	834	0.2%
care-harm\|loyalty-betrayal\|authority-subversion	623	0.2%
care-harm\|authority-subversion\|sanctity-degradation	423	0.1%
care-harm\|fairness-cheating\|authority-subversion	416	0.1%
care-harm\|loyalty-betrayal\|sanctity-degradation	183	0.1%

area · Check the source mix across confessions, ROCStories, AmItheAsshole, and Dear Abby.

Show data table

Top values for area (4 unique shown, of 4 total).
value	count	share
confessions	107749	30.3%
rocstories	101791	28.6%
amitheasshole	96082	27.0%
dearabby	50300	14.1%

action-legal · Notice the strong skew: ~93% of actions are labeled 'legal', with 'illegal' rare.

Show data table

Top values for action-legal (3 unique shown, of 3 total).
value	count	share
legal	289316	81.3%
tolerated	15064	4.2%
illegal	5934	1.7%

rot-categorization · Compare how rules-of-thumb split between advice, social-norms, morality-ethics, and description.

Show data table

Top values for rot-categorization (15 unique shown, of 15 total).
value	count	share
advice	82786	23.3%
social-norms	72934	20.5%
morality-ethics	58564	16.5%
description	58537	16.4%
social-norms\|advice	27657	7.8%
morality-ethics\|social-norms	27118	7.6%
morality-ethics\|advice	11498	3.2%
social-norms\|description	6078	1.7%
advice\|description	4785	1.3%
morality-ethics\|description	2023	0.6%
morality-ethics\|social-norms\|advice	790	0.2%
morality-ethics\|social-norms\|description	55	0.0%
morality-ethics\|advice\|description	26	0.0%
social-norms\|advice\|description	24	0.0%
morality-ethics\|social-norms\|advice\|description	1	0.0%

action-agree · Workers cluster at agreement values 3-4, indicating most actions get strong consensus.

Show data table

Histogram bins for action-agree (median: 3.0).
bin	count
0 – 0.1	1172
0.1 – 0.2	0
0.2 – 0.3	0
0.3 – 0.4	0
0.4 – 0.5	0
0.5 – 0.6	0
0.6 – 0.7	0
0.7 – 0.8	0
0.8 – 0.9	0
0.9 – 1	0
1 – 1.1	5874
1.1 – 1.2	0
1.2 – 1.3	0
1.3 – 1.4	0
1.4 – 1.5	0
1.5 – 1.6	0
1.6 – 1.7	0
1.7 – 1.8	0
1.8 – 1.9	0
1.9 – 2	0
2 – 2.1	44800
2.1 – 2.2	0
2.2 – 2.3	0
2.3 – 2.4	0
2.4 – 2.5	0
2.5 – 2.6	0
2.6 – 2.7	0
2.7 – 2.8	0
2.8 – 2.9	0
2.9 – 3	0
3 – 3.1	168120
3.1 – 3.2	0
3.2 – 3.3	0
3.3 – 3.4	0
3.4 – 3.5	0
3.5 – 3.6	0
3.6 – 3.7	0
3.7 – 3.8	0
3.8 – 3.9	0
3.9 – 4	91554

Schema

25 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
area	categorical	0.0%	4
m	numeric	0.0%	4	high_skew outliers
split	categorical	0.0%	7
rot-agree	numeric	0.3%	5
rot-categorization	categorical	0.9%	15
rot-moral-foundations	categorical	21.7%	30	null_rate
rot-char-targeting	categorical	0.5%	7
rot-bad	numeric	0.0%	2	high_skew
rot-judgment	text	0.0%	10,589	short_text duplicates
action	text	0.0%	260,627	multilingual duplicates
action-agency	categorical	1.8%	2
action-moral-judgment	numeric	12.7%	5
action-agree	numeric	12.5%	5
action-legal	categorical	12.8%	3
action-pressure	numeric	13.1%	5
action-char-involved	categorical	13.0%	7
action-hypothetical	categorical	19.0%	5
situation	text	0.0%	103,296	multilingual duplicates
situation-short-id	text	0.0%	103,692	one_word duplicates
rot	text	0.0%	259,614	duplicates
rot-id	text	0.0%	291,974	one_word
rot-worker-id	numeric	0.0%	89
breakdown-worker-id	numeric	0.0%	117
n-characters	numeric	0.0%	8
characters	text	0.0%	31,782	one_word duplicates

area

categorical label

This column tags each record with one of four source areas, with 'confessions' the modal value at 30.3% of 355,922 rows. The distribution is fairly balanced — entropy_ratio of 0.97 indicates near-uniform spread across the four categories, though 'dearabby' (50,300) is roughly half the size of the other three. No nulls and only 4 unique values make this a clean grouping key. Treatment: one-hot encode or use directly as a stratification/grouping variable. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 4
top_value: confessions
top_rate: 0.3027
cardinality: 4
entropy: 1.947
entropy_ratio: 0.9737

m

numeric feature high_skew outliers

Column 'm' is a numeric feature with only 4 distinct values across 355,922 rows, ranging from 1 to 50 with a median of 1 and both Q1 and Q3 equal to 1. The distribution is severely right-skewed (skew 3.77, kurtosis 12.45), with 18.5% of rows flagged as outliers despite a mean of just 4.24. The tiny cardinality combined with extreme spread suggests this is a categorical-like multiplier or count where most records sit at 1 and a few jump to much larger values. Treatment: Treat as low-cardinality categorical or bin the rare large values before modelling. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 4
min: 1
max: 50
mean: 4.237
median: 1
std: 11.24
q1: 1
q3: 1
iqr: 0
skew: 3.774
kurtosis: 12.45
n_outliers: 65,947
outlier_rate: 0.1853
zero_rate: 0

split

categorical metadata

Column holds the dataset split assignment across 355922 rows with 7 distinct values and no nulls. 'train' dominates at 65.6% (233501 rows), followed by near-equal 'test' (29239) and 'dev' (29234), plus auxiliary 'test-extra', 'analysis', and 'dev-extra' partitions; a 'none' bucket of 3913 rows is unusual and likely indicates unassigned examples. Treatment: Use as a row filter to separate train/dev/test; investigate or exclude the 'none' rows before modelling. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 7
top_value: train
top_rate: 0.656
cardinality: 7
entropy: 1.763
entropy_ratio: 0.6281

rot-agree

numeric feature

This is a 5-value ordinal score (0–4) capturing agreement on some rotation/rationale ('rot-agree'), with a mean of 3.10 and median 3.0 — answers cluster at the high end. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, and 2.4% of rows fall outside the IQR fence as low-end outliers. Nulls are negligible (0.35%) and zeros are rare (0.38%). Treatment: Treat as an ordinal Likert feature; keep as-is or one-hot encode rather than log-transform. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 1,236 (0.3%)
unique: 5
min: 0
max: 4
mean: 3.1
median: 3
std: 0.7437
q1: 3
q3: 4
iqr: 1
skew: -0.6805
kurtosis: 0.7872
n_outliers: 8,547
outlier_rate: 0.0241
zero_rate: 0.003786

rot-categorization

categorical label

Categorical tag describing the type of 'rule of thumb' (RoT), with 15 distinct values drawn from a small base vocabulary (advice, social-norms, morality-ethics, description) plus pipe-delimited combinations. Distribution is moderately balanced (entropy ratio 0.72); 'advice' leads at 23.5% and the four single-tag values dominate, while compound tags like 'social-norms|advice' (27,657) indicate multi-label encoding stuffed into one string. Null rate is low at 0.86%. Treatment: split on '|' and one-hot encode the four base categories for multi-label modelling. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 3,046 (0.9%)
unique: 15
top_value: advice
top_rate: 0.2346
cardinality: 15
entropy: 2.806
entropy_ratio: 0.7181

rot-moral-foundations

categorical label null_rate

This column tags each row with one or more of Haidt's moral foundations (care-harm, fairness-cheating, loyalty-betrayal, authority-subversion, sanctity-degradation), with multi-label combinations encoded as pipe-delimited strings yielding 30 distinct values. Distribution is heavily skewed: 'care-harm' alone covers 38.9% of rows, and 21.7% of rows are null. Entropy ratio of 0.60 confirms the long tail collapses quickly after the top few categories. Treatment: Split on '|' and binarize into five multi-hot foundation indicators; treat nulls as an explicit missing flag. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 77,120 (21.7%)
unique: 30
top_value: care-harm
top_rate: 0.3893
cardinality: 30
entropy: 2.962
entropy_ratio: 0.6037

rot-char-targeting

categorical feature

Categorical tag identifying which character slot a rotation/transform targets, with 7 distinct values dominated by 'char-0' (51.9%) and 'char-1' (123,396 rows). The distribution is sharply long-tailed: 'char-4' and 'char-5' appear only 46 and 15 times respectively, and 23,781 rows are explicitly 'char-none'. Entropy ratio is 0.557, confirming most mass sits in the first two categories. Null rate is low at 0.46%. Treatment: One-hot encode and consider collapsing rare 'char-3/4/5' levels into an 'other' bucket. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 1,622 (0.5%)
unique: 7
top_value: char-0
top_rate: 0.5192
cardinality: 7
entropy: 1.564
entropy_ratio: 0.5571

rot-bad

numeric label high_skew

Binary 0/1 flag (n_unique=2, min=0, max=1) indicating a rare 'rot-bad' condition. Positives occur at 2.0% (mean=0.0201, zero_rate=0.9799), producing the flagged high skew (6.84) and heavy kurtosis (44.78). The 7,153 'outliers' are simply the positive class, not anomalies. Treatment: Treat as a binary target; address class imbalance with stratified sampling or class weights rather than outlier removal. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 2
min: 0
max: 1
mean: 0.0201
median: 0
std: 0.1403
q1: 0
q3: 0
iqr: 0
skew: 6.84
kurtosis: 44.78
n_outliers: 7,153
outlier_rate: 0.0201
zero_rate: 0.9799

rot-judgment

text label short_text duplicates

Short moral-judgment phrases (mean 2 words, median 9 chars, max 94) drawn from a tight vocabulary of 797 tokens, dominated by verdicts like "It's good", "shouldn't", "It's okay", and "It's wrong". With 97.0% duplicate rate and only 10,589 uniques across 355,922 rows, this behaves as a categorical label rather than free text. Casing is inconsistent (e.g. "It's good" vs "it's good" appear as separate top values), which will inflate cardinality unless normalized. Treatment: Lowercase and strip punctuation, then treat as a categorical target rather than free text. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 1 (0.0%)
unique: 10,589
len_min: 1
len_max: 94
len_mean: 10.46
len_median: 9
len_p95: 19
word_mean: 2.001
word_median: 2
n_empty: 0
n_duplicates: 345,332
duplicate_rate: 0.9702
vocab_size: 797
readability_flesch_mean: 83.27
emoji_rate: 0
url_rate: 0
one_word_rate: 0.2083
allcaps_rate: 5.619e-06
boilerplate_rate: 0

action

text free_text multilingual duplicates

Short English phrases describing an action or behaviour (mean 41.8 chars, median 7 words), e.g. 'being yourself.' or 'cheating on your partner.' — likely the subject of a moral/judgement prompt. Roughly 27% of the 355,922 rows are duplicates (95,292), with the top phrase repeating 461 times, so the same actions recur heavily across records. Language detection flags multilingual but only 2 non-English rows (1 la, 1 no) out of ~5k sampled, so effectively monolingual. Treatment: Normalize casing/punctuation and tokenize or embed before modelling; expect heavy phrase repetition. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 3 (0.0%)
unique: 260,627
len_min: 1
len_max: 221
len_mean: 41.76
len_median: 40
len_p95: 73
word_mean: 6.969
word_median: 7
n_empty: 0
n_duplicates: 95,292
duplicate_rate: 0.2677
vocab_size: 9,430
readability_flesch_mean: 57.88
emoji_rate: 0
url_rate: 2.81e-06
one_word_rate: 0.006013
allcaps_rate: 0
boilerplate_rate: 2.81e-06

action-agency

categorical feature

Binary categorical flag with only two levels, 'agency' (88.5%) and 'experience' (the remainder), likely indicating which side or channel originated the action. Class imbalance is heavy and 1.84% of rows are null, so any modelling needs to account for both. Entropy ratio of 0.515 reflects the dominance of the majority class. Treatment: Encode as a binary indicator and impute or flag the ~1.8% nulls. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 6,551 (1.8%)
unique: 2
top_value: agency
top_rate: 0.8849
cardinality: 2
entropy: 0.5152
entropy_ratio: 0.5152

action-moral-judgment

numeric label

A discrete moral-judgment rating on a 5-point scale from -2 to 2 (likely Likert-style: very wrong → very right). The mean of -0.178 and median of 0 indicate a slight lean toward negative judgments, with 43.7% of values exactly zero and 12.7% missing. The distribution is nearly symmetric (skew -0.011) and platykurtic, so most ratings cluster near neutral with modest spread (std 0.857). Treatment: Treat as an ordinal label; impute or filter the 12.7% nulls before modelling. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 45,215 (12.7%)
unique: 5
min: -2
max: 2
mean: -0.1785
median: 0
std: 0.8565
q1: -1
q3: 0
iqr: 1
skew: -0.01137
kurtosis: -0.3466
n_outliers: 4,592
outlier_rate: 0.01478
zero_rate: 0.4367

action-agree

numeric feature

This is a 5-level ordinal feature (values 0-4), almost certainly a Likert-style agreement rating for an 'action' item, with mean 3.10 and median 3.0 indicating a tilt toward agreement. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, so most respondents pick 3 or 4; only 0.38% give zero. Note the 12.48% null rate, which is substantial and likely reflects non-response or skipped items. Treatment: Treat as ordinal (0-4); impute or flag the ~12% missing values before modelling. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 44,402 (12.5%)
unique: 5
min: 0
max: 4
mean: 3.101
median: 3
std: 0.7326
q1: 3
q3: 4
iqr: 1
skew: -0.6768
kurtosis: 0.8829
n_outliers: 7,046
outlier_rate: 0.02262
zero_rate: 0.003762

action-legal

categorical feature

This is a categorical legal-status flag with only 3 distinct values ('legal', 'tolerated', 'illegal') across 355,922 rows. The distribution is severely imbalanced: 'legal' accounts for 93.2% of records while 'illegal' represents just 5,934 rows, and entropy ratio is only 0.26. Note also that 12.81% of rows are null, so absence of a status is itself a meaningful signal. Treatment: One-hot encode with an explicit 'missing' category; consider class-weighting due to severe imbalance. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 45,608 (12.8%)
unique: 3
top_value: legal
top_rate: 0.9323
cardinality: 3
entropy: 0.4153
entropy_ratio: 0.262

action-pressure

numeric feature

A discrete ordinal feature taking only 5 values across the symmetric range -2 to 2, almost certainly a Likert-style pressure rating. The distribution is balanced (mean -0.04, skew -0.05) and centered on 0, which accounts for 35.3% of rows. Notable: 13.08% of rows are null, and despite being numeric there are just 5 unique values. Treatment: Treat as ordinal categorical; impute or flag the 13% missing before modelling. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 46,537 (13.1%)
unique: 5
min: -2
max: 2
mean: -0.03992
median: 0
std: 1.114
q1: -1
q3: 1
iqr: 2
skew: -0.04985
kurtosis: -0.6595
n_outliers: 0
outlier_rate: 0
zero_rate: 0.3534

action-char-involved

categorical feature

Categorical pointer to which character slot is involved in an action, with 7 distinct values dominated by char-0 (51.5%) and char-1 (~31%). The long tail collapses fast: char-3 through char-5 together account for fewer than 1,400 rows, and 13.05% of rows are null while another 22,100 are explicitly tagged 'char-none'. Entropy ratio of 0.56 confirms the heavy concentration on the first two slots. Treatment: Treat as a low-cardinality categorical; collapse char-3/4/5 into an 'other' bucket and decide whether null and char-none should be merged. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 46,441 (13.0%)
unique: 7
top_value: char-0
top_rate: 0.515
cardinality: 7
entropy: 1.578
entropy_ratio: 0.562

action-hypothetical

categorical label

A 5-level categorical labeling whether an action is stated explicitly or only hypothetically/probably, with negative variants ('explicit-no', 'probable-no'). 'explicit' dominates at 47.1% of non-null rows, but 19.04% of values are null, which is substantial. Entropy ratio of 0.84 indicates the remaining classes are reasonably spread rather than collapsed onto a single mode. Treatment: One-hot encode the five levels and add an explicit missing indicator for the 19% nulls. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 67,776 (19.0%)
unique: 5
top_value: explicit
top_rate: 0.4711
cardinality: 5
entropy: 1.956
entropy_ratio: 0.8423

situation

text free_text multilingual duplicates

Short first-person descriptions of personal situations or dilemmas, averaging 55 characters / 10 words and topping out at 300 chars, with high readability (Flesch 78.5). Massive duplication is the headline issue: 252,626 of 355,922 rows are duplicates (71%), leaving only 103,296 uniques, and several distinct strings repeat exactly 130 times — suggesting templated or oversampled records. Language detection is overwhelmingly English (4,989) with a tiny multilingual tail (6 de, 1 each es/it/nl/pt/ru) that is unlikely to matter at scale. Treatment: Deduplicate before modelling, then tokenize and embed; the 130-repeat pattern warrants checking the upstream sampling. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 103,296
len_min: 10
len_max: 300
len_mean: 55.42
len_median: 52
len_p95: 100
word_mean: 10.52
word_median: 10
n_empty: 0
n_duplicates: 252,626
duplicate_rate: 0.7098
vocab_size: 16,855
readability_flesch_mean: 78.53
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0.0005282
boilerplate_rate: 0.0001292

situation-short-id

text foreign_key one_word duplicates

Slash-delimited source identifiers pointing back to situations on Reddit (confessions, amitheasshole), Dear Abby columns, and ROCStories sentences. Despite the 'id' name, only 103,692 values are unique across 355,922 rows — a 70.9% duplicate rate, with the most repeated key appearing 180 times, so each situation evidently surfaces in many rows. Every value is a single token (one_word_rate 1.0) up to 99 chars long. Treatment: Use as a join key to the situation source table; do not treat as a row-level primary key. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 103,692
len_min: 24
len_max: 99
len_mean: 41.29
len_median: 27
len_p95: 74
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 252,230
duplicate_rate: 0.7087
vocab_size: 16,878
readability_flesch_mean: -443.9
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

rot

text free_text duplicates

Short English moral/normative statements (mean 10.1 words, max 221 chars), almost certainly rule-of-thumb (RoT) annotations describing what one should or shouldn't do. The vocabulary is small (9,659 types) and readability is high (Flesch 79.4), consistent with simple declarative templates like 'It is good/okay to...' and 'You shouldn't...'. Notable: 27.1% of rows are duplicates (96,308), with single phrases like 'It is good to be yourself.' repeating 287 times, indicating heavy template reuse rather than free generation. Treatment: Deduplicate or weight by frequency before tokenizing and embedding for modelling. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 259,614
len_min: 6
len_max: 221
len_mean: 54.66
len_median: 53
len_p95: 87
word_mean: 10.1
word_median: 10
n_empty: 0
n_duplicates: 96,308
duplicate_rate: 0.2706
vocab_size: 9,659
readability_flesch_mean: 79.38
emoji_rate: 0
url_rate: 2.81e-06
one_word_rate: 5.619e-06
allcaps_rate: 0
boilerplate_rate: 5.619e-06

rot-id

text identifier one_word

Structured path-like identifiers for 'rule of thumb' records, sourced from Reddit (amitheasshole, confessions), Dear Abby columns, and ROCStories. Despite being IDs, only 291974 of 355922 rows are unique (duplicate_rate 0.18, with top values repeating up to 58 times), so this is not a primary key. Every value is a single token (one_word_rate 1.0, word_mean 1.0) with length 63-140 characters, which triggered the one_word alert but is expected for slash-delimited identifiers. Treatment: Treat as a composite identifier; split on '/' to extract source/subreddit/post fields rather than embedding the raw string. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 291,974
len_min: 63
len_max: 140
len_mean: 81.61
len_median: 68
len_p95: 114
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 63,948
duplicate_rate: 0.1797
vocab_size: 18,736
readability_flesch_mean: -755.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

rot-worker-id

numeric foreign_key

Despite being typed numeric, `rot-worker-id` looks like a categorical worker identifier: only 89 unique values across 355,922 rows, no nulls, and no zeros. The distribution is broad and nearly symmetric (skew -0.08, kurtosis -1.37) spanning 2 to 144 with median 83, consistent with arbitrary id codes rather than a measured quantity. No outliers were flagged. Treatment: Treat as a categorical worker key; encode or join rather than aggregating numerically. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 89
min: 2
max: 144
mean: 72.54
median: 83
std: 39.61
q1: 42
q3: 105
iqr: 63
skew: -0.0804
kurtosis: -1.366
n_outliers: 0
outlier_rate: 0
zero_rate: 0

breakdown-worker-id

numeric foreign_key

This appears to be a worker identifier encoded as integers, with 117 distinct values spread roughly uniformly between 0 and 146 (mean 71.75, median 71, skew 0.05, kurtosis -1.26). Despite being stored as numeric, the low cardinality relative to 355,922 rows and the near-flat distribution suggest a categorical key rather than a measurement. No nulls and no outliers, but about 1.4% of rows carry worker id 0, which may be a sentinel. Treatment: Cast to categorical and left-join to a worker dimension table; investigate whether id 0 is a placeholder. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 117
min: 0
max: 146
mean: 71.75
median: 71
std: 42.17
q1: 33
q3: 106
iqr: 73
skew: 0.05124
kurtosis: -1.255
n_outliers: 0
outlier_rate: 0
zero_rate: 0.01435

n-characters

numeric feature

A small-integer count column ranging 1-10 with only 8 unique values across 355,922 rows, mean 2.13 and median 2. The tight IQR (2-3) and low std (0.78) suggest most records cluster around 2-3 characters, with a mild right tail (skew 0.44) producing 1,132 outliers (~0.32%). Treatment: Treat as a low-cardinality ordinal/discrete count; usable directly or one-hot encoded. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 8
min: 1
max: 10
mean: 2.128
median: 2
std: 0.779
q1: 2
q3: 3
iqr: 1
skew: 0.4378
kurtosis: 0.4591
n_outliers: 1,132
outlier_rate: 0.00318
zero_rate: 0

characters

text feature one_word duplicates

This column lists the characters/speakers in each record, encoded as pipe-delimited role strings (e.g. 'narrator|He', 'narrator|my girlfriend'). It's extremely repetitive — 91.1% of 355,922 rows are duplicates, only 31,782 unique values exist, and 'narrator' alone appears 71,601 times. Half the entries are a single token (one_word_rate 0.50, word_median 1), so this behaves more like a low-cardinality categorical tag than free text despite the text kind. Treatment: Treat as a categorical/multi-label field: split on '|' and one-hot encode the top roles rather than tokenizing as prose. high · anthropic:claude-opus-4-7

n: 355,922
nulls: 0 (0.0%)
unique: 31,782
len_min: 8
len_max: 165
len_mean: 18.84
len_median: 17
len_p95: 38
word_mean: 1.907
word_median: 1
n_empty: 0
n_duplicates: 324,140
duplicate_rate: 0.9107
vocab_size: 5,682
readability_flesch_mean: -63.18
emoji_rate: 0
url_rate: 0
one_word_rate: 0.5012
allcaps_rate: 0
boilerplate_rate: 0