quirky-social_norms_20260121 · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms_20260121.parquet

Saturn profiled 355,922 rows across 25 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/quirky/social_norms_20260121.parquet",
    "--findings", "quirky-social_norms_20260121.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a social-norms annotation dataset of 355,922 rows and 25 columns, where each entry pairs a real-life 'situation' (mostly from Reddit confessions, AmItheAsshole, Dear Abby, and ROCStories) with an 'action', a rule-of-thumb ('rot'), and a battery of moral judgments by crowd workers. The most striking shape feature is heavy duplication in the text fields: 'rot-judgment' is 97% duplicated and 'characters' 91%, because they collapse to short controlled vocabularies, while 'situation' and 'rot' themselves repeat ~71% and ~27% of the time across annotators. Worth a closer look first: the moral-foundation distribution, which is dominated by 'care-harm' (~39% of non-null), and the 'action-legal' field where 93% of actions are tagged 'legal' — both suggest class imbalance that will matter for any modeling. Also note 'area' is reasonably balanced across the four source corpora, but 'split' is heavily skewed toward 'train' (66%).

citing: rot-moral-foundations · action-legal · area · split · rot-judgment · characters · situation · rot · action-hypothetical · rot-categorization · action-agree · row_count · column_count

Out[4]:

saturn.schema() · 25 columns

column	kind	n	null%	unique	alerts
area	categorical	355,922	0.0%	4
m	numeric	355,922	0.0%	4	high_skew outliers
split	categorical	355,922	0.0%	7
rot-agree	numeric	355,922	0.3%	5
rot-categorization	categorical	355,922	0.9%	15
rot-moral-foundations	categorical	355,922	21.7%	30	null_rate
rot-char-targeting	categorical	355,922	0.5%	7
rot-bad	numeric	355,922	0.0%	2	high_skew
rot-judgment	text	355,922	0.0%	10,589	short_text duplicates
action	text	355,922	0.0%	260,627	multilingual duplicates
action-agency	categorical	355,922	1.8%	2
action-moral-judgment	numeric	355,922	12.7%	5
action-agree	numeric	355,922	12.5%	5
action-legal	categorical	355,922	12.8%	3
action-pressure	numeric	355,922	13.1%	5
action-char-involved	categorical	355,922	13.0%	7
action-hypothetical	categorical	355,922	19.0%	5
situation	text	355,922	0.0%	103,296	multilingual duplicates
situation-short-id	text	355,922	0.0%	103,692	one_word duplicates
rot	text	355,922	0.0%	259,614	duplicates
rot-id	text	355,922	0.0%	291,974	one_word
rot-worker-id	numeric	355,922	0.0%	89
breakdown-worker-id	numeric	355,922	0.0%	117
n-characters	numeric	355,922	0.0%	8
characters	text	355,922	0.0%	31,782	one_word duplicates

Fig 1.

rot-moral-foundations · See how 'care-harm' dominates the moral foundation labels at ~39% of non-null rows.

Show data table

Top values for rot-moral-foundations (20 unique shown, of 30 total).
value	count	share
care-harm	108535	30.5%
fairness-cheating	37666	10.6%
loyalty-betrayal	34581	9.7%
care-harm\|loyalty-betrayal	21125	5.9%
authority-subversion	19087	5.4%
sanctity-degradation	14657	4.1%
care-harm\|fairness-cheating	10787	3.0%
care-harm\|authority-subversion	7958	2.2%
care-harm\|sanctity-degradation	6328	1.8%
fairness-cheating\|loyalty-betrayal	6222	1.7%
fairness-cheating\|authority-subversion	3738	1.1%
loyalty-betrayal\|authority-subversion	2122	0.6%
fairness-cheating\|sanctity-degradation	1218	0.3%
authority-subversion\|sanctity-degradation	1042	0.3%
loyalty-betrayal\|sanctity-degradation	885	0.2%
care-harm\|fairness-cheating\|loyalty-betrayal	834	0.2%
care-harm\|loyalty-betrayal\|authority-subversion	623	0.2%
care-harm\|authority-subversion\|sanctity-degradation	423	0.1%
care-harm\|fairness-cheating\|authority-subversion	416	0.1%
care-harm\|loyalty-betrayal\|sanctity-degradation	183	0.1%

Fig 2.

area · Check the source mix across confessions, ROCStories, AmItheAsshole, and Dear Abby.

Show data table

Top values for area (4 unique shown, of 4 total).
value	count	share
confessions	107749	30.3%
rocstories	101791	28.6%
amitheasshole	96082	27.0%
dearabby	50300	14.1%

Fig 3.

action-legal · Notice the strong skew: ~93% of actions are labeled 'legal', with 'illegal' rare.

Show data table

Top values for action-legal (3 unique shown, of 3 total).
value	count	share
legal	289316	81.3%
tolerated	15064	4.2%
illegal	5934	1.7%

Fig 4.

rot-categorization · Compare how rules-of-thumb split between advice, social-norms, morality-ethics, and description.

Show data table

Top values for rot-categorization (15 unique shown, of 15 total).
value	count	share
advice	82786	23.3%
social-norms	72934	20.5%
morality-ethics	58564	16.5%
description	58537	16.4%
social-norms\|advice	27657	7.8%
morality-ethics\|social-norms	27118	7.6%
morality-ethics\|advice	11498	3.2%
social-norms\|description	6078	1.7%
advice\|description	4785	1.3%
morality-ethics\|description	2023	0.6%
morality-ethics\|social-norms\|advice	790	0.2%
morality-ethics\|social-norms\|description	55	0.0%
morality-ethics\|advice\|description	26	0.0%
social-norms\|advice\|description	24	0.0%
morality-ethics\|social-norms\|advice\|description	1	0.0%

Fig 5.

action-agree · Workers cluster at agreement values 3-4, indicating most actions get strong consensus.

Show data table

Histogram bins for action-agree (median: 3.0).
bin	count
0 – 0.1	1172
0.1 – 0.2	0
0.2 – 0.3	0
0.3 – 0.4	0
0.4 – 0.5	0
0.5 – 0.6	0
0.6 – 0.7	0
0.7 – 0.8	0
0.8 – 0.9	0
0.9 – 1	0
1 – 1.1	5874
1.1 – 1.2	0
1.2 – 1.3	0
1.3 – 1.4	0
1.4 – 1.5	0
1.5 – 1.6	0
1.6 – 1.7	0
1.7 – 1.8	0
1.8 – 1.9	0
1.9 – 2	0
2 – 2.1	44800
2.1 – 2.2	0
2.2 – 2.3	0
2.3 – 2.4	0
2.4 – 2.5	0
2.5 – 2.6	0
2.6 – 2.7	0
2.7 – 2.8	0
2.8 – 2.9	0
2.9 – 3	0
3 – 3.1	168120
3.1 – 3.2	0
3.2 – 3.3	0
3.3 – 3.4	0
3.4 – 3.5	0
3.5 – 3.6	0
3.6 – 3.7	0
3.7 – 3.8	0
3.8 – 3.9	0
3.9 – 4	91554

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
area	categorical	0.0%
m	numeric	0.0%
split	categorical	0.0%
rot-agree	numeric	0.3%
rot-categorization	categorical	0.9%
rot-moral-foundations	categorical	21.7%
rot-char-targeting	categorical	0.5%
rot-bad	numeric	0.0%
rot-judgment	text	0.0%
action	text	0.0%
action-agency	categorical	1.8%
action-moral-judgment	numeric	12.7%
action-agree	numeric	12.5%
action-legal	categorical	12.8%
action-pressure	numeric	13.1%
action-char-involved	categorical	13.0%
action-hypothetical	categorical	19.0%
situation	text	0.0%
situation-short-id	text	0.0%
rot	text	0.0%
rot-id	text	0.0%
rot-worker-id	numeric	0.0%
breakdown-worker-id	numeric	0.0%
n-characters	numeric	0.0%
characters	text	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 14,991 detected strings).
lang	count	share
en	14978	99.9%
de	6	0.0%
la	1	0.0%
no	1	0.0%
nl	1	0.0%
pt	1	0.0%
es	1	0.0%
it	1	0.0%
ru	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 9 numeric columns (values clipped to 2 decimals).
	m	rot-agree	rot-bad	action-moral-judgment	action-agree	action-pressure	rot-worker-id	breakdown-worker-id	n-characters
m	+1.00	-0.10	-0.02	+0.03	+0.04	-0.02	+0.01	-0.04	-0.01
rot-agree	-0.10	+1.00	+0.08	+0.03	-0.00	+0.03	+0.02	+0.02	-0.04
rot-bad	-0.02	+0.08	+1.00	-0.07	-0.04	+0.04	+0.03	-0.07	-0.05
action-moral-judgment	+0.03	+0.03	-0.07	+1.00	-0.04	+0.06	+0.00	+0.01	+0.12
action-agree	+0.04	-0.00	-0.04	-0.04	+1.00	-0.05	+0.02	-0.12	+0.01
action-pressure	-0.02	+0.03	+0.04	+0.06	-0.05	+1.00	-0.01	-0.03	+0.04
rot-worker-id	+0.01	+0.02	+0.03	+0.00	+0.02	-0.01	+1.00	-0.03	-0.01
breakdown-worker-id	-0.04	+0.02	-0.07	+0.01	-0.12	-0.03	-0.03	+1.00	-0.01
n-characters	-0.01	-0.04	-0.05	+0.12	+0.01	+0.04	-0.01	-0.01	+1.00

area categorical label

This column tags each record with one of four source areas, with 'confessions' the modal value at 30.3% of 355,922 rows. The distribution is fairly balanced — entropy_ratio of 0.97 indicates near-uniform spread across the four categories, though 'dearabby' (50,300) is roughly half the size of the other three. No nulls and only 4 unique values make this a clean grouping key.

Treatment: one-hot encode or use directly as a stratification/grouping variable.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["area"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	4
top_value	confessions
top_rate	0.3027
cardinality	4
entropy	1.947
entropy_ratio	0.9737

Fig 9.

Top values for area.

Show data table

Top values for area (4 unique shown, of 4 total).
value	count	share
confessions	107749	30.3%
rocstories	101791	28.6%
amitheasshole	96082	27.0%
dearabby	50300	14.1%

m numeric feature

Column 'm' is a numeric feature with only 4 distinct values across 355,922 rows, ranging from 1 to 50 with a median of 1 and both Q1 and Q3 equal to 1. The distribution is severely right-skewed (skew 3.77, kurtosis 12.45), with 18.5% of rows flagged as outliers despite a mean of just 4.24. The tiny cardinality combined with extreme spread suggests this is a categorical-like multiplier or count where most records sit at 1 and a few jump to much larger values.

Treatment: Treat as low-cardinality categorical or bin the rare large values before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["m"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	4
min	1
max	50
mean	4.237
median	1
std	11.24
q1	1
q3	1
iqr	0
skew	3.774
kurtosis	12.45
n_outliers	65,947
outlier_rate	0.1853
zero_rate	0
alert: high_skew	skew=+3.77
alert: outliers	18.5% rows beyond 1.5 IQR

Fig 10.

Distribution of m. Vertical dash marks the median.

Show data table

Histogram bins for m (median: 1.0).
bin	count
1 – 2.225	289975
2.225 – 3.45	5912
3.45 – 4.675	0
4.675 – 5.9	40035
5.9 – 7.125	0
7.125 – 8.35	0
8.35 – 9.575	0
9.575 – 10.8	0
10.8 – 12.03	0
12.03 – 13.25	0
13.25 – 14.48	0
14.48 – 15.7	0
15.7 – 16.93	0
16.93 – 18.15	0
18.15 – 19.38	0
19.38 – 20.6	0
20.6 – 21.83	0
21.83 – 23.05	0
23.05 – 24.28	0
24.28 – 25.5	0
25.5 – 26.73	0
26.73 – 27.95	0
27.95 – 29.18	0
29.18 – 30.4	0
30.4 – 31.63	0
31.63 – 32.85	0
32.85 – 34.08	0
34.08 – 35.3	0
35.3 – 36.53	0
36.53 – 37.75	0
37.75 – 38.98	0
38.98 – 40.2	0
40.2 – 41.43	0
41.43 – 42.65	0
42.65 – 43.88	0
43.88 – 45.1	0
45.1 – 46.33	0
46.33 – 47.55	0
47.55 – 48.78	0
48.78 – 50	20000

split categorical metadata

Column holds the dataset split assignment across 355922 rows with 7 distinct values and no nulls. 'train' dominates at 65.6% (233501 rows), followed by near-equal 'test' (29239) and 'dev' (29234), plus auxiliary 'test-extra', 'analysis', and 'dev-extra' partitions; a 'none' bucket of 3913 rows is unusual and likely indicates unassigned examples.

Treatment: Use as a row filter to separate train/dev/test; investigate or exclude the 'none' rows before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["split"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	7
top_value	train
top_rate	0.656
cardinality	7
entropy	1.763
entropy_ratio	0.6281

Fig 11.

Top values for split.

Show data table

Top values for split (7 unique shown, of 7 total).
value	count	share
train	233501	65.6%
test	29239	8.2%
dev	29234	8.2%
test-extra	20050	5.6%
analysis	20000	5.6%
dev-extra	19985	5.6%
none	3913	1.1%

rot-agree numeric feature

This is a 5-value ordinal score (0–4) capturing agreement on some rotation/rationale ('rot-agree'), with a mean of 3.10 and median 3.0 — answers cluster at the high end. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, and 2.4% of rows fall outside the IQR fence as low-end outliers. Nulls are negligible (0.35%) and zeros are rare (0.38%).

Treatment: Treat as an ordinal Likert feature; keep as-is or one-hot encode rather than log-transform.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["rot-agree"].stats

stat	value
n	355,922
nulls	1,236 (0.3%)
unique	5
min	0
max	4
mean	3.1
median	3
std	0.7437
q1	3
q3	4
iqr	1
skew	-0.6805
kurtosis	0.7872
n_outliers	8,547
outlier_rate	0.0241
zero_rate	0.003786

Fig 12.

Distribution of rot-agree. Vertical dash marks the median.

Show data table

Histogram bins for rot-agree (median: 3.0).
bin	count
0 – 0.1	1343
0.1 – 0.2	0
0.2 – 0.3	0
0.3 – 0.4	0
0.4 – 0.5	0
0.5 – 0.6	0
0.6 – 0.7	0
0.7 – 0.8	0
0.8 – 0.9	0
0.9 – 1	0
1 – 1.1	7204
1.1 – 1.2	0
1.2 – 1.3	0
1.3 – 1.4	0
1.4 – 1.5	0
1.5 – 1.6	0
1.6 – 1.7	0
1.7 – 1.8	0
1.8 – 1.9	0
1.9 – 2	0
2 – 2.1	52396
2.1 – 2.2	0
2.2 – 2.3	0
2.3 – 2.4	0
2.4 – 2.5	0
2.5 – 2.6	0
2.6 – 2.7	0
2.7 – 2.8	0
2.8 – 2.9	0
2.9 – 3	0
3 – 3.1	187322
3.1 – 3.2	0
3.2 – 3.3	0
3.3 – 3.4	0
3.4 – 3.5	0
3.5 – 3.6	0
3.6 – 3.7	0
3.7 – 3.8	0
3.8 – 3.9	0
3.9 – 4	106421

rot-categorization categorical label

Categorical tag describing the type of 'rule of thumb' (RoT), with 15 distinct values drawn from a small base vocabulary (advice, social-norms, morality-ethics, description) plus pipe-delimited combinations. Distribution is moderately balanced (entropy ratio 0.72); 'advice' leads at 23.5% and the four single-tag values dominate, while compound tags like 'social-norms|advice' (27,657) indicate multi-label encoding stuffed into one string. Null rate is low at 0.86%.

Treatment: split on '|' and one-hot encode the four base categories for multi-label modelling.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["rot-categorization"].stats

stat	value
n	355,922
nulls	3,046 (0.9%)
unique	15
top_value	advice
top_rate	0.2346
cardinality	15
entropy	2.806
entropy_ratio	0.7181

Fig 13.

Top values for rot-categorization.

Show data table

Top values for rot-categorization (15 unique shown, of 15 total).
value	count	share
advice	82786	23.3%
social-norms	72934	20.5%
morality-ethics	58564	16.5%
description	58537	16.4%
social-norms\|advice	27657	7.8%
morality-ethics\|social-norms	27118	7.6%
morality-ethics\|advice	11498	3.2%
social-norms\|description	6078	1.7%
advice\|description	4785	1.3%
morality-ethics\|description	2023	0.6%
morality-ethics\|social-norms\|advice	790	0.2%
morality-ethics\|social-norms\|description	55	0.0%
morality-ethics\|advice\|description	26	0.0%
social-norms\|advice\|description	24	0.0%
morality-ethics\|social-norms\|advice\|description	1	0.0%

rot-moral-foundations categorical label

This column tags each row with one or more of Haidt's moral foundations (care-harm, fairness-cheating, loyalty-betrayal, authority-subversion, sanctity-degradation), with multi-label combinations encoded as pipe-delimited strings yielding 30 distinct values. Distribution is heavily skewed: 'care-harm' alone covers 38.9% of rows, and 21.7% of rows are null. Entropy ratio of 0.60 confirms the long tail collapses quickly after the top few categories.

Treatment: Split on '|' and binarize into five multi-hot foundation indicators; treat nulls as an explicit missing flag.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["rot-moral-foundations"].stats

stat	value
n	355,922
nulls	77,120 (21.7%)
unique	30
top_value	care-harm
top_rate	0.3893
cardinality	30
entropy	2.962
entropy_ratio	0.6037
alert: null_rate	21.7% null

Fig 14.

Top values for rot-moral-foundations.

Show data table

Top values for rot-moral-foundations (20 unique shown, of 30 total).
value	count	share
care-harm	108535	30.5%
fairness-cheating	37666	10.6%
loyalty-betrayal	34581	9.7%
care-harm\|loyalty-betrayal	21125	5.9%
authority-subversion	19087	5.4%
sanctity-degradation	14657	4.1%
care-harm\|fairness-cheating	10787	3.0%
care-harm\|authority-subversion	7958	2.2%
care-harm\|sanctity-degradation	6328	1.8%
fairness-cheating\|loyalty-betrayal	6222	1.7%
fairness-cheating\|authority-subversion	3738	1.1%
loyalty-betrayal\|authority-subversion	2122	0.6%
fairness-cheating\|sanctity-degradation	1218	0.3%
authority-subversion\|sanctity-degradation	1042	0.3%
loyalty-betrayal\|sanctity-degradation	885	0.2%
care-harm\|fairness-cheating\|loyalty-betrayal	834	0.2%
care-harm\|loyalty-betrayal\|authority-subversion	623	0.2%
care-harm\|authority-subversion\|sanctity-degradation	423	0.1%
care-harm\|fairness-cheating\|authority-subversion	416	0.1%
care-harm\|loyalty-betrayal\|sanctity-degradation	183	0.1%

rot-char-targeting categorical feature

Categorical tag identifying which character slot a rotation/transform targets, with 7 distinct values dominated by 'char-0' (51.9%) and 'char-1' (123,396 rows). The distribution is sharply long-tailed: 'char-4' and 'char-5' appear only 46 and 15 times respectively, and 23,781 rows are explicitly 'char-none'. Entropy ratio is 0.557, confirming most mass sits in the first two categories. Null rate is low at 0.46%.

Treatment: One-hot encode and consider collapsing rare 'char-3/4/5' levels into an 'other' bucket.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["rot-char-targeting"].stats

stat	value
n	355,922
nulls	1,622 (0.5%)
unique	7
top_value	char-0
top_rate	0.5192
cardinality	7
entropy	1.564
entropy_ratio	0.5571

Fig 15.

Top values for rot-char-targeting.

Show data table

Top values for rot-char-targeting (7 unique shown, of 7 total).
value	count	share
char-0	183950	51.7%
char-1	123396	34.7%
char-none	23781	6.7%
char-2	21648	6.1%
char-3	1464	0.4%
char-4	46	0.0%
char-5	15	0.0%

rot-bad numeric label

Binary 0/1 flag (n_unique=2, min=0, max=1) indicating a rare 'rot-bad' condition. Positives occur at 2.0% (mean=0.0201, zero_rate=0.9799), producing the flagged high skew (6.84) and heavy kurtosis (44.78). The 7,153 'outliers' are simply the positive class, not anomalies.

Treatment: Treat as a binary target; address class imbalance with stratified sampling or class weights rather than outlier removal.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["rot-bad"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	2
min	0
max	1
mean	0.0201
median	0
std	0.1403
q1	0
q3	0
iqr	0
skew	6.84
kurtosis	44.78
n_outliers	7,153
outlier_rate	0.0201
zero_rate	0.9799
alert: high_skew	skew=+6.84

Fig 16.

Distribution of rot-bad. Vertical dash marks the median.

Show data table

Histogram bins for rot-bad (median: 0.0).
bin	count
0 – 0.025	348769
0.025 – 0.05	0
0.05 – 0.075	0
0.075 – 0.1	0
0.1 – 0.125	0
0.125 – 0.15	0
0.15 – 0.175	0
0.175 – 0.2	0
0.2 – 0.225	0
0.225 – 0.25	0
0.25 – 0.275	0
0.275 – 0.3	0
0.3 – 0.325	0
0.325 – 0.35	0
0.35 – 0.375	0
0.375 – 0.4	0
0.4 – 0.425	0
0.425 – 0.45	0
0.45 – 0.475	0
0.475 – 0.5	0
0.5 – 0.525	0
0.525 – 0.55	0
0.55 – 0.575	0
0.575 – 0.6	0
0.6 – 0.625	0
0.625 – 0.65	0
0.65 – 0.675	0
0.675 – 0.7	0
0.7 – 0.725	0
0.725 – 0.75	0
0.75 – 0.775	0
0.775 – 0.8	0
0.8 – 0.825	0
0.825 – 0.85	0
0.85 – 0.875	0
0.875 – 0.9	0
0.9 – 0.925	0
0.925 – 0.95	0
0.95 – 0.975	0
0.975 – 1	7153

rot-judgment text label

Short moral-judgment phrases (mean 2 words, median 9 chars, max 94) drawn from a tight vocabulary of 797 tokens, dominated by verdicts like "It's good", "shouldn't", "It's okay", and "It's wrong". With 97.0% duplicate rate and only 10,589 uniques across 355,922 rows, this behaves as a categorical label rather than free text. Casing is inconsistent (e.g. "It's good" vs "it's good" appear as separate top values), which will inflate cardinality unless normalized.

Treatment: Lowercase and strip punctuation, then treat as a categorical target rather than free text.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["rot-judgment"].stats

stat	value
n	355,922
nulls	1 (0.0%)
unique	10,589
len_min	1
len_max	94
len_mean	10.46
len_median	9
len_p95	19
word_mean	2.001
word_median	2
n_empty	0
n_duplicates	345,332
duplicate_rate	0.9702
vocab_size	797
readability_flesch_mean	83.27
emoji_rate	0
url_rate	0
one_word_rate	0.2083
allcaps_rate	5.619e-06
boilerplate_rate	0
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	97.0% duplicate strings

Fig 17.

Character-length distribution for rot-judgment.

Show data table

Character-length distribution for rot-judgment (mean: 10.462726279146215).
chars	count
1 – 3	8736
3 – 6	11861
6 – 8	31229
8 – 10	187513
10 – 13	22320
13 – 15	51718
15 – 17	16171
17 – 20	20003
20 – 22	2587
22 – 24	1287
24 – 27	496
27 – 29	412
29 – 31	487
31 – 34	256
34 – 36	187
36 – 38	183
38 – 41	91
41 – 43	76
43 – 45	103
45 – 48	39
48 – 50	42
50 – 52	32
52 – 54	22
54 – 57	15
57 – 59	12
59 – 61	7
61 – 64	6
64 – 66	12
66 – 68	3
68 – 71	2
71 – 73	4
73 – 75	2
75 – 78	0
78 – 80	2
80 – 82	1
82 – 85	0
85 – 87	1
87 – 89	0
89 – 92	1
92 – 94	2

action text free_text

Short English phrases describing an action or behaviour (mean 41.8 chars, median 7 words), e.g. 'being yourself.' or 'cheating on your partner.' — likely the subject of a moral/judgement prompt. Roughly 27% of the 355,922 rows are duplicates (95,292), with the top phrase repeating 461 times, so the same actions recur heavily across records. Language detection flags multilingual but only 2 non-English rows (1 la, 1 no) out of ~5k sampled, so effectively monolingual.

Treatment: Normalize casing/punctuation and tokenize or embed before modelling; expect heavy phrase repetition.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["action"].stats

stat	value
n	355,922
nulls	3 (0.0%)
unique	260,627
len_min	1
len_max	221
len_mean	41.76
len_median	40
len_p95	73
word_mean	6.969
word_median	7
n_empty	0
n_duplicates	95,292
duplicate_rate	0.2677
vocab_size	9,430
readability_flesch_mean	57.88
emoji_rate	0
url_rate	2.81e-06
one_word_rate	0.006013
allcaps_rate	0
boilerplate_rate	2.81e-06
alert: multilingual	4 languages detected in sample
alert: duplicates	26.8% duplicate strings

Fig 18.

Character-length distribution for action.

Show data table

Character-length distribution for action (mean: 41.76226613358657).
chars	count
1 – 6	456
6 – 12	2368
12 – 18	18362
18 – 23	23877
23 – 28	40347
28 – 34	39906
34 – 40	49900
40 – 45	40358
45 – 50	40757
50 – 56	27618
56 – 62	25084
62 – 67	15079
67 – 72	12798
72 – 78	6840
78 – 84	5004
84 – 89	2500
89 – 94	1855
94 – 100	1061
100 – 106	731
106 – 111	448
111 – 116	223
116 – 122	111
122 – 128	79
128 – 133	63
133 – 138	40
138 – 144	19
144 – 150	13
150 – 155	4
155 – 160	10
160 – 166	2
166 – 172	0
172 – 177	2
177 – 182	3
182 – 188	0
188 – 194	0
194 – 199	0
199 – 204	0
204 – 210	0
210 – 216	0
216 – 221	1

action-agency categorical feature

Binary categorical flag with only two levels, 'agency' (88.5%) and 'experience' (the remainder), likely indicating which side or channel originated the action. Class imbalance is heavy and 1.84% of rows are null, so any modelling needs to account for both. Entropy ratio of 0.515 reflects the dominance of the majority class.

Treatment: Encode as a binary indicator and impute or flag the ~1.8% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["action-agency"].stats

stat	value
n	355,922
nulls	6,551 (1.8%)
unique	2
top_value	agency
top_rate	0.8849
cardinality	2
entropy	0.5152
entropy_ratio	0.5152

Fig 19.

Top values for action-agency.

Show data table

Top values for action-agency (2 unique shown, of 2 total).
value	count	share
agency	309148	86.9%
experience	40223	11.3%

action-moral-judgment numeric label

A discrete moral-judgment rating on a 5-point scale from -2 to 2 (likely Likert-style: very wrong → very right). The mean of -0.178 and median of 0 indicate a slight lean toward negative judgments, with 43.7% of values exactly zero and 12.7% missing. The distribution is nearly symmetric (skew -0.011) and platykurtic, so most ratings cluster near neutral with modest spread (std 0.857).

Treatment: Treat as an ordinal label; impute or filter the 12.7% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["action-moral-judgment"].stats

stat	value
n	355,922
nulls	45,215 (12.7%)
unique	5
min	-2
max	2
mean	-0.1785
median	0
std	0.8565
q1	-1
q3	0
iqr	1
skew	-0.01137
kurtosis	-0.3466
n_outliers	4,592
outlier_rate	0.01478
zero_rate	0.4367

Fig 20.

Distribution of action-moral-judgment. Vertical dash marks the median.

Show data table

Histogram bins for action-moral-judgment (median: 0.0).
bin	count
-2 – -1.9	16356
-1.9 – -1.8	0
-1.8 – -1.7	0
-1.7 – -1.6	0
-1.6 – -1.5	0
-1.5 – -1.4	0
-1.4 – -1.3	0
-1.3 – -1.2	0
-1.2 – -1.1	0
-1.1 – -1	0
-1 – -0.9	92992
-0.9 – -0.8	0
-0.8 – -0.7	0
-0.7 – -0.6	0
-0.6 – -0.5	0
-0.5 – -0.4	0
-0.4 – -0.3	0
-0.3 – -0.2	0
-0.2 – -0.1	0
-0.1 – 0	0
0 – 0.1	135698
0.1 – 0.2	0
0.2 – 0.3	0
0.3 – 0.4	0
0.4 – 0.5	0
0.5 – 0.6	0
0.6 – 0.7	0
0.7 – 0.8	0
0.8 – 0.9	0
0.9 – 1	0
1 – 1.1	61069
1.1 – 1.2	0
1.2 – 1.3	0
1.3 – 1.4	0
1.4 – 1.5	0
1.5 – 1.6	0
1.6 – 1.7	0
1.7 – 1.8	0
1.8 – 1.9	0
1.9 – 2	4592

action-agree numeric feature

This is a 5-level ordinal feature (values 0-4), almost certainly a Likert-style agreement rating for an 'action' item, with mean 3.10 and median 3.0 indicating a tilt toward agreement. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, so most respondents pick 3 or 4; only 0.38% give zero. Note the 12.48% null rate, which is substantial and likely reflects non-response or skipped items.

Treatment: Treat as ordinal (0-4); impute or flag the ~12% missing values before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["action-agree"].stats

stat	value
n	355,922
nulls	44,402 (12.5%)
unique	5
min	0
max	4
mean	3.101
median	3
std	0.7326
q1	3
q3	4
iqr	1
skew	-0.6768
kurtosis	0.8829
n_outliers	7,046
outlier_rate	0.02262
zero_rate	0.003762

Fig 21.

Distribution of action-agree. Vertical dash marks the median.

Show data table

Histogram bins for action-agree (median: 3.0).
bin	count
0 – 0.1	1172
0.1 – 0.2	0
0.2 – 0.3	0
0.3 – 0.4	0
0.4 – 0.5	0
0.5 – 0.6	0
0.6 – 0.7	0
0.7 – 0.8	0
0.8 – 0.9	0
0.9 – 1	0
1 – 1.1	5874
1.1 – 1.2	0
1.2 – 1.3	0
1.3 – 1.4	0
1.4 – 1.5	0
1.5 – 1.6	0
1.6 – 1.7	0
1.7 – 1.8	0
1.8 – 1.9	0
1.9 – 2	0
2 – 2.1	44800
2.1 – 2.2	0
2.2 – 2.3	0
2.3 – 2.4	0
2.4 – 2.5	0
2.5 – 2.6	0
2.6 – 2.7	0
2.7 – 2.8	0
2.8 – 2.9	0
2.9 – 3	0
3 – 3.1	168120
3.1 – 3.2	0
3.2 – 3.3	0
3.3 – 3.4	0
3.4 – 3.5	0
3.5 – 3.6	0
3.6 – 3.7	0
3.7 – 3.8	0
3.8 – 3.9	0
3.9 – 4	91554

action-legal categorical feature

This is a categorical legal-status flag with only 3 distinct values ('legal', 'tolerated', 'illegal') across 355,922 rows. The distribution is severely imbalanced: 'legal' accounts for 93.2% of records while 'illegal' represents just 5,934 rows, and entropy ratio is only 0.26. Note also that 12.81% of rows are null, so absence of a status is itself a meaningful signal.

Treatment: One-hot encode with an explicit 'missing' category; consider class-weighting due to severe imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["action-legal"].stats

stat	value
n	355,922
nulls	45,608 (12.8%)
unique	3
top_value	legal
top_rate	0.9323
cardinality	3
entropy	0.4153
entropy_ratio	0.262

Fig 22.

Top values for action-legal.

Show data table

Top values for action-legal (3 unique shown, of 3 total).
value	count	share
legal	289316	81.3%
tolerated	15064	4.2%
illegal	5934	1.7%

action-pressure numeric feature

A discrete ordinal feature taking only 5 values across the symmetric range -2 to 2, almost certainly a Likert-style pressure rating. The distribution is balanced (mean -0.04, skew -0.05) and centered on 0, which accounts for 35.3% of rows. Notable: 13.08% of rows are null, and despite being numeric there are just 5 unique values.

Treatment: Treat as ordinal categorical; impute or flag the 13% missing before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[56]:

saturn.columns["action-pressure"].stats

stat	value
n	355,922
nulls	46,537 (13.1%)
unique	5
min	-2
max	2
mean	-0.03992
median	0
std	1.114
q1	-1
q3	1
iqr	2
skew	-0.04985
kurtosis	-0.6595
n_outliers	0
outlier_rate	0
zero_rate	0.3534

Fig 23.

Distribution of action-pressure. Vertical dash marks the median.

Show data table

Histogram bins for action-pressure (median: 0.0).
bin	count
-2 – -1.9	35266
-1.9 – -1.8	0
-1.8 – -1.7	0
-1.7 – -1.6	0
-1.6 – -1.5	0
-1.5 – -1.4	0
-1.4 – -1.3	0
-1.3 – -1.2	0
-1.2 – -1.1	0
-1.1 – -1	0
-1 – -0.9	66353
-0.9 – -0.8	0
-0.8 – -0.7	0
-0.7 – -0.6	0
-0.6 – -0.5	0
-0.5 – -0.4	0
-0.4 – -0.3	0
-0.3 – -0.2	0
-0.2 – -0.1	0
-0.1 – 0	0
0 – 0.1	109344
0.1 – 0.2	0
0.2 – 0.3	0
0.3 – 0.4	0
0.4 – 0.5	0
0.5 – 0.6	0
0.6 – 0.7	0
0.7 – 0.8	0
0.8 – 0.9	0
0.9 – 1	0
1 – 1.1	72309
1.1 – 1.2	0
1.2 – 1.3	0
1.3 – 1.4	0
1.4 – 1.5	0
1.5 – 1.6	0
1.6 – 1.7	0
1.7 – 1.8	0
1.8 – 1.9	0
1.9 – 2	26113

action-char-involved categorical feature

Categorical pointer to which character slot is involved in an action, with 7 distinct values dominated by char-0 (51.5%) and char-1 (~31%). The long tail collapses fast: char-3 through char-5 together account for fewer than 1,400 rows, and 13.05% of rows are null while another 22,100 are explicitly tagged 'char-none'. Entropy ratio of 0.56 confirms the heavy concentration on the first two slots.

Treatment: Treat as a low-cardinality categorical; collapse char-3/4/5 into an 'other' bucket and decide whether null and char-none should be merged.

anthropic:claude-opus-4-7 · confidence high

Out[59]:

saturn.columns["action-char-involved"].stats

stat	value
n	355,922
nulls	46,441 (13.0%)
unique	7
top_value	char-0
top_rate	0.515
cardinality	7
entropy	1.578
entropy_ratio	0.562

Fig 24.

Top values for action-char-involved.

Show data table

Top values for action-char-involved (7 unique shown, of 7 total).
value	count	share
char-0	159381	44.8%
char-1	107567	30.2%
char-none	22100	6.2%
char-2	19124	5.4%
char-3	1252	0.4%
char-4	41	0.0%
char-5	16	0.0%

action-hypothetical categorical label

A 5-level categorical labeling whether an action is stated explicitly or only hypothetically/probably, with negative variants ('explicit-no', 'probable-no'). 'explicit' dominates at 47.1% of non-null rows, but 19.04% of values are null, which is substantial. Entropy ratio of 0.84 indicates the remaining classes are reasonably spread rather than collapsed onto a single mode.

Treatment: One-hot encode the five levels and add an explicit missing indicator for the 19% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[62]:

saturn.columns["action-hypothetical"].stats

stat	value
n	355,922
nulls	67,776 (19.0%)
unique	5
top_value	explicit
top_rate	0.4711
cardinality	5
entropy	1.956
entropy_ratio	0.8423

Fig 25.

Top values for action-hypothetical.

Show data table

Top values for action-hypothetical (5 unique shown, of 5 total).
value	count	share
explicit	135744	38.1%
hypothetical	62689	17.6%
probable	49670	14.0%
explicit-no	25022	7.0%
probable-no	15021	4.2%

situation text free_text

Short first-person descriptions of personal situations or dilemmas, averaging 55 characters / 10 words and topping out at 300 chars, with high readability (Flesch 78.5). Massive duplication is the headline issue: 252,626 of 355,922 rows are duplicates (71%), leaving only 103,296 uniques, and several distinct strings repeat exactly 130 times — suggesting templated or oversampled records. Language detection is overwhelmingly English (4,989) with a tiny multilingual tail (6 de, 1 each es/it/nl/pt/ru) that is unlikely to matter at scale.

Treatment: Deduplicate before modelling, then tokenize and embed; the 130-repeat pattern warrants checking the upstream sampling.

anthropic:claude-opus-4-7 · confidence high

Out[65]:

saturn.columns["situation"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	103,296
len_min	10
len_max	300
len_mean	55.42
len_median	52
len_p95	100
word_mean	10.52
word_median	10
n_empty	0
n_duplicates	252,626
duplicate_rate	0.7098
vocab_size	16,855
readability_flesch_mean	78.53
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0.0005282
boilerplate_rate	0.0001292
alert: multilingual	8 languages detected in sample
alert: duplicates	71.0% duplicate strings

Fig 26.

Character-length distribution for situation.

Show data table

Character-length distribution for situation (mean: 55.417827501531235).
chars	count
10 – 17	3408
17 – 24	13315
24 – 32	21793
32 – 39	34803
39 – 46	55004
46 – 54	62695
54 – 61	63829
61 – 68	37281
68 – 75	22416
75 – 82	8768
82 – 90	7451
90 – 97	5256
97 – 104	4592
104 – 112	2828
112 – 119	2283
119 – 126	1878
126 – 133	1729
133 – 140	1054
140 – 148	930
148 – 155	655
155 – 162	734
162 – 170	439
170 – 177	336
177 – 184	320
184 – 191	296
191 – 198	228
198 – 206	196
206 – 213	137
213 – 220	207
220 – 228	201
228 – 235	165
235 – 242	69
242 – 249	147
249 – 256	48
256 – 264	75
264 – 271	66
271 – 278	60
278 – 286	65
286 – 293	56
293 – 300	109

situation-short-id text foreign_key

Slash-delimited source identifiers pointing back to situations on Reddit (confessions, amitheasshole), Dear Abby columns, and ROCStories sentences. Despite the 'id' name, only 103,692 values are unique across 355,922 rows — a 70.9% duplicate rate, with the most repeated key appearing 180 times, so each situation evidently surfaces in many rows. Every value is a single token (one_word_rate 1.0) up to 99 chars long.

Treatment: Use as a join key to the situation source table; do not treat as a row-level primary key.

anthropic:claude-opus-4-7 · confidence high

Out[68]:

saturn.columns["situation-short-id"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	103,692
len_min	24
len_max	99
len_mean	41.29
len_median	27
len_p95	74
word_mean	1
word_median	1
n_empty	0
n_duplicates	252,230
duplicate_rate	0.7087
vocab_size	16,878
readability_flesch_mean	-443.9
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: duplicates	70.9% duplicate strings

Fig 27.

Character-length distribution for situation-short-id.

Show data table

Character-length distribution for situation-short-id (mean: 41.29494383600901).
chars	count
24 – 26	107749
26 – 28	96082
28 – 30	0
30 – 32	0
32 – 33	0
33 – 35	0
35 – 37	0
37 – 39	0
39 – 41	0
41 – 43	0
43 – 45	0
45 – 46	0
46 – 48	0
48 – 50	0
50 – 52	0
52 – 54	0
54 – 56	14
56 – 58	101845
58 – 60	196
60 – 62	893
62 – 63	1880
63 – 65	3550
65 – 67	5362
67 – 69	3151
69 – 71	7061
71 – 73	6866
73 – 75	6129
75 – 76	5145
76 – 78	3985
78 – 80	2407
80 – 82	1681
82 – 84	336
84 – 86	930
86 – 88	316
88 – 90	137
90 – 92	151
92 – 93	28
93 – 95	5
95 – 97	7
97 – 99	16

rot text free_text

Short English moral/normative statements (mean 10.1 words, max 221 chars), almost certainly rule-of-thumb (RoT) annotations describing what one should or shouldn't do. The vocabulary is small (9,659 types) and readability is high (Flesch 79.4), consistent with simple declarative templates like 'It is good/okay to...' and 'You shouldn't...'. Notable: 27.1% of rows are duplicates (96,308), with single phrases like 'It is good to be yourself.' repeating 287 times, indicating heavy template reuse rather than free generation.

Treatment: Deduplicate or weight by frequency before tokenizing and embedding for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[71]:

saturn.columns["rot"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	259,614
len_min	6
len_max	221
len_mean	54.66
len_median	53
len_p95	87
word_mean	10.1
word_median	10
n_empty	0
n_duplicates	96,308
duplicate_rate	0.2706
vocab_size	9,659
readability_flesch_mean	79.38
emoji_rate	0
url_rate	2.81e-06
one_word_rate	5.619e-06
allcaps_rate	0
boilerplate_rate	5.619e-06
alert: duplicates	27.1% duplicate strings

Fig 28.

Character-length distribution for rot.

Show data table

Character-length distribution for rot (mean: 54.66070094009362).
chars	count
6 – 11	4
11 – 17	114
17 – 22	2576
22 – 28	12669
28 – 33	19758
33 – 38	34865
38 – 44	35016
44 – 49	38131
49 – 54	47869
54 – 60	36856
60 – 65	37718
65 – 70	25208
70 – 76	19264
76 – 81	17161
81 – 87	10039
87 – 92	6486
92 – 97	4880
97 – 103	2558
103 – 108	2004
108 – 114	1000
114 – 119	610
119 – 124	446
124 – 130	213
130 – 135	231
135 – 140	87
140 – 146	58
146 – 151	30
151 – 156	28
156 – 162	20
162 – 167	11
167 – 173	3
173 – 178	2
178 – 183	3
183 – 189	1
189 – 194	2
194 – 200	0
200 – 205	0
205 – 210	0
210 – 216	0
216 – 221	1

rot-id text identifier

Structured path-like identifiers for 'rule of thumb' records, sourced from Reddit (amitheasshole, confessions), Dear Abby columns, and ROCStories. Despite being IDs, only 291974 of 355922 rows are unique (duplicate_rate 0.18, with top values repeating up to 58 times), so this is not a primary key. Every value is a single token (one_word_rate 1.0, word_mean 1.0) with length 63-140 characters, which triggered the one_word alert but is expected for slash-delimited identifiers.

Treatment: Treat as a composite identifier; split on '/' to extract source/subreddit/post fields rather than embedding the raw string.

anthropic:claude-opus-4-7 · confidence high

Out[74]:

saturn.columns["rot-id"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	291,974
len_min	63
len_max	140
len_mean	81.61
len_median	68
len_p95	114
word_mean	1
word_median	1
n_empty	0
n_duplicates	63,948
duplicate_rate	0.1797
vocab_size	18,736
readability_flesch_mean	-755.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word

Fig 29.

Character-length distribution for rot-id.

Show data table

Character-length distribution for rot-id (mean: 81.6139491236844).
chars	count
63 – 65	4555
65 – 67	111051
67 – 69	88225
69 – 71	0
71 – 73	0
73 – 75	0
75 – 76	0
76 – 78	0
78 – 80	0
80 – 82	0
82 – 84	0
84 – 86	0
86 – 88	0
88 – 90	0
90 – 92	0
92 – 94	2
94 – 96	12
96 – 98	65795
98 – 100	36190
100 – 102	806
102 – 103	1665
103 – 105	3057
105 – 107	5065
107 – 109	6209
109 – 111	7337
111 – 113	6570
113 – 115	3187
115 – 117	5314
117 – 119	4016
119 – 121	2809
121 – 123	1847
123 – 125	925
125 – 127	700
127 – 128	296
128 – 130	169
130 – 132	72
132 – 134	22
134 – 136	8
136 – 138	14
138 – 140	4

rot-worker-id numeric foreign_key

Despite being typed numeric, `rot-worker-id` looks like a categorical worker identifier: only 89 unique values across 355,922 rows, no nulls, and no zeros. The distribution is broad and nearly symmetric (skew -0.08, kurtosis -1.37) spanning 2 to 144 with median 83, consistent with arbitrary id codes rather than a measured quantity. No outliers were flagged.

Treatment: Treat as a categorical worker key; encode or join rather than aggregating numerically.

anthropic:claude-opus-4-7 · confidence high

Out[77]:

saturn.columns["rot-worker-id"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	89
min	2
max	144
mean	72.54
median	83
std	39.61
q1	42
q3	105
iqr	63
skew	-0.0804
kurtosis	-1.366
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 30.

Distribution of rot-worker-id. Vertical dash marks the median.

Show data table

Histogram bins for rot-worker-id (median: 83.0).
bin	count
2 – 5.55	19114
5.55 – 9.1	1190
9.1 – 12.65	57
12.65 – 16.2	8207
16.2 – 19.75	14204
19.75 – 23.3	0
23.3 – 26.85	0
26.85 – 30.4	10922
30.4 – 33.95	7449
33.95 – 37.5	506
37.5 – 41.05	10560
41.05 – 44.6	79576
44.6 – 48.15	3134
48.15 – 51.7	4
51.7 – 55.25	314
55.25 – 58.8	326
58.8 – 62.35	12019
62.35 – 65.9	19
65.9 – 69.45	425
69.45 – 73	1691
73 – 76.55	4917
76.55 – 80.1	2039
80.1 – 83.65	3998
83.65 – 87.2	18079
87.2 – 90.75	9881
90.75 – 94.3	13440
94.3 – 97.85	6
97.85 – 101.4	16569
101.4 – 104.9	9005
104.9 – 108.5	31406
108.5 – 112	6515
112 – 115.6	4230
115.6 – 119.1	2114
119.1 – 122.7	26140
122.7 – 126.2	2735
126.2 – 129.8	29427
129.8 – 133.3	4465
133.3 – 136.9	1
136.9 – 140.4	206
140.4 – 144	1032

breakdown-worker-id numeric foreign_key

This appears to be a worker identifier encoded as integers, with 117 distinct values spread roughly uniformly between 0 and 146 (mean 71.75, median 71, skew 0.05, kurtosis -1.26). Despite being stored as numeric, the low cardinality relative to 355,922 rows and the near-flat distribution suggest a categorical key rather than a measurement. No nulls and no outliers, but about 1.4% of rows carry worker id 0, which may be a sentinel.

Treatment: Cast to categorical and left-join to a worker dimension table; investigate whether id 0 is a placeholder.

anthropic:claude-opus-4-7 · confidence high

Out[80]:

saturn.columns["breakdown-worker-id"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	117
min	0
max	146
mean	71.75
median	71
std	42.17
q1	33
q3	106
iqr	73
skew	0.05124
kurtosis	-1.255
n_outliers	0
outlier_rate	0
zero_rate	0.01435

Fig 31.

Distribution of breakdown-worker-id. Vertical dash marks the median.

Show data table

Histogram bins for breakdown-worker-id (median: 71.0).
bin	count
0 – 3.65	10977
3.65 – 7.3	6826
7.3 – 10.95	295
10.95 – 14.6	5408
14.6 – 18.25	17438
18.25 – 21.9	5661
21.9 – 25.55	14918
25.55 – 29.2	17405
29.2 – 32.85	8194
32.85 – 36.5	6075
36.5 – 40.15	14864
40.15 – 43.8	14057
43.8 – 47.45	4300
47.45 – 51.1	9634
51.1 – 54.75	9782
54.75 – 58.4	3497
58.4 – 62.05	18405
62.05 – 65.7	4440
65.7 – 69.35	2196
69.35 – 73	4907
73 – 76.65	8096
76.65 – 80.3	10266
80.3 – 83.95	766
83.95 – 87.6	6916
87.6 – 91.25	10729
91.25 – 94.9	14625
94.9 – 98.55	7752
98.55 – 102.2	14249
102.2 – 105.8	7135
105.8 – 109.5	21787
109.5 – 113.1	4903
113.1 – 116.8	27
116.8 – 120.5	6283
120.5 – 124.1	11729
124.1 – 127.8	8202
127.8 – 131.4	10912
131.4 – 135	8184
135 – 138.7	9522
138.7 – 142.3	5775
142.3 – 146	8785

n-characters numeric feature

A small-integer count column ranging 1-10 with only 8 unique values across 355,922 rows, mean 2.13 and median 2. The tight IQR (2-3) and low std (0.78) suggest most records cluster around 2-3 characters, with a mild right tail (skew 0.44) producing 1,132 outliers (~0.32%).

Treatment: Treat as a low-cardinality ordinal/discrete count; usable directly or one-hot encoded.

anthropic:claude-opus-4-7 · confidence high

Out[83]:

saturn.columns["n-characters"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	8
min	1
max	10
mean	2.128
median	2
std	0.779
q1	2
q3	3
iqr	1
skew	0.4378
kurtosis	0.4591
n_outliers	1,132
outlier_rate	0.00318
zero_rate	0

Fig 32.

Distribution of n-characters. Vertical dash marks the median.

Show data table

Histogram bins for n-characters (median: 2.0).
bin	count
1 – 1.225	71601
1.225 – 1.45	0
1.45 – 1.675	0
1.675 – 1.9	0
1.9 – 2.125	181769
2.125 – 2.35	0
2.35 – 2.575	0
2.575 – 2.8	0
2.8 – 3.025	89195
3.025 – 3.25	0
3.25 – 3.475	0
3.475 – 3.7	0
3.7 – 3.925	0
3.925 – 4.15	12225
4.15 – 4.375	0
4.375 – 4.6	0
4.6 – 4.825	0
4.825 – 5.05	918
5.05 – 5.275	0
5.275 – 5.5	0
5.5 – 5.725	0
5.725 – 5.95	0
5.95 – 6.175	178
6.175 – 6.4	0
6.4 – 6.625	0
6.625 – 6.85	0
6.85 – 7.075	32
7.075 – 7.3	0
7.3 – 7.525	0
7.525 – 7.75	0
7.75 – 7.975	0
7.975 – 8.2	0
8.2 – 8.425	0
8.425 – 8.65	0
8.65 – 8.875	0
8.875 – 9.1	0
9.1 – 9.325	0
9.325 – 9.55	0
9.55 – 9.775	0
9.775 – 10	4

characters text feature

This column lists the characters/speakers in each record, encoded as pipe-delimited role strings (e.g. 'narrator|He', 'narrator|my girlfriend'). It's extremely repetitive — 91.1% of 355,922 rows are duplicates, only 31,782 unique values exist, and 'narrator' alone appears 71,601 times. Half the entries are a single token (one_word_rate 0.50, word_median 1), so this behaves more like a low-cardinality categorical tag than free text despite the text kind.

Treatment: Treat as a categorical/multi-label field: split on '|' and one-hot encode the top roles rather than tokenizing as prose.

anthropic:claude-opus-4-7 · confidence high

Out[86]:

saturn.columns["characters"].stats

stat	value
n	355,922
nulls	0 (0.0%)
unique	31,782
len_min	8
len_max	165
len_mean	18.84
len_median	17
len_p95	38
word_mean	1.907
word_median	1
n_empty	0
n_duplicates	324,140
duplicate_rate	0.9107
vocab_size	5,682
readability_flesch_mean	-63.18
emoji_rate	0
url_rate	0
one_word_rate	0.5012
allcaps_rate	0
boilerplate_rate	0
alert: one_word	50.1% rows are a single word
alert: duplicates	91.1% duplicate strings

Fig 33.

Character-length distribution for characters.

Show data table

Character-length distribution for characters (mean: 18.836837846494458).
chars	count
8 – 12	85583
12 – 16	69950
16 – 20	63497
20 – 24	49403
24 – 28	30408
28 – 32	20287
32 – 35	13108
35 – 39	8478
39 – 43	5267
43 – 47	3757
47 – 51	2325
51 – 55	1441
55 – 59	1008
59 – 63	349
63 – 67	425
67 – 71	144
71 – 75	124
75 – 79	94
79 – 83	62
83 – 86	51
86 – 90	34
90 – 94	34
94 – 98	26
98 – 102	12
102 – 106	14
106 – 110	7
110 – 114	4
114 – 118	8
118 – 122	3
122 – 126	3
126 – 130	0
130 – 134	2
134 – 138	2
138 – 141	0
141 – 145	0
145 – 149	10
149 – 153	0
153 – 157	0
157 – 161	0
161 – 165	2

quirky social norms 20260121

Overview

Summary confidence: high

area categorical label

m numeric feature

split categorical metadata

rot-agree numeric feature

rot-categorization categorical label

rot-moral-foundations categorical label

rot-char-targeting categorical feature

rot-bad numeric label

rot-judgment text label

action text free_text

action-agency categorical feature

action-moral-judgment numeric label

action-agree numeric feature

action-legal categorical feature

action-pressure numeric feature

action-char-involved categorical feature

action-hypothetical categorical label

situation text free_text

situation-short-id text foreign_key

rot text free_text

rot-id text identifier

rot-worker-id numeric foreign_key

breakdown-worker-id numeric foreign_key

n-characters numeric feature

characters text feature

How to cite