saturn·

quirky social norms 20260121

source /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms_20260121.parquet 355,922 rows 25 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a social-norms annotation dataset of 355,922 rows and 25 columns, where each entry pairs a real-life 'situation' (mostly from Reddit confessions, AmItheAsshole, Dear Abby, and ROCStories) with an 'action', a rule-of-thumb ('rot'), and a battery of moral judgments by crowd workers. The most striking shape feature is heavy duplication in the text fields: 'rot-judgment' is 97% duplicated and 'characters' 91%, because they collapse to short controlled vocabularies, while 'situation' and 'rot' themselves repeat ~71% and ~27% of the time across annotators. Worth a closer look first: the moral-foundation distribution, which is dominated by 'care-harm' (~39% of non-null), and the 'action-legal' field where 93% of actions are tagged 'legal' — both suggest class imbalance that will matter for any modeling. Also note 'area' is reasonably balanced across the four source corpora, but 'split' is heavily skewed toward 'train' (66%).

citing: rot-moral-foundations · action-legal · area · split · rot-judgment · characters · situation · rot · action-hypothetical · rot-categorization · action-agree · row_count · column_count

Schema

25 columns
Per-column summary. Click column name to jump to its detail.
Alerts
area categorical 0.0% 4
m numeric 0.0% 4
high_skew outliers
split categorical 0.0% 7
rot-agree numeric 0.3% 5
rot-categorization categorical 0.9% 15
rot-moral-foundations categorical 21.7% 30
null_rate
rot-char-targeting categorical 0.5% 7
rot-bad numeric 0.0% 2
high_skew
rot-judgment text 0.0% 10,589
short_text duplicates
action text 0.0% 260,627
multilingual duplicates
action-agency categorical 1.8% 2
action-moral-judgment numeric 12.7% 5
action-agree numeric 12.5% 5
action-legal categorical 12.8% 3
action-pressure numeric 13.1% 5
action-char-involved categorical 13.0% 7
action-hypothetical categorical 19.0% 5
situation text 0.0% 103,296
multilingual duplicates
situation-short-id text 0.0% 103,692
one_word duplicates
rot text 0.0% 259,614
duplicates
rot-id text 0.0% 291,974
one_word
rot-worker-id numeric 0.0% 89
breakdown-worker-id numeric 0.0% 117
n-characters numeric 0.0% 8
characters text 0.0% 31,782
one_word duplicates

area

categorical label
This column tags each record with one of four source areas, with 'confessions' the modal value at 30.3% of 355,922 rows. The distribution is fairly balanced — entropy_ratio of 0.97 indicates near-uniform spread across the four categories, though 'dearabby' (50,300) is roughly half the size of the other three. No nulls and only 4 unique values make this a clean grouping key. Treatment: one-hot encode or use directly as a stratification/grouping variable. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
4
top_value
confessions
top_rate
0.3027
cardinality
4
entropy
1.947
entropy_ratio
0.9737

m

numeric feature high_skew outliers
Column 'm' is a numeric feature with only 4 distinct values across 355,922 rows, ranging from 1 to 50 with a median of 1 and both Q1 and Q3 equal to 1. The distribution is severely right-skewed (skew 3.77, kurtosis 12.45), with 18.5% of rows flagged as outliers despite a mean of just 4.24. The tiny cardinality combined with extreme spread suggests this is a categorical-like multiplier or count where most records sit at 1 and a few jump to much larger values. Treatment: Treat as low-cardinality categorical or bin the rare large values before modelling. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
4
min
1
max
50
mean
4.237
median
1
std
11.24
q1
1
q3
1
iqr
0
skew
3.774
kurtosis
12.45
n_outliers
65,947
outlier_rate
0.1853
zero_rate
0

split

categorical metadata
Column holds the dataset split assignment across 355922 rows with 7 distinct values and no nulls. 'train' dominates at 65.6% (233501 rows), followed by near-equal 'test' (29239) and 'dev' (29234), plus auxiliary 'test-extra', 'analysis', and 'dev-extra' partitions; a 'none' bucket of 3913 rows is unusual and likely indicates unassigned examples. Treatment: Use as a row filter to separate train/dev/test; investigate or exclude the 'none' rows before modelling. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
7
top_value
train
top_rate
0.656
cardinality
7
entropy
1.763
entropy_ratio
0.6281

rot-agree

numeric feature
This is a 5-value ordinal score (0–4) capturing agreement on some rotation/rationale ('rot-agree'), with a mean of 3.10 and median 3.0 — answers cluster at the high end. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, and 2.4% of rows fall outside the IQR fence as low-end outliers. Nulls are negligible (0.35%) and zeros are rare (0.38%). Treatment: Treat as an ordinal Likert feature; keep as-is or one-hot encode rather than log-transform. high · anthropic:claude-opus-4-7
n
355,922
nulls
1,236 (0.3%)
unique
5
min
0
max
4
mean
3.1
median
3
std
0.7437
q1
3
q3
4
iqr
1
skew
-0.6805
kurtosis
0.7872
n_outliers
8,547
outlier_rate
0.0241
zero_rate
0.003786

rot-categorization

categorical label
Categorical tag describing the type of 'rule of thumb' (RoT), with 15 distinct values drawn from a small base vocabulary (advice, social-norms, morality-ethics, description) plus pipe-delimited combinations. Distribution is moderately balanced (entropy ratio 0.72); 'advice' leads at 23.5% and the four single-tag values dominate, while compound tags like 'social-norms|advice' (27,657) indicate multi-label encoding stuffed into one string. Null rate is low at 0.86%. Treatment: split on '|' and one-hot encode the four base categories for multi-label modelling. high · anthropic:claude-opus-4-7
n
355,922
nulls
3,046 (0.9%)
unique
15
top_value
advice
top_rate
0.2346
cardinality
15
entropy
2.806
entropy_ratio
0.7181

rot-moral-foundations

categorical label null_rate
This column tags each row with one or more of Haidt's moral foundations (care-harm, fairness-cheating, loyalty-betrayal, authority-subversion, sanctity-degradation), with multi-label combinations encoded as pipe-delimited strings yielding 30 distinct values. Distribution is heavily skewed: 'care-harm' alone covers 38.9% of rows, and 21.7% of rows are null. Entropy ratio of 0.60 confirms the long tail collapses quickly after the top few categories. Treatment: Split on '|' and binarize into five multi-hot foundation indicators; treat nulls as an explicit missing flag. high · anthropic:claude-opus-4-7
n
355,922
nulls
77,120 (21.7%)
unique
30
top_value
care-harm
top_rate
0.3893
cardinality
30
entropy
2.962
entropy_ratio
0.6037

rot-char-targeting

categorical feature
Categorical tag identifying which character slot a rotation/transform targets, with 7 distinct values dominated by 'char-0' (51.9%) and 'char-1' (123,396 rows). The distribution is sharply long-tailed: 'char-4' and 'char-5' appear only 46 and 15 times respectively, and 23,781 rows are explicitly 'char-none'. Entropy ratio is 0.557, confirming most mass sits in the first two categories. Null rate is low at 0.46%. Treatment: One-hot encode and consider collapsing rare 'char-3/4/5' levels into an 'other' bucket. high · anthropic:claude-opus-4-7
n
355,922
nulls
1,622 (0.5%)
unique
7
top_value
char-0
top_rate
0.5192
cardinality
7
entropy
1.564
entropy_ratio
0.5571

rot-bad

numeric label high_skew
Binary 0/1 flag (n_unique=2, min=0, max=1) indicating a rare 'rot-bad' condition. Positives occur at 2.0% (mean=0.0201, zero_rate=0.9799), producing the flagged high skew (6.84) and heavy kurtosis (44.78). The 7,153 'outliers' are simply the positive class, not anomalies. Treatment: Treat as a binary target; address class imbalance with stratified sampling or class weights rather than outlier removal. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
2
min
0
max
1
mean
0.0201
median
0
std
0.1403
q1
0
q3
0
iqr
0
skew
6.84
kurtosis
44.78
n_outliers
7,153
outlier_rate
0.0201
zero_rate
0.9799

rot-judgment

text label short_text duplicates
Short moral-judgment phrases (mean 2 words, median 9 chars, max 94) drawn from a tight vocabulary of 797 tokens, dominated by verdicts like "It's good", "shouldn't", "It's okay", and "It's wrong". With 97.0% duplicate rate and only 10,589 uniques across 355,922 rows, this behaves as a categorical label rather than free text. Casing is inconsistent (e.g. "It's good" vs "it's good" appear as separate top values), which will inflate cardinality unless normalized. Treatment: Lowercase and strip punctuation, then treat as a categorical target rather than free text. high · anthropic:claude-opus-4-7
n
355,922
nulls
1 (0.0%)
unique
10,589
len_min
1
len_max
94
len_mean
10.46
len_median
9
len_p95
19
word_mean
2.001
word_median
2
n_empty
0
n_duplicates
345,332
duplicate_rate
0.9702
vocab_size
797
readability_flesch_mean
83.27
emoji_rate
0
url_rate
0
one_word_rate
0.2083
allcaps_rate
5.619e-06
boilerplate_rate
0

action

text free_text multilingual duplicates
Short English phrases describing an action or behaviour (mean 41.8 chars, median 7 words), e.g. 'being yourself.' or 'cheating on your partner.' — likely the subject of a moral/judgement prompt. Roughly 27% of the 355,922 rows are duplicates (95,292), with the top phrase repeating 461 times, so the same actions recur heavily across records. Language detection flags multilingual but only 2 non-English rows (1 la, 1 no) out of ~5k sampled, so effectively monolingual. Treatment: Normalize casing/punctuation and tokenize or embed before modelling; expect heavy phrase repetition. high · anthropic:claude-opus-4-7
n
355,922
nulls
3 (0.0%)
unique
260,627
len_min
1
len_max
221
len_mean
41.76
len_median
40
len_p95
73
word_mean
6.969
word_median
7
n_empty
0
n_duplicates
95,292
duplicate_rate
0.2677
vocab_size
9,430
readability_flesch_mean
57.88
emoji_rate
0
url_rate
2.81e-06
one_word_rate
0.006013
allcaps_rate
0
boilerplate_rate
2.81e-06

action-agency

categorical feature
Binary categorical flag with only two levels, 'agency' (88.5%) and 'experience' (the remainder), likely indicating which side or channel originated the action. Class imbalance is heavy and 1.84% of rows are null, so any modelling needs to account for both. Entropy ratio of 0.515 reflects the dominance of the majority class. Treatment: Encode as a binary indicator and impute or flag the ~1.8% nulls. high · anthropic:claude-opus-4-7
n
355,922
nulls
6,551 (1.8%)
unique
2
top_value
agency
top_rate
0.8849
cardinality
2
entropy
0.5152
entropy_ratio
0.5152

action-moral-judgment

numeric label
A discrete moral-judgment rating on a 5-point scale from -2 to 2 (likely Likert-style: very wrong → very right). The mean of -0.178 and median of 0 indicate a slight lean toward negative judgments, with 43.7% of values exactly zero and 12.7% missing. The distribution is nearly symmetric (skew -0.011) and platykurtic, so most ratings cluster near neutral with modest spread (std 0.857). Treatment: Treat as an ordinal label; impute or filter the 12.7% nulls before modelling. high · anthropic:claude-opus-4-7
n
355,922
nulls
45,215 (12.7%)
unique
5
min
-2
max
2
mean
-0.1785
median
0
std
0.8565
q1
-1
q3
0
iqr
1
skew
-0.01137
kurtosis
-0.3466
n_outliers
4,592
outlier_rate
0.01478
zero_rate
0.4367

action-agree

numeric feature
This is a 5-level ordinal feature (values 0-4), almost certainly a Likert-style agreement rating for an 'action' item, with mean 3.10 and median 3.0 indicating a tilt toward agreement. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, so most respondents pick 3 or 4; only 0.38% give zero. Note the 12.48% null rate, which is substantial and likely reflects non-response or skipped items. Treatment: Treat as ordinal (0-4); impute or flag the ~12% missing values before modelling. high · anthropic:claude-opus-4-7
n
355,922
nulls
44,402 (12.5%)
unique
5
min
0
max
4
mean
3.101
median
3
std
0.7326
q1
3
q3
4
iqr
1
skew
-0.6768
kurtosis
0.8829
n_outliers
7,046
outlier_rate
0.02262
zero_rate
0.003762

action-pressure

numeric feature
A discrete ordinal feature taking only 5 values across the symmetric range -2 to 2, almost certainly a Likert-style pressure rating. The distribution is balanced (mean -0.04, skew -0.05) and centered on 0, which accounts for 35.3% of rows. Notable: 13.08% of rows are null, and despite being numeric there are just 5 unique values. Treatment: Treat as ordinal categorical; impute or flag the 13% missing before modelling. high · anthropic:claude-opus-4-7
n
355,922
nulls
46,537 (13.1%)
unique
5
min
-2
max
2
mean
-0.03992
median
0
std
1.114
q1
-1
q3
1
iqr
2
skew
-0.04985
kurtosis
-0.6595
n_outliers
0
outlier_rate
0
zero_rate
0.3534

action-char-involved

categorical feature
Categorical pointer to which character slot is involved in an action, with 7 distinct values dominated by char-0 (51.5%) and char-1 (~31%). The long tail collapses fast: char-3 through char-5 together account for fewer than 1,400 rows, and 13.05% of rows are null while another 22,100 are explicitly tagged 'char-none'. Entropy ratio of 0.56 confirms the heavy concentration on the first two slots. Treatment: Treat as a low-cardinality categorical; collapse char-3/4/5 into an 'other' bucket and decide whether null and char-none should be merged. high · anthropic:claude-opus-4-7
n
355,922
nulls
46,441 (13.0%)
unique
7
top_value
char-0
top_rate
0.515
cardinality
7
entropy
1.578
entropy_ratio
0.562

action-hypothetical

categorical label
A 5-level categorical labeling whether an action is stated explicitly or only hypothetically/probably, with negative variants ('explicit-no', 'probable-no'). 'explicit' dominates at 47.1% of non-null rows, but 19.04% of values are null, which is substantial. Entropy ratio of 0.84 indicates the remaining classes are reasonably spread rather than collapsed onto a single mode. Treatment: One-hot encode the five levels and add an explicit missing indicator for the 19% nulls. high · anthropic:claude-opus-4-7
n
355,922
nulls
67,776 (19.0%)
unique
5
top_value
explicit
top_rate
0.4711
cardinality
5
entropy
1.956
entropy_ratio
0.8423

situation

text free_text multilingual duplicates
Short first-person descriptions of personal situations or dilemmas, averaging 55 characters / 10 words and topping out at 300 chars, with high readability (Flesch 78.5). Massive duplication is the headline issue: 252,626 of 355,922 rows are duplicates (71%), leaving only 103,296 uniques, and several distinct strings repeat exactly 130 times — suggesting templated or oversampled records. Language detection is overwhelmingly English (4,989) with a tiny multilingual tail (6 de, 1 each es/it/nl/pt/ru) that is unlikely to matter at scale. Treatment: Deduplicate before modelling, then tokenize and embed; the 130-repeat pattern warrants checking the upstream sampling. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
103,296
len_min
10
len_max
300
len_mean
55.42
len_median
52
len_p95
100
word_mean
10.52
word_median
10
n_empty
0
n_duplicates
252,626
duplicate_rate
0.7098
vocab_size
16,855
readability_flesch_mean
78.53
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0.0005282
boilerplate_rate
0.0001292

situation-short-id

text foreign_key one_word duplicates
Slash-delimited source identifiers pointing back to situations on Reddit (confessions, amitheasshole), Dear Abby columns, and ROCStories sentences. Despite the 'id' name, only 103,692 values are unique across 355,922 rows — a 70.9% duplicate rate, with the most repeated key appearing 180 times, so each situation evidently surfaces in many rows. Every value is a single token (one_word_rate 1.0) up to 99 chars long. Treatment: Use as a join key to the situation source table; do not treat as a row-level primary key. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
103,692
len_min
24
len_max
99
len_mean
41.29
len_median
27
len_p95
74
word_mean
1
word_median
1
n_empty
0
n_duplicates
252,230
duplicate_rate
0.7087
vocab_size
16,878
readability_flesch_mean
-443.9
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

rot

text free_text duplicates
Short English moral/normative statements (mean 10.1 words, max 221 chars), almost certainly rule-of-thumb (RoT) annotations describing what one should or shouldn't do. The vocabulary is small (9,659 types) and readability is high (Flesch 79.4), consistent with simple declarative templates like 'It is good/okay to...' and 'You shouldn't...'. Notable: 27.1% of rows are duplicates (96,308), with single phrases like 'It is good to be yourself.' repeating 287 times, indicating heavy template reuse rather than free generation. Treatment: Deduplicate or weight by frequency before tokenizing and embedding for modelling. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
259,614
len_min
6
len_max
221
len_mean
54.66
len_median
53
len_p95
87
word_mean
10.1
word_median
10
n_empty
0
n_duplicates
96,308
duplicate_rate
0.2706
vocab_size
9,659
readability_flesch_mean
79.38
emoji_rate
0
url_rate
2.81e-06
one_word_rate
5.619e-06
allcaps_rate
0
boilerplate_rate
5.619e-06

rot-id

text identifier one_word
Structured path-like identifiers for 'rule of thumb' records, sourced from Reddit (amitheasshole, confessions), Dear Abby columns, and ROCStories. Despite being IDs, only 291974 of 355922 rows are unique (duplicate_rate 0.18, with top values repeating up to 58 times), so this is not a primary key. Every value is a single token (one_word_rate 1.0, word_mean 1.0) with length 63-140 characters, which triggered the one_word alert but is expected for slash-delimited identifiers. Treatment: Treat as a composite identifier; split on '/' to extract source/subreddit/post fields rather than embedding the raw string. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
291,974
len_min
63
len_max
140
len_mean
81.61
len_median
68
len_p95
114
word_mean
1
word_median
1
n_empty
0
n_duplicates
63,948
duplicate_rate
0.1797
vocab_size
18,736
readability_flesch_mean
-755.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

rot-worker-id

numeric foreign_key
Despite being typed numeric, `rot-worker-id` looks like a categorical worker identifier: only 89 unique values across 355,922 rows, no nulls, and no zeros. The distribution is broad and nearly symmetric (skew -0.08, kurtosis -1.37) spanning 2 to 144 with median 83, consistent with arbitrary id codes rather than a measured quantity. No outliers were flagged. Treatment: Treat as a categorical worker key; encode or join rather than aggregating numerically. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
89
min
2
max
144
mean
72.54
median
83
std
39.61
q1
42
q3
105
iqr
63
skew
-0.0804
kurtosis
-1.366
n_outliers
0
outlier_rate
0
zero_rate
0

breakdown-worker-id

numeric foreign_key
This appears to be a worker identifier encoded as integers, with 117 distinct values spread roughly uniformly between 0 and 146 (mean 71.75, median 71, skew 0.05, kurtosis -1.26). Despite being stored as numeric, the low cardinality relative to 355,922 rows and the near-flat distribution suggest a categorical key rather than a measurement. No nulls and no outliers, but about 1.4% of rows carry worker id 0, which may be a sentinel. Treatment: Cast to categorical and left-join to a worker dimension table; investigate whether id 0 is a placeholder. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
117
min
0
max
146
mean
71.75
median
71
std
42.17
q1
33
q3
106
iqr
73
skew
0.05124
kurtosis
-1.255
n_outliers
0
outlier_rate
0
zero_rate
0.01435

n-characters

numeric feature
A small-integer count column ranging 1-10 with only 8 unique values across 355,922 rows, mean 2.13 and median 2. The tight IQR (2-3) and low std (0.78) suggest most records cluster around 2-3 characters, with a mild right tail (skew 0.44) producing 1,132 outliers (~0.32%). Treatment: Treat as a low-cardinality ordinal/discrete count; usable directly or one-hot encoded. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
8
min
1
max
10
mean
2.128
median
2
std
0.779
q1
2
q3
3
iqr
1
skew
0.4378
kurtosis
0.4591
n_outliers
1,132
outlier_rate
0.00318
zero_rate
0

characters

text feature one_word duplicates
This column lists the characters/speakers in each record, encoded as pipe-delimited role strings (e.g. 'narrator|He', 'narrator|my girlfriend'). It's extremely repetitive — 91.1% of 355,922 rows are duplicates, only 31,782 unique values exist, and 'narrator' alone appears 71,601 times. Half the entries are a single token (one_word_rate 0.50, word_median 1), so this behaves more like a low-cardinality categorical tag than free text despite the text kind. Treatment: Treat as a categorical/multi-label field: split on '|' and one-hot encode the top roles rather than tokenizing as prose. high · anthropic:claude-opus-4-7
n
355,922
nulls
0 (0.0%)
unique
31,782
len_min
8
len_max
165
len_mean
18.84
len_median
17
len_p95
38
word_mean
1.907
word_median
1
n_empty
0
n_duplicates
324,140
duplicate_rate
0.9107
vocab_size
5,682
readability_flesch_mean
-63.18
emoji_rate
0
url_rate
0
one_word_rate
0.5012
allcaps_rate
0
boilerplate_rate
0