saturn·

quirky social norms 20260121

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms_20260121.parquet

Saturn profiled 355,922 rows across 25 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/quirky/social_norms_20260121.parquet",
    "--findings", "quirky-social_norms_20260121.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a social-norms annotation dataset of 355,922 rows and 25 columns, where each entry pairs a real-life 'situation' (mostly from Reddit confessions, AmItheAsshole, Dear Abby, and ROCStories) with an 'action', a rule-of-thumb ('rot'), and a battery of moral judgments by crowd workers. The most striking shape feature is heavy duplication in the text fields: 'rot-judgment' is 97% duplicated and 'characters' 91%, because they collapse to short controlled vocabularies, while 'situation' and 'rot' themselves repeat ~71% and ~27% of the time across annotators. Worth a closer look first: the moral-foundation distribution, which is dominated by 'care-harm' (~39% of non-null), and the 'action-legal' field where 93% of actions are tagged 'legal' — both suggest class imbalance that will matter for any modeling. Also note 'area' is reasonably balanced across the four source corpora, but 'split' is heavily skewed toward 'train' (66%).

citing: rot-moral-foundations · action-legal · area · split · rot-judgment · characters · situation · rot · action-hypothetical · rot-categorization · action-agree · row_count · column_count

Out[4]:

saturn.schema() · 25 columns

column kind n null% unique alerts
area categorical 355,922 0.0% 4
m numeric 355,922 0.0% 4 high_skew outliers
split categorical 355,922 0.0% 7
rot-agree numeric 355,922 0.3% 5
rot-categorization categorical 355,922 0.9% 15
rot-moral-foundations categorical 355,922 21.7% 30 null_rate
rot-char-targeting categorical 355,922 0.5% 7
rot-bad numeric 355,922 0.0% 2 high_skew
rot-judgment text 355,922 0.0% 10,589 short_text duplicates
action text 355,922 0.0% 260,627 multilingual duplicates
action-agency categorical 355,922 1.8% 2
action-moral-judgment numeric 355,922 12.7% 5
action-agree numeric 355,922 12.5% 5
action-legal categorical 355,922 12.8% 3
action-pressure numeric 355,922 13.1% 5
action-char-involved categorical 355,922 13.0% 7
action-hypothetical categorical 355,922 19.0% 5
situation text 355,922 0.0% 103,296 multilingual duplicates
situation-short-id text 355,922 0.0% 103,692 one_word duplicates
rot text 355,922 0.0% 259,614 duplicates
rot-id text 355,922 0.0% 291,974 one_word
rot-worker-id numeric 355,922 0.0% 89
breakdown-worker-id numeric 355,922 0.0% 117
n-characters numeric 355,922 0.0% 8
characters text 355,922 0.0% 31,782 one_word duplicates
Fig 1.
rot-moral-foundations · See how 'care-harm' dominates the moral foundation labels at ~39% of non-null rows.
Show data table
Top values for rot-moral-foundations (20 unique shown, of 30 total).
valuecountshare
care-harm10853530.5%
fairness-cheating3766610.6%
loyalty-betrayal345819.7%
care-harm|loyalty-betrayal211255.9%
authority-subversion190875.4%
sanctity-degradation146574.1%
care-harm|fairness-cheating107873.0%
care-harm|authority-subversion79582.2%
care-harm|sanctity-degradation63281.8%
fairness-cheating|loyalty-betrayal62221.7%
fairness-cheating|authority-subversion37381.1%
loyalty-betrayal|authority-subversion21220.6%
fairness-cheating|sanctity-degradation12180.3%
authority-subversion|sanctity-degradation10420.3%
loyalty-betrayal|sanctity-degradation8850.2%
care-harm|fairness-cheating|loyalty-betrayal8340.2%
care-harm|loyalty-betrayal|authority-subversion6230.2%
care-harm|authority-subversion|sanctity-degradation4230.1%
care-harm|fairness-cheating|authority-subversion4160.1%
care-harm|loyalty-betrayal|sanctity-degradation1830.1%
Fig 2.
area · Check the source mix across confessions, ROCStories, AmItheAsshole, and Dear Abby.
Show data table
Top values for area (4 unique shown, of 4 total).
valuecountshare
confessions10774930.3%
rocstories10179128.6%
amitheasshole9608227.0%
dearabby5030014.1%
Fig 3.
action-legal · Notice the strong skew: ~93% of actions are labeled 'legal', with 'illegal' rare.
Show data table
Top values for action-legal (3 unique shown, of 3 total).
valuecountshare
legal28931681.3%
tolerated150644.2%
illegal59341.7%
Fig 4.
rot-categorization · Compare how rules-of-thumb split between advice, social-norms, morality-ethics, and description.
Show data table
Top values for rot-categorization (15 unique shown, of 15 total).
valuecountshare
advice8278623.3%
social-norms7293420.5%
morality-ethics5856416.5%
description5853716.4%
social-norms|advice276577.8%
morality-ethics|social-norms271187.6%
morality-ethics|advice114983.2%
social-norms|description60781.7%
advice|description47851.3%
morality-ethics|description20230.6%
morality-ethics|social-norms|advice7900.2%
morality-ethics|social-norms|description550.0%
morality-ethics|advice|description260.0%
social-norms|advice|description240.0%
morality-ethics|social-norms|advice|description10.0%
Fig 5.
action-agree · Workers cluster at agreement values 3-4, indicating most actions get strong consensus.
Show data table
Histogram bins for action-agree (median: 3.0).
bincount
0 – 0.11172
0.1 – 0.20
0.2 – 0.30
0.3 – 0.40
0.4 – 0.50
0.5 – 0.60
0.6 – 0.70
0.7 – 0.80
0.8 – 0.90
0.9 – 10
1 – 1.15874
1.1 – 1.20
1.2 – 1.30
1.3 – 1.40
1.4 – 1.50
1.5 – 1.60
1.6 – 1.70
1.7 – 1.80
1.8 – 1.90
1.9 – 20
2 – 2.144800
2.1 – 2.20
2.2 – 2.30
2.3 – 2.40
2.4 – 2.50
2.5 – 2.60
2.6 – 2.70
2.7 – 2.80
2.8 – 2.90
2.9 – 30
3 – 3.1168120
3.1 – 3.20
3.2 – 3.30
3.3 – 3.40
3.4 – 3.50
3.5 – 3.60
3.6 – 3.70
3.7 – 3.80
3.8 – 3.90
3.9 – 491554
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
areacategorical0.0%
mnumeric0.0%
splitcategorical0.0%
rot-agreenumeric0.3%
rot-categorizationcategorical0.9%
rot-moral-foundationscategorical21.7%
rot-char-targetingcategorical0.5%
rot-badnumeric0.0%
rot-judgmenttext0.0%
actiontext0.0%
action-agencycategorical1.8%
action-moral-judgmentnumeric12.7%
action-agreenumeric12.5%
action-legalcategorical12.8%
action-pressurenumeric13.1%
action-char-involvedcategorical13.0%
action-hypotheticalcategorical19.0%
situationtext0.0%
situation-short-idtext0.0%
rottext0.0%
rot-idtext0.0%
rot-worker-idnumeric0.0%
breakdown-worker-idnumeric0.0%
n-charactersnumeric0.0%
characterstext0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 14,991 detected strings).
langcountshare
en1497899.9%
de60.0%
la10.0%
no10.0%
nl10.0%
pt10.0%
es10.0%
it10.0%
ru10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 9 numeric columns (values clipped to 2 decimals).
mrot-agreerot-badaction-moral-judgmentaction-agreeaction-pressurerot-worker-idbreakdown-worker-idn-characters
m+1.00-0.10-0.02+0.03+0.04-0.02+0.01-0.04-0.01
rot-agree-0.10+1.00+0.08+0.03-0.00+0.03+0.02+0.02-0.04
rot-bad-0.02+0.08+1.00-0.07-0.04+0.04+0.03-0.07-0.05
action-moral-judgment+0.03+0.03-0.07+1.00-0.04+0.06+0.00+0.01+0.12
action-agree+0.04-0.00-0.04-0.04+1.00-0.05+0.02-0.12+0.01
action-pressure-0.02+0.03+0.04+0.06-0.05+1.00-0.01-0.03+0.04
rot-worker-id+0.01+0.02+0.03+0.00+0.02-0.01+1.00-0.03-0.01
breakdown-worker-id-0.04+0.02-0.07+0.01-0.12-0.03-0.03+1.00-0.01
n-characters-0.01-0.04-0.05+0.12+0.01+0.04-0.01-0.01+1.00

area categorical label

This column tags each record with one of four source areas, with 'confessions' the modal value at 30.3% of 355,922 rows. The distribution is fairly balanced — entropy_ratio of 0.97 indicates near-uniform spread across the four categories, though 'dearabby' (50,300) is roughly half the size of the other three. No nulls and only 4 unique values make this a clean grouping key.

Treatment: one-hot encode or use directly as a stratification/grouping variable.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["area"].stats

statvalue
n355,922
nulls0 (0.0%)
unique4
top_value confessions
top_rate 0.3027
cardinality 4
entropy 1.947
entropy_ratio 0.9737
Fig 9.
Top values for area.
Show data table
Top values for area (4 unique shown, of 4 total).
valuecountshare
confessions10774930.3%
rocstories10179128.6%
amitheasshole9608227.0%
dearabby5030014.1%

m numeric feature

Column 'm' is a numeric feature with only 4 distinct values across 355,922 rows, ranging from 1 to 50 with a median of 1 and both Q1 and Q3 equal to 1. The distribution is severely right-skewed (skew 3.77, kurtosis 12.45), with 18.5% of rows flagged as outliers despite a mean of just 4.24. The tiny cardinality combined with extreme spread suggests this is a categorical-like multiplier or count where most records sit at 1 and a few jump to much larger values.

Treatment: Treat as low-cardinality categorical or bin the rare large values before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["m"].stats

statvalue
n355,922
nulls0 (0.0%)
unique4
min 1
max 50
mean 4.237
median 1
std 11.24
q1 1
q3 1
iqr 0
skew 3.774
kurtosis 12.45
n_outliers 65,947
outlier_rate 0.1853
zero_rate 0
alert: high_skewskew=+3.77
alert: outliers18.5% rows beyond 1.5 IQR
Fig 10.
Distribution of m. Vertical dash marks the median.
Show data table
Histogram bins for m (median: 1.0).
bincount
1 – 2.225289975
2.225 – 3.455912
3.45 – 4.6750
4.675 – 5.940035
5.9 – 7.1250
7.125 – 8.350
8.35 – 9.5750
9.575 – 10.80
10.8 – 12.030
12.03 – 13.250
13.25 – 14.480
14.48 – 15.70
15.7 – 16.930
16.93 – 18.150
18.15 – 19.380
19.38 – 20.60
20.6 – 21.830
21.83 – 23.050
23.05 – 24.280
24.28 – 25.50
25.5 – 26.730
26.73 – 27.950
27.95 – 29.180
29.18 – 30.40
30.4 – 31.630
31.63 – 32.850
32.85 – 34.080
34.08 – 35.30
35.3 – 36.530
36.53 – 37.750
37.75 – 38.980
38.98 – 40.20
40.2 – 41.430
41.43 – 42.650
42.65 – 43.880
43.88 – 45.10
45.1 – 46.330
46.33 – 47.550
47.55 – 48.780
48.78 – 5020000

split categorical metadata

Column holds the dataset split assignment across 355922 rows with 7 distinct values and no nulls. 'train' dominates at 65.6% (233501 rows), followed by near-equal 'test' (29239) and 'dev' (29234), plus auxiliary 'test-extra', 'analysis', and 'dev-extra' partitions; a 'none' bucket of 3913 rows is unusual and likely indicates unassigned examples.

Treatment: Use as a row filter to separate train/dev/test; investigate or exclude the 'none' rows before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["split"].stats

statvalue
n355,922
nulls0 (0.0%)
unique7
top_value train
top_rate 0.656
cardinality 7
entropy 1.763
entropy_ratio 0.6281
Fig 11.
Top values for split.
Show data table
Top values for split (7 unique shown, of 7 total).
valuecountshare
train23350165.6%
test292398.2%
dev292348.2%
test-extra200505.6%
analysis200005.6%
dev-extra199855.6%
none39131.1%

rot-agree numeric feature

This is a 5-value ordinal score (0–4) capturing agreement on some rotation/rationale ('rot-agree'), with a mean of 3.10 and median 3.0 — answers cluster at the high end. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, and 2.4% of rows fall outside the IQR fence as low-end outliers. Nulls are negligible (0.35%) and zeros are rare (0.38%).

Treatment: Treat as an ordinal Likert feature; keep as-is or one-hot encode rather than log-transform.

anthropic:claude-opus-4-7 · confidence high
Out[23]:

saturn.columns["rot-agree"].stats

statvalue
n355,922
nulls1,236 (0.3%)
unique5
min 0
max 4
mean 3.1
median 3
std 0.7437
q1 3
q3 4
iqr 1
skew -0.6805
kurtosis 0.7872
n_outliers 8,547
outlier_rate 0.0241
zero_rate 0.003786
Fig 12.
Distribution of rot-agree. Vertical dash marks the median.
Show data table
Histogram bins for rot-agree (median: 3.0).
bincount
0 – 0.11343
0.1 – 0.20
0.2 – 0.30
0.3 – 0.40
0.4 – 0.50
0.5 – 0.60
0.6 – 0.70
0.7 – 0.80
0.8 – 0.90
0.9 – 10
1 – 1.17204
1.1 – 1.20
1.2 – 1.30
1.3 – 1.40
1.4 – 1.50
1.5 – 1.60
1.6 – 1.70
1.7 – 1.80
1.8 – 1.90
1.9 – 20
2 – 2.152396
2.1 – 2.20
2.2 – 2.30
2.3 – 2.40
2.4 – 2.50
2.5 – 2.60
2.6 – 2.70
2.7 – 2.80
2.8 – 2.90
2.9 – 30
3 – 3.1187322
3.1 – 3.20
3.2 – 3.30
3.3 – 3.40
3.4 – 3.50
3.5 – 3.60
3.6 – 3.70
3.7 – 3.80
3.8 – 3.90
3.9 – 4106421

rot-categorization categorical label

Categorical tag describing the type of 'rule of thumb' (RoT), with 15 distinct values drawn from a small base vocabulary (advice, social-norms, morality-ethics, description) plus pipe-delimited combinations. Distribution is moderately balanced (entropy ratio 0.72); 'advice' leads at 23.5% and the four single-tag values dominate, while compound tags like 'social-norms|advice' (27,657) indicate multi-label encoding stuffed into one string. Null rate is low at 0.86%.

Treatment: split on '|' and one-hot encode the four base categories for multi-label modelling.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["rot-categorization"].stats

statvalue
n355,922
nulls3,046 (0.9%)
unique15
top_value advice
top_rate 0.2346
cardinality 15
entropy 2.806
entropy_ratio 0.7181
Fig 13.
Top values for rot-categorization.
Show data table
Top values for rot-categorization (15 unique shown, of 15 total).
valuecountshare
advice8278623.3%
social-norms7293420.5%
morality-ethics5856416.5%
description5853716.4%
social-norms|advice276577.8%
morality-ethics|social-norms271187.6%
morality-ethics|advice114983.2%
social-norms|description60781.7%
advice|description47851.3%
morality-ethics|description20230.6%
morality-ethics|social-norms|advice7900.2%
morality-ethics|social-norms|description550.0%
morality-ethics|advice|description260.0%
social-norms|advice|description240.0%
morality-ethics|social-norms|advice|description10.0%

rot-moral-foundations categorical label

This column tags each row with one or more of Haidt's moral foundations (care-harm, fairness-cheating, loyalty-betrayal, authority-subversion, sanctity-degradation), with multi-label combinations encoded as pipe-delimited strings yielding 30 distinct values. Distribution is heavily skewed: 'care-harm' alone covers 38.9% of rows, and 21.7% of rows are null. Entropy ratio of 0.60 confirms the long tail collapses quickly after the top few categories.

Treatment: Split on '|' and binarize into five multi-hot foundation indicators; treat nulls as an explicit missing flag.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["rot-moral-foundations"].stats

statvalue
n355,922
nulls77,120 (21.7%)
unique30
top_value care-harm
top_rate 0.3893
cardinality 30
entropy 2.962
entropy_ratio 0.6037
alert: null_rate21.7% null
Fig 14.
Top values for rot-moral-foundations.
Show data table
Top values for rot-moral-foundations (20 unique shown, of 30 total).
valuecountshare
care-harm10853530.5%
fairness-cheating3766610.6%
loyalty-betrayal345819.7%
care-harm|loyalty-betrayal211255.9%
authority-subversion190875.4%
sanctity-degradation146574.1%
care-harm|fairness-cheating107873.0%
care-harm|authority-subversion79582.2%
care-harm|sanctity-degradation63281.8%
fairness-cheating|loyalty-betrayal62221.7%
fairness-cheating|authority-subversion37381.1%
loyalty-betrayal|authority-subversion21220.6%
fairness-cheating|sanctity-degradation12180.3%
authority-subversion|sanctity-degradation10420.3%
loyalty-betrayal|sanctity-degradation8850.2%
care-harm|fairness-cheating|loyalty-betrayal8340.2%
care-harm|loyalty-betrayal|authority-subversion6230.2%
care-harm|authority-subversion|sanctity-degradation4230.1%
care-harm|fairness-cheating|authority-subversion4160.1%
care-harm|loyalty-betrayal|sanctity-degradation1830.1%

rot-char-targeting categorical feature

Categorical tag identifying which character slot a rotation/transform targets, with 7 distinct values dominated by 'char-0' (51.9%) and 'char-1' (123,396 rows). The distribution is sharply long-tailed: 'char-4' and 'char-5' appear only 46 and 15 times respectively, and 23,781 rows are explicitly 'char-none'. Entropy ratio is 0.557, confirming most mass sits in the first two categories. Null rate is low at 0.46%.

Treatment: One-hot encode and consider collapsing rare 'char-3/4/5' levels into an 'other' bucket.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["rot-char-targeting"].stats

statvalue
n355,922
nulls1,622 (0.5%)
unique7
top_value char-0
top_rate 0.5192
cardinality 7
entropy 1.564
entropy_ratio 0.5571
Fig 15.
Top values for rot-char-targeting.
Show data table
Top values for rot-char-targeting (7 unique shown, of 7 total).
valuecountshare
char-018395051.7%
char-112339634.7%
char-none237816.7%
char-2216486.1%
char-314640.4%
char-4460.0%
char-5150.0%

rot-bad numeric label

Binary 0/1 flag (n_unique=2, min=0, max=1) indicating a rare 'rot-bad' condition. Positives occur at 2.0% (mean=0.0201, zero_rate=0.9799), producing the flagged high skew (6.84) and heavy kurtosis (44.78). The 7,153 'outliers' are simply the positive class, not anomalies.

Treatment: Treat as a binary target; address class imbalance with stratified sampling or class weights rather than outlier removal.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["rot-bad"].stats

statvalue
n355,922
nulls0 (0.0%)
unique2
min 0
max 1
mean 0.0201
median 0
std 0.1403
q1 0
q3 0
iqr 0
skew 6.84
kurtosis 44.78
n_outliers 7,153
outlier_rate 0.0201
zero_rate 0.9799
alert: high_skewskew=+6.84
Fig 16.
Distribution of rot-bad. Vertical dash marks the median.
Show data table
Histogram bins for rot-bad (median: 0.0).
bincount
0 – 0.025348769
0.025 – 0.050
0.05 – 0.0750
0.075 – 0.10
0.1 – 0.1250
0.125 – 0.150
0.15 – 0.1750
0.175 – 0.20
0.2 – 0.2250
0.225 – 0.250
0.25 – 0.2750
0.275 – 0.30
0.3 – 0.3250
0.325 – 0.350
0.35 – 0.3750
0.375 – 0.40
0.4 – 0.4250
0.425 – 0.450
0.45 – 0.4750
0.475 – 0.50
0.5 – 0.5250
0.525 – 0.550
0.55 – 0.5750
0.575 – 0.60
0.6 – 0.6250
0.625 – 0.650
0.65 – 0.6750
0.675 – 0.70
0.7 – 0.7250
0.725 – 0.750
0.75 – 0.7750
0.775 – 0.80
0.8 – 0.8250
0.825 – 0.850
0.85 – 0.8750
0.875 – 0.90
0.9 – 0.9250
0.925 – 0.950
0.95 – 0.9750
0.975 – 17153

rot-judgment text label

Short moral-judgment phrases (mean 2 words, median 9 chars, max 94) drawn from a tight vocabulary of 797 tokens, dominated by verdicts like "It's good", "shouldn't", "It's okay", and "It's wrong". With 97.0% duplicate rate and only 10,589 uniques across 355,922 rows, this behaves as a categorical label rather than free text. Casing is inconsistent (e.g. "It's good" vs "it's good" appear as separate top values), which will inflate cardinality unless normalized.

Treatment: Lowercase and strip punctuation, then treat as a categorical target rather than free text.

anthropic:claude-opus-4-7 · confidence high
Out[38]:

saturn.columns["rot-judgment"].stats

statvalue
n355,922
nulls1 (0.0%)
unique10,589
len_min 1
len_max 94
len_mean 10.46
len_median 9
len_p95 19
word_mean 2.001
word_median 2
n_empty 0
n_duplicates 345,332
duplicate_rate 0.9702
vocab_size 797
readability_flesch_mean 83.27
emoji_rate 0
url_rate 0
one_word_rate 0.2083
allcaps_rate 5.619e-06
boilerplate_rate 0
alert: short_text95th-percentile length under 20 chars
alert: duplicates97.0% duplicate strings
Fig 17.
Character-length distribution for rot-judgment.
Show data table
Character-length distribution for rot-judgment (mean: 10.462726279146215).
charscount
1 – 38736
3 – 611861
6 – 831229
8 – 10187513
10 – 1322320
13 – 1551718
15 – 1716171
17 – 2020003
20 – 222587
22 – 241287
24 – 27496
27 – 29412
29 – 31487
31 – 34256
34 – 36187
36 – 38183
38 – 4191
41 – 4376
43 – 45103
45 – 4839
48 – 5042
50 – 5232
52 – 5422
54 – 5715
57 – 5912
59 – 617
61 – 646
64 – 6612
66 – 683
68 – 712
71 – 734
73 – 752
75 – 780
78 – 802
80 – 821
82 – 850
85 – 871
87 – 890
89 – 921
92 – 942

action text free_text

Short English phrases describing an action or behaviour (mean 41.8 chars, median 7 words), e.g. 'being yourself.' or 'cheating on your partner.' — likely the subject of a moral/judgement prompt. Roughly 27% of the 355,922 rows are duplicates (95,292), with the top phrase repeating 461 times, so the same actions recur heavily across records. Language detection flags multilingual but only 2 non-English rows (1 la, 1 no) out of ~5k sampled, so effectively monolingual.

Treatment: Normalize casing/punctuation and tokenize or embed before modelling; expect heavy phrase repetition.

anthropic:claude-opus-4-7 · confidence high
Out[41]:

saturn.columns["action"].stats

statvalue
n355,922
nulls3 (0.0%)
unique260,627
len_min 1
len_max 221
len_mean 41.76
len_median 40
len_p95 73
word_mean 6.969
word_median 7
n_empty 0
n_duplicates 95,292
duplicate_rate 0.2677
vocab_size 9,430
readability_flesch_mean 57.88
emoji_rate 0
url_rate 2.81e-06
one_word_rate 0.006013
allcaps_rate 0
boilerplate_rate 2.81e-06
alert: multilingual4 languages detected in sample
alert: duplicates26.8% duplicate strings
Fig 18.
Character-length distribution for action.
Show data table
Character-length distribution for action (mean: 41.76226613358657).
charscount
1 – 6456
6 – 122368
12 – 1818362
18 – 2323877
23 – 2840347
28 – 3439906
34 – 4049900
40 – 4540358
45 – 5040757
50 – 5627618
56 – 6225084
62 – 6715079
67 – 7212798
72 – 786840
78 – 845004
84 – 892500
89 – 941855
94 – 1001061
100 – 106731
106 – 111448
111 – 116223
116 – 122111
122 – 12879
128 – 13363
133 – 13840
138 – 14419
144 – 15013
150 – 1554
155 – 16010
160 – 1662
166 – 1720
172 – 1772
177 – 1823
182 – 1880
188 – 1940
194 – 1990
199 – 2040
204 – 2100
210 – 2160
216 – 2211

action-agency categorical feature

Binary categorical flag with only two levels, 'agency' (88.5%) and 'experience' (the remainder), likely indicating which side or channel originated the action. Class imbalance is heavy and 1.84% of rows are null, so any modelling needs to account for both. Entropy ratio of 0.515 reflects the dominance of the majority class.

Treatment: Encode as a binary indicator and impute or flag the ~1.8% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

saturn.columns["action-agency"].stats

statvalue
n355,922
nulls6,551 (1.8%)
unique2
top_value agency
top_rate 0.8849
cardinality 2
entropy 0.5152
entropy_ratio 0.5152
Fig 19.
Top values for action-agency.
Show data table
Top values for action-agency (2 unique shown, of 2 total).
valuecountshare
agency30914886.9%
experience4022311.3%

action-moral-judgment numeric label

A discrete moral-judgment rating on a 5-point scale from -2 to 2 (likely Likert-style: very wrong → very right). The mean of -0.178 and median of 0 indicate a slight lean toward negative judgments, with 43.7% of values exactly zero and 12.7% missing. The distribution is nearly symmetric (skew -0.011) and platykurtic, so most ratings cluster near neutral with modest spread (std 0.857).

Treatment: Treat as an ordinal label; impute or filter the 12.7% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["action-moral-judgment"].stats

statvalue
n355,922
nulls45,215 (12.7%)
unique5
min -2
max 2
mean -0.1785
median 0
std 0.8565
q1 -1
q3 0
iqr 1
skew -0.01137
kurtosis -0.3466
n_outliers 4,592
outlier_rate 0.01478
zero_rate 0.4367
Fig 20.
Distribution of action-moral-judgment. Vertical dash marks the median.
Show data table
Histogram bins for action-moral-judgment (median: 0.0).
bincount
-2 – -1.916356
-1.9 – -1.80
-1.8 – -1.70
-1.7 – -1.60
-1.6 – -1.50
-1.5 – -1.40
-1.4 – -1.30
-1.3 – -1.20
-1.2 – -1.10
-1.1 – -10
-1 – -0.992992
-0.9 – -0.80
-0.8 – -0.70
-0.7 – -0.60
-0.6 – -0.50
-0.5 – -0.40
-0.4 – -0.30
-0.3 – -0.20
-0.2 – -0.10
-0.1 – 00
0 – 0.1135698
0.1 – 0.20
0.2 – 0.30
0.3 – 0.40
0.4 – 0.50
0.5 – 0.60
0.6 – 0.70
0.7 – 0.80
0.8 – 0.90
0.9 – 10
1 – 1.161069
1.1 – 1.20
1.2 – 1.30
1.3 – 1.40
1.4 – 1.50
1.5 – 1.60
1.6 – 1.70
1.7 – 1.80
1.8 – 1.90
1.9 – 24592

action-agree numeric feature

This is a 5-level ordinal feature (values 0-4), almost certainly a Likert-style agreement rating for an 'action' item, with mean 3.10 and median 3.0 indicating a tilt toward agreement. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, so most respondents pick 3 or 4; only 0.38% give zero. Note the 12.48% null rate, which is substantial and likely reflects non-response or skipped items.

Treatment: Treat as ordinal (0-4); impute or flag the ~12% missing values before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["action-agree"].stats

statvalue
n355,922
nulls44,402 (12.5%)
unique5
min 0
max 4
mean 3.101
median 3
std 0.7326
q1 3
q3 4
iqr 1
skew -0.6768
kurtosis 0.8829
n_outliers 7,046
outlier_rate 0.02262
zero_rate 0.003762
Fig 21.
Distribution of action-agree. Vertical dash marks the median.
Show data table
Histogram bins for action-agree (median: 3.0).
bincount
0 – 0.11172
0.1 – 0.20
0.2 – 0.30
0.3 – 0.40
0.4 – 0.50
0.5 – 0.60
0.6 – 0.70
0.7 – 0.80
0.8 – 0.90
0.9 – 10
1 – 1.15874
1.1 – 1.20
1.2 – 1.30
1.3 – 1.40
1.4 – 1.50
1.5 – 1.60
1.6 – 1.70
1.7 – 1.80
1.8 – 1.90
1.9 – 20
2 – 2.144800
2.1 – 2.20
2.2 – 2.30
2.3 – 2.40
2.4 – 2.50
2.5 – 2.60
2.6 – 2.70
2.7 – 2.80
2.8 – 2.90
2.9 – 30
3 – 3.1168120
3.1 – 3.20
3.2 – 3.30
3.3 – 3.40
3.4 – 3.50
3.5 – 3.60
3.6 – 3.70
3.7 – 3.80
3.8 – 3.90
3.9 – 491554
Out[53]:

saturn.columns["action-legal"].stats

statvalue
n355,922
nulls45,608 (12.8%)
unique3
top_value legal
top_rate 0.9323
cardinality 3
entropy 0.4153
entropy_ratio 0.262
Fig 22.
Top values for action-legal.
Show data table
Top values for action-legal (3 unique shown, of 3 total).
valuecountshare
legal28931681.3%
tolerated150644.2%
illegal59341.7%

action-pressure numeric feature

A discrete ordinal feature taking only 5 values across the symmetric range -2 to 2, almost certainly a Likert-style pressure rating. The distribution is balanced (mean -0.04, skew -0.05) and centered on 0, which accounts for 35.3% of rows. Notable: 13.08% of rows are null, and despite being numeric there are just 5 unique values.

Treatment: Treat as ordinal categorical; impute or flag the 13% missing before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[56]:

saturn.columns["action-pressure"].stats

statvalue
n355,922
nulls46,537 (13.1%)
unique5
min -2
max 2
mean -0.03992
median 0
std 1.114
q1 -1
q3 1
iqr 2
skew -0.04985
kurtosis -0.6595
n_outliers 0
outlier_rate 0
zero_rate 0.3534
Fig 23.
Distribution of action-pressure. Vertical dash marks the median.
Show data table
Histogram bins for action-pressure (median: 0.0).
bincount
-2 – -1.935266
-1.9 – -1.80
-1.8 – -1.70
-1.7 – -1.60
-1.6 – -1.50
-1.5 – -1.40
-1.4 – -1.30
-1.3 – -1.20
-1.2 – -1.10
-1.1 – -10
-1 – -0.966353
-0.9 – -0.80
-0.8 – -0.70
-0.7 – -0.60
-0.6 – -0.50
-0.5 – -0.40
-0.4 – -0.30
-0.3 – -0.20
-0.2 – -0.10
-0.1 – 00
0 – 0.1109344
0.1 – 0.20
0.2 – 0.30
0.3 – 0.40
0.4 – 0.50
0.5 – 0.60
0.6 – 0.70
0.7 – 0.80
0.8 – 0.90
0.9 – 10
1 – 1.172309
1.1 – 1.20
1.2 – 1.30
1.3 – 1.40
1.4 – 1.50
1.5 – 1.60
1.6 – 1.70
1.7 – 1.80
1.8 – 1.90
1.9 – 226113

action-char-involved categorical feature

Categorical pointer to which character slot is involved in an action, with 7 distinct values dominated by char-0 (51.5%) and char-1 (~31%). The long tail collapses fast: char-3 through char-5 together account for fewer than 1,400 rows, and 13.05% of rows are null while another 22,100 are explicitly tagged 'char-none'. Entropy ratio of 0.56 confirms the heavy concentration on the first two slots.

Treatment: Treat as a low-cardinality categorical; collapse char-3/4/5 into an 'other' bucket and decide whether null and char-none should be merged.

anthropic:claude-opus-4-7 · confidence high
Out[59]:

saturn.columns["action-char-involved"].stats

statvalue
n355,922
nulls46,441 (13.0%)
unique7
top_value char-0
top_rate 0.515
cardinality 7
entropy 1.578
entropy_ratio 0.562
Fig 24.
Top values for action-char-involved.
Show data table
Top values for action-char-involved (7 unique shown, of 7 total).
valuecountshare
char-015938144.8%
char-110756730.2%
char-none221006.2%
char-2191245.4%
char-312520.4%
char-4410.0%
char-5160.0%

action-hypothetical categorical label

A 5-level categorical labeling whether an action is stated explicitly or only hypothetically/probably, with negative variants ('explicit-no', 'probable-no'). 'explicit' dominates at 47.1% of non-null rows, but 19.04% of values are null, which is substantial. Entropy ratio of 0.84 indicates the remaining classes are reasonably spread rather than collapsed onto a single mode.

Treatment: One-hot encode the five levels and add an explicit missing indicator for the 19% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[62]:

saturn.columns["action-hypothetical"].stats

statvalue
n355,922
nulls67,776 (19.0%)
unique5
top_value explicit
top_rate 0.4711
cardinality 5
entropy 1.956
entropy_ratio 0.8423
Fig 25.
Top values for action-hypothetical.
Show data table
Top values for action-hypothetical (5 unique shown, of 5 total).
valuecountshare
explicit13574438.1%
hypothetical6268917.6%
probable4967014.0%
explicit-no250227.0%
probable-no150214.2%

situation text free_text

Short first-person descriptions of personal situations or dilemmas, averaging 55 characters / 10 words and topping out at 300 chars, with high readability (Flesch 78.5). Massive duplication is the headline issue: 252,626 of 355,922 rows are duplicates (71%), leaving only 103,296 uniques, and several distinct strings repeat exactly 130 times — suggesting templated or oversampled records. Language detection is overwhelmingly English (4,989) with a tiny multilingual tail (6 de, 1 each es/it/nl/pt/ru) that is unlikely to matter at scale.

Treatment: Deduplicate before modelling, then tokenize and embed; the 130-repeat pattern warrants checking the upstream sampling.

anthropic:claude-opus-4-7 · confidence high
Out[65]:

saturn.columns["situation"].stats

statvalue
n355,922
nulls0 (0.0%)
unique103,296
len_min 10
len_max 300
len_mean 55.42
len_median 52
len_p95 100
word_mean 10.52
word_median 10
n_empty 0
n_duplicates 252,626
duplicate_rate 0.7098
vocab_size 16,855
readability_flesch_mean 78.53
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 0.0005282
boilerplate_rate 0.0001292
alert: multilingual8 languages detected in sample
alert: duplicates71.0% duplicate strings
Fig 26.
Character-length distribution for situation.
Show data table
Character-length distribution for situation (mean: 55.417827501531235).
charscount
10 – 173408
17 – 2413315
24 – 3221793
32 – 3934803
39 – 4655004
46 – 5462695
54 – 6163829
61 – 6837281
68 – 7522416
75 – 828768
82 – 907451
90 – 975256
97 – 1044592
104 – 1122828
112 – 1192283
119 – 1261878
126 – 1331729
133 – 1401054
140 – 148930
148 – 155655
155 – 162734
162 – 170439
170 – 177336
177 – 184320
184 – 191296
191 – 198228
198 – 206196
206 – 213137
213 – 220207
220 – 228201
228 – 235165
235 – 24269
242 – 249147
249 – 25648
256 – 26475
264 – 27166
271 – 27860
278 – 28665
286 – 29356
293 – 300109

situation-short-id text foreign_key

Slash-delimited source identifiers pointing back to situations on Reddit (confessions, amitheasshole), Dear Abby columns, and ROCStories sentences. Despite the 'id' name, only 103,692 values are unique across 355,922 rows — a 70.9% duplicate rate, with the most repeated key appearing 180 times, so each situation evidently surfaces in many rows. Every value is a single token (one_word_rate 1.0) up to 99 chars long.

Treatment: Use as a join key to the situation source table; do not treat as a row-level primary key.

anthropic:claude-opus-4-7 · confidence high
Out[68]:

saturn.columns["situation-short-id"].stats

statvalue
n355,922
nulls0 (0.0%)
unique103,692
len_min 24
len_max 99
len_mean 41.29
len_median 27
len_p95 74
word_mean 1
word_median 1
n_empty 0
n_duplicates 252,230
duplicate_rate 0.7087
vocab_size 16,878
readability_flesch_mean -443.9
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: duplicates70.9% duplicate strings
Fig 27.
Character-length distribution for situation-short-id.
Show data table
Character-length distribution for situation-short-id (mean: 41.29494383600901).
charscount
24 – 26107749
26 – 2896082
28 – 300
30 – 320
32 – 330
33 – 350
35 – 370
37 – 390
39 – 410
41 – 430
43 – 450
45 – 460
46 – 480
48 – 500
50 – 520
52 – 540
54 – 5614
56 – 58101845
58 – 60196
60 – 62893
62 – 631880
63 – 653550
65 – 675362
67 – 693151
69 – 717061
71 – 736866
73 – 756129
75 – 765145
76 – 783985
78 – 802407
80 – 821681
82 – 84336
84 – 86930
86 – 88316
88 – 90137
90 – 92151
92 – 9328
93 – 955
95 – 977
97 – 9916

rot text free_text

Short English moral/normative statements (mean 10.1 words, max 221 chars), almost certainly rule-of-thumb (RoT) annotations describing what one should or shouldn't do. The vocabulary is small (9,659 types) and readability is high (Flesch 79.4), consistent with simple declarative templates like 'It is good/okay to...' and 'You shouldn't...'. Notable: 27.1% of rows are duplicates (96,308), with single phrases like 'It is good to be yourself.' repeating 287 times, indicating heavy template reuse rather than free generation.

Treatment: Deduplicate or weight by frequency before tokenizing and embedding for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[71]:

saturn.columns["rot"].stats

statvalue
n355,922
nulls0 (0.0%)
unique259,614
len_min 6
len_max 221
len_mean 54.66
len_median 53
len_p95 87
word_mean 10.1
word_median 10
n_empty 0
n_duplicates 96,308
duplicate_rate 0.2706
vocab_size 9,659
readability_flesch_mean 79.38
emoji_rate 0
url_rate 2.81e-06
one_word_rate 5.619e-06
allcaps_rate 0
boilerplate_rate 5.619e-06
alert: duplicates27.1% duplicate strings
Fig 28.
Character-length distribution for rot.
Show data table
Character-length distribution for rot (mean: 54.66070094009362).
charscount
6 – 114
11 – 17114
17 – 222576
22 – 2812669
28 – 3319758
33 – 3834865
38 – 4435016
44 – 4938131
49 – 5447869
54 – 6036856
60 – 6537718
65 – 7025208
70 – 7619264
76 – 8117161
81 – 8710039
87 – 926486
92 – 974880
97 – 1032558
103 – 1082004
108 – 1141000
114 – 119610
119 – 124446
124 – 130213
130 – 135231
135 – 14087
140 – 14658
146 – 15130
151 – 15628
156 – 16220
162 – 16711
167 – 1733
173 – 1782
178 – 1833
183 – 1891
189 – 1942
194 – 2000
200 – 2050
205 – 2100
210 – 2160
216 – 2211

rot-id text identifier

Structured path-like identifiers for 'rule of thumb' records, sourced from Reddit (amitheasshole, confessions), Dear Abby columns, and ROCStories. Despite being IDs, only 291974 of 355922 rows are unique (duplicate_rate 0.18, with top values repeating up to 58 times), so this is not a primary key. Every value is a single token (one_word_rate 1.0, word_mean 1.0) with length 63-140 characters, which triggered the one_word alert but is expected for slash-delimited identifiers.

Treatment: Treat as a composite identifier; split on '/' to extract source/subreddit/post fields rather than embedding the raw string.

anthropic:claude-opus-4-7 · confidence high
Out[74]:

saturn.columns["rot-id"].stats

statvalue
n355,922
nulls0 (0.0%)
unique291,974
len_min 63
len_max 140
len_mean 81.61
len_median 68
len_p95 114
word_mean 1
word_median 1
n_empty 0
n_duplicates 63,948
duplicate_rate 0.1797
vocab_size 18,736
readability_flesch_mean -755.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
Fig 29.
Character-length distribution for rot-id.
Show data table
Character-length distribution for rot-id (mean: 81.6139491236844).
charscount
63 – 654555
65 – 67111051
67 – 6988225
69 – 710
71 – 730
73 – 750
75 – 760
76 – 780
78 – 800
80 – 820
82 – 840
84 – 860
86 – 880
88 – 900
90 – 920
92 – 942
94 – 9612
96 – 9865795
98 – 10036190
100 – 102806
102 – 1031665
103 – 1053057
105 – 1075065
107 – 1096209
109 – 1117337
111 – 1136570
113 – 1153187
115 – 1175314
117 – 1194016
119 – 1212809
121 – 1231847
123 – 125925
125 – 127700
127 – 128296
128 – 130169
130 – 13272
132 – 13422
134 – 1368
136 – 13814
138 – 1404

rot-worker-id numeric foreign_key

Despite being typed numeric, `rot-worker-id` looks like a categorical worker identifier: only 89 unique values across 355,922 rows, no nulls, and no zeros. The distribution is broad and nearly symmetric (skew -0.08, kurtosis -1.37) spanning 2 to 144 with median 83, consistent with arbitrary id codes rather than a measured quantity. No outliers were flagged.

Treatment: Treat as a categorical worker key; encode or join rather than aggregating numerically.

anthropic:claude-opus-4-7 · confidence high
Out[77]:

saturn.columns["rot-worker-id"].stats

statvalue
n355,922
nulls0 (0.0%)
unique89
min 2
max 144
mean 72.54
median 83
std 39.61
q1 42
q3 105
iqr 63
skew -0.0804
kurtosis -1.366
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 30.
Distribution of rot-worker-id. Vertical dash marks the median.
Show data table
Histogram bins for rot-worker-id (median: 83.0).
bincount
2 – 5.5519114
5.55 – 9.11190
9.1 – 12.6557
12.65 – 16.28207
16.2 – 19.7514204
19.75 – 23.30
23.3 – 26.850
26.85 – 30.410922
30.4 – 33.957449
33.95 – 37.5506
37.5 – 41.0510560
41.05 – 44.679576
44.6 – 48.153134
48.15 – 51.74
51.7 – 55.25314
55.25 – 58.8326
58.8 – 62.3512019
62.35 – 65.919
65.9 – 69.45425
69.45 – 731691
73 – 76.554917
76.55 – 80.12039
80.1 – 83.653998
83.65 – 87.218079
87.2 – 90.759881
90.75 – 94.313440
94.3 – 97.856
97.85 – 101.416569
101.4 – 104.99005
104.9 – 108.531406
108.5 – 1126515
112 – 115.64230
115.6 – 119.12114
119.1 – 122.726140
122.7 – 126.22735
126.2 – 129.829427
129.8 – 133.34465
133.3 – 136.91
136.9 – 140.4206
140.4 – 1441032

breakdown-worker-id numeric foreign_key

This appears to be a worker identifier encoded as integers, with 117 distinct values spread roughly uniformly between 0 and 146 (mean 71.75, median 71, skew 0.05, kurtosis -1.26). Despite being stored as numeric, the low cardinality relative to 355,922 rows and the near-flat distribution suggest a categorical key rather than a measurement. No nulls and no outliers, but about 1.4% of rows carry worker id 0, which may be a sentinel.

Treatment: Cast to categorical and left-join to a worker dimension table; investigate whether id 0 is a placeholder.

anthropic:claude-opus-4-7 · confidence high
Out[80]:

saturn.columns["breakdown-worker-id"].stats

statvalue
n355,922
nulls0 (0.0%)
unique117
min 0
max 146
mean 71.75
median 71
std 42.17
q1 33
q3 106
iqr 73
skew 0.05124
kurtosis -1.255
n_outliers 0
outlier_rate 0
zero_rate 0.01435
Fig 31.
Distribution of breakdown-worker-id. Vertical dash marks the median.
Show data table
Histogram bins for breakdown-worker-id (median: 71.0).
bincount
0 – 3.6510977
3.65 – 7.36826
7.3 – 10.95295
10.95 – 14.65408
14.6 – 18.2517438
18.25 – 21.95661
21.9 – 25.5514918
25.55 – 29.217405
29.2 – 32.858194
32.85 – 36.56075
36.5 – 40.1514864
40.15 – 43.814057
43.8 – 47.454300
47.45 – 51.19634
51.1 – 54.759782
54.75 – 58.43497
58.4 – 62.0518405
62.05 – 65.74440
65.7 – 69.352196
69.35 – 734907
73 – 76.658096
76.65 – 80.310266
80.3 – 83.95766
83.95 – 87.66916
87.6 – 91.2510729
91.25 – 94.914625
94.9 – 98.557752
98.55 – 102.214249
102.2 – 105.87135
105.8 – 109.521787
109.5 – 113.14903
113.1 – 116.827
116.8 – 120.56283
120.5 – 124.111729
124.1 – 127.88202
127.8 – 131.410912
131.4 – 1358184
135 – 138.79522
138.7 – 142.35775
142.3 – 1468785

n-characters numeric feature

A small-integer count column ranging 1-10 with only 8 unique values across 355,922 rows, mean 2.13 and median 2. The tight IQR (2-3) and low std (0.78) suggest most records cluster around 2-3 characters, with a mild right tail (skew 0.44) producing 1,132 outliers (~0.32%).

Treatment: Treat as a low-cardinality ordinal/discrete count; usable directly or one-hot encoded.

anthropic:claude-opus-4-7 · confidence high
Out[83]:

saturn.columns["n-characters"].stats

statvalue
n355,922
nulls0 (0.0%)
unique8
min 1
max 10
mean 2.128
median 2
std 0.779
q1 2
q3 3
iqr 1
skew 0.4378
kurtosis 0.4591
n_outliers 1,132
outlier_rate 0.00318
zero_rate 0
Fig 32.
Distribution of n-characters. Vertical dash marks the median.
Show data table
Histogram bins for n-characters (median: 2.0).
bincount
1 – 1.22571601
1.225 – 1.450
1.45 – 1.6750
1.675 – 1.90
1.9 – 2.125181769
2.125 – 2.350
2.35 – 2.5750
2.575 – 2.80
2.8 – 3.02589195
3.025 – 3.250
3.25 – 3.4750
3.475 – 3.70
3.7 – 3.9250
3.925 – 4.1512225
4.15 – 4.3750
4.375 – 4.60
4.6 – 4.8250
4.825 – 5.05918
5.05 – 5.2750
5.275 – 5.50
5.5 – 5.7250
5.725 – 5.950
5.95 – 6.175178
6.175 – 6.40
6.4 – 6.6250
6.625 – 6.850
6.85 – 7.07532
7.075 – 7.30
7.3 – 7.5250
7.525 – 7.750
7.75 – 7.9750
7.975 – 8.20
8.2 – 8.4250
8.425 – 8.650
8.65 – 8.8750
8.875 – 9.10
9.1 – 9.3250
9.325 – 9.550
9.55 – 9.7750
9.775 – 104

characters text feature

This column lists the characters/speakers in each record, encoded as pipe-delimited role strings (e.g. 'narrator|He', 'narrator|my girlfriend'). It's extremely repetitive — 91.1% of 355,922 rows are duplicates, only 31,782 unique values exist, and 'narrator' alone appears 71,601 times. Half the entries are a single token (one_word_rate 0.50, word_median 1), so this behaves more like a low-cardinality categorical tag than free text despite the text kind.

Treatment: Treat as a categorical/multi-label field: split on '|' and one-hot encode the top roles rather than tokenizing as prose.

anthropic:claude-opus-4-7 · confidence high
Out[86]:

saturn.columns["characters"].stats

statvalue
n355,922
nulls0 (0.0%)
unique31,782
len_min 8
len_max 165
len_mean 18.84
len_median 17
len_p95 38
word_mean 1.907
word_median 1
n_empty 0
n_duplicates 324,140
duplicate_rate 0.9107
vocab_size 5,682
readability_flesch_mean -63.18
emoji_rate 0
url_rate 0
one_word_rate 0.5012
allcaps_rate 0
boilerplate_rate 0
alert: one_word50.1% rows are a single word
alert: duplicates91.1% duplicate strings
Fig 33.
Character-length distribution for characters.
Show data table
Character-length distribution for characters (mean: 18.836837846494458).
charscount
8 – 1285583
12 – 1669950
16 – 2063497
20 – 2449403
24 – 2830408
28 – 3220287
32 – 3513108
35 – 398478
39 – 435267
43 – 473757
47 – 512325
51 – 551441
55 – 591008
59 – 63349
63 – 67425
67 – 71144
71 – 75124
75 – 7994
79 – 8362
83 – 8651
86 – 9034
90 – 9434
94 – 9826
98 – 10212
102 – 10614
106 – 1107
110 – 1144
114 – 1188
118 – 1223
122 – 1263
126 – 1300
130 – 1342
134 – 1382
138 – 1410
141 – 1450
145 – 14910
149 – 1530
153 – 1570
157 – 1610
161 – 1652

How to cite

click to copy

BibTeX
@misc{saturn-quirky-social-norms-20260121-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky social norms 20260121},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-social_norms_20260121}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky social norms 20260121. Source: /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms_20260121.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-social_norms_20260121