quirky social norms 20260121
Reading
This is a social-norms annotation dataset of 355,922 rows and 25 columns, where each entry pairs a real-life 'situation' (mostly from Reddit confessions, AmItheAsshole, Dear Abby, and ROCStories) with an 'action', a rule-of-thumb ('rot'), and a battery of moral judgments by crowd workers. The most striking shape feature is heavy duplication in the text fields: 'rot-judgment' is 97% duplicated and 'characters' 91%, because they collapse to short controlled vocabularies, while 'situation' and 'rot' themselves repeat ~71% and ~27% of the time across annotators. Worth a closer look first: the moral-foundation distribution, which is dominated by 'care-harm' (~39% of non-null), and the 'action-legal' field where 93% of actions are tagged 'legal' — both suggest class imbalance that will matter for any modeling. Also note 'area' is reasonably balanced across the four source corpora, but 'split' is heavily skewed toward 'train' (66%).
citing: rot-moral-foundations · action-legal · area · split · rot-judgment · characters · situation · rot · action-hypothetical · rot-categorization · action-agree · row_count · column_count
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| care-harm | 108535 | 30.5% |
| fairness-cheating | 37666 | 10.6% |
| loyalty-betrayal | 34581 | 9.7% |
| care-harm|loyalty-betrayal | 21125 | 5.9% |
| authority-subversion | 19087 | 5.4% |
| sanctity-degradation | 14657 | 4.1% |
| care-harm|fairness-cheating | 10787 | 3.0% |
| care-harm|authority-subversion | 7958 | 2.2% |
| care-harm|sanctity-degradation | 6328 | 1.8% |
| fairness-cheating|loyalty-betrayal | 6222 | 1.7% |
| fairness-cheating|authority-subversion | 3738 | 1.1% |
| loyalty-betrayal|authority-subversion | 2122 | 0.6% |
| fairness-cheating|sanctity-degradation | 1218 | 0.3% |
| authority-subversion|sanctity-degradation | 1042 | 0.3% |
| loyalty-betrayal|sanctity-degradation | 885 | 0.2% |
| care-harm|fairness-cheating|loyalty-betrayal | 834 | 0.2% |
| care-harm|loyalty-betrayal|authority-subversion | 623 | 0.2% |
| care-harm|authority-subversion|sanctity-degradation | 423 | 0.1% |
| care-harm|fairness-cheating|authority-subversion | 416 | 0.1% |
| care-harm|loyalty-betrayal|sanctity-degradation | 183 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| confessions | 107749 | 30.3% |
| rocstories | 101791 | 28.6% |
| amitheasshole | 96082 | 27.0% |
| dearabby | 50300 | 14.1% |
Show data table
| value | count | share |
|---|---|---|
| legal | 289316 | 81.3% |
| tolerated | 15064 | 4.2% |
| illegal | 5934 | 1.7% |
Show data table
| value | count | share |
|---|---|---|
| advice | 82786 | 23.3% |
| social-norms | 72934 | 20.5% |
| morality-ethics | 58564 | 16.5% |
| description | 58537 | 16.4% |
| social-norms|advice | 27657 | 7.8% |
| morality-ethics|social-norms | 27118 | 7.6% |
| morality-ethics|advice | 11498 | 3.2% |
| social-norms|description | 6078 | 1.7% |
| advice|description | 4785 | 1.3% |
| morality-ethics|description | 2023 | 0.6% |
| morality-ethics|social-norms|advice | 790 | 0.2% |
| morality-ethics|social-norms|description | 55 | 0.0% |
| morality-ethics|advice|description | 26 | 0.0% |
| social-norms|advice|description | 24 | 0.0% |
| morality-ethics|social-norms|advice|description | 1 | 0.0% |
Show data table
| bin | count |
|---|---|
| 0 – 0.1 | 1172 |
| 0.1 – 0.2 | 0 |
| 0.2 – 0.3 | 0 |
| 0.3 – 0.4 | 0 |
| 0.4 – 0.5 | 0 |
| 0.5 – 0.6 | 0 |
| 0.6 – 0.7 | 0 |
| 0.7 – 0.8 | 0 |
| 0.8 – 0.9 | 0 |
| 0.9 – 1 | 0 |
| 1 – 1.1 | 5874 |
| 1.1 – 1.2 | 0 |
| 1.2 – 1.3 | 0 |
| 1.3 – 1.4 | 0 |
| 1.4 – 1.5 | 0 |
| 1.5 – 1.6 | 0 |
| 1.6 – 1.7 | 0 |
| 1.7 – 1.8 | 0 |
| 1.8 – 1.9 | 0 |
| 1.9 – 2 | 0 |
| 2 – 2.1 | 44800 |
| 2.1 – 2.2 | 0 |
| 2.2 – 2.3 | 0 |
| 2.3 – 2.4 | 0 |
| 2.4 – 2.5 | 0 |
| 2.5 – 2.6 | 0 |
| 2.6 – 2.7 | 0 |
| 2.7 – 2.8 | 0 |
| 2.8 – 2.9 | 0 |
| 2.9 – 3 | 0 |
| 3 – 3.1 | 168120 |
| 3.1 – 3.2 | 0 |
| 3.2 – 3.3 | 0 |
| 3.3 – 3.4 | 0 |
| 3.4 – 3.5 | 0 |
| 3.5 – 3.6 | 0 |
| 3.6 – 3.7 | 0 |
| 3.7 – 3.8 | 0 |
| 3.8 – 3.9 | 0 |
| 3.9 – 4 | 91554 |
Schema
25 columns| Alerts | ||||
|---|---|---|---|---|
| area | categorical | 0.0% | 4 |
|
| m | numeric | 0.0% | 4 |
high_skew
outliers
|
| split | categorical | 0.0% | 7 |
|
| rot-agree | numeric | 0.3% | 5 |
|
| rot-categorization | categorical | 0.9% | 15 |
|
| rot-moral-foundations | categorical | 21.7% | 30 |
null_rate
|
| rot-char-targeting | categorical | 0.5% | 7 |
|
| rot-bad | numeric | 0.0% | 2 |
high_skew
|
| rot-judgment | text | 0.0% | 10,589 |
short_text
duplicates
|
| action | text | 0.0% | 260,627 |
multilingual
duplicates
|
| action-agency | categorical | 1.8% | 2 |
|
| action-moral-judgment | numeric | 12.7% | 5 |
|
| action-agree | numeric | 12.5% | 5 |
|
| action-legal | categorical | 12.8% | 3 |
|
| action-pressure | numeric | 13.1% | 5 |
|
| action-char-involved | categorical | 13.0% | 7 |
|
| action-hypothetical | categorical | 19.0% | 5 |
|
| situation | text | 0.0% | 103,296 |
multilingual
duplicates
|
| situation-short-id | text | 0.0% | 103,692 |
one_word
duplicates
|
| rot | text | 0.0% | 259,614 |
duplicates
|
| rot-id | text | 0.0% | 291,974 |
one_word
|
| rot-worker-id | numeric | 0.0% | 89 |
|
| breakdown-worker-id | numeric | 0.0% | 117 |
|
| n-characters | numeric | 0.0% | 8 |
|
| characters | text | 0.0% | 31,782 |
one_word
duplicates
|
area
categorical labelThis column tags each record with one of four source areas, with 'confessions' the modal value at 30.3% of 355,922 rows. The distribution is fairly balanced — entropy_ratio of 0.97 indicates near-uniform spread across the four categories, though 'dearabby' (50,300) is roughly half the size of the other three. No nulls and only 4 unique values make this a clean grouping key. Treatment: one-hot encode or use directly as a stratification/grouping variable.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- confessions
- top_rate
- 0.3027
- cardinality
- 4
- entropy
- 1.947
- entropy_ratio
- 0.9737
m
numeric feature high_skew outliersColumn 'm' is a numeric feature with only 4 distinct values across 355,922 rows, ranging from 1 to 50 with a median of 1 and both Q1 and Q3 equal to 1. The distribution is severely right-skewed (skew 3.77, kurtosis 12.45), with 18.5% of rows flagged as outliers despite a mean of just 4.24. The tiny cardinality combined with extreme spread suggests this is a categorical-like multiplier or count where most records sit at 1 and a few jump to much larger values. Treatment: Treat as low-cardinality categorical or bin the rare large values before modelling.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 4
- min
- 1
- max
- 50
- mean
- 4.237
- median
- 1
- std
- 11.24
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 3.774
- kurtosis
- 12.45
- n_outliers
- 65,947
- outlier_rate
- 0.1853
- zero_rate
- 0
split
categorical metadataColumn holds the dataset split assignment across 355922 rows with 7 distinct values and no nulls. 'train' dominates at 65.6% (233501 rows), followed by near-equal 'test' (29239) and 'dev' (29234), plus auxiliary 'test-extra', 'analysis', and 'dev-extra' partitions; a 'none' bucket of 3913 rows is unusual and likely indicates unassigned examples. Treatment: Use as a row filter to separate train/dev/test; investigate or exclude the 'none' rows before modelling.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- train
- top_rate
- 0.656
- cardinality
- 7
- entropy
- 1.763
- entropy_ratio
- 0.6281
rot-agree
numeric featureThis is a 5-value ordinal score (0–4) capturing agreement on some rotation/rationale ('rot-agree'), with a mean of 3.10 and median 3.0 — answers cluster at the high end. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, and 2.4% of rows fall outside the IQR fence as low-end outliers. Nulls are negligible (0.35%) and zeros are rare (0.38%). Treatment: Treat as an ordinal Likert feature; keep as-is or one-hot encode rather than log-transform.
- n
- 355,922
- nulls
- 1,236 (0.3%)
- unique
- 5
- min
- 0
- max
- 4
- mean
- 3.1
- median
- 3
- std
- 0.7437
- q1
- 3
- q3
- 4
- iqr
- 1
- skew
- -0.6805
- kurtosis
- 0.7872
- n_outliers
- 8,547
- outlier_rate
- 0.0241
- zero_rate
- 0.003786
rot-categorization
categorical labelCategorical tag describing the type of 'rule of thumb' (RoT), with 15 distinct values drawn from a small base vocabulary (advice, social-norms, morality-ethics, description) plus pipe-delimited combinations. Distribution is moderately balanced (entropy ratio 0.72); 'advice' leads at 23.5% and the four single-tag values dominate, while compound tags like 'social-norms|advice' (27,657) indicate multi-label encoding stuffed into one string. Null rate is low at 0.86%. Treatment: split on '|' and one-hot encode the four base categories for multi-label modelling.
- n
- 355,922
- nulls
- 3,046 (0.9%)
- unique
- 15
- top_value
- advice
- top_rate
- 0.2346
- cardinality
- 15
- entropy
- 2.806
- entropy_ratio
- 0.7181
rot-moral-foundations
categorical label null_rateThis column tags each row with one or more of Haidt's moral foundations (care-harm, fairness-cheating, loyalty-betrayal, authority-subversion, sanctity-degradation), with multi-label combinations encoded as pipe-delimited strings yielding 30 distinct values. Distribution is heavily skewed: 'care-harm' alone covers 38.9% of rows, and 21.7% of rows are null. Entropy ratio of 0.60 confirms the long tail collapses quickly after the top few categories. Treatment: Split on '|' and binarize into five multi-hot foundation indicators; treat nulls as an explicit missing flag.
- n
- 355,922
- nulls
- 77,120 (21.7%)
- unique
- 30
- top_value
- care-harm
- top_rate
- 0.3893
- cardinality
- 30
- entropy
- 2.962
- entropy_ratio
- 0.6037
rot-char-targeting
categorical featureCategorical tag identifying which character slot a rotation/transform targets, with 7 distinct values dominated by 'char-0' (51.9%) and 'char-1' (123,396 rows). The distribution is sharply long-tailed: 'char-4' and 'char-5' appear only 46 and 15 times respectively, and 23,781 rows are explicitly 'char-none'. Entropy ratio is 0.557, confirming most mass sits in the first two categories. Null rate is low at 0.46%. Treatment: One-hot encode and consider collapsing rare 'char-3/4/5' levels into an 'other' bucket.
- n
- 355,922
- nulls
- 1,622 (0.5%)
- unique
- 7
- top_value
- char-0
- top_rate
- 0.5192
- cardinality
- 7
- entropy
- 1.564
- entropy_ratio
- 0.5571
rot-bad
numeric label high_skewBinary 0/1 flag (n_unique=2, min=0, max=1) indicating a rare 'rot-bad' condition. Positives occur at 2.0% (mean=0.0201, zero_rate=0.9799), producing the flagged high skew (6.84) and heavy kurtosis (44.78). The 7,153 'outliers' are simply the positive class, not anomalies. Treatment: Treat as a binary target; address class imbalance with stratified sampling or class weights rather than outlier removal.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.0201
- median
- 0
- std
- 0.1403
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 6.84
- kurtosis
- 44.78
- n_outliers
- 7,153
- outlier_rate
- 0.0201
- zero_rate
- 0.9799
rot-judgment
text label short_text duplicatesShort moral-judgment phrases (mean 2 words, median 9 chars, max 94) drawn from a tight vocabulary of 797 tokens, dominated by verdicts like "It's good", "shouldn't", "It's okay", and "It's wrong". With 97.0% duplicate rate and only 10,589 uniques across 355,922 rows, this behaves as a categorical label rather than free text. Casing is inconsistent (e.g. "It's good" vs "it's good" appear as separate top values), which will inflate cardinality unless normalized. Treatment: Lowercase and strip punctuation, then treat as a categorical target rather than free text.
- n
- 355,922
- nulls
- 1 (0.0%)
- unique
- 10,589
- len_min
- 1
- len_max
- 94
- len_mean
- 10.46
- len_median
- 9
- len_p95
- 19
- word_mean
- 2.001
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 345,332
- duplicate_rate
- 0.9702
- vocab_size
- 797
- readability_flesch_mean
- 83.27
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.2083
- allcaps_rate
- 5.619e-06
- boilerplate_rate
- 0
action
text free_text multilingual duplicatesShort English phrases describing an action or behaviour (mean 41.8 chars, median 7 words), e.g. 'being yourself.' or 'cheating on your partner.' — likely the subject of a moral/judgement prompt. Roughly 27% of the 355,922 rows are duplicates (95,292), with the top phrase repeating 461 times, so the same actions recur heavily across records. Language detection flags multilingual but only 2 non-English rows (1 la, 1 no) out of ~5k sampled, so effectively monolingual. Treatment: Normalize casing/punctuation and tokenize or embed before modelling; expect heavy phrase repetition.
- n
- 355,922
- nulls
- 3 (0.0%)
- unique
- 260,627
- len_min
- 1
- len_max
- 221
- len_mean
- 41.76
- len_median
- 40
- len_p95
- 73
- word_mean
- 6.969
- word_median
- 7
- n_empty
- 0
- n_duplicates
- 95,292
- duplicate_rate
- 0.2677
- vocab_size
- 9,430
- readability_flesch_mean
- 57.88
- emoji_rate
- 0
- url_rate
- 2.81e-06
- one_word_rate
- 0.006013
- allcaps_rate
- 0
- boilerplate_rate
- 2.81e-06
action-agency
categorical featureBinary categorical flag with only two levels, 'agency' (88.5%) and 'experience' (the remainder), likely indicating which side or channel originated the action. Class imbalance is heavy and 1.84% of rows are null, so any modelling needs to account for both. Entropy ratio of 0.515 reflects the dominance of the majority class. Treatment: Encode as a binary indicator and impute or flag the ~1.8% nulls.
- n
- 355,922
- nulls
- 6,551 (1.8%)
- unique
- 2
- top_value
- agency
- top_rate
- 0.8849
- cardinality
- 2
- entropy
- 0.5152
- entropy_ratio
- 0.5152
action-moral-judgment
numeric labelA discrete moral-judgment rating on a 5-point scale from -2 to 2 (likely Likert-style: very wrong → very right). The mean of -0.178 and median of 0 indicate a slight lean toward negative judgments, with 43.7% of values exactly zero and 12.7% missing. The distribution is nearly symmetric (skew -0.011) and platykurtic, so most ratings cluster near neutral with modest spread (std 0.857). Treatment: Treat as an ordinal label; impute or filter the 12.7% nulls before modelling.
- n
- 355,922
- nulls
- 45,215 (12.7%)
- unique
- 5
- min
- -2
- max
- 2
- mean
- -0.1785
- median
- 0
- std
- 0.8565
- q1
- -1
- q3
- 0
- iqr
- 1
- skew
- -0.01137
- kurtosis
- -0.3466
- n_outliers
- 4,592
- outlier_rate
- 0.01478
- zero_rate
- 0.4367
action-agree
numeric featureThis is a 5-level ordinal feature (values 0-4), almost certainly a Likert-style agreement rating for an 'action' item, with mean 3.10 and median 3.0 indicating a tilt toward agreement. The distribution is left-skewed (skew -0.68) with Q1=3 and Q3=4, so most respondents pick 3 or 4; only 0.38% give zero. Note the 12.48% null rate, which is substantial and likely reflects non-response or skipped items. Treatment: Treat as ordinal (0-4); impute or flag the ~12% missing values before modelling.
- n
- 355,922
- nulls
- 44,402 (12.5%)
- unique
- 5
- min
- 0
- max
- 4
- mean
- 3.101
- median
- 3
- std
- 0.7326
- q1
- 3
- q3
- 4
- iqr
- 1
- skew
- -0.6768
- kurtosis
- 0.8829
- n_outliers
- 7,046
- outlier_rate
- 0.02262
- zero_rate
- 0.003762
action-legal
categorical featureThis is a categorical legal-status flag with only 3 distinct values ('legal', 'tolerated', 'illegal') across 355,922 rows. The distribution is severely imbalanced: 'legal' accounts for 93.2% of records while 'illegal' represents just 5,934 rows, and entropy ratio is only 0.26. Note also that 12.81% of rows are null, so absence of a status is itself a meaningful signal. Treatment: One-hot encode with an explicit 'missing' category; consider class-weighting due to severe imbalance.
- n
- 355,922
- nulls
- 45,608 (12.8%)
- unique
- 3
- top_value
- legal
- top_rate
- 0.9323
- cardinality
- 3
- entropy
- 0.4153
- entropy_ratio
- 0.262
action-pressure
numeric featureA discrete ordinal feature taking only 5 values across the symmetric range -2 to 2, almost certainly a Likert-style pressure rating. The distribution is balanced (mean -0.04, skew -0.05) and centered on 0, which accounts for 35.3% of rows. Notable: 13.08% of rows are null, and despite being numeric there are just 5 unique values. Treatment: Treat as ordinal categorical; impute or flag the 13% missing before modelling.
- n
- 355,922
- nulls
- 46,537 (13.1%)
- unique
- 5
- min
- -2
- max
- 2
- mean
- -0.03992
- median
- 0
- std
- 1.114
- q1
- -1
- q3
- 1
- iqr
- 2
- skew
- -0.04985
- kurtosis
- -0.6595
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.3534
action-char-involved
categorical featureCategorical pointer to which character slot is involved in an action, with 7 distinct values dominated by char-0 (51.5%) and char-1 (~31%). The long tail collapses fast: char-3 through char-5 together account for fewer than 1,400 rows, and 13.05% of rows are null while another 22,100 are explicitly tagged 'char-none'. Entropy ratio of 0.56 confirms the heavy concentration on the first two slots. Treatment: Treat as a low-cardinality categorical; collapse char-3/4/5 into an 'other' bucket and decide whether null and char-none should be merged.
- n
- 355,922
- nulls
- 46,441 (13.0%)
- unique
- 7
- top_value
- char-0
- top_rate
- 0.515
- cardinality
- 7
- entropy
- 1.578
- entropy_ratio
- 0.562
action-hypothetical
categorical labelA 5-level categorical labeling whether an action is stated explicitly or only hypothetically/probably, with negative variants ('explicit-no', 'probable-no'). 'explicit' dominates at 47.1% of non-null rows, but 19.04% of values are null, which is substantial. Entropy ratio of 0.84 indicates the remaining classes are reasonably spread rather than collapsed onto a single mode. Treatment: One-hot encode the five levels and add an explicit missing indicator for the 19% nulls.
- n
- 355,922
- nulls
- 67,776 (19.0%)
- unique
- 5
- top_value
- explicit
- top_rate
- 0.4711
- cardinality
- 5
- entropy
- 1.956
- entropy_ratio
- 0.8423
situation
text free_text multilingual duplicatesShort first-person descriptions of personal situations or dilemmas, averaging 55 characters / 10 words and topping out at 300 chars, with high readability (Flesch 78.5). Massive duplication is the headline issue: 252,626 of 355,922 rows are duplicates (71%), leaving only 103,296 uniques, and several distinct strings repeat exactly 130 times — suggesting templated or oversampled records. Language detection is overwhelmingly English (4,989) with a tiny multilingual tail (6 de, 1 each es/it/nl/pt/ru) that is unlikely to matter at scale. Treatment: Deduplicate before modelling, then tokenize and embed; the 130-repeat pattern warrants checking the upstream sampling.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 103,296
- len_min
- 10
- len_max
- 300
- len_mean
- 55.42
- len_median
- 52
- len_p95
- 100
- word_mean
- 10.52
- word_median
- 10
- n_empty
- 0
- n_duplicates
- 252,626
- duplicate_rate
- 0.7098
- vocab_size
- 16,855
- readability_flesch_mean
- 78.53
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0.0005282
- boilerplate_rate
- 0.0001292
situation-short-id
text foreign_key one_word duplicatesSlash-delimited source identifiers pointing back to situations on Reddit (confessions, amitheasshole), Dear Abby columns, and ROCStories sentences. Despite the 'id' name, only 103,692 values are unique across 355,922 rows — a 70.9% duplicate rate, with the most repeated key appearing 180 times, so each situation evidently surfaces in many rows. Every value is a single token (one_word_rate 1.0) up to 99 chars long. Treatment: Use as a join key to the situation source table; do not treat as a row-level primary key.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 103,692
- len_min
- 24
- len_max
- 99
- len_mean
- 41.29
- len_median
- 27
- len_p95
- 74
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 252,230
- duplicate_rate
- 0.7087
- vocab_size
- 16,878
- readability_flesch_mean
- -443.9
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
rot
text free_text duplicatesShort English moral/normative statements (mean 10.1 words, max 221 chars), almost certainly rule-of-thumb (RoT) annotations describing what one should or shouldn't do. The vocabulary is small (9,659 types) and readability is high (Flesch 79.4), consistent with simple declarative templates like 'It is good/okay to...' and 'You shouldn't...'. Notable: 27.1% of rows are duplicates (96,308), with single phrases like 'It is good to be yourself.' repeating 287 times, indicating heavy template reuse rather than free generation. Treatment: Deduplicate or weight by frequency before tokenizing and embedding for modelling.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 259,614
- len_min
- 6
- len_max
- 221
- len_mean
- 54.66
- len_median
- 53
- len_p95
- 87
- word_mean
- 10.1
- word_median
- 10
- n_empty
- 0
- n_duplicates
- 96,308
- duplicate_rate
- 0.2706
- vocab_size
- 9,659
- readability_flesch_mean
- 79.38
- emoji_rate
- 0
- url_rate
- 2.81e-06
- one_word_rate
- 5.619e-06
- allcaps_rate
- 0
- boilerplate_rate
- 5.619e-06
rot-id
text identifier one_wordStructured path-like identifiers for 'rule of thumb' records, sourced from Reddit (amitheasshole, confessions), Dear Abby columns, and ROCStories. Despite being IDs, only 291974 of 355922 rows are unique (duplicate_rate 0.18, with top values repeating up to 58 times), so this is not a primary key. Every value is a single token (one_word_rate 1.0, word_mean 1.0) with length 63-140 characters, which triggered the one_word alert but is expected for slash-delimited identifiers. Treatment: Treat as a composite identifier; split on '/' to extract source/subreddit/post fields rather than embedding the raw string.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 291,974
- len_min
- 63
- len_max
- 140
- len_mean
- 81.61
- len_median
- 68
- len_p95
- 114
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 63,948
- duplicate_rate
- 0.1797
- vocab_size
- 18,736
- readability_flesch_mean
- -755.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
rot-worker-id
numeric foreign_keyDespite being typed numeric, `rot-worker-id` looks like a categorical worker identifier: only 89 unique values across 355,922 rows, no nulls, and no zeros. The distribution is broad and nearly symmetric (skew -0.08, kurtosis -1.37) spanning 2 to 144 with median 83, consistent with arbitrary id codes rather than a measured quantity. No outliers were flagged. Treatment: Treat as a categorical worker key; encode or join rather than aggregating numerically.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 89
- min
- 2
- max
- 144
- mean
- 72.54
- median
- 83
- std
- 39.61
- q1
- 42
- q3
- 105
- iqr
- 63
- skew
- -0.0804
- kurtosis
- -1.366
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
breakdown-worker-id
numeric foreign_keyThis appears to be a worker identifier encoded as integers, with 117 distinct values spread roughly uniformly between 0 and 146 (mean 71.75, median 71, skew 0.05, kurtosis -1.26). Despite being stored as numeric, the low cardinality relative to 355,922 rows and the near-flat distribution suggest a categorical key rather than a measurement. No nulls and no outliers, but about 1.4% of rows carry worker id 0, which may be a sentinel. Treatment: Cast to categorical and left-join to a worker dimension table; investigate whether id 0 is a placeholder.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 117
- min
- 0
- max
- 146
- mean
- 71.75
- median
- 71
- std
- 42.17
- q1
- 33
- q3
- 106
- iqr
- 73
- skew
- 0.05124
- kurtosis
- -1.255
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.01435
n-characters
numeric featureA small-integer count column ranging 1-10 with only 8 unique values across 355,922 rows, mean 2.13 and median 2. The tight IQR (2-3) and low std (0.78) suggest most records cluster around 2-3 characters, with a mild right tail (skew 0.44) producing 1,132 outliers (~0.32%). Treatment: Treat as a low-cardinality ordinal/discrete count; usable directly or one-hot encoded.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 8
- min
- 1
- max
- 10
- mean
- 2.128
- median
- 2
- std
- 0.779
- q1
- 2
- q3
- 3
- iqr
- 1
- skew
- 0.4378
- kurtosis
- 0.4591
- n_outliers
- 1,132
- outlier_rate
- 0.00318
- zero_rate
- 0
characters
text feature one_word duplicatesThis column lists the characters/speakers in each record, encoded as pipe-delimited role strings (e.g. 'narrator|He', 'narrator|my girlfriend'). It's extremely repetitive — 91.1% of 355,922 rows are duplicates, only 31,782 unique values exist, and 'narrator' alone appears 71,601 times. Half the entries are a single token (one_word_rate 0.50, word_median 1), so this behaves more like a low-cardinality categorical tag than free text despite the text kind. Treatment: Treat as a categorical/multi-label field: split on '|' and one-hot encode the top roles rather than tokenizing as prose.
- n
- 355,922
- nulls
- 0 (0.0%)
- unique
- 31,782
- len_min
- 8
- len_max
- 165
- len_mean
- 18.84
- len_median
- 17
- len_p95
- 38
- word_mean
- 1.907
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 324,140
- duplicate_rate
- 0.9107
- vocab_size
- 5,682
- readability_flesch_mean
- -63.18
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.5012
- allcaps_rate
- 0
- boilerplate_rate
- 0