saturn·

quirky social norms

source /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms.parquet 12,383 rows 6 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 12,383 multiple-choice questions tagged by subject, grade, and skill, likely from an educational platform. The content is heavily skewed toward language arts (10,068 rows) over social studies (2,315), and grade-5 is the single largest grade bucket at 2,537 rows. The question text shows a notable 24.3% duplicate rate with 3,008 repeats, so deduplication is worth considering before any modeling. Answer indices range 0-3 but are concentrated at 0 and 1 (43% are zero), suggesting possible position bias in the correct-answer distribution. Skill coverage is broad with 402 distinct skills, none dominating (top skill is only 1.9% of rows).

citing: row_count · columns.subject.top_values · columns.grade.top_values · columns.question.stats.duplicate_rate · columns.question.stats.n_duplicates · columns.answer_idx.stats.zero_rate · columns.answer_idx.n_unique · columns.skill.n_unique · columns.skill.stats.top_rate

Schema

6 columns
Per-column summary. Click column name to jump to its detail.
Alerts
subject categorical 0.0% 2
grade categorical 0.0% 14
skill categorical 0.0% 402
question text 0.0% 9,375
duplicates
choices unknown 0.0%
skipped
answer_idx numeric 0.0% 4

subject

categorical label
Binary subject indicator with only two values: 'language arts' (10068 rows, 81.3%) and 'social studies' (2315 rows). No nulls across 12383 rows. The class imbalance is notable—roughly 4.3:1 in favor of language arts—which will skew any per-subject aggregation or modelling. Treatment: One-hot or binary encode; account for the 4:1 imbalance when stratifying or modelling. high · anthropic:claude-opus-4-7
n
12,383
nulls
0 (0.0%)
unique
2
top_value
language arts
top_rate
0.8131
cardinality
2
entropy
0.695
entropy_ratio
0.695

grade

categorical feature
Categorical grade level with 14 unique values across 12,383 rows and no nulls. Distribution is fairly even (entropy ratio 0.92), though grade-5 leads at 20.5% and the top 10 values shown span grade-2 through grade-12, suggesting a K-12 schema with a few additional buckets not displayed. Treatment: Encode as ordinal (extract numeric grade level) before modelling. high · anthropic:claude-opus-4-7
n
12,383
nulls
0 (0.0%)
unique
14
top_value
grade-5
top_rate
0.2049
cardinality
14
entropy
3.513
entropy_ratio
0.9227

skill

categorical label
Slug-style identifiers for educational skills (e.g., "understand-overall-supply-and-demand", "choose-between-adjectives-and-adverbs"), spanning topics from economics to grammar and vocabulary. With 402 unique values across 12,383 rows and entropy ratio 0.919, the distribution is nearly flat — the most common skill accounts for only 1.92% of rows. Content mixes domains (ELA and economics) suggesting this column tags items from a multi-subject curriculum. Treatment: Treat as a high-cardinality categorical label; target-encode or embed rather than one-hot. high · anthropic:claude-opus-4-7
n
12,383
nulls
0 (0.0%)
unique
402
top_value
understand-overall-supply-and-demand
top_rate
0.01922
cardinality
402
entropy
7.949
entropy_ratio
0.9188

question

text free_text duplicates
Free-text question prompts, almost entirely English (4379 en vs 2 es detected) with a wide length spread (min 3, median 120, max 6441 chars; mean 52 words). High duplication is the standout: 3008 rows (24.3%) repeat, with stems like 'Which sentence is correct?' (171) and 'Which sentence states a fact?' (146) dominating, suggesting templated educational/grammar items. Readability is easy (Flesch 73.9) and vocab is moderate (31660 unique tokens across 9375 distinct questions). Treatment: Tokenize/embed for modelling and decide whether to dedupe the 3008 repeated prompts before train/test splitting to avoid leakage. high · anthropic:claude-opus-4-7
n
12,383
nulls
0 (0.0%)
unique
9,375
len_min
3
len_max
6,441
len_mean
313.4
len_median
120
len_p95
1502
word_mean
52.2
word_median
20
n_empty
0
n_duplicates
3,008
duplicate_rate
0.2429
vocab_size
31,660
readability_flesch_mean
73.85
emoji_rate
0
url_rate
0
one_word_rate
0.0002423
allcaps_rate
0.000323
boilerplate_rate
0

choices

unknown other skipped
The column 'choices' was skipped by the profiler (kind 'unknown'), so no descriptive statistics, cardinality, or value samples are available beyond a row count of 12383 and a null rate of 0.0. Without type inference or distributional signals it is impossible to characterise the contents from this evidence alone. The name suggests it may hold structured option sets (e.g. lists or JSON), which would explain why the dissector could not coerce it into a primitive type. Treatment: Inspect raw values manually and parse (likely JSON/list) before deciding on a downstream representation. low · anthropic:claude-opus-4-7
n
12,383
nulls
0 (0.0%)
unique

answer_idx

numeric label
This is almost certainly a categorical answer index encoded as an integer, taking only 4 distinct values from 0 to 3 across 12,383 rows with no nulls. The distribution is heavily weighted toward the low end: 43% are zero and the median is 1, with skew 0.94 indicating answer choice 3 is comparatively rare. The 577 flagged outliers (4.7%) are an artifact of applying numeric outlier rules to what is really a small categorical set. Treatment: Treat as a categorical class label, not a numeric value. high · anthropic:claude-opus-4-7
n
12,383
nulls
0 (0.0%)
unique
4
min
0
max
3
mean
0.7734
median
1
std
0.821
q1
0
q3
1
iqr
1
skew
0.9448
kurtosis
0.4078
n_outliers
577
outlier_rate
0.0466
zero_rate
0.4294