quirky social norms

source /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms.parquet 12,383 rows 6 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 12,383 multiple-choice questions tagged by subject, grade, and skill, likely from an educational platform. The content is heavily skewed toward language arts (10,068 rows) over social studies (2,315), and grade-5 is the single largest grade bucket at 2,537 rows. The question text shows a notable 24.3% duplicate rate with 3,008 repeats, so deduplication is worth considering before any modeling. Answer indices range 0-3 but are concentrated at 0 and 1 (43% are zero), suggesting possible position bias in the correct-answer distribution. Skill coverage is broad with 402 distinct skills, none dominating (top skill is only 1.9% of rows).

citing: row_count · columns.subject.top_values · columns.grade.top_values · columns.question.stats.duplicate_rate · columns.question.stats.n_duplicates · columns.answer_idx.stats.zero_rate · columns.answer_idx.n_unique · columns.skill.n_unique · columns.skill.stats.top_rate

Charts the summary said to look at first

subject · Shows the heavy tilt toward language arts over social studies.

Show data table

Top values for subject (2 unique shown, of 2 total).
value	count	share
language arts	10068	81.3%
social studies	2315	18.7%

grade · Reveals which grade levels are best represented, with grade-5 leading.

Show data table

Top values for grade (14 unique shown, of 14 total).
value	count	share
grade-5	2537	20.5%
grade-3	1559	12.6%
grade-8	1389	11.2%
grade-9	1025	8.3%
grade-6	1017	8.2%
grade-7	807	6.5%
grade-11	790	6.4%
grade-2	653	5.3%
grade-10	649	5.2%
grade-12	564	4.6%
grade-4	541	4.4%
kindergarten	446	3.6%
grade-1	284	2.3%
pre-k	122	1.0%

answer_idx · Check whether correct-answer positions are evenly spread or biased toward 0 and 1.

Show data table

Histogram bins for answer_idx (median: 1.0).
bin	count
0 – 0.075	5317
0.075 – 0.15	0
0.15 – 0.225	0
0.225 – 0.3	0
0.3 – 0.375	0
0.375 – 0.45	0
0.45 – 0.525	0
0.525 – 0.6	0
0.6 – 0.675	0
0.675 – 0.75	0
0.75 – 0.825	0
0.825 – 0.9	0
0.9 – 0.975	0
0.975 – 1.05	5132
1.05 – 1.125	0
1.125 – 1.2	0
1.2 – 1.275	0
1.275 – 1.35	0
1.35 – 1.425	0
1.425 – 1.5	0
1.5 – 1.575	0
1.575 – 1.65	0
1.65 – 1.725	0
1.725 – 1.8	0
1.8 – 1.875	0
1.875 – 1.95	0
1.95 – 2.025	1357
2.025 – 2.1	0
2.1 – 2.175	0
2.175 – 2.25	0
2.25 – 2.325	0
2.325 – 2.4	0
2.4 – 2.475	0
2.475 – 2.55	0
2.55 – 2.625	0
2.625 – 2.7	0
2.7 – 2.775	0
2.775 – 2.85	0
2.85 – 2.925	0
2.925 – 3	577

question · Question lengths span from 3 to 6,441 characters; look for a long tail of unusually long prompts.

Show data table

Character-length distribution for question (mean: 313.4049099571994).
chars	count
3 – 164	8306
164 – 325	2142
325 – 486	439
486 – 647	337
647 – 808	253
808 – 969	150
969 – 1130	65
1130 – 1291	30
1291 – 1452	32
1452 – 1612	52
1612 – 1773	38
1773 – 1934	20
1934 – 2095	30
2095 – 2256	28
2256 – 2417	43
2417 – 2578	42
2578 – 2739	26
2739 – 2900	67
2900 – 3061	51
3061 – 3222	27
3222 – 3383	12
3383 – 3544	13
3544 – 3705	22
3705 – 3866	13
3866 – 4027	13
4027 – 4188	5
4188 – 4349	10
4349 – 4510	21
4510 – 4671	15
4671 – 4832	23
4832 – 4992	12
4992 – 5153	13
5153 – 5314	11
5314 – 5475	4
5475 – 5636	9
5636 – 5797	6
5797 – 5958	1
5958 – 6119	0
6119 – 6280	0
6280 – 6441	2

skill · Top skills are fairly evenly distributed across 402 categories — no single skill dominates.

Show data table

Top values for skill (20 unique shown, of 402 total).
value	count	share
understand-overall-supply-and-demand	238	1.9%
choose-between-adjectives-and-adverbs	176	1.4%
determine-the-meanings-of-greek-and-latin-roots	165	1.3%
describe-the-difference-between-related-words	162	1.3%
costs-and-benefits	160	1.3%
is-it-a-complete-sentence-or-a-fragment	156	1.3%
use-greek-and-latin-roots-as-clues-to-the-meanings-of-words	137	1.1%
identify-vague-pronoun-references	136	1.1%
what-does-the-punctuation-suggest	131	1.1%
determine-the-meanings-of-words-with-greek-and-latin-roots	128	1.0%
use-the-correct-homophone	128	1.0%
analogies	127	1.0%
classify-logical-fallacies	123	1.0%
is-it-a-phrase-or-a-clause	117	0.9%
is-the-sentence-declarative-interrogative-imperative-or-exclamatory	116	0.9%
analogies-challenge	113	0.9%
is-the-sentence-simple-compound-complex-or-compound-complex	108	0.9%
is-it-a-complete-sentence-or-a-run-on	107	0.9%
use-guide-words	106	0.9%
recall-the-source-of-an-allusion	105	0.8%

Schema

6 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
subject	categorical	0.0%	2
grade	categorical	0.0%	14
skill	categorical	0.0%	402
question	text	0.0%	9,375	duplicates
choices	unknown	0.0%	—	skipped
answer_idx	numeric	0.0%	4

subject

categorical label

Binary subject indicator with only two values: 'language arts' (10068 rows, 81.3%) and 'social studies' (2315 rows). No nulls across 12383 rows. The class imbalance is notable—roughly 4.3:1 in favor of language arts—which will skew any per-subject aggregation or modelling. Treatment: One-hot or binary encode; account for the 4:1 imbalance when stratifying or modelling. high · anthropic:claude-opus-4-7

n: 12,383
nulls: 0 (0.0%)
unique: 2
top_value: language arts
top_rate: 0.8131
cardinality: 2
entropy: 0.695
entropy_ratio: 0.695

grade

categorical feature

Categorical grade level with 14 unique values across 12,383 rows and no nulls. Distribution is fairly even (entropy ratio 0.92), though grade-5 leads at 20.5% and the top 10 values shown span grade-2 through grade-12, suggesting a K-12 schema with a few additional buckets not displayed. Treatment: Encode as ordinal (extract numeric grade level) before modelling. high · anthropic:claude-opus-4-7

n: 12,383
nulls: 0 (0.0%)
unique: 14
top_value: grade-5
top_rate: 0.2049
cardinality: 14
entropy: 3.513
entropy_ratio: 0.9227

skill

categorical label

Slug-style identifiers for educational skills (e.g., "understand-overall-supply-and-demand", "choose-between-adjectives-and-adverbs"), spanning topics from economics to grammar and vocabulary. With 402 unique values across 12,383 rows and entropy ratio 0.919, the distribution is nearly flat — the most common skill accounts for only 1.92% of rows. Content mixes domains (ELA and economics) suggesting this column tags items from a multi-subject curriculum. Treatment: Treat as a high-cardinality categorical label; target-encode or embed rather than one-hot. high · anthropic:claude-opus-4-7

n: 12,383
nulls: 0 (0.0%)
unique: 402
top_value: understand-overall-supply-and-demand
top_rate: 0.01922
cardinality: 402
entropy: 7.949
entropy_ratio: 0.9188

question

text free_text duplicates

Free-text question prompts, almost entirely English (4379 en vs 2 es detected) with a wide length spread (min 3, median 120, max 6441 chars; mean 52 words). High duplication is the standout: 3008 rows (24.3%) repeat, with stems like 'Which sentence is correct?' (171) and 'Which sentence states a fact?' (146) dominating, suggesting templated educational/grammar items. Readability is easy (Flesch 73.9) and vocab is moderate (31660 unique tokens across 9375 distinct questions). Treatment: Tokenize/embed for modelling and decide whether to dedupe the 3008 repeated prompts before train/test splitting to avoid leakage. high · anthropic:claude-opus-4-7

n: 12,383
nulls: 0 (0.0%)
unique: 9,375
len_min: 3
len_max: 6,441
len_mean: 313.4
len_median: 120
len_p95: 1502
word_mean: 52.2
word_median: 20
n_empty: 0
n_duplicates: 3,008
duplicate_rate: 0.2429
vocab_size: 31,660
readability_flesch_mean: 73.85
emoji_rate: 0
url_rate: 0
one_word_rate: 0.0002423
allcaps_rate: 0.000323
boilerplate_rate: 0

choices

unknown other skipped

The column 'choices' was skipped by the profiler (kind 'unknown'), so no descriptive statistics, cardinality, or value samples are available beyond a row count of 12383 and a null rate of 0.0. Without type inference or distributional signals it is impossible to characterise the contents from this evidence alone. The name suggests it may hold structured option sets (e.g. lists or JSON), which would explain why the dissector could not coerce it into a primitive type. Treatment: Inspect raw values manually and parse (likely JSON/list) before deciding on a downstream representation. low · anthropic:claude-opus-4-7

n: 12,383
nulls: 0 (0.0%)
unique: —

answer_idx

numeric label

This is almost certainly a categorical answer index encoded as an integer, taking only 4 distinct values from 0 to 3 across 12,383 rows with no nulls. The distribution is heavily weighted toward the low end: 43% are zero and the median is 1, with skew 0.94 indicating answer choice 3 is comparatively rare. The 577 flagged outliers (4.7%) are an artifact of applying numeric outlier rules to what is really a small categorical set. Treatment: Treat as a categorical class label, not a numeric value. high · anthropic:claude-opus-4-7

n: 12,383
nulls: 0 (0.0%)
unique: 4
min: 0
max: 3
mean: 0.7734
median: 1
std: 0.821
q1: 0
q3: 1
iqr: 1
skew: 0.9448
kurtosis: 0.4078
n_outliers: 577
outlier_rate: 0.0466
zero_rate: 0.4294