quirky social norms
Reading
This dataset contains 12,383 multiple-choice questions tagged by subject, grade, and skill, likely from an educational platform. The content is heavily skewed toward language arts (10,068 rows) over social studies (2,315), and grade-5 is the single largest grade bucket at 2,537 rows. The question text shows a notable 24.3% duplicate rate with 3,008 repeats, so deduplication is worth considering before any modeling. Answer indices range 0-3 but are concentrated at 0 and 1 (43% are zero), suggesting possible position bias in the correct-answer distribution. Skill coverage is broad with 402 distinct skills, none dominating (top skill is only 1.9% of rows).
citing: row_count · columns.subject.top_values · columns.grade.top_values · columns.question.stats.duplicate_rate · columns.question.stats.n_duplicates · columns.answer_idx.stats.zero_rate · columns.answer_idx.n_unique · columns.skill.n_unique · columns.skill.stats.top_rate
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| language arts | 10068 | 81.3% |
| social studies | 2315 | 18.7% |
Show data table
| value | count | share |
|---|---|---|
| grade-5 | 2537 | 20.5% |
| grade-3 | 1559 | 12.6% |
| grade-8 | 1389 | 11.2% |
| grade-9 | 1025 | 8.3% |
| grade-6 | 1017 | 8.2% |
| grade-7 | 807 | 6.5% |
| grade-11 | 790 | 6.4% |
| grade-2 | 653 | 5.3% |
| grade-10 | 649 | 5.2% |
| grade-12 | 564 | 4.6% |
| grade-4 | 541 | 4.4% |
| kindergarten | 446 | 3.6% |
| grade-1 | 284 | 2.3% |
| pre-k | 122 | 1.0% |
Show data table
| bin | count |
|---|---|
| 0 – 0.075 | 5317 |
| 0.075 – 0.15 | 0 |
| 0.15 – 0.225 | 0 |
| 0.225 – 0.3 | 0 |
| 0.3 – 0.375 | 0 |
| 0.375 – 0.45 | 0 |
| 0.45 – 0.525 | 0 |
| 0.525 – 0.6 | 0 |
| 0.6 – 0.675 | 0 |
| 0.675 – 0.75 | 0 |
| 0.75 – 0.825 | 0 |
| 0.825 – 0.9 | 0 |
| 0.9 – 0.975 | 0 |
| 0.975 – 1.05 | 5132 |
| 1.05 – 1.125 | 0 |
| 1.125 – 1.2 | 0 |
| 1.2 – 1.275 | 0 |
| 1.275 – 1.35 | 0 |
| 1.35 – 1.425 | 0 |
| 1.425 – 1.5 | 0 |
| 1.5 – 1.575 | 0 |
| 1.575 – 1.65 | 0 |
| 1.65 – 1.725 | 0 |
| 1.725 – 1.8 | 0 |
| 1.8 – 1.875 | 0 |
| 1.875 – 1.95 | 0 |
| 1.95 – 2.025 | 1357 |
| 2.025 – 2.1 | 0 |
| 2.1 – 2.175 | 0 |
| 2.175 – 2.25 | 0 |
| 2.25 – 2.325 | 0 |
| 2.325 – 2.4 | 0 |
| 2.4 – 2.475 | 0 |
| 2.475 – 2.55 | 0 |
| 2.55 – 2.625 | 0 |
| 2.625 – 2.7 | 0 |
| 2.7 – 2.775 | 0 |
| 2.775 – 2.85 | 0 |
| 2.85 – 2.925 | 0 |
| 2.925 – 3 | 577 |
Show data table
| chars | count |
|---|---|
| 3 – 164 | 8306 |
| 164 – 325 | 2142 |
| 325 – 486 | 439 |
| 486 – 647 | 337 |
| 647 – 808 | 253 |
| 808 – 969 | 150 |
| 969 – 1130 | 65 |
| 1130 – 1291 | 30 |
| 1291 – 1452 | 32 |
| 1452 – 1612 | 52 |
| 1612 – 1773 | 38 |
| 1773 – 1934 | 20 |
| 1934 – 2095 | 30 |
| 2095 – 2256 | 28 |
| 2256 – 2417 | 43 |
| 2417 – 2578 | 42 |
| 2578 – 2739 | 26 |
| 2739 – 2900 | 67 |
| 2900 – 3061 | 51 |
| 3061 – 3222 | 27 |
| 3222 – 3383 | 12 |
| 3383 – 3544 | 13 |
| 3544 – 3705 | 22 |
| 3705 – 3866 | 13 |
| 3866 – 4027 | 13 |
| 4027 – 4188 | 5 |
| 4188 – 4349 | 10 |
| 4349 – 4510 | 21 |
| 4510 – 4671 | 15 |
| 4671 – 4832 | 23 |
| 4832 – 4992 | 12 |
| 4992 – 5153 | 13 |
| 5153 – 5314 | 11 |
| 5314 – 5475 | 4 |
| 5475 – 5636 | 9 |
| 5636 – 5797 | 6 |
| 5797 – 5958 | 1 |
| 5958 – 6119 | 0 |
| 6119 – 6280 | 0 |
| 6280 – 6441 | 2 |
Show data table
| value | count | share |
|---|---|---|
| understand-overall-supply-and-demand | 238 | 1.9% |
| choose-between-adjectives-and-adverbs | 176 | 1.4% |
| determine-the-meanings-of-greek-and-latin-roots | 165 | 1.3% |
| describe-the-difference-between-related-words | 162 | 1.3% |
| costs-and-benefits | 160 | 1.3% |
| is-it-a-complete-sentence-or-a-fragment | 156 | 1.3% |
| use-greek-and-latin-roots-as-clues-to-the-meanings-of-words | 137 | 1.1% |
| identify-vague-pronoun-references | 136 | 1.1% |
| what-does-the-punctuation-suggest | 131 | 1.1% |
| determine-the-meanings-of-words-with-greek-and-latin-roots | 128 | 1.0% |
| use-the-correct-homophone | 128 | 1.0% |
| analogies | 127 | 1.0% |
| classify-logical-fallacies | 123 | 1.0% |
| is-it-a-phrase-or-a-clause | 117 | 0.9% |
| is-the-sentence-declarative-interrogative-imperative-or-exclamatory | 116 | 0.9% |
| analogies-challenge | 113 | 0.9% |
| is-the-sentence-simple-compound-complex-or-compound-complex | 108 | 0.9% |
| is-it-a-complete-sentence-or-a-run-on | 107 | 0.9% |
| use-guide-words | 106 | 0.9% |
| recall-the-source-of-an-allusion | 105 | 0.8% |
Schema
6 columns| Alerts | ||||
|---|---|---|---|---|
| subject | categorical | 0.0% | 2 |
|
| grade | categorical | 0.0% | 14 |
|
| skill | categorical | 0.0% | 402 |
|
| question | text | 0.0% | 9,375 |
duplicates
|
| choices | unknown | 0.0% | — |
skipped
|
| answer_idx | numeric | 0.0% | 4 |
|
subject
categorical labelBinary subject indicator with only two values: 'language arts' (10068 rows, 81.3%) and 'social studies' (2315 rows). No nulls across 12383 rows. The class imbalance is notable—roughly 4.3:1 in favor of language arts—which will skew any per-subject aggregation or modelling. Treatment: One-hot or binary encode; account for the 4:1 imbalance when stratifying or modelling.
- n
- 12,383
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- language arts
- top_rate
- 0.8131
- cardinality
- 2
- entropy
- 0.695
- entropy_ratio
- 0.695
grade
categorical featureCategorical grade level with 14 unique values across 12,383 rows and no nulls. Distribution is fairly even (entropy ratio 0.92), though grade-5 leads at 20.5% and the top 10 values shown span grade-2 through grade-12, suggesting a K-12 schema with a few additional buckets not displayed. Treatment: Encode as ordinal (extract numeric grade level) before modelling.
- n
- 12,383
- nulls
- 0 (0.0%)
- unique
- 14
- top_value
- grade-5
- top_rate
- 0.2049
- cardinality
- 14
- entropy
- 3.513
- entropy_ratio
- 0.9227
skill
categorical labelSlug-style identifiers for educational skills (e.g., "understand-overall-supply-and-demand", "choose-between-adjectives-and-adverbs"), spanning topics from economics to grammar and vocabulary. With 402 unique values across 12,383 rows and entropy ratio 0.919, the distribution is nearly flat — the most common skill accounts for only 1.92% of rows. Content mixes domains (ELA and economics) suggesting this column tags items from a multi-subject curriculum. Treatment: Treat as a high-cardinality categorical label; target-encode or embed rather than one-hot.
- n
- 12,383
- nulls
- 0 (0.0%)
- unique
- 402
- top_value
- understand-overall-supply-and-demand
- top_rate
- 0.01922
- cardinality
- 402
- entropy
- 7.949
- entropy_ratio
- 0.9188
question
text free_text duplicatesFree-text question prompts, almost entirely English (4379 en vs 2 es detected) with a wide length spread (min 3, median 120, max 6441 chars; mean 52 words). High duplication is the standout: 3008 rows (24.3%) repeat, with stems like 'Which sentence is correct?' (171) and 'Which sentence states a fact?' (146) dominating, suggesting templated educational/grammar items. Readability is easy (Flesch 73.9) and vocab is moderate (31660 unique tokens across 9375 distinct questions). Treatment: Tokenize/embed for modelling and decide whether to dedupe the 3008 repeated prompts before train/test splitting to avoid leakage.
- n
- 12,383
- nulls
- 0 (0.0%)
- unique
- 9,375
- len_min
- 3
- len_max
- 6,441
- len_mean
- 313.4
- len_median
- 120
- len_p95
- 1502
- word_mean
- 52.2
- word_median
- 20
- n_empty
- 0
- n_duplicates
- 3,008
- duplicate_rate
- 0.2429
- vocab_size
- 31,660
- readability_flesch_mean
- 73.85
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.0002423
- allcaps_rate
- 0.000323
- boilerplate_rate
- 0
choices
unknown other skippedThe column 'choices' was skipped by the profiler (kind 'unknown'), so no descriptive statistics, cardinality, or value samples are available beyond a row count of 12383 and a null rate of 0.0. Without type inference or distributional signals it is impossible to characterise the contents from this evidence alone. The name suggests it may hold structured option sets (e.g. lists or JSON), which would explain why the dissector could not coerce it into a primitive type. Treatment: Inspect raw values manually and parse (likely JSON/list) before deciding on a downstream representation.
- n
- 12,383
- nulls
- 0 (0.0%)
- unique
- —
answer_idx
numeric labelThis is almost certainly a categorical answer index encoded as an integer, taking only 4 distinct values from 0 to 3 across 12,383 rows with no nulls. The distribution is heavily weighted toward the low end: 43% are zero and the median is 1, with skew 0.94 indicating answer choice 3 is comparatively rare. The 577 flagged outliers (4.7%) are an artifact of applying numeric outlier rules to what is really a small categorical set. Treatment: Treat as a categorical class label, not a numeric value.
- n
- 12,383
- nulls
- 0 (0.0%)
- unique
- 4
- min
- 0
- max
- 3
- mean
- 0.7734
- median
- 1
- std
- 0.821
- q1
- 0
- q3
- 1
- iqr
- 1
- skew
- 0.9448
- kurtosis
- 0.4078
- n_outliers
- 577
- outlier_rate
- 0.0466
- zero_rate
- 0.4294