quirky social norms

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms.parquet

Saturn profiled 12,383 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/quirky/social_norms.parquet",
    "--findings", "quirky-social_norms.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 12,383 multiple-choice questions tagged by subject, grade, and skill, likely from an educational platform. The content is heavily skewed toward language arts (10,068 rows) over social studies (2,315), and grade-5 is the single largest grade bucket at 2,537 rows. The question text shows a notable 24.3% duplicate rate with 3,008 repeats, so deduplication is worth considering before any modeling. Answer indices range 0-3 but are concentrated at 0 and 1 (43% are zero), suggesting possible position bias in the correct-answer distribution. Skill coverage is broad with 402 distinct skills, none dominating (top skill is only 1.9% of rows).

citing: row_count · columns.subject.top_values · columns.grade.top_values · columns.question.stats.duplicate_rate · columns.question.stats.n_duplicates · columns.answer_idx.stats.zero_rate · columns.answer_idx.n_unique · columns.skill.n_unique · columns.skill.stats.top_rate

Out[4]:

saturn.schema() · 6 columns

column	kind	n	null%	unique	alerts
subject	categorical	12,383	0.0%	2
grade	categorical	12,383	0.0%	14
skill	categorical	12,383	0.0%	402
question	text	12,383	0.0%	9,375	duplicates
choices	unknown	12,383	0.0%	—	skipped
answer_idx	numeric	12,383	0.0%	4

Fig 1.

subject · Shows the heavy tilt toward language arts over social studies.

Show data table

Top values for subject (2 unique shown, of 2 total).
value	count	share
language arts	10068	81.3%
social studies	2315	18.7%

Fig 2.

grade · Reveals which grade levels are best represented, with grade-5 leading.

Show data table

Top values for grade (14 unique shown, of 14 total).
value	count	share
grade-5	2537	20.5%
grade-3	1559	12.6%
grade-8	1389	11.2%
grade-9	1025	8.3%
grade-6	1017	8.2%
grade-7	807	6.5%
grade-11	790	6.4%
grade-2	653	5.3%
grade-10	649	5.2%
grade-12	564	4.6%
grade-4	541	4.4%
kindergarten	446	3.6%
grade-1	284	2.3%
pre-k	122	1.0%

Fig 3.

answer_idx · Check whether correct-answer positions are evenly spread or biased toward 0 and 1.

Show data table

Histogram bins for answer_idx (median: 1.0).
bin	count
0 – 0.075	5317
0.075 – 0.15	0
0.15 – 0.225	0
0.225 – 0.3	0
0.3 – 0.375	0
0.375 – 0.45	0
0.45 – 0.525	0
0.525 – 0.6	0
0.6 – 0.675	0
0.675 – 0.75	0
0.75 – 0.825	0
0.825 – 0.9	0
0.9 – 0.975	0
0.975 – 1.05	5132
1.05 – 1.125	0
1.125 – 1.2	0
1.2 – 1.275	0
1.275 – 1.35	0
1.35 – 1.425	0
1.425 – 1.5	0
1.5 – 1.575	0
1.575 – 1.65	0
1.65 – 1.725	0
1.725 – 1.8	0
1.8 – 1.875	0
1.875 – 1.95	0
1.95 – 2.025	1357
2.025 – 2.1	0
2.1 – 2.175	0
2.175 – 2.25	0
2.25 – 2.325	0
2.325 – 2.4	0
2.4 – 2.475	0
2.475 – 2.55	0
2.55 – 2.625	0
2.625 – 2.7	0
2.7 – 2.775	0
2.775 – 2.85	0
2.85 – 2.925	0
2.925 – 3	577

Fig 4.

question · Question lengths span from 3 to 6,441 characters; look for a long tail of unusually long prompts.

Show data table

Character-length distribution for question (mean: 313.4049099571994).
chars	count
3 – 164	8306
164 – 325	2142
325 – 486	439
486 – 647	337
647 – 808	253
808 – 969	150
969 – 1130	65
1130 – 1291	30
1291 – 1452	32
1452 – 1612	52
1612 – 1773	38
1773 – 1934	20
1934 – 2095	30
2095 – 2256	28
2256 – 2417	43
2417 – 2578	42
2578 – 2739	26
2739 – 2900	67
2900 – 3061	51
3061 – 3222	27
3222 – 3383	12
3383 – 3544	13
3544 – 3705	22
3705 – 3866	13
3866 – 4027	13
4027 – 4188	5
4188 – 4349	10
4349 – 4510	21
4510 – 4671	15
4671 – 4832	23
4832 – 4992	12
4992 – 5153	13
5153 – 5314	11
5314 – 5475	4
5475 – 5636	9
5636 – 5797	6
5797 – 5958	1
5958 – 6119	0
6119 – 6280	0
6280 – 6441	2

Fig 5.

skill · Top skills are fairly evenly distributed across 402 categories — no single skill dominates.

Show data table

Top values for skill (20 unique shown, of 402 total).
value	count	share
understand-overall-supply-and-demand	238	1.9%
choose-between-adjectives-and-adverbs	176	1.4%
determine-the-meanings-of-greek-and-latin-roots	165	1.3%
describe-the-difference-between-related-words	162	1.3%
costs-and-benefits	160	1.3%
is-it-a-complete-sentence-or-a-fragment	156	1.3%
use-greek-and-latin-roots-as-clues-to-the-meanings-of-words	137	1.1%
identify-vague-pronoun-references	136	1.1%
what-does-the-punctuation-suggest	131	1.1%
determine-the-meanings-of-words-with-greek-and-latin-roots	128	1.0%
use-the-correct-homophone	128	1.0%
analogies	127	1.0%
classify-logical-fallacies	123	1.0%
is-it-a-phrase-or-a-clause	117	0.9%
is-the-sentence-declarative-interrogative-imperative-or-exclamatory	116	0.9%
analogies-challenge	113	0.9%
is-the-sentence-simple-compound-complex-or-compound-complex	108	0.9%
is-it-a-complete-sentence-or-a-run-on	107	0.9%
use-guide-words	106	0.9%
recall-the-source-of-an-allusion	105	0.8%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
subject	categorical	0.0%
grade	categorical	0.0%
skill	categorical	0.0%
question	text	0.0%
choices	unknown	0.0%
answer_idx	numeric	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,381 detected strings).
lang	count	share
en	4379	100.0%
es	2	0.0%

subject categorical label

Binary subject indicator with only two values: 'language arts' (10068 rows, 81.3%) and 'social studies' (2315 rows). No nulls across 12383 rows. The class imbalance is notable—roughly 4.3:1 in favor of language arts—which will skew any per-subject aggregation or modelling.

Treatment: One-hot or binary encode; account for the 4:1 imbalance when stratifying or modelling.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["subject"].stats

stat	value
n	12,383
nulls	0 (0.0%)
unique	2
top_value	language arts
top_rate	0.8131
cardinality	2
entropy	0.695
entropy_ratio	0.695

Fig 8.

Top values for subject.

Show data table

Top values for subject (2 unique shown, of 2 total).
value	count	share
language arts	10068	81.3%
social studies	2315	18.7%

grade categorical feature

Categorical grade level with 14 unique values across 12,383 rows and no nulls. Distribution is fairly even (entropy ratio 0.92), though grade-5 leads at 20.5% and the top 10 values shown span grade-2 through grade-12, suggesting a K-12 schema with a few additional buckets not displayed.

Treatment: Encode as ordinal (extract numeric grade level) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["grade"].stats

stat	value
n	12,383
nulls	0 (0.0%)
unique	14
top_value	grade-5
top_rate	0.2049
cardinality	14
entropy	3.513
entropy_ratio	0.9227

Fig 9.

Top values for grade.

Show data table

Top values for grade (14 unique shown, of 14 total).
value	count	share
grade-5	2537	20.5%
grade-3	1559	12.6%
grade-8	1389	11.2%
grade-9	1025	8.3%
grade-6	1017	8.2%
grade-7	807	6.5%
grade-11	790	6.4%
grade-2	653	5.3%
grade-10	649	5.2%
grade-12	564	4.6%
grade-4	541	4.4%
kindergarten	446	3.6%
grade-1	284	2.3%
pre-k	122	1.0%

skill categorical label

Slug-style identifiers for educational skills (e.g., "understand-overall-supply-and-demand", "choose-between-adjectives-and-adverbs"), spanning topics from economics to grammar and vocabulary. With 402 unique values across 12,383 rows and entropy ratio 0.919, the distribution is nearly flat — the most common skill accounts for only 1.92% of rows. Content mixes domains (ELA and economics) suggesting this column tags items from a multi-subject curriculum.

Treatment: Treat as a high-cardinality categorical label; target-encode or embed rather than one-hot.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["skill"].stats

stat	value
n	12,383
nulls	0 (0.0%)
unique	402
top_value	understand-overall-supply-and-demand
top_rate	0.01922
cardinality	402
entropy	7.949
entropy_ratio	0.9188

Fig 10.

Top values for skill.

Show data table

Top values for skill (20 unique shown, of 402 total).
value	count	share
understand-overall-supply-and-demand	238	1.9%
choose-between-adjectives-and-adverbs	176	1.4%
determine-the-meanings-of-greek-and-latin-roots	165	1.3%
describe-the-difference-between-related-words	162	1.3%
costs-and-benefits	160	1.3%
is-it-a-complete-sentence-or-a-fragment	156	1.3%
use-greek-and-latin-roots-as-clues-to-the-meanings-of-words	137	1.1%
identify-vague-pronoun-references	136	1.1%
what-does-the-punctuation-suggest	131	1.1%
determine-the-meanings-of-words-with-greek-and-latin-roots	128	1.0%
use-the-correct-homophone	128	1.0%
analogies	127	1.0%
classify-logical-fallacies	123	1.0%
is-it-a-phrase-or-a-clause	117	0.9%
is-the-sentence-declarative-interrogative-imperative-or-exclamatory	116	0.9%
analogies-challenge	113	0.9%
is-the-sentence-simple-compound-complex-or-compound-complex	108	0.9%
is-it-a-complete-sentence-or-a-run-on	107	0.9%
use-guide-words	106	0.9%
recall-the-source-of-an-allusion	105	0.8%

question text free_text

Free-text question prompts, almost entirely English (4379 en vs 2 es detected) with a wide length spread (min 3, median 120, max 6441 chars; mean 52 words). High duplication is the standout: 3008 rows (24.3%) repeat, with stems like 'Which sentence is correct?' (171) and 'Which sentence states a fact?' (146) dominating, suggesting templated educational/grammar items. Readability is easy (Flesch 73.9) and vocab is moderate (31660 unique tokens across 9375 distinct questions).

Treatment: Tokenize/embed for modelling and decide whether to dedupe the 3008 repeated prompts before train/test splitting to avoid leakage.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["question"].stats

stat	value
n	12,383
nulls	0 (0.0%)
unique	9,375
len_min	3
len_max	6,441
len_mean	313.4
len_median	120
len_p95	1502
word_mean	52.2
word_median	20
n_empty	0
n_duplicates	3,008
duplicate_rate	0.2429
vocab_size	31,660
readability_flesch_mean	73.85
emoji_rate	0
url_rate	0
one_word_rate	0.0002423
allcaps_rate	0.000323
boilerplate_rate	0
alert: duplicates	24.3% duplicate strings

Fig 11.

Character-length distribution for question.

Show data table

Character-length distribution for question (mean: 313.4049099571994).
chars	count
3 – 164	8306
164 – 325	2142
325 – 486	439
486 – 647	337
647 – 808	253
808 – 969	150
969 – 1130	65
1130 – 1291	30
1291 – 1452	32
1452 – 1612	52
1612 – 1773	38
1773 – 1934	20
1934 – 2095	30
2095 – 2256	28
2256 – 2417	43
2417 – 2578	42
2578 – 2739	26
2739 – 2900	67
2900 – 3061	51
3061 – 3222	27
3222 – 3383	12
3383 – 3544	13
3544 – 3705	22
3705 – 3866	13
3866 – 4027	13
4027 – 4188	5
4188 – 4349	10
4349 – 4510	21
4510 – 4671	15
4671 – 4832	23
4832 – 4992	12
4992 – 5153	13
5153 – 5314	11
5314 – 5475	4
5475 – 5636	9
5636 – 5797	6
5797 – 5958	1
5958 – 6119	0
6119 – 6280	0
6280 – 6441	2

choices unknown other

The column 'choices' was skipped by the profiler (kind 'unknown'), so no descriptive statistics, cardinality, or value samples are available beyond a row count of 12383 and a null rate of 0.0. Without type inference or distributional signals it is impossible to characterise the contents from this evidence alone. The name suggests it may hold structured option sets (e.g. lists or JSON), which would explain why the dissector could not coerce it into a primitive type.

Treatment: Inspect raw values manually and parse (likely JSON/list) before deciding on a downstream representation.

anthropic:claude-opus-4-7 · confidence low

Out[25]:

saturn.columns["choices"].stats

stat	value
n	12,383
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

answer_idx numeric label

This is almost certainly a categorical answer index encoded as an integer, taking only 4 distinct values from 0 to 3 across 12,383 rows with no nulls. The distribution is heavily weighted toward the low end: 43% are zero and the median is 1, with skew 0.94 indicating answer choice 3 is comparatively rare. The 577 flagged outliers (4.7%) are an artifact of applying numeric outlier rules to what is really a small categorical set.

Treatment: Treat as a categorical class label, not a numeric value.

anthropic:claude-opus-4-7 · confidence high

Out[27]:

saturn.columns["answer_idx"].stats

stat	value
n	12,383
nulls	0 (0.0%)
unique	4
min	0
max	3
mean	0.7734
median	1
std	0.821
q1	0
q3	1
iqr	1
skew	0.9448
kurtosis	0.4078
n_outliers	577
outlier_rate	0.0466
zero_rate	0.4294

Fig 12.

Distribution of answer_idx. Vertical dash marks the median.

Show data table

Histogram bins for answer_idx (median: 1.0).
bin	count
0 – 0.075	5317
0.075 – 0.15	0
0.15 – 0.225	0
0.225 – 0.3	0
0.3 – 0.375	0
0.375 – 0.45	0
0.45 – 0.525	0
0.525 – 0.6	0
0.6 – 0.675	0
0.675 – 0.75	0
0.75 – 0.825	0
0.825 – 0.9	0
0.9 – 0.975	0
0.975 – 1.05	5132
1.05 – 1.125	0
1.125 – 1.2	0
1.2 – 1.275	0
1.275 – 1.35	0
1.35 – 1.425	0
1.425 – 1.5	0
1.5 – 1.575	0
1.575 – 1.65	0
1.65 – 1.725	0
1.725 – 1.8	0
1.8 – 1.875	0
1.875 – 1.95	0
1.95 – 2.025	1357
2.025 – 2.1	0
2.1 – 2.175	0
2.175 – 2.25	0
2.25 – 2.325	0
2.325 – 2.4	0
2.4 – 2.475	0
2.475 – 2.55	0
2.55 – 2.625	0
2.625 – 2.7	0
2.7 – 2.775	0
2.775 – 2.85	0
2.85 – 2.925	0
2.925 – 3	577

How to cite

click to copy

BibTeX

@misc{saturn-quirky-social-norms-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky social norms},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-social_norms}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}

APA

Steuber, L. (2026). Saturn reading: quirky social norms. Source: /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-social_norms