saturn·

quirky social norms

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms.parquet

Saturn profiled 12,383 rows across 6 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/quirky/social_norms.parquet",
    "--findings", "quirky-social_norms.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 12,383 multiple-choice questions tagged by subject, grade, and skill, likely from an educational platform. The content is heavily skewed toward language arts (10,068 rows) over social studies (2,315), and grade-5 is the single largest grade bucket at 2,537 rows. The question text shows a notable 24.3% duplicate rate with 3,008 repeats, so deduplication is worth considering before any modeling. Answer indices range 0-3 but are concentrated at 0 and 1 (43% are zero), suggesting possible position bias in the correct-answer distribution. Skill coverage is broad with 402 distinct skills, none dominating (top skill is only 1.9% of rows).

citing: row_count · columns.subject.top_values · columns.grade.top_values · columns.question.stats.duplicate_rate · columns.question.stats.n_duplicates · columns.answer_idx.stats.zero_rate · columns.answer_idx.n_unique · columns.skill.n_unique · columns.skill.stats.top_rate

Out[4]:

saturn.schema() · 6 columns

column kind n null% unique alerts
subject categorical 12,383 0.0% 2
grade categorical 12,383 0.0% 14
skill categorical 12,383 0.0% 402
question text 12,383 0.0% 9,375 duplicates
choices unknown 12,383 0.0% skipped
answer_idx numeric 12,383 0.0% 4
Fig 1.
subject · Shows the heavy tilt toward language arts over social studies.
Show data table
Top values for subject (2 unique shown, of 2 total).
valuecountshare
language arts1006881.3%
social studies231518.7%
Fig 2.
grade · Reveals which grade levels are best represented, with grade-5 leading.
Show data table
Top values for grade (14 unique shown, of 14 total).
valuecountshare
grade-5253720.5%
grade-3155912.6%
grade-8138911.2%
grade-910258.3%
grade-610178.2%
grade-78076.5%
grade-117906.4%
grade-26535.3%
grade-106495.2%
grade-125644.6%
grade-45414.4%
kindergarten4463.6%
grade-12842.3%
pre-k1221.0%
Fig 3.
answer_idx · Check whether correct-answer positions are evenly spread or biased toward 0 and 1.
Show data table
Histogram bins for answer_idx (median: 1.0).
bincount
0 – 0.0755317
0.075 – 0.150
0.15 – 0.2250
0.225 – 0.30
0.3 – 0.3750
0.375 – 0.450
0.45 – 0.5250
0.525 – 0.60
0.6 – 0.6750
0.675 – 0.750
0.75 – 0.8250
0.825 – 0.90
0.9 – 0.9750
0.975 – 1.055132
1.05 – 1.1250
1.125 – 1.20
1.2 – 1.2750
1.275 – 1.350
1.35 – 1.4250
1.425 – 1.50
1.5 – 1.5750
1.575 – 1.650
1.65 – 1.7250
1.725 – 1.80
1.8 – 1.8750
1.875 – 1.950
1.95 – 2.0251357
2.025 – 2.10
2.1 – 2.1750
2.175 – 2.250
2.25 – 2.3250
2.325 – 2.40
2.4 – 2.4750
2.475 – 2.550
2.55 – 2.6250
2.625 – 2.70
2.7 – 2.7750
2.775 – 2.850
2.85 – 2.9250
2.925 – 3577
Fig 4.
question · Question lengths span from 3 to 6,441 characters; look for a long tail of unusually long prompts.
Show data table
Character-length distribution for question (mean: 313.4049099571994).
charscount
3 – 1648306
164 – 3252142
325 – 486439
486 – 647337
647 – 808253
808 – 969150
969 – 113065
1130 – 129130
1291 – 145232
1452 – 161252
1612 – 177338
1773 – 193420
1934 – 209530
2095 – 225628
2256 – 241743
2417 – 257842
2578 – 273926
2739 – 290067
2900 – 306151
3061 – 322227
3222 – 338312
3383 – 354413
3544 – 370522
3705 – 386613
3866 – 402713
4027 – 41885
4188 – 434910
4349 – 451021
4510 – 467115
4671 – 483223
4832 – 499212
4992 – 515313
5153 – 531411
5314 – 54754
5475 – 56369
5636 – 57976
5797 – 59581
5958 – 61190
6119 – 62800
6280 – 64412
Fig 5.
skill · Top skills are fairly evenly distributed across 402 categories — no single skill dominates.
Show data table
Top values for skill (20 unique shown, of 402 total).
valuecountshare
understand-overall-supply-and-demand2381.9%
choose-between-adjectives-and-adverbs1761.4%
determine-the-meanings-of-greek-and-latin-roots1651.3%
describe-the-difference-between-related-words1621.3%
costs-and-benefits1601.3%
is-it-a-complete-sentence-or-a-fragment1561.3%
use-greek-and-latin-roots-as-clues-to-the-meanings-of-words1371.1%
identify-vague-pronoun-references1361.1%
what-does-the-punctuation-suggest1311.1%
determine-the-meanings-of-words-with-greek-and-latin-roots1281.0%
use-the-correct-homophone1281.0%
analogies1271.0%
classify-logical-fallacies1231.0%
is-it-a-phrase-or-a-clause1170.9%
is-the-sentence-declarative-interrogative-imperative-or-exclamatory1160.9%
analogies-challenge1130.9%
is-the-sentence-simple-compound-complex-or-compound-complex1080.9%
is-it-a-complete-sentence-or-a-run-on1070.9%
use-guide-words1060.9%
recall-the-source-of-an-allusion1050.8%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
subjectcategorical0.0%
gradecategorical0.0%
skillcategorical0.0%
questiontext0.0%
choicesunknown0.0%
answer_idxnumeric0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 4,381 detected strings).
langcountshare
en4379100.0%
es20.0%

subject categorical label

Binary subject indicator with only two values: 'language arts' (10068 rows, 81.3%) and 'social studies' (2315 rows). No nulls across 12383 rows. The class imbalance is notable—roughly 4.3:1 in favor of language arts—which will skew any per-subject aggregation or modelling.

Treatment: One-hot or binary encode; account for the 4:1 imbalance when stratifying or modelling.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["subject"].stats

statvalue
n12,383
nulls0 (0.0%)
unique2
top_value language arts
top_rate 0.8131
cardinality 2
entropy 0.695
entropy_ratio 0.695
Fig 8.
Top values for subject.
Show data table
Top values for subject (2 unique shown, of 2 total).
valuecountshare
language arts1006881.3%
social studies231518.7%

grade categorical feature

Categorical grade level with 14 unique values across 12,383 rows and no nulls. Distribution is fairly even (entropy ratio 0.92), though grade-5 leads at 20.5% and the top 10 values shown span grade-2 through grade-12, suggesting a K-12 schema with a few additional buckets not displayed.

Treatment: Encode as ordinal (extract numeric grade level) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["grade"].stats

statvalue
n12,383
nulls0 (0.0%)
unique14
top_value grade-5
top_rate 0.2049
cardinality 14
entropy 3.513
entropy_ratio 0.9227
Fig 9.
Top values for grade.
Show data table
Top values for grade (14 unique shown, of 14 total).
valuecountshare
grade-5253720.5%
grade-3155912.6%
grade-8138911.2%
grade-910258.3%
grade-610178.2%
grade-78076.5%
grade-117906.4%
grade-26535.3%
grade-106495.2%
grade-125644.6%
grade-45414.4%
kindergarten4463.6%
grade-12842.3%
pre-k1221.0%

skill categorical label

Slug-style identifiers for educational skills (e.g., "understand-overall-supply-and-demand", "choose-between-adjectives-and-adverbs"), spanning topics from economics to grammar and vocabulary. With 402 unique values across 12,383 rows and entropy ratio 0.919, the distribution is nearly flat — the most common skill accounts for only 1.92% of rows. Content mixes domains (ELA and economics) suggesting this column tags items from a multi-subject curriculum.

Treatment: Treat as a high-cardinality categorical label; target-encode or embed rather than one-hot.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["skill"].stats

statvalue
n12,383
nulls0 (0.0%)
unique402
top_value understand-overall-supply-and-demand
top_rate 0.01922
cardinality 402
entropy 7.949
entropy_ratio 0.9188
Fig 10.
Top values for skill.
Show data table
Top values for skill (20 unique shown, of 402 total).
valuecountshare
understand-overall-supply-and-demand2381.9%
choose-between-adjectives-and-adverbs1761.4%
determine-the-meanings-of-greek-and-latin-roots1651.3%
describe-the-difference-between-related-words1621.3%
costs-and-benefits1601.3%
is-it-a-complete-sentence-or-a-fragment1561.3%
use-greek-and-latin-roots-as-clues-to-the-meanings-of-words1371.1%
identify-vague-pronoun-references1361.1%
what-does-the-punctuation-suggest1311.1%
determine-the-meanings-of-words-with-greek-and-latin-roots1281.0%
use-the-correct-homophone1281.0%
analogies1271.0%
classify-logical-fallacies1231.0%
is-it-a-phrase-or-a-clause1170.9%
is-the-sentence-declarative-interrogative-imperative-or-exclamatory1160.9%
analogies-challenge1130.9%
is-the-sentence-simple-compound-complex-or-compound-complex1080.9%
is-it-a-complete-sentence-or-a-run-on1070.9%
use-guide-words1060.9%
recall-the-source-of-an-allusion1050.8%

question text free_text

Free-text question prompts, almost entirely English (4379 en vs 2 es detected) with a wide length spread (min 3, median 120, max 6441 chars; mean 52 words). High duplication is the standout: 3008 rows (24.3%) repeat, with stems like 'Which sentence is correct?' (171) and 'Which sentence states a fact?' (146) dominating, suggesting templated educational/grammar items. Readability is easy (Flesch 73.9) and vocab is moderate (31660 unique tokens across 9375 distinct questions).

Treatment: Tokenize/embed for modelling and decide whether to dedupe the 3008 repeated prompts before train/test splitting to avoid leakage.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["question"].stats

statvalue
n12,383
nulls0 (0.0%)
unique9,375
len_min 3
len_max 6,441
len_mean 313.4
len_median 120
len_p95 1502
word_mean 52.2
word_median 20
n_empty 0
n_duplicates 3,008
duplicate_rate 0.2429
vocab_size 31,660
readability_flesch_mean 73.85
emoji_rate 0
url_rate 0
one_word_rate 0.0002423
allcaps_rate 0.000323
boilerplate_rate 0
alert: duplicates24.3% duplicate strings
Fig 11.
Character-length distribution for question.
Show data table
Character-length distribution for question (mean: 313.4049099571994).
charscount
3 – 1648306
164 – 3252142
325 – 486439
486 – 647337
647 – 808253
808 – 969150
969 – 113065
1130 – 129130
1291 – 145232
1452 – 161252
1612 – 177338
1773 – 193420
1934 – 209530
2095 – 225628
2256 – 241743
2417 – 257842
2578 – 273926
2739 – 290067
2900 – 306151
3061 – 322227
3222 – 338312
3383 – 354413
3544 – 370522
3705 – 386613
3866 – 402713
4027 – 41885
4188 – 434910
4349 – 451021
4510 – 467115
4671 – 483223
4832 – 499212
4992 – 515313
5153 – 531411
5314 – 54754
5475 – 56369
5636 – 57976
5797 – 59581
5958 – 61190
6119 – 62800
6280 – 64412

choices unknown other

The column 'choices' was skipped by the profiler (kind 'unknown'), so no descriptive statistics, cardinality, or value samples are available beyond a row count of 12383 and a null rate of 0.0. Without type inference or distributional signals it is impossible to characterise the contents from this evidence alone. The name suggests it may hold structured option sets (e.g. lists or JSON), which would explain why the dissector could not coerce it into a primitive type.

Treatment: Inspect raw values manually and parse (likely JSON/list) before deciding on a downstream representation.

anthropic:claude-opus-4-7 · confidence low
Out[25]:

saturn.columns["choices"].stats

statvalue
n12,383
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

answer_idx numeric label

This is almost certainly a categorical answer index encoded as an integer, taking only 4 distinct values from 0 to 3 across 12,383 rows with no nulls. The distribution is heavily weighted toward the low end: 43% are zero and the median is 1, with skew 0.94 indicating answer choice 3 is comparatively rare. The 577 flagged outliers (4.7%) are an artifact of applying numeric outlier rules to what is really a small categorical set.

Treatment: Treat as a categorical class label, not a numeric value.

anthropic:claude-opus-4-7 · confidence high
Out[27]:

saturn.columns["answer_idx"].stats

statvalue
n12,383
nulls0 (0.0%)
unique4
min 0
max 3
mean 0.7734
median 1
std 0.821
q1 0
q3 1
iqr 1
skew 0.9448
kurtosis 0.4078
n_outliers 577
outlier_rate 0.0466
zero_rate 0.4294
Fig 12.
Distribution of answer_idx. Vertical dash marks the median.
Show data table
Histogram bins for answer_idx (median: 1.0).
bincount
0 – 0.0755317
0.075 – 0.150
0.15 – 0.2250
0.225 – 0.30
0.3 – 0.3750
0.375 – 0.450
0.45 – 0.5250
0.525 – 0.60
0.6 – 0.6750
0.675 – 0.750
0.75 – 0.8250
0.825 – 0.90
0.9 – 0.9750
0.975 – 1.055132
1.05 – 1.1250
1.125 – 1.20
1.2 – 1.2750
1.275 – 1.350
1.35 – 1.4250
1.425 – 1.50
1.5 – 1.5750
1.575 – 1.650
1.65 – 1.7250
1.725 – 1.80
1.8 – 1.8750
1.875 – 1.950
1.95 – 2.0251357
2.025 – 2.10
2.1 – 2.1750
2.175 – 2.250
2.25 – 2.3250
2.325 – 2.40
2.4 – 2.4750
2.475 – 2.550
2.55 – 2.6250
2.625 – 2.70
2.7 – 2.7750
2.775 – 2.850
2.85 – 2.9250
2.925 – 3577

How to cite

click to copy

BibTeX
@misc{saturn-quirky-social-norms-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky social norms},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-social_norms}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky social norms. Source: /home/coolhand/html/datavis/data_trove/cache/quirky/social_norms.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-social_norms