saturn·

scars master dataset

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv

Saturn profiled 3,221 rows across 20 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv",
    "--findings", "scars-master_dataset.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a county-level US dataset of 3,221 rows and 20 columns combining demographics (population by race, poverty, income), 2016 and 2020 presidential vote shares, and geographic identifiers (FIPS, state, county). Two data-quality issues stand out and should be addressed first: median_household_income contains sentinel/error values that pull its minimum to -666,666,666 and yield a negative mean, and margin_2016 is stored as text percentages (e.g. '15.17%') while margin_2020 is numeric, so the two election cycles aren't directly comparable without cleaning. The political columns themselves are well-formed and show a Republican-leaning county distribution (mean republican_pct_2020 ≈ 0.65 vs democratic_pct_2020 ≈ 0.33). Population and demographic counts are heavily right-skewed with many outliers, as expected when mixing rural counties with metros up to ~10M people, so log scales or per-capita rates (already provided as pct_white, pct_black, pct_hispanic) will be more informative than raw counts.

citing: median_household_income · margin_2016 · margin_2020 · republican_pct_2020 · democratic_pct_2020 · total_population · pct_white · pct_black · pct_hispanic · poverty_rate

Out[4]:

saturn.schema() · 20 columns

column kind n null% unique alerts
NAME text 3,221 0.0% 3,221 near_unique
total_population numeric 3,221 0.0% 3,160 high_skew outliers
black_population numeric 3,221 0.0% 2,066 high_skew outliers
white_population numeric 3,221 0.0% 3,143 high_skew outliers
hispanic_population numeric 3,221 0.0% 2,331 high_skew outliers
state numeric 3,221 0.0% 52
county numeric 3,221 0.0% 326 high_skew outliers
FIPS numeric 3,221 0.0% 3,221
pct_black numeric 3,221 0.0% 3,128 high_skew outliers
pct_white numeric 3,221 0.0% 3,218
pct_hispanic numeric 3,221 0.0% 3,205 high_skew outliers
poverty_rate numeric 3,221 0.0% 3,219 high_skew
below_poverty_level numeric 3,221 0.0% 2,824 high_skew outliers
median_household_income numeric 3,221 0.0% 3,099 high_skew outliers
margin_2020 numeric 3,221 3.4% 3,112
democratic_pct_2020 numeric 3,221 3.4% 3,111
republican_pct_2020 numeric 3,221 3.4% 3,111
margin_2016 text 3,221 2.6% 2,554 one_word allcaps short_text
democratic_pct_2016 numeric 3,221 2.6% 3,111
republican_pct_2016 numeric 3,221 2.6% 3,111
Fig 1.
republican_pct_2020 · Distribution of 2020 Republican vote share across counties — note the right-leaning skew with median near 0.68.
Show data table
Histogram bins for republican_pct_2020 (median: 0.6829120557612961).
bincount
0.05397 – 0.076671
0.07667 – 0.099372
0.09937 – 0.12212
0.1221 – 0.14486
0.1448 – 0.16756
0.1675 – 0.190115
0.1901 – 0.21285
0.2128 – 0.235513
0.2355 – 0.258212
0.2582 – 0.280925
0.2809 – 0.303626
0.3036 – 0.326332
0.3263 – 0.34932
0.349 – 0.371740
0.3717 – 0.394446
0.3944 – 0.417152
0.4171 – 0.439864
0.4398 – 0.462578
0.4625 – 0.485261
0.4852 – 0.507978
0.5079 – 0.530666
0.5306 – 0.553397
0.5533 – 0.576126
0.576 – 0.5987122
0.5987 – 0.6214139
0.6214 – 0.6441143
0.6441 – 0.6668154
0.6668 – 0.6895173
0.6895 – 0.7122176
0.7122 – 0.7349200
0.7349 – 0.7576213
0.7576 – 0.7802195
0.7802 – 0.8029203
0.8029 – 0.8256169
0.8256 – 0.8483132
0.8483 – 0.871103
0.871 – 0.893765
0.8937 – 0.916427
0.9164 – 0.93919
0.9391 – 0.96184
Fig 2.
margin_2020 · 2020 margin ranges from -0.87 to +0.93; check the bimodality and how many counties fall on each side of zero.
Show data table
Histogram bins for margin_2020 (median: 0.3843813151543954).
bincount
-0.8675 – -0.82261
-0.8226 – -0.77762
-0.7776 – -0.73263
-0.7326 – -0.68775
-0.6877 – -0.64278
-0.6427 – -0.597812
-0.5978 – -0.55286
-0.5528 – -0.507812
-0.5078 – -0.462914
-0.4629 – -0.417922
-0.4179 – -0.37328
-0.373 – -0.32833
-0.328 – -0.28329
-0.283 – -0.238146
-0.2381 – -0.193141
-0.1931 – -0.148254
-0.1482 – -0.103263
-0.1032 – -0.0582367
-0.05823 – -0.0132769
-0.01327 – 0.0316973
0.03169 – 0.0766570
0.07665 – 0.121690
0.1216 – 0.1666131
0.1666 – 0.2115117
0.2115 – 0.2565129
0.2565 – 0.3015141
0.3015 – 0.3464159
0.3464 – 0.3914165
0.3914 – 0.4363181
0.4363 – 0.4813206
0.4813 – 0.5263195
0.5263 – 0.5712197
0.5712 – 0.6162213
0.6162 – 0.6611175
0.6611 – 0.7061140
0.7061 – 0.7511102
0.7511 – 0.79669
0.796 – 0.84129
0.841 – 0.885911
0.8859 – 0.93094
Fig 3.
pct_white · County racial composition is left-skewed with a median of ~88% white; watch the long tail of diverse counties.
Show data table
Histogram bins for pct_white (median: 87.65979926043318).
bincount
3.29 – 5.7082
5.708 – 8.1251
8.125 – 10.544
10.54 – 12.962
12.96 – 15.3810
15.38 – 17.86
17.8 – 20.214
20.21 – 22.637
22.63 – 25.0512
25.05 – 27.4712
27.47 – 29.897
29.89 – 32.34
32.3 – 34.7210
34.72 – 37.1413
37.14 – 39.5624
39.56 – 41.9718
41.97 – 44.3924
44.39 – 46.8130
46.81 – 49.2324
49.23 – 51.6434
51.64 – 54.0633
54.06 – 56.4837
56.48 – 58.958
58.9 – 61.3243
61.32 – 63.7369
63.73 – 66.1572
66.15 – 68.5777
68.57 – 70.9976
70.99 – 73.488
73.4 – 75.82100
75.82 – 78.2498
78.24 – 80.66143
80.66 – 83.08132
83.08 – 85.49163
85.49 – 87.91191
87.91 – 90.33246
90.33 – 92.75302
92.75 – 95.16453
95.16 – 97.58525
97.58 – 10067
Fig 4.
poverty_rate · Poverty rate centers around 14% with a right tail past 60% — useful for spotting economically distressed counties.
Show data table
Histogram bins for poverty_rate (median: 13.807805224676027).
bincount
0 – 1.6552
1.655 – 3.319
3.31 – 4.96448
4.964 – 6.619123
6.619 – 8.274212
8.274 – 9.929317
9.929 – 11.58378
11.58 – 13.24392
13.24 – 14.89353
14.89 – 16.55338
16.55 – 18.2235
18.2 – 19.86192
19.86 – 21.51157
21.51 – 23.17108
23.17 – 24.8277
24.82 – 26.4853
26.48 – 28.1341
28.13 – 29.7937
29.79 – 31.4429
31.44 – 33.113
33.1 – 34.7510
34.75 – 36.4111
36.41 – 38.067
38.06 – 39.722
39.72 – 41.378
41.37 – 43.036
43.03 – 44.687
44.68 – 46.3311
46.33 – 47.996
47.99 – 49.649
49.64 – 51.38
51.3 – 52.955
52.95 – 54.616
54.61 – 56.262
56.26 – 57.922
57.92 – 59.573
59.57 – 61.231
61.23 – 62.881
62.88 – 64.540
64.54 – 66.192
Fig 5.
median_household_income · Flag the data-quality issue: a sentinel value of -666,666,666 distorts the mean; filter to positives before plotting.
Show data table
Histogram bins for median_household_income (median: 52380.0).
bincount
-6.667e+08 – -6.5e+081
-6.5e+08 – -6.333e+080
-6.333e+08 – -6.167e+080
-6.167e+08 – -6e+080
-6e+08 – -5.833e+080
-5.833e+08 – -5.666e+080
-5.666e+08 – -5.5e+080
-5.5e+08 – -5.333e+080
-5.333e+08 – -5.166e+080
-5.166e+08 – -5e+080
-5e+08 – -4.833e+080
-4.833e+08 – -4.666e+080
-4.666e+08 – -4.5e+080
-4.5e+08 – -4.333e+080
-4.333e+08 – -4.166e+080
-4.166e+08 – -3.999e+080
-3.999e+08 – -3.833e+080
-3.833e+08 – -3.666e+080
-3.666e+08 – -3.499e+080
-3.499e+08 – -3.333e+080
-3.333e+08 – -3.166e+080
-3.166e+08 – -2.999e+080
-2.999e+08 – -2.832e+080
-2.832e+08 – -2.666e+080
-2.666e+08 – -2.499e+080
-2.499e+08 – -2.332e+080
-2.332e+08 – -2.166e+080
-2.166e+08 – -1.999e+080
-1.999e+08 – -1.832e+080
-1.832e+08 – -1.666e+080
-1.666e+08 – -1.499e+080
-1.499e+08 – -1.332e+080
-1.332e+08 – -1.165e+080
-1.165e+08 – -9.987e+070
-9.987e+07 – -8.32e+070
-8.32e+07 – -6.653e+070
-6.653e+07 – -4.986e+070
-4.986e+07 – -3.319e+070
-3.319e+07 – -1.652e+070
-1.652e+07 – 1.471e+053220
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
NAMEtext0.0%
total_populationnumeric0.0%
black_populationnumeric0.0%
white_populationnumeric0.0%
hispanic_populationnumeric0.0%
statenumeric0.0%
countynumeric0.0%
FIPSnumeric0.0%
pct_blacknumeric0.0%
pct_whitenumeric0.0%
pct_hispanicnumeric0.0%
poverty_ratenumeric0.0%
below_poverty_levelnumeric0.0%
median_household_incomenumeric0.0%
margin_2020numeric3.4%
democratic_pct_2020numeric3.4%
republican_pct_2020numeric3.4%
margin_2016text2.6%
democratic_pct_2016numeric2.6%
republican_pct_2016numeric2.6%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
total_populationblack_populationwhite_populationhispanic_populationstatecountyFIPSpct_blackpct_whitepct_hispanicpoverty_ratebelow_poverty_level
total_population+1.00+0.69+0.98+0.88-0.06-0.10-0.06+0.04-0.19+0.08-0.15+0.93
black_population+0.69+1.00+0.63+0.44-0.01-0.03-0.01+0.27-0.29+0.01-0.04+0.77
white_population+0.98+0.63+1.00+0.86-0.06-0.11-0.06+0.01-0.12+0.07-0.17+0.89
hispanic_population+0.88+0.44+0.86+1.00-0.06-0.07-0.06-0.01-0.15+0.22-0.02+0.86
state-0.06-0.01-0.06-0.06+1.00+0.07+1.00-0.07+0.02+0.37+0.22-0.05
county-0.10-0.03-0.11-0.07+0.07+1.00+0.07+0.09-0.07+0.07+0.11-0.08
FIPS-0.06-0.01-0.06-0.06+1.00+0.07+1.00-0.07+0.02+0.37+0.22-0.05
pct_black+0.04+0.27+0.01-0.01-0.07+0.09-0.07+1.00-0.79-0.02+0.39+0.10
pct_white-0.19-0.29-0.12-0.15+0.02-0.07+0.02-0.79+1.00-0.24-0.48-0.23
pct_hispanic+0.08+0.01+0.07+0.22+0.37+0.07+0.37-0.02-0.24+1.00+0.53+0.14
poverty_rate-0.15-0.04-0.17-0.02+0.22+0.11+0.22+0.39-0.48+0.53+1.00-0.01
below_poverty_level+0.93+0.77+0.89+0.86-0.05-0.08-0.05+0.10-0.23+0.14-0.01+1.00

NAME text identifier

This column appears to hold US county names with state qualifiers — 'county,' appears in 3,007 of 3,221 rows, followed by state tokens like Texas (256), Virginia (189), and Georgia (159). Every value is unique (n_unique = 3221, duplicate_rate = 0.0) with no nulls, and lengths cluster tightly around 24 characters (min 16, max 42), consistent with a canonical 'X County, ST' format. The near_unique alert confirms this behaves as an identifier rather than a categorical feature.

Treatment: Use as a join key to county-level reference tables; do not feed as a categorical feature.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["NAME"].stats

statvalue
n3,221
nulls0 (0.0%)
unique3,221
len_min 16
len_max 42
len_mean 24.27
len_median 24
len_p95 31
word_mean 3.243
word_median 3
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 1,983
readability_flesch_mean 7.581
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
Fig 8.
Character-length distribution for NAME.
Show data table
Character-length distribution for NAME (mean: 24.26637690158336).
charscount
16 – 173
17 – 1723
17 – 180
18 – 1972
19 – 19121
19 – 200
20 – 21190
21 – 21264
21 – 220
22 – 22407
22 – 23420
23 – 240
24 – 24363
24 – 25320
25 – 260
26 – 26240
26 – 27233
27 – 280
28 – 28153
28 – 290
29 – 30142
30 – 3086
30 – 310
31 – 3281
32 – 3241
32 – 330
33 – 3428
34 – 3416
34 – 350
35 – 3610
36 – 364
36 – 370
37 – 370
37 – 381
38 – 390
39 – 391
39 – 400
40 – 410
41 – 411
41 – 421

total_population numeric feature

Likely a county- or area-level total population count, given 3,221 rows with no nulls and a minimum of 117 alongside a maximum of 10,040,682. The distribution is severely right-skewed (skew 13.67, kurtosis 311.9) with the mean (102,398) nearly four times the median (25,981) and 441 outliers (13.7%). A few mega-population areas dominate while most are small.

Treatment: log-transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["total_population"].stats

statvalue
n3,221
nulls0 (0.0%)
unique3,160
min 117
max 1.004e+07
mean 1.024e+05
median 25,981
std 3.283e+05
q1 11,125
q3 66,969
iqr 55,844
skew 13.67
kurtosis 311.9
n_outliers 441
outlier_rate 0.1369
zero_rate 0
alert: high_skewskew=+13.67
alert: outliers13.7% rows beyond 1.5 IQR
Fig 9.
Distribution of total_population. Vertical dash marks the median.
Show data table
Histogram bins for total_population (median: 25981.0).
bincount
117 – 2.511e+052946
2.511e+05 – 5.021e+05135
5.021e+05 – 7.532e+0553
7.532e+05 – 1.004e+0642
1.004e+06 – 1.255e+0612
1.255e+06 – 1.506e+069
1.506e+06 – 1.757e+066
1.757e+06 – 2.008e+063
2.008e+06 – 2.259e+064
2.259e+06 – 2.51e+062
2.51e+06 – 2.761e+063
2.761e+06 – 3.012e+060
3.012e+06 – 3.263e+061
3.263e+06 – 3.514e+061
3.514e+06 – 3.765e+060
3.765e+06 – 4.016e+060
4.016e+06 – 4.267e+060
4.267e+06 – 4.518e+061
4.518e+06 – 4.769e+061
4.769e+06 – 5.02e+060
5.02e+06 – 5.271e+061
5.271e+06 – 5.522e+060
5.522e+06 – 5.773e+060
5.773e+06 – 6.024e+060
6.024e+06 – 6.275e+060
6.275e+06 – 6.526e+060
6.526e+06 – 6.777e+060
6.777e+06 – 7.029e+060
7.029e+06 – 7.28e+060
7.28e+06 – 7.531e+060
7.531e+06 – 7.782e+060
7.782e+06 – 8.033e+060
8.033e+06 – 8.284e+060
8.284e+06 – 8.535e+060
8.535e+06 – 8.786e+060
8.786e+06 – 9.037e+060
9.037e+06 – 9.288e+060
9.288e+06 – 9.539e+060
9.539e+06 – 9.79e+060
9.79e+06 – 1.004e+071

black_population numeric feature

Numeric count of Black residents per record (likely US county-level given n=3221), ranging from 0 to 1,202,260 with a median of just 859. The distribution is extremely right-skewed (skew 10.46, kurtosis 148.2) with 13.6% flagged as outliers and a std (54,952) over four times the mean (12,914), reflecting a few major metros dominating a long tail of small counties. About 2.8% of rows are zero and there are no nulls.

Treatment: Log-transform (log1p) before modelling or normalise as a share of total population.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["black_population"].stats

statvalue
n3,221
nulls0 (0.0%)
unique2,066
min 0
max 1.202e+06
mean 1.291e+04
median 859
std 5.495e+04
q1 114
q3 5,553
iqr 5,439
skew 10.46
kurtosis 148.2
n_outliers 438
outlier_rate 0.136
zero_rate 0.02825
alert: high_skewskew=+10.46
alert: outliers13.6% rows beyond 1.5 IQR
Fig 10.
Distribution of black_population. Vertical dash marks the median.
Show data table
Histogram bins for black_population (median: 859.0).
bincount
0 – 3.006e+042968
3.006e+04 – 6.011e+04108
6.011e+04 – 9.017e+0441
9.017e+04 – 1.202e+0533
1.202e+05 – 1.503e+0512
1.503e+05 – 1.803e+0514
1.803e+05 – 2.104e+058
2.104e+05 – 2.405e+053
2.405e+05 – 2.705e+057
2.705e+05 – 3.006e+056
3.006e+05 – 3.306e+052
3.306e+05 – 3.607e+052
3.607e+05 – 3.907e+052
3.907e+05 – 4.208e+052
4.208e+05 – 4.508e+050
4.508e+05 – 4.809e+052
4.809e+05 – 5.11e+052
5.11e+05 – 5.41e+050
5.41e+05 – 5.711e+052
5.711e+05 – 6.011e+051
6.011e+05 – 6.312e+050
6.312e+05 – 6.612e+051
6.612e+05 – 6.913e+051
6.913e+05 – 7.214e+050
7.214e+05 – 7.514e+050
7.514e+05 – 7.815e+050
7.815e+05 – 8.115e+052
8.115e+05 – 8.416e+050
8.416e+05 – 8.716e+050
8.716e+05 – 9.017e+051
9.017e+05 – 9.318e+050
9.318e+05 – 9.618e+050
9.618e+05 – 9.919e+050
9.919e+05 – 1.022e+060
1.022e+06 – 1.052e+060
1.052e+06 – 1.082e+060
1.082e+06 – 1.112e+060
1.112e+06 – 1.142e+060
1.142e+06 – 1.172e+060
1.172e+06 – 1.202e+061

white_population numeric feature

Counts of the white population per record (likely US counties given n=3221), ranging from 58 to 4,795,186 with a median of 21,282 but a mean of 72,000. The distribution is extremely right-skewed (skew 10.35, kurtosis 175.65) with 407 outliers (12.6%), reflecting a few very populous counties dwarfing the rest. No nulls or zeros, and near-unique values (3143/3221).

Treatment: log-transform before regression to tame the heavy right skew.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["white_population"].stats

statvalue
n3,221
nulls0 (0.0%)
unique3,143
min 58
max 4.795e+06
mean 7.2e+04
median 21,282
std 1.918e+05
q1 8,855
q3 56,553
iqr 47,698
skew 10.35
kurtosis 175.7
n_outliers 407
outlier_rate 0.1264
zero_rate 0
alert: high_skewskew=+10.35
alert: outliers12.6% rows beyond 1.5 IQR
Fig 11.
Distribution of white_population. Vertical dash marks the median.
Show data table
Histogram bins for white_population (median: 21282.0).
bincount
58 – 1.199e+052795
1.199e+05 – 2.398e+05216
2.398e+05 – 3.597e+0574
3.597e+05 – 4.796e+0547
4.796e+05 – 5.994e+0529
5.994e+05 – 7.193e+0524
7.193e+05 – 8.392e+056
8.392e+05 – 9.591e+059
9.591e+05 – 1.079e+063
1.079e+06 – 1.199e+063
1.199e+06 – 1.319e+064
1.319e+06 – 1.439e+063
1.439e+06 – 1.558e+061
1.558e+06 – 1.678e+060
1.678e+06 – 1.798e+061
1.798e+06 – 1.918e+061
1.918e+06 – 2.038e+060
2.038e+06 – 2.158e+060
2.158e+06 – 2.278e+061
2.278e+06 – 2.398e+060
2.398e+06 – 2.518e+060
2.518e+06 – 2.637e+060
2.637e+06 – 2.757e+061
2.757e+06 – 2.877e+061
2.877e+06 – 2.997e+060
2.997e+06 – 3.117e+060
3.117e+06 – 3.237e+060
3.237e+06 – 3.357e+061
3.357e+06 – 3.477e+060
3.477e+06 – 3.596e+060
3.596e+06 – 3.716e+060
3.716e+06 – 3.836e+060
3.836e+06 – 3.956e+060
3.956e+06 – 4.076e+060
4.076e+06 – 4.196e+060
4.196e+06 – 4.316e+060
4.316e+06 – 4.436e+060
4.436e+06 – 4.555e+060
4.555e+06 – 4.675e+060
4.675e+06 – 4.795e+061

hispanic_population numeric feature

Counts of Hispanic population per record (likely county- or tract-level given n=3221), ranging from 0 to 4,851,344 with a median of just 1,209. The distribution is extraordinarily right-skewed (skew 22.75, kurtosis 744.79) and 15.3% of rows flag as outliers, indicating a handful of very large jurisdictions dwarf the rest. Mean (19,427) sits far above the Q3 of 5,875, confirming a long heavy tail.

Treatment: Apply a log1p transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["hispanic_population"].stats

statvalue
n3,221
nulls0 (0.0%)
unique2,331
min 0
max 4.851e+06
mean 1.943e+04
median 1,209
std 1.251e+05
q1 377
q3 5,875
iqr 5,498
skew 22.75
kurtosis 744.8
n_outliers 492
outlier_rate 0.1527
zero_rate 0.004967
alert: high_skewskew=+22.75
alert: outliers15.3% rows beyond 1.5 IQR
Fig 12.
Distribution of hispanic_population. Vertical dash marks the median.
Show data table
Histogram bins for hispanic_population (median: 1209.0).
bincount
0 – 1.213e+053124
1.213e+05 – 2.426e+0555
2.426e+05 – 3.639e+0513
3.639e+05 – 4.851e+059
4.851e+05 – 6.064e+054
6.064e+05 – 7.277e+053
7.277e+05 – 8.49e+052
8.49e+05 – 9.703e+050
9.703e+05 – 1.092e+062
1.092e+06 – 1.213e+064
1.213e+06 – 1.334e+061
1.334e+06 – 1.455e+061
1.455e+06 – 1.577e+060
1.577e+06 – 1.698e+060
1.698e+06 – 1.819e+060
1.819e+06 – 1.941e+061
1.941e+06 – 2.062e+061
2.062e+06 – 2.183e+060
2.183e+06 – 2.304e+060
2.304e+06 – 2.426e+060
2.426e+06 – 2.547e+060
2.547e+06 – 2.668e+060
2.668e+06 – 2.79e+060
2.79e+06 – 2.911e+060
2.911e+06 – 3.032e+060
3.032e+06 – 3.153e+060
3.153e+06 – 3.275e+060
3.275e+06 – 3.396e+060
3.396e+06 – 3.517e+060
3.517e+06 – 3.639e+060
3.639e+06 – 3.76e+060
3.76e+06 – 3.881e+060
3.881e+06 – 4.002e+060
4.002e+06 – 4.124e+060
4.124e+06 – 4.245e+060
4.245e+06 – 4.366e+060
4.366e+06 – 4.487e+060
4.487e+06 – 4.609e+060
4.609e+06 – 4.73e+060
4.73e+06 – 4.851e+061

state numeric feature

Stored as numeric but with only 52 unique integer values across 3221 rows ranging 1–72 with no nulls or zeros, this is almost certainly a FIPS-style state code rather than a true quantity. The near-symmetric spread (skew 0.157, kurtosis -0.626, mean 31.28 vs median 30) reflects roughly uniform coverage of US states/territories, not a meaningful distribution. The max of 72 is consistent with FIPS codes that extend past 50 to cover territories.

Treatment: Cast to categorical and one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["state"].stats

statvalue
n3,221
nulls0 (0.0%)
unique52
min 1
max 72
mean 31.28
median 30
std 16.28
q1 19
q3 46
iqr 27
skew 0.157
kurtosis -0.6261
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 13.
Distribution of state. Vertical dash marks the median.
Show data table
Histogram bins for state (median: 30.0).
bincount
1 – 2.77597
2.775 – 4.5515
4.55 – 6.325133
6.325 – 8.164
8.1 – 9.8758
9.875 – 11.654
11.65 – 13.42226
13.42 – 15.25
15.2 – 16.9844
16.98 – 18.75194
18.75 – 20.52204
20.52 – 22.3184
22.3 – 24.0740
24.07 – 25.8514
25.85 – 27.62170
27.62 – 29.4197
29.4 – 31.17149
31.17 – 32.9517
32.95 – 34.7331
34.73 – 36.595
36.5 – 38.27153
38.27 – 40.05165
40.05 – 41.8236
41.82 – 43.667
43.6 – 45.3851
45.38 – 47.15161
47.15 – 48.92254
48.92 – 50.743
50.7 – 52.47133
52.47 – 54.2594
54.25 – 56.0295
56.02 – 57.80
57.8 – 59.570
59.57 – 61.350
61.35 – 63.120
63.12 – 64.90
64.9 – 66.670
66.67 – 68.450
68.45 – 70.220
70.22 – 7278

county numeric foreign_key

Stored as numeric, but with 326 unique integer values from 1 to 840 across 3221 rows and zero nulls, this is almost certainly a county FIPS or county-code identifier rather than a measurement. The heavy right skew (2.87) and kurtosis (11.6) flagged as outliers simply reflect that codes are not uniformly distributed — 178 'outliers' here are real codes, not anomalies. Treating mean=102.8 or std=106.6 as meaningful would be misleading.

Treatment: Cast to categorical/string code and join to a county lookup; do not use as a continuous feature.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["county"].stats

statvalue
n3,221
nulls0 (0.0%)
unique326
min 1
max 840
mean 102.8
median 79
std 106.6
q1 35
q3 133
iqr 98
skew 2.868
kurtosis 11.64
n_outliers 178
outlier_rate 0.05526
zero_rate 0
alert: high_skewskew=+2.87
alert: outliers5.5% rows beyond 1.5 IQR
Fig 14.
Distribution of county. Vertical dash marks the median.
Show data table
Histogram bins for county (median: 79.0).
bincount
1 – 21.98531
21.98 – 42.95418
42.95 – 63.93411
63.93 – 84.9345
84.9 – 105.9352
105.9 – 126.9279
126.9 – 147.8234
147.8 – 168.8166
168.8 – 189.8138
189.8 – 210.870
210.8 – 231.745
231.7 – 252.725
252.7 – 273.722
273.7 – 294.723
294.7 – 315.622
315.6 – 336.613
336.6 – 357.611
357.6 – 378.610
378.6 – 399.511
399.5 – 420.510
420.5 – 441.511
441.5 – 462.510
462.5 – 483.411
483.4 – 504.410
504.4 – 525.47
525.4 – 546.42
546.4 – 567.31
567.3 – 588.32
588.3 – 609.33
609.3 – 630.23
630.2 – 651.22
651.2 – 672.22
672.2 – 693.25
693.2 – 714.22
714.2 – 735.13
735.1 – 756.12
756.1 – 777.13
777.1 – 798.11
798.1 – 8192
819 – 8403

FIPS numeric identifier

FIPS is the standard U.S. Federal Information Processing Standards county code, with all 3221 rows unique and no nulls. Values span 1001 to 72153, consistent with state-prefixed county identifiers (Alabama through Puerto Rico), and the distribution is near-symmetric (skew 0.157) with no outliers flagged.

Treatment: Treat as a categorical key; left-join on this code rather than using as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["FIPS"].stats

statvalue
n3,221
nulls0 (0.0%)
unique3,221
min 1,001
max 72,153
mean 3.138e+04
median 30,023
std 1.63e+04
q1 19,031
q3 46,105
iqr 27,074
skew 0.1569
kurtosis -0.6308
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 15.
Distribution of FIPS. Vertical dash marks the median.
Show data table
Histogram bins for FIPS (median: 30023.0).
bincount
1001 – 278097
2780 – 455915
4559 – 6337133
6337 – 811659
8116 – 989513
9895 – 1.167e+044
1.167e+04 – 1.345e+04226
1.345e+04 – 1.523e+045
1.523e+04 – 1.701e+0449
1.701e+04 – 1.879e+04189
1.879e+04 – 2.057e+04204
2.057e+04 – 2.235e+04184
2.235e+04 – 2.413e+0439
2.413e+04 – 2.59e+0415
2.59e+04 – 2.768e+04170
2.768e+04 – 2.946e+04196
2.946e+04 – 3.124e+04150
3.124e+04 – 3.302e+0427
3.302e+04 – 3.48e+0421
3.48e+04 – 3.658e+0495
3.658e+04 – 3.836e+04153
3.836e+04 – 4.013e+04155
4.013e+04 – 4.191e+0446
4.191e+04 – 4.369e+0467
4.369e+04 – 4.547e+0451
4.547e+04 – 4.725e+04161
4.725e+04 – 4.903e+04268
4.903e+04 – 5.081e+0429
5.081e+04 – 5.259e+04133
5.259e+04 – 5.436e+0494
5.436e+04 – 5.614e+0495
5.614e+04 – 5.792e+040
5.792e+04 – 5.97e+040
5.97e+04 – 6.148e+040
6.148e+04 – 6.326e+040
6.326e+04 – 6.504e+040
6.504e+04 – 6.682e+040
6.682e+04 – 6.86e+040
6.86e+04 – 7.037e+040
7.037e+04 – 7.215e+0478

pct_black numeric feature

This is a numeric percentage of Black population per record (likely a county or tract), ranging from 0 to 87.79 with a median of just 2.38% but a mean of 9.08%. The distribution is heavily right-skewed (skew 2.33, kurtosis 5.45) with 422 outliers (13.1%) and 2.8% exact zeros, indicating a long tail of high-percentage areas above an otherwise low-share majority. No nulls, and 3,128 of 3,221 values are unique.

Treatment: Apply a log1p or similar transform before regression to tame the right skew.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["pct_black"].stats

statvalue
n3,221
nulls0 (0.0%)
unique3,128
min 0
max 87.79
mean 9.085
median 2.383
std 14.5
q1 0.6919
q3 10.21
iqr 9.513
skew 2.326
kurtosis 5.451
n_outliers 422
outlier_rate 0.131
zero_rate 0.02825
alert: high_skewskew=+2.33
alert: outliers13.1% rows beyond 1.5 IQR
Fig 16.
Distribution of pct_black. Vertical dash marks the median.
Show data table
Histogram bins for pct_black (median: 2.382606279522089).
bincount
0 – 2.1951568
2.195 – 4.39402
4.39 – 6.584218
6.584 – 8.779153
8.779 – 10.97112
10.97 – 13.1786
13.17 – 15.3670
15.36 – 17.5654
17.56 – 19.7546
19.75 – 21.9548
21.95 – 24.1436
24.14 – 26.3442
26.34 – 28.5336
28.53 – 30.7341
30.73 – 32.9234
32.92 – 35.1232
35.12 – 37.3128
37.31 – 39.5119
39.51 – 41.725
41.7 – 43.925
43.9 – 46.0918
46.09 – 48.2817
48.28 – 50.4814
50.48 – 52.679
52.67 – 54.8713
54.87 – 57.0611
57.06 – 59.2613
59.26 – 61.457
61.45 – 63.658
63.65 – 65.844
65.84 – 68.041
68.04 – 70.236
70.23 – 72.438
72.43 – 74.625
74.62 – 76.822
76.82 – 79.015
79.01 – 81.211
81.21 – 83.41
83.4 – 85.61
85.6 – 87.792

pct_white numeric feature

This column reports the percentage of a population that is white, ranging from 3.29 to 100.0 with a mean of 81.20 and median of 87.66. The distribution is heavily left-skewed (skew -1.56) with 145 low-end outliers (4.5% outlier rate), indicating most records are predominantly white but a long tail of diverse populations exists. No nulls or zeros are present, and near-unique values across 3221 rows suggest one row per geographic unit.

Treatment: Consider a logit or reflected-log transform to address the strong left skew before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["pct_white"].stats

statvalue
n3,221
nulls0 (0.0%)
unique3,218
min 3.29
max 100
mean 81.2
median 87.66
std 17.35
q1 73.62
q3 93.99
iqr 20.37
skew -1.562
kurtosis 2.301
n_outliers 145
outlier_rate 0.04502
zero_rate 0
Fig 17.
Distribution of pct_white. Vertical dash marks the median.
Show data table
Histogram bins for pct_white (median: 87.65979926043318).
bincount
3.29 – 5.7082
5.708 – 8.1251
8.125 – 10.544
10.54 – 12.962
12.96 – 15.3810
15.38 – 17.86
17.8 – 20.214
20.21 – 22.637
22.63 – 25.0512
25.05 – 27.4712
27.47 – 29.897
29.89 – 32.34
32.3 – 34.7210
34.72 – 37.1413
37.14 – 39.5624
39.56 – 41.9718
41.97 – 44.3924
44.39 – 46.8130
46.81 – 49.2324
49.23 – 51.6434
51.64 – 54.0633
54.06 – 56.4837
56.48 – 58.958
58.9 – 61.3243
61.32 – 63.7369
63.73 – 66.1572
66.15 – 68.5777
68.57 – 70.9976
70.99 – 73.488
73.4 – 75.82100
75.82 – 78.2498
78.24 – 80.66143
80.66 – 83.08132
83.08 – 85.49163
85.49 – 87.91191
87.91 – 90.33246
90.33 – 92.75302
92.75 – 95.16453
95.16 – 97.58525
97.58 – 10067

pct_hispanic numeric feature

This is a numeric percentage of Hispanic population per row, ranging 0 to 99.996 with a median of just 4.52 but a mean of 11.74, indicating a long right tail. Skew of 3.11 and kurtosis of 9.89 confirm heavy concentration at low values with 420 outliers (13.0% of rows) stretching toward 100. Near-zero null rate (0.0) and only 0.5% exact zeros suggest the values are continuously measured rather than sparsely populated.

Treatment: Apply a log1p or similar transform before modelling to tame the right skew and outliers.

anthropic:claude-opus-4-7 · confidence high
Out[43]:

saturn.columns["pct_hispanic"].stats

statvalue
n3,221
nulls0 (0.0%)
unique3,205
min 0
max 100
mean 11.74
median 4.516
std 19.4
q1 2.363
q3 10.66
iqr 8.294
skew 3.113
kurtosis 9.888
n_outliers 420
outlier_rate 0.1304
zero_rate 0.004967
alert: high_skewskew=+3.11
alert: outliers13.0% rows beyond 1.5 IQR
Fig 18.
Distribution of pct_hispanic. Vertical dash marks the median.
Show data table
Histogram bins for pct_hispanic (median: 4.51638689048761).
bincount
0 – 2.5882
2.5 – 5850
5 – 7.5412
7.5 – 10213
10 – 12.5148
12.5 – 15102
15 – 17.575
17.5 – 2064
20 – 22.545
22.5 – 2549
25 – 27.539
27.5 – 3030
30 – 32.526
32.5 – 3518
35 – 37.517
37.5 – 4015
40 – 42.517
42.5 – 4515
45 – 47.512
47.5 – 5010
50 – 52.512
52.5 – 5512
55 – 57.59
57.5 – 6012
60 – 62.510
62.5 – 659
65 – 67.53
67.5 – 706
70 – 72.53
72.5 – 753
75 – 77.50
77.5 – 803
80 – 82.54
82.5 – 855
85 – 87.51
87.5 – 903
90 – 92.54
92.5 – 955
95 – 97.58
97.5 – 10070

poverty_rate numeric feature

Continuous percentage values ranging from 0 to 66.19 with mean 15.38 and median 13.81, almost certainly a county- or area-level poverty rate. Distribution is right-skewed (skew 2.11, kurtosis 6.92) with 143 high-end outliers (4.4%) stretching well beyond Q3 of 18.25. Near-unique values across 3,221 rows (3,219 distinct) and effectively no zeros or nulls.

Treatment: Log- or Box-Cox-transform before regression to tame the right skew.

anthropic:claude-opus-4-7 · confidence high
Out[46]:

saturn.columns["poverty_rate"].stats

statvalue
n3,221
nulls0 (0.0%)
unique3,219
min 0
max 66.19
mean 15.38
median 13.81
std 7.97
q1 10.34
q3 18.25
iqr 7.91
skew 2.111
kurtosis 6.922
n_outliers 143
outlier_rate 0.0444
zero_rate 0.0003105
alert: high_skewskew=+2.11
Fig 19.
Distribution of poverty_rate. Vertical dash marks the median.
Show data table
Histogram bins for poverty_rate (median: 13.807805224676027).
bincount
0 – 1.6552
1.655 – 3.319
3.31 – 4.96448
4.964 – 6.619123
6.619 – 8.274212
8.274 – 9.929317
9.929 – 11.58378
11.58 – 13.24392
13.24 – 14.89353
14.89 – 16.55338
16.55 – 18.2235
18.2 – 19.86192
19.86 – 21.51157
21.51 – 23.17108
23.17 – 24.8277
24.82 – 26.4853
26.48 – 28.1341
28.13 – 29.7937
29.79 – 31.4429
31.44 – 33.113
33.1 – 34.7510
34.75 – 36.4111
36.41 – 38.067
38.06 – 39.722
39.72 – 41.378
41.37 – 43.036
43.03 – 44.687
44.68 – 46.3311
46.33 – 47.996
47.99 – 49.649
49.64 – 51.38
51.3 – 52.955
52.95 – 54.616
54.61 – 56.262
56.26 – 57.922
57.92 – 59.573
59.57 – 61.231
61.23 – 62.881
62.88 – 64.540
64.54 – 66.192

below_poverty_level numeric feature

This column appears to be a count of residents below the poverty level per geographic unit, ranging from 0 to 1,401,656 with a median of 3,831. The distribution is severely right-skewed (skew 15.1, kurtosis 360.7) with the mean (13,136) more than three times the median and 351 outliers (10.9% of rows). Standard deviation (44,284) dwarfs the IQR (8,390), consistent with a few very large jurisdictions dominating the tail.

Treatment: Log-transform (or normalize per population) before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[49]:

saturn.columns["below_poverty_level"].stats

statvalue
n3,221
nulls0 (0.0%)
unique2,824
min 0
max 1.402e+06
mean 1.314e+04
median 3,831
std 4.428e+04
q1 1,547
q3 9,937
iqr 8,390
skew 15.11
kurtosis 360.7
n_outliers 351
outlier_rate 0.109
zero_rate 0.0003105
alert: high_skewskew=+15.11
alert: outliers10.9% rows beyond 1.5 IQR
Fig 20.
Distribution of below_poverty_level. Vertical dash marks the median.
Show data table
Histogram bins for below_poverty_level (median: 3831.0).
bincount
0 – 3.504e+042980
3.504e+04 – 7.008e+04136
7.008e+04 – 1.051e+0550
1.051e+05 – 1.402e+0519
1.402e+05 – 1.752e+057
1.752e+05 – 2.102e+058
2.102e+05 – 2.453e+053
2.453e+05 – 2.803e+052
2.803e+05 – 3.154e+053
3.154e+05 – 3.504e+052
3.504e+05 – 3.855e+055
3.855e+05 – 4.205e+050
4.205e+05 – 4.555e+051
4.555e+05 – 4.906e+051
4.906e+05 – 5.256e+050
5.256e+05 – 5.607e+051
5.607e+05 – 5.957e+050
5.957e+05 – 6.307e+050
6.307e+05 – 6.658e+050
6.658e+05 – 7.008e+051
7.008e+05 – 7.359e+051
7.359e+05 – 7.709e+050
7.709e+05 – 8.06e+050
8.06e+05 – 8.41e+050
8.41e+05 – 8.76e+050
8.76e+05 – 9.111e+050
9.111e+05 – 9.461e+050
9.461e+05 – 9.812e+050
9.812e+05 – 1.016e+060
1.016e+06 – 1.051e+060
1.051e+06 – 1.086e+060
1.086e+06 – 1.121e+060
1.121e+06 – 1.156e+060
1.156e+06 – 1.191e+060
1.191e+06 – 1.226e+060
1.226e+06 – 1.261e+060
1.261e+06 – 1.297e+060
1.297e+06 – 1.332e+060
1.332e+06 – 1.367e+060
1.367e+06 – 1.402e+061

median_household_income numeric feature

Median household income per record (n=3221, 3099 unique, no nulls) with a typical value near the median of 52380 and IQR of 16300. The mean of -152820 and min of -666666666 betray a sentinel value masquerading as data, producing extreme skew (-56.73) and kurtosis (3215.99) plus 182 flagged outliers. Once those sentinels are removed, the q1/q3 range of 44939-61239 looks like plausible US county-level income.

Treatment: Replace the -666666666 sentinel with null, then consider winsorizing or log-transforming before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[52]:

saturn.columns["median_household_income"].stats

statvalue
n3,221
nulls0 (0.0%)
unique3,099
min -6.667e+08
max 147,111
mean -1.528e+05
median 52,380
std 1.175e+07
q1 44,939
q3 61,239
iqr 16,300
skew -56.73
kurtosis 3216
n_outliers 182
outlier_rate 0.0565
zero_rate 0
alert: high_skewskew=-56.73
alert: outliers5.7% rows beyond 1.5 IQR
Fig 21.
Distribution of median_household_income. Vertical dash marks the median.
Show data table
Histogram bins for median_household_income (median: 52380.0).
bincount
-6.667e+08 – -6.5e+081
-6.5e+08 – -6.333e+080
-6.333e+08 – -6.167e+080
-6.167e+08 – -6e+080
-6e+08 – -5.833e+080
-5.833e+08 – -5.666e+080
-5.666e+08 – -5.5e+080
-5.5e+08 – -5.333e+080
-5.333e+08 – -5.166e+080
-5.166e+08 – -5e+080
-5e+08 – -4.833e+080
-4.833e+08 – -4.666e+080
-4.666e+08 – -4.5e+080
-4.5e+08 – -4.333e+080
-4.333e+08 – -4.166e+080
-4.166e+08 – -3.999e+080
-3.999e+08 – -3.833e+080
-3.833e+08 – -3.666e+080
-3.666e+08 – -3.499e+080
-3.499e+08 – -3.333e+080
-3.333e+08 – -3.166e+080
-3.166e+08 – -2.999e+080
-2.999e+08 – -2.832e+080
-2.832e+08 – -2.666e+080
-2.666e+08 – -2.499e+080
-2.499e+08 – -2.332e+080
-2.332e+08 – -2.166e+080
-2.166e+08 – -1.999e+080
-1.999e+08 – -1.832e+080
-1.832e+08 – -1.666e+080
-1.666e+08 – -1.499e+080
-1.499e+08 – -1.332e+080
-1.332e+08 – -1.165e+080
-1.165e+08 – -9.987e+070
-9.987e+07 – -8.32e+070
-8.32e+07 – -6.653e+070
-6.653e+07 – -4.986e+070
-4.986e+07 – -3.319e+070
-3.319e+07 – -1.652e+070
-1.652e+07 – 1.471e+053220

margin_2020 numeric feature

Numeric margin values for 2020, almost entirely unique across 3,221 rows (3,112 distinct), ranging from -0.87 to 0.93 with a mean of 0.317 and median 0.384. The distribution is left-skewed (skew -0.82), suggesting most observations cluster on the positive side while a tail of negative margins pulls the mean down. About 3.4% of rows are null and only 1.5% are flagged as outliers, with no zero values at all.

Treatment: Use directly as a signed numeric feature; impute the 3.4% nulls and retain sign since negatives are meaningful.

anthropic:claude-opus-4-7 · confidence high
Out[55]:

saturn.columns["margin_2020"].stats

statvalue
n3,221
nulls109 (3.4%)
unique3,112
min -0.8675
max 0.9309
mean 0.317
median 0.3844
std 0.321
q1 0.1348
q3 0.5662
iqr 0.4314
skew -0.8212
kurtosis 0.2286
n_outliers 48
outlier_rate 0.01542
zero_rate 0
Fig 22.
Distribution of margin_2020. Vertical dash marks the median.
Show data table
Histogram bins for margin_2020 (median: 0.3843813151543954).
bincount
-0.8675 – -0.82261
-0.8226 – -0.77762
-0.7776 – -0.73263
-0.7326 – -0.68775
-0.6877 – -0.64278
-0.6427 – -0.597812
-0.5978 – -0.55286
-0.5528 – -0.507812
-0.5078 – -0.462914
-0.4629 – -0.417922
-0.4179 – -0.37328
-0.373 – -0.32833
-0.328 – -0.28329
-0.283 – -0.238146
-0.2381 – -0.193141
-0.1931 – -0.148254
-0.1482 – -0.103263
-0.1032 – -0.0582367
-0.05823 – -0.0132769
-0.01327 – 0.0316973
0.03169 – 0.0766570
0.07665 – 0.121690
0.1216 – 0.1666131
0.1666 – 0.2115117
0.2115 – 0.2565129
0.2565 – 0.3015141
0.3015 – 0.3464159
0.3464 – 0.3914165
0.3914 – 0.4363181
0.4363 – 0.4813206
0.4813 – 0.5263195
0.5263 – 0.5712197
0.5712 – 0.6162213
0.6162 – 0.6611175
0.6611 – 0.7061140
0.7061 – 0.7511102
0.7511 – 0.79669
0.796 – 0.84129
0.841 – 0.885911
0.8859 – 0.93094

democratic_pct_2020 numeric feature

This is the share of votes cast for the Democratic candidate in 2020, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.921 with a median of 0.300 and mean of 0.333, indicating most units lean Republican while a long right tail of heavily Democratic units pulls the mean up (skew 0.83). About 3.4% of rows are null and 49 outliers (1.6%) sit beyond the whiskers; no zeros are present.

Treatment: Use as-is as a proportion feature; impute the 3.4% nulls or drop those rows before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[58]:

saturn.columns["democratic_pct_2020"].stats

statvalue
n3,221
nulls109 (3.4%)
unique3,111
min 0.03091
max 0.9215
mean 0.3327
median 0.2998
std 0.1598
q1 0.2091
q3 0.4236
iqr 0.2145
skew 0.8326
kurtosis 0.2523
n_outliers 49
outlier_rate 0.01575
zero_rate 0
Fig 23.
Distribution of democratic_pct_2020. Vertical dash marks the median.
Show data table
Histogram bins for democratic_pct_2020 (median: 0.29977253358402933).
bincount
0.03091 – 0.053175
0.05317 – 0.0754411
0.07544 – 0.097731
0.0977 – 0.1276
0.12 – 0.1422104
0.1422 – 0.1645145
0.1645 – 0.1868182
0.1868 – 0.209224
0.209 – 0.2313192
0.2313 – 0.2536199
0.2536 – 0.2758200
0.2758 – 0.2981174
0.2981 – 0.3204164
0.3204 – 0.3426158
0.3426 – 0.3649130
0.3649 – 0.3871132
0.3871 – 0.4094121
0.4094 – 0.4317125
0.4317 – 0.453984
0.4539 – 0.476279
0.4762 – 0.498566
0.4985 – 0.520773
0.5207 – 0.54367
0.543 – 0.565358
0.5653 – 0.587550
0.5875 – 0.609842
0.6098 – 0.632142
0.6321 – 0.654334
0.6543 – 0.676628
0.6766 – 0.698829
0.6988 – 0.721122
0.7211 – 0.743414
0.7434 – 0.765613
0.7656 – 0.78797
0.7879 – 0.81028
0.8102 – 0.832411
0.8324 – 0.85475
0.8547 – 0.8773
0.877 – 0.89923
0.8992 – 0.92151

republican_pct_2020 numeric feature

This is the 2020 Republican vote share by what looks like a U.S. county-level unit, with 3221 rows and a mean of 0.65 and median 0.68. The distribution is left-skewed (skew -0.81) toward strongly Republican counties, ranging from 0.054 to 0.962, and 3.38% of rows are null. Only 47 outliers (1.5%) and near-unique values (3111 distinct) are consistent with continuous geographic shares.

Treatment: Use as-is as a continuous feature; impute or drop the 3.38% missing rows before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[61]:

saturn.columns["republican_pct_2020"].stats

statvalue
n3,221
nulls109 (3.4%)
unique3,111
min 0.05397
max 0.9618
mean 0.6497
median 0.6829
std 0.1613
q1 0.5576
q3 0.7747
iqr 0.2171
skew -0.8091
kurtosis 0.2063
n_outliers 47
outlier_rate 0.0151
zero_rate 0
Fig 24.
Distribution of republican_pct_2020. Vertical dash marks the median.
Show data table
Histogram bins for republican_pct_2020 (median: 0.6829120557612961).
bincount
0.05397 – 0.076671
0.07667 – 0.099372
0.09937 – 0.12212
0.1221 – 0.14486
0.1448 – 0.16756
0.1675 – 0.190115
0.1901 – 0.21285
0.2128 – 0.235513
0.2355 – 0.258212
0.2582 – 0.280925
0.2809 – 0.303626
0.3036 – 0.326332
0.3263 – 0.34932
0.349 – 0.371740
0.3717 – 0.394446
0.3944 – 0.417152
0.4171 – 0.439864
0.4398 – 0.462578
0.4625 – 0.485261
0.4852 – 0.507978
0.5079 – 0.530666
0.5306 – 0.553397
0.5533 – 0.576126
0.576 – 0.5987122
0.5987 – 0.6214139
0.6214 – 0.6441143
0.6441 – 0.6668154
0.6668 – 0.6895173
0.6895 – 0.7122176
0.7122 – 0.7349200
0.7349 – 0.7576213
0.7576 – 0.7802195
0.7802 – 0.8029203
0.8029 – 0.8256169
0.8256 – 0.8483132
0.8483 – 0.871103
0.871 – 0.893765
0.8937 – 0.916427
0.9164 – 0.93919
0.9391 – 0.96184

margin_2016 text feature

This column stores a 2016 margin as a short percentage string (e.g. '15.17%', '26.55%'), with lengths capped at 5-6 characters and exactly one 'word' per row. Despite the percent formatting it's stored as text, and 18.6% of values are duplicates with '15.17%' alone appearing 29 times — worth checking whether that's a placeholder or genuine repeat. Null rate is 2.58% and there are 2554 unique values across 3221 rows.

Treatment: Strip the '%' and cast to float before any numeric analysis.

anthropic:claude-opus-4-7 · confidence high
Out[64]:

saturn.columns["margin_2016"].stats

statvalue
n3,221
nulls83 (2.6%)
unique2,554
len_min 5
len_max 6
len_mean 5.896
len_median 6
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 584
duplicate_rate 0.1861
vocab_size 2,554
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 25.
Character-length distribution for margin_2016.
Show data table
Character-length distribution for margin_2016 (mean: 5.895793499043977).
charscount
5 – 5327
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 62811

democratic_pct_2016 numeric feature

Share of the 2016 vote going Democratic, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.928 with a mean of 0.317 and median 0.286, and the right skew of 0.94 reflects a long tail of heavily Democratic jurisdictions amid a mass of Republican-leaning ones. About 2.6% of rows are null and 75 high-side outliers (2.4%) sit above the IQR fence.

Treatment: Use as-is as a proportion feature; impute or drop the 2.6% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[67]:

saturn.columns["democratic_pct_2016"].stats

statvalue
n3,221
nulls83 (2.6%)
unique3,111
min 0.03145
max 0.9285
mean 0.3174
median 0.2861
std 0.1527
q1 0.2054
q3 0.3982
iqr 0.1928
skew 0.9371
kurtosis 0.666
n_outliers 75
outlier_rate 0.0239
zero_rate 0
Fig 26.
Distribution of democratic_pct_2016. Vertical dash marks the median.
Show data table
Histogram bins for democratic_pct_2016 (median: 0.2861345852895).
bincount
0.03145 – 0.053878
0.05387 – 0.076316
0.0763 – 0.0987252
0.09872 – 0.121174
0.1211 – 0.1436116
0.1436 – 0.166146
0.166 – 0.1884203
0.1884 – 0.2109226
0.2109 – 0.2333240
0.2333 – 0.2557218
0.2557 – 0.2781200
0.2781 – 0.3006205
0.3006 – 0.323153
0.323 – 0.3454147
0.3454 – 0.3678153
0.3678 – 0.3903150
0.3903 – 0.4127106
0.4127 – 0.4351111
0.4351 – 0.457577
0.4575 – 0.4878
0.48 – 0.502456
0.5024 – 0.524872
0.5248 – 0.547245
0.5472 – 0.569742
0.5697 – 0.592136
0.5921 – 0.614536
0.6145 – 0.636934
0.6369 – 0.659425
0.6594 – 0.681830
0.6818 – 0.704216
0.7042 – 0.726612
0.7266 – 0.749112
0.7491 – 0.771514
0.7715 – 0.79399
0.7939 – 0.81636
0.8163 – 0.83884
0.8388 – 0.86124
0.8612 – 0.88363
0.8836 – 0.9062
0.906 – 0.92851

republican_pct_2016 numeric feature

This column captures the Republican vote share by unit (likely US county) in the 2016 election, expressed as a proportion between 0.041 and 0.953. The distribution is left-skewed (skew -0.81) with a median of 0.666 above the mean of 0.635, indicating most units leaned Republican while a smaller tail of strongly Democratic units pulls the mean down. Near-unique values (3111 of 3221) and a 2.58% null rate are consistent with one row per geographic unit.

Treatment: Use as-is as a proportion feature; impute or drop the ~2.6% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[70]:

saturn.columns["republican_pct_2016"].stats

statvalue
n3,221
nulls83 (2.6%)
unique3,111
min 0.04122
max 0.9527
mean 0.6354
median 0.6656
std 0.1559
q1 0.5463
q3 0.7503
iqr 0.2041
skew -0.8145
kurtosis 0.3566
n_outliers 62
outlier_rate 0.01976
zero_rate 0
Fig 27.
Distribution of republican_pct_2016. Vertical dash marks the median.
Show data table
Histogram bins for republican_pct_2016 (median: 0.6655515136155).
bincount
0.04122 – 0.064011
0.06401 – 0.08681
0.0868 – 0.10965
0.1096 – 0.13242
0.1324 – 0.15526
0.1552 – 0.177911
0.1779 – 0.20077
0.2007 – 0.223517
0.2235 – 0.246317
0.2463 – 0.269117
0.2691 – 0.291923
0.2919 – 0.314732
0.3147 – 0.337534
0.3375 – 0.360230
0.3602 – 0.38343
0.383 – 0.405841
0.4058 – 0.428664
0.4286 – 0.451471
0.4514 – 0.474263
0.4742 – 0.49789
0.497 – 0.519878
0.5198 – 0.5425115
0.5425 – 0.5653116
0.5653 – 0.5881147
0.5881 – 0.6109147
0.6109 – 0.6337156
0.6337 – 0.6565165
0.6565 – 0.6793193
0.6793 – 0.7021190
0.7021 – 0.7249215
0.7249 – 0.7476223
0.7476 – 0.7704213
0.7704 – 0.7932187
0.7932 – 0.816142
0.816 – 0.8388113
0.8388 – 0.861674
0.8616 – 0.884449
0.8844 – 0.907229
0.9072 – 0.92999
0.9299 – 0.95273

How to cite

click to copy

BibTeX
@misc{saturn-scars-master-dataset-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: scars master dataset},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/scars-master_dataset}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: scars master dataset. Source: /home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/scars-master_dataset