scars-master_dataset · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv

Saturn profiled 3,221 rows across 20 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv",
    "--findings", "scars-master_dataset.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a county-level US dataset of 3,221 rows and 20 columns combining demographics (population by race, poverty, income), 2016 and 2020 presidential vote shares, and geographic identifiers (FIPS, state, county). Two data-quality issues stand out and should be addressed first: median_household_income contains sentinel/error values that pull its minimum to -666,666,666 and yield a negative mean, and margin_2016 is stored as text percentages (e.g. '15.17%') while margin_2020 is numeric, so the two election cycles aren't directly comparable without cleaning. The political columns themselves are well-formed and show a Republican-leaning county distribution (mean republican_pct_2020 ≈ 0.65 vs democratic_pct_2020 ≈ 0.33). Population and demographic counts are heavily right-skewed with many outliers, as expected when mixing rural counties with metros up to ~10M people, so log scales or per-capita rates (already provided as pct_white, pct_black, pct_hispanic) will be more informative than raw counts.

citing: median_household_income · margin_2016 · margin_2020 · republican_pct_2020 · democratic_pct_2020 · total_population · pct_white · pct_black · pct_hispanic · poverty_rate

Out[4]:

saturn.schema() · 20 columns

column	kind	n	null%	unique	alerts
NAME	text	3,221	0.0%	3,221	near_unique
total_population	numeric	3,221	0.0%	3,160	high_skew outliers
black_population	numeric	3,221	0.0%	2,066	high_skew outliers
white_population	numeric	3,221	0.0%	3,143	high_skew outliers
hispanic_population	numeric	3,221	0.0%	2,331	high_skew outliers
state	numeric	3,221	0.0%	52
county	numeric	3,221	0.0%	326	high_skew outliers
FIPS	numeric	3,221	0.0%	3,221
pct_black	numeric	3,221	0.0%	3,128	high_skew outliers
pct_white	numeric	3,221	0.0%	3,218
pct_hispanic	numeric	3,221	0.0%	3,205	high_skew outliers
poverty_rate	numeric	3,221	0.0%	3,219	high_skew
below_poverty_level	numeric	3,221	0.0%	2,824	high_skew outliers
median_household_income	numeric	3,221	0.0%	3,099	high_skew outliers
margin_2020	numeric	3,221	3.4%	3,112
democratic_pct_2020	numeric	3,221	3.4%	3,111
republican_pct_2020	numeric	3,221	3.4%	3,111
margin_2016	text	3,221	2.6%	2,554	one_word allcaps short_text
democratic_pct_2016	numeric	3,221	2.6%	3,111
republican_pct_2016	numeric	3,221	2.6%	3,111

Fig 1.

republican_pct_2020 · Distribution of 2020 Republican vote share across counties — note the right-leaning skew with median near 0.68.

Show data table

Histogram bins for republican_pct_2020 (median: 0.6829120557612961).
bin	count
0.05397 – 0.07667	1
0.07667 – 0.09937	2
0.09937 – 0.1221	2
0.1221 – 0.1448	6
0.1448 – 0.1675	6
0.1675 – 0.1901	15
0.1901 – 0.2128	5
0.2128 – 0.2355	13
0.2355 – 0.2582	12
0.2582 – 0.2809	25
0.2809 – 0.3036	26
0.3036 – 0.3263	32
0.3263 – 0.349	32
0.349 – 0.3717	40
0.3717 – 0.3944	46
0.3944 – 0.4171	52
0.4171 – 0.4398	64
0.4398 – 0.4625	78
0.4625 – 0.4852	61
0.4852 – 0.5079	78
0.5079 – 0.5306	66
0.5306 – 0.5533	97
0.5533 – 0.576	126
0.576 – 0.5987	122
0.5987 – 0.6214	139
0.6214 – 0.6441	143
0.6441 – 0.6668	154
0.6668 – 0.6895	173
0.6895 – 0.7122	176
0.7122 – 0.7349	200
0.7349 – 0.7576	213
0.7576 – 0.7802	195
0.7802 – 0.8029	203
0.8029 – 0.8256	169
0.8256 – 0.8483	132
0.8483 – 0.871	103
0.871 – 0.8937	65
0.8937 – 0.9164	27
0.9164 – 0.9391	9
0.9391 – 0.9618	4

Fig 2.

margin_2020 · 2020 margin ranges from -0.87 to +0.93; check the bimodality and how many counties fall on each side of zero.

Show data table

Histogram bins for margin_2020 (median: 0.3843813151543954).
bin	count
-0.8675 – -0.8226	1
-0.8226 – -0.7776	2
-0.7776 – -0.7326	3
-0.7326 – -0.6877	5
-0.6877 – -0.6427	8
-0.6427 – -0.5978	12
-0.5978 – -0.5528	6
-0.5528 – -0.5078	12
-0.5078 – -0.4629	14
-0.4629 – -0.4179	22
-0.4179 – -0.373	28
-0.373 – -0.328	33
-0.328 – -0.283	29
-0.283 – -0.2381	46
-0.2381 – -0.1931	41
-0.1931 – -0.1482	54
-0.1482 – -0.1032	63
-0.1032 – -0.05823	67
-0.05823 – -0.01327	69
-0.01327 – 0.03169	73
0.03169 – 0.07665	70
0.07665 – 0.1216	90
0.1216 – 0.1666	131
0.1666 – 0.2115	117
0.2115 – 0.2565	129
0.2565 – 0.3015	141
0.3015 – 0.3464	159
0.3464 – 0.3914	165
0.3914 – 0.4363	181
0.4363 – 0.4813	206
0.4813 – 0.5263	195
0.5263 – 0.5712	197
0.5712 – 0.6162	213
0.6162 – 0.6611	175
0.6611 – 0.7061	140
0.7061 – 0.7511	102
0.7511 – 0.796	69
0.796 – 0.841	29
0.841 – 0.8859	11
0.8859 – 0.9309	4

Fig 3.

pct_white · County racial composition is left-skewed with a median of ~88% white; watch the long tail of diverse counties.

Show data table

Histogram bins for pct_white (median: 87.65979926043318).
bin	count
3.29 – 5.708	2
5.708 – 8.125	1
8.125 – 10.54	4
10.54 – 12.96	2
12.96 – 15.38	10
15.38 – 17.8	6
17.8 – 20.21	4
20.21 – 22.63	7
22.63 – 25.05	12
25.05 – 27.47	12
27.47 – 29.89	7
29.89 – 32.3	4
32.3 – 34.72	10
34.72 – 37.14	13
37.14 – 39.56	24
39.56 – 41.97	18
41.97 – 44.39	24
44.39 – 46.81	30
46.81 – 49.23	24
49.23 – 51.64	34
51.64 – 54.06	33
54.06 – 56.48	37
56.48 – 58.9	58
58.9 – 61.32	43
61.32 – 63.73	69
63.73 – 66.15	72
66.15 – 68.57	77
68.57 – 70.99	76
70.99 – 73.4	88
73.4 – 75.82	100
75.82 – 78.24	98
78.24 – 80.66	143
80.66 – 83.08	132
83.08 – 85.49	163
85.49 – 87.91	191
87.91 – 90.33	246
90.33 – 92.75	302
92.75 – 95.16	453
95.16 – 97.58	525
97.58 – 100	67

Fig 4.

poverty_rate · Poverty rate centers around 14% with a right tail past 60% — useful for spotting economically distressed counties.

Show data table

Histogram bins for poverty_rate (median: 13.807805224676027).
bin	count
0 – 1.655	2
1.655 – 3.31	9
3.31 – 4.964	48
4.964 – 6.619	123
6.619 – 8.274	212
8.274 – 9.929	317
9.929 – 11.58	378
11.58 – 13.24	392
13.24 – 14.89	353
14.89 – 16.55	338
16.55 – 18.2	235
18.2 – 19.86	192
19.86 – 21.51	157
21.51 – 23.17	108
23.17 – 24.82	77
24.82 – 26.48	53
26.48 – 28.13	41
28.13 – 29.79	37
29.79 – 31.44	29
31.44 – 33.1	13
33.1 – 34.75	10
34.75 – 36.41	11
36.41 – 38.06	7
38.06 – 39.72	2
39.72 – 41.37	8
41.37 – 43.03	6
43.03 – 44.68	7
44.68 – 46.33	11
46.33 – 47.99	6
47.99 – 49.64	9
49.64 – 51.3	8
51.3 – 52.95	5
52.95 – 54.61	6
54.61 – 56.26	2
56.26 – 57.92	2
57.92 – 59.57	3
59.57 – 61.23	1
61.23 – 62.88	1
62.88 – 64.54	0
64.54 – 66.19	2

Fig 5.

median_household_income · Flag the data-quality issue: a sentinel value of -666,666,666 distorts the mean; filter to positives before plotting.

Show data table

Histogram bins for median_household_income (median: 52380.0).
bin	count
-6.667e+08 – -6.5e+08	1
-6.5e+08 – -6.333e+08	0
-6.333e+08 – -6.167e+08	0
-6.167e+08 – -6e+08	0
-6e+08 – -5.833e+08	0
-5.833e+08 – -5.666e+08	0
-5.666e+08 – -5.5e+08	0
-5.5e+08 – -5.333e+08	0
-5.333e+08 – -5.166e+08	0
-5.166e+08 – -5e+08	0
-5e+08 – -4.833e+08	0
-4.833e+08 – -4.666e+08	0
-4.666e+08 – -4.5e+08	0
-4.5e+08 – -4.333e+08	0
-4.333e+08 – -4.166e+08	0
-4.166e+08 – -3.999e+08	0
-3.999e+08 – -3.833e+08	0
-3.833e+08 – -3.666e+08	0
-3.666e+08 – -3.499e+08	0
-3.499e+08 – -3.333e+08	0
-3.333e+08 – -3.166e+08	0
-3.166e+08 – -2.999e+08	0
-2.999e+08 – -2.832e+08	0
-2.832e+08 – -2.666e+08	0
-2.666e+08 – -2.499e+08	0
-2.499e+08 – -2.332e+08	0
-2.332e+08 – -2.166e+08	0
-2.166e+08 – -1.999e+08	0
-1.999e+08 – -1.832e+08	0
-1.832e+08 – -1.666e+08	0
-1.666e+08 – -1.499e+08	0
-1.499e+08 – -1.332e+08	0
-1.332e+08 – -1.165e+08	0
-1.165e+08 – -9.987e+07	0
-9.987e+07 – -8.32e+07	0
-8.32e+07 – -6.653e+07	0
-6.653e+07 – -4.986e+07	0
-4.986e+07 – -3.319e+07	0
-3.319e+07 – -1.652e+07	0
-1.652e+07 – 1.471e+05	3220

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
NAME	text	0.0%
total_population	numeric	0.0%
black_population	numeric	0.0%
white_population	numeric	0.0%
hispanic_population	numeric	0.0%
state	numeric	0.0%
county	numeric	0.0%
FIPS	numeric	0.0%
pct_black	numeric	0.0%
pct_white	numeric	0.0%
pct_hispanic	numeric	0.0%
poverty_rate	numeric	0.0%
below_poverty_level	numeric	0.0%
median_household_income	numeric	0.0%
margin_2020	numeric	3.4%
democratic_pct_2020	numeric	3.4%
republican_pct_2020	numeric	3.4%
margin_2016	text	2.6%
democratic_pct_2016	numeric	2.6%
republican_pct_2016	numeric	2.6%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
	total_population	black_population	white_population	hispanic_population	state	county	FIPS	pct_black	pct_white	pct_hispanic	poverty_rate	below_poverty_level
total_population	+1.00	+0.69	+0.98	+0.88	-0.06	-0.10	-0.06	+0.04	-0.19	+0.08	-0.15	+0.93
black_population	+0.69	+1.00	+0.63	+0.44	-0.01	-0.03	-0.01	+0.27	-0.29	+0.01	-0.04	+0.77
white_population	+0.98	+0.63	+1.00	+0.86	-0.06	-0.11	-0.06	+0.01	-0.12	+0.07	-0.17	+0.89
hispanic_population	+0.88	+0.44	+0.86	+1.00	-0.06	-0.07	-0.06	-0.01	-0.15	+0.22	-0.02	+0.86
state	-0.06	-0.01	-0.06	-0.06	+1.00	+0.07	+1.00	-0.07	+0.02	+0.37	+0.22	-0.05
county	-0.10	-0.03	-0.11	-0.07	+0.07	+1.00	+0.07	+0.09	-0.07	+0.07	+0.11	-0.08
FIPS	-0.06	-0.01	-0.06	-0.06	+1.00	+0.07	+1.00	-0.07	+0.02	+0.37	+0.22	-0.05
pct_black	+0.04	+0.27	+0.01	-0.01	-0.07	+0.09	-0.07	+1.00	-0.79	-0.02	+0.39	+0.10
pct_white	-0.19	-0.29	-0.12	-0.15	+0.02	-0.07	+0.02	-0.79	+1.00	-0.24	-0.48	-0.23
pct_hispanic	+0.08	+0.01	+0.07	+0.22	+0.37	+0.07	+0.37	-0.02	-0.24	+1.00	+0.53	+0.14
poverty_rate	-0.15	-0.04	-0.17	-0.02	+0.22	+0.11	+0.22	+0.39	-0.48	+0.53	+1.00	-0.01
below_poverty_level	+0.93	+0.77	+0.89	+0.86	-0.05	-0.08	-0.05	+0.10	-0.23	+0.14	-0.01	+1.00

NAME text identifier

This column appears to hold US county names with state qualifiers — 'county,' appears in 3,007 of 3,221 rows, followed by state tokens like Texas (256), Virginia (189), and Georgia (159). Every value is unique (n_unique = 3221, duplicate_rate = 0.0) with no nulls, and lengths cluster tightly around 24 characters (min 16, max 42), consistent with a canonical 'X County, ST' format. The near_unique alert confirms this behaves as an identifier rather than a categorical feature.

Treatment: Use as a join key to county-level reference tables; do not feed as a categorical feature.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["NAME"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,221
len_min	16
len_max	42
len_mean	24.27
len_median	24
len_p95	31
word_mean	3.243
word_median	3
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	1,983
readability_flesch_mean	7.581
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings

Fig 8.

Character-length distribution for NAME.

Show data table

Character-length distribution for NAME (mean: 24.26637690158336).
chars	count
16 – 17	3
17 – 17	23
17 – 18	0
18 – 19	72
19 – 19	121
19 – 20	0
20 – 21	190
21 – 21	264
21 – 22	0
22 – 22	407
22 – 23	420
23 – 24	0
24 – 24	363
24 – 25	320
25 – 26	0
26 – 26	240
26 – 27	233
27 – 28	0
28 – 28	153
28 – 29	0
29 – 30	142
30 – 30	86
30 – 31	0
31 – 32	81
32 – 32	41
32 – 33	0
33 – 34	28
34 – 34	16
34 – 35	0
35 – 36	10
36 – 36	4
36 – 37	0
37 – 37	0
37 – 38	1
38 – 39	0
39 – 39	1
39 – 40	0
40 – 41	0
41 – 41	1
41 – 42	1

total_population numeric feature

Likely a county- or area-level total population count, given 3,221 rows with no nulls and a minimum of 117 alongside a maximum of 10,040,682. The distribution is severely right-skewed (skew 13.67, kurtosis 311.9) with the mean (102,398) nearly four times the median (25,981) and 441 outliers (13.7%). A few mega-population areas dominate while most are small.

Treatment: log-transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["total_population"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,160
min	117
max	1.004e+07
mean	1.024e+05
median	25,981
std	3.283e+05
q1	11,125
q3	66,969
iqr	55,844
skew	13.67
kurtosis	311.9
n_outliers	441
outlier_rate	0.1369
zero_rate	0
alert: high_skew	skew=+13.67
alert: outliers	13.7% rows beyond 1.5 IQR

Fig 9.

Distribution of total_population. Vertical dash marks the median.

Show data table

Histogram bins for total_population (median: 25981.0).
bin	count
117 – 2.511e+05	2946
2.511e+05 – 5.021e+05	135
5.021e+05 – 7.532e+05	53
7.532e+05 – 1.004e+06	42
1.004e+06 – 1.255e+06	12
1.255e+06 – 1.506e+06	9
1.506e+06 – 1.757e+06	6
1.757e+06 – 2.008e+06	3
2.008e+06 – 2.259e+06	4
2.259e+06 – 2.51e+06	2
2.51e+06 – 2.761e+06	3
2.761e+06 – 3.012e+06	0
3.012e+06 – 3.263e+06	1
3.263e+06 – 3.514e+06	1
3.514e+06 – 3.765e+06	0
3.765e+06 – 4.016e+06	0
4.016e+06 – 4.267e+06	0
4.267e+06 – 4.518e+06	1
4.518e+06 – 4.769e+06	1
4.769e+06 – 5.02e+06	0
5.02e+06 – 5.271e+06	1
5.271e+06 – 5.522e+06	0
5.522e+06 – 5.773e+06	0
5.773e+06 – 6.024e+06	0
6.024e+06 – 6.275e+06	0
6.275e+06 – 6.526e+06	0
6.526e+06 – 6.777e+06	0
6.777e+06 – 7.029e+06	0
7.029e+06 – 7.28e+06	0
7.28e+06 – 7.531e+06	0
7.531e+06 – 7.782e+06	0
7.782e+06 – 8.033e+06	0
8.033e+06 – 8.284e+06	0
8.284e+06 – 8.535e+06	0
8.535e+06 – 8.786e+06	0
8.786e+06 – 9.037e+06	0
9.037e+06 – 9.288e+06	0
9.288e+06 – 9.539e+06	0
9.539e+06 – 9.79e+06	0
9.79e+06 – 1.004e+07	1

black_population numeric feature

Numeric count of Black residents per record (likely US county-level given n=3221), ranging from 0 to 1,202,260 with a median of just 859. The distribution is extremely right-skewed (skew 10.46, kurtosis 148.2) with 13.6% flagged as outliers and a std (54,952) over four times the mean (12,914), reflecting a few major metros dominating a long tail of small counties. About 2.8% of rows are zero and there are no nulls.

Treatment: Log-transform (log1p) before modelling or normalise as a share of total population.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["black_population"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	2,066
min	0
max	1.202e+06
mean	1.291e+04
median	859
std	5.495e+04
q1	114
q3	5,553
iqr	5,439
skew	10.46
kurtosis	148.2
n_outliers	438
outlier_rate	0.136
zero_rate	0.02825
alert: high_skew	skew=+10.46
alert: outliers	13.6% rows beyond 1.5 IQR

Fig 10.

Distribution of black_population. Vertical dash marks the median.

Show data table

Histogram bins for black_population (median: 859.0).
bin	count
0 – 3.006e+04	2968
3.006e+04 – 6.011e+04	108
6.011e+04 – 9.017e+04	41
9.017e+04 – 1.202e+05	33
1.202e+05 – 1.503e+05	12
1.503e+05 – 1.803e+05	14
1.803e+05 – 2.104e+05	8
2.104e+05 – 2.405e+05	3
2.405e+05 – 2.705e+05	7
2.705e+05 – 3.006e+05	6
3.006e+05 – 3.306e+05	2
3.306e+05 – 3.607e+05	2
3.607e+05 – 3.907e+05	2
3.907e+05 – 4.208e+05	2
4.208e+05 – 4.508e+05	0
4.508e+05 – 4.809e+05	2
4.809e+05 – 5.11e+05	2
5.11e+05 – 5.41e+05	0
5.41e+05 – 5.711e+05	2
5.711e+05 – 6.011e+05	1
6.011e+05 – 6.312e+05	0
6.312e+05 – 6.612e+05	1
6.612e+05 – 6.913e+05	1
6.913e+05 – 7.214e+05	0
7.214e+05 – 7.514e+05	0
7.514e+05 – 7.815e+05	0
7.815e+05 – 8.115e+05	2
8.115e+05 – 8.416e+05	0
8.416e+05 – 8.716e+05	0
8.716e+05 – 9.017e+05	1
9.017e+05 – 9.318e+05	0
9.318e+05 – 9.618e+05	0
9.618e+05 – 9.919e+05	0
9.919e+05 – 1.022e+06	0
1.022e+06 – 1.052e+06	0
1.052e+06 – 1.082e+06	0
1.082e+06 – 1.112e+06	0
1.112e+06 – 1.142e+06	0
1.142e+06 – 1.172e+06	0
1.172e+06 – 1.202e+06	1

white_population numeric feature

Counts of the white population per record (likely US counties given n=3221), ranging from 58 to 4,795,186 with a median of 21,282 but a mean of 72,000. The distribution is extremely right-skewed (skew 10.35, kurtosis 175.65) with 407 outliers (12.6%), reflecting a few very populous counties dwarfing the rest. No nulls or zeros, and near-unique values (3143/3221).

Treatment: log-transform before regression to tame the heavy right skew.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["white_population"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,143
min	58
max	4.795e+06
mean	7.2e+04
median	21,282
std	1.918e+05
q1	8,855
q3	56,553
iqr	47,698
skew	10.35
kurtosis	175.7
n_outliers	407
outlier_rate	0.1264
zero_rate	0
alert: high_skew	skew=+10.35
alert: outliers	12.6% rows beyond 1.5 IQR

Fig 11.

Distribution of white_population. Vertical dash marks the median.

Show data table

Histogram bins for white_population (median: 21282.0).
bin	count
58 – 1.199e+05	2795
1.199e+05 – 2.398e+05	216
2.398e+05 – 3.597e+05	74
3.597e+05 – 4.796e+05	47
4.796e+05 – 5.994e+05	29
5.994e+05 – 7.193e+05	24
7.193e+05 – 8.392e+05	6
8.392e+05 – 9.591e+05	9
9.591e+05 – 1.079e+06	3
1.079e+06 – 1.199e+06	3
1.199e+06 – 1.319e+06	4
1.319e+06 – 1.439e+06	3
1.439e+06 – 1.558e+06	1
1.558e+06 – 1.678e+06	0
1.678e+06 – 1.798e+06	1
1.798e+06 – 1.918e+06	1
1.918e+06 – 2.038e+06	0
2.038e+06 – 2.158e+06	0
2.158e+06 – 2.278e+06	1
2.278e+06 – 2.398e+06	0
2.398e+06 – 2.518e+06	0
2.518e+06 – 2.637e+06	0
2.637e+06 – 2.757e+06	1
2.757e+06 – 2.877e+06	1
2.877e+06 – 2.997e+06	0
2.997e+06 – 3.117e+06	0
3.117e+06 – 3.237e+06	0
3.237e+06 – 3.357e+06	1
3.357e+06 – 3.477e+06	0
3.477e+06 – 3.596e+06	0
3.596e+06 – 3.716e+06	0
3.716e+06 – 3.836e+06	0
3.836e+06 – 3.956e+06	0
3.956e+06 – 4.076e+06	0
4.076e+06 – 4.196e+06	0
4.196e+06 – 4.316e+06	0
4.316e+06 – 4.436e+06	0
4.436e+06 – 4.555e+06	0
4.555e+06 – 4.675e+06	0
4.675e+06 – 4.795e+06	1

hispanic_population numeric feature

Counts of Hispanic population per record (likely county- or tract-level given n=3221), ranging from 0 to 4,851,344 with a median of just 1,209. The distribution is extraordinarily right-skewed (skew 22.75, kurtosis 744.79) and 15.3% of rows flag as outliers, indicating a handful of very large jurisdictions dwarf the rest. Mean (19,427) sits far above the Q3 of 5,875, confirming a long heavy tail.

Treatment: Apply a log1p transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["hispanic_population"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	2,331
min	0
max	4.851e+06
mean	1.943e+04
median	1,209
std	1.251e+05
q1	377
q3	5,875
iqr	5,498
skew	22.75
kurtosis	744.8
n_outliers	492
outlier_rate	0.1527
zero_rate	0.004967
alert: high_skew	skew=+22.75
alert: outliers	15.3% rows beyond 1.5 IQR

Fig 12.

Distribution of hispanic_population. Vertical dash marks the median.

Show data table

Histogram bins for hispanic_population (median: 1209.0).
bin	count
0 – 1.213e+05	3124
1.213e+05 – 2.426e+05	55
2.426e+05 – 3.639e+05	13
3.639e+05 – 4.851e+05	9
4.851e+05 – 6.064e+05	4
6.064e+05 – 7.277e+05	3
7.277e+05 – 8.49e+05	2
8.49e+05 – 9.703e+05	0
9.703e+05 – 1.092e+06	2
1.092e+06 – 1.213e+06	4
1.213e+06 – 1.334e+06	1
1.334e+06 – 1.455e+06	1
1.455e+06 – 1.577e+06	0
1.577e+06 – 1.698e+06	0
1.698e+06 – 1.819e+06	0
1.819e+06 – 1.941e+06	1
1.941e+06 – 2.062e+06	1
2.062e+06 – 2.183e+06	0
2.183e+06 – 2.304e+06	0
2.304e+06 – 2.426e+06	0
2.426e+06 – 2.547e+06	0
2.547e+06 – 2.668e+06	0
2.668e+06 – 2.79e+06	0
2.79e+06 – 2.911e+06	0
2.911e+06 – 3.032e+06	0
3.032e+06 – 3.153e+06	0
3.153e+06 – 3.275e+06	0
3.275e+06 – 3.396e+06	0
3.396e+06 – 3.517e+06	0
3.517e+06 – 3.639e+06	0
3.639e+06 – 3.76e+06	0
3.76e+06 – 3.881e+06	0
3.881e+06 – 4.002e+06	0
4.002e+06 – 4.124e+06	0
4.124e+06 – 4.245e+06	0
4.245e+06 – 4.366e+06	0
4.366e+06 – 4.487e+06	0
4.487e+06 – 4.609e+06	0
4.609e+06 – 4.73e+06	0
4.73e+06 – 4.851e+06	1

state numeric feature

Stored as numeric but with only 52 unique integer values across 3221 rows ranging 1–72 with no nulls or zeros, this is almost certainly a FIPS-style state code rather than a true quantity. The near-symmetric spread (skew 0.157, kurtosis -0.626, mean 31.28 vs median 30) reflects roughly uniform coverage of US states/territories, not a meaningful distribution. The max of 72 is consistent with FIPS codes that extend past 50 to cover territories.

Treatment: Cast to categorical and one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["state"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	52
min	1
max	72
mean	31.28
median	30
std	16.28
q1	19
q3	46
iqr	27
skew	0.157
kurtosis	-0.6261
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 13.

Distribution of state. Vertical dash marks the median.

Show data table

Histogram bins for state (median: 30.0).
bin	count
1 – 2.775	97
2.775 – 4.55	15
4.55 – 6.325	133
6.325 – 8.1	64
8.1 – 9.875	8
9.875 – 11.65	4
11.65 – 13.42	226
13.42 – 15.2	5
15.2 – 16.98	44
16.98 – 18.75	194
18.75 – 20.52	204
20.52 – 22.3	184
22.3 – 24.07	40
24.07 – 25.85	14
25.85 – 27.62	170
27.62 – 29.4	197
29.4 – 31.17	149
31.17 – 32.95	17
32.95 – 34.73	31
34.73 – 36.5	95
36.5 – 38.27	153
38.27 – 40.05	165
40.05 – 41.82	36
41.82 – 43.6	67
43.6 – 45.38	51
45.38 – 47.15	161
47.15 – 48.92	254
48.92 – 50.7	43
50.7 – 52.47	133
52.47 – 54.25	94
54.25 – 56.02	95
56.02 – 57.8	0
57.8 – 59.57	0
59.57 – 61.35	0
61.35 – 63.12	0
63.12 – 64.9	0
64.9 – 66.67	0
66.67 – 68.45	0
68.45 – 70.22	0
70.22 – 72	78

county numeric foreign_key

Stored as numeric, but with 326 unique integer values from 1 to 840 across 3221 rows and zero nulls, this is almost certainly a county FIPS or county-code identifier rather than a measurement. The heavy right skew (2.87) and kurtosis (11.6) flagged as outliers simply reflect that codes are not uniformly distributed — 178 'outliers' here are real codes, not anomalies. Treating mean=102.8 or std=106.6 as meaningful would be misleading.

Treatment: Cast to categorical/string code and join to a county lookup; do not use as a continuous feature.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["county"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	326
min	1
max	840
mean	102.8
median	79
std	106.6
q1	35
q3	133
iqr	98
skew	2.868
kurtosis	11.64
n_outliers	178
outlier_rate	0.05526
zero_rate	0
alert: high_skew	skew=+2.87
alert: outliers	5.5% rows beyond 1.5 IQR

Fig 14.

Distribution of county. Vertical dash marks the median.

Show data table

Histogram bins for county (median: 79.0).
bin	count
1 – 21.98	531
21.98 – 42.95	418
42.95 – 63.93	411
63.93 – 84.9	345
84.9 – 105.9	352
105.9 – 126.9	279
126.9 – 147.8	234
147.8 – 168.8	166
168.8 – 189.8	138
189.8 – 210.8	70
210.8 – 231.7	45
231.7 – 252.7	25
252.7 – 273.7	22
273.7 – 294.7	23
294.7 – 315.6	22
315.6 – 336.6	13
336.6 – 357.6	11
357.6 – 378.6	10
378.6 – 399.5	11
399.5 – 420.5	10
420.5 – 441.5	11
441.5 – 462.5	10
462.5 – 483.4	11
483.4 – 504.4	10
504.4 – 525.4	7
525.4 – 546.4	2
546.4 – 567.3	1
567.3 – 588.3	2
588.3 – 609.3	3
609.3 – 630.2	3
630.2 – 651.2	2
651.2 – 672.2	2
672.2 – 693.2	5
693.2 – 714.2	2
714.2 – 735.1	3
735.1 – 756.1	2
756.1 – 777.1	3
777.1 – 798.1	1
798.1 – 819	2
819 – 840	3

FIPS numeric identifier

FIPS is the standard U.S. Federal Information Processing Standards county code, with all 3221 rows unique and no nulls. Values span 1001 to 72153, consistent with state-prefixed county identifiers (Alabama through Puerto Rico), and the distribution is near-symmetric (skew 0.157) with no outliers flagged.

Treatment: Treat as a categorical key; left-join on this code rather than using as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["FIPS"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,221
min	1,001
max	72,153
mean	3.138e+04
median	30,023
std	1.63e+04
q1	19,031
q3	46,105
iqr	27,074
skew	0.1569
kurtosis	-0.6308
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 15.

Distribution of FIPS. Vertical dash marks the median.

Show data table

Histogram bins for FIPS (median: 30023.0).
bin	count
1001 – 2780	97
2780 – 4559	15
4559 – 6337	133
6337 – 8116	59
8116 – 9895	13
9895 – 1.167e+04	4
1.167e+04 – 1.345e+04	226
1.345e+04 – 1.523e+04	5
1.523e+04 – 1.701e+04	49
1.701e+04 – 1.879e+04	189
1.879e+04 – 2.057e+04	204
2.057e+04 – 2.235e+04	184
2.235e+04 – 2.413e+04	39
2.413e+04 – 2.59e+04	15
2.59e+04 – 2.768e+04	170
2.768e+04 – 2.946e+04	196
2.946e+04 – 3.124e+04	150
3.124e+04 – 3.302e+04	27
3.302e+04 – 3.48e+04	21
3.48e+04 – 3.658e+04	95
3.658e+04 – 3.836e+04	153
3.836e+04 – 4.013e+04	155
4.013e+04 – 4.191e+04	46
4.191e+04 – 4.369e+04	67
4.369e+04 – 4.547e+04	51
4.547e+04 – 4.725e+04	161
4.725e+04 – 4.903e+04	268
4.903e+04 – 5.081e+04	29
5.081e+04 – 5.259e+04	133
5.259e+04 – 5.436e+04	94
5.436e+04 – 5.614e+04	95
5.614e+04 – 5.792e+04	0
5.792e+04 – 5.97e+04	0
5.97e+04 – 6.148e+04	0
6.148e+04 – 6.326e+04	0
6.326e+04 – 6.504e+04	0
6.504e+04 – 6.682e+04	0
6.682e+04 – 6.86e+04	0
6.86e+04 – 7.037e+04	0
7.037e+04 – 7.215e+04	78

pct_black numeric feature

This is a numeric percentage of Black population per record (likely a county or tract), ranging from 0 to 87.79 with a median of just 2.38% but a mean of 9.08%. The distribution is heavily right-skewed (skew 2.33, kurtosis 5.45) with 422 outliers (13.1%) and 2.8% exact zeros, indicating a long tail of high-percentage areas above an otherwise low-share majority. No nulls, and 3,128 of 3,221 values are unique.

Treatment: Apply a log1p or similar transform before regression to tame the right skew.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["pct_black"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,128
min	0
max	87.79
mean	9.085
median	2.383
std	14.5
q1	0.6919
q3	10.21
iqr	9.513
skew	2.326
kurtosis	5.451
n_outliers	422
outlier_rate	0.131
zero_rate	0.02825
alert: high_skew	skew=+2.33
alert: outliers	13.1% rows beyond 1.5 IQR

Fig 16.

Distribution of pct_black. Vertical dash marks the median.

Show data table

Histogram bins for pct_black (median: 2.382606279522089).
bin	count
0 – 2.195	1568
2.195 – 4.39	402
4.39 – 6.584	218
6.584 – 8.779	153
8.779 – 10.97	112
10.97 – 13.17	86
13.17 – 15.36	70
15.36 – 17.56	54
17.56 – 19.75	46
19.75 – 21.95	48
21.95 – 24.14	36
24.14 – 26.34	42
26.34 – 28.53	36
28.53 – 30.73	41
30.73 – 32.92	34
32.92 – 35.12	32
35.12 – 37.31	28
37.31 – 39.51	19
39.51 – 41.7	25
41.7 – 43.9	25
43.9 – 46.09	18
46.09 – 48.28	17
48.28 – 50.48	14
50.48 – 52.67	9
52.67 – 54.87	13
54.87 – 57.06	11
57.06 – 59.26	13
59.26 – 61.45	7
61.45 – 63.65	8
63.65 – 65.84	4
65.84 – 68.04	1
68.04 – 70.23	6
70.23 – 72.43	8
72.43 – 74.62	5
74.62 – 76.82	2
76.82 – 79.01	5
79.01 – 81.21	1
81.21 – 83.4	1
83.4 – 85.6	1
85.6 – 87.79	2

pct_white numeric feature

This column reports the percentage of a population that is white, ranging from 3.29 to 100.0 with a mean of 81.20 and median of 87.66. The distribution is heavily left-skewed (skew -1.56) with 145 low-end outliers (4.5% outlier rate), indicating most records are predominantly white but a long tail of diverse populations exists. No nulls or zeros are present, and near-unique values across 3221 rows suggest one row per geographic unit.

Treatment: Consider a logit or reflected-log transform to address the strong left skew before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["pct_white"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,218
min	3.29
max	100
mean	81.2
median	87.66
std	17.35
q1	73.62
q3	93.99
iqr	20.37
skew	-1.562
kurtosis	2.301
n_outliers	145
outlier_rate	0.04502
zero_rate	0

Fig 17.

Distribution of pct_white. Vertical dash marks the median.

Show data table

Histogram bins for pct_white (median: 87.65979926043318).
bin	count
3.29 – 5.708	2
5.708 – 8.125	1
8.125 – 10.54	4
10.54 – 12.96	2
12.96 – 15.38	10
15.38 – 17.8	6
17.8 – 20.21	4
20.21 – 22.63	7
22.63 – 25.05	12
25.05 – 27.47	12
27.47 – 29.89	7
29.89 – 32.3	4
32.3 – 34.72	10
34.72 – 37.14	13
37.14 – 39.56	24
39.56 – 41.97	18
41.97 – 44.39	24
44.39 – 46.81	30
46.81 – 49.23	24
49.23 – 51.64	34
51.64 – 54.06	33
54.06 – 56.48	37
56.48 – 58.9	58
58.9 – 61.32	43
61.32 – 63.73	69
63.73 – 66.15	72
66.15 – 68.57	77
68.57 – 70.99	76
70.99 – 73.4	88
73.4 – 75.82	100
75.82 – 78.24	98
78.24 – 80.66	143
80.66 – 83.08	132
83.08 – 85.49	163
85.49 – 87.91	191
87.91 – 90.33	246
90.33 – 92.75	302
92.75 – 95.16	453
95.16 – 97.58	525
97.58 – 100	67

pct_hispanic numeric feature

This is a numeric percentage of Hispanic population per row, ranging 0 to 99.996 with a median of just 4.52 but a mean of 11.74, indicating a long right tail. Skew of 3.11 and kurtosis of 9.89 confirm heavy concentration at low values with 420 outliers (13.0% of rows) stretching toward 100. Near-zero null rate (0.0) and only 0.5% exact zeros suggest the values are continuously measured rather than sparsely populated.

Treatment: Apply a log1p or similar transform before modelling to tame the right skew and outliers.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["pct_hispanic"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,205
min	0
max	100
mean	11.74
median	4.516
std	19.4
q1	2.363
q3	10.66
iqr	8.294
skew	3.113
kurtosis	9.888
n_outliers	420
outlier_rate	0.1304
zero_rate	0.004967
alert: high_skew	skew=+3.11
alert: outliers	13.0% rows beyond 1.5 IQR

Fig 18.

Distribution of pct_hispanic. Vertical dash marks the median.

Show data table

Histogram bins for pct_hispanic (median: 4.51638689048761).
bin	count
0 – 2.5	882
2.5 – 5	850
5 – 7.5	412
7.5 – 10	213
10 – 12.5	148
12.5 – 15	102
15 – 17.5	75
17.5 – 20	64
20 – 22.5	45
22.5 – 25	49
25 – 27.5	39
27.5 – 30	30
30 – 32.5	26
32.5 – 35	18
35 – 37.5	17
37.5 – 40	15
40 – 42.5	17
42.5 – 45	15
45 – 47.5	12
47.5 – 50	10
50 – 52.5	12
52.5 – 55	12
55 – 57.5	9
57.5 – 60	12
60 – 62.5	10
62.5 – 65	9
65 – 67.5	3
67.5 – 70	6
70 – 72.5	3
72.5 – 75	3
75 – 77.5	0
77.5 – 80	3
80 – 82.5	4
82.5 – 85	5
85 – 87.5	1
87.5 – 90	3
90 – 92.5	4
92.5 – 95	5
95 – 97.5	8
97.5 – 100	70

poverty_rate numeric feature

Continuous percentage values ranging from 0 to 66.19 with mean 15.38 and median 13.81, almost certainly a county- or area-level poverty rate. Distribution is right-skewed (skew 2.11, kurtosis 6.92) with 143 high-end outliers (4.4%) stretching well beyond Q3 of 18.25. Near-unique values across 3,221 rows (3,219 distinct) and effectively no zeros or nulls.

Treatment: Log- or Box-Cox-transform before regression to tame the right skew.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["poverty_rate"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,219
min	0
max	66.19
mean	15.38
median	13.81
std	7.97
q1	10.34
q3	18.25
iqr	7.91
skew	2.111
kurtosis	6.922
n_outliers	143
outlier_rate	0.0444
zero_rate	0.0003105
alert: high_skew	skew=+2.11

Fig 19.

Distribution of poverty_rate. Vertical dash marks the median.

Show data table

Histogram bins for poverty_rate (median: 13.807805224676027).
bin	count
0 – 1.655	2
1.655 – 3.31	9
3.31 – 4.964	48
4.964 – 6.619	123
6.619 – 8.274	212
8.274 – 9.929	317
9.929 – 11.58	378
11.58 – 13.24	392
13.24 – 14.89	353
14.89 – 16.55	338
16.55 – 18.2	235
18.2 – 19.86	192
19.86 – 21.51	157
21.51 – 23.17	108
23.17 – 24.82	77
24.82 – 26.48	53
26.48 – 28.13	41
28.13 – 29.79	37
29.79 – 31.44	29
31.44 – 33.1	13
33.1 – 34.75	10
34.75 – 36.41	11
36.41 – 38.06	7
38.06 – 39.72	2
39.72 – 41.37	8
41.37 – 43.03	6
43.03 – 44.68	7
44.68 – 46.33	11
46.33 – 47.99	6
47.99 – 49.64	9
49.64 – 51.3	8
51.3 – 52.95	5
52.95 – 54.61	6
54.61 – 56.26	2
56.26 – 57.92	2
57.92 – 59.57	3
59.57 – 61.23	1
61.23 – 62.88	1
62.88 – 64.54	0
64.54 – 66.19	2

below_poverty_level numeric feature

This column appears to be a count of residents below the poverty level per geographic unit, ranging from 0 to 1,401,656 with a median of 3,831. The distribution is severely right-skewed (skew 15.1, kurtosis 360.7) with the mean (13,136) more than three times the median and 351 outliers (10.9% of rows). Standard deviation (44,284) dwarfs the IQR (8,390), consistent with a few very large jurisdictions dominating the tail.

Treatment: Log-transform (or normalize per population) before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["below_poverty_level"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	2,824
min	0
max	1.402e+06
mean	1.314e+04
median	3,831
std	4.428e+04
q1	1,547
q3	9,937
iqr	8,390
skew	15.11
kurtosis	360.7
n_outliers	351
outlier_rate	0.109
zero_rate	0.0003105
alert: high_skew	skew=+15.11
alert: outliers	10.9% rows beyond 1.5 IQR

Fig 20.

Distribution of below_poverty_level. Vertical dash marks the median.

Show data table

Histogram bins for below_poverty_level (median: 3831.0).
bin	count
0 – 3.504e+04	2980
3.504e+04 – 7.008e+04	136
7.008e+04 – 1.051e+05	50
1.051e+05 – 1.402e+05	19
1.402e+05 – 1.752e+05	7
1.752e+05 – 2.102e+05	8
2.102e+05 – 2.453e+05	3
2.453e+05 – 2.803e+05	2
2.803e+05 – 3.154e+05	3
3.154e+05 – 3.504e+05	2
3.504e+05 – 3.855e+05	5
3.855e+05 – 4.205e+05	0
4.205e+05 – 4.555e+05	1
4.555e+05 – 4.906e+05	1
4.906e+05 – 5.256e+05	0
5.256e+05 – 5.607e+05	1
5.607e+05 – 5.957e+05	0
5.957e+05 – 6.307e+05	0
6.307e+05 – 6.658e+05	0
6.658e+05 – 7.008e+05	1
7.008e+05 – 7.359e+05	1
7.359e+05 – 7.709e+05	0
7.709e+05 – 8.06e+05	0
8.06e+05 – 8.41e+05	0
8.41e+05 – 8.76e+05	0
8.76e+05 – 9.111e+05	0
9.111e+05 – 9.461e+05	0
9.461e+05 – 9.812e+05	0
9.812e+05 – 1.016e+06	0
1.016e+06 – 1.051e+06	0
1.051e+06 – 1.086e+06	0
1.086e+06 – 1.121e+06	0
1.121e+06 – 1.156e+06	0
1.156e+06 – 1.191e+06	0
1.191e+06 – 1.226e+06	0
1.226e+06 – 1.261e+06	0
1.261e+06 – 1.297e+06	0
1.297e+06 – 1.332e+06	0
1.332e+06 – 1.367e+06	0
1.367e+06 – 1.402e+06	1

median_household_income numeric feature

Median household income per record (n=3221, 3099 unique, no nulls) with a typical value near the median of 52380 and IQR of 16300. The mean of -152820 and min of -666666666 betray a sentinel value masquerading as data, producing extreme skew (-56.73) and kurtosis (3215.99) plus 182 flagged outliers. Once those sentinels are removed, the q1/q3 range of 44939-61239 looks like plausible US county-level income.

Treatment: Replace the -666666666 sentinel with null, then consider winsorizing or log-transforming before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[52]:

saturn.columns["median_household_income"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,099
min	-6.667e+08
max	147,111
mean	-1.528e+05
median	52,380
std	1.175e+07
q1	44,939
q3	61,239
iqr	16,300
skew	-56.73
kurtosis	3216
n_outliers	182
outlier_rate	0.0565
zero_rate	0
alert: high_skew	skew=-56.73
alert: outliers	5.7% rows beyond 1.5 IQR

Fig 21.

Distribution of median_household_income. Vertical dash marks the median.

Show data table

Histogram bins for median_household_income (median: 52380.0).
bin	count
-6.667e+08 – -6.5e+08	1
-6.5e+08 – -6.333e+08	0
-6.333e+08 – -6.167e+08	0
-6.167e+08 – -6e+08	0
-6e+08 – -5.833e+08	0
-5.833e+08 – -5.666e+08	0
-5.666e+08 – -5.5e+08	0
-5.5e+08 – -5.333e+08	0
-5.333e+08 – -5.166e+08	0
-5.166e+08 – -5e+08	0
-5e+08 – -4.833e+08	0
-4.833e+08 – -4.666e+08	0
-4.666e+08 – -4.5e+08	0
-4.5e+08 – -4.333e+08	0
-4.333e+08 – -4.166e+08	0
-4.166e+08 – -3.999e+08	0
-3.999e+08 – -3.833e+08	0
-3.833e+08 – -3.666e+08	0
-3.666e+08 – -3.499e+08	0
-3.499e+08 – -3.333e+08	0
-3.333e+08 – -3.166e+08	0
-3.166e+08 – -2.999e+08	0
-2.999e+08 – -2.832e+08	0
-2.832e+08 – -2.666e+08	0
-2.666e+08 – -2.499e+08	0
-2.499e+08 – -2.332e+08	0
-2.332e+08 – -2.166e+08	0
-2.166e+08 – -1.999e+08	0
-1.999e+08 – -1.832e+08	0
-1.832e+08 – -1.666e+08	0
-1.666e+08 – -1.499e+08	0
-1.499e+08 – -1.332e+08	0
-1.332e+08 – -1.165e+08	0
-1.165e+08 – -9.987e+07	0
-9.987e+07 – -8.32e+07	0
-8.32e+07 – -6.653e+07	0
-6.653e+07 – -4.986e+07	0
-4.986e+07 – -3.319e+07	0
-3.319e+07 – -1.652e+07	0
-1.652e+07 – 1.471e+05	3220

margin_2020 numeric feature

Numeric margin values for 2020, almost entirely unique across 3,221 rows (3,112 distinct), ranging from -0.87 to 0.93 with a mean of 0.317 and median 0.384. The distribution is left-skewed (skew -0.82), suggesting most observations cluster on the positive side while a tail of negative margins pulls the mean down. About 3.4% of rows are null and only 1.5% are flagged as outliers, with no zero values at all.

Treatment: Use directly as a signed numeric feature; impute the 3.4% nulls and retain sign since negatives are meaningful.

anthropic:claude-opus-4-7 · confidence high

Out[55]:

saturn.columns["margin_2020"].stats

stat	value
n	3,221
nulls	109 (3.4%)
unique	3,112
min	-0.8675
max	0.9309
mean	0.317
median	0.3844
std	0.321
q1	0.1348
q3	0.5662
iqr	0.4314
skew	-0.8212
kurtosis	0.2286
n_outliers	48
outlier_rate	0.01542
zero_rate	0

Fig 22.

Distribution of margin_2020. Vertical dash marks the median.

Show data table

Histogram bins for margin_2020 (median: 0.3843813151543954).
bin	count
-0.8675 – -0.8226	1
-0.8226 – -0.7776	2
-0.7776 – -0.7326	3
-0.7326 – -0.6877	5
-0.6877 – -0.6427	8
-0.6427 – -0.5978	12
-0.5978 – -0.5528	6
-0.5528 – -0.5078	12
-0.5078 – -0.4629	14
-0.4629 – -0.4179	22
-0.4179 – -0.373	28
-0.373 – -0.328	33
-0.328 – -0.283	29
-0.283 – -0.2381	46
-0.2381 – -0.1931	41
-0.1931 – -0.1482	54
-0.1482 – -0.1032	63
-0.1032 – -0.05823	67
-0.05823 – -0.01327	69
-0.01327 – 0.03169	73
0.03169 – 0.07665	70
0.07665 – 0.1216	90
0.1216 – 0.1666	131
0.1666 – 0.2115	117
0.2115 – 0.2565	129
0.2565 – 0.3015	141
0.3015 – 0.3464	159
0.3464 – 0.3914	165
0.3914 – 0.4363	181
0.4363 – 0.4813	206
0.4813 – 0.5263	195
0.5263 – 0.5712	197
0.5712 – 0.6162	213
0.6162 – 0.6611	175
0.6611 – 0.7061	140
0.7061 – 0.7511	102
0.7511 – 0.796	69
0.796 – 0.841	29
0.841 – 0.8859	11
0.8859 – 0.9309	4

democratic_pct_2020 numeric feature

This is the share of votes cast for the Democratic candidate in 2020, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.921 with a median of 0.300 and mean of 0.333, indicating most units lean Republican while a long right tail of heavily Democratic units pulls the mean up (skew 0.83). About 3.4% of rows are null and 49 outliers (1.6%) sit beyond the whiskers; no zeros are present.

Treatment: Use as-is as a proportion feature; impute the 3.4% nulls or drop those rows before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[58]:

saturn.columns["democratic_pct_2020"].stats

stat	value
n	3,221
nulls	109 (3.4%)
unique	3,111
min	0.03091
max	0.9215
mean	0.3327
median	0.2998
std	0.1598
q1	0.2091
q3	0.4236
iqr	0.2145
skew	0.8326
kurtosis	0.2523
n_outliers	49
outlier_rate	0.01575
zero_rate	0

Fig 23.

Distribution of democratic_pct_2020. Vertical dash marks the median.

Show data table

Histogram bins for democratic_pct_2020 (median: 0.29977253358402933).
bin	count
0.03091 – 0.05317	5
0.05317 – 0.07544	11
0.07544 – 0.0977	31
0.0977 – 0.12	76
0.12 – 0.1422	104
0.1422 – 0.1645	145
0.1645 – 0.1868	182
0.1868 – 0.209	224
0.209 – 0.2313	192
0.2313 – 0.2536	199
0.2536 – 0.2758	200
0.2758 – 0.2981	174
0.2981 – 0.3204	164
0.3204 – 0.3426	158
0.3426 – 0.3649	130
0.3649 – 0.3871	132
0.3871 – 0.4094	121
0.4094 – 0.4317	125
0.4317 – 0.4539	84
0.4539 – 0.4762	79
0.4762 – 0.4985	66
0.4985 – 0.5207	73
0.5207 – 0.543	67
0.543 – 0.5653	58
0.5653 – 0.5875	50
0.5875 – 0.6098	42
0.6098 – 0.6321	42
0.6321 – 0.6543	34
0.6543 – 0.6766	28
0.6766 – 0.6988	29
0.6988 – 0.7211	22
0.7211 – 0.7434	14
0.7434 – 0.7656	13
0.7656 – 0.7879	7
0.7879 – 0.8102	8
0.8102 – 0.8324	11
0.8324 – 0.8547	5
0.8547 – 0.877	3
0.877 – 0.8992	3
0.8992 – 0.9215	1

republican_pct_2020 numeric feature

This is the 2020 Republican vote share by what looks like a U.S. county-level unit, with 3221 rows and a mean of 0.65 and median 0.68. The distribution is left-skewed (skew -0.81) toward strongly Republican counties, ranging from 0.054 to 0.962, and 3.38% of rows are null. Only 47 outliers (1.5%) and near-unique values (3111 distinct) are consistent with continuous geographic shares.

Treatment: Use as-is as a continuous feature; impute or drop the 3.38% missing rows before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[61]:

saturn.columns["republican_pct_2020"].stats

stat	value
n	3,221
nulls	109 (3.4%)
unique	3,111
min	0.05397
max	0.9618
mean	0.6497
median	0.6829
std	0.1613
q1	0.5576
q3	0.7747
iqr	0.2171
skew	-0.8091
kurtosis	0.2063
n_outliers	47
outlier_rate	0.0151
zero_rate	0

Fig 24.

Distribution of republican_pct_2020. Vertical dash marks the median.

Show data table

Histogram bins for republican_pct_2020 (median: 0.6829120557612961).
bin	count
0.05397 – 0.07667	1
0.07667 – 0.09937	2
0.09937 – 0.1221	2
0.1221 – 0.1448	6
0.1448 – 0.1675	6
0.1675 – 0.1901	15
0.1901 – 0.2128	5
0.2128 – 0.2355	13
0.2355 – 0.2582	12
0.2582 – 0.2809	25
0.2809 – 0.3036	26
0.3036 – 0.3263	32
0.3263 – 0.349	32
0.349 – 0.3717	40
0.3717 – 0.3944	46
0.3944 – 0.4171	52
0.4171 – 0.4398	64
0.4398 – 0.4625	78
0.4625 – 0.4852	61
0.4852 – 0.5079	78
0.5079 – 0.5306	66
0.5306 – 0.5533	97
0.5533 – 0.576	126
0.576 – 0.5987	122
0.5987 – 0.6214	139
0.6214 – 0.6441	143
0.6441 – 0.6668	154
0.6668 – 0.6895	173
0.6895 – 0.7122	176
0.7122 – 0.7349	200
0.7349 – 0.7576	213
0.7576 – 0.7802	195
0.7802 – 0.8029	203
0.8029 – 0.8256	169
0.8256 – 0.8483	132
0.8483 – 0.871	103
0.871 – 0.8937	65
0.8937 – 0.9164	27
0.9164 – 0.9391	9
0.9391 – 0.9618	4

margin_2016 text feature

This column stores a 2016 margin as a short percentage string (e.g. '15.17%', '26.55%'), with lengths capped at 5-6 characters and exactly one 'word' per row. Despite the percent formatting it's stored as text, and 18.6% of values are duplicates with '15.17%' alone appearing 29 times — worth checking whether that's a placeholder or genuine repeat. Null rate is 2.58% and there are 2554 unique values across 3221 rows.

Treatment: Strip the '%' and cast to float before any numeric analysis.

anthropic:claude-opus-4-7 · confidence high

Out[64]:

saturn.columns["margin_2016"].stats

stat	value
n	3,221
nulls	83 (2.6%)
unique	2,554
len_min	5
len_max	6
len_mean	5.896
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	584
duplicate_rate	0.1861
vocab_size	2,554
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 25.

Character-length distribution for margin_2016.

Show data table

Character-length distribution for margin_2016 (mean: 5.895793499043977).
chars	count
5 – 5	327
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	2811

democratic_pct_2016 numeric feature

Share of the 2016 vote going Democratic, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.928 with a mean of 0.317 and median 0.286, and the right skew of 0.94 reflects a long tail of heavily Democratic jurisdictions amid a mass of Republican-leaning ones. About 2.6% of rows are null and 75 high-side outliers (2.4%) sit above the IQR fence.

Treatment: Use as-is as a proportion feature; impute or drop the 2.6% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[67]:

saturn.columns["democratic_pct_2016"].stats

stat	value
n	3,221
nulls	83 (2.6%)
unique	3,111
min	0.03145
max	0.9285
mean	0.3174
median	0.2861
std	0.1527
q1	0.2054
q3	0.3982
iqr	0.1928
skew	0.9371
kurtosis	0.666
n_outliers	75
outlier_rate	0.0239
zero_rate	0

Fig 26.

Distribution of democratic_pct_2016. Vertical dash marks the median.

Show data table

Histogram bins for democratic_pct_2016 (median: 0.2861345852895).
bin	count
0.03145 – 0.05387	8
0.05387 – 0.0763	16
0.0763 – 0.09872	52
0.09872 – 0.1211	74
0.1211 – 0.1436	116
0.1436 – 0.166	146
0.166 – 0.1884	203
0.1884 – 0.2109	226
0.2109 – 0.2333	240
0.2333 – 0.2557	218
0.2557 – 0.2781	200
0.2781 – 0.3006	205
0.3006 – 0.323	153
0.323 – 0.3454	147
0.3454 – 0.3678	153
0.3678 – 0.3903	150
0.3903 – 0.4127	106
0.4127 – 0.4351	111
0.4351 – 0.4575	77
0.4575 – 0.48	78
0.48 – 0.5024	56
0.5024 – 0.5248	72
0.5248 – 0.5472	45
0.5472 – 0.5697	42
0.5697 – 0.5921	36
0.5921 – 0.6145	36
0.6145 – 0.6369	34
0.6369 – 0.6594	25
0.6594 – 0.6818	30
0.6818 – 0.7042	16
0.7042 – 0.7266	12
0.7266 – 0.7491	12
0.7491 – 0.7715	14
0.7715 – 0.7939	9
0.7939 – 0.8163	6
0.8163 – 0.8388	4
0.8388 – 0.8612	4
0.8612 – 0.8836	3
0.8836 – 0.906	2
0.906 – 0.9285	1

republican_pct_2016 numeric feature

This column captures the Republican vote share by unit (likely US county) in the 2016 election, expressed as a proportion between 0.041 and 0.953. The distribution is left-skewed (skew -0.81) with a median of 0.666 above the mean of 0.635, indicating most units leaned Republican while a smaller tail of strongly Democratic units pulls the mean down. Near-unique values (3111 of 3221) and a 2.58% null rate are consistent with one row per geographic unit.

Treatment: Use as-is as a proportion feature; impute or drop the ~2.6% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[70]:

saturn.columns["republican_pct_2016"].stats

stat	value
n	3,221
nulls	83 (2.6%)
unique	3,111
min	0.04122
max	0.9527
mean	0.6354
median	0.6656
std	0.1559
q1	0.5463
q3	0.7503
iqr	0.2041
skew	-0.8145
kurtosis	0.3566
n_outliers	62
outlier_rate	0.01976
zero_rate	0

Fig 27.

Distribution of republican_pct_2016. Vertical dash marks the median.

Show data table

Histogram bins for republican_pct_2016 (median: 0.6655515136155).
bin	count
0.04122 – 0.06401	1
0.06401 – 0.0868	1
0.0868 – 0.1096	5
0.1096 – 0.1324	2
0.1324 – 0.1552	6
0.1552 – 0.1779	11
0.1779 – 0.2007	7
0.2007 – 0.2235	17
0.2235 – 0.2463	17
0.2463 – 0.2691	17
0.2691 – 0.2919	23
0.2919 – 0.3147	32
0.3147 – 0.3375	34
0.3375 – 0.3602	30
0.3602 – 0.383	43
0.383 – 0.4058	41
0.4058 – 0.4286	64
0.4286 – 0.4514	71
0.4514 – 0.4742	63
0.4742 – 0.497	89
0.497 – 0.5198	78
0.5198 – 0.5425	115
0.5425 – 0.5653	116
0.5653 – 0.5881	147
0.5881 – 0.6109	147
0.6109 – 0.6337	156
0.6337 – 0.6565	165
0.6565 – 0.6793	193
0.6793 – 0.7021	190
0.7021 – 0.7249	215
0.7249 – 0.7476	223
0.7476 – 0.7704	213
0.7704 – 0.7932	187
0.7932 – 0.816	142
0.816 – 0.8388	113
0.8388 – 0.8616	74
0.8616 – 0.8844	49
0.8844 – 0.9072	29
0.9072 – 0.9299	9
0.9299 – 0.9527	3

scars master dataset

Overview

Summary confidence: high

NAME text identifier

total_population numeric feature

black_population numeric feature

white_population numeric feature

hispanic_population numeric feature

state numeric feature

county numeric foreign_key

FIPS numeric identifier

pct_black numeric feature

pct_white numeric feature

pct_hispanic numeric feature

poverty_rate numeric feature

below_poverty_level numeric feature

median_household_income numeric feature

margin_2020 numeric feature

democratic_pct_2020 numeric feature

republican_pct_2020 numeric feature

margin_2016 text feature

democratic_pct_2016 numeric feature

republican_pct_2016 numeric feature

How to cite