scars master dataset

source /home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv 3,221 rows 20 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a county-level US dataset of 3,221 rows and 20 columns combining demographics (population by race, poverty, income), 2016 and 2020 presidential vote shares, and geographic identifiers (FIPS, state, county). Two data-quality issues stand out and should be addressed first: median_household_income contains sentinel/error values that pull its minimum to -666,666,666 and yield a negative mean, and margin_2016 is stored as text percentages (e.g. '15.17%') while margin_2020 is numeric, so the two election cycles aren't directly comparable without cleaning. The political columns themselves are well-formed and show a Republican-leaning county distribution (mean republican_pct_2020 ≈ 0.65 vs democratic_pct_2020 ≈ 0.33). Population and demographic counts are heavily right-skewed with many outliers, as expected when mixing rural counties with metros up to ~10M people, so log scales or per-capita rates (already provided as pct_white, pct_black, pct_hispanic) will be more informative than raw counts.

citing: median_household_income · margin_2016 · margin_2020 · republican_pct_2020 · democratic_pct_2020 · total_population · pct_white · pct_black · pct_hispanic · poverty_rate

Charts the summary said to look at first

republican_pct_2020 · Distribution of 2020 Republican vote share across counties — note the right-leaning skew with median near 0.68.

Show data table

Histogram bins for republican_pct_2020 (median: 0.6829120557612961).
bin	count
0.05397 – 0.07667	1
0.07667 – 0.09937	2
0.09937 – 0.1221	2
0.1221 – 0.1448	6
0.1448 – 0.1675	6
0.1675 – 0.1901	15
0.1901 – 0.2128	5
0.2128 – 0.2355	13
0.2355 – 0.2582	12
0.2582 – 0.2809	25
0.2809 – 0.3036	26
0.3036 – 0.3263	32
0.3263 – 0.349	32
0.349 – 0.3717	40
0.3717 – 0.3944	46
0.3944 – 0.4171	52
0.4171 – 0.4398	64
0.4398 – 0.4625	78
0.4625 – 0.4852	61
0.4852 – 0.5079	78
0.5079 – 0.5306	66
0.5306 – 0.5533	97
0.5533 – 0.576	126
0.576 – 0.5987	122
0.5987 – 0.6214	139
0.6214 – 0.6441	143
0.6441 – 0.6668	154
0.6668 – 0.6895	173
0.6895 – 0.7122	176
0.7122 – 0.7349	200
0.7349 – 0.7576	213
0.7576 – 0.7802	195
0.7802 – 0.8029	203
0.8029 – 0.8256	169
0.8256 – 0.8483	132
0.8483 – 0.871	103
0.871 – 0.8937	65
0.8937 – 0.9164	27
0.9164 – 0.9391	9
0.9391 – 0.9618	4

margin_2020 · 2020 margin ranges from -0.87 to +0.93; check the bimodality and how many counties fall on each side of zero.

Show data table

Histogram bins for margin_2020 (median: 0.3843813151543954).
bin	count
-0.8675 – -0.8226	1
-0.8226 – -0.7776	2
-0.7776 – -0.7326	3
-0.7326 – -0.6877	5
-0.6877 – -0.6427	8
-0.6427 – -0.5978	12
-0.5978 – -0.5528	6
-0.5528 – -0.5078	12
-0.5078 – -0.4629	14
-0.4629 – -0.4179	22
-0.4179 – -0.373	28
-0.373 – -0.328	33
-0.328 – -0.283	29
-0.283 – -0.2381	46
-0.2381 – -0.1931	41
-0.1931 – -0.1482	54
-0.1482 – -0.1032	63
-0.1032 – -0.05823	67
-0.05823 – -0.01327	69
-0.01327 – 0.03169	73
0.03169 – 0.07665	70
0.07665 – 0.1216	90
0.1216 – 0.1666	131
0.1666 – 0.2115	117
0.2115 – 0.2565	129
0.2565 – 0.3015	141
0.3015 – 0.3464	159
0.3464 – 0.3914	165
0.3914 – 0.4363	181
0.4363 – 0.4813	206
0.4813 – 0.5263	195
0.5263 – 0.5712	197
0.5712 – 0.6162	213
0.6162 – 0.6611	175
0.6611 – 0.7061	140
0.7061 – 0.7511	102
0.7511 – 0.796	69
0.796 – 0.841	29
0.841 – 0.8859	11
0.8859 – 0.9309	4

pct_white · County racial composition is left-skewed with a median of ~88% white; watch the long tail of diverse counties.

Show data table

Histogram bins for pct_white (median: 87.65979926043318).
bin	count
3.29 – 5.708	2
5.708 – 8.125	1
8.125 – 10.54	4
10.54 – 12.96	2
12.96 – 15.38	10
15.38 – 17.8	6
17.8 – 20.21	4
20.21 – 22.63	7
22.63 – 25.05	12
25.05 – 27.47	12
27.47 – 29.89	7
29.89 – 32.3	4
32.3 – 34.72	10
34.72 – 37.14	13
37.14 – 39.56	24
39.56 – 41.97	18
41.97 – 44.39	24
44.39 – 46.81	30
46.81 – 49.23	24
49.23 – 51.64	34
51.64 – 54.06	33
54.06 – 56.48	37
56.48 – 58.9	58
58.9 – 61.32	43
61.32 – 63.73	69
63.73 – 66.15	72
66.15 – 68.57	77
68.57 – 70.99	76
70.99 – 73.4	88
73.4 – 75.82	100
75.82 – 78.24	98
78.24 – 80.66	143
80.66 – 83.08	132
83.08 – 85.49	163
85.49 – 87.91	191
87.91 – 90.33	246
90.33 – 92.75	302
92.75 – 95.16	453
95.16 – 97.58	525
97.58 – 100	67

poverty_rate · Poverty rate centers around 14% with a right tail past 60% — useful for spotting economically distressed counties.

Show data table

Histogram bins for poverty_rate (median: 13.807805224676027).
bin	count
0 – 1.655	2
1.655 – 3.31	9
3.31 – 4.964	48
4.964 – 6.619	123
6.619 – 8.274	212
8.274 – 9.929	317
9.929 – 11.58	378
11.58 – 13.24	392
13.24 – 14.89	353
14.89 – 16.55	338
16.55 – 18.2	235
18.2 – 19.86	192
19.86 – 21.51	157
21.51 – 23.17	108
23.17 – 24.82	77
24.82 – 26.48	53
26.48 – 28.13	41
28.13 – 29.79	37
29.79 – 31.44	29
31.44 – 33.1	13
33.1 – 34.75	10
34.75 – 36.41	11
36.41 – 38.06	7
38.06 – 39.72	2
39.72 – 41.37	8
41.37 – 43.03	6
43.03 – 44.68	7
44.68 – 46.33	11
46.33 – 47.99	6
47.99 – 49.64	9
49.64 – 51.3	8
51.3 – 52.95	5
52.95 – 54.61	6
54.61 – 56.26	2
56.26 – 57.92	2
57.92 – 59.57	3
59.57 – 61.23	1
61.23 – 62.88	1
62.88 – 64.54	0
64.54 – 66.19	2

median_household_income · Flag the data-quality issue: a sentinel value of -666,666,666 distorts the mean; filter to positives before plotting.

Show data table

Histogram bins for median_household_income (median: 52380.0).
bin	count
-6.667e+08 – -6.5e+08	1
-6.5e+08 – -6.333e+08	0
-6.333e+08 – -6.167e+08	0
-6.167e+08 – -6e+08	0
-6e+08 – -5.833e+08	0
-5.833e+08 – -5.666e+08	0
-5.666e+08 – -5.5e+08	0
-5.5e+08 – -5.333e+08	0
-5.333e+08 – -5.166e+08	0
-5.166e+08 – -5e+08	0
-5e+08 – -4.833e+08	0
-4.833e+08 – -4.666e+08	0
-4.666e+08 – -4.5e+08	0
-4.5e+08 – -4.333e+08	0
-4.333e+08 – -4.166e+08	0
-4.166e+08 – -3.999e+08	0
-3.999e+08 – -3.833e+08	0
-3.833e+08 – -3.666e+08	0
-3.666e+08 – -3.499e+08	0
-3.499e+08 – -3.333e+08	0
-3.333e+08 – -3.166e+08	0
-3.166e+08 – -2.999e+08	0
-2.999e+08 – -2.832e+08	0
-2.832e+08 – -2.666e+08	0
-2.666e+08 – -2.499e+08	0
-2.499e+08 – -2.332e+08	0
-2.332e+08 – -2.166e+08	0
-2.166e+08 – -1.999e+08	0
-1.999e+08 – -1.832e+08	0
-1.832e+08 – -1.666e+08	0
-1.666e+08 – -1.499e+08	0
-1.499e+08 – -1.332e+08	0
-1.332e+08 – -1.165e+08	0
-1.165e+08 – -9.987e+07	0
-9.987e+07 – -8.32e+07	0
-8.32e+07 – -6.653e+07	0
-6.653e+07 – -4.986e+07	0
-4.986e+07 – -3.319e+07	0
-3.319e+07 – -1.652e+07	0
-1.652e+07 – 1.471e+05	3220

Schema

20 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
NAME	text	0.0%	3,221	near_unique
total_population	numeric	0.0%	3,160	high_skew outliers
black_population	numeric	0.0%	2,066	high_skew outliers
white_population	numeric	0.0%	3,143	high_skew outliers
hispanic_population	numeric	0.0%	2,331	high_skew outliers
state	numeric	0.0%	52
county	numeric	0.0%	326	high_skew outliers
FIPS	numeric	0.0%	3,221
pct_black	numeric	0.0%	3,128	high_skew outliers
pct_white	numeric	0.0%	3,218
pct_hispanic	numeric	0.0%	3,205	high_skew outliers
poverty_rate	numeric	0.0%	3,219	high_skew
below_poverty_level	numeric	0.0%	2,824	high_skew outliers
median_household_income	numeric	0.0%	3,099	high_skew outliers
margin_2020	numeric	3.4%	3,112
democratic_pct_2020	numeric	3.4%	3,111
republican_pct_2020	numeric	3.4%	3,111
margin_2016	text	2.6%	2,554	one_word allcaps short_text
democratic_pct_2016	numeric	2.6%	3,111
republican_pct_2016	numeric	2.6%	3,111

NAME

text identifier near_unique

This column appears to hold US county names with state qualifiers — 'county,' appears in 3,007 of 3,221 rows, followed by state tokens like Texas (256), Virginia (189), and Georgia (159). Every value is unique (n_unique = 3221, duplicate_rate = 0.0) with no nulls, and lengths cluster tightly around 24 characters (min 16, max 42), consistent with a canonical 'X County, ST' format. The near_unique alert confirms this behaves as an identifier rather than a categorical feature. Treatment: Use as a join key to county-level reference tables; do not feed as a categorical feature. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 3,221
len_min: 16
len_max: 42
len_mean: 24.27
len_median: 24
len_p95: 31
word_mean: 3.243
word_median: 3
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 1,983
readability_flesch_mean: 7.581
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0

total_population

numeric feature high_skew outliers

Likely a county- or area-level total population count, given 3,221 rows with no nulls and a minimum of 117 alongside a maximum of 10,040,682. The distribution is severely right-skewed (skew 13.67, kurtosis 311.9) with the mean (102,398) nearly four times the median (25,981) and 441 outliers (13.7%). A few mega-population areas dominate while most are small. Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 3,160
min: 117
max: 1.004e+07
mean: 1.024e+05
median: 25,981
std: 3.283e+05
q1: 11,125
q3: 66,969
iqr: 55,844
skew: 13.67
kurtosis: 311.9
n_outliers: 441
outlier_rate: 0.1369
zero_rate: 0

black_population

numeric feature high_skew outliers

Numeric count of Black residents per record (likely US county-level given n=3221), ranging from 0 to 1,202,260 with a median of just 859. The distribution is extremely right-skewed (skew 10.46, kurtosis 148.2) with 13.6% flagged as outliers and a std (54,952) over four times the mean (12,914), reflecting a few major metros dominating a long tail of small counties. About 2.8% of rows are zero and there are no nulls. Treatment: Log-transform (log1p) before modelling or normalise as a share of total population. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 2,066
min: 0
max: 1.202e+06
mean: 1.291e+04
median: 859
std: 5.495e+04
q1: 114
q3: 5,553
iqr: 5,439
skew: 10.46
kurtosis: 148.2
n_outliers: 438
outlier_rate: 0.136
zero_rate: 0.02825

white_population

numeric feature high_skew outliers

Counts of the white population per record (likely US counties given n=3221), ranging from 58 to 4,795,186 with a median of 21,282 but a mean of 72,000. The distribution is extremely right-skewed (skew 10.35, kurtosis 175.65) with 407 outliers (12.6%), reflecting a few very populous counties dwarfing the rest. No nulls or zeros, and near-unique values (3143/3221). Treatment: log-transform before regression to tame the heavy right skew. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 3,143
min: 58
max: 4.795e+06
mean: 7.2e+04
median: 21,282
std: 1.918e+05
q1: 8,855
q3: 56,553
iqr: 47,698
skew: 10.35
kurtosis: 175.7
n_outliers: 407
outlier_rate: 0.1264
zero_rate: 0

hispanic_population

numeric feature high_skew outliers

Counts of Hispanic population per record (likely county- or tract-level given n=3221), ranging from 0 to 4,851,344 with a median of just 1,209. The distribution is extraordinarily right-skewed (skew 22.75, kurtosis 744.79) and 15.3% of rows flag as outliers, indicating a handful of very large jurisdictions dwarf the rest. Mean (19,427) sits far above the Q3 of 5,875, confirming a long heavy tail. Treatment: Apply a log1p transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 2,331
min: 0
max: 4.851e+06
mean: 1.943e+04
median: 1,209
std: 1.251e+05
q1: 377
q3: 5,875
iqr: 5,498
skew: 22.75
kurtosis: 744.8
n_outliers: 492
outlier_rate: 0.1527
zero_rate: 0.004967

state

numeric feature

Stored as numeric but with only 52 unique integer values across 3221 rows ranging 1–72 with no nulls or zeros, this is almost certainly a FIPS-style state code rather than a true quantity. The near-symmetric spread (skew 0.157, kurtosis -0.626, mean 31.28 vs median 30) reflects roughly uniform coverage of US states/territories, not a meaningful distribution. The max of 72 is consistent with FIPS codes that extend past 50 to cover territories. Treatment: Cast to categorical and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 52
min: 1
max: 72
mean: 31.28
median: 30
std: 16.28
q1: 19
q3: 46
iqr: 27
skew: 0.157
kurtosis: -0.6261
n_outliers: 0
outlier_rate: 0
zero_rate: 0

county

numeric foreign_key high_skew outliers

Stored as numeric, but with 326 unique integer values from 1 to 840 across 3221 rows and zero nulls, this is almost certainly a county FIPS or county-code identifier rather than a measurement. The heavy right skew (2.87) and kurtosis (11.6) flagged as outliers simply reflect that codes are not uniformly distributed — 178 'outliers' here are real codes, not anomalies. Treating mean=102.8 or std=106.6 as meaningful would be misleading. Treatment: Cast to categorical/string code and join to a county lookup; do not use as a continuous feature. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 326
min: 1
max: 840
mean: 102.8
median: 79
std: 106.6
q1: 35
q3: 133
iqr: 98
skew: 2.868
kurtosis: 11.64
n_outliers: 178
outlier_rate: 0.05526
zero_rate: 0

FIPS

numeric identifier

FIPS is the standard U.S. Federal Information Processing Standards county code, with all 3221 rows unique and no nulls. Values span 1001 to 72153, consistent with state-prefixed county identifiers (Alabama through Puerto Rico), and the distribution is near-symmetric (skew 0.157) with no outliers flagged. Treatment: Treat as a categorical key; left-join on this code rather than using as a numeric feature. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 3,221
min: 1,001
max: 72,153
mean: 3.138e+04
median: 30,023
std: 1.63e+04
q1: 19,031
q3: 46,105
iqr: 27,074
skew: 0.1569
kurtosis: -0.6308
n_outliers: 0
outlier_rate: 0
zero_rate: 0

pct_black

numeric feature high_skew outliers

This is a numeric percentage of Black population per record (likely a county or tract), ranging from 0 to 87.79 with a median of just 2.38% but a mean of 9.08%. The distribution is heavily right-skewed (skew 2.33, kurtosis 5.45) with 422 outliers (13.1%) and 2.8% exact zeros, indicating a long tail of high-percentage areas above an otherwise low-share majority. No nulls, and 3,128 of 3,221 values are unique. Treatment: Apply a log1p or similar transform before regression to tame the right skew. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 3,128
min: 0
max: 87.79
mean: 9.085
median: 2.383
std: 14.5
q1: 0.6919
q3: 10.21
iqr: 9.513
skew: 2.326
kurtosis: 5.451
n_outliers: 422
outlier_rate: 0.131
zero_rate: 0.02825

pct_white

numeric feature

This column reports the percentage of a population that is white, ranging from 3.29 to 100.0 with a mean of 81.20 and median of 87.66. The distribution is heavily left-skewed (skew -1.56) with 145 low-end outliers (4.5% outlier rate), indicating most records are predominantly white but a long tail of diverse populations exists. No nulls or zeros are present, and near-unique values across 3221 rows suggest one row per geographic unit. Treatment: Consider a logit or reflected-log transform to address the strong left skew before modelling. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 3,218
min: 3.29
max: 100
mean: 81.2
median: 87.66
std: 17.35
q1: 73.62
q3: 93.99
iqr: 20.37
skew: -1.562
kurtosis: 2.301
n_outliers: 145
outlier_rate: 0.04502
zero_rate: 0

pct_hispanic

numeric feature high_skew outliers

This is a numeric percentage of Hispanic population per row, ranging 0 to 99.996 with a median of just 4.52 but a mean of 11.74, indicating a long right tail. Skew of 3.11 and kurtosis of 9.89 confirm heavy concentration at low values with 420 outliers (13.0% of rows) stretching toward 100. Near-zero null rate (0.0) and only 0.5% exact zeros suggest the values are continuously measured rather than sparsely populated. Treatment: Apply a log1p or similar transform before modelling to tame the right skew and outliers. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 3,205
min: 0
max: 100
mean: 11.74
median: 4.516
std: 19.4
q1: 2.363
q3: 10.66
iqr: 8.294
skew: 3.113
kurtosis: 9.888
n_outliers: 420
outlier_rate: 0.1304
zero_rate: 0.004967

poverty_rate

numeric feature high_skew

Continuous percentage values ranging from 0 to 66.19 with mean 15.38 and median 13.81, almost certainly a county- or area-level poverty rate. Distribution is right-skewed (skew 2.11, kurtosis 6.92) with 143 high-end outliers (4.4%) stretching well beyond Q3 of 18.25. Near-unique values across 3,221 rows (3,219 distinct) and effectively no zeros or nulls. Treatment: Log- or Box-Cox-transform before regression to tame the right skew. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 3,219
min: 0
max: 66.19
mean: 15.38
median: 13.81
std: 7.97
q1: 10.34
q3: 18.25
iqr: 7.91
skew: 2.111
kurtosis: 6.922
n_outliers: 143
outlier_rate: 0.0444
zero_rate: 0.0003105

below_poverty_level

numeric feature high_skew outliers

This column appears to be a count of residents below the poverty level per geographic unit, ranging from 0 to 1,401,656 with a median of 3,831. The distribution is severely right-skewed (skew 15.1, kurtosis 360.7) with the mean (13,136) more than three times the median and 351 outliers (10.9% of rows). Standard deviation (44,284) dwarfs the IQR (8,390), consistent with a few very large jurisdictions dominating the tail. Treatment: Log-transform (or normalize per population) before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 2,824
min: 0
max: 1.402e+06
mean: 1.314e+04
median: 3,831
std: 4.428e+04
q1: 1,547
q3: 9,937
iqr: 8,390
skew: 15.11
kurtosis: 360.7
n_outliers: 351
outlier_rate: 0.109
zero_rate: 0.0003105

median_household_income

numeric feature high_skew outliers

Median household income per record (n=3221, 3099 unique, no nulls) with a typical value near the median of 52380 and IQR of 16300. The mean of -152820 and min of -666666666 betray a sentinel value masquerading as data, producing extreme skew (-56.73) and kurtosis (3215.99) plus 182 flagged outliers. Once those sentinels are removed, the q1/q3 range of 44939-61239 looks like plausible US county-level income. Treatment: Replace the -666666666 sentinel with null, then consider winsorizing or log-transforming before modelling. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 0 (0.0%)
unique: 3,099
min: -6.667e+08
max: 147,111
mean: -1.528e+05
median: 52,380
std: 1.175e+07
q1: 44,939
q3: 61,239
iqr: 16,300
skew: -56.73
kurtosis: 3216
n_outliers: 182
outlier_rate: 0.0565
zero_rate: 0

margin_2020

numeric feature

Numeric margin values for 2020, almost entirely unique across 3,221 rows (3,112 distinct), ranging from -0.87 to 0.93 with a mean of 0.317 and median 0.384. The distribution is left-skewed (skew -0.82), suggesting most observations cluster on the positive side while a tail of negative margins pulls the mean down. About 3.4% of rows are null and only 1.5% are flagged as outliers, with no zero values at all. Treatment: Use directly as a signed numeric feature; impute the 3.4% nulls and retain sign since negatives are meaningful. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 109 (3.4%)
unique: 3,112
min: -0.8675
max: 0.9309
mean: 0.317
median: 0.3844
std: 0.321
q1: 0.1348
q3: 0.5662
iqr: 0.4314
skew: -0.8212
kurtosis: 0.2286
n_outliers: 48
outlier_rate: 0.01542
zero_rate: 0

democratic_pct_2020

numeric feature

This is the share of votes cast for the Democratic candidate in 2020, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.921 with a median of 0.300 and mean of 0.333, indicating most units lean Republican while a long right tail of heavily Democratic units pulls the mean up (skew 0.83). About 3.4% of rows are null and 49 outliers (1.6%) sit beyond the whiskers; no zeros are present. Treatment: Use as-is as a proportion feature; impute the 3.4% nulls or drop those rows before modelling. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 109 (3.4%)
unique: 3,111
min: 0.03091
max: 0.9215
mean: 0.3327
median: 0.2998
std: 0.1598
q1: 0.2091
q3: 0.4236
iqr: 0.2145
skew: 0.8326
kurtosis: 0.2523
n_outliers: 49
outlier_rate: 0.01575
zero_rate: 0

republican_pct_2020

numeric feature

This is the 2020 Republican vote share by what looks like a U.S. county-level unit, with 3221 rows and a mean of 0.65 and median 0.68. The distribution is left-skewed (skew -0.81) toward strongly Republican counties, ranging from 0.054 to 0.962, and 3.38% of rows are null. Only 47 outliers (1.5%) and near-unique values (3111 distinct) are consistent with continuous geographic shares. Treatment: Use as-is as a continuous feature; impute or drop the 3.38% missing rows before modelling. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 109 (3.4%)
unique: 3,111
min: 0.05397
max: 0.9618
mean: 0.6497
median: 0.6829
std: 0.1613
q1: 0.5576
q3: 0.7747
iqr: 0.2171
skew: -0.8091
kurtosis: 0.2063
n_outliers: 47
outlier_rate: 0.0151
zero_rate: 0

margin_2016

text feature one_word allcaps short_text

This column stores a 2016 margin as a short percentage string (e.g. '15.17%', '26.55%'), with lengths capped at 5-6 characters and exactly one 'word' per row. Despite the percent formatting it's stored as text, and 18.6% of values are duplicates with '15.17%' alone appearing 29 times — worth checking whether that's a placeholder or genuine repeat. Null rate is 2.58% and there are 2554 unique values across 3221 rows. Treatment: Strip the '%' and cast to float before any numeric analysis. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 83 (2.6%)
unique: 2,554
len_min: 5
len_max: 6
len_mean: 5.896
len_median: 6
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 584
duplicate_rate: 0.1861
vocab_size: 2,554
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

democratic_pct_2016

numeric feature

Share of the 2016 vote going Democratic, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.928 with a mean of 0.317 and median 0.286, and the right skew of 0.94 reflects a long tail of heavily Democratic jurisdictions amid a mass of Republican-leaning ones. About 2.6% of rows are null and 75 high-side outliers (2.4%) sit above the IQR fence. Treatment: Use as-is as a proportion feature; impute or drop the 2.6% nulls before modelling. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 83 (2.6%)
unique: 3,111
min: 0.03145
max: 0.9285
mean: 0.3174
median: 0.2861
std: 0.1527
q1: 0.2054
q3: 0.3982
iqr: 0.1928
skew: 0.9371
kurtosis: 0.666
n_outliers: 75
outlier_rate: 0.0239
zero_rate: 0

republican_pct_2016

numeric feature

This column captures the Republican vote share by unit (likely US county) in the 2016 election, expressed as a proportion between 0.041 and 0.953. The distribution is left-skewed (skew -0.81) with a median of 0.666 above the mean of 0.635, indicating most units leaned Republican while a smaller tail of strongly Democratic units pulls the mean down. Near-unique values (3111 of 3221) and a 2.58% null rate are consistent with one row per geographic unit. Treatment: Use as-is as a proportion feature; impute or drop the ~2.6% nulls before modelling. high · anthropic:claude-opus-4-7

n: 3,221
nulls: 83 (2.6%)
unique: 3,111
min: 0.04122
max: 0.9527
mean: 0.6354
median: 0.6656
std: 0.1559
q1: 0.5463
q3: 0.7503
iqr: 0.2041
skew: -0.8145
kurtosis: 0.3566
n_outliers: 62
outlier_rate: 0.01976
zero_rate: 0