data trove scars standardized county analysis research system

source /home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv 3,221 rows 20 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset covers 3,221 U.S. counties with demographic, economic, and electoral variables for the 2016 and 2020 presidential elections. The most striking finding is that Republican candidates dominated the majority of counties in both cycles — the median Republican share was roughly 67% in 2016 and 68% in 2020, while the Democratic median hovered near 29–30%, reflecting the well-known rural-county skew in U.S. politics. A data quality issue worth flagging immediately is the median_household_income column, which contains a minimum value of -666,666,666 — almost certainly a sentinel/error value — dragging the column mean to -$152,820 despite a plausible median of $52,380. Poverty rate averages about 15% across counties but reaches as high as 66%, and racial composition variables (pct_white, pct_black, pct_hispanic) are highly skewed, suggesting a small number of majority-minority counties sit at the extremes.

citing: republican_pct_2016.stats.median · republican_pct_2020.stats.median · democratic_pct_2016.stats.median · democratic_pct_2020.stats.median · median_household_income.stats.min · median_household_income.stats.median · poverty_rate.stats.mean · poverty_rate.stats.max · pct_white.stats.mean · pct_white.stats.skew · row_count

Charts the summary said to look at first

republican_pct_2020 · Look for the strong right-skewed peak above 0.5, confirming Republican dominance across most U.S. counties in 2020.

Show data table

Histogram bins for republican_pct_2020 (median: 0.6829120557612961).
bin	count
0.05397 – 0.07667	1
0.07667 – 0.09937	2
0.09937 – 0.1221	2
0.1221 – 0.1448	6
0.1448 – 0.1675	6
0.1675 – 0.1901	15
0.1901 – 0.2128	5
0.2128 – 0.2355	13
0.2355 – 0.2582	12
0.2582 – 0.2809	25
0.2809 – 0.3036	26
0.3036 – 0.3263	32
0.3263 – 0.349	32
0.349 – 0.3717	40
0.3717 – 0.3944	46
0.3944 – 0.4171	52
0.4171 – 0.4398	64
0.4398 – 0.4625	78
0.4625 – 0.4852	61
0.4852 – 0.5079	78
0.5079 – 0.5306	66
0.5306 – 0.5533	97
0.5533 – 0.576	126
0.576 – 0.5987	122
0.5987 – 0.6214	139
0.6214 – 0.6441	143
0.6441 – 0.6668	154
0.6668 – 0.6895	173
0.6895 – 0.7122	176
0.7122 – 0.7349	200
0.7349 – 0.7576	213
0.7576 – 0.7802	195
0.7802 – 0.8029	203
0.8029 – 0.8256	169
0.8256 – 0.8483	132
0.8483 – 0.871	103
0.871 – 0.8937	65
0.8937 – 0.9164	27
0.9164 – 0.9391	9
0.9391 – 0.9618	4

margin_2020 · The distribution of victory margins reveals how lopsided most county-level results are, with few truly competitive counties near zero.

Show data table

Histogram bins for margin_2020 (median: 0.3843813151543954).
bin	count
-0.8675 – -0.8226	1
-0.8226 – -0.7776	2
-0.7776 – -0.7326	3
-0.7326 – -0.6877	5
-0.6877 – -0.6427	8
-0.6427 – -0.5978	12
-0.5978 – -0.5528	6
-0.5528 – -0.5078	12
-0.5078 – -0.4629	14
-0.4629 – -0.4179	22
-0.4179 – -0.373	28
-0.373 – -0.328	33
-0.328 – -0.283	29
-0.283 – -0.2381	46
-0.2381 – -0.1931	41
-0.1931 – -0.1482	54
-0.1482 – -0.1032	63
-0.1032 – -0.05823	67
-0.05823 – -0.01327	69
-0.01327 – 0.03169	73
0.03169 – 0.07665	70
0.07665 – 0.1216	90
0.1216 – 0.1666	131
0.1666 – 0.2115	117
0.2115 – 0.2565	129
0.2565 – 0.3015	141
0.3015 – 0.3464	159
0.3464 – 0.3914	165
0.3914 – 0.4363	181
0.4363 – 0.4813	206
0.4813 – 0.5263	195
0.5263 – 0.5712	197
0.5712 – 0.6162	213
0.6162 – 0.6611	175
0.6611 – 0.7061	140
0.7061 – 0.7511	102
0.7511 – 0.796	69
0.796 – 0.841	29
0.841 – 0.8859	11
0.8859 – 0.9309	4

poverty_rate · Poverty rate peaks around 14% but has a long right tail extending to 66% — watch for the outlier counties driving extreme values.

Show data table

Histogram bins for poverty_rate (median: 13.807805224676027).
bin	count
0 – 1.655	2
1.655 – 3.31	9
3.31 – 4.964	48
4.964 – 6.619	123
6.619 – 8.274	212
8.274 – 9.929	317
9.929 – 11.58	378
11.58 – 13.24	392
13.24 – 14.89	353
14.89 – 16.55	338
16.55 – 18.2	235
18.2 – 19.86	192
19.86 – 21.51	157
21.51 – 23.17	108
23.17 – 24.82	77
24.82 – 26.48	53
26.48 – 28.13	41
28.13 – 29.79	37
29.79 – 31.44	29
31.44 – 33.1	13
33.1 – 34.75	10
34.75 – 36.41	11
36.41 – 38.06	7
38.06 – 39.72	2
39.72 – 41.37	8
41.37 – 43.03	6
43.03 – 44.68	7
44.68 – 46.33	11
46.33 – 47.99	6
47.99 – 49.64	9
49.64 – 51.3	8
51.3 – 52.95	5
52.95 – 54.61	6
54.61 – 56.26	2
56.26 – 57.92	2
57.92 – 59.57	3
59.57 – 61.23	1
61.23 – 62.88	1
62.88 – 64.54	0
64.54 – 66.19	2

pct_white · Most counties are majority-white with a left-skewed distribution, but the lower tail highlights a distinct cluster of majority-minority counties.

Show data table

Histogram bins for pct_white (median: 87.65979926043318).
bin	count
3.29 – 5.708	2
5.708 – 8.125	1
8.125 – 10.54	4
10.54 – 12.96	2
12.96 – 15.38	10
15.38 – 17.8	6
17.8 – 20.21	4
20.21 – 22.63	7
22.63 – 25.05	12
25.05 – 27.47	12
27.47 – 29.89	7
29.89 – 32.3	4
32.3 – 34.72	10
34.72 – 37.14	13
37.14 – 39.56	24
39.56 – 41.97	18
41.97 – 44.39	24
44.39 – 46.81	30
46.81 – 49.23	24
49.23 – 51.64	34
51.64 – 54.06	33
54.06 – 56.48	37
56.48 – 58.9	58
58.9 – 61.32	43
61.32 – 63.73	69
63.73 – 66.15	72
66.15 – 68.57	77
68.57 – 70.99	76
70.99 – 73.4	88
73.4 – 75.82	100
75.82 – 78.24	98
78.24 – 80.66	143
80.66 – 83.08	132
83.08 – 85.49	163
85.49 – 87.91	191
87.91 – 90.33	246
90.33 – 92.75	302
92.75 – 95.16	453
95.16 – 97.58	525
97.58 – 100	67

pct_black · Heavily right-skewed with median near 2.4% but reaching 88% — a small number of counties account for most of the Black population share.

Show data table

Histogram bins for pct_black (median: 2.382606279522089).
bin	count
0 – 2.195	1568
2.195 – 4.39	402
4.39 – 6.584	218
6.584 – 8.779	153
8.779 – 10.97	112
10.97 – 13.17	86
13.17 – 15.36	70
15.36 – 17.56	54
17.56 – 19.75	46
19.75 – 21.95	48
21.95 – 24.14	36
24.14 – 26.34	42
26.34 – 28.53	36
28.53 – 30.73	41
30.73 – 32.92	34
32.92 – 35.12	32
35.12 – 37.31	28
37.31 – 39.51	19
39.51 – 41.7	25
41.7 – 43.9	25
43.9 – 46.09	18
46.09 – 48.28	17
48.28 – 50.48	14
50.48 – 52.67	9
52.67 – 54.87	13
54.87 – 57.06	11
57.06 – 59.26	13
59.26 – 61.45	7
61.45 – 63.65	8
63.65 – 65.84	4
65.84 – 68.04	1
68.04 – 70.23	6
70.23 – 72.43	8
72.43 – 74.62	5
74.62 – 76.82	2
76.82 – 79.01	5
79.01 – 81.21	1
81.21 – 83.4	1
83.4 – 85.6	1
85.6 – 87.79	2

Schema

20 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
NAME	text	0.0%	3,221	near_unique
total_population	numeric	0.0%	3,160	high_skew outliers
black_population	numeric	0.0%	2,066	high_skew outliers
white_population	numeric	0.0%	3,143	high_skew outliers
hispanic_population	numeric	0.0%	2,331	high_skew outliers
state	numeric	0.0%	52
county	numeric	0.0%	326	high_skew outliers
FIPS	numeric	0.0%	3,221
pct_black	numeric	0.0%	3,128	high_skew outliers
pct_white	numeric	0.0%	3,218
pct_hispanic	numeric	0.0%	3,205	high_skew outliers
poverty_rate	numeric	0.0%	3,219	high_skew
below_poverty_level	numeric	0.0%	2,824	high_skew outliers
median_household_income	numeric	0.0%	3,099	high_skew outliers
margin_2020	numeric	3.4%	3,112
democratic_pct_2020	numeric	3.4%	3,111
republican_pct_2020	numeric	3.4%	3,111
margin_2016	text	2.6%	2,554	one_word allcaps short_text
democratic_pct_2016	numeric	2.6%	3,111
republican_pct_2016	numeric	2.6%	3,111

NAME

text label near_unique

This column contains US county names, formatted with the word 'county' included (e.g., 'Jefferson County, Texas'), as evidenced by 'county,' appearing in 3,007 of 3,221 rows and US state names dominating the top words. Every value is unique (3,221 distinct entries, 0 duplicates, 0 nulls), making this a natural identifier for county-level records. The mean string length of ~24 characters and mean word count of ~3.2 are consistent with a 'Name County, State' pattern. The near-perfect vocabulary of 1,983 words across 3,221 rows suggests structured, standardized naming rather than free text. Treatment: Use as a human-readable label or join key; normalize casing and strip trailing state suffix if joining to external county tables. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 3,221
len_min: 16
len_max: 42
len_mean: 24.27
len_median: 24
len_p95: 31
word_mean: 3.243
word_median: 3
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 1,983
readability_flesch_mean: 7.581
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0

total_population

numeric feature high_skew outliers

This column represents the total population count for geographic or administrative units (e.g., counties, municipalities, or census tracts), ranging from 117 to 10,040,682. The distribution is severely right-skewed (skew = 13.67, kurtosis = 311.91): the median of 25,981 is less than a quarter of the mean of 102,398, indicating a long tail driven by a small number of very large population centers. An outlier rate of 13.7% (441 of 3,221 rows) is unusually high and signals that large urban units coexist with many small rural units in the same dataset. Treatment: Log-transform before regression or distance-based modelling to reduce skew and outlier influence. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 3,160
min: 117
max: 1.004e+07
mean: 1.024e+05
median: 25,981
std: 3.283e+05
q1: 11,125
q3: 66,969
iqr: 55,844
skew: 13.67
kurtosis: 311.9
n_outliers: 441
outlier_rate: 0.1369
zero_rate: 0

black_population

numeric feature high_skew outliers

This column represents the count of Black residents per geographic unit (likely U.S. counties or census tracts). The distribution is extremely right-skewed (skew=10.46, kurtosis=148.22), with a median of just 859 versus a mean of 12,913 and a maximum of 1,202,260 — indicating a small number of high-population urban areas dominating the tail. 438 outliers (13.6% of rows) and a std of 54,951 against a median of 859 confirm the vast majority of units are small while a few are very large; 2.8% of records are zero, likely rural or sparsely populated geographies. Treatment: Log-transform (log1p) before modelling to compress the extreme right tail; consider per-capita normalisation if total population is available. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 2,066
min: 0
max: 1.202e+06
mean: 1.291e+04
median: 859
std: 5.495e+04
q1: 114
q3: 5,553
iqr: 5,439
skew: 10.46
kurtosis: 148.2
n_outliers: 438
outlier_rate: 0.136
zero_rate: 0.02825

white_population

numeric feature high_skew outliers

This column represents the white population count for geographic units (likely counties or census tracts), with 3,221 non-null records spanning a wide range from 58 to 4,795,186. The distribution is severely right-skewed (skew = 10.35, kurtosis = 175.65): the median is only 21,282 while the mean is 72,000, indicating most units are small but a long tail of large urban areas dominates — 407 records (12.6%) are flagged as outliers. The near-unique value count (3,143 of 3,221) confirms this is a raw count feature, not a category or ID. Treatment: Log-transform (log1p) before modelling to reduce skew and compress the extreme outlier range. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 3,143
min: 58
max: 4.795e+06
mean: 7.2e+04
median: 21,282
std: 1.918e+05
q1: 8,855
q3: 56,553
iqr: 47,698
skew: 10.35
kurtosis: 175.7
n_outliers: 407
outlier_rate: 0.1264
zero_rate: 0

hispanic_population

numeric feature high_skew outliers

This column represents the Hispanic population count for geographic units (e.g., counties, census tracts, or ZIP codes) across 3,221 records. The distribution is extremely right-skewed (skew = 22.75, kurtosis = 744.79), with a median of only 1,209 but a mean of 19,427 and a maximum of 4,851,344 — indicating a small number of large urban areas dominate the distribution. 15.3% of records (492 rows) are flagged as outliers, and the IQR spans just 377–5,875 while the std is 125,108, confirming the extreme concentration of values at the low end with a long heavy tail. Treatment: Log-transform (log1p) before modelling to reduce extreme skew; consider per-capita normalization if total population is available. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 2,331
min: 0
max: 4.851e+06
mean: 1.943e+04
median: 1,209
std: 1.251e+05
q1: 377
q3: 5,875
iqr: 5,498
skew: 22.75
kurtosis: 744.8
n_outliers: 492
outlier_rate: 0.1527
zero_rate: 0.004967

state

numeric foreign_key

This column named 'state' is almost certainly a numeric state code (e.g., FIPS state codes or similar enumeration), with 52 distinct integer values ranging from 1 to 72 — consistent with US FIPS codes covering 50 states plus DC and outlying territories such as Puerto Rico (72). The distribution is remarkably flat and near-uniform (low kurtosis of -0.63, near-zero skew of 0.16, IQR of 27 across a 1–72 range), with zero nulls and zero outliers, indicating a clean, fully-populated categorical-as-integer field. The presence of 52 unique values rather than 50 or 51 suggests territorial codes are included, which may surprise analysts expecting only the 50 US states. Treatment: Treat as a categorical nominal code; do not use raw numeric value in regression — one-hot encode or left-join to a state reference table for geographic attributes. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 52
min: 1
max: 72
mean: 31.28
median: 30
std: 16.28
q1: 19
q3: 46
iqr: 27
skew: 0.157
kurtosis: -0.6261
n_outliers: 0
outlier_rate: 0
zero_rate: 0

county

numeric foreign_key high_skew outliers

This column is almost certainly a numeric county FIPS code or county ID, not a true continuous measure — the 326 unique values out of 3,221 rows strongly suggest a categorical geographic identifier encoded as an integer. The distribution is heavily right-skewed (skew 2.87, kurtosis 11.64) with values ranging from 1 to 840 and 178 outliers (5.5%), which reflects the uneven distribution of records across counties rather than any meaningful numeric magnitude. The mean (102.85) sitting well above the median (79.0) confirms that a small number of high-coded counties appear disproportionately often. Treatment: Cast to categorical/string and treat as a geographic grouping key; do not use raw numeric value in any regression or distance-based model. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 326
min: 1
max: 840
mean: 102.8
median: 79
std: 106.6
q1: 35
q3: 133
iqr: 98
skew: 2.868
kurtosis: 11.64
n_outliers: 178
outlier_rate: 0.05526
zero_rate: 0

FIPS

numeric identifier

This column contains US FIPS (Federal Information Processing Standards) county codes, which are 4–5 digit numeric identifiers uniquely assigned to each US county. Every row has a distinct value (n_unique = 3221, matching n exactly) with no nulls, confirming this is a primary identifier for US counties — there are 3,221 counties/county-equivalents in the US, matching this count almost exactly. The distribution is nearly uniform (low skew of 0.157, mild platykurtosis of -0.63), consistent with the sequential-but-gapped structure of FIPS codes across states. The range of 1001 to 72153 is correct for US county FIPS codes (Alabama's first county to Puerto Rico's last). Treatment: Treat as a categorical geographic identifier; do not use numerically — left-join to FIPS reference tables for geographic enrichment or spatial analysis. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 3,221
min: 1,001
max: 72,153
mean: 3.138e+04
median: 30,023
std: 1.63e+04
q1: 19,031
q3: 46,105
iqr: 27,074
skew: 0.1569
kurtosis: -0.6308
n_outliers: 0
outlier_rate: 0
zero_rate: 0

pct_black

numeric feature high_skew outliers

This column represents the percentage of Black residents in a geographic unit (e.g., census tract, county, or zip code), with 3,221 rows and no nulls. The distribution is heavily right-skewed (skew=2.33, kurtosis=5.45): the median is just 2.38% while the mean is pulled to 9.08%, and 422 rows (13.1%) are flagged as outliers reaching up to 87.79%. The IQR spans only 0.69–10.21%, meaning most units are predominantly non-Black, with a long tail of majority-Black geographies. Treatment: Apply log1p or quantile transformation before regression to address severe right skew and outlier influence. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 3,128
min: 0
max: 87.79
mean: 9.085
median: 2.383
std: 14.5
q1: 0.6919
q3: 10.21
iqr: 9.513
skew: 2.326
kurtosis: 5.451
n_outliers: 422
outlier_rate: 0.131
zero_rate: 0.02825

pct_white

numeric feature

This column represents the percentage of white population in a geographic or demographic unit, ranging from 3.29% to 100% across 3,221 records. The distribution is strongly left-skewed (skew = -1.56) with a mean of 81.2% and median of 87.7%, indicating the dataset is dominated by majority-white units — likely U.S. counties, census tracts, or similar jurisdictions. The gap between mean and median signals a long lower tail of more diverse units, and 145 outliers (4.5%) likely represent highly diverse areas pulling the distribution downward. Near-perfect uniqueness (3,218 of 3,221 values) confirms this is a continuous ratio measure, not a binned or rounded variable. Treatment: Use as-is or apply a reflection-log transform to address left skew before regression; consider interactions with other demographic features. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 3,218
min: 3.29
max: 100
mean: 81.2
median: 87.66
std: 17.35
q1: 73.62
q3: 93.99
iqr: 20.37
skew: -1.562
kurtosis: 2.301
n_outliers: 145
outlier_rate: 0.04502
zero_rate: 0

pct_hispanic

numeric feature high_skew outliers

This column represents the percentage of Hispanic population in a geographic or demographic unit, ranging from 0% to nearly 100%. The distribution is severely right-skewed (skew=3.11, kurtosis=9.89): the median is only 4.52% while the mean is 11.74%, indicating most units have low Hispanic shares but a long tail of high-concentration areas drives the average up. A notable 13% of rows (420 out of 3221) are flagged as outliers, consistent with areas of heavy Hispanic concentration. The near-zero zero_rate (0.5%) and zero null_rate suggest good data completeness. Treatment: Log-transform or apply a square-root transformation before regression/modelling to reduce skew and diminish outlier leverage. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 3,205
min: 0
max: 100
mean: 11.74
median: 4.516
std: 19.4
q1: 2.363
q3: 10.66
iqr: 8.294
skew: 3.113
kurtosis: 9.888
n_outliers: 420
outlier_rate: 0.1304
zero_rate: 0.004967

poverty_rate

numeric feature high_skew

This column represents a poverty rate (percentage) measured across 3,221 geographic or demographic units, with near-complete coverage (null_rate 0.0) and near-unique values (3,219 distinct). The distribution is right-skewed (skew 2.11, kurtosis 6.92), with a median of 13.8% and mean pulled up to 15.4% by a long upper tail reaching 66.2%; 143 outliers (4.4% of records) drive this tail, suggesting a minority of units with extremely high poverty concentration that will disproportionately influence linear models. Treatment: Apply log or square-root transform to reduce right skew before regression; investigate the 143 outlier units separately for data quality or structural differences. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 3,219
min: 0
max: 66.19
mean: 15.38
median: 13.81
std: 7.97
q1: 10.34
q3: 18.25
iqr: 7.91
skew: 2.111
kurtosis: 6.922
n_outliers: 143
outlier_rate: 0.0444
zero_rate: 0.0003105

below_poverty_level

numeric feature high_skew outliers

This column represents a count of people living below the poverty level, likely aggregated at some geographic unit (e.g., census tract, county, or ZIP code). The distribution is extremely right-skewed (skew=15.1, kurtosis=360.7): the median is 3,831 but the mean is 13,136, and the maximum reaches 1,401,656 — almost certainly a large urban area or county-level aggregate pulling the tail hard. With 351 outliers (~10.9% of rows) and a standard deviation of 44,284 against a median of 3,831, a small number of high-population jurisdictions dominate the raw counts entirely. Treatment: Log-transform (log1p) before regression or clustering; consider normalizing by total population to create a poverty rate for more comparable cross-unit modelling. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 2,824
min: 0
max: 1.402e+06
mean: 1.314e+04
median: 3,831
std: 4.428e+04
q1: 1,547
q3: 9,937
iqr: 8,390
skew: 15.11
kurtosis: 360.7
n_outliers: 351
outlier_rate: 0.109
zero_rate: 0.0003105

median_household_income

numeric feature high_skew outliers

This column represents median household income, likely sourced from census or demographic data tied to geographic units. The median of 52,380 and IQR of 16,300 look plausible for household income, but the column is severely compromised by sentinel/error values: a minimum of -666,666,666 drags the mean to -152,820 and produces a kurtosis of 3,215 and skew of -56.73, all flagged as alerts. With 182 outliers (5.65% of rows) and a std of 11,747,597, the negative extremes are almost certainly coded null-substitutes or data-entry errors rather than real income values. Treatment: Replace -666666666 and any negative values with NaN, investigate remaining outliers above q3, then consider log-transform after cleaning before modelling. high · anthropic:default

n: 3,221
nulls: 0 (0.0%)
unique: 3,099
min: -6.667e+08
max: 147,111
mean: -1.528e+05
median: 52,380
std: 1.175e+07
q1: 44,939
q3: 61,239
iqr: 16,300
skew: -56.73
kurtosis: 3216
n_outliers: 182
outlier_rate: 0.0565
zero_rate: 0

margin_2020

numeric feature

This column represents a vote or profit margin figure for the year 2020, expressed as a proportion (roughly −0.87 to +0.93), most likely an election margin or financial margin ratio. The distribution is moderately left-skewed (skew −0.82) with a mean of 0.317 sitting noticeably below the median of 0.384, indicating a tail of strongly negative values pulling the average down. Negative values (minimum −0.868) are present and meaningful — likely contested or loss outcomes — while 48 outliers (1.54%) sit at the distributional extremes. The null rate of 3.38% is modest but worth investigating for systematic missingness. Treatment: Use as-is for modelling; consider investigating left-tail outliers and whether nulls are structurally missing before imputation. high · anthropic:default

n: 3,221
nulls: 109 (3.4%)
unique: 3,112
min: -0.8675
max: 0.9309
mean: 0.317
median: 0.3844
std: 0.321
q1: 0.1348
q3: 0.5662
iqr: 0.4314
skew: -0.8212
kurtosis: 0.2286
n_outliers: 48
outlier_rate: 0.01542
zero_rate: 0

democratic_pct_2020

numeric feature

This column represents the Democratic vote share (as a proportion 0–1) in the 2020 election, likely at the county or precinct level across 3,221 geographic units. The distribution is right-skewed (skew=0.83) with a mean of 0.333 and median of 0.300, indicating most units lean Republican—the typical unit gave Democrats roughly 30% of the vote. The range spans 0.031 to 0.921, capturing both deep-red and deep-blue areas, with only 49 outliers (1.57%) and near-zero null rate (3.38%), suggesting a clean, well-populated electoral feature. Treatment: Use as-is or apply a logit transform to stretch the bounded 0–1 proportion before regression or clustering. high · anthropic:default

n: 3,221
nulls: 109 (3.4%)
unique: 3,111
min: 0.03091
max: 0.9215
mean: 0.3327
median: 0.2998
std: 0.1598
q1: 0.2091
q3: 0.4236
iqr: 0.2145
skew: 0.8326
kurtosis: 0.2523
n_outliers: 49
outlier_rate: 0.01575
zero_rate: 0

republican_pct_2020

numeric feature

This column represents the Republican vote share (as a proportion 0–1) in the 2020 U.S. election, most likely at the county or precinct level. The mean of 0.650 and median of 0.683 indicate a right-leaning dataset — the majority of geographic units recorded Republican majorities, which is consistent with county-level data where rural areas outnumber urban ones by count. The distribution is notably left-skewed (skew = −0.809), meaning a tail of strongly Democratic units pulls the mean below the median, while the near-mesokurtic kurtosis (0.206) and only 47 outliers suggest no extreme concentration at the tails. The null rate of 3.38% warrants investigation to confirm whether missing values reflect unreported results or data gaps. Treatment: Use as-is for modeling after imputing or flagging the 3.38% nulls; consider logit-transform if used as a continuous predictor in a linear model. high · anthropic:default

n: 3,221
nulls: 109 (3.4%)
unique: 3,111
min: 0.05397
max: 0.9618
mean: 0.6497
median: 0.6829
std: 0.1613
q1: 0.5576
q3: 0.7747
iqr: 0.2171
skew: -0.8091
kurtosis: 0.2063
n_outliers: 47
outlier_rate: 0.0151
zero_rate: 0

margin_2016

text feature one_word allcaps short_text

This column stores the 2016 electoral or financial margin as a percentage string (e.g., '15.17%'), stored as text rather than a numeric type. All 3,221 values are single all-caps tokens of 5–6 characters, confirming a uniform percentage format. Surprisingly, '15.17%' appears 29 times — far more than any other value — suggesting it may be a default, imputed, or boundary value worth investigating. The duplicate rate of 18.6% (584 duplicates across 2,554 unique values) is notable for what should otherwise be a near-continuous numeric measure. Treatment: Strip '%' suffix and cast to float; investigate the 29 occurrences of '15.17%' for data quality issues before modelling. high · anthropic:default

n: 3,221
nulls: 83 (2.6%)
unique: 2,554
len_min: 5
len_max: 6
len_mean: 5.896
len_median: 6
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 584
duplicate_rate: 0.1861
vocab_size: 2,554
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

democratic_pct_2016

numeric feature

This column represents the Democratic party vote share (as a proportion 0–1) in the 2016 U.S. presidential election, most likely aggregated at the county level given 3,221 rows. The distribution is right-skewed (skew=0.94) with a mean of 0.317 and median of 0.286, indicating that most geographic units lean Republican, with a long tail of heavily Democratic areas reaching up to 0.928. The spread is moderate (IQR=0.193, std=0.153) and 75 outliers exist on the high end, likely dense urban counties. Treatment: Use as-is or apply logit-transform to unbound the [0,1] proportion before linear modelling. high · anthropic:default

n: 3,221
nulls: 83 (2.6%)
unique: 3,111
min: 0.03145
max: 0.9285
mean: 0.3174
median: 0.2861
std: 0.1527
q1: 0.2054
q3: 0.3982
iqr: 0.1928
skew: 0.9371
kurtosis: 0.666
n_outliers: 75
outlier_rate: 0.0239
zero_rate: 0

republican_pct_2016

numeric feature

This column represents the Republican vote share (as a proportion, 0–1) in the 2016 U.S. presidential election, likely at the county or precinct level across 3,221 geographic units. The distribution is left-skewed (skew = -0.81) with a median of 0.666 and mean of 0.635, indicating that most units leaned heavily Republican in 2016, which is consistent with rural-county-level data where Republicans dominate by count even if not by population. The range spans 0.041 to 0.953, covering genuinely competitive to overwhelmingly one-sided areas, with only 62 outliers (1.98%) and near-zero nulls (2.58%), suggesting a clean, well-populated field. Treatment: Use directly as a continuous feature; consider pairing with democratic equivalent or computing a two-party margin; mild left skew does not require transformation for most models. high · anthropic:default

n: 3,221
nulls: 83 (2.6%)
unique: 3,111
min: 0.04122
max: 0.9527
mean: 0.6354
median: 0.6656
std: 0.1559
q1: 0.5463
q3: 0.7503
iqr: 0.2041
skew: -0.8145
kurtosis: 0.3566
n_outliers: 62
outlier_rate: 0.01976
zero_rate: 0