nationwide-2016_election · saturn notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/geographic/nationwide/2016_election.csv

Saturn profiled 3,141 rows across 11 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/geographic/nationwide/2016_election.csv",
    "--findings", "nationwide-2016_election.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 3,141 rows and 11 columns covering 2016 U.S. presidential election results at the county level, including total votes, Democratic and Republican vote counts and shares, and county/state identifiers. Vote-count columns (total_votes, votes_dem, votes_gop) are extremely right-skewed with high kurtosis and many outliers, reflecting a few very populous counties dominating the totals — worth a log-scale or filtered view. The per_gop and per_dem share columns tell a clearer story: per_gop has a mean of about 0.64 versus per_dem at 0.32, indicating Republican margins were larger across most counties. State coverage is broad (51 categories) with Texas (254 counties) and Georgia (159) most represented, so any state-level aggregation should account for that imbalance.

citing: row_count · column_count · total_votes · votes_dem · votes_gop · per_dem · per_gop · state_abbr · county_name

Out[4]:

saturn.schema() · 11 columns

column	kind	n	null%	unique	alerts
	numeric	3,141	0.0%	3,141
votes_dem	numeric	3,141	0.0%	2,688	high_skew outliers
votes_gop	numeric	3,141	0.0%	2,901	high_skew outliers
total_votes	numeric	3,141	0.0%	2,966	high_skew outliers
per_dem	numeric	3,141	0.0%	3,112
per_gop	numeric	3,141	0.0%	3,112
diff	text	3,141	0.0%	2,738	one_word allcaps short_text
per_point_diff	text	3,141	0.0%	2,555	one_word allcaps short_text
state_abbr	categorical	3,141	0.0%	51
county_name	text	3,141	0.0%	1,848	short_text duplicates
combined_fips	numeric	3,141	0.0%	3,141

Fig 1.

per_gop · Distribution of Republican vote share across counties — note the left-skewed shape with a median near 0.67.

Show data table

Histogram bins for per_gop (median: 0.665352643757).
bin	count
0.04122 – 0.06401	1
0.06401 – 0.0868	2
0.0868 – 0.1096	5
0.1096 – 0.1324	2
0.1324 – 0.1552	6
0.1552 – 0.1779	11
0.1779 – 0.2007	7
0.2007 – 0.2235	17
0.2235 – 0.2463	17
0.2463 – 0.2691	17
0.2691 – 0.2919	23
0.2919 – 0.3147	32
0.3147 – 0.3375	34
0.3375 – 0.3602	30
0.3602 – 0.383	43
0.383 – 0.4058	41
0.4058 – 0.4286	64
0.4286 – 0.4514	71
0.4514 – 0.4742	63
0.4742 – 0.497	89
0.497 – 0.5198	78
0.5198 – 0.5425	117
0.5425 – 0.5653	116
0.5653 – 0.5881	147
0.5881 – 0.6109	147
0.6109 – 0.6337	156
0.6337 – 0.6565	165
0.6565 – 0.6793	193
0.6793 – 0.7021	190
0.7021 – 0.7249	215
0.7249 – 0.7476	223
0.7476 – 0.7704	213
0.7704 – 0.7932	187
0.7932 – 0.816	142
0.816 – 0.8388	113
0.8388 – 0.8616	74
0.8616 – 0.8844	49
0.8844 – 0.9072	29
0.9072 – 0.9299	9
0.9299 – 0.9527	3

Fig 2.

per_dem · Democratic vote share by county; compare against per_gop to see the partisan tilt of most counties.

Show data table

Histogram bins for per_dem (median: 0.2864).
bin	count
0.03145 – 0.05387	8
0.05387 – 0.0763	16
0.0763 – 0.09872	52
0.09872 – 0.1211	74
0.1211 – 0.1436	116
0.1436 – 0.166	146
0.166 – 0.1884	203
0.1884 – 0.2109	226
0.2109 – 0.2333	240
0.2333 – 0.2557	218
0.2557 – 0.2781	200
0.2781 – 0.3006	205
0.3006 – 0.323	153
0.323 – 0.3454	147
0.3454 – 0.3678	153
0.3678 – 0.3903	152
0.3903 – 0.4127	106
0.4127 – 0.4351	111
0.4351 – 0.4575	77
0.4575 – 0.48	78
0.48 – 0.5024	56
0.5024 – 0.5248	72
0.5248 – 0.5472	45
0.5472 – 0.5697	42
0.5697 – 0.5921	36
0.5921 – 0.6145	36
0.6145 – 0.6369	34
0.6369 – 0.6594	25
0.6594 – 0.6818	30
0.6818 – 0.7042	16
0.7042 – 0.7266	12
0.7266 – 0.7491	12
0.7491 – 0.7715	14
0.7715 – 0.7939	9
0.7939 – 0.8163	6
0.8163 – 0.8388	4
0.8388 – 0.8612	4
0.8612 – 0.8836	4
0.8836 – 0.906	2
0.906 – 0.9285	1

Fig 3.

total_votes · Total votes per county is heavily right-skewed (skew ~8.9); a handful of large counties dwarf the rest.

Show data table

Histogram bins for total_votes (median: 11144.0).
bin	count
64 – 6.636e+04	2699
6.636e+04 – 1.327e+05	192
1.327e+05 – 1.99e+05	77
1.99e+05 – 2.653e+05	70
2.653e+05 – 3.316e+05	33
3.316e+05 – 3.979e+05	20
3.979e+05 – 4.642e+05	12
4.642e+05 – 5.305e+05	4
5.305e+05 – 5.968e+05	7
5.968e+05 – 6.631e+05	9
6.631e+05 – 7.294e+05	4
7.294e+05 – 7.957e+05	5
7.957e+05 – 8.62e+05	1
8.62e+05 – 9.283e+05	1
9.283e+05 – 9.946e+05	1
9.946e+05 – 1.061e+06	1
1.061e+06 – 1.127e+06	1
1.127e+06 – 1.193e+06	0
1.193e+06 – 1.26e+06	1
1.26e+06 – 1.326e+06	1
1.326e+06 – 1.392e+06	0
1.392e+06 – 1.459e+06	0
1.459e+06 – 1.525e+06	0
1.525e+06 – 1.591e+06	0
1.591e+06 – 1.658e+06	0
1.658e+06 – 1.724e+06	0
1.724e+06 – 1.79e+06	0
1.79e+06 – 1.856e+06	0
1.856e+06 – 1.923e+06	0
1.923e+06 – 1.989e+06	0
1.989e+06 – 2.055e+06	1
2.055e+06 – 2.122e+06	0
2.122e+06 – 2.188e+06	0
2.188e+06 – 2.254e+06	0
2.254e+06 – 2.321e+06	0
2.321e+06 – 2.387e+06	0
2.387e+06 – 2.453e+06	0
2.453e+06 – 2.519e+06	0
2.519e+06 – 2.586e+06	0
2.586e+06 – 2.652e+06	1

Fig 4.

state_abbr · Counties per state — Texas and Georgia dominate the row counts, which matters for any aggregation.

Show data table

Top values for state_abbr (20 unique shown, of 51 total).
value	count	share
TX	254	8.1%
GA	159	5.1%
VA	133	4.2%
KY	120	3.8%
MO	115	3.7%
KS	105	3.3%
IL	102	3.2%
NC	100	3.2%
IA	99	3.2%
TN	95	3.0%
NE	93	3.0%
IN	92	2.9%
OH	88	2.8%
MN	87	2.8%
MI	83	2.6%
MS	82	2.6%
OK	77	2.5%
AR	75	2.4%
WI	72	2.3%
AL	67	2.1%

Fig 5.

votes_gop · Republican vote totals per county; check the long tail of outlier counties versus the median of ~7,268.

Show data table

Histogram bins for votes_gop (median: 7268.0).
bin	count
57 – 1.556e+04	2260
1.556e+04 – 3.107e+04	381
3.107e+04 – 4.657e+04	153
4.657e+04 – 6.208e+04	100
6.208e+04 – 7.759e+04	54
7.759e+04 – 9.309e+04	32
9.309e+04 – 1.086e+05	27
1.086e+05 – 1.241e+05	23
1.241e+05 – 1.396e+05	47
1.396e+05 – 1.551e+05	13
1.551e+05 – 1.706e+05	14
1.706e+05 – 1.861e+05	4
1.861e+05 – 2.016e+05	8
2.016e+05 – 2.171e+05	2
2.171e+05 – 2.326e+05	2
2.326e+05 – 2.481e+05	2
2.481e+05 – 2.637e+05	4
2.637e+05 – 2.792e+05	3
2.792e+05 – 2.947e+05	1
2.947e+05 – 3.102e+05	1
3.102e+05 – 3.257e+05	1
3.257e+05 – 3.412e+05	2
3.412e+05 – 3.567e+05	1
3.567e+05 – 3.722e+05	0
3.722e+05 – 3.877e+05	1
3.877e+05 – 4.032e+05	0
4.032e+05 – 4.187e+05	0
4.187e+05 – 4.342e+05	0
4.342e+05 – 4.497e+05	1
4.497e+05 – 4.652e+05	0
4.652e+05 – 4.807e+05	1
4.807e+05 – 4.962e+05	0
4.962e+05 – 5.117e+05	0
5.117e+05 – 5.273e+05	0
5.273e+05 – 5.428e+05	0
5.428e+05 – 5.583e+05	1
5.583e+05 – 5.738e+05	0
5.738e+05 – 5.893e+05	0
5.893e+05 – 6.048e+05	1
6.048e+05 – 6.203e+05	1

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
	numeric	0.0%
votes_dem	numeric	0.0%
votes_gop	numeric	0.0%
total_votes	numeric	0.0%
per_dem	numeric	0.0%
per_gop	numeric	0.0%
diff	text	0.0%
per_point_diff	text	0.0%
state_abbr	categorical	0.0%
county_name	text	0.0%
combined_fips	numeric	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 7 numeric columns (values clipped to 2 decimals).
		votes_dem	votes_gop	total_votes	per_dem	per_gop	combined_fips
	+1.00	-0.11	-0.12	-0.11	-0.10	+0.08	+0.99
votes_dem	-0.11	+1.00	+0.86	+0.98	+0.37	-0.37	-0.11
votes_gop	-0.12	+0.86	+1.00	+0.93	+0.36	-0.37	-0.12
total_votes	-0.11	+0.98	+0.93	+1.00	+0.37	-0.38	-0.12
per_dem	-0.10	+0.37	+0.36	+0.37	+1.00	-0.98	-0.09
per_gop	+0.08	-0.37	-0.37	-0.38	-0.98	+1.00	+0.07
combined_fips	+0.99	-0.11	-0.12	-0.12	-0.09	+0.07	+1.00

numeric identifier

This unnamed numeric column runs from 0 to 3140 with exactly 3141 unique values across 3141 rows, mean and median both 1570, and zero skew — the hallmarks of a row index rather than a measured feature. There are no nulls and no outliers, and the only zero is the single index-0 row (zero_rate ≈ 0.00032).

Treatment: Drop before modelling; it is a sequential row index.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns[""].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	3,141
min	0
max	3,140
mean	1,570
median	1,570
std	906.9
q1	785
q3	2,355
iqr	1,570
skew	0
kurtosis	-1.2
n_outliers	0
outlier_rate	0
zero_rate	0.0003184

Fig 8.

Distribution of . Vertical dash marks the median.

Show data table

Histogram bins for (median: 1570.0).
bin	count
0 – 78.5	79
78.5 – 157	78
157 – 235.5	79
235.5 – 314	78
314 – 392.5	79
392.5 – 471	78
471 – 549.5	79
549.5 – 628	78
628 – 706.5	79
706.5 – 785	78
785 – 863.5	79
863.5 – 942	78
942 – 1020	79
1020 – 1099	78
1099 – 1178	79
1178 – 1256	78
1256 – 1334	79
1334 – 1413	78
1413 – 1492	79
1492 – 1570	78
1570 – 1648	79
1648 – 1727	78
1727 – 1806	79
1806 – 1884	78
1884 – 1962	79
1962 – 2041	78
2041 – 2120	79
2120 – 2198	78
2198 – 2276	79
2276 – 2355	78
2355 – 2434	79
2434 – 2512	78
2512 – 2590	79
2590 – 2669	78
2669 – 2748	79
2748 – 2826	78
2826 – 2904	79
2904 – 2983	78
2983 – 3062	79
3062 – 3140	79

votes_dem numeric feature

Counts of Democratic votes per row (likely a US county or precinct), ranging from 4 to 1,893,770 with a median of 3,194 but a mean of 20,734. The distribution is extremely right-skewed (skew 11.65, kurtosis 224.4), and 468 rows (14.9%) flag as outliers — consistent with a few large urban jurisdictions dwarfing the rest. No nulls or zeros, and 2,688 unique values across 3,141 rows.

Treatment: Log-transform before regression or convert to a share/per-capita rate to tame the skew.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["votes_dem"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	2,688
min	4
max	1.894e+06
mean	2.073e+04
median	3,194
std	7.2e+04
q1	1,175
q3	10,047
iqr	8,872
skew	11.65
kurtosis	224.4
n_outliers	468
outlier_rate	0.149
zero_rate	0
alert: high_skew	skew=+11.65
alert: outliers	14.9% rows beyond 1.5 IQR

Fig 9.

Distribution of votes_dem. Vertical dash marks the median.

Show data table

Histogram bins for votes_dem (median: 3194.0).
bin	count
4 – 4.735e+04	2844
4.735e+04 – 9.469e+04	145
9.469e+04 – 1.42e+05	52
1.42e+05 – 1.894e+05	28
1.894e+05 – 2.367e+05	19
2.367e+05 – 2.841e+05	11
2.841e+05 – 3.314e+05	15
3.314e+05 – 3.788e+05	6
3.788e+05 – 4.261e+05	2
4.261e+05 – 4.734e+05	3
4.734e+05 – 5.208e+05	5
5.208e+05 – 5.681e+05	5
5.681e+05 – 6.155e+05	1
6.155e+05 – 6.628e+05	2
6.628e+05 – 7.102e+05	1
7.102e+05 – 7.575e+05	0
7.575e+05 – 8.049e+05	0
8.049e+05 – 8.522e+05	0
8.522e+05 – 8.995e+05	0
8.995e+05 – 9.469e+05	0
9.469e+05 – 9.942e+05	0
9.942e+05 – 1.042e+06	0
1.042e+06 – 1.089e+06	0
1.089e+06 – 1.136e+06	0
1.136e+06 – 1.184e+06	0
1.184e+06 – 1.231e+06	0
1.231e+06 – 1.278e+06	0
1.278e+06 – 1.326e+06	0
1.326e+06 – 1.373e+06	0
1.373e+06 – 1.42e+06	0
1.42e+06 – 1.468e+06	0
1.468e+06 – 1.515e+06	0
1.515e+06 – 1.562e+06	1
1.562e+06 – 1.61e+06	0
1.61e+06 – 1.657e+06	0
1.657e+06 – 1.704e+06	0
1.704e+06 – 1.752e+06	0
1.752e+06 – 1.799e+06	0
1.799e+06 – 1.846e+06	0
1.846e+06 – 1.894e+06	1

votes_gop numeric feature

Per-county GOP vote totals across 3,141 rows, almost all distinct (2,901 unique) and never null or zero. The distribution is heavily right-skewed (skew 5.78, kurtosis 51.78) with a median of 7,268 but a max of 620,285, and 394 rows (12.5%) flagged as outliers — consistent with a few very populous counties dwarfing the rest.

Treatment: Log-transform or normalize by county population before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["votes_gop"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	2,901
min	57
max	620,285
mean	2.065e+04
median	7,268
std	4.163e+04
q1	3,241
q3	18,130
iqr	14,889
skew	5.78
kurtosis	51.78
n_outliers	394
outlier_rate	0.1254
zero_rate	0
alert: high_skew	skew=+5.78
alert: outliers	12.5% rows beyond 1.5 IQR

Fig 10.

Distribution of votes_gop. Vertical dash marks the median.

Show data table

Histogram bins for votes_gop (median: 7268.0).
bin	count
57 – 1.556e+04	2260
1.556e+04 – 3.107e+04	381
3.107e+04 – 4.657e+04	153
4.657e+04 – 6.208e+04	100
6.208e+04 – 7.759e+04	54
7.759e+04 – 9.309e+04	32
9.309e+04 – 1.086e+05	27
1.086e+05 – 1.241e+05	23
1.241e+05 – 1.396e+05	47
1.396e+05 – 1.551e+05	13
1.551e+05 – 1.706e+05	14
1.706e+05 – 1.861e+05	4
1.861e+05 – 2.016e+05	8
2.016e+05 – 2.171e+05	2
2.171e+05 – 2.326e+05	2
2.326e+05 – 2.481e+05	2
2.481e+05 – 2.637e+05	4
2.637e+05 – 2.792e+05	3
2.792e+05 – 2.947e+05	1
2.947e+05 – 3.102e+05	1
3.102e+05 – 3.257e+05	1
3.257e+05 – 3.412e+05	2
3.412e+05 – 3.567e+05	1
3.567e+05 – 3.722e+05	0
3.722e+05 – 3.877e+05	1
3.877e+05 – 4.032e+05	0
4.032e+05 – 4.187e+05	0
4.187e+05 – 4.342e+05	0
4.342e+05 – 4.497e+05	1
4.497e+05 – 4.652e+05	0
4.652e+05 – 4.807e+05	1
4.807e+05 – 4.962e+05	0
4.962e+05 – 5.117e+05	0
5.117e+05 – 5.273e+05	0
5.273e+05 – 5.428e+05	0
5.428e+05 – 5.583e+05	1
5.583e+05 – 5.738e+05	0
5.738e+05 – 5.893e+05	0
5.893e+05 – 6.048e+05	1
6.048e+05 – 6.203e+05	1

total_votes numeric feature

Per-row vote totals across 3,141 records, almost all distinct (2,966 unique) with no nulls or zeros. The distribution is severely right-skewed (skew 8.89, kurtosis 136.17): the median is 11,144 yet the mean is 43,636 and the max reaches 2,652,072, far above Q3 of 29,799. About 14% of rows (442) flag as outliers, consistent with a few very high-vote jurisdictions dominating the tail.

Treatment: log-transform before regression or aggregation to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["total_votes"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	2,966
min	64
max	2.652e+06
mean	4.364e+04
median	11,144
std	1.146e+05
q1	4,870
q3	29,799
iqr	24,929
skew	8.894
kurtosis	136.2
n_outliers	442
outlier_rate	0.1407
zero_rate	0
alert: high_skew	skew=+8.89
alert: outliers	14.1% rows beyond 1.5 IQR

Fig 11.

Distribution of total_votes. Vertical dash marks the median.

Show data table

Histogram bins for total_votes (median: 11144.0).
bin	count
64 – 6.636e+04	2699
6.636e+04 – 1.327e+05	192
1.327e+05 – 1.99e+05	77
1.99e+05 – 2.653e+05	70
2.653e+05 – 3.316e+05	33
3.316e+05 – 3.979e+05	20
3.979e+05 – 4.642e+05	12
4.642e+05 – 5.305e+05	4
5.305e+05 – 5.968e+05	7
5.968e+05 – 6.631e+05	9
6.631e+05 – 7.294e+05	4
7.294e+05 – 7.957e+05	5
7.957e+05 – 8.62e+05	1
8.62e+05 – 9.283e+05	1
9.283e+05 – 9.946e+05	1
9.946e+05 – 1.061e+06	1
1.061e+06 – 1.127e+06	1
1.127e+06 – 1.193e+06	0
1.193e+06 – 1.26e+06	1
1.26e+06 – 1.326e+06	1
1.326e+06 – 1.392e+06	0
1.392e+06 – 1.459e+06	0
1.459e+06 – 1.525e+06	0
1.525e+06 – 1.591e+06	0
1.591e+06 – 1.658e+06	0
1.658e+06 – 1.724e+06	0
1.724e+06 – 1.79e+06	0
1.79e+06 – 1.856e+06	0
1.856e+06 – 1.923e+06	0
1.923e+06 – 1.989e+06	0
1.989e+06 – 2.055e+06	1
2.055e+06 – 2.122e+06	0
2.122e+06 – 2.188e+06	0
2.188e+06 – 2.254e+06	0
2.254e+06 – 2.321e+06	0
2.321e+06 – 2.387e+06	0
2.387e+06 – 2.453e+06	0
2.453e+06 – 2.519e+06	0
2.519e+06 – 2.586e+06	0
2.586e+06 – 2.652e+06	1

per_dem numeric feature

Values are continuous proportions bounded between 0.031 and 0.928 with mean 0.318 and median 0.286, consistent with a per-unit Democratic vote share across 3,141 rows (matching the U.S. county count). Distribution is right-skewed (skew 0.94) with 76 outliers (2.4%) on the upper tail, reflecting a minority of heavily Democratic units. Near-unique values (3,112/3,141) and zero null/zero rates indicate a clean, fully-populated feature.

Treatment: Use as-is as a bounded proportion; consider a logit transform if feeding a linear model due to right skew.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["per_dem"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	3,112
min	0.03145
max	0.9285
mean	0.3176
median	0.2864
std	0.153
q1	0.2054
q3	0.3982
iqr	0.1929
skew	0.9422
kurtosis	0.6859
n_outliers	76
outlier_rate	0.0242
zero_rate	0

Fig 12.

Distribution of per_dem. Vertical dash marks the median.

Show data table

Histogram bins for per_dem (median: 0.2864).
bin	count
0.03145 – 0.05387	8
0.05387 – 0.0763	16
0.0763 – 0.09872	52
0.09872 – 0.1211	74
0.1211 – 0.1436	116
0.1436 – 0.166	146
0.166 – 0.1884	203
0.1884 – 0.2109	226
0.2109 – 0.2333	240
0.2333 – 0.2557	218
0.2557 – 0.2781	200
0.2781 – 0.3006	205
0.3006 – 0.323	153
0.323 – 0.3454	147
0.3454 – 0.3678	153
0.3678 – 0.3903	152
0.3903 – 0.4127	106
0.4127 – 0.4351	111
0.4351 – 0.4575	77
0.4575 – 0.48	78
0.48 – 0.5024	56
0.5024 – 0.5248	72
0.5248 – 0.5472	45
0.5472 – 0.5697	42
0.5697 – 0.5921	36
0.5921 – 0.6145	36
0.6145 – 0.6369	34
0.6369 – 0.6594	25
0.6594 – 0.6818	30
0.6818 – 0.7042	16
0.7042 – 0.7266	12
0.7266 – 0.7491	12
0.7491 – 0.7715	14
0.7715 – 0.7939	9
0.7939 – 0.8163	6
0.8163 – 0.8388	4
0.8388 – 0.8612	4
0.8612 – 0.8836	4
0.8836 – 0.906	2
0.906 – 0.9285	1

per_gop numeric feature

Likely the Republican (GOP) vote share per geographic unit (e.g., county), bounded between 0.041 and 0.953 with no nulls and 3112 unique values across 3141 rows. The distribution is left-skewed (skew -0.82) with a median of 0.665 above the mean of 0.635, indicating most units lean Republican while a tail of low-GOP units pulls the mean down. 63 outliers (2.0%) sit outside the IQR fence, consistent with strongly Democratic enclaves.

Treatment: Use as-is as a proportion feature; consider a logit transform if feeding a linear model.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["per_gop"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	3,112
min	0.04122
max	0.9527
mean	0.6351
median	0.6654
std	0.1561
q1	0.5458
q3	0.7503
iqr	0.2045
skew	-0.8193
kurtosis	0.376
n_outliers	63
outlier_rate	0.02006
zero_rate	0

Fig 13.

Distribution of per_gop. Vertical dash marks the median.

Show data table

Histogram bins for per_gop (median: 0.665352643757).
bin	count
0.04122 – 0.06401	1
0.06401 – 0.0868	2
0.0868 – 0.1096	5
0.1096 – 0.1324	2
0.1324 – 0.1552	6
0.1552 – 0.1779	11
0.1779 – 0.2007	7
0.2007 – 0.2235	17
0.2235 – 0.2463	17
0.2463 – 0.2691	17
0.2691 – 0.2919	23
0.2919 – 0.3147	32
0.3147 – 0.3375	34
0.3375 – 0.3602	30
0.3602 – 0.383	43
0.383 – 0.4058	41
0.4058 – 0.4286	64
0.4286 – 0.4514	71
0.4514 – 0.4742	63
0.4742 – 0.497	89
0.497 – 0.5198	78
0.5198 – 0.5425	117
0.5425 – 0.5653	116
0.5653 – 0.5881	147
0.5881 – 0.6109	147
0.6109 – 0.6337	156
0.6337 – 0.6565	165
0.6565 – 0.6793	193
0.6793 – 0.7021	190
0.7021 – 0.7249	215
0.7249 – 0.7476	223
0.7476 – 0.7704	213
0.7704 – 0.7932	187
0.7932 – 0.816	142
0.816 – 0.8388	113
0.8388 – 0.8616	74
0.8616 – 0.8844	49
0.8844 – 0.9072	29
0.9072 – 0.9299	9
0.9299 – 0.9527	3

diff text feature

Despite being typed as text, `diff` is a single-token numeric field stored as comma-formatted strings (one_word_rate 1.0, len_mean ~4.9, max length 9). All 3,141 rows are populated with 2,738 unique values and 403 duplicates (12.8%); the value '37,410' appears 29 times, far above any other, suggesting either a sentinel or a heavily repeated magnitude. The allcaps and short_text alerts are artefacts of digits-only content rather than real prose.

Treatment: strip commas and cast to numeric before modelling, and investigate the spike at 37,410.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["diff"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	2,738
len_min	1
len_max	9
len_mean	4.935
len_median	5
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	403
duplicate_rate	0.1283
vocab_size	2,738
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.9924
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	99.2% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 14.

Character-length distribution for diff.

Show data table

Character-length distribution for diff (mean: 4.935370900986947).
chars	count
1 – 1	2
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 2	22
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	440
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	1978
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	651
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	46
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	2

per_point_diff text feature

This column stores a per-point differential as a percentage string (e.g. '15.17%', '63.21%'), with lengths tightly bound between 5 and 6 characters and exactly one token per cell. Despite 2555 unique values across 3141 rows, the duplicate rate is 18.7% and '15.17%' alone appears 31 times — far more than any other value, which is worth checking. The values are stored as text with a trailing '%', not as numbers.

Treatment: Strip the '%' and cast to float before any numeric modelling.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["per_point_diff"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	2,555
len_min	5
len_max	6
len_mean	5.896
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	586
duplicate_rate	0.1866
vocab_size	2,555
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 15.

Character-length distribution for per_point_diff.

Show data table

Character-length distribution for per_point_diff (mean: 5.895893027698185).
chars	count
5 – 5	327
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	2814

state_abbr categorical foreign_key

This column holds US state abbreviations, with 51 unique values across 3141 rows and no nulls — consistent with one row per US county (50 states plus DC). The distribution tracks county counts rather than population: TX leads at 254 (8.1%), followed by GA (159), VA (133), and KY (120). Entropy ratio of 0.93 indicates a fairly even spread across states.

Treatment: Use as a categorical grouping key or left-join to a state-level reference table.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["state_abbr"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	51
top_value	TX
top_rate	0.08087
cardinality	51
entropy	5.275
entropy_ratio	0.9299

Fig 16.

Top values for state_abbr.

Show data table

Top values for state_abbr (20 unique shown, of 51 total).
value	count	share
TX	254	8.1%
GA	159	5.1%
VA	133	4.2%
KY	120	3.8%
MO	115	3.7%
KS	105	3.3%
IL	102	3.2%
NC	100	3.2%
IA	99	3.2%
TN	95	3.0%
NE	93	3.0%
IN	92	2.9%
OH	88	2.8%
MN	87	2.8%
MI	83	2.6%
MS	82	2.6%
OK	77	2.5%
AR	75	2.4%
WI	72	2.3%
AL	67	2.1%

county_name text feature

This column holds US county names — 3,006 of 3,141 rows contain the word 'county', with 'parish' (64) and 'city' (43) covering Louisiana and Virginia equivalents. Names repeat heavily across states: 1,293 duplicates (41.2%) leave only 1,848 unique values, with 'Washington County' (30), 'Jefferson County' (25), and 'Franklin County' (24) leading. One oddity: 'Alaska' appears 29 times as a bare state name, breaking the county/parish/city pattern.

Treatment: Pair with a state column before joining or grouping; the name alone is not unique.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["county_name"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	1,848
len_min	6
len_max	27
len_mean	13.87
len_median	14
len_p95	17
word_mean	2.054
word_median	2
n_empty	0
n_duplicates	1,293
duplicate_rate	0.4117
vocab_size	1,840
readability_flesch_mean	38.38
emoji_rate	0
url_rate	0
one_word_rate	0.009233
allcaps_rate	0
boilerplate_rate	0
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	41.2% duplicate strings

Fig 17.

Character-length distribution for county_name.

Show data table

Character-length distribution for county_name (mean: 13.869149952244507).
chars	count
6 – 7	29
7 – 7	0
7 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 10	0
10 – 10	29
10 – 11	0
11 – 11	255
11 – 12	0
12 – 12	465
12 – 13	0
13 – 13	683
13 – 14	0
14 – 14	585
14 – 15	0
15 – 15	485
15 – 16	0
16 – 16	280
16 – 17	202
17 – 18	0
18 – 18	52
18 – 19	0
19 – 19	40
19 – 20	0
20 – 20	15
20 – 21	0
21 – 21	11
21 – 22	0
22 – 22	6
22 – 23	0
23 – 23	2
23 – 24	0
24 – 24	1
24 – 25	0
25 – 25	0
25 – 26	0
26 – 26	0
26 – 27	1

combined_fips numeric identifier

This is almost certainly the 5-digit combined state+county FIPS code (state*1000 + county), with all 3141 values unique and no nulls — matching the count of US counties. The range 1001 to 56045 spans Alabama (01) through Wyoming (56), and the near-zero skew reflects roughly uniform numeric county codes across states rather than a meaningful distribution.

Treatment: treat as a categorical key; left-join on this code rather than using as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["combined_fips"].stats

stat	value
n	3,141
nulls	0 (0.0%)
unique	3,141
min	1,001
max	56,045
mean	3.039e+04
median	29,177
std	1.516e+04
q1	18,179
q3	45,081
iqr	26,902
skew	-0.08027
kurtosis	-1.098
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 18.

Distribution of combined_fips. Vertical dash marks the median.

Show data table

Histogram bins for combined_fips (median: 29177.0).
bin	count
1001 – 2377	96
2377 – 3753	0
3753 – 5129	80
5129 – 6505	68
6505 – 7882	0
7882 – 9258	72
9258 – 1.063e+04	3
1.063e+04 – 1.201e+04	6
1.201e+04 – 1.339e+04	221
1.339e+04 – 1.476e+04	0
1.476e+04 – 1.614e+04	48
1.614e+04 – 1.751e+04	102
1.751e+04 – 1.889e+04	92
1.889e+04 – 2.027e+04	204
2.027e+04 – 2.164e+04	120
2.164e+04 – 2.302e+04	73
2.302e+04 – 2.439e+04	30
2.439e+04 – 2.577e+04	15
2.577e+04 – 2.715e+04	156
2.715e+04 – 2.852e+04	96
2.852e+04 – 2.99e+04	115
2.99e+04 – 3.128e+04	149
3.128e+04 – 3.265e+04	17
3.265e+04 – 3.403e+04	24
3.403e+04 – 3.54e+04	40
3.54e+04 – 3.678e+04	62
3.678e+04 – 3.816e+04	153
3.816e+04 – 3.953e+04	88
3.953e+04 – 4.091e+04	77
4.091e+04 – 4.228e+04	103
4.228e+04 – 4.366e+04	0
4.366e+04 – 4.504e+04	23
4.504e+04 – 4.641e+04	94
4.641e+04 – 4.779e+04	95
4.779e+04 – 4.916e+04	283
4.916e+04 – 5.054e+04	14
5.054e+04 – 5.192e+04	133
5.192e+04 – 5.329e+04	39
5.329e+04 – 5.467e+04	55
5.467e+04 – 5.604e+04	95

nationwide 2016 election

Overview

Summary confidence: high

numeric identifier

votes_dem numeric feature

votes_gop numeric feature

total_votes numeric feature

per_dem numeric feature

per_gop numeric feature

diff text feature

per_point_diff text feature

state_abbr categorical foreign_key

county_name text feature

combined_fips numeric identifier

How to cite