data-trove-scars-standardized-county-analysis-research-system

Overview

Source: /home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv

Saturn profiled 3,221 rows across 20 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv",
    "--findings", "data-trove-scars-standardized-county-analysis-research-system.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset covers 3,221 U.S. counties with demographic, economic, and electoral variables for the 2016 and 2020 presidential elections. The most striking finding is that Republican candidates dominated the majority of counties in both cycles — the median Republican share was roughly 67% in 2016 and 68% in 2020, while the Democratic median hovered near 29–30%, reflecting the well-known rural-county skew in U.S. politics. A data quality issue worth flagging immediately is the median_household_income column, which contains a minimum value of -666,666,666 — almost certainly a sentinel/error value — dragging the column mean to -$152,820 despite a plausible median of $52,380. Poverty rate averages about 15% across counties but reaches as high as 66%, and racial composition variables (pct_white, pct_black, pct_hispanic) are highly skewed, suggesting a small number of majority-minority counties sit at the extremes.

citing: republican_pct_2016.stats.median · republican_pct_2020.stats.median · democratic_pct_2016.stats.median · democratic_pct_2020.stats.median · median_household_income.stats.min · median_household_income.stats.median · poverty_rate.stats.mean · poverty_rate.stats.max · pct_white.stats.mean · pct_white.stats.skew · row_count

Out[4]:

saturn.schema() · 20 columns

column	kind	n	null%	unique	alerts
NAME	text	3,221	0.0%	3,221	near_unique
total_population	numeric	3,221	0.0%	3,160	high_skew outliers
black_population	numeric	3,221	0.0%	2,066	high_skew outliers
white_population	numeric	3,221	0.0%	3,143	high_skew outliers
hispanic_population	numeric	3,221	0.0%	2,331	high_skew outliers
state	numeric	3,221	0.0%	52
county	numeric	3,221	0.0%	326	high_skew outliers
FIPS	numeric	3,221	0.0%	3,221
pct_black	numeric	3,221	0.0%	3,128	high_skew outliers
pct_white	numeric	3,221	0.0%	3,218
pct_hispanic	numeric	3,221	0.0%	3,205	high_skew outliers
poverty_rate	numeric	3,221	0.0%	3,219	high_skew
below_poverty_level	numeric	3,221	0.0%	2,824	high_skew outliers
median_household_income	numeric	3,221	0.0%	3,099	high_skew outliers
margin_2020	numeric	3,221	3.4%	3,112
democratic_pct_2020	numeric	3,221	3.4%	3,111
republican_pct_2020	numeric	3,221	3.4%	3,111
margin_2016	text	3,221	2.6%	2,554	one_word allcaps short_text
democratic_pct_2016	numeric	3,221	2.6%	3,111
republican_pct_2016	numeric	3,221	2.6%	3,111

Fig 1.

republican_pct_2020 · Look for the strong right-skewed peak above 0.5, confirming Republican dominance across most U.S. counties in 2020.

Show data table

Histogram bins for republican_pct_2020 (median: 0.6829120557612961).
bin	count
0.05397 – 0.07667	1
0.07667 – 0.09937	2
0.09937 – 0.1221	2
0.1221 – 0.1448	6
0.1448 – 0.1675	6
0.1675 – 0.1901	15
0.1901 – 0.2128	5
0.2128 – 0.2355	13
0.2355 – 0.2582	12
0.2582 – 0.2809	25
0.2809 – 0.3036	26
0.3036 – 0.3263	32
0.3263 – 0.349	32
0.349 – 0.3717	40
0.3717 – 0.3944	46
0.3944 – 0.4171	52
0.4171 – 0.4398	64
0.4398 – 0.4625	78
0.4625 – 0.4852	61
0.4852 – 0.5079	78
0.5079 – 0.5306	66
0.5306 – 0.5533	97
0.5533 – 0.576	126
0.576 – 0.5987	122
0.5987 – 0.6214	139
0.6214 – 0.6441	143
0.6441 – 0.6668	154
0.6668 – 0.6895	173
0.6895 – 0.7122	176
0.7122 – 0.7349	200
0.7349 – 0.7576	213
0.7576 – 0.7802	195
0.7802 – 0.8029	203
0.8029 – 0.8256	169
0.8256 – 0.8483	132
0.8483 – 0.871	103
0.871 – 0.8937	65
0.8937 – 0.9164	27
0.9164 – 0.9391	9
0.9391 – 0.9618	4

Fig 2.

margin_2020 · The distribution of victory margins reveals how lopsided most county-level results are, with few truly competitive counties near zero.

Show data table

Histogram bins for margin_2020 (median: 0.3843813151543954).
bin	count
-0.8675 – -0.8226	1
-0.8226 – -0.7776	2
-0.7776 – -0.7326	3
-0.7326 – -0.6877	5
-0.6877 – -0.6427	8
-0.6427 – -0.5978	12
-0.5978 – -0.5528	6
-0.5528 – -0.5078	12
-0.5078 – -0.4629	14
-0.4629 – -0.4179	22
-0.4179 – -0.373	28
-0.373 – -0.328	33
-0.328 – -0.283	29
-0.283 – -0.2381	46
-0.2381 – -0.1931	41
-0.1931 – -0.1482	54
-0.1482 – -0.1032	63
-0.1032 – -0.05823	67
-0.05823 – -0.01327	69
-0.01327 – 0.03169	73
0.03169 – 0.07665	70
0.07665 – 0.1216	90
0.1216 – 0.1666	131
0.1666 – 0.2115	117
0.2115 – 0.2565	129
0.2565 – 0.3015	141
0.3015 – 0.3464	159
0.3464 – 0.3914	165
0.3914 – 0.4363	181
0.4363 – 0.4813	206
0.4813 – 0.5263	195
0.5263 – 0.5712	197
0.5712 – 0.6162	213
0.6162 – 0.6611	175
0.6611 – 0.7061	140
0.7061 – 0.7511	102
0.7511 – 0.796	69
0.796 – 0.841	29
0.841 – 0.8859	11
0.8859 – 0.9309	4

Fig 3.

poverty_rate · Poverty rate peaks around 14% but has a long right tail extending to 66% — watch for the outlier counties driving extreme values.

Show data table

Histogram bins for poverty_rate (median: 13.807805224676027).
bin	count
0 – 1.655	2
1.655 – 3.31	9
3.31 – 4.964	48
4.964 – 6.619	123
6.619 – 8.274	212
8.274 – 9.929	317
9.929 – 11.58	378
11.58 – 13.24	392
13.24 – 14.89	353
14.89 – 16.55	338
16.55 – 18.2	235
18.2 – 19.86	192
19.86 – 21.51	157
21.51 – 23.17	108
23.17 – 24.82	77
24.82 – 26.48	53
26.48 – 28.13	41
28.13 – 29.79	37
29.79 – 31.44	29
31.44 – 33.1	13
33.1 – 34.75	10
34.75 – 36.41	11
36.41 – 38.06	7
38.06 – 39.72	2
39.72 – 41.37	8
41.37 – 43.03	6
43.03 – 44.68	7
44.68 – 46.33	11
46.33 – 47.99	6
47.99 – 49.64	9
49.64 – 51.3	8
51.3 – 52.95	5
52.95 – 54.61	6
54.61 – 56.26	2
56.26 – 57.92	2
57.92 – 59.57	3
59.57 – 61.23	1
61.23 – 62.88	1
62.88 – 64.54	0
64.54 – 66.19	2

Fig 4.

pct_white · Most counties are majority-white with a left-skewed distribution, but the lower tail highlights a distinct cluster of majority-minority counties.

Show data table

Histogram bins for pct_white (median: 87.65979926043318).
bin	count
3.29 – 5.708	2
5.708 – 8.125	1
8.125 – 10.54	4
10.54 – 12.96	2
12.96 – 15.38	10
15.38 – 17.8	6
17.8 – 20.21	4
20.21 – 22.63	7
22.63 – 25.05	12
25.05 – 27.47	12
27.47 – 29.89	7
29.89 – 32.3	4
32.3 – 34.72	10
34.72 – 37.14	13
37.14 – 39.56	24
39.56 – 41.97	18
41.97 – 44.39	24
44.39 – 46.81	30
46.81 – 49.23	24
49.23 – 51.64	34
51.64 – 54.06	33
54.06 – 56.48	37
56.48 – 58.9	58
58.9 – 61.32	43
61.32 – 63.73	69
63.73 – 66.15	72
66.15 – 68.57	77
68.57 – 70.99	76
70.99 – 73.4	88
73.4 – 75.82	100
75.82 – 78.24	98
78.24 – 80.66	143
80.66 – 83.08	132
83.08 – 85.49	163
85.49 – 87.91	191
87.91 – 90.33	246
90.33 – 92.75	302
92.75 – 95.16	453
95.16 – 97.58	525
97.58 – 100	67

Fig 5.

pct_black · Heavily right-skewed with median near 2.4% but reaching 88% — a small number of counties account for most of the Black population share.

Show data table

Histogram bins for pct_black (median: 2.382606279522089).
bin	count
0 – 2.195	1568
2.195 – 4.39	402
4.39 – 6.584	218
6.584 – 8.779	153
8.779 – 10.97	112
10.97 – 13.17	86
13.17 – 15.36	70
15.36 – 17.56	54
17.56 – 19.75	46
19.75 – 21.95	48
21.95 – 24.14	36
24.14 – 26.34	42
26.34 – 28.53	36
28.53 – 30.73	41
30.73 – 32.92	34
32.92 – 35.12	32
35.12 – 37.31	28
37.31 – 39.51	19
39.51 – 41.7	25
41.7 – 43.9	25
43.9 – 46.09	18
46.09 – 48.28	17
48.28 – 50.48	14
50.48 – 52.67	9
52.67 – 54.87	13
54.87 – 57.06	11
57.06 – 59.26	13
59.26 – 61.45	7
61.45 – 63.65	8
63.65 – 65.84	4
65.84 – 68.04	1
68.04 – 70.23	6
70.23 – 72.43	8
72.43 – 74.62	5
74.62 – 76.82	2
76.82 – 79.01	5
79.01 – 81.21	1
81.21 – 83.4	1
83.4 – 85.6	1
85.6 – 87.79	2

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
NAME	text	0.0%
total_population	numeric	0.0%
black_population	numeric	0.0%
white_population	numeric	0.0%
hispanic_population	numeric	0.0%
state	numeric	0.0%
county	numeric	0.0%
FIPS	numeric	0.0%
pct_black	numeric	0.0%
pct_white	numeric	0.0%
pct_hispanic	numeric	0.0%
poverty_rate	numeric	0.0%
below_poverty_level	numeric	0.0%
median_household_income	numeric	0.0%
margin_2020	numeric	3.4%
democratic_pct_2020	numeric	3.4%
republican_pct_2020	numeric	3.4%
margin_2016	text	2.6%
democratic_pct_2016	numeric	2.6%
republican_pct_2016	numeric	2.6%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
	total_population	black_population	white_population	hispanic_population	state	county	FIPS	pct_black	pct_white	pct_hispanic	poverty_rate	below_poverty_level
total_population	+1.00	+0.69	+0.98	+0.88	-0.06	-0.10	-0.06	+0.04	-0.19	+0.08	-0.15	+0.93
black_population	+0.69	+1.00	+0.63	+0.44	-0.01	-0.03	-0.01	+0.27	-0.29	+0.01	-0.04	+0.77
white_population	+0.98	+0.63	+1.00	+0.86	-0.06	-0.11	-0.06	+0.01	-0.12	+0.07	-0.17	+0.89
hispanic_population	+0.88	+0.44	+0.86	+1.00	-0.06	-0.07	-0.06	-0.01	-0.15	+0.22	-0.02	+0.86
state	-0.06	-0.01	-0.06	-0.06	+1.00	+0.07	+1.00	-0.07	+0.02	+0.37	+0.22	-0.05
county	-0.10	-0.03	-0.11	-0.07	+0.07	+1.00	+0.07	+0.09	-0.07	+0.07	+0.11	-0.08
FIPS	-0.06	-0.01	-0.06	-0.06	+1.00	+0.07	+1.00	-0.07	+0.02	+0.37	+0.22	-0.05
pct_black	+0.04	+0.27	+0.01	-0.01	-0.07	+0.09	-0.07	+1.00	-0.79	-0.02	+0.39	+0.10
pct_white	-0.19	-0.29	-0.12	-0.15	+0.02	-0.07	+0.02	-0.79	+1.00	-0.24	-0.48	-0.23
pct_hispanic	+0.08	+0.01	+0.07	+0.22	+0.37	+0.07	+0.37	-0.02	-0.24	+1.00	+0.53	+0.14
poverty_rate	-0.15	-0.04	-0.17	-0.02	+0.22	+0.11	+0.22	+0.39	-0.48	+0.53	+1.00	-0.01
below_poverty_level	+0.93	+0.77	+0.89	+0.86	-0.05	-0.08	-0.05	+0.10	-0.23	+0.14	-0.01	+1.00

NAME text label

This column contains US county names, formatted with the word 'county' included (e.g., 'Jefferson County, Texas'), as evidenced by 'county,' appearing in 3,007 of 3,221 rows and US state names dominating the top words. Every value is unique (3,221 distinct entries, 0 duplicates, 0 nulls), making this a natural identifier for county-level records. The mean string length of ~24 characters and mean word count of ~3.2 are consistent with a 'Name County, State' pattern. The near-perfect vocabulary of 1,983 words across 3,221 rows suggests structured, standardized naming rather than free text.

Treatment: Use as a human-readable label or join key; normalize casing and strip trailing state suffix if joining to external county tables.

anthropic:default · confidence high

Out[13]:

saturn.columns["NAME"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,221
len_min	16
len_max	42
len_mean	24.27
len_median	24
len_p95	31
word_mean	3.243
word_median	3
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	1,983
readability_flesch_mean	7.581
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings

Fig 8.

Character-length distribution for NAME.

Show data table

Character-length distribution for NAME (mean: 24.26637690158336).
chars	count
16 – 17	3
17 – 17	23
17 – 18	0
18 – 19	72
19 – 19	121
19 – 20	0
20 – 21	190
21 – 21	264
21 – 22	0
22 – 22	407
22 – 23	420
23 – 24	0
24 – 24	363
24 – 25	320
25 – 26	0
26 – 26	240
26 – 27	233
27 – 28	0
28 – 28	153
28 – 29	0
29 – 30	142
30 – 30	86
30 – 31	0
31 – 32	81
32 – 32	41
32 – 33	0
33 – 34	28
34 – 34	16
34 – 35	0
35 – 36	10
36 – 36	4
36 – 37	0
37 – 37	0
37 – 38	1
38 – 39	0
39 – 39	1
39 – 40	0
40 – 41	0
41 – 41	1
41 – 42	1

total_population numeric feature

This column represents the total population count for geographic or administrative units (e.g., counties, municipalities, or census tracts), ranging from 117 to 10,040,682. The distribution is severely right-skewed (skew = 13.67, kurtosis = 311.91): the median of 25,981 is less than a quarter of the mean of 102,398, indicating a long tail driven by a small number of very large population centers. An outlier rate of 13.7% (441 of 3,221 rows) is unusually high and signals that large urban units coexist with many small rural units in the same dataset.

Treatment: Log-transform before regression or distance-based modelling to reduce skew and outlier influence.

anthropic:default · confidence high

Out[16]:

saturn.columns["total_population"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,160
min	117
max	1.004e+07
mean	1.024e+05
median	25,981
std	3.283e+05
q1	11,125
q3	66,969
iqr	55,844
skew	13.67
kurtosis	311.9
n_outliers	441
outlier_rate	0.1369
zero_rate	0
alert: high_skew	skew=+13.67
alert: outliers	13.7% rows beyond 1.5 IQR

Fig 9.

Distribution of total_population. Vertical dash marks the median.

Show data table

Histogram bins for total_population (median: 25981.0).
bin	count
117 – 2.511e+05	2946
2.511e+05 – 5.021e+05	135
5.021e+05 – 7.532e+05	53
7.532e+05 – 1.004e+06	42
1.004e+06 – 1.255e+06	12
1.255e+06 – 1.506e+06	9
1.506e+06 – 1.757e+06	6
1.757e+06 – 2.008e+06	3
2.008e+06 – 2.259e+06	4
2.259e+06 – 2.51e+06	2
2.51e+06 – 2.761e+06	3
2.761e+06 – 3.012e+06	0
3.012e+06 – 3.263e+06	1
3.263e+06 – 3.514e+06	1
3.514e+06 – 3.765e+06	0
3.765e+06 – 4.016e+06	0
4.016e+06 – 4.267e+06	0
4.267e+06 – 4.518e+06	1
4.518e+06 – 4.769e+06	1
4.769e+06 – 5.02e+06	0
5.02e+06 – 5.271e+06	1
5.271e+06 – 5.522e+06	0
5.522e+06 – 5.773e+06	0
5.773e+06 – 6.024e+06	0
6.024e+06 – 6.275e+06	0
6.275e+06 – 6.526e+06	0
6.526e+06 – 6.777e+06	0
6.777e+06 – 7.029e+06	0
7.029e+06 – 7.28e+06	0
7.28e+06 – 7.531e+06	0
7.531e+06 – 7.782e+06	0
7.782e+06 – 8.033e+06	0
8.033e+06 – 8.284e+06	0
8.284e+06 – 8.535e+06	0
8.535e+06 – 8.786e+06	0
8.786e+06 – 9.037e+06	0
9.037e+06 – 9.288e+06	0
9.288e+06 – 9.539e+06	0
9.539e+06 – 9.79e+06	0
9.79e+06 – 1.004e+07	1

black_population numeric feature

This column represents the count of Black residents per geographic unit (likely U.S. counties or census tracts). The distribution is extremely right-skewed (skew=10.46, kurtosis=148.22), with a median of just 859 versus a mean of 12,913 and a maximum of 1,202,260 — indicating a small number of high-population urban areas dominating the tail. 438 outliers (13.6% of rows) and a std of 54,951 against a median of 859 confirm the vast majority of units are small while a few are very large; 2.8% of records are zero, likely rural or sparsely populated geographies.

Treatment: Log-transform (log1p) before modelling to compress the extreme right tail; consider per-capita normalisation if total population is available.

anthropic:default · confidence high

Out[19]:

saturn.columns["black_population"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	2,066
min	0
max	1.202e+06
mean	1.291e+04
median	859
std	5.495e+04
q1	114
q3	5,553
iqr	5,439
skew	10.46
kurtosis	148.2
n_outliers	438
outlier_rate	0.136
zero_rate	0.02825
alert: high_skew	skew=+10.46
alert: outliers	13.6% rows beyond 1.5 IQR

Fig 10.

Distribution of black_population. Vertical dash marks the median.

Show data table

Histogram bins for black_population (median: 859.0).
bin	count
0 – 3.006e+04	2968
3.006e+04 – 6.011e+04	108
6.011e+04 – 9.017e+04	41
9.017e+04 – 1.202e+05	33
1.202e+05 – 1.503e+05	12
1.503e+05 – 1.803e+05	14
1.803e+05 – 2.104e+05	8
2.104e+05 – 2.405e+05	3
2.405e+05 – 2.705e+05	7
2.705e+05 – 3.006e+05	6
3.006e+05 – 3.306e+05	2
3.306e+05 – 3.607e+05	2
3.607e+05 – 3.907e+05	2
3.907e+05 – 4.208e+05	2
4.208e+05 – 4.508e+05	0
4.508e+05 – 4.809e+05	2
4.809e+05 – 5.11e+05	2
5.11e+05 – 5.41e+05	0
5.41e+05 – 5.711e+05	2
5.711e+05 – 6.011e+05	1
6.011e+05 – 6.312e+05	0
6.312e+05 – 6.612e+05	1
6.612e+05 – 6.913e+05	1
6.913e+05 – 7.214e+05	0
7.214e+05 – 7.514e+05	0
7.514e+05 – 7.815e+05	0
7.815e+05 – 8.115e+05	2
8.115e+05 – 8.416e+05	0
8.416e+05 – 8.716e+05	0
8.716e+05 – 9.017e+05	1
9.017e+05 – 9.318e+05	0
9.318e+05 – 9.618e+05	0
9.618e+05 – 9.919e+05	0
9.919e+05 – 1.022e+06	0
1.022e+06 – 1.052e+06	0
1.052e+06 – 1.082e+06	0
1.082e+06 – 1.112e+06	0
1.112e+06 – 1.142e+06	0
1.142e+06 – 1.172e+06	0
1.172e+06 – 1.202e+06	1

white_population numeric feature

This column represents the white population count for geographic units (likely counties or census tracts), with 3,221 non-null records spanning a wide range from 58 to 4,795,186. The distribution is severely right-skewed (skew = 10.35, kurtosis = 175.65): the median is only 21,282 while the mean is 72,000, indicating most units are small but a long tail of large urban areas dominates — 407 records (12.6%) are flagged as outliers. The near-unique value count (3,143 of 3,221) confirms this is a raw count feature, not a category or ID.

Treatment: Log-transform (log1p) before modelling to reduce skew and compress the extreme outlier range.

anthropic:default · confidence high

Out[22]:

saturn.columns["white_population"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,143
min	58
max	4.795e+06
mean	7.2e+04
median	21,282
std	1.918e+05
q1	8,855
q3	56,553
iqr	47,698
skew	10.35
kurtosis	175.7
n_outliers	407
outlier_rate	0.1264
zero_rate	0
alert: high_skew	skew=+10.35
alert: outliers	12.6% rows beyond 1.5 IQR

Fig 11.

Distribution of white_population. Vertical dash marks the median.

Show data table

Histogram bins for white_population (median: 21282.0).
bin	count
58 – 1.199e+05	2795
1.199e+05 – 2.398e+05	216
2.398e+05 – 3.597e+05	74
3.597e+05 – 4.796e+05	47
4.796e+05 – 5.994e+05	29
5.994e+05 – 7.193e+05	24
7.193e+05 – 8.392e+05	6
8.392e+05 – 9.591e+05	9
9.591e+05 – 1.079e+06	3
1.079e+06 – 1.199e+06	3
1.199e+06 – 1.319e+06	4
1.319e+06 – 1.439e+06	3
1.439e+06 – 1.558e+06	1
1.558e+06 – 1.678e+06	0
1.678e+06 – 1.798e+06	1
1.798e+06 – 1.918e+06	1
1.918e+06 – 2.038e+06	0
2.038e+06 – 2.158e+06	0
2.158e+06 – 2.278e+06	1
2.278e+06 – 2.398e+06	0
2.398e+06 – 2.518e+06	0
2.518e+06 – 2.637e+06	0
2.637e+06 – 2.757e+06	1
2.757e+06 – 2.877e+06	1
2.877e+06 – 2.997e+06	0
2.997e+06 – 3.117e+06	0
3.117e+06 – 3.237e+06	0
3.237e+06 – 3.357e+06	1
3.357e+06 – 3.477e+06	0
3.477e+06 – 3.596e+06	0
3.596e+06 – 3.716e+06	0
3.716e+06 – 3.836e+06	0
3.836e+06 – 3.956e+06	0
3.956e+06 – 4.076e+06	0
4.076e+06 – 4.196e+06	0
4.196e+06 – 4.316e+06	0
4.316e+06 – 4.436e+06	0
4.436e+06 – 4.555e+06	0
4.555e+06 – 4.675e+06	0
4.675e+06 – 4.795e+06	1

hispanic_population numeric feature

This column represents the Hispanic population count for geographic units (e.g., counties, census tracts, or ZIP codes) across 3,221 records. The distribution is extremely right-skewed (skew = 22.75, kurtosis = 744.79), with a median of only 1,209 but a mean of 19,427 and a maximum of 4,851,344 — indicating a small number of large urban areas dominate the distribution. 15.3% of records (492 rows) are flagged as outliers, and the IQR spans just 377–5,875 while the std is 125,108, confirming the extreme concentration of values at the low end with a long heavy tail.

Treatment: Log-transform (log1p) before modelling to reduce extreme skew; consider per-capita normalization if total population is available.

anthropic:default · confidence high

Out[25]:

saturn.columns["hispanic_population"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	2,331
min	0
max	4.851e+06
mean	1.943e+04
median	1,209
std	1.251e+05
q1	377
q3	5,875
iqr	5,498
skew	22.75
kurtosis	744.8
n_outliers	492
outlier_rate	0.1527
zero_rate	0.004967
alert: high_skew	skew=+22.75
alert: outliers	15.3% rows beyond 1.5 IQR

Fig 12.

Distribution of hispanic_population. Vertical dash marks the median.

Show data table

Histogram bins for hispanic_population (median: 1209.0).
bin	count
0 – 1.213e+05	3124
1.213e+05 – 2.426e+05	55
2.426e+05 – 3.639e+05	13
3.639e+05 – 4.851e+05	9
4.851e+05 – 6.064e+05	4
6.064e+05 – 7.277e+05	3
7.277e+05 – 8.49e+05	2
8.49e+05 – 9.703e+05	0
9.703e+05 – 1.092e+06	2
1.092e+06 – 1.213e+06	4
1.213e+06 – 1.334e+06	1
1.334e+06 – 1.455e+06	1
1.455e+06 – 1.577e+06	0
1.577e+06 – 1.698e+06	0
1.698e+06 – 1.819e+06	0
1.819e+06 – 1.941e+06	1
1.941e+06 – 2.062e+06	1
2.062e+06 – 2.183e+06	0
2.183e+06 – 2.304e+06	0
2.304e+06 – 2.426e+06	0
2.426e+06 – 2.547e+06	0
2.547e+06 – 2.668e+06	0
2.668e+06 – 2.79e+06	0
2.79e+06 – 2.911e+06	0
2.911e+06 – 3.032e+06	0
3.032e+06 – 3.153e+06	0
3.153e+06 – 3.275e+06	0
3.275e+06 – 3.396e+06	0
3.396e+06 – 3.517e+06	0
3.517e+06 – 3.639e+06	0
3.639e+06 – 3.76e+06	0
3.76e+06 – 3.881e+06	0
3.881e+06 – 4.002e+06	0
4.002e+06 – 4.124e+06	0
4.124e+06 – 4.245e+06	0
4.245e+06 – 4.366e+06	0
4.366e+06 – 4.487e+06	0
4.487e+06 – 4.609e+06	0
4.609e+06 – 4.73e+06	0
4.73e+06 – 4.851e+06	1

state numeric foreign_key

This column named 'state' is almost certainly a numeric state code (e.g., FIPS state codes or similar enumeration), with 52 distinct integer values ranging from 1 to 72 — consistent with US FIPS codes covering 50 states plus DC and outlying territories such as Puerto Rico (72). The distribution is remarkably flat and near-uniform (low kurtosis of -0.63, near-zero skew of 0.16, IQR of 27 across a 1–72 range), with zero nulls and zero outliers, indicating a clean, fully-populated categorical-as-integer field. The presence of 52 unique values rather than 50 or 51 suggests territorial codes are included, which may surprise analysts expecting only the 50 US states.

Treatment: Treat as a categorical nominal code; do not use raw numeric value in regression — one-hot encode or left-join to a state reference table for geographic attributes.

anthropic:default · confidence high

Out[28]:

saturn.columns["state"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	52
min	1
max	72
mean	31.28
median	30
std	16.28
q1	19
q3	46
iqr	27
skew	0.157
kurtosis	-0.6261
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 13.

Distribution of state. Vertical dash marks the median.

Show data table

Histogram bins for state (median: 30.0).
bin	count
1 – 2.775	97
2.775 – 4.55	15
4.55 – 6.325	133
6.325 – 8.1	64
8.1 – 9.875	8
9.875 – 11.65	4
11.65 – 13.42	226
13.42 – 15.2	5
15.2 – 16.98	44
16.98 – 18.75	194
18.75 – 20.52	204
20.52 – 22.3	184
22.3 – 24.07	40
24.07 – 25.85	14
25.85 – 27.62	170
27.62 – 29.4	197
29.4 – 31.17	149
31.17 – 32.95	17
32.95 – 34.73	31
34.73 – 36.5	95
36.5 – 38.27	153
38.27 – 40.05	165
40.05 – 41.82	36
41.82 – 43.6	67
43.6 – 45.38	51
45.38 – 47.15	161
47.15 – 48.92	254
48.92 – 50.7	43
50.7 – 52.47	133
52.47 – 54.25	94
54.25 – 56.02	95
56.02 – 57.8	0
57.8 – 59.57	0
59.57 – 61.35	0
61.35 – 63.12	0
63.12 – 64.9	0
64.9 – 66.67	0
66.67 – 68.45	0
68.45 – 70.22	0
70.22 – 72	78

county numeric foreign_key

This column is almost certainly a numeric county FIPS code or county ID, not a true continuous measure — the 326 unique values out of 3,221 rows strongly suggest a categorical geographic identifier encoded as an integer. The distribution is heavily right-skewed (skew 2.87, kurtosis 11.64) with values ranging from 1 to 840 and 178 outliers (5.5%), which reflects the uneven distribution of records across counties rather than any meaningful numeric magnitude. The mean (102.85) sitting well above the median (79.0) confirms that a small number of high-coded counties appear disproportionately often.

Treatment: Cast to categorical/string and treat as a geographic grouping key; do not use raw numeric value in any regression or distance-based model.

anthropic:default · confidence high

Out[31]:

saturn.columns["county"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	326
min	1
max	840
mean	102.8
median	79
std	106.6
q1	35
q3	133
iqr	98
skew	2.868
kurtosis	11.64
n_outliers	178
outlier_rate	0.05526
zero_rate	0
alert: high_skew	skew=+2.87
alert: outliers	5.5% rows beyond 1.5 IQR

Fig 14.

Distribution of county. Vertical dash marks the median.

Show data table

Histogram bins for county (median: 79.0).
bin	count
1 – 21.98	531
21.98 – 42.95	418
42.95 – 63.93	411
63.93 – 84.9	345
84.9 – 105.9	352
105.9 – 126.9	279
126.9 – 147.8	234
147.8 – 168.8	166
168.8 – 189.8	138
189.8 – 210.8	70
210.8 – 231.7	45
231.7 – 252.7	25
252.7 – 273.7	22
273.7 – 294.7	23
294.7 – 315.6	22
315.6 – 336.6	13
336.6 – 357.6	11
357.6 – 378.6	10
378.6 – 399.5	11
399.5 – 420.5	10
420.5 – 441.5	11
441.5 – 462.5	10
462.5 – 483.4	11
483.4 – 504.4	10
504.4 – 525.4	7
525.4 – 546.4	2
546.4 – 567.3	1
567.3 – 588.3	2
588.3 – 609.3	3
609.3 – 630.2	3
630.2 – 651.2	2
651.2 – 672.2	2
672.2 – 693.2	5
693.2 – 714.2	2
714.2 – 735.1	3
735.1 – 756.1	2
756.1 – 777.1	3
777.1 – 798.1	1
798.1 – 819	2
819 – 840	3

FIPS numeric identifier

This column contains US FIPS (Federal Information Processing Standards) county codes, which are 4–5 digit numeric identifiers uniquely assigned to each US county. Every row has a distinct value (n_unique = 3221, matching n exactly) with no nulls, confirming this is a primary identifier for US counties — there are 3,221 counties/county-equivalents in the US, matching this count almost exactly. The distribution is nearly uniform (low skew of 0.157, mild platykurtosis of -0.63), consistent with the sequential-but-gapped structure of FIPS codes across states. The range of 1001 to 72153 is correct for US county FIPS codes (Alabama's first county to Puerto Rico's last).

Treatment: Treat as a categorical geographic identifier; do not use numerically — left-join to FIPS reference tables for geographic enrichment or spatial analysis.

anthropic:default · confidence high

Out[34]:

saturn.columns["FIPS"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,221
min	1,001
max	72,153
mean	3.138e+04
median	30,023
std	1.63e+04
q1	19,031
q3	46,105
iqr	27,074
skew	0.1569
kurtosis	-0.6308
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 15.

Distribution of FIPS. Vertical dash marks the median.

Show data table

Histogram bins for FIPS (median: 30023.0).
bin	count
1001 – 2780	97
2780 – 4559	15
4559 – 6337	133
6337 – 8116	59
8116 – 9895	13
9895 – 1.167e+04	4
1.167e+04 – 1.345e+04	226
1.345e+04 – 1.523e+04	5
1.523e+04 – 1.701e+04	49
1.701e+04 – 1.879e+04	189
1.879e+04 – 2.057e+04	204
2.057e+04 – 2.235e+04	184
2.235e+04 – 2.413e+04	39
2.413e+04 – 2.59e+04	15
2.59e+04 – 2.768e+04	170
2.768e+04 – 2.946e+04	196
2.946e+04 – 3.124e+04	150
3.124e+04 – 3.302e+04	27
3.302e+04 – 3.48e+04	21
3.48e+04 – 3.658e+04	95
3.658e+04 – 3.836e+04	153
3.836e+04 – 4.013e+04	155
4.013e+04 – 4.191e+04	46
4.191e+04 – 4.369e+04	67
4.369e+04 – 4.547e+04	51
4.547e+04 – 4.725e+04	161
4.725e+04 – 4.903e+04	268
4.903e+04 – 5.081e+04	29
5.081e+04 – 5.259e+04	133
5.259e+04 – 5.436e+04	94
5.436e+04 – 5.614e+04	95
5.614e+04 – 5.792e+04	0
5.792e+04 – 5.97e+04	0
5.97e+04 – 6.148e+04	0
6.148e+04 – 6.326e+04	0
6.326e+04 – 6.504e+04	0
6.504e+04 – 6.682e+04	0
6.682e+04 – 6.86e+04	0
6.86e+04 – 7.037e+04	0
7.037e+04 – 7.215e+04	78

pct_black numeric feature

This column represents the percentage of Black residents in a geographic unit (e.g., census tract, county, or zip code), with 3,221 rows and no nulls. The distribution is heavily right-skewed (skew=2.33, kurtosis=5.45): the median is just 2.38% while the mean is pulled to 9.08%, and 422 rows (13.1%) are flagged as outliers reaching up to 87.79%. The IQR spans only 0.69–10.21%, meaning most units are predominantly non-Black, with a long tail of majority-Black geographies.

Treatment: Apply log1p or quantile transformation before regression to address severe right skew and outlier influence.

anthropic:default · confidence high

Out[37]:

saturn.columns["pct_black"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,128
min	0
max	87.79
mean	9.085
median	2.383
std	14.5
q1	0.6919
q3	10.21
iqr	9.513
skew	2.326
kurtosis	5.451
n_outliers	422
outlier_rate	0.131
zero_rate	0.02825
alert: high_skew	skew=+2.33
alert: outliers	13.1% rows beyond 1.5 IQR

Fig 16.

Distribution of pct_black. Vertical dash marks the median.

Show data table

Histogram bins for pct_black (median: 2.382606279522089).
bin	count
0 – 2.195	1568
2.195 – 4.39	402
4.39 – 6.584	218
6.584 – 8.779	153
8.779 – 10.97	112
10.97 – 13.17	86
13.17 – 15.36	70
15.36 – 17.56	54
17.56 – 19.75	46
19.75 – 21.95	48
21.95 – 24.14	36
24.14 – 26.34	42
26.34 – 28.53	36
28.53 – 30.73	41
30.73 – 32.92	34
32.92 – 35.12	32
35.12 – 37.31	28
37.31 – 39.51	19
39.51 – 41.7	25
41.7 – 43.9	25
43.9 – 46.09	18
46.09 – 48.28	17
48.28 – 50.48	14
50.48 – 52.67	9
52.67 – 54.87	13
54.87 – 57.06	11
57.06 – 59.26	13
59.26 – 61.45	7
61.45 – 63.65	8
63.65 – 65.84	4
65.84 – 68.04	1
68.04 – 70.23	6
70.23 – 72.43	8
72.43 – 74.62	5
74.62 – 76.82	2
76.82 – 79.01	5
79.01 – 81.21	1
81.21 – 83.4	1
83.4 – 85.6	1
85.6 – 87.79	2

pct_white numeric feature

This column represents the percentage of white population in a geographic or demographic unit, ranging from 3.29% to 100% across 3,221 records. The distribution is strongly left-skewed (skew = -1.56) with a mean of 81.2% and median of 87.7%, indicating the dataset is dominated by majority-white units — likely U.S. counties, census tracts, or similar jurisdictions. The gap between mean and median signals a long lower tail of more diverse units, and 145 outliers (4.5%) likely represent highly diverse areas pulling the distribution downward. Near-perfect uniqueness (3,218 of 3,221 values) confirms this is a continuous ratio measure, not a binned or rounded variable.

Treatment: Use as-is or apply a reflection-log transform to address left skew before regression; consider interactions with other demographic features.

anthropic:default · confidence high

Out[40]:

saturn.columns["pct_white"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,218
min	3.29
max	100
mean	81.2
median	87.66
std	17.35
q1	73.62
q3	93.99
iqr	20.37
skew	-1.562
kurtosis	2.301
n_outliers	145
outlier_rate	0.04502
zero_rate	0

Fig 17.

Distribution of pct_white. Vertical dash marks the median.

Show data table

Histogram bins for pct_white (median: 87.65979926043318).
bin	count
3.29 – 5.708	2
5.708 – 8.125	1
8.125 – 10.54	4
10.54 – 12.96	2
12.96 – 15.38	10
15.38 – 17.8	6
17.8 – 20.21	4
20.21 – 22.63	7
22.63 – 25.05	12
25.05 – 27.47	12
27.47 – 29.89	7
29.89 – 32.3	4
32.3 – 34.72	10
34.72 – 37.14	13
37.14 – 39.56	24
39.56 – 41.97	18
41.97 – 44.39	24
44.39 – 46.81	30
46.81 – 49.23	24
49.23 – 51.64	34
51.64 – 54.06	33
54.06 – 56.48	37
56.48 – 58.9	58
58.9 – 61.32	43
61.32 – 63.73	69
63.73 – 66.15	72
66.15 – 68.57	77
68.57 – 70.99	76
70.99 – 73.4	88
73.4 – 75.82	100
75.82 – 78.24	98
78.24 – 80.66	143
80.66 – 83.08	132
83.08 – 85.49	163
85.49 – 87.91	191
87.91 – 90.33	246
90.33 – 92.75	302
92.75 – 95.16	453
95.16 – 97.58	525
97.58 – 100	67

pct_hispanic numeric feature

This column represents the percentage of Hispanic population in a geographic or demographic unit, ranging from 0% to nearly 100%. The distribution is severely right-skewed (skew=3.11, kurtosis=9.89): the median is only 4.52% while the mean is 11.74%, indicating most units have low Hispanic shares but a long tail of high-concentration areas drives the average up. A notable 13% of rows (420 out of 3221) are flagged as outliers, consistent with areas of heavy Hispanic concentration. The near-zero zero_rate (0.5%) and zero null_rate suggest good data completeness.

Treatment: Log-transform or apply a square-root transformation before regression/modelling to reduce skew and diminish outlier leverage.

anthropic:default · confidence high

Out[43]:

saturn.columns["pct_hispanic"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,205
min	0
max	100
mean	11.74
median	4.516
std	19.4
q1	2.363
q3	10.66
iqr	8.294
skew	3.113
kurtosis	9.888
n_outliers	420
outlier_rate	0.1304
zero_rate	0.004967
alert: high_skew	skew=+3.11
alert: outliers	13.0% rows beyond 1.5 IQR

Fig 18.

Distribution of pct_hispanic. Vertical dash marks the median.

Show data table

Histogram bins for pct_hispanic (median: 4.51638689048761).
bin	count
0 – 2.5	882
2.5 – 5	850
5 – 7.5	412
7.5 – 10	213
10 – 12.5	148
12.5 – 15	102
15 – 17.5	75
17.5 – 20	64
20 – 22.5	45
22.5 – 25	49
25 – 27.5	39
27.5 – 30	30
30 – 32.5	26
32.5 – 35	18
35 – 37.5	17
37.5 – 40	15
40 – 42.5	17
42.5 – 45	15
45 – 47.5	12
47.5 – 50	10
50 – 52.5	12
52.5 – 55	12
55 – 57.5	9
57.5 – 60	12
60 – 62.5	10
62.5 – 65	9
65 – 67.5	3
67.5 – 70	6
70 – 72.5	3
72.5 – 75	3
75 – 77.5	0
77.5 – 80	3
80 – 82.5	4
82.5 – 85	5
85 – 87.5	1
87.5 – 90	3
90 – 92.5	4
92.5 – 95	5
95 – 97.5	8
97.5 – 100	70

poverty_rate numeric feature

This column represents a poverty rate (percentage) measured across 3,221 geographic or demographic units, with near-complete coverage (null_rate 0.0) and near-unique values (3,219 distinct). The distribution is right-skewed (skew 2.11, kurtosis 6.92), with a median of 13.8% and mean pulled up to 15.4% by a long upper tail reaching 66.2%; 143 outliers (4.4% of records) drive this tail, suggesting a minority of units with extremely high poverty concentration that will disproportionately influence linear models.

Treatment: Apply log or square-root transform to reduce right skew before regression; investigate the 143 outlier units separately for data quality or structural differences.

anthropic:default · confidence high

Out[46]:

saturn.columns["poverty_rate"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,219
min	0
max	66.19
mean	15.38
median	13.81
std	7.97
q1	10.34
q3	18.25
iqr	7.91
skew	2.111
kurtosis	6.922
n_outliers	143
outlier_rate	0.0444
zero_rate	0.0003105
alert: high_skew	skew=+2.11

Fig 19.

Distribution of poverty_rate. Vertical dash marks the median.

Show data table

Histogram bins for poverty_rate (median: 13.807805224676027).
bin	count
0 – 1.655	2
1.655 – 3.31	9
3.31 – 4.964	48
4.964 – 6.619	123
6.619 – 8.274	212
8.274 – 9.929	317
9.929 – 11.58	378
11.58 – 13.24	392
13.24 – 14.89	353
14.89 – 16.55	338
16.55 – 18.2	235
18.2 – 19.86	192
19.86 – 21.51	157
21.51 – 23.17	108
23.17 – 24.82	77
24.82 – 26.48	53
26.48 – 28.13	41
28.13 – 29.79	37
29.79 – 31.44	29
31.44 – 33.1	13
33.1 – 34.75	10
34.75 – 36.41	11
36.41 – 38.06	7
38.06 – 39.72	2
39.72 – 41.37	8
41.37 – 43.03	6
43.03 – 44.68	7
44.68 – 46.33	11
46.33 – 47.99	6
47.99 – 49.64	9
49.64 – 51.3	8
51.3 – 52.95	5
52.95 – 54.61	6
54.61 – 56.26	2
56.26 – 57.92	2
57.92 – 59.57	3
59.57 – 61.23	1
61.23 – 62.88	1
62.88 – 64.54	0
64.54 – 66.19	2

below_poverty_level numeric feature

This column represents a count of people living below the poverty level, likely aggregated at some geographic unit (e.g., census tract, county, or ZIP code). The distribution is extremely right-skewed (skew=15.1, kurtosis=360.7): the median is 3,831 but the mean is 13,136, and the maximum reaches 1,401,656 — almost certainly a large urban area or county-level aggregate pulling the tail hard. With 351 outliers (~10.9% of rows) and a standard deviation of 44,284 against a median of 3,831, a small number of high-population jurisdictions dominate the raw counts entirely.

Treatment: Log-transform (log1p) before regression or clustering; consider normalizing by total population to create a poverty rate for more comparable cross-unit modelling.

anthropic:default · confidence high

Out[49]:

saturn.columns["below_poverty_level"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	2,824
min	0
max	1.402e+06
mean	1.314e+04
median	3,831
std	4.428e+04
q1	1,547
q3	9,937
iqr	8,390
skew	15.11
kurtosis	360.7
n_outliers	351
outlier_rate	0.109
zero_rate	0.0003105
alert: high_skew	skew=+15.11
alert: outliers	10.9% rows beyond 1.5 IQR

Fig 20.

Distribution of below_poverty_level. Vertical dash marks the median.

Show data table

Histogram bins for below_poverty_level (median: 3831.0).
bin	count
0 – 3.504e+04	2980
3.504e+04 – 7.008e+04	136
7.008e+04 – 1.051e+05	50
1.051e+05 – 1.402e+05	19
1.402e+05 – 1.752e+05	7
1.752e+05 – 2.102e+05	8
2.102e+05 – 2.453e+05	3
2.453e+05 – 2.803e+05	2
2.803e+05 – 3.154e+05	3
3.154e+05 – 3.504e+05	2
3.504e+05 – 3.855e+05	5
3.855e+05 – 4.205e+05	0
4.205e+05 – 4.555e+05	1
4.555e+05 – 4.906e+05	1
4.906e+05 – 5.256e+05	0
5.256e+05 – 5.607e+05	1
5.607e+05 – 5.957e+05	0
5.957e+05 – 6.307e+05	0
6.307e+05 – 6.658e+05	0
6.658e+05 – 7.008e+05	1
7.008e+05 – 7.359e+05	1
7.359e+05 – 7.709e+05	0
7.709e+05 – 8.06e+05	0
8.06e+05 – 8.41e+05	0
8.41e+05 – 8.76e+05	0
8.76e+05 – 9.111e+05	0
9.111e+05 – 9.461e+05	0
9.461e+05 – 9.812e+05	0
9.812e+05 – 1.016e+06	0
1.016e+06 – 1.051e+06	0
1.051e+06 – 1.086e+06	0
1.086e+06 – 1.121e+06	0
1.121e+06 – 1.156e+06	0
1.156e+06 – 1.191e+06	0
1.191e+06 – 1.226e+06	0
1.226e+06 – 1.261e+06	0
1.261e+06 – 1.297e+06	0
1.297e+06 – 1.332e+06	0
1.332e+06 – 1.367e+06	0
1.367e+06 – 1.402e+06	1

median_household_income numeric feature

This column represents median household income, likely sourced from census or demographic data tied to geographic units. The median of 52,380 and IQR of 16,300 look plausible for household income, but the column is severely compromised by sentinel/error values: a minimum of -666,666,666 drags the mean to -152,820 and produces a kurtosis of 3,215 and skew of -56.73, all flagged as alerts. With 182 outliers (5.65% of rows) and a std of 11,747,597, the negative extremes are almost certainly coded null-substitutes or data-entry errors rather than real income values.

Treatment: Replace -666666666 and any negative values with NaN, investigate remaining outliers above q3, then consider log-transform after cleaning before modelling.

anthropic:default · confidence high

Out[52]:

saturn.columns["median_household_income"].stats

stat	value
n	3,221
nulls	0 (0.0%)
unique	3,099
min	-6.667e+08
max	147,111
mean	-1.528e+05
median	52,380
std	1.175e+07
q1	44,939
q3	61,239
iqr	16,300
skew	-56.73
kurtosis	3216
n_outliers	182
outlier_rate	0.0565
zero_rate	0
alert: high_skew	skew=-56.73
alert: outliers	5.7% rows beyond 1.5 IQR

Fig 21.

Distribution of median_household_income. Vertical dash marks the median.

Show data table

Histogram bins for median_household_income (median: 52380.0).
bin	count
-6.667e+08 – -6.5e+08	1
-6.5e+08 – -6.333e+08	0
-6.333e+08 – -6.167e+08	0
-6.167e+08 – -6e+08	0
-6e+08 – -5.833e+08	0
-5.833e+08 – -5.666e+08	0
-5.666e+08 – -5.5e+08	0
-5.5e+08 – -5.333e+08	0
-5.333e+08 – -5.166e+08	0
-5.166e+08 – -5e+08	0
-5e+08 – -4.833e+08	0
-4.833e+08 – -4.666e+08	0
-4.666e+08 – -4.5e+08	0
-4.5e+08 – -4.333e+08	0
-4.333e+08 – -4.166e+08	0
-4.166e+08 – -3.999e+08	0
-3.999e+08 – -3.833e+08	0
-3.833e+08 – -3.666e+08	0
-3.666e+08 – -3.499e+08	0
-3.499e+08 – -3.333e+08	0
-3.333e+08 – -3.166e+08	0
-3.166e+08 – -2.999e+08	0
-2.999e+08 – -2.832e+08	0
-2.832e+08 – -2.666e+08	0
-2.666e+08 – -2.499e+08	0
-2.499e+08 – -2.332e+08	0
-2.332e+08 – -2.166e+08	0
-2.166e+08 – -1.999e+08	0
-1.999e+08 – -1.832e+08	0
-1.832e+08 – -1.666e+08	0
-1.666e+08 – -1.499e+08	0
-1.499e+08 – -1.332e+08	0
-1.332e+08 – -1.165e+08	0
-1.165e+08 – -9.987e+07	0
-9.987e+07 – -8.32e+07	0
-8.32e+07 – -6.653e+07	0
-6.653e+07 – -4.986e+07	0
-4.986e+07 – -3.319e+07	0
-3.319e+07 – -1.652e+07	0
-1.652e+07 – 1.471e+05	3220

margin_2020 numeric feature

This column represents a vote or profit margin figure for the year 2020, expressed as a proportion (roughly −0.87 to +0.93), most likely an election margin or financial margin ratio. The distribution is moderately left-skewed (skew −0.82) with a mean of 0.317 sitting noticeably below the median of 0.384, indicating a tail of strongly negative values pulling the average down. Negative values (minimum −0.868) are present and meaningful — likely contested or loss outcomes — while 48 outliers (1.54%) sit at the distributional extremes. The null rate of 3.38% is modest but worth investigating for systematic missingness.

Treatment: Use as-is for modelling; consider investigating left-tail outliers and whether nulls are structurally missing before imputation.

anthropic:default · confidence high

Out[55]:

saturn.columns["margin_2020"].stats

stat	value
n	3,221
nulls	109 (3.4%)
unique	3,112
min	-0.8675
max	0.9309
mean	0.317
median	0.3844
std	0.321
q1	0.1348
q3	0.5662
iqr	0.4314
skew	-0.8212
kurtosis	0.2286
n_outliers	48
outlier_rate	0.01542
zero_rate	0

Fig 22.

Distribution of margin_2020. Vertical dash marks the median.

Show data table

Histogram bins for margin_2020 (median: 0.3843813151543954).
bin	count
-0.8675 – -0.8226	1
-0.8226 – -0.7776	2
-0.7776 – -0.7326	3
-0.7326 – -0.6877	5
-0.6877 – -0.6427	8
-0.6427 – -0.5978	12
-0.5978 – -0.5528	6
-0.5528 – -0.5078	12
-0.5078 – -0.4629	14
-0.4629 – -0.4179	22
-0.4179 – -0.373	28
-0.373 – -0.328	33
-0.328 – -0.283	29
-0.283 – -0.2381	46
-0.2381 – -0.1931	41
-0.1931 – -0.1482	54
-0.1482 – -0.1032	63
-0.1032 – -0.05823	67
-0.05823 – -0.01327	69
-0.01327 – 0.03169	73
0.03169 – 0.07665	70
0.07665 – 0.1216	90
0.1216 – 0.1666	131
0.1666 – 0.2115	117
0.2115 – 0.2565	129
0.2565 – 0.3015	141
0.3015 – 0.3464	159
0.3464 – 0.3914	165
0.3914 – 0.4363	181
0.4363 – 0.4813	206
0.4813 – 0.5263	195
0.5263 – 0.5712	197
0.5712 – 0.6162	213
0.6162 – 0.6611	175
0.6611 – 0.7061	140
0.7061 – 0.7511	102
0.7511 – 0.796	69
0.796 – 0.841	29
0.841 – 0.8859	11
0.8859 – 0.9309	4

democratic_pct_2020 numeric feature

This column represents the Democratic vote share (as a proportion 0–1) in the 2020 election, likely at the county or precinct level across 3,221 geographic units. The distribution is right-skewed (skew=0.83) with a mean of 0.333 and median of 0.300, indicating most units lean Republican—the typical unit gave Democrats roughly 30% of the vote. The range spans 0.031 to 0.921, capturing both deep-red and deep-blue areas, with only 49 outliers (1.57%) and near-zero null rate (3.38%), suggesting a clean, well-populated electoral feature.

Treatment: Use as-is or apply a logit transform to stretch the bounded 0–1 proportion before regression or clustering.

anthropic:default · confidence high

Out[58]:

saturn.columns["democratic_pct_2020"].stats

stat	value
n	3,221
nulls	109 (3.4%)
unique	3,111
min	0.03091
max	0.9215
mean	0.3327
median	0.2998
std	0.1598
q1	0.2091
q3	0.4236
iqr	0.2145
skew	0.8326
kurtosis	0.2523
n_outliers	49
outlier_rate	0.01575
zero_rate	0

Fig 23.

Distribution of democratic_pct_2020. Vertical dash marks the median.

Show data table

Histogram bins for democratic_pct_2020 (median: 0.29977253358402933).
bin	count
0.03091 – 0.05317	5
0.05317 – 0.07544	11
0.07544 – 0.0977	31
0.0977 – 0.12	76
0.12 – 0.1422	104
0.1422 – 0.1645	145
0.1645 – 0.1868	182
0.1868 – 0.209	224
0.209 – 0.2313	192
0.2313 – 0.2536	199
0.2536 – 0.2758	200
0.2758 – 0.2981	174
0.2981 – 0.3204	164
0.3204 – 0.3426	158
0.3426 – 0.3649	130
0.3649 – 0.3871	132
0.3871 – 0.4094	121
0.4094 – 0.4317	125
0.4317 – 0.4539	84
0.4539 – 0.4762	79
0.4762 – 0.4985	66
0.4985 – 0.5207	73
0.5207 – 0.543	67
0.543 – 0.5653	58
0.5653 – 0.5875	50
0.5875 – 0.6098	42
0.6098 – 0.6321	42
0.6321 – 0.6543	34
0.6543 – 0.6766	28
0.6766 – 0.6988	29
0.6988 – 0.7211	22
0.7211 – 0.7434	14
0.7434 – 0.7656	13
0.7656 – 0.7879	7
0.7879 – 0.8102	8
0.8102 – 0.8324	11
0.8324 – 0.8547	5
0.8547 – 0.877	3
0.877 – 0.8992	3
0.8992 – 0.9215	1

republican_pct_2020 numeric feature

This column represents the Republican vote share (as a proportion 0–1) in the 2020 U.S. election, most likely at the county or precinct level. The mean of 0.650 and median of 0.683 indicate a right-leaning dataset — the majority of geographic units recorded Republican majorities, which is consistent with county-level data where rural areas outnumber urban ones by count. The distribution is notably left-skewed (skew = −0.809), meaning a tail of strongly Democratic units pulls the mean below the median, while the near-mesokurtic kurtosis (0.206) and only 47 outliers suggest no extreme concentration at the tails. The null rate of 3.38% warrants investigation to confirm whether missing values reflect unreported results or data gaps.

Treatment: Use as-is for modeling after imputing or flagging the 3.38% nulls; consider logit-transform if used as a continuous predictor in a linear model.

anthropic:default · confidence high

Out[61]:

saturn.columns["republican_pct_2020"].stats

stat	value
n	3,221
nulls	109 (3.4%)
unique	3,111
min	0.05397
max	0.9618
mean	0.6497
median	0.6829
std	0.1613
q1	0.5576
q3	0.7747
iqr	0.2171
skew	-0.8091
kurtosis	0.2063
n_outliers	47
outlier_rate	0.0151
zero_rate	0

Fig 24.

Distribution of republican_pct_2020. Vertical dash marks the median.

Show data table

Histogram bins for republican_pct_2020 (median: 0.6829120557612961).
bin	count
0.05397 – 0.07667	1
0.07667 – 0.09937	2
0.09937 – 0.1221	2
0.1221 – 0.1448	6
0.1448 – 0.1675	6
0.1675 – 0.1901	15
0.1901 – 0.2128	5
0.2128 – 0.2355	13
0.2355 – 0.2582	12
0.2582 – 0.2809	25
0.2809 – 0.3036	26
0.3036 – 0.3263	32
0.3263 – 0.349	32
0.349 – 0.3717	40
0.3717 – 0.3944	46
0.3944 – 0.4171	52
0.4171 – 0.4398	64
0.4398 – 0.4625	78
0.4625 – 0.4852	61
0.4852 – 0.5079	78
0.5079 – 0.5306	66
0.5306 – 0.5533	97
0.5533 – 0.576	126
0.576 – 0.5987	122
0.5987 – 0.6214	139
0.6214 – 0.6441	143
0.6441 – 0.6668	154
0.6668 – 0.6895	173
0.6895 – 0.7122	176
0.7122 – 0.7349	200
0.7349 – 0.7576	213
0.7576 – 0.7802	195
0.7802 – 0.8029	203
0.8029 – 0.8256	169
0.8256 – 0.8483	132
0.8483 – 0.871	103
0.871 – 0.8937	65
0.8937 – 0.9164	27
0.9164 – 0.9391	9
0.9391 – 0.9618	4

margin_2016 text feature

This column stores the 2016 electoral or financial margin as a percentage string (e.g., '15.17%'), stored as text rather than a numeric type. All 3,221 values are single all-caps tokens of 5–6 characters, confirming a uniform percentage format. Surprisingly, '15.17%' appears 29 times — far more than any other value — suggesting it may be a default, imputed, or boundary value worth investigating. The duplicate rate of 18.6% (584 duplicates across 2,554 unique values) is notable for what should otherwise be a near-continuous numeric measure.

Treatment: Strip '%' suffix and cast to float; investigate the 29 occurrences of '15.17%' for data quality issues before modelling.

anthropic:default · confidence high

Out[64]:

saturn.columns["margin_2016"].stats

stat	value
n	3,221
nulls	83 (2.6%)
unique	2,554
len_min	5
len_max	6
len_mean	5.896
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	584
duplicate_rate	0.1861
vocab_size	2,554
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 25.

Character-length distribution for margin_2016.

Show data table

Character-length distribution for margin_2016 (mean: 5.895793499043977).
chars	count
5 – 5	327
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	2811

democratic_pct_2016 numeric feature

This column represents the Democratic party vote share (as a proportion 0–1) in the 2016 U.S. presidential election, most likely aggregated at the county level given 3,221 rows. The distribution is right-skewed (skew=0.94) with a mean of 0.317 and median of 0.286, indicating that most geographic units lean Republican, with a long tail of heavily Democratic areas reaching up to 0.928. The spread is moderate (IQR=0.193, std=0.153) and 75 outliers exist on the high end, likely dense urban counties.

Treatment: Use as-is or apply logit-transform to unbound the [0,1] proportion before linear modelling.

anthropic:default · confidence high

Out[67]:

saturn.columns["democratic_pct_2016"].stats

stat	value
n	3,221
nulls	83 (2.6%)
unique	3,111
min	0.03145
max	0.9285
mean	0.3174
median	0.2861
std	0.1527
q1	0.2054
q3	0.3982
iqr	0.1928
skew	0.9371
kurtosis	0.666
n_outliers	75
outlier_rate	0.0239
zero_rate	0

Fig 26.

Distribution of democratic_pct_2016. Vertical dash marks the median.

Show data table

Histogram bins for democratic_pct_2016 (median: 0.2861345852895).
bin	count
0.03145 – 0.05387	8
0.05387 – 0.0763	16
0.0763 – 0.09872	52
0.09872 – 0.1211	74
0.1211 – 0.1436	116
0.1436 – 0.166	146
0.166 – 0.1884	203
0.1884 – 0.2109	226
0.2109 – 0.2333	240
0.2333 – 0.2557	218
0.2557 – 0.2781	200
0.2781 – 0.3006	205
0.3006 – 0.323	153
0.323 – 0.3454	147
0.3454 – 0.3678	153
0.3678 – 0.3903	150
0.3903 – 0.4127	106
0.4127 – 0.4351	111
0.4351 – 0.4575	77
0.4575 – 0.48	78
0.48 – 0.5024	56
0.5024 – 0.5248	72
0.5248 – 0.5472	45
0.5472 – 0.5697	42
0.5697 – 0.5921	36
0.5921 – 0.6145	36
0.6145 – 0.6369	34
0.6369 – 0.6594	25
0.6594 – 0.6818	30
0.6818 – 0.7042	16
0.7042 – 0.7266	12
0.7266 – 0.7491	12
0.7491 – 0.7715	14
0.7715 – 0.7939	9
0.7939 – 0.8163	6
0.8163 – 0.8388	4
0.8388 – 0.8612	4
0.8612 – 0.8836	3
0.8836 – 0.906	2
0.906 – 0.9285	1

republican_pct_2016 numeric feature

This column represents the Republican vote share (as a proportion, 0–1) in the 2016 U.S. presidential election, likely at the county or precinct level across 3,221 geographic units. The distribution is left-skewed (skew = -0.81) with a median of 0.666 and mean of 0.635, indicating that most units leaned heavily Republican in 2016, which is consistent with rural-county-level data where Republicans dominate by count even if not by population. The range spans 0.041 to 0.953, covering genuinely competitive to overwhelmingly one-sided areas, with only 62 outliers (1.98%) and near-zero nulls (2.58%), suggesting a clean, well-populated field.

Treatment: Use directly as a continuous feature; consider pairing with democratic equivalent or computing a two-party margin; mild left skew does not require transformation for most models.

anthropic:default · confidence high

Out[70]:

saturn.columns["republican_pct_2016"].stats

stat	value
n	3,221
nulls	83 (2.6%)
unique	3,111
min	0.04122
max	0.9527
mean	0.6354
median	0.6656
std	0.1559
q1	0.5463
q3	0.7503
iqr	0.2041
skew	-0.8145
kurtosis	0.3566
n_outliers	62
outlier_rate	0.01976
zero_rate	0

Fig 27.

Distribution of republican_pct_2016. Vertical dash marks the median.

Show data table

Histogram bins for republican_pct_2016 (median: 0.6655515136155).
bin	count
0.04122 – 0.06401	1
0.06401 – 0.0868	1
0.0868 – 0.1096	5
0.1096 – 0.1324	2
0.1324 – 0.1552	6
0.1552 – 0.1779	11
0.1779 – 0.2007	7
0.2007 – 0.2235	17
0.2235 – 0.2463	17
0.2463 – 0.2691	17
0.2691 – 0.2919	23
0.2919 – 0.3147	32
0.3147 – 0.3375	34
0.3375 – 0.3602	30
0.3602 – 0.383	43
0.383 – 0.4058	41
0.4058 – 0.4286	64
0.4286 – 0.4514	71
0.4514 – 0.4742	63
0.4742 – 0.497	89
0.497 – 0.5198	78
0.5198 – 0.5425	115
0.5425 – 0.5653	116
0.5653 – 0.5881	147
0.5881 – 0.6109	147
0.6109 – 0.6337	156
0.6337 – 0.6565	165
0.6565 – 0.6793	193
0.6793 – 0.7021	190
0.7021 – 0.7249	215
0.7249 – 0.7476	223
0.7476 – 0.7704	213
0.7704 – 0.7932	187
0.7932 – 0.816	142
0.816 – 0.8388	113
0.8388 – 0.8616	74
0.8616 – 0.8844	49
0.8844 – 0.9072	29
0.9072 – 0.9299	9
0.9299 – 0.9527	3

data trove scars standardized county analysis research system

Overview

Summary confidence: high

NAME text label

total_population numeric feature

black_population numeric feature

white_population numeric feature

hispanic_population numeric feature

state numeric foreign_key

county numeric foreign_key

FIPS numeric identifier

pct_black numeric feature

pct_white numeric feature

pct_hispanic numeric feature

poverty_rate numeric feature

below_poverty_level numeric feature

median_household_income numeric feature

margin_2020 numeric feature

democratic_pct_2020 numeric feature

republican_pct_2020 numeric feature

margin_2016 text feature

democratic_pct_2016 numeric feature

republican_pct_2016 numeric feature

How to cite