saturn·

data trove scars standardized county analysis research system

source /home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv 3,221 rows 20 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset covers 3,221 U.S. counties with demographic, economic, and electoral variables for the 2016 and 2020 presidential elections. The most striking finding is that Republican candidates dominated the majority of counties in both cycles — the median Republican share was roughly 67% in 2016 and 68% in 2020, while the Democratic median hovered near 29–30%, reflecting the well-known rural-county skew in U.S. politics. A data quality issue worth flagging immediately is the median_household_income column, which contains a minimum value of -666,666,666 — almost certainly a sentinel/error value — dragging the column mean to -$152,820 despite a plausible median of $52,380. Poverty rate averages about 15% across counties but reaches as high as 66%, and racial composition variables (pct_white, pct_black, pct_hispanic) are highly skewed, suggesting a small number of majority-minority counties sit at the extremes.

citing: republican_pct_2016.stats.median · republican_pct_2020.stats.median · democratic_pct_2016.stats.median · democratic_pct_2020.stats.median · median_household_income.stats.min · median_household_income.stats.median · poverty_rate.stats.mean · poverty_rate.stats.max · pct_white.stats.mean · pct_white.stats.skew · row_count

Schema

20 columns
Per-column summary. Click column name to jump to its detail.
Alerts
NAME text 0.0% 3,221
near_unique
total_population numeric 0.0% 3,160
high_skew outliers
black_population numeric 0.0% 2,066
high_skew outliers
white_population numeric 0.0% 3,143
high_skew outliers
hispanic_population numeric 0.0% 2,331
high_skew outliers
state numeric 0.0% 52
county numeric 0.0% 326
high_skew outliers
FIPS numeric 0.0% 3,221
pct_black numeric 0.0% 3,128
high_skew outliers
pct_white numeric 0.0% 3,218
pct_hispanic numeric 0.0% 3,205
high_skew outliers
poverty_rate numeric 0.0% 3,219
high_skew
below_poverty_level numeric 0.0% 2,824
high_skew outliers
median_household_income numeric 0.0% 3,099
high_skew outliers
margin_2020 numeric 3.4% 3,112
democratic_pct_2020 numeric 3.4% 3,111
republican_pct_2020 numeric 3.4% 3,111
margin_2016 text 2.6% 2,554
one_word allcaps short_text
democratic_pct_2016 numeric 2.6% 3,111
republican_pct_2016 numeric 2.6% 3,111

NAME

text label near_unique
This column contains US county names, formatted with the word 'county' included (e.g., 'Jefferson County, Texas'), as evidenced by 'county,' appearing in 3,007 of 3,221 rows and US state names dominating the top words. Every value is unique (3,221 distinct entries, 0 duplicates, 0 nulls), making this a natural identifier for county-level records. The mean string length of ~24 characters and mean word count of ~3.2 are consistent with a 'Name County, State' pattern. The near-perfect vocabulary of 1,983 words across 3,221 rows suggests structured, standardized naming rather than free text. Treatment: Use as a human-readable label or join key; normalize casing and strip trailing state suffix if joining to external county tables. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
3,221
len_min
16
len_max
42
len_mean
24.27
len_median
24
len_p95
31
word_mean
3.243
word_median
3
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
1,983
readability_flesch_mean
7.581
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

total_population

numeric feature high_skew outliers
This column represents the total population count for geographic or administrative units (e.g., counties, municipalities, or census tracts), ranging from 117 to 10,040,682. The distribution is severely right-skewed (skew = 13.67, kurtosis = 311.91): the median of 25,981 is less than a quarter of the mean of 102,398, indicating a long tail driven by a small number of very large population centers. An outlier rate of 13.7% (441 of 3,221 rows) is unusually high and signals that large urban units coexist with many small rural units in the same dataset. Treatment: Log-transform before regression or distance-based modelling to reduce skew and outlier influence. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
3,160
min
117
max
1.004e+07
mean
1.024e+05
median
25,981
std
3.283e+05
q1
11,125
q3
66,969
iqr
55,844
skew
13.67
kurtosis
311.9
n_outliers
441
outlier_rate
0.1369
zero_rate
0

black_population

numeric feature high_skew outliers
This column represents the count of Black residents per geographic unit (likely U.S. counties or census tracts). The distribution is extremely right-skewed (skew=10.46, kurtosis=148.22), with a median of just 859 versus a mean of 12,913 and a maximum of 1,202,260 — indicating a small number of high-population urban areas dominating the tail. 438 outliers (13.6% of rows) and a std of 54,951 against a median of 859 confirm the vast majority of units are small while a few are very large; 2.8% of records are zero, likely rural or sparsely populated geographies. Treatment: Log-transform (log1p) before modelling to compress the extreme right tail; consider per-capita normalisation if total population is available. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
2,066
min
0
max
1.202e+06
mean
1.291e+04
median
859
std
5.495e+04
q1
114
q3
5,553
iqr
5,439
skew
10.46
kurtosis
148.2
n_outliers
438
outlier_rate
0.136
zero_rate
0.02825

white_population

numeric feature high_skew outliers
This column represents the white population count for geographic units (likely counties or census tracts), with 3,221 non-null records spanning a wide range from 58 to 4,795,186. The distribution is severely right-skewed (skew = 10.35, kurtosis = 175.65): the median is only 21,282 while the mean is 72,000, indicating most units are small but a long tail of large urban areas dominates — 407 records (12.6%) are flagged as outliers. The near-unique value count (3,143 of 3,221) confirms this is a raw count feature, not a category or ID. Treatment: Log-transform (log1p) before modelling to reduce skew and compress the extreme outlier range. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
3,143
min
58
max
4.795e+06
mean
7.2e+04
median
21,282
std
1.918e+05
q1
8,855
q3
56,553
iqr
47,698
skew
10.35
kurtosis
175.7
n_outliers
407
outlier_rate
0.1264
zero_rate
0

hispanic_population

numeric feature high_skew outliers
This column represents the Hispanic population count for geographic units (e.g., counties, census tracts, or ZIP codes) across 3,221 records. The distribution is extremely right-skewed (skew = 22.75, kurtosis = 744.79), with a median of only 1,209 but a mean of 19,427 and a maximum of 4,851,344 — indicating a small number of large urban areas dominate the distribution. 15.3% of records (492 rows) are flagged as outliers, and the IQR spans just 377–5,875 while the std is 125,108, confirming the extreme concentration of values at the low end with a long heavy tail. Treatment: Log-transform (log1p) before modelling to reduce extreme skew; consider per-capita normalization if total population is available. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
2,331
min
0
max
4.851e+06
mean
1.943e+04
median
1,209
std
1.251e+05
q1
377
q3
5,875
iqr
5,498
skew
22.75
kurtosis
744.8
n_outliers
492
outlier_rate
0.1527
zero_rate
0.004967

state

numeric foreign_key
This column named 'state' is almost certainly a numeric state code (e.g., FIPS state codes or similar enumeration), with 52 distinct integer values ranging from 1 to 72 — consistent with US FIPS codes covering 50 states plus DC and outlying territories such as Puerto Rico (72). The distribution is remarkably flat and near-uniform (low kurtosis of -0.63, near-zero skew of 0.16, IQR of 27 across a 1–72 range), with zero nulls and zero outliers, indicating a clean, fully-populated categorical-as-integer field. The presence of 52 unique values rather than 50 or 51 suggests territorial codes are included, which may surprise analysts expecting only the 50 US states. Treatment: Treat as a categorical nominal code; do not use raw numeric value in regression — one-hot encode or left-join to a state reference table for geographic attributes. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
52
min
1
max
72
mean
31.28
median
30
std
16.28
q1
19
q3
46
iqr
27
skew
0.157
kurtosis
-0.6261
n_outliers
0
outlier_rate
0
zero_rate
0

county

numeric foreign_key high_skew outliers
This column is almost certainly a numeric county FIPS code or county ID, not a true continuous measure — the 326 unique values out of 3,221 rows strongly suggest a categorical geographic identifier encoded as an integer. The distribution is heavily right-skewed (skew 2.87, kurtosis 11.64) with values ranging from 1 to 840 and 178 outliers (5.5%), which reflects the uneven distribution of records across counties rather than any meaningful numeric magnitude. The mean (102.85) sitting well above the median (79.0) confirms that a small number of high-coded counties appear disproportionately often. Treatment: Cast to categorical/string and treat as a geographic grouping key; do not use raw numeric value in any regression or distance-based model. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
326
min
1
max
840
mean
102.8
median
79
std
106.6
q1
35
q3
133
iqr
98
skew
2.868
kurtosis
11.64
n_outliers
178
outlier_rate
0.05526
zero_rate
0

FIPS

numeric identifier
This column contains US FIPS (Federal Information Processing Standards) county codes, which are 4–5 digit numeric identifiers uniquely assigned to each US county. Every row has a distinct value (n_unique = 3221, matching n exactly) with no nulls, confirming this is a primary identifier for US counties — there are 3,221 counties/county-equivalents in the US, matching this count almost exactly. The distribution is nearly uniform (low skew of 0.157, mild platykurtosis of -0.63), consistent with the sequential-but-gapped structure of FIPS codes across states. The range of 1001 to 72153 is correct for US county FIPS codes (Alabama's first county to Puerto Rico's last). Treatment: Treat as a categorical geographic identifier; do not use numerically — left-join to FIPS reference tables for geographic enrichment or spatial analysis. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
3,221
min
1,001
max
72,153
mean
3.138e+04
median
30,023
std
1.63e+04
q1
19,031
q3
46,105
iqr
27,074
skew
0.1569
kurtosis
-0.6308
n_outliers
0
outlier_rate
0
zero_rate
0

pct_black

numeric feature high_skew outliers
This column represents the percentage of Black residents in a geographic unit (e.g., census tract, county, or zip code), with 3,221 rows and no nulls. The distribution is heavily right-skewed (skew=2.33, kurtosis=5.45): the median is just 2.38% while the mean is pulled to 9.08%, and 422 rows (13.1%) are flagged as outliers reaching up to 87.79%. The IQR spans only 0.69–10.21%, meaning most units are predominantly non-Black, with a long tail of majority-Black geographies. Treatment: Apply log1p or quantile transformation before regression to address severe right skew and outlier influence. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
3,128
min
0
max
87.79
mean
9.085
median
2.383
std
14.5
q1
0.6919
q3
10.21
iqr
9.513
skew
2.326
kurtosis
5.451
n_outliers
422
outlier_rate
0.131
zero_rate
0.02825

pct_white

numeric feature
This column represents the percentage of white population in a geographic or demographic unit, ranging from 3.29% to 100% across 3,221 records. The distribution is strongly left-skewed (skew = -1.56) with a mean of 81.2% and median of 87.7%, indicating the dataset is dominated by majority-white units — likely U.S. counties, census tracts, or similar jurisdictions. The gap between mean and median signals a long lower tail of more diverse units, and 145 outliers (4.5%) likely represent highly diverse areas pulling the distribution downward. Near-perfect uniqueness (3,218 of 3,221 values) confirms this is a continuous ratio measure, not a binned or rounded variable. Treatment: Use as-is or apply a reflection-log transform to address left skew before regression; consider interactions with other demographic features. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
3,218
min
3.29
max
100
mean
81.2
median
87.66
std
17.35
q1
73.62
q3
93.99
iqr
20.37
skew
-1.562
kurtosis
2.301
n_outliers
145
outlier_rate
0.04502
zero_rate
0

pct_hispanic

numeric feature high_skew outliers
This column represents the percentage of Hispanic population in a geographic or demographic unit, ranging from 0% to nearly 100%. The distribution is severely right-skewed (skew=3.11, kurtosis=9.89): the median is only 4.52% while the mean is 11.74%, indicating most units have low Hispanic shares but a long tail of high-concentration areas drives the average up. A notable 13% of rows (420 out of 3221) are flagged as outliers, consistent with areas of heavy Hispanic concentration. The near-zero zero_rate (0.5%) and zero null_rate suggest good data completeness. Treatment: Log-transform or apply a square-root transformation before regression/modelling to reduce skew and diminish outlier leverage. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
3,205
min
0
max
100
mean
11.74
median
4.516
std
19.4
q1
2.363
q3
10.66
iqr
8.294
skew
3.113
kurtosis
9.888
n_outliers
420
outlier_rate
0.1304
zero_rate
0.004967

poverty_rate

numeric feature high_skew
This column represents a poverty rate (percentage) measured across 3,221 geographic or demographic units, with near-complete coverage (null_rate 0.0) and near-unique values (3,219 distinct). The distribution is right-skewed (skew 2.11, kurtosis 6.92), with a median of 13.8% and mean pulled up to 15.4% by a long upper tail reaching 66.2%; 143 outliers (4.4% of records) drive this tail, suggesting a minority of units with extremely high poverty concentration that will disproportionately influence linear models. Treatment: Apply log or square-root transform to reduce right skew before regression; investigate the 143 outlier units separately for data quality or structural differences. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
3,219
min
0
max
66.19
mean
15.38
median
13.81
std
7.97
q1
10.34
q3
18.25
iqr
7.91
skew
2.111
kurtosis
6.922
n_outliers
143
outlier_rate
0.0444
zero_rate
0.0003105

below_poverty_level

numeric feature high_skew outliers
This column represents a count of people living below the poverty level, likely aggregated at some geographic unit (e.g., census tract, county, or ZIP code). The distribution is extremely right-skewed (skew=15.1, kurtosis=360.7): the median is 3,831 but the mean is 13,136, and the maximum reaches 1,401,656 — almost certainly a large urban area or county-level aggregate pulling the tail hard. With 351 outliers (~10.9% of rows) and a standard deviation of 44,284 against a median of 3,831, a small number of high-population jurisdictions dominate the raw counts entirely. Treatment: Log-transform (log1p) before regression or clustering; consider normalizing by total population to create a poverty rate for more comparable cross-unit modelling. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
2,824
min
0
max
1.402e+06
mean
1.314e+04
median
3,831
std
4.428e+04
q1
1,547
q3
9,937
iqr
8,390
skew
15.11
kurtosis
360.7
n_outliers
351
outlier_rate
0.109
zero_rate
0.0003105

median_household_income

numeric feature high_skew outliers
This column represents median household income, likely sourced from census or demographic data tied to geographic units. The median of 52,380 and IQR of 16,300 look plausible for household income, but the column is severely compromised by sentinel/error values: a minimum of -666,666,666 drags the mean to -152,820 and produces a kurtosis of 3,215 and skew of -56.73, all flagged as alerts. With 182 outliers (5.65% of rows) and a std of 11,747,597, the negative extremes are almost certainly coded null-substitutes or data-entry errors rather than real income values. Treatment: Replace -666666666 and any negative values with NaN, investigate remaining outliers above q3, then consider log-transform after cleaning before modelling. high · anthropic:default
n
3,221
nulls
0 (0.0%)
unique
3,099
min
-6.667e+08
max
147,111
mean
-1.528e+05
median
52,380
std
1.175e+07
q1
44,939
q3
61,239
iqr
16,300
skew
-56.73
kurtosis
3216
n_outliers
182
outlier_rate
0.0565
zero_rate
0

margin_2020

numeric feature
This column represents a vote or profit margin figure for the year 2020, expressed as a proportion (roughly −0.87 to +0.93), most likely an election margin or financial margin ratio. The distribution is moderately left-skewed (skew −0.82) with a mean of 0.317 sitting noticeably below the median of 0.384, indicating a tail of strongly negative values pulling the average down. Negative values (minimum −0.868) are present and meaningful — likely contested or loss outcomes — while 48 outliers (1.54%) sit at the distributional extremes. The null rate of 3.38% is modest but worth investigating for systematic missingness. Treatment: Use as-is for modelling; consider investigating left-tail outliers and whether nulls are structurally missing before imputation. high · anthropic:default
n
3,221
nulls
109 (3.4%)
unique
3,112
min
-0.8675
max
0.9309
mean
0.317
median
0.3844
std
0.321
q1
0.1348
q3
0.5662
iqr
0.4314
skew
-0.8212
kurtosis
0.2286
n_outliers
48
outlier_rate
0.01542
zero_rate
0

democratic_pct_2020

numeric feature
This column represents the Democratic vote share (as a proportion 0–1) in the 2020 election, likely at the county or precinct level across 3,221 geographic units. The distribution is right-skewed (skew=0.83) with a mean of 0.333 and median of 0.300, indicating most units lean Republican—the typical unit gave Democrats roughly 30% of the vote. The range spans 0.031 to 0.921, capturing both deep-red and deep-blue areas, with only 49 outliers (1.57%) and near-zero null rate (3.38%), suggesting a clean, well-populated electoral feature. Treatment: Use as-is or apply a logit transform to stretch the bounded 0–1 proportion before regression or clustering. high · anthropic:default
n
3,221
nulls
109 (3.4%)
unique
3,111
min
0.03091
max
0.9215
mean
0.3327
median
0.2998
std
0.1598
q1
0.2091
q3
0.4236
iqr
0.2145
skew
0.8326
kurtosis
0.2523
n_outliers
49
outlier_rate
0.01575
zero_rate
0

republican_pct_2020

numeric feature
This column represents the Republican vote share (as a proportion 0–1) in the 2020 U.S. election, most likely at the county or precinct level. The mean of 0.650 and median of 0.683 indicate a right-leaning dataset — the majority of geographic units recorded Republican majorities, which is consistent with county-level data where rural areas outnumber urban ones by count. The distribution is notably left-skewed (skew = −0.809), meaning a tail of strongly Democratic units pulls the mean below the median, while the near-mesokurtic kurtosis (0.206) and only 47 outliers suggest no extreme concentration at the tails. The null rate of 3.38% warrants investigation to confirm whether missing values reflect unreported results or data gaps. Treatment: Use as-is for modeling after imputing or flagging the 3.38% nulls; consider logit-transform if used as a continuous predictor in a linear model. high · anthropic:default
n
3,221
nulls
109 (3.4%)
unique
3,111
min
0.05397
max
0.9618
mean
0.6497
median
0.6829
std
0.1613
q1
0.5576
q3
0.7747
iqr
0.2171
skew
-0.8091
kurtosis
0.2063
n_outliers
47
outlier_rate
0.0151
zero_rate
0

margin_2016

text feature one_word allcaps short_text
This column stores the 2016 electoral or financial margin as a percentage string (e.g., '15.17%'), stored as text rather than a numeric type. All 3,221 values are single all-caps tokens of 5–6 characters, confirming a uniform percentage format. Surprisingly, '15.17%' appears 29 times — far more than any other value — suggesting it may be a default, imputed, or boundary value worth investigating. The duplicate rate of 18.6% (584 duplicates across 2,554 unique values) is notable for what should otherwise be a near-continuous numeric measure. Treatment: Strip '%' suffix and cast to float; investigate the 29 occurrences of '15.17%' for data quality issues before modelling. high · anthropic:default
n
3,221
nulls
83 (2.6%)
unique
2,554
len_min
5
len_max
6
len_mean
5.896
len_median
6
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
584
duplicate_rate
0.1861
vocab_size
2,554
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

democratic_pct_2016

numeric feature
This column represents the Democratic party vote share (as a proportion 0–1) in the 2016 U.S. presidential election, most likely aggregated at the county level given 3,221 rows. The distribution is right-skewed (skew=0.94) with a mean of 0.317 and median of 0.286, indicating that most geographic units lean Republican, with a long tail of heavily Democratic areas reaching up to 0.928. The spread is moderate (IQR=0.193, std=0.153) and 75 outliers exist on the high end, likely dense urban counties. Treatment: Use as-is or apply logit-transform to unbound the [0,1] proportion before linear modelling. high · anthropic:default
n
3,221
nulls
83 (2.6%)
unique
3,111
min
0.03145
max
0.9285
mean
0.3174
median
0.2861
std
0.1527
q1
0.2054
q3
0.3982
iqr
0.1928
skew
0.9371
kurtosis
0.666
n_outliers
75
outlier_rate
0.0239
zero_rate
0

republican_pct_2016

numeric feature
This column represents the Republican vote share (as a proportion, 0–1) in the 2016 U.S. presidential election, likely at the county or precinct level across 3,221 geographic units. The distribution is left-skewed (skew = -0.81) with a median of 0.666 and mean of 0.635, indicating that most units leaned heavily Republican in 2016, which is consistent with rural-county-level data where Republicans dominate by count even if not by population. The range spans 0.041 to 0.953, covering genuinely competitive to overwhelmingly one-sided areas, with only 62 outliers (1.98%) and near-zero nulls (2.58%), suggesting a clean, well-populated field. Treatment: Use directly as a continuous feature; consider pairing with democratic equivalent or computing a two-party margin; mild left skew does not require transformation for most models. high · anthropic:default
n
3,221
nulls
83 (2.6%)
unique
3,111
min
0.04122
max
0.9527
mean
0.6354
median
0.6656
std
0.1559
q1
0.5463
q3
0.7503
iqr
0.2041
skew
-0.8145
kurtosis
0.3566
n_outliers
62
outlier_rate
0.01976
zero_rate
0