data trove scars standardized county analysis research system
Reading
This dataset covers 3,221 U.S. counties with demographic, economic, and electoral variables for the 2016 and 2020 presidential elections. The most striking finding is that Republican candidates dominated the majority of counties in both cycles — the median Republican share was roughly 67% in 2016 and 68% in 2020, while the Democratic median hovered near 29–30%, reflecting the well-known rural-county skew in U.S. politics. A data quality issue worth flagging immediately is the median_household_income column, which contains a minimum value of -666,666,666 — almost certainly a sentinel/error value — dragging the column mean to -$152,820 despite a plausible median of $52,380. Poverty rate averages about 15% across counties but reaches as high as 66%, and racial composition variables (pct_white, pct_black, pct_hispanic) are highly skewed, suggesting a small number of majority-minority counties sit at the extremes.
citing: republican_pct_2016.stats.median · republican_pct_2020.stats.median · democratic_pct_2016.stats.median · democratic_pct_2020.stats.median · median_household_income.stats.min · median_household_income.stats.median · poverty_rate.stats.mean · poverty_rate.stats.max · pct_white.stats.mean · pct_white.stats.skew · row_count
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 0.05397 – 0.07667 | 1 |
| 0.07667 – 0.09937 | 2 |
| 0.09937 – 0.1221 | 2 |
| 0.1221 – 0.1448 | 6 |
| 0.1448 – 0.1675 | 6 |
| 0.1675 – 0.1901 | 15 |
| 0.1901 – 0.2128 | 5 |
| 0.2128 – 0.2355 | 13 |
| 0.2355 – 0.2582 | 12 |
| 0.2582 – 0.2809 | 25 |
| 0.2809 – 0.3036 | 26 |
| 0.3036 – 0.3263 | 32 |
| 0.3263 – 0.349 | 32 |
| 0.349 – 0.3717 | 40 |
| 0.3717 – 0.3944 | 46 |
| 0.3944 – 0.4171 | 52 |
| 0.4171 – 0.4398 | 64 |
| 0.4398 – 0.4625 | 78 |
| 0.4625 – 0.4852 | 61 |
| 0.4852 – 0.5079 | 78 |
| 0.5079 – 0.5306 | 66 |
| 0.5306 – 0.5533 | 97 |
| 0.5533 – 0.576 | 126 |
| 0.576 – 0.5987 | 122 |
| 0.5987 – 0.6214 | 139 |
| 0.6214 – 0.6441 | 143 |
| 0.6441 – 0.6668 | 154 |
| 0.6668 – 0.6895 | 173 |
| 0.6895 – 0.7122 | 176 |
| 0.7122 – 0.7349 | 200 |
| 0.7349 – 0.7576 | 213 |
| 0.7576 – 0.7802 | 195 |
| 0.7802 – 0.8029 | 203 |
| 0.8029 – 0.8256 | 169 |
| 0.8256 – 0.8483 | 132 |
| 0.8483 – 0.871 | 103 |
| 0.871 – 0.8937 | 65 |
| 0.8937 – 0.9164 | 27 |
| 0.9164 – 0.9391 | 9 |
| 0.9391 – 0.9618 | 4 |
Show data table
| bin | count |
|---|---|
| -0.8675 – -0.8226 | 1 |
| -0.8226 – -0.7776 | 2 |
| -0.7776 – -0.7326 | 3 |
| -0.7326 – -0.6877 | 5 |
| -0.6877 – -0.6427 | 8 |
| -0.6427 – -0.5978 | 12 |
| -0.5978 – -0.5528 | 6 |
| -0.5528 – -0.5078 | 12 |
| -0.5078 – -0.4629 | 14 |
| -0.4629 – -0.4179 | 22 |
| -0.4179 – -0.373 | 28 |
| -0.373 – -0.328 | 33 |
| -0.328 – -0.283 | 29 |
| -0.283 – -0.2381 | 46 |
| -0.2381 – -0.1931 | 41 |
| -0.1931 – -0.1482 | 54 |
| -0.1482 – -0.1032 | 63 |
| -0.1032 – -0.05823 | 67 |
| -0.05823 – -0.01327 | 69 |
| -0.01327 – 0.03169 | 73 |
| 0.03169 – 0.07665 | 70 |
| 0.07665 – 0.1216 | 90 |
| 0.1216 – 0.1666 | 131 |
| 0.1666 – 0.2115 | 117 |
| 0.2115 – 0.2565 | 129 |
| 0.2565 – 0.3015 | 141 |
| 0.3015 – 0.3464 | 159 |
| 0.3464 – 0.3914 | 165 |
| 0.3914 – 0.4363 | 181 |
| 0.4363 – 0.4813 | 206 |
| 0.4813 – 0.5263 | 195 |
| 0.5263 – 0.5712 | 197 |
| 0.5712 – 0.6162 | 213 |
| 0.6162 – 0.6611 | 175 |
| 0.6611 – 0.7061 | 140 |
| 0.7061 – 0.7511 | 102 |
| 0.7511 – 0.796 | 69 |
| 0.796 – 0.841 | 29 |
| 0.841 – 0.8859 | 11 |
| 0.8859 – 0.9309 | 4 |
Show data table
| bin | count |
|---|---|
| 0 – 1.655 | 2 |
| 1.655 – 3.31 | 9 |
| 3.31 – 4.964 | 48 |
| 4.964 – 6.619 | 123 |
| 6.619 – 8.274 | 212 |
| 8.274 – 9.929 | 317 |
| 9.929 – 11.58 | 378 |
| 11.58 – 13.24 | 392 |
| 13.24 – 14.89 | 353 |
| 14.89 – 16.55 | 338 |
| 16.55 – 18.2 | 235 |
| 18.2 – 19.86 | 192 |
| 19.86 – 21.51 | 157 |
| 21.51 – 23.17 | 108 |
| 23.17 – 24.82 | 77 |
| 24.82 – 26.48 | 53 |
| 26.48 – 28.13 | 41 |
| 28.13 – 29.79 | 37 |
| 29.79 – 31.44 | 29 |
| 31.44 – 33.1 | 13 |
| 33.1 – 34.75 | 10 |
| 34.75 – 36.41 | 11 |
| 36.41 – 38.06 | 7 |
| 38.06 – 39.72 | 2 |
| 39.72 – 41.37 | 8 |
| 41.37 – 43.03 | 6 |
| 43.03 – 44.68 | 7 |
| 44.68 – 46.33 | 11 |
| 46.33 – 47.99 | 6 |
| 47.99 – 49.64 | 9 |
| 49.64 – 51.3 | 8 |
| 51.3 – 52.95 | 5 |
| 52.95 – 54.61 | 6 |
| 54.61 – 56.26 | 2 |
| 56.26 – 57.92 | 2 |
| 57.92 – 59.57 | 3 |
| 59.57 – 61.23 | 1 |
| 61.23 – 62.88 | 1 |
| 62.88 – 64.54 | 0 |
| 64.54 – 66.19 | 2 |
Show data table
| bin | count |
|---|---|
| 3.29 – 5.708 | 2 |
| 5.708 – 8.125 | 1 |
| 8.125 – 10.54 | 4 |
| 10.54 – 12.96 | 2 |
| 12.96 – 15.38 | 10 |
| 15.38 – 17.8 | 6 |
| 17.8 – 20.21 | 4 |
| 20.21 – 22.63 | 7 |
| 22.63 – 25.05 | 12 |
| 25.05 – 27.47 | 12 |
| 27.47 – 29.89 | 7 |
| 29.89 – 32.3 | 4 |
| 32.3 – 34.72 | 10 |
| 34.72 – 37.14 | 13 |
| 37.14 – 39.56 | 24 |
| 39.56 – 41.97 | 18 |
| 41.97 – 44.39 | 24 |
| 44.39 – 46.81 | 30 |
| 46.81 – 49.23 | 24 |
| 49.23 – 51.64 | 34 |
| 51.64 – 54.06 | 33 |
| 54.06 – 56.48 | 37 |
| 56.48 – 58.9 | 58 |
| 58.9 – 61.32 | 43 |
| 61.32 – 63.73 | 69 |
| 63.73 – 66.15 | 72 |
| 66.15 – 68.57 | 77 |
| 68.57 – 70.99 | 76 |
| 70.99 – 73.4 | 88 |
| 73.4 – 75.82 | 100 |
| 75.82 – 78.24 | 98 |
| 78.24 – 80.66 | 143 |
| 80.66 – 83.08 | 132 |
| 83.08 – 85.49 | 163 |
| 85.49 – 87.91 | 191 |
| 87.91 – 90.33 | 246 |
| 90.33 – 92.75 | 302 |
| 92.75 – 95.16 | 453 |
| 95.16 – 97.58 | 525 |
| 97.58 – 100 | 67 |
Show data table
| bin | count |
|---|---|
| 0 – 2.195 | 1568 |
| 2.195 – 4.39 | 402 |
| 4.39 – 6.584 | 218 |
| 6.584 – 8.779 | 153 |
| 8.779 – 10.97 | 112 |
| 10.97 – 13.17 | 86 |
| 13.17 – 15.36 | 70 |
| 15.36 – 17.56 | 54 |
| 17.56 – 19.75 | 46 |
| 19.75 – 21.95 | 48 |
| 21.95 – 24.14 | 36 |
| 24.14 – 26.34 | 42 |
| 26.34 – 28.53 | 36 |
| 28.53 – 30.73 | 41 |
| 30.73 – 32.92 | 34 |
| 32.92 – 35.12 | 32 |
| 35.12 – 37.31 | 28 |
| 37.31 – 39.51 | 19 |
| 39.51 – 41.7 | 25 |
| 41.7 – 43.9 | 25 |
| 43.9 – 46.09 | 18 |
| 46.09 – 48.28 | 17 |
| 48.28 – 50.48 | 14 |
| 50.48 – 52.67 | 9 |
| 52.67 – 54.87 | 13 |
| 54.87 – 57.06 | 11 |
| 57.06 – 59.26 | 13 |
| 59.26 – 61.45 | 7 |
| 61.45 – 63.65 | 8 |
| 63.65 – 65.84 | 4 |
| 65.84 – 68.04 | 1 |
| 68.04 – 70.23 | 6 |
| 70.23 – 72.43 | 8 |
| 72.43 – 74.62 | 5 |
| 74.62 – 76.82 | 2 |
| 76.82 – 79.01 | 5 |
| 79.01 – 81.21 | 1 |
| 81.21 – 83.4 | 1 |
| 83.4 – 85.6 | 1 |
| 85.6 – 87.79 | 2 |
Schema
20 columns| Alerts | ||||
|---|---|---|---|---|
| NAME | text | 0.0% | 3,221 |
near_unique
|
| total_population | numeric | 0.0% | 3,160 |
high_skew
outliers
|
| black_population | numeric | 0.0% | 2,066 |
high_skew
outliers
|
| white_population | numeric | 0.0% | 3,143 |
high_skew
outliers
|
| hispanic_population | numeric | 0.0% | 2,331 |
high_skew
outliers
|
| state | numeric | 0.0% | 52 |
|
| county | numeric | 0.0% | 326 |
high_skew
outliers
|
| FIPS | numeric | 0.0% | 3,221 |
|
| pct_black | numeric | 0.0% | 3,128 |
high_skew
outliers
|
| pct_white | numeric | 0.0% | 3,218 |
|
| pct_hispanic | numeric | 0.0% | 3,205 |
high_skew
outliers
|
| poverty_rate | numeric | 0.0% | 3,219 |
high_skew
|
| below_poverty_level | numeric | 0.0% | 2,824 |
high_skew
outliers
|
| median_household_income | numeric | 0.0% | 3,099 |
high_skew
outliers
|
| margin_2020 | numeric | 3.4% | 3,112 |
|
| democratic_pct_2020 | numeric | 3.4% | 3,111 |
|
| republican_pct_2020 | numeric | 3.4% | 3,111 |
|
| margin_2016 | text | 2.6% | 2,554 |
one_word
allcaps
short_text
|
| democratic_pct_2016 | numeric | 2.6% | 3,111 |
|
| republican_pct_2016 | numeric | 2.6% | 3,111 |
|
NAME
text label near_uniqueThis column contains US county names, formatted with the word 'county' included (e.g., 'Jefferson County, Texas'), as evidenced by 'county,' appearing in 3,007 of 3,221 rows and US state names dominating the top words. Every value is unique (3,221 distinct entries, 0 duplicates, 0 nulls), making this a natural identifier for county-level records. The mean string length of ~24 characters and mean word count of ~3.2 are consistent with a 'Name County, State' pattern. The near-perfect vocabulary of 1,983 words across 3,221 rows suggests structured, standardized naming rather than free text. Treatment: Use as a human-readable label or join key; normalize casing and strip trailing state suffix if joining to external county tables.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,221
- len_min
- 16
- len_max
- 42
- len_mean
- 24.27
- len_median
- 24
- len_p95
- 31
- word_mean
- 3.243
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 1,983
- readability_flesch_mean
- 7.581
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
total_population
numeric feature high_skew outliersThis column represents the total population count for geographic or administrative units (e.g., counties, municipalities, or census tracts), ranging from 117 to 10,040,682. The distribution is severely right-skewed (skew = 13.67, kurtosis = 311.91): the median of 25,981 is less than a quarter of the mean of 102,398, indicating a long tail driven by a small number of very large population centers. An outlier rate of 13.7% (441 of 3,221 rows) is unusually high and signals that large urban units coexist with many small rural units in the same dataset. Treatment: Log-transform before regression or distance-based modelling to reduce skew and outlier influence.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,160
- min
- 117
- max
- 1.004e+07
- mean
- 1.024e+05
- median
- 25,981
- std
- 3.283e+05
- q1
- 11,125
- q3
- 66,969
- iqr
- 55,844
- skew
- 13.67
- kurtosis
- 311.9
- n_outliers
- 441
- outlier_rate
- 0.1369
- zero_rate
- 0
black_population
numeric feature high_skew outliersThis column represents the count of Black residents per geographic unit (likely U.S. counties or census tracts). The distribution is extremely right-skewed (skew=10.46, kurtosis=148.22), with a median of just 859 versus a mean of 12,913 and a maximum of 1,202,260 — indicating a small number of high-population urban areas dominating the tail. 438 outliers (13.6% of rows) and a std of 54,951 against a median of 859 confirm the vast majority of units are small while a few are very large; 2.8% of records are zero, likely rural or sparsely populated geographies. Treatment: Log-transform (log1p) before modelling to compress the extreme right tail; consider per-capita normalisation if total population is available.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 2,066
- min
- 0
- max
- 1.202e+06
- mean
- 1.291e+04
- median
- 859
- std
- 5.495e+04
- q1
- 114
- q3
- 5,553
- iqr
- 5,439
- skew
- 10.46
- kurtosis
- 148.2
- n_outliers
- 438
- outlier_rate
- 0.136
- zero_rate
- 0.02825
white_population
numeric feature high_skew outliersThis column represents the white population count for geographic units (likely counties or census tracts), with 3,221 non-null records spanning a wide range from 58 to 4,795,186. The distribution is severely right-skewed (skew = 10.35, kurtosis = 175.65): the median is only 21,282 while the mean is 72,000, indicating most units are small but a long tail of large urban areas dominates — 407 records (12.6%) are flagged as outliers. The near-unique value count (3,143 of 3,221) confirms this is a raw count feature, not a category or ID. Treatment: Log-transform (log1p) before modelling to reduce skew and compress the extreme outlier range.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,143
- min
- 58
- max
- 4.795e+06
- mean
- 7.2e+04
- median
- 21,282
- std
- 1.918e+05
- q1
- 8,855
- q3
- 56,553
- iqr
- 47,698
- skew
- 10.35
- kurtosis
- 175.7
- n_outliers
- 407
- outlier_rate
- 0.1264
- zero_rate
- 0
hispanic_population
numeric feature high_skew outliersThis column represents the Hispanic population count for geographic units (e.g., counties, census tracts, or ZIP codes) across 3,221 records. The distribution is extremely right-skewed (skew = 22.75, kurtosis = 744.79), with a median of only 1,209 but a mean of 19,427 and a maximum of 4,851,344 — indicating a small number of large urban areas dominate the distribution. 15.3% of records (492 rows) are flagged as outliers, and the IQR spans just 377–5,875 while the std is 125,108, confirming the extreme concentration of values at the low end with a long heavy tail. Treatment: Log-transform (log1p) before modelling to reduce extreme skew; consider per-capita normalization if total population is available.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 2,331
- min
- 0
- max
- 4.851e+06
- mean
- 1.943e+04
- median
- 1,209
- std
- 1.251e+05
- q1
- 377
- q3
- 5,875
- iqr
- 5,498
- skew
- 22.75
- kurtosis
- 744.8
- n_outliers
- 492
- outlier_rate
- 0.1527
- zero_rate
- 0.004967
state
numeric foreign_keyThis column named 'state' is almost certainly a numeric state code (e.g., FIPS state codes or similar enumeration), with 52 distinct integer values ranging from 1 to 72 — consistent with US FIPS codes covering 50 states plus DC and outlying territories such as Puerto Rico (72). The distribution is remarkably flat and near-uniform (low kurtosis of -0.63, near-zero skew of 0.16, IQR of 27 across a 1–72 range), with zero nulls and zero outliers, indicating a clean, fully-populated categorical-as-integer field. The presence of 52 unique values rather than 50 or 51 suggests territorial codes are included, which may surprise analysts expecting only the 50 US states. Treatment: Treat as a categorical nominal code; do not use raw numeric value in regression — one-hot encode or left-join to a state reference table for geographic attributes.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 52
- min
- 1
- max
- 72
- mean
- 31.28
- median
- 30
- std
- 16.28
- q1
- 19
- q3
- 46
- iqr
- 27
- skew
- 0.157
- kurtosis
- -0.6261
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
county
numeric foreign_key high_skew outliersThis column is almost certainly a numeric county FIPS code or county ID, not a true continuous measure — the 326 unique values out of 3,221 rows strongly suggest a categorical geographic identifier encoded as an integer. The distribution is heavily right-skewed (skew 2.87, kurtosis 11.64) with values ranging from 1 to 840 and 178 outliers (5.5%), which reflects the uneven distribution of records across counties rather than any meaningful numeric magnitude. The mean (102.85) sitting well above the median (79.0) confirms that a small number of high-coded counties appear disproportionately often. Treatment: Cast to categorical/string and treat as a geographic grouping key; do not use raw numeric value in any regression or distance-based model.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 326
- min
- 1
- max
- 840
- mean
- 102.8
- median
- 79
- std
- 106.6
- q1
- 35
- q3
- 133
- iqr
- 98
- skew
- 2.868
- kurtosis
- 11.64
- n_outliers
- 178
- outlier_rate
- 0.05526
- zero_rate
- 0
FIPS
numeric identifierThis column contains US FIPS (Federal Information Processing Standards) county codes, which are 4–5 digit numeric identifiers uniquely assigned to each US county. Every row has a distinct value (n_unique = 3221, matching n exactly) with no nulls, confirming this is a primary identifier for US counties — there are 3,221 counties/county-equivalents in the US, matching this count almost exactly. The distribution is nearly uniform (low skew of 0.157, mild platykurtosis of -0.63), consistent with the sequential-but-gapped structure of FIPS codes across states. The range of 1001 to 72153 is correct for US county FIPS codes (Alabama's first county to Puerto Rico's last). Treatment: Treat as a categorical geographic identifier; do not use numerically — left-join to FIPS reference tables for geographic enrichment or spatial analysis.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,221
- min
- 1,001
- max
- 72,153
- mean
- 3.138e+04
- median
- 30,023
- std
- 1.63e+04
- q1
- 19,031
- q3
- 46,105
- iqr
- 27,074
- skew
- 0.1569
- kurtosis
- -0.6308
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
pct_black
numeric feature high_skew outliersThis column represents the percentage of Black residents in a geographic unit (e.g., census tract, county, or zip code), with 3,221 rows and no nulls. The distribution is heavily right-skewed (skew=2.33, kurtosis=5.45): the median is just 2.38% while the mean is pulled to 9.08%, and 422 rows (13.1%) are flagged as outliers reaching up to 87.79%. The IQR spans only 0.69–10.21%, meaning most units are predominantly non-Black, with a long tail of majority-Black geographies. Treatment: Apply log1p or quantile transformation before regression to address severe right skew and outlier influence.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,128
- min
- 0
- max
- 87.79
- mean
- 9.085
- median
- 2.383
- std
- 14.5
- q1
- 0.6919
- q3
- 10.21
- iqr
- 9.513
- skew
- 2.326
- kurtosis
- 5.451
- n_outliers
- 422
- outlier_rate
- 0.131
- zero_rate
- 0.02825
pct_white
numeric featureThis column represents the percentage of white population in a geographic or demographic unit, ranging from 3.29% to 100% across 3,221 records. The distribution is strongly left-skewed (skew = -1.56) with a mean of 81.2% and median of 87.7%, indicating the dataset is dominated by majority-white units — likely U.S. counties, census tracts, or similar jurisdictions. The gap between mean and median signals a long lower tail of more diverse units, and 145 outliers (4.5%) likely represent highly diverse areas pulling the distribution downward. Near-perfect uniqueness (3,218 of 3,221 values) confirms this is a continuous ratio measure, not a binned or rounded variable. Treatment: Use as-is or apply a reflection-log transform to address left skew before regression; consider interactions with other demographic features.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,218
- min
- 3.29
- max
- 100
- mean
- 81.2
- median
- 87.66
- std
- 17.35
- q1
- 73.62
- q3
- 93.99
- iqr
- 20.37
- skew
- -1.562
- kurtosis
- 2.301
- n_outliers
- 145
- outlier_rate
- 0.04502
- zero_rate
- 0
pct_hispanic
numeric feature high_skew outliersThis column represents the percentage of Hispanic population in a geographic or demographic unit, ranging from 0% to nearly 100%. The distribution is severely right-skewed (skew=3.11, kurtosis=9.89): the median is only 4.52% while the mean is 11.74%, indicating most units have low Hispanic shares but a long tail of high-concentration areas drives the average up. A notable 13% of rows (420 out of 3221) are flagged as outliers, consistent with areas of heavy Hispanic concentration. The near-zero zero_rate (0.5%) and zero null_rate suggest good data completeness. Treatment: Log-transform or apply a square-root transformation before regression/modelling to reduce skew and diminish outlier leverage.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,205
- min
- 0
- max
- 100
- mean
- 11.74
- median
- 4.516
- std
- 19.4
- q1
- 2.363
- q3
- 10.66
- iqr
- 8.294
- skew
- 3.113
- kurtosis
- 9.888
- n_outliers
- 420
- outlier_rate
- 0.1304
- zero_rate
- 0.004967
poverty_rate
numeric feature high_skewThis column represents a poverty rate (percentage) measured across 3,221 geographic or demographic units, with near-complete coverage (null_rate 0.0) and near-unique values (3,219 distinct). The distribution is right-skewed (skew 2.11, kurtosis 6.92), with a median of 13.8% and mean pulled up to 15.4% by a long upper tail reaching 66.2%; 143 outliers (4.4% of records) drive this tail, suggesting a minority of units with extremely high poverty concentration that will disproportionately influence linear models. Treatment: Apply log or square-root transform to reduce right skew before regression; investigate the 143 outlier units separately for data quality or structural differences.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,219
- min
- 0
- max
- 66.19
- mean
- 15.38
- median
- 13.81
- std
- 7.97
- q1
- 10.34
- q3
- 18.25
- iqr
- 7.91
- skew
- 2.111
- kurtosis
- 6.922
- n_outliers
- 143
- outlier_rate
- 0.0444
- zero_rate
- 0.0003105
below_poverty_level
numeric feature high_skew outliersThis column represents a count of people living below the poverty level, likely aggregated at some geographic unit (e.g., census tract, county, or ZIP code). The distribution is extremely right-skewed (skew=15.1, kurtosis=360.7): the median is 3,831 but the mean is 13,136, and the maximum reaches 1,401,656 — almost certainly a large urban area or county-level aggregate pulling the tail hard. With 351 outliers (~10.9% of rows) and a standard deviation of 44,284 against a median of 3,831, a small number of high-population jurisdictions dominate the raw counts entirely. Treatment: Log-transform (log1p) before regression or clustering; consider normalizing by total population to create a poverty rate for more comparable cross-unit modelling.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 2,824
- min
- 0
- max
- 1.402e+06
- mean
- 1.314e+04
- median
- 3,831
- std
- 4.428e+04
- q1
- 1,547
- q3
- 9,937
- iqr
- 8,390
- skew
- 15.11
- kurtosis
- 360.7
- n_outliers
- 351
- outlier_rate
- 0.109
- zero_rate
- 0.0003105
median_household_income
numeric feature high_skew outliersThis column represents median household income, likely sourced from census or demographic data tied to geographic units. The median of 52,380 and IQR of 16,300 look plausible for household income, but the column is severely compromised by sentinel/error values: a minimum of -666,666,666 drags the mean to -152,820 and produces a kurtosis of 3,215 and skew of -56.73, all flagged as alerts. With 182 outliers (5.65% of rows) and a std of 11,747,597, the negative extremes are almost certainly coded null-substitutes or data-entry errors rather than real income values. Treatment: Replace -666666666 and any negative values with NaN, investigate remaining outliers above q3, then consider log-transform after cleaning before modelling.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,099
- min
- -6.667e+08
- max
- 147,111
- mean
- -1.528e+05
- median
- 52,380
- std
- 1.175e+07
- q1
- 44,939
- q3
- 61,239
- iqr
- 16,300
- skew
- -56.73
- kurtosis
- 3216
- n_outliers
- 182
- outlier_rate
- 0.0565
- zero_rate
- 0
margin_2020
numeric featureThis column represents a vote or profit margin figure for the year 2020, expressed as a proportion (roughly −0.87 to +0.93), most likely an election margin or financial margin ratio. The distribution is moderately left-skewed (skew −0.82) with a mean of 0.317 sitting noticeably below the median of 0.384, indicating a tail of strongly negative values pulling the average down. Negative values (minimum −0.868) are present and meaningful — likely contested or loss outcomes — while 48 outliers (1.54%) sit at the distributional extremes. The null rate of 3.38% is modest but worth investigating for systematic missingness. Treatment: Use as-is for modelling; consider investigating left-tail outliers and whether nulls are structurally missing before imputation.
- n
- 3,221
- nulls
- 109 (3.4%)
- unique
- 3,112
- min
- -0.8675
- max
- 0.9309
- mean
- 0.317
- median
- 0.3844
- std
- 0.321
- q1
- 0.1348
- q3
- 0.5662
- iqr
- 0.4314
- skew
- -0.8212
- kurtosis
- 0.2286
- n_outliers
- 48
- outlier_rate
- 0.01542
- zero_rate
- 0
democratic_pct_2020
numeric featureThis column represents the Democratic vote share (as a proportion 0–1) in the 2020 election, likely at the county or precinct level across 3,221 geographic units. The distribution is right-skewed (skew=0.83) with a mean of 0.333 and median of 0.300, indicating most units lean Republican—the typical unit gave Democrats roughly 30% of the vote. The range spans 0.031 to 0.921, capturing both deep-red and deep-blue areas, with only 49 outliers (1.57%) and near-zero null rate (3.38%), suggesting a clean, well-populated electoral feature. Treatment: Use as-is or apply a logit transform to stretch the bounded 0–1 proportion before regression or clustering.
- n
- 3,221
- nulls
- 109 (3.4%)
- unique
- 3,111
- min
- 0.03091
- max
- 0.9215
- mean
- 0.3327
- median
- 0.2998
- std
- 0.1598
- q1
- 0.2091
- q3
- 0.4236
- iqr
- 0.2145
- skew
- 0.8326
- kurtosis
- 0.2523
- n_outliers
- 49
- outlier_rate
- 0.01575
- zero_rate
- 0
republican_pct_2020
numeric featureThis column represents the Republican vote share (as a proportion 0–1) in the 2020 U.S. election, most likely at the county or precinct level. The mean of 0.650 and median of 0.683 indicate a right-leaning dataset — the majority of geographic units recorded Republican majorities, which is consistent with county-level data where rural areas outnumber urban ones by count. The distribution is notably left-skewed (skew = −0.809), meaning a tail of strongly Democratic units pulls the mean below the median, while the near-mesokurtic kurtosis (0.206) and only 47 outliers suggest no extreme concentration at the tails. The null rate of 3.38% warrants investigation to confirm whether missing values reflect unreported results or data gaps. Treatment: Use as-is for modeling after imputing or flagging the 3.38% nulls; consider logit-transform if used as a continuous predictor in a linear model.
- n
- 3,221
- nulls
- 109 (3.4%)
- unique
- 3,111
- min
- 0.05397
- max
- 0.9618
- mean
- 0.6497
- median
- 0.6829
- std
- 0.1613
- q1
- 0.5576
- q3
- 0.7747
- iqr
- 0.2171
- skew
- -0.8091
- kurtosis
- 0.2063
- n_outliers
- 47
- outlier_rate
- 0.0151
- zero_rate
- 0
margin_2016
text feature one_word allcaps short_textThis column stores the 2016 electoral or financial margin as a percentage string (e.g., '15.17%'), stored as text rather than a numeric type. All 3,221 values are single all-caps tokens of 5–6 characters, confirming a uniform percentage format. Surprisingly, '15.17%' appears 29 times — far more than any other value — suggesting it may be a default, imputed, or boundary value worth investigating. The duplicate rate of 18.6% (584 duplicates across 2,554 unique values) is notable for what should otherwise be a near-continuous numeric measure. Treatment: Strip '%' suffix and cast to float; investigate the 29 occurrences of '15.17%' for data quality issues before modelling.
- n
- 3,221
- nulls
- 83 (2.6%)
- unique
- 2,554
- len_min
- 5
- len_max
- 6
- len_mean
- 5.896
- len_median
- 6
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 584
- duplicate_rate
- 0.1861
- vocab_size
- 2,554
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
democratic_pct_2016
numeric featureThis column represents the Democratic party vote share (as a proportion 0–1) in the 2016 U.S. presidential election, most likely aggregated at the county level given 3,221 rows. The distribution is right-skewed (skew=0.94) with a mean of 0.317 and median of 0.286, indicating that most geographic units lean Republican, with a long tail of heavily Democratic areas reaching up to 0.928. The spread is moderate (IQR=0.193, std=0.153) and 75 outliers exist on the high end, likely dense urban counties. Treatment: Use as-is or apply logit-transform to unbound the [0,1] proportion before linear modelling.
- n
- 3,221
- nulls
- 83 (2.6%)
- unique
- 3,111
- min
- 0.03145
- max
- 0.9285
- mean
- 0.3174
- median
- 0.2861
- std
- 0.1527
- q1
- 0.2054
- q3
- 0.3982
- iqr
- 0.1928
- skew
- 0.9371
- kurtosis
- 0.666
- n_outliers
- 75
- outlier_rate
- 0.0239
- zero_rate
- 0
republican_pct_2016
numeric featureThis column represents the Republican vote share (as a proportion, 0–1) in the 2016 U.S. presidential election, likely at the county or precinct level across 3,221 geographic units. The distribution is left-skewed (skew = -0.81) with a median of 0.666 and mean of 0.635, indicating that most units leaned heavily Republican in 2016, which is consistent with rural-county-level data where Republicans dominate by count even if not by population. The range spans 0.041 to 0.953, covering genuinely competitive to overwhelmingly one-sided areas, with only 62 outliers (1.98%) and near-zero nulls (2.58%), suggesting a clean, well-populated field. Treatment: Use directly as a continuous feature; consider pairing with democratic equivalent or computing a two-party margin; mild left skew does not require transformation for most models.
- n
- 3,221
- nulls
- 83 (2.6%)
- unique
- 3,111
- min
- 0.04122
- max
- 0.9527
- mean
- 0.6354
- median
- 0.6656
- std
- 0.1559
- q1
- 0.5463
- q3
- 0.7503
- iqr
- 0.2041
- skew
- -0.8145
- kurtosis
- 0.3566
- n_outliers
- 62
- outlier_rate
- 0.01976
- zero_rate
- 0