scars master dataset
Reading
This is a county-level US dataset of 3,221 rows and 20 columns combining demographics (population by race, poverty, income), 2016 and 2020 presidential vote shares, and geographic identifiers (FIPS, state, county). Two data-quality issues stand out and should be addressed first: median_household_income contains sentinel/error values that pull its minimum to -666,666,666 and yield a negative mean, and margin_2016 is stored as text percentages (e.g. '15.17%') while margin_2020 is numeric, so the two election cycles aren't directly comparable without cleaning. The political columns themselves are well-formed and show a Republican-leaning county distribution (mean republican_pct_2020 ≈ 0.65 vs democratic_pct_2020 ≈ 0.33). Population and demographic counts are heavily right-skewed with many outliers, as expected when mixing rural counties with metros up to ~10M people, so log scales or per-capita rates (already provided as pct_white, pct_black, pct_hispanic) will be more informative than raw counts.
citing: median_household_income · margin_2016 · margin_2020 · republican_pct_2020 · democratic_pct_2020 · total_population · pct_white · pct_black · pct_hispanic · poverty_rate
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 0.05397 – 0.07667 | 1 |
| 0.07667 – 0.09937 | 2 |
| 0.09937 – 0.1221 | 2 |
| 0.1221 – 0.1448 | 6 |
| 0.1448 – 0.1675 | 6 |
| 0.1675 – 0.1901 | 15 |
| 0.1901 – 0.2128 | 5 |
| 0.2128 – 0.2355 | 13 |
| 0.2355 – 0.2582 | 12 |
| 0.2582 – 0.2809 | 25 |
| 0.2809 – 0.3036 | 26 |
| 0.3036 – 0.3263 | 32 |
| 0.3263 – 0.349 | 32 |
| 0.349 – 0.3717 | 40 |
| 0.3717 – 0.3944 | 46 |
| 0.3944 – 0.4171 | 52 |
| 0.4171 – 0.4398 | 64 |
| 0.4398 – 0.4625 | 78 |
| 0.4625 – 0.4852 | 61 |
| 0.4852 – 0.5079 | 78 |
| 0.5079 – 0.5306 | 66 |
| 0.5306 – 0.5533 | 97 |
| 0.5533 – 0.576 | 126 |
| 0.576 – 0.5987 | 122 |
| 0.5987 – 0.6214 | 139 |
| 0.6214 – 0.6441 | 143 |
| 0.6441 – 0.6668 | 154 |
| 0.6668 – 0.6895 | 173 |
| 0.6895 – 0.7122 | 176 |
| 0.7122 – 0.7349 | 200 |
| 0.7349 – 0.7576 | 213 |
| 0.7576 – 0.7802 | 195 |
| 0.7802 – 0.8029 | 203 |
| 0.8029 – 0.8256 | 169 |
| 0.8256 – 0.8483 | 132 |
| 0.8483 – 0.871 | 103 |
| 0.871 – 0.8937 | 65 |
| 0.8937 – 0.9164 | 27 |
| 0.9164 – 0.9391 | 9 |
| 0.9391 – 0.9618 | 4 |
Show data table
| bin | count |
|---|---|
| -0.8675 – -0.8226 | 1 |
| -0.8226 – -0.7776 | 2 |
| -0.7776 – -0.7326 | 3 |
| -0.7326 – -0.6877 | 5 |
| -0.6877 – -0.6427 | 8 |
| -0.6427 – -0.5978 | 12 |
| -0.5978 – -0.5528 | 6 |
| -0.5528 – -0.5078 | 12 |
| -0.5078 – -0.4629 | 14 |
| -0.4629 – -0.4179 | 22 |
| -0.4179 – -0.373 | 28 |
| -0.373 – -0.328 | 33 |
| -0.328 – -0.283 | 29 |
| -0.283 – -0.2381 | 46 |
| -0.2381 – -0.1931 | 41 |
| -0.1931 – -0.1482 | 54 |
| -0.1482 – -0.1032 | 63 |
| -0.1032 – -0.05823 | 67 |
| -0.05823 – -0.01327 | 69 |
| -0.01327 – 0.03169 | 73 |
| 0.03169 – 0.07665 | 70 |
| 0.07665 – 0.1216 | 90 |
| 0.1216 – 0.1666 | 131 |
| 0.1666 – 0.2115 | 117 |
| 0.2115 – 0.2565 | 129 |
| 0.2565 – 0.3015 | 141 |
| 0.3015 – 0.3464 | 159 |
| 0.3464 – 0.3914 | 165 |
| 0.3914 – 0.4363 | 181 |
| 0.4363 – 0.4813 | 206 |
| 0.4813 – 0.5263 | 195 |
| 0.5263 – 0.5712 | 197 |
| 0.5712 – 0.6162 | 213 |
| 0.6162 – 0.6611 | 175 |
| 0.6611 – 0.7061 | 140 |
| 0.7061 – 0.7511 | 102 |
| 0.7511 – 0.796 | 69 |
| 0.796 – 0.841 | 29 |
| 0.841 – 0.8859 | 11 |
| 0.8859 – 0.9309 | 4 |
Show data table
| bin | count |
|---|---|
| 3.29 – 5.708 | 2 |
| 5.708 – 8.125 | 1 |
| 8.125 – 10.54 | 4 |
| 10.54 – 12.96 | 2 |
| 12.96 – 15.38 | 10 |
| 15.38 – 17.8 | 6 |
| 17.8 – 20.21 | 4 |
| 20.21 – 22.63 | 7 |
| 22.63 – 25.05 | 12 |
| 25.05 – 27.47 | 12 |
| 27.47 – 29.89 | 7 |
| 29.89 – 32.3 | 4 |
| 32.3 – 34.72 | 10 |
| 34.72 – 37.14 | 13 |
| 37.14 – 39.56 | 24 |
| 39.56 – 41.97 | 18 |
| 41.97 – 44.39 | 24 |
| 44.39 – 46.81 | 30 |
| 46.81 – 49.23 | 24 |
| 49.23 – 51.64 | 34 |
| 51.64 – 54.06 | 33 |
| 54.06 – 56.48 | 37 |
| 56.48 – 58.9 | 58 |
| 58.9 – 61.32 | 43 |
| 61.32 – 63.73 | 69 |
| 63.73 – 66.15 | 72 |
| 66.15 – 68.57 | 77 |
| 68.57 – 70.99 | 76 |
| 70.99 – 73.4 | 88 |
| 73.4 – 75.82 | 100 |
| 75.82 – 78.24 | 98 |
| 78.24 – 80.66 | 143 |
| 80.66 – 83.08 | 132 |
| 83.08 – 85.49 | 163 |
| 85.49 – 87.91 | 191 |
| 87.91 – 90.33 | 246 |
| 90.33 – 92.75 | 302 |
| 92.75 – 95.16 | 453 |
| 95.16 – 97.58 | 525 |
| 97.58 – 100 | 67 |
Show data table
| bin | count |
|---|---|
| 0 – 1.655 | 2 |
| 1.655 – 3.31 | 9 |
| 3.31 – 4.964 | 48 |
| 4.964 – 6.619 | 123 |
| 6.619 – 8.274 | 212 |
| 8.274 – 9.929 | 317 |
| 9.929 – 11.58 | 378 |
| 11.58 – 13.24 | 392 |
| 13.24 – 14.89 | 353 |
| 14.89 – 16.55 | 338 |
| 16.55 – 18.2 | 235 |
| 18.2 – 19.86 | 192 |
| 19.86 – 21.51 | 157 |
| 21.51 – 23.17 | 108 |
| 23.17 – 24.82 | 77 |
| 24.82 – 26.48 | 53 |
| 26.48 – 28.13 | 41 |
| 28.13 – 29.79 | 37 |
| 29.79 – 31.44 | 29 |
| 31.44 – 33.1 | 13 |
| 33.1 – 34.75 | 10 |
| 34.75 – 36.41 | 11 |
| 36.41 – 38.06 | 7 |
| 38.06 – 39.72 | 2 |
| 39.72 – 41.37 | 8 |
| 41.37 – 43.03 | 6 |
| 43.03 – 44.68 | 7 |
| 44.68 – 46.33 | 11 |
| 46.33 – 47.99 | 6 |
| 47.99 – 49.64 | 9 |
| 49.64 – 51.3 | 8 |
| 51.3 – 52.95 | 5 |
| 52.95 – 54.61 | 6 |
| 54.61 – 56.26 | 2 |
| 56.26 – 57.92 | 2 |
| 57.92 – 59.57 | 3 |
| 59.57 – 61.23 | 1 |
| 61.23 – 62.88 | 1 |
| 62.88 – 64.54 | 0 |
| 64.54 – 66.19 | 2 |
Show data table
| bin | count |
|---|---|
| -6.667e+08 – -6.5e+08 | 1 |
| -6.5e+08 – -6.333e+08 | 0 |
| -6.333e+08 – -6.167e+08 | 0 |
| -6.167e+08 – -6e+08 | 0 |
| -6e+08 – -5.833e+08 | 0 |
| -5.833e+08 – -5.666e+08 | 0 |
| -5.666e+08 – -5.5e+08 | 0 |
| -5.5e+08 – -5.333e+08 | 0 |
| -5.333e+08 – -5.166e+08 | 0 |
| -5.166e+08 – -5e+08 | 0 |
| -5e+08 – -4.833e+08 | 0 |
| -4.833e+08 – -4.666e+08 | 0 |
| -4.666e+08 – -4.5e+08 | 0 |
| -4.5e+08 – -4.333e+08 | 0 |
| -4.333e+08 – -4.166e+08 | 0 |
| -4.166e+08 – -3.999e+08 | 0 |
| -3.999e+08 – -3.833e+08 | 0 |
| -3.833e+08 – -3.666e+08 | 0 |
| -3.666e+08 – -3.499e+08 | 0 |
| -3.499e+08 – -3.333e+08 | 0 |
| -3.333e+08 – -3.166e+08 | 0 |
| -3.166e+08 – -2.999e+08 | 0 |
| -2.999e+08 – -2.832e+08 | 0 |
| -2.832e+08 – -2.666e+08 | 0 |
| -2.666e+08 – -2.499e+08 | 0 |
| -2.499e+08 – -2.332e+08 | 0 |
| -2.332e+08 – -2.166e+08 | 0 |
| -2.166e+08 – -1.999e+08 | 0 |
| -1.999e+08 – -1.832e+08 | 0 |
| -1.832e+08 – -1.666e+08 | 0 |
| -1.666e+08 – -1.499e+08 | 0 |
| -1.499e+08 – -1.332e+08 | 0 |
| -1.332e+08 – -1.165e+08 | 0 |
| -1.165e+08 – -9.987e+07 | 0 |
| -9.987e+07 – -8.32e+07 | 0 |
| -8.32e+07 – -6.653e+07 | 0 |
| -6.653e+07 – -4.986e+07 | 0 |
| -4.986e+07 – -3.319e+07 | 0 |
| -3.319e+07 – -1.652e+07 | 0 |
| -1.652e+07 – 1.471e+05 | 3220 |
Schema
20 columns| Alerts | ||||
|---|---|---|---|---|
| NAME | text | 0.0% | 3,221 |
near_unique
|
| total_population | numeric | 0.0% | 3,160 |
high_skew
outliers
|
| black_population | numeric | 0.0% | 2,066 |
high_skew
outliers
|
| white_population | numeric | 0.0% | 3,143 |
high_skew
outliers
|
| hispanic_population | numeric | 0.0% | 2,331 |
high_skew
outliers
|
| state | numeric | 0.0% | 52 |
|
| county | numeric | 0.0% | 326 |
high_skew
outliers
|
| FIPS | numeric | 0.0% | 3,221 |
|
| pct_black | numeric | 0.0% | 3,128 |
high_skew
outliers
|
| pct_white | numeric | 0.0% | 3,218 |
|
| pct_hispanic | numeric | 0.0% | 3,205 |
high_skew
outliers
|
| poverty_rate | numeric | 0.0% | 3,219 |
high_skew
|
| below_poverty_level | numeric | 0.0% | 2,824 |
high_skew
outliers
|
| median_household_income | numeric | 0.0% | 3,099 |
high_skew
outliers
|
| margin_2020 | numeric | 3.4% | 3,112 |
|
| democratic_pct_2020 | numeric | 3.4% | 3,111 |
|
| republican_pct_2020 | numeric | 3.4% | 3,111 |
|
| margin_2016 | text | 2.6% | 2,554 |
one_word
allcaps
short_text
|
| democratic_pct_2016 | numeric | 2.6% | 3,111 |
|
| republican_pct_2016 | numeric | 2.6% | 3,111 |
|
NAME
text identifier near_uniqueThis column appears to hold US county names with state qualifiers — 'county,' appears in 3,007 of 3,221 rows, followed by state tokens like Texas (256), Virginia (189), and Georgia (159). Every value is unique (n_unique = 3221, duplicate_rate = 0.0) with no nulls, and lengths cluster tightly around 24 characters (min 16, max 42), consistent with a canonical 'X County, ST' format. The near_unique alert confirms this behaves as an identifier rather than a categorical feature. Treatment: Use as a join key to county-level reference tables; do not feed as a categorical feature.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,221
- len_min
- 16
- len_max
- 42
- len_mean
- 24.27
- len_median
- 24
- len_p95
- 31
- word_mean
- 3.243
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 1,983
- readability_flesch_mean
- 7.581
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0
- boilerplate_rate
- 0
total_population
numeric feature high_skew outliersLikely a county- or area-level total population count, given 3,221 rows with no nulls and a minimum of 117 alongside a maximum of 10,040,682. The distribution is severely right-skewed (skew 13.67, kurtosis 311.9) with the mean (102,398) nearly four times the median (25,981) and 441 outliers (13.7%). A few mega-population areas dominate while most are small. Treatment: log-transform before modelling to tame the heavy right tail.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,160
- min
- 117
- max
- 1.004e+07
- mean
- 1.024e+05
- median
- 25,981
- std
- 3.283e+05
- q1
- 11,125
- q3
- 66,969
- iqr
- 55,844
- skew
- 13.67
- kurtosis
- 311.9
- n_outliers
- 441
- outlier_rate
- 0.1369
- zero_rate
- 0
black_population
numeric feature high_skew outliersNumeric count of Black residents per record (likely US county-level given n=3221), ranging from 0 to 1,202,260 with a median of just 859. The distribution is extremely right-skewed (skew 10.46, kurtosis 148.2) with 13.6% flagged as outliers and a std (54,952) over four times the mean (12,914), reflecting a few major metros dominating a long tail of small counties. About 2.8% of rows are zero and there are no nulls. Treatment: Log-transform (log1p) before modelling or normalise as a share of total population.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 2,066
- min
- 0
- max
- 1.202e+06
- mean
- 1.291e+04
- median
- 859
- std
- 5.495e+04
- q1
- 114
- q3
- 5,553
- iqr
- 5,439
- skew
- 10.46
- kurtosis
- 148.2
- n_outliers
- 438
- outlier_rate
- 0.136
- zero_rate
- 0.02825
white_population
numeric feature high_skew outliersCounts of the white population per record (likely US counties given n=3221), ranging from 58 to 4,795,186 with a median of 21,282 but a mean of 72,000. The distribution is extremely right-skewed (skew 10.35, kurtosis 175.65) with 407 outliers (12.6%), reflecting a few very populous counties dwarfing the rest. No nulls or zeros, and near-unique values (3143/3221). Treatment: log-transform before regression to tame the heavy right skew.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,143
- min
- 58
- max
- 4.795e+06
- mean
- 7.2e+04
- median
- 21,282
- std
- 1.918e+05
- q1
- 8,855
- q3
- 56,553
- iqr
- 47,698
- skew
- 10.35
- kurtosis
- 175.7
- n_outliers
- 407
- outlier_rate
- 0.1264
- zero_rate
- 0
hispanic_population
numeric feature high_skew outliersCounts of Hispanic population per record (likely county- or tract-level given n=3221), ranging from 0 to 4,851,344 with a median of just 1,209. The distribution is extraordinarily right-skewed (skew 22.75, kurtosis 744.79) and 15.3% of rows flag as outliers, indicating a handful of very large jurisdictions dwarf the rest. Mean (19,427) sits far above the Q3 of 5,875, confirming a long heavy tail. Treatment: Apply a log1p transform before modelling to tame the heavy right tail.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 2,331
- min
- 0
- max
- 4.851e+06
- mean
- 1.943e+04
- median
- 1,209
- std
- 1.251e+05
- q1
- 377
- q3
- 5,875
- iqr
- 5,498
- skew
- 22.75
- kurtosis
- 744.8
- n_outliers
- 492
- outlier_rate
- 0.1527
- zero_rate
- 0.004967
state
numeric featureStored as numeric but with only 52 unique integer values across 3221 rows ranging 1–72 with no nulls or zeros, this is almost certainly a FIPS-style state code rather than a true quantity. The near-symmetric spread (skew 0.157, kurtosis -0.626, mean 31.28 vs median 30) reflects roughly uniform coverage of US states/territories, not a meaningful distribution. The max of 72 is consistent with FIPS codes that extend past 50 to cover territories. Treatment: Cast to categorical and one-hot or target-encode before modelling.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 52
- min
- 1
- max
- 72
- mean
- 31.28
- median
- 30
- std
- 16.28
- q1
- 19
- q3
- 46
- iqr
- 27
- skew
- 0.157
- kurtosis
- -0.6261
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
county
numeric foreign_key high_skew outliersStored as numeric, but with 326 unique integer values from 1 to 840 across 3221 rows and zero nulls, this is almost certainly a county FIPS or county-code identifier rather than a measurement. The heavy right skew (2.87) and kurtosis (11.6) flagged as outliers simply reflect that codes are not uniformly distributed — 178 'outliers' here are real codes, not anomalies. Treating mean=102.8 or std=106.6 as meaningful would be misleading. Treatment: Cast to categorical/string code and join to a county lookup; do not use as a continuous feature.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 326
- min
- 1
- max
- 840
- mean
- 102.8
- median
- 79
- std
- 106.6
- q1
- 35
- q3
- 133
- iqr
- 98
- skew
- 2.868
- kurtosis
- 11.64
- n_outliers
- 178
- outlier_rate
- 0.05526
- zero_rate
- 0
FIPS
numeric identifierFIPS is the standard U.S. Federal Information Processing Standards county code, with all 3221 rows unique and no nulls. Values span 1001 to 72153, consistent with state-prefixed county identifiers (Alabama through Puerto Rico), and the distribution is near-symmetric (skew 0.157) with no outliers flagged. Treatment: Treat as a categorical key; left-join on this code rather than using as a numeric feature.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,221
- min
- 1,001
- max
- 72,153
- mean
- 3.138e+04
- median
- 30,023
- std
- 1.63e+04
- q1
- 19,031
- q3
- 46,105
- iqr
- 27,074
- skew
- 0.1569
- kurtosis
- -0.6308
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
pct_black
numeric feature high_skew outliersThis is a numeric percentage of Black population per record (likely a county or tract), ranging from 0 to 87.79 with a median of just 2.38% but a mean of 9.08%. The distribution is heavily right-skewed (skew 2.33, kurtosis 5.45) with 422 outliers (13.1%) and 2.8% exact zeros, indicating a long tail of high-percentage areas above an otherwise low-share majority. No nulls, and 3,128 of 3,221 values are unique. Treatment: Apply a log1p or similar transform before regression to tame the right skew.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,128
- min
- 0
- max
- 87.79
- mean
- 9.085
- median
- 2.383
- std
- 14.5
- q1
- 0.6919
- q3
- 10.21
- iqr
- 9.513
- skew
- 2.326
- kurtosis
- 5.451
- n_outliers
- 422
- outlier_rate
- 0.131
- zero_rate
- 0.02825
pct_white
numeric featureThis column reports the percentage of a population that is white, ranging from 3.29 to 100.0 with a mean of 81.20 and median of 87.66. The distribution is heavily left-skewed (skew -1.56) with 145 low-end outliers (4.5% outlier rate), indicating most records are predominantly white but a long tail of diverse populations exists. No nulls or zeros are present, and near-unique values across 3221 rows suggest one row per geographic unit. Treatment: Consider a logit or reflected-log transform to address the strong left skew before modelling.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,218
- min
- 3.29
- max
- 100
- mean
- 81.2
- median
- 87.66
- std
- 17.35
- q1
- 73.62
- q3
- 93.99
- iqr
- 20.37
- skew
- -1.562
- kurtosis
- 2.301
- n_outliers
- 145
- outlier_rate
- 0.04502
- zero_rate
- 0
pct_hispanic
numeric feature high_skew outliersThis is a numeric percentage of Hispanic population per row, ranging 0 to 99.996 with a median of just 4.52 but a mean of 11.74, indicating a long right tail. Skew of 3.11 and kurtosis of 9.89 confirm heavy concentration at low values with 420 outliers (13.0% of rows) stretching toward 100. Near-zero null rate (0.0) and only 0.5% exact zeros suggest the values are continuously measured rather than sparsely populated. Treatment: Apply a log1p or similar transform before modelling to tame the right skew and outliers.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,205
- min
- 0
- max
- 100
- mean
- 11.74
- median
- 4.516
- std
- 19.4
- q1
- 2.363
- q3
- 10.66
- iqr
- 8.294
- skew
- 3.113
- kurtosis
- 9.888
- n_outliers
- 420
- outlier_rate
- 0.1304
- zero_rate
- 0.004967
poverty_rate
numeric feature high_skewContinuous percentage values ranging from 0 to 66.19 with mean 15.38 and median 13.81, almost certainly a county- or area-level poverty rate. Distribution is right-skewed (skew 2.11, kurtosis 6.92) with 143 high-end outliers (4.4%) stretching well beyond Q3 of 18.25. Near-unique values across 3,221 rows (3,219 distinct) and effectively no zeros or nulls. Treatment: Log- or Box-Cox-transform before regression to tame the right skew.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,219
- min
- 0
- max
- 66.19
- mean
- 15.38
- median
- 13.81
- std
- 7.97
- q1
- 10.34
- q3
- 18.25
- iqr
- 7.91
- skew
- 2.111
- kurtosis
- 6.922
- n_outliers
- 143
- outlier_rate
- 0.0444
- zero_rate
- 0.0003105
below_poverty_level
numeric feature high_skew outliersThis column appears to be a count of residents below the poverty level per geographic unit, ranging from 0 to 1,401,656 with a median of 3,831. The distribution is severely right-skewed (skew 15.1, kurtosis 360.7) with the mean (13,136) more than three times the median and 351 outliers (10.9% of rows). Standard deviation (44,284) dwarfs the IQR (8,390), consistent with a few very large jurisdictions dominating the tail. Treatment: Log-transform (or normalize per population) before modelling to tame the heavy right tail.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 2,824
- min
- 0
- max
- 1.402e+06
- mean
- 1.314e+04
- median
- 3,831
- std
- 4.428e+04
- q1
- 1,547
- q3
- 9,937
- iqr
- 8,390
- skew
- 15.11
- kurtosis
- 360.7
- n_outliers
- 351
- outlier_rate
- 0.109
- zero_rate
- 0.0003105
median_household_income
numeric feature high_skew outliersMedian household income per record (n=3221, 3099 unique, no nulls) with a typical value near the median of 52380 and IQR of 16300. The mean of -152820 and min of -666666666 betray a sentinel value masquerading as data, producing extreme skew (-56.73) and kurtosis (3215.99) plus 182 flagged outliers. Once those sentinels are removed, the q1/q3 range of 44939-61239 looks like plausible US county-level income. Treatment: Replace the -666666666 sentinel with null, then consider winsorizing or log-transforming before modelling.
- n
- 3,221
- nulls
- 0 (0.0%)
- unique
- 3,099
- min
- -6.667e+08
- max
- 147,111
- mean
- -1.528e+05
- median
- 52,380
- std
- 1.175e+07
- q1
- 44,939
- q3
- 61,239
- iqr
- 16,300
- skew
- -56.73
- kurtosis
- 3216
- n_outliers
- 182
- outlier_rate
- 0.0565
- zero_rate
- 0
margin_2020
numeric featureNumeric margin values for 2020, almost entirely unique across 3,221 rows (3,112 distinct), ranging from -0.87 to 0.93 with a mean of 0.317 and median 0.384. The distribution is left-skewed (skew -0.82), suggesting most observations cluster on the positive side while a tail of negative margins pulls the mean down. About 3.4% of rows are null and only 1.5% are flagged as outliers, with no zero values at all. Treatment: Use directly as a signed numeric feature; impute the 3.4% nulls and retain sign since negatives are meaningful.
- n
- 3,221
- nulls
- 109 (3.4%)
- unique
- 3,112
- min
- -0.8675
- max
- 0.9309
- mean
- 0.317
- median
- 0.3844
- std
- 0.321
- q1
- 0.1348
- q3
- 0.5662
- iqr
- 0.4314
- skew
- -0.8212
- kurtosis
- 0.2286
- n_outliers
- 48
- outlier_rate
- 0.01542
- zero_rate
- 0
democratic_pct_2020
numeric featureThis is the share of votes cast for the Democratic candidate in 2020, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.921 with a median of 0.300 and mean of 0.333, indicating most units lean Republican while a long right tail of heavily Democratic units pulls the mean up (skew 0.83). About 3.4% of rows are null and 49 outliers (1.6%) sit beyond the whiskers; no zeros are present. Treatment: Use as-is as a proportion feature; impute the 3.4% nulls or drop those rows before modelling.
- n
- 3,221
- nulls
- 109 (3.4%)
- unique
- 3,111
- min
- 0.03091
- max
- 0.9215
- mean
- 0.3327
- median
- 0.2998
- std
- 0.1598
- q1
- 0.2091
- q3
- 0.4236
- iqr
- 0.2145
- skew
- 0.8326
- kurtosis
- 0.2523
- n_outliers
- 49
- outlier_rate
- 0.01575
- zero_rate
- 0
republican_pct_2020
numeric featureThis is the 2020 Republican vote share by what looks like a U.S. county-level unit, with 3221 rows and a mean of 0.65 and median 0.68. The distribution is left-skewed (skew -0.81) toward strongly Republican counties, ranging from 0.054 to 0.962, and 3.38% of rows are null. Only 47 outliers (1.5%) and near-unique values (3111 distinct) are consistent with continuous geographic shares. Treatment: Use as-is as a continuous feature; impute or drop the 3.38% missing rows before modelling.
- n
- 3,221
- nulls
- 109 (3.4%)
- unique
- 3,111
- min
- 0.05397
- max
- 0.9618
- mean
- 0.6497
- median
- 0.6829
- std
- 0.1613
- q1
- 0.5576
- q3
- 0.7747
- iqr
- 0.2171
- skew
- -0.8091
- kurtosis
- 0.2063
- n_outliers
- 47
- outlier_rate
- 0.0151
- zero_rate
- 0
margin_2016
text feature one_word allcaps short_textThis column stores a 2016 margin as a short percentage string (e.g. '15.17%', '26.55%'), with lengths capped at 5-6 characters and exactly one 'word' per row. Despite the percent formatting it's stored as text, and 18.6% of values are duplicates with '15.17%' alone appearing 29 times — worth checking whether that's a placeholder or genuine repeat. Null rate is 2.58% and there are 2554 unique values across 3221 rows. Treatment: Strip the '%' and cast to float before any numeric analysis.
- n
- 3,221
- nulls
- 83 (2.6%)
- unique
- 2,554
- len_min
- 5
- len_max
- 6
- len_mean
- 5.896
- len_median
- 6
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 584
- duplicate_rate
- 0.1861
- vocab_size
- 2,554
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
democratic_pct_2016
numeric featureShare of the 2016 vote going Democratic, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.928 with a mean of 0.317 and median 0.286, and the right skew of 0.94 reflects a long tail of heavily Democratic jurisdictions amid a mass of Republican-leaning ones. About 2.6% of rows are null and 75 high-side outliers (2.4%) sit above the IQR fence. Treatment: Use as-is as a proportion feature; impute or drop the 2.6% nulls before modelling.
- n
- 3,221
- nulls
- 83 (2.6%)
- unique
- 3,111
- min
- 0.03145
- max
- 0.9285
- mean
- 0.3174
- median
- 0.2861
- std
- 0.1527
- q1
- 0.2054
- q3
- 0.3982
- iqr
- 0.1928
- skew
- 0.9371
- kurtosis
- 0.666
- n_outliers
- 75
- outlier_rate
- 0.0239
- zero_rate
- 0
republican_pct_2016
numeric featureThis column captures the Republican vote share by unit (likely US county) in the 2016 election, expressed as a proportion between 0.041 and 0.953. The distribution is left-skewed (skew -0.81) with a median of 0.666 above the mean of 0.635, indicating most units leaned Republican while a smaller tail of strongly Democratic units pulls the mean down. Near-unique values (3111 of 3221) and a 2.58% null rate are consistent with one row per geographic unit. Treatment: Use as-is as a proportion feature; impute or drop the ~2.6% nulls before modelling.
- n
- 3,221
- nulls
- 83 (2.6%)
- unique
- 3,111
- min
- 0.04122
- max
- 0.9527
- mean
- 0.6354
- median
- 0.6656
- std
- 0.1559
- q1
- 0.5463
- q3
- 0.7503
- iqr
- 0.2041
- skew
- -0.8145
- kurtosis
- 0.3566
- n_outliers
- 62
- outlier_rate
- 0.01976
- zero_rate
- 0