saturn·

scars master dataset

source /home/coolhand/html/datavis/data_trove/data/geographic/scars/master_dataset.csv 3,221 rows 20 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a county-level US dataset of 3,221 rows and 20 columns combining demographics (population by race, poverty, income), 2016 and 2020 presidential vote shares, and geographic identifiers (FIPS, state, county). Two data-quality issues stand out and should be addressed first: median_household_income contains sentinel/error values that pull its minimum to -666,666,666 and yield a negative mean, and margin_2016 is stored as text percentages (e.g. '15.17%') while margin_2020 is numeric, so the two election cycles aren't directly comparable without cleaning. The political columns themselves are well-formed and show a Republican-leaning county distribution (mean republican_pct_2020 ≈ 0.65 vs democratic_pct_2020 ≈ 0.33). Population and demographic counts are heavily right-skewed with many outliers, as expected when mixing rural counties with metros up to ~10M people, so log scales or per-capita rates (already provided as pct_white, pct_black, pct_hispanic) will be more informative than raw counts.

citing: median_household_income · margin_2016 · margin_2020 · republican_pct_2020 · democratic_pct_2020 · total_population · pct_white · pct_black · pct_hispanic · poverty_rate

Schema

20 columns
Per-column summary. Click column name to jump to its detail.
Alerts
NAME text 0.0% 3,221
near_unique
total_population numeric 0.0% 3,160
high_skew outliers
black_population numeric 0.0% 2,066
high_skew outliers
white_population numeric 0.0% 3,143
high_skew outliers
hispanic_population numeric 0.0% 2,331
high_skew outliers
state numeric 0.0% 52
county numeric 0.0% 326
high_skew outliers
FIPS numeric 0.0% 3,221
pct_black numeric 0.0% 3,128
high_skew outliers
pct_white numeric 0.0% 3,218
pct_hispanic numeric 0.0% 3,205
high_skew outliers
poverty_rate numeric 0.0% 3,219
high_skew
below_poverty_level numeric 0.0% 2,824
high_skew outliers
median_household_income numeric 0.0% 3,099
high_skew outliers
margin_2020 numeric 3.4% 3,112
democratic_pct_2020 numeric 3.4% 3,111
republican_pct_2020 numeric 3.4% 3,111
margin_2016 text 2.6% 2,554
one_word allcaps short_text
democratic_pct_2016 numeric 2.6% 3,111
republican_pct_2016 numeric 2.6% 3,111

NAME

text identifier near_unique
This column appears to hold US county names with state qualifiers — 'county,' appears in 3,007 of 3,221 rows, followed by state tokens like Texas (256), Virginia (189), and Georgia (159). Every value is unique (n_unique = 3221, duplicate_rate = 0.0) with no nulls, and lengths cluster tightly around 24 characters (min 16, max 42), consistent with a canonical 'X County, ST' format. The near_unique alert confirms this behaves as an identifier rather than a categorical feature. Treatment: Use as a join key to county-level reference tables; do not feed as a categorical feature. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
3,221
len_min
16
len_max
42
len_mean
24.27
len_median
24
len_p95
31
word_mean
3.243
word_median
3
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
1,983
readability_flesch_mean
7.581
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

total_population

numeric feature high_skew outliers
Likely a county- or area-level total population count, given 3,221 rows with no nulls and a minimum of 117 alongside a maximum of 10,040,682. The distribution is severely right-skewed (skew 13.67, kurtosis 311.9) with the mean (102,398) nearly four times the median (25,981) and 441 outliers (13.7%). A few mega-population areas dominate while most are small. Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
3,160
min
117
max
1.004e+07
mean
1.024e+05
median
25,981
std
3.283e+05
q1
11,125
q3
66,969
iqr
55,844
skew
13.67
kurtosis
311.9
n_outliers
441
outlier_rate
0.1369
zero_rate
0

black_population

numeric feature high_skew outliers
Numeric count of Black residents per record (likely US county-level given n=3221), ranging from 0 to 1,202,260 with a median of just 859. The distribution is extremely right-skewed (skew 10.46, kurtosis 148.2) with 13.6% flagged as outliers and a std (54,952) over four times the mean (12,914), reflecting a few major metros dominating a long tail of small counties. About 2.8% of rows are zero and there are no nulls. Treatment: Log-transform (log1p) before modelling or normalise as a share of total population. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
2,066
min
0
max
1.202e+06
mean
1.291e+04
median
859
std
5.495e+04
q1
114
q3
5,553
iqr
5,439
skew
10.46
kurtosis
148.2
n_outliers
438
outlier_rate
0.136
zero_rate
0.02825

white_population

numeric feature high_skew outliers
Counts of the white population per record (likely US counties given n=3221), ranging from 58 to 4,795,186 with a median of 21,282 but a mean of 72,000. The distribution is extremely right-skewed (skew 10.35, kurtosis 175.65) with 407 outliers (12.6%), reflecting a few very populous counties dwarfing the rest. No nulls or zeros, and near-unique values (3143/3221). Treatment: log-transform before regression to tame the heavy right skew. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
3,143
min
58
max
4.795e+06
mean
7.2e+04
median
21,282
std
1.918e+05
q1
8,855
q3
56,553
iqr
47,698
skew
10.35
kurtosis
175.7
n_outliers
407
outlier_rate
0.1264
zero_rate
0

hispanic_population

numeric feature high_skew outliers
Counts of Hispanic population per record (likely county- or tract-level given n=3221), ranging from 0 to 4,851,344 with a median of just 1,209. The distribution is extraordinarily right-skewed (skew 22.75, kurtosis 744.79) and 15.3% of rows flag as outliers, indicating a handful of very large jurisdictions dwarf the rest. Mean (19,427) sits far above the Q3 of 5,875, confirming a long heavy tail. Treatment: Apply a log1p transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
2,331
min
0
max
4.851e+06
mean
1.943e+04
median
1,209
std
1.251e+05
q1
377
q3
5,875
iqr
5,498
skew
22.75
kurtosis
744.8
n_outliers
492
outlier_rate
0.1527
zero_rate
0.004967

state

numeric feature
Stored as numeric but with only 52 unique integer values across 3221 rows ranging 1–72 with no nulls or zeros, this is almost certainly a FIPS-style state code rather than a true quantity. The near-symmetric spread (skew 0.157, kurtosis -0.626, mean 31.28 vs median 30) reflects roughly uniform coverage of US states/territories, not a meaningful distribution. The max of 72 is consistent with FIPS codes that extend past 50 to cover territories. Treatment: Cast to categorical and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
52
min
1
max
72
mean
31.28
median
30
std
16.28
q1
19
q3
46
iqr
27
skew
0.157
kurtosis
-0.6261
n_outliers
0
outlier_rate
0
zero_rate
0

county

numeric foreign_key high_skew outliers
Stored as numeric, but with 326 unique integer values from 1 to 840 across 3221 rows and zero nulls, this is almost certainly a county FIPS or county-code identifier rather than a measurement. The heavy right skew (2.87) and kurtosis (11.6) flagged as outliers simply reflect that codes are not uniformly distributed — 178 'outliers' here are real codes, not anomalies. Treating mean=102.8 or std=106.6 as meaningful would be misleading. Treatment: Cast to categorical/string code and join to a county lookup; do not use as a continuous feature. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
326
min
1
max
840
mean
102.8
median
79
std
106.6
q1
35
q3
133
iqr
98
skew
2.868
kurtosis
11.64
n_outliers
178
outlier_rate
0.05526
zero_rate
0

FIPS

numeric identifier
FIPS is the standard U.S. Federal Information Processing Standards county code, with all 3221 rows unique and no nulls. Values span 1001 to 72153, consistent with state-prefixed county identifiers (Alabama through Puerto Rico), and the distribution is near-symmetric (skew 0.157) with no outliers flagged. Treatment: Treat as a categorical key; left-join on this code rather than using as a numeric feature. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
3,221
min
1,001
max
72,153
mean
3.138e+04
median
30,023
std
1.63e+04
q1
19,031
q3
46,105
iqr
27,074
skew
0.1569
kurtosis
-0.6308
n_outliers
0
outlier_rate
0
zero_rate
0

pct_black

numeric feature high_skew outliers
This is a numeric percentage of Black population per record (likely a county or tract), ranging from 0 to 87.79 with a median of just 2.38% but a mean of 9.08%. The distribution is heavily right-skewed (skew 2.33, kurtosis 5.45) with 422 outliers (13.1%) and 2.8% exact zeros, indicating a long tail of high-percentage areas above an otherwise low-share majority. No nulls, and 3,128 of 3,221 values are unique. Treatment: Apply a log1p or similar transform before regression to tame the right skew. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
3,128
min
0
max
87.79
mean
9.085
median
2.383
std
14.5
q1
0.6919
q3
10.21
iqr
9.513
skew
2.326
kurtosis
5.451
n_outliers
422
outlier_rate
0.131
zero_rate
0.02825

pct_white

numeric feature
This column reports the percentage of a population that is white, ranging from 3.29 to 100.0 with a mean of 81.20 and median of 87.66. The distribution is heavily left-skewed (skew -1.56) with 145 low-end outliers (4.5% outlier rate), indicating most records are predominantly white but a long tail of diverse populations exists. No nulls or zeros are present, and near-unique values across 3221 rows suggest one row per geographic unit. Treatment: Consider a logit or reflected-log transform to address the strong left skew before modelling. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
3,218
min
3.29
max
100
mean
81.2
median
87.66
std
17.35
q1
73.62
q3
93.99
iqr
20.37
skew
-1.562
kurtosis
2.301
n_outliers
145
outlier_rate
0.04502
zero_rate
0

pct_hispanic

numeric feature high_skew outliers
This is a numeric percentage of Hispanic population per row, ranging 0 to 99.996 with a median of just 4.52 but a mean of 11.74, indicating a long right tail. Skew of 3.11 and kurtosis of 9.89 confirm heavy concentration at low values with 420 outliers (13.0% of rows) stretching toward 100. Near-zero null rate (0.0) and only 0.5% exact zeros suggest the values are continuously measured rather than sparsely populated. Treatment: Apply a log1p or similar transform before modelling to tame the right skew and outliers. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
3,205
min
0
max
100
mean
11.74
median
4.516
std
19.4
q1
2.363
q3
10.66
iqr
8.294
skew
3.113
kurtosis
9.888
n_outliers
420
outlier_rate
0.1304
zero_rate
0.004967

poverty_rate

numeric feature high_skew
Continuous percentage values ranging from 0 to 66.19 with mean 15.38 and median 13.81, almost certainly a county- or area-level poverty rate. Distribution is right-skewed (skew 2.11, kurtosis 6.92) with 143 high-end outliers (4.4%) stretching well beyond Q3 of 18.25. Near-unique values across 3,221 rows (3,219 distinct) and effectively no zeros or nulls. Treatment: Log- or Box-Cox-transform before regression to tame the right skew. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
3,219
min
0
max
66.19
mean
15.38
median
13.81
std
7.97
q1
10.34
q3
18.25
iqr
7.91
skew
2.111
kurtosis
6.922
n_outliers
143
outlier_rate
0.0444
zero_rate
0.0003105

below_poverty_level

numeric feature high_skew outliers
This column appears to be a count of residents below the poverty level per geographic unit, ranging from 0 to 1,401,656 with a median of 3,831. The distribution is severely right-skewed (skew 15.1, kurtosis 360.7) with the mean (13,136) more than three times the median and 351 outliers (10.9% of rows). Standard deviation (44,284) dwarfs the IQR (8,390), consistent with a few very large jurisdictions dominating the tail. Treatment: Log-transform (or normalize per population) before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
2,824
min
0
max
1.402e+06
mean
1.314e+04
median
3,831
std
4.428e+04
q1
1,547
q3
9,937
iqr
8,390
skew
15.11
kurtosis
360.7
n_outliers
351
outlier_rate
0.109
zero_rate
0.0003105

median_household_income

numeric feature high_skew outliers
Median household income per record (n=3221, 3099 unique, no nulls) with a typical value near the median of 52380 and IQR of 16300. The mean of -152820 and min of -666666666 betray a sentinel value masquerading as data, producing extreme skew (-56.73) and kurtosis (3215.99) plus 182 flagged outliers. Once those sentinels are removed, the q1/q3 range of 44939-61239 looks like plausible US county-level income. Treatment: Replace the -666666666 sentinel with null, then consider winsorizing or log-transforming before modelling. high · anthropic:claude-opus-4-7
n
3,221
nulls
0 (0.0%)
unique
3,099
min
-6.667e+08
max
147,111
mean
-1.528e+05
median
52,380
std
1.175e+07
q1
44,939
q3
61,239
iqr
16,300
skew
-56.73
kurtosis
3216
n_outliers
182
outlier_rate
0.0565
zero_rate
0

margin_2020

numeric feature
Numeric margin values for 2020, almost entirely unique across 3,221 rows (3,112 distinct), ranging from -0.87 to 0.93 with a mean of 0.317 and median 0.384. The distribution is left-skewed (skew -0.82), suggesting most observations cluster on the positive side while a tail of negative margins pulls the mean down. About 3.4% of rows are null and only 1.5% are flagged as outliers, with no zero values at all. Treatment: Use directly as a signed numeric feature; impute the 3.4% nulls and retain sign since negatives are meaningful. high · anthropic:claude-opus-4-7
n
3,221
nulls
109 (3.4%)
unique
3,112
min
-0.8675
max
0.9309
mean
0.317
median
0.3844
std
0.321
q1
0.1348
q3
0.5662
iqr
0.4314
skew
-0.8212
kurtosis
0.2286
n_outliers
48
outlier_rate
0.01542
zero_rate
0

democratic_pct_2020

numeric feature
This is the share of votes cast for the Democratic candidate in 2020, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.921 with a median of 0.300 and mean of 0.333, indicating most units lean Republican while a long right tail of heavily Democratic units pulls the mean up (skew 0.83). About 3.4% of rows are null and 49 outliers (1.6%) sit beyond the whiskers; no zeros are present. Treatment: Use as-is as a proportion feature; impute the 3.4% nulls or drop those rows before modelling. high · anthropic:claude-opus-4-7
n
3,221
nulls
109 (3.4%)
unique
3,111
min
0.03091
max
0.9215
mean
0.3327
median
0.2998
std
0.1598
q1
0.2091
q3
0.4236
iqr
0.2145
skew
0.8326
kurtosis
0.2523
n_outliers
49
outlier_rate
0.01575
zero_rate
0

republican_pct_2020

numeric feature
This is the 2020 Republican vote share by what looks like a U.S. county-level unit, with 3221 rows and a mean of 0.65 and median 0.68. The distribution is left-skewed (skew -0.81) toward strongly Republican counties, ranging from 0.054 to 0.962, and 3.38% of rows are null. Only 47 outliers (1.5%) and near-unique values (3111 distinct) are consistent with continuous geographic shares. Treatment: Use as-is as a continuous feature; impute or drop the 3.38% missing rows before modelling. high · anthropic:claude-opus-4-7
n
3,221
nulls
109 (3.4%)
unique
3,111
min
0.05397
max
0.9618
mean
0.6497
median
0.6829
std
0.1613
q1
0.5576
q3
0.7747
iqr
0.2171
skew
-0.8091
kurtosis
0.2063
n_outliers
47
outlier_rate
0.0151
zero_rate
0

margin_2016

text feature one_word allcaps short_text
This column stores a 2016 margin as a short percentage string (e.g. '15.17%', '26.55%'), with lengths capped at 5-6 characters and exactly one 'word' per row. Despite the percent formatting it's stored as text, and 18.6% of values are duplicates with '15.17%' alone appearing 29 times — worth checking whether that's a placeholder or genuine repeat. Null rate is 2.58% and there are 2554 unique values across 3221 rows. Treatment: Strip the '%' and cast to float before any numeric analysis. high · anthropic:claude-opus-4-7
n
3,221
nulls
83 (2.6%)
unique
2,554
len_min
5
len_max
6
len_mean
5.896
len_median
6
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
584
duplicate_rate
0.1861
vocab_size
2,554
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

democratic_pct_2016

numeric feature
Share of the 2016 vote going Democratic, recorded per row (likely county-level given n=3221). Values range from 0.031 to 0.928 with a mean of 0.317 and median 0.286, and the right skew of 0.94 reflects a long tail of heavily Democratic jurisdictions amid a mass of Republican-leaning ones. About 2.6% of rows are null and 75 high-side outliers (2.4%) sit above the IQR fence. Treatment: Use as-is as a proportion feature; impute or drop the 2.6% nulls before modelling. high · anthropic:claude-opus-4-7
n
3,221
nulls
83 (2.6%)
unique
3,111
min
0.03145
max
0.9285
mean
0.3174
median
0.2861
std
0.1527
q1
0.2054
q3
0.3982
iqr
0.1928
skew
0.9371
kurtosis
0.666
n_outliers
75
outlier_rate
0.0239
zero_rate
0

republican_pct_2016

numeric feature
This column captures the Republican vote share by unit (likely US county) in the 2016 election, expressed as a proportion between 0.041 and 0.953. The distribution is left-skewed (skew -0.81) with a median of 0.666 above the mean of 0.635, indicating most units leaned Republican while a smaller tail of strongly Democratic units pulls the mean down. Near-unique values (3111 of 3221) and a 2.58% null rate are consistent with one row per geographic unit. Treatment: Use as-is as a proportion feature; impute or drop the ~2.6% nulls before modelling. high · anthropic:claude-opus-4-7
n
3,221
nulls
83 (2.6%)
unique
3,111
min
0.04122
max
0.9527
mean
0.6354
median
0.6656
std
0.1559
q1
0.5463
q3
0.7503
iqr
0.2041
skew
-0.8145
kurtosis
0.3566
n_outliers
62
outlier_rate
0.01976
zero_rate
0