saturn·

nationwide census counties nationwide

source /home/coolhand/html/datavis/data_trove/data/geographic/nationwide/census_counties_nationwide.csv 3,144 rows 8 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset covers 3,144 U.S. counties with demographic and socioeconomic indicators including population, median income, college attainment rate, and poverty rate, identified by FIPS codes and state. The most urgent issue is median_income: it has a minimum of -666,666,666 and a mean of -148,752, which are clearly sentinel values for missing data masquerading as numbers and must be cleaned before any analysis. Population is also extremely right-skewed (skew ~13, max ~9.9M vs median ~25,785), so log-scaling will be necessary for any visualization or modeling. State coverage is uneven, with Texas (254 counties), Georgia (159), and Virginia (133) dominating the row counts. College and poverty rates are the cleanest fields and behave roughly as expected for county-level distributions.

citing: median_income · population · state_name · college_rate · poverty_rate · county_fips

Schema

8 columns
Per-column summary. Click column name to jump to its detail.
Alerts
name text 0.0% 3,144
near_unique
state_fips numeric 0.0% 51
county_fips numeric 0.0% 329
high_skew outliers
state_name categorical 0.0% 51
median_income numeric 0.0% 3,021
high_skew
poverty_rate numeric 0.0% 3,144
college_rate numeric 0.0% 3,143
population numeric 0.0% 3,080
high_skew outliers

name

text identifier near_unique
This is the full name of a US county-state pair: 2999 of 3144 rows contain the word 'county,' and the remaining top tokens are state names (Texas 256, Virginia 189, Georgia 159). Every value is unique (n_unique=3144, duplicate_rate=0.0) with no nulls and a tight length band (min 16, mean 24.2, max 59). It functions as a row identifier rather than a modelling feature. Treatment: Use as the row key for joins; do not feed into models. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,144
len_min
16
len_max
59
len_mean
24.16
len_median
24
len_p95
30.85
word_mean
3.224
word_median
3
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
1,910
readability_flesch_mean
6.826
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

state_fips

numeric foreign_key
Numeric column with exactly 51 unique values across 3144 rows, ranging 1 to 56 with no nulls — this is the U.S. state FIPS code (50 states plus DC), and 3144 matches the U.S. county count. The mean (30.26) and median (29) sit near the middle of the code range, and the near-zero skew (-0.08) reflects roughly uniform coverage of states. Despite being stored as numeric, the values are categorical identifiers, not measurements. Treatment: Cast to categorical or zero-padded string and use as a join key to state-level reference tables; do not treat as a continuous feature. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
51
min
1
max
56
mean
30.26
median
29
std
15.15
q1
18
q3
45
iqr
27
skew
-0.08128
kurtosis
-1.099
n_outliers
0
outlier_rate
0
zero_rate
0

county_fips

numeric identifier high_skew outliers
This is the county-level component of a FIPS code stored as an integer, with 3144 rows and only 329 unique values, suggesting many counties share the same within-state numeric suffix. Values run from 1 to 840 with a median of 79, but the high skew (2.84) and 176 outliers (5.6%) reflect the long tail of larger county codes used in a few states rather than a true distribution. There are no nulls or zeros. Treatment: Treat as a categorical code; concatenate with a state FIPS to form a unique county key for joins. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
329
min
1
max
840
mean
103.9
median
79
std
107.6
q1
35
q3
133.5
iqr
98.5
skew
2.841
kurtosis
11.38
n_outliers
176
outlier_rate
0.05598
zero_rate
0

state_name

categorical feature
This column holds US state names, with 51 distinct values across 3,144 rows and no nulls — consistent with a county-level dataset covering all states plus DC. Distribution mirrors county counts: Texas leads at 254 (8.08%), followed by Georgia (159) and Virginia (133), and entropy ratio of 0.93 indicates a fairly even spread across states. No anomalies flagged. Treatment: Use as a categorical grouping key or one-hot/target-encode for modelling. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
51
top_value
Texas
top_rate
0.08079
cardinality
51
entropy
5.277
entropy_ratio
0.9304

median_income

numeric feature high_skew
This column appears to be county-level median household income in dollars, with a median of 60931 and IQR spanning 52544.5 to 70605.25. The minimum of -666666666 is a sentinel value masquerading as data, dragging the mean to -148752.33 and producing a skew of -56.04 and kurtosis of 3138.99. Aside from that contamination, 3021 unique values across 3144 rows and 135 outliers (4.29%) suggest an otherwise plausible distribution capped at 170463. Treatment: Replace the -666666666 sentinel with null, then consider log or robust scaling before modelling. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,021
min
-6.667e+08
max
170,463
mean
-1.488e+05
median
60,931
std
1.189e+07
q1
5.254e+04
q3
7.061e+04
iqr
1.806e+04
skew
-56.04
kurtosis
3139
n_outliers
135
outlier_rate
0.04294
zero_rate
0

poverty_rate

numeric feature
Numeric poverty_rate spanning 1.60 to 55.10 with mean 13.82 and median 12.95, suggesting a percentage-style measure across 3144 rows (no nulls, no zeros). Distribution is right-skewed (skew 1.15, kurtosis 2.90) with 74 high-end outliers (2.35%) stretching the tail well past the Q3 of 16.77. Every one of the 3144 values is unique, consistent with a per-geography rate (e.g., one row per US county). Treatment: Consider a log or winsorising transform before regression to tame the right tail. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,144
min
1.603
max
55.1
mean
13.82
median
12.95
std
5.702
q1
9.699
q3
16.77
iqr
7.074
skew
1.15
kurtosis
2.901
n_outliers
74
outlier_rate
0.02354
zero_rate
0

college_rate

numeric feature
Likely a percentage of college-educated residents per row (probably a US county-level rate given n=3144). Values range from 0.0 to 56.35 with mean 16.26 and median 14.60, right-skewed (skew 1.42) with 134 outliers (4.26%) on the high tail. Near-unique (3143/3144) and no nulls, with only a single zero observation. Treatment: Use as-is or apply a mild log/sqrt transform to dampen the right skew before regression. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,143
min
0
max
56.35
mean
16.26
median
14.6
std
7.005
q1
11.48
q3
19.37
iqr
7.892
skew
1.422
kurtosis
2.751
n_outliers
134
outlier_rate
0.04262
zero_rate
0.0003181

population

numeric feature high_skew outliers
This column reports a population count for 3,144 rows with no nulls and 3,080 unique values, consistent with one row per US county. The distribution is extremely right-skewed (skew 13.17, kurtosis 289.76): the median is 25,784.5 yet the mean is 105,310.94 and the max reaches 9,936,690, with 440 rows (14.0%) flagged as outliers. The std of 333,792 dwarfs the IQR of 57,244, confirming a heavy upper tail driven by a few very large jurisdictions. Treatment: log-transform before regression or modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,080
min
50
max
9.937e+06
mean
1.053e+05
median
2.578e+04
std
3.338e+05
q1
1.084e+04
q3
6.808e+04
iqr
57,244
skew
13.17
kurtosis
289.8
n_outliers
440
outlier_rate
0.1399
zero_rate
0