saturn·

data trove healthcare deserts

source /home/coolhand/html/datavis/data_trove/data/healthcare/healthcare_desert_merged.csv 3,222 rows 10 columns profiled 2026-06-22 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset covers healthcare access indicators for 3,222 U.S. counties, combining population size, uninsured rates, poverty rates, and hospital closure risk scores. The most striking pattern is the extreme skew in both total population and uninsured population — the median county has just 25,328 residents and 36 uninsured individuals, yet outliers push the max to nearly 10 million people and over 20,000 uninsured, meaning a small number of large counties dominate the raw counts. Two things warrant a closer look: first, 84% of counties are rated 'Low' hospital closure risk, but nearly 29% score exactly zero on the closure risk score, suggesting the scoring may be coarser than it appears (only 3 unique values exist); second, 69% of counties are classified as Rural, yet uninsured rates range from 0% to 370% of expected norms with heavy right skew, pointing to pockets of severe coverage gaps worth isolating geographically.

citing: row_count · column_count · total_pop.stats.median · total_pop.stats.max · uninsured_pop.stats.median · uninsured_pop.stats.max · uninsured_rate.stats.max · uninsured_rate.stats.median · hospital_closure_risk_score.n_unique · hospital_closure_risk_score.stats.zero_rate · risk_category.top_value · risk_category.top_rate · rural_category.top_value · rural_category.top_rate · poverty_rate.stats.median · poverty_rate.stats.max

Schema

10 columns
Per-column summary. Click column name to jump to its detail.
Alerts
fips numeric 0.0% 3,222
county_name text 0.0% 3,222
near_unique
total_pop numeric 0.0% 3,141
high_skew outliers
uninsured_pop numeric 0.0% 584
high_skew outliers
uninsured_rate numeric 0.0% 152
high_skew outliers
poverty_rate numeric 0.0% 1,719
high_skew
rural categorical 0.0% 2
rural_category categorical 0.0% 2
hospital_closure_risk_score numeric 0.0% 3
risk_category categorical 0.0% 2

fips

numeric identifier
This column contains US FIPS (Federal Information Processing Standard) county codes, which are 4–5 digit numeric identifiers assigned to every US county. Every one of the 3,222 rows has a unique FIPS code with no nulls, matching almost exactly the ~3,143 US counties plus territories (the max of 72153 indicates Puerto Rico territory codes are included). Despite being stored as a numeric type, FIPS codes are categorical identifiers — arithmetic on them is meaningless — and the near-uniform distribution (low skew of 0.157, kurtosis of -0.63) simply reflects the sequential geographic assignment of codes. Treatment: Cast to string/categorical and use as a geographic join key; do not use as a numeric feature. high · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
3,222
min
1,001
max
72,153
mean
3.138e+04
median
30,022
std
1.63e+04
q1
1.903e+04
q3
4.61e+04
iqr
27,075
skew
0.1574
kurtosis
-0.6314
n_outliers
0
outlier_rate
0
zero_rate
0

county_name

text label near_unique
This column contains fully-qualified county name strings, almost certainly formatted as ' County, ' — evidenced by the word 'county,' appearing in 2,999 of 3,222 rows and a mean string length of ~24 characters with a mean word count of ~3.25. Every row is unique (n_unique = 3,222, duplicate_rate = 0.0), triggering the near-unique alert, which is expected for a geographic identifier combining county and state. The state distribution skews toward Texas (256), Virginia (189), and Georgia (159), suggesting those states are overrepresented in the dataset. Treatment: Parse into county and state components for join or groupby operations; do not treat as a free-text feature. high · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
3,222
len_min
16
len_max
59
len_mean
24.32
len_median
24
len_p95
31
word_mean
3.248
word_median
3
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
1,990
readability_flesch_mean
10.28
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

total_pop

numeric feature high_skew outliers
This column represents total population counts across geographic units (likely counties, census tracts, or similar administrative areas). The distribution is severely right-skewed (skew=13.38, kurtosis=298.69): the median is 25,328 while the mean is 102,232, indicating a long tail driven by a small number of very large population centers — the maximum of 9,866,623 is roughly 390× the median. With 453 outliers (14.1% of rows) and a standard deviation of 326,934, raw values will distort any distance- or variance-sensitive model. Treatment: Log-transform (log1p) before modelling to compress the extreme right tail. high · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
3,141
min
47
max
9.867e+06
mean
1.022e+05
median
25,328
std
3.269e+05
q1
1.061e+04
q3
65,190
iqr
5.458e+04
skew
13.38
kurtosis
298.7
n_outliers
453
outlier_rate
0.1406
zero_rate
0

uninsured_pop

numeric feature high_skew outliers
This column represents the count of uninsured individuals in a population unit (likely a census tract, zip code, or similar geographic subdivision). The distribution is extremely right-skewed (skew=17.81, kurtosis=462.87): the median is just 36 while the mean is 159.95 and the max reaches 20,915, indicating a small number of very large geographic units dominate the tail. Notably, 17.2% of rows have a zero value and 11.4% are flagged as outliers (368 rows), suggesting a mix of very small or fully-insured areas alongside a few densely populated uninsured concentrations. Treatment: Log-transform (e.g., log1p) before regression or clustering to reduce skew; consider normalizing by total population to produce an uninsured rate. high · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
584
min
0
max
20,915
mean
159.9
median
36
std
627.2
q1
7
q3
120
iqr
113
skew
17.81
kurtosis
462.9
n_outliers
368
outlier_rate
0.1142
zero_rate
0.1723

uninsured_rate

numeric feature high_skew outliers
This column represents an uninsured rate, likely a proportion or percentage of a population lacking insurance coverage (e.g., health, auto, or similar) at some geographic or demographic unit level. With only 152 unique values across 3,222 rows, the data appears discretized or rounded. The distribution is severely right-skewed (skew 4.10, kurtosis 27.70) with a median of 0.12 but a mean of 0.20 and a max of 3.7 — the max value exceeding 1.0 is surprising if this is a true rate/proportion, suggesting either a percentage-scale mix, a non-standard encoding, or genuine outliers among the 230 flagged cases (7.1% of rows). The 17.5% zero rate also warrants investigation as it may indicate missing data coded as zero or genuinely uninsured-free units. Treatment: Investigate values > 1.0 for scale inconsistency, recode zeros if they represent missingness, then log-transform or apply a bounded transformation before modelling. medium · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
152
min
0
max
3.7
mean
0.2002
median
0.12
std
0.2829
q1
0.04
q3
0.25
iqr
0.21
skew
4.095
kurtosis
27.7
n_outliers
230
outlier_rate
0.07138
zero_rate
0.1754

poverty_rate

numeric feature high_skew
This column represents a poverty rate measure (likely percentage of population below a poverty threshold) across 3,222 records with no nulls and a reasonable 1,719 unique values. The distribution is heavily right-skewed (skew = 2.10, kurtosis = 6.89), with a median of 13.55% but a mean pulled up to 15.10% by a long upper tail reaching 66.32% — more than 4× the median. There are 137 flagged outliers (4.25% of rows), concentrated in that upper tail, which likely represent unusually deprived geographic units or communities and warrant special attention in any model. Treatment: Log-transform or apply a Box-Cox transformation before regression to reduce skew; inspect the 137 outliers above the upper fence for data quality or genuine extreme cases. high · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
1,719
min
1.6
max
66.32
mean
15.1
median
13.55
std
7.706
q1
10.16
q3
17.91
iqr
7.75
skew
2.096
kurtosis
6.891
n_outliers
137
outlier_rate
0.04252
zero_rate
0

rural

categorical feature
This column is a binary flag indicating whether a record is associated with a rural location, stored as string 'True'/'False' rather than a native boolean. The dominant class is 'True' (2,212 of 3,222 rows, ~68.7%), meaning the dataset is notably skewed toward rural records — analysts should account for this class imbalance in any comparative or predictive analysis. Treatment: Cast to boolean, then use as a binary feature or stratification variable; monitor class imbalance (~2:1 rural vs. non-rural) during modelling. high · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
2
top_value
True
top_rate
0.6865
cardinality
2
entropy
0.8971
entropy_ratio
0.8971

rural_category

categorical label
This column is a binary geographic classification indicating whether a record is from a Rural or Urban/Suburban setting. Across 3,222 records with no nulls, 'Rural' dominates at 68.7% (2,212 records) versus 'Urban/Suburban' at 31.3% (1,010 records), representing a meaningful class imbalance. The entropy ratio of 0.897 confirms the distribution is moderately skewed but not extreme. Analysts should be aware this imbalance could bias models trained without stratification or reweighting. Treatment: One-hot or binary encode; consider stratified sampling or class weighting to address 69/31 imbalance. high · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
2
top_value
Rural
top_rate
0.6865
cardinality
2
entropy
0.8971
entropy_ratio
0.8971

hospital_closure_risk_score

numeric feature
This column purports to be a continuous risk score but contains only 3 unique values across 3,222 rows — almost certainly 0, 25, and 50, matching the min, Q1/median, and max exactly. This makes it a de-facto ordinal categorical variable (low/medium/high) despite its numeric framing. Notably, 28.8% of records are zero, and the near-symmetric distribution (skew 0.14) with no outliers further confirms a discrete tier structure rather than a true continuous score. Treatment: Treat as ordinal categorical with three levels (0/25/50); one-hot encode or ordinal-encode before modelling rather than using raw numeric values. high · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
3
min
0
max
50
mean
21.69
median
25
std
16.34
q1
0
q3
25
iqr
25
skew
0.1414
kurtosis
-0.6949
n_outliers
0
outlier_rate
0
zero_rate
0.2883

risk_category

categorical label
This column is a binary risk classification label with exactly two categories: 'Low' and 'Moderate'. The distribution is heavily skewed — 'Low' accounts for 84.4% of all 3,222 rows (2,719 records), leaving only 503 'Moderate' cases. The complete absence of higher risk tiers (e.g., 'High', 'Critical') is notable and may indicate data filtering, a low-risk population, or an incomplete taxonomy. With zero nulls and only two values, the column is clean but class-imbalanced. Treatment: Encode as binary (0/1) and apply class-imbalance handling (e.g., SMOTE or class weights) before modelling. high · anthropic:default
n
3,222
nulls
0 (0.0%)
unique
2
top_value
Low
top_rate
0.8439
cardinality
2
entropy
0.6249
entropy_ratio
0.6249