saturn·

us housing affordability crisis housing crisis merged

source /home/coolhand/datasets/us-housing-affordability-crisis/housing_crisis_merged.csv 3,222 rows 16 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset covers 3,222 U.S. counties with 16 columns describing rental affordability — rents, incomes, renter shares, and burden percentages — keyed by FIPS and county name. Several numeric fields (annual_rent, median_gross_rent, median_household_income, rent_to_income_ratio) carry impossible negative sentinel values like -666666666 and -7999999992, which drag means deeply negative and produce skew around -17 to -57; these need cleaning before any analysis. The affordability_category field is also extremely imbalanced — 3,192 of 3,222 counties are labeled 'Affordable' (top_rate 0.99), so it offers little discriminatory signal as-is. The cleaner fields to start with are pct_rent_burdened_30plus (median 37.36%), pct_rent_burdened_50plus (median 17.62%), and pct_renter (median 26.07%), which look well-behaved and tell the real affordability story.

citing: annual_rent.stats.min · annual_rent.stats.skew · median_household_income.stats.min · median_household_income.stats.skew · rent_to_income_ratio.stats.skew · affordability_category.stats.top_rate · affordability_category.top_values · pct_rent_burdened_30plus.stats.median · pct_rent_burdened_50plus.stats.median · pct_renter.stats.median · row_count · column_count

Schema

16 columns
Per-column summary. Click column name to jump to its detail.
Alerts
fips numeric 0.0% 3,222
county_name text 0.0% 3,222
near_unique
total_renters numeric 0.0% 2,709
high_skew outliers
pct_rent_burdened_30plus numeric 0.0% 2,146
pct_rent_burdened_50plus numeric 0.0% 1,769
median_gross_rent numeric 0.0% 984
high_skew outliers
median_household_income numeric 0.0% 3,099
high_skew outliers
total_housing_units numeric 0.0% 3,074
high_skew outliers
owner_occupied numeric 0.0% 3,001
high_skew outliers
renter_occupied numeric 0.0% 2,709
high_skew outliers
pct_renter numeric 0.0% 1,925
annual_rent numeric 0.0% 984
high_skew outliers
rent_to_income_ratio numeric 0.0% 1,278
high_skew
affordability_category categorical 0.0% 3
imbalance
hours_at_min_wage_for_rent numeric 0.0% 230
high_skew outliers
weeks_at_min_wage_for_rent numeric 0.0% 72
high_skew outliers

fips

numeric identifier
This is the U.S. county FIPS code, a 5-digit geographic identifier where the leading digits encode the state. Every one of the 3222 rows is unique with no nulls, and the range 1001 to 72153 spans Alabama through Puerto Rico, consistent with a full national county roster. Distribution stats (mean 31377.89, skew 0.157) are not meaningful here since the values are categorical codes, not quantities. Treatment: Treat as a categorical key; left-join on this code to bring in geographic attributes rather than using it as a numeric feature. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
3,222
min
1,001
max
72,153
mean
3.138e+04
median
30,022
std
1.63e+04
q1
1.903e+04
q3
4.61e+04
iqr
27,075
skew
0.1574
kurtosis
-0.6314
n_outliers
0
outlier_rate
0
zero_rate
0

county_name

text identifier near_unique
This column holds U.S. county identifiers, formatted as 'County, State' strings — 'county,' appears in 2999 of 3222 rows and the top state tokens (texas 256, virginia 189, georgia 159) match the U.S. counties-by-state distribution. Every one of the 3222 values is unique with zero nulls or duplicates, and lengths cluster tightly (min 16, median 24, max 59), consistent with a clean canonical name field. The 223 rows not containing 'county,' likely reflect Louisiana parishes, Alaska boroughs, or independent cities rather than data quality issues. Treatment: Use as a join key against county-level reference tables; do not feed raw into models. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
3,222
len_min
16
len_max
59
len_mean
24.32
len_median
24
len_p95
31
word_mean
3.248
word_median
3
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
1,990
readability_flesch_mean
10.28
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

total_renters

numeric feature high_skew outliers
This is a numeric count of renters per record, ranging from 28 to 1,810,929 with a median of 2,579.5 — likely an aggregate at some geography or entity level rather than a per-unit measure. The distribution is severely right-skewed (skew 15.82, kurtosis 398.15) with 449 outliers (14.0% outlier rate) and a mean (13,851) more than five times the median, indicating a few very large populations dominate. No nulls or zeros, and 2,709 unique values out of 3,222 rows suggest most records carry distinct counts. Treatment: log-transform before regression to tame the extreme right skew. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
2,709
min
28
max
1.811e+06
mean
1.385e+04
median
2580
std
5.535e+04
q1
1004
q3
7396
iqr
6,392
skew
15.82
kurtosis
398.2
n_outliers
449
outlier_rate
0.1394
zero_rate
0

pct_rent_burdened_30plus

numeric feature
This appears to be the percentage of renter households spending 30%+ of income on rent, reported per geographic unit (likely county-level given n=3222). Values span 0 to 64.96 with a median of 37.36 and IQR of roughly 30.67–43.48, indicating rent burden is widespread rather than rare. The mild left skew (-0.57) and 58 outliers (1.8%) suggest a few areas with unusually low burden pull the tail down, while 0.25% report exactly zero. Treatment: Use as-is as a continuous feature; no transformation needed given near-symmetric distribution. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
2,146
min
0
max
64.96
mean
36.44
median
37.36
std
10.01
q1
30.67
q3
43.48
iqr
12.81
skew
-0.5673
kurtosis
0.5032
n_outliers
58
outlier_rate
0.018
zero_rate
0.002483

pct_rent_burdened_50plus

numeric feature
This is a county-level (or similar geographic) numeric feature giving the percentage of households spending 50%+ of income on rent — severely rent-burdened. Values span 0 to 64.96 with a near-symmetric distribution (skew 0.05) centered at a median of 17.62%, and the IQR of 8.56 around a mean of 17.35 indicates a tight, well-behaved spread. Only 47 outliers (1.46%) and a 0.93% zero rate, with no nulls across 3,222 rows, suggest clean and complete coverage. Treatment: Use as-is in modelling; standardize if combining with other scaled features. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
1,769
min
0
max
64.96
mean
17.35
median
17.62
std
6.577
q1
13.07
q3
21.63
iqr
8.557
skew
0.05436
kurtosis
0.9823
n_outliers
47
outlier_rate
0.01459
zero_rate
0.009311

median_gross_rent

numeric feature high_skew outliers
Likely the median gross rent (in dollars) for each row's geography, with a typical value near $817.5 and an interquartile range of $718–$978. The data is contaminated by sentinel values: the minimum is -666666666 and the mean of -2,068,220 with std ~3.7e7 is impossible for rent, producing extreme negative skew (-17.87) and kurtosis (317.2). 235 outliers (7.3%) are flagged, but the central distribution looks plausible once sentinels are removed. Treatment: Replace sentinel negatives (e.g., -666666666) with NaN before any modelling or aggregation. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
984
min
-6.667e+08
max
2,805
mean
-2.068e+06
median
817.5
std
3.709e+07
q1
718
q3
978
iqr
260
skew
-17.87
kurtosis
317.2
n_outliers
235
outlier_rate
0.07294
zero_rate
0

median_household_income

numeric feature high_skew outliers
County-level median household income in dollars, with 3099 unique values across 3222 rows and no nulls. The distribution is contaminated by sentinel values: the minimum is -666666666 and the mean of -144603 is implausible against a median of 60458.5, driving extreme skew (-56.74) and kurtosis (3216.99). Once sentinels are removed, the IQR (51814.75 to 70376.25) looks like a typical US income distribution, but 188 outliers (5.83%) remain flagged. Treatment: Replace -666666666 sentinels with nulls, then consider log-transform or winsorization before modelling. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
3,099
min
-6.667e+08
max
170,463
mean
-1.446e+05
median
6.046e+04
std
1.175e+07
q1
5.181e+04
q3
7.038e+04
iqr
1.856e+04
skew
-56.74
kurtosis
3217
n_outliers
188
outlier_rate
0.05835
zero_rate
0

total_housing_units

numeric feature high_skew outliers
Counts of total housing units across 3,222 rows with no nulls and 3,074 distinct values, consistent with one record per geographic area (e.g., county). The distribution is severely right-skewed (skew 12.05, kurtosis 240.5): the median is 10,021 but the mean is 39,403 and the max reaches 3,363,093, with 443 outliers (13.7%) above the IQR fence. Range spans 32 to 3.36M, indicating a mix of very small and very large jurisdictions. Treatment: log-transform before regression to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
3,074
min
32
max
3.363e+06
mean
3.94e+04
median
10,021
std
1.201e+05
q1
4211
q3
25,939
iqr
2.173e+04
skew
12.05
kurtosis
240.5
n_outliers
443
outlier_rate
0.1375
zero_rate
0

owner_occupied

numeric feature high_skew outliers
Likely a count of owner-occupied housing units per geographic area, given the integer-like range from 0 to 1,552,164 and median of 7,325.5. The distribution is severely right-skewed (skew 9.52, kurtosis 146.9) with 429 outliers (13.3%) and a mean (25,551.7) far above the median, indicating a few very large areas dominate. Near-unique values (3,001 of 3,222) and effectively no zeros (0.03%) suggest one row per area rather than a categorical flag. Treatment: log-transform before regression to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
3,001
min
0
max
1.552e+06
mean
2.555e+04
median
7326
std
6.755e+04
q1
3148
q3
1.886e+04
iqr
1.572e+04
skew
9.516
kurtosis
146.9
n_outliers
429
outlier_rate
0.1331
zero_rate
0.0003104

renter_occupied

numeric feature high_skew outliers
Counts of renter-occupied housing units per record, almost certainly aggregated by geography given the spread from 28 to 1,810,929. The distribution is extremely right-skewed (skew 15.82, kurtosis 398.15) with a median of 2,579.5 sitting far below the mean of 13,851, and 449 outliers (14% of rows) likely representing dense urban areas. Near-unique values (2,709 of 3,222) and zero null/zero rate suggest a clean per-area tally rather than a categorical feature. Treatment: log-transform before regression to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
2,709
min
28
max
1.811e+06
mean
1.385e+04
median
2580
std
5.535e+04
q1
1004
q3
7396
iqr
6,392
skew
15.82
kurtosis
398.2
n_outliers
449
outlier_rate
0.1394
zero_rate
0

pct_renter

numeric feature
Numeric share-of-renters feature spanning 3.01 to 100.0 with mean 27.35 and median 26.07, suggesting county- or tract-level renter percentages. The distribution is right-skewed (skew 1.32, kurtosis 4.41) with 88 high-side outliers (2.7%) pulling toward 100, while the IQR is a tight 10.02 around the mid-20s. No nulls or zeros, and 1925 unique values across 3222 rows indicate granular but not unique measurements. Treatment: Consider a log or Yeo-Johnson transform before regression to tame the right tail. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
1,925
min
3.01
max
100
mean
27.35
median
26.07
std
8.564
q1
21.64
q3
31.66
iqr
10.02
skew
1.317
kurtosis
4.412
n_outliers
88
outlier_rate
0.02731
zero_rate
0

annual_rent

numeric feature high_skew outliers
This is an annual rent figure in currency units, with a typical tenant paying between 8616 and 11736 (median 9810). However, the minimum of -7999999992 drags the mean to -24818640 and produces extreme skew (-17.87) and kurtosis (317.20), indicating sentinel values or sign errors masquerading as rents. 235 outliers (7.3%) sit outside the IQR fence, so the tail is not a single rogue record. Treatment: Clip or null negative sentinels, then log-transform before modelling. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
984
min
-8e+09
max
33,660
mean
-2.482e+07
median
9,810
std
4.451e+08
q1
8,616
q3
11,736
iqr
3,120
skew
-17.87
kurtosis
317.2
n_outliers
235
outlier_rate
0.07294
zero_rate
0

rent_to_income_ratio

numeric feature high_skew
This is a numeric feature meant to capture rent as a proportion (or percentage) of income, with the bulk of values clustered tightly between q1 15.07 and q3 19.39 around a median of 17.05. However, the column is badly corrupted: the minimum is -24,357,569.09 driving a mean of -37,244.13 and a std of 752,361.70, with skew -22.74 and kurtosis 570.21. Negative ratios are nonsensical here and 114 outliers (3.54%) plus a max of 1200 suggest data entry or unit errors rather than genuine variation. Treatment: Clip or filter to plausible non-negative ratios and investigate the extreme negatives before any modelling. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
1,278
min
-2.436e+07
max
1,200
mean
-3.724e+04
median
17.05
std
7.524e+05
q1
15.07
q3
19.39
iqr
4.317
skew
-22.74
kurtosis
570.2
n_outliers
114
outlier_rate
0.03538
zero_rate
0

affordability_category

categorical label imbalance
A categorical affordability bucket with three levels: Affordable, Moderately Burdened, and Extremely Burdened. The distribution is extremely imbalanced — 'Affordable' covers 3192 of 3222 rows (top_rate 0.9907), leaving only 29 'Moderately Burdened' and a single 'Extremely Burdened' record. Entropy ratio of 0.049 confirms there is almost no information in this column as-is. Treatment: Collapse to a binary affordable-vs-burdened flag or drop; near-constant for modelling. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
3
top_value
Affordable
top_rate
0.9907
cardinality
3
entropy
0.07815
entropy_ratio
0.04931

hours_at_min_wage_for_rent

numeric feature high_skew outliers
This column appears to capture how many minimum-wage hours are needed to cover rent, with a typical value around 113 hours (IQR 99–135). However, the data is corrupted by extreme negative values: the minimum is -91,954,023 and the mean is -285,271, despite a max of 387, producing severe negative skew (-17.87) and kurtosis of 317. About 7.2% of rows (232) are flagged as outliers, suggesting sentinel codes or data-entry errors masquerading as numeric values. Treatment: Filter out negative sentinel values before use, then consider a log or robust transform. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
230
min
-9.195e+07
max
387
mean
-2.853e+05
median
113
std
5.116e+06
q1
99
q3
135
iqr
36
skew
-17.87
kurtosis
317.2
n_outliers
232
outlier_rate
0.072
zero_rate
0

weeks_at_min_wage_for_rent

numeric feature high_skew outliers
This column appears to capture how many weeks of full-time minimum-wage work are needed to cover rent, with a typical value around 2.8 weeks (IQR 2.5-3.4). However, the data is severely corrupted: the minimum is -2,298,850.6 and the mean is -7,131.79, while the max is only 9.7, producing extreme skew (-17.87) and kurtosis (317.20) with 232 outliers (7.2% of rows). Negative weeks are nonsensical for this metric, suggesting sentinel values or data-entry errors masquerading as numbers. Treatment: Filter out negative values as invalid before any modelling, then consider winsorizing the upper tail. high · anthropic:claude-opus-4-7
n
3,222
nulls
0 (0.0%)
unique
72
min
-2.299e+06
max
9.7
mean
-7132
median
2.8
std
1.279e+05
q1
2.5
q3
3.4
iqr
0.9
skew
-17.87
kurtosis
317.2
n_outliers
232
outlier_rate
0.072
zero_rate
0