data trove nyc housing analysis

source /home/coolhand/html/datavis/data_trove/economic/housing/nyc/nyc_housing_metrics_merged.csv 2,327 rows 23 columns profiled 2026-06-21 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:default

This dataset covers housing affordability metrics for 2,327 census tracts across New York City's five boroughs, with variables spanning rent burden, household income, gross rent, and tenure type. The most urgent data quality issue is that both `median_gross_rent` and `median_household_income` contain extreme negative sentinel values (min of -666,666,666), which wildly distort their means and standard deviations — these columns must be filtered or recoded before any analysis. Substantively, rent burden is the headline story: the median tract has 50% of renter households paying more than 30% of income on rent (`pct_rent_burdened` median = 50.0), and severe burden (≥50% of income) affects a median of 26.2% of renters per tract. Brooklyn leads in tract count (805 tracts, 34.6% of the dataset), followed by Queens (725) and the Bronx (361), so borough-level comparisons are feasible but uneven in sample size.

citing: median_gross_rent.stats.min · median_household_income.stats.min · pct_rent_burdened.stats.median · pct_severe_burden.stats.median · county_name.top_values · county_name.stats.top_rate · pct_rent_burdened.stats.mean · pct_owner_occupied.stats.median

Charts the summary said to look at first

pct_rent_burdened · Look for the concentration around 50% — this confirms rent burden is near-universal across NYC tracts, not confined to a few outliers.

Show data table

Histogram bins for pct_rent_burdened (median: 50.0).
bin	count
0 – 2.5	8
2.5 – 5	3
5 – 7.5	1
7.5 – 10	5
10 – 12.5	7
12.5 – 15	8
15 – 17.5	12
17.5 – 20	14
20 – 22.5	14
22.5 – 25	24
25 – 27.5	35
27.5 – 30	42
30 – 32.5	53
32.5 – 35	80
35 – 37.5	91
37.5 – 40	119
40 – 42.5	129
42.5 – 45	144
45 – 47.5	146
47.5 – 50	177
50 – 52.5	139
52.5 – 55	178
55 – 57.5	162
57.5 – 60	131
60 – 62.5	117
62.5 – 65	97
65 – 67.5	60
67.5 – 70	57
70 – 72.5	54
72.5 – 75	28
75 – 77.5	20
77.5 – 80	25
80 – 82.5	8
82.5 – 85	4
85 – 87.5	12
87.5 – 90	5
90 – 92.5	5
92.5 – 95	3
95 – 97.5	0
97.5 – 100	8

pct_severe_burden · The distribution of severe rent burden (≥50% of income) shows how many tracts have a large share of deeply cost-stressed renters.

Show data table

Histogram bins for pct_severe_burden (median: 26.2).
bin	count
0 – 2.5	45
2.5 – 5	14
5 – 7.5	41
7.5 – 10	53
10 – 12.5	94
12.5 – 15	115
15 – 17.5	131
17.5 – 20	160
20 – 22.5	170
22.5 – 25	188
25 – 27.5	188
27.5 – 30	168
30 – 32.5	173
32.5 – 35	157
35 – 37.5	115
37.5 – 40	97
40 – 42.5	73
42.5 – 45	62
45 – 47.5	44
47.5 – 50	35
50 – 52.5	29
52.5 – 55	19
55 – 57.5	18
57.5 – 60	12
60 – 62.5	6
62.5 – 65	4
65 – 67.5	4
67.5 – 70	2
70 – 72.5	1
72.5 – 75	1
75 – 77.5	1
77.5 – 80	1
80 – 82.5	0
82.5 – 85	1
85 – 87.5	1
87.5 – 90	0
90 – 92.5	1
92.5 – 95	0
95 – 97.5	0
97.5 – 100	1

county_name · Compare tract counts across the five boroughs — Brooklyn and Queens dominate, so borough-level averages need weighting.

Show data table

Top values for county_name (5 unique shown, of 5 total).
value	count	share
Brooklyn (Kings)	805	34.6%
Queens	725	31.2%
Bronx	361	15.5%
Manhattan (New York)	310	13.3%
Staten Island (Richmond)	126	5.4%

pct_renter_occupied · Most tracts skew heavily toward renting (median 65.6%), but the wide spread reveals pockets of higher ownership worth cross-tabbing by borough.

Show data table

Histogram bins for pct_renter_occupied (median: 65.6).
bin	count
0 – 2.5	9
2.5 – 5	6
5 – 7.5	7
7.5 – 10	23
10 – 12.5	24
12.5 – 15	26
15 – 17.5	27
17.5 – 20	32
20 – 22.5	47
22.5 – 25	31
25 – 27.5	48
27.5 – 30	40
30 – 32.5	34
32.5 – 35	50
35 – 37.5	39
37.5 – 40	46
40 – 42.5	41
42.5 – 45	53
45 – 47.5	57
47.5 – 50	72
50 – 52.5	54
52.5 – 55	54
55 – 57.5	73
57.5 – 60	62
60 – 62.5	69
62.5 – 65	72
65 – 67.5	60
67.5 – 70	65
70 – 72.5	72
72.5 – 75	75
75 – 77.5	91
77.5 – 80	94
80 – 82.5	92
82.5 – 85	73
85 – 87.5	64
87.5 – 90	83
90 – 92.5	65
92.5 – 95	71
95 – 97.5	83
97.5 – 100	147

pct_owner_occupied · Owner-occupancy ranges from 0 to 100% with a median of 34.4% — identify tracts with unusually high ownership as potential contrast cases.

Show data table

Histogram bins for pct_owner_occupied (median: 34.4).
bin	count
0 – 2.5	141
2.5 – 5	86
5 – 7.5	71
7.5 – 10	63
10 – 12.5	86
12.5 – 15	65
15 – 17.5	72
17.5 – 20	88
20 – 22.5	98
22.5 – 25	88
25 – 27.5	79
27.5 – 30	67
30 – 32.5	70
32.5 – 35	56
35 – 37.5	76
37.5 – 40	68
40 – 42.5	59
42.5 – 45	72
45 – 47.5	58
47.5 – 50	54
50 – 52.5	70
52.5 – 55	59
55 – 57.5	55
57.5 – 60	39
60 – 62.5	44
62.5 – 65	40
65 – 67.5	52
67.5 – 70	34
70 – 72.5	40
72.5 – 75	47
75 – 77.5	32
77.5 – 80	48
80 – 82.5	31
82.5 – 85	27
85 – 87.5	26
87.5 – 90	24
90 – 92.5	23
92.5 – 95	8
95 – 97.5	6
97.5 – 100	9

Schema

23 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
total_renter_households	numeric	0.0%	1,418
rent_30_to_34_9_pct	numeric	0.0%	355	high_skew outliers
rent_35_to_39_9_pct	numeric	0.0%	270	high_skew
rent_40_to_49_9_pct	numeric	0.0%	322	high_skew
rent_50_pct_or_more	numeric	0.0%	706
NAME	text	0.0%	2,327	near_unique
state	numeric	0.0%	1	constant
county	numeric	0.0%	5
tract	numeric	0.0%	1,530	high_skew
county_name	categorical	0.0%	5
moderate_burden	numeric	0.0%	639
severe_burden	numeric	0.0%	706
pct_moderate_burden	numeric	4.4%	461
pct_severe_burden	numeric	4.4%	518
rent_burdened	numeric	0.0%	1,013
pct_rent_burdened	numeric	4.4%	596
median_gross_rent	numeric	0.0%	1,232	high_skew outliers
median_household_income	numeric	0.0%	2,106	high_skew outliers
total_households	numeric	0.0%	1,495
owner_occupied	numeric	0.0%	1,001	outliers
renter_occupied	numeric	0.0%	1,418
pct_owner_occupied	numeric	4.1%	823
pct_renter_occupied	numeric	4.1%	823

total_renter_households

numeric feature

This column counts the total number of renter-occupied households per geographic unit (likely census tract or ZIP code), with values ranging from 0 to 8,209. The distribution is notably right-skewed (skew 1.59) with excess kurtosis of 4.63, indicating a long tail of high-density rental areas pulling the mean (946) well above the median (726). About 4.4% of records are zero, suggesting some areas have no renters at all, and 69 outliers (~3%) represent unusually dense rental markets that may warrant separate treatment. Treatment: Log-transform (or log1p to handle zeros) before regression or distance-based modelling to reduce skew impact. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 1,418
min: 0
max: 8,209
mean: 946.1
median: 726
std: 815.4
q1: 346
q3: 1,357
iqr: 1,011
skew: 1.595
kurtosis: 4.627
n_outliers: 69
outlier_rate: 0.02965
zero_rate: 0.04383

rent_30_to_34_9_pct

numeric feature high_skew outliers

This column represents a count (or weighted count) of households paying 30–34.9% of income toward rent — a standard housing cost-burden threshold bracket. The distribution is severely right-skewed (skew=2.76, kurtosis=13.86), with a median of 51 but a mean of 83 and a max of 1205, indicating a long tail driven by high-density geographies. About 16.2% of rows are zero (areas with no such households or suppressed data), and 124 observations (~5.3%) are flagged as outliers, suggesting a mix of very small and very large geographic units in the dataset. Treatment: Log-transform (log1p) before modelling to address severe right skew; consider normalising by total renter households to convert counts to rates. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 355
min: 0
max: 1,205
mean: 83.05
median: 51
std: 100.3
q1: 15
q3: 116
iqr: 101
skew: 2.755
kurtosis: 13.86
n_outliers: 124
outlier_rate: 0.05329
zero_rate: 0.1616

rent_35_to_39_9_pct

numeric feature high_skew

This column represents the count (or estimated count) of renter households paying 35–39.9% of their income on rent — a standard housing cost-burden bracket used in Census/ACS tabulations. The distribution is severely right-skewed (skew 2.40, kurtosis 9.27), with a median of 35 but a mean of 58 and a maximum of 633, indicating that most geographic units have low counts while a small number of populous areas drive extreme values. Nearly 20% of rows are zero, suggesting many small geographies have no households in this rent-burden band, and 110 observations (4.7%) qualify as outliers. The IQR of 73 against a median of 35 confirms substantial spread relative to the typical value. Treatment: Log-transform (log1p) before regression or clustering to reduce skew and compress outlier influence. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 270
min: 0
max: 633
mean: 58.35
median: 35
std: 69.85
q1: 10
q3: 83
iqr: 73
skew: 2.395
kurtosis: 9.275
n_outliers: 110
outlier_rate: 0.04727
zero_rate: 0.1964

rent_40_to_49_9_pct

numeric feature high_skew

This column represents a count (or estimate) of renter households spending 40–49.9% of their income on rent within some geographic unit, likely a census tract or block group. The distribution is heavily right-skewed (skew=2.14, kurtosis=7.14) with a median of 49 but a mean of 74.68 and a maximum of 740, indicating a small number of high-density areas pull the mean far above the typical value. About 15.6% of rows are zero—geographic units with no households in this rent-burden band—and 111 observations (4.8%) are flagged as outliers, consistent with urban concentration effects. Treatment: Log-transform (log1p) to reduce skew before regression or clustering; retain zeros as valid observations representing absence of this rent-burden group. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 322
min: 0
max: 740
mean: 74.68
median: 49
std: 83.79
q1: 14
q3: 106
iqr: 92
skew: 2.137
kurtosis: 7.139
n_outliers: 111
outlier_rate: 0.0477
zero_rate: 0.1556

rent_50_pct_or_more

numeric feature

This column likely represents a count of renter households spending 50% or more of their income on rent (severe rent burden), probably aggregated at a geographic unit such as a census tract or ZIP code. The distribution is right-skewed (skew 1.60) with a mean of 253 and a median of 184, indicating many lower-burden areas alongside a long tail of high-burden zones reaching up to 1,918. Kurtosis of 3.44 and 87 outliers (3.7% of rows) confirm a heavy upper tail, and ~6.3% of records are zero, representing areas with no severely rent-burdened households. Treatment: Log-transform or apply square-root before regression to reduce right skew; consider normalizing by total renter households to derive a rate. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 706
min: 0
max: 1,918
mean: 253.2
median: 184
std: 236.6
q1: 82
q3: 360
iqr: 278
skew: 1.603
kurtosis: 3.435
n_outliers: 87
outlier_rate: 0.03739
zero_rate: 0.06274

NAME

text label near_unique

This column contains fully-qualified U.S. Census tract names for New York City, following the pattern 'Census Tract [number], [County] County; New York' — every one of the 2,327 rows carries this format. All 2,327 values are unique with zero nulls or duplicates, and the extremely tight string-length range (min 38, max 46, mean 41.6) confirms a highly templated naming convention. The top words show 'census', 'tract', and 'county;' each appear exactly 2,327 times, while borough-level counties (Kings 805, Queens 725, Bronx 361, Richmond 126) account for NYC's five boroughs. The vocabulary size of 1,539 relative to 2,327 rows reflects the shared template words plus varying tract numbers and county names. Treatment: Use as a human-readable label or parse to extract county and tract number as separate structured features; do not use as a model input directly. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 2,327
len_min: 38
len_max: 46
len_mean: 41.65
len_median: 41
len_p95: 46
word_mean: 7.133
word_median: 7
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 1,539
readability_flesch_mean: 91.45
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0

state

numeric feature constant

This column appears to encode a categorical 'state' as a numeric code, but every single one of its 2,327 non-null rows holds the identical value of 36.0 — making it a zero-variance constant. This is a strong signal that the dataset was filtered or exported for a single state (code 36, which corresponds to New York in U.S. FIPS encoding, though that interpretation is not confirmed by the evidence). With n_unique = 1 and std = 0.0, this column carries no predictive or discriminative information whatsoever. Treatment: Drop — constant value across all 2,327 rows provides zero variance and no modelling signal. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 1
min: 36
max: 36
mean: 36
median: 36
std: 0
q1: 36
q3: 36
iqr: 0
skew: 0
kurtosis: 0
n_outliers: 0
outlier_rate: 0
zero_rate: 0

county

numeric foreign_key

Despite being named 'county', this column is stored as a numeric type with only 5 distinct values (5, 47, 81, and two others within the min–max range of 5–85), strongly suggesting it is a numeric county FIPS code or a coded categorical identifier rather than a free-text name. The concentration of values around Q1=47 and Q3=81 with zero outliers and zero nulls indicates a clean, bounded categorical encoding. The mild negative skew (−0.72) and flat kurtosis (−0.45) suggest the five codes are distributed somewhat unevenly, with lower-numbered counties less represented. Analysts should treat the 5 numeric values as nominal category labels, not ordinal or continuous quantities. Treatment: Cast to categorical; map numeric codes to county names via a reference table before modelling or reporting. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 5
min: 5
max: 85
mean: 55
median: 47
std: 25.97
q1: 47
q3: 81
iqr: 34
skew: -0.72
kurtosis: -0.4531
n_outliers: 0
outlier_rate: 0
zero_rate: 0

tract

numeric foreign_key high_skew

This column contains U.S. Census tract codes stored as integers, serving as a geographic identifier rather than a true numeric measure. The values range from 100 to 990100 with 1,530 unique codes across 2,327 rows, meaning roughly 34% of tract codes appear more than once — consistent with multiple records per tract. The extreme skew (10.14) and kurtosis (189.82) reflect the non-uniform geographic distribution of census tracts and the integer encoding scheme (e.g., tract 101 vs. tract 9901), not a meaningful numeric distribution; 63 outliers at the high end likely represent tracts in high-numbered FIPS areas. Treatment: Treat as a categorical geographic key; do not use raw numeric value in models — left-join to a census tract reference table for geographic features. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 1,530
min: 100
max: 990,100
mean: 4.225e+04
median: 30,100
std: 4.827e+04
q1: 15,200
q3: 5.79e+04
iqr: 4.27e+04
skew: 10.14
kurtosis: 189.8
n_outliers: 63
outlier_rate: 0.02707
zero_rate: 0

county_name

categorical label

This column contains the five boroughs of New York City, functioning as a geographic region label for each record. With only 5 unique values across 2,327 rows and zero nulls, it is clean and complete. Brooklyn (Kings) is the most frequent borough at 34.6% (805 records), while Staten Island (Richmond) is the least represented at 126 records — a roughly 6:1 imbalance worth noting for any borough-level modelling. Entropy ratio of 0.898 confirms the distribution is moderately spread but not uniform. Treatment: One-hot encode or use as a grouping/stratification variable; be aware of class imbalance between Brooklyn and Staten Island. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 5
top_value: Brooklyn (Kings)
top_rate: 0.3459
cardinality: 5
entropy: 2.086
entropy_ratio: 0.8985

moderate_burden

numeric feature

This column appears to represent a count or dollar-amount measure of 'moderate burden' — likely a financial, time, or resource burden metric per entity (e.g., household, case, or geographic unit). The distribution is heavily right-skewed (skew=1.93) with high excess kurtosis (6.05), meaning a long upper tail pulls the mean (216.1) well above the median (159.0), and 86 outliers reach as high as 1732. The IQR of 247 spans 64 to 311, while ~6.4% of rows are zero, suggesting a meaningful minority of entities report no burden at all. Treatment: Log-transform (or apply sqrt) before regression/modelling to address skew; consider zero-inflated treatment given the 6.4% zero mass. medium · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 639
min: 0
max: 1,732
mean: 216.1
median: 159
std: 210.4
q1: 64
q3: 311
iqr: 247
skew: 1.934
kurtosis: 6.052
n_outliers: 86
outlier_rate: 0.03696
zero_rate: 0.06403

severe_burden

numeric numeric_target

This column appears to be a quantitative burden score or count (e.g., disease burden, cost burden, or case load) measured per some unit, given its name 'severe_burden' and integer-like range of 0–1918. The distribution is notably right-skewed (skew = 1.60) with excess kurtosis (3.44), meaning a long upper tail pulls the mean (253.2) well above the median (184.0); 87 outliers (3.7% of rows) extend toward the maximum of 1918. About 6.3% of values are exactly zero, which may represent genuine absence of burden or a data-quality artifact worth investigating. Treatment: Log-transform (or apply Yeo-Johnson) before regression to reduce skew; investigate zero values for structural zeros vs. missingness. medium · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 706
min: 0
max: 1,918
mean: 253.2
median: 184
std: 236.6
q1: 82
q3: 360
iqr: 278
skew: 1.603
kurtosis: 3.435
n_outliers: 87
outlier_rate: 0.03739
zero_rate: 0.06274

pct_moderate_burden

numeric feature

This column represents the percentage of households (or units) experiencing moderate housing cost burden, likely defined as spending 30–50% of income on housing. The distribution is right-skewed (skew = 1.51) with a heavy tail and high kurtosis (6.70), meaning most observations cluster around a median of 21.8% but a notable minority push toward the maximum of 100.0—59 outliers exist at the upper extreme. The null rate of 4.38% and zero rate of 2.11% are modest but worth flagging as potential data gaps or areas with genuinely zero burden. With only 461 unique values across 2,327 rows, values appear to be rounded or binned percentages rather than continuous measurements. Treatment: Investigate and possibly winsorize or cap the 59 upper outliers (including any 100.0 values) before regression; impute or flag the 4.38% nulls. high · anthropic:default

n: 2,327
nulls: 102 (4.4%)
unique: 461
min: 0
max: 100
mean: 22.74
median: 21.8
std: 11.36
q1: 15.9
q3: 28.2
iqr: 12.3
skew: 1.509
kurtosis: 6.704
n_outliers: 59
outlier_rate: 0.02652
zero_rate: 0.02112

pct_severe_burden

numeric feature

This column represents the percentage of some population or housing units experiencing severe cost burden (likely housing costs exceeding 50% of income, a standard HUD metric). Values range from 0 to 100 with a mean of 27.1% and median of 26.2%, indicating a fairly symmetric distribution with mild right skew (0.57) and modest kurtosis (1.22). Notably, 518 unique values across 2,327 rows suggests aggregated geographic or demographic groupings rather than raw microdata. The 4.38% null rate and 30 outliers (reaching 100%) warrant attention but are not extreme. Treatment: Use as-is or apply mild log-transform if residuals show heteroscedasticity; impute or flag the 4.38% nulls before modelling. high · anthropic:default

n: 2,327
nulls: 102 (4.4%)
unique: 518
min: 0
max: 100
mean: 27.12
median: 26.2
std: 12.68
q1: 18.7
q3: 34.6
iqr: 15.9
skew: 0.5663
kurtosis: 1.222
n_outliers: 30
outlier_rate: 0.01348
zero_rate: 0.01978

rent_burdened

numeric feature

This column likely represents a count of rent-burdened households (those spending >30% of income on rent) per geographic unit such as a census tract or ZIP code. The distribution is right-skewed (skew 1.49, kurtosis 3.00) with a mean of 469 well above the median of 358, indicating a long tail driven by 82 outliers reaching up to 3,153 — consistent with dense urban areas pulling the upper end. The IQR spans 164.5 to 670, suggesting high variability across units, and 4.7% of rows are zero, plausibly reflecting areas with negligible renter populations. Treatment: Log-transform or apply robust scaling before regression to address right skew and outlier influence. medium · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 1,013
min: 0
max: 3,153
mean: 469.3
median: 358
std: 415.3
q1: 164.5
q3: 670
iqr: 505.5
skew: 1.494
kurtosis: 3.005
n_outliers: 82
outlier_rate: 0.03524
zero_rate: 0.04727

pct_rent_burdened

numeric feature

This column represents the percentage of renters who are rent-burdened (typically defined as spending ≥30% of income on rent) across 2,327 geographic or housing units. Surprisingly, the distribution is almost perfectly symmetric around the median of 50.0 and mean of 49.87, with near-zero skew (-0.038) and moderate IQ range of 17.9 — this is unusually bell-shaped for a percentage metric, which normally skews in one direction. The full 0–100 range is present, 62 outliers exist at the tails, and the near-zero rate is negligible at 0.36%, suggesting very few units with no rent burden at all. Treatment: Use as-is or apply mild clipping at tails (62 outliers); no transformation needed given near-normal distribution. high · anthropic:default

n: 2,327
nulls: 102 (4.4%)
unique: 596
min: 0
max: 100
mean: 49.87
median: 50
std: 14.62
q1: 40.9
q3: 58.8
iqr: 17.9
skew: -0.03839
kurtosis: 0.7849
n_outliers: 62
outlier_rate: 0.02787
zero_rate: 0.003596

median_gross_rent

numeric feature high_skew outliers

This column represents median gross rent (likely in USD) across 2,327 geographic units, with a healthy central range of $1,441–$2,049 (IQR) and a median of $1,735. The mean of –$41.5M and minimum of –$666,666,666 are severe red flags: a sentinel/placeholder value (e.g., –666666666) has been used for missing or N/A records, which is driving the extreme negative skew (–3.62), kurtosis of 11.1, and the 289 outliers (12.4% of rows). These corrupted values must be treated as nulls before any analysis. Treatment: Replace –666666666 (and any other negative values) with NaN, then validate the remaining distribution before using as a regression feature or for aggregation. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 1,232
min: -6.667e+08
max: 3,501
mean: -4.154e+07
median: 1,735
std: 1.612e+08
q1: 1442
q3: 2,049
iqr: 607.5
skew: -3.621
kurtosis: 11.11
n_outliers: 289
outlier_rate: 0.1242
zero_rate: 0

median_household_income

numeric feature high_skew outliers

This column represents median household income, likely drawn from census or demographic data, with a plausible central distribution (median $76,833, IQR $49,117–$102,360) that aligns with typical US household income ranges. However, the column is severely corrupted by sentinel/error values: a minimum of -666,666,666 drags the mean to -$36,017,397 and produces extreme negative skew (-3.94) and kurtosis of 13.5. With 208 outliers (8.9% of rows) and a max capped at $250,001 (suggesting a top-coded value), a meaningful subset of records contain invalid or placeholder negative values that must be treated before any analysis. Treatment: Filter or null-out records where value < 0 (sentinel values like -666666666), then apply log-transform for regression after cleaning. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 2,106
min: -6.667e+08
max: 250,001
mean: -3.602e+07
median: 76,833
std: 1.509e+08
q1: 5.324e+04
q3: 1.024e+05
iqr: 49,117
skew: -3.94
kurtosis: 13.53
n_outliers: 208
outlier_rate: 0.08939
zero_rate: 0

total_households

numeric feature

This column represents a count of total households per geographic or administrative unit, likely at a census-tract or zip-code level. The distribution is right-skewed (skew = 1.479) with leptokurtic tails (kurtosis = 4.38), meaning a minority of units contain disproportionately large household counts — confirmed by 70 outliers reaching up to 8,209 while the median sits at 1,252. A notable 4.1% of rows carry a zero value, which may indicate uninhabited areas, data gaps, or boundary artifacts worth investigating before modelling. Treatment: Investigate zero values for validity, then log-transform or apply a power transform before regression to address right skew. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 1,495
min: 0
max: 8,209
mean: 1411
median: 1,252
std: 923.3
q1: 773.5
q3: 1,850
iqr: 1076
skew: 1.479
kurtosis: 4.377
n_outliers: 70
outlier_rate: 0.03008
zero_rate: 0.04125

owner_occupied

numeric feature outliers

This column represents a count of owner-occupied housing units per geographic area (e.g., census tract or block group), with values ranging from 0 to 3,052 and a mean of ~465. The distribution is notably right-skewed (skew = 1.76) with high kurtosis (4.25), and 143 outliers (~6.1% of rows) pull the tail well above the median of 371. A 7.2% zero rate is worth investigating — these may be areas with no owner-occupied units or data gaps. Treatment: Log-transform or apply a robust scaler before modelling; investigate zero values for data-quality issues. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 1,001
min: 0
max: 3,052
mean: 464.6
median: 371
std: 422.6
q1: 177
q3: 608
iqr: 431
skew: 1.761
kurtosis: 4.254
n_outliers: 143
outlier_rate: 0.06145
zero_rate: 0.0722

renter_occupied

numeric feature

This column represents the count of renter-occupied housing units, likely aggregated at a census tract or similar geographic level. The distribution is right-skewed (skew 1.59) with a mean of 946 well above the median of 726, indicating many mid-range areas pulled up by a long upper tail reaching 8,209. Kurtosis of 4.63 and 69 outliers (3.0%) confirm a heavy tail of high-density rental areas. A 4.4% zero rate is plausible for owner-dominated or low-density geographies. Treatment: Log-transform or apply square-root before regression to reduce skew and compress outlier influence. high · anthropic:default

n: 2,327
nulls: 0 (0.0%)
unique: 1,418
min: 0
max: 8,209
mean: 946.1
median: 726
std: 815.4
q1: 346
q3: 1,357
iqr: 1,011
skew: 1.595
kurtosis: 4.627
n_outliers: 69
outlier_rate: 0.02965
zero_rate: 0.04383

pct_owner_occupied

numeric feature

This column represents the percentage of owner-occupied housing units for geographic areas (e.g., census tracts or zip codes), ranging from 0% to 100% with a mean of 37.5% and median of 34.4%. The distribution is surprisingly flat and broad — the IQR alone spans 39.7 percentage points (Q1=16.4, Q3=56.1) — indicating high variability across areas rather than a clustered norm. Negative kurtosis (−0.85) confirms a platykurtic, spread-out distribution with no heavy tails or outliers. A 3.2% zero rate may reflect areas that are entirely renter-occupied or institutionalized populations, which is worth investigating before modelling. Treatment: Use as-is or apply mild clipping at boundaries; consider spatial context and impute the 4.1% nulls using geographic neighbours before modelling. high · anthropic:default

n: 2,327
nulls: 96 (4.1%)
unique: 823
min: 0
max: 100
mean: 37.51
median: 34.4
std: 25.65
q1: 16.4
q3: 56.1
iqr: 39.7
skew: 0.3948
kurtosis: -0.854
n_outliers: 0
outlier_rate: 0
zero_rate: 0.03227

pct_renter_occupied

numeric feature

This column represents the percentage of housing units that are renter-occupied, likely derived from census or housing survey data at some geographic unit (e.g., tract, ZIP, or neighborhood). The distribution is notably broad — IQR spans 39.7 points (43.9 to 83.6) with a std of 25.65 — indicating high variability in renter rates across observations. The mean (62.5) and median (65.6) both skew toward higher renter shares, suggesting this dataset may oversample urban or higher-density areas where renting is more common. The near-platykurtic shape (kurtosis −0.85) and mild negative skew confirm a relatively flat, left-leaning distribution with no outliers flagged. Treatment: Use as-is or normalize to [0,1]; null_rate of 4.13% warrants median or model-based imputation before modelling. high · anthropic:default

n: 2,327
nulls: 96 (4.1%)
unique: 823
min: 0
max: 100
mean: 62.49
median: 65.6
std: 25.65
q1: 43.9
q3: 83.6
iqr: 39.7
skew: -0.3948
kurtosis: -0.854
n_outliers: 0
outlier_rate: 0
zero_rate: 0.002689