saturn·

nyc housing nyc housing metrics merged

source /home/coolhand/html/datavis/data_trove/data/urban/nyc_housing/nyc_housing_metrics_merged.csv 2,327 rows 23 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset covers 2,327 NYC census tracts with 23 columns describing housing tenure, rent burden, income, and rent levels across the five boroughs. The most urgent issue is data hygiene: median_gross_rent and median_household_income both contain a sentinel value of -666666666, which drags their means to roughly -41.5M and -36M respectively despite sensible medians (~$1,735 rent, ~$76,833 income) — these need to be filtered before any analysis. Beyond that, the substantive story is rent burden: pct_rent_burdened has a median of 50% with an IQR of 40.9–58.8, meaning half of NYC tracts have a majority of renters paying 30%+ of income on rent. Brooklyn (Kings) dominates the tract count at 35%, followed by Queens (31%) and the Bronx (15%), so any borough-level comparison should weight accordingly. The state column is constant (all 36, New York) and can be dropped.

citing: median_gross_rent · median_household_income · pct_rent_burdened · pct_severe_burden · pct_owner_occupied · county_name · state · total_households

Schema

23 columns
Per-column summary. Click column name to jump to its detail.
Alerts
total_renter_households numeric 0.0% 1,418
rent_30_to_34_9_pct numeric 0.0% 355
high_skew outliers
rent_35_to_39_9_pct numeric 0.0% 270
high_skew
rent_40_to_49_9_pct numeric 0.0% 322
high_skew
rent_50_pct_or_more numeric 0.0% 706
NAME text 0.0% 2,327
near_unique
state numeric 0.0% 1
constant
county numeric 0.0% 5
tract numeric 0.0% 1,530
high_skew
county_name categorical 0.0% 5
moderate_burden numeric 0.0% 639
severe_burden numeric 0.0% 706
pct_moderate_burden numeric 4.4% 461
pct_severe_burden numeric 4.4% 518
rent_burdened numeric 0.0% 1,013
pct_rent_burdened numeric 4.4% 596
median_gross_rent numeric 0.0% 1,232
high_skew outliers
median_household_income numeric 0.0% 2,106
high_skew outliers
total_households numeric 0.0% 1,495
owner_occupied numeric 0.0% 1,001
outliers
renter_occupied numeric 0.0% 1,418
pct_owner_occupied numeric 4.1% 823
pct_renter_occupied numeric 4.1% 823

total_renter_households

numeric feature
This column counts renter households per record, ranging from 0 to 8209 with a median of 726 and mean of 946. The distribution is right-skewed (skew 1.59, kurtosis 4.63) with 69 outliers (2.97%) on the high end and 4.38% zero values. No nulls, and 1418 unique values across 2327 rows suggests aggregation at a geographic or administrative unit. Treatment: Log-transform before regression to tame the right skew, and decide whether zero-count rows should be modelled separately. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
1,418
min
0
max
8,209
mean
946.1
median
726
std
815.4
q1
346
q3
1,357
iqr
1,011
skew
1.595
kurtosis
4.627
n_outliers
69
outlier_rate
0.02965
zero_rate
0.04383

rent_30_to_34_9_pct

numeric feature high_skew outliers
Likely a count of households paying 30%-34.9% of income on rent within some geographic unit, given the integer-like values, zero floor, and max of 1205. The distribution is heavily right-skewed (skew 2.76, kurtosis 13.86) with a median of 51 against a mean of 83.05, and 16.2% of rows are exactly zero. 124 outliers (5.33%) extend far above the Q3 of 116, consistent with a few large areas dominating. Treatment: log1p-transform before modelling to tame the skew and zero inflation. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
355
min
0
max
1,205
mean
83.05
median
51
std
100.3
q1
15
q3
116
iqr
101
skew
2.755
kurtosis
13.86
n_outliers
124
outlier_rate
0.05329
zero_rate
0.1616

rent_35_to_39_9_pct

numeric feature high_skew
Likely a count of households (or housing units) paying 35% to 39.9% of income on rent within some geographic unit. The distribution is heavily right-skewed (skew 2.40, kurtosis 9.27) with a median of 35 but a max of 633, and nearly 20% of rows are zero (zero_rate 0.196), suggesting many small areas have no households in this rent burden bracket. 110 outliers (4.7%) sit well above the Q3 of 83. Treatment: Log1p-transform before regression to tame the right skew and zero inflation. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
270
min
0
max
633
mean
58.35
median
35
std
69.85
q1
10
q3
83
iqr
73
skew
2.395
kurtosis
9.275
n_outliers
110
outlier_rate
0.04727
zero_rate
0.1964

rent_40_to_49_9_pct

numeric feature high_skew
Likely a count of housing units paying rent in the 40-49.9% income bracket per geographic area. The distribution is heavily right-skewed (skew 2.14, kurtosis 7.14) with a median of 49 but a max of 740 and 111 outliers (4.77%), and 15.6% of rows are zero — consistent with small geographies sitting alongside dense ones. Treatment: log1p-transform before modelling to tame the right skew and zero mass. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
322
min
0
max
740
mean
74.68
median
49
std
83.79
q1
14
q3
106
iqr
92
skew
2.137
kurtosis
7.139
n_outliers
111
outlier_rate
0.0477
zero_rate
0.1556

rent_50_pct_or_more

numeric feature
Counts of households spending 50% or more of income on rent, aggregated per geographic unit across 2327 rows with no nulls. The distribution is right-skewed (skew 1.60, kurtosis 3.44) with a median of 184 well below the mean of 253.18 and a max of 1918, and 6.27% of rows are zero. About 3.74% of values fall outside the Tukey fence. Treatment: Log1p-transform before modelling to tame the right skew. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
706
min
0
max
1,918
mean
253.2
median
184
std
236.6
q1
82
q3
360
iqr
278
skew
1.603
kurtosis
3.435
n_outliers
87
outlier_rate
0.03739
zero_rate
0.06274

NAME

text identifier near_unique
This column holds fully-qualified Census Tract names for New York City, every one of 2327 rows unique with zero nulls and tightly bounded length (38-46 chars, median 41). The vocabulary is formulaic: 'new', 'york', 'census', 'tract', 'county;' appear in essentially every row, with the borough split dominated by Kings (805), Queens (725), Bronx (361), and Richmond (126). Because each value is a one-to-one tract label, it functions as a geographic key rather than a modelling feature. Treatment: Treat as a tract-level key; parse out borough or join to a geo table rather than feeding the raw string to a model. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
2,327
len_min
38
len_max
46
len_mean
41.65
len_median
41
len_p95
46
word_mean
7.133
word_median
7
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
1,539
readability_flesch_mean
91.45
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

state

numeric metadata constant
The column 'state' is numeric but holds the single value 36 across all 2327 rows, with zero variance and no nulls. It carries no information for modelling and likely encodes a fixed jurisdiction or pipeline stage code that was filtered upstream. Treatment: Drop, constant column. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
1
min
36
max
36
mean
36
median
36
std
0
q1
36
q3
36
iqr
0
skew
0
kurtosis
0
n_outliers
0
outlier_rate
0
zero_rate
0

county

numeric feature
Encoded county identifier stored as a numeric code, with only 5 distinct values across 2327 rows and no nulls. The values (min 5, max 85, median 47) look like sparse categorical codes rather than a continuous measurement, and the negative skew (-0.72) reflects uneven frequency across those 5 codes. Treatment: Cast to categorical and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
5
min
5
max
85
mean
55
median
47
std
25.97
q1
47
q3
81
iqr
34
skew
-0.72
kurtosis
-0.4531
n_outliers
0
outlier_rate
0
zero_rate
0

tract

numeric identifier high_skew
This is almost certainly a U.S. Census tract code stored as a numeric, with 1530 unique values across 2327 rows and no nulls. The distribution is severely right-skewed (skew 10.14, kurtosis 189.8) with a median of 30100 but a max of 990100, which is the expected pattern for tract codes rather than a true magnitude — values are categorical identifiers padded into a numeric range. The 63 flagged outliers (2.7%) are likely just tracts in higher-numbered county/state ranges, not data errors. Treatment: Treat as a categorical geographic code; cast to zero-padded string and join to tract-level reference data rather than using as a numeric feature. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
1,530
min
100
max
990,100
mean
4.225e+04
median
30,100
std
4.827e+04
q1
15,200
q3
5.79e+04
iqr
4.27e+04
skew
10.14
kurtosis
189.8
n_outliers
63
outlier_rate
0.02707
zero_rate
0

county_name

categorical feature
This column lists New York City borough/county names across 2327 rows, with exactly 5 unique values and no nulls. Distribution mirrors NYC borough sizes: Brooklyn (Kings) leads at 805 (34.6%), followed by Queens (725), Bronx (361), Manhattan (310), and Staten Island (126). Entropy ratio of 0.90 indicates a fairly balanced spread across the five categories with no extreme concentration. Treatment: One-hot or target-encode for modelling. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
5
top_value
Brooklyn (Kings)
top_rate
0.3459
cardinality
5
entropy
2.086
entropy_ratio
0.8985

moderate_burden

numeric feature
A non-negative integer count named 'moderate_burden', spanning 0 to 1732 with a median of 159 and mean of 216 across 2327 rows, no nulls. The distribution is right-skewed (skew 1.93, kurtosis 6.05) with 86 outliers (3.7%) and 6.4% zeros, suggesting a long tail of high-burden cases over a typical mid-hundreds bulk. Treatment: Apply a log1p transform before regression to tame the right-skew and outliers. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
639
min
0
max
1,732
mean
216.1
median
159
std
210.4
q1
64
q3
311
iqr
247
skew
1.934
kurtosis
6.052
n_outliers
86
outlier_rate
0.03696
zero_rate
0.06403

severe_burden

numeric feature
Numeric count-like column 'severe_burden' with 2327 rows, no nulls, and 706 unique integer values ranging from 0 to 1918 (median 184, mean 253.18). The distribution is right-skewed (skew 1.60, kurtosis 3.44) with 6.27% zeros and 87 outliers (3.74%) above the upper whisker. The wide IQR (278) and std (236.60) relative to the median suggest substantial dispersion across units. Treatment: Apply a log1p transform before regression to tame the right skew and outliers. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
706
min
0
max
1,918
mean
253.2
median
184
std
236.6
q1
82
q3
360
iqr
278
skew
1.603
kurtosis
3.435
n_outliers
87
outlier_rate
0.03739
zero_rate
0.06274

pct_moderate_burden

numeric feature
This is a percentage feature measuring the share of some population under moderate housing burden, ranging 0–100 with mean 22.74 and median 21.8. The distribution is right-skewed (skew 1.51, kurtosis 6.70) with 59 outliers (2.65%) and a 4.38% null rate. About 2.1% of rows are exact zeros and the IQR is tight at 12.3, so the upper tail past q3=28.2 stretches all the way to 100. Treatment: Impute the ~4% nulls and consider a mild transform or winsorization to tame the right tail before modelling. high · anthropic:claude-opus-4-7
n
2,327
nulls
102 (4.4%)
unique
461
min
0
max
100
mean
22.74
median
21.8
std
11.36
q1
15.9
q3
28.2
iqr
12.3
skew
1.509
kurtosis
6.704
n_outliers
59
outlier_rate
0.02652
zero_rate
0.02112

pct_severe_burden

numeric feature
A percentage metric (0–100 range) capturing the share of some population under severe burden, with a mean of 27.12 and median of 26.2 suggesting a fairly typical right-skewed distribution (skew 0.57). Spread is moderate (std 12.68, IQR 15.9) and only 1.35% of rows are flagged as outliers, though a max of 100.0 alongside a 1.98% zero rate hints at a few extreme records worth inspecting. Note the 4.38% null rate, which will need handling. Treatment: Impute the 4.4% missing values and use as-is; mild skew does not require transformation. high · anthropic:claude-opus-4-7
n
2,327
nulls
102 (4.4%)
unique
518
min
0
max
100
mean
27.12
median
26.2
std
12.68
q1
18.7
q3
34.6
iqr
15.9
skew
0.5663
kurtosis
1.222
n_outliers
30
outlier_rate
0.01348
zero_rate
0.01978

rent_burdened

numeric feature
Likely a count or dollar measure of rent-burdened households (or burden amount) per record, ranging from 0 to 3153 with a median of 358 and mean of 469.26. The distribution is right-skewed (skew 1.49, kurtosis 3.00) with 82 outliers (3.5%) and 4.7% exact zeros, so a long tail dominates the upper end. Treatment: Apply a log1p transform before regression to tame the right skew and zero mass. medium · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
1,013
min
0
max
3,153
mean
469.3
median
358
std
415.3
q1
164.5
q3
670
iqr
505.5
skew
1.494
kurtosis
3.005
n_outliers
82
outlier_rate
0.03524
zero_rate
0.04727

pct_rent_burdened

numeric feature
This is a numeric percentage indicating the share of rent-burdened households per record, ranging from 0 to 100 with a mean of 49.87 and median of 50.0. The distribution is nearly symmetric (skew -0.04) and reasonably tight around the middle (IQR 17.9, std 14.6), with 4.38% nulls and only 0.36% zeros. 62 outliers (2.79%) sit beyond the whiskers, but no severe tail or drift is evident. Treatment: Impute the ~4% nulls and use as-is; no transform needed given near-symmetric bounded percentage. high · anthropic:claude-opus-4-7
n
2,327
nulls
102 (4.4%)
unique
596
min
0
max
100
mean
49.87
median
50
std
14.62
q1
40.9
q3
58.8
iqr
17.9
skew
-0.03839
kurtosis
0.7849
n_outliers
62
outlier_rate
0.02787
zero_rate
0.003596

median_gross_rent

numeric feature high_skew outliers
This is a numeric feature for median gross rent, with 2327 non-null values and 1232 unique levels. The middle of the distribution looks plausible (median 1735, IQR 1441.5–2049, max 3501), but the minimum is -666666666 and the mean is -41539608.8 with std 161182638.7, indicating sentinel values masquerading as numbers and producing severe negative skew (-3.62) and 289 outliers (12.4%). Treatment: Replace the -666666666 sentinel with null before any modelling or aggregation. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
1,232
min
-6.667e+08
max
3,501
mean
-4.154e+07
median
1,735
std
1.612e+08
q1
1442
q3
2,049
iqr
607.5
skew
-3.621
kurtosis
11.11
n_outliers
289
outlier_rate
0.1242
zero_rate
0

median_household_income

numeric feature high_skew outliers
Median household income in dollars per record, fully populated across 2,327 rows with 2,106 unique values and a sensible median of 76,833 and IQR of 49,117. The mean of -36,017,397 and minimum of -666,666,666 are sentinel-coded missing values masquerading as numbers, which drag skew to -3.94 and kurtosis to 13.53. Roughly 8.9% of rows (208) are flagged as outliers, almost certainly the same sentinel contamination. Treatment: Replace -666666666 sentinel with null, then consider log-transform or winsorisation before modelling. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
2,106
min
-6.667e+08
max
250,001
mean
-3.602e+07
median
76,833
std
1.509e+08
q1
5.324e+04
q3
1.024e+05
iqr
49,117
skew
-3.94
kurtosis
13.53
n_outliers
208
outlier_rate
0.08939
zero_rate
0

total_households

numeric feature
Counts of households per record, ranging from 0 to 8209 with a median of 1252 and mean of 1410.7. The distribution is right-skewed (skew 1.48, kurtosis 4.38) with 70 outliers (3.0%) on the high end, and 4.1% of rows are zero, which may indicate unpopulated or placeholder areas. Treatment: Log-transform or winsorize before modelling and decide whether zero-household rows should be filtered. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
1,495
min
0
max
8,209
mean
1411
median
1,252
std
923.3
q1
773.5
q3
1,850
iqr
1076
skew
1.479
kurtosis
4.377
n_outliers
70
outlier_rate
0.03008
zero_rate
0.04125

owner_occupied

numeric feature outliers
Despite the boolean-sounding name 'owner_occupied', this is a numeric count column with 1001 unique values ranging from 0 to 3052 and a mean of 464.6 — likely a tally of owner-occupied units per record (e.g., per tract or block). The distribution is right-skewed (skew 1.76, kurtosis 4.25) with 143 outliers (6.1%) and 7.2% zeros. No nulls are present. Treatment: Log-transform (log1p to handle the 7% zeros) before modelling to tame the right skew. medium · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
1,001
min
0
max
3,052
mean
464.6
median
371
std
422.6
q1
177
q3
608
iqr
431
skew
1.761
kurtosis
4.254
n_outliers
143
outlier_rate
0.06145
zero_rate
0.0722

renter_occupied

numeric feature
Counts of renter-occupied units per record, ranging from 0 to 8209 with a median of 726 and mean of 946. The distribution is right-skewed (skew 1.59, kurtosis 4.63) with 69 outliers (2.97%) and 4.38% zeros, consistent with area-level housing tallies rather than a per-household flag. Treatment: log-transform or scale before regression to tame the right skew. high · anthropic:claude-opus-4-7
n
2,327
nulls
0 (0.0%)
unique
1,418
min
0
max
8,209
mean
946.1
median
726
std
815.4
q1
346
q3
1,357
iqr
1,011
skew
1.595
kurtosis
4.627
n_outliers
69
outlier_rate
0.02965
zero_rate
0.04383

pct_owner_occupied

numeric feature
Percentage of owner-occupied housing per record, ranging the full 0-100 scale with a mean of 37.5 and median of 34.4. The distribution is wide (std 25.7, IQR 39.7) and slightly right-skewed (0.39) with negative kurtosis (-0.85), indicating a flat, near-uniform spread rather than a tight central mass. About 3.2% of rows are exactly zero and 4.1% are null, but no statistical outliers were flagged. Treatment: Use as-is as a bounded percentage feature; impute the 4% nulls with the median or a missingness flag. high · anthropic:claude-opus-4-7
n
2,327
nulls
96 (4.1%)
unique
823
min
0
max
100
mean
37.51
median
34.4
std
25.65
q1
16.4
q3
56.1
iqr
39.7
skew
0.3948
kurtosis
-0.854
n_outliers
0
outlier_rate
0
zero_rate
0.03227

pct_renter_occupied

numeric feature
Numeric percentage of renter-occupied units, ranging the full 0–100 with mean 62.5 and median 65.6, suggesting these records skew toward rental-heavy geographies. Spread is wide (std 25.7, IQR 39.7) and the distribution is mildly left-skewed (-0.39) and flat (kurtosis -0.85), so no outliers were flagged. About 4.1% of rows are null and only 0.27% are exact zeros, with 823 distinct values across 2,327 rows. Treatment: Use as-is as a bounded percentage feature; impute the 4.1% nulls before modelling. high · anthropic:claude-opus-4-7
n
2,327
nulls
96 (4.1%)
unique
823
min
0
max
100
mean
62.49
median
65.6
std
25.65
q1
43.9
q3
83.6
iqr
39.7
skew
-0.3948
kurtosis
-0.854
n_outliers
0
outlier_rate
0
zero_rate
0.002689