healthcare cms hospitals 2025

source /home/coolhand/datasets/us-inequality-atlas/healthcare/cms_hospitals_2025.csv 5,421 rows 38 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a CMS hospital directory covering 5,421 U.S. hospitals across 56 state/territory codes, with 38 columns mixing facility identifiers (Facility ID, Name, Address, Phone), location fields, and a battery of CMS quality-measure summaries (Mortality, Readmission, Safety, Patient Experience, Timely & Effective care). Two things are worth a closer look first: the Hospital overall rating is 'Not Available' for 47% of facilities, and the 'Meets criteria for birthing friendly designation' field is 58% null with only 'Y' as a value, so any rating- or designation-based analysis will be heavily gated by missingness. Beyond that, the mix is dominated by Acute Care Hospitals (3,120) and Voluntary non-profit – Private ownership (2,291 / ~42%), with Texas, California, and Florida holding the largest state shares. The 'Count of … Measures Worse/Better' fields are highly skewed toward 0, suggesting most hospitals look 'no different than national average' on CMS comparisons — a useful framing before drilling into outliers.

citing: row_count · column_count · Hospital overall rating · Meets criteria for birthing friendly designation · Hospital Type · Hospital Ownership · State · Emergency Services · Count of MORT Measures Worse · Count of READM Measures Better

Charts the summary said to look at first

Hospital overall rating · Distribution of CMS star ratings — note that 'Not Available' is the single largest bucket (47%).

Show data table

Top values for Hospital overall rating (6 unique shown, of 6 total).
value	count	share
Not Available	2552	47.1%
3	937	17.3%
4	765	14.1%
2	649	12.0%
5	289	5.3%
1	229	4.2%

Hospital Type · Acute Care Hospitals dominate at ~58%, with Critical Access and Psychiatric making up most of the remainder.

Show data table

Top values for Hospital Type (8 unique shown, of 8 total).
value	count	share
Acute Care Hospitals	3120	57.6%
Critical Access Hospitals	1375	25.4%
Psychiatric	626	11.5%
Acute Care - Veterans Administration	132	2.4%
Childrens	94	1.7%
Rural Emergency Hospital	38	0.7%
Acute Care - Department of Defense	32	0.6%
Long-term	4	0.1%

Hospital Ownership · Voluntary non-profit Private leads at ~42%; useful for segmenting quality outcomes by ownership model.

Show data table

Top values for Hospital Ownership (12 unique shown, of 12 total).
value	count	share
Voluntary non-profit - Private	2291	42.3%
Proprietary	1067	19.7%
Government - Hospital District or Authority	521	9.6%
Government - Local	400	7.4%
Voluntary non-profit - Other	361	6.7%
Voluntary non-profit - Church	275	5.1%
Government - State	210	3.9%
Veterans Health Administration	132	2.4%
Physician	74	1.4%
Government - Federal	44	0.8%
Department of Defense	32	0.6%
Tribal	14	0.3%

State · Hospital counts by state — TX, CA, and FL top the list; helpful for normalizing any state-level comparisons.

Show data table

Top values for State (20 unique shown, of 56 total).
value	count	share
TX	462	8.5%
CA	378	7.0%
FL	221	4.1%
IL	194	3.6%
OH	194	3.6%
NY	191	3.5%
PA	187	3.4%
LA	160	3.0%
GA	149	2.7%
IN	149	2.7%
MI	147	2.7%
WI	142	2.6%
KS	139	2.6%
MN	136	2.5%
OK	135	2.5%
TN	123	2.3%
MO	121	2.2%
NC	120	2.2%
IA	118	2.2%
AZ	106	2.0%

Emergency Services · 83% of facilities offer emergency services, a quick sanity check on facility mix before quality analysis.

Show data table

Top values for Emergency Services (2 unique shown, of 2 total).
value	count	share
Yes	4505	83.1%
No	916	16.9%

Schema

38 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
Facility ID	text	0.0%	5,421	near_unique one_word allcaps short_text
Facility Name	text	0.0%	5,286	near_unique allcaps
Address	text	0.0%	5,387	near_unique allcaps
City/Town	text	0.0%	3,049	one_word allcaps short_text duplicates
State	categorical	0.0%	56
ZIP Code	numeric	0.0%	4,721
County/Parish	text	0.0%	1,555	one_word allcaps short_text duplicates
Telephone Number	text	0.0%	5,383	near_unique allcaps short_text
Hospital Type	categorical	0.0%	8
Hospital Ownership	categorical	0.0%	12
Emergency Services	categorical	0.0%	2
Meets criteria for birthing friendly designation	categorical	58.2%	1	null_rate imbalance
Hospital overall rating	categorical	0.0%	6
Hospital overall rating footnote	categorical	52.7%	7	null_rate
MORT Group Measure Count	categorical	0.0%	2
Count of Facility MORT Measures	categorical	0.0%	8
Count of MORT Measures Better	categorical	0.0%	9
Count of MORT Measures No Different	categorical	0.0%	9
Count of MORT Measures Worse	categorical	0.0%	7
MORT Group Footnote	numeric	67.2%	4	null_rate
Safety Group Measure Count	categorical	0.0%	2
Count of Facility Safety Measures	categorical	0.0%	9
Count of Safety Measures Better	categorical	0.0%	8
Count of Safety Measures No Different	categorical	0.0%	10
Count of Safety Measures Worse	categorical	0.0%	5
Safety Group Footnote	numeric	61.8%	4	null_rate
READM Group Measure Count	categorical	0.0%	2
Count of Facility READM Measures	categorical	0.0%	12
Count of READM Measures Better	categorical	0.0%	7
Count of READM Measures No Different	categorical	0.0%	13
Count of READM Measures Worse	categorical	0.0%	9
READM Group Footnote	numeric	78.8%	3	null_rate
Pt Exp Group Measure Count	categorical	0.0%	2
Count of Facility Pt Exp Measures	categorical	0.0%	2
Pt Exp Group Footnote	numeric	58.2%	3	null_rate
TE Group Measure Count	categorical	0.0%	2
Count of Facility TE Measures	categorical	0.0%	13
TE Group Footnote	numeric	82.9%	3	null_rate high_skew outliers

Facility ID

text identifier near_unique one_word allcaps short_text

Facility ID is a fixed-width 6-character single-token code, unique across all 5421 rows with zero nulls or duplicates. Every value is one word in all caps with length exactly 6, and the sampled tokens (e.g., 010001, 010005) are zero-padded numeric strings consistent with a primary facility identifier. Treatment: Treat as primary key; left-join on this id and exclude from modelling features. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 5,421
len_min: 6
len_max: 6
len_mean: 6
len_median: 6
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 5,421
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

Facility Name

text identifier near_unique allcaps

This column holds names of healthcare facilities, dominated by terms like 'hospital' (2740), 'center' (1579), and 'medical' (1444), with a typical length of 4 words / 28 characters. Values are near-unique (5286 distinct out of 5421) yet 135 duplicates remain, suggesting either multi-site chains or repeated reporting rows for the same facility. Nearly everything is uppercase (99.3% allcaps rate), which will trip case-sensitive joins. Treatment: Normalize case and whitespace before using as a join key on facility. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 5,286
len_min: 3
len_max: 74
len_mean: 29.21
len_median: 28
len_p95: 45
word_mean: 3.995
word_median: 4
n_empty: 0
n_duplicates: 135
duplicate_rate: 0.0249
vocab_size: 3,942
readability_flesch_mean: 6.842
emoji_rate: 0
url_rate: 0
one_word_rate: 0.001845
allcaps_rate: 0.9932
boilerplate_rate: 0

Address

text identifier near_unique allcaps

Free-text street addresses, with 5387 unique values across 5421 rows (near-unique) and no nulls. 99.2% are all-caps and the top tokens are 'street', 'st', 'avenue', 'drive', 'road', confirming postal-style entries averaging 3.75 words and 19 characters. 34 duplicates exist but no boilerplate, URLs, or emoji. Treatment: Drop or geocode/parse into structured components before modelling; do not feed raw. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 5,387
len_min: 7
len_max: 50
len_mean: 19.37
len_median: 19
len_p95: 29
word_mean: 3.754
word_median: 4
n_empty: 0
n_duplicates: 34
duplicate_rate: 0.006272
vocab_size: 4,996
readability_flesch_mean: 79.27
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0.9921
boilerplate_rate: 0

City/Town

text feature one_word allcaps short_text duplicates

This is a US city/town name field, almost entirely uppercased (allcaps_rate 0.994) and dominated by single-word entries (one_word_rate 0.771, word_mean 1.24). Of 5421 rows there are 3049 unique values with a 0.438 duplicate rate, led by CHICAGO (34), HOUSTON (31), and COLUMBUS (23). Lengths are short and tight (len_mean 8.6, len_max 24) and there are no nulls or empties. Treatment: Normalize case and standardize against a city gazetteer (ideally with state) before grouping or joining. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 3,049
len_min: 3
len_max: 24
len_mean: 8.611
len_median: 8
len_p95: 13
word_mean: 1.241
word_median: 1
n_empty: 0
n_duplicates: 2,372
duplicate_rate: 0.4376
vocab_size: 2,890
readability_flesch_mean: 18.29
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7709
allcaps_rate: 0.9943
boilerplate_rate: 0

State

categorical feature

US state codes covering 56 distinct values across 5,421 rows with no nulls — likely all 50 states plus territories/DC. Distribution is broad and near-uniform (entropy ratio 0.917), with TX leading at only 8.5% and CA, FL, IL, OH following. The 56 cardinality is worth noting since it exceeds the standard 50 states. Treatment: one-hot or target-encode for modelling; optionally group rare territories. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 56
top_value: TX
top_rate: 0.08522
cardinality: 56
entropy: 5.328
entropy_ratio: 0.9174

ZIP Code

numeric identifier

This is a US ZIP code field stored as a numeric column, with 4721 unique values across 5421 rows and no nulls. The range (603 to 99929) and broad IQR (32771-76104) are consistent with national ZIP coverage, and leading-zero ZIPs (e.g., the min of 603) have already been corrupted by numeric storage. Treating the mean (53780) or std (27064) as meaningful is misleading since ZIPs are categorical identifiers, not quantities. Treatment: Cast to zero-padded 5-character strings and treat as a categorical/geographic key rather than a numeric feature. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 4,721
min: 603
max: 99,929
mean: 5.378e+04
median: 55,066
std: 2.706e+04
q1: 32,771
q3: 76,104
iqr: 43,333
skew: -0.1646
kurtosis: -0.9879
n_outliers: 0
outlier_rate: 0
zero_rate: 0

County/Parish

text feature one_word allcaps short_text duplicates

This column holds U.S. county or parish names, stored uppercase and almost always as a single token (one_word_rate 0.87, allcaps_rate 1.0). With 1,555 uniques across 5,421 rows and a 71.3% duplicate rate, common counties repeat heavily — LOS ANGELES (88), JEFFERSON and COOK (59 each) lead. Note that vocab_size (1591) exceeds n_unique (1555), implying multi-word names like SAN/LOS/ST. variants contribute extra tokens. Treatment: Normalize casing and join to a state column before using as a categorical geographic feature. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 1,555
len_min: 3
len_max: 25
len_mean: 7.34
len_median: 7
len_p95: 11
word_mean: 1.135
word_median: 1
n_empty: 0
n_duplicates: 3,866
duplicate_rate: 0.7132
vocab_size: 1,591
readability_flesch_mean: 34.44
emoji_rate: 0
url_rate: 0
one_word_rate: 0.8733
allcaps_rate: 1
boilerplate_rate: 0

Telephone Number

text identifier near_unique allcaps short_text

This column holds US-style telephone numbers, every value exactly 14 characters long with a mean of 2 words, consistent with a `(NPA) NNN-NNNN` format. Of 5421 rows, 5383 are unique and 38 duplicates appear (0.7%), with no nulls; the most common tokens are area codes like (406), (605), and (402). The near-unique cardinality and rigid length make this an identifier rather than a feature. Treatment: Drop for modelling; retain as a contact identifier or parse out the area code if geography is useful. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 5,383
len_min: 14
len_max: 14
len_mean: 14
len_median: 14
len_p95: 14
word_mean: 2
word_median: 2
n_empty: 0
n_duplicates: 38
duplicate_rate: 0.00701
vocab_size: 5,550
readability_flesch_mean: 120.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 1
boilerplate_rate: 0

Hospital Type

categorical feature

Categorical classifier of hospital facilities across 8 types, with no nulls in 5421 rows. Acute Care Hospitals dominate at 57.6% (3120 records), followed by Critical Access (1375) and Psychiatric (626); the long tail is sparse, with only 4 Long-term facilities. Entropy ratio of 0.55 confirms the distribution is heavily concentrated in the top category. Treatment: One-hot encode, optionally collapsing the four rarest types into an 'Other' bucket. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 8
top_value: Acute Care Hospitals
top_rate: 0.5755
cardinality: 8
entropy: 1.654
entropy_ratio: 0.5513

Hospital Ownership

categorical feature

Categorical descriptor of hospital ownership type across 5421 records, with 12 distinct categories and no nulls. Distribution is dominated by 'Voluntary non-profit - Private' at 42.26% of rows, followed by 'Proprietary' at 1067 and a long tail down to 'Government - Federal' at 44. Entropy ratio of 0.72 indicates moderate concentration but reasonable spread across the taxonomy. Treatment: One-hot or target-encode for modelling; consider collapsing rare tiers like 'Physician' and 'Government - Federal'. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 12
top_value: Voluntary non-profit - Private
top_rate: 0.4226
cardinality: 12
entropy: 2.586
entropy_ratio: 0.7215

Emergency Services

categorical feature

Binary Yes/No flag indicating whether emergency services were involved or available, with no nulls across 5421 rows. The distribution is imbalanced: 'Yes' accounts for 83.1% (4505) versus 916 'No' responses, giving an entropy ratio of 0.655. Treatment: Encode as a binary 0/1 indicator; be mindful of class imbalance if used as a target. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 2
top_value: Yes
top_rate: 0.831
cardinality: 2
entropy: 0.6553
entropy_ratio: 0.6553

Meets criteria for birthing friendly designation

categorical feature null_rate imbalance

This appears to be a binary flag indicating whether a hospital meets criteria for a 'birthing friendly' designation, but it functions as a presence indicator only. Of 5421 rows, 58.24% are null and the remaining 2264 are all 'Y' — there are no 'N' values, giving cardinality 1 and zero entropy. Nulls effectively mean 'not designated' rather than missing data. Treatment: Recode nulls as 'N' to form a usable binary indicator, or drop since it carries the same information as the null mask. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 3,157 (58.2%)
unique: 1
top_value: Y
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

Hospital overall rating

categorical label

This is the CMS-style 1-5 hospital star rating stored as a string, with a sixth bucket 'Not Available' for unrated facilities. The striking issue is that 'Not Available' dominates at 47.1% of 5,421 rows, outnumbering every actual rating; among rated hospitals, 3 stars (937) is most common and 1 star (229) least. Entropy ratio of 0.83 across 6 categories confirms the distribution is spread but heavily anchored by missingness-as-category. Treatment: Recode 'Not Available' to NaN and treat the remainder as an ordinal 1-5 scale. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 6
top_value: Not Available
top_rate: 0.4708
cardinality: 6
entropy: 2.133
entropy_ratio: 0.8252

Hospital overall rating footnote

categorical metadata null_rate

Footnote codes attached to the Hospital overall rating, with only 7 distinct values across 5,421 rows. Over half the column is null (null_rate 0.527), and among populated rows code '16' dominates at 65.4% followed by '19', so the field carries low information (entropy_ratio 0.41). One compound entry '16, 23' appears twice, indicating values can be multi-coded. Treatment: Keep as a categorical flag alongside the rating; split the rare compound codes if you need per-footnote analysis. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 2,857 (52.7%)
unique: 7
top_value: 16
top_rate: 0.6537
cardinality: 7
entropy: 1.158
entropy_ratio: 0.4126

MORT Group Measure Count

categorical feature

This is a binary categorical field indicating the count of measures in a mortality (MORT) group, holding only the value '7' (84.1% of rows) or 'Not Available' (the remaining 863 rows). Despite being labeled a count, it functions as a presence flag: when data exists, the count is always 7. No nulls are recorded, since missingness is encoded as the literal string 'Not Available'. Treatment: Recode to a boolean availability flag ('7' vs 'Not Available') before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 2
top_value: 7
top_rate: 0.8408
cardinality: 2
entropy: 0.6324
entropy_ratio: 0.6324

Count of Facility MORT Measures

categorical feature

Likely the count of facility mortality (MORT) measures reported per hospital, stored as strings 1-7 alongside a 'Not Available' sentinel. The dominant value is 'Not Available' at 32.8% (1777/5421), which makes the column nominally categorical with 8 levels even though 7 of them are integers. Entropy ratio of 0.92 indicates the non-null counts are spread fairly evenly across 1-7. Treatment: Replace 'Not Available' with NaN, cast remaining values to integer, and optionally add a missingness indicator. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 8
top_value: Not Available
top_rate: 0.3278
cardinality: 8
entropy: 2.765
entropy_ratio: 0.9217

Count of MORT Measures Better

categorical feature

Categorical count of mortality measures on which a hospital performed better than the national rate, encoded as small integers stored as strings (0–7) plus a 'Not Available' sentinel. The distribution is heavily concentrated at 0 (57.8% of 5421 rows) with another 1777 rows marked 'Not Available', leaving only ~10% of hospitals showing any 'better' measure. Entropy ratio of 0.458 confirms the skew, and the sentinel string blocks direct numeric use. Treatment: Replace 'Not Available' with NaN, cast remaining values to int, and consider binarising (any-better vs none) given the skew. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 9
top_value: 0
top_rate: 0.5785
cardinality: 9
entropy: 1.453
entropy_ratio: 0.4583

Count of MORT Measures No Different

categorical feature

This column counts how many mortality (MORT) measures a hospital scored 'No Different' on, encoded as a small integer 0-7 but stored as strings alongside a 'Not Available' sentinel. Cardinality is just 9 with high entropy ratio (0.885), meaning values are spread fairly evenly across the integers — except 'Not Available' dominates at 32.8% (1777/5421) and '0' is rare at only 12 rows. Analysts should note the heavy missing-as-string encoding and the near-empty '0' bucket. Treatment: Recode 'Not Available' to null and cast remaining values to integer before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 9
top_value: Not Available
top_rate: 0.3278
cardinality: 9
entropy: 2.806
entropy_ratio: 0.8852

Count of MORT Measures Worse

categorical feature

This column counts how many mortality (MORT) measures a hospital scored worse than the national average, encoded as small integers 0-5 with a 'Not Available' sentinel mixed in. Roughly 60% of 5421 rows are '0' and another 1777 are 'Not Available', leaving only 378 hospitals flagged on any MORT measure and just 1 hospital at the maximum of 5. The mixed string/integer encoding is the main gotcha. Treatment: Replace 'Not Available' with NaN and cast to integer before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 7
top_value: 0
top_rate: 0.6025
cardinality: 7
entropy: 1.294
entropy_ratio: 0.4608

MORT Group Footnote

numeric metadata null_rate

This appears to be a footnote code attached to MORT (mortality) group records, stored as a small numeric category with only 4 distinct values ranging from 5 to 23. Two-thirds of rows are null (null_rate 0.672), suggesting the footnote applies only to a minority of records. The strongly negative kurtosis (-1.96) and bimodal-looking quartiles (Q1=5, Q3=19) confirm these are discrete codes rather than a continuous measurement. Treatment: Cast to categorical and treat nulls as 'no footnote' rather than imputing numerically. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 3,643 (67.2%)
unique: 4
min: 5
max: 23
mean: 11.58
median: 5
std: 7.057
q1: 5
q3: 19
iqr: 14
skew: 0.1488
kurtosis: -1.959
n_outliers: 0
outlier_rate: 0
zero_rate: 0

Safety Group Measure Count

categorical feature

Binary categorical with only two values across 5421 rows: the literal string "8" (84.1%) and "Not Available" (15.9%). Despite the name suggesting a count, it functions as a flag indicating either a fixed measure count of 8 or missing data encoded as text. Treatment: Recode "Not Available" to null and binarize, or drop given near-constant value. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 2
top_value: 8
top_rate: 0.8408
cardinality: 2
entropy: 0.6324
entropy_ratio: 0.6324

Count of Facility Safety Measures

categorical feature

This column reports the count of facility safety measures, stored as a categorical with 9 distinct values (1-8 plus 'Not Available'). The dominant value is 'Not Available' at 38.1% (2,065 of 5,421 rows), making missingness encoded as a string rather than null. Among reported counts, 7 is most common (733), and the distribution across 1-8 is uneven rather than monotonic. Treatment: Recode 'Not Available' to null and cast remaining values to integer before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 9
top_value: Not Available
top_rate: 0.3809
cardinality: 9
entropy: 2.753
entropy_ratio: 0.8684

Count of Safety Measures Better

categorical feature

This is a small-integer count (0-6) of safety measures rated 'Better', stored as strings alongside a 'Not Available' sentinel. The sentinel dominates at 38.1% (2065/5421) of rows, and the distribution is heavily right-skewed: 0 and 1 cover 2600 rows while 5 and 6 appear only 14 and 3 times. Cardinality is 8 with entropy ratio 0.70, so most signal sits in the lower bins. Treatment: Recode 'Not Available' to missing, cast remaining values to integer, and consider binning the sparse 4-6 tail. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 8
top_value: Not Available
top_rate: 0.3809
cardinality: 8
entropy: 2.11
entropy_ratio: 0.7033

Count of Safety Measures No Different

categorical feature

A small-integer count (0-8) of safety measures rated 'no different', stored as strings alongside a 'Not Available' sentinel. The sentinel dominates at 38.1% of 5421 rows (2065), making it the modal value ahead of any actual count. Among real values, 5, 2, 4, and 1 cluster between 509-656 occurrences while 0 and 8 are rare (20 and 10). Treatment: Cast numeric levels to int and treat 'Not Available' as an explicit missing indicator before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 10
top_value: Not Available
top_rate: 0.3809
cardinality: 10
entropy: 2.685
entropy_ratio: 0.8083

Count of Safety Measures Worse

categorical feature

This is a low-cardinality count column (5 distinct values) recording how many safety measures got worse, taking integer values 0-3 plus a 'Not Available' sentinel. Most rows (54.3%) are 0 and another 2065 are 'Not Available', leaving only 415 rows with any worsening measures (1-3). The mix of numeric strings and a textual missing-marker means it isn't cleanly numeric as stored. Treatment: Recode 'Not Available' to NaN and cast to integer before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 5
top_value: 0
top_rate: 0.5425
cardinality: 5
entropy: 1.338
entropy_ratio: 0.5764

Safety Group Footnote

numeric metadata null_rate

This appears to be a footnote code column tied to a 'Safety Group' classification, encoded numerically but with only 4 distinct values ranging from 5 to 23. 61.8% of rows are null, suggesting footnotes apply only to a minority of records. The bimodal-looking distribution (median 5, q3 19, kurtosis -1.81) indicates these are categorical flags rather than a true measurement. Treatment: Cast to categorical and treat nulls as 'no footnote' rather than imputing numerically. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 3,350 (61.8%)
unique: 4
min: 5
max: 23
mean: 10.69
median: 5
std: 6.95
q1: 5
q3: 19
iqr: 14
skew: 0.4116
kurtosis: -1.809
n_outliers: 0
outlier_rate: 0
zero_rate: 0

READM Group Measure Count

categorical metadata

A categorical column with only two distinct values: the count '11' covers 84.1% of 5421 rows, with the remaining 863 rows marked 'Not Available'. Functionally this is a binary availability flag rather than a true count, since '11' is the only numeric value present. The mixing of a numeric string with a sentinel 'Not Available' is the main surprise. Treatment: Recode to a binary available/not-available indicator before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 2
top_value: 11
top_rate: 0.8408
cardinality: 2
entropy: 0.6324
entropy_ratio: 0.6324

Count of Facility READM Measures

categorical feature

This column reports the count of facility readmission (READM) measures available per facility, stored as strings rather than integers. With 12 distinct values and entropy ratio 0.965, the values are spread fairly evenly across small integers, but the modal value is the literal string "Not Available" at 21.2% (1150 of 5421 rows), which masks missingness as a category. Numeric values like "11", "8", "6", and "9" dominate the rest. Treatment: Replace "Not Available" with null and cast to integer before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 12
top_value: Not Available
top_rate: 0.2121
cardinality: 12
entropy: 3.459
entropy_ratio: 0.965

Count of READM Measures Better

categorical feature

Counts how many readmission measures a hospital scored 'better than national rate' on, encoded as small integers 0-5 with 'Not Available' as a seventh category. The distribution is heavily concentrated at 0 (61.5% of 5421 rows) and ties off rapidly: only 28 rows hit 3, 10 hit 4, and 3 hit 5. Notably, 'Not Available' (1150 rows) is the second most common value, larger than every nonzero count combined. Treatment: Cast numeric levels to int and treat 'Not Available' as an explicit missing-category flag before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 7
top_value: 0
top_rate: 0.6146
cardinality: 7
entropy: 1.51
entropy_ratio: 0.5379

Count of READM Measures No Different

categorical feature

This column counts how many readmission (READM) measures at a hospital scored 'No Different' than the national rate, encoded as integers 1-9+ alongside a 'Not Available' sentinel. With 13 unique values and entropy ratio 0.92, the distribution is fairly flat across counts 1-9 (each ~370-500 rows), but 'Not Available' is the single largest bucket at 21.2% (1,150 of 5,421). That missingness-as-string is the main surprise and would corrupt any numeric aggregation if not handled. Treatment: Cast to integer with 'Not Available' converted to NaN before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 13
top_value: Not Available
top_rate: 0.2121
cardinality: 13
entropy: 3.408
entropy_ratio: 0.9211

Count of READM Measures Worse

categorical feature

This appears to be a hospital-level count of readmission measures performing worse than the national rate, stored as a categorical/string field with values 0-7 plus a 'Not Available' sentinel. The distribution is heavily concentrated at '0' (55.1% of 5421 rows) with another 1150 rows marked 'Not Available', leaving genuine non-zero counts as a long tail that thins to just 1-2 hospitals at values 5, 6, and 7. The mixing of numeric strings with 'Not Available' is the main gotcha. Treatment: Replace 'Not Available' with NaN and cast remaining values to integer before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 9
top_value: 0
top_rate: 0.5512
cardinality: 9
entropy: 1.758
entropy_ratio: 0.5545

READM Group Footnote

numeric metadata null_rate

Likely a footnote/flag code attached to a hospital readmission group metric, encoded as a small integer rather than a true numeric quantity. Only 3 distinct values appear (min 5, max 22, median 19) and 78.79% of rows are null, so the column is sparse and categorical in nature. The non-null distribution is concentrated at the higher codes, with q1=5 and q3=19 indicating a bimodal split between two code clusters. Treatment: Cast to categorical footnote code and treat nulls as 'no footnote' rather than imputing numerically. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 4,271 (78.8%)
unique: 3
min: 5
max: 22
mean: 15.15
median: 19
std: 6.366
q1: 5
q3: 19
iqr: 14
skew: -0.9528
kurtosis: -1.051
n_outliers: 0
outlier_rate: 0
zero_rate: 0

Pt Exp Group Measure Count

categorical metadata

A binary categorical with only two values: the string "8" covering 84.08% of 5,421 rows and "Not Available" for the remaining 863. This looks like a measure-count field that is fixed at 8 when reported, otherwise marked as missing via a sentinel rather than a true null (null_rate is 0.0). Entropy ratio of 0.63 reflects that imbalance. Treatment: Convert "Not Available" to a true null and treat as a binary missingness indicator; the column carries little signal otherwise. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 2
top_value: 8
top_rate: 0.8408
cardinality: 2
entropy: 0.6324
entropy_ratio: 0.6324

Count of Facility Pt Exp Measures

categorical feature

This column is a binary categorical indicator with only two distinct values across all 5,421 rows: the literal string "8" (58.2%) and "Not Available" (41.8%). Despite being labeled as a count, it never varies numerically — every reporting facility either has exactly 8 measures or none reported, making this effectively a presence/absence flag with high entropy ratio (0.98). The 41.8% "Not Available" rate is a substantial missingness signal masquerading as a category. Treatment: Recode as a binary available/not-available flag rather than treating as a numeric count. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 2
top_value: 8
top_rate: 0.5818
cardinality: 2
entropy: 0.9806
entropy_ratio: 0.9806

Pt Exp Group Footnote

numeric metadata null_rate

This appears to be a footnote code attached to a 'Pt Exp Group' (patient experience group) measure, stored as a numeric but functioning as a categorical flag. Only 3 distinct values appear (min 5, median 5, max 22) and 58.18% of rows are null, indicating footnotes are applied sparingly to a minority of records. The bimodal-looking spread (Q1=5, Q3=19, kurtosis -1.66) confirms this is a small set of discrete codes rather than a true numeric measurement. Treatment: Cast to categorical footnote code and treat nulls as 'no footnote' rather than missing. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 3,154 (58.2%)
unique: 3
min: 5
max: 22
mean: 10.15
median: 5
std: 6.806
q1: 5
q3: 19
iqr: 14
skew: 0.571
kurtosis: -1.658
n_outliers: 0
outlier_rate: 0
zero_rate: 0

TE Group Measure Count

categorical feature

Binary categorical column with only two values: '12' (84.1% of rows) and 'Not Available' (the remaining 863 rows). The literal '12' suggests this captures a count of measures per TE group, but it is stored as a string and effectively constant when present. The 'Not Available' sentinel functions as the only real source of variation. Treatment: Convert to a boolean is_available flag; drop the constant '12' value. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 2
top_value: 12
top_rate: 0.8408
cardinality: 2
entropy: 0.6324
entropy_ratio: 0.6324

Count of Facility TE Measures

categorical feature

This column reports the count of facility TE (timely & effective care) measures available per record, but it's stored as strings mixing numeric values ("4" through "12") with the sentinel "Not Available". The most common value is "Not Available" at 17.1% (928 of 5421), making missingness the modal category despite a reported null_rate of 0. Entropy ratio of 0.93 across 13 categories shows the non-null counts are spread fairly evenly, with "10" and "11" the dominant numeric values. Treatment: Recode "Not Available" to null and cast remaining values to integer before modelling. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 0 (0.0%)
unique: 13
top_value: Not Available
top_rate: 0.1712
cardinality: 13
entropy: 3.458
entropy_ratio: 0.9343

TE Group Footnote

numeric metadata null_rate high_skew outliers

This appears to be a footnote reference code for a 'TE Group', stored numerically but functioning as a categorical pointer with only 3 distinct values across 5421 rows. It is overwhelmingly empty (82.88% null) and heavily concentrated at 19 (both q1 and q3 equal 19, iqr=0), producing a strong left skew (-2.43) and 133 outliers (14.33%) at the low end down to 5. The numeric stats like mean=17.58 are misleading given the column is effectively a small code set. Treatment: Cast to categorical and treat nulls as 'no footnote'; do not use as a numeric feature. high · anthropic:claude-opus-4-7

n: 5,421
nulls: 4,493 (82.9%)
unique: 3
min: 5
max: 22
mean: 17.58
median: 19
std: 4.432
q1: 19
q3: 19
iqr: 0
skew: -2.43
kurtosis: 4.12
n_outliers: 133
outlier_rate: 0.1433
zero_rate: 0