healthcare cms hospitals 2025
Reading
This dataset is a CMS hospital directory covering 5,421 U.S. hospitals across 56 state/territory codes, with 38 columns mixing facility identifiers (Facility ID, Name, Address, Phone), location fields, and a battery of CMS quality-measure summaries (Mortality, Readmission, Safety, Patient Experience, Timely & Effective care). Two things are worth a closer look first: the Hospital overall rating is 'Not Available' for 47% of facilities, and the 'Meets criteria for birthing friendly designation' field is 58% null with only 'Y' as a value, so any rating- or designation-based analysis will be heavily gated by missingness. Beyond that, the mix is dominated by Acute Care Hospitals (3,120) and Voluntary non-profit – Private ownership (2,291 / ~42%), with Texas, California, and Florida holding the largest state shares. The 'Count of … Measures Worse/Better' fields are highly skewed toward 0, suggesting most hospitals look 'no different than national average' on CMS comparisons — a useful framing before drilling into outliers.
citing: row_count · column_count · Hospital overall rating · Meets criteria for birthing friendly designation · Hospital Type · Hospital Ownership · State · Emergency Services · Count of MORT Measures Worse · Count of READM Measures Better
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Not Available | 2552 | 47.1% |
| 3 | 937 | 17.3% |
| 4 | 765 | 14.1% |
| 2 | 649 | 12.0% |
| 5 | 289 | 5.3% |
| 1 | 229 | 4.2% |
Show data table
| value | count | share |
|---|---|---|
| Acute Care Hospitals | 3120 | 57.6% |
| Critical Access Hospitals | 1375 | 25.4% |
| Psychiatric | 626 | 11.5% |
| Acute Care - Veterans Administration | 132 | 2.4% |
| Childrens | 94 | 1.7% |
| Rural Emergency Hospital | 38 | 0.7% |
| Acute Care - Department of Defense | 32 | 0.6% |
| Long-term | 4 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| Voluntary non-profit - Private | 2291 | 42.3% |
| Proprietary | 1067 | 19.7% |
| Government - Hospital District or Authority | 521 | 9.6% |
| Government - Local | 400 | 7.4% |
| Voluntary non-profit - Other | 361 | 6.7% |
| Voluntary non-profit - Church | 275 | 5.1% |
| Government - State | 210 | 3.9% |
| Veterans Health Administration | 132 | 2.4% |
| Physician | 74 | 1.4% |
| Government - Federal | 44 | 0.8% |
| Department of Defense | 32 | 0.6% |
| Tribal | 14 | 0.3% |
Show data table
| value | count | share |
|---|---|---|
| TX | 462 | 8.5% |
| CA | 378 | 7.0% |
| FL | 221 | 4.1% |
| IL | 194 | 3.6% |
| OH | 194 | 3.6% |
| NY | 191 | 3.5% |
| PA | 187 | 3.4% |
| LA | 160 | 3.0% |
| GA | 149 | 2.7% |
| IN | 149 | 2.7% |
| MI | 147 | 2.7% |
| WI | 142 | 2.6% |
| KS | 139 | 2.6% |
| MN | 136 | 2.5% |
| OK | 135 | 2.5% |
| TN | 123 | 2.3% |
| MO | 121 | 2.2% |
| NC | 120 | 2.2% |
| IA | 118 | 2.2% |
| AZ | 106 | 2.0% |
Show data table
| value | count | share |
|---|---|---|
| Yes | 4505 | 83.1% |
| No | 916 | 16.9% |
Schema
38 columns| Alerts | ||||
|---|---|---|---|---|
| Facility ID | text | 0.0% | 5,421 |
near_unique
one_word
allcaps
short_text
|
| Facility Name | text | 0.0% | 5,286 |
near_unique
allcaps
|
| Address | text | 0.0% | 5,387 |
near_unique
allcaps
|
| City/Town | text | 0.0% | 3,049 |
one_word
allcaps
short_text
duplicates
|
| State | categorical | 0.0% | 56 |
|
| ZIP Code | numeric | 0.0% | 4,721 |
|
| County/Parish | text | 0.0% | 1,555 |
one_word
allcaps
short_text
duplicates
|
| Telephone Number | text | 0.0% | 5,383 |
near_unique
allcaps
short_text
|
| Hospital Type | categorical | 0.0% | 8 |
|
| Hospital Ownership | categorical | 0.0% | 12 |
|
| Emergency Services | categorical | 0.0% | 2 |
|
| Meets criteria for birthing friendly designation | categorical | 58.2% | 1 |
null_rate
imbalance
|
| Hospital overall rating | categorical | 0.0% | 6 |
|
| Hospital overall rating footnote | categorical | 52.7% | 7 |
null_rate
|
| MORT Group Measure Count | categorical | 0.0% | 2 |
|
| Count of Facility MORT Measures | categorical | 0.0% | 8 |
|
| Count of MORT Measures Better | categorical | 0.0% | 9 |
|
| Count of MORT Measures No Different | categorical | 0.0% | 9 |
|
| Count of MORT Measures Worse | categorical | 0.0% | 7 |
|
| MORT Group Footnote | numeric | 67.2% | 4 |
null_rate
|
| Safety Group Measure Count | categorical | 0.0% | 2 |
|
| Count of Facility Safety Measures | categorical | 0.0% | 9 |
|
| Count of Safety Measures Better | categorical | 0.0% | 8 |
|
| Count of Safety Measures No Different | categorical | 0.0% | 10 |
|
| Count of Safety Measures Worse | categorical | 0.0% | 5 |
|
| Safety Group Footnote | numeric | 61.8% | 4 |
null_rate
|
| READM Group Measure Count | categorical | 0.0% | 2 |
|
| Count of Facility READM Measures | categorical | 0.0% | 12 |
|
| Count of READM Measures Better | categorical | 0.0% | 7 |
|
| Count of READM Measures No Different | categorical | 0.0% | 13 |
|
| Count of READM Measures Worse | categorical | 0.0% | 9 |
|
| READM Group Footnote | numeric | 78.8% | 3 |
null_rate
|
| Pt Exp Group Measure Count | categorical | 0.0% | 2 |
|
| Count of Facility Pt Exp Measures | categorical | 0.0% | 2 |
|
| Pt Exp Group Footnote | numeric | 58.2% | 3 |
null_rate
|
| TE Group Measure Count | categorical | 0.0% | 2 |
|
| Count of Facility TE Measures | categorical | 0.0% | 13 |
|
| TE Group Footnote | numeric | 82.9% | 3 |
null_rate
high_skew
outliers
|
Facility ID
text identifier near_unique one_word allcaps short_textFacility ID is a fixed-width 6-character single-token code, unique across all 5421 rows with zero nulls or duplicates. Every value is one word in all caps with length exactly 6, and the sampled tokens (e.g., 010001, 010005) are zero-padded numeric strings consistent with a primary facility identifier. Treatment: Treat as primary key; left-join on this id and exclude from modelling features.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 5,421
- len_min
- 6
- len_max
- 6
- len_mean
- 6
- len_median
- 6
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 5,421
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
Facility Name
text identifier near_unique allcapsThis column holds names of healthcare facilities, dominated by terms like 'hospital' (2740), 'center' (1579), and 'medical' (1444), with a typical length of 4 words / 28 characters. Values are near-unique (5286 distinct out of 5421) yet 135 duplicates remain, suggesting either multi-site chains or repeated reporting rows for the same facility. Nearly everything is uppercase (99.3% allcaps rate), which will trip case-sensitive joins. Treatment: Normalize case and whitespace before using as a join key on facility.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 5,286
- len_min
- 3
- len_max
- 74
- len_mean
- 29.21
- len_median
- 28
- len_p95
- 45
- word_mean
- 3.995
- word_median
- 4
- n_empty
- 0
- n_duplicates
- 135
- duplicate_rate
- 0.0249
- vocab_size
- 3,942
- readability_flesch_mean
- 6.842
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.001845
- allcaps_rate
- 0.9932
- boilerplate_rate
- 0
Address
text identifier near_unique allcapsFree-text street addresses, with 5387 unique values across 5421 rows (near-unique) and no nulls. 99.2% are all-caps and the top tokens are 'street', 'st', 'avenue', 'drive', 'road', confirming postal-style entries averaging 3.75 words and 19 characters. 34 duplicates exist but no boilerplate, URLs, or emoji. Treatment: Drop or geocode/parse into structured components before modelling; do not feed raw.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 5,387
- len_min
- 7
- len_max
- 50
- len_mean
- 19.37
- len_median
- 19
- len_p95
- 29
- word_mean
- 3.754
- word_median
- 4
- n_empty
- 0
- n_duplicates
- 34
- duplicate_rate
- 0.006272
- vocab_size
- 4,996
- readability_flesch_mean
- 79.27
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 0.9921
- boilerplate_rate
- 0
City/Town
text feature one_word allcaps short_text duplicatesThis is a US city/town name field, almost entirely uppercased (allcaps_rate 0.994) and dominated by single-word entries (one_word_rate 0.771, word_mean 1.24). Of 5421 rows there are 3049 unique values with a 0.438 duplicate rate, led by CHICAGO (34), HOUSTON (31), and COLUMBUS (23). Lengths are short and tight (len_mean 8.6, len_max 24) and there are no nulls or empties. Treatment: Normalize case and standardize against a city gazetteer (ideally with state) before grouping or joining.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 3,049
- len_min
- 3
- len_max
- 24
- len_mean
- 8.611
- len_median
- 8
- len_p95
- 13
- word_mean
- 1.241
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 2,372
- duplicate_rate
- 0.4376
- vocab_size
- 2,890
- readability_flesch_mean
- 18.29
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7709
- allcaps_rate
- 0.9943
- boilerplate_rate
- 0
State
categorical featureUS state codes covering 56 distinct values across 5,421 rows with no nulls — likely all 50 states plus territories/DC. Distribution is broad and near-uniform (entropy ratio 0.917), with TX leading at only 8.5% and CA, FL, IL, OH following. The 56 cardinality is worth noting since it exceeds the standard 50 states. Treatment: one-hot or target-encode for modelling; optionally group rare territories.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 56
- top_value
- TX
- top_rate
- 0.08522
- cardinality
- 56
- entropy
- 5.328
- entropy_ratio
- 0.9174
ZIP Code
numeric identifierThis is a US ZIP code field stored as a numeric column, with 4721 unique values across 5421 rows and no nulls. The range (603 to 99929) and broad IQR (32771-76104) are consistent with national ZIP coverage, and leading-zero ZIPs (e.g., the min of 603) have already been corrupted by numeric storage. Treating the mean (53780) or std (27064) as meaningful is misleading since ZIPs are categorical identifiers, not quantities. Treatment: Cast to zero-padded 5-character strings and treat as a categorical/geographic key rather than a numeric feature.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 4,721
- min
- 603
- max
- 99,929
- mean
- 5.378e+04
- median
- 55,066
- std
- 2.706e+04
- q1
- 32,771
- q3
- 76,104
- iqr
- 43,333
- skew
- -0.1646
- kurtosis
- -0.9879
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
County/Parish
text feature one_word allcaps short_text duplicatesThis column holds U.S. county or parish names, stored uppercase and almost always as a single token (one_word_rate 0.87, allcaps_rate 1.0). With 1,555 uniques across 5,421 rows and a 71.3% duplicate rate, common counties repeat heavily — LOS ANGELES (88), JEFFERSON and COOK (59 each) lead. Note that vocab_size (1591) exceeds n_unique (1555), implying multi-word names like SAN/LOS/ST. variants contribute extra tokens. Treatment: Normalize casing and join to a state column before using as a categorical geographic feature.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 1,555
- len_min
- 3
- len_max
- 25
- len_mean
- 7.34
- len_median
- 7
- len_p95
- 11
- word_mean
- 1.135
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 3,866
- duplicate_rate
- 0.7132
- vocab_size
- 1,591
- readability_flesch_mean
- 34.44
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.8733
- allcaps_rate
- 1
- boilerplate_rate
- 0
Telephone Number
text identifier near_unique allcaps short_textThis column holds US-style telephone numbers, every value exactly 14 characters long with a mean of 2 words, consistent with a `(NPA) NNN-NNNN` format. Of 5421 rows, 5383 are unique and 38 duplicates appear (0.7%), with no nulls; the most common tokens are area codes like (406), (605), and (402). The near-unique cardinality and rigid length make this an identifier rather than a feature. Treatment: Drop for modelling; retain as a contact identifier or parse out the area code if geography is useful.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 5,383
- len_min
- 14
- len_max
- 14
- len_mean
- 14
- len_median
- 14
- len_p95
- 14
- word_mean
- 2
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 38
- duplicate_rate
- 0.00701
- vocab_size
- 5,550
- readability_flesch_mean
- 120.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 1
- boilerplate_rate
- 0
Hospital Type
categorical featureCategorical classifier of hospital facilities across 8 types, with no nulls in 5421 rows. Acute Care Hospitals dominate at 57.6% (3120 records), followed by Critical Access (1375) and Psychiatric (626); the long tail is sparse, with only 4 Long-term facilities. Entropy ratio of 0.55 confirms the distribution is heavily concentrated in the top category. Treatment: One-hot encode, optionally collapsing the four rarest types into an 'Other' bucket.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- Acute Care Hospitals
- top_rate
- 0.5755
- cardinality
- 8
- entropy
- 1.654
- entropy_ratio
- 0.5513
Hospital Ownership
categorical featureCategorical descriptor of hospital ownership type across 5421 records, with 12 distinct categories and no nulls. Distribution is dominated by 'Voluntary non-profit - Private' at 42.26% of rows, followed by 'Proprietary' at 1067 and a long tail down to 'Government - Federal' at 44. Entropy ratio of 0.72 indicates moderate concentration but reasonable spread across the taxonomy. Treatment: One-hot or target-encode for modelling; consider collapsing rare tiers like 'Physician' and 'Government - Federal'.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- Voluntary non-profit - Private
- top_rate
- 0.4226
- cardinality
- 12
- entropy
- 2.586
- entropy_ratio
- 0.7215
Emergency Services
categorical featureBinary Yes/No flag indicating whether emergency services were involved or available, with no nulls across 5421 rows. The distribution is imbalanced: 'Yes' accounts for 83.1% (4505) versus 916 'No' responses, giving an entropy ratio of 0.655. Treatment: Encode as a binary 0/1 indicator; be mindful of class imbalance if used as a target.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Yes
- top_rate
- 0.831
- cardinality
- 2
- entropy
- 0.6553
- entropy_ratio
- 0.6553
Meets criteria for birthing friendly designation
categorical feature null_rate imbalanceThis appears to be a binary flag indicating whether a hospital meets criteria for a 'birthing friendly' designation, but it functions as a presence indicator only. Of 5421 rows, 58.24% are null and the remaining 2264 are all 'Y' — there are no 'N' values, giving cardinality 1 and zero entropy. Nulls effectively mean 'not designated' rather than missing data. Treatment: Recode nulls as 'N' to form a usable binary indicator, or drop since it carries the same information as the null mask.
- n
- 5,421
- nulls
- 3,157 (58.2%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
Hospital overall rating
categorical labelThis is the CMS-style 1-5 hospital star rating stored as a string, with a sixth bucket 'Not Available' for unrated facilities. The striking issue is that 'Not Available' dominates at 47.1% of 5,421 rows, outnumbering every actual rating; among rated hospitals, 3 stars (937) is most common and 1 star (229) least. Entropy ratio of 0.83 across 6 categories confirms the distribution is spread but heavily anchored by missingness-as-category. Treatment: Recode 'Not Available' to NaN and treat the remainder as an ordinal 1-5 scale.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- Not Available
- top_rate
- 0.4708
- cardinality
- 6
- entropy
- 2.133
- entropy_ratio
- 0.8252
Hospital overall rating footnote
categorical metadata null_rateFootnote codes attached to the Hospital overall rating, with only 7 distinct values across 5,421 rows. Over half the column is null (null_rate 0.527), and among populated rows code '16' dominates at 65.4% followed by '19', so the field carries low information (entropy_ratio 0.41). One compound entry '16, 23' appears twice, indicating values can be multi-coded. Treatment: Keep as a categorical flag alongside the rating; split the rare compound codes if you need per-footnote analysis.
- n
- 5,421
- nulls
- 2,857 (52.7%)
- unique
- 7
- top_value
- 16
- top_rate
- 0.6537
- cardinality
- 7
- entropy
- 1.158
- entropy_ratio
- 0.4126
MORT Group Measure Count
categorical featureThis is a binary categorical field indicating the count of measures in a mortality (MORT) group, holding only the value '7' (84.1% of rows) or 'Not Available' (the remaining 863 rows). Despite being labeled a count, it functions as a presence flag: when data exists, the count is always 7. No nulls are recorded, since missingness is encoded as the literal string 'Not Available'. Treatment: Recode to a boolean availability flag ('7' vs 'Not Available') before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 7
- top_rate
- 0.8408
- cardinality
- 2
- entropy
- 0.6324
- entropy_ratio
- 0.6324
Count of Facility MORT Measures
categorical featureLikely the count of facility mortality (MORT) measures reported per hospital, stored as strings 1-7 alongside a 'Not Available' sentinel. The dominant value is 'Not Available' at 32.8% (1777/5421), which makes the column nominally categorical with 8 levels even though 7 of them are integers. Entropy ratio of 0.92 indicates the non-null counts are spread fairly evenly across 1-7. Treatment: Replace 'Not Available' with NaN, cast remaining values to integer, and optionally add a missingness indicator.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- Not Available
- top_rate
- 0.3278
- cardinality
- 8
- entropy
- 2.765
- entropy_ratio
- 0.9217
Count of MORT Measures Better
categorical featureCategorical count of mortality measures on which a hospital performed better than the national rate, encoded as small integers stored as strings (0–7) plus a 'Not Available' sentinel. The distribution is heavily concentrated at 0 (57.8% of 5421 rows) with another 1777 rows marked 'Not Available', leaving only ~10% of hospitals showing any 'better' measure. Entropy ratio of 0.458 confirms the skew, and the sentinel string blocks direct numeric use. Treatment: Replace 'Not Available' with NaN, cast remaining values to int, and consider binarising (any-better vs none) given the skew.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 9
- top_value
- 0
- top_rate
- 0.5785
- cardinality
- 9
- entropy
- 1.453
- entropy_ratio
- 0.4583
Count of MORT Measures No Different
categorical featureThis column counts how many mortality (MORT) measures a hospital scored 'No Different' on, encoded as a small integer 0-7 but stored as strings alongside a 'Not Available' sentinel. Cardinality is just 9 with high entropy ratio (0.885), meaning values are spread fairly evenly across the integers — except 'Not Available' dominates at 32.8% (1777/5421) and '0' is rare at only 12 rows. Analysts should note the heavy missing-as-string encoding and the near-empty '0' bucket. Treatment: Recode 'Not Available' to null and cast remaining values to integer before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 9
- top_value
- Not Available
- top_rate
- 0.3278
- cardinality
- 9
- entropy
- 2.806
- entropy_ratio
- 0.8852
Count of MORT Measures Worse
categorical featureThis column counts how many mortality (MORT) measures a hospital scored worse than the national average, encoded as small integers 0-5 with a 'Not Available' sentinel mixed in. Roughly 60% of 5421 rows are '0' and another 1777 are 'Not Available', leaving only 378 hospitals flagged on any MORT measure and just 1 hospital at the maximum of 5. The mixed string/integer encoding is the main gotcha. Treatment: Replace 'Not Available' with NaN and cast to integer before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- 0
- top_rate
- 0.6025
- cardinality
- 7
- entropy
- 1.294
- entropy_ratio
- 0.4608
MORT Group Footnote
numeric metadata null_rateThis appears to be a footnote code attached to MORT (mortality) group records, stored as a small numeric category with only 4 distinct values ranging from 5 to 23. Two-thirds of rows are null (null_rate 0.672), suggesting the footnote applies only to a minority of records. The strongly negative kurtosis (-1.96) and bimodal-looking quartiles (Q1=5, Q3=19) confirm these are discrete codes rather than a continuous measurement. Treatment: Cast to categorical and treat nulls as 'no footnote' rather than imputing numerically.
- n
- 5,421
- nulls
- 3,643 (67.2%)
- unique
- 4
- min
- 5
- max
- 23
- mean
- 11.58
- median
- 5
- std
- 7.057
- q1
- 5
- q3
- 19
- iqr
- 14
- skew
- 0.1488
- kurtosis
- -1.959
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Safety Group Measure Count
categorical featureBinary categorical with only two values across 5421 rows: the literal string "8" (84.1%) and "Not Available" (15.9%). Despite the name suggesting a count, it functions as a flag indicating either a fixed measure count of 8 or missing data encoded as text. Treatment: Recode "Not Available" to null and binarize, or drop given near-constant value.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 8
- top_rate
- 0.8408
- cardinality
- 2
- entropy
- 0.6324
- entropy_ratio
- 0.6324
Count of Facility Safety Measures
categorical featureThis column reports the count of facility safety measures, stored as a categorical with 9 distinct values (1-8 plus 'Not Available'). The dominant value is 'Not Available' at 38.1% (2,065 of 5,421 rows), making missingness encoded as a string rather than null. Among reported counts, 7 is most common (733), and the distribution across 1-8 is uneven rather than monotonic. Treatment: Recode 'Not Available' to null and cast remaining values to integer before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 9
- top_value
- Not Available
- top_rate
- 0.3809
- cardinality
- 9
- entropy
- 2.753
- entropy_ratio
- 0.8684
Count of Safety Measures Better
categorical featureThis is a small-integer count (0-6) of safety measures rated 'Better', stored as strings alongside a 'Not Available' sentinel. The sentinel dominates at 38.1% (2065/5421) of rows, and the distribution is heavily right-skewed: 0 and 1 cover 2600 rows while 5 and 6 appear only 14 and 3 times. Cardinality is 8 with entropy ratio 0.70, so most signal sits in the lower bins. Treatment: Recode 'Not Available' to missing, cast remaining values to integer, and consider binning the sparse 4-6 tail.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- Not Available
- top_rate
- 0.3809
- cardinality
- 8
- entropy
- 2.11
- entropy_ratio
- 0.7033
Count of Safety Measures No Different
categorical featureA small-integer count (0-8) of safety measures rated 'no different', stored as strings alongside a 'Not Available' sentinel. The sentinel dominates at 38.1% of 5421 rows (2065), making it the modal value ahead of any actual count. Among real values, 5, 2, 4, and 1 cluster between 509-656 occurrences while 0 and 8 are rare (20 and 10). Treatment: Cast numeric levels to int and treat 'Not Available' as an explicit missing indicator before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 10
- top_value
- Not Available
- top_rate
- 0.3809
- cardinality
- 10
- entropy
- 2.685
- entropy_ratio
- 0.8083
Count of Safety Measures Worse
categorical featureThis is a low-cardinality count column (5 distinct values) recording how many safety measures got worse, taking integer values 0-3 plus a 'Not Available' sentinel. Most rows (54.3%) are 0 and another 2065 are 'Not Available', leaving only 415 rows with any worsening measures (1-3). The mix of numeric strings and a textual missing-marker means it isn't cleanly numeric as stored. Treatment: Recode 'Not Available' to NaN and cast to integer before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- 0
- top_rate
- 0.5425
- cardinality
- 5
- entropy
- 1.338
- entropy_ratio
- 0.5764
Safety Group Footnote
numeric metadata null_rateThis appears to be a footnote code column tied to a 'Safety Group' classification, encoded numerically but with only 4 distinct values ranging from 5 to 23. 61.8% of rows are null, suggesting footnotes apply only to a minority of records. The bimodal-looking distribution (median 5, q3 19, kurtosis -1.81) indicates these are categorical flags rather than a true measurement. Treatment: Cast to categorical and treat nulls as 'no footnote' rather than imputing numerically.
- n
- 5,421
- nulls
- 3,350 (61.8%)
- unique
- 4
- min
- 5
- max
- 23
- mean
- 10.69
- median
- 5
- std
- 6.95
- q1
- 5
- q3
- 19
- iqr
- 14
- skew
- 0.4116
- kurtosis
- -1.809
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
READM Group Measure Count
categorical metadataA categorical column with only two distinct values: the count '11' covers 84.1% of 5421 rows, with the remaining 863 rows marked 'Not Available'. Functionally this is a binary availability flag rather than a true count, since '11' is the only numeric value present. The mixing of a numeric string with a sentinel 'Not Available' is the main surprise. Treatment: Recode to a binary available/not-available indicator before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 11
- top_rate
- 0.8408
- cardinality
- 2
- entropy
- 0.6324
- entropy_ratio
- 0.6324
Count of Facility READM Measures
categorical featureThis column reports the count of facility readmission (READM) measures available per facility, stored as strings rather than integers. With 12 distinct values and entropy ratio 0.965, the values are spread fairly evenly across small integers, but the modal value is the literal string "Not Available" at 21.2% (1150 of 5421 rows), which masks missingness as a category. Numeric values like "11", "8", "6", and "9" dominate the rest. Treatment: Replace "Not Available" with null and cast to integer before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- Not Available
- top_rate
- 0.2121
- cardinality
- 12
- entropy
- 3.459
- entropy_ratio
- 0.965
Count of READM Measures Better
categorical featureCounts how many readmission measures a hospital scored 'better than national rate' on, encoded as small integers 0-5 with 'Not Available' as a seventh category. The distribution is heavily concentrated at 0 (61.5% of 5421 rows) and ties off rapidly: only 28 rows hit 3, 10 hit 4, and 3 hit 5. Notably, 'Not Available' (1150 rows) is the second most common value, larger than every nonzero count combined. Treatment: Cast numeric levels to int and treat 'Not Available' as an explicit missing-category flag before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- 0
- top_rate
- 0.6146
- cardinality
- 7
- entropy
- 1.51
- entropy_ratio
- 0.5379
Count of READM Measures No Different
categorical featureThis column counts how many readmission (READM) measures at a hospital scored 'No Different' than the national rate, encoded as integers 1-9+ alongside a 'Not Available' sentinel. With 13 unique values and entropy ratio 0.92, the distribution is fairly flat across counts 1-9 (each ~370-500 rows), but 'Not Available' is the single largest bucket at 21.2% (1,150 of 5,421). That missingness-as-string is the main surprise and would corrupt any numeric aggregation if not handled. Treatment: Cast to integer with 'Not Available' converted to NaN before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 13
- top_value
- Not Available
- top_rate
- 0.2121
- cardinality
- 13
- entropy
- 3.408
- entropy_ratio
- 0.9211
Count of READM Measures Worse
categorical featureThis appears to be a hospital-level count of readmission measures performing worse than the national rate, stored as a categorical/string field with values 0-7 plus a 'Not Available' sentinel. The distribution is heavily concentrated at '0' (55.1% of 5421 rows) with another 1150 rows marked 'Not Available', leaving genuine non-zero counts as a long tail that thins to just 1-2 hospitals at values 5, 6, and 7. The mixing of numeric strings with 'Not Available' is the main gotcha. Treatment: Replace 'Not Available' with NaN and cast remaining values to integer before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 9
- top_value
- 0
- top_rate
- 0.5512
- cardinality
- 9
- entropy
- 1.758
- entropy_ratio
- 0.5545
READM Group Footnote
numeric metadata null_rateLikely a footnote/flag code attached to a hospital readmission group metric, encoded as a small integer rather than a true numeric quantity. Only 3 distinct values appear (min 5, max 22, median 19) and 78.79% of rows are null, so the column is sparse and categorical in nature. The non-null distribution is concentrated at the higher codes, with q1=5 and q3=19 indicating a bimodal split between two code clusters. Treatment: Cast to categorical footnote code and treat nulls as 'no footnote' rather than imputing numerically.
- n
- 5,421
- nulls
- 4,271 (78.8%)
- unique
- 3
- min
- 5
- max
- 22
- mean
- 15.15
- median
- 19
- std
- 6.366
- q1
- 5
- q3
- 19
- iqr
- 14
- skew
- -0.9528
- kurtosis
- -1.051
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Pt Exp Group Measure Count
categorical metadataA binary categorical with only two values: the string "8" covering 84.08% of 5,421 rows and "Not Available" for the remaining 863. This looks like a measure-count field that is fixed at 8 when reported, otherwise marked as missing via a sentinel rather than a true null (null_rate is 0.0). Entropy ratio of 0.63 reflects that imbalance. Treatment: Convert "Not Available" to a true null and treat as a binary missingness indicator; the column carries little signal otherwise.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 8
- top_rate
- 0.8408
- cardinality
- 2
- entropy
- 0.6324
- entropy_ratio
- 0.6324
Count of Facility Pt Exp Measures
categorical featureThis column is a binary categorical indicator with only two distinct values across all 5,421 rows: the literal string "8" (58.2%) and "Not Available" (41.8%). Despite being labeled as a count, it never varies numerically — every reporting facility either has exactly 8 measures or none reported, making this effectively a presence/absence flag with high entropy ratio (0.98). The 41.8% "Not Available" rate is a substantial missingness signal masquerading as a category. Treatment: Recode as a binary available/not-available flag rather than treating as a numeric count.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 8
- top_rate
- 0.5818
- cardinality
- 2
- entropy
- 0.9806
- entropy_ratio
- 0.9806
Pt Exp Group Footnote
numeric metadata null_rateThis appears to be a footnote code attached to a 'Pt Exp Group' (patient experience group) measure, stored as a numeric but functioning as a categorical flag. Only 3 distinct values appear (min 5, median 5, max 22) and 58.18% of rows are null, indicating footnotes are applied sparingly to a minority of records. The bimodal-looking spread (Q1=5, Q3=19, kurtosis -1.66) confirms this is a small set of discrete codes rather than a true numeric measurement. Treatment: Cast to categorical footnote code and treat nulls as 'no footnote' rather than missing.
- n
- 5,421
- nulls
- 3,154 (58.2%)
- unique
- 3
- min
- 5
- max
- 22
- mean
- 10.15
- median
- 5
- std
- 6.806
- q1
- 5
- q3
- 19
- iqr
- 14
- skew
- 0.571
- kurtosis
- -1.658
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
TE Group Measure Count
categorical featureBinary categorical column with only two values: '12' (84.1% of rows) and 'Not Available' (the remaining 863 rows). The literal '12' suggests this captures a count of measures per TE group, but it is stored as a string and effectively constant when present. The 'Not Available' sentinel functions as the only real source of variation. Treatment: Convert to a boolean is_available flag; drop the constant '12' value.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 12
- top_rate
- 0.8408
- cardinality
- 2
- entropy
- 0.6324
- entropy_ratio
- 0.6324
Count of Facility TE Measures
categorical featureThis column reports the count of facility TE (timely & effective care) measures available per record, but it's stored as strings mixing numeric values ("4" through "12") with the sentinel "Not Available". The most common value is "Not Available" at 17.1% (928 of 5421), making missingness the modal category despite a reported null_rate of 0. Entropy ratio of 0.93 across 13 categories shows the non-null counts are spread fairly evenly, with "10" and "11" the dominant numeric values. Treatment: Recode "Not Available" to null and cast remaining values to integer before modelling.
- n
- 5,421
- nulls
- 0 (0.0%)
- unique
- 13
- top_value
- Not Available
- top_rate
- 0.1712
- cardinality
- 13
- entropy
- 3.458
- entropy_ratio
- 0.9343
TE Group Footnote
numeric metadata null_rate high_skew outliersThis appears to be a footnote reference code for a 'TE Group', stored numerically but functioning as a categorical pointer with only 3 distinct values across 5421 rows. It is overwhelmingly empty (82.88% null) and heavily concentrated at 19 (both q1 and q3 equal 19, iqr=0), producing a strong left skew (-2.43) and 133 outliers (14.33%) at the low end down to 5. The numeric stats like mean=17.58 are misleading given the column is effectively a small code set. Treatment: Cast to categorical and treat nulls as 'no footnote'; do not use as a numeric feature.
- n
- 5,421
- nulls
- 4,493 (82.9%)
- unique
- 3
- min
- 5
- max
- 22
- mean
- 17.58
- median
- 19
- std
- 4.432
- q1
- 19
- q3
- 19
- iqr
- 0
- skew
- -2.43
- kurtosis
- 4.12
- n_outliers
- 133
- outlier_rate
- 0.1433
- zero_rate
- 0