saturn·

healthcare cms hospitals 2025

source /home/coolhand/datasets/us-inequality-atlas/healthcare/cms_hospitals_2025.csv 5,421 rows 38 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a CMS hospital directory covering 5,421 U.S. hospitals across 56 state/territory codes, with 38 columns mixing facility identifiers (Facility ID, Name, Address, Phone), location fields, and a battery of CMS quality-measure summaries (Mortality, Readmission, Safety, Patient Experience, Timely & Effective care). Two things are worth a closer look first: the Hospital overall rating is 'Not Available' for 47% of facilities, and the 'Meets criteria for birthing friendly designation' field is 58% null with only 'Y' as a value, so any rating- or designation-based analysis will be heavily gated by missingness. Beyond that, the mix is dominated by Acute Care Hospitals (3,120) and Voluntary non-profit – Private ownership (2,291 / ~42%), with Texas, California, and Florida holding the largest state shares. The 'Count of … Measures Worse/Better' fields are highly skewed toward 0, suggesting most hospitals look 'no different than national average' on CMS comparisons — a useful framing before drilling into outliers.

citing: row_count · column_count · Hospital overall rating · Meets criteria for birthing friendly designation · Hospital Type · Hospital Ownership · State · Emergency Services · Count of MORT Measures Worse · Count of READM Measures Better

Schema

38 columns
Per-column summary. Click column name to jump to its detail.
Alerts
Facility ID text 0.0% 5,421
near_unique one_word allcaps short_text
Facility Name text 0.0% 5,286
near_unique allcaps
Address text 0.0% 5,387
near_unique allcaps
City/Town text 0.0% 3,049
one_word allcaps short_text duplicates
State categorical 0.0% 56
ZIP Code numeric 0.0% 4,721
County/Parish text 0.0% 1,555
one_word allcaps short_text duplicates
Telephone Number text 0.0% 5,383
near_unique allcaps short_text
Hospital Type categorical 0.0% 8
Hospital Ownership categorical 0.0% 12
Emergency Services categorical 0.0% 2
Meets criteria for birthing friendly designation categorical 58.2% 1
null_rate imbalance
Hospital overall rating categorical 0.0% 6
Hospital overall rating footnote categorical 52.7% 7
null_rate
MORT Group Measure Count categorical 0.0% 2
Count of Facility MORT Measures categorical 0.0% 8
Count of MORT Measures Better categorical 0.0% 9
Count of MORT Measures No Different categorical 0.0% 9
Count of MORT Measures Worse categorical 0.0% 7
MORT Group Footnote numeric 67.2% 4
null_rate
Safety Group Measure Count categorical 0.0% 2
Count of Facility Safety Measures categorical 0.0% 9
Count of Safety Measures Better categorical 0.0% 8
Count of Safety Measures No Different categorical 0.0% 10
Count of Safety Measures Worse categorical 0.0% 5
Safety Group Footnote numeric 61.8% 4
null_rate
READM Group Measure Count categorical 0.0% 2
Count of Facility READM Measures categorical 0.0% 12
Count of READM Measures Better categorical 0.0% 7
Count of READM Measures No Different categorical 0.0% 13
Count of READM Measures Worse categorical 0.0% 9
READM Group Footnote numeric 78.8% 3
null_rate
Pt Exp Group Measure Count categorical 0.0% 2
Count of Facility Pt Exp Measures categorical 0.0% 2
Pt Exp Group Footnote numeric 58.2% 3
null_rate
TE Group Measure Count categorical 0.0% 2
Count of Facility TE Measures categorical 0.0% 13
TE Group Footnote numeric 82.9% 3
null_rate high_skew outliers

Facility ID

text identifier near_unique one_word allcaps short_text
Facility ID is a fixed-width 6-character single-token code, unique across all 5421 rows with zero nulls or duplicates. Every value is one word in all caps with length exactly 6, and the sampled tokens (e.g., 010001, 010005) are zero-padded numeric strings consistent with a primary facility identifier. Treatment: Treat as primary key; left-join on this id and exclude from modelling features. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
5,421
len_min
6
len_max
6
len_mean
6
len_median
6
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
5,421
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

Facility Name

text identifier near_unique allcaps
This column holds names of healthcare facilities, dominated by terms like 'hospital' (2740), 'center' (1579), and 'medical' (1444), with a typical length of 4 words / 28 characters. Values are near-unique (5286 distinct out of 5421) yet 135 duplicates remain, suggesting either multi-site chains or repeated reporting rows for the same facility. Nearly everything is uppercase (99.3% allcaps rate), which will trip case-sensitive joins. Treatment: Normalize case and whitespace before using as a join key on facility. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
5,286
len_min
3
len_max
74
len_mean
29.21
len_median
28
len_p95
45
word_mean
3.995
word_median
4
n_empty
0
n_duplicates
135
duplicate_rate
0.0249
vocab_size
3,942
readability_flesch_mean
6.842
emoji_rate
0
url_rate
0
one_word_rate
0.001845
allcaps_rate
0.9932
boilerplate_rate
0

Address

text identifier near_unique allcaps
Free-text street addresses, with 5387 unique values across 5421 rows (near-unique) and no nulls. 99.2% are all-caps and the top tokens are 'street', 'st', 'avenue', 'drive', 'road', confirming postal-style entries averaging 3.75 words and 19 characters. 34 duplicates exist but no boilerplate, URLs, or emoji. Treatment: Drop or geocode/parse into structured components before modelling; do not feed raw. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
5,387
len_min
7
len_max
50
len_mean
19.37
len_median
19
len_p95
29
word_mean
3.754
word_median
4
n_empty
0
n_duplicates
34
duplicate_rate
0.006272
vocab_size
4,996
readability_flesch_mean
79.27
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0.9921
boilerplate_rate
0

City/Town

text feature one_word allcaps short_text duplicates
This is a US city/town name field, almost entirely uppercased (allcaps_rate 0.994) and dominated by single-word entries (one_word_rate 0.771, word_mean 1.24). Of 5421 rows there are 3049 unique values with a 0.438 duplicate rate, led by CHICAGO (34), HOUSTON (31), and COLUMBUS (23). Lengths are short and tight (len_mean 8.6, len_max 24) and there are no nulls or empties. Treatment: Normalize case and standardize against a city gazetteer (ideally with state) before grouping or joining. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
3,049
len_min
3
len_max
24
len_mean
8.611
len_median
8
len_p95
13
word_mean
1.241
word_median
1
n_empty
0
n_duplicates
2,372
duplicate_rate
0.4376
vocab_size
2,890
readability_flesch_mean
18.29
emoji_rate
0
url_rate
0
one_word_rate
0.7709
allcaps_rate
0.9943
boilerplate_rate
0

State

categorical feature
US state codes covering 56 distinct values across 5,421 rows with no nulls — likely all 50 states plus territories/DC. Distribution is broad and near-uniform (entropy ratio 0.917), with TX leading at only 8.5% and CA, FL, IL, OH following. The 56 cardinality is worth noting since it exceeds the standard 50 states. Treatment: one-hot or target-encode for modelling; optionally group rare territories. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
56
top_value
TX
top_rate
0.08522
cardinality
56
entropy
5.328
entropy_ratio
0.9174

ZIP Code

numeric identifier
This is a US ZIP code field stored as a numeric column, with 4721 unique values across 5421 rows and no nulls. The range (603 to 99929) and broad IQR (32771-76104) are consistent with national ZIP coverage, and leading-zero ZIPs (e.g., the min of 603) have already been corrupted by numeric storage. Treating the mean (53780) or std (27064) as meaningful is misleading since ZIPs are categorical identifiers, not quantities. Treatment: Cast to zero-padded 5-character strings and treat as a categorical/geographic key rather than a numeric feature. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
4,721
min
603
max
99,929
mean
5.378e+04
median
55,066
std
2.706e+04
q1
32,771
q3
76,104
iqr
43,333
skew
-0.1646
kurtosis
-0.9879
n_outliers
0
outlier_rate
0
zero_rate
0

County/Parish

text feature one_word allcaps short_text duplicates
This column holds U.S. county or parish names, stored uppercase and almost always as a single token (one_word_rate 0.87, allcaps_rate 1.0). With 1,555 uniques across 5,421 rows and a 71.3% duplicate rate, common counties repeat heavily — LOS ANGELES (88), JEFFERSON and COOK (59 each) lead. Note that vocab_size (1591) exceeds n_unique (1555), implying multi-word names like SAN/LOS/ST. variants contribute extra tokens. Treatment: Normalize casing and join to a state column before using as a categorical geographic feature. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
1,555
len_min
3
len_max
25
len_mean
7.34
len_median
7
len_p95
11
word_mean
1.135
word_median
1
n_empty
0
n_duplicates
3,866
duplicate_rate
0.7132
vocab_size
1,591
readability_flesch_mean
34.44
emoji_rate
0
url_rate
0
one_word_rate
0.8733
allcaps_rate
1
boilerplate_rate
0

Telephone Number

text identifier near_unique allcaps short_text
This column holds US-style telephone numbers, every value exactly 14 characters long with a mean of 2 words, consistent with a `(NPA) NNN-NNNN` format. Of 5421 rows, 5383 are unique and 38 duplicates appear (0.7%), with no nulls; the most common tokens are area codes like (406), (605), and (402). The near-unique cardinality and rigid length make this an identifier rather than a feature. Treatment: Drop for modelling; retain as a contact identifier or parse out the area code if geography is useful. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
5,383
len_min
14
len_max
14
len_mean
14
len_median
14
len_p95
14
word_mean
2
word_median
2
n_empty
0
n_duplicates
38
duplicate_rate
0.00701
vocab_size
5,550
readability_flesch_mean
120.2
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
1
boilerplate_rate
0

Hospital Type

categorical feature
Categorical classifier of hospital facilities across 8 types, with no nulls in 5421 rows. Acute Care Hospitals dominate at 57.6% (3120 records), followed by Critical Access (1375) and Psychiatric (626); the long tail is sparse, with only 4 Long-term facilities. Entropy ratio of 0.55 confirms the distribution is heavily concentrated in the top category. Treatment: One-hot encode, optionally collapsing the four rarest types into an 'Other' bucket. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
8
top_value
Acute Care Hospitals
top_rate
0.5755
cardinality
8
entropy
1.654
entropy_ratio
0.5513

Hospital Ownership

categorical feature
Categorical descriptor of hospital ownership type across 5421 records, with 12 distinct categories and no nulls. Distribution is dominated by 'Voluntary non-profit - Private' at 42.26% of rows, followed by 'Proprietary' at 1067 and a long tail down to 'Government - Federal' at 44. Entropy ratio of 0.72 indicates moderate concentration but reasonable spread across the taxonomy. Treatment: One-hot or target-encode for modelling; consider collapsing rare tiers like 'Physician' and 'Government - Federal'. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
12
top_value
Voluntary non-profit - Private
top_rate
0.4226
cardinality
12
entropy
2.586
entropy_ratio
0.7215

Emergency Services

categorical feature
Binary Yes/No flag indicating whether emergency services were involved or available, with no nulls across 5421 rows. The distribution is imbalanced: 'Yes' accounts for 83.1% (4505) versus 916 'No' responses, giving an entropy ratio of 0.655. Treatment: Encode as a binary 0/1 indicator; be mindful of class imbalance if used as a target. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
2
top_value
Yes
top_rate
0.831
cardinality
2
entropy
0.6553
entropy_ratio
0.6553

Meets criteria for birthing friendly designation

categorical feature null_rate imbalance
This appears to be a binary flag indicating whether a hospital meets criteria for a 'birthing friendly' designation, but it functions as a presence indicator only. Of 5421 rows, 58.24% are null and the remaining 2264 are all 'Y' — there are no 'N' values, giving cardinality 1 and zero entropy. Nulls effectively mean 'not designated' rather than missing data. Treatment: Recode nulls as 'N' to form a usable binary indicator, or drop since it carries the same information as the null mask. high · anthropic:claude-opus-4-7
n
5,421
nulls
3,157 (58.2%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

Hospital overall rating

categorical label
This is the CMS-style 1-5 hospital star rating stored as a string, with a sixth bucket 'Not Available' for unrated facilities. The striking issue is that 'Not Available' dominates at 47.1% of 5,421 rows, outnumbering every actual rating; among rated hospitals, 3 stars (937) is most common and 1 star (229) least. Entropy ratio of 0.83 across 6 categories confirms the distribution is spread but heavily anchored by missingness-as-category. Treatment: Recode 'Not Available' to NaN and treat the remainder as an ordinal 1-5 scale. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
6
top_value
Not Available
top_rate
0.4708
cardinality
6
entropy
2.133
entropy_ratio
0.8252

Hospital overall rating footnote

categorical metadata null_rate
Footnote codes attached to the Hospital overall rating, with only 7 distinct values across 5,421 rows. Over half the column is null (null_rate 0.527), and among populated rows code '16' dominates at 65.4% followed by '19', so the field carries low information (entropy_ratio 0.41). One compound entry '16, 23' appears twice, indicating values can be multi-coded. Treatment: Keep as a categorical flag alongside the rating; split the rare compound codes if you need per-footnote analysis. high · anthropic:claude-opus-4-7
n
5,421
nulls
2,857 (52.7%)
unique
7
top_value
16
top_rate
0.6537
cardinality
7
entropy
1.158
entropy_ratio
0.4126

MORT Group Measure Count

categorical feature
This is a binary categorical field indicating the count of measures in a mortality (MORT) group, holding only the value '7' (84.1% of rows) or 'Not Available' (the remaining 863 rows). Despite being labeled a count, it functions as a presence flag: when data exists, the count is always 7. No nulls are recorded, since missingness is encoded as the literal string 'Not Available'. Treatment: Recode to a boolean availability flag ('7' vs 'Not Available') before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
2
top_value
7
top_rate
0.8408
cardinality
2
entropy
0.6324
entropy_ratio
0.6324

Count of Facility MORT Measures

categorical feature
Likely the count of facility mortality (MORT) measures reported per hospital, stored as strings 1-7 alongside a 'Not Available' sentinel. The dominant value is 'Not Available' at 32.8% (1777/5421), which makes the column nominally categorical with 8 levels even though 7 of them are integers. Entropy ratio of 0.92 indicates the non-null counts are spread fairly evenly across 1-7. Treatment: Replace 'Not Available' with NaN, cast remaining values to integer, and optionally add a missingness indicator. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
8
top_value
Not Available
top_rate
0.3278
cardinality
8
entropy
2.765
entropy_ratio
0.9217

Count of MORT Measures Better

categorical feature
Categorical count of mortality measures on which a hospital performed better than the national rate, encoded as small integers stored as strings (0–7) plus a 'Not Available' sentinel. The distribution is heavily concentrated at 0 (57.8% of 5421 rows) with another 1777 rows marked 'Not Available', leaving only ~10% of hospitals showing any 'better' measure. Entropy ratio of 0.458 confirms the skew, and the sentinel string blocks direct numeric use. Treatment: Replace 'Not Available' with NaN, cast remaining values to int, and consider binarising (any-better vs none) given the skew. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
9
top_value
0
top_rate
0.5785
cardinality
9
entropy
1.453
entropy_ratio
0.4583

Count of MORT Measures No Different

categorical feature
This column counts how many mortality (MORT) measures a hospital scored 'No Different' on, encoded as a small integer 0-7 but stored as strings alongside a 'Not Available' sentinel. Cardinality is just 9 with high entropy ratio (0.885), meaning values are spread fairly evenly across the integers — except 'Not Available' dominates at 32.8% (1777/5421) and '0' is rare at only 12 rows. Analysts should note the heavy missing-as-string encoding and the near-empty '0' bucket. Treatment: Recode 'Not Available' to null and cast remaining values to integer before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
9
top_value
Not Available
top_rate
0.3278
cardinality
9
entropy
2.806
entropy_ratio
0.8852

Count of MORT Measures Worse

categorical feature
This column counts how many mortality (MORT) measures a hospital scored worse than the national average, encoded as small integers 0-5 with a 'Not Available' sentinel mixed in. Roughly 60% of 5421 rows are '0' and another 1777 are 'Not Available', leaving only 378 hospitals flagged on any MORT measure and just 1 hospital at the maximum of 5. The mixed string/integer encoding is the main gotcha. Treatment: Replace 'Not Available' with NaN and cast to integer before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
7
top_value
0
top_rate
0.6025
cardinality
7
entropy
1.294
entropy_ratio
0.4608

MORT Group Footnote

numeric metadata null_rate
This appears to be a footnote code attached to MORT (mortality) group records, stored as a small numeric category with only 4 distinct values ranging from 5 to 23. Two-thirds of rows are null (null_rate 0.672), suggesting the footnote applies only to a minority of records. The strongly negative kurtosis (-1.96) and bimodal-looking quartiles (Q1=5, Q3=19) confirm these are discrete codes rather than a continuous measurement. Treatment: Cast to categorical and treat nulls as 'no footnote' rather than imputing numerically. high · anthropic:claude-opus-4-7
n
5,421
nulls
3,643 (67.2%)
unique
4
min
5
max
23
mean
11.58
median
5
std
7.057
q1
5
q3
19
iqr
14
skew
0.1488
kurtosis
-1.959
n_outliers
0
outlier_rate
0
zero_rate
0

Safety Group Measure Count

categorical feature
Binary categorical with only two values across 5421 rows: the literal string "8" (84.1%) and "Not Available" (15.9%). Despite the name suggesting a count, it functions as a flag indicating either a fixed measure count of 8 or missing data encoded as text. Treatment: Recode "Not Available" to null and binarize, or drop given near-constant value. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
2
top_value
8
top_rate
0.8408
cardinality
2
entropy
0.6324
entropy_ratio
0.6324

Count of Facility Safety Measures

categorical feature
This column reports the count of facility safety measures, stored as a categorical with 9 distinct values (1-8 plus 'Not Available'). The dominant value is 'Not Available' at 38.1% (2,065 of 5,421 rows), making missingness encoded as a string rather than null. Among reported counts, 7 is most common (733), and the distribution across 1-8 is uneven rather than monotonic. Treatment: Recode 'Not Available' to null and cast remaining values to integer before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
9
top_value
Not Available
top_rate
0.3809
cardinality
9
entropy
2.753
entropy_ratio
0.8684

Count of Safety Measures Better

categorical feature
This is a small-integer count (0-6) of safety measures rated 'Better', stored as strings alongside a 'Not Available' sentinel. The sentinel dominates at 38.1% (2065/5421) of rows, and the distribution is heavily right-skewed: 0 and 1 cover 2600 rows while 5 and 6 appear only 14 and 3 times. Cardinality is 8 with entropy ratio 0.70, so most signal sits in the lower bins. Treatment: Recode 'Not Available' to missing, cast remaining values to integer, and consider binning the sparse 4-6 tail. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
8
top_value
Not Available
top_rate
0.3809
cardinality
8
entropy
2.11
entropy_ratio
0.7033

Count of Safety Measures No Different

categorical feature
A small-integer count (0-8) of safety measures rated 'no different', stored as strings alongside a 'Not Available' sentinel. The sentinel dominates at 38.1% of 5421 rows (2065), making it the modal value ahead of any actual count. Among real values, 5, 2, 4, and 1 cluster between 509-656 occurrences while 0 and 8 are rare (20 and 10). Treatment: Cast numeric levels to int and treat 'Not Available' as an explicit missing indicator before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
10
top_value
Not Available
top_rate
0.3809
cardinality
10
entropy
2.685
entropy_ratio
0.8083

Count of Safety Measures Worse

categorical feature
This is a low-cardinality count column (5 distinct values) recording how many safety measures got worse, taking integer values 0-3 plus a 'Not Available' sentinel. Most rows (54.3%) are 0 and another 2065 are 'Not Available', leaving only 415 rows with any worsening measures (1-3). The mix of numeric strings and a textual missing-marker means it isn't cleanly numeric as stored. Treatment: Recode 'Not Available' to NaN and cast to integer before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
5
top_value
0
top_rate
0.5425
cardinality
5
entropy
1.338
entropy_ratio
0.5764

Safety Group Footnote

numeric metadata null_rate
This appears to be a footnote code column tied to a 'Safety Group' classification, encoded numerically but with only 4 distinct values ranging from 5 to 23. 61.8% of rows are null, suggesting footnotes apply only to a minority of records. The bimodal-looking distribution (median 5, q3 19, kurtosis -1.81) indicates these are categorical flags rather than a true measurement. Treatment: Cast to categorical and treat nulls as 'no footnote' rather than imputing numerically. high · anthropic:claude-opus-4-7
n
5,421
nulls
3,350 (61.8%)
unique
4
min
5
max
23
mean
10.69
median
5
std
6.95
q1
5
q3
19
iqr
14
skew
0.4116
kurtosis
-1.809
n_outliers
0
outlier_rate
0
zero_rate
0

READM Group Measure Count

categorical metadata
A categorical column with only two distinct values: the count '11' covers 84.1% of 5421 rows, with the remaining 863 rows marked 'Not Available'. Functionally this is a binary availability flag rather than a true count, since '11' is the only numeric value present. The mixing of a numeric string with a sentinel 'Not Available' is the main surprise. Treatment: Recode to a binary available/not-available indicator before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
2
top_value
11
top_rate
0.8408
cardinality
2
entropy
0.6324
entropy_ratio
0.6324

Count of Facility READM Measures

categorical feature
This column reports the count of facility readmission (READM) measures available per facility, stored as strings rather than integers. With 12 distinct values and entropy ratio 0.965, the values are spread fairly evenly across small integers, but the modal value is the literal string "Not Available" at 21.2% (1150 of 5421 rows), which masks missingness as a category. Numeric values like "11", "8", "6", and "9" dominate the rest. Treatment: Replace "Not Available" with null and cast to integer before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
12
top_value
Not Available
top_rate
0.2121
cardinality
12
entropy
3.459
entropy_ratio
0.965

Count of READM Measures Better

categorical feature
Counts how many readmission measures a hospital scored 'better than national rate' on, encoded as small integers 0-5 with 'Not Available' as a seventh category. The distribution is heavily concentrated at 0 (61.5% of 5421 rows) and ties off rapidly: only 28 rows hit 3, 10 hit 4, and 3 hit 5. Notably, 'Not Available' (1150 rows) is the second most common value, larger than every nonzero count combined. Treatment: Cast numeric levels to int and treat 'Not Available' as an explicit missing-category flag before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
7
top_value
0
top_rate
0.6146
cardinality
7
entropy
1.51
entropy_ratio
0.5379

Count of READM Measures No Different

categorical feature
This column counts how many readmission (READM) measures at a hospital scored 'No Different' than the national rate, encoded as integers 1-9+ alongside a 'Not Available' sentinel. With 13 unique values and entropy ratio 0.92, the distribution is fairly flat across counts 1-9 (each ~370-500 rows), but 'Not Available' is the single largest bucket at 21.2% (1,150 of 5,421). That missingness-as-string is the main surprise and would corrupt any numeric aggregation if not handled. Treatment: Cast to integer with 'Not Available' converted to NaN before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
13
top_value
Not Available
top_rate
0.2121
cardinality
13
entropy
3.408
entropy_ratio
0.9211

Count of READM Measures Worse

categorical feature
This appears to be a hospital-level count of readmission measures performing worse than the national rate, stored as a categorical/string field with values 0-7 plus a 'Not Available' sentinel. The distribution is heavily concentrated at '0' (55.1% of 5421 rows) with another 1150 rows marked 'Not Available', leaving genuine non-zero counts as a long tail that thins to just 1-2 hospitals at values 5, 6, and 7. The mixing of numeric strings with 'Not Available' is the main gotcha. Treatment: Replace 'Not Available' with NaN and cast remaining values to integer before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
9
top_value
0
top_rate
0.5512
cardinality
9
entropy
1.758
entropy_ratio
0.5545

READM Group Footnote

numeric metadata null_rate
Likely a footnote/flag code attached to a hospital readmission group metric, encoded as a small integer rather than a true numeric quantity. Only 3 distinct values appear (min 5, max 22, median 19) and 78.79% of rows are null, so the column is sparse and categorical in nature. The non-null distribution is concentrated at the higher codes, with q1=5 and q3=19 indicating a bimodal split between two code clusters. Treatment: Cast to categorical footnote code and treat nulls as 'no footnote' rather than imputing numerically. high · anthropic:claude-opus-4-7
n
5,421
nulls
4,271 (78.8%)
unique
3
min
5
max
22
mean
15.15
median
19
std
6.366
q1
5
q3
19
iqr
14
skew
-0.9528
kurtosis
-1.051
n_outliers
0
outlier_rate
0
zero_rate
0

Pt Exp Group Measure Count

categorical metadata
A binary categorical with only two values: the string "8" covering 84.08% of 5,421 rows and "Not Available" for the remaining 863. This looks like a measure-count field that is fixed at 8 when reported, otherwise marked as missing via a sentinel rather than a true null (null_rate is 0.0). Entropy ratio of 0.63 reflects that imbalance. Treatment: Convert "Not Available" to a true null and treat as a binary missingness indicator; the column carries little signal otherwise. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
2
top_value
8
top_rate
0.8408
cardinality
2
entropy
0.6324
entropy_ratio
0.6324

Count of Facility Pt Exp Measures

categorical feature
This column is a binary categorical indicator with only two distinct values across all 5,421 rows: the literal string "8" (58.2%) and "Not Available" (41.8%). Despite being labeled as a count, it never varies numerically — every reporting facility either has exactly 8 measures or none reported, making this effectively a presence/absence flag with high entropy ratio (0.98). The 41.8% "Not Available" rate is a substantial missingness signal masquerading as a category. Treatment: Recode as a binary available/not-available flag rather than treating as a numeric count. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
2
top_value
8
top_rate
0.5818
cardinality
2
entropy
0.9806
entropy_ratio
0.9806

Pt Exp Group Footnote

numeric metadata null_rate
This appears to be a footnote code attached to a 'Pt Exp Group' (patient experience group) measure, stored as a numeric but functioning as a categorical flag. Only 3 distinct values appear (min 5, median 5, max 22) and 58.18% of rows are null, indicating footnotes are applied sparingly to a minority of records. The bimodal-looking spread (Q1=5, Q3=19, kurtosis -1.66) confirms this is a small set of discrete codes rather than a true numeric measurement. Treatment: Cast to categorical footnote code and treat nulls as 'no footnote' rather than missing. high · anthropic:claude-opus-4-7
n
5,421
nulls
3,154 (58.2%)
unique
3
min
5
max
22
mean
10.15
median
5
std
6.806
q1
5
q3
19
iqr
14
skew
0.571
kurtosis
-1.658
n_outliers
0
outlier_rate
0
zero_rate
0

TE Group Measure Count

categorical feature
Binary categorical column with only two values: '12' (84.1% of rows) and 'Not Available' (the remaining 863 rows). The literal '12' suggests this captures a count of measures per TE group, but it is stored as a string and effectively constant when present. The 'Not Available' sentinel functions as the only real source of variation. Treatment: Convert to a boolean is_available flag; drop the constant '12' value. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
2
top_value
12
top_rate
0.8408
cardinality
2
entropy
0.6324
entropy_ratio
0.6324

Count of Facility TE Measures

categorical feature
This column reports the count of facility TE (timely & effective care) measures available per record, but it's stored as strings mixing numeric values ("4" through "12") with the sentinel "Not Available". The most common value is "Not Available" at 17.1% (928 of 5421), making missingness the modal category despite a reported null_rate of 0. Entropy ratio of 0.93 across 13 categories shows the non-null counts are spread fairly evenly, with "10" and "11" the dominant numeric values. Treatment: Recode "Not Available" to null and cast remaining values to integer before modelling. high · anthropic:claude-opus-4-7
n
5,421
nulls
0 (0.0%)
unique
13
top_value
Not Available
top_rate
0.1712
cardinality
13
entropy
3.458
entropy_ratio
0.9343

TE Group Footnote

numeric metadata null_rate high_skew outliers
This appears to be a footnote reference code for a 'TE Group', stored numerically but functioning as a categorical pointer with only 3 distinct values across 5421 rows. It is overwhelmingly empty (82.88% null) and heavily concentrated at 19 (both q1 and q3 equal 19, iqr=0), producing a strong left skew (-2.43) and 133 outliers (14.33%) at the low end down to 5. The numeric stats like mean=17.58 are misleading given the column is effectively a small code set. Treatment: Cast to categorical and treat nulls as 'no footnote'; do not use as a numeric feature. high · anthropic:claude-opus-4-7
n
5,421
nulls
4,493 (82.9%)
unique
3
min
5
max
22
mean
17.58
median
19
std
4.432
q1
19
q3
19
iqr
0
skew
-2.43
kurtosis
4.12
n_outliers
133
outlier_rate
0.1433
zero_rate
0