healthcare-cms_hospitals_2025

Overview

Source: /home/coolhand/datasets/us-inequality-atlas/healthcare/cms_hospitals_2025.csv

Saturn profiled 5,421 rows across 38 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/datasets/us-inequality-atlas/healthcare/cms_hospitals_2025.csv",
    "--findings", "healthcare-cms_hospitals_2025.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a CMS hospital directory covering 5,421 U.S. hospitals across 56 state/territory codes, with 38 columns mixing facility identifiers (Facility ID, Name, Address, Phone), location fields, and a battery of CMS quality-measure summaries (Mortality, Readmission, Safety, Patient Experience, Timely & Effective care). Two things are worth a closer look first: the Hospital overall rating is 'Not Available' for 47% of facilities, and the 'Meets criteria for birthing friendly designation' field is 58% null with only 'Y' as a value, so any rating- or designation-based analysis will be heavily gated by missingness. Beyond that, the mix is dominated by Acute Care Hospitals (3,120) and Voluntary non-profit – Private ownership (2,291 / ~42%), with Texas, California, and Florida holding the largest state shares. The 'Count of … Measures Worse/Better' fields are highly skewed toward 0, suggesting most hospitals look 'no different than national average' on CMS comparisons — a useful framing before drilling into outliers.

citing: row_count · column_count · Hospital overall rating · Meets criteria for birthing friendly designation · Hospital Type · Hospital Ownership · State · Emergency Services · Count of MORT Measures Worse · Count of READM Measures Better

Out[4]:

saturn.schema() · 38 columns

column	kind	n	null%	unique	alerts
Facility ID	text	5,421	0.0%	5,421	near_unique one_word allcaps short_text
Facility Name	text	5,421	0.0%	5,286	near_unique allcaps
Address	text	5,421	0.0%	5,387	near_unique allcaps
City/Town	text	5,421	0.0%	3,049	one_word allcaps short_text duplicates
State	categorical	5,421	0.0%	56
ZIP Code	numeric	5,421	0.0%	4,721
County/Parish	text	5,421	0.0%	1,555	one_word allcaps short_text duplicates
Telephone Number	text	5,421	0.0%	5,383	near_unique allcaps short_text
Hospital Type	categorical	5,421	0.0%	8
Hospital Ownership	categorical	5,421	0.0%	12
Emergency Services	categorical	5,421	0.0%	2
Meets criteria for birthing friendly designation	categorical	5,421	58.2%	1	null_rate imbalance
Hospital overall rating	categorical	5,421	0.0%	6
Hospital overall rating footnote	categorical	5,421	52.7%	7	null_rate
MORT Group Measure Count	categorical	5,421	0.0%	2
Count of Facility MORT Measures	categorical	5,421	0.0%	8
Count of MORT Measures Better	categorical	5,421	0.0%	9
Count of MORT Measures No Different	categorical	5,421	0.0%	9
Count of MORT Measures Worse	categorical	5,421	0.0%	7
MORT Group Footnote	numeric	5,421	67.2%	4	null_rate
Safety Group Measure Count	categorical	5,421	0.0%	2
Count of Facility Safety Measures	categorical	5,421	0.0%	9
Count of Safety Measures Better	categorical	5,421	0.0%	8
Count of Safety Measures No Different	categorical	5,421	0.0%	10
Count of Safety Measures Worse	categorical	5,421	0.0%	5
Safety Group Footnote	numeric	5,421	61.8%	4	null_rate
READM Group Measure Count	categorical	5,421	0.0%	2
Count of Facility READM Measures	categorical	5,421	0.0%	12
Count of READM Measures Better	categorical	5,421	0.0%	7
Count of READM Measures No Different	categorical	5,421	0.0%	13
Count of READM Measures Worse	categorical	5,421	0.0%	9
READM Group Footnote	numeric	5,421	78.8%	3	null_rate
Pt Exp Group Measure Count	categorical	5,421	0.0%	2
Count of Facility Pt Exp Measures	categorical	5,421	0.0%	2
Pt Exp Group Footnote	numeric	5,421	58.2%	3	null_rate
TE Group Measure Count	categorical	5,421	0.0%	2
Count of Facility TE Measures	categorical	5,421	0.0%	13
TE Group Footnote	numeric	5,421	82.9%	3	null_rate high_skew outliers

Fig 1.

Hospital overall rating · Distribution of CMS star ratings — note that 'Not Available' is the single largest bucket (47%).

Show data table

Top values for Hospital overall rating (6 unique shown, of 6 total).
value	count	share
Not Available	2552	47.1%
3	937	17.3%
4	765	14.1%
2	649	12.0%
5	289	5.3%
1	229	4.2%

Fig 2.

Hospital Type · Acute Care Hospitals dominate at ~58%, with Critical Access and Psychiatric making up most of the remainder.

Show data table

Top values for Hospital Type (8 unique shown, of 8 total).
value	count	share
Acute Care Hospitals	3120	57.6%
Critical Access Hospitals	1375	25.4%
Psychiatric	626	11.5%
Acute Care - Veterans Administration	132	2.4%
Childrens	94	1.7%
Rural Emergency Hospital	38	0.7%
Acute Care - Department of Defense	32	0.6%
Long-term	4	0.1%

Fig 3.

Hospital Ownership · Voluntary non-profit Private leads at ~42%; useful for segmenting quality outcomes by ownership model.

Show data table

Top values for Hospital Ownership (12 unique shown, of 12 total).
value	count	share
Voluntary non-profit - Private	2291	42.3%
Proprietary	1067	19.7%
Government - Hospital District or Authority	521	9.6%
Government - Local	400	7.4%
Voluntary non-profit - Other	361	6.7%
Voluntary non-profit - Church	275	5.1%
Government - State	210	3.9%
Veterans Health Administration	132	2.4%
Physician	74	1.4%
Government - Federal	44	0.8%
Department of Defense	32	0.6%
Tribal	14	0.3%

Fig 4.

State · Hospital counts by state — TX, CA, and FL top the list; helpful for normalizing any state-level comparisons.

Show data table

Top values for State (20 unique shown, of 56 total).
value	count	share
TX	462	8.5%
CA	378	7.0%
FL	221	4.1%
IL	194	3.6%
OH	194	3.6%
NY	191	3.5%
PA	187	3.4%
LA	160	3.0%
GA	149	2.7%
IN	149	2.7%
MI	147	2.7%
WI	142	2.6%
KS	139	2.6%
MN	136	2.5%
OK	135	2.5%
TN	123	2.3%
MO	121	2.2%
NC	120	2.2%
IA	118	2.2%
AZ	106	2.0%

Fig 5.

Emergency Services · 83% of facilities offer emergency services, a quick sanity check on facility mix before quality analysis.

Show data table

Top values for Emergency Services (2 unique shown, of 2 total).
value	count	share
Yes	4505	83.1%
No	916	16.9%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
Facility ID	text	0.0%
Facility Name	text	0.0%
Address	text	0.0%
City/Town	text	0.0%
State	categorical	0.0%
ZIP Code	numeric	0.0%
County/Parish	text	0.0%
Telephone Number	text	0.0%
Hospital Type	categorical	0.0%
Hospital Ownership	categorical	0.0%
Emergency Services	categorical	0.0%
Meets criteria for birthing friendly designation	categorical	58.2%
Hospital overall rating	categorical	0.0%
Hospital overall rating footnote	categorical	52.7%
MORT Group Measure Count	categorical	0.0%
Count of Facility MORT Measures	categorical	0.0%
Count of MORT Measures Better	categorical	0.0%
Count of MORT Measures No Different	categorical	0.0%
Count of MORT Measures Worse	categorical	0.0%
MORT Group Footnote	numeric	67.2%
Safety Group Measure Count	categorical	0.0%
Count of Facility Safety Measures	categorical	0.0%
Count of Safety Measures Better	categorical	0.0%
Count of Safety Measures No Different	categorical	0.0%
Count of Safety Measures Worse	categorical	0.0%
Safety Group Footnote	numeric	61.8%
READM Group Measure Count	categorical	0.0%
Count of Facility READM Measures	categorical	0.0%
Count of READM Measures Better	categorical	0.0%
Count of READM Measures No Different	categorical	0.0%
Count of READM Measures Worse	categorical	0.0%
READM Group Footnote	numeric	78.8%
Pt Exp Group Measure Count	categorical	0.0%
Count of Facility Pt Exp Measures	categorical	0.0%
Pt Exp Group Footnote	numeric	58.2%
TE Group Measure Count	categorical	0.0%
Count of Facility TE Measures	categorical	0.0%
TE Group Footnote	numeric	82.9%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 6 numeric columns (values clipped to 2 decimals).
	ZIP Code	MORT Group Footnote	Safety Group Footnote	READM Group Footnote	Pt Exp Group Footnote	TE Group Footnote
ZIP Code	+1.00	-0.01	-0.04	-0.09	-0.02	+0.03
MORT Group Footnote	-0.01	+1.00	+0.13	+0.24	+0.07	+0.09
Safety Group Footnote	-0.04	+0.13	+1.00	+0.10	+0.27	-0.00
READM Group Footnote	-0.09	+0.24	+0.10	+1.00	+0.06	+0.07
Pt Exp Group Footnote	-0.02	+0.07	+0.27	+0.06	+1.00	+0.03
TE Group Footnote	+0.03	+0.09	-0.00	+0.07	+0.03	+1.00

Facility ID text identifier

Facility ID is a fixed-width 6-character single-token code, unique across all 5421 rows with zero nulls or duplicates. Every value is one word in all caps with length exactly 6, and the sampled tokens (e.g., 010001, 010005) are zero-padded numeric strings consistent with a primary facility identifier.

Treatment: Treat as primary key; left-join on this id and exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["Facility ID"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5,421
len_min	6
len_max	6
len_mean	6
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	5,421
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for Facility ID.

Show data table

Character-length distribution for Facility ID (mean: 6.0).
chars	count
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	5421
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0

Facility Name text identifier

This column holds names of healthcare facilities, dominated by terms like 'hospital' (2740), 'center' (1579), and 'medical' (1444), with a typical length of 4 words / 28 characters. Values are near-unique (5286 distinct out of 5421) yet 135 duplicates remain, suggesting either multi-site chains or repeated reporting rows for the same facility. Nearly everything is uppercase (99.3% allcaps rate), which will trip case-sensitive joins.

Treatment: Normalize case and whitespace before using as a join key on facility.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["Facility Name"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5,286
len_min	3
len_max	74
len_mean	29.21
len_median	28
len_p95	45
word_mean	3.995
word_median	4
n_empty	0
n_duplicates	135
duplicate_rate	0.0249
vocab_size	3,942
readability_flesch_mean	6.842
emoji_rate	0
url_rate	0
one_word_rate	0.001845
allcaps_rate	0.9932
boilerplate_rate	0
alert: near_unique	97.5% of rows are unique strings
alert: allcaps	99.3% rows are all-caps

Fig 9.

Character-length distribution for Facility Name.

Show data table

Character-length distribution for Facility Name (mean: 29.20605054418004).
chars	count
3 – 5	2
5 – 7	1
7 – 8	1
8 – 10	11
10 – 12	12
12 – 14	43
14 – 15	91
15 – 17	178
17 – 19	93
19 – 21	258
21 – 23	424
23 – 24	526
24 – 26	572
26 – 28	258
28 – 30	548
30 – 31	525
31 – 33	460
33 – 35	160
35 – 37	308
37 – 38	232
38 – 40	170
40 – 42	149
42 – 44	47
44 – 46	105
46 – 47	82
47 – 49	79
49 – 51	55
51 – 53	7
53 – 54	4
54 – 56	2
56 – 58	1
58 – 60	3
60 – 62	4
62 – 63	1
63 – 65	2
65 – 67	1
67 – 69	2
69 – 70	0
70 – 72	3
72 – 74	1

Address text identifier

Free-text street addresses, with 5387 unique values across 5421 rows (near-unique) and no nulls. 99.2% are all-caps and the top tokens are 'street', 'st', 'avenue', 'drive', 'road', confirming postal-style entries averaging 3.75 words and 19 characters. 34 duplicates exist but no boilerplate, URLs, or emoji.

Treatment: Drop or geocode/parse into structured components before modelling; do not feed raw.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["Address"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5,387
len_min	7
len_max	50
len_mean	19.37
len_median	19
len_p95	29
word_mean	3.754
word_median	4
n_empty	0
n_duplicates	34
duplicate_rate	0.006272
vocab_size	4,996
readability_flesch_mean	79.27
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0.9921
boilerplate_rate	0
alert: near_unique	99.4% of rows are unique strings
alert: allcaps	99.2% rows are all-caps

Fig 10.

Character-length distribution for Address.

Show data table

Character-length distribution for Address (mean: 19.371702637889687).
chars	count
7 – 8	3
8 – 9	8
9 – 10	32
10 – 11	66
11 – 12	115
12 – 13	227
13 – 15	288
15 – 16	380
16 – 17	481
17 – 18	502
18 – 19	541
19 – 20	510
20 – 21	413
21 – 22	691
22 – 23	257
23 – 24	218
24 – 25	151
25 – 26	121
26 – 27	73
27 – 28	72
28 – 30	52
30 – 31	47
31 – 32	34
32 – 33	27
33 – 34	15
34 – 35	12
35 – 36	34
36 – 37	12
37 – 38	8
38 – 39	7
39 – 40	6
40 – 41	3
41 – 42	2
42 – 44	3
44 – 45	0
45 – 46	3
46 – 47	2
47 – 48	0
48 – 49	0
49 – 50	5

City/Town text feature

This is a US city/town name field, almost entirely uppercased (allcaps_rate 0.994) and dominated by single-word entries (one_word_rate 0.771, word_mean 1.24). Of 5421 rows there are 3049 unique values with a 0.438 duplicate rate, led by CHICAGO (34), HOUSTON (31), and COLUMBUS (23). Lengths are short and tight (len_mean 8.6, len_max 24) and there are no nulls or empties.

Treatment: Normalize case and standardize against a city gazetteer (ideally with state) before grouping or joining.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["City/Town"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	3,049
len_min	3
len_max	24
len_mean	8.611
len_median	8
len_p95	13
word_mean	1.241
word_median	1
n_empty	0
n_duplicates	2,372
duplicate_rate	0.4376
vocab_size	2,890
readability_flesch_mean	18.29
emoji_rate	0
url_rate	0
one_word_rate	0.7709
allcaps_rate	0.9943
boilerplate_rate	0
alert: one_word	77.1% rows are a single word
alert: allcaps	99.4% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	43.8% duplicate strings

Fig 11.

Character-length distribution for City/Town.

Show data table

Character-length distribution for City/Town (mean: 8.610957387935805).
chars	count
3 – 4	10
4 – 4	122
4 – 5	0
5 – 5	332
5 – 6	0
6 – 6	761
6 – 7	0
7 – 7	895
7 – 8	0
8 – 8	737
8 – 9	0
9 – 9	694
9 – 10	0
10 – 10	705
10 – 11	0
11 – 11	446
11 – 12	0
12 – 12	295
12 – 13	0
13 – 14	191
14 – 14	98
14 – 15	0
15 – 15	50
15 – 16	0
16 – 16	54
16 – 17	0
17 – 17	15
17 – 18	0
18 – 18	6
18 – 19	0
19 – 19	3
19 – 20	0
20 – 20	6
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 23	0
23 – 23	0
23 – 24	1

State categorical feature

US state codes covering 56 distinct values across 5,421 rows with no nulls — likely all 50 states plus territories/DC. Distribution is broad and near-uniform (entropy ratio 0.917), with TX leading at only 8.5% and CA, FL, IL, OH following. The 56 cardinality is worth noting since it exceeds the standard 50 states.

Treatment: one-hot or target-encode for modelling; optionally group rare territories.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["State"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	56
top_value	TX
top_rate	0.08522
cardinality	56
entropy	5.328
entropy_ratio	0.9174

Fig 12.

Top values for State.

Show data table

Top values for State (20 unique shown, of 56 total).
value	count	share
TX	462	8.5%
CA	378	7.0%
FL	221	4.1%
IL	194	3.6%
OH	194	3.6%
NY	191	3.5%
PA	187	3.4%
LA	160	3.0%
GA	149	2.7%
IN	149	2.7%
MI	147	2.7%
WI	142	2.6%
KS	139	2.6%
MN	136	2.5%
OK	135	2.5%
TN	123	2.3%
MO	121	2.2%
NC	120	2.2%
IA	118	2.2%
AZ	106	2.0%

ZIP Code numeric identifier

This is a US ZIP code field stored as a numeric column, with 4721 unique values across 5421 rows and no nulls. The range (603 to 99929) and broad IQR (32771-76104) are consistent with national ZIP coverage, and leading-zero ZIPs (e.g., the min of 603) have already been corrupted by numeric storage. Treating the mean (53780) or std (27064) as meaningful is misleading since ZIPs are categorical identifiers, not quantities.

Treatment: Cast to zero-padded 5-character strings and treat as a categorical/geographic key rather than a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["ZIP Code"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	4,721
min	603
max	99,929
mean	5.378e+04
median	55,066
std	2.706e+04
q1	32,771
q3	76,104
iqr	43,333
skew	-0.1646
kurtosis	-0.9879
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 13.

Distribution of ZIP Code. Vertical dash marks the median.

Show data table

Histogram bins for ZIP Code (median: 55066.0).
bin	count
603 – 3086	161
3086 – 5569	73
5569 – 8052	92
8052 – 1.054e+04	57
1.054e+04 – 1.302e+04	90
1.302e+04 – 1.55e+04	101
1.55e+04 – 1.799e+04	81
1.799e+04 – 2.047e+04	108
2.047e+04 – 2.295e+04	79
2.295e+04 – 2.543e+04	81
2.543e+04 – 2.792e+04	82
2.792e+04 – 3.04e+04	191
3.04e+04 – 3.288e+04	171
3.288e+04 – 3.537e+04	168
3.537e+04 – 3.785e+04	151
3.785e+04 – 4.033e+04	180
4.033e+04 – 4.282e+04	87
4.282e+04 – 4.53e+04	154
4.53e+04 – 4.778e+04	174
4.778e+04 – 5.027e+04	180
5.027e+04 – 5.275e+04	98
5.275e+04 – 5.523e+04	160
5.523e+04 – 5.772e+04	175
5.772e+04 – 6.02e+04	147
6.02e+04 – 6.268e+04	140
6.268e+04 – 6.516e+04	120
6.516e+04 – 6.765e+04	133
6.765e+04 – 7.013e+04	145
7.013e+04 – 7.261e+04	205
7.261e+04 – 7.51e+04	191
7.51e+04 – 7.758e+04	236
7.758e+04 – 8.006e+04	204
8.006e+04 – 8.255e+04	103
8.255e+04 – 8.503e+04	132
8.503e+04 – 8.751e+04	106
8.751e+04 – 9e+04	72
9e+04 – 9.248e+04	140
9.248e+04 – 9.496e+04	146
9.496e+04 – 9.745e+04	159
9.745e+04 – 9.993e+04	148

County/Parish text feature

This column holds U.S. county or parish names, stored uppercase and almost always as a single token (one_word_rate 0.87, allcaps_rate 1.0). With 1,555 uniques across 5,421 rows and a 71.3% duplicate rate, common counties repeat heavily — LOS ANGELES (88), JEFFERSON and COOK (59 each) lead. Note that vocab_size (1591) exceeds n_unique (1555), implying multi-word names like SAN/LOS/ST. variants contribute extra tokens.

Treatment: Normalize casing and join to a state column before using as a categorical geographic feature.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["County/Parish"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	1,555
len_min	3
len_max	25
len_mean	7.34
len_median	7
len_p95	11
word_mean	1.135
word_median	1
n_empty	0
n_duplicates	3,866
duplicate_rate	0.7132
vocab_size	1,591
readability_flesch_mean	34.44
emoji_rate	0
url_rate	0
one_word_rate	0.8733
allcaps_rate	1
boilerplate_rate	0
alert: one_word	87.3% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	71.3% duplicate strings

Fig 14.

Character-length distribution for County/Parish.

Show data table

Character-length distribution for County/Parish (mean: 7.3399741745065485).
chars	count
3 – 4	36
4 – 4	472
4 – 5	0
5 – 5	659
5 – 6	0
6 – 6	1008
6 – 7	0
7 – 7	956
7 – 8	0
8 – 8	834
8 – 9	575
9 – 10	0
10 – 10	430
10 – 11	0
11 – 11	195
11 – 12	0
12 – 12	101
12 – 13	0
13 – 13	40
13 – 14	0
14 – 15	75
15 – 15	18
15 – 16	0
16 – 16	2
16 – 17	0
17 – 17	5
17 – 18	0
18 – 18	3
18 – 19	0
19 – 20	5
20 – 20	4
20 – 21	0
21 – 21	1
21 – 22	0
22 – 22	0
22 – 23	0
23 – 23	1
23 – 24	0
24 – 24	0
24 – 25	1

Telephone Number text identifier

This column holds US-style telephone numbers, every value exactly 14 characters long with a mean of 2 words, consistent with a `(NPA) NNN-NNNN` format. Of 5421 rows, 5383 are unique and 38 duplicates appear (0.7%), with no nulls; the most common tokens are area codes like (406), (605), and (402). The near-unique cardinality and rigid length make this an identifier rather than a feature.

Treatment: Drop for modelling; retain as a contact identifier or parse out the area code if geography is useful.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["Telephone Number"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5,383
len_min	14
len_max	14
len_mean	14
len_median	14
len_p95	14
word_mean	2
word_median	2
n_empty	0
n_duplicates	38
duplicate_rate	0.00701
vocab_size	5,550
readability_flesch_mean	120.2
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	99.3% of rows are unique strings
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 15.

Character-length distribution for Telephone Number.

Show data table

Character-length distribution for Telephone Number (mean: 14.0).
chars	count
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	5421
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0

Hospital Type categorical feature

Categorical classifier of hospital facilities across 8 types, with no nulls in 5421 rows. Acute Care Hospitals dominate at 57.6% (3120 records), followed by Critical Access (1375) and Psychiatric (626); the long tail is sparse, with only 4 Long-term facilities. Entropy ratio of 0.55 confirms the distribution is heavily concentrated in the top category.

Treatment: One-hot encode, optionally collapsing the four rarest types into an 'Other' bucket.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["Hospital Type"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	8
top_value	Acute Care Hospitals
top_rate	0.5755
cardinality	8
entropy	1.654
entropy_ratio	0.5513

Fig 16.

Top values for Hospital Type.

Show data table

Top values for Hospital Type (8 unique shown, of 8 total).
value	count	share
Acute Care Hospitals	3120	57.6%
Critical Access Hospitals	1375	25.4%
Psychiatric	626	11.5%
Acute Care - Veterans Administration	132	2.4%
Childrens	94	1.7%
Rural Emergency Hospital	38	0.7%
Acute Care - Department of Defense	32	0.6%
Long-term	4	0.1%

Hospital Ownership categorical feature

Categorical descriptor of hospital ownership type across 5421 records, with 12 distinct categories and no nulls. Distribution is dominated by 'Voluntary non-profit - Private' at 42.26% of rows, followed by 'Proprietary' at 1067 and a long tail down to 'Government - Federal' at 44. Entropy ratio of 0.72 indicates moderate concentration but reasonable spread across the taxonomy.

Treatment: One-hot or target-encode for modelling; consider collapsing rare tiers like 'Physician' and 'Government - Federal'.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["Hospital Ownership"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	12
top_value	Voluntary non-profit - Private
top_rate	0.4226
cardinality	12
entropy	2.586
entropy_ratio	0.7215

Fig 17.

Top values for Hospital Ownership.

Show data table

Top values for Hospital Ownership (12 unique shown, of 12 total).
value	count	share
Voluntary non-profit - Private	2291	42.3%
Proprietary	1067	19.7%
Government - Hospital District or Authority	521	9.6%
Government - Local	400	7.4%
Voluntary non-profit - Other	361	6.7%
Voluntary non-profit - Church	275	5.1%
Government - State	210	3.9%
Veterans Health Administration	132	2.4%
Physician	74	1.4%
Government - Federal	44	0.8%
Department of Defense	32	0.6%
Tribal	14	0.3%

Emergency Services categorical feature

Binary Yes/No flag indicating whether emergency services were involved or available, with no nulls across 5421 rows. The distribution is imbalanced: 'Yes' accounts for 83.1% (4505) versus 916 'No' responses, giving an entropy ratio of 0.655.

Treatment: Encode as a binary 0/1 indicator; be mindful of class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["Emergency Services"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	Yes
top_rate	0.831
cardinality	2
entropy	0.6553
entropy_ratio	0.6553

Fig 18.

Top values for Emergency Services.

Show data table

Top values for Emergency Services (2 unique shown, of 2 total).
value	count	share
Yes	4505	83.1%
No	916	16.9%

Meets criteria for birthing friendly designation categorical feature

This appears to be a binary flag indicating whether a hospital meets criteria for a 'birthing friendly' designation, but it functions as a presence indicator only. Of 5421 rows, 58.24% are null and the remaining 2264 are all 'Y' — there are no 'N' values, giving cardinality 1 and zero entropy. Nulls effectively mean 'not designated' rather than missing data.

Treatment: Recode nulls as 'N' to form a usable binary indicator, or drop since it carries the same information as the null mask.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["Meets criteria for birthing friendly designation"].stats

stat	value
n	5,421
nulls	3,157 (58.2%)
unique	1
top_value	Y
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	58.2% null
alert: imbalance	top value is 100.0% of rows

Fig 19.

Top values for Meets criteria for birthing friendly designation.

Show data table

Top values for Meets criteria for birthing friendly designation (1 unique shown, of 1 total).
value	count	share
Y	2264	41.8%

Hospital overall rating categorical label

This is the CMS-style 1-5 hospital star rating stored as a string, with a sixth bucket 'Not Available' for unrated facilities. The striking issue is that 'Not Available' dominates at 47.1% of 5,421 rows, outnumbering every actual rating; among rated hospitals, 3 stars (937) is most common and 1 star (229) least. Entropy ratio of 0.83 across 6 categories confirms the distribution is spread but heavily anchored by missingness-as-category.

Treatment: Recode 'Not Available' to NaN and treat the remainder as an ordinal 1-5 scale.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["Hospital overall rating"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	6
top_value	Not Available
top_rate	0.4708
cardinality	6
entropy	2.133
entropy_ratio	0.8252

Fig 20.

Top values for Hospital overall rating.

Show data table

Top values for Hospital overall rating (6 unique shown, of 6 total).
value	count	share
Not Available	2552	47.1%
3	937	17.3%
4	765	14.1%
2	649	12.0%
5	289	5.3%
1	229	4.2%

Hospital overall rating footnote categorical metadata

Footnote codes attached to the Hospital overall rating, with only 7 distinct values across 5,421 rows. Over half the column is null (null_rate 0.527), and among populated rows code '16' dominates at 65.4% followed by '19', so the field carries low information (entropy_ratio 0.41). One compound entry '16, 23' appears twice, indicating values can be multi-coded.

Treatment: Keep as a categorical flag alongside the rating; split the rare compound codes if you need per-footnote analysis.

anthropic:claude-opus-4-7 · confidence high

Out[52]:

saturn.columns["Hospital overall rating footnote"].stats

stat	value
n	5,421
nulls	2,857 (52.7%)
unique	7
top_value	16
top_rate	0.6537
cardinality	7
entropy	1.158
entropy_ratio	0.4126
alert: null_rate	52.7% null

Fig 21.

Top values for Hospital overall rating footnote.

Show data table

Top values for Hospital overall rating footnote (7 unique shown, of 7 total).
value	count	share
16	1676	30.9%
19	795	14.7%
5	47	0.9%
22	32	0.6%
17	7	0.1%
23	5	0.1%
16, 23	2	0.0%

MORT Group Measure Count categorical feature

This is a binary categorical field indicating the count of measures in a mortality (MORT) group, holding only the value '7' (84.1% of rows) or 'Not Available' (the remaining 863 rows). Despite being labeled a count, it functions as a presence flag: when data exists, the count is always 7. No nulls are recorded, since missingness is encoded as the literal string 'Not Available'.

Treatment: Recode to a boolean availability flag ('7' vs 'Not Available') before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[55]:

saturn.columns["MORT Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	7
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 22.

Top values for MORT Group Measure Count.

Show data table

Top values for MORT Group Measure Count (2 unique shown, of 2 total).
value	count	share
7	4558	84.1%
Not Available	863	15.9%

Count of Facility MORT Measures categorical feature

Likely the count of facility mortality (MORT) measures reported per hospital, stored as strings 1-7 alongside a 'Not Available' sentinel. The dominant value is 'Not Available' at 32.8% (1777/5421), which makes the column nominally categorical with 8 levels even though 7 of them are integers. Entropy ratio of 0.92 indicates the non-null counts are spread fairly evenly across 1-7.

Treatment: Replace 'Not Available' with NaN, cast remaining values to integer, and optionally add a missingness indicator.

anthropic:claude-opus-4-7 · confidence high

Out[58]:

saturn.columns["Count of Facility MORT Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	8
top_value	Not Available
top_rate	0.3278
cardinality	8
entropy	2.765
entropy_ratio	0.9217

Fig 23.

Top values for Count of Facility MORT Measures.

Show data table

Top values for Count of Facility MORT Measures (8 unique shown, of 8 total).
value	count	share
Not Available	1777	32.8%
7	850	15.7%
6	587	10.8%
1	495	9.1%
5	455	8.4%
3	444	8.2%
2	420	7.7%
4	393	7.2%

Count of MORT Measures Better categorical feature

Categorical count of mortality measures on which a hospital performed better than the national rate, encoded as small integers stored as strings (0–7) plus a 'Not Available' sentinel. The distribution is heavily concentrated at 0 (57.8% of 5421 rows) with another 1777 rows marked 'Not Available', leaving only ~10% of hospitals showing any 'better' measure. Entropy ratio of 0.458 confirms the skew, and the sentinel string blocks direct numeric use.

Treatment: Replace 'Not Available' with NaN, cast remaining values to int, and consider binarising (any-better vs none) given the skew.

anthropic:claude-opus-4-7 · confidence high

Out[61]:

saturn.columns["Count of MORT Measures Better"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	9
top_value	0
top_rate	0.5785
cardinality	9
entropy	1.453
entropy_ratio	0.4583

Fig 24.

Top values for Count of MORT Measures Better.

Show data table

Top values for Count of MORT Measures Better (9 unique shown, of 9 total).
value	count	share
0	3136	57.8%
Not Available	1777	32.8%
1	297	5.5%
2	133	2.5%
3	53	1.0%
4	15	0.3%
5	7	0.1%
7	2	0.0%
6	1	0.0%

Count of MORT Measures No Different categorical feature

This column counts how many mortality (MORT) measures a hospital scored 'No Different' on, encoded as a small integer 0-7 but stored as strings alongside a 'Not Available' sentinel. Cardinality is just 9 with high entropy ratio (0.885), meaning values are spread fairly evenly across the integers — except 'Not Available' dominates at 32.8% (1777/5421) and '0' is rare at only 12 rows. Analysts should note the heavy missing-as-string encoding and the near-empty '0' bucket.

Treatment: Recode 'Not Available' to null and cast remaining values to integer before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[64]:

saturn.columns["Count of MORT Measures No Different"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	9
top_value	Not Available
top_rate	0.3278
cardinality	9
entropy	2.806
entropy_ratio	0.8852

Fig 25.

Top values for Count of MORT Measures No Different.

Show data table

Top values for Count of MORT Measures No Different (9 unique shown, of 9 total).
value	count	share
Not Available	1777	32.8%
6	672	12.4%
5	541	10.0%
1	513	9.5%
3	509	9.4%
4	503	9.3%
2	472	8.7%
7	422	7.8%
0	12	0.2%

Count of MORT Measures Worse categorical feature

This column counts how many mortality (MORT) measures a hospital scored worse than the national average, encoded as small integers 0-5 with a 'Not Available' sentinel mixed in. Roughly 60% of 5421 rows are '0' and another 1777 are 'Not Available', leaving only 378 hospitals flagged on any MORT measure and just 1 hospital at the maximum of 5. The mixed string/integer encoding is the main gotcha.

Treatment: Replace 'Not Available' with NaN and cast to integer before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[67]:

saturn.columns["Count of MORT Measures Worse"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	7
top_value	0
top_rate	0.6025
cardinality	7
entropy	1.294
entropy_ratio	0.4608

Fig 26.

Top values for Count of MORT Measures Worse.

Show data table

Top values for Count of MORT Measures Worse (7 unique shown, of 7 total).
value	count	share
0	3266	60.2%
Not Available	1777	32.8%
1	310	5.7%
2	57	1.1%
3	7	0.1%
4	3	0.1%
5	1	0.0%

MORT Group Footnote numeric metadata

This appears to be a footnote code attached to MORT (mortality) group records, stored as a small numeric category with only 4 distinct values ranging from 5 to 23. Two-thirds of rows are null (null_rate 0.672), suggesting the footnote applies only to a minority of records. The strongly negative kurtosis (-1.96) and bimodal-looking quartiles (Q1=5, Q3=19) confirm these are discrete codes rather than a continuous measurement.

Treatment: Cast to categorical and treat nulls as 'no footnote' rather than imputing numerically.

anthropic:claude-opus-4-7 · confidence high

Out[70]:

saturn.columns["MORT Group Footnote"].stats

stat	value
n	5,421
nulls	3,643 (67.2%)
unique	4
min	5
max	23
mean	11.58
median	5
std	7.057
q1	5
q3	19
iqr	14
skew	0.1488
kurtosis	-1.959
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	67.2% null

Fig 27.

Distribution of MORT Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for MORT Group Footnote (median: 5.0).
bin	count
5 – 5.45	950
5.45 – 5.9	0
5.9 – 6.35	0
6.35 – 6.8	0
6.8 – 7.25	0
7.25 – 7.7	0
7.7 – 8.15	0
8.15 – 8.6	0
8.6 – 9.05	0
9.05 – 9.5	0
9.5 – 9.95	0
9.95 – 10.4	0
10.4 – 10.85	0
10.85 – 11.3	0
11.3 – 11.75	0
11.75 – 12.2	0
12.2 – 12.65	0
12.65 – 13.1	0
13.1 – 13.55	0
13.55 – 14	0
14 – 14.45	0
14.45 – 14.9	0
14.9 – 15.35	0
15.35 – 15.8	0
15.8 – 16.25	0
16.25 – 16.7	0
16.7 – 17.15	0
17.15 – 17.6	0
17.6 – 18.05	0
18.05 – 18.5	0
18.5 – 18.95	0
18.95 – 19.4	795
19.4 – 19.85	0
19.85 – 20.3	0
20.3 – 20.75	0
20.75 – 21.2	0
21.2 – 21.65	0
21.65 – 22.1	32
22.1 – 22.55	0
22.55 – 23	1

Safety Group Measure Count categorical feature

Binary categorical with only two values across 5421 rows: the literal string "8" (84.1%) and "Not Available" (15.9%). Despite the name suggesting a count, it functions as a flag indicating either a fixed measure count of 8 or missing data encoded as text.

Treatment: Recode "Not Available" to null and binarize, or drop given near-constant value.

anthropic:claude-opus-4-7 · confidence high

Out[73]:

saturn.columns["Safety Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	8
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 28.

Top values for Safety Group Measure Count.

Show data table

Top values for Safety Group Measure Count (2 unique shown, of 2 total).
value	count	share
8	4558	84.1%
Not Available	863	15.9%

Count of Facility Safety Measures categorical feature

This column reports the count of facility safety measures, stored as a categorical with 9 distinct values (1-8 plus 'Not Available'). The dominant value is 'Not Available' at 38.1% (2,065 of 5,421 rows), making missingness encoded as a string rather than null. Among reported counts, 7 is most common (733), and the distribution across 1-8 is uneven rather than monotonic.

Treatment: Recode 'Not Available' to null and cast remaining values to integer before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[76]:

saturn.columns["Count of Facility Safety Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	9
top_value	Not Available
top_rate	0.3809
cardinality	9
entropy	2.753
entropy_ratio	0.8684

Fig 29.

Top values for Count of Facility Safety Measures.

Show data table

Top values for Count of Facility Safety Measures (9 unique shown, of 9 total).
value	count	share
Not Available	2065	38.1%
7	733	13.5%
2	519	9.6%
6	460	8.5%
8	453	8.4%
1	443	8.2%
3	290	5.3%
5	235	4.3%
4	223	4.1%

Count of Safety Measures Better categorical feature

This is a small-integer count (0-6) of safety measures rated 'Better', stored as strings alongside a 'Not Available' sentinel. The sentinel dominates at 38.1% (2065/5421) of rows, and the distribution is heavily right-skewed: 0 and 1 cover 2600 rows while 5 and 6 appear only 14 and 3 times. Cardinality is 8 with entropy ratio 0.70, so most signal sits in the lower bins.

Treatment: Recode 'Not Available' to missing, cast remaining values to integer, and consider binning the sparse 4-6 tail.

anthropic:claude-opus-4-7 · confidence high

Out[79]:

saturn.columns["Count of Safety Measures Better"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	8
top_value	Not Available
top_rate	0.3809
cardinality	8
entropy	2.11
entropy_ratio	0.7033

Fig 30.

Top values for Count of Safety Measures Better.

Show data table

Top values for Count of Safety Measures Better (8 unique shown, of 8 total).
value	count	share
Not Available	2065	38.1%
0	1548	28.6%
1	1052	19.4%
2	430	7.9%
3	216	4.0%
4	93	1.7%
5	14	0.3%
6	3	0.1%

Count of Safety Measures No Different categorical feature

A small-integer count (0-8) of safety measures rated 'no different', stored as strings alongside a 'Not Available' sentinel. The sentinel dominates at 38.1% of 5421 rows (2065), making it the modal value ahead of any actual count. Among real values, 5, 2, 4, and 1 cluster between 509-656 occurrences while 0 and 8 are rare (20 and 10).

Treatment: Cast numeric levels to int and treat 'Not Available' as an explicit missing indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[82]:

saturn.columns["Count of Safety Measures No Different"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	10
top_value	Not Available
top_rate	0.3809
cardinality	10
entropy	2.685
entropy_ratio	0.8083

Fig 31.

Top values for Count of Safety Measures No Different.

Show data table

Top values for Count of Safety Measures No Different (10 unique shown, of 10 total).
value	count	share
Not Available	2065	38.1%
5	656	12.1%
2	551	10.2%
4	527	9.7%
1	509	9.4%
6	482	8.9%
3	434	8.0%
7	167	3.1%
0	20	0.4%
8	10	0.2%

Count of Safety Measures Worse categorical feature

This is a low-cardinality count column (5 distinct values) recording how many safety measures got worse, taking integer values 0-3 plus a 'Not Available' sentinel. Most rows (54.3%) are 0 and another 2065 are 'Not Available', leaving only 415 rows with any worsening measures (1-3). The mix of numeric strings and a textual missing-marker means it isn't cleanly numeric as stored.

Treatment: Recode 'Not Available' to NaN and cast to integer before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[85]:

saturn.columns["Count of Safety Measures Worse"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5
top_value	0
top_rate	0.5425
cardinality	5
entropy	1.338
entropy_ratio	0.5764

Fig 32.

Top values for Count of Safety Measures Worse.

Show data table

Top values for Count of Safety Measures Worse (5 unique shown, of 5 total).
value	count	share
0	2941	54.3%
Not Available	2065	38.1%
1	365	6.7%
2	44	0.8%
3	6	0.1%

Safety Group Footnote numeric metadata

This appears to be a footnote code column tied to a 'Safety Group' classification, encoded numerically but with only 4 distinct values ranging from 5 to 23. 61.8% of rows are null, suggesting footnotes apply only to a minority of records. The bimodal-looking distribution (median 5, q3 19, kurtosis -1.81) indicates these are categorical flags rather than a true measurement.

Treatment: Cast to categorical and treat nulls as 'no footnote' rather than imputing numerically.

anthropic:claude-opus-4-7 · confidence high

Out[88]:

saturn.columns["Safety Group Footnote"].stats

stat	value
n	5,421
nulls	3,350 (61.8%)
unique	4
min	5
max	23
mean	10.69
median	5
std	6.95
q1	5
q3	19
iqr	14
skew	0.4116
kurtosis	-1.809
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	61.8% null

Fig 33.

Distribution of Safety Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for Safety Group Footnote (median: 5.0).
bin	count
5 – 5.45	1238
5.45 – 5.9	0
5.9 – 6.35	0
6.35 – 6.8	0
6.8 – 7.25	0
7.25 – 7.7	0
7.7 – 8.15	0
8.15 – 8.6	0
8.6 – 9.05	0
9.05 – 9.5	0
9.5 – 9.95	0
9.95 – 10.4	0
10.4 – 10.85	0
10.85 – 11.3	0
11.3 – 11.75	0
11.75 – 12.2	0
12.2 – 12.65	0
12.65 – 13.1	0
13.1 – 13.55	0
13.55 – 14	0
14 – 14.45	0
14.45 – 14.9	0
14.9 – 15.35	0
15.35 – 15.8	0
15.8 – 16.25	0
16.25 – 16.7	0
16.7 – 17.15	0
17.15 – 17.6	0
17.6 – 18.05	0
18.05 – 18.5	0
18.5 – 18.95	0
18.95 – 19.4	795
19.4 – 19.85	0
19.85 – 20.3	0
20.3 – 20.75	0
20.75 – 21.2	0
21.2 – 21.65	0
21.65 – 22.1	32
22.1 – 22.55	0
22.55 – 23	6

READM Group Measure Count categorical metadata

A categorical column with only two distinct values: the count '11' covers 84.1% of 5421 rows, with the remaining 863 rows marked 'Not Available'. Functionally this is a binary availability flag rather than a true count, since '11' is the only numeric value present. The mixing of a numeric string with a sentinel 'Not Available' is the main surprise.

Treatment: Recode to a binary available/not-available indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[91]:

saturn.columns["READM Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	11
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 34.

Top values for READM Group Measure Count.

Show data table

Top values for READM Group Measure Count (2 unique shown, of 2 total).
value	count	share
11	4558	84.1%
Not Available	863	15.9%

Count of Facility READM Measures categorical feature

This column reports the count of facility readmission (READM) measures available per facility, stored as strings rather than integers. With 12 distinct values and entropy ratio 0.965, the values are spread fairly evenly across small integers, but the modal value is the literal string "Not Available" at 21.2% (1150 of 5421 rows), which masks missingness as a category. Numeric values like "11", "8", "6", and "9" dominate the rest.

Treatment: Replace "Not Available" with null and cast to integer before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[94]:

saturn.columns["Count of Facility READM Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	12
top_value	Not Available
top_rate	0.2121
cardinality	12
entropy	3.459
entropy_ratio	0.965

Fig 35.

Top values for Count of Facility READM Measures.

Show data table

Top values for Count of Facility READM Measures (12 unique shown, of 12 total).
value	count	share
Not Available	1150	21.2%
11	498	9.2%
8	466	8.6%
6	438	8.1%
9	425	7.8%
3	375	6.9%
2	374	6.9%
7	358	6.6%
5	347	6.4%
4	335	6.2%
10	333	6.1%
1	322	5.9%

Count of READM Measures Better categorical feature

Counts how many readmission measures a hospital scored 'better than national rate' on, encoded as small integers 0-5 with 'Not Available' as a seventh category. The distribution is heavily concentrated at 0 (61.5% of 5421 rows) and ties off rapidly: only 28 rows hit 3, 10 hit 4, and 3 hit 5. Notably, 'Not Available' (1150 rows) is the second most common value, larger than every nonzero count combined.

Treatment: Cast numeric levels to int and treat 'Not Available' as an explicit missing-category flag before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[97]:

saturn.columns["Count of READM Measures Better"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	7
top_value	0
top_rate	0.6146
cardinality	7
entropy	1.51
entropy_ratio	0.5379

Fig 36.

Top values for Count of READM Measures Better.

Show data table

Top values for Count of READM Measures Better (7 unique shown, of 7 total).
value	count	share
0	3332	61.5%
Not Available	1150	21.2%
1	737	13.6%
2	161	3.0%
3	28	0.5%
4	10	0.2%
5	3	0.1%

Count of READM Measures No Different categorical feature

This column counts how many readmission (READM) measures at a hospital scored 'No Different' than the national rate, encoded as integers 1-9+ alongside a 'Not Available' sentinel. With 13 unique values and entropy ratio 0.92, the distribution is fairly flat across counts 1-9 (each ~370-500 rows), but 'Not Available' is the single largest bucket at 21.2% (1,150 of 5,421). That missingness-as-string is the main surprise and would corrupt any numeric aggregation if not handled.

Treatment: Cast to integer with 'Not Available' converted to NaN before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[100]:

saturn.columns["Count of READM Measures No Different"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	13
top_value	Not Available
top_rate	0.2121
cardinality	13
entropy	3.408
entropy_ratio	0.9211

Fig 37.

Top values for Count of READM Measures No Different.

Show data table

Top values for Count of READM Measures No Different (13 unique shown, of 13 total).
value	count	share
Not Available	1150	21.2%
7	497	9.2%
8	491	9.1%
6	480	8.9%
2	428	7.9%
3	428	7.9%
5	426	7.9%
9	418	7.7%
4	398	7.3%
1	372	6.9%
10	249	4.6%
11	81	1.5%
0	3	0.1%

Count of READM Measures Worse categorical feature

This appears to be a hospital-level count of readmission measures performing worse than the national rate, stored as a categorical/string field with values 0-7 plus a 'Not Available' sentinel. The distribution is heavily concentrated at '0' (55.1% of 5421 rows) with another 1150 rows marked 'Not Available', leaving genuine non-zero counts as a long tail that thins to just 1-2 hospitals at values 5, 6, and 7. The mixing of numeric strings with 'Not Available' is the main gotcha.

Treatment: Replace 'Not Available' with NaN and cast remaining values to integer before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[103]:

saturn.columns["Count of READM Measures Worse"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	9
top_value	0
top_rate	0.5512
cardinality	9
entropy	1.758
entropy_ratio	0.5545

Fig 38.

Top values for Count of READM Measures Worse.

Show data table

Top values for Count of READM Measures Worse (9 unique shown, of 9 total).
value	count	share
0	2988	55.1%
Not Available	1150	21.2%
1	839	15.5%
2	308	5.7%
3	105	1.9%
4	26	0.5%
6	2	0.0%
5	2	0.0%
7	1	0.0%

READM Group Footnote numeric metadata

Likely a footnote/flag code attached to a hospital readmission group metric, encoded as a small integer rather than a true numeric quantity. Only 3 distinct values appear (min 5, max 22, median 19) and 78.79% of rows are null, so the column is sparse and categorical in nature. The non-null distribution is concentrated at the higher codes, with q1=5 and q3=19 indicating a bimodal split between two code clusters.

Treatment: Cast to categorical footnote code and treat nulls as 'no footnote' rather than imputing numerically.

anthropic:claude-opus-4-7 · confidence high

Out[106]:

saturn.columns["READM Group Footnote"].stats

stat	value
n	5,421
nulls	4,271 (78.8%)
unique	3
min	5
max	22
mean	15.15
median	19
std	6.366
q1	5
q3	19
iqr	14
skew	-0.9528
kurtosis	-1.051
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	78.8% null

Fig 39.

Distribution of READM Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for READM Group Footnote (median: 19.0).
bin	count
5 – 5.515	323
5.515 – 6.03	0
6.03 – 6.545	0
6.545 – 7.061	0
7.061 – 7.576	0
7.576 – 8.091	0
8.091 – 8.606	0
8.606 – 9.121	0
9.121 – 9.636	0
9.636 – 10.15	0
10.15 – 10.67	0
10.67 – 11.18	0
11.18 – 11.7	0
11.7 – 12.21	0
12.21 – 12.73	0
12.73 – 13.24	0
13.24 – 13.76	0
13.76 – 14.27	0
14.27 – 14.79	0
14.79 – 15.3	0
15.3 – 15.82	0
15.82 – 16.33	0
16.33 – 16.85	0
16.85 – 17.36	0
17.36 – 17.88	0
17.88 – 18.39	0
18.39 – 18.91	0
18.91 – 19.42	795
19.42 – 19.94	0
19.94 – 20.45	0
20.45 – 20.97	0
20.97 – 21.48	0
21.48 – 22	32

Pt Exp Group Measure Count categorical metadata

A binary categorical with only two values: the string "8" covering 84.08% of 5,421 rows and "Not Available" for the remaining 863. This looks like a measure-count field that is fixed at 8 when reported, otherwise marked as missing via a sentinel rather than a true null (null_rate is 0.0). Entropy ratio of 0.63 reflects that imbalance.

Treatment: Convert "Not Available" to a true null and treat as a binary missingness indicator; the column carries little signal otherwise.

anthropic:claude-opus-4-7 · confidence high

Out[109]:

saturn.columns["Pt Exp Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	8
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 40.

Top values for Pt Exp Group Measure Count.

Show data table

Top values for Pt Exp Group Measure Count (2 unique shown, of 2 total).
value	count	share
8	4558	84.1%
Not Available	863	15.9%

Count of Facility Pt Exp Measures categorical feature

This column is a binary categorical indicator with only two distinct values across all 5,421 rows: the literal string "8" (58.2%) and "Not Available" (41.8%). Despite being labeled as a count, it never varies numerically — every reporting facility either has exactly 8 measures or none reported, making this effectively a presence/absence flag with high entropy ratio (0.98). The 41.8% "Not Available" rate is a substantial missingness signal masquerading as a category.

Treatment: Recode as a binary available/not-available flag rather than treating as a numeric count.

anthropic:claude-opus-4-7 · confidence high

Out[112]:

saturn.columns["Count of Facility Pt Exp Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	8
top_rate	0.5818
cardinality	2
entropy	0.9806
entropy_ratio	0.9806

Fig 41.

Top values for Count of Facility Pt Exp Measures.

Show data table

Top values for Count of Facility Pt Exp Measures (2 unique shown, of 2 total).
value	count	share
8	3154	58.2%
Not Available	2267	41.8%

Pt Exp Group Footnote numeric metadata

This appears to be a footnote code attached to a 'Pt Exp Group' (patient experience group) measure, stored as a numeric but functioning as a categorical flag. Only 3 distinct values appear (min 5, median 5, max 22) and 58.18% of rows are null, indicating footnotes are applied sparingly to a minority of records. The bimodal-looking spread (Q1=5, Q3=19, kurtosis -1.66) confirms this is a small set of discrete codes rather than a true numeric measurement.

Treatment: Cast to categorical footnote code and treat nulls as 'no footnote' rather than missing.

anthropic:claude-opus-4-7 · confidence high

Out[115]:

saturn.columns["Pt Exp Group Footnote"].stats

stat	value
n	5,421
nulls	3,154 (58.2%)
unique	3
min	5
max	22
mean	10.15
median	5
std	6.806
q1	5
q3	19
iqr	14
skew	0.571
kurtosis	-1.658
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	58.2% null

Fig 42.

Distribution of Pt Exp Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for Pt Exp Group Footnote (median: 5.0).
bin	count
5 – 5.425	1440
5.425 – 5.85	0
5.85 – 6.275	0
6.275 – 6.7	0
6.7 – 7.125	0
7.125 – 7.55	0
7.55 – 7.975	0
7.975 – 8.4	0
8.4 – 8.825	0
8.825 – 9.25	0
9.25 – 9.675	0
9.675 – 10.1	0
10.1 – 10.52	0
10.52 – 10.95	0
10.95 – 11.38	0
11.38 – 11.8	0
11.8 – 12.22	0
12.22 – 12.65	0
12.65 – 13.07	0
13.07 – 13.5	0
13.5 – 13.92	0
13.92 – 14.35	0
14.35 – 14.78	0
14.78 – 15.2	0
15.2 – 15.62	0
15.62 – 16.05	0
16.05 – 16.48	0
16.48 – 16.9	0
16.9 – 17.32	0
17.32 – 17.75	0
17.75 – 18.17	0
18.17 – 18.6	0
18.6 – 19.02	795
19.02 – 19.45	0
19.45 – 19.88	0
19.88 – 20.3	0
20.3 – 20.73	0
20.73 – 21.15	0
21.15 – 21.57	0
21.57 – 22	32

TE Group Measure Count categorical feature

Binary categorical column with only two values: '12' (84.1% of rows) and 'Not Available' (the remaining 863 rows). The literal '12' suggests this captures a count of measures per TE group, but it is stored as a string and effectively constant when present. The 'Not Available' sentinel functions as the only real source of variation.

Treatment: Convert to a boolean is_available flag; drop the constant '12' value.

anthropic:claude-opus-4-7 · confidence high

Out[118]:

saturn.columns["TE Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	12
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 43.

Top values for TE Group Measure Count.

Show data table

Top values for TE Group Measure Count (2 unique shown, of 2 total).
value	count	share
12	4558	84.1%
Not Available	863	15.9%

Count of Facility TE Measures categorical feature

This column reports the count of facility TE (timely & effective care) measures available per record, but it's stored as strings mixing numeric values ("4" through "12") with the sentinel "Not Available". The most common value is "Not Available" at 17.1% (928 of 5421), making missingness the modal category despite a reported null_rate of 0. Entropy ratio of 0.93 across 13 categories shows the non-null counts are spread fairly evenly, with "10" and "11" the dominant numeric values.

Treatment: Recode "Not Available" to null and cast remaining values to integer before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[121]:

saturn.columns["Count of Facility TE Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	13
top_value	Not Available
top_rate	0.1712
cardinality	13
entropy	3.458
entropy_ratio	0.9343

Fig 44.

Top values for Count of Facility TE Measures.

Show data table

Top values for Count of Facility TE Measures (13 unique shown, of 13 total).
value	count	share
Not Available	928	17.1%
10	759	14.0%
11	724	13.4%
9	543	10.0%
8	391	7.2%
5	351	6.5%
12	347	6.4%
6	337	6.2%
7	284	5.2%
4	272	5.0%
3	269	5.0%
2	163	3.0%
1	53	1.0%

TE Group Footnote numeric metadata

This appears to be a footnote reference code for a 'TE Group', stored numerically but functioning as a categorical pointer with only 3 distinct values across 5421 rows. It is overwhelmingly empty (82.88% null) and heavily concentrated at 19 (both q1 and q3 equal 19, iqr=0), producing a strong left skew (-2.43) and 133 outliers (14.33%) at the low end down to 5. The numeric stats like mean=17.58 are misleading given the column is effectively a small code set.

Treatment: Cast to categorical and treat nulls as 'no footnote'; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[124]:

saturn.columns["TE Group Footnote"].stats

stat	value
n	5,421
nulls	4,493 (82.9%)
unique	3
min	5
max	22
mean	17.58
median	19
std	4.432
q1	19
q3	19
iqr	0
skew	-2.43
kurtosis	4.12
n_outliers	133
outlier_rate	0.1433
zero_rate	0
alert: null_rate	82.9% null
alert: high_skew	skew=-2.43
alert: outliers	14.3% rows beyond 1.5 IQR

Fig 45.

Distribution of TE Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for TE Group Footnote (median: 19.0).
bin	count
5 – 5.567	101
5.567 – 6.133	0
6.133 – 6.7	0
6.7 – 7.267	0
7.267 – 7.833	0
7.833 – 8.4	0
8.4 – 8.967	0
8.967 – 9.533	0
9.533 – 10.1	0
10.1 – 10.67	0
10.67 – 11.23	0
11.23 – 11.8	0
11.8 – 12.37	0
12.37 – 12.93	0
12.93 – 13.5	0
13.5 – 14.07	0
14.07 – 14.63	0
14.63 – 15.2	0
15.2 – 15.77	0
15.77 – 16.33	0
16.33 – 16.9	0
16.9 – 17.47	0
17.47 – 18.03	0
18.03 – 18.6	0
18.6 – 19.17	795
19.17 – 19.73	0
19.73 – 20.3	0
20.3 – 20.87	0
20.87 – 21.43	0
21.43 – 22	32

Overview

Summary confidence: high

Facility ID text identifier

Facility Name text identifier

Address text identifier

City/Town text feature

State categorical feature

ZIP Code numeric identifier

County/Parish text feature

Telephone Number text identifier

Hospital Type categorical feature

Hospital Ownership categorical feature

Emergency Services categorical feature

Meets criteria for birthing friendly designation categorical feature

Hospital overall rating categorical label

Hospital overall rating footnote categorical metadata

MORT Group Measure Count categorical feature

Count of Facility MORT Measures categorical feature

Count of MORT Measures Better categorical feature

Count of MORT Measures No Different categorical feature

Count of MORT Measures Worse categorical feature

MORT Group Footnote numeric metadata

Safety Group Measure Count categorical feature

Count of Facility Safety Measures categorical feature

Count of Safety Measures Better categorical feature

Count of Safety Measures No Different categorical feature

Count of Safety Measures Worse categorical feature

Safety Group Footnote numeric metadata

READM Group Measure Count categorical metadata

Count of Facility READM Measures categorical feature

Count of READM Measures Better categorical feature

Count of READM Measures No Different categorical feature

Count of READM Measures Worse categorical feature

READM Group Footnote numeric metadata

Pt Exp Group Measure Count categorical metadata

Count of Facility Pt Exp Measures categorical feature

Pt Exp Group Footnote numeric metadata

TE Group Measure Count categorical feature

Count of Facility TE Measures categorical feature

TE Group Footnote numeric metadata

How to cite