data-trove-cms-hospital-database

Overview

Source: /home/coolhand/html/datavis/data_trove/data/healthcare/cms_hospitals_2025.csv

Saturn profiled 5,421 rows across 38 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/healthcare/cms_hospitals_2025.csv",
    "--findings", "data-trove-cms-hospital-database.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is a 2025 CMS (Centers for Medicare & Medicaid Services) registry of 5,421 U.S. hospitals, covering identity, location, ownership type, and performance ratings across mortality, readmission, safety, patient experience, and timely care measures. The most striking feature is that 47% of hospitals lack an overall star rating ('Not Available'), which severely limits any headline quality comparison and warrants investigation into which hospital types or states are disproportionately unrated. A second area worth scrutiny is ownership structure: voluntary non-profit private hospitals dominate at 42%, yet proprietary and government-run facilities make up a substantial share — cross-referencing ownership against star ratings could reveal systematic quality differences.

citing: Hospital overall rating.top_values · Hospital Ownership.top_values · Hospital Type.top_values · State.top_values · Meets criteria for birthing friendly designation.null_rate · Count of Facility MORT Measures.top_values · Emergency Services.top_values

Out[4]:

saturn.schema() · 38 columns

column	kind	n	null%	unique	alerts
Facility ID	text	5,421	0.0%	5,421	near_unique one_word allcaps short_text
Facility Name	text	5,421	0.0%	5,286	near_unique allcaps
Address	text	5,421	0.0%	5,387	near_unique allcaps
City/Town	text	5,421	0.0%	3,049	one_word allcaps short_text duplicates
State	categorical	5,421	0.0%	56
ZIP Code	numeric	5,421	0.0%	4,721
County/Parish	text	5,421	0.0%	1,555	one_word allcaps short_text duplicates
Telephone Number	text	5,421	0.0%	5,383	near_unique allcaps short_text
Hospital Type	categorical	5,421	0.0%	8
Hospital Ownership	categorical	5,421	0.0%	12
Emergency Services	categorical	5,421	0.0%	2
Meets criteria for birthing friendly designation	categorical	5,421	58.2%	1	null_rate imbalance
Hospital overall rating	categorical	5,421	0.0%	6
Hospital overall rating footnote	categorical	5,421	52.7%	7	null_rate
MORT Group Measure Count	categorical	5,421	0.0%	2
Count of Facility MORT Measures	categorical	5,421	0.0%	8
Count of MORT Measures Better	categorical	5,421	0.0%	9
Count of MORT Measures No Different	categorical	5,421	0.0%	9
Count of MORT Measures Worse	categorical	5,421	0.0%	7
MORT Group Footnote	numeric	5,421	67.2%	4	null_rate
Safety Group Measure Count	categorical	5,421	0.0%	2
Count of Facility Safety Measures	categorical	5,421	0.0%	9
Count of Safety Measures Better	categorical	5,421	0.0%	8
Count of Safety Measures No Different	categorical	5,421	0.0%	10
Count of Safety Measures Worse	categorical	5,421	0.0%	5
Safety Group Footnote	numeric	5,421	61.8%	4	null_rate
READM Group Measure Count	categorical	5,421	0.0%	2
Count of Facility READM Measures	categorical	5,421	0.0%	12
Count of READM Measures Better	categorical	5,421	0.0%	7
Count of READM Measures No Different	categorical	5,421	0.0%	13
Count of READM Measures Worse	categorical	5,421	0.0%	9
READM Group Footnote	numeric	5,421	78.8%	3	null_rate
Pt Exp Group Measure Count	categorical	5,421	0.0%	2
Count of Facility Pt Exp Measures	categorical	5,421	0.0%	2
Pt Exp Group Footnote	numeric	5,421	58.2%	3	null_rate
TE Group Measure Count	categorical	5,421	0.0%	2
Count of Facility TE Measures	categorical	5,421	0.0%	13
TE Group Footnote	numeric	5,421	82.9%	3	null_rate high_skew outliers

Fig 1.

Hospital overall rating · Look for how large the 'Not Available' bar is relative to rated hospitals, and whether ratings cluster around 3 stars.

Show data table

Top values for Hospital overall rating (6 unique shown, of 6 total).
value	count	share
Not Available	2552	47.1%
3	937	17.3%
4	765	14.1%
2	649	12.0%
5	289	5.3%
1	229	4.2%

Fig 2.

Hospital Ownership · Voluntary non-profit private hospitals dominate — compare the relative size of proprietary vs. government categories.

Show data table

Top values for Hospital Ownership (12 unique shown, of 12 total).
value	count	share
Voluntary non-profit - Private	2291	42.3%
Proprietary	1067	19.7%
Government - Hospital District or Authority	521	9.6%
Government - Local	400	7.4%
Voluntary non-profit - Other	361	6.7%
Voluntary non-profit - Church	275	5.1%
Government - State	210	3.9%
Veterans Health Administration	132	2.4%
Physician	74	1.4%
Government - Federal	44	0.8%
Department of Defense	32	0.6%
Tribal	14	0.3%

Fig 3.

Hospital Type · Acute Care and Critical Access hospitals account for the vast majority; check how smaller specialty types like Psychiatric and Children's are represented.

Show data table

Top values for Hospital Type (8 unique shown, of 8 total).
value	count	share
Acute Care Hospitals	3120	57.6%
Critical Access Hospitals	1375	25.4%
Psychiatric	626	11.5%
Acute Care - Veterans Administration	132	2.4%
Childrens	94	1.7%
Rural Emergency Hospital	38	0.7%
Acute Care - Department of Defense	32	0.6%
Long-term	4	0.1%

Fig 4.

State · TX and CA lead with 462 and 378 hospitals respectively — look for which states have disproportionately high or low facility counts.

Show data table

Top values for State (20 unique shown, of 56 total).
value	count	share
TX	462	8.5%
CA	378	7.0%
FL	221	4.1%
IL	194	3.6%
OH	194	3.6%
NY	191	3.5%
PA	187	3.4%
LA	160	3.0%
GA	149	2.7%
IN	149	2.7%
MI	147	2.7%
WI	142	2.6%
KS	139	2.6%
MN	136	2.5%
OK	135	2.5%
TN	123	2.3%
MO	121	2.2%
NC	120	2.2%
IA	118	2.2%
AZ	106	2.0%

Fig 5.

Count of MORT Measures Worse · Most hospitals score 0 worse-than-expected mortality measures, but look for the tail of facilities with 1 or more to flag potential outliers.

Show data table

Top values for Count of MORT Measures Worse (7 unique shown, of 7 total).
value	count	share
0	3266	60.2%
Not Available	1777	32.8%
1	310	5.7%
2	57	1.1%
3	7	0.1%
4	3	0.1%
5	1	0.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
Facility ID	text	0.0%
Facility Name	text	0.0%
Address	text	0.0%
City/Town	text	0.0%
State	categorical	0.0%
ZIP Code	numeric	0.0%
County/Parish	text	0.0%
Telephone Number	text	0.0%
Hospital Type	categorical	0.0%
Hospital Ownership	categorical	0.0%
Emergency Services	categorical	0.0%
Meets criteria for birthing friendly designation	categorical	58.2%
Hospital overall rating	categorical	0.0%
Hospital overall rating footnote	categorical	52.7%
MORT Group Measure Count	categorical	0.0%
Count of Facility MORT Measures	categorical	0.0%
Count of MORT Measures Better	categorical	0.0%
Count of MORT Measures No Different	categorical	0.0%
Count of MORT Measures Worse	categorical	0.0%
MORT Group Footnote	numeric	67.2%
Safety Group Measure Count	categorical	0.0%
Count of Facility Safety Measures	categorical	0.0%
Count of Safety Measures Better	categorical	0.0%
Count of Safety Measures No Different	categorical	0.0%
Count of Safety Measures Worse	categorical	0.0%
Safety Group Footnote	numeric	61.8%
READM Group Measure Count	categorical	0.0%
Count of Facility READM Measures	categorical	0.0%
Count of READM Measures Better	categorical	0.0%
Count of READM Measures No Different	categorical	0.0%
Count of READM Measures Worse	categorical	0.0%
READM Group Footnote	numeric	78.8%
Pt Exp Group Measure Count	categorical	0.0%
Count of Facility Pt Exp Measures	categorical	0.0%
Pt Exp Group Footnote	numeric	58.2%
TE Group Measure Count	categorical	0.0%
Count of Facility TE Measures	categorical	0.0%
TE Group Footnote	numeric	82.9%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 6 numeric columns (values clipped to 2 decimals).
	ZIP Code	MORT Group Footnote	Safety Group Footnote	READM Group Footnote	Pt Exp Group Footnote	TE Group Footnote
ZIP Code	+1.00	-0.01	-0.04	-0.09	-0.02	+0.03
MORT Group Footnote	-0.01	+1.00	+0.13	+0.24	+0.07	+0.09
Safety Group Footnote	-0.04	+0.13	+1.00	+0.10	+0.27	-0.00
READM Group Footnote	-0.09	+0.24	+0.10	+1.00	+0.06	+0.07
Pt Exp Group Footnote	-0.02	+0.07	+0.27	+0.06	+1.00	+0.03
TE Group Footnote	+0.03	+0.09	-0.00	+0.07	+0.03	+1.00

Facility ID text identifier

This column is a fixed-length, zero-padded 6-character alphanumeric facility identifier — likely a government or regulatory facility code (e.g., CMS Facility ID or similar). Every one of the 5,421 rows has a distinct value with zero nulls and zero duplicates, confirming it functions as a primary key. All values are exactly 6 characters long (min, mean, max = 6) and fully uppercase, consistent with a structured code scheme. The top visible values follow a numeric sequence starting with '010001', suggesting geographic or administrative ordering.

Treatment: Retain as a primary key for row identification and use as a join key to link facility-level metadata tables.

anthropic:default · confidence high

Out[13]:

saturn.columns["Facility ID"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5,421
len_min	6
len_max	6
len_mean	6
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	5,421
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for Facility ID.

Show data table

Character-length distribution for Facility ID (mean: 6.0).
chars	count
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	5421
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0

Facility Name text label

This column contains healthcare facility names — overwhelmingly hospitals and medical centers, as evidenced by 'hospital' appearing 2,740 times and 'medical'/'health'/'center' dominating the top-word list across 5,421 rows. Nearly all values are ALL-CAPS (99.3%), suggesting a standardized administrative data entry convention. The near-unique alert is noteworthy: with 5,286 unique values out of 5,421 rows (duplicate_rate 2.5%, 135 duplicates), some facilities appear more than once, which could indicate multiple records per facility rather than a clean one-row-per-facility structure. Average name length of ~29 characters and ~4 words per name is consistent with formal institutional naming patterns.

Treatment: Use as a human-readable label or grouping key; normalize case before display; investigate 135 duplicate entries for potential record linkage issues before modelling.

anthropic:default · confidence high

Out[16]:

saturn.columns["Facility Name"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5,286
len_min	3
len_max	74
len_mean	29.21
len_median	28
len_p95	45
word_mean	3.995
word_median	4
n_empty	0
n_duplicates	135
duplicate_rate	0.0249
vocab_size	3,942
readability_flesch_mean	6.842
emoji_rate	0
url_rate	0
one_word_rate	0.001845
allcaps_rate	0.9932
boilerplate_rate	0
alert: near_unique	97.5% of rows are unique strings
alert: allcaps	99.3% rows are all-caps

Fig 9.

Character-length distribution for Facility Name.

Show data table

Character-length distribution for Facility Name (mean: 29.20605054418004).
chars	count
3 – 5	2
5 – 7	1
7 – 8	1
8 – 10	11
10 – 12	12
12 – 14	43
14 – 15	91
15 – 17	178
17 – 19	93
19 – 21	258
21 – 23	424
23 – 24	526
24 – 26	572
26 – 28	258
28 – 30	548
30 – 31	525
31 – 33	460
33 – 35	160
35 – 37	308
37 – 38	232
38 – 40	170
40 – 42	149
42 – 44	47
44 – 46	105
46 – 47	82
47 – 49	79
49 – 51	55
51 – 53	7
53 – 54	4
54 – 56	2
56 – 58	1
58 – 60	3
60 – 62	4
62 – 63	1
63 – 65	2
65 – 67	1
67 – 69	2
69 – 70	0
70 – 72	3
72 – 74	1

Address text metadata

This column contains physical street addresses, confirmed by dominant top words: 'street' (1,036), 'avenue' (580), 'drive' (511), 'road' (507), and directional prefixes like 'north', 'east', 'west', 'south'. Nearly all values are uppercased (99.2% allcaps_rate), consistent with address data sourced from a government registry or standardised mailing system. With 5,387 unique values out of 5,421 rows and only 34 duplicates (0.6% duplicate rate), the column is near-unique — the small number of duplicates may indicate shared addresses such as apartment buildings or data entry issues worth investigating.

Treatment: Do not use as a predictive feature directly; parse into structured components (number, street, direction, suffix) or geocode for spatial modelling.

anthropic:default · confidence high

Out[19]:

saturn.columns["Address"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5,387
len_min	7
len_max	50
len_mean	19.37
len_median	19
len_p95	29
word_mean	3.754
word_median	4
n_empty	0
n_duplicates	34
duplicate_rate	0.006272
vocab_size	4,996
readability_flesch_mean	79.27
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0.9921
boilerplate_rate	0
alert: near_unique	99.4% of rows are unique strings
alert: allcaps	99.2% rows are all-caps

Fig 10.

Character-length distribution for Address.

Show data table

Character-length distribution for Address (mean: 19.371702637889687).
chars	count
7 – 8	3
8 – 9	8
9 – 10	32
10 – 11	66
11 – 12	115
12 – 13	227
13 – 15	288
15 – 16	380
16 – 17	481
17 – 18	502
18 – 19	541
19 – 20	510
20 – 21	413
21 – 22	691
22 – 23	257
23 – 24	218
24 – 25	151
25 – 26	121
26 – 27	73
27 – 28	72
28 – 30	52
30 – 31	47
31 – 32	34
32 – 33	27
33 – 34	15
34 – 35	12
35 – 36	34
36 – 37	12
37 – 38	8
38 – 39	7
39 – 40	6
40 – 41	3
41 – 42	2
42 – 44	3
44 – 45	0
45 – 46	3
46 – 47	2
47 – 48	0
48 – 49	0
49 – 50	5

City/Town text label

This column contains US city and town names, stored in ALL-CAPS format (99.4% allcaps rate), consistent with a standardized postal or administrative data source. With 3,049 unique values across 5,421 rows, the duplicate rate of 43.8% (2,372 duplicates) is expected for a geographic field — major cities like CHICAGO (34) and HOUSTON (31) naturally repeat. The near-complete one-word rate (77.1%) alongside multi-word entries like 'OKLAHOMA CITY' and 'LOS ANGELES' is normal for city names. No nulls are present, indicating good completeness.

Treatment: Standardize casing, then use as a grouping/aggregation key or encode as a high-cardinality categorical feature.

anthropic:default · confidence high

Out[22]:

saturn.columns["City/Town"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	3,049
len_min	3
len_max	24
len_mean	8.611
len_median	8
len_p95	13
word_mean	1.241
word_median	1
n_empty	0
n_duplicates	2,372
duplicate_rate	0.4376
vocab_size	2,890
readability_flesch_mean	18.29
emoji_rate	0
url_rate	0
one_word_rate	0.7709
allcaps_rate	0.9943
boilerplate_rate	0
alert: one_word	77.1% rows are a single word
alert: allcaps	99.4% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	43.8% duplicate strings

Fig 11.

Character-length distribution for City/Town.

Show data table

Character-length distribution for City/Town (mean: 8.610957387935805).
chars	count
3 – 4	10
4 – 4	122
4 – 5	0
5 – 5	332
5 – 6	0
6 – 6	761
6 – 7	0
7 – 7	895
7 – 8	0
8 – 8	737
8 – 9	0
9 – 9	694
9 – 10	0
10 – 10	705
10 – 11	0
11 – 11	446
11 – 12	0
12 – 12	295
12 – 13	0
13 – 14	191
14 – 14	98
14 – 15	0
15 – 15	50
15 – 16	0
16 – 16	54
16 – 17	0
17 – 17	15
17 – 18	0
18 – 18	6
18 – 19	0
19 – 19	3
19 – 20	0
20 – 20	6
20 – 21	0
21 – 21	0
21 – 22	0
22 – 22	0
22 – 23	0
23 – 23	0
23 – 24	1

State categorical feature

This column contains US state abbreviations, covering 56 distinct values (50 states plus likely DC and US territories). The distribution is notably spread — entropy ratio of 0.917 indicates near-uniform coverage — but TX leads with 462 occurrences (8.5% of 5,421 rows), followed by CA (378) and FL (221), reflecting population-weighted representation. The cardinality of 56 slightly exceeding 50 states warrants a quick audit to confirm no invalid or duplicate codes exist.

Treatment: One-hot encode or use target encoding depending on model type; verify the 6 non-standard codes (beyond 50 states) are valid territories or DC.

anthropic:default · confidence high

Out[25]:

saturn.columns["State"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	56
top_value	TX
top_rate	0.08522
cardinality	56
entropy	5.328
entropy_ratio	0.9174

Fig 12.

Top values for State.

Show data table

Top values for State (20 unique shown, of 56 total).
value	count	share
TX	462	8.5%
CA	378	7.0%
FL	221	4.1%
IL	194	3.6%
OH	194	3.6%
NY	191	3.5%
PA	187	3.4%
LA	160	3.0%
GA	149	2.7%
IN	149	2.7%
MI	147	2.7%
WI	142	2.6%
KS	139	2.6%
MN	136	2.5%
OK	135	2.5%
TN	123	2.3%
MO	121	2.2%
NC	120	2.2%
IA	118	2.2%
AZ	106	2.0%

ZIP Code numeric feature

This column contains US postal (ZIP) codes stored as integers rather than strings, covering the full national range from 603 (Puerto Rico/Northeast) to 99929 (Alaska), with 4721 distinct values across 5421 rows. The numeric treatment is problematic: leading zeros are silently dropped (the minimum of 603 almost certainly represents 00603), making string-based joins or lookups unreliable. The near-flat kurtosis (−0.99) and low skew (−0.16) indicate fairly broad geographic spread with no dominant regional cluster, which is notable given the wide IQR of 43333.

Treatment: Cast to zero-padded 5-character string before any join, geocoding, or feature engineering to restore lost leading zeros.

anthropic:default · confidence high

Out[28]:

saturn.columns["ZIP Code"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	4,721
min	603
max	99,929
mean	5.378e+04
median	55,066
std	2.706e+04
q1	32,771
q3	76,104
iqr	43,333
skew	-0.1646
kurtosis	-0.9879
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 13.

Distribution of ZIP Code. Vertical dash marks the median.

Show data table

Histogram bins for ZIP Code (median: 55066.0).
bin	count
603 – 3086	161
3086 – 5569	73
5569 – 8052	92
8052 – 1.054e+04	57
1.054e+04 – 1.302e+04	90
1.302e+04 – 1.55e+04	101
1.55e+04 – 1.799e+04	81
1.799e+04 – 2.047e+04	108
2.047e+04 – 2.295e+04	79
2.295e+04 – 2.543e+04	81
2.543e+04 – 2.792e+04	82
2.792e+04 – 3.04e+04	191
3.04e+04 – 3.288e+04	171
3.288e+04 – 3.537e+04	168
3.537e+04 – 3.785e+04	151
3.785e+04 – 4.033e+04	180
4.033e+04 – 4.282e+04	87
4.282e+04 – 4.53e+04	154
4.53e+04 – 4.778e+04	174
4.778e+04 – 5.027e+04	180
5.027e+04 – 5.275e+04	98
5.275e+04 – 5.523e+04	160
5.523e+04 – 5.772e+04	175
5.772e+04 – 6.02e+04	147
6.02e+04 – 6.268e+04	140
6.268e+04 – 6.516e+04	120
6.516e+04 – 6.765e+04	133
6.765e+04 – 7.013e+04	145
7.013e+04 – 7.261e+04	205
7.261e+04 – 7.51e+04	191
7.51e+04 – 7.758e+04	236
7.758e+04 – 8.006e+04	204
8.006e+04 – 8.255e+04	103
8.255e+04 – 8.503e+04	132
8.503e+04 – 8.751e+04	106
8.751e+04 – 9e+04	72
9e+04 – 9.248e+04	140
9.248e+04 – 9.496e+04	146
9.496e+04 – 9.745e+04	159
9.745e+04 – 9.993e+04	148

County/Parish text label

This column contains US county (or parish) names, stored entirely in uppercase (allcaps_rate 1.0), consistent with a standardised geographic reference field. With 1,555 unique values across 5,421 rows and a duplicate_rate of 0.7132, the same county names recur frequently — expected given that many records share common counties like LOS ANGELES (88), JEFFERSON (59), and COOK (59). The high one_word_rate (0.873) reflects that most US county names are single tokens, while multi-word entries like 'LOS ANGELES' and 'SAN ...' account for the remainder.

Treatment: Encode as a categorical geographic grouping variable; consider joining to a FIPS code lookup for spatial analysis.

anthropic:default · confidence high

Out[31]:

saturn.columns["County/Parish"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	1,555
len_min	3
len_max	25
len_mean	7.34
len_median	7
len_p95	11
word_mean	1.135
word_median	1
n_empty	0
n_duplicates	3,866
duplicate_rate	0.7132
vocab_size	1,591
readability_flesch_mean	34.44
emoji_rate	0
url_rate	0
one_word_rate	0.8733
allcaps_rate	1
boilerplate_rate	0
alert: one_word	87.3% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	71.3% duplicate strings

Fig 14.

Character-length distribution for County/Parish.

Show data table

Character-length distribution for County/Parish (mean: 7.3399741745065485).
chars	count
3 – 4	36
4 – 4	472
4 – 5	0
5 – 5	659
5 – 6	0
6 – 6	1008
6 – 7	0
7 – 7	956
7 – 8	0
8 – 8	834
8 – 9	575
9 – 10	0
10 – 10	430
10 – 11	0
11 – 11	195
11 – 12	0
12 – 12	101
12 – 13	0
13 – 13	40
13 – 14	0
14 – 15	75
15 – 15	18
15 – 16	0
16 – 16	2
16 – 17	0
17 – 17	5
17 – 18	0
18 – 18	3
18 – 19	0
19 – 20	5
20 – 20	4
20 – 21	0
21 – 21	1
21 – 22	0
22 – 22	0
22 – 23	0
23 – 23	1
23 – 24	0
24 – 24	0
24 – 25	1

Telephone Number text feature

This column contains North American telephone numbers, all formatted uniformly at exactly 14 characters (consistent with the '(NXX) NXX-XXXX' pattern) with zero nulls across 5,421 records. The allcaps_rate of 1.0 is an artifact of digit/punctuation-only strings rather than alphabetic content. Surprisingly, 38 duplicate phone numbers exist (duplicate_rate ≈ 0.007), meaning distinct records share the same telephone number, which warrants investigation for data quality issues. Top area codes such as (406), (605), and (402) suggest a predominantly rural US geographic distribution.

Treatment: Flag 38 duplicate values for deduplication review; treat as a categorical or string identifier — do not embed numerically.

anthropic:default · confidence high

Out[34]:

saturn.columns["Telephone Number"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5,383
len_min	14
len_max	14
len_mean	14
len_median	14
len_p95	14
word_mean	2
word_median	2
n_empty	0
n_duplicates	38
duplicate_rate	0.00701
vocab_size	5,550
readability_flesch_mean	120.2
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	99.3% of rows are unique strings
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 15.

Character-length distribution for Telephone Number.

Show data table

Character-length distribution for Telephone Number (mean: 14.0).
chars	count
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	5421
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0

Hospital Type categorical label

This column classifies each hospital record into one of 8 facility-type categories, making it a standard categorical label for hospital segmentation. 'Acute Care Hospitals' dominates at 57.6% of records (3,120 of 5,421), creating notable class imbalance. The extreme tail is striking: 'Long-term' hospitals appear only 4 times, and 'Acute Care - Department of Defense' just 32 times, meaning minority classes may be statistically unreliable for subgroup analysis. There are no nulls and no unexpected values.

Treatment: One-hot encode for modelling; be cautious with rare classes ('Long-term' n=4, 'DoD' n=32) which may need grouping or exclusion.

anthropic:default · confidence high

Out[37]:

saturn.columns["Hospital Type"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	8
top_value	Acute Care Hospitals
top_rate	0.5755
cardinality	8
entropy	1.654
entropy_ratio	0.5513

Fig 16.

Top values for Hospital Type.

Show data table

Top values for Hospital Type (8 unique shown, of 8 total).
value	count	share
Acute Care Hospitals	3120	57.6%
Critical Access Hospitals	1375	25.4%
Psychiatric	626	11.5%
Acute Care - Veterans Administration	132	2.4%
Childrens	94	1.7%
Rural Emergency Hospital	38	0.7%
Acute Care - Department of Defense	32	0.6%
Long-term	4	0.1%

Hospital Ownership categorical feature

This column captures the legal/operational ownership classification of hospitals, with 12 distinct categories across 5,421 records and no nulls. The dominant category is 'Voluntary non-profit - Private' at 42.3% (2,291 records), creating notable imbalance — the top value alone accounts for more than twice the second-largest category ('Proprietary' at 1,067). The entropy ratio of 0.72 indicates moderate but uneven spread across categories, with minority classes like 'Government - Federal' (44) and 'Physician' (74) being quite sparse.

Treatment: One-hot encode or target-encode for modelling, but consider grouping sparse categories (e.g., 'Government - Federal' with n=44, 'Physician' with n=74) to avoid instability.

anthropic:default · confidence high

Out[40]:

saturn.columns["Hospital Ownership"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	12
top_value	Voluntary non-profit - Private
top_rate	0.4226
cardinality	12
entropy	2.586
entropy_ratio	0.7215

Fig 17.

Top values for Hospital Ownership.

Show data table

Top values for Hospital Ownership (12 unique shown, of 12 total).
value	count	share
Voluntary non-profit - Private	2291	42.3%
Proprietary	1067	19.7%
Government - Hospital District or Authority	521	9.6%
Government - Local	400	7.4%
Voluntary non-profit - Other	361	6.7%
Voluntary non-profit - Church	275	5.1%
Government - State	210	3.9%
Veterans Health Administration	132	2.4%
Physician	74	1.4%
Government - Federal	44	0.8%
Department of Defense	32	0.6%
Tribal	14	0.3%

Emergency Services categorical feature

This column is a binary flag indicating whether emergency services are present or available for a given record. The dominant value is 'Yes' at 83.1% (4505 of 5421 rows), making the distribution notably imbalanced — 'No' accounts for only 916 records (roughly 1-in-6). No nulls exist, and entropy of 0.655 reflects the skew away from a balanced 50/50 split.

Treatment: Encode as binary (1/0); note class imbalance (83/17 split) and apply appropriate weighting or stratified sampling if used as a target or predictor.

anthropic:default · confidence high

Out[43]:

saturn.columns["Emergency Services"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	Yes
top_rate	0.831
cardinality	2
entropy	0.6553
entropy_ratio	0.6553

Fig 18.

Top values for Emergency Services.

Show data table

Top values for Emergency Services (2 unique shown, of 2 total).
value	count	share
Yes	4505	83.1%
No	916	16.9%

Meets criteria for birthing friendly designation categorical label

This column flags whether a facility meets criteria for a 'birthing friendly' designation, and every non-null value is 'Y' (top_rate = 1.0, cardinality = 1). That means it carries zero discriminative information among observed records — there are no 'N' values at all. More striking, 58.24% of rows are null, leaving only 2,264 usable observations out of 5,421; the nulls likely represent facilities that were not assessed or do not offer birthing services.

Treatment: Treat non-null as a binary indicator of assessed-and-qualifying facilities; nulls should be coded as a distinct category (e.g. 'not assessed') rather than imputed, as they almost certainly reflect structural missingness rather than random dropout.

anthropic:default · confidence high

Out[46]:

saturn.columns["Meets criteria for birthing friendly designation"].stats

stat	value
n	5,421
nulls	3,157 (58.2%)
unique	1
top_value	Y
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	58.2% null
alert: imbalance	top value is 100.0% of rows

Fig 19.

Top values for Meets criteria for birthing friendly designation.

Show data table

Top values for Meets criteria for birthing friendly designation (1 unique shown, of 1 total).
value	count	share
Y	2264	41.8%

Hospital overall rating categorical label

This column represents a CMS-style 1–5 star overall hospital rating, stored as a categorical field with 6 distinct values. The most striking finding is that 47.1% of all 5,421 rows carry the value 'Not Available', making missing ratings the single largest 'category'—outnumbering any numeric rating tier. Among hospitals that do have a rating, scores cluster around 3 and 4 (937 and 765 occurrences respectively), with the extremes (1 and 5) being relatively rare.

Treatment: Convert numeric strings to ordinal integers; treat 'Not Available' as a distinct missing indicator (do not encode as a numeric level) and consider imputation or a separate missingness flag before modelling.

anthropic:default · confidence high

Out[49]:

saturn.columns["Hospital overall rating"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	6
top_value	Not Available
top_rate	0.4708
cardinality	6
entropy	2.133
entropy_ratio	0.8252

Fig 20.

Top values for Hospital overall rating.

Show data table

Top values for Hospital overall rating (6 unique shown, of 6 total).
value	count	share
Not Available	2552	47.1%
3	937	17.3%
4	765	14.1%
2	649	12.0%
5	289	5.3%
1	229	4.2%

Hospital overall rating footnote categorical metadata

This column contains footnote codes attached to hospital overall ratings, where each numeric code references a specific explanatory note (e.g., data suppression reason, insufficient data, or special methodology). The 52.7% null rate is expected since most hospitals with valid ratings require no footnote. However, among non-null rows, value '16' dominates at 65.4% of present values (1,676 occurrences), suggesting a single suppression or caveat reason is overwhelmingly common; the presence of a compound value '16, 23' indicates some records carry multiple footnote codes, which could complicate downstream parsing.

Treatment: Treat nulls as 'no footnote'; split multi-code entries on comma and one-hot encode each footnote code separately before modelling.

anthropic:default · confidence high

Out[52]:

saturn.columns["Hospital overall rating footnote"].stats

stat	value
n	5,421
nulls	2,857 (52.7%)
unique	7
top_value	16
top_rate	0.6537
cardinality	7
entropy	1.158
entropy_ratio	0.4126
alert: null_rate	52.7% null

Fig 21.

Top values for Hospital overall rating footnote.

Show data table

Top values for Hospital overall rating footnote (7 unique shown, of 7 total).
value	count	share
16	1676	30.9%
19	795	14.7%
5	47	0.9%
22	32	0.6%
17	7	0.1%
23	5	0.1%
16, 23	2	0.0%

MORT Group Measure Count categorical feature

This column records the count of MORT (mortality) group measures evaluated per entity, with only two distinct values across 5,421 rows: '7' (dominant, 84.1%) and 'Not Available' (15.9%). The cardinality of 2 and the presence of 'Not Available' as the only alternative to '7' suggests this is effectively a binary flag — entities either have a full complement of 7 mortality measures or are excluded from measurement entirely. The 'Not Available' string rather than a true null (null_rate is 0.0) means missingness is encoded as a sentinel value and must be handled explicitly.

Treatment: Recode 'Not Available' to null or a binary indicator before modelling; if '7' is constant for valid records, consider dropping or using only the missingness flag.

anthropic:default · confidence high

Out[55]:

saturn.columns["MORT Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	7
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 22.

Top values for MORT Group Measure Count.

Show data table

Top values for MORT Group Measure Count (2 unique shown, of 2 total).
value	count	share
7	4558	84.1%
Not Available	863	15.9%

Count of Facility MORT Measures categorical feature

This column records how many mortality quality measures are reported for a given facility, taking integer values 1–7 plus a sentinel string 'Not Available'. Despite being numeric in nature, it was parsed as categorical, likely because of that mixed-type sentinel. Surprisingly, 'Not Available' is the single most frequent value at 32.8% (1,777 of 5,421 rows), meaning roughly a third of facilities have no mortality measure count on record. The remaining values are fairly evenly spread across 1–7, suggesting the missingness is not random but tied to whether a facility participates in mortality reporting at all.

Treatment: Separate 'Not Available' into a boolean missingness indicator, then cast the remaining values to integer for use as an ordinal or numeric feature.

anthropic:default · confidence high

Out[58]:

saturn.columns["Count of Facility MORT Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	8
top_value	Not Available
top_rate	0.3278
cardinality	8
entropy	2.765
entropy_ratio	0.9217

Fig 23.

Top values for Count of Facility MORT Measures.

Show data table

Top values for Count of Facility MORT Measures (8 unique shown, of 8 total).
value	count	share
Not Available	1777	32.8%
7	850	15.7%
6	587	10.8%
1	495	9.1%
5	455	8.4%
3	444	8.2%
2	420	7.7%
4	393	7.2%

Count of MORT Measures Better categorical feature

This column encodes how many mortality (MORT) measures a hospital performs better than the national benchmark, stored as a categorical string despite being an ordinal count ranging from 0 to 7. The dominant value is '0' (3,136 rows, 57.8%), meaning most hospitals beat no mortality benchmarks, while 1,777 rows (32.8%) are 'Not Available', leaving only ~9.4% of hospitals showing any above-average mortality performance. The absence of value '6' except for a single record and the jump from '5' to '7' (skipping '6' effectively) signals a highly right-skewed, near-zero distribution with a substantial missingness class masquerading as a category.

Treatment: Split 'Not Available' into a binary missing indicator, then cast remaining values to integer for ordinal or numeric modelling.

anthropic:default · confidence high

Out[61]:

saturn.columns["Count of MORT Measures Better"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	9
top_value	0
top_rate	0.5785
cardinality	9
entropy	1.453
entropy_ratio	0.4583

Fig 24.

Top values for Count of MORT Measures Better.

Show data table

Top values for Count of MORT Measures Better (9 unique shown, of 9 total).
value	count	share
0	3136	57.8%
Not Available	1777	32.8%
1	297	5.5%
2	133	2.5%
3	53	1.0%
4	15	0.3%
5	7	0.1%
7	2	0.0%
6	1	0.0%

Count of MORT Measures No Different categorical feature

This column represents a count of mortality (MORT) measures where a hospital's performance was rated 'No Different' from the national benchmark, stored as a categorical string rather than a numeric type. The most striking signal is that 32.8% of rows (1,777 of 5,421) carry 'Not Available', effectively a masked null despite the reported null_rate of 0.0. The numeric values range from 0 to 7, with '0' being extremely rare (only 12 occurrences), suggesting nearly all reporting hospitals have at least one mortality measure in this category.

Treatment: Recode 'Not Available' as a true null or a separate indicator flag, then cast remaining values to integer for modelling.

anthropic:default · confidence high

Out[64]:

saturn.columns["Count of MORT Measures No Different"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	9
top_value	Not Available
top_rate	0.3278
cardinality	9
entropy	2.806
entropy_ratio	0.8852

Fig 25.

Top values for Count of MORT Measures No Different.

Show data table

Top values for Count of MORT Measures No Different (9 unique shown, of 9 total).
value	count	share
Not Available	1777	32.8%
6	672	12.4%
5	541	10.0%
1	513	9.5%
3	509	9.4%
4	503	9.3%
2	472	8.7%
7	422	7.8%
0	12	0.2%

Count of MORT Measures Worse categorical feature

This column counts how many mortality (MORT) quality measures a hospital scored 'worse than expected' on, stored as a categorical string rather than an integer. The dominant value is '0' (3,266 rows, 60.2%), indicating most hospitals have no worse-than-expected mortality flags, but 1,777 rows (32.8%) carry 'Not Available', which is a substantial missingness proxy masked by a zero null_rate. Only 378 hospitals have one or more worse measures, and the counts are right-skewed, topping out at 5.

Treatment: Convert numeric strings to integers, recode 'Not Available' as NaN, then treat as an ordinal count feature; consider a binary flag for any worse measure given extreme skew toward 0.

anthropic:default · confidence high

Out[67]:

saturn.columns["Count of MORT Measures Worse"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	7
top_value	0
top_rate	0.6025
cardinality	7
entropy	1.294
entropy_ratio	0.4608

Fig 26.

Top values for Count of MORT Measures Worse.

Show data table

Top values for Count of MORT Measures Worse (7 unique shown, of 7 total).
value	count	share
0	3266	60.2%
Not Available	1777	32.8%
1	310	5.7%
2	57	1.1%
3	7	0.1%
4	3	0.1%
5	1	0.0%

MORT Group Footnote numeric label

This column contains footnote codes associated with a mortality (MORT) group, encoded as numeric values despite functioning as a categorical label. With only 4 unique values (5.0, and likely 19.0, 23.0, and one other near the mean) drawn from a set bounded by 5–23, this is clearly a small enumeration of footnote categories rather than a true numeric measure. The 67.2% null rate is the dominant signal — meaning footnotes apply to fewer than a third of rows, suggesting they flag exceptional or conditional cases. The near-flat kurtosis (−1.96) and bimodal-like IQR of 14.0 confirm the values are widely spread across only a few discrete codes.

Treatment: Cast to categorical/string; treat nulls as 'no footnote' class; one-hot encode if used as a feature.

anthropic:default · confidence high

Out[70]:

saturn.columns["MORT Group Footnote"].stats

stat	value
n	5,421
nulls	3,643 (67.2%)
unique	4
min	5
max	23
mean	11.58
median	5
std	7.057
q1	5
q3	19
iqr	14
skew	0.1488
kurtosis	-1.959
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	67.2% null

Fig 27.

Distribution of MORT Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for MORT Group Footnote (median: 5.0).
bin	count
5 – 5.45	950
5.45 – 5.9	0
5.9 – 6.35	0
6.35 – 6.8	0
6.8 – 7.25	0
7.25 – 7.7	0
7.7 – 8.15	0
8.15 – 8.6	0
8.6 – 9.05	0
9.05 – 9.5	0
9.5 – 9.95	0
9.95 – 10.4	0
10.4 – 10.85	0
10.85 – 11.3	0
11.3 – 11.75	0
11.75 – 12.2	0
12.2 – 12.65	0
12.65 – 13.1	0
13.1 – 13.55	0
13.55 – 14	0
14 – 14.45	0
14.45 – 14.9	0
14.9 – 15.35	0
15.35 – 15.8	0
15.8 – 16.25	0
16.25 – 16.7	0
16.7 – 17.15	0
17.15 – 17.6	0
17.6 – 18.05	0
18.05 – 18.5	0
18.5 – 18.95	0
18.95 – 19.4	795
19.4 – 19.85	0
19.85 – 20.3	0
20.3 – 20.75	0
20.75 – 21.2	0
21.2 – 21.65	0
21.65 – 22.1	32
22.1 – 22.55	0
22.55 – 23	1

Safety Group Measure Count categorical feature

This column represents a count of safety group measures, but is typed as categorical with only two distinct values: '8' (a fixed numeric string) and 'Not Available'. The dominance of a single value '8' across 84.1% of 5,421 rows (4,558 occurrences) suggests this may be a schema-constrained field or a denormalized count that rarely varies. The presence of 'Not Available' as the only alternative (863 rows, ~15.9%) rather than a true numeric null is surprising and indicates missing data was encoded as a string sentinel rather than left null.

Treatment: Binarize into a flag (count_available vs. not_available); treat 'Not Available' as missing before any numeric use.

anthropic:default · confidence high

Out[73]:

saturn.columns["Safety Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	8
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 28.

Top values for Safety Group Measure Count.

Show data table

Top values for Safety Group Measure Count (2 unique shown, of 2 total).
value	count	share
8	4558	84.1%
Not Available	863	15.9%

Count of Facility Safety Measures categorical feature

This column represents an ordinal count of safety measures implemented at a facility, stored as a categorical/string type with values ranging from 1 to 8. The most striking signal is that 38.1% of the 5,421 rows (2,065 records) carry the value 'Not Available', which is the single largest category and likely reflects missing or unreported data rather than a meaningful category. The numeric values show a non-uniform distribution, with '7' being the most common true count (733 occurrences) and '4' the rarest (223), suggesting a rough bimodal tendency toward higher counts. The column should be treated as ordinal numeric after separating out 'Not Available' as a distinct missingness indicator.

Treatment: Cast numeric strings to integer, encode 'Not Available' as a separate missingness flag or impute, then use as ordinal feature.

anthropic:default · confidence high

Out[76]:

saturn.columns["Count of Facility Safety Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	9
top_value	Not Available
top_rate	0.3809
cardinality	9
entropy	2.753
entropy_ratio	0.8684

Fig 29.

Top values for Count of Facility Safety Measures.

Show data table

Top values for Count of Facility Safety Measures (9 unique shown, of 9 total).
value	count	share
Not Available	2065	38.1%
7	733	13.5%
2	519	9.6%
6	460	8.5%
8	453	8.4%
1	443	8.2%
3	290	5.3%
5	235	4.3%
4	223	4.1%

Count of Safety Measures Better categorical feature

This column represents a count of safety measures rated 'better' for a given entity (likely a healthcare facility or similar regulated site), stored as a categorical rather than numeric type. The dominant value is 'Not Available' at 38.1% of rows (2,065 of 5,421), which masks the underlying numeric distribution; among records with actual counts, the distribution is strongly right-skewed, with 1,548 zeros and only 3 records reaching the maximum value of 6. Analysts should be aware that the high 'Not Available' rate introduces substantial missingness risk if this field is coerced to numeric.

Treatment: Separate 'Not Available' into a missingness indicator, then cast remaining values to integer for ordinal or numeric modelling.

anthropic:default · confidence high

Out[79]:

saturn.columns["Count of Safety Measures Better"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	8
top_value	Not Available
top_rate	0.3809
cardinality	8
entropy	2.11
entropy_ratio	0.7033

Fig 30.

Top values for Count of Safety Measures Better.

Show data table

Top values for Count of Safety Measures Better (8 unique shown, of 8 total).
value	count	share
Not Available	2065	38.1%
0	1548	28.6%
1	1052	19.4%
2	430	7.9%
3	216	4.0%
4	93	1.7%
5	14	0.3%
6	3	0.1%

Count of Safety Measures No Different categorical feature

This column represents a count of safety measures rated as 'no different' (presumably compared to a benchmark), stored as a categorical type despite containing numeric-looking values 0–8. The dominant value is 'Not Available' at 38.1% of 5,421 rows (2,065 records), which is a significant missingness proxy masking as a category rather than a true null — notable since the null_rate is 0.0. The remaining values follow a rough bell-curve peaking at 5, with very sparse representation at the extremes (8 appears only 10 times, 0 only 20 times).

Treatment: Recode 'Not Available' as NaN, cast remaining values to integer, then treat as ordinal or numeric feature.

anthropic:default · confidence high

Out[82]:

saturn.columns["Count of Safety Measures No Different"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	10
top_value	Not Available
top_rate	0.3809
cardinality	10
entropy	2.685
entropy_ratio	0.8083

Fig 31.

Top values for Count of Safety Measures No Different.

Show data table

Top values for Count of Safety Measures No Different (10 unique shown, of 10 total).
value	count	share
Not Available	2065	38.1%
5	656	12.1%
2	551	10.2%
4	527	9.7%
1	509	9.4%
6	482	8.9%
3	434	8.0%
7	167	3.1%
0	20	0.4%
8	10	0.2%

Count of Safety Measures Worse categorical feature

This column represents a count of safety measures that have deteriorated, stored as a categorical field with only 5 distinct values (0, 1, 2, 3, and 'Not Available'). The dominant value is '0' at 54.3% of rows (2,941 records), indicating most entities have no worsening safety measures, but 38.1% of rows (2,065) carry 'Not Available' rather than a numeric zero, suggesting a meaningful distinction between 'measured and none found' vs. 'not assessed'. The actual worsening counts (1–3) are rare and sharply drop off, with only 6 records reaching a count of 3.

Treatment: Encode 'Not Available' as a separate indicator flag, then cast numeric strings to integer for ordinal or count-based modelling.

anthropic:default · confidence high

Out[85]:

saturn.columns["Count of Safety Measures Worse"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	5
top_value	0
top_rate	0.5425
cardinality	5
entropy	1.338
entropy_ratio	0.5764

Fig 32.

Top values for Count of Safety Measures Worse.

Show data table

Top values for Count of Safety Measures Worse (5 unique shown, of 5 total).
value	count	share
0	2941	54.3%
Not Available	2065	38.1%
1	365	6.7%
2	44	0.8%
3	6	0.1%

Safety Group Footnote numeric label

This column encodes footnote reference numbers attached to safety groups, with only 4 distinct integer values (5, 19, 23, and at least one other given the IQR and mean) acting as categorical codes rather than true continuous quantities. The 61.8% null rate is a major signal: most records carry no footnote, making non-null values the exception. The platykurtic distribution (kurtosis ≈ −1.81) and near-zero skew confirm a flat, discrete distribution across a handful of categories, not a numeric measurement. Despite being stored as numeric, this should be treated as a sparse categorical indicator.

Treatment: Cast to nullable categorical; one-hot or ordinal encode the 4 distinct values, and treat nulls as a fifth 'no footnote' category.

anthropic:default · confidence high

Out[88]:

saturn.columns["Safety Group Footnote"].stats

stat	value
n	5,421
nulls	3,350 (61.8%)
unique	4
min	5
max	23
mean	10.69
median	5
std	6.95
q1	5
q3	19
iqr	14
skew	0.4116
kurtosis	-1.809
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	61.8% null

Fig 33.

Distribution of Safety Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for Safety Group Footnote (median: 5.0).
bin	count
5 – 5.45	1238
5.45 – 5.9	0
5.9 – 6.35	0
6.35 – 6.8	0
6.8 – 7.25	0
7.25 – 7.7	0
7.7 – 8.15	0
8.15 – 8.6	0
8.6 – 9.05	0
9.05 – 9.5	0
9.5 – 9.95	0
9.95 – 10.4	0
10.4 – 10.85	0
10.85 – 11.3	0
11.3 – 11.75	0
11.75 – 12.2	0
12.2 – 12.65	0
12.65 – 13.1	0
13.1 – 13.55	0
13.55 – 14	0
14 – 14.45	0
14.45 – 14.9	0
14.9 – 15.35	0
15.35 – 15.8	0
15.8 – 16.25	0
16.25 – 16.7	0
16.7 – 17.15	0
17.15 – 17.6	0
17.6 – 18.05	0
18.05 – 18.5	0
18.5 – 18.95	0
18.95 – 19.4	795
19.4 – 19.85	0
19.85 – 20.3	0
20.3 – 20.75	0
20.75 – 21.2	0
21.2 – 21.65	0
21.65 – 22.1	32
22.1 – 22.55	0
22.55 – 23	6

READM Group Measure Count categorical feature

This column represents the count of readmission group measures associated with a hospital record, stored as a categorical type despite being a numeric concept. It has only two distinct values across 5,421 rows: '11' (84.1% of records) and 'Not Available' (15.9%), with zero nulls. The near-total dominance of a single value '11' and the use of 'Not Available' as a sentinel string (rather than a true null) are both worth flagging — the column carries almost no discriminative power as a feature.

Treatment: Convert 'Not Available' to null, cast to integer, then assess whether near-zero variance warrants dropping before modelling.

anthropic:default · confidence high

Out[91]:

saturn.columns["READM Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	11
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 34.

Top values for READM Group Measure Count.

Show data table

Top values for READM Group Measure Count (2 unique shown, of 2 total).
value	count	share
11	4558	84.1%
Not Available	863	15.9%

Count of Facility READM Measures categorical feature

This column represents the count of readmission measures available for a healthcare facility, stored as a categorical/string type despite containing numeric values. The most striking signal is that 'Not Available' is the dominant value at 21.2% of all 5,421 rows (1,150 records), meaning roughly one-in-five facilities have no reportable readmission measure count. The remaining values span a narrow integer range (at least 2–11 visible), with near-uniform distribution across them given the high entropy ratio of 0.965, suggesting facilities tend to report varying but substantive numbers of measures.

Treatment: Cast numeric strings to integer, encode 'Not Available' as NaN or a separate missingness indicator flag before modelling.

anthropic:default · confidence high

Out[94]:

saturn.columns["Count of Facility READM Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	12
top_value	Not Available
top_rate	0.2121
cardinality	12
entropy	3.459
entropy_ratio	0.965

Fig 35.

Top values for Count of Facility READM Measures.

Show data table

Top values for Count of Facility READM Measures (12 unique shown, of 12 total).
value	count	share
Not Available	1150	21.2%
11	498	9.2%
8	466	8.6%
6	438	8.1%
9	425	7.8%
3	375	6.9%
2	374	6.9%
7	358	6.6%
5	347	6.4%
4	335	6.2%
10	333	6.1%
1	322	5.9%

Count of READM Measures Better categorical feature

This column represents a small integer count (0–5) of how many readmission measures a hospital performed 'better' than a national benchmark, stored as a categorical/string field. The dominant value is '0' at 61.5% (3,332 rows), meaning most hospitals beat no readmission benchmarks at all. Notably, 1,150 rows (21.2%) carry 'Not Available' rather than a numeric zero, signalling a data-availability distinction that must be resolved before any numeric analysis.

Treatment: Separate 'Not Available' into a missingness indicator flag, then cast remaining values to integer for ordinal or numeric modelling.

anthropic:default · confidence high

Out[97]:

saturn.columns["Count of READM Measures Better"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	7
top_value	0
top_rate	0.6146
cardinality	7
entropy	1.51
entropy_ratio	0.5379

Fig 36.

Top values for Count of READM Measures Better.

Show data table

Top values for Count of READM Measures Better (7 unique shown, of 7 total).
value	count	share
0	3332	61.5%
Not Available	1150	21.2%
1	737	13.6%
2	161	3.0%
3	28	0.5%
4	10	0.2%
5	3	0.1%

Count of READM Measures No Different categorical feature

This column represents the count of readmission measures rated 'No Different' from the national rate for a given hospital, with valid integer values ranging from 1 to at least 9 (cardinality 13 suggests values up to ~12). The most striking signal is that 'Not Available' is the single most frequent value at 21.2% (1,150 of 5,421 rows), meaning roughly one in five hospitals has no reportable count — this non-numeric sentinel is stored as a string, making the column categorical despite its inherently ordinal numeric nature. Among reportable hospitals, the distribution across counts 1–9 is remarkably flat (372–497 each), suggesting no strong clustering around a typical score. The high entropy ratio (0.92) reflects this near-uniform spread across the 13 distinct values.

Treatment: Separate 'Not Available' into a missingness indicator flag, then cast remaining values to integer for ordinal/numeric modelling.

anthropic:default · confidence high

Out[100]:

saturn.columns["Count of READM Measures No Different"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	13
top_value	Not Available
top_rate	0.2121
cardinality	13
entropy	3.408
entropy_ratio	0.9211

Fig 37.

Top values for Count of READM Measures No Different.

Show data table

Top values for Count of READM Measures No Different (13 unique shown, of 13 total).
value	count	share
Not Available	1150	21.2%
7	497	9.2%
8	491	9.1%
6	480	8.9%
2	428	7.9%
3	428	7.9%
5	426	7.9%
9	418	7.7%
4	398	7.3%
1	372	6.9%
10	249	4.6%
11	81	1.5%
0	3	0.1%

Count of READM Measures Worse categorical feature

This column records the number of readmission quality measures on which a hospital performed worse than the national benchmark, stored as a categorical/string field despite being fundamentally ordinal-numeric. The dominant value is '0' (2,988 of 5,421 rows, 55.1%), indicating most facilities had no worse-than-average readmission measures, while 'Not Available' accounts for 1,150 rows (21.2%) — a substantial missingness masked by a zero null_rate. The numeric range spans 0–7, but the distribution is heavily right-skewed with only 3 hospitals scoring 6 or higher.

Treatment: Recode 'Not Available' as NaN, cast remaining values to integer, then treat as an ordinal feature or apply count-based encoding before modelling.

anthropic:default · confidence high

Out[103]:

saturn.columns["Count of READM Measures Worse"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	9
top_value	0
top_rate	0.5512
cardinality	9
entropy	1.758
entropy_ratio	0.5545

Fig 38.

Top values for Count of READM Measures Worse.

Show data table

Top values for Count of READM Measures Worse (9 unique shown, of 9 total).
value	count	share
0	2988	55.1%
Not Available	1150	21.2%
1	839	15.5%
2	308	5.7%
3	105	1.9%
4	26	0.5%
6	2	0.0%
5	2	0.0%
7	1	0.0%

READM Group Footnote numeric metadata

This column is a coded footnote indicator for a readmission (READM) metric group, taking only 3 distinct numeric values (5, 19, 22) despite 5,421 rows. The 78.79% null rate is the dominant signal — nulls almost certainly mean 'no footnote applies', making non-null values the exception rather than the rule. The three values are likely lookup codes referencing a footnote legend (e.g., CMS suppression or reliability flags), not a continuous numeric measure, despite being stored as numeric.

Treatment: Treat as categorical; map the 3 codes (5, 19, 22) to their footnote definitions and one-hot encode or use as a suppression/reliability flag.

anthropic:default · confidence high

Out[106]:

saturn.columns["READM Group Footnote"].stats

stat	value
n	5,421
nulls	4,271 (78.8%)
unique	3
min	5
max	22
mean	15.15
median	19
std	6.366
q1	5
q3	19
iqr	14
skew	-0.9528
kurtosis	-1.051
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	78.8% null

Fig 39.

Distribution of READM Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for READM Group Footnote (median: 19.0).
bin	count
5 – 5.515	323
5.515 – 6.03	0
6.03 – 6.545	0
6.545 – 7.061	0
7.061 – 7.576	0
7.576 – 8.091	0
8.091 – 8.606	0
8.606 – 9.121	0
9.121 – 9.636	0
9.636 – 10.15	0
10.15 – 10.67	0
10.67 – 11.18	0
11.18 – 11.7	0
11.7 – 12.21	0
12.21 – 12.73	0
12.73 – 13.24	0
13.24 – 13.76	0
13.76 – 14.27	0
14.27 – 14.79	0
14.79 – 15.3	0
15.3 – 15.82	0
15.82 – 16.33	0
16.33 – 16.85	0
16.85 – 17.36	0
17.36 – 17.88	0
17.88 – 18.39	0
18.39 – 18.91	0
18.91 – 19.42	795
19.42 – 19.94	0
19.94 – 20.45	0
20.45 – 20.97	0
20.97 – 21.48	0
21.48 – 22	32

Pt Exp Group Measure Count categorical feature

This column records the count of patient experience group measures, but is stored as a categorical with only two distinct values: '8' (the actual measure count) and 'Not Available'. The dominant value '8' appears in 84.1% of rows (4,558 of 5,421), while 'Not Available' accounts for the remaining 15.9% (863 rows) — suggesting a structured survey instrument with a fixed number of measures for eligible facilities, and a meaningful missingness pattern for ineligible or non-reporting ones.

Treatment: Treat 'Not Available' as a distinct eligibility/reporting flag; convert '8' to integer if numeric operations are needed, or encode as binary eligible/not-eligible indicator.

anthropic:default · confidence high

Out[109]:

saturn.columns["Pt Exp Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	8
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 40.

Top values for Pt Exp Group Measure Count.

Show data table

Top values for Pt Exp Group Measure Count (2 unique shown, of 2 total).
value	count	share
8	4558	84.1%
Not Available	863	15.9%

Count of Facility Pt Exp Measures categorical feature

This column represents a count of facility patient experience measures, but despite its numeric-sounding name it is stored as a categorical with only 2 distinct values: '8' (58.2% of rows, n=3,154) and 'Not Available' (41.8%, n=2,267). The near-uniform presence of a single numeric value ('8') suggests this count is fixed or standardized across facilities that report it, making the column effectively a binary indicator of data availability rather than a meaningful numeric measure. No nulls exist; missingness is encoded as the string 'Not Available'.

Treatment: Re-encode as binary flag (1 = '8', 0 = 'Not Available') since the numeric value is invariant; drop or use as availability indicator.

anthropic:default · confidence high

Out[112]:

saturn.columns["Count of Facility Pt Exp Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	8
top_rate	0.5818
cardinality	2
entropy	0.9806
entropy_ratio	0.9806

Fig 41.

Top values for Count of Facility Pt Exp Measures.

Show data table

Top values for Count of Facility Pt Exp Measures (2 unique shown, of 2 total).
value	count	share
8	3154	58.2%
Not Available	2267	41.8%

Pt Exp Group Footnote numeric label

This column appears to store numeric footnote codes attached to a patient experience group field, with only 3 distinct values (5, 19, and 22 inferred from min/Q1/Q3/max) acting as categorical tags rather than true measurements. The 58.18% null rate is flagged as an alert, meaning footnotes are absent for the majority of records — consistent with footnotes being exception annotations rather than universal attributes. The platykurtic distribution (kurtosis −1.66) and the fact that Q1 equals the median and minimum (all 5.0) confirm the values cluster heavily at the lowest code, with the IQR of 14 spanning to 19.

Treatment: Treat as a low-cardinality categorical; one-hot encode the 3 footnote codes and model nulls as a fourth explicit category ('no footnote').

anthropic:default · confidence medium

Out[115]:

saturn.columns["Pt Exp Group Footnote"].stats

stat	value
n	5,421
nulls	3,154 (58.2%)
unique	3
min	5
max	22
mean	10.15
median	5
std	6.806
q1	5
q3	19
iqr	14
skew	0.571
kurtosis	-1.658
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	58.2% null

Fig 42.

Distribution of Pt Exp Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for Pt Exp Group Footnote (median: 5.0).
bin	count
5 – 5.425	1440
5.425 – 5.85	0
5.85 – 6.275	0
6.275 – 6.7	0
6.7 – 7.125	0
7.125 – 7.55	0
7.55 – 7.975	0
7.975 – 8.4	0
8.4 – 8.825	0
8.825 – 9.25	0
9.25 – 9.675	0
9.675 – 10.1	0
10.1 – 10.52	0
10.52 – 10.95	0
10.95 – 11.38	0
11.38 – 11.8	0
11.8 – 12.22	0
12.22 – 12.65	0
12.65 – 13.07	0
13.07 – 13.5	0
13.5 – 13.92	0
13.92 – 14.35	0
14.35 – 14.78	0
14.78 – 15.2	0
15.2 – 15.62	0
15.62 – 16.05	0
16.05 – 16.48	0
16.48 – 16.9	0
16.9 – 17.32	0
17.32 – 17.75	0
17.75 – 18.17	0
18.17 – 18.6	0
18.6 – 19.02	795
19.02 – 19.45	0
19.45 – 19.88	0
19.88 – 20.3	0
20.3 – 20.73	0
20.73 – 21.15	0
21.15 – 21.57	0
21.57 – 22	32

TE Group Measure Count categorical feature

This column represents a count of TE (likely Test/Treatment Episode) group measures, stored as a categorical rather than numeric type — a likely encoding artefact. It has only two distinct values across 5,421 rows: '12' (84.1% of rows) and 'Not Available' (15.9%), suggesting it is effectively a binary availability flag masquerading as a count. The dominance of a single numeric value '12' with no other numeric variation is surprising and implies the measure count may be fixed/standardised across records rather than truly variable.

Treatment: Convert '12' to numeric and 'Not Available' to null, then treat as a binary missingness indicator or drop if constant after imputation.

anthropic:default · confidence medium

Out[118]:

saturn.columns["TE Group Measure Count"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	2
top_value	12
top_rate	0.8408
cardinality	2
entropy	0.6324
entropy_ratio	0.6324

Fig 43.

Top values for TE Group Measure Count.

Show data table

Top values for TE Group Measure Count (2 unique shown, of 2 total).
value	count	share
12	4558	84.1%
Not Available	863	15.9%

Count of Facility TE Measures categorical feature

This column represents the number of Technical Efficiency (TE) measures recorded per facility, stored as a categorical/string type despite being fundamentally numeric with integer values ranging at least from 4 to 12. The most surprising signal is that 'Not Available' is the single most frequent value at 17.1% (928 of 5421 rows), indicating a meaningful data-availability gap that is structurally encoded as a string sentinel rather than a null — zero nulls are reported. Entropy ratio of 0.93 across only 13 categories suggests a near-uniform distribution among the numeric values, with no strong concentration beyond the 'Not Available' category.

Treatment: Convert numeric string values to integer, recode 'Not Available' as NaN, then decide whether to impute or flag as a separate missingness indicator before modelling.

anthropic:default · confidence high

Out[121]:

saturn.columns["Count of Facility TE Measures"].stats

stat	value
n	5,421
nulls	0 (0.0%)
unique	13
top_value	Not Available
top_rate	0.1712
cardinality	13
entropy	3.458
entropy_ratio	0.9343

Fig 44.

Top values for Count of Facility TE Measures.

Show data table

Top values for Count of Facility TE Measures (13 unique shown, of 13 total).
value	count	share
Not Available	928	17.1%
10	759	14.0%
11	724	13.4%
9	543	10.0%
8	391	7.2%
5	351	6.5%
12	347	6.4%
6	337	6.2%
7	284	5.2%
4	272	5.0%
3	269	5.0%
2	163	3.0%
1	53	1.0%

TE Group Footnote numeric metadata

This column encodes footnote reference numbers attached to a 'TE Group' field, taking only 3 distinct numeric values (5, 19, 22) across the entire dataset. 82.88% of rows are null, meaning footnotes apply to a small minority of records. The IQR of 0 and median/Q1/Q3 all equal to 19 confirm that value 19 dominates non-null entries, while values 5 and 22 appear rarely enough to generate 133 outliers (14.3% of non-null rows) and strong negative skew (−2.43).

Treatment: Treat as a sparse categorical flag; convert non-null values to a one-hot or ordinal indicator and consider a separate binary 'has_footnote' feature given the 82.88% null rate.

anthropic:default · confidence high

Out[124]:

saturn.columns["TE Group Footnote"].stats

stat	value
n	5,421
nulls	4,493 (82.9%)
unique	3
min	5
max	22
mean	17.58
median	19
std	4.432
q1	19
q3	19
iqr	0
skew	-2.43
kurtosis	4.12
n_outliers	133
outlier_rate	0.1433
zero_rate	0
alert: null_rate	82.9% null
alert: high_skew	skew=-2.43
alert: outliers	14.3% rows beyond 1.5 IQR

Fig 45.

Distribution of TE Group Footnote. Vertical dash marks the median.

Show data table

Histogram bins for TE Group Footnote (median: 19.0).
bin	count
5 – 5.567	101
5.567 – 6.133	0
6.133 – 6.7	0
6.7 – 7.267	0
7.267 – 7.833	0
7.833 – 8.4	0
8.4 – 8.967	0
8.967 – 9.533	0
9.533 – 10.1	0
10.1 – 10.67	0
10.67 – 11.23	0
11.23 – 11.8	0
11.8 – 12.37	0
12.37 – 12.93	0
12.93 – 13.5	0
13.5 – 14.07	0
14.07 – 14.63	0
14.63 – 15.2	0
15.2 – 15.77	0
15.77 – 16.33	0
16.33 – 16.9	0
16.9 – 17.47	0
17.47 – 18.03	0
18.03 – 18.6	0
18.6 – 19.17	795
19.17 – 19.73	0
19.73 – 20.3	0
20.3 – 20.87	0
20.87 – 21.43	0
21.43 – 22	32

Overview

Summary confidence: high

Facility ID text identifier

Facility Name text label

Address text metadata

City/Town text label

State categorical feature

ZIP Code numeric feature

County/Parish text label

Telephone Number text feature

Hospital Type categorical label

Hospital Ownership categorical feature

Emergency Services categorical feature

Meets criteria for birthing friendly designation categorical label

Hospital overall rating categorical label

Hospital overall rating footnote categorical metadata

MORT Group Measure Count categorical feature

Count of Facility MORT Measures categorical feature

Count of MORT Measures Better categorical feature

Count of MORT Measures No Different categorical feature

Count of MORT Measures Worse categorical feature

MORT Group Footnote numeric label

Safety Group Measure Count categorical feature

Count of Facility Safety Measures categorical feature

Count of Safety Measures Better categorical feature

Count of Safety Measures No Different categorical feature

Count of Safety Measures Worse categorical feature

Safety Group Footnote numeric label

READM Group Measure Count categorical feature

Count of Facility READM Measures categorical feature

Count of READM Measures Better categorical feature

Count of READM Measures No Different categorical feature

Count of READM Measures Worse categorical feature

READM Group Footnote numeric metadata

Pt Exp Group Measure Count categorical feature

Count of Facility Pt Exp Measures categorical feature

Pt Exp Group Footnote numeric label

TE Group Measure Count categorical feature

Count of Facility TE Measures categorical feature

TE Group Footnote numeric metadata

How to cite