veterans-merged_county_analysis

Overview

Source: /home/coolhand/html/datavis/data_trove/data/policy/veterans/merged_county_analysis.csv

Saturn profiled 3,144 rows across 18 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/policy/veterans/merged_county_analysis.csv",
    "--findings", "veterans-merged_county_analysis.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 3,144 rows — one per U.S. county — combining Census geographic identifiers (GEOID, STATE_NAME, NAMELSAD, ALAND, AWATER) with veteran and active-duty military estimates and rate-normalized fields. The raw count columns (total_pop, active_duty_est, veterans_est, ALAND) are extremely right-skewed with skew values above 8 and hundreds of outliers each, so any analysis on them should use logs or per-capita versions. The rate columns tell a cleaner story: active_duty_per_10k is roughly symmetric (skew -0.38, mean ~4,694 per 10k) while veterans_per_100 is mildly right-skewed (mean 6.19, max 18.09) and is the better candidate for ranking counties. State coverage is uneven — Texas alone supplies 254 counties (8.1%), followed by Georgia and Virginia — which matters when aggregating. Note also that LSAD is heavily imbalanced (95% code '06') and GEOID and fips are duplicates of each other.

citing: row_count · column_count · ALAND · total_pop · active_duty_est · veterans_est · active_duty_per_10k · veterans_per_100 · STATE_NAME · LSAD · NAMELSAD

Out[4]:

saturn.schema() · 18 columns

column	kind	n	null%	unique	alerts
STATEFP	numeric	3,144	0.0%	51
COUNTYFP	numeric	3,144	0.0%	329	high_skew outliers
COUNTYNS	numeric	3,144	0.0%	3,144
GEOIDFQ	text	3,144	0.0%	3,144	near_unique one_word allcaps short_text
GEOID	numeric	3,144	0.0%	3,144
NAME	text	3,144	0.0%	1,838	one_word short_text duplicates
NAMELSAD	text	3,144	0.0%	1,882	short_text duplicates
STUSPS	categorical	3,144	0.0%	51
STATE_NAME	categorical	3,144	0.0%	51
LSAD	categorical	3,144	0.0%	9	imbalance
ALAND	numeric	3,144	0.0%	3,144	high_skew outliers
AWATER	numeric	3,144	0.0%	3,144	high_skew outliers
fips	numeric	3,144	0.0%	3,144
active_duty_est	numeric	3,144	0.0%	3,028	high_skew outliers
veterans_est	numeric	3,144	0.0%	2,424	high_skew outliers
total_pop	numeric	3,144	0.0%	3,080	high_skew outliers
active_duty_per_10k	numeric	3,144	0.0%	3,144
veterans_per_100	numeric	3,144	0.0%	3,143

Fig 1.

STATE_NAME · Counties per state — Texas, Georgia and Virginia dominate the row counts.

Show data table

Top values for STATE_NAME (20 unique shown, of 51 total).
value	count	share
Texas	254	8.1%
Georgia	159	5.1%
Virginia	133	4.2%
Kentucky	120	3.8%
Missouri	115	3.7%
Kansas	105	3.3%
Illinois	102	3.2%
North Carolina	100	3.2%
Iowa	99	3.1%
Tennessee	95	3.0%
Nebraska	93	3.0%
Indiana	92	2.9%
Ohio	88	2.8%
Minnesota	87	2.8%
Michigan	83	2.6%
Mississippi	82	2.6%
Oklahoma	77	2.4%
Arkansas	75	2.4%
Wisconsin	72	2.3%
Alabama	67	2.1%

Fig 2.

veterans_per_100 · Distribution of veterans as a share of population; most counties cluster between 5 and 7 per 100 with a long right tail.

Show data table

Histogram bins for veterans_per_100 (median: 5.984985213609011).
bin	count
0 – 0.4522	1
0.4522 – 0.9043	0
0.9043 – 1.357	6
1.357 – 1.809	13
1.809 – 2.261	21
2.261 – 2.713	34
2.713 – 3.165	60
3.165 – 3.617	84
3.617 – 4.07	137
4.07 – 4.522	187
4.522 – 4.974	271
4.974 – 5.426	319
5.426 – 5.878	346
5.878 – 6.33	358
6.33 – 6.783	313
6.783 – 7.235	259
7.235 – 7.687	173
7.687 – 8.139	128
8.139 – 8.591	94
8.591 – 9.043	88
9.043 – 9.496	60
9.496 – 9.948	38
9.948 – 10.4	39
10.4 – 10.85	26
10.85 – 11.3	16
11.3 – 11.76	21
11.76 – 12.21	14
12.21 – 12.66	15
12.66 – 13.11	4
13.11 – 13.57	6
13.57 – 14.02	6
14.02 – 14.47	1
14.47 – 14.92	1
14.92 – 15.37	3
15.37 – 15.83	0
15.83 – 16.28	0
16.28 – 16.73	1
16.73 – 17.18	0
17.18 – 17.63	0
17.63 – 18.09	1

Fig 3.

active_duty_per_10k · Active-duty rate per 10k is roughly symmetric around ~4,700, useful as a normalized comparison metric.

Show data table

Histogram bins for active_duty_per_10k (median: 4733.279942644007).
bin	count
1708 – 1845	1
1845 – 1983	0
1983 – 2120	1
2120 – 2257	0
2257 – 2395	1
2395 – 2532	1
2532 – 2669	1
2669 – 2807	6
2807 – 2944	7
2944 – 3081	9
3081 – 3218	21
3218 – 3356	20
3356 – 3493	29
3493 – 3630	57
3630 – 3768	52
3768 – 3905	92
3905 – 4042	111
4042 – 4179	165
4179 – 4317	193
4317 – 4454	224
4454 – 4591	267
4591 – 4729	303
4729 – 4866	280
4866 – 5003	301
5003 – 5141	277
5141 – 5278	268
5278 – 5415	177
5415 – 5552	136
5552 – 5690	61
5690 – 5827	37
5827 – 5964	18
5964 – 6102	8
6102 – 6239	5
6239 – 6376	5
6376 – 6514	3
6514 – 6651	2
6651 – 6788	2
6788 – 6925	0
6925 – 7063	0
7063 – 7200	3

Fig 4.

total_pop · County population is extremely right-skewed (skew ~13); plot on a log scale to see structure.

Show data table

Histogram bins for total_pop (median: 25784.5).
bin	count
50 – 2.485e+05	2863
2.485e+05 – 4.969e+05	137
4.969e+05 – 7.453e+05	57
7.453e+05 – 9.937e+05	37
9.937e+05 – 1.242e+06	14
1.242e+06 – 1.491e+06	10
1.491e+06 – 1.739e+06	7
1.739e+06 – 1.987e+06	3
1.987e+06 – 2.236e+06	3
2.236e+06 – 2.484e+06	4
2.484e+06 – 2.733e+06	3
2.733e+06 – 2.981e+06	0
2.981e+06 – 3.229e+06	1
3.229e+06 – 3.478e+06	1
3.478e+06 – 3.726e+06	0
3.726e+06 – 3.975e+06	0
3.975e+06 – 4.223e+06	0
4.223e+06 – 4.472e+06	1
4.472e+06 – 4.72e+06	0
4.72e+06 – 4.968e+06	1
4.968e+06 – 5.217e+06	0
5.217e+06 – 5.465e+06	1
5.465e+06 – 5.714e+06	0
5.714e+06 – 5.962e+06	0
5.962e+06 – 6.21e+06	0
6.21e+06 – 6.459e+06	0
6.459e+06 – 6.707e+06	0
6.707e+06 – 6.956e+06	0
6.956e+06 – 7.204e+06	0
7.204e+06 – 7.453e+06	0
7.453e+06 – 7.701e+06	0
7.701e+06 – 7.949e+06	0
7.949e+06 – 8.198e+06	0
8.198e+06 – 8.446e+06	0
8.446e+06 – 8.695e+06	0
8.695e+06 – 8.943e+06	0
8.943e+06 – 9.191e+06	0
9.191e+06 – 9.44e+06	0
9.44e+06 – 9.688e+06	0
9.688e+06 – 9.937e+06	1

Fig 5.

LSAD · Legal/statistical area descriptor is dominated by code '06' (~95%), so this column adds little signal.

Show data table

Top values for LSAD (9 unique shown, of 9 total).
value	count	share
06	2999	95.4%
15	64	2.0%
25	40	1.3%
04	13	0.4%
05	11	0.3%
PL	9	0.3%
03	4	0.1%
00	2	0.1%
12	2	0.1%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
STATEFP	numeric	0.0%
COUNTYFP	numeric	0.0%
COUNTYNS	numeric	0.0%
GEOIDFQ	text	0.0%
GEOID	numeric	0.0%
NAME	text	0.0%
NAMELSAD	text	0.0%
STUSPS	categorical	0.0%
STATE_NAME	categorical	0.0%
LSAD	categorical	0.0%
ALAND	numeric	0.0%
AWATER	numeric	0.0%
fips	numeric	0.0%
active_duty_est	numeric	0.0%
veterans_est	numeric	0.0%
total_pop	numeric	0.0%
active_duty_per_10k	numeric	0.0%
veterans_per_100	numeric	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
	STATEFP	COUNTYFP	COUNTYNS	GEOID	ALAND	AWATER	fips	active_duty_est	veterans_est	total_pop	active_duty_per_10k	veterans_per_100
STATEFP	+1.00	+0.21	+0.76	+1.00	-0.11	-0.21	+1.00	-0.05	-0.04	-0.05	+0.06	+0.14
COUNTYFP	+0.21	+1.00	+0.17	+0.22	-0.11	-0.03	+0.22	-0.02	+0.01	-0.02	+0.05	+0.09
COUNTYNS	+0.76	+0.17	+1.00	+0.76	+0.10	+0.09	+0.76	-0.06	-0.05	-0.06	+0.09	+0.19
GEOID	+1.00	+0.22	+0.76	+1.00	-0.11	-0.21	+1.00	-0.05	-0.04	-0.05	+0.06	+0.14
ALAND	-0.11	-0.11	+0.10	-0.11	+1.00	+0.43	-0.11	+0.07	+0.10	+0.07	-0.05	+0.03
AWATER	-0.21	-0.03	+0.09	-0.21	+0.43	+1.00	-0.21	+0.01	+0.02	+0.01	+0.20	-0.01
fips	+1.00	+0.22	+0.76	+1.00	-0.11	-0.21	+1.00	-0.05	-0.04	-0.05	+0.06	+0.14
active_duty_est	-0.05	-0.02	-0.06	-0.05	+0.07	+0.01	-0.05	+1.00	+0.94	+1.00	+0.27	-0.11
veterans_est	-0.04	+0.01	-0.05	-0.04	+0.10	+0.02	-0.04	+0.94	+1.00	+0.95	+0.24	+0.02
total_pop	-0.05	-0.02	-0.06	-0.05	+0.07	+0.01	-0.05	+1.00	+0.95	+1.00	+0.26	-0.11
active_duty_per_10k	+0.06	+0.05	+0.09	+0.06	-0.05	+0.20	+0.06	+0.27	+0.24	+0.26	+1.00	-0.04
veterans_per_100	+0.14	+0.09	+0.19	+0.14	+0.03	-0.01	+0.14	-0.11	+0.02	-0.11	-0.04	+1.00

STATEFP numeric foreign_key

This is the US Census STATEFP code, a 1-2 digit FIPS identifier for states stored numerically. Values range from 1 to 56 with 51 unique entries across 3144 rows, matching the count of US states plus DC, and the row count aligns with the number of US counties. The near-uniform spread (skew -0.08, kurtosis -1.10) and zero outliers are consistent with a categorical state code rather than a measured quantity.

Treatment: Cast to zero-padded string and treat as a categorical state key for joins, not a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["STATEFP"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	51
min	1
max	56
mean	30.26
median	29
std	15.15
q1	18
q3	45
iqr	27
skew	-0.08128
kurtosis	-1.099
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 8.

Distribution of STATEFP. Vertical dash marks the median.

Show data table

Histogram bins for STATEFP (median: 29.0).
bin	count
1 – 2.375	97
2.375 – 3.75	0
3.75 – 5.125	90
5.125 – 6.5	58
6.5 – 7.875	0
7.875 – 9.25	73
9.25 – 10.62	3
10.62 – 12	1
12 – 13.38	226
13.38 – 14.75	0
14.75 – 16.12	49
16.12 – 17.5	102
17.5 – 18.88	92
18.88 – 20.25	204
20.25 – 21.62	120
21.62 – 23	64
23 – 24.38	40
24.38 – 25.75	14
25.75 – 27.12	170
27.12 – 28.5	82
28.5 – 29.88	115
29.88 – 31.25	149
31.25 – 32.62	17
32.62 – 34	10
34 – 35.38	54
35.38 – 36.75	62
36.75 – 38.12	153
38.12 – 39.5	88
39.5 – 40.88	77
40.88 – 42.25	103
42.25 – 43.62	0
43.62 – 45	5
45 – 46.38	112
46.38 – 47.75	95
47.75 – 49.12	283
49.12 – 50.5	14
50.5 – 51.88	133
51.88 – 53.25	39
53.25 – 54.62	55
54.62 – 56	95

COUNTYFP numeric identifier

COUNTYFP is the 3-digit FIPS county code, stored numerically across 3144 rows with no nulls and 329 unique values. The distribution is heavily right-skewed (skew 2.84, kurtosis 11.4) with a max of 840 well beyond Q3 of 133.5, flagging 176 outliers — expected behavior since FIPS codes are categorical identifiers, not measurements, and high values correspond to specific county assignments.

Treatment: Cast to zero-padded string and combine with STATEFP to form a 5-digit GEOID join key; do not treat as numeric.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["COUNTYFP"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	329
min	1
max	840
mean	103.9
median	79
std	107.6
q1	35
q3	133.5
iqr	98.5
skew	2.841
kurtosis	11.38
n_outliers	176
outlier_rate	0.05598
zero_rate	0
alert: high_skew	skew=+2.84
alert: outliers	5.6% rows beyond 1.5 IQR

Fig 9.

Distribution of COUNTYFP. Vertical dash marks the median.

Show data table

Histogram bins for COUNTYFP (median: 79.0).
bin	count
1 – 21.98	512
21.98 – 42.95	408
42.95 – 63.93	399
63.93 – 84.9	335
84.9 – 105.9	341
105.9 – 126.9	271
126.9 – 147.8	225
147.8 – 168.8	165
168.8 – 189.8	140
189.8 – 210.8	71
210.8 – 231.7	45
231.7 – 252.7	25
252.7 – 273.7	22
273.7 – 294.7	23
294.7 – 315.6	22
315.6 – 336.6	13
336.6 – 357.6	11
357.6 – 378.6	10
378.6 – 399.5	11
399.5 – 420.5	10
420.5 – 441.5	11
441.5 – 462.5	10
462.5 – 483.4	11
483.4 – 504.4	10
504.4 – 525.4	7
525.4 – 546.4	2
546.4 – 567.3	1
567.3 – 588.3	2
588.3 – 609.3	3
609.3 – 630.2	3
630.2 – 651.2	2
651.2 – 672.2	2
672.2 – 693.2	5
693.2 – 714.2	2
714.2 – 735.1	3
735.1 – 756.1	2
756.1 – 777.1	3
777.1 – 798.1	1
798.1 – 819	2
819 – 840	3

COUNTYNS numeric identifier

COUNTYNS is the Census Bureau's permanent numeric ANSI/GNIS identifier for U.S. counties: all 3144 values are unique with no nulls or zeros, and the range (23901 to 2830254) matches the GNIS ID space. The distribution is broad but unremarkable (skew 0.17, kurtosis -0.80), as expected for an ID code rather than a measurement.

Treatment: Treat as a county-level key for joins; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["COUNTYNS"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,144
min	23,901
max	2.83e+06
mean	9.503e+05
median	9.741e+05
std	5.168e+05
q1	4.85e+05
q3	1.384e+06
iqr	8.99e+05
skew	0.1721
kurtosis	-0.8015
n_outliers	11
outlier_rate	0.003499
zero_rate	0

Fig 10.

Distribution of COUNTYNS. Vertical dash marks the median.

Show data table

Histogram bins for COUNTYNS (median: 974128.5).
bin	count
2.39e+04 – 9.406e+04	90
9.406e+04 – 1.642e+05	66
1.642e+05 – 2.344e+05	67
2.344e+05 – 3.045e+05	86
3.045e+05 – 3.747e+05	161
3.747e+05 – 4.449e+05	114
4.449e+05 – 5.15e+05	296
5.15e+05 – 5.852e+05	191
5.852e+05 – 6.553e+05	21
6.553e+05 – 7.255e+05	169
7.255e+05 – 7.956e+05	116
7.956e+05 – 8.658e+05	110
8.658e+05 – 9.36e+05	53
9.36e+05 – 1.006e+06	64
1.006e+06 – 1.076e+06	241
1.076e+06 – 1.146e+06	99
1.146e+06 – 1.217e+06	81
1.217e+06 – 1.287e+06	117
1.287e+06 – 1.357e+06	0
1.357e+06 – 1.427e+06	278
1.427e+06 – 1.497e+06	100
1.497e+06 – 1.567e+06	121
1.567e+06 – 1.638e+06	187
1.638e+06 – 1.708e+06	193
1.708e+06 – 1.778e+06	63
1.778e+06 – 1.848e+06	43
1.848e+06 – 1.918e+06	0
1.918e+06 – 1.988e+06	1
1.988e+06 – 2.059e+06	1
2.059e+06 – 2.129e+06	0
2.129e+06 – 2.199e+06	0
2.199e+06 – 2.269e+06	0
2.269e+06 – 2.339e+06	0
2.339e+06 – 2.409e+06	2
2.409e+06 – 2.479e+06	0
2.479e+06 – 2.55e+06	2
2.55e+06 – 2.62e+06	0
2.62e+06 – 2.69e+06	0
2.69e+06 – 2.76e+06	0
2.76e+06 – 2.83e+06	11

GEOIDFQ text identifier

This is the Census Bureau's fully-qualified GEOID (GEOIDFQ) for U.S. counties: every value is exactly 14 characters, single-token, all-caps, and follows the `0500000US` summary-level prefix followed by a state+county FIPS code. All 3144 rows are unique with no nulls or duplicates, matching the count of U.S. counties. Vocab size equals row count (3144), confirming it is a pure identifier with no analytical signal of its own.

Treatment: Use as a left-join key against Census geographies; do not feed into models.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["GEOIDFQ"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,144
len_min	14
len_max	14
len_mean	14
len_median	14
len_p95	14
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	3,144
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 11.

Character-length distribution for GEOIDFQ.

Show data table

Character-length distribution for GEOIDFQ (mean: 14.0).
chars	count
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	3144
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0
14 – 14	0

GEOID numeric identifier

GEOID is the 5-digit FIPS code identifying US counties: every one of the 3,144 rows is unique with no nulls, and the range 1001 to 56045 matches the state+county FIPS convention (Alabama through Wyoming). The near-zero skew (-0.08) and flat kurtosis (-1.10) reflect roughly uniform coverage across state codes rather than any meaningful distribution. Treating these as numbers is misleading—they are categorical keys.

Treatment: Cast to zero-padded string and use as a join key to county-level geographies; do not model as numeric.

anthropic:claude-opus-4-7 · confidence high

Out[25]:

saturn.columns["GEOID"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,144
min	1,001
max	56,045
mean	3.037e+04
median	29,174
std	1.517e+04
q1	1.817e+04
q3	4.508e+04
iqr	26,905
skew	-0.07923
kurtosis	-1.099
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 12.

Distribution of GEOID. Vertical dash marks the median.

Show data table

Histogram bins for GEOID (median: 29174.0).
bin	count
1001 – 2377	97
2377 – 3753	0
3753 – 5129	80
5129 – 6505	68
6505 – 7882	0
7882 – 9258	73
9258 – 1.063e+04	3
1.063e+04 – 1.201e+04	6
1.201e+04 – 1.339e+04	221
1.339e+04 – 1.476e+04	0
1.476e+04 – 1.614e+04	49
1.614e+04 – 1.751e+04	102
1.751e+04 – 1.889e+04	92
1.889e+04 – 2.027e+04	204
2.027e+04 – 2.164e+04	120
2.164e+04 – 2.302e+04	73
2.302e+04 – 2.439e+04	30
2.439e+04 – 2.577e+04	15
2.577e+04 – 2.715e+04	156
2.715e+04 – 2.852e+04	96
2.852e+04 – 2.99e+04	115
2.99e+04 – 3.128e+04	149
3.128e+04 – 3.265e+04	17
3.265e+04 – 3.403e+04	24
3.403e+04 – 3.54e+04	40
3.54e+04 – 3.678e+04	62
3.678e+04 – 3.816e+04	153
3.816e+04 – 3.953e+04	88
3.953e+04 – 4.091e+04	77
4.091e+04 – 4.228e+04	103
4.228e+04 – 4.366e+04	0
4.366e+04 – 4.504e+04	23
4.504e+04 – 4.641e+04	94
4.641e+04 – 4.779e+04	95
4.779e+04 – 4.916e+04	283
4.916e+04 – 5.054e+04	14
5.054e+04 – 5.192e+04	133
5.192e+04 – 5.329e+04	39
5.329e+04 – 5.467e+04	55
5.467e+04 – 5.604e+04	95

NAME text label

This column holds short place names — almost certainly US county names, given the dominance of 'Washington' (31), 'Franklin' (26), 'Jefferson' (26), 'Lincoln' (24) and 'Madison' (20), all classic county namesakes. Values are overwhelmingly single-word (one_word_rate 0.934, word_mean 1.07) and short (len_mean 7.0, len_max 30), with no nulls. The 41.5% duplicate_rate is expected here: the same county name recurs across states, so 3144 rows collapse to 1838 unique strings.

Treatment: Treat as a non-unique name; pair with a state/FIPS column before joining or grouping.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["NAME"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	1,838
len_min	3
len_max	30
len_mean	7.05
len_median	7
len_p95	11
word_mean	1.072
word_median	1
n_empty	0
n_duplicates	1,306
duplicate_rate	0.4154
vocab_size	1,875
readability_flesch_mean	36.74
emoji_rate	0
url_rate	0
one_word_rate	0.9342
allcaps_rate	0
boilerplate_rate	0
alert: one_word	93.4% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	41.5% duplicate strings

Fig 13.

Character-length distribution for NAME.

Show data table

Character-length distribution for NAME (mean: 7.049618320610687).
chars	count
3 – 4	27
4 – 4	254
4 – 5	460
5 – 6	0
6 – 6	680
6 – 7	592
7 – 8	0
8 – 8	486
8 – 9	282
9 – 10	0
10 – 10	203
10 – 11	55
11 – 12	0
12 – 12	44
12 – 13	18
13 – 14	0
14 – 14	14
14 – 15	9
15 – 16	0
16 – 16	5
16 – 17	3
17 – 18	0
18 – 19	2
19 – 19	2
19 – 20	0
20 – 21	3
21 – 21	1
21 – 22	0
22 – 23	0
23 – 23	0
23 – 24	0
24 – 25	2
25 – 25	1
25 – 26	0
26 – 27	0
27 – 27	0
27 – 28	0
28 – 29	0
29 – 29	0
29 – 30	1

NAMELSAD text label

This is the full legal name of US county-equivalents (NAMELSAD from Census TIGER), with 'county' appearing 2999 times alongside 64 'parish' and 47 'city' entries reflecting Louisiana and independent-city conventions. Names are short (mean 14.08 chars, ~2 words) and heavily duplicated across states: 1262 duplicates (40.1%) driven by repeated names like 'Washington County' (30), 'Jefferson County' (25), and 'Franklin County' (24). Only 1882 unique values across 3144 rows, so this field alone does not identify a county.

Treatment: Use as a display label only; join on a state+FIPS code rather than this name to avoid duplicate collisions.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["NAMELSAD"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	1,882
len_min	10
len_max	46
len_mean	14.08
len_median	14
len_p95	18
word_mean	2.08
word_median	2
n_empty	0
n_duplicates	1,262
duplicate_rate	0.4014
vocab_size	1,883
readability_flesch_mean	35.29
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	0
boilerplate_rate	0
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	40.1% duplicate strings

Fig 14.

Character-length distribution for NAMELSAD.

Show data table

Character-length distribution for NAMELSAD (mean: 14.083651399491094).
chars	count
10 – 11	29
11 – 12	255
12 – 13	465
13 – 14	682
14 – 14	587
14 – 15	483
15 – 16	278
16 – 17	200
17 – 18	54
18 – 19	0
19 – 20	43
20 – 21	18
21 – 22	12
22 – 23	10
23 – 24	5
24 – 24	4
24 – 25	5
25 – 26	2
26 – 27	1
27 – 28	0
28 – 29	1
29 – 30	0
30 – 31	0
31 – 32	2
32 – 32	1
32 – 33	1
33 – 34	1
34 – 35	1
35 – 36	0
36 – 37	0
37 – 38	0
38 – 39	0
39 – 40	0
40 – 41	2
41 – 42	1
42 – 42	0
42 – 43	0
43 – 44	0
44 – 45	0
45 – 46	1

STUSPS categorical foreign_key

STUSPS is the USPS two-letter state abbreviation, with 51 distinct values across 3,144 rows — consistent with US states plus DC at the county grain. Distribution matches county counts: TX leads at 254 (8.08%), followed by GA (159), VA (133), and KY (120). No nulls and high entropy ratio (0.93) indicate clean, well-spread categorical coverage.

Treatment: left-join on this code to state-level reference tables, or one-hot/target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["STUSPS"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	51
top_value	TX
top_rate	0.08079
cardinality	51
entropy	5.277
entropy_ratio	0.9304

Fig 15.

Top values for STUSPS.

Show data table

Top values for STUSPS (20 unique shown, of 51 total).
value	count	share
TX	254	8.1%
GA	159	5.1%
VA	133	4.2%
KY	120	3.8%
MO	115	3.7%
KS	105	3.3%
IL	102	3.2%
NC	100	3.2%
IA	99	3.1%
TN	95	3.0%
NE	93	3.0%
IN	92	2.9%
OH	88	2.8%
MN	87	2.8%
MI	83	2.6%
MS	82	2.6%
OK	77	2.4%
AR	75	2.4%
WI	72	2.3%
AL	67	2.1%

STATE_NAME categorical feature

STATE_NAME holds US state labels across 3,144 rows with exactly 51 unique values (50 states plus likely DC) and zero nulls. The distribution mirrors county counts per state: Texas leads at 254 (8.1%), followed by Georgia (159) and Virginia (133), consistent with this being one row per US county. Entropy ratio of 0.93 indicates a fairly even spread across states given their natural county-count differences.

Treatment: use as a categorical grouping key or one-hot/target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[37]:

saturn.columns["STATE_NAME"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	51
top_value	Texas
top_rate	0.08079
cardinality	51
entropy	5.277
entropy_ratio	0.9304

Fig 16.

Top values for STATE_NAME.

Show data table

Top values for STATE_NAME (20 unique shown, of 51 total).
value	count	share
Texas	254	8.1%
Georgia	159	5.1%
Virginia	133	4.2%
Kentucky	120	3.8%
Missouri	115	3.7%
Kansas	105	3.3%
Illinois	102	3.2%
North Carolina	100	3.2%
Iowa	99	3.1%
Tennessee	95	3.0%
Nebraska	93	3.0%
Indiana	92	2.9%
Ohio	88	2.8%
Minnesota	87	2.8%
Michigan	83	2.6%
Mississippi	82	2.6%
Oklahoma	77	2.4%
Arkansas	75	2.4%
Wisconsin	72	2.3%
Alabama	67	2.1%

LSAD categorical metadata

LSAD is a Census Legal/Statistical Area Description code identifying the type of geographic entity for each of 3144 rows. The distribution is extremely imbalanced: code '06' (county) accounts for 95.39% of rows, leaving only 9 distinct codes and an entropy ratio of 0.117. Minor categories like '15', '25', and 'PL' tail off quickly into single-digit counts.

Treatment: Collapse rare codes into an 'other' bucket or drop, since one value dominates.

anthropic:claude-opus-4-7 · confidence high

Out[40]:

saturn.columns["LSAD"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	9
top_value	06
top_rate	0.9539
cardinality	9
entropy	0.3707
entropy_ratio	0.1169
alert: imbalance	top value is 95.4% of rows

Fig 17.

Top values for LSAD.

Show data table

Top values for LSAD (9 unique shown, of 9 total).
value	count	share
06	2999	95.4%
15	64	2.0%
25	40	1.3%
04	13	0.4%
05	11	0.3%
PL	9	0.3%
03	4	0.1%
00	2	0.1%
12	2	0.1%

ALAND numeric feature

ALAND looks like land-area in square meters for 3,144 unique geographic units (matching the U.S. county count), with no nulls or zeros. The distribution is extremely right-skewed (skew 26.8, kurtosis 953) — the median is 1.59B while the max reaches 377B, and 11.5% of rows flag as outliers. A handful of very large areas dominate the mean (2.91B) versus the median.

Treatment: log-transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[43]:

saturn.columns["ALAND"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,144
min	5.3e+06
max	3.771e+11
mean	2.911e+09
median	1.594e+09
std	9.306e+09
q1	1.116e+09
q3	2.394e+09
iqr	1.277e+09
skew	26.82
kurtosis	953.2
n_outliers	362
outlier_rate	0.1151
zero_rate	0
alert: high_skew	skew=+26.82
alert: outliers	11.5% rows beyond 1.5 IQR

Fig 18.

Distribution of ALAND. Vertical dash marks the median.

Show data table

Histogram bins for ALAND (median: 1594401059.0).
bin	count
5.3e+06 – 9.432e+09	3006
9.432e+09 – 1.886e+10	97
1.886e+10 – 2.828e+10	22
2.828e+10 – 3.771e+10	3
3.771e+10 – 4.714e+10	4
4.714e+10 – 5.656e+10	3
5.656e+10 – 6.599e+10	5
6.599e+10 – 7.542e+10	0
7.542e+10 – 8.484e+10	0
8.484e+10 – 9.427e+10	1
9.427e+10 – 1.037e+11	0
1.037e+11 – 1.131e+11	1
1.131e+11 – 1.225e+11	0
1.225e+11 – 1.32e+11	0
1.32e+11 – 1.414e+11	0
1.414e+11 – 1.508e+11	0
1.508e+11 – 1.603e+11	0
1.603e+11 – 1.697e+11	0
1.697e+11 – 1.791e+11	0
1.791e+11 – 1.885e+11	0
1.885e+11 – 1.98e+11	0
1.98e+11 – 2.074e+11	0
2.074e+11 – 2.168e+11	0
2.168e+11 – 2.262e+11	0
2.262e+11 – 2.357e+11	1
2.357e+11 – 2.451e+11	0
2.451e+11 – 2.545e+11	0
2.545e+11 – 2.639e+11	0
2.639e+11 – 2.734e+11	0
2.734e+11 – 2.828e+11	0
2.828e+11 – 2.922e+11	0
2.922e+11 – 3.016e+11	0
3.016e+11 – 3.111e+11	0
3.111e+11 – 3.205e+11	0
3.205e+11 – 3.299e+11	0
3.299e+11 – 3.394e+11	0
3.394e+11 – 3.488e+11	0
3.488e+11 – 3.582e+11	0
3.582e+11 – 3.676e+11	0
3.676e+11 – 3.771e+11	1

AWATER numeric feature

AWATER is the standard Census TIGER field for water-area in square meters, here at what looks like county granularity given n=3144 unique values. The distribution is extremely right-skewed (skew 13.18, kurtosis 210.8): the median is 19.4M but the mean is 222M and the max reaches 25.99B, with 440 outliers (14.0% of rows). One row is zero, and all 3144 values are unique, so this behaves like a continuous geographic feature rather than a key.

Treatment: Apply a log1p transform before any modelling to tame the 13.2 skew and heavy outlier tail.

anthropic:claude-opus-4-7 · confidence high

Out[46]:

saturn.columns["AWATER"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,144
min	0
max	2.599e+10
mean	2.22e+08
median	1.939e+07
std	1.241e+09
q1	7.132e+06
q3	5.946e+07
iqr	5.233e+07
skew	13.18
kurtosis	210.8
n_outliers	440
outlier_rate	0.1399
zero_rate	0.0003181
alert: high_skew	skew=+13.18
alert: outliers	14.0% rows beyond 1.5 IQR

Fig 19.

Distribution of AWATER. Vertical dash marks the median.

Show data table

Histogram bins for AWATER (median: 19389010.5).
bin	count
0 – 6.497e+08	2954
6.497e+08 – 1.299e+09	89
1.299e+09 – 1.949e+09	30
1.949e+09 – 2.599e+09	25
2.599e+09 – 3.249e+09	13
3.249e+09 – 3.898e+09	3
3.898e+09 – 4.548e+09	3
4.548e+09 – 5.198e+09	7
5.198e+09 – 5.848e+09	1
5.848e+09 – 6.497e+09	3
6.497e+09 – 7.147e+09	1
7.147e+09 – 7.797e+09	0
7.797e+09 – 8.447e+09	1
8.447e+09 – 9.096e+09	0
9.096e+09 – 9.746e+09	0
9.746e+09 – 1.04e+10	0
1.04e+10 – 1.105e+10	1
1.105e+10 – 1.17e+10	1
1.17e+10 – 1.235e+10	0
1.235e+10 – 1.299e+10	2
1.299e+10 – 1.364e+10	0
1.364e+10 – 1.429e+10	3
1.429e+10 – 1.494e+10	2
1.494e+10 – 1.559e+10	1
1.559e+10 – 1.624e+10	0
1.624e+10 – 1.689e+10	0
1.689e+10 – 1.754e+10	0
1.754e+10 – 1.819e+10	0
1.819e+10 – 1.884e+10	0
1.884e+10 – 1.949e+10	0
1.949e+10 – 2.014e+10	0
2.014e+10 – 2.079e+10	0
2.079e+10 – 2.144e+10	1
2.144e+10 – 2.209e+10	0
2.209e+10 – 2.274e+10	1
2.274e+10 – 2.339e+10	0
2.339e+10 – 2.404e+10	0
2.404e+10 – 2.469e+10	0
2.469e+10 – 2.534e+10	1
2.534e+10 – 2.599e+10	1

fips numeric identifier

This is the 5-digit US county FIPS code: every one of the 3144 rows is unique, there are no nulls, and the range 1001–56045 spans the standard state+county encoding. The distribution is essentially uniform across the code space (skew −0.08, kurtosis −1.10), as expected for an identifier rather than a measurement.

Treatment: left-join on this id; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[49]:

saturn.columns["fips"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,144
min	1,001
max	56,045
mean	3.037e+04
median	29,174
std	1.517e+04
q1	1.817e+04
q3	4.508e+04
iqr	26,905
skew	-0.07923
kurtosis	-1.099
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 20.

Distribution of fips. Vertical dash marks the median.

Show data table

Histogram bins for fips (median: 29174.0).
bin	count
1001 – 2377	97
2377 – 3753	0
3753 – 5129	80
5129 – 6505	68
6505 – 7882	0
7882 – 9258	73
9258 – 1.063e+04	3
1.063e+04 – 1.201e+04	6
1.201e+04 – 1.339e+04	221
1.339e+04 – 1.476e+04	0
1.476e+04 – 1.614e+04	49
1.614e+04 – 1.751e+04	102
1.751e+04 – 1.889e+04	92
1.889e+04 – 2.027e+04	204
2.027e+04 – 2.164e+04	120
2.164e+04 – 2.302e+04	73
2.302e+04 – 2.439e+04	30
2.439e+04 – 2.577e+04	15
2.577e+04 – 2.715e+04	156
2.715e+04 – 2.852e+04	96
2.852e+04 – 2.99e+04	115
2.99e+04 – 3.128e+04	149
3.128e+04 – 3.265e+04	17
3.265e+04 – 3.403e+04	24
3.403e+04 – 3.54e+04	40
3.54e+04 – 3.678e+04	62
3.678e+04 – 3.816e+04	153
3.816e+04 – 3.953e+04	88
3.953e+04 – 4.091e+04	77
4.091e+04 – 4.228e+04	103
4.228e+04 – 4.366e+04	0
4.366e+04 – 4.504e+04	23
4.504e+04 – 4.641e+04	94
4.641e+04 – 4.779e+04	95
4.779e+04 – 4.916e+04	283
4.916e+04 – 5.054e+04	14
5.054e+04 – 5.192e+04	133
5.192e+04 – 5.329e+04	39
5.329e+04 – 5.467e+04	55
5.467e+04 – 5.604e+04	95

active_duty_est numeric feature

Numeric estimate of active-duty population per record, with 3144 rows and 3028 unique values suggesting one row per geographic unit (likely county-level given the row count). The distribution is severely right-skewed (skew 13.14, kurtosis 288.57): median is 11698 but mean is 53782.95 and the max reaches 5240842, with 449 outliers (14.3%). No nulls or zeros, and the IQR of 27868 is dwarfed by a std of 176262.59.

Treatment: log-transform before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[52]:

saturn.columns["active_duty_est"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,028
min	36
max	5.241e+06
mean	5.378e+04
median	11,698
std	1.763e+05
q1	4722
q3	3.259e+04
iqr	27,868
skew	13.14
kurtosis	288.6
n_outliers	449
outlier_rate	0.1428
zero_rate	0
alert: high_skew	skew=+13.14
alert: outliers	14.3% rows beyond 1.5 IQR

Fig 21.

Distribution of active_duty_est. Vertical dash marks the median.

Show data table

Histogram bins for active_duty_est (median: 11698.0).
bin	count
36 – 1.311e+05	2867
1.311e+05 – 2.621e+05	135
2.621e+05 – 3.931e+05	52
3.931e+05 – 5.241e+05	41
5.241e+05 – 6.551e+05	14
6.551e+05 – 7.862e+05	10
7.862e+05 – 9.172e+05	5
9.172e+05 – 1.048e+06	5
1.048e+06 – 1.179e+06	4
1.179e+06 – 1.31e+06	2
1.31e+06 – 1.441e+06	3
1.441e+06 – 1.572e+06	0
1.572e+06 – 1.703e+06	1
1.703e+06 – 1.834e+06	1
1.834e+06 – 1.965e+06	0
1.965e+06 – 2.096e+06	0
2.096e+06 – 2.227e+06	0
2.227e+06 – 2.358e+06	1
2.358e+06 – 2.489e+06	1
2.489e+06 – 2.62e+06	0
2.62e+06 – 2.751e+06	0
2.751e+06 – 2.882e+06	1
2.882e+06 – 3.013e+06	0
3.013e+06 – 3.145e+06	0
3.145e+06 – 3.276e+06	0
3.276e+06 – 3.407e+06	0
3.407e+06 – 3.538e+06	0
3.538e+06 – 3.669e+06	0
3.669e+06 – 3.8e+06	0
3.8e+06 – 3.931e+06	0
3.931e+06 – 4.062e+06	0
4.062e+06 – 4.193e+06	0
4.193e+06 – 4.324e+06	0
4.324e+06 – 4.455e+06	0
4.455e+06 – 4.586e+06	0
4.586e+06 – 4.717e+06	0
4.717e+06 – 4.848e+06	0
4.848e+06 – 4.979e+06	0
4.979e+06 – 5.11e+06	0
5.11e+06 – 5.241e+06	1

veterans_est numeric feature

Estimated veteran counts per row (likely US counties given n=3144), ranging from 0 to 244,160 with a median of 1,547.5 but a mean of 5,419.5. The distribution is heavily right-skewed (skew 8.01, kurtosis 100.0) with 408 outliers (12.98%) reflecting a few highly populous areas dwarfing the rest. Near-zero null and zero rates, so coverage is essentially complete.

Treatment: log1p-transform before modelling to tame the heavy right skew.

anthropic:claude-opus-4-7 · confidence high

Out[55]:

saturn.columns["veterans_est"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	2,424
min	0
max	244,160
mean	5419
median	1548
std	1.311e+04
q1	634.8
q3	4428
iqr	3,793
skew	8.014
kurtosis	100
n_outliers	408
outlier_rate	0.1298
zero_rate	0.0003181
alert: high_skew	skew=+8.01
alert: outliers	13.0% rows beyond 1.5 IQR

Fig 22.

Distribution of veterans_est. Vertical dash marks the median.

Show data table

Histogram bins for veterans_est (median: 1547.5).
bin	count
0 – 6104	2534
6104 – 1.221e+04	271
1.221e+04 – 1.831e+04	116
1.831e+04 – 2.442e+04	68
2.442e+04 – 3.052e+04	49
3.052e+04 – 3.662e+04	30
3.662e+04 – 4.273e+04	18
4.273e+04 – 4.883e+04	20
4.883e+04 – 5.494e+04	7
5.494e+04 – 6.104e+04	3
6.104e+04 – 6.714e+04	4
6.714e+04 – 7.325e+04	6
7.325e+04 – 7.935e+04	0
7.935e+04 – 8.546e+04	6
8.546e+04 – 9.156e+04	2
9.156e+04 – 9.766e+04	1
9.766e+04 – 1.038e+05	0
1.038e+05 – 1.099e+05	1
1.099e+05 – 1.16e+05	1
1.16e+05 – 1.221e+05	0
1.221e+05 – 1.282e+05	0
1.282e+05 – 1.343e+05	0
1.343e+05 – 1.404e+05	1
1.404e+05 – 1.465e+05	2
1.465e+05 – 1.526e+05	0
1.526e+05 – 1.587e+05	1
1.587e+05 – 1.648e+05	0
1.648e+05 – 1.709e+05	0
1.709e+05 – 1.77e+05	0
1.77e+05 – 1.831e+05	0
1.831e+05 – 1.892e+05	0
1.892e+05 – 1.953e+05	1
1.953e+05 – 2.014e+05	0
2.014e+05 – 2.075e+05	0
2.075e+05 – 2.136e+05	0
2.136e+05 – 2.197e+05	0
2.197e+05 – 2.258e+05	0
2.258e+05 – 2.32e+05	1
2.32e+05 – 2.381e+05	0
2.381e+05 – 2.442e+05	1

total_pop numeric feature

Looks like a per-row population total across 3,144 rows (suggestive of US counties), with no nulls and 3,080 unique values. The distribution is severely right-skewed (skew 13.17, kurtosis 289.76): median is 25,784.5 but the mean is 105,310.94 and the max reaches 9,936,690, with 440 rows (14.0%) flagged as outliers. Min is 50 and zero_rate is 0, so every row carries a real count.

Treatment: log-transform before regression to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[58]:

saturn.columns["total_pop"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,080
min	50
max	9.937e+06
mean	1.053e+05
median	2.578e+04
std	3.338e+05
q1	1.084e+04
q3	6.808e+04
iqr	57,244
skew	13.17
kurtosis	289.8
n_outliers	440
outlier_rate	0.1399
zero_rate	0
alert: high_skew	skew=+13.17
alert: outliers	14.0% rows beyond 1.5 IQR

Fig 23.

Distribution of total_pop. Vertical dash marks the median.

Show data table

Histogram bins for total_pop (median: 25784.5).
bin	count
50 – 2.485e+05	2863
2.485e+05 – 4.969e+05	137
4.969e+05 – 7.453e+05	57
7.453e+05 – 9.937e+05	37
9.937e+05 – 1.242e+06	14
1.242e+06 – 1.491e+06	10
1.491e+06 – 1.739e+06	7
1.739e+06 – 1.987e+06	3
1.987e+06 – 2.236e+06	3
2.236e+06 – 2.484e+06	4
2.484e+06 – 2.733e+06	3
2.733e+06 – 2.981e+06	0
2.981e+06 – 3.229e+06	1
3.229e+06 – 3.478e+06	1
3.478e+06 – 3.726e+06	0
3.726e+06 – 3.975e+06	0
3.975e+06 – 4.223e+06	0
4.223e+06 – 4.472e+06	1
4.472e+06 – 4.72e+06	0
4.72e+06 – 4.968e+06	1
4.968e+06 – 5.217e+06	0
5.217e+06 – 5.465e+06	1
5.465e+06 – 5.714e+06	0
5.714e+06 – 5.962e+06	0
5.962e+06 – 6.21e+06	0
6.21e+06 – 6.459e+06	0
6.459e+06 – 6.707e+06	0
6.707e+06 – 6.956e+06	0
6.956e+06 – 7.204e+06	0
7.204e+06 – 7.453e+06	0
7.453e+06 – 7.701e+06	0
7.701e+06 – 7.949e+06	0
7.949e+06 – 8.198e+06	0
8.198e+06 – 8.446e+06	0
8.446e+06 – 8.695e+06	0
8.695e+06 – 8.943e+06	0
8.943e+06 – 9.191e+06	0
9.191e+06 – 9.44e+06	0
9.44e+06 – 9.688e+06	0
9.688e+06 – 9.937e+06	1

active_duty_per_10k numeric feature

A per-capita rate (active duty personnel per 10,000) reported across 3,144 rows with no nulls, no zeros, and every value unique. The distribution is tight around a mean of 4,693.79 and median of 4,733.28 with std 592.22, mildly left-skewed (-0.38), and 57 outliers (1.81%) span a range from 1,708.13 to 7,200.00. The 3,144 row count strongly suggests one record per US county.

Treatment: Use as-is as a continuous feature; the mild skew does not require transformation.

anthropic:claude-opus-4-7 · confidence high

Out[61]:

saturn.columns["active_duty_per_10k"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,144
min	1708
max	7,200
mean	4694
median	4733
std	592.2
q1	4331
q3	5102
iqr	771.6
skew	-0.3768
kurtosis	0.8418
n_outliers	57
outlier_rate	0.01813
zero_rate	0

Fig 24.

Distribution of active_duty_per_10k. Vertical dash marks the median.

Show data table

Histogram bins for active_duty_per_10k (median: 4733.279942644007).
bin	count
1708 – 1845	1
1845 – 1983	0
1983 – 2120	1
2120 – 2257	0
2257 – 2395	1
2395 – 2532	1
2532 – 2669	1
2669 – 2807	6
2807 – 2944	7
2944 – 3081	9
3081 – 3218	21
3218 – 3356	20
3356 – 3493	29
3493 – 3630	57
3630 – 3768	52
3768 – 3905	92
3905 – 4042	111
4042 – 4179	165
4179 – 4317	193
4317 – 4454	224
4454 – 4591	267
4591 – 4729	303
4729 – 4866	280
4866 – 5003	301
5003 – 5141	277
5141 – 5278	268
5278 – 5415	177
5415 – 5552	136
5552 – 5690	61
5690 – 5827	37
5827 – 5964	18
5964 – 6102	8
6102 – 6239	5
6239 – 6376	5
6376 – 6514	3
6514 – 6651	2
6651 – 6788	2
6788 – 6925	0
6925 – 7063	0
7063 – 7200	3

veterans_per_100 numeric feature

This column reports veterans per 100 residents, with 3143 unique values across 3144 rows (likely one row per US county). Values range from 0 to 18.09 with a mean of 6.19 and median of 5.98, showing a mild right skew (0.88) and 125 outliers (~3.98%) on the high end. Only one row is zero, so the distribution is effectively continuous and well-populated.

Treatment: Use as-is for modelling; optionally winsorize the upper ~4% outliers.

anthropic:claude-opus-4-7 · confidence high

Out[64]:

saturn.columns["veterans_per_100"].stats

stat	value
n	3,144
nulls	0 (0.0%)
unique	3,143
min	0
max	18.09
mean	6.19
median	5.985
std	1.998
q1	4.92
q3	7.136
iqr	2.216
skew	0.8797
kurtosis	2.029
n_outliers	125
outlier_rate	0.03976
zero_rate	0.0003181

Fig 25.

Distribution of veterans_per_100. Vertical dash marks the median.

Show data table

Histogram bins for veterans_per_100 (median: 5.984985213609011).
bin	count
0 – 0.4522	1
0.4522 – 0.9043	0
0.9043 – 1.357	6
1.357 – 1.809	13
1.809 – 2.261	21
2.261 – 2.713	34
2.713 – 3.165	60
3.165 – 3.617	84
3.617 – 4.07	137
4.07 – 4.522	187
4.522 – 4.974	271
4.974 – 5.426	319
5.426 – 5.878	346
5.878 – 6.33	358
6.33 – 6.783	313
6.783 – 7.235	259
7.235 – 7.687	173
7.687 – 8.139	128
8.139 – 8.591	94
8.591 – 9.043	88
9.043 – 9.496	60
9.496 – 9.948	38
9.948 – 10.4	39
10.4 – 10.85	26
10.85 – 11.3	16
11.3 – 11.76	21
11.76 – 12.21	14
12.21 – 12.66	15
12.66 – 13.11	4
13.11 – 13.57	6
13.57 – 14.02	6
14.02 – 14.47	1
14.47 – 14.92	1
14.92 – 15.37	3
15.37 – 15.83	0
15.83 – 16.28	0
16.28 – 16.73	1
16.73 – 17.18	0
17.18 – 17.63	0
17.63 – 18.09	1

veterans merged county analysis

Overview

Summary confidence: high

STATEFP numeric foreign_key

COUNTYFP numeric identifier

COUNTYNS numeric identifier

GEOIDFQ text identifier

GEOID numeric identifier

NAME text label

NAMELSAD text label

STUSPS categorical foreign_key

STATE_NAME categorical feature

LSAD categorical metadata

ALAND numeric feature

AWATER numeric feature

fips numeric identifier

active_duty_est numeric feature

veterans_est numeric feature

total_pop numeric feature

active_duty_per_10k numeric feature

veterans_per_100 numeric feature

How to cite