veterans merged county analysis

source /home/coolhand/html/datavis/data_trove/data/policy/veterans/merged_county_analysis.csv 3,144 rows 18 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 3,144 rows — one per U.S. county — combining Census geographic identifiers (GEOID, STATE_NAME, NAMELSAD, ALAND, AWATER) with veteran and active-duty military estimates and rate-normalized fields. The raw count columns (total_pop, active_duty_est, veterans_est, ALAND) are extremely right-skewed with skew values above 8 and hundreds of outliers each, so any analysis on them should use logs or per-capita versions. The rate columns tell a cleaner story: active_duty_per_10k is roughly symmetric (skew -0.38, mean ~4,694 per 10k) while veterans_per_100 is mildly right-skewed (mean 6.19, max 18.09) and is the better candidate for ranking counties. State coverage is uneven — Texas alone supplies 254 counties (8.1%), followed by Georgia and Virginia — which matters when aggregating. Note also that LSAD is heavily imbalanced (95% code '06') and GEOID and fips are duplicates of each other.

citing: row_count · column_count · ALAND · total_pop · active_duty_est · veterans_est · active_duty_per_10k · veterans_per_100 · STATE_NAME · LSAD · NAMELSAD

Charts the summary said to look at first

STATE_NAME · Counties per state — Texas, Georgia and Virginia dominate the row counts.

Show data table

Top values for STATE_NAME (20 unique shown, of 51 total).
value	count	share
Texas	254	8.1%
Georgia	159	5.1%
Virginia	133	4.2%
Kentucky	120	3.8%
Missouri	115	3.7%
Kansas	105	3.3%
Illinois	102	3.2%
North Carolina	100	3.2%
Iowa	99	3.1%
Tennessee	95	3.0%
Nebraska	93	3.0%
Indiana	92	2.9%
Ohio	88	2.8%
Minnesota	87	2.8%
Michigan	83	2.6%
Mississippi	82	2.6%
Oklahoma	77	2.4%
Arkansas	75	2.4%
Wisconsin	72	2.3%
Alabama	67	2.1%

veterans_per_100 · Distribution of veterans as a share of population; most counties cluster between 5 and 7 per 100 with a long right tail.

Show data table

Histogram bins for veterans_per_100 (median: 5.984985213609011).
bin	count
0 – 0.4522	1
0.4522 – 0.9043	0
0.9043 – 1.357	6
1.357 – 1.809	13
1.809 – 2.261	21
2.261 – 2.713	34
2.713 – 3.165	60
3.165 – 3.617	84
3.617 – 4.07	137
4.07 – 4.522	187
4.522 – 4.974	271
4.974 – 5.426	319
5.426 – 5.878	346
5.878 – 6.33	358
6.33 – 6.783	313
6.783 – 7.235	259
7.235 – 7.687	173
7.687 – 8.139	128
8.139 – 8.591	94
8.591 – 9.043	88
9.043 – 9.496	60
9.496 – 9.948	38
9.948 – 10.4	39
10.4 – 10.85	26
10.85 – 11.3	16
11.3 – 11.76	21
11.76 – 12.21	14
12.21 – 12.66	15
12.66 – 13.11	4
13.11 – 13.57	6
13.57 – 14.02	6
14.02 – 14.47	1
14.47 – 14.92	1
14.92 – 15.37	3
15.37 – 15.83	0
15.83 – 16.28	0
16.28 – 16.73	1
16.73 – 17.18	0
17.18 – 17.63	0
17.63 – 18.09	1

active_duty_per_10k · Active-duty rate per 10k is roughly symmetric around ~4,700, useful as a normalized comparison metric.

Show data table

Histogram bins for active_duty_per_10k (median: 4733.279942644007).
bin	count
1708 – 1845	1
1845 – 1983	0
1983 – 2120	1
2120 – 2257	0
2257 – 2395	1
2395 – 2532	1
2532 – 2669	1
2669 – 2807	6
2807 – 2944	7
2944 – 3081	9
3081 – 3218	21
3218 – 3356	20
3356 – 3493	29
3493 – 3630	57
3630 – 3768	52
3768 – 3905	92
3905 – 4042	111
4042 – 4179	165
4179 – 4317	193
4317 – 4454	224
4454 – 4591	267
4591 – 4729	303
4729 – 4866	280
4866 – 5003	301
5003 – 5141	277
5141 – 5278	268
5278 – 5415	177
5415 – 5552	136
5552 – 5690	61
5690 – 5827	37
5827 – 5964	18
5964 – 6102	8
6102 – 6239	5
6239 – 6376	5
6376 – 6514	3
6514 – 6651	2
6651 – 6788	2
6788 – 6925	0
6925 – 7063	0
7063 – 7200	3

total_pop · County population is extremely right-skewed (skew ~13); plot on a log scale to see structure.

Show data table

Histogram bins for total_pop (median: 25784.5).
bin	count
50 – 2.485e+05	2863
2.485e+05 – 4.969e+05	137
4.969e+05 – 7.453e+05	57
7.453e+05 – 9.937e+05	37
9.937e+05 – 1.242e+06	14
1.242e+06 – 1.491e+06	10
1.491e+06 – 1.739e+06	7
1.739e+06 – 1.987e+06	3
1.987e+06 – 2.236e+06	3
2.236e+06 – 2.484e+06	4
2.484e+06 – 2.733e+06	3
2.733e+06 – 2.981e+06	0
2.981e+06 – 3.229e+06	1
3.229e+06 – 3.478e+06	1
3.478e+06 – 3.726e+06	0
3.726e+06 – 3.975e+06	0
3.975e+06 – 4.223e+06	0
4.223e+06 – 4.472e+06	1
4.472e+06 – 4.72e+06	0
4.72e+06 – 4.968e+06	1
4.968e+06 – 5.217e+06	0
5.217e+06 – 5.465e+06	1
5.465e+06 – 5.714e+06	0
5.714e+06 – 5.962e+06	0
5.962e+06 – 6.21e+06	0
6.21e+06 – 6.459e+06	0
6.459e+06 – 6.707e+06	0
6.707e+06 – 6.956e+06	0
6.956e+06 – 7.204e+06	0
7.204e+06 – 7.453e+06	0
7.453e+06 – 7.701e+06	0
7.701e+06 – 7.949e+06	0
7.949e+06 – 8.198e+06	0
8.198e+06 – 8.446e+06	0
8.446e+06 – 8.695e+06	0
8.695e+06 – 8.943e+06	0
8.943e+06 – 9.191e+06	0
9.191e+06 – 9.44e+06	0
9.44e+06 – 9.688e+06	0
9.688e+06 – 9.937e+06	1

LSAD · Legal/statistical area descriptor is dominated by code '06' (~95%), so this column adds little signal.

Show data table

Top values for LSAD (9 unique shown, of 9 total).
value	count	share
06	2999	95.4%
15	64	2.0%
25	40	1.3%
04	13	0.4%
05	11	0.3%
PL	9	0.3%
03	4	0.1%
00	2	0.1%
12	2	0.1%

Schema

18 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
STATEFP	numeric	0.0%	51
COUNTYFP	numeric	0.0%	329	high_skew outliers
COUNTYNS	numeric	0.0%	3,144
GEOIDFQ	text	0.0%	3,144	near_unique one_word allcaps short_text
GEOID	numeric	0.0%	3,144
NAME	text	0.0%	1,838	one_word short_text duplicates
NAMELSAD	text	0.0%	1,882	short_text duplicates
STUSPS	categorical	0.0%	51
STATE_NAME	categorical	0.0%	51
LSAD	categorical	0.0%	9	imbalance
ALAND	numeric	0.0%	3,144	high_skew outliers
AWATER	numeric	0.0%	3,144	high_skew outliers
fips	numeric	0.0%	3,144
active_duty_est	numeric	0.0%	3,028	high_skew outliers
veterans_est	numeric	0.0%	2,424	high_skew outliers
total_pop	numeric	0.0%	3,080	high_skew outliers
active_duty_per_10k	numeric	0.0%	3,144
veterans_per_100	numeric	0.0%	3,143

STATEFP

numeric foreign_key

This is the US Census STATEFP code, a 1-2 digit FIPS identifier for states stored numerically. Values range from 1 to 56 with 51 unique entries across 3144 rows, matching the count of US states plus DC, and the row count aligns with the number of US counties. The near-uniform spread (skew -0.08, kurtosis -1.10) and zero outliers are consistent with a categorical state code rather than a measured quantity. Treatment: Cast to zero-padded string and treat as a categorical state key for joins, not a numeric feature. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 51
min: 1
max: 56
mean: 30.26
median: 29
std: 15.15
q1: 18
q3: 45
iqr: 27
skew: -0.08128
kurtosis: -1.099
n_outliers: 0
outlier_rate: 0
zero_rate: 0

COUNTYFP

numeric identifier high_skew outliers

COUNTYFP is the 3-digit FIPS county code, stored numerically across 3144 rows with no nulls and 329 unique values. The distribution is heavily right-skewed (skew 2.84, kurtosis 11.4) with a max of 840 well beyond Q3 of 133.5, flagging 176 outliers — expected behavior since FIPS codes are categorical identifiers, not measurements, and high values correspond to specific county assignments. Treatment: Cast to zero-padded string and combine with STATEFP to form a 5-digit GEOID join key; do not treat as numeric. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 329
min: 1
max: 840
mean: 103.9
median: 79
std: 107.6
q1: 35
q3: 133.5
iqr: 98.5
skew: 2.841
kurtosis: 11.38
n_outliers: 176
outlier_rate: 0.05598
zero_rate: 0

COUNTYNS

numeric identifier

COUNTYNS is the Census Bureau's permanent numeric ANSI/GNIS identifier for U.S. counties: all 3144 values are unique with no nulls or zeros, and the range (23901 to 2830254) matches the GNIS ID space. The distribution is broad but unremarkable (skew 0.17, kurtosis -0.80), as expected for an ID code rather than a measurement. Treatment: Treat as a county-level key for joins; do not use as a numeric feature. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,144
min: 23,901
max: 2.83e+06
mean: 9.503e+05
median: 9.741e+05
std: 5.168e+05
q1: 4.85e+05
q3: 1.384e+06
iqr: 8.99e+05
skew: 0.1721
kurtosis: -0.8015
n_outliers: 11
outlier_rate: 0.003499
zero_rate: 0

GEOIDFQ

text identifier near_unique one_word allcaps short_text

This is the Census Bureau's fully-qualified GEOID (GEOIDFQ) for U.S. counties: every value is exactly 14 characters, single-token, all-caps, and follows the `0500000US` summary-level prefix followed by a state+county FIPS code. All 3144 rows are unique with no nulls or duplicates, matching the count of U.S. counties. Vocab size equals row count (3144), confirming it is a pure identifier with no analytical signal of its own. Treatment: Use as a left-join key against Census geographies; do not feed into models. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,144
len_min: 14
len_max: 14
len_mean: 14
len_median: 14
len_p95: 14
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 3,144
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

GEOID

numeric identifier

GEOID is the 5-digit FIPS code identifying US counties: every one of the 3,144 rows is unique with no nulls, and the range 1001 to 56045 matches the state+county FIPS convention (Alabama through Wyoming). The near-zero skew (-0.08) and flat kurtosis (-1.10) reflect roughly uniform coverage across state codes rather than any meaningful distribution. Treating these as numbers is misleading—they are categorical keys. Treatment: Cast to zero-padded string and use as a join key to county-level geographies; do not model as numeric. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,144
min: 1,001
max: 56,045
mean: 3.037e+04
median: 29,174
std: 1.517e+04
q1: 1.817e+04
q3: 4.508e+04
iqr: 26,905
skew: -0.07923
kurtosis: -1.099
n_outliers: 0
outlier_rate: 0
zero_rate: 0

NAME

text label one_word short_text duplicates

This column holds short place names — almost certainly US county names, given the dominance of 'Washington' (31), 'Franklin' (26), 'Jefferson' (26), 'Lincoln' (24) and 'Madison' (20), all classic county namesakes. Values are overwhelmingly single-word (one_word_rate 0.934, word_mean 1.07) and short (len_mean 7.0, len_max 30), with no nulls. The 41.5% duplicate_rate is expected here: the same county name recurs across states, so 3144 rows collapse to 1838 unique strings. Treatment: Treat as a non-unique name; pair with a state/FIPS column before joining or grouping. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 1,838
len_min: 3
len_max: 30
len_mean: 7.05
len_median: 7
len_p95: 11
word_mean: 1.072
word_median: 1
n_empty: 0
n_duplicates: 1,306
duplicate_rate: 0.4154
vocab_size: 1,875
readability_flesch_mean: 36.74
emoji_rate: 0
url_rate: 0
one_word_rate: 0.9342
allcaps_rate: 0
boilerplate_rate: 0

NAMELSAD

text label short_text duplicates

This is the full legal name of US county-equivalents (NAMELSAD from Census TIGER), with 'county' appearing 2999 times alongside 64 'parish' and 47 'city' entries reflecting Louisiana and independent-city conventions. Names are short (mean 14.08 chars, ~2 words) and heavily duplicated across states: 1262 duplicates (40.1%) driven by repeated names like 'Washington County' (30), 'Jefferson County' (25), and 'Franklin County' (24). Only 1882 unique values across 3144 rows, so this field alone does not identify a county. Treatment: Use as a display label only; join on a state+FIPS code rather than this name to avoid duplicate collisions. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 1,882
len_min: 10
len_max: 46
len_mean: 14.08
len_median: 14
len_p95: 18
word_mean: 2.08
word_median: 2
n_empty: 0
n_duplicates: 1,262
duplicate_rate: 0.4014
vocab_size: 1,883
readability_flesch_mean: 35.29
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 0
boilerplate_rate: 0

STUSPS

categorical foreign_key

STUSPS is the USPS two-letter state abbreviation, with 51 distinct values across 3,144 rows — consistent with US states plus DC at the county grain. Distribution matches county counts: TX leads at 254 (8.08%), followed by GA (159), VA (133), and KY (120). No nulls and high entropy ratio (0.93) indicate clean, well-spread categorical coverage. Treatment: left-join on this code to state-level reference tables, or one-hot/target-encode for modelling. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 51
top_value: TX
top_rate: 0.08079
cardinality: 51
entropy: 5.277
entropy_ratio: 0.9304

STATE_NAME

categorical feature

STATE_NAME holds US state labels across 3,144 rows with exactly 51 unique values (50 states plus likely DC) and zero nulls. The distribution mirrors county counts per state: Texas leads at 254 (8.1%), followed by Georgia (159) and Virginia (133), consistent with this being one row per US county. Entropy ratio of 0.93 indicates a fairly even spread across states given their natural county-count differences. Treatment: use as a categorical grouping key or one-hot/target-encode for modelling. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 51
top_value: Texas
top_rate: 0.08079
cardinality: 51
entropy: 5.277
entropy_ratio: 0.9304

LSAD

categorical metadata imbalance

LSAD is a Census Legal/Statistical Area Description code identifying the type of geographic entity for each of 3144 rows. The distribution is extremely imbalanced: code '06' (county) accounts for 95.39% of rows, leaving only 9 distinct codes and an entropy ratio of 0.117. Minor categories like '15', '25', and 'PL' tail off quickly into single-digit counts. Treatment: Collapse rare codes into an 'other' bucket or drop, since one value dominates. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 9
top_value: 06
top_rate: 0.9539
cardinality: 9
entropy: 0.3707
entropy_ratio: 0.1169

ALAND

numeric feature high_skew outliers

ALAND looks like land-area in square meters for 3,144 unique geographic units (matching the U.S. county count), with no nulls or zeros. The distribution is extremely right-skewed (skew 26.8, kurtosis 953) — the median is 1.59B while the max reaches 377B, and 11.5% of rows flag as outliers. A handful of very large areas dominate the mean (2.91B) versus the median. Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,144
min: 5.3e+06
max: 3.771e+11
mean: 2.911e+09
median: 1.594e+09
std: 9.306e+09
q1: 1.116e+09
q3: 2.394e+09
iqr: 1.277e+09
skew: 26.82
kurtosis: 953.2
n_outliers: 362
outlier_rate: 0.1151
zero_rate: 0

AWATER

numeric feature high_skew outliers

AWATER is the standard Census TIGER field for water-area in square meters, here at what looks like county granularity given n=3144 unique values. The distribution is extremely right-skewed (skew 13.18, kurtosis 210.8): the median is 19.4M but the mean is 222M and the max reaches 25.99B, with 440 outliers (14.0% of rows). One row is zero, and all 3144 values are unique, so this behaves like a continuous geographic feature rather than a key. Treatment: Apply a log1p transform before any modelling to tame the 13.2 skew and heavy outlier tail. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,144
min: 0
max: 2.599e+10
mean: 2.22e+08
median: 1.939e+07
std: 1.241e+09
q1: 7.132e+06
q3: 5.946e+07
iqr: 5.233e+07
skew: 13.18
kurtosis: 210.8
n_outliers: 440
outlier_rate: 0.1399
zero_rate: 0.0003181

fips

numeric identifier

This is the 5-digit US county FIPS code: every one of the 3144 rows is unique, there are no nulls, and the range 1001–56045 spans the standard state+county encoding. The distribution is essentially uniform across the code space (skew −0.08, kurtosis −1.10), as expected for an identifier rather than a measurement. Treatment: left-join on this id; do not use as a numeric feature. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,144
min: 1,001
max: 56,045
mean: 3.037e+04
median: 29,174
std: 1.517e+04
q1: 1.817e+04
q3: 4.508e+04
iqr: 26,905
skew: -0.07923
kurtosis: -1.099
n_outliers: 0
outlier_rate: 0
zero_rate: 0

active_duty_est

numeric feature high_skew outliers

Numeric estimate of active-duty population per record, with 3144 rows and 3028 unique values suggesting one row per geographic unit (likely county-level given the row count). The distribution is severely right-skewed (skew 13.14, kurtosis 288.57): median is 11698 but mean is 53782.95 and the max reaches 5240842, with 449 outliers (14.3%). No nulls or zeros, and the IQR of 27868 is dwarfed by a std of 176262.59. Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,028
min: 36
max: 5.241e+06
mean: 5.378e+04
median: 11,698
std: 1.763e+05
q1: 4722
q3: 3.259e+04
iqr: 27,868
skew: 13.14
kurtosis: 288.6
n_outliers: 449
outlier_rate: 0.1428
zero_rate: 0

veterans_est

numeric feature high_skew outliers

Estimated veteran counts per row (likely US counties given n=3144), ranging from 0 to 244,160 with a median of 1,547.5 but a mean of 5,419.5. The distribution is heavily right-skewed (skew 8.01, kurtosis 100.0) with 408 outliers (12.98%) reflecting a few highly populous areas dwarfing the rest. Near-zero null and zero rates, so coverage is essentially complete. Treatment: log1p-transform before modelling to tame the heavy right skew. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 2,424
min: 0
max: 244,160
mean: 5419
median: 1548
std: 1.311e+04
q1: 634.8
q3: 4428
iqr: 3,793
skew: 8.014
kurtosis: 100
n_outliers: 408
outlier_rate: 0.1298
zero_rate: 0.0003181

total_pop

numeric feature high_skew outliers

Looks like a per-row population total across 3,144 rows (suggestive of US counties), with no nulls and 3,080 unique values. The distribution is severely right-skewed (skew 13.17, kurtosis 289.76): median is 25,784.5 but the mean is 105,310.94 and the max reaches 9,936,690, with 440 rows (14.0%) flagged as outliers. Min is 50 and zero_rate is 0, so every row carries a real count. Treatment: log-transform before regression to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,080
min: 50
max: 9.937e+06
mean: 1.053e+05
median: 2.578e+04
std: 3.338e+05
q1: 1.084e+04
q3: 6.808e+04
iqr: 57,244
skew: 13.17
kurtosis: 289.8
n_outliers: 440
outlier_rate: 0.1399
zero_rate: 0

active_duty_per_10k

numeric feature

A per-capita rate (active duty personnel per 10,000) reported across 3,144 rows with no nulls, no zeros, and every value unique. The distribution is tight around a mean of 4,693.79 and median of 4,733.28 with std 592.22, mildly left-skewed (-0.38), and 57 outliers (1.81%) span a range from 1,708.13 to 7,200.00. The 3,144 row count strongly suggests one record per US county. Treatment: Use as-is as a continuous feature; the mild skew does not require transformation. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,144
min: 1708
max: 7,200
mean: 4694
median: 4733
std: 592.2
q1: 4331
q3: 5102
iqr: 771.6
skew: -0.3768
kurtosis: 0.8418
n_outliers: 57
outlier_rate: 0.01813
zero_rate: 0

veterans_per_100

numeric feature

This column reports veterans per 100 residents, with 3143 unique values across 3144 rows (likely one row per US county). Values range from 0 to 18.09 with a mean of 6.19 and median of 5.98, showing a mild right skew (0.88) and 125 outliers (~3.98%) on the high end. Only one row is zero, so the distribution is effectively continuous and well-populated. Treatment: Use as-is for modelling; optionally winsorize the upper ~4% outliers. high · anthropic:claude-opus-4-7

n: 3,144
nulls: 0 (0.0%)
unique: 3,143
min: 0
max: 18.09
mean: 6.19
median: 5.985
std: 1.998
q1: 4.92
q3: 7.136
iqr: 2.216
skew: 0.8797
kurtosis: 2.029
n_outliers: 125
outlier_rate: 0.03976
zero_rate: 0.0003181