saturn·

veterans merged county analysis

source /home/coolhand/html/datavis/data_trove/data/policy/veterans/merged_county_analysis.csv 3,144 rows 18 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 3,144 rows — one per U.S. county — combining Census geographic identifiers (GEOID, STATE_NAME, NAMELSAD, ALAND, AWATER) with veteran and active-duty military estimates and rate-normalized fields. The raw count columns (total_pop, active_duty_est, veterans_est, ALAND) are extremely right-skewed with skew values above 8 and hundreds of outliers each, so any analysis on them should use logs or per-capita versions. The rate columns tell a cleaner story: active_duty_per_10k is roughly symmetric (skew -0.38, mean ~4,694 per 10k) while veterans_per_100 is mildly right-skewed (mean 6.19, max 18.09) and is the better candidate for ranking counties. State coverage is uneven — Texas alone supplies 254 counties (8.1%), followed by Georgia and Virginia — which matters when aggregating. Note also that LSAD is heavily imbalanced (95% code '06') and GEOID and fips are duplicates of each other.

citing: row_count · column_count · ALAND · total_pop · active_duty_est · veterans_est · active_duty_per_10k · veterans_per_100 · STATE_NAME · LSAD · NAMELSAD

Schema

18 columns
Per-column summary. Click column name to jump to its detail.
Alerts
STATEFP numeric 0.0% 51
COUNTYFP numeric 0.0% 329
high_skew outliers
COUNTYNS numeric 0.0% 3,144
GEOIDFQ text 0.0% 3,144
near_unique one_word allcaps short_text
GEOID numeric 0.0% 3,144
NAME text 0.0% 1,838
one_word short_text duplicates
NAMELSAD text 0.0% 1,882
short_text duplicates
STUSPS categorical 0.0% 51
STATE_NAME categorical 0.0% 51
LSAD categorical 0.0% 9
imbalance
ALAND numeric 0.0% 3,144
high_skew outliers
AWATER numeric 0.0% 3,144
high_skew outliers
fips numeric 0.0% 3,144
active_duty_est numeric 0.0% 3,028
high_skew outliers
veterans_est numeric 0.0% 2,424
high_skew outliers
total_pop numeric 0.0% 3,080
high_skew outliers
active_duty_per_10k numeric 0.0% 3,144
veterans_per_100 numeric 0.0% 3,143

STATEFP

numeric foreign_key
This is the US Census STATEFP code, a 1-2 digit FIPS identifier for states stored numerically. Values range from 1 to 56 with 51 unique entries across 3144 rows, matching the count of US states plus DC, and the row count aligns with the number of US counties. The near-uniform spread (skew -0.08, kurtosis -1.10) and zero outliers are consistent with a categorical state code rather than a measured quantity. Treatment: Cast to zero-padded string and treat as a categorical state key for joins, not a numeric feature. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
51
min
1
max
56
mean
30.26
median
29
std
15.15
q1
18
q3
45
iqr
27
skew
-0.08128
kurtosis
-1.099
n_outliers
0
outlier_rate
0
zero_rate
0

COUNTYFP

numeric identifier high_skew outliers
COUNTYFP is the 3-digit FIPS county code, stored numerically across 3144 rows with no nulls and 329 unique values. The distribution is heavily right-skewed (skew 2.84, kurtosis 11.4) with a max of 840 well beyond Q3 of 133.5, flagging 176 outliers — expected behavior since FIPS codes are categorical identifiers, not measurements, and high values correspond to specific county assignments. Treatment: Cast to zero-padded string and combine with STATEFP to form a 5-digit GEOID join key; do not treat as numeric. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
329
min
1
max
840
mean
103.9
median
79
std
107.6
q1
35
q3
133.5
iqr
98.5
skew
2.841
kurtosis
11.38
n_outliers
176
outlier_rate
0.05598
zero_rate
0

COUNTYNS

numeric identifier
COUNTYNS is the Census Bureau's permanent numeric ANSI/GNIS identifier for U.S. counties: all 3144 values are unique with no nulls or zeros, and the range (23901 to 2830254) matches the GNIS ID space. The distribution is broad but unremarkable (skew 0.17, kurtosis -0.80), as expected for an ID code rather than a measurement. Treatment: Treat as a county-level key for joins; do not use as a numeric feature. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,144
min
23,901
max
2.83e+06
mean
9.503e+05
median
9.741e+05
std
5.168e+05
q1
4.85e+05
q3
1.384e+06
iqr
8.99e+05
skew
0.1721
kurtosis
-0.8015
n_outliers
11
outlier_rate
0.003499
zero_rate
0

GEOIDFQ

text identifier near_unique one_word allcaps short_text
This is the Census Bureau's fully-qualified GEOID (GEOIDFQ) for U.S. counties: every value is exactly 14 characters, single-token, all-caps, and follows the `0500000US` summary-level prefix followed by a state+county FIPS code. All 3144 rows are unique with no nulls or duplicates, matching the count of U.S. counties. Vocab size equals row count (3144), confirming it is a pure identifier with no analytical signal of its own. Treatment: Use as a left-join key against Census geographies; do not feed into models. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,144
len_min
14
len_max
14
len_mean
14
len_median
14
len_p95
14
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
3,144
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

GEOID

numeric identifier
GEOID is the 5-digit FIPS code identifying US counties: every one of the 3,144 rows is unique with no nulls, and the range 1001 to 56045 matches the state+county FIPS convention (Alabama through Wyoming). The near-zero skew (-0.08) and flat kurtosis (-1.10) reflect roughly uniform coverage across state codes rather than any meaningful distribution. Treating these as numbers is misleading—they are categorical keys. Treatment: Cast to zero-padded string and use as a join key to county-level geographies; do not model as numeric. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,144
min
1,001
max
56,045
mean
3.037e+04
median
29,174
std
1.517e+04
q1
1.817e+04
q3
4.508e+04
iqr
26,905
skew
-0.07923
kurtosis
-1.099
n_outliers
0
outlier_rate
0
zero_rate
0

NAME

text label one_word short_text duplicates
This column holds short place names — almost certainly US county names, given the dominance of 'Washington' (31), 'Franklin' (26), 'Jefferson' (26), 'Lincoln' (24) and 'Madison' (20), all classic county namesakes. Values are overwhelmingly single-word (one_word_rate 0.934, word_mean 1.07) and short (len_mean 7.0, len_max 30), with no nulls. The 41.5% duplicate_rate is expected here: the same county name recurs across states, so 3144 rows collapse to 1838 unique strings. Treatment: Treat as a non-unique name; pair with a state/FIPS column before joining or grouping. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
1,838
len_min
3
len_max
30
len_mean
7.05
len_median
7
len_p95
11
word_mean
1.072
word_median
1
n_empty
0
n_duplicates
1,306
duplicate_rate
0.4154
vocab_size
1,875
readability_flesch_mean
36.74
emoji_rate
0
url_rate
0
one_word_rate
0.9342
allcaps_rate
0
boilerplate_rate
0

NAMELSAD

text label short_text duplicates
This is the full legal name of US county-equivalents (NAMELSAD from Census TIGER), with 'county' appearing 2999 times alongside 64 'parish' and 47 'city' entries reflecting Louisiana and independent-city conventions. Names are short (mean 14.08 chars, ~2 words) and heavily duplicated across states: 1262 duplicates (40.1%) driven by repeated names like 'Washington County' (30), 'Jefferson County' (25), and 'Franklin County' (24). Only 1882 unique values across 3144 rows, so this field alone does not identify a county. Treatment: Use as a display label only; join on a state+FIPS code rather than this name to avoid duplicate collisions. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
1,882
len_min
10
len_max
46
len_mean
14.08
len_median
14
len_p95
18
word_mean
2.08
word_median
2
n_empty
0
n_duplicates
1,262
duplicate_rate
0.4014
vocab_size
1,883
readability_flesch_mean
35.29
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

STUSPS

categorical foreign_key
STUSPS is the USPS two-letter state abbreviation, with 51 distinct values across 3,144 rows — consistent with US states plus DC at the county grain. Distribution matches county counts: TX leads at 254 (8.08%), followed by GA (159), VA (133), and KY (120). No nulls and high entropy ratio (0.93) indicate clean, well-spread categorical coverage. Treatment: left-join on this code to state-level reference tables, or one-hot/target-encode for modelling. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
51
top_value
TX
top_rate
0.08079
cardinality
51
entropy
5.277
entropy_ratio
0.9304

STATE_NAME

categorical feature
STATE_NAME holds US state labels across 3,144 rows with exactly 51 unique values (50 states plus likely DC) and zero nulls. The distribution mirrors county counts per state: Texas leads at 254 (8.1%), followed by Georgia (159) and Virginia (133), consistent with this being one row per US county. Entropy ratio of 0.93 indicates a fairly even spread across states given their natural county-count differences. Treatment: use as a categorical grouping key or one-hot/target-encode for modelling. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
51
top_value
Texas
top_rate
0.08079
cardinality
51
entropy
5.277
entropy_ratio
0.9304

LSAD

categorical metadata imbalance
LSAD is a Census Legal/Statistical Area Description code identifying the type of geographic entity for each of 3144 rows. The distribution is extremely imbalanced: code '06' (county) accounts for 95.39% of rows, leaving only 9 distinct codes and an entropy ratio of 0.117. Minor categories like '15', '25', and 'PL' tail off quickly into single-digit counts. Treatment: Collapse rare codes into an 'other' bucket or drop, since one value dominates. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
9
top_value
06
top_rate
0.9539
cardinality
9
entropy
0.3707
entropy_ratio
0.1169

ALAND

numeric feature high_skew outliers
ALAND looks like land-area in square meters for 3,144 unique geographic units (matching the U.S. county count), with no nulls or zeros. The distribution is extremely right-skewed (skew 26.8, kurtosis 953) — the median is 1.59B while the max reaches 377B, and 11.5% of rows flag as outliers. A handful of very large areas dominate the mean (2.91B) versus the median. Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,144
min
5.3e+06
max
3.771e+11
mean
2.911e+09
median
1.594e+09
std
9.306e+09
q1
1.116e+09
q3
2.394e+09
iqr
1.277e+09
skew
26.82
kurtosis
953.2
n_outliers
362
outlier_rate
0.1151
zero_rate
0

AWATER

numeric feature high_skew outliers
AWATER is the standard Census TIGER field for water-area in square meters, here at what looks like county granularity given n=3144 unique values. The distribution is extremely right-skewed (skew 13.18, kurtosis 210.8): the median is 19.4M but the mean is 222M and the max reaches 25.99B, with 440 outliers (14.0% of rows). One row is zero, and all 3144 values are unique, so this behaves like a continuous geographic feature rather than a key. Treatment: Apply a log1p transform before any modelling to tame the 13.2 skew and heavy outlier tail. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,144
min
0
max
2.599e+10
mean
2.22e+08
median
1.939e+07
std
1.241e+09
q1
7.132e+06
q3
5.946e+07
iqr
5.233e+07
skew
13.18
kurtosis
210.8
n_outliers
440
outlier_rate
0.1399
zero_rate
0.0003181

fips

numeric identifier
This is the 5-digit US county FIPS code: every one of the 3144 rows is unique, there are no nulls, and the range 1001–56045 spans the standard state+county encoding. The distribution is essentially uniform across the code space (skew −0.08, kurtosis −1.10), as expected for an identifier rather than a measurement. Treatment: left-join on this id; do not use as a numeric feature. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,144
min
1,001
max
56,045
mean
3.037e+04
median
29,174
std
1.517e+04
q1
1.817e+04
q3
4.508e+04
iqr
26,905
skew
-0.07923
kurtosis
-1.099
n_outliers
0
outlier_rate
0
zero_rate
0

active_duty_est

numeric feature high_skew outliers
Numeric estimate of active-duty population per record, with 3144 rows and 3028 unique values suggesting one row per geographic unit (likely county-level given the row count). The distribution is severely right-skewed (skew 13.14, kurtosis 288.57): median is 11698 but mean is 53782.95 and the max reaches 5240842, with 449 outliers (14.3%). No nulls or zeros, and the IQR of 27868 is dwarfed by a std of 176262.59. Treatment: log-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,028
min
36
max
5.241e+06
mean
5.378e+04
median
11,698
std
1.763e+05
q1
4722
q3
3.259e+04
iqr
27,868
skew
13.14
kurtosis
288.6
n_outliers
449
outlier_rate
0.1428
zero_rate
0

veterans_est

numeric feature high_skew outliers
Estimated veteran counts per row (likely US counties given n=3144), ranging from 0 to 244,160 with a median of 1,547.5 but a mean of 5,419.5. The distribution is heavily right-skewed (skew 8.01, kurtosis 100.0) with 408 outliers (12.98%) reflecting a few highly populous areas dwarfing the rest. Near-zero null and zero rates, so coverage is essentially complete. Treatment: log1p-transform before modelling to tame the heavy right skew. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
2,424
min
0
max
244,160
mean
5419
median
1548
std
1.311e+04
q1
634.8
q3
4428
iqr
3,793
skew
8.014
kurtosis
100
n_outliers
408
outlier_rate
0.1298
zero_rate
0.0003181

total_pop

numeric feature high_skew outliers
Looks like a per-row population total across 3,144 rows (suggestive of US counties), with no nulls and 3,080 unique values. The distribution is severely right-skewed (skew 13.17, kurtosis 289.76): median is 25,784.5 but the mean is 105,310.94 and the max reaches 9,936,690, with 440 rows (14.0%) flagged as outliers. Min is 50 and zero_rate is 0, so every row carries a real count. Treatment: log-transform before regression to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,080
min
50
max
9.937e+06
mean
1.053e+05
median
2.578e+04
std
3.338e+05
q1
1.084e+04
q3
6.808e+04
iqr
57,244
skew
13.17
kurtosis
289.8
n_outliers
440
outlier_rate
0.1399
zero_rate
0

active_duty_per_10k

numeric feature
A per-capita rate (active duty personnel per 10,000) reported across 3,144 rows with no nulls, no zeros, and every value unique. The distribution is tight around a mean of 4,693.79 and median of 4,733.28 with std 592.22, mildly left-skewed (-0.38), and 57 outliers (1.81%) span a range from 1,708.13 to 7,200.00. The 3,144 row count strongly suggests one record per US county. Treatment: Use as-is as a continuous feature; the mild skew does not require transformation. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,144
min
1708
max
7,200
mean
4694
median
4733
std
592.2
q1
4331
q3
5102
iqr
771.6
skew
-0.3768
kurtosis
0.8418
n_outliers
57
outlier_rate
0.01813
zero_rate
0

veterans_per_100

numeric feature
This column reports veterans per 100 residents, with 3143 unique values across 3144 rows (likely one row per US county). Values range from 0 to 18.09 with a mean of 6.19 and median of 5.98, showing a mild right skew (0.88) and 125 outliers (~3.98%) on the high end. Only one row is zero, so the distribution is effectively continuous and well-populated. Treatment: Use as-is for modelling; optionally winsorize the upper ~4% outliers. high · anthropic:claude-opus-4-7
n
3,144
nulls
0 (0.0%)
unique
3,143
min
0
max
18.09
mean
6.19
median
5.985
std
1.998
q1
4.92
q3
7.136
iqr
2.216
skew
0.8797
kurtosis
2.029
n_outliers
125
outlier_rate
0.03976
zero_rate
0.0003181