data trove us military veteran analysis
Reading
This is a 54-row, state-level dataset merging U.S. military and veteran demographics with firearm licensing, suicide rates, and installation-level economic data. The most striking signal is the veteran suicide rate (mean 35.6, range 24.9–52.3), which is roughly double the civilian suicide rate (mean 17.2, range 7.7–28.9), and the veteran_risk_ratio column directly quantifies this gap (mean 2.2x) across states. A second area worth scrutiny is the extreme right-skew in active_duty_per_100k (median 92, max 5,544) and ffl_per_100k (median 12, max 342), suggesting a handful of states—likely those hosting large installations—are pulling these distributions hard; about 22% of rows also carry heavy null rates on installation-level columns (county, installation, economic impact), meaning the installation-linked data covers only ~12 records. Analysts should examine how firearm density and military concentration interact with veteran mental health outcomes across states.
citing: veteran_suicide_rate.stats.mean · veteran_suicide_rate.stats.max · civilian_suicide_rate.stats.mean · veteran_risk_ratio.stats.mean · active_duty_per_100k.stats.median · active_duty_per_100k.stats.max · active_duty_per_100k.alerts · ffl_per_100k.stats.median · ffl_per_100k.stats.max · county.null_rate · installation.null_rate · ptsd_prevalence_pct.null_rate · row_count
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 24.9 – 28.81 | 12 |
| 28.81 – 32.73 | 10 |
| 32.73 – 36.64 | 10 |
| 36.64 – 40.56 | 8 |
| 40.56 – 44.47 | 6 |
| 44.47 – 48.39 | 4 |
| 48.39 – 52.3 | 4 |
Show data table
| bin | count |
|---|---|
| 7.7 – 10.73 | 10 |
| 10.73 – 13.76 | 9 |
| 13.76 – 16.79 | 7 |
| 16.79 – 19.81 | 9 |
| 19.81 – 22.84 | 7 |
| 22.84 – 25.87 | 7 |
| 25.87 – 28.9 | 5 |
Show data table
| bin | count |
|---|---|
| 1.377 – 49.99 | 41 |
| 49.99 – 98.6 | 4 |
| 98.6 – 147.2 | 2 |
| 147.2 – 195.8 | 1 |
| 195.8 – 244.4 | 2 |
| 244.4 – 293 | 0 |
| 293 – 341.7 | 3 |
Show data table
| bin | count |
|---|---|
| 13.8 – 17.87 | 8 |
| 17.87 – 21.94 | 9 |
| 21.94 – 26.01 | 10 |
| 26.01 – 30.09 | 7 |
| 30.09 – 34.16 | 8 |
| 34.16 – 38.23 | 7 |
| 38.23 – 42.3 | 5 |
Show data table
| value | count | share |
|---|---|---|
| California | 3 | 5.6% |
| North Carolina | 2 | 3.7% |
| Texas | 2 | 3.7% |
| Alabama | 1 | 1.9% |
| Alaska | 1 | 1.9% |
| Arizona | 1 | 1.9% |
| Arkansas | 1 | 1.9% |
| Colorado | 1 | 1.9% |
| Connecticut | 1 | 1.9% |
| Delaware | 1 | 1.9% |
| Florida | 1 | 1.9% |
| Georgia | 1 | 1.9% |
| Hawaii | 1 | 1.9% |
| Idaho | 1 | 1.9% |
| Illinois | 1 | 1.9% |
| Indiana | 1 | 1.9% |
| Iowa | 1 | 1.9% |
| Kansas | 1 | 1.9% |
| Kentucky | 1 | 1.9% |
| Louisiana | 1 | 1.9% |
Schema
23 columns| Alerts | ||||
|---|---|---|---|---|
| NAME | categorical | 1.9% | 49 |
long_tail
|
| state | categorical | 0.0% | 50 |
long_tail
|
| veteran_population | numeric | 1.9% | 49 |
|
| total_population | numeric | 1.9% | 49 |
|
| veteran_percentage | numeric | 1.9% | 49 |
high_skew
outliers
|
| active_duty_personnel | numeric | 0.0% | 50 |
outliers
|
| ownership_percentage | numeric | 0.0% | 48 |
outliers
|
| ffl_count | numeric | 0.0% | 50 |
outliers
|
| veteran_suicide_rate | numeric | 0.0% | 50 |
|
| civilian_suicide_rate | numeric | 0.0% | 50 |
|
| veteran_risk_ratio | numeric | 0.0% | 41 |
|
| ptsd_prevalence_pct | numeric | 46.3% | 25 |
null_rate
|
| va_users_with_ptsd_pct | numeric | 46.3% | 25 |
null_rate
|
| spouse_unemployment_rate | numeric | 64.8% | 15 |
null_rate
outliers
|
| spouse_labor_force_participation | numeric | 64.8% | 15 |
null_rate
|
| va_utilization_pct | numeric | 0.0% | 50 |
|
| installation | categorical | 77.8% | 12 |
long_tail
null_rate
|
| county | categorical | 77.8% | 10 |
long_tail
null_rate
|
| annual_economic_impact_millions | numeric | 77.8% | 12 |
null_rate
|
| total_personnel | numeric | 77.8% | 12 |
null_rate
|
| direct_jobs | numeric | 77.8% | 12 |
null_rate
|
| ffl_per_100k | numeric | 1.9% | 49 |
high_skew
outliers
|
| active_duty_per_100k | numeric | 1.9% | 49 |
high_skew
outliers
|
NAME
categorical label long_tailThis column contains U.S. state names, functioning as a geographic label or identifier in a small dataset of 54 rows. With 49 unique values and an entropy ratio of 0.991, cardinality is near-maximal — almost every row has a distinct state name. The top value 'California' appears only 3 times (5.66%), and the long-tail alert confirms that most values appear just once, suggesting this may be a nearly one-per-state lookup table with a handful of duplicates. Treatment: Use as a geographic join key or group-by label; deduplicate or aggregate the 5 repeated state entries (California ×3, North Carolina ×2, Texas ×2) before any state-level analysis.
- n
- 54
- nulls
- 1 (1.9%)
- unique
- 49
- top_value
- California
- top_rate
- 0.0566
- cardinality
- 49
- entropy
- 5.563
- entropy_ratio
- 0.9907
state
categorical label long_tailThis column contains U.S. state names, with 50 unique values across only 54 rows — nearly one entry per state, suggesting near-complete national coverage. California appears 3 times (top_rate 5.6%) and North Carolina and Texas appear twice each, while the remaining 47 states appear exactly once. The entropy ratio of 0.991 confirms an almost flat distribution, and the long_tail alert is technically triggered but is largely an artifact of the tiny dataset size rather than meaningful concentration. Treatment: Use as a categorical grouping key; encode with target or ordinal encoding if modelling, or use as a join/filter dimension for geographic aggregation.
- n
- 54
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- California
- top_rate
- 0.05556
- cardinality
- 50
- entropy
- 5.593
- entropy_ratio
- 0.9909
veteran_population
numeric featureThis column represents veteran population counts, likely at the U.S. state or territory level given the 54 rows and the plausible range of 61,090 to 1,786,891. The distribution is remarkably symmetric (skew ≈ 0.009) and platykurtic (kurtosis ≈ −1.35), meaning values are broadly spread across the range with no sharp central peak and no outliers detected. The IQR of 988,439 relative to a mean of ~820,444 indicates substantial spread across geographies, consistent with large population differences between small and large states. Treatment: Use as-is or normalize by total population to create a veteran share ratio before modelling.
- n
- 54
- nulls
- 1 (1.9%)
- unique
- 49
- min
- 61,090
- max
- 1.787e+06
- mean
- 8.204e+05
- median
- 811,743
- std
- 5.29e+05
- q1
- 279,178
- q3
- 1.268e+06
- iqr
- 988,439
- skew
- 0.009014
- kurtosis
- -1.347
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
total_population
numeric featureThis column represents total population counts, almost certainly at a regional or state/province level given the value range (min 548,984 to max 39,227,468) and the small row count of 54 rows — consistent with US states or similar administrative units. The distribution is notably flat and near-uniform: kurtosis of -1.24 indicates lighter tails than normal, skew is near zero (0.093), and the IQR of 22,033,717 spans more than half the full range, confirming wide spread without outliers. There are 5 duplicate values among 54 rows (49 unique) and a 1.85% null rate (roughly 1 missing record) worth investigating. Treatment: Use as-is or normalize per area/density for modelling; investigate the 1 null row and 5 duplicate values before joining or aggregating.
- n
- 54
- nulls
- 1 (1.9%)
- unique
- 49
- min
- 548,984
- max
- 3.923e+07
- mean
- 1.87e+07
- median
- 1.958e+07
- std
- 1.247e+07
- q1
- 6.898e+06
- q3
- 2.893e+07
- iqr
- 2.203e+07
- skew
- 0.09278
- kurtosis
- -1.243
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
veteran_percentage
numeric feature high_skew outliersThis column represents veteran percentage, likely a share (%) of veterans within some geographic or demographic unit across 54 records. The distribution is severely right-skewed (skew=6.03, kurtosis=37.997) with a median of 4.85% but a mean of 14.09%, driven by 9 outliers (17% of rows) including a maximum of 277.05 — a value that cannot represent a valid percentage and almost certainly reflects a data quality issue such as a raw count, a decimal-point error, or a different unit entirely. The std of 38.90 dwarfs the IQR of 5.89, confirming extreme contamination from these outliers. Treatment: Investigate and cap or correct values exceeding 100 (max=277.05 is impossible for a percentage), then consider log-transform or winsorization before modelling.
- n
- 54
- nulls
- 1 (1.9%)
- unique
- 49
- min
- 0.22
- max
- 277.1
- mean
- 14.09
- median
- 4.85
- std
- 38.9
- q1
- 1.88
- q3
- 7.77
- iqr
- 5.89
- skew
- 6.031
- kurtosis
- 38
- n_outliers
- 9
- outlier_rate
- 0.1698
- zero_rate
- 0
active_duty_personnel
numeric feature outliersThis column represents the count of active duty military personnel per record (likely per country or military branch), with 54 observations and no nulls. The distribution is heavily right-skewed (skew=1.59) with a median of 11,584 but a mean of 34,003, driven by a long upper tail stretching to 162,362. Eight records (14.8% of the dataset) are flagged as outliers, indicating a small number of entities with disproportionately large forces. The IQR of 38,504 vs. a std of 46,995 confirms the spread is dominated by extreme high-end values. Treatment: Log-transform before regression or clustering to reduce skew impact from high-value outliers.
- n
- 54
- nulls
- 0 (0.0%)
- unique
- 50
- min
- 1,166
- max
- 162,362
- mean
- 3.4e+04
- median
- 11,584
- std
- 4.7e+04
- q1
- 3884
- q3
- 4.239e+04
- iqr
- 3.85e+04
- skew
- 1.594
- kurtosis
- 1.291
- n_outliers
- 8
- outlier_rate
- 0.1481
- zero_rate
- 0
ownership_percentage
numeric feature outliersThis column represents ownership percentage stakes, likely equity shareholdings in companies or assets, ranging from 14.7% to 66.3% across 54 records with no nulls. The distribution is moderately left-skewed (skew = -0.69) with values tightly clustered between Q1 40.05% and Q3 51.4%, suggesting most holdings hover around majority or near-majority control thresholds. Notably, 5 outliers (~9.3% of rows) pull the lower tail, and the max of 66.3% implies no full buyouts are present. The near-platykurtic shape (kurtosis ≈ 0) indicates an unusually flat, spread-out distribution rather than a peaked one. Treatment: Use as-is or clip outliers at IQR boundaries before modelling; consider binning into control-threshold buckets (minority <50%, majority ≥50%).
- n
- 54
- nulls
- 0 (0.0%)
- unique
- 48
- min
- 14.7
- max
- 66.3
- mean
- 43.58
- median
- 45.75
- std
- 13.04
- q1
- 40.05
- q3
- 51.4
- iqr
- 11.35
- skew
- -0.6887
- kurtosis
- -0.002081
- n_outliers
- 5
- outlier_rate
- 0.09259
- zero_rate
- 0
ffl_count
numeric feature outliersThis column represents a count of Federal Firearms Licensees (FFL) — likely per geographic unit such as state or county — across 54 observations with no nulls. The distribution is right-skewed (skew = 1.66) with a wide IQR of 2496.5 and a standard deviation (2772.66) nearly equal to the mean (3073.11), signaling high dispersion. Five outliers (≈9.3% of rows) pull the tail toward the maximum of 10904, while the minimum is 220 and median only 2096.5, confirming a long upper tail. The kurtosis of 2.0 is moderate, suggesting the outliers are notable but not extreme relative to a normal distribution. Treatment: Log-transform before regression or modelling to reduce right skew; investigate the 5 outlier rows for data integrity.
- n
- 54
- nulls
- 0 (0.0%)
- unique
- 50
- min
- 220
- max
- 10,904
- mean
- 3073
- median
- 2096
- std
- 2773
- q1
- 1300
- q3
- 3,796
- iqr
- 2496
- skew
- 1.66
- kurtosis
- 2.002
- n_outliers
- 5
- outlier_rate
- 0.09259
- zero_rate
- 0
veteran_suicide_rate
numeric numeric_targetThis column contains veteran suicide rates, likely per 100,000 veterans, covering 54 observations (probably U.S. states plus a few territories or summary rows, given n=54 and 50 unique values). Values range from 24.9 to 52.3 with a mean of 35.6 and median of 34.65, indicating a relatively symmetric but mildly right-skewed distribution (skew=0.498). Noteworthy is the wide spread—an IQR of ~10.85 and max nearly double the min—highlighting substantial geographic disparity in veteran suicide rates, yet no statistical outliers were flagged. Treatment: Use as-is or apply mild log-transform if residuals show heteroscedasticity; investigate the 4 duplicate values among 54 rows before modelling.
- n
- 54
- nulls
- 0 (0.0%)
- unique
- 50
- min
- 24.9
- max
- 52.3
- mean
- 35.64
- median
- 34.65
- std
- 7.393
- q1
- 29.8
- q3
- 40.65
- iqr
- 10.85
- skew
- 0.4978
- kurtosis
- -0.6953
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
civilian_suicide_rate
numeric featureThis column represents a civilian suicide rate (likely per 100,000 population) across 54 records, with no nulls and no zeros. The distribution is notably well-behaved: near-zero skew (0.17), platykurtic shape (kurtosis −1.09), and no detected outliers, suggesting values spread broadly but uniformly between 7.7 and 28.9. The mean (17.22) and median (17.1) are nearly identical, and the IQR of 9.8 covers a substantial range, indicating genuine cross-unit variation rather than clustering. Treatment: Use as-is in modelling; no transformation needed given near-symmetric distribution and absence of outliers.
- n
- 54
- nulls
- 0 (0.0%)
- unique
- 50
- min
- 7.7
- max
- 28.9
- mean
- 17.22
- median
- 17.1
- std
- 6.026
- q1
- 12.2
- q3
- 22
- iqr
- 9.8
- skew
- 0.1737
- kurtosis
- -1.092
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
veteran_risk_ratio
numeric featureThis column represents a risk ratio specifically for a veteran population, with all 54 rows populated and no outliers detected. Values range from 1.8 to 3.23, with a mean of ~2.20 and median of 2.025, indicating veterans in this dataset are consistently at elevated risk (all values above 1.0 by a wide margin). The distribution is moderately right-skewed (skew ≈ 0.95) with a relatively tight IQR of 0.59, suggesting most observations cluster in the 1.85–2.44 range but a tail of higher-risk cases pulls the mean upward. The near-platykurtic shape (kurtosis ≈ -0.31) and 41 unique values out of 54 rows suggest this is a continuous derived metric rather than a categorised score. Treatment: Use as-is or apply mild log-transform to reduce right skew before regression or classification modelling.
- n
- 54
- nulls
- 0 (0.0%)
- unique
- 41
- min
- 1.8
- max
- 3.23
- mean
- 2.197
- median
- 2.025
- std
- 0.4101
- q1
- 1.85
- q3
- 2.44
- iqr
- 0.59
- skew
- 0.9459
- kurtosis
- -0.3143
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ptsd_prevalence_pct
numeric feature null_rateThis column captures PTSD prevalence as a percentage, likely drawn from epidemiological or clinical survey data across 54 records. The most striking issue is a 46.3% null rate — nearly half the rows are missing, which severely limits usability and warrants investigation into whether missingness is systematic (e.g., tied to a subgroup or data source). Among the 29 non-null values, the distribution is fairly compact (min 6.3, max 12.3, IQR 3.0) with a near-flat kurtosis of −1.05, suggesting a roughly uniform spread rather than a peaked central cluster. Only 25 unique values across 29 non-null rows implies some repeated percentage figures, possibly due to rounding or grouped reporting. Treatment: Investigate missingness mechanism before imputing; if MAR, impute with group-level median; if MNAR, model missingness as a separate binary indicator.
- n
- 54
- nulls
- 25 (46.3%)
- unique
- 25
- min
- 6.3
- max
- 12.3
- mean
- 8.621
- median
- 8.3
- std
- 1.848
- q1
- 7.1
- q3
- 10.1
- iqr
- 3
- skew
- 0.4299
- kurtosis
- -1.049
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
va_users_with_ptsd_pct
numeric feature null_rateThis column represents the percentage of VA users diagnosed with PTSD, likely aggregated at a state or facility level given n=54 (matching U.S. states/territories). The most surprising signal is the 46.3% null rate — nearly half the rows are missing, which severely limits usability and warrants investigation into whether missingness is systematic (e.g., certain facility types or regions not reporting). Among observed values, the distribution is fairly uniform (kurtosis –1.13, near-zero skew of 0.32) ranging from 7.9% to 18.5% with no outliers, suggesting genuine geographic variation rather than data error. Treatment: Investigate missingness pattern before use; impute or subset to complete cases, then use as-is (no transform needed given near-normal distribution).
- n
- 54
- nulls
- 25 (46.3%)
- unique
- 25
- min
- 7.9
- max
- 18.5
- mean
- 12.26
- median
- 11.9
- std
- 3.315
- q1
- 9.5
- q3
- 14.8
- iqr
- 5.3
- skew
- 0.3168
- kurtosis
- -1.132
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
spouse_unemployment_rate
numeric feature null_rate outliersThis column records the unemployment rate of a respondent's spouse, expressed as a percentage. It is missing for 64.81% of records — almost certainly because many respondents have no spouse — making the high null rate structurally expected rather than a data quality failure. Among the 19 non-null observations, values cluster tightly between 7.35 and 16.28 with a mean of ~10.46 and median of ~10.34, suggesting a near-symmetric distribution (skew 0.77, kurtosis 0.08); one outlier at 16.28 sits just beyond the upper fence. Only 15 unique values across 19 observations hints the rate may be recorded at a coarse or categorical granularity. Treatment: Model nulls as a separate binary indicator (has_spouse); impute or subset non-null rows for any spouse-specific analysis; investigate the single outlier at 16.28 before inclusion.
- n
- 54
- nulls
- 35 (64.8%)
- unique
- 15
- min
- 7.35
- max
- 16.28
- mean
- 10.46
- median
- 10.34
- std
- 2.382
- q1
- 8.59
- q3
- 11.45
- iqr
- 2.86
- skew
- 0.7742
- kurtosis
- 0.07957
- n_outliers
- 1
- outlier_rate
- 0.05263
- zero_rate
- 0
spouse_labor_force_participation
numeric feature null_rateThis column represents the labor force participation rate (likely as a percentage) of spouses in a surveyed population. The most striking feature is a null rate of 64.81% — nearly two-thirds of the 54 rows are missing, triggering an alert and severely limiting usability. Among the 19 non-null observations, values are tightly clustered between 66.8 and 73.4 with a mean of ~69.7 and IQR of only 2.2, suggesting very low variance and minimal outlier risk within the observed subset. Treatment: Investigate missingness mechanism before use; if MAR/MCAR, consider imputation with caution given only 19 valid observations across 15 unique values.
- n
- 54
- nulls
- 35 (64.8%)
- unique
- 15
- min
- 66.8
- max
- 73.4
- mean
- 69.69
- median
- 69.8
- std
- 1.68
- q1
- 68.35
- q3
- 70.55
- iqr
- 2.2
- skew
- 0.3659
- kurtosis
- -0.3936
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
va_utilization_pct
numeric featureThis column represents a utilization percentage for VA (likely Veterans Affairs) resources or capacity, expressed as a numeric rate across 54 records. The distribution is notably uniform and platykurtic (kurtosis ≈ −1.05), meaning values are spread broadly and flatly across the range 13.8–42.3 with no outliers and near-zero skew (0.165). The mean (27.04) and median (26.2) are tightly aligned, and the IQR of 12.48 spans a moderate but consistent band, suggesting this metric reflects genuine variation across units or time periods rather than a concentrated signal. Treatment: Use as-is in modelling; no transformation needed given near-symmetric, outlier-free distribution.
- n
- 54
- nulls
- 0 (0.0%)
- unique
- 50
- min
- 13.8
- max
- 42.3
- mean
- 27.04
- median
- 26.2
- std
- 7.952
- q1
- 21
- q3
- 33.48
- iqr
- 12.48
- skew
- 0.1651
- kurtosis
- -1.051
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
installation
categorical label long_tail null_rateThis column records U.S. military installation names (bases and airfields) associated with records in the dataset. The most striking issue is that 77.78% of the 54 rows are null, leaving only 12 non-null values — each appearing exactly once, yielding a perfectly uniform distribution (entropy_ratio = 1.0) with no dominant installation. The long-tail alert is somewhat misleading given the uniformity; the real concern is the extreme missingness, which severely limits analytical utility. Treatment: Investigate source of 77.78% nulls before use; if imputation is not feasible, treat as a sparse categorical feature or exclude from models dependent on this column.
- n
- 54
- nulls
- 42 (77.8%)
- unique
- 12
- top_value
- Naval Base San Diego, CA
- top_rate
- 0.08333
- cardinality
- 12
- entropy
- 3.585
- entropy_ratio
- 1
county
categorical metadata long_tail null_rateThis column represents a US county name associated with each record, likely geographic metadata for a location or address field. The most striking issue is a 77.78% null rate — only 12 of 54 rows have a value at all, rendering the column nearly empty. Among the 12 non-null values, cardinality is 10 with a near-uniform distribution (entropy ratio 0.979), meaning almost every populated entry is a distinct county, with only San Diego and Cumberland appearing twice. Treatment: Investigate source of missing values before use; with 77.78% nulls the column is unreliable as a feature without significant imputation or enrichment.
- n
- 54
- nulls
- 42 (77.8%)
- unique
- 10
- top_value
- San Diego
- top_rate
- 0.1667
- cardinality
- 10
- entropy
- 3.252
- entropy_ratio
- 0.9788
annual_economic_impact_millions
numeric feature null_rateThis column records estimated annual economic impact in millions of currency units, but 77.78% of the 54 rows are null — meaning only 12 rows carry a value, and those 12 values collapse to just 12 unique entries (effectively no repeats among non-null rows). The non-null values span 15,000 to 41,000 with a mean of ~23,833 and an IQR of 10,250, indicating a wide but plausibly real spread across large-scale economic entities; skew is moderate (0.87) and no outliers are flagged. The extreme null rate is the dominant concern and strongly suggests this field is sparsely populated by design or data collection failure, not random missingness. Treatment: Investigate source of 77.78% nulls before use; if imputation is not justified, consider as a sparse auxiliary feature or drop depending on model tolerance for missingness.
- n
- 54
- nulls
- 42 (77.8%)
- unique
- 12
- min
- 15,000
- max
- 41,000
- mean
- 2.383e+04
- median
- 21,500
- std
- 8178
- q1
- 17,750
- q3
- 28,000
- iqr
- 10,250
- skew
- 0.8718
- kurtosis
- -0.3627
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
total_personnel
numeric feature null_rateThis column represents total personnel counts, likely headcount figures for organizations, units, or facilities. The most striking issue is a null rate of 77.78% — only 12 of 54 rows have a value — making it severely under-populated and potentially unreliable for modelling. Among the 12 non-null values, only 12 unique values exist (suggesting no repeated counts), ranging from 27,000 to 82,000 with a mean of ~46,083 and mild positive skew (0.855). The distribution is platykurtic (kurtosis ≈ 0.006) and has no outliers, so the non-null values themselves are internally well-behaved. Treatment: Investigate source of 77.78% missingness before use; impute or drop depending on missingness mechanism, then consider log-transform given positive skew.
- n
- 54
- nulls
- 42 (77.8%)
- unique
- 12
- min
- 27,000
- max
- 82,000
- mean
- 4.608e+04
- median
- 43,500
- std
- 1.621e+04
- q1
- 34,250
- q3
- 53,500
- iqr
- 19,250
- skew
- 0.8549
- kurtosis
- 0.005583
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
direct_jobs
numeric feature null_rateThis column records the count of direct jobs associated with each record — likely a project, investment, or contract — with values ranging from 8,500 to 25,000 and a mean of ~14,292. The most striking signal is an extremely high null rate of 77.78%, meaning only 12 of 54 rows carry a value, which severely limits its usability. Among the non-null values, only 12 distinct figures exist across those 12 populated rows (effectively all unique), suggesting manual entry or discrete reporting tiers rather than continuous measurement. Distribution is mildly right-skewed (skew 0.81) with no outliers detected. Treatment: Impute or flag missing values (77.78% null rate) before use; consider whether missingness is systematic before including in any model.
- n
- 54
- nulls
- 42 (77.8%)
- unique
- 12
- min
- 8,500
- max
- 25,000
- mean
- 1.429e+04
- median
- 13,500
- std
- 4883
- q1
- 10,750
- q3
- 16,500
- iqr
- 5,750
- skew
- 0.8149
- kurtosis
- -0.06741
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ffl_per_100k
numeric feature high_skew outliersThis column represents the number of Federal Firearms Licensees (FFLs) per 100,000 residents, almost certainly measured at the U.S. state (or territory) level given n=54. The distribution is severely right-skewed (skew=2.47, kurtosis=5.08): the median is only 12.37 but the mean is pulled to 49.44 by extreme outliers, with a maximum of 341.66 — over 27× the median. Eight observations (15.1%) are flagged as outliers, likely small-population states or territories where per-capita rates explode due to a low denominator. Treatment: Log-transform before modelling to reduce skew; investigate the 8 outliers for small-denominator population artifacts before inclusion.
- n
- 54
- nulls
- 1 (1.9%)
- unique
- 49
- min
- 1.377
- max
- 341.7
- mean
- 49.44
- median
- 12.37
- std
- 87.37
- q1
- 7.034
- q3
- 31.09
- iqr
- 24.05
- skew
- 2.474
- kurtosis
- 5.076
- n_outliers
- 8
- outlier_rate
- 0.1509
- zero_rate
- 0
active_duty_per_100k
numeric feature high_skew outliersThis column represents active-duty military personnel per 100,000 residents, almost certainly measured at the U.S. state/territory level (n=54 aligns with states plus D.C. and territories). The distribution is severely right-skewed (skew=2.91, kurtosis=7.22): the median is only 92.5 yet the mean is 610.7 and the max reaches 5,544, driven by 9 outliers (17% of rows) — almost certainly states with large military installations such as Hawaii, Alaska, Virginia, or small-population territories. The IQR of 301 versus a std of 1,389 further confirms extreme concentration at low values with a long heavy tail. Treatment: Log-transform (log1p) before regression or clustering to reduce skew; consider flagging or separately modelling the 9 outlier rows.
- n
- 54
- nulls
- 1 (1.9%)
- unique
- 49
- min
- 3.105
- max
- 5544
- mean
- 610.7
- median
- 92.49
- std
- 1389
- q1
- 21.96
- q3
- 323.3
- iqr
- 301.3
- skew
- 2.911
- kurtosis
- 7.217
- n_outliers
- 9
- outlier_rate
- 0.1698
- zero_rate
- 0