saturn·

environmental desert data

source /home/coolhand/html/datavis/data_trove/data/environmental/desert_data.json 52,037 rows 8 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 52,037 records describing US Census-tract-level demographics, with an 11-character ID, county and state labels, and five numeric measures: distance/share, income, population, poverty rate, and SNAP counts. State coverage spans all 51 entries (50 states plus DC), led by Texas (4,010), California (3,727), and Florida (3,018), and counties are dominated by common names like Jefferson and Montgomery. The income distribution is right-skewed (mean $78,215 vs median $70,455, max $250,001) with about 4% flagged as outliers, and poverty rate shows a similar skew (mean 13.7%, median 10.8%, max 99.5%). Worth a closer look: the strong skew and outlier rates in inc, pov, and snap, plus how dist_share spreads almost uniformly from 0 to 10,000 (kurtosis -1.5), suggesting it may be a percentile-style metric rather than a raw count.

citing: row_count · column_count · columns.st.top_values · columns.inc.stats · columns.pov.stats · columns.snap.stats · columns.dist_share.stats · columns.cty.top_values

Schema

8 columns
Per-column summary. Click column name to jump to its detail.
Alerts
id text 0.0% 52,037
near_unique one_word allcaps short_text
st categorical 0.0% 51
cty text 0.0% 1,870
short_text duplicates
pov numeric 0.0% 693
inc numeric 0.0% 31,375
dist_share numeric 0.0% 33,000
pop numeric 0.0% 8,732
snap numeric 0.0% 1,034

id

text identifier near_unique one_word allcaps short_text
This is a unique row identifier: all 52037 values are distinct (n_unique equals n) with zero nulls or duplicates. Values are 10-11 character single-token strings (len_min 10, len_max 11, one_word_rate 1.0, allcaps_rate 1.0), and the samples are numeric strings resembling 10-11 digit codes (e.g., FIPS-style geography IDs like 42069110300). Treatment: Use as the primary key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7
n
52,037
nulls
0 (0.0%)
unique
52,037
len_min
10
len_max
11
len_mean
10.84
len_median
11
len_p95
11
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
20,000
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

st

categorical feature
This column holds US state names — 51 unique values (likely 50 states plus DC) across 52,037 rows with no nulls. The distribution roughly tracks population: Texas leads at 7.7%, followed by California, Florida, Ohio, and Pennsylvania. Entropy ratio of 0.915 indicates a fairly even spread with no single state dominating. Treatment: One-hot or target-encode for modelling; consider grouping low-frequency states. high · anthropic:claude-opus-4-7
n
52,037
nulls
0 (0.0%)
unique
51
top_value
Texas
top_rate
0.07706
cardinality
51
entropy
5.192
entropy_ratio
0.9153

cty

text feature short_text duplicates
This column ('cty') holds US county/parish names — short text averaging 2 words and 14 characters, with 'county' appearing in 19,399 rows and recurring names like Jefferson, Montgomery, and Maricopa County topping the list. With only 1,870 unique values across 52,037 rows, the duplicate rate is 96.4%, which is expected for a categorical geography field rather than a data-quality issue. No nulls, no URLs, no emoji — clean categorical text. Treatment: Treat as a high-cardinality categorical; encode via target/frequency encoding or join to a county FIPS lookup. high · anthropic:claude-opus-4-7
n
52,037
nulls
0 (0.0%)
unique
1,870
len_min
10
len_max
33
len_mean
14.3
len_median
14
len_p95
18
word_mean
2.099
word_median
2
n_empty
0
n_duplicates
50,167
duplicate_rate
0.9641
vocab_size
1,651
readability_flesch_mean
25.91
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
0
boilerplate_rate
0

pov

numeric feature
Likely a poverty rate (percent) feature, bounded between 0 and 99.5 with a median of 10.8 and IQR 6.0-18.4. The distribution is right-skewed (skew 1.54, kurtosis 3.08) with 2116 outliers (4.07%) in the upper tail, and only 0.09% zeros across 693 distinct values over 52037 rows. Treatment: Consider a log1p or winsorising transform before modelling to dampen the right-skewed tail. high · anthropic:claude-opus-4-7
n
52,037
nulls
0 (0.0%)
unique
693
min
0
max
99.5
mean
13.7
median
10.8
std
10.58
q1
6
q3
18.4
iqr
12.4
skew
1.536
kurtosis
3.077
n_outliers
2,116
outlier_rate
0.04066
zero_rate
0.0008648

inc

numeric feature
Likely an income or annual revenue figure: values range from 2,499 to 250,001 with a mean of 78,215 and median of 70,455. The distribution is right-skewed (skew 1.45) with about 3.9% outliers (2,047 rows), and the suspiciously round max of 250,001 hints at a censoring cap. Treatment: log-transform and consider clipping at the 250,001 cap before modelling. high · anthropic:claude-opus-4-7
n
52,037
nulls
0 (0.0%)
unique
31,375
min
2,499
max
250,001
mean
7.821e+04
median
70,455
std
3.573e+04
q1
54,059
q3
94,375
iqr
40,316
skew
1.451
kurtosis
3.144
n_outliers
2,047
outlier_rate
0.03934
zero_rate
0

dist_share

numeric feature
A numeric share/proportion field bounded between 0 and 10000, suggesting basis-point or per-myriad encoding rather than a 0-1 ratio. The distribution is nearly symmetric (skew -0.09) but distinctly flat with negative kurtosis (-1.51) and an IQR spanning 1785.6 to 9377.5, pointing to a near-uniform spread across the range rather than a peaked distribution. About 2.1% of rows are exactly zero, and there are no statistical outliers. Treatment: Rescale to [0,1] by dividing by 10000 before modelling; consider a separate zero-flag for the 2% exact zeros. high · anthropic:claude-opus-4-7
n
52,037
nulls
0 (0.0%)
unique
33,000
min
0
max
10,000
mean
5396
median
5503
std
3671
q1
1786
q3
9378
iqr
7592
skew
-0.09003
kurtosis
-1.506
n_outliers
0
outlier_rate
0
zero_rate
0.02056

pop

numeric feature
This is a numeric 'pop' column with 52037 non-null values, likely a population count (or similar headcount metric) per row, ranging from 102 to 37452 with a median of 4169. The distribution is right-skewed (skew 1.68, kurtosis 9.99) with 972 outliers (1.87%) above the upper tail, and there are no zeros or nulls. With 8732 unique values across 52037 rows, the same population figures repeat frequently, suggesting the same entity appears across multiple rows. Treatment: Log-transform before modelling to dampen the right skew and outliers. medium · anthropic:claude-opus-4-7
n
52,037
nulls
0 (0.0%)
unique
8,732
min
102
max
37,452
mean
4432
median
4,169
std
2025
q1
3,022
q3
5,523
iqr
2,501
skew
1.684
kurtosis
9.986
n_outliers
972
outlier_rate
0.01868
zero_rate
0

snap

numeric feature
A right-skewed numeric feature with values spanning 0 to 1888 and a median of 146, well below the mean of 191. Skew of 1.71 and kurtosis of 4.70 confirm a long upper tail, with 1918 outliers (3.69%) and 2.28% zeros. The 1034 unique values across 52037 rows suggest a bounded count or score rather than a continuous measurement. Treatment: log-transform or winsorize before regression to tame the right tail. medium · anthropic:claude-opus-4-7
n
52,037
nulls
0 (0.0%)
unique
1,034
min
0
max
1,888
mean
191
median
146
std
170.2
q1
66
q3
268
iqr
202
skew
1.709
kurtosis
4.702
n_outliers
1,918
outlier_rate
0.03686
zero_rate
0.02279