extracted_cppi-jp-cppi-cross-reference

Overview

Source: /home/coolhand/html/datavis/data_trove/joshua-project/archive/extracted_cppi/jp-cppi-cross-reference.csv

Saturn profiled 19,375 rows across 24 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/joshua-project/archive/extracted_cppi/jp-cppi-cross-reference.csv",
    "--findings", "extracted_cppi-jp-cppi-cross-reference.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Joshua Project cross-reference of 19,375 people groups across 240 countries, combining CPPI and JP fields covering language, religion, population, and evangelical engagement. India dominates the country distribution at 17.2% of rows, followed by Papua New Guinea, Pakistan, and Indonesia, so any global view should account for that skew. JPPrimaryReligion is heavily weighted toward Christianity (41%) with Islam, Ethnic Religions, and Hinduism trailing, while CPPIEvangelicalEngagement is split roughly 59% Engaged vs 41% Unengaged among the non-null rows. Watch the population and evangelical-percentage fields: JPPopulation is extremely long-tailed (max ~919M, median 16,000, skew ~95) and JP%Evangelical is right-skewed with ~27% zeros and ~10% outliers. Also note the high null rates on the CPPI-prefixed columns (~37%) and JPLeastReached (63%), which constrain joinability.

citing: row_count · column_count · Ctry.top_values · Ctry.stats.top_rate · JPPrimaryReligion.top_values · JPPrimaryReligion.stats.top_rate · CPPIEvangelicalEngagement.top_values · CPPIEvangelicalEngagement.stats.top_rate · JPPopulation.stats · JP%Evangelical.stats · CPPIPrimaryReligion.top_values · JPLeastReached.null_rate · CPPIPopulation.null_rate

Out[4]:

saturn.schema() · 24 columns

column	kind	n	null%	unique	alerts
Type	numeric	19,375	0.0%	3
PEID	numeric	19,375	36.7%	12,256	null_rate
ROP3	numeric	19,375	0.0%	11,786
PeopleID3	numeric	19,375	10.9%	10,284
ROG3	categorical	19,375	0.0%	240
Ctry	categorical	19,375	0.0%	240
JPPeopleGroup	text	19,375	10.9%	10,624	one_word duplicates
JPPopulation	numeric	19,375	11.0%	1,731	high_skew outliers
JPIndigenous	categorical	19,375	10.9%	2
JPROL3	text	19,375	10.9%	6,165	one_word short_text duplicates
JPPrimaryLanguage	text	19,375	10.9%	6,152	one_word short_text duplicates
JPRLG3	numeric	19,375	10.9%	8
JPPrimaryReligion	categorical	19,375	10.9%	8
JPScale	numeric	19,375	10.9%	5
JPLeastReached	categorical	19,375	62.9%	1	null_rate imbalance
JP%ChristianAdherent	numeric	19,375	10.9%	957
JP%Evangelical	numeric	19,375	17.2%	692	high_skew outliers
CPPIPeopleGroup	text	19,375	36.7%	9,170	one_word null_rate short_text duplicates
CPPIPopulation	text	19,375	36.7%	1,629	multilingual allcaps null_rate short_text duplicates
CPPIROL	text	19,375	36.7%	5,014	one_word null_rate short_text duplicates
CPPIPrimaryLanguage	text	19,375	36.7%	5,014	one_word null_rate short_text duplicates
CPPIPrimaryReligion	categorical	19,375	47.6%	40	null_rate
CPPIGSEC	numeric	19,375	36.7%	7	null_rate
CPPIEvangelicalEngagement	categorical	19,375	36.7%	2	null_rate

Fig 1.

Ctry · Top countries by row count — India alone is ~17% of records, so check geographic balance before any aggregate.

Show data table

Top values for Ctry (20 unique shown, of 240 total).
value	count	share
India	3330	17.2%
Papua New Guinea	919	4.7%
Pakistan	851	4.4%
Indonesia	806	4.2%
United States	655	3.4%
China	557	2.9%
Nigeria	557	2.9%
Canada	357	1.8%
Brazil	350	1.8%
Mexico	344	1.8%
Bangladesh	321	1.7%
Cameroon	304	1.6%
Nepal	297	1.5%
Congo, Democratic Republic of	240	1.2%
Sudan	222	1.1%
Australia	211	1.1%
Philippines	209	1.1%
Malaysia	196	1.0%
Russia	189	1.0%
Myanmar (Burma)	165	0.9%

Fig 2.

JPPrimaryReligion · Religion mix is dominated by Christianity (41%) and Islam, with smaller Ethnic, Hindu, and Buddhist slices.

Show data table

Top values for JPPrimaryReligion (8 unique shown, of 8 total).
value	count	share
Christianity	7140	36.9%
Islam	4003	20.7%
Ethnic Religions	2641	13.6%
Hinduism	2396	12.4%
Buddhism	669	3.5%
Non-Religious	264	1.4%
Other / Small	128	0.7%
Unknown	26	0.1%

Fig 3.

CPPIEvangelicalEngagement · Among engaged-status rows, roughly 59% are 'Engaged' vs 41% 'Unengaged' — a useful headline split.

Show data table

Top values for CPPIEvangelicalEngagement (2 unique shown, of 2 total).
value	count	share
Engaged	7238	37.4%
Unengaged	5018	25.9%

Fig 4.

JPPopulation · Population is extremely long-tailed (median 16K, max ~919M); plot on a log scale to see structure.

Show data table

Histogram bins for JPPopulation (median: 16000.0).
bin	count
10 – 2.297e+07	17193
2.297e+07 – 4.594e+07	33
4.594e+07 – 6.891e+07	9
6.891e+07 – 9.188e+07	5
9.188e+07 – 1.149e+08	2
1.149e+08 – 1.378e+08	3
1.378e+08 – 1.608e+08	0
1.608e+08 – 1.838e+08	0
1.838e+08 – 2.067e+08	1
2.067e+08 – 2.297e+08	0
2.297e+08 – 2.527e+08	0
2.527e+08 – 2.756e+08	0
2.756e+08 – 2.986e+08	0
2.986e+08 – 3.216e+08	0
3.216e+08 – 3.446e+08	0
3.446e+08 – 3.675e+08	0
3.675e+08 – 3.905e+08	0
3.905e+08 – 4.135e+08	0
4.135e+08 – 4.364e+08	0
4.364e+08 – 4.594e+08	0
4.594e+08 – 4.824e+08	0
4.824e+08 – 5.053e+08	0
5.053e+08 – 5.283e+08	0
5.283e+08 – 5.513e+08	0
5.513e+08 – 5.743e+08	0
5.743e+08 – 5.972e+08	0
5.972e+08 – 6.202e+08	0
6.202e+08 – 6.432e+08	0
6.432e+08 – 6.661e+08	0
6.661e+08 – 6.891e+08	0
6.891e+08 – 7.121e+08	0
7.121e+08 – 7.35e+08	0
7.35e+08 – 7.58e+08	0
7.58e+08 – 7.81e+08	0
7.81e+08 – 8.04e+08	0
8.04e+08 – 8.269e+08	0
8.269e+08 – 8.499e+08	0
8.499e+08 – 8.729e+08	0
8.729e+08 – 8.958e+08	0
8.958e+08 – 9.188e+08	1

Fig 5.

JP%Evangelical · Evangelical share is right-skewed with a heavy zero/low cluster and a long tail up to 95%.

Show data table

Histogram bins for JP%Evangelical (median: 1.5).
bin	count
0 – 2.375	8955
2.375 – 4.75	1481
4.75 – 7.125	1358
7.125 – 9.5	669
9.5 – 11.88	432
11.88 – 14.25	623
14.25 – 16.62	364
16.62 – 19	283
19 – 21.38	425
21.38 – 23.75	222
23.75 – 26.12	350
26.12 – 28.5	131
28.5 – 30.88	198
30.88 – 33.25	97
33.25 – 35.62	91
35.62 – 38	20
38 – 40.38	64
40.38 – 42.75	22
42.75 – 45.12	98
45.12 – 47.5	38
47.5 – 49.88	22
49.88 – 52.25	36
52.25 – 54.62	4
54.62 – 57	8
57 – 59.38	1
59.38 – 61.75	17
61.75 – 64.12	4
64.12 – 66.5	3
66.5 – 68.88	0
68.88 – 71.25	8
71.25 – 73.62	2
73.62 – 76	4
76 – 78.38	4
78.38 – 80.75	3
80.75 – 83.12	0
83.12 – 85.5	0
85.5 – 87.88	2
87.88 – 90.25	2
90.25 – 92.62	0
92.62 – 95	2

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
Type	numeric	0.0%
PEID	numeric	36.7%
ROP3	numeric	0.0%
PeopleID3	numeric	10.9%
ROG3	categorical	0.0%
Ctry	categorical	0.0%
JPPeopleGroup	text	10.9%
JPPopulation	numeric	11.0%
JPIndigenous	categorical	10.9%
JPROL3	text	10.9%
JPPrimaryLanguage	text	10.9%
JPRLG3	numeric	10.9%
JPPrimaryReligion	categorical	10.9%
JPScale	numeric	10.9%
JPLeastReached	categorical	62.9%
JP%ChristianAdherent	numeric	10.9%
JP%Evangelical	numeric	17.2%
CPPIPeopleGroup	text	36.7%
CPPIPopulation	text	36.7%
CPPIROL	text	36.7%
CPPIPrimaryLanguage	text	36.7%
CPPIPrimaryReligion	categorical	47.6%
CPPIGSEC	numeric	36.7%
CPPIEvangelicalEngagement	categorical	36.7%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 259 detected strings).
lang	count	share
en	256	98.8%
zh	1	0.4%
kn	1	0.4%
fa	1	0.4%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 10 numeric columns (values clipped to 2 decimals).
	Type	PEID	ROP3	PeopleID3	JPPopulation	JPRLG3	JPScale	JP%ChristianAdherent	JP%Evangelical	CPPIGSEC
Type	+1.00	+0.20	+0.36	+0.27	-0.04	+0.09	-0.17	-0.11	-0.14	-0.08
PEID	+0.20	+1.00	+0.09	+0.12	+0.04	+0.10	-0.10	-0.10	-0.14	-0.18
ROP3	+0.36	+0.09	+1.00	+0.03	+0.04	-0.02	+0.00	+0.00	-0.03	-0.07
PeopleID3	+0.27	+0.12	+0.03	+1.00	+0.05	+0.30	-0.34	-0.33	-0.14	-0.10
JPPopulation	-0.04	+0.04	+0.04	+0.05	+1.00	+0.02	-0.06	-0.05	-0.03	+0.08
JPRLG3	+0.09	+0.10	-0.02	+0.30	+0.02	+1.00	-0.71	-0.88	-0.24	+0.05
JPScale	-0.17	-0.10	+0.00	-0.34	-0.06	-0.71	+1.00	+0.80	+0.23	-0.04
JP%ChristianAdherent	-0.11	-0.10	+0.00	-0.33	-0.05	-0.88	+0.80	+1.00	+0.22	-0.07
JP%Evangelical	-0.14	-0.14	-0.03	-0.14	-0.03	-0.24	+0.23	+0.22	+1.00	-0.01
CPPIGSEC	-0.08	-0.18	-0.07	-0.10	+0.08	+0.05	-0.04	-0.07	-0.01	+1.00

Type numeric feature

Type is encoded numerically but takes only 3 distinct values across 19,375 rows (min 1, max 3, median 1), so it is almost certainly a categorical code rather than a true measurement. The distribution leans toward the lowest category, with mean 1.585 and Q3 of 2.0 indicating most rows are 1 or 2. No nulls or outliers are present.

Treatment: Treat as a categorical variable and one-hot encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["Type"].stats

stat	value
n	19,375
nulls	0 (0.0%)
unique	3
min	1
max	3
mean	1.585
median	1
std	0.6785
q1	1
q3	2
iqr	1
skew	0.7351
kurtosis	-0.6013
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 9.

Distribution of Type. Vertical dash marks the median.

Show data table

Histogram bins for Type (median: 1.0).
bin	count
1 – 1.05	10148
1.05 – 1.1	0
1.1 – 1.15	0
1.15 – 1.2	0
1.2 – 1.25	0
1.25 – 1.3	0
1.3 – 1.35	0
1.35 – 1.4	0
1.4 – 1.45	0
1.45 – 1.5	0
1.5 – 1.55	0
1.55 – 1.6	0
1.6 – 1.65	0
1.65 – 1.7	0
1.7 – 1.75	0
1.75 – 1.8	0
1.8 – 1.85	0
1.85 – 1.9	0
1.9 – 1.95	0
1.95 – 2	0
2 – 2.05	7119
2.05 – 2.1	0
2.1 – 2.15	0
2.15 – 2.2	0
2.2 – 2.25	0
2.25 – 2.3	0
2.3 – 2.35	0
2.35 – 2.4	0
2.4 – 2.45	0
2.45 – 2.5	0
2.5 – 2.55	0
2.55 – 2.6	0
2.6 – 2.65	0
2.65 – 2.7	0
2.7 – 2.75	0
2.75 – 2.8	0
2.8 – 2.85	0
2.85 – 2.9	0
2.9 – 2.95	0
2.95 – 3	2108

PEID numeric identifier

PEID looks like a person/entity identifier: integer values from 1 to 50520 with 12256 unique values across 19375 rows and no zeros. The 36.7% null rate is the headline concern, and uniqueness is well below row count, so the same PEID recurs across rows. The near-uniform spread (skew 0.33, kurtosis -1.50) is consistent with an ID rather than a measured quantity.

Treatment: Treat as a foreign key for joins; do not use as a numeric feature, and decide on a policy for the 36.7% missing IDs.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["PEID"].stats

stat	value
n	19,375
nulls	7,119 (36.7%)
unique	12,256
min	1
max	50,520
mean	2.564e+04
median	1.838e+04
std	1.637e+04
q1	1.213e+04
q3	4.329e+04
iqr	31,162
skew	0.3311
kurtosis	-1.499
n_outliers	0
outlier_rate	0
zero_rate	0
alert: null_rate	36.7% null

Fig 10.

Distribution of PEID. Vertical dash marks the median.

Show data table

Histogram bins for PEID (median: 18378.5).
bin	count
1 – 1264	501
1264 – 2527	0
2527 – 3790	0
3790 – 5053	0
5053 – 6316	95
6316 – 7579	380
7579 – 8842	268
8842 – 1.01e+04	509
1.01e+04 – 1.137e+04	822
1.137e+04 – 1.263e+04	828
1.263e+04 – 1.389e+04	851
1.389e+04 – 1.516e+04	761
1.516e+04 – 1.642e+04	530
1.642e+04 – 1.768e+04	493
1.768e+04 – 1.895e+04	178
1.895e+04 – 2.021e+04	69
2.021e+04 – 2.147e+04	136
2.147e+04 – 2.273e+04	471
2.273e+04 – 2.4e+04	402
2.4e+04 – 2.526e+04	399
2.526e+04 – 2.652e+04	0
2.652e+04 – 2.779e+04	0
2.779e+04 – 2.905e+04	0
2.905e+04 – 3.031e+04	0
3.031e+04 – 3.158e+04	0
3.158e+04 – 3.284e+04	0
3.284e+04 – 3.41e+04	82
3.41e+04 – 3.536e+04	0
3.536e+04 – 3.663e+04	0
3.663e+04 – 3.789e+04	0
3.789e+04 – 3.915e+04	75
3.915e+04 – 4.042e+04	90
4.042e+04 – 4.168e+04	410
4.168e+04 – 4.294e+04	741
4.294e+04 – 4.421e+04	454
4.421e+04 – 4.547e+04	0
4.547e+04 – 4.673e+04	318
4.673e+04 – 4.799e+04	623
4.799e+04 – 4.926e+04	820
4.926e+04 – 5.052e+04	950

ROP3 numeric identifier

ROP3 holds 6-digit integers tightly bounded between 100004 and 119498 with mean 109260.77 and median 109305, suggesting a coded identifier or category number rather than a measured quantity. The distribution is essentially flat (kurtosis -1.16, skew 0.12) with no outliers and a near-zero null rate, and 11786 unique values across 19375 rows indicates many repeats but no dominant mode visible from these stats.

Treatment: Treat as a categorical code; do not feed as a continuous numeric feature.

anthropic:claude-opus-4-7 · confidence medium

Out[20]:

saturn.columns["ROP3"].stats

stat	value
n	19,375
nulls	2 (0.0%)
unique	11,786
min	100,004
max	119,498
mean	1.093e+05
median	109,305
std	5558
q1	104,126
q3	114,057
iqr	9,931
skew	0.1203
kurtosis	-1.159
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 11.

Distribution of ROP3. Vertical dash marks the median.

Show data table

Histogram bins for ROP3 (median: 109305.0).
bin	count
1e+05 – 1.005e+05	584
1.005e+05 – 1.01e+05	450
1.01e+05 – 1.015e+05	338
1.015e+05 – 1.02e+05	620
1.02e+05 – 1.024e+05	489
1.024e+05 – 1.029e+05	603
1.029e+05 – 1.034e+05	715
1.034e+05 – 1.039e+05	681
1.039e+05 – 1.044e+05	613
1.044e+05 – 1.049e+05	395
1.049e+05 – 1.054e+05	430
1.054e+05 – 1.059e+05	552
1.059e+05 – 1.063e+05	482
1.063e+05 – 1.068e+05	465
1.068e+05 – 1.073e+05	375
1.073e+05 – 1.078e+05	497
1.078e+05 – 1.083e+05	516
1.083e+05 – 1.088e+05	584
1.088e+05 – 1.093e+05	293
1.093e+05 – 1.098e+05	790
1.098e+05 – 1.102e+05	502
1.102e+05 – 1.107e+05	646
1.107e+05 – 1.112e+05	500
1.112e+05 – 1.117e+05	406
1.117e+05 – 1.122e+05	329
1.122e+05 – 1.127e+05	364
1.127e+05 – 1.132e+05	482
1.132e+05 – 1.136e+05	429
1.136e+05 – 1.141e+05	509
1.141e+05 – 1.146e+05	355
1.146e+05 – 1.151e+05	661
1.151e+05 – 1.156e+05	543
1.156e+05 – 1.161e+05	516
1.161e+05 – 1.166e+05	429
1.166e+05 – 1.171e+05	287
1.171e+05 – 1.175e+05	174
1.175e+05 – 1.18e+05	367
1.18e+05 – 1.185e+05	456
1.185e+05 – 1.19e+05	469
1.19e+05 – 1.195e+05	477

PeopleID3 numeric foreign_key

PeopleID3 is almost certainly an identifier: values are integers ranging from 10119 to 22498 with no zeros, no outliers, and 10284 distinct values across 19375 rows. The 10.88% null rate suggests this is an optional or third-slot person reference (the '3' in the name implies a multi-ID schema), and the duplication (≈1.88 rows per id) indicates the same person appears multiple times. The roughly uniform spread (kurtosis -0.97, mild skew 0.39) is consistent with sequentially assigned ids rather than a measured quantity.

Treatment: Treat as a foreign key for left-joins to a people table; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[23]:

saturn.columns["PeopleID3"].stats

stat	value
n	19,375
nulls	2,108 (10.9%)
unique	10,284
min	10,119
max	22,498
mean	1.524e+04
median	14,780
std	3413
q1	12,247
q3	1.814e+04
iqr	5898
skew	0.3911
kurtosis	-0.9664
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 12.

Distribution of PeopleID3. Vertical dash marks the median.

Show data table

Histogram bins for PeopleID3 (median: 14780.0).
bin	count
1.012e+04 – 1.043e+04	565
1.043e+04 – 1.074e+04	528
1.074e+04 – 1.105e+04	527
1.105e+04 – 1.136e+04	857
1.136e+04 – 1.167e+04	570
1.167e+04 – 1.198e+04	655
1.198e+04 – 1.229e+04	749
1.229e+04 – 1.259e+04	510
1.259e+04 – 1.29e+04	549
1.29e+04 – 1.321e+04	564
1.321e+04 – 1.352e+04	442
1.352e+04 – 1.383e+04	457
1.383e+04 – 1.414e+04	422
1.414e+04 – 1.445e+04	624
1.445e+04 – 1.476e+04	585
1.476e+04 – 1.507e+04	534
1.507e+04 – 1.538e+04	688
1.538e+04 – 1.569e+04	483
1.569e+04 – 1.6e+04	591
1.6e+04 – 1.631e+04	351
1.631e+04 – 1.662e+04	275
1.662e+04 – 1.693e+04	258
1.693e+04 – 1.724e+04	263
1.724e+04 – 1.755e+04	304
1.755e+04 – 1.786e+04	325
1.786e+04 – 1.817e+04	296
1.817e+04 – 1.847e+04	372
1.847e+04 – 1.878e+04	392
1.878e+04 – 1.909e+04	546
1.909e+04 – 1.94e+04	484
1.94e+04 – 1.971e+04	311
1.971e+04 – 2.002e+04	272
2.002e+04 – 2.033e+04	204
2.033e+04 – 2.064e+04	271
2.064e+04 – 2.095e+04	151
2.095e+04 – 2.126e+04	267
2.126e+04 – 2.157e+04	306
2.157e+04 – 2.188e+04	244
2.188e+04 – 2.219e+04	170
2.219e+04 – 2.25e+04	305

ROG3 categorical feature

ROG3 looks like an ISO-style two-letter country code, with 240 distinct values and no nulls across 19,375 rows. Distribution is moderately concentrated: 'IN' alone accounts for 17.2% of records, followed by 'PP', 'PK', 'ID', and 'US', with entropy ratio 0.78 indicating reasonable spread across the long tail. The presence of 'PP' as the second most common code is unusual since it is not a standard ISO country code and may warrant verification.

Treatment: Group rare codes into 'Other' and one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["ROG3"].stats

stat	value
n	19,375
nulls	0 (0.0%)
unique	240
top_value	IN
top_rate	0.1719
cardinality	240
entropy	6.175
entropy_ratio	0.781

Fig 13.

Top values for ROG3.

Show data table

Top values for ROG3 (20 unique shown, of 240 total).
value	count	share
IN	3330	17.2%
PP	919	4.7%
PK	851	4.4%
ID	806	4.2%
US	655	3.4%
CH	557	2.9%
NI	557	2.9%
CA	357	1.8%
BR	350	1.8%
MX	344	1.8%
BG	321	1.7%
CM	304	1.6%
NP	297	1.5%
CG	240	1.2%
SU	222	1.1%
AS	211	1.1%
RP	209	1.1%
MY	196	1.0%
RS	189	1.0%
BM	165	0.9%

Ctry categorical feature

Country name field with 240 distinct values across 19,375 rows and zero nulls. India dominates at 17.2% (3,330 rows), followed by Papua New Guinea, Pakistan, and Indonesia, with the long tail spread broadly (entropy ratio 0.78). The high Papua New Guinea share is unusual relative to its global population and suggests domain-specific sampling.

Treatment: Group rare countries into an 'Other' bucket and target- or frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["Ctry"].stats

stat	value
n	19,375
nulls	0 (0.0%)
unique	240
top_value	India
top_rate	0.1719
cardinality	240
entropy	6.175
entropy_ratio	0.781

Fig 14.

Top values for Ctry.

Show data table

Top values for Ctry (20 unique shown, of 240 total).
value	count	share
India	3330	17.2%
Papua New Guinea	919	4.7%
Pakistan	851	4.4%
Indonesia	806	4.2%
United States	655	3.4%
China	557	2.9%
Nigeria	557	2.9%
Canada	357	1.8%
Brazil	350	1.8%
Mexico	344	1.8%
Bangladesh	321	1.7%
Cameroon	304	1.6%
Nepal	297	1.5%
Congo, Democratic Republic of	240	1.2%
Sudan	222	1.1%
Australia	211	1.1%
Philippines	209	1.1%
Malaysia	196	1.0%
Russia	189	1.0%
Myanmar (Burma)	165	0.9%

JPPeopleGroup text feature

This column holds Joshua Project-style people group labels — short ethnonyms or community descriptors like 'Deaf', 'British', 'French', and qualified forms such as 'Arab, (Muslim traditions)'. Values are short (mean 11.4 chars, median 1 word) and 54% are single-word, but with 10,624 uniques across 19,375 rows and a 38.5% duplicate rate, plus 10.9% nulls, the field is high-cardinality categorical with a long tail. The prevalence of parenthetical religious qualifiers ('(hindu', '(muslim', 'traditions)') in top_words suggests an embedded taxonomy that would need parsing to separate ethnicity from religious tradition.

Treatment: Normalize casing and split parenthetical religious qualifiers into a separate column before encoding as a high-cardinality category.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["JPPeopleGroup"].stats

stat	value
n	19,375
nulls	2,108 (10.9%)
unique	10,624
len_min	1
len_max	41
len_mean	11.35
len_median	9
len_p95	25
word_mean	1.618
word_median	1
n_empty	0
n_duplicates	6,643
duplicate_rate	0.3847
vocab_size	11,428
readability_flesch_mean	39.84
emoji_rate	0
url_rate	0
one_word_rate	0.5414
allcaps_rate	0
boilerplate_rate	0
alert: one_word	54.1% rows are a single word
alert: duplicates	38.5% duplicate strings

Fig 15.

Character-length distribution for JPPeopleGroup.

Show data table

Character-length distribution for JPPeopleGroup (mean: 11.350147680546707).
chars	count
1 – 2	1
2 – 3	31
3 – 4	285
4 – 5	1325
5 – 6	1715
6 – 7	1938
7 – 8	1655
8 – 9	1143
9 – 10	697
10 – 11	674
11 – 12	598
12 – 13	742
13 – 14	714
14 – 15	930
15 – 16	794
16 – 17	671
17 – 18	511
18 – 19	345
19 – 20	275
20 – 21	245
21 – 22	223
22 – 23	182
23 – 24	199
24 – 25	295
25 – 26	263
26 – 27	231
27 – 28	136
28 – 29	107
29 – 30	75
30 – 31	49
31 – 32	59
32 – 33	61
33 – 34	39
34 – 35	32
35 – 36	19
36 – 37	1
37 – 38	1
38 – 39	2
39 – 40	2
40 – 41	2

JPPopulation numeric feature

This appears to be a Japan-related population count per record, with values spanning from 10 to 918,811,000 and a median of just 16,000. The distribution is extraordinarily right-skewed (skew 94.9, kurtosis 10,840) — the mean of 468,378 sits far above the Q3 of 85,000, and 2,609 rows (15.1%) flag as outliers. Roughly 11% of rows are null, and only 1,731 unique values across 19,375 rows suggests heavy repetition.

Treatment: log-transform and impute nulls before modelling; investigate the 918M max as a possible unit error.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["JPPopulation"].stats

stat	value
n	19,375
nulls	2,128 (11.0%)
unique	1,731
min	10
max	9.188e+08
mean	4.684e+05
median	16,000
std	7.861e+06
q1	3,100
q3	85,000
iqr	81,900
skew	94.92
kurtosis	1.084e+04
n_outliers	2,609
outlier_rate	0.1513
zero_rate	0
alert: high_skew	skew=+94.92
alert: outliers	15.1% rows beyond 1.5 IQR

Fig 16.

Distribution of JPPopulation. Vertical dash marks the median.

Show data table

Histogram bins for JPPopulation (median: 16000.0).
bin	count
10 – 2.297e+07	17193
2.297e+07 – 4.594e+07	33
4.594e+07 – 6.891e+07	9
6.891e+07 – 9.188e+07	5
9.188e+07 – 1.149e+08	2
1.149e+08 – 1.378e+08	3
1.378e+08 – 1.608e+08	0
1.608e+08 – 1.838e+08	0
1.838e+08 – 2.067e+08	1
2.067e+08 – 2.297e+08	0
2.297e+08 – 2.527e+08	0
2.527e+08 – 2.756e+08	0
2.756e+08 – 2.986e+08	0
2.986e+08 – 3.216e+08	0
3.216e+08 – 3.446e+08	0
3.446e+08 – 3.675e+08	0
3.675e+08 – 3.905e+08	0
3.905e+08 – 4.135e+08	0
4.135e+08 – 4.364e+08	0
4.364e+08 – 4.594e+08	0
4.594e+08 – 4.824e+08	0
4.824e+08 – 5.053e+08	0
5.053e+08 – 5.283e+08	0
5.283e+08 – 5.513e+08	0
5.513e+08 – 5.743e+08	0
5.743e+08 – 5.972e+08	0
5.972e+08 – 6.202e+08	0
6.202e+08 – 6.432e+08	0
6.432e+08 – 6.661e+08	0
6.661e+08 – 6.891e+08	0
6.891e+08 – 7.121e+08	0
7.121e+08 – 7.35e+08	0
7.35e+08 – 7.58e+08	0
7.58e+08 – 7.81e+08	0
7.81e+08 – 8.04e+08	0
8.04e+08 – 8.269e+08	0
8.269e+08 – 8.499e+08	0
8.499e+08 – 8.729e+08	0
8.729e+08 – 8.958e+08	0
8.958e+08 – 9.188e+08	1

JPIndigenous categorical feature

Binary Y/N flag, almost certainly indicating whether a record is classified as Indigenous (likely in a Japan-related context given the 'JP' prefix). 'Y' dominates at 71.2% of non-null rows (12,293 vs 4,974 'N'), and 10.88% of rows are null. Entropy ratio of 0.87 shows the two classes are reasonably balanced once nulls are excluded.

Treatment: Encode as boolean and decide explicitly whether to impute or flag the ~11% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["JPIndigenous"].stats

stat	value
n	19,375
nulls	2,108 (10.9%)
unique	2
top_value	Y
top_rate	0.7119
cardinality	2
entropy	0.8662
entropy_ratio	0.8662

Fig 17.

Top values for JPIndigenous.

Show data table

Top values for JPIndigenous (2 unique shown, of 2 total).
value	count	share
Y	12293	63.4%
N	4974	25.7%

JPROL3 text feature

JPROL3 holds three-letter ISO 639-3 language codes (hin, eng, ben, spa, tam...), with every value being a single token of length 3. The field is sparsely populated (10.88% null) and heavily duplicated (64.3% duplicate rate across 6,165 unique codes), with Hindi (700) and English (537) leading. The presence of 6,164 distinct codes is unusually high for a language tag and suggests either very broad linguistic coverage or noisy/long-tail entries.

Treatment: Treat as a categorical language-code feature; one-hot or target-encode top codes and bucket the long tail.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["JPROL3"].stats

stat	value
n	19,375
nulls	2,108 (10.9%)
unique	6,165
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	11,102
duplicate_rate	0.643
vocab_size	6,164
readability_flesch_mean	115.3
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.0003475
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	64.3% duplicate strings

Fig 18.

Character-length distribution for JPROL3.

Show data table

Character-length distribution for JPROL3 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	17267
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

JPPrimaryLanguage text feature

Categorical free-text capturing each subject's primary language, dominated by single-word entries (one_word_rate 0.7335) with a mean of 1.32 words and median length 7. Despite only 6152 unique values across 19375 rows, the long tail is heavy: top value 'Hindi' appears just 700 times and 64.37% of rows are duplicates, while 10.88% are null. Surprising: vocab_size (6236) exceeds n_unique (6152), implying compound entries like 'Chinese, Mandarin' and 'Punjabi,' that split into multiple tokens and will fragment any naive value_counts.

Treatment: Normalise casing and split comma-separated entries before encoding as a categorical or multi-label feature.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["JPPrimaryLanguage"].stats

stat	value
n	19,375
nulls	2,108 (10.9%)
unique	6,152
len_min	1
len_max	41
len_mean	9.088
len_median	7
len_p95	18
word_mean	1.319
word_median	1
n_empty	0
n_duplicates	11,115
duplicate_rate	0.6437
vocab_size	6,236
readability_flesch_mean	31.33
emoji_rate	0
url_rate	0
one_word_rate	0.7335
allcaps_rate	0
boilerplate_rate	0
alert: one_word	73.4% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	64.4% duplicate strings

Fig 19.

Character-length distribution for JPPrimaryLanguage.

Show data table

Character-length distribution for JPPrimaryLanguage (mean: 9.088318758325128).
chars	count
1 – 2	2
2 – 3	39
3 – 4	303
4 – 5	1440
5 – 6	2649
6 – 7	2244
7 – 8	3274
8 – 9	1229
9 – 10	706
10 – 11	642
11 – 12	299
12 – 13	377
13 – 14	335
14 – 15	396
15 – 16	398
16 – 17	1265
17 – 18	631
18 – 19	179
19 – 20	123
20 – 21	148
21 – 22	92
22 – 23	78
23 – 24	95
24 – 25	92
25 – 26	39
26 – 27	36
27 – 28	31
28 – 29	23
29 – 30	12
30 – 31	11
31 – 32	31
32 – 33	16
33 – 34	9
34 – 35	1
35 – 36	1
36 – 37	1
37 – 38	1
38 – 39	1
39 – 40	8
40 – 41	10

JPRLG3 numeric feature

JPRLG3 is a low-cardinality integer feature taking only 8 distinct values between 1 and 9, with a near-symmetric spread (skew 0.096) and a flat, plateau-like distribution (kurtosis -1.60). The mean (3.37) and median (4.0) sit mid-range and there are no outliers, suggesting an ordinal rating or category code rather than a continuous measurement. About 10.88% of rows are null, which is the main quirk to address.

Treatment: Treat as ordinal/categorical (8 levels) and impute or flag the 10.88% missing before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["JPRLG3"].stats

stat	value
n	19,375
nulls	2,108 (10.9%)
unique	8
min	1
max	9
mean	3.367
median	4
std	2.199
q1	1
q3	6
iqr	5
skew	0.09636
kurtosis	-1.595
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 20.

Distribution of JPRLG3. Vertical dash marks the median.

Show data table

Histogram bins for JPRLG3 (median: 4.0).
bin	count
1 – 1.2	7140
1.2 – 1.4	0
1.4 – 1.6	0
1.6 – 1.8	0
1.8 – 2	0
2 – 2.2	669
2.2 – 2.4	0
2.4 – 2.6	0
2.6 – 2.8	0
2.8 – 3	0
3 – 3.2	0
3.2 – 3.4	0
3.4 – 3.6	0
3.6 – 3.8	0
3.8 – 4	0
4 – 4.2	2641
4.2 – 4.4	0
4.4 – 4.6	0
4.6 – 4.8	0
4.8 – 5	0
5 – 5.2	2396
5.2 – 5.4	0
5.4 – 5.6	0
5.6 – 5.8	0
5.8 – 6	0
6 – 6.2	4003
6.2 – 6.4	0
6.4 – 6.6	0
6.6 – 6.8	0
6.8 – 7	0
7 – 7.2	264
7.2 – 7.4	0
7.4 – 7.6	0
7.6 – 7.8	0
7.8 – 8	0
8 – 8.2	128
8.2 – 8.4	0
8.4 – 8.6	0
8.6 – 8.8	0
8.8 – 9	26

JPPrimaryReligion categorical feature

Categorical label assigning a primary religion to each record (likely a people group or country profile from a Joshua Project-style source). Eight categories cover 19,375 rows with ~10.9% missing; Christianity leads at 41.4% (7,140), followed by Islam (4,003) and Ethnic Religions (2,641), with entropy_ratio 0.72 indicating a moderately balanced distribution rather than a single dominant class. Note the explicit 'Unknown' bucket (26) co-existing with an 10.88% null rate, suggesting two distinct flavors of missingness.

Treatment: One-hot or target-encode; reconcile the explicit 'Unknown' category with the 10.88% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["JPPrimaryReligion"].stats

stat	value
n	19,375
nulls	2,108 (10.9%)
unique	8
top_value	Christianity
top_rate	0.4135
cardinality	8
entropy	2.166
entropy_ratio	0.722

Fig 21.

Top values for JPPrimaryReligion.

Show data table

Top values for JPPrimaryReligion (8 unique shown, of 8 total).
value	count	share
Christianity	7140	36.9%
Islam	4003	20.7%
Ethnic Religions	2641	13.6%
Hinduism	2396	12.4%
Buddhism	669	3.5%
Non-Religious	264	1.4%
Other / Small	128	0.7%
Unknown	26	0.1%

JPScale numeric feature

JPScale is an integer-coded ordinal feature with only 5 distinct values spanning 1 to 5, a mean of 2.71 and median of 3, suggesting a Likert-style or severity rating. The distribution is broad and flat (kurtosis -1.64) rather than peaked, with no outliers and a 10.88% null rate that should be handled explicitly.

Treatment: Treat as ordinal categorical (1-5) and impute or flag the ~11% missing values before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["JPScale"].stats

stat	value
n	19,375
nulls	2,108 (10.9%)
unique	5
min	1
max	5
mean	2.71
median	3
std	1.623
q1	1
q3	4
iqr	3
skew	0.159
kurtosis	-1.637
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 22.

Distribution of JPScale. Vertical dash marks the median.

Show data table

Histogram bins for JPScale (median: 3.0).
bin	count
1 – 1.1	7186
1.1 – 1.2	0
1.2 – 1.3	0
1.3 – 1.4	0
1.4 – 1.5	0
1.5 – 1.6	0
1.6 – 1.7	0
1.7 – 1.8	0
1.8 – 1.9	0
1.9 – 2	0
2 – 2.1	1119
2.1 – 2.2	0
2.2 – 2.3	0
2.3 – 2.4	0
2.4 – 2.5	0
2.5 – 2.6	0
2.6 – 2.7	0
2.7 – 2.8	0
2.8 – 2.9	0
2.9 – 3	0
3 – 3.1	1788
3.1 – 3.2	0
3.2 – 3.3	0
3.3 – 3.4	0
3.4 – 3.5	0
3.5 – 3.6	0
3.6 – 3.7	0
3.7 – 3.8	0
3.8 – 3.9	0
3.9 – 4	0
4 – 4.1	3872
4.1 – 4.2	0
4.2 – 4.3	0
4.3 – 4.4	0
4.4 – 4.5	0
4.5 – 4.6	0
4.6 – 4.7	0
4.7 – 4.8	0
4.8 – 4.9	0
4.9 – 5	3302

JPLeastReached categorical metadata

JPLeastReached holds a single non-null value 'Y' across all 7,186 populated rows, with the remaining 62.91% of records null. With cardinality 1 and entropy 0, the field carries no discriminative information and effectively functions as a presence flag rather than a category.

Treatment: Drop, or recode as a binary is-present indicator if the null pattern is meaningful.

anthropic:claude-opus-4-7 · confidence high

Out[56]:

saturn.columns["JPLeastReached"].stats

stat	value
n	19,375
nulls	12,189 (62.9%)
unique	1
top_value	Y
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	62.9% null
alert: imbalance	top value is 100.0% of rows

Fig 23.

Top values for JPLeastReached.

Show data table

Top values for JPLeastReached (1 unique shown, of 1 total).
value	count	share
Y	7186	37.1%

JP%ChristianAdherent numeric feature

Numeric share (0–100) of a population identified as Christian adherents, judging by the name and full 0–100 range with mean 37.58 and median 17.0. The distribution is starkly bimodal in feel: q1 sits at 0.02 while q3 hits 80.0, with 23.7% exact zeros and kurtosis of -1.60, so values cluster at the extremes rather than the middle. Null rate is 10.88%, which is non-trivial and should be handled explicitly.

Treatment: Treat as a bounded percentage; impute or flag the 10.88% nulls and consider a zero-inflation indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[59]:

saturn.columns["JP%ChristianAdherent"].stats

stat	value
n	19,375
nulls	2,108 (10.9%)
unique	957
min	0
max	100
mean	37.58
median	17
std	39.33
q1	0.02
q3	80
iqr	79.98
skew	0.3824
kurtosis	-1.602
n_outliers	0
outlier_rate	0
zero_rate	0.2371

Fig 24.

Distribution of JP%ChristianAdherent. Vertical dash marks the median.

Show data table

Histogram bins for JP%ChristianAdherent (median: 17.0).
bin	count
0 – 2.5	6605
2.5 – 5	525
5 – 7.5	484
7.5 – 10	300
10 – 12.5	410
12.5 – 15	98
15 – 17.5	223
17.5 – 20	61
20 – 22.5	297
22.5 – 25	29
25 – 27.5	202
27.5 – 30	33
30 – 32.5	274
32.5 – 35	40
35 – 37.5	119
37.5 – 40	22
40 – 42.5	236
42.5 – 45	19
45 – 47.5	164
47.5 – 50	30
50 – 52.5	127
52.5 – 55	31
55 – 57.5	201
57.5 – 60	42
60 – 62.5	525
62.5 – 65	98
65 – 67.5	421
67.5 – 70	93
70 – 72.5	486
72.5 – 75	105
75 – 77.5	325
77.5 – 80	144
80 – 82.5	479
82.5 – 85	156
85 – 87.5	411
87.5 – 90	149
90 – 92.5	969
92.5 – 95	359
95 – 97.5	1340
97.5 – 100	635

JP%Evangelical numeric feature

This column appears to capture the percentage of Evangelical adherents for some geographic or demographic unit (the JP prefix suggests a JoshuaProject-style indicator). The distribution is heavily right-skewed (skew 2.53, kurtosis 8.38) with a median of just 1.5 and Q3 of 8.0, yet a max of 95.0, and 27% of values are exactly zero. About 17.2% are null and 9.7% flag as outliers, so the long tail of high-Evangelical units is both real and sparse.

Treatment: Impute or flag nulls, then apply a log1p or similar transform before modelling to tame the skew.

anthropic:claude-opus-4-7 · confidence high

Out[62]:

saturn.columns["JP%Evangelical"].stats

stat	value
n	19,375
nulls	3,332 (17.2%)
unique	692
min	0
max	95
mean	6.279
median	1.5
std	10.18
q1	0
q3	8
iqr	8
skew	2.535
kurtosis	8.378
n_outliers	1,561
outlier_rate	0.0973
zero_rate	0.27
alert: high_skew	skew=+2.53
alert: outliers	9.7% rows beyond 1.5 IQR

Fig 25.

Distribution of JP%Evangelical. Vertical dash marks the median.

Show data table

Histogram bins for JP%Evangelical (median: 1.5).
bin	count
0 – 2.375	8955
2.375 – 4.75	1481
4.75 – 7.125	1358
7.125 – 9.5	669
9.5 – 11.88	432
11.88 – 14.25	623
14.25 – 16.62	364
16.62 – 19	283
19 – 21.38	425
21.38 – 23.75	222
23.75 – 26.12	350
26.12 – 28.5	131
28.5 – 30.88	198
30.88 – 33.25	97
33.25 – 35.62	91
35.62 – 38	20
38 – 40.38	64
40.38 – 42.75	22
42.75 – 45.12	98
45.12 – 47.5	38
47.5 – 49.88	22
49.88 – 52.25	36
52.25 – 54.62	4
54.62 – 57	8
57 – 59.38	1
59.38 – 61.75	17
61.75 – 64.12	4
64.12 – 66.5	3
66.5 – 68.88	0
68.88 – 71.25	8
71.25 – 73.62	2
73.62 – 76	4
76 – 78.38	4
78.38 – 80.75	3
80.75 – 83.12	0
83.12 – 85.5	0
85.5 – 87.88	2
87.88 – 90.25	2
90.25 – 92.62	0
92.62 – 95	2

CPPIPeopleGroup text feature

This column appears to hold ethnolinguistic or people-group labels (CPPI = Christian/Joshua Project-style people group), with values like 'Han Chinese', 'British', 'Russian', and 'Korean' dominating. It is short and categorical-like — 72.7% are single words and mean length is 8.7 characters — but with 9,170 unique values across 19,375 rows it is high-cardinality, and 36.7% are null. Duplication is also notable (25.2%), and the long tail includes compound entries like 'Han Chinese, Mandarin' suggesting inconsistent delimiting.

Treatment: Normalize delimiters and treat as a high-cardinality categorical; group rare levels or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[65]:

saturn.columns["CPPIPeopleGroup"].stats

stat	value
n	19,375
nulls	7,119 (36.7%)
unique	9,170
len_min	1
len_max	38
len_mean	8.68
len_median	7
len_p95	18
word_mean	1.316
word_median	1
n_empty	0
n_duplicates	3,086
duplicate_rate	0.2518
vocab_size	8,586
readability_flesch_mean	40.96
emoji_rate	0
url_rate	0
one_word_rate	0.7273
allcaps_rate	0
boilerplate_rate	0
alert: one_word	72.7% rows are a single word
alert: null_rate	36.7% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	25.2% duplicate strings

Fig 26.

Character-length distribution for CPPIPeopleGroup.

Show data table

Character-length distribution for CPPIPeopleGroup (mean: 8.679748694516972).
chars	count
1 – 2	1
2 – 3	42
3 – 4	322
4 – 5	1274
5 – 6	1850
6 – 7	1845
7 – 7	1484
7 – 8	1030
8 – 9	558
9 – 10	467
10 – 11	451
11 – 12	398
12 – 13	424
13 – 14	0
14 – 15	494
15 – 16	376
16 – 17	320
17 – 18	245
18 – 19	128
19 – 20	81
20 – 20	83
20 – 21	111
21 – 22	78
22 – 23	53
23 – 24	39
24 – 25	29
25 – 26	0
26 – 27	27
27 – 28	9
28 – 29	10
29 – 30	10
30 – 31	3
31 – 32	5
32 – 32	3
32 – 33	2
33 – 34	0
34 – 35	0
35 – 36	0
36 – 37	2
37 – 38	2

CPPIPopulation text feature

This column holds population counts (likely CPPI = community/place population index) stored as comma-formatted strings rather than numbers, with values like ' 1,300 ' and ' 15,500 ' padded by whitespace. It is 36.74% null and heavily repetitive: 1,629 unique values across 19,375 rows with an 86.7% duplicate rate, suggesting rounded/bucketed figures. The 'multilingual' and 'allcaps' alerts are spurious artifacts of digit-only text — only 3 non-English rows exist out of 256 detected.

Treatment: Strip whitespace and commas, then cast to integer; impute or flag the 36.74% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[68]:

saturn.columns[" CPPIPopulation "].stats

stat	value
n	19,375
nulls	7,119 (36.7%)
unique	1,629
len_min	3
len_max	13
len_mean	7.838
len_median	8
len_p95	11
word_mean	3
word_median	3
n_empty	0
n_duplicates	10,627
duplicate_rate	0.8671
vocab_size	1,629
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	0
allcaps_rate	1
boilerplate_rate	0
alert: multilingual	5 languages detected in sample
alert: allcaps	100.0% rows are all-caps
alert: null_rate	36.7% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	86.7% duplicate strings

Fig 27.

Character-length distribution for CPPIPopulation .

Show data table

Character-length distribution for CPPIPopulation (mean: 7.837630548302872).
chars	count
3 – 3	2
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	128
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	1205
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	3328
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	4079
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	2597
9 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 10	0
10 – 11	0
11 – 11	0
11 – 11	796
11 – 12	0
12 – 12	0
12 – 12	0
12 – 12	115
12 – 12	0
12 – 13	0
13 – 13	6

CPPIROL text feature

CPPIROL holds 3-character ISO 639-3 language codes (eng, spa, tel, hin, por...), with every value being exactly one word of length 3. The field is null 36.74% of the time and 59.09% of present values are duplicates across 5,014 distinct codes, with 'und' (undetermined) appearing 131 times. The distribution is long-tailed but English and Spanish dominate the top of 19,375 rows.

Treatment: Treat as a categorical language-code feature; impute nulls (or map to 'und') and one-hot or target-encode the high-frequency codes.

anthropic:claude-opus-4-7 · confidence high

Out[71]:

saturn.columns["CPPIROL"].stats

stat	value
n	19,375
nulls	7,119 (36.7%)
unique	5,014
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	7,242
duplicate_rate	0.5909
vocab_size	5,014
readability_flesch_mean	119.5
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: null_rate	36.7% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	59.1% duplicate strings

Fig 28.

Character-length distribution for CPPIROL.

Show data table

Character-length distribution for CPPIROL (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	12256
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

CPPIPrimaryLanguage text feature

This column records a primary language label per record, with 5,014 distinct values across 19,375 rows and a heavy tail of one-word entries (one_word_rate 0.77, word_mean 1.29). Top values look canonical (English 401, Spanish 351, Telugu 214) but 36.7% are null and 59.1% duplicate, and the presence of an explicit 'Undetermined' bucket (131) plus 5,014 uniques against only ~10 obvious top languages hints at inconsistent free-text entry rather than a controlled vocabulary. Worth noting: 'arabic' appears 385 times in top_words but does not surface in top_values, suggesting it is split across multi-word variants (e.g. dialect-qualified forms).

Treatment: Normalize to a controlled language vocabulary (canonicalize variants, map 'Undetermined'/nulls to a single missing token) before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[74]:

saturn.columns["CPPIPrimaryLanguage"].stats

stat	value
n	19,375
nulls	7,119 (36.7%)
unique	5,014
len_min	1
len_max	41
len_mean	8.622
len_median	7
len_p95	18
word_mean	1.288
word_median	1
n_empty	0
n_duplicates	7,242
duplicate_rate	0.5909
vocab_size	5,098
readability_flesch_mean	37.09
emoji_rate	0
url_rate	0
one_word_rate	0.7674
allcaps_rate	0
boilerplate_rate	0
alert: one_word	76.7% rows are a single word
alert: null_rate	36.7% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	59.1% duplicate strings

Fig 29.

Character-length distribution for CPPIPrimaryLanguage.

Show data table

Character-length distribution for CPPIPrimaryLanguage (mean: 8.622225848563968).
chars	count
1 – 2	2
2 – 3	32
3 – 4	261
4 – 5	1169
5 – 6	1753
6 – 7	1817
7 – 8	2203
8 – 9	958
9 – 10	564
10 – 11	501
11 – 12	277
12 – 13	398
13 – 14	324
14 – 15	256
15 – 16	592
16 – 17	374
17 – 18	110
18 – 19	90
19 – 20	66
20 – 21	84
21 – 22	65
22 – 23	101
23 – 24	89
24 – 25	50
25 – 26	23
26 – 27	15
27 – 28	10
28 – 29	3
29 – 30	11
30 – 31	13
31 – 32	12
32 – 33	12
33 – 34	5
34 – 35	0
35 – 36	1
36 – 37	0
37 – 38	2
38 – 39	1
39 – 40	8
40 – 41	4

CPPIPrimaryReligion categorical label

Categorical label identifying the primary religion of each record (likely a people-group or country-level row), drawn from a fixed taxonomy of 40 religious categories. The distribution is broad (entropy ratio 0.71) with no dominant class — the leading value 'Islam - Sunni' covers only 14.7% of non-null rows, followed by Non-Evangelical Protestant Christianity and two Ethnoreligion variants. Note the heavy 47.6% null rate, which will materially shrink any analysis conditioned on this field.

Treatment: Impute or bucket nulls explicitly before use; consider collapsing the 40 categories into parent religions for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[77]:

saturn.columns["CPPIPrimaryReligion"].stats

stat	value
n	19,375
nulls	9,227 (47.6%)
unique	40
top_value	Islam - Sunni
top_rate	0.1473
cardinality	40
entropy	3.791
entropy_ratio	0.7123
alert: null_rate	47.6% null

Fig 30.

Top values for CPPIPrimaryReligion.

Show data table

Top values for CPPIPrimaryReligion (20 unique shown, of 40 total).
value	count	share
Islam - Sunni	1495	7.7%
Christianity - Non-Evangelical Protestant	1358	7.0%
Ethnoreligion	1223	6.3%
Ethnoreligion - Animism	1196	6.2%
Christianity - Roman Catholicism	978	5.0%
Hinduism	820	4.2%
Christianity - Evangelical Protestant	738	3.8%
Islam - Folk	359	1.9%
Christianity - Folk Catholicism	349	1.8%
Christianity - Eastern Orthodox	257	1.3%
Ethnoreligion - Chinese Folk Religion	252	1.3%
Hinduism - Folk	192	1.0%
Unaffiliated	176	0.9%
Christianity - Neo-Pentecostalism	121	0.6%
Islam - Shia	103	0.5%
Buddhism - Theravada	103	0.5%
Buddhism - Tibetan	97	0.5%
Judaism	69	0.4%
Unaffiliated - Athiesm	50	0.3%
Ethnoreligion - Shinbutso-shugo	34	0.2%

CPPIGSEC numeric feature

CPPIGSEC is an integer-coded categorical with only 7 unique values ranging 0-6, almost certainly an ordinal classification or sector code rather than a true numeric measure. The distribution is broad and flat (kurtosis -1.43, IQR spanning 1 to 5) with a median of 1 and mean of 2.65, suggesting a left-leaning concentration on lower codes but heavy presence across the full scale. Notably, 36.74% of rows are null, which is large enough to materially affect any downstream join or model.

Treatment: Treat as categorical/ordinal and impute or add a missingness indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[80]:

saturn.columns["CPPIGSEC"].stats

stat	value
n	19,375
nulls	7,119 (36.7%)
unique	7
min	0
max	6
mean	2.654
median	1
std	2.058
q1	1
q3	5
iqr	4
skew	0.5224
kurtosis	-1.432
n_outliers	0
outlier_rate	0
zero_rate	0.02635
alert: null_rate	36.7% null

Fig 31.

Distribution of CPPIGSEC. Vertical dash marks the median.

Show data table

Histogram bins for CPPIGSEC (median: 1.0).
bin	count
0 – 0.15	323
0.15 – 0.3	0
0.3 – 0.45	0
0.45 – 0.6	0
0.6 – 0.75	0
0.75 – 0.9	0
0.9 – 1.05	6549
1.05 – 1.2	0
1.2 – 1.35	0
1.35 – 1.5	0
1.5 – 1.65	0
1.65 – 1.8	0
1.8 – 1.95	0
1.95 – 2.1	243
2.1 – 2.25	0
2.25 – 2.4	0
2.4 – 2.55	0
2.55 – 2.7	0
2.7 – 2.85	0
2.85 – 3	0
3 – 3.15	202
3.15 – 3.3	0
3.3 – 3.45	0
3.45 – 3.6	0
3.6 – 3.75	0
3.75 – 3.9	0
3.9 – 4.05	1632
4.05 – 4.2	0
4.2 – 4.35	0
4.35 – 4.5	0
4.5 – 4.65	0
4.65 – 4.8	0
4.8 – 4.95	0
4.95 – 5.1	1479
5.1 – 5.25	0
5.25 – 5.4	0
5.4 – 5.55	0
5.55 – 5.7	0
5.7 – 5.85	0
5.85 – 6	1828

CPPIEvangelicalEngagement categorical feature

Binary engagement flag for a CPPI Evangelical program, taking only 'Engaged' or 'Unengaged'. 'Engaged' leads at 59.1% of non-null rows (7238 vs 5018), and entropy ratio of 0.976 shows the two classes are nearly balanced. Notably, 36.74% of rows are null, which is large enough to materially bias any downstream split.

Treatment: Encode as binary and add an explicit 'missing' category before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[83]:

saturn.columns["CPPIEvangelicalEngagement"].stats

stat	value
n	19,375
nulls	7,119 (36.7%)
unique	2
top_value	Engaged
top_rate	0.5906
cardinality	2
entropy	0.9762
entropy_ratio	0.9762
alert: null_rate	36.7% null

Fig 32.

Top values for CPPIEvangelicalEngagement.

Show data table

Top values for CPPIEvangelicalEngagement (2 unique shown, of 2 total).
value	count	share
Engaged	7238	37.4%
Unengaged	5018	25.9%

extracted cppi jp cppi cross reference

Overview

Summary confidence: high

Type numeric feature

PEID numeric identifier

ROP3 numeric identifier

PeopleID3 numeric foreign_key

ROG3 categorical feature

Ctry categorical feature

JPPeopleGroup text feature

JPPopulation numeric feature

JPIndigenous categorical feature

JPROL3 text feature

JPPrimaryLanguage text feature

JPRLG3 numeric feature

JPPrimaryReligion categorical feature

JPScale numeric feature

JPLeastReached categorical metadata

JP%ChristianAdherent numeric feature

JP%Evangelical numeric feature

CPPIPeopleGroup text feature

CPPIPopulation text feature

CPPIROL text feature

CPPIPrimaryLanguage text feature

CPPIPrimaryReligion categorical label

CPPIGSEC numeric feature

CPPIEvangelicalEngagement categorical feature

How to cite