saturn·

extracted cppi jp cppi cross reference

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/joshua-project/archive/extracted_cppi/jp-cppi-cross-reference.csv

Saturn profiled 19,375 rows across 24 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/joshua-project/archive/extracted_cppi/jp-cppi-cross-reference.csv",
    "--findings", "extracted_cppi-jp-cppi-cross-reference.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Joshua Project cross-reference of 19,375 people groups across 240 countries, combining CPPI and JP fields covering language, religion, population, and evangelical engagement. India dominates the country distribution at 17.2% of rows, followed by Papua New Guinea, Pakistan, and Indonesia, so any global view should account for that skew. JPPrimaryReligion is heavily weighted toward Christianity (41%) with Islam, Ethnic Religions, and Hinduism trailing, while CPPIEvangelicalEngagement is split roughly 59% Engaged vs 41% Unengaged among the non-null rows. Watch the population and evangelical-percentage fields: JPPopulation is extremely long-tailed (max ~919M, median 16,000, skew ~95) and JP%Evangelical is right-skewed with ~27% zeros and ~10% outliers. Also note the high null rates on the CPPI-prefixed columns (~37%) and JPLeastReached (63%), which constrain joinability.

citing: row_count · column_count · Ctry.top_values · Ctry.stats.top_rate · JPPrimaryReligion.top_values · JPPrimaryReligion.stats.top_rate · CPPIEvangelicalEngagement.top_values · CPPIEvangelicalEngagement.stats.top_rate · JPPopulation.stats · JP%Evangelical.stats · CPPIPrimaryReligion.top_values · JPLeastReached.null_rate · CPPIPopulation.null_rate

Out[4]:

saturn.schema() · 24 columns

column kind n null% unique alerts
Type numeric 19,375 0.0% 3
PEID numeric 19,375 36.7% 12,256 null_rate
ROP3 numeric 19,375 0.0% 11,786
PeopleID3 numeric 19,375 10.9% 10,284
ROG3 categorical 19,375 0.0% 240
Ctry categorical 19,375 0.0% 240
JPPeopleGroup text 19,375 10.9% 10,624 one_word duplicates
JPPopulation numeric 19,375 11.0% 1,731 high_skew outliers
JPIndigenous categorical 19,375 10.9% 2
JPROL3 text 19,375 10.9% 6,165 one_word short_text duplicates
JPPrimaryLanguage text 19,375 10.9% 6,152 one_word short_text duplicates
JPRLG3 numeric 19,375 10.9% 8
JPPrimaryReligion categorical 19,375 10.9% 8
JPScale numeric 19,375 10.9% 5
JPLeastReached categorical 19,375 62.9% 1 null_rate imbalance
JP%ChristianAdherent numeric 19,375 10.9% 957
JP%Evangelical numeric 19,375 17.2% 692 high_skew outliers
CPPIPeopleGroup text 19,375 36.7% 9,170 one_word null_rate short_text duplicates
CPPIPopulation text 19,375 36.7% 1,629 multilingual allcaps null_rate short_text duplicates
CPPIROL text 19,375 36.7% 5,014 one_word null_rate short_text duplicates
CPPIPrimaryLanguage text 19,375 36.7% 5,014 one_word null_rate short_text duplicates
CPPIPrimaryReligion categorical 19,375 47.6% 40 null_rate
CPPIGSEC numeric 19,375 36.7% 7 null_rate
CPPIEvangelicalEngagement categorical 19,375 36.7% 2 null_rate
Fig 1.
Ctry · Top countries by row count — India alone is ~17% of records, so check geographic balance before any aggregate.
Show data table
Top values for Ctry (20 unique shown, of 240 total).
valuecountshare
India333017.2%
Papua New Guinea9194.7%
Pakistan8514.4%
Indonesia8064.2%
United States6553.4%
China5572.9%
Nigeria5572.9%
Canada3571.8%
Brazil3501.8%
Mexico3441.8%
Bangladesh3211.7%
Cameroon3041.6%
Nepal2971.5%
Congo, Democratic Republic of2401.2%
Sudan2221.1%
Australia2111.1%
Philippines2091.1%
Malaysia1961.0%
Russia1891.0%
Myanmar (Burma)1650.9%
Fig 2.
JPPrimaryReligion · Religion mix is dominated by Christianity (41%) and Islam, with smaller Ethnic, Hindu, and Buddhist slices.
Show data table
Top values for JPPrimaryReligion (8 unique shown, of 8 total).
valuecountshare
Christianity714036.9%
Islam400320.7%
Ethnic Religions264113.6%
Hinduism239612.4%
Buddhism6693.5%
Non-Religious2641.4%
Other / Small1280.7%
Unknown260.1%
Fig 3.
CPPIEvangelicalEngagement · Among engaged-status rows, roughly 59% are 'Engaged' vs 41% 'Unengaged' — a useful headline split.
Show data table
Top values for CPPIEvangelicalEngagement (2 unique shown, of 2 total).
valuecountshare
Engaged723837.4%
Unengaged501825.9%
Fig 4.
JPPopulation · Population is extremely long-tailed (median 16K, max ~919M); plot on a log scale to see structure.
Show data table
Histogram bins for JPPopulation (median: 16000.0).
bincount
10 – 2.297e+0717193
2.297e+07 – 4.594e+0733
4.594e+07 – 6.891e+079
6.891e+07 – 9.188e+075
9.188e+07 – 1.149e+082
1.149e+08 – 1.378e+083
1.378e+08 – 1.608e+080
1.608e+08 – 1.838e+080
1.838e+08 – 2.067e+081
2.067e+08 – 2.297e+080
2.297e+08 – 2.527e+080
2.527e+08 – 2.756e+080
2.756e+08 – 2.986e+080
2.986e+08 – 3.216e+080
3.216e+08 – 3.446e+080
3.446e+08 – 3.675e+080
3.675e+08 – 3.905e+080
3.905e+08 – 4.135e+080
4.135e+08 – 4.364e+080
4.364e+08 – 4.594e+080
4.594e+08 – 4.824e+080
4.824e+08 – 5.053e+080
5.053e+08 – 5.283e+080
5.283e+08 – 5.513e+080
5.513e+08 – 5.743e+080
5.743e+08 – 5.972e+080
5.972e+08 – 6.202e+080
6.202e+08 – 6.432e+080
6.432e+08 – 6.661e+080
6.661e+08 – 6.891e+080
6.891e+08 – 7.121e+080
7.121e+08 – 7.35e+080
7.35e+08 – 7.58e+080
7.58e+08 – 7.81e+080
7.81e+08 – 8.04e+080
8.04e+08 – 8.269e+080
8.269e+08 – 8.499e+080
8.499e+08 – 8.729e+080
8.729e+08 – 8.958e+080
8.958e+08 – 9.188e+081
Fig 5.
JP%Evangelical · Evangelical share is right-skewed with a heavy zero/low cluster and a long tail up to 95%.
Show data table
Histogram bins for JP%Evangelical (median: 1.5).
bincount
0 – 2.3758955
2.375 – 4.751481
4.75 – 7.1251358
7.125 – 9.5669
9.5 – 11.88432
11.88 – 14.25623
14.25 – 16.62364
16.62 – 19283
19 – 21.38425
21.38 – 23.75222
23.75 – 26.12350
26.12 – 28.5131
28.5 – 30.88198
30.88 – 33.2597
33.25 – 35.6291
35.62 – 3820
38 – 40.3864
40.38 – 42.7522
42.75 – 45.1298
45.12 – 47.538
47.5 – 49.8822
49.88 – 52.2536
52.25 – 54.624
54.62 – 578
57 – 59.381
59.38 – 61.7517
61.75 – 64.124
64.12 – 66.53
66.5 – 68.880
68.88 – 71.258
71.25 – 73.622
73.62 – 764
76 – 78.384
78.38 – 80.753
80.75 – 83.120
83.12 – 85.50
85.5 – 87.882
87.88 – 90.252
90.25 – 92.620
92.62 – 952
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
Typenumeric0.0%
PEIDnumeric36.7%
ROP3numeric0.0%
PeopleID3numeric10.9%
ROG3categorical0.0%
Ctrycategorical0.0%
JPPeopleGrouptext10.9%
JPPopulationnumeric11.0%
JPIndigenouscategorical10.9%
JPROL3text10.9%
JPPrimaryLanguagetext10.9%
JPRLG3numeric10.9%
JPPrimaryReligioncategorical10.9%
JPScalenumeric10.9%
JPLeastReachedcategorical62.9%
JP%ChristianAdherentnumeric10.9%
JP%Evangelicalnumeric17.2%
CPPIPeopleGrouptext36.7%
CPPIPopulation text36.7%
CPPIROLtext36.7%
CPPIPrimaryLanguagetext36.7%
CPPIPrimaryReligioncategorical47.6%
CPPIGSECnumeric36.7%
CPPIEvangelicalEngagementcategorical36.7%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 259 detected strings).
langcountshare
en25698.8%
zh10.4%
kn10.4%
fa10.4%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 10 numeric columns (values clipped to 2 decimals).
TypePEIDROP3PeopleID3JPPopulationJPRLG3JPScaleJP%ChristianAdherentJP%EvangelicalCPPIGSEC
Type+1.00+0.20+0.36+0.27-0.04+0.09-0.17-0.11-0.14-0.08
PEID+0.20+1.00+0.09+0.12+0.04+0.10-0.10-0.10-0.14-0.18
ROP3+0.36+0.09+1.00+0.03+0.04-0.02+0.00+0.00-0.03-0.07
PeopleID3+0.27+0.12+0.03+1.00+0.05+0.30-0.34-0.33-0.14-0.10
JPPopulation-0.04+0.04+0.04+0.05+1.00+0.02-0.06-0.05-0.03+0.08
JPRLG3+0.09+0.10-0.02+0.30+0.02+1.00-0.71-0.88-0.24+0.05
JPScale-0.17-0.10+0.00-0.34-0.06-0.71+1.00+0.80+0.23-0.04
JP%ChristianAdherent-0.11-0.10+0.00-0.33-0.05-0.88+0.80+1.00+0.22-0.07
JP%Evangelical-0.14-0.14-0.03-0.14-0.03-0.24+0.23+0.22+1.00-0.01
CPPIGSEC-0.08-0.18-0.07-0.10+0.08+0.05-0.04-0.07-0.01+1.00

Type numeric feature

Type is encoded numerically but takes only 3 distinct values across 19,375 rows (min 1, max 3, median 1), so it is almost certainly a categorical code rather than a true measurement. The distribution leans toward the lowest category, with mean 1.585 and Q3 of 2.0 indicating most rows are 1 or 2. No nulls or outliers are present.

Treatment: Treat as a categorical variable and one-hot encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["Type"].stats

statvalue
n19,375
nulls0 (0.0%)
unique3
min 1
max 3
mean 1.585
median 1
std 0.6785
q1 1
q3 2
iqr 1
skew 0.7351
kurtosis -0.6013
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 9.
Distribution of Type. Vertical dash marks the median.
Show data table
Histogram bins for Type (median: 1.0).
bincount
1 – 1.0510148
1.05 – 1.10
1.1 – 1.150
1.15 – 1.20
1.2 – 1.250
1.25 – 1.30
1.3 – 1.350
1.35 – 1.40
1.4 – 1.450
1.45 – 1.50
1.5 – 1.550
1.55 – 1.60
1.6 – 1.650
1.65 – 1.70
1.7 – 1.750
1.75 – 1.80
1.8 – 1.850
1.85 – 1.90
1.9 – 1.950
1.95 – 20
2 – 2.057119
2.05 – 2.10
2.1 – 2.150
2.15 – 2.20
2.2 – 2.250
2.25 – 2.30
2.3 – 2.350
2.35 – 2.40
2.4 – 2.450
2.45 – 2.50
2.5 – 2.550
2.55 – 2.60
2.6 – 2.650
2.65 – 2.70
2.7 – 2.750
2.75 – 2.80
2.8 – 2.850
2.85 – 2.90
2.9 – 2.950
2.95 – 32108

PEID numeric identifier

PEID looks like a person/entity identifier: integer values from 1 to 50520 with 12256 unique values across 19375 rows and no zeros. The 36.7% null rate is the headline concern, and uniqueness is well below row count, so the same PEID recurs across rows. The near-uniform spread (skew 0.33, kurtosis -1.50) is consistent with an ID rather than a measured quantity.

Treatment: Treat as a foreign key for joins; do not use as a numeric feature, and decide on a policy for the 36.7% missing IDs.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["PEID"].stats

statvalue
n19,375
nulls7,119 (36.7%)
unique12,256
min 1
max 50,520
mean 2.564e+04
median 1.838e+04
std 1.637e+04
q1 1.213e+04
q3 4.329e+04
iqr 31,162
skew 0.3311
kurtosis -1.499
n_outliers 0
outlier_rate 0
zero_rate 0
alert: null_rate36.7% null
Fig 10.
Distribution of PEID. Vertical dash marks the median.
Show data table
Histogram bins for PEID (median: 18378.5).
bincount
1 – 1264501
1264 – 25270
2527 – 37900
3790 – 50530
5053 – 631695
6316 – 7579380
7579 – 8842268
8842 – 1.01e+04509
1.01e+04 – 1.137e+04822
1.137e+04 – 1.263e+04828
1.263e+04 – 1.389e+04851
1.389e+04 – 1.516e+04761
1.516e+04 – 1.642e+04530
1.642e+04 – 1.768e+04493
1.768e+04 – 1.895e+04178
1.895e+04 – 2.021e+0469
2.021e+04 – 2.147e+04136
2.147e+04 – 2.273e+04471
2.273e+04 – 2.4e+04402
2.4e+04 – 2.526e+04399
2.526e+04 – 2.652e+040
2.652e+04 – 2.779e+040
2.779e+04 – 2.905e+040
2.905e+04 – 3.031e+040
3.031e+04 – 3.158e+040
3.158e+04 – 3.284e+040
3.284e+04 – 3.41e+0482
3.41e+04 – 3.536e+040
3.536e+04 – 3.663e+040
3.663e+04 – 3.789e+040
3.789e+04 – 3.915e+0475
3.915e+04 – 4.042e+0490
4.042e+04 – 4.168e+04410
4.168e+04 – 4.294e+04741
4.294e+04 – 4.421e+04454
4.421e+04 – 4.547e+040
4.547e+04 – 4.673e+04318
4.673e+04 – 4.799e+04623
4.799e+04 – 4.926e+04820
4.926e+04 – 5.052e+04950

ROP3 numeric identifier

ROP3 holds 6-digit integers tightly bounded between 100004 and 119498 with mean 109260.77 and median 109305, suggesting a coded identifier or category number rather than a measured quantity. The distribution is essentially flat (kurtosis -1.16, skew 0.12) with no outliers and a near-zero null rate, and 11786 unique values across 19375 rows indicates many repeats but no dominant mode visible from these stats.

Treatment: Treat as a categorical code; do not feed as a continuous numeric feature.

anthropic:claude-opus-4-7 · confidence medium
Out[20]:

saturn.columns["ROP3"].stats

statvalue
n19,375
nulls2 (0.0%)
unique11,786
min 100,004
max 119,498
mean 1.093e+05
median 109,305
std 5558
q1 104,126
q3 114,057
iqr 9,931
skew 0.1203
kurtosis -1.159
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 11.
Distribution of ROP3. Vertical dash marks the median.
Show data table
Histogram bins for ROP3 (median: 109305.0).
bincount
1e+05 – 1.005e+05584
1.005e+05 – 1.01e+05450
1.01e+05 – 1.015e+05338
1.015e+05 – 1.02e+05620
1.02e+05 – 1.024e+05489
1.024e+05 – 1.029e+05603
1.029e+05 – 1.034e+05715
1.034e+05 – 1.039e+05681
1.039e+05 – 1.044e+05613
1.044e+05 – 1.049e+05395
1.049e+05 – 1.054e+05430
1.054e+05 – 1.059e+05552
1.059e+05 – 1.063e+05482
1.063e+05 – 1.068e+05465
1.068e+05 – 1.073e+05375
1.073e+05 – 1.078e+05497
1.078e+05 – 1.083e+05516
1.083e+05 – 1.088e+05584
1.088e+05 – 1.093e+05293
1.093e+05 – 1.098e+05790
1.098e+05 – 1.102e+05502
1.102e+05 – 1.107e+05646
1.107e+05 – 1.112e+05500
1.112e+05 – 1.117e+05406
1.117e+05 – 1.122e+05329
1.122e+05 – 1.127e+05364
1.127e+05 – 1.132e+05482
1.132e+05 – 1.136e+05429
1.136e+05 – 1.141e+05509
1.141e+05 – 1.146e+05355
1.146e+05 – 1.151e+05661
1.151e+05 – 1.156e+05543
1.156e+05 – 1.161e+05516
1.161e+05 – 1.166e+05429
1.166e+05 – 1.171e+05287
1.171e+05 – 1.175e+05174
1.175e+05 – 1.18e+05367
1.18e+05 – 1.185e+05456
1.185e+05 – 1.19e+05469
1.19e+05 – 1.195e+05477

PeopleID3 numeric foreign_key

PeopleID3 is almost certainly an identifier: values are integers ranging from 10119 to 22498 with no zeros, no outliers, and 10284 distinct values across 19375 rows. The 10.88% null rate suggests this is an optional or third-slot person reference (the '3' in the name implies a multi-ID schema), and the duplication (≈1.88 rows per id) indicates the same person appears multiple times. The roughly uniform spread (kurtosis -0.97, mild skew 0.39) is consistent with sequentially assigned ids rather than a measured quantity.

Treatment: Treat as a foreign key for left-joins to a people table; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[23]:

saturn.columns["PeopleID3"].stats

statvalue
n19,375
nulls2,108 (10.9%)
unique10,284
min 10,119
max 22,498
mean 1.524e+04
median 14,780
std 3413
q1 12,247
q3 1.814e+04
iqr 5898
skew 0.3911
kurtosis -0.9664
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 12.
Distribution of PeopleID3. Vertical dash marks the median.
Show data table
Histogram bins for PeopleID3 (median: 14780.0).
bincount
1.012e+04 – 1.043e+04565
1.043e+04 – 1.074e+04528
1.074e+04 – 1.105e+04527
1.105e+04 – 1.136e+04857
1.136e+04 – 1.167e+04570
1.167e+04 – 1.198e+04655
1.198e+04 – 1.229e+04749
1.229e+04 – 1.259e+04510
1.259e+04 – 1.29e+04549
1.29e+04 – 1.321e+04564
1.321e+04 – 1.352e+04442
1.352e+04 – 1.383e+04457
1.383e+04 – 1.414e+04422
1.414e+04 – 1.445e+04624
1.445e+04 – 1.476e+04585
1.476e+04 – 1.507e+04534
1.507e+04 – 1.538e+04688
1.538e+04 – 1.569e+04483
1.569e+04 – 1.6e+04591
1.6e+04 – 1.631e+04351
1.631e+04 – 1.662e+04275
1.662e+04 – 1.693e+04258
1.693e+04 – 1.724e+04263
1.724e+04 – 1.755e+04304
1.755e+04 – 1.786e+04325
1.786e+04 – 1.817e+04296
1.817e+04 – 1.847e+04372
1.847e+04 – 1.878e+04392
1.878e+04 – 1.909e+04546
1.909e+04 – 1.94e+04484
1.94e+04 – 1.971e+04311
1.971e+04 – 2.002e+04272
2.002e+04 – 2.033e+04204
2.033e+04 – 2.064e+04271
2.064e+04 – 2.095e+04151
2.095e+04 – 2.126e+04267
2.126e+04 – 2.157e+04306
2.157e+04 – 2.188e+04244
2.188e+04 – 2.219e+04170
2.219e+04 – 2.25e+04305

ROG3 categorical feature

ROG3 looks like an ISO-style two-letter country code, with 240 distinct values and no nulls across 19,375 rows. Distribution is moderately concentrated: 'IN' alone accounts for 17.2% of records, followed by 'PP', 'PK', 'ID', and 'US', with entropy ratio 0.78 indicating reasonable spread across the long tail. The presence of 'PP' as the second most common code is unusual since it is not a standard ISO country code and may warrant verification.

Treatment: Group rare codes into 'Other' and one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["ROG3"].stats

statvalue
n19,375
nulls0 (0.0%)
unique240
top_value IN
top_rate 0.1719
cardinality 240
entropy 6.175
entropy_ratio 0.781
Fig 13.
Top values for ROG3.
Show data table
Top values for ROG3 (20 unique shown, of 240 total).
valuecountshare
IN333017.2%
PP9194.7%
PK8514.4%
ID8064.2%
US6553.4%
CH5572.9%
NI5572.9%
CA3571.8%
BR3501.8%
MX3441.8%
BG3211.7%
CM3041.6%
NP2971.5%
CG2401.2%
SU2221.1%
AS2111.1%
RP2091.1%
MY1961.0%
RS1891.0%
BM1650.9%

Ctry categorical feature

Country name field with 240 distinct values across 19,375 rows and zero nulls. India dominates at 17.2% (3,330 rows), followed by Papua New Guinea, Pakistan, and Indonesia, with the long tail spread broadly (entropy ratio 0.78). The high Papua New Guinea share is unusual relative to its global population and suggests domain-specific sampling.

Treatment: Group rare countries into an 'Other' bucket and target- or frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["Ctry"].stats

statvalue
n19,375
nulls0 (0.0%)
unique240
top_value India
top_rate 0.1719
cardinality 240
entropy 6.175
entropy_ratio 0.781
Fig 14.
Top values for Ctry.
Show data table
Top values for Ctry (20 unique shown, of 240 total).
valuecountshare
India333017.2%
Papua New Guinea9194.7%
Pakistan8514.4%
Indonesia8064.2%
United States6553.4%
China5572.9%
Nigeria5572.9%
Canada3571.8%
Brazil3501.8%
Mexico3441.8%
Bangladesh3211.7%
Cameroon3041.6%
Nepal2971.5%
Congo, Democratic Republic of2401.2%
Sudan2221.1%
Australia2111.1%
Philippines2091.1%
Malaysia1961.0%
Russia1891.0%
Myanmar (Burma)1650.9%

JPPeopleGroup text feature

This column holds Joshua Project-style people group labels — short ethnonyms or community descriptors like 'Deaf', 'British', 'French', and qualified forms such as 'Arab, (Muslim traditions)'. Values are short (mean 11.4 chars, median 1 word) and 54% are single-word, but with 10,624 uniques across 19,375 rows and a 38.5% duplicate rate, plus 10.9% nulls, the field is high-cardinality categorical with a long tail. The prevalence of parenthetical religious qualifiers ('(hindu', '(muslim', 'traditions)') in top_words suggests an embedded taxonomy that would need parsing to separate ethnicity from religious tradition.

Treatment: Normalize casing and split parenthetical religious qualifiers into a separate column before encoding as a high-cardinality category.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["JPPeopleGroup"].stats

statvalue
n19,375
nulls2,108 (10.9%)
unique10,624
len_min 1
len_max 41
len_mean 11.35
len_median 9
len_p95 25
word_mean 1.618
word_median 1
n_empty 0
n_duplicates 6,643
duplicate_rate 0.3847
vocab_size 11,428
readability_flesch_mean 39.84
emoji_rate 0
url_rate 0
one_word_rate 0.5414
allcaps_rate 0
boilerplate_rate 0
alert: one_word54.1% rows are a single word
alert: duplicates38.5% duplicate strings
Fig 15.
Character-length distribution for JPPeopleGroup.
Show data table
Character-length distribution for JPPeopleGroup (mean: 11.350147680546707).
charscount
1 – 21
2 – 331
3 – 4285
4 – 51325
5 – 61715
6 – 71938
7 – 81655
8 – 91143
9 – 10697
10 – 11674
11 – 12598
12 – 13742
13 – 14714
14 – 15930
15 – 16794
16 – 17671
17 – 18511
18 – 19345
19 – 20275
20 – 21245
21 – 22223
22 – 23182
23 – 24199
24 – 25295
25 – 26263
26 – 27231
27 – 28136
28 – 29107
29 – 3075
30 – 3149
31 – 3259
32 – 3361
33 – 3439
34 – 3532
35 – 3619
36 – 371
37 – 381
38 – 392
39 – 402
40 – 412

JPPopulation numeric feature

This appears to be a Japan-related population count per record, with values spanning from 10 to 918,811,000 and a median of just 16,000. The distribution is extraordinarily right-skewed (skew 94.9, kurtosis 10,840) — the mean of 468,378 sits far above the Q3 of 85,000, and 2,609 rows (15.1%) flag as outliers. Roughly 11% of rows are null, and only 1,731 unique values across 19,375 rows suggests heavy repetition.

Treatment: log-transform and impute nulls before modelling; investigate the 918M max as a possible unit error.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["JPPopulation"].stats

statvalue
n19,375
nulls2,128 (11.0%)
unique1,731
min 10
max 9.188e+08
mean 4.684e+05
median 16,000
std 7.861e+06
q1 3,100
q3 85,000
iqr 81,900
skew 94.92
kurtosis 1.084e+04
n_outliers 2,609
outlier_rate 0.1513
zero_rate 0
alert: high_skewskew=+94.92
alert: outliers15.1% rows beyond 1.5 IQR
Fig 16.
Distribution of JPPopulation. Vertical dash marks the median.
Show data table
Histogram bins for JPPopulation (median: 16000.0).
bincount
10 – 2.297e+0717193
2.297e+07 – 4.594e+0733
4.594e+07 – 6.891e+079
6.891e+07 – 9.188e+075
9.188e+07 – 1.149e+082
1.149e+08 – 1.378e+083
1.378e+08 – 1.608e+080
1.608e+08 – 1.838e+080
1.838e+08 – 2.067e+081
2.067e+08 – 2.297e+080
2.297e+08 – 2.527e+080
2.527e+08 – 2.756e+080
2.756e+08 – 2.986e+080
2.986e+08 – 3.216e+080
3.216e+08 – 3.446e+080
3.446e+08 – 3.675e+080
3.675e+08 – 3.905e+080
3.905e+08 – 4.135e+080
4.135e+08 – 4.364e+080
4.364e+08 – 4.594e+080
4.594e+08 – 4.824e+080
4.824e+08 – 5.053e+080
5.053e+08 – 5.283e+080
5.283e+08 – 5.513e+080
5.513e+08 – 5.743e+080
5.743e+08 – 5.972e+080
5.972e+08 – 6.202e+080
6.202e+08 – 6.432e+080
6.432e+08 – 6.661e+080
6.661e+08 – 6.891e+080
6.891e+08 – 7.121e+080
7.121e+08 – 7.35e+080
7.35e+08 – 7.58e+080
7.58e+08 – 7.81e+080
7.81e+08 – 8.04e+080
8.04e+08 – 8.269e+080
8.269e+08 – 8.499e+080
8.499e+08 – 8.729e+080
8.729e+08 – 8.958e+080
8.958e+08 – 9.188e+081

JPIndigenous categorical feature

Binary Y/N flag, almost certainly indicating whether a record is classified as Indigenous (likely in a Japan-related context given the 'JP' prefix). 'Y' dominates at 71.2% of non-null rows (12,293 vs 4,974 'N'), and 10.88% of rows are null. Entropy ratio of 0.87 shows the two classes are reasonably balanced once nulls are excluded.

Treatment: Encode as boolean and decide explicitly whether to impute or flag the ~11% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[38]:

saturn.columns["JPIndigenous"].stats

statvalue
n19,375
nulls2,108 (10.9%)
unique2
top_value Y
top_rate 0.7119
cardinality 2
entropy 0.8662
entropy_ratio 0.8662
Fig 17.
Top values for JPIndigenous.
Show data table
Top values for JPIndigenous (2 unique shown, of 2 total).
valuecountshare
Y1229363.4%
N497425.7%

JPROL3 text feature

JPROL3 holds three-letter ISO 639-3 language codes (hin, eng, ben, spa, tam...), with every value being a single token of length 3. The field is sparsely populated (10.88% null) and heavily duplicated (64.3% duplicate rate across 6,165 unique codes), with Hindi (700) and English (537) leading. The presence of 6,164 distinct codes is unusually high for a language tag and suggests either very broad linguistic coverage or noisy/long-tail entries.

Treatment: Treat as a categorical language-code feature; one-hot or target-encode top codes and bucket the long tail.

anthropic:claude-opus-4-7 · confidence high
Out[41]:

saturn.columns["JPROL3"].stats

statvalue
n19,375
nulls2,108 (10.9%)
unique6,165
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 11,102
duplicate_rate 0.643
vocab_size 6,164
readability_flesch_mean 115.3
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.0003475
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates64.3% duplicate strings
Fig 18.
Character-length distribution for JPROL3.
Show data table
Character-length distribution for JPROL3 (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 317267
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

JPPrimaryLanguage text feature

Categorical free-text capturing each subject's primary language, dominated by single-word entries (one_word_rate 0.7335) with a mean of 1.32 words and median length 7. Despite only 6152 unique values across 19375 rows, the long tail is heavy: top value 'Hindi' appears just 700 times and 64.37% of rows are duplicates, while 10.88% are null. Surprising: vocab_size (6236) exceeds n_unique (6152), implying compound entries like 'Chinese, Mandarin' and 'Punjabi,' that split into multiple tokens and will fragment any naive value_counts.

Treatment: Normalise casing and split comma-separated entries before encoding as a categorical or multi-label feature.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

saturn.columns["JPPrimaryLanguage"].stats

statvalue
n19,375
nulls2,108 (10.9%)
unique6,152
len_min 1
len_max 41
len_mean 9.088
len_median 7
len_p95 18
word_mean 1.319
word_median 1
n_empty 0
n_duplicates 11,115
duplicate_rate 0.6437
vocab_size 6,236
readability_flesch_mean 31.33
emoji_rate 0
url_rate 0
one_word_rate 0.7335
allcaps_rate 0
boilerplate_rate 0
alert: one_word73.4% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates64.4% duplicate strings
Fig 19.
Character-length distribution for JPPrimaryLanguage.
Show data table
Character-length distribution for JPPrimaryLanguage (mean: 9.088318758325128).
charscount
1 – 22
2 – 339
3 – 4303
4 – 51440
5 – 62649
6 – 72244
7 – 83274
8 – 91229
9 – 10706
10 – 11642
11 – 12299
12 – 13377
13 – 14335
14 – 15396
15 – 16398
16 – 171265
17 – 18631
18 – 19179
19 – 20123
20 – 21148
21 – 2292
22 – 2378
23 – 2495
24 – 2592
25 – 2639
26 – 2736
27 – 2831
28 – 2923
29 – 3012
30 – 3111
31 – 3231
32 – 3316
33 – 349
34 – 351
35 – 361
36 – 371
37 – 381
38 – 391
39 – 408
40 – 4110

JPRLG3 numeric feature

JPRLG3 is a low-cardinality integer feature taking only 8 distinct values between 1 and 9, with a near-symmetric spread (skew 0.096) and a flat, plateau-like distribution (kurtosis -1.60). The mean (3.37) and median (4.0) sit mid-range and there are no outliers, suggesting an ordinal rating or category code rather than a continuous measurement. About 10.88% of rows are null, which is the main quirk to address.

Treatment: Treat as ordinal/categorical (8 levels) and impute or flag the 10.88% missing before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["JPRLG3"].stats

statvalue
n19,375
nulls2,108 (10.9%)
unique8
min 1
max 9
mean 3.367
median 4
std 2.199
q1 1
q3 6
iqr 5
skew 0.09636
kurtosis -1.595
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 20.
Distribution of JPRLG3. Vertical dash marks the median.
Show data table
Histogram bins for JPRLG3 (median: 4.0).
bincount
1 – 1.27140
1.2 – 1.40
1.4 – 1.60
1.6 – 1.80
1.8 – 20
2 – 2.2669
2.2 – 2.40
2.4 – 2.60
2.6 – 2.80
2.8 – 30
3 – 3.20
3.2 – 3.40
3.4 – 3.60
3.6 – 3.80
3.8 – 40
4 – 4.22641
4.2 – 4.40
4.4 – 4.60
4.6 – 4.80
4.8 – 50
5 – 5.22396
5.2 – 5.40
5.4 – 5.60
5.6 – 5.80
5.8 – 60
6 – 6.24003
6.2 – 6.40
6.4 – 6.60
6.6 – 6.80
6.8 – 70
7 – 7.2264
7.2 – 7.40
7.4 – 7.60
7.6 – 7.80
7.8 – 80
8 – 8.2128
8.2 – 8.40
8.4 – 8.60
8.6 – 8.80
8.8 – 926

JPPrimaryReligion categorical feature

Categorical label assigning a primary religion to each record (likely a people group or country profile from a Joshua Project-style source). Eight categories cover 19,375 rows with ~10.9% missing; Christianity leads at 41.4% (7,140), followed by Islam (4,003) and Ethnic Religions (2,641), with entropy_ratio 0.72 indicating a moderately balanced distribution rather than a single dominant class. Note the explicit 'Unknown' bucket (26) co-existing with an 10.88% null rate, suggesting two distinct flavors of missingness.

Treatment: One-hot or target-encode; reconcile the explicit 'Unknown' category with the 10.88% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["JPPrimaryReligion"].stats

statvalue
n19,375
nulls2,108 (10.9%)
unique8
top_value Christianity
top_rate 0.4135
cardinality 8
entropy 2.166
entropy_ratio 0.722
Fig 21.
Top values for JPPrimaryReligion.
Show data table
Top values for JPPrimaryReligion (8 unique shown, of 8 total).
valuecountshare
Christianity714036.9%
Islam400320.7%
Ethnic Religions264113.6%
Hinduism239612.4%
Buddhism6693.5%
Non-Religious2641.4%
Other / Small1280.7%
Unknown260.1%

JPScale numeric feature

JPScale is an integer-coded ordinal feature with only 5 distinct values spanning 1 to 5, a mean of 2.71 and median of 3, suggesting a Likert-style or severity rating. The distribution is broad and flat (kurtosis -1.64) rather than peaked, with no outliers and a 10.88% null rate that should be handled explicitly.

Treatment: Treat as ordinal categorical (1-5) and impute or flag the ~11% missing values before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[53]:

saturn.columns["JPScale"].stats

statvalue
n19,375
nulls2,108 (10.9%)
unique5
min 1
max 5
mean 2.71
median 3
std 1.623
q1 1
q3 4
iqr 3
skew 0.159
kurtosis -1.637
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 22.
Distribution of JPScale. Vertical dash marks the median.
Show data table
Histogram bins for JPScale (median: 3.0).
bincount
1 – 1.17186
1.1 – 1.20
1.2 – 1.30
1.3 – 1.40
1.4 – 1.50
1.5 – 1.60
1.6 – 1.70
1.7 – 1.80
1.8 – 1.90
1.9 – 20
2 – 2.11119
2.1 – 2.20
2.2 – 2.30
2.3 – 2.40
2.4 – 2.50
2.5 – 2.60
2.6 – 2.70
2.7 – 2.80
2.8 – 2.90
2.9 – 30
3 – 3.11788
3.1 – 3.20
3.2 – 3.30
3.3 – 3.40
3.4 – 3.50
3.5 – 3.60
3.6 – 3.70
3.7 – 3.80
3.8 – 3.90
3.9 – 40
4 – 4.13872
4.1 – 4.20
4.2 – 4.30
4.3 – 4.40
4.4 – 4.50
4.5 – 4.60
4.6 – 4.70
4.7 – 4.80
4.8 – 4.90
4.9 – 53302

JPLeastReached categorical metadata

JPLeastReached holds a single non-null value 'Y' across all 7,186 populated rows, with the remaining 62.91% of records null. With cardinality 1 and entropy 0, the field carries no discriminative information and effectively functions as a presence flag rather than a category.

Treatment: Drop, or recode as a binary is-present indicator if the null pattern is meaningful.

anthropic:claude-opus-4-7 · confidence high
Out[56]:

saturn.columns["JPLeastReached"].stats

statvalue
n19,375
nulls12,189 (62.9%)
unique1
top_value Y
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: null_rate62.9% null
alert: imbalancetop value is 100.0% of rows
Fig 23.
Top values for JPLeastReached.
Show data table
Top values for JPLeastReached (1 unique shown, of 1 total).
valuecountshare
Y718637.1%

JP%ChristianAdherent numeric feature

Numeric share (0–100) of a population identified as Christian adherents, judging by the name and full 0–100 range with mean 37.58 and median 17.0. The distribution is starkly bimodal in feel: q1 sits at 0.02 while q3 hits 80.0, with 23.7% exact zeros and kurtosis of -1.60, so values cluster at the extremes rather than the middle. Null rate is 10.88%, which is non-trivial and should be handled explicitly.

Treatment: Treat as a bounded percentage; impute or flag the 10.88% nulls and consider a zero-inflation indicator before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[59]:

saturn.columns["JP%ChristianAdherent"].stats

statvalue
n19,375
nulls2,108 (10.9%)
unique957
min 0
max 100
mean 37.58
median 17
std 39.33
q1 0.02
q3 80
iqr 79.98
skew 0.3824
kurtosis -1.602
n_outliers 0
outlier_rate 0
zero_rate 0.2371
Fig 24.
Distribution of JP%ChristianAdherent. Vertical dash marks the median.
Show data table
Histogram bins for JP%ChristianAdherent (median: 17.0).
bincount
0 – 2.56605
2.5 – 5525
5 – 7.5484
7.5 – 10300
10 – 12.5410
12.5 – 1598
15 – 17.5223
17.5 – 2061
20 – 22.5297
22.5 – 2529
25 – 27.5202
27.5 – 3033
30 – 32.5274
32.5 – 3540
35 – 37.5119
37.5 – 4022
40 – 42.5236
42.5 – 4519
45 – 47.5164
47.5 – 5030
50 – 52.5127
52.5 – 5531
55 – 57.5201
57.5 – 6042
60 – 62.5525
62.5 – 6598
65 – 67.5421
67.5 – 7093
70 – 72.5486
72.5 – 75105
75 – 77.5325
77.5 – 80144
80 – 82.5479
82.5 – 85156
85 – 87.5411
87.5 – 90149
90 – 92.5969
92.5 – 95359
95 – 97.51340
97.5 – 100635

JP%Evangelical numeric feature

This column appears to capture the percentage of Evangelical adherents for some geographic or demographic unit (the JP prefix suggests a JoshuaProject-style indicator). The distribution is heavily right-skewed (skew 2.53, kurtosis 8.38) with a median of just 1.5 and Q3 of 8.0, yet a max of 95.0, and 27% of values are exactly zero. About 17.2% are null and 9.7% flag as outliers, so the long tail of high-Evangelical units is both real and sparse.

Treatment: Impute or flag nulls, then apply a log1p or similar transform before modelling to tame the skew.

anthropic:claude-opus-4-7 · confidence high
Out[62]:

saturn.columns["JP%Evangelical"].stats

statvalue
n19,375
nulls3,332 (17.2%)
unique692
min 0
max 95
mean 6.279
median 1.5
std 10.18
q1 0
q3 8
iqr 8
skew 2.535
kurtosis 8.378
n_outliers 1,561
outlier_rate 0.0973
zero_rate 0.27
alert: high_skewskew=+2.53
alert: outliers9.7% rows beyond 1.5 IQR
Fig 25.
Distribution of JP%Evangelical. Vertical dash marks the median.
Show data table
Histogram bins for JP%Evangelical (median: 1.5).
bincount
0 – 2.3758955
2.375 – 4.751481
4.75 – 7.1251358
7.125 – 9.5669
9.5 – 11.88432
11.88 – 14.25623
14.25 – 16.62364
16.62 – 19283
19 – 21.38425
21.38 – 23.75222
23.75 – 26.12350
26.12 – 28.5131
28.5 – 30.88198
30.88 – 33.2597
33.25 – 35.6291
35.62 – 3820
38 – 40.3864
40.38 – 42.7522
42.75 – 45.1298
45.12 – 47.538
47.5 – 49.8822
49.88 – 52.2536
52.25 – 54.624
54.62 – 578
57 – 59.381
59.38 – 61.7517
61.75 – 64.124
64.12 – 66.53
66.5 – 68.880
68.88 – 71.258
71.25 – 73.622
73.62 – 764
76 – 78.384
78.38 – 80.753
80.75 – 83.120
83.12 – 85.50
85.5 – 87.882
87.88 – 90.252
90.25 – 92.620
92.62 – 952

CPPIPeopleGroup text feature

This column appears to hold ethnolinguistic or people-group labels (CPPI = Christian/Joshua Project-style people group), with values like 'Han Chinese', 'British', 'Russian', and 'Korean' dominating. It is short and categorical-like — 72.7% are single words and mean length is 8.7 characters — but with 9,170 unique values across 19,375 rows it is high-cardinality, and 36.7% are null. Duplication is also notable (25.2%), and the long tail includes compound entries like 'Han Chinese, Mandarin' suggesting inconsistent delimiting.

Treatment: Normalize delimiters and treat as a high-cardinality categorical; group rare levels or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[65]:

saturn.columns["CPPIPeopleGroup"].stats

statvalue
n19,375
nulls7,119 (36.7%)
unique9,170
len_min 1
len_max 38
len_mean 8.68
len_median 7
len_p95 18
word_mean 1.316
word_median 1
n_empty 0
n_duplicates 3,086
duplicate_rate 0.2518
vocab_size 8,586
readability_flesch_mean 40.96
emoji_rate 0
url_rate 0
one_word_rate 0.7273
allcaps_rate 0
boilerplate_rate 0
alert: one_word72.7% rows are a single word
alert: null_rate36.7% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates25.2% duplicate strings
Fig 26.
Character-length distribution for CPPIPeopleGroup.
Show data table
Character-length distribution for CPPIPeopleGroup (mean: 8.679748694516972).
charscount
1 – 21
2 – 342
3 – 4322
4 – 51274
5 – 61850
6 – 71845
7 – 71484
7 – 81030
8 – 9558
9 – 10467
10 – 11451
11 – 12398
12 – 13424
13 – 140
14 – 15494
15 – 16376
16 – 17320
17 – 18245
18 – 19128
19 – 2081
20 – 2083
20 – 21111
21 – 2278
22 – 2353
23 – 2439
24 – 2529
25 – 260
26 – 2727
27 – 289
28 – 2910
29 – 3010
30 – 313
31 – 325
32 – 323
32 – 332
33 – 340
34 – 350
35 – 360
36 – 372
37 – 382

CPPIPopulation text feature

This column holds population counts (likely CPPI = community/place population index) stored as comma-formatted strings rather than numbers, with values like ' 1,300 ' and ' 15,500 ' padded by whitespace. It is 36.74% null and heavily repetitive: 1,629 unique values across 19,375 rows with an 86.7% duplicate rate, suggesting rounded/bucketed figures. The 'multilingual' and 'allcaps' alerts are spurious artifacts of digit-only text — only 3 non-English rows exist out of 256 detected.

Treatment: Strip whitespace and commas, then cast to integer; impute or flag the 36.74% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[68]:

saturn.columns[" CPPIPopulation "].stats

statvalue
n19,375
nulls7,119 (36.7%)
unique1,629
len_min 3
len_max 13
len_mean 7.838
len_median 8
len_p95 11
word_mean 3
word_median 3
n_empty 0
n_duplicates 10,627
duplicate_rate 0.8671
vocab_size 1,629
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 0
allcaps_rate 1
boilerplate_rate 0
alert: multilingual5 languages detected in sample
alert: allcaps100.0% rows are all-caps
alert: null_rate36.7% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates86.7% duplicate strings
Fig 27.
Character-length distribution for CPPIPopulation .
Show data table
Character-length distribution for CPPIPopulation (mean: 7.837630548302872).
charscount
3 – 32
3 – 40
4 – 40
4 – 40
4 – 4128
4 – 40
4 – 50
5 – 50
5 – 51205
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 73328
7 – 80
8 – 80
8 – 80
8 – 84079
8 – 80
8 – 90
9 – 90
9 – 92597
9 – 100
10 – 100
10 – 100
10 – 100
10 – 100
10 – 110
11 – 110
11 – 11796
11 – 120
12 – 120
12 – 120
12 – 12115
12 – 120
12 – 130
13 – 136

CPPIROL text feature

CPPIROL holds 3-character ISO 639-3 language codes (eng, spa, tel, hin, por...), with every value being exactly one word of length 3. The field is null 36.74% of the time and 59.09% of present values are duplicates across 5,014 distinct codes, with 'und' (undetermined) appearing 131 times. The distribution is long-tailed but English and Spanish dominate the top of 19,375 rows.

Treatment: Treat as a categorical language-code feature; impute nulls (or map to 'und') and one-hot or target-encode the high-frequency codes.

anthropic:claude-opus-4-7 · confidence high
Out[71]:

saturn.columns["CPPIROL"].stats

statvalue
n19,375
nulls7,119 (36.7%)
unique5,014
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 7,242
duplicate_rate 0.5909
vocab_size 5,014
readability_flesch_mean 119.5
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: null_rate36.7% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates59.1% duplicate strings
Fig 28.
Character-length distribution for CPPIROL.
Show data table
Character-length distribution for CPPIROL (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 312256
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

CPPIPrimaryLanguage text feature

This column records a primary language label per record, with 5,014 distinct values across 19,375 rows and a heavy tail of one-word entries (one_word_rate 0.77, word_mean 1.29). Top values look canonical (English 401, Spanish 351, Telugu 214) but 36.7% are null and 59.1% duplicate, and the presence of an explicit 'Undetermined' bucket (131) plus 5,014 uniques against only ~10 obvious top languages hints at inconsistent free-text entry rather than a controlled vocabulary. Worth noting: 'arabic' appears 385 times in top_words but does not surface in top_values, suggesting it is split across multi-word variants (e.g. dialect-qualified forms).

Treatment: Normalize to a controlled language vocabulary (canonicalize variants, map 'Undetermined'/nulls to a single missing token) before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[74]:

saturn.columns["CPPIPrimaryLanguage"].stats

statvalue
n19,375
nulls7,119 (36.7%)
unique5,014
len_min 1
len_max 41
len_mean 8.622
len_median 7
len_p95 18
word_mean 1.288
word_median 1
n_empty 0
n_duplicates 7,242
duplicate_rate 0.5909
vocab_size 5,098
readability_flesch_mean 37.09
emoji_rate 0
url_rate 0
one_word_rate 0.7674
allcaps_rate 0
boilerplate_rate 0
alert: one_word76.7% rows are a single word
alert: null_rate36.7% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates59.1% duplicate strings
Fig 29.
Character-length distribution for CPPIPrimaryLanguage.
Show data table
Character-length distribution for CPPIPrimaryLanguage (mean: 8.622225848563968).
charscount
1 – 22
2 – 332
3 – 4261
4 – 51169
5 – 61753
6 – 71817
7 – 82203
8 – 9958
9 – 10564
10 – 11501
11 – 12277
12 – 13398
13 – 14324
14 – 15256
15 – 16592
16 – 17374
17 – 18110
18 – 1990
19 – 2066
20 – 2184
21 – 2265
22 – 23101
23 – 2489
24 – 2550
25 – 2623
26 – 2715
27 – 2810
28 – 293
29 – 3011
30 – 3113
31 – 3212
32 – 3312
33 – 345
34 – 350
35 – 361
36 – 370
37 – 382
38 – 391
39 – 408
40 – 414

CPPIPrimaryReligion categorical label

Categorical label identifying the primary religion of each record (likely a people-group or country-level row), drawn from a fixed taxonomy of 40 religious categories. The distribution is broad (entropy ratio 0.71) with no dominant class — the leading value 'Islam - Sunni' covers only 14.7% of non-null rows, followed by Non-Evangelical Protestant Christianity and two Ethnoreligion variants. Note the heavy 47.6% null rate, which will materially shrink any analysis conditioned on this field.

Treatment: Impute or bucket nulls explicitly before use; consider collapsing the 40 categories into parent religions for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[77]:

saturn.columns["CPPIPrimaryReligion"].stats

statvalue
n19,375
nulls9,227 (47.6%)
unique40
top_value Islam - Sunni
top_rate 0.1473
cardinality 40
entropy 3.791
entropy_ratio 0.7123
alert: null_rate47.6% null
Fig 30.
Top values for CPPIPrimaryReligion.
Show data table
Top values for CPPIPrimaryReligion (20 unique shown, of 40 total).
valuecountshare
Islam - Sunni14957.7%
Christianity - Non-Evangelical Protestant13587.0%
Ethnoreligion12236.3%
Ethnoreligion - Animism11966.2%
Christianity - Roman Catholicism9785.0%
Hinduism8204.2%
Christianity - Evangelical Protestant7383.8%
Islam - Folk3591.9%
Christianity - Folk Catholicism3491.8%
Christianity - Eastern Orthodox2571.3%
Ethnoreligion - Chinese Folk Religion2521.3%
Hinduism - Folk1921.0%
Unaffiliated1760.9%
Christianity - Neo-Pentecostalism1210.6%
Islam - Shia1030.5%
Buddhism - Theravada1030.5%
Buddhism - Tibetan970.5%
Judaism690.4%
Unaffiliated - Athiesm500.3%
Ethnoreligion - Shinbutso-shugo340.2%

CPPIGSEC numeric feature

CPPIGSEC is an integer-coded categorical with only 7 unique values ranging 0-6, almost certainly an ordinal classification or sector code rather than a true numeric measure. The distribution is broad and flat (kurtosis -1.43, IQR spanning 1 to 5) with a median of 1 and mean of 2.65, suggesting a left-leaning concentration on lower codes but heavy presence across the full scale. Notably, 36.74% of rows are null, which is large enough to materially affect any downstream join or model.

Treatment: Treat as categorical/ordinal and impute or add a missingness indicator before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[80]:

saturn.columns["CPPIGSEC"].stats

statvalue
n19,375
nulls7,119 (36.7%)
unique7
min 0
max 6
mean 2.654
median 1
std 2.058
q1 1
q3 5
iqr 4
skew 0.5224
kurtosis -1.432
n_outliers 0
outlier_rate 0
zero_rate 0.02635
alert: null_rate36.7% null
Fig 31.
Distribution of CPPIGSEC. Vertical dash marks the median.
Show data table
Histogram bins for CPPIGSEC (median: 1.0).
bincount
0 – 0.15323
0.15 – 0.30
0.3 – 0.450
0.45 – 0.60
0.6 – 0.750
0.75 – 0.90
0.9 – 1.056549
1.05 – 1.20
1.2 – 1.350
1.35 – 1.50
1.5 – 1.650
1.65 – 1.80
1.8 – 1.950
1.95 – 2.1243
2.1 – 2.250
2.25 – 2.40
2.4 – 2.550
2.55 – 2.70
2.7 – 2.850
2.85 – 30
3 – 3.15202
3.15 – 3.30
3.3 – 3.450
3.45 – 3.60
3.6 – 3.750
3.75 – 3.90
3.9 – 4.051632
4.05 – 4.20
4.2 – 4.350
4.35 – 4.50
4.5 – 4.650
4.65 – 4.80
4.8 – 4.950
4.95 – 5.11479
5.1 – 5.250
5.25 – 5.40
5.4 – 5.550
5.55 – 5.70
5.7 – 5.850
5.85 – 61828

CPPIEvangelicalEngagement categorical feature

Binary engagement flag for a CPPI Evangelical program, taking only 'Engaged' or 'Unengaged'. 'Engaged' leads at 59.1% of non-null rows (7238 vs 5018), and entropy ratio of 0.976 shows the two classes are nearly balanced. Notably, 36.74% of rows are null, which is large enough to materially bias any downstream split.

Treatment: Encode as binary and add an explicit 'missing' category before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[83]:

saturn.columns["CPPIEvangelicalEngagement"].stats

statvalue
n19,375
nulls7,119 (36.7%)
unique2
top_value Engaged
top_rate 0.5906
cardinality 2
entropy 0.9762
entropy_ratio 0.9762
alert: null_rate36.7% null
Fig 32.
Top values for CPPIEvangelicalEngagement.
Show data table
Top values for CPPIEvangelicalEngagement (2 unique shown, of 2 total).
valuecountshare
Engaged723837.4%
Unengaged501825.9%

How to cite

click to copy

BibTeX
@misc{saturn-extracted-cppi-jp-cppi-cross-reference-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: extracted cppi jp cppi cross reference},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/extracted_cppi-jp-cppi-cross-reference}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: extracted cppi jp cppi cross reference. Source: /home/coolhand/html/datavis/data_trove/joshua-project/archive/extracted_cppi/jp-cppi-cross-reference.csv. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/extracted_cppi-jp-cppi-cross-reference