saturn·

extracted cppi jp cppi cross reference

source /home/coolhand/html/datavis/data_trove/joshua-project/archive/extracted_cppi/jp-cppi-cross-reference.csv 19,375 rows 24 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Joshua Project cross-reference of 19,375 people groups across 240 countries, combining CPPI and JP fields covering language, religion, population, and evangelical engagement. India dominates the country distribution at 17.2% of rows, followed by Papua New Guinea, Pakistan, and Indonesia, so any global view should account for that skew. JPPrimaryReligion is heavily weighted toward Christianity (41%) with Islam, Ethnic Religions, and Hinduism trailing, while CPPIEvangelicalEngagement is split roughly 59% Engaged vs 41% Unengaged among the non-null rows. Watch the population and evangelical-percentage fields: JPPopulation is extremely long-tailed (max ~919M, median 16,000, skew ~95) and JP%Evangelical is right-skewed with ~27% zeros and ~10% outliers. Also note the high null rates on the CPPI-prefixed columns (~37%) and JPLeastReached (63%), which constrain joinability.

citing: row_count · column_count · Ctry.top_values · Ctry.stats.top_rate · JPPrimaryReligion.top_values · JPPrimaryReligion.stats.top_rate · CPPIEvangelicalEngagement.top_values · CPPIEvangelicalEngagement.stats.top_rate · JPPopulation.stats · JP%Evangelical.stats · CPPIPrimaryReligion.top_values · JPLeastReached.null_rate · CPPIPopulation.null_rate

Schema

24 columns
Per-column summary. Click column name to jump to its detail.
Alerts
Type numeric 0.0% 3
PEID numeric 36.7% 12,256
null_rate
ROP3 numeric 0.0% 11,786
PeopleID3 numeric 10.9% 10,284
ROG3 categorical 0.0% 240
Ctry categorical 0.0% 240
JPPeopleGroup text 10.9% 10,624
one_word duplicates
JPPopulation numeric 11.0% 1,731
high_skew outliers
JPIndigenous categorical 10.9% 2
JPROL3 text 10.9% 6,165
one_word short_text duplicates
JPPrimaryLanguage text 10.9% 6,152
one_word short_text duplicates
JPRLG3 numeric 10.9% 8
JPPrimaryReligion categorical 10.9% 8
JPScale numeric 10.9% 5
JPLeastReached categorical 62.9% 1
null_rate imbalance
JP%ChristianAdherent numeric 10.9% 957
JP%Evangelical numeric 17.2% 692
high_skew outliers
CPPIPeopleGroup text 36.7% 9,170
one_word null_rate short_text duplicates
CPPIPopulation text 36.7% 1,629
multilingual allcaps null_rate short_text duplicates
CPPIROL text 36.7% 5,014
one_word null_rate short_text duplicates
CPPIPrimaryLanguage text 36.7% 5,014
one_word null_rate short_text duplicates
CPPIPrimaryReligion categorical 47.6% 40
null_rate
CPPIGSEC numeric 36.7% 7
null_rate
CPPIEvangelicalEngagement categorical 36.7% 2
null_rate

Type

numeric feature
Type is encoded numerically but takes only 3 distinct values across 19,375 rows (min 1, max 3, median 1), so it is almost certainly a categorical code rather than a true measurement. The distribution leans toward the lowest category, with mean 1.585 and Q3 of 2.0 indicating most rows are 1 or 2. No nulls or outliers are present. Treatment: Treat as a categorical variable and one-hot encode before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
0 (0.0%)
unique
3
min
1
max
3
mean
1.585
median
1
std
0.6785
q1
1
q3
2
iqr
1
skew
0.7351
kurtosis
-0.6013
n_outliers
0
outlier_rate
0
zero_rate
0

PEID

numeric identifier null_rate
PEID looks like a person/entity identifier: integer values from 1 to 50520 with 12256 unique values across 19375 rows and no zeros. The 36.7% null rate is the headline concern, and uniqueness is well below row count, so the same PEID recurs across rows. The near-uniform spread (skew 0.33, kurtosis -1.50) is consistent with an ID rather than a measured quantity. Treatment: Treat as a foreign key for joins; do not use as a numeric feature, and decide on a policy for the 36.7% missing IDs. high · anthropic:claude-opus-4-7
n
19,375
nulls
7,119 (36.7%)
unique
12,256
min
1
max
50,520
mean
2.564e+04
median
1.838e+04
std
1.637e+04
q1
1.213e+04
q3
4.329e+04
iqr
31,162
skew
0.3311
kurtosis
-1.499
n_outliers
0
outlier_rate
0
zero_rate
0

ROP3

numeric identifier
ROP3 holds 6-digit integers tightly bounded between 100004 and 119498 with mean 109260.77 and median 109305, suggesting a coded identifier or category number rather than a measured quantity. The distribution is essentially flat (kurtosis -1.16, skew 0.12) with no outliers and a near-zero null rate, and 11786 unique values across 19375 rows indicates many repeats but no dominant mode visible from these stats. Treatment: Treat as a categorical code; do not feed as a continuous numeric feature. medium · anthropic:claude-opus-4-7
n
19,375
nulls
2 (0.0%)
unique
11,786
min
100,004
max
119,498
mean
1.093e+05
median
109,305
std
5558
q1
104,126
q3
114,057
iqr
9,931
skew
0.1203
kurtosis
-1.159
n_outliers
0
outlier_rate
0
zero_rate
0

PeopleID3

numeric foreign_key
PeopleID3 is almost certainly an identifier: values are integers ranging from 10119 to 22498 with no zeros, no outliers, and 10284 distinct values across 19375 rows. The 10.88% null rate suggests this is an optional or third-slot person reference (the '3' in the name implies a multi-ID schema), and the duplication (≈1.88 rows per id) indicates the same person appears multiple times. The roughly uniform spread (kurtosis -0.97, mild skew 0.39) is consistent with sequentially assigned ids rather than a measured quantity. Treatment: Treat as a foreign key for left-joins to a people table; do not use as a numeric feature. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,108 (10.9%)
unique
10,284
min
10,119
max
22,498
mean
1.524e+04
median
14,780
std
3413
q1
12,247
q3
1.814e+04
iqr
5898
skew
0.3911
kurtosis
-0.9664
n_outliers
0
outlier_rate
0
zero_rate
0

ROG3

categorical feature
ROG3 looks like an ISO-style two-letter country code, with 240 distinct values and no nulls across 19,375 rows. Distribution is moderately concentrated: 'IN' alone accounts for 17.2% of records, followed by 'PP', 'PK', 'ID', and 'US', with entropy ratio 0.78 indicating reasonable spread across the long tail. The presence of 'PP' as the second most common code is unusual since it is not a standard ISO country code and may warrant verification. Treatment: Group rare codes into 'Other' and one-hot or target-encode for modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
0 (0.0%)
unique
240
top_value
IN
top_rate
0.1719
cardinality
240
entropy
6.175
entropy_ratio
0.781

Ctry

categorical feature
Country name field with 240 distinct values across 19,375 rows and zero nulls. India dominates at 17.2% (3,330 rows), followed by Papua New Guinea, Pakistan, and Indonesia, with the long tail spread broadly (entropy ratio 0.78). The high Papua New Guinea share is unusual relative to its global population and suggests domain-specific sampling. Treatment: Group rare countries into an 'Other' bucket and target- or frequency-encode before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
0 (0.0%)
unique
240
top_value
India
top_rate
0.1719
cardinality
240
entropy
6.175
entropy_ratio
0.781

JPPeopleGroup

text feature one_word duplicates
This column holds Joshua Project-style people group labels — short ethnonyms or community descriptors like 'Deaf', 'British', 'French', and qualified forms such as 'Arab, (Muslim traditions)'. Values are short (mean 11.4 chars, median 1 word) and 54% are single-word, but with 10,624 uniques across 19,375 rows and a 38.5% duplicate rate, plus 10.9% nulls, the field is high-cardinality categorical with a long tail. The prevalence of parenthetical religious qualifiers ('(hindu', '(muslim', 'traditions)') in top_words suggests an embedded taxonomy that would need parsing to separate ethnicity from religious tradition. Treatment: Normalize casing and split parenthetical religious qualifiers into a separate column before encoding as a high-cardinality category. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,108 (10.9%)
unique
10,624
len_min
1
len_max
41
len_mean
11.35
len_median
9
len_p95
25
word_mean
1.618
word_median
1
n_empty
0
n_duplicates
6,643
duplicate_rate
0.3847
vocab_size
11,428
readability_flesch_mean
39.84
emoji_rate
0
url_rate
0
one_word_rate
0.5414
allcaps_rate
0
boilerplate_rate
0

JPPopulation

numeric feature high_skew outliers
This appears to be a Japan-related population count per record, with values spanning from 10 to 918,811,000 and a median of just 16,000. The distribution is extraordinarily right-skewed (skew 94.9, kurtosis 10,840) — the mean of 468,378 sits far above the Q3 of 85,000, and 2,609 rows (15.1%) flag as outliers. Roughly 11% of rows are null, and only 1,731 unique values across 19,375 rows suggests heavy repetition. Treatment: log-transform and impute nulls before modelling; investigate the 918M max as a possible unit error. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,128 (11.0%)
unique
1,731
min
10
max
9.188e+08
mean
4.684e+05
median
16,000
std
7.861e+06
q1
3,100
q3
85,000
iqr
81,900
skew
94.92
kurtosis
1.084e+04
n_outliers
2,609
outlier_rate
0.1513
zero_rate
0

JPIndigenous

categorical feature
Binary Y/N flag, almost certainly indicating whether a record is classified as Indigenous (likely in a Japan-related context given the 'JP' prefix). 'Y' dominates at 71.2% of non-null rows (12,293 vs 4,974 'N'), and 10.88% of rows are null. Entropy ratio of 0.87 shows the two classes are reasonably balanced once nulls are excluded. Treatment: Encode as boolean and decide explicitly whether to impute or flag the ~11% nulls before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,108 (10.9%)
unique
2
top_value
Y
top_rate
0.7119
cardinality
2
entropy
0.8662
entropy_ratio
0.8662

JPROL3

text feature one_word short_text duplicates
JPROL3 holds three-letter ISO 639-3 language codes (hin, eng, ben, spa, tam...), with every value being a single token of length 3. The field is sparsely populated (10.88% null) and heavily duplicated (64.3% duplicate rate across 6,165 unique codes), with Hindi (700) and English (537) leading. The presence of 6,164 distinct codes is unusually high for a language tag and suggests either very broad linguistic coverage or noisy/long-tail entries. Treatment: Treat as a categorical language-code feature; one-hot or target-encode top codes and bucket the long tail. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,108 (10.9%)
unique
6,165
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
11,102
duplicate_rate
0.643
vocab_size
6,164
readability_flesch_mean
115.3
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.0003475
boilerplate_rate
0

JPPrimaryLanguage

text feature one_word short_text duplicates
Categorical free-text capturing each subject's primary language, dominated by single-word entries (one_word_rate 0.7335) with a mean of 1.32 words and median length 7. Despite only 6152 unique values across 19375 rows, the long tail is heavy: top value 'Hindi' appears just 700 times and 64.37% of rows are duplicates, while 10.88% are null. Surprising: vocab_size (6236) exceeds n_unique (6152), implying compound entries like 'Chinese, Mandarin' and 'Punjabi,' that split into multiple tokens and will fragment any naive value_counts. Treatment: Normalise casing and split comma-separated entries before encoding as a categorical or multi-label feature. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,108 (10.9%)
unique
6,152
len_min
1
len_max
41
len_mean
9.088
len_median
7
len_p95
18
word_mean
1.319
word_median
1
n_empty
0
n_duplicates
11,115
duplicate_rate
0.6437
vocab_size
6,236
readability_flesch_mean
31.33
emoji_rate
0
url_rate
0
one_word_rate
0.7335
allcaps_rate
0
boilerplate_rate
0

JPRLG3

numeric feature
JPRLG3 is a low-cardinality integer feature taking only 8 distinct values between 1 and 9, with a near-symmetric spread (skew 0.096) and a flat, plateau-like distribution (kurtosis -1.60). The mean (3.37) and median (4.0) sit mid-range and there are no outliers, suggesting an ordinal rating or category code rather than a continuous measurement. About 10.88% of rows are null, which is the main quirk to address. Treatment: Treat as ordinal/categorical (8 levels) and impute or flag the 10.88% missing before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,108 (10.9%)
unique
8
min
1
max
9
mean
3.367
median
4
std
2.199
q1
1
q3
6
iqr
5
skew
0.09636
kurtosis
-1.595
n_outliers
0
outlier_rate
0
zero_rate
0

JPPrimaryReligion

categorical feature
Categorical label assigning a primary religion to each record (likely a people group or country profile from a Joshua Project-style source). Eight categories cover 19,375 rows with ~10.9% missing; Christianity leads at 41.4% (7,140), followed by Islam (4,003) and Ethnic Religions (2,641), with entropy_ratio 0.72 indicating a moderately balanced distribution rather than a single dominant class. Note the explicit 'Unknown' bucket (26) co-existing with an 10.88% null rate, suggesting two distinct flavors of missingness. Treatment: One-hot or target-encode; reconcile the explicit 'Unknown' category with the 10.88% nulls before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,108 (10.9%)
unique
8
top_value
Christianity
top_rate
0.4135
cardinality
8
entropy
2.166
entropy_ratio
0.722

JPScale

numeric feature
JPScale is an integer-coded ordinal feature with only 5 distinct values spanning 1 to 5, a mean of 2.71 and median of 3, suggesting a Likert-style or severity rating. The distribution is broad and flat (kurtosis -1.64) rather than peaked, with no outliers and a 10.88% null rate that should be handled explicitly. Treatment: Treat as ordinal categorical (1-5) and impute or flag the ~11% missing values before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,108 (10.9%)
unique
5
min
1
max
5
mean
2.71
median
3
std
1.623
q1
1
q3
4
iqr
3
skew
0.159
kurtosis
-1.637
n_outliers
0
outlier_rate
0
zero_rate
0

JPLeastReached

categorical metadata null_rate imbalance
JPLeastReached holds a single non-null value 'Y' across all 7,186 populated rows, with the remaining 62.91% of records null. With cardinality 1 and entropy 0, the field carries no discriminative information and effectively functions as a presence flag rather than a category. Treatment: Drop, or recode as a binary is-present indicator if the null pattern is meaningful. high · anthropic:claude-opus-4-7
n
19,375
nulls
12,189 (62.9%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

JP%ChristianAdherent

numeric feature
Numeric share (0–100) of a population identified as Christian adherents, judging by the name and full 0–100 range with mean 37.58 and median 17.0. The distribution is starkly bimodal in feel: q1 sits at 0.02 while q3 hits 80.0, with 23.7% exact zeros and kurtosis of -1.60, so values cluster at the extremes rather than the middle. Null rate is 10.88%, which is non-trivial and should be handled explicitly. Treatment: Treat as a bounded percentage; impute or flag the 10.88% nulls and consider a zero-inflation indicator before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
2,108 (10.9%)
unique
957
min
0
max
100
mean
37.58
median
17
std
39.33
q1
0.02
q3
80
iqr
79.98
skew
0.3824
kurtosis
-1.602
n_outliers
0
outlier_rate
0
zero_rate
0.2371

JP%Evangelical

numeric feature high_skew outliers
This column appears to capture the percentage of Evangelical adherents for some geographic or demographic unit (the JP prefix suggests a JoshuaProject-style indicator). The distribution is heavily right-skewed (skew 2.53, kurtosis 8.38) with a median of just 1.5 and Q3 of 8.0, yet a max of 95.0, and 27% of values are exactly zero. About 17.2% are null and 9.7% flag as outliers, so the long tail of high-Evangelical units is both real and sparse. Treatment: Impute or flag nulls, then apply a log1p or similar transform before modelling to tame the skew. high · anthropic:claude-opus-4-7
n
19,375
nulls
3,332 (17.2%)
unique
692
min
0
max
95
mean
6.279
median
1.5
std
10.18
q1
0
q3
8
iqr
8
skew
2.535
kurtosis
8.378
n_outliers
1,561
outlier_rate
0.0973
zero_rate
0.27

CPPIPeopleGroup

text feature one_word null_rate short_text duplicates
This column appears to hold ethnolinguistic or people-group labels (CPPI = Christian/Joshua Project-style people group), with values like 'Han Chinese', 'British', 'Russian', and 'Korean' dominating. It is short and categorical-like — 72.7% are single words and mean length is 8.7 characters — but with 9,170 unique values across 19,375 rows it is high-cardinality, and 36.7% are null. Duplication is also notable (25.2%), and the long tail includes compound entries like 'Han Chinese, Mandarin' suggesting inconsistent delimiting. Treatment: Normalize delimiters and treat as a high-cardinality categorical; group rare levels or target-encode before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
7,119 (36.7%)
unique
9,170
len_min
1
len_max
38
len_mean
8.68
len_median
7
len_p95
18
word_mean
1.316
word_median
1
n_empty
0
n_duplicates
3,086
duplicate_rate
0.2518
vocab_size
8,586
readability_flesch_mean
40.96
emoji_rate
0
url_rate
0
one_word_rate
0.7273
allcaps_rate
0
boilerplate_rate
0

CPPIPopulation

text feature multilingual allcaps null_rate short_text duplicates
This column holds population counts (likely CPPI = community/place population index) stored as comma-formatted strings rather than numbers, with values like ' 1,300 ' and ' 15,500 ' padded by whitespace. It is 36.74% null and heavily repetitive: 1,629 unique values across 19,375 rows with an 86.7% duplicate rate, suggesting rounded/bucketed figures. The 'multilingual' and 'allcaps' alerts are spurious artifacts of digit-only text — only 3 non-English rows exist out of 256 detected. Treatment: Strip whitespace and commas, then cast to integer; impute or flag the 36.74% nulls before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
7,119 (36.7%)
unique
1,629
len_min
3
len_max
13
len_mean
7.838
len_median
8
len_p95
11
word_mean
3
word_median
3
n_empty
0
n_duplicates
10,627
duplicate_rate
0.8671
vocab_size
1,629
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
0
allcaps_rate
1
boilerplate_rate
0

CPPIROL

text feature one_word null_rate short_text duplicates
CPPIROL holds 3-character ISO 639-3 language codes (eng, spa, tel, hin, por...), with every value being exactly one word of length 3. The field is null 36.74% of the time and 59.09% of present values are duplicates across 5,014 distinct codes, with 'und' (undetermined) appearing 131 times. The distribution is long-tailed but English and Spanish dominate the top of 19,375 rows. Treatment: Treat as a categorical language-code feature; impute nulls (or map to 'und') and one-hot or target-encode the high-frequency codes. high · anthropic:claude-opus-4-7
n
19,375
nulls
7,119 (36.7%)
unique
5,014
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
7,242
duplicate_rate
0.5909
vocab_size
5,014
readability_flesch_mean
119.5
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

CPPIPrimaryLanguage

text feature one_word null_rate short_text duplicates
This column records a primary language label per record, with 5,014 distinct values across 19,375 rows and a heavy tail of one-word entries (one_word_rate 0.77, word_mean 1.29). Top values look canonical (English 401, Spanish 351, Telugu 214) but 36.7% are null and 59.1% duplicate, and the presence of an explicit 'Undetermined' bucket (131) plus 5,014 uniques against only ~10 obvious top languages hints at inconsistent free-text entry rather than a controlled vocabulary. Worth noting: 'arabic' appears 385 times in top_words but does not surface in top_values, suggesting it is split across multi-word variants (e.g. dialect-qualified forms). Treatment: Normalize to a controlled language vocabulary (canonicalize variants, map 'Undetermined'/nulls to a single missing token) before encoding. high · anthropic:claude-opus-4-7
n
19,375
nulls
7,119 (36.7%)
unique
5,014
len_min
1
len_max
41
len_mean
8.622
len_median
7
len_p95
18
word_mean
1.288
word_median
1
n_empty
0
n_duplicates
7,242
duplicate_rate
0.5909
vocab_size
5,098
readability_flesch_mean
37.09
emoji_rate
0
url_rate
0
one_word_rate
0.7674
allcaps_rate
0
boilerplate_rate
0

CPPIPrimaryReligion

categorical label null_rate
Categorical label identifying the primary religion of each record (likely a people-group or country-level row), drawn from a fixed taxonomy of 40 religious categories. The distribution is broad (entropy ratio 0.71) with no dominant class — the leading value 'Islam - Sunni' covers only 14.7% of non-null rows, followed by Non-Evangelical Protestant Christianity and two Ethnoreligion variants. Note the heavy 47.6% null rate, which will materially shrink any analysis conditioned on this field. Treatment: Impute or bucket nulls explicitly before use; consider collapsing the 40 categories into parent religions for modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
9,227 (47.6%)
unique
40
top_value
Islam - Sunni
top_rate
0.1473
cardinality
40
entropy
3.791
entropy_ratio
0.7123

CPPIGSEC

numeric feature null_rate
CPPIGSEC is an integer-coded categorical with only 7 unique values ranging 0-6, almost certainly an ordinal classification or sector code rather than a true numeric measure. The distribution is broad and flat (kurtosis -1.43, IQR spanning 1 to 5) with a median of 1 and mean of 2.65, suggesting a left-leaning concentration on lower codes but heavy presence across the full scale. Notably, 36.74% of rows are null, which is large enough to materially affect any downstream join or model. Treatment: Treat as categorical/ordinal and impute or add a missingness indicator before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
7,119 (36.7%)
unique
7
min
0
max
6
mean
2.654
median
1
std
2.058
q1
1
q3
5
iqr
4
skew
0.5224
kurtosis
-1.432
n_outliers
0
outlier_rate
0
zero_rate
0.02635

CPPIEvangelicalEngagement

categorical feature null_rate
Binary engagement flag for a CPPI Evangelical program, taking only 'Engaged' or 'Unengaged'. 'Engaged' leads at 59.1% of non-null rows (7238 vs 5018), and entropy ratio of 0.976 shows the two classes are nearly balanced. Notably, 36.74% of rows are null, which is large enough to materially bias any downstream split. Treatment: Encode as binary and add an explicit 'missing' category before modelling. high · anthropic:claude-opus-4-7
n
19,375
nulls
7,119 (36.7%)
unique
2
top_value
Engaged
top_rate
0.5906
cardinality
2
entropy
0.9762
entropy_ratio
0.9762