extracted cppi jp cppi cross reference

source /home/coolhand/html/datavis/data_trove/joshua-project/archive/extracted_cppi/jp-cppi-cross-reference.csv 19,375 rows 24 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Joshua Project cross-reference of 19,375 people groups across 240 countries, combining CPPI and JP fields covering language, religion, population, and evangelical engagement. India dominates the country distribution at 17.2% of rows, followed by Papua New Guinea, Pakistan, and Indonesia, so any global view should account for that skew. JPPrimaryReligion is heavily weighted toward Christianity (41%) with Islam, Ethnic Religions, and Hinduism trailing, while CPPIEvangelicalEngagement is split roughly 59% Engaged vs 41% Unengaged among the non-null rows. Watch the population and evangelical-percentage fields: JPPopulation is extremely long-tailed (max ~919M, median 16,000, skew ~95) and JP%Evangelical is right-skewed with ~27% zeros and ~10% outliers. Also note the high null rates on the CPPI-prefixed columns (~37%) and JPLeastReached (63%), which constrain joinability.

citing: row_count · column_count · Ctry.top_values · Ctry.stats.top_rate · JPPrimaryReligion.top_values · JPPrimaryReligion.stats.top_rate · CPPIEvangelicalEngagement.top_values · CPPIEvangelicalEngagement.stats.top_rate · JPPopulation.stats · JP%Evangelical.stats · CPPIPrimaryReligion.top_values · JPLeastReached.null_rate · CPPIPopulation.null_rate

Charts the summary said to look at first

Ctry · Top countries by row count — India alone is ~17% of records, so check geographic balance before any aggregate.

Show data table

Top values for Ctry (20 unique shown, of 240 total).
value	count	share
India	3330	17.2%
Papua New Guinea	919	4.7%
Pakistan	851	4.4%
Indonesia	806	4.2%
United States	655	3.4%
China	557	2.9%
Nigeria	557	2.9%
Canada	357	1.8%
Brazil	350	1.8%
Mexico	344	1.8%
Bangladesh	321	1.7%
Cameroon	304	1.6%
Nepal	297	1.5%
Congo, Democratic Republic of	240	1.2%
Sudan	222	1.1%
Australia	211	1.1%
Philippines	209	1.1%
Malaysia	196	1.0%
Russia	189	1.0%
Myanmar (Burma)	165	0.9%

JPPrimaryReligion · Religion mix is dominated by Christianity (41%) and Islam, with smaller Ethnic, Hindu, and Buddhist slices.

Show data table

Top values for JPPrimaryReligion (8 unique shown, of 8 total).
value	count	share
Christianity	7140	36.9%
Islam	4003	20.7%
Ethnic Religions	2641	13.6%
Hinduism	2396	12.4%
Buddhism	669	3.5%
Non-Religious	264	1.4%
Other / Small	128	0.7%
Unknown	26	0.1%

CPPIEvangelicalEngagement · Among engaged-status rows, roughly 59% are 'Engaged' vs 41% 'Unengaged' — a useful headline split.

Show data table

Top values for CPPIEvangelicalEngagement (2 unique shown, of 2 total).
value	count	share
Engaged	7238	37.4%
Unengaged	5018	25.9%

JPPopulation · Population is extremely long-tailed (median 16K, max ~919M); plot on a log scale to see structure.

Show data table

Histogram bins for JPPopulation (median: 16000.0).
bin	count
10 – 2.297e+07	17193
2.297e+07 – 4.594e+07	33
4.594e+07 – 6.891e+07	9
6.891e+07 – 9.188e+07	5
9.188e+07 – 1.149e+08	2
1.149e+08 – 1.378e+08	3
1.378e+08 – 1.608e+08	0
1.608e+08 – 1.838e+08	0
1.838e+08 – 2.067e+08	1
2.067e+08 – 2.297e+08	0
2.297e+08 – 2.527e+08	0
2.527e+08 – 2.756e+08	0
2.756e+08 – 2.986e+08	0
2.986e+08 – 3.216e+08	0
3.216e+08 – 3.446e+08	0
3.446e+08 – 3.675e+08	0
3.675e+08 – 3.905e+08	0
3.905e+08 – 4.135e+08	0
4.135e+08 – 4.364e+08	0
4.364e+08 – 4.594e+08	0
4.594e+08 – 4.824e+08	0
4.824e+08 – 5.053e+08	0
5.053e+08 – 5.283e+08	0
5.283e+08 – 5.513e+08	0
5.513e+08 – 5.743e+08	0
5.743e+08 – 5.972e+08	0
5.972e+08 – 6.202e+08	0
6.202e+08 – 6.432e+08	0
6.432e+08 – 6.661e+08	0
6.661e+08 – 6.891e+08	0
6.891e+08 – 7.121e+08	0
7.121e+08 – 7.35e+08	0
7.35e+08 – 7.58e+08	0
7.58e+08 – 7.81e+08	0
7.81e+08 – 8.04e+08	0
8.04e+08 – 8.269e+08	0
8.269e+08 – 8.499e+08	0
8.499e+08 – 8.729e+08	0
8.729e+08 – 8.958e+08	0
8.958e+08 – 9.188e+08	1

JP%Evangelical · Evangelical share is right-skewed with a heavy zero/low cluster and a long tail up to 95%.

Show data table

Histogram bins for JP%Evangelical (median: 1.5).
bin	count
0 – 2.375	8955
2.375 – 4.75	1481
4.75 – 7.125	1358
7.125 – 9.5	669
9.5 – 11.88	432
11.88 – 14.25	623
14.25 – 16.62	364
16.62 – 19	283
19 – 21.38	425
21.38 – 23.75	222
23.75 – 26.12	350
26.12 – 28.5	131
28.5 – 30.88	198
30.88 – 33.25	97
33.25 – 35.62	91
35.62 – 38	20
38 – 40.38	64
40.38 – 42.75	22
42.75 – 45.12	98
45.12 – 47.5	38
47.5 – 49.88	22
49.88 – 52.25	36
52.25 – 54.62	4
54.62 – 57	8
57 – 59.38	1
59.38 – 61.75	17
61.75 – 64.12	4
64.12 – 66.5	3
66.5 – 68.88	0
68.88 – 71.25	8
71.25 – 73.62	2
73.62 – 76	4
76 – 78.38	4
78.38 – 80.75	3
80.75 – 83.12	0
83.12 – 85.5	0
85.5 – 87.88	2
87.88 – 90.25	2
90.25 – 92.62	0
92.62 – 95	2

Schema

24 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
Type	numeric	0.0%	3
PEID	numeric	36.7%	12,256	null_rate
ROP3	numeric	0.0%	11,786
PeopleID3	numeric	10.9%	10,284
ROG3	categorical	0.0%	240
Ctry	categorical	0.0%	240
JPPeopleGroup	text	10.9%	10,624	one_word duplicates
JPPopulation	numeric	11.0%	1,731	high_skew outliers
JPIndigenous	categorical	10.9%	2
JPROL3	text	10.9%	6,165	one_word short_text duplicates
JPPrimaryLanguage	text	10.9%	6,152	one_word short_text duplicates
JPRLG3	numeric	10.9%	8
JPPrimaryReligion	categorical	10.9%	8
JPScale	numeric	10.9%	5
JPLeastReached	categorical	62.9%	1	null_rate imbalance
JP%ChristianAdherent	numeric	10.9%	957
JP%Evangelical	numeric	17.2%	692	high_skew outliers
CPPIPeopleGroup	text	36.7%	9,170	one_word null_rate short_text duplicates
CPPIPopulation	text	36.7%	1,629	multilingual allcaps null_rate short_text duplicates
CPPIROL	text	36.7%	5,014	one_word null_rate short_text duplicates
CPPIPrimaryLanguage	text	36.7%	5,014	one_word null_rate short_text duplicates
CPPIPrimaryReligion	categorical	47.6%	40	null_rate
CPPIGSEC	numeric	36.7%	7	null_rate
CPPIEvangelicalEngagement	categorical	36.7%	2	null_rate

Type

numeric feature

Type is encoded numerically but takes only 3 distinct values across 19,375 rows (min 1, max 3, median 1), so it is almost certainly a categorical code rather than a true measurement. The distribution leans toward the lowest category, with mean 1.585 and Q3 of 2.0 indicating most rows are 1 or 2. No nulls or outliers are present. Treatment: Treat as a categorical variable and one-hot encode before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 0 (0.0%)
unique: 3
min: 1
max: 3
mean: 1.585
median: 1
std: 0.6785
q1: 1
q3: 2
iqr: 1
skew: 0.7351
kurtosis: -0.6013
n_outliers: 0
outlier_rate: 0
zero_rate: 0

PEID

numeric identifier null_rate

PEID looks like a person/entity identifier: integer values from 1 to 50520 with 12256 unique values across 19375 rows and no zeros. The 36.7% null rate is the headline concern, and uniqueness is well below row count, so the same PEID recurs across rows. The near-uniform spread (skew 0.33, kurtosis -1.50) is consistent with an ID rather than a measured quantity. Treatment: Treat as a foreign key for joins; do not use as a numeric feature, and decide on a policy for the 36.7% missing IDs. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 7,119 (36.7%)
unique: 12,256
min: 1
max: 50,520
mean: 2.564e+04
median: 1.838e+04
std: 1.637e+04
q1: 1.213e+04
q3: 4.329e+04
iqr: 31,162
skew: 0.3311
kurtosis: -1.499
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ROP3

numeric identifier

ROP3 holds 6-digit integers tightly bounded between 100004 and 119498 with mean 109260.77 and median 109305, suggesting a coded identifier or category number rather than a measured quantity. The distribution is essentially flat (kurtosis -1.16, skew 0.12) with no outliers and a near-zero null rate, and 11786 unique values across 19375 rows indicates many repeats but no dominant mode visible from these stats. Treatment: Treat as a categorical code; do not feed as a continuous numeric feature. medium · anthropic:claude-opus-4-7

n: 19,375
nulls: 2 (0.0%)
unique: 11,786
min: 100,004
max: 119,498
mean: 1.093e+05
median: 109,305
std: 5558
q1: 104,126
q3: 114,057
iqr: 9,931
skew: 0.1203
kurtosis: -1.159
n_outliers: 0
outlier_rate: 0
zero_rate: 0

PeopleID3

numeric foreign_key

PeopleID3 is almost certainly an identifier: values are integers ranging from 10119 to 22498 with no zeros, no outliers, and 10284 distinct values across 19375 rows. The 10.88% null rate suggests this is an optional or third-slot person reference (the '3' in the name implies a multi-ID schema), and the duplication (≈1.88 rows per id) indicates the same person appears multiple times. The roughly uniform spread (kurtosis -0.97, mild skew 0.39) is consistent with sequentially assigned ids rather than a measured quantity. Treatment: Treat as a foreign key for left-joins to a people table; do not use as a numeric feature. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,108 (10.9%)
unique: 10,284
min: 10,119
max: 22,498
mean: 1.524e+04
median: 14,780
std: 3413
q1: 12,247
q3: 1.814e+04
iqr: 5898
skew: 0.3911
kurtosis: -0.9664
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ROG3

categorical feature

ROG3 looks like an ISO-style two-letter country code, with 240 distinct values and no nulls across 19,375 rows. Distribution is moderately concentrated: 'IN' alone accounts for 17.2% of records, followed by 'PP', 'PK', 'ID', and 'US', with entropy ratio 0.78 indicating reasonable spread across the long tail. The presence of 'PP' as the second most common code is unusual since it is not a standard ISO country code and may warrant verification. Treatment: Group rare codes into 'Other' and one-hot or target-encode for modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 0 (0.0%)
unique: 240
top_value: IN
top_rate: 0.1719
cardinality: 240
entropy: 6.175
entropy_ratio: 0.781

Ctry

categorical feature

Country name field with 240 distinct values across 19,375 rows and zero nulls. India dominates at 17.2% (3,330 rows), followed by Papua New Guinea, Pakistan, and Indonesia, with the long tail spread broadly (entropy ratio 0.78). The high Papua New Guinea share is unusual relative to its global population and suggests domain-specific sampling. Treatment: Group rare countries into an 'Other' bucket and target- or frequency-encode before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 0 (0.0%)
unique: 240
top_value: India
top_rate: 0.1719
cardinality: 240
entropy: 6.175
entropy_ratio: 0.781

JPPeopleGroup

text feature one_word duplicates

This column holds Joshua Project-style people group labels — short ethnonyms or community descriptors like 'Deaf', 'British', 'French', and qualified forms such as 'Arab, (Muslim traditions)'. Values are short (mean 11.4 chars, median 1 word) and 54% are single-word, but with 10,624 uniques across 19,375 rows and a 38.5% duplicate rate, plus 10.9% nulls, the field is high-cardinality categorical with a long tail. The prevalence of parenthetical religious qualifiers ('(hindu', '(muslim', 'traditions)') in top_words suggests an embedded taxonomy that would need parsing to separate ethnicity from religious tradition. Treatment: Normalize casing and split parenthetical religious qualifiers into a separate column before encoding as a high-cardinality category. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,108 (10.9%)
unique: 10,624
len_min: 1
len_max: 41
len_mean: 11.35
len_median: 9
len_p95: 25
word_mean: 1.618
word_median: 1
n_empty: 0
n_duplicates: 6,643
duplicate_rate: 0.3847
vocab_size: 11,428
readability_flesch_mean: 39.84
emoji_rate: 0
url_rate: 0
one_word_rate: 0.5414
allcaps_rate: 0
boilerplate_rate: 0

JPPopulation

numeric feature high_skew outliers

This appears to be a Japan-related population count per record, with values spanning from 10 to 918,811,000 and a median of just 16,000. The distribution is extraordinarily right-skewed (skew 94.9, kurtosis 10,840) — the mean of 468,378 sits far above the Q3 of 85,000, and 2,609 rows (15.1%) flag as outliers. Roughly 11% of rows are null, and only 1,731 unique values across 19,375 rows suggests heavy repetition. Treatment: log-transform and impute nulls before modelling; investigate the 918M max as a possible unit error. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,128 (11.0%)
unique: 1,731
min: 10
max: 9.188e+08
mean: 4.684e+05
median: 16,000
std: 7.861e+06
q1: 3,100
q3: 85,000
iqr: 81,900
skew: 94.92
kurtosis: 1.084e+04
n_outliers: 2,609
outlier_rate: 0.1513
zero_rate: 0

JPIndigenous

categorical feature

Binary Y/N flag, almost certainly indicating whether a record is classified as Indigenous (likely in a Japan-related context given the 'JP' prefix). 'Y' dominates at 71.2% of non-null rows (12,293 vs 4,974 'N'), and 10.88% of rows are null. Entropy ratio of 0.87 shows the two classes are reasonably balanced once nulls are excluded. Treatment: Encode as boolean and decide explicitly whether to impute or flag the ~11% nulls before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,108 (10.9%)
unique: 2
top_value: Y
top_rate: 0.7119
cardinality: 2
entropy: 0.8662
entropy_ratio: 0.8662

JPROL3

text feature one_word short_text duplicates

JPROL3 holds three-letter ISO 639-3 language codes (hin, eng, ben, spa, tam...), with every value being a single token of length 3. The field is sparsely populated (10.88% null) and heavily duplicated (64.3% duplicate rate across 6,165 unique codes), with Hindi (700) and English (537) leading. The presence of 6,164 distinct codes is unusually high for a language tag and suggests either very broad linguistic coverage or noisy/long-tail entries. Treatment: Treat as a categorical language-code feature; one-hot or target-encode top codes and bucket the long tail. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,108 (10.9%)
unique: 6,165
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 11,102
duplicate_rate: 0.643
vocab_size: 6,164
readability_flesch_mean: 115.3
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.0003475
boilerplate_rate: 0

JPPrimaryLanguage

text feature one_word short_text duplicates

Categorical free-text capturing each subject's primary language, dominated by single-word entries (one_word_rate 0.7335) with a mean of 1.32 words and median length 7. Despite only 6152 unique values across 19375 rows, the long tail is heavy: top value 'Hindi' appears just 700 times and 64.37% of rows are duplicates, while 10.88% are null. Surprising: vocab_size (6236) exceeds n_unique (6152), implying compound entries like 'Chinese, Mandarin' and 'Punjabi,' that split into multiple tokens and will fragment any naive value_counts. Treatment: Normalise casing and split comma-separated entries before encoding as a categorical or multi-label feature. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,108 (10.9%)
unique: 6,152
len_min: 1
len_max: 41
len_mean: 9.088
len_median: 7
len_p95: 18
word_mean: 1.319
word_median: 1
n_empty: 0
n_duplicates: 11,115
duplicate_rate: 0.6437
vocab_size: 6,236
readability_flesch_mean: 31.33
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7335
allcaps_rate: 0
boilerplate_rate: 0

JPRLG3

numeric feature

JPRLG3 is a low-cardinality integer feature taking only 8 distinct values between 1 and 9, with a near-symmetric spread (skew 0.096) and a flat, plateau-like distribution (kurtosis -1.60). The mean (3.37) and median (4.0) sit mid-range and there are no outliers, suggesting an ordinal rating or category code rather than a continuous measurement. About 10.88% of rows are null, which is the main quirk to address. Treatment: Treat as ordinal/categorical (8 levels) and impute or flag the 10.88% missing before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,108 (10.9%)
unique: 8
min: 1
max: 9
mean: 3.367
median: 4
std: 2.199
q1: 1
q3: 6
iqr: 5
skew: 0.09636
kurtosis: -1.595
n_outliers: 0
outlier_rate: 0
zero_rate: 0

JPPrimaryReligion

categorical feature

Categorical label assigning a primary religion to each record (likely a people group or country profile from a Joshua Project-style source). Eight categories cover 19,375 rows with ~10.9% missing; Christianity leads at 41.4% (7,140), followed by Islam (4,003) and Ethnic Religions (2,641), with entropy_ratio 0.72 indicating a moderately balanced distribution rather than a single dominant class. Note the explicit 'Unknown' bucket (26) co-existing with an 10.88% null rate, suggesting two distinct flavors of missingness. Treatment: One-hot or target-encode; reconcile the explicit 'Unknown' category with the 10.88% nulls before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,108 (10.9%)
unique: 8
top_value: Christianity
top_rate: 0.4135
cardinality: 8
entropy: 2.166
entropy_ratio: 0.722

JPScale

numeric feature

JPScale is an integer-coded ordinal feature with only 5 distinct values spanning 1 to 5, a mean of 2.71 and median of 3, suggesting a Likert-style or severity rating. The distribution is broad and flat (kurtosis -1.64) rather than peaked, with no outliers and a 10.88% null rate that should be handled explicitly. Treatment: Treat as ordinal categorical (1-5) and impute or flag the ~11% missing values before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,108 (10.9%)
unique: 5
min: 1
max: 5
mean: 2.71
median: 3
std: 1.623
q1: 1
q3: 4
iqr: 3
skew: 0.159
kurtosis: -1.637
n_outliers: 0
outlier_rate: 0
zero_rate: 0

JPLeastReached

categorical metadata null_rate imbalance

JPLeastReached holds a single non-null value 'Y' across all 7,186 populated rows, with the remaining 62.91% of records null. With cardinality 1 and entropy 0, the field carries no discriminative information and effectively functions as a presence flag rather than a category. Treatment: Drop, or recode as a binary is-present indicator if the null pattern is meaningful. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 12,189 (62.9%)
unique: 1
top_value: Y
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

JP%ChristianAdherent

numeric feature

Numeric share (0–100) of a population identified as Christian adherents, judging by the name and full 0–100 range with mean 37.58 and median 17.0. The distribution is starkly bimodal in feel: q1 sits at 0.02 while q3 hits 80.0, with 23.7% exact zeros and kurtosis of -1.60, so values cluster at the extremes rather than the middle. Null rate is 10.88%, which is non-trivial and should be handled explicitly. Treatment: Treat as a bounded percentage; impute or flag the 10.88% nulls and consider a zero-inflation indicator before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 2,108 (10.9%)
unique: 957
min: 0
max: 100
mean: 37.58
median: 17
std: 39.33
q1: 0.02
q3: 80
iqr: 79.98
skew: 0.3824
kurtosis: -1.602
n_outliers: 0
outlier_rate: 0
zero_rate: 0.2371

JP%Evangelical

numeric feature high_skew outliers

This column appears to capture the percentage of Evangelical adherents for some geographic or demographic unit (the JP prefix suggests a JoshuaProject-style indicator). The distribution is heavily right-skewed (skew 2.53, kurtosis 8.38) with a median of just 1.5 and Q3 of 8.0, yet a max of 95.0, and 27% of values are exactly zero. About 17.2% are null and 9.7% flag as outliers, so the long tail of high-Evangelical units is both real and sparse. Treatment: Impute or flag nulls, then apply a log1p or similar transform before modelling to tame the skew. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 3,332 (17.2%)
unique: 692
min: 0
max: 95
mean: 6.279
median: 1.5
std: 10.18
q1: 0
q3: 8
iqr: 8
skew: 2.535
kurtosis: 8.378
n_outliers: 1,561
outlier_rate: 0.0973
zero_rate: 0.27

CPPIPeopleGroup

text feature one_word null_rate short_text duplicates

This column appears to hold ethnolinguistic or people-group labels (CPPI = Christian/Joshua Project-style people group), with values like 'Han Chinese', 'British', 'Russian', and 'Korean' dominating. It is short and categorical-like — 72.7% are single words and mean length is 8.7 characters — but with 9,170 unique values across 19,375 rows it is high-cardinality, and 36.7% are null. Duplication is also notable (25.2%), and the long tail includes compound entries like 'Han Chinese, Mandarin' suggesting inconsistent delimiting. Treatment: Normalize delimiters and treat as a high-cardinality categorical; group rare levels or target-encode before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 7,119 (36.7%)
unique: 9,170
len_min: 1
len_max: 38
len_mean: 8.68
len_median: 7
len_p95: 18
word_mean: 1.316
word_median: 1
n_empty: 0
n_duplicates: 3,086
duplicate_rate: 0.2518
vocab_size: 8,586
readability_flesch_mean: 40.96
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7273
allcaps_rate: 0
boilerplate_rate: 0

CPPIPopulation

text feature multilingual allcaps null_rate short_text duplicates

This column holds population counts (likely CPPI = community/place population index) stored as comma-formatted strings rather than numbers, with values like ' 1,300 ' and ' 15,500 ' padded by whitespace. It is 36.74% null and heavily repetitive: 1,629 unique values across 19,375 rows with an 86.7% duplicate rate, suggesting rounded/bucketed figures. The 'multilingual' and 'allcaps' alerts are spurious artifacts of digit-only text — only 3 non-English rows exist out of 256 detected. Treatment: Strip whitespace and commas, then cast to integer; impute or flag the 36.74% nulls before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 7,119 (36.7%)
unique: 1,629
len_min: 3
len_max: 13
len_mean: 7.838
len_median: 8
len_p95: 11
word_mean: 3
word_median: 3
n_empty: 0
n_duplicates: 10,627
duplicate_rate: 0.8671
vocab_size: 1,629
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 0
allcaps_rate: 1
boilerplate_rate: 0

CPPIROL

text feature one_word null_rate short_text duplicates

CPPIROL holds 3-character ISO 639-3 language codes (eng, spa, tel, hin, por...), with every value being exactly one word of length 3. The field is null 36.74% of the time and 59.09% of present values are duplicates across 5,014 distinct codes, with 'und' (undetermined) appearing 131 times. The distribution is long-tailed but English and Spanish dominate the top of 19,375 rows. Treatment: Treat as a categorical language-code feature; impute nulls (or map to 'und') and one-hot or target-encode the high-frequency codes. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 7,119 (36.7%)
unique: 5,014
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 7,242
duplicate_rate: 0.5909
vocab_size: 5,014
readability_flesch_mean: 119.5
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

CPPIPrimaryLanguage

text feature one_word null_rate short_text duplicates

This column records a primary language label per record, with 5,014 distinct values across 19,375 rows and a heavy tail of one-word entries (one_word_rate 0.77, word_mean 1.29). Top values look canonical (English 401, Spanish 351, Telugu 214) but 36.7% are null and 59.1% duplicate, and the presence of an explicit 'Undetermined' bucket (131) plus 5,014 uniques against only ~10 obvious top languages hints at inconsistent free-text entry rather than a controlled vocabulary. Worth noting: 'arabic' appears 385 times in top_words but does not surface in top_values, suggesting it is split across multi-word variants (e.g. dialect-qualified forms). Treatment: Normalize to a controlled language vocabulary (canonicalize variants, map 'Undetermined'/nulls to a single missing token) before encoding. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 7,119 (36.7%)
unique: 5,014
len_min: 1
len_max: 41
len_mean: 8.622
len_median: 7
len_p95: 18
word_mean: 1.288
word_median: 1
n_empty: 0
n_duplicates: 7,242
duplicate_rate: 0.5909
vocab_size: 5,098
readability_flesch_mean: 37.09
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7674
allcaps_rate: 0
boilerplate_rate: 0

CPPIPrimaryReligion

categorical label null_rate

Categorical label identifying the primary religion of each record (likely a people-group or country-level row), drawn from a fixed taxonomy of 40 religious categories. The distribution is broad (entropy ratio 0.71) with no dominant class — the leading value 'Islam - Sunni' covers only 14.7% of non-null rows, followed by Non-Evangelical Protestant Christianity and two Ethnoreligion variants. Note the heavy 47.6% null rate, which will materially shrink any analysis conditioned on this field. Treatment: Impute or bucket nulls explicitly before use; consider collapsing the 40 categories into parent religions for modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 9,227 (47.6%)
unique: 40
top_value: Islam - Sunni
top_rate: 0.1473
cardinality: 40
entropy: 3.791
entropy_ratio: 0.7123

CPPIGSEC

numeric feature null_rate

CPPIGSEC is an integer-coded categorical with only 7 unique values ranging 0-6, almost certainly an ordinal classification or sector code rather than a true numeric measure. The distribution is broad and flat (kurtosis -1.43, IQR spanning 1 to 5) with a median of 1 and mean of 2.65, suggesting a left-leaning concentration on lower codes but heavy presence across the full scale. Notably, 36.74% of rows are null, which is large enough to materially affect any downstream join or model. Treatment: Treat as categorical/ordinal and impute or add a missingness indicator before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 7,119 (36.7%)
unique: 7
min: 0
max: 6
mean: 2.654
median: 1
std: 2.058
q1: 1
q3: 5
iqr: 4
skew: 0.5224
kurtosis: -1.432
n_outliers: 0
outlier_rate: 0
zero_rate: 0.02635

CPPIEvangelicalEngagement

categorical feature null_rate

Binary engagement flag for a CPPI Evangelical program, taking only 'Engaged' or 'Unengaged'. 'Engaged' leads at 59.1% of non-null rows (7238 vs 5018), and entropy ratio of 0.976 shows the two classes are nearly balanced. Notably, 36.74% of rows are null, which is large enough to materially bias any downstream split. Treatment: Encode as binary and add an explicit 'missing' category before modelling. high · anthropic:claude-opus-4-7

n: 19,375
nulls: 7,119 (36.7%)
unique: 2
top_value: Engaged
top_rate: 0.5906
cardinality: 2
entropy: 0.9762
entropy_ratio: 0.9762