extracted cppi jp cppi cross reference
Reading
This dataset is a Joshua Project cross-reference of 19,375 people groups across 240 countries, combining CPPI and JP fields covering language, religion, population, and evangelical engagement. India dominates the country distribution at 17.2% of rows, followed by Papua New Guinea, Pakistan, and Indonesia, so any global view should account for that skew. JPPrimaryReligion is heavily weighted toward Christianity (41%) with Islam, Ethnic Religions, and Hinduism trailing, while CPPIEvangelicalEngagement is split roughly 59% Engaged vs 41% Unengaged among the non-null rows. Watch the population and evangelical-percentage fields: JPPopulation is extremely long-tailed (max ~919M, median 16,000, skew ~95) and JP%Evangelical is right-skewed with ~27% zeros and ~10% outliers. Also note the high null rates on the CPPI-prefixed columns (~37%) and JPLeastReached (63%), which constrain joinability.
citing: row_count · column_count · Ctry.top_values · Ctry.stats.top_rate · JPPrimaryReligion.top_values · JPPrimaryReligion.stats.top_rate · CPPIEvangelicalEngagement.top_values · CPPIEvangelicalEngagement.stats.top_rate · JPPopulation.stats · JP%Evangelical.stats · CPPIPrimaryReligion.top_values · JPLeastReached.null_rate · CPPIPopulation.null_rate
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| India | 3330 | 17.2% |
| Papua New Guinea | 919 | 4.7% |
| Pakistan | 851 | 4.4% |
| Indonesia | 806 | 4.2% |
| United States | 655 | 3.4% |
| China | 557 | 2.9% |
| Nigeria | 557 | 2.9% |
| Canada | 357 | 1.8% |
| Brazil | 350 | 1.8% |
| Mexico | 344 | 1.8% |
| Bangladesh | 321 | 1.7% |
| Cameroon | 304 | 1.6% |
| Nepal | 297 | 1.5% |
| Congo, Democratic Republic of | 240 | 1.2% |
| Sudan | 222 | 1.1% |
| Australia | 211 | 1.1% |
| Philippines | 209 | 1.1% |
| Malaysia | 196 | 1.0% |
| Russia | 189 | 1.0% |
| Myanmar (Burma) | 165 | 0.9% |
Show data table
| value | count | share |
|---|---|---|
| Christianity | 7140 | 36.9% |
| Islam | 4003 | 20.7% |
| Ethnic Religions | 2641 | 13.6% |
| Hinduism | 2396 | 12.4% |
| Buddhism | 669 | 3.5% |
| Non-Religious | 264 | 1.4% |
| Other / Small | 128 | 0.7% |
| Unknown | 26 | 0.1% |
Show data table
| value | count | share |
|---|---|---|
| Engaged | 7238 | 37.4% |
| Unengaged | 5018 | 25.9% |
Show data table
| bin | count |
|---|---|
| 10 – 2.297e+07 | 17193 |
| 2.297e+07 – 4.594e+07 | 33 |
| 4.594e+07 – 6.891e+07 | 9 |
| 6.891e+07 – 9.188e+07 | 5 |
| 9.188e+07 – 1.149e+08 | 2 |
| 1.149e+08 – 1.378e+08 | 3 |
| 1.378e+08 – 1.608e+08 | 0 |
| 1.608e+08 – 1.838e+08 | 0 |
| 1.838e+08 – 2.067e+08 | 1 |
| 2.067e+08 – 2.297e+08 | 0 |
| 2.297e+08 – 2.527e+08 | 0 |
| 2.527e+08 – 2.756e+08 | 0 |
| 2.756e+08 – 2.986e+08 | 0 |
| 2.986e+08 – 3.216e+08 | 0 |
| 3.216e+08 – 3.446e+08 | 0 |
| 3.446e+08 – 3.675e+08 | 0 |
| 3.675e+08 – 3.905e+08 | 0 |
| 3.905e+08 – 4.135e+08 | 0 |
| 4.135e+08 – 4.364e+08 | 0 |
| 4.364e+08 – 4.594e+08 | 0 |
| 4.594e+08 – 4.824e+08 | 0 |
| 4.824e+08 – 5.053e+08 | 0 |
| 5.053e+08 – 5.283e+08 | 0 |
| 5.283e+08 – 5.513e+08 | 0 |
| 5.513e+08 – 5.743e+08 | 0 |
| 5.743e+08 – 5.972e+08 | 0 |
| 5.972e+08 – 6.202e+08 | 0 |
| 6.202e+08 – 6.432e+08 | 0 |
| 6.432e+08 – 6.661e+08 | 0 |
| 6.661e+08 – 6.891e+08 | 0 |
| 6.891e+08 – 7.121e+08 | 0 |
| 7.121e+08 – 7.35e+08 | 0 |
| 7.35e+08 – 7.58e+08 | 0 |
| 7.58e+08 – 7.81e+08 | 0 |
| 7.81e+08 – 8.04e+08 | 0 |
| 8.04e+08 – 8.269e+08 | 0 |
| 8.269e+08 – 8.499e+08 | 0 |
| 8.499e+08 – 8.729e+08 | 0 |
| 8.729e+08 – 8.958e+08 | 0 |
| 8.958e+08 – 9.188e+08 | 1 |
Show data table
| bin | count |
|---|---|
| 0 – 2.375 | 8955 |
| 2.375 – 4.75 | 1481 |
| 4.75 – 7.125 | 1358 |
| 7.125 – 9.5 | 669 |
| 9.5 – 11.88 | 432 |
| 11.88 – 14.25 | 623 |
| 14.25 – 16.62 | 364 |
| 16.62 – 19 | 283 |
| 19 – 21.38 | 425 |
| 21.38 – 23.75 | 222 |
| 23.75 – 26.12 | 350 |
| 26.12 – 28.5 | 131 |
| 28.5 – 30.88 | 198 |
| 30.88 – 33.25 | 97 |
| 33.25 – 35.62 | 91 |
| 35.62 – 38 | 20 |
| 38 – 40.38 | 64 |
| 40.38 – 42.75 | 22 |
| 42.75 – 45.12 | 98 |
| 45.12 – 47.5 | 38 |
| 47.5 – 49.88 | 22 |
| 49.88 – 52.25 | 36 |
| 52.25 – 54.62 | 4 |
| 54.62 – 57 | 8 |
| 57 – 59.38 | 1 |
| 59.38 – 61.75 | 17 |
| 61.75 – 64.12 | 4 |
| 64.12 – 66.5 | 3 |
| 66.5 – 68.88 | 0 |
| 68.88 – 71.25 | 8 |
| 71.25 – 73.62 | 2 |
| 73.62 – 76 | 4 |
| 76 – 78.38 | 4 |
| 78.38 – 80.75 | 3 |
| 80.75 – 83.12 | 0 |
| 83.12 – 85.5 | 0 |
| 85.5 – 87.88 | 2 |
| 87.88 – 90.25 | 2 |
| 90.25 – 92.62 | 0 |
| 92.62 – 95 | 2 |
Schema
24 columns| Alerts | ||||
|---|---|---|---|---|
| Type | numeric | 0.0% | 3 |
|
| PEID | numeric | 36.7% | 12,256 |
null_rate
|
| ROP3 | numeric | 0.0% | 11,786 |
|
| PeopleID3 | numeric | 10.9% | 10,284 |
|
| ROG3 | categorical | 0.0% | 240 |
|
| Ctry | categorical | 0.0% | 240 |
|
| JPPeopleGroup | text | 10.9% | 10,624 |
one_word
duplicates
|
| JPPopulation | numeric | 11.0% | 1,731 |
high_skew
outliers
|
| JPIndigenous | categorical | 10.9% | 2 |
|
| JPROL3 | text | 10.9% | 6,165 |
one_word
short_text
duplicates
|
| JPPrimaryLanguage | text | 10.9% | 6,152 |
one_word
short_text
duplicates
|
| JPRLG3 | numeric | 10.9% | 8 |
|
| JPPrimaryReligion | categorical | 10.9% | 8 |
|
| JPScale | numeric | 10.9% | 5 |
|
| JPLeastReached | categorical | 62.9% | 1 |
null_rate
imbalance
|
| JP%ChristianAdherent | numeric | 10.9% | 957 |
|
| JP%Evangelical | numeric | 17.2% | 692 |
high_skew
outliers
|
| CPPIPeopleGroup | text | 36.7% | 9,170 |
one_word
null_rate
short_text
duplicates
|
| CPPIPopulation | text | 36.7% | 1,629 |
multilingual
allcaps
null_rate
short_text
duplicates
|
| CPPIROL | text | 36.7% | 5,014 |
one_word
null_rate
short_text
duplicates
|
| CPPIPrimaryLanguage | text | 36.7% | 5,014 |
one_word
null_rate
short_text
duplicates
|
| CPPIPrimaryReligion | categorical | 47.6% | 40 |
null_rate
|
| CPPIGSEC | numeric | 36.7% | 7 |
null_rate
|
| CPPIEvangelicalEngagement | categorical | 36.7% | 2 |
null_rate
|
Type
numeric featureType is encoded numerically but takes only 3 distinct values across 19,375 rows (min 1, max 3, median 1), so it is almost certainly a categorical code rather than a true measurement. The distribution leans toward the lowest category, with mean 1.585 and Q3 of 2.0 indicating most rows are 1 or 2. No nulls or outliers are present. Treatment: Treat as a categorical variable and one-hot encode before modelling.
- n
- 19,375
- nulls
- 0 (0.0%)
- unique
- 3
- min
- 1
- max
- 3
- mean
- 1.585
- median
- 1
- std
- 0.6785
- q1
- 1
- q3
- 2
- iqr
- 1
- skew
- 0.7351
- kurtosis
- -0.6013
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
PEID
numeric identifier null_ratePEID looks like a person/entity identifier: integer values from 1 to 50520 with 12256 unique values across 19375 rows and no zeros. The 36.7% null rate is the headline concern, and uniqueness is well below row count, so the same PEID recurs across rows. The near-uniform spread (skew 0.33, kurtosis -1.50) is consistent with an ID rather than a measured quantity. Treatment: Treat as a foreign key for joins; do not use as a numeric feature, and decide on a policy for the 36.7% missing IDs.
- n
- 19,375
- nulls
- 7,119 (36.7%)
- unique
- 12,256
- min
- 1
- max
- 50,520
- mean
- 2.564e+04
- median
- 1.838e+04
- std
- 1.637e+04
- q1
- 1.213e+04
- q3
- 4.329e+04
- iqr
- 31,162
- skew
- 0.3311
- kurtosis
- -1.499
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ROP3
numeric identifierROP3 holds 6-digit integers tightly bounded between 100004 and 119498 with mean 109260.77 and median 109305, suggesting a coded identifier or category number rather than a measured quantity. The distribution is essentially flat (kurtosis -1.16, skew 0.12) with no outliers and a near-zero null rate, and 11786 unique values across 19375 rows indicates many repeats but no dominant mode visible from these stats. Treatment: Treat as a categorical code; do not feed as a continuous numeric feature.
- n
- 19,375
- nulls
- 2 (0.0%)
- unique
- 11,786
- min
- 100,004
- max
- 119,498
- mean
- 1.093e+05
- median
- 109,305
- std
- 5558
- q1
- 104,126
- q3
- 114,057
- iqr
- 9,931
- skew
- 0.1203
- kurtosis
- -1.159
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
PeopleID3
numeric foreign_keyPeopleID3 is almost certainly an identifier: values are integers ranging from 10119 to 22498 with no zeros, no outliers, and 10284 distinct values across 19375 rows. The 10.88% null rate suggests this is an optional or third-slot person reference (the '3' in the name implies a multi-ID schema), and the duplication (≈1.88 rows per id) indicates the same person appears multiple times. The roughly uniform spread (kurtosis -0.97, mild skew 0.39) is consistent with sequentially assigned ids rather than a measured quantity. Treatment: Treat as a foreign key for left-joins to a people table; do not use as a numeric feature.
- n
- 19,375
- nulls
- 2,108 (10.9%)
- unique
- 10,284
- min
- 10,119
- max
- 22,498
- mean
- 1.524e+04
- median
- 14,780
- std
- 3413
- q1
- 12,247
- q3
- 1.814e+04
- iqr
- 5898
- skew
- 0.3911
- kurtosis
- -0.9664
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ROG3
categorical featureROG3 looks like an ISO-style two-letter country code, with 240 distinct values and no nulls across 19,375 rows. Distribution is moderately concentrated: 'IN' alone accounts for 17.2% of records, followed by 'PP', 'PK', 'ID', and 'US', with entropy ratio 0.78 indicating reasonable spread across the long tail. The presence of 'PP' as the second most common code is unusual since it is not a standard ISO country code and may warrant verification. Treatment: Group rare codes into 'Other' and one-hot or target-encode for modelling.
- n
- 19,375
- nulls
- 0 (0.0%)
- unique
- 240
- top_value
- IN
- top_rate
- 0.1719
- cardinality
- 240
- entropy
- 6.175
- entropy_ratio
- 0.781
Ctry
categorical featureCountry name field with 240 distinct values across 19,375 rows and zero nulls. India dominates at 17.2% (3,330 rows), followed by Papua New Guinea, Pakistan, and Indonesia, with the long tail spread broadly (entropy ratio 0.78). The high Papua New Guinea share is unusual relative to its global population and suggests domain-specific sampling. Treatment: Group rare countries into an 'Other' bucket and target- or frequency-encode before modelling.
- n
- 19,375
- nulls
- 0 (0.0%)
- unique
- 240
- top_value
- India
- top_rate
- 0.1719
- cardinality
- 240
- entropy
- 6.175
- entropy_ratio
- 0.781
JPPeopleGroup
text feature one_word duplicatesThis column holds Joshua Project-style people group labels — short ethnonyms or community descriptors like 'Deaf', 'British', 'French', and qualified forms such as 'Arab, (Muslim traditions)'. Values are short (mean 11.4 chars, median 1 word) and 54% are single-word, but with 10,624 uniques across 19,375 rows and a 38.5% duplicate rate, plus 10.9% nulls, the field is high-cardinality categorical with a long tail. The prevalence of parenthetical religious qualifiers ('(hindu', '(muslim', 'traditions)') in top_words suggests an embedded taxonomy that would need parsing to separate ethnicity from religious tradition. Treatment: Normalize casing and split parenthetical religious qualifiers into a separate column before encoding as a high-cardinality category.
- n
- 19,375
- nulls
- 2,108 (10.9%)
- unique
- 10,624
- len_min
- 1
- len_max
- 41
- len_mean
- 11.35
- len_median
- 9
- len_p95
- 25
- word_mean
- 1.618
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 6,643
- duplicate_rate
- 0.3847
- vocab_size
- 11,428
- readability_flesch_mean
- 39.84
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.5414
- allcaps_rate
- 0
- boilerplate_rate
- 0
JPPopulation
numeric feature high_skew outliersThis appears to be a Japan-related population count per record, with values spanning from 10 to 918,811,000 and a median of just 16,000. The distribution is extraordinarily right-skewed (skew 94.9, kurtosis 10,840) — the mean of 468,378 sits far above the Q3 of 85,000, and 2,609 rows (15.1%) flag as outliers. Roughly 11% of rows are null, and only 1,731 unique values across 19,375 rows suggests heavy repetition. Treatment: log-transform and impute nulls before modelling; investigate the 918M max as a possible unit error.
- n
- 19,375
- nulls
- 2,128 (11.0%)
- unique
- 1,731
- min
- 10
- max
- 9.188e+08
- mean
- 4.684e+05
- median
- 16,000
- std
- 7.861e+06
- q1
- 3,100
- q3
- 85,000
- iqr
- 81,900
- skew
- 94.92
- kurtosis
- 1.084e+04
- n_outliers
- 2,609
- outlier_rate
- 0.1513
- zero_rate
- 0
JPIndigenous
categorical featureBinary Y/N flag, almost certainly indicating whether a record is classified as Indigenous (likely in a Japan-related context given the 'JP' prefix). 'Y' dominates at 71.2% of non-null rows (12,293 vs 4,974 'N'), and 10.88% of rows are null. Entropy ratio of 0.87 shows the two classes are reasonably balanced once nulls are excluded. Treatment: Encode as boolean and decide explicitly whether to impute or flag the ~11% nulls before modelling.
- n
- 19,375
- nulls
- 2,108 (10.9%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.7119
- cardinality
- 2
- entropy
- 0.8662
- entropy_ratio
- 0.8662
JPROL3
text feature one_word short_text duplicatesJPROL3 holds three-letter ISO 639-3 language codes (hin, eng, ben, spa, tam...), with every value being a single token of length 3. The field is sparsely populated (10.88% null) and heavily duplicated (64.3% duplicate rate across 6,165 unique codes), with Hindi (700) and English (537) leading. The presence of 6,164 distinct codes is unusually high for a language tag and suggests either very broad linguistic coverage or noisy/long-tail entries. Treatment: Treat as a categorical language-code feature; one-hot or target-encode top codes and bucket the long tail.
- n
- 19,375
- nulls
- 2,108 (10.9%)
- unique
- 6,165
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 11,102
- duplicate_rate
- 0.643
- vocab_size
- 6,164
- readability_flesch_mean
- 115.3
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0003475
- boilerplate_rate
- 0
JPPrimaryLanguage
text feature one_word short_text duplicatesCategorical free-text capturing each subject's primary language, dominated by single-word entries (one_word_rate 0.7335) with a mean of 1.32 words and median length 7. Despite only 6152 unique values across 19375 rows, the long tail is heavy: top value 'Hindi' appears just 700 times and 64.37% of rows are duplicates, while 10.88% are null. Surprising: vocab_size (6236) exceeds n_unique (6152), implying compound entries like 'Chinese, Mandarin' and 'Punjabi,' that split into multiple tokens and will fragment any naive value_counts. Treatment: Normalise casing and split comma-separated entries before encoding as a categorical or multi-label feature.
- n
- 19,375
- nulls
- 2,108 (10.9%)
- unique
- 6,152
- len_min
- 1
- len_max
- 41
- len_mean
- 9.088
- len_median
- 7
- len_p95
- 18
- word_mean
- 1.319
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 11,115
- duplicate_rate
- 0.6437
- vocab_size
- 6,236
- readability_flesch_mean
- 31.33
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7335
- allcaps_rate
- 0
- boilerplate_rate
- 0
JPRLG3
numeric featureJPRLG3 is a low-cardinality integer feature taking only 8 distinct values between 1 and 9, with a near-symmetric spread (skew 0.096) and a flat, plateau-like distribution (kurtosis -1.60). The mean (3.37) and median (4.0) sit mid-range and there are no outliers, suggesting an ordinal rating or category code rather than a continuous measurement. About 10.88% of rows are null, which is the main quirk to address. Treatment: Treat as ordinal/categorical (8 levels) and impute or flag the 10.88% missing before modelling.
- n
- 19,375
- nulls
- 2,108 (10.9%)
- unique
- 8
- min
- 1
- max
- 9
- mean
- 3.367
- median
- 4
- std
- 2.199
- q1
- 1
- q3
- 6
- iqr
- 5
- skew
- 0.09636
- kurtosis
- -1.595
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
JPPrimaryReligion
categorical featureCategorical label assigning a primary religion to each record (likely a people group or country profile from a Joshua Project-style source). Eight categories cover 19,375 rows with ~10.9% missing; Christianity leads at 41.4% (7,140), followed by Islam (4,003) and Ethnic Religions (2,641), with entropy_ratio 0.72 indicating a moderately balanced distribution rather than a single dominant class. Note the explicit 'Unknown' bucket (26) co-existing with an 10.88% null rate, suggesting two distinct flavors of missingness. Treatment: One-hot or target-encode; reconcile the explicit 'Unknown' category with the 10.88% nulls before modelling.
- n
- 19,375
- nulls
- 2,108 (10.9%)
- unique
- 8
- top_value
- Christianity
- top_rate
- 0.4135
- cardinality
- 8
- entropy
- 2.166
- entropy_ratio
- 0.722
JPScale
numeric featureJPScale is an integer-coded ordinal feature with only 5 distinct values spanning 1 to 5, a mean of 2.71 and median of 3, suggesting a Likert-style or severity rating. The distribution is broad and flat (kurtosis -1.64) rather than peaked, with no outliers and a 10.88% null rate that should be handled explicitly. Treatment: Treat as ordinal categorical (1-5) and impute or flag the ~11% missing values before modelling.
- n
- 19,375
- nulls
- 2,108 (10.9%)
- unique
- 5
- min
- 1
- max
- 5
- mean
- 2.71
- median
- 3
- std
- 1.623
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 0.159
- kurtosis
- -1.637
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
JPLeastReached
categorical metadata null_rate imbalanceJPLeastReached holds a single non-null value 'Y' across all 7,186 populated rows, with the remaining 62.91% of records null. With cardinality 1 and entropy 0, the field carries no discriminative information and effectively functions as a presence flag rather than a category. Treatment: Drop, or recode as a binary is-present indicator if the null pattern is meaningful.
- n
- 19,375
- nulls
- 12,189 (62.9%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
JP%ChristianAdherent
numeric featureNumeric share (0–100) of a population identified as Christian adherents, judging by the name and full 0–100 range with mean 37.58 and median 17.0. The distribution is starkly bimodal in feel: q1 sits at 0.02 while q3 hits 80.0, with 23.7% exact zeros and kurtosis of -1.60, so values cluster at the extremes rather than the middle. Null rate is 10.88%, which is non-trivial and should be handled explicitly. Treatment: Treat as a bounded percentage; impute or flag the 10.88% nulls and consider a zero-inflation indicator before modelling.
- n
- 19,375
- nulls
- 2,108 (10.9%)
- unique
- 957
- min
- 0
- max
- 100
- mean
- 37.58
- median
- 17
- std
- 39.33
- q1
- 0.02
- q3
- 80
- iqr
- 79.98
- skew
- 0.3824
- kurtosis
- -1.602
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.2371
JP%Evangelical
numeric feature high_skew outliersThis column appears to capture the percentage of Evangelical adherents for some geographic or demographic unit (the JP prefix suggests a JoshuaProject-style indicator). The distribution is heavily right-skewed (skew 2.53, kurtosis 8.38) with a median of just 1.5 and Q3 of 8.0, yet a max of 95.0, and 27% of values are exactly zero. About 17.2% are null and 9.7% flag as outliers, so the long tail of high-Evangelical units is both real and sparse. Treatment: Impute or flag nulls, then apply a log1p or similar transform before modelling to tame the skew.
- n
- 19,375
- nulls
- 3,332 (17.2%)
- unique
- 692
- min
- 0
- max
- 95
- mean
- 6.279
- median
- 1.5
- std
- 10.18
- q1
- 0
- q3
- 8
- iqr
- 8
- skew
- 2.535
- kurtosis
- 8.378
- n_outliers
- 1,561
- outlier_rate
- 0.0973
- zero_rate
- 0.27
CPPIPeopleGroup
text feature one_word null_rate short_text duplicatesThis column appears to hold ethnolinguistic or people-group labels (CPPI = Christian/Joshua Project-style people group), with values like 'Han Chinese', 'British', 'Russian', and 'Korean' dominating. It is short and categorical-like — 72.7% are single words and mean length is 8.7 characters — but with 9,170 unique values across 19,375 rows it is high-cardinality, and 36.7% are null. Duplication is also notable (25.2%), and the long tail includes compound entries like 'Han Chinese, Mandarin' suggesting inconsistent delimiting. Treatment: Normalize delimiters and treat as a high-cardinality categorical; group rare levels or target-encode before modelling.
- n
- 19,375
- nulls
- 7,119 (36.7%)
- unique
- 9,170
- len_min
- 1
- len_max
- 38
- len_mean
- 8.68
- len_median
- 7
- len_p95
- 18
- word_mean
- 1.316
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 3,086
- duplicate_rate
- 0.2518
- vocab_size
- 8,586
- readability_flesch_mean
- 40.96
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7273
- allcaps_rate
- 0
- boilerplate_rate
- 0
CPPIPopulation
text feature multilingual allcaps null_rate short_text duplicatesThis column holds population counts (likely CPPI = community/place population index) stored as comma-formatted strings rather than numbers, with values like ' 1,300 ' and ' 15,500 ' padded by whitespace. It is 36.74% null and heavily repetitive: 1,629 unique values across 19,375 rows with an 86.7% duplicate rate, suggesting rounded/bucketed figures. The 'multilingual' and 'allcaps' alerts are spurious artifacts of digit-only text — only 3 non-English rows exist out of 256 detected. Treatment: Strip whitespace and commas, then cast to integer; impute or flag the 36.74% nulls before modelling.
- n
- 19,375
- nulls
- 7,119 (36.7%)
- unique
- 1,629
- len_min
- 3
- len_max
- 13
- len_mean
- 7.838
- len_median
- 8
- len_p95
- 11
- word_mean
- 3
- word_median
- 3
- n_empty
- 0
- n_duplicates
- 10,627
- duplicate_rate
- 0.8671
- vocab_size
- 1,629
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0
- allcaps_rate
- 1
- boilerplate_rate
- 0
CPPIROL
text feature one_word null_rate short_text duplicatesCPPIROL holds 3-character ISO 639-3 language codes (eng, spa, tel, hin, por...), with every value being exactly one word of length 3. The field is null 36.74% of the time and 59.09% of present values are duplicates across 5,014 distinct codes, with 'und' (undetermined) appearing 131 times. The distribution is long-tailed but English and Spanish dominate the top of 19,375 rows. Treatment: Treat as a categorical language-code feature; impute nulls (or map to 'und') and one-hot or target-encode the high-frequency codes.
- n
- 19,375
- nulls
- 7,119 (36.7%)
- unique
- 5,014
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 7,242
- duplicate_rate
- 0.5909
- vocab_size
- 5,014
- readability_flesch_mean
- 119.5
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
CPPIPrimaryLanguage
text feature one_word null_rate short_text duplicatesThis column records a primary language label per record, with 5,014 distinct values across 19,375 rows and a heavy tail of one-word entries (one_word_rate 0.77, word_mean 1.29). Top values look canonical (English 401, Spanish 351, Telugu 214) but 36.7% are null and 59.1% duplicate, and the presence of an explicit 'Undetermined' bucket (131) plus 5,014 uniques against only ~10 obvious top languages hints at inconsistent free-text entry rather than a controlled vocabulary. Worth noting: 'arabic' appears 385 times in top_words but does not surface in top_values, suggesting it is split across multi-word variants (e.g. dialect-qualified forms). Treatment: Normalize to a controlled language vocabulary (canonicalize variants, map 'Undetermined'/nulls to a single missing token) before encoding.
- n
- 19,375
- nulls
- 7,119 (36.7%)
- unique
- 5,014
- len_min
- 1
- len_max
- 41
- len_mean
- 8.622
- len_median
- 7
- len_p95
- 18
- word_mean
- 1.288
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 7,242
- duplicate_rate
- 0.5909
- vocab_size
- 5,098
- readability_flesch_mean
- 37.09
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7674
- allcaps_rate
- 0
- boilerplate_rate
- 0
CPPIPrimaryReligion
categorical label null_rateCategorical label identifying the primary religion of each record (likely a people-group or country-level row), drawn from a fixed taxonomy of 40 religious categories. The distribution is broad (entropy ratio 0.71) with no dominant class — the leading value 'Islam - Sunni' covers only 14.7% of non-null rows, followed by Non-Evangelical Protestant Christianity and two Ethnoreligion variants. Note the heavy 47.6% null rate, which will materially shrink any analysis conditioned on this field. Treatment: Impute or bucket nulls explicitly before use; consider collapsing the 40 categories into parent religions for modelling.
- n
- 19,375
- nulls
- 9,227 (47.6%)
- unique
- 40
- top_value
- Islam - Sunni
- top_rate
- 0.1473
- cardinality
- 40
- entropy
- 3.791
- entropy_ratio
- 0.7123
CPPIGSEC
numeric feature null_rateCPPIGSEC is an integer-coded categorical with only 7 unique values ranging 0-6, almost certainly an ordinal classification or sector code rather than a true numeric measure. The distribution is broad and flat (kurtosis -1.43, IQR spanning 1 to 5) with a median of 1 and mean of 2.65, suggesting a left-leaning concentration on lower codes but heavy presence across the full scale. Notably, 36.74% of rows are null, which is large enough to materially affect any downstream join or model. Treatment: Treat as categorical/ordinal and impute or add a missingness indicator before modelling.
- n
- 19,375
- nulls
- 7,119 (36.7%)
- unique
- 7
- min
- 0
- max
- 6
- mean
- 2.654
- median
- 1
- std
- 2.058
- q1
- 1
- q3
- 5
- iqr
- 4
- skew
- 0.5224
- kurtosis
- -1.432
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.02635
CPPIEvangelicalEngagement
categorical feature null_rateBinary engagement flag for a CPPI Evangelical program, taking only 'Engaged' or 'Unengaged'. 'Engaged' leads at 59.1% of non-null rows (7238 vs 5018), and entropy ratio of 0.976 shows the two classes are nearly balanced. Notably, 36.74% of rows are null, which is large enough to materially bias any downstream split. Treatment: Encode as binary and add an explicit 'missing' category before modelling.
- n
- 19,375
- nulls
- 7,119 (36.7%)
- unique
- 2
- top_value
- Engaged
- top_rate
- 0.5906
- cardinality
- 2
- entropy
- 0.9762
- entropy_ratio
- 0.9762