joshua project joshua project countries
Reading
This dataset profiles 238 countries from the Joshua Project, combining demographic data (population, people groups, languages) with religious composition percentages and Bible translation/evangelization status. Christianity dominates as the primary religion in 159 of 238 countries, while the JPScaleText field shows 89 countries are 'Significantly Reached' versus 43 'Unreached' — a useful starting lens for mission analysis. Population and people-group counts are extremely right-skewed (skew >9, with outliers like 1.46B population), so log-scale views or per-capita ratios will be more informative than raw totals. Religion percentage columns also have very high zero-rates (e.g., Hinduism 56%, Buddhism 48%), reflecting that most countries have negligible presence of any given non-dominant religion. Note also that PoplPeoplesFPG and CntPeoplesFPG have substantial null rates (32% and 29%), so any analysis of frontier/unreached people groups should account for missing coverage.
citing: ReligionPrimary · JPScaleText · RegionName · Window1040 · PercentChristianity · Population · PoplPeoplesFPG · CntPeoplesFPG · PercentIslam · PercentHinduism · PercentBuddhism
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Christianity | 159 | 66.8% |
| Islam | 55 | 23.1% |
| Buddhism | 10 | 4.2% |
| Ethnic Religions | 6 | 2.5% |
| Non-Religious | 5 | 2.1% |
| Hinduism | 3 | 1.3% |
Show data table
| value | count | share |
|---|---|---|
| Significantly Reached | 89 | 37.4% |
| Partially Reached | 67 | 28.2% |
| Unreached | 43 | 18.1% |
| Superficially Reached | 28 | 11.8% |
| Minimally Reached | 11 | 4.6% |
Show data table
| value | count | share |
|---|---|---|
| America, North and Caribbean | 30 | 12.6% |
| Europe, Western | 28 | 11.8% |
| Africa, East and Southern | 28 | 11.8% |
| Australia and Pacific | 27 | 11.3% |
| Africa, West and Central | 24 | 10.1% |
| Europe, Eastern and Eurasia | 23 | 9.7% |
| America, Latin | 22 | 9.2% |
| Africa, North and Middle East | 19 | 8.0% |
| Asia, Southeast | 11 | 4.6% |
| Asia, Central | 10 | 4.2% |
| Asia, South | 8 | 3.4% |
| Asia, Northeast | 8 | 3.4% |
Show data table
| bin | count |
|---|---|
| 0.0165 – 6.682 | 48 |
| 6.682 – 13.35 | 16 |
| 13.35 – 20.01 | 2 |
| 20.01 – 26.68 | 2 |
| 26.68 – 33.34 | 5 |
| 33.34 – 40.01 | 3 |
| 40.01 – 46.68 | 5 |
| 46.68 – 53.34 | 8 |
| 53.34 – 60.01 | 6 |
| 60.01 – 66.67 | 13 |
| 66.67 – 73.34 | 7 |
| 73.34 – 80 | 15 |
| 80 – 86.67 | 25 |
| 86.67 – 93.33 | 42 |
| 93.33 – 100 | 41 |
Show data table
| value | count | share |
|---|---|---|
| N | 170 | 71.4% |
| Y | 68 | 28.6% |
Schema
39 columns| Alerts | ||||
|---|---|---|---|---|
| PoplPeoplesLR | numeric | 17.6% | 180 |
high_skew
outliers
|
| PercentUnknown | numeric | 0.0% | 158 |
high_skew
|
| ReligionPrimary | categorical | 0.0% | 6 |
|
| SecurityLevel | numeric | 0.0% | 3 |
|
| PercentNonReligious | numeric | 0.0% | 228 |
high_skew
outliers
|
| ISO2 | categorical | 0.0% | 238 |
long_tail
|
| JPScaleImageURL | categorical | 0.0% | 5 |
|
| PercentEthnicReligions | numeric | 0.0% | 199 |
high_skew
outliers
|
| RegionName | categorical | 0.0% | 12 |
|
| CntPeoples | numeric | 0.0% | 96 |
high_skew
outliers
|
| PoplPeoplesFPG | numeric | 31.5% | 142 |
null_rate
high_skew
outliers
|
| ROG3 | categorical | 0.0% | 238 |
long_tail
|
| PercentEvangelical | numeric | 2.5% | 232 |
|
| ROL3OfficialLanguage | categorical | 0.0% | 88 |
long_tail
|
| TranslationUnspecified | numeric | 0.0% | 21 |
high_skew
outliers
|
| TranslationNeeded | numeric | 0.0% | 18 |
high_skew
outliers
|
| TranslationStarted | numeric | 0.0% | 30 |
high_skew
outliers
|
| RegionCode | numeric | 0.0% | 12 |
|
| Ctry | categorical | 0.0% | 238 |
long_tail
|
| BibleNewTestament | numeric | 0.0% | 45 |
high_skew
outliers
|
| BibleComplete | numeric | 0.0% | 54 |
high_skew
outliers
|
| Window1040 | categorical | 0.0% | 2 |
|
| JPScaleText | categorical | 0.0% | 5 |
|
| CntPrimaryLanguages | numeric | 0.0% | 91 |
high_skew
outliers
|
| OfficialLang | categorical | 0.4% | 87 |
long_tail
|
| BiblePortions | numeric | 0.0% | 35 |
high_skew
outliers
|
| ROG2 | categorical | 0.0% | 7 |
|
| PercentIslam | numeric | 0.0% | 198 |
outliers
|
| RLG3Primary | numeric | 0.0% | 6 |
|
| Capital | categorical | 1.7% | 233 |
long_tail
|
| Population | numeric | 0.0% | 230 |
high_skew
outliers
|
| CntPeoplesLR | numeric | 15.1% | 57 |
high_skew
outliers
|
| CntPeoplesFPG | numeric | 28.6% | 44 |
null_rate
high_skew
outliers
|
| PercentHinduism | numeric | 0.0% | 106 |
high_skew
outliers
|
| PercentOtherSmall | numeric | 0.0% | 199 |
high_skew
outliers
|
| PercentBuddhism | numeric | 0.0% | 125 |
high_skew
outliers
|
| JPScaleCtry | numeric | 0.0% | 5 |
|
| PercentChristianity | numeric | 0.0% | 237 |
|
| ISO3 | categorical | 0.0% | 238 |
long_tail
|
PoplPeoplesLR
numeric feature high_skew outliersLikely a population count by some geographic or organisational unit ('PoplPeoples'), spanning from 50 to ~1.39B with a median of 532,500. The distribution is extremely right-skewed (skew 12.1, kurtosis 156) with 31 outliers (15.8%) and a std (~104M) far exceeding the mean (~18M), suggesting a few massive entities dominate. Also notable: 17.65% of rows are null. Treatment: Log-transform and impute the ~17.65% nulls before modelling.
- n
- 238
- nulls
- 42 (17.6%)
- unique
- 180
- min
- 50
- max
- 1.395e+09
- mean
- 1.823e+07
- median
- 532,500
- std
- 1.038e+08
- q1
- 32,750
- q3
- 6.022e+06
- iqr
- 5.99e+06
- skew
- 12.1
- kurtosis
- 156.4
- n_outliers
- 31
- outlier_rate
- 0.1582
- zero_rate
- 0
PercentUnknown
numeric feature high_skewPercentUnknown is a numeric proportion ranging from 0.0 to 2.679, with a mean of 0.218 and median of 0.176. About 34% of the 238 rows are exactly zero (q1 is also 0.0), yet the column is heavily right-tailed with skew 3.92 and kurtosis 32.16, plus a max above 1.0 that is unusual if this was meant to be a 0-1 share. Three outliers (1.3%) sit far above the bulk of the distribution. Treatment: Verify whether values >1 are valid, then log1p-transform given the heavy right skew and zero-inflation.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 158
- min
- 0
- max
- 2.679
- mean
- 0.2176
- median
- 0.1758
- std
- 0.2604
- q1
- 0
- q3
- 0.3673
- iqr
- 0.3673
- skew
- 3.918
- kurtosis
- 32.16
- n_outliers
- 3
- outlier_rate
- 0.01261
- zero_rate
- 0.3403
ReligionPrimary
categorical featurePrimary religion of each record across 238 rows with 6 distinct values and no nulls. Christianity dominates at 159/238 (top_rate 0.668), followed by Islam at 55, with Buddhism, Ethnic Religions, Non-Religious, and Hinduism sharing the long tail under 10 each. Entropy ratio of 0.54 confirms the heavy concentration in one category. Treatment: One-hot encode, optionally collapsing the four smallest categories into 'Other' to handle imbalance.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- Christianity
- top_rate
- 0.6681
- cardinality
- 6
- entropy
- 1.4
- entropy_ratio
- 0.5415
SecurityLevel
numeric featureSecurityLevel takes only 3 distinct integer values across 238 rows (min 0, max 2) with no nulls, suggesting an ordinal tier code rather than a continuous measure. The distribution is heavily weighted toward the lowest tier: 67.6% of rows are zero, the median is 0, and the mean is just 0.55, producing a right skew of 1.01. No outliers were flagged, which is consistent with a bounded categorical scale. Treatment: Treat as an ordinal category (0/1/2) rather than a continuous numeric.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 3
- min
- 0
- max
- 2
- mean
- 0.5462
- median
- 0
- std
- 0.8344
- q1
- 0
- q3
- 1
- iqr
- 1
- skew
- 1.011
- kurtosis
- -0.796
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.6765
PercentNonReligious
numeric feature high_skew outliersThis column reports the percentage of a population that is non-religious across 238 rows, with 228 unique values and no nulls. The distribution is heavily right-skewed (skew 2.61, kurtosis 8.00): the median is just 2.85% while the mean is 7.31% and the max reaches 68.81%, with 30 outliers (12.6% of rows) and 4.6% of values exactly zero. The std (11.39) dwarfs the IQR (7.08), confirming a long upper tail rather than a symmetric spread. Treatment: Log1p- or rank-transform before modelling to tame the heavy right tail.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 228
- min
- 0
- max
- 68.81
- mean
- 7.308
- median
- 2.851
- std
- 11.39
- q1
- 0.5635
- q3
- 7.646
- iqr
- 7.083
- skew
- 2.611
- kurtosis
- 7.999
- n_outliers
- 30
- outlier_rate
- 0.1261
- zero_rate
- 0.04622
ISO2
categorical identifier long_tailThis column holds ISO2 country codes (AF, AL, DZ, AS, AD...), serving as a unique key with 238 distinct values across 238 rows and zero nulls. Cardinality equals row count and entropy_ratio is 1.0, meaning every code appears exactly once — top_rate is just 0.0042. The long_tail alert is expected here since the column is effectively a primary identifier. Treatment: Use as the join key to merge country-level attributes.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 238
- top_value
- AF
- top_rate
- 0.004202
- cardinality
- 238
- entropy
- 7.895
- entropy_ratio
- 1
JPScaleImageURL
categorical featureThis column holds URLs to one of five Joshua Project 'gauge' images (gauge-1.png through gauge-5.png), almost certainly a visual encoding of an ordinal progress/status score on a 1-5 scale. With only 5 unique values across 238 rows and no nulls, gauge-5 leads at 37.4% (89 rows) while gauge-2 is rarest at 11 rows, suggesting a skew toward the high end of the scale. The URL itself carries no information beyond the trailing digit. Treatment: Extract the trailing digit (1-5) and treat as an ordinal feature; drop the URL string.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- https://joshuaproject.net/assets/img/gauge/gauge-5.png
- top_rate
- 0.3739
- cardinality
- 5
- entropy
- 2.06
- entropy_ratio
- 0.8871
PercentEthnicReligions
numeric feature high_skew outliersNumeric share (0–75.16) representing the percentage of ethnic-religion adherents per row, likely one country or region per record across 238 entries with 199 unique values. The distribution is heavily right-skewed (skew 3.29, kurtosis 12.38) with a median of just 1.12% but a mean of 5.59% and a long tail producing 27 outliers (11.3% of rows); 16.8% of rows are exact zeros. The IQR (5.52) is far smaller than the std (11.21), confirming most values cluster near zero while a few cases dominate. Treatment: Apply a log1p or similar transform before modelling to tame the heavy right tail and zero mass.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 199
- min
- 0
- max
- 75.16
- mean
- 5.59
- median
- 1.116
- std
- 11.21
- q1
- 0.05109
- q3
- 5.576
- iqr
- 5.525
- skew
- 3.295
- kurtosis
- 12.38
- n_outliers
- 27
- outlier_rate
- 0.1134
- zero_rate
- 0.1681
RegionName
categorical featureRegionName is a categorical geographic grouping with 12 distinct values across 238 rows and no nulls. The distribution is remarkably even — entropy ratio 0.96 and the top bucket 'America, North and Caribbean' accounts for only 12.6% (30 rows) — suggesting these are world regions assigned to countries or similar entities. The Asian regions (Southeast: 11, Central: 10) are notably smaller than the African and European groupings. Treatment: one-hot or target-encode for modelling; safe to use as a grouping key.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- America, North and Caribbean
- top_rate
- 0.1261
- cardinality
- 12
- entropy
- 3.454
- entropy_ratio
- 0.9634
CntPeoples
numeric feature high_skew outliersNumeric count of people per record, fully populated across 238 rows with 96 distinct values ranging from 1 to 2262. The distribution is severely right-skewed (skew 8.20, kurtosis 85.91): the median is 24.5 and Q3 is 56.75, yet the mean is 68.83 and the max reaches 2262, with 24 outliers (10.1% outlier rate). Std (184.37) far exceeds the IQR (48.75), confirming a long heavy tail. Treatment: log-transform (or winsorize the top decile) before any distance- or variance-based modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 96
- min
- 1
- max
- 2,262
- mean
- 68.83
- median
- 24.5
- std
- 184.4
- q1
- 8
- q3
- 56.75
- iqr
- 48.75
- skew
- 8.202
- kurtosis
- 85.91
- n_outliers
- 24
- outlier_rate
- 0.1008
- zero_rate
- 0
PoplPeoplesFPG
numeric feature null_rate high_skew outliersLikely a population count of 'people groups' (PoplPeoplesFPG), with values ranging from 50 to roughly 1.09 billion and a median of 217,000. The distribution is extremely right-skewed (skew 11.6, kurtosis 138.7) with 19% of values flagged as outliers, and 31.5% of rows are null. The mean (~12.26M) sits far above the median, confirming a handful of massive groups dominate. Treatment: log-transform and impute the 31.5% nulls before any modelling.
- n
- 238
- nulls
- 75 (31.5%)
- unique
- 142
- min
- 50
- max
- 1.09e+09
- mean
- 1.226e+07
- median
- 217,000
- std
- 8.773e+07
- q1
- 16,500
- q3
- 1.842e+06
- iqr
- 1.825e+06
- skew
- 11.6
- kurtosis
- 138.7
- n_outliers
- 31
- outlier_rate
- 0.1902
- zero_rate
- 0
ROG3
categorical identifier long_tailROG3 looks like a country/region code identifier — every one of the 238 rows holds a unique two-letter value (AF, AL, AG, AQ, ...), giving cardinality 238 and entropy_ratio 1.0. With top_rate at 0.0042 and no nulls, this column carries no predictive signal on its own and behaves as a primary key for the row. Treatment: Use as a join key to country-level attributes; drop from any model as a feature.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 238
- top_value
- AF
- top_rate
- 0.004202
- cardinality
- 238
- entropy
- 7.895
- entropy_ratio
- 1
PercentEvangelical
numeric featureNumeric share (0–53.4%) of an evangelical population across 238 rows, almost all unique (232 distinct values), suggesting one observation per geographic or demographic unit. The distribution is right-skewed (skew 1.25) with mean 10.46% well above the median 6.92% and an IQR spanning 1.38–17.69%, so a long tail of high-evangelical units pulls the average up. About 2.5% of rows are null and 4 outliers sit beyond the upper whisker; no zero values were recorded. Treatment: Consider a log or sqrt transform before modelling to tame the right skew.
- n
- 238
- nulls
- 6 (2.5%)
- unique
- 232
- min
- 0.000766
- max
- 53.44
- mean
- 10.46
- median
- 6.916
- std
- 11.35
- q1
- 1.38
- q3
- 17.69
- iqr
- 16.31
- skew
- 1.247
- kurtosis
- 0.9595
- n_outliers
- 4
- outlier_rate
- 0.01724
- zero_rate
- 0
ROL3OfficialLanguage
categorical feature long_tailISO 639-3 language codes denoting each entity's official language, with 88 distinct values across 238 rows and no nulls. English dominates at 26.5% (63 rows) followed by French (25), Spanish (21), and Arabic (20), but the long tail is heavy — entropy ratio 0.74 against cardinality 88 means most codes appear only once or twice (e.g. aln, smo at 2). Worth noting some codes like 'arb' and 'cmn' are macro-language specific variants, so consistency of coding granularity should be checked. Treatment: Group the long tail into an 'other' bucket or target-encode before modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 88
- top_value
- eng
- top_rate
- 0.2647
- cardinality
- 88
- entropy
- 4.802
- entropy_ratio
- 0.7434
TranslationUnspecified
numeric feature high_skew outliersA heavily right-skewed count of 'TranslationUnspecified' occurrences per row, with 21 distinct integer values across 238 rows. Roughly 42% of rows are zero and the median is 1, yet the maximum reaches 80 against a Q3 of 2, producing extreme skew (6.02) and kurtosis (45.1). About 11% of values (26 rows) flag as outliers, so a small tail dominates the mean (2.80) versus median (1). Treatment: Log1p- or rank-transform before modelling, and consider winsorising the long tail.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 21
- min
- 0
- max
- 80
- mean
- 2.798
- median
- 1
- std
- 7.881
- q1
- 0
- q3
- 2
- iqr
- 2
- skew
- 6.016
- kurtosis
- 45.14
- n_outliers
- 26
- outlier_rate
- 0.1092
- zero_rate
- 0.4244
TranslationNeeded
numeric feature high_skew outliersTranslationNeeded is a numeric count column where 68% of rows are zero and the median and Q1-Q3 sit at 0-1, suggesting it tracks how many items required translation per record. The distribution is extremely right-skewed (skew 8.56, kurtosis 79.6) with a max of 104 against a mean of 2.24, and 31 rows (13%) flag as outliers. With only 18 unique values across 238 rows, this behaves more like a sparse event counter than a continuous metric. Treatment: Log1p-transform or binarise (zero vs non-zero) before modelling to tame the heavy tail.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 18
- min
- 0
- max
- 104
- mean
- 2.235
- median
- 0
- std
- 10.27
- q1
- 0
- q3
- 1
- iqr
- 1
- skew
- 8.56
- kurtosis
- 79.64
- n_outliers
- 31
- outlier_rate
- 0.1303
- zero_rate
- 0.6807
TranslationStarted
numeric feature high_skew outliersA numeric counter named TranslationStarted, likely the number of translation jobs initiated per row/entity. Over half the rows are zero (zero_rate 0.517) and the median is 0 with q3=3, yet the max reaches 261, producing extreme skew (7.81) and kurtosis (64.98) plus 35 outliers (14.7%). The mean of 6.46 is pulled far above the median by these heavy-tail cases. Treatment: Apply a log1p (or zero-inflated) transform before modelling to tame the heavy right tail.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 30
- min
- 0
- max
- 261
- mean
- 6.458
- median
- 0
- std
- 27.06
- q1
- 0
- q3
- 3
- iqr
- 3
- skew
- 7.807
- kurtosis
- 64.98
- n_outliers
- 35
- outlier_rate
- 0.1471
- zero_rate
- 0.5168
RegionCode
numeric featureRegionCode holds 12 distinct integer values from 1 to 12 across 238 rows with no nulls, which strongly suggests a categorical region identifier rather than a true numeric measure. The distribution is mildly left-skewed (skew -0.47) with a median of 8 and no outliers, indicating fairly even coverage across the higher-numbered regions. Treatment: Cast to categorical and one-hot or target-encode before modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 12
- min
- 1
- max
- 12
- mean
- 7.336
- median
- 8
- std
- 3.528
- q1
- 5
- q3
- 10
- iqr
- 5
- skew
- -0.4715
- kurtosis
- -0.9103
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Ctry
categorical identifier long_tailDespite the abbreviated header `Ctry`, this column holds full country names (Afghanistan, Albania, Algeria, …) and acts as a unique key: all 238 rows have distinct values, with entropy_ratio of ~1.0 and a top_rate of just 0.0042. There are no nulls, and the alphabetical run in top_values suggests the dataset is a one-row-per-country reference table. Treatment: Use as the primary key; left-join other country-level data on this column.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 238
- top_value
- Afghanistan
- top_rate
- 0.004202
- cardinality
- 238
- entropy
- 7.895
- entropy_ratio
- 1
BibleNewTestament
numeric feature high_skew outliersNumeric counts of New Testament references per row, ranging from 0 to 274 with a median of just 3. The distribution is extremely right-skewed (skew 6.08, kurtosis 49.4) with 23.1% zeros and 10.5% outliers, so a small number of rows dominate the totals while most carry few or no references. Treatment: Apply a log1p transform and consider winsorising before modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 45
- min
- 0
- max
- 274
- mean
- 10.82
- median
- 3
- std
- 25.76
- q1
- 1
- q3
- 10
- iqr
- 9
- skew
- 6.079
- kurtosis
- 49.43
- n_outliers
- 25
- outlier_rate
- 0.105
- zero_rate
- 0.2311
BibleComplete
numeric feature high_skew outliersA numeric count, plausibly the number of times a respondent has read the Bible cover-to-cover or a similar completion tally, ranging from 0 to 162 across 238 rows with no nulls. The distribution is heavily right-skewed (skew 3.32, kurtosis 17.0): median is 9 with an IQR of 15, yet 18 values (7.6%) qualify as outliers and the max of 162 sits far above q3=19. Only 1.7% are zero, so non-engagement is rare; the long tail of high counts is the dominant surprise. Treatment: Apply a log1p transform before modelling to tame the heavy right tail.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 54
- min
- 0
- max
- 162
- mean
- 15.56
- median
- 9
- std
- 18.9
- q1
- 4
- q3
- 19
- iqr
- 15
- skew
- 3.319
- kurtosis
- 17.02
- n_outliers
- 18
- outlier_rate
- 0.07563
- zero_rate
- 0.01681
Window1040
categorical featureBinary Y/N flag with no nulls across 238 rows. The distribution is imbalanced toward 'N' at 71.4% (170 of 238) versus 'Y' at 68, giving an entropy ratio of 0.86. The 'Window1040' name suggests a windowed indicator tied to event 1040, but the evidence does not clarify what that event represents. Treatment: Encode as a 0/1 boolean for modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.7143
- cardinality
- 2
- entropy
- 0.8631
- entropy_ratio
- 0.8631
JPScaleText
categorical labelJPScaleText is a 5-level ordinal label describing reach status (Unreached, Minimally, Superficially, Partially, Significantly Reached) across 238 complete rows. The distribution is fairly balanced with high entropy ratio (0.887) and a modal class of 'Significantly Reached' at 37.4%, though 'Minimally Reached' is sparse at only 11 records. No nulls and tight cardinality make this a clean categorical feature. Treatment: Encode as an ordered ordinal (Unreached → Significantly Reached) before modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- Significantly Reached
- top_rate
- 0.3739
- cardinality
- 5
- entropy
- 2.06
- entropy_ratio
- 0.8871
CntPrimaryLanguages
numeric feature high_skew outliersCntPrimaryLanguages is a numeric count (likely the number of primary languages associated with each row, e.g., a country or region) ranging from 1 to 827 across 238 rows with no nulls. The distribution is heavily right-skewed (skew 5.42, kurtosis 36.5): the median is just 20 while the mean is 45.6 and the max reaches 827, with 20 outliers (8.4%) sitting well above the Q3 of 46.5. Most entities have modest language counts but a small tail dominates the variance (std 90.1). Treatment: log-transform (or winsorize the upper tail) before any distance- or regression-based modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 91
- min
- 1
- max
- 827
- mean
- 45.55
- median
- 20
- std
- 90.13
- q1
- 7
- q3
- 46.5
- iqr
- 39.5
- skew
- 5.42
- kurtosis
- 36.52
- n_outliers
- 20
- outlier_rate
- 0.08403
- zero_rate
- 0
OfficialLang
categorical feature long_tailThis column lists the official language(s) of 238 entities, almost certainly countries or territories, with one near-null. English dominates at 63 occurrences (26.6% top rate), followed by French (25), Spanish (21), and Standard Arabic (20), but the long tail spans 87 distinct values including narrow entries like Gheg Albanian and Samoan, yielding entropy 4.78 (ratio 0.74). The high cardinality relative to 238 rows means many languages appear only once or twice. Treatment: Group rare languages into an 'Other' bucket before one-hot or target encoding.
- n
- 238
- nulls
- 1 (0.4%)
- unique
- 87
- top_value
- English
- top_rate
- 0.2658
- cardinality
- 87
- entropy
- 4.783
- entropy_ratio
- 0.7423
BiblePortions
numeric feature high_skew outliersNumeric count of Bible portions, likely per language or region, across 238 rows with no nulls and only 35 unique values. The distribution is severely right-skewed (skew 5.57, kurtosis 36.36) with a median of 2 against a max of 161, and 26.05% of rows are zero. Roughly 10.9% of values flag as outliers, so a few entries dominate the tail. Treatment: Apply a log1p transform before modelling to tame the heavy right tail.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 35
- min
- 0
- max
- 161
- mean
- 7.681
- median
- 2
- std
- 18.63
- q1
- 0
- q3
- 6.75
- iqr
- 6.75
- skew
- 5.573
- kurtosis
- 36.36
- n_outliers
- 26
- outlier_rate
- 0.1092
- zero_rate
- 0.2605
ROG2
categorical featureROG2 is a low-cardinality categorical with 7 region-like codes (AFR, EUR, ASI, NAR, SOP, LAM, AUS) across 238 rows and no nulls. The distribution is fairly even — entropy ratio 0.893 and the top value AFR holds just 24.4% — though AUS is a tiny tail with only 2 records. The codes look like geographic groupings (Africa, Europe, Asia, North America, South Pacific, Latin America, Australia). Treatment: One-hot encode; consider merging AUS (n=2) into a neighbouring region or 'Other' to avoid sparse dummies.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- AFR
- top_rate
- 0.2437
- cardinality
- 7
- entropy
- 2.508
- entropy_ratio
- 0.8934
PercentIslam
numeric feature outliersNumeric share (0–99.47) of Muslim population per row, almost certainly one row per country or territory given n=238. The distribution is heavily right-skewed (skew 1.35) with median just 2.51% but mean 22.31%, and 17.2% of rows are exactly zero while 16.8% flag as outliers — a bimodal world where most places have negligible Muslim populations and a minority are overwhelmingly Muslim. Treatment: Consider a log1p or logit transform, or bucket into low/medium/high bands, before modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 198
- min
- 0
- max
- 99.47
- mean
- 22.31
- median
- 2.514
- std
- 34.82
- q1
- 0.1012
- q3
- 28.76
- iqr
- 28.66
- skew
- 1.349
- kurtosis
- 0.1114
- n_outliers
- 40
- outlier_rate
- 0.1681
- zero_rate
- 0.1723
RLG3Primary
numeric featureRLG3Primary is a small-cardinality numeric code with only 6 unique values spanning 1 to 7 across 238 rows and no nulls. The distribution is bottom-heavy: median is 1.0 and Q1 equals the minimum, yet Q3 reaches 5.75, producing a wide IQR of 4.75 and right skew of 0.98. This looks like an ordinal category (e.g., a primary-rating or grade code) masquerading as a number rather than a continuous measurement. Treatment: Treat as an ordinal categorical and one-hot or ordinal-encode before modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 6
- min
- 1
- max
- 7
- mean
- 2.45
- median
- 1
- std
- 2.219
- q1
- 1
- q3
- 5.75
- iqr
- 4.75
- skew
- 0.9751
- kurtosis
- -0.9467
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Capital
categorical identifier long_tailThis column lists capital cities, with 233 unique values across 238 rows and a null rate of 1.68%. Cardinality is essentially one-per-row (entropy ratio 0.9997), and the only repeat is Kingston appearing twice — likely Jamaica and Norfolk Island sharing the name. Effectively a near-unique label tied to the country/territory record. Treatment: Treat as a near-unique label; drop or use as a join key rather than a model feature.
- n
- 238
- nulls
- 4 (1.7%)
- unique
- 233
- top_value
- Kingston
- top_rate
- 0.008547
- cardinality
- 233
- entropy
- 7.862
- entropy_ratio
- 0.9997
Population
numeric feature high_skew outliersThis is a country/region population count, with 238 rows and 230 unique values, no nulls. The distribution is extremely right-skewed (skew 9.10, kurtosis 89.06): the median is 5,606,500 but the max reaches 1,463,866,000, dwarfing the Q3 of 23,200,000. About 10.9% of values (26 rows) flag as outliers, consistent with a handful of population giants like China/India-scale entities. Treatment: log-transform before any modelling or aggregation.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 230
- min
- 50
- max
- 1.464e+09
- mean
- 3.459e+07
- median
- 5.606e+06
- std
- 1.378e+08
- q1
- 399,250
- q3
- 2.32e+07
- iqr
- 2.28e+07
- skew
- 9.099
- kurtosis
- 89.06
- n_outliers
- 26
- outlier_rate
- 0.1092
- zero_rate
- 0
CntPeoplesLR
numeric feature high_skew outliersCntPeoplesLR is a numeric count of people (likely a left/right group size or attendance metric) with 57 distinct values across 238 rows and a 15.13% null rate. The distribution is severely right-skewed (skew 10.79, kurtosis 129.0): the median is 6.5 and Q3 is 24.75, yet the max reaches 2032 and the mean is 35.27 with std 157.42. 17 outliers (8.42%) pull the tail dramatically, and no zeros are recorded. Treatment: log-transform (or winsorize the upper tail) and impute the ~15% nulls before modelling.
- n
- 238
- nulls
- 36 (15.1%)
- unique
- 57
- min
- 1
- max
- 2,032
- mean
- 35.27
- median
- 6.5
- std
- 157.4
- q1
- 2
- q3
- 24.75
- iqr
- 22.75
- skew
- 10.79
- kurtosis
- 129
- n_outliers
- 17
- outlier_rate
- 0.08416
- zero_rate
- 0
CntPeoplesFPG
numeric feature null_rate high_skew outliersA numeric count column (likely 'count of people FPG') with only 44 unique values across 238 rows and 28.57% nulls. The distribution is extremely right-skewed (skew 10.07, kurtosis 109.5): median is 4 and Q3 is 12.75, yet the max reaches 1700, producing 18 outliers (10.6% rate) and inflating the mean to 28.04 against a std of 143.69. Treatment: Impute the 28.57% nulls and apply a log1p transform before modelling to tame the heavy right tail.
- n
- 238
- nulls
- 68 (28.6%)
- unique
- 44
- min
- 1
- max
- 1,700
- mean
- 28.04
- median
- 4
- std
- 143.7
- q1
- 1
- q3
- 12.75
- iqr
- 11.75
- skew
- 10.07
- kurtosis
- 109.5
- n_outliers
- 18
- outlier_rate
- 0.1059
- zero_rate
- 0
PercentHinduism
numeric feature high_skew outliersCountry-level share of population identifying as Hindu, expressed as a percentage. The distribution is dominated by zeros (zero_rate 0.559, median 0) with a long right tail to 82.4, producing extreme skew (6.90) and kurtosis (53.0); 43 of 238 rows (18.1%) flag as outliers, presumably the few Hindu-majority countries. Treatment: Apply a log1p or zero-inflated transform before modelling, since most values are 0 with a heavy right tail.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 106
- min
- 0
- max
- 82.4
- mean
- 2.01
- median
- 0
- std
- 8.927
- q1
- 0
- q3
- 0.2745
- iqr
- 0.2745
- skew
- 6.898
- kurtosis
- 53.02
- n_outliers
- 43
- outlier_rate
- 0.1807
- zero_rate
- 0.5588
PercentOtherSmall
numeric feature high_skew outliersA numeric share/percentage feature called PercentOtherSmall, with 238 rows and 199 unique values, no nulls, but 16.4% zeros and a long right tail (median 0.29, max 12.84). Skew of 5.05 and kurtosis 39.4 with 15 outliers (6.3%) signal an extremely heavy-tailed distribution. Despite the name, values exceed 1, so this is not bounded as a 0–1 proportion. Treatment: Apply a log1p transform and consider a zero-inflation indicator before modelling.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 199
- min
- 0
- max
- 12.84
- mean
- 0.6962
- median
- 0.2884
- std
- 1.242
- q1
- 0.01289
- q3
- 0.9557
- iqr
- 0.9428
- skew
- 5.052
- kurtosis
- 39.39
- n_outliers
- 15
- outlier_rate
- 0.06303
- zero_rate
- 0.1639
PercentBuddhism
numeric feature high_skew outliersThis column appears to be the percentage of Buddhists per country (or similar geographic unit), with 238 rows and 125 unique values. The distribution is extremely right-skewed (skew 4.59, kurtosis 20.4): nearly half the rows are zero (zero_rate 0.48), the median is 0.004%, yet the max reaches 88.74%. Saturn flagged 38 outliers (16% of rows), reflecting the handful of Buddhist-majority countries dominating the tail. Treatment: Apply a log1p or similar transform before modelling to tame the heavy right tail.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 125
- min
- 0
- max
- 88.74
- mean
- 3.701
- median
- 0.004281
- std
- 14.74
- q1
- 0
- q3
- 0.2949
- iqr
- 0.2949
- skew
- 4.592
- kurtosis
- 20.42
- n_outliers
- 38
- outlier_rate
- 0.1597
- zero_rate
- 0.479
JPScaleCtry
numeric featureJPScaleCtry holds an integer 1–5 rating with only 5 unique values across 238 rows and no nulls, consistent with a Likert-style country-level scale. The distribution leans high (mean 3.62, median 4, Q1=3, Q3=5) with a left skew of -0.78, indicating most respondents cluster at the upper end. No outliers are flagged. Treatment: Treat as an ordinal Likert feature; consider ordered encoding rather than raw numeric use.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 5
- min
- 1
- max
- 5
- mean
- 3.622
- median
- 4
- std
- 1.473
- q1
- 3
- q3
- 5
- iqr
- 2
- skew
- -0.7839
- kurtosis
- -0.8065
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
PercentChristianity
numeric featureThis column reports the percentage of Christians per row (likely country or region), spanning the full 0.02% to 100% range across 238 nearly unique values. The distribution is strongly bimodal-feeling: the median (75.3%) sits far above the mean (58.2%), with a wide IQR from 11.7% to 90.9% and negative skew (-0.54), suggesting many heavily Christian populations alongside a substantial cluster of very low-share rows. No nulls, no zeros, and no statistical outliers despite the extreme spread. Treatment: Use as-is or rescale to 0-1; consider pairing with other religion-share columns since values are bounded percentages.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 237
- min
- 0.0165
- max
- 100
- mean
- 58.17
- median
- 75.3
- std
- 37.03
- q1
- 11.72
- q3
- 90.92
- iqr
- 79.2
- skew
- -0.5364
- kurtosis
- -1.392
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ISO3
categorical identifier long_tailISO3 is the standard three-letter country code, with all 238 values unique (AFG, ALB, DZA, ASM, ...) and zero nulls. Maximum entropy ratio (≈1.0) and a top_rate of 0.0042 confirm one row per country, making this a clean primary key for the table. Treatment: use as the join key to merge country-level data; do not feed into models as a feature.
- n
- 238
- nulls
- 0 (0.0%)
- unique
- 238
- top_value
- AFG
- top_rate
- 0.004202
- cardinality
- 238
- entropy
- 7.895
- entropy_ratio
- 1