joshua project joshua project enriched
Reading
This is the Joshua Project people-groups dataset: 16,382 rows and 109 columns describing ethnic groups by country, with demographics, language, religion mix, and Christian-engagement indicators. The shape is dominated by categorical and text fields — Continent, AffinityBloc, PrimaryReligion, and the JPScale 'reachedness' rating give the cleanest first read on who is in the file. Population is extremely long-tailed (median 20,000 but max ~913M and skew ~91), so any size analysis should use logs or quantiles rather than means. Religion-share columns like PCIslam, PCHinduism, and PCBuddhism are mostly zero with a minority of groups at very high percentages, which tells you religion is effectively single-dominant per group. Watch out for several columns with very high null rates (RLG4 96%, NomadicTypeDescription 98%, PrimaryLanguageDialect 92%, NTOnline 29%) and many near-duplicate URL/ID fields that won't add analytic value.
citing: Continent · AffinityBloc · PrimaryReligion · JPScaleText · Population · PCIslam · PCHinduism · RegionName · LeastReached · Frontier
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Asia | 7368 | 45.0% |
| Africa | 3635 | 22.2% |
| Europe | 1532 | 9.4% |
| North America | 1407 | 8.6% |
| Australia | 1088 | 6.6% |
| South America | 905 | 5.5% |
| Oceania | 447 | 2.7% |
Show data table
| value | count | share |
|---|---|---|
| Christianity | 6459 | 39.4% |
| Islam | 3786 | 23.1% |
| Ethnic Religions | 2651 | 16.2% |
| Hinduism | 2338 | 14.3% |
| Buddhism | 635 | 3.9% |
| Non-Religious | 200 | 1.2% |
| Unknown | 189 | 1.2% |
| Other / Small | 124 | 0.8% |
Show data table
| value | count | share |
|---|---|---|
| Unreached | 7124 | 43.5% |
| Partially Reached | 3636 | 22.2% |
| Significantly Reached | 3200 | 19.5% |
| Superficially Reached | 1413 | 8.6% |
| Minimally Reached | 1009 | 6.2% |
Show data table
| bin | count |
|---|---|
| 10 – 2.282e+07 | 16302 |
| 2.282e+07 – 4.565e+07 | 32 |
| 4.565e+07 – 6.847e+07 | 12 |
| 6.847e+07 – 9.13e+07 | 3 |
| 9.13e+07 – 1.141e+08 | 3 |
| 1.141e+08 – 1.369e+08 | 3 |
| 1.369e+08 – 1.598e+08 | 0 |
| 1.598e+08 – 1.826e+08 | 0 |
| 1.826e+08 – 2.054e+08 | 1 |
| 2.054e+08 – 2.282e+08 | 0 |
| 2.282e+08 – 2.511e+08 | 0 |
| 2.511e+08 – 2.739e+08 | 0 |
| 2.739e+08 – 2.967e+08 | 0 |
| 2.967e+08 – 3.195e+08 | 0 |
| 3.195e+08 – 3.424e+08 | 0 |
| 3.424e+08 – 3.652e+08 | 0 |
| 3.652e+08 – 3.88e+08 | 0 |
| 3.88e+08 – 4.108e+08 | 0 |
| 4.108e+08 – 4.337e+08 | 0 |
| 4.337e+08 – 4.565e+08 | 0 |
| 4.565e+08 – 4.793e+08 | 0 |
| 4.793e+08 – 5.021e+08 | 0 |
| 5.021e+08 – 5.249e+08 | 0 |
| 5.249e+08 – 5.478e+08 | 0 |
| 5.478e+08 – 5.706e+08 | 0 |
| 5.706e+08 – 5.934e+08 | 0 |
| 5.934e+08 – 6.162e+08 | 0 |
| 6.162e+08 – 6.391e+08 | 0 |
| 6.391e+08 – 6.619e+08 | 0 |
| 6.619e+08 – 6.847e+08 | 0 |
| 6.847e+08 – 7.075e+08 | 0 |
| 7.075e+08 – 7.304e+08 | 0 |
| 7.304e+08 – 7.532e+08 | 0 |
| 7.532e+08 – 7.76e+08 | 0 |
| 7.76e+08 – 7.988e+08 | 0 |
| 7.988e+08 – 8.217e+08 | 0 |
| 8.217e+08 – 8.445e+08 | 0 |
| 8.445e+08 – 8.673e+08 | 0 |
| 8.673e+08 – 8.901e+08 | 0 |
| 8.901e+08 – 9.13e+08 | 1 |
Show data table
| value | count | share |
|---|---|---|
| South Asian Peoples | 4178 | 25.5% |
| Sub-Saharan Peoples | 3073 | 18.8% |
| Eurasian Peoples | 1593 | 9.7% |
| Pacific Islanders | 1588 | 9.7% |
| Latin-Caribbean Americans | 1352 | 8.3% |
| Malay Peoples | 1031 | 6.3% |
| Southeast Asian Peoples | 635 | 3.9% |
| Arab World | 634 | 3.9% |
| Tibetan-Himalayan Peoples | 453 | 2.8% |
| North American Peoples | 415 | 2.5% |
| East Asian Peoples | 402 | 2.5% |
| Turkic Peoples | 299 | 1.8% |
| Persian-Median | 228 | 1.4% |
| Horn of Africa Peoples | 200 | 1.2% |
| Deaf | 164 | 1.0% |
| Jewish | 137 | 0.8% |
Schema
109 columns| Alerts | ||||
|---|---|---|---|---|
| PeopleID3ROG3 | text | 0.0% | 16,382 |
near_unique
one_word
allcaps
short_text
|
| ROG3 | categorical | 0.0% | 238 |
|
| PeopleID3 | numeric | 0.0% | 10,415 |
|
| ROP3 | numeric | 0.1% | 10,405 |
|
| PeopNameInCountry | text | 0.0% | 10,748 |
one_word
duplicates
|
| ROG2 | categorical | 0.0% | 7 |
|
| Continent | categorical | 0.0% | 7 |
|
| RegionName | categorical | 0.0% | 12 |
|
| ISO3 | categorical | 0.0% | 238 |
|
| LocationInCountry | text | 45.1% | 7,794 |
multilingual
null_rate
|
| PeopleID1 | numeric | 0.0% | 16 |
|
| ROP1 | categorical | 0.0% | 16 |
|
| AffinityBloc | categorical | 0.0% | 16 |
|
| PeopleID2 | numeric | 0.0% | 267 |
|
| ROP2 | categorical | 0.0% | 214 |
|
| PeopleCluster | categorical | 0.0% | 267 |
|
| PeopNameAcrossCountries | text | 0.0% | 10,377 |
one_word
duplicates
|
| Population | numeric | 0.2% | 1,708 |
high_skew
outliers
|
| Category | categorical | 0.0% | 3 |
|
| ROL3 | text | 0.0% | 6,164 |
one_word
short_text
duplicates
|
| PrimaryLanguageName | text | 0.0% | 6,153 |
one_word
short_text
duplicates
|
| PrimaryLanguageDialect | categorical | 92.3% | 980 |
long_tail
null_rate
|
| NumberLanguagesSpoken | numeric | 0.0% | 78 |
high_skew
outliers
|
| OfficialLang | categorical | 0.0% | 87 |
|
| SpeakNationalLang | unknown | 0.0% | — |
skipped
|
| BibleStatus | numeric | 0.0% | 6 |
|
| BibleYear | categorical | 52.4% | 466 |
null_rate
|
| NTYear | text | 30.5% | 1,072 |
one_word
allcaps
null_rate
short_text
duplicates
|
| PortionsYear | text | 17.9% | 1,737 |
one_word
allcaps
short_text
duplicates
|
| TranslationNeedQuestionable | unknown | 0.0% | — |
skipped
|
| JPScale | numeric | 0.0% | 5 |
|
| JPScalePC | categorical | 0.0% | 5 |
|
| JPScalePGAC | categorical | 0.0% | 5 |
|
| LeastReached | categorical | 0.0% | 2 |
|
| LeastReachedPC | categorical | 0.0% | 2 |
|
| LeastReachedPGAC | categorical | 0.0% | 2 |
|
| GSEC | categorical | 0.0% | 8 |
|
| HasAudioRecordings | categorical | 0.0% | 2 |
|
| NTOnline | categorical | 28.5% | 1 |
null_rate
imbalance
|
| RLG3 | numeric | 0.0% | 8 |
|
| RLG3PC | numeric | 0.0% | 8 |
|
| RLG3PGAC | numeric | 0.0% | 8 |
|
| PrimaryReligion | categorical | 0.0% | 8 |
|
| PrimaryReligionPC | categorical | 0.0% | 8 |
|
| PrimaryReligionPGAC | categorical | 0.0% | 8 |
|
| RLG4 | numeric | 96.2% | 20 |
null_rate
outliers
|
| ReligionSubdivision | categorical | 96.2% | 20 |
null_rate
|
| PCIslam | numeric | 0.5% | 1,117 |
outliers
|
| PCNonReligious | numeric | 0.4% | 223 |
high_skew
outliers
|
| PCUnknown | numeric | 0.6% | 583 |
high_skew
|
| SecurityLevel | numeric | 0.0% | 3 |
|
| LRTop100 | categorical | 0.0% | 2 |
imbalance
|
| PhotoAddress | text | 0.0% | 5,277 |
one_word
short_text
duplicates
|
| PhotoCredits | text | 0.1% | 1,605 |
one_word
duplicates
|
| PhotoCreditURL | text | 33.1% | 1,434 |
one_word
url_heavy
null_rate
duplicates
|
| PhotoCreativeCommons | categorical | 0.0% | 2 |
|
| PhotoCopyright | categorical | 0.1% | 2 |
|
| PhotoPermission | categorical | 0.1% | 3 |
|
| ProfileTextExists | categorical | 0.0% | 2 |
|
| CountOfCountries | numeric | 0.0% | 48 |
high_skew
outliers
|
| CountOfProvinces | unknown | 0.0% | — |
skipped
|
| Longitude | numeric | 0.0% | 15,889 |
high_skew
|
| Latitude | numeric | 0.0% | 15,851 |
|
| Ctry | categorical | 0.0% | 238 |
|
| IndigenousCode | categorical | 0.0% | 2 |
|
| PercentAdherents | text | 0.0% | 1,248 |
one_word
allcaps
short_text
duplicates
|
| PercentChristianPC | categorical | 0.0% | 246 |
|
| NaturalName | text | 0.0% | 10,737 |
one_word
duplicates
|
| NaturalPronunciation | text | 69.6% | 1,933 |
one_word
null_rate
duplicates
|
| PercentChristianPGAC | text | 0.1% | 1,954 |
one_word
allcaps
short_text
duplicates
|
| PercentEvangelical | text | 6.7% | 1,047 |
one_word
allcaps
short_text
duplicates
|
| PercentEvangelicalPC | categorical | 1.0% | 228 |
|
| PercentEvangelicalPGAC | text | 4.5% | 1,624 |
one_word
allcaps
short_text
duplicates
|
| PCBuddhism | numeric | 0.6% | 1,052 |
high_skew
outliers
|
| PCEthnicReligions | numeric | 0.4% | 978 |
outliers
|
| PCHinduism | numeric | 0.6% | 1,412 |
high_skew
outliers
|
| PCOtherSmall | numeric | 0.6% | 908 |
high_skew
outliers
|
| RegionCode | numeric | 0.0% | 12 |
|
| PopulationPGAC | numeric | 0.1% | 2,250 |
high_skew
outliers
|
| Frontier | categorical | 0.0% | 2 |
|
| MapAddress | text | 0.0% | 6,029 |
one_word
short_text
duplicates
|
| HasJesusFilm | categorical | 0.0% | 2 |
|
| Nomadic | categorical | 0.0% | 2 |
imbalance
|
| NomadicTypeDescription | categorical | 98.1% | 6 |
null_rate
|
| PhotoCCVersionText | categorical | 0.0% | 17 |
|
| PhotoCCVersionURL | categorical | 0.0% | 17 |
|
| MapCredits | categorical | 0.0% | 199 |
long_tail
|
| MapCreditURL | categorical | 0.0% | 51 |
long_tail
imbalance
|
| MapCopyright | categorical | 0.0% | 3 |
|
| MapCCVersionText | categorical | 0.0% | 6 |
imbalance
|
| MapCCVersionURL | categorical | 0.0% | 6 |
imbalance
|
| JF | categorical | 0.0% | 2 |
|
| AudioRecordings | categorical | 0.0% | 2 |
|
| Window1040 | categorical | 0.0% | 2 |
|
| PeopleGroupMapURL | text | 0.0% | 6,029 |
one_word
url_heavy
duplicates
|
| PeopleGroupMapExpandedURL | text | 0.0% | 5,561 |
one_word
url_heavy
duplicates
|
| PeopleGroupURL | text | 0.0% | 16,382 |
near_unique
one_word
url_heavy
|
| PeopleGroupPhotoURL | text | 0.0% | 5,277 |
one_word
url_heavy
duplicates
|
| CountryURL | categorical | 0.0% | 238 |
|
| JPScaleText | categorical | 0.0% | 5 |
|
| JPScaleImageURL | categorical | 0.0% | 5 |
|
| Summary | text | 0.0% | 3,778 |
one_word
duplicates
|
| Obstacles | text | 0.0% | 3,732 |
one_word
duplicates
|
| HowReach | text | 0.0% | 2,944 |
one_word
duplicates
|
| PrayForChurch | text | 0.0% | 1,791 |
one_word
duplicates
|
| PrayForPG | text | 0.0% | 3,530 |
one_word
duplicates
|
| Resources | unknown | 0.0% | — |
skipped
|
| country_data | unknown | 0.0% | — |
skipped
|
| language_data | unknown | 0.0% | — |
skipped
|
PeopleID3ROG3
text identifier near_unique one_word allcaps short_textPeopleID3ROG3 is almost certainly a person-level identifier: every one of the 16,382 rows has a unique 7-character all-caps single token (n_unique equals n, duplicate_rate 0, len_min and len_max both 7). The sampled values look like a 5-digit numeric prefix followed by a 2-letter suffix (e.g. '10375tz', '10375up'), suggesting a structured composite key rather than a random hash. No nulls, no boilerplate, no duplicates — clean but useless as a feature. Treatment: Drop from modelling; retain only as a join key.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 16,382
- len_min
- 7
- len_max
- 7
- len_mean
- 7
- len_median
- 7
- len_p95
- 7
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 16,382
- readability_flesch_mean
- 115.3
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
ROG3
categorical featureROG3 holds 238 distinct two-letter codes that look like ISO country codes, with IN (2,262 rows, 13.8%) leading, followed by PP, ID, PK, CH, and NI. No nulls across 16,382 rows, and the entropy ratio of 0.79 indicates a fairly even spread across many countries rather than concentration in a handful. The presence of 'PP' among the top values is unusual since it isn't a standard ISO 3166-1 alpha-2 code and may signal a custom or legacy encoding worth verifying. Treatment: Treat as a high-cardinality categorical: target- or frequency-encode for modelling, and reconcile non-standard codes like 'PP' against an ISO reference.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 238
- top_value
- IN
- top_rate
- 0.1381
- cardinality
- 238
- entropy
- 6.225
- entropy_ratio
- 0.7885
PeopleID3
numeric foreign_keyPeopleID3 is an integer key ranging from 10119 to 22661 with 10415 unique values across 16382 rows and no nulls. The duplication (about 5967 repeated entries) and the bounded, non-zero range are consistent with a foreign-key reference to a people table rather than a measurement. Distribution is mildly right-skewed (0.37) and platykurtic (-0.98), with no outliers flagged. Treatment: Treat as a foreign key and left-join to the people dimension; do not use as a numeric feature.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 10,415
- min
- 10,119
- max
- 22,661
- mean
- 1.541e+04
- median
- 14,962
- std
- 3478
- q1
- 12,348
- q3
- 1.833e+04
- iqr
- 5984
- skew
- 0.3697
- kurtosis
- -0.9812
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ROP3
numeric identifierROP3 is a numeric column tightly bounded between 100004 and 119649 across 16382 rows, with 10405 unique values and almost no nulls (0.07%). The mean (109058.6) and median (108856) sit close together with low skew (0.15) and slightly platykurtic shape (kurtosis -1.05), and saturn flagged zero outliers. The narrow ~20k range starting just above 100000 looks more like a coded identifier or zoned key than a free-ranging measurement. Treatment: Treat as a categorical code rather than a continuous feature; do not scale or log-transform.
- n
- 16,382
- nulls
- 11 (0.1%)
- unique
- 10,405
- min
- 100,004
- max
- 119,649
- mean
- 1.091e+05
- median
- 108,856
- std
- 5405
- q1
- 104,189
- q3
- 113,527
- iqr
- 9,338
- skew
- 0.1507
- kurtosis
- -1.053
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
PeopNameInCountry
text label one_word duplicatesShort ethnonym/people-group labels naming a population within a country (e.g., 'French', 'British', 'Han Chinese, Mandarin'), with a median length of 9 characters and median word count of 1. 53% of values are single words and 34% are duplicates (5634 rows), so the same group label recurs across many country rows; 'Deaf' tops the list at 164. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' indicate a Joshua-Project-style people-group taxonomy rather than free text. Treatment: Treat as a categorical people-group label; normalize casing/parentheticals and join with country to form the unique key.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 10,748
- len_min
- 1
- len_max
- 41
- len_mean
- 11.46
- len_median
- 9
- len_p95
- 25
- word_mean
- 1.636
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 5,634
- duplicate_rate
- 0.3439
- vocab_size
- 11,539
- readability_flesch_mean
- 49.27
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.5313
- allcaps_rate
- 0
- boilerplate_rate
- 0
ROG2
categorical featureROG2 is a low-cardinality categorical with 7 codes that look like macro-region groupings (ASI, AFR, EUR, NAR, AUS, LAM, SOP). Distribution is uneven but not degenerate: ASI dominates at 45.0% of 16,382 rows, AFR follows at 3,635, and entropy ratio of 0.80 confirms broad spread across the remaining buckets. No nulls and no rare-value tail beyond the seven codes. Treatment: one-hot or target-encode; safe to use directly given clean 7-level categorical with no missingness.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- ASI
- top_rate
- 0.4498
- cardinality
- 7
- entropy
- 2.257
- entropy_ratio
- 0.8039
Continent
categorical featureCategorical continent label with all 7 expected values and zero nulls across 16,382 rows. Asia dominates at 44.98% (7,368 rows), followed by Africa at 3,635; entropy ratio of 0.80 confirms a moderately skewed but not degenerate distribution. Note that Australia (1,088) and Oceania (447) appear as separate categories, which is unusual and suggests inconsistent regional coding worth reconciling. Treatment: One-hot encode after merging Australia and Oceania into a single category.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- Asia
- top_rate
- 0.4498
- cardinality
- 7
- entropy
- 2.257
- entropy_ratio
- 0.8039
RegionName
categorical featureRegionName is a categorical geographic grouping with 12 distinct world regions and no nulls across 16,382 rows. Distribution is fairly balanced (entropy ratio 0.93), though 'Asia, South' leads at 22.6% (3,707 rows) and the top three regions are all Asian or African. The labels use a 'Continent, Subregion' convention which may need parsing if continent-level rollups are wanted. Treatment: one-hot or target-encode for modelling; optionally split on the comma to derive a continent column.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- Asia, South
- top_rate
- 0.2263
- cardinality
- 12
- entropy
- 3.325
- entropy_ratio
- 0.9275
ISO3
categorical foreign_keyThis column holds ISO3 country codes across 238 distinct values with no nulls, consistent with a country dimension key. India (IND) dominates at 13.8% (2262 rows), followed by PNG, IDN, PAK and CHN, indicating a heavy Asia/tropical skew rather than uniform global coverage. Entropy ratio of 0.79 confirms moderate concentration on a few countries. Treatment: Use as a country join key; consider grouping long-tail codes or stratifying analyses to handle the IND-heavy skew.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 238
- top_value
- IND
- top_rate
- 0.1381
- cardinality
- 238
- entropy
- 6.225
- entropy_ratio
- 0.7885
LocationInCountry
text free_text multilingual null_rateFree-text descriptions of where a group is located within a country, mixing geographic prose ('Primarily north', 'Madang province.') with longer ethnographic paragraphs up to 994 characters. Nearly half the rows are null (45.12%) and 13.3% are duplicate strings, with stock phrases like 'Widespread.' (103) and 'Scattered.' recurring. Although 2,618 entries register as English, small pockets of Spanish (12), Portuguese (10) and nine other languages appear, and Flesch readability averages a difficult 38.1. Treatment: Normalize boilerplate phrases and tokenize/embed for semantic use; do not treat as a categorical.
- n
- 16,382
- nulls
- 7,392 (45.1%)
- unique
- 7,794
- len_min
- 3
- len_max
- 994
- len_mean
- 108.2
- len_median
- 79
- len_p95
- 314
- word_mean
- 15.48
- word_median
- 11
- n_empty
- 0
- n_duplicates
- 1,196
- duplicate_rate
- 0.133
- vocab_size
- 25,733
- readability_flesch_mean
- 38.13
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.02725
- allcaps_rate
- 0
- boilerplate_rate
- 0
PeopleID1
numeric featurePeopleID1 is stored as numeric but takes only 16 distinct integer values across 16,382 rows, ranging from 10 to 26 with a median of 20. The bounded range, low cardinality, and left skew (-0.79) suggest this is a small categorical or grouping code rather than a true continuous measurement, despite the 'ID' name implying a key. Treatment: Cast to categorical and one-hot encode rather than treating as continuous.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 16
- min
- 10
- max
- 26
- mean
- 18.58
- median
- 20
- std
- 3.921
- q1
- 16
- q3
- 21
- iqr
- 5
- skew
- -0.7942
- kurtosis
- -0.5112
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ROP1
categorical featureROP1 is a low-cardinality categorical code with 16 distinct values (all prefixed 'A0##'), fully populated across 16382 rows. The distribution is moderately concentrated: 'A012' alone covers 25.5% and 'A013' adds another ~19%, while entropy ratio is 0.83 indicating reasonably even spread among the rest. Looks like a fixed coded attribute (e.g., a category or status code) rather than a free identifier. Treatment: one-hot or target-encode for modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 16
- top_value
- A012
- top_rate
- 0.255
- cardinality
- 16
- entropy
- 3.322
- entropy_ratio
- 0.8306
AffinityBloc
categorical featureAffinityBloc is a categorical grouping of populations into 16 broad ethno-geographic blocs, with no nulls across 16,382 rows. The distribution is moderately concentrated: "South Asian Peoples" leads at 25.5% (4,178 rows), followed by "Sub-Saharan Peoples" (3,073), while entropy ratio of 0.83 indicates the remaining 14 categories carry meaningful mass. Labels mix regional and ethnolinguistic framings (e.g., "Arab World" alongside "Tibetan-Himalayan Peoples"), which an analyst should note for taxonomy consistency. Treatment: One-hot or target-encode for modelling; audit label taxonomy for overlap before grouping.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 16
- top_value
- South Asian Peoples
- top_rate
- 0.255
- cardinality
- 16
- entropy
- 3.322
- entropy_ratio
- 0.8306
PeopleID2
numeric foreign_keyPeopleID2 is a numeric people-identifier code with only 267 distinct values across 16,382 rows, so each id repeats heavily and behaves more like a foreign key than a measurement. Values span 100 to 479 with a fairly flat distribution (kurtosis -1.13, skew 0.25) and no nulls or outliers, consistent with a bounded code rather than a quantity to model. Treatment: Treat as a categorical key and left-join on it rather than using as a numeric feature.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 267
- min
- 100
- max
- 479
- mean
- 283.7
- median
- 268
- std
- 114.6
- q1
- 183
- q3
- 402
- iqr
- 219
- skew
- 0.2451
- kurtosis
- -1.126
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ROP2
categorical featureROP2 is a categorical code field with 214 distinct alphanumeric values (e.g., 'A012', 'C0152') across 16,382 rows and no nulls, suggesting a controlled vocabulary like a routing or product code. The distribution is heavy-tailed: 'A012' alone covers 25.4% of rows and the next code 'C0152' another ~7%, while entropy ratio sits at 0.76. The 'A' vs 'C' prefix split hints at two code families coexisting in the same column. Treatment: Group rare codes into an 'other' bucket and target/one-hot encode the high-frequency levels.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 214
- top_value
- A012
- top_rate
- 0.2545
- cardinality
- 214
- entropy
- 5.901
- entropy_ratio
- 0.7622
PeopleCluster
categorical featurePeopleCluster is a categorical ethnographic grouping with 267 distinct values across 16,382 rows and no nulls. The distribution is broad (entropy ratio 0.86) but with a notable concentration: 'New Guinea' accounts for 6.95% of rows (1,139), followed by 'South Asia Hindu - other' (935) and 'South Asia Muslim - other' (592). The labels mix geographic, religious, and tribal descriptors, so several '... - other' buckets are doing heavy lifting. Treatment: Group rare clusters and target- or frequency-encode before modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 267
- top_value
- New Guinea
- top_rate
- 0.06953
- cardinality
- 267
- entropy
- 6.929
- entropy_ratio
- 0.8596
PeopNameAcrossCountries
text foreign_key one_word duplicatesThis column holds people-group or ethnicity names repeated across countries, with 10,377 unique labels over 16,382 rows and a 36.7% duplicate rate (6,005 repeats). Entries are short (median 8 chars, mean 1.57 words) and 59% are single-word labels like 'Deaf' (164), 'French' (82), or 'British' (81). The frequent fragments '(hindu' (516) and '(muslim' (424) alongside 'traditions)' (1038) suggest religious-tradition qualifiers in parentheses are a common naming convention, and the same group name recurs because it appears in multiple country contexts. Treatment: Treat as a people-group key; normalize casing/parentheticals and join with country to form a unique grouping key.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 10,377
- len_min
- 1
- len_max
- 42
- len_mean
- 10.93
- len_median
- 8
- len_p95
- 25
- word_mean
- 1.568
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 6,005
- duplicate_rate
- 0.3666
- vocab_size
- 10,446
- readability_flesch_mean
- 49.45
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.5899
- allcaps_rate
- 0
- boilerplate_rate
- 0
Population
numeric feature high_skew outliersThis is a Population count column with 16,382 rows and only 1,708 unique values, suggesting many shared or rounded figures. The distribution is extremely heavy-tailed: median is 20,000 but the max is 912,955,000, with skew 91.04 and kurtosis 10,050.74, and 15.0% of rows flag as outliers. The mean (499,468) sits far above Q3 (93,000), indicating a small number of very large entities dominate. Treatment: Log-transform before any modelling or distance-based analysis.
- n
- 16,382
- nulls
- 25 (0.2%)
- unique
- 1,708
- min
- 10
- max
- 9.13e+08
- mean
- 4.995e+05
- median
- 20,000
- std
- 8.066e+06
- q1
- 4,300
- q3
- 93,000
- iqr
- 88,700
- skew
- 91.04
- kurtosis
- 1.005e+04
- n_outliers
- 2,455
- outlier_rate
- 0.1501
- zero_rate
- 0
Category
categorical featureA 3-level categorical with no nulls across 16,382 rows, encoded as the strings "1", "2", and "3". Class "1" dominates at 53.1% (8,705 rows) and "2" is the minority at 1,360 rows, giving a moderately imbalanced distribution (entropy ratio 0.83). The numeric-string labels suggest an ordinal or coded category whose meaning is not self-evident from the values alone. Treatment: One-hot or ordinal encode; consider class-imbalance handling if used as a target.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- 1
- top_rate
- 0.5314
- cardinality
- 3
- entropy
- 1.313
- entropy_ratio
- 0.8284
ROL3
text feature one_word short_text duplicatesROL3 holds three-letter ISO 639-3 language codes (every value is exactly 3 characters and one word), with hin, eng, and ben dominating. The distribution is heavily multilingual with 6,164 distinct codes across 16,382 rows and a 62.4% duplicate rate, plus 176 'xxx' entries that likely flag undetermined or missing language. Treatment: Treat as a categorical language code; one-hot or target-encode top codes and bucket the long tail (including 'xxx') as 'other'.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 6,164
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 10,218
- duplicate_rate
- 0.6237
- vocab_size
- 6,163
- readability_flesch_mean
- 117.4
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0001221
- boilerplate_rate
- 0
PrimaryLanguageName
text feature one_word short_text duplicatesHolds the primary language name for each record, predominantly single-token entries (one_word_rate 0.73, word_mean 1.32) with Hindi (682), English (424) and Bengali (366) leading. High duplicate_rate of 0.62 is expected for a categorical language label, but n_unique 6153 against 16382 rows suggests many compound or comma-separated multilingual entries (note 'arabic,' and 'punjabi,' in top_words). 176 rows are explicitly 'Language unknown', and lengths up to 45 chars confirm some multi-language strings. Treatment: Normalize casing and split comma-separated entries into a multi-label categorical before encoding.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 6,153
- len_min
- 1
- len_max
- 45
- len_mean
- 9.081
- len_median
- 7
- len_p95
- 19
- word_mean
- 1.322
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 10,229
- duplicate_rate
- 0.6244
- vocab_size
- 6,251
- readability_flesch_mean
- 42.67
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7306
- allcaps_rate
- 0
- boilerplate_rate
- 0
PrimaryLanguageDialect
categorical metadata long_tail null_rateThis column records a primary language dialect per record, with 980 distinct values across 16,382 rows. It is 92.3% null, and even the most common value, 'Brazilian Portuguese', accounts for just 3.25% of non-nulls (41 occurrences); entropy ratio of 0.967 confirms an extremely flat, long-tailed distribution spanning dialects like Assyrian, Punjabi, Sinhalese, and Ta'izzi. The combination of sparse coverage and high cardinality limits its standalone modelling value. Treatment: Group into language families or a coarse bucket plus 'missing' indicator before encoding.
- n
- 16,382
- nulls
- 15,121 (92.3%)
- unique
- 980
- top_value
- Brazilian Portuguese
- top_rate
- 0.03251
- cardinality
- 980
- entropy
- 9.613
- entropy_ratio
- 0.9674
NumberLanguagesSpoken
numeric feature high_skew outliersCount of languages spoken, with 16382 non-null integer values ranging from 1 to 145 and a median of 1. The distribution is severely right-skewed (skew 7.44, kurtosis 83.76): Q1 and Q3 are both within [1,2], yet 2410 rows (14.7%) flag as outliers and the max of 145 is implausibly high for a person-level language count. Treatment: Cap or log-transform before modelling, and investigate the 145 maximum for data-entry errors.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 78
- min
- 1
- max
- 145
- mean
- 2.764
- median
- 1
- std
- 5.985
- q1
- 1
- q3
- 2
- iqr
- 1
- skew
- 7.437
- kurtosis
- 83.76
- n_outliers
- 2,410
- outlier_rate
- 0.1471
- zero_rate
- 0
OfficialLang
categorical featureOfficialLang is a categorical column listing the official language of each record, with 87 distinct values across 16,382 rows and almost no nulls (0.05%). English dominates at 22.4% (3,672), followed by Hindi (2,262) and French (1,478), giving a moderately concentrated distribution (entropy ratio 0.68). The presence of compound labels like 'Arabic, Standard' and 'Chinese, Mandarin' suggests a specific naming convention worth preserving when joining to language references. Treatment: Group long tail and one-hot or target-encode the top categories before modelling.
- n
- 16,382
- nulls
- 8 (0.0%)
- unique
- 87
- top_value
- English
- top_rate
- 0.2243
- cardinality
- 87
- entropy
- 4.368
- entropy_ratio
- 0.6779
SpeakNationalLang
unknown other skippedColumn 'SpeakNationalLang' was skipped by the profiler, so no type inference, uniqueness count, or value statistics are available. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a flag or category indicating whether a respondent speaks the national language, but this cannot be verified from the evidence. Treatment: Re-profile or manually inspect to determine type before any downstream use.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- —
BibleStatus
numeric featureBibleStatus is an integer-coded categorical with only 6 distinct values spanning 0 to 5 across 16,382 complete rows. The distribution is heavily left-skewed (skew -1.22) with a mean of 3.86 and median of 4, indicating most records cluster at the high end while about 4.8% sit at zero. Despite being stored as numeric, the small cardinality and bounded range suggest an ordinal status code rather than a true measurement. Treatment: Treat as an ordinal category (one-hot or ordered encode) rather than a continuous numeric.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 6
- min
- 0
- max
- 5
- mean
- 3.862
- median
- 4
- std
- 1.429
- q1
- 3
- q3
- 5
- iqr
- 2
- skew
- -1.219
- kurtosis
- 0.5856
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.04786
BibleYear
categorical metadata null_rateBibleYear appears to encode a translation's publication or revision span, typically formatted as a start-end year range like "1818-2022", with single years (e.g. "1954") appearing as a minority pattern. Cardinality is high (466 distinct values across 16382 rows) and the most common range covers only 8.75% of records, giving a flat distribution (entropy ratio 0.776). Notably, 52.45% of values are null, which the alert flags and which will limit any direct use. Treatment: Parse into separate start_year and end_year integer features and add a missingness indicator before modelling.
- n
- 16,382
- nulls
- 8,592 (52.4%)
- unique
- 466
- top_value
- 1818-2022
- top_rate
- 0.08755
- cardinality
- 466
- entropy
- 6.883
- entropy_ratio
- 0.7765
NTYear
text metadata one_word allcaps null_rate short_text duplicatesNTYear appears to be a year-range metadata field (e.g. '1811-1998', '1380-2011') stored as short single-token strings, with 1072 unique values across 16382 rows. The column is messy: 30.47% null, 90.59% duplicate rate, and a sentinel value 'Yes' shows up 670 times alongside the date ranges, indicating mixed semantics. Lengths cluster tightly (median 9, max 9), consistent with a 'YYYY-YYYY' format for most non-sentinel entries. Treatment: Parse into start_year/end_year integer columns, isolate the 'Yes' sentinel into a separate flag, and impute or drop the 30% nulls.
- n
- 16,382
- nulls
- 4,991 (30.5%)
- unique
- 1,072
- len_min
- 3
- len_max
- 9
- len_mean
- 7.794
- len_median
- 9
- len_p95
- 9
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 10,319
- duplicate_rate
- 0.9059
- vocab_size
- 1,072
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.9412
- boilerplate_rate
- 0
PortionsYear
text feature one_word allcaps short_text duplicatesPortionsYear appears to be a single-token field that mostly encodes year ranges (e.g. '1806-1962', '1530-1995') with strings up to 9 characters, but it is contaminated by a large 'Yes' bucket (1520 rows) that breaks the type. Nulls run at 17.92% and duplicate_rate is 0.87 across 1737 unique values out of 16382, so the column is highly repetitive. The mix of a boolean-like 'Yes' with hyphenated year spans suggests two different concepts were merged into one column. Treatment: Split into two fields: parse year ranges into start/end integers and isolate the 'Yes' values into a separate boolean before modelling.
- n
- 16,382
- nulls
- 2,936 (17.9%)
- unique
- 1,737
- len_min
- 3
- len_max
- 9
- len_mean
- 7.595
- len_median
- 9
- len_p95
- 9
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 11,709
- duplicate_rate
- 0.8708
- vocab_size
- 1,737
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.887
- boilerplate_rate
- 0
TranslationNeedQuestionable
unknown other skippedColumn 'TranslationNeedQuestionable' was skipped by the profiler, so its kind, cardinality and value distribution are unknown. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a boolean or flag indicating uncertainty about translation need, but this cannot be verified from the evidence. Treatment: Re-profile or inspect raw values before deciding on use; do not model until kind is resolved.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- —
JPScale
numeric featureJPScale is an integer-valued ordinal feature spanning 1 to 5 with only 5 unique values across 16382 rows and no nulls. The distribution is roughly flat (kurtosis -1.66, skew 0.19) with mean 2.68 and median 3, suggesting a Likert-style or category rating rather than a continuous measurement. No outliers and no zeros are present. Treatment: Treat as an ordinal categorical (1-5) rather than continuous; one-hot or keep as ordered integer.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 5
- min
- 1
- max
- 5
- mean
- 2.681
- median
- 3
- std
- 1.644
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 0.1937
- kurtosis
- -1.658
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
JPScalePC
categorical featureJPScalePC is a 5-level categorical, almost certainly a Likert or ordinal scale (values "1" through "5") with no nulls across 16,382 rows. The distribution is bimodal at the extremes: "5" leads at 33.8% and "1" follows closely, while the middle codes "2" and "3" together account for far less, hinting at polarised responses rather than a normal spread. Entropy ratio of 0.86 confirms the spread is wide but not uniform. Treatment: Treat as ordinal (1-5); keep as integer or one-hot depending on model.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- 5
- top_rate
- 0.3381
- cardinality
- 5
- entropy
- 1.997
- entropy_ratio
- 0.8602
JPScalePGAC
categorical featureJPScalePGAC is a low-cardinality categorical with 5 distinct string-encoded levels ('1' through '5') across 16382 rows and no nulls, consistent with an ordinal scale (likely the Japanese JMA seismic intensity scale applied to PGA). The distribution is uneven: '1' dominates at 43.3% while '2' is the rarest at 908 rows, yet entropy ratio is high at 0.86 indicating the remaining mass is spread broadly. The non-monotonic frequency order (1 > 4 > 5 > 3 > 2) is worth flagging since a clean ordinal would typically taper. Treatment: Treat as ordinal: cast to integer and preserve order, or one-hot encode if downstream model is non-ordinal.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- 1
- top_rate
- 0.4329
- cardinality
- 5
- entropy
- 1.998
- entropy_ratio
- 0.8603
LeastReached
categorical featureBinary Y/N flag named LeastReached, fully populated across 16382 rows with only 2 distinct values. The split is fairly balanced — 'N' leads at 56.5% (9258) versus 7124 'Y' — yielding near-maximal entropy ratio of 0.988. No nulls or anomalies present. Treatment: Encode as a 0/1 boolean for modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.5651
- cardinality
- 2
- entropy
- 0.9877
- entropy_ratio
- 0.9877
LeastReachedPC
categorical featureA binary Y/N flag named LeastReachedPC, likely indicating whether some 'least reached' threshold or PC condition was met. The split is moderately imbalanced at 67.3% N versus the rest Y, with no nulls across 16,382 rows and entropy ratio 0.91 showing both classes are well represented. Treatment: Encode as a 0/1 indicator for modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.6732
- cardinality
- 2
- entropy
- 0.9116
- entropy_ratio
- 0.9116
LeastReachedPGAC
categorical featureBinary Y/N flag indicating whether some 'LeastReachedPGAC' condition holds, with no missing values across 16382 rows. The split is fairly balanced — 'N' leads at 56.7% (9291) versus 7091 'Y' — giving a near-maximal entropy ratio of 0.987. Treatment: Encode as a 0/1 indicator for modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.5671
- cardinality
- 2
- entropy
- 0.987
- entropy_ratio
- 0.987
GSEC
categorical featureGSEC is a low-cardinality categorical with 8 distinct values across 16,382 rows and no nulls. The dominant value is the empty string at 40.0% (6,553 rows), followed by '1' at 4,852; the remaining codes ('0','2','3','4','5','6') split the rest, suggesting a coded classification where blanks likely encode 'not applicable' or missing-as-empty. Entropy ratio of 0.732 indicates moderate spread despite the empty-string plurality. Treatment: Recode the empty string as an explicit missing category and one-hot encode the remaining codes.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- top_rate
- 0.4
- cardinality
- 8
- entropy
- 2.197
- entropy_ratio
- 0.7325
HasAudioRecordings
categorical featureBinary Y/N flag indicating whether a record has associated audio recordings, fully populated across 16382 rows. The distribution is imbalanced: 'Y' covers 82.3% (13479) versus 2903 'N', with entropy ratio 0.67. Treatment: Encode as a 0/1 boolean indicator for modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.8228
- cardinality
- 2
- entropy
- 0.6739
- entropy_ratio
- 0.6739
NTOnline
categorical feature null_rate imbalanceNTOnline is a categorical flag with only one observed value, 'Y', across all 11,705 non-null rows. The remaining 28.55% of rows are null, so this column carries no discriminating signal — it is effectively a constant where present. Treatment: Drop; zero-variance column with high nullity.
- n
- 16,382
- nulls
- 4,677 (28.5%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
RLG3
numeric featureRLG3 is a small-integer ordinal feature ranging from 1 to 9 with only 8 distinct values across 16,382 rows and no nulls. The distribution is broad and flat (kurtosis -1.37, skew 0.13, IQR spanning 1 to 6) with mean 3.47 and median 4, and no outliers. The 8 unique values across a 1-9 range implies one integer in that span never occurs, which is worth confirming. Treatment: Treat as an ordinal categorical (e.g., a Likert-style rating) rather than a continuous numeric.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 8
- min
- 1
- max
- 9
- mean
- 3.469
- median
- 4
- std
- 2.238
- q1
- 1
- q3
- 6
- iqr
- 5
- skew
- 0.1265
- kurtosis
- -1.366
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
RLG3PC
numeric featureRLG3PC is an integer-valued numeric column with only 8 distinct values bounded between 1 and 9, no nulls, and no zeros. The flat distribution (kurtosis -1.47, IQR spanning 1 to 6) and small cardinality suggest this is an ordinal code or category rather than a continuous measurement. Mean 3.21 sits below the median's upper quartile, with mild positive skew (0.31). Treatment: Treat as an ordinal category; one-hot or ordinal-encode rather than scaling as continuous.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 8
- min
- 1
- max
- 9
- mean
- 3.213
- median
- 2
- std
- 2.311
- q1
- 1
- q3
- 6
- iqr
- 5
- skew
- 0.3143
- kurtosis
- -1.466
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
RLG3PGAC
numeric featureRLG3PGAC holds an integer code on a small 1-9 scale with only 8 distinct values across 16,382 rows and zero nulls. The distribution is broad and flat (kurtosis -1.36, std 2.25, IQR spanning 1 to 6) with near-zero skew, suggesting an ordinal category or rating rather than a true continuous measurement. No outliers and no zeros are present. Treatment: Treat as an ordinal/categorical feature; one-hot or ordinal encode rather than scaling as continuous.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 8
- min
- 1
- max
- 9
- mean
- 3.486
- median
- 4
- std
- 2.252
- q1
- 1
- q3
- 6
- iqr
- 5
- skew
- 0.1259
- kurtosis
- -1.363
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
PrimaryReligion
categorical featurePrimaryReligion is a low-cardinality categorical label assigning each of 16,382 rows to one of 8 religious traditions, with no nulls. Christianity dominates at 39.4% (6,459 rows), followed by Islam (3,786) and Ethnic Religions (2,651); the long tail includes 189 'Unknown' and 124 'Other / Small' rows. Entropy ratio of 0.74 indicates a moderately balanced distribution rather than a single overwhelming class. Treatment: One-hot or target-encode; consider grouping 'Unknown' and 'Other / Small' if modelling sensitivity to rare classes.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- Christianity
- top_rate
- 0.3943
- cardinality
- 8
- entropy
- 2.231
- entropy_ratio
- 0.7436
PrimaryReligionPC
categorical featureCategorical label of the dominant religion of a people-cluster (PC), with 8 distinct values and no nulls across 16,382 rows. Christianity leads at 47.6% (7,795), followed by Islam (3,658) and Hinduism (2,557), while a small 'Unknown' bucket (173) and 'Other / Small' (62) provide explicit catch-alls. Entropy ratio of 0.69 indicates moderate concentration rather than a single dominant class. Treatment: One-hot or target-encode; consider merging 'Unknown' and 'Other / Small' if downstream model is sensitive to rare levels.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- Christianity
- top_rate
- 0.4758
- cardinality
- 8
- entropy
- 2.071
- entropy_ratio
- 0.6904
PrimaryReligionPGAC
categorical labelCategorical label of the primary religion of a People Group, Affinity, or Country (PGAC) record, drawn from a fixed taxonomy of 8 values with no nulls across 16,382 rows. Christianity dominates at 39.4% (6,462), followed by Islam (3,766), Ethnic Religions (2,613) and Hinduism (2,348); Buddhism, Non-Religious, Unknown and Other/Small together account for under 8% of rows. Entropy ratio of 0.748 indicates a moderately concentrated but not degenerate distribution. Treatment: One-hot or target-encode; consider grouping the small Unknown/Other tail before modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- Christianity
- top_rate
- 0.3945
- cardinality
- 8
- entropy
- 2.245
- entropy_ratio
- 0.7482
RLG4
numeric feature null_rate outliersRLG4 is a sparse integer-valued numeric feature with only 20 distinct values spanning 10 to 41, suggesting an ordinal score or count rather than a continuous measurement. It is overwhelmingly missing (null_rate 0.9621), so just under 4% of rows carry a value, and among those the distribution is right-skewed (skew 0.94) with 33 flagged outliers (outlier_rate 0.053). Center sits at median 20 with IQR 7, and no zeros are present. Treatment: Add a missingness indicator and impute or bin the few observed values before modelling.
- n
- 16,382
- nulls
- 15,761 (96.2%)
- unique
- 20
- min
- 10
- max
- 41
- mean
- 18.49
- median
- 20
- std
- 6.519
- q1
- 14
- q3
- 21
- iqr
- 7
- skew
- 0.9361
- kurtosis
- 1.156
- n_outliers
- 33
- outlier_rate
- 0.05314
- zero_rate
- 0
ReligionSubdivision
categorical feature null_rateA sub-categorisation of religion (e.g. Sunni/Shia branches, Buddhist schools, Judaism, Sikhism), populated only when a finer split applies. It is overwhelmingly null at 96.21%, so just 16382 rows carry one of 20 values, with Sunni leading at 29.15% of the populated rows. Entropy ratio 0.74 indicates the non-null portion is reasonably spread rather than dominated by a single bucket. Treatment: Treat nulls as an explicit 'not applicable' category before one-hot encoding.
- n
- 16,382
- nulls
- 15,761 (96.2%)
- unique
- 20
- top_value
- Sunni
- top_rate
- 0.2915
- cardinality
- 20
- entropy
- 3.185
- entropy_ratio
- 0.7369
PCIslam
numeric feature outliersPCIslam is a numeric column bounded between 0 and 100, almost certainly a percentage share of Muslim population (or similar Islam-related composition metric) per record. The distribution is heavily zero-inflated: 63.2% of values are exactly 0 and the median is 0, while the mean is 23.2 and values stretch all the way to 100, producing a right skew of 1.27 and 3,438 flagged outliers (21.1%). Nulls are negligible (0.52%) and 1,117 distinct values suggest reasonably fine-grained measurement rather than a coarse bucket. Treatment: Treat as a zero-inflated proportion: model the zero mass separately or add a presence indicator before scaling.
- n
- 16,382
- nulls
- 86 (0.5%)
- unique
- 1,117
- min
- 0
- max
- 100
- mean
- 23.2
- median
- 0
- std
- 39.54
- q1
- 0
- q3
- 28
- iqr
- 28
- skew
- 1.273
- kurtosis
- -0.2575
- n_outliers
- 3,438
- outlier_rate
- 0.211
- zero_rate
- 0.6322
PCNonReligious
numeric feature high_skew outliersPCNonReligious appears to be a percentage feature capturing the share of a population that is non-religious, ranging from 0 to 99. The distribution is dominated by zeros (75.2% of rows) with median, Q1, and Q3 all at 0, yet the mean is 3.42 and skew is 3.65 with kurtosis 15.4, indicating a long right tail. Roughly 24.8% of values flag as outliers, suggesting a sparse signal where most records report none and a minority report substantial percentages. Treatment: Consider a zero-inflated treatment or log1p transform before modelling given the 75% zeros and heavy right tail.
- n
- 16,382
- nulls
- 66 (0.4%)
- unique
- 223
- min
- 0
- max
- 99
- mean
- 3.421
- median
- 0
- std
- 9.21
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 3.648
- kurtosis
- 15.43
- n_outliers
- 4,043
- outlier_rate
- 0.2478
- zero_rate
- 0.7522
PCUnknown
numeric feature high_skewPCUnknown is a numeric feature bounded between 0 and 100, almost certainly a percentage of items classified as 'unknown'. It is overwhelmingly zero (zero_rate 0.9558) with median, q1, and q3 all at 0, yet the max reaches 100 with skew 9.07 and kurtosis 81.5, producing 719 outliers (4.4%). The 583 distinct non-zero values form a long, heavy tail rather than a smooth distribution. Treatment: Binarize (zero vs non-zero) or log1p-transform before modelling given the 95.6% zero mass and extreme skew.
- n
- 16,382
- nulls
- 104 (0.6%)
- unique
- 583
- min
- 0
- max
- 100
- mean
- 1.201
- median
- 0
- std
- 10.34
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 9.066
- kurtosis
- 81.52
- n_outliers
- 719
- outlier_rate
- 0.04417
- zero_rate
- 0.9558
SecurityLevel
numeric featureSecurityLevel takes only 3 distinct integer values spanning 0 to 2 with no nulls, so it is effectively an ordinal category encoded numerically (e.g., low/medium/high). The distribution is fairly flat with kurtosis -1.82 and zeros making up 38.8% of rows, while the mean of 1.10 and median of 1.0 suggest the three levels are reasonably balanced with a slight tilt toward the higher end. Treatment: Treat as an ordinal categorical and one-hot or ordinal-encode before modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 3
- min
- 0
- max
- 2
- mean
- 1.099
- median
- 1
- std
- 0.9307
- q1
- 0
- q3
- 2
- iqr
- 2
- skew
- -0.1985
- kurtosis
- -1.816
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.3883
LRTop100
categorical label imbalanceBinary Y/N flag indicating membership in some 'LR Top 100' set, with only 100 positive cases out of 16,382 rows (top_rate 0.9939). Extreme class imbalance and very low entropy (0.0537) make this nearly constant. No nulls, exactly 2 categories as expected. Treatment: Use stratified sampling or class-weighting if modelling; otherwise treat as rare-event indicator.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.9939
- cardinality
- 2
- entropy
- 0.05368
- entropy_ratio
- 0.05368
PhotoAddress
text foreign_key one_word short_text duplicatesPhotoAddress holds single-token image filenames in the pattern p#####.jpg, with a max length of 13 and exactly one word per row. 5,718 of 16,382 rows (~35%) are empty strings rather than nulls, and overall duplicate rate is 67.8% — the same photo file is reused across many records (e.g., p19007.jpg appears 92 times). With only 5,277 unique values, this behaves like a foreign-key reference to an image asset, not a per-row unique pointer. Treatment: Treat empty strings as missing and join to an image/asset table on this filename rather than modelling it as text.
- n
- 16,382
- nulls
- 1 (0.0%)
- unique
- 5,277
- len_min
- 0
- len_max
- 13
- len_mean
- 6.523
- len_median
- 10
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 5,718
- n_duplicates
- 11,104
- duplicate_rate
- 0.6779
- vocab_size
- 5,276
- readability_flesch_mean
- 82.43
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
PhotoCredits
text metadata one_word duplicatesAttribution string for a photo credit, mostly short names or sources (mean length 11.5 chars, median 1 word). Highly repetitive: 90.2% duplicate rate across only 1,605 unique values, with 5,718 empty entries and 3,065 'Anonymous' tags dominating. Top words reveal stock/CC sources like Flickr, Wikimedia, Pixabay, and Shutterstock alongside named contributors. Treatment: Treat as low-cardinality categorical attribution; normalize empties/'Anonymous' and group rare credits before any analysis.
- n
- 16,382
- nulls
- 10 (0.1%)
- unique
- 1,605
- len_min
- 0
- len_max
- 56
- len_mean
- 11.54
- len_median
- 9
- len_p95
- 30
- word_mean
- 2.081
- word_median
- 1
- n_empty
- 5,718
- n_duplicates
- 14,767
- duplicate_rate
- 0.902
- vocab_size
- 2,658
- readability_flesch_mean
- -13.88
- emoji_rate
- 0
- url_rate
- 0.0004276
- one_word_rate
- 0.5754
- allcaps_rate
- 0.001649
- boilerplate_rate
- 0
PhotoCreditURL
text metadata one_word url_heavy null_rate duplicatesThis column stores photo credit URLs, with every non-empty value being a single token (one_word_rate 1.0) and 47.5% matching a URL pattern. It is sparsely populated: 33.08% null and another 5,718 empty strings among the top values, while 86.9% of values are duplicates — a single domain, https://www.asiaharvest.org, accounts for 736 rows. Only 1,434 unique URLs serve 16,382 rows, suggesting a small set of recurring image sources rather than per-record attribution. Treatment: Extract the domain as a categorical feature and drop the raw URL; do not use as a modelling input.
- n
- 16,382
- nulls
- 5,419 (33.1%)
- unique
- 1,434
- len_min
- 0
- len_max
- 240
- len_mean
- 25.67
- len_median
- 0
- len_p95
- 73
- word_mean
- 1
- word_median
- 1
- n_empty
- 5,718
- n_duplicates
- 9,529
- duplicate_rate
- 0.8692
- vocab_size
- 1,433
- readability_flesch_mean
- -261.7
- emoji_rate
- 0
- url_rate
- 0.4753
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
PhotoCreativeCommons
categorical featureA binary Y/N flag indicating whether a photo carries a Creative Commons licence. The column is heavily skewed: 'N' covers 83.6% of the 16382 rows while 'Y' accounts for 2691, with a near-zero null rate of 0.0003. Treatment: Encode as a 0/1 boolean; expect class imbalance if used as a target.
- n
- 16,382
- nulls
- 5 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.8357
- cardinality
- 2
- entropy
- 0.6445
- entropy_ratio
- 0.6445
PhotoCopyright
categorical featureBinary Y/N flag indicating whether a photo copyright applies, with 'N' dominating at 87.95% of 16,382 rows versus 1,972 'Y' values. Class imbalance is notable but not extreme, and nulls are negligible at 0.09%. Entropy ratio of 0.53 reflects this skew toward 'N'. Treatment: Encode as a 0/1 boolean; be aware of the ~1:7 class imbalance if used as a target.
- n
- 16,382
- nulls
- 15 (0.1%)
- unique
- 2
- top_value
- N
- top_rate
- 0.8795
- cardinality
- 2
- entropy
- 0.5308
- entropy_ratio
- 0.5308
PhotoPermission
categorical featureA consent flag for photo use, encoded as Y/N with 87.1% of 16382 rows set to 'N' and only 0.1% null. Cardinality is 3 because two records use lowercase 'y' alongside 2111 uppercase 'Y', a casing inconsistency worth normalising. Entropy ratio of 0.35 confirms the heavy skew toward refusal. Treatment: Uppercase-normalise then map to a boolean before modelling.
- n
- 16,382
- nulls
- 17 (0.1%)
- unique
- 3
- top_value
- N
- top_rate
- 0.8709
- cardinality
- 3
- entropy
- 0.5564
- entropy_ratio
- 0.3511
ProfileTextExists
categorical featureBinary Y/N flag indicating whether a profile has text, with no nulls across 16382 rows. Roughly 79.5% are 'Y' (13018) versus 'N' (3364), an imbalance worth noting but not extreme. Entropy ratio of 0.73 confirms a moderately skewed but informative distribution. Treatment: Encode as a 0/1 boolean for modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.7947
- cardinality
- 2
- entropy
- 0.7325
- entropy_ratio
- 0.7325
CountOfCountries
numeric feature high_skew outliersCounts the number of countries associated with each record, ranging from 1 to 164 with a median of 1 and Q3 of just 4. The distribution is severely right-skewed (skew 5.15, kurtosis 32.05) and 19.2% of rows flag as outliers, indicating a long tail where a small set of records span dozens or hundreds of countries while most cover only one. Treatment: Log-transform or bin (e.g. 1, 2-4, 5+) before modelling to tame the heavy tail.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 48
- min
- 1
- max
- 164
- mean
- 8.328
- median
- 1
- std
- 20.64
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 5.152
- kurtosis
- 32.05
- n_outliers
- 3,139
- outlier_rate
- 0.1916
- zero_rate
- 0
CountOfProvinces
unknown other skippedThe column 'CountOfProvinces' was skipped by the profiler, so beyond a row count of 16382 and a null rate of 0.0 there is no evidence about its distribution, type, or uniqueness. The name suggests an integer count of provinces per record, but this cannot be confirmed from the payload. No further signal is available. Treatment: Re-run the profiler on this column to recover type and distribution before any downstream use.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- —
Longitude
numeric feature high_skewThis is a Longitude coordinate column, but the values are corrupted: valid longitudes must lie within [-180, 180], yet the max is 2350588.0 and the mean (189.33) already exceeds the legal range. Skew of 127.98 and kurtosis of 16376.52 confirm extreme outlier contamination, with 207 flagged outliers (1.26%). The median of 55.45 is plausible, so most rows are likely valid, but a small set of malformed entries is dominating the distribution. Treatment: Clip or drop rows outside [-180, 180] before any geospatial use.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 15,889
- min
- -179.3
- max
- 2.351e+06
- mean
- 189.3
- median
- 55.45
- std
- 1.836e+04
- q1
- 8.673
- q3
- 94.64
- iqr
- 85.97
- skew
- 128
- kurtosis
- 1.638e+04
- n_outliers
- 207
- outlier_rate
- 0.01264
- zero_rate
- 0
Latitude
numeric featureGeographic latitude in decimal degrees, with values spanning -54.94 to 78.21 — well within the valid [-90, 90] range. Distribution is nearly symmetric (skew -0.12) and slightly flat (kurtosis -0.26), centered around a median of 17.03 and mean of 16.44, suggesting a tropical/northern-hemisphere bias. Near-unique (15851 of 16382) with no nulls and only 39 mild outliers, consistent with per-record geocoordinates rather than a categorical region label. Treatment: Pair with longitude for geospatial features; consider binning by hemisphere or clustering rather than using raw degrees in linear models.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 15,851
- min
- -54.94
- max
- 78.21
- mean
- 16.44
- median
- 17.03
- std
- 20.47
- q1
- 2.072
- q3
- 29.88
- iqr
- 27.81
- skew
- -0.118
- kurtosis
- -0.2579
- n_outliers
- 39
- outlier_rate
- 0.002381
- zero_rate
- 0
Ctry
categorical featureCountry names stored as full strings, with 238 distinct values across 16,382 rows and no nulls. India dominates at 13.8% (2,262 rows), followed by Papua New Guinea (883) and Indonesia (788) — a notable skew toward South/Southeast Asia rather than the typical US-heavy distribution. Entropy ratio of 0.79 indicates fairly broad spread despite the long tail. Treatment: Group long-tail countries or target/frequency-encode before modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 238
- top_value
- India
- top_rate
- 0.1381
- cardinality
- 238
- entropy
- 6.225
- entropy_ratio
- 0.7885
IndigenousCode
categorical featureIndigenousCode is a binary Y/N flag, fully populated across all 16,382 rows with only 2 distinct values. The class split is uneven: 'Y' covers 74.8% of records against 'N' for the remainder, yielding entropy of 0.81. The imbalance is notable but not extreme. Treatment: Encode as a binary indicator; consider class imbalance if used as a target.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.7483
- cardinality
- 2
- entropy
- 0.8139
- entropy_ratio
- 0.8139
PercentAdherents
text feature one_word allcaps short_text duplicatesThis is a numeric percentage field (PercentAdherents) stored as text, with all 16382 values being single tokens of length 5-7 like '0.000' or '95.000'. The distribution is heavily concentrated at zero (4007 of 16382 rows) and shows strong duplication (duplicate_rate 0.924, only 1248 unique values). Despite the 'allcaps' and 'one_word' alerts, these are just numeric strings, not categorical text. Treatment: Cast to float and treat as a numeric feature; consider zero-inflation handling given the spike at 0.000.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 1,248
- len_min
- 5
- len_max
- 7
- len_mean
- 5.534
- len_median
- 6
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 15,134
- duplicate_rate
- 0.9238
- vocab_size
- 1,248
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
PercentChristianPC
categorical featureStored as a categorical string, this column appears to be a per-country (or per-region) percentage of Christians, with 246 distinct values across 16,382 rows and no nulls. The distribution is highly repetitive: the modal value '90.061' covers 6.95% of rows and the top ten values include both very high shares (90.061, 82.325, 76.515) and near-zero shares (0.482, 0.111, 0.000), suggesting a small set of country-level percentages broadcast onto many rows. Entropy ratio of 0.86 indicates the values are fairly evenly spread across the 246 categories despite the heavy mode. Treatment: Cast to float and treat as a numeric feature rather than a category.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 246
- top_value
- 90.061
- top_rate
- 0.06953
- cardinality
- 246
- entropy
- 6.853
- entropy_ratio
- 0.8628
NaturalName
text label one_word duplicatesNaturalName appears to be a people-group or ethno-linguistic label, dominated by single-word entries (one_word_rate 0.555) and short strings (len_mean 10.9, word_mean 1.59). Roughly a third of rows repeat (duplicate_rate 0.344, 5645 duplicates across 10737 uniques), with 'Deaf' (164), 'French' (82), and 'British' (80) leading. Top words expose unclosed parenthetical qualifiers like 'traditions)', '(hindu', '(muslim' occurring 500-1000+ times, suggesting tokenisation broke compound names such as 'X (Hindu traditions)'. Treatment: Normalise casing and repair the parenthetical qualifier splits before using as a categorical grouping key.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 10,737
- len_min
- 1
- len_max
- 41
- len_mean
- 10.91
- len_median
- 9
- len_p95
- 25
- word_mean
- 1.585
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 5,645
- duplicate_rate
- 0.3446
- vocab_size
- 11,164
- readability_flesch_mean
- 50.53
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.5554
- allcaps_rate
- 0
- boilerplate_rate
- 0
NaturalPronunciation
text metadata one_word null_rate duplicatesThis column holds phonetic respellings of ethnic or demographic labels (e.g. 'AY-zhun', 'chai-NEEZ', 'kor-EE-un'), with hyphenated syllables and capitalised stress markers indicating an ad-hoc pronunciation guide. It is overwhelmingly sparse and repetitive: 69.63% null, 69.41% one-word entries, and 61.15% duplicates across only 1,933 unique values out of 16,382 rows. The token 'def' appears 164 times as the most frequent value, which looks like a placeholder or default rather than a pronunciation. Treatment: Treat as a categorical pronunciation lookup keyed to an ethnicity label; investigate the 'def' placeholder and impute or drop given the 69.63% null rate.
- n
- 16,382
- nulls
- 11,407 (69.6%)
- unique
- 1,933
- len_min
- 2
- len_max
- 57
- len_mean
- 12.13
- len_median
- 11
- len_p95
- 26
- word_mean
- 1.345
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 3,042
- duplicate_rate
- 0.6115
- vocab_size
- 2,039
- readability_flesch_mean
- 61.69
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.6941
- allcaps_rate
- 0.000402
- boilerplate_rate
- 0
PercentChristianPGAC
text feature one_word allcaps short_text duplicatesThis column holds percentages (likely Christian population share, per the PGAC suffix) stored as text rather than numeric, with values like "0.000", "95.000", "90.000" filling lengths of 5-7 characters. The distribution is heavily zero-inflated: 3,121 of 16,382 rows are "0.000" and the duplicate rate is 88%, leaving only 1,954 unique values. Flagged as allcaps/one-word only because the profiler treated numeric strings as tokens. Treatment: Cast to float and treat as a numeric percentage feature.
- n
- 16,382
- nulls
- 15 (0.1%)
- unique
- 1,954
- len_min
- 5
- len_max
- 7
- len_mean
- 5.528
- len_median
- 6
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 14,413
- duplicate_rate
- 0.8806
- vocab_size
- 1,954
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
PercentEvangelical
text feature one_word allcaps short_text duplicatesThis is a numeric percentage (share evangelical) stored as text strings like '0.000' to '6.000', with values ranging 5-6 characters long and one token each. The distribution is heavily zero-inflated: 4205 of 16382 rows are '0.000', and the duplicate rate is 0.9315 across only 1047 unique values. Null rate is 0.0668, so roughly 7% are missing. Treatment: Cast to float and treat as a zero-inflated numeric feature.
- n
- 16,382
- nulls
- 1,095 (6.7%)
- unique
- 1,047
- len_min
- 5
- len_max
- 6
- len_mean
- 5.226
- len_median
- 5
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 14,240
- duplicate_rate
- 0.9315
- vocab_size
- 1,047
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
PercentEvangelicalPC
categorical featureNumeric percentages (0.000 to 28.097) describing evangelical share, but stored as strings with only 228 distinct values across 16,382 rows — suggesting a precomputed per-group statistic broadcast to many records rather than a per-row measurement. The top value '20.481' covers 7.0% of rows and the top ten values together account for a large fraction, consistent with repeated group-level imputation. Entropy ratio is 0.87, so distribution is fairly spread but discretised. Treatment: Cast to float and treat as a group-level numeric feature; do not one-hot encode.
- n
- 16,382
- nulls
- 166 (1.0%)
- unique
- 228
- top_value
- 20.481
- top_rate
- 0.07024
- cardinality
- 228
- entropy
- 6.782
- entropy_ratio
- 0.8658
PercentEvangelicalPGAC
text feature one_word allcaps short_text duplicatesThis is a numeric percentage (Percent Evangelical, PGAC) stored as text — every value is a single token of 5-6 characters formatted like '0.000', '4.000', '1.801'. The distribution is heavily zero-inflated: 3,272 of 16,382 rows are '0.000', duplicate rate is 0.896 across only 1,624 unique values, and 4.5% are null. Despite the column being typed as text, there is no real language content here. Treatment: Cast to float and treat as a zero-inflated numeric feature.
- n
- 16,382
- nulls
- 743 (4.5%)
- unique
- 1,624
- len_min
- 5
- len_max
- 6
- len_mean
- 5.235
- len_median
- 5
- len_p95
- 6
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 14,015
- duplicate_rate
- 0.8962
- vocab_size
- 1,624
- readability_flesch_mean
- 121.2
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
PCBuddhism
numeric feature high_skew outliersPCBuddhism appears to be a per-record percentage feature for Buddhist composition, ranging 0–100 with mean 3.77 and median 0. The distribution is overwhelmingly zero (zero_rate 0.89) with q1=q3=0 and iqr=0, yet ~11% of rows are outliers and skew (4.77) and kurtosis (21.99) are extreme. Treat this as a sparse, heavy-tailed minority share where most populations have no Buddhist presence but a long tail reaches 100. Treatment: Add a zero-vs-nonzero indicator and log1p-transform the nonzero share before modelling.
- n
- 16,382
- nulls
- 104 (0.6%)
- unique
- 1,052
- min
- 0
- max
- 100
- mean
- 3.769
- median
- 0
- std
- 16.75
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 4.775
- kurtosis
- 22
- n_outliers
- 1,798
- outlier_rate
- 0.1105
- zero_rate
- 0.8895
PCEthnicReligions
numeric feature outliersPCEthnicReligions appears to be a percentage feature (0–100) capturing the share of ethnic/folk religion adherents per record. Just over half the rows are exactly zero (zero_rate 0.5045) and the median and Q1 are both 0, yet values stretch all the way to 100, producing strong right skew (1.65) and 1,967 flagged outliers (12.05%). The distribution is effectively zero-inflated rather than continuous. Treatment: Model as zero-inflated: add an is_nonzero indicator and log1p-transform the positive values.
- n
- 16,382
- nulls
- 59 (0.4%)
- unique
- 978
- min
- 0
- max
- 100
- mean
- 17.6
- median
- 0
- std
- 29.02
- q1
- 0
- q3
- 25
- iqr
- 25
- skew
- 1.654
- kurtosis
- 1.404
- n_outliers
- 1,967
- outlier_rate
- 0.1205
- zero_rate
- 0.5045
PCHinduism
numeric feature high_skew outliersPCHinduism appears to be a per-record percentage share of Hinduism (0–100), with max 100.0 and min 0.0. The distribution is overwhelmingly zero (zero_rate 0.8343) so Q1, median, and Q3 are all 0.0, yet the mean is 14.01 with std 33.87, indicating a small minority of records carry very high values. Skew 2.06 and 16.57% flagged outliers confirm a heavy right tail rather than dirty data. Treatment: Treat as zero-inflated proportion: add a nonzero indicator and consider a log1p or sqrt transform before modelling.
- n
- 16,382
- nulls
- 105 (0.6%)
- unique
- 1,412
- min
- 0
- max
- 100
- mean
- 14.01
- median
- 0
- std
- 33.87
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 2.058
- kurtosis
- 2.3
- n_outliers
- 2,697
- outlier_rate
- 0.1657
- zero_rate
- 0.8343
PCOtherSmall
numeric feature high_skew outliersPCOtherSmall is a numeric feature that appears to capture a small-share or percentage-like quantity, with 89.3% of values exactly zero and a median/Q1/Q3 all at 0.0. The remaining mass is highly skewed (skew 11.0, kurtosis 124.0) with a max of 100.0 and 10.7% flagged as outliers, suggesting a sparse long-tailed distribution rather than a typical continuous feature. Treatment: Treat as zero-inflated: add a binary is_nonzero flag and log1p-transform the positive tail before modelling.
- n
- 16,382
- nulls
- 104 (0.6%)
- unique
- 908
- min
- 0
- max
- 100
- mean
- 0.9613
- median
- 0
- std
- 8.299
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 11
- kurtosis
- 124
- n_outliers
- 1,749
- outlier_rate
- 0.1074
- zero_rate
- 0.8926
RegionCode
numeric featureRegionCode is an integer-valued field ranging from 1 to 12 with only 12 unique values across 16382 rows and no nulls. The flat distribution (kurtosis -1.20, skew 0.23, no outliers) and small cardinality indicate a categorical region identifier encoded numerically rather than a true numeric measure. Treatment: Cast to categorical and one-hot or target-encode before modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 12
- min
- 1
- max
- 12
- mean
- 5.935
- median
- 5
- std
- 3.42
- q1
- 3
- q3
- 8
- iqr
- 5
- skew
- 0.231
- kurtosis
- -1.201
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
PopulationPGAC
numeric feature high_skew outliersPopulationPGAC is a numeric population-like measure spanning 10 to 925,129,800 with a median of just 88,000, suggesting counts of people across geographic units of wildly varying scale (towns up through country-sized aggregates). The distribution is extremely right-skewed (skew 15.15, kurtosis 262.66) and 19.2% of rows flag as outliers, with the mean (8.8M) two orders of magnitude above the median. Nulls are negligible (0.09%) and there are no zeros, but the spread between Q3 (1.39M) and the max indicates a long heavy tail of very large entities. Treatment: log-transform before any modelling or distance-based comparison.
- n
- 16,382
- nulls
- 15 (0.1%)
- unique
- 2,250
- min
- 10
- max
- 9.251e+08
- mean
- 8.812e+06
- median
- 88,000
- std
- 5.114e+07
- q1
- 8,800
- q3
- 1.386e+06
- iqr
- 1.377e+06
- skew
- 15.15
- kurtosis
- 262.7
- n_outliers
- 3,145
- outlier_rate
- 0.1922
- zero_rate
- 0
Frontier
categorical featureBinary Y/N flag named 'Frontier', fully populated across 16382 rows with only 2 distinct values. The 'N' class dominates at 70.9% versus 29.1% 'Y', giving an entropy ratio of 0.87 — moderately imbalanced but well within usable range. Treatment: Encode as 0/1 boolean and use directly as a feature.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.709
- cardinality
- 2
- entropy
- 0.87
- entropy_ratio
- 0.87
MapAddress
text foreign_key one_word short_text duplicatesMapAddress holds single-token PNG filenames (e.g. m00320.png), almost certainly references to map image assets. Over half the column is empty (8728 of 16382 rows) and 63.2% are duplicates, with only 6029 distinct values across 16382 rows. Every non-empty value is one word with max length 13, so this behaves like a sparse foreign key to a map asset rather than free text. Treatment: Treat as a categorical asset reference; impute empties as 'none' and join to the map asset table.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 6,029
- len_min
- 0
- len_max
- 13
- len_mean
- 5.153
- len_median
- 0
- len_p95
- 13
- word_mean
- 1
- word_median
- 1
- n_empty
- 8,728
- n_duplicates
- 10,353
- duplicate_rate
- 0.632
- vocab_size
- 6,028
- readability_flesch_mean
- 13.52
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
HasJesusFilm
categorical featureBinary Y/N flag indicating whether the JESUS Film is available for each record, with no nulls across 16,382 rows. The split is roughly 2:1 in favour of 'Y' (10,816 vs 5,566; top_rate 0.660), giving a high entropy ratio of 0.925 — informative but mildly imbalanced. Treatment: Encode as a 0/1 boolean for modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.6602
- cardinality
- 2
- entropy
- 0.9246
- entropy_ratio
- 0.9246
Nomadic
categorical feature imbalanceBinary Y/N flag indicating whether a record is 'Nomadic', with no nulls across 16382 rows. The distribution is severely imbalanced: 'N' covers 98.1% (16071) versus only 311 'Y' cases, yielding an entropy ratio of just 0.136. Treatment: Encode as a boolean; consider class-weighting or resampling since positives are only ~1.9%.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.981
- cardinality
- 2
- entropy
- 0.1357
- entropy_ratio
- 0.1357
NomadicTypeDescription
categorical metadata null_rateCategorical descriptor of nomadic livelihood type, with six values combining three base categories (Agro-Pastoralists, Service or Trade, Hunter-Gatherers) singly or in pairs. The column is 98.1% null, populated for only ~311 of 16,382 rows, and among those Agro-Pastoralists dominates at 68.2% of non-nulls. The sparsity makes this effectively a rare annotation rather than a general feature. Treatment: Treat as sparse metadata; impute a 'Unknown' category or drop unless modelling the populated subset.
- n
- 16,382
- nulls
- 16,071 (98.1%)
- unique
- 6
- top_value
- Agro-Pastoralists
- top_rate
- 0.6817
- cardinality
- 6
- entropy
- 1.341
- entropy_ratio
- 0.5187
PhotoCCVersionText
categorical metadataThis column records the Creative Commons license version attached to a photo, with 17 distinct values across 16,382 rows and no nulls. It is dominated by empty strings at 83.6% (13,688 rows), leaving only ~16% with an actual license tag — the most common being 'CC BY 2.0' (661) and 'CC BY-NC-SA 2.0' (440). Low entropy ratio (0.28) confirms the field is sparse in practice despite zero technical nulls. Treatment: Treat empty string as missing and one-hot encode the remaining license categories.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 17
- top_value
- top_rate
- 0.8356
- cardinality
- 17
- entropy
- 1.137
- entropy_ratio
- 0.2781
PhotoCCVersionURL
categorical metadataThis column holds Creative Commons license URLs associated with photos, drawn from a closed vocabulary of 17 distinct values. It is overwhelmingly empty: 13,688 of 16,382 rows (top_rate 0.836) carry the blank string rather than a license, leaving CC BY 2.0 (661) and CC BY-NC-SA 2.0 (440) as the most common actual licenses. Entropy ratio of 0.278 confirms the distribution is highly concentrated on the empty value. Treatment: Treat blank as missing and bucket the remaining license URLs into a low-cardinality categorical.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 17
- top_value
- top_rate
- 0.8356
- cardinality
- 17
- entropy
- 1.137
- entropy_ratio
- 0.2781
MapCredits
categorical metadata long_tailAttribution string for the map asset associated with each record, naming data, geography, and design contributors. Over half the rows (top_rate 0.533, 8,733 of 16,382) carry an empty string, and the remaining mass is spread across 199 near-duplicate credit lines — note for example two variants of the Omid/UNESCO/GMI credit differing only by a trailing period (2,228 vs 864). Entropy ratio of 0.357 and the long_tail alert confirm a few dominant phrasings plus a sparse tail. Treatment: Treat as provenance metadata; normalize whitespace/punctuation to collapse duplicate credit strings, and exclude from modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 199
- top_value
- top_rate
- 0.5331
- cardinality
- 199
- entropy
- 2.726
- entropy_ratio
- 0.357
MapCreditURL
categorical metadata long_tail imbalanceOptional attribution URL for the source map of each record, blank in 15891 of 16382 rows (top_rate 0.97). Only 51 distinct values populate the remaining 3%, dominated by asiaharvest.org (146) and cartomission.com (117), giving a very low entropy_ratio of 0.052. The mix of http URLs and a mailto: address suggests inconsistent data entry rather than a controlled vocabulary. Treatment: Drop from modelling; retain only as a provenance link for the ~3% of rows that carry a value.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 51
- top_value
- top_rate
- 0.97
- cardinality
- 51
- entropy
- 0.2954
- entropy_ratio
- 0.05207
MapCopyright
categorical featureA near-binary flag (with blanks) indicating map copyright status, dominated by 'N' at 86.5% of 16,382 rows. Only 118 records carry 'Y' and 2,100 are empty strings, giving just 3 distinct values and an entropy ratio of 0.387. The empty category is large enough that it should be treated as its own level rather than silently coerced. Treatment: Encode as a 3-level categorical (N / Y / blank); low signal due to severe imbalance.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- N
- top_rate
- 0.8646
- cardinality
- 3
- entropy
- 0.6126
- entropy_ratio
- 0.3865
MapCCVersionText
categorical metadata imbalanceThis column records Creative Commons licence versions for map content, but 16347 of 16382 rows (top_rate 0.9979) are empty strings, leaving only 35 rows spread across five actual licences (CC0 1.0, CC BY-SA 3.0, CC BY 3.0, CC BY 2.0, CC BY-SA 4.0). Entropy ratio of 0.0099 confirms the column carries almost no information. Note that nulls are reported as 0.0 because the missing values are stored as empty strings rather than true nulls. Treatment: Drop or collapse to a binary has_licence flag; too sparse to use as a feature.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- top_rate
- 0.9979
- cardinality
- 6
- entropy
- 0.02557
- entropy_ratio
- 0.009891
MapCCVersionURL
categorical metadata imbalanceMapCCVersionURL appears to be a Creative Commons license URL field attached to map records, with five distinct CC variants observed (CC0, BY-SA 3.0/4.0, BY 2.0/3.0). It is effectively empty: 16,347 of 16,382 rows (top_rate 0.9979) are blank strings, leaving only 35 rows with an actual license URL and entropy_ratio of 0.0099. Null_rate is reported as 0.0 because empties are stored as '' rather than nulls, which is itself worth flagging. Treatment: Drop or collapse to a binary has_license flag; near-constant empty string.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- top_rate
- 0.9979
- cardinality
- 6
- entropy
- 0.02557
- entropy_ratio
- 0.009891
JF
categorical featureJF is a binary Y/N flag with no nulls across 16382 rows. The split is moderately imbalanced, with Y dominating at 66.0% (10816) versus N at 34.0% (5566), yielding a high entropy ratio of 0.925. Treatment: Encode as a 0/1 indicator before modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.6602
- cardinality
- 2
- entropy
- 0.9246
- entropy_ratio
- 0.9246
AudioRecordings
categorical featureBinary Y/N flag indicating whether audio recordings are present, with no nulls across 16,382 rows. The distribution is imbalanced: 'Y' dominates at 82.3% (13,479) versus 2,903 'N' values, yielding an entropy ratio of 0.67. Treatment: Encode as a 0/1 boolean; be aware of the ~82/18 class imbalance when using as a predictor.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.8228
- cardinality
- 2
- entropy
- 0.6739
- entropy_ratio
- 0.6739
Window1040
categorical featureWindow1040 is a binary Y/N flag covering all 16382 rows with no nulls. The split is nearly even (Y at 52.3%, N at 7810 occurrences), giving an entropy ratio of 0.998 — essentially maximum uncertainty for a two-class field. Without column context the meaning is opaque, but the balanced distribution makes it a usable feature rather than a degenerate constant. Treatment: Encode as a 0/1 indicator for modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.5233
- cardinality
- 2
- entropy
- 0.9984
- entropy_ratio
- 0.9984
PeopleGroupMapURL
text metadata one_word url_heavy duplicatesThis column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty entry being a single token link. Over half the rows (8,728 of 16,382) are empty strings, and among the rest the same map URLs recur heavily — duplicate_rate is 0.63 and only 6,029 unique values exist across 16,382 rows. The url_rate of 0.47 reflects that empties dominate the remainder. Treatment: Treat as an optional asset URL: keep as-is for display, or drop if not rendering images.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 6,029
- len_min
- 0
- len_max
- 66
- len_mean
- 29.92
- len_median
- 0
- len_p95
- 66
- word_mean
- 1
- word_median
- 1
- n_empty
- 8,728
- n_duplicates
- 10,353
- duplicate_rate
- 0.632
- vocab_size
- 6,028
- readability_flesch_mean
- -329.1
- emoji_rate
- 0
- url_rate
- 0.4672
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
PeopleGroupMapExpandedURL
text metadata one_word url_heavy duplicatesThis column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, one link per row. It's mostly empty: 9,538 of 16,382 rows (the modal value) are blank, which is why len_median is 0 and url_rate is only 0.418. Among populated rows there is heavy reuse — duplicate_rate is 0.661 and a single map (m00320.pdf) appears 96 times — indicating many people-group records share the same regional map. Treatment: Treat as an optional reference link; drop for modelling or keep only as a join key to map assets.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 5,561
- len_min
- 0
- len_max
- 66
- len_mean
- 26.74
- len_median
- 0
- len_p95
- 66
- word_mean
- 1
- word_median
- 1
- n_empty
- 9,538
- n_duplicates
- 10,821
- duplicate_rate
- 0.6605
- vocab_size
- 5,560
- readability_flesch_mean
- -256.6
- emoji_rate
- 0
- url_rate
- 0.4178
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
PeopleGroupURL
text identifier near_unique one_word url_heavyThis column holds Joshua Project people-group URLs, one per row, with a perfectly fixed length of 48 characters and exactly one 'word' per cell. Every value is unique across all 16382 rows (n_unique equals n, duplicate_rate 0.0) and url_rate is 1.0, so it functions as a row-level identifier rather than analysable text. The URL pattern encodes a numeric people-group id plus a two-letter country suffix (e.g. /10375/tz, /10375/up), meaning the same group repeats across countries via different URLs. Treatment: Drop from modelling; keep as a row key or parse out the people-group id and country code if a join is needed.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 16,382
- len_min
- 48
- len_max
- 48
- len_mean
- 48
- len_median
- 48
- len_p95
- 48
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 16,382
- readability_flesch_mean
- -476.9
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
PeopleGroupPhotoURL
text metadata one_word url_heavy duplicatesThis column holds URLs to people-group profile photos hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.65). Notably, 5719 of 16382 rows are empty strings and duplicates dominate the rest (n_duplicates 11105, duplicate_rate 0.68) — only 5277 unique URLs serve 16382 rows, meaning many groups share the same photo. The top URL alone repeats 92 times. Treatment: Treat as an optional image asset link; drop for modelling or use only to fetch images, and handle the 5719 empty values as missing.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 5,277
- len_min
- 0
- len_max
- 68
- len_mean
- 42.32
- len_median
- 65
- len_p95
- 65
- word_mean
- 1
- word_median
- 1
- n_empty
- 5,719
- n_duplicates
- 11,105
- duplicate_rate
- 0.6779
- vocab_size
- 5,276
- readability_flesch_mean
- -585.6
- emoji_rate
- 0
- url_rate
- 0.6509
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
CountryURL
categorical metadataURL pointing to a country page on joshuaproject.net, with the trailing two-letter code acting as the country identifier. 238 distinct countries appear across 16,382 rows with no nulls, and India (IN) dominates at 13.8% (2,262 rows) followed by PP (883) and ID (788). High entropy ratio (0.79) indicates the distribution is broad rather than concentrated despite the India lead. Treatment: Strip the URL prefix and keep the two-letter country code as a categorical feature.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 238
- top_value
- https://joshuaproject.net/countries/IN
- top_rate
- 0.1381
- cardinality
- 238
- entropy
- 6.225
- entropy_ratio
- 0.7885
JPScaleText
categorical labelJPScaleText is a 5-level ordinal categorical describing how 'reached' an entity is, ranging from 'Unreached' to 'Significantly Reached'. The distribution is top-heavy: 'Unreached' covers 43.5% of 16,382 rows (7,124), while 'Minimally Reached' is the rarest at 1,009. No nulls and entropy ratio 0.87 indicate well-spread but skewed coverage across all five levels. Treatment: Encode as an ordered ordinal (Unreached < Minimally < Superficially < Partially < Significantly) before modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- Unreached
- top_rate
- 0.4349
- cardinality
- 5
- entropy
- 2.017
- entropy_ratio
- 0.8688
JPScaleImageURL
categorical metadataThis column holds URLs to one of five 'gauge' images on joshuaproject.net, almost certainly a visual encoding of an ordinal Joshua Project Scale (1-5). Distribution is uneven: gauge-1 dominates at 43.5% of 16,382 rows, while gauge-2 is rarest at 1,009 rows, and entropy ratio is 0.87. No nulls, but the URL itself carries no information beyond the underlying 1-5 code. Treatment: Extract the trailing digit as an ordinal feature and drop the URL string.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- https://joshuaproject.net/assets/img/gauge/gauge-1.png
- top_rate
- 0.4349
- cardinality
- 5
- entropy
- 2.017
- entropy_ratio
- 0.8688
Summary
text free_text one_word duplicatesA short ethnographic descriptor field, likely a people-group summary paragraph in English. The column is dominated by emptiness — 12,328 of 16,382 rows (median length 0) are blank — and the non-empty entries are heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat write-ups repeating dozens of times in slight variants. Among populated rows, texts can be substantial (len_p95 = 719, max 1212) and Flesch readability of 13.2 indicates dense, hard-to-read prose. Treatment: Deduplicate and drop empties before any NLP; treat as supplementary description rather than a per-row feature.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 3,778
- len_min
- 0
- len_max
- 1,212
- len_mean
- 137.7
- len_median
- 0
- len_p95
- 719
- word_mean
- 23.34
- word_median
- 1
- n_empty
- 12,328
- n_duplicates
- 12,604
- duplicate_rate
- 0.7694
- vocab_size
- 24,964
- readability_flesch_mean
- 13.16
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7525
- allcaps_rate
- 0
- boilerplate_rate
- 0.0001221
Obstacles
text free_text one_word duplicatesFree-text English commentary describing spiritual or cultural obstacles to Christian evangelism for various ethnic groups (Rajputs, Jats, Bosniaks, Azeri, etc.). The field is overwhelmingly empty: 12,327 of 16,382 rows are blank, driving a median length of 0 and a one-word rate of 0.75. Among populated rows, content is heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat paragraphs repeated 88 and 74 times, suggesting templated entries reused across related people-group records. Treatment: Treat blanks as missing and deduplicate template paragraphs before tokenizing/embedding for any text modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 3,732
- len_min
- 0
- len_max
- 726
- len_mean
- 47.4
- len_median
- 0
- len_p95
- 267
- word_mean
- 8.704
- word_median
- 1
- n_empty
- 12,327
- n_duplicates
- 12,650
- duplicate_rate
- 0.7722
- vocab_size
- 9,899
- readability_flesch_mean
- 12.76
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7525
- allcaps_rate
- 0
- boilerplate_rate
- 0.0004273
HowReach
text free_text one_word duplicatesThis is a free-text field describing how a people group could be reached, likely sourced from a missions/Joshua Project-style dataset. The column is dominated by emptiness: 13,043 of 16,382 rows (n_empty) are blank, driving a median length of 0 and a duplicate_rate of 0.82. Among populated entries, prose is substantive (len_max 599, len_p95 221) but heavily repeated — the same paragraph about Jats appears 136 times and several near-duplicates differ only by a word, suggesting templated copy across related groups. Treatment: Treat as sparse free text: filter out empties and deduplicate near-identical strings before any tokenization or embedding.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 2,944
- len_min
- 0
- len_max
- 599
- len_mean
- 36.3
- len_median
- 0
- len_p95
- 221
- word_mean
- 6.879
- word_median
- 1
- n_empty
- 13,043
- n_duplicates
- 13,438
- duplicate_rate
- 0.8203
- vocab_size
- 7,981
- readability_flesch_mean
- 12.17
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7962
- allcaps_rate
- 0
- boilerplate_rate
- 0.0001221
PrayForChurch
text free_text one_word duplicatesFree-text prayer prompts for a people-group / missions dataset, focused on praying for the church among unreached groups (top words: pray, the, among). The column is mostly empty — 14,208 of 16,382 rows are blank — and among non-empty entries duplication is heavy, with a single Jat-related prayer appearing 146 times and an overall duplicate_rate of 0.89. Only 1,791 unique values and a vocab of 4,576 words suggest these are templated prayer points rather than authored prose, and all 652 language-tagged rows are English. Treatment: Treat as sparse optional commentary; impute empties as missing and dedupe templates before any text modelling.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 1,791
- len_min
- 0
- len_max
- 649
- len_mean
- 26.91
- len_median
- 0
- len_p95
- 210
- word_mean
- 5.599
- word_median
- 1
- n_empty
- 14,208
- n_duplicates
- 14,591
- duplicate_rate
- 0.8907
- vocab_size
- 4,576
- readability_flesch_mean
- 7.55
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.8673
- allcaps_rate
- 0
- boilerplate_rate
- 0
PrayForPG
text free_text one_word duplicatesFree-text prayer points for a people group (PG), evidently scraped from a missions/Joshua Project–style source given the recurring 'Pray for...' templates. The column is mostly empty: 12,570 of 16,382 rows are blank (median length 0, median word count 1) and the duplicate rate is 0.78, with one Rajput-focused prayer block repeated 88 times and similar boilerplate dominating the rest. Flesch mean of 14.3 confirms dense, formulaic devotional prose rather than varied commentary. Treatment: Treat as sparse boilerplate: drop empties, dedupe, and only embed the ~3.5k unique strings if prayer content is actually needed downstream.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- 3,530
- len_min
- 0
- len_max
- 937
- len_mean
- 72.22
- len_median
- 0
- len_p95
- 400
- word_mean
- 13.06
- word_median
- 1
- n_empty
- 12,570
- n_duplicates
- 12,852
- duplicate_rate
- 0.7845
- vocab_size
- 9,427
- readability_flesch_mean
- 14.31
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7673
- allcaps_rate
- 0
- boilerplate_rate
- 0
Resources
unknown other skippedThe column is named "Resources" but saturn skipped profiling, so kind is unknown and no descriptive statistics were computed. We can confirm 16382 rows with a null_rate of 0.0, but n_unique and value-level signals are missing. Without those stats the semantic role and content type cannot be determined from the evidence alone. Treatment: Re-profile this column with type coercion before deciding how to use it.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- —
country_data
unknown other skippedThe column `country_data` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any sample stats, the actual content (codes, names, nested objects) cannot be inferred from the evidence. Treatment: Re-profile with an appropriate parser to determine structure before use.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- —
language_data
unknown other skippedThe `language_data` column was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any descriptive stats, the actual contents (likely some structured or nested language payload given the name) cannot be characterized here. Treatment: Re-profile with an appropriate parser (e.g., expand JSON or cast to string) before deciding on downstream use.
- n
- 16,382
- nulls
- 0 (0.0%)
- unique
- —