joshua project joshua project unreached

source /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_unreached.parquet 7,124 rows 109 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Joshua Project catalogue of 7,124 unreached people groups described across 109 fields covering geography, language, religion, population, and outreach status. Every row is flagged as 'Unreached' (JPScaleText is constant) and 'LeastReached' is uniformly Y, so the analytical interest sits in the breakdown by region, religion, and population rather than in reach status itself. The data is heavily skewed toward Asia (5,351 of 7,124) and especially South Asian Peoples (3,681), with India alone accounting for 2,032 groups; Islam (3,279) and Hinduism (2,142) dominate PrimaryReligion. Population is extremely long-tailed (median 30,000 vs. max 135.5M, skew ~21), so any size-based analysis should use log scales or medians. Worth a closer look first: the Continent/Region/Country concentration, the religion mix, and the population distribution — these three together explain most of the dataset's shape.

citing: row_count · column_count · Continent.top_values · AffinityBloc.top_values · Ctry.top_values · PrimaryReligion.top_values · Population.stats · RegionName.top_values · Frontier.top_values · JPScaleText.top_values · LeastReached.top_values · PrimaryLanguageName.top_values

Charts the summary said to look at first

Continent · Shows how heavily the dataset concentrates in Asia (~75%) versus other continents.

Show data table

Top values for Continent (7 unique shown, of 7 total).
value	count	share
Asia	5351	75.1%
Africa	986	13.8%
Europe	431	6.0%
North America	175	2.5%
South America	106	1.5%
Australia	39	0.5%
Oceania	36	0.5%

PrimaryReligion · Highlights that Islam and Hinduism together account for the majority of unreached groups.

Show data table

Top values for PrimaryReligion (7 unique shown, of 7 total).
value	count	share
Islam	3279	46.0%
Hinduism	2142	30.1%
Ethnic Religions	933	13.1%
Buddhism	480	6.7%
Unknown	157	2.2%
Other / Small	120	1.7%
Non-Religious	13	0.2%

RegionName · Breaks the Asia bulk into sub-regions, with Asia South alone covering nearly half the rows.

Show data table

Top values for RegionName (12 unique shown, of 12 total).
value	count	share
Asia, South	3349	47.0%
Asia, Southeast	726	10.2%
Asia, Northeast	521	7.3%
Africa, West and Central	460	6.5%
Africa, North and Middle East	444	6.2%
Africa, East and Southern	373	5.2%
Asia, Central	352	4.9%
Europe, Western	320	4.5%
Europe, Eastern and Eurasia	223	3.1%
America, North and Caribbean	160	2.2%
America, Latin	121	1.7%
Australia and Pacific	75	1.1%

Population · Reveals an extreme right-skew — most groups are small (median 30k) but a few exceed 100M; consider a log scale.

Show data table

Histogram bins for Population (median: 30000.0).
bin	count
10 – 3.388e+06	6930
3.388e+06 – 6.777e+06	78
6.777e+06 – 1.016e+07	36
1.016e+07 – 1.355e+07	17
1.355e+07 – 1.694e+07	11
1.694e+07 – 2.033e+07	8
2.033e+07 – 2.372e+07	7
2.372e+07 – 2.711e+07	3
2.711e+07 – 3.049e+07	2
3.049e+07 – 3.388e+07	1
3.388e+07 – 3.727e+07	2
3.727e+07 – 4.066e+07	4
4.066e+07 – 4.405e+07	0
4.405e+07 – 4.744e+07	3
4.744e+07 – 5.082e+07	1
5.082e+07 – 5.421e+07	0
5.421e+07 – 5.76e+07	1
5.76e+07 – 6.099e+07	0
6.099e+07 – 6.438e+07	1
6.438e+07 – 6.777e+07	1
6.777e+07 – 7.115e+07	0
7.115e+07 – 7.454e+07	0
7.454e+07 – 7.793e+07	0
7.793e+07 – 8.132e+07	0
8.132e+07 – 8.471e+07	0
8.471e+07 – 8.81e+07	0
8.81e+07 – 9.148e+07	0
9.148e+07 – 9.487e+07	0
9.487e+07 – 9.826e+07	0
9.826e+07 – 1.016e+08	1
1.016e+08 – 1.05e+08	0
1.05e+08 – 1.084e+08	0
1.084e+08 – 1.118e+08	0
1.118e+08 – 1.152e+08	0
1.152e+08 – 1.186e+08	1
1.186e+08 – 1.22e+08	0
1.22e+08 – 1.254e+08	0
1.254e+08 – 1.288e+08	0
1.288e+08 – 1.321e+08	0
1.321e+08 – 1.355e+08	1

AffinityBloc · Confirms the South Asian Peoples bloc dominates and shows the next-largest cultural clusters.

Show data table

Top values for AffinityBloc (16 unique shown, of 16 total).
value	count	share
South Asian Peoples	3681	51.7%
Sub-Saharan Peoples	632	8.9%
Arab World	475	6.7%
Southeast Asian Peoples	451	6.3%
Malay Peoples	339	4.8%
Tibetan-Himalayan Peoples	287	4.0%
Turkic Peoples	269	3.8%
Persian-Median	225	3.2%
Eurasian Peoples	166	2.3%
Deaf	151	2.1%
East Asian Peoples	140	2.0%
Jewish	128	1.8%
Horn of Africa Peoples	83	1.2%
Latin-Caribbean Americans	81	1.1%
Pacific Islanders	13	0.2%
North American Peoples	3	0.0%

Schema

109 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
PeopleID3ROG3	text	0.0%	7,124	near_unique one_word allcaps short_text
ROG3	categorical	0.0%	202
PeopleID3	numeric	0.0%	4,614
ROP3	numeric	0.1%	4,608
PeopNameInCountry	text	0.0%	4,722	one_word duplicates
ROG2	categorical	0.0%	7
Continent	categorical	0.0%	7
RegionName	categorical	0.0%	12
ISO3	categorical	0.0%	202
LocationInCountry	text	64.2%	2,176	multilingual null_rate
PeopleID1	numeric	0.0%	16	outliers
ROP1	categorical	0.0%	16
AffinityBloc	categorical	0.0%	16
PeopleID2	numeric	0.0%	205
ROP2	categorical	0.0%	155
PeopleCluster	categorical	0.0%	205
PeopNameAcrossCountries	text	0.0%	4,604	one_word duplicates
Population	numeric	0.2%	1,200	high_skew outliers
Category	categorical	0.0%	3
ROL3	text	0.0%	1,565	one_word short_text duplicates
PrimaryLanguageName	text	0.0%	1,563	one_word short_text duplicates
PrimaryLanguageDialect	categorical	94.5%	303	long_tail null_rate
NumberLanguagesSpoken	numeric	0.0%	69	high_skew outliers
OfficialLang	categorical	0.1%	79
SpeakNationalLang	unknown	0.0%	—	skipped
BibleStatus	numeric	0.0%	6	outliers
BibleYear	categorical	45.8%	163	null_rate
NTYear	categorical	24.6%	305	null_rate
PortionsYear	categorical	13.5%	460
TranslationNeedQuestionable	unknown	0.0%	—	skipped
JPScale	numeric	0.0%	1	constant
JPScalePC	categorical	0.0%	5
JPScalePGAC	categorical	0.0%	5	imbalance
LeastReached	categorical	0.0%	1	imbalance
LeastReachedPC	categorical	0.0%	2
LeastReachedPGAC	categorical	0.0%	2	imbalance
GSEC	categorical	0.0%	8
HasAudioRecordings	categorical	0.0%	2
NTOnline	categorical	22.4%	1	null_rate imbalance
RLG3	numeric	0.0%	7	outliers
RLG3PC	numeric	0.0%	8	outliers
RLG3PGAC	numeric	0.0%	8	outliers
PrimaryReligion	categorical	0.0%	7
PrimaryReligionPC	categorical	0.0%	8
PrimaryReligionPGAC	categorical	0.0%	8
RLG4	numeric	92.4%	18	null_rate outliers
ReligionSubdivision	categorical	92.4%	18	null_rate
PCIslam	numeric	0.1%	902
PCNonReligious	numeric	0.3%	152	high_skew outliers
PCUnknown	numeric	0.4%	388	high_skew outliers
SecurityLevel	numeric	0.0%	3	outliers
LRTop100	categorical	0.0%	2	imbalance
PhotoAddress	text	0.0%	2,880	one_word short_text duplicates
PhotoCredits	categorical	0.1%	851	long_tail
PhotoCreditURL	categorical	36.0%	774	long_tail null_rate
PhotoCreativeCommons	categorical	0.1%	2
PhotoCopyright	categorical	0.2%	2
PhotoPermission	categorical	0.2%	3
ProfileTextExists	categorical	0.0%	2	imbalance
CountOfCountries	numeric	0.0%	39	high_skew outliers
CountOfProvinces	unknown	0.0%	—	skipped
Longitude	numeric	0.0%	6,713
Latitude	numeric	0.0%	6,696
Ctry	categorical	0.0%	202
IndigenousCode	categorical	0.0%	2
PercentAdherents	categorical	0.0%	692	long_tail
PercentChristianPC	categorical	0.0%	184
NaturalName	text	0.0%	4,705	one_word duplicates
NaturalPronunciation	text	48.5%	1,489	one_word null_rate duplicates
PercentChristianPGAC	categorical	0.1%	842
PercentEvangelical	categorical	10.4%	401	long_tail
PercentEvangelicalPC	categorical	2.1%	166
PercentEvangelicalPGAC	categorical	6.3%	548
PCBuddhism	numeric	0.3%	809	high_skew outliers
PCEthnicReligions	numeric	0.3%	351	high_skew outliers
PCHinduism	numeric	0.3%	1,131
PCOtherSmall	numeric	0.3%	670	high_skew outliers
RegionCode	numeric	0.0%	12	outliers
PopulationPGAC	numeric	0.1%	1,509	high_skew outliers
Frontier	categorical	0.0%	2
MapAddress	text	0.0%	4,616	one_word short_text duplicates
HasJesusFilm	categorical	0.0%	2
Nomadic	categorical	0.0%	2	imbalance
NomadicTypeDescription	categorical	96.6%	6	null_rate
PhotoCCVersionText	categorical	0.0%	16
PhotoCCVersionURL	categorical	0.0%	16
MapCredits	categorical	0.0%	161	long_tail
MapCreditURL	categorical	0.0%	31	long_tail imbalance
MapCopyright	categorical	0.0%	3
MapCCVersionText	categorical	0.0%	4	imbalance
MapCCVersionURL	categorical	0.0%	4	imbalance
JF	categorical	0.0%	2
AudioRecordings	categorical	0.0%	2
Window1040	categorical	0.0%	2
PeopleGroupMapURL	text	0.0%	4,616	one_word url_heavy duplicates
PeopleGroupMapExpandedURL	text	0.0%	4,331	one_word url_heavy duplicates
PeopleGroupURL	text	0.0%	7,124	near_unique one_word url_heavy
PeopleGroupPhotoURL	text	0.0%	2,880	one_word url_heavy duplicates
CountryURL	categorical	0.0%	202
JPScaleText	categorical	0.0%	1	imbalance
JPScaleImageURL	categorical	0.0%	1	imbalance
Summary	text	0.0%	3,685	one_word duplicates
Obstacles	text	0.0%	3,641	one_word duplicates
HowReach	text	0.0%	2,853	one_word duplicates
PrayForChurch	text	0.0%	1,713	one_word duplicates
PrayForPG	text	0.0%	3,441	one_word duplicates
Resources	unknown	0.0%	—	skipped
country_data	unknown	0.0%	—	skipped
language_data	unknown	0.0%	—	skipped

PeopleID3ROG3

text identifier near_unique one_word allcaps short_text

PeopleID3ROG3 is almost certainly a person-level identifier: every one of the 7,124 rows holds a unique 7-character, all-caps, single-token code (n_unique equals n, len_min=len_max=7, allcaps_rate=1.0, one_word_rate=1.0). Sample values like '10208ng' and '10375su' suggest a 5-digit numeric prefix followed by a 2-letter suffix. There are no nulls, duplicates, or empties, so the key looks clean. Treatment: Use as a primary key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 7,124
len_min: 7
len_max: 7
len_mean: 7
len_median: 7
len_p95: 7
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 7,124
readability_flesch_mean: 112.3
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

ROG3

categorical feature

ROG3 holds two-letter country codes across 7,124 rows with 202 distinct values and no nulls. India (IN) dominates at 28.5% of records, followed by PK (767) and CH (442); the top 10 codes account for the bulk of mass while a long tail of ~190 other codes shares the remainder, giving an entropy ratio of 0.66. Treatment: Group rare codes into an 'other' bucket before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 202
top_value: IN
top_rate: 0.2852
cardinality: 202
entropy: 5.058
entropy_ratio: 0.6605

PeopleID3

numeric foreign_key

PeopleID3 is an integer key spanning 10120 to 22661 with 4614 unique values across 7124 rows, suggesting a person identifier that recurs (about 1.5 rows per id on average). The distribution is mildly left-skewed (-0.23) and platykurtic (-0.95) with no nulls, no zeros, and no outliers, consistent with a dense allocated id range rather than a measured quantity. Treatment: Treat as a join key; do not use as a numeric feature. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 4,614
min: 10,120
max: 22,661
mean: 1.693e+04
median: 1.736e+04
std: 3431
q1: 1.433e+04
q3: 1.958e+04
iqr: 5251
skew: -0.2255
kurtosis: -0.9528
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ROP3

numeric identifier

ROP3 is a numeric column with 4608 unique values across 7124 rows, ranging tightly from 100005 to 119619 with a mean of 111443.68 and median of 112533. The narrow ~19k span sitting well above zero, combined with integer-looking bounds, suggests a coded identifier or sequence number rather than a measured quantity. Mild left skew (-0.47) and no outliers indicate a fairly uniform spread within that band, and the null rate is negligible at 0.001. Treatment: Treat as a categorical code or key; do not feed raw into numeric models. medium · anthropic:claude-opus-4-7

n: 7,124
nulls: 7 (0.1%)
unique: 4,608
min: 100,005
max: 119,619
mean: 1.114e+05
median: 112,533
std: 5269
q1: 107,901
q3: 115,240
iqr: 7,339
skew: -0.4712
kurtosis: -0.7273
n_outliers: 0
outlier_rate: 0
zero_rate: 0

PeopNameInCountry

text label one_word duplicates

This column names a people group as it appears within a given country (e.g., 'Turk', 'Persian', 'Arab, Moroccan'), likely from a Joshua Project-style ethnographic registry. Values are short (len_mean 12.5, word_mean 1.78) with 4,722 uniques across 7,124 rows, and 33.7% are duplicates because the same people group recurs across countries — 'Deaf' alone appears 151 times. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' in top_words show religion-tagged variants are baked into the label. Treatment: Treat as a categorical label; pair with country to form a unique key, and consider stripping parenthetical religion tags for cleaner grouping. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 4,722
len_min: 1
len_max: 39
len_mean: 12.5
len_median: 11
len_p95: 27
word_mean: 1.784
word_median: 2
n_empty: 0
n_duplicates: 2,402
duplicate_rate: 0.3372
vocab_size: 4,602
readability_flesch_mean: 56.38
emoji_rate: 0
url_rate: 0
one_word_rate: 0.4575
allcaps_rate: 0
boilerplate_rate: 0

ROG2

categorical feature

ROG2 is a low-cardinality categorical with 7 region codes (ASI, AFR, EUR, NAR, LAM, AUS, SOP) and no nulls across 7,124 rows, consistent with a continental/region-of-origin grouping. The distribution is highly imbalanced: ASI accounts for 75.1% of records while AUS and SOP together contribute fewer than 80 rows, yielding an entropy ratio of just 0.45. Any model conditioned on this field will be dominated by the ASI bucket. Treatment: One-hot encode and consider grouping AUS/SOP/LAM into an 'other' bucket given the severe imbalance. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 7
top_value: ASI
top_rate: 0.7511
cardinality: 7
entropy: 1.251
entropy_ratio: 0.4457

Continent

categorical feature

Continent is a low-cardinality geographic categorical with 7 distinct values and no nulls across 7,124 rows. The distribution is heavily concentrated: Asia alone accounts for 75.1% of records, with Africa a distant second at 986. Notably, both 'Australia' (39) and 'Oceania' (36) appear as separate categories, which is a labeling inconsistency since Australia is part of Oceania. Treatment: Reconcile Australia/Oceania into a single category, then one-hot encode. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 7
top_value: Asia
top_rate: 0.7511
cardinality: 7
entropy: 1.251
entropy_ratio: 0.4457

RegionName

categorical feature

RegionName is a categorical geographic grouping with 12 distinct regions and no nulls across 7,124 rows. The distribution is heavily concentrated: 'Asia, South' alone accounts for 47.0% of records, followed distantly by 'Asia, Southeast' at 726 and 'Asia, Northeast' at 521, leaving the Americas and Europe sparsely represented. Entropy ratio of 0.76 confirms meaningful but uneven coverage across the 12 buckets. Treatment: One-hot encode and consider grouping rare regions, given the dominance of 'Asia, South'. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 12
top_value: Asia, South
top_rate: 0.4701
cardinality: 12
entropy: 2.715
entropy_ratio: 0.7574

ISO3

categorical foreign_key

ISO3 looks like a country code in standard 3-letter ISO 3166-1 alpha-3 format, with 202 distinct values across 7,124 rows and zero nulls. The distribution is heavily concentrated on India (IND) at 28.5% of rows (2,032), followed by PAK (767) and CHN (442), so South and East Asia dominate. Entropy ratio of 0.66 confirms the imbalance is material rather than uniform across countries. Treatment: Use as a join key to country reference tables; consider grouping long-tail codes or stratifying by ISO3 to control for the IND-heavy skew. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 202
top_value: IND
top_rate: 0.2852
cardinality: 202
entropy: 5.058
entropy_ratio: 0.6605

LocationInCountry

text free_text multilingual null_rate

Free-text geographic descriptions of where a people group lives within a country, ranging from terse tags like "Widespread." (56 occurrences) to multi-sentence paragraphs up to 939 characters. 64.23% of rows are null and 14.6% of the non-null values are duplicates, so usable signal is concentrated in roughly a third of the dataset. Content is overwhelmingly English (1714 of 1729 detected) with a long tail of place names producing a 10,936-token vocabulary across 2,176 unique strings. Treatment: Normalize boilerplate phrases like "Widespread" into a categorical flag, then tokenize and embed the residual prose before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 4,576 (64.2%)
unique: 2,176
len_min: 3
len_max: 939
len_mean: 141.1
len_median: 89
len_p95: 455.7
word_mean: 21.15
word_median: 12
n_empty: 0
n_duplicates: 372
duplicate_rate: 0.146
vocab_size: 10,936
readability_flesch_mean: 41.64
emoji_rate: 0
url_rate: 0
one_word_rate: 0.04317
allcaps_rate: 0
boilerplate_rate: 0

PeopleID1

numeric feature outliers

PeopleID1 is stored as numeric but only takes 16 distinct integer values between 10 and 26, with a tight IQR of 1.0 (Q1=20, Q3=21) around a median of 21. The distribution is heavily left-skewed (skew -1.34) and 32.9% of rows fall outside the Tukey fence, so the 'outliers' alert reflects the column being near-categorical rather than truly continuous. No nulls and no zeros, but the name suggests an identifier despite the low cardinality. Treatment: Cast to categorical (16 levels) rather than treating as a continuous numeric. medium · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 16
min: 10
max: 26
mean: 19.51
median: 21
std: 3.858
q1: 20
q3: 21
iqr: 1
skew: -1.337
kurtosis: 0.8321
n_outliers: 2,347
outlier_rate: 0.3294
zero_rate: 0

ROP1

categorical feature

ROP1 is a low-cardinality categorical code (16 distinct values, all following an 'A0xx' pattern) with no nulls across 7,124 rows. The distribution is heavily concentrated: 'A012' alone accounts for 51.7% of records, and entropy ratio of 0.67 confirms the imbalance. This looks like a controlled vocabulary or lookup code rather than free input. Treatment: One-hot or target-encode; consider grouping rare codes given the dominance of A012. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 16
top_value: A012
top_rate: 0.5167
cardinality: 16
entropy: 2.676
entropy_ratio: 0.669

AffinityBloc

categorical feature

AffinityBloc is a categorical grouping of peoples/ethnolinguistic blocs, with 16 distinct values across 7124 rows and no nulls. The distribution is heavily concentrated: 'South Asian Peoples' alone accounts for 51.7% of records (3681), followed distantly by 'Sub-Saharan Peoples' (632) and 'Arab World' (475). Entropy ratio of 0.67 confirms the imbalance, and the inclusion of 'Deaf' (151) as a bloc alongside geographic/ethnic categories is a notable taxonomy quirk. Treatment: One-hot or target-encode, and consider grouping rare blocs given the dominance of South Asian Peoples. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 16
top_value: South Asian Peoples
top_rate: 0.5167
cardinality: 16
entropy: 2.676
entropy_ratio: 0.669

PeopleID2

numeric foreign_key

PeopleID2 is a numeric identifier-like field with only 205 distinct values across 7,124 rows, ranging 101 to 475 with no nulls or zeros. The distribution is left-skewed (skew -0.50) and platykurtic (kurtosis -1.24) with median 412 well above the mean 339, suggesting a few low-id clusters pull the mean down while most rows concentrate near the upper end. The low cardinality relative to row count indicates this is a repeated key rather than a per-row unique id. Treatment: Treat as a categorical foreign key and left-join to the people dimension rather than using as a numeric feature. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 205
min: 101
max: 475
mean: 339
median: 412
std: 123.2
q1: 232.8
q3: 450
iqr: 217.2
skew: -0.5035
kurtosis: -1.242
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ROP2

categorical feature

ROP2 is a categorical code field with 155 distinct values following an A####/C#### pattern, suggesting a classification or routing code. One value, 'A012', dominates at 51.6% of the 7,124 rows, while the remaining categories are long-tailed C-codes each below 2.4%. Entropy ratio of 0.56 confirms the distribution is heavily concentrated rather than uniform. Treatment: Collapse rare C-codes into an 'other' bucket and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 155
top_value: A012
top_rate: 0.5163
cardinality: 155
entropy: 4.085
entropy_ratio: 0.5614

PeopleCluster

categorical feature

PeopleCluster is a high-cardinality categorical taxonomy of ethno-religious groupings, with 205 distinct values across 7124 rows and no nulls. The distribution is dominated by South Asian categories — 'South Asia Hindu - other' alone accounts for 12.2% (869 rows), followed by 'South Asia Muslim - other' (586) and 'South Asia Dalit - other' (352). Entropy ratio of 0.80 indicates a fairly spread distribution despite the South Asian skew, and the appearance of 'Deaf' alongside ethnolinguistic labels signals a mixed taxonomy worth flagging. Treatment: Group the long tail and target- or frequency-encode before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 205
top_value: South Asia Hindu - other
top_rate: 0.122
cardinality: 205
entropy: 6.108
entropy_ratio: 0.7954

PeopNameAcrossCountries

text label one_word duplicates

This column holds people-group / ethnolinguistic names spanning countries (e.g. 'Turk', 'Persian', 'Kurd, Kurmanji', 'Arab, Moroccan'), with frequent religious-tradition qualifiers like '(Hindu traditions)' and '(Muslim traditions)' appearing in 985 and 424 rows respectively. Values are short (mean 12.4 chars, median 2 words) and 47.5% are single-word labels, yet 35.4% (2,520 rows) are duplicates across 4,604 unique strings out of 7,124 — the same group recurs across countries. The most surprising entry is 'Deaf' at 151 occurrences, which sits oddly alongside ethnic categories. Treatment: Treat as a categorical people-group label; normalize qualifiers in parentheses and join on (group, country) for cross-country aggregation. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 4,604
len_min: 1
len_max: 39
len_mean: 12.38
len_median: 10
len_p95: 27
word_mean: 1.766
word_median: 2
n_empty: 0
n_duplicates: 2,520
duplicate_rate: 0.3537
vocab_size: 4,431
readability_flesch_mean: 56.01
emoji_rate: 0
url_rate: 0
one_word_rate: 0.475
allcaps_rate: 0
boilerplate_rate: 0

Population

numeric feature high_skew outliers

Population counts per record, ranging from 10 to 135,533,000 with a median of just 30,000. The distribution is extraordinarily right-skewed (skew 21.1, kurtosis 607) — the mean of ~502,570 sits far above Q3 of 129,000, and ~14.9% of rows flag as outliers, suggesting a mix of small localities with a few country- or megacity-scale entries. Null rate is negligible (0.21%) and there are no zeros. Treatment: Apply a log1p transform before any modelling to tame the extreme skew. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 15 (0.2%)
unique: 1,200
min: 10
max: 1.355e+08
mean: 5.026e+05
median: 30,000
std: 3.568e+06
q1: 6,700
q3: 129,000
iqr: 122,300
skew: 21.1
kurtosis: 607.4
n_outliers: 1,058
outlier_rate: 0.1488
zero_rate: 0

ROL3

text feature one_word short_text duplicates

ROL3 holds three-letter ISO 639-3 language codes — every value is exactly 3 characters and one word, with top entries like 'hin', 'ben', 'urd', 'guj', and 'tel' pointing to South Asian languages. The distribution is heavily skewed toward Hindi (662 of 7124) and only 1565 unique codes appear across 7124 rows, giving a 78% duplicate rate. No nulls, no empties, no formatting noise. Treatment: treat as a categorical language code; one-hot or target-encode, and consider grouping rare codes. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 1,565
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 5,559
duplicate_rate: 0.7803
vocab_size: 1,564
readability_flesch_mean: 118.7
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.0002807
boilerplate_rate: 0

PrimaryLanguageName

text feature one_word short_text duplicates

Categorical language label, almost always a single word (one_word_rate 0.70, word_mean 1.33) drawn from a vocabulary of 1,641 tokens across 1,563 distinct values. Hindi (662), Bengali (357), and Sindhi (191) dominate the 7,124 rows, and 78.1% of values are duplicates of an earlier row — expected for a controlled language taxonomy. Compound names like 'Pashto, Northern' and 'Punjabi, Eastern' indicate ISO-style subvariant naming rather than free text. Treatment: Treat as a categorical factor; normalise the comma-separated subvariants before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 1,563
len_min: 1
len_max: 32
len_mean: 9.147
len_median: 7
len_p95: 17
word_mean: 1.331
word_median: 1
n_empty: 0
n_duplicates: 5,561
duplicate_rate: 0.7806
vocab_size: 1,641
readability_flesch_mean: 33.28
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7016
allcaps_rate: 0
boilerplate_rate: 0

PrimaryLanguageDialect

categorical metadata long_tail null_rate

Free-text or controlled-vocabulary field naming a primary language dialect, with 303 distinct values across 7124 rows but populated in only 5.5% of records (null_rate 0.945). Distribution is essentially flat — entropy_ratio 0.97 and the modal value 'Punjabi' covers just 3.1% of non-nulls (12 occurrences) — so no dialect dominates. The mix spans South Asian, Middle Eastern, African, and European dialects, suggesting a global but extremely sparse roster. Treatment: Drop or collapse into a coarser language grouping; too sparse and high-cardinality to use directly as a feature. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 6,732 (94.5%)
unique: 303
top_value: Punjabi
top_rate: 0.03061
cardinality: 303
entropy: 8.011
entropy_ratio: 0.9719

NumberLanguagesSpoken

numeric feature high_skew outliers

Counts the number of languages spoken, with values ranging from 1 to 120 across 7,124 rows and no nulls. The distribution is heavily right-skewed (skew 4.87, kurtosis 36.3): the median is 1 and Q3 is 5, yet the mean is 4.33 and 597 rows (8.4%) flag as outliers, suggesting a long tail of implausibly high counts up to 120. Treatment: Cap or log-transform before modelling, and audit the extreme tail (values up to 120) for data-entry errors. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 69
min: 1
max: 120
mean: 4.333
median: 1
std: 7.32
q1: 1
q3: 5
iqr: 4
skew: 4.871
kurtosis: 36.31
n_outliers: 597
outlier_rate: 0.0838
zero_rate: 0

OfficialLang

categorical feature

Categorical column listing an official language per record, with 79 distinct values across 7124 rows and effectively no nulls (0.08%). Hindi dominates at 28.5% (2032 rows), followed by Urdu (767), Standard Arabic (657), Mandarin (475), and English (433), giving an entropy ratio of 0.66 — moderately concentrated rather than uniform. The South/Central Asian skew is notable: five of the top ten values are languages of that region, which may bias any downstream language-level analysis. Treatment: Group rare languages into an 'Other' bucket and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 6 (0.1%)
unique: 79
top_value: Hindi
top_rate: 0.2855
cardinality: 79
entropy: 4.178
entropy_ratio: 0.6628

SpeakNationalLang

unknown other skipped

Column 'SpeakNationalLang' was skipped by the profiler, so type, cardinality and value distribution are all unavailable. The only confirmed facts are that it has 7124 rows and zero nulls. The name suggests a flag or category for whether a respondent speaks the national language, but this cannot be verified from the evidence. Treatment: Re-profile with type inference enabled before deciding on any downstream handling. low · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: —

BibleStatus

numeric feature outliers

BibleStatus is an ordinal/categorical code stored as a small integer, taking just 6 distinct values from 0 to 5 across 7,124 rows with no nulls. The distribution is heavily concentrated at the top (median 5, Q1=4, Q3=5, mean 4.05) with strong negative skew (-1.51), and 13.5% of rows flagged as low-end outliers plus a 3.8% zero rate. This looks like a status/level code rather than a true numeric measurement. Treatment: Treat as an ordinal category (or one-hot encode) rather than a continuous numeric. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 6
min: 0
max: 5
mean: 4.054
median: 5
std: 1.342
q1: 4
q3: 5
iqr: 1
skew: -1.513
kurtosis: 1.538
n_outliers: 961
outlier_rate: 0.1349
zero_rate: 0.03818

BibleYear

categorical metadata null_rate

BibleYear appears to be a publication/edition year field for Bible translations, encoded mostly as date ranges like "1818-2022" rather than single years. Cardinality is high (163 distinct values across 7124 rows) and the column is missing for 45.79% of rows, which is a major coverage gap. The top value "1818-2022" covers 17.14% of non-nulls and most frequent entries are spans, while plain single years like "1954" (191 occurrences) are the exception. Treatment: Parse into start_year and end_year integers, then decide imputation given the 45.79% null rate. medium · anthropic:claude-opus-4-7

n: 7,124
nulls: 3,262 (45.8%)
unique: 163
top_value: 1818-2022
top_rate: 0.1714
cardinality: 163
entropy: 5.318
entropy_ratio: 0.7237

NTYear

categorical feature null_rate

NTYear appears to hold year-range strings (e.g. "1811-1998", "1801-1984") indicating a span between two dates, though the presence of "Yes" as the third most common value (345 rows) signals encoding inconsistency. The column has 305 distinct values across 7124 rows with a high null rate of 24.65%, and the top value covers only 12.3% of non-nulls, so the distribution is fairly spread (entropy ratio 0.735). Treatment: Parse year-range strings into start/end numeric fields and quarantine non-conforming values like "Yes" before modelling. medium · anthropic:claude-opus-4-7

n: 7,124
nulls: 1,756 (24.6%)
unique: 305
top_value: 1811-1998
top_rate: 0.1233
cardinality: 305
entropy: 6.065
entropy_ratio: 0.7349

PortionsYear

categorical metadata

PortionsYear appears to be a free-form field describing the year range covered by record portions, with 460 distinct values across 7124 rows and a 13.5% null rate. Most entries are date ranges like '1806-1962' or '1800-1980', but the single most common value is the literal string 'Yes' (821 rows, 13.3%), suggesting inconsistent data entry where a yes/no answer leaks into a date-range field. High entropy ratio (0.71) confirms values are spread across many ranges rather than concentrated. Treatment: Parse date ranges into start/end year numerics and isolate the 'Yes' contamination as a separate boolean flag before use. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 961 (13.5%)
unique: 460
top_value: Yes
top_rate: 0.1332
cardinality: 460
entropy: 6.289
entropy_ratio: 0.711

TranslationNeedQuestionable

unknown other skipped

The column "TranslationNeedQuestionable" was skipped by the profiler, so no type, uniqueness, or value statistics are available. All we know is that it has 7124 rows with a 0.0 null rate; nothing else can be inferred from the evidence. Treatment: Re-profile or inspect manually before deciding on use; current evidence is insufficient. low · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: —

JPScale

numeric metadata constant

JPScale is a numeric column that is entirely constant: all 7124 rows hold the value 1.0 with zero nulls, zero variance, and a single unique value. It carries no information for any downstream model or comparison and was flagged as constant. Treatment: Drop; constant column with no variance. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 1
min: 1
max: 1
mean: 1
median: 1
std: 0
q1: 1
q3: 1
iqr: 0
skew: 0
kurtosis: 0
n_outliers: 0
outlier_rate: 0
zero_rate: 0

JPScalePC

categorical feature

JPScalePC is a 5-level categorical code (values "1" through "5") with no nulls across 7124 rows, likely an ordinal scale or rating. The distribution is heavily concentrated at "1" (70.2% of rows), with "3" the rarest at just 205 occurrences, yielding an entropy ratio of 0.59. The non-monotonic frequency order (1 > 4 > 2 > 5 > 3) is unusual for a true ordinal scale and worth checking. Treatment: Treat as ordinal categorical; consider grouping minority levels (3, 5) given the dominance of "1". high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 5
top_value: 1
top_rate: 0.702
cardinality: 5
entropy: 1.377
entropy_ratio: 0.5929

JPScalePGAC

categorical feature imbalance

JPScalePGAC is a 5-level categorical code (values "1" through "5"), likely a Japanese seismic intensity / PGA scale rating. The distribution is severely imbalanced: "1" accounts for 6910 of 7124 rows (top_rate 0.97), entropy_ratio is just 0.10, and the remaining four levels together hold under 220 records. No nulls are present. Treatment: Collapse rare levels or binarise as "1" vs other before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 5
top_value: 1
top_rate: 0.97
cardinality: 5
entropy: 0.2372
entropy_ratio: 0.1021

LeastReached

categorical metadata imbalance

This column is a single-valued categorical flag holding "Y" for all 7124 rows, with no nulls and zero entropy. Because cardinality is 1 and top_rate is 1.0, it carries no information for any downstream model. Treatment: Drop; constant column with no variance. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 1
top_value: Y
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

LeastReachedPC

categorical feature

Binary Y/N flag indicating whether some 'least reached' people-group condition is met. The column is fully populated across 7124 rows with only 2 distinct values, skewed toward 'Y' at 72.3% (5152) versus 'N' at 1972. Entropy ratio of 0.85 shows the split is uneven but still informative. Treatment: Encode as a 0/1 boolean for modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.7232
cardinality: 2
entropy: 0.8511
entropy_ratio: 0.8511

LeastReachedPGAC

categorical feature imbalance

A binary Y/N flag (likely indicating whether the least-reached PGAC condition was met) with no nulls across 7124 rows. The distribution is severely imbalanced: 'Y' accounts for 6910 rows (97.0%) versus only 214 'N', yielding entropy_ratio of just 0.19. As a near-constant feature it carries little discriminative signal on its own. Treatment: Encode as 0/1 but consider dropping or pairing with rare-class oversampling given the 97/3 imbalance. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.97
cardinality: 2
entropy: 0.1946
entropy_ratio: 0.1946

GSEC

categorical feature

GSEC is a low-cardinality categorical with 8 distinct values across 7124 rows and no nulls. The dominant value is the empty string at 51.08% (3639 rows), followed by '1' at 2767; the remaining codes ('0' through '6') together account for under 10% of rows. The mix of blanks and small integer codes suggests an optional categorical flag where 'missing' is encoded as '' rather than null. Treatment: Recode '' as explicit missing and one-hot encode the remaining small-integer categories. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 8
top_value
top_rate: 0.5108
cardinality: 8
entropy: 1.605
entropy_ratio: 0.535

HasAudioRecordings

categorical feature

Binary Y/N flag indicating whether a record has associated audio recordings. The class is heavily imbalanced toward 'Y' at 86.9% (6188 of 7124), with no nulls. Entropy ratio of 0.56 confirms the skew but the minority 'N' class still has 936 observations, enough to be usable. Treatment: Encode as a 0/1 boolean; watch for class imbalance if used as a target. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.8686
cardinality: 2
entropy: 0.5612
entropy_ratio: 0.5612

NTOnline

categorical feature null_rate imbalance

NTOnline is a categorical flag with only one observed value, 'Y', across 5528 non-null rows, while 22.4% of rows are null. With cardinality 1 and entropy 0, it carries no discriminative signal—presence vs. absence is the only information available. Treatment: Drop, or replace with a binary is_present indicator if the null pattern is meaningful. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 1,596 (22.4%)
unique: 1
top_value: Y
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

RLG3

numeric feature outliers

RLG3 is a discrete numeric column with only 7 unique values spanning 2 to 9, suggesting an ordinal rating or Likert-style scale rather than a continuous measurement. The distribution is tight around the median of 6 (IQR=1, Q1=5, Q3=6) with mild left skew (-0.46), but 10.6% of rows (757) fall outside the IQR fences — an artifact of the narrow box rather than true anomalies. Treatment: Treat as an ordinal categorical; the outlier flag is a side-effect of the compressed IQR, not bad data. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 7
min: 2
max: 9
mean: 5.27
median: 6
std: 1.279
q1: 5
q3: 6
iqr: 1
skew: -0.4551
kurtosis: 2.001
n_outliers: 757
outlier_rate: 0.1063
zero_rate: 0

RLG3PC

numeric feature outliers

RLG3PC is an integer-coded ordinal feature with only 8 distinct values spanning 1-9 and a tight IQR of 1 (Q1=5, Q3=6). The distribution is left-skewed (-0.95) and concentrated around the median of 5, yet 14.3% of rows (1022) fall outside the IQR fence, suggesting a heavy lower tail rather than true anomalies. No nulls or zeros are present. Treatment: Treat as an ordinal/categorical scale rather than continuous; the outlier rate reflects skew, not errors. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 8
min: 1
max: 9
mean: 5.079
median: 5
std: 1.52
q1: 5
q3: 6
iqr: 1
skew: -0.9463
kurtosis: 1.703
n_outliers: 1,022
outlier_rate: 0.1435
zero_rate: 0

RLG3PGAC

numeric feature outliers

RLG3PGAC is a numeric column with only 8 distinct integer values spanning 1 to 9, suggesting an ordinal rating or Likert-style score rather than a continuous measurement. The distribution is tight around a median of 5.5 with IQR of just 1 (Q1=5, Q3=6), yet 776 rows (10.9%) fall outside the Tukey fence, indicating a heavy-tailed concentration where any deviation from the central 5-6 band registers as an outlier. Mild left skew (-0.46) hints that low scores are slightly more common than the symmetric mean of 5.27 would suggest. Treatment: Treat as ordinal categorical; bin or one-hot encode rather than scaling as continuous. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 8
min: 1
max: 9
mean: 5.272
median: 5.5
std: 1.296
q1: 5
q3: 6
iqr: 1
skew: -0.4637
kurtosis: 2.032
n_outliers: 776
outlier_rate: 0.1089
zero_rate: 0

PrimaryReligion

categorical feature

PrimaryReligion is a low-cardinality categorical with 7 distinct values across 7,124 rows and no nulls. Islam dominates at 46% (3,279 rows), followed by Hinduism (2,142) and Ethnic Religions (933); Non-Religious appears only 13 times and 157 rows are explicitly 'Unknown'. Entropy ratio of 0.68 indicates a moderately skewed but not degenerate distribution. Treatment: One-hot encode and consider merging 'Unknown' with 'Other / Small' or treating it as a missing-value flag. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 7
top_value: Islam
top_rate: 0.4603
cardinality: 7
entropy: 1.92
entropy_ratio: 0.6839

PrimaryReligionPC

categorical feature

Categorical label assigning each of 7124 rows to one of 8 primary religion categories, with no nulls. Islam dominates at 3105 rows (43.6%) followed by Hinduism at 2296, while Non-Religious (35) and Other/Small (62) are rare; entropy ratio of 0.68 indicates moderate concentration in the top two classes. 154 rows are explicitly 'Unknown', a category worth treating distinctly from missing. Treatment: One-hot encode the 8 levels and keep 'Unknown' as its own category rather than imputing. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 8
top_value: Islam
top_rate: 0.4359
cardinality: 8
entropy: 2.051
entropy_ratio: 0.6838

PrimaryReligionPGAC

categorical feature

Categorical label for the primary religion of a People Group Across Countries (PGAC) record, with 8 distinct values across 7124 rows and no nulls. Islam dominates at 45.6% (3247), followed by Hinduism (2154) and Ethnic Religions (925); Christianity is strikingly rare at just 17 rows, which is notable for a religion-coded dataset. Entropy ratio of 0.65 indicates moderate concentration on the top categories. Treatment: One-hot or target-encode for modelling; consider folding 'Unknown' and 'Non-Religious'/'Christianity' tails into 'Other'. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 8
top_value: Islam
top_rate: 0.4558
cardinality: 8
entropy: 1.955
entropy_ratio: 0.6516

RLG4

numeric feature null_rate outliers

RLG4 is a sparse numeric feature populated for only ~7.6% of rows (null_rate 0.9239) with just 18 distinct integer-like values ranging 10 to 39. The distribution is right-skewed (skew 1.05, mean 18.19 vs median 20.0) with 30 flagged outliers (5.5% of present values) and a tight IQR of 6. The combination of heavy nullness and a bounded, discrete value set suggests an ordinal score or category code recorded only in specific cases. Treatment: Add a missingness indicator and impute or bin before modelling, given 92% nulls and a small discrete value set. medium · anthropic:claude-opus-4-7

n: 7,124
nulls: 6,582 (92.4%)
unique: 18
min: 10
max: 39
mean: 18.19
median: 20
std: 6.472
q1: 14
q3: 20
iqr: 6
skew: 1.051
kurtosis: 1.474
n_outliers: 30
outlier_rate: 0.05535
zero_rate: 0

ReligionSubdivision

categorical feature null_rate

A sub-classification of religion (denomination/sect), with 18 distinct values like Sunni, Judaism, Sikhism, Tibetan, and Theravada. The column is 92.39% null, so it is only populated for the small subset of records where a finer-grained religious branch applies. Among the 7124 rows, Sunni leads at 29.52% of non-null values (160 occurrences), and entropy ratio 0.72 indicates the populated values are spread fairly evenly across branches. Treatment: Treat missingness as its own category and one-hot encode, or roll up into the parent Religion field before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 6,582 (92.4%)
unique: 18
top_value: Sunni
top_rate: 0.2952
cardinality: 18
entropy: 2.984
entropy_ratio: 0.7157

PCIslam

numeric feature

PCIslam appears to be a percentage-style indicator of Islamic affiliation, bounded between 0 and 100 with a near-zero null rate (0.0013). The distribution is starkly bimodal rather than continuous: 47.1% of rows are exactly zero, the median is 0.28, yet Q3 sits at 99.99, producing a kurtosis of -1.93 and an IQR spanning nearly the full range. Mean (45.2) and std (48.2) confirm the mass is piled at the extremes rather than around the center. Treatment: Treat as bimodal: consider binarizing (0 vs >0) or binning rather than using raw value in linear models. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 9 (0.1%)
unique: 902
min: 0
max: 100
mean: 45.22
median: 0.2753
std: 48.22
q1: 0
q3: 99.99
iqr: 99.99
skew: 0.1703
kurtosis: -1.935
n_outliers: 0
outlier_rate: 0
zero_rate: 0.4713

PCNonReligious

numeric feature high_skew outliers

PCNonReligious appears to be the percentage of non-religious individuals in each record, but the distribution is dominated by zeros — 87.5% of values are exactly 0 and the entire IQR collapses to 0. The remaining tail stretches to 99.0 with skew of 9.1 and kurtosis of 125.3, producing 886 outliers (12.5% of rows). Mean (1.02) sits far above median (0), so any modelling that assumes symmetry will be misled. Treatment: Treat as zero-inflated; consider a binary is_nonzero flag plus a log1p transform of the positive tail. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 23 (0.3%)
unique: 152
min: 0
max: 99
mean: 1.016
median: 0
std: 4.549
q1: 0
q3: 0
iqr: 0
skew: 9.105
kurtosis: 125.3
n_outliers: 886
outlier_rate: 0.1248
zero_rate: 0.8752

PCUnknown

numeric feature high_skew outliers

PCUnknown is a numeric column expressing what looks like a percentage (range 0-100) of 'unknown' classification, with 92.8% of values being zero and a median/Q1/Q3 all at 0. The distribution is extremely right-skewed (skew 6.45, kurtosis 39.85) with 510 outliers (7.2%) extending up to 100. With 388 unique values and only 0.35% nulls, it carries sparse but potentially meaningful signal in the long tail. Treatment: Binarize (zero vs non-zero) or log-transform the non-zero tail before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 25 (0.4%)
unique: 388
min: 0
max: 100
mean: 2.28
median: 0
std: 14.59
q1: 0
q3: 0
iqr: 0
skew: 6.454
kurtosis: 39.85
n_outliers: 510
outlier_rate: 0.07184
zero_rate: 0.9282

SecurityLevel

numeric feature outliers

SecurityLevel is an ordinal/categorical code stored as numeric, with only 3 distinct values spanning 0 to 2 across 7,124 complete rows. The distribution is heavily concentrated at the top tier (median, Q1, and Q3 all equal 2.0, mean 1.595), yet 15.6% of rows are 0 and the IQR-based outlier check flags 24.9% of records — an artifact of the degenerate IQR of 0 rather than true anomalies. Strong negative skew (-1.47) confirms the mass sits at level 2. Treatment: Treat as a 3-level ordinal category (one-hot or ordered encode); ignore the outlier flag since IQR is zero. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 3
min: 0
max: 2
mean: 1.595
median: 2
std: 0.7442
q1: 2
q3: 2
iqr: 0
skew: -1.466
kurtosis: 0.4048
n_outliers: 1,771
outlier_rate: 0.2486
zero_rate: 0.1564

LRTop100

categorical label imbalance

Binary Y/N flag indicating membership in some 'LRTop100' set, with exactly 100 rows marked 'Y' out of 7124 — strongly suggesting a curated top-100 list. The distribution is severely imbalanced (98.6% 'N', entropy ratio 0.107), which is flagged as an imbalance alert. No nulls are present. Treatment: Use as a binary indicator; if modelling, apply class-imbalance handling (stratification or reweighting). high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: N
top_rate: 0.986
cardinality: 2
entropy: 0.1065
entropy_ratio: 0.1065

PhotoAddress

text metadata one_word short_text duplicates

PhotoAddress holds single-token image filenames following a 'pXXXXX.jpg' pattern (one_word_rate 1.0, len_max 13). Coverage is poor: 1970 of 7124 rows are empty strings and duplicate_rate is 0.596, so the same photo is reused across many records (e.g., p19007.jpg appears 90 times). Only 2880 unique values back 7124 rows, suggesting shared stock images or a many-to-one photo lookup rather than a per-row asset. Treatment: Treat as a file reference: drop from modelling, or join to an image table after handling the ~1970 empty strings. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 1 (0.0%)
unique: 2,880
len_min: 0
len_max: 13
len_mean: 7.26
len_median: 10
len_p95: 10
word_mean: 1
word_median: 1
n_empty: 1,970
n_duplicates: 4,243
duplicate_rate: 0.5957
vocab_size: 2,879
readability_flesch_mean: 84.01
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

PhotoCredits

categorical metadata long_tail

PhotoCredits captures the attribution string for an associated image, with 851 distinct credits across 7124 rows. The column is dominated by missing-style values: 1970 rows (27.7%) are empty strings and another 1496 are 'Anonymous', so over half lack a real attribution. The remaining tail is long and idiosyncratic, mixing organisations ('Operation China, Asia Harvest'), individuals ('Isudas', 'Kerry Olson'), and platform tags ('Steve Evans - Flickr'). Treatment: Treat empty and 'Anonymous' as missing and keep only as provenance metadata; do not use as a model feature. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 10 (0.1%)
unique: 851
top_value
top_rate: 0.2769
cardinality: 851
entropy: 5.584
entropy_ratio: 0.5737

PhotoCreditURL

categorical metadata long_tail null_rate

URL string crediting the source of an associated photo, dominated by a single domain (asiaharvest.org appears 443 times) alongside a long tail of 774 distinct values. 36% of rows are null and another 43.21% are empty strings — together roughly four out of five rows carry no usable credit. Remaining values mix organisational sites (newcovenantmissions, createinternational), shorteners (tinyurl), and stock-photo hosts (pixabay, pxhere, flickr). Treatment: Drop for modelling; if provenance matters, parse to domain and treat as a low-coverage attribution field. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 2,565 (36.0%)
unique: 774
top_value
top_rate: 0.4321
cardinality: 774
entropy: 5.389
entropy_ratio: 0.5616

PhotoCreativeCommons

categorical feature

Binary Y/N flag indicating whether a photo carries a Creative Commons licence. The vast majority (top_rate 0.7981) are 'N' with only 1437 'Y' values, and nulls are negligible (null_rate 0.0007). Class imbalance is notable but not extreme. Treatment: Encode as a 0/1 boolean; impute the handful of nulls with the mode 'N'. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 5 (0.1%)
unique: 2
top_value: N
top_rate: 0.7981
cardinality: 2
entropy: 0.7256
entropy_ratio: 0.7256

PhotoCopyright

categorical feature

Binary Y/N flag indicating whether a photo carries copyright restrictions, with 'N' dominating at 80.6% of 7,124 rows and only 2 unique values. Nulls are negligible (0.17%) and entropy ratio of 0.71 reflects the moderate class imbalance. No anomalies beyond the expected skew toward unrestricted photos. Treatment: Encode as a boolean (Y=1, N=0) and impute the handful of nulls with the mode. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 12 (0.2%)
unique: 2
top_value: N
top_rate: 0.8064
cardinality: 2
entropy: 0.709
entropy_ratio: 0.709

PhotoPermission

categorical feature

Binary opt-in flag for photo permission, stored as 'Y'/'N'. The column is heavily skewed toward 'N' at 80.4% (5715 of 7124), with a near-zero null rate of 0.2%. Watch the case inconsistency: 2 records use lowercase 'y' alongside 1393 uppercase 'Y', so case-sensitive joins or filters will miscount. Treatment: Normalize case (upper) and encode as boolean before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 14 (0.2%)
unique: 3
top_value: N
top_rate: 0.8038
cardinality: 3
entropy: 0.7173
entropy_ratio: 0.4526

ProfileTextExists

categorical feature imbalance

A binary flag indicating whether a profile text exists, with values Y/N. The column is severely imbalanced: 6888 of 7124 rows (96.7%) are Y, leaving only 236 N, yielding a low entropy ratio of 0.21. Treatment: Encode as a 0/1 indicator but expect minimal predictive signal due to severe class imbalance. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.9669
cardinality: 2
entropy: 0.2098
entropy_ratio: 0.2098

CountOfCountries

numeric feature high_skew outliers

Likely a per-row count of distinct countries associated with each record, ranging from 1 to 164 across 7124 rows with no nulls. The distribution is severely right-skewed (skew 5.67, kurtosis 33.17): the median is just 2 and Q3 is 4, yet the mean is 8.11 and 16.98% of rows flag as outliers. A long tail of high-country records is dragging the mean far above typical values. Treatment: Log-transform or cap at a high quantile before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 39
min: 1
max: 164
mean: 8.108
median: 2
std: 24.27
q1: 1
q3: 4
iqr: 3
skew: 5.672
kurtosis: 33.17
n_outliers: 1,210
outlier_rate: 0.1698
zero_rate: 0

CountOfProvinces

unknown other skipped

Saturn skipped profiling on CountOfProvinces, so beyond a row count of 7124 and zero nulls, no distributional evidence is available. The name suggests an integer count of provinces per record, but unique count, range, and summary stats are all missing. Without further inspection the column's actual content and cardinality cannot be confirmed. Treatment: Re-profile or manually inspect this column before use; saturn skipped it. low · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: —

Longitude

numeric feature

Geographic longitude coordinates spanning the full global range from -173.08 to 178.44 degrees. The distribution is heavily left-skewed (-1.40) with a median of 75.23 sitting well above the mean of 62.80, suggesting concentration in eastern hemisphere locations with a tail of western-hemisphere points. About 4.4% of values (316 rows) fall outside the typical IQR range. Treatment: Pair with latitude for geospatial features; consider clustering or binning rather than treating as a raw scalar. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 6,713
min: -173.1
max: 178.4
mean: 62.8
median: 75.23
std: 44.79
q1: 40.81
q3: 88.22
iqr: 47.41
skew: -1.402
kurtosis: 2.859
n_outliers: 316
outlier_rate: 0.04436
zero_rate: 0

Latitude

numeric feature

Latitude values for 7124 rows spanning -42.61 to 71.84 with a median of 25.02 — consistent with geographic latitudes in degrees. Distribution leans toward northern hemisphere (mean 23.54, skew -0.70) with 292 outliers (4.1%) likely representing far-southern or far-northern records. No nulls and 6696 unique values suggest near-record-level coordinates. Treatment: Pair with longitude for geospatial features; avoid standard scaling alone since latitude is bounded and non-linear in distance. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 6,696
min: -42.61
max: 71.84
mean: 23.54
median: 25.02
std: 14.92
q1: 15.55
q3: 31.61
iqr: 16.06
skew: -0.702
kurtosis: 2.141
n_outliers: 292
outlier_rate: 0.04099
zero_rate: 0

Ctry

categorical feature

Country-of-origin or location field with 202 distinct values across 7,124 rows and zero nulls. India dominates at 28.5% (2,032 rows), followed by Pakistan (767) and China (442); the long tail spans 200+ countries with entropy ratio 0.66, indicating concentrated but globally distributed coverage. The South/Central Asia skew is the headline surprise — five of the top six values are Asian. Treatment: Group rare countries into an 'Other' bucket or encode by region before modelling to avoid 200-way one-hot blow-up. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 202
top_value: India
top_rate: 0.2852
cardinality: 202
entropy: 5.058
entropy_ratio: 0.6605

IndigenousCode

categorical feature

A binary Y/N flag indicating Indigenous status, fully populated across all 7124 rows. The distribution is imbalanced: 'Y' accounts for 79.4% (5657 rows) versus 1467 'N' rows, which is a notable skew to keep in mind for any stratified analysis. Treatment: Encode as a binary indicator and watch for class imbalance when used as a predictor or stratifier. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.7941
cardinality: 2
entropy: 0.7336
entropy_ratio: 0.7336

PercentAdherents

categorical feature long_tail

PercentAdherents appears to be a numeric measure (likely a percentage or rate of religious adherents) stored as strings, with 692 distinct values across 7,124 rows and no nulls. It is dominated by '0.000', which accounts for 56.2% of records, and the long tail of small integer- and decimal-valued strings drives entropy down to a ratio of 0.43. The format mixing whole numbers like '5.000' with fractions like '0.200' suggests these are raw values rather than binned categories. Treatment: Cast to float and treat as a zero-inflated numeric feature rather than a category. medium · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 692
top_value: 0.000
top_rate: 0.5625
cardinality: 692
entropy: 4.046
entropy_ratio: 0.4288

PercentChristianPC

categorical feature

Stored as a categorical but the 184 distinct values are numeric strings ranging from '0.000' to figures like '8.571', suggesting this is a percent-Christian metric (likely per-capita or per-county) cast as text. The distribution is concentrated: '0.482' alone covers 12.2% of 7124 rows and the top 10 values account for a large share, yet entropy ratio of 0.79 indicates the long tail still carries information. No nulls, but the repeated exact decimals hint at a lookup or pre-binned source rather than raw measurements. Treatment: cast to float and treat as a continuous feature; investigate the heavy spike at 0.482 before modelling. medium · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 184
top_value: 0.482
top_rate: 0.122
cardinality: 184
entropy: 5.934
entropy_ratio: 0.7887

NaturalName

text label one_word duplicates

Short ethnonym/community labels (e.g., 'Deaf', 'Turk', 'Persian', 'Japanese'), averaging 11.8 characters and 1.7 words with a median of 2 words. About 34% of rows are duplicates (2,419) and ~49% are single-word entries, with 4,705 unique values across 7,124 rows. Surprising signals: 'Deaf' tops the list at 151 occurrences, and top words include parenthetical religious qualifiers like 'traditions)', '(hindu', '(muslim' (952/477/411), suggesting many entries carry trailing tradition tags that the tokenizer split awkwardly. Treatment: Normalize casing and strip parenthetical tradition suffixes, then treat as a categorical label. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 4,705
len_min: 1
len_max: 39
len_mean: 11.84
len_median: 10
len_p95: 27
word_mean: 1.723
word_median: 2
n_empty: 0
n_duplicates: 2,419
duplicate_rate: 0.3396
vocab_size: 4,343
readability_flesch_mean: 56.42
emoji_rate: 0
url_rate: 0
one_word_rate: 0.4885
allcaps_rate: 0
boilerplate_rate: 0

NaturalPronunciation

text metadata one_word null_rate duplicates

Phonetic respellings of ethnonyms — short hyphenated pronunciation guides like 'PUR-zhun', 'jae-puh-NEEZ', and 'pahsh-TOON' — accompanying some other label column. Values are overwhelmingly single tokens (one_word_rate 0.73, word_mean 1.28, len_mean 10.8) and 48.5% are null, so coverage is partial. Duplicates dominate (n_duplicates 2183, duplicate_rate 0.59) with only 1489 unique forms across 7124 rows, suggesting a small controlled vocabulary repeated across records. Treatment: Treat as an optional pronunciation lookup keyed to the parent term; drop or impute before modelling given ~48% nulls. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 3,452 (48.5%)
unique: 1,489
len_min: 2
len_max: 42
len_mean: 10.77
len_median: 10
len_p95: 21
word_mean: 1.281
word_median: 1
n_empty: 0
n_duplicates: 2,183
duplicate_rate: 0.5945
vocab_size: 1,537
readability_flesch_mean: 69.93
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7271
allcaps_rate: 0.0005447
boilerplate_rate: 0

PercentChristianPGAC

categorical feature

This column appears to be a percentage or count of Christians (PGAC suggesting a per-group/area Christian metric), stored as strings with three-decimal precision rather than as a numeric type. It is heavily zero-inflated: '0.000' accounts for 43.8% of the 7,124 rows (3,121 occurrences), and a suspiciously specific value '3.733' is the second mode at 151 rows. With 842 distinct values and entropy ratio 0.58, the distribution is concentrated but long-tailed, and the null rate is negligible at 0.07%. Treatment: Cast to numeric and consider a zero-inflated transform (e.g., log1p with a zero indicator) before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 5 (0.1%)
unique: 842
top_value: 0.000
top_rate: 0.4384
cardinality: 842
entropy: 5.681
entropy_ratio: 0.5846

PercentEvangelical

categorical feature long_tail

PercentEvangelical reads as a numeric share of evangelicals stored as strings, with 401 distinct values across 7124 rows. The distribution is heavily zero-inflated: 65.7% of rows are exactly "0.000" and another 10.4% are null, leaving a long tail of small fractions like 0.100, 0.200, 0.500. Entropy ratio of 0.364 confirms most of the signal collapses onto that single zero bucket. Treatment: Cast to float, impute the 10.4% nulls, and consider a zero-vs-nonzero indicator alongside the raw value to handle the zero inflation. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 741 (10.4%)
unique: 401
top_value: 0.000
top_rate: 0.6572
cardinality: 401
entropy: 3.146
entropy_ratio: 0.3638

PercentEvangelicalPC

categorical feature

PercentEvangelicalPC appears to be a numeric percentage (likely an evangelical population share, possibly per capita or principal-component scaled) that has been stored as strings, yielding 166 distinct values across 7124 rows with a 2.15% null rate. The distribution is concentrated: the top value '0.199' covers 12.47% of rows, and the leading entries cluster near zero ('0.095', '0.000', '0.004') yet some values reach above 3 ('3.409', '3.339'), suggesting a long right tail or mixed scale. Entropy ratio of 0.78 indicates moderate concentration rather than uniformity. Treatment: Cast to float, impute the ~2% nulls, and consider log or rank transform given the right-tailed values. medium · anthropic:claude-opus-4-7

n: 7,124
nulls: 153 (2.1%)
unique: 166
top_value: 0.199
top_rate: 0.1247
cardinality: 166
entropy: 5.777
entropy_ratio: 0.7833

PercentEvangelicalPGAC

categorical feature

Numeric percentages (likely share of evangelical population per PGAC unit) stored as strings, hence profiled as categorical with 548 distinct values. The distribution is heavily zero-inflated: '0.000' accounts for 48.9% of 7124 rows, with a curious secondary spike at '1.801' (151 rows) that doesn't fit a percentage interpretation cleanly. Null rate is 6.32% and entropy ratio is 0.55, consistent with a long tail of small fractional values. Treatment: Cast to float, impute or flag nulls, and consider a zero-indicator plus log/sqrt transform given the heavy zero mass. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 450 (6.3%)
unique: 548
top_value: 0.000
top_rate: 0.4891
cardinality: 548
entropy: 4.972
entropy_ratio: 0.5465

PCBuddhism

numeric feature high_skew outliers

PCBuddhism appears to be a percentage feature measuring the Buddhist share of some unit (likely a postcode or area), ranging 0 to 100 with mean 6.41. The distribution is extremely zero-inflated: 82.99% of rows are exactly 0, the entire IQR collapses to 0, and yet 17.01% of rows are flagged as outliers with skew 3.48 and kurtosis 10.56. This means Buddhism is rare across most areas but reaches sizeable concentrations in a long tail. Treatment: Treat as zero-inflated proportion: add a presence indicator and log1p-transform the non-zero tail before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 24 (0.3%)
unique: 809
min: 0
max: 100
mean: 6.411
median: 0
std: 22.39
q1: 0
q3: 0
iqr: 0
skew: 3.475
kurtosis: 10.56
n_outliers: 1,208
outlier_rate: 0.1701
zero_rate: 0.8299

PCEthnicReligions

numeric feature high_skew outliers

PCEthnicReligions is a numeric percentage-style feature (0–100) capturing the share of some ethnic-religion category, likely per record/region. It's overwhelmingly zero — 78% of values are 0 and the entire interquartile range collapses to 0 — yet the mean is 13.1 with std 30.7, indicating a small set of records carry very large shares. Skew of 2.16 and a 22% outlier rate confirm a sparse, heavy-tailed distribution rather than a smooth continuum. Treatment: Binarize (zero vs non-zero) or apply a zero-inflated/log1p transform before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 18 (0.3%)
unique: 351
min: 0
max: 100
mean: 13.11
median: 0
std: 30.74
q1: 0
q3: 0
iqr: 0
skew: 2.155
kurtosis: 2.885
n_outliers: 1,560
outlier_rate: 0.2195
zero_rate: 0.7805

PCHinduism

numeric feature

This column appears to be the percentage share of Hindus in some geographic or demographic unit, ranging from 0 to 100 with a mean of 29.8. The distribution is strongly bimodal in spirit: 67.7% of rows are exactly zero while Q3 sits at 98.4, indicating most units have no Hindu presence and a substantial minority are nearly entirely Hindu. Skew is 0.87 and kurtosis -1.22, consistent with this U-shaped split rather than a single peak. Treatment: Consider a zero-vs-nonzero indicator plus the raw percentage, since a flat numeric treatment will hide the bimodal structure. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 24 (0.3%)
unique: 1,131
min: 0
max: 100
mean: 29.82
median: 0
std: 44.98
q1: 0
q3: 98.42
iqr: 98.42
skew: 0.8721
kurtosis: -1.216
n_outliers: 0
outlier_rate: 0
zero_rate: 0.6768

PCOtherSmall

numeric feature high_skew outliers

PCOtherSmall is a numeric feature where 88% of rows are zero and the IQR is zero, meaning the bottom three quartiles are all 0. The remaining mass stretches up to 100 with mean 1.84 and std 12.33, producing severe right skew (7.39) and very heavy tails (kurtosis 54.18). About 12% of rows (851) flag as outliers, suggesting this is a sparse share/percentage indicator that fires only for a small subset of records. Treatment: Binarize presence (>0) or apply log1p before modelling to tame the skew. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 24 (0.3%)
unique: 670
min: 0
max: 100
mean: 1.836
median: 0
std: 12.33
q1: 0
q3: 0
iqr: 0
skew: 7.39
kurtosis: 54.18
n_outliers: 851
outlier_rate: 0.1199
zero_rate: 0.8801

RegionCode

numeric feature outliers

RegionCode holds 12 distinct integer values from 1 to 12 with no nulls, so it is almost certainly a categorical region identifier stored as a number rather than a true numeric measure. The distribution is concentrated around the median of 4 with an IQR of just 2, yet the right skew of 1.12 and 601 flagged outliers (8.4%) reflect the long tail of higher-numbered regions rather than genuine anomalies. Treatment: Cast to categorical and one-hot or target-encode; do not treat as a continuous numeric. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 12
min: 1
max: 12
mean: 5.005
median: 4
std: 2.457
q1: 4
q3: 6
iqr: 2
skew: 1.122
kurtosis: 0.5775
n_outliers: 601
outlier_rate: 0.08436
zero_rate: 0

PopulationPGAC

numeric feature high_skew outliers

PopulationPGAC appears to be a population count tied to some geographic or administrative unit (PGAC), spanning 10 to roughly 925 million across 7,124 rows with only 0.07% nulls. The distribution is extraordinarily right-skewed (skew 25.5, kurtosis 1051) — the median is 130,300 while the mean is 4.88 million, and 17.8% of rows flag as outliers. With 1,509 unique values across 7,124 rows, the same population figures repeat heavily, suggesting many rows share the same geographic aggregate. Treatment: log-transform before regression to tame the extreme right skew. medium · anthropic:claude-opus-4-7

n: 7,124
nulls: 5 (0.1%)
unique: 1,509
min: 10
max: 9.251e+08
mean: 4.881e+06
median: 130,300
std: 2.095e+07
q1: 20,000
q3: 1.435e+06
iqr: 1.415e+06
skew: 25.48
kurtosis: 1052
n_outliers: 1,264
outlier_rate: 0.1776
zero_rate: 0

Frontier

categorical feature

Binary Y/N flag indicating whether a record is on the frontier, with no nulls across 7124 rows. The split is imbalanced toward Y at 66.9% (4767) versus N at 2357, though entropy ratio of 0.92 shows both classes are well represented. Treatment: Encode as a 0/1 indicator before modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.6691
cardinality: 2
entropy: 0.9158
entropy_ratio: 0.9158

MapAddress

text foreign_key one_word short_text duplicates

MapAddress holds single-token PNG filenames (e.g. 'm00328.png'), with one_word_rate of 1.0 and max length 13, suggesting it points to a map image asset. 1500 of 7124 rows are empty strings and duplicate_rate is 0.352, so roughly a third of non-empty values repeat across rows — meaning many records share the same map. With 4616 unique values across 7124 rows, this behaves like a foreign reference to a finite set of map images rather than a free-text field. Treatment: Treat as a categorical asset reference: impute or flag the 1500 empties and join to a map-image lookup. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 4,616
len_min: 0
len_max: 13
len_mean: 8.649
len_median: 10
len_p95: 13
word_mean: 1
word_median: 1
n_empty: 1,500
n_duplicates: 2,508
duplicate_rate: 0.352
vocab_size: 4,615
readability_flesch_mean: 17.62
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

HasJesusFilm

categorical feature

Binary Y/N flag indicating whether the Jesus Film is available for the entity (likely a language or people group). Heavily skewed toward 'Y' at 78.7% (5,610 of 7,124), with no nulls across all 7,124 rows. Entropy of 0.746 reflects the imbalance but still leaves a usable minority class of 1,514 'N' values. Treatment: Encode as 0/1 boolean; account for class imbalance if used as a target. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.7875
cardinality: 2
entropy: 0.7463
entropy_ratio: 0.7463

Nomadic

categorical feature imbalance

Binary Y/N flag indicating nomadic status, with no nulls across 7124 rows. Severely imbalanced: 'N' dominates at 96.6% (6884 rows) versus only 240 'Y' cases, yielding a low entropy ratio of 0.21. Treatment: Encode as binary; consider class-weighting or stratified sampling due to severe imbalance. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: N
top_rate: 0.9663
cardinality: 2
entropy: 0.2126
entropy_ratio: 0.2126

NomadicTypeDescription

categorical feature null_rate

This is a low-cardinality categorical describing the type of nomadic livelihood, with only 6 distinct values dominated by 'Agro-Pastoralists' (76.7% of non-nulls, 184 records). The column is almost entirely empty — null_rate is 0.9663, leaving roughly 240 populated rows out of 7124. Several values are comma-joined combinations (e.g., 'Agro-Pastoralists, Service or Trade'), suggesting the field encodes multi-label memberships as concatenated strings. Treatment: Split comma-separated values into multi-hot indicators and treat missingness as its own category given the 96.6% null rate. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 6,884 (96.6%)
unique: 6
top_value: Agro-Pastoralists
top_rate: 0.7667
cardinality: 6
entropy: 1.159
entropy_ratio: 0.4483

PhotoCCVersionText

categorical metadata

Creative Commons license version attached to a photo (e.g., 'CC BY 2.0', 'CC BY-SA 4.0'). The field is dominated by empty strings at 79.8% of 7124 rows, with only 16 distinct values and entropy ratio 0.33, so license metadata is missing for the vast majority of records. Among populated values, 'CC BY 2.0' (387) and 'CC BY-SA 4.0' (246) lead. Treatment: Treat empty string as missing and group rare licenses; use as a low-cardinality categorical only where photo licensing matters. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 16
top_value
top_rate: 0.7984
cardinality: 16
entropy: 1.323
entropy_ratio: 0.3307

PhotoCCVersionURL

categorical metadata

This column holds the URL of the Creative Commons license version applied to an associated photo, drawn from a fixed set of 16 distinct license URIs. About 79.8% of rows (5688 of 7124) are empty strings, so the field is sparsely populated; among the populated minority, CC BY 2.0 (387) and CC BY-SA 4.0 (246) dominate. Entropy ratio of 0.33 confirms heavy concentration on the blank value. Treatment: Treat empty strings as missing and collapse to a categorical license code before any modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 16
top_value
top_rate: 0.7984
cardinality: 16
entropy: 1.323
entropy_ratio: 0.3307

MapCredits

categorical metadata long_tail

Attribution string crediting the data, geography, and design sources for each map (e.g. Joshua Project, GMI, UNESCO, IMB). With 161 distinct values across 7124 rows, the top credit covers 28% of records and a blank string is the second most common value at 1505 rows; near-duplicates differing only by trailing punctuation (the same Omid/UNESCO credit appears with and without a final period) inflate cardinality. Treatment: Normalise whitespace/punctuation to collapse near-duplicates, then drop from modelling as boilerplate provenance. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 161
top_value: People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project
top_rate: 0.28
cardinality: 161
entropy: 3.318
entropy_ratio: 0.4527

MapCreditURL

categorical metadata long_tail imbalance

This column holds attribution URLs for source maps, but 6919 of 7124 rows (top_rate 0.9712) are empty strings, leaving only 31 distinct values across the entire dataset. Among the populated entries, cartomission.com dominates with 100 occurrences while most other domains appear fewer than 10 times, producing a very long tail. Entropy ratio of 0.054 confirms there is almost no information here unless the empty string itself is treated as a meaningful 'no credit' signal. Treatment: Keep as provenance metadata; do not use as a model feature given 97% blanks. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 31
top_value
top_rate: 0.9712
cardinality: 31
entropy: 0.27
entropy_ratio: 0.05449

MapCopyright

categorical feature

A near-binary flag (N/Y) with a third state being an empty string, almost certainly indicating whether map copyright applies. 'N' dominates at 72.95% (5197/7124), blanks account for 1885 rows, and only 42 records are 'Y' — a severe class imbalance that makes the affirmative case nearly negligible. Treatment: Normalize blanks to a missing/'N' category and treat as a low-signal binary flag given the 42-row positive class. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 3
top_value: N
top_rate: 0.7295
cardinality: 3
entropy: 0.8831
entropy_ratio: 0.5572

MapCCVersionText

categorical metadata imbalance

This appears to be a Creative Commons license version field for maps, but it is effectively empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving only 10 rows with actual licenses split across CC BY-SA 3.0 (8), CC0 1.0 (1), and CC BY 3.0 (1). Entropy is just 0.0166 (entropy_ratio 0.0083), so the column carries almost no information despite having 0% nulls — the missingness is encoded as empty strings rather than NaN. Treatment: Drop; near-constant blank with only 10 informative rows. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 4
top_value
top_rate: 0.9986
cardinality: 4
entropy: 0.01662
entropy_ratio: 0.00831

MapCCVersionURL

categorical metadata imbalance

MapCCVersionURL appears to hold a Creative Commons license URL associated with each map record, but it is essentially empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving just 10 rows split across three CC license URLs. Entropy is 0.017 (ratio 0.008), so the column carries almost no information despite having 4 distinct values and zero nulls (the missingness is encoded as "" rather than null). Treatment: Drop; near-constant with empty-string standing in for missing. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 4
top_value
top_rate: 0.9986
cardinality: 4
entropy: 0.01662
entropy_ratio: 0.00831

JF

categorical feature

JF is a binary Y/N flag with no nulls across 7124 rows. The distribution is imbalanced: "Y" accounts for 78.7% (5610 rows) versus 1514 "N", giving an entropy ratio of 0.746. The column name is opaque, so the semantic meaning of the flag is not recoverable from the evidence. Treatment: Encode as a 0/1 indicator; consider class imbalance if used as a target. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.7875
cardinality: 2
entropy: 0.7463
entropy_ratio: 0.7463

AudioRecordings

categorical feature

Binary Y/N flag indicating whether audio recordings exist for each row, with no nulls across 7124 records. The distribution is heavily imbalanced toward 'Y' at 86.9% (6188 vs 936), giving an entropy ratio of 0.56. Treatment: Encode as a 0/1 indicator; be mindful of class imbalance if used as a target. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.8686
cardinality: 2
entropy: 0.5612
entropy_ratio: 0.5612

Window1040

categorical feature

Window1040 is a binary Y/N flag covering all 7124 rows with no nulls. The distribution is imbalanced: 'Y' accounts for 5910 rows (top_rate 0.8296) versus 1214 'N', giving an entropy ratio of 0.659. The column's semantic meaning isn't recoverable from the evidence, but it behaves like a clean indicator variable. Treatment: Encode as a 0/1 indicator and watch for class imbalance when used as a predictor. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.8296
cardinality: 2
entropy: 0.6586
entropy_ratio: 0.6586

PeopleGroupMapURL

text metadata one_word url_heavy duplicates

This column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.79). 1,500 of 7,124 rows are empty strings and 2,508 are duplicates (duplicate_rate 0.35), meaning many people groups share the same map image (e.g., m00328.png appears 40 times). With 4,616 unique values across 7,124 rows, this is a reference link rather than a unique key. Treatment: Keep as a display/reference URL; drop from modelling features. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 4,616
len_min: 0
len_max: 66
len_mean: 50.49
len_median: 63
len_p95: 66
word_mean: 1
word_median: 1
n_empty: 1,500
n_duplicates: 2,508
duplicate_rate: 0.352
vocab_size: 4,615
readability_flesch_mean: -568.7
emoji_rate: 0
url_rate: 0.7894
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

PeopleGroupMapExpandedURL

text metadata one_word url_heavy duplicates

This column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, with 72.3% of rows containing a URL and every value being a single token. 1,975 rows (about 27.7%) are empty strings, and 2,793 rows (39.2%) duplicate another value — e.g. m00328.pdf appears 40 times — suggesting many people groups share the same regional map. Treatment: Treat as a reference link; drop from modelling or extract the map ID if joining to a maps table. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 4,331
len_min: 0
len_max: 66
len_mean: 46.2
len_median: 63
len_p95: 66
word_mean: 1
word_median: 1
n_empty: 1,975
n_duplicates: 2,793
duplicate_rate: 0.3921
vocab_size: 4,330
readability_flesch_mean: -468.9
emoji_rate: 0
url_rate: 0.7228
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

PeopleGroupURL

text identifier near_unique one_word url_heavy

This column holds Joshua Project people-group URLs, one per row, with every value a 48-character single-token https link (url_rate 1.0, one_word_rate 1.0, len_min and len_max both 48). All 7124 values are unique with zero nulls or duplicates, so it functions as a per-row identifier rather than a feature. The URLs encode a people-group ID and a country code suffix (e.g., /10375/tz, /10375/up), meaning the same group recurs across countries in the underlying key even though the full URL is unique. Treatment: Drop from modelling; retain as a row-level link key or parse out the people-group ID and country code as separate features. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 7,124
len_min: 48
len_max: 48
len_mean: 48
len_median: 48
len_p95: 48
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 7,124
readability_flesch_mean: -479.9
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

PeopleGroupPhotoURL

text metadata one_word url_heavy duplicates

This column holds Joshua Project people-group photo URLs, with every populated cell being a single joshuaproject.net/assets/media/profiles/photos/.jpg link (url_rate 0.72, one_word_rate 1.0). 1971 of 7124 rows are empty strings (no nulls reported), and the same image URLs repeat heavily — duplicate_rate is 0.60 with only 2880 unique values, the top URL appearing 90 times. The same photo is clearly being reused across many people-group records rather than being a unique per-row asset. Treatment: Treat as an optional asset link; drop or replace empty strings with null and do not use as a feature. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2,880
len_min: 0
len_max: 68
len_mean: 47.04
len_median: 65
len_p95: 65
word_mean: 1
word_median: 1
n_empty: 1,971
n_duplicates: 4,244
duplicate_rate: 0.5957
vocab_size: 2,879
readability_flesch_mean: -604.3
emoji_rate: 0
url_rate: 0.7233
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

CountryURL

categorical foreign_key

Country-level URLs pointing to joshuaproject.net profile pages, with the 2-letter country code as the path segment. There are 202 distinct countries across 7,124 rows and no nulls, but the distribution is heavily concentrated: India alone accounts for 28.5% of rows (2,032), with Pakistan (767) a distant second. Entropy ratio of 0.66 confirms moderate skew toward a handful of South Asian countries. Treatment: Extract the trailing country code as a categorical key; treat the URL itself as redundant metadata. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 202
top_value: https://joshuaproject.net/countries/IN
top_rate: 0.2852
cardinality: 202
entropy: 5.058
entropy_ratio: 0.6605

JPScaleText

categorical metadata imbalance

JPScaleText is a categorical field that holds a single value, "Unreached", across all 7124 rows with no nulls. With cardinality of 1 and entropy of 0, it carries no information and cannot discriminate between records. The constant value suggests this dataset has been pre-filtered to unreached people groups only. Treatment: Drop; constant column with zero entropy. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 1
top_value: Unreached
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

JPScaleImageURL

categorical metadata imbalance

Every one of the 7,124 rows holds the same URL, https://joshuaproject.net/assets/img/gauge/gauge-1.png, giving a single unique value and zero entropy. This looks like a static asset link (a JP Scale gauge image) attached to each record rather than a discriminating feature. It carries no information for analysis or modelling. Treatment: Drop; constant column with a single value across all rows. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 1
top_value: https://joshuaproject.net/assets/img/gauge/gauge-1.png
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

Summary

text free_text one_word duplicates

Free-text English summaries describing South Asian people groups (Rajputs, Jats, Bania, Beldar, etc.), averaging 51 words with median length 316 characters. Quality is poor: 3,167 of 7,124 rows (44%) are empty strings and another 3,439 are duplicates, leaving only 3,685 unique values and a 48% duplicate rate. Several near-identical Rajput paragraphs differ by only a word or two, suggesting lightly edited copies of the same source text rather than independent summaries. Flesch readability of 30.4 indicates fairly difficult prose. Treatment: Deduplicate near-identical entries and drop or impute the 3,167 empty rows before tokenizing and embedding. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 3,685
len_min: 0
len_max: 1,212
len_mean: 309.7
len_median: 316
len_p95: 793
word_mean: 51.26
word_median: 52
n_empty: 3,167
n_duplicates: 3,439
duplicate_rate: 0.4827
vocab_size: 24,501
readability_flesch_mean: 30.4
emoji_rate: 0
url_rate: 0
one_word_rate: 0.4446
allcaps_rate: 0
boilerplate_rate: 0.0002807

Obstacles

text free_text one_word duplicates

Free-text English prose describing barriers to Christian evangelism among various people groups (Rajputs, Jats, Bosniaks, Azeri, etc.), averaging 18 words and 107 characters per entry. Notably, 3167 of 7124 rows are empty strings and the duplicate rate is 0.489, with a single Rajput-pride passage repeated 88 times and a near-identical Jat passage appearing as both 74- and 7-count variants. Readability is low (Flesch 31.6) and vocabulary is modest (9760 unique words), consistent with a templated missiological description field. Treatment: Treat empties as missing, dedupe near-identical passages, then tokenize and embed for downstream topic or similarity analysis. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 3,641
len_min: 0
len_max: 726
len_mean: 106.9
len_median: 95
len_p95: 317
word_mean: 18.37
word_median: 16
n_empty: 3,167
n_duplicates: 3,483
duplicate_rate: 0.4889
vocab_size: 9,760
readability_flesch_mean: 31.62
emoji_rate: 0
url_rate: 0
one_word_rate: 0.4446
allcaps_rate: 0
boilerplate_rate: 0.0009826

HowReach

text free_text one_word duplicates

Free-text English prose describing outreach/engagement strategies for various people groups, likely a 'how to reach' field in a missions dataset. Over half the rows (3883 of 7124) are empty strings and duplicate_rate is 0.60, with the same Jats and Rajputs paragraphs repeating dozens of times — so the median length and word count are 0. Readability is low (Flesch 27.3) and vocabulary reaches 7803 tokens across the non-empty rows. Treatment: Treat empty strings as missing, deduplicate boilerplate, then tokenize and embed for downstream NLP. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 2,853
len_min: 0
len_max: 599
len_mean: 80.82
len_median: 0
len_p95: 260
word_mean: 14.08
word_median: 1
n_empty: 3,883
n_duplicates: 4,271
duplicate_rate: 0.5995
vocab_size: 7,803
readability_flesch_mean: 27.34
emoji_rate: 0
url_rate: 0
one_word_rate: 0.5451
allcaps_rate: 0
boilerplate_rate: 0.0002807

PrayForChurch

text free_text one_word duplicates

Free-text prayer prompts for an unreached-people-group / church-planting dataset, written in English (1473 detected) and centered on words like 'pray', 'Christ', 'among'. The field is sparsely populated: 5032 of 7124 rows are empty and only 1713 unique strings exist, giving a 0.76 duplicate rate as the same boilerplate prayer is reused across people groups (top non-empty value repeats 146 times). Readability is low (Flesch 19.5) and length varies wildly from 0 to 649 chars, so the column is a mix of nothing, one-liners, and full paragraphs. Treatment: Treat as optional long-form text: impute empties as missing and tokenize/embed the rest before any modelling. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 1,713
len_min: 0
len_max: 649
len_mean: 59.63
len_median: 0
len_p95: 286
word_mean: 11.19
word_median: 1
n_empty: 5,032
n_duplicates: 5,411
duplicate_rate: 0.7595
vocab_size: 4,447
readability_flesch_mean: 19.5
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7063
allcaps_rate: 0
boilerplate_rate: 0

PrayForPG

text free_text one_word duplicates

Free-text prayer points for people groups (PG), each entry a short paragraph of intercessions led by the verb 'pray' (5450 occurrences). Nearly half the rows are empty (3405 of 7124) and another large chunk reuse boilerplate templates — duplicate_rate 0.517 with the top non-empty value repeating 88 times — so unique content is far less than the 3441 distinct strings suggest. Readability is low (Flesch mean 32.7) and all detected language is English (2528 rows tagged en). Treatment: Treat as free-text: drop empties, dedupe boilerplate, then tokenize/embed if used as a feature. high · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: 3,441
len_min: 0
len_max: 937
len_mean: 163.1
len_median: 120
len_p95: 453.8
word_mean: 28.23
word_median: 20
n_empty: 3,405
n_duplicates: 3,683
duplicate_rate: 0.517
vocab_size: 9,291
readability_flesch_mean: 32.72
emoji_rate: 0
url_rate: 0
one_word_rate: 0.478
allcaps_rate: 0
boilerplate_rate: 0

Resources

unknown other skipped

The column is named "Resources" with 7124 rows and zero nulls, but saturn skipped profiling so the kind is unknown and no unique count or value statistics were computed. Without type inference or sample values, its content (numeric, list, text, or identifier) cannot be determined from the evidence. Treatment: Re-profile or inspect raw samples to establish type before any downstream use. low · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: —

country_data

unknown other skipped

The column `country_data` was skipped by the profiler, so its kind is unrecorded and no statistics, uniqueness, or value distribution are available. The only confirmed signals are 7124 rows with a 0.0 null rate. Without further inspection, the contents (likely some country-related payload given the name) cannot be characterised. Treatment: Re-profile or manually inspect this column before any downstream use. low · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: —

language_data

unknown other skipped

The column `language_data` was skipped by the profiler — its kind is unrecognised and no descriptive statistics, uniqueness count, or value samples were emitted. Only the row count (7124) and a null rate of 0.0 are available, so nothing can be said about content, cardinality, or distribution. The name hints at linguistic payloads (possibly nested or serialised), but this is not corroborated by evidence. Treatment: Re-profile after parsing or casting to a supported type before deciding on use. low · anthropic:claude-opus-4-7

n: 7,124
nulls: 0 (0.0%)
unique: —

Reading

Charts the summary said to look at first

PeopleID3ROG3

ROG3

PeopleID3

ROP3

PeopNameInCountry

ROG2

Continent

RegionName

ISO3

LocationInCountry

PeopleID1

ROP1

AffinityBloc

PeopleID2

ROP2

PeopleCluster

PeopNameAcrossCountries

Population

Category

ROL3

PrimaryLanguageName

PrimaryLanguageDialect

NumberLanguagesSpoken

OfficialLang

SpeakNationalLang

BibleStatus

BibleYear

NTYear

PortionsYear

TranslationNeedQuestionable

JPScale

JPScalePC

JPScalePGAC

LeastReached

LeastReachedPC

LeastReachedPGAC

GSEC

HasAudioRecordings

NTOnline

RLG3

RLG3PC

RLG3PGAC

PrimaryReligion

PrimaryReligionPC

PrimaryReligionPGAC

RLG4

ReligionSubdivision

PCIslam

PCNonReligious

PCUnknown

SecurityLevel

LRTop100

PhotoAddress

PhotoCredits

PhotoCreditURL

PhotoCreativeCommons

PhotoCopyright

PhotoPermission

ProfileTextExists

CountOfCountries

CountOfProvinces

Longitude

Latitude

Ctry

IndigenousCode

PercentAdherents

PercentChristianPC

NaturalName

NaturalPronunciation

PercentChristianPGAC

PercentEvangelical

PercentEvangelicalPC

PercentEvangelicalPGAC

PCBuddhism

PCEthnicReligions

PCHinduism

PCOtherSmall

RegionCode