joshua project joshua project enriched

source /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_enriched.parquet 16,382 rows 109 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is the Joshua Project people-groups dataset: 16,382 rows and 109 columns describing ethnic groups by country, with demographics, language, religion mix, and Christian-engagement indicators. The shape is dominated by categorical and text fields — Continent, AffinityBloc, PrimaryReligion, and the JPScale 'reachedness' rating give the cleanest first read on who is in the file. Population is extremely long-tailed (median 20,000 but max ~913M and skew ~91), so any size analysis should use logs or quantiles rather than means. Religion-share columns like PCIslam, PCHinduism, and PCBuddhism are mostly zero with a minority of groups at very high percentages, which tells you religion is effectively single-dominant per group. Watch out for several columns with very high null rates (RLG4 96%, NomadicTypeDescription 98%, PrimaryLanguageDialect 92%, NTOnline 29%) and many near-duplicate URL/ID fields that won't add analytic value.

citing: Continent · AffinityBloc · PrimaryReligion · JPScaleText · Population · PCIslam · PCHinduism · RegionName · LeastReached · Frontier

Charts the summary said to look at first

Continent · Where the people groups sit geographically — Asia dominates at ~45% of rows.

Show data table

Top values for Continent (7 unique shown, of 7 total).
value	count	share
Asia	7368	45.0%
Africa	3635	22.2%
Europe	1532	9.4%
North America	1407	8.6%
Australia	1088	6.6%
South America	905	5.5%
Oceania	447	2.7%

PrimaryReligion · Religion mix across groups; Christianity and Islam together cover most of the file.

Show data table

Top values for PrimaryReligion (8 unique shown, of 8 total).
value	count	share
Christianity	6459	39.4%
Islam	3786	23.1%
Ethnic Religions	2651	16.2%
Hinduism	2338	14.3%
Buddhism	635	3.9%
Non-Religious	200	1.2%
Unknown	189	1.2%
Other / Small	124	0.8%

JPScaleText · Reachedness rating — note that 'Unreached' is the single largest bucket (~43%).

Show data table

Top values for JPScaleText (5 unique shown, of 5 total).
value	count	share
Unreached	7124	43.5%
Partially Reached	3636	22.2%
Significantly Reached	3200	19.5%
Superficially Reached	1413	8.6%
Minimally Reached	1009	6.2%

Population · Highly skewed group sizes (median 20k, max ~913M); plot on a log scale to see the shape.

Show data table

Histogram bins for Population (median: 20000.0).
bin	count
10 – 2.282e+07	16302
2.282e+07 – 4.565e+07	32
4.565e+07 – 6.847e+07	12
6.847e+07 – 9.13e+07	3
9.13e+07 – 1.141e+08	3
1.141e+08 – 1.369e+08	3
1.369e+08 – 1.598e+08	0
1.598e+08 – 1.826e+08	0
1.826e+08 – 2.054e+08	1
2.054e+08 – 2.282e+08	0
2.282e+08 – 2.511e+08	0
2.511e+08 – 2.739e+08	0
2.739e+08 – 2.967e+08	0
2.967e+08 – 3.195e+08	0
3.195e+08 – 3.424e+08	0
3.424e+08 – 3.652e+08	0
3.652e+08 – 3.88e+08	0
3.88e+08 – 4.108e+08	0
4.108e+08 – 4.337e+08	0
4.337e+08 – 4.565e+08	0
4.565e+08 – 4.793e+08	0
4.793e+08 – 5.021e+08	0
5.021e+08 – 5.249e+08	0
5.249e+08 – 5.478e+08	0
5.478e+08 – 5.706e+08	0
5.706e+08 – 5.934e+08	0
5.934e+08 – 6.162e+08	0
6.162e+08 – 6.391e+08	0
6.391e+08 – 6.619e+08	0
6.619e+08 – 6.847e+08	0
6.847e+08 – 7.075e+08	0
7.075e+08 – 7.304e+08	0
7.304e+08 – 7.532e+08	0
7.532e+08 – 7.76e+08	0
7.76e+08 – 7.988e+08	0
7.988e+08 – 8.217e+08	0
8.217e+08 – 8.445e+08	0
8.445e+08 – 8.673e+08	0
8.673e+08 – 8.901e+08	0
8.901e+08 – 9.13e+08	1

AffinityBloc · Top-level cultural groupings; South Asian and Sub-Saharan blocs lead.

Show data table

Top values for AffinityBloc (16 unique shown, of 16 total).
value	count	share
South Asian Peoples	4178	25.5%
Sub-Saharan Peoples	3073	18.8%
Eurasian Peoples	1593	9.7%
Pacific Islanders	1588	9.7%
Latin-Caribbean Americans	1352	8.3%
Malay Peoples	1031	6.3%
Southeast Asian Peoples	635	3.9%
Arab World	634	3.9%
Tibetan-Himalayan Peoples	453	2.8%
North American Peoples	415	2.5%
East Asian Peoples	402	2.5%
Turkic Peoples	299	1.8%
Persian-Median	228	1.4%
Horn of Africa Peoples	200	1.2%
Deaf	164	1.0%
Jewish	137	0.8%

Schema

109 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
PeopleID3ROG3	text	0.0%	16,382	near_unique one_word allcaps short_text
ROG3	categorical	0.0%	238
PeopleID3	numeric	0.0%	10,415
ROP3	numeric	0.1%	10,405
PeopNameInCountry	text	0.0%	10,748	one_word duplicates
ROG2	categorical	0.0%	7
Continent	categorical	0.0%	7
RegionName	categorical	0.0%	12
ISO3	categorical	0.0%	238
LocationInCountry	text	45.1%	7,794	multilingual null_rate
PeopleID1	numeric	0.0%	16
ROP1	categorical	0.0%	16
AffinityBloc	categorical	0.0%	16
PeopleID2	numeric	0.0%	267
ROP2	categorical	0.0%	214
PeopleCluster	categorical	0.0%	267
PeopNameAcrossCountries	text	0.0%	10,377	one_word duplicates
Population	numeric	0.2%	1,708	high_skew outliers
Category	categorical	0.0%	3
ROL3	text	0.0%	6,164	one_word short_text duplicates
PrimaryLanguageName	text	0.0%	6,153	one_word short_text duplicates
PrimaryLanguageDialect	categorical	92.3%	980	long_tail null_rate
NumberLanguagesSpoken	numeric	0.0%	78	high_skew outliers
OfficialLang	categorical	0.0%	87
SpeakNationalLang	unknown	0.0%	—	skipped
BibleStatus	numeric	0.0%	6
BibleYear	categorical	52.4%	466	null_rate
NTYear	text	30.5%	1,072	one_word allcaps null_rate short_text duplicates
PortionsYear	text	17.9%	1,737	one_word allcaps short_text duplicates
TranslationNeedQuestionable	unknown	0.0%	—	skipped
JPScale	numeric	0.0%	5
JPScalePC	categorical	0.0%	5
JPScalePGAC	categorical	0.0%	5
LeastReached	categorical	0.0%	2
LeastReachedPC	categorical	0.0%	2
LeastReachedPGAC	categorical	0.0%	2
GSEC	categorical	0.0%	8
HasAudioRecordings	categorical	0.0%	2
NTOnline	categorical	28.5%	1	null_rate imbalance
RLG3	numeric	0.0%	8
RLG3PC	numeric	0.0%	8
RLG3PGAC	numeric	0.0%	8
PrimaryReligion	categorical	0.0%	8
PrimaryReligionPC	categorical	0.0%	8
PrimaryReligionPGAC	categorical	0.0%	8
RLG4	numeric	96.2%	20	null_rate outliers
ReligionSubdivision	categorical	96.2%	20	null_rate
PCIslam	numeric	0.5%	1,117	outliers
PCNonReligious	numeric	0.4%	223	high_skew outliers
PCUnknown	numeric	0.6%	583	high_skew
SecurityLevel	numeric	0.0%	3
LRTop100	categorical	0.0%	2	imbalance
PhotoAddress	text	0.0%	5,277	one_word short_text duplicates
PhotoCredits	text	0.1%	1,605	one_word duplicates
PhotoCreditURL	text	33.1%	1,434	one_word url_heavy null_rate duplicates
PhotoCreativeCommons	categorical	0.0%	2
PhotoCopyright	categorical	0.1%	2
PhotoPermission	categorical	0.1%	3
ProfileTextExists	categorical	0.0%	2
CountOfCountries	numeric	0.0%	48	high_skew outliers
CountOfProvinces	unknown	0.0%	—	skipped
Longitude	numeric	0.0%	15,889	high_skew
Latitude	numeric	0.0%	15,851
Ctry	categorical	0.0%	238
IndigenousCode	categorical	0.0%	2
PercentAdherents	text	0.0%	1,248	one_word allcaps short_text duplicates
PercentChristianPC	categorical	0.0%	246
NaturalName	text	0.0%	10,737	one_word duplicates
NaturalPronunciation	text	69.6%	1,933	one_word null_rate duplicates
PercentChristianPGAC	text	0.1%	1,954	one_word allcaps short_text duplicates
PercentEvangelical	text	6.7%	1,047	one_word allcaps short_text duplicates
PercentEvangelicalPC	categorical	1.0%	228
PercentEvangelicalPGAC	text	4.5%	1,624	one_word allcaps short_text duplicates
PCBuddhism	numeric	0.6%	1,052	high_skew outliers
PCEthnicReligions	numeric	0.4%	978	outliers
PCHinduism	numeric	0.6%	1,412	high_skew outliers
PCOtherSmall	numeric	0.6%	908	high_skew outliers
RegionCode	numeric	0.0%	12
PopulationPGAC	numeric	0.1%	2,250	high_skew outliers
Frontier	categorical	0.0%	2
MapAddress	text	0.0%	6,029	one_word short_text duplicates
HasJesusFilm	categorical	0.0%	2
Nomadic	categorical	0.0%	2	imbalance
NomadicTypeDescription	categorical	98.1%	6	null_rate
PhotoCCVersionText	categorical	0.0%	17
PhotoCCVersionURL	categorical	0.0%	17
MapCredits	categorical	0.0%	199	long_tail
MapCreditURL	categorical	0.0%	51	long_tail imbalance
MapCopyright	categorical	0.0%	3
MapCCVersionText	categorical	0.0%	6	imbalance
MapCCVersionURL	categorical	0.0%	6	imbalance
JF	categorical	0.0%	2
AudioRecordings	categorical	0.0%	2
Window1040	categorical	0.0%	2
PeopleGroupMapURL	text	0.0%	6,029	one_word url_heavy duplicates
PeopleGroupMapExpandedURL	text	0.0%	5,561	one_word url_heavy duplicates
PeopleGroupURL	text	0.0%	16,382	near_unique one_word url_heavy
PeopleGroupPhotoURL	text	0.0%	5,277	one_word url_heavy duplicates
CountryURL	categorical	0.0%	238
JPScaleText	categorical	0.0%	5
JPScaleImageURL	categorical	0.0%	5
Summary	text	0.0%	3,778	one_word duplicates
Obstacles	text	0.0%	3,732	one_word duplicates
HowReach	text	0.0%	2,944	one_word duplicates
PrayForChurch	text	0.0%	1,791	one_word duplicates
PrayForPG	text	0.0%	3,530	one_word duplicates
Resources	unknown	0.0%	—	skipped
country_data	unknown	0.0%	—	skipped
language_data	unknown	0.0%	—	skipped

PeopleID3ROG3

text identifier near_unique one_word allcaps short_text

PeopleID3ROG3 is almost certainly a person-level identifier: every one of the 16,382 rows has a unique 7-character all-caps single token (n_unique equals n, duplicate_rate 0, len_min and len_max both 7). The sampled values look like a 5-digit numeric prefix followed by a 2-letter suffix (e.g. '10375tz', '10375up'), suggesting a structured composite key rather than a random hash. No nulls, no boilerplate, no duplicates — clean but useless as a feature. Treatment: Drop from modelling; retain only as a join key. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 16,382
len_min: 7
len_max: 7
len_mean: 7
len_median: 7
len_p95: 7
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 16,382
readability_flesch_mean: 115.3
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

ROG3

categorical feature

ROG3 holds 238 distinct two-letter codes that look like ISO country codes, with IN (2,262 rows, 13.8%) leading, followed by PP, ID, PK, CH, and NI. No nulls across 16,382 rows, and the entropy ratio of 0.79 indicates a fairly even spread across many countries rather than concentration in a handful. The presence of 'PP' among the top values is unusual since it isn't a standard ISO 3166-1 alpha-2 code and may signal a custom or legacy encoding worth verifying. Treatment: Treat as a high-cardinality categorical: target- or frequency-encode for modelling, and reconcile non-standard codes like 'PP' against an ISO reference. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 238
top_value: IN
top_rate: 0.1381
cardinality: 238
entropy: 6.225
entropy_ratio: 0.7885

PeopleID3

numeric foreign_key

PeopleID3 is an integer key ranging from 10119 to 22661 with 10415 unique values across 16382 rows and no nulls. The duplication (about 5967 repeated entries) and the bounded, non-zero range are consistent with a foreign-key reference to a people table rather than a measurement. Distribution is mildly right-skewed (0.37) and platykurtic (-0.98), with no outliers flagged. Treatment: Treat as a foreign key and left-join to the people dimension; do not use as a numeric feature. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 10,415
min: 10,119
max: 22,661
mean: 1.541e+04
median: 14,962
std: 3478
q1: 12,348
q3: 1.833e+04
iqr: 5984
skew: 0.3697
kurtosis: -0.9812
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ROP3

numeric identifier

ROP3 is a numeric column tightly bounded between 100004 and 119649 across 16382 rows, with 10405 unique values and almost no nulls (0.07%). The mean (109058.6) and median (108856) sit close together with low skew (0.15) and slightly platykurtic shape (kurtosis -1.05), and saturn flagged zero outliers. The narrow ~20k range starting just above 100000 looks more like a coded identifier or zoned key than a free-ranging measurement. Treatment: Treat as a categorical code rather than a continuous feature; do not scale or log-transform. medium · anthropic:claude-opus-4-7

n: 16,382
nulls: 11 (0.1%)
unique: 10,405
min: 100,004
max: 119,649
mean: 1.091e+05
median: 108,856
std: 5405
q1: 104,189
q3: 113,527
iqr: 9,338
skew: 0.1507
kurtosis: -1.053
n_outliers: 0
outlier_rate: 0
zero_rate: 0

PeopNameInCountry

text label one_word duplicates

Short ethnonym/people-group labels naming a population within a country (e.g., 'French', 'British', 'Han Chinese, Mandarin'), with a median length of 9 characters and median word count of 1. 53% of values are single words and 34% are duplicates (5634 rows), so the same group label recurs across many country rows; 'Deaf' tops the list at 164. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' indicate a Joshua-Project-style people-group taxonomy rather than free text. Treatment: Treat as a categorical people-group label; normalize casing/parentheticals and join with country to form the unique key. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 10,748
len_min: 1
len_max: 41
len_mean: 11.46
len_median: 9
len_p95: 25
word_mean: 1.636
word_median: 1
n_empty: 0
n_duplicates: 5,634
duplicate_rate: 0.3439
vocab_size: 11,539
readability_flesch_mean: 49.27
emoji_rate: 0
url_rate: 0
one_word_rate: 0.5313
allcaps_rate: 0
boilerplate_rate: 0

ROG2

categorical feature

ROG2 is a low-cardinality categorical with 7 codes that look like macro-region groupings (ASI, AFR, EUR, NAR, AUS, LAM, SOP). Distribution is uneven but not degenerate: ASI dominates at 45.0% of 16,382 rows, AFR follows at 3,635, and entropy ratio of 0.80 confirms broad spread across the remaining buckets. No nulls and no rare-value tail beyond the seven codes. Treatment: one-hot or target-encode; safe to use directly given clean 7-level categorical with no missingness. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 7
top_value: ASI
top_rate: 0.4498
cardinality: 7
entropy: 2.257
entropy_ratio: 0.8039

Continent

categorical feature

Categorical continent label with all 7 expected values and zero nulls across 16,382 rows. Asia dominates at 44.98% (7,368 rows), followed by Africa at 3,635; entropy ratio of 0.80 confirms a moderately skewed but not degenerate distribution. Note that Australia (1,088) and Oceania (447) appear as separate categories, which is unusual and suggests inconsistent regional coding worth reconciling. Treatment: One-hot encode after merging Australia and Oceania into a single category. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 7
top_value: Asia
top_rate: 0.4498
cardinality: 7
entropy: 2.257
entropy_ratio: 0.8039

RegionName

categorical feature

RegionName is a categorical geographic grouping with 12 distinct world regions and no nulls across 16,382 rows. Distribution is fairly balanced (entropy ratio 0.93), though 'Asia, South' leads at 22.6% (3,707 rows) and the top three regions are all Asian or African. The labels use a 'Continent, Subregion' convention which may need parsing if continent-level rollups are wanted. Treatment: one-hot or target-encode for modelling; optionally split on the comma to derive a continent column. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 12
top_value: Asia, South
top_rate: 0.2263
cardinality: 12
entropy: 3.325
entropy_ratio: 0.9275

ISO3

categorical foreign_key

This column holds ISO3 country codes across 238 distinct values with no nulls, consistent with a country dimension key. India (IND) dominates at 13.8% (2262 rows), followed by PNG, IDN, PAK and CHN, indicating a heavy Asia/tropical skew rather than uniform global coverage. Entropy ratio of 0.79 confirms moderate concentration on a few countries. Treatment: Use as a country join key; consider grouping long-tail codes or stratifying analyses to handle the IND-heavy skew. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 238
top_value: IND
top_rate: 0.1381
cardinality: 238
entropy: 6.225
entropy_ratio: 0.7885

LocationInCountry

text free_text multilingual null_rate

Free-text descriptions of where a group is located within a country, mixing geographic prose ('Primarily north', 'Madang province.') with longer ethnographic paragraphs up to 994 characters. Nearly half the rows are null (45.12%) and 13.3% are duplicate strings, with stock phrases like 'Widespread.' (103) and 'Scattered.' recurring. Although 2,618 entries register as English, small pockets of Spanish (12), Portuguese (10) and nine other languages appear, and Flesch readability averages a difficult 38.1. Treatment: Normalize boilerplate phrases and tokenize/embed for semantic use; do not treat as a categorical. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 7,392 (45.1%)
unique: 7,794
len_min: 3
len_max: 994
len_mean: 108.2
len_median: 79
len_p95: 314
word_mean: 15.48
word_median: 11
n_empty: 0
n_duplicates: 1,196
duplicate_rate: 0.133
vocab_size: 25,733
readability_flesch_mean: 38.13
emoji_rate: 0
url_rate: 0
one_word_rate: 0.02725
allcaps_rate: 0
boilerplate_rate: 0

PeopleID1

numeric feature

PeopleID1 is stored as numeric but takes only 16 distinct integer values across 16,382 rows, ranging from 10 to 26 with a median of 20. The bounded range, low cardinality, and left skew (-0.79) suggest this is a small categorical or grouping code rather than a true continuous measurement, despite the 'ID' name implying a key. Treatment: Cast to categorical and one-hot encode rather than treating as continuous. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 16
min: 10
max: 26
mean: 18.58
median: 20
std: 3.921
q1: 16
q3: 21
iqr: 5
skew: -0.7942
kurtosis: -0.5112
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ROP1

categorical feature

ROP1 is a low-cardinality categorical code with 16 distinct values (all prefixed 'A0##'), fully populated across 16382 rows. The distribution is moderately concentrated: 'A012' alone covers 25.5% and 'A013' adds another ~19%, while entropy ratio is 0.83 indicating reasonably even spread among the rest. Looks like a fixed coded attribute (e.g., a category or status code) rather than a free identifier. Treatment: one-hot or target-encode for modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 16
top_value: A012
top_rate: 0.255
cardinality: 16
entropy: 3.322
entropy_ratio: 0.8306

AffinityBloc

categorical feature

AffinityBloc is a categorical grouping of populations into 16 broad ethno-geographic blocs, with no nulls across 16,382 rows. The distribution is moderately concentrated: "South Asian Peoples" leads at 25.5% (4,178 rows), followed by "Sub-Saharan Peoples" (3,073), while entropy ratio of 0.83 indicates the remaining 14 categories carry meaningful mass. Labels mix regional and ethnolinguistic framings (e.g., "Arab World" alongside "Tibetan-Himalayan Peoples"), which an analyst should note for taxonomy consistency. Treatment: One-hot or target-encode for modelling; audit label taxonomy for overlap before grouping. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 16
top_value: South Asian Peoples
top_rate: 0.255
cardinality: 16
entropy: 3.322
entropy_ratio: 0.8306

PeopleID2

numeric foreign_key

PeopleID2 is a numeric people-identifier code with only 267 distinct values across 16,382 rows, so each id repeats heavily and behaves more like a foreign key than a measurement. Values span 100 to 479 with a fairly flat distribution (kurtosis -1.13, skew 0.25) and no nulls or outliers, consistent with a bounded code rather than a quantity to model. Treatment: Treat as a categorical key and left-join on it rather than using as a numeric feature. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 267
min: 100
max: 479
mean: 283.7
median: 268
std: 114.6
q1: 183
q3: 402
iqr: 219
skew: 0.2451
kurtosis: -1.126
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ROP2

categorical feature

ROP2 is a categorical code field with 214 distinct alphanumeric values (e.g., 'A012', 'C0152') across 16,382 rows and no nulls, suggesting a controlled vocabulary like a routing or product code. The distribution is heavy-tailed: 'A012' alone covers 25.4% of rows and the next code 'C0152' another ~7%, while entropy ratio sits at 0.76. The 'A' vs 'C' prefix split hints at two code families coexisting in the same column. Treatment: Group rare codes into an 'other' bucket and target/one-hot encode the high-frequency levels. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 214
top_value: A012
top_rate: 0.2545
cardinality: 214
entropy: 5.901
entropy_ratio: 0.7622

PeopleCluster

categorical feature

PeopleCluster is a categorical ethnographic grouping with 267 distinct values across 16,382 rows and no nulls. The distribution is broad (entropy ratio 0.86) but with a notable concentration: 'New Guinea' accounts for 6.95% of rows (1,139), followed by 'South Asia Hindu - other' (935) and 'South Asia Muslim - other' (592). The labels mix geographic, religious, and tribal descriptors, so several '... - other' buckets are doing heavy lifting. Treatment: Group rare clusters and target- or frequency-encode before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 267
top_value: New Guinea
top_rate: 0.06953
cardinality: 267
entropy: 6.929
entropy_ratio: 0.8596

PeopNameAcrossCountries

text foreign_key one_word duplicates

This column holds people-group or ethnicity names repeated across countries, with 10,377 unique labels over 16,382 rows and a 36.7% duplicate rate (6,005 repeats). Entries are short (median 8 chars, mean 1.57 words) and 59% are single-word labels like 'Deaf' (164), 'French' (82), or 'British' (81). The frequent fragments '(hindu' (516) and '(muslim' (424) alongside 'traditions)' (1038) suggest religious-tradition qualifiers in parentheses are a common naming convention, and the same group name recurs because it appears in multiple country contexts. Treatment: Treat as a people-group key; normalize casing/parentheticals and join with country to form a unique grouping key. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 10,377
len_min: 1
len_max: 42
len_mean: 10.93
len_median: 8
len_p95: 25
word_mean: 1.568
word_median: 1
n_empty: 0
n_duplicates: 6,005
duplicate_rate: 0.3666
vocab_size: 10,446
readability_flesch_mean: 49.45
emoji_rate: 0
url_rate: 0
one_word_rate: 0.5899
allcaps_rate: 0
boilerplate_rate: 0

Population

numeric feature high_skew outliers

This is a Population count column with 16,382 rows and only 1,708 unique values, suggesting many shared or rounded figures. The distribution is extremely heavy-tailed: median is 20,000 but the max is 912,955,000, with skew 91.04 and kurtosis 10,050.74, and 15.0% of rows flag as outliers. The mean (499,468) sits far above Q3 (93,000), indicating a small number of very large entities dominate. Treatment: Log-transform before any modelling or distance-based analysis. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 25 (0.2%)
unique: 1,708
min: 10
max: 9.13e+08
mean: 4.995e+05
median: 20,000
std: 8.066e+06
q1: 4,300
q3: 93,000
iqr: 88,700
skew: 91.04
kurtosis: 1.005e+04
n_outliers: 2,455
outlier_rate: 0.1501
zero_rate: 0

ROL3

text feature one_word short_text duplicates

ROL3 holds three-letter ISO 639-3 language codes (every value is exactly 3 characters and one word), with hin, eng, and ben dominating. The distribution is heavily multilingual with 6,164 distinct codes across 16,382 rows and a 62.4% duplicate rate, plus 176 'xxx' entries that likely flag undetermined or missing language. Treatment: Treat as a categorical language code; one-hot or target-encode top codes and bucket the long tail (including 'xxx') as 'other'. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 6,164
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 10,218
duplicate_rate: 0.6237
vocab_size: 6,163
readability_flesch_mean: 117.4
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.0001221
boilerplate_rate: 0

PrimaryLanguageName

text feature one_word short_text duplicates

Holds the primary language name for each record, predominantly single-token entries (one_word_rate 0.73, word_mean 1.32) with Hindi (682), English (424) and Bengali (366) leading. High duplicate_rate of 0.62 is expected for a categorical language label, but n_unique 6153 against 16382 rows suggests many compound or comma-separated multilingual entries (note 'arabic,' and 'punjabi,' in top_words). 176 rows are explicitly 'Language unknown', and lengths up to 45 chars confirm some multi-language strings. Treatment: Normalize casing and split comma-separated entries into a multi-label categorical before encoding. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 6,153
len_min: 1
len_max: 45
len_mean: 9.081
len_median: 7
len_p95: 19
word_mean: 1.322
word_median: 1
n_empty: 0
n_duplicates: 10,229
duplicate_rate: 0.6244
vocab_size: 6,251
readability_flesch_mean: 42.67
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7306
allcaps_rate: 0
boilerplate_rate: 0

PrimaryLanguageDialect

categorical metadata long_tail null_rate

This column records a primary language dialect per record, with 980 distinct values across 16,382 rows. It is 92.3% null, and even the most common value, 'Brazilian Portuguese', accounts for just 3.25% of non-nulls (41 occurrences); entropy ratio of 0.967 confirms an extremely flat, long-tailed distribution spanning dialects like Assyrian, Punjabi, Sinhalese, and Ta'izzi. The combination of sparse coverage and high cardinality limits its standalone modelling value. Treatment: Group into language families or a coarse bucket plus 'missing' indicator before encoding. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 15,121 (92.3%)
unique: 980
top_value: Brazilian Portuguese
top_rate: 0.03251
cardinality: 980
entropy: 9.613
entropy_ratio: 0.9674

NumberLanguagesSpoken

numeric feature high_skew outliers

Count of languages spoken, with 16382 non-null integer values ranging from 1 to 145 and a median of 1. The distribution is severely right-skewed (skew 7.44, kurtosis 83.76): Q1 and Q3 are both within [1,2], yet 2410 rows (14.7%) flag as outliers and the max of 145 is implausibly high for a person-level language count. Treatment: Cap or log-transform before modelling, and investigate the 145 maximum for data-entry errors. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 78
min: 1
max: 145
mean: 2.764
median: 1
std: 5.985
q1: 1
q3: 2
iqr: 1
skew: 7.437
kurtosis: 83.76
n_outliers: 2,410
outlier_rate: 0.1471
zero_rate: 0

OfficialLang

categorical feature

OfficialLang is a categorical column listing the official language of each record, with 87 distinct values across 16,382 rows and almost no nulls (0.05%). English dominates at 22.4% (3,672), followed by Hindi (2,262) and French (1,478), giving a moderately concentrated distribution (entropy ratio 0.68). The presence of compound labels like 'Arabic, Standard' and 'Chinese, Mandarin' suggests a specific naming convention worth preserving when joining to language references. Treatment: Group long tail and one-hot or target-encode the top categories before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 8 (0.0%)
unique: 87
top_value: English
top_rate: 0.2243
cardinality: 87
entropy: 4.368
entropy_ratio: 0.6779

SpeakNationalLang

unknown other skipped

Column 'SpeakNationalLang' was skipped by the profiler, so no type inference, uniqueness count, or value statistics are available. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a flag or category indicating whether a respondent speaks the national language, but this cannot be verified from the evidence. Treatment: Re-profile or manually inspect to determine type before any downstream use. low · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: —

BibleStatus

numeric feature

BibleStatus is an integer-coded categorical with only 6 distinct values spanning 0 to 5 across 16,382 complete rows. The distribution is heavily left-skewed (skew -1.22) with a mean of 3.86 and median of 4, indicating most records cluster at the high end while about 4.8% sit at zero. Despite being stored as numeric, the small cardinality and bounded range suggest an ordinal status code rather than a true measurement. Treatment: Treat as an ordinal category (one-hot or ordered encode) rather than a continuous numeric. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 6
min: 0
max: 5
mean: 3.862
median: 4
std: 1.429
q1: 3
q3: 5
iqr: 2
skew: -1.219
kurtosis: 0.5856
n_outliers: 0
outlier_rate: 0
zero_rate: 0.04786

BibleYear

categorical metadata null_rate

BibleYear appears to encode a translation's publication or revision span, typically formatted as a start-end year range like "1818-2022", with single years (e.g. "1954") appearing as a minority pattern. Cardinality is high (466 distinct values across 16382 rows) and the most common range covers only 8.75% of records, giving a flat distribution (entropy ratio 0.776). Notably, 52.45% of values are null, which the alert flags and which will limit any direct use. Treatment: Parse into separate start_year and end_year integer features and add a missingness indicator before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 8,592 (52.4%)
unique: 466
top_value: 1818-2022
top_rate: 0.08755
cardinality: 466
entropy: 6.883
entropy_ratio: 0.7765

NTYear

text metadata one_word allcaps null_rate short_text duplicates

NTYear appears to be a year-range metadata field (e.g. '1811-1998', '1380-2011') stored as short single-token strings, with 1072 unique values across 16382 rows. The column is messy: 30.47% null, 90.59% duplicate rate, and a sentinel value 'Yes' shows up 670 times alongside the date ranges, indicating mixed semantics. Lengths cluster tightly (median 9, max 9), consistent with a 'YYYY-YYYY' format for most non-sentinel entries. Treatment: Parse into start_year/end_year integer columns, isolate the 'Yes' sentinel into a separate flag, and impute or drop the 30% nulls. medium · anthropic:claude-opus-4-7

n: 16,382
nulls: 4,991 (30.5%)
unique: 1,072
len_min: 3
len_max: 9
len_mean: 7.794
len_median: 9
len_p95: 9
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 10,319
duplicate_rate: 0.9059
vocab_size: 1,072
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.9412
boilerplate_rate: 0

PortionsYear

text feature one_word allcaps short_text duplicates

PortionsYear appears to be a single-token field that mostly encodes year ranges (e.g. '1806-1962', '1530-1995') with strings up to 9 characters, but it is contaminated by a large 'Yes' bucket (1520 rows) that breaks the type. Nulls run at 17.92% and duplicate_rate is 0.87 across 1737 unique values out of 16382, so the column is highly repetitive. The mix of a boolean-like 'Yes' with hyphenated year spans suggests two different concepts were merged into one column. Treatment: Split into two fields: parse year ranges into start/end integers and isolate the 'Yes' values into a separate boolean before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 2,936 (17.9%)
unique: 1,737
len_min: 3
len_max: 9
len_mean: 7.595
len_median: 9
len_p95: 9
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 11,709
duplicate_rate: 0.8708
vocab_size: 1,737
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.887
boilerplate_rate: 0

TranslationNeedQuestionable

unknown other skipped

Column 'TranslationNeedQuestionable' was skipped by the profiler, so its kind, cardinality and value distribution are unknown. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a boolean or flag indicating uncertainty about translation need, but this cannot be verified from the evidence. Treatment: Re-profile or inspect raw values before deciding on use; do not model until kind is resolved. low · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: —

JPScale

numeric feature

JPScale is an integer-valued ordinal feature spanning 1 to 5 with only 5 unique values across 16382 rows and no nulls. The distribution is roughly flat (kurtosis -1.66, skew 0.19) with mean 2.68 and median 3, suggesting a Likert-style or category rating rather than a continuous measurement. No outliers and no zeros are present. Treatment: Treat as an ordinal categorical (1-5) rather than continuous; one-hot or keep as ordered integer. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 5
min: 1
max: 5
mean: 2.681
median: 3
std: 1.644
q1: 1
q3: 4
iqr: 3
skew: 0.1937
kurtosis: -1.658
n_outliers: 0
outlier_rate: 0
zero_rate: 0

JPScalePC

categorical feature

JPScalePC is a 5-level categorical, almost certainly a Likert or ordinal scale (values "1" through "5") with no nulls across 16,382 rows. The distribution is bimodal at the extremes: "5" leads at 33.8% and "1" follows closely, while the middle codes "2" and "3" together account for far less, hinting at polarised responses rather than a normal spread. Entropy ratio of 0.86 confirms the spread is wide but not uniform. Treatment: Treat as ordinal (1-5); keep as integer or one-hot depending on model. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 5
top_value: 5
top_rate: 0.3381
cardinality: 5
entropy: 1.997
entropy_ratio: 0.8602

JPScalePGAC

categorical feature

JPScalePGAC is a low-cardinality categorical with 5 distinct string-encoded levels ('1' through '5') across 16382 rows and no nulls, consistent with an ordinal scale (likely the Japanese JMA seismic intensity scale applied to PGA). The distribution is uneven: '1' dominates at 43.3% while '2' is the rarest at 908 rows, yet entropy ratio is high at 0.86 indicating the remaining mass is spread broadly. The non-monotonic frequency order (1 > 4 > 5 > 3 > 2) is worth flagging since a clean ordinal would typically taper. Treatment: Treat as ordinal: cast to integer and preserve order, or one-hot encode if downstream model is non-ordinal. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 5
top_value: 1
top_rate: 0.4329
cardinality: 5
entropy: 1.998
entropy_ratio: 0.8603

LeastReached

categorical feature

Binary Y/N flag named LeastReached, fully populated across 16382 rows with only 2 distinct values. The split is fairly balanced — 'N' leads at 56.5% (9258) versus 7124 'Y' — yielding near-maximal entropy ratio of 0.988. No nulls or anomalies present. Treatment: Encode as a 0/1 boolean for modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: N
top_rate: 0.5651
cardinality: 2
entropy: 0.9877
entropy_ratio: 0.9877

LeastReachedPC

categorical feature

A binary Y/N flag named LeastReachedPC, likely indicating whether some 'least reached' threshold or PC condition was met. The split is moderately imbalanced at 67.3% N versus the rest Y, with no nulls across 16,382 rows and entropy ratio 0.91 showing both classes are well represented. Treatment: Encode as a 0/1 indicator for modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: N
top_rate: 0.6732
cardinality: 2
entropy: 0.9116
entropy_ratio: 0.9116

LeastReachedPGAC

categorical feature

Binary Y/N flag indicating whether some 'LeastReachedPGAC' condition holds, with no missing values across 16382 rows. The split is fairly balanced — 'N' leads at 56.7% (9291) versus 7091 'Y' — giving a near-maximal entropy ratio of 0.987. Treatment: Encode as a 0/1 indicator for modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: N
top_rate: 0.5671
cardinality: 2
entropy: 0.987
entropy_ratio: 0.987

GSEC

categorical feature

GSEC is a low-cardinality categorical with 8 distinct values across 16,382 rows and no nulls. The dominant value is the empty string at 40.0% (6,553 rows), followed by '1' at 4,852; the remaining codes ('0','2','3','4','5','6') split the rest, suggesting a coded classification where blanks likely encode 'not applicable' or missing-as-empty. Entropy ratio of 0.732 indicates moderate spread despite the empty-string plurality. Treatment: Recode the empty string as an explicit missing category and one-hot encode the remaining codes. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 8
top_value
top_rate: 0.4
cardinality: 8
entropy: 2.197
entropy_ratio: 0.7325

HasAudioRecordings

categorical feature

Binary Y/N flag indicating whether a record has associated audio recordings, fully populated across 16382 rows. The distribution is imbalanced: 'Y' covers 82.3% (13479) versus 2903 'N', with entropy ratio 0.67. Treatment: Encode as a 0/1 boolean indicator for modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.8228
cardinality: 2
entropy: 0.6739
entropy_ratio: 0.6739

NTOnline

categorical feature null_rate imbalance

NTOnline is a categorical flag with only one observed value, 'Y', across all 11,705 non-null rows. The remaining 28.55% of rows are null, so this column carries no discriminating signal — it is effectively a constant where present. Treatment: Drop; zero-variance column with high nullity. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 4,677 (28.5%)
unique: 1
top_value: Y
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

RLG3

numeric feature

RLG3 is a small-integer ordinal feature ranging from 1 to 9 with only 8 distinct values across 16,382 rows and no nulls. The distribution is broad and flat (kurtosis -1.37, skew 0.13, IQR spanning 1 to 6) with mean 3.47 and median 4, and no outliers. The 8 unique values across a 1-9 range implies one integer in that span never occurs, which is worth confirming. Treatment: Treat as an ordinal categorical (e.g., a Likert-style rating) rather than a continuous numeric. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 8
min: 1
max: 9
mean: 3.469
median: 4
std: 2.238
q1: 1
q3: 6
iqr: 5
skew: 0.1265
kurtosis: -1.366
n_outliers: 0
outlier_rate: 0
zero_rate: 0

RLG3PC

numeric feature

RLG3PC is an integer-valued numeric column with only 8 distinct values bounded between 1 and 9, no nulls, and no zeros. The flat distribution (kurtosis -1.47, IQR spanning 1 to 6) and small cardinality suggest this is an ordinal code or category rather than a continuous measurement. Mean 3.21 sits below the median's upper quartile, with mild positive skew (0.31). Treatment: Treat as an ordinal category; one-hot or ordinal-encode rather than scaling as continuous. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 8
min: 1
max: 9
mean: 3.213
median: 2
std: 2.311
q1: 1
q3: 6
iqr: 5
skew: 0.3143
kurtosis: -1.466
n_outliers: 0
outlier_rate: 0
zero_rate: 0

RLG3PGAC

numeric feature

RLG3PGAC holds an integer code on a small 1-9 scale with only 8 distinct values across 16,382 rows and zero nulls. The distribution is broad and flat (kurtosis -1.36, std 2.25, IQR spanning 1 to 6) with near-zero skew, suggesting an ordinal category or rating rather than a true continuous measurement. No outliers and no zeros are present. Treatment: Treat as an ordinal/categorical feature; one-hot or ordinal encode rather than scaling as continuous. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 8
min: 1
max: 9
mean: 3.486
median: 4
std: 2.252
q1: 1
q3: 6
iqr: 5
skew: 0.1259
kurtosis: -1.363
n_outliers: 0
outlier_rate: 0
zero_rate: 0

PrimaryReligion

categorical feature

PrimaryReligion is a low-cardinality categorical label assigning each of 16,382 rows to one of 8 religious traditions, with no nulls. Christianity dominates at 39.4% (6,459 rows), followed by Islam (3,786) and Ethnic Religions (2,651); the long tail includes 189 'Unknown' and 124 'Other / Small' rows. Entropy ratio of 0.74 indicates a moderately balanced distribution rather than a single overwhelming class. Treatment: One-hot or target-encode; consider grouping 'Unknown' and 'Other / Small' if modelling sensitivity to rare classes. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 8
top_value: Christianity
top_rate: 0.3943
cardinality: 8
entropy: 2.231
entropy_ratio: 0.7436

PrimaryReligionPC

categorical feature

Categorical label of the dominant religion of a people-cluster (PC), with 8 distinct values and no nulls across 16,382 rows. Christianity leads at 47.6% (7,795), followed by Islam (3,658) and Hinduism (2,557), while a small 'Unknown' bucket (173) and 'Other / Small' (62) provide explicit catch-alls. Entropy ratio of 0.69 indicates moderate concentration rather than a single dominant class. Treatment: One-hot or target-encode; consider merging 'Unknown' and 'Other / Small' if downstream model is sensitive to rare levels. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 8
top_value: Christianity
top_rate: 0.4758
cardinality: 8
entropy: 2.071
entropy_ratio: 0.6904

PrimaryReligionPGAC

categorical label

Categorical label of the primary religion of a People Group, Affinity, or Country (PGAC) record, drawn from a fixed taxonomy of 8 values with no nulls across 16,382 rows. Christianity dominates at 39.4% (6,462), followed by Islam (3,766), Ethnic Religions (2,613) and Hinduism (2,348); Buddhism, Non-Religious, Unknown and Other/Small together account for under 8% of rows. Entropy ratio of 0.748 indicates a moderately concentrated but not degenerate distribution. Treatment: One-hot or target-encode; consider grouping the small Unknown/Other tail before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 8
top_value: Christianity
top_rate: 0.3945
cardinality: 8
entropy: 2.245
entropy_ratio: 0.7482

RLG4

numeric feature null_rate outliers

RLG4 is a sparse integer-valued numeric feature with only 20 distinct values spanning 10 to 41, suggesting an ordinal score or count rather than a continuous measurement. It is overwhelmingly missing (null_rate 0.9621), so just under 4% of rows carry a value, and among those the distribution is right-skewed (skew 0.94) with 33 flagged outliers (outlier_rate 0.053). Center sits at median 20 with IQR 7, and no zeros are present. Treatment: Add a missingness indicator and impute or bin the few observed values before modelling. medium · anthropic:claude-opus-4-7

n: 16,382
nulls: 15,761 (96.2%)
unique: 20
min: 10
max: 41
mean: 18.49
median: 20
std: 6.519
q1: 14
q3: 21
iqr: 7
skew: 0.9361
kurtosis: 1.156
n_outliers: 33
outlier_rate: 0.05314
zero_rate: 0

ReligionSubdivision

categorical feature null_rate

A sub-categorisation of religion (e.g. Sunni/Shia branches, Buddhist schools, Judaism, Sikhism), populated only when a finer split applies. It is overwhelmingly null at 96.21%, so just 16382 rows carry one of 20 values, with Sunni leading at 29.15% of the populated rows. Entropy ratio 0.74 indicates the non-null portion is reasonably spread rather than dominated by a single bucket. Treatment: Treat nulls as an explicit 'not applicable' category before one-hot encoding. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 15,761 (96.2%)
unique: 20
top_value: Sunni
top_rate: 0.2915
cardinality: 20
entropy: 3.185
entropy_ratio: 0.7369

PCIslam

numeric feature outliers

PCIslam is a numeric column bounded between 0 and 100, almost certainly a percentage share of Muslim population (or similar Islam-related composition metric) per record. The distribution is heavily zero-inflated: 63.2% of values are exactly 0 and the median is 0, while the mean is 23.2 and values stretch all the way to 100, producing a right skew of 1.27 and 3,438 flagged outliers (21.1%). Nulls are negligible (0.52%) and 1,117 distinct values suggest reasonably fine-grained measurement rather than a coarse bucket. Treatment: Treat as a zero-inflated proportion: model the zero mass separately or add a presence indicator before scaling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 86 (0.5%)
unique: 1,117
min: 0
max: 100
mean: 23.2
median: 0
std: 39.54
q1: 0
q3: 28
iqr: 28
skew: 1.273
kurtosis: -0.2575
n_outliers: 3,438
outlier_rate: 0.211
zero_rate: 0.6322

PCNonReligious

numeric feature high_skew outliers

PCNonReligious appears to be a percentage feature capturing the share of a population that is non-religious, ranging from 0 to 99. The distribution is dominated by zeros (75.2% of rows) with median, Q1, and Q3 all at 0, yet the mean is 3.42 and skew is 3.65 with kurtosis 15.4, indicating a long right tail. Roughly 24.8% of values flag as outliers, suggesting a sparse signal where most records report none and a minority report substantial percentages. Treatment: Consider a zero-inflated treatment or log1p transform before modelling given the 75% zeros and heavy right tail. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 66 (0.4%)
unique: 223
min: 0
max: 99
mean: 3.421
median: 0
std: 9.21
q1: 0
q3: 0
iqr: 0
skew: 3.648
kurtosis: 15.43
n_outliers: 4,043
outlier_rate: 0.2478
zero_rate: 0.7522

PCUnknown

numeric feature high_skew

PCUnknown is a numeric feature bounded between 0 and 100, almost certainly a percentage of items classified as 'unknown'. It is overwhelmingly zero (zero_rate 0.9558) with median, q1, and q3 all at 0, yet the max reaches 100 with skew 9.07 and kurtosis 81.5, producing 719 outliers (4.4%). The 583 distinct non-zero values form a long, heavy tail rather than a smooth distribution. Treatment: Binarize (zero vs non-zero) or log1p-transform before modelling given the 95.6% zero mass and extreme skew. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 104 (0.6%)
unique: 583
min: 0
max: 100
mean: 1.201
median: 0
std: 10.34
q1: 0
q3: 0
iqr: 0
skew: 9.066
kurtosis: 81.52
n_outliers: 719
outlier_rate: 0.04417
zero_rate: 0.9558

SecurityLevel

numeric feature

SecurityLevel takes only 3 distinct integer values spanning 0 to 2 with no nulls, so it is effectively an ordinal category encoded numerically (e.g., low/medium/high). The distribution is fairly flat with kurtosis -1.82 and zeros making up 38.8% of rows, while the mean of 1.10 and median of 1.0 suggest the three levels are reasonably balanced with a slight tilt toward the higher end. Treatment: Treat as an ordinal categorical and one-hot or ordinal-encode before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 3
min: 0
max: 2
mean: 1.099
median: 1
std: 0.9307
q1: 0
q3: 2
iqr: 2
skew: -0.1985
kurtosis: -1.816
n_outliers: 0
outlier_rate: 0
zero_rate: 0.3883

LRTop100

categorical label imbalance

Binary Y/N flag indicating membership in some 'LR Top 100' set, with only 100 positive cases out of 16,382 rows (top_rate 0.9939). Extreme class imbalance and very low entropy (0.0537) make this nearly constant. No nulls, exactly 2 categories as expected. Treatment: Use stratified sampling or class-weighting if modelling; otherwise treat as rare-event indicator. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: N
top_rate: 0.9939
cardinality: 2
entropy: 0.05368
entropy_ratio: 0.05368

PhotoAddress

text foreign_key one_word short_text duplicates

PhotoAddress holds single-token image filenames in the pattern p#####.jpg, with a max length of 13 and exactly one word per row. 5,718 of 16,382 rows (~35%) are empty strings rather than nulls, and overall duplicate rate is 67.8% — the same photo file is reused across many records (e.g., p19007.jpg appears 92 times). With only 5,277 unique values, this behaves like a foreign-key reference to an image asset, not a per-row unique pointer. Treatment: Treat empty strings as missing and join to an image/asset table on this filename rather than modelling it as text. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 1 (0.0%)
unique: 5,277
len_min: 0
len_max: 13
len_mean: 6.523
len_median: 10
len_p95: 10
word_mean: 1
word_median: 1
n_empty: 5,718
n_duplicates: 11,104
duplicate_rate: 0.6779
vocab_size: 5,276
readability_flesch_mean: 82.43
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

PhotoCredits

text metadata one_word duplicates

Attribution string for a photo credit, mostly short names or sources (mean length 11.5 chars, median 1 word). Highly repetitive: 90.2% duplicate rate across only 1,605 unique values, with 5,718 empty entries and 3,065 'Anonymous' tags dominating. Top words reveal stock/CC sources like Flickr, Wikimedia, Pixabay, and Shutterstock alongside named contributors. Treatment: Treat as low-cardinality categorical attribution; normalize empties/'Anonymous' and group rare credits before any analysis. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 10 (0.1%)
unique: 1,605
len_min: 0
len_max: 56
len_mean: 11.54
len_median: 9
len_p95: 30
word_mean: 2.081
word_median: 1
n_empty: 5,718
n_duplicates: 14,767
duplicate_rate: 0.902
vocab_size: 2,658
readability_flesch_mean: -13.88
emoji_rate: 0
url_rate: 0.0004276
one_word_rate: 0.5754
allcaps_rate: 0.001649
boilerplate_rate: 0

PhotoCreditURL

text metadata one_word url_heavy null_rate duplicates

This column stores photo credit URLs, with every non-empty value being a single token (one_word_rate 1.0) and 47.5% matching a URL pattern. It is sparsely populated: 33.08% null and another 5,718 empty strings among the top values, while 86.9% of values are duplicates — a single domain, https://www.asiaharvest.org, accounts for 736 rows. Only 1,434 unique URLs serve 16,382 rows, suggesting a small set of recurring image sources rather than per-record attribution. Treatment: Extract the domain as a categorical feature and drop the raw URL; do not use as a modelling input. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 5,419 (33.1%)
unique: 1,434
len_min: 0
len_max: 240
len_mean: 25.67
len_median: 0
len_p95: 73
word_mean: 1
word_median: 1
n_empty: 5,718
n_duplicates: 9,529
duplicate_rate: 0.8692
vocab_size: 1,433
readability_flesch_mean: -261.7
emoji_rate: 0
url_rate: 0.4753
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

PhotoCreativeCommons

categorical feature

A binary Y/N flag indicating whether a photo carries a Creative Commons licence. The column is heavily skewed: 'N' covers 83.6% of the 16382 rows while 'Y' accounts for 2691, with a near-zero null rate of 0.0003. Treatment: Encode as a 0/1 boolean; expect class imbalance if used as a target. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 5 (0.0%)
unique: 2
top_value: N
top_rate: 0.8357
cardinality: 2
entropy: 0.6445
entropy_ratio: 0.6445

PhotoCopyright

categorical feature

Binary Y/N flag indicating whether a photo copyright applies, with 'N' dominating at 87.95% of 16,382 rows versus 1,972 'Y' values. Class imbalance is notable but not extreme, and nulls are negligible at 0.09%. Entropy ratio of 0.53 reflects this skew toward 'N'. Treatment: Encode as a 0/1 boolean; be aware of the ~1:7 class imbalance if used as a target. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 15 (0.1%)
unique: 2
top_value: N
top_rate: 0.8795
cardinality: 2
entropy: 0.5308
entropy_ratio: 0.5308

PhotoPermission

categorical feature

A consent flag for photo use, encoded as Y/N with 87.1% of 16382 rows set to 'N' and only 0.1% null. Cardinality is 3 because two records use lowercase 'y' alongside 2111 uppercase 'Y', a casing inconsistency worth normalising. Entropy ratio of 0.35 confirms the heavy skew toward refusal. Treatment: Uppercase-normalise then map to a boolean before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 17 (0.1%)
unique: 3
top_value: N
top_rate: 0.8709
cardinality: 3
entropy: 0.5564
entropy_ratio: 0.3511

ProfileTextExists

categorical feature

Binary Y/N flag indicating whether a profile has text, with no nulls across 16382 rows. Roughly 79.5% are 'Y' (13018) versus 'N' (3364), an imbalance worth noting but not extreme. Entropy ratio of 0.73 confirms a moderately skewed but informative distribution. Treatment: Encode as a 0/1 boolean for modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.7947
cardinality: 2
entropy: 0.7325
entropy_ratio: 0.7325

CountOfCountries

numeric feature high_skew outliers

Counts the number of countries associated with each record, ranging from 1 to 164 with a median of 1 and Q3 of just 4. The distribution is severely right-skewed (skew 5.15, kurtosis 32.05) and 19.2% of rows flag as outliers, indicating a long tail where a small set of records span dozens or hundreds of countries while most cover only one. Treatment: Log-transform or bin (e.g. 1, 2-4, 5+) before modelling to tame the heavy tail. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 48
min: 1
max: 164
mean: 8.328
median: 1
std: 20.64
q1: 1
q3: 4
iqr: 3
skew: 5.152
kurtosis: 32.05
n_outliers: 3,139
outlier_rate: 0.1916
zero_rate: 0

CountOfProvinces

unknown other skipped

The column 'CountOfProvinces' was skipped by the profiler, so beyond a row count of 16382 and a null rate of 0.0 there is no evidence about its distribution, type, or uniqueness. The name suggests an integer count of provinces per record, but this cannot be confirmed from the payload. No further signal is available. Treatment: Re-run the profiler on this column to recover type and distribution before any downstream use. low · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: —

Longitude

numeric feature high_skew

This is a Longitude coordinate column, but the values are corrupted: valid longitudes must lie within [-180, 180], yet the max is 2350588.0 and the mean (189.33) already exceeds the legal range. Skew of 127.98 and kurtosis of 16376.52 confirm extreme outlier contamination, with 207 flagged outliers (1.26%). The median of 55.45 is plausible, so most rows are likely valid, but a small set of malformed entries is dominating the distribution. Treatment: Clip or drop rows outside [-180, 180] before any geospatial use. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 15,889
min: -179.3
max: 2.351e+06
mean: 189.3
median: 55.45
std: 1.836e+04
q1: 8.673
q3: 94.64
iqr: 85.97
skew: 128
kurtosis: 1.638e+04
n_outliers: 207
outlier_rate: 0.01264
zero_rate: 0

Latitude

numeric feature

Geographic latitude in decimal degrees, with values spanning -54.94 to 78.21 — well within the valid [-90, 90] range. Distribution is nearly symmetric (skew -0.12) and slightly flat (kurtosis -0.26), centered around a median of 17.03 and mean of 16.44, suggesting a tropical/northern-hemisphere bias. Near-unique (15851 of 16382) with no nulls and only 39 mild outliers, consistent with per-record geocoordinates rather than a categorical region label. Treatment: Pair with longitude for geospatial features; consider binning by hemisphere or clustering rather than using raw degrees in linear models. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 15,851
min: -54.94
max: 78.21
mean: 16.44
median: 17.03
std: 20.47
q1: 2.072
q3: 29.88
iqr: 27.81
skew: -0.118
kurtosis: -0.2579
n_outliers: 39
outlier_rate: 0.002381
zero_rate: 0

Ctry

categorical feature

Country names stored as full strings, with 238 distinct values across 16,382 rows and no nulls. India dominates at 13.8% (2,262 rows), followed by Papua New Guinea (883) and Indonesia (788) — a notable skew toward South/Southeast Asia rather than the typical US-heavy distribution. Entropy ratio of 0.79 indicates fairly broad spread despite the long tail. Treatment: Group long-tail countries or target/frequency-encode before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 238
top_value: India
top_rate: 0.1381
cardinality: 238
entropy: 6.225
entropy_ratio: 0.7885

IndigenousCode

categorical feature

IndigenousCode is a binary Y/N flag, fully populated across all 16,382 rows with only 2 distinct values. The class split is uneven: 'Y' covers 74.8% of records against 'N' for the remainder, yielding entropy of 0.81. The imbalance is notable but not extreme. Treatment: Encode as a binary indicator; consider class imbalance if used as a target. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.7483
cardinality: 2
entropy: 0.8139
entropy_ratio: 0.8139

PercentAdherents

text feature one_word allcaps short_text duplicates

This is a numeric percentage field (PercentAdherents) stored as text, with all 16382 values being single tokens of length 5-7 like '0.000' or '95.000'. The distribution is heavily concentrated at zero (4007 of 16382 rows) and shows strong duplication (duplicate_rate 0.924, only 1248 unique values). Despite the 'allcaps' and 'one_word' alerts, these are just numeric strings, not categorical text. Treatment: Cast to float and treat as a numeric feature; consider zero-inflation handling given the spike at 0.000. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 1,248
len_min: 5
len_max: 7
len_mean: 5.534
len_median: 6
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 15,134
duplicate_rate: 0.9238
vocab_size: 1,248
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

PercentChristianPC

categorical feature

Stored as a categorical string, this column appears to be a per-country (or per-region) percentage of Christians, with 246 distinct values across 16,382 rows and no nulls. The distribution is highly repetitive: the modal value '90.061' covers 6.95% of rows and the top ten values include both very high shares (90.061, 82.325, 76.515) and near-zero shares (0.482, 0.111, 0.000), suggesting a small set of country-level percentages broadcast onto many rows. Entropy ratio of 0.86 indicates the values are fairly evenly spread across the 246 categories despite the heavy mode. Treatment: Cast to float and treat as a numeric feature rather than a category. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 246
top_value: 90.061
top_rate: 0.06953
cardinality: 246
entropy: 6.853
entropy_ratio: 0.8628

NaturalName

text label one_word duplicates

NaturalName appears to be a people-group or ethno-linguistic label, dominated by single-word entries (one_word_rate 0.555) and short strings (len_mean 10.9, word_mean 1.59). Roughly a third of rows repeat (duplicate_rate 0.344, 5645 duplicates across 10737 uniques), with 'Deaf' (164), 'French' (82), and 'British' (80) leading. Top words expose unclosed parenthetical qualifiers like 'traditions)', '(hindu', '(muslim' occurring 500-1000+ times, suggesting tokenisation broke compound names such as 'X (Hindu traditions)'. Treatment: Normalise casing and repair the parenthetical qualifier splits before using as a categorical grouping key. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 10,737
len_min: 1
len_max: 41
len_mean: 10.91
len_median: 9
len_p95: 25
word_mean: 1.585
word_median: 1
n_empty: 0
n_duplicates: 5,645
duplicate_rate: 0.3446
vocab_size: 11,164
readability_flesch_mean: 50.53
emoji_rate: 0
url_rate: 0
one_word_rate: 0.5554
allcaps_rate: 0
boilerplate_rate: 0

NaturalPronunciation

text metadata one_word null_rate duplicates

This column holds phonetic respellings of ethnic or demographic labels (e.g. 'AY-zhun', 'chai-NEEZ', 'kor-EE-un'), with hyphenated syllables and capitalised stress markers indicating an ad-hoc pronunciation guide. It is overwhelmingly sparse and repetitive: 69.63% null, 69.41% one-word entries, and 61.15% duplicates across only 1,933 unique values out of 16,382 rows. The token 'def' appears 164 times as the most frequent value, which looks like a placeholder or default rather than a pronunciation. Treatment: Treat as a categorical pronunciation lookup keyed to an ethnicity label; investigate the 'def' placeholder and impute or drop given the 69.63% null rate. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 11,407 (69.6%)
unique: 1,933
len_min: 2
len_max: 57
len_mean: 12.13
len_median: 11
len_p95: 26
word_mean: 1.345
word_median: 1
n_empty: 0
n_duplicates: 3,042
duplicate_rate: 0.6115
vocab_size: 2,039
readability_flesch_mean: 61.69
emoji_rate: 0
url_rate: 0
one_word_rate: 0.6941
allcaps_rate: 0.000402
boilerplate_rate: 0

PercentChristianPGAC

text feature one_word allcaps short_text duplicates

This column holds percentages (likely Christian population share, per the PGAC suffix) stored as text rather than numeric, with values like "0.000", "95.000", "90.000" filling lengths of 5-7 characters. The distribution is heavily zero-inflated: 3,121 of 16,382 rows are "0.000" and the duplicate rate is 88%, leaving only 1,954 unique values. Flagged as allcaps/one-word only because the profiler treated numeric strings as tokens. Treatment: Cast to float and treat as a numeric percentage feature. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 15 (0.1%)
unique: 1,954
len_min: 5
len_max: 7
len_mean: 5.528
len_median: 6
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 14,413
duplicate_rate: 0.8806
vocab_size: 1,954
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

PercentEvangelical

text feature one_word allcaps short_text duplicates

This is a numeric percentage (share evangelical) stored as text strings like '0.000' to '6.000', with values ranging 5-6 characters long and one token each. The distribution is heavily zero-inflated: 4205 of 16382 rows are '0.000', and the duplicate rate is 0.9315 across only 1047 unique values. Null rate is 0.0668, so roughly 7% are missing. Treatment: Cast to float and treat as a zero-inflated numeric feature. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 1,095 (6.7%)
unique: 1,047
len_min: 5
len_max: 6
len_mean: 5.226
len_median: 5
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 14,240
duplicate_rate: 0.9315
vocab_size: 1,047
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

PercentEvangelicalPC

categorical feature

Numeric percentages (0.000 to 28.097) describing evangelical share, but stored as strings with only 228 distinct values across 16,382 rows — suggesting a precomputed per-group statistic broadcast to many records rather than a per-row measurement. The top value '20.481' covers 7.0% of rows and the top ten values together account for a large fraction, consistent with repeated group-level imputation. Entropy ratio is 0.87, so distribution is fairly spread but discretised. Treatment: Cast to float and treat as a group-level numeric feature; do not one-hot encode. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 166 (1.0%)
unique: 228
top_value: 20.481
top_rate: 0.07024
cardinality: 228
entropy: 6.782
entropy_ratio: 0.8658

PercentEvangelicalPGAC

text feature one_word allcaps short_text duplicates

This is a numeric percentage (Percent Evangelical, PGAC) stored as text — every value is a single token of 5-6 characters formatted like '0.000', '4.000', '1.801'. The distribution is heavily zero-inflated: 3,272 of 16,382 rows are '0.000', duplicate rate is 0.896 across only 1,624 unique values, and 4.5% are null. Despite the column being typed as text, there is no real language content here. Treatment: Cast to float and treat as a zero-inflated numeric feature. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 743 (4.5%)
unique: 1,624
len_min: 5
len_max: 6
len_mean: 5.235
len_median: 5
len_p95: 6
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 14,015
duplicate_rate: 0.8962
vocab_size: 1,624
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 1
boilerplate_rate: 0

PCBuddhism

numeric feature high_skew outliers

PCBuddhism appears to be a per-record percentage feature for Buddhist composition, ranging 0–100 with mean 3.77 and median 0. The distribution is overwhelmingly zero (zero_rate 0.89) with q1=q3=0 and iqr=0, yet ~11% of rows are outliers and skew (4.77) and kurtosis (21.99) are extreme. Treat this as a sparse, heavy-tailed minority share where most populations have no Buddhist presence but a long tail reaches 100. Treatment: Add a zero-vs-nonzero indicator and log1p-transform the nonzero share before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 104 (0.6%)
unique: 1,052
min: 0
max: 100
mean: 3.769
median: 0
std: 16.75
q1: 0
q3: 0
iqr: 0
skew: 4.775
kurtosis: 22
n_outliers: 1,798
outlier_rate: 0.1105
zero_rate: 0.8895

PCEthnicReligions

numeric feature outliers

PCEthnicReligions appears to be a percentage feature (0–100) capturing the share of ethnic/folk religion adherents per record. Just over half the rows are exactly zero (zero_rate 0.5045) and the median and Q1 are both 0, yet values stretch all the way to 100, producing strong right skew (1.65) and 1,967 flagged outliers (12.05%). The distribution is effectively zero-inflated rather than continuous. Treatment: Model as zero-inflated: add an is_nonzero indicator and log1p-transform the positive values. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 59 (0.4%)
unique: 978
min: 0
max: 100
mean: 17.6
median: 0
std: 29.02
q1: 0
q3: 25
iqr: 25
skew: 1.654
kurtosis: 1.404
n_outliers: 1,967
outlier_rate: 0.1205
zero_rate: 0.5045

PCHinduism

numeric feature high_skew outliers

PCHinduism appears to be a per-record percentage share of Hinduism (0–100), with max 100.0 and min 0.0. The distribution is overwhelmingly zero (zero_rate 0.8343) so Q1, median, and Q3 are all 0.0, yet the mean is 14.01 with std 33.87, indicating a small minority of records carry very high values. Skew 2.06 and 16.57% flagged outliers confirm a heavy right tail rather than dirty data. Treatment: Treat as zero-inflated proportion: add a nonzero indicator and consider a log1p or sqrt transform before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 105 (0.6%)
unique: 1,412
min: 0
max: 100
mean: 14.01
median: 0
std: 33.87
q1: 0
q3: 0
iqr: 0
skew: 2.058
kurtosis: 2.3
n_outliers: 2,697
outlier_rate: 0.1657
zero_rate: 0.8343

PCOtherSmall

numeric feature high_skew outliers

PCOtherSmall is a numeric feature that appears to capture a small-share or percentage-like quantity, with 89.3% of values exactly zero and a median/Q1/Q3 all at 0.0. The remaining mass is highly skewed (skew 11.0, kurtosis 124.0) with a max of 100.0 and 10.7% flagged as outliers, suggesting a sparse long-tailed distribution rather than a typical continuous feature. Treatment: Treat as zero-inflated: add a binary is_nonzero flag and log1p-transform the positive tail before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 104 (0.6%)
unique: 908
min: 0
max: 100
mean: 0.9613
median: 0
std: 8.299
q1: 0
q3: 0
iqr: 0
skew: 11
kurtosis: 124
n_outliers: 1,749
outlier_rate: 0.1074
zero_rate: 0.8926

RegionCode

numeric feature

RegionCode is an integer-valued field ranging from 1 to 12 with only 12 unique values across 16382 rows and no nulls. The flat distribution (kurtosis -1.20, skew 0.23, no outliers) and small cardinality indicate a categorical region identifier encoded numerically rather than a true numeric measure. Treatment: Cast to categorical and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 12
min: 1
max: 12
mean: 5.935
median: 5
std: 3.42
q1: 3
q3: 8
iqr: 5
skew: 0.231
kurtosis: -1.201
n_outliers: 0
outlier_rate: 0
zero_rate: 0

PopulationPGAC

numeric feature high_skew outliers

PopulationPGAC is a numeric population-like measure spanning 10 to 925,129,800 with a median of just 88,000, suggesting counts of people across geographic units of wildly varying scale (towns up through country-sized aggregates). The distribution is extremely right-skewed (skew 15.15, kurtosis 262.66) and 19.2% of rows flag as outliers, with the mean (8.8M) two orders of magnitude above the median. Nulls are negligible (0.09%) and there are no zeros, but the spread between Q3 (1.39M) and the max indicates a long heavy tail of very large entities. Treatment: log-transform before any modelling or distance-based comparison. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 15 (0.1%)
unique: 2,250
min: 10
max: 9.251e+08
mean: 8.812e+06
median: 88,000
std: 5.114e+07
q1: 8,800
q3: 1.386e+06
iqr: 1.377e+06
skew: 15.15
kurtosis: 262.7
n_outliers: 3,145
outlier_rate: 0.1922
zero_rate: 0

Frontier

categorical feature

Binary Y/N flag named 'Frontier', fully populated across 16382 rows with only 2 distinct values. The 'N' class dominates at 70.9% versus 29.1% 'Y', giving an entropy ratio of 0.87 — moderately imbalanced but well within usable range. Treatment: Encode as 0/1 boolean and use directly as a feature. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: N
top_rate: 0.709
cardinality: 2
entropy: 0.87
entropy_ratio: 0.87

MapAddress

text foreign_key one_word short_text duplicates

MapAddress holds single-token PNG filenames (e.g. m00320.png), almost certainly references to map image assets. Over half the column is empty (8728 of 16382 rows) and 63.2% are duplicates, with only 6029 distinct values across 16382 rows. Every non-empty value is one word with max length 13, so this behaves like a sparse foreign key to a map asset rather than free text. Treatment: Treat as a categorical asset reference; impute empties as 'none' and join to the map asset table. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 6,029
len_min: 0
len_max: 13
len_mean: 5.153
len_median: 0
len_p95: 13
word_mean: 1
word_median: 1
n_empty: 8,728
n_duplicates: 10,353
duplicate_rate: 0.632
vocab_size: 6,028
readability_flesch_mean: 13.52
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

HasJesusFilm

categorical feature

Binary Y/N flag indicating whether the JESUS Film is available for each record, with no nulls across 16,382 rows. The split is roughly 2:1 in favour of 'Y' (10,816 vs 5,566; top_rate 0.660), giving a high entropy ratio of 0.925 — informative but mildly imbalanced. Treatment: Encode as a 0/1 boolean for modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.6602
cardinality: 2
entropy: 0.9246
entropy_ratio: 0.9246

Nomadic

categorical feature imbalance

Binary Y/N flag indicating whether a record is 'Nomadic', with no nulls across 16382 rows. The distribution is severely imbalanced: 'N' covers 98.1% (16071) versus only 311 'Y' cases, yielding an entropy ratio of just 0.136. Treatment: Encode as a boolean; consider class-weighting or resampling since positives are only ~1.9%. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: N
top_rate: 0.981
cardinality: 2
entropy: 0.1357
entropy_ratio: 0.1357

NomadicTypeDescription

categorical metadata null_rate

Categorical descriptor of nomadic livelihood type, with six values combining three base categories (Agro-Pastoralists, Service or Trade, Hunter-Gatherers) singly or in pairs. The column is 98.1% null, populated for only ~311 of 16,382 rows, and among those Agro-Pastoralists dominates at 68.2% of non-nulls. The sparsity makes this effectively a rare annotation rather than a general feature. Treatment: Treat as sparse metadata; impute a 'Unknown' category or drop unless modelling the populated subset. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 16,071 (98.1%)
unique: 6
top_value: Agro-Pastoralists
top_rate: 0.6817
cardinality: 6
entropy: 1.341
entropy_ratio: 0.5187

PhotoCCVersionText

categorical metadata

This column records the Creative Commons license version attached to a photo, with 17 distinct values across 16,382 rows and no nulls. It is dominated by empty strings at 83.6% (13,688 rows), leaving only ~16% with an actual license tag — the most common being 'CC BY 2.0' (661) and 'CC BY-NC-SA 2.0' (440). Low entropy ratio (0.28) confirms the field is sparse in practice despite zero technical nulls. Treatment: Treat empty string as missing and one-hot encode the remaining license categories. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 17
top_value
top_rate: 0.8356
cardinality: 17
entropy: 1.137
entropy_ratio: 0.2781

PhotoCCVersionURL

categorical metadata

This column holds Creative Commons license URLs associated with photos, drawn from a closed vocabulary of 17 distinct values. It is overwhelmingly empty: 13,688 of 16,382 rows (top_rate 0.836) carry the blank string rather than a license, leaving CC BY 2.0 (661) and CC BY-NC-SA 2.0 (440) as the most common actual licenses. Entropy ratio of 0.278 confirms the distribution is highly concentrated on the empty value. Treatment: Treat blank as missing and bucket the remaining license URLs into a low-cardinality categorical. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 17
top_value
top_rate: 0.8356
cardinality: 17
entropy: 1.137
entropy_ratio: 0.2781

MapCredits

categorical metadata long_tail

Attribution string for the map asset associated with each record, naming data, geography, and design contributors. Over half the rows (top_rate 0.533, 8,733 of 16,382) carry an empty string, and the remaining mass is spread across 199 near-duplicate credit lines — note for example two variants of the Omid/UNESCO/GMI credit differing only by a trailing period (2,228 vs 864). Entropy ratio of 0.357 and the long_tail alert confirm a few dominant phrasings plus a sparse tail. Treatment: Treat as provenance metadata; normalize whitespace/punctuation to collapse duplicate credit strings, and exclude from modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 199
top_value
top_rate: 0.5331
cardinality: 199
entropy: 2.726
entropy_ratio: 0.357

MapCreditURL

categorical metadata long_tail imbalance

Optional attribution URL for the source map of each record, blank in 15891 of 16382 rows (top_rate 0.97). Only 51 distinct values populate the remaining 3%, dominated by asiaharvest.org (146) and cartomission.com (117), giving a very low entropy_ratio of 0.052. The mix of http URLs and a mailto: address suggests inconsistent data entry rather than a controlled vocabulary. Treatment: Drop from modelling; retain only as a provenance link for the ~3% of rows that carry a value. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 51
top_value
top_rate: 0.97
cardinality: 51
entropy: 0.2954
entropy_ratio: 0.05207

MapCopyright

categorical feature

A near-binary flag (with blanks) indicating map copyright status, dominated by 'N' at 86.5% of 16,382 rows. Only 118 records carry 'Y' and 2,100 are empty strings, giving just 3 distinct values and an entropy ratio of 0.387. The empty category is large enough that it should be treated as its own level rather than silently coerced. Treatment: Encode as a 3-level categorical (N / Y / blank); low signal due to severe imbalance. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 3
top_value: N
top_rate: 0.8646
cardinality: 3
entropy: 0.6126
entropy_ratio: 0.3865

MapCCVersionText

categorical metadata imbalance

This column records Creative Commons licence versions for map content, but 16347 of 16382 rows (top_rate 0.9979) are empty strings, leaving only 35 rows spread across five actual licences (CC0 1.0, CC BY-SA 3.0, CC BY 3.0, CC BY 2.0, CC BY-SA 4.0). Entropy ratio of 0.0099 confirms the column carries almost no information. Note that nulls are reported as 0.0 because the missing values are stored as empty strings rather than true nulls. Treatment: Drop or collapse to a binary has_licence flag; too sparse to use as a feature. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 6
top_value
top_rate: 0.9979
cardinality: 6
entropy: 0.02557
entropy_ratio: 0.009891

MapCCVersionURL

categorical metadata imbalance

MapCCVersionURL appears to be a Creative Commons license URL field attached to map records, with five distinct CC variants observed (CC0, BY-SA 3.0/4.0, BY 2.0/3.0). It is effectively empty: 16,347 of 16,382 rows (top_rate 0.9979) are blank strings, leaving only 35 rows with an actual license URL and entropy_ratio of 0.0099. Null_rate is reported as 0.0 because empties are stored as '' rather than nulls, which is itself worth flagging. Treatment: Drop or collapse to a binary has_license flag; near-constant empty string. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 6
top_value
top_rate: 0.9979
cardinality: 6
entropy: 0.02557
entropy_ratio: 0.009891

JF

categorical feature

JF is a binary Y/N flag with no nulls across 16382 rows. The split is moderately imbalanced, with Y dominating at 66.0% (10816) versus N at 34.0% (5566), yielding a high entropy ratio of 0.925. Treatment: Encode as a 0/1 indicator before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.6602
cardinality: 2
entropy: 0.9246
entropy_ratio: 0.9246

AudioRecordings

categorical feature

Binary Y/N flag indicating whether audio recordings are present, with no nulls across 16,382 rows. The distribution is imbalanced: 'Y' dominates at 82.3% (13,479) versus 2,903 'N' values, yielding an entropy ratio of 0.67. Treatment: Encode as a 0/1 boolean; be aware of the ~82/18 class imbalance when using as a predictor. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.8228
cardinality: 2
entropy: 0.6739
entropy_ratio: 0.6739

Window1040

categorical feature

Window1040 is a binary Y/N flag covering all 16382 rows with no nulls. The split is nearly even (Y at 52.3%, N at 7810 occurrences), giving an entropy ratio of 0.998 — essentially maximum uncertainty for a two-class field. Without column context the meaning is opaque, but the balanced distribution makes it a usable feature rather than a degenerate constant. Treatment: Encode as a 0/1 indicator for modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2
top_value: Y
top_rate: 0.5233
cardinality: 2
entropy: 0.9984
entropy_ratio: 0.9984

PeopleGroupMapURL

text metadata one_word url_heavy duplicates

This column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty entry being a single token link. Over half the rows (8,728 of 16,382) are empty strings, and among the rest the same map URLs recur heavily — duplicate_rate is 0.63 and only 6,029 unique values exist across 16,382 rows. The url_rate of 0.47 reflects that empties dominate the remainder. Treatment: Treat as an optional asset URL: keep as-is for display, or drop if not rendering images. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 6,029
len_min: 0
len_max: 66
len_mean: 29.92
len_median: 0
len_p95: 66
word_mean: 1
word_median: 1
n_empty: 8,728
n_duplicates: 10,353
duplicate_rate: 0.632
vocab_size: 6,028
readability_flesch_mean: -329.1
emoji_rate: 0
url_rate: 0.4672
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

PeopleGroupMapExpandedURL

text metadata one_word url_heavy duplicates

This column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, one link per row. It's mostly empty: 9,538 of 16,382 rows (the modal value) are blank, which is why len_median is 0 and url_rate is only 0.418. Among populated rows there is heavy reuse — duplicate_rate is 0.661 and a single map (m00320.pdf) appears 96 times — indicating many people-group records share the same regional map. Treatment: Treat as an optional reference link; drop for modelling or keep only as a join key to map assets. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 5,561
len_min: 0
len_max: 66
len_mean: 26.74
len_median: 0
len_p95: 66
word_mean: 1
word_median: 1
n_empty: 9,538
n_duplicates: 10,821
duplicate_rate: 0.6605
vocab_size: 5,560
readability_flesch_mean: -256.6
emoji_rate: 0
url_rate: 0.4178
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

PeopleGroupURL

text identifier near_unique one_word url_heavy

This column holds Joshua Project people-group URLs, one per row, with a perfectly fixed length of 48 characters and exactly one 'word' per cell. Every value is unique across all 16382 rows (n_unique equals n, duplicate_rate 0.0) and url_rate is 1.0, so it functions as a row-level identifier rather than analysable text. The URL pattern encodes a numeric people-group id plus a two-letter country suffix (e.g. /10375/tz, /10375/up), meaning the same group repeats across countries via different URLs. Treatment: Drop from modelling; keep as a row key or parse out the people-group id and country code if a join is needed. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 16,382
len_min: 48
len_max: 48
len_mean: 48
len_median: 48
len_p95: 48
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 16,382
readability_flesch_mean: -476.9
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

PeopleGroupPhotoURL

text metadata one_word url_heavy duplicates

This column holds URLs to people-group profile photos hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.65). Notably, 5719 of 16382 rows are empty strings and duplicates dominate the rest (n_duplicates 11105, duplicate_rate 0.68) — only 5277 unique URLs serve 16382 rows, meaning many groups share the same photo. The top URL alone repeats 92 times. Treatment: Treat as an optional image asset link; drop for modelling or use only to fetch images, and handle the 5719 empty values as missing. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 5,277
len_min: 0
len_max: 68
len_mean: 42.32
len_median: 65
len_p95: 65
word_mean: 1
word_median: 1
n_empty: 5,719
n_duplicates: 11,105
duplicate_rate: 0.6779
vocab_size: 5,276
readability_flesch_mean: -585.6
emoji_rate: 0
url_rate: 0.6509
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

CountryURL

categorical metadata

URL pointing to a country page on joshuaproject.net, with the trailing two-letter code acting as the country identifier. 238 distinct countries appear across 16,382 rows with no nulls, and India (IN) dominates at 13.8% (2,262 rows) followed by PP (883) and ID (788). High entropy ratio (0.79) indicates the distribution is broad rather than concentrated despite the India lead. Treatment: Strip the URL prefix and keep the two-letter country code as a categorical feature. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 238
top_value: https://joshuaproject.net/countries/IN
top_rate: 0.1381
cardinality: 238
entropy: 6.225
entropy_ratio: 0.7885

JPScaleText

categorical label

JPScaleText is a 5-level ordinal categorical describing how 'reached' an entity is, ranging from 'Unreached' to 'Significantly Reached'. The distribution is top-heavy: 'Unreached' covers 43.5% of 16,382 rows (7,124), while 'Minimally Reached' is the rarest at 1,009. No nulls and entropy ratio 0.87 indicate well-spread but skewed coverage across all five levels. Treatment: Encode as an ordered ordinal (Unreached < Minimally < Superficially < Partially < Significantly) before modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 5
top_value: Unreached
top_rate: 0.4349
cardinality: 5
entropy: 2.017
entropy_ratio: 0.8688

JPScaleImageURL

categorical metadata

This column holds URLs to one of five 'gauge' images on joshuaproject.net, almost certainly a visual encoding of an ordinal Joshua Project Scale (1-5). Distribution is uneven: gauge-1 dominates at 43.5% of 16,382 rows, while gauge-2 is rarest at 1,009 rows, and entropy ratio is 0.87. No nulls, but the URL itself carries no information beyond the underlying 1-5 code. Treatment: Extract the trailing digit as an ordinal feature and drop the URL string. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 5
top_value: https://joshuaproject.net/assets/img/gauge/gauge-1.png
top_rate: 0.4349
cardinality: 5
entropy: 2.017
entropy_ratio: 0.8688

Summary

text free_text one_word duplicates

A short ethnographic descriptor field, likely a people-group summary paragraph in English. The column is dominated by emptiness — 12,328 of 16,382 rows (median length 0) are blank — and the non-empty entries are heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat write-ups repeating dozens of times in slight variants. Among populated rows, texts can be substantial (len_p95 = 719, max 1212) and Flesch readability of 13.2 indicates dense, hard-to-read prose. Treatment: Deduplicate and drop empties before any NLP; treat as supplementary description rather than a per-row feature. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 3,778
len_min: 0
len_max: 1,212
len_mean: 137.7
len_median: 0
len_p95: 719
word_mean: 23.34
word_median: 1
n_empty: 12,328
n_duplicates: 12,604
duplicate_rate: 0.7694
vocab_size: 24,964
readability_flesch_mean: 13.16
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7525
allcaps_rate: 0
boilerplate_rate: 0.0001221

Obstacles

text free_text one_word duplicates

Free-text English commentary describing spiritual or cultural obstacles to Christian evangelism for various ethnic groups (Rajputs, Jats, Bosniaks, Azeri, etc.). The field is overwhelmingly empty: 12,327 of 16,382 rows are blank, driving a median length of 0 and a one-word rate of 0.75. Among populated rows, content is heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat paragraphs repeated 88 and 74 times, suggesting templated entries reused across related people-group records. Treatment: Treat blanks as missing and deduplicate template paragraphs before tokenizing/embedding for any text modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 3,732
len_min: 0
len_max: 726
len_mean: 47.4
len_median: 0
len_p95: 267
word_mean: 8.704
word_median: 1
n_empty: 12,327
n_duplicates: 12,650
duplicate_rate: 0.7722
vocab_size: 9,899
readability_flesch_mean: 12.76
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7525
allcaps_rate: 0
boilerplate_rate: 0.0004273

HowReach

text free_text one_word duplicates

This is a free-text field describing how a people group could be reached, likely sourced from a missions/Joshua Project-style dataset. The column is dominated by emptiness: 13,043 of 16,382 rows (n_empty) are blank, driving a median length of 0 and a duplicate_rate of 0.82. Among populated entries, prose is substantive (len_max 599, len_p95 221) but heavily repeated — the same paragraph about Jats appears 136 times and several near-duplicates differ only by a word, suggesting templated copy across related groups. Treatment: Treat as sparse free text: filter out empties and deduplicate near-identical strings before any tokenization or embedding. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 2,944
len_min: 0
len_max: 599
len_mean: 36.3
len_median: 0
len_p95: 221
word_mean: 6.879
word_median: 1
n_empty: 13,043
n_duplicates: 13,438
duplicate_rate: 0.8203
vocab_size: 7,981
readability_flesch_mean: 12.17
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7962
allcaps_rate: 0
boilerplate_rate: 0.0001221

PrayForChurch

text free_text one_word duplicates

Free-text prayer prompts for a people-group / missions dataset, focused on praying for the church among unreached groups (top words: pray, the, among). The column is mostly empty — 14,208 of 16,382 rows are blank — and among non-empty entries duplication is heavy, with a single Jat-related prayer appearing 146 times and an overall duplicate_rate of 0.89. Only 1,791 unique values and a vocab of 4,576 words suggest these are templated prayer points rather than authored prose, and all 652 language-tagged rows are English. Treatment: Treat as sparse optional commentary; impute empties as missing and dedupe templates before any text modelling. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 1,791
len_min: 0
len_max: 649
len_mean: 26.91
len_median: 0
len_p95: 210
word_mean: 5.599
word_median: 1
n_empty: 14,208
n_duplicates: 14,591
duplicate_rate: 0.8907
vocab_size: 4,576
readability_flesch_mean: 7.55
emoji_rate: 0
url_rate: 0
one_word_rate: 0.8673
allcaps_rate: 0
boilerplate_rate: 0

PrayForPG

text free_text one_word duplicates

Free-text prayer points for a people group (PG), evidently scraped from a missions/Joshua Project–style source given the recurring 'Pray for...' templates. The column is mostly empty: 12,570 of 16,382 rows are blank (median length 0, median word count 1) and the duplicate rate is 0.78, with one Rajput-focused prayer block repeated 88 times and similar boilerplate dominating the rest. Flesch mean of 14.3 confirms dense, formulaic devotional prose rather than varied commentary. Treatment: Treat as sparse boilerplate: drop empties, dedupe, and only embed the ~3.5k unique strings if prayer content is actually needed downstream. high · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: 3,530
len_min: 0
len_max: 937
len_mean: 72.22
len_median: 0
len_p95: 400
word_mean: 13.06
word_median: 1
n_empty: 12,570
n_duplicates: 12,852
duplicate_rate: 0.7845
vocab_size: 9,427
readability_flesch_mean: 14.31
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7673
allcaps_rate: 0
boilerplate_rate: 0

Resources

unknown other skipped

The column is named "Resources" but saturn skipped profiling, so kind is unknown and no descriptive statistics were computed. We can confirm 16382 rows with a null_rate of 0.0, but n_unique and value-level signals are missing. Without those stats the semantic role and content type cannot be determined from the evidence alone. Treatment: Re-profile this column with type coercion before deciding how to use it. low · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: —

country_data

unknown other skipped

The column `country_data` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any sample stats, the actual content (codes, names, nested objects) cannot be inferred from the evidence. Treatment: Re-profile with an appropriate parser to determine structure before use. low · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: —

language_data

unknown other skipped

The `language_data` column was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any descriptive stats, the actual contents (likely some structured or nested language payload given the name) cannot be characterized here. Treatment: Re-profile with an appropriate parser (e.g., expand JSON or cast to string) before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 16,382
nulls: 0 (0.0%)
unique: —

Reading

Charts the summary said to look at first

PeopleID3ROG3

ROG3

PeopleID3

ROP3

PeopNameInCountry

ROG2

Continent

RegionName

ISO3

LocationInCountry

PeopleID1

ROP1

AffinityBloc

PeopleID2

ROP2

PeopleCluster

PeopNameAcrossCountries

Population

Category

ROL3

PrimaryLanguageName

PrimaryLanguageDialect

NumberLanguagesSpoken

OfficialLang

SpeakNationalLang

BibleStatus

BibleYear

NTYear

PortionsYear

TranslationNeedQuestionable

JPScale

JPScalePC

JPScalePGAC

LeastReached

LeastReachedPC

LeastReachedPGAC

GSEC

HasAudioRecordings

NTOnline

RLG3

RLG3PC

RLG3PGAC

PrimaryReligion

PrimaryReligionPC

PrimaryReligionPGAC

RLG4

ReligionSubdivision

PCIslam

PCNonReligious

PCUnknown

SecurityLevel

LRTop100

PhotoAddress

PhotoCredits

PhotoCreditURL

PhotoCreativeCommons

PhotoCopyright

PhotoPermission

ProfileTextExists

CountOfCountries

CountOfProvinces

Longitude

Latitude

Ctry

IndigenousCode

PercentAdherents

PercentChristianPC

NaturalName

NaturalPronunciation

PercentChristianPGAC

PercentEvangelical

PercentEvangelicalPC

PercentEvangelicalPGAC

PCBuddhism

PCEthnicReligions

PCHinduism

PCOtherSmall

RegionCode