saturn·

joshua project joshua project enriched

source /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_enriched.parquet 16,382 rows 109 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is the Joshua Project people-groups dataset: 16,382 rows and 109 columns describing ethnic groups by country, with demographics, language, religion mix, and Christian-engagement indicators. The shape is dominated by categorical and text fields — Continent, AffinityBloc, PrimaryReligion, and the JPScale 'reachedness' rating give the cleanest first read on who is in the file. Population is extremely long-tailed (median 20,000 but max ~913M and skew ~91), so any size analysis should use logs or quantiles rather than means. Religion-share columns like PCIslam, PCHinduism, and PCBuddhism are mostly zero with a minority of groups at very high percentages, which tells you religion is effectively single-dominant per group. Watch out for several columns with very high null rates (RLG4 96%, NomadicTypeDescription 98%, PrimaryLanguageDialect 92%, NTOnline 29%) and many near-duplicate URL/ID fields that won't add analytic value.

citing: Continent · AffinityBloc · PrimaryReligion · JPScaleText · Population · PCIslam · PCHinduism · RegionName · LeastReached · Frontier

Schema

109 columns
Per-column summary. Click column name to jump to its detail.
Alerts
PeopleID3ROG3 text 0.0% 16,382
near_unique one_word allcaps short_text
ROG3 categorical 0.0% 238
PeopleID3 numeric 0.0% 10,415
ROP3 numeric 0.1% 10,405
PeopNameInCountry text 0.0% 10,748
one_word duplicates
ROG2 categorical 0.0% 7
Continent categorical 0.0% 7
RegionName categorical 0.0% 12
ISO3 categorical 0.0% 238
LocationInCountry text 45.1% 7,794
multilingual null_rate
PeopleID1 numeric 0.0% 16
ROP1 categorical 0.0% 16
AffinityBloc categorical 0.0% 16
PeopleID2 numeric 0.0% 267
ROP2 categorical 0.0% 214
PeopleCluster categorical 0.0% 267
PeopNameAcrossCountries text 0.0% 10,377
one_word duplicates
Population numeric 0.2% 1,708
high_skew outliers
Category categorical 0.0% 3
ROL3 text 0.0% 6,164
one_word short_text duplicates
PrimaryLanguageName text 0.0% 6,153
one_word short_text duplicates
PrimaryLanguageDialect categorical 92.3% 980
long_tail null_rate
NumberLanguagesSpoken numeric 0.0% 78
high_skew outliers
OfficialLang categorical 0.0% 87
SpeakNationalLang unknown 0.0%
skipped
BibleStatus numeric 0.0% 6
BibleYear categorical 52.4% 466
null_rate
NTYear text 30.5% 1,072
one_word allcaps null_rate short_text duplicates
PortionsYear text 17.9% 1,737
one_word allcaps short_text duplicates
TranslationNeedQuestionable unknown 0.0%
skipped
JPScale numeric 0.0% 5
JPScalePC categorical 0.0% 5
JPScalePGAC categorical 0.0% 5
LeastReached categorical 0.0% 2
LeastReachedPC categorical 0.0% 2
LeastReachedPGAC categorical 0.0% 2
GSEC categorical 0.0% 8
HasAudioRecordings categorical 0.0% 2
NTOnline categorical 28.5% 1
null_rate imbalance
RLG3 numeric 0.0% 8
RLG3PC numeric 0.0% 8
RLG3PGAC numeric 0.0% 8
PrimaryReligion categorical 0.0% 8
PrimaryReligionPC categorical 0.0% 8
PrimaryReligionPGAC categorical 0.0% 8
RLG4 numeric 96.2% 20
null_rate outliers
ReligionSubdivision categorical 96.2% 20
null_rate
PCIslam numeric 0.5% 1,117
outliers
PCNonReligious numeric 0.4% 223
high_skew outliers
PCUnknown numeric 0.6% 583
high_skew
SecurityLevel numeric 0.0% 3
LRTop100 categorical 0.0% 2
imbalance
PhotoAddress text 0.0% 5,277
one_word short_text duplicates
PhotoCredits text 0.1% 1,605
one_word duplicates
PhotoCreditURL text 33.1% 1,434
one_word url_heavy null_rate duplicates
PhotoCreativeCommons categorical 0.0% 2
PhotoCopyright categorical 0.1% 2
PhotoPermission categorical 0.1% 3
ProfileTextExists categorical 0.0% 2
CountOfCountries numeric 0.0% 48
high_skew outliers
CountOfProvinces unknown 0.0%
skipped
Longitude numeric 0.0% 15,889
high_skew
Latitude numeric 0.0% 15,851
Ctry categorical 0.0% 238
IndigenousCode categorical 0.0% 2
PercentAdherents text 0.0% 1,248
one_word allcaps short_text duplicates
PercentChristianPC categorical 0.0% 246
NaturalName text 0.0% 10,737
one_word duplicates
NaturalPronunciation text 69.6% 1,933
one_word null_rate duplicates
PercentChristianPGAC text 0.1% 1,954
one_word allcaps short_text duplicates
PercentEvangelical text 6.7% 1,047
one_word allcaps short_text duplicates
PercentEvangelicalPC categorical 1.0% 228
PercentEvangelicalPGAC text 4.5% 1,624
one_word allcaps short_text duplicates
PCBuddhism numeric 0.6% 1,052
high_skew outliers
PCEthnicReligions numeric 0.4% 978
outliers
PCHinduism numeric 0.6% 1,412
high_skew outliers
PCOtherSmall numeric 0.6% 908
high_skew outliers
RegionCode numeric 0.0% 12
PopulationPGAC numeric 0.1% 2,250
high_skew outliers
Frontier categorical 0.0% 2
MapAddress text 0.0% 6,029
one_word short_text duplicates
HasJesusFilm categorical 0.0% 2
Nomadic categorical 0.0% 2
imbalance
NomadicTypeDescription categorical 98.1% 6
null_rate
PhotoCCVersionText categorical 0.0% 17
PhotoCCVersionURL categorical 0.0% 17
MapCredits categorical 0.0% 199
long_tail
MapCreditURL categorical 0.0% 51
long_tail imbalance
MapCopyright categorical 0.0% 3
MapCCVersionText categorical 0.0% 6
imbalance
MapCCVersionURL categorical 0.0% 6
imbalance
JF categorical 0.0% 2
AudioRecordings categorical 0.0% 2
Window1040 categorical 0.0% 2
PeopleGroupMapURL text 0.0% 6,029
one_word url_heavy duplicates
PeopleGroupMapExpandedURL text 0.0% 5,561
one_word url_heavy duplicates
PeopleGroupURL text 0.0% 16,382
near_unique one_word url_heavy
PeopleGroupPhotoURL text 0.0% 5,277
one_word url_heavy duplicates
CountryURL categorical 0.0% 238
JPScaleText categorical 0.0% 5
JPScaleImageURL categorical 0.0% 5
Summary text 0.0% 3,778
one_word duplicates
Obstacles text 0.0% 3,732
one_word duplicates
HowReach text 0.0% 2,944
one_word duplicates
PrayForChurch text 0.0% 1,791
one_word duplicates
PrayForPG text 0.0% 3,530
one_word duplicates
Resources unknown 0.0%
skipped
country_data unknown 0.0%
skipped
language_data unknown 0.0%
skipped

PeopleID3ROG3

text identifier near_unique one_word allcaps short_text
PeopleID3ROG3 is almost certainly a person-level identifier: every one of the 16,382 rows has a unique 7-character all-caps single token (n_unique equals n, duplicate_rate 0, len_min and len_max both 7). The sampled values look like a 5-digit numeric prefix followed by a 2-letter suffix (e.g. '10375tz', '10375up'), suggesting a structured composite key rather than a random hash. No nulls, no boilerplate, no duplicates — clean but useless as a feature. Treatment: Drop from modelling; retain only as a join key. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
16,382
len_min
7
len_max
7
len_mean
7
len_median
7
len_p95
7
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
16,382
readability_flesch_mean
115.3
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

ROG3

categorical feature
ROG3 holds 238 distinct two-letter codes that look like ISO country codes, with IN (2,262 rows, 13.8%) leading, followed by PP, ID, PK, CH, and NI. No nulls across 16,382 rows, and the entropy ratio of 0.79 indicates a fairly even spread across many countries rather than concentration in a handful. The presence of 'PP' among the top values is unusual since it isn't a standard ISO 3166-1 alpha-2 code and may signal a custom or legacy encoding worth verifying. Treatment: Treat as a high-cardinality categorical: target- or frequency-encode for modelling, and reconcile non-standard codes like 'PP' against an ISO reference. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
238
top_value
IN
top_rate
0.1381
cardinality
238
entropy
6.225
entropy_ratio
0.7885

PeopleID3

numeric foreign_key
PeopleID3 is an integer key ranging from 10119 to 22661 with 10415 unique values across 16382 rows and no nulls. The duplication (about 5967 repeated entries) and the bounded, non-zero range are consistent with a foreign-key reference to a people table rather than a measurement. Distribution is mildly right-skewed (0.37) and platykurtic (-0.98), with no outliers flagged. Treatment: Treat as a foreign key and left-join to the people dimension; do not use as a numeric feature. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
10,415
min
10,119
max
22,661
mean
1.541e+04
median
14,962
std
3478
q1
12,348
q3
1.833e+04
iqr
5984
skew
0.3697
kurtosis
-0.9812
n_outliers
0
outlier_rate
0
zero_rate
0

ROP3

numeric identifier
ROP3 is a numeric column tightly bounded between 100004 and 119649 across 16382 rows, with 10405 unique values and almost no nulls (0.07%). The mean (109058.6) and median (108856) sit close together with low skew (0.15) and slightly platykurtic shape (kurtosis -1.05), and saturn flagged zero outliers. The narrow ~20k range starting just above 100000 looks more like a coded identifier or zoned key than a free-ranging measurement. Treatment: Treat as a categorical code rather than a continuous feature; do not scale or log-transform. medium · anthropic:claude-opus-4-7
n
16,382
nulls
11 (0.1%)
unique
10,405
min
100,004
max
119,649
mean
1.091e+05
median
108,856
std
5405
q1
104,189
q3
113,527
iqr
9,338
skew
0.1507
kurtosis
-1.053
n_outliers
0
outlier_rate
0
zero_rate
0

PeopNameInCountry

text label one_word duplicates
Short ethnonym/people-group labels naming a population within a country (e.g., 'French', 'British', 'Han Chinese, Mandarin'), with a median length of 9 characters and median word count of 1. 53% of values are single words and 34% are duplicates (5634 rows), so the same group label recurs across many country rows; 'Deaf' tops the list at 164. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' indicate a Joshua-Project-style people-group taxonomy rather than free text. Treatment: Treat as a categorical people-group label; normalize casing/parentheticals and join with country to form the unique key. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
10,748
len_min
1
len_max
41
len_mean
11.46
len_median
9
len_p95
25
word_mean
1.636
word_median
1
n_empty
0
n_duplicates
5,634
duplicate_rate
0.3439
vocab_size
11,539
readability_flesch_mean
49.27
emoji_rate
0
url_rate
0
one_word_rate
0.5313
allcaps_rate
0
boilerplate_rate
0

ROG2

categorical feature
ROG2 is a low-cardinality categorical with 7 codes that look like macro-region groupings (ASI, AFR, EUR, NAR, AUS, LAM, SOP). Distribution is uneven but not degenerate: ASI dominates at 45.0% of 16,382 rows, AFR follows at 3,635, and entropy ratio of 0.80 confirms broad spread across the remaining buckets. No nulls and no rare-value tail beyond the seven codes. Treatment: one-hot or target-encode; safe to use directly given clean 7-level categorical with no missingness. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
7
top_value
ASI
top_rate
0.4498
cardinality
7
entropy
2.257
entropy_ratio
0.8039

Continent

categorical feature
Categorical continent label with all 7 expected values and zero nulls across 16,382 rows. Asia dominates at 44.98% (7,368 rows), followed by Africa at 3,635; entropy ratio of 0.80 confirms a moderately skewed but not degenerate distribution. Note that Australia (1,088) and Oceania (447) appear as separate categories, which is unusual and suggests inconsistent regional coding worth reconciling. Treatment: One-hot encode after merging Australia and Oceania into a single category. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
7
top_value
Asia
top_rate
0.4498
cardinality
7
entropy
2.257
entropy_ratio
0.8039

RegionName

categorical feature
RegionName is a categorical geographic grouping with 12 distinct world regions and no nulls across 16,382 rows. Distribution is fairly balanced (entropy ratio 0.93), though 'Asia, South' leads at 22.6% (3,707 rows) and the top three regions are all Asian or African. The labels use a 'Continent, Subregion' convention which may need parsing if continent-level rollups are wanted. Treatment: one-hot or target-encode for modelling; optionally split on the comma to derive a continent column. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
12
top_value
Asia, South
top_rate
0.2263
cardinality
12
entropy
3.325
entropy_ratio
0.9275

ISO3

categorical foreign_key
This column holds ISO3 country codes across 238 distinct values with no nulls, consistent with a country dimension key. India (IND) dominates at 13.8% (2262 rows), followed by PNG, IDN, PAK and CHN, indicating a heavy Asia/tropical skew rather than uniform global coverage. Entropy ratio of 0.79 confirms moderate concentration on a few countries. Treatment: Use as a country join key; consider grouping long-tail codes or stratifying analyses to handle the IND-heavy skew. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
238
top_value
IND
top_rate
0.1381
cardinality
238
entropy
6.225
entropy_ratio
0.7885

LocationInCountry

text free_text multilingual null_rate
Free-text descriptions of where a group is located within a country, mixing geographic prose ('Primarily north', 'Madang province.') with longer ethnographic paragraphs up to 994 characters. Nearly half the rows are null (45.12%) and 13.3% are duplicate strings, with stock phrases like 'Widespread.' (103) and 'Scattered.' recurring. Although 2,618 entries register as English, small pockets of Spanish (12), Portuguese (10) and nine other languages appear, and Flesch readability averages a difficult 38.1. Treatment: Normalize boilerplate phrases and tokenize/embed for semantic use; do not treat as a categorical. high · anthropic:claude-opus-4-7
n
16,382
nulls
7,392 (45.1%)
unique
7,794
len_min
3
len_max
994
len_mean
108.2
len_median
79
len_p95
314
word_mean
15.48
word_median
11
n_empty
0
n_duplicates
1,196
duplicate_rate
0.133
vocab_size
25,733
readability_flesch_mean
38.13
emoji_rate
0
url_rate
0
one_word_rate
0.02725
allcaps_rate
0
boilerplate_rate
0

PeopleID1

numeric feature
PeopleID1 is stored as numeric but takes only 16 distinct integer values across 16,382 rows, ranging from 10 to 26 with a median of 20. The bounded range, low cardinality, and left skew (-0.79) suggest this is a small categorical or grouping code rather than a true continuous measurement, despite the 'ID' name implying a key. Treatment: Cast to categorical and one-hot encode rather than treating as continuous. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
16
min
10
max
26
mean
18.58
median
20
std
3.921
q1
16
q3
21
iqr
5
skew
-0.7942
kurtosis
-0.5112
n_outliers
0
outlier_rate
0
zero_rate
0

ROP1

categorical feature
ROP1 is a low-cardinality categorical code with 16 distinct values (all prefixed 'A0##'), fully populated across 16382 rows. The distribution is moderately concentrated: 'A012' alone covers 25.5% and 'A013' adds another ~19%, while entropy ratio is 0.83 indicating reasonably even spread among the rest. Looks like a fixed coded attribute (e.g., a category or status code) rather than a free identifier. Treatment: one-hot or target-encode for modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
16
top_value
A012
top_rate
0.255
cardinality
16
entropy
3.322
entropy_ratio
0.8306

AffinityBloc

categorical feature
AffinityBloc is a categorical grouping of populations into 16 broad ethno-geographic blocs, with no nulls across 16,382 rows. The distribution is moderately concentrated: "South Asian Peoples" leads at 25.5% (4,178 rows), followed by "Sub-Saharan Peoples" (3,073), while entropy ratio of 0.83 indicates the remaining 14 categories carry meaningful mass. Labels mix regional and ethnolinguistic framings (e.g., "Arab World" alongside "Tibetan-Himalayan Peoples"), which an analyst should note for taxonomy consistency. Treatment: One-hot or target-encode for modelling; audit label taxonomy for overlap before grouping. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
16
top_value
South Asian Peoples
top_rate
0.255
cardinality
16
entropy
3.322
entropy_ratio
0.8306

PeopleID2

numeric foreign_key
PeopleID2 is a numeric people-identifier code with only 267 distinct values across 16,382 rows, so each id repeats heavily and behaves more like a foreign key than a measurement. Values span 100 to 479 with a fairly flat distribution (kurtosis -1.13, skew 0.25) and no nulls or outliers, consistent with a bounded code rather than a quantity to model. Treatment: Treat as a categorical key and left-join on it rather than using as a numeric feature. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
267
min
100
max
479
mean
283.7
median
268
std
114.6
q1
183
q3
402
iqr
219
skew
0.2451
kurtosis
-1.126
n_outliers
0
outlier_rate
0
zero_rate
0

ROP2

categorical feature
ROP2 is a categorical code field with 214 distinct alphanumeric values (e.g., 'A012', 'C0152') across 16,382 rows and no nulls, suggesting a controlled vocabulary like a routing or product code. The distribution is heavy-tailed: 'A012' alone covers 25.4% of rows and the next code 'C0152' another ~7%, while entropy ratio sits at 0.76. The 'A' vs 'C' prefix split hints at two code families coexisting in the same column. Treatment: Group rare codes into an 'other' bucket and target/one-hot encode the high-frequency levels. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
214
top_value
A012
top_rate
0.2545
cardinality
214
entropy
5.901
entropy_ratio
0.7622

PeopleCluster

categorical feature
PeopleCluster is a categorical ethnographic grouping with 267 distinct values across 16,382 rows and no nulls. The distribution is broad (entropy ratio 0.86) but with a notable concentration: 'New Guinea' accounts for 6.95% of rows (1,139), followed by 'South Asia Hindu - other' (935) and 'South Asia Muslim - other' (592). The labels mix geographic, religious, and tribal descriptors, so several '... - other' buckets are doing heavy lifting. Treatment: Group rare clusters and target- or frequency-encode before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
267
top_value
New Guinea
top_rate
0.06953
cardinality
267
entropy
6.929
entropy_ratio
0.8596

PeopNameAcrossCountries

text foreign_key one_word duplicates
This column holds people-group or ethnicity names repeated across countries, with 10,377 unique labels over 16,382 rows and a 36.7% duplicate rate (6,005 repeats). Entries are short (median 8 chars, mean 1.57 words) and 59% are single-word labels like 'Deaf' (164), 'French' (82), or 'British' (81). The frequent fragments '(hindu' (516) and '(muslim' (424) alongside 'traditions)' (1038) suggest religious-tradition qualifiers in parentheses are a common naming convention, and the same group name recurs because it appears in multiple country contexts. Treatment: Treat as a people-group key; normalize casing/parentheticals and join with country to form a unique grouping key. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
10,377
len_min
1
len_max
42
len_mean
10.93
len_median
8
len_p95
25
word_mean
1.568
word_median
1
n_empty
0
n_duplicates
6,005
duplicate_rate
0.3666
vocab_size
10,446
readability_flesch_mean
49.45
emoji_rate
0
url_rate
0
one_word_rate
0.5899
allcaps_rate
0
boilerplate_rate
0

Population

numeric feature high_skew outliers
This is a Population count column with 16,382 rows and only 1,708 unique values, suggesting many shared or rounded figures. The distribution is extremely heavy-tailed: median is 20,000 but the max is 912,955,000, with skew 91.04 and kurtosis 10,050.74, and 15.0% of rows flag as outliers. The mean (499,468) sits far above Q3 (93,000), indicating a small number of very large entities dominate. Treatment: Log-transform before any modelling or distance-based analysis. high · anthropic:claude-opus-4-7
n
16,382
nulls
25 (0.2%)
unique
1,708
min
10
max
9.13e+08
mean
4.995e+05
median
20,000
std
8.066e+06
q1
4,300
q3
93,000
iqr
88,700
skew
91.04
kurtosis
1.005e+04
n_outliers
2,455
outlier_rate
0.1501
zero_rate
0

Category

categorical feature
A 3-level categorical with no nulls across 16,382 rows, encoded as the strings "1", "2", and "3". Class "1" dominates at 53.1% (8,705 rows) and "2" is the minority at 1,360 rows, giving a moderately imbalanced distribution (entropy ratio 0.83). The numeric-string labels suggest an ordinal or coded category whose meaning is not self-evident from the values alone. Treatment: One-hot or ordinal encode; consider class-imbalance handling if used as a target. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
3
top_value
1
top_rate
0.5314
cardinality
3
entropy
1.313
entropy_ratio
0.8284

ROL3

text feature one_word short_text duplicates
ROL3 holds three-letter ISO 639-3 language codes (every value is exactly 3 characters and one word), with hin, eng, and ben dominating. The distribution is heavily multilingual with 6,164 distinct codes across 16,382 rows and a 62.4% duplicate rate, plus 176 'xxx' entries that likely flag undetermined or missing language. Treatment: Treat as a categorical language code; one-hot or target-encode top codes and bucket the long tail (including 'xxx') as 'other'. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
6,164
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
10,218
duplicate_rate
0.6237
vocab_size
6,163
readability_flesch_mean
117.4
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.0001221
boilerplate_rate
0

PrimaryLanguageName

text feature one_word short_text duplicates
Holds the primary language name for each record, predominantly single-token entries (one_word_rate 0.73, word_mean 1.32) with Hindi (682), English (424) and Bengali (366) leading. High duplicate_rate of 0.62 is expected for a categorical language label, but n_unique 6153 against 16382 rows suggests many compound or comma-separated multilingual entries (note 'arabic,' and 'punjabi,' in top_words). 176 rows are explicitly 'Language unknown', and lengths up to 45 chars confirm some multi-language strings. Treatment: Normalize casing and split comma-separated entries into a multi-label categorical before encoding. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
6,153
len_min
1
len_max
45
len_mean
9.081
len_median
7
len_p95
19
word_mean
1.322
word_median
1
n_empty
0
n_duplicates
10,229
duplicate_rate
0.6244
vocab_size
6,251
readability_flesch_mean
42.67
emoji_rate
0
url_rate
0
one_word_rate
0.7306
allcaps_rate
0
boilerplate_rate
0

PrimaryLanguageDialect

categorical metadata long_tail null_rate
This column records a primary language dialect per record, with 980 distinct values across 16,382 rows. It is 92.3% null, and even the most common value, 'Brazilian Portuguese', accounts for just 3.25% of non-nulls (41 occurrences); entropy ratio of 0.967 confirms an extremely flat, long-tailed distribution spanning dialects like Assyrian, Punjabi, Sinhalese, and Ta'izzi. The combination of sparse coverage and high cardinality limits its standalone modelling value. Treatment: Group into language families or a coarse bucket plus 'missing' indicator before encoding. high · anthropic:claude-opus-4-7
n
16,382
nulls
15,121 (92.3%)
unique
980
top_value
Brazilian Portuguese
top_rate
0.03251
cardinality
980
entropy
9.613
entropy_ratio
0.9674

NumberLanguagesSpoken

numeric feature high_skew outliers
Count of languages spoken, with 16382 non-null integer values ranging from 1 to 145 and a median of 1. The distribution is severely right-skewed (skew 7.44, kurtosis 83.76): Q1 and Q3 are both within [1,2], yet 2410 rows (14.7%) flag as outliers and the max of 145 is implausibly high for a person-level language count. Treatment: Cap or log-transform before modelling, and investigate the 145 maximum for data-entry errors. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
78
min
1
max
145
mean
2.764
median
1
std
5.985
q1
1
q3
2
iqr
1
skew
7.437
kurtosis
83.76
n_outliers
2,410
outlier_rate
0.1471
zero_rate
0

OfficialLang

categorical feature
OfficialLang is a categorical column listing the official language of each record, with 87 distinct values across 16,382 rows and almost no nulls (0.05%). English dominates at 22.4% (3,672), followed by Hindi (2,262) and French (1,478), giving a moderately concentrated distribution (entropy ratio 0.68). The presence of compound labels like 'Arabic, Standard' and 'Chinese, Mandarin' suggests a specific naming convention worth preserving when joining to language references. Treatment: Group long tail and one-hot or target-encode the top categories before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
8 (0.0%)
unique
87
top_value
English
top_rate
0.2243
cardinality
87
entropy
4.368
entropy_ratio
0.6779

SpeakNationalLang

unknown other skipped
Column 'SpeakNationalLang' was skipped by the profiler, so no type inference, uniqueness count, or value statistics are available. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a flag or category indicating whether a respondent speaks the national language, but this cannot be verified from the evidence. Treatment: Re-profile or manually inspect to determine type before any downstream use. low · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique

BibleStatus

numeric feature
BibleStatus is an integer-coded categorical with only 6 distinct values spanning 0 to 5 across 16,382 complete rows. The distribution is heavily left-skewed (skew -1.22) with a mean of 3.86 and median of 4, indicating most records cluster at the high end while about 4.8% sit at zero. Despite being stored as numeric, the small cardinality and bounded range suggest an ordinal status code rather than a true measurement. Treatment: Treat as an ordinal category (one-hot or ordered encode) rather than a continuous numeric. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
6
min
0
max
5
mean
3.862
median
4
std
1.429
q1
3
q3
5
iqr
2
skew
-1.219
kurtosis
0.5856
n_outliers
0
outlier_rate
0
zero_rate
0.04786

BibleYear

categorical metadata null_rate
BibleYear appears to encode a translation's publication or revision span, typically formatted as a start-end year range like "1818-2022", with single years (e.g. "1954") appearing as a minority pattern. Cardinality is high (466 distinct values across 16382 rows) and the most common range covers only 8.75% of records, giving a flat distribution (entropy ratio 0.776). Notably, 52.45% of values are null, which the alert flags and which will limit any direct use. Treatment: Parse into separate start_year and end_year integer features and add a missingness indicator before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
8,592 (52.4%)
unique
466
top_value
1818-2022
top_rate
0.08755
cardinality
466
entropy
6.883
entropy_ratio
0.7765

NTYear

text metadata one_word allcaps null_rate short_text duplicates
NTYear appears to be a year-range metadata field (e.g. '1811-1998', '1380-2011') stored as short single-token strings, with 1072 unique values across 16382 rows. The column is messy: 30.47% null, 90.59% duplicate rate, and a sentinel value 'Yes' shows up 670 times alongside the date ranges, indicating mixed semantics. Lengths cluster tightly (median 9, max 9), consistent with a 'YYYY-YYYY' format for most non-sentinel entries. Treatment: Parse into start_year/end_year integer columns, isolate the 'Yes' sentinel into a separate flag, and impute or drop the 30% nulls. medium · anthropic:claude-opus-4-7
n
16,382
nulls
4,991 (30.5%)
unique
1,072
len_min
3
len_max
9
len_mean
7.794
len_median
9
len_p95
9
word_mean
1
word_median
1
n_empty
0
n_duplicates
10,319
duplicate_rate
0.9059
vocab_size
1,072
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.9412
boilerplate_rate
0

PortionsYear

text feature one_word allcaps short_text duplicates
PortionsYear appears to be a single-token field that mostly encodes year ranges (e.g. '1806-1962', '1530-1995') with strings up to 9 characters, but it is contaminated by a large 'Yes' bucket (1520 rows) that breaks the type. Nulls run at 17.92% and duplicate_rate is 0.87 across 1737 unique values out of 16382, so the column is highly repetitive. The mix of a boolean-like 'Yes' with hyphenated year spans suggests two different concepts were merged into one column. Treatment: Split into two fields: parse year ranges into start/end integers and isolate the 'Yes' values into a separate boolean before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
2,936 (17.9%)
unique
1,737
len_min
3
len_max
9
len_mean
7.595
len_median
9
len_p95
9
word_mean
1
word_median
1
n_empty
0
n_duplicates
11,709
duplicate_rate
0.8708
vocab_size
1,737
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.887
boilerplate_rate
0

TranslationNeedQuestionable

unknown other skipped
Column 'TranslationNeedQuestionable' was skipped by the profiler, so its kind, cardinality and value distribution are unknown. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a boolean or flag indicating uncertainty about translation need, but this cannot be verified from the evidence. Treatment: Re-profile or inspect raw values before deciding on use; do not model until kind is resolved. low · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique

JPScale

numeric feature
JPScale is an integer-valued ordinal feature spanning 1 to 5 with only 5 unique values across 16382 rows and no nulls. The distribution is roughly flat (kurtosis -1.66, skew 0.19) with mean 2.68 and median 3, suggesting a Likert-style or category rating rather than a continuous measurement. No outliers and no zeros are present. Treatment: Treat as an ordinal categorical (1-5) rather than continuous; one-hot or keep as ordered integer. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
5
min
1
max
5
mean
2.681
median
3
std
1.644
q1
1
q3
4
iqr
3
skew
0.1937
kurtosis
-1.658
n_outliers
0
outlier_rate
0
zero_rate
0

JPScalePC

categorical feature
JPScalePC is a 5-level categorical, almost certainly a Likert or ordinal scale (values "1" through "5") with no nulls across 16,382 rows. The distribution is bimodal at the extremes: "5" leads at 33.8% and "1" follows closely, while the middle codes "2" and "3" together account for far less, hinting at polarised responses rather than a normal spread. Entropy ratio of 0.86 confirms the spread is wide but not uniform. Treatment: Treat as ordinal (1-5); keep as integer or one-hot depending on model. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
5
top_value
5
top_rate
0.3381
cardinality
5
entropy
1.997
entropy_ratio
0.8602

JPScalePGAC

categorical feature
JPScalePGAC is a low-cardinality categorical with 5 distinct string-encoded levels ('1' through '5') across 16382 rows and no nulls, consistent with an ordinal scale (likely the Japanese JMA seismic intensity scale applied to PGA). The distribution is uneven: '1' dominates at 43.3% while '2' is the rarest at 908 rows, yet entropy ratio is high at 0.86 indicating the remaining mass is spread broadly. The non-monotonic frequency order (1 > 4 > 5 > 3 > 2) is worth flagging since a clean ordinal would typically taper. Treatment: Treat as ordinal: cast to integer and preserve order, or one-hot encode if downstream model is non-ordinal. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
5
top_value
1
top_rate
0.4329
cardinality
5
entropy
1.998
entropy_ratio
0.8603

LeastReached

categorical feature
Binary Y/N flag named LeastReached, fully populated across 16382 rows with only 2 distinct values. The split is fairly balanced — 'N' leads at 56.5% (9258) versus 7124 'Y' — yielding near-maximal entropy ratio of 0.988. No nulls or anomalies present. Treatment: Encode as a 0/1 boolean for modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.5651
cardinality
2
entropy
0.9877
entropy_ratio
0.9877

LeastReachedPC

categorical feature
A binary Y/N flag named LeastReachedPC, likely indicating whether some 'least reached' threshold or PC condition was met. The split is moderately imbalanced at 67.3% N versus the rest Y, with no nulls across 16,382 rows and entropy ratio 0.91 showing both classes are well represented. Treatment: Encode as a 0/1 indicator for modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.6732
cardinality
2
entropy
0.9116
entropy_ratio
0.9116

LeastReachedPGAC

categorical feature
Binary Y/N flag indicating whether some 'LeastReachedPGAC' condition holds, with no missing values across 16382 rows. The split is fairly balanced — 'N' leads at 56.7% (9291) versus 7091 'Y' — giving a near-maximal entropy ratio of 0.987. Treatment: Encode as a 0/1 indicator for modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.5671
cardinality
2
entropy
0.987
entropy_ratio
0.987

GSEC

categorical feature
GSEC is a low-cardinality categorical with 8 distinct values across 16,382 rows and no nulls. The dominant value is the empty string at 40.0% (6,553 rows), followed by '1' at 4,852; the remaining codes ('0','2','3','4','5','6') split the rest, suggesting a coded classification where blanks likely encode 'not applicable' or missing-as-empty. Entropy ratio of 0.732 indicates moderate spread despite the empty-string plurality. Treatment: Recode the empty string as an explicit missing category and one-hot encode the remaining codes. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
8
top_value
top_rate
0.4
cardinality
8
entropy
2.197
entropy_ratio
0.7325

HasAudioRecordings

categorical feature
Binary Y/N flag indicating whether a record has associated audio recordings, fully populated across 16382 rows. The distribution is imbalanced: 'Y' covers 82.3% (13479) versus 2903 'N', with entropy ratio 0.67. Treatment: Encode as a 0/1 boolean indicator for modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.8228
cardinality
2
entropy
0.6739
entropy_ratio
0.6739

NTOnline

categorical feature null_rate imbalance
NTOnline is a categorical flag with only one observed value, 'Y', across all 11,705 non-null rows. The remaining 28.55% of rows are null, so this column carries no discriminating signal — it is effectively a constant where present. Treatment: Drop; zero-variance column with high nullity. high · anthropic:claude-opus-4-7
n
16,382
nulls
4,677 (28.5%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

RLG3

numeric feature
RLG3 is a small-integer ordinal feature ranging from 1 to 9 with only 8 distinct values across 16,382 rows and no nulls. The distribution is broad and flat (kurtosis -1.37, skew 0.13, IQR spanning 1 to 6) with mean 3.47 and median 4, and no outliers. The 8 unique values across a 1-9 range implies one integer in that span never occurs, which is worth confirming. Treatment: Treat as an ordinal categorical (e.g., a Likert-style rating) rather than a continuous numeric. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
8
min
1
max
9
mean
3.469
median
4
std
2.238
q1
1
q3
6
iqr
5
skew
0.1265
kurtosis
-1.366
n_outliers
0
outlier_rate
0
zero_rate
0

RLG3PC

numeric feature
RLG3PC is an integer-valued numeric column with only 8 distinct values bounded between 1 and 9, no nulls, and no zeros. The flat distribution (kurtosis -1.47, IQR spanning 1 to 6) and small cardinality suggest this is an ordinal code or category rather than a continuous measurement. Mean 3.21 sits below the median's upper quartile, with mild positive skew (0.31). Treatment: Treat as an ordinal category; one-hot or ordinal-encode rather than scaling as continuous. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
8
min
1
max
9
mean
3.213
median
2
std
2.311
q1
1
q3
6
iqr
5
skew
0.3143
kurtosis
-1.466
n_outliers
0
outlier_rate
0
zero_rate
0

RLG3PGAC

numeric feature
RLG3PGAC holds an integer code on a small 1-9 scale with only 8 distinct values across 16,382 rows and zero nulls. The distribution is broad and flat (kurtosis -1.36, std 2.25, IQR spanning 1 to 6) with near-zero skew, suggesting an ordinal category or rating rather than a true continuous measurement. No outliers and no zeros are present. Treatment: Treat as an ordinal/categorical feature; one-hot or ordinal encode rather than scaling as continuous. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
8
min
1
max
9
mean
3.486
median
4
std
2.252
q1
1
q3
6
iqr
5
skew
0.1259
kurtosis
-1.363
n_outliers
0
outlier_rate
0
zero_rate
0

PrimaryReligion

categorical feature
PrimaryReligion is a low-cardinality categorical label assigning each of 16,382 rows to one of 8 religious traditions, with no nulls. Christianity dominates at 39.4% (6,459 rows), followed by Islam (3,786) and Ethnic Religions (2,651); the long tail includes 189 'Unknown' and 124 'Other / Small' rows. Entropy ratio of 0.74 indicates a moderately balanced distribution rather than a single overwhelming class. Treatment: One-hot or target-encode; consider grouping 'Unknown' and 'Other / Small' if modelling sensitivity to rare classes. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
8
top_value
Christianity
top_rate
0.3943
cardinality
8
entropy
2.231
entropy_ratio
0.7436

PrimaryReligionPC

categorical feature
Categorical label of the dominant religion of a people-cluster (PC), with 8 distinct values and no nulls across 16,382 rows. Christianity leads at 47.6% (7,795), followed by Islam (3,658) and Hinduism (2,557), while a small 'Unknown' bucket (173) and 'Other / Small' (62) provide explicit catch-alls. Entropy ratio of 0.69 indicates moderate concentration rather than a single dominant class. Treatment: One-hot or target-encode; consider merging 'Unknown' and 'Other / Small' if downstream model is sensitive to rare levels. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
8
top_value
Christianity
top_rate
0.4758
cardinality
8
entropy
2.071
entropy_ratio
0.6904

PrimaryReligionPGAC

categorical label
Categorical label of the primary religion of a People Group, Affinity, or Country (PGAC) record, drawn from a fixed taxonomy of 8 values with no nulls across 16,382 rows. Christianity dominates at 39.4% (6,462), followed by Islam (3,766), Ethnic Religions (2,613) and Hinduism (2,348); Buddhism, Non-Religious, Unknown and Other/Small together account for under 8% of rows. Entropy ratio of 0.748 indicates a moderately concentrated but not degenerate distribution. Treatment: One-hot or target-encode; consider grouping the small Unknown/Other tail before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
8
top_value
Christianity
top_rate
0.3945
cardinality
8
entropy
2.245
entropy_ratio
0.7482

RLG4

numeric feature null_rate outliers
RLG4 is a sparse integer-valued numeric feature with only 20 distinct values spanning 10 to 41, suggesting an ordinal score or count rather than a continuous measurement. It is overwhelmingly missing (null_rate 0.9621), so just under 4% of rows carry a value, and among those the distribution is right-skewed (skew 0.94) with 33 flagged outliers (outlier_rate 0.053). Center sits at median 20 with IQR 7, and no zeros are present. Treatment: Add a missingness indicator and impute or bin the few observed values before modelling. medium · anthropic:claude-opus-4-7
n
16,382
nulls
15,761 (96.2%)
unique
20
min
10
max
41
mean
18.49
median
20
std
6.519
q1
14
q3
21
iqr
7
skew
0.9361
kurtosis
1.156
n_outliers
33
outlier_rate
0.05314
zero_rate
0

ReligionSubdivision

categorical feature null_rate
A sub-categorisation of religion (e.g. Sunni/Shia branches, Buddhist schools, Judaism, Sikhism), populated only when a finer split applies. It is overwhelmingly null at 96.21%, so just 16382 rows carry one of 20 values, with Sunni leading at 29.15% of the populated rows. Entropy ratio 0.74 indicates the non-null portion is reasonably spread rather than dominated by a single bucket. Treatment: Treat nulls as an explicit 'not applicable' category before one-hot encoding. high · anthropic:claude-opus-4-7
n
16,382
nulls
15,761 (96.2%)
unique
20
top_value
Sunni
top_rate
0.2915
cardinality
20
entropy
3.185
entropy_ratio
0.7369

PCIslam

numeric feature outliers
PCIslam is a numeric column bounded between 0 and 100, almost certainly a percentage share of Muslim population (or similar Islam-related composition metric) per record. The distribution is heavily zero-inflated: 63.2% of values are exactly 0 and the median is 0, while the mean is 23.2 and values stretch all the way to 100, producing a right skew of 1.27 and 3,438 flagged outliers (21.1%). Nulls are negligible (0.52%) and 1,117 distinct values suggest reasonably fine-grained measurement rather than a coarse bucket. Treatment: Treat as a zero-inflated proportion: model the zero mass separately or add a presence indicator before scaling. high · anthropic:claude-opus-4-7
n
16,382
nulls
86 (0.5%)
unique
1,117
min
0
max
100
mean
23.2
median
0
std
39.54
q1
0
q3
28
iqr
28
skew
1.273
kurtosis
-0.2575
n_outliers
3,438
outlier_rate
0.211
zero_rate
0.6322

PCNonReligious

numeric feature high_skew outliers
PCNonReligious appears to be a percentage feature capturing the share of a population that is non-religious, ranging from 0 to 99. The distribution is dominated by zeros (75.2% of rows) with median, Q1, and Q3 all at 0, yet the mean is 3.42 and skew is 3.65 with kurtosis 15.4, indicating a long right tail. Roughly 24.8% of values flag as outliers, suggesting a sparse signal where most records report none and a minority report substantial percentages. Treatment: Consider a zero-inflated treatment or log1p transform before modelling given the 75% zeros and heavy right tail. high · anthropic:claude-opus-4-7
n
16,382
nulls
66 (0.4%)
unique
223
min
0
max
99
mean
3.421
median
0
std
9.21
q1
0
q3
0
iqr
0
skew
3.648
kurtosis
15.43
n_outliers
4,043
outlier_rate
0.2478
zero_rate
0.7522

PCUnknown

numeric feature high_skew
PCUnknown is a numeric feature bounded between 0 and 100, almost certainly a percentage of items classified as 'unknown'. It is overwhelmingly zero (zero_rate 0.9558) with median, q1, and q3 all at 0, yet the max reaches 100 with skew 9.07 and kurtosis 81.5, producing 719 outliers (4.4%). The 583 distinct non-zero values form a long, heavy tail rather than a smooth distribution. Treatment: Binarize (zero vs non-zero) or log1p-transform before modelling given the 95.6% zero mass and extreme skew. high · anthropic:claude-opus-4-7
n
16,382
nulls
104 (0.6%)
unique
583
min
0
max
100
mean
1.201
median
0
std
10.34
q1
0
q3
0
iqr
0
skew
9.066
kurtosis
81.52
n_outliers
719
outlier_rate
0.04417
zero_rate
0.9558

SecurityLevel

numeric feature
SecurityLevel takes only 3 distinct integer values spanning 0 to 2 with no nulls, so it is effectively an ordinal category encoded numerically (e.g., low/medium/high). The distribution is fairly flat with kurtosis -1.82 and zeros making up 38.8% of rows, while the mean of 1.10 and median of 1.0 suggest the three levels are reasonably balanced with a slight tilt toward the higher end. Treatment: Treat as an ordinal categorical and one-hot or ordinal-encode before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
3
min
0
max
2
mean
1.099
median
1
std
0.9307
q1
0
q3
2
iqr
2
skew
-0.1985
kurtosis
-1.816
n_outliers
0
outlier_rate
0
zero_rate
0.3883

LRTop100

categorical label imbalance
Binary Y/N flag indicating membership in some 'LR Top 100' set, with only 100 positive cases out of 16,382 rows (top_rate 0.9939). Extreme class imbalance and very low entropy (0.0537) make this nearly constant. No nulls, exactly 2 categories as expected. Treatment: Use stratified sampling or class-weighting if modelling; otherwise treat as rare-event indicator. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.9939
cardinality
2
entropy
0.05368
entropy_ratio
0.05368

PhotoAddress

text foreign_key one_word short_text duplicates
PhotoAddress holds single-token image filenames in the pattern p#####.jpg, with a max length of 13 and exactly one word per row. 5,718 of 16,382 rows (~35%) are empty strings rather than nulls, and overall duplicate rate is 67.8% — the same photo file is reused across many records (e.g., p19007.jpg appears 92 times). With only 5,277 unique values, this behaves like a foreign-key reference to an image asset, not a per-row unique pointer. Treatment: Treat empty strings as missing and join to an image/asset table on this filename rather than modelling it as text. high · anthropic:claude-opus-4-7
n
16,382
nulls
1 (0.0%)
unique
5,277
len_min
0
len_max
13
len_mean
6.523
len_median
10
len_p95
10
word_mean
1
word_median
1
n_empty
5,718
n_duplicates
11,104
duplicate_rate
0.6779
vocab_size
5,276
readability_flesch_mean
82.43
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

PhotoCredits

text metadata one_word duplicates
Attribution string for a photo credit, mostly short names or sources (mean length 11.5 chars, median 1 word). Highly repetitive: 90.2% duplicate rate across only 1,605 unique values, with 5,718 empty entries and 3,065 'Anonymous' tags dominating. Top words reveal stock/CC sources like Flickr, Wikimedia, Pixabay, and Shutterstock alongside named contributors. Treatment: Treat as low-cardinality categorical attribution; normalize empties/'Anonymous' and group rare credits before any analysis. high · anthropic:claude-opus-4-7
n
16,382
nulls
10 (0.1%)
unique
1,605
len_min
0
len_max
56
len_mean
11.54
len_median
9
len_p95
30
word_mean
2.081
word_median
1
n_empty
5,718
n_duplicates
14,767
duplicate_rate
0.902
vocab_size
2,658
readability_flesch_mean
-13.88
emoji_rate
0
url_rate
0.0004276
one_word_rate
0.5754
allcaps_rate
0.001649
boilerplate_rate
0

PhotoCreditURL

text metadata one_word url_heavy null_rate duplicates
This column stores photo credit URLs, with every non-empty value being a single token (one_word_rate 1.0) and 47.5% matching a URL pattern. It is sparsely populated: 33.08% null and another 5,718 empty strings among the top values, while 86.9% of values are duplicates — a single domain, https://www.asiaharvest.org, accounts for 736 rows. Only 1,434 unique URLs serve 16,382 rows, suggesting a small set of recurring image sources rather than per-record attribution. Treatment: Extract the domain as a categorical feature and drop the raw URL; do not use as a modelling input. high · anthropic:claude-opus-4-7
n
16,382
nulls
5,419 (33.1%)
unique
1,434
len_min
0
len_max
240
len_mean
25.67
len_median
0
len_p95
73
word_mean
1
word_median
1
n_empty
5,718
n_duplicates
9,529
duplicate_rate
0.8692
vocab_size
1,433
readability_flesch_mean
-261.7
emoji_rate
0
url_rate
0.4753
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

PhotoCreativeCommons

categorical feature
A binary Y/N flag indicating whether a photo carries a Creative Commons licence. The column is heavily skewed: 'N' covers 83.6% of the 16382 rows while 'Y' accounts for 2691, with a near-zero null rate of 0.0003. Treatment: Encode as a 0/1 boolean; expect class imbalance if used as a target. high · anthropic:claude-opus-4-7
n
16,382
nulls
5 (0.0%)
unique
2
top_value
N
top_rate
0.8357
cardinality
2
entropy
0.6445
entropy_ratio
0.6445

PhotoCopyright

categorical feature
Binary Y/N flag indicating whether a photo copyright applies, with 'N' dominating at 87.95% of 16,382 rows versus 1,972 'Y' values. Class imbalance is notable but not extreme, and nulls are negligible at 0.09%. Entropy ratio of 0.53 reflects this skew toward 'N'. Treatment: Encode as a 0/1 boolean; be aware of the ~1:7 class imbalance if used as a target. high · anthropic:claude-opus-4-7
n
16,382
nulls
15 (0.1%)
unique
2
top_value
N
top_rate
0.8795
cardinality
2
entropy
0.5308
entropy_ratio
0.5308

PhotoPermission

categorical feature
A consent flag for photo use, encoded as Y/N with 87.1% of 16382 rows set to 'N' and only 0.1% null. Cardinality is 3 because two records use lowercase 'y' alongside 2111 uppercase 'Y', a casing inconsistency worth normalising. Entropy ratio of 0.35 confirms the heavy skew toward refusal. Treatment: Uppercase-normalise then map to a boolean before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
17 (0.1%)
unique
3
top_value
N
top_rate
0.8709
cardinality
3
entropy
0.5564
entropy_ratio
0.3511

ProfileTextExists

categorical feature
Binary Y/N flag indicating whether a profile has text, with no nulls across 16382 rows. Roughly 79.5% are 'Y' (13018) versus 'N' (3364), an imbalance worth noting but not extreme. Entropy ratio of 0.73 confirms a moderately skewed but informative distribution. Treatment: Encode as a 0/1 boolean for modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.7947
cardinality
2
entropy
0.7325
entropy_ratio
0.7325

CountOfCountries

numeric feature high_skew outliers
Counts the number of countries associated with each record, ranging from 1 to 164 with a median of 1 and Q3 of just 4. The distribution is severely right-skewed (skew 5.15, kurtosis 32.05) and 19.2% of rows flag as outliers, indicating a long tail where a small set of records span dozens or hundreds of countries while most cover only one. Treatment: Log-transform or bin (e.g. 1, 2-4, 5+) before modelling to tame the heavy tail. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
48
min
1
max
164
mean
8.328
median
1
std
20.64
q1
1
q3
4
iqr
3
skew
5.152
kurtosis
32.05
n_outliers
3,139
outlier_rate
0.1916
zero_rate
0

CountOfProvinces

unknown other skipped
The column 'CountOfProvinces' was skipped by the profiler, so beyond a row count of 16382 and a null rate of 0.0 there is no evidence about its distribution, type, or uniqueness. The name suggests an integer count of provinces per record, but this cannot be confirmed from the payload. No further signal is available. Treatment: Re-run the profiler on this column to recover type and distribution before any downstream use. low · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique

Longitude

numeric feature high_skew
This is a Longitude coordinate column, but the values are corrupted: valid longitudes must lie within [-180, 180], yet the max is 2350588.0 and the mean (189.33) already exceeds the legal range. Skew of 127.98 and kurtosis of 16376.52 confirm extreme outlier contamination, with 207 flagged outliers (1.26%). The median of 55.45 is plausible, so most rows are likely valid, but a small set of malformed entries is dominating the distribution. Treatment: Clip or drop rows outside [-180, 180] before any geospatial use. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
15,889
min
-179.3
max
2.351e+06
mean
189.3
median
55.45
std
1.836e+04
q1
8.673
q3
94.64
iqr
85.97
skew
128
kurtosis
1.638e+04
n_outliers
207
outlier_rate
0.01264
zero_rate
0

Latitude

numeric feature
Geographic latitude in decimal degrees, with values spanning -54.94 to 78.21 — well within the valid [-90, 90] range. Distribution is nearly symmetric (skew -0.12) and slightly flat (kurtosis -0.26), centered around a median of 17.03 and mean of 16.44, suggesting a tropical/northern-hemisphere bias. Near-unique (15851 of 16382) with no nulls and only 39 mild outliers, consistent with per-record geocoordinates rather than a categorical region label. Treatment: Pair with longitude for geospatial features; consider binning by hemisphere or clustering rather than using raw degrees in linear models. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
15,851
min
-54.94
max
78.21
mean
16.44
median
17.03
std
20.47
q1
2.072
q3
29.88
iqr
27.81
skew
-0.118
kurtosis
-0.2579
n_outliers
39
outlier_rate
0.002381
zero_rate
0

Ctry

categorical feature
Country names stored as full strings, with 238 distinct values across 16,382 rows and no nulls. India dominates at 13.8% (2,262 rows), followed by Papua New Guinea (883) and Indonesia (788) — a notable skew toward South/Southeast Asia rather than the typical US-heavy distribution. Entropy ratio of 0.79 indicates fairly broad spread despite the long tail. Treatment: Group long-tail countries or target/frequency-encode before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
238
top_value
India
top_rate
0.1381
cardinality
238
entropy
6.225
entropy_ratio
0.7885

IndigenousCode

categorical feature
IndigenousCode is a binary Y/N flag, fully populated across all 16,382 rows with only 2 distinct values. The class split is uneven: 'Y' covers 74.8% of records against 'N' for the remainder, yielding entropy of 0.81. The imbalance is notable but not extreme. Treatment: Encode as a binary indicator; consider class imbalance if used as a target. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.7483
cardinality
2
entropy
0.8139
entropy_ratio
0.8139

PercentAdherents

text feature one_word allcaps short_text duplicates
This is a numeric percentage field (PercentAdherents) stored as text, with all 16382 values being single tokens of length 5-7 like '0.000' or '95.000'. The distribution is heavily concentrated at zero (4007 of 16382 rows) and shows strong duplication (duplicate_rate 0.924, only 1248 unique values). Despite the 'allcaps' and 'one_word' alerts, these are just numeric strings, not categorical text. Treatment: Cast to float and treat as a numeric feature; consider zero-inflation handling given the spike at 0.000. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
1,248
len_min
5
len_max
7
len_mean
5.534
len_median
6
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
15,134
duplicate_rate
0.9238
vocab_size
1,248
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

PercentChristianPC

categorical feature
Stored as a categorical string, this column appears to be a per-country (or per-region) percentage of Christians, with 246 distinct values across 16,382 rows and no nulls. The distribution is highly repetitive: the modal value '90.061' covers 6.95% of rows and the top ten values include both very high shares (90.061, 82.325, 76.515) and near-zero shares (0.482, 0.111, 0.000), suggesting a small set of country-level percentages broadcast onto many rows. Entropy ratio of 0.86 indicates the values are fairly evenly spread across the 246 categories despite the heavy mode. Treatment: Cast to float and treat as a numeric feature rather than a category. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
246
top_value
90.061
top_rate
0.06953
cardinality
246
entropy
6.853
entropy_ratio
0.8628

NaturalName

text label one_word duplicates
NaturalName appears to be a people-group or ethno-linguistic label, dominated by single-word entries (one_word_rate 0.555) and short strings (len_mean 10.9, word_mean 1.59). Roughly a third of rows repeat (duplicate_rate 0.344, 5645 duplicates across 10737 uniques), with 'Deaf' (164), 'French' (82), and 'British' (80) leading. Top words expose unclosed parenthetical qualifiers like 'traditions)', '(hindu', '(muslim' occurring 500-1000+ times, suggesting tokenisation broke compound names such as 'X (Hindu traditions)'. Treatment: Normalise casing and repair the parenthetical qualifier splits before using as a categorical grouping key. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
10,737
len_min
1
len_max
41
len_mean
10.91
len_median
9
len_p95
25
word_mean
1.585
word_median
1
n_empty
0
n_duplicates
5,645
duplicate_rate
0.3446
vocab_size
11,164
readability_flesch_mean
50.53
emoji_rate
0
url_rate
0
one_word_rate
0.5554
allcaps_rate
0
boilerplate_rate
0

NaturalPronunciation

text metadata one_word null_rate duplicates
This column holds phonetic respellings of ethnic or demographic labels (e.g. 'AY-zhun', 'chai-NEEZ', 'kor-EE-un'), with hyphenated syllables and capitalised stress markers indicating an ad-hoc pronunciation guide. It is overwhelmingly sparse and repetitive: 69.63% null, 69.41% one-word entries, and 61.15% duplicates across only 1,933 unique values out of 16,382 rows. The token 'def' appears 164 times as the most frequent value, which looks like a placeholder or default rather than a pronunciation. Treatment: Treat as a categorical pronunciation lookup keyed to an ethnicity label; investigate the 'def' placeholder and impute or drop given the 69.63% null rate. high · anthropic:claude-opus-4-7
n
16,382
nulls
11,407 (69.6%)
unique
1,933
len_min
2
len_max
57
len_mean
12.13
len_median
11
len_p95
26
word_mean
1.345
word_median
1
n_empty
0
n_duplicates
3,042
duplicate_rate
0.6115
vocab_size
2,039
readability_flesch_mean
61.69
emoji_rate
0
url_rate
0
one_word_rate
0.6941
allcaps_rate
0.000402
boilerplate_rate
0

PercentChristianPGAC

text feature one_word allcaps short_text duplicates
This column holds percentages (likely Christian population share, per the PGAC suffix) stored as text rather than numeric, with values like "0.000", "95.000", "90.000" filling lengths of 5-7 characters. The distribution is heavily zero-inflated: 3,121 of 16,382 rows are "0.000" and the duplicate rate is 88%, leaving only 1,954 unique values. Flagged as allcaps/one-word only because the profiler treated numeric strings as tokens. Treatment: Cast to float and treat as a numeric percentage feature. high · anthropic:claude-opus-4-7
n
16,382
nulls
15 (0.1%)
unique
1,954
len_min
5
len_max
7
len_mean
5.528
len_median
6
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
14,413
duplicate_rate
0.8806
vocab_size
1,954
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

PercentEvangelical

text feature one_word allcaps short_text duplicates
This is a numeric percentage (share evangelical) stored as text strings like '0.000' to '6.000', with values ranging 5-6 characters long and one token each. The distribution is heavily zero-inflated: 4205 of 16382 rows are '0.000', and the duplicate rate is 0.9315 across only 1047 unique values. Null rate is 0.0668, so roughly 7% are missing. Treatment: Cast to float and treat as a zero-inflated numeric feature. high · anthropic:claude-opus-4-7
n
16,382
nulls
1,095 (6.7%)
unique
1,047
len_min
5
len_max
6
len_mean
5.226
len_median
5
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
14,240
duplicate_rate
0.9315
vocab_size
1,047
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

PercentEvangelicalPC

categorical feature
Numeric percentages (0.000 to 28.097) describing evangelical share, but stored as strings with only 228 distinct values across 16,382 rows — suggesting a precomputed per-group statistic broadcast to many records rather than a per-row measurement. The top value '20.481' covers 7.0% of rows and the top ten values together account for a large fraction, consistent with repeated group-level imputation. Entropy ratio is 0.87, so distribution is fairly spread but discretised. Treatment: Cast to float and treat as a group-level numeric feature; do not one-hot encode. high · anthropic:claude-opus-4-7
n
16,382
nulls
166 (1.0%)
unique
228
top_value
20.481
top_rate
0.07024
cardinality
228
entropy
6.782
entropy_ratio
0.8658

PercentEvangelicalPGAC

text feature one_word allcaps short_text duplicates
This is a numeric percentage (Percent Evangelical, PGAC) stored as text — every value is a single token of 5-6 characters formatted like '0.000', '4.000', '1.801'. The distribution is heavily zero-inflated: 3,272 of 16,382 rows are '0.000', duplicate rate is 0.896 across only 1,624 unique values, and 4.5% are null. Despite the column being typed as text, there is no real language content here. Treatment: Cast to float and treat as a zero-inflated numeric feature. high · anthropic:claude-opus-4-7
n
16,382
nulls
743 (4.5%)
unique
1,624
len_min
5
len_max
6
len_mean
5.235
len_median
5
len_p95
6
word_mean
1
word_median
1
n_empty
0
n_duplicates
14,015
duplicate_rate
0.8962
vocab_size
1,624
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

PCBuddhism

numeric feature high_skew outliers
PCBuddhism appears to be a per-record percentage feature for Buddhist composition, ranging 0–100 with mean 3.77 and median 0. The distribution is overwhelmingly zero (zero_rate 0.89) with q1=q3=0 and iqr=0, yet ~11% of rows are outliers and skew (4.77) and kurtosis (21.99) are extreme. Treat this as a sparse, heavy-tailed minority share where most populations have no Buddhist presence but a long tail reaches 100. Treatment: Add a zero-vs-nonzero indicator and log1p-transform the nonzero share before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
104 (0.6%)
unique
1,052
min
0
max
100
mean
3.769
median
0
std
16.75
q1
0
q3
0
iqr
0
skew
4.775
kurtosis
22
n_outliers
1,798
outlier_rate
0.1105
zero_rate
0.8895

PCEthnicReligions

numeric feature outliers
PCEthnicReligions appears to be a percentage feature (0–100) capturing the share of ethnic/folk religion adherents per record. Just over half the rows are exactly zero (zero_rate 0.5045) and the median and Q1 are both 0, yet values stretch all the way to 100, producing strong right skew (1.65) and 1,967 flagged outliers (12.05%). The distribution is effectively zero-inflated rather than continuous. Treatment: Model as zero-inflated: add an is_nonzero indicator and log1p-transform the positive values. high · anthropic:claude-opus-4-7
n
16,382
nulls
59 (0.4%)
unique
978
min
0
max
100
mean
17.6
median
0
std
29.02
q1
0
q3
25
iqr
25
skew
1.654
kurtosis
1.404
n_outliers
1,967
outlier_rate
0.1205
zero_rate
0.5045

PCHinduism

numeric feature high_skew outliers
PCHinduism appears to be a per-record percentage share of Hinduism (0–100), with max 100.0 and min 0.0. The distribution is overwhelmingly zero (zero_rate 0.8343) so Q1, median, and Q3 are all 0.0, yet the mean is 14.01 with std 33.87, indicating a small minority of records carry very high values. Skew 2.06 and 16.57% flagged outliers confirm a heavy right tail rather than dirty data. Treatment: Treat as zero-inflated proportion: add a nonzero indicator and consider a log1p or sqrt transform before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
105 (0.6%)
unique
1,412
min
0
max
100
mean
14.01
median
0
std
33.87
q1
0
q3
0
iqr
0
skew
2.058
kurtosis
2.3
n_outliers
2,697
outlier_rate
0.1657
zero_rate
0.8343

PCOtherSmall

numeric feature high_skew outliers
PCOtherSmall is a numeric feature that appears to capture a small-share or percentage-like quantity, with 89.3% of values exactly zero and a median/Q1/Q3 all at 0.0. The remaining mass is highly skewed (skew 11.0, kurtosis 124.0) with a max of 100.0 and 10.7% flagged as outliers, suggesting a sparse long-tailed distribution rather than a typical continuous feature. Treatment: Treat as zero-inflated: add a binary is_nonzero flag and log1p-transform the positive tail before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
104 (0.6%)
unique
908
min
0
max
100
mean
0.9613
median
0
std
8.299
q1
0
q3
0
iqr
0
skew
11
kurtosis
124
n_outliers
1,749
outlier_rate
0.1074
zero_rate
0.8926

RegionCode

numeric feature
RegionCode is an integer-valued field ranging from 1 to 12 with only 12 unique values across 16382 rows and no nulls. The flat distribution (kurtosis -1.20, skew 0.23, no outliers) and small cardinality indicate a categorical region identifier encoded numerically rather than a true numeric measure. Treatment: Cast to categorical and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
12
min
1
max
12
mean
5.935
median
5
std
3.42
q1
3
q3
8
iqr
5
skew
0.231
kurtosis
-1.201
n_outliers
0
outlier_rate
0
zero_rate
0

PopulationPGAC

numeric feature high_skew outliers
PopulationPGAC is a numeric population-like measure spanning 10 to 925,129,800 with a median of just 88,000, suggesting counts of people across geographic units of wildly varying scale (towns up through country-sized aggregates). The distribution is extremely right-skewed (skew 15.15, kurtosis 262.66) and 19.2% of rows flag as outliers, with the mean (8.8M) two orders of magnitude above the median. Nulls are negligible (0.09%) and there are no zeros, but the spread between Q3 (1.39M) and the max indicates a long heavy tail of very large entities. Treatment: log-transform before any modelling or distance-based comparison. high · anthropic:claude-opus-4-7
n
16,382
nulls
15 (0.1%)
unique
2,250
min
10
max
9.251e+08
mean
8.812e+06
median
88,000
std
5.114e+07
q1
8,800
q3
1.386e+06
iqr
1.377e+06
skew
15.15
kurtosis
262.7
n_outliers
3,145
outlier_rate
0.1922
zero_rate
0

Frontier

categorical feature
Binary Y/N flag named 'Frontier', fully populated across 16382 rows with only 2 distinct values. The 'N' class dominates at 70.9% versus 29.1% 'Y', giving an entropy ratio of 0.87 — moderately imbalanced but well within usable range. Treatment: Encode as 0/1 boolean and use directly as a feature. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.709
cardinality
2
entropy
0.87
entropy_ratio
0.87

MapAddress

text foreign_key one_word short_text duplicates
MapAddress holds single-token PNG filenames (e.g. m00320.png), almost certainly references to map image assets. Over half the column is empty (8728 of 16382 rows) and 63.2% are duplicates, with only 6029 distinct values across 16382 rows. Every non-empty value is one word with max length 13, so this behaves like a sparse foreign key to a map asset rather than free text. Treatment: Treat as a categorical asset reference; impute empties as 'none' and join to the map asset table. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
6,029
len_min
0
len_max
13
len_mean
5.153
len_median
0
len_p95
13
word_mean
1
word_median
1
n_empty
8,728
n_duplicates
10,353
duplicate_rate
0.632
vocab_size
6,028
readability_flesch_mean
13.52
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

HasJesusFilm

categorical feature
Binary Y/N flag indicating whether the JESUS Film is available for each record, with no nulls across 16,382 rows. The split is roughly 2:1 in favour of 'Y' (10,816 vs 5,566; top_rate 0.660), giving a high entropy ratio of 0.925 — informative but mildly imbalanced. Treatment: Encode as a 0/1 boolean for modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.6602
cardinality
2
entropy
0.9246
entropy_ratio
0.9246

Nomadic

categorical feature imbalance
Binary Y/N flag indicating whether a record is 'Nomadic', with no nulls across 16382 rows. The distribution is severely imbalanced: 'N' covers 98.1% (16071) versus only 311 'Y' cases, yielding an entropy ratio of just 0.136. Treatment: Encode as a boolean; consider class-weighting or resampling since positives are only ~1.9%. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.981
cardinality
2
entropy
0.1357
entropy_ratio
0.1357

NomadicTypeDescription

categorical metadata null_rate
Categorical descriptor of nomadic livelihood type, with six values combining three base categories (Agro-Pastoralists, Service or Trade, Hunter-Gatherers) singly or in pairs. The column is 98.1% null, populated for only ~311 of 16,382 rows, and among those Agro-Pastoralists dominates at 68.2% of non-nulls. The sparsity makes this effectively a rare annotation rather than a general feature. Treatment: Treat as sparse metadata; impute a 'Unknown' category or drop unless modelling the populated subset. high · anthropic:claude-opus-4-7
n
16,382
nulls
16,071 (98.1%)
unique
6
top_value
Agro-Pastoralists
top_rate
0.6817
cardinality
6
entropy
1.341
entropy_ratio
0.5187

PhotoCCVersionText

categorical metadata
This column records the Creative Commons license version attached to a photo, with 17 distinct values across 16,382 rows and no nulls. It is dominated by empty strings at 83.6% (13,688 rows), leaving only ~16% with an actual license tag — the most common being 'CC BY 2.0' (661) and 'CC BY-NC-SA 2.0' (440). Low entropy ratio (0.28) confirms the field is sparse in practice despite zero technical nulls. Treatment: Treat empty string as missing and one-hot encode the remaining license categories. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
17
top_value
top_rate
0.8356
cardinality
17
entropy
1.137
entropy_ratio
0.2781

PhotoCCVersionURL

categorical metadata
This column holds Creative Commons license URLs associated with photos, drawn from a closed vocabulary of 17 distinct values. It is overwhelmingly empty: 13,688 of 16,382 rows (top_rate 0.836) carry the blank string rather than a license, leaving CC BY 2.0 (661) and CC BY-NC-SA 2.0 (440) as the most common actual licenses. Entropy ratio of 0.278 confirms the distribution is highly concentrated on the empty value. Treatment: Treat blank as missing and bucket the remaining license URLs into a low-cardinality categorical. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
17
top_value
top_rate
0.8356
cardinality
17
entropy
1.137
entropy_ratio
0.2781

MapCredits

categorical metadata long_tail
Attribution string for the map asset associated with each record, naming data, geography, and design contributors. Over half the rows (top_rate 0.533, 8,733 of 16,382) carry an empty string, and the remaining mass is spread across 199 near-duplicate credit lines — note for example two variants of the Omid/UNESCO/GMI credit differing only by a trailing period (2,228 vs 864). Entropy ratio of 0.357 and the long_tail alert confirm a few dominant phrasings plus a sparse tail. Treatment: Treat as provenance metadata; normalize whitespace/punctuation to collapse duplicate credit strings, and exclude from modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
199
top_value
top_rate
0.5331
cardinality
199
entropy
2.726
entropy_ratio
0.357

MapCreditURL

categorical metadata long_tail imbalance
Optional attribution URL for the source map of each record, blank in 15891 of 16382 rows (top_rate 0.97). Only 51 distinct values populate the remaining 3%, dominated by asiaharvest.org (146) and cartomission.com (117), giving a very low entropy_ratio of 0.052. The mix of http URLs and a mailto: address suggests inconsistent data entry rather than a controlled vocabulary. Treatment: Drop from modelling; retain only as a provenance link for the ~3% of rows that carry a value. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
51
top_value
top_rate
0.97
cardinality
51
entropy
0.2954
entropy_ratio
0.05207

MapCopyright

categorical feature
A near-binary flag (with blanks) indicating map copyright status, dominated by 'N' at 86.5% of 16,382 rows. Only 118 records carry 'Y' and 2,100 are empty strings, giving just 3 distinct values and an entropy ratio of 0.387. The empty category is large enough that it should be treated as its own level rather than silently coerced. Treatment: Encode as a 3-level categorical (N / Y / blank); low signal due to severe imbalance. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
3
top_value
N
top_rate
0.8646
cardinality
3
entropy
0.6126
entropy_ratio
0.3865

MapCCVersionText

categorical metadata imbalance
This column records Creative Commons licence versions for map content, but 16347 of 16382 rows (top_rate 0.9979) are empty strings, leaving only 35 rows spread across five actual licences (CC0 1.0, CC BY-SA 3.0, CC BY 3.0, CC BY 2.0, CC BY-SA 4.0). Entropy ratio of 0.0099 confirms the column carries almost no information. Note that nulls are reported as 0.0 because the missing values are stored as empty strings rather than true nulls. Treatment: Drop or collapse to a binary has_licence flag; too sparse to use as a feature. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
6
top_value
top_rate
0.9979
cardinality
6
entropy
0.02557
entropy_ratio
0.009891

MapCCVersionURL

categorical metadata imbalance
MapCCVersionURL appears to be a Creative Commons license URL field attached to map records, with five distinct CC variants observed (CC0, BY-SA 3.0/4.0, BY 2.0/3.0). It is effectively empty: 16,347 of 16,382 rows (top_rate 0.9979) are blank strings, leaving only 35 rows with an actual license URL and entropy_ratio of 0.0099. Null_rate is reported as 0.0 because empties are stored as '' rather than nulls, which is itself worth flagging. Treatment: Drop or collapse to a binary has_license flag; near-constant empty string. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
6
top_value
top_rate
0.9979
cardinality
6
entropy
0.02557
entropy_ratio
0.009891

JF

categorical feature
JF is a binary Y/N flag with no nulls across 16382 rows. The split is moderately imbalanced, with Y dominating at 66.0% (10816) versus N at 34.0% (5566), yielding a high entropy ratio of 0.925. Treatment: Encode as a 0/1 indicator before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.6602
cardinality
2
entropy
0.9246
entropy_ratio
0.9246

AudioRecordings

categorical feature
Binary Y/N flag indicating whether audio recordings are present, with no nulls across 16,382 rows. The distribution is imbalanced: 'Y' dominates at 82.3% (13,479) versus 2,903 'N' values, yielding an entropy ratio of 0.67. Treatment: Encode as a 0/1 boolean; be aware of the ~82/18 class imbalance when using as a predictor. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.8228
cardinality
2
entropy
0.6739
entropy_ratio
0.6739

Window1040

categorical feature
Window1040 is a binary Y/N flag covering all 16382 rows with no nulls. The split is nearly even (Y at 52.3%, N at 7810 occurrences), giving an entropy ratio of 0.998 — essentially maximum uncertainty for a two-class field. Without column context the meaning is opaque, but the balanced distribution makes it a usable feature rather than a degenerate constant. Treatment: Encode as a 0/1 indicator for modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.5233
cardinality
2
entropy
0.9984
entropy_ratio
0.9984

PeopleGroupMapURL

text metadata one_word url_heavy duplicates
This column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty entry being a single token link. Over half the rows (8,728 of 16,382) are empty strings, and among the rest the same map URLs recur heavily — duplicate_rate is 0.63 and only 6,029 unique values exist across 16,382 rows. The url_rate of 0.47 reflects that empties dominate the remainder. Treatment: Treat as an optional asset URL: keep as-is for display, or drop if not rendering images. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
6,029
len_min
0
len_max
66
len_mean
29.92
len_median
0
len_p95
66
word_mean
1
word_median
1
n_empty
8,728
n_duplicates
10,353
duplicate_rate
0.632
vocab_size
6,028
readability_flesch_mean
-329.1
emoji_rate
0
url_rate
0.4672
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

PeopleGroupMapExpandedURL

text metadata one_word url_heavy duplicates
This column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, one link per row. It's mostly empty: 9,538 of 16,382 rows (the modal value) are blank, which is why len_median is 0 and url_rate is only 0.418. Among populated rows there is heavy reuse — duplicate_rate is 0.661 and a single map (m00320.pdf) appears 96 times — indicating many people-group records share the same regional map. Treatment: Treat as an optional reference link; drop for modelling or keep only as a join key to map assets. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
5,561
len_min
0
len_max
66
len_mean
26.74
len_median
0
len_p95
66
word_mean
1
word_median
1
n_empty
9,538
n_duplicates
10,821
duplicate_rate
0.6605
vocab_size
5,560
readability_flesch_mean
-256.6
emoji_rate
0
url_rate
0.4178
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

PeopleGroupURL

text identifier near_unique one_word url_heavy
This column holds Joshua Project people-group URLs, one per row, with a perfectly fixed length of 48 characters and exactly one 'word' per cell. Every value is unique across all 16382 rows (n_unique equals n, duplicate_rate 0.0) and url_rate is 1.0, so it functions as a row-level identifier rather than analysable text. The URL pattern encodes a numeric people-group id plus a two-letter country suffix (e.g. /10375/tz, /10375/up), meaning the same group repeats across countries via different URLs. Treatment: Drop from modelling; keep as a row key or parse out the people-group id and country code if a join is needed. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
16,382
len_min
48
len_max
48
len_mean
48
len_median
48
len_p95
48
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
16,382
readability_flesch_mean
-476.9
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

PeopleGroupPhotoURL

text metadata one_word url_heavy duplicates
This column holds URLs to people-group profile photos hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.65). Notably, 5719 of 16382 rows are empty strings and duplicates dominate the rest (n_duplicates 11105, duplicate_rate 0.68) — only 5277 unique URLs serve 16382 rows, meaning many groups share the same photo. The top URL alone repeats 92 times. Treatment: Treat as an optional image asset link; drop for modelling or use only to fetch images, and handle the 5719 empty values as missing. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
5,277
len_min
0
len_max
68
len_mean
42.32
len_median
65
len_p95
65
word_mean
1
word_median
1
n_empty
5,719
n_duplicates
11,105
duplicate_rate
0.6779
vocab_size
5,276
readability_flesch_mean
-585.6
emoji_rate
0
url_rate
0.6509
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

CountryURL

categorical metadata
URL pointing to a country page on joshuaproject.net, with the trailing two-letter code acting as the country identifier. 238 distinct countries appear across 16,382 rows with no nulls, and India (IN) dominates at 13.8% (2,262 rows) followed by PP (883) and ID (788). High entropy ratio (0.79) indicates the distribution is broad rather than concentrated despite the India lead. Treatment: Strip the URL prefix and keep the two-letter country code as a categorical feature. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
238
top_value
https://joshuaproject.net/countries/IN
top_rate
0.1381
cardinality
238
entropy
6.225
entropy_ratio
0.7885

JPScaleText

categorical label
JPScaleText is a 5-level ordinal categorical describing how 'reached' an entity is, ranging from 'Unreached' to 'Significantly Reached'. The distribution is top-heavy: 'Unreached' covers 43.5% of 16,382 rows (7,124), while 'Minimally Reached' is the rarest at 1,009. No nulls and entropy ratio 0.87 indicate well-spread but skewed coverage across all five levels. Treatment: Encode as an ordered ordinal (Unreached < Minimally < Superficially < Partially < Significantly) before modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
5
top_value
Unreached
top_rate
0.4349
cardinality
5
entropy
2.017
entropy_ratio
0.8688

JPScaleImageURL

categorical metadata
This column holds URLs to one of five 'gauge' images on joshuaproject.net, almost certainly a visual encoding of an ordinal Joshua Project Scale (1-5). Distribution is uneven: gauge-1 dominates at 43.5% of 16,382 rows, while gauge-2 is rarest at 1,009 rows, and entropy ratio is 0.87. No nulls, but the URL itself carries no information beyond the underlying 1-5 code. Treatment: Extract the trailing digit as an ordinal feature and drop the URL string. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
5
top_value
https://joshuaproject.net/assets/img/gauge/gauge-1.png
top_rate
0.4349
cardinality
5
entropy
2.017
entropy_ratio
0.8688

Summary

text free_text one_word duplicates
A short ethnographic descriptor field, likely a people-group summary paragraph in English. The column is dominated by emptiness — 12,328 of 16,382 rows (median length 0) are blank — and the non-empty entries are heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat write-ups repeating dozens of times in slight variants. Among populated rows, texts can be substantial (len_p95 = 719, max 1212) and Flesch readability of 13.2 indicates dense, hard-to-read prose. Treatment: Deduplicate and drop empties before any NLP; treat as supplementary description rather than a per-row feature. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
3,778
len_min
0
len_max
1,212
len_mean
137.7
len_median
0
len_p95
719
word_mean
23.34
word_median
1
n_empty
12,328
n_duplicates
12,604
duplicate_rate
0.7694
vocab_size
24,964
readability_flesch_mean
13.16
emoji_rate
0
url_rate
0
one_word_rate
0.7525
allcaps_rate
0
boilerplate_rate
0.0001221

Obstacles

text free_text one_word duplicates
Free-text English commentary describing spiritual or cultural obstacles to Christian evangelism for various ethnic groups (Rajputs, Jats, Bosniaks, Azeri, etc.). The field is overwhelmingly empty: 12,327 of 16,382 rows are blank, driving a median length of 0 and a one-word rate of 0.75. Among populated rows, content is heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat paragraphs repeated 88 and 74 times, suggesting templated entries reused across related people-group records. Treatment: Treat blanks as missing and deduplicate template paragraphs before tokenizing/embedding for any text modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
3,732
len_min
0
len_max
726
len_mean
47.4
len_median
0
len_p95
267
word_mean
8.704
word_median
1
n_empty
12,327
n_duplicates
12,650
duplicate_rate
0.7722
vocab_size
9,899
readability_flesch_mean
12.76
emoji_rate
0
url_rate
0
one_word_rate
0.7525
allcaps_rate
0
boilerplate_rate
0.0004273

HowReach

text free_text one_word duplicates
This is a free-text field describing how a people group could be reached, likely sourced from a missions/Joshua Project-style dataset. The column is dominated by emptiness: 13,043 of 16,382 rows (n_empty) are blank, driving a median length of 0 and a duplicate_rate of 0.82. Among populated entries, prose is substantive (len_max 599, len_p95 221) but heavily repeated — the same paragraph about Jats appears 136 times and several near-duplicates differ only by a word, suggesting templated copy across related groups. Treatment: Treat as sparse free text: filter out empties and deduplicate near-identical strings before any tokenization or embedding. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
2,944
len_min
0
len_max
599
len_mean
36.3
len_median
0
len_p95
221
word_mean
6.879
word_median
1
n_empty
13,043
n_duplicates
13,438
duplicate_rate
0.8203
vocab_size
7,981
readability_flesch_mean
12.17
emoji_rate
0
url_rate
0
one_word_rate
0.7962
allcaps_rate
0
boilerplate_rate
0.0001221

PrayForChurch

text free_text one_word duplicates
Free-text prayer prompts for a people-group / missions dataset, focused on praying for the church among unreached groups (top words: pray, the, among). The column is mostly empty — 14,208 of 16,382 rows are blank — and among non-empty entries duplication is heavy, with a single Jat-related prayer appearing 146 times and an overall duplicate_rate of 0.89. Only 1,791 unique values and a vocab of 4,576 words suggest these are templated prayer points rather than authored prose, and all 652 language-tagged rows are English. Treatment: Treat as sparse optional commentary; impute empties as missing and dedupe templates before any text modelling. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
1,791
len_min
0
len_max
649
len_mean
26.91
len_median
0
len_p95
210
word_mean
5.599
word_median
1
n_empty
14,208
n_duplicates
14,591
duplicate_rate
0.8907
vocab_size
4,576
readability_flesch_mean
7.55
emoji_rate
0
url_rate
0
one_word_rate
0.8673
allcaps_rate
0
boilerplate_rate
0

PrayForPG

text free_text one_word duplicates
Free-text prayer points for a people group (PG), evidently scraped from a missions/Joshua Project–style source given the recurring 'Pray for...' templates. The column is mostly empty: 12,570 of 16,382 rows are blank (median length 0, median word count 1) and the duplicate rate is 0.78, with one Rajput-focused prayer block repeated 88 times and similar boilerplate dominating the rest. Flesch mean of 14.3 confirms dense, formulaic devotional prose rather than varied commentary. Treatment: Treat as sparse boilerplate: drop empties, dedupe, and only embed the ~3.5k unique strings if prayer content is actually needed downstream. high · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique
3,530
len_min
0
len_max
937
len_mean
72.22
len_median
0
len_p95
400
word_mean
13.06
word_median
1
n_empty
12,570
n_duplicates
12,852
duplicate_rate
0.7845
vocab_size
9,427
readability_flesch_mean
14.31
emoji_rate
0
url_rate
0
one_word_rate
0.7673
allcaps_rate
0
boilerplate_rate
0

Resources

unknown other skipped
The column is named "Resources" but saturn skipped profiling, so kind is unknown and no descriptive statistics were computed. We can confirm 16382 rows with a null_rate of 0.0, but n_unique and value-level signals are missing. Without those stats the semantic role and content type cannot be determined from the evidence alone. Treatment: Re-profile this column with type coercion before deciding how to use it. low · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique

country_data

unknown other skipped
The column `country_data` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any sample stats, the actual content (codes, names, nested objects) cannot be inferred from the evidence. Treatment: Re-profile with an appropriate parser to determine structure before use. low · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique

language_data

unknown other skipped
The `language_data` column was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any descriptive stats, the actual contents (likely some structured or nested language payload given the name) cannot be characterized here. Treatment: Re-profile with an appropriate parser (e.g., expand JSON or cast to string) before deciding on downstream use. low · anthropic:claude-opus-4-7
n
16,382
nulls
0 (0.0%)
unique