saturn·

joshua project joshua project unreached

source /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_unreached.parquet 7,124 rows 109 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset is a Joshua Project catalogue of 7,124 unreached people groups described across 109 fields covering geography, language, religion, population, and outreach status. Every row is flagged as 'Unreached' (JPScaleText is constant) and 'LeastReached' is uniformly Y, so the analytical interest sits in the breakdown by region, religion, and population rather than in reach status itself. The data is heavily skewed toward Asia (5,351 of 7,124) and especially South Asian Peoples (3,681), with India alone accounting for 2,032 groups; Islam (3,279) and Hinduism (2,142) dominate PrimaryReligion. Population is extremely long-tailed (median 30,000 vs. max 135.5M, skew ~21), so any size-based analysis should use log scales or medians. Worth a closer look first: the Continent/Region/Country concentration, the religion mix, and the population distribution — these three together explain most of the dataset's shape.

citing: row_count · column_count · Continent.top_values · AffinityBloc.top_values · Ctry.top_values · PrimaryReligion.top_values · Population.stats · RegionName.top_values · Frontier.top_values · JPScaleText.top_values · LeastReached.top_values · PrimaryLanguageName.top_values

Schema

109 columns
Per-column summary. Click column name to jump to its detail.
Alerts
PeopleID3ROG3 text 0.0% 7,124
near_unique one_word allcaps short_text
ROG3 categorical 0.0% 202
PeopleID3 numeric 0.0% 4,614
ROP3 numeric 0.1% 4,608
PeopNameInCountry text 0.0% 4,722
one_word duplicates
ROG2 categorical 0.0% 7
Continent categorical 0.0% 7
RegionName categorical 0.0% 12
ISO3 categorical 0.0% 202
LocationInCountry text 64.2% 2,176
multilingual null_rate
PeopleID1 numeric 0.0% 16
outliers
ROP1 categorical 0.0% 16
AffinityBloc categorical 0.0% 16
PeopleID2 numeric 0.0% 205
ROP2 categorical 0.0% 155
PeopleCluster categorical 0.0% 205
PeopNameAcrossCountries text 0.0% 4,604
one_word duplicates
Population numeric 0.2% 1,200
high_skew outliers
Category categorical 0.0% 3
ROL3 text 0.0% 1,565
one_word short_text duplicates
PrimaryLanguageName text 0.0% 1,563
one_word short_text duplicates
PrimaryLanguageDialect categorical 94.5% 303
long_tail null_rate
NumberLanguagesSpoken numeric 0.0% 69
high_skew outliers
OfficialLang categorical 0.1% 79
SpeakNationalLang unknown 0.0%
skipped
BibleStatus numeric 0.0% 6
outliers
BibleYear categorical 45.8% 163
null_rate
NTYear categorical 24.6% 305
null_rate
PortionsYear categorical 13.5% 460
TranslationNeedQuestionable unknown 0.0%
skipped
JPScale numeric 0.0% 1
constant
JPScalePC categorical 0.0% 5
JPScalePGAC categorical 0.0% 5
imbalance
LeastReached categorical 0.0% 1
imbalance
LeastReachedPC categorical 0.0% 2
LeastReachedPGAC categorical 0.0% 2
imbalance
GSEC categorical 0.0% 8
HasAudioRecordings categorical 0.0% 2
NTOnline categorical 22.4% 1
null_rate imbalance
RLG3 numeric 0.0% 7
outliers
RLG3PC numeric 0.0% 8
outliers
RLG3PGAC numeric 0.0% 8
outliers
PrimaryReligion categorical 0.0% 7
PrimaryReligionPC categorical 0.0% 8
PrimaryReligionPGAC categorical 0.0% 8
RLG4 numeric 92.4% 18
null_rate outliers
ReligionSubdivision categorical 92.4% 18
null_rate
PCIslam numeric 0.1% 902
PCNonReligious numeric 0.3% 152
high_skew outliers
PCUnknown numeric 0.4% 388
high_skew outliers
SecurityLevel numeric 0.0% 3
outliers
LRTop100 categorical 0.0% 2
imbalance
PhotoAddress text 0.0% 2,880
one_word short_text duplicates
PhotoCredits categorical 0.1% 851
long_tail
PhotoCreditURL categorical 36.0% 774
long_tail null_rate
PhotoCreativeCommons categorical 0.1% 2
PhotoCopyright categorical 0.2% 2
PhotoPermission categorical 0.2% 3
ProfileTextExists categorical 0.0% 2
imbalance
CountOfCountries numeric 0.0% 39
high_skew outliers
CountOfProvinces unknown 0.0%
skipped
Longitude numeric 0.0% 6,713
Latitude numeric 0.0% 6,696
Ctry categorical 0.0% 202
IndigenousCode categorical 0.0% 2
PercentAdherents categorical 0.0% 692
long_tail
PercentChristianPC categorical 0.0% 184
NaturalName text 0.0% 4,705
one_word duplicates
NaturalPronunciation text 48.5% 1,489
one_word null_rate duplicates
PercentChristianPGAC categorical 0.1% 842
PercentEvangelical categorical 10.4% 401
long_tail
PercentEvangelicalPC categorical 2.1% 166
PercentEvangelicalPGAC categorical 6.3% 548
PCBuddhism numeric 0.3% 809
high_skew outliers
PCEthnicReligions numeric 0.3% 351
high_skew outliers
PCHinduism numeric 0.3% 1,131
PCOtherSmall numeric 0.3% 670
high_skew outliers
RegionCode numeric 0.0% 12
outliers
PopulationPGAC numeric 0.1% 1,509
high_skew outliers
Frontier categorical 0.0% 2
MapAddress text 0.0% 4,616
one_word short_text duplicates
HasJesusFilm categorical 0.0% 2
Nomadic categorical 0.0% 2
imbalance
NomadicTypeDescription categorical 96.6% 6
null_rate
PhotoCCVersionText categorical 0.0% 16
PhotoCCVersionURL categorical 0.0% 16
MapCredits categorical 0.0% 161
long_tail
MapCreditURL categorical 0.0% 31
long_tail imbalance
MapCopyright categorical 0.0% 3
MapCCVersionText categorical 0.0% 4
imbalance
MapCCVersionURL categorical 0.0% 4
imbalance
JF categorical 0.0% 2
AudioRecordings categorical 0.0% 2
Window1040 categorical 0.0% 2
PeopleGroupMapURL text 0.0% 4,616
one_word url_heavy duplicates
PeopleGroupMapExpandedURL text 0.0% 4,331
one_word url_heavy duplicates
PeopleGroupURL text 0.0% 7,124
near_unique one_word url_heavy
PeopleGroupPhotoURL text 0.0% 2,880
one_word url_heavy duplicates
CountryURL categorical 0.0% 202
JPScaleText categorical 0.0% 1
imbalance
JPScaleImageURL categorical 0.0% 1
imbalance
Summary text 0.0% 3,685
one_word duplicates
Obstacles text 0.0% 3,641
one_word duplicates
HowReach text 0.0% 2,853
one_word duplicates
PrayForChurch text 0.0% 1,713
one_word duplicates
PrayForPG text 0.0% 3,441
one_word duplicates
Resources unknown 0.0%
skipped
country_data unknown 0.0%
skipped
language_data unknown 0.0%
skipped

PeopleID3ROG3

text identifier near_unique one_word allcaps short_text
PeopleID3ROG3 is almost certainly a person-level identifier: every one of the 7,124 rows holds a unique 7-character, all-caps, single-token code (n_unique equals n, len_min=len_max=7, allcaps_rate=1.0, one_word_rate=1.0). Sample values like '10208ng' and '10375su' suggest a 5-digit numeric prefix followed by a 2-letter suffix. There are no nulls, duplicates, or empties, so the key looks clean. Treatment: Use as a primary key for joins; exclude from modelling features. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
7,124
len_min
7
len_max
7
len_mean
7
len_median
7
len_p95
7
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
7,124
readability_flesch_mean
112.3
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
1
boilerplate_rate
0

ROG3

categorical feature
ROG3 holds two-letter country codes across 7,124 rows with 202 distinct values and no nulls. India (IN) dominates at 28.5% of records, followed by PK (767) and CH (442); the top 10 codes account for the bulk of mass while a long tail of ~190 other codes shares the remainder, giving an entropy ratio of 0.66. Treatment: Group rare codes into an 'other' bucket before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
202
top_value
IN
top_rate
0.2852
cardinality
202
entropy
5.058
entropy_ratio
0.6605

PeopleID3

numeric foreign_key
PeopleID3 is an integer key spanning 10120 to 22661 with 4614 unique values across 7124 rows, suggesting a person identifier that recurs (about 1.5 rows per id on average). The distribution is mildly left-skewed (-0.23) and platykurtic (-0.95) with no nulls, no zeros, and no outliers, consistent with a dense allocated id range rather than a measured quantity. Treatment: Treat as a join key; do not use as a numeric feature. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
4,614
min
10,120
max
22,661
mean
1.693e+04
median
1.736e+04
std
3431
q1
1.433e+04
q3
1.958e+04
iqr
5251
skew
-0.2255
kurtosis
-0.9528
n_outliers
0
outlier_rate
0
zero_rate
0

ROP3

numeric identifier
ROP3 is a numeric column with 4608 unique values across 7124 rows, ranging tightly from 100005 to 119619 with a mean of 111443.68 and median of 112533. The narrow ~19k span sitting well above zero, combined with integer-looking bounds, suggests a coded identifier or sequence number rather than a measured quantity. Mild left skew (-0.47) and no outliers indicate a fairly uniform spread within that band, and the null rate is negligible at 0.001. Treatment: Treat as a categorical code or key; do not feed raw into numeric models. medium · anthropic:claude-opus-4-7
n
7,124
nulls
7 (0.1%)
unique
4,608
min
100,005
max
119,619
mean
1.114e+05
median
112,533
std
5269
q1
107,901
q3
115,240
iqr
7,339
skew
-0.4712
kurtosis
-0.7273
n_outliers
0
outlier_rate
0
zero_rate
0

PeopNameInCountry

text label one_word duplicates
This column names a people group as it appears within a given country (e.g., 'Turk', 'Persian', 'Arab, Moroccan'), likely from a Joshua Project-style ethnographic registry. Values are short (len_mean 12.5, word_mean 1.78) with 4,722 uniques across 7,124 rows, and 33.7% are duplicates because the same people group recurs across countries — 'Deaf' alone appears 151 times. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' in top_words show religion-tagged variants are baked into the label. Treatment: Treat as a categorical label; pair with country to form a unique key, and consider stripping parenthetical religion tags for cleaner grouping. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
4,722
len_min
1
len_max
39
len_mean
12.5
len_median
11
len_p95
27
word_mean
1.784
word_median
2
n_empty
0
n_duplicates
2,402
duplicate_rate
0.3372
vocab_size
4,602
readability_flesch_mean
56.38
emoji_rate
0
url_rate
0
one_word_rate
0.4575
allcaps_rate
0
boilerplate_rate
0

ROG2

categorical feature
ROG2 is a low-cardinality categorical with 7 region codes (ASI, AFR, EUR, NAR, LAM, AUS, SOP) and no nulls across 7,124 rows, consistent with a continental/region-of-origin grouping. The distribution is highly imbalanced: ASI accounts for 75.1% of records while AUS and SOP together contribute fewer than 80 rows, yielding an entropy ratio of just 0.45. Any model conditioned on this field will be dominated by the ASI bucket. Treatment: One-hot encode and consider grouping AUS/SOP/LAM into an 'other' bucket given the severe imbalance. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
7
top_value
ASI
top_rate
0.7511
cardinality
7
entropy
1.251
entropy_ratio
0.4457

Continent

categorical feature
Continent is a low-cardinality geographic categorical with 7 distinct values and no nulls across 7,124 rows. The distribution is heavily concentrated: Asia alone accounts for 75.1% of records, with Africa a distant second at 986. Notably, both 'Australia' (39) and 'Oceania' (36) appear as separate categories, which is a labeling inconsistency since Australia is part of Oceania. Treatment: Reconcile Australia/Oceania into a single category, then one-hot encode. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
7
top_value
Asia
top_rate
0.7511
cardinality
7
entropy
1.251
entropy_ratio
0.4457

RegionName

categorical feature
RegionName is a categorical geographic grouping with 12 distinct regions and no nulls across 7,124 rows. The distribution is heavily concentrated: 'Asia, South' alone accounts for 47.0% of records, followed distantly by 'Asia, Southeast' at 726 and 'Asia, Northeast' at 521, leaving the Americas and Europe sparsely represented. Entropy ratio of 0.76 confirms meaningful but uneven coverage across the 12 buckets. Treatment: One-hot encode and consider grouping rare regions, given the dominance of 'Asia, South'. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
12
top_value
Asia, South
top_rate
0.4701
cardinality
12
entropy
2.715
entropy_ratio
0.7574

ISO3

categorical foreign_key
ISO3 looks like a country code in standard 3-letter ISO 3166-1 alpha-3 format, with 202 distinct values across 7,124 rows and zero nulls. The distribution is heavily concentrated on India (IND) at 28.5% of rows (2,032), followed by PAK (767) and CHN (442), so South and East Asia dominate. Entropy ratio of 0.66 confirms the imbalance is material rather than uniform across countries. Treatment: Use as a join key to country reference tables; consider grouping long-tail codes or stratifying by ISO3 to control for the IND-heavy skew. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
202
top_value
IND
top_rate
0.2852
cardinality
202
entropy
5.058
entropy_ratio
0.6605

LocationInCountry

text free_text multilingual null_rate
Free-text geographic descriptions of where a people group lives within a country, ranging from terse tags like "Widespread." (56 occurrences) to multi-sentence paragraphs up to 939 characters. 64.23% of rows are null and 14.6% of the non-null values are duplicates, so usable signal is concentrated in roughly a third of the dataset. Content is overwhelmingly English (1714 of 1729 detected) with a long tail of place names producing a 10,936-token vocabulary across 2,176 unique strings. Treatment: Normalize boilerplate phrases like "Widespread" into a categorical flag, then tokenize and embed the residual prose before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
4,576 (64.2%)
unique
2,176
len_min
3
len_max
939
len_mean
141.1
len_median
89
len_p95
455.7
word_mean
21.15
word_median
12
n_empty
0
n_duplicates
372
duplicate_rate
0.146
vocab_size
10,936
readability_flesch_mean
41.64
emoji_rate
0
url_rate
0
one_word_rate
0.04317
allcaps_rate
0
boilerplate_rate
0

PeopleID1

numeric feature outliers
PeopleID1 is stored as numeric but only takes 16 distinct integer values between 10 and 26, with a tight IQR of 1.0 (Q1=20, Q3=21) around a median of 21. The distribution is heavily left-skewed (skew -1.34) and 32.9% of rows fall outside the Tukey fence, so the 'outliers' alert reflects the column being near-categorical rather than truly continuous. No nulls and no zeros, but the name suggests an identifier despite the low cardinality. Treatment: Cast to categorical (16 levels) rather than treating as a continuous numeric. medium · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
16
min
10
max
26
mean
19.51
median
21
std
3.858
q1
20
q3
21
iqr
1
skew
-1.337
kurtosis
0.8321
n_outliers
2,347
outlier_rate
0.3294
zero_rate
0

ROP1

categorical feature
ROP1 is a low-cardinality categorical code (16 distinct values, all following an 'A0xx' pattern) with no nulls across 7,124 rows. The distribution is heavily concentrated: 'A012' alone accounts for 51.7% of records, and entropy ratio of 0.67 confirms the imbalance. This looks like a controlled vocabulary or lookup code rather than free input. Treatment: One-hot or target-encode; consider grouping rare codes given the dominance of A012. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
16
top_value
A012
top_rate
0.5167
cardinality
16
entropy
2.676
entropy_ratio
0.669

AffinityBloc

categorical feature
AffinityBloc is a categorical grouping of peoples/ethnolinguistic blocs, with 16 distinct values across 7124 rows and no nulls. The distribution is heavily concentrated: 'South Asian Peoples' alone accounts for 51.7% of records (3681), followed distantly by 'Sub-Saharan Peoples' (632) and 'Arab World' (475). Entropy ratio of 0.67 confirms the imbalance, and the inclusion of 'Deaf' (151) as a bloc alongside geographic/ethnic categories is a notable taxonomy quirk. Treatment: One-hot or target-encode, and consider grouping rare blocs given the dominance of South Asian Peoples. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
16
top_value
South Asian Peoples
top_rate
0.5167
cardinality
16
entropy
2.676
entropy_ratio
0.669

PeopleID2

numeric foreign_key
PeopleID2 is a numeric identifier-like field with only 205 distinct values across 7,124 rows, ranging 101 to 475 with no nulls or zeros. The distribution is left-skewed (skew -0.50) and platykurtic (kurtosis -1.24) with median 412 well above the mean 339, suggesting a few low-id clusters pull the mean down while most rows concentrate near the upper end. The low cardinality relative to row count indicates this is a repeated key rather than a per-row unique id. Treatment: Treat as a categorical foreign key and left-join to the people dimension rather than using as a numeric feature. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
205
min
101
max
475
mean
339
median
412
std
123.2
q1
232.8
q3
450
iqr
217.2
skew
-0.5035
kurtosis
-1.242
n_outliers
0
outlier_rate
0
zero_rate
0

ROP2

categorical feature
ROP2 is a categorical code field with 155 distinct values following an A####/C#### pattern, suggesting a classification or routing code. One value, 'A012', dominates at 51.6% of the 7,124 rows, while the remaining categories are long-tailed C-codes each below 2.4%. Entropy ratio of 0.56 confirms the distribution is heavily concentrated rather than uniform. Treatment: Collapse rare C-codes into an 'other' bucket and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
155
top_value
A012
top_rate
0.5163
cardinality
155
entropy
4.085
entropy_ratio
0.5614

PeopleCluster

categorical feature
PeopleCluster is a high-cardinality categorical taxonomy of ethno-religious groupings, with 205 distinct values across 7124 rows and no nulls. The distribution is dominated by South Asian categories — 'South Asia Hindu - other' alone accounts for 12.2% (869 rows), followed by 'South Asia Muslim - other' (586) and 'South Asia Dalit - other' (352). Entropy ratio of 0.80 indicates a fairly spread distribution despite the South Asian skew, and the appearance of 'Deaf' alongside ethnolinguistic labels signals a mixed taxonomy worth flagging. Treatment: Group the long tail and target- or frequency-encode before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
205
top_value
South Asia Hindu - other
top_rate
0.122
cardinality
205
entropy
6.108
entropy_ratio
0.7954

PeopNameAcrossCountries

text label one_word duplicates
This column holds people-group / ethnolinguistic names spanning countries (e.g. 'Turk', 'Persian', 'Kurd, Kurmanji', 'Arab, Moroccan'), with frequent religious-tradition qualifiers like '(Hindu traditions)' and '(Muslim traditions)' appearing in 985 and 424 rows respectively. Values are short (mean 12.4 chars, median 2 words) and 47.5% are single-word labels, yet 35.4% (2,520 rows) are duplicates across 4,604 unique strings out of 7,124 — the same group recurs across countries. The most surprising entry is 'Deaf' at 151 occurrences, which sits oddly alongside ethnic categories. Treatment: Treat as a categorical people-group label; normalize qualifiers in parentheses and join on (group, country) for cross-country aggregation. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
4,604
len_min
1
len_max
39
len_mean
12.38
len_median
10
len_p95
27
word_mean
1.766
word_median
2
n_empty
0
n_duplicates
2,520
duplicate_rate
0.3537
vocab_size
4,431
readability_flesch_mean
56.01
emoji_rate
0
url_rate
0
one_word_rate
0.475
allcaps_rate
0
boilerplate_rate
0

Population

numeric feature high_skew outliers
Population counts per record, ranging from 10 to 135,533,000 with a median of just 30,000. The distribution is extraordinarily right-skewed (skew 21.1, kurtosis 607) — the mean of ~502,570 sits far above Q3 of 129,000, and ~14.9% of rows flag as outliers, suggesting a mix of small localities with a few country- or megacity-scale entries. Null rate is negligible (0.21%) and there are no zeros. Treatment: Apply a log1p transform before any modelling to tame the extreme skew. high · anthropic:claude-opus-4-7
n
7,124
nulls
15 (0.2%)
unique
1,200
min
10
max
1.355e+08
mean
5.026e+05
median
30,000
std
3.568e+06
q1
6,700
q3
129,000
iqr
122,300
skew
21.1
kurtosis
607.4
n_outliers
1,058
outlier_rate
0.1488
zero_rate
0

Category

categorical label
Three-level categorical with values "1", "2", "3" and no nulls across 7124 rows. Distribution is imbalanced: "3" dominates at 60.3% (4299), "1" follows at 2330, and "2" is rare at 495. Entropy ratio of 0.78 confirms the skew toward the majority class. Treatment: Treat as a categorical label; consider class weighting or stratified sampling to handle the minority class "2". high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
3
top_value
3
top_rate
0.6035
cardinality
3
entropy
1.234
entropy_ratio
0.7788

ROL3

text feature one_word short_text duplicates
ROL3 holds three-letter ISO 639-3 language codes — every value is exactly 3 characters and one word, with top entries like 'hin', 'ben', 'urd', 'guj', and 'tel' pointing to South Asian languages. The distribution is heavily skewed toward Hindi (662 of 7124) and only 1565 unique codes appear across 7124 rows, giving a 78% duplicate rate. No nulls, no empties, no formatting noise. Treatment: treat as a categorical language code; one-hot or target-encode, and consider grouping rare codes. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
1,565
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
5,559
duplicate_rate
0.7803
vocab_size
1,564
readability_flesch_mean
118.7
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.0002807
boilerplate_rate
0

PrimaryLanguageName

text feature one_word short_text duplicates
Categorical language label, almost always a single word (one_word_rate 0.70, word_mean 1.33) drawn from a vocabulary of 1,641 tokens across 1,563 distinct values. Hindi (662), Bengali (357), and Sindhi (191) dominate the 7,124 rows, and 78.1% of values are duplicates of an earlier row — expected for a controlled language taxonomy. Compound names like 'Pashto, Northern' and 'Punjabi, Eastern' indicate ISO-style subvariant naming rather than free text. Treatment: Treat as a categorical factor; normalise the comma-separated subvariants before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
1,563
len_min
1
len_max
32
len_mean
9.147
len_median
7
len_p95
17
word_mean
1.331
word_median
1
n_empty
0
n_duplicates
5,561
duplicate_rate
0.7806
vocab_size
1,641
readability_flesch_mean
33.28
emoji_rate
0
url_rate
0
one_word_rate
0.7016
allcaps_rate
0
boilerplate_rate
0

PrimaryLanguageDialect

categorical metadata long_tail null_rate
Free-text or controlled-vocabulary field naming a primary language dialect, with 303 distinct values across 7124 rows but populated in only 5.5% of records (null_rate 0.945). Distribution is essentially flat — entropy_ratio 0.97 and the modal value 'Punjabi' covers just 3.1% of non-nulls (12 occurrences) — so no dialect dominates. The mix spans South Asian, Middle Eastern, African, and European dialects, suggesting a global but extremely sparse roster. Treatment: Drop or collapse into a coarser language grouping; too sparse and high-cardinality to use directly as a feature. high · anthropic:claude-opus-4-7
n
7,124
nulls
6,732 (94.5%)
unique
303
top_value
Punjabi
top_rate
0.03061
cardinality
303
entropy
8.011
entropy_ratio
0.9719

NumberLanguagesSpoken

numeric feature high_skew outliers
Counts the number of languages spoken, with values ranging from 1 to 120 across 7,124 rows and no nulls. The distribution is heavily right-skewed (skew 4.87, kurtosis 36.3): the median is 1 and Q3 is 5, yet the mean is 4.33 and 597 rows (8.4%) flag as outliers, suggesting a long tail of implausibly high counts up to 120. Treatment: Cap or log-transform before modelling, and audit the extreme tail (values up to 120) for data-entry errors. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
69
min
1
max
120
mean
4.333
median
1
std
7.32
q1
1
q3
5
iqr
4
skew
4.871
kurtosis
36.31
n_outliers
597
outlier_rate
0.0838
zero_rate
0

OfficialLang

categorical feature
Categorical column listing an official language per record, with 79 distinct values across 7124 rows and effectively no nulls (0.08%). Hindi dominates at 28.5% (2032 rows), followed by Urdu (767), Standard Arabic (657), Mandarin (475), and English (433), giving an entropy ratio of 0.66 — moderately concentrated rather than uniform. The South/Central Asian skew is notable: five of the top ten values are languages of that region, which may bias any downstream language-level analysis. Treatment: Group rare languages into an 'Other' bucket and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
6 (0.1%)
unique
79
top_value
Hindi
top_rate
0.2855
cardinality
79
entropy
4.178
entropy_ratio
0.6628

SpeakNationalLang

unknown other skipped
Column 'SpeakNationalLang' was skipped by the profiler, so type, cardinality and value distribution are all unavailable. The only confirmed facts are that it has 7124 rows and zero nulls. The name suggests a flag or category for whether a respondent speaks the national language, but this cannot be verified from the evidence. Treatment: Re-profile with type inference enabled before deciding on any downstream handling. low · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique

BibleStatus

numeric feature outliers
BibleStatus is an ordinal/categorical code stored as a small integer, taking just 6 distinct values from 0 to 5 across 7,124 rows with no nulls. The distribution is heavily concentrated at the top (median 5, Q1=4, Q3=5, mean 4.05) with strong negative skew (-1.51), and 13.5% of rows flagged as low-end outliers plus a 3.8% zero rate. This looks like a status/level code rather than a true numeric measurement. Treatment: Treat as an ordinal category (or one-hot encode) rather than a continuous numeric. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
6
min
0
max
5
mean
4.054
median
5
std
1.342
q1
4
q3
5
iqr
1
skew
-1.513
kurtosis
1.538
n_outliers
961
outlier_rate
0.1349
zero_rate
0.03818

BibleYear

categorical metadata null_rate
BibleYear appears to be a publication/edition year field for Bible translations, encoded mostly as date ranges like "1818-2022" rather than single years. Cardinality is high (163 distinct values across 7124 rows) and the column is missing for 45.79% of rows, which is a major coverage gap. The top value "1818-2022" covers 17.14% of non-nulls and most frequent entries are spans, while plain single years like "1954" (191 occurrences) are the exception. Treatment: Parse into start_year and end_year integers, then decide imputation given the 45.79% null rate. medium · anthropic:claude-opus-4-7
n
7,124
nulls
3,262 (45.8%)
unique
163
top_value
1818-2022
top_rate
0.1714
cardinality
163
entropy
5.318
entropy_ratio
0.7237

NTYear

categorical feature null_rate
NTYear appears to hold year-range strings (e.g. "1811-1998", "1801-1984") indicating a span between two dates, though the presence of "Yes" as the third most common value (345 rows) signals encoding inconsistency. The column has 305 distinct values across 7124 rows with a high null rate of 24.65%, and the top value covers only 12.3% of non-nulls, so the distribution is fairly spread (entropy ratio 0.735). Treatment: Parse year-range strings into start/end numeric fields and quarantine non-conforming values like "Yes" before modelling. medium · anthropic:claude-opus-4-7
n
7,124
nulls
1,756 (24.6%)
unique
305
top_value
1811-1998
top_rate
0.1233
cardinality
305
entropy
6.065
entropy_ratio
0.7349

PortionsYear

categorical metadata
PortionsYear appears to be a free-form field describing the year range covered by record portions, with 460 distinct values across 7124 rows and a 13.5% null rate. Most entries are date ranges like '1806-1962' or '1800-1980', but the single most common value is the literal string 'Yes' (821 rows, 13.3%), suggesting inconsistent data entry where a yes/no answer leaks into a date-range field. High entropy ratio (0.71) confirms values are spread across many ranges rather than concentrated. Treatment: Parse date ranges into start/end year numerics and isolate the 'Yes' contamination as a separate boolean flag before use. high · anthropic:claude-opus-4-7
n
7,124
nulls
961 (13.5%)
unique
460
top_value
Yes
top_rate
0.1332
cardinality
460
entropy
6.289
entropy_ratio
0.711

TranslationNeedQuestionable

unknown other skipped
The column "TranslationNeedQuestionable" was skipped by the profiler, so no type, uniqueness, or value statistics are available. All we know is that it has 7124 rows with a 0.0 null rate; nothing else can be inferred from the evidence. Treatment: Re-profile or inspect manually before deciding on use; current evidence is insufficient. low · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique

JPScale

numeric metadata constant
JPScale is a numeric column that is entirely constant: all 7124 rows hold the value 1.0 with zero nulls, zero variance, and a single unique value. It carries no information for any downstream model or comparison and was flagged as constant. Treatment: Drop; constant column with no variance. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
1
min
1
max
1
mean
1
median
1
std
0
q1
1
q3
1
iqr
0
skew
0
kurtosis
0
n_outliers
0
outlier_rate
0
zero_rate
0

JPScalePC

categorical feature
JPScalePC is a 5-level categorical code (values "1" through "5") with no nulls across 7124 rows, likely an ordinal scale or rating. The distribution is heavily concentrated at "1" (70.2% of rows), with "3" the rarest at just 205 occurrences, yielding an entropy ratio of 0.59. The non-monotonic frequency order (1 > 4 > 2 > 5 > 3) is unusual for a true ordinal scale and worth checking. Treatment: Treat as ordinal categorical; consider grouping minority levels (3, 5) given the dominance of "1". high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
5
top_value
1
top_rate
0.702
cardinality
5
entropy
1.377
entropy_ratio
0.5929

JPScalePGAC

categorical feature imbalance
JPScalePGAC is a 5-level categorical code (values "1" through "5"), likely a Japanese seismic intensity / PGA scale rating. The distribution is severely imbalanced: "1" accounts for 6910 of 7124 rows (top_rate 0.97), entropy_ratio is just 0.10, and the remaining four levels together hold under 220 records. No nulls are present. Treatment: Collapse rare levels or binarise as "1" vs other before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
5
top_value
1
top_rate
0.97
cardinality
5
entropy
0.2372
entropy_ratio
0.1021

LeastReached

categorical metadata imbalance
This column is a single-valued categorical flag holding "Y" for all 7124 rows, with no nulls and zero entropy. Because cardinality is 1 and top_rate is 1.0, it carries no information for any downstream model. Treatment: Drop; constant column with no variance. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

LeastReachedPC

categorical feature
Binary Y/N flag indicating whether some 'least reached' people-group condition is met. The column is fully populated across 7124 rows with only 2 distinct values, skewed toward 'Y' at 72.3% (5152) versus 'N' at 1972. Entropy ratio of 0.85 shows the split is uneven but still informative. Treatment: Encode as a 0/1 boolean for modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.7232
cardinality
2
entropy
0.8511
entropy_ratio
0.8511

LeastReachedPGAC

categorical feature imbalance
A binary Y/N flag (likely indicating whether the least-reached PGAC condition was met) with no nulls across 7124 rows. The distribution is severely imbalanced: 'Y' accounts for 6910 rows (97.0%) versus only 214 'N', yielding entropy_ratio of just 0.19. As a near-constant feature it carries little discriminative signal on its own. Treatment: Encode as 0/1 but consider dropping or pairing with rare-class oversampling given the 97/3 imbalance. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.97
cardinality
2
entropy
0.1946
entropy_ratio
0.1946

GSEC

categorical feature
GSEC is a low-cardinality categorical with 8 distinct values across 7124 rows and no nulls. The dominant value is the empty string at 51.08% (3639 rows), followed by '1' at 2767; the remaining codes ('0' through '6') together account for under 10% of rows. The mix of blanks and small integer codes suggests an optional categorical flag where 'missing' is encoded as '' rather than null. Treatment: Recode '' as explicit missing and one-hot encode the remaining small-integer categories. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
8
top_value
top_rate
0.5108
cardinality
8
entropy
1.605
entropy_ratio
0.535

HasAudioRecordings

categorical feature
Binary Y/N flag indicating whether a record has associated audio recordings. The class is heavily imbalanced toward 'Y' at 86.9% (6188 of 7124), with no nulls. Entropy ratio of 0.56 confirms the skew but the minority 'N' class still has 936 observations, enough to be usable. Treatment: Encode as a 0/1 boolean; watch for class imbalance if used as a target. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.8686
cardinality
2
entropy
0.5612
entropy_ratio
0.5612

NTOnline

categorical feature null_rate imbalance
NTOnline is a categorical flag with only one observed value, 'Y', across 5528 non-null rows, while 22.4% of rows are null. With cardinality 1 and entropy 0, it carries no discriminative signal—presence vs. absence is the only information available. Treatment: Drop, or replace with a binary is_present indicator if the null pattern is meaningful. high · anthropic:claude-opus-4-7
n
7,124
nulls
1,596 (22.4%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

RLG3

numeric feature outliers
RLG3 is a discrete numeric column with only 7 unique values spanning 2 to 9, suggesting an ordinal rating or Likert-style scale rather than a continuous measurement. The distribution is tight around the median of 6 (IQR=1, Q1=5, Q3=6) with mild left skew (-0.46), but 10.6% of rows (757) fall outside the IQR fences — an artifact of the narrow box rather than true anomalies. Treatment: Treat as an ordinal categorical; the outlier flag is a side-effect of the compressed IQR, not bad data. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
7
min
2
max
9
mean
5.27
median
6
std
1.279
q1
5
q3
6
iqr
1
skew
-0.4551
kurtosis
2.001
n_outliers
757
outlier_rate
0.1063
zero_rate
0

RLG3PC

numeric feature outliers
RLG3PC is an integer-coded ordinal feature with only 8 distinct values spanning 1-9 and a tight IQR of 1 (Q1=5, Q3=6). The distribution is left-skewed (-0.95) and concentrated around the median of 5, yet 14.3% of rows (1022) fall outside the IQR fence, suggesting a heavy lower tail rather than true anomalies. No nulls or zeros are present. Treatment: Treat as an ordinal/categorical scale rather than continuous; the outlier rate reflects skew, not errors. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
8
min
1
max
9
mean
5.079
median
5
std
1.52
q1
5
q3
6
iqr
1
skew
-0.9463
kurtosis
1.703
n_outliers
1,022
outlier_rate
0.1435
zero_rate
0

RLG3PGAC

numeric feature outliers
RLG3PGAC is a numeric column with only 8 distinct integer values spanning 1 to 9, suggesting an ordinal rating or Likert-style score rather than a continuous measurement. The distribution is tight around a median of 5.5 with IQR of just 1 (Q1=5, Q3=6), yet 776 rows (10.9%) fall outside the Tukey fence, indicating a heavy-tailed concentration where any deviation from the central 5-6 band registers as an outlier. Mild left skew (-0.46) hints that low scores are slightly more common than the symmetric mean of 5.27 would suggest. Treatment: Treat as ordinal categorical; bin or one-hot encode rather than scaling as continuous. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
8
min
1
max
9
mean
5.272
median
5.5
std
1.296
q1
5
q3
6
iqr
1
skew
-0.4637
kurtosis
2.032
n_outliers
776
outlier_rate
0.1089
zero_rate
0

PrimaryReligion

categorical feature
PrimaryReligion is a low-cardinality categorical with 7 distinct values across 7,124 rows and no nulls. Islam dominates at 46% (3,279 rows), followed by Hinduism (2,142) and Ethnic Religions (933); Non-Religious appears only 13 times and 157 rows are explicitly 'Unknown'. Entropy ratio of 0.68 indicates a moderately skewed but not degenerate distribution. Treatment: One-hot encode and consider merging 'Unknown' with 'Other / Small' or treating it as a missing-value flag. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
7
top_value
Islam
top_rate
0.4603
cardinality
7
entropy
1.92
entropy_ratio
0.6839

PrimaryReligionPC

categorical feature
Categorical label assigning each of 7124 rows to one of 8 primary religion categories, with no nulls. Islam dominates at 3105 rows (43.6%) followed by Hinduism at 2296, while Non-Religious (35) and Other/Small (62) are rare; entropy ratio of 0.68 indicates moderate concentration in the top two classes. 154 rows are explicitly 'Unknown', a category worth treating distinctly from missing. Treatment: One-hot encode the 8 levels and keep 'Unknown' as its own category rather than imputing. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
8
top_value
Islam
top_rate
0.4359
cardinality
8
entropy
2.051
entropy_ratio
0.6838

PrimaryReligionPGAC

categorical feature
Categorical label for the primary religion of a People Group Across Countries (PGAC) record, with 8 distinct values across 7124 rows and no nulls. Islam dominates at 45.6% (3247), followed by Hinduism (2154) and Ethnic Religions (925); Christianity is strikingly rare at just 17 rows, which is notable for a religion-coded dataset. Entropy ratio of 0.65 indicates moderate concentration on the top categories. Treatment: One-hot or target-encode for modelling; consider folding 'Unknown' and 'Non-Religious'/'Christianity' tails into 'Other'. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
8
top_value
Islam
top_rate
0.4558
cardinality
8
entropy
1.955
entropy_ratio
0.6516

RLG4

numeric feature null_rate outliers
RLG4 is a sparse numeric feature populated for only ~7.6% of rows (null_rate 0.9239) with just 18 distinct integer-like values ranging 10 to 39. The distribution is right-skewed (skew 1.05, mean 18.19 vs median 20.0) with 30 flagged outliers (5.5% of present values) and a tight IQR of 6. The combination of heavy nullness and a bounded, discrete value set suggests an ordinal score or category code recorded only in specific cases. Treatment: Add a missingness indicator and impute or bin before modelling, given 92% nulls and a small discrete value set. medium · anthropic:claude-opus-4-7
n
7,124
nulls
6,582 (92.4%)
unique
18
min
10
max
39
mean
18.19
median
20
std
6.472
q1
14
q3
20
iqr
6
skew
1.051
kurtosis
1.474
n_outliers
30
outlier_rate
0.05535
zero_rate
0

ReligionSubdivision

categorical feature null_rate
A sub-classification of religion (denomination/sect), with 18 distinct values like Sunni, Judaism, Sikhism, Tibetan, and Theravada. The column is 92.39% null, so it is only populated for the small subset of records where a finer-grained religious branch applies. Among the 7124 rows, Sunni leads at 29.52% of non-null values (160 occurrences), and entropy ratio 0.72 indicates the populated values are spread fairly evenly across branches. Treatment: Treat missingness as its own category and one-hot encode, or roll up into the parent Religion field before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
6,582 (92.4%)
unique
18
top_value
Sunni
top_rate
0.2952
cardinality
18
entropy
2.984
entropy_ratio
0.7157

PCIslam

numeric feature
PCIslam appears to be a percentage-style indicator of Islamic affiliation, bounded between 0 and 100 with a near-zero null rate (0.0013). The distribution is starkly bimodal rather than continuous: 47.1% of rows are exactly zero, the median is 0.28, yet Q3 sits at 99.99, producing a kurtosis of -1.93 and an IQR spanning nearly the full range. Mean (45.2) and std (48.2) confirm the mass is piled at the extremes rather than around the center. Treatment: Treat as bimodal: consider binarizing (0 vs >0) or binning rather than using raw value in linear models. high · anthropic:claude-opus-4-7
n
7,124
nulls
9 (0.1%)
unique
902
min
0
max
100
mean
45.22
median
0.2753
std
48.22
q1
0
q3
99.99
iqr
99.99
skew
0.1703
kurtosis
-1.935
n_outliers
0
outlier_rate
0
zero_rate
0.4713

PCNonReligious

numeric feature high_skew outliers
PCNonReligious appears to be the percentage of non-religious individuals in each record, but the distribution is dominated by zeros — 87.5% of values are exactly 0 and the entire IQR collapses to 0. The remaining tail stretches to 99.0 with skew of 9.1 and kurtosis of 125.3, producing 886 outliers (12.5% of rows). Mean (1.02) sits far above median (0), so any modelling that assumes symmetry will be misled. Treatment: Treat as zero-inflated; consider a binary is_nonzero flag plus a log1p transform of the positive tail. high · anthropic:claude-opus-4-7
n
7,124
nulls
23 (0.3%)
unique
152
min
0
max
99
mean
1.016
median
0
std
4.549
q1
0
q3
0
iqr
0
skew
9.105
kurtosis
125.3
n_outliers
886
outlier_rate
0.1248
zero_rate
0.8752

PCUnknown

numeric feature high_skew outliers
PCUnknown is a numeric column expressing what looks like a percentage (range 0-100) of 'unknown' classification, with 92.8% of values being zero and a median/Q1/Q3 all at 0. The distribution is extremely right-skewed (skew 6.45, kurtosis 39.85) with 510 outliers (7.2%) extending up to 100. With 388 unique values and only 0.35% nulls, it carries sparse but potentially meaningful signal in the long tail. Treatment: Binarize (zero vs non-zero) or log-transform the non-zero tail before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
25 (0.4%)
unique
388
min
0
max
100
mean
2.28
median
0
std
14.59
q1
0
q3
0
iqr
0
skew
6.454
kurtosis
39.85
n_outliers
510
outlier_rate
0.07184
zero_rate
0.9282

SecurityLevel

numeric feature outliers
SecurityLevel is an ordinal/categorical code stored as numeric, with only 3 distinct values spanning 0 to 2 across 7,124 complete rows. The distribution is heavily concentrated at the top tier (median, Q1, and Q3 all equal 2.0, mean 1.595), yet 15.6% of rows are 0 and the IQR-based outlier check flags 24.9% of records — an artifact of the degenerate IQR of 0 rather than true anomalies. Strong negative skew (-1.47) confirms the mass sits at level 2. Treatment: Treat as a 3-level ordinal category (one-hot or ordered encode); ignore the outlier flag since IQR is zero. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
3
min
0
max
2
mean
1.595
median
2
std
0.7442
q1
2
q3
2
iqr
0
skew
-1.466
kurtosis
0.4048
n_outliers
1,771
outlier_rate
0.2486
zero_rate
0.1564

LRTop100

categorical label imbalance
Binary Y/N flag indicating membership in some 'LRTop100' set, with exactly 100 rows marked 'Y' out of 7124 — strongly suggesting a curated top-100 list. The distribution is severely imbalanced (98.6% 'N', entropy ratio 0.107), which is flagged as an imbalance alert. No nulls are present. Treatment: Use as a binary indicator; if modelling, apply class-imbalance handling (stratification or reweighting). high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.986
cardinality
2
entropy
0.1065
entropy_ratio
0.1065

PhotoAddress

text metadata one_word short_text duplicates
PhotoAddress holds single-token image filenames following a 'pXXXXX.jpg' pattern (one_word_rate 1.0, len_max 13). Coverage is poor: 1970 of 7124 rows are empty strings and duplicate_rate is 0.596, so the same photo is reused across many records (e.g., p19007.jpg appears 90 times). Only 2880 unique values back 7124 rows, suggesting shared stock images or a many-to-one photo lookup rather than a per-row asset. Treatment: Treat as a file reference: drop from modelling, or join to an image table after handling the ~1970 empty strings. high · anthropic:claude-opus-4-7
n
7,124
nulls
1 (0.0%)
unique
2,880
len_min
0
len_max
13
len_mean
7.26
len_median
10
len_p95
10
word_mean
1
word_median
1
n_empty
1,970
n_duplicates
4,243
duplicate_rate
0.5957
vocab_size
2,879
readability_flesch_mean
84.01
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

PhotoCredits

categorical metadata long_tail
PhotoCredits captures the attribution string for an associated image, with 851 distinct credits across 7124 rows. The column is dominated by missing-style values: 1970 rows (27.7%) are empty strings and another 1496 are 'Anonymous', so over half lack a real attribution. The remaining tail is long and idiosyncratic, mixing organisations ('Operation China, Asia Harvest'), individuals ('Isudas', 'Kerry Olson'), and platform tags ('Steve Evans - Flickr'). Treatment: Treat empty and 'Anonymous' as missing and keep only as provenance metadata; do not use as a model feature. high · anthropic:claude-opus-4-7
n
7,124
nulls
10 (0.1%)
unique
851
top_value
top_rate
0.2769
cardinality
851
entropy
5.584
entropy_ratio
0.5737

PhotoCreditURL

categorical metadata long_tail null_rate
URL string crediting the source of an associated photo, dominated by a single domain (asiaharvest.org appears 443 times) alongside a long tail of 774 distinct values. 36% of rows are null and another 43.21% are empty strings — together roughly four out of five rows carry no usable credit. Remaining values mix organisational sites (newcovenantmissions, createinternational), shorteners (tinyurl), and stock-photo hosts (pixabay, pxhere, flickr). Treatment: Drop for modelling; if provenance matters, parse to domain and treat as a low-coverage attribution field. high · anthropic:claude-opus-4-7
n
7,124
nulls
2,565 (36.0%)
unique
774
top_value
top_rate
0.4321
cardinality
774
entropy
5.389
entropy_ratio
0.5616

PhotoCreativeCommons

categorical feature
Binary Y/N flag indicating whether a photo carries a Creative Commons licence. The vast majority (top_rate 0.7981) are 'N' with only 1437 'Y' values, and nulls are negligible (null_rate 0.0007). Class imbalance is notable but not extreme. Treatment: Encode as a 0/1 boolean; impute the handful of nulls with the mode 'N'. high · anthropic:claude-opus-4-7
n
7,124
nulls
5 (0.1%)
unique
2
top_value
N
top_rate
0.7981
cardinality
2
entropy
0.7256
entropy_ratio
0.7256

PhotoCopyright

categorical feature
Binary Y/N flag indicating whether a photo carries copyright restrictions, with 'N' dominating at 80.6% of 7,124 rows and only 2 unique values. Nulls are negligible (0.17%) and entropy ratio of 0.71 reflects the moderate class imbalance. No anomalies beyond the expected skew toward unrestricted photos. Treatment: Encode as a boolean (Y=1, N=0) and impute the handful of nulls with the mode. high · anthropic:claude-opus-4-7
n
7,124
nulls
12 (0.2%)
unique
2
top_value
N
top_rate
0.8064
cardinality
2
entropy
0.709
entropy_ratio
0.709

PhotoPermission

categorical feature
Binary opt-in flag for photo permission, stored as 'Y'/'N'. The column is heavily skewed toward 'N' at 80.4% (5715 of 7124), with a near-zero null rate of 0.2%. Watch the case inconsistency: 2 records use lowercase 'y' alongside 1393 uppercase 'Y', so case-sensitive joins or filters will miscount. Treatment: Normalize case (upper) and encode as boolean before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
14 (0.2%)
unique
3
top_value
N
top_rate
0.8038
cardinality
3
entropy
0.7173
entropy_ratio
0.4526

ProfileTextExists

categorical feature imbalance
A binary flag indicating whether a profile text exists, with values Y/N. The column is severely imbalanced: 6888 of 7124 rows (96.7%) are Y, leaving only 236 N, yielding a low entropy ratio of 0.21. Treatment: Encode as a 0/1 indicator but expect minimal predictive signal due to severe class imbalance. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.9669
cardinality
2
entropy
0.2098
entropy_ratio
0.2098

CountOfCountries

numeric feature high_skew outliers
Likely a per-row count of distinct countries associated with each record, ranging from 1 to 164 across 7124 rows with no nulls. The distribution is severely right-skewed (skew 5.67, kurtosis 33.17): the median is just 2 and Q3 is 4, yet the mean is 8.11 and 16.98% of rows flag as outliers. A long tail of high-country records is dragging the mean far above typical values. Treatment: Log-transform or cap at a high quantile before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
39
min
1
max
164
mean
8.108
median
2
std
24.27
q1
1
q3
4
iqr
3
skew
5.672
kurtosis
33.17
n_outliers
1,210
outlier_rate
0.1698
zero_rate
0

CountOfProvinces

unknown other skipped
Saturn skipped profiling on CountOfProvinces, so beyond a row count of 7124 and zero nulls, no distributional evidence is available. The name suggests an integer count of provinces per record, but unique count, range, and summary stats are all missing. Without further inspection the column's actual content and cardinality cannot be confirmed. Treatment: Re-profile or manually inspect this column before use; saturn skipped it. low · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique

Longitude

numeric feature
Geographic longitude coordinates spanning the full global range from -173.08 to 178.44 degrees. The distribution is heavily left-skewed (-1.40) with a median of 75.23 sitting well above the mean of 62.80, suggesting concentration in eastern hemisphere locations with a tail of western-hemisphere points. About 4.4% of values (316 rows) fall outside the typical IQR range. Treatment: Pair with latitude for geospatial features; consider clustering or binning rather than treating as a raw scalar. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
6,713
min
-173.1
max
178.4
mean
62.8
median
75.23
std
44.79
q1
40.81
q3
88.22
iqr
47.41
skew
-1.402
kurtosis
2.859
n_outliers
316
outlier_rate
0.04436
zero_rate
0

Latitude

numeric feature
Latitude values for 7124 rows spanning -42.61 to 71.84 with a median of 25.02 — consistent with geographic latitudes in degrees. Distribution leans toward northern hemisphere (mean 23.54, skew -0.70) with 292 outliers (4.1%) likely representing far-southern or far-northern records. No nulls and 6696 unique values suggest near-record-level coordinates. Treatment: Pair with longitude for geospatial features; avoid standard scaling alone since latitude is bounded and non-linear in distance. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
6,696
min
-42.61
max
71.84
mean
23.54
median
25.02
std
14.92
q1
15.55
q3
31.61
iqr
16.06
skew
-0.702
kurtosis
2.141
n_outliers
292
outlier_rate
0.04099
zero_rate
0

Ctry

categorical feature
Country-of-origin or location field with 202 distinct values across 7,124 rows and zero nulls. India dominates at 28.5% (2,032 rows), followed by Pakistan (767) and China (442); the long tail spans 200+ countries with entropy ratio 0.66, indicating concentrated but globally distributed coverage. The South/Central Asia skew is the headline surprise — five of the top six values are Asian. Treatment: Group rare countries into an 'Other' bucket or encode by region before modelling to avoid 200-way one-hot blow-up. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
202
top_value
India
top_rate
0.2852
cardinality
202
entropy
5.058
entropy_ratio
0.6605

IndigenousCode

categorical feature
A binary Y/N flag indicating Indigenous status, fully populated across all 7124 rows. The distribution is imbalanced: 'Y' accounts for 79.4% (5657 rows) versus 1467 'N' rows, which is a notable skew to keep in mind for any stratified analysis. Treatment: Encode as a binary indicator and watch for class imbalance when used as a predictor or stratifier. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.7941
cardinality
2
entropy
0.7336
entropy_ratio
0.7336

PercentAdherents

categorical feature long_tail
PercentAdherents appears to be a numeric measure (likely a percentage or rate of religious adherents) stored as strings, with 692 distinct values across 7,124 rows and no nulls. It is dominated by '0.000', which accounts for 56.2% of records, and the long tail of small integer- and decimal-valued strings drives entropy down to a ratio of 0.43. The format mixing whole numbers like '5.000' with fractions like '0.200' suggests these are raw values rather than binned categories. Treatment: Cast to float and treat as a zero-inflated numeric feature rather than a category. medium · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
692
top_value
0.000
top_rate
0.5625
cardinality
692
entropy
4.046
entropy_ratio
0.4288

PercentChristianPC

categorical feature
Stored as a categorical but the 184 distinct values are numeric strings ranging from '0.000' to figures like '8.571', suggesting this is a percent-Christian metric (likely per-capita or per-county) cast as text. The distribution is concentrated: '0.482' alone covers 12.2% of 7124 rows and the top 10 values account for a large share, yet entropy ratio of 0.79 indicates the long tail still carries information. No nulls, but the repeated exact decimals hint at a lookup or pre-binned source rather than raw measurements. Treatment: cast to float and treat as a continuous feature; investigate the heavy spike at 0.482 before modelling. medium · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
184
top_value
0.482
top_rate
0.122
cardinality
184
entropy
5.934
entropy_ratio
0.7887

NaturalName

text label one_word duplicates
Short ethnonym/community labels (e.g., 'Deaf', 'Turk', 'Persian', 'Japanese'), averaging 11.8 characters and 1.7 words with a median of 2 words. About 34% of rows are duplicates (2,419) and ~49% are single-word entries, with 4,705 unique values across 7,124 rows. Surprising signals: 'Deaf' tops the list at 151 occurrences, and top words include parenthetical religious qualifiers like 'traditions)', '(hindu', '(muslim' (952/477/411), suggesting many entries carry trailing tradition tags that the tokenizer split awkwardly. Treatment: Normalize casing and strip parenthetical tradition suffixes, then treat as a categorical label. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
4,705
len_min
1
len_max
39
len_mean
11.84
len_median
10
len_p95
27
word_mean
1.723
word_median
2
n_empty
0
n_duplicates
2,419
duplicate_rate
0.3396
vocab_size
4,343
readability_flesch_mean
56.42
emoji_rate
0
url_rate
0
one_word_rate
0.4885
allcaps_rate
0
boilerplate_rate
0

NaturalPronunciation

text metadata one_word null_rate duplicates
Phonetic respellings of ethnonyms — short hyphenated pronunciation guides like 'PUR-zhun', 'jae-puh-NEEZ', and 'pahsh-TOON' — accompanying some other label column. Values are overwhelmingly single tokens (one_word_rate 0.73, word_mean 1.28, len_mean 10.8) and 48.5% are null, so coverage is partial. Duplicates dominate (n_duplicates 2183, duplicate_rate 0.59) with only 1489 unique forms across 7124 rows, suggesting a small controlled vocabulary repeated across records. Treatment: Treat as an optional pronunciation lookup keyed to the parent term; drop or impute before modelling given ~48% nulls. high · anthropic:claude-opus-4-7
n
7,124
nulls
3,452 (48.5%)
unique
1,489
len_min
2
len_max
42
len_mean
10.77
len_median
10
len_p95
21
word_mean
1.281
word_median
1
n_empty
0
n_duplicates
2,183
duplicate_rate
0.5945
vocab_size
1,537
readability_flesch_mean
69.93
emoji_rate
0
url_rate
0
one_word_rate
0.7271
allcaps_rate
0.0005447
boilerplate_rate
0

PercentChristianPGAC

categorical feature
This column appears to be a percentage or count of Christians (PGAC suggesting a per-group/area Christian metric), stored as strings with three-decimal precision rather than as a numeric type. It is heavily zero-inflated: '0.000' accounts for 43.8% of the 7,124 rows (3,121 occurrences), and a suspiciously specific value '3.733' is the second mode at 151 rows. With 842 distinct values and entropy ratio 0.58, the distribution is concentrated but long-tailed, and the null rate is negligible at 0.07%. Treatment: Cast to numeric and consider a zero-inflated transform (e.g., log1p with a zero indicator) before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
5 (0.1%)
unique
842
top_value
0.000
top_rate
0.4384
cardinality
842
entropy
5.681
entropy_ratio
0.5846

PercentEvangelical

categorical feature long_tail
PercentEvangelical reads as a numeric share of evangelicals stored as strings, with 401 distinct values across 7124 rows. The distribution is heavily zero-inflated: 65.7% of rows are exactly "0.000" and another 10.4% are null, leaving a long tail of small fractions like 0.100, 0.200, 0.500. Entropy ratio of 0.364 confirms most of the signal collapses onto that single zero bucket. Treatment: Cast to float, impute the 10.4% nulls, and consider a zero-vs-nonzero indicator alongside the raw value to handle the zero inflation. high · anthropic:claude-opus-4-7
n
7,124
nulls
741 (10.4%)
unique
401
top_value
0.000
top_rate
0.6572
cardinality
401
entropy
3.146
entropy_ratio
0.3638

PercentEvangelicalPC

categorical feature
PercentEvangelicalPC appears to be a numeric percentage (likely an evangelical population share, possibly per capita or principal-component scaled) that has been stored as strings, yielding 166 distinct values across 7124 rows with a 2.15% null rate. The distribution is concentrated: the top value '0.199' covers 12.47% of rows, and the leading entries cluster near zero ('0.095', '0.000', '0.004') yet some values reach above 3 ('3.409', '3.339'), suggesting a long right tail or mixed scale. Entropy ratio of 0.78 indicates moderate concentration rather than uniformity. Treatment: Cast to float, impute the ~2% nulls, and consider log or rank transform given the right-tailed values. medium · anthropic:claude-opus-4-7
n
7,124
nulls
153 (2.1%)
unique
166
top_value
0.199
top_rate
0.1247
cardinality
166
entropy
5.777
entropy_ratio
0.7833

PercentEvangelicalPGAC

categorical feature
Numeric percentages (likely share of evangelical population per PGAC unit) stored as strings, hence profiled as categorical with 548 distinct values. The distribution is heavily zero-inflated: '0.000' accounts for 48.9% of 7124 rows, with a curious secondary spike at '1.801' (151 rows) that doesn't fit a percentage interpretation cleanly. Null rate is 6.32% and entropy ratio is 0.55, consistent with a long tail of small fractional values. Treatment: Cast to float, impute or flag nulls, and consider a zero-indicator plus log/sqrt transform given the heavy zero mass. high · anthropic:claude-opus-4-7
n
7,124
nulls
450 (6.3%)
unique
548
top_value
0.000
top_rate
0.4891
cardinality
548
entropy
4.972
entropy_ratio
0.5465

PCBuddhism

numeric feature high_skew outliers
PCBuddhism appears to be a percentage feature measuring the Buddhist share of some unit (likely a postcode or area), ranging 0 to 100 with mean 6.41. The distribution is extremely zero-inflated: 82.99% of rows are exactly 0, the entire IQR collapses to 0, and yet 17.01% of rows are flagged as outliers with skew 3.48 and kurtosis 10.56. This means Buddhism is rare across most areas but reaches sizeable concentrations in a long tail. Treatment: Treat as zero-inflated proportion: add a presence indicator and log1p-transform the non-zero tail before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
24 (0.3%)
unique
809
min
0
max
100
mean
6.411
median
0
std
22.39
q1
0
q3
0
iqr
0
skew
3.475
kurtosis
10.56
n_outliers
1,208
outlier_rate
0.1701
zero_rate
0.8299

PCEthnicReligions

numeric feature high_skew outliers
PCEthnicReligions is a numeric percentage-style feature (0–100) capturing the share of some ethnic-religion category, likely per record/region. It's overwhelmingly zero — 78% of values are 0 and the entire interquartile range collapses to 0 — yet the mean is 13.1 with std 30.7, indicating a small set of records carry very large shares. Skew of 2.16 and a 22% outlier rate confirm a sparse, heavy-tailed distribution rather than a smooth continuum. Treatment: Binarize (zero vs non-zero) or apply a zero-inflated/log1p transform before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
18 (0.3%)
unique
351
min
0
max
100
mean
13.11
median
0
std
30.74
q1
0
q3
0
iqr
0
skew
2.155
kurtosis
2.885
n_outliers
1,560
outlier_rate
0.2195
zero_rate
0.7805

PCHinduism

numeric feature
This column appears to be the percentage share of Hindus in some geographic or demographic unit, ranging from 0 to 100 with a mean of 29.8. The distribution is strongly bimodal in spirit: 67.7% of rows are exactly zero while Q3 sits at 98.4, indicating most units have no Hindu presence and a substantial minority are nearly entirely Hindu. Skew is 0.87 and kurtosis -1.22, consistent with this U-shaped split rather than a single peak. Treatment: Consider a zero-vs-nonzero indicator plus the raw percentage, since a flat numeric treatment will hide the bimodal structure. high · anthropic:claude-opus-4-7
n
7,124
nulls
24 (0.3%)
unique
1,131
min
0
max
100
mean
29.82
median
0
std
44.98
q1
0
q3
98.42
iqr
98.42
skew
0.8721
kurtosis
-1.216
n_outliers
0
outlier_rate
0
zero_rate
0.6768

PCOtherSmall

numeric feature high_skew outliers
PCOtherSmall is a numeric feature where 88% of rows are zero and the IQR is zero, meaning the bottom three quartiles are all 0. The remaining mass stretches up to 100 with mean 1.84 and std 12.33, producing severe right skew (7.39) and very heavy tails (kurtosis 54.18). About 12% of rows (851) flag as outliers, suggesting this is a sparse share/percentage indicator that fires only for a small subset of records. Treatment: Binarize presence (>0) or apply log1p before modelling to tame the skew. high · anthropic:claude-opus-4-7
n
7,124
nulls
24 (0.3%)
unique
670
min
0
max
100
mean
1.836
median
0
std
12.33
q1
0
q3
0
iqr
0
skew
7.39
kurtosis
54.18
n_outliers
851
outlier_rate
0.1199
zero_rate
0.8801

RegionCode

numeric feature outliers
RegionCode holds 12 distinct integer values from 1 to 12 with no nulls, so it is almost certainly a categorical region identifier stored as a number rather than a true numeric measure. The distribution is concentrated around the median of 4 with an IQR of just 2, yet the right skew of 1.12 and 601 flagged outliers (8.4%) reflect the long tail of higher-numbered regions rather than genuine anomalies. Treatment: Cast to categorical and one-hot or target-encode; do not treat as a continuous numeric. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
12
min
1
max
12
mean
5.005
median
4
std
2.457
q1
4
q3
6
iqr
2
skew
1.122
kurtosis
0.5775
n_outliers
601
outlier_rate
0.08436
zero_rate
0

PopulationPGAC

numeric feature high_skew outliers
PopulationPGAC appears to be a population count tied to some geographic or administrative unit (PGAC), spanning 10 to roughly 925 million across 7,124 rows with only 0.07% nulls. The distribution is extraordinarily right-skewed (skew 25.5, kurtosis 1051) — the median is 130,300 while the mean is 4.88 million, and 17.8% of rows flag as outliers. With 1,509 unique values across 7,124 rows, the same population figures repeat heavily, suggesting many rows share the same geographic aggregate. Treatment: log-transform before regression to tame the extreme right skew. medium · anthropic:claude-opus-4-7
n
7,124
nulls
5 (0.1%)
unique
1,509
min
10
max
9.251e+08
mean
4.881e+06
median
130,300
std
2.095e+07
q1
20,000
q3
1.435e+06
iqr
1.415e+06
skew
25.48
kurtosis
1052
n_outliers
1,264
outlier_rate
0.1776
zero_rate
0

Frontier

categorical feature
Binary Y/N flag indicating whether a record is on the frontier, with no nulls across 7124 rows. The split is imbalanced toward Y at 66.9% (4767) versus N at 2357, though entropy ratio of 0.92 shows both classes are well represented. Treatment: Encode as a 0/1 indicator before modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.6691
cardinality
2
entropy
0.9158
entropy_ratio
0.9158

MapAddress

text foreign_key one_word short_text duplicates
MapAddress holds single-token PNG filenames (e.g. 'm00328.png'), with one_word_rate of 1.0 and max length 13, suggesting it points to a map image asset. 1500 of 7124 rows are empty strings and duplicate_rate is 0.352, so roughly a third of non-empty values repeat across rows — meaning many records share the same map. With 4616 unique values across 7124 rows, this behaves like a foreign reference to a finite set of map images rather than a free-text field. Treatment: Treat as a categorical asset reference: impute or flag the 1500 empties and join to a map-image lookup. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
4,616
len_min
0
len_max
13
len_mean
8.649
len_median
10
len_p95
13
word_mean
1
word_median
1
n_empty
1,500
n_duplicates
2,508
duplicate_rate
0.352
vocab_size
4,615
readability_flesch_mean
17.62
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

HasJesusFilm

categorical feature
Binary Y/N flag indicating whether the Jesus Film is available for the entity (likely a language or people group). Heavily skewed toward 'Y' at 78.7% (5,610 of 7,124), with no nulls across all 7,124 rows. Entropy of 0.746 reflects the imbalance but still leaves a usable minority class of 1,514 'N' values. Treatment: Encode as 0/1 boolean; account for class imbalance if used as a target. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.7875
cardinality
2
entropy
0.7463
entropy_ratio
0.7463

Nomadic

categorical feature imbalance
Binary Y/N flag indicating nomadic status, with no nulls across 7124 rows. Severely imbalanced: 'N' dominates at 96.6% (6884 rows) versus only 240 'Y' cases, yielding a low entropy ratio of 0.21. Treatment: Encode as binary; consider class-weighting or stratified sampling due to severe imbalance. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.9663
cardinality
2
entropy
0.2126
entropy_ratio
0.2126

NomadicTypeDescription

categorical feature null_rate
This is a low-cardinality categorical describing the type of nomadic livelihood, with only 6 distinct values dominated by 'Agro-Pastoralists' (76.7% of non-nulls, 184 records). The column is almost entirely empty — null_rate is 0.9663, leaving roughly 240 populated rows out of 7124. Several values are comma-joined combinations (e.g., 'Agro-Pastoralists, Service or Trade'), suggesting the field encodes multi-label memberships as concatenated strings. Treatment: Split comma-separated values into multi-hot indicators and treat missingness as its own category given the 96.6% null rate. high · anthropic:claude-opus-4-7
n
7,124
nulls
6,884 (96.6%)
unique
6
top_value
Agro-Pastoralists
top_rate
0.7667
cardinality
6
entropy
1.159
entropy_ratio
0.4483

PhotoCCVersionText

categorical metadata
Creative Commons license version attached to a photo (e.g., 'CC BY 2.0', 'CC BY-SA 4.0'). The field is dominated by empty strings at 79.8% of 7124 rows, with only 16 distinct values and entropy ratio 0.33, so license metadata is missing for the vast majority of records. Among populated values, 'CC BY 2.0' (387) and 'CC BY-SA 4.0' (246) lead. Treatment: Treat empty string as missing and group rare licenses; use as a low-cardinality categorical only where photo licensing matters. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
16
top_value
top_rate
0.7984
cardinality
16
entropy
1.323
entropy_ratio
0.3307

PhotoCCVersionURL

categorical metadata
This column holds the URL of the Creative Commons license version applied to an associated photo, drawn from a fixed set of 16 distinct license URIs. About 79.8% of rows (5688 of 7124) are empty strings, so the field is sparsely populated; among the populated minority, CC BY 2.0 (387) and CC BY-SA 4.0 (246) dominate. Entropy ratio of 0.33 confirms heavy concentration on the blank value. Treatment: Treat empty strings as missing and collapse to a categorical license code before any modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
16
top_value
top_rate
0.7984
cardinality
16
entropy
1.323
entropy_ratio
0.3307

MapCredits

categorical metadata long_tail
Attribution string crediting the data, geography, and design sources for each map (e.g. Joshua Project, GMI, UNESCO, IMB). With 161 distinct values across 7124 rows, the top credit covers 28% of records and a blank string is the second most common value at 1505 rows; near-duplicates differing only by trailing punctuation (the same Omid/UNESCO credit appears with and without a final period) inflate cardinality. Treatment: Normalise whitespace/punctuation to collapse near-duplicates, then drop from modelling as boilerplate provenance. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
161
top_value
People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project
top_rate
0.28
cardinality
161
entropy
3.318
entropy_ratio
0.4527

MapCreditURL

categorical metadata long_tail imbalance
This column holds attribution URLs for source maps, but 6919 of 7124 rows (top_rate 0.9712) are empty strings, leaving only 31 distinct values across the entire dataset. Among the populated entries, cartomission.com dominates with 100 occurrences while most other domains appear fewer than 10 times, producing a very long tail. Entropy ratio of 0.054 confirms there is almost no information here unless the empty string itself is treated as a meaningful 'no credit' signal. Treatment: Keep as provenance metadata; do not use as a model feature given 97% blanks. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
31
top_value
top_rate
0.9712
cardinality
31
entropy
0.27
entropy_ratio
0.05449

MapCopyright

categorical feature
A near-binary flag (N/Y) with a third state being an empty string, almost certainly indicating whether map copyright applies. 'N' dominates at 72.95% (5197/7124), blanks account for 1885 rows, and only 42 records are 'Y' — a severe class imbalance that makes the affirmative case nearly negligible. Treatment: Normalize blanks to a missing/'N' category and treat as a low-signal binary flag given the 42-row positive class. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
3
top_value
N
top_rate
0.7295
cardinality
3
entropy
0.8831
entropy_ratio
0.5572

MapCCVersionText

categorical metadata imbalance
This appears to be a Creative Commons license version field for maps, but it is effectively empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving only 10 rows with actual licenses split across CC BY-SA 3.0 (8), CC0 1.0 (1), and CC BY 3.0 (1). Entropy is just 0.0166 (entropy_ratio 0.0083), so the column carries almost no information despite having 0% nulls — the missingness is encoded as empty strings rather than NaN. Treatment: Drop; near-constant blank with only 10 informative rows. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
4
top_value
top_rate
0.9986
cardinality
4
entropy
0.01662
entropy_ratio
0.00831

MapCCVersionURL

categorical metadata imbalance
MapCCVersionURL appears to hold a Creative Commons license URL associated with each map record, but it is essentially empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving just 10 rows split across three CC license URLs. Entropy is 0.017 (ratio 0.008), so the column carries almost no information despite having 4 distinct values and zero nulls (the missingness is encoded as "" rather than null). Treatment: Drop; near-constant with empty-string standing in for missing. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
4
top_value
top_rate
0.9986
cardinality
4
entropy
0.01662
entropy_ratio
0.00831

JF

categorical feature
JF is a binary Y/N flag with no nulls across 7124 rows. The distribution is imbalanced: "Y" accounts for 78.7% (5610 rows) versus 1514 "N", giving an entropy ratio of 0.746. The column name is opaque, so the semantic meaning of the flag is not recoverable from the evidence. Treatment: Encode as a 0/1 indicator; consider class imbalance if used as a target. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.7875
cardinality
2
entropy
0.7463
entropy_ratio
0.7463

AudioRecordings

categorical feature
Binary Y/N flag indicating whether audio recordings exist for each row, with no nulls across 7124 records. The distribution is heavily imbalanced toward 'Y' at 86.9% (6188 vs 936), giving an entropy ratio of 0.56. Treatment: Encode as a 0/1 indicator; be mindful of class imbalance if used as a target. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.8686
cardinality
2
entropy
0.5612
entropy_ratio
0.5612

Window1040

categorical feature
Window1040 is a binary Y/N flag covering all 7124 rows with no nulls. The distribution is imbalanced: 'Y' accounts for 5910 rows (top_rate 0.8296) versus 1214 'N', giving an entropy ratio of 0.659. The column's semantic meaning isn't recoverable from the evidence, but it behaves like a clean indicator variable. Treatment: Encode as a 0/1 indicator and watch for class imbalance when used as a predictor. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.8296
cardinality
2
entropy
0.6586
entropy_ratio
0.6586

PeopleGroupMapURL

text metadata one_word url_heavy duplicates
This column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.79). 1,500 of 7,124 rows are empty strings and 2,508 are duplicates (duplicate_rate 0.35), meaning many people groups share the same map image (e.g., m00328.png appears 40 times). With 4,616 unique values across 7,124 rows, this is a reference link rather than a unique key. Treatment: Keep as a display/reference URL; drop from modelling features. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
4,616
len_min
0
len_max
66
len_mean
50.49
len_median
63
len_p95
66
word_mean
1
word_median
1
n_empty
1,500
n_duplicates
2,508
duplicate_rate
0.352
vocab_size
4,615
readability_flesch_mean
-568.7
emoji_rate
0
url_rate
0.7894
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

PeopleGroupMapExpandedURL

text metadata one_word url_heavy duplicates
This column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, with 72.3% of rows containing a URL and every value being a single token. 1,975 rows (about 27.7%) are empty strings, and 2,793 rows (39.2%) duplicate another value — e.g. m00328.pdf appears 40 times — suggesting many people groups share the same regional map. Treatment: Treat as a reference link; drop from modelling or extract the map ID if joining to a maps table. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
4,331
len_min
0
len_max
66
len_mean
46.2
len_median
63
len_p95
66
word_mean
1
word_median
1
n_empty
1,975
n_duplicates
2,793
duplicate_rate
0.3921
vocab_size
4,330
readability_flesch_mean
-468.9
emoji_rate
0
url_rate
0.7228
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

PeopleGroupURL

text identifier near_unique one_word url_heavy
This column holds Joshua Project people-group URLs, one per row, with every value a 48-character single-token https link (url_rate 1.0, one_word_rate 1.0, len_min and len_max both 48). All 7124 values are unique with zero nulls or duplicates, so it functions as a per-row identifier rather than a feature. The URLs encode a people-group ID and a country code suffix (e.g., /10375/tz, /10375/up), meaning the same group recurs across countries in the underlying key even though the full URL is unique. Treatment: Drop from modelling; retain as a row-level link key or parse out the people-group ID and country code as separate features. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
7,124
len_min
48
len_max
48
len_mean
48
len_median
48
len_p95
48
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
7,124
readability_flesch_mean
-479.9
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

PeopleGroupPhotoURL

text metadata one_word url_heavy duplicates
This column holds Joshua Project people-group photo URLs, with every populated cell being a single joshuaproject.net/assets/media/profiles/photos/.jpg link (url_rate 0.72, one_word_rate 1.0). 1971 of 7124 rows are empty strings (no nulls reported), and the same image URLs repeat heavily — duplicate_rate is 0.60 with only 2880 unique values, the top URL appearing 90 times. The same photo is clearly being reused across many people-group records rather than being a unique per-row asset. Treatment: Treat as an optional asset link; drop or replace empty strings with null and do not use as a feature. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2,880
len_min
0
len_max
68
len_mean
47.04
len_median
65
len_p95
65
word_mean
1
word_median
1
n_empty
1,971
n_duplicates
4,244
duplicate_rate
0.5957
vocab_size
2,879
readability_flesch_mean
-604.3
emoji_rate
0
url_rate
0.7233
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

CountryURL

categorical foreign_key
Country-level URLs pointing to joshuaproject.net profile pages, with the 2-letter country code as the path segment. There are 202 distinct countries across 7,124 rows and no nulls, but the distribution is heavily concentrated: India alone accounts for 28.5% of rows (2,032), with Pakistan (767) a distant second. Entropy ratio of 0.66 confirms moderate skew toward a handful of South Asian countries. Treatment: Extract the trailing country code as a categorical key; treat the URL itself as redundant metadata. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
202
top_value
https://joshuaproject.net/countries/IN
top_rate
0.2852
cardinality
202
entropy
5.058
entropy_ratio
0.6605

JPScaleText

categorical metadata imbalance
JPScaleText is a categorical field that holds a single value, "Unreached", across all 7124 rows with no nulls. With cardinality of 1 and entropy of 0, it carries no information and cannot discriminate between records. The constant value suggests this dataset has been pre-filtered to unreached people groups only. Treatment: Drop; constant column with zero entropy. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
1
top_value
Unreached
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

JPScaleImageURL

categorical metadata imbalance
Every one of the 7,124 rows holds the same URL, https://joshuaproject.net/assets/img/gauge/gauge-1.png, giving a single unique value and zero entropy. This looks like a static asset link (a JP Scale gauge image) attached to each record rather than a discriminating feature. It carries no information for analysis or modelling. Treatment: Drop; constant column with a single value across all rows. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
1
top_value
https://joshuaproject.net/assets/img/gauge/gauge-1.png
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

Summary

text free_text one_word duplicates
Free-text English summaries describing South Asian people groups (Rajputs, Jats, Bania, Beldar, etc.), averaging 51 words with median length 316 characters. Quality is poor: 3,167 of 7,124 rows (44%) are empty strings and another 3,439 are duplicates, leaving only 3,685 unique values and a 48% duplicate rate. Several near-identical Rajput paragraphs differ by only a word or two, suggesting lightly edited copies of the same source text rather than independent summaries. Flesch readability of 30.4 indicates fairly difficult prose. Treatment: Deduplicate near-identical entries and drop or impute the 3,167 empty rows before tokenizing and embedding. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
3,685
len_min
0
len_max
1,212
len_mean
309.7
len_median
316
len_p95
793
word_mean
51.26
word_median
52
n_empty
3,167
n_duplicates
3,439
duplicate_rate
0.4827
vocab_size
24,501
readability_flesch_mean
30.4
emoji_rate
0
url_rate
0
one_word_rate
0.4446
allcaps_rate
0
boilerplate_rate
0.0002807

Obstacles

text free_text one_word duplicates
Free-text English prose describing barriers to Christian evangelism among various people groups (Rajputs, Jats, Bosniaks, Azeri, etc.), averaging 18 words and 107 characters per entry. Notably, 3167 of 7124 rows are empty strings and the duplicate rate is 0.489, with a single Rajput-pride passage repeated 88 times and a near-identical Jat passage appearing as both 74- and 7-count variants. Readability is low (Flesch 31.6) and vocabulary is modest (9760 unique words), consistent with a templated missiological description field. Treatment: Treat empties as missing, dedupe near-identical passages, then tokenize and embed for downstream topic or similarity analysis. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
3,641
len_min
0
len_max
726
len_mean
106.9
len_median
95
len_p95
317
word_mean
18.37
word_median
16
n_empty
3,167
n_duplicates
3,483
duplicate_rate
0.4889
vocab_size
9,760
readability_flesch_mean
31.62
emoji_rate
0
url_rate
0
one_word_rate
0.4446
allcaps_rate
0
boilerplate_rate
0.0009826

HowReach

text free_text one_word duplicates
Free-text English prose describing outreach/engagement strategies for various people groups, likely a 'how to reach' field in a missions dataset. Over half the rows (3883 of 7124) are empty strings and duplicate_rate is 0.60, with the same Jats and Rajputs paragraphs repeating dozens of times — so the median length and word count are 0. Readability is low (Flesch 27.3) and vocabulary reaches 7803 tokens across the non-empty rows. Treatment: Treat empty strings as missing, deduplicate boilerplate, then tokenize and embed for downstream NLP. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
2,853
len_min
0
len_max
599
len_mean
80.82
len_median
0
len_p95
260
word_mean
14.08
word_median
1
n_empty
3,883
n_duplicates
4,271
duplicate_rate
0.5995
vocab_size
7,803
readability_flesch_mean
27.34
emoji_rate
0
url_rate
0
one_word_rate
0.5451
allcaps_rate
0
boilerplate_rate
0.0002807

PrayForChurch

text free_text one_word duplicates
Free-text prayer prompts for an unreached-people-group / church-planting dataset, written in English (1473 detected) and centered on words like 'pray', 'Christ', 'among'. The field is sparsely populated: 5032 of 7124 rows are empty and only 1713 unique strings exist, giving a 0.76 duplicate rate as the same boilerplate prayer is reused across people groups (top non-empty value repeats 146 times). Readability is low (Flesch 19.5) and length varies wildly from 0 to 649 chars, so the column is a mix of nothing, one-liners, and full paragraphs. Treatment: Treat as optional long-form text: impute empties as missing and tokenize/embed the rest before any modelling. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
1,713
len_min
0
len_max
649
len_mean
59.63
len_median
0
len_p95
286
word_mean
11.19
word_median
1
n_empty
5,032
n_duplicates
5,411
duplicate_rate
0.7595
vocab_size
4,447
readability_flesch_mean
19.5
emoji_rate
0
url_rate
0
one_word_rate
0.7063
allcaps_rate
0
boilerplate_rate
0

PrayForPG

text free_text one_word duplicates
Free-text prayer points for people groups (PG), each entry a short paragraph of intercessions led by the verb 'pray' (5450 occurrences). Nearly half the rows are empty (3405 of 7124) and another large chunk reuse boilerplate templates — duplicate_rate 0.517 with the top non-empty value repeating 88 times — so unique content is far less than the 3441 distinct strings suggest. Readability is low (Flesch mean 32.7) and all detected language is English (2528 rows tagged en). Treatment: Treat as free-text: drop empties, dedupe boilerplate, then tokenize/embed if used as a feature. high · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique
3,441
len_min
0
len_max
937
len_mean
163.1
len_median
120
len_p95
453.8
word_mean
28.23
word_median
20
n_empty
3,405
n_duplicates
3,683
duplicate_rate
0.517
vocab_size
9,291
readability_flesch_mean
32.72
emoji_rate
0
url_rate
0
one_word_rate
0.478
allcaps_rate
0
boilerplate_rate
0

Resources

unknown other skipped
The column is named "Resources" with 7124 rows and zero nulls, but saturn skipped profiling so the kind is unknown and no unique count or value statistics were computed. Without type inference or sample values, its content (numeric, list, text, or identifier) cannot be determined from the evidence. Treatment: Re-profile or inspect raw samples to establish type before any downstream use. low · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique

country_data

unknown other skipped
The column `country_data` was skipped by the profiler, so its kind is unrecorded and no statistics, uniqueness, or value distribution are available. The only confirmed signals are 7124 rows with a 0.0 null rate. Without further inspection, the contents (likely some country-related payload given the name) cannot be characterised. Treatment: Re-profile or manually inspect this column before any downstream use. low · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique

language_data

unknown other skipped
The column `language_data` was skipped by the profiler — its kind is unrecognised and no descriptive statistics, uniqueness count, or value samples were emitted. Only the row count (7124) and a null rate of 0.0 are available, so nothing can be said about content, cardinality, or distribution. The name hints at linguistic payloads (possibly nested or serialised), but this is not corroborated by evidence. Treatment: Re-profile after parsing or casting to a supported type before deciding on use. low · anthropic:claude-opus-4-7
n
7,124
nulls
0 (0.0%)
unique