saturn·

archive api data sample

source /home/coolhand/html/datavis/data_trove/joshua-project/archive/api_data_sample.json 50 rows 107 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a 50-row, 107-column sample from the Joshua Project API describing Arab and related Muslim people groups across 41 countries. The dataset is dominated by one affinity bloc ('Arab World', 100%) and one religion ('Islam', 98%), so the interesting variation lies in geography, population size, and reachedness rather than in identity fields. Look first at Population and PopulationPGAC, which are heavily right-skewed (max 2.22M and 7.56M respectively, with multiple outliers) and at PCIslam, which is high but varies from 25% to 100%. JPScaleText shows that 76% of groups are classified 'Unreached', making that the most actionable signal alongside Continent/RegionName for where these groups sit. Note also the high null rates on language, Bible-translation, and nomadic descriptors (86–98% missing), which limits any analysis of those attributes.

citing: Population · PopulationPGAC · PCIslam · JPScaleText · Continent · RegionName · PrimaryLanguageName · LeastReached · PrimaryReligion · PeopleCluster · PrimaryLanguageDialect · NomadicTypeDescription

Schema

107 columns
Per-column summary. Click column name to jump to its detail.
Alerts
ROL3 categorical 0.0% 10
PhotoCredits categorical 0.0% 5
PrimaryReligion categorical 0.0% 2
imbalance
Ctry categorical 0.0% 41
long_tail
RegionName categorical 0.0% 11
BibleYear categorical 92.0% 3
long_tail null_rate
RLG3PC numeric 0.0% 1
constant
Population numeric 0.0% 49
high_skew outliers
Resources unknown 0.0%
skipped
LeastReachedPGAC categorical 0.0% 2
NumberLanguagesSpoken numeric 0.0% 2
high_skew
GSEC categorical 0.0% 2
AudioRecordings categorical 0.0% 1
imbalance
PercentAdherents categorical 0.0% 29
long_tail
ROP1 categorical 0.0% 1
imbalance
JPScalePGAC categorical 0.0% 2
Latitude numeric 0.0% 50
PeopNameInCountry categorical 0.0% 7
long_tail
Window1040 categorical 0.0% 2
PeopleGroupMapURL categorical 0.0% 17
long_tail
CountryURL categorical 0.0% 41
long_tail
PercentEvangelicalPC categorical 0.0% 3
long_tail imbalance
CountOfProvinces unknown 0.0%
skipped
PercentEvangelicalPGAC categorical 0.0% 5
NomadicTypeDescription categorical 90.0% 1
null_rate imbalance
MapCredits categorical 0.0% 7
HasJesusFilm categorical 0.0% 2
HowReach categorical 0.0% 20
long_tail
PCIslam numeric 0.0% 30
high_skew outliers
NTYear categorical 42.0% 8
long_tail null_rate
RLG4 numeric 86.0% 1
null_rate constant
AffinityBloc categorical 0.0% 1
imbalance
NaturalName categorical 0.0% 7
long_tail
PercentChristianPGAC categorical 0.0% 5
PrimaryLanguageName categorical 0.0% 10
CountOfCountries numeric 0.0% 4
PeopleID2 numeric 0.0% 3
high_skew
Summary categorical 0.0% 21
long_tail
Obstacles categorical 0.0% 21
long_tail
ROP2 categorical 0.0% 3
long_tail imbalance
RLG3 numeric 0.0% 2
high_skew
PercentEvangelical categorical 0.0% 18
long_tail
LeastReached categorical 0.0% 2
Continent categorical 0.0% 6
JPScalePC categorical 0.0% 1
imbalance
JPScaleText categorical 0.0% 4
SecurityLevel numeric 0.0% 3
LRTop100 categorical 0.0% 1
imbalance
PrimaryReligionPGAC categorical 0.0% 1
imbalance
PCNonReligious numeric 2.0% 6
outliers
PhotoCreditURL categorical 4.0% 3
PhotoCreativeCommons categorical 0.0% 2
PrayForPG categorical 0.0% 21
long_tail
PeopleGroupPhotoURL categorical 0.0% 5
ROG2 categorical 0.0% 6
PhotoCCVersionText categorical 0.0% 2
Longitude numeric 0.0% 50
outliers
JPScaleImageURL categorical 0.0% 4
OfficialLang categorical 0.0% 21
long_tail
PhotoPermission categorical 0.0% 2
imbalance
PCHinduism numeric 2.0% 3
high_skew
PeopleID3 numeric 0.0% 5
high_skew outliers
PeopleID1 numeric 0.0% 1
constant
SpeakNationalLang unknown 0.0%
skipped
PortionsYear categorical 16.0% 9
long_tail
PrimaryReligionPC categorical 0.0% 1
imbalance
PCUnknown numeric 2.0% 1
constant
ProfileTextExists categorical 0.0% 2
PCOtherSmall numeric 2.0% 3
high_skew outliers
BibleStatus numeric 0.0% 4
Frontier categorical 0.0% 2
MapAddress categorical 0.0% 17
long_tail
PeopleID3ROG3 categorical 0.0% 50
long_tail
ROP3 numeric 0.0% 5
high_skew outliers
PrimaryLanguageDialect categorical 98.0% 1
long_tail null_rate imbalance
JPScale numeric 0.0% 4
high_skew outliers
HasAudioRecordings categorical 0.0% 1
imbalance
PCBuddhism numeric 2.0% 1
constant
PeopNameAcrossCountries categorical 0.0% 5
PhotoCCVersionURL categorical 0.0% 2
MapCCVersionText categorical 0.0% 1
imbalance
PercentChristianPC categorical 0.0% 3
long_tail imbalance
Nomadic categorical 0.0% 2
PrayForChurch categorical 0.0% 9
long_tail
RLG3PGAC numeric 0.0% 1
constant
ISO3 categorical 0.0% 41
long_tail
NaturalPronunciation categorical 2.0% 6
PhotoAddress categorical 0.0% 5
RegionCode numeric 0.0% 11
LocationInCountry categorical 72.0% 13
long_tail null_rate
JF categorical 0.0% 2
PopulationPGAC numeric 0.0% 5
outliers
PeopleGroupMapExpandedURL categorical 0.0% 11
long_tail
TranslationNeedQuestionable unknown 0.0%
skipped
Category categorical 0.0% 3
PhotoCopyright categorical 0.0% 2
imbalance
NTOnline categorical 18.0% 1
imbalance
LeastReachedPC categorical 0.0% 1
imbalance
ROG3 categorical 0.0% 41
long_tail
ReligionSubdivision categorical 86.0% 1
null_rate imbalance
PCEthnicReligions numeric 2.0% 3
high_skew outliers
PeopleCluster categorical 0.0% 3
long_tail imbalance
IndigenousCode categorical 0.0% 2
MapCreditURL categorical 0.0% 1
imbalance
MapCopyright categorical 0.0% 1
imbalance
MapCCVersionURL categorical 0.0% 1
imbalance
PeopleGroupURL categorical 0.0% 50
long_tail

ROL3

categorical feature
ROL3 holds 3-letter codes (likely ISO 639-3 language tags such as 'eng', 'arz', 'ary', 'apc', 'afb') across 50 complete rows with 10 distinct values. Distribution is skewed toward Levantine/Gulf Arabic variants: 'apc' covers 36% and 'afb' another 13/50, while six codes appear only once or twice. Entropy ratio of 0.75 indicates moderate concentration rather than uniform spread. Treatment: Group rare codes into an 'other' bucket, then one-hot encode. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
10
top_value
apc
top_rate
0.36
cardinality
10
entropy
2.501
entropy_ratio
0.7527

PhotoCredits

categorical metadata
Attribution string for the image accompanying each row, naming photographer and source platform. Just 5 distinct credits cover all 50 rows, with 'Hashim Abdullah - Pixabay' alone accounting for 56% and the top three Pixabay/Flickr contributors covering 48 of 50 entries. Two credits ('Link Up Africa', 'Claudiovidri - Shutterstock') appear only once, suggesting a long tail of incidental sources. Treatment: Retain as provenance metadata; drop from modelling features. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
5
top_value
Hashim Abdullah - Pixabay
top_rate
0.56
cardinality
5
entropy
1.611
entropy_ratio
0.694

PrimaryReligion

categorical feature imbalance
Categorical column capturing the dominant religion of each record, with only two observed values across 50 rows. The distribution is severely imbalanced: Islam accounts for 49 of 50 entries (top_rate 0.98) and Christianity for just 1, yielding an entropy ratio of 0.14. With effectively no variance, this column carries almost no discriminative signal. Treatment: Drop or collapse to a binary 'is_Islam' flag; near-constant for modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
Islam
top_rate
0.98
cardinality
2
entropy
0.1414
entropy_ratio
0.1414

Ctry

categorical feature long_tail
Country field with 41 distinct values across 50 rows and no nulls. The distribution is essentially flat — entropy ratio is 0.986 and the most common value, United Arab Emirates, appears just twice (4%). Nine countries tie at 2 occurrences each, the rest are singletons, hence the long_tail alert. Treatment: Group rare countries into regions or an 'Other' bucket before encoding. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
41
top_value
United Arab Emirates
top_rate
0.04
cardinality
41
entropy
5.284
entropy_ratio
0.9862

RegionName

categorical feature
RegionName is a categorical geographic grouping with 11 distinct regions across 50 rows and no nulls. The distribution is uneven: 'Africa, North and Middle East' alone accounts for 32% (16/50), and the three African regions together dominate the column. Entropy ratio of 0.86 indicates spread is fairly even given the cardinality, but several regions ('Australia and Pacific', 'America, Latin') appear only once. Treatment: one-hot or target-encode; consider grouping single-row regions to avoid sparse levels. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
11
top_value
Africa, North and Middle East
top_rate
0.32
cardinality
11
entropy
2.973
entropy_ratio
0.8595

BibleYear

categorical metadata long_tail null_rate
BibleYear appears to be a metadata field capturing the publication year or year-range of a Bible edition, with values like "1890-2024", "1382-2020", and "2021". The column is almost entirely empty: 92% null with only 4 of 50 rows populated and 3 distinct values. Format is inconsistent, mixing single years with hyphenated ranges, which blocks numeric parsing. Treatment: Drop or quarantine; null rate of 0.92 and mixed year/range formats make it unusable without manual curation. high · anthropic:claude-opus-4-7
n
50
nulls
46 (92.0%)
unique
3
top_value
1890-2024
top_rate
0.5
cardinality
3
entropy
1.5
entropy_ratio
0.9464

RLG3PC

numeric other constant
RLG3PC is a numeric column that is entirely constant: all 50 rows hold the value 6.0, with zero variance and only 1 unique value. There is no information for a model to learn from, and no nulls or outliers to caveat. Treatment: Drop; constant column carries no signal. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
min
6
max
6
mean
6
median
6
std
0
q1
6
q3
6
iqr
0
skew
0
kurtosis
0
n_outliers
0
outlier_rate
0
zero_rate
0

Population

numeric feature high_skew outliers
Population counts across 50 rows, ranging from 200 to 2,221,000 with a median of just 46,500 versus a mean of 264,074. The distribution is severely right-skewed (skew 2.52, kurtosis 5.78) and 12% of rows (6 values) flag as outliers, indicating a few very large populations dominate an otherwise small-town dataset. No nulls or zeros, and 49 of 50 values are unique. Treatment: log-transform before any modelling to tame the right skew and outliers. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
49
min
200
max
2.221e+06
mean
264,074
median
46,500
std
4.927e+05
q1
13,000
q3
272,500
iqr
259,500
skew
2.52
kurtosis
5.781
n_outliers
6
outlier_rate
0.12
zero_rate
0

Resources

unknown other skipped
The column is named "Resources" and contains 50 non-null entries, but saturn skipped profiling it so its kind is unknown and no descriptive stats (uniqueness, distribution, type) are available. Without further signals, its content and structure cannot be characterized from this evidence. Treatment: Re-run profiling with type coercion or inspect raw values manually before deciding on use. low · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique

LeastReachedPGAC

categorical feature
Binary Y/N flag indicating whether some 'least reached PGAC' condition was met, with no nulls across 50 rows. The split is nearly balanced (28 N, 22 Y; top_rate 0.56) and entropy_ratio of 0.99 confirms maximal informativeness for a binary feature. Treatment: Encode as 0/1 boolean for modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.56
cardinality
2
entropy
0.9896
entropy_ratio
0.9896

NumberLanguagesSpoken

numeric feature high_skew
Counts the number of languages spoken, but with only 2 unique values across 50 rows it is effectively a binary indicator of monolingual vs bilingual. The mean of 1.04 and median of 1.0 show 48 of 50 records sit at 1, with just 2 outliers at 2.0 driving the extreme skew (4.69) and kurtosis (20.04). Treatment: Recode as a binary multilingual flag; the raw count adds no information. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
min
1
max
2
mean
1.04
median
1
std
0.1979
q1
1
q3
1
iqr
0
skew
4.695
kurtosis
20.04
n_outliers
2
outlier_rate
0.04
zero_rate
0

GSEC

categorical feature
GSEC is a binary categorical field with exactly two values, "1" and an empty string, split perfectly 25/25 across the 50 rows. The maximum entropy (1.0) confirms a balanced flag, but the empty string rather than "0" or null suggests the absent state is encoded as a blank string rather than a true missing value. Treatment: Recode empty string to 0 (or NaN if it means missing) and treat as a binary indicator. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
1
top_rate
0.5
cardinality
2
entropy
1
entropy_ratio
1

AudioRecordings

categorical metadata imbalance
AudioRecordings is a categorical flag that takes the single value 'Y' across all 50 rows, with zero nulls and entropy of 0. Because cardinality is 1 and top_rate is 1.0, the column carries no information and cannot discriminate between records. Treatment: Drop; constant column with no variance. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

PercentAdherents

categorical feature long_tail
PercentAdherents holds numeric percentages stored as strings, with 29 distinct values across 50 rows and no nulls. The mode is "0.000" at 12% of rows, and entropy ratio 0.942 indicates a very flat distribution with a long tail of small-frequency values. The mix of fractional (0.200, 0.500) and whole-number (5.000, 6.000) entries suggests the values are raw percentages rather than proportions. Treatment: Cast strings to float and treat as a continuous numeric feature. medium · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
29
top_value
0.000
top_rate
0.12
cardinality
29
entropy
4.576
entropy_ratio
0.942

ROP1

categorical metadata imbalance
ROP1 is a categorical column holding a single constant value 'A001' across all 50 rows, with zero nulls and entropy of 0.0. It carries no information for modelling or segmentation since cardinality is 1 and top_rate is 1.0. Treatment: Drop; constant column with no variance. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
A001
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

JPScalePGAC

categorical feature
A binary categorical field with only the values "1" and "2", split 28/22 across 50 rows. The near-maximal entropy ratio (0.99) indicates an almost balanced two-class distribution with no nulls. The column name suggests a Japanese PGA scale code, likely an ordinal seismic-intensity or rating bucket. Treatment: Cast to categorical (or 0/1 indicator) before modelling. medium · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
2
top_rate
0.56
cardinality
2
entropy
0.9896
entropy_ratio
0.9896

Latitude

numeric feature
Numeric column holding geographic latitude in degrees, with all 50 values unique and no nulls. The range spans -33.87 to 59.11, consistent with worldwide coordinates, and the distribution is mildly left-skewed (-0.46) with a mean of 22.28 sitting below the median of 24.12. Only one outlier (2%) is flagged, suggesting one row sits far from the otherwise broad spread (IQR 26.18). Treatment: Pair with longitude as a geospatial coordinate; consider binning or distance features rather than using as a raw scalar. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
50
min
-33.87
max
59.11
mean
22.28
median
24.12
std
20.19
q1
9.247
q3
35.43
iqr
26.18
skew
-0.4594
kurtosis
0.0487
n_outliers
1
outlier_rate
0.02
zero_rate
0

PeopNameInCountry

categorical label long_tail
Categorical label naming the people group in-country, with only 7 distinct values across 50 rows and no nulls. The distribution is heavily concentrated on Arab variants: 'Arab' alone covers 54% of rows, and the top three Arab-prefixed labels account for 46 of 50 entries, leaving a long tail of singletons like 'Tuareg, Air' and 'Amri'. Entropy ratio of 0.65 confirms the imbalance flagged by the long_tail alert. Treatment: Collapse rare singletons into an 'Other' bucket before any group-wise analysis. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
7
top_value
Arab
top_rate
0.54
cardinality
7
entropy
1.835
entropy_ratio
0.6537

Window1040

categorical feature
Window1040 is a binary Y/N flag, almost perfectly balanced with 26 'Y' and 24 'N' across 50 rows. Entropy ratio of 0.999 confirms a near-maximum-uncertainty split, and there are no nulls. The name suggests a windowed indicator (possibly a 1040-period rolling event flag), but the evidence does not confirm its semantics. Treatment: Encode as a 0/1 boolean for modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.52
cardinality
2
entropy
0.9988
entropy_ratio
0.9988

PeopleGroupMapURL

categorical metadata long_tail
URL pointing to a people-group map image hosted on joshuaproject.net, one per row. 48% of the 50 rows are empty strings, so the field is missing more often than populated, and a single map (m00007.png) accounts for 9 of the 26 non-blank entries. With 17 unique values across 50 rows and a long_tail alert, most distinct URLs appear only once. Treatment: Treat empty strings as missing and drop from modelling; keep only as a display link. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
17
top_value
top_rate
0.48
cardinality
17
entropy
2.792
entropy_ratio
0.6832

CountryURL

categorical foreign_key long_tail
URLs to country pages on joshuaproject.net, with the two-letter country code as the path suffix. With 41 unique values across 50 rows and entropy ratio 0.986, the column is near-unique; the most frequent URL (UAE) appears just twice (top_rate 0.04). The base domain is constant, so the country code is the only informative part. Treatment: Extract the trailing country code and use that as the join/grouping key instead of the raw URL. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
41
top_value
https://joshuaproject.net/countries/AE
top_rate
0.04
cardinality
41
entropy
5.284
entropy_ratio
0.9862

PercentEvangelicalPC

categorical feature long_tail imbalance
Numeric-looking field stored as a categorical with only 3 distinct values across 50 rows, and 96% of rows share the single value '0.197'. The other two values ('0.103' and '0.265') each appear exactly once, giving an extreme imbalance and an entropy ratio of just 0.178. This looks like a principal-component or aggregate score that has been collapsed/repeated for nearly every record, leaving almost no signal. Treatment: Drop or treat as constant — near-zero variance offers no modelling signal. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
3
top_value
0.197
top_rate
0.96
cardinality
3
entropy
0.2823
entropy_ratio
0.1781

CountOfProvinces

unknown other skipped
CountOfProvinces was skipped by the profiler, so no type, uniqueness, or distribution stats are available beyond a row count of 50 with no nulls. The name suggests an integer tally of provinces per record, but this cannot be confirmed from the evidence. No further signal is present to flag skew, duplicates, or range. Treatment: Re-run the profiler on this column to recover type and distribution before deciding how to use it. low · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique

PercentEvangelicalPGAC

categorical feature
Likely a percentage of evangelical adherents (PGAC denomination grouping) stored as strings rather than floats, with only 5 distinct values across 50 rows. The distribution is severely lumpy: '1.892' covers 56% of rows and the top three values ('1.892', '0.233', '0.023') account for 48 of 50 observations, suggesting these are imputed or default category codes rather than true continuous measurements. Treatment: Cast to float and treat as a low-cardinality categorical or imputed flag rather than a continuous percentage. medium · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
5
top_value
1.892
top_rate
0.56
cardinality
5
entropy
1.611
entropy_ratio
0.694

NomadicTypeDescription

categorical metadata null_rate imbalance
This appears to be a descriptive label for a nomadic lifestyle classification, but it carries almost no information in this sample. 90% of rows are null, and the 5 non-null rows all hold the single value 'Agro-Pastoralists' (top_rate 1.0, cardinality 1, entropy 0.0). As-is, the column cannot discriminate between records. Treatment: Drop: 90% null and only one observed value provides no signal. high · anthropic:claude-opus-4-7
n
50
nulls
45 (90.0%)
unique
1
top_value
Agro-Pastoralists
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

MapCredits

categorical metadata
MapCredits holds attribution strings for the map associated with each row, citing sources like Joshua Project, GMI, ESRI, and Bethany World Prayer Center. Nearly half the rows (24 of 50) carry an empty string rather than a null, and a single credit to 'Bethany World Prayer Center' covers another 14 rows, leaving only 7 distinct values across the column. The dominance of blanks alongside non-null status is the main surprise — missingness is encoded as empty text, not NULL. Treatment: Normalize empty strings to nulls and treat as provenance metadata; drop from modelling features. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
7
top_value
top_rate
0.48
cardinality
7
entropy
1.987
entropy_ratio
0.7077

HasJesusFilm

categorical feature
Binary Y/N flag indicating whether each record has an associated 'Jesus Film' resource. The column is complete (null_rate 0.0) with only 2 unique values, heavily skewed toward 'Y' at 82% (41 of 50), leaving just 9 'N' cases. Treatment: Encode as boolean; expect limited discriminatory power given the 82/18 imbalance. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.82
cardinality
2
entropy
0.6801
entropy_ratio
0.6801

HowReach

categorical free_text long_tail
HowReach holds free-text outreach suggestions, likely missionary engagement strategies for various people groups. The column is dominated by empty strings (62% top_rate, 31 of 50 rows blank), and every non-blank value is unique, yielding 20 distinct values across 50 rows with no nulls flagged. Entropy ratio of 0.60 plus the long_tail alert confirm this is essentially sparse prose, not a categorical variable. Treatment: Treat blanks as missing and tokenize/embed the prose entries rather than one-hot encoding. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
20
top_value
top_rate
0.62
cardinality
20
entropy
2.572
entropy_ratio
0.5952

PCIslam

numeric feature high_skew outliers
PCIslam appears to be a percentage measure of Islamic affiliation per record, ranging 25.0 to 100.0 with a median of 95.99 and mean 91.12. The distribution is heavily left-skewed (skew -2.65, kurtosis 7.39) with a tight IQR of 6.55 between Q1 92.93 and Q3 99.48, yet 8 outliers (16%) sit far below that cluster. Most observations are near-saturation while a small tail of low-share records pulls the mean down. Treatment: Consider a reflected log or beta transform before modelling to tame the left skew and downweight the low-share outliers. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
30
min
25
max
100
mean
91.12
median
95.99
std
14.7
q1
92.93
q3
99.47
iqr
6.55
skew
-2.648
kurtosis
7.388
n_outliers
8
outlier_rate
0.16
zero_rate
0

NTYear

categorical free_text long_tail null_rate
NTYear appears to be a free-form annotation about a 'NT' year status, mixing a yes/no flag with single years (e.g. 2005, 1932, 2012) and year ranges (e.g. 1879-1989, 1990-2003). The format is inconsistent: 'Yes' dominates at 62% of non-null entries, while 42% of all rows are null and the remaining cells split across 7 heterogeneous values. This is effectively two or three different fields collapsed into one string column. Treatment: Split into a boolean indicator and parsed year/year-range fields before use. high · anthropic:claude-opus-4-7
n
50
nulls
21 (42.0%)
unique
8
top_value
Yes
top_rate
0.6207
cardinality
8
entropy
1.925
entropy_ratio
0.6416

RLG4

numeric other null_rate constant
RLG4 is a numeric column that is effectively unusable: 86% of its 50 rows are null, and every one of the remaining values equals 20.0 (min, median, max, std all confirm this). With a single distinct value and no variance, it carries no information for modelling. Treatment: Drop the column; it is constant where present and 86% null. high · anthropic:claude-opus-4-7
n
50
nulls
43 (86.0%)
unique
1
min
20
max
20
mean
20
median
20
std
0
q1
20
q3
20
iqr
0
skew
0
kurtosis
0
n_outliers
0
outlier_rate
0
zero_rate
0

AffinityBloc

categorical metadata imbalance
AffinityBloc is a categorical grouping label, but every one of the 50 rows holds the same value, "Arab World". With cardinality 1 and entropy 0, this column carries no information for distinguishing records in this slice. Treatment: Drop; constant column with zero entropy. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
Arab World
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

NaturalName

categorical label long_tail
NaturalName is a low-cardinality categorical label, likely an ethnic or linguistic group identifier, with 7 distinct values across 50 rows and no nulls. The distribution is heavily concentrated: 'Arab' alone covers 54% of rows, and together with 'Gulf-spoken Arab' (11) and 'Omani Arab' (8) accounts for 46 of 50 records, leaving four singleton categories in a long tail. Entropy ratio of 0.65 confirms the imbalance. Treatment: Group the four singleton categories into an 'Other' bucket before encoding. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
7
top_value
Arab
top_rate
0.54
cardinality
7
entropy
1.835
entropy_ratio
0.6537

PercentChristianPGAC

categorical feature
This column appears to be a percentage-based metric (likely 'Percent Christian' from a PGAC indicator) stored as strings, with only 5 distinct values across 50 rows. The distribution is heavily concentrated: '14.741' accounts for 56% of records, followed by '0.935' (12 rows) and '0.066' (8 rows), suggesting these are repeated category-level constants rather than per-row measurements. The presence of just 5 unique values for what looks like a continuous percentage is suspicious and points to either aggregated/joined reference data or a coarse bucketing. Treatment: Cast to numeric and verify whether values are constants from a join key before using as a feature. medium · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
5
top_value
14.741
top_rate
0.56
cardinality
5
entropy
1.611
entropy_ratio
0.694

PrimaryLanguageName

categorical feature
This is a categorical column naming a primary language, dominated by Arabic dialects across 10 distinct values in 50 rows. Levantine Arabic leads at 18/50 (36%), followed by Gulf (13) and Omani (8); non-Arabic entries are rare (Swahili 2, plus singletons for Tamajeq, English, etc.). Entropy ratio of 0.75 indicates moderate concentration without a single overwhelming class, and there are no nulls. Treatment: Group rare non-Arabic and minor dialects into an 'Other' bucket before one-hot encoding. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
10
top_value
Arabic, Levantine
top_rate
0.36
cardinality
10
entropy
2.501
entropy_ratio
0.7527

CountOfCountries

numeric feature
Numeric count of countries with only 4 unique values across 50 rows, ranging 1 to 28 with median 28 and mean 19.88. The distribution is heavily concentrated at the maximum (median equals max, Q3 equals max), indicating most rows hit the ceiling of 28 while a minority sit much lower. Negative kurtosis (-1.52) and mild left skew (-0.43) confirm a bimodal-like spread rather than a smooth distribution. Treatment: Treat as a low-cardinality ordinal or bin into categorical buckets rather than as a continuous variable. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
4
min
1
max
28
mean
19.88
median
28
std
9.512
q1
12
q3
28
iqr
16
skew
-0.4253
kurtosis
-1.521
n_outliers
0
outlier_rate
0
zero_rate
0

PeopleID2

numeric foreign_key high_skew
PeopleID2 is stored as numeric but behaves like a categorical key, with only 3 unique values across 50 rows and an IQR of 0 because Q1, median, and Q3 all equal 111. The mean (115.04) is pulled above the median by 2 outliers reaching up to 307, producing extreme skew (6.85) and kurtosis (44.93). No nulls or zeros are present. Treatment: Treat as a categorical identifier (not a numeric feature) and left-join on it rather than aggregating. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
3
min
111
max
307
mean
115
median
111
std
27.71
q1
111
q3
111
iqr
0
skew
6.847
kurtosis
44.93
n_outliers
2
outlier_rate
0.04
zero_rate
0

Summary

categorical free_text long_tail
Free-text ethnographic summaries describing people groups, with 21 unique values across 50 rows and a 60% top rate driven entirely by empty strings (30 of 50). The non-empty entries are long, prose paragraphs about Tuareg, Arab, and other groups, so this is descriptive content rather than a category. Entropy ratio of 0.61 and the long_tail alert confirm most non-empty values appear only once. Treatment: Treat empty strings as missing and tokenize/embed the prose before any modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
21
top_value
top_rate
0.6
cardinality
21
entropy
2.7
entropy_ratio
0.6146

Obstacles

categorical free_text long_tail
Free-text commentary describing barriers to gospel outreach for various people groups, one paragraph per row. 30 of 50 rows (top_rate 0.6) are empty strings, and the remaining 20 entries are essentially unique, yielding 21 distinct values and entropy_ratio 0.61. Despite the categorical kind, the content is prose passages about Islam, identity, and mission access, not a bounded label set. Treatment: Treat empty strings as missing and tokenize/embed the prose rather than one-hot encoding. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
21
top_value
top_rate
0.6
cardinality
21
entropy
2.7
entropy_ratio
0.6146

ROP2

categorical feature long_tail imbalance
ROP2 is a categorical column with only 3 distinct codes (C0013, C0219, C0019) across 50 rows and no nulls. The distribution is extremely lopsided: C0013 covers 96% of rows while the other two codes appear once each, yielding an entropy ratio of just 0.178. As a near-constant feature it carries almost no signal. Treatment: Drop or collapse rare levels into 'other'; near-constant column unlikely to aid modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
3
top_value
C0013
top_rate
0.96
cardinality
3
entropy
0.2823
entropy_ratio
0.1781

RLG3

numeric feature high_skew
RLG3 is a near-constant numeric feature: 49 of 50 rows take the value 6.0 (median, Q1, and Q3 all equal 6.0), with a single outlier at 1.0 producing extreme negative skew (-6.86) and kurtosis (45.02). With only 2 unique values and an IQR of 0, this behaves more like a degenerate flag than a continuous measurement. Treatment: Drop or binarize as is_not_6; near-zero variance makes it useless for most models. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
min
1
max
6
mean
5.9
median
6
std
0.7071
q1
6
q3
6
iqr
0
skew
-6.857
kurtosis
45.02
n_outliers
1
outlier_rate
0.02
zero_rate
0

PercentEvangelical

categorical feature long_tail
PercentEvangelical appears to be a numeric share of evangelicals stored as strings, with 18 distinct values across 50 rows and no nulls. The distribution is heavily concentrated on small values: '0.000' and '0.500' tie for the mode at 8 occurrences each (16% top_rate), while values up to '2.500' form a long tail. High entropy_ratio (0.88) indicates the mass is spread fairly evenly across the small set of bins despite the long_tail alert. Treatment: Cast strings to float and treat as a continuous numeric feature. medium · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
18
top_value
0.000
top_rate
0.16
cardinality
18
entropy
3.686
entropy_ratio
0.884

LeastReached

categorical feature
Binary Y/N flag, likely indicating whether some 'least reached' status applies to each record. The class is imbalanced toward Y at 76% (38 of 50), with N covering the remaining 12. No nulls, and entropy ratio of 0.80 reflects the moderate skew. Treatment: Encode as a 0/1 indicator; consider class-imbalance handling if used as a target. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.76
cardinality
2
entropy
0.795
entropy_ratio
0.795

Continent

categorical feature
Categorical continent label with 6 distinct values across 50 rows and no nulls. Asia dominates at 38% (19 rows), followed by Africa (14) and Europe (10), while Australia and South America appear only once each. The skewed distribution and high entropy ratio (0.80) suggest reasonable spread but with a clear Asia/Africa concentration. Treatment: One-hot encode; consider grouping Australia and South America into an 'Other' bucket given single-row counts. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
6
top_value
Asia
top_rate
0.38
cardinality
6
entropy
2.067
entropy_ratio
0.7996

JPScalePC

categorical metadata imbalance
JPScalePC is a categorical column that holds the single value "1" across all 50 rows, giving it cardinality 1 and zero entropy. With a top_rate of 1.0 and no nulls, it carries no information and likely represents a constant flag or scale parameter that was never varied in this slice. Treatment: Drop before modelling; the column is constant. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
1
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

JPScaleText

categorical label
Categorical label describing a Joshua Project reach scale, with 4 distinct levels across 50 rows and no nulls. The distribution is heavily skewed: 'Unreached' covers 76% of rows, while 'Superficially Reached' appears just once, giving an entropy ratio of 0.54. Class imbalance will dominate any model trained on this field. Treatment: Treat as ordinal categorical and address class imbalance (e.g., stratify or collapse rare levels) before modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
4
top_value
Unreached
top_rate
0.76
cardinality
4
entropy
1.08
entropy_ratio
0.5402

SecurityLevel

numeric feature
SecurityLevel takes only 3 distinct integer values across 50 rows (min 0, max 2, median 2), so it reads as an ordinal category encoded as a number rather than a continuous measure. The distribution is bimodal-leaning: 42% of rows are zero while the median sits at 2, and the strongly negative kurtosis (-1.90) confirms a flat, multi-peaked shape with no outliers. Treatment: Treat as an ordinal category (0/1/2) or one-hot encode before modelling rather than using as a continuous numeric. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
3
min
0
max
2
mean
1.12
median
2
std
0.9823
q1
0
q3
2
iqr
2
skew
-0.2416
kurtosis
-1.899
n_outliers
0
outlier_rate
0
zero_rate
0.42

LRTop100

categorical feature imbalance
This is a categorical flag (likely 'is in LR Top 100') that takes the single value 'N' across all 50 rows. With cardinality 1 and entropy 0.0, it carries no information for any downstream model. The 'imbalance' alert here reflects total constancy rather than skew. Treatment: Drop; constant column with zero entropy. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
N
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

PrimaryReligionPGAC

categorical metadata imbalance
This column records the primary religion classification (PGAC), but every one of the 50 rows holds the single value "Islam". With cardinality of 1 and entropy of 0.0, it carries no information for modelling or segmentation. Treatment: Drop; constant column with zero variance. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
Islam
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

PCNonReligious

numeric feature outliers
Likely a percentage of non-religious population per row, with 50 records and only 6 distinct values. The distribution is dominated by zeros (zero_rate 0.755) with median and IQR both 0, yet a long right tail pushes the max to 10.0 and flags 12 outliers (24.5%). Skew of 1.89 and one missing value (null_rate 0.02) confirm a sparse, heavily right-skewed feature. Treatment: Treat as sparse; consider a zero/non-zero binary flag plus log1p transform before modelling. high · anthropic:claude-opus-4-7
n
50
nulls
1 (2.0%)
unique
6
min
0
max
10
mean
1.255
median
0
std
2.45
q1
0
q3
0
iqr
0
skew
1.892
kurtosis
2.849
n_outliers
12
outlier_rate
0.2449
zero_rate
0.7551

PhotoCreditURL

categorical metadata
This column holds photo credit URLs (Pixabay and Flickr links), but with only 3 unique values across 50 rows it functions as a coarse source tag rather than a per-record citation. The top URL covers 58.3% of non-null rows (28 of 50), suggesting the same stock photo is reused widely. Null rate is 4%. Treatment: Drop or retain as a low-cardinality source label; not useful as a modelling feature. high · anthropic:claude-opus-4-7
n
50
nulls
2 (4.0%)
unique
3
top_value
https://pixabay.com/photos/people-students-walk-street-muslim-6284192/
top_rate
0.5833
cardinality
3
entropy
1.384
entropy_ratio
0.8735

PhotoCreativeCommons

categorical feature
Binary Y/N flag indicating whether a photo carries a Creative Commons license, with no nulls across 50 rows. The distribution is heavily skewed toward 'N' at 84% (42 of 50), leaving only 8 records flagged 'Y'. Treatment: Encode as a boolean indicator; note class imbalance if used as a predictor. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.84
cardinality
2
entropy
0.6343
entropy_ratio
0.6343

PrayForPG

categorical free_text long_tail
Free-text prayer prompts for unreached people groups, with 60% of the 50 rows being empty strings and the remaining 20 entries each unique multi-sentence paragraphs. The dominance of the blank value (top_rate 0.60) coexists with very high textual diversity among non-blank rows, hence the long_tail alert and entropy_ratio of 0.61. No nulls are recorded, but the empty string is functioning as a missing-value sentinel. Treatment: Treat empty strings as missing and tokenize/embed the remaining prose if used as a feature. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
21
top_value
top_rate
0.6
cardinality
21
entropy
2.7
entropy_ratio
0.6146

PeopleGroupPhotoURL

categorical metadata
This column holds URLs to people-group profile photos hosted on joshuaproject.net, one per row with no nulls. Despite 50 rows, only 5 distinct images appear, and a single photo (p10375.jpg) covers 56% of records while the top three URLs account for 48 of 50 — suggesting many rows share the same people-group identity rather than being unique entities. Treatment: Drop for modelling; retain as a display asset or join key to a people-group lookup. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
5
top_value
https://joshuaproject.net/assets/media/profiles/photos/p10375.jpg
top_rate
0.56
cardinality
5
entropy
1.611
entropy_ratio
0.694

ROG2

categorical feature
ROG2 looks like a regional grouping code with 6 categories (ASI, AFR, EUR, NAR, AUS, LAM), fully populated across all 50 rows. Distribution is moderately concentrated — ASI leads at 38% (19/50) and AFR follows at 14, while AUS and LAM appear only once each. Entropy ratio of 0.80 indicates fairly even spread among the top regions but with thin tails. Treatment: One-hot encode, but consider collapsing AUS and LAM into an 'Other' bucket given single-row support. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
6
top_value
ASI
top_rate
0.38
cardinality
6
entropy
2.067
entropy_ratio
0.7996

PhotoCCVersionText

categorical metadata
This is a categorical license-text field for photo Creative Commons versioning, with only 2 distinct values across 50 rows. 84% of entries are empty strings (42/50), and the remaining 8 carry 'CC BY-NC-SA 2.0' — there are no nulls, just blanks standing in for missing licenses. Entropy ratio of 0.63 reflects this binary, heavily imbalanced split. Treatment: Recode empty strings to missing and collapse to a binary has_license flag. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
top_rate
0.84
cardinality
2
entropy
0.6343
entropy_ratio
0.6343

Longitude

numeric feature outliers
Geographic longitude coordinates spanning -118.3 to 151.2, covering most of the globe's east-west range with all 50 values unique. The distribution is left-skewed (skew -0.77) with a median of 39.4 sitting well above the mean of 27.5, and 7 outliers (14%) flag locations far from the central cluster around Europe/Asia. Treatment: Pair with Latitude as a 2D geospatial feature; avoid scaling independently. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
50
min
-118.3
max
151.2
mean
27.55
median
39.45
std
52.08
q1
10.72
q3
51.19
iqr
40.47
skew
-0.7666
kurtosis
1.346
n_outliers
7
outlier_rate
0.14
zero_rate
0

JPScaleImageURL

categorical feature
This column holds URLs to Joshua Project gauge images (gauge-1 through gauge-4), almost certainly a visual encoding of an ordinal progress/status scale. Distribution is heavily skewed: 76% point to gauge-1.png, while gauge-3.png appears only once across 50 rows. With just 4 unique values and no nulls, it's a low-cardinality categorical masquerading as a URL. Treatment: Extract the gauge digit (1-4) and treat as an ordinal feature rather than keeping the URL string. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
4
top_value
https://joshuaproject.net/assets/img/gauge/gauge-1.png
top_rate
0.76
cardinality
4
entropy
1.08
entropy_ratio
0.5402

OfficialLang

categorical feature long_tail
This is a categorical column naming an official language, with 21 distinct values across 50 rows and no nulls. The distribution is heavily skewed: 'Arabic, Standard' alone covers 36% of rows, followed by English (8) and French (5), while a long tail of languages like Swahili, Ukrainian, Sinhala and Bulgarian appear only once. Entropy ratio of 0.77 confirms concentration at the top despite the wide vocabulary. Treatment: Group rare languages into an 'Other' bucket before one-hot encoding. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
21
top_value
Arabic, Standard
top_rate
0.36
cardinality
21
entropy
3.39
entropy_ratio
0.7719

PhotoPermission

categorical feature imbalance
Binary Y/N flag indicating whether photo permission has been granted, with no missing values across 50 rows. The distribution is severely imbalanced: 'N' covers 49 of 50 rows (top_rate 0.98) with only a single 'Y', yielding entropy_ratio of 0.14. With one positive case, this column carries almost no discriminative signal in the current sample. Treatment: Drop or hold aside as a near-constant flag until more 'Y' cases accumulate. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.98
cardinality
2
entropy
0.1414
entropy_ratio
0.1414

PCHinduism

numeric feature high_skew
PCHinduism appears to be a per-row count or percentage related to Hinduism, with 95.9% of values being zero and only 3 distinct values across 50 rows. The distribution is extremely sparse and right-skewed (skew 5.96, kurtosis 35.36), with a max of 6.0 standing out as an outlier against a median and IQR of 0. Effectively a near-constant feature with rare nonzero spikes. Treatment: Binarize to zero/nonzero or drop, given 95.9% zeros and only 3 unique values. high · anthropic:claude-opus-4-7
n
50
nulls
1 (2.0%)
unique
3
min
0
max
6
mean
0.1633
median
0
std
0.8978
q1
0
q3
0
iqr
0
skew
5.957
kurtosis
35.36
n_outliers
2
outlier_rate
0.04082
zero_rate
0.9592

PeopleID3

numeric foreign_key high_skew outliers
PeopleID3 is a numeric identifier-like field with only 5 unique values across 50 rows, clustered tightly around 10375-10376 (IQR of 1.0). The distribution is severely left-skewed (skew -5.59, kurtosis 31.2) because at least one value drops to 10208 while the bulk sits near the max of 10378, producing a 20% outlier rate. Despite being typed as numeric, the near-constant range and low cardinality suggest this behaves as a categorical key rather than a measurement. Treatment: Treat as a categorical identifier; do not use as a continuous numeric feature. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
5
min
10,208
max
10,378
mean
1.037e+04
median
10,375
std
25.8
q1
10,375
q3
10,376
iqr
1
skew
-5.593
kurtosis
31.24
n_outliers
10
outlier_rate
0.2
zero_rate
0

PeopleID1

numeric other constant
PeopleID1 is flagged as constant: every one of the 50 rows holds the value 10, with zero variance and a single unique value. Despite the 'ID' name, it carries no identifying information and cannot distinguish records. There is no null or outlier activity to interpret. Treatment: Drop, constant column with no signal. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
min
10
max
10
mean
10
median
10
std
0
q1
10
q3
10
iqr
0
skew
0
kurtosis
0
n_outliers
0
outlier_rate
0
zero_rate
0

SpeakNationalLang

unknown feature skipped
Saturn skipped this column, so no profiling stats were computed beyond row count (50) and a null rate of 0.0. The name 'SpeakNationalLang' suggests a binary or categorical indicator of whether a respondent speaks the national language, but kind is 'unknown' and n_unique is missing, so the actual value distribution cannot be confirmed from this evidence. Treatment: Re-profile or manually inspect to determine dtype before use; if binary, encode as 0/1. low · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique

PortionsYear

categorical feature long_tail
PortionsYear is a low-cardinality categorical with 9 unique values across 50 rows and a 16% null rate. Most entries are year ranges (e.g., "1939-2021" at 42.9% and "2009-2024"), but 4 rows contain the string "Yes", indicating a mixed/dirty schema where a boolean answer was recorded in a date-range field. Entropy ratio of 0.70 and a long-tail alert reflect many singleton ranges alongside two dominant values. Treatment: Clean the type mismatch (separate "Yes" from year ranges), then parse ranges into start/end year numerics before modelling. high · anthropic:claude-opus-4-7
n
50
nulls
8 (16.0%)
unique
9
top_value
1939-2021
top_rate
0.4286
cardinality
9
entropy
2.222
entropy_ratio
0.7009

PrimaryReligionPC

categorical feature imbalance
This column records the primary religion of a people-cluster (PC), but every one of the 50 rows holds the value "Islam". With cardinality 1 and entropy 0, it carries no information for distinguishing records in this slice. Treatment: Drop; constant column with zero entropy. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
Islam
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

PCUnknown

numeric feature constant
PCUnknown is a numeric column that is effectively constant: every one of the 49 non-null observations is exactly 0, and 2% of rows are null. There is no variance, no spread, and no outliers, so the column carries no information as currently populated. Treatment: Drop; constant column with zero variance. high · anthropic:claude-opus-4-7
n
50
nulls
1 (2.0%)
unique
1
min
0
max
0
mean
0
median
0
std
0
q1
0
q3
0
iqr
0
skew
0
kurtosis
0
n_outliers
0
outlier_rate
0
zero_rate
1

ProfileTextExists

categorical feature
Binary Y/N flag indicating whether a profile text exists, with no nulls across 50 rows. The distribution is heavily imbalanced: 'Y' covers 45 of 50 (top_rate 0.9) versus only 5 'N', giving an entropy ratio of 0.47. Treatment: Encode as a 0/1 boolean; watch for low signal given the 90/10 imbalance. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.9
cardinality
2
entropy
0.469
entropy_ratio
0.469

PCOtherSmall

numeric feature high_skew outliers
PCOtherSmall is a numeric count-like feature that is essentially zero for nearly everyone — 93.9% of the 49 non-null rows are 0 and only 3 distinct values appear. The distribution is extremely right-skewed (skew 6.41, kurtosis 40.46) with a max of 7 driving 3 outliers (6.1% outlier rate), so almost all signal lives in a tiny tail. Treatment: Binarize to zero/non-zero or drop, since variance is concentrated in a handful of outliers. high · anthropic:claude-opus-4-7
n
50
nulls
1 (2.0%)
unique
3
min
0
max
7
mean
0.1837
median
0
std
1.014
q1
0
q3
0
iqr
0
skew
6.411
kurtosis
40.46
n_outliers
3
outlier_rate
0.06122
zero_rate
0.9388

BibleStatus

numeric feature
BibleStatus is a small-range integer code with only 4 distinct values spanning 2 to 5, mean 3.5 and median 4. The tight IQR (3 to 4) and absence of zeros or nulls suggest it's an ordinal status flag rather than a true numeric measurement. Mild left skew (-0.38) indicates most records sit at the higher end of the scale. Treatment: Treat as an ordinal categorical and one-hot or ordinal-encode before modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
4
min
2
max
5
mean
3.5
median
4
std
0.8631
q1
3
q3
4
iqr
1
skew
-0.3848
kurtosis
-0.6309
n_outliers
0
outlier_rate
0
zero_rate
0

Frontier

categorical feature
Frontier is a binary Y/N flag with no nulls across 50 rows. The distribution is heavily skewed toward 'N' at 84% (42 of 50), leaving only 8 'Y' cases, which limits its discriminative power. Treatment: Encode as a 0/1 indicator; watch for class imbalance when using as a predictor. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.84
cardinality
2
entropy
0.6343
entropy_ratio
0.6343

MapAddress

categorical metadata long_tail
MapAddress holds filenames of map images (e.g. 'm00007.png', 'm10375_ke.png'), with country/region suffixes appearing on some variants. Nearly half the rows (24/50) are empty strings and a single file 'm00007.png' covers 9 more, leaving a long tail of 15 other filenames at 1-2 occurrences each. Cardinality is 17 unique values with entropy ratio 0.68, so the column is dominated by the blank and one hero image. Treatment: Treat empty string as missing and group the long tail before any categorical encoding. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
17
top_value
top_rate
0.48
cardinality
17
entropy
2.792
entropy_ratio
0.6832

PeopleID3ROG3

categorical identifier long_tail
PeopleID3ROG3 looks like a per-row identifier: every one of the 50 rows holds a distinct alphanumeric code (5 digits followed by 2 letters, e.g. '10208NG'), giving cardinality 50 and entropy_ratio 1.0. Top_rate is 0.02 because no value repeats, and there are no nulls. The long_tail alert simply reflects that uniqueness rather than any skew. Treatment: drop from modelling features; retain as a join/lookup key. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
50
top_value
10208NG
top_rate
0.02
cardinality
50
entropy
5.644
entropy_ratio
1

ROP3

numeric feature high_skew outliers
ROP3 is a near-constant numeric reading clustered tightly around 100425 (median) with an IQR of just 2, yet 20% of values are flagged as outliers and the minimum drops to 100161 versus a max of 100431. The extreme negative skew (-5.53) and kurtosis above 30 indicate a heavy left tail dragging the mean (100418.74) below the median. With only 5 unique values across 50 rows, this looks like a sensor or pressure-style measurement that is mostly stuck at one level with occasional sharp dips. Treatment: Investigate the low-tail outliers and consider centering (subtract median) or binning before modelling, given the near-constant distribution. medium · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
5
min
100,161
max
100,431
mean
1.004e+05
median
100,425
std
41.09
q1
100,425
q3
100,427
iqr
2
skew
-5.53
kurtosis
30.52
n_outliers
10
outlier_rate
0.2
zero_rate
0

PrimaryLanguageDialect

categorical metadata long_tail null_rate imbalance
PrimaryLanguageDialect is a categorical field that is effectively empty: 98% of the 50 rows are null, and the single non-null value is the string "Air" — which doesn't read like a language or dialect at all. With only 1 unique value, entropy is 0, so the column carries no signal in this sample and the lone value looks suspect. Treatment: Drop from modelling; investigate the lone "Air" value as a possible data-entry error. high · anthropic:claude-opus-4-7
n
50
nulls
49 (98.0%)
unique
1
top_value
Air
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

JPScale

numeric feature high_skew outliers
JPScale is a small-integer ordinal feature with only 4 distinct values ranging from 1 to 4, where the bulk of records sit at 1 (median, Q1 and Q3 all equal 1.0, IQR 0.0). The distribution is heavily right-skewed (skew 2.29, kurtosis 4.44) and 24% of rows (12 of 50) flag as outliers simply because anything above 1 deviates from the dominant value. Mean 1.38 against std 0.81 confirms most mass is at the floor with a long thin tail toward 4. Treatment: Treat as an ordinal/categorical scale (1-4) rather than continuous; one-hot or bin the rare 2-4 levels. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
4
min
1
max
4
mean
1.38
median
1
std
0.8053
q1
1
q3
1
iqr
0
skew
2.29
kurtosis
4.437
n_outliers
12
outlier_rate
0.24
zero_rate
0

HasAudioRecordings

categorical metadata imbalance
This column is a flag indicating whether audio recordings exist, but every one of the 50 rows holds the value "Y". Cardinality is 1 and entropy is 0, so it carries no information for any downstream task. Treatment: Drop; constant column with zero entropy. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

PCBuddhism

numeric feature constant
PCBuddhism appears to be a numeric feature (likely a per-capita or principal-component-style indicator for Buddhism) that carries no information in this sample: every one of the 49 non-null values is exactly 0.0, with a 2% null rate. The constant alert and zero_rate of 1.0 confirm there is no variance to model. Treatment: Drop; constant column with no predictive signal. high · anthropic:claude-opus-4-7
n
50
nulls
1 (2.0%)
unique
1
min
0
max
0
mean
0
median
0
std
0
q1
0
q3
0
iqr
0
skew
0
kurtosis
0
n_outliers
0
outlier_rate
0
zero_rate
1

PeopNameAcrossCountries

categorical feature
This column appears to label ethnic or people-group identities across countries, with only 5 unique values across 50 rows and no nulls. The distribution is heavily skewed toward 'Arab' (28 of 50, top_rate 0.56), with 'Arab, Arabic Gulf Spoken' and 'Arab, Omani' as the next most common, while 'Tuareg, Air' and 'Amri' appear just once each. Entropy ratio of 0.69 confirms moderate concentration rather than uniform spread. Treatment: Group rare categories or one-hot encode after consolidating the long tail. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
5
top_value
Arab
top_rate
0.56
cardinality
5
entropy
1.611
entropy_ratio
0.694

PhotoCCVersionURL

categorical metadata
This column appears to hold a Creative Commons license URL associated with a photo, but it is overwhelmingly empty: 42 of 50 rows are blank strings and only 8 carry the single license value 'https://creativecommons.org/licenses/by-nc-sa/2.0/'. With just 2 unique values and a top_rate of 0.84, it functions more as a binary licensed/unlicensed flag than a true URL field. Note that nulls are reported as 0.0 because the missing entries are empty strings rather than true nulls. Treatment: Convert to a boolean has_license flag rather than treating as a URL. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
top_rate
0.84
cardinality
2
entropy
0.6343
entropy_ratio
0.6343

MapCCVersionText

categorical metadata imbalance
MapCCVersionText is a categorical column that contains a single value — the empty string — across all 50 rows. Cardinality is 1, entropy is 0, and null_rate is 0.0, so the field is technically populated but carries no information. Treatment: Drop; constant empty value provides no signal. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

PercentChristianPC

categorical feature long_tail imbalance
This appears to be a per-capita or principal-component-style 'PercentChristian' score, stored as strings with only 3 distinct values across 50 rows. It is overwhelmingly degenerate: the value '0.997' covers 48/50 rows (top_rate 0.96), with '0.116' and '1.344' each appearing once, yielding an entropy ratio of just 0.178. The two outlier values look anomalous relative to the dominant 0.997 and may be data-entry artefacts or genuine extremes worth investigating. Treatment: Drop or treat as near-constant; inspect the two non-modal rows as potential outliers before modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
3
top_value
0.997
top_rate
0.96
cardinality
3
entropy
0.2823
entropy_ratio
0.1781

Nomadic

categorical feature
Binary Y/N flag indicating whether a record is nomadic, with no nulls across 50 rows. The distribution is heavily imbalanced: 'N' covers 45 of 50 (top_rate 0.9) while 'Y' appears only 5 times, yielding low entropy_ratio of 0.47. Treatment: Encode as a 0/1 indicator; watch for class imbalance given only 5 positives. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.9
cardinality
2
entropy
0.469
entropy_ratio
0.469

PrayForChurch

categorical free_text long_tail
Free-text prayer prompts about Christian outreach to Arab/Muslim people groups, stored as a categorical but functionally a short-document field. 42 of 50 rows (top_rate 0.84) are empty strings and the remaining 8 are all unique long sentences, giving 9 distinct values and an entropy ratio of 0.35. The long_tail alert reflects this empty-vs-unique split rather than meaningful category structure. Treatment: Treat empty strings as missing and tokenize/embed the remaining prose rather than one-hot encoding. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
9
top_value
top_rate
0.84
cardinality
9
entropy
1.114
entropy_ratio
0.3515

RLG3PGAC

numeric metadata constant
RLG3PGAC is a numeric column that holds the constant value 6.0 across all 50 rows, with zero variance and no nulls. Since min, max, mean, and both quartiles are all 6.0, the column carries no information for modelling or analysis. Treatment: Drop, constant column. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
min
6
max
6
mean
6
median
6
std
0
q1
6
q3
6
iqr
0
skew
0
kurtosis
0
n_outliers
0
outlier_rate
0
zero_rate
0

ISO3

categorical foreign_key long_tail
ISO3 holds three-letter country codes (ARE, CAN, EGY, KEN...), making it a country identifier. With 41 unique values across 50 rows and entropy ratio 0.986, it is near-uniform with only nine countries appearing twice; no nulls. The long_tail alert reflects that most countries appear exactly once. Treatment: left-join on this id to enrich with country attributes. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
41
top_value
ARE
top_rate
0.04
cardinality
41
entropy
5.284
entropy_ratio
0.9862

NaturalPronunciation

categorical feature
Phonetic pronunciation guides for what appear to be Arabic-related labels, with most variants ending in 'AE-rub' (likely 'Arab'). The top value 'AE-rub' covers 55% of rows (27/50), and the top three values account for 46 of 50 entries, leaving three singleton long-tail spellings like 'AH-eer TWA-reg' and 'KEN-yun AE-rub'. Cardinality is just 6 with a 2% null rate, suggesting a controlled vocabulary rather than free text. Treatment: One-hot encode or group the three singleton categories into 'other' before modelling. high · anthropic:claude-opus-4-7
n
50
nulls
1 (2.0%)
unique
6
top_value
AE-rub
top_rate
0.551
cardinality
6
entropy
1.728
entropy_ratio
0.6686

PhotoAddress

categorical metadata
PhotoAddress holds JPG filenames (e.g., p10375.jpg), so it points to image assets associated with each row. With only 5 unique values across 50 rows and the top file p10375.jpg covering 56% of records, the same images are reused heavily rather than being row-specific. Entropy ratio of 0.69 confirms a skewed distribution dominated by three filenames, while two others appear just once each. Treatment: Treat as a low-cardinality asset reference; join to an image table or drop unless image features are needed. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
5
top_value
p10375.jpg
top_rate
0.56
cardinality
5
entropy
1.611
entropy_ratio
0.694

RegionCode

numeric foreign_key
RegionCode is stored as an integer but only takes 11 distinct values across 50 rows (min 1, max 12), so it is almost certainly a categorical region identifier rather than a true numeric quantity. The distribution is roughly centered (mean 7.28, median 7) with low skew (-0.09) and one flagged outlier, but those moments are not meaningful for a code. No nulls or zeros are present. Treatment: Cast to categorical and one-hot or target-encode before modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
11
min
1
max
12
mean
7.28
median
7
std
2.711
q1
6
q3
9
iqr
3
skew
-0.08855
kurtosis
-0.2134
n_outliers
1
outlier_rate
0.02
zero_rate
0

LocationInCountry

categorical free_text long_tail null_rate
Free-text geographic descriptions of where a group is located within a country, ranging from single words like "Scattered" to multi-clause sentences naming provinces and landmarks. The column is 72% null and only 14 of 50 rows carry values, yet entropy_ratio is 0.99 with 13 unique strings across 14 non-nulls — essentially every response is bespoke. The top value "Widespread." appears just twice, so there is no usable category structure. Treatment: Treat as free-text notes; geocode or NER-extract place names rather than one-hot encoding. high · anthropic:claude-opus-4-7
n
50
nulls
36 (72.0%)
unique
13
top_value
Widespread.
top_rate
0.1429
cardinality
13
entropy
3.664
entropy_ratio
0.9903

JF

categorical feature
JF is a binary Y/N flag with no missing values across 50 rows. The distribution is imbalanced: 'Y' accounts for 41 of 50 records (top_rate 0.82) versus 9 'N's, yielding entropy of 0.68. Treatment: Encode as a 0/1 indicator and account for the 82/18 class imbalance in modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
Y
top_rate
0.82
cardinality
2
entropy
0.6801
entropy_ratio
0.6801

PopulationPGAC

numeric feature outliers
PopulationPGAC appears to be a population count tied to some PGAC grouping, with values ranging from 101,000 to 7,562,600 across 50 rows. Only 5 unique values populate the column, so the 'numeric' framing is misleading — it behaves more like a coarse categorical bucket. The right-skew (1.03) and 26% outlier rate stem from a small number of rows carrying the largest population value far above the Q3 of 3,096,000. Treatment: Treat as a categorical/ordinal bucket given only 5 unique values, or log-transform if kept numeric. medium · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
5
min
101,000
max
7.563e+06
mean
3.402e+06
median
1.927e+06
std
2.427e+06
q1
1.927e+06
q3
3.096e+06
iqr
1.169e+06
skew
1.03
kurtosis
-0.6488
n_outliers
13
outlier_rate
0.26
zero_rate
0

PeopleGroupMapExpandedURL

categorical metadata long_tail
This column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, one per row. It is mostly empty: 38 of 50 rows (top_rate 0.76) are blank strings, leaving only 11 distinct values across the 50 records with a long tail of near-unique links. Despite null_rate being 0, the dominant value is an empty string, so true coverage is roughly a quarter of rows. Treatment: Convert empty strings to nulls and treat as an optional reference link rather than a modelling feature. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
11
top_value
top_rate
0.76
cardinality
11
entropy
1.575
entropy_ratio
0.4554

TranslationNeedQuestionable

unknown other skipped
The column 'TranslationNeedQuestionable' was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 with no nulls. The name suggests a boolean or flag indicating whether a translation need is in doubt, but this cannot be confirmed from the evidence. No distributional signals are present to flag. Treatment: Re-profile or manually inspect to determine type before any downstream use. low · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique

Category

categorical label
A low-cardinality categorical with 3 distinct values ('1','2','3') across 50 rows and no nulls, likely a class label or category code. The distribution is imbalanced: '1' dominates at 68% (34/50) while '2' and '3' account for just 7 and 9 rows respectively, giving an entropy ratio of 0.77. Treatment: Treat as a categorical class label and address the class imbalance (e.g., stratified splits or reweighting) before modelling. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
3
top_value
1
top_rate
0.68
cardinality
3
entropy
1.221
entropy_ratio
0.7702

PhotoCopyright

categorical feature imbalance
PhotoCopyright is a binary Y/N flag, almost certainly indicating whether a photo carries a copyright restriction. The distribution is severely imbalanced: 49 of 50 rows are 'N' and only 1 is 'Y', giving an entropy ratio of just 0.14. With effectively no variance, this column carries little signal on its own. Treatment: Drop or retain only as a rare-event indicator; near-constant at 98% 'N'. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.98
cardinality
2
entropy
0.1414
entropy_ratio
0.1414

NTOnline

categorical feature imbalance
NTOnline is a categorical flag that takes only the value 'Y' across all 41 non-null rows, with 18% of records null. Effectively a constant indicator with no discriminative information, plus a non-trivial missingness rate that may itself be the only signal. Treatment: Drop as a zero-variance column, or replace with a binary is_null indicator if missingness is meaningful. high · anthropic:claude-opus-4-7
n
50
nulls
9 (18.0%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

LeastReachedPC

categorical metadata imbalance
This is a categorical flag that takes the value "Y" for all 50 rows, giving cardinality 1 and entropy 0.0. With a single constant value it carries no information and cannot discriminate between records. The name suggests it once tracked a 'least reached PC' status, but here it is degenerate. Treatment: Drop; constant column with zero entropy. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
Y
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

ROG3

categorical feature long_tail
ROG3 holds two-letter country codes (AE, CA, EG, KE, SO, etc.), making it a geographic categorical feature. With 41 unique values across 50 rows and a top rate of just 0.04, the column is almost a unique-per-row identifier — entropy ratio 0.9862 confirms an extremely flat, long-tail distribution. No nulls, but the sample is too thin for any country to dominate. Treatment: Group rare countries into an 'Other' bucket or map to region before encoding. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
41
top_value
AE
top_rate
0.04
cardinality
41
entropy
5.284
entropy_ratio
0.9862

ReligionSubdivision

categorical metadata null_rate imbalance
This column records a religious subdivision, but it is overwhelmingly empty: 86% of the 50 rows are null, and every one of the 7 populated rows is 'Sunni'. With cardinality of 1 and entropy of 0, the field carries no discriminative signal in this sample. Treatment: Drop or collapse to a binary 'Sunni vs missing' indicator; otherwise non-informative. high · anthropic:claude-opus-4-7
n
50
nulls
43 (86.0%)
unique
1
top_value
Sunni
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

PCEthnicReligions

numeric feature high_skew outliers
Numeric count-like column where 93.9% of the 49 non-null values are zero and only 3 distinct values appear. The distribution is highly right-skewed (skew 4.45, kurtosis 19.79) with a max of 10 against a median of 0, producing 3 outliers (6.1%). One null is present. Treatment: Binarize to zero/non-zero or drop, since near-constant zeros carry little signal. high · anthropic:claude-opus-4-7
n
50
nulls
1 (2.0%)
unique
3
min
0
max
10
mean
0.4082
median
0
std
1.719
q1
0
q3
0
iqr
0
skew
4.446
kurtosis
19.79
n_outliers
3
outlier_rate
0.06122
zero_rate
0.9388

PeopleCluster

categorical feature long_tail imbalance
Categorical grouping of people clusters, with only 3 distinct values across 50 rows and no nulls. The distribution is extremely imbalanced: 'Arab, Arabian' covers 96% of rows, leaving 'Tuareg' and 'Arab, Sudan' with a single record each, yielding a very low entropy ratio of 0.178. Treatment: Drop or collapse rare levels; near-constant column offers little signal. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
3
top_value
Arab, Arabian
top_rate
0.96
cardinality
3
entropy
0.2823
entropy_ratio
0.1781

IndigenousCode

categorical feature
Binary Y/N flag indicating Indigenous status, with 'N' dominating at 86% (43 of 50) versus 7 'Y' records. The column is fully populated with no nulls and only 2 distinct values, yielding a low entropy ratio of 0.58. The class imbalance is notable for any modelling use. Treatment: Encode as a binary indicator and account for class imbalance in any downstream model. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
2
top_value
N
top_rate
0.86
cardinality
2
entropy
0.5842
entropy_ratio
0.5842

MapCreditURL

categorical metadata imbalance
MapCreditURL appears to be a metadata field intended to hold a URL crediting a map's source, but every one of the 50 rows is an empty string. Cardinality is 1, entropy is 0, and the top value (empty) accounts for 100% of records, so the column carries no information. Treatment: Drop; the column is constant (all empty strings). high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

MapCopyright

categorical metadata imbalance
MapCopyright is a categorical column that holds the single value "N" across all 50 rows, giving it zero entropy and a top_rate of 1.0. With cardinality of 1 and no nulls, it carries no information for any downstream model or comparison. Treatment: Drop; constant column with a single value. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
N
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

MapCCVersionURL

categorical metadata imbalance
This column appears to be a metadata field intended to hold a Creative Commons version URL for a map, but every one of the 50 rows contains an empty string. Cardinality is 1 with entropy of 0.0, so it carries no information whatsoever. Treatment: Drop the column; it is constant (empty) across all rows. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
1
top_value
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

PeopleGroupURL

categorical identifier long_tail
This column holds Joshua Project people-group URLs, with each row pointing to a /people_groups/{id}/{country} path. All 50 rows are unique (cardinality 50, entropy_ratio 1.0, top_rate 0.02) and there are no nulls, so it functions as a per-row identifier rather than a categorical feature. The URL stems repeat (e.g. 10375 appears across TZ, UP, AG, AS, AU, BA), suggesting the same people group is tracked across multiple country codes. Treatment: Drop from modelling; retain as a row-level link or parse out the group id and country code into separate keys. high · anthropic:claude-opus-4-7
n
50
nulls
0 (0.0%)
unique
50
top_value
https://joshuaproject.net/people_groups/10208/NG
top_rate
0.02
cardinality
50
entropy
5.644
entropy_ratio
1