saturn·

joshua project joshua project unreached

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_unreached.parquet

Saturn profiled 7,124 rows across 109 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_unreached.parquet",
    "--findings", "joshua-project-joshua_project_unreached.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Joshua Project catalogue of 7,124 unreached people groups described across 109 fields covering geography, language, religion, population, and outreach status. Every row is flagged as 'Unreached' (JPScaleText is constant) and 'LeastReached' is uniformly Y, so the analytical interest sits in the breakdown by region, religion, and population rather than in reach status itself. The data is heavily skewed toward Asia (5,351 of 7,124) and especially South Asian Peoples (3,681), with India alone accounting for 2,032 groups; Islam (3,279) and Hinduism (2,142) dominate PrimaryReligion. Population is extremely long-tailed (median 30,000 vs. max 135.5M, skew ~21), so any size-based analysis should use log scales or medians. Worth a closer look first: the Continent/Region/Country concentration, the religion mix, and the population distribution — these three together explain most of the dataset's shape.

citing: row_count · column_count · Continent.top_values · AffinityBloc.top_values · Ctry.top_values · PrimaryReligion.top_values · Population.stats · RegionName.top_values · Frontier.top_values · JPScaleText.top_values · LeastReached.top_values · PrimaryLanguageName.top_values

Out[4]:

saturn.schema() · 109 columns

column kind n null% unique alerts
PeopleID3ROG3 text 7,124 0.0% 7,124 near_unique one_word allcaps short_text
ROG3 categorical 7,124 0.0% 202
PeopleID3 numeric 7,124 0.0% 4,614
ROP3 numeric 7,124 0.1% 4,608
PeopNameInCountry text 7,124 0.0% 4,722 one_word duplicates
ROG2 categorical 7,124 0.0% 7
Continent categorical 7,124 0.0% 7
RegionName categorical 7,124 0.0% 12
ISO3 categorical 7,124 0.0% 202
LocationInCountry text 7,124 64.2% 2,176 multilingual null_rate
PeopleID1 numeric 7,124 0.0% 16 outliers
ROP1 categorical 7,124 0.0% 16
AffinityBloc categorical 7,124 0.0% 16
PeopleID2 numeric 7,124 0.0% 205
ROP2 categorical 7,124 0.0% 155
PeopleCluster categorical 7,124 0.0% 205
PeopNameAcrossCountries text 7,124 0.0% 4,604 one_word duplicates
Population numeric 7,124 0.2% 1,200 high_skew outliers
Category categorical 7,124 0.0% 3
ROL3 text 7,124 0.0% 1,565 one_word short_text duplicates
PrimaryLanguageName text 7,124 0.0% 1,563 one_word short_text duplicates
PrimaryLanguageDialect categorical 7,124 94.5% 303 long_tail null_rate
NumberLanguagesSpoken numeric 7,124 0.0% 69 high_skew outliers
OfficialLang categorical 7,124 0.1% 79
SpeakNationalLang unknown 7,124 0.0% skipped
BibleStatus numeric 7,124 0.0% 6 outliers
BibleYear categorical 7,124 45.8% 163 null_rate
NTYear categorical 7,124 24.6% 305 null_rate
PortionsYear categorical 7,124 13.5% 460
TranslationNeedQuestionable unknown 7,124 0.0% skipped
JPScale numeric 7,124 0.0% 1 constant
JPScalePC categorical 7,124 0.0% 5
JPScalePGAC categorical 7,124 0.0% 5 imbalance
LeastReached categorical 7,124 0.0% 1 imbalance
LeastReachedPC categorical 7,124 0.0% 2
LeastReachedPGAC categorical 7,124 0.0% 2 imbalance
GSEC categorical 7,124 0.0% 8
HasAudioRecordings categorical 7,124 0.0% 2
NTOnline categorical 7,124 22.4% 1 null_rate imbalance
RLG3 numeric 7,124 0.0% 7 outliers
RLG3PC numeric 7,124 0.0% 8 outliers
RLG3PGAC numeric 7,124 0.0% 8 outliers
PrimaryReligion categorical 7,124 0.0% 7
PrimaryReligionPC categorical 7,124 0.0% 8
PrimaryReligionPGAC categorical 7,124 0.0% 8
RLG4 numeric 7,124 92.4% 18 null_rate outliers
ReligionSubdivision categorical 7,124 92.4% 18 null_rate
PCIslam numeric 7,124 0.1% 902
PCNonReligious numeric 7,124 0.3% 152 high_skew outliers
PCUnknown numeric 7,124 0.4% 388 high_skew outliers
SecurityLevel numeric 7,124 0.0% 3 outliers
LRTop100 categorical 7,124 0.0% 2 imbalance
PhotoAddress text 7,124 0.0% 2,880 one_word short_text duplicates
PhotoCredits categorical 7,124 0.1% 851 long_tail
PhotoCreditURL categorical 7,124 36.0% 774 long_tail null_rate
PhotoCreativeCommons categorical 7,124 0.1% 2
PhotoCopyright categorical 7,124 0.2% 2
PhotoPermission categorical 7,124 0.2% 3
ProfileTextExists categorical 7,124 0.0% 2 imbalance
CountOfCountries numeric 7,124 0.0% 39 high_skew outliers
CountOfProvinces unknown 7,124 0.0% skipped
Longitude numeric 7,124 0.0% 6,713
Latitude numeric 7,124 0.0% 6,696
Ctry categorical 7,124 0.0% 202
IndigenousCode categorical 7,124 0.0% 2
PercentAdherents categorical 7,124 0.0% 692 long_tail
PercentChristianPC categorical 7,124 0.0% 184
NaturalName text 7,124 0.0% 4,705 one_word duplicates
NaturalPronunciation text 7,124 48.5% 1,489 one_word null_rate duplicates
PercentChristianPGAC categorical 7,124 0.1% 842
PercentEvangelical categorical 7,124 10.4% 401 long_tail
PercentEvangelicalPC categorical 7,124 2.1% 166
PercentEvangelicalPGAC categorical 7,124 6.3% 548
PCBuddhism numeric 7,124 0.3% 809 high_skew outliers
PCEthnicReligions numeric 7,124 0.3% 351 high_skew outliers
PCHinduism numeric 7,124 0.3% 1,131
PCOtherSmall numeric 7,124 0.3% 670 high_skew outliers
RegionCode numeric 7,124 0.0% 12 outliers
PopulationPGAC numeric 7,124 0.1% 1,509 high_skew outliers
Frontier categorical 7,124 0.0% 2
MapAddress text 7,124 0.0% 4,616 one_word short_text duplicates
HasJesusFilm categorical 7,124 0.0% 2
Nomadic categorical 7,124 0.0% 2 imbalance
NomadicTypeDescription categorical 7,124 96.6% 6 null_rate
PhotoCCVersionText categorical 7,124 0.0% 16
PhotoCCVersionURL categorical 7,124 0.0% 16
MapCredits categorical 7,124 0.0% 161 long_tail
MapCreditURL categorical 7,124 0.0% 31 long_tail imbalance
MapCopyright categorical 7,124 0.0% 3
MapCCVersionText categorical 7,124 0.0% 4 imbalance
MapCCVersionURL categorical 7,124 0.0% 4 imbalance
JF categorical 7,124 0.0% 2
AudioRecordings categorical 7,124 0.0% 2
Window1040 categorical 7,124 0.0% 2
PeopleGroupMapURL text 7,124 0.0% 4,616 one_word url_heavy duplicates
PeopleGroupMapExpandedURL text 7,124 0.0% 4,331 one_word url_heavy duplicates
PeopleGroupURL text 7,124 0.0% 7,124 near_unique one_word url_heavy
PeopleGroupPhotoURL text 7,124 0.0% 2,880 one_word url_heavy duplicates
CountryURL categorical 7,124 0.0% 202
JPScaleText categorical 7,124 0.0% 1 imbalance
JPScaleImageURL categorical 7,124 0.0% 1 imbalance
Summary text 7,124 0.0% 3,685 one_word duplicates
Obstacles text 7,124 0.0% 3,641 one_word duplicates
HowReach text 7,124 0.0% 2,853 one_word duplicates
PrayForChurch text 7,124 0.0% 1,713 one_word duplicates
PrayForPG text 7,124 0.0% 3,441 one_word duplicates
Resources unknown 7,124 0.0% skipped
country_data unknown 7,124 0.0% skipped
language_data unknown 7,124 0.0% skipped
Fig 1.
Continent · Shows how heavily the dataset concentrates in Asia (~75%) versus other continents.
Show data table
Top values for Continent (7 unique shown, of 7 total).
valuecountshare
Asia535175.1%
Africa98613.8%
Europe4316.0%
North America1752.5%
South America1061.5%
Australia390.5%
Oceania360.5%
Fig 2.
PrimaryReligion · Highlights that Islam and Hinduism together account for the majority of unreached groups.
Show data table
Top values for PrimaryReligion (7 unique shown, of 7 total).
valuecountshare
Islam327946.0%
Hinduism214230.1%
Ethnic Religions93313.1%
Buddhism4806.7%
Unknown1572.2%
Other / Small1201.7%
Non-Religious130.2%
Fig 3.
RegionName · Breaks the Asia bulk into sub-regions, with Asia South alone covering nearly half the rows.
Show data table
Top values for RegionName (12 unique shown, of 12 total).
valuecountshare
Asia, South334947.0%
Asia, Southeast72610.2%
Asia, Northeast5217.3%
Africa, West and Central4606.5%
Africa, North and Middle East4446.2%
Africa, East and Southern3735.2%
Asia, Central3524.9%
Europe, Western3204.5%
Europe, Eastern and Eurasia2233.1%
America, North and Caribbean1602.2%
America, Latin1211.7%
Australia and Pacific751.1%
Fig 4.
Population · Reveals an extreme right-skew — most groups are small (median 30k) but a few exceed 100M; consider a log scale.
Show data table
Histogram bins for Population (median: 30000.0).
bincount
10 – 3.388e+066930
3.388e+06 – 6.777e+0678
6.777e+06 – 1.016e+0736
1.016e+07 – 1.355e+0717
1.355e+07 – 1.694e+0711
1.694e+07 – 2.033e+078
2.033e+07 – 2.372e+077
2.372e+07 – 2.711e+073
2.711e+07 – 3.049e+072
3.049e+07 – 3.388e+071
3.388e+07 – 3.727e+072
3.727e+07 – 4.066e+074
4.066e+07 – 4.405e+070
4.405e+07 – 4.744e+073
4.744e+07 – 5.082e+071
5.082e+07 – 5.421e+070
5.421e+07 – 5.76e+071
5.76e+07 – 6.099e+070
6.099e+07 – 6.438e+071
6.438e+07 – 6.777e+071
6.777e+07 – 7.115e+070
7.115e+07 – 7.454e+070
7.454e+07 – 7.793e+070
7.793e+07 – 8.132e+070
8.132e+07 – 8.471e+070
8.471e+07 – 8.81e+070
8.81e+07 – 9.148e+070
9.148e+07 – 9.487e+070
9.487e+07 – 9.826e+070
9.826e+07 – 1.016e+081
1.016e+08 – 1.05e+080
1.05e+08 – 1.084e+080
1.084e+08 – 1.118e+080
1.118e+08 – 1.152e+080
1.152e+08 – 1.186e+081
1.186e+08 – 1.22e+080
1.22e+08 – 1.254e+080
1.254e+08 – 1.288e+080
1.288e+08 – 1.321e+080
1.321e+08 – 1.355e+081
Fig 5.
AffinityBloc · Confirms the South Asian Peoples bloc dominates and shows the next-largest cultural clusters.
Show data table
Top values for AffinityBloc (16 unique shown, of 16 total).
valuecountshare
South Asian Peoples368151.7%
Sub-Saharan Peoples6328.9%
Arab World4756.7%
Southeast Asian Peoples4516.3%
Malay Peoples3394.8%
Tibetan-Himalayan Peoples2874.0%
Turkic Peoples2693.8%
Persian-Median2253.2%
Eurasian Peoples1662.3%
Deaf1512.1%
East Asian Peoples1402.0%
Jewish1281.8%
Horn of Africa Peoples831.2%
Latin-Caribbean Americans811.1%
Pacific Islanders130.2%
North American Peoples30.0%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
PeopleID3ROG3text0.0%
ROG3categorical0.0%
PeopleID3numeric0.0%
ROP3numeric0.1%
PeopNameInCountrytext0.0%
ROG2categorical0.0%
Continentcategorical0.0%
RegionNamecategorical0.0%
ISO3categorical0.0%
LocationInCountrytext64.2%
PeopleID1numeric0.0%
ROP1categorical0.0%
AffinityBloccategorical0.0%
PeopleID2numeric0.0%
ROP2categorical0.0%
PeopleClustercategorical0.0%
PeopNameAcrossCountriestext0.0%
Populationnumeric0.2%
Categorycategorical0.0%
ROL3text0.0%
PrimaryLanguageNametext0.0%
PrimaryLanguageDialectcategorical94.5%
NumberLanguagesSpokennumeric0.0%
OfficialLangcategorical0.1%
SpeakNationalLangunknown0.0%
BibleStatusnumeric0.0%
BibleYearcategorical45.8%
NTYearcategorical24.6%
PortionsYearcategorical13.5%
TranslationNeedQuestionableunknown0.0%
JPScalenumeric0.0%
JPScalePCcategorical0.0%
JPScalePGACcategorical0.0%
LeastReachedcategorical0.0%
LeastReachedPCcategorical0.0%
LeastReachedPGACcategorical0.0%
GSECcategorical0.0%
HasAudioRecordingscategorical0.0%
NTOnlinecategorical22.4%
RLG3numeric0.0%
RLG3PCnumeric0.0%
RLG3PGACnumeric0.0%
PrimaryReligioncategorical0.0%
PrimaryReligionPCcategorical0.0%
PrimaryReligionPGACcategorical0.0%
RLG4numeric92.4%
ReligionSubdivisioncategorical92.4%
PCIslamnumeric0.1%
PCNonReligiousnumeric0.3%
PCUnknownnumeric0.4%
SecurityLevelnumeric0.0%
LRTop100categorical0.0%
PhotoAddresstext0.0%
PhotoCreditscategorical0.1%
PhotoCreditURLcategorical36.0%
PhotoCreativeCommonscategorical0.1%
PhotoCopyrightcategorical0.2%
PhotoPermissioncategorical0.2%
ProfileTextExistscategorical0.0%
CountOfCountriesnumeric0.0%
CountOfProvincesunknown0.0%
Longitudenumeric0.0%
Latitudenumeric0.0%
Ctrycategorical0.0%
IndigenousCodecategorical0.0%
PercentAdherentscategorical0.0%
PercentChristianPCcategorical0.0%
NaturalNametext0.0%
NaturalPronunciationtext48.5%
PercentChristianPGACcategorical0.1%
PercentEvangelicalcategorical10.4%
PercentEvangelicalPCcategorical2.1%
PercentEvangelicalPGACcategorical6.3%
PCBuddhismnumeric0.3%
PCEthnicReligionsnumeric0.3%
PCHinduismnumeric0.3%
PCOtherSmallnumeric0.3%
RegionCodenumeric0.0%
PopulationPGACnumeric0.1%
Frontiercategorical0.0%
MapAddresstext0.0%
HasJesusFilmcategorical0.0%
Nomadiccategorical0.0%
NomadicTypeDescriptioncategorical96.6%
PhotoCCVersionTextcategorical0.0%
PhotoCCVersionURLcategorical0.0%
MapCreditscategorical0.0%
MapCreditURLcategorical0.0%
MapCopyrightcategorical0.0%
MapCCVersionTextcategorical0.0%
MapCCVersionURLcategorical0.0%
JFcategorical0.0%
AudioRecordingscategorical0.0%
Window1040categorical0.0%
PeopleGroupMapURLtext0.0%
PeopleGroupMapExpandedURLtext0.0%
PeopleGroupURLtext0.0%
PeopleGroupPhotoURLtext0.0%
CountryURLcategorical0.0%
JPScaleTextcategorical0.0%
JPScaleImageURLcategorical0.0%
Summarytext0.0%
Obstaclestext0.0%
HowReachtext0.0%
PrayForChurchtext0.0%
PrayForPGtext0.0%
Resourcesunknown0.0%
country_dataunknown0.0%
language_dataunknown0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 11,918 detected strings).
langcountshare
en1190399.9%
de40.0%
pt20.0%
it20.0%
eo20.0%
ceb10.0%
ilo10.0%
min10.0%
id10.0%
es10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
PeopleID3ROP3PeopleID1PeopleID2PopulationNumberLanguagesSpokenBibleStatusJPScaleRLG3RLG3PCRLG3PGACRLG4
PeopleID3+1.00+0.86+0.29+0.45-0.02+0.15+0.10+nan+0.03+0.04+0.02+0.12
ROP3+0.86+1.00+0.33+0.47-0.01+0.16+0.13+nan+0.03+0.08+0.03+0.06
PeopleID1+0.29+0.33+1.00+0.55-0.13+0.17-0.00+nan+0.08+0.22+0.10-0.29
PeopleID2+0.45+0.47+0.55+1.00-0.06+0.39+0.40+nan+0.06+0.20+0.07-0.14
Population-0.02-0.01-0.13-0.06+1.00+0.01-0.01+nan+0.01-0.01+0.01+0.01
NumberLanguagesSpoken+0.15+0.16+0.17+0.39+0.01+1.00+0.24+nan-0.00+0.06-0.00-0.06
BibleStatus+0.10+0.13-0.00+0.40-0.01+0.24+1.00+nan-0.07+0.01-0.07-0.02
JPScale+nan+nan+nan+nan+nan+nan+nan+nan+nan+nan+nan+nan
RLG3+0.03+0.03+0.08+0.06+0.01-0.00-0.07+nan+1.00+0.68+0.99+0.17
RLG3PC+0.04+0.08+0.22+0.20-0.01+0.06+0.01+nan+0.68+1.00+0.68+0.14
RLG3PGAC+0.02+0.03+0.10+0.07+0.01-0.00-0.07+nan+0.99+0.68+1.00+0.16
RLG4+0.12+0.06-0.29-0.14+0.01-0.06-0.02+nan+0.17+0.14+0.16+1.00

PeopleID3ROG3 text identifier

PeopleID3ROG3 is almost certainly a person-level identifier: every one of the 7,124 rows holds a unique 7-character, all-caps, single-token code (n_unique equals n, len_min=len_max=7, allcaps_rate=1.0, one_word_rate=1.0). Sample values like '10208ng' and '10375su' suggest a 5-digit numeric prefix followed by a 2-letter suffix. There are no nulls, duplicates, or empties, so the key looks clean.

Treatment: Use as a primary key for joins; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["PeopleID3ROG3"].stats

statvalue
n7,124
nulls0 (0.0%)
unique7,124
len_min 7
len_max 7
len_mean 7
len_median 7
len_p95 7
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 7,124
readability_flesch_mean 112.3
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 9.
Character-length distribution for PeopleID3ROG3.
Show data table
Character-length distribution for PeopleID3ROG3 (mean: 7.0).
charscount
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 77124
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80

ROG3 categorical feature

ROG3 holds two-letter country codes across 7,124 rows with 202 distinct values and no nulls. India (IN) dominates at 28.5% of records, followed by PK (767) and CH (442); the top 10 codes account for the bulk of mass while a long tail of ~190 other codes shares the remainder, giving an entropy ratio of 0.66.

Treatment: Group rare codes into an 'other' bucket before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["ROG3"].stats

statvalue
n7,124
nulls0 (0.0%)
unique202
top_value IN
top_rate 0.2852
cardinality 202
entropy 5.058
entropy_ratio 0.6605
Fig 10.
Top values for ROG3.
Show data table
Top values for ROG3 (20 unique shown, of 202 total).
valuecountshare
IN203228.5%
PK76710.8%
CH4426.2%
BG2563.6%
ID2343.3%
NP1842.6%
SU1682.4%
LA1422.0%
RS1151.6%
US901.3%
IR851.2%
CD811.1%
MY781.1%
TH731.0%
VM691.0%
TU610.9%
BM590.8%
AF580.8%
CE550.8%
CA520.7%

PeopleID3 numeric foreign_key

PeopleID3 is an integer key spanning 10120 to 22661 with 4614 unique values across 7124 rows, suggesting a person identifier that recurs (about 1.5 rows per id on average). The distribution is mildly left-skewed (-0.23) and platykurtic (-0.95) with no nulls, no zeros, and no outliers, consistent with a dense allocated id range rather than a measured quantity.

Treatment: Treat as a join key; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["PeopleID3"].stats

statvalue
n7,124
nulls0 (0.0%)
unique4,614
min 10,120
max 22,661
mean 1.693e+04
median 1.736e+04
std 3431
q1 1.433e+04
q3 1.958e+04
iqr 5251
skew -0.2255
kurtosis -0.9528
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 11.
Distribution of PeopleID3. Vertical dash marks the median.
Show data table
Histogram bins for PeopleID3 (median: 17356.5).
bincount
1.012e+04 – 1.043e+04131
1.043e+04 – 1.075e+04123
1.075e+04 – 1.106e+04167
1.106e+04 – 1.137e+04106
1.137e+04 – 1.169e+04123
1.169e+04 – 1.2e+04108
1.2e+04 – 1.231e+04168
1.231e+04 – 1.263e+04196
1.263e+04 – 1.294e+04140
1.294e+04 – 1.326e+04128
1.326e+04 – 1.357e+04131
1.357e+04 – 1.388e+0493
1.388e+04 – 1.42e+0471
1.42e+04 – 1.451e+04212
1.451e+04 – 1.482e+0496
1.482e+04 – 1.514e+04178
1.514e+04 – 1.545e+04223
1.545e+04 – 1.576e+04170
1.576e+04 – 1.608e+0453
1.608e+04 – 1.639e+04233
1.639e+04 – 1.67e+04231
1.67e+04 – 1.702e+04221
1.702e+04 – 1.733e+04238
1.733e+04 – 1.764e+04314
1.764e+04 – 1.796e+04274
1.796e+04 – 1.827e+04255
1.827e+04 – 1.859e+04272
1.859e+04 – 1.89e+04236
1.89e+04 – 1.921e+04278
1.921e+04 – 1.953e+04164
1.953e+04 – 1.984e+04160
1.984e+04 – 2.015e+04174
2.015e+04 – 2.047e+04205
2.047e+04 – 2.078e+0492
2.078e+04 – 2.109e+04106
2.109e+04 – 2.141e+04254
2.141e+04 – 2.172e+04189
2.172e+04 – 2.203e+04115
2.203e+04 – 2.235e+04295
2.235e+04 – 2.266e+04201

ROP3 numeric identifier

ROP3 is a numeric column with 4608 unique values across 7124 rows, ranging tightly from 100005 to 119619 with a mean of 111443.68 and median of 112533. The narrow ~19k span sitting well above zero, combined with integer-looking bounds, suggests a coded identifier or sequence number rather than a measured quantity. Mild left skew (-0.47) and no outliers indicate a fairly uniform spread within that band, and the null rate is negligible at 0.001.

Treatment: Treat as a categorical code or key; do not feed raw into numeric models.

anthropic:claude-opus-4-7 · confidence medium
Out[23]:

saturn.columns["ROP3"].stats

statvalue
n7,124
nulls7 (0.1%)
unique4,608
min 100,005
max 119,619
mean 1.114e+05
median 112,533
std 5269
q1 107,901
q3 115,240
iqr 7,339
skew -0.4712
kurtosis -0.7273
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 12.
Distribution of ROP3. Vertical dash marks the median.
Show data table
Histogram bins for ROP3 (median: 112533.0).
bincount
1e+05 – 1.005e+05133
1.005e+05 – 1.01e+0597
1.01e+05 – 1.015e+05107
1.015e+05 – 1.02e+05139
1.02e+05 – 1.025e+0594
1.025e+05 – 1.029e+0589
1.029e+05 – 1.034e+0586
1.034e+05 – 1.039e+05153
1.039e+05 – 1.044e+05155
1.044e+05 – 1.049e+05105
1.049e+05 – 1.054e+0589
1.054e+05 – 1.059e+05137
1.059e+05 – 1.064e+05141
1.064e+05 – 1.069e+05102
1.069e+05 – 1.074e+0566
1.074e+05 – 1.079e+0579
1.079e+05 – 1.083e+05162
1.083e+05 – 1.088e+05100
1.088e+05 – 1.093e+0583
1.093e+05 – 1.098e+05263
1.098e+05 – 1.103e+05152
1.103e+05 – 1.108e+05111
1.108e+05 – 1.113e+05127
1.113e+05 – 1.118e+05342
1.118e+05 – 1.123e+05277
1.123e+05 – 1.128e+05300
1.128e+05 – 1.132e+05418
1.132e+05 – 1.137e+05364
1.137e+05 – 1.142e+05340
1.142e+05 – 1.147e+05169
1.147e+05 – 1.152e+05324
1.152e+05 – 1.157e+05236
1.157e+05 – 1.162e+05360
1.162e+05 – 1.167e+0571
1.167e+05 – 1.172e+0572
1.172e+05 – 1.177e+0539
1.177e+05 – 1.181e+05263
1.181e+05 – 1.186e+05332
1.186e+05 – 1.191e+0567
1.191e+05 – 1.196e+05373

PeopNameInCountry text label

This column names a people group as it appears within a given country (e.g., 'Turk', 'Persian', 'Arab, Moroccan'), likely from a Joshua Project-style ethnographic registry. Values are short (len_mean 12.5, word_mean 1.78) with 4,722 uniques across 7,124 rows, and 33.7% are duplicates because the same people group recurs across countries — 'Deaf' alone appears 151 times. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' in top_words show religion-tagged variants are baked into the label.

Treatment: Treat as a categorical label; pair with country to form a unique key, and consider stripping parenthetical religion tags for cleaner grouping.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["PeopNameInCountry"].stats

statvalue
n7,124
nulls0 (0.0%)
unique4,722
len_min 1
len_max 39
len_mean 12.5
len_median 11
len_p95 27
word_mean 1.784
word_median 2
n_empty 0
n_duplicates 2,402
duplicate_rate 0.3372
vocab_size 4,602
readability_flesch_mean 56.38
emoji_rate 0
url_rate 0
one_word_rate 0.4575
allcaps_rate 0
boilerplate_rate 0
alert: one_word45.7% rows are a single word
alert: duplicates33.7% duplicate strings
Fig 13.
Character-length distribution for PeopNameInCountry.
Show data table
Character-length distribution for PeopNameInCountry (mean: 12.50042111173498).
charscount
1 – 21
2 – 314
3 – 4101
4 – 5557
5 – 6661
6 – 7692
7 – 8665
8 – 9370
9 – 10202
10 – 10210
10 – 11277
11 – 12290
12 – 13329
13 – 14434
14 – 15330
15 – 16282
16 – 17175
17 – 18118
18 – 1967
19 – 200
20 – 2187
21 – 2256
22 – 2365
23 – 24103
24 – 25195
25 – 26209
26 – 27192
27 – 28111
28 – 2969
29 – 3057
30 – 3037
30 – 3146
31 – 3245
32 – 3333
33 – 3422
34 – 3518
35 – 361
36 – 370
37 – 381
38 – 392

ROG2 categorical feature

ROG2 is a low-cardinality categorical with 7 region codes (ASI, AFR, EUR, NAR, LAM, AUS, SOP) and no nulls across 7,124 rows, consistent with a continental/region-of-origin grouping. The distribution is highly imbalanced: ASI accounts for 75.1% of records while AUS and SOP together contribute fewer than 80 rows, yielding an entropy ratio of just 0.45. Any model conditioned on this field will be dominated by the ASI bucket.

Treatment: One-hot encode and consider grouping AUS/SOP/LAM into an 'other' bucket given the severe imbalance.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["ROG2"].stats

statvalue
n7,124
nulls0 (0.0%)
unique7
top_value ASI
top_rate 0.7511
cardinality 7
entropy 1.251
entropy_ratio 0.4457
Fig 14.
Top values for ROG2.
Show data table
Top values for ROG2 (7 unique shown, of 7 total).
valuecountshare
ASI535175.1%
AFR98613.8%
EUR4316.0%
NAR1752.5%
LAM1061.5%
AUS390.5%
SOP360.5%

Continent categorical feature

Continent is a low-cardinality geographic categorical with 7 distinct values and no nulls across 7,124 rows. The distribution is heavily concentrated: Asia alone accounts for 75.1% of records, with Africa a distant second at 986. Notably, both 'Australia' (39) and 'Oceania' (36) appear as separate categories, which is a labeling inconsistency since Australia is part of Oceania.

Treatment: Reconcile Australia/Oceania into a single category, then one-hot encode.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["Continent"].stats

statvalue
n7,124
nulls0 (0.0%)
unique7
top_value Asia
top_rate 0.7511
cardinality 7
entropy 1.251
entropy_ratio 0.4457
Fig 15.
Top values for Continent.
Show data table
Top values for Continent (7 unique shown, of 7 total).
valuecountshare
Asia535175.1%
Africa98613.8%
Europe4316.0%
North America1752.5%
South America1061.5%
Australia390.5%
Oceania360.5%

RegionName categorical feature

RegionName is a categorical geographic grouping with 12 distinct regions and no nulls across 7,124 rows. The distribution is heavily concentrated: 'Asia, South' alone accounts for 47.0% of records, followed distantly by 'Asia, Southeast' at 726 and 'Asia, Northeast' at 521, leaving the Americas and Europe sparsely represented. Entropy ratio of 0.76 confirms meaningful but uneven coverage across the 12 buckets.

Treatment: One-hot encode and consider grouping rare regions, given the dominance of 'Asia, South'.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["RegionName"].stats

statvalue
n7,124
nulls0 (0.0%)
unique12
top_value Asia, South
top_rate 0.4701
cardinality 12
entropy 2.715
entropy_ratio 0.7574
Fig 16.
Top values for RegionName.
Show data table
Top values for RegionName (12 unique shown, of 12 total).
valuecountshare
Asia, South334947.0%
Asia, Southeast72610.2%
Asia, Northeast5217.3%
Africa, West and Central4606.5%
Africa, North and Middle East4446.2%
Africa, East and Southern3735.2%
Asia, Central3524.9%
Europe, Western3204.5%
Europe, Eastern and Eurasia2233.1%
America, North and Caribbean1602.2%
America, Latin1211.7%
Australia and Pacific751.1%

ISO3 categorical foreign_key

ISO3 looks like a country code in standard 3-letter ISO 3166-1 alpha-3 format, with 202 distinct values across 7,124 rows and zero nulls. The distribution is heavily concentrated on India (IND) at 28.5% of rows (2,032), followed by PAK (767) and CHN (442), so South and East Asia dominate. Entropy ratio of 0.66 confirms the imbalance is material rather than uniform across countries.

Treatment: Use as a join key to country reference tables; consider grouping long-tail codes or stratifying by ISO3 to control for the IND-heavy skew.

anthropic:claude-opus-4-7 · confidence high
Out[38]:

saturn.columns["ISO3"].stats

statvalue
n7,124
nulls0 (0.0%)
unique202
top_value IND
top_rate 0.2852
cardinality 202
entropy 5.058
entropy_ratio 0.6605
Fig 17.
Top values for ISO3.
Show data table
Top values for ISO3 (20 unique shown, of 202 total).
valuecountshare
IND203228.5%
PAK76710.8%
CHN4426.2%
BGD2563.6%
IDN2343.3%
NPL1842.6%
SDN1682.4%
LAO1422.0%
RUS1151.6%
USA901.3%
IRN851.2%
TCD811.1%
MYS781.1%
THA731.0%
VNM691.0%
TUR610.9%
MMR590.8%
AFG580.8%
LKA550.8%
CAN520.7%

LocationInCountry text free_text

Free-text geographic descriptions of where a people group lives within a country, ranging from terse tags like "Widespread." (56 occurrences) to multi-sentence paragraphs up to 939 characters. 64.23% of rows are null and 14.6% of the non-null values are duplicates, so usable signal is concentrated in roughly a third of the dataset. Content is overwhelmingly English (1714 of 1729 detected) with a long tail of place names producing a 10,936-token vocabulary across 2,176 unique strings.

Treatment: Normalize boilerplate phrases like "Widespread" into a categorical flag, then tokenize and embed the residual prose before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[41]:

saturn.columns["LocationInCountry"].stats

statvalue
n7,124
nulls4,576 (64.2%)
unique2,176
len_min 3
len_max 939
len_mean 141.1
len_median 89
len_p95 455.7
word_mean 21.15
word_median 12
n_empty 0
n_duplicates 372
duplicate_rate 0.146
vocab_size 10,936
readability_flesch_mean 41.64
emoji_rate 0
url_rate 0
one_word_rate 0.04317
allcaps_rate 0
boilerplate_rate 0
alert: multilingual11 languages detected in sample
alert: null_rate64.2% null
Fig 18.
Character-length distribution for LocationInCountry.
Show data table
Character-length distribution for LocationInCountry (mean: 141.07339089481945).
charscount
3 – 26327
26 – 50413
50 – 73323
73 – 97303
97 – 120234
120 – 143153
143 – 167116
167 – 19084
190 – 21475
214 – 23748
237 – 26055
260 – 28446
284 – 30742
307 – 33146
331 – 35433
354 – 37731
377 – 40119
401 – 42424
424 – 44840
448 – 47117
471 – 49419
494 – 51812
518 – 54110
541 – 56514
565 – 58810
588 – 61110
611 – 63511
635 – 6587
658 – 6824
682 – 7052
705 – 7283
728 – 7524
752 – 7757
775 – 7991
799 – 8221
822 – 8451
845 – 8691
869 – 8920
892 – 9161
916 – 9391

PeopleID1 numeric feature

PeopleID1 is stored as numeric but only takes 16 distinct integer values between 10 and 26, with a tight IQR of 1.0 (Q1=20, Q3=21) around a median of 21. The distribution is heavily left-skewed (skew -1.34) and 32.9% of rows fall outside the Tukey fence, so the 'outliers' alert reflects the column being near-categorical rather than truly continuous. No nulls and no zeros, but the name suggests an identifier despite the low cardinality.

Treatment: Cast to categorical (16 levels) rather than treating as a continuous numeric.

anthropic:claude-opus-4-7 · confidence medium
Out[44]:

saturn.columns["PeopleID1"].stats

statvalue
n7,124
nulls0 (0.0%)
unique16
min 10
max 26
mean 19.51
median 21
std 3.858
q1 20
q3 21
iqr 1
skew -1.337
kurtosis 0.8321
n_outliers 2,347
outlier_rate 0.3294
zero_rate 0
alert: outliers32.9% rows beyond 1.5 IQR
Fig 19.
Distribution of PeopleID1. Vertical dash marks the median.
Show data table
Histogram bins for PeopleID1 (median: 21.0).
bincount
10 – 10.4475
10.4 – 10.80
10.8 – 11.2140
11.2 – 11.60
11.6 – 120
12 – 12.4166
12.4 – 12.80
12.8 – 13.283
13.2 – 13.60
13.6 – 140
14 – 14.4225
14.4 – 14.80
14.8 – 15.2128
15.2 – 15.60
15.6 – 160
16 – 16.481
16.4 – 16.80
16.8 – 17.2339
17.2 – 17.60
17.6 – 180
18 – 18.43
18.4 – 18.80
18.8 – 19.213
19.2 – 19.60
19.6 – 200
20 – 20.4451
20.4 – 20.80
20.8 – 21.23681
21.2 – 21.60
21.6 – 220
22 – 22.4632
22.4 – 22.80
22.8 – 23.2287
23.2 – 23.60
23.6 – 240
24 – 24.4269
24.4 – 24.80
24.8 – 25.20
25.2 – 25.60
25.6 – 26151

ROP1 categorical feature

ROP1 is a low-cardinality categorical code (16 distinct values, all following an 'A0xx' pattern) with no nulls across 7,124 rows. The distribution is heavily concentrated: 'A012' alone accounts for 51.7% of records, and entropy ratio of 0.67 confirms the imbalance. This looks like a controlled vocabulary or lookup code rather than free input.

Treatment: One-hot or target-encode; consider grouping rare codes given the dominance of A012.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["ROP1"].stats

statvalue
n7,124
nulls0 (0.0%)
unique16
top_value A012
top_rate 0.5167
cardinality 16
entropy 2.676
entropy_ratio 0.669
Fig 20.
Top values for ROP1.
Show data table
Top values for ROP1 (16 unique shown, of 16 total).
valuecountshare
A012368151.7%
A0136328.9%
A0014756.7%
A0114516.3%
A0083394.8%
A0142874.0%
A0152693.8%
A0052253.2%
A0031662.3%
A0171512.1%
A0021402.0%
A0061281.8%
A004831.2%
A007811.1%
A010130.2%
A00930.0%

AffinityBloc categorical feature

AffinityBloc is a categorical grouping of peoples/ethnolinguistic blocs, with 16 distinct values across 7124 rows and no nulls. The distribution is heavily concentrated: 'South Asian Peoples' alone accounts for 51.7% of records (3681), followed distantly by 'Sub-Saharan Peoples' (632) and 'Arab World' (475). Entropy ratio of 0.67 confirms the imbalance, and the inclusion of 'Deaf' (151) as a bloc alongside geographic/ethnic categories is a notable taxonomy quirk.

Treatment: One-hot or target-encode, and consider grouping rare blocs given the dominance of South Asian Peoples.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["AffinityBloc"].stats

statvalue
n7,124
nulls0 (0.0%)
unique16
top_value South Asian Peoples
top_rate 0.5167
cardinality 16
entropy 2.676
entropy_ratio 0.669
Fig 21.
Top values for AffinityBloc.
Show data table
Top values for AffinityBloc (16 unique shown, of 16 total).
valuecountshare
South Asian Peoples368151.7%
Sub-Saharan Peoples6328.9%
Arab World4756.7%
Southeast Asian Peoples4516.3%
Malay Peoples3394.8%
Tibetan-Himalayan Peoples2874.0%
Turkic Peoples2693.8%
Persian-Median2253.2%
Eurasian Peoples1662.3%
Deaf1512.1%
East Asian Peoples1402.0%
Jewish1281.8%
Horn of Africa Peoples831.2%
Latin-Caribbean Americans811.1%
Pacific Islanders130.2%
North American Peoples30.0%

PeopleID2 numeric foreign_key

PeopleID2 is a numeric identifier-like field with only 205 distinct values across 7,124 rows, ranging 101 to 475 with no nulls or zeros. The distribution is left-skewed (skew -0.50) and platykurtic (kurtosis -1.24) with median 412 well above the mean 339, suggesting a few low-id clusters pull the mean down while most rows concentrate near the upper end. The low cardinality relative to row count indicates this is a repeated key rather than a per-row unique id.

Treatment: Treat as a categorical foreign key and left-join to the people dimension rather than using as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[53]:

saturn.columns["PeopleID2"].stats

statvalue
n7,124
nulls0 (0.0%)
unique205
min 101
max 475
mean 339
median 412
std 123.2
q1 232.8
q3 450
iqr 217.2
skew -0.5035
kurtosis -1.242
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 22.
Distribution of PeopleID2. Vertical dash marks the median.
Show data table
Histogram bins for PeopleID2 (median: 412.0).
bincount
101 – 110.372
110.3 – 119.7353
119.7 – 129.199
129.1 – 138.458
138.4 – 147.874
147.8 – 157.1217
157.1 – 166.4156
166.4 – 175.886
175.8 – 185.192
185.1 – 194.567
194.5 – 203.8188
203.8 – 213.2111
213.2 – 222.6122
222.6 – 231.984
231.9 – 241.2222
241.2 – 250.6136
250.6 – 259.934
259.9 – 269.3144
269.3 – 278.63
278.6 – 28868
288 – 297.4194
297.4 – 306.7147
306.7 – 316223
316 – 325.4256
325.4 – 334.8160
334.8 – 344.114
344.1 – 353.458
353.4 – 362.88
362.8 – 372.10
372.1 – 381.50
381.5 – 390.80
390.8 – 400.20
400.2 – 409.6100
409.6 – 418.9573
418.9 – 428.2216
428.2 – 437.650
437.6 – 446.9928
446.9 – 456.3128
456.3 – 465.61136
465.6 – 475547

ROP2 categorical feature

ROP2 is a categorical code field with 155 distinct values following an A####/C#### pattern, suggesting a classification or routing code. One value, 'A012', dominates at 51.6% of the 7,124 rows, while the remaining categories are long-tailed C-codes each below 2.4%. Entropy ratio of 0.56 confirms the distribution is heavily concentrated rather than uniform.

Treatment: Collapse rare C-codes into an 'other' bucket and one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[56]:

saturn.columns["ROP2"].stats

statvalue
n7,124
nulls0 (0.0%)
unique155
top_value A012
top_rate 0.5163
cardinality 155
entropy 4.085
entropy_ratio 0.5614
Fig 23.
Top values for ROP2.
Show data table
Top values for ROP2 (20 unique shown, of 155 total).
valuecountshare
A012367851.6%
C02291692.4%
C01471672.3%
C02521512.1%
C01021281.8%
C00611111.6%
C0179901.3%
C0207861.2%
C0013761.1%
C0156751.1%
C0223751.1%
C0015701.0%
C0221681.0%
C0126630.9%
C0114610.9%
C0017590.8%
C0019580.8%
C0216520.7%
C0062510.7%
C0201500.7%

PeopleCluster categorical feature

PeopleCluster is a high-cardinality categorical taxonomy of ethno-religious groupings, with 205 distinct values across 7124 rows and no nulls. The distribution is dominated by South Asian categories — 'South Asia Hindu - other' alone accounts for 12.2% (869 rows), followed by 'South Asia Muslim - other' (586) and 'South Asia Dalit - other' (352). Entropy ratio of 0.80 indicates a fairly spread distribution despite the South Asian skew, and the appearance of 'Deaf' alongside ethnolinguistic labels signals a mixed taxonomy worth flagging.

Treatment: Group the long tail and target- or frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[59]:

saturn.columns["PeopleCluster"].stats

statvalue
n7,124
nulls0 (0.0%)
unique205
top_value South Asia Hindu - other
top_rate 0.122
cardinality 205
entropy 6.108
entropy_ratio 0.7954
Fig 24.
Top values for PeopleCluster.
Show data table
Top values for PeopleCluster (20 unique shown, of 205 total).
valuecountshare
South Asia Hindu - other86912.2%
South Asia Muslim - other5868.2%
South Asia Dalit - other3524.9%
South Asia Tribal - other3114.4%
South Asia Muslim - Pashtun2934.1%
Tibeto-Burman, other1692.4%
Mon-Khmer1672.3%
Deaf1512.1%
South Asia Muslim - Jat1381.9%
Jewish1281.8%
South Asia Forward Caste - Brahmin1231.7%
South Asia - other1141.6%
South Asia Muslim - Rajput1121.6%
Caucasus1111.6%
South Asia Forward Caste - Rajput951.3%
Persian901.3%
South Asia Buddhist871.2%
Tai861.2%
South Asia Hindu - Jat771.1%
Arab, Arabian761.1%

PeopNameAcrossCountries text label

This column holds people-group / ethnolinguistic names spanning countries (e.g. 'Turk', 'Persian', 'Kurd, Kurmanji', 'Arab, Moroccan'), with frequent religious-tradition qualifiers like '(Hindu traditions)' and '(Muslim traditions)' appearing in 985 and 424 rows respectively. Values are short (mean 12.4 chars, median 2 words) and 47.5% are single-word labels, yet 35.4% (2,520 rows) are duplicates across 4,604 unique strings out of 7,124 — the same group recurs across countries. The most surprising entry is 'Deaf' at 151 occurrences, which sits oddly alongside ethnic categories.

Treatment: Treat as a categorical people-group label; normalize qualifiers in parentheses and join on (group, country) for cross-country aggregation.

anthropic:claude-opus-4-7 · confidence high
Out[62]:

saturn.columns["PeopNameAcrossCountries"].stats

statvalue
n7,124
nulls0 (0.0%)
unique4,604
len_min 1
len_max 39
len_mean 12.38
len_median 10
len_p95 27
word_mean 1.766
word_median 2
n_empty 0
n_duplicates 2,520
duplicate_rate 0.3537
vocab_size 4,431
readability_flesch_mean 56.01
emoji_rate 0
url_rate 0
one_word_rate 0.475
allcaps_rate 0
boilerplate_rate 0
alert: one_word47.5% rows are a single word
alert: duplicates35.4% duplicate strings
Fig 25.
Character-length distribution for PeopNameAcrossCountries.
Show data table
Character-length distribution for PeopNameAcrossCountries (mean: 12.377596855699046).
charscount
1 – 21
2 – 314
3 – 495
4 – 5582
5 – 6679
6 – 7736
7 – 8689
8 – 9382
9 – 10206
10 – 10209
10 – 11270
11 – 12277
12 – 13308
13 – 14420
14 – 15310
15 – 16253
16 – 17164
17 – 18103
18 – 1967
19 – 200
20 – 2179
21 – 2255
22 – 2364
23 – 24105
24 – 25205
25 – 26209
26 – 27194
27 – 28115
28 – 2967
29 – 3057
30 – 3036
30 – 3150
31 – 3245
32 – 3333
33 – 3418
34 – 3522
35 – 361
36 – 371
37 – 381
38 – 392

Population numeric feature

Population counts per record, ranging from 10 to 135,533,000 with a median of just 30,000. The distribution is extraordinarily right-skewed (skew 21.1, kurtosis 607) — the mean of ~502,570 sits far above Q3 of 129,000, and ~14.9% of rows flag as outliers, suggesting a mix of small localities with a few country- or megacity-scale entries. Null rate is negligible (0.21%) and there are no zeros.

Treatment: Apply a log1p transform before any modelling to tame the extreme skew.

anthropic:claude-opus-4-7 · confidence high
Out[65]:

saturn.columns["Population"].stats

statvalue
n7,124
nulls15 (0.2%)
unique1,200
min 10
max 1.355e+08
mean 5.026e+05
median 30,000
std 3.568e+06
q1 6,700
q3 129,000
iqr 122,300
skew 21.1
kurtosis 607.4
n_outliers 1,058
outlier_rate 0.1488
zero_rate 0
alert: high_skewskew=+21.10
alert: outliers14.9% rows beyond 1.5 IQR
Fig 26.
Distribution of Population. Vertical dash marks the median.
Show data table
Histogram bins for Population (median: 30000.0).
bincount
10 – 3.388e+066930
3.388e+06 – 6.777e+0678
6.777e+06 – 1.016e+0736
1.016e+07 – 1.355e+0717
1.355e+07 – 1.694e+0711
1.694e+07 – 2.033e+078
2.033e+07 – 2.372e+077
2.372e+07 – 2.711e+073
2.711e+07 – 3.049e+072
3.049e+07 – 3.388e+071
3.388e+07 – 3.727e+072
3.727e+07 – 4.066e+074
4.066e+07 – 4.405e+070
4.405e+07 – 4.744e+073
4.744e+07 – 5.082e+071
5.082e+07 – 5.421e+070
5.421e+07 – 5.76e+071
5.76e+07 – 6.099e+070
6.099e+07 – 6.438e+071
6.438e+07 – 6.777e+071
6.777e+07 – 7.115e+070
7.115e+07 – 7.454e+070
7.454e+07 – 7.793e+070
7.793e+07 – 8.132e+070
8.132e+07 – 8.471e+070
8.471e+07 – 8.81e+070
8.81e+07 – 9.148e+070
9.148e+07 – 9.487e+070
9.487e+07 – 9.826e+070
9.826e+07 – 1.016e+081
1.016e+08 – 1.05e+080
1.05e+08 – 1.084e+080
1.084e+08 – 1.118e+080
1.118e+08 – 1.152e+080
1.152e+08 – 1.186e+081
1.186e+08 – 1.22e+080
1.22e+08 – 1.254e+080
1.254e+08 – 1.288e+080
1.288e+08 – 1.321e+080
1.321e+08 – 1.355e+081

Category categorical label

Three-level categorical with values "1", "2", "3" and no nulls across 7124 rows. Distribution is imbalanced: "3" dominates at 60.3% (4299), "1" follows at 2330, and "2" is rare at 495. Entropy ratio of 0.78 confirms the skew toward the majority class.

Treatment: Treat as a categorical label; consider class weighting or stratified sampling to handle the minority class "2".

anthropic:claude-opus-4-7 · confidence high
Out[68]:

saturn.columns["Category"].stats

statvalue
n7,124
nulls0 (0.0%)
unique3
top_value 3
top_rate 0.6035
cardinality 3
entropy 1.234
entropy_ratio 0.7788
Fig 27.
Top values for Category.
Show data table
Top values for Category (3 unique shown, of 3 total).
valuecountshare
3429960.3%
1233032.7%
24956.9%

ROL3 text feature

ROL3 holds three-letter ISO 639-3 language codes — every value is exactly 3 characters and one word, with top entries like 'hin', 'ben', 'urd', 'guj', and 'tel' pointing to South Asian languages. The distribution is heavily skewed toward Hindi (662 of 7124) and only 1565 unique codes appear across 7124 rows, giving a 78% duplicate rate. No nulls, no empties, no formatting noise.

Treatment: treat as a categorical language code; one-hot or target-encode, and consider grouping rare codes.

anthropic:claude-opus-4-7 · confidence high
Out[71]:

saturn.columns["ROL3"].stats

statvalue
n7,124
nulls0 (0.0%)
unique1,565
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 5,559
duplicate_rate 0.7803
vocab_size 1,564
readability_flesch_mean 118.7
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.0002807
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates78.0% duplicate strings
Fig 28.
Character-length distribution for ROL3.
Show data table
Character-length distribution for ROL3 (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 37124
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

PrimaryLanguageName text feature

Categorical language label, almost always a single word (one_word_rate 0.70, word_mean 1.33) drawn from a vocabulary of 1,641 tokens across 1,563 distinct values. Hindi (662), Bengali (357), and Sindhi (191) dominate the 7,124 rows, and 78.1% of values are duplicates of an earlier row — expected for a controlled language taxonomy. Compound names like 'Pashto, Northern' and 'Punjabi, Eastern' indicate ISO-style subvariant naming rather than free text.

Treatment: Treat as a categorical factor; normalise the comma-separated subvariants before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[74]:

saturn.columns["PrimaryLanguageName"].stats

statvalue
n7,124
nulls0 (0.0%)
unique1,563
len_min 1
len_max 32
len_mean 9.147
len_median 7
len_p95 17
word_mean 1.331
word_median 1
n_empty 0
n_duplicates 5,561
duplicate_rate 0.7806
vocab_size 1,641
readability_flesch_mean 33.28
emoji_rate 0
url_rate 0
one_word_rate 0.7016
allcaps_rate 0
boilerplate_rate 0
alert: one_word70.2% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates78.1% duplicate strings
Fig 29.
Character-length distribution for PrimaryLanguageName.
Show data table
Character-length distribution for PrimaryLanguageName (mean: 9.14654688377316).
charscount
1 – 22
2 – 329
3 – 361
3 – 4655
4 – 50
5 – 61267
6 – 6888
6 – 71222
7 – 80
8 – 9517
9 – 10207
10 – 10137
10 – 1165
11 – 120
12 – 1398
13 – 13157
13 – 14150
14 – 150
15 – 16204
16 – 16890
16 – 17228
17 – 1860
18 – 190
19 – 2037
20 – 2046
20 – 2151
21 – 220
22 – 2335
23 – 2334
23 – 2441
24 – 253
25 – 260
26 – 2716
27 – 278
27 – 283
28 – 290
29 – 301
30 – 300
30 – 3111
31 – 321

PrimaryLanguageDialect categorical metadata

Free-text or controlled-vocabulary field naming a primary language dialect, with 303 distinct values across 7124 rows but populated in only 5.5% of records (null_rate 0.945). Distribution is essentially flat — entropy_ratio 0.97 and the modal value 'Punjabi' covers just 3.1% of non-nulls (12 occurrences) — so no dialect dominates. The mix spans South Asian, Middle Eastern, African, and European dialects, suggesting a global but extremely sparse roster.

Treatment: Drop or collapse into a coarser language grouping; too sparse and high-cardinality to use directly as a feature.

anthropic:claude-opus-4-7 · confidence high
Out[77]:

saturn.columns["PrimaryLanguageDialect"].stats

statvalue
n7,124
nulls6,732 (94.5%)
unique303
top_value Punjabi
top_rate 0.03061
cardinality 303
entropy 8.011
entropy_ratio 0.9719
alert: long_tail255 singleton categories
alert: null_rate94.5% null
Fig 30.
Top values for PrimaryLanguageDialect.
Show data table
Top values for PrimaryLanguageDialect (20 unique shown, of 303 total).
valuecountshare
Punjabi120.2%
Ta'izzi80.1%
Sinhalese60.1%
Wasulunkakan60.1%
Pomak50.1%
Siripuria50.1%
Hui40.1%
Vixlin40.1%
Western Sudanese30.0%
Brazilian Portuguese30.0%
Bawean30.0%
Bajuni30.0%
Miri30.0%
Southern Khams30.0%
Levantine Turkmen30.0%
Jerbi20.0%
Tawallammat Tan Ataram20.0%
Tihami20.0%
Timbuktu20.0%
Xinan Guanhua20.0%

NumberLanguagesSpoken numeric feature

Counts the number of languages spoken, with values ranging from 1 to 120 across 7,124 rows and no nulls. The distribution is heavily right-skewed (skew 4.87, kurtosis 36.3): the median is 1 and Q3 is 5, yet the mean is 4.33 and 597 rows (8.4%) flag as outliers, suggesting a long tail of implausibly high counts up to 120.

Treatment: Cap or log-transform before modelling, and audit the extreme tail (values up to 120) for data-entry errors.

anthropic:claude-opus-4-7 · confidence high
Out[80]:

saturn.columns["NumberLanguagesSpoken"].stats

statvalue
n7,124
nulls0 (0.0%)
unique69
min 1
max 120
mean 4.333
median 1
std 7.32
q1 1
q3 5
iqr 4
skew 4.871
kurtosis 36.31
n_outliers 597
outlier_rate 0.0838
zero_rate 0
alert: high_skewskew=+4.87
alert: outliers8.4% rows beyond 1.5 IQR
Fig 31.
Distribution of NumberLanguagesSpoken. Vertical dash marks the median.
Show data table
Histogram bins for NumberLanguagesSpoken (median: 1.0).
bincount
1 – 3.9754930
3.975 – 6.95911
6.95 – 9.925493
9.925 – 12.9262
12.9 – 15.88146
15.88 – 18.8588
18.85 – 21.8254
21.82 – 24.851
24.8 – 27.7844
27.78 – 30.7527
30.75 – 33.7321
33.73 – 36.714
36.7 – 39.6817
39.68 – 42.6514
42.65 – 45.629
45.62 – 48.69
48.6 – 51.588
51.58 – 54.554
54.55 – 57.524
57.52 – 60.56
60.5 – 63.482
63.48 – 66.452
66.45 – 69.420
69.42 – 72.41
72.4 – 75.381
75.38 – 78.352
78.35 – 81.330
81.33 – 84.32
84.3 – 87.280
87.28 – 90.250
90.25 – 93.230
93.23 – 96.20
96.2 – 99.170
99.17 – 102.20
102.2 – 105.11
105.1 – 108.10
108.1 – 111.10
111.1 – 1140
114 – 1170
117 – 1201

OfficialLang categorical feature

Categorical column listing an official language per record, with 79 distinct values across 7124 rows and effectively no nulls (0.08%). Hindi dominates at 28.5% (2032 rows), followed by Urdu (767), Standard Arabic (657), Mandarin (475), and English (433), giving an entropy ratio of 0.66 — moderately concentrated rather than uniform. The South/Central Asian skew is notable: five of the top ten values are languages of that region, which may bias any downstream language-level analysis.

Treatment: Group rare languages into an 'Other' bucket and one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[83]:

saturn.columns["OfficialLang"].stats

statvalue
n7,124
nulls6 (0.1%)
unique79
top_value Hindi
top_rate 0.2855
cardinality 79
entropy 4.178
entropy_ratio 0.6628
Fig 32.
Top values for OfficialLang.
Show data table
Top values for OfficialLang (20 unique shown, of 79 total).
valuecountshare
Hindi203228.5%
Urdu76710.8%
Arabic, Standard6579.2%
Chinese, Mandarin4756.7%
English4336.1%
French3034.3%
Bengali2563.6%
Indonesian2343.3%
Nepali1842.6%
Lao1422.0%
Russian1151.6%
Portuguese941.3%
Malay871.2%
Persian, Iranian851.2%
Spanish791.1%
Thai731.0%
Vietnamese691.0%
Turkish610.9%
Burmese590.8%
Pashto, Southern580.8%

SpeakNationalLang unknown other

Column 'SpeakNationalLang' was skipped by the profiler, so type, cardinality and value distribution are all unavailable. The only confirmed facts are that it has 7124 rows and zero nulls. The name suggests a flag or category for whether a respondent speaks the national language, but this cannot be verified from the evidence.

Treatment: Re-profile with type inference enabled before deciding on any downstream handling.

anthropic:claude-opus-4-7 · confidence low
Out[86]:

saturn.columns["SpeakNationalLang"].stats

statvalue
n7,124
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

BibleStatus numeric feature

BibleStatus is an ordinal/categorical code stored as a small integer, taking just 6 distinct values from 0 to 5 across 7,124 rows with no nulls. The distribution is heavily concentrated at the top (median 5, Q1=4, Q3=5, mean 4.05) with strong negative skew (-1.51), and 13.5% of rows flagged as low-end outliers plus a 3.8% zero rate. This looks like a status/level code rather than a true numeric measurement.

Treatment: Treat as an ordinal category (or one-hot encode) rather than a continuous numeric.

anthropic:claude-opus-4-7 · confidence high
Out[88]:

saturn.columns["BibleStatus"].stats

statvalue
n7,124
nulls0 (0.0%)
unique6
min 0
max 5
mean 4.054
median 5
std 1.342
q1 4
q3 5
iqr 1
skew -1.513
kurtosis 1.538
n_outliers 961
outlier_rate 0.1349
zero_rate 0.03818
alert: outliers13.5% rows beyond 1.5 IQR
Fig 33.
Distribution of BibleStatus. Vertical dash marks the median.
Show data table
Histogram bins for BibleStatus (median: 5.0).
bincount
0 – 0.125272
0.125 – 0.250
0.25 – 0.3750
0.375 – 0.50
0.5 – 0.6250
0.625 – 0.750
0.75 – 0.8750
0.875 – 10
1 – 1.125216
1.125 – 1.250
1.25 – 1.3750
1.375 – 1.50
1.5 – 1.6250
1.625 – 1.750
1.75 – 1.8750
1.875 – 20
2 – 2.125473
2.125 – 2.250
2.25 – 2.3750
2.375 – 2.50
2.5 – 2.6250
2.625 – 2.750
2.75 – 2.8750
2.875 – 30
3 – 3.125795
3.125 – 3.250
3.25 – 3.3750
3.375 – 3.50
3.5 – 3.6250
3.625 – 3.750
3.75 – 3.8750
3.875 – 40
4 – 4.1251506
4.125 – 4.250
4.25 – 4.3750
4.375 – 4.50
4.5 – 4.6250
4.625 – 4.750
4.75 – 4.8750
4.875 – 53862

BibleYear categorical metadata

BibleYear appears to be a publication/edition year field for Bible translations, encoded mostly as date ranges like "1818-2022" rather than single years. Cardinality is high (163 distinct values across 7124 rows) and the column is missing for 45.79% of rows, which is a major coverage gap. The top value "1818-2022" covers 17.14% of non-nulls and most frequent entries are spans, while plain single years like "1954" (191 occurrences) are the exception.

Treatment: Parse into start_year and end_year integers, then decide imputation given the 45.79% null rate.

anthropic:claude-opus-4-7 · confidence medium
Out[91]:

saturn.columns["BibleYear"].stats

statvalue
n7,124
nulls3,262 (45.8%)
unique163
top_value 1818-2022
top_rate 0.1714
cardinality 163
entropy 5.318
entropy_ratio 0.7237
alert: null_rate45.8% null
Fig 34.
Top values for BibleYear.
Show data table
Top values for BibleYear (20 unique shown, of 163 total).
valuecountshare
1818-20226629.3%
1809-20223575.0%
19541912.7%
1843-20221672.3%
1823-20211552.2%
1815-20211502.1%
1895-20201462.0%
1854-20221361.9%
1959-20211311.8%
1727-20241291.8%
1821-2024961.3%
1827-2024901.3%
1841-2022851.2%
2008801.1%
1914-2024711.0%
1827-2009550.8%
1874-2018500.7%
2023430.6%
1883-2018360.5%
1838-2023360.5%

NTYear categorical feature

NTYear appears to hold year-range strings (e.g. "1811-1998", "1801-1984") indicating a span between two dates, though the presence of "Yes" as the third most common value (345 rows) signals encoding inconsistency. The column has 305 distinct values across 7124 rows with a high null rate of 24.65%, and the top value covers only 12.3% of non-nulls, so the distribution is fairly spread (entropy ratio 0.735).

Treatment: Parse year-range strings into start/end numeric fields and quarantine non-conforming values like "Yes" before modelling.

anthropic:claude-opus-4-7 · confidence medium
Out[94]:

saturn.columns["NTYear"].stats

statvalue
n7,124
nulls1,756 (24.6%)
unique305
top_value 1811-1998
top_rate 0.1233
cardinality 305
entropy 6.065
entropy_ratio 0.7349
alert: null_rate24.6% null
Fig 35.
Top values for NTYear.
Show data table
Top values for NTYear (20 unique shown, of 305 total).
valuecountshare
1811-19986629.3%
1801-19843575.0%
Yes3454.8%
1890-19921912.7%
1758-20001672.3%
1820-19851552.2%
1809-20001502.1%
1818-19911462.0%
2023-20241462.0%
1818-19891361.9%
1815-20111311.8%
1819-20211301.8%
1715-19981291.8%
1978-20221101.5%
1827-20231001.4%
1811-1982961.3%
1829-1995851.2%
2010831.2%
1819-1993771.1%
1821-2010711.0%

PortionsYear categorical metadata

PortionsYear appears to be a free-form field describing the year range covered by record portions, with 460 distinct values across 7124 rows and a 13.5% null rate. Most entries are date ranges like '1806-1962' or '1800-1980', but the single most common value is the literal string 'Yes' (821 rows, 13.3%), suggesting inconsistent data entry where a yes/no answer leaks into a date-range field. High entropy ratio (0.71) confirms values are spread across many ranges rather than concentrated.

Treatment: Parse date ranges into start/end year numerics and isolate the 'Yes' contamination as a separate boolean flag before use.

anthropic:claude-opus-4-7 · confidence high
Out[97]:

saturn.columns["PortionsYear"].stats

statvalue
n7,124
nulls961 (13.5%)
unique460
top_value Yes
top_rate 0.1332
cardinality 460
entropy 6.289
entropy_ratio 0.711
Fig 36.
Top values for PortionsYear.
Show data table
Top values for PortionsYear (20 unique shown, of 460 total).
valuecountshare
Yes82111.5%
1806-19626629.3%
1800-19803575.0%
1825-19811912.7%
1747-18941672.3%
1809-19651552.2%
1811-19561502.1%
1824-20101462.0%
1812-19661361.9%
1818-19541311.8%
1885-19221301.8%
1714-19561291.8%
1927-19641101.5%
1807-1957961.3%
1812-1988901.3%
1811-1968851.2%
2024731.0%
1850-1961711.0%
1782-1985550.8%
1939-2021520.7%

TranslationNeedQuestionable unknown other

The column "TranslationNeedQuestionable" was skipped by the profiler, so no type, uniqueness, or value statistics are available. All we know is that it has 7124 rows with a 0.0 null rate; nothing else can be inferred from the evidence.

Treatment: Re-profile or inspect manually before deciding on use; current evidence is insufficient.

anthropic:claude-opus-4-7 · confidence low
Out[100]:

saturn.columns["TranslationNeedQuestionable"].stats

statvalue
n7,124
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

JPScale numeric metadata

JPScale is a numeric column that is entirely constant: all 7124 rows hold the value 1.0 with zero nulls, zero variance, and a single unique value. It carries no information for any downstream model or comparison and was flagged as constant.

Treatment: Drop; constant column with no variance.

anthropic:claude-opus-4-7 · confidence high
Out[102]:

saturn.columns["JPScale"].stats

statvalue
n7,124
nulls0 (0.0%)
unique1
min 1
max 1
mean 1
median 1
std 0
q1 1
q3 1
iqr 0
skew 0
kurtosis 0
n_outliers 0
outlier_rate 0
zero_rate 0
alert: constantonly one distinct value
Fig 37.
Distribution of JPScale. Vertical dash marks the median.
Show data table
Histogram bins for JPScale (median: 1.0).
bincount
0.5 – 0.5250
0.525 – 0.550
0.55 – 0.5750
0.575 – 0.60
0.6 – 0.6250
0.625 – 0.650
0.65 – 0.6750
0.675 – 0.70
0.7 – 0.7250
0.725 – 0.750
0.75 – 0.7750
0.775 – 0.80
0.8 – 0.8250
0.825 – 0.850
0.85 – 0.8750
0.875 – 0.90
0.9 – 0.9250
0.925 – 0.950
0.95 – 0.9750
0.975 – 10
1 – 1.0257124
1.025 – 1.050
1.05 – 1.0750
1.075 – 1.10
1.1 – 1.1250
1.125 – 1.150
1.15 – 1.1750
1.175 – 1.20
1.2 – 1.2250
1.225 – 1.250
1.25 – 1.2750
1.275 – 1.30
1.3 – 1.3250
1.325 – 1.350
1.35 – 1.3750
1.375 – 1.40
1.4 – 1.4250
1.425 – 1.450
1.45 – 1.4750
1.475 – 1.50

JPScalePC categorical feature

JPScalePC is a 5-level categorical code (values "1" through "5") with no nulls across 7124 rows, likely an ordinal scale or rating. The distribution is heavily concentrated at "1" (70.2% of rows), with "3" the rarest at just 205 occurrences, yielding an entropy ratio of 0.59. The non-monotonic frequency order (1 > 4 > 2 > 5 > 3) is unusual for a true ordinal scale and worth checking.

Treatment: Treat as ordinal categorical; consider grouping minority levels (3, 5) given the dominance of "1".

anthropic:claude-opus-4-7 · confidence high
Out[105]:

saturn.columns["JPScalePC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique5
top_value 1
top_rate 0.702
cardinality 5
entropy 1.377
entropy_ratio 0.5929
Fig 38.
Top values for JPScalePC.
Show data table
Top values for JPScalePC (5 unique shown, of 5 total).
valuecountshare
1500170.2%
4114116.0%
25257.4%
52523.5%
32052.9%

JPScalePGAC categorical feature

JPScalePGAC is a 5-level categorical code (values "1" through "5"), likely a Japanese seismic intensity / PGA scale rating. The distribution is severely imbalanced: "1" accounts for 6910 of 7124 rows (top_rate 0.97), entropy_ratio is just 0.10, and the remaining four levels together hold under 220 records. No nulls are present.

Treatment: Collapse rare levels or binarise as "1" vs other before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[108]:

saturn.columns["JPScalePGAC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique5
top_value 1
top_rate 0.97
cardinality 5
entropy 0.2372
entropy_ratio 0.1021
alert: imbalancetop value is 97.0% of rows
Fig 39.
Top values for JPScalePGAC.
Show data table
Top values for JPScalePGAC (5 unique shown, of 5 total).
valuecountshare
1691097.0%
41181.7%
2751.1%
5150.2%
360.1%

LeastReached categorical metadata

This column is a single-valued categorical flag holding "Y" for all 7124 rows, with no nulls and zero entropy. Because cardinality is 1 and top_rate is 1.0, it carries no information for any downstream model.

Treatment: Drop; constant column with no variance.

anthropic:claude-opus-4-7 · confidence high
Out[111]:

saturn.columns["LeastReached"].stats

statvalue
n7,124
nulls0 (0.0%)
unique1
top_value Y
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 40.
Top values for LeastReached.
Show data table
Top values for LeastReached (1 unique shown, of 1 total).
valuecountshare
Y7124100.0%

LeastReachedPC categorical feature

Binary Y/N flag indicating whether some 'least reached' people-group condition is met. The column is fully populated across 7124 rows with only 2 distinct values, skewed toward 'Y' at 72.3% (5152) versus 'N' at 1972. Entropy ratio of 0.85 shows the split is uneven but still informative.

Treatment: Encode as a 0/1 boolean for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[114]:

saturn.columns["LeastReachedPC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.7232
cardinality 2
entropy 0.8511
entropy_ratio 0.8511
Fig 41.
Top values for LeastReachedPC.
Show data table
Top values for LeastReachedPC (2 unique shown, of 2 total).
valuecountshare
Y515272.3%
N197227.7%

LeastReachedPGAC categorical feature

A binary Y/N flag (likely indicating whether the least-reached PGAC condition was met) with no nulls across 7124 rows. The distribution is severely imbalanced: 'Y' accounts for 6910 rows (97.0%) versus only 214 'N', yielding entropy_ratio of just 0.19. As a near-constant feature it carries little discriminative signal on its own.

Treatment: Encode as 0/1 but consider dropping or pairing with rare-class oversampling given the 97/3 imbalance.

anthropic:claude-opus-4-7 · confidence high
Out[117]:

saturn.columns["LeastReachedPGAC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.97
cardinality 2
entropy 0.1946
entropy_ratio 0.1946
alert: imbalancetop value is 97.0% of rows
Fig 42.
Top values for LeastReachedPGAC.
Show data table
Top values for LeastReachedPGAC (2 unique shown, of 2 total).
valuecountshare
Y691097.0%
N2143.0%

GSEC categorical feature

GSEC is a low-cardinality categorical with 8 distinct values across 7124 rows and no nulls. The dominant value is the empty string at 51.08% (3639 rows), followed by '1' at 2767; the remaining codes ('0' through '6') together account for under 10% of rows. The mix of blanks and small integer codes suggests an optional categorical flag where 'missing' is encoded as '' rather than null.

Treatment: Recode '' as explicit missing and one-hot encode the remaining small-integer categories.

anthropic:claude-opus-4-7 · confidence high
Out[120]:

saturn.columns["GSEC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique8
top_value
top_rate 0.5108
cardinality 8
entropy 1.605
entropy_ratio 0.535
Fig 43.
Top values for GSEC.
Show data table
Top values for GSEC (8 unique shown, of 8 total).
valuecountshare
363951.1%
1276738.8%
41862.6%
01762.5%
21392.0%
31001.4%
6620.9%
5550.8%

HasAudioRecordings categorical feature

Binary Y/N flag indicating whether a record has associated audio recordings. The class is heavily imbalanced toward 'Y' at 86.9% (6188 of 7124), with no nulls. Entropy ratio of 0.56 confirms the skew but the minority 'N' class still has 936 observations, enough to be usable.

Treatment: Encode as a 0/1 boolean; watch for class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[123]:

saturn.columns["HasAudioRecordings"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.8686
cardinality 2
entropy 0.5612
entropy_ratio 0.5612
Fig 44.
Top values for HasAudioRecordings.
Show data table
Top values for HasAudioRecordings (2 unique shown, of 2 total).
valuecountshare
Y618886.9%
N93613.1%

NTOnline categorical feature

NTOnline is a categorical flag with only one observed value, 'Y', across 5528 non-null rows, while 22.4% of rows are null. With cardinality 1 and entropy 0, it carries no discriminative signal—presence vs. absence is the only information available.

Treatment: Drop, or replace with a binary is_present indicator if the null pattern is meaningful.

anthropic:claude-opus-4-7 · confidence high
Out[126]:

saturn.columns["NTOnline"].stats

statvalue
n7,124
nulls1,596 (22.4%)
unique1
top_value Y
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: null_rate22.4% null
alert: imbalancetop value is 100.0% of rows
Fig 45.
Top values for NTOnline.
Show data table
Top values for NTOnline (1 unique shown, of 1 total).
valuecountshare
Y552877.6%

RLG3 numeric feature

RLG3 is a discrete numeric column with only 7 unique values spanning 2 to 9, suggesting an ordinal rating or Likert-style scale rather than a continuous measurement. The distribution is tight around the median of 6 (IQR=1, Q1=5, Q3=6) with mild left skew (-0.46), but 10.6% of rows (757) fall outside the IQR fences — an artifact of the narrow box rather than true anomalies.

Treatment: Treat as an ordinal categorical; the outlier flag is a side-effect of the compressed IQR, not bad data.

anthropic:claude-opus-4-7 · confidence high
Out[129]:

saturn.columns["RLG3"].stats

statvalue
n7,124
nulls0 (0.0%)
unique7
min 2
max 9
mean 5.27
median 6
std 1.279
q1 5
q3 6
iqr 1
skew -0.4551
kurtosis 2.001
n_outliers 757
outlier_rate 0.1063
zero_rate 0
alert: outliers10.6% rows beyond 1.5 IQR
Fig 46.
Distribution of RLG3. Vertical dash marks the median.
Show data table
Histogram bins for RLG3 (median: 6.0).
bincount
2 – 2.175480
2.175 – 2.350
2.35 – 2.5250
2.525 – 2.70
2.7 – 2.8750
2.875 – 3.050
3.05 – 3.2250
3.225 – 3.40
3.4 – 3.5750
3.575 – 3.750
3.75 – 3.9250
3.925 – 4.1933
4.1 – 4.2750
4.275 – 4.450
4.45 – 4.6250
4.625 – 4.80
4.8 – 4.9750
4.975 – 5.152142
5.15 – 5.3250
5.325 – 5.50
5.5 – 5.6750
5.675 – 5.850
5.85 – 6.0253279
6.025 – 6.20
6.2 – 6.3750
6.375 – 6.550
6.55 – 6.7250
6.725 – 6.90
6.9 – 7.07513
7.075 – 7.250
7.25 – 7.4250
7.425 – 7.60
7.6 – 7.7750
7.775 – 7.950
7.95 – 8.125120
8.125 – 8.30
8.3 – 8.4750
8.475 – 8.650
8.65 – 8.8250
8.825 – 9157

RLG3PC numeric feature

RLG3PC is an integer-coded ordinal feature with only 8 distinct values spanning 1-9 and a tight IQR of 1 (Q1=5, Q3=6). The distribution is left-skewed (-0.95) and concentrated around the median of 5, yet 14.3% of rows (1022) fall outside the IQR fence, suggesting a heavy lower tail rather than true anomalies. No nulls or zeros are present.

Treatment: Treat as an ordinal/categorical scale rather than continuous; the outlier rate reflects skew, not errors.

anthropic:claude-opus-4-7 · confidence high
Out[132]:

saturn.columns["RLG3PC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique8
min 1
max 9
mean 5.079
median 5
std 1.52
q1 5
q3 6
iqr 1
skew -0.9463
kurtosis 1.703
n_outliers 1,022
outlier_rate 0.1435
zero_rate 0
alert: outliers14.3% rows beyond 1.5 IQR
Fig 47.
Distribution of RLG3PC. Vertical dash marks the median.
Show data table
Histogram bins for RLG3PC (median: 5.0).
bincount
1 – 1.2332
1.2 – 1.40
1.4 – 1.60
1.6 – 1.80
1.8 – 20
2 – 2.2474
2.2 – 2.40
2.4 – 2.60
2.6 – 2.80
2.8 – 30
3 – 3.20
3.2 – 3.40
3.4 – 3.60
3.6 – 3.80
3.8 – 40
4 – 4.2666
4.2 – 4.40
4.4 – 4.60
4.6 – 4.80
4.8 – 50
5 – 5.22296
5.2 – 5.40
5.4 – 5.60
5.6 – 5.80
5.8 – 60
6 – 6.23105
6.2 – 6.40
6.4 – 6.60
6.6 – 6.80
6.8 – 70
7 – 7.235
7.2 – 7.40
7.4 – 7.60
7.6 – 7.80
7.8 – 80
8 – 8.262
8.2 – 8.40
8.4 – 8.60
8.6 – 8.80
8.8 – 9154

RLG3PGAC numeric feature

RLG3PGAC is a numeric column with only 8 distinct integer values spanning 1 to 9, suggesting an ordinal rating or Likert-style score rather than a continuous measurement. The distribution is tight around a median of 5.5 with IQR of just 1 (Q1=5, Q3=6), yet 776 rows (10.9%) fall outside the Tukey fence, indicating a heavy-tailed concentration where any deviation from the central 5-6 band registers as an outlier. Mild left skew (-0.46) hints that low scores are slightly more common than the symmetric mean of 5.27 would suggest.

Treatment: Treat as ordinal categorical; bin or one-hot encode rather than scaling as continuous.

anthropic:claude-opus-4-7 · confidence high
Out[135]:

saturn.columns["RLG3PGAC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique8
min 1
max 9
mean 5.272
median 5.5
std 1.296
q1 5
q3 6
iqr 1
skew -0.4637
kurtosis 2.032
n_outliers 776
outlier_rate 0.1089
zero_rate 0
alert: outliers10.9% rows beyond 1.5 IQR
Fig 48.
Distribution of RLG3PGAC. Vertical dash marks the median.
Show data table
Histogram bins for RLG3PGAC (median: 5.5).
bincount
1 – 1.217
1.2 – 1.40
1.4 – 1.60
1.6 – 1.80
1.8 – 20
2 – 2.2466
2.2 – 2.40
2.4 – 2.60
2.6 – 2.80
2.8 – 30
3 – 3.20
3.2 – 3.40
3.4 – 3.60
3.6 – 3.80
3.8 – 40
4 – 4.2925
4.2 – 4.40
4.4 – 4.60
4.6 – 4.80
4.8 – 50
5 – 5.22154
5.2 – 5.40
5.4 – 5.60
5.6 – 5.80
5.8 – 60
6 – 6.23247
6.2 – 6.40
6.4 – 6.60
6.6 – 6.80
6.8 – 70
7 – 7.222
7.2 – 7.40
7.4 – 7.60
7.6 – 7.80
7.8 – 80
8 – 8.2131
8.2 – 8.40
8.4 – 8.60
8.6 – 8.80
8.8 – 9162

PrimaryReligion categorical feature

PrimaryReligion is a low-cardinality categorical with 7 distinct values across 7,124 rows and no nulls. Islam dominates at 46% (3,279 rows), followed by Hinduism (2,142) and Ethnic Religions (933); Non-Religious appears only 13 times and 157 rows are explicitly 'Unknown'. Entropy ratio of 0.68 indicates a moderately skewed but not degenerate distribution.

Treatment: One-hot encode and consider merging 'Unknown' with 'Other / Small' or treating it as a missing-value flag.

anthropic:claude-opus-4-7 · confidence high
Out[138]:

saturn.columns["PrimaryReligion"].stats

statvalue
n7,124
nulls0 (0.0%)
unique7
top_value Islam
top_rate 0.4603
cardinality 7
entropy 1.92
entropy_ratio 0.6839
Fig 49.
Top values for PrimaryReligion.
Show data table
Top values for PrimaryReligion (7 unique shown, of 7 total).
valuecountshare
Islam327946.0%
Hinduism214230.1%
Ethnic Religions93313.1%
Buddhism4806.7%
Unknown1572.2%
Other / Small1201.7%
Non-Religious130.2%

PrimaryReligionPC categorical feature

Categorical label assigning each of 7124 rows to one of 8 primary religion categories, with no nulls. Islam dominates at 3105 rows (43.6%) followed by Hinduism at 2296, while Non-Religious (35) and Other/Small (62) are rare; entropy ratio of 0.68 indicates moderate concentration in the top two classes. 154 rows are explicitly 'Unknown', a category worth treating distinctly from missing.

Treatment: One-hot encode the 8 levels and keep 'Unknown' as its own category rather than imputing.

anthropic:claude-opus-4-7 · confidence high
Out[141]:

saturn.columns["PrimaryReligionPC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique8
top_value Islam
top_rate 0.4359
cardinality 8
entropy 2.051
entropy_ratio 0.6838
Fig 50.
Top values for PrimaryReligionPC.
Show data table
Top values for PrimaryReligionPC (8 unique shown, of 8 total).
valuecountshare
Islam310543.6%
Hinduism229632.2%
Ethnic Religions6669.3%
Buddhism4746.7%
Christianity3324.7%
Unknown1542.2%
Other / Small620.9%
Non-Religious350.5%

PrimaryReligionPGAC categorical feature

Categorical label for the primary religion of a People Group Across Countries (PGAC) record, with 8 distinct values across 7124 rows and no nulls. Islam dominates at 45.6% (3247), followed by Hinduism (2154) and Ethnic Religions (925); Christianity is strikingly rare at just 17 rows, which is notable for a religion-coded dataset. Entropy ratio of 0.65 indicates moderate concentration on the top categories.

Treatment: One-hot or target-encode for modelling; consider folding 'Unknown' and 'Non-Religious'/'Christianity' tails into 'Other'.

anthropic:claude-opus-4-7 · confidence high
Out[144]:

saturn.columns["PrimaryReligionPGAC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique8
top_value Islam
top_rate 0.4558
cardinality 8
entropy 1.955
entropy_ratio 0.6516
Fig 51.
Top values for PrimaryReligionPGAC.
Show data table
Top values for PrimaryReligionPGAC (8 unique shown, of 8 total).
valuecountshare
Islam324745.6%
Hinduism215430.2%
Ethnic Religions92513.0%
Buddhism4666.5%
Unknown1622.3%
Other / Small1311.8%
Non-Religious220.3%
Christianity170.2%

RLG4 numeric feature

RLG4 is a sparse numeric feature populated for only ~7.6% of rows (null_rate 0.9239) with just 18 distinct integer-like values ranging 10 to 39. The distribution is right-skewed (skew 1.05, mean 18.19 vs median 20.0) with 30 flagged outliers (5.5% of present values) and a tight IQR of 6. The combination of heavy nullness and a bounded, discrete value set suggests an ordinal score or category code recorded only in specific cases.

Treatment: Add a missingness indicator and impute or bin before modelling, given 92% nulls and a small discrete value set.

anthropic:claude-opus-4-7 · confidence medium
Out[147]:

saturn.columns["RLG4"].stats

statvalue
n7,124
nulls6,582 (92.4%)
unique18
min 10
max 39
mean 18.19
median 20
std 6.472
q1 14
q3 20
iqr 6
skew 1.051
kurtosis 1.474
n_outliers 30
outlier_rate 0.05535
zero_rate 0
alert: null_rate92.4% null
alert: outliers5.5% rows beyond 1.5 IQR
Fig 52.
Distribution of RLG4. Vertical dash marks the median.
Show data table
Histogram bins for RLG4 (median: 20.0).
bincount
10 – 11.26107
11.26 – 12.529
12.52 – 13.780
13.78 – 15.04120
15.04 – 16.31
16.3 – 17.570
17.57 – 18.8320
18.83 – 20.09160
20.09 – 21.352
21.35 – 22.613
22.61 – 23.8711
23.87 – 25.1379
25.13 – 26.390
26.39 – 27.650
27.65 – 28.910
28.91 – 30.170
30.17 – 31.430
31.43 – 32.74
32.7 – 33.960
33.96 – 35.222
35.22 – 36.488
36.48 – 37.746
37.74 – 3910

ReligionSubdivision categorical feature

A sub-classification of religion (denomination/sect), with 18 distinct values like Sunni, Judaism, Sikhism, Tibetan, and Theravada. The column is 92.39% null, so it is only populated for the small subset of records where a finer-grained religious branch applies. Among the 7124 rows, Sunni leads at 29.52% of non-null values (160 occurrences), and entropy ratio 0.72 indicates the populated values are spread fairly evenly across branches.

Treatment: Treat missingness as its own category and one-hot encode, or roll up into the parent Religion field before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[150]:

saturn.columns["ReligionSubdivision"].stats

statvalue
n7,124
nulls6,582 (92.4%)
unique18
top_value Sunni
top_rate 0.2952
cardinality 18
entropy 2.984
entropy_ratio 0.7157
alert: null_rate92.4% null
Fig 53.
Top values for ReligionSubdivision.
Show data table
Top values for ReligionSubdivision (18 unique shown, of 18 total).
valuecountshare
Sunni1602.2%
Judaism1201.7%
Sikhism681.0%
Tibetan580.8%
Theravada490.7%
Shia200.3%
Zoroastrianism110.2%
Jainism110.2%
Mahayana90.1%
Prakriti90.1%
Kirati80.1%
Mandaeism60.1%
Druze40.1%
Baha'i30.0%
Shia Imami Ismaili20.0%
Syncretized20.0%
Lingayat10.0%
Animism10.0%

PCIslam numeric feature

PCIslam appears to be a percentage-style indicator of Islamic affiliation, bounded between 0 and 100 with a near-zero null rate (0.0013). The distribution is starkly bimodal rather than continuous: 47.1% of rows are exactly zero, the median is 0.28, yet Q3 sits at 99.99, producing a kurtosis of -1.93 and an IQR spanning nearly the full range. Mean (45.2) and std (48.2) confirm the mass is piled at the extremes rather than around the center.

Treatment: Treat as bimodal: consider binarizing (0 vs >0) or binning rather than using raw value in linear models.

anthropic:claude-opus-4-7 · confidence high
Out[153]:

saturn.columns["PCIslam"].stats

statvalue
n7,124
nulls9 (0.1%)
unique902
min 0
max 100
mean 45.22
median 0.2753
std 48.22
q1 0
q3 99.99
iqr 99.99
skew 0.1703
kurtosis -1.935
n_outliers 0
outlier_rate 0
zero_rate 0.4713
Fig 54.
Distribution of PCIslam. Vertical dash marks the median.
Show data table
Histogram bins for PCIslam (median: 0.275292498279422).
bincount
0 – 2.53635
2.5 – 526
5 – 7.522
7.5 – 1013
10 – 12.517
12.5 – 159
15 – 17.59
17.5 – 208
20 – 22.520
22.5 – 251
25 – 27.511
27.5 – 305
30 – 32.521
32.5 – 3510
35 – 37.57
37.5 – 407
40 – 42.512
42.5 – 452
45 – 47.54
47.5 – 501
50 – 52.59
52.5 – 553
55 – 57.54
57.5 – 605
60 – 62.511
62.5 – 653
65 – 67.515
67.5 – 7014
70 – 72.530
72.5 – 7513
75 – 77.525
77.5 – 8022
80 – 82.539
82.5 – 8520
85 – 87.543
87.5 – 9045
90 – 92.5103
92.5 – 95109
95 – 97.5226
97.5 – 1002536

PCNonReligious numeric feature

PCNonReligious appears to be the percentage of non-religious individuals in each record, but the distribution is dominated by zeros — 87.5% of values are exactly 0 and the entire IQR collapses to 0. The remaining tail stretches to 99.0 with skew of 9.1 and kurtosis of 125.3, producing 886 outliers (12.5% of rows). Mean (1.02) sits far above median (0), so any modelling that assumes symmetry will be misled.

Treatment: Treat as zero-inflated; consider a binary is_nonzero flag plus a log1p transform of the positive tail.

anthropic:claude-opus-4-7 · confidence high
Out[156]:

saturn.columns["PCNonReligious"].stats

statvalue
n7,124
nulls23 (0.3%)
unique152
min 0
max 99
mean 1.016
median 0
std 4.549
q1 0
q3 0
iqr 0
skew 9.105
kurtosis 125.3
n_outliers 886
outlier_rate 0.1248
zero_rate 0.8752
alert: high_skewskew=+9.11
alert: outliers12.5% rows beyond 1.5 IQR
Fig 55.
Distribution of PCNonReligious. Vertical dash marks the median.
Show data table
Histogram bins for PCNonReligious (median: 0.0).
bincount
0 – 2.4756442
2.475 – 4.95159
4.95 – 7.425222
7.425 – 9.934
9.9 – 12.3873
12.38 – 14.8517
14.85 – 17.3248
17.32 – 19.818
19.8 – 22.2832
22.28 – 24.753
24.75 – 27.239
27.23 – 29.74
29.7 – 32.1812
32.18 – 34.655
34.65 – 37.123
37.12 – 39.62
39.6 – 42.082
42.08 – 44.551
44.55 – 47.022
47.02 – 49.52
49.5 – 51.983
51.98 – 54.451
54.45 – 56.932
56.93 – 59.41
59.4 – 61.880
61.88 – 64.350
64.35 – 66.830
66.83 – 69.31
69.3 – 71.780
71.78 – 74.250
74.25 – 76.730
76.73 – 79.20
79.2 – 81.670
81.67 – 84.150
84.15 – 86.620
86.62 – 89.10
89.1 – 91.580
91.58 – 94.050
94.05 – 96.532
96.53 – 991

PCUnknown numeric feature

PCUnknown is a numeric column expressing what looks like a percentage (range 0-100) of 'unknown' classification, with 92.8% of values being zero and a median/Q1/Q3 all at 0. The distribution is extremely right-skewed (skew 6.45, kurtosis 39.85) with 510 outliers (7.2%) extending up to 100. With 388 unique values and only 0.35% nulls, it carries sparse but potentially meaningful signal in the long tail.

Treatment: Binarize (zero vs non-zero) or log-transform the non-zero tail before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[159]:

saturn.columns["PCUnknown"].stats

statvalue
n7,124
nulls25 (0.4%)
unique388
min 0
max 100
mean 2.28
median 0
std 14.59
q1 0
q3 0
iqr 0
skew 6.454
kurtosis 39.85
n_outliers 510
outlier_rate 0.07184
zero_rate 0.9282
alert: high_skewskew=+6.45
alert: outliers7.2% rows beyond 1.5 IQR
Fig 56.
Distribution of PCUnknown. Vertical dash marks the median.
Show data table
Histogram bins for PCUnknown (median: 0.0).
bincount
0 – 2.56885
2.5 – 520
5 – 7.59
7.5 – 1010
10 – 12.56
12.5 – 152
15 – 17.53
17.5 – 200
20 – 22.50
22.5 – 252
25 – 27.50
27.5 – 302
30 – 32.50
32.5 – 352
35 – 37.50
37.5 – 401
40 – 42.50
42.5 – 450
45 – 47.50
47.5 – 500
50 – 52.50
52.5 – 550
55 – 57.50
57.5 – 600
60 – 62.51
62.5 – 650
65 – 67.50
67.5 – 700
70 – 72.50
72.5 – 750
75 – 77.50
77.5 – 800
80 – 82.50
82.5 – 851
85 – 87.51
87.5 – 900
90 – 92.50
92.5 – 950
95 – 97.52
97.5 – 100152

SecurityLevel numeric feature

SecurityLevel is an ordinal/categorical code stored as numeric, with only 3 distinct values spanning 0 to 2 across 7,124 complete rows. The distribution is heavily concentrated at the top tier (median, Q1, and Q3 all equal 2.0, mean 1.595), yet 15.6% of rows are 0 and the IQR-based outlier check flags 24.9% of records — an artifact of the degenerate IQR of 0 rather than true anomalies. Strong negative skew (-1.47) confirms the mass sits at level 2.

Treatment: Treat as a 3-level ordinal category (one-hot or ordered encode); ignore the outlier flag since IQR is zero.

anthropic:claude-opus-4-7 · confidence high
Out[162]:

saturn.columns["SecurityLevel"].stats

statvalue
n7,124
nulls0 (0.0%)
unique3
min 0
max 2
mean 1.595
median 2
std 0.7442
q1 2
q3 2
iqr 0
skew -1.466
kurtosis 0.4048
n_outliers 1,771
outlier_rate 0.2486
zero_rate 0.1564
alert: outliers24.9% rows beyond 1.5 IQR
Fig 57.
Distribution of SecurityLevel. Vertical dash marks the median.
Show data table
Histogram bins for SecurityLevel (median: 2.0).
bincount
0 – 0.051114
0.05 – 0.10
0.1 – 0.150
0.15 – 0.20
0.2 – 0.250
0.25 – 0.30
0.3 – 0.350
0.35 – 0.40
0.4 – 0.450
0.45 – 0.50
0.5 – 0.550
0.55 – 0.60
0.6 – 0.650
0.65 – 0.70
0.7 – 0.750
0.75 – 0.80
0.8 – 0.850
0.85 – 0.90
0.9 – 0.950
0.95 – 10
1 – 1.05657
1.05 – 1.10
1.1 – 1.150
1.15 – 1.20
1.2 – 1.250
1.25 – 1.30
1.3 – 1.350
1.35 – 1.40
1.4 – 1.450
1.45 – 1.50
1.5 – 1.550
1.55 – 1.60
1.6 – 1.650
1.65 – 1.70
1.7 – 1.750
1.75 – 1.80
1.8 – 1.850
1.85 – 1.90
1.9 – 1.950
1.95 – 25353

LRTop100 categorical label

Binary Y/N flag indicating membership in some 'LRTop100' set, with exactly 100 rows marked 'Y' out of 7124 — strongly suggesting a curated top-100 list. The distribution is severely imbalanced (98.6% 'N', entropy ratio 0.107), which is flagged as an imbalance alert. No nulls are present.

Treatment: Use as a binary indicator; if modelling, apply class-imbalance handling (stratification or reweighting).

anthropic:claude-opus-4-7 · confidence high
Out[165]:

saturn.columns["LRTop100"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value N
top_rate 0.986
cardinality 2
entropy 0.1065
entropy_ratio 0.1065
alert: imbalancetop value is 98.6% of rows
Fig 58.
Top values for LRTop100.
Show data table
Top values for LRTop100 (2 unique shown, of 2 total).
valuecountshare
N702498.6%
Y1001.4%

PhotoAddress text metadata

PhotoAddress holds single-token image filenames following a 'pXXXXX.jpg' pattern (one_word_rate 1.0, len_max 13). Coverage is poor: 1970 of 7124 rows are empty strings and duplicate_rate is 0.596, so the same photo is reused across many records (e.g., p19007.jpg appears 90 times). Only 2880 unique values back 7124 rows, suggesting shared stock images or a many-to-one photo lookup rather than a per-row asset.

Treatment: Treat as a file reference: drop from modelling, or join to an image table after handling the ~1970 empty strings.

anthropic:claude-opus-4-7 · confidence high
Out[168]:

saturn.columns["PhotoAddress"].stats

statvalue
n7,124
nulls1 (0.0%)
unique2,880
len_min 0
len_max 13
len_mean 7.26
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 1,970
n_duplicates 4,243
duplicate_rate 0.5957
vocab_size 2,879
readability_flesch_mean 84.01
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates59.6% duplicate strings
Fig 59.
Character-length distribution for PhotoAddress.
Show data table
Character-length distribution for PhotoAddress (mean: 7.2600028078057).
charscount
0 – 01970
0 – 10
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 100
10 – 105092
10 – 100
10 – 110
11 – 110
11 – 110
11 – 120
12 – 120
12 – 120
12 – 130
13 – 1361

PhotoCredits categorical metadata

PhotoCredits captures the attribution string for an associated image, with 851 distinct credits across 7124 rows. The column is dominated by missing-style values: 1970 rows (27.7%) are empty strings and another 1496 are 'Anonymous', so over half lack a real attribution. The remaining tail is long and idiosyncratic, mixing organisations ('Operation China, Asia Harvest'), individuals ('Isudas', 'Kerry Olson'), and platform tags ('Steve Evans - Flickr').

Treatment: Treat empty and 'Anonymous' as missing and keep only as provenance metadata; do not use as a model feature.

anthropic:claude-opus-4-7 · confidence high
Out[171]:

saturn.columns["PhotoCredits"].stats

statvalue
n7,124
nulls10 (0.1%)
unique851
top_value
top_rate 0.2769
cardinality 851
entropy 5.584
entropy_ratio 0.5737
alert: long_tail463 singleton categories
Fig 60.
Top values for PhotoCredits.
Show data table
Top values for PhotoCredits (20 unique shown, of 851 total).
valuecountshare
197027.7%
Anonymous149621.0%
Operation China, Asia Harvest2633.7%
Isudas2123.0%
Kate Nelson/AusAID - Wikimedia901.3%
manothegreek771.1%
Kerry Olson761.1%
Asia Harvest-Operation Myanmar691.0%
Steve Evans - Flickr620.9%
Final Sudan430.6%
Rod Waddington - Flickr420.6%
Peoples of Laos, Asia Harvest410.6%
COMIBAM / Sepal370.5%
Link Up Africa360.5%
CharlesFred - Flickr360.5%
Asia Harvest360.5%
Hamed Saber - Flickr360.5%
Peoples of the Buddhist World, Asia Harvest360.5%
N-Y-C - Pixabay340.5%
pxhere340.5%

PhotoCreditURL categorical metadata

URL string crediting the source of an associated photo, dominated by a single domain (asiaharvest.org appears 443 times) alongside a long tail of 774 distinct values. 36% of rows are null and another 43.21% are empty strings — together roughly four out of five rows carry no usable credit. Remaining values mix organisational sites (newcovenantmissions, createinternational), shorteners (tinyurl), and stock-photo hosts (pixabay, pxhere, flickr).

Treatment: Drop for modelling; if provenance matters, parse to domain and treat as a low-coverage attribution field.

anthropic:claude-opus-4-7 · confidence high
Out[174]:

saturn.columns["PhotoCreditURL"].stats

statvalue
n7,124
nulls2,565 (36.0%)
unique774
top_value
top_rate 0.4321
cardinality 774
entropy 5.389
entropy_ratio 0.5616
alert: long_tail465 singleton categories
alert: null_rate36.0% null
Fig 61.
Top values for PhotoCreditURL.
Show data table
Top values for PhotoCreditURL (20 unique shown, of 774 total).
valuecountshare
197027.7%
https://www.asiaharvest.org4436.2%
https://tinyurl.com/89ffm33y901.3%
https://www.pesquisas.org.br/370.5%
https://flickr.com/photos/44124425616@N01/424972149/360.5%
https://pixabay.com/photos/fashion-asian-japanese-chinese-4257900/340.5%
https://pxhere.com/en/photo/637533340.5%
https://www.newcovenantmissions.org310.4%
https://www.createinternational.com300.4%
https://pixabay.com/photos/person-face-people-portrait-smile-5039573/270.4%
https://www.globalROAR.org260.4%
https://pixabay.com/photos/bread-marrakesh-food-morocco-1166272/250.4%
https://pixabay.com/photos/pakistani-model-boy-model-pakistan-3770152/240.3%
https://commons.wikimedia.org/wiki/File:Sudanese_arab_from_manasir_tribe.jpg230.3%
https://pixabay.com/photos/refugee-afghan-forest-geocaching-1189087/210.3%
https://flickr.com/photos/91418149@N03/14204194758200.3%
https://pixabay.com/pl/photos/m%c4%99%c5%bcczy%c5%bani-turban-portret-facet-2146800/190.3%
https://commons.wikimedia.org/wiki/File:Algerian_man_with_turban.jpg180.3%
https://commons.wikimedia.org/wiki/File:Iraqi_man_on_Baghdad_street.jpg180.3%
https://flickr.com/photos/rosluc6460/3260871722180.3%

PhotoCreativeCommons categorical feature

Binary Y/N flag indicating whether a photo carries a Creative Commons licence. The vast majority (top_rate 0.7981) are 'N' with only 1437 'Y' values, and nulls are negligible (null_rate 0.0007). Class imbalance is notable but not extreme.

Treatment: Encode as a 0/1 boolean; impute the handful of nulls with the mode 'N'.

anthropic:claude-opus-4-7 · confidence high
Out[177]:

saturn.columns["PhotoCreativeCommons"].stats

statvalue
n7,124
nulls5 (0.1%)
unique2
top_value N
top_rate 0.7981
cardinality 2
entropy 0.7256
entropy_ratio 0.7256
Fig 62.
Top values for PhotoCreativeCommons.
Show data table
Top values for PhotoCreativeCommons (2 unique shown, of 2 total).
valuecountshare
N568279.8%
Y143720.2%

PhotoCopyright categorical feature

Binary Y/N flag indicating whether a photo carries copyright restrictions, with 'N' dominating at 80.6% of 7,124 rows and only 2 unique values. Nulls are negligible (0.17%) and entropy ratio of 0.71 reflects the moderate class imbalance. No anomalies beyond the expected skew toward unrestricted photos.

Treatment: Encode as a boolean (Y=1, N=0) and impute the handful of nulls with the mode.

anthropic:claude-opus-4-7 · confidence high
Out[180]:

saturn.columns["PhotoCopyright"].stats

statvalue
n7,124
nulls12 (0.2%)
unique2
top_value N
top_rate 0.8064
cardinality 2
entropy 0.709
entropy_ratio 0.709
Fig 63.
Top values for PhotoCopyright.
Show data table
Top values for PhotoCopyright (2 unique shown, of 2 total).
valuecountshare
N573580.5%
Y137719.3%

PhotoPermission categorical feature

Binary opt-in flag for photo permission, stored as 'Y'/'N'. The column is heavily skewed toward 'N' at 80.4% (5715 of 7124), with a near-zero null rate of 0.2%. Watch the case inconsistency: 2 records use lowercase 'y' alongside 1393 uppercase 'Y', so case-sensitive joins or filters will miscount.

Treatment: Normalize case (upper) and encode as boolean before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[183]:

saturn.columns["PhotoPermission"].stats

statvalue
n7,124
nulls14 (0.2%)
unique3
top_value N
top_rate 0.8038
cardinality 3
entropy 0.7173
entropy_ratio 0.4526
Fig 64.
Top values for PhotoPermission.
Show data table
Top values for PhotoPermission (3 unique shown, of 3 total).
valuecountshare
N571580.2%
Y139319.6%
y20.0%

ProfileTextExists categorical feature

A binary flag indicating whether a profile text exists, with values Y/N. The column is severely imbalanced: 6888 of 7124 rows (96.7%) are Y, leaving only 236 N, yielding a low entropy ratio of 0.21.

Treatment: Encode as a 0/1 indicator but expect minimal predictive signal due to severe class imbalance.

anthropic:claude-opus-4-7 · confidence high
Out[186]:

saturn.columns["ProfileTextExists"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.9669
cardinality 2
entropy 0.2098
entropy_ratio 0.2098
alert: imbalancetop value is 96.7% of rows
Fig 65.
Top values for ProfileTextExists.
Show data table
Top values for ProfileTextExists (2 unique shown, of 2 total).
valuecountshare
Y688896.7%
N2363.3%

CountOfCountries numeric feature

Likely a per-row count of distinct countries associated with each record, ranging from 1 to 164 across 7124 rows with no nulls. The distribution is severely right-skewed (skew 5.67, kurtosis 33.17): the median is just 2 and Q3 is 4, yet the mean is 8.11 and 16.98% of rows flag as outliers. A long tail of high-country records is dragging the mean far above typical values.

Treatment: Log-transform or cap at a high quantile before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[189]:

saturn.columns["CountOfCountries"].stats

statvalue
n7,124
nulls0 (0.0%)
unique39
min 1
max 164
mean 8.108
median 2
std 24.27
q1 1
q3 4
iqr 3
skew 5.672
kurtosis 33.17
n_outliers 1,210
outlier_rate 0.1698
zero_rate 0
alert: high_skewskew=+5.67
alert: outliers17.0% rows beyond 1.5 IQR
Fig 66.
Distribution of CountOfCountries. Vertical dash marks the median.
Show data table
Histogram bins for CountOfCountries (median: 2.0).
bincount
1 – 5.0755710
5.075 – 9.15272
9.15 – 13.23248
13.23 – 17.3163
17.3 – 21.38150
21.38 – 25.45139
25.45 – 29.53123
29.53 – 33.620
33.6 – 37.680
37.68 – 41.75109
41.75 – 45.832
45.83 – 49.937
49.9 – 53.980
53.98 – 58.050
58.05 – 62.120
62.12 – 66.20
66.2 – 70.280
70.28 – 74.350
74.35 – 78.420
78.42 – 82.50
82.5 – 86.580
86.58 – 90.650
90.65 – 94.730
94.73 – 98.80
98.8 – 102.90
102.9 – 1070
107 – 1110
111 – 115.10
115.1 – 119.20
119.2 – 123.20
123.2 – 127.30
127.3 – 131.40
131.4 – 135.50
135.5 – 139.60
139.6 – 143.60
143.6 – 147.70
147.7 – 151.80
151.8 – 155.80
155.8 – 159.90
159.9 – 164151

CountOfProvinces unknown other

Saturn skipped profiling on CountOfProvinces, so beyond a row count of 7124 and zero nulls, no distributional evidence is available. The name suggests an integer count of provinces per record, but unique count, range, and summary stats are all missing. Without further inspection the column's actual content and cardinality cannot be confirmed.

Treatment: Re-profile or manually inspect this column before use; saturn skipped it.

anthropic:claude-opus-4-7 · confidence low
Out[192]:

saturn.columns["CountOfProvinces"].stats

statvalue
n7,124
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

Longitude numeric feature

Geographic longitude coordinates spanning the full global range from -173.08 to 178.44 degrees. The distribution is heavily left-skewed (-1.40) with a median of 75.23 sitting well above the mean of 62.80, suggesting concentration in eastern hemisphere locations with a tail of western-hemisphere points. About 4.4% of values (316 rows) fall outside the typical IQR range.

Treatment: Pair with latitude for geospatial features; consider clustering or binning rather than treating as a raw scalar.

anthropic:claude-opus-4-7 · confidence high
Out[194]:

saturn.columns["Longitude"].stats

statvalue
n7,124
nulls0 (0.0%)
unique6,713
min -173.1
max 178.4
mean 62.8
median 75.23
std 44.79
q1 40.81
q3 88.22
iqr 47.41
skew -1.402
kurtosis 2.859
n_outliers 316
outlier_rate 0.04436
zero_rate 0
Fig 67.
Distribution of Longitude. Vertical dash marks the median.
Show data table
Histogram bins for Longitude (median: 75.22975927090806).
bincount
-173.1 – -164.36
-164.3 – -155.52
-155.5 – -146.70
-146.7 – -137.90
-137.9 – -129.10
-129.1 – -120.411
-120.4 – -111.616
-111.6 – -102.82
-102.8 – -93.998
-93.99 – -85.215
-85.2 – -76.4248
-76.42 – -67.63110
-67.63 – -58.8429
-58.84 – -50.0524
-50.05 – -41.268
-41.26 – -32.4810
-32.48 – -23.690
-23.69 – -14.946
-14.9 – -6.112139
-6.112 – 2.676229
2.676 – 11.46217
11.46 – 20.25249
20.25 – 29.04173
29.04 – 37.83356
37.83 – 46.62244
46.62 – 55.4243
55.4 – 64.19104
64.19 – 72.98781
72.98 – 81.771584
81.77 – 90.56970
90.56 – 99.34313
99.34 – 108.1728
108.1 – 116.9135
116.9 – 125.7161
125.7 – 134.552
134.5 – 143.336
143.3 – 152.141
152.1 – 160.98
160.9 – 169.63
169.6 – 178.423

Latitude numeric feature

Latitude values for 7124 rows spanning -42.61 to 71.84 with a median of 25.02 — consistent with geographic latitudes in degrees. Distribution leans toward northern hemisphere (mean 23.54, skew -0.70) with 292 outliers (4.1%) likely representing far-southern or far-northern records. No nulls and 6696 unique values suggest near-record-level coordinates.

Treatment: Pair with longitude for geospatial features; avoid standard scaling alone since latitude is bounded and non-linear in distance.

anthropic:claude-opus-4-7 · confidence high
Out[197]:

saturn.columns["Latitude"].stats

statvalue
n7,124
nulls0 (0.0%)
unique6,696
min -42.61
max 71.84
mean 23.54
median 25.02
std 14.92
q1 15.55
q3 31.61
iqr 16.06
skew -0.702
kurtosis 2.141
n_outliers 292
outlier_rate 0.04099
zero_rate 0
Fig 68.
Distribution of Latitude. Vertical dash marks the median.
Show data table
Histogram bins for Latitude (median: 25.021597596171).
bincount
-42.61 – -39.754
-39.75 – -36.8920
-36.89 – -34.0318
-34.03 – -31.1618
-31.16 – -28.34
-28.3 – -25.4411
-25.44 – -22.587
-22.58 – -19.7215
-19.72 – -16.8614
-16.86 – -1418
-14 – -11.1440
-11.14 – -8.27545
-8.275 – -5.41465
-5.414 – -2.553115
-2.553 – 0.308474
0.3084 – 3.17135
3.17 – 6.03198
6.031 – 8.892158
8.892 – 11.75414
11.75 – 14.61396
14.61 – 17.48299
17.48 – 20.34408
20.34 – 23.2595
23.2 – 26.06961
26.06 – 28.92863
28.92 – 31.78584
31.78 – 34.64560
34.64 – 37.5244
37.5 – 40.36154
40.36 – 43.23257
43.23 – 46.09112
46.09 – 48.95105
48.95 – 51.81131
51.81 – 54.6772
54.67 – 57.5339
57.53 – 60.3952
60.39 – 63.253
63.25 – 66.1210
66.12 – 68.983
68.98 – 71.843

Ctry categorical feature

Country-of-origin or location field with 202 distinct values across 7,124 rows and zero nulls. India dominates at 28.5% (2,032 rows), followed by Pakistan (767) and China (442); the long tail spans 200+ countries with entropy ratio 0.66, indicating concentrated but globally distributed coverage. The South/Central Asia skew is the headline surprise — five of the top six values are Asian.

Treatment: Group rare countries into an 'Other' bucket or encode by region before modelling to avoid 200-way one-hot blow-up.

anthropic:claude-opus-4-7 · confidence high
Out[200]:

saturn.columns["Ctry"].stats

statvalue
n7,124
nulls0 (0.0%)
unique202
top_value India
top_rate 0.2852
cardinality 202
entropy 5.058
entropy_ratio 0.6605
Fig 69.
Top values for Ctry.
Show data table
Top values for Ctry (20 unique shown, of 202 total).
valuecountshare
India203228.5%
Pakistan76710.8%
China4426.2%
Bangladesh2563.6%
Indonesia2343.3%
Nepal1842.6%
Sudan1682.4%
Laos1422.0%
Russia1151.6%
United States901.3%
Iran851.2%
Chad811.1%
Malaysia781.1%
Thailand731.0%
Vietnam691.0%
Türkiye (Turkey)610.9%
Myanmar (Burma)590.8%
Afghanistan580.8%
Sri Lanka550.8%
Canada520.7%

IndigenousCode categorical feature

A binary Y/N flag indicating Indigenous status, fully populated across all 7124 rows. The distribution is imbalanced: 'Y' accounts for 79.4% (5657 rows) versus 1467 'N' rows, which is a notable skew to keep in mind for any stratified analysis.

Treatment: Encode as a binary indicator and watch for class imbalance when used as a predictor or stratifier.

anthropic:claude-opus-4-7 · confidence high
Out[203]:

saturn.columns["IndigenousCode"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.7941
cardinality 2
entropy 0.7336
entropy_ratio 0.7336
Fig 70.
Top values for IndigenousCode.
Show data table
Top values for IndigenousCode (2 unique shown, of 2 total).
valuecountshare
Y565779.4%
N146720.6%

PercentAdherents categorical feature

PercentAdherents appears to be a numeric measure (likely a percentage or rate of religious adherents) stored as strings, with 692 distinct values across 7,124 rows and no nulls. It is dominated by '0.000', which accounts for 56.2% of records, and the long tail of small integer- and decimal-valued strings drives entropy down to a ratio of 0.43. The format mixing whole numbers like '5.000' with fractions like '0.200' suggests these are raw values rather than binned categories.

Treatment: Cast to float and treat as a zero-inflated numeric feature rather than a category.

anthropic:claude-opus-4-7 · confidence medium
Out[206]:

saturn.columns["PercentAdherents"].stats

statvalue
n7,124
nulls0 (0.0%)
unique692
top_value 0.000
top_rate 0.5625
cardinality 692
entropy 4.046
entropy_ratio 0.4288
alert: long_tail467 singleton categories
Fig 71.
Top values for PercentAdherents.
Show data table
Top values for PercentAdherents (20 unique shown, of 692 total).
valuecountshare
0.000400756.2%
1.0002854.0%
2.0002243.1%
3.0001662.3%
0.5001592.2%
5.0001371.9%
4.0001291.8%
0.200961.3%
0.100881.2%
0.300771.1%
0.010590.8%
1.500540.8%
0.050520.7%
0.400470.7%
0.020330.5%
1.200300.4%
0.600290.4%
0.009250.4%
0.800240.3%
0.090220.3%

PercentChristianPC categorical feature

Stored as a categorical but the 184 distinct values are numeric strings ranging from '0.000' to figures like '8.571', suggesting this is a percent-Christian metric (likely per-capita or per-county) cast as text. The distribution is concentrated: '0.482' alone covers 12.2% of 7124 rows and the top 10 values account for a large share, yet entropy ratio of 0.79 indicates the long tail still carries information. No nulls, but the repeated exact decimals hint at a lookup or pre-binned source rather than raw measurements.

Treatment: cast to float and treat as a continuous feature; investigate the heavy spike at 0.482 before modelling.

anthropic:claude-opus-4-7 · confidence medium
Out[209]:

saturn.columns["PercentChristianPC"].stats

statvalue
n7,124
nulls0 (0.0%)
unique184
top_value 0.482
top_rate 0.122
cardinality 184
entropy 5.934
entropy_ratio 0.7887
Fig 72.
Top values for PercentChristianPC.
Show data table
Top values for PercentChristianPC (20 unique shown, of 184 total).
valuecountshare
0.48286912.2%
0.1115868.2%
0.0003745.2%
0.5083524.9%
8.5713114.4%
0.0042934.1%
5.0231692.4%
5.3451672.3%
3.7331512.1%
1.5521281.8%
0.0081271.8%
5.4821141.6%
27.0521111.6%
0.0051051.5%
1.495901.3%
6.327871.2%
0.731861.2%
0.003811.1%
0.001791.1%
0.030771.1%

NaturalName text label

Short ethnonym/community labels (e.g., 'Deaf', 'Turk', 'Persian', 'Japanese'), averaging 11.8 characters and 1.7 words with a median of 2 words. About 34% of rows are duplicates (2,419) and ~49% are single-word entries, with 4,705 unique values across 7,124 rows. Surprising signals: 'Deaf' tops the list at 151 occurrences, and top words include parenthetical religious qualifiers like 'traditions)', '(hindu', '(muslim' (952/477/411), suggesting many entries carry trailing tradition tags that the tokenizer split awkwardly.

Treatment: Normalize casing and strip parenthetical tradition suffixes, then treat as a categorical label.

anthropic:claude-opus-4-7 · confidence high
Out[212]:

saturn.columns["NaturalName"].stats

statvalue
n7,124
nulls0 (0.0%)
unique4,705
len_min 1
len_max 39
len_mean 11.84
len_median 10
len_p95 27
word_mean 1.723
word_median 2
n_empty 0
n_duplicates 2,419
duplicate_rate 0.3396
vocab_size 4,343
readability_flesch_mean 56.42
emoji_rate 0
url_rate 0
one_word_rate 0.4885
allcaps_rate 0
boilerplate_rate 0
alert: one_word48.8% rows are a single word
alert: duplicates34.0% duplicate strings
Fig 73.
Character-length distribution for NaturalName.
Show data table
Character-length distribution for NaturalName (mean: 11.84250421111735).
charscount
1 – 21
2 – 315
3 – 4107
4 – 5590
5 – 6711
6 – 7776
7 – 8702
8 – 9389
9 – 10216
10 – 10278
10 – 11347
11 – 12281
12 – 13431
13 – 14349
14 – 15292
15 – 16219
16 – 17110
17 – 1866
18 – 1947
19 – 200
20 – 2146
21 – 2235
22 – 2348
23 – 24109
24 – 25174
25 – 26191
26 – 27179
27 – 28103
28 – 2977
29 – 3038
30 – 3035
30 – 3153
31 – 3237
32 – 3336
33 – 3418
34 – 3515
35 – 360
36 – 370
37 – 381
38 – 392

NaturalPronunciation text metadata

Phonetic respellings of ethnonyms — short hyphenated pronunciation guides like 'PUR-zhun', 'jae-puh-NEEZ', and 'pahsh-TOON' — accompanying some other label column. Values are overwhelmingly single tokens (one_word_rate 0.73, word_mean 1.28, len_mean 10.8) and 48.5% are null, so coverage is partial. Duplicates dominate (n_duplicates 2183, duplicate_rate 0.59) with only 1489 unique forms across 7124 rows, suggesting a small controlled vocabulary repeated across records.

Treatment: Treat as an optional pronunciation lookup keyed to the parent term; drop or impute before modelling given ~48% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[215]:

saturn.columns["NaturalPronunciation"].stats

statvalue
n7,124
nulls3,452 (48.5%)
unique1,489
len_min 2
len_max 42
len_mean 10.77
len_median 10
len_p95 21
word_mean 1.281
word_median 1
n_empty 0
n_duplicates 2,183
duplicate_rate 0.5945
vocab_size 1,537
readability_flesch_mean 69.93
emoji_rate 0
url_rate 0
one_word_rate 0.7271
allcaps_rate 0.0005447
boilerplate_rate 0
alert: one_word72.7% rows are a single word
alert: null_rate48.5% null
alert: duplicates59.4% duplicate strings
Fig 74.
Character-length distribution for NaturalPronunciation.
Show data table
Character-length distribution for NaturalPronunciation (mean: 10.772875816993464).
charscount
2 – 31
3 – 4251
4 – 5136
5 – 647
6 – 7112
7 – 8595
8 – 9517
9 – 10143
10 – 11181
11 – 12392
12 – 13243
13 – 14123
14 – 1583
15 – 16121
16 – 17114
17 – 18104
18 – 19137
19 – 20112
20 – 2159
21 – 2248
22 – 2346
23 – 2429
24 – 2517
25 – 2620
26 – 2715
27 – 287
28 – 294
29 – 302
30 – 311
31 – 326
32 – 331
33 – 341
34 – 350
35 – 360
36 – 370
37 – 381
38 – 391
39 – 400
40 – 410
41 – 422

PercentChristianPGAC categorical feature

This column appears to be a percentage or count of Christians (PGAC suggesting a per-group/area Christian metric), stored as strings with three-decimal precision rather than as a numeric type. It is heavily zero-inflated: '0.000' accounts for 43.8% of the 7,124 rows (3,121 occurrences), and a suspiciously specific value '3.733' is the second mode at 151 rows. With 842 distinct values and entropy ratio 0.58, the distribution is concentrated but long-tailed, and the null rate is negligible at 0.07%.

Treatment: Cast to numeric and consider a zero-inflated transform (e.g., log1p with a zero indicator) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[218]:

saturn.columns["PercentChristianPGAC"].stats

statvalue
n7,124
nulls5 (0.1%)
unique842
top_value 0.000
top_rate 0.4384
cardinality 842
entropy 5.681
entropy_ratio 0.5846
Fig 75.
Top values for PercentChristianPGAC.
Show data table
Top values for PercentChristianPGAC (20 unique shown, of 842 total).
valuecountshare
0.000312143.8%
3.7331512.1%
1.000871.2%
2.000791.1%
0.001751.1%
0.006691.0%
5.000650.9%
3.000590.8%
0.010550.8%
4.000550.8%
0.500500.7%
0.002460.6%
0.005450.6%
0.030450.6%
0.049440.6%
0.100390.5%
0.200370.5%
1.592370.5%
0.009360.5%
1.771360.5%

PercentEvangelical categorical feature

PercentEvangelical reads as a numeric share of evangelicals stored as strings, with 401 distinct values across 7124 rows. The distribution is heavily zero-inflated: 65.7% of rows are exactly "0.000" and another 10.4% are null, leaving a long tail of small fractions like 0.100, 0.200, 0.500. Entropy ratio of 0.364 confirms most of the signal collapses onto that single zero bucket.

Treatment: Cast to float, impute the 10.4% nulls, and consider a zero-vs-nonzero indicator alongside the raw value to handle the zero inflation.

anthropic:claude-opus-4-7 · confidence high
Out[221]:

saturn.columns["PercentEvangelical"].stats

statvalue
n7,124
nulls741 (10.4%)
unique401
top_value 0.000
top_rate 0.6572
cardinality 401
entropy 3.146
entropy_ratio 0.3638
alert: long_tail256 singleton categories
Fig 76.
Top values for PercentEvangelical.
Show data table
Top values for PercentEvangelical (20 unique shown, of 401 total).
valuecountshare
0.000419558.9%
0.1001712.4%
0.2001542.2%
0.5001351.9%
1.0001331.9%
0.3001081.5%
2.000921.3%
0.400711.0%
0.050600.8%
0.010600.8%
0.800460.6%
1.500400.6%
0.900380.5%
0.600350.5%
0.030320.4%
0.700290.4%
0.020270.4%
0.080250.4%
0.001250.4%
0.006240.3%

PercentEvangelicalPC categorical feature

PercentEvangelicalPC appears to be a numeric percentage (likely an evangelical population share, possibly per capita or principal-component scaled) that has been stored as strings, yielding 166 distinct values across 7124 rows with a 2.15% null rate. The distribution is concentrated: the top value '0.199' covers 12.47% of rows, and the leading entries cluster near zero ('0.095', '0.000', '0.004') yet some values reach above 3 ('3.409', '3.339'), suggesting a long right tail or mixed scale. Entropy ratio of 0.78 indicates moderate concentration rather than uniformity.

Treatment: Cast to float, impute the ~2% nulls, and consider log or rank transform given the right-tailed values.

anthropic:claude-opus-4-7 · confidence medium
Out[224]:

saturn.columns["PercentEvangelicalPC"].stats

statvalue
n7,124
nulls153 (2.1%)
unique166
top_value 0.199
top_rate 0.1247
cardinality 166
entropy 5.777
entropy_ratio 0.7833
Fig 77.
Top values for PercentEvangelicalPC.
Show data table
Top values for PercentEvangelicalPC (20 unique shown, of 166 total).
valuecountshare
0.19986912.2%
0.0955868.2%
0.0004416.2%
0.2473524.9%
0.0043154.4%
3.4093114.4%
3.3391692.4%
2.6561672.3%
0.0031512.1%
0.0011311.8%
0.8661281.8%
1.6991141.6%
0.4721111.6%
0.028961.3%
1.315901.3%
1.509871.2%
0.439861.2%
0.036781.1%
0.012771.1%
0.197761.1%

PercentEvangelicalPGAC categorical feature

Numeric percentages (likely share of evangelical population per PGAC unit) stored as strings, hence profiled as categorical with 548 distinct values. The distribution is heavily zero-inflated: '0.000' accounts for 48.9% of 7124 rows, with a curious secondary spike at '1.801' (151 rows) that doesn't fit a percentage interpretation cleanly. Null rate is 6.32% and entropy ratio is 0.55, consistent with a long tail of small fractional values.

Treatment: Cast to float, impute or flag nulls, and consider a zero-indicator plus log/sqrt transform given the heavy zero mass.

anthropic:claude-opus-4-7 · confidence high
Out[227]:

saturn.columns["PercentEvangelicalPGAC"].stats

statvalue
n7,124
nulls450 (6.3%)
unique548
top_value 0.000
top_rate 0.4891
cardinality 548
entropy 4.972
entropy_ratio 0.5465
Fig 78.
Top values for PercentEvangelicalPGAC.
Show data table
Top values for PercentEvangelicalPGAC (20 unique shown, of 548 total).
valuecountshare
0.000326445.8%
1.8011512.1%
0.002981.4%
0.001931.3%
0.006681.0%
0.004660.9%
1.000590.8%
0.016580.8%
0.200420.6%
0.010420.6%
0.100410.6%
0.005410.6%
0.049400.6%
0.500370.5%
0.104370.5%
0.074360.5%
2.000360.5%
1.543360.5%
0.007350.5%
0.003340.5%

PCBuddhism numeric feature

PCBuddhism appears to be a percentage feature measuring the Buddhist share of some unit (likely a postcode or area), ranging 0 to 100 with mean 6.41. The distribution is extremely zero-inflated: 82.99% of rows are exactly 0, the entire IQR collapses to 0, and yet 17.01% of rows are flagged as outliers with skew 3.48 and kurtosis 10.56. This means Buddhism is rare across most areas but reaches sizeable concentrations in a long tail.

Treatment: Treat as zero-inflated proportion: add a presence indicator and log1p-transform the non-zero tail before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[230]:

saturn.columns["PCBuddhism"].stats

statvalue
n7,124
nulls24 (0.3%)
unique809
min 0
max 100
mean 6.411
median 0
std 22.39
q1 0
q3 0
iqr 0
skew 3.475
kurtosis 10.56
n_outliers 1,208
outlier_rate 0.1701
zero_rate 0.8299
alert: high_skewskew=+3.48
alert: outliers17.0% rows beyond 1.5 IQR
Fig 79.
Distribution of PCBuddhism. Vertical dash marks the median.
Show data table
Histogram bins for PCBuddhism (median: 0.0).
bincount
0 – 2.56435
2.5 – 527
5 – 7.519
7.5 – 107
10 – 12.525
12.5 – 153
15 – 17.513
17.5 – 203
20 – 22.514
22.5 – 254
25 – 27.56
27.5 – 303
30 – 32.521
32.5 – 355
35 – 37.59
37.5 – 403
40 – 42.522
42.5 – 454
45 – 47.52
47.5 – 505
50 – 52.54
52.5 – 553
55 – 57.510
57.5 – 6010
60 – 62.516
62.5 – 653
65 – 67.59
67.5 – 7024
70 – 72.518
72.5 – 757
75 – 77.56
77.5 – 8012
80 – 82.514
82.5 – 858
85 – 87.513
87.5 – 9020
90 – 92.529
92.5 – 959
95 – 97.576
97.5 – 100179

PCEthnicReligions numeric feature

PCEthnicReligions is a numeric percentage-style feature (0–100) capturing the share of some ethnic-religion category, likely per record/region. It's overwhelmingly zero — 78% of values are 0 and the entire interquartile range collapses to 0 — yet the mean is 13.1 with std 30.7, indicating a small set of records carry very large shares. Skew of 2.16 and a 22% outlier rate confirm a sparse, heavy-tailed distribution rather than a smooth continuum.

Treatment: Binarize (zero vs non-zero) or apply a zero-inflated/log1p transform before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[233]:

saturn.columns["PCEthnicReligions"].stats

statvalue
n7,124
nulls18 (0.3%)
unique351
min 0
max 100
mean 13.11
median 0
std 30.74
q1 0
q3 0
iqr 0
skew 2.155
kurtosis 2.885
n_outliers 1,560
outlier_rate 0.2195
zero_rate 0.7805
alert: high_skewskew=+2.16
alert: outliers22.0% rows beyond 1.5 IQR
Fig 80.
Distribution of PCEthnicReligions. Vertical dash marks the median.
Show data table
Histogram bins for PCEthnicReligions (median: 0.0).
bincount
0 – 2.55603
2.5 – 576
5 – 7.5108
7.5 – 1048
10 – 12.571
12.5 – 1515
15 – 17.530
17.5 – 2020
20 – 22.545
22.5 – 2511
25 – 27.527
27.5 – 3014
30 – 32.542
32.5 – 354
35 – 37.515
37.5 – 409
40 – 42.523
42.5 – 451
45 – 47.513
47.5 – 505
50 – 52.57
52.5 – 555
55 – 57.56
57.5 – 6010
60 – 62.521
62.5 – 653
65 – 67.526
67.5 – 7012
70 – 72.530
72.5 – 753
75 – 77.58
77.5 – 8016
80 – 82.540
82.5 – 8518
85 – 87.544
87.5 – 9018
90 – 92.556
92.5 – 9557
95 – 97.5223
97.5 – 100323

PCHinduism numeric feature

This column appears to be the percentage share of Hindus in some geographic or demographic unit, ranging from 0 to 100 with a mean of 29.8. The distribution is strongly bimodal in spirit: 67.7% of rows are exactly zero while Q3 sits at 98.4, indicating most units have no Hindu presence and a substantial minority are nearly entirely Hindu. Skew is 0.87 and kurtosis -1.22, consistent with this U-shaped split rather than a single peak.

Treatment: Consider a zero-vs-nonzero indicator plus the raw percentage, since a flat numeric treatment will hide the bimodal structure.

anthropic:claude-opus-4-7 · confidence high
Out[236]:

saturn.columns["PCHinduism"].stats

statvalue
n7,124
nulls24 (0.3%)
unique1,131
min 0
max 100
mean 29.82
median 0
std 44.98
q1 0
q3 98.42
iqr 98.42
skew 0.8721
kurtosis -1.216
n_outliers 0
outlier_rate 0
zero_rate 0.6768
Fig 81.
Distribution of PCHinduism. Vertical dash marks the median.
Show data table
Histogram bins for PCHinduism (median: 0.0).
bincount
0 – 2.54856
2.5 – 510
5 – 7.514
7.5 – 102
10 – 12.59
12.5 – 156
15 – 17.57
17.5 – 203
20 – 22.510
22.5 – 258
25 – 27.54
27.5 – 304
30 – 32.55
32.5 – 356
35 – 37.52
37.5 – 402
40 – 42.54
42.5 – 455
45 – 47.50
47.5 – 503
50 – 52.56
52.5 – 551
55 – 57.58
57.5 – 604
60 – 62.57
62.5 – 651
65 – 67.58
67.5 – 707
70 – 72.512
72.5 – 755
75 – 77.55
77.5 – 809
80 – 82.59
82.5 – 8512
85 – 87.511
87.5 – 9020
90 – 92.523
92.5 – 9531
95 – 97.5117
97.5 – 1001844

PCOtherSmall numeric feature

PCOtherSmall is a numeric feature where 88% of rows are zero and the IQR is zero, meaning the bottom three quartiles are all 0. The remaining mass stretches up to 100 with mean 1.84 and std 12.33, producing severe right skew (7.39) and very heavy tails (kurtosis 54.18). About 12% of rows (851) flag as outliers, suggesting this is a sparse share/percentage indicator that fires only for a small subset of records.

Treatment: Binarize presence (>0) or apply log1p before modelling to tame the skew.

anthropic:claude-opus-4-7 · confidence high
Out[239]:

saturn.columns["PCOtherSmall"].stats

statvalue
n7,124
nulls24 (0.3%)
unique670
min 0
max 100
mean 1.836
median 0
std 12.33
q1 0
q3 0
iqr 0
skew 7.39
kurtosis 54.18
n_outliers 851
outlier_rate 0.1199
zero_rate 0.8801
alert: high_skewskew=+7.39
alert: outliers12.0% rows beyond 1.5 IQR
Fig 82.
Distribution of PCOtherSmall. Vertical dash marks the median.
Show data table
Histogram bins for PCOtherSmall (median: 0.0).
bincount
0 – 2.56854
2.5 – 530
5 – 7.524
7.5 – 1011
10 – 12.512
12.5 – 155
15 – 17.54
17.5 – 204
20 – 22.51
22.5 – 2521
25 – 27.52
27.5 – 303
30 – 32.54
32.5 – 350
35 – 37.51
37.5 – 403
40 – 42.53
42.5 – 450
45 – 47.50
47.5 – 500
50 – 52.50
52.5 – 550
55 – 57.53
57.5 – 602
60 – 62.53
62.5 – 650
65 – 67.50
67.5 – 703
70 – 72.53
72.5 – 750
75 – 77.51
77.5 – 800
80 – 82.52
82.5 – 852
85 – 87.52
87.5 – 900
90 – 92.52
92.5 – 950
95 – 97.52
97.5 – 10093

RegionCode numeric feature

RegionCode holds 12 distinct integer values from 1 to 12 with no nulls, so it is almost certainly a categorical region identifier stored as a number rather than a true numeric measure. The distribution is concentrated around the median of 4 with an IQR of just 2, yet the right skew of 1.12 and 601 flagged outliers (8.4%) reflect the long tail of higher-numbered regions rather than genuine anomalies.

Treatment: Cast to categorical and one-hot or target-encode; do not treat as a continuous numeric.

anthropic:claude-opus-4-7 · confidence high
Out[242]:

saturn.columns["RegionCode"].stats

statvalue
n7,124
nulls0 (0.0%)
unique12
min 1
max 12
mean 5.005
median 4
std 2.457
q1 4
q3 6
iqr 2
skew 1.122
kurtosis 0.5775
n_outliers 601
outlier_rate 0.08436
zero_rate 0
alert: outliers8.4% rows beyond 1.5 IQR
Fig 83.
Distribution of RegionCode. Vertical dash marks the median.
Show data table
Histogram bins for RegionCode (median: 4.0).
bincount
1 – 1.27575
1.275 – 1.550
1.55 – 1.8250
1.825 – 2.1726
2.1 – 2.3750
2.375 – 2.650
2.65 – 2.9250
2.925 – 3.2521
3.2 – 3.4750
3.475 – 3.750
3.75 – 4.0253349
4.025 – 4.30
4.3 – 4.5750
4.575 – 4.850
4.85 – 5.125352
5.125 – 5.40
5.4 – 5.6750
5.675 – 5.950
5.95 – 6.225444
6.225 – 6.50
6.5 – 6.7750
6.775 – 7.05373
7.05 – 7.3250
7.325 – 7.60
7.6 – 7.8750
7.875 – 8.15460
8.15 – 8.4250
8.425 – 8.70
8.7 – 8.9750
8.975 – 9.25223
9.25 – 9.5250
9.525 – 9.80
9.8 – 10.08320
10.08 – 10.350
10.35 – 10.620
10.62 – 10.90
10.9 – 11.18121
11.18 – 11.450
11.45 – 11.730
11.73 – 12160

PopulationPGAC numeric feature

PopulationPGAC appears to be a population count tied to some geographic or administrative unit (PGAC), spanning 10 to roughly 925 million across 7,124 rows with only 0.07% nulls. The distribution is extraordinarily right-skewed (skew 25.5, kurtosis 1051) — the median is 130,300 while the mean is 4.88 million, and 17.8% of rows flag as outliers. With 1,509 unique values across 7,124 rows, the same population figures repeat heavily, suggesting many rows share the same geographic aggregate.

Treatment: log-transform before regression to tame the extreme right skew.

anthropic:claude-opus-4-7 · confidence medium
Out[245]:

saturn.columns["PopulationPGAC"].stats

statvalue
n7,124
nulls5 (0.1%)
unique1,509
min 10
max 9.251e+08
mean 4.881e+06
median 130,300
std 2.095e+07
q1 20,000
q3 1.435e+06
iqr 1.415e+06
skew 25.48
kurtosis 1052
n_outliers 1,264
outlier_rate 0.1776
zero_rate 0
alert: high_skewskew=+25.48
alert: outliers17.8% rows beyond 1.5 IQR
Fig 84.
Distribution of PopulationPGAC. Vertical dash marks the median.
Show data table
Histogram bins for PopulationPGAC (median: 130300.0).
bincount
10 – 2.313e+076626
2.313e+07 – 4.626e+07327
4.626e+07 – 6.938e+07104
6.938e+07 – 9.251e+079
9.251e+07 – 1.156e+080
1.156e+08 – 1.388e+0851
1.388e+08 – 1.619e+080
1.619e+08 – 1.85e+080
1.85e+08 – 2.082e+080
2.082e+08 – 2.313e+080
2.313e+08 – 2.544e+080
2.544e+08 – 2.775e+080
2.775e+08 – 3.007e+080
3.007e+08 – 3.238e+080
3.238e+08 – 3.469e+080
3.469e+08 – 3.701e+080
3.701e+08 – 3.932e+080
3.932e+08 – 4.163e+080
4.163e+08 – 4.394e+080
4.394e+08 – 4.626e+080
4.626e+08 – 4.857e+080
4.857e+08 – 5.088e+080
5.088e+08 – 5.319e+080
5.319e+08 – 5.551e+080
5.551e+08 – 5.782e+080
5.782e+08 – 6.013e+080
6.013e+08 – 6.245e+080
6.245e+08 – 6.476e+080
6.476e+08 – 6.707e+080
6.707e+08 – 6.938e+080
6.938e+08 – 7.17e+080
7.17e+08 – 7.401e+080
7.401e+08 – 7.632e+080
7.632e+08 – 7.864e+080
7.864e+08 – 8.095e+080
8.095e+08 – 8.326e+080
8.326e+08 – 8.557e+080
8.557e+08 – 8.789e+080
8.789e+08 – 9.02e+080
9.02e+08 – 9.251e+082

Frontier categorical feature

Binary Y/N flag indicating whether a record is on the frontier, with no nulls across 7124 rows. The split is imbalanced toward Y at 66.9% (4767) versus N at 2357, though entropy ratio of 0.92 shows both classes are well represented.

Treatment: Encode as a 0/1 indicator before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[248]:

saturn.columns["Frontier"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.6691
cardinality 2
entropy 0.9158
entropy_ratio 0.9158
Fig 85.
Top values for Frontier.
Show data table
Top values for Frontier (2 unique shown, of 2 total).
valuecountshare
Y476766.9%
N235733.1%

MapAddress text foreign_key

MapAddress holds single-token PNG filenames (e.g. 'm00328.png'), with one_word_rate of 1.0 and max length 13, suggesting it points to a map image asset. 1500 of 7124 rows are empty strings and duplicate_rate is 0.352, so roughly a third of non-empty values repeat across rows — meaning many records share the same map. With 4616 unique values across 7124 rows, this behaves like a foreign reference to a finite set of map images rather than a free-text field.

Treatment: Treat as a categorical asset reference: impute or flag the 1500 empties and join to a map-image lookup.

anthropic:claude-opus-4-7 · confidence high
Out[251]:

saturn.columns["MapAddress"].stats

statvalue
n7,124
nulls0 (0.0%)
unique4,616
len_min 0
len_max 13
len_mean 8.649
len_median 10
len_p95 13
word_mean 1
word_median 1
n_empty 1,500
n_duplicates 2,508
duplicate_rate 0.352
vocab_size 4,615
readability_flesch_mean 17.62
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates35.2% duplicate strings
Fig 86.
Character-length distribution for MapAddress.
Show data table
Character-length distribution for MapAddress (mean: 8.648652442448062).
charscount
0 – 01500
0 – 10
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 100
10 – 103833
10 – 100
10 – 110
11 – 110
11 – 110
11 – 120
12 – 120
12 – 120
12 – 130
13 – 131791

HasJesusFilm categorical feature

Binary Y/N flag indicating whether the Jesus Film is available for the entity (likely a language or people group). Heavily skewed toward 'Y' at 78.7% (5,610 of 7,124), with no nulls across all 7,124 rows. Entropy of 0.746 reflects the imbalance but still leaves a usable minority class of 1,514 'N' values.

Treatment: Encode as 0/1 boolean; account for class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[254]:

saturn.columns["HasJesusFilm"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.7875
cardinality 2
entropy 0.7463
entropy_ratio 0.7463
Fig 87.
Top values for HasJesusFilm.
Show data table
Top values for HasJesusFilm (2 unique shown, of 2 total).
valuecountshare
Y561078.7%
N151421.3%

Nomadic categorical feature

Binary Y/N flag indicating nomadic status, with no nulls across 7124 rows. Severely imbalanced: 'N' dominates at 96.6% (6884 rows) versus only 240 'Y' cases, yielding a low entropy ratio of 0.21.

Treatment: Encode as binary; consider class-weighting or stratified sampling due to severe imbalance.

anthropic:claude-opus-4-7 · confidence high
Out[257]:

saturn.columns["Nomadic"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value N
top_rate 0.9663
cardinality 2
entropy 0.2126
entropy_ratio 0.2126
alert: imbalancetop value is 96.6% of rows
Fig 88.
Top values for Nomadic.
Show data table
Top values for Nomadic (2 unique shown, of 2 total).
valuecountshare
N688496.6%
Y2403.4%

NomadicTypeDescription categorical feature

This is a low-cardinality categorical describing the type of nomadic livelihood, with only 6 distinct values dominated by 'Agro-Pastoralists' (76.7% of non-nulls, 184 records). The column is almost entirely empty — null_rate is 0.9663, leaving roughly 240 populated rows out of 7124. Several values are comma-joined combinations (e.g., 'Agro-Pastoralists, Service or Trade'), suggesting the field encodes multi-label memberships as concatenated strings.

Treatment: Split comma-separated values into multi-hot indicators and treat missingness as its own category given the 96.6% null rate.

anthropic:claude-opus-4-7 · confidence high
Out[260]:

saturn.columns["NomadicTypeDescription"].stats

statvalue
n7,124
nulls6,884 (96.6%)
unique6
top_value Agro-Pastoralists
top_rate 0.7667
cardinality 6
entropy 1.159
entropy_ratio 0.4483
alert: null_rate96.6% null
Fig 89.
Top values for NomadicTypeDescription.
Show data table
Top values for NomadicTypeDescription (6 unique shown, of 6 total).
valuecountshare
Agro-Pastoralists1842.6%
Service or Trade300.4%
Agro-Pastoralists, Service or Trade160.2%
Hunter-Gatherers80.1%
Agro-Pastoralists, Hunter-Gatherers10.0%
Service or Trade, Hunter-Gatherers10.0%

PhotoCCVersionText categorical metadata

Creative Commons license version attached to a photo (e.g., 'CC BY 2.0', 'CC BY-SA 4.0'). The field is dominated by empty strings at 79.8% of 7124 rows, with only 16 distinct values and entropy ratio 0.33, so license metadata is missing for the vast majority of records. Among populated values, 'CC BY 2.0' (387) and 'CC BY-SA 4.0' (246) lead.

Treatment: Treat empty string as missing and group rare licenses; use as a low-cardinality categorical only where photo licensing matters.

anthropic:claude-opus-4-7 · confidence high
Out[263]:

saturn.columns["PhotoCCVersionText"].stats

statvalue
n7,124
nulls0 (0.0%)
unique16
top_value
top_rate 0.7984
cardinality 16
entropy 1.323
entropy_ratio 0.3307
Fig 90.
Top values for PhotoCCVersionText.
Show data table
Top values for PhotoCCVersionText (16 unique shown, of 16 total).
valuecountshare
568879.8%
CC BY 2.03875.4%
CC BY-SA 4.02463.5%
CC BY-SA 2.01932.7%
CC BY-NC-SA 2.01512.1%
CC BY-SA 3.01432.0%
CC0 1.01271.8%
CC BY-NC 2.01111.6%
CC BY 3.0270.4%
CC BY-NC-ND 2.0180.3%
CC BY 4.0140.2%
CC SA 1.070.1%
CC BY-ND 2.050.1%
CC BY 3.0 BR40.1%
CC BY-SA 2.520.0%
CC BY-NC-SA 4.010.0%

PhotoCCVersionURL categorical metadata

This column holds the URL of the Creative Commons license version applied to an associated photo, drawn from a fixed set of 16 distinct license URIs. About 79.8% of rows (5688 of 7124) are empty strings, so the field is sparsely populated; among the populated minority, CC BY 2.0 (387) and CC BY-SA 4.0 (246) dominate. Entropy ratio of 0.33 confirms heavy concentration on the blank value.

Treatment: Treat empty strings as missing and collapse to a categorical license code before any modelling.

anthropic:claude-opus-4-7 · confidence high
Out[266]:

saturn.columns["PhotoCCVersionURL"].stats

statvalue
n7,124
nulls0 (0.0%)
unique16
top_value
top_rate 0.7984
cardinality 16
entropy 1.323
entropy_ratio 0.3307
Fig 91.
Top values for PhotoCCVersionURL.
Show data table
Top values for PhotoCCVersionURL (16 unique shown, of 16 total).
valuecountshare
568879.8%
https://creativecommons.org/licenses/by/2.0/3875.4%
https://creativecommons.org/licenses/by-sa/4.0/2463.5%
https://creativecommons.org/licenses/by-sa/2.0/1932.7%
https://creativecommons.org/licenses/by-nc-sa/2.0/1512.1%
https://creativecommons.org/licenses/by-sa/3.0/1432.0%
https://creativecommons.org/publicdomain/zero/1.0/1271.8%
https://creativecommons.org/licenses/by-nc/2.0/1111.6%
https://creativecommons.org/licenses/by/3.0/270.4%
https://creativecommons.org/licenses/by-nc-nd/2.0/180.3%
https://creativecommons.org/licenses/by/4.0/140.2%
https://creativecommons.org/licenses/by-sa/1.0/70.1%
https://creativecommons.org/licenses/by-nd/2.0/50.1%
https://creativecommons.org/licenses/by/3.0/br/deed.en40.1%
https://creativecommons.org/licenses/by-sa/2.5/20.0%
https://creativecommons.org/licenses/by-nc-sa/4.010.0%

MapCredits categorical metadata

Attribution string crediting the data, geography, and design sources for each map (e.g. Joshua Project, GMI, UNESCO, IMB). With 161 distinct values across 7124 rows, the top credit covers 28% of records and a blank string is the second most common value at 1505 rows; near-duplicates differing only by trailing punctuation (the same Omid/UNESCO credit appears with and without a final period) inflate cardinality.

Treatment: Normalise whitespace/punctuation to collapse near-duplicates, then drop from modelling as boilerplate provenance.

anthropic:claude-opus-4-7 · confidence high
Out[269]:

saturn.columns["MapCredits"].stats

statvalue
n7,124
nulls0 (0.0%)
unique161
top_value People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project
top_rate 0.28
cardinality 161
entropy 3.318
entropy_ratio 0.4527
alert: long_tail99 singleton categories
Fig 92.
Top values for MapCredits.
Show data table
Top values for MapCredits (20 unique shown, of 161 total).
valuecountshare
People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project199528.0%
150521.1%
Location: IMB. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.80811.3%
People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project.75510.6%
People Group Location: Omid. Other geography / data: GMI. Map Design: Joshua Project5838.2%
Bethany World Prayer Center4085.7%
Joshua Project / Global Mapping International3354.7%
Bryan Nicholson / cartoMission1001.4%
Location: SIL / WLMS. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.771.1%
Anonymous701.0%
Location: WLMS. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.470.7%
NCRP440.6%
Location: Web research. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.310.4%
Asia Harvest-Operation Myanmar260.4%
Location: World Jewish Congress, Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.260.4%
Location: Joshua Project. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.230.3%
Southeast Asia Link - SEALINK210.3%
Location: Ethnologue. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.180.3%
Location: Asia Harvest. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.100.1%
Peoples of the Red Book80.1%

MapCreditURL categorical metadata

This column holds attribution URLs for source maps, but 6919 of 7124 rows (top_rate 0.9712) are empty strings, leaving only 31 distinct values across the entire dataset. Among the populated entries, cartomission.com dominates with 100 occurrences while most other domains appear fewer than 10 times, producing a very long tail. Entropy ratio of 0.054 confirms there is almost no information here unless the empty string itself is treated as a meaningful 'no credit' signal.

Treatment: Keep as provenance metadata; do not use as a model feature given 97% blanks.

anthropic:claude-opus-4-7 · confidence high
Out[272]:

saturn.columns["MapCreditURL"].stats

statvalue
n7,124
nulls0 (0.0%)
unique31
top_value
top_rate 0.9712
cardinality 31
entropy 0.27
entropy_ratio 0.05449
alert: long_tail19 singleton categories
alert: imbalancetop value is 97.1% of rows
Fig 93.
Top values for MapCreditURL.
Show data table
Top values for MapCreditURL (20 unique shown, of 31 total).
valuecountshare
691997.1%
https://www.cartomission.com1001.4%
https://www.asiaharvest.org280.4%
https://www.worldjewishcongress.org/260.4%
https://www.eki.ee/books/redbook/introduction.shtml80.1%
https://commons.wikimedia.org/wiki/File:Maeneo_penye_wasemaji_wa_Kiswahili.png70.1%
https://www.npolar.no/ansipra/english/Indexpages/Map_index.html50.1%
https://www.face-music.ch/bi_bid/trad_costumes_en.html30.0%
https://thekurds.net/30.0%
https://www.cartpioneers.org/products/Peoples-of-Yemen-Prayer-Guide.html20.0%
http://lingvarium.org/20.0%
https://www.westmelanesia.com/20.0%
https://www.lib.utexas.edu/maps/africa/libya_ethnic_1974.jpg10.0%
https://commons.wikimedia.org/wiki/File:Libya_ethnic.svg10.0%
https://www.ssb.no/en/statbank/table/09817/tableViewLayout1/10.0%
https://commons.wikimedia.org/wiki/File:Alawites_in_the_Levant.jpg10.0%
https://zolimacitymag.com/keeping-hakka-culture-alive-the-story-of-hong-kongs-mountain-pioneers/10.0%
https://www.cia.gov/the-world-factbook/countries/china/map/10.0%
https://commons.wikimedia.org/wiki/File:Albanians_in_Kosovo_2011_census.GIF10.0%
https://www.refworld.org/docid/4a8414f5c.html10.0%

MapCopyright categorical feature

A near-binary flag (N/Y) with a third state being an empty string, almost certainly indicating whether map copyright applies. 'N' dominates at 72.95% (5197/7124), blanks account for 1885 rows, and only 42 records are 'Y' — a severe class imbalance that makes the affirmative case nearly negligible.

Treatment: Normalize blanks to a missing/'N' category and treat as a low-signal binary flag given the 42-row positive class.

anthropic:claude-opus-4-7 · confidence high
Out[275]:

saturn.columns["MapCopyright"].stats

statvalue
n7,124
nulls0 (0.0%)
unique3
top_value N
top_rate 0.7295
cardinality 3
entropy 0.8831
entropy_ratio 0.5572
Fig 94.
Top values for MapCopyright.
Show data table
Top values for MapCopyright (3 unique shown, of 3 total).
valuecountshare
N519773.0%
188526.5%
Y420.6%

MapCCVersionText categorical metadata

This appears to be a Creative Commons license version field for maps, but it is effectively empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving only 10 rows with actual licenses split across CC BY-SA 3.0 (8), CC0 1.0 (1), and CC BY 3.0 (1). Entropy is just 0.0166 (entropy_ratio 0.0083), so the column carries almost no information despite having 0% nulls — the missingness is encoded as empty strings rather than NaN.

Treatment: Drop; near-constant blank with only 10 informative rows.

anthropic:claude-opus-4-7 · confidence high
Out[278]:

saturn.columns["MapCCVersionText"].stats

statvalue
n7,124
nulls0 (0.0%)
unique4
top_value
top_rate 0.9986
cardinality 4
entropy 0.01662
entropy_ratio 0.00831
alert: imbalancetop value is 99.9% of rows
Fig 95.
Top values for MapCCVersionText.
Show data table
Top values for MapCCVersionText (4 unique shown, of 4 total).
valuecountshare
711499.9%
CC BY-SA 3.080.1%
CC0 1.010.0%
CC BY 3.010.0%

MapCCVersionURL categorical metadata

MapCCVersionURL appears to hold a Creative Commons license URL associated with each map record, but it is essentially empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving just 10 rows split across three CC license URLs. Entropy is 0.017 (ratio 0.008), so the column carries almost no information despite having 4 distinct values and zero nulls (the missingness is encoded as "" rather than null).

Treatment: Drop; near-constant with empty-string standing in for missing.

anthropic:claude-opus-4-7 · confidence high
Out[281]:

saturn.columns["MapCCVersionURL"].stats

statvalue
n7,124
nulls0 (0.0%)
unique4
top_value
top_rate 0.9986
cardinality 4
entropy 0.01662
entropy_ratio 0.00831
alert: imbalancetop value is 99.9% of rows
Fig 96.
Top values for MapCCVersionURL.
Show data table
Top values for MapCCVersionURL (4 unique shown, of 4 total).
valuecountshare
711499.9%
https://creativecommons.org/licenses/by-sa/3.0/80.1%
https://creativecommons.org/publicdomain/zero/1.0/10.0%
https://creativecommons.org/licenses/by/3.0/10.0%

JF categorical feature

JF is a binary Y/N flag with no nulls across 7124 rows. The distribution is imbalanced: "Y" accounts for 78.7% (5610 rows) versus 1514 "N", giving an entropy ratio of 0.746. The column name is opaque, so the semantic meaning of the flag is not recoverable from the evidence.

Treatment: Encode as a 0/1 indicator; consider class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[284]:

saturn.columns["JF"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.7875
cardinality 2
entropy 0.7463
entropy_ratio 0.7463
Fig 97.
Top values for JF.
Show data table
Top values for JF (2 unique shown, of 2 total).
valuecountshare
Y561078.7%
N151421.3%

AudioRecordings categorical feature

Binary Y/N flag indicating whether audio recordings exist for each row, with no nulls across 7124 records. The distribution is heavily imbalanced toward 'Y' at 86.9% (6188 vs 936), giving an entropy ratio of 0.56.

Treatment: Encode as a 0/1 indicator; be mindful of class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[287]:

saturn.columns["AudioRecordings"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.8686
cardinality 2
entropy 0.5612
entropy_ratio 0.5612
Fig 98.
Top values for AudioRecordings.
Show data table
Top values for AudioRecordings (2 unique shown, of 2 total).
valuecountshare
Y618886.9%
N93613.1%

Window1040 categorical feature

Window1040 is a binary Y/N flag covering all 7124 rows with no nulls. The distribution is imbalanced: 'Y' accounts for 5910 rows (top_rate 0.8296) versus 1214 'N', giving an entropy ratio of 0.659. The column's semantic meaning isn't recoverable from the evidence, but it behaves like a clean indicator variable.

Treatment: Encode as a 0/1 indicator and watch for class imbalance when used as a predictor.

anthropic:claude-opus-4-7 · confidence high
Out[290]:

saturn.columns["Window1040"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.8296
cardinality 2
entropy 0.6586
entropy_ratio 0.6586
Fig 99.
Top values for Window1040.
Show data table
Top values for Window1040 (2 unique shown, of 2 total).
valuecountshare
Y591083.0%
N121417.0%

PeopleGroupMapURL text metadata

This column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.79). 1,500 of 7,124 rows are empty strings and 2,508 are duplicates (duplicate_rate 0.35), meaning many people groups share the same map image (e.g., m00328.png appears 40 times). With 4,616 unique values across 7,124 rows, this is a reference link rather than a unique key.

Treatment: Keep as a display/reference URL; drop from modelling features.

anthropic:claude-opus-4-7 · confidence high
Out[293]:

saturn.columns["PeopleGroupMapURL"].stats

statvalue
n7,124
nulls0 (0.0%)
unique4,616
len_min 0
len_max 66
len_mean 50.49
len_median 63
len_p95 66
word_mean 1
word_median 1
n_empty 1,500
n_duplicates 2,508
duplicate_rate 0.352
vocab_size 4,615
readability_flesch_mean -568.7
emoji_rate 0
url_rate 0.7894
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: url_heavy78.9% rows contain a URL
alert: duplicates35.2% duplicate strings
Fig 100.
Character-length distribution for PeopleGroupMapURL.
Show data table
Character-length distribution for PeopleGroupMapURL (mean: 50.489191465468835).
charscount
0 – 21500
2 – 30
3 – 50
5 – 70
7 – 80
8 – 100
10 – 120
12 – 130
13 – 150
15 – 160
16 – 180
18 – 200
20 – 210
21 – 230
23 – 250
25 – 260
26 – 280
28 – 300
30 – 310
31 – 330
33 – 350
35 – 360
36 – 380
38 – 400
40 – 410
41 – 430
43 – 450
45 – 460
46 – 480
48 – 500
50 – 510
51 – 530
53 – 540
54 – 560
56 – 580
58 – 590
59 – 610
61 – 630
63 – 643833
64 – 661791

PeopleGroupMapExpandedURL text metadata

This column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, with 72.3% of rows containing a URL and every value being a single token. 1,975 rows (about 27.7%) are empty strings, and 2,793 rows (39.2%) duplicate another value — e.g. m00328.pdf appears 40 times — suggesting many people groups share the same regional map.

Treatment: Treat as a reference link; drop from modelling or extract the map ID if joining to a maps table.

anthropic:claude-opus-4-7 · confidence high
Out[296]:

saturn.columns["PeopleGroupMapExpandedURL"].stats

statvalue
n7,124
nulls0 (0.0%)
unique4,331
len_min 0
len_max 66
len_mean 46.2
len_median 63
len_p95 66
word_mean 1
word_median 1
n_empty 1,975
n_duplicates 2,793
duplicate_rate 0.3921
vocab_size 4,330
readability_flesch_mean -468.9
emoji_rate 0
url_rate 0.7228
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: url_heavy72.3% rows contain a URL
alert: duplicates39.2% duplicate strings
Fig 101.
Character-length distribution for PeopleGroupMapExpandedURL.
Show data table
Character-length distribution for PeopleGroupMapExpandedURL (mean: 46.19553621560921).
charscount
0 – 21975
2 – 30
3 – 50
5 – 70
7 – 80
8 – 100
10 – 120
12 – 130
13 – 150
15 – 160
16 – 180
18 – 200
20 – 210
21 – 230
23 – 250
25 – 260
26 – 280
28 – 300
30 – 310
31 – 330
33 – 350
35 – 360
36 – 380
38 – 400
40 – 410
41 – 430
43 – 450
45 – 460
46 – 480
48 – 500
50 – 510
51 – 530
53 – 540
54 – 560
56 – 580
58 – 590
59 – 610
61 – 630
63 – 643579
64 – 661570

PeopleGroupURL text identifier

This column holds Joshua Project people-group URLs, one per row, with every value a 48-character single-token https link (url_rate 1.0, one_word_rate 1.0, len_min and len_max both 48). All 7124 values are unique with zero nulls or duplicates, so it functions as a per-row identifier rather than a feature. The URLs encode a people-group ID and a country code suffix (e.g., /10375/tz, /10375/up), meaning the same group recurs across countries in the underlying key even though the full URL is unique.

Treatment: Drop from modelling; retain as a row-level link key or parse out the people-group ID and country code as separate features.

anthropic:claude-opus-4-7 · confidence high
Out[299]:

saturn.columns["PeopleGroupURL"].stats

statvalue
n7,124
nulls0 (0.0%)
unique7,124
len_min 48
len_max 48
len_mean 48
len_median 48
len_p95 48
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 7,124
readability_flesch_mean -479.9
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
Fig 102.
Character-length distribution for PeopleGroupURL.
Show data table
Character-length distribution for PeopleGroupURL (mean: 48.0).
charscount
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 487124
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480

PeopleGroupPhotoURL text metadata

This column holds Joshua Project people-group photo URLs, with every populated cell being a single joshuaproject.net/assets/media/profiles/photos/.jpg link (url_rate 0.72, one_word_rate 1.0). 1971 of 7124 rows are empty strings (no nulls reported), and the same image URLs repeat heavily — duplicate_rate is 0.60 with only 2880 unique values, the top URL appearing 90 times. The same photo is clearly being reused across many people-group records rather than being a unique per-row asset.

Treatment: Treat as an optional asset link; drop or replace empty strings with null and do not use as a feature.

anthropic:claude-opus-4-7 · confidence high
Out[302]:

saturn.columns["PeopleGroupPhotoURL"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2,880
len_min 0
len_max 68
len_mean 47.04
len_median 65
len_p95 65
word_mean 1
word_median 1
n_empty 1,971
n_duplicates 4,244
duplicate_rate 0.5957
vocab_size 2,879
readability_flesch_mean -604.3
emoji_rate 0
url_rate 0.7233
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: url_heavy72.3% rows contain a URL
alert: duplicates59.6% duplicate strings
Fig 103.
Character-length distribution for PeopleGroupPhotoURL.
Show data table
Character-length distribution for PeopleGroupPhotoURL (mean: 47.042111173498036).
charscount
0 – 21971
2 – 30
3 – 50
5 – 70
7 – 80
8 – 100
10 – 120
12 – 140
14 – 150
15 – 170
17 – 190
19 – 200
20 – 220
22 – 240
24 – 260
26 – 270
27 – 290
29 – 310
31 – 320
32 – 340
34 – 360
36 – 370
37 – 390
39 – 410
41 – 420
42 – 440
44 – 460
46 – 480
48 – 490
49 – 510
51 – 530
53 – 540
54 – 560
56 – 580
58 – 600
60 – 610
61 – 630
63 – 650
65 – 665092
66 – 6861

CountryURL categorical foreign_key

Country-level URLs pointing to joshuaproject.net profile pages, with the 2-letter country code as the path segment. There are 202 distinct countries across 7,124 rows and no nulls, but the distribution is heavily concentrated: India alone accounts for 28.5% of rows (2,032), with Pakistan (767) a distant second. Entropy ratio of 0.66 confirms moderate skew toward a handful of South Asian countries.

Treatment: Extract the trailing country code as a categorical key; treat the URL itself as redundant metadata.

anthropic:claude-opus-4-7 · confidence high
Out[305]:

saturn.columns["CountryURL"].stats

statvalue
n7,124
nulls0 (0.0%)
unique202
top_value https://joshuaproject.net/countries/IN
top_rate 0.2852
cardinality 202
entropy 5.058
entropy_ratio 0.6605
Fig 104.
Top values for CountryURL.
Show data table
Top values for CountryURL (20 unique shown, of 202 total).
valuecountshare
https://joshuaproject.net/countries/IN203228.5%
https://joshuaproject.net/countries/PK76710.8%
https://joshuaproject.net/countries/CH4426.2%
https://joshuaproject.net/countries/BG2563.6%
https://joshuaproject.net/countries/ID2343.3%
https://joshuaproject.net/countries/NP1842.6%
https://joshuaproject.net/countries/SU1682.4%
https://joshuaproject.net/countries/LA1422.0%
https://joshuaproject.net/countries/RS1151.6%
https://joshuaproject.net/countries/US901.3%
https://joshuaproject.net/countries/IR851.2%
https://joshuaproject.net/countries/CD811.1%
https://joshuaproject.net/countries/MY781.1%
https://joshuaproject.net/countries/TH731.0%
https://joshuaproject.net/countries/VM691.0%
https://joshuaproject.net/countries/TU610.9%
https://joshuaproject.net/countries/BM590.8%
https://joshuaproject.net/countries/AF580.8%
https://joshuaproject.net/countries/CE550.8%
https://joshuaproject.net/countries/CA520.7%

JPScaleText categorical metadata

JPScaleText is a categorical field that holds a single value, "Unreached", across all 7124 rows with no nulls. With cardinality of 1 and entropy of 0, it carries no information and cannot discriminate between records. The constant value suggests this dataset has been pre-filtered to unreached people groups only.

Treatment: Drop; constant column with zero entropy.

anthropic:claude-opus-4-7 · confidence high
Out[308]:

saturn.columns["JPScaleText"].stats

statvalue
n7,124
nulls0 (0.0%)
unique1
top_value Unreached
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 105.
Top values for JPScaleText.
Show data table
Top values for JPScaleText (1 unique shown, of 1 total).
valuecountshare
Unreached7124100.0%

JPScaleImageURL categorical metadata

Every one of the 7,124 rows holds the same URL, https://joshuaproject.net/assets/img/gauge/gauge-1.png, giving a single unique value and zero entropy. This looks like a static asset link (a JP Scale gauge image) attached to each record rather than a discriminating feature. It carries no information for analysis or modelling.

Treatment: Drop; constant column with a single value across all rows.

anthropic:claude-opus-4-7 · confidence high
Out[311]:

saturn.columns["JPScaleImageURL"].stats

statvalue
n7,124
nulls0 (0.0%)
unique1
top_value https://joshuaproject.net/assets/img/gauge/gauge-1.png
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 106.
Top values for JPScaleImageURL.
Show data table
Top values for JPScaleImageURL (1 unique shown, of 1 total).
valuecountshare
https://joshuaproject.net/assets/img/gauge/gauge-1.png7124100.0%

Summary text free_text

Free-text English summaries describing South Asian people groups (Rajputs, Jats, Bania, Beldar, etc.), averaging 51 words with median length 316 characters. Quality is poor: 3,167 of 7,124 rows (44%) are empty strings and another 3,439 are duplicates, leaving only 3,685 unique values and a 48% duplicate rate. Several near-identical Rajput paragraphs differ by only a word or two, suggesting lightly edited copies of the same source text rather than independent summaries. Flesch readability of 30.4 indicates fairly difficult prose.

Treatment: Deduplicate near-identical entries and drop or impute the 3,167 empty rows before tokenizing and embedding.

anthropic:claude-opus-4-7 · confidence high
Out[314]:

saturn.columns["Summary"].stats

statvalue
n7,124
nulls0 (0.0%)
unique3,685
len_min 0
len_max 1,212
len_mean 309.7
len_median 316
len_p95 793
word_mean 51.26
word_median 52
n_empty 3,167
n_duplicates 3,439
duplicate_rate 0.4827
vocab_size 24,501
readability_flesch_mean 30.4
emoji_rate 0
url_rate 0
one_word_rate 0.4446
allcaps_rate 0
boilerplate_rate 0.0002807
alert: one_word44.5% rows are a single word
alert: duplicates48.3% duplicate strings
Fig 107.
Character-length distribution for Summary.
Show data table
Character-length distribution for Summary (mean: 309.70087029758565).
charscount
0 – 303167
30 – 611
61 – 911
91 – 1217
121 – 15216
152 – 18221
182 – 21240
212 – 24260
242 – 27376
273 – 303110
303 – 333144
333 – 364173
364 – 394162
394 – 424198
424 – 454208
454 – 485206
485 – 515243
515 – 545234
545 – 576207
576 – 606224
606 – 636253
636 – 667244
667 – 697196
697 – 727251
727 – 758143
758 – 788165
788 – 81885
818 – 84884
848 – 87960
879 – 90929
909 – 93918
939 – 9709
970 – 100010
1000 – 103021
1030 – 106050
1060 – 10913
1091 – 11212
1121 – 11511
1151 – 11820
1182 – 12122

Obstacles text free_text

Free-text English prose describing barriers to Christian evangelism among various people groups (Rajputs, Jats, Bosniaks, Azeri, etc.), averaging 18 words and 107 characters per entry. Notably, 3167 of 7124 rows are empty strings and the duplicate rate is 0.489, with a single Rajput-pride passage repeated 88 times and a near-identical Jat passage appearing as both 74- and 7-count variants. Readability is low (Flesch 31.6) and vocabulary is modest (9760 unique words), consistent with a templated missiological description field.

Treatment: Treat empties as missing, dedupe near-identical passages, then tokenize and embed for downstream topic or similarity analysis.

anthropic:claude-opus-4-7 · confidence high
Out[317]:

saturn.columns["Obstacles"].stats

statvalue
n7,124
nulls0 (0.0%)
unique3,641
len_min 0
len_max 726
len_mean 106.9
len_median 95
len_p95 317
word_mean 18.37
word_median 16
n_empty 3,167
n_duplicates 3,483
duplicate_rate 0.4889
vocab_size 9,760
readability_flesch_mean 31.62
emoji_rate 0
url_rate 0
one_word_rate 0.4446
allcaps_rate 0
boilerplate_rate 0.0009826
alert: one_word44.5% rows are a single word
alert: duplicates48.9% duplicate strings
Fig 108.
Character-length distribution for Obstacles.
Show data table
Character-length distribution for Obstacles (mean: 106.86285794497473).
charscount
0 – 183167
18 – 361
36 – 5424
54 – 73103
73 – 91198
91 – 109293
109 – 127407
127 – 145420
145 – 163353
163 – 182355
182 – 200265
200 – 218223
218 – 236173
236 – 254161
254 – 272232
272 – 29087
290 – 309222
309 – 327194
327 – 34569
345 – 36339
363 – 38131
381 – 39924
399 – 41711
417 – 4368
436 – 4549
454 – 47213
472 – 49012
490 – 5084
508 – 5267
526 – 5443
544 – 5634
563 – 5814
581 – 5991
599 – 6170
617 – 6352
635 – 6531
653 – 6720
672 – 6902
690 – 7081
708 – 7261

HowReach text free_text

Free-text English prose describing outreach/engagement strategies for various people groups, likely a 'how to reach' field in a missions dataset. Over half the rows (3883 of 7124) are empty strings and duplicate_rate is 0.60, with the same Jats and Rajputs paragraphs repeating dozens of times — so the median length and word count are 0. Readability is low (Flesch 27.3) and vocabulary reaches 7803 tokens across the non-empty rows.

Treatment: Treat empty strings as missing, deduplicate boilerplate, then tokenize and embed for downstream NLP.

anthropic:claude-opus-4-7 · confidence high
Out[320]:

saturn.columns["HowReach"].stats

statvalue
n7,124
nulls0 (0.0%)
unique2,853
len_min 0
len_max 599
len_mean 80.82
len_median 0
len_p95 260
word_mean 14.08
word_median 1
n_empty 3,883
n_duplicates 4,271
duplicate_rate 0.5995
vocab_size 7,803
readability_flesch_mean 27.34
emoji_rate 0
url_rate 0
one_word_rate 0.5451
allcaps_rate 0
boilerplate_rate 0.0002807
alert: one_word54.5% rows are a single word
alert: duplicates60.0% duplicate strings
Fig 109.
Character-length distribution for HowReach.
Show data table
Character-length distribution for HowReach (mean: 80.82116788321167).
charscount
0 – 153883
15 – 300
30 – 451
45 – 607
60 – 7561
75 – 90156
90 – 105252
105 – 120244
120 – 135277
135 – 150269
150 – 165274
165 – 180233
180 – 195219
195 – 210184
210 – 225357
225 – 240172
240 – 255135
255 – 27090
270 – 285103
285 – 30060
300 – 31439
314 – 32922
329 – 34424
344 – 35912
359 – 3749
374 – 38914
389 – 4045
404 – 4194
419 – 4343
434 – 4491
449 – 4643
464 – 4792
479 – 4941
494 – 5093
509 – 5240
524 – 5392
539 – 5542
554 – 5690
569 – 5840
584 – 5991

PrayForChurch text free_text

Free-text prayer prompts for an unreached-people-group / church-planting dataset, written in English (1473 detected) and centered on words like 'pray', 'Christ', 'among'. The field is sparsely populated: 5032 of 7124 rows are empty and only 1713 unique strings exist, giving a 0.76 duplicate rate as the same boilerplate prayer is reused across people groups (top non-empty value repeats 146 times). Readability is low (Flesch 19.5) and length varies wildly from 0 to 649 chars, so the column is a mix of nothing, one-liners, and full paragraphs.

Treatment: Treat as optional long-form text: impute empties as missing and tokenize/embed the rest before any modelling.

anthropic:claude-opus-4-7 · confidence high
Out[323]:

saturn.columns["PrayForChurch"].stats

statvalue
n7,124
nulls0 (0.0%)
unique1,713
len_min 0
len_max 649
len_mean 59.63
len_median 0
len_p95 286
word_mean 11.19
word_median 1
n_empty 5,032
n_duplicates 5,411
duplicate_rate 0.7595
vocab_size 4,447
readability_flesch_mean 19.5
emoji_rate 0
url_rate 0
one_word_rate 0.7063
allcaps_rate 0
boilerplate_rate 0
alert: one_word70.6% rows are a single word
alert: duplicates76.0% duplicate strings
Fig 110.
Character-length distribution for PrayForChurch.
Show data table
Character-length distribution for PrayForChurch (mean: 59.63447501403706).
charscount
0 – 165032
16 – 320
32 – 491
49 – 6516
65 – 8145
81 – 97104
97 – 114109
114 – 130143
130 – 146160
146 – 162184
162 – 178125
178 – 195315
195 – 211103
211 – 227115
227 – 24391
243 – 26092
260 – 27653
276 – 292101
292 – 30852
308 – 324115
324 – 34117
341 – 35727
357 – 37323
373 – 38933
389 – 40615
406 – 4223
422 – 43813
438 – 4549
454 – 4715
471 – 4872
487 – 5036
503 – 5193
519 – 5352
535 – 5522
552 – 5682
568 – 5842
584 – 6001
600 – 6170
617 – 6331
633 – 6492

PrayForPG text free_text

Free-text prayer points for people groups (PG), each entry a short paragraph of intercessions led by the verb 'pray' (5450 occurrences). Nearly half the rows are empty (3405 of 7124) and another large chunk reuse boilerplate templates — duplicate_rate 0.517 with the top non-empty value repeating 88 times — so unique content is far less than the 3441 distinct strings suggest. Readability is low (Flesch mean 32.7) and all detected language is English (2528 rows tagged en).

Treatment: Treat as free-text: drop empties, dedupe boilerplate, then tokenize/embed if used as a feature.

anthropic:claude-opus-4-7 · confidence high
Out[326]:

saturn.columns["PrayForPG"].stats

statvalue
n7,124
nulls0 (0.0%)
unique3,441
len_min 0
len_max 937
len_mean 163.1
len_median 120
len_p95 453.8
word_mean 28.23
word_median 20
n_empty 3,405
n_duplicates 3,683
duplicate_rate 0.517
vocab_size 9,291
readability_flesch_mean 32.72
emoji_rate 0
url_rate 0
one_word_rate 0.478
allcaps_rate 0
boilerplate_rate 0
alert: one_word47.8% rows are a single word
alert: duplicates51.7% duplicate strings
Fig 111.
Character-length distribution for PrayForPG.
Show data table
Character-length distribution for PrayForPG (mean: 163.06948343627175).
charscount
0 – 233406
23 – 470
47 – 706
70 – 9447
94 – 11786
117 – 141145
141 – 164136
164 – 187176
187 – 211159
211 – 234208
234 – 258212
258 – 281264
281 – 305274
305 – 328330
328 – 351321
351 – 375243
375 – 398275
398 – 422197
422 – 445240
445 – 468117
468 – 49276
492 – 51559
515 – 53943
539 – 56234
562 – 58622
586 – 60911
609 – 63211
632 – 6566
656 – 67911
679 – 7034
703 – 7263
726 – 7501
750 – 7730
773 – 7960
796 – 8200
820 – 8430
843 – 8670
867 – 8900
890 – 9140
914 – 9371

Resources unknown other

The column is named "Resources" with 7124 rows and zero nulls, but saturn skipped profiling so the kind is unknown and no unique count or value statistics were computed. Without type inference or sample values, its content (numeric, list, text, or identifier) cannot be determined from the evidence.

Treatment: Re-profile or inspect raw samples to establish type before any downstream use.

anthropic:claude-opus-4-7 · confidence low
Out[329]:

saturn.columns["Resources"].stats

statvalue
n7,124
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

country_data unknown other

The column `country_data` was skipped by the profiler, so its kind is unrecorded and no statistics, uniqueness, or value distribution are available. The only confirmed signals are 7124 rows with a 0.0 null rate. Without further inspection, the contents (likely some country-related payload given the name) cannot be characterised.

Treatment: Re-profile or manually inspect this column before any downstream use.

anthropic:claude-opus-4-7 · confidence low
Out[331]:

saturn.columns["country_data"].stats

statvalue
n7,124
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

language_data unknown other

The column `language_data` was skipped by the profiler — its kind is unrecognised and no descriptive statistics, uniqueness count, or value samples were emitted. Only the row count (7124) and a null rate of 0.0 are available, so nothing can be said about content, cardinality, or distribution. The name hints at linguistic payloads (possibly nested or serialised), but this is not corroborated by evidence.

Treatment: Re-profile after parsing or casting to a supported type before deciding on use.

anthropic:claude-opus-4-7 · confidence low
Out[333]:

saturn.columns["language_data"].stats

statvalue
n7,124
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

How to cite

click to copy

BibTeX
@misc{saturn-joshua-project-joshua-project-unreached-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: joshua project joshua project unreached},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/joshua-project-joshua_project_unreached}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: joshua project joshua project unreached. Source: /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_unreached.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/joshua-project-joshua_project_unreached