saturn·

joshua project joshua project enriched

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_enriched.parquet

Saturn profiled 16,382 rows across 109 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_enriched.parquet",
    "--findings", "joshua-project-joshua_project_enriched.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is the Joshua Project people-groups dataset: 16,382 rows and 109 columns describing ethnic groups by country, with demographics, language, religion mix, and Christian-engagement indicators. The shape is dominated by categorical and text fields — Continent, AffinityBloc, PrimaryReligion, and the JPScale 'reachedness' rating give the cleanest first read on who is in the file. Population is extremely long-tailed (median 20,000 but max ~913M and skew ~91), so any size analysis should use logs or quantiles rather than means. Religion-share columns like PCIslam, PCHinduism, and PCBuddhism are mostly zero with a minority of groups at very high percentages, which tells you religion is effectively single-dominant per group. Watch out for several columns with very high null rates (RLG4 96%, NomadicTypeDescription 98%, PrimaryLanguageDialect 92%, NTOnline 29%) and many near-duplicate URL/ID fields that won't add analytic value.

citing: Continent · AffinityBloc · PrimaryReligion · JPScaleText · Population · PCIslam · PCHinduism · RegionName · LeastReached · Frontier

Out[4]:

saturn.schema() · 109 columns

column kind n null% unique alerts
PeopleID3ROG3 text 16,382 0.0% 16,382 near_unique one_word allcaps short_text
ROG3 categorical 16,382 0.0% 238
PeopleID3 numeric 16,382 0.0% 10,415
ROP3 numeric 16,382 0.1% 10,405
PeopNameInCountry text 16,382 0.0% 10,748 one_word duplicates
ROG2 categorical 16,382 0.0% 7
Continent categorical 16,382 0.0% 7
RegionName categorical 16,382 0.0% 12
ISO3 categorical 16,382 0.0% 238
LocationInCountry text 16,382 45.1% 7,794 multilingual null_rate
PeopleID1 numeric 16,382 0.0% 16
ROP1 categorical 16,382 0.0% 16
AffinityBloc categorical 16,382 0.0% 16
PeopleID2 numeric 16,382 0.0% 267
ROP2 categorical 16,382 0.0% 214
PeopleCluster categorical 16,382 0.0% 267
PeopNameAcrossCountries text 16,382 0.0% 10,377 one_word duplicates
Population numeric 16,382 0.2% 1,708 high_skew outliers
Category categorical 16,382 0.0% 3
ROL3 text 16,382 0.0% 6,164 one_word short_text duplicates
PrimaryLanguageName text 16,382 0.0% 6,153 one_word short_text duplicates
PrimaryLanguageDialect categorical 16,382 92.3% 980 long_tail null_rate
NumberLanguagesSpoken numeric 16,382 0.0% 78 high_skew outliers
OfficialLang categorical 16,382 0.0% 87
SpeakNationalLang unknown 16,382 0.0% skipped
BibleStatus numeric 16,382 0.0% 6
BibleYear categorical 16,382 52.4% 466 null_rate
NTYear text 16,382 30.5% 1,072 one_word allcaps null_rate short_text duplicates
PortionsYear text 16,382 17.9% 1,737 one_word allcaps short_text duplicates
TranslationNeedQuestionable unknown 16,382 0.0% skipped
JPScale numeric 16,382 0.0% 5
JPScalePC categorical 16,382 0.0% 5
JPScalePGAC categorical 16,382 0.0% 5
LeastReached categorical 16,382 0.0% 2
LeastReachedPC categorical 16,382 0.0% 2
LeastReachedPGAC categorical 16,382 0.0% 2
GSEC categorical 16,382 0.0% 8
HasAudioRecordings categorical 16,382 0.0% 2
NTOnline categorical 16,382 28.5% 1 null_rate imbalance
RLG3 numeric 16,382 0.0% 8
RLG3PC numeric 16,382 0.0% 8
RLG3PGAC numeric 16,382 0.0% 8
PrimaryReligion categorical 16,382 0.0% 8
PrimaryReligionPC categorical 16,382 0.0% 8
PrimaryReligionPGAC categorical 16,382 0.0% 8
RLG4 numeric 16,382 96.2% 20 null_rate outliers
ReligionSubdivision categorical 16,382 96.2% 20 null_rate
PCIslam numeric 16,382 0.5% 1,117 outliers
PCNonReligious numeric 16,382 0.4% 223 high_skew outliers
PCUnknown numeric 16,382 0.6% 583 high_skew
SecurityLevel numeric 16,382 0.0% 3
LRTop100 categorical 16,382 0.0% 2 imbalance
PhotoAddress text 16,382 0.0% 5,277 one_word short_text duplicates
PhotoCredits text 16,382 0.1% 1,605 one_word duplicates
PhotoCreditURL text 16,382 33.1% 1,434 one_word url_heavy null_rate duplicates
PhotoCreativeCommons categorical 16,382 0.0% 2
PhotoCopyright categorical 16,382 0.1% 2
PhotoPermission categorical 16,382 0.1% 3
ProfileTextExists categorical 16,382 0.0% 2
CountOfCountries numeric 16,382 0.0% 48 high_skew outliers
CountOfProvinces unknown 16,382 0.0% skipped
Longitude numeric 16,382 0.0% 15,889 high_skew
Latitude numeric 16,382 0.0% 15,851
Ctry categorical 16,382 0.0% 238
IndigenousCode categorical 16,382 0.0% 2
PercentAdherents text 16,382 0.0% 1,248 one_word allcaps short_text duplicates
PercentChristianPC categorical 16,382 0.0% 246
NaturalName text 16,382 0.0% 10,737 one_word duplicates
NaturalPronunciation text 16,382 69.6% 1,933 one_word null_rate duplicates
PercentChristianPGAC text 16,382 0.1% 1,954 one_word allcaps short_text duplicates
PercentEvangelical text 16,382 6.7% 1,047 one_word allcaps short_text duplicates
PercentEvangelicalPC categorical 16,382 1.0% 228
PercentEvangelicalPGAC text 16,382 4.5% 1,624 one_word allcaps short_text duplicates
PCBuddhism numeric 16,382 0.6% 1,052 high_skew outliers
PCEthnicReligions numeric 16,382 0.4% 978 outliers
PCHinduism numeric 16,382 0.6% 1,412 high_skew outliers
PCOtherSmall numeric 16,382 0.6% 908 high_skew outliers
RegionCode numeric 16,382 0.0% 12
PopulationPGAC numeric 16,382 0.1% 2,250 high_skew outliers
Frontier categorical 16,382 0.0% 2
MapAddress text 16,382 0.0% 6,029 one_word short_text duplicates
HasJesusFilm categorical 16,382 0.0% 2
Nomadic categorical 16,382 0.0% 2 imbalance
NomadicTypeDescription categorical 16,382 98.1% 6 null_rate
PhotoCCVersionText categorical 16,382 0.0% 17
PhotoCCVersionURL categorical 16,382 0.0% 17
MapCredits categorical 16,382 0.0% 199 long_tail
MapCreditURL categorical 16,382 0.0% 51 long_tail imbalance
MapCopyright categorical 16,382 0.0% 3
MapCCVersionText categorical 16,382 0.0% 6 imbalance
MapCCVersionURL categorical 16,382 0.0% 6 imbalance
JF categorical 16,382 0.0% 2
AudioRecordings categorical 16,382 0.0% 2
Window1040 categorical 16,382 0.0% 2
PeopleGroupMapURL text 16,382 0.0% 6,029 one_word url_heavy duplicates
PeopleGroupMapExpandedURL text 16,382 0.0% 5,561 one_word url_heavy duplicates
PeopleGroupURL text 16,382 0.0% 16,382 near_unique one_word url_heavy
PeopleGroupPhotoURL text 16,382 0.0% 5,277 one_word url_heavy duplicates
CountryURL categorical 16,382 0.0% 238
JPScaleText categorical 16,382 0.0% 5
JPScaleImageURL categorical 16,382 0.0% 5
Summary text 16,382 0.0% 3,778 one_word duplicates
Obstacles text 16,382 0.0% 3,732 one_word duplicates
HowReach text 16,382 0.0% 2,944 one_word duplicates
PrayForChurch text 16,382 0.0% 1,791 one_word duplicates
PrayForPG text 16,382 0.0% 3,530 one_word duplicates
Resources unknown 16,382 0.0% skipped
country_data unknown 16,382 0.0% skipped
language_data unknown 16,382 0.0% skipped
Fig 1.
Continent · Where the people groups sit geographically — Asia dominates at ~45% of rows.
Show data table
Top values for Continent (7 unique shown, of 7 total).
valuecountshare
Asia736845.0%
Africa363522.2%
Europe15329.4%
North America14078.6%
Australia10886.6%
South America9055.5%
Oceania4472.7%
Fig 2.
PrimaryReligion · Religion mix across groups; Christianity and Islam together cover most of the file.
Show data table
Top values for PrimaryReligion (8 unique shown, of 8 total).
valuecountshare
Christianity645939.4%
Islam378623.1%
Ethnic Religions265116.2%
Hinduism233814.3%
Buddhism6353.9%
Non-Religious2001.2%
Unknown1891.2%
Other / Small1240.8%
Fig 3.
JPScaleText · Reachedness rating — note that 'Unreached' is the single largest bucket (~43%).
Show data table
Top values for JPScaleText (5 unique shown, of 5 total).
valuecountshare
Unreached712443.5%
Partially Reached363622.2%
Significantly Reached320019.5%
Superficially Reached14138.6%
Minimally Reached10096.2%
Fig 4.
Population · Highly skewed group sizes (median 20k, max ~913M); plot on a log scale to see the shape.
Show data table
Histogram bins for Population (median: 20000.0).
bincount
10 – 2.282e+0716302
2.282e+07 – 4.565e+0732
4.565e+07 – 6.847e+0712
6.847e+07 – 9.13e+073
9.13e+07 – 1.141e+083
1.141e+08 – 1.369e+083
1.369e+08 – 1.598e+080
1.598e+08 – 1.826e+080
1.826e+08 – 2.054e+081
2.054e+08 – 2.282e+080
2.282e+08 – 2.511e+080
2.511e+08 – 2.739e+080
2.739e+08 – 2.967e+080
2.967e+08 – 3.195e+080
3.195e+08 – 3.424e+080
3.424e+08 – 3.652e+080
3.652e+08 – 3.88e+080
3.88e+08 – 4.108e+080
4.108e+08 – 4.337e+080
4.337e+08 – 4.565e+080
4.565e+08 – 4.793e+080
4.793e+08 – 5.021e+080
5.021e+08 – 5.249e+080
5.249e+08 – 5.478e+080
5.478e+08 – 5.706e+080
5.706e+08 – 5.934e+080
5.934e+08 – 6.162e+080
6.162e+08 – 6.391e+080
6.391e+08 – 6.619e+080
6.619e+08 – 6.847e+080
6.847e+08 – 7.075e+080
7.075e+08 – 7.304e+080
7.304e+08 – 7.532e+080
7.532e+08 – 7.76e+080
7.76e+08 – 7.988e+080
7.988e+08 – 8.217e+080
8.217e+08 – 8.445e+080
8.445e+08 – 8.673e+080
8.673e+08 – 8.901e+080
8.901e+08 – 9.13e+081
Fig 5.
AffinityBloc · Top-level cultural groupings; South Asian and Sub-Saharan blocs lead.
Show data table
Top values for AffinityBloc (16 unique shown, of 16 total).
valuecountshare
South Asian Peoples417825.5%
Sub-Saharan Peoples307318.8%
Eurasian Peoples15939.7%
Pacific Islanders15889.7%
Latin-Caribbean Americans13528.3%
Malay Peoples10316.3%
Southeast Asian Peoples6353.9%
Arab World6343.9%
Tibetan-Himalayan Peoples4532.8%
North American Peoples4152.5%
East Asian Peoples4022.5%
Turkic Peoples2991.8%
Persian-Median2281.4%
Horn of Africa Peoples2001.2%
Deaf1641.0%
Jewish1370.8%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
PeopleID3ROG3text0.0%
ROG3categorical0.0%
PeopleID3numeric0.0%
ROP3numeric0.1%
PeopNameInCountrytext0.0%
ROG2categorical0.0%
Continentcategorical0.0%
RegionNamecategorical0.0%
ISO3categorical0.0%
LocationInCountrytext45.1%
PeopleID1numeric0.0%
ROP1categorical0.0%
AffinityBloccategorical0.0%
PeopleID2numeric0.0%
ROP2categorical0.0%
PeopleClustercategorical0.0%
PeopNameAcrossCountriestext0.0%
Populationnumeric0.2%
Categorycategorical0.0%
ROL3text0.0%
PrimaryLanguageNametext0.0%
PrimaryLanguageDialectcategorical92.3%
NumberLanguagesSpokennumeric0.0%
OfficialLangcategorical0.0%
SpeakNationalLangunknown0.0%
BibleStatusnumeric0.0%
BibleYearcategorical52.4%
NTYeartext30.5%
PortionsYeartext17.9%
TranslationNeedQuestionableunknown0.0%
JPScalenumeric0.0%
JPScalePCcategorical0.0%
JPScalePGACcategorical0.0%
LeastReachedcategorical0.0%
LeastReachedPCcategorical0.0%
LeastReachedPGACcategorical0.0%
GSECcategorical0.0%
HasAudioRecordingscategorical0.0%
NTOnlinecategorical28.5%
RLG3numeric0.0%
RLG3PCnumeric0.0%
RLG3PGACnumeric0.0%
PrimaryReligioncategorical0.0%
PrimaryReligionPCcategorical0.0%
PrimaryReligionPGACcategorical0.0%
RLG4numeric96.2%
ReligionSubdivisioncategorical96.2%
PCIslamnumeric0.5%
PCNonReligiousnumeric0.4%
PCUnknownnumeric0.6%
SecurityLevelnumeric0.0%
LRTop100categorical0.0%
PhotoAddresstext0.0%
PhotoCreditstext0.1%
PhotoCreditURLtext33.1%
PhotoCreativeCommonscategorical0.0%
PhotoCopyrightcategorical0.1%
PhotoPermissioncategorical0.1%
ProfileTextExistscategorical0.0%
CountOfCountriesnumeric0.0%
CountOfProvincesunknown0.0%
Longitudenumeric0.0%
Latitudenumeric0.0%
Ctrycategorical0.0%
IndigenousCodecategorical0.0%
PercentAdherentstext0.0%
PercentChristianPCcategorical0.0%
NaturalNametext0.0%
NaturalPronunciationtext69.6%
PercentChristianPGACtext0.1%
PercentEvangelicaltext6.7%
PercentEvangelicalPCcategorical1.0%
PercentEvangelicalPGACtext4.5%
PCBuddhismnumeric0.6%
PCEthnicReligionsnumeric0.4%
PCHinduismnumeric0.6%
PCOtherSmallnumeric0.6%
RegionCodenumeric0.0%
PopulationPGACnumeric0.1%
Frontiercategorical0.0%
MapAddresstext0.0%
HasJesusFilmcategorical0.0%
Nomadiccategorical0.0%
NomadicTypeDescriptioncategorical98.1%
PhotoCCVersionTextcategorical0.0%
PhotoCCVersionURLcategorical0.0%
MapCreditscategorical0.0%
MapCreditURLcategorical0.0%
MapCopyrightcategorical0.0%
MapCCVersionTextcategorical0.0%
MapCCVersionURLcategorical0.0%
JFcategorical0.0%
AudioRecordingscategorical0.0%
Window1040categorical0.0%
PeopleGroupMapURLtext0.0%
PeopleGroupMapExpandedURLtext0.0%
PeopleGroupURLtext0.0%
PeopleGroupPhotoURLtext0.0%
CountryURLcategorical0.0%
JPScaleTextcategorical0.0%
JPScaleImageURLcategorical0.0%
Summarytext0.0%
Obstaclestext0.0%
HowReachtext0.0%
PrayForChurchtext0.0%
PrayForPGtext0.0%
Resourcesunknown0.0%
country_dataunknown0.0%
language_dataunknown0.0%
Fig 7.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 7,033 detected strings).
langcountshare
en699699.5%
es120.2%
pt100.1%
id30.0%
fr20.0%
la20.0%
it20.0%
de20.0%
sco10.0%
nl10.0%
ceb10.0%
sq10.0%
Fig 8.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
PeopleID3ROP3PeopleID1PeopleID2PopulationNumberLanguagesSpokenBibleStatusJPScaleRLG3RLG3PCRLG3PGACRLG4
PeopleID3+1.00+0.90+0.28+0.46+0.06+0.16+0.05-0.36+0.32+0.32+0.32+0.09
ROP3+0.90+1.00+0.27+0.46-0.09+0.18+0.07-0.36+0.30+0.30+0.29+0.10
PeopleID1+0.28+0.27+1.00+0.46-0.00+0.15-0.13-0.07+0.16+0.17+0.15-0.08
PeopleID2+0.46+0.46+0.46+1.00+0.06+0.35+0.20-0.38+0.27+0.31+0.25+0.24
Population+0.06-0.09-0.00+0.06+1.00-0.01-0.08+0.05-0.05-0.04-0.05+0.07
NumberLanguagesSpoken+0.16+0.18+0.15+0.35-0.01+1.00+0.14-0.22+0.17+0.21+0.17+0.12
BibleStatus+0.05+0.07-0.13+0.20-0.08+0.14+1.00-0.23+0.12+0.19+0.11+0.25
JPScale-0.36-0.36-0.07-0.38+0.05-0.22-0.23+1.00-0.72-0.71-0.71-0.12
RLG3+0.32+0.30+0.16+0.27-0.05+0.17+0.12-0.72+1.00+0.79+0.97+0.09
RLG3PC+0.32+0.30+0.17+0.31-0.04+0.21+0.19-0.71+0.79+1.00+0.79+0.17
RLG3PGAC+0.32+0.29+0.15+0.25-0.05+0.17+0.11-0.71+0.97+0.79+1.00+0.10
RLG4+0.09+0.10-0.08+0.24+0.07+0.12+0.25-0.12+0.09+0.17+0.10+1.00

PeopleID3ROG3 text identifier

PeopleID3ROG3 is almost certainly a person-level identifier: every one of the 16,382 rows has a unique 7-character all-caps single token (n_unique equals n, duplicate_rate 0, len_min and len_max both 7). The sampled values look like a 5-digit numeric prefix followed by a 2-letter suffix (e.g. '10375tz', '10375up'), suggesting a structured composite key rather than a random hash. No nulls, no boilerplate, no duplicates — clean but useless as a feature.

Treatment: Drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high
Out[14]:

saturn.columns["PeopleID3ROG3"].stats

statvalue
n16,382
nulls0 (0.0%)
unique16,382
len_min 7
len_max 7
len_mean 7
len_median 7
len_p95 7
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 16,382
readability_flesch_mean 115.3
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
Fig 9.
Character-length distribution for PeopleID3ROG3.
Show data table
Character-length distribution for PeopleID3ROG3 (mean: 7.0).
charscount
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 716382
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80

ROG3 categorical feature

ROG3 holds 238 distinct two-letter codes that look like ISO country codes, with IN (2,262 rows, 13.8%) leading, followed by PP, ID, PK, CH, and NI. No nulls across 16,382 rows, and the entropy ratio of 0.79 indicates a fairly even spread across many countries rather than concentration in a handful. The presence of 'PP' among the top values is unusual since it isn't a standard ISO 3166-1 alpha-2 code and may signal a custom or legacy encoding worth verifying.

Treatment: Treat as a high-cardinality categorical: target- or frequency-encode for modelling, and reconcile non-standard codes like 'PP' against an ISO reference.

anthropic:claude-opus-4-7 · confidence high
Out[17]:

saturn.columns["ROG3"].stats

statvalue
n16,382
nulls0 (0.0%)
unique238
top_value IN
top_rate 0.1381
cardinality 238
entropy 6.225
entropy_ratio 0.7885
Fig 10.
Top values for ROG3.
Show data table
Top values for ROG3 (20 unique shown, of 238 total).
valuecountshare
IN226213.8%
PP8835.4%
ID7884.8%
PK7754.7%
CH5473.3%
NI5353.3%
US4963.0%
MX3332.0%
BR3212.0%
CM2921.8%
BG2781.7%
CA2431.5%
CG2311.4%
BM2181.3%
AS2051.3%
RP2001.2%
SU1981.2%
NP1951.2%
LA1841.1%
MY1831.1%

PeopleID3 numeric foreign_key

PeopleID3 is an integer key ranging from 10119 to 22661 with 10415 unique values across 16382 rows and no nulls. The duplication (about 5967 repeated entries) and the bounded, non-zero range are consistent with a foreign-key reference to a people table rather than a measurement. Distribution is mildly right-skewed (0.37) and platykurtic (-0.98), with no outliers flagged.

Treatment: Treat as a foreign key and left-join to the people dimension; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[20]:

saturn.columns["PeopleID3"].stats

statvalue
n16,382
nulls0 (0.0%)
unique10,415
min 10,119
max 22,661
mean 1.541e+04
median 14,962
std 3478
q1 12,348
q3 1.833e+04
iqr 5984
skew 0.3697
kurtosis -0.9812
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 11.
Distribution of PeopleID3. Vertical dash marks the median.
Show data table
Histogram bins for PeopleID3 (median: 14962.0).
bincount
1.012e+04 – 1.043e+04548
1.043e+04 – 1.075e+04446
1.075e+04 – 1.106e+04492
1.106e+04 – 1.137e+04745
1.137e+04 – 1.169e+04515
1.169e+04 – 1.2e+04587
1.2e+04 – 1.231e+04670
1.231e+04 – 1.263e+04486
1.263e+04 – 1.294e+04509
1.294e+04 – 1.325e+04519
1.325e+04 – 1.357e+04456
1.357e+04 – 1.388e+04446
1.388e+04 – 1.42e+04409
1.42e+04 – 1.451e+04614
1.451e+04 – 1.482e+04537
1.482e+04 – 1.514e+04573
1.514e+04 – 1.545e+04584
1.545e+04 – 1.576e+04626
1.576e+04 – 1.608e+04411
1.608e+04 – 1.639e+04305
1.639e+04 – 1.67e+04251
1.67e+04 – 1.702e+04258
1.702e+04 – 1.733e+04261
1.733e+04 – 1.764e+04344
1.764e+04 – 1.796e+04302
1.796e+04 – 1.827e+04302
1.827e+04 – 1.858e+04387
1.858e+04 – 1.89e+04345
1.89e+04 – 1.921e+04590
1.921e+04 – 1.953e+04319
1.953e+04 – 1.984e+04319
1.984e+04 – 2.015e+04215
2.015e+04 – 2.047e+04254
2.047e+04 – 2.078e+04239
2.078e+04 – 2.109e+04173
2.109e+04 – 2.141e+04297
2.141e+04 – 2.172e+04291
2.172e+04 – 2.203e+04146
2.203e+04 – 2.235e+04314
2.235e+04 – 2.266e+04297

ROP3 numeric identifier

ROP3 is a numeric column tightly bounded between 100004 and 119649 across 16382 rows, with 10405 unique values and almost no nulls (0.07%). The mean (109058.6) and median (108856) sit close together with low skew (0.15) and slightly platykurtic shape (kurtosis -1.05), and saturn flagged zero outliers. The narrow ~20k range starting just above 100000 looks more like a coded identifier or zoned key than a free-ranging measurement.

Treatment: Treat as a categorical code rather than a continuous feature; do not scale or log-transform.

anthropic:claude-opus-4-7 · confidence medium
Out[23]:

saturn.columns["ROP3"].stats

statvalue
n16,382
nulls11 (0.1%)
unique10,405
min 100,004
max 119,649
mean 1.091e+05
median 108,856
std 5405
q1 104,189
q3 113,527
iqr 9,338
skew 0.1507
kurtosis -1.053
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 12.
Distribution of ROP3. Vertical dash marks the median.
Show data table
Histogram bins for ROP3 (median: 108856.0).
bincount
1e+05 – 1.005e+05508
1.005e+05 – 1.01e+05385
1.01e+05 – 1.015e+05309
1.015e+05 – 1.02e+05530
1.02e+05 – 1.025e+05423
1.025e+05 – 1.03e+05489
1.03e+05 – 1.034e+05565
1.034e+05 – 1.039e+05572
1.039e+05 – 1.044e+05480
1.044e+05 – 1.049e+05337
1.049e+05 – 1.054e+05391
1.054e+05 – 1.059e+05502
1.059e+05 – 1.064e+05425
1.064e+05 – 1.069e+05426
1.069e+05 – 1.074e+05366
1.074e+05 – 1.079e+05434
1.079e+05 – 1.084e+05470
1.084e+05 – 1.088e+05548
1.088e+05 – 1.093e+05247
1.093e+05 – 1.098e+05693
1.098e+05 – 1.103e+05499
1.103e+05 – 1.108e+05584
1.108e+05 – 1.113e+05417
1.113e+05 – 1.118e+05363
1.118e+05 – 1.123e+05303
1.123e+05 – 1.128e+05330
1.128e+05 – 1.133e+05457
1.133e+05 – 1.138e+05424
1.138e+05 – 1.142e+05478
1.142e+05 – 1.147e+05240
1.147e+05 – 1.152e+05627
1.152e+05 – 1.157e+05487
1.157e+05 – 1.162e+05457
1.162e+05 – 1.167e+0576
1.167e+05 – 1.172e+05135
1.172e+05 – 1.177e+0590
1.177e+05 – 1.182e+05318
1.182e+05 – 1.187e+05433
1.187e+05 – 1.192e+05119
1.192e+05 – 1.196e+05434

PeopNameInCountry text label

Short ethnonym/people-group labels naming a population within a country (e.g., 'French', 'British', 'Han Chinese, Mandarin'), with a median length of 9 characters and median word count of 1. 53% of values are single words and 34% are duplicates (5634 rows), so the same group label recurs across many country rows; 'Deaf' tops the list at 164. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' indicate a Joshua-Project-style people-group taxonomy rather than free text.

Treatment: Treat as a categorical people-group label; normalize casing/parentheticals and join with country to form the unique key.

anthropic:claude-opus-4-7 · confidence high
Out[26]:

saturn.columns["PeopNameInCountry"].stats

statvalue
n16,382
nulls0 (0.0%)
unique10,748
len_min 1
len_max 41
len_mean 11.46
len_median 9
len_p95 25
word_mean 1.636
word_median 1
n_empty 0
n_duplicates 5,634
duplicate_rate 0.3439
vocab_size 11,539
readability_flesch_mean 49.27
emoji_rate 0
url_rate 0
one_word_rate 0.5313
allcaps_rate 0
boilerplate_rate 0
alert: one_word53.1% rows are a single word
alert: duplicates34.4% duplicate strings
Fig 13.
Character-length distribution for PeopNameInCountry.
Show data table
Character-length distribution for PeopNameInCountry (mean: 11.457270174581858).
charscount
1 – 21
2 – 333
3 – 4282
4 – 51209
5 – 61631
6 – 71843
7 – 81526
8 – 91037
9 – 10653
10 – 11636
11 – 12574
12 – 13718
13 – 14708
14 – 15828
15 – 16761
16 – 17665
17 – 18499
18 – 19334
19 – 20254
20 – 21234
21 – 22198
22 – 23235
23 – 24186
24 – 25278
25 – 26255
26 – 27228
27 – 28136
28 – 29107
29 – 3073
30 – 3150
31 – 3259
32 – 3356
33 – 3437
34 – 3531
35 – 3619
36 – 372
37 – 381
38 – 391
39 – 402
40 – 412

ROG2 categorical feature

ROG2 is a low-cardinality categorical with 7 codes that look like macro-region groupings (ASI, AFR, EUR, NAR, AUS, LAM, SOP). Distribution is uneven but not degenerate: ASI dominates at 45.0% of 16,382 rows, AFR follows at 3,635, and entropy ratio of 0.80 confirms broad spread across the remaining buckets. No nulls and no rare-value tail beyond the seven codes.

Treatment: one-hot or target-encode; safe to use directly given clean 7-level categorical with no missingness.

anthropic:claude-opus-4-7 · confidence high
Out[29]:

saturn.columns["ROG2"].stats

statvalue
n16,382
nulls0 (0.0%)
unique7
top_value ASI
top_rate 0.4498
cardinality 7
entropy 2.257
entropy_ratio 0.8039
Fig 14.
Top values for ROG2.
Show data table
Top values for ROG2 (7 unique shown, of 7 total).
valuecountshare
ASI736845.0%
AFR363522.2%
EUR15329.4%
NAR14078.6%
AUS10886.6%
LAM9055.5%
SOP4472.7%

Continent categorical feature

Categorical continent label with all 7 expected values and zero nulls across 16,382 rows. Asia dominates at 44.98% (7,368 rows), followed by Africa at 3,635; entropy ratio of 0.80 confirms a moderately skewed but not degenerate distribution. Note that Australia (1,088) and Oceania (447) appear as separate categories, which is unusual and suggests inconsistent regional coding worth reconciling.

Treatment: One-hot encode after merging Australia and Oceania into a single category.

anthropic:claude-opus-4-7 · confidence high
Out[32]:

saturn.columns["Continent"].stats

statvalue
n16,382
nulls0 (0.0%)
unique7
top_value Asia
top_rate 0.4498
cardinality 7
entropy 2.257
entropy_ratio 0.8039
Fig 15.
Top values for Continent.
Show data table
Top values for Continent (7 unique shown, of 7 total).
valuecountshare
Asia736845.0%
Africa363522.2%
Europe15329.4%
North America14078.6%
Australia10886.6%
South America9055.5%
Oceania4472.7%

RegionName categorical feature

RegionName is a categorical geographic grouping with 12 distinct world regions and no nulls across 16,382 rows. Distribution is fairly balanced (entropy ratio 0.93), though 'Asia, South' leads at 22.6% (3,707 rows) and the top three regions are all Asian or African. The labels use a 'Continent, Subregion' convention which may need parsing if continent-level rollups are wanted.

Treatment: one-hot or target-encode for modelling; optionally split on the comma to derive a continent column.

anthropic:claude-opus-4-7 · confidence high
Out[35]:

saturn.columns["RegionName"].stats

statvalue
n16,382
nulls0 (0.0%)
unique12
top_value Asia, South
top_rate 0.2263
cardinality 12
entropy 3.325
entropy_ratio 0.9275
Fig 16.
Top values for RegionName.
Show data table
Top values for RegionName (12 unique shown, of 12 total).
valuecountshare
Asia, South370722.6%
Africa, West and Central217513.3%
Asia, Southeast192211.7%
Australia and Pacific15359.4%
America, Latin13958.5%
Africa, East and Southern12767.8%
Europe, Western11166.8%
America, North and Caribbean9175.6%
Asia, Northeast7094.3%
Africa, North and Middle East5933.6%
Europe, Eastern and Eurasia5773.5%
Asia, Central4602.8%

ISO3 categorical foreign_key

This column holds ISO3 country codes across 238 distinct values with no nulls, consistent with a country dimension key. India (IND) dominates at 13.8% (2262 rows), followed by PNG, IDN, PAK and CHN, indicating a heavy Asia/tropical skew rather than uniform global coverage. Entropy ratio of 0.79 confirms moderate concentration on a few countries.

Treatment: Use as a country join key; consider grouping long-tail codes or stratifying analyses to handle the IND-heavy skew.

anthropic:claude-opus-4-7 · confidence high
Out[38]:

saturn.columns["ISO3"].stats

statvalue
n16,382
nulls0 (0.0%)
unique238
top_value IND
top_rate 0.1381
cardinality 238
entropy 6.225
entropy_ratio 0.7885
Fig 17.
Top values for ISO3.
Show data table
Top values for ISO3 (20 unique shown, of 238 total).
valuecountshare
IND226213.8%
PNG8835.4%
IDN7884.8%
PAK7754.7%
CHN5473.3%
NGA5353.3%
USA4963.0%
MEX3332.0%
BRA3212.0%
CMR2921.8%
BGD2781.7%
CAN2431.5%
COD2311.4%
MMR2181.3%
AUS2051.3%
PHL2001.2%
SDN1981.2%
NPL1951.2%
LAO1841.1%
MYS1831.1%

LocationInCountry text free_text

Free-text descriptions of where a group is located within a country, mixing geographic prose ('Primarily north', 'Madang province.') with longer ethnographic paragraphs up to 994 characters. Nearly half the rows are null (45.12%) and 13.3% are duplicate strings, with stock phrases like 'Widespread.' (103) and 'Scattered.' recurring. Although 2,618 entries register as English, small pockets of Spanish (12), Portuguese (10) and nine other languages appear, and Flesch readability averages a difficult 38.1.

Treatment: Normalize boilerplate phrases and tokenize/embed for semantic use; do not treat as a categorical.

anthropic:claude-opus-4-7 · confidence high
Out[41]:

saturn.columns["LocationInCountry"].stats

statvalue
n16,382
nulls7,392 (45.1%)
unique7,794
len_min 3
len_max 994
len_mean 108.2
len_median 79
len_p95 314
word_mean 15.48
word_median 11
n_empty 0
n_duplicates 1,196
duplicate_rate 0.133
vocab_size 25,733
readability_flesch_mean 38.13
emoji_rate 0
url_rate 0
one_word_rate 0.02725
allcaps_rate 0
boilerplate_rate 0
alert: multilingual13 languages detected in sample
alert: null_rate45.1% null
Fig 18.
Character-length distribution for LocationInCountry.
Show data table
Character-length distribution for LocationInCountry (mean: 108.21223581757508).
charscount
3 – 28921
28 – 531766
53 – 771690
77 – 1021346
102 – 127962
127 – 152632
152 – 176409
176 – 201262
201 – 226170
226 – 251137
251 – 276116
276 – 30081
300 – 32582
325 – 35053
350 – 37554
375 – 39933
399 – 42437
424 – 44948
449 – 47426
474 – 49828
498 – 52323
523 – 54814
548 – 57316
573 – 59819
598 – 62214
622 – 64714
647 – 6728
672 – 6974
697 – 7214
721 – 7465
746 – 7719
771 – 7961
796 – 8211
821 – 8451
845 – 8701
870 – 8950
895 – 9201
920 – 9441
944 – 9690
969 – 9941

PeopleID1 numeric feature

PeopleID1 is stored as numeric but takes only 16 distinct integer values across 16,382 rows, ranging from 10 to 26 with a median of 20. The bounded range, low cardinality, and left skew (-0.79) suggest this is a small categorical or grouping code rather than a true continuous measurement, despite the 'ID' name implying a key.

Treatment: Cast to categorical and one-hot encode rather than treating as continuous.

anthropic:claude-opus-4-7 · confidence high
Out[44]:

saturn.columns["PeopleID1"].stats

statvalue
n16,382
nulls0 (0.0%)
unique16
min 10
max 26
mean 18.58
median 20
std 3.921
q1 16
q3 21
iqr 5
skew -0.7942
kurtosis -0.5112
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 19.
Distribution of PeopleID1. Vertical dash marks the median.
Show data table
Histogram bins for PeopleID1 (median: 20.0).
bincount
10 – 10.4634
10.4 – 10.80
10.8 – 11.2402
11.2 – 11.60
11.6 – 120
12 – 12.41593
12.4 – 12.80
12.8 – 13.2200
13.2 – 13.60
13.6 – 140
14 – 14.4228
14.4 – 14.80
14.8 – 15.2137
15.2 – 15.60
15.6 – 160
16 – 16.41352
16.4 – 16.80
16.8 – 17.21031
17.2 – 17.60
17.6 – 180
18 – 18.4415
18.4 – 18.80
18.8 – 19.21588
19.2 – 19.60
19.6 – 200
20 – 20.4635
20.4 – 20.80
20.8 – 21.24178
21.2 – 21.60
21.6 – 220
22 – 22.43073
22.4 – 22.80
22.8 – 23.2453
23.2 – 23.60
23.6 – 240
24 – 24.4299
24.4 – 24.80
24.8 – 25.20
25.2 – 25.60
25.6 – 26164

ROP1 categorical feature

ROP1 is a low-cardinality categorical code with 16 distinct values (all prefixed 'A0##'), fully populated across 16382 rows. The distribution is moderately concentrated: 'A012' alone covers 25.5% and 'A013' adds another ~19%, while entropy ratio is 0.83 indicating reasonably even spread among the rest. Looks like a fixed coded attribute (e.g., a category or status code) rather than a free identifier.

Treatment: one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[47]:

saturn.columns["ROP1"].stats

statvalue
n16,382
nulls0 (0.0%)
unique16
top_value A012
top_rate 0.255
cardinality 16
entropy 3.322
entropy_ratio 0.8306
Fig 20.
Top values for ROP1.
Show data table
Top values for ROP1 (16 unique shown, of 16 total).
valuecountshare
A012417825.5%
A013307318.8%
A00315939.7%
A01015889.7%
A00713528.3%
A00810316.3%
A0116353.9%
A0016343.9%
A0144532.8%
A0094152.5%
A0024022.5%
A0152991.8%
A0052281.4%
A0042001.2%
A0171641.0%
A0061370.8%

AffinityBloc categorical feature

AffinityBloc is a categorical grouping of populations into 16 broad ethno-geographic blocs, with no nulls across 16,382 rows. The distribution is moderately concentrated: "South Asian Peoples" leads at 25.5% (4,178 rows), followed by "Sub-Saharan Peoples" (3,073), while entropy ratio of 0.83 indicates the remaining 14 categories carry meaningful mass. Labels mix regional and ethnolinguistic framings (e.g., "Arab World" alongside "Tibetan-Himalayan Peoples"), which an analyst should note for taxonomy consistency.

Treatment: One-hot or target-encode for modelling; audit label taxonomy for overlap before grouping.

anthropic:claude-opus-4-7 · confidence high
Out[50]:

saturn.columns["AffinityBloc"].stats

statvalue
n16,382
nulls0 (0.0%)
unique16
top_value South Asian Peoples
top_rate 0.255
cardinality 16
entropy 3.322
entropy_ratio 0.8306
Fig 21.
Top values for AffinityBloc.
Show data table
Top values for AffinityBloc (16 unique shown, of 16 total).
valuecountshare
South Asian Peoples417825.5%
Sub-Saharan Peoples307318.8%
Eurasian Peoples15939.7%
Pacific Islanders15889.7%
Latin-Caribbean Americans13528.3%
Malay Peoples10316.3%
Southeast Asian Peoples6353.9%
Arab World6343.9%
Tibetan-Himalayan Peoples4532.8%
North American Peoples4152.5%
East Asian Peoples4022.5%
Turkic Peoples2991.8%
Persian-Median2281.4%
Horn of Africa Peoples2001.2%
Deaf1641.0%
Jewish1370.8%

PeopleID2 numeric foreign_key

PeopleID2 is a numeric people-identifier code with only 267 distinct values across 16,382 rows, so each id repeats heavily and behaves more like a foreign key than a measurement. Values span 100 to 479 with a fairly flat distribution (kurtosis -1.13, skew 0.25) and no nulls or outliers, consistent with a bounded code rather than a quantity to model.

Treatment: Treat as a categorical key and left-join on it rather than using as a numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[53]:

saturn.columns["PeopleID2"].stats

statvalue
n16,382
nulls0 (0.0%)
unique267
min 100
max 479
mean 283.7
median 268
std 114.6
q1 183
q3 402
iqr 219
skew 0.2451
kurtosis -1.126
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 22.
Distribution of PeopleID2. Vertical dash marks the median.
Show data table
Histogram bins for PeopleID2 (median: 268.0).
bincount
100 – 109.5542
109.5 – 119595
119 – 128.4283
128.4 – 137.9144
137.9 – 147.4411
147.4 – 156.8247
156.8 – 166.3876
166.3 – 175.8611
175.8 – 185.3502
185.3 – 194.8309
194.8 – 204.2303
204.2 – 213.7258
213.7 – 223.2280
223.2 – 232.7334
232.7 – 242.1400
242.1 – 251.61677
251.6 – 261.1243
261.1 – 270.5314
270.5 – 280257
280 – 289.5478
289.5 – 299722
299 – 308.4228
308.4 – 317.9568
317.9 – 327.4269
327.4 – 336.9646
336.9 – 346.4348
346.4 – 355.8368
355.8 – 365.30
365.3 – 374.80
374.8 – 384.20
384.2 – 393.70
393.7 – 403.2132
403.2 – 412.7447
412.7 – 422.1360
422.1 – 431.694
431.6 – 441.161
441.1 – 450.61015
450.6 – 460.1219
460.1 – 469.51195
469.5 – 479646

ROP2 categorical feature

ROP2 is a categorical code field with 214 distinct alphanumeric values (e.g., 'A012', 'C0152') across 16,382 rows and no nulls, suggesting a controlled vocabulary like a routing or product code. The distribution is heavy-tailed: 'A012' alone covers 25.4% of rows and the next code 'C0152' another ~7%, while entropy ratio sits at 0.76. The 'A' vs 'C' prefix split hints at two code families coexisting in the same column.

Treatment: Group rare codes into an 'other' bucket and target/one-hot encode the high-frequency levels.

anthropic:claude-opus-4-7 · confidence high
Out[56]:

saturn.columns["ROP2"].stats

statvalue
n16,382
nulls0 (0.0%)
unique214
top_value A012
top_rate 0.2545
cardinality 214
entropy 5.901
entropy_ratio 0.7622
Fig 23.
Top values for ROP2.
Show data table
Top values for ROP2 (20 unique shown, of 214 total).
valuecountshare
A012416925.4%
C015211397.0%
C02014482.7%
C00443422.1%
C01472531.5%
C01542401.5%
C00652261.4%
C00622231.4%
C02292231.4%
C00871821.1%
C02521641.0%
C00121621.0%
C00031611.0%
C00591601.0%
C00851601.0%
C02581591.0%
C00791520.9%
C00611490.9%
C01951420.9%
C00151390.8%

PeopleCluster categorical feature

PeopleCluster is a categorical ethnographic grouping with 267 distinct values across 16,382 rows and no nulls. The distribution is broad (entropy ratio 0.86) but with a notable concentration: 'New Guinea' accounts for 6.95% of rows (1,139), followed by 'South Asia Hindu - other' (935) and 'South Asia Muslim - other' (592). The labels mix geographic, religious, and tribal descriptors, so several '... - other' buckets are doing heavy lifting.

Treatment: Group rare clusters and target- or frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[59]:

saturn.columns["PeopleCluster"].stats

statvalue
n16,382
nulls0 (0.0%)
unique267
top_value New Guinea
top_rate 0.06953
cardinality 267
entropy 6.929
entropy_ratio 0.8596
Fig 24.
Top values for PeopleCluster.
Show data table
Top values for PeopleCluster (20 unique shown, of 267 total).
valuecountshare
New Guinea11397.0%
South Asia Hindu - other9355.7%
South Asia Muslim - other5923.6%
South American Indigenous4482.7%
South Asia Tribal - other4482.7%
South Asia Dalit - other3592.2%
Benue3422.1%
South Asia Muslim - Pashtun2931.8%
Mon-Khmer2531.5%
North American Indigenous2401.5%
Chinese2261.4%
Chadic 2231.4%
Tibeto-Burman, other2231.4%
Gur1821.1%
Deaf1641.0%
Anglo-Celt1621.0%
Adamawa-Ubangi1611.0%
Borneo-Kalimantan1601.0%
Guinean1601.0%
Bantu, Central-Congo1591.0%

PeopNameAcrossCountries text foreign_key

This column holds people-group or ethnicity names repeated across countries, with 10,377 unique labels over 16,382 rows and a 36.7% duplicate rate (6,005 repeats). Entries are short (median 8 chars, mean 1.57 words) and 59% are single-word labels like 'Deaf' (164), 'French' (82), or 'British' (81). The frequent fragments '(hindu' (516) and '(muslim' (424) alongside 'traditions)' (1038) suggest religious-tradition qualifiers in parentheses are a common naming convention, and the same group name recurs because it appears in multiple country contexts.

Treatment: Treat as a people-group key; normalize casing/parentheticals and join with country to form a unique grouping key.

anthropic:claude-opus-4-7 · confidence high
Out[62]:

saturn.columns["PeopNameAcrossCountries"].stats

statvalue
n16,382
nulls0 (0.0%)
unique10,377
len_min 1
len_max 42
len_mean 10.93
len_median 8
len_p95 25
word_mean 1.568
word_median 1
n_empty 0
n_duplicates 6,005
duplicate_rate 0.3666
vocab_size 10,446
readability_flesch_mean 49.45
emoji_rate 0
url_rate 0
one_word_rate 0.5899
allcaps_rate 0
boilerplate_rate 0
alert: one_word59.0% rows are a single word
alert: duplicates36.7% duplicate strings
Fig 25.
Character-length distribution for PeopNameAcrossCountries.
Show data table
Character-length distribution for PeopNameAcrossCountries (mean: 10.93370772799414).
charscount
1 – 235
2 – 3313
3 – 41356
4 – 51836
5 – 62072
6 – 71670
7 – 81120
8 – 9682
9 – 10647
10 – 11533
11 – 12629
12 – 13604
13 – 14701
14 – 15630
15 – 16558
16 – 17429
17 – 18275
18 – 19211
19 – 20202
20 – 22164
22 – 23222
23 – 24167
24 – 25274
25 – 26246
26 – 27227
27 – 28134
28 – 29107
29 – 3071
30 – 3150
31 – 3265
32 – 3356
33 – 3438
34 – 3525
35 – 3624
36 – 371
37 – 382
38 – 391
39 – 404
40 – 410
41 – 421

Population numeric feature

This is a Population count column with 16,382 rows and only 1,708 unique values, suggesting many shared or rounded figures. The distribution is extremely heavy-tailed: median is 20,000 but the max is 912,955,000, with skew 91.04 and kurtosis 10,050.74, and 15.0% of rows flag as outliers. The mean (499,468) sits far above Q3 (93,000), indicating a small number of very large entities dominate.

Treatment: Log-transform before any modelling or distance-based analysis.

anthropic:claude-opus-4-7 · confidence high
Out[65]:

saturn.columns["Population"].stats

statvalue
n16,382
nulls25 (0.2%)
unique1,708
min 10
max 9.13e+08
mean 4.995e+05
median 20,000
std 8.066e+06
q1 4,300
q3 93,000
iqr 88,700
skew 91.04
kurtosis 1.005e+04
n_outliers 2,455
outlier_rate 0.1501
zero_rate 0
alert: high_skewskew=+91.04
alert: outliers15.0% rows beyond 1.5 IQR
Fig 26.
Distribution of Population. Vertical dash marks the median.
Show data table
Histogram bins for Population (median: 20000.0).
bincount
10 – 2.282e+0716302
2.282e+07 – 4.565e+0732
4.565e+07 – 6.847e+0712
6.847e+07 – 9.13e+073
9.13e+07 – 1.141e+083
1.141e+08 – 1.369e+083
1.369e+08 – 1.598e+080
1.598e+08 – 1.826e+080
1.826e+08 – 2.054e+081
2.054e+08 – 2.282e+080
2.282e+08 – 2.511e+080
2.511e+08 – 2.739e+080
2.739e+08 – 2.967e+080
2.967e+08 – 3.195e+080
3.195e+08 – 3.424e+080
3.424e+08 – 3.652e+080
3.652e+08 – 3.88e+080
3.88e+08 – 4.108e+080
4.108e+08 – 4.337e+080
4.337e+08 – 4.565e+080
4.565e+08 – 4.793e+080
4.793e+08 – 5.021e+080
5.021e+08 – 5.249e+080
5.249e+08 – 5.478e+080
5.478e+08 – 5.706e+080
5.706e+08 – 5.934e+080
5.934e+08 – 6.162e+080
6.162e+08 – 6.391e+080
6.391e+08 – 6.619e+080
6.619e+08 – 6.847e+080
6.847e+08 – 7.075e+080
7.075e+08 – 7.304e+080
7.304e+08 – 7.532e+080
7.532e+08 – 7.76e+080
7.76e+08 – 7.988e+080
7.988e+08 – 8.217e+080
8.217e+08 – 8.445e+080
8.445e+08 – 8.673e+080
8.673e+08 – 8.901e+080
8.901e+08 – 9.13e+081

Category categorical feature

A 3-level categorical with no nulls across 16,382 rows, encoded as the strings "1", "2", and "3". Class "1" dominates at 53.1% (8,705 rows) and "2" is the minority at 1,360 rows, giving a moderately imbalanced distribution (entropy ratio 0.83). The numeric-string labels suggest an ordinal or coded category whose meaning is not self-evident from the values alone.

Treatment: One-hot or ordinal encode; consider class-imbalance handling if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[68]:

saturn.columns["Category"].stats

statvalue
n16,382
nulls0 (0.0%)
unique3
top_value 1
top_rate 0.5314
cardinality 3
entropy 1.313
entropy_ratio 0.8284
Fig 27.
Top values for Category.
Show data table
Top values for Category (3 unique shown, of 3 total).
valuecountshare
1870553.1%
3631738.6%
213608.3%

ROL3 text feature

ROL3 holds three-letter ISO 639-3 language codes (every value is exactly 3 characters and one word), with hin, eng, and ben dominating. The distribution is heavily multilingual with 6,164 distinct codes across 16,382 rows and a 62.4% duplicate rate, plus 176 'xxx' entries that likely flag undetermined or missing language.

Treatment: Treat as a categorical language code; one-hot or target-encode top codes and bucket the long tail (including 'xxx') as 'other'.

anthropic:claude-opus-4-7 · confidence high
Out[71]:

saturn.columns["ROL3"].stats

statvalue
n16,382
nulls0 (0.0%)
unique6,164
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 10,218
duplicate_rate 0.6237
vocab_size 6,163
readability_flesch_mean 117.4
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.0001221
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates62.4% duplicate strings
Fig 28.
Character-length distribution for ROL3.
Show data table
Character-length distribution for ROL3 (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 316382
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

PrimaryLanguageName text feature

Holds the primary language name for each record, predominantly single-token entries (one_word_rate 0.73, word_mean 1.32) with Hindi (682), English (424) and Bengali (366) leading. High duplicate_rate of 0.62 is expected for a categorical language label, but n_unique 6153 against 16382 rows suggests many compound or comma-separated multilingual entries (note 'arabic,' and 'punjabi,' in top_words). 176 rows are explicitly 'Language unknown', and lengths up to 45 chars confirm some multi-language strings.

Treatment: Normalize casing and split comma-separated entries into a multi-label categorical before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[74]:

saturn.columns["PrimaryLanguageName"].stats

statvalue
n16,382
nulls0 (0.0%)
unique6,153
len_min 1
len_max 45
len_mean 9.081
len_median 7
len_p95 19
word_mean 1.322
word_median 1
n_empty 0
n_duplicates 10,229
duplicate_rate 0.6244
vocab_size 6,251
readability_flesch_mean 42.67
emoji_rate 0
url_rate 0
one_word_rate 0.7306
allcaps_rate 0
boilerplate_rate 0
alert: one_word73.1% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates62.4% duplicate strings
Fig 29.
Character-length distribution for PrimaryLanguageName.
Show data table
Character-length distribution for PrimaryLanguageName (mean: 9.08136979611769).
charscount
1 – 249
2 – 3295
3 – 41430
4 – 52556
5 – 62134
6 – 82957
8 – 91139
9 – 10677
10 – 11597
11 – 12311
12 – 13698
13 – 14387
14 – 15372
15 – 161263
16 – 18535
18 – 19160
19 – 20119
20 – 21139
21 – 2285
22 – 2372
23 – 24175
24 – 2536
25 – 2641
26 – 2730
27 – 2923
29 – 3012
30 – 3111
31 – 3231
32 – 3316
33 – 348
34 – 351
35 – 361
36 – 371
37 – 382
38 – 409
40 – 415
41 – 424
42 – 430
43 – 440
44 – 451

PrimaryLanguageDialect categorical metadata

This column records a primary language dialect per record, with 980 distinct values across 16,382 rows. It is 92.3% null, and even the most common value, 'Brazilian Portuguese', accounts for just 3.25% of non-nulls (41 occurrences); entropy ratio of 0.967 confirms an extremely flat, long-tailed distribution spanning dialects like Assyrian, Punjabi, Sinhalese, and Ta'izzi. The combination of sparse coverage and high cardinality limits its standalone modelling value.

Treatment: Group into language families or a coarse bucket plus 'missing' indicator before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[77]:

saturn.columns["PrimaryLanguageDialect"].stats

statvalue
n16,382
nulls15,121 (92.3%)
unique980
top_value Brazilian Portuguese
top_rate 0.03251
cardinality 980
entropy 9.613
entropy_ratio 0.9674
alert: long_tail850 singleton categories
alert: null_rate92.3% null
Fig 30.
Top values for PrimaryLanguageDialect.
Show data table
Top values for PrimaryLanguageDialect (20 unique shown, of 980 total).
valuecountshare
Brazilian Portuguese410.3%
Assyrian150.1%
Punjabi140.1%
Sinhalese100.1%
Ta'izzi80.0%
Guipuzcoan80.0%
Moldavian70.0%
Kalderash60.0%
Siripuria60.0%
Tangsa60.0%
Wasulunkakan60.0%
Nyanja60.0%
Orange River50.0%
Pomak50.0%
Hui40.0%
Central40.0%
Vixlin40.0%
Malagasy40.0%
Unga40.0%
Western Sudanese30.0%

NumberLanguagesSpoken numeric feature

Count of languages spoken, with 16382 non-null integer values ranging from 1 to 145 and a median of 1. The distribution is severely right-skewed (skew 7.44, kurtosis 83.76): Q1 and Q3 are both within [1,2], yet 2410 rows (14.7%) flag as outliers and the max of 145 is implausibly high for a person-level language count.

Treatment: Cap or log-transform before modelling, and investigate the 145 maximum for data-entry errors.

anthropic:claude-opus-4-7 · confidence high
Out[80]:

saturn.columns["NumberLanguagesSpoken"].stats

statvalue
n16,382
nulls0 (0.0%)
unique78
min 1
max 145
mean 2.764
median 1
std 5.985
q1 1
q3 2
iqr 1
skew 7.437
kurtosis 83.76
n_outliers 2,410
outlier_rate 0.1471
zero_rate 0
alert: high_skewskew=+7.44
alert: outliers14.7% rows beyond 1.5 IQR
Fig 31.
Distribution of NumberLanguagesSpoken. Vertical dash marks the median.
Show data table
Histogram bins for NumberLanguagesSpoken (median: 1.0).
bincount
1 – 4.614386
4.6 – 8.2950
8.2 – 11.8343
11.8 – 15.4232
15.4 – 1999
19 – 22.680
22.6 – 26.280
26.2 – 29.841
29.8 – 33.435
33.4 – 3718
37 – 40.626
40.6 – 44.220
44.2 – 47.816
47.8 – 51.410
51.4 – 555
55 – 58.68
58.6 – 62.210
62.2 – 65.84
65.8 – 69.45
69.4 – 731
73 – 76.63
76.6 – 80.21
80.2 – 83.81
83.8 – 87.42
87.4 – 911
91 – 94.61
94.6 – 98.20
98.2 – 101.80
101.8 – 105.42
105.4 – 1090
109 – 112.60
112.6 – 116.20
116.2 – 119.80
119.8 – 123.41
123.4 – 1270
127 – 130.60
130.6 – 134.20
134.2 – 137.80
137.8 – 141.40
141.4 – 1451

OfficialLang categorical feature

OfficialLang is a categorical column listing the official language of each record, with 87 distinct values across 16,382 rows and almost no nulls (0.05%). English dominates at 22.4% (3,672), followed by Hindi (2,262) and French (1,478), giving a moderately concentrated distribution (entropy ratio 0.68). The presence of compound labels like 'Arabic, Standard' and 'Chinese, Mandarin' suggests a specific naming convention worth preserving when joining to language references.

Treatment: Group long tail and one-hot or target-encode the top categories before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[83]:

saturn.columns["OfficialLang"].stats

statvalue
n16,382
nulls8 (0.0%)
unique87
top_value English
top_rate 0.2243
cardinality 87
entropy 4.368
entropy_ratio 0.6779
Fig 32.
Top values for OfficialLang.
Show data table
Top values for OfficialLang (20 unique shown, of 87 total).
valuecountshare
English367222.4%
Hindi226213.8%
French14789.0%
Spanish11216.8%
Arabic, Standard8915.4%
Indonesian7884.8%
Urdu7754.7%
Chinese, Mandarin6383.9%
Portuguese5293.2%
Bengali2781.7%
Burmese2181.3%
Malay2071.3%
Tagalog2001.2%
Nepali1951.2%
Lao1841.1%
Russian1711.0%
German, Standard1581.0%
Swahili1540.9%
Sinhala1390.8%
Amharic1200.7%

SpeakNationalLang unknown other

Column 'SpeakNationalLang' was skipped by the profiler, so no type inference, uniqueness count, or value statistics are available. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a flag or category indicating whether a respondent speaks the national language, but this cannot be verified from the evidence.

Treatment: Re-profile or manually inspect to determine type before any downstream use.

anthropic:claude-opus-4-7 · confidence low
Out[86]:

saturn.columns["SpeakNationalLang"].stats

statvalue
n16,382
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

BibleStatus numeric feature

BibleStatus is an integer-coded categorical with only 6 distinct values spanning 0 to 5 across 16,382 complete rows. The distribution is heavily left-skewed (skew -1.22) with a mean of 3.86 and median of 4, indicating most records cluster at the high end while about 4.8% sit at zero. Despite being stored as numeric, the small cardinality and bounded range suggest an ordinal status code rather than a true measurement.

Treatment: Treat as an ordinal category (one-hot or ordered encode) rather than a continuous numeric.

anthropic:claude-opus-4-7 · confidence high
Out[88]:

saturn.columns["BibleStatus"].stats

statvalue
n16,382
nulls0 (0.0%)
unique6
min 0
max 5
mean 3.862
median 4
std 1.429
q1 3
q3 5
iqr 2
skew -1.219
kurtosis 0.5856
n_outliers 0
outlier_rate 0
zero_rate 0.04786
Fig 33.
Distribution of BibleStatus. Vertical dash marks the median.
Show data table
Histogram bins for BibleStatus (median: 4.0).
bincount
0 – 0.125784
0.125 – 0.250
0.25 – 0.3750
0.375 – 0.50
0.5 – 0.6250
0.625 – 0.750
0.75 – 0.8750
0.875 – 10
1 – 1.125555
1.125 – 1.250
1.25 – 1.3750
1.375 – 1.50
1.5 – 1.6250
1.625 – 1.750
1.75 – 1.8750
1.875 – 20
2 – 2.1251597
2.125 – 2.250
2.25 – 2.3750
2.375 – 2.50
2.5 – 2.6250
2.625 – 2.750
2.75 – 2.8750
2.875 – 30
3 – 3.1252055
3.125 – 3.250
3.25 – 3.3750
3.375 – 3.50
3.5 – 3.6250
3.625 – 3.750
3.75 – 3.8750
3.875 – 40
4 – 4.1253601
4.125 – 4.250
4.25 – 4.3750
4.375 – 4.50
4.5 – 4.6250
4.625 – 4.750
4.75 – 4.8750
4.875 – 57790

BibleYear categorical metadata

BibleYear appears to encode a translation's publication or revision span, typically formatted as a start-end year range like "1818-2022", with single years (e.g. "1954") appearing as a minority pattern. Cardinality is high (466 distinct values across 16382 rows) and the most common range covers only 8.75% of records, giving a flat distribution (entropy ratio 0.776). Notably, 52.45% of values are null, which the alert flags and which will limit any direct use.

Treatment: Parse into separate start_year and end_year integer features and add a missingness indicator before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[91]:

saturn.columns["BibleYear"].stats

statvalue
n16,382
nulls8,592 (52.4%)
unique466
top_value 1818-2022
top_rate 0.08755
cardinality 466
entropy 6.883
entropy_ratio 0.7765
alert: null_rate52.4% null
Fig 34.
Top values for BibleYear.
Show data table
Top values for BibleYear (20 unique shown, of 466 total).
valuecountshare
1818-20226824.2%
1382-20204242.6%
1809-20223662.2%
1553-20203081.9%
1727-20242211.3%
19541951.2%
1751-20251731.1%
1843-20221671.0%
1815-20211641.0%
1823-20211601.0%
1854-20221480.9%
1895-20201460.9%
1874-20181360.8%
1959-20211340.8%
1478-20201120.7%
1841-20221090.7%
1821-2024980.6%
1827-2024920.6%
1875-2024900.5%
2008870.5%

NTYear text metadata

NTYear appears to be a year-range metadata field (e.g. '1811-1998', '1380-2011') stored as short single-token strings, with 1072 unique values across 16382 rows. The column is messy: 30.47% null, 90.59% duplicate rate, and a sentinel value 'Yes' shows up 670 times alongside the date ranges, indicating mixed semantics. Lengths cluster tightly (median 9, max 9), consistent with a 'YYYY-YYYY' format for most non-sentinel entries.

Treatment: Parse into start_year/end_year integer columns, isolate the 'Yes' sentinel into a separate flag, and impute or drop the 30% nulls.

anthropic:claude-opus-4-7 · confidence medium
Out[94]:

saturn.columns["NTYear"].stats

statvalue
n16,382
nulls4,991 (30.5%)
unique1,072
len_min 3
len_max 9
len_mean 7.794
len_median 9
len_p95 9
word_mean 1
word_median 1
n_empty 0
n_duplicates 10,319
duplicate_rate 0.9059
vocab_size 1,072
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.9412
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps94.1% rows are all-caps
alert: null_rate30.5% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates90.6% duplicate strings
Fig 35.
Character-length distribution for NTYear.
Show data table
Character-length distribution for NTYear (mean: 7.79422350978843).
charscount
3 – 3670
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 41943
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 98778

PortionsYear text feature

PortionsYear appears to be a single-token field that mostly encodes year ranges (e.g. '1806-1962', '1530-1995') with strings up to 9 characters, but it is contaminated by a large 'Yes' bucket (1520 rows) that breaks the type. Nulls run at 17.92% and duplicate_rate is 0.87 across 1737 unique values out of 16382, so the column is highly repetitive. The mix of a boolean-like 'Yes' with hyphenated year spans suggests two different concepts were merged into one column.

Treatment: Split into two fields: parse year ranges into start/end integers and isolate the 'Yes' values into a separate boolean before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[97]:

saturn.columns["PortionsYear"].stats

statvalue
n16,382
nulls2,936 (17.9%)
unique1,737
len_min 3
len_max 9
len_mean 7.595
len_median 9
len_p95 9
word_mean 1
word_median 1
n_empty 0
n_duplicates 11,709
duplicate_rate 0.8708
vocab_size 1,737
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.887
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps88.7% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates87.1% duplicate strings
Fig 36.
Character-length distribution for PortionsYear.
Show data table
Character-length distribution for PortionsYear (mean: 7.595121225643314).
charscount
3 – 31520
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 41954
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 99972

TranslationNeedQuestionable unknown other

Column 'TranslationNeedQuestionable' was skipped by the profiler, so its kind, cardinality and value distribution are unknown. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a boolean or flag indicating uncertainty about translation need, but this cannot be verified from the evidence.

Treatment: Re-profile or inspect raw values before deciding on use; do not model until kind is resolved.

anthropic:claude-opus-4-7 · confidence low
Out[100]:

saturn.columns["TranslationNeedQuestionable"].stats

statvalue
n16,382
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

JPScale numeric feature

JPScale is an integer-valued ordinal feature spanning 1 to 5 with only 5 unique values across 16382 rows and no nulls. The distribution is roughly flat (kurtosis -1.66, skew 0.19) with mean 2.68 and median 3, suggesting a Likert-style or category rating rather than a continuous measurement. No outliers and no zeros are present.

Treatment: Treat as an ordinal categorical (1-5) rather than continuous; one-hot or keep as ordered integer.

anthropic:claude-opus-4-7 · confidence high
Out[102]:

saturn.columns["JPScale"].stats

statvalue
n16,382
nulls0 (0.0%)
unique5
min 1
max 5
mean 2.681
median 3
std 1.644
q1 1
q3 4
iqr 3
skew 0.1937
kurtosis -1.658
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 37.
Distribution of JPScale. Vertical dash marks the median.
Show data table
Histogram bins for JPScale (median: 3.0).
bincount
1 – 1.17124
1.1 – 1.20
1.2 – 1.30
1.3 – 1.40
1.4 – 1.50
1.5 – 1.60
1.6 – 1.70
1.7 – 1.80
1.8 – 1.90
1.9 – 20
2 – 2.11009
2.1 – 2.20
2.2 – 2.30
2.3 – 2.40
2.4 – 2.50
2.5 – 2.60
2.6 – 2.70
2.7 – 2.80
2.8 – 2.90
2.9 – 30
3 – 3.11413
3.1 – 3.20
3.2 – 3.30
3.3 – 3.40
3.4 – 3.50
3.5 – 3.60
3.6 – 3.70
3.7 – 3.80
3.8 – 3.90
3.9 – 40
4 – 4.13636
4.1 – 4.20
4.2 – 4.30
4.3 – 4.40
4.4 – 4.50
4.5 – 4.60
4.6 – 4.70
4.7 – 4.80
4.8 – 4.90
4.9 – 53200

JPScalePC categorical feature

JPScalePC is a 5-level categorical, almost certainly a Likert or ordinal scale (values "1" through "5") with no nulls across 16,382 rows. The distribution is bimodal at the extremes: "5" leads at 33.8% and "1" follows closely, while the middle codes "2" and "3" together account for far less, hinting at polarised responses rather than a normal spread. Entropy ratio of 0.86 confirms the spread is wide but not uniform.

Treatment: Treat as ordinal (1-5); keep as integer or one-hot depending on model.

anthropic:claude-opus-4-7 · confidence high
Out[105]:

saturn.columns["JPScalePC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique5
top_value 5
top_rate 0.3381
cardinality 5
entropy 1.997
entropy_ratio 0.8602
Fig 38.
Top values for JPScalePC.
Show data table
Top values for JPScalePC (5 unique shown, of 5 total).
valuecountshare
5553933.8%
1518931.7%
4391423.9%
39275.7%
28135.0%

JPScalePGAC categorical feature

JPScalePGAC is a low-cardinality categorical with 5 distinct string-encoded levels ('1' through '5') across 16382 rows and no nulls, consistent with an ordinal scale (likely the Japanese JMA seismic intensity scale applied to PGA). The distribution is uneven: '1' dominates at 43.3% while '2' is the rarest at 908 rows, yet entropy ratio is high at 0.86 indicating the remaining mass is spread broadly. The non-monotonic frequency order (1 > 4 > 5 > 3 > 2) is worth flagging since a clean ordinal would typically taper.

Treatment: Treat as ordinal: cast to integer and preserve order, or one-hot encode if downstream model is non-ordinal.

anthropic:claude-opus-4-7 · confidence high
Out[108]:

saturn.columns["JPScalePGAC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique5
top_value 1
top_rate 0.4329
cardinality 5
entropy 1.998
entropy_ratio 0.8603
Fig 39.
Top values for JPScalePGAC.
Show data table
Top values for JPScalePGAC (5 unique shown, of 5 total).
valuecountshare
1709143.3%
4358721.9%
5352021.5%
312767.8%
29085.5%

LeastReached categorical feature

Binary Y/N flag named LeastReached, fully populated across 16382 rows with only 2 distinct values. The split is fairly balanced — 'N' leads at 56.5% (9258) versus 7124 'Y' — yielding near-maximal entropy ratio of 0.988. No nulls or anomalies present.

Treatment: Encode as a 0/1 boolean for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[111]:

saturn.columns["LeastReached"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value N
top_rate 0.5651
cardinality 2
entropy 0.9877
entropy_ratio 0.9877
Fig 40.
Top values for LeastReached.
Show data table
Top values for LeastReached (2 unique shown, of 2 total).
valuecountshare
N925856.5%
Y712443.5%

LeastReachedPC categorical feature

A binary Y/N flag named LeastReachedPC, likely indicating whether some 'least reached' threshold or PC condition was met. The split is moderately imbalanced at 67.3% N versus the rest Y, with no nulls across 16,382 rows and entropy ratio 0.91 showing both classes are well represented.

Treatment: Encode as a 0/1 indicator for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[114]:

saturn.columns["LeastReachedPC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value N
top_rate 0.6732
cardinality 2
entropy 0.9116
entropy_ratio 0.9116
Fig 41.
Top values for LeastReachedPC.
Show data table
Top values for LeastReachedPC (2 unique shown, of 2 total).
valuecountshare
N1102967.3%
Y535332.7%

LeastReachedPGAC categorical feature

Binary Y/N flag indicating whether some 'LeastReachedPGAC' condition holds, with no missing values across 16382 rows. The split is fairly balanced — 'N' leads at 56.7% (9291) versus 7091 'Y' — giving a near-maximal entropy ratio of 0.987.

Treatment: Encode as a 0/1 indicator for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[117]:

saturn.columns["LeastReachedPGAC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value N
top_rate 0.5671
cardinality 2
entropy 0.987
entropy_ratio 0.987
Fig 42.
Top values for LeastReachedPGAC.
Show data table
Top values for LeastReachedPGAC (2 unique shown, of 2 total).
valuecountshare
N929156.7%
Y709143.3%

GSEC categorical feature

GSEC is a low-cardinality categorical with 8 distinct values across 16,382 rows and no nulls. The dominant value is the empty string at 40.0% (6,553 rows), followed by '1' at 4,852; the remaining codes ('0','2','3','4','5','6') split the rest, suggesting a coded classification where blanks likely encode 'not applicable' or missing-as-empty. Entropy ratio of 0.732 indicates moderate spread despite the empty-string plurality.

Treatment: Recode the empty string as an explicit missing category and one-hot encode the remaining codes.

anthropic:claude-opus-4-7 · confidence high
Out[120]:

saturn.columns["GSEC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique8
top_value
top_rate 0.4
cardinality 8
entropy 2.197
entropy_ratio 0.7325
Fig 43.
Top values for GSEC.
Show data table
Top values for GSEC (8 unique shown, of 8 total).
valuecountshare
655340.0%
1485229.6%
6167610.2%
414358.8%
513378.2%
02381.5%
21671.0%
31240.8%

HasAudioRecordings categorical feature

Binary Y/N flag indicating whether a record has associated audio recordings, fully populated across 16382 rows. The distribution is imbalanced: 'Y' covers 82.3% (13479) versus 2903 'N', with entropy ratio 0.67.

Treatment: Encode as a 0/1 boolean indicator for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[123]:

saturn.columns["HasAudioRecordings"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.8228
cardinality 2
entropy 0.6739
entropy_ratio 0.6739
Fig 44.
Top values for HasAudioRecordings.
Show data table
Top values for HasAudioRecordings (2 unique shown, of 2 total).
valuecountshare
Y1347982.3%
N290317.7%

NTOnline categorical feature

NTOnline is a categorical flag with only one observed value, 'Y', across all 11,705 non-null rows. The remaining 28.55% of rows are null, so this column carries no discriminating signal — it is effectively a constant where present.

Treatment: Drop; zero-variance column with high nullity.

anthropic:claude-opus-4-7 · confidence high
Out[126]:

saturn.columns["NTOnline"].stats

statvalue
n16,382
nulls4,677 (28.5%)
unique1
top_value Y
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: null_rate28.5% null
alert: imbalancetop value is 100.0% of rows
Fig 45.
Top values for NTOnline.
Show data table
Top values for NTOnline (1 unique shown, of 1 total).
valuecountshare
Y1170571.5%

RLG3 numeric feature

RLG3 is a small-integer ordinal feature ranging from 1 to 9 with only 8 distinct values across 16,382 rows and no nulls. The distribution is broad and flat (kurtosis -1.37, skew 0.13, IQR spanning 1 to 6) with mean 3.47 and median 4, and no outliers. The 8 unique values across a 1-9 range implies one integer in that span never occurs, which is worth confirming.

Treatment: Treat as an ordinal categorical (e.g., a Likert-style rating) rather than a continuous numeric.

anthropic:claude-opus-4-7 · confidence high
Out[129]:

saturn.columns["RLG3"].stats

statvalue
n16,382
nulls0 (0.0%)
unique8
min 1
max 9
mean 3.469
median 4
std 2.238
q1 1
q3 6
iqr 5
skew 0.1265
kurtosis -1.366
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 46.
Distribution of RLG3. Vertical dash marks the median.
Show data table
Histogram bins for RLG3 (median: 4.0).
bincount
1 – 1.26459
1.2 – 1.40
1.4 – 1.60
1.6 – 1.80
1.8 – 20
2 – 2.2635
2.2 – 2.40
2.4 – 2.60
2.6 – 2.80
2.8 – 30
3 – 3.20
3.2 – 3.40
3.4 – 3.60
3.6 – 3.80
3.8 – 40
4 – 4.22651
4.2 – 4.40
4.4 – 4.60
4.6 – 4.80
4.8 – 50
5 – 5.22338
5.2 – 5.40
5.4 – 5.60
5.6 – 5.80
5.8 – 60
6 – 6.23786
6.2 – 6.40
6.4 – 6.60
6.6 – 6.80
6.8 – 70
7 – 7.2200
7.2 – 7.40
7.4 – 7.60
7.6 – 7.80
7.8 – 80
8 – 8.2124
8.2 – 8.40
8.4 – 8.60
8.6 – 8.80
8.8 – 9189

RLG3PC numeric feature

RLG3PC is an integer-valued numeric column with only 8 distinct values bounded between 1 and 9, no nulls, and no zeros. The flat distribution (kurtosis -1.47, IQR spanning 1 to 6) and small cardinality suggest this is an ordinal code or category rather than a continuous measurement. Mean 3.21 sits below the median's upper quartile, with mild positive skew (0.31).

Treatment: Treat as an ordinal category; one-hot or ordinal-encode rather than scaling as continuous.

anthropic:claude-opus-4-7 · confidence high
Out[132]:

saturn.columns["RLG3PC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique8
min 1
max 9
mean 3.213
median 2
std 2.311
q1 1
q3 6
iqr 5
skew 0.3143
kurtosis -1.466
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 47.
Distribution of RLG3PC. Vertical dash marks the median.
Show data table
Histogram bins for RLG3PC (median: 2.0).
bincount
1 – 1.27795
1.2 – 1.40
1.4 – 1.60
1.6 – 1.80
1.8 – 20
2 – 2.2647
2.2 – 2.40
2.4 – 2.60
2.6 – 2.80
2.8 – 30
3 – 3.20
3.2 – 3.40
3.4 – 3.60
3.6 – 3.80
3.8 – 40
4 – 4.21223
4.2 – 4.40
4.4 – 4.60
4.6 – 4.80
4.8 – 50
5 – 5.22557
5.2 – 5.40
5.4 – 5.60
5.6 – 5.80
5.8 – 60
6 – 6.23658
6.2 – 6.40
6.4 – 6.60
6.6 – 6.80
6.8 – 70
7 – 7.2267
7.2 – 7.40
7.4 – 7.60
7.6 – 7.80
7.8 – 80
8 – 8.262
8.2 – 8.40
8.4 – 8.60
8.6 – 8.80
8.8 – 9173

RLG3PGAC numeric feature

RLG3PGAC holds an integer code on a small 1-9 scale with only 8 distinct values across 16,382 rows and zero nulls. The distribution is broad and flat (kurtosis -1.36, std 2.25, IQR spanning 1 to 6) with near-zero skew, suggesting an ordinal category or rating rather than a true continuous measurement. No outliers and no zeros are present.

Treatment: Treat as an ordinal/categorical feature; one-hot or ordinal encode rather than scaling as continuous.

anthropic:claude-opus-4-7 · confidence high
Out[135]:

saturn.columns["RLG3PGAC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique8
min 1
max 9
mean 3.486
median 4
std 2.252
q1 1
q3 6
iqr 5
skew 0.1259
kurtosis -1.363
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 48.
Distribution of RLG3PGAC. Vertical dash marks the median.
Show data table
Histogram bins for RLG3PGAC (median: 4.0).
bincount
1 – 1.26462
1.2 – 1.40
1.4 – 1.60
1.6 – 1.80
1.8 – 20
2 – 2.2603
2.2 – 2.40
2.4 – 2.60
2.6 – 2.80
2.8 – 30
3 – 3.20
3.2 – 3.40
3.4 – 3.60
3.6 – 3.80
3.8 – 40
4 – 4.22613
4.2 – 4.40
4.4 – 4.60
4.6 – 4.80
4.8 – 50
5 – 5.22348
5.2 – 5.40
5.4 – 5.60
5.6 – 5.80
5.8 – 60
6 – 6.23766
6.2 – 6.40
6.4 – 6.60
6.6 – 6.80
6.8 – 70
7 – 7.2256
7.2 – 7.40
7.4 – 7.60
7.6 – 7.80
7.8 – 80
8 – 8.2139
8.2 – 8.40
8.4 – 8.60
8.6 – 8.80
8.8 – 9195

PrimaryReligion categorical feature

PrimaryReligion is a low-cardinality categorical label assigning each of 16,382 rows to one of 8 religious traditions, with no nulls. Christianity dominates at 39.4% (6,459 rows), followed by Islam (3,786) and Ethnic Religions (2,651); the long tail includes 189 'Unknown' and 124 'Other / Small' rows. Entropy ratio of 0.74 indicates a moderately balanced distribution rather than a single overwhelming class.

Treatment: One-hot or target-encode; consider grouping 'Unknown' and 'Other / Small' if modelling sensitivity to rare classes.

anthropic:claude-opus-4-7 · confidence high
Out[138]:

saturn.columns["PrimaryReligion"].stats

statvalue
n16,382
nulls0 (0.0%)
unique8
top_value Christianity
top_rate 0.3943
cardinality 8
entropy 2.231
entropy_ratio 0.7436
Fig 49.
Top values for PrimaryReligion.
Show data table
Top values for PrimaryReligion (8 unique shown, of 8 total).
valuecountshare
Christianity645939.4%
Islam378623.1%
Ethnic Religions265116.2%
Hinduism233814.3%
Buddhism6353.9%
Non-Religious2001.2%
Unknown1891.2%
Other / Small1240.8%

PrimaryReligionPC categorical feature

Categorical label of the dominant religion of a people-cluster (PC), with 8 distinct values and no nulls across 16,382 rows. Christianity leads at 47.6% (7,795), followed by Islam (3,658) and Hinduism (2,557), while a small 'Unknown' bucket (173) and 'Other / Small' (62) provide explicit catch-alls. Entropy ratio of 0.69 indicates moderate concentration rather than a single dominant class.

Treatment: One-hot or target-encode; consider merging 'Unknown' and 'Other / Small' if downstream model is sensitive to rare levels.

anthropic:claude-opus-4-7 · confidence high
Out[141]:

saturn.columns["PrimaryReligionPC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique8
top_value Christianity
top_rate 0.4758
cardinality 8
entropy 2.071
entropy_ratio 0.6904
Fig 50.
Top values for PrimaryReligionPC.
Show data table
Top values for PrimaryReligionPC (8 unique shown, of 8 total).
valuecountshare
Christianity779547.6%
Islam365822.3%
Hinduism255715.6%
Ethnic Religions12237.5%
Buddhism6473.9%
Non-Religious2671.6%
Unknown1731.1%
Other / Small620.4%

PrimaryReligionPGAC categorical label

Categorical label of the primary religion of a People Group, Affinity, or Country (PGAC) record, drawn from a fixed taxonomy of 8 values with no nulls across 16,382 rows. Christianity dominates at 39.4% (6,462), followed by Islam (3,766), Ethnic Religions (2,613) and Hinduism (2,348); Buddhism, Non-Religious, Unknown and Other/Small together account for under 8% of rows. Entropy ratio of 0.748 indicates a moderately concentrated but not degenerate distribution.

Treatment: One-hot or target-encode; consider grouping the small Unknown/Other tail before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[144]:

saturn.columns["PrimaryReligionPGAC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique8
top_value Christianity
top_rate 0.3945
cardinality 8
entropy 2.245
entropy_ratio 0.7482
Fig 51.
Top values for PrimaryReligionPGAC.
Show data table
Top values for PrimaryReligionPGAC (8 unique shown, of 8 total).
valuecountshare
Christianity646239.4%
Islam376623.0%
Ethnic Religions261316.0%
Hinduism234814.3%
Buddhism6033.7%
Non-Religious2561.6%
Unknown1951.2%
Other / Small1390.8%

RLG4 numeric feature

RLG4 is a sparse integer-valued numeric feature with only 20 distinct values spanning 10 to 41, suggesting an ordinal score or count rather than a continuous measurement. It is overwhelmingly missing (null_rate 0.9621), so just under 4% of rows carry a value, and among those the distribution is right-skewed (skew 0.94) with 33 flagged outliers (outlier_rate 0.053). Center sits at median 20 with IQR 7, and no zeros are present.

Treatment: Add a missingness indicator and impute or bin the few observed values before modelling.

anthropic:claude-opus-4-7 · confidence medium
Out[147]:

saturn.columns["RLG4"].stats

statvalue
n16,382
nulls15,761 (96.2%)
unique20
min 10
max 41
mean 18.49
median 20
std 6.519
q1 14
q3 21
iqr 7
skew 0.9361
kurtosis 1.156
n_outliers 33
outlier_rate 0.05314
zero_rate 0
alert: null_rate96.2% null
alert: outliers5.3% rows beyond 1.5 IQR
Fig 52.
Distribution of RLG4. Vertical dash marks the median.
Show data table
Histogram bins for RLG4 (median: 20.0).
bincount
10 – 11.29114
11.29 – 12.5816
12.58 – 13.880
13.88 – 15.17124
15.17 – 16.467
16.46 – 17.750
17.75 – 19.0421
19.04 – 20.33181
20.33 – 21.623
21.62 – 22.924
22.92 – 24.2180
24.21 – 25.511
25.5 – 26.7927
26.79 – 28.080
28.08 – 29.380
29.38 – 30.670
30.67 – 31.960
31.96 – 33.254
33.25 – 34.540
34.54 – 35.832
35.83 – 37.1216
37.12 – 38.421
38.42 – 39.719
39.71 – 411

ReligionSubdivision categorical feature

A sub-categorisation of religion (e.g. Sunni/Shia branches, Buddhist schools, Judaism, Sikhism), populated only when a finer split applies. It is overwhelmingly null at 96.21%, so just 16382 rows carry one of 20 values, with Sunni leading at 29.15% of the populated rows. Entropy ratio 0.74 indicates the non-null portion is reasonably spread rather than dominated by a single bucket.

Treatment: Treat nulls as an explicit 'not applicable' category before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high
Out[150]:

saturn.columns["ReligionSubdivision"].stats

statvalue
n16,382
nulls15,761 (96.2%)
unique20
top_value Sunni
top_rate 0.2915
cardinality 20
entropy 3.185
entropy_ratio 0.7369
alert: null_rate96.2% null
Fig 53.
Top values for ReligionSubdivision.
Show data table
Top values for ReligionSubdivision (20 unique shown, of 20 total).
valuecountshare
Sunni1811.1%
Judaism1240.8%
Sikhism680.4%
Tibetan590.4%
Theravada550.3%
Ancestor Worship270.2%
Shia210.1%
Mahayana160.1%
Jainism120.1%
Zoroastrianism110.1%
Kirati100.1%
Prakriti90.1%
Animism70.0%
Mandaeism60.0%
Druze40.0%
Baha'i40.0%
Syncretized30.0%
Shia Imami Ismaili20.0%
Sarna10.0%
Lingayat10.0%

PCIslam numeric feature

PCIslam is a numeric column bounded between 0 and 100, almost certainly a percentage share of Muslim population (or similar Islam-related composition metric) per record. The distribution is heavily zero-inflated: 63.2% of values are exactly 0 and the median is 0, while the mean is 23.2 and values stretch all the way to 100, producing a right skew of 1.27 and 3,438 flagged outliers (21.1%). Nulls are negligible (0.52%) and 1,117 distinct values suggest reasonably fine-grained measurement rather than a coarse bucket.

Treatment: Treat as a zero-inflated proportion: model the zero mass separately or add a presence indicator before scaling.

anthropic:claude-opus-4-7 · confidence high
Out[153]:

saturn.columns["PCIslam"].stats

statvalue
n16,382
nulls86 (0.5%)
unique1,117
min 0
max 100
mean 23.2
median 0
std 39.54
q1 0
q3 28
iqr 28
skew 1.273
kurtosis -0.2575
n_outliers 3,438
outlier_rate 0.211
zero_rate 0.6322
alert: outliers21.1% rows beyond 1.5 IQR
Fig 54.
Distribution of PCIslam. Vertical dash marks the median.
Show data table
Histogram bins for PCIslam (median: 0.0).
bincount
0 – 2.511065
2.5 – 5160
5 – 7.5245
7.5 – 1054
10 – 12.5250
12.5 – 1534
15 – 17.5119
17.5 – 2026
20 – 22.5167
22.5 – 2518
25 – 27.580
27.5 – 3017
30 – 32.5113
32.5 – 3527
35 – 37.545
37.5 – 4021
40 – 42.558
42.5 – 4512
45 – 47.529
47.5 – 506
50 – 52.540
52.5 – 5510
55 – 57.526
57.5 – 6022
60 – 62.566
62.5 – 6515
65 – 67.553
67.5 – 7023
70 – 72.572
72.5 – 7525
75 – 77.553
77.5 – 8032
80 – 82.587
82.5 – 8537
85 – 87.576
87.5 – 9059
90 – 92.5159
92.5 – 95123
95 – 97.5232
97.5 – 1002540

PCNonReligious numeric feature

PCNonReligious appears to be a percentage feature capturing the share of a population that is non-religious, ranging from 0 to 99. The distribution is dominated by zeros (75.2% of rows) with median, Q1, and Q3 all at 0, yet the mean is 3.42 and skew is 3.65 with kurtosis 15.4, indicating a long right tail. Roughly 24.8% of values flag as outliers, suggesting a sparse signal where most records report none and a minority report substantial percentages.

Treatment: Consider a zero-inflated treatment or log1p transform before modelling given the 75% zeros and heavy right tail.

anthropic:claude-opus-4-7 · confidence high
Out[156]:

saturn.columns["PCNonReligious"].stats

statvalue
n16,382
nulls66 (0.4%)
unique223
min 0
max 99
mean 3.421
median 0
std 9.21
q1 0
q3 0
iqr 0
skew 3.648
kurtosis 15.43
n_outliers 4,043
outlier_rate 0.2478
zero_rate 0.7522
alert: high_skewskew=+3.65
alert: outliers24.8% rows beyond 1.5 IQR
Fig 55.
Distribution of PCNonReligious. Vertical dash marks the median.
Show data table
Histogram bins for PCNonReligious (median: 0.0).
bincount
0 – 2.47512958
2.475 – 4.95603
4.95 – 7.425634
7.425 – 9.9160
9.9 – 12.38406
12.38 – 14.8577
14.85 – 17.32299
17.32 – 19.891
19.8 – 22.28203
22.28 – 24.7572
24.75 – 27.23105
27.23 – 29.739
29.7 – 32.18145
32.18 – 34.6541
34.65 – 37.12137
37.12 – 39.646
39.6 – 42.0893
42.08 – 44.5547
44.55 – 47.0265
47.02 – 49.510
49.5 – 51.987
51.98 – 54.4520
54.45 – 56.9313
56.93 – 59.44
59.4 – 61.887
61.88 – 64.351
64.35 – 66.834
66.83 – 69.31
69.3 – 71.7816
71.78 – 74.258
74.25 – 76.731
76.73 – 79.20
79.2 – 81.670
81.67 – 84.150
84.15 – 86.620
86.62 – 89.10
89.1 – 91.580
91.58 – 94.050
94.05 – 96.532
96.53 – 991

PCUnknown numeric feature

PCUnknown is a numeric feature bounded between 0 and 100, almost certainly a percentage of items classified as 'unknown'. It is overwhelmingly zero (zero_rate 0.9558) with median, q1, and q3 all at 0, yet the max reaches 100 with skew 9.07 and kurtosis 81.5, producing 719 outliers (4.4%). The 583 distinct non-zero values form a long, heavy tail rather than a smooth distribution.

Treatment: Binarize (zero vs non-zero) or log1p-transform before modelling given the 95.6% zero mass and extreme skew.

anthropic:claude-opus-4-7 · confidence high
Out[159]:

saturn.columns["PCUnknown"].stats

statvalue
n16,382
nulls104 (0.6%)
unique583
min 0
max 100
mean 1.201
median 0
std 10.34
q1 0
q3 0
iqr 0
skew 9.066
kurtosis 81.52
n_outliers 719
outlier_rate 0.04417
zero_rate 0.9558
alert: high_skewskew=+9.07
Fig 56.
Distribution of PCUnknown. Vertical dash marks the median.
Show data table
Histogram bins for PCUnknown (median: 0.0).
bincount
0 – 2.515982
2.5 – 529
5 – 7.513
7.5 – 1011
10 – 12.512
12.5 – 154
15 – 17.57
17.5 – 202
20 – 22.55
22.5 – 255
25 – 27.53
27.5 – 306
30 – 32.51
32.5 – 353
35 – 37.50
37.5 – 404
40 – 42.52
42.5 – 450
45 – 47.51
47.5 – 502
50 – 52.52
52.5 – 551
55 – 57.51
57.5 – 602
60 – 62.51
62.5 – 650
65 – 67.51
67.5 – 701
70 – 72.53
72.5 – 751
75 – 77.51
77.5 – 801
80 – 82.51
82.5 – 851
85 – 87.52
87.5 – 900
90 – 92.51
92.5 – 9512
95 – 97.52
97.5 – 100152

SecurityLevel numeric feature

SecurityLevel takes only 3 distinct integer values spanning 0 to 2 with no nulls, so it is effectively an ordinal category encoded numerically (e.g., low/medium/high). The distribution is fairly flat with kurtosis -1.82 and zeros making up 38.8% of rows, while the mean of 1.10 and median of 1.0 suggest the three levels are reasonably balanced with a slight tilt toward the higher end.

Treatment: Treat as an ordinal categorical and one-hot or ordinal-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[162]:

saturn.columns["SecurityLevel"].stats

statvalue
n16,382
nulls0 (0.0%)
unique3
min 0
max 2
mean 1.099
median 1
std 0.9307
q1 0
q3 2
iqr 2
skew -0.1985
kurtosis -1.816
n_outliers 0
outlier_rate 0
zero_rate 0.3883
Fig 57.
Distribution of SecurityLevel. Vertical dash marks the median.
Show data table
Histogram bins for SecurityLevel (median: 1.0).
bincount
0 – 0.056361
0.05 – 0.10
0.1 – 0.150
0.15 – 0.20
0.2 – 0.250
0.25 – 0.30
0.3 – 0.350
0.35 – 0.40
0.4 – 0.450
0.45 – 0.50
0.5 – 0.550
0.55 – 0.60
0.6 – 0.650
0.65 – 0.70
0.7 – 0.750
0.75 – 0.80
0.8 – 0.850
0.85 – 0.90
0.9 – 0.950
0.95 – 10
1 – 1.052030
1.05 – 1.10
1.1 – 1.150
1.15 – 1.20
1.2 – 1.250
1.25 – 1.30
1.3 – 1.350
1.35 – 1.40
1.4 – 1.450
1.45 – 1.50
1.5 – 1.550
1.55 – 1.60
1.6 – 1.650
1.65 – 1.70
1.7 – 1.750
1.75 – 1.80
1.8 – 1.850
1.85 – 1.90
1.9 – 1.950
1.95 – 27991

LRTop100 categorical label

Binary Y/N flag indicating membership in some 'LR Top 100' set, with only 100 positive cases out of 16,382 rows (top_rate 0.9939). Extreme class imbalance and very low entropy (0.0537) make this nearly constant. No nulls, exactly 2 categories as expected.

Treatment: Use stratified sampling or class-weighting if modelling; otherwise treat as rare-event indicator.

anthropic:claude-opus-4-7 · confidence high
Out[165]:

saturn.columns["LRTop100"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value N
top_rate 0.9939
cardinality 2
entropy 0.05368
entropy_ratio 0.05368
alert: imbalancetop value is 99.4% of rows
Fig 58.
Top values for LRTop100.
Show data table
Top values for LRTop100 (2 unique shown, of 2 total).
valuecountshare
N1628299.4%
Y1000.6%

PhotoAddress text foreign_key

PhotoAddress holds single-token image filenames in the pattern p#####.jpg, with a max length of 13 and exactly one word per row. 5,718 of 16,382 rows (~35%) are empty strings rather than nulls, and overall duplicate rate is 67.8% — the same photo file is reused across many records (e.g., p19007.jpg appears 92 times). With only 5,277 unique values, this behaves like a foreign-key reference to an image asset, not a per-row unique pointer.

Treatment: Treat empty strings as missing and join to an image/asset table on this filename rather than modelling it as text.

anthropic:claude-opus-4-7 · confidence high
Out[168]:

saturn.columns["PhotoAddress"].stats

statvalue
n16,382
nulls1 (0.0%)
unique5,277
len_min 0
len_max 13
len_mean 6.523
len_median 10
len_p95 10
word_mean 1
word_median 1
n_empty 5,718
n_duplicates 11,104
duplicate_rate 0.6779
vocab_size 5,276
readability_flesch_mean 82.43
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates67.8% duplicate strings
Fig 59.
Character-length distribution for PhotoAddress.
Show data table
Character-length distribution for PhotoAddress (mean: 6.522556620474941).
charscount
0 – 05718
0 – 10
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 100
10 – 1010591
10 – 100
10 – 110
11 – 110
11 – 110
11 – 120
12 – 120
12 – 120
12 – 130
13 – 1372

PhotoCredits text metadata

Attribution string for a photo credit, mostly short names or sources (mean length 11.5 chars, median 1 word). Highly repetitive: 90.2% duplicate rate across only 1,605 unique values, with 5,718 empty entries and 3,065 'Anonymous' tags dominating. Top words reveal stock/CC sources like Flickr, Wikimedia, Pixabay, and Shutterstock alongside named contributors.

Treatment: Treat as low-cardinality categorical attribution; normalize empties/'Anonymous' and group rare credits before any analysis.

anthropic:claude-opus-4-7 · confidence high
Out[171]:

saturn.columns["PhotoCredits"].stats

statvalue
n16,382
nulls10 (0.1%)
unique1,605
len_min 0
len_max 56
len_mean 11.54
len_median 9
len_p95 30
word_mean 2.081
word_median 1
n_empty 5,718
n_duplicates 14,767
duplicate_rate 0.902
vocab_size 2,658
readability_flesch_mean -13.88
emoji_rate 0
url_rate 0.0004276
one_word_rate 0.5754
allcaps_rate 0.001649
boilerplate_rate 0
alert: one_word57.5% rows are a single word
alert: duplicates90.2% duplicate strings
Fig 60.
Character-length distribution for PhotoCredits.
Show data table
Character-length distribution for PhotoCredits (mean: 11.541595406792084).
charscount
0 – 15718
1 – 30
3 – 48
4 – 614
6 – 7293
7 – 834
8 – 103203
10 – 11541
11 – 13425
13 – 14182
14 – 15479
15 – 17173
17 – 18474
18 – 20251
20 – 21446
21 – 22658
22 – 24371
24 – 25683
25 – 27247
27 – 28269
28 – 29798
29 – 31532
31 – 32209
32 – 3453
34 – 3524
35 – 3681
36 – 3823
38 – 3917
39 – 4117
41 – 426
42 – 4361
43 – 455
45 – 4639
46 – 4812
48 – 4911
49 – 508
50 – 521
52 – 533
53 – 552
55 – 561

PhotoCreditURL text metadata

This column stores photo credit URLs, with every non-empty value being a single token (one_word_rate 1.0) and 47.5% matching a URL pattern. It is sparsely populated: 33.08% null and another 5,718 empty strings among the top values, while 86.9% of values are duplicates — a single domain, https://www.asiaharvest.org, accounts for 736 rows. Only 1,434 unique URLs serve 16,382 rows, suggesting a small set of recurring image sources rather than per-record attribution.

Treatment: Extract the domain as a categorical feature and drop the raw URL; do not use as a modelling input.

anthropic:claude-opus-4-7 · confidence high
Out[174]:

saturn.columns["PhotoCreditURL"].stats

statvalue
n16,382
nulls5,419 (33.1%)
unique1,434
len_min 0
len_max 240
len_mean 25.67
len_median 0
len_p95 73
word_mean 1
word_median 1
n_empty 5,718
n_duplicates 9,529
duplicate_rate 0.8692
vocab_size 1,433
readability_flesch_mean -261.7
emoji_rate 0
url_rate 0.4753
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: url_heavy47.5% rows contain a URL
alert: null_rate33.1% null
alert: duplicates86.9% duplicate strings
Fig 61.
Character-length distribution for PhotoCreditURL.
Show data table
Character-length distribution for PhotoCreditURL (mean: 25.672899753717047).
charscount
0 – 65718
6 – 120
12 – 183
18 – 2447
24 – 301018
30 – 36370
36 – 42111
42 – 48388
48 – 54851
54 – 60514
60 – 66584
66 – 72783
72 – 78117
78 – 8488
84 – 9094
90 – 9641
96 – 10233
102 – 10826
108 – 11423
114 – 12068
120 – 1266
126 – 13216
132 – 1382
138 – 14424
144 – 1501
150 – 15621
156 – 1620
162 – 1682
168 – 1747
174 – 1800
180 – 1860
186 – 1922
192 – 1981
198 – 2040
204 – 2102
210 – 2160
216 – 2220
222 – 2280
228 – 2340
234 – 2402

PhotoCreativeCommons categorical feature

A binary Y/N flag indicating whether a photo carries a Creative Commons licence. The column is heavily skewed: 'N' covers 83.6% of the 16382 rows while 'Y' accounts for 2691, with a near-zero null rate of 0.0003.

Treatment: Encode as a 0/1 boolean; expect class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[177]:

saturn.columns["PhotoCreativeCommons"].stats

statvalue
n16,382
nulls5 (0.0%)
unique2
top_value N
top_rate 0.8357
cardinality 2
entropy 0.6445
entropy_ratio 0.6445
Fig 62.
Top values for PhotoCreativeCommons.
Show data table
Top values for PhotoCreativeCommons (2 unique shown, of 2 total).
valuecountshare
N1368683.5%
Y269116.4%

PhotoCopyright categorical feature

Binary Y/N flag indicating whether a photo copyright applies, with 'N' dominating at 87.95% of 16,382 rows versus 1,972 'Y' values. Class imbalance is notable but not extreme, and nulls are negligible at 0.09%. Entropy ratio of 0.53 reflects this skew toward 'N'.

Treatment: Encode as a 0/1 boolean; be aware of the ~1:7 class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[180]:

saturn.columns["PhotoCopyright"].stats

statvalue
n16,382
nulls15 (0.1%)
unique2
top_value N
top_rate 0.8795
cardinality 2
entropy 0.5308
entropy_ratio 0.5308
Fig 63.
Top values for PhotoCopyright.
Show data table
Top values for PhotoCopyright (2 unique shown, of 2 total).
valuecountshare
N1439587.9%
Y197212.0%

PhotoPermission categorical feature

A consent flag for photo use, encoded as Y/N with 87.1% of 16382 rows set to 'N' and only 0.1% null. Cardinality is 3 because two records use lowercase 'y' alongside 2111 uppercase 'Y', a casing inconsistency worth normalising. Entropy ratio of 0.35 confirms the heavy skew toward refusal.

Treatment: Uppercase-normalise then map to a boolean before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[183]:

saturn.columns["PhotoPermission"].stats

statvalue
n16,382
nulls17 (0.1%)
unique3
top_value N
top_rate 0.8709
cardinality 3
entropy 0.5564
entropy_ratio 0.3511
Fig 64.
Top values for PhotoPermission.
Show data table
Top values for PhotoPermission (3 unique shown, of 3 total).
valuecountshare
N1425287.0%
Y211112.9%
y20.0%

ProfileTextExists categorical feature

Binary Y/N flag indicating whether a profile has text, with no nulls across 16382 rows. Roughly 79.5% are 'Y' (13018) versus 'N' (3364), an imbalance worth noting but not extreme. Entropy ratio of 0.73 confirms a moderately skewed but informative distribution.

Treatment: Encode as a 0/1 boolean for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[186]:

saturn.columns["ProfileTextExists"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.7947
cardinality 2
entropy 0.7325
entropy_ratio 0.7325
Fig 65.
Top values for ProfileTextExists.
Show data table
Top values for ProfileTextExists (2 unique shown, of 2 total).
valuecountshare
Y1301879.5%
N336420.5%

CountOfCountries numeric feature

Counts the number of countries associated with each record, ranging from 1 to 164 with a median of 1 and Q3 of just 4. The distribution is severely right-skewed (skew 5.15, kurtosis 32.05) and 19.2% of rows flag as outliers, indicating a long tail where a small set of records span dozens or hundreds of countries while most cover only one.

Treatment: Log-transform or bin (e.g. 1, 2-4, 5+) before modelling to tame the heavy tail.

anthropic:claude-opus-4-7 · confidence high
Out[189]:

saturn.columns["CountOfCountries"].stats

statvalue
n16,382
nulls0 (0.0%)
unique48
min 1
max 164
mean 8.328
median 1
std 20.64
q1 1
q3 4
iqr 3
skew 5.152
kurtosis 32.05
n_outliers 3,139
outlier_rate 0.1916
zero_rate 0
alert: high_skewskew=+5.15
alert: outliers19.2% rows beyond 1.5 IQR
Fig 66.
Distribution of CountOfCountries. Vertical dash marks the median.
Show data table
Histogram bins for CountOfCountries (median: 1.0).
bincount
1 – 5.07512712
5.075 – 9.15702
9.15 – 13.23503
13.23 – 17.3366
17.3 – 21.38334
21.38 – 25.45306
25.45 – 29.53221
29.53 – 33.6193
33.6 – 37.6869
37.68 – 41.75235
41.75 – 45.8384
45.83 – 49.9142
49.9 – 53.9853
53.98 – 58.0557
58.05 – 62.120
62.12 – 66.20
66.2 – 70.280
70.28 – 74.350
74.35 – 78.4278
78.42 – 82.5163
82.5 – 86.580
86.58 – 90.650
90.65 – 94.730
94.73 – 98.80
98.8 – 102.90
102.9 – 1070
107 – 1110
111 – 115.10
115.1 – 119.20
119.2 – 123.20
123.2 – 127.30
127.3 – 131.40
131.4 – 135.50
135.5 – 139.60
139.6 – 143.60
143.6 – 147.70
147.7 – 151.80
151.8 – 155.80
155.8 – 159.90
159.9 – 164164

CountOfProvinces unknown other

The column 'CountOfProvinces' was skipped by the profiler, so beyond a row count of 16382 and a null rate of 0.0 there is no evidence about its distribution, type, or uniqueness. The name suggests an integer count of provinces per record, but this cannot be confirmed from the payload. No further signal is available.

Treatment: Re-run the profiler on this column to recover type and distribution before any downstream use.

anthropic:claude-opus-4-7 · confidence low
Out[192]:

saturn.columns["CountOfProvinces"].stats

statvalue
n16,382
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

Longitude numeric feature

This is a Longitude coordinate column, but the values are corrupted: valid longitudes must lie within [-180, 180], yet the max is 2350588.0 and the mean (189.33) already exceeds the legal range. Skew of 127.98 and kurtosis of 16376.52 confirm extreme outlier contamination, with 207 flagged outliers (1.26%). The median of 55.45 is plausible, so most rows are likely valid, but a small set of malformed entries is dominating the distribution.

Treatment: Clip or drop rows outside [-180, 180] before any geospatial use.

anthropic:claude-opus-4-7 · confidence high
Out[194]:

saturn.columns["Longitude"].stats

statvalue
n16,382
nulls0 (0.0%)
unique15,889
min -179.3
max 2.351e+06
mean 189.3
median 55.45
std 1.836e+04
q1 8.673
q3 94.64
iqr 85.97
skew 128
kurtosis 1.638e+04
n_outliers 207
outlier_rate 0.01264
zero_rate 0
alert: high_skewskew=+127.98
Fig 67.
Distribution of Longitude. Vertical dash marks the median.
Show data table
Histogram bins for Longitude (median: 55.451769999999996).
bincount
-179.3 – 5.859e+0416381
5.859e+04 – 1.174e+050
1.174e+05 – 1.761e+050
1.761e+05 – 2.349e+050
2.349e+05 – 2.937e+050
2.937e+05 – 3.524e+050
3.524e+05 – 4.112e+050
4.112e+05 – 4.7e+050
4.7e+05 – 5.287e+050
5.287e+05 – 5.875e+050
5.875e+05 – 6.463e+050
6.463e+05 – 7.051e+050
7.051e+05 – 7.638e+050
7.638e+05 – 8.226e+050
8.226e+05 – 8.814e+050
8.814e+05 – 9.401e+050
9.401e+05 – 9.989e+050
9.989e+05 – 1.058e+060
1.058e+06 – 1.116e+060
1.116e+06 – 1.175e+060
1.175e+06 – 1.234e+060
1.234e+06 – 1.293e+060
1.293e+06 – 1.352e+060
1.352e+06 – 1.41e+060
1.41e+06 – 1.469e+060
1.469e+06 – 1.528e+060
1.528e+06 – 1.587e+060
1.587e+06 – 1.645e+060
1.645e+06 – 1.704e+060
1.704e+06 – 1.763e+060
1.763e+06 – 1.822e+060
1.822e+06 – 1.88e+060
1.88e+06 – 1.939e+060
1.939e+06 – 1.998e+060
1.998e+06 – 2.057e+060
2.057e+06 – 2.116e+060
2.116e+06 – 2.174e+060
2.174e+06 – 2.233e+060
2.233e+06 – 2.292e+060
2.292e+06 – 2.351e+061

Latitude numeric feature

Geographic latitude in decimal degrees, with values spanning -54.94 to 78.21 — well within the valid [-90, 90] range. Distribution is nearly symmetric (skew -0.12) and slightly flat (kurtosis -0.26), centered around a median of 17.03 and mean of 16.44, suggesting a tropical/northern-hemisphere bias. Near-unique (15851 of 16382) with no nulls and only 39 mild outliers, consistent with per-record geocoordinates rather than a categorical region label.

Treatment: Pair with longitude for geospatial features; consider binning by hemisphere or clustering rather than using raw degrees in linear models.

anthropic:claude-opus-4-7 · confidence high
Out[197]:

saturn.columns["Latitude"].stats

statvalue
n16,382
nulls0 (0.0%)
unique15,851
min -54.94
max 78.21
mean 16.44
median 17.03
std 20.47
q1 2.072
q3 29.88
iqr 27.81
skew -0.118
kurtosis -0.2579
n_outliers 39
outlier_rate 0.002381
zero_rate 0
Fig 68.
Distribution of Latitude. Vertical dash marks the median.
Show data table
Histogram bins for Latitude (median: 17.03254999983485).
bincount
-54.94 – -51.615
-51.61 – -48.281
-48.28 – -44.955
-44.95 – -41.6216
-41.62 – -38.2912
-38.29 – -34.9675
-34.96 – -31.64153
-31.64 – -28.3136
-28.31 – -24.98109
-24.98 – -21.65132
-21.65 – -18.32172
-18.32 – -14.99296
-14.99 – -11.66253
-11.66 – -8.335488
-8.335 – -5.006761
-5.006 – -1.678936
-1.678 – 1.651589
1.651 – 4.98633
4.98 – 8.3081052
8.308 – 11.641205
11.64 – 14.97791
14.97 – 18.29793
18.29 – 21.62775
21.62 – 24.951210
24.95 – 28.281405
28.28 – 31.61779
31.61 – 34.94775
34.94 – 38.27429
38.27 – 41.6496
41.6 – 44.92496
44.92 – 48.25387
48.25 – 51.58471
51.58 – 54.91280
54.91 – 58.24131
58.24 – 61.57155
61.57 – 64.943
64.9 – 68.2219
68.22 – 71.5515
71.55 – 74.881
74.88 – 78.212

Ctry categorical feature

Country names stored as full strings, with 238 distinct values across 16,382 rows and no nulls. India dominates at 13.8% (2,262 rows), followed by Papua New Guinea (883) and Indonesia (788) — a notable skew toward South/Southeast Asia rather than the typical US-heavy distribution. Entropy ratio of 0.79 indicates fairly broad spread despite the long tail.

Treatment: Group long-tail countries or target/frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[200]:

saturn.columns["Ctry"].stats

statvalue
n16,382
nulls0 (0.0%)
unique238
top_value India
top_rate 0.1381
cardinality 238
entropy 6.225
entropy_ratio 0.7885
Fig 69.
Top values for Ctry.
Show data table
Top values for Ctry (20 unique shown, of 238 total).
valuecountshare
India226213.8%
Papua New Guinea8835.4%
Indonesia7884.8%
Pakistan7754.7%
China5473.3%
Nigeria5353.3%
United States4963.0%
Mexico3332.0%
Brazil3212.0%
Cameroon2921.8%
Bangladesh2781.7%
Canada2431.5%
Congo, Democratic Republic of2311.4%
Myanmar (Burma)2181.3%
Australia2051.3%
Philippines2001.2%
Sudan1981.2%
Nepal1951.2%
Laos1841.1%
Malaysia1831.1%

IndigenousCode categorical feature

IndigenousCode is a binary Y/N flag, fully populated across all 16,382 rows with only 2 distinct values. The class split is uneven: 'Y' covers 74.8% of records against 'N' for the remainder, yielding entropy of 0.81. The imbalance is notable but not extreme.

Treatment: Encode as a binary indicator; consider class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[203]:

saturn.columns["IndigenousCode"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.7483
cardinality 2
entropy 0.8139
entropy_ratio 0.8139
Fig 70.
Top values for IndigenousCode.
Show data table
Top values for IndigenousCode (2 unique shown, of 2 total).
valuecountshare
Y1225974.8%
N412325.2%

PercentAdherents text feature

This is a numeric percentage field (PercentAdherents) stored as text, with all 16382 values being single tokens of length 5-7 like '0.000' or '95.000'. The distribution is heavily concentrated at zero (4007 of 16382 rows) and shows strong duplication (duplicate_rate 0.924, only 1248 unique values). Despite the 'allcaps' and 'one_word' alerts, these are just numeric strings, not categorical text.

Treatment: Cast to float and treat as a numeric feature; consider zero-inflation handling given the spike at 0.000.

anthropic:claude-opus-4-7 · confidence high
Out[206]:

saturn.columns["PercentAdherents"].stats

statvalue
n16,382
nulls0 (0.0%)
unique1,248
len_min 5
len_max 7
len_mean 5.534
len_median 6
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 15,134
duplicate_rate 0.9238
vocab_size 1,248
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates92.4% duplicate strings
Fig 71.
Character-length distribution for PercentAdherents.
Show data table
Character-length distribution for PercentAdherents (mean: 5.533573434257112).
charscount
5 – 57823
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 68377
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 7182

PercentChristianPC categorical feature

Stored as a categorical string, this column appears to be a per-country (or per-region) percentage of Christians, with 246 distinct values across 16,382 rows and no nulls. The distribution is highly repetitive: the modal value '90.061' covers 6.95% of rows and the top ten values include both very high shares (90.061, 82.325, 76.515) and near-zero shares (0.482, 0.111, 0.000), suggesting a small set of country-level percentages broadcast onto many rows. Entropy ratio of 0.86 indicates the values are fairly evenly spread across the 246 categories despite the heavy mode.

Treatment: Cast to float and treat as a numeric feature rather than a category.

anthropic:claude-opus-4-7 · confidence high
Out[209]:

saturn.columns["PercentChristianPC"].stats

statvalue
n16,382
nulls0 (0.0%)
unique246
top_value 90.061
top_rate 0.06953
cardinality 246
entropy 6.853
entropy_ratio 0.8628
Fig 72.
Top values for PercentChristianPC.
Show data table
Top values for PercentChristianPC (20 unique shown, of 246 total).
valuecountshare
90.06111397.0%
0.4829355.7%
0.1115923.6%
82.3254482.7%
8.5714482.7%
0.0003742.3%
0.5083592.2%
76.5153422.1%
0.0042931.8%
5.3452531.5%
70.2532401.5%
9.8382261.4%
38.9962231.4%
5.0232231.4%
24.0251821.1%
3.7331641.0%
67.6471621.0%
70.3251611.0%
49.2501601.0%
67.3001601.0%

NaturalName text label

NaturalName appears to be a people-group or ethno-linguistic label, dominated by single-word entries (one_word_rate 0.555) and short strings (len_mean 10.9, word_mean 1.59). Roughly a third of rows repeat (duplicate_rate 0.344, 5645 duplicates across 10737 uniques), with 'Deaf' (164), 'French' (82), and 'British' (80) leading. Top words expose unclosed parenthetical qualifiers like 'traditions)', '(hindu', '(muslim' occurring 500-1000+ times, suggesting tokenisation broke compound names such as 'X (Hindu traditions)'.

Treatment: Normalise casing and repair the parenthetical qualifier splits before using as a categorical grouping key.

anthropic:claude-opus-4-7 · confidence high
Out[212]:

saturn.columns["NaturalName"].stats

statvalue
n16,382
nulls0 (0.0%)
unique10,737
len_min 1
len_max 41
len_mean 10.91
len_median 9
len_p95 25
word_mean 1.585
word_median 1
n_empty 0
n_duplicates 5,645
duplicate_rate 0.3446
vocab_size 11,164
readability_flesch_mean 50.53
emoji_rate 0
url_rate 0
one_word_rate 0.5554
allcaps_rate 0
boilerplate_rate 0
alert: one_word55.5% rows are a single word
alert: duplicates34.5% duplicate strings
Fig 73.
Character-length distribution for NaturalName.
Show data table
Character-length distribution for NaturalName (mean: 10.908008790135515).
charscount
1 – 21
2 – 334
3 – 4290
4 – 51252
5 – 61696
6 – 71956
7 – 81615
8 – 91098
9 – 10697
10 – 11766
11 – 12729
12 – 13698
13 – 14883
14 – 15741
15 – 16733
16 – 17593
17 – 18355
18 – 19248
19 – 20186
20 – 21147
21 – 22132
22 – 23112
23 – 24180
24 – 25232
25 – 26280
26 – 27204
27 – 28134
28 – 29100
29 – 3050
30 – 3146
31 – 3267
32 – 3343
33 – 3442
34 – 3522
35 – 3615
36 – 370
37 – 380
38 – 391
39 – 402
40 – 412

NaturalPronunciation text metadata

This column holds phonetic respellings of ethnic or demographic labels (e.g. 'AY-zhun', 'chai-NEEZ', 'kor-EE-un'), with hyphenated syllables and capitalised stress markers indicating an ad-hoc pronunciation guide. It is overwhelmingly sparse and repetitive: 69.63% null, 69.41% one-word entries, and 61.15% duplicates across only 1,933 unique values out of 16,382 rows. The token 'def' appears 164 times as the most frequent value, which looks like a placeholder or default rather than a pronunciation.

Treatment: Treat as a categorical pronunciation lookup keyed to an ethnicity label; investigate the 'def' placeholder and impute or drop given the 69.63% null rate.

anthropic:claude-opus-4-7 · confidence high
Out[215]:

saturn.columns["NaturalPronunciation"].stats

statvalue
n16,382
nulls11,407 (69.6%)
unique1,933
len_min 2
len_max 57
len_mean 12.13
len_median 11
len_p95 26
word_mean 1.345
word_median 1
n_empty 0
n_duplicates 3,042
duplicate_rate 0.6115
vocab_size 2,039
readability_flesch_mean 61.69
emoji_rate 0
url_rate 0
one_word_rate 0.6941
allcaps_rate 0.000402
boilerplate_rate 0
alert: one_word69.4% rows are a single word
alert: null_rate69.6% null
alert: duplicates61.1% duplicate strings
Fig 74.
Character-length distribution for NaturalPronunciation.
Show data table
Character-length distribution for NaturalPronunciation (mean: 12.130653266331658).
charscount
2 – 3285
3 – 5161
5 – 6231
6 – 8690
8 – 9599
9 – 10488
10 – 12500
12 – 13281
13 – 14333
14 – 16211
16 – 17257
17 – 18174
18 – 20165
20 – 21156
21 – 2355
23 – 2439
24 – 2576
25 – 2726
27 – 2858
28 – 3024
30 – 3117
31 – 3243
32 – 348
34 – 3511
35 – 3612
36 – 389
38 – 3914
39 – 407
40 – 429
42 – 436
43 – 458
45 – 466
46 – 473
47 – 493
49 – 504
50 – 521
52 – 530
53 – 543
54 – 560
56 – 572

PercentChristianPGAC text feature

This column holds percentages (likely Christian population share, per the PGAC suffix) stored as text rather than numeric, with values like "0.000", "95.000", "90.000" filling lengths of 5-7 characters. The distribution is heavily zero-inflated: 3,121 of 16,382 rows are "0.000" and the duplicate rate is 88%, leaving only 1,954 unique values. Flagged as allcaps/one-word only because the profiler treated numeric strings as tokens.

Treatment: Cast to float and treat as a numeric percentage feature.

anthropic:claude-opus-4-7 · confidence high
Out[218]:

saturn.columns["PercentChristianPGAC"].stats

statvalue
n16,382
nulls15 (0.1%)
unique1,954
len_min 5
len_max 7
len_mean 5.528
len_median 6
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 14,413
duplicate_rate 0.8806
vocab_size 1,954
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates88.1% duplicate strings
Fig 75.
Character-length distribution for PercentChristianPGAC.
Show data table
Character-length distribution for PercentChristianPGAC (mean: 5.528196981731533).
charscount
5 – 57867
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 68355
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 7145

PercentEvangelical text feature

This is a numeric percentage (share evangelical) stored as text strings like '0.000' to '6.000', with values ranging 5-6 characters long and one token each. The distribution is heavily zero-inflated: 4205 of 16382 rows are '0.000', and the duplicate rate is 0.9315 across only 1047 unique values. Null rate is 0.0668, so roughly 7% are missing.

Treatment: Cast to float and treat as a zero-inflated numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[221]:

saturn.columns["PercentEvangelical"].stats

statvalue
n16,382
nulls1,095 (6.7%)
unique1,047
len_min 5
len_max 6
len_mean 5.226
len_median 5
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 14,240
duplicate_rate 0.9315
vocab_size 1,047
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates93.2% duplicate strings
Fig 76.
Character-length distribution for PercentEvangelical.
Show data table
Character-length distribution for PercentEvangelical (mean: 5.226401517629358).
charscount
5 – 511826
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 63461

PercentEvangelicalPC categorical feature

Numeric percentages (0.000 to 28.097) describing evangelical share, but stored as strings with only 228 distinct values across 16,382 rows — suggesting a precomputed per-group statistic broadcast to many records rather than a per-row measurement. The top value '20.481' covers 7.0% of rows and the top ten values together account for a large fraction, consistent with repeated group-level imputation. Entropy ratio is 0.87, so distribution is fairly spread but discretised.

Treatment: Cast to float and treat as a group-level numeric feature; do not one-hot encode.

anthropic:claude-opus-4-7 · confidence high
Out[224]:

saturn.columns["PercentEvangelicalPC"].stats

statvalue
n16,382
nulls166 (1.0%)
unique228
top_value 20.481
top_rate 0.07024
cardinality 228
entropy 6.782
entropy_ratio 0.8658
Fig 77.
Top values for PercentEvangelicalPC.
Show data table
Top values for PercentEvangelicalPC (20 unique shown, of 228 total).
valuecountshare
20.48111397.0%
0.1999355.7%
0.0955923.6%
13.2744482.7%
3.4094482.7%
0.0004412.7%
0.2473592.2%
28.0973422.1%
0.0043161.9%
2.6562531.5%
7.8512401.5%
7.9362261.4%
10.1222231.4%
3.3392231.4%
9.9901821.1%
10.0251621.0%
21.7111611.0%
9.5191601.0%
25.3961601.0%
17.4331591.0%

PercentEvangelicalPGAC text feature

This is a numeric percentage (Percent Evangelical, PGAC) stored as text — every value is a single token of 5-6 characters formatted like '0.000', '4.000', '1.801'. The distribution is heavily zero-inflated: 3,272 of 16,382 rows are '0.000', duplicate rate is 0.896 across only 1,624 unique values, and 4.5% are null. Despite the column being typed as text, there is no real language content here.

Treatment: Cast to float and treat as a zero-inflated numeric feature.

anthropic:claude-opus-4-7 · confidence high
Out[227]:

saturn.columns["PercentEvangelicalPGAC"].stats

statvalue
n16,382
nulls743 (4.5%)
unique1,624
len_min 5
len_max 6
len_mean 5.235
len_median 5
len_p95 6
word_mean 1
word_median 1
n_empty 0
n_duplicates 14,015
duplicate_rate 0.8962
vocab_size 1,624
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 1
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps100.0% rows are all-caps
alert: short_text95th-percentile length under 20 chars
alert: duplicates89.6% duplicate strings
Fig 78.
Character-length distribution for PercentEvangelicalPGAC.
Show data table
Character-length distribution for PercentEvangelicalPGAC (mean: 5.234925506745956).
charscount
5 – 511965
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 63674

PCBuddhism numeric feature

PCBuddhism appears to be a per-record percentage feature for Buddhist composition, ranging 0–100 with mean 3.77 and median 0. The distribution is overwhelmingly zero (zero_rate 0.89) with q1=q3=0 and iqr=0, yet ~11% of rows are outliers and skew (4.77) and kurtosis (21.99) are extreme. Treat this as a sparse, heavy-tailed minority share where most populations have no Buddhist presence but a long tail reaches 100.

Treatment: Add a zero-vs-nonzero indicator and log1p-transform the nonzero share before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[230]:

saturn.columns["PCBuddhism"].stats

statvalue
n16,382
nulls104 (0.6%)
unique1,052
min 0
max 100
mean 3.769
median 0
std 16.75
q1 0
q3 0
iqr 0
skew 4.775
kurtosis 22
n_outliers 1,798
outlier_rate 0.1105
zero_rate 0.8895
alert: high_skewskew=+4.77
alert: outliers11.0% rows beyond 1.5 IQR
Fig 79.
Distribution of PCBuddhism. Vertical dash marks the median.
Show data table
Histogram bins for PCBuddhism (median: 0.0).
bincount
0 – 2.515174
2.5 – 544
5 – 7.541
7.5 – 1016
10 – 12.5121
12.5 – 1510
15 – 17.534
17.5 – 204
20 – 22.544
22.5 – 2525
25 – 27.521
27.5 – 306
30 – 32.546
32.5 – 3510
35 – 37.514
37.5 – 406
40 – 42.534
42.5 – 456
45 – 47.54
47.5 – 507
50 – 52.519
52.5 – 5517
55 – 57.516
57.5 – 6018
60 – 62.521
62.5 – 653
65 – 67.515
67.5 – 7026
70 – 72.529
72.5 – 759
75 – 77.511
77.5 – 8015
80 – 82.521
82.5 – 8511
85 – 87.517
87.5 – 9025
90 – 92.540
92.5 – 9537
95 – 97.582
97.5 – 100179

PCEthnicReligions numeric feature

PCEthnicReligions appears to be a percentage feature (0–100) capturing the share of ethnic/folk religion adherents per record. Just over half the rows are exactly zero (zero_rate 0.5045) and the median and Q1 are both 0, yet values stretch all the way to 100, producing strong right skew (1.65) and 1,967 flagged outliers (12.05%). The distribution is effectively zero-inflated rather than continuous.

Treatment: Model as zero-inflated: add an is_nonzero indicator and log1p-transform the positive values.

anthropic:claude-opus-4-7 · confidence high
Out[233]:

saturn.columns["PCEthnicReligions"].stats

statvalue
n16,382
nulls59 (0.4%)
unique978
min 0
max 100
mean 17.6
median 0
std 29.02
q1 0
q3 25
iqr 25
skew 1.654
kurtosis 1.404
n_outliers 1,967
outlier_rate 0.1205
zero_rate 0.5045
alert: outliers12.1% rows beyond 1.5 IQR
Fig 80.
Distribution of PCEthnicReligions. Vertical dash marks the median.
Show data table
Histogram bins for PCEthnicReligions (median: 0.0).
bincount
0 – 2.59029
2.5 – 5551
5 – 7.5772
7.5 – 10242
10 – 12.5674
12.5 – 1596
15 – 17.5298
17.5 – 20100
20 – 22.5411
22.5 – 2553
25 – 27.5309
27.5 – 3059
30 – 32.5395
32.5 – 3592
35 – 37.5225
37.5 – 4058
40 – 42.5219
42.5 – 4521
45 – 47.5120
47.5 – 5030
50 – 52.5130
52.5 – 5529
55 – 57.5147
57.5 – 6051
60 – 62.5245
62.5 – 6525
65 – 67.5106
67.5 – 7048
70 – 72.5188
72.5 – 7522
75 – 77.5109
77.5 – 8051
80 – 82.5235
82.5 – 8553
85 – 87.5134
87.5 – 9073
90 – 92.5220
92.5 – 95119
95 – 97.5258
97.5 – 100326

PCHinduism numeric feature

PCHinduism appears to be a per-record percentage share of Hinduism (0–100), with max 100.0 and min 0.0. The distribution is overwhelmingly zero (zero_rate 0.8343) so Q1, median, and Q3 are all 0.0, yet the mean is 14.01 with std 33.87, indicating a small minority of records carry very high values. Skew 2.06 and 16.57% flagged outliers confirm a heavy right tail rather than dirty data.

Treatment: Treat as zero-inflated proportion: add a nonzero indicator and consider a log1p or sqrt transform before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[236]:

saturn.columns["PCHinduism"].stats

statvalue
n16,382
nulls105 (0.6%)
unique1,412
min 0
max 100
mean 14.01
median 0
std 33.87
q1 0
q3 0
iqr 0
skew 2.058
kurtosis 2.3
n_outliers 2,697
outlier_rate 0.1657
zero_rate 0.8343
alert: high_skewskew=+2.06
alert: outliers16.6% rows beyond 1.5 IQR
Fig 81.
Distribution of PCHinduism. Vertical dash marks the median.
Show data table
Histogram bins for PCHinduism (median: 0.0).
bincount
0 – 2.513738
2.5 – 530
5 – 7.528
7.5 – 108
10 – 12.523
12.5 – 1516
15 – 17.512
17.5 – 204
20 – 22.514
22.5 – 2512
25 – 27.510
27.5 – 307
30 – 32.511
32.5 – 356
35 – 37.53
37.5 – 402
40 – 42.511
42.5 – 457
45 – 47.54
47.5 – 509
50 – 52.516
52.5 – 555
55 – 57.511
57.5 – 608
60 – 62.519
62.5 – 655
65 – 67.516
67.5 – 7011
70 – 72.526
72.5 – 7512
75 – 77.514
77.5 – 8014
80 – 82.517
82.5 – 8527
85 – 87.532
87.5 – 9041
90 – 92.542
92.5 – 9541
95 – 97.5120
97.5 – 1001845

PCOtherSmall numeric feature

PCOtherSmall is a numeric feature that appears to capture a small-share or percentage-like quantity, with 89.3% of values exactly zero and a median/Q1/Q3 all at 0.0. The remaining mass is highly skewed (skew 11.0, kurtosis 124.0) with a max of 100.0 and 10.7% flagged as outliers, suggesting a sparse long-tailed distribution rather than a typical continuous feature.

Treatment: Treat as zero-inflated: add a binary is_nonzero flag and log1p-transform the positive tail before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[239]:

saturn.columns["PCOtherSmall"].stats

statvalue
n16,382
nulls104 (0.6%)
unique908
min 0
max 100
mean 0.9613
median 0
std 8.299
q1 0
q3 0
iqr 0
skew 11
kurtosis 124
n_outliers 1,749
outlier_rate 0.1074
zero_rate 0.8926
alert: high_skewskew=+11.00
alert: outliers10.7% rows beyond 1.5 IQR
Fig 82.
Distribution of PCOtherSmall. Vertical dash marks the median.
Show data table
Histogram bins for PCOtherSmall (median: 0.0).
bincount
0 – 2.515753
2.5 – 5155
5 – 7.5119
7.5 – 1020
10 – 12.536
12.5 – 159
15 – 17.510
17.5 – 206
20 – 22.56
22.5 – 2522
25 – 27.53
27.5 – 304
30 – 32.56
32.5 – 350
35 – 37.51
37.5 – 403
40 – 42.53
42.5 – 451
45 – 47.50
47.5 – 500
50 – 52.50
52.5 – 551
55 – 57.53
57.5 – 602
60 – 62.53
62.5 – 650
65 – 67.50
67.5 – 704
70 – 72.54
72.5 – 750
75 – 77.51
77.5 – 800
80 – 82.52
82.5 – 852
85 – 87.52
87.5 – 900
90 – 92.52
92.5 – 950
95 – 97.52
97.5 – 10093

RegionCode numeric feature

RegionCode is an integer-valued field ranging from 1 to 12 with only 12 unique values across 16382 rows and no nulls. The flat distribution (kurtosis -1.20, skew 0.23, no outliers) and small cardinality indicate a categorical region identifier encoded numerically rather than a true numeric measure.

Treatment: Cast to categorical and one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[242]:

saturn.columns["RegionCode"].stats

statvalue
n16,382
nulls0 (0.0%)
unique12
min 1
max 12
mean 5.935
median 5
std 3.42
q1 3
q3 8
iqr 5
skew 0.231
kurtosis -1.201
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 83.
Distribution of RegionCode. Vertical dash marks the median.
Show data table
Histogram bins for RegionCode (median: 5.0).
bincount
1 – 1.2751535
1.275 – 1.550
1.55 – 1.8250
1.825 – 2.11922
2.1 – 2.3750
2.375 – 2.650
2.65 – 2.9250
2.925 – 3.2709
3.2 – 3.4750
3.475 – 3.750
3.75 – 4.0253707
4.025 – 4.30
4.3 – 4.5750
4.575 – 4.850
4.85 – 5.125460
5.125 – 5.40
5.4 – 5.6750
5.675 – 5.950
5.95 – 6.225593
6.225 – 6.50
6.5 – 6.7750
6.775 – 7.051276
7.05 – 7.3250
7.325 – 7.60
7.6 – 7.8750
7.875 – 8.152175
8.15 – 8.4250
8.425 – 8.70
8.7 – 8.9750
8.975 – 9.25577
9.25 – 9.5250
9.525 – 9.80
9.8 – 10.081116
10.08 – 10.350
10.35 – 10.620
10.62 – 10.90
10.9 – 11.181395
11.18 – 11.450
11.45 – 11.730
11.73 – 12917

PopulationPGAC numeric feature

PopulationPGAC is a numeric population-like measure spanning 10 to 925,129,800 with a median of just 88,000, suggesting counts of people across geographic units of wildly varying scale (towns up through country-sized aggregates). The distribution is extremely right-skewed (skew 15.15, kurtosis 262.66) and 19.2% of rows flag as outliers, with the mean (8.8M) two orders of magnitude above the median. Nulls are negligible (0.09%) and there are no zeros, but the spread between Q3 (1.39M) and the max indicates a long heavy tail of very large entities.

Treatment: log-transform before any modelling or distance-based comparison.

anthropic:claude-opus-4-7 · confidence high
Out[245]:

saturn.columns["PopulationPGAC"].stats

statvalue
n16,382
nulls15 (0.1%)
unique2,250
min 10
max 9.251e+08
mean 8.812e+06
median 88,000
std 5.114e+07
q1 8,800
q3 1.386e+06
iqr 1.377e+06
skew 15.15
kurtosis 262.7
n_outliers 3,145
outlier_rate 0.1922
zero_rate 0
alert: high_skewskew=+15.15
alert: outliers19.2% rows beyond 1.5 IQR
Fig 84.
Distribution of PopulationPGAC. Vertical dash marks the median.
Show data table
Histogram bins for PopulationPGAC (median: 88000.0).
bincount
10 – 2.313e+0714975
2.313e+07 – 4.626e+07660
4.626e+07 – 6.938e+07344
6.938e+07 – 9.251e+07143
9.251e+07 – 1.156e+0817
1.156e+08 – 1.388e+08108
1.388e+08 – 1.619e+080
1.619e+08 – 1.85e+080
1.85e+08 – 2.082e+0878
2.082e+08 – 2.313e+080
2.313e+08 – 2.544e+080
2.544e+08 – 2.775e+080
2.775e+08 – 3.007e+080
3.007e+08 – 3.238e+080
3.238e+08 – 3.469e+080
3.469e+08 – 3.701e+080
3.701e+08 – 3.932e+080
3.932e+08 – 4.163e+080
4.163e+08 – 4.394e+080
4.394e+08 – 4.626e+080
4.626e+08 – 4.857e+080
4.857e+08 – 5.088e+080
5.088e+08 – 5.319e+080
5.319e+08 – 5.551e+080
5.551e+08 – 5.782e+080
5.782e+08 – 6.013e+080
6.013e+08 – 6.245e+080
6.245e+08 – 6.476e+080
6.476e+08 – 6.707e+080
6.707e+08 – 6.938e+080
6.938e+08 – 7.17e+080
7.17e+08 – 7.401e+080
7.401e+08 – 7.632e+080
7.632e+08 – 7.864e+080
7.864e+08 – 8.095e+080
8.095e+08 – 8.326e+080
8.326e+08 – 8.557e+080
8.557e+08 – 8.789e+080
8.789e+08 – 9.02e+080
9.02e+08 – 9.251e+0842

Frontier categorical feature

Binary Y/N flag named 'Frontier', fully populated across 16382 rows with only 2 distinct values. The 'N' class dominates at 70.9% versus 29.1% 'Y', giving an entropy ratio of 0.87 — moderately imbalanced but well within usable range.

Treatment: Encode as 0/1 boolean and use directly as a feature.

anthropic:claude-opus-4-7 · confidence high
Out[248]:

saturn.columns["Frontier"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value N
top_rate 0.709
cardinality 2
entropy 0.87
entropy_ratio 0.87
Fig 85.
Top values for Frontier.
Show data table
Top values for Frontier (2 unique shown, of 2 total).
valuecountshare
N1161570.9%
Y476729.1%

MapAddress text foreign_key

MapAddress holds single-token PNG filenames (e.g. m00320.png), almost certainly references to map image assets. Over half the column is empty (8728 of 16382 rows) and 63.2% are duplicates, with only 6029 distinct values across 16382 rows. Every non-empty value is one word with max length 13, so this behaves like a sparse foreign key to a map asset rather than free text.

Treatment: Treat as a categorical asset reference; impute empties as 'none' and join to the map asset table.

anthropic:claude-opus-4-7 · confidence high
Out[251]:

saturn.columns["MapAddress"].stats

statvalue
n16,382
nulls0 (0.0%)
unique6,029
len_min 0
len_max 13
len_mean 5.153
len_median 0
len_p95 13
word_mean 1
word_median 1
n_empty 8,728
n_duplicates 10,353
duplicate_rate 0.632
vocab_size 6,028
readability_flesch_mean 13.52
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
alert: duplicates63.2% duplicate strings
Fig 86.
Character-length distribution for MapAddress.
Show data table
Character-length distribution for MapAddress (mean: 5.153094860212429).
charscount
0 – 08728
0 – 10
1 – 10
1 – 10
1 – 20
2 – 20
2 – 20
2 – 30
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 100
10 – 105028
10 – 100
10 – 110
11 – 110
11 – 110
11 – 120
12 – 120
12 – 120
12 – 130
13 – 132626

HasJesusFilm categorical feature

Binary Y/N flag indicating whether the JESUS Film is available for each record, with no nulls across 16,382 rows. The split is roughly 2:1 in favour of 'Y' (10,816 vs 5,566; top_rate 0.660), giving a high entropy ratio of 0.925 — informative but mildly imbalanced.

Treatment: Encode as a 0/1 boolean for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[254]:

saturn.columns["HasJesusFilm"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.6602
cardinality 2
entropy 0.9246
entropy_ratio 0.9246
Fig 87.
Top values for HasJesusFilm.
Show data table
Top values for HasJesusFilm (2 unique shown, of 2 total).
valuecountshare
Y1081666.0%
N556634.0%

Nomadic categorical feature

Binary Y/N flag indicating whether a record is 'Nomadic', with no nulls across 16382 rows. The distribution is severely imbalanced: 'N' covers 98.1% (16071) versus only 311 'Y' cases, yielding an entropy ratio of just 0.136.

Treatment: Encode as a boolean; consider class-weighting or resampling since positives are only ~1.9%.

anthropic:claude-opus-4-7 · confidence high
Out[257]:

saturn.columns["Nomadic"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value N
top_rate 0.981
cardinality 2
entropy 0.1357
entropy_ratio 0.1357
alert: imbalancetop value is 98.1% of rows
Fig 88.
Top values for Nomadic.
Show data table
Top values for Nomadic (2 unique shown, of 2 total).
valuecountshare
N1607198.1%
Y3111.9%

NomadicTypeDescription categorical metadata

Categorical descriptor of nomadic livelihood type, with six values combining three base categories (Agro-Pastoralists, Service or Trade, Hunter-Gatherers) singly or in pairs. The column is 98.1% null, populated for only ~311 of 16,382 rows, and among those Agro-Pastoralists dominates at 68.2% of non-nulls. The sparsity makes this effectively a rare annotation rather than a general feature.

Treatment: Treat as sparse metadata; impute a 'Unknown' category or drop unless modelling the populated subset.

anthropic:claude-opus-4-7 · confidence high
Out[260]:

saturn.columns["NomadicTypeDescription"].stats

statvalue
n16,382
nulls16,071 (98.1%)
unique6
top_value Agro-Pastoralists
top_rate 0.6817
cardinality 6
entropy 1.341
entropy_ratio 0.5187
alert: null_rate98.1% null
Fig 89.
Top values for NomadicTypeDescription.
Show data table
Top values for NomadicTypeDescription (6 unique shown, of 6 total).
valuecountshare
Agro-Pastoralists2121.3%
Service or Trade680.4%
Agro-Pastoralists, Service or Trade160.1%
Hunter-Gatherers110.1%
Agro-Pastoralists, Hunter-Gatherers20.0%
Service or Trade, Hunter-Gatherers20.0%

PhotoCCVersionText categorical metadata

This column records the Creative Commons license version attached to a photo, with 17 distinct values across 16,382 rows and no nulls. It is dominated by empty strings at 83.6% (13,688 rows), leaving only ~16% with an actual license tag — the most common being 'CC BY 2.0' (661) and 'CC BY-NC-SA 2.0' (440). Low entropy ratio (0.28) confirms the field is sparse in practice despite zero technical nulls.

Treatment: Treat empty string as missing and one-hot encode the remaining license categories.

anthropic:claude-opus-4-7 · confidence high
Out[263]:

saturn.columns["PhotoCCVersionText"].stats

statvalue
n16,382
nulls0 (0.0%)
unique17
top_value
top_rate 0.8356
cardinality 17
entropy 1.137
entropy_ratio 0.2781
Fig 90.
Top values for PhotoCCVersionText.
Show data table
Top values for PhotoCCVersionText (17 unique shown, of 17 total).
valuecountshare
1368883.6%
CC BY 2.06614.0%
CC BY-NC-SA 2.04402.7%
CC BY-SA 4.04252.6%
CC BY-SA 2.03322.0%
CC0 1.02471.5%
CC BY-NC 2.02301.4%
CC BY-SA 3.02191.3%
CC BY 3.0350.2%
CC BY-NC-ND 2.0350.2%
CC BY 3.0 BR210.1%
CC BY 4.0200.1%
CC BY-SA 2.5110.1%
CC BY-ND 2.090.1%
CC SA 1.070.0%
CC BY-SA 3.0 DE10.0%
CC BY-NC-SA 4.010.0%

PhotoCCVersionURL categorical metadata

This column holds Creative Commons license URLs associated with photos, drawn from a closed vocabulary of 17 distinct values. It is overwhelmingly empty: 13,688 of 16,382 rows (top_rate 0.836) carry the blank string rather than a license, leaving CC BY 2.0 (661) and CC BY-NC-SA 2.0 (440) as the most common actual licenses. Entropy ratio of 0.278 confirms the distribution is highly concentrated on the empty value.

Treatment: Treat blank as missing and bucket the remaining license URLs into a low-cardinality categorical.

anthropic:claude-opus-4-7 · confidence high
Out[266]:

saturn.columns["PhotoCCVersionURL"].stats

statvalue
n16,382
nulls0 (0.0%)
unique17
top_value
top_rate 0.8356
cardinality 17
entropy 1.137
entropy_ratio 0.2781
Fig 91.
Top values for PhotoCCVersionURL.
Show data table
Top values for PhotoCCVersionURL (17 unique shown, of 17 total).
valuecountshare
1368883.6%
https://creativecommons.org/licenses/by/2.0/6614.0%
https://creativecommons.org/licenses/by-nc-sa/2.0/4402.7%
https://creativecommons.org/licenses/by-sa/4.0/4252.6%
https://creativecommons.org/licenses/by-sa/2.0/3322.0%
https://creativecommons.org/publicdomain/zero/1.0/2471.5%
https://creativecommons.org/licenses/by-nc/2.0/2301.4%
https://creativecommons.org/licenses/by-sa/3.0/2191.3%
https://creativecommons.org/licenses/by/3.0/350.2%
https://creativecommons.org/licenses/by-nc-nd/2.0/350.2%
https://creativecommons.org/licenses/by/3.0/br/deed.en210.1%
https://creativecommons.org/licenses/by/4.0/200.1%
https://creativecommons.org/licenses/by-sa/2.5/110.1%
https://creativecommons.org/licenses/by-nd/2.0/90.1%
https://creativecommons.org/licenses/by-sa/1.0/70.0%
https://creativecommons.org/licenses/by-sa/3.0/de/deed.en10.0%
https://creativecommons.org/licenses/by-nc-sa/4.010.0%

MapCredits categorical metadata

Attribution string for the map asset associated with each record, naming data, geography, and design contributors. Over half the rows (top_rate 0.533, 8,733 of 16,382) carry an empty string, and the remaining mass is spread across 199 near-duplicate credit lines — note for example two variants of the Omid/UNESCO/GMI credit differing only by a trailing period (2,228 vs 864). Entropy ratio of 0.357 and the long_tail alert confirm a few dominant phrasings plus a sparse tail.

Treatment: Treat as provenance metadata; normalize whitespace/punctuation to collapse duplicate credit strings, and exclude from modelling.

anthropic:claude-opus-4-7 · confidence high
Out[269]:

saturn.columns["MapCredits"].stats

statvalue
n16,382
nulls0 (0.0%)
unique199
top_value
top_rate 0.5331
cardinality 199
entropy 2.726
entropy_ratio 0.357
alert: long_tail114 singleton categories
Fig 92.
Top values for MapCredits.
Show data table
Top values for MapCredits (20 unique shown, of 199 total).
valuecountshare
873353.3%
People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project222813.6%
People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project.8645.3%
Location: IMB. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.8535.2%
Anonymous7594.6%
People Group Location: Omid. Other geography / data: GMI. Map Design: Joshua Project6033.7%
Bethany World Prayer Center5823.6%
Joshua Project / Global Mapping International5203.2%
Asia Harvest-Operation Myanmar1440.9%
Bryan Nicholson / cartoMission1170.7%
Mark Stevens960.6%
Location: SIL / WLMS. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.850.5%
West Melanesia780.5%
Southeast Asia Link - SEALINK680.4%
NCRP540.3%
Location: WLMS. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.490.3%
Rodrigo Tinoco - https://www.data4mission.com/440.3%
Location: Web research. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.350.2%
Location: World Jewish Congress, Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.260.2%
Location: Joshua Project. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.230.1%

MapCreditURL categorical metadata

Optional attribution URL for the source map of each record, blank in 15891 of 16382 rows (top_rate 0.97). Only 51 distinct values populate the remaining 3%, dominated by asiaharvest.org (146) and cartomission.com (117), giving a very low entropy_ratio of 0.052. The mix of http URLs and a mailto: address suggests inconsistent data entry rather than a controlled vocabulary.

Treatment: Drop from modelling; retain only as a provenance link for the ~3% of rows that carry a value.

anthropic:claude-opus-4-7 · confidence high
Out[272]:

saturn.columns["MapCreditURL"].stats

statvalue
n16,382
nulls0 (0.0%)
unique51
top_value
top_rate 0.97
cardinality 51
entropy 0.2954
entropy_ratio 0.05207
alert: long_tail26 singleton categories
alert: imbalancetop value is 97.0% of rows
Fig 93.
Top values for MapCreditURL.
Show data table
Top values for MapCreditURL (20 unique shown, of 51 total).
valuecountshare
1589197.0%
https://www.asiaharvest.org1460.9%
https://www.cartomission.com1170.7%
https://www.westmelanesia.com/780.5%
https://www.worldjewishcongress.org/260.2%
https://www.census.gov/prod/2004pubs/c2kbr-35.pdf160.1%
https://commons.wikimedia.org/wiki/File:Caucasus_ethnic.jpg110.1%
https://www.npolar.no/ansipra/english/Indexpages/Map_index.html110.1%
https://www.eki.ee/books/redbook/introduction.shtml90.1%
mailto:cambodia.peoples@gmail.com90.1%
https://commons.wikimedia.org/wiki/File:Maeneo_penye_wasemaji_wa_Kiswahili.png80.0%
https://www.cartpioneers.org/products/Peoples-of-Yemen-Prayer-Guide.html40.0%
https://www.burmaissues.org/En/ethnicgroups1.html40.0%
https://www.face-music.ch/bi_bid/trad_costumes_en.html30.0%
https://thekurds.net/30.0%
http://lingvarium.org/20.0%
https://lingvarium.org/20.0%
https://commons.wikimedia.org/wiki/File:Basque_Country_Location_Map.svg20.0%
https://commons.wikimedia.org/wiki/File:Mapuche_du_Chili_et_d%27Argentine.gif20.0%
https://commons.wikimedia.org/wiki/File:Tlingit-map.png20.0%

MapCopyright categorical feature

A near-binary flag (with blanks) indicating map copyright status, dominated by 'N' at 86.5% of 16,382 rows. Only 118 records carry 'Y' and 2,100 are empty strings, giving just 3 distinct values and an entropy ratio of 0.387. The empty category is large enough that it should be treated as its own level rather than silently coerced.

Treatment: Encode as a 3-level categorical (N / Y / blank); low signal due to severe imbalance.

anthropic:claude-opus-4-7 · confidence high
Out[275]:

saturn.columns["MapCopyright"].stats

statvalue
n16,382
nulls0 (0.0%)
unique3
top_value N
top_rate 0.8646
cardinality 3
entropy 0.6126
entropy_ratio 0.3865
Fig 94.
Top values for MapCopyright.
Show data table
Top values for MapCopyright (3 unique shown, of 3 total).
valuecountshare
N1416486.5%
210012.8%
Y1180.7%

MapCCVersionText categorical metadata

This column records Creative Commons licence versions for map content, but 16347 of 16382 rows (top_rate 0.9979) are empty strings, leaving only 35 rows spread across five actual licences (CC0 1.0, CC BY-SA 3.0, CC BY 3.0, CC BY 2.0, CC BY-SA 4.0). Entropy ratio of 0.0099 confirms the column carries almost no information. Note that nulls are reported as 0.0 because the missing values are stored as empty strings rather than true nulls.

Treatment: Drop or collapse to a binary has_licence flag; too sparse to use as a feature.

anthropic:claude-opus-4-7 · confidence high
Out[278]:

saturn.columns["MapCCVersionText"].stats

statvalue
n16,382
nulls0 (0.0%)
unique6
top_value
top_rate 0.9979
cardinality 6
entropy 0.02557
entropy_ratio 0.009891
alert: imbalancetop value is 99.8% of rows
Fig 95.
Top values for MapCCVersionText.
Show data table
Top values for MapCCVersionText (6 unique shown, of 6 total).
valuecountshare
1634799.8%
CC0 1.0170.1%
CC BY-SA 3.0130.1%
CC BY 3.020.0%
CC BY 2.020.0%
CC BY-SA 4.010.0%

MapCCVersionURL categorical metadata

MapCCVersionURL appears to be a Creative Commons license URL field attached to map records, with five distinct CC variants observed (CC0, BY-SA 3.0/4.0, BY 2.0/3.0). It is effectively empty: 16,347 of 16,382 rows (top_rate 0.9979) are blank strings, leaving only 35 rows with an actual license URL and entropy_ratio of 0.0099. Null_rate is reported as 0.0 because empties are stored as '' rather than nulls, which is itself worth flagging.

Treatment: Drop or collapse to a binary has_license flag; near-constant empty string.

anthropic:claude-opus-4-7 · confidence high
Out[281]:

saturn.columns["MapCCVersionURL"].stats

statvalue
n16,382
nulls0 (0.0%)
unique6
top_value
top_rate 0.9979
cardinality 6
entropy 0.02557
entropy_ratio 0.009891
alert: imbalancetop value is 99.8% of rows
Fig 96.
Top values for MapCCVersionURL.
Show data table
Top values for MapCCVersionURL (6 unique shown, of 6 total).
valuecountshare
1634799.8%
https://creativecommons.org/publicdomain/zero/1.0/170.1%
https://creativecommons.org/licenses/by-sa/3.0/130.1%
https://creativecommons.org/licenses/by/3.0/20.0%
https://creativecommons.org/licenses/by/2.0/20.0%
https://creativecommons.org/licenses/by-sa/4.0/10.0%

JF categorical feature

JF is a binary Y/N flag with no nulls across 16382 rows. The split is moderately imbalanced, with Y dominating at 66.0% (10816) versus N at 34.0% (5566), yielding a high entropy ratio of 0.925.

Treatment: Encode as a 0/1 indicator before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[284]:

saturn.columns["JF"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.6602
cardinality 2
entropy 0.9246
entropy_ratio 0.9246
Fig 97.
Top values for JF.
Show data table
Top values for JF (2 unique shown, of 2 total).
valuecountshare
Y1081666.0%
N556634.0%

AudioRecordings categorical feature

Binary Y/N flag indicating whether audio recordings are present, with no nulls across 16,382 rows. The distribution is imbalanced: 'Y' dominates at 82.3% (13,479) versus 2,903 'N' values, yielding an entropy ratio of 0.67.

Treatment: Encode as a 0/1 boolean; be aware of the ~82/18 class imbalance when using as a predictor.

anthropic:claude-opus-4-7 · confidence high
Out[287]:

saturn.columns["AudioRecordings"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.8228
cardinality 2
entropy 0.6739
entropy_ratio 0.6739
Fig 98.
Top values for AudioRecordings.
Show data table
Top values for AudioRecordings (2 unique shown, of 2 total).
valuecountshare
Y1347982.3%
N290317.7%

Window1040 categorical feature

Window1040 is a binary Y/N flag covering all 16382 rows with no nulls. The split is nearly even (Y at 52.3%, N at 7810 occurrences), giving an entropy ratio of 0.998 — essentially maximum uncertainty for a two-class field. Without column context the meaning is opaque, but the balanced distribution makes it a usable feature rather than a degenerate constant.

Treatment: Encode as a 0/1 indicator for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[290]:

saturn.columns["Window1040"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2
top_value Y
top_rate 0.5233
cardinality 2
entropy 0.9984
entropy_ratio 0.9984
Fig 99.
Top values for Window1040.
Show data table
Top values for Window1040 (2 unique shown, of 2 total).
valuecountshare
Y857252.3%
N781047.7%

PeopleGroupMapURL text metadata

This column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty entry being a single token link. Over half the rows (8,728 of 16,382) are empty strings, and among the rest the same map URLs recur heavily — duplicate_rate is 0.63 and only 6,029 unique values exist across 16,382 rows. The url_rate of 0.47 reflects that empties dominate the remainder.

Treatment: Treat as an optional asset URL: keep as-is for display, or drop if not rendering images.

anthropic:claude-opus-4-7 · confidence high
Out[293]:

saturn.columns["PeopleGroupMapURL"].stats

statvalue
n16,382
nulls0 (0.0%)
unique6,029
len_min 0
len_max 66
len_mean 29.92
len_median 0
len_p95 66
word_mean 1
word_median 1
n_empty 8,728
n_duplicates 10,353
duplicate_rate 0.632
vocab_size 6,028
readability_flesch_mean -329.1
emoji_rate 0
url_rate 0.4672
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: url_heavy46.7% rows contain a URL
alert: duplicates63.2% duplicate strings
Fig 100.
Character-length distribution for PeopleGroupMapURL.
Show data table
Character-length distribution for PeopleGroupMapURL (mean: 29.91576120131852).
charscount
0 – 28728
2 – 30
3 – 50
5 – 70
7 – 80
8 – 100
10 – 120
12 – 130
13 – 150
15 – 160
16 – 180
18 – 200
20 – 210
21 – 230
23 – 250
25 – 260
26 – 280
28 – 300
30 – 310
31 – 330
33 – 350
35 – 360
36 – 380
38 – 400
40 – 410
41 – 430
43 – 450
45 – 460
46 – 480
48 – 500
50 – 510
51 – 530
53 – 540
54 – 560
56 – 580
58 – 590
59 – 610
61 – 630
63 – 645028
64 – 662626

PeopleGroupMapExpandedURL text metadata

This column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, one link per row. It's mostly empty: 9,538 of 16,382 rows (the modal value) are blank, which is why len_median is 0 and url_rate is only 0.418. Among populated rows there is heavy reuse — duplicate_rate is 0.661 and a single map (m00320.pdf) appears 96 times — indicating many people-group records share the same regional map.

Treatment: Treat as an optional reference link; drop for modelling or keep only as a join key to map assets.

anthropic:claude-opus-4-7 · confidence high
Out[296]:

saturn.columns["PeopleGroupMapExpandedURL"].stats

statvalue
n16,382
nulls0 (0.0%)
unique5,561
len_min 0
len_max 66
len_mean 26.74
len_median 0
len_p95 66
word_mean 1
word_median 1
n_empty 9,538
n_duplicates 10,821
duplicate_rate 0.6605
vocab_size 5,560
readability_flesch_mean -256.6
emoji_rate 0
url_rate 0.4178
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: url_heavy41.8% rows contain a URL
alert: duplicates66.1% duplicate strings
Fig 101.
Character-length distribution for PeopleGroupMapExpandedURL.
Show data table
Character-length distribution for PeopleGroupMapExpandedURL (mean: 26.739592235380297).
charscount
0 – 29538
2 – 30
3 – 50
5 – 70
7 – 80
8 – 100
10 – 120
12 – 130
13 – 150
15 – 160
16 – 180
18 – 200
20 – 210
21 – 230
23 – 250
25 – 260
26 – 280
28 – 300
30 – 310
31 – 330
33 – 350
35 – 360
36 – 380
38 – 400
40 – 410
41 – 430
43 – 450
45 – 460
46 – 480
48 – 500
50 – 510
51 – 530
53 – 540
54 – 560
56 – 580
58 – 590
59 – 610
61 – 630
63 – 644552
64 – 662292

PeopleGroupURL text identifier

This column holds Joshua Project people-group URLs, one per row, with a perfectly fixed length of 48 characters and exactly one 'word' per cell. Every value is unique across all 16382 rows (n_unique equals n, duplicate_rate 0.0) and url_rate is 1.0, so it functions as a row-level identifier rather than analysable text. The URL pattern encodes a numeric people-group id plus a two-letter country suffix (e.g. /10375/tz, /10375/up), meaning the same group repeats across countries via different URLs.

Treatment: Drop from modelling; keep as a row key or parse out the people-group id and country code if a join is needed.

anthropic:claude-opus-4-7 · confidence high
Out[299]:

saturn.columns["PeopleGroupURL"].stats

statvalue
n16,382
nulls0 (0.0%)
unique16,382
len_min 48
len_max 48
len_mean 48
len_median 48
len_p95 48
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 16,382
readability_flesch_mean -476.9
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
Fig 102.
Character-length distribution for PeopleGroupURL.
Show data table
Character-length distribution for PeopleGroupURL (mean: 48.0).
charscount
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 4816382
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480
48 – 480

PeopleGroupPhotoURL text metadata

This column holds URLs to people-group profile photos hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.65). Notably, 5719 of 16382 rows are empty strings and duplicates dominate the rest (n_duplicates 11105, duplicate_rate 0.68) — only 5277 unique URLs serve 16382 rows, meaning many groups share the same photo. The top URL alone repeats 92 times.

Treatment: Treat as an optional image asset link; drop for modelling or use only to fetch images, and handle the 5719 empty values as missing.

anthropic:claude-opus-4-7 · confidence high
Out[302]:

saturn.columns["PeopleGroupPhotoURL"].stats

statvalue
n16,382
nulls0 (0.0%)
unique5,277
len_min 0
len_max 68
len_mean 42.32
len_median 65
len_p95 65
word_mean 1
word_median 1
n_empty 5,719
n_duplicates 11,105
duplicate_rate 0.6779
vocab_size 5,276
readability_flesch_mean -585.6
emoji_rate 0
url_rate 0.6509
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: url_heavy65.1% rows contain a URL
alert: duplicates67.8% duplicate strings
Fig 103.
Character-length distribution for PeopleGroupPhotoURL.
Show data table
Character-length distribution for PeopleGroupPhotoURL (mean: 42.321511414967645).
charscount
0 – 25719
2 – 30
3 – 50
5 – 70
7 – 80
8 – 100
10 – 120
12 – 140
14 – 150
15 – 170
17 – 190
19 – 200
20 – 220
22 – 240
24 – 260
26 – 270
27 – 290
29 – 310
31 – 320
32 – 340
34 – 360
36 – 370
37 – 390
39 – 410
41 – 420
42 – 440
44 – 460
46 – 480
48 – 490
49 – 510
51 – 530
53 – 540
54 – 560
56 – 580
58 – 600
60 – 610
61 – 630
63 – 650
65 – 6610591
66 – 6872

CountryURL categorical metadata

URL pointing to a country page on joshuaproject.net, with the trailing two-letter code acting as the country identifier. 238 distinct countries appear across 16,382 rows with no nulls, and India (IN) dominates at 13.8% (2,262 rows) followed by PP (883) and ID (788). High entropy ratio (0.79) indicates the distribution is broad rather than concentrated despite the India lead.

Treatment: Strip the URL prefix and keep the two-letter country code as a categorical feature.

anthropic:claude-opus-4-7 · confidence high
Out[305]:

saturn.columns["CountryURL"].stats

statvalue
n16,382
nulls0 (0.0%)
unique238
top_value https://joshuaproject.net/countries/IN
top_rate 0.1381
cardinality 238
entropy 6.225
entropy_ratio 0.7885
Fig 104.
Top values for CountryURL.
Show data table
Top values for CountryURL (20 unique shown, of 238 total).
valuecountshare
https://joshuaproject.net/countries/IN226213.8%
https://joshuaproject.net/countries/PP8835.4%
https://joshuaproject.net/countries/ID7884.8%
https://joshuaproject.net/countries/PK7754.7%
https://joshuaproject.net/countries/CH5473.3%
https://joshuaproject.net/countries/NI5353.3%
https://joshuaproject.net/countries/US4963.0%
https://joshuaproject.net/countries/MX3332.0%
https://joshuaproject.net/countries/BR3212.0%
https://joshuaproject.net/countries/CM2921.8%
https://joshuaproject.net/countries/BG2781.7%
https://joshuaproject.net/countries/CA2431.5%
https://joshuaproject.net/countries/CG2311.4%
https://joshuaproject.net/countries/BM2181.3%
https://joshuaproject.net/countries/AS2051.3%
https://joshuaproject.net/countries/RP2001.2%
https://joshuaproject.net/countries/SU1981.2%
https://joshuaproject.net/countries/NP1951.2%
https://joshuaproject.net/countries/LA1841.1%
https://joshuaproject.net/countries/MY1831.1%

JPScaleText categorical label

JPScaleText is a 5-level ordinal categorical describing how 'reached' an entity is, ranging from 'Unreached' to 'Significantly Reached'. The distribution is top-heavy: 'Unreached' covers 43.5% of 16,382 rows (7,124), while 'Minimally Reached' is the rarest at 1,009. No nulls and entropy ratio 0.87 indicate well-spread but skewed coverage across all five levels.

Treatment: Encode as an ordered ordinal (Unreached < Minimally < Superficially < Partially < Significantly) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[308]:

saturn.columns["JPScaleText"].stats

statvalue
n16,382
nulls0 (0.0%)
unique5
top_value Unreached
top_rate 0.4349
cardinality 5
entropy 2.017
entropy_ratio 0.8688
Fig 105.
Top values for JPScaleText.
Show data table
Top values for JPScaleText (5 unique shown, of 5 total).
valuecountshare
Unreached712443.5%
Partially Reached363622.2%
Significantly Reached320019.5%
Superficially Reached14138.6%
Minimally Reached10096.2%

JPScaleImageURL categorical metadata

This column holds URLs to one of five 'gauge' images on joshuaproject.net, almost certainly a visual encoding of an ordinal Joshua Project Scale (1-5). Distribution is uneven: gauge-1 dominates at 43.5% of 16,382 rows, while gauge-2 is rarest at 1,009 rows, and entropy ratio is 0.87. No nulls, but the URL itself carries no information beyond the underlying 1-5 code.

Treatment: Extract the trailing digit as an ordinal feature and drop the URL string.

anthropic:claude-opus-4-7 · confidence high
Out[311]:

saturn.columns["JPScaleImageURL"].stats

statvalue
n16,382
nulls0 (0.0%)
unique5
top_value https://joshuaproject.net/assets/img/gauge/gauge-1.png
top_rate 0.4349
cardinality 5
entropy 2.017
entropy_ratio 0.8688
Fig 106.
Top values for JPScaleImageURL.
Show data table
Top values for JPScaleImageURL (5 unique shown, of 5 total).
valuecountshare
https://joshuaproject.net/assets/img/gauge/gauge-1.png712443.5%
https://joshuaproject.net/assets/img/gauge/gauge-4.png363622.2%
https://joshuaproject.net/assets/img/gauge/gauge-5.png320019.5%
https://joshuaproject.net/assets/img/gauge/gauge-3.png14138.6%
https://joshuaproject.net/assets/img/gauge/gauge-2.png10096.2%

Summary text free_text

A short ethnographic descriptor field, likely a people-group summary paragraph in English. The column is dominated by emptiness — 12,328 of 16,382 rows (median length 0) are blank — and the non-empty entries are heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat write-ups repeating dozens of times in slight variants. Among populated rows, texts can be substantial (len_p95 = 719, max 1212) and Flesch readability of 13.2 indicates dense, hard-to-read prose.

Treatment: Deduplicate and drop empties before any NLP; treat as supplementary description rather than a per-row feature.

anthropic:claude-opus-4-7 · confidence high
Out[314]:

saturn.columns["Summary"].stats

statvalue
n16,382
nulls0 (0.0%)
unique3,778
len_min 0
len_max 1,212
len_mean 137.7
len_median 0
len_p95 719
word_mean 23.34
word_median 1
n_empty 12,328
n_duplicates 12,604
duplicate_rate 0.7694
vocab_size 24,964
readability_flesch_mean 13.16
emoji_rate 0
url_rate 0
one_word_rate 0.7525
allcaps_rate 0
boilerplate_rate 0.0001221
alert: one_word75.3% rows are a single word
alert: duplicates76.9% duplicate strings
Fig 107.
Character-length distribution for Summary.
Show data table
Character-length distribution for Summary (mean: 137.66963740691003).
charscount
0 – 3012328
30 – 611
61 – 911
91 – 1217
121 – 15218
152 – 18225
182 – 21244
212 – 24263
242 – 27376
273 – 303113
303 – 333146
333 – 364174
364 – 394167
394 – 424204
424 – 454219
454 – 485214
485 – 515247
515 – 545238
545 – 576210
576 – 606227
606 – 636256
636 – 667252
667 – 697202
697 – 727257
727 – 758144
758 – 788169
788 – 81888
818 – 84885
848 – 87960
879 – 90930
909 – 93919
939 – 9709
970 – 100010
1000 – 103021
1030 – 106050
1060 – 10913
1091 – 11212
1121 – 11511
1151 – 11820
1182 – 12122

Obstacles text free_text

Free-text English commentary describing spiritual or cultural obstacles to Christian evangelism for various ethnic groups (Rajputs, Jats, Bosniaks, Azeri, etc.). The field is overwhelmingly empty: 12,327 of 16,382 rows are blank, driving a median length of 0 and a one-word rate of 0.75. Among populated rows, content is heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat paragraphs repeated 88 and 74 times, suggesting templated entries reused across related people-group records.

Treatment: Treat blanks as missing and deduplicate template paragraphs before tokenizing/embedding for any text modelling.

anthropic:claude-opus-4-7 · confidence high
Out[317]:

saturn.columns["Obstacles"].stats

statvalue
n16,382
nulls0 (0.0%)
unique3,732
len_min 0
len_max 726
len_mean 47.4
len_median 0
len_p95 267
word_mean 8.704
word_median 1
n_empty 12,327
n_duplicates 12,650
duplicate_rate 0.7722
vocab_size 9,899
readability_flesch_mean 12.76
emoji_rate 0
url_rate 0
one_word_rate 0.7525
allcaps_rate 0
boilerplate_rate 0.0004273
alert: one_word75.2% rows are a single word
alert: duplicates77.2% duplicate strings
Fig 108.
Character-length distribution for Obstacles.
Show data table
Character-length distribution for Obstacles (mean: 47.39598339641069).
charscount
0 – 1812327
18 – 361
36 – 5425
54 – 73111
73 – 91206
91 – 109309
109 – 127422
127 – 145433
145 – 163361
163 – 182360
182 – 200267
200 – 218228
218 – 236174
236 – 254164
254 – 272232
272 – 29090
290 – 309223
309 – 327197
327 – 34570
345 – 36339
363 – 38134
381 – 39925
399 – 41711
417 – 4369
436 – 4549
454 – 47213
472 – 49012
490 – 5084
508 – 5267
526 – 5443
544 – 5634
563 – 5814
581 – 5991
599 – 6170
617 – 6352
635 – 6531
653 – 6720
672 – 6902
690 – 7081
708 – 7261

HowReach text free_text

This is a free-text field describing how a people group could be reached, likely sourced from a missions/Joshua Project-style dataset. The column is dominated by emptiness: 13,043 of 16,382 rows (n_empty) are blank, driving a median length of 0 and a duplicate_rate of 0.82. Among populated entries, prose is substantive (len_max 599, len_p95 221) but heavily repeated — the same paragraph about Jats appears 136 times and several near-duplicates differ only by a word, suggesting templated copy across related groups.

Treatment: Treat as sparse free text: filter out empties and deduplicate near-identical strings before any tokenization or embedding.

anthropic:claude-opus-4-7 · confidence high
Out[320]:

saturn.columns["HowReach"].stats

statvalue
n16,382
nulls0 (0.0%)
unique2,944
len_min 0
len_max 599
len_mean 36.3
len_median 0
len_p95 221
word_mean 6.879
word_median 1
n_empty 13,043
n_duplicates 13,438
duplicate_rate 0.8203
vocab_size 7,981
readability_flesch_mean 12.17
emoji_rate 0
url_rate 0
one_word_rate 0.7962
allcaps_rate 0
boilerplate_rate 0.0001221
alert: one_word79.6% rows are a single word
alert: duplicates82.0% duplicate strings
Fig 109.
Character-length distribution for HowReach.
Show data table
Character-length distribution for HowReach (mean: 36.29764375534123).
charscount
0 – 1513043
15 – 300
30 – 451
45 – 607
60 – 7561
75 – 90158
90 – 105257
105 – 120253
120 – 135289
135 – 150277
150 – 165282
165 – 180238
180 – 195226
195 – 210190
210 – 225368
225 – 240177
240 – 255139
255 – 27092
270 – 285106
285 – 30060
300 – 31442
314 – 32925
329 – 34425
344 – 35913
359 – 3749
374 – 38915
389 – 4045
404 – 4194
419 – 4343
434 – 4491
449 – 4643
464 – 4792
479 – 4942
494 – 5093
509 – 5240
524 – 5393
539 – 5542
554 – 5690
569 – 5840
584 – 5991

PrayForChurch text free_text

Free-text prayer prompts for a people-group / missions dataset, focused on praying for the church among unreached groups (top words: pray, the, among). The column is mostly empty — 14,208 of 16,382 rows are blank — and among non-empty entries duplication is heavy, with a single Jat-related prayer appearing 146 times and an overall duplicate_rate of 0.89. Only 1,791 unique values and a vocab of 4,576 words suggest these are templated prayer points rather than authored prose, and all 652 language-tagged rows are English.

Treatment: Treat as sparse optional commentary; impute empties as missing and dedupe templates before any text modelling.

anthropic:claude-opus-4-7 · confidence high
Out[323]:

saturn.columns["PrayForChurch"].stats

statvalue
n16,382
nulls0 (0.0%)
unique1,791
len_min 0
len_max 649
len_mean 26.91
len_median 0
len_p95 210
word_mean 5.599
word_median 1
n_empty 14,208
n_duplicates 14,591
duplicate_rate 0.8907
vocab_size 4,576
readability_flesch_mean 7.55
emoji_rate 0
url_rate 0
one_word_rate 0.8673
allcaps_rate 0
boilerplate_rate 0
alert: one_word86.7% rows are a single word
alert: duplicates89.1% duplicate strings
Fig 110.
Character-length distribution for PrayForChurch.
Show data table
Character-length distribution for PrayForChurch (mean: 26.90599438408009).
charscount
0 – 1614208
16 – 320
32 – 491
49 – 6516
65 – 8145
81 – 97107
97 – 114112
114 – 130149
130 – 146164
146 – 162193
162 – 178134
178 – 195324
195 – 211110
211 – 227126
227 – 24395
243 – 26099
260 – 27656
276 – 292103
292 – 30855
308 – 324116
324 – 34117
341 – 35728
357 – 37323
373 – 38933
389 – 40615
406 – 4223
422 – 43813
438 – 4549
454 – 4715
471 – 4872
487 – 5036
503 – 5193
519 – 5352
535 – 5522
552 – 5682
568 – 5842
584 – 6001
600 – 6170
617 – 6331
633 – 6492

PrayForPG text free_text

Free-text prayer points for a people group (PG), evidently scraped from a missions/Joshua Project–style source given the recurring 'Pray for...' templates. The column is mostly empty: 12,570 of 16,382 rows are blank (median length 0, median word count 1) and the duplicate rate is 0.78, with one Rajput-focused prayer block repeated 88 times and similar boilerplate dominating the rest. Flesch mean of 14.3 confirms dense, formulaic devotional prose rather than varied commentary.

Treatment: Treat as sparse boilerplate: drop empties, dedupe, and only embed the ~3.5k unique strings if prayer content is actually needed downstream.

anthropic:claude-opus-4-7 · confidence high
Out[326]:

saturn.columns["PrayForPG"].stats

statvalue
n16,382
nulls0 (0.0%)
unique3,530
len_min 0
len_max 937
len_mean 72.22
len_median 0
len_p95 400
word_mean 13.06
word_median 1
n_empty 12,570
n_duplicates 12,852
duplicate_rate 0.7845
vocab_size 9,427
readability_flesch_mean 14.31
emoji_rate 0
url_rate 0
one_word_rate 0.7673
allcaps_rate 0
boilerplate_rate 0
alert: one_word76.7% rows are a single word
alert: duplicates78.5% duplicate strings
Fig 111.
Character-length distribution for PrayForPG.
Show data table
Character-length distribution for PrayForPG (mean: 72.21584666096936).
charscount
0 – 2312571
23 – 470
47 – 706
70 – 9448
94 – 11791
117 – 141161
141 – 164144
164 – 187188
187 – 211167
211 – 234215
234 – 258218
258 – 281266
281 – 305279
305 – 328339
328 – 351324
351 – 375245
375 – 398275
398 – 422199
422 – 445241
445 – 468122
468 – 49276
492 – 51559
515 – 53943
539 – 56234
562 – 58622
586 – 60911
609 – 63212
632 – 6566
656 – 67911
679 – 7034
703 – 7263
726 – 7501
750 – 7730
773 – 7960
796 – 8200
820 – 8430
843 – 8670
867 – 8900
890 – 9140
914 – 9371

Resources unknown other

The column is named "Resources" but saturn skipped profiling, so kind is unknown and no descriptive statistics were computed. We can confirm 16382 rows with a null_rate of 0.0, but n_unique and value-level signals are missing. Without those stats the semantic role and content type cannot be determined from the evidence alone.

Treatment: Re-profile this column with type coercion before deciding how to use it.

anthropic:claude-opus-4-7 · confidence low
Out[329]:

saturn.columns["Resources"].stats

statvalue
n16,382
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

country_data unknown other

The column `country_data` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any sample stats, the actual content (codes, names, nested objects) cannot be inferred from the evidence.

Treatment: Re-profile with an appropriate parser to determine structure before use.

anthropic:claude-opus-4-7 · confidence low
Out[331]:

saturn.columns["country_data"].stats

statvalue
n16,382
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

language_data unknown other

The `language_data` column was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any descriptive stats, the actual contents (likely some structured or nested language payload given the name) cannot be characterized here.

Treatment: Re-profile with an appropriate parser (e.g., expand JSON or cast to string) before deciding on downstream use.

anthropic:claude-opus-4-7 · confidence low
Out[333]:

saturn.columns["language_data"].stats

statvalue
n16,382
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

How to cite

click to copy

BibTeX
@misc{saturn-joshua-project-joshua-project-enriched-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: joshua project joshua project enriched},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/joshua-project-joshua_project_enriched}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: joshua project joshua project enriched. Source: /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_enriched.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/joshua-project-joshua_project_enriched