joshua-project-joshua_project_enriched

Overview

Source: /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_enriched.parquet

Saturn profiled 16,382 rows across 109 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_enriched.parquet",
    "--findings", "joshua-project-joshua_project_enriched.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is the Joshua Project people-groups dataset: 16,382 rows and 109 columns describing ethnic groups by country, with demographics, language, religion mix, and Christian-engagement indicators. The shape is dominated by categorical and text fields — Continent, AffinityBloc, PrimaryReligion, and the JPScale 'reachedness' rating give the cleanest first read on who is in the file. Population is extremely long-tailed (median 20,000 but max ~913M and skew ~91), so any size analysis should use logs or quantiles rather than means. Religion-share columns like PCIslam, PCHinduism, and PCBuddhism are mostly zero with a minority of groups at very high percentages, which tells you religion is effectively single-dominant per group. Watch out for several columns with very high null rates (RLG4 96%, NomadicTypeDescription 98%, PrimaryLanguageDialect 92%, NTOnline 29%) and many near-duplicate URL/ID fields that won't add analytic value.

citing: Continent · AffinityBloc · PrimaryReligion · JPScaleText · Population · PCIslam · PCHinduism · RegionName · LeastReached · Frontier

Out[4]:

saturn.schema() · 109 columns

column	kind	n	null%	unique	alerts
PeopleID3ROG3	text	16,382	0.0%	16,382	near_unique one_word allcaps short_text
ROG3	categorical	16,382	0.0%	238
PeopleID3	numeric	16,382	0.0%	10,415
ROP3	numeric	16,382	0.1%	10,405
PeopNameInCountry	text	16,382	0.0%	10,748	one_word duplicates
ROG2	categorical	16,382	0.0%	7
Continent	categorical	16,382	0.0%	7
RegionName	categorical	16,382	0.0%	12
ISO3	categorical	16,382	0.0%	238
LocationInCountry	text	16,382	45.1%	7,794	multilingual null_rate
PeopleID1	numeric	16,382	0.0%	16
ROP1	categorical	16,382	0.0%	16
AffinityBloc	categorical	16,382	0.0%	16
PeopleID2	numeric	16,382	0.0%	267
ROP2	categorical	16,382	0.0%	214
PeopleCluster	categorical	16,382	0.0%	267
PeopNameAcrossCountries	text	16,382	0.0%	10,377	one_word duplicates
Population	numeric	16,382	0.2%	1,708	high_skew outliers
Category	categorical	16,382	0.0%	3
ROL3	text	16,382	0.0%	6,164	one_word short_text duplicates
PrimaryLanguageName	text	16,382	0.0%	6,153	one_word short_text duplicates
PrimaryLanguageDialect	categorical	16,382	92.3%	980	long_tail null_rate
NumberLanguagesSpoken	numeric	16,382	0.0%	78	high_skew outliers
OfficialLang	categorical	16,382	0.0%	87
SpeakNationalLang	unknown	16,382	0.0%	—	skipped
BibleStatus	numeric	16,382	0.0%	6
BibleYear	categorical	16,382	52.4%	466	null_rate
NTYear	text	16,382	30.5%	1,072	one_word allcaps null_rate short_text duplicates
PortionsYear	text	16,382	17.9%	1,737	one_word allcaps short_text duplicates
TranslationNeedQuestionable	unknown	16,382	0.0%	—	skipped
JPScale	numeric	16,382	0.0%	5
JPScalePC	categorical	16,382	0.0%	5
JPScalePGAC	categorical	16,382	0.0%	5
LeastReached	categorical	16,382	0.0%	2
LeastReachedPC	categorical	16,382	0.0%	2
LeastReachedPGAC	categorical	16,382	0.0%	2
GSEC	categorical	16,382	0.0%	8
HasAudioRecordings	categorical	16,382	0.0%	2
NTOnline	categorical	16,382	28.5%	1	null_rate imbalance
RLG3	numeric	16,382	0.0%	8
RLG3PC	numeric	16,382	0.0%	8
RLG3PGAC	numeric	16,382	0.0%	8
PrimaryReligion	categorical	16,382	0.0%	8
PrimaryReligionPC	categorical	16,382	0.0%	8
PrimaryReligionPGAC	categorical	16,382	0.0%	8
RLG4	numeric	16,382	96.2%	20	null_rate outliers
ReligionSubdivision	categorical	16,382	96.2%	20	null_rate
PCIslam	numeric	16,382	0.5%	1,117	outliers
PCNonReligious	numeric	16,382	0.4%	223	high_skew outliers
PCUnknown	numeric	16,382	0.6%	583	high_skew
SecurityLevel	numeric	16,382	0.0%	3
LRTop100	categorical	16,382	0.0%	2	imbalance
PhotoAddress	text	16,382	0.0%	5,277	one_word short_text duplicates
PhotoCredits	text	16,382	0.1%	1,605	one_word duplicates
PhotoCreditURL	text	16,382	33.1%	1,434	one_word url_heavy null_rate duplicates
PhotoCreativeCommons	categorical	16,382	0.0%	2
PhotoCopyright	categorical	16,382	0.1%	2
PhotoPermission	categorical	16,382	0.1%	3
ProfileTextExists	categorical	16,382	0.0%	2
CountOfCountries	numeric	16,382	0.0%	48	high_skew outliers
CountOfProvinces	unknown	16,382	0.0%	—	skipped
Longitude	numeric	16,382	0.0%	15,889	high_skew
Latitude	numeric	16,382	0.0%	15,851
Ctry	categorical	16,382	0.0%	238
IndigenousCode	categorical	16,382	0.0%	2
PercentAdherents	text	16,382	0.0%	1,248	one_word allcaps short_text duplicates
PercentChristianPC	categorical	16,382	0.0%	246
NaturalName	text	16,382	0.0%	10,737	one_word duplicates
NaturalPronunciation	text	16,382	69.6%	1,933	one_word null_rate duplicates
PercentChristianPGAC	text	16,382	0.1%	1,954	one_word allcaps short_text duplicates
PercentEvangelical	text	16,382	6.7%	1,047	one_word allcaps short_text duplicates
PercentEvangelicalPC	categorical	16,382	1.0%	228
PercentEvangelicalPGAC	text	16,382	4.5%	1,624	one_word allcaps short_text duplicates
PCBuddhism	numeric	16,382	0.6%	1,052	high_skew outliers
PCEthnicReligions	numeric	16,382	0.4%	978	outliers
PCHinduism	numeric	16,382	0.6%	1,412	high_skew outliers
PCOtherSmall	numeric	16,382	0.6%	908	high_skew outliers
RegionCode	numeric	16,382	0.0%	12
PopulationPGAC	numeric	16,382	0.1%	2,250	high_skew outliers
Frontier	categorical	16,382	0.0%	2
MapAddress	text	16,382	0.0%	6,029	one_word short_text duplicates
HasJesusFilm	categorical	16,382	0.0%	2
Nomadic	categorical	16,382	0.0%	2	imbalance
NomadicTypeDescription	categorical	16,382	98.1%	6	null_rate
PhotoCCVersionText	categorical	16,382	0.0%	17
PhotoCCVersionURL	categorical	16,382	0.0%	17
MapCredits	categorical	16,382	0.0%	199	long_tail
MapCreditURL	categorical	16,382	0.0%	51	long_tail imbalance
MapCopyright	categorical	16,382	0.0%	3
MapCCVersionText	categorical	16,382	0.0%	6	imbalance
MapCCVersionURL	categorical	16,382	0.0%	6	imbalance
JF	categorical	16,382	0.0%	2
AudioRecordings	categorical	16,382	0.0%	2
Window1040	categorical	16,382	0.0%	2
PeopleGroupMapURL	text	16,382	0.0%	6,029	one_word url_heavy duplicates
PeopleGroupMapExpandedURL	text	16,382	0.0%	5,561	one_word url_heavy duplicates
PeopleGroupURL	text	16,382	0.0%	16,382	near_unique one_word url_heavy
PeopleGroupPhotoURL	text	16,382	0.0%	5,277	one_word url_heavy duplicates
CountryURL	categorical	16,382	0.0%	238
JPScaleText	categorical	16,382	0.0%	5
JPScaleImageURL	categorical	16,382	0.0%	5
Summary	text	16,382	0.0%	3,778	one_word duplicates
Obstacles	text	16,382	0.0%	3,732	one_word duplicates
HowReach	text	16,382	0.0%	2,944	one_word duplicates
PrayForChurch	text	16,382	0.0%	1,791	one_word duplicates
PrayForPG	text	16,382	0.0%	3,530	one_word duplicates
Resources	unknown	16,382	0.0%	—	skipped
country_data	unknown	16,382	0.0%	—	skipped
language_data	unknown	16,382	0.0%	—	skipped

Fig 1.

Continent · Where the people groups sit geographically — Asia dominates at ~45% of rows.

Show data table

Top values for Continent (7 unique shown, of 7 total).
value	count	share
Asia	7368	45.0%
Africa	3635	22.2%
Europe	1532	9.4%
North America	1407	8.6%
Australia	1088	6.6%
South America	905	5.5%
Oceania	447	2.7%

Fig 2.

PrimaryReligion · Religion mix across groups; Christianity and Islam together cover most of the file.

Show data table

Top values for PrimaryReligion (8 unique shown, of 8 total).
value	count	share
Christianity	6459	39.4%
Islam	3786	23.1%
Ethnic Religions	2651	16.2%
Hinduism	2338	14.3%
Buddhism	635	3.9%
Non-Religious	200	1.2%
Unknown	189	1.2%
Other / Small	124	0.8%

Fig 3.

JPScaleText · Reachedness rating — note that 'Unreached' is the single largest bucket (~43%).

Show data table

Top values for JPScaleText (5 unique shown, of 5 total).
value	count	share
Unreached	7124	43.5%
Partially Reached	3636	22.2%
Significantly Reached	3200	19.5%
Superficially Reached	1413	8.6%
Minimally Reached	1009	6.2%

Fig 4.

Population · Highly skewed group sizes (median 20k, max ~913M); plot on a log scale to see the shape.

Show data table

Histogram bins for Population (median: 20000.0).
bin	count
10 – 2.282e+07	16302
2.282e+07 – 4.565e+07	32
4.565e+07 – 6.847e+07	12
6.847e+07 – 9.13e+07	3
9.13e+07 – 1.141e+08	3
1.141e+08 – 1.369e+08	3
1.369e+08 – 1.598e+08	0
1.598e+08 – 1.826e+08	0
1.826e+08 – 2.054e+08	1
2.054e+08 – 2.282e+08	0
2.282e+08 – 2.511e+08	0
2.511e+08 – 2.739e+08	0
2.739e+08 – 2.967e+08	0
2.967e+08 – 3.195e+08	0
3.195e+08 – 3.424e+08	0
3.424e+08 – 3.652e+08	0
3.652e+08 – 3.88e+08	0
3.88e+08 – 4.108e+08	0
4.108e+08 – 4.337e+08	0
4.337e+08 – 4.565e+08	0
4.565e+08 – 4.793e+08	0
4.793e+08 – 5.021e+08	0
5.021e+08 – 5.249e+08	0
5.249e+08 – 5.478e+08	0
5.478e+08 – 5.706e+08	0
5.706e+08 – 5.934e+08	0
5.934e+08 – 6.162e+08	0
6.162e+08 – 6.391e+08	0
6.391e+08 – 6.619e+08	0
6.619e+08 – 6.847e+08	0
6.847e+08 – 7.075e+08	0
7.075e+08 – 7.304e+08	0
7.304e+08 – 7.532e+08	0
7.532e+08 – 7.76e+08	0
7.76e+08 – 7.988e+08	0
7.988e+08 – 8.217e+08	0
8.217e+08 – 8.445e+08	0
8.445e+08 – 8.673e+08	0
8.673e+08 – 8.901e+08	0
8.901e+08 – 9.13e+08	1

Fig 5.

AffinityBloc · Top-level cultural groupings; South Asian and Sub-Saharan blocs lead.

Show data table

Top values for AffinityBloc (16 unique shown, of 16 total).
value	count	share
South Asian Peoples	4178	25.5%
Sub-Saharan Peoples	3073	18.8%
Eurasian Peoples	1593	9.7%
Pacific Islanders	1588	9.7%
Latin-Caribbean Americans	1352	8.3%
Malay Peoples	1031	6.3%
Southeast Asian Peoples	635	3.9%
Arab World	634	3.9%
Tibetan-Himalayan Peoples	453	2.8%
North American Peoples	415	2.5%
East Asian Peoples	402	2.5%
Turkic Peoples	299	1.8%
Persian-Median	228	1.4%
Horn of Africa Peoples	200	1.2%
Deaf	164	1.0%
Jewish	137	0.8%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
PeopleID3ROG3	text	0.0%
ROG3	categorical	0.0%
PeopleID3	numeric	0.0%
ROP3	numeric	0.1%
PeopNameInCountry	text	0.0%
ROG2	categorical	0.0%
Continent	categorical	0.0%
RegionName	categorical	0.0%
ISO3	categorical	0.0%
LocationInCountry	text	45.1%
PeopleID1	numeric	0.0%
ROP1	categorical	0.0%
AffinityBloc	categorical	0.0%
PeopleID2	numeric	0.0%
ROP2	categorical	0.0%
PeopleCluster	categorical	0.0%
PeopNameAcrossCountries	text	0.0%
Population	numeric	0.2%
Category	categorical	0.0%
ROL3	text	0.0%
PrimaryLanguageName	text	0.0%
PrimaryLanguageDialect	categorical	92.3%
NumberLanguagesSpoken	numeric	0.0%
OfficialLang	categorical	0.0%
SpeakNationalLang	unknown	0.0%
BibleStatus	numeric	0.0%
BibleYear	categorical	52.4%
NTYear	text	30.5%
PortionsYear	text	17.9%
TranslationNeedQuestionable	unknown	0.0%
JPScale	numeric	0.0%
JPScalePC	categorical	0.0%
JPScalePGAC	categorical	0.0%
LeastReached	categorical	0.0%
LeastReachedPC	categorical	0.0%
LeastReachedPGAC	categorical	0.0%
GSEC	categorical	0.0%
HasAudioRecordings	categorical	0.0%
NTOnline	categorical	28.5%
RLG3	numeric	0.0%
RLG3PC	numeric	0.0%
RLG3PGAC	numeric	0.0%
PrimaryReligion	categorical	0.0%
PrimaryReligionPC	categorical	0.0%
PrimaryReligionPGAC	categorical	0.0%
RLG4	numeric	96.2%
ReligionSubdivision	categorical	96.2%
PCIslam	numeric	0.5%
PCNonReligious	numeric	0.4%
PCUnknown	numeric	0.6%
SecurityLevel	numeric	0.0%
LRTop100	categorical	0.0%
PhotoAddress	text	0.0%
PhotoCredits	text	0.1%
PhotoCreditURL	text	33.1%
PhotoCreativeCommons	categorical	0.0%
PhotoCopyright	categorical	0.1%
PhotoPermission	categorical	0.1%
ProfileTextExists	categorical	0.0%
CountOfCountries	numeric	0.0%
CountOfProvinces	unknown	0.0%
Longitude	numeric	0.0%
Latitude	numeric	0.0%
Ctry	categorical	0.0%
IndigenousCode	categorical	0.0%
PercentAdherents	text	0.0%
PercentChristianPC	categorical	0.0%
NaturalName	text	0.0%
NaturalPronunciation	text	69.6%
PercentChristianPGAC	text	0.1%
PercentEvangelical	text	6.7%
PercentEvangelicalPC	categorical	1.0%
PercentEvangelicalPGAC	text	4.5%
PCBuddhism	numeric	0.6%
PCEthnicReligions	numeric	0.4%
PCHinduism	numeric	0.6%
PCOtherSmall	numeric	0.6%
RegionCode	numeric	0.0%
PopulationPGAC	numeric	0.1%
Frontier	categorical	0.0%
MapAddress	text	0.0%
HasJesusFilm	categorical	0.0%
Nomadic	categorical	0.0%
NomadicTypeDescription	categorical	98.1%
PhotoCCVersionText	categorical	0.0%
PhotoCCVersionURL	categorical	0.0%
MapCredits	categorical	0.0%
MapCreditURL	categorical	0.0%
MapCopyright	categorical	0.0%
MapCCVersionText	categorical	0.0%
MapCCVersionURL	categorical	0.0%
JF	categorical	0.0%
AudioRecordings	categorical	0.0%
Window1040	categorical	0.0%
PeopleGroupMapURL	text	0.0%
PeopleGroupMapExpandedURL	text	0.0%
PeopleGroupURL	text	0.0%
PeopleGroupPhotoURL	text	0.0%
CountryURL	categorical	0.0%
JPScaleText	categorical	0.0%
JPScaleImageURL	categorical	0.0%
Summary	text	0.0%
Obstacles	text	0.0%
HowReach	text	0.0%
PrayForChurch	text	0.0%
PrayForPG	text	0.0%
Resources	unknown	0.0%
country_data	unknown	0.0%
language_data	unknown	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 7,033 detected strings).
lang	count	share
en	6996	99.5%
es	12	0.2%
pt	10	0.1%
id	3	0.0%
fr	2	0.0%
la	2	0.0%
it	2	0.0%
de	2	0.0%
sco	1	0.0%
nl	1	0.0%
ceb	1	0.0%
sq	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
	PeopleID3	ROP3	PeopleID1	PeopleID2	Population	NumberLanguagesSpoken	BibleStatus	JPScale	RLG3	RLG3PC	RLG3PGAC	RLG4
PeopleID3	+1.00	+0.90	+0.28	+0.46	+0.06	+0.16	+0.05	-0.36	+0.32	+0.32	+0.32	+0.09
ROP3	+0.90	+1.00	+0.27	+0.46	-0.09	+0.18	+0.07	-0.36	+0.30	+0.30	+0.29	+0.10
PeopleID1	+0.28	+0.27	+1.00	+0.46	-0.00	+0.15	-0.13	-0.07	+0.16	+0.17	+0.15	-0.08
PeopleID2	+0.46	+0.46	+0.46	+1.00	+0.06	+0.35	+0.20	-0.38	+0.27	+0.31	+0.25	+0.24
Population	+0.06	-0.09	-0.00	+0.06	+1.00	-0.01	-0.08	+0.05	-0.05	-0.04	-0.05	+0.07
NumberLanguagesSpoken	+0.16	+0.18	+0.15	+0.35	-0.01	+1.00	+0.14	-0.22	+0.17	+0.21	+0.17	+0.12
BibleStatus	+0.05	+0.07	-0.13	+0.20	-0.08	+0.14	+1.00	-0.23	+0.12	+0.19	+0.11	+0.25
JPScale	-0.36	-0.36	-0.07	-0.38	+0.05	-0.22	-0.23	+1.00	-0.72	-0.71	-0.71	-0.12
RLG3	+0.32	+0.30	+0.16	+0.27	-0.05	+0.17	+0.12	-0.72	+1.00	+0.79	+0.97	+0.09
RLG3PC	+0.32	+0.30	+0.17	+0.31	-0.04	+0.21	+0.19	-0.71	+0.79	+1.00	+0.79	+0.17
RLG3PGAC	+0.32	+0.29	+0.15	+0.25	-0.05	+0.17	+0.11	-0.71	+0.97	+0.79	+1.00	+0.10
RLG4	+0.09	+0.10	-0.08	+0.24	+0.07	+0.12	+0.25	-0.12	+0.09	+0.17	+0.10	+1.00

PeopleID3ROG3 text identifier

PeopleID3ROG3 is almost certainly a person-level identifier: every one of the 16,382 rows has a unique 7-character all-caps single token (n_unique equals n, duplicate_rate 0, len_min and len_max both 7). The sampled values look like a 5-digit numeric prefix followed by a 2-letter suffix (e.g. '10375tz', '10375up'), suggesting a structured composite key rather than a random hash. No nulls, no boilerplate, no duplicates — clean but useless as a feature.

Treatment: Drop from modelling; retain only as a join key.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["PeopleID3ROG3"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	16,382
len_min	7
len_max	7
len_mean	7
len_median	7
len_p95	7
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	16,382
readability_flesch_mean	115.3
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 9.

Character-length distribution for PeopleID3ROG3.

Show data table

Character-length distribution for PeopleID3ROG3 (mean: 7.0).
chars	count
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	16382
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0

ROG3 categorical feature

ROG3 holds 238 distinct two-letter codes that look like ISO country codes, with IN (2,262 rows, 13.8%) leading, followed by PP, ID, PK, CH, and NI. No nulls across 16,382 rows, and the entropy ratio of 0.79 indicates a fairly even spread across many countries rather than concentration in a handful. The presence of 'PP' among the top values is unusual since it isn't a standard ISO 3166-1 alpha-2 code and may signal a custom or legacy encoding worth verifying.

Treatment: Treat as a high-cardinality categorical: target- or frequency-encode for modelling, and reconcile non-standard codes like 'PP' against an ISO reference.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["ROG3"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	238
top_value	IN
top_rate	0.1381
cardinality	238
entropy	6.225
entropy_ratio	0.7885

Fig 10.

Top values for ROG3.

Show data table

Top values for ROG3 (20 unique shown, of 238 total).
value	count	share
IN	2262	13.8%
PP	883	5.4%
ID	788	4.8%
PK	775	4.7%
CH	547	3.3%
NI	535	3.3%
US	496	3.0%
MX	333	2.0%
BR	321	2.0%
CM	292	1.8%
BG	278	1.7%
CA	243	1.5%
CG	231	1.4%
BM	218	1.3%
AS	205	1.3%
RP	200	1.2%
SU	198	1.2%
NP	195	1.2%
LA	184	1.1%
MY	183	1.1%

PeopleID3 numeric foreign_key

PeopleID3 is an integer key ranging from 10119 to 22661 with 10415 unique values across 16382 rows and no nulls. The duplication (about 5967 repeated entries) and the bounded, non-zero range are consistent with a foreign-key reference to a people table rather than a measurement. Distribution is mildly right-skewed (0.37) and platykurtic (-0.98), with no outliers flagged.

Treatment: Treat as a foreign key and left-join to the people dimension; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["PeopleID3"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	10,415
min	10,119
max	22,661
mean	1.541e+04
median	14,962
std	3478
q1	12,348
q3	1.833e+04
iqr	5984
skew	0.3697
kurtosis	-0.9812
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 11.

Distribution of PeopleID3. Vertical dash marks the median.

Show data table

Histogram bins for PeopleID3 (median: 14962.0).
bin	count
1.012e+04 – 1.043e+04	548
1.043e+04 – 1.075e+04	446
1.075e+04 – 1.106e+04	492
1.106e+04 – 1.137e+04	745
1.137e+04 – 1.169e+04	515
1.169e+04 – 1.2e+04	587
1.2e+04 – 1.231e+04	670
1.231e+04 – 1.263e+04	486
1.263e+04 – 1.294e+04	509
1.294e+04 – 1.325e+04	519
1.325e+04 – 1.357e+04	456
1.357e+04 – 1.388e+04	446
1.388e+04 – 1.42e+04	409
1.42e+04 – 1.451e+04	614
1.451e+04 – 1.482e+04	537
1.482e+04 – 1.514e+04	573
1.514e+04 – 1.545e+04	584
1.545e+04 – 1.576e+04	626
1.576e+04 – 1.608e+04	411
1.608e+04 – 1.639e+04	305
1.639e+04 – 1.67e+04	251
1.67e+04 – 1.702e+04	258
1.702e+04 – 1.733e+04	261
1.733e+04 – 1.764e+04	344
1.764e+04 – 1.796e+04	302
1.796e+04 – 1.827e+04	302
1.827e+04 – 1.858e+04	387
1.858e+04 – 1.89e+04	345
1.89e+04 – 1.921e+04	590
1.921e+04 – 1.953e+04	319
1.953e+04 – 1.984e+04	319
1.984e+04 – 2.015e+04	215
2.015e+04 – 2.047e+04	254
2.047e+04 – 2.078e+04	239
2.078e+04 – 2.109e+04	173
2.109e+04 – 2.141e+04	297
2.141e+04 – 2.172e+04	291
2.172e+04 – 2.203e+04	146
2.203e+04 – 2.235e+04	314
2.235e+04 – 2.266e+04	297

ROP3 numeric identifier

ROP3 is a numeric column tightly bounded between 100004 and 119649 across 16382 rows, with 10405 unique values and almost no nulls (0.07%). The mean (109058.6) and median (108856) sit close together with low skew (0.15) and slightly platykurtic shape (kurtosis -1.05), and saturn flagged zero outliers. The narrow ~20k range starting just above 100000 looks more like a coded identifier or zoned key than a free-ranging measurement.

Treatment: Treat as a categorical code rather than a continuous feature; do not scale or log-transform.

anthropic:claude-opus-4-7 · confidence medium

Out[23]:

saturn.columns["ROP3"].stats

stat	value
n	16,382
nulls	11 (0.1%)
unique	10,405
min	100,004
max	119,649
mean	1.091e+05
median	108,856
std	5405
q1	104,189
q3	113,527
iqr	9,338
skew	0.1507
kurtosis	-1.053
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 12.

Distribution of ROP3. Vertical dash marks the median.

Show data table

Histogram bins for ROP3 (median: 108856.0).
bin	count
1e+05 – 1.005e+05	508
1.005e+05 – 1.01e+05	385
1.01e+05 – 1.015e+05	309
1.015e+05 – 1.02e+05	530
1.02e+05 – 1.025e+05	423
1.025e+05 – 1.03e+05	489
1.03e+05 – 1.034e+05	565
1.034e+05 – 1.039e+05	572
1.039e+05 – 1.044e+05	480
1.044e+05 – 1.049e+05	337
1.049e+05 – 1.054e+05	391
1.054e+05 – 1.059e+05	502
1.059e+05 – 1.064e+05	425
1.064e+05 – 1.069e+05	426
1.069e+05 – 1.074e+05	366
1.074e+05 – 1.079e+05	434
1.079e+05 – 1.084e+05	470
1.084e+05 – 1.088e+05	548
1.088e+05 – 1.093e+05	247
1.093e+05 – 1.098e+05	693
1.098e+05 – 1.103e+05	499
1.103e+05 – 1.108e+05	584
1.108e+05 – 1.113e+05	417
1.113e+05 – 1.118e+05	363
1.118e+05 – 1.123e+05	303
1.123e+05 – 1.128e+05	330
1.128e+05 – 1.133e+05	457
1.133e+05 – 1.138e+05	424
1.138e+05 – 1.142e+05	478
1.142e+05 – 1.147e+05	240
1.147e+05 – 1.152e+05	627
1.152e+05 – 1.157e+05	487
1.157e+05 – 1.162e+05	457
1.162e+05 – 1.167e+05	76
1.167e+05 – 1.172e+05	135
1.172e+05 – 1.177e+05	90
1.177e+05 – 1.182e+05	318
1.182e+05 – 1.187e+05	433
1.187e+05 – 1.192e+05	119
1.192e+05 – 1.196e+05	434

PeopNameInCountry text label

Short ethnonym/people-group labels naming a population within a country (e.g., 'French', 'British', 'Han Chinese, Mandarin'), with a median length of 9 characters and median word count of 1. 53% of values are single words and 34% are duplicates (5634 rows), so the same group label recurs across many country rows; 'Deaf' tops the list at 164. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' indicate a Joshua-Project-style people-group taxonomy rather than free text.

Treatment: Treat as a categorical people-group label; normalize casing/parentheticals and join with country to form the unique key.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["PeopNameInCountry"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	10,748
len_min	1
len_max	41
len_mean	11.46
len_median	9
len_p95	25
word_mean	1.636
word_median	1
n_empty	0
n_duplicates	5,634
duplicate_rate	0.3439
vocab_size	11,539
readability_flesch_mean	49.27
emoji_rate	0
url_rate	0
one_word_rate	0.5313
allcaps_rate	0
boilerplate_rate	0
alert: one_word	53.1% rows are a single word
alert: duplicates	34.4% duplicate strings

Fig 13.

Character-length distribution for PeopNameInCountry.

Show data table

Character-length distribution for PeopNameInCountry (mean: 11.457270174581858).
chars	count
1 – 2	1
2 – 3	33
3 – 4	282
4 – 5	1209
5 – 6	1631
6 – 7	1843
7 – 8	1526
8 – 9	1037
9 – 10	653
10 – 11	636
11 – 12	574
12 – 13	718
13 – 14	708
14 – 15	828
15 – 16	761
16 – 17	665
17 – 18	499
18 – 19	334
19 – 20	254
20 – 21	234
21 – 22	198
22 – 23	235
23 – 24	186
24 – 25	278
25 – 26	255
26 – 27	228
27 – 28	136
28 – 29	107
29 – 30	73
30 – 31	50
31 – 32	59
32 – 33	56
33 – 34	37
34 – 35	31
35 – 36	19
36 – 37	2
37 – 38	1
38 – 39	1
39 – 40	2
40 – 41	2

ROG2 categorical feature

ROG2 is a low-cardinality categorical with 7 codes that look like macro-region groupings (ASI, AFR, EUR, NAR, AUS, LAM, SOP). Distribution is uneven but not degenerate: ASI dominates at 45.0% of 16,382 rows, AFR follows at 3,635, and entropy ratio of 0.80 confirms broad spread across the remaining buckets. No nulls and no rare-value tail beyond the seven codes.

Treatment: one-hot or target-encode; safe to use directly given clean 7-level categorical with no missingness.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["ROG2"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	7
top_value	ASI
top_rate	0.4498
cardinality	7
entropy	2.257
entropy_ratio	0.8039

Fig 14.

Top values for ROG2.

Show data table

Top values for ROG2 (7 unique shown, of 7 total).
value	count	share
ASI	7368	45.0%
AFR	3635	22.2%
EUR	1532	9.4%
NAR	1407	8.6%
AUS	1088	6.6%
LAM	905	5.5%
SOP	447	2.7%

Continent categorical feature

Categorical continent label with all 7 expected values and zero nulls across 16,382 rows. Asia dominates at 44.98% (7,368 rows), followed by Africa at 3,635; entropy ratio of 0.80 confirms a moderately skewed but not degenerate distribution. Note that Australia (1,088) and Oceania (447) appear as separate categories, which is unusual and suggests inconsistent regional coding worth reconciling.

Treatment: One-hot encode after merging Australia and Oceania into a single category.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["Continent"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	7
top_value	Asia
top_rate	0.4498
cardinality	7
entropy	2.257
entropy_ratio	0.8039

Fig 15.

Top values for Continent.

Show data table

Top values for Continent (7 unique shown, of 7 total).
value	count	share
Asia	7368	45.0%
Africa	3635	22.2%
Europe	1532	9.4%
North America	1407	8.6%
Australia	1088	6.6%
South America	905	5.5%
Oceania	447	2.7%

RegionName categorical feature

RegionName is a categorical geographic grouping with 12 distinct world regions and no nulls across 16,382 rows. Distribution is fairly balanced (entropy ratio 0.93), though 'Asia, South' leads at 22.6% (3,707 rows) and the top three regions are all Asian or African. The labels use a 'Continent, Subregion' convention which may need parsing if continent-level rollups are wanted.

Treatment: one-hot or target-encode for modelling; optionally split on the comma to derive a continent column.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["RegionName"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	12
top_value	Asia, South
top_rate	0.2263
cardinality	12
entropy	3.325
entropy_ratio	0.9275

Fig 16.

Top values for RegionName.

Show data table

Top values for RegionName (12 unique shown, of 12 total).
value	count	share
Asia, South	3707	22.6%
Africa, West and Central	2175	13.3%
Asia, Southeast	1922	11.7%
Australia and Pacific	1535	9.4%
America, Latin	1395	8.5%
Africa, East and Southern	1276	7.8%
Europe, Western	1116	6.8%
America, North and Caribbean	917	5.6%
Asia, Northeast	709	4.3%
Africa, North and Middle East	593	3.6%
Europe, Eastern and Eurasia	577	3.5%
Asia, Central	460	2.8%

ISO3 categorical foreign_key

This column holds ISO3 country codes across 238 distinct values with no nulls, consistent with a country dimension key. India (IND) dominates at 13.8% (2262 rows), followed by PNG, IDN, PAK and CHN, indicating a heavy Asia/tropical skew rather than uniform global coverage. Entropy ratio of 0.79 confirms moderate concentration on a few countries.

Treatment: Use as a country join key; consider grouping long-tail codes or stratifying analyses to handle the IND-heavy skew.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["ISO3"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	238
top_value	IND
top_rate	0.1381
cardinality	238
entropy	6.225
entropy_ratio	0.7885

Fig 17.

Top values for ISO3.

Show data table

Top values for ISO3 (20 unique shown, of 238 total).
value	count	share
IND	2262	13.8%
PNG	883	5.4%
IDN	788	4.8%
PAK	775	4.7%
CHN	547	3.3%
NGA	535	3.3%
USA	496	3.0%
MEX	333	2.0%
BRA	321	2.0%
CMR	292	1.8%
BGD	278	1.7%
CAN	243	1.5%
COD	231	1.4%
MMR	218	1.3%
AUS	205	1.3%
PHL	200	1.2%
SDN	198	1.2%
NPL	195	1.2%
LAO	184	1.1%
MYS	183	1.1%

LocationInCountry text free_text

Free-text descriptions of where a group is located within a country, mixing geographic prose ('Primarily north', 'Madang province.') with longer ethnographic paragraphs up to 994 characters. Nearly half the rows are null (45.12%) and 13.3% are duplicate strings, with stock phrases like 'Widespread.' (103) and 'Scattered.' recurring. Although 2,618 entries register as English, small pockets of Spanish (12), Portuguese (10) and nine other languages appear, and Flesch readability averages a difficult 38.1.

Treatment: Normalize boilerplate phrases and tokenize/embed for semantic use; do not treat as a categorical.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["LocationInCountry"].stats

stat	value
n	16,382
nulls	7,392 (45.1%)
unique	7,794
len_min	3
len_max	994
len_mean	108.2
len_median	79
len_p95	314
word_mean	15.48
word_median	11
n_empty	0
n_duplicates	1,196
duplicate_rate	0.133
vocab_size	25,733
readability_flesch_mean	38.13
emoji_rate	0
url_rate	0
one_word_rate	0.02725
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	13 languages detected in sample
alert: null_rate	45.1% null

Fig 18.

Character-length distribution for LocationInCountry.

Show data table

Character-length distribution for LocationInCountry (mean: 108.21223581757508).
chars	count
3 – 28	921
28 – 53	1766
53 – 77	1690
77 – 102	1346
102 – 127	962
127 – 152	632
152 – 176	409
176 – 201	262
201 – 226	170
226 – 251	137
251 – 276	116
276 – 300	81
300 – 325	82
325 – 350	53
350 – 375	54
375 – 399	33
399 – 424	37
424 – 449	48
449 – 474	26
474 – 498	28
498 – 523	23
523 – 548	14
548 – 573	16
573 – 598	19
598 – 622	14
622 – 647	14
647 – 672	8
672 – 697	4
697 – 721	4
721 – 746	5
746 – 771	9
771 – 796	1
796 – 821	1
821 – 845	1
845 – 870	1
870 – 895	0
895 – 920	1
920 – 944	1
944 – 969	0
969 – 994	1

PeopleID1 numeric feature

PeopleID1 is stored as numeric but takes only 16 distinct integer values across 16,382 rows, ranging from 10 to 26 with a median of 20. The bounded range, low cardinality, and left skew (-0.79) suggest this is a small categorical or grouping code rather than a true continuous measurement, despite the 'ID' name implying a key.

Treatment: Cast to categorical and one-hot encode rather than treating as continuous.

anthropic:claude-opus-4-7 · confidence high

Out[44]:

saturn.columns["PeopleID1"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	16
min	10
max	26
mean	18.58
median	20
std	3.921
q1	16
q3	21
iqr	5
skew	-0.7942
kurtosis	-0.5112
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 19.

Distribution of PeopleID1. Vertical dash marks the median.

Show data table

Histogram bins for PeopleID1 (median: 20.0).
bin	count
10 – 10.4	634
10.4 – 10.8	0
10.8 – 11.2	402
11.2 – 11.6	0
11.6 – 12	0
12 – 12.4	1593
12.4 – 12.8	0
12.8 – 13.2	200
13.2 – 13.6	0
13.6 – 14	0
14 – 14.4	228
14.4 – 14.8	0
14.8 – 15.2	137
15.2 – 15.6	0
15.6 – 16	0
16 – 16.4	1352
16.4 – 16.8	0
16.8 – 17.2	1031
17.2 – 17.6	0
17.6 – 18	0
18 – 18.4	415
18.4 – 18.8	0
18.8 – 19.2	1588
19.2 – 19.6	0
19.6 – 20	0
20 – 20.4	635
20.4 – 20.8	0
20.8 – 21.2	4178
21.2 – 21.6	0
21.6 – 22	0
22 – 22.4	3073
22.4 – 22.8	0
22.8 – 23.2	453
23.2 – 23.6	0
23.6 – 24	0
24 – 24.4	299
24.4 – 24.8	0
24.8 – 25.2	0
25.2 – 25.6	0
25.6 – 26	164

ROP1 categorical feature

ROP1 is a low-cardinality categorical code with 16 distinct values (all prefixed 'A0##'), fully populated across 16382 rows. The distribution is moderately concentrated: 'A012' alone covers 25.5% and 'A013' adds another ~19%, while entropy ratio is 0.83 indicating reasonably even spread among the rest. Looks like a fixed coded attribute (e.g., a category or status code) rather than a free identifier.

Treatment: one-hot or target-encode for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["ROP1"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	16
top_value	A012
top_rate	0.255
cardinality	16
entropy	3.322
entropy_ratio	0.8306

Fig 20.

Top values for ROP1.

Show data table

Top values for ROP1 (16 unique shown, of 16 total).
value	count	share
A012	4178	25.5%
A013	3073	18.8%
A003	1593	9.7%
A010	1588	9.7%
A007	1352	8.3%
A008	1031	6.3%
A011	635	3.9%
A001	634	3.9%
A014	453	2.8%
A009	415	2.5%
A002	402	2.5%
A015	299	1.8%
A005	228	1.4%
A004	200	1.2%
A017	164	1.0%
A006	137	0.8%

AffinityBloc categorical feature

AffinityBloc is a categorical grouping of populations into 16 broad ethno-geographic blocs, with no nulls across 16,382 rows. The distribution is moderately concentrated: "South Asian Peoples" leads at 25.5% (4,178 rows), followed by "Sub-Saharan Peoples" (3,073), while entropy ratio of 0.83 indicates the remaining 14 categories carry meaningful mass. Labels mix regional and ethnolinguistic framings (e.g., "Arab World" alongside "Tibetan-Himalayan Peoples"), which an analyst should note for taxonomy consistency.

Treatment: One-hot or target-encode for modelling; audit label taxonomy for overlap before grouping.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["AffinityBloc"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	16
top_value	South Asian Peoples
top_rate	0.255
cardinality	16
entropy	3.322
entropy_ratio	0.8306

Fig 21.

Top values for AffinityBloc.

Show data table

Top values for AffinityBloc (16 unique shown, of 16 total).
value	count	share
South Asian Peoples	4178	25.5%
Sub-Saharan Peoples	3073	18.8%
Eurasian Peoples	1593	9.7%
Pacific Islanders	1588	9.7%
Latin-Caribbean Americans	1352	8.3%
Malay Peoples	1031	6.3%
Southeast Asian Peoples	635	3.9%
Arab World	634	3.9%
Tibetan-Himalayan Peoples	453	2.8%
North American Peoples	415	2.5%
East Asian Peoples	402	2.5%
Turkic Peoples	299	1.8%
Persian-Median	228	1.4%
Horn of Africa Peoples	200	1.2%
Deaf	164	1.0%
Jewish	137	0.8%

PeopleID2 numeric foreign_key

PeopleID2 is a numeric people-identifier code with only 267 distinct values across 16,382 rows, so each id repeats heavily and behaves more like a foreign key than a measurement. Values span 100 to 479 with a fairly flat distribution (kurtosis -1.13, skew 0.25) and no nulls or outliers, consistent with a bounded code rather than a quantity to model.

Treatment: Treat as a categorical key and left-join on it rather than using as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["PeopleID2"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	267
min	100
max	479
mean	283.7
median	268
std	114.6
q1	183
q3	402
iqr	219
skew	0.2451
kurtosis	-1.126
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 22.

Distribution of PeopleID2. Vertical dash marks the median.

Show data table

Histogram bins for PeopleID2 (median: 268.0).
bin	count
100 – 109.5	542
109.5 – 119	595
119 – 128.4	283
128.4 – 137.9	144
137.9 – 147.4	411
147.4 – 156.8	247
156.8 – 166.3	876
166.3 – 175.8	611
175.8 – 185.3	502
185.3 – 194.8	309
194.8 – 204.2	303
204.2 – 213.7	258
213.7 – 223.2	280
223.2 – 232.7	334
232.7 – 242.1	400
242.1 – 251.6	1677
251.6 – 261.1	243
261.1 – 270.5	314
270.5 – 280	257
280 – 289.5	478
289.5 – 299	722
299 – 308.4	228
308.4 – 317.9	568
317.9 – 327.4	269
327.4 – 336.9	646
336.9 – 346.4	348
346.4 – 355.8	368
355.8 – 365.3	0
365.3 – 374.8	0
374.8 – 384.2	0
384.2 – 393.7	0
393.7 – 403.2	132
403.2 – 412.7	447
412.7 – 422.1	360
422.1 – 431.6	94
431.6 – 441.1	61
441.1 – 450.6	1015
450.6 – 460.1	219
460.1 – 469.5	1195
469.5 – 479	646

ROP2 categorical feature

ROP2 is a categorical code field with 214 distinct alphanumeric values (e.g., 'A012', 'C0152') across 16,382 rows and no nulls, suggesting a controlled vocabulary like a routing or product code. The distribution is heavy-tailed: 'A012' alone covers 25.4% of rows and the next code 'C0152' another ~7%, while entropy ratio sits at 0.76. The 'A' vs 'C' prefix split hints at two code families coexisting in the same column.

Treatment: Group rare codes into an 'other' bucket and target/one-hot encode the high-frequency levels.

anthropic:claude-opus-4-7 · confidence high

Out[56]:

saturn.columns["ROP2"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	214
top_value	A012
top_rate	0.2545
cardinality	214
entropy	5.901
entropy_ratio	0.7622

Fig 23.

Top values for ROP2.

Show data table

Top values for ROP2 (20 unique shown, of 214 total).
value	count	share
A012	4169	25.4%
C0152	1139	7.0%
C0201	448	2.7%
C0044	342	2.1%
C0147	253	1.5%
C0154	240	1.5%
C0065	226	1.4%
C0062	223	1.4%
C0229	223	1.4%
C0087	182	1.1%
C0252	164	1.0%
C0012	162	1.0%
C0003	161	1.0%
C0059	160	1.0%
C0085	160	1.0%
C0258	159	1.0%
C0079	152	0.9%
C0061	149	0.9%
C0195	142	0.9%
C0015	139	0.8%

PeopleCluster categorical feature

PeopleCluster is a categorical ethnographic grouping with 267 distinct values across 16,382 rows and no nulls. The distribution is broad (entropy ratio 0.86) but with a notable concentration: 'New Guinea' accounts for 6.95% of rows (1,139), followed by 'South Asia Hindu - other' (935) and 'South Asia Muslim - other' (592). The labels mix geographic, religious, and tribal descriptors, so several '... - other' buckets are doing heavy lifting.

Treatment: Group rare clusters and target- or frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[59]:

saturn.columns["PeopleCluster"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	267
top_value	New Guinea
top_rate	0.06953
cardinality	267
entropy	6.929
entropy_ratio	0.8596

Fig 24.

Top values for PeopleCluster.

Show data table

Top values for PeopleCluster (20 unique shown, of 267 total).
value	count	share
New Guinea	1139	7.0%
South Asia Hindu - other	935	5.7%
South Asia Muslim - other	592	3.6%
South American Indigenous	448	2.7%
South Asia Tribal - other	448	2.7%
South Asia Dalit - other	359	2.2%
Benue	342	2.1%
South Asia Muslim - Pashtun	293	1.8%
Mon-Khmer	253	1.5%
North American Indigenous	240	1.5%
Chinese	226	1.4%
Chadic	223	1.4%
Tibeto-Burman, other	223	1.4%
Gur	182	1.1%
Deaf	164	1.0%
Anglo-Celt	162	1.0%
Adamawa-Ubangi	161	1.0%
Borneo-Kalimantan	160	1.0%
Guinean	160	1.0%
Bantu, Central-Congo	159	1.0%

PeopNameAcrossCountries text foreign_key

This column holds people-group or ethnicity names repeated across countries, with 10,377 unique labels over 16,382 rows and a 36.7% duplicate rate (6,005 repeats). Entries are short (median 8 chars, mean 1.57 words) and 59% are single-word labels like 'Deaf' (164), 'French' (82), or 'British' (81). The frequent fragments '(hindu' (516) and '(muslim' (424) alongside 'traditions)' (1038) suggest religious-tradition qualifiers in parentheses are a common naming convention, and the same group name recurs because it appears in multiple country contexts.

Treatment: Treat as a people-group key; normalize casing/parentheticals and join with country to form a unique grouping key.

anthropic:claude-opus-4-7 · confidence high

Out[62]:

saturn.columns["PeopNameAcrossCountries"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	10,377
len_min	1
len_max	42
len_mean	10.93
len_median	8
len_p95	25
word_mean	1.568
word_median	1
n_empty	0
n_duplicates	6,005
duplicate_rate	0.3666
vocab_size	10,446
readability_flesch_mean	49.45
emoji_rate	0
url_rate	0
one_word_rate	0.5899
allcaps_rate	0
boilerplate_rate	0
alert: one_word	59.0% rows are a single word
alert: duplicates	36.7% duplicate strings

Fig 25.

Character-length distribution for PeopNameAcrossCountries.

Show data table

Character-length distribution for PeopNameAcrossCountries (mean: 10.93370772799414).
chars	count
1 – 2	35
2 – 3	313
3 – 4	1356
4 – 5	1836
5 – 6	2072
6 – 7	1670
7 – 8	1120
8 – 9	682
9 – 10	647
10 – 11	533
11 – 12	629
12 – 13	604
13 – 14	701
14 – 15	630
15 – 16	558
16 – 17	429
17 – 18	275
18 – 19	211
19 – 20	202
20 – 22	164
22 – 23	222
23 – 24	167
24 – 25	274
25 – 26	246
26 – 27	227
27 – 28	134
28 – 29	107
29 – 30	71
30 – 31	50
31 – 32	65
32 – 33	56
33 – 34	38
34 – 35	25
35 – 36	24
36 – 37	1
37 – 38	2
38 – 39	1
39 – 40	4
40 – 41	0
41 – 42	1

Population numeric feature

This is a Population count column with 16,382 rows and only 1,708 unique values, suggesting many shared or rounded figures. The distribution is extremely heavy-tailed: median is 20,000 but the max is 912,955,000, with skew 91.04 and kurtosis 10,050.74, and 15.0% of rows flag as outliers. The mean (499,468) sits far above Q3 (93,000), indicating a small number of very large entities dominate.

Treatment: Log-transform before any modelling or distance-based analysis.

anthropic:claude-opus-4-7 · confidence high

Out[65]:

saturn.columns["Population"].stats

stat	value
n	16,382
nulls	25 (0.2%)
unique	1,708
min	10
max	9.13e+08
mean	4.995e+05
median	20,000
std	8.066e+06
q1	4,300
q3	93,000
iqr	88,700
skew	91.04
kurtosis	1.005e+04
n_outliers	2,455
outlier_rate	0.1501
zero_rate	0
alert: high_skew	skew=+91.04
alert: outliers	15.0% rows beyond 1.5 IQR

Fig 26.

Distribution of Population. Vertical dash marks the median.

Show data table

Histogram bins for Population (median: 20000.0).
bin	count
10 – 2.282e+07	16302
2.282e+07 – 4.565e+07	32
4.565e+07 – 6.847e+07	12
6.847e+07 – 9.13e+07	3
9.13e+07 – 1.141e+08	3
1.141e+08 – 1.369e+08	3
1.369e+08 – 1.598e+08	0
1.598e+08 – 1.826e+08	0
1.826e+08 – 2.054e+08	1
2.054e+08 – 2.282e+08	0
2.282e+08 – 2.511e+08	0
2.511e+08 – 2.739e+08	0
2.739e+08 – 2.967e+08	0
2.967e+08 – 3.195e+08	0
3.195e+08 – 3.424e+08	0
3.424e+08 – 3.652e+08	0
3.652e+08 – 3.88e+08	0
3.88e+08 – 4.108e+08	0
4.108e+08 – 4.337e+08	0
4.337e+08 – 4.565e+08	0
4.565e+08 – 4.793e+08	0
4.793e+08 – 5.021e+08	0
5.021e+08 – 5.249e+08	0
5.249e+08 – 5.478e+08	0
5.478e+08 – 5.706e+08	0
5.706e+08 – 5.934e+08	0
5.934e+08 – 6.162e+08	0
6.162e+08 – 6.391e+08	0
6.391e+08 – 6.619e+08	0
6.619e+08 – 6.847e+08	0
6.847e+08 – 7.075e+08	0
7.075e+08 – 7.304e+08	0
7.304e+08 – 7.532e+08	0
7.532e+08 – 7.76e+08	0
7.76e+08 – 7.988e+08	0
7.988e+08 – 8.217e+08	0
8.217e+08 – 8.445e+08	0
8.445e+08 – 8.673e+08	0
8.673e+08 – 8.901e+08	0
8.901e+08 – 9.13e+08	1

Category categorical feature

A 3-level categorical with no nulls across 16,382 rows, encoded as the strings "1", "2", and "3". Class "1" dominates at 53.1% (8,705 rows) and "2" is the minority at 1,360 rows, giving a moderately imbalanced distribution (entropy ratio 0.83). The numeric-string labels suggest an ordinal or coded category whose meaning is not self-evident from the values alone.

Treatment: One-hot or ordinal encode; consider class-imbalance handling if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[68]:

saturn.columns["Category"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	3
top_value	1
top_rate	0.5314
cardinality	3
entropy	1.313
entropy_ratio	0.8284

Fig 27.

Top values for Category.

Show data table

Top values for Category (3 unique shown, of 3 total).
value	count	share
1	8705	53.1%
3	6317	38.6%
2	1360	8.3%

ROL3 text feature

ROL3 holds three-letter ISO 639-3 language codes (every value is exactly 3 characters and one word), with hin, eng, and ben dominating. The distribution is heavily multilingual with 6,164 distinct codes across 16,382 rows and a 62.4% duplicate rate, plus 176 'xxx' entries that likely flag undetermined or missing language.

Treatment: Treat as a categorical language code; one-hot or target-encode top codes and bucket the long tail (including 'xxx') as 'other'.

anthropic:claude-opus-4-7 · confidence high

Out[71]:

saturn.columns["ROL3"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	6,164
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	10,218
duplicate_rate	0.6237
vocab_size	6,163
readability_flesch_mean	117.4
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.0001221
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	62.4% duplicate strings

Fig 28.

Character-length distribution for ROL3.

Show data table

Character-length distribution for ROL3 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	16382
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

PrimaryLanguageName text feature

Holds the primary language name for each record, predominantly single-token entries (one_word_rate 0.73, word_mean 1.32) with Hindi (682), English (424) and Bengali (366) leading. High duplicate_rate of 0.62 is expected for a categorical language label, but n_unique 6153 against 16382 rows suggests many compound or comma-separated multilingual entries (note 'arabic,' and 'punjabi,' in top_words). 176 rows are explicitly 'Language unknown', and lengths up to 45 chars confirm some multi-language strings.

Treatment: Normalize casing and split comma-separated entries into a multi-label categorical before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[74]:

saturn.columns["PrimaryLanguageName"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	6,153
len_min	1
len_max	45
len_mean	9.081
len_median	7
len_p95	19
word_mean	1.322
word_median	1
n_empty	0
n_duplicates	10,229
duplicate_rate	0.6244
vocab_size	6,251
readability_flesch_mean	42.67
emoji_rate	0
url_rate	0
one_word_rate	0.7306
allcaps_rate	0
boilerplate_rate	0
alert: one_word	73.1% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	62.4% duplicate strings

Fig 29.

Character-length distribution for PrimaryLanguageName.

Show data table

Character-length distribution for PrimaryLanguageName (mean: 9.08136979611769).
chars	count
1 – 2	49
2 – 3	295
3 – 4	1430
4 – 5	2556
5 – 6	2134
6 – 8	2957
8 – 9	1139
9 – 10	677
10 – 11	597
11 – 12	311
12 – 13	698
13 – 14	387
14 – 15	372
15 – 16	1263
16 – 18	535
18 – 19	160
19 – 20	119
20 – 21	139
21 – 22	85
22 – 23	72
23 – 24	175
24 – 25	36
25 – 26	41
26 – 27	30
27 – 29	23
29 – 30	12
30 – 31	11
31 – 32	31
32 – 33	16
33 – 34	8
34 – 35	1
35 – 36	1
36 – 37	1
37 – 38	2
38 – 40	9
40 – 41	5
41 – 42	4
42 – 43	0
43 – 44	0
44 – 45	1

PrimaryLanguageDialect categorical metadata

This column records a primary language dialect per record, with 980 distinct values across 16,382 rows. It is 92.3% null, and even the most common value, 'Brazilian Portuguese', accounts for just 3.25% of non-nulls (41 occurrences); entropy ratio of 0.967 confirms an extremely flat, long-tailed distribution spanning dialects like Assyrian, Punjabi, Sinhalese, and Ta'izzi. The combination of sparse coverage and high cardinality limits its standalone modelling value.

Treatment: Group into language families or a coarse bucket plus 'missing' indicator before encoding.

anthropic:claude-opus-4-7 · confidence high

Out[77]:

saturn.columns["PrimaryLanguageDialect"].stats

stat	value
n	16,382
nulls	15,121 (92.3%)
unique	980
top_value	Brazilian Portuguese
top_rate	0.03251
cardinality	980
entropy	9.613
entropy_ratio	0.9674
alert: long_tail	850 singleton categories
alert: null_rate	92.3% null

Fig 30.

Top values for PrimaryLanguageDialect.

Show data table

Top values for PrimaryLanguageDialect (20 unique shown, of 980 total).
value	count	share
Brazilian Portuguese	41	0.3%
Assyrian	15	0.1%
Punjabi	14	0.1%
Sinhalese	10	0.1%
Ta'izzi	8	0.0%
Guipuzcoan	8	0.0%
Moldavian	7	0.0%
Kalderash	6	0.0%
Siripuria	6	0.0%
Tangsa	6	0.0%
Wasulunkakan	6	0.0%
Nyanja	6	0.0%
Orange River	5	0.0%
Pomak	5	0.0%
Hui	4	0.0%
Central	4	0.0%
Vixlin	4	0.0%
Malagasy	4	0.0%
Unga	4	0.0%
Western Sudanese	3	0.0%

NumberLanguagesSpoken numeric feature

Count of languages spoken, with 16382 non-null integer values ranging from 1 to 145 and a median of 1. The distribution is severely right-skewed (skew 7.44, kurtosis 83.76): Q1 and Q3 are both within [1,2], yet 2410 rows (14.7%) flag as outliers and the max of 145 is implausibly high for a person-level language count.

Treatment: Cap or log-transform before modelling, and investigate the 145 maximum for data-entry errors.

anthropic:claude-opus-4-7 · confidence high

Out[80]:

saturn.columns["NumberLanguagesSpoken"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	78
min	1
max	145
mean	2.764
median	1
std	5.985
q1	1
q3	2
iqr	1
skew	7.437
kurtosis	83.76
n_outliers	2,410
outlier_rate	0.1471
zero_rate	0
alert: high_skew	skew=+7.44
alert: outliers	14.7% rows beyond 1.5 IQR

Fig 31.

Distribution of NumberLanguagesSpoken. Vertical dash marks the median.

Show data table

Histogram bins for NumberLanguagesSpoken (median: 1.0).
bin	count
1 – 4.6	14386
4.6 – 8.2	950
8.2 – 11.8	343
11.8 – 15.4	232
15.4 – 19	99
19 – 22.6	80
22.6 – 26.2	80
26.2 – 29.8	41
29.8 – 33.4	35
33.4 – 37	18
37 – 40.6	26
40.6 – 44.2	20
44.2 – 47.8	16
47.8 – 51.4	10
51.4 – 55	5
55 – 58.6	8
58.6 – 62.2	10
62.2 – 65.8	4
65.8 – 69.4	5
69.4 – 73	1
73 – 76.6	3
76.6 – 80.2	1
80.2 – 83.8	1
83.8 – 87.4	2
87.4 – 91	1
91 – 94.6	1
94.6 – 98.2	0
98.2 – 101.8	0
101.8 – 105.4	2
105.4 – 109	0
109 – 112.6	0
112.6 – 116.2	0
116.2 – 119.8	0
119.8 – 123.4	1
123.4 – 127	0
127 – 130.6	0
130.6 – 134.2	0
134.2 – 137.8	0
137.8 – 141.4	0
141.4 – 145	1

OfficialLang categorical feature

OfficialLang is a categorical column listing the official language of each record, with 87 distinct values across 16,382 rows and almost no nulls (0.05%). English dominates at 22.4% (3,672), followed by Hindi (2,262) and French (1,478), giving a moderately concentrated distribution (entropy ratio 0.68). The presence of compound labels like 'Arabic, Standard' and 'Chinese, Mandarin' suggests a specific naming convention worth preserving when joining to language references.

Treatment: Group long tail and one-hot or target-encode the top categories before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[83]:

saturn.columns["OfficialLang"].stats

stat	value
n	16,382
nulls	8 (0.0%)
unique	87
top_value	English
top_rate	0.2243
cardinality	87
entropy	4.368
entropy_ratio	0.6779

Fig 32.

Top values for OfficialLang.

Show data table

Top values for OfficialLang (20 unique shown, of 87 total).
value	count	share
English	3672	22.4%
Hindi	2262	13.8%
French	1478	9.0%
Spanish	1121	6.8%
Arabic, Standard	891	5.4%
Indonesian	788	4.8%
Urdu	775	4.7%
Chinese, Mandarin	638	3.9%
Portuguese	529	3.2%
Bengali	278	1.7%
Burmese	218	1.3%
Malay	207	1.3%
Tagalog	200	1.2%
Nepali	195	1.2%
Lao	184	1.1%
Russian	171	1.0%
German, Standard	158	1.0%
Swahili	154	0.9%
Sinhala	139	0.8%
Amharic	120	0.7%

SpeakNationalLang unknown other

Column 'SpeakNationalLang' was skipped by the profiler, so no type inference, uniqueness count, or value statistics are available. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a flag or category indicating whether a respondent speaks the national language, but this cannot be verified from the evidence.

Treatment: Re-profile or manually inspect to determine type before any downstream use.

anthropic:claude-opus-4-7 · confidence low

Out[86]:

saturn.columns["SpeakNationalLang"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

BibleStatus numeric feature

BibleStatus is an integer-coded categorical with only 6 distinct values spanning 0 to 5 across 16,382 complete rows. The distribution is heavily left-skewed (skew -1.22) with a mean of 3.86 and median of 4, indicating most records cluster at the high end while about 4.8% sit at zero. Despite being stored as numeric, the small cardinality and bounded range suggest an ordinal status code rather than a true measurement.

Treatment: Treat as an ordinal category (one-hot or ordered encode) rather than a continuous numeric.

anthropic:claude-opus-4-7 · confidence high

Out[88]:

saturn.columns["BibleStatus"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	6
min	0
max	5
mean	3.862
median	4
std	1.429
q1	3
q3	5
iqr	2
skew	-1.219
kurtosis	0.5856
n_outliers	0
outlier_rate	0
zero_rate	0.04786

Fig 33.

Distribution of BibleStatus. Vertical dash marks the median.

Show data table

Histogram bins for BibleStatus (median: 4.0).
bin	count
0 – 0.125	784
0.125 – 0.25	0
0.25 – 0.375	0
0.375 – 0.5	0
0.5 – 0.625	0
0.625 – 0.75	0
0.75 – 0.875	0
0.875 – 1	0
1 – 1.125	555
1.125 – 1.25	0
1.25 – 1.375	0
1.375 – 1.5	0
1.5 – 1.625	0
1.625 – 1.75	0
1.75 – 1.875	0
1.875 – 2	0
2 – 2.125	1597
2.125 – 2.25	0
2.25 – 2.375	0
2.375 – 2.5	0
2.5 – 2.625	0
2.625 – 2.75	0
2.75 – 2.875	0
2.875 – 3	0
3 – 3.125	2055
3.125 – 3.25	0
3.25 – 3.375	0
3.375 – 3.5	0
3.5 – 3.625	0
3.625 – 3.75	0
3.75 – 3.875	0
3.875 – 4	0
4 – 4.125	3601
4.125 – 4.25	0
4.25 – 4.375	0
4.375 – 4.5	0
4.5 – 4.625	0
4.625 – 4.75	0
4.75 – 4.875	0
4.875 – 5	7790

BibleYear categorical metadata

BibleYear appears to encode a translation's publication or revision span, typically formatted as a start-end year range like "1818-2022", with single years (e.g. "1954") appearing as a minority pattern. Cardinality is high (466 distinct values across 16382 rows) and the most common range covers only 8.75% of records, giving a flat distribution (entropy ratio 0.776). Notably, 52.45% of values are null, which the alert flags and which will limit any direct use.

Treatment: Parse into separate start_year and end_year integer features and add a missingness indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[91]:

saturn.columns["BibleYear"].stats

stat	value
n	16,382
nulls	8,592 (52.4%)
unique	466
top_value	1818-2022
top_rate	0.08755
cardinality	466
entropy	6.883
entropy_ratio	0.7765
alert: null_rate	52.4% null

Fig 34.

Top values for BibleYear.

Show data table

Top values for BibleYear (20 unique shown, of 466 total).
value	count	share
1818-2022	682	4.2%
1382-2020	424	2.6%
1809-2022	366	2.2%
1553-2020	308	1.9%
1727-2024	221	1.3%
1954	195	1.2%
1751-2025	173	1.1%
1843-2022	167	1.0%
1815-2021	164	1.0%
1823-2021	160	1.0%
1854-2022	148	0.9%
1895-2020	146	0.9%
1874-2018	136	0.8%
1959-2021	134	0.8%
1478-2020	112	0.7%
1841-2022	109	0.7%
1821-2024	98	0.6%
1827-2024	92	0.6%
1875-2024	90	0.5%
2008	87	0.5%

NTYear text metadata

NTYear appears to be a year-range metadata field (e.g. '1811-1998', '1380-2011') stored as short single-token strings, with 1072 unique values across 16382 rows. The column is messy: 30.47% null, 90.59% duplicate rate, and a sentinel value 'Yes' shows up 670 times alongside the date ranges, indicating mixed semantics. Lengths cluster tightly (median 9, max 9), consistent with a 'YYYY-YYYY' format for most non-sentinel entries.

Treatment: Parse into start_year/end_year integer columns, isolate the 'Yes' sentinel into a separate flag, and impute or drop the 30% nulls.

anthropic:claude-opus-4-7 · confidence medium

Out[94]:

saturn.columns["NTYear"].stats

stat	value
n	16,382
nulls	4,991 (30.5%)
unique	1,072
len_min	3
len_max	9
len_mean	7.794
len_median	9
len_p95	9
word_mean	1
word_median	1
n_empty	0
n_duplicates	10,319
duplicate_rate	0.9059
vocab_size	1,072
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.9412
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	94.1% rows are all-caps
alert: null_rate	30.5% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	90.6% duplicate strings

Fig 35.

Character-length distribution for NTYear.

Show data table

Character-length distribution for NTYear (mean: 7.79422350978843).
chars	count
3 – 3	670
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	1943
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	8778

PortionsYear text feature

PortionsYear appears to be a single-token field that mostly encodes year ranges (e.g. '1806-1962', '1530-1995') with strings up to 9 characters, but it is contaminated by a large 'Yes' bucket (1520 rows) that breaks the type. Nulls run at 17.92% and duplicate_rate is 0.87 across 1737 unique values out of 16382, so the column is highly repetitive. The mix of a boolean-like 'Yes' with hyphenated year spans suggests two different concepts were merged into one column.

Treatment: Split into two fields: parse year ranges into start/end integers and isolate the 'Yes' values into a separate boolean before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[97]:

saturn.columns["PortionsYear"].stats

stat	value
n	16,382
nulls	2,936 (17.9%)
unique	1,737
len_min	3
len_max	9
len_mean	7.595
len_median	9
len_p95	9
word_mean	1
word_median	1
n_empty	0
n_duplicates	11,709
duplicate_rate	0.8708
vocab_size	1,737
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.887
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	88.7% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	87.1% duplicate strings

Fig 36.

Character-length distribution for PortionsYear.

Show data table

Character-length distribution for PortionsYear (mean: 7.595121225643314).
chars	count
3 – 3	1520
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	1954
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	9972

TranslationNeedQuestionable unknown other

Column 'TranslationNeedQuestionable' was skipped by the profiler, so its kind, cardinality and value distribution are unknown. The only confirmed signals are 16382 rows with a 0.0 null rate. The name suggests a boolean or flag indicating uncertainty about translation need, but this cannot be verified from the evidence.

Treatment: Re-profile or inspect raw values before deciding on use; do not model until kind is resolved.

anthropic:claude-opus-4-7 · confidence low

Out[100]:

saturn.columns["TranslationNeedQuestionable"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

JPScale numeric feature

JPScale is an integer-valued ordinal feature spanning 1 to 5 with only 5 unique values across 16382 rows and no nulls. The distribution is roughly flat (kurtosis -1.66, skew 0.19) with mean 2.68 and median 3, suggesting a Likert-style or category rating rather than a continuous measurement. No outliers and no zeros are present.

Treatment: Treat as an ordinal categorical (1-5) rather than continuous; one-hot or keep as ordered integer.

anthropic:claude-opus-4-7 · confidence high

Out[102]:

saturn.columns["JPScale"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	5
min	1
max	5
mean	2.681
median	3
std	1.644
q1	1
q3	4
iqr	3
skew	0.1937
kurtosis	-1.658
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 37.

Distribution of JPScale. Vertical dash marks the median.

Show data table

Histogram bins for JPScale (median: 3.0).
bin	count
1 – 1.1	7124
1.1 – 1.2	0
1.2 – 1.3	0
1.3 – 1.4	0
1.4 – 1.5	0
1.5 – 1.6	0
1.6 – 1.7	0
1.7 – 1.8	0
1.8 – 1.9	0
1.9 – 2	0
2 – 2.1	1009
2.1 – 2.2	0
2.2 – 2.3	0
2.3 – 2.4	0
2.4 – 2.5	0
2.5 – 2.6	0
2.6 – 2.7	0
2.7 – 2.8	0
2.8 – 2.9	0
2.9 – 3	0
3 – 3.1	1413
3.1 – 3.2	0
3.2 – 3.3	0
3.3 – 3.4	0
3.4 – 3.5	0
3.5 – 3.6	0
3.6 – 3.7	0
3.7 – 3.8	0
3.8 – 3.9	0
3.9 – 4	0
4 – 4.1	3636
4.1 – 4.2	0
4.2 – 4.3	0
4.3 – 4.4	0
4.4 – 4.5	0
4.5 – 4.6	0
4.6 – 4.7	0
4.7 – 4.8	0
4.8 – 4.9	0
4.9 – 5	3200

JPScalePC categorical feature

JPScalePC is a 5-level categorical, almost certainly a Likert or ordinal scale (values "1" through "5") with no nulls across 16,382 rows. The distribution is bimodal at the extremes: "5" leads at 33.8% and "1" follows closely, while the middle codes "2" and "3" together account for far less, hinting at polarised responses rather than a normal spread. Entropy ratio of 0.86 confirms the spread is wide but not uniform.

Treatment: Treat as ordinal (1-5); keep as integer or one-hot depending on model.

anthropic:claude-opus-4-7 · confidence high

Out[105]:

saturn.columns["JPScalePC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	5
top_value	5
top_rate	0.3381
cardinality	5
entropy	1.997
entropy_ratio	0.8602

Fig 38.

Top values for JPScalePC.

Show data table

Top values for JPScalePC (5 unique shown, of 5 total).
value	count	share
5	5539	33.8%
1	5189	31.7%
4	3914	23.9%
3	927	5.7%
2	813	5.0%

JPScalePGAC categorical feature

JPScalePGAC is a low-cardinality categorical with 5 distinct string-encoded levels ('1' through '5') across 16382 rows and no nulls, consistent with an ordinal scale (likely the Japanese JMA seismic intensity scale applied to PGA). The distribution is uneven: '1' dominates at 43.3% while '2' is the rarest at 908 rows, yet entropy ratio is high at 0.86 indicating the remaining mass is spread broadly. The non-monotonic frequency order (1 > 4 > 5 > 3 > 2) is worth flagging since a clean ordinal would typically taper.

Treatment: Treat as ordinal: cast to integer and preserve order, or one-hot encode if downstream model is non-ordinal.

anthropic:claude-opus-4-7 · confidence high

Out[108]:

saturn.columns["JPScalePGAC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	5
top_value	1
top_rate	0.4329
cardinality	5
entropy	1.998
entropy_ratio	0.8603

Fig 39.

Top values for JPScalePGAC.

Show data table

Top values for JPScalePGAC (5 unique shown, of 5 total).
value	count	share
1	7091	43.3%
4	3587	21.9%
5	3520	21.5%
3	1276	7.8%
2	908	5.5%

LeastReached categorical feature

Binary Y/N flag named LeastReached, fully populated across 16382 rows with only 2 distinct values. The split is fairly balanced — 'N' leads at 56.5% (9258) versus 7124 'Y' — yielding near-maximal entropy ratio of 0.988. No nulls or anomalies present.

Treatment: Encode as a 0/1 boolean for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[111]:

saturn.columns["LeastReached"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	N
top_rate	0.5651
cardinality	2
entropy	0.9877
entropy_ratio	0.9877

Fig 40.

Top values for LeastReached.

Show data table

Top values for LeastReached (2 unique shown, of 2 total).
value	count	share
N	9258	56.5%
Y	7124	43.5%

LeastReachedPC categorical feature

A binary Y/N flag named LeastReachedPC, likely indicating whether some 'least reached' threshold or PC condition was met. The split is moderately imbalanced at 67.3% N versus the rest Y, with no nulls across 16,382 rows and entropy ratio 0.91 showing both classes are well represented.

Treatment: Encode as a 0/1 indicator for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[114]:

saturn.columns["LeastReachedPC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	N
top_rate	0.6732
cardinality	2
entropy	0.9116
entropy_ratio	0.9116

Fig 41.

Top values for LeastReachedPC.

Show data table

Top values for LeastReachedPC (2 unique shown, of 2 total).
value	count	share
N	11029	67.3%
Y	5353	32.7%

LeastReachedPGAC categorical feature

Binary Y/N flag indicating whether some 'LeastReachedPGAC' condition holds, with no missing values across 16382 rows. The split is fairly balanced — 'N' leads at 56.7% (9291) versus 7091 'Y' — giving a near-maximal entropy ratio of 0.987.

Treatment: Encode as a 0/1 indicator for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[117]:

saturn.columns["LeastReachedPGAC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	N
top_rate	0.5671
cardinality	2
entropy	0.987
entropy_ratio	0.987

Fig 42.

Top values for LeastReachedPGAC.

Show data table

Top values for LeastReachedPGAC (2 unique shown, of 2 total).
value	count	share
N	9291	56.7%
Y	7091	43.3%

GSEC categorical feature

GSEC is a low-cardinality categorical with 8 distinct values across 16,382 rows and no nulls. The dominant value is the empty string at 40.0% (6,553 rows), followed by '1' at 4,852; the remaining codes ('0','2','3','4','5','6') split the rest, suggesting a coded classification where blanks likely encode 'not applicable' or missing-as-empty. Entropy ratio of 0.732 indicates moderate spread despite the empty-string plurality.

Treatment: Recode the empty string as an explicit missing category and one-hot encode the remaining codes.

anthropic:claude-opus-4-7 · confidence high

Out[120]:

saturn.columns["GSEC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	8
top_value
top_rate	0.4
cardinality	8
entropy	2.197
entropy_ratio	0.7325

Fig 43.

Top values for GSEC.

Show data table

Top values for GSEC (8 unique shown, of 8 total).
value	count	share
	6553	40.0%
1	4852	29.6%
6	1676	10.2%
4	1435	8.8%
5	1337	8.2%
0	238	1.5%
2	167	1.0%
3	124	0.8%

HasAudioRecordings categorical feature

Binary Y/N flag indicating whether a record has associated audio recordings, fully populated across 16382 rows. The distribution is imbalanced: 'Y' covers 82.3% (13479) versus 2903 'N', with entropy ratio 0.67.

Treatment: Encode as a 0/1 boolean indicator for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[123]:

saturn.columns["HasAudioRecordings"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.8228
cardinality	2
entropy	0.6739
entropy_ratio	0.6739

Fig 44.

Top values for HasAudioRecordings.

Show data table

Top values for HasAudioRecordings (2 unique shown, of 2 total).
value	count	share
Y	13479	82.3%
N	2903	17.7%

NTOnline categorical feature

NTOnline is a categorical flag with only one observed value, 'Y', across all 11,705 non-null rows. The remaining 28.55% of rows are null, so this column carries no discriminating signal — it is effectively a constant where present.

Treatment: Drop; zero-variance column with high nullity.

anthropic:claude-opus-4-7 · confidence high

Out[126]:

saturn.columns["NTOnline"].stats

stat	value
n	16,382
nulls	4,677 (28.5%)
unique	1
top_value	Y
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	28.5% null
alert: imbalance	top value is 100.0% of rows

Fig 45.

Top values for NTOnline.

Show data table

Top values for NTOnline (1 unique shown, of 1 total).
value	count	share
Y	11705	71.5%

RLG3 numeric feature

RLG3 is a small-integer ordinal feature ranging from 1 to 9 with only 8 distinct values across 16,382 rows and no nulls. The distribution is broad and flat (kurtosis -1.37, skew 0.13, IQR spanning 1 to 6) with mean 3.47 and median 4, and no outliers. The 8 unique values across a 1-9 range implies one integer in that span never occurs, which is worth confirming.

Treatment: Treat as an ordinal categorical (e.g., a Likert-style rating) rather than a continuous numeric.

anthropic:claude-opus-4-7 · confidence high

Out[129]:

saturn.columns["RLG3"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	8
min	1
max	9
mean	3.469
median	4
std	2.238
q1	1
q3	6
iqr	5
skew	0.1265
kurtosis	-1.366
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 46.

Distribution of RLG3. Vertical dash marks the median.

Show data table

Histogram bins for RLG3 (median: 4.0).
bin	count
1 – 1.2	6459
1.2 – 1.4	0
1.4 – 1.6	0
1.6 – 1.8	0
1.8 – 2	0
2 – 2.2	635
2.2 – 2.4	0
2.4 – 2.6	0
2.6 – 2.8	0
2.8 – 3	0
3 – 3.2	0
3.2 – 3.4	0
3.4 – 3.6	0
3.6 – 3.8	0
3.8 – 4	0
4 – 4.2	2651
4.2 – 4.4	0
4.4 – 4.6	0
4.6 – 4.8	0
4.8 – 5	0
5 – 5.2	2338
5.2 – 5.4	0
5.4 – 5.6	0
5.6 – 5.8	0
5.8 – 6	0
6 – 6.2	3786
6.2 – 6.4	0
6.4 – 6.6	0
6.6 – 6.8	0
6.8 – 7	0
7 – 7.2	200
7.2 – 7.4	0
7.4 – 7.6	0
7.6 – 7.8	0
7.8 – 8	0
8 – 8.2	124
8.2 – 8.4	0
8.4 – 8.6	0
8.6 – 8.8	0
8.8 – 9	189

RLG3PC numeric feature

RLG3PC is an integer-valued numeric column with only 8 distinct values bounded between 1 and 9, no nulls, and no zeros. The flat distribution (kurtosis -1.47, IQR spanning 1 to 6) and small cardinality suggest this is an ordinal code or category rather than a continuous measurement. Mean 3.21 sits below the median's upper quartile, with mild positive skew (0.31).

Treatment: Treat as an ordinal category; one-hot or ordinal-encode rather than scaling as continuous.

anthropic:claude-opus-4-7 · confidence high

Out[132]:

saturn.columns["RLG3PC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	8
min	1
max	9
mean	3.213
median	2
std	2.311
q1	1
q3	6
iqr	5
skew	0.3143
kurtosis	-1.466
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 47.

Distribution of RLG3PC. Vertical dash marks the median.

Show data table

Histogram bins for RLG3PC (median: 2.0).
bin	count
1 – 1.2	7795
1.2 – 1.4	0
1.4 – 1.6	0
1.6 – 1.8	0
1.8 – 2	0
2 – 2.2	647
2.2 – 2.4	0
2.4 – 2.6	0
2.6 – 2.8	0
2.8 – 3	0
3 – 3.2	0
3.2 – 3.4	0
3.4 – 3.6	0
3.6 – 3.8	0
3.8 – 4	0
4 – 4.2	1223
4.2 – 4.4	0
4.4 – 4.6	0
4.6 – 4.8	0
4.8 – 5	0
5 – 5.2	2557
5.2 – 5.4	0
5.4 – 5.6	0
5.6 – 5.8	0
5.8 – 6	0
6 – 6.2	3658
6.2 – 6.4	0
6.4 – 6.6	0
6.6 – 6.8	0
6.8 – 7	0
7 – 7.2	267
7.2 – 7.4	0
7.4 – 7.6	0
7.6 – 7.8	0
7.8 – 8	0
8 – 8.2	62
8.2 – 8.4	0
8.4 – 8.6	0
8.6 – 8.8	0
8.8 – 9	173

RLG3PGAC numeric feature

RLG3PGAC holds an integer code on a small 1-9 scale with only 8 distinct values across 16,382 rows and zero nulls. The distribution is broad and flat (kurtosis -1.36, std 2.25, IQR spanning 1 to 6) with near-zero skew, suggesting an ordinal category or rating rather than a true continuous measurement. No outliers and no zeros are present.

Treatment: Treat as an ordinal/categorical feature; one-hot or ordinal encode rather than scaling as continuous.

anthropic:claude-opus-4-7 · confidence high

Out[135]:

saturn.columns["RLG3PGAC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	8
min	1
max	9
mean	3.486
median	4
std	2.252
q1	1
q3	6
iqr	5
skew	0.1259
kurtosis	-1.363
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 48.

Distribution of RLG3PGAC. Vertical dash marks the median.

Show data table

Histogram bins for RLG3PGAC (median: 4.0).
bin	count
1 – 1.2	6462
1.2 – 1.4	0
1.4 – 1.6	0
1.6 – 1.8	0
1.8 – 2	0
2 – 2.2	603
2.2 – 2.4	0
2.4 – 2.6	0
2.6 – 2.8	0
2.8 – 3	0
3 – 3.2	0
3.2 – 3.4	0
3.4 – 3.6	0
3.6 – 3.8	0
3.8 – 4	0
4 – 4.2	2613
4.2 – 4.4	0
4.4 – 4.6	0
4.6 – 4.8	0
4.8 – 5	0
5 – 5.2	2348
5.2 – 5.4	0
5.4 – 5.6	0
5.6 – 5.8	0
5.8 – 6	0
6 – 6.2	3766
6.2 – 6.4	0
6.4 – 6.6	0
6.6 – 6.8	0
6.8 – 7	0
7 – 7.2	256
7.2 – 7.4	0
7.4 – 7.6	0
7.6 – 7.8	0
7.8 – 8	0
8 – 8.2	139
8.2 – 8.4	0
8.4 – 8.6	0
8.6 – 8.8	0
8.8 – 9	195

PrimaryReligion categorical feature

PrimaryReligion is a low-cardinality categorical label assigning each of 16,382 rows to one of 8 religious traditions, with no nulls. Christianity dominates at 39.4% (6,459 rows), followed by Islam (3,786) and Ethnic Religions (2,651); the long tail includes 189 'Unknown' and 124 'Other / Small' rows. Entropy ratio of 0.74 indicates a moderately balanced distribution rather than a single overwhelming class.

Treatment: One-hot or target-encode; consider grouping 'Unknown' and 'Other / Small' if modelling sensitivity to rare classes.

anthropic:claude-opus-4-7 · confidence high

Out[138]:

saturn.columns["PrimaryReligion"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	8
top_value	Christianity
top_rate	0.3943
cardinality	8
entropy	2.231
entropy_ratio	0.7436

Fig 49.

Top values for PrimaryReligion.

Show data table

Top values for PrimaryReligion (8 unique shown, of 8 total).
value	count	share
Christianity	6459	39.4%
Islam	3786	23.1%
Ethnic Religions	2651	16.2%
Hinduism	2338	14.3%
Buddhism	635	3.9%
Non-Religious	200	1.2%
Unknown	189	1.2%
Other / Small	124	0.8%

PrimaryReligionPC categorical feature

Categorical label of the dominant religion of a people-cluster (PC), with 8 distinct values and no nulls across 16,382 rows. Christianity leads at 47.6% (7,795), followed by Islam (3,658) and Hinduism (2,557), while a small 'Unknown' bucket (173) and 'Other / Small' (62) provide explicit catch-alls. Entropy ratio of 0.69 indicates moderate concentration rather than a single dominant class.

Treatment: One-hot or target-encode; consider merging 'Unknown' and 'Other / Small' if downstream model is sensitive to rare levels.

anthropic:claude-opus-4-7 · confidence high

Out[141]:

saturn.columns["PrimaryReligionPC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	8
top_value	Christianity
top_rate	0.4758
cardinality	8
entropy	2.071
entropy_ratio	0.6904

Fig 50.

Top values for PrimaryReligionPC.

Show data table

Top values for PrimaryReligionPC (8 unique shown, of 8 total).
value	count	share
Christianity	7795	47.6%
Islam	3658	22.3%
Hinduism	2557	15.6%
Ethnic Religions	1223	7.5%
Buddhism	647	3.9%
Non-Religious	267	1.6%
Unknown	173	1.1%
Other / Small	62	0.4%

PrimaryReligionPGAC categorical label

Categorical label of the primary religion of a People Group, Affinity, or Country (PGAC) record, drawn from a fixed taxonomy of 8 values with no nulls across 16,382 rows. Christianity dominates at 39.4% (6,462), followed by Islam (3,766), Ethnic Religions (2,613) and Hinduism (2,348); Buddhism, Non-Religious, Unknown and Other/Small together account for under 8% of rows. Entropy ratio of 0.748 indicates a moderately concentrated but not degenerate distribution.

Treatment: One-hot or target-encode; consider grouping the small Unknown/Other tail before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[144]:

saturn.columns["PrimaryReligionPGAC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	8
top_value	Christianity
top_rate	0.3945
cardinality	8
entropy	2.245
entropy_ratio	0.7482

Fig 51.

Top values for PrimaryReligionPGAC.

Show data table

Top values for PrimaryReligionPGAC (8 unique shown, of 8 total).
value	count	share
Christianity	6462	39.4%
Islam	3766	23.0%
Ethnic Religions	2613	16.0%
Hinduism	2348	14.3%
Buddhism	603	3.7%
Non-Religious	256	1.6%
Unknown	195	1.2%
Other / Small	139	0.8%

RLG4 numeric feature

RLG4 is a sparse integer-valued numeric feature with only 20 distinct values spanning 10 to 41, suggesting an ordinal score or count rather than a continuous measurement. It is overwhelmingly missing (null_rate 0.9621), so just under 4% of rows carry a value, and among those the distribution is right-skewed (skew 0.94) with 33 flagged outliers (outlier_rate 0.053). Center sits at median 20 with IQR 7, and no zeros are present.

Treatment: Add a missingness indicator and impute or bin the few observed values before modelling.

anthropic:claude-opus-4-7 · confidence medium

Out[147]:

saturn.columns["RLG4"].stats

stat	value
n	16,382
nulls	15,761 (96.2%)
unique	20
min	10
max	41
mean	18.49
median	20
std	6.519
q1	14
q3	21
iqr	7
skew	0.9361
kurtosis	1.156
n_outliers	33
outlier_rate	0.05314
zero_rate	0
alert: null_rate	96.2% null
alert: outliers	5.3% rows beyond 1.5 IQR

Fig 52.

Distribution of RLG4. Vertical dash marks the median.

Show data table

Histogram bins for RLG4 (median: 20.0).
bin	count
10 – 11.29	114
11.29 – 12.58	16
12.58 – 13.88	0
13.88 – 15.17	124
15.17 – 16.46	7
16.46 – 17.75	0
17.75 – 19.04	21
19.04 – 20.33	181
20.33 – 21.62	3
21.62 – 22.92	4
22.92 – 24.21	80
24.21 – 25.5	11
25.5 – 26.79	27
26.79 – 28.08	0
28.08 – 29.38	0
29.38 – 30.67	0
30.67 – 31.96	0
31.96 – 33.25	4
33.25 – 34.54	0
34.54 – 35.83	2
35.83 – 37.12	16
37.12 – 38.42	1
38.42 – 39.71	9
39.71 – 41	1

ReligionSubdivision categorical feature

A sub-categorisation of religion (e.g. Sunni/Shia branches, Buddhist schools, Judaism, Sikhism), populated only when a finer split applies. It is overwhelmingly null at 96.21%, so just 16382 rows carry one of 20 values, with Sunni leading at 29.15% of the populated rows. Entropy ratio 0.74 indicates the non-null portion is reasonably spread rather than dominated by a single bucket.

Treatment: Treat nulls as an explicit 'not applicable' category before one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[150]:

saturn.columns["ReligionSubdivision"].stats

stat	value
n	16,382
nulls	15,761 (96.2%)
unique	20
top_value	Sunni
top_rate	0.2915
cardinality	20
entropy	3.185
entropy_ratio	0.7369
alert: null_rate	96.2% null

Fig 53.

Top values for ReligionSubdivision.

Show data table

Top values for ReligionSubdivision (20 unique shown, of 20 total).
value	count	share
Sunni	181	1.1%
Judaism	124	0.8%
Sikhism	68	0.4%
Tibetan	59	0.4%
Theravada	55	0.3%
Ancestor Worship	27	0.2%
Shia	21	0.1%
Mahayana	16	0.1%
Jainism	12	0.1%
Zoroastrianism	11	0.1%
Kirati	10	0.1%
Prakriti	9	0.1%
Animism	7	0.0%
Mandaeism	6	0.0%
Druze	4	0.0%
Baha'i	4	0.0%
Syncretized	3	0.0%
Shia Imami Ismaili	2	0.0%
Sarna	1	0.0%
Lingayat	1	0.0%

PCIslam numeric feature

PCIslam is a numeric column bounded between 0 and 100, almost certainly a percentage share of Muslim population (or similar Islam-related composition metric) per record. The distribution is heavily zero-inflated: 63.2% of values are exactly 0 and the median is 0, while the mean is 23.2 and values stretch all the way to 100, producing a right skew of 1.27 and 3,438 flagged outliers (21.1%). Nulls are negligible (0.52%) and 1,117 distinct values suggest reasonably fine-grained measurement rather than a coarse bucket.

Treatment: Treat as a zero-inflated proportion: model the zero mass separately or add a presence indicator before scaling.

anthropic:claude-opus-4-7 · confidence high

Out[153]:

saturn.columns["PCIslam"].stats

stat	value
n	16,382
nulls	86 (0.5%)
unique	1,117
min	0
max	100
mean	23.2
median	0
std	39.54
q1	0
q3	28
iqr	28
skew	1.273
kurtosis	-0.2575
n_outliers	3,438
outlier_rate	0.211
zero_rate	0.6322
alert: outliers	21.1% rows beyond 1.5 IQR

Fig 54.

Distribution of PCIslam. Vertical dash marks the median.

Show data table

Histogram bins for PCIslam (median: 0.0).
bin	count
0 – 2.5	11065
2.5 – 5	160
5 – 7.5	245
7.5 – 10	54
10 – 12.5	250
12.5 – 15	34
15 – 17.5	119
17.5 – 20	26
20 – 22.5	167
22.5 – 25	18
25 – 27.5	80
27.5 – 30	17
30 – 32.5	113
32.5 – 35	27
35 – 37.5	45
37.5 – 40	21
40 – 42.5	58
42.5 – 45	12
45 – 47.5	29
47.5 – 50	6
50 – 52.5	40
52.5 – 55	10
55 – 57.5	26
57.5 – 60	22
60 – 62.5	66
62.5 – 65	15
65 – 67.5	53
67.5 – 70	23
70 – 72.5	72
72.5 – 75	25
75 – 77.5	53
77.5 – 80	32
80 – 82.5	87
82.5 – 85	37
85 – 87.5	76
87.5 – 90	59
90 – 92.5	159
92.5 – 95	123
95 – 97.5	232
97.5 – 100	2540

PCNonReligious numeric feature

PCNonReligious appears to be a percentage feature capturing the share of a population that is non-religious, ranging from 0 to 99. The distribution is dominated by zeros (75.2% of rows) with median, Q1, and Q3 all at 0, yet the mean is 3.42 and skew is 3.65 with kurtosis 15.4, indicating a long right tail. Roughly 24.8% of values flag as outliers, suggesting a sparse signal where most records report none and a minority report substantial percentages.

Treatment: Consider a zero-inflated treatment or log1p transform before modelling given the 75% zeros and heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[156]:

saturn.columns["PCNonReligious"].stats

stat	value
n	16,382
nulls	66 (0.4%)
unique	223
min	0
max	99
mean	3.421
median	0
std	9.21
q1	0
q3	0
iqr	0
skew	3.648
kurtosis	15.43
n_outliers	4,043
outlier_rate	0.2478
zero_rate	0.7522
alert: high_skew	skew=+3.65
alert: outliers	24.8% rows beyond 1.5 IQR

Fig 55.

Distribution of PCNonReligious. Vertical dash marks the median.

Show data table

Histogram bins for PCNonReligious (median: 0.0).
bin	count
0 – 2.475	12958
2.475 – 4.95	603
4.95 – 7.425	634
7.425 – 9.9	160
9.9 – 12.38	406
12.38 – 14.85	77
14.85 – 17.32	299
17.32 – 19.8	91
19.8 – 22.28	203
22.28 – 24.75	72
24.75 – 27.23	105
27.23 – 29.7	39
29.7 – 32.18	145
32.18 – 34.65	41
34.65 – 37.12	137
37.12 – 39.6	46
39.6 – 42.08	93
42.08 – 44.55	47
44.55 – 47.02	65
47.02 – 49.5	10
49.5 – 51.98	7
51.98 – 54.45	20
54.45 – 56.93	13
56.93 – 59.4	4
59.4 – 61.88	7
61.88 – 64.35	1
64.35 – 66.83	4
66.83 – 69.3	1
69.3 – 71.78	16
71.78 – 74.25	8
74.25 – 76.73	1
76.73 – 79.2	0
79.2 – 81.67	0
81.67 – 84.15	0
84.15 – 86.62	0
86.62 – 89.1	0
89.1 – 91.58	0
91.58 – 94.05	0
94.05 – 96.53	2
96.53 – 99	1

PCUnknown numeric feature

PCUnknown is a numeric feature bounded between 0 and 100, almost certainly a percentage of items classified as 'unknown'. It is overwhelmingly zero (zero_rate 0.9558) with median, q1, and q3 all at 0, yet the max reaches 100 with skew 9.07 and kurtosis 81.5, producing 719 outliers (4.4%). The 583 distinct non-zero values form a long, heavy tail rather than a smooth distribution.

Treatment: Binarize (zero vs non-zero) or log1p-transform before modelling given the 95.6% zero mass and extreme skew.

anthropic:claude-opus-4-7 · confidence high

Out[159]:

saturn.columns["PCUnknown"].stats

stat	value
n	16,382
nulls	104 (0.6%)
unique	583
min	0
max	100
mean	1.201
median	0
std	10.34
q1	0
q3	0
iqr	0
skew	9.066
kurtosis	81.52
n_outliers	719
outlier_rate	0.04417
zero_rate	0.9558
alert: high_skew	skew=+9.07

Fig 56.

Distribution of PCUnknown. Vertical dash marks the median.

Show data table

Histogram bins for PCUnknown (median: 0.0).
bin	count
0 – 2.5	15982
2.5 – 5	29
5 – 7.5	13
7.5 – 10	11
10 – 12.5	12
12.5 – 15	4
15 – 17.5	7
17.5 – 20	2
20 – 22.5	5
22.5 – 25	5
25 – 27.5	3
27.5 – 30	6
30 – 32.5	1
32.5 – 35	3
35 – 37.5	0
37.5 – 40	4
40 – 42.5	2
42.5 – 45	0
45 – 47.5	1
47.5 – 50	2
50 – 52.5	2
52.5 – 55	1
55 – 57.5	1
57.5 – 60	2
60 – 62.5	1
62.5 – 65	0
65 – 67.5	1
67.5 – 70	1
70 – 72.5	3
72.5 – 75	1
75 – 77.5	1
77.5 – 80	1
80 – 82.5	1
82.5 – 85	1
85 – 87.5	2
87.5 – 90	0
90 – 92.5	1
92.5 – 95	12
95 – 97.5	2
97.5 – 100	152

SecurityLevel numeric feature

SecurityLevel takes only 3 distinct integer values spanning 0 to 2 with no nulls, so it is effectively an ordinal category encoded numerically (e.g., low/medium/high). The distribution is fairly flat with kurtosis -1.82 and zeros making up 38.8% of rows, while the mean of 1.10 and median of 1.0 suggest the three levels are reasonably balanced with a slight tilt toward the higher end.

Treatment: Treat as an ordinal categorical and one-hot or ordinal-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[162]:

saturn.columns["SecurityLevel"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	3
min	0
max	2
mean	1.099
median	1
std	0.9307
q1	0
q3	2
iqr	2
skew	-0.1985
kurtosis	-1.816
n_outliers	0
outlier_rate	0
zero_rate	0.3883

Fig 57.

Distribution of SecurityLevel. Vertical dash marks the median.

Show data table

Histogram bins for SecurityLevel (median: 1.0).
bin	count
0 – 0.05	6361
0.05 – 0.1	0
0.1 – 0.15	0
0.15 – 0.2	0
0.2 – 0.25	0
0.25 – 0.3	0
0.3 – 0.35	0
0.35 – 0.4	0
0.4 – 0.45	0
0.45 – 0.5	0
0.5 – 0.55	0
0.55 – 0.6	0
0.6 – 0.65	0
0.65 – 0.7	0
0.7 – 0.75	0
0.75 – 0.8	0
0.8 – 0.85	0
0.85 – 0.9	0
0.9 – 0.95	0
0.95 – 1	0
1 – 1.05	2030
1.05 – 1.1	0
1.1 – 1.15	0
1.15 – 1.2	0
1.2 – 1.25	0
1.25 – 1.3	0
1.3 – 1.35	0
1.35 – 1.4	0
1.4 – 1.45	0
1.45 – 1.5	0
1.5 – 1.55	0
1.55 – 1.6	0
1.6 – 1.65	0
1.65 – 1.7	0
1.7 – 1.75	0
1.75 – 1.8	0
1.8 – 1.85	0
1.85 – 1.9	0
1.9 – 1.95	0
1.95 – 2	7991

LRTop100 categorical label

Binary Y/N flag indicating membership in some 'LR Top 100' set, with only 100 positive cases out of 16,382 rows (top_rate 0.9939). Extreme class imbalance and very low entropy (0.0537) make this nearly constant. No nulls, exactly 2 categories as expected.

Treatment: Use stratified sampling or class-weighting if modelling; otherwise treat as rare-event indicator.

anthropic:claude-opus-4-7 · confidence high

Out[165]:

saturn.columns["LRTop100"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	N
top_rate	0.9939
cardinality	2
entropy	0.05368
entropy_ratio	0.05368
alert: imbalance	top value is 99.4% of rows

Fig 58.

Top values for LRTop100.

Show data table

Top values for LRTop100 (2 unique shown, of 2 total).
value	count	share
N	16282	99.4%
Y	100	0.6%

PhotoAddress text foreign_key

PhotoAddress holds single-token image filenames in the pattern p#####.jpg, with a max length of 13 and exactly one word per row. 5,718 of 16,382 rows (~35%) are empty strings rather than nulls, and overall duplicate rate is 67.8% — the same photo file is reused across many records (e.g., p19007.jpg appears 92 times). With only 5,277 unique values, this behaves like a foreign-key reference to an image asset, not a per-row unique pointer.

Treatment: Treat empty strings as missing and join to an image/asset table on this filename rather than modelling it as text.

anthropic:claude-opus-4-7 · confidence high

Out[168]:

saturn.columns["PhotoAddress"].stats

stat	value
n	16,382
nulls	1 (0.0%)
unique	5,277
len_min	0
len_max	13
len_mean	6.523
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	5,718
n_duplicates	11,104
duplicate_rate	0.6779
vocab_size	5,276
readability_flesch_mean	82.43
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	67.8% duplicate strings

Fig 59.

Character-length distribution for PhotoAddress.

Show data table

Character-length distribution for PhotoAddress (mean: 6.522556620474941).
chars	count
0 – 0	5718
0 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	10591
10 – 10	0
10 – 11	0
11 – 11	0
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 13	72

PhotoCredits text metadata

Attribution string for a photo credit, mostly short names or sources (mean length 11.5 chars, median 1 word). Highly repetitive: 90.2% duplicate rate across only 1,605 unique values, with 5,718 empty entries and 3,065 'Anonymous' tags dominating. Top words reveal stock/CC sources like Flickr, Wikimedia, Pixabay, and Shutterstock alongside named contributors.

Treatment: Treat as low-cardinality categorical attribution; normalize empties/'Anonymous' and group rare credits before any analysis.

anthropic:claude-opus-4-7 · confidence high

Out[171]:

saturn.columns["PhotoCredits"].stats

stat	value
n	16,382
nulls	10 (0.1%)
unique	1,605
len_min	0
len_max	56
len_mean	11.54
len_median	9
len_p95	30
word_mean	2.081
word_median	1
n_empty	5,718
n_duplicates	14,767
duplicate_rate	0.902
vocab_size	2,658
readability_flesch_mean	-13.88
emoji_rate	0
url_rate	0.0004276
one_word_rate	0.5754
allcaps_rate	0.001649
boilerplate_rate	0
alert: one_word	57.5% rows are a single word
alert: duplicates	90.2% duplicate strings

Fig 60.

Character-length distribution for PhotoCredits.

Show data table

Character-length distribution for PhotoCredits (mean: 11.541595406792084).
chars	count
0 – 1	5718
1 – 3	0
3 – 4	8
4 – 6	14
6 – 7	293
7 – 8	34
8 – 10	3203
10 – 11	541
11 – 13	425
13 – 14	182
14 – 15	479
15 – 17	173
17 – 18	474
18 – 20	251
20 – 21	446
21 – 22	658
22 – 24	371
24 – 25	683
25 – 27	247
27 – 28	269
28 – 29	798
29 – 31	532
31 – 32	209
32 – 34	53
34 – 35	24
35 – 36	81
36 – 38	23
38 – 39	17
39 – 41	17
41 – 42	6
42 – 43	61
43 – 45	5
45 – 46	39
46 – 48	12
48 – 49	11
49 – 50	8
50 – 52	1
52 – 53	3
53 – 55	2
55 – 56	1

PhotoCreditURL text metadata

This column stores photo credit URLs, with every non-empty value being a single token (one_word_rate 1.0) and 47.5% matching a URL pattern. It is sparsely populated: 33.08% null and another 5,718 empty strings among the top values, while 86.9% of values are duplicates — a single domain, https://www.asiaharvest.org, accounts for 736 rows. Only 1,434 unique URLs serve 16,382 rows, suggesting a small set of recurring image sources rather than per-record attribution.

Treatment: Extract the domain as a categorical feature and drop the raw URL; do not use as a modelling input.

anthropic:claude-opus-4-7 · confidence high

Out[174]:

saturn.columns["PhotoCreditURL"].stats

stat	value
n	16,382
nulls	5,419 (33.1%)
unique	1,434
len_min	0
len_max	240
len_mean	25.67
len_median	0
len_p95	73
word_mean	1
word_median	1
n_empty	5,718
n_duplicates	9,529
duplicate_rate	0.8692
vocab_size	1,433
readability_flesch_mean	-261.7
emoji_rate	0
url_rate	0.4753
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: url_heavy	47.5% rows contain a URL
alert: null_rate	33.1% null
alert: duplicates	86.9% duplicate strings

Fig 61.

Character-length distribution for PhotoCreditURL.

Show data table

Character-length distribution for PhotoCreditURL (mean: 25.672899753717047).
chars	count
0 – 6	5718
6 – 12	0
12 – 18	3
18 – 24	47
24 – 30	1018
30 – 36	370
36 – 42	111
42 – 48	388
48 – 54	851
54 – 60	514
60 – 66	584
66 – 72	783
72 – 78	117
78 – 84	88
84 – 90	94
90 – 96	41
96 – 102	33
102 – 108	26
108 – 114	23
114 – 120	68
120 – 126	6
126 – 132	16
132 – 138	2
138 – 144	24
144 – 150	1
150 – 156	21
156 – 162	0
162 – 168	2
168 – 174	7
174 – 180	0
180 – 186	0
186 – 192	2
192 – 198	1
198 – 204	0
204 – 210	2
210 – 216	0
216 – 222	0
222 – 228	0
228 – 234	0
234 – 240	2

PhotoCreativeCommons categorical feature

A binary Y/N flag indicating whether a photo carries a Creative Commons licence. The column is heavily skewed: 'N' covers 83.6% of the 16382 rows while 'Y' accounts for 2691, with a near-zero null rate of 0.0003.

Treatment: Encode as a 0/1 boolean; expect class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[177]:

saturn.columns["PhotoCreativeCommons"].stats

stat	value
n	16,382
nulls	5 (0.0%)
unique	2
top_value	N
top_rate	0.8357
cardinality	2
entropy	0.6445
entropy_ratio	0.6445

Fig 62.

Top values for PhotoCreativeCommons.

Show data table

Top values for PhotoCreativeCommons (2 unique shown, of 2 total).
value	count	share
N	13686	83.5%
Y	2691	16.4%

PhotoCopyright categorical feature

Binary Y/N flag indicating whether a photo copyright applies, with 'N' dominating at 87.95% of 16,382 rows versus 1,972 'Y' values. Class imbalance is notable but not extreme, and nulls are negligible at 0.09%. Entropy ratio of 0.53 reflects this skew toward 'N'.

Treatment: Encode as a 0/1 boolean; be aware of the ~1:7 class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[180]:

saturn.columns["PhotoCopyright"].stats

stat	value
n	16,382
nulls	15 (0.1%)
unique	2
top_value	N
top_rate	0.8795
cardinality	2
entropy	0.5308
entropy_ratio	0.5308

Fig 63.

Top values for PhotoCopyright.

Show data table

Top values for PhotoCopyright (2 unique shown, of 2 total).
value	count	share
N	14395	87.9%
Y	1972	12.0%

PhotoPermission categorical feature

A consent flag for photo use, encoded as Y/N with 87.1% of 16382 rows set to 'N' and only 0.1% null. Cardinality is 3 because two records use lowercase 'y' alongside 2111 uppercase 'Y', a casing inconsistency worth normalising. Entropy ratio of 0.35 confirms the heavy skew toward refusal.

Treatment: Uppercase-normalise then map to a boolean before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[183]:

saturn.columns["PhotoPermission"].stats

stat	value
n	16,382
nulls	17 (0.1%)
unique	3
top_value	N
top_rate	0.8709
cardinality	3
entropy	0.5564
entropy_ratio	0.3511

Fig 64.

Top values for PhotoPermission.

Show data table

Top values for PhotoPermission (3 unique shown, of 3 total).
value	count	share
N	14252	87.0%
Y	2111	12.9%
y	2	0.0%

ProfileTextExists categorical feature

Binary Y/N flag indicating whether a profile has text, with no nulls across 16382 rows. Roughly 79.5% are 'Y' (13018) versus 'N' (3364), an imbalance worth noting but not extreme. Entropy ratio of 0.73 confirms a moderately skewed but informative distribution.

Treatment: Encode as a 0/1 boolean for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[186]:

saturn.columns["ProfileTextExists"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.7947
cardinality	2
entropy	0.7325
entropy_ratio	0.7325

Fig 65.

Top values for ProfileTextExists.

Show data table

Top values for ProfileTextExists (2 unique shown, of 2 total).
value	count	share
Y	13018	79.5%
N	3364	20.5%

CountOfCountries numeric feature

Counts the number of countries associated with each record, ranging from 1 to 164 with a median of 1 and Q3 of just 4. The distribution is severely right-skewed (skew 5.15, kurtosis 32.05) and 19.2% of rows flag as outliers, indicating a long tail where a small set of records span dozens or hundreds of countries while most cover only one.

Treatment: Log-transform or bin (e.g. 1, 2-4, 5+) before modelling to tame the heavy tail.

anthropic:claude-opus-4-7 · confidence high

Out[189]:

saturn.columns["CountOfCountries"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	48
min	1
max	164
mean	8.328
median	1
std	20.64
q1	1
q3	4
iqr	3
skew	5.152
kurtosis	32.05
n_outliers	3,139
outlier_rate	0.1916
zero_rate	0
alert: high_skew	skew=+5.15
alert: outliers	19.2% rows beyond 1.5 IQR

Fig 66.

Distribution of CountOfCountries. Vertical dash marks the median.

Show data table

Histogram bins for CountOfCountries (median: 1.0).
bin	count
1 – 5.075	12712
5.075 – 9.15	702
9.15 – 13.23	503
13.23 – 17.3	366
17.3 – 21.38	334
21.38 – 25.45	306
25.45 – 29.53	221
29.53 – 33.6	193
33.6 – 37.68	69
37.68 – 41.75	235
41.75 – 45.83	84
45.83 – 49.9	142
49.9 – 53.98	53
53.98 – 58.05	57
58.05 – 62.12	0
62.12 – 66.2	0
66.2 – 70.28	0
70.28 – 74.35	0
74.35 – 78.42	78
78.42 – 82.5	163
82.5 – 86.58	0
86.58 – 90.65	0
90.65 – 94.73	0
94.73 – 98.8	0
98.8 – 102.9	0
102.9 – 107	0
107 – 111	0
111 – 115.1	0
115.1 – 119.2	0
119.2 – 123.2	0
123.2 – 127.3	0
127.3 – 131.4	0
131.4 – 135.5	0
135.5 – 139.6	0
139.6 – 143.6	0
143.6 – 147.7	0
147.7 – 151.8	0
151.8 – 155.8	0
155.8 – 159.9	0
159.9 – 164	164

CountOfProvinces unknown other

The column 'CountOfProvinces' was skipped by the profiler, so beyond a row count of 16382 and a null rate of 0.0 there is no evidence about its distribution, type, or uniqueness. The name suggests an integer count of provinces per record, but this cannot be confirmed from the payload. No further signal is available.

Treatment: Re-run the profiler on this column to recover type and distribution before any downstream use.

anthropic:claude-opus-4-7 · confidence low

Out[192]:

saturn.columns["CountOfProvinces"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

Longitude numeric feature

This is a Longitude coordinate column, but the values are corrupted: valid longitudes must lie within [-180, 180], yet the max is 2350588.0 and the mean (189.33) already exceeds the legal range. Skew of 127.98 and kurtosis of 16376.52 confirm extreme outlier contamination, with 207 flagged outliers (1.26%). The median of 55.45 is plausible, so most rows are likely valid, but a small set of malformed entries is dominating the distribution.

Treatment: Clip or drop rows outside [-180, 180] before any geospatial use.

anthropic:claude-opus-4-7 · confidence high

Out[194]:

saturn.columns["Longitude"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	15,889
min	-179.3
max	2.351e+06
mean	189.3
median	55.45
std	1.836e+04
q1	8.673
q3	94.64
iqr	85.97
skew	128
kurtosis	1.638e+04
n_outliers	207
outlier_rate	0.01264
zero_rate	0
alert: high_skew	skew=+127.98

Fig 67.

Distribution of Longitude. Vertical dash marks the median.

Show data table

Histogram bins for Longitude (median: 55.451769999999996).
bin	count
-179.3 – 5.859e+04	16381
5.859e+04 – 1.174e+05	0
1.174e+05 – 1.761e+05	0
1.761e+05 – 2.349e+05	0
2.349e+05 – 2.937e+05	0
2.937e+05 – 3.524e+05	0
3.524e+05 – 4.112e+05	0
4.112e+05 – 4.7e+05	0
4.7e+05 – 5.287e+05	0
5.287e+05 – 5.875e+05	0
5.875e+05 – 6.463e+05	0
6.463e+05 – 7.051e+05	0
7.051e+05 – 7.638e+05	0
7.638e+05 – 8.226e+05	0
8.226e+05 – 8.814e+05	0
8.814e+05 – 9.401e+05	0
9.401e+05 – 9.989e+05	0
9.989e+05 – 1.058e+06	0
1.058e+06 – 1.116e+06	0
1.116e+06 – 1.175e+06	0
1.175e+06 – 1.234e+06	0
1.234e+06 – 1.293e+06	0
1.293e+06 – 1.352e+06	0
1.352e+06 – 1.41e+06	0
1.41e+06 – 1.469e+06	0
1.469e+06 – 1.528e+06	0
1.528e+06 – 1.587e+06	0
1.587e+06 – 1.645e+06	0
1.645e+06 – 1.704e+06	0
1.704e+06 – 1.763e+06	0
1.763e+06 – 1.822e+06	0
1.822e+06 – 1.88e+06	0
1.88e+06 – 1.939e+06	0
1.939e+06 – 1.998e+06	0
1.998e+06 – 2.057e+06	0
2.057e+06 – 2.116e+06	0
2.116e+06 – 2.174e+06	0
2.174e+06 – 2.233e+06	0
2.233e+06 – 2.292e+06	0
2.292e+06 – 2.351e+06	1

Latitude numeric feature

Geographic latitude in decimal degrees, with values spanning -54.94 to 78.21 — well within the valid [-90, 90] range. Distribution is nearly symmetric (skew -0.12) and slightly flat (kurtosis -0.26), centered around a median of 17.03 and mean of 16.44, suggesting a tropical/northern-hemisphere bias. Near-unique (15851 of 16382) with no nulls and only 39 mild outliers, consistent with per-record geocoordinates rather than a categorical region label.

Treatment: Pair with longitude for geospatial features; consider binning by hemisphere or clustering rather than using raw degrees in linear models.

anthropic:claude-opus-4-7 · confidence high

Out[197]:

saturn.columns["Latitude"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	15,851
min	-54.94
max	78.21
mean	16.44
median	17.03
std	20.47
q1	2.072
q3	29.88
iqr	27.81
skew	-0.118
kurtosis	-0.2579
n_outliers	39
outlier_rate	0.002381
zero_rate	0

Fig 68.

Distribution of Latitude. Vertical dash marks the median.

Show data table

Histogram bins for Latitude (median: 17.03254999983485).
bin	count
-54.94 – -51.61	5
-51.61 – -48.28	1
-48.28 – -44.95	5
-44.95 – -41.62	16
-41.62 – -38.29	12
-38.29 – -34.96	75
-34.96 – -31.64	153
-31.64 – -28.31	36
-28.31 – -24.98	109
-24.98 – -21.65	132
-21.65 – -18.32	172
-18.32 – -14.99	296
-14.99 – -11.66	253
-11.66 – -8.335	488
-8.335 – -5.006	761
-5.006 – -1.678	936
-1.678 – 1.651	589
1.651 – 4.98	633
4.98 – 8.308	1052
8.308 – 11.64	1205
11.64 – 14.97	791
14.97 – 18.29	793
18.29 – 21.62	775
21.62 – 24.95	1210
24.95 – 28.28	1405
28.28 – 31.61	779
31.61 – 34.94	775
34.94 – 38.27	429
38.27 – 41.6	496
41.6 – 44.92	496
44.92 – 48.25	387
48.25 – 51.58	471
51.58 – 54.91	280
54.91 – 58.24	131
58.24 – 61.57	155
61.57 – 64.9	43
64.9 – 68.22	19
68.22 – 71.55	15
71.55 – 74.88	1
74.88 – 78.21	2

Ctry categorical feature

Country names stored as full strings, with 238 distinct values across 16,382 rows and no nulls. India dominates at 13.8% (2,262 rows), followed by Papua New Guinea (883) and Indonesia (788) — a notable skew toward South/Southeast Asia rather than the typical US-heavy distribution. Entropy ratio of 0.79 indicates fairly broad spread despite the long tail.

Treatment: Group long-tail countries or target/frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[200]:

saturn.columns["Ctry"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	238
top_value	India
top_rate	0.1381
cardinality	238
entropy	6.225
entropy_ratio	0.7885

Fig 69.

Top values for Ctry.

Show data table

Top values for Ctry (20 unique shown, of 238 total).
value	count	share
India	2262	13.8%
Papua New Guinea	883	5.4%
Indonesia	788	4.8%
Pakistan	775	4.7%
China	547	3.3%
Nigeria	535	3.3%
United States	496	3.0%
Mexico	333	2.0%
Brazil	321	2.0%
Cameroon	292	1.8%
Bangladesh	278	1.7%
Canada	243	1.5%
Congo, Democratic Republic of	231	1.4%
Myanmar (Burma)	218	1.3%
Australia	205	1.3%
Philippines	200	1.2%
Sudan	198	1.2%
Nepal	195	1.2%
Laos	184	1.1%
Malaysia	183	1.1%

IndigenousCode categorical feature

IndigenousCode is a binary Y/N flag, fully populated across all 16,382 rows with only 2 distinct values. The class split is uneven: 'Y' covers 74.8% of records against 'N' for the remainder, yielding entropy of 0.81. The imbalance is notable but not extreme.

Treatment: Encode as a binary indicator; consider class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[203]:

saturn.columns["IndigenousCode"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.7483
cardinality	2
entropy	0.8139
entropy_ratio	0.8139

Fig 70.

Top values for IndigenousCode.

Show data table

Top values for IndigenousCode (2 unique shown, of 2 total).
value	count	share
Y	12259	74.8%
N	4123	25.2%

PercentAdherents text feature

This is a numeric percentage field (PercentAdherents) stored as text, with all 16382 values being single tokens of length 5-7 like '0.000' or '95.000'. The distribution is heavily concentrated at zero (4007 of 16382 rows) and shows strong duplication (duplicate_rate 0.924, only 1248 unique values). Despite the 'allcaps' and 'one_word' alerts, these are just numeric strings, not categorical text.

Treatment: Cast to float and treat as a numeric feature; consider zero-inflation handling given the spike at 0.000.

anthropic:claude-opus-4-7 · confidence high

Out[206]:

saturn.columns["PercentAdherents"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	1,248
len_min	5
len_max	7
len_mean	5.534
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	15,134
duplicate_rate	0.9238
vocab_size	1,248
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	92.4% duplicate strings

Fig 71.

Character-length distribution for PercentAdherents.

Show data table

Character-length distribution for PercentAdherents (mean: 5.533573434257112).
chars	count
5 – 5	7823
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	8377
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	182

PercentChristianPC categorical feature

Stored as a categorical string, this column appears to be a per-country (or per-region) percentage of Christians, with 246 distinct values across 16,382 rows and no nulls. The distribution is highly repetitive: the modal value '90.061' covers 6.95% of rows and the top ten values include both very high shares (90.061, 82.325, 76.515) and near-zero shares (0.482, 0.111, 0.000), suggesting a small set of country-level percentages broadcast onto many rows. Entropy ratio of 0.86 indicates the values are fairly evenly spread across the 246 categories despite the heavy mode.

Treatment: Cast to float and treat as a numeric feature rather than a category.

anthropic:claude-opus-4-7 · confidence high

Out[209]:

saturn.columns["PercentChristianPC"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	246
top_value	90.061
top_rate	0.06953
cardinality	246
entropy	6.853
entropy_ratio	0.8628

Fig 72.

Top values for PercentChristianPC.

Show data table

Top values for PercentChristianPC (20 unique shown, of 246 total).
value	count	share
90.061	1139	7.0%
0.482	935	5.7%
0.111	592	3.6%
82.325	448	2.7%
8.571	448	2.7%
0.000	374	2.3%
0.508	359	2.2%
76.515	342	2.1%
0.004	293	1.8%
5.345	253	1.5%
70.253	240	1.5%
9.838	226	1.4%
38.996	223	1.4%
5.023	223	1.4%
24.025	182	1.1%
3.733	164	1.0%
67.647	162	1.0%
70.325	161	1.0%
49.250	160	1.0%
67.300	160	1.0%

NaturalName text label

NaturalName appears to be a people-group or ethno-linguistic label, dominated by single-word entries (one_word_rate 0.555) and short strings (len_mean 10.9, word_mean 1.59). Roughly a third of rows repeat (duplicate_rate 0.344, 5645 duplicates across 10737 uniques), with 'Deaf' (164), 'French' (82), and 'British' (80) leading. Top words expose unclosed parenthetical qualifiers like 'traditions)', '(hindu', '(muslim' occurring 500-1000+ times, suggesting tokenisation broke compound names such as 'X (Hindu traditions)'.

Treatment: Normalise casing and repair the parenthetical qualifier splits before using as a categorical grouping key.

anthropic:claude-opus-4-7 · confidence high

Out[212]:

saturn.columns["NaturalName"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	10,737
len_min	1
len_max	41
len_mean	10.91
len_median	9
len_p95	25
word_mean	1.585
word_median	1
n_empty	0
n_duplicates	5,645
duplicate_rate	0.3446
vocab_size	11,164
readability_flesch_mean	50.53
emoji_rate	0
url_rate	0
one_word_rate	0.5554
allcaps_rate	0
boilerplate_rate	0
alert: one_word	55.5% rows are a single word
alert: duplicates	34.5% duplicate strings

Fig 73.

Character-length distribution for NaturalName.

Show data table

Character-length distribution for NaturalName (mean: 10.908008790135515).
chars	count
1 – 2	1
2 – 3	34
3 – 4	290
4 – 5	1252
5 – 6	1696
6 – 7	1956
7 – 8	1615
8 – 9	1098
9 – 10	697
10 – 11	766
11 – 12	729
12 – 13	698
13 – 14	883
14 – 15	741
15 – 16	733
16 – 17	593
17 – 18	355
18 – 19	248
19 – 20	186
20 – 21	147
21 – 22	132
22 – 23	112
23 – 24	180
24 – 25	232
25 – 26	280
26 – 27	204
27 – 28	134
28 – 29	100
29 – 30	50
30 – 31	46
31 – 32	67
32 – 33	43
33 – 34	42
34 – 35	22
35 – 36	15
36 – 37	0
37 – 38	0
38 – 39	1
39 – 40	2
40 – 41	2

NaturalPronunciation text metadata

This column holds phonetic respellings of ethnic or demographic labels (e.g. 'AY-zhun', 'chai-NEEZ', 'kor-EE-un'), with hyphenated syllables and capitalised stress markers indicating an ad-hoc pronunciation guide. It is overwhelmingly sparse and repetitive: 69.63% null, 69.41% one-word entries, and 61.15% duplicates across only 1,933 unique values out of 16,382 rows. The token 'def' appears 164 times as the most frequent value, which looks like a placeholder or default rather than a pronunciation.

Treatment: Treat as a categorical pronunciation lookup keyed to an ethnicity label; investigate the 'def' placeholder and impute or drop given the 69.63% null rate.

anthropic:claude-opus-4-7 · confidence high

Out[215]:

saturn.columns["NaturalPronunciation"].stats

stat	value
n	16,382
nulls	11,407 (69.6%)
unique	1,933
len_min	2
len_max	57
len_mean	12.13
len_median	11
len_p95	26
word_mean	1.345
word_median	1
n_empty	0
n_duplicates	3,042
duplicate_rate	0.6115
vocab_size	2,039
readability_flesch_mean	61.69
emoji_rate	0
url_rate	0
one_word_rate	0.6941
allcaps_rate	0.000402
boilerplate_rate	0
alert: one_word	69.4% rows are a single word
alert: null_rate	69.6% null
alert: duplicates	61.1% duplicate strings

Fig 74.

Character-length distribution for NaturalPronunciation.

Show data table

Character-length distribution for NaturalPronunciation (mean: 12.130653266331658).
chars	count
2 – 3	285
3 – 5	161
5 – 6	231
6 – 8	690
8 – 9	599
9 – 10	488
10 – 12	500
12 – 13	281
13 – 14	333
14 – 16	211
16 – 17	257
17 – 18	174
18 – 20	165
20 – 21	156
21 – 23	55
23 – 24	39
24 – 25	76
25 – 27	26
27 – 28	58
28 – 30	24
30 – 31	17
31 – 32	43
32 – 34	8
34 – 35	11
35 – 36	12
36 – 38	9
38 – 39	14
39 – 40	7
40 – 42	9
42 – 43	6
43 – 45	8
45 – 46	6
46 – 47	3
47 – 49	3
49 – 50	4
50 – 52	1
52 – 53	0
53 – 54	3
54 – 56	0
56 – 57	2

PercentChristianPGAC text feature

This column holds percentages (likely Christian population share, per the PGAC suffix) stored as text rather than numeric, with values like "0.000", "95.000", "90.000" filling lengths of 5-7 characters. The distribution is heavily zero-inflated: 3,121 of 16,382 rows are "0.000" and the duplicate rate is 88%, leaving only 1,954 unique values. Flagged as allcaps/one-word only because the profiler treated numeric strings as tokens.

Treatment: Cast to float and treat as a numeric percentage feature.

anthropic:claude-opus-4-7 · confidence high

Out[218]:

saturn.columns["PercentChristianPGAC"].stats

stat	value
n	16,382
nulls	15 (0.1%)
unique	1,954
len_min	5
len_max	7
len_mean	5.528
len_median	6
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	14,413
duplicate_rate	0.8806
vocab_size	1,954
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	88.1% duplicate strings

Fig 75.

Character-length distribution for PercentChristianPGAC.

Show data table

Character-length distribution for PercentChristianPGAC (mean: 5.528196981731533).
chars	count
5 – 5	7867
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	8355
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	145

PercentEvangelical text feature

This is a numeric percentage (share evangelical) stored as text strings like '0.000' to '6.000', with values ranging 5-6 characters long and one token each. The distribution is heavily zero-inflated: 4205 of 16382 rows are '0.000', and the duplicate rate is 0.9315 across only 1047 unique values. Null rate is 0.0668, so roughly 7% are missing.

Treatment: Cast to float and treat as a zero-inflated numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[221]:

saturn.columns["PercentEvangelical"].stats

stat	value
n	16,382
nulls	1,095 (6.7%)
unique	1,047
len_min	5
len_max	6
len_mean	5.226
len_median	5
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	14,240
duplicate_rate	0.9315
vocab_size	1,047
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	93.2% duplicate strings

Fig 76.

Character-length distribution for PercentEvangelical.

Show data table

Character-length distribution for PercentEvangelical (mean: 5.226401517629358).
chars	count
5 – 5	11826
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	3461

PercentEvangelicalPC categorical feature

Numeric percentages (0.000 to 28.097) describing evangelical share, but stored as strings with only 228 distinct values across 16,382 rows — suggesting a precomputed per-group statistic broadcast to many records rather than a per-row measurement. The top value '20.481' covers 7.0% of rows and the top ten values together account for a large fraction, consistent with repeated group-level imputation. Entropy ratio is 0.87, so distribution is fairly spread but discretised.

Treatment: Cast to float and treat as a group-level numeric feature; do not one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[224]:

saturn.columns["PercentEvangelicalPC"].stats

stat	value
n	16,382
nulls	166 (1.0%)
unique	228
top_value	20.481
top_rate	0.07024
cardinality	228
entropy	6.782
entropy_ratio	0.8658

Fig 77.

Top values for PercentEvangelicalPC.

Show data table

Top values for PercentEvangelicalPC (20 unique shown, of 228 total).
value	count	share
20.481	1139	7.0%
0.199	935	5.7%
0.095	592	3.6%
13.274	448	2.7%
3.409	448	2.7%
0.000	441	2.7%
0.247	359	2.2%
28.097	342	2.1%
0.004	316	1.9%
2.656	253	1.5%
7.851	240	1.5%
7.936	226	1.4%
10.122	223	1.4%
3.339	223	1.4%
9.990	182	1.1%
10.025	162	1.0%
21.711	161	1.0%
9.519	160	1.0%
25.396	160	1.0%
17.433	159	1.0%

PercentEvangelicalPGAC text feature

This is a numeric percentage (Percent Evangelical, PGAC) stored as text — every value is a single token of 5-6 characters formatted like '0.000', '4.000', '1.801'. The distribution is heavily zero-inflated: 3,272 of 16,382 rows are '0.000', duplicate rate is 0.896 across only 1,624 unique values, and 4.5% are null. Despite the column being typed as text, there is no real language content here.

Treatment: Cast to float and treat as a zero-inflated numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[227]:

saturn.columns["PercentEvangelicalPGAC"].stats

stat	value
n	16,382
nulls	743 (4.5%)
unique	1,624
len_min	5
len_max	6
len_mean	5.235
len_median	5
len_p95	6
word_mean	1
word_median	1
n_empty	0
n_duplicates	14,015
duplicate_rate	0.8962
vocab_size	1,624
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	89.6% duplicate strings

Fig 78.

Character-length distribution for PercentEvangelicalPGAC.

Show data table

Character-length distribution for PercentEvangelicalPGAC (mean: 5.234925506745956).
chars	count
5 – 5	11965
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	3674

PCBuddhism numeric feature

PCBuddhism appears to be a per-record percentage feature for Buddhist composition, ranging 0–100 with mean 3.77 and median 0. The distribution is overwhelmingly zero (zero_rate 0.89) with q1=q3=0 and iqr=0, yet ~11% of rows are outliers and skew (4.77) and kurtosis (21.99) are extreme. Treat this as a sparse, heavy-tailed minority share where most populations have no Buddhist presence but a long tail reaches 100.

Treatment: Add a zero-vs-nonzero indicator and log1p-transform the nonzero share before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[230]:

saturn.columns["PCBuddhism"].stats

stat	value
n	16,382
nulls	104 (0.6%)
unique	1,052
min	0
max	100
mean	3.769
median	0
std	16.75
q1	0
q3	0
iqr	0
skew	4.775
kurtosis	22
n_outliers	1,798
outlier_rate	0.1105
zero_rate	0.8895
alert: high_skew	skew=+4.77
alert: outliers	11.0% rows beyond 1.5 IQR

Fig 79.

Distribution of PCBuddhism. Vertical dash marks the median.

Show data table

Histogram bins for PCBuddhism (median: 0.0).
bin	count
0 – 2.5	15174
2.5 – 5	44
5 – 7.5	41
7.5 – 10	16
10 – 12.5	121
12.5 – 15	10
15 – 17.5	34
17.5 – 20	4
20 – 22.5	44
22.5 – 25	25
25 – 27.5	21
27.5 – 30	6
30 – 32.5	46
32.5 – 35	10
35 – 37.5	14
37.5 – 40	6
40 – 42.5	34
42.5 – 45	6
45 – 47.5	4
47.5 – 50	7
50 – 52.5	19
52.5 – 55	17
55 – 57.5	16
57.5 – 60	18
60 – 62.5	21
62.5 – 65	3
65 – 67.5	15
67.5 – 70	26
70 – 72.5	29
72.5 – 75	9
75 – 77.5	11
77.5 – 80	15
80 – 82.5	21
82.5 – 85	11
85 – 87.5	17
87.5 – 90	25
90 – 92.5	40
92.5 – 95	37
95 – 97.5	82
97.5 – 100	179

PCEthnicReligions numeric feature

PCEthnicReligions appears to be a percentage feature (0–100) capturing the share of ethnic/folk religion adherents per record. Just over half the rows are exactly zero (zero_rate 0.5045) and the median and Q1 are both 0, yet values stretch all the way to 100, producing strong right skew (1.65) and 1,967 flagged outliers (12.05%). The distribution is effectively zero-inflated rather than continuous.

Treatment: Model as zero-inflated: add an is_nonzero indicator and log1p-transform the positive values.

anthropic:claude-opus-4-7 · confidence high

Out[233]:

saturn.columns["PCEthnicReligions"].stats

stat	value
n	16,382
nulls	59 (0.4%)
unique	978
min	0
max	100
mean	17.6
median	0
std	29.02
q1	0
q3	25
iqr	25
skew	1.654
kurtosis	1.404
n_outliers	1,967
outlier_rate	0.1205
zero_rate	0.5045
alert: outliers	12.1% rows beyond 1.5 IQR

Fig 80.

Distribution of PCEthnicReligions. Vertical dash marks the median.

Show data table

Histogram bins for PCEthnicReligions (median: 0.0).
bin	count
0 – 2.5	9029
2.5 – 5	551
5 – 7.5	772
7.5 – 10	242
10 – 12.5	674
12.5 – 15	96
15 – 17.5	298
17.5 – 20	100
20 – 22.5	411
22.5 – 25	53
25 – 27.5	309
27.5 – 30	59
30 – 32.5	395
32.5 – 35	92
35 – 37.5	225
37.5 – 40	58
40 – 42.5	219
42.5 – 45	21
45 – 47.5	120
47.5 – 50	30
50 – 52.5	130
52.5 – 55	29
55 – 57.5	147
57.5 – 60	51
60 – 62.5	245
62.5 – 65	25
65 – 67.5	106
67.5 – 70	48
70 – 72.5	188
72.5 – 75	22
75 – 77.5	109
77.5 – 80	51
80 – 82.5	235
82.5 – 85	53
85 – 87.5	134
87.5 – 90	73
90 – 92.5	220
92.5 – 95	119
95 – 97.5	258
97.5 – 100	326

PCHinduism numeric feature

PCHinduism appears to be a per-record percentage share of Hinduism (0–100), with max 100.0 and min 0.0. The distribution is overwhelmingly zero (zero_rate 0.8343) so Q1, median, and Q3 are all 0.0, yet the mean is 14.01 with std 33.87, indicating a small minority of records carry very high values. Skew 2.06 and 16.57% flagged outliers confirm a heavy right tail rather than dirty data.

Treatment: Treat as zero-inflated proportion: add a nonzero indicator and consider a log1p or sqrt transform before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[236]:

saturn.columns["PCHinduism"].stats

stat	value
n	16,382
nulls	105 (0.6%)
unique	1,412
min	0
max	100
mean	14.01
median	0
std	33.87
q1	0
q3	0
iqr	0
skew	2.058
kurtosis	2.3
n_outliers	2,697
outlier_rate	0.1657
zero_rate	0.8343
alert: high_skew	skew=+2.06
alert: outliers	16.6% rows beyond 1.5 IQR

Fig 81.

Distribution of PCHinduism. Vertical dash marks the median.

Show data table

Histogram bins for PCHinduism (median: 0.0).
bin	count
0 – 2.5	13738
2.5 – 5	30
5 – 7.5	28
7.5 – 10	8
10 – 12.5	23
12.5 – 15	16
15 – 17.5	12
17.5 – 20	4
20 – 22.5	14
22.5 – 25	12
25 – 27.5	10
27.5 – 30	7
30 – 32.5	11
32.5 – 35	6
35 – 37.5	3
37.5 – 40	2
40 – 42.5	11
42.5 – 45	7
45 – 47.5	4
47.5 – 50	9
50 – 52.5	16
52.5 – 55	5
55 – 57.5	11
57.5 – 60	8
60 – 62.5	19
62.5 – 65	5
65 – 67.5	16
67.5 – 70	11
70 – 72.5	26
72.5 – 75	12
75 – 77.5	14
77.5 – 80	14
80 – 82.5	17
82.5 – 85	27
85 – 87.5	32
87.5 – 90	41
90 – 92.5	42
92.5 – 95	41
95 – 97.5	120
97.5 – 100	1845

PCOtherSmall numeric feature

PCOtherSmall is a numeric feature that appears to capture a small-share or percentage-like quantity, with 89.3% of values exactly zero and a median/Q1/Q3 all at 0.0. The remaining mass is highly skewed (skew 11.0, kurtosis 124.0) with a max of 100.0 and 10.7% flagged as outliers, suggesting a sparse long-tailed distribution rather than a typical continuous feature.

Treatment: Treat as zero-inflated: add a binary is_nonzero flag and log1p-transform the positive tail before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[239]:

saturn.columns["PCOtherSmall"].stats

stat	value
n	16,382
nulls	104 (0.6%)
unique	908
min	0
max	100
mean	0.9613
median	0
std	8.299
q1	0
q3	0
iqr	0
skew	11
kurtosis	124
n_outliers	1,749
outlier_rate	0.1074
zero_rate	0.8926
alert: high_skew	skew=+11.00
alert: outliers	10.7% rows beyond 1.5 IQR

Fig 82.

Distribution of PCOtherSmall. Vertical dash marks the median.

Show data table

Histogram bins for PCOtherSmall (median: 0.0).
bin	count
0 – 2.5	15753
2.5 – 5	155
5 – 7.5	119
7.5 – 10	20
10 – 12.5	36
12.5 – 15	9
15 – 17.5	10
17.5 – 20	6
20 – 22.5	6
22.5 – 25	22
25 – 27.5	3
27.5 – 30	4
30 – 32.5	6
32.5 – 35	0
35 – 37.5	1
37.5 – 40	3
40 – 42.5	3
42.5 – 45	1
45 – 47.5	0
47.5 – 50	0
50 – 52.5	0
52.5 – 55	1
55 – 57.5	3
57.5 – 60	2
60 – 62.5	3
62.5 – 65	0
65 – 67.5	0
67.5 – 70	4
70 – 72.5	4
72.5 – 75	0
75 – 77.5	1
77.5 – 80	0
80 – 82.5	2
82.5 – 85	2
85 – 87.5	2
87.5 – 90	0
90 – 92.5	2
92.5 – 95	0
95 – 97.5	2
97.5 – 100	93

RegionCode numeric feature

RegionCode is an integer-valued field ranging from 1 to 12 with only 12 unique values across 16382 rows and no nulls. The flat distribution (kurtosis -1.20, skew 0.23, no outliers) and small cardinality indicate a categorical region identifier encoded numerically rather than a true numeric measure.

Treatment: Cast to categorical and one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[242]:

saturn.columns["RegionCode"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	12
min	1
max	12
mean	5.935
median	5
std	3.42
q1	3
q3	8
iqr	5
skew	0.231
kurtosis	-1.201
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 83.

Distribution of RegionCode. Vertical dash marks the median.

Show data table

Histogram bins for RegionCode (median: 5.0).
bin	count
1 – 1.275	1535
1.275 – 1.55	0
1.55 – 1.825	0
1.825 – 2.1	1922
2.1 – 2.375	0
2.375 – 2.65	0
2.65 – 2.925	0
2.925 – 3.2	709
3.2 – 3.475	0
3.475 – 3.75	0
3.75 – 4.025	3707
4.025 – 4.3	0
4.3 – 4.575	0
4.575 – 4.85	0
4.85 – 5.125	460
5.125 – 5.4	0
5.4 – 5.675	0
5.675 – 5.95	0
5.95 – 6.225	593
6.225 – 6.5	0
6.5 – 6.775	0
6.775 – 7.05	1276
7.05 – 7.325	0
7.325 – 7.6	0
7.6 – 7.875	0
7.875 – 8.15	2175
8.15 – 8.425	0
8.425 – 8.7	0
8.7 – 8.975	0
8.975 – 9.25	577
9.25 – 9.525	0
9.525 – 9.8	0
9.8 – 10.08	1116
10.08 – 10.35	0
10.35 – 10.62	0
10.62 – 10.9	0
10.9 – 11.18	1395
11.18 – 11.45	0
11.45 – 11.73	0
11.73 – 12	917

PopulationPGAC numeric feature

PopulationPGAC is a numeric population-like measure spanning 10 to 925,129,800 with a median of just 88,000, suggesting counts of people across geographic units of wildly varying scale (towns up through country-sized aggregates). The distribution is extremely right-skewed (skew 15.15, kurtosis 262.66) and 19.2% of rows flag as outliers, with the mean (8.8M) two orders of magnitude above the median. Nulls are negligible (0.09%) and there are no zeros, but the spread between Q3 (1.39M) and the max indicates a long heavy tail of very large entities.

Treatment: log-transform before any modelling or distance-based comparison.

anthropic:claude-opus-4-7 · confidence high

Out[245]:

saturn.columns["PopulationPGAC"].stats

stat	value
n	16,382
nulls	15 (0.1%)
unique	2,250
min	10
max	9.251e+08
mean	8.812e+06
median	88,000
std	5.114e+07
q1	8,800
q3	1.386e+06
iqr	1.377e+06
skew	15.15
kurtosis	262.7
n_outliers	3,145
outlier_rate	0.1922
zero_rate	0
alert: high_skew	skew=+15.15
alert: outliers	19.2% rows beyond 1.5 IQR

Fig 84.

Distribution of PopulationPGAC. Vertical dash marks the median.

Show data table

Histogram bins for PopulationPGAC (median: 88000.0).
bin	count
10 – 2.313e+07	14975
2.313e+07 – 4.626e+07	660
4.626e+07 – 6.938e+07	344
6.938e+07 – 9.251e+07	143
9.251e+07 – 1.156e+08	17
1.156e+08 – 1.388e+08	108
1.388e+08 – 1.619e+08	0
1.619e+08 – 1.85e+08	0
1.85e+08 – 2.082e+08	78
2.082e+08 – 2.313e+08	0
2.313e+08 – 2.544e+08	0
2.544e+08 – 2.775e+08	0
2.775e+08 – 3.007e+08	0
3.007e+08 – 3.238e+08	0
3.238e+08 – 3.469e+08	0
3.469e+08 – 3.701e+08	0
3.701e+08 – 3.932e+08	0
3.932e+08 – 4.163e+08	0
4.163e+08 – 4.394e+08	0
4.394e+08 – 4.626e+08	0
4.626e+08 – 4.857e+08	0
4.857e+08 – 5.088e+08	0
5.088e+08 – 5.319e+08	0
5.319e+08 – 5.551e+08	0
5.551e+08 – 5.782e+08	0
5.782e+08 – 6.013e+08	0
6.013e+08 – 6.245e+08	0
6.245e+08 – 6.476e+08	0
6.476e+08 – 6.707e+08	0
6.707e+08 – 6.938e+08	0
6.938e+08 – 7.17e+08	0
7.17e+08 – 7.401e+08	0
7.401e+08 – 7.632e+08	0
7.632e+08 – 7.864e+08	0
7.864e+08 – 8.095e+08	0
8.095e+08 – 8.326e+08	0
8.326e+08 – 8.557e+08	0
8.557e+08 – 8.789e+08	0
8.789e+08 – 9.02e+08	0
9.02e+08 – 9.251e+08	42

Frontier categorical feature

Binary Y/N flag named 'Frontier', fully populated across 16382 rows with only 2 distinct values. The 'N' class dominates at 70.9% versus 29.1% 'Y', giving an entropy ratio of 0.87 — moderately imbalanced but well within usable range.

Treatment: Encode as 0/1 boolean and use directly as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[248]:

saturn.columns["Frontier"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	N
top_rate	0.709
cardinality	2
entropy	0.87
entropy_ratio	0.87

Fig 85.

Top values for Frontier.

Show data table

Top values for Frontier (2 unique shown, of 2 total).
value	count	share
N	11615	70.9%
Y	4767	29.1%

MapAddress text foreign_key

MapAddress holds single-token PNG filenames (e.g. m00320.png), almost certainly references to map image assets. Over half the column is empty (8728 of 16382 rows) and 63.2% are duplicates, with only 6029 distinct values across 16382 rows. Every non-empty value is one word with max length 13, so this behaves like a sparse foreign key to a map asset rather than free text.

Treatment: Treat as a categorical asset reference; impute empties as 'none' and join to the map asset table.

anthropic:claude-opus-4-7 · confidence high

Out[251]:

saturn.columns["MapAddress"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	6,029
len_min	0
len_max	13
len_mean	5.153
len_median	0
len_p95	13
word_mean	1
word_median	1
n_empty	8,728
n_duplicates	10,353
duplicate_rate	0.632
vocab_size	6,028
readability_flesch_mean	13.52
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	63.2% duplicate strings

Fig 86.

Character-length distribution for MapAddress.

Show data table

Character-length distribution for MapAddress (mean: 5.153094860212429).
chars	count
0 – 0	8728
0 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	5028
10 – 10	0
10 – 11	0
11 – 11	0
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 13	2626

HasJesusFilm categorical feature

Binary Y/N flag indicating whether the JESUS Film is available for each record, with no nulls across 16,382 rows. The split is roughly 2:1 in favour of 'Y' (10,816 vs 5,566; top_rate 0.660), giving a high entropy ratio of 0.925 — informative but mildly imbalanced.

Treatment: Encode as a 0/1 boolean for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[254]:

saturn.columns["HasJesusFilm"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.6602
cardinality	2
entropy	0.9246
entropy_ratio	0.9246

Fig 87.

Top values for HasJesusFilm.

Show data table

Top values for HasJesusFilm (2 unique shown, of 2 total).
value	count	share
Y	10816	66.0%
N	5566	34.0%

Nomadic categorical feature

Binary Y/N flag indicating whether a record is 'Nomadic', with no nulls across 16382 rows. The distribution is severely imbalanced: 'N' covers 98.1% (16071) versus only 311 'Y' cases, yielding an entropy ratio of just 0.136.

Treatment: Encode as a boolean; consider class-weighting or resampling since positives are only ~1.9%.

anthropic:claude-opus-4-7 · confidence high

Out[257]:

saturn.columns["Nomadic"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	N
top_rate	0.981
cardinality	2
entropy	0.1357
entropy_ratio	0.1357
alert: imbalance	top value is 98.1% of rows

Fig 88.

Top values for Nomadic.

Show data table

Top values for Nomadic (2 unique shown, of 2 total).
value	count	share
N	16071	98.1%
Y	311	1.9%

NomadicTypeDescription categorical metadata

Categorical descriptor of nomadic livelihood type, with six values combining three base categories (Agro-Pastoralists, Service or Trade, Hunter-Gatherers) singly or in pairs. The column is 98.1% null, populated for only ~311 of 16,382 rows, and among those Agro-Pastoralists dominates at 68.2% of non-nulls. The sparsity makes this effectively a rare annotation rather than a general feature.

Treatment: Treat as sparse metadata; impute a 'Unknown' category or drop unless modelling the populated subset.

anthropic:claude-opus-4-7 · confidence high

Out[260]:

saturn.columns["NomadicTypeDescription"].stats

stat	value
n	16,382
nulls	16,071 (98.1%)
unique	6
top_value	Agro-Pastoralists
top_rate	0.6817
cardinality	6
entropy	1.341
entropy_ratio	0.5187
alert: null_rate	98.1% null

Fig 89.

Top values for NomadicTypeDescription.

Show data table

Top values for NomadicTypeDescription (6 unique shown, of 6 total).
value	count	share
Agro-Pastoralists	212	1.3%
Service or Trade	68	0.4%
Agro-Pastoralists, Service or Trade	16	0.1%
Hunter-Gatherers	11	0.1%
Agro-Pastoralists, Hunter-Gatherers	2	0.0%
Service or Trade, Hunter-Gatherers	2	0.0%

PhotoCCVersionText categorical metadata

This column records the Creative Commons license version attached to a photo, with 17 distinct values across 16,382 rows and no nulls. It is dominated by empty strings at 83.6% (13,688 rows), leaving only ~16% with an actual license tag — the most common being 'CC BY 2.0' (661) and 'CC BY-NC-SA 2.0' (440). Low entropy ratio (0.28) confirms the field is sparse in practice despite zero technical nulls.

Treatment: Treat empty string as missing and one-hot encode the remaining license categories.

anthropic:claude-opus-4-7 · confidence high

Out[263]:

saturn.columns["PhotoCCVersionText"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	17
top_value
top_rate	0.8356
cardinality	17
entropy	1.137
entropy_ratio	0.2781

Fig 90.

Top values for PhotoCCVersionText.

Show data table

Top values for PhotoCCVersionText (17 unique shown, of 17 total).
value	count	share
	13688	83.6%
CC BY 2.0	661	4.0%
CC BY-NC-SA 2.0	440	2.7%
CC BY-SA 4.0	425	2.6%
CC BY-SA 2.0	332	2.0%
CC0 1.0	247	1.5%
CC BY-NC 2.0	230	1.4%
CC BY-SA 3.0	219	1.3%
CC BY 3.0	35	0.2%
CC BY-NC-ND 2.0	35	0.2%
CC BY 3.0 BR	21	0.1%
CC BY 4.0	20	0.1%
CC BY-SA 2.5	11	0.1%
CC BY-ND 2.0	9	0.1%
CC SA 1.0	7	0.0%
CC BY-SA 3.0 DE	1	0.0%
CC BY-NC-SA 4.0	1	0.0%

PhotoCCVersionURL categorical metadata

This column holds Creative Commons license URLs associated with photos, drawn from a closed vocabulary of 17 distinct values. It is overwhelmingly empty: 13,688 of 16,382 rows (top_rate 0.836) carry the blank string rather than a license, leaving CC BY 2.0 (661) and CC BY-NC-SA 2.0 (440) as the most common actual licenses. Entropy ratio of 0.278 confirms the distribution is highly concentrated on the empty value.

Treatment: Treat blank as missing and bucket the remaining license URLs into a low-cardinality categorical.

anthropic:claude-opus-4-7 · confidence high

Out[266]:

saturn.columns["PhotoCCVersionURL"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	17
top_value
top_rate	0.8356
cardinality	17
entropy	1.137
entropy_ratio	0.2781

Fig 91.

Top values for PhotoCCVersionURL.

Show data table

Top values for PhotoCCVersionURL (17 unique shown, of 17 total).
value	count	share
	13688	83.6%
https://creativecommons.org/licenses/by/2.0/	661	4.0%
https://creativecommons.org/licenses/by-nc-sa/2.0/	440	2.7%
https://creativecommons.org/licenses/by-sa/4.0/	425	2.6%
https://creativecommons.org/licenses/by-sa/2.0/	332	2.0%
https://creativecommons.org/publicdomain/zero/1.0/	247	1.5%
https://creativecommons.org/licenses/by-nc/2.0/	230	1.4%
https://creativecommons.org/licenses/by-sa/3.0/	219	1.3%
https://creativecommons.org/licenses/by/3.0/	35	0.2%
https://creativecommons.org/licenses/by-nc-nd/2.0/	35	0.2%
https://creativecommons.org/licenses/by/3.0/br/deed.en	21	0.1%
https://creativecommons.org/licenses/by/4.0/	20	0.1%
https://creativecommons.org/licenses/by-sa/2.5/	11	0.1%
https://creativecommons.org/licenses/by-nd/2.0/	9	0.1%
https://creativecommons.org/licenses/by-sa/1.0/	7	0.0%
https://creativecommons.org/licenses/by-sa/3.0/de/deed.en	1	0.0%
https://creativecommons.org/licenses/by-nc-sa/4.0	1	0.0%

MapCredits categorical metadata

Attribution string for the map asset associated with each record, naming data, geography, and design contributors. Over half the rows (top_rate 0.533, 8,733 of 16,382) carry an empty string, and the remaining mass is spread across 199 near-duplicate credit lines — note for example two variants of the Omid/UNESCO/GMI credit differing only by a trailing period (2,228 vs 864). Entropy ratio of 0.357 and the long_tail alert confirm a few dominant phrasings plus a sparse tail.

Treatment: Treat as provenance metadata; normalize whitespace/punctuation to collapse duplicate credit strings, and exclude from modelling.

anthropic:claude-opus-4-7 · confidence high

Out[269]:

saturn.columns["MapCredits"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	199
top_value
top_rate	0.5331
cardinality	199
entropy	2.726
entropy_ratio	0.357
alert: long_tail	114 singleton categories

Fig 92.

Top values for MapCredits.

Show data table

Top values for MapCredits (20 unique shown, of 199 total).
value	count	share
	8733	53.3%
People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project	2228	13.6%
People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project.	864	5.3%
Location: IMB. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	853	5.2%
Anonymous	759	4.6%
People Group Location: Omid. Other geography / data: GMI. Map Design: Joshua Project	603	3.7%
Bethany World Prayer Center	582	3.6%
Joshua Project / Global Mapping International	520	3.2%
Asia Harvest-Operation Myanmar	144	0.9%
Bryan Nicholson / cartoMission	117	0.7%
Mark Stevens	96	0.6%
Location: SIL / WLMS. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	85	0.5%
West Melanesia	78	0.5%
Southeast Asia Link - SEALINK	68	0.4%
NCRP	54	0.3%
Location: WLMS. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	49	0.3%
Rodrigo Tinoco - https://www.data4mission.com/	44	0.3%
Location: Web research. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	35	0.2%
Location: World Jewish Congress, Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	26	0.2%
Location: Joshua Project. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	23	0.1%

MapCreditURL categorical metadata

Optional attribution URL for the source map of each record, blank in 15891 of 16382 rows (top_rate 0.97). Only 51 distinct values populate the remaining 3%, dominated by asiaharvest.org (146) and cartomission.com (117), giving a very low entropy_ratio of 0.052. The mix of http URLs and a mailto: address suggests inconsistent data entry rather than a controlled vocabulary.

Treatment: Drop from modelling; retain only as a provenance link for the ~3% of rows that carry a value.

anthropic:claude-opus-4-7 · confidence high

Out[272]:

saturn.columns["MapCreditURL"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	51
top_value
top_rate	0.97
cardinality	51
entropy	0.2954
entropy_ratio	0.05207
alert: long_tail	26 singleton categories
alert: imbalance	top value is 97.0% of rows

Fig 93.

Top values for MapCreditURL.

Show data table

Top values for MapCreditURL (20 unique shown, of 51 total).
value	count	share
	15891	97.0%
https://www.asiaharvest.org	146	0.9%
https://www.cartomission.com	117	0.7%
https://www.westmelanesia.com/	78	0.5%
https://www.worldjewishcongress.org/	26	0.2%
https://www.census.gov/prod/2004pubs/c2kbr-35.pdf	16	0.1%
https://commons.wikimedia.org/wiki/File:Caucasus_ethnic.jpg	11	0.1%
https://www.npolar.no/ansipra/english/Indexpages/Map_index.html	11	0.1%
https://www.eki.ee/books/redbook/introduction.shtml	9	0.1%
mailto:cambodia.peoples@gmail.com	9	0.1%
https://commons.wikimedia.org/wiki/File:Maeneo_penye_wasemaji_wa_Kiswahili.png	8	0.0%
https://www.cartpioneers.org/products/Peoples-of-Yemen-Prayer-Guide.html	4	0.0%
https://www.burmaissues.org/En/ethnicgroups1.html	4	0.0%
https://www.face-music.ch/bi_bid/trad_costumes_en.html	3	0.0%
https://thekurds.net/	3	0.0%
http://lingvarium.org/	2	0.0%
https://lingvarium.org/	2	0.0%
https://commons.wikimedia.org/wiki/File:Basque_Country_Location_Map.svg	2	0.0%
https://commons.wikimedia.org/wiki/File:Mapuche_du_Chili_et_d%27Argentine.gif	2	0.0%
https://commons.wikimedia.org/wiki/File:Tlingit-map.png	2	0.0%

MapCopyright categorical feature

A near-binary flag (with blanks) indicating map copyright status, dominated by 'N' at 86.5% of 16,382 rows. Only 118 records carry 'Y' and 2,100 are empty strings, giving just 3 distinct values and an entropy ratio of 0.387. The empty category is large enough that it should be treated as its own level rather than silently coerced.

Treatment: Encode as a 3-level categorical (N / Y / blank); low signal due to severe imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[275]:

saturn.columns["MapCopyright"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	3
top_value	N
top_rate	0.8646
cardinality	3
entropy	0.6126
entropy_ratio	0.3865

Fig 94.

Top values for MapCopyright.

Show data table

Top values for MapCopyright (3 unique shown, of 3 total).
value	count	share
N	14164	86.5%
	2100	12.8%
Y	118	0.7%

MapCCVersionText categorical metadata

This column records Creative Commons licence versions for map content, but 16347 of 16382 rows (top_rate 0.9979) are empty strings, leaving only 35 rows spread across five actual licences (CC0 1.0, CC BY-SA 3.0, CC BY 3.0, CC BY 2.0, CC BY-SA 4.0). Entropy ratio of 0.0099 confirms the column carries almost no information. Note that nulls are reported as 0.0 because the missing values are stored as empty strings rather than true nulls.

Treatment: Drop or collapse to a binary has_licence flag; too sparse to use as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[278]:

saturn.columns["MapCCVersionText"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	6
top_value
top_rate	0.9979
cardinality	6
entropy	0.02557
entropy_ratio	0.009891
alert: imbalance	top value is 99.8% of rows

Fig 95.

Top values for MapCCVersionText.

Show data table

Top values for MapCCVersionText (6 unique shown, of 6 total).
value	count	share
	16347	99.8%
CC0 1.0	17	0.1%
CC BY-SA 3.0	13	0.1%
CC BY 3.0	2	0.0%
CC BY 2.0	2	0.0%
CC BY-SA 4.0	1	0.0%

MapCCVersionURL categorical metadata

MapCCVersionURL appears to be a Creative Commons license URL field attached to map records, with five distinct CC variants observed (CC0, BY-SA 3.0/4.0, BY 2.0/3.0). It is effectively empty: 16,347 of 16,382 rows (top_rate 0.9979) are blank strings, leaving only 35 rows with an actual license URL and entropy_ratio of 0.0099. Null_rate is reported as 0.0 because empties are stored as '' rather than nulls, which is itself worth flagging.

Treatment: Drop or collapse to a binary has_license flag; near-constant empty string.

anthropic:claude-opus-4-7 · confidence high

Out[281]:

saturn.columns["MapCCVersionURL"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	6
top_value
top_rate	0.9979
cardinality	6
entropy	0.02557
entropy_ratio	0.009891
alert: imbalance	top value is 99.8% of rows

Fig 96.

Top values for MapCCVersionURL.

Show data table

Top values for MapCCVersionURL (6 unique shown, of 6 total).
value	count	share
	16347	99.8%
https://creativecommons.org/publicdomain/zero/1.0/	17	0.1%
https://creativecommons.org/licenses/by-sa/3.0/	13	0.1%
https://creativecommons.org/licenses/by/3.0/	2	0.0%
https://creativecommons.org/licenses/by/2.0/	2	0.0%
https://creativecommons.org/licenses/by-sa/4.0/	1	0.0%

JF categorical feature

JF is a binary Y/N flag with no nulls across 16382 rows. The split is moderately imbalanced, with Y dominating at 66.0% (10816) versus N at 34.0% (5566), yielding a high entropy ratio of 0.925.

Treatment: Encode as a 0/1 indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[284]:

saturn.columns["JF"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.6602
cardinality	2
entropy	0.9246
entropy_ratio	0.9246

Fig 97.

Top values for JF.

Show data table

Top values for JF (2 unique shown, of 2 total).
value	count	share
Y	10816	66.0%
N	5566	34.0%

AudioRecordings categorical feature

Binary Y/N flag indicating whether audio recordings are present, with no nulls across 16,382 rows. The distribution is imbalanced: 'Y' dominates at 82.3% (13,479) versus 2,903 'N' values, yielding an entropy ratio of 0.67.

Treatment: Encode as a 0/1 boolean; be aware of the ~82/18 class imbalance when using as a predictor.

anthropic:claude-opus-4-7 · confidence high

Out[287]:

saturn.columns["AudioRecordings"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.8228
cardinality	2
entropy	0.6739
entropy_ratio	0.6739

Fig 98.

Top values for AudioRecordings.

Show data table

Top values for AudioRecordings (2 unique shown, of 2 total).
value	count	share
Y	13479	82.3%
N	2903	17.7%

Window1040 categorical feature

Window1040 is a binary Y/N flag covering all 16382 rows with no nulls. The split is nearly even (Y at 52.3%, N at 7810 occurrences), giving an entropy ratio of 0.998 — essentially maximum uncertainty for a two-class field. Without column context the meaning is opaque, but the balanced distribution makes it a usable feature rather than a degenerate constant.

Treatment: Encode as a 0/1 indicator for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[290]:

saturn.columns["Window1040"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.5233
cardinality	2
entropy	0.9984
entropy_ratio	0.9984

Fig 99.

Top values for Window1040.

Show data table

Top values for Window1040 (2 unique shown, of 2 total).
value	count	share
Y	8572	52.3%
N	7810	47.7%

PeopleGroupMapURL text metadata

This column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty entry being a single token link. Over half the rows (8,728 of 16,382) are empty strings, and among the rest the same map URLs recur heavily — duplicate_rate is 0.63 and only 6,029 unique values exist across 16,382 rows. The url_rate of 0.47 reflects that empties dominate the remainder.

Treatment: Treat as an optional asset URL: keep as-is for display, or drop if not rendering images.

anthropic:claude-opus-4-7 · confidence high

Out[293]:

saturn.columns["PeopleGroupMapURL"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	6,029
len_min	0
len_max	66
len_mean	29.92
len_median	0
len_p95	66
word_mean	1
word_median	1
n_empty	8,728
n_duplicates	10,353
duplicate_rate	0.632
vocab_size	6,028
readability_flesch_mean	-329.1
emoji_rate	0
url_rate	0.4672
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: url_heavy	46.7% rows contain a URL
alert: duplicates	63.2% duplicate strings

Fig 100.

Character-length distribution for PeopleGroupMapURL.

Show data table

Character-length distribution for PeopleGroupMapURL (mean: 29.91576120131852).
chars	count
0 – 2	8728
2 – 3	0
3 – 5	0
5 – 7	0
7 – 8	0
8 – 10	0
10 – 12	0
12 – 13	0
13 – 15	0
15 – 16	0
16 – 18	0
18 – 20	0
20 – 21	0
21 – 23	0
23 – 25	0
25 – 26	0
26 – 28	0
28 – 30	0
30 – 31	0
31 – 33	0
33 – 35	0
35 – 36	0
36 – 38	0
38 – 40	0
40 – 41	0
41 – 43	0
43 – 45	0
45 – 46	0
46 – 48	0
48 – 50	0
50 – 51	0
51 – 53	0
53 – 54	0
54 – 56	0
56 – 58	0
58 – 59	0
59 – 61	0
61 – 63	0
63 – 64	5028
64 – 66	2626

PeopleGroupMapExpandedURL text metadata

This column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, one link per row. It's mostly empty: 9,538 of 16,382 rows (the modal value) are blank, which is why len_median is 0 and url_rate is only 0.418. Among populated rows there is heavy reuse — duplicate_rate is 0.661 and a single map (m00320.pdf) appears 96 times — indicating many people-group records share the same regional map.

Treatment: Treat as an optional reference link; drop for modelling or keep only as a join key to map assets.

anthropic:claude-opus-4-7 · confidence high

Out[296]:

saturn.columns["PeopleGroupMapExpandedURL"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	5,561
len_min	0
len_max	66
len_mean	26.74
len_median	0
len_p95	66
word_mean	1
word_median	1
n_empty	9,538
n_duplicates	10,821
duplicate_rate	0.6605
vocab_size	5,560
readability_flesch_mean	-256.6
emoji_rate	0
url_rate	0.4178
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: url_heavy	41.8% rows contain a URL
alert: duplicates	66.1% duplicate strings

Fig 101.

Character-length distribution for PeopleGroupMapExpandedURL.

Show data table

Character-length distribution for PeopleGroupMapExpandedURL (mean: 26.739592235380297).
chars	count
0 – 2	9538
2 – 3	0
3 – 5	0
5 – 7	0
7 – 8	0
8 – 10	0
10 – 12	0
12 – 13	0
13 – 15	0
15 – 16	0
16 – 18	0
18 – 20	0
20 – 21	0
21 – 23	0
23 – 25	0
25 – 26	0
26 – 28	0
28 – 30	0
30 – 31	0
31 – 33	0
33 – 35	0
35 – 36	0
36 – 38	0
38 – 40	0
40 – 41	0
41 – 43	0
43 – 45	0
45 – 46	0
46 – 48	0
48 – 50	0
50 – 51	0
51 – 53	0
53 – 54	0
54 – 56	0
56 – 58	0
58 – 59	0
59 – 61	0
61 – 63	0
63 – 64	4552
64 – 66	2292

PeopleGroupURL text identifier

This column holds Joshua Project people-group URLs, one per row, with a perfectly fixed length of 48 characters and exactly one 'word' per cell. Every value is unique across all 16382 rows (n_unique equals n, duplicate_rate 0.0) and url_rate is 1.0, so it functions as a row-level identifier rather than analysable text. The URL pattern encodes a numeric people-group id plus a two-letter country suffix (e.g. /10375/tz, /10375/up), meaning the same group repeats across countries via different URLs.

Treatment: Drop from modelling; keep as a row key or parse out the people-group id and country code if a join is needed.

anthropic:claude-opus-4-7 · confidence high

Out[299]:

saturn.columns["PeopleGroupURL"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	16,382
len_min	48
len_max	48
len_mean	48
len_median	48
len_p95	48
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	16,382
readability_flesch_mean	-476.9
emoji_rate	0
url_rate	1
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL

Fig 102.

Character-length distribution for PeopleGroupURL.

Show data table

Character-length distribution for PeopleGroupURL (mean: 48.0).
chars	count
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	16382
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0

PeopleGroupPhotoURL text metadata

This column holds URLs to people-group profile photos hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.65). Notably, 5719 of 16382 rows are empty strings and duplicates dominate the rest (n_duplicates 11105, duplicate_rate 0.68) — only 5277 unique URLs serve 16382 rows, meaning many groups share the same photo. The top URL alone repeats 92 times.

Treatment: Treat as an optional image asset link; drop for modelling or use only to fetch images, and handle the 5719 empty values as missing.

anthropic:claude-opus-4-7 · confidence high

Out[302]:

saturn.columns["PeopleGroupPhotoURL"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	5,277
len_min	0
len_max	68
len_mean	42.32
len_median	65
len_p95	65
word_mean	1
word_median	1
n_empty	5,719
n_duplicates	11,105
duplicate_rate	0.6779
vocab_size	5,276
readability_flesch_mean	-585.6
emoji_rate	0
url_rate	0.6509
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: url_heavy	65.1% rows contain a URL
alert: duplicates	67.8% duplicate strings

Fig 103.

Character-length distribution for PeopleGroupPhotoURL.

Show data table

Character-length distribution for PeopleGroupPhotoURL (mean: 42.321511414967645).
chars	count
0 – 2	5719
2 – 3	0
3 – 5	0
5 – 7	0
7 – 8	0
8 – 10	0
10 – 12	0
12 – 14	0
14 – 15	0
15 – 17	0
17 – 19	0
19 – 20	0
20 – 22	0
22 – 24	0
24 – 26	0
26 – 27	0
27 – 29	0
29 – 31	0
31 – 32	0
32 – 34	0
34 – 36	0
36 – 37	0
37 – 39	0
39 – 41	0
41 – 42	0
42 – 44	0
44 – 46	0
46 – 48	0
48 – 49	0
49 – 51	0
51 – 53	0
53 – 54	0
54 – 56	0
56 – 58	0
58 – 60	0
60 – 61	0
61 – 63	0
63 – 65	0
65 – 66	10591
66 – 68	72

CountryURL categorical metadata

URL pointing to a country page on joshuaproject.net, with the trailing two-letter code acting as the country identifier. 238 distinct countries appear across 16,382 rows with no nulls, and India (IN) dominates at 13.8% (2,262 rows) followed by PP (883) and ID (788). High entropy ratio (0.79) indicates the distribution is broad rather than concentrated despite the India lead.

Treatment: Strip the URL prefix and keep the two-letter country code as a categorical feature.

anthropic:claude-opus-4-7 · confidence high

Out[305]:

saturn.columns["CountryURL"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	238
top_value	https://joshuaproject.net/countries/IN
top_rate	0.1381
cardinality	238
entropy	6.225
entropy_ratio	0.7885

Fig 104.

Top values for CountryURL.

Show data table

Top values for CountryURL (20 unique shown, of 238 total).
value	count	share
https://joshuaproject.net/countries/IN	2262	13.8%
https://joshuaproject.net/countries/PP	883	5.4%
https://joshuaproject.net/countries/ID	788	4.8%
https://joshuaproject.net/countries/PK	775	4.7%
https://joshuaproject.net/countries/CH	547	3.3%
https://joshuaproject.net/countries/NI	535	3.3%
https://joshuaproject.net/countries/US	496	3.0%
https://joshuaproject.net/countries/MX	333	2.0%
https://joshuaproject.net/countries/BR	321	2.0%
https://joshuaproject.net/countries/CM	292	1.8%
https://joshuaproject.net/countries/BG	278	1.7%
https://joshuaproject.net/countries/CA	243	1.5%
https://joshuaproject.net/countries/CG	231	1.4%
https://joshuaproject.net/countries/BM	218	1.3%
https://joshuaproject.net/countries/AS	205	1.3%
https://joshuaproject.net/countries/RP	200	1.2%
https://joshuaproject.net/countries/SU	198	1.2%
https://joshuaproject.net/countries/NP	195	1.2%
https://joshuaproject.net/countries/LA	184	1.1%
https://joshuaproject.net/countries/MY	183	1.1%

JPScaleText categorical label

JPScaleText is a 5-level ordinal categorical describing how 'reached' an entity is, ranging from 'Unreached' to 'Significantly Reached'. The distribution is top-heavy: 'Unreached' covers 43.5% of 16,382 rows (7,124), while 'Minimally Reached' is the rarest at 1,009. No nulls and entropy ratio 0.87 indicate well-spread but skewed coverage across all five levels.

Treatment: Encode as an ordered ordinal (Unreached < Minimally < Superficially < Partially < Significantly) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[308]:

saturn.columns["JPScaleText"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	5
top_value	Unreached
top_rate	0.4349
cardinality	5
entropy	2.017
entropy_ratio	0.8688

Fig 105.

Top values for JPScaleText.

Show data table

Top values for JPScaleText (5 unique shown, of 5 total).
value	count	share
Unreached	7124	43.5%
Partially Reached	3636	22.2%
Significantly Reached	3200	19.5%
Superficially Reached	1413	8.6%
Minimally Reached	1009	6.2%

JPScaleImageURL categorical metadata

This column holds URLs to one of five 'gauge' images on joshuaproject.net, almost certainly a visual encoding of an ordinal Joshua Project Scale (1-5). Distribution is uneven: gauge-1 dominates at 43.5% of 16,382 rows, while gauge-2 is rarest at 1,009 rows, and entropy ratio is 0.87. No nulls, but the URL itself carries no information beyond the underlying 1-5 code.

Treatment: Extract the trailing digit as an ordinal feature and drop the URL string.

anthropic:claude-opus-4-7 · confidence high

Out[311]:

saturn.columns["JPScaleImageURL"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	5
top_value	https://joshuaproject.net/assets/img/gauge/gauge-1.png
top_rate	0.4349
cardinality	5
entropy	2.017
entropy_ratio	0.8688

Fig 106.

Top values for JPScaleImageURL.

Show data table

Top values for JPScaleImageURL (5 unique shown, of 5 total).
value	count	share
https://joshuaproject.net/assets/img/gauge/gauge-1.png	7124	43.5%
https://joshuaproject.net/assets/img/gauge/gauge-4.png	3636	22.2%
https://joshuaproject.net/assets/img/gauge/gauge-5.png	3200	19.5%
https://joshuaproject.net/assets/img/gauge/gauge-3.png	1413	8.6%
https://joshuaproject.net/assets/img/gauge/gauge-2.png	1009	6.2%

Summary text free_text

A short ethnographic descriptor field, likely a people-group summary paragraph in English. The column is dominated by emptiness — 12,328 of 16,382 rows (median length 0) are blank — and the non-empty entries are heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat write-ups repeating dozens of times in slight variants. Among populated rows, texts can be substantial (len_p95 = 719, max 1212) and Flesch readability of 13.2 indicates dense, hard-to-read prose.

Treatment: Deduplicate and drop empties before any NLP; treat as supplementary description rather than a per-row feature.

anthropic:claude-opus-4-7 · confidence high

Out[314]:

saturn.columns["Summary"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	3,778
len_min	0
len_max	1,212
len_mean	137.7
len_median	0
len_p95	719
word_mean	23.34
word_median	1
n_empty	12,328
n_duplicates	12,604
duplicate_rate	0.7694
vocab_size	24,964
readability_flesch_mean	13.16
emoji_rate	0
url_rate	0
one_word_rate	0.7525
allcaps_rate	0
boilerplate_rate	0.0001221
alert: one_word	75.3% rows are a single word
alert: duplicates	76.9% duplicate strings

Fig 107.

Character-length distribution for Summary.

Show data table

Character-length distribution for Summary (mean: 137.66963740691003).
chars	count
0 – 30	12328
30 – 61	1
61 – 91	1
91 – 121	7
121 – 152	18
152 – 182	25
182 – 212	44
212 – 242	63
242 – 273	76
273 – 303	113
303 – 333	146
333 – 364	174
364 – 394	167
394 – 424	204
424 – 454	219
454 – 485	214
485 – 515	247
515 – 545	238
545 – 576	210
576 – 606	227
606 – 636	256
636 – 667	252
667 – 697	202
697 – 727	257
727 – 758	144
758 – 788	169
788 – 818	88
818 – 848	85
848 – 879	60
879 – 909	30
909 – 939	19
939 – 970	9
970 – 1000	10
1000 – 1030	21
1030 – 1060	50
1060 – 1091	3
1091 – 1121	2
1121 – 1151	1
1151 – 1182	0
1182 – 1212	2

Obstacles text free_text

Free-text English commentary describing spiritual or cultural obstacles to Christian evangelism for various ethnic groups (Rajputs, Jats, Bosniaks, Azeri, etc.). The field is overwhelmingly empty: 12,327 of 16,382 rows are blank, driving a median length of 0 and a one-word rate of 0.75. Among populated rows, content is heavily duplicated (duplicate_rate 0.77), with the same Rajput and Jat paragraphs repeated 88 and 74 times, suggesting templated entries reused across related people-group records.

Treatment: Treat blanks as missing and deduplicate template paragraphs before tokenizing/embedding for any text modelling.

anthropic:claude-opus-4-7 · confidence high

Out[317]:

saturn.columns["Obstacles"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	3,732
len_min	0
len_max	726
len_mean	47.4
len_median	0
len_p95	267
word_mean	8.704
word_median	1
n_empty	12,327
n_duplicates	12,650
duplicate_rate	0.7722
vocab_size	9,899
readability_flesch_mean	12.76
emoji_rate	0
url_rate	0
one_word_rate	0.7525
allcaps_rate	0
boilerplate_rate	0.0004273
alert: one_word	75.2% rows are a single word
alert: duplicates	77.2% duplicate strings

Fig 108.

Character-length distribution for Obstacles.

Show data table

Character-length distribution for Obstacles (mean: 47.39598339641069).
chars	count
0 – 18	12327
18 – 36	1
36 – 54	25
54 – 73	111
73 – 91	206
91 – 109	309
109 – 127	422
127 – 145	433
145 – 163	361
163 – 182	360
182 – 200	267
200 – 218	228
218 – 236	174
236 – 254	164
254 – 272	232
272 – 290	90
290 – 309	223
309 – 327	197
327 – 345	70
345 – 363	39
363 – 381	34
381 – 399	25
399 – 417	11
417 – 436	9
436 – 454	9
454 – 472	13
472 – 490	12
490 – 508	4
508 – 526	7
526 – 544	3
544 – 563	4
563 – 581	4
581 – 599	1
599 – 617	0
617 – 635	2
635 – 653	1
653 – 672	0
672 – 690	2
690 – 708	1
708 – 726	1

HowReach text free_text

This is a free-text field describing how a people group could be reached, likely sourced from a missions/Joshua Project-style dataset. The column is dominated by emptiness: 13,043 of 16,382 rows (n_empty) are blank, driving a median length of 0 and a duplicate_rate of 0.82. Among populated entries, prose is substantive (len_max 599, len_p95 221) but heavily repeated — the same paragraph about Jats appears 136 times and several near-duplicates differ only by a word, suggesting templated copy across related groups.

Treatment: Treat as sparse free text: filter out empties and deduplicate near-identical strings before any tokenization or embedding.

anthropic:claude-opus-4-7 · confidence high

Out[320]:

saturn.columns["HowReach"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	2,944
len_min	0
len_max	599
len_mean	36.3
len_median	0
len_p95	221
word_mean	6.879
word_median	1
n_empty	13,043
n_duplicates	13,438
duplicate_rate	0.8203
vocab_size	7,981
readability_flesch_mean	12.17
emoji_rate	0
url_rate	0
one_word_rate	0.7962
allcaps_rate	0
boilerplate_rate	0.0001221
alert: one_word	79.6% rows are a single word
alert: duplicates	82.0% duplicate strings

Fig 109.

Character-length distribution for HowReach.

Show data table

Character-length distribution for HowReach (mean: 36.29764375534123).
chars	count
0 – 15	13043
15 – 30	0
30 – 45	1
45 – 60	7
60 – 75	61
75 – 90	158
90 – 105	257
105 – 120	253
120 – 135	289
135 – 150	277
150 – 165	282
165 – 180	238
180 – 195	226
195 – 210	190
210 – 225	368
225 – 240	177
240 – 255	139
255 – 270	92
270 – 285	106
285 – 300	60
300 – 314	42
314 – 329	25
329 – 344	25
344 – 359	13
359 – 374	9
374 – 389	15
389 – 404	5
404 – 419	4
419 – 434	3
434 – 449	1
449 – 464	3
464 – 479	2
479 – 494	2
494 – 509	3
509 – 524	0
524 – 539	3
539 – 554	2
554 – 569	0
569 – 584	0
584 – 599	1

PrayForChurch text free_text

Free-text prayer prompts for a people-group / missions dataset, focused on praying for the church among unreached groups (top words: pray, the, among). The column is mostly empty — 14,208 of 16,382 rows are blank — and among non-empty entries duplication is heavy, with a single Jat-related prayer appearing 146 times and an overall duplicate_rate of 0.89. Only 1,791 unique values and a vocab of 4,576 words suggest these are templated prayer points rather than authored prose, and all 652 language-tagged rows are English.

Treatment: Treat as sparse optional commentary; impute empties as missing and dedupe templates before any text modelling.

anthropic:claude-opus-4-7 · confidence high

Out[323]:

saturn.columns["PrayForChurch"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	1,791
len_min	0
len_max	649
len_mean	26.91
len_median	0
len_p95	210
word_mean	5.599
word_median	1
n_empty	14,208
n_duplicates	14,591
duplicate_rate	0.8907
vocab_size	4,576
readability_flesch_mean	7.55
emoji_rate	0
url_rate	0
one_word_rate	0.8673
allcaps_rate	0
boilerplate_rate	0
alert: one_word	86.7% rows are a single word
alert: duplicates	89.1% duplicate strings

Fig 110.

Character-length distribution for PrayForChurch.

Show data table

Character-length distribution for PrayForChurch (mean: 26.90599438408009).
chars	count
0 – 16	14208
16 – 32	0
32 – 49	1
49 – 65	16
65 – 81	45
81 – 97	107
97 – 114	112
114 – 130	149
130 – 146	164
146 – 162	193
162 – 178	134
178 – 195	324
195 – 211	110
211 – 227	126
227 – 243	95
243 – 260	99
260 – 276	56
276 – 292	103
292 – 308	55
308 – 324	116
324 – 341	17
341 – 357	28
357 – 373	23
373 – 389	33
389 – 406	15
406 – 422	3
422 – 438	13
438 – 454	9
454 – 471	5
471 – 487	2
487 – 503	6
503 – 519	3
519 – 535	2
535 – 552	2
552 – 568	2
568 – 584	2
584 – 600	1
600 – 617	0
617 – 633	1
633 – 649	2

PrayForPG text free_text

Free-text prayer points for a people group (PG), evidently scraped from a missions/Joshua Project–style source given the recurring 'Pray for...' templates. The column is mostly empty: 12,570 of 16,382 rows are blank (median length 0, median word count 1) and the duplicate rate is 0.78, with one Rajput-focused prayer block repeated 88 times and similar boilerplate dominating the rest. Flesch mean of 14.3 confirms dense, formulaic devotional prose rather than varied commentary.

Treatment: Treat as sparse boilerplate: drop empties, dedupe, and only embed the ~3.5k unique strings if prayer content is actually needed downstream.

anthropic:claude-opus-4-7 · confidence high

Out[326]:

saturn.columns["PrayForPG"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	3,530
len_min	0
len_max	937
len_mean	72.22
len_median	0
len_p95	400
word_mean	13.06
word_median	1
n_empty	12,570
n_duplicates	12,852
duplicate_rate	0.7845
vocab_size	9,427
readability_flesch_mean	14.31
emoji_rate	0
url_rate	0
one_word_rate	0.7673
allcaps_rate	0
boilerplate_rate	0
alert: one_word	76.7% rows are a single word
alert: duplicates	78.5% duplicate strings

Fig 111.

Character-length distribution for PrayForPG.

Show data table

Character-length distribution for PrayForPG (mean: 72.21584666096936).
chars	count
0 – 23	12571
23 – 47	0
47 – 70	6
70 – 94	48
94 – 117	91
117 – 141	161
141 – 164	144
164 – 187	188
187 – 211	167
211 – 234	215
234 – 258	218
258 – 281	266
281 – 305	279
305 – 328	339
328 – 351	324
351 – 375	245
375 – 398	275
398 – 422	199
422 – 445	241
445 – 468	122
468 – 492	76
492 – 515	59
515 – 539	43
539 – 562	34
562 – 586	22
586 – 609	11
609 – 632	12
632 – 656	6
656 – 679	11
679 – 703	4
703 – 726	3
726 – 750	1
750 – 773	0
773 – 796	0
796 – 820	0
820 – 843	0
843 – 867	0
867 – 890	0
890 – 914	0
914 – 937	1

Resources unknown other

The column is named "Resources" but saturn skipped profiling, so kind is unknown and no descriptive statistics were computed. We can confirm 16382 rows with a null_rate of 0.0, but n_unique and value-level signals are missing. Without those stats the semantic role and content type cannot be determined from the evidence alone.

Treatment: Re-profile this column with type coercion before deciding how to use it.

anthropic:claude-opus-4-7 · confidence low

Out[329]:

saturn.columns["Resources"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

country_data unknown other

The column `country_data` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any sample stats, the actual content (codes, names, nested objects) cannot be inferred from the evidence.

Treatment: Re-profile with an appropriate parser to determine structure before use.

anthropic:claude-opus-4-7 · confidence low

Out[331]:

saturn.columns["country_data"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

language_data unknown other

The `language_data` column was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 16382 and a null rate of 0.0. Without `n_unique` or any descriptive stats, the actual contents (likely some structured or nested language payload given the name) cannot be characterized here.

Treatment: Re-profile with an appropriate parser (e.g., expand JSON or cast to string) before deciding on downstream use.

anthropic:claude-opus-4-7 · confidence low

Out[333]:

saturn.columns["language_data"].stats

stat	value
n	16,382
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown