joshua-project-joshua_project_unreached

Overview

Source: /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_unreached.parquet

Saturn profiled 7,124 rows across 109 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_unreached.parquet",
    "--findings", "joshua-project-joshua_project_unreached.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset is a Joshua Project catalogue of 7,124 unreached people groups described across 109 fields covering geography, language, religion, population, and outreach status. Every row is flagged as 'Unreached' (JPScaleText is constant) and 'LeastReached' is uniformly Y, so the analytical interest sits in the breakdown by region, religion, and population rather than in reach status itself. The data is heavily skewed toward Asia (5,351 of 7,124) and especially South Asian Peoples (3,681), with India alone accounting for 2,032 groups; Islam (3,279) and Hinduism (2,142) dominate PrimaryReligion. Population is extremely long-tailed (median 30,000 vs. max 135.5M, skew ~21), so any size-based analysis should use log scales or medians. Worth a closer look first: the Continent/Region/Country concentration, the religion mix, and the population distribution — these three together explain most of the dataset's shape.

citing: row_count · column_count · Continent.top_values · AffinityBloc.top_values · Ctry.top_values · PrimaryReligion.top_values · Population.stats · RegionName.top_values · Frontier.top_values · JPScaleText.top_values · LeastReached.top_values · PrimaryLanguageName.top_values

Out[4]:

saturn.schema() · 109 columns

column	kind	n	null%	unique	alerts
PeopleID3ROG3	text	7,124	0.0%	7,124	near_unique one_word allcaps short_text
ROG3	categorical	7,124	0.0%	202
PeopleID3	numeric	7,124	0.0%	4,614
ROP3	numeric	7,124	0.1%	4,608
PeopNameInCountry	text	7,124	0.0%	4,722	one_word duplicates
ROG2	categorical	7,124	0.0%	7
Continent	categorical	7,124	0.0%	7
RegionName	categorical	7,124	0.0%	12
ISO3	categorical	7,124	0.0%	202
LocationInCountry	text	7,124	64.2%	2,176	multilingual null_rate
PeopleID1	numeric	7,124	0.0%	16	outliers
ROP1	categorical	7,124	0.0%	16
AffinityBloc	categorical	7,124	0.0%	16
PeopleID2	numeric	7,124	0.0%	205
ROP2	categorical	7,124	0.0%	155
PeopleCluster	categorical	7,124	0.0%	205
PeopNameAcrossCountries	text	7,124	0.0%	4,604	one_word duplicates
Population	numeric	7,124	0.2%	1,200	high_skew outliers
Category	categorical	7,124	0.0%	3
ROL3	text	7,124	0.0%	1,565	one_word short_text duplicates
PrimaryLanguageName	text	7,124	0.0%	1,563	one_word short_text duplicates
PrimaryLanguageDialect	categorical	7,124	94.5%	303	long_tail null_rate
NumberLanguagesSpoken	numeric	7,124	0.0%	69	high_skew outliers
OfficialLang	categorical	7,124	0.1%	79
SpeakNationalLang	unknown	7,124	0.0%	—	skipped
BibleStatus	numeric	7,124	0.0%	6	outliers
BibleYear	categorical	7,124	45.8%	163	null_rate
NTYear	categorical	7,124	24.6%	305	null_rate
PortionsYear	categorical	7,124	13.5%	460
TranslationNeedQuestionable	unknown	7,124	0.0%	—	skipped
JPScale	numeric	7,124	0.0%	1	constant
JPScalePC	categorical	7,124	0.0%	5
JPScalePGAC	categorical	7,124	0.0%	5	imbalance
LeastReached	categorical	7,124	0.0%	1	imbalance
LeastReachedPC	categorical	7,124	0.0%	2
LeastReachedPGAC	categorical	7,124	0.0%	2	imbalance
GSEC	categorical	7,124	0.0%	8
HasAudioRecordings	categorical	7,124	0.0%	2
NTOnline	categorical	7,124	22.4%	1	null_rate imbalance
RLG3	numeric	7,124	0.0%	7	outliers
RLG3PC	numeric	7,124	0.0%	8	outliers
RLG3PGAC	numeric	7,124	0.0%	8	outliers
PrimaryReligion	categorical	7,124	0.0%	7
PrimaryReligionPC	categorical	7,124	0.0%	8
PrimaryReligionPGAC	categorical	7,124	0.0%	8
RLG4	numeric	7,124	92.4%	18	null_rate outliers
ReligionSubdivision	categorical	7,124	92.4%	18	null_rate
PCIslam	numeric	7,124	0.1%	902
PCNonReligious	numeric	7,124	0.3%	152	high_skew outliers
PCUnknown	numeric	7,124	0.4%	388	high_skew outliers
SecurityLevel	numeric	7,124	0.0%	3	outliers
LRTop100	categorical	7,124	0.0%	2	imbalance
PhotoAddress	text	7,124	0.0%	2,880	one_word short_text duplicates
PhotoCredits	categorical	7,124	0.1%	851	long_tail
PhotoCreditURL	categorical	7,124	36.0%	774	long_tail null_rate
PhotoCreativeCommons	categorical	7,124	0.1%	2
PhotoCopyright	categorical	7,124	0.2%	2
PhotoPermission	categorical	7,124	0.2%	3
ProfileTextExists	categorical	7,124	0.0%	2	imbalance
CountOfCountries	numeric	7,124	0.0%	39	high_skew outliers
CountOfProvinces	unknown	7,124	0.0%	—	skipped
Longitude	numeric	7,124	0.0%	6,713
Latitude	numeric	7,124	0.0%	6,696
Ctry	categorical	7,124	0.0%	202
IndigenousCode	categorical	7,124	0.0%	2
PercentAdherents	categorical	7,124	0.0%	692	long_tail
PercentChristianPC	categorical	7,124	0.0%	184
NaturalName	text	7,124	0.0%	4,705	one_word duplicates
NaturalPronunciation	text	7,124	48.5%	1,489	one_word null_rate duplicates
PercentChristianPGAC	categorical	7,124	0.1%	842
PercentEvangelical	categorical	7,124	10.4%	401	long_tail
PercentEvangelicalPC	categorical	7,124	2.1%	166
PercentEvangelicalPGAC	categorical	7,124	6.3%	548
PCBuddhism	numeric	7,124	0.3%	809	high_skew outliers
PCEthnicReligions	numeric	7,124	0.3%	351	high_skew outliers
PCHinduism	numeric	7,124	0.3%	1,131
PCOtherSmall	numeric	7,124	0.3%	670	high_skew outliers
RegionCode	numeric	7,124	0.0%	12	outliers
PopulationPGAC	numeric	7,124	0.1%	1,509	high_skew outliers
Frontier	categorical	7,124	0.0%	2
MapAddress	text	7,124	0.0%	4,616	one_word short_text duplicates
HasJesusFilm	categorical	7,124	0.0%	2
Nomadic	categorical	7,124	0.0%	2	imbalance
NomadicTypeDescription	categorical	7,124	96.6%	6	null_rate
PhotoCCVersionText	categorical	7,124	0.0%	16
PhotoCCVersionURL	categorical	7,124	0.0%	16
MapCredits	categorical	7,124	0.0%	161	long_tail
MapCreditURL	categorical	7,124	0.0%	31	long_tail imbalance
MapCopyright	categorical	7,124	0.0%	3
MapCCVersionText	categorical	7,124	0.0%	4	imbalance
MapCCVersionURL	categorical	7,124	0.0%	4	imbalance
JF	categorical	7,124	0.0%	2
AudioRecordings	categorical	7,124	0.0%	2
Window1040	categorical	7,124	0.0%	2
PeopleGroupMapURL	text	7,124	0.0%	4,616	one_word url_heavy duplicates
PeopleGroupMapExpandedURL	text	7,124	0.0%	4,331	one_word url_heavy duplicates
PeopleGroupURL	text	7,124	0.0%	7,124	near_unique one_word url_heavy
PeopleGroupPhotoURL	text	7,124	0.0%	2,880	one_word url_heavy duplicates
CountryURL	categorical	7,124	0.0%	202
JPScaleText	categorical	7,124	0.0%	1	imbalance
JPScaleImageURL	categorical	7,124	0.0%	1	imbalance
Summary	text	7,124	0.0%	3,685	one_word duplicates
Obstacles	text	7,124	0.0%	3,641	one_word duplicates
HowReach	text	7,124	0.0%	2,853	one_word duplicates
PrayForChurch	text	7,124	0.0%	1,713	one_word duplicates
PrayForPG	text	7,124	0.0%	3,441	one_word duplicates
Resources	unknown	7,124	0.0%	—	skipped
country_data	unknown	7,124	0.0%	—	skipped
language_data	unknown	7,124	0.0%	—	skipped

Fig 1.

Continent · Shows how heavily the dataset concentrates in Asia (~75%) versus other continents.

Show data table

Top values for Continent (7 unique shown, of 7 total).
value	count	share
Asia	5351	75.1%
Africa	986	13.8%
Europe	431	6.0%
North America	175	2.5%
South America	106	1.5%
Australia	39	0.5%
Oceania	36	0.5%

Fig 2.

PrimaryReligion · Highlights that Islam and Hinduism together account for the majority of unreached groups.

Show data table

Top values for PrimaryReligion (7 unique shown, of 7 total).
value	count	share
Islam	3279	46.0%
Hinduism	2142	30.1%
Ethnic Religions	933	13.1%
Buddhism	480	6.7%
Unknown	157	2.2%
Other / Small	120	1.7%
Non-Religious	13	0.2%

Fig 3.

RegionName · Breaks the Asia bulk into sub-regions, with Asia South alone covering nearly half the rows.

Show data table

Top values for RegionName (12 unique shown, of 12 total).
value	count	share
Asia, South	3349	47.0%
Asia, Southeast	726	10.2%
Asia, Northeast	521	7.3%
Africa, West and Central	460	6.5%
Africa, North and Middle East	444	6.2%
Africa, East and Southern	373	5.2%
Asia, Central	352	4.9%
Europe, Western	320	4.5%
Europe, Eastern and Eurasia	223	3.1%
America, North and Caribbean	160	2.2%
America, Latin	121	1.7%
Australia and Pacific	75	1.1%

Fig 4.

Population · Reveals an extreme right-skew — most groups are small (median 30k) but a few exceed 100M; consider a log scale.

Show data table

Histogram bins for Population (median: 30000.0).
bin	count
10 – 3.388e+06	6930
3.388e+06 – 6.777e+06	78
6.777e+06 – 1.016e+07	36
1.016e+07 – 1.355e+07	17
1.355e+07 – 1.694e+07	11
1.694e+07 – 2.033e+07	8
2.033e+07 – 2.372e+07	7
2.372e+07 – 2.711e+07	3
2.711e+07 – 3.049e+07	2
3.049e+07 – 3.388e+07	1
3.388e+07 – 3.727e+07	2
3.727e+07 – 4.066e+07	4
4.066e+07 – 4.405e+07	0
4.405e+07 – 4.744e+07	3
4.744e+07 – 5.082e+07	1
5.082e+07 – 5.421e+07	0
5.421e+07 – 5.76e+07	1
5.76e+07 – 6.099e+07	0
6.099e+07 – 6.438e+07	1
6.438e+07 – 6.777e+07	1
6.777e+07 – 7.115e+07	0
7.115e+07 – 7.454e+07	0
7.454e+07 – 7.793e+07	0
7.793e+07 – 8.132e+07	0
8.132e+07 – 8.471e+07	0
8.471e+07 – 8.81e+07	0
8.81e+07 – 9.148e+07	0
9.148e+07 – 9.487e+07	0
9.487e+07 – 9.826e+07	0
9.826e+07 – 1.016e+08	1
1.016e+08 – 1.05e+08	0
1.05e+08 – 1.084e+08	0
1.084e+08 – 1.118e+08	0
1.118e+08 – 1.152e+08	0
1.152e+08 – 1.186e+08	1
1.186e+08 – 1.22e+08	0
1.22e+08 – 1.254e+08	0
1.254e+08 – 1.288e+08	0
1.288e+08 – 1.321e+08	0
1.321e+08 – 1.355e+08	1

Fig 5.

AffinityBloc · Confirms the South Asian Peoples bloc dominates and shows the next-largest cultural clusters.

Show data table

Top values for AffinityBloc (16 unique shown, of 16 total).
value	count	share
South Asian Peoples	3681	51.7%
Sub-Saharan Peoples	632	8.9%
Arab World	475	6.7%
Southeast Asian Peoples	451	6.3%
Malay Peoples	339	4.8%
Tibetan-Himalayan Peoples	287	4.0%
Turkic Peoples	269	3.8%
Persian-Median	225	3.2%
Eurasian Peoples	166	2.3%
Deaf	151	2.1%
East Asian Peoples	140	2.0%
Jewish	128	1.8%
Horn of Africa Peoples	83	1.2%
Latin-Caribbean Americans	81	1.1%
Pacific Islanders	13	0.2%
North American Peoples	3	0.0%

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
PeopleID3ROG3	text	0.0%
ROG3	categorical	0.0%
PeopleID3	numeric	0.0%
ROP3	numeric	0.1%
PeopNameInCountry	text	0.0%
ROG2	categorical	0.0%
Continent	categorical	0.0%
RegionName	categorical	0.0%
ISO3	categorical	0.0%
LocationInCountry	text	64.2%
PeopleID1	numeric	0.0%
ROP1	categorical	0.0%
AffinityBloc	categorical	0.0%
PeopleID2	numeric	0.0%
ROP2	categorical	0.0%
PeopleCluster	categorical	0.0%
PeopNameAcrossCountries	text	0.0%
Population	numeric	0.2%
Category	categorical	0.0%
ROL3	text	0.0%
PrimaryLanguageName	text	0.0%
PrimaryLanguageDialect	categorical	94.5%
NumberLanguagesSpoken	numeric	0.0%
OfficialLang	categorical	0.1%
SpeakNationalLang	unknown	0.0%
BibleStatus	numeric	0.0%
BibleYear	categorical	45.8%
NTYear	categorical	24.6%
PortionsYear	categorical	13.5%
TranslationNeedQuestionable	unknown	0.0%
JPScale	numeric	0.0%
JPScalePC	categorical	0.0%
JPScalePGAC	categorical	0.0%
LeastReached	categorical	0.0%
LeastReachedPC	categorical	0.0%
LeastReachedPGAC	categorical	0.0%
GSEC	categorical	0.0%
HasAudioRecordings	categorical	0.0%
NTOnline	categorical	22.4%
RLG3	numeric	0.0%
RLG3PC	numeric	0.0%
RLG3PGAC	numeric	0.0%
PrimaryReligion	categorical	0.0%
PrimaryReligionPC	categorical	0.0%
PrimaryReligionPGAC	categorical	0.0%
RLG4	numeric	92.4%
ReligionSubdivision	categorical	92.4%
PCIslam	numeric	0.1%
PCNonReligious	numeric	0.3%
PCUnknown	numeric	0.4%
SecurityLevel	numeric	0.0%
LRTop100	categorical	0.0%
PhotoAddress	text	0.0%
PhotoCredits	categorical	0.1%
PhotoCreditURL	categorical	36.0%
PhotoCreativeCommons	categorical	0.1%
PhotoCopyright	categorical	0.2%
PhotoPermission	categorical	0.2%
ProfileTextExists	categorical	0.0%
CountOfCountries	numeric	0.0%
CountOfProvinces	unknown	0.0%
Longitude	numeric	0.0%
Latitude	numeric	0.0%
Ctry	categorical	0.0%
IndigenousCode	categorical	0.0%
PercentAdherents	categorical	0.0%
PercentChristianPC	categorical	0.0%
NaturalName	text	0.0%
NaturalPronunciation	text	48.5%
PercentChristianPGAC	categorical	0.1%
PercentEvangelical	categorical	10.4%
PercentEvangelicalPC	categorical	2.1%
PercentEvangelicalPGAC	categorical	6.3%
PCBuddhism	numeric	0.3%
PCEthnicReligions	numeric	0.3%
PCHinduism	numeric	0.3%
PCOtherSmall	numeric	0.3%
RegionCode	numeric	0.0%
PopulationPGAC	numeric	0.1%
Frontier	categorical	0.0%
MapAddress	text	0.0%
HasJesusFilm	categorical	0.0%
Nomadic	categorical	0.0%
NomadicTypeDescription	categorical	96.6%
PhotoCCVersionText	categorical	0.0%
PhotoCCVersionURL	categorical	0.0%
MapCredits	categorical	0.0%
MapCreditURL	categorical	0.0%
MapCopyright	categorical	0.0%
MapCCVersionText	categorical	0.0%
MapCCVersionURL	categorical	0.0%
JF	categorical	0.0%
AudioRecordings	categorical	0.0%
Window1040	categorical	0.0%
PeopleGroupMapURL	text	0.0%
PeopleGroupMapExpandedURL	text	0.0%
PeopleGroupURL	text	0.0%
PeopleGroupPhotoURL	text	0.0%
CountryURL	categorical	0.0%
JPScaleText	categorical	0.0%
JPScaleImageURL	categorical	0.0%
Summary	text	0.0%
Obstacles	text	0.0%
HowReach	text	0.0%
PrayForChurch	text	0.0%
PrayForPG	text	0.0%
Resources	unknown	0.0%
country_data	unknown	0.0%
language_data	unknown	0.0%

Fig 7.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 11,918 detected strings).
lang	count	share
en	11903	99.9%
de	4	0.0%
pt	2	0.0%
it	2	0.0%
eo	2	0.0%
ceb	1	0.0%
ilo	1	0.0%
min	1	0.0%
id	1	0.0%
es	1	0.0%

Fig 8.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 12 numeric columns (values clipped to 2 decimals).
	PeopleID3	ROP3	PeopleID1	PeopleID2	Population	NumberLanguagesSpoken	BibleStatus	JPScale	RLG3	RLG3PC	RLG3PGAC	RLG4
PeopleID3	+1.00	+0.86	+0.29	+0.45	-0.02	+0.15	+0.10	+nan	+0.03	+0.04	+0.02	+0.12
ROP3	+0.86	+1.00	+0.33	+0.47	-0.01	+0.16	+0.13	+nan	+0.03	+0.08	+0.03	+0.06
PeopleID1	+0.29	+0.33	+1.00	+0.55	-0.13	+0.17	-0.00	+nan	+0.08	+0.22	+0.10	-0.29
PeopleID2	+0.45	+0.47	+0.55	+1.00	-0.06	+0.39	+0.40	+nan	+0.06	+0.20	+0.07	-0.14
Population	-0.02	-0.01	-0.13	-0.06	+1.00	+0.01	-0.01	+nan	+0.01	-0.01	+0.01	+0.01
NumberLanguagesSpoken	+0.15	+0.16	+0.17	+0.39	+0.01	+1.00	+0.24	+nan	-0.00	+0.06	-0.00	-0.06
BibleStatus	+0.10	+0.13	-0.00	+0.40	-0.01	+0.24	+1.00	+nan	-0.07	+0.01	-0.07	-0.02
JPScale	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan	+nan
RLG3	+0.03	+0.03	+0.08	+0.06	+0.01	-0.00	-0.07	+nan	+1.00	+0.68	+0.99	+0.17
RLG3PC	+0.04	+0.08	+0.22	+0.20	-0.01	+0.06	+0.01	+nan	+0.68	+1.00	+0.68	+0.14
RLG3PGAC	+0.02	+0.03	+0.10	+0.07	+0.01	-0.00	-0.07	+nan	+0.99	+0.68	+1.00	+0.16
RLG4	+0.12	+0.06	-0.29	-0.14	+0.01	-0.06	-0.02	+nan	+0.17	+0.14	+0.16	+1.00

PeopleID3ROG3 text identifier

PeopleID3ROG3 is almost certainly a person-level identifier: every one of the 7,124 rows holds a unique 7-character, all-caps, single-token code (n_unique equals n, len_min=len_max=7, allcaps_rate=1.0, one_word_rate=1.0). Sample values like '10208ng' and '10375su' suggest a 5-digit numeric prefix followed by a 2-letter suffix. There are no nulls, duplicates, or empties, so the key looks clean.

Treatment: Use as a primary key for joins; exclude from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[14]:

saturn.columns["PeopleID3ROG3"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	7,124
len_min	7
len_max	7
len_mean	7
len_median	7
len_p95	7
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	7,124
readability_flesch_mean	112.3
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	1
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: allcaps	100.0% rows are all-caps
alert: short_text	95th-percentile length under 20 chars

Fig 9.

Character-length distribution for PeopleID3ROG3.

Show data table

Character-length distribution for PeopleID3ROG3 (mean: 7.0).
chars	count
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	7124
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0

ROG3 categorical feature

ROG3 holds two-letter country codes across 7,124 rows with 202 distinct values and no nulls. India (IN) dominates at 28.5% of records, followed by PK (767) and CH (442); the top 10 codes account for the bulk of mass while a long tail of ~190 other codes shares the remainder, giving an entropy ratio of 0.66.

Treatment: Group rare codes into an 'other' bucket before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[17]:

saturn.columns["ROG3"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	202
top_value	IN
top_rate	0.2852
cardinality	202
entropy	5.058
entropy_ratio	0.6605

Fig 10.

Top values for ROG3.

Show data table

Top values for ROG3 (20 unique shown, of 202 total).
value	count	share
IN	2032	28.5%
PK	767	10.8%
CH	442	6.2%
BG	256	3.6%
ID	234	3.3%
NP	184	2.6%
SU	168	2.4%
LA	142	2.0%
RS	115	1.6%
US	90	1.3%
IR	85	1.2%
CD	81	1.1%
MY	78	1.1%
TH	73	1.0%
VM	69	1.0%
TU	61	0.9%
BM	59	0.8%
AF	58	0.8%
CE	55	0.8%
CA	52	0.7%

PeopleID3 numeric foreign_key

PeopleID3 is an integer key spanning 10120 to 22661 with 4614 unique values across 7124 rows, suggesting a person identifier that recurs (about 1.5 rows per id on average). The distribution is mildly left-skewed (-0.23) and platykurtic (-0.95) with no nulls, no zeros, and no outliers, consistent with a dense allocated id range rather than a measured quantity.

Treatment: Treat as a join key; do not use as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[20]:

saturn.columns["PeopleID3"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	4,614
min	10,120
max	22,661
mean	1.693e+04
median	1.736e+04
std	3431
q1	1.433e+04
q3	1.958e+04
iqr	5251
skew	-0.2255
kurtosis	-0.9528
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 11.

Distribution of PeopleID3. Vertical dash marks the median.

Show data table

Histogram bins for PeopleID3 (median: 17356.5).
bin	count
1.012e+04 – 1.043e+04	131
1.043e+04 – 1.075e+04	123
1.075e+04 – 1.106e+04	167
1.106e+04 – 1.137e+04	106
1.137e+04 – 1.169e+04	123
1.169e+04 – 1.2e+04	108
1.2e+04 – 1.231e+04	168
1.231e+04 – 1.263e+04	196
1.263e+04 – 1.294e+04	140
1.294e+04 – 1.326e+04	128
1.326e+04 – 1.357e+04	131
1.357e+04 – 1.388e+04	93
1.388e+04 – 1.42e+04	71
1.42e+04 – 1.451e+04	212
1.451e+04 – 1.482e+04	96
1.482e+04 – 1.514e+04	178
1.514e+04 – 1.545e+04	223
1.545e+04 – 1.576e+04	170
1.576e+04 – 1.608e+04	53
1.608e+04 – 1.639e+04	233
1.639e+04 – 1.67e+04	231
1.67e+04 – 1.702e+04	221
1.702e+04 – 1.733e+04	238
1.733e+04 – 1.764e+04	314
1.764e+04 – 1.796e+04	274
1.796e+04 – 1.827e+04	255
1.827e+04 – 1.859e+04	272
1.859e+04 – 1.89e+04	236
1.89e+04 – 1.921e+04	278
1.921e+04 – 1.953e+04	164
1.953e+04 – 1.984e+04	160
1.984e+04 – 2.015e+04	174
2.015e+04 – 2.047e+04	205
2.047e+04 – 2.078e+04	92
2.078e+04 – 2.109e+04	106
2.109e+04 – 2.141e+04	254
2.141e+04 – 2.172e+04	189
2.172e+04 – 2.203e+04	115
2.203e+04 – 2.235e+04	295
2.235e+04 – 2.266e+04	201

ROP3 numeric identifier

ROP3 is a numeric column with 4608 unique values across 7124 rows, ranging tightly from 100005 to 119619 with a mean of 111443.68 and median of 112533. The narrow ~19k span sitting well above zero, combined with integer-looking bounds, suggests a coded identifier or sequence number rather than a measured quantity. Mild left skew (-0.47) and no outliers indicate a fairly uniform spread within that band, and the null rate is negligible at 0.001.

Treatment: Treat as a categorical code or key; do not feed raw into numeric models.

anthropic:claude-opus-4-7 · confidence medium

Out[23]:

saturn.columns["ROP3"].stats

stat	value
n	7,124
nulls	7 (0.1%)
unique	4,608
min	100,005
max	119,619
mean	1.114e+05
median	112,533
std	5269
q1	107,901
q3	115,240
iqr	7,339
skew	-0.4712
kurtosis	-0.7273
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 12.

Distribution of ROP3. Vertical dash marks the median.

Show data table

Histogram bins for ROP3 (median: 112533.0).
bin	count
1e+05 – 1.005e+05	133
1.005e+05 – 1.01e+05	97
1.01e+05 – 1.015e+05	107
1.015e+05 – 1.02e+05	139
1.02e+05 – 1.025e+05	94
1.025e+05 – 1.029e+05	89
1.029e+05 – 1.034e+05	86
1.034e+05 – 1.039e+05	153
1.039e+05 – 1.044e+05	155
1.044e+05 – 1.049e+05	105
1.049e+05 – 1.054e+05	89
1.054e+05 – 1.059e+05	137
1.059e+05 – 1.064e+05	141
1.064e+05 – 1.069e+05	102
1.069e+05 – 1.074e+05	66
1.074e+05 – 1.079e+05	79
1.079e+05 – 1.083e+05	162
1.083e+05 – 1.088e+05	100
1.088e+05 – 1.093e+05	83
1.093e+05 – 1.098e+05	263
1.098e+05 – 1.103e+05	152
1.103e+05 – 1.108e+05	111
1.108e+05 – 1.113e+05	127
1.113e+05 – 1.118e+05	342
1.118e+05 – 1.123e+05	277
1.123e+05 – 1.128e+05	300
1.128e+05 – 1.132e+05	418
1.132e+05 – 1.137e+05	364
1.137e+05 – 1.142e+05	340
1.142e+05 – 1.147e+05	169
1.147e+05 – 1.152e+05	324
1.152e+05 – 1.157e+05	236
1.157e+05 – 1.162e+05	360
1.162e+05 – 1.167e+05	71
1.167e+05 – 1.172e+05	72
1.172e+05 – 1.177e+05	39
1.177e+05 – 1.181e+05	263
1.181e+05 – 1.186e+05	332
1.186e+05 – 1.191e+05	67
1.191e+05 – 1.196e+05	373

PeopNameInCountry text label

This column names a people group as it appears within a given country (e.g., 'Turk', 'Persian', 'Arab, Moroccan'), likely from a Joshua Project-style ethnographic registry. Values are short (len_mean 12.5, word_mean 1.78) with 4,722 uniques across 7,124 rows, and 33.7% are duplicates because the same people group recurs across countries — 'Deaf' alone appears 151 times. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' in top_words show religion-tagged variants are baked into the label.

Treatment: Treat as a categorical label; pair with country to form a unique key, and consider stripping parenthetical religion tags for cleaner grouping.

anthropic:claude-opus-4-7 · confidence high

Out[26]:

saturn.columns["PeopNameInCountry"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	4,722
len_min	1
len_max	39
len_mean	12.5
len_median	11
len_p95	27
word_mean	1.784
word_median	2
n_empty	0
n_duplicates	2,402
duplicate_rate	0.3372
vocab_size	4,602
readability_flesch_mean	56.38
emoji_rate	0
url_rate	0
one_word_rate	0.4575
allcaps_rate	0
boilerplate_rate	0
alert: one_word	45.7% rows are a single word
alert: duplicates	33.7% duplicate strings

Fig 13.

Character-length distribution for PeopNameInCountry.

Show data table

Character-length distribution for PeopNameInCountry (mean: 12.50042111173498).
chars	count
1 – 2	1
2 – 3	14
3 – 4	101
4 – 5	557
5 – 6	661
6 – 7	692
7 – 8	665
8 – 9	370
9 – 10	202
10 – 10	210
10 – 11	277
11 – 12	290
12 – 13	329
13 – 14	434
14 – 15	330
15 – 16	282
16 – 17	175
17 – 18	118
18 – 19	67
19 – 20	0
20 – 21	87
21 – 22	56
22 – 23	65
23 – 24	103
24 – 25	195
25 – 26	209
26 – 27	192
27 – 28	111
28 – 29	69
29 – 30	57
30 – 30	37
30 – 31	46
31 – 32	45
32 – 33	33
33 – 34	22
34 – 35	18
35 – 36	1
36 – 37	0
37 – 38	1
38 – 39	2

ROG2 categorical feature

ROG2 is a low-cardinality categorical with 7 region codes (ASI, AFR, EUR, NAR, LAM, AUS, SOP) and no nulls across 7,124 rows, consistent with a continental/region-of-origin grouping. The distribution is highly imbalanced: ASI accounts for 75.1% of records while AUS and SOP together contribute fewer than 80 rows, yielding an entropy ratio of just 0.45. Any model conditioned on this field will be dominated by the ASI bucket.

Treatment: One-hot encode and consider grouping AUS/SOP/LAM into an 'other' bucket given the severe imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[29]:

saturn.columns["ROG2"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	7
top_value	ASI
top_rate	0.7511
cardinality	7
entropy	1.251
entropy_ratio	0.4457

Fig 14.

Top values for ROG2.

Show data table

Top values for ROG2 (7 unique shown, of 7 total).
value	count	share
ASI	5351	75.1%
AFR	986	13.8%
EUR	431	6.0%
NAR	175	2.5%
LAM	106	1.5%
AUS	39	0.5%
SOP	36	0.5%

Continent categorical feature

Continent is a low-cardinality geographic categorical with 7 distinct values and no nulls across 7,124 rows. The distribution is heavily concentrated: Asia alone accounts for 75.1% of records, with Africa a distant second at 986. Notably, both 'Australia' (39) and 'Oceania' (36) appear as separate categories, which is a labeling inconsistency since Australia is part of Oceania.

Treatment: Reconcile Australia/Oceania into a single category, then one-hot encode.

anthropic:claude-opus-4-7 · confidence high

Out[32]:

saturn.columns["Continent"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	7
top_value	Asia
top_rate	0.7511
cardinality	7
entropy	1.251
entropy_ratio	0.4457

Fig 15.

Top values for Continent.

Show data table

Top values for Continent (7 unique shown, of 7 total).
value	count	share
Asia	5351	75.1%
Africa	986	13.8%
Europe	431	6.0%
North America	175	2.5%
South America	106	1.5%
Australia	39	0.5%
Oceania	36	0.5%

RegionName categorical feature

RegionName is a categorical geographic grouping with 12 distinct regions and no nulls across 7,124 rows. The distribution is heavily concentrated: 'Asia, South' alone accounts for 47.0% of records, followed distantly by 'Asia, Southeast' at 726 and 'Asia, Northeast' at 521, leaving the Americas and Europe sparsely represented. Entropy ratio of 0.76 confirms meaningful but uneven coverage across the 12 buckets.

Treatment: One-hot encode and consider grouping rare regions, given the dominance of 'Asia, South'.

anthropic:claude-opus-4-7 · confidence high

Out[35]:

saturn.columns["RegionName"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	12
top_value	Asia, South
top_rate	0.4701
cardinality	12
entropy	2.715
entropy_ratio	0.7574

Fig 16.

Top values for RegionName.

Show data table

Top values for RegionName (12 unique shown, of 12 total).
value	count	share
Asia, South	3349	47.0%
Asia, Southeast	726	10.2%
Asia, Northeast	521	7.3%
Africa, West and Central	460	6.5%
Africa, North and Middle East	444	6.2%
Africa, East and Southern	373	5.2%
Asia, Central	352	4.9%
Europe, Western	320	4.5%
Europe, Eastern and Eurasia	223	3.1%
America, North and Caribbean	160	2.2%
America, Latin	121	1.7%
Australia and Pacific	75	1.1%

ISO3 categorical foreign_key

ISO3 looks like a country code in standard 3-letter ISO 3166-1 alpha-3 format, with 202 distinct values across 7,124 rows and zero nulls. The distribution is heavily concentrated on India (IND) at 28.5% of rows (2,032), followed by PAK (767) and CHN (442), so South and East Asia dominate. Entropy ratio of 0.66 confirms the imbalance is material rather than uniform across countries.

Treatment: Use as a join key to country reference tables; consider grouping long-tail codes or stratifying by ISO3 to control for the IND-heavy skew.

anthropic:claude-opus-4-7 · confidence high

Out[38]:

saturn.columns["ISO3"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	202
top_value	IND
top_rate	0.2852
cardinality	202
entropy	5.058
entropy_ratio	0.6605

Fig 17.

Top values for ISO3.

Show data table

Top values for ISO3 (20 unique shown, of 202 total).
value	count	share
IND	2032	28.5%
PAK	767	10.8%
CHN	442	6.2%
BGD	256	3.6%
IDN	234	3.3%
NPL	184	2.6%
SDN	168	2.4%
LAO	142	2.0%
RUS	115	1.6%
USA	90	1.3%
IRN	85	1.2%
TCD	81	1.1%
MYS	78	1.1%
THA	73	1.0%
VNM	69	1.0%
TUR	61	0.9%
MMR	59	0.8%
AFG	58	0.8%
LKA	55	0.8%
CAN	52	0.7%

LocationInCountry text free_text

Free-text geographic descriptions of where a people group lives within a country, ranging from terse tags like "Widespread." (56 occurrences) to multi-sentence paragraphs up to 939 characters. 64.23% of rows are null and 14.6% of the non-null values are duplicates, so usable signal is concentrated in roughly a third of the dataset. Content is overwhelmingly English (1714 of 1729 detected) with a long tail of place names producing a 10,936-token vocabulary across 2,176 unique strings.

Treatment: Normalize boilerplate phrases like "Widespread" into a categorical flag, then tokenize and embed the residual prose before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[41]:

saturn.columns["LocationInCountry"].stats

stat	value
n	7,124
nulls	4,576 (64.2%)
unique	2,176
len_min	3
len_max	939
len_mean	141.1
len_median	89
len_p95	455.7
word_mean	21.15
word_median	12
n_empty	0
n_duplicates	372
duplicate_rate	0.146
vocab_size	10,936
readability_flesch_mean	41.64
emoji_rate	0
url_rate	0
one_word_rate	0.04317
allcaps_rate	0
boilerplate_rate	0
alert: multilingual	11 languages detected in sample
alert: null_rate	64.2% null

Fig 18.

Character-length distribution for LocationInCountry.

Show data table

Character-length distribution for LocationInCountry (mean: 141.07339089481945).
chars	count
3 – 26	327
26 – 50	413
50 – 73	323
73 – 97	303
97 – 120	234
120 – 143	153
143 – 167	116
167 – 190	84
190 – 214	75
214 – 237	48
237 – 260	55
260 – 284	46
284 – 307	42
307 – 331	46
331 – 354	33
354 – 377	31
377 – 401	19
401 – 424	24
424 – 448	40
448 – 471	17
471 – 494	19
494 – 518	12
518 – 541	10
541 – 565	14
565 – 588	10
588 – 611	10
611 – 635	11
635 – 658	7
658 – 682	4
682 – 705	2
705 – 728	3
728 – 752	4
752 – 775	7
775 – 799	1
799 – 822	1
822 – 845	1
845 – 869	1
869 – 892	0
892 – 916	1
916 – 939	1

PeopleID1 numeric feature

PeopleID1 is stored as numeric but only takes 16 distinct integer values between 10 and 26, with a tight IQR of 1.0 (Q1=20, Q3=21) around a median of 21. The distribution is heavily left-skewed (skew -1.34) and 32.9% of rows fall outside the Tukey fence, so the 'outliers' alert reflects the column being near-categorical rather than truly continuous. No nulls and no zeros, but the name suggests an identifier despite the low cardinality.

Treatment: Cast to categorical (16 levels) rather than treating as a continuous numeric.

anthropic:claude-opus-4-7 · confidence medium

Out[44]:

saturn.columns["PeopleID1"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	16
min	10
max	26
mean	19.51
median	21
std	3.858
q1	20
q3	21
iqr	1
skew	-1.337
kurtosis	0.8321
n_outliers	2,347
outlier_rate	0.3294
zero_rate	0
alert: outliers	32.9% rows beyond 1.5 IQR

Fig 19.

Distribution of PeopleID1. Vertical dash marks the median.

Show data table

Histogram bins for PeopleID1 (median: 21.0).
bin	count
10 – 10.4	475
10.4 – 10.8	0
10.8 – 11.2	140
11.2 – 11.6	0
11.6 – 12	0
12 – 12.4	166
12.4 – 12.8	0
12.8 – 13.2	83
13.2 – 13.6	0
13.6 – 14	0
14 – 14.4	225
14.4 – 14.8	0
14.8 – 15.2	128
15.2 – 15.6	0
15.6 – 16	0
16 – 16.4	81
16.4 – 16.8	0
16.8 – 17.2	339
17.2 – 17.6	0
17.6 – 18	0
18 – 18.4	3
18.4 – 18.8	0
18.8 – 19.2	13
19.2 – 19.6	0
19.6 – 20	0
20 – 20.4	451
20.4 – 20.8	0
20.8 – 21.2	3681
21.2 – 21.6	0
21.6 – 22	0
22 – 22.4	632
22.4 – 22.8	0
22.8 – 23.2	287
23.2 – 23.6	0
23.6 – 24	0
24 – 24.4	269
24.4 – 24.8	0
24.8 – 25.2	0
25.2 – 25.6	0
25.6 – 26	151

ROP1 categorical feature

ROP1 is a low-cardinality categorical code (16 distinct values, all following an 'A0xx' pattern) with no nulls across 7,124 rows. The distribution is heavily concentrated: 'A012' alone accounts for 51.7% of records, and entropy ratio of 0.67 confirms the imbalance. This looks like a controlled vocabulary or lookup code rather than free input.

Treatment: One-hot or target-encode; consider grouping rare codes given the dominance of A012.

anthropic:claude-opus-4-7 · confidence high

Out[47]:

saturn.columns["ROP1"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	16
top_value	A012
top_rate	0.5167
cardinality	16
entropy	2.676
entropy_ratio	0.669

Fig 20.

Top values for ROP1.

Show data table

Top values for ROP1 (16 unique shown, of 16 total).
value	count	share
A012	3681	51.7%
A013	632	8.9%
A001	475	6.7%
A011	451	6.3%
A008	339	4.8%
A014	287	4.0%
A015	269	3.8%
A005	225	3.2%
A003	166	2.3%
A017	151	2.1%
A002	140	2.0%
A006	128	1.8%
A004	83	1.2%
A007	81	1.1%
A010	13	0.2%
A009	3	0.0%

AffinityBloc categorical feature

AffinityBloc is a categorical grouping of peoples/ethnolinguistic blocs, with 16 distinct values across 7124 rows and no nulls. The distribution is heavily concentrated: 'South Asian Peoples' alone accounts for 51.7% of records (3681), followed distantly by 'Sub-Saharan Peoples' (632) and 'Arab World' (475). Entropy ratio of 0.67 confirms the imbalance, and the inclusion of 'Deaf' (151) as a bloc alongside geographic/ethnic categories is a notable taxonomy quirk.

Treatment: One-hot or target-encode, and consider grouping rare blocs given the dominance of South Asian Peoples.

anthropic:claude-opus-4-7 · confidence high

Out[50]:

saturn.columns["AffinityBloc"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	16
top_value	South Asian Peoples
top_rate	0.5167
cardinality	16
entropy	2.676
entropy_ratio	0.669

Fig 21.

Top values for AffinityBloc.

Show data table

Top values for AffinityBloc (16 unique shown, of 16 total).
value	count	share
South Asian Peoples	3681	51.7%
Sub-Saharan Peoples	632	8.9%
Arab World	475	6.7%
Southeast Asian Peoples	451	6.3%
Malay Peoples	339	4.8%
Tibetan-Himalayan Peoples	287	4.0%
Turkic Peoples	269	3.8%
Persian-Median	225	3.2%
Eurasian Peoples	166	2.3%
Deaf	151	2.1%
East Asian Peoples	140	2.0%
Jewish	128	1.8%
Horn of Africa Peoples	83	1.2%
Latin-Caribbean Americans	81	1.1%
Pacific Islanders	13	0.2%
North American Peoples	3	0.0%

PeopleID2 numeric foreign_key

PeopleID2 is a numeric identifier-like field with only 205 distinct values across 7,124 rows, ranging 101 to 475 with no nulls or zeros. The distribution is left-skewed (skew -0.50) and platykurtic (kurtosis -1.24) with median 412 well above the mean 339, suggesting a few low-id clusters pull the mean down while most rows concentrate near the upper end. The low cardinality relative to row count indicates this is a repeated key rather than a per-row unique id.

Treatment: Treat as a categorical foreign key and left-join to the people dimension rather than using as a numeric feature.

anthropic:claude-opus-4-7 · confidence high

Out[53]:

saturn.columns["PeopleID2"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	205
min	101
max	475
mean	339
median	412
std	123.2
q1	232.8
q3	450
iqr	217.2
skew	-0.5035
kurtosis	-1.242
n_outliers	0
outlier_rate	0
zero_rate	0

Fig 22.

Distribution of PeopleID2. Vertical dash marks the median.

Show data table

Histogram bins for PeopleID2 (median: 412.0).
bin	count
101 – 110.3	72
110.3 – 119.7	353
119.7 – 129.1	99
129.1 – 138.4	58
138.4 – 147.8	74
147.8 – 157.1	217
157.1 – 166.4	156
166.4 – 175.8	86
175.8 – 185.1	92
185.1 – 194.5	67
194.5 – 203.8	188
203.8 – 213.2	111
213.2 – 222.6	122
222.6 – 231.9	84
231.9 – 241.2	222
241.2 – 250.6	136
250.6 – 259.9	34
259.9 – 269.3	144
269.3 – 278.6	3
278.6 – 288	68
288 – 297.4	194
297.4 – 306.7	147
306.7 – 316	223
316 – 325.4	256
325.4 – 334.8	160
334.8 – 344.1	14
344.1 – 353.4	58
353.4 – 362.8	8
362.8 – 372.1	0
372.1 – 381.5	0
381.5 – 390.8	0
390.8 – 400.2	0
400.2 – 409.6	100
409.6 – 418.9	573
418.9 – 428.2	216
428.2 – 437.6	50
437.6 – 446.9	928
446.9 – 456.3	128
456.3 – 465.6	1136
465.6 – 475	547

ROP2 categorical feature

ROP2 is a categorical code field with 155 distinct values following an A####/C#### pattern, suggesting a classification or routing code. One value, 'A012', dominates at 51.6% of the 7,124 rows, while the remaining categories are long-tailed C-codes each below 2.4%. Entropy ratio of 0.56 confirms the distribution is heavily concentrated rather than uniform.

Treatment: Collapse rare C-codes into an 'other' bucket and one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[56]:

saturn.columns["ROP2"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	155
top_value	A012
top_rate	0.5163
cardinality	155
entropy	4.085
entropy_ratio	0.5614

Fig 23.

Top values for ROP2.

Show data table

Top values for ROP2 (20 unique shown, of 155 total).
value	count	share
A012	3678	51.6%
C0229	169	2.4%
C0147	167	2.3%
C0252	151	2.1%
C0102	128	1.8%
C0061	111	1.6%
C0179	90	1.3%
C0207	86	1.2%
C0013	76	1.1%
C0156	75	1.1%
C0223	75	1.1%
C0015	70	1.0%
C0221	68	1.0%
C0126	63	0.9%
C0114	61	0.9%
C0017	59	0.8%
C0019	58	0.8%
C0216	52	0.7%
C0062	51	0.7%
C0201	50	0.7%

PeopleCluster categorical feature

PeopleCluster is a high-cardinality categorical taxonomy of ethno-religious groupings, with 205 distinct values across 7124 rows and no nulls. The distribution is dominated by South Asian categories — 'South Asia Hindu - other' alone accounts for 12.2% (869 rows), followed by 'South Asia Muslim - other' (586) and 'South Asia Dalit - other' (352). Entropy ratio of 0.80 indicates a fairly spread distribution despite the South Asian skew, and the appearance of 'Deaf' alongside ethnolinguistic labels signals a mixed taxonomy worth flagging.

Treatment: Group the long tail and target- or frequency-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[59]:

saturn.columns["PeopleCluster"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	205
top_value	South Asia Hindu - other
top_rate	0.122
cardinality	205
entropy	6.108
entropy_ratio	0.7954

Fig 24.

Top values for PeopleCluster.

Show data table

Top values for PeopleCluster (20 unique shown, of 205 total).
value	count	share
South Asia Hindu - other	869	12.2%
South Asia Muslim - other	586	8.2%
South Asia Dalit - other	352	4.9%
South Asia Tribal - other	311	4.4%
South Asia Muslim - Pashtun	293	4.1%
Tibeto-Burman, other	169	2.4%
Mon-Khmer	167	2.3%
Deaf	151	2.1%
South Asia Muslim - Jat	138	1.9%
Jewish	128	1.8%
South Asia Forward Caste - Brahmin	123	1.7%
South Asia - other	114	1.6%
South Asia Muslim - Rajput	112	1.6%
Caucasus	111	1.6%
South Asia Forward Caste - Rajput	95	1.3%
Persian	90	1.3%
South Asia Buddhist	87	1.2%
Tai	86	1.2%
South Asia Hindu - Jat	77	1.1%
Arab, Arabian	76	1.1%

PeopNameAcrossCountries text label

This column holds people-group / ethnolinguistic names spanning countries (e.g. 'Turk', 'Persian', 'Kurd, Kurmanji', 'Arab, Moroccan'), with frequent religious-tradition qualifiers like '(Hindu traditions)' and '(Muslim traditions)' appearing in 985 and 424 rows respectively. Values are short (mean 12.4 chars, median 2 words) and 47.5% are single-word labels, yet 35.4% (2,520 rows) are duplicates across 4,604 unique strings out of 7,124 — the same group recurs across countries. The most surprising entry is 'Deaf' at 151 occurrences, which sits oddly alongside ethnic categories.

Treatment: Treat as a categorical people-group label; normalize qualifiers in parentheses and join on (group, country) for cross-country aggregation.

anthropic:claude-opus-4-7 · confidence high

Out[62]:

saturn.columns["PeopNameAcrossCountries"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	4,604
len_min	1
len_max	39
len_mean	12.38
len_median	10
len_p95	27
word_mean	1.766
word_median	2
n_empty	0
n_duplicates	2,520
duplicate_rate	0.3537
vocab_size	4,431
readability_flesch_mean	56.01
emoji_rate	0
url_rate	0
one_word_rate	0.475
allcaps_rate	0
boilerplate_rate	0
alert: one_word	47.5% rows are a single word
alert: duplicates	35.4% duplicate strings

Fig 25.

Character-length distribution for PeopNameAcrossCountries.

Show data table

Character-length distribution for PeopNameAcrossCountries (mean: 12.377596855699046).
chars	count
1 – 2	1
2 – 3	14
3 – 4	95
4 – 5	582
5 – 6	679
6 – 7	736
7 – 8	689
8 – 9	382
9 – 10	206
10 – 10	209
10 – 11	270
11 – 12	277
12 – 13	308
13 – 14	420
14 – 15	310
15 – 16	253
16 – 17	164
17 – 18	103
18 – 19	67
19 – 20	0
20 – 21	79
21 – 22	55
22 – 23	64
23 – 24	105
24 – 25	205
25 – 26	209
26 – 27	194
27 – 28	115
28 – 29	67
29 – 30	57
30 – 30	36
30 – 31	50
31 – 32	45
32 – 33	33
33 – 34	18
34 – 35	22
35 – 36	1
36 – 37	1
37 – 38	1
38 – 39	2

Population numeric feature

Population counts per record, ranging from 10 to 135,533,000 with a median of just 30,000. The distribution is extraordinarily right-skewed (skew 21.1, kurtosis 607) — the mean of ~502,570 sits far above Q3 of 129,000, and ~14.9% of rows flag as outliers, suggesting a mix of small localities with a few country- or megacity-scale entries. Null rate is negligible (0.21%) and there are no zeros.

Treatment: Apply a log1p transform before any modelling to tame the extreme skew.

anthropic:claude-opus-4-7 · confidence high

Out[65]:

saturn.columns["Population"].stats

stat	value
n	7,124
nulls	15 (0.2%)
unique	1,200
min	10
max	1.355e+08
mean	5.026e+05
median	30,000
std	3.568e+06
q1	6,700
q3	129,000
iqr	122,300
skew	21.1
kurtosis	607.4
n_outliers	1,058
outlier_rate	0.1488
zero_rate	0
alert: high_skew	skew=+21.10
alert: outliers	14.9% rows beyond 1.5 IQR

Fig 26.

Distribution of Population. Vertical dash marks the median.

Show data table

Histogram bins for Population (median: 30000.0).
bin	count
10 – 3.388e+06	6930
3.388e+06 – 6.777e+06	78
6.777e+06 – 1.016e+07	36
1.016e+07 – 1.355e+07	17
1.355e+07 – 1.694e+07	11
1.694e+07 – 2.033e+07	8
2.033e+07 – 2.372e+07	7
2.372e+07 – 2.711e+07	3
2.711e+07 – 3.049e+07	2
3.049e+07 – 3.388e+07	1
3.388e+07 – 3.727e+07	2
3.727e+07 – 4.066e+07	4
4.066e+07 – 4.405e+07	0
4.405e+07 – 4.744e+07	3
4.744e+07 – 5.082e+07	1
5.082e+07 – 5.421e+07	0
5.421e+07 – 5.76e+07	1
5.76e+07 – 6.099e+07	0
6.099e+07 – 6.438e+07	1
6.438e+07 – 6.777e+07	1
6.777e+07 – 7.115e+07	0
7.115e+07 – 7.454e+07	0
7.454e+07 – 7.793e+07	0
7.793e+07 – 8.132e+07	0
8.132e+07 – 8.471e+07	0
8.471e+07 – 8.81e+07	0
8.81e+07 – 9.148e+07	0
9.148e+07 – 9.487e+07	0
9.487e+07 – 9.826e+07	0
9.826e+07 – 1.016e+08	1
1.016e+08 – 1.05e+08	0
1.05e+08 – 1.084e+08	0
1.084e+08 – 1.118e+08	0
1.118e+08 – 1.152e+08	0
1.152e+08 – 1.186e+08	1
1.186e+08 – 1.22e+08	0
1.22e+08 – 1.254e+08	0
1.254e+08 – 1.288e+08	0
1.288e+08 – 1.321e+08	0
1.321e+08 – 1.355e+08	1

Category categorical label

Three-level categorical with values "1", "2", "3" and no nulls across 7124 rows. Distribution is imbalanced: "3" dominates at 60.3% (4299), "1" follows at 2330, and "2" is rare at 495. Entropy ratio of 0.78 confirms the skew toward the majority class.

Treatment: Treat as a categorical label; consider class weighting or stratified sampling to handle the minority class "2".

anthropic:claude-opus-4-7 · confidence high

Out[68]:

saturn.columns["Category"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	3
top_value	3
top_rate	0.6035
cardinality	3
entropy	1.234
entropy_ratio	0.7788

Fig 27.

Top values for Category.

Show data table

Top values for Category (3 unique shown, of 3 total).
value	count	share
3	4299	60.3%
1	2330	32.7%
2	495	6.9%

ROL3 text feature

ROL3 holds three-letter ISO 639-3 language codes — every value is exactly 3 characters and one word, with top entries like 'hin', 'ben', 'urd', 'guj', and 'tel' pointing to South Asian languages. The distribution is heavily skewed toward Hindi (662 of 7124) and only 1565 unique codes appear across 7124 rows, giving a 78% duplicate rate. No nulls, no empties, no formatting noise.

Treatment: treat as a categorical language code; one-hot or target-encode, and consider grouping rare codes.

anthropic:claude-opus-4-7 · confidence high

Out[71]:

saturn.columns["ROL3"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1,565
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	5,559
duplicate_rate	0.7803
vocab_size	1,564
readability_flesch_mean	118.7
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.0002807
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	78.0% duplicate strings

Fig 28.

Character-length distribution for ROL3.

Show data table

Character-length distribution for ROL3 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	7124
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

PrimaryLanguageName text feature

Categorical language label, almost always a single word (one_word_rate 0.70, word_mean 1.33) drawn from a vocabulary of 1,641 tokens across 1,563 distinct values. Hindi (662), Bengali (357), and Sindhi (191) dominate the 7,124 rows, and 78.1% of values are duplicates of an earlier row — expected for a controlled language taxonomy. Compound names like 'Pashto, Northern' and 'Punjabi, Eastern' indicate ISO-style subvariant naming rather than free text.

Treatment: Treat as a categorical factor; normalise the comma-separated subvariants before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[74]:

saturn.columns["PrimaryLanguageName"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1,563
len_min	1
len_max	32
len_mean	9.147
len_median	7
len_p95	17
word_mean	1.331
word_median	1
n_empty	0
n_duplicates	5,561
duplicate_rate	0.7806
vocab_size	1,641
readability_flesch_mean	33.28
emoji_rate	0
url_rate	0
one_word_rate	0.7016
allcaps_rate	0
boilerplate_rate	0
alert: one_word	70.2% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	78.1% duplicate strings

Fig 29.

Character-length distribution for PrimaryLanguageName.

Show data table

Character-length distribution for PrimaryLanguageName (mean: 9.14654688377316).
chars	count
1 – 2	2
2 – 3	29
3 – 3	61
3 – 4	655
4 – 5	0
5 – 6	1267
6 – 6	888
6 – 7	1222
7 – 8	0
8 – 9	517
9 – 10	207
10 – 10	137
10 – 11	65
11 – 12	0
12 – 13	98
13 – 13	157
13 – 14	150
14 – 15	0
15 – 16	204
16 – 16	890
16 – 17	228
17 – 18	60
18 – 19	0
19 – 20	37
20 – 20	46
20 – 21	51
21 – 22	0
22 – 23	35
23 – 23	34
23 – 24	41
24 – 25	3
25 – 26	0
26 – 27	16
27 – 27	8
27 – 28	3
28 – 29	0
29 – 30	1
30 – 30	0
30 – 31	11
31 – 32	1

PrimaryLanguageDialect categorical metadata

Free-text or controlled-vocabulary field naming a primary language dialect, with 303 distinct values across 7124 rows but populated in only 5.5% of records (null_rate 0.945). Distribution is essentially flat — entropy_ratio 0.97 and the modal value 'Punjabi' covers just 3.1% of non-nulls (12 occurrences) — so no dialect dominates. The mix spans South Asian, Middle Eastern, African, and European dialects, suggesting a global but extremely sparse roster.

Treatment: Drop or collapse into a coarser language grouping; too sparse and high-cardinality to use directly as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[77]:

saturn.columns["PrimaryLanguageDialect"].stats

stat	value
n	7,124
nulls	6,732 (94.5%)
unique	303
top_value	Punjabi
top_rate	0.03061
cardinality	303
entropy	8.011
entropy_ratio	0.9719
alert: long_tail	255 singleton categories
alert: null_rate	94.5% null

Fig 30.

Top values for PrimaryLanguageDialect.

Show data table

Top values for PrimaryLanguageDialect (20 unique shown, of 303 total).
value	count	share
Punjabi	12	0.2%
Ta'izzi	8	0.1%
Sinhalese	6	0.1%
Wasulunkakan	6	0.1%
Pomak	5	0.1%
Siripuria	5	0.1%
Hui	4	0.1%
Vixlin	4	0.1%
Western Sudanese	3	0.0%
Brazilian Portuguese	3	0.0%
Bawean	3	0.0%
Bajuni	3	0.0%
Miri	3	0.0%
Southern Khams	3	0.0%
Levantine Turkmen	3	0.0%
Jerbi	2	0.0%
Tawallammat Tan Ataram	2	0.0%
Tihami	2	0.0%
Timbuktu	2	0.0%
Xinan Guanhua	2	0.0%

NumberLanguagesSpoken numeric feature

Counts the number of languages spoken, with values ranging from 1 to 120 across 7,124 rows and no nulls. The distribution is heavily right-skewed (skew 4.87, kurtosis 36.3): the median is 1 and Q3 is 5, yet the mean is 4.33 and 597 rows (8.4%) flag as outliers, suggesting a long tail of implausibly high counts up to 120.

Treatment: Cap or log-transform before modelling, and audit the extreme tail (values up to 120) for data-entry errors.

anthropic:claude-opus-4-7 · confidence high

Out[80]:

saturn.columns["NumberLanguagesSpoken"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	69
min	1
max	120
mean	4.333
median	1
std	7.32
q1	1
q3	5
iqr	4
skew	4.871
kurtosis	36.31
n_outliers	597
outlier_rate	0.0838
zero_rate	0
alert: high_skew	skew=+4.87
alert: outliers	8.4% rows beyond 1.5 IQR

Fig 31.

Distribution of NumberLanguagesSpoken. Vertical dash marks the median.

Show data table

Histogram bins for NumberLanguagesSpoken (median: 1.0).
bin	count
1 – 3.975	4930
3.975 – 6.95	911
6.95 – 9.925	493
9.925 – 12.9	262
12.9 – 15.88	146
15.88 – 18.85	88
18.85 – 21.82	54
21.82 – 24.8	51
24.8 – 27.78	44
27.78 – 30.75	27
30.75 – 33.73	21
33.73 – 36.7	14
36.7 – 39.68	17
39.68 – 42.65	14
42.65 – 45.62	9
45.62 – 48.6	9
48.6 – 51.58	8
51.58 – 54.55	4
54.55 – 57.52	4
57.52 – 60.5	6
60.5 – 63.48	2
63.48 – 66.45	2
66.45 – 69.42	0
69.42 – 72.4	1
72.4 – 75.38	1
75.38 – 78.35	2
78.35 – 81.33	0
81.33 – 84.3	2
84.3 – 87.28	0
87.28 – 90.25	0
90.25 – 93.23	0
93.23 – 96.2	0
96.2 – 99.17	0
99.17 – 102.2	0
102.2 – 105.1	1
105.1 – 108.1	0
108.1 – 111.1	0
111.1 – 114	0
114 – 117	0
117 – 120	1

OfficialLang categorical feature

Categorical column listing an official language per record, with 79 distinct values across 7124 rows and effectively no nulls (0.08%). Hindi dominates at 28.5% (2032 rows), followed by Urdu (767), Standard Arabic (657), Mandarin (475), and English (433), giving an entropy ratio of 0.66 — moderately concentrated rather than uniform. The South/Central Asian skew is notable: five of the top ten values are languages of that region, which may bias any downstream language-level analysis.

Treatment: Group rare languages into an 'Other' bucket and one-hot or target-encode before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[83]:

saturn.columns["OfficialLang"].stats

stat	value
n	7,124
nulls	6 (0.1%)
unique	79
top_value	Hindi
top_rate	0.2855
cardinality	79
entropy	4.178
entropy_ratio	0.6628

Fig 32.

Top values for OfficialLang.

Show data table

Top values for OfficialLang (20 unique shown, of 79 total).
value	count	share
Hindi	2032	28.5%
Urdu	767	10.8%
Arabic, Standard	657	9.2%
Chinese, Mandarin	475	6.7%
English	433	6.1%
French	303	4.3%
Bengali	256	3.6%
Indonesian	234	3.3%
Nepali	184	2.6%
Lao	142	2.0%
Russian	115	1.6%
Portuguese	94	1.3%
Malay	87	1.2%
Persian, Iranian	85	1.2%
Spanish	79	1.1%
Thai	73	1.0%
Vietnamese	69	1.0%
Turkish	61	0.9%
Burmese	59	0.8%
Pashto, Southern	58	0.8%

SpeakNationalLang unknown other

Column 'SpeakNationalLang' was skipped by the profiler, so type, cardinality and value distribution are all unavailable. The only confirmed facts are that it has 7124 rows and zero nulls. The name suggests a flag or category for whether a respondent speaks the national language, but this cannot be verified from the evidence.

Treatment: Re-profile with type inference enabled before deciding on any downstream handling.

anthropic:claude-opus-4-7 · confidence low

Out[86]:

saturn.columns["SpeakNationalLang"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

BibleStatus numeric feature

BibleStatus is an ordinal/categorical code stored as a small integer, taking just 6 distinct values from 0 to 5 across 7,124 rows with no nulls. The distribution is heavily concentrated at the top (median 5, Q1=4, Q3=5, mean 4.05) with strong negative skew (-1.51), and 13.5% of rows flagged as low-end outliers plus a 3.8% zero rate. This looks like a status/level code rather than a true numeric measurement.

Treatment: Treat as an ordinal category (or one-hot encode) rather than a continuous numeric.

anthropic:claude-opus-4-7 · confidence high

Out[88]:

saturn.columns["BibleStatus"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	6
min	0
max	5
mean	4.054
median	5
std	1.342
q1	4
q3	5
iqr	1
skew	-1.513
kurtosis	1.538
n_outliers	961
outlier_rate	0.1349
zero_rate	0.03818
alert: outliers	13.5% rows beyond 1.5 IQR

Fig 33.

Distribution of BibleStatus. Vertical dash marks the median.

Show data table

Histogram bins for BibleStatus (median: 5.0).
bin	count
0 – 0.125	272
0.125 – 0.25	0
0.25 – 0.375	0
0.375 – 0.5	0
0.5 – 0.625	0
0.625 – 0.75	0
0.75 – 0.875	0
0.875 – 1	0
1 – 1.125	216
1.125 – 1.25	0
1.25 – 1.375	0
1.375 – 1.5	0
1.5 – 1.625	0
1.625 – 1.75	0
1.75 – 1.875	0
1.875 – 2	0
2 – 2.125	473
2.125 – 2.25	0
2.25 – 2.375	0
2.375 – 2.5	0
2.5 – 2.625	0
2.625 – 2.75	0
2.75 – 2.875	0
2.875 – 3	0
3 – 3.125	795
3.125 – 3.25	0
3.25 – 3.375	0
3.375 – 3.5	0
3.5 – 3.625	0
3.625 – 3.75	0
3.75 – 3.875	0
3.875 – 4	0
4 – 4.125	1506
4.125 – 4.25	0
4.25 – 4.375	0
4.375 – 4.5	0
4.5 – 4.625	0
4.625 – 4.75	0
4.75 – 4.875	0
4.875 – 5	3862

BibleYear categorical metadata

BibleYear appears to be a publication/edition year field for Bible translations, encoded mostly as date ranges like "1818-2022" rather than single years. Cardinality is high (163 distinct values across 7124 rows) and the column is missing for 45.79% of rows, which is a major coverage gap. The top value "1818-2022" covers 17.14% of non-nulls and most frequent entries are spans, while plain single years like "1954" (191 occurrences) are the exception.

Treatment: Parse into start_year and end_year integers, then decide imputation given the 45.79% null rate.

anthropic:claude-opus-4-7 · confidence medium

Out[91]:

saturn.columns["BibleYear"].stats

stat	value
n	7,124
nulls	3,262 (45.8%)
unique	163
top_value	1818-2022
top_rate	0.1714
cardinality	163
entropy	5.318
entropy_ratio	0.7237
alert: null_rate	45.8% null

Fig 34.

Top values for BibleYear.

Show data table

Top values for BibleYear (20 unique shown, of 163 total).
value	count	share
1818-2022	662	9.3%
1809-2022	357	5.0%
1954	191	2.7%
1843-2022	167	2.3%
1823-2021	155	2.2%
1815-2021	150	2.1%
1895-2020	146	2.0%
1854-2022	136	1.9%
1959-2021	131	1.8%
1727-2024	129	1.8%
1821-2024	96	1.3%
1827-2024	90	1.3%
1841-2022	85	1.2%
2008	80	1.1%
1914-2024	71	1.0%
1827-2009	55	0.8%
1874-2018	50	0.7%
2023	43	0.6%
1883-2018	36	0.5%
1838-2023	36	0.5%

NTYear categorical feature

NTYear appears to hold year-range strings (e.g. "1811-1998", "1801-1984") indicating a span between two dates, though the presence of "Yes" as the third most common value (345 rows) signals encoding inconsistency. The column has 305 distinct values across 7124 rows with a high null rate of 24.65%, and the top value covers only 12.3% of non-nulls, so the distribution is fairly spread (entropy ratio 0.735).

Treatment: Parse year-range strings into start/end numeric fields and quarantine non-conforming values like "Yes" before modelling.

anthropic:claude-opus-4-7 · confidence medium

Out[94]:

saturn.columns["NTYear"].stats

stat	value
n	7,124
nulls	1,756 (24.6%)
unique	305
top_value	1811-1998
top_rate	0.1233
cardinality	305
entropy	6.065
entropy_ratio	0.7349
alert: null_rate	24.6% null

Fig 35.

Top values for NTYear.

Show data table

Top values for NTYear (20 unique shown, of 305 total).
value	count	share
1811-1998	662	9.3%
1801-1984	357	5.0%
Yes	345	4.8%
1890-1992	191	2.7%
1758-2000	167	2.3%
1820-1985	155	2.2%
1809-2000	150	2.1%
1818-1991	146	2.0%
2023-2024	146	2.0%
1818-1989	136	1.9%
1815-2011	131	1.8%
1819-2021	130	1.8%
1715-1998	129	1.8%
1978-2022	110	1.5%
1827-2023	100	1.4%
1811-1982	96	1.3%
1829-1995	85	1.2%
2010	83	1.2%
1819-1993	77	1.1%
1821-2010	71	1.0%

PortionsYear categorical metadata

PortionsYear appears to be a free-form field describing the year range covered by record portions, with 460 distinct values across 7124 rows and a 13.5% null rate. Most entries are date ranges like '1806-1962' or '1800-1980', but the single most common value is the literal string 'Yes' (821 rows, 13.3%), suggesting inconsistent data entry where a yes/no answer leaks into a date-range field. High entropy ratio (0.71) confirms values are spread across many ranges rather than concentrated.

Treatment: Parse date ranges into start/end year numerics and isolate the 'Yes' contamination as a separate boolean flag before use.

anthropic:claude-opus-4-7 · confidence high

Out[97]:

saturn.columns["PortionsYear"].stats

stat	value
n	7,124
nulls	961 (13.5%)
unique	460
top_value	Yes
top_rate	0.1332
cardinality	460
entropy	6.289
entropy_ratio	0.711

Fig 36.

Top values for PortionsYear.

Show data table

Top values for PortionsYear (20 unique shown, of 460 total).
value	count	share
Yes	821	11.5%
1806-1962	662	9.3%
1800-1980	357	5.0%
1825-1981	191	2.7%
1747-1894	167	2.3%
1809-1965	155	2.2%
1811-1956	150	2.1%
1824-2010	146	2.0%
1812-1966	136	1.9%
1818-1954	131	1.8%
1885-1922	130	1.8%
1714-1956	129	1.8%
1927-1964	110	1.5%
1807-1957	96	1.3%
1812-1988	90	1.3%
1811-1968	85	1.2%
2024	73	1.0%
1850-1961	71	1.0%
1782-1985	55	0.8%
1939-2021	52	0.7%

TranslationNeedQuestionable unknown other

The column "TranslationNeedQuestionable" was skipped by the profiler, so no type, uniqueness, or value statistics are available. All we know is that it has 7124 rows with a 0.0 null rate; nothing else can be inferred from the evidence.

Treatment: Re-profile or inspect manually before deciding on use; current evidence is insufficient.

anthropic:claude-opus-4-7 · confidence low

Out[100]:

saturn.columns["TranslationNeedQuestionable"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

JPScale numeric metadata

JPScale is a numeric column that is entirely constant: all 7124 rows hold the value 1.0 with zero nulls, zero variance, and a single unique value. It carries no information for any downstream model or comparison and was flagged as constant.

Treatment: Drop; constant column with no variance.

anthropic:claude-opus-4-7 · confidence high

Out[102]:

saturn.columns["JPScale"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
min	1
max	1
mean	1
median	1
std	0
q1	1
q3	1
iqr	0
skew	0
kurtosis	0
n_outliers	0
outlier_rate	0
zero_rate	0
alert: constant	only one distinct value

Fig 37.

Distribution of JPScale. Vertical dash marks the median.

Show data table

Histogram bins for JPScale (median: 1.0).
bin	count
0.5 – 0.525	0
0.525 – 0.55	0
0.55 – 0.575	0
0.575 – 0.6	0
0.6 – 0.625	0
0.625 – 0.65	0
0.65 – 0.675	0
0.675 – 0.7	0
0.7 – 0.725	0
0.725 – 0.75	0
0.75 – 0.775	0
0.775 – 0.8	0
0.8 – 0.825	0
0.825 – 0.85	0
0.85 – 0.875	0
0.875 – 0.9	0
0.9 – 0.925	0
0.925 – 0.95	0
0.95 – 0.975	0
0.975 – 1	0
1 – 1.025	7124
1.025 – 1.05	0
1.05 – 1.075	0
1.075 – 1.1	0
1.1 – 1.125	0
1.125 – 1.15	0
1.15 – 1.175	0
1.175 – 1.2	0
1.2 – 1.225	0
1.225 – 1.25	0
1.25 – 1.275	0
1.275 – 1.3	0
1.3 – 1.325	0
1.325 – 1.35	0
1.35 – 1.375	0
1.375 – 1.4	0
1.4 – 1.425	0
1.425 – 1.45	0
1.45 – 1.475	0
1.475 – 1.5	0

JPScalePC categorical feature

JPScalePC is a 5-level categorical code (values "1" through "5") with no nulls across 7124 rows, likely an ordinal scale or rating. The distribution is heavily concentrated at "1" (70.2% of rows), with "3" the rarest at just 205 occurrences, yielding an entropy ratio of 0.59. The non-monotonic frequency order (1 > 4 > 2 > 5 > 3) is unusual for a true ordinal scale and worth checking.

Treatment: Treat as ordinal categorical; consider grouping minority levels (3, 5) given the dominance of "1".

anthropic:claude-opus-4-7 · confidence high

Out[105]:

saturn.columns["JPScalePC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	5
top_value	1
top_rate	0.702
cardinality	5
entropy	1.377
entropy_ratio	0.5929

Fig 38.

Top values for JPScalePC.

Show data table

Top values for JPScalePC (5 unique shown, of 5 total).
value	count	share
1	5001	70.2%
4	1141	16.0%
2	525	7.4%
5	252	3.5%
3	205	2.9%

JPScalePGAC categorical feature

JPScalePGAC is a 5-level categorical code (values "1" through "5"), likely a Japanese seismic intensity / PGA scale rating. The distribution is severely imbalanced: "1" accounts for 6910 of 7124 rows (top_rate 0.97), entropy_ratio is just 0.10, and the remaining four levels together hold under 220 records. No nulls are present.

Treatment: Collapse rare levels or binarise as "1" vs other before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[108]:

saturn.columns["JPScalePGAC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	5
top_value	1
top_rate	0.97
cardinality	5
entropy	0.2372
entropy_ratio	0.1021
alert: imbalance	top value is 97.0% of rows

Fig 39.

Top values for JPScalePGAC.

Show data table

Top values for JPScalePGAC (5 unique shown, of 5 total).
value	count	share
1	6910	97.0%
4	118	1.7%
2	75	1.1%
5	15	0.2%
3	6	0.1%

LeastReached categorical metadata

This column is a single-valued categorical flag holding "Y" for all 7124 rows, with no nulls and zero entropy. Because cardinality is 1 and top_rate is 1.0, it carries no information for any downstream model.

Treatment: Drop; constant column with no variance.

anthropic:claude-opus-4-7 · confidence high

Out[111]:

saturn.columns["LeastReached"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value	Y
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 40.

Top values for LeastReached.

Show data table

Top values for LeastReached (1 unique shown, of 1 total).
value	count	share
Y	7124	100.0%

LeastReachedPC categorical feature

Binary Y/N flag indicating whether some 'least reached' people-group condition is met. The column is fully populated across 7124 rows with only 2 distinct values, skewed toward 'Y' at 72.3% (5152) versus 'N' at 1972. Entropy ratio of 0.85 shows the split is uneven but still informative.

Treatment: Encode as a 0/1 boolean for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[114]:

saturn.columns["LeastReachedPC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.7232
cardinality	2
entropy	0.8511
entropy_ratio	0.8511

Fig 41.

Top values for LeastReachedPC.

Show data table

Top values for LeastReachedPC (2 unique shown, of 2 total).
value	count	share
Y	5152	72.3%
N	1972	27.7%

LeastReachedPGAC categorical feature

A binary Y/N flag (likely indicating whether the least-reached PGAC condition was met) with no nulls across 7124 rows. The distribution is severely imbalanced: 'Y' accounts for 6910 rows (97.0%) versus only 214 'N', yielding entropy_ratio of just 0.19. As a near-constant feature it carries little discriminative signal on its own.

Treatment: Encode as 0/1 but consider dropping or pairing with rare-class oversampling given the 97/3 imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[117]:

saturn.columns["LeastReachedPGAC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.97
cardinality	2
entropy	0.1946
entropy_ratio	0.1946
alert: imbalance	top value is 97.0% of rows

Fig 42.

Top values for LeastReachedPGAC.

Show data table

Top values for LeastReachedPGAC (2 unique shown, of 2 total).
value	count	share
Y	6910	97.0%
N	214	3.0%

GSEC categorical feature

GSEC is a low-cardinality categorical with 8 distinct values across 7124 rows and no nulls. The dominant value is the empty string at 51.08% (3639 rows), followed by '1' at 2767; the remaining codes ('0' through '6') together account for under 10% of rows. The mix of blanks and small integer codes suggests an optional categorical flag where 'missing' is encoded as '' rather than null.

Treatment: Recode '' as explicit missing and one-hot encode the remaining small-integer categories.

anthropic:claude-opus-4-7 · confidence high

Out[120]:

saturn.columns["GSEC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	8
top_value
top_rate	0.5108
cardinality	8
entropy	1.605
entropy_ratio	0.535

Fig 43.

Top values for GSEC.

Show data table

Top values for GSEC (8 unique shown, of 8 total).
value	count	share
	3639	51.1%
1	2767	38.8%
4	186	2.6%
0	176	2.5%
2	139	2.0%
3	100	1.4%
6	62	0.9%
5	55	0.8%

HasAudioRecordings categorical feature

Binary Y/N flag indicating whether a record has associated audio recordings. The class is heavily imbalanced toward 'Y' at 86.9% (6188 of 7124), with no nulls. Entropy ratio of 0.56 confirms the skew but the minority 'N' class still has 936 observations, enough to be usable.

Treatment: Encode as a 0/1 boolean; watch for class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[123]:

saturn.columns["HasAudioRecordings"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.8686
cardinality	2
entropy	0.5612
entropy_ratio	0.5612

Fig 44.

Top values for HasAudioRecordings.

Show data table

Top values for HasAudioRecordings (2 unique shown, of 2 total).
value	count	share
Y	6188	86.9%
N	936	13.1%

NTOnline categorical feature

NTOnline is a categorical flag with only one observed value, 'Y', across 5528 non-null rows, while 22.4% of rows are null. With cardinality 1 and entropy 0, it carries no discriminative signal—presence vs. absence is the only information available.

Treatment: Drop, or replace with a binary is_present indicator if the null pattern is meaningful.

anthropic:claude-opus-4-7 · confidence high

Out[126]:

saturn.columns["NTOnline"].stats

stat	value
n	7,124
nulls	1,596 (22.4%)
unique	1
top_value	Y
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: null_rate	22.4% null
alert: imbalance	top value is 100.0% of rows

Fig 45.

Top values for NTOnline.

Show data table

Top values for NTOnline (1 unique shown, of 1 total).
value	count	share
Y	5528	77.6%

RLG3 numeric feature

RLG3 is a discrete numeric column with only 7 unique values spanning 2 to 9, suggesting an ordinal rating or Likert-style scale rather than a continuous measurement. The distribution is tight around the median of 6 (IQR=1, Q1=5, Q3=6) with mild left skew (-0.46), but 10.6% of rows (757) fall outside the IQR fences — an artifact of the narrow box rather than true anomalies.

Treatment: Treat as an ordinal categorical; the outlier flag is a side-effect of the compressed IQR, not bad data.

anthropic:claude-opus-4-7 · confidence high

Out[129]:

saturn.columns["RLG3"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	7
min	2
max	9
mean	5.27
median	6
std	1.279
q1	5
q3	6
iqr	1
skew	-0.4551
kurtosis	2.001
n_outliers	757
outlier_rate	0.1063
zero_rate	0
alert: outliers	10.6% rows beyond 1.5 IQR

Fig 46.

Distribution of RLG3. Vertical dash marks the median.

Show data table

Histogram bins for RLG3 (median: 6.0).
bin	count
2 – 2.175	480
2.175 – 2.35	0
2.35 – 2.525	0
2.525 – 2.7	0
2.7 – 2.875	0
2.875 – 3.05	0
3.05 – 3.225	0
3.225 – 3.4	0
3.4 – 3.575	0
3.575 – 3.75	0
3.75 – 3.925	0
3.925 – 4.1	933
4.1 – 4.275	0
4.275 – 4.45	0
4.45 – 4.625	0
4.625 – 4.8	0
4.8 – 4.975	0
4.975 – 5.15	2142
5.15 – 5.325	0
5.325 – 5.5	0
5.5 – 5.675	0
5.675 – 5.85	0
5.85 – 6.025	3279
6.025 – 6.2	0
6.2 – 6.375	0
6.375 – 6.55	0
6.55 – 6.725	0
6.725 – 6.9	0
6.9 – 7.075	13
7.075 – 7.25	0
7.25 – 7.425	0
7.425 – 7.6	0
7.6 – 7.775	0
7.775 – 7.95	0
7.95 – 8.125	120
8.125 – 8.3	0
8.3 – 8.475	0
8.475 – 8.65	0
8.65 – 8.825	0
8.825 – 9	157

RLG3PC numeric feature

RLG3PC is an integer-coded ordinal feature with only 8 distinct values spanning 1-9 and a tight IQR of 1 (Q1=5, Q3=6). The distribution is left-skewed (-0.95) and concentrated around the median of 5, yet 14.3% of rows (1022) fall outside the IQR fence, suggesting a heavy lower tail rather than true anomalies. No nulls or zeros are present.

Treatment: Treat as an ordinal/categorical scale rather than continuous; the outlier rate reflects skew, not errors.

anthropic:claude-opus-4-7 · confidence high

Out[132]:

saturn.columns["RLG3PC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	8
min	1
max	9
mean	5.079
median	5
std	1.52
q1	5
q3	6
iqr	1
skew	-0.9463
kurtosis	1.703
n_outliers	1,022
outlier_rate	0.1435
zero_rate	0
alert: outliers	14.3% rows beyond 1.5 IQR

Fig 47.

Distribution of RLG3PC. Vertical dash marks the median.

Show data table

Histogram bins for RLG3PC (median: 5.0).
bin	count
1 – 1.2	332
1.2 – 1.4	0
1.4 – 1.6	0
1.6 – 1.8	0
1.8 – 2	0
2 – 2.2	474
2.2 – 2.4	0
2.4 – 2.6	0
2.6 – 2.8	0
2.8 – 3	0
3 – 3.2	0
3.2 – 3.4	0
3.4 – 3.6	0
3.6 – 3.8	0
3.8 – 4	0
4 – 4.2	666
4.2 – 4.4	0
4.4 – 4.6	0
4.6 – 4.8	0
4.8 – 5	0
5 – 5.2	2296
5.2 – 5.4	0
5.4 – 5.6	0
5.6 – 5.8	0
5.8 – 6	0
6 – 6.2	3105
6.2 – 6.4	0
6.4 – 6.6	0
6.6 – 6.8	0
6.8 – 7	0
7 – 7.2	35
7.2 – 7.4	0
7.4 – 7.6	0
7.6 – 7.8	0
7.8 – 8	0
8 – 8.2	62
8.2 – 8.4	0
8.4 – 8.6	0
8.6 – 8.8	0
8.8 – 9	154

RLG3PGAC numeric feature

RLG3PGAC is a numeric column with only 8 distinct integer values spanning 1 to 9, suggesting an ordinal rating or Likert-style score rather than a continuous measurement. The distribution is tight around a median of 5.5 with IQR of just 1 (Q1=5, Q3=6), yet 776 rows (10.9%) fall outside the Tukey fence, indicating a heavy-tailed concentration where any deviation from the central 5-6 band registers as an outlier. Mild left skew (-0.46) hints that low scores are slightly more common than the symmetric mean of 5.27 would suggest.

Treatment: Treat as ordinal categorical; bin or one-hot encode rather than scaling as continuous.

anthropic:claude-opus-4-7 · confidence high

Out[135]:

saturn.columns["RLG3PGAC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	8
min	1
max	9
mean	5.272
median	5.5
std	1.296
q1	5
q3	6
iqr	1
skew	-0.4637
kurtosis	2.032
n_outliers	776
outlier_rate	0.1089
zero_rate	0
alert: outliers	10.9% rows beyond 1.5 IQR

Fig 48.

Distribution of RLG3PGAC. Vertical dash marks the median.

Show data table

Histogram bins for RLG3PGAC (median: 5.5).
bin	count
1 – 1.2	17
1.2 – 1.4	0
1.4 – 1.6	0
1.6 – 1.8	0
1.8 – 2	0
2 – 2.2	466
2.2 – 2.4	0
2.4 – 2.6	0
2.6 – 2.8	0
2.8 – 3	0
3 – 3.2	0
3.2 – 3.4	0
3.4 – 3.6	0
3.6 – 3.8	0
3.8 – 4	0
4 – 4.2	925
4.2 – 4.4	0
4.4 – 4.6	0
4.6 – 4.8	0
4.8 – 5	0
5 – 5.2	2154
5.2 – 5.4	0
5.4 – 5.6	0
5.6 – 5.8	0
5.8 – 6	0
6 – 6.2	3247
6.2 – 6.4	0
6.4 – 6.6	0
6.6 – 6.8	0
6.8 – 7	0
7 – 7.2	22
7.2 – 7.4	0
7.4 – 7.6	0
7.6 – 7.8	0
7.8 – 8	0
8 – 8.2	131
8.2 – 8.4	0
8.4 – 8.6	0
8.6 – 8.8	0
8.8 – 9	162

PrimaryReligion categorical feature

PrimaryReligion is a low-cardinality categorical with 7 distinct values across 7,124 rows and no nulls. Islam dominates at 46% (3,279 rows), followed by Hinduism (2,142) and Ethnic Religions (933); Non-Religious appears only 13 times and 157 rows are explicitly 'Unknown'. Entropy ratio of 0.68 indicates a moderately skewed but not degenerate distribution.

Treatment: One-hot encode and consider merging 'Unknown' with 'Other / Small' or treating it as a missing-value flag.

anthropic:claude-opus-4-7 · confidence high

Out[138]:

saturn.columns["PrimaryReligion"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	7
top_value	Islam
top_rate	0.4603
cardinality	7
entropy	1.92
entropy_ratio	0.6839

Fig 49.

Top values for PrimaryReligion.

Show data table

Top values for PrimaryReligion (7 unique shown, of 7 total).
value	count	share
Islam	3279	46.0%
Hinduism	2142	30.1%
Ethnic Religions	933	13.1%
Buddhism	480	6.7%
Unknown	157	2.2%
Other / Small	120	1.7%
Non-Religious	13	0.2%

PrimaryReligionPC categorical feature

Categorical label assigning each of 7124 rows to one of 8 primary religion categories, with no nulls. Islam dominates at 3105 rows (43.6%) followed by Hinduism at 2296, while Non-Religious (35) and Other/Small (62) are rare; entropy ratio of 0.68 indicates moderate concentration in the top two classes. 154 rows are explicitly 'Unknown', a category worth treating distinctly from missing.

Treatment: One-hot encode the 8 levels and keep 'Unknown' as its own category rather than imputing.

anthropic:claude-opus-4-7 · confidence high

Out[141]:

saturn.columns["PrimaryReligionPC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	8
top_value	Islam
top_rate	0.4359
cardinality	8
entropy	2.051
entropy_ratio	0.6838

Fig 50.

Top values for PrimaryReligionPC.

Show data table

Top values for PrimaryReligionPC (8 unique shown, of 8 total).
value	count	share
Islam	3105	43.6%
Hinduism	2296	32.2%
Ethnic Religions	666	9.3%
Buddhism	474	6.7%
Christianity	332	4.7%
Unknown	154	2.2%
Other / Small	62	0.9%
Non-Religious	35	0.5%

PrimaryReligionPGAC categorical feature

Categorical label for the primary religion of a People Group Across Countries (PGAC) record, with 8 distinct values across 7124 rows and no nulls. Islam dominates at 45.6% (3247), followed by Hinduism (2154) and Ethnic Religions (925); Christianity is strikingly rare at just 17 rows, which is notable for a religion-coded dataset. Entropy ratio of 0.65 indicates moderate concentration on the top categories.

Treatment: One-hot or target-encode for modelling; consider folding 'Unknown' and 'Non-Religious'/'Christianity' tails into 'Other'.

anthropic:claude-opus-4-7 · confidence high

Out[144]:

saturn.columns["PrimaryReligionPGAC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	8
top_value	Islam
top_rate	0.4558
cardinality	8
entropy	1.955
entropy_ratio	0.6516

Fig 51.

Top values for PrimaryReligionPGAC.

Show data table

Top values for PrimaryReligionPGAC (8 unique shown, of 8 total).
value	count	share
Islam	3247	45.6%
Hinduism	2154	30.2%
Ethnic Religions	925	13.0%
Buddhism	466	6.5%
Unknown	162	2.3%
Other / Small	131	1.8%
Non-Religious	22	0.3%
Christianity	17	0.2%

RLG4 numeric feature

RLG4 is a sparse numeric feature populated for only ~7.6% of rows (null_rate 0.9239) with just 18 distinct integer-like values ranging 10 to 39. The distribution is right-skewed (skew 1.05, mean 18.19 vs median 20.0) with 30 flagged outliers (5.5% of present values) and a tight IQR of 6. The combination of heavy nullness and a bounded, discrete value set suggests an ordinal score or category code recorded only in specific cases.

Treatment: Add a missingness indicator and impute or bin before modelling, given 92% nulls and a small discrete value set.

anthropic:claude-opus-4-7 · confidence medium

Out[147]:

saturn.columns["RLG4"].stats

stat	value
n	7,124
nulls	6,582 (92.4%)
unique	18
min	10
max	39
mean	18.19
median	20
std	6.472
q1	14
q3	20
iqr	6
skew	1.051
kurtosis	1.474
n_outliers	30
outlier_rate	0.05535
zero_rate	0
alert: null_rate	92.4% null
alert: outliers	5.5% rows beyond 1.5 IQR

Fig 52.

Distribution of RLG4. Vertical dash marks the median.

Show data table

Histogram bins for RLG4 (median: 20.0).
bin	count
10 – 11.26	107
11.26 – 12.52	9
12.52 – 13.78	0
13.78 – 15.04	120
15.04 – 16.3	1
16.3 – 17.57	0
17.57 – 18.83	20
18.83 – 20.09	160
20.09 – 21.35	2
21.35 – 22.61	3
22.61 – 23.87	11
23.87 – 25.13	79
25.13 – 26.39	0
26.39 – 27.65	0
27.65 – 28.91	0
28.91 – 30.17	0
30.17 – 31.43	0
31.43 – 32.7	4
32.7 – 33.96	0
33.96 – 35.22	2
35.22 – 36.48	8
36.48 – 37.74	6
37.74 – 39	10

ReligionSubdivision categorical feature

A sub-classification of religion (denomination/sect), with 18 distinct values like Sunni, Judaism, Sikhism, Tibetan, and Theravada. The column is 92.39% null, so it is only populated for the small subset of records where a finer-grained religious branch applies. Among the 7124 rows, Sunni leads at 29.52% of non-null values (160 occurrences), and entropy ratio 0.72 indicates the populated values are spread fairly evenly across branches.

Treatment: Treat missingness as its own category and one-hot encode, or roll up into the parent Religion field before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[150]:

saturn.columns["ReligionSubdivision"].stats

stat	value
n	7,124
nulls	6,582 (92.4%)
unique	18
top_value	Sunni
top_rate	0.2952
cardinality	18
entropy	2.984
entropy_ratio	0.7157
alert: null_rate	92.4% null

Fig 53.

Top values for ReligionSubdivision.

Show data table

Top values for ReligionSubdivision (18 unique shown, of 18 total).
value	count	share
Sunni	160	2.2%
Judaism	120	1.7%
Sikhism	68	1.0%
Tibetan	58	0.8%
Theravada	49	0.7%
Shia	20	0.3%
Zoroastrianism	11	0.2%
Jainism	11	0.2%
Mahayana	9	0.1%
Prakriti	9	0.1%
Kirati	8	0.1%
Mandaeism	6	0.1%
Druze	4	0.1%
Baha'i	3	0.0%
Shia Imami Ismaili	2	0.0%
Syncretized	2	0.0%
Lingayat	1	0.0%
Animism	1	0.0%

PCIslam numeric feature

PCIslam appears to be a percentage-style indicator of Islamic affiliation, bounded between 0 and 100 with a near-zero null rate (0.0013). The distribution is starkly bimodal rather than continuous: 47.1% of rows are exactly zero, the median is 0.28, yet Q3 sits at 99.99, producing a kurtosis of -1.93 and an IQR spanning nearly the full range. Mean (45.2) and std (48.2) confirm the mass is piled at the extremes rather than around the center.

Treatment: Treat as bimodal: consider binarizing (0 vs >0) or binning rather than using raw value in linear models.

anthropic:claude-opus-4-7 · confidence high

Out[153]:

saturn.columns["PCIslam"].stats

stat	value
n	7,124
nulls	9 (0.1%)
unique	902
min	0
max	100
mean	45.22
median	0.2753
std	48.22
q1	0
q3	99.99
iqr	99.99
skew	0.1703
kurtosis	-1.935
n_outliers	0
outlier_rate	0
zero_rate	0.4713

Fig 54.

Distribution of PCIslam. Vertical dash marks the median.

Show data table

Histogram bins for PCIslam (median: 0.275292498279422).
bin	count
0 – 2.5	3635
2.5 – 5	26
5 – 7.5	22
7.5 – 10	13
10 – 12.5	17
12.5 – 15	9
15 – 17.5	9
17.5 – 20	8
20 – 22.5	20
22.5 – 25	1
25 – 27.5	11
27.5 – 30	5
30 – 32.5	21
32.5 – 35	10
35 – 37.5	7
37.5 – 40	7
40 – 42.5	12
42.5 – 45	2
45 – 47.5	4
47.5 – 50	1
50 – 52.5	9
52.5 – 55	3
55 – 57.5	4
57.5 – 60	5
60 – 62.5	11
62.5 – 65	3
65 – 67.5	15
67.5 – 70	14
70 – 72.5	30
72.5 – 75	13
75 – 77.5	25
77.5 – 80	22
80 – 82.5	39
82.5 – 85	20
85 – 87.5	43
87.5 – 90	45
90 – 92.5	103
92.5 – 95	109
95 – 97.5	226
97.5 – 100	2536

PCNonReligious numeric feature

PCNonReligious appears to be the percentage of non-religious individuals in each record, but the distribution is dominated by zeros — 87.5% of values are exactly 0 and the entire IQR collapses to 0. The remaining tail stretches to 99.0 with skew of 9.1 and kurtosis of 125.3, producing 886 outliers (12.5% of rows). Mean (1.02) sits far above median (0), so any modelling that assumes symmetry will be misled.

Treatment: Treat as zero-inflated; consider a binary is_nonzero flag plus a log1p transform of the positive tail.

anthropic:claude-opus-4-7 · confidence high

Out[156]:

saturn.columns["PCNonReligious"].stats

stat	value
n	7,124
nulls	23 (0.3%)
unique	152
min	0
max	99
mean	1.016
median	0
std	4.549
q1	0
q3	0
iqr	0
skew	9.105
kurtosis	125.3
n_outliers	886
outlier_rate	0.1248
zero_rate	0.8752
alert: high_skew	skew=+9.11
alert: outliers	12.5% rows beyond 1.5 IQR

Fig 55.

Distribution of PCNonReligious. Vertical dash marks the median.

Show data table

Histogram bins for PCNonReligious (median: 0.0).
bin	count
0 – 2.475	6442
2.475 – 4.95	159
4.95 – 7.425	222
7.425 – 9.9	34
9.9 – 12.38	73
12.38 – 14.85	17
14.85 – 17.32	48
17.32 – 19.8	18
19.8 – 22.28	32
22.28 – 24.75	3
24.75 – 27.23	9
27.23 – 29.7	4
29.7 – 32.18	12
32.18 – 34.65	5
34.65 – 37.12	3
37.12 – 39.6	2
39.6 – 42.08	2
42.08 – 44.55	1
44.55 – 47.02	2
47.02 – 49.5	2
49.5 – 51.98	3
51.98 – 54.45	1
54.45 – 56.93	2
56.93 – 59.4	1
59.4 – 61.88	0
61.88 – 64.35	0
64.35 – 66.83	0
66.83 – 69.3	1
69.3 – 71.78	0
71.78 – 74.25	0
74.25 – 76.73	0
76.73 – 79.2	0
79.2 – 81.67	0
81.67 – 84.15	0
84.15 – 86.62	0
86.62 – 89.1	0
89.1 – 91.58	0
91.58 – 94.05	0
94.05 – 96.53	2
96.53 – 99	1

PCUnknown numeric feature

PCUnknown is a numeric column expressing what looks like a percentage (range 0-100) of 'unknown' classification, with 92.8% of values being zero and a median/Q1/Q3 all at 0. The distribution is extremely right-skewed (skew 6.45, kurtosis 39.85) with 510 outliers (7.2%) extending up to 100. With 388 unique values and only 0.35% nulls, it carries sparse but potentially meaningful signal in the long tail.

Treatment: Binarize (zero vs non-zero) or log-transform the non-zero tail before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[159]:

saturn.columns["PCUnknown"].stats

stat	value
n	7,124
nulls	25 (0.4%)
unique	388
min	0
max	100
mean	2.28
median	0
std	14.59
q1	0
q3	0
iqr	0
skew	6.454
kurtosis	39.85
n_outliers	510
outlier_rate	0.07184
zero_rate	0.9282
alert: high_skew	skew=+6.45
alert: outliers	7.2% rows beyond 1.5 IQR

Fig 56.

Distribution of PCUnknown. Vertical dash marks the median.

Show data table

Histogram bins for PCUnknown (median: 0.0).
bin	count
0 – 2.5	6885
2.5 – 5	20
5 – 7.5	9
7.5 – 10	10
10 – 12.5	6
12.5 – 15	2
15 – 17.5	3
17.5 – 20	0
20 – 22.5	0
22.5 – 25	2
25 – 27.5	0
27.5 – 30	2
30 – 32.5	0
32.5 – 35	2
35 – 37.5	0
37.5 – 40	1
40 – 42.5	0
42.5 – 45	0
45 – 47.5	0
47.5 – 50	0
50 – 52.5	0
52.5 – 55	0
55 – 57.5	0
57.5 – 60	0
60 – 62.5	1
62.5 – 65	0
65 – 67.5	0
67.5 – 70	0
70 – 72.5	0
72.5 – 75	0
75 – 77.5	0
77.5 – 80	0
80 – 82.5	0
82.5 – 85	1
85 – 87.5	1
87.5 – 90	0
90 – 92.5	0
92.5 – 95	0
95 – 97.5	2
97.5 – 100	152

SecurityLevel numeric feature

SecurityLevel is an ordinal/categorical code stored as numeric, with only 3 distinct values spanning 0 to 2 across 7,124 complete rows. The distribution is heavily concentrated at the top tier (median, Q1, and Q3 all equal 2.0, mean 1.595), yet 15.6% of rows are 0 and the IQR-based outlier check flags 24.9% of records — an artifact of the degenerate IQR of 0 rather than true anomalies. Strong negative skew (-1.47) confirms the mass sits at level 2.

Treatment: Treat as a 3-level ordinal category (one-hot or ordered encode); ignore the outlier flag since IQR is zero.

anthropic:claude-opus-4-7 · confidence high

Out[162]:

saturn.columns["SecurityLevel"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	3
min	0
max	2
mean	1.595
median	2
std	0.7442
q1	2
q3	2
iqr	0
skew	-1.466
kurtosis	0.4048
n_outliers	1,771
outlier_rate	0.2486
zero_rate	0.1564
alert: outliers	24.9% rows beyond 1.5 IQR

Fig 57.

Distribution of SecurityLevel. Vertical dash marks the median.

Show data table

Histogram bins for SecurityLevel (median: 2.0).
bin	count
0 – 0.05	1114
0.05 – 0.1	0
0.1 – 0.15	0
0.15 – 0.2	0
0.2 – 0.25	0
0.25 – 0.3	0
0.3 – 0.35	0
0.35 – 0.4	0
0.4 – 0.45	0
0.45 – 0.5	0
0.5 – 0.55	0
0.55 – 0.6	0
0.6 – 0.65	0
0.65 – 0.7	0
0.7 – 0.75	0
0.75 – 0.8	0
0.8 – 0.85	0
0.85 – 0.9	0
0.9 – 0.95	0
0.95 – 1	0
1 – 1.05	657
1.05 – 1.1	0
1.1 – 1.15	0
1.15 – 1.2	0
1.2 – 1.25	0
1.25 – 1.3	0
1.3 – 1.35	0
1.35 – 1.4	0
1.4 – 1.45	0
1.45 – 1.5	0
1.5 – 1.55	0
1.55 – 1.6	0
1.6 – 1.65	0
1.65 – 1.7	0
1.7 – 1.75	0
1.75 – 1.8	0
1.8 – 1.85	0
1.85 – 1.9	0
1.9 – 1.95	0
1.95 – 2	5353

LRTop100 categorical label

Binary Y/N flag indicating membership in some 'LRTop100' set, with exactly 100 rows marked 'Y' out of 7124 — strongly suggesting a curated top-100 list. The distribution is severely imbalanced (98.6% 'N', entropy ratio 0.107), which is flagged as an imbalance alert. No nulls are present.

Treatment: Use as a binary indicator; if modelling, apply class-imbalance handling (stratification or reweighting).

anthropic:claude-opus-4-7 · confidence high

Out[165]:

saturn.columns["LRTop100"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	N
top_rate	0.986
cardinality	2
entropy	0.1065
entropy_ratio	0.1065
alert: imbalance	top value is 98.6% of rows

Fig 58.

Top values for LRTop100.

Show data table

Top values for LRTop100 (2 unique shown, of 2 total).
value	count	share
N	7024	98.6%
Y	100	1.4%

PhotoAddress text metadata

PhotoAddress holds single-token image filenames following a 'pXXXXX.jpg' pattern (one_word_rate 1.0, len_max 13). Coverage is poor: 1970 of 7124 rows are empty strings and duplicate_rate is 0.596, so the same photo is reused across many records (e.g., p19007.jpg appears 90 times). Only 2880 unique values back 7124 rows, suggesting shared stock images or a many-to-one photo lookup rather than a per-row asset.

Treatment: Treat as a file reference: drop from modelling, or join to an image table after handling the ~1970 empty strings.

anthropic:claude-opus-4-7 · confidence high

Out[168]:

saturn.columns["PhotoAddress"].stats

stat	value
n	7,124
nulls	1 (0.0%)
unique	2,880
len_min	0
len_max	13
len_mean	7.26
len_median	10
len_p95	10
word_mean	1
word_median	1
n_empty	1,970
n_duplicates	4,243
duplicate_rate	0.5957
vocab_size	2,879
readability_flesch_mean	84.01
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	59.6% duplicate strings

Fig 59.

Character-length distribution for PhotoAddress.

Show data table

Character-length distribution for PhotoAddress (mean: 7.2600028078057).
chars	count
0 – 0	1970
0 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	5092
10 – 10	0
10 – 11	0
11 – 11	0
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 13	61

PhotoCredits categorical metadata

PhotoCredits captures the attribution string for an associated image, with 851 distinct credits across 7124 rows. The column is dominated by missing-style values: 1970 rows (27.7%) are empty strings and another 1496 are 'Anonymous', so over half lack a real attribution. The remaining tail is long and idiosyncratic, mixing organisations ('Operation China, Asia Harvest'), individuals ('Isudas', 'Kerry Olson'), and platform tags ('Steve Evans - Flickr').

Treatment: Treat empty and 'Anonymous' as missing and keep only as provenance metadata; do not use as a model feature.

anthropic:claude-opus-4-7 · confidence high

Out[171]:

saturn.columns["PhotoCredits"].stats

stat	value
n	7,124
nulls	10 (0.1%)
unique	851
top_value
top_rate	0.2769
cardinality	851
entropy	5.584
entropy_ratio	0.5737
alert: long_tail	463 singleton categories

Fig 60.

Top values for PhotoCredits.

Show data table

Top values for PhotoCredits (20 unique shown, of 851 total).
value	count	share
	1970	27.7%
Anonymous	1496	21.0%
Operation China, Asia Harvest	263	3.7%
Isudas	212	3.0%
Kate Nelson/AusAID - Wikimedia	90	1.3%
manothegreek	77	1.1%
Kerry Olson	76	1.1%
Asia Harvest-Operation Myanmar	69	1.0%
Steve Evans - Flickr	62	0.9%
Final Sudan	43	0.6%
Rod Waddington - Flickr	42	0.6%
Peoples of Laos, Asia Harvest	41	0.6%
COMIBAM / Sepal	37	0.5%
Link Up Africa	36	0.5%
CharlesFred - Flickr	36	0.5%
Asia Harvest	36	0.5%
Hamed Saber - Flickr	36	0.5%
Peoples of the Buddhist World, Asia Harvest	36	0.5%
N-Y-C - Pixabay	34	0.5%
pxhere	34	0.5%

PhotoCreditURL categorical metadata

URL string crediting the source of an associated photo, dominated by a single domain (asiaharvest.org appears 443 times) alongside a long tail of 774 distinct values. 36% of rows are null and another 43.21% are empty strings — together roughly four out of five rows carry no usable credit. Remaining values mix organisational sites (newcovenantmissions, createinternational), shorteners (tinyurl), and stock-photo hosts (pixabay, pxhere, flickr).

Treatment: Drop for modelling; if provenance matters, parse to domain and treat as a low-coverage attribution field.

anthropic:claude-opus-4-7 · confidence high

Out[174]:

saturn.columns["PhotoCreditURL"].stats

stat	value
n	7,124
nulls	2,565 (36.0%)
unique	774
top_value
top_rate	0.4321
cardinality	774
entropy	5.389
entropy_ratio	0.5616
alert: long_tail	465 singleton categories
alert: null_rate	36.0% null

Fig 61.

Top values for PhotoCreditURL.

Show data table

Top values for PhotoCreditURL (20 unique shown, of 774 total).
value	count	share
	1970	27.7%
https://www.asiaharvest.org	443	6.2%
https://tinyurl.com/89ffm33y	90	1.3%
https://www.pesquisas.org.br/	37	0.5%
https://flickr.com/photos/44124425616@N01/424972149/	36	0.5%
https://pixabay.com/photos/fashion-asian-japanese-chinese-4257900/	34	0.5%
https://pxhere.com/en/photo/637533	34	0.5%
https://www.newcovenantmissions.org	31	0.4%
https://www.createinternational.com	30	0.4%
https://pixabay.com/photos/person-face-people-portrait-smile-5039573/	27	0.4%
https://www.globalROAR.org	26	0.4%
https://pixabay.com/photos/bread-marrakesh-food-morocco-1166272/	25	0.4%
https://pixabay.com/photos/pakistani-model-boy-model-pakistan-3770152/	24	0.3%
https://commons.wikimedia.org/wiki/File:Sudanese_arab_from_manasir_tribe.jpg	23	0.3%
https://pixabay.com/photos/refugee-afghan-forest-geocaching-1189087/	21	0.3%
https://flickr.com/photos/91418149@N03/14204194758	20	0.3%
https://pixabay.com/pl/photos/m%c4%99%c5%bcczy%c5%bani-turban-portret-facet-2146800/	19	0.3%
https://commons.wikimedia.org/wiki/File:Algerian_man_with_turban.jpg	18	0.3%
https://commons.wikimedia.org/wiki/File:Iraqi_man_on_Baghdad_street.jpg	18	0.3%
https://flickr.com/photos/rosluc6460/3260871722	18	0.3%

PhotoCreativeCommons categorical feature

Binary Y/N flag indicating whether a photo carries a Creative Commons licence. The vast majority (top_rate 0.7981) are 'N' with only 1437 'Y' values, and nulls are negligible (null_rate 0.0007). Class imbalance is notable but not extreme.

Treatment: Encode as a 0/1 boolean; impute the handful of nulls with the mode 'N'.

anthropic:claude-opus-4-7 · confidence high

Out[177]:

saturn.columns["PhotoCreativeCommons"].stats

stat	value
n	7,124
nulls	5 (0.1%)
unique	2
top_value	N
top_rate	0.7981
cardinality	2
entropy	0.7256
entropy_ratio	0.7256

Fig 62.

Top values for PhotoCreativeCommons.

Show data table

Top values for PhotoCreativeCommons (2 unique shown, of 2 total).
value	count	share
N	5682	79.8%
Y	1437	20.2%

PhotoCopyright categorical feature

Binary Y/N flag indicating whether a photo carries copyright restrictions, with 'N' dominating at 80.6% of 7,124 rows and only 2 unique values. Nulls are negligible (0.17%) and entropy ratio of 0.71 reflects the moderate class imbalance. No anomalies beyond the expected skew toward unrestricted photos.

Treatment: Encode as a boolean (Y=1, N=0) and impute the handful of nulls with the mode.

anthropic:claude-opus-4-7 · confidence high

Out[180]:

saturn.columns["PhotoCopyright"].stats

stat	value
n	7,124
nulls	12 (0.2%)
unique	2
top_value	N
top_rate	0.8064
cardinality	2
entropy	0.709
entropy_ratio	0.709

Fig 63.

Top values for PhotoCopyright.

Show data table

Top values for PhotoCopyright (2 unique shown, of 2 total).
value	count	share
N	5735	80.5%
Y	1377	19.3%

PhotoPermission categorical feature

Binary opt-in flag for photo permission, stored as 'Y'/'N'. The column is heavily skewed toward 'N' at 80.4% (5715 of 7124), with a near-zero null rate of 0.2%. Watch the case inconsistency: 2 records use lowercase 'y' alongside 1393 uppercase 'Y', so case-sensitive joins or filters will miscount.

Treatment: Normalize case (upper) and encode as boolean before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[183]:

saturn.columns["PhotoPermission"].stats

stat	value
n	7,124
nulls	14 (0.2%)
unique	3
top_value	N
top_rate	0.8038
cardinality	3
entropy	0.7173
entropy_ratio	0.4526

Fig 64.

Top values for PhotoPermission.

Show data table

Top values for PhotoPermission (3 unique shown, of 3 total).
value	count	share
N	5715	80.2%
Y	1393	19.6%
y	2	0.0%

ProfileTextExists categorical feature

A binary flag indicating whether a profile text exists, with values Y/N. The column is severely imbalanced: 6888 of 7124 rows (96.7%) are Y, leaving only 236 N, yielding a low entropy ratio of 0.21.

Treatment: Encode as a 0/1 indicator but expect minimal predictive signal due to severe class imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[186]:

saturn.columns["ProfileTextExists"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.9669
cardinality	2
entropy	0.2098
entropy_ratio	0.2098
alert: imbalance	top value is 96.7% of rows

Fig 65.

Top values for ProfileTextExists.

Show data table

Top values for ProfileTextExists (2 unique shown, of 2 total).
value	count	share
Y	6888	96.7%
N	236	3.3%

CountOfCountries numeric feature

Likely a per-row count of distinct countries associated with each record, ranging from 1 to 164 across 7124 rows with no nulls. The distribution is severely right-skewed (skew 5.67, kurtosis 33.17): the median is just 2 and Q3 is 4, yet the mean is 8.11 and 16.98% of rows flag as outliers. A long tail of high-country records is dragging the mean far above typical values.

Treatment: Log-transform or cap at a high quantile before modelling to tame the heavy right tail.

anthropic:claude-opus-4-7 · confidence high

Out[189]:

saturn.columns["CountOfCountries"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	39
min	1
max	164
mean	8.108
median	2
std	24.27
q1	1
q3	4
iqr	3
skew	5.672
kurtosis	33.17
n_outliers	1,210
outlier_rate	0.1698
zero_rate	0
alert: high_skew	skew=+5.67
alert: outliers	17.0% rows beyond 1.5 IQR

Fig 66.

Distribution of CountOfCountries. Vertical dash marks the median.

Show data table

Histogram bins for CountOfCountries (median: 2.0).
bin	count
1 – 5.075	5710
5.075 – 9.15	272
9.15 – 13.23	248
13.23 – 17.3	163
17.3 – 21.38	150
21.38 – 25.45	139
25.45 – 29.53	123
29.53 – 33.6	20
33.6 – 37.68	0
37.68 – 41.75	109
41.75 – 45.83	2
45.83 – 49.9	37
49.9 – 53.98	0
53.98 – 58.05	0
58.05 – 62.12	0
62.12 – 66.2	0
66.2 – 70.28	0
70.28 – 74.35	0
74.35 – 78.42	0
78.42 – 82.5	0
82.5 – 86.58	0
86.58 – 90.65	0
90.65 – 94.73	0
94.73 – 98.8	0
98.8 – 102.9	0
102.9 – 107	0
107 – 111	0
111 – 115.1	0
115.1 – 119.2	0
119.2 – 123.2	0
123.2 – 127.3	0
127.3 – 131.4	0
131.4 – 135.5	0
135.5 – 139.6	0
139.6 – 143.6	0
143.6 – 147.7	0
147.7 – 151.8	0
151.8 – 155.8	0
155.8 – 159.9	0
159.9 – 164	151

CountOfProvinces unknown other

Saturn skipped profiling on CountOfProvinces, so beyond a row count of 7124 and zero nulls, no distributional evidence is available. The name suggests an integer count of provinces per record, but unique count, range, and summary stats are all missing. Without further inspection the column's actual content and cardinality cannot be confirmed.

Treatment: Re-profile or manually inspect this column before use; saturn skipped it.

anthropic:claude-opus-4-7 · confidence low

Out[192]:

saturn.columns["CountOfProvinces"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

Longitude numeric feature

Geographic longitude coordinates spanning the full global range from -173.08 to 178.44 degrees. The distribution is heavily left-skewed (-1.40) with a median of 75.23 sitting well above the mean of 62.80, suggesting concentration in eastern hemisphere locations with a tail of western-hemisphere points. About 4.4% of values (316 rows) fall outside the typical IQR range.

Treatment: Pair with latitude for geospatial features; consider clustering or binning rather than treating as a raw scalar.

anthropic:claude-opus-4-7 · confidence high

Out[194]:

saturn.columns["Longitude"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	6,713
min	-173.1
max	178.4
mean	62.8
median	75.23
std	44.79
q1	40.81
q3	88.22
iqr	47.41
skew	-1.402
kurtosis	2.859
n_outliers	316
outlier_rate	0.04436
zero_rate	0

Fig 67.

Distribution of Longitude. Vertical dash marks the median.

Show data table

Histogram bins for Longitude (median: 75.22975927090806).
bin	count
-173.1 – -164.3	6
-164.3 – -155.5	2
-155.5 – -146.7	0
-146.7 – -137.9	0
-137.9 – -129.1	0
-129.1 – -120.4	11
-120.4 – -111.6	16
-111.6 – -102.8	2
-102.8 – -93.99	8
-93.99 – -85.2	15
-85.2 – -76.42	48
-76.42 – -67.63	110
-67.63 – -58.84	29
-58.84 – -50.05	24
-50.05 – -41.26	8
-41.26 – -32.48	10
-32.48 – -23.69	0
-23.69 – -14.9	46
-14.9 – -6.112	139
-6.112 – 2.676	229
2.676 – 11.46	217
11.46 – 20.25	249
20.25 – 29.04	173
29.04 – 37.83	356
37.83 – 46.62	244
46.62 – 55.4	243
55.4 – 64.19	104
64.19 – 72.98	781
72.98 – 81.77	1584
81.77 – 90.56	970
90.56 – 99.34	313
99.34 – 108.1	728
108.1 – 116.9	135
116.9 – 125.7	161
125.7 – 134.5	52
134.5 – 143.3	36
143.3 – 152.1	41
152.1 – 160.9	8
160.9 – 169.6	3
169.6 – 178.4	23

Latitude numeric feature

Latitude values for 7124 rows spanning -42.61 to 71.84 with a median of 25.02 — consistent with geographic latitudes in degrees. Distribution leans toward northern hemisphere (mean 23.54, skew -0.70) with 292 outliers (4.1%) likely representing far-southern or far-northern records. No nulls and 6696 unique values suggest near-record-level coordinates.

Treatment: Pair with longitude for geospatial features; avoid standard scaling alone since latitude is bounded and non-linear in distance.

anthropic:claude-opus-4-7 · confidence high

Out[197]:

saturn.columns["Latitude"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	6,696
min	-42.61
max	71.84
mean	23.54
median	25.02
std	14.92
q1	15.55
q3	31.61
iqr	16.06
skew	-0.702
kurtosis	2.141
n_outliers	292
outlier_rate	0.04099
zero_rate	0

Fig 68.

Distribution of Latitude. Vertical dash marks the median.

Show data table

Histogram bins for Latitude (median: 25.021597596171).
bin	count
-42.61 – -39.75	4
-39.75 – -36.89	20
-36.89 – -34.03	18
-34.03 – -31.16	18
-31.16 – -28.3	4
-28.3 – -25.44	11
-25.44 – -22.58	7
-22.58 – -19.72	15
-19.72 – -16.86	14
-16.86 – -14	18
-14 – -11.14	40
-11.14 – -8.275	45
-8.275 – -5.414	65
-5.414 – -2.553	115
-2.553 – 0.3084	74
0.3084 – 3.17	135
3.17 – 6.031	98
6.031 – 8.892	158
8.892 – 11.75	414
11.75 – 14.61	396
14.61 – 17.48	299
17.48 – 20.34	408
20.34 – 23.2	595
23.2 – 26.06	961
26.06 – 28.92	863
28.92 – 31.78	584
31.78 – 34.64	560
34.64 – 37.5	244
37.5 – 40.36	154
40.36 – 43.23	257
43.23 – 46.09	112
46.09 – 48.95	105
48.95 – 51.81	131
51.81 – 54.67	72
54.67 – 57.53	39
57.53 – 60.39	52
60.39 – 63.25	3
63.25 – 66.12	10
66.12 – 68.98	3
68.98 – 71.84	3

Ctry categorical feature

Country-of-origin or location field with 202 distinct values across 7,124 rows and zero nulls. India dominates at 28.5% (2,032 rows), followed by Pakistan (767) and China (442); the long tail spans 200+ countries with entropy ratio 0.66, indicating concentrated but globally distributed coverage. The South/Central Asia skew is the headline surprise — five of the top six values are Asian.

Treatment: Group rare countries into an 'Other' bucket or encode by region before modelling to avoid 200-way one-hot blow-up.

anthropic:claude-opus-4-7 · confidence high

Out[200]:

saturn.columns["Ctry"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	202
top_value	India
top_rate	0.2852
cardinality	202
entropy	5.058
entropy_ratio	0.6605

Fig 69.

Top values for Ctry.

Show data table

Top values for Ctry (20 unique shown, of 202 total).
value	count	share
India	2032	28.5%
Pakistan	767	10.8%
China	442	6.2%
Bangladesh	256	3.6%
Indonesia	234	3.3%
Nepal	184	2.6%
Sudan	168	2.4%
Laos	142	2.0%
Russia	115	1.6%
United States	90	1.3%
Iran	85	1.2%
Chad	81	1.1%
Malaysia	78	1.1%
Thailand	73	1.0%
Vietnam	69	1.0%
Türkiye (Turkey)	61	0.9%
Myanmar (Burma)	59	0.8%
Afghanistan	58	0.8%
Sri Lanka	55	0.8%
Canada	52	0.7%

IndigenousCode categorical feature

A binary Y/N flag indicating Indigenous status, fully populated across all 7124 rows. The distribution is imbalanced: 'Y' accounts for 79.4% (5657 rows) versus 1467 'N' rows, which is a notable skew to keep in mind for any stratified analysis.

Treatment: Encode as a binary indicator and watch for class imbalance when used as a predictor or stratifier.

anthropic:claude-opus-4-7 · confidence high

Out[203]:

saturn.columns["IndigenousCode"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.7941
cardinality	2
entropy	0.7336
entropy_ratio	0.7336

Fig 70.

Top values for IndigenousCode.

Show data table

Top values for IndigenousCode (2 unique shown, of 2 total).
value	count	share
Y	5657	79.4%
N	1467	20.6%

PercentAdherents categorical feature

PercentAdherents appears to be a numeric measure (likely a percentage or rate of religious adherents) stored as strings, with 692 distinct values across 7,124 rows and no nulls. It is dominated by '0.000', which accounts for 56.2% of records, and the long tail of small integer- and decimal-valued strings drives entropy down to a ratio of 0.43. The format mixing whole numbers like '5.000' with fractions like '0.200' suggests these are raw values rather than binned categories.

Treatment: Cast to float and treat as a zero-inflated numeric feature rather than a category.

anthropic:claude-opus-4-7 · confidence medium

Out[206]:

saturn.columns["PercentAdherents"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	692
top_value	0.000
top_rate	0.5625
cardinality	692
entropy	4.046
entropy_ratio	0.4288
alert: long_tail	467 singleton categories

Fig 71.

Top values for PercentAdherents.

Show data table

Top values for PercentAdherents (20 unique shown, of 692 total).
value	count	share
0.000	4007	56.2%
1.000	285	4.0%
2.000	224	3.1%
3.000	166	2.3%
0.500	159	2.2%
5.000	137	1.9%
4.000	129	1.8%
0.200	96	1.3%
0.100	88	1.2%
0.300	77	1.1%
0.010	59	0.8%
1.500	54	0.8%
0.050	52	0.7%
0.400	47	0.7%
0.020	33	0.5%
1.200	30	0.4%
0.600	29	0.4%
0.009	25	0.4%
0.800	24	0.3%
0.090	22	0.3%

PercentChristianPC categorical feature

Stored as a categorical but the 184 distinct values are numeric strings ranging from '0.000' to figures like '8.571', suggesting this is a percent-Christian metric (likely per-capita or per-county) cast as text. The distribution is concentrated: '0.482' alone covers 12.2% of 7124 rows and the top 10 values account for a large share, yet entropy ratio of 0.79 indicates the long tail still carries information. No nulls, but the repeated exact decimals hint at a lookup or pre-binned source rather than raw measurements.

Treatment: cast to float and treat as a continuous feature; investigate the heavy spike at 0.482 before modelling.

anthropic:claude-opus-4-7 · confidence medium

Out[209]:

saturn.columns["PercentChristianPC"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	184
top_value	0.482
top_rate	0.122
cardinality	184
entropy	5.934
entropy_ratio	0.7887

Fig 72.

Top values for PercentChristianPC.

Show data table

Top values for PercentChristianPC (20 unique shown, of 184 total).
value	count	share
0.482	869	12.2%
0.111	586	8.2%
0.000	374	5.2%
0.508	352	4.9%
8.571	311	4.4%
0.004	293	4.1%
5.023	169	2.4%
5.345	167	2.3%
3.733	151	2.1%
1.552	128	1.8%
0.008	127	1.8%
5.482	114	1.6%
27.052	111	1.6%
0.005	105	1.5%
1.495	90	1.3%
6.327	87	1.2%
0.731	86	1.2%
0.003	81	1.1%
0.001	79	1.1%
0.030	77	1.1%

NaturalName text label

Short ethnonym/community labels (e.g., 'Deaf', 'Turk', 'Persian', 'Japanese'), averaging 11.8 characters and 1.7 words with a median of 2 words. About 34% of rows are duplicates (2,419) and ~49% are single-word entries, with 4,705 unique values across 7,124 rows. Surprising signals: 'Deaf' tops the list at 151 occurrences, and top words include parenthetical religious qualifiers like 'traditions)', '(hindu', '(muslim' (952/477/411), suggesting many entries carry trailing tradition tags that the tokenizer split awkwardly.

Treatment: Normalize casing and strip parenthetical tradition suffixes, then treat as a categorical label.

anthropic:claude-opus-4-7 · confidence high

Out[212]:

saturn.columns["NaturalName"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	4,705
len_min	1
len_max	39
len_mean	11.84
len_median	10
len_p95	27
word_mean	1.723
word_median	2
n_empty	0
n_duplicates	2,419
duplicate_rate	0.3396
vocab_size	4,343
readability_flesch_mean	56.42
emoji_rate	0
url_rate	0
one_word_rate	0.4885
allcaps_rate	0
boilerplate_rate	0
alert: one_word	48.8% rows are a single word
alert: duplicates	34.0% duplicate strings

Fig 73.

Character-length distribution for NaturalName.

Show data table

Character-length distribution for NaturalName (mean: 11.84250421111735).
chars	count
1 – 2	1
2 – 3	15
3 – 4	107
4 – 5	590
5 – 6	711
6 – 7	776
7 – 8	702
8 – 9	389
9 – 10	216
10 – 10	278
10 – 11	347
11 – 12	281
12 – 13	431
13 – 14	349
14 – 15	292
15 – 16	219
16 – 17	110
17 – 18	66
18 – 19	47
19 – 20	0
20 – 21	46
21 – 22	35
22 – 23	48
23 – 24	109
24 – 25	174
25 – 26	191
26 – 27	179
27 – 28	103
28 – 29	77
29 – 30	38
30 – 30	35
30 – 31	53
31 – 32	37
32 – 33	36
33 – 34	18
34 – 35	15
35 – 36	0
36 – 37	0
37 – 38	1
38 – 39	2

NaturalPronunciation text metadata

Phonetic respellings of ethnonyms — short hyphenated pronunciation guides like 'PUR-zhun', 'jae-puh-NEEZ', and 'pahsh-TOON' — accompanying some other label column. Values are overwhelmingly single tokens (one_word_rate 0.73, word_mean 1.28, len_mean 10.8) and 48.5% are null, so coverage is partial. Duplicates dominate (n_duplicates 2183, duplicate_rate 0.59) with only 1489 unique forms across 7124 rows, suggesting a small controlled vocabulary repeated across records.

Treatment: Treat as an optional pronunciation lookup keyed to the parent term; drop or impute before modelling given ~48% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[215]:

saturn.columns["NaturalPronunciation"].stats

stat	value
n	7,124
nulls	3,452 (48.5%)
unique	1,489
len_min	2
len_max	42
len_mean	10.77
len_median	10
len_p95	21
word_mean	1.281
word_median	1
n_empty	0
n_duplicates	2,183
duplicate_rate	0.5945
vocab_size	1,537
readability_flesch_mean	69.93
emoji_rate	0
url_rate	0
one_word_rate	0.7271
allcaps_rate	0.0005447
boilerplate_rate	0
alert: one_word	72.7% rows are a single word
alert: null_rate	48.5% null
alert: duplicates	59.4% duplicate strings

Fig 74.

Character-length distribution for NaturalPronunciation.

Show data table

Character-length distribution for NaturalPronunciation (mean: 10.772875816993464).
chars	count
2 – 3	1
3 – 4	251
4 – 5	136
5 – 6	47
6 – 7	112
7 – 8	595
8 – 9	517
9 – 10	143
10 – 11	181
11 – 12	392
12 – 13	243
13 – 14	123
14 – 15	83
15 – 16	121
16 – 17	114
17 – 18	104
18 – 19	137
19 – 20	112
20 – 21	59
21 – 22	48
22 – 23	46
23 – 24	29
24 – 25	17
25 – 26	20
26 – 27	15
27 – 28	7
28 – 29	4
29 – 30	2
30 – 31	1
31 – 32	6
32 – 33	1
33 – 34	1
34 – 35	0
35 – 36	0
36 – 37	0
37 – 38	1
38 – 39	1
39 – 40	0
40 – 41	0
41 – 42	2

PercentChristianPGAC categorical feature

This column appears to be a percentage or count of Christians (PGAC suggesting a per-group/area Christian metric), stored as strings with three-decimal precision rather than as a numeric type. It is heavily zero-inflated: '0.000' accounts for 43.8% of the 7,124 rows (3,121 occurrences), and a suspiciously specific value '3.733' is the second mode at 151 rows. With 842 distinct values and entropy ratio 0.58, the distribution is concentrated but long-tailed, and the null rate is negligible at 0.07%.

Treatment: Cast to numeric and consider a zero-inflated transform (e.g., log1p with a zero indicator) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[218]:

saturn.columns["PercentChristianPGAC"].stats

stat	value
n	7,124
nulls	5 (0.1%)
unique	842
top_value	0.000
top_rate	0.4384
cardinality	842
entropy	5.681
entropy_ratio	0.5846

Fig 75.

Top values for PercentChristianPGAC.

Show data table

Top values for PercentChristianPGAC (20 unique shown, of 842 total).
value	count	share
0.000	3121	43.8%
3.733	151	2.1%
1.000	87	1.2%
2.000	79	1.1%
0.001	75	1.1%
0.006	69	1.0%
5.000	65	0.9%
3.000	59	0.8%
0.010	55	0.8%
4.000	55	0.8%
0.500	50	0.7%
0.002	46	0.6%
0.005	45	0.6%
0.030	45	0.6%
0.049	44	0.6%
0.100	39	0.5%
0.200	37	0.5%
1.592	37	0.5%
0.009	36	0.5%
1.771	36	0.5%

PercentEvangelical categorical feature

PercentEvangelical reads as a numeric share of evangelicals stored as strings, with 401 distinct values across 7124 rows. The distribution is heavily zero-inflated: 65.7% of rows are exactly "0.000" and another 10.4% are null, leaving a long tail of small fractions like 0.100, 0.200, 0.500. Entropy ratio of 0.364 confirms most of the signal collapses onto that single zero bucket.

Treatment: Cast to float, impute the 10.4% nulls, and consider a zero-vs-nonzero indicator alongside the raw value to handle the zero inflation.

anthropic:claude-opus-4-7 · confidence high

Out[221]:

saturn.columns["PercentEvangelical"].stats

stat	value
n	7,124
nulls	741 (10.4%)
unique	401
top_value	0.000
top_rate	0.6572
cardinality	401
entropy	3.146
entropy_ratio	0.3638
alert: long_tail	256 singleton categories

Fig 76.

Top values for PercentEvangelical.

Show data table

Top values for PercentEvangelical (20 unique shown, of 401 total).
value	count	share
0.000	4195	58.9%
0.100	171	2.4%
0.200	154	2.2%
0.500	135	1.9%
1.000	133	1.9%
0.300	108	1.5%
2.000	92	1.3%
0.400	71	1.0%
0.050	60	0.8%
0.010	60	0.8%
0.800	46	0.6%
1.500	40	0.6%
0.900	38	0.5%
0.600	35	0.5%
0.030	32	0.4%
0.700	29	0.4%
0.020	27	0.4%
0.080	25	0.4%
0.001	25	0.4%
0.006	24	0.3%

PercentEvangelicalPC categorical feature

PercentEvangelicalPC appears to be a numeric percentage (likely an evangelical population share, possibly per capita or principal-component scaled) that has been stored as strings, yielding 166 distinct values across 7124 rows with a 2.15% null rate. The distribution is concentrated: the top value '0.199' covers 12.47% of rows, and the leading entries cluster near zero ('0.095', '0.000', '0.004') yet some values reach above 3 ('3.409', '3.339'), suggesting a long right tail or mixed scale. Entropy ratio of 0.78 indicates moderate concentration rather than uniformity.

Treatment: Cast to float, impute the ~2% nulls, and consider log or rank transform given the right-tailed values.

anthropic:claude-opus-4-7 · confidence medium

Out[224]:

saturn.columns["PercentEvangelicalPC"].stats

stat	value
n	7,124
nulls	153 (2.1%)
unique	166
top_value	0.199
top_rate	0.1247
cardinality	166
entropy	5.777
entropy_ratio	0.7833

Fig 77.

Top values for PercentEvangelicalPC.

Show data table

Top values for PercentEvangelicalPC (20 unique shown, of 166 total).
value	count	share
0.199	869	12.2%
0.095	586	8.2%
0.000	441	6.2%
0.247	352	4.9%
0.004	315	4.4%
3.409	311	4.4%
3.339	169	2.4%
2.656	167	2.3%
0.003	151	2.1%
0.001	131	1.8%
0.866	128	1.8%
1.699	114	1.6%
0.472	111	1.6%
0.028	96	1.3%
1.315	90	1.3%
1.509	87	1.2%
0.439	86	1.2%
0.036	78	1.1%
0.012	77	1.1%
0.197	76	1.1%

PercentEvangelicalPGAC categorical feature

Numeric percentages (likely share of evangelical population per PGAC unit) stored as strings, hence profiled as categorical with 548 distinct values. The distribution is heavily zero-inflated: '0.000' accounts for 48.9% of 7124 rows, with a curious secondary spike at '1.801' (151 rows) that doesn't fit a percentage interpretation cleanly. Null rate is 6.32% and entropy ratio is 0.55, consistent with a long tail of small fractional values.

Treatment: Cast to float, impute or flag nulls, and consider a zero-indicator plus log/sqrt transform given the heavy zero mass.

anthropic:claude-opus-4-7 · confidence high

Out[227]:

saturn.columns["PercentEvangelicalPGAC"].stats

stat	value
n	7,124
nulls	450 (6.3%)
unique	548
top_value	0.000
top_rate	0.4891
cardinality	548
entropy	4.972
entropy_ratio	0.5465

Fig 78.

Top values for PercentEvangelicalPGAC.

Show data table

Top values for PercentEvangelicalPGAC (20 unique shown, of 548 total).
value	count	share
0.000	3264	45.8%
1.801	151	2.1%
0.002	98	1.4%
0.001	93	1.3%
0.006	68	1.0%
0.004	66	0.9%
1.000	59	0.8%
0.016	58	0.8%
0.200	42	0.6%
0.010	42	0.6%
0.100	41	0.6%
0.005	41	0.6%
0.049	40	0.6%
0.500	37	0.5%
0.104	37	0.5%
0.074	36	0.5%
2.000	36	0.5%
1.543	36	0.5%
0.007	35	0.5%
0.003	34	0.5%

PCBuddhism numeric feature

PCBuddhism appears to be a percentage feature measuring the Buddhist share of some unit (likely a postcode or area), ranging 0 to 100 with mean 6.41. The distribution is extremely zero-inflated: 82.99% of rows are exactly 0, the entire IQR collapses to 0, and yet 17.01% of rows are flagged as outliers with skew 3.48 and kurtosis 10.56. This means Buddhism is rare across most areas but reaches sizeable concentrations in a long tail.

Treatment: Treat as zero-inflated proportion: add a presence indicator and log1p-transform the non-zero tail before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[230]:

saturn.columns["PCBuddhism"].stats

stat	value
n	7,124
nulls	24 (0.3%)
unique	809
min	0
max	100
mean	6.411
median	0
std	22.39
q1	0
q3	0
iqr	0
skew	3.475
kurtosis	10.56
n_outliers	1,208
outlier_rate	0.1701
zero_rate	0.8299
alert: high_skew	skew=+3.48
alert: outliers	17.0% rows beyond 1.5 IQR

Fig 79.

Distribution of PCBuddhism. Vertical dash marks the median.

Show data table

Histogram bins for PCBuddhism (median: 0.0).
bin	count
0 – 2.5	6435
2.5 – 5	27
5 – 7.5	19
7.5 – 10	7
10 – 12.5	25
12.5 – 15	3
15 – 17.5	13
17.5 – 20	3
20 – 22.5	14
22.5 – 25	4
25 – 27.5	6
27.5 – 30	3
30 – 32.5	21
32.5 – 35	5
35 – 37.5	9
37.5 – 40	3
40 – 42.5	22
42.5 – 45	4
45 – 47.5	2
47.5 – 50	5
50 – 52.5	4
52.5 – 55	3
55 – 57.5	10
57.5 – 60	10
60 – 62.5	16
62.5 – 65	3
65 – 67.5	9
67.5 – 70	24
70 – 72.5	18
72.5 – 75	7
75 – 77.5	6
77.5 – 80	12
80 – 82.5	14
82.5 – 85	8
85 – 87.5	13
87.5 – 90	20
90 – 92.5	29
92.5 – 95	9
95 – 97.5	76
97.5 – 100	179

PCEthnicReligions numeric feature

PCEthnicReligions is a numeric percentage-style feature (0–100) capturing the share of some ethnic-religion category, likely per record/region. It's overwhelmingly zero — 78% of values are 0 and the entire interquartile range collapses to 0 — yet the mean is 13.1 with std 30.7, indicating a small set of records carry very large shares. Skew of 2.16 and a 22% outlier rate confirm a sparse, heavy-tailed distribution rather than a smooth continuum.

Treatment: Binarize (zero vs non-zero) or apply a zero-inflated/log1p transform before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[233]:

saturn.columns["PCEthnicReligions"].stats

stat	value
n	7,124
nulls	18 (0.3%)
unique	351
min	0
max	100
mean	13.11
median	0
std	30.74
q1	0
q3	0
iqr	0
skew	2.155
kurtosis	2.885
n_outliers	1,560
outlier_rate	0.2195
zero_rate	0.7805
alert: high_skew	skew=+2.16
alert: outliers	22.0% rows beyond 1.5 IQR

Fig 80.

Distribution of PCEthnicReligions. Vertical dash marks the median.

Show data table

Histogram bins for PCEthnicReligions (median: 0.0).
bin	count
0 – 2.5	5603
2.5 – 5	76
5 – 7.5	108
7.5 – 10	48
10 – 12.5	71
12.5 – 15	15
15 – 17.5	30
17.5 – 20	20
20 – 22.5	45
22.5 – 25	11
25 – 27.5	27
27.5 – 30	14
30 – 32.5	42
32.5 – 35	4
35 – 37.5	15
37.5 – 40	9
40 – 42.5	23
42.5 – 45	1
45 – 47.5	13
47.5 – 50	5
50 – 52.5	7
52.5 – 55	5
55 – 57.5	6
57.5 – 60	10
60 – 62.5	21
62.5 – 65	3
65 – 67.5	26
67.5 – 70	12
70 – 72.5	30
72.5 – 75	3
75 – 77.5	8
77.5 – 80	16
80 – 82.5	40
82.5 – 85	18
85 – 87.5	44
87.5 – 90	18
90 – 92.5	56
92.5 – 95	57
95 – 97.5	223
97.5 – 100	323

PCHinduism numeric feature

This column appears to be the percentage share of Hindus in some geographic or demographic unit, ranging from 0 to 100 with a mean of 29.8. The distribution is strongly bimodal in spirit: 67.7% of rows are exactly zero while Q3 sits at 98.4, indicating most units have no Hindu presence and a substantial minority are nearly entirely Hindu. Skew is 0.87 and kurtosis -1.22, consistent with this U-shaped split rather than a single peak.

Treatment: Consider a zero-vs-nonzero indicator plus the raw percentage, since a flat numeric treatment will hide the bimodal structure.

anthropic:claude-opus-4-7 · confidence high

Out[236]:

saturn.columns["PCHinduism"].stats

stat	value
n	7,124
nulls	24 (0.3%)
unique	1,131
min	0
max	100
mean	29.82
median	0
std	44.98
q1	0
q3	98.42
iqr	98.42
skew	0.8721
kurtosis	-1.216
n_outliers	0
outlier_rate	0
zero_rate	0.6768

Fig 81.

Distribution of PCHinduism. Vertical dash marks the median.

Show data table

Histogram bins for PCHinduism (median: 0.0).
bin	count
0 – 2.5	4856
2.5 – 5	10
5 – 7.5	14
7.5 – 10	2
10 – 12.5	9
12.5 – 15	6
15 – 17.5	7
17.5 – 20	3
20 – 22.5	10
22.5 – 25	8
25 – 27.5	4
27.5 – 30	4
30 – 32.5	5
32.5 – 35	6
35 – 37.5	2
37.5 – 40	2
40 – 42.5	4
42.5 – 45	5
45 – 47.5	0
47.5 – 50	3
50 – 52.5	6
52.5 – 55	1
55 – 57.5	8
57.5 – 60	4
60 – 62.5	7
62.5 – 65	1
65 – 67.5	8
67.5 – 70	7
70 – 72.5	12
72.5 – 75	5
75 – 77.5	5
77.5 – 80	9
80 – 82.5	9
82.5 – 85	12
85 – 87.5	11
87.5 – 90	20
90 – 92.5	23
92.5 – 95	31
95 – 97.5	117
97.5 – 100	1844

PCOtherSmall numeric feature

PCOtherSmall is a numeric feature where 88% of rows are zero and the IQR is zero, meaning the bottom three quartiles are all 0. The remaining mass stretches up to 100 with mean 1.84 and std 12.33, producing severe right skew (7.39) and very heavy tails (kurtosis 54.18). About 12% of rows (851) flag as outliers, suggesting this is a sparse share/percentage indicator that fires only for a small subset of records.

Treatment: Binarize presence (>0) or apply log1p before modelling to tame the skew.

anthropic:claude-opus-4-7 · confidence high

Out[239]:

saturn.columns["PCOtherSmall"].stats

stat	value
n	7,124
nulls	24 (0.3%)
unique	670
min	0
max	100
mean	1.836
median	0
std	12.33
q1	0
q3	0
iqr	0
skew	7.39
kurtosis	54.18
n_outliers	851
outlier_rate	0.1199
zero_rate	0.8801
alert: high_skew	skew=+7.39
alert: outliers	12.0% rows beyond 1.5 IQR

Fig 82.

Distribution of PCOtherSmall. Vertical dash marks the median.

Show data table

Histogram bins for PCOtherSmall (median: 0.0).
bin	count
0 – 2.5	6854
2.5 – 5	30
5 – 7.5	24
7.5 – 10	11
10 – 12.5	12
12.5 – 15	5
15 – 17.5	4
17.5 – 20	4
20 – 22.5	1
22.5 – 25	21
25 – 27.5	2
27.5 – 30	3
30 – 32.5	4
32.5 – 35	0
35 – 37.5	1
37.5 – 40	3
40 – 42.5	3
42.5 – 45	0
45 – 47.5	0
47.5 – 50	0
50 – 52.5	0
52.5 – 55	0
55 – 57.5	3
57.5 – 60	2
60 – 62.5	3
62.5 – 65	0
65 – 67.5	0
67.5 – 70	3
70 – 72.5	3
72.5 – 75	0
75 – 77.5	1
77.5 – 80	0
80 – 82.5	2
82.5 – 85	2
85 – 87.5	2
87.5 – 90	0
90 – 92.5	2
92.5 – 95	0
95 – 97.5	2
97.5 – 100	93

RegionCode numeric feature

RegionCode holds 12 distinct integer values from 1 to 12 with no nulls, so it is almost certainly a categorical region identifier stored as a number rather than a true numeric measure. The distribution is concentrated around the median of 4 with an IQR of just 2, yet the right skew of 1.12 and 601 flagged outliers (8.4%) reflect the long tail of higher-numbered regions rather than genuine anomalies.

Treatment: Cast to categorical and one-hot or target-encode; do not treat as a continuous numeric.

anthropic:claude-opus-4-7 · confidence high

Out[242]:

saturn.columns["RegionCode"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	12
min	1
max	12
mean	5.005
median	4
std	2.457
q1	4
q3	6
iqr	2
skew	1.122
kurtosis	0.5775
n_outliers	601
outlier_rate	0.08436
zero_rate	0
alert: outliers	8.4% rows beyond 1.5 IQR

Fig 83.

Distribution of RegionCode. Vertical dash marks the median.

Show data table

Histogram bins for RegionCode (median: 4.0).
bin	count
1 – 1.275	75
1.275 – 1.55	0
1.55 – 1.825	0
1.825 – 2.1	726
2.1 – 2.375	0
2.375 – 2.65	0
2.65 – 2.925	0
2.925 – 3.2	521
3.2 – 3.475	0
3.475 – 3.75	0
3.75 – 4.025	3349
4.025 – 4.3	0
4.3 – 4.575	0
4.575 – 4.85	0
4.85 – 5.125	352
5.125 – 5.4	0
5.4 – 5.675	0
5.675 – 5.95	0
5.95 – 6.225	444
6.225 – 6.5	0
6.5 – 6.775	0
6.775 – 7.05	373
7.05 – 7.325	0
7.325 – 7.6	0
7.6 – 7.875	0
7.875 – 8.15	460
8.15 – 8.425	0
8.425 – 8.7	0
8.7 – 8.975	0
8.975 – 9.25	223
9.25 – 9.525	0
9.525 – 9.8	0
9.8 – 10.08	320
10.08 – 10.35	0
10.35 – 10.62	0
10.62 – 10.9	0
10.9 – 11.18	121
11.18 – 11.45	0
11.45 – 11.73	0
11.73 – 12	160

PopulationPGAC numeric feature

PopulationPGAC appears to be a population count tied to some geographic or administrative unit (PGAC), spanning 10 to roughly 925 million across 7,124 rows with only 0.07% nulls. The distribution is extraordinarily right-skewed (skew 25.5, kurtosis 1051) — the median is 130,300 while the mean is 4.88 million, and 17.8% of rows flag as outliers. With 1,509 unique values across 7,124 rows, the same population figures repeat heavily, suggesting many rows share the same geographic aggregate.

Treatment: log-transform before regression to tame the extreme right skew.

anthropic:claude-opus-4-7 · confidence medium

Out[245]:

saturn.columns["PopulationPGAC"].stats

stat	value
n	7,124
nulls	5 (0.1%)
unique	1,509
min	10
max	9.251e+08
mean	4.881e+06
median	130,300
std	2.095e+07
q1	20,000
q3	1.435e+06
iqr	1.415e+06
skew	25.48
kurtosis	1052
n_outliers	1,264
outlier_rate	0.1776
zero_rate	0
alert: high_skew	skew=+25.48
alert: outliers	17.8% rows beyond 1.5 IQR

Fig 84.

Distribution of PopulationPGAC. Vertical dash marks the median.

Show data table

Histogram bins for PopulationPGAC (median: 130300.0).
bin	count
10 – 2.313e+07	6626
2.313e+07 – 4.626e+07	327
4.626e+07 – 6.938e+07	104
6.938e+07 – 9.251e+07	9
9.251e+07 – 1.156e+08	0
1.156e+08 – 1.388e+08	51
1.388e+08 – 1.619e+08	0
1.619e+08 – 1.85e+08	0
1.85e+08 – 2.082e+08	0
2.082e+08 – 2.313e+08	0
2.313e+08 – 2.544e+08	0
2.544e+08 – 2.775e+08	0
2.775e+08 – 3.007e+08	0
3.007e+08 – 3.238e+08	0
3.238e+08 – 3.469e+08	0
3.469e+08 – 3.701e+08	0
3.701e+08 – 3.932e+08	0
3.932e+08 – 4.163e+08	0
4.163e+08 – 4.394e+08	0
4.394e+08 – 4.626e+08	0
4.626e+08 – 4.857e+08	0
4.857e+08 – 5.088e+08	0
5.088e+08 – 5.319e+08	0
5.319e+08 – 5.551e+08	0
5.551e+08 – 5.782e+08	0
5.782e+08 – 6.013e+08	0
6.013e+08 – 6.245e+08	0
6.245e+08 – 6.476e+08	0
6.476e+08 – 6.707e+08	0
6.707e+08 – 6.938e+08	0
6.938e+08 – 7.17e+08	0
7.17e+08 – 7.401e+08	0
7.401e+08 – 7.632e+08	0
7.632e+08 – 7.864e+08	0
7.864e+08 – 8.095e+08	0
8.095e+08 – 8.326e+08	0
8.326e+08 – 8.557e+08	0
8.557e+08 – 8.789e+08	0
8.789e+08 – 9.02e+08	0
9.02e+08 – 9.251e+08	2

Frontier categorical feature

Binary Y/N flag indicating whether a record is on the frontier, with no nulls across 7124 rows. The split is imbalanced toward Y at 66.9% (4767) versus N at 2357, though entropy ratio of 0.92 shows both classes are well represented.

Treatment: Encode as a 0/1 indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[248]:

saturn.columns["Frontier"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.6691
cardinality	2
entropy	0.9158
entropy_ratio	0.9158

Fig 85.

Top values for Frontier.

Show data table

Top values for Frontier (2 unique shown, of 2 total).
value	count	share
Y	4767	66.9%
N	2357	33.1%

MapAddress text foreign_key

MapAddress holds single-token PNG filenames (e.g. 'm00328.png'), with one_word_rate of 1.0 and max length 13, suggesting it points to a map image asset. 1500 of 7124 rows are empty strings and duplicate_rate is 0.352, so roughly a third of non-empty values repeat across rows — meaning many records share the same map. With 4616 unique values across 7124 rows, this behaves like a foreign reference to a finite set of map images rather than a free-text field.

Treatment: Treat as a categorical asset reference: impute or flag the 1500 empties and join to a map-image lookup.

anthropic:claude-opus-4-7 · confidence high

Out[251]:

saturn.columns["MapAddress"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	4,616
len_min	0
len_max	13
len_mean	8.649
len_median	10
len_p95	13
word_mean	1
word_median	1
n_empty	1,500
n_duplicates	2,508
duplicate_rate	0.352
vocab_size	4,615
readability_flesch_mean	17.62
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	35.2% duplicate strings

Fig 86.

Character-length distribution for MapAddress.

Show data table

Character-length distribution for MapAddress (mean: 8.648652442448062).
chars	count
0 – 0	1500
0 – 1	0
1 – 1	0
1 – 1	0
1 – 2	0
2 – 2	0
2 – 2	0
2 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 10	0
10 – 10	3833
10 – 10	0
10 – 11	0
11 – 11	0
11 – 11	0
11 – 12	0
12 – 12	0
12 – 12	0
12 – 13	0
13 – 13	1791

HasJesusFilm categorical feature

Binary Y/N flag indicating whether the Jesus Film is available for the entity (likely a language or people group). Heavily skewed toward 'Y' at 78.7% (5,610 of 7,124), with no nulls across all 7,124 rows. Entropy of 0.746 reflects the imbalance but still leaves a usable minority class of 1,514 'N' values.

Treatment: Encode as 0/1 boolean; account for class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[254]:

saturn.columns["HasJesusFilm"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.7875
cardinality	2
entropy	0.7463
entropy_ratio	0.7463

Fig 87.

Top values for HasJesusFilm.

Show data table

Top values for HasJesusFilm (2 unique shown, of 2 total).
value	count	share
Y	5610	78.7%
N	1514	21.3%

Nomadic categorical feature

Binary Y/N flag indicating nomadic status, with no nulls across 7124 rows. Severely imbalanced: 'N' dominates at 96.6% (6884 rows) versus only 240 'Y' cases, yielding a low entropy ratio of 0.21.

Treatment: Encode as binary; consider class-weighting or stratified sampling due to severe imbalance.

anthropic:claude-opus-4-7 · confidence high

Out[257]:

saturn.columns["Nomadic"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	N
top_rate	0.9663
cardinality	2
entropy	0.2126
entropy_ratio	0.2126
alert: imbalance	top value is 96.6% of rows

Fig 88.

Top values for Nomadic.

Show data table

Top values for Nomadic (2 unique shown, of 2 total).
value	count	share
N	6884	96.6%
Y	240	3.4%

NomadicTypeDescription categorical feature

This is a low-cardinality categorical describing the type of nomadic livelihood, with only 6 distinct values dominated by 'Agro-Pastoralists' (76.7% of non-nulls, 184 records). The column is almost entirely empty — null_rate is 0.9663, leaving roughly 240 populated rows out of 7124. Several values are comma-joined combinations (e.g., 'Agro-Pastoralists, Service or Trade'), suggesting the field encodes multi-label memberships as concatenated strings.

Treatment: Split comma-separated values into multi-hot indicators and treat missingness as its own category given the 96.6% null rate.

anthropic:claude-opus-4-7 · confidence high

Out[260]:

saturn.columns["NomadicTypeDescription"].stats

stat	value
n	7,124
nulls	6,884 (96.6%)
unique	6
top_value	Agro-Pastoralists
top_rate	0.7667
cardinality	6
entropy	1.159
entropy_ratio	0.4483
alert: null_rate	96.6% null

Fig 89.

Top values for NomadicTypeDescription.

Show data table

Top values for NomadicTypeDescription (6 unique shown, of 6 total).
value	count	share
Agro-Pastoralists	184	2.6%
Service or Trade	30	0.4%
Agro-Pastoralists, Service or Trade	16	0.2%
Hunter-Gatherers	8	0.1%
Agro-Pastoralists, Hunter-Gatherers	1	0.0%
Service or Trade, Hunter-Gatherers	1	0.0%

PhotoCCVersionText categorical metadata

Creative Commons license version attached to a photo (e.g., 'CC BY 2.0', 'CC BY-SA 4.0'). The field is dominated by empty strings at 79.8% of 7124 rows, with only 16 distinct values and entropy ratio 0.33, so license metadata is missing for the vast majority of records. Among populated values, 'CC BY 2.0' (387) and 'CC BY-SA 4.0' (246) lead.

Treatment: Treat empty string as missing and group rare licenses; use as a low-cardinality categorical only where photo licensing matters.

anthropic:claude-opus-4-7 · confidence high

Out[263]:

saturn.columns["PhotoCCVersionText"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	16
top_value
top_rate	0.7984
cardinality	16
entropy	1.323
entropy_ratio	0.3307

Fig 90.

Top values for PhotoCCVersionText.

Show data table

Top values for PhotoCCVersionText (16 unique shown, of 16 total).
value	count	share
	5688	79.8%
CC BY 2.0	387	5.4%
CC BY-SA 4.0	246	3.5%
CC BY-SA 2.0	193	2.7%
CC BY-NC-SA 2.0	151	2.1%
CC BY-SA 3.0	143	2.0%
CC0 1.0	127	1.8%
CC BY-NC 2.0	111	1.6%
CC BY 3.0	27	0.4%
CC BY-NC-ND 2.0	18	0.3%
CC BY 4.0	14	0.2%
CC SA 1.0	7	0.1%
CC BY-ND 2.0	5	0.1%
CC BY 3.0 BR	4	0.1%
CC BY-SA 2.5	2	0.0%
CC BY-NC-SA 4.0	1	0.0%

PhotoCCVersionURL categorical metadata

This column holds the URL of the Creative Commons license version applied to an associated photo, drawn from a fixed set of 16 distinct license URIs. About 79.8% of rows (5688 of 7124) are empty strings, so the field is sparsely populated; among the populated minority, CC BY 2.0 (387) and CC BY-SA 4.0 (246) dominate. Entropy ratio of 0.33 confirms heavy concentration on the blank value.

Treatment: Treat empty strings as missing and collapse to a categorical license code before any modelling.

anthropic:claude-opus-4-7 · confidence high

Out[266]:

saturn.columns["PhotoCCVersionURL"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	16
top_value
top_rate	0.7984
cardinality	16
entropy	1.323
entropy_ratio	0.3307

Fig 91.

Top values for PhotoCCVersionURL.

Show data table

Top values for PhotoCCVersionURL (16 unique shown, of 16 total).
value	count	share
	5688	79.8%
https://creativecommons.org/licenses/by/2.0/	387	5.4%
https://creativecommons.org/licenses/by-sa/4.0/	246	3.5%
https://creativecommons.org/licenses/by-sa/2.0/	193	2.7%
https://creativecommons.org/licenses/by-nc-sa/2.0/	151	2.1%
https://creativecommons.org/licenses/by-sa/3.0/	143	2.0%
https://creativecommons.org/publicdomain/zero/1.0/	127	1.8%
https://creativecommons.org/licenses/by-nc/2.0/	111	1.6%
https://creativecommons.org/licenses/by/3.0/	27	0.4%
https://creativecommons.org/licenses/by-nc-nd/2.0/	18	0.3%
https://creativecommons.org/licenses/by/4.0/	14	0.2%
https://creativecommons.org/licenses/by-sa/1.0/	7	0.1%
https://creativecommons.org/licenses/by-nd/2.0/	5	0.1%
https://creativecommons.org/licenses/by/3.0/br/deed.en	4	0.1%
https://creativecommons.org/licenses/by-sa/2.5/	2	0.0%
https://creativecommons.org/licenses/by-nc-sa/4.0	1	0.0%

MapCredits categorical metadata

Attribution string crediting the data, geography, and design sources for each map (e.g. Joshua Project, GMI, UNESCO, IMB). With 161 distinct values across 7124 rows, the top credit covers 28% of records and a blank string is the second most common value at 1505 rows; near-duplicates differing only by trailing punctuation (the same Omid/UNESCO credit appears with and without a final period) inflate cardinality.

Treatment: Normalise whitespace/punctuation to collapse near-duplicates, then drop from modelling as boilerplate provenance.

anthropic:claude-opus-4-7 · confidence high

Out[269]:

saturn.columns["MapCredits"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	161
top_value	People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project
top_rate	0.28
cardinality	161
entropy	3.318
entropy_ratio	0.4527
alert: long_tail	99 singleton categories

Fig 92.

Top values for MapCredits.

Show data table

Top values for MapCredits (20 unique shown, of 161 total).
value	count	share
People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project	1995	28.0%
	1505	21.1%
Location: IMB. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	808	11.3%
People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project.	755	10.6%
People Group Location: Omid. Other geography / data: GMI. Map Design: Joshua Project	583	8.2%
Bethany World Prayer Center	408	5.7%
Joshua Project / Global Mapping International	335	4.7%
Bryan Nicholson / cartoMission	100	1.4%
Location: SIL / WLMS. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	77	1.1%
Anonymous	70	1.0%
Location: WLMS. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	47	0.7%
NCRP	44	0.6%
Location: Web research. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	31	0.4%
Asia Harvest-Operation Myanmar	26	0.4%
Location: World Jewish Congress, Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	26	0.4%
Location: Joshua Project. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	23	0.3%
Southeast Asia Link - SEALINK	21	0.3%
Location: Ethnologue. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	18	0.3%
Location: Asia Harvest. Imagery: GMI, ESRI, Maxar, Earthstar Geographics, ESRI User Community. Design: Joshua Project.	10	0.1%
Peoples of the Red Book	8	0.1%

MapCreditURL categorical metadata

This column holds attribution URLs for source maps, but 6919 of 7124 rows (top_rate 0.9712) are empty strings, leaving only 31 distinct values across the entire dataset. Among the populated entries, cartomission.com dominates with 100 occurrences while most other domains appear fewer than 10 times, producing a very long tail. Entropy ratio of 0.054 confirms there is almost no information here unless the empty string itself is treated as a meaningful 'no credit' signal.

Treatment: Keep as provenance metadata; do not use as a model feature given 97% blanks.

anthropic:claude-opus-4-7 · confidence high

Out[272]:

saturn.columns["MapCreditURL"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	31
top_value
top_rate	0.9712
cardinality	31
entropy	0.27
entropy_ratio	0.05449
alert: long_tail	19 singleton categories
alert: imbalance	top value is 97.1% of rows

Fig 93.

Top values for MapCreditURL.

Show data table

Top values for MapCreditURL (20 unique shown, of 31 total).
value	count	share
	6919	97.1%
https://www.cartomission.com	100	1.4%
https://www.asiaharvest.org	28	0.4%
https://www.worldjewishcongress.org/	26	0.4%
https://www.eki.ee/books/redbook/introduction.shtml	8	0.1%
https://commons.wikimedia.org/wiki/File:Maeneo_penye_wasemaji_wa_Kiswahili.png	7	0.1%
https://www.npolar.no/ansipra/english/Indexpages/Map_index.html	5	0.1%
https://www.face-music.ch/bi_bid/trad_costumes_en.html	3	0.0%
https://thekurds.net/	3	0.0%
https://www.cartpioneers.org/products/Peoples-of-Yemen-Prayer-Guide.html	2	0.0%
http://lingvarium.org/	2	0.0%
https://www.westmelanesia.com/	2	0.0%
https://www.lib.utexas.edu/maps/africa/libya_ethnic_1974.jpg	1	0.0%
https://commons.wikimedia.org/wiki/File:Libya_ethnic.svg	1	0.0%
https://www.ssb.no/en/statbank/table/09817/tableViewLayout1/	1	0.0%
https://commons.wikimedia.org/wiki/File:Alawites_in_the_Levant.jpg	1	0.0%
https://zolimacitymag.com/keeping-hakka-culture-alive-the-story-of-hong-kongs-mountain-pioneers/	1	0.0%
https://www.cia.gov/the-world-factbook/countries/china/map/	1	0.0%
https://commons.wikimedia.org/wiki/File:Albanians_in_Kosovo_2011_census.GIF	1	0.0%
https://www.refworld.org/docid/4a8414f5c.html	1	0.0%

MapCopyright categorical feature

A near-binary flag (N/Y) with a third state being an empty string, almost certainly indicating whether map copyright applies. 'N' dominates at 72.95% (5197/7124), blanks account for 1885 rows, and only 42 records are 'Y' — a severe class imbalance that makes the affirmative case nearly negligible.

Treatment: Normalize blanks to a missing/'N' category and treat as a low-signal binary flag given the 42-row positive class.

anthropic:claude-opus-4-7 · confidence high

Out[275]:

saturn.columns["MapCopyright"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	3
top_value	N
top_rate	0.7295
cardinality	3
entropy	0.8831
entropy_ratio	0.5572

Fig 94.

Top values for MapCopyright.

Show data table

Top values for MapCopyright (3 unique shown, of 3 total).
value	count	share
N	5197	73.0%
	1885	26.5%
Y	42	0.6%

MapCCVersionText categorical metadata

This appears to be a Creative Commons license version field for maps, but it is effectively empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving only 10 rows with actual licenses split across CC BY-SA 3.0 (8), CC0 1.0 (1), and CC BY 3.0 (1). Entropy is just 0.0166 (entropy_ratio 0.0083), so the column carries almost no information despite having 0% nulls — the missingness is encoded as empty strings rather than NaN.

Treatment: Drop; near-constant blank with only 10 informative rows.

anthropic:claude-opus-4-7 · confidence high

Out[278]:

saturn.columns["MapCCVersionText"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	4
top_value
top_rate	0.9986
cardinality	4
entropy	0.01662
entropy_ratio	0.00831
alert: imbalance	top value is 99.9% of rows

Fig 95.

Top values for MapCCVersionText.

Show data table

Top values for MapCCVersionText (4 unique shown, of 4 total).
value	count	share
	7114	99.9%
CC BY-SA 3.0	8	0.1%
CC0 1.0	1	0.0%
CC BY 3.0	1	0.0%

MapCCVersionURL categorical metadata

MapCCVersionURL appears to hold a Creative Commons license URL associated with each map record, but it is essentially empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving just 10 rows split across three CC license URLs. Entropy is 0.017 (ratio 0.008), so the column carries almost no information despite having 4 distinct values and zero nulls (the missingness is encoded as "" rather than null).

Treatment: Drop; near-constant with empty-string standing in for missing.

anthropic:claude-opus-4-7 · confidence high

Out[281]:

saturn.columns["MapCCVersionURL"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	4
top_value
top_rate	0.9986
cardinality	4
entropy	0.01662
entropy_ratio	0.00831
alert: imbalance	top value is 99.9% of rows

Fig 96.

Top values for MapCCVersionURL.

Show data table

Top values for MapCCVersionURL (4 unique shown, of 4 total).
value	count	share
	7114	99.9%
https://creativecommons.org/licenses/by-sa/3.0/	8	0.1%
https://creativecommons.org/publicdomain/zero/1.0/	1	0.0%
https://creativecommons.org/licenses/by/3.0/	1	0.0%

JF categorical feature

JF is a binary Y/N flag with no nulls across 7124 rows. The distribution is imbalanced: "Y" accounts for 78.7% (5610 rows) versus 1514 "N", giving an entropy ratio of 0.746. The column name is opaque, so the semantic meaning of the flag is not recoverable from the evidence.

Treatment: Encode as a 0/1 indicator; consider class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[284]:

saturn.columns["JF"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.7875
cardinality	2
entropy	0.7463
entropy_ratio	0.7463

Fig 97.

Top values for JF.

Show data table

Top values for JF (2 unique shown, of 2 total).
value	count	share
Y	5610	78.7%
N	1514	21.3%

AudioRecordings categorical feature

Binary Y/N flag indicating whether audio recordings exist for each row, with no nulls across 7124 records. The distribution is heavily imbalanced toward 'Y' at 86.9% (6188 vs 936), giving an entropy ratio of 0.56.

Treatment: Encode as a 0/1 indicator; be mindful of class imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[287]:

saturn.columns["AudioRecordings"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.8686
cardinality	2
entropy	0.5612
entropy_ratio	0.5612

Fig 98.

Top values for AudioRecordings.

Show data table

Top values for AudioRecordings (2 unique shown, of 2 total).
value	count	share
Y	6188	86.9%
N	936	13.1%

Window1040 categorical feature

Window1040 is a binary Y/N flag covering all 7124 rows with no nulls. The distribution is imbalanced: 'Y' accounts for 5910 rows (top_rate 0.8296) versus 1214 'N', giving an entropy ratio of 0.659. The column's semantic meaning isn't recoverable from the evidence, but it behaves like a clean indicator variable.

Treatment: Encode as a 0/1 indicator and watch for class imbalance when used as a predictor.

anthropic:claude-opus-4-7 · confidence high

Out[290]:

saturn.columns["Window1040"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2
top_value	Y
top_rate	0.8296
cardinality	2
entropy	0.6586
entropy_ratio	0.6586

Fig 99.

Top values for Window1040.

Show data table

Top values for Window1040 (2 unique shown, of 2 total).
value	count	share
Y	5910	83.0%
N	1214	17.0%

PeopleGroupMapURL text metadata

This column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.79). 1,500 of 7,124 rows are empty strings and 2,508 are duplicates (duplicate_rate 0.35), meaning many people groups share the same map image (e.g., m00328.png appears 40 times). With 4,616 unique values across 7,124 rows, this is a reference link rather than a unique key.

Treatment: Keep as a display/reference URL; drop from modelling features.

anthropic:claude-opus-4-7 · confidence high

Out[293]:

saturn.columns["PeopleGroupMapURL"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	4,616
len_min	0
len_max	66
len_mean	50.49
len_median	63
len_p95	66
word_mean	1
word_median	1
n_empty	1,500
n_duplicates	2,508
duplicate_rate	0.352
vocab_size	4,615
readability_flesch_mean	-568.7
emoji_rate	0
url_rate	0.7894
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: url_heavy	78.9% rows contain a URL
alert: duplicates	35.2% duplicate strings

Fig 100.

Character-length distribution for PeopleGroupMapURL.

Show data table

Character-length distribution for PeopleGroupMapURL (mean: 50.489191465468835).
chars	count
0 – 2	1500
2 – 3	0
3 – 5	0
5 – 7	0
7 – 8	0
8 – 10	0
10 – 12	0
12 – 13	0
13 – 15	0
15 – 16	0
16 – 18	0
18 – 20	0
20 – 21	0
21 – 23	0
23 – 25	0
25 – 26	0
26 – 28	0
28 – 30	0
30 – 31	0
31 – 33	0
33 – 35	0
35 – 36	0
36 – 38	0
38 – 40	0
40 – 41	0
41 – 43	0
43 – 45	0
45 – 46	0
46 – 48	0
48 – 50	0
50 – 51	0
51 – 53	0
53 – 54	0
54 – 56	0
56 – 58	0
58 – 59	0
59 – 61	0
61 – 63	0
63 – 64	3833
64 – 66	1791

PeopleGroupMapExpandedURL text metadata

This column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, with 72.3% of rows containing a URL and every value being a single token. 1,975 rows (about 27.7%) are empty strings, and 2,793 rows (39.2%) duplicate another value — e.g. m00328.pdf appears 40 times — suggesting many people groups share the same regional map.

Treatment: Treat as a reference link; drop from modelling or extract the map ID if joining to a maps table.

anthropic:claude-opus-4-7 · confidence high

Out[296]:

saturn.columns["PeopleGroupMapExpandedURL"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	4,331
len_min	0
len_max	66
len_mean	46.2
len_median	63
len_p95	66
word_mean	1
word_median	1
n_empty	1,975
n_duplicates	2,793
duplicate_rate	0.3921
vocab_size	4,330
readability_flesch_mean	-468.9
emoji_rate	0
url_rate	0.7228
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: url_heavy	72.3% rows contain a URL
alert: duplicates	39.2% duplicate strings

Fig 101.

Character-length distribution for PeopleGroupMapExpandedURL.

Show data table

Character-length distribution for PeopleGroupMapExpandedURL (mean: 46.19553621560921).
chars	count
0 – 2	1975
2 – 3	0
3 – 5	0
5 – 7	0
7 – 8	0
8 – 10	0
10 – 12	0
12 – 13	0
13 – 15	0
15 – 16	0
16 – 18	0
18 – 20	0
20 – 21	0
21 – 23	0
23 – 25	0
25 – 26	0
26 – 28	0
28 – 30	0
30 – 31	0
31 – 33	0
33 – 35	0
35 – 36	0
36 – 38	0
38 – 40	0
40 – 41	0
41 – 43	0
43 – 45	0
45 – 46	0
46 – 48	0
48 – 50	0
50 – 51	0
51 – 53	0
53 – 54	0
54 – 56	0
56 – 58	0
58 – 59	0
59 – 61	0
61 – 63	0
63 – 64	3579
64 – 66	1570

PeopleGroupURL text identifier

This column holds Joshua Project people-group URLs, one per row, with every value a 48-character single-token https link (url_rate 1.0, one_word_rate 1.0, len_min and len_max both 48). All 7124 values are unique with zero nulls or duplicates, so it functions as a per-row identifier rather than a feature. The URLs encode a people-group ID and a country code suffix (e.g., /10375/tz, /10375/up), meaning the same group recurs across countries in the underlying key even though the full URL is unique.

Treatment: Drop from modelling; retain as a row-level link key or parse out the people-group ID and country code as separate features.

anthropic:claude-opus-4-7 · confidence high

Out[299]:

saturn.columns["PeopleGroupURL"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	7,124
len_min	48
len_max	48
len_mean	48
len_median	48
len_p95	48
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	7,124
readability_flesch_mean	-479.9
emoji_rate	0
url_rate	1
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL

Fig 102.

Character-length distribution for PeopleGroupURL.

Show data table

Character-length distribution for PeopleGroupURL (mean: 48.0).
chars	count
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	7124
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0
48 – 48	0

PeopleGroupPhotoURL text metadata

This column holds Joshua Project people-group photo URLs, with every populated cell being a single joshuaproject.net/assets/media/profiles/photos/.jpg link (url_rate 0.72, one_word_rate 1.0). 1971 of 7124 rows are empty strings (no nulls reported), and the same image URLs repeat heavily — duplicate_rate is 0.60 with only 2880 unique values, the top URL appearing 90 times. The same photo is clearly being reused across many people-group records rather than being a unique per-row asset.

Treatment: Treat as an optional asset link; drop or replace empty strings with null and do not use as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[302]:

saturn.columns["PeopleGroupPhotoURL"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2,880
len_min	0
len_max	68
len_mean	47.04
len_median	65
len_p95	65
word_mean	1
word_median	1
n_empty	1,971
n_duplicates	4,244
duplicate_rate	0.5957
vocab_size	2,879
readability_flesch_mean	-604.3
emoji_rate	0
url_rate	0.7233
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: url_heavy	72.3% rows contain a URL
alert: duplicates	59.6% duplicate strings

Fig 103.

Character-length distribution for PeopleGroupPhotoURL.

Show data table

Character-length distribution for PeopleGroupPhotoURL (mean: 47.042111173498036).
chars	count
0 – 2	1971
2 – 3	0
3 – 5	0
5 – 7	0
7 – 8	0
8 – 10	0
10 – 12	0
12 – 14	0
14 – 15	0
15 – 17	0
17 – 19	0
19 – 20	0
20 – 22	0
22 – 24	0
24 – 26	0
26 – 27	0
27 – 29	0
29 – 31	0
31 – 32	0
32 – 34	0
34 – 36	0
36 – 37	0
37 – 39	0
39 – 41	0
41 – 42	0
42 – 44	0
44 – 46	0
46 – 48	0
48 – 49	0
49 – 51	0
51 – 53	0
53 – 54	0
54 – 56	0
56 – 58	0
58 – 60	0
60 – 61	0
61 – 63	0
63 – 65	0
65 – 66	5092
66 – 68	61

CountryURL categorical foreign_key

Country-level URLs pointing to joshuaproject.net profile pages, with the 2-letter country code as the path segment. There are 202 distinct countries across 7,124 rows and no nulls, but the distribution is heavily concentrated: India alone accounts for 28.5% of rows (2,032), with Pakistan (767) a distant second. Entropy ratio of 0.66 confirms moderate skew toward a handful of South Asian countries.

Treatment: Extract the trailing country code as a categorical key; treat the URL itself as redundant metadata.

anthropic:claude-opus-4-7 · confidence high

Out[305]:

saturn.columns["CountryURL"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	202
top_value	https://joshuaproject.net/countries/IN
top_rate	0.2852
cardinality	202
entropy	5.058
entropy_ratio	0.6605

Fig 104.

Top values for CountryURL.

Show data table

Top values for CountryURL (20 unique shown, of 202 total).
value	count	share
https://joshuaproject.net/countries/IN	2032	28.5%
https://joshuaproject.net/countries/PK	767	10.8%
https://joshuaproject.net/countries/CH	442	6.2%
https://joshuaproject.net/countries/BG	256	3.6%
https://joshuaproject.net/countries/ID	234	3.3%
https://joshuaproject.net/countries/NP	184	2.6%
https://joshuaproject.net/countries/SU	168	2.4%
https://joshuaproject.net/countries/LA	142	2.0%
https://joshuaproject.net/countries/RS	115	1.6%
https://joshuaproject.net/countries/US	90	1.3%
https://joshuaproject.net/countries/IR	85	1.2%
https://joshuaproject.net/countries/CD	81	1.1%
https://joshuaproject.net/countries/MY	78	1.1%
https://joshuaproject.net/countries/TH	73	1.0%
https://joshuaproject.net/countries/VM	69	1.0%
https://joshuaproject.net/countries/TU	61	0.9%
https://joshuaproject.net/countries/BM	59	0.8%
https://joshuaproject.net/countries/AF	58	0.8%
https://joshuaproject.net/countries/CE	55	0.8%
https://joshuaproject.net/countries/CA	52	0.7%

JPScaleText categorical metadata

JPScaleText is a categorical field that holds a single value, "Unreached", across all 7124 rows with no nulls. With cardinality of 1 and entropy of 0, it carries no information and cannot discriminate between records. The constant value suggests this dataset has been pre-filtered to unreached people groups only.

Treatment: Drop; constant column with zero entropy.

anthropic:claude-opus-4-7 · confidence high

Out[308]:

saturn.columns["JPScaleText"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value	Unreached
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 105.

Top values for JPScaleText.

Show data table

Top values for JPScaleText (1 unique shown, of 1 total).
value	count	share
Unreached	7124	100.0%

JPScaleImageURL categorical metadata

Every one of the 7,124 rows holds the same URL, https://joshuaproject.net/assets/img/gauge/gauge-1.png, giving a single unique value and zero entropy. This looks like a static asset link (a JP Scale gauge image) attached to each record rather than a discriminating feature. It carries no information for analysis or modelling.

Treatment: Drop; constant column with a single value across all rows.

anthropic:claude-opus-4-7 · confidence high

Out[311]:

saturn.columns["JPScaleImageURL"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1
top_value	https://joshuaproject.net/assets/img/gauge/gauge-1.png
top_rate	1
cardinality	1
entropy	0
entropy_ratio	0
alert: imbalance	top value is 100.0% of rows

Fig 106.

Top values for JPScaleImageURL.

Show data table

Top values for JPScaleImageURL (1 unique shown, of 1 total).
value	count	share
https://joshuaproject.net/assets/img/gauge/gauge-1.png	7124	100.0%

Summary text free_text

Free-text English summaries describing South Asian people groups (Rajputs, Jats, Bania, Beldar, etc.), averaging 51 words with median length 316 characters. Quality is poor: 3,167 of 7,124 rows (44%) are empty strings and another 3,439 are duplicates, leaving only 3,685 unique values and a 48% duplicate rate. Several near-identical Rajput paragraphs differ by only a word or two, suggesting lightly edited copies of the same source text rather than independent summaries. Flesch readability of 30.4 indicates fairly difficult prose.

Treatment: Deduplicate near-identical entries and drop or impute the 3,167 empty rows before tokenizing and embedding.

anthropic:claude-opus-4-7 · confidence high

Out[314]:

saturn.columns["Summary"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	3,685
len_min	0
len_max	1,212
len_mean	309.7
len_median	316
len_p95	793
word_mean	51.26
word_median	52
n_empty	3,167
n_duplicates	3,439
duplicate_rate	0.4827
vocab_size	24,501
readability_flesch_mean	30.4
emoji_rate	0
url_rate	0
one_word_rate	0.4446
allcaps_rate	0
boilerplate_rate	0.0002807
alert: one_word	44.5% rows are a single word
alert: duplicates	48.3% duplicate strings

Fig 107.

Character-length distribution for Summary.

Show data table

Character-length distribution for Summary (mean: 309.70087029758565).
chars	count
0 – 30	3167
30 – 61	1
61 – 91	1
91 – 121	7
121 – 152	16
152 – 182	21
182 – 212	40
212 – 242	60
242 – 273	76
273 – 303	110
303 – 333	144
333 – 364	173
364 – 394	162
394 – 424	198
424 – 454	208
454 – 485	206
485 – 515	243
515 – 545	234
545 – 576	207
576 – 606	224
606 – 636	253
636 – 667	244
667 – 697	196
697 – 727	251
727 – 758	143
758 – 788	165
788 – 818	85
818 – 848	84
848 – 879	60
879 – 909	29
909 – 939	18
939 – 970	9
970 – 1000	10
1000 – 1030	21
1030 – 1060	50
1060 – 1091	3
1091 – 1121	2
1121 – 1151	1
1151 – 1182	0
1182 – 1212	2

Obstacles text free_text

Free-text English prose describing barriers to Christian evangelism among various people groups (Rajputs, Jats, Bosniaks, Azeri, etc.), averaging 18 words and 107 characters per entry. Notably, 3167 of 7124 rows are empty strings and the duplicate rate is 0.489, with a single Rajput-pride passage repeated 88 times and a near-identical Jat passage appearing as both 74- and 7-count variants. Readability is low (Flesch 31.6) and vocabulary is modest (9760 unique words), consistent with a templated missiological description field.

Treatment: Treat empties as missing, dedupe near-identical passages, then tokenize and embed for downstream topic or similarity analysis.

anthropic:claude-opus-4-7 · confidence high

Out[317]:

saturn.columns["Obstacles"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	3,641
len_min	0
len_max	726
len_mean	106.9
len_median	95
len_p95	317
word_mean	18.37
word_median	16
n_empty	3,167
n_duplicates	3,483
duplicate_rate	0.4889
vocab_size	9,760
readability_flesch_mean	31.62
emoji_rate	0
url_rate	0
one_word_rate	0.4446
allcaps_rate	0
boilerplate_rate	0.0009826
alert: one_word	44.5% rows are a single word
alert: duplicates	48.9% duplicate strings

Fig 108.

Character-length distribution for Obstacles.

Show data table

Character-length distribution for Obstacles (mean: 106.86285794497473).
chars	count
0 – 18	3167
18 – 36	1
36 – 54	24
54 – 73	103
73 – 91	198
91 – 109	293
109 – 127	407
127 – 145	420
145 – 163	353
163 – 182	355
182 – 200	265
200 – 218	223
218 – 236	173
236 – 254	161
254 – 272	232
272 – 290	87
290 – 309	222
309 – 327	194
327 – 345	69
345 – 363	39
363 – 381	31
381 – 399	24
399 – 417	11
417 – 436	8
436 – 454	9
454 – 472	13
472 – 490	12
490 – 508	4
508 – 526	7
526 – 544	3
544 – 563	4
563 – 581	4
581 – 599	1
599 – 617	0
617 – 635	2
635 – 653	1
653 – 672	0
672 – 690	2
690 – 708	1
708 – 726	1

HowReach text free_text

Free-text English prose describing outreach/engagement strategies for various people groups, likely a 'how to reach' field in a missions dataset. Over half the rows (3883 of 7124) are empty strings and duplicate_rate is 0.60, with the same Jats and Rajputs paragraphs repeating dozens of times — so the median length and word count are 0. Readability is low (Flesch 27.3) and vocabulary reaches 7803 tokens across the non-empty rows.

Treatment: Treat empty strings as missing, deduplicate boilerplate, then tokenize and embed for downstream NLP.

anthropic:claude-opus-4-7 · confidence high

Out[320]:

saturn.columns["HowReach"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	2,853
len_min	0
len_max	599
len_mean	80.82
len_median	0
len_p95	260
word_mean	14.08
word_median	1
n_empty	3,883
n_duplicates	4,271
duplicate_rate	0.5995
vocab_size	7,803
readability_flesch_mean	27.34
emoji_rate	0
url_rate	0
one_word_rate	0.5451
allcaps_rate	0
boilerplate_rate	0.0002807
alert: one_word	54.5% rows are a single word
alert: duplicates	60.0% duplicate strings

Fig 109.

Character-length distribution for HowReach.

Show data table

Character-length distribution for HowReach (mean: 80.82116788321167).
chars	count
0 – 15	3883
15 – 30	0
30 – 45	1
45 – 60	7
60 – 75	61
75 – 90	156
90 – 105	252
105 – 120	244
120 – 135	277
135 – 150	269
150 – 165	274
165 – 180	233
180 – 195	219
195 – 210	184
210 – 225	357
225 – 240	172
240 – 255	135
255 – 270	90
270 – 285	103
285 – 300	60
300 – 314	39
314 – 329	22
329 – 344	24
344 – 359	12
359 – 374	9
374 – 389	14
389 – 404	5
404 – 419	4
419 – 434	3
434 – 449	1
449 – 464	3
464 – 479	2
479 – 494	1
494 – 509	3
509 – 524	0
524 – 539	2
539 – 554	2
554 – 569	0
569 – 584	0
584 – 599	1

PrayForChurch text free_text

Free-text prayer prompts for an unreached-people-group / church-planting dataset, written in English (1473 detected) and centered on words like 'pray', 'Christ', 'among'. The field is sparsely populated: 5032 of 7124 rows are empty and only 1713 unique strings exist, giving a 0.76 duplicate rate as the same boilerplate prayer is reused across people groups (top non-empty value repeats 146 times). Readability is low (Flesch 19.5) and length varies wildly from 0 to 649 chars, so the column is a mix of nothing, one-liners, and full paragraphs.

Treatment: Treat as optional long-form text: impute empties as missing and tokenize/embed the rest before any modelling.

anthropic:claude-opus-4-7 · confidence high

Out[323]:

saturn.columns["PrayForChurch"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	1,713
len_min	0
len_max	649
len_mean	59.63
len_median	0
len_p95	286
word_mean	11.19
word_median	1
n_empty	5,032
n_duplicates	5,411
duplicate_rate	0.7595
vocab_size	4,447
readability_flesch_mean	19.5
emoji_rate	0
url_rate	0
one_word_rate	0.7063
allcaps_rate	0
boilerplate_rate	0
alert: one_word	70.6% rows are a single word
alert: duplicates	76.0% duplicate strings

Fig 110.

Character-length distribution for PrayForChurch.

Show data table

Character-length distribution for PrayForChurch (mean: 59.63447501403706).
chars	count
0 – 16	5032
16 – 32	0
32 – 49	1
49 – 65	16
65 – 81	45
81 – 97	104
97 – 114	109
114 – 130	143
130 – 146	160
146 – 162	184
162 – 178	125
178 – 195	315
195 – 211	103
211 – 227	115
227 – 243	91
243 – 260	92
260 – 276	53
276 – 292	101
292 – 308	52
308 – 324	115
324 – 341	17
341 – 357	27
357 – 373	23
373 – 389	33
389 – 406	15
406 – 422	3
422 – 438	13
438 – 454	9
454 – 471	5
471 – 487	2
487 – 503	6
503 – 519	3
519 – 535	2
535 – 552	2
552 – 568	2
568 – 584	2
584 – 600	1
600 – 617	0
617 – 633	1
633 – 649	2

PrayForPG text free_text

Free-text prayer points for people groups (PG), each entry a short paragraph of intercessions led by the verb 'pray' (5450 occurrences). Nearly half the rows are empty (3405 of 7124) and another large chunk reuse boilerplate templates — duplicate_rate 0.517 with the top non-empty value repeating 88 times — so unique content is far less than the 3441 distinct strings suggest. Readability is low (Flesch mean 32.7) and all detected language is English (2528 rows tagged en).

Treatment: Treat as free-text: drop empties, dedupe boilerplate, then tokenize/embed if used as a feature.

anthropic:claude-opus-4-7 · confidence high

Out[326]:

saturn.columns["PrayForPG"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	3,441
len_min	0
len_max	937
len_mean	163.1
len_median	120
len_p95	453.8
word_mean	28.23
word_median	20
n_empty	3,405
n_duplicates	3,683
duplicate_rate	0.517
vocab_size	9,291
readability_flesch_mean	32.72
emoji_rate	0
url_rate	0
one_word_rate	0.478
allcaps_rate	0
boilerplate_rate	0
alert: one_word	47.8% rows are a single word
alert: duplicates	51.7% duplicate strings

Fig 111.

Character-length distribution for PrayForPG.

Show data table

Character-length distribution for PrayForPG (mean: 163.06948343627175).
chars	count
0 – 23	3406
23 – 47	0
47 – 70	6
70 – 94	47
94 – 117	86
117 – 141	145
141 – 164	136
164 – 187	176
187 – 211	159
211 – 234	208
234 – 258	212
258 – 281	264
281 – 305	274
305 – 328	330
328 – 351	321
351 – 375	243
375 – 398	275
398 – 422	197
422 – 445	240
445 – 468	117
468 – 492	76
492 – 515	59
515 – 539	43
539 – 562	34
562 – 586	22
586 – 609	11
609 – 632	11
632 – 656	6
656 – 679	11
679 – 703	4
703 – 726	3
726 – 750	1
750 – 773	0
773 – 796	0
796 – 820	0
820 – 843	0
843 – 867	0
867 – 890	0
890 – 914	0
914 – 937	1

Resources unknown other

The column is named "Resources" with 7124 rows and zero nulls, but saturn skipped profiling so the kind is unknown and no unique count or value statistics were computed. Without type inference or sample values, its content (numeric, list, text, or identifier) cannot be determined from the evidence.

Treatment: Re-profile or inspect raw samples to establish type before any downstream use.

anthropic:claude-opus-4-7 · confidence low

Out[329]:

saturn.columns["Resources"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

country_data unknown other

The column `country_data` was skipped by the profiler, so its kind is unrecorded and no statistics, uniqueness, or value distribution are available. The only confirmed signals are 7124 rows with a 0.0 null rate. Without further inspection, the contents (likely some country-related payload given the name) cannot be characterised.

Treatment: Re-profile or manually inspect this column before any downstream use.

anthropic:claude-opus-4-7 · confidence low

Out[331]:

saturn.columns["country_data"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

language_data unknown other

The column `language_data` was skipped by the profiler — its kind is unrecognised and no descriptive statistics, uniqueness count, or value samples were emitted. Only the row count (7124) and a null rate of 0.0 are available, so nothing can be said about content, cardinality, or distribution. The name hints at linguistic payloads (possibly nested or serialised), but this is not corroborated by evidence.

Treatment: Re-profile after parsing or casting to a supported type before deciding on use.

anthropic:claude-opus-4-7 · confidence low

Out[333]:

saturn.columns["language_data"].stats

stat	value
n	7,124
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown