joshua-project-joshua_project_languages

Overview

Source: /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_languages.json

Saturn profiled 7,134 rows across 26 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_languages.json",
    "--findings", "joshua-project-joshua_project_languages.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a Joshua Project languages dataset with 7,134 rows and 26 columns, profiling world languages alongside Bible translation status, audio/film resource availability, primary religion, and host-country distribution. The headline signal is religious-engagement coverage: PrimaryReligion is dominated by Christianity (3,328) followed by Ethnic Religions (1,472) and Islam (945), and JPScale skews toward the more-reached end with category 5 the largest bucket (2,050). Resource availability is uneven — HasAudioRecordings is roughly 59% Yes / 41% No, while HasJesusFilm is only ~28% Yes, suggesting the Jesus Film coverage gap is worth a closer look. Geographic concentration is also notable: HubCountry is led by Papua New Guinea (837), Indonesia (686), and Nigeria (494), together accounting for a large share of entries. Finally, NbrPGICs is extremely skewed (max 1,804, median 1) so any per-language counts should be inspected with that long tail in mind.

citing: PrimaryReligion · JPScale · HasAudioRecordings · HasJesusFilm · HubCountry · NbrPGICs · Status · BibleStatus

Out[4]:

saturn.schema() · 26 columns

column	kind	n	null%	unique	alerts
ROL3	text	7,134	0.0%	7,134	near_unique one_word short_text
Language	text	7,134	0.0%	7,124	near_unique one_word
WebLangText	text	7,134	0.0%	7,134	near_unique one_word
Status	categorical	7,134	0.0%	2
ROG3	categorical	7,134	0.0%	211
HubCountry	categorical	7,134	0.0%	210
BibleStatus	numeric	7,134	0.0%	6
GRN_URL	text	7,134	41.4%	4,179	near_unique one_word url_heavy null_rate
TranslationNeedQuestionable	unknown	7,134	0.0%	—	skipped
BibleYear	categorical	7,134	89.0%	488	long_tail null_rate
NTYear	text	7,134	63.6%	1,109	one_word allcaps null_rate short_text duplicates
PortionsYear	text	7,134	43.0%	1,797	one_word allcaps null_rate short_text duplicates
PercentAdherents	numeric	7,134	11.9%	1,349
PercentEvangelical	numeric	7,134	17.3%	1,006
HasJesusFilm	categorical	7,134	0.0%	2
JF_URL	text	7,134	71.6%	2,008	near_unique one_word url_heavy null_rate
HasAudioRecordings	categorical	7,134	0.0%	2
JPScale	categorical	7,134	11.9%	5
LeastReached	categorical	7,134	0.0%	2
RLG3	numeric	7,134	10.9%	8
PrimaryReligion	categorical	7,134	0.0%	9
FCBH_URL	text	7,134	68.0%	2,272	near_unique one_word url_heavy null_rate
NbrPGICs	numeric	7,134	10.8%	155	high_skew outliers
NbrCountries	numeric	7,134	17.7%	43	high_skew outliers
JF	categorical	7,134	0.0%	2
AudioRecordings	categorical	7,134	0.0%	2

Fig 1.

PrimaryReligion · Christianity dominates at ~47%, but Ethnic Religions and Islam together cover roughly a third of languages.

Show data table

Top values for PrimaryReligion (9 unique shown, of 9 total).
value	count	share
Christianity	3328	46.6%
Ethnic Religions	1472	20.6%
Islam	945	13.2%
	774	10.8%
Hinduism	268	3.8%
Buddhism	192	2.7%
Unknown	110	1.5%
Other / Small	25	0.4%
Non-Religious	18	0.3%

Fig 2.

JPScale · Distribution across the 5-point Joshua Project progress scale, skewed toward the more-reached categories 4 and 5.

Show data table

Top values for JPScale (5 unique shown, of 5 total).
value	count	share
5	2050	28.7%
4	1900	26.6%
1	1473	20.6%
3	455	6.4%
2	409	5.7%

Fig 3.

HasJesusFilm · Only about 28% of languages have a Jesus Film available — a clear coverage gap.

Show data table

Top values for HasJesusFilm (2 unique shown, of 2 total).
value	count	share
N	5105	71.6%
Y	2027	28.4%

Fig 4.

HubCountry · Top hub countries (Papua New Guinea, Indonesia, Nigeria) concentrate a large share of the languages tracked.

Show data table

Top values for HubCountry (20 unique shown, of 210 total).
value	count	share
Papua New Guinea	837	11.7%
Indonesia	686	9.6%
Nigeria	494	6.9%
India	383	5.4%
Mexico	277	3.9%
China	256	3.6%
Cameroon	234	3.3%
Australia	192	2.7%
Congo, Democratic Republic of	182	2.6%
United States	179	2.5%
Philippines	175	2.5%
Brazil	175	2.5%
Tanzania	110	1.5%
Vanuatu	108	1.5%
Chad	104	1.5%
Nepal	100	1.4%
Malaysia	99	1.4%
Myanmar (Burma)	94	1.3%
Russia	91	1.3%
Peru	82	1.1%

Fig 5.

BibleStatus · Bible translation status spread from 0 to 5 — note the ~15% at zero indicating no scripture available.

Show data table

Histogram bins for BibleStatus (median: 3.0).
bin	count
0 – 0.125	1064
0.125 – 0.25	0
0.25 – 0.375	0
0.375 – 0.5	0
0.5 – 0.625	0
0.625 – 0.75	0
0.75 – 0.875	0
0.875 – 1	0
1 – 1.125	488
1.125 – 1.25	0
1.25 – 1.375	0
1.375 – 1.5	0
1.5 – 1.625	0
1.625 – 1.75	0
1.75 – 1.875	0
1.875 – 2	0
2 – 2.125	1514
2.125 – 2.25	0
2.25 – 2.375	0
2.375 – 2.5	0
2.5 – 2.625	0
2.625 – 2.75	0
2.75 – 2.875	0
2.875 – 3	0
3 – 3.125	1470
3.125 – 3.25	0
3.25 – 3.375	0
3.375 – 3.5	0
3.5 – 3.625	0
3.625 – 3.75	0
3.75 – 3.875	0
3.875 – 4	0
4 – 4.125	1812
4.125 – 4.25	0
4.25 – 4.375	0
4.375 – 4.5	0
4.5 – 4.625	0
4.625 – 4.75	0
4.75 – 4.875	0
4.875 – 5	784

Fig 6.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
ROL3	text	0.0%
Language	text	0.0%
WebLangText	text	0.0%
Status	categorical	0.0%
ROG3	categorical	0.0%
HubCountry	categorical	0.0%
BibleStatus	numeric	0.0%
GRN_URL	text	41.4%
TranslationNeedQuestionable	unknown	0.0%
BibleYear	categorical	89.0%
NTYear	text	63.6%
PortionsYear	text	43.0%
PercentAdherents	numeric	11.9%
PercentEvangelical	numeric	17.3%
HasJesusFilm	categorical	0.0%
JF_URL	text	71.6%
HasAudioRecordings	categorical	0.0%
JPScale	categorical	11.9%
LeastReached	categorical	0.0%
RLG3	numeric	10.9%
PrimaryReligion	categorical	0.0%
FCBH_URL	text	68.0%
NbrPGICs	numeric	10.8%
NbrCountries	numeric	17.7%
JF	categorical	0.0%
AudioRecordings	categorical	0.0%

Fig 7.

Pearson correlation across numeric columns (sampled, bounded).

Show data table

Pearson correlation across 6 numeric columns (values clipped to 2 decimals).
	BibleStatus	PercentAdherents	PercentEvangelical	RLG3	NbrPGICs	NbrCountries
BibleStatus	+1.00	-0.03	+0.01	+0.01	+0.02	-0.00
PercentAdherents	-0.03	+1.00	+0.05	-0.01	-0.01	-0.07
PercentEvangelical	+0.01	+0.05	+1.00	-0.04	+0.02	-0.07
RLG3	+0.01	-0.01	-0.04	+1.00	-0.02	+0.13
NbrPGICs	+0.02	-0.01	+0.02	-0.02	+1.00	+0.07
NbrCountries	-0.00	-0.07	-0.07	+0.13	+0.07	+1.00

ROL3 text identifier

ROL3 is a text column of exactly 7134 unique three-character single-word tokens across 7134 rows, with zero nulls and zero duplicates. The perfect 1:1 cardinality (vocab_size 7134 == n) and uniform len_min/max/mean of 3 strongly suggest this is a row-level identifier or code rather than natural language. Top tokens like 'aou', 'aiw', 'aas' show no repeated values, confirming it carries no distributional signal on its own.

Treatment: Drop from modelling or use only as a join key; near-unique three-letter codes carry no predictive signal.

anthropic:claude-opus-4-7 · confidence high

Out[13]:

saturn.columns["ROL3"].stats

stat	value
n	7,134
nulls	0 (0.0%)
unique	7,134
len_min	3
len_max	3
len_mean	3
len_median	3
len_p95	3
word_mean	1
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	7,134
readability_flesch_mean	120.4
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: short_text	95th-percentile length under 20 chars

Fig 8.

Character-length distribution for ROL3.

Show data table

Character-length distribution for ROL3 (mean: 3.0).
chars	count
2 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	7134
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 3	0
3 – 4	0

Language text identifier

This column holds language names, with 7124 distinct values across 7134 rows and only 10 duplicates — essentially one row per language. Entries are short (mean 9.1 chars, median 1 word) and 73.5% are single-word labels, though compound names involving directional qualifiers (southern, northern, eastern, western) and family roots (zapotec, mixtec, naga) appear often. The high cardinality combined with the 'language' and 'sign' top tokens suggests this is a catalog of world languages, likely including sign-language variants.

Treatment: Treat as a near-unique label key; left-join on it rather than one-hot encoding.

anthropic:claude-opus-4-7 · confidence high

Out[16]:

saturn.columns["Language"].stats

stat	value
n	7,134
nulls	0 (0.0%)
unique	7,124
len_min	1
len_max	45
len_mean	9.102
len_median	7
len_p95	22
word_mean	1.363
word_median	1
n_empty	0
n_duplicates	10
duplicate_rate	0.001402
vocab_size	7,180
readability_flesch_mean	52.16
emoji_rate	0
url_rate	0
one_word_rate	0.7347
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	99.9% of rows are unique strings
alert: one_word	73.5% rows are a single word

Fig 9.

Character-length distribution for Language.

Show data table

Character-length distribution for Language (mean: 9.102326885337819).
chars	count
1 – 2	27
2 – 3	210
3 – 4	762
4 – 5	1116
5 – 6	1097
6 – 8	841
8 – 9	531
9 – 10	346
10 – 11	225
11 – 12	215
12 – 13	383
13 – 14	204
14 – 15	175
15 – 16	193
16 – 18	124
18 – 19	90
19 – 20	78
20 – 21	74
21 – 22	69
22 – 23	73
23 – 24	123
24 – 25	31
25 – 26	30
26 – 27	24
27 – 29	17
29 – 30	13
30 – 31	6
31 – 32	19
32 – 33	10
33 – 34	7
34 – 35	2
35 – 36	1
36 – 37	1
37 – 38	2
38 – 40	8
40 – 41	3
41 – 42	3
42 – 43	0
43 – 44	0
44 – 45	1

WebLangText text identifier

WebLangText appears to be a per-row language name label, with every one of the 7134 values unique and 73% being a single word (mean 1.37 words, median length 7 chars). Top tokens like 'language', 'sign', 'zapotec', 'mixtec', and 'naga' suggest this is an inventory of world languages including sign languages and regional variants (Southern/Northern/Eastern/Western). The full uniqueness (n_unique == n) means it functions as an identifier rather than a categorical feature.

Treatment: Treat as a language-name key; left-join on this rather than using as a model feature.

anthropic:claude-opus-4-7 · confidence high

Out[19]:

saturn.columns["WebLangText"].stats

stat	value
n	7,134
nulls	0 (0.0%)
unique	7,134
len_min	1
len_max	45
len_mean	9.119
len_median	7
len_p95	22
word_mean	1.366
word_median	1
n_empty	0
n_duplicates	0
duplicate_rate	0
vocab_size	7,200
readability_flesch_mean	52.37
emoji_rate	0
url_rate	0
one_word_rate	0.7318
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	73.2% rows are a single word

Fig 10.

Character-length distribution for WebLangText.

Show data table

Character-length distribution for WebLangText (mean: 9.11914774320157).
chars	count
1 – 2	27
2 – 3	208
3 – 4	748
4 – 5	1114
5 – 6	1095
6 – 8	841
8 – 9	531
9 – 10	348
10 – 11	239
11 – 12	217
12 – 13	385
13 – 14	204
14 – 15	175
15 – 16	193
16 – 18	124
18 – 19	90
19 – 20	78
20 – 21	74
21 – 22	69
22 – 23	73
23 – 24	123
24 – 25	31
25 – 26	30
26 – 27	24
27 – 29	17
29 – 30	13
30 – 31	6
31 – 32	19
32 – 33	10
33 – 34	7
34 – 35	2
35 – 36	1
36 – 37	1
37 – 38	2
38 – 40	8
40 – 41	3
41 – 42	3
42 – 43	0
43 – 44	0
44 – 45	1

Status categorical label

Binary status flag with two values, 'L' and 'N', dominated by 'L' at 86.0% (6134 of 7134) versus 'N' at 998. Class imbalance is notable, and there are 2 nulls (null_rate 0.0003). Entropy ratio of 0.58 confirms the skewed distribution.

Treatment: Encode as binary; address class imbalance (e.g., stratified sampling or class weights) before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[22]:

saturn.columns["Status"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	2
top_value	L
top_rate	0.8601
cardinality	2
entropy	0.5841
entropy_ratio	0.5841

Fig 11.

Top values for Status.

Show data table

Top values for Status (2 unique shown, of 2 total).
value	count	share
L	6134	86.0%
N	998	14.0%

ROG3 categorical feature

ROG3 is a categorical code field with 211 distinct two-letter values, dominated by 'PP' at 11.7% (837 rows) followed by 'ID', 'NI', 'IN', 'MX' — a distribution consistent with country or region codes. Entropy is 5.64 (ratio 0.73), indicating broad spread across the 211 categories rather than concentration in a few. Nulls are negligible (0.03%) and the top-10 values mix what look like ISO-style country codes ('IN', 'MX', 'US', 'CH') with less standard tokens ('PP', 'NI', 'CG').

Treatment: Target-encode or group rare levels before modelling; verify whether codes follow an ISO standard.

anthropic:claude-opus-4-7 · confidence medium

Out[25]:

saturn.columns["ROG3"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	211
top_value	PP
top_rate	0.1174
cardinality	211
entropy	5.642
entropy_ratio	0.7307

Fig 12.

Top values for ROG3.

Show data table

Top values for ROG3 (20 unique shown, of 211 total).
value	count	share
PP	837	11.7%
ID	686	9.6%
NI	494	6.9%
IN	383	5.4%
MX	277	3.9%
CH	256	3.6%
CM	234	3.3%
AS	192	2.7%
CG	182	2.6%
US	179	2.5%
RP	175	2.5%
BR	175	2.5%
TZ	110	1.5%
NH	108	1.5%
CD	104	1.5%
NP	100	1.4%
MY	98	1.4%
BM	94	1.3%
RS	91	1.3%
PE	82	1.1%

HubCountry categorical feature

HubCountry is a categorical country-name field with 210 distinct values across 7,134 rows and a near-zero null rate (0.0003). The distribution is broad rather than concentrated — entropy ratio 0.73, with the top value Papua New Guinea covering only 11.7% of records, followed by Indonesia (686) and Nigeria (494). The leading countries skew toward biodiversity- or resource-rich nations rather than the largest economies, which is worth noting before any geographic modelling.

Treatment: Group long-tail countries into regions or frequency buckets before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high

Out[28]:

saturn.columns["HubCountry"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	210
top_value	Papua New Guinea
top_rate	0.1174
cardinality	210
entropy	5.641
entropy_ratio	0.7312

Fig 13.

Top values for HubCountry.

Show data table

Top values for HubCountry (20 unique shown, of 210 total).
value	count	share
Papua New Guinea	837	11.7%
Indonesia	686	9.6%
Nigeria	494	6.9%
India	383	5.4%
Mexico	277	3.9%
China	256	3.6%
Cameroon	234	3.3%
Australia	192	2.7%
Congo, Democratic Republic of	182	2.6%
United States	179	2.5%
Philippines	175	2.5%
Brazil	175	2.5%
Tanzania	110	1.5%
Vanuatu	108	1.5%
Chad	104	1.5%
Nepal	100	1.4%
Malaysia	99	1.4%
Myanmar (Burma)	94	1.3%
Russia	91	1.3%
Peru	82	1.1%

BibleStatus numeric feature

BibleStatus is an integer-coded categorical with 6 distinct values from 0 to 5, mean 2.68 and median 3, almost certainly an ordinal status/level code rather than a true numeric measure. About 14.9% of rows are zero and the distribution is mildly left-skewed (skew -0.34) with flat kurtosis (-0.91), suggesting a fairly even spread across the upper levels with a sizable zero/'none' bucket. Null rate is negligible (0.0003) and no outliers were flagged.

Treatment: Treat as an ordinal categorical (one-hot or ordered encoding) rather than a continuous numeric.

anthropic:claude-opus-4-7 · confidence high

Out[31]:

saturn.columns["BibleStatus"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	6
min	0
max	5
mean	2.677
median	3
std	1.555
q1	2
q3	4
iqr	2
skew	-0.3401
kurtosis	-0.9086
n_outliers	0
outlier_rate	0
zero_rate	0.1492

Fig 14.

Distribution of BibleStatus. Vertical dash marks the median.

Show data table

Histogram bins for BibleStatus (median: 3.0).
bin	count
0 – 0.125	1064
0.125 – 0.25	0
0.25 – 0.375	0
0.375 – 0.5	0
0.5 – 0.625	0
0.625 – 0.75	0
0.75 – 0.875	0
0.875 – 1	0
1 – 1.125	488
1.125 – 1.25	0
1.25 – 1.375	0
1.375 – 1.5	0
1.5 – 1.625	0
1.625 – 1.75	0
1.75 – 1.875	0
1.875 – 2	0
2 – 2.125	1514
2.125 – 2.25	0
2.25 – 2.375	0
2.375 – 2.5	0
2.5 – 2.625	0
2.625 – 2.75	0
2.75 – 2.875	0
2.875 – 3	0
3 – 3.125	1470
3.125 – 3.25	0
3.25 – 3.375	0
3.375 – 3.5	0
3.5 – 3.625	0
3.625 – 3.75	0
3.75 – 3.875	0
3.875 – 4	0
4 – 4.125	1812
4.125 – 4.25	0
4.25 – 4.375	0
4.375 – 4.5	0
4.5 – 4.625	0
4.625 – 4.75	0
4.75 – 4.875	0
4.875 – 5	784

GRN_URL text identifier

This column holds Global Recordings Network language URLs, every value a fixed 44-character single token under https://globalrecordings.net/en/language/ followed by a language code. With 4179 unique values across 7134 rows and a 41.41% null rate, it functions as a per-language identifier link rather than a feature. Notable: only one duplicate URL (idt appears twice) despite the high uniqueness, and 41% of rows have no GRN link at all.

Treatment: Extract the trailing language code as a foreign key; otherwise drop the URL itself from modelling.

anthropic:claude-opus-4-7 · confidence high

Out[34]:

saturn.columns["GRN_URL"].stats

stat	value
n	7,134
nulls	2,954 (41.4%)
unique	4,179
len_min	44
len_max	44
len_mean	44
len_median	44
len_p95	44
word_mean	1
word_median	1
n_empty	0
n_duplicates	1
duplicate_rate	0.0002392
vocab_size	4,179
readability_flesch_mean	-435
emoji_rate	0
url_rate	1
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL
alert: null_rate	41.4% null

Fig 15.

Character-length distribution for GRN_URL.

Show data table

Character-length distribution for GRN_URL (mean: 44.0).
chars	count
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	4180
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0
44 – 44	0

TranslationNeedQuestionable unknown other

The column 'TranslationNeedQuestionable' was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 7134 and a null rate of 0.0. The name suggests a boolean or flag indicating whether the need for translation is in doubt, but this cannot be confirmed from the evidence. No distribution, unique count, or sample values were captured.

Treatment: Re-profile with type inference enabled before deciding on downstream use.

anthropic:claude-opus-4-7 · confidence low

Out[37]:

saturn.columns["TranslationNeedQuestionable"].stats

stat	value
n	7,134
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

BibleYear categorical free_text

Free-text field nominally capturing a year associated with a Bible (likely year acquired or published), but heavily polluted: 89.01% of 7134 rows are null, and among the 488 unique values the most common entry is "2023" at just 3.4%, with "Yes" appearing as the second most frequent value (22 times) — indicating the field also absorbed yes/no answers. Entropy ratio of 0.93 confirms a long, flat tail with no dominant year.

Treatment: Clean by coercing to integer years and routing non-numeric responses (e.g., "Yes") to a separate flag before use.

anthropic:claude-opus-4-7 · confidence high

Out[39]:

saturn.columns["BibleYear"].stats

stat	value
n	7,134
nulls	6,350 (89.0%)
unique	488
top_value	2023
top_rate	0.03444
cardinality	488
entropy	8.296
entropy_ratio	0.9289
alert: long_tail	411 singleton categories
alert: null_rate	89.0% null

Fig 16.

Top values for BibleYear.

Show data table

Top values for BibleYear (20 unique shown, of 488 total).
value	count	share
2023	27	0.4%
Yes	22	0.3%
2019	15	0.2%
2016	14	0.2%
2022	14	0.2%
2014	14	0.2%
2013	11	0.2%
2009	11	0.2%
2018	10	0.1%
2021	10	0.1%
2010	10	0.1%
2024	9	0.1%
2011	9	0.1%
2015	8	0.1%
2002	8	0.1%
2014-2015	8	0.1%
2013-2014	7	0.1%
2020	6	0.1%
2008	6	0.1%
1999	6	0.1%

NTYear text feature

Despite the name suggesting a year, NTYear is a single-token text column with mixed semantics: the most frequent value is 'Yes' (147 rows), followed by four-digit years from 2016-2024. It's 63.61% null and 57.28% duplicates across only 1109 unique values, with one_word_rate of 1.0 and allcaps_rate of 0.94. The mix of a yes/no token alongside year values suggests two questions were collapsed into one field.

Treatment: Split into two columns (boolean indicator vs. numeric year) before any modelling.

anthropic:claude-opus-4-7 · confidence high

Out[42]:

saturn.columns["NTYear"].stats

stat	value
n	7,134
nulls	4,538 (63.6%)
unique	1,109
len_min	3
len_max	9
len_mean	6.694
len_median	9
len_p95	9
word_mean	1
word_median	1
n_empty	0
n_duplicates	1,487
duplicate_rate	0.5728
vocab_size	1,109
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.9434
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	94.3% rows are all-caps
alert: null_rate	63.6% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	57.3% duplicate strings

Fig 17.

Character-length distribution for NTYear.

Show data table

Character-length distribution for NTYear (mean: 6.693759630200308).
chars	count
3 – 3	147
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	1021
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	1428

PortionsYear text feature

Despite the name PortionsYear, this column mixes a yes/no flag with four-digit years: every entry is a single word, the most common value is 'Yes' (706 occurrences) followed by years like 2024 (107), 2022 (55) and 2025 (43). 43% of rows are null and 55.8% of the non-null values are duplicates across only 1,797 unique tokens, with 82.6% in all-caps. The semantic mix of a boolean and a year in one field is the headline anomaly.

Treatment: Split into two columns — a boolean 'has portions' flag and a parsed year — before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[45]:

saturn.columns["PortionsYear"].stats

stat	value
n	7,134
nulls	3,068 (43.0%)
unique	1,797
len_min	3
len_max	9
len_mean	6.372
len_median	9
len_p95	9
word_mean	1
word_median	1
n_empty	0
n_duplicates	2,269
duplicate_rate	0.558
vocab_size	1,797
readability_flesch_mean	121.2
emoji_rate	0
url_rate	0
one_word_rate	1
allcaps_rate	0.8264
boilerplate_rate	0
alert: one_word	100.0% rows are a single word
alert: allcaps	82.6% rows are all-caps
alert: null_rate	43.0% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	55.8% duplicate strings

Fig 18.

Character-length distribution for PortionsYear.

Show data table

Character-length distribution for PortionsYear (mean: 6.371864240039351).
chars	count
3 – 3	706
3 – 3	0
3 – 3	0
3 – 4	0
4 – 4	0
4 – 4	0
4 – 4	1290
4 – 4	0
4 – 4	0
4 – 4	0
4 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 5	0
5 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 6	0
6 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 7	0
7 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 8	0
8 – 9	0
9 – 9	0
9 – 9	0
9 – 9	2070

PercentAdherents numeric feature

PercentAdherents is a numeric share variable bounded between 0 and 100, almost certainly the percentage of some population that adheres to a religion or group. The distribution is strongly bimodal in feel: the IQR spans 5.34 to 90.0 with a median of 58.33, kurtosis of -1.68 indicates a flat/U-shaped spread rather than a central peak, and 7.3% of values are exactly zero. Nearly 12% of rows are null, which is worth flagging before any aggregation.

Treatment: Impute or filter the 11.87% nulls and consider binning given the U-shaped distribution before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[48]:

saturn.columns["PercentAdherents"].stats

stat	value
n	7,134
nulls	847 (11.9%)
unique	1,349
min	0
max	100
mean	49.63
median	58.33
std	38.83
q1	5.34
q3	90
iqr	84.66
skew	-0.08408
kurtosis	-1.679
n_outliers	0
outlier_rate	0
zero_rate	0.07285

Fig 19.

Distribution of PercentAdherents. Vertical dash marks the median.

Show data table

Histogram bins for PercentAdherents (median: 58.33).
bin	count
0 – 2.5	1282
2.5 – 5	205
5 – 7.5	203
7.5 – 10	112
10 – 12.5	191
12.5 – 15	53
15 – 17.5	119
17.5 – 20	42
20 – 22.5	129
22.5 – 25	25
25 – 27.5	102
27.5 – 30	25
30 – 32.5	110
32.5 – 35	29
35 – 37.5	66
37.5 – 40	22
40 – 42.5	145
42.5 – 45	18
45 – 47.5	80
47.5 – 50	20
50 – 52.5	55
52.5 – 55	19
55 – 57.5	83
57.5 – 60	24
60 – 62.5	181
62.5 – 65	31
65 – 67.5	171
67.5 – 70	48
70 – 72.5	220
72.5 – 75	70
75 – 77.5	116
77.5 – 80	43
80 – 82.5	164
82.5 – 85	56
85 – 87.5	167
87.5 – 90	88
90 – 92.5	435
92.5 – 95	155
95 – 97.5	785
97.5 – 100	398

PercentEvangelical numeric feature

This appears to be a percentage feature capturing the share of evangelicals in each record's population, ranging from 0 to 95 with a median of just 5. The distribution is heavily right-skewed (skew 1.93, kurtosis 4.92) and 9% of values are exact zeros, with 251 outliers (4.3%) pulling the mean up to 10.1. Note also that 17.3% of rows are null, which is substantial.

Treatment: Impute or flag the 17% nulls and consider a log1p transform before modelling to tame the right skew.

anthropic:claude-opus-4-7 · confidence high

Out[51]:

saturn.columns["PercentEvangelical"].stats

stat	value
n	7,134
nulls	1,234 (17.3%)
unique	1,006
min	0
max	95
mean	10.13
median	5
std	12.24
q1	1
q3	15.55
iqr	14.55
skew	1.932
kurtosis	4.925
n_outliers	251
outlier_rate	0.04254
zero_rate	0.09

Fig 20.

Distribution of PercentEvangelical. Vertical dash marks the median.

Show data table

Histogram bins for PercentEvangelical (median: 5.0).
bin	count
0 – 2.375	1985
2.375 – 4.75	770
4.75 – 7.125	674
7.125 – 9.5	295
9.5 – 11.88	252
11.88 – 14.25	311
14.25 – 16.62	225
16.62 – 19	170
19 – 21.38	295
21.38 – 23.75	150
23.75 – 26.12	223
26.12 – 28.5	82
28.5 – 30.88	94
30.88 – 33.25	67
33.25 – 35.62	40
35.62 – 38	17
38 – 40.38	37
40.38 – 42.75	21
42.75 – 45.12	73
45.12 – 47.5	31
47.5 – 49.88	19
49.88 – 52.25	16
52.25 – 54.62	3
54.62 – 57	7
57 – 59.38	0
59.38 – 61.75	16
61.75 – 64.12	3
64.12 – 66.5	2
66.5 – 68.88	0
68.88 – 71.25	5
71.25 – 73.62	2
73.62 – 76	4
76 – 78.38	3
78.38 – 80.75	3
80.75 – 83.12	0
83.12 – 85.5	1
85.5 – 87.88	2
87.88 – 90.25	1
90.25 – 92.62	0
92.62 – 95	1

HasJesusFilm categorical feature

Binary Y/N flag indicating whether each record has an associated Jesus Film, with only 2 unique values across 7134 rows and a negligible 0.0003 null rate. The distribution is moderately imbalanced: 'N' dominates at 71.6% (5105) versus 2027 'Y' values, yielding an entropy ratio of 0.86.

Treatment: Encode as a 0/1 boolean indicator for modelling.

anthropic:claude-opus-4-7 · confidence high

Out[54]:

saturn.columns["HasJesusFilm"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	2
top_value	N
top_rate	0.7158
cardinality	2
entropy	0.8611
entropy_ratio	0.8611

Fig 21.

Top values for HasJesusFilm.

Show data table

Top values for HasJesusFilm (2 unique shown, of 2 total).
value	count	share
N	5105	71.6%
Y	2027	28.4%

JF_URL text metadata

This column holds JesusFilm.org URLs (url_rate 1.0, one_word_rate 0.9995), almost all pointing to language-specific watch pages like /watch/jesus.html/{language}.html. Coverage is sparse with a 71.6% null rate, and values are near-unique (2008 distinct out of 7134), though a generic partners/resources page appears 13 times and 19 duplicates exist overall. URL lengths are tight (45-87 chars, median 55), consistent with a templated link rather than free text.

Treatment: Treat as a reference link; extract the language slug from the path if you need a feature, otherwise drop from modelling.

anthropic:claude-opus-4-7 · confidence high

Out[57]:

saturn.columns["JF_URL"].stats

stat	value
n	7,134
nulls	5,107 (71.6%)
unique	2,008
len_min	45
len_max	87
len_mean	56.77
len_median	55
len_p95	69
word_mean	1
word_median	1
n_empty	0
n_duplicates	19
duplicate_rate	0.009373
vocab_size	2,009
readability_flesch_mean	-781.9
emoji_rate	0
url_rate	1
one_word_rate	0.9995
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	99.1% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL
alert: null_rate	71.6% null

Fig 22.

Character-length distribution for JF_URL.

Show data table

Character-length distribution for JF_URL (mean: 56.772570300937346).
chars	count
45 – 46	13
46 – 47	0
47 – 48	0
48 – 49	0
49 – 50	5
50 – 51	58
51 – 52	221
52 – 53	324
53 – 54	299
54 – 56	237
56 – 57	139
57 – 58	96
58 – 59	72
59 – 60	69
60 – 61	63
61 – 62	85
62 – 63	57
63 – 64	72
64 – 65	42
65 – 66	23
66 – 67	29
67 – 68	17
68 – 69	18
69 – 70	22
70 – 71	15
71 – 72	20
72 – 73	8
73 – 74	9
74 – 75	3
75 – 76	1
76 – 78	0
78 – 79	3
79 – 80	3
80 – 81	0
81 – 82	2
82 – 83	0
83 – 84	1
84 – 85	0
85 – 86	0
86 – 87	1

HasAudioRecordings categorical feature

Binary Y/N flag indicating whether a record has associated audio recordings. The split is 59.1% Y vs N, with entropy ratio 0.976 showing near-maximal balance for a two-class field. Null rate is negligible (0.0003).

Treatment: Encode as a boolean indicator before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[60]:

saturn.columns["HasAudioRecordings"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	2
top_value	Y
top_rate	0.5911
cardinality	2
entropy	0.9759
entropy_ratio	0.9759

Fig 23.

Top values for HasAudioRecordings.

Show data table

Top values for HasAudioRecordings (2 unique shown, of 2 total).
value	count	share
Y	4216	59.1%
N	2916	40.9%

JPScale categorical feature

JPScale is a low-cardinality categorical with 5 distinct values ('1' through '5'), suggesting an ordinal rating or scale. Distribution is bimodal: the extremes '5' (32.6%) and '4' (26.6%) dominate alongside '1' (20.6%), while middle values '3' and '2' are comparatively rare. Entropy ratio of 0.89 indicates fairly even spread across categories, but 11.87% of rows are null.

Treatment: Treat as ordinal (1–5); impute or flag the ~12% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[63]:

saturn.columns["JPScale"].stats

stat	value
n	7,134
nulls	847 (11.9%)
unique	5
top_value	5
top_rate	0.3261
cardinality	5
entropy	2.07
entropy_ratio	0.8915

Fig 24.

Top values for JPScale.

Show data table

Top values for JPScale (5 unique shown, of 5 total).
value	count	share
5	2050	28.7%
4	1900	26.6%
1	1473	20.6%
3	455	6.4%
2	409	5.7%

LeastReached categorical feature

Binary Y/N flag indicating whether some 'least reached' status applies, with N dominating at 79.3% (5659) versus 1473 Y values across 7134 rows. Class imbalance is notable but not extreme, and nulls are negligible (0.03%). Cardinality is exactly 2 with entropy ratio 0.73, consistent with a clean boolean indicator.

Treatment: Encode as boolean (Y=1, N=0) and account for the ~80/20 imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high

Out[66]:

saturn.columns["LeastReached"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	2
top_value	N
top_rate	0.7935
cardinality	2
entropy	0.7348
entropy_ratio	0.7348

Fig 25.

Top values for LeastReached.

Show data table

Top values for LeastReached (2 unique shown, of 2 total).
value	count	share
N	5659	79.3%
Y	1473	20.6%

RLG3 numeric feature

RLG3 is a small-cardinality integer-coded numeric (only 8 unique values across 7134 rows, ranging 1-9 with no zeros), suggesting an ordinal scale or category code rather than a true continuous measure. The distribution leans low: median 1, Q3 of 4, mean 2.82, with right skew (0.74) and 110 outliers (1.7%) toward the high end. Note the 10.88% null rate, which is non-trivial and should be addressed before modelling.

Treatment: Treat as ordinal/categorical and impute or flag the ~11% missing values before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[69]:

saturn.columns["RLG3"].stats

stat	value
n	7,134
nulls	776 (10.9%)
unique	8
min	1
max	9
mean	2.819
median	1
std	2.144
q1	1
q3	4
iqr	3
skew	0.7378
kurtosis	-0.5707
n_outliers	110
outlier_rate	0.0173
zero_rate	0

Fig 26.

Distribution of RLG3. Vertical dash marks the median.

Show data table

Histogram bins for RLG3 (median: 1.0).
bin	count
1 – 1.2	3328
1.2 – 1.4	0
1.4 – 1.6	0
1.6 – 1.8	0
1.8 – 2	0
2 – 2.2	192
2.2 – 2.4	0
2.4 – 2.6	0
2.6 – 2.8	0
2.8 – 3	0
3 – 3.2	0
3.2 – 3.4	0
3.4 – 3.6	0
3.6 – 3.8	0
3.8 – 4	0
4 – 4.2	1472
4.2 – 4.4	0
4.4 – 4.6	0
4.6 – 4.8	0
4.8 – 5	0
5 – 5.2	268
5.2 – 5.4	0
5.4 – 5.6	0
5.6 – 5.8	0
5.8 – 6	0
6 – 6.2	945
6.2 – 6.4	0
6.4 – 6.6	0
6.6 – 6.8	0
6.8 – 7	0
7 – 7.2	18
7.2 – 7.4	0
7.4 – 7.6	0
7.6 – 7.8	0
7.8 – 8	0
8 – 8.2	25
8.2 – 8.4	0
8.4 – 8.6	0
8.6 – 8.8	0
8.8 – 9	110

PrimaryReligion categorical feature

Categorical label for the dominant religion of each record, with 9 distinct values across 7134 rows. Christianity leads at 46.7% (3328), followed by Ethnic Religions (1472) and Islam (945). Note the empty-string category appears 774 times alongside an explicit 'Unknown' bucket of 110 — two separate missingness conventions coexist beyond the 0.03% true nulls.

Treatment: Consolidate the empty string into 'Unknown' and one-hot encode the 9 categories.

anthropic:claude-opus-4-7 · confidence high

Out[72]:

saturn.columns["PrimaryReligion"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	9
top_value	Christianity
top_rate	0.4666
cardinality	9
entropy	2.179
entropy_ratio	0.6873

Fig 27.

Top values for PrimaryReligion.

Show data table

Top values for PrimaryReligion (9 unique shown, of 9 total).
value	count	share
Christianity	3328	46.6%
Ethnic Religions	1472	20.6%
Islam	945	13.2%
	774	10.8%
Hinduism	268	3.8%
Buddhism	192	2.7%
Unknown	110	1.5%
Other / Small	25	0.4%
Non-Religious	18	0.3%

FCBH_URL text metadata

This column holds a single URL per row pointing to Faith Comes By Hearing resources (apk.fcbh.org or live.bible.is), with url_rate at 1.0 and one_word_rate at 0.9987. It is largely missing — null_rate is 0.6801 — and of the populated rows 2272 of values are unique with only 10 duplicates. Lengths are tight (min 25, max 100, mean 37.68), consistent with a structured link field rather than free text.

Treatment: Treat as an optional reference link; keep as-is for lookup, do not feed into modelling.

anthropic:claude-opus-4-7 · confidence high

Out[75]:

saturn.columns["FCBH_URL"].stats

stat	value
n	7,134
nulls	4,852 (68.0%)
unique	2,272
len_min	25
len_max	100
len_mean	37.68
len_median	34
len_p95	66
word_mean	1.001
word_median	1
n_empty	0
n_duplicates	10
duplicate_rate	0.004382
vocab_size	2,272
readability_flesch_mean	-325.5
emoji_rate	0
url_rate	1
one_word_rate	0.9987
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	99.6% of rows are unique strings
alert: one_word	99.9% rows are a single word
alert: url_heavy	100.0% rows contain a URL
alert: null_rate	68.0% null

Fig 28.

Character-length distribution for FCBH_URL.

Show data table

Character-length distribution for FCBH_URL (mean: 37.6783523225241).
chars	count
25 – 27	17
27 – 29	15
29 – 31	40
31 – 32	43
32 – 34	1831
34 – 36	30
36 – 38	9
38 – 40	7
40 – 42	5
42 – 44	6
44 – 46	8
46 – 48	4
48 – 49	2
49 – 51	0
51 – 53	0
53 – 55	0
55 – 57	1
57 – 59	0
59 – 61	1
61 – 62	1
62 – 64	25
64 – 66	221
66 – 68	3
68 – 70	1
70 – 72	1
72 – 74	0
74 – 76	0
76 – 78	2
78 – 79	3
79 – 81	3
81 – 83	1
83 – 85	1
85 – 87	0
87 – 89	0
89 – 91	0
91 – 92	0
92 – 94	0
94 – 96	0
96 – 98	0
98 – 100	1

NbrPGICs numeric feature

NbrPGICs is a heavily right-skewed count feature, with median 1 and Q3 of 2 but a maximum of 1804 and standard deviation of 48.93. The distribution shows extreme tail behaviour (skew 16.65, kurtosis 404.08) and 701 outliers (11.0% of values), while 10.75% of rows are null. Most records carry a trivial count, but a small subset reports values orders of magnitude larger.

Treatment: Log-transform or cap at a high quantile and impute the 10.75% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high

Out[78]:

saturn.columns["NbrPGICs"].stats

stat	value
n	7,134
nulls	767 (10.8%)
unique	155
min	1
max	1,804
mean	7.209
median	1
std	48.93
q1	1
q3	2
iqr	1
skew	16.65
kurtosis	404.1
n_outliers	701
outlier_rate	0.1101
zero_rate	0
alert: high_skew	skew=+16.65
alert: outliers	11.0% rows beyond 1.5 IQR

Fig 29.

Distribution of NbrPGICs. Vertical dash marks the median.

Show data table

Histogram bins for NbrPGICs (median: 1.0).
bin	count
1 – 46.08	6231
46.08 – 91.15	46
91.15 – 136.2	21
136.2 – 181.3	9
181.3 – 226.4	11
226.4 – 271.5	5
271.5 – 316.5	11
316.5 – 361.6	9
361.6 – 406.7	1
406.7 – 451.8	6
451.8 – 496.8	1
496.8 – 541.9	3
541.9 – 587	2
587 – 632.1	2
632.1 – 677.1	0
677.1 – 722.2	3
722.2 – 767.3	1
767.3 – 812.4	1
812.4 – 857.4	1
857.4 – 902.5	0
902.5 – 947.6	2
947.6 – 992.7	0
992.7 – 1038	0
1038 – 1083	0
1083 – 1128	0
1128 – 1173	0
1173 – 1218	0
1218 – 1263	0
1263 – 1308	0
1308 – 1353	0
1353 – 1398	0
1398 – 1443	0
1443 – 1488	0
1488 – 1534	0
1534 – 1579	0
1579 – 1624	0
1624 – 1669	0
1669 – 1714	0
1714 – 1759	0
1759 – 1804	1

NbrCountries numeric feature

NbrCountries is a numeric count of countries associated with each record, ranging from 1 to 136 with a median of 1 and Q1=Q3=1, meaning at least three quarters of rows are single-country. The distribution is extremely heavy-tailed (skew 15.6, kurtosis 364) with 1203 outliers (20.5% outlier rate) and a 17.75% null rate, so a small minority of multi-country records dominate the variance.

Treatment: Log1p-transform or bucket into 1 vs. multi-country before modelling, and impute or flag the 17.75% nulls.

anthropic:claude-opus-4-7 · confidence high

Out[81]:

saturn.columns["NbrCountries"].stats

stat	value
n	7,134
nulls	1,266 (17.7%)
unique	43
min	1
max	136
mean	1.711
median	1
std	3.901
q1	1
q3	1
iqr	0
skew	15.6
kurtosis	364.1
n_outliers	1,203
outlier_rate	0.205
zero_rate	0
alert: high_skew	skew=+15.60
alert: outliers	20.5% rows beyond 1.5 IQR

Fig 30.

Distribution of NbrCountries. Vertical dash marks the median.

Show data table

Histogram bins for NbrCountries (median: 1.0).
bin	count
1 – 4.375	5656
4.375 – 7.75	92
7.75 – 11.12	41
11.12 – 14.5	20
14.5 – 17.88	10
17.88 – 21.25	12
21.25 – 24.62	5
24.62 – 28	8
28 – 31.38	4
31.38 – 34.75	4
34.75 – 38.12	2
38.12 – 41.5	3
41.5 – 44.88	2
44.88 – 48.25	1
48.25 – 51.62	1
51.62 – 55	1
55 – 58.38	1
58.38 – 61.75	0
61.75 – 65.12	1
65.12 – 68.5	0
68.5 – 71.88	1
71.88 – 75.25	0
75.25 – 78.62	1
78.62 – 82	0
82 – 85.38	0
85.38 – 88.75	1
88.75 – 92.12	0
92.12 – 95.5	0
95.5 – 98.88	0
98.88 – 102.2	0
102.2 – 105.6	0
105.6 – 109	0
109 – 112.4	0
112.4 – 115.8	0
115.8 – 119.1	0
119.1 – 122.5	0
122.5 – 125.9	0
125.9 – 129.2	0
129.2 – 132.6	0
132.6 – 136	1

JF categorical feature

Binary Y/N flag with only two distinct values across 7134 rows and a negligible null rate of 0.0003. The distribution is skewed toward 'N' at 71.6%, leaving 'Y' at roughly 2027 occurrences. Entropy ratio of 0.86 indicates the split is imbalanced but still informative.

Treatment: Encode as a 0/1 indicator and impute the rare nulls with the mode.

anthropic:claude-opus-4-7 · confidence high

Out[84]:

saturn.columns["JF"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	2
top_value	N
top_rate	0.7158
cardinality	2
entropy	0.8611
entropy_ratio	0.8611

Fig 31.

Top values for JF.

Show data table

Top values for JF (2 unique shown, of 2 total).
value	count	share
N	5105	71.6%
Y	2027	28.4%

AudioRecordings categorical feature

Binary Y/N flag indicating whether audio recordings exist for each row, with only 2 unique values across 7,134 records and a negligible null rate of 0.0003. The split is moderately balanced toward 'Y' at 59.1% (4,216) versus 'N' (2,916), giving high entropy (0.976) for a binary field. No surprising signals beyond the slight Y-majority skew.

Treatment: Encode as a 0/1 boolean indicator after imputing the few nulls.

anthropic:claude-opus-4-7 · confidence high

Out[87]:

saturn.columns["AudioRecordings"].stats

stat	value
n	7,134
nulls	2 (0.0%)
unique	2
top_value	Y
top_rate	0.5911
cardinality	2
entropy	0.9759
entropy_ratio	0.9759

Fig 32.

Top values for AudioRecordings.

Show data table

Top values for AudioRecordings (2 unique shown, of 2 total).
value	count	share
Y	4216	59.1%
N	2916	40.9%

joshua project joshua project languages

Overview

Summary confidence: high

ROL3 text identifier

Language text identifier

WebLangText text identifier

Status categorical label

ROG3 categorical feature

HubCountry categorical feature

BibleStatus numeric feature

GRN_URL text identifier

TranslationNeedQuestionable unknown other

BibleYear categorical free_text

NTYear text feature

PortionsYear text feature

PercentAdherents numeric feature

PercentEvangelical numeric feature

HasJesusFilm categorical feature

JF_URL text metadata

HasAudioRecordings categorical feature

JPScale categorical feature

LeastReached categorical feature

RLG3 numeric feature

PrimaryReligion categorical feature

FCBH_URL text metadata

NbrPGICs numeric feature

NbrCountries numeric feature

JF categorical feature

AudioRecordings categorical feature

How to cite