saturn·

joshua project joshua project languages

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_languages.json

Saturn profiled 7,134 rows across 26 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_languages.json",
    "--findings", "joshua-project-joshua_project_languages.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This is a Joshua Project languages dataset with 7,134 rows and 26 columns, profiling world languages alongside Bible translation status, audio/film resource availability, primary religion, and host-country distribution. The headline signal is religious-engagement coverage: PrimaryReligion is dominated by Christianity (3,328) followed by Ethnic Religions (1,472) and Islam (945), and JPScale skews toward the more-reached end with category 5 the largest bucket (2,050). Resource availability is uneven — HasAudioRecordings is roughly 59% Yes / 41% No, while HasJesusFilm is only ~28% Yes, suggesting the Jesus Film coverage gap is worth a closer look. Geographic concentration is also notable: HubCountry is led by Papua New Guinea (837), Indonesia (686), and Nigeria (494), together accounting for a large share of entries. Finally, NbrPGICs is extremely skewed (max 1,804, median 1) so any per-language counts should be inspected with that long tail in mind.

citing: PrimaryReligion · JPScale · HasAudioRecordings · HasJesusFilm · HubCountry · NbrPGICs · Status · BibleStatus

Out[4]:

saturn.schema() · 26 columns

column kind n null% unique alerts
ROL3 text 7,134 0.0% 7,134 near_unique one_word short_text
Language text 7,134 0.0% 7,124 near_unique one_word
WebLangText text 7,134 0.0% 7,134 near_unique one_word
Status categorical 7,134 0.0% 2
ROG3 categorical 7,134 0.0% 211
HubCountry categorical 7,134 0.0% 210
BibleStatus numeric 7,134 0.0% 6
GRN_URL text 7,134 41.4% 4,179 near_unique one_word url_heavy null_rate
TranslationNeedQuestionable unknown 7,134 0.0% skipped
BibleYear categorical 7,134 89.0% 488 long_tail null_rate
NTYear text 7,134 63.6% 1,109 one_word allcaps null_rate short_text duplicates
PortionsYear text 7,134 43.0% 1,797 one_word allcaps null_rate short_text duplicates
PercentAdherents numeric 7,134 11.9% 1,349
PercentEvangelical numeric 7,134 17.3% 1,006
HasJesusFilm categorical 7,134 0.0% 2
JF_URL text 7,134 71.6% 2,008 near_unique one_word url_heavy null_rate
HasAudioRecordings categorical 7,134 0.0% 2
JPScale categorical 7,134 11.9% 5
LeastReached categorical 7,134 0.0% 2
RLG3 numeric 7,134 10.9% 8
PrimaryReligion categorical 7,134 0.0% 9
FCBH_URL text 7,134 68.0% 2,272 near_unique one_word url_heavy null_rate
NbrPGICs numeric 7,134 10.8% 155 high_skew outliers
NbrCountries numeric 7,134 17.7% 43 high_skew outliers
JF categorical 7,134 0.0% 2
AudioRecordings categorical 7,134 0.0% 2
Fig 1.
PrimaryReligion · Christianity dominates at ~47%, but Ethnic Religions and Islam together cover roughly a third of languages.
Show data table
Top values for PrimaryReligion (9 unique shown, of 9 total).
valuecountshare
Christianity332846.6%
Ethnic Religions147220.6%
Islam94513.2%
77410.8%
Hinduism2683.8%
Buddhism1922.7%
Unknown1101.5%
Other / Small250.4%
Non-Religious180.3%
Fig 2.
JPScale · Distribution across the 5-point Joshua Project progress scale, skewed toward the more-reached categories 4 and 5.
Show data table
Top values for JPScale (5 unique shown, of 5 total).
valuecountshare
5205028.7%
4190026.6%
1147320.6%
34556.4%
24095.7%
Fig 3.
HasJesusFilm · Only about 28% of languages have a Jesus Film available — a clear coverage gap.
Show data table
Top values for HasJesusFilm (2 unique shown, of 2 total).
valuecountshare
N510571.6%
Y202728.4%
Fig 4.
HubCountry · Top hub countries (Papua New Guinea, Indonesia, Nigeria) concentrate a large share of the languages tracked.
Show data table
Top values for HubCountry (20 unique shown, of 210 total).
valuecountshare
Papua New Guinea83711.7%
Indonesia6869.6%
Nigeria4946.9%
India3835.4%
Mexico2773.9%
China2563.6%
Cameroon2343.3%
Australia1922.7%
Congo, Democratic Republic of1822.6%
United States1792.5%
Philippines1752.5%
Brazil1752.5%
Tanzania1101.5%
Vanuatu1081.5%
Chad1041.5%
Nepal1001.4%
Malaysia991.4%
Myanmar (Burma)941.3%
Russia911.3%
Peru821.1%
Fig 5.
BibleStatus · Bible translation status spread from 0 to 5 — note the ~15% at zero indicating no scripture available.
Show data table
Histogram bins for BibleStatus (median: 3.0).
bincount
0 – 0.1251064
0.125 – 0.250
0.25 – 0.3750
0.375 – 0.50
0.5 – 0.6250
0.625 – 0.750
0.75 – 0.8750
0.875 – 10
1 – 1.125488
1.125 – 1.250
1.25 – 1.3750
1.375 – 1.50
1.5 – 1.6250
1.625 – 1.750
1.75 – 1.8750
1.875 – 20
2 – 2.1251514
2.125 – 2.250
2.25 – 2.3750
2.375 – 2.50
2.5 – 2.6250
2.625 – 2.750
2.75 – 2.8750
2.875 – 30
3 – 3.1251470
3.125 – 3.250
3.25 – 3.3750
3.375 – 3.50
3.5 – 3.6250
3.625 – 3.750
3.75 – 3.8750
3.875 – 40
4 – 4.1251812
4.125 – 4.250
4.25 – 4.3750
4.375 – 4.50
4.5 – 4.6250
4.625 – 4.750
4.75 – 4.8750
4.875 – 5784
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
ROL3text0.0%
Languagetext0.0%
WebLangTexttext0.0%
Statuscategorical0.0%
ROG3categorical0.0%
HubCountrycategorical0.0%
BibleStatusnumeric0.0%
GRN_URLtext41.4%
TranslationNeedQuestionableunknown0.0%
BibleYearcategorical89.0%
NTYeartext63.6%
PortionsYeartext43.0%
PercentAdherentsnumeric11.9%
PercentEvangelicalnumeric17.3%
HasJesusFilmcategorical0.0%
JF_URLtext71.6%
HasAudioRecordingscategorical0.0%
JPScalecategorical11.9%
LeastReachedcategorical0.0%
RLG3numeric10.9%
PrimaryReligioncategorical0.0%
FCBH_URLtext68.0%
NbrPGICsnumeric10.8%
NbrCountriesnumeric17.7%
JFcategorical0.0%
AudioRecordingscategorical0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 6 numeric columns (values clipped to 2 decimals).
BibleStatusPercentAdherentsPercentEvangelicalRLG3NbrPGICsNbrCountries
BibleStatus+1.00-0.03+0.01+0.01+0.02-0.00
PercentAdherents-0.03+1.00+0.05-0.01-0.01-0.07
PercentEvangelical+0.01+0.05+1.00-0.04+0.02-0.07
RLG3+0.01-0.01-0.04+1.00-0.02+0.13
NbrPGICs+0.02-0.01+0.02-0.02+1.00+0.07
NbrCountries-0.00-0.07-0.07+0.13+0.07+1.00

ROL3 text identifier

ROL3 is a text column of exactly 7134 unique three-character single-word tokens across 7134 rows, with zero nulls and zero duplicates. The perfect 1:1 cardinality (vocab_size 7134 == n) and uniform len_min/max/mean of 3 strongly suggest this is a row-level identifier or code rather than natural language. Top tokens like 'aou', 'aiw', 'aas' show no repeated values, confirming it carries no distributional signal on its own.

Treatment: Drop from modelling or use only as a join key; near-unique three-letter codes carry no predictive signal.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["ROL3"].stats

statvalue
n7,134
nulls0 (0.0%)
unique7,134
len_min 3
len_max 3
len_mean 3
len_median 3
len_p95 3
word_mean 1
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 7,134
readability_flesch_mean 120.4
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: short_text95th-percentile length under 20 chars
Fig 8.
Character-length distribution for ROL3.
Show data table
Character-length distribution for ROL3 (mean: 3.0).
charscount
2 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 37134
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 30
3 – 40

Language text identifier

This column holds language names, with 7124 distinct values across 7134 rows and only 10 duplicates — essentially one row per language. Entries are short (mean 9.1 chars, median 1 word) and 73.5% are single-word labels, though compound names involving directional qualifiers (southern, northern, eastern, western) and family roots (zapotec, mixtec, naga) appear often. The high cardinality combined with the 'language' and 'sign' top tokens suggests this is a catalog of world languages, likely including sign-language variants.

Treatment: Treat as a near-unique label key; left-join on it rather than one-hot encoding.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["Language"].stats

statvalue
n7,134
nulls0 (0.0%)
unique7,124
len_min 1
len_max 45
len_mean 9.102
len_median 7
len_p95 22
word_mean 1.363
word_median 1
n_empty 0
n_duplicates 10
duplicate_rate 0.001402
vocab_size 7,180
readability_flesch_mean 52.16
emoji_rate 0
url_rate 0
one_word_rate 0.7347
allcaps_rate 0
boilerplate_rate 0
alert: near_unique99.9% of rows are unique strings
alert: one_word73.5% rows are a single word
Fig 9.
Character-length distribution for Language.
Show data table
Character-length distribution for Language (mean: 9.102326885337819).
charscount
1 – 227
2 – 3210
3 – 4762
4 – 51116
5 – 61097
6 – 8841
8 – 9531
9 – 10346
10 – 11225
11 – 12215
12 – 13383
13 – 14204
14 – 15175
15 – 16193
16 – 18124
18 – 1990
19 – 2078
20 – 2174
21 – 2269
22 – 2373
23 – 24123
24 – 2531
25 – 2630
26 – 2724
27 – 2917
29 – 3013
30 – 316
31 – 3219
32 – 3310
33 – 347
34 – 352
35 – 361
36 – 371
37 – 382
38 – 408
40 – 413
41 – 423
42 – 430
43 – 440
44 – 451

WebLangText text identifier

WebLangText appears to be a per-row language name label, with every one of the 7134 values unique and 73% being a single word (mean 1.37 words, median length 7 chars). Top tokens like 'language', 'sign', 'zapotec', 'mixtec', and 'naga' suggest this is an inventory of world languages including sign languages and regional variants (Southern/Northern/Eastern/Western). The full uniqueness (n_unique == n) means it functions as an identifier rather than a categorical feature.

Treatment: Treat as a language-name key; left-join on this rather than using as a model feature.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["WebLangText"].stats

statvalue
n7,134
nulls0 (0.0%)
unique7,134
len_min 1
len_max 45
len_mean 9.119
len_median 7
len_p95 22
word_mean 1.366
word_median 1
n_empty 0
n_duplicates 0
duplicate_rate 0
vocab_size 7,200
readability_flesch_mean 52.37
emoji_rate 0
url_rate 0
one_word_rate 0.7318
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word73.2% rows are a single word
Fig 10.
Character-length distribution for WebLangText.
Show data table
Character-length distribution for WebLangText (mean: 9.11914774320157).
charscount
1 – 227
2 – 3208
3 – 4748
4 – 51114
5 – 61095
6 – 8841
8 – 9531
9 – 10348
10 – 11239
11 – 12217
12 – 13385
13 – 14204
14 – 15175
15 – 16193
16 – 18124
18 – 1990
19 – 2078
20 – 2174
21 – 2269
22 – 2373
23 – 24123
24 – 2531
25 – 2630
26 – 2724
27 – 2917
29 – 3013
30 – 316
31 – 3219
32 – 3310
33 – 347
34 – 352
35 – 361
36 – 371
37 – 382
38 – 408
40 – 413
41 – 423
42 – 430
43 – 440
44 – 451

Status categorical label

Binary status flag with two values, 'L' and 'N', dominated by 'L' at 86.0% (6134 of 7134) versus 'N' at 998. Class imbalance is notable, and there are 2 nulls (null_rate 0.0003). Entropy ratio of 0.58 confirms the skewed distribution.

Treatment: Encode as binary; address class imbalance (e.g., stratified sampling or class weights) before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["Status"].stats

statvalue
n7,134
nulls2 (0.0%)
unique2
top_value L
top_rate 0.8601
cardinality 2
entropy 0.5841
entropy_ratio 0.5841
Fig 11.
Top values for Status.
Show data table
Top values for Status (2 unique shown, of 2 total).
valuecountshare
L613486.0%
N99814.0%

ROG3 categorical feature

ROG3 is a categorical code field with 211 distinct two-letter values, dominated by 'PP' at 11.7% (837 rows) followed by 'ID', 'NI', 'IN', 'MX' — a distribution consistent with country or region codes. Entropy is 5.64 (ratio 0.73), indicating broad spread across the 211 categories rather than concentration in a few. Nulls are negligible (0.03%) and the top-10 values mix what look like ISO-style country codes ('IN', 'MX', 'US', 'CH') with less standard tokens ('PP', 'NI', 'CG').

Treatment: Target-encode or group rare levels before modelling; verify whether codes follow an ISO standard.

anthropic:claude-opus-4-7 · confidence medium
Out[25]:

saturn.columns["ROG3"].stats

statvalue
n7,134
nulls2 (0.0%)
unique211
top_value PP
top_rate 0.1174
cardinality 211
entropy 5.642
entropy_ratio 0.7307
Fig 12.
Top values for ROG3.
Show data table
Top values for ROG3 (20 unique shown, of 211 total).
valuecountshare
PP83711.7%
ID6869.6%
NI4946.9%
IN3835.4%
MX2773.9%
CH2563.6%
CM2343.3%
AS1922.7%
CG1822.6%
US1792.5%
RP1752.5%
BR1752.5%
TZ1101.5%
NH1081.5%
CD1041.5%
NP1001.4%
MY981.4%
BM941.3%
RS911.3%
PE821.1%

HubCountry categorical feature

HubCountry is a categorical country-name field with 210 distinct values across 7,134 rows and a near-zero null rate (0.0003). The distribution is broad rather than concentrated — entropy ratio 0.73, with the top value Papua New Guinea covering only 11.7% of records, followed by Indonesia (686) and Nigeria (494). The leading countries skew toward biodiversity- or resource-rich nations rather than the largest economies, which is worth noting before any geographic modelling.

Treatment: Group long-tail countries into regions or frequency buckets before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["HubCountry"].stats

statvalue
n7,134
nulls2 (0.0%)
unique210
top_value Papua New Guinea
top_rate 0.1174
cardinality 210
entropy 5.641
entropy_ratio 0.7312
Fig 13.
Top values for HubCountry.
Show data table
Top values for HubCountry (20 unique shown, of 210 total).
valuecountshare
Papua New Guinea83711.7%
Indonesia6869.6%
Nigeria4946.9%
India3835.4%
Mexico2773.9%
China2563.6%
Cameroon2343.3%
Australia1922.7%
Congo, Democratic Republic of1822.6%
United States1792.5%
Philippines1752.5%
Brazil1752.5%
Tanzania1101.5%
Vanuatu1081.5%
Chad1041.5%
Nepal1001.4%
Malaysia991.4%
Myanmar (Burma)941.3%
Russia911.3%
Peru821.1%

BibleStatus numeric feature

BibleStatus is an integer-coded categorical with 6 distinct values from 0 to 5, mean 2.68 and median 3, almost certainly an ordinal status/level code rather than a true numeric measure. About 14.9% of rows are zero and the distribution is mildly left-skewed (skew -0.34) with flat kurtosis (-0.91), suggesting a fairly even spread across the upper levels with a sizable zero/'none' bucket. Null rate is negligible (0.0003) and no outliers were flagged.

Treatment: Treat as an ordinal categorical (one-hot or ordered encoding) rather than a continuous numeric.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["BibleStatus"].stats

statvalue
n7,134
nulls2 (0.0%)
unique6
min 0
max 5
mean 2.677
median 3
std 1.555
q1 2
q3 4
iqr 2
skew -0.3401
kurtosis -0.9086
n_outliers 0
outlier_rate 0
zero_rate 0.1492
Fig 14.
Distribution of BibleStatus. Vertical dash marks the median.
Show data table
Histogram bins for BibleStatus (median: 3.0).
bincount
0 – 0.1251064
0.125 – 0.250
0.25 – 0.3750
0.375 – 0.50
0.5 – 0.6250
0.625 – 0.750
0.75 – 0.8750
0.875 – 10
1 – 1.125488
1.125 – 1.250
1.25 – 1.3750
1.375 – 1.50
1.5 – 1.6250
1.625 – 1.750
1.75 – 1.8750
1.875 – 20
2 – 2.1251514
2.125 – 2.250
2.25 – 2.3750
2.375 – 2.50
2.5 – 2.6250
2.625 – 2.750
2.75 – 2.8750
2.875 – 30
3 – 3.1251470
3.125 – 3.250
3.25 – 3.3750
3.375 – 3.50
3.5 – 3.6250
3.625 – 3.750
3.75 – 3.8750
3.875 – 40
4 – 4.1251812
4.125 – 4.250
4.25 – 4.3750
4.375 – 4.50
4.5 – 4.6250
4.625 – 4.750
4.75 – 4.8750
4.875 – 5784

GRN_URL text identifier

This column holds Global Recordings Network language URLs, every value a fixed 44-character single token under https://globalrecordings.net/en/language/ followed by a language code. With 4179 unique values across 7134 rows and a 41.41% null rate, it functions as a per-language identifier link rather than a feature. Notable: only one duplicate URL (idt appears twice) despite the high uniqueness, and 41% of rows have no GRN link at all.

Treatment: Extract the trailing language code as a foreign key; otherwise drop the URL itself from modelling.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["GRN_URL"].stats

statvalue
n7,134
nulls2,954 (41.4%)
unique4,179
len_min 44
len_max 44
len_mean 44
len_median 44
len_p95 44
word_mean 1
word_median 1
n_empty 0
n_duplicates 1
duplicate_rate 0.0002392
vocab_size 4,179
readability_flesch_mean -435
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
alert: null_rate41.4% null
Fig 15.
Character-length distribution for GRN_URL.
Show data table
Character-length distribution for GRN_URL (mean: 44.0).
charscount
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 444180
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440
44 – 440

TranslationNeedQuestionable unknown other

The column 'TranslationNeedQuestionable' was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 7134 and a null rate of 0.0. The name suggests a boolean or flag indicating whether the need for translation is in doubt, but this cannot be confirmed from the evidence. No distribution, unique count, or sample values were captured.

Treatment: Re-profile with type inference enabled before deciding on downstream use.

anthropic:claude-opus-4-7 · confidence low
Out[37]:

saturn.columns["TranslationNeedQuestionable"].stats

statvalue
n7,134
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

BibleYear categorical free_text

Free-text field nominally capturing a year associated with a Bible (likely year acquired or published), but heavily polluted: 89.01% of 7134 rows are null, and among the 488 unique values the most common entry is "2023" at just 3.4%, with "Yes" appearing as the second most frequent value (22 times) — indicating the field also absorbed yes/no answers. Entropy ratio of 0.93 confirms a long, flat tail with no dominant year.

Treatment: Clean by coercing to integer years and routing non-numeric responses (e.g., "Yes") to a separate flag before use.

anthropic:claude-opus-4-7 · confidence high
Out[39]:

saturn.columns["BibleYear"].stats

statvalue
n7,134
nulls6,350 (89.0%)
unique488
top_value 2023
top_rate 0.03444
cardinality 488
entropy 8.296
entropy_ratio 0.9289
alert: long_tail411 singleton categories
alert: null_rate89.0% null
Fig 16.
Top values for BibleYear.
Show data table
Top values for BibleYear (20 unique shown, of 488 total).
valuecountshare
2023270.4%
Yes220.3%
2019150.2%
2016140.2%
2022140.2%
2014140.2%
2013110.2%
2009110.2%
2018100.1%
2021100.1%
2010100.1%
202490.1%
201190.1%
201580.1%
200280.1%
2014-201580.1%
2013-201470.1%
202060.1%
200860.1%
199960.1%

NTYear text feature

Despite the name suggesting a year, NTYear is a single-token text column with mixed semantics: the most frequent value is 'Yes' (147 rows), followed by four-digit years from 2016-2024. It's 63.61% null and 57.28% duplicates across only 1109 unique values, with one_word_rate of 1.0 and allcaps_rate of 0.94. The mix of a yes/no token alongside year values suggests two questions were collapsed into one field.

Treatment: Split into two columns (boolean indicator vs. numeric year) before any modelling.

anthropic:claude-opus-4-7 · confidence high
Out[42]:

saturn.columns["NTYear"].stats

statvalue
n7,134
nulls4,538 (63.6%)
unique1,109
len_min 3
len_max 9
len_mean 6.694
len_median 9
len_p95 9
word_mean 1
word_median 1
n_empty 0
n_duplicates 1,487
duplicate_rate 0.5728
vocab_size 1,109
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.9434
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps94.3% rows are all-caps
alert: null_rate63.6% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates57.3% duplicate strings
Fig 17.
Character-length distribution for NTYear.
Show data table
Character-length distribution for NTYear (mean: 6.693759630200308).
charscount
3 – 3147
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 41021
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 91428

PortionsYear text feature

Despite the name PortionsYear, this column mixes a yes/no flag with four-digit years: every entry is a single word, the most common value is 'Yes' (706 occurrences) followed by years like 2024 (107), 2022 (55) and 2025 (43). 43% of rows are null and 55.8% of the non-null values are duplicates across only 1,797 unique tokens, with 82.6% in all-caps. The semantic mix of a boolean and a year in one field is the headline anomaly.

Treatment: Split into two columns — a boolean 'has portions' flag and a parsed year — before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[45]:

saturn.columns["PortionsYear"].stats

statvalue
n7,134
nulls3,068 (43.0%)
unique1,797
len_min 3
len_max 9
len_mean 6.372
len_median 9
len_p95 9
word_mean 1
word_median 1
n_empty 0
n_duplicates 2,269
duplicate_rate 0.558
vocab_size 1,797
readability_flesch_mean 121.2
emoji_rate 0
url_rate 0
one_word_rate 1
allcaps_rate 0.8264
boilerplate_rate 0
alert: one_word100.0% rows are a single word
alert: allcaps82.6% rows are all-caps
alert: null_rate43.0% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates55.8% duplicate strings
Fig 18.
Character-length distribution for PortionsYear.
Show data table
Character-length distribution for PortionsYear (mean: 6.371864240039351).
charscount
3 – 3706
3 – 30
3 – 30
3 – 40
4 – 40
4 – 40
4 – 41290
4 – 40
4 – 40
4 – 40
4 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 50
5 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 60
6 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 70
7 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 80
8 – 90
9 – 90
9 – 90
9 – 92070

PercentAdherents numeric feature

PercentAdherents is a numeric share variable bounded between 0 and 100, almost certainly the percentage of some population that adheres to a religion or group. The distribution is strongly bimodal in feel: the IQR spans 5.34 to 90.0 with a median of 58.33, kurtosis of -1.68 indicates a flat/U-shaped spread rather than a central peak, and 7.3% of values are exactly zero. Nearly 12% of rows are null, which is worth flagging before any aggregation.

Treatment: Impute or filter the 11.87% nulls and consider binning given the U-shaped distribution before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[48]:

saturn.columns["PercentAdherents"].stats

statvalue
n7,134
nulls847 (11.9%)
unique1,349
min 0
max 100
mean 49.63
median 58.33
std 38.83
q1 5.34
q3 90
iqr 84.66
skew -0.08408
kurtosis -1.679
n_outliers 0
outlier_rate 0
zero_rate 0.07285
Fig 19.
Distribution of PercentAdherents. Vertical dash marks the median.
Show data table
Histogram bins for PercentAdherents (median: 58.33).
bincount
0 – 2.51282
2.5 – 5205
5 – 7.5203
7.5 – 10112
10 – 12.5191
12.5 – 1553
15 – 17.5119
17.5 – 2042
20 – 22.5129
22.5 – 2525
25 – 27.5102
27.5 – 3025
30 – 32.5110
32.5 – 3529
35 – 37.566
37.5 – 4022
40 – 42.5145
42.5 – 4518
45 – 47.580
47.5 – 5020
50 – 52.555
52.5 – 5519
55 – 57.583
57.5 – 6024
60 – 62.5181
62.5 – 6531
65 – 67.5171
67.5 – 7048
70 – 72.5220
72.5 – 7570
75 – 77.5116
77.5 – 8043
80 – 82.5164
82.5 – 8556
85 – 87.5167
87.5 – 9088
90 – 92.5435
92.5 – 95155
95 – 97.5785
97.5 – 100398

PercentEvangelical numeric feature

This appears to be a percentage feature capturing the share of evangelicals in each record's population, ranging from 0 to 95 with a median of just 5. The distribution is heavily right-skewed (skew 1.93, kurtosis 4.92) and 9% of values are exact zeros, with 251 outliers (4.3%) pulling the mean up to 10.1. Note also that 17.3% of rows are null, which is substantial.

Treatment: Impute or flag the 17% nulls and consider a log1p transform before modelling to tame the right skew.

anthropic:claude-opus-4-7 · confidence high
Out[51]:

saturn.columns["PercentEvangelical"].stats

statvalue
n7,134
nulls1,234 (17.3%)
unique1,006
min 0
max 95
mean 10.13
median 5
std 12.24
q1 1
q3 15.55
iqr 14.55
skew 1.932
kurtosis 4.925
n_outliers 251
outlier_rate 0.04254
zero_rate 0.09
Fig 20.
Distribution of PercentEvangelical. Vertical dash marks the median.
Show data table
Histogram bins for PercentEvangelical (median: 5.0).
bincount
0 – 2.3751985
2.375 – 4.75770
4.75 – 7.125674
7.125 – 9.5295
9.5 – 11.88252
11.88 – 14.25311
14.25 – 16.62225
16.62 – 19170
19 – 21.38295
21.38 – 23.75150
23.75 – 26.12223
26.12 – 28.582
28.5 – 30.8894
30.88 – 33.2567
33.25 – 35.6240
35.62 – 3817
38 – 40.3837
40.38 – 42.7521
42.75 – 45.1273
45.12 – 47.531
47.5 – 49.8819
49.88 – 52.2516
52.25 – 54.623
54.62 – 577
57 – 59.380
59.38 – 61.7516
61.75 – 64.123
64.12 – 66.52
66.5 – 68.880
68.88 – 71.255
71.25 – 73.622
73.62 – 764
76 – 78.383
78.38 – 80.753
80.75 – 83.120
83.12 – 85.51
85.5 – 87.882
87.88 – 90.251
90.25 – 92.620
92.62 – 951

HasJesusFilm categorical feature

Binary Y/N flag indicating whether each record has an associated Jesus Film, with only 2 unique values across 7134 rows and a negligible 0.0003 null rate. The distribution is moderately imbalanced: 'N' dominates at 71.6% (5105) versus 2027 'Y' values, yielding an entropy ratio of 0.86.

Treatment: Encode as a 0/1 boolean indicator for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[54]:

saturn.columns["HasJesusFilm"].stats

statvalue
n7,134
nulls2 (0.0%)
unique2
top_value N
top_rate 0.7158
cardinality 2
entropy 0.8611
entropy_ratio 0.8611
Fig 21.
Top values for HasJesusFilm.
Show data table
Top values for HasJesusFilm (2 unique shown, of 2 total).
valuecountshare
N510571.6%
Y202728.4%

JF_URL text metadata

This column holds JesusFilm.org URLs (url_rate 1.0, one_word_rate 0.9995), almost all pointing to language-specific watch pages like /watch/jesus.html/{language}.html. Coverage is sparse with a 71.6% null rate, and values are near-unique (2008 distinct out of 7134), though a generic partners/resources page appears 13 times and 19 duplicates exist overall. URL lengths are tight (45-87 chars, median 55), consistent with a templated link rather than free text.

Treatment: Treat as a reference link; extract the language slug from the path if you need a feature, otherwise drop from modelling.

anthropic:claude-opus-4-7 · confidence high
Out[57]:

saturn.columns["JF_URL"].stats

statvalue
n7,134
nulls5,107 (71.6%)
unique2,008
len_min 45
len_max 87
len_mean 56.77
len_median 55
len_p95 69
word_mean 1
word_median 1
n_empty 0
n_duplicates 19
duplicate_rate 0.009373
vocab_size 2,009
readability_flesch_mean -781.9
emoji_rate 0
url_rate 1
one_word_rate 0.9995
allcaps_rate 0
boilerplate_rate 0
alert: near_unique99.1% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
alert: null_rate71.6% null
Fig 22.
Character-length distribution for JF_URL.
Show data table
Character-length distribution for JF_URL (mean: 56.772570300937346).
charscount
45 – 4613
46 – 470
47 – 480
48 – 490
49 – 505
50 – 5158
51 – 52221
52 – 53324
53 – 54299
54 – 56237
56 – 57139
57 – 5896
58 – 5972
59 – 6069
60 – 6163
61 – 6285
62 – 6357
63 – 6472
64 – 6542
65 – 6623
66 – 6729
67 – 6817
68 – 6918
69 – 7022
70 – 7115
71 – 7220
72 – 738
73 – 749
74 – 753
75 – 761
76 – 780
78 – 793
79 – 803
80 – 810
81 – 822
82 – 830
83 – 841
84 – 850
85 – 860
86 – 871

HasAudioRecordings categorical feature

Binary Y/N flag indicating whether a record has associated audio recordings. The split is 59.1% Y vs N, with entropy ratio 0.976 showing near-maximal balance for a two-class field. Null rate is negligible (0.0003).

Treatment: Encode as a boolean indicator before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[60]:

saturn.columns["HasAudioRecordings"].stats

statvalue
n7,134
nulls2 (0.0%)
unique2
top_value Y
top_rate 0.5911
cardinality 2
entropy 0.9759
entropy_ratio 0.9759
Fig 23.
Top values for HasAudioRecordings.
Show data table
Top values for HasAudioRecordings (2 unique shown, of 2 total).
valuecountshare
Y421659.1%
N291640.9%

JPScale categorical feature

JPScale is a low-cardinality categorical with 5 distinct values ('1' through '5'), suggesting an ordinal rating or scale. Distribution is bimodal: the extremes '5' (32.6%) and '4' (26.6%) dominate alongside '1' (20.6%), while middle values '3' and '2' are comparatively rare. Entropy ratio of 0.89 indicates fairly even spread across categories, but 11.87% of rows are null.

Treatment: Treat as ordinal (1–5); impute or flag the ~12% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[63]:

saturn.columns["JPScale"].stats

statvalue
n7,134
nulls847 (11.9%)
unique5
top_value 5
top_rate 0.3261
cardinality 5
entropy 2.07
entropy_ratio 0.8915
Fig 24.
Top values for JPScale.
Show data table
Top values for JPScale (5 unique shown, of 5 total).
valuecountshare
5205028.7%
4190026.6%
1147320.6%
34556.4%
24095.7%

LeastReached categorical feature

Binary Y/N flag indicating whether some 'least reached' status applies, with N dominating at 79.3% (5659) versus 1473 Y values across 7134 rows. Class imbalance is notable but not extreme, and nulls are negligible (0.03%). Cardinality is exactly 2 with entropy ratio 0.73, consistent with a clean boolean indicator.

Treatment: Encode as boolean (Y=1, N=0) and account for the ~80/20 imbalance if used as a target.

anthropic:claude-opus-4-7 · confidence high
Out[66]:

saturn.columns["LeastReached"].stats

statvalue
n7,134
nulls2 (0.0%)
unique2
top_value N
top_rate 0.7935
cardinality 2
entropy 0.7348
entropy_ratio 0.7348
Fig 25.
Top values for LeastReached.
Show data table
Top values for LeastReached (2 unique shown, of 2 total).
valuecountshare
N565979.3%
Y147320.6%

RLG3 numeric feature

RLG3 is a small-cardinality integer-coded numeric (only 8 unique values across 7134 rows, ranging 1-9 with no zeros), suggesting an ordinal scale or category code rather than a true continuous measure. The distribution leans low: median 1, Q3 of 4, mean 2.82, with right skew (0.74) and 110 outliers (1.7%) toward the high end. Note the 10.88% null rate, which is non-trivial and should be addressed before modelling.

Treatment: Treat as ordinal/categorical and impute or flag the ~11% missing values before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[69]:

saturn.columns["RLG3"].stats

statvalue
n7,134
nulls776 (10.9%)
unique8
min 1
max 9
mean 2.819
median 1
std 2.144
q1 1
q3 4
iqr 3
skew 0.7378
kurtosis -0.5707
n_outliers 110
outlier_rate 0.0173
zero_rate 0
Fig 26.
Distribution of RLG3. Vertical dash marks the median.
Show data table
Histogram bins for RLG3 (median: 1.0).
bincount
1 – 1.23328
1.2 – 1.40
1.4 – 1.60
1.6 – 1.80
1.8 – 20
2 – 2.2192
2.2 – 2.40
2.4 – 2.60
2.6 – 2.80
2.8 – 30
3 – 3.20
3.2 – 3.40
3.4 – 3.60
3.6 – 3.80
3.8 – 40
4 – 4.21472
4.2 – 4.40
4.4 – 4.60
4.6 – 4.80
4.8 – 50
5 – 5.2268
5.2 – 5.40
5.4 – 5.60
5.6 – 5.80
5.8 – 60
6 – 6.2945
6.2 – 6.40
6.4 – 6.60
6.6 – 6.80
6.8 – 70
7 – 7.218
7.2 – 7.40
7.4 – 7.60
7.6 – 7.80
7.8 – 80
8 – 8.225
8.2 – 8.40
8.4 – 8.60
8.6 – 8.80
8.8 – 9110

PrimaryReligion categorical feature

Categorical label for the dominant religion of each record, with 9 distinct values across 7134 rows. Christianity leads at 46.7% (3328), followed by Ethnic Religions (1472) and Islam (945). Note the empty-string category appears 774 times alongside an explicit 'Unknown' bucket of 110 — two separate missingness conventions coexist beyond the 0.03% true nulls.

Treatment: Consolidate the empty string into 'Unknown' and one-hot encode the 9 categories.

anthropic:claude-opus-4-7 · confidence high
Out[72]:

saturn.columns["PrimaryReligion"].stats

statvalue
n7,134
nulls2 (0.0%)
unique9
top_value Christianity
top_rate 0.4666
cardinality 9
entropy 2.179
entropy_ratio 0.6873
Fig 27.
Top values for PrimaryReligion.
Show data table
Top values for PrimaryReligion (9 unique shown, of 9 total).
valuecountshare
Christianity332846.6%
Ethnic Religions147220.6%
Islam94513.2%
77410.8%
Hinduism2683.8%
Buddhism1922.7%
Unknown1101.5%
Other / Small250.4%
Non-Religious180.3%

FCBH_URL text metadata

This column holds a single URL per row pointing to Faith Comes By Hearing resources (apk.fcbh.org or live.bible.is), with url_rate at 1.0 and one_word_rate at 0.9987. It is largely missing — null_rate is 0.6801 — and of the populated rows 2272 of values are unique with only 10 duplicates. Lengths are tight (min 25, max 100, mean 37.68), consistent with a structured link field rather than free text.

Treatment: Treat as an optional reference link; keep as-is for lookup, do not feed into modelling.

anthropic:claude-opus-4-7 · confidence high
Out[75]:

saturn.columns["FCBH_URL"].stats

statvalue
n7,134
nulls4,852 (68.0%)
unique2,272
len_min 25
len_max 100
len_mean 37.68
len_median 34
len_p95 66
word_mean 1.001
word_median 1
n_empty 0
n_duplicates 10
duplicate_rate 0.004382
vocab_size 2,272
readability_flesch_mean -325.5
emoji_rate 0
url_rate 1
one_word_rate 0.9987
allcaps_rate 0
boilerplate_rate 0
alert: near_unique99.6% of rows are unique strings
alert: one_word99.9% rows are a single word
alert: url_heavy100.0% rows contain a URL
alert: null_rate68.0% null
Fig 28.
Character-length distribution for FCBH_URL.
Show data table
Character-length distribution for FCBH_URL (mean: 37.6783523225241).
charscount
25 – 2717
27 – 2915
29 – 3140
31 – 3243
32 – 341831
34 – 3630
36 – 389
38 – 407
40 – 425
42 – 446
44 – 468
46 – 484
48 – 492
49 – 510
51 – 530
53 – 550
55 – 571
57 – 590
59 – 611
61 – 621
62 – 6425
64 – 66221
66 – 683
68 – 701
70 – 721
72 – 740
74 – 760
76 – 782
78 – 793
79 – 813
81 – 831
83 – 851
85 – 870
87 – 890
89 – 910
91 – 920
92 – 940
94 – 960
96 – 980
98 – 1001

NbrPGICs numeric feature

NbrPGICs is a heavily right-skewed count feature, with median 1 and Q3 of 2 but a maximum of 1804 and standard deviation of 48.93. The distribution shows extreme tail behaviour (skew 16.65, kurtosis 404.08) and 701 outliers (11.0% of values), while 10.75% of rows are null. Most records carry a trivial count, but a small subset reports values orders of magnitude larger.

Treatment: Log-transform or cap at a high quantile and impute the 10.75% nulls before modelling.

anthropic:claude-opus-4-7 · confidence high
Out[78]:

saturn.columns["NbrPGICs"].stats

statvalue
n7,134
nulls767 (10.8%)
unique155
min 1
max 1,804
mean 7.209
median 1
std 48.93
q1 1
q3 2
iqr 1
skew 16.65
kurtosis 404.1
n_outliers 701
outlier_rate 0.1101
zero_rate 0
alert: high_skewskew=+16.65
alert: outliers11.0% rows beyond 1.5 IQR
Fig 29.
Distribution of NbrPGICs. Vertical dash marks the median.
Show data table
Histogram bins for NbrPGICs (median: 1.0).
bincount
1 – 46.086231
46.08 – 91.1546
91.15 – 136.221
136.2 – 181.39
181.3 – 226.411
226.4 – 271.55
271.5 – 316.511
316.5 – 361.69
361.6 – 406.71
406.7 – 451.86
451.8 – 496.81
496.8 – 541.93
541.9 – 5872
587 – 632.12
632.1 – 677.10
677.1 – 722.23
722.2 – 767.31
767.3 – 812.41
812.4 – 857.41
857.4 – 902.50
902.5 – 947.62
947.6 – 992.70
992.7 – 10380
1038 – 10830
1083 – 11280
1128 – 11730
1173 – 12180
1218 – 12630
1263 – 13080
1308 – 13530
1353 – 13980
1398 – 14430
1443 – 14880
1488 – 15340
1534 – 15790
1579 – 16240
1624 – 16690
1669 – 17140
1714 – 17590
1759 – 18041

NbrCountries numeric feature

NbrCountries is a numeric count of countries associated with each record, ranging from 1 to 136 with a median of 1 and Q1=Q3=1, meaning at least three quarters of rows are single-country. The distribution is extremely heavy-tailed (skew 15.6, kurtosis 364) with 1203 outliers (20.5% outlier rate) and a 17.75% null rate, so a small minority of multi-country records dominate the variance.

Treatment: Log1p-transform or bucket into 1 vs. multi-country before modelling, and impute or flag the 17.75% nulls.

anthropic:claude-opus-4-7 · confidence high
Out[81]:

saturn.columns["NbrCountries"].stats

statvalue
n7,134
nulls1,266 (17.7%)
unique43
min 1
max 136
mean 1.711
median 1
std 3.901
q1 1
q3 1
iqr 0
skew 15.6
kurtosis 364.1
n_outliers 1,203
outlier_rate 0.205
zero_rate 0
alert: high_skewskew=+15.60
alert: outliers20.5% rows beyond 1.5 IQR
Fig 30.
Distribution of NbrCountries. Vertical dash marks the median.
Show data table
Histogram bins for NbrCountries (median: 1.0).
bincount
1 – 4.3755656
4.375 – 7.7592
7.75 – 11.1241
11.12 – 14.520
14.5 – 17.8810
17.88 – 21.2512
21.25 – 24.625
24.62 – 288
28 – 31.384
31.38 – 34.754
34.75 – 38.122
38.12 – 41.53
41.5 – 44.882
44.88 – 48.251
48.25 – 51.621
51.62 – 551
55 – 58.381
58.38 – 61.750
61.75 – 65.121
65.12 – 68.50
68.5 – 71.881
71.88 – 75.250
75.25 – 78.621
78.62 – 820
82 – 85.380
85.38 – 88.751
88.75 – 92.120
92.12 – 95.50
95.5 – 98.880
98.88 – 102.20
102.2 – 105.60
105.6 – 1090
109 – 112.40
112.4 – 115.80
115.8 – 119.10
119.1 – 122.50
122.5 – 125.90
125.9 – 129.20
129.2 – 132.60
132.6 – 1361

JF categorical feature

Binary Y/N flag with only two distinct values across 7134 rows and a negligible null rate of 0.0003. The distribution is skewed toward 'N' at 71.6%, leaving 'Y' at roughly 2027 occurrences. Entropy ratio of 0.86 indicates the split is imbalanced but still informative.

Treatment: Encode as a 0/1 indicator and impute the rare nulls with the mode.

anthropic:claude-opus-4-7 · confidence high
Out[84]:

saturn.columns["JF"].stats

statvalue
n7,134
nulls2 (0.0%)
unique2
top_value N
top_rate 0.7158
cardinality 2
entropy 0.8611
entropy_ratio 0.8611
Fig 31.
Top values for JF.
Show data table
Top values for JF (2 unique shown, of 2 total).
valuecountshare
N510571.6%
Y202728.4%

AudioRecordings categorical feature

Binary Y/N flag indicating whether audio recordings exist for each row, with only 2 unique values across 7,134 records and a negligible null rate of 0.0003. The split is moderately balanced toward 'Y' at 59.1% (4,216) versus 'N' (2,916), giving high entropy (0.976) for a binary field. No surprising signals beyond the slight Y-majority skew.

Treatment: Encode as a 0/1 boolean indicator after imputing the few nulls.

anthropic:claude-opus-4-7 · confidence high
Out[87]:

saturn.columns["AudioRecordings"].stats

statvalue
n7,134
nulls2 (0.0%)
unique2
top_value Y
top_rate 0.5911
cardinality 2
entropy 0.9759
entropy_ratio 0.9759
Fig 32.
Top values for AudioRecordings.
Show data table
Top values for AudioRecordings (2 unique shown, of 2 total).
valuecountshare
Y421659.1%
N291640.9%

How to cite

click to copy

BibTeX
@misc{saturn-joshua-project-joshua-project-languages-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: joshua project joshua project languages},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/joshua-project-joshua_project_languages}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: joshua project joshua project languages. Source: /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_languages.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/joshua-project-joshua_project_languages