saturn·

joshua project joshua project languages

source /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_languages.json 7,134 rows 26 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a Joshua Project languages dataset with 7,134 rows and 26 columns, profiling world languages alongside Bible translation status, audio/film resource availability, primary religion, and host-country distribution. The headline signal is religious-engagement coverage: PrimaryReligion is dominated by Christianity (3,328) followed by Ethnic Religions (1,472) and Islam (945), and JPScale skews toward the more-reached end with category 5 the largest bucket (2,050). Resource availability is uneven — HasAudioRecordings is roughly 59% Yes / 41% No, while HasJesusFilm is only ~28% Yes, suggesting the Jesus Film coverage gap is worth a closer look. Geographic concentration is also notable: HubCountry is led by Papua New Guinea (837), Indonesia (686), and Nigeria (494), together accounting for a large share of entries. Finally, NbrPGICs is extremely skewed (max 1,804, median 1) so any per-language counts should be inspected with that long tail in mind.

citing: PrimaryReligion · JPScale · HasAudioRecordings · HasJesusFilm · HubCountry · NbrPGICs · Status · BibleStatus

Schema

26 columns
Per-column summary. Click column name to jump to its detail.
Alerts
ROL3 text 0.0% 7,134
near_unique one_word short_text
Language text 0.0% 7,124
near_unique one_word
WebLangText text 0.0% 7,134
near_unique one_word
Status categorical 0.0% 2
ROG3 categorical 0.0% 211
HubCountry categorical 0.0% 210
BibleStatus numeric 0.0% 6
GRN_URL text 41.4% 4,179
near_unique one_word url_heavy null_rate
TranslationNeedQuestionable unknown 0.0%
skipped
BibleYear categorical 89.0% 488
long_tail null_rate
NTYear text 63.6% 1,109
one_word allcaps null_rate short_text duplicates
PortionsYear text 43.0% 1,797
one_word allcaps null_rate short_text duplicates
PercentAdherents numeric 11.9% 1,349
PercentEvangelical numeric 17.3% 1,006
HasJesusFilm categorical 0.0% 2
JF_URL text 71.6% 2,008
near_unique one_word url_heavy null_rate
HasAudioRecordings categorical 0.0% 2
JPScale categorical 11.9% 5
LeastReached categorical 0.0% 2
RLG3 numeric 10.9% 8
PrimaryReligion categorical 0.0% 9
FCBH_URL text 68.0% 2,272
near_unique one_word url_heavy null_rate
NbrPGICs numeric 10.8% 155
high_skew outliers
NbrCountries numeric 17.7% 43
high_skew outliers
JF categorical 0.0% 2
AudioRecordings categorical 0.0% 2

ROL3

text identifier near_unique one_word short_text
ROL3 is a text column of exactly 7134 unique three-character single-word tokens across 7134 rows, with zero nulls and zero duplicates. The perfect 1:1 cardinality (vocab_size 7134 == n) and uniform len_min/max/mean of 3 strongly suggest this is a row-level identifier or code rather than natural language. Top tokens like 'aou', 'aiw', 'aas' show no repeated values, confirming it carries no distributional signal on its own. Treatment: Drop from modelling or use only as a join key; near-unique three-letter codes carry no predictive signal. high · anthropic:claude-opus-4-7
n
7,134
nulls
0 (0.0%)
unique
7,134
len_min
3
len_max
3
len_mean
3
len_median
3
len_p95
3
word_mean
1
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
7,134
readability_flesch_mean
120.4
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

Language

text identifier near_unique one_word
This column holds language names, with 7124 distinct values across 7134 rows and only 10 duplicates — essentially one row per language. Entries are short (mean 9.1 chars, median 1 word) and 73.5% are single-word labels, though compound names involving directional qualifiers (southern, northern, eastern, western) and family roots (zapotec, mixtec, naga) appear often. The high cardinality combined with the 'language' and 'sign' top tokens suggests this is a catalog of world languages, likely including sign-language variants. Treatment: Treat as a near-unique label key; left-join on it rather than one-hot encoding. high · anthropic:claude-opus-4-7
n
7,134
nulls
0 (0.0%)
unique
7,124
len_min
1
len_max
45
len_mean
9.102
len_median
7
len_p95
22
word_mean
1.363
word_median
1
n_empty
0
n_duplicates
10
duplicate_rate
0.001402
vocab_size
7,180
readability_flesch_mean
52.16
emoji_rate
0
url_rate
0
one_word_rate
0.7347
allcaps_rate
0
boilerplate_rate
0

WebLangText

text identifier near_unique one_word
WebLangText appears to be a per-row language name label, with every one of the 7134 values unique and 73% being a single word (mean 1.37 words, median length 7 chars). Top tokens like 'language', 'sign', 'zapotec', 'mixtec', and 'naga' suggest this is an inventory of world languages including sign languages and regional variants (Southern/Northern/Eastern/Western). The full uniqueness (n_unique == n) means it functions as an identifier rather than a categorical feature. Treatment: Treat as a language-name key; left-join on this rather than using as a model feature. high · anthropic:claude-opus-4-7
n
7,134
nulls
0 (0.0%)
unique
7,134
len_min
1
len_max
45
len_mean
9.119
len_median
7
len_p95
22
word_mean
1.366
word_median
1
n_empty
0
n_duplicates
0
duplicate_rate
0
vocab_size
7,200
readability_flesch_mean
52.37
emoji_rate
0
url_rate
0
one_word_rate
0.7318
allcaps_rate
0
boilerplate_rate
0

Status

categorical label
Binary status flag with two values, 'L' and 'N', dominated by 'L' at 86.0% (6134 of 7134) versus 'N' at 998. Class imbalance is notable, and there are 2 nulls (null_rate 0.0003). Entropy ratio of 0.58 confirms the skewed distribution. Treatment: Encode as binary; address class imbalance (e.g., stratified sampling or class weights) before modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
2
top_value
L
top_rate
0.8601
cardinality
2
entropy
0.5841
entropy_ratio
0.5841

ROG3

categorical feature
ROG3 is a categorical code field with 211 distinct two-letter values, dominated by 'PP' at 11.7% (837 rows) followed by 'ID', 'NI', 'IN', 'MX' — a distribution consistent with country or region codes. Entropy is 5.64 (ratio 0.73), indicating broad spread across the 211 categories rather than concentration in a few. Nulls are negligible (0.03%) and the top-10 values mix what look like ISO-style country codes ('IN', 'MX', 'US', 'CH') with less standard tokens ('PP', 'NI', 'CG'). Treatment: Target-encode or group rare levels before modelling; verify whether codes follow an ISO standard. medium · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
211
top_value
PP
top_rate
0.1174
cardinality
211
entropy
5.642
entropy_ratio
0.7307

HubCountry

categorical feature
HubCountry is a categorical country-name field with 210 distinct values across 7,134 rows and a near-zero null rate (0.0003). The distribution is broad rather than concentrated — entropy ratio 0.73, with the top value Papua New Guinea covering only 11.7% of records, followed by Indonesia (686) and Nigeria (494). The leading countries skew toward biodiversity- or resource-rich nations rather than the largest economies, which is worth noting before any geographic modelling. Treatment: Group long-tail countries into regions or frequency buckets before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
210
top_value
Papua New Guinea
top_rate
0.1174
cardinality
210
entropy
5.641
entropy_ratio
0.7312

BibleStatus

numeric feature
BibleStatus is an integer-coded categorical with 6 distinct values from 0 to 5, mean 2.68 and median 3, almost certainly an ordinal status/level code rather than a true numeric measure. About 14.9% of rows are zero and the distribution is mildly left-skewed (skew -0.34) with flat kurtosis (-0.91), suggesting a fairly even spread across the upper levels with a sizable zero/'none' bucket. Null rate is negligible (0.0003) and no outliers were flagged. Treatment: Treat as an ordinal categorical (one-hot or ordered encoding) rather than a continuous numeric. high · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
6
min
0
max
5
mean
2.677
median
3
std
1.555
q1
2
q3
4
iqr
2
skew
-0.3401
kurtosis
-0.9086
n_outliers
0
outlier_rate
0
zero_rate
0.1492

GRN_URL

text identifier near_unique one_word url_heavy null_rate
This column holds Global Recordings Network language URLs, every value a fixed 44-character single token under https://globalrecordings.net/en/language/ followed by a language code. With 4179 unique values across 7134 rows and a 41.41% null rate, it functions as a per-language identifier link rather than a feature. Notable: only one duplicate URL (idt appears twice) despite the high uniqueness, and 41% of rows have no GRN link at all. Treatment: Extract the trailing language code as a foreign key; otherwise drop the URL itself from modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
2,954 (41.4%)
unique
4,179
len_min
44
len_max
44
len_mean
44
len_median
44
len_p95
44
word_mean
1
word_median
1
n_empty
0
n_duplicates
1
duplicate_rate
0.0002392
vocab_size
4,179
readability_flesch_mean
-435
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

TranslationNeedQuestionable

unknown other skipped
The column 'TranslationNeedQuestionable' was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 7134 and a null rate of 0.0. The name suggests a boolean or flag indicating whether the need for translation is in doubt, but this cannot be confirmed from the evidence. No distribution, unique count, or sample values were captured. Treatment: Re-profile with type inference enabled before deciding on downstream use. low · anthropic:claude-opus-4-7
n
7,134
nulls
0 (0.0%)
unique

BibleYear

categorical free_text long_tail null_rate
Free-text field nominally capturing a year associated with a Bible (likely year acquired or published), but heavily polluted: 89.01% of 7134 rows are null, and among the 488 unique values the most common entry is "2023" at just 3.4%, with "Yes" appearing as the second most frequent value (22 times) — indicating the field also absorbed yes/no answers. Entropy ratio of 0.93 confirms a long, flat tail with no dominant year. Treatment: Clean by coercing to integer years and routing non-numeric responses (e.g., "Yes") to a separate flag before use. high · anthropic:claude-opus-4-7
n
7,134
nulls
6,350 (89.0%)
unique
488
top_value
2023
top_rate
0.03444
cardinality
488
entropy
8.296
entropy_ratio
0.9289

NTYear

text feature one_word allcaps null_rate short_text duplicates
Despite the name suggesting a year, NTYear is a single-token text column with mixed semantics: the most frequent value is 'Yes' (147 rows), followed by four-digit years from 2016-2024. It's 63.61% null and 57.28% duplicates across only 1109 unique values, with one_word_rate of 1.0 and allcaps_rate of 0.94. The mix of a yes/no token alongside year values suggests two questions were collapsed into one field. Treatment: Split into two columns (boolean indicator vs. numeric year) before any modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
4,538 (63.6%)
unique
1,109
len_min
3
len_max
9
len_mean
6.694
len_median
9
len_p95
9
word_mean
1
word_median
1
n_empty
0
n_duplicates
1,487
duplicate_rate
0.5728
vocab_size
1,109
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.9434
boilerplate_rate
0

PortionsYear

text feature one_word allcaps null_rate short_text duplicates
Despite the name PortionsYear, this column mixes a yes/no flag with four-digit years: every entry is a single word, the most common value is 'Yes' (706 occurrences) followed by years like 2024 (107), 2022 (55) and 2025 (43). 43% of rows are null and 55.8% of the non-null values are duplicates across only 1,797 unique tokens, with 82.6% in all-caps. The semantic mix of a boolean and a year in one field is the headline anomaly. Treatment: Split into two columns — a boolean 'has portions' flag and a parsed year — before modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
3,068 (43.0%)
unique
1,797
len_min
3
len_max
9
len_mean
6.372
len_median
9
len_p95
9
word_mean
1
word_median
1
n_empty
0
n_duplicates
2,269
duplicate_rate
0.558
vocab_size
1,797
readability_flesch_mean
121.2
emoji_rate
0
url_rate
0
one_word_rate
1
allcaps_rate
0.8264
boilerplate_rate
0

PercentAdherents

numeric feature
PercentAdherents is a numeric share variable bounded between 0 and 100, almost certainly the percentage of some population that adheres to a religion or group. The distribution is strongly bimodal in feel: the IQR spans 5.34 to 90.0 with a median of 58.33, kurtosis of -1.68 indicates a flat/U-shaped spread rather than a central peak, and 7.3% of values are exactly zero. Nearly 12% of rows are null, which is worth flagging before any aggregation. Treatment: Impute or filter the 11.87% nulls and consider binning given the U-shaped distribution before modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
847 (11.9%)
unique
1,349
min
0
max
100
mean
49.63
median
58.33
std
38.83
q1
5.34
q3
90
iqr
84.66
skew
-0.08408
kurtosis
-1.679
n_outliers
0
outlier_rate
0
zero_rate
0.07285

PercentEvangelical

numeric feature
This appears to be a percentage feature capturing the share of evangelicals in each record's population, ranging from 0 to 95 with a median of just 5. The distribution is heavily right-skewed (skew 1.93, kurtosis 4.92) and 9% of values are exact zeros, with 251 outliers (4.3%) pulling the mean up to 10.1. Note also that 17.3% of rows are null, which is substantial. Treatment: Impute or flag the 17% nulls and consider a log1p transform before modelling to tame the right skew. high · anthropic:claude-opus-4-7
n
7,134
nulls
1,234 (17.3%)
unique
1,006
min
0
max
95
mean
10.13
median
5
std
12.24
q1
1
q3
15.55
iqr
14.55
skew
1.932
kurtosis
4.925
n_outliers
251
outlier_rate
0.04254
zero_rate
0.09

HasJesusFilm

categorical feature
Binary Y/N flag indicating whether each record has an associated Jesus Film, with only 2 unique values across 7134 rows and a negligible 0.0003 null rate. The distribution is moderately imbalanced: 'N' dominates at 71.6% (5105) versus 2027 'Y' values, yielding an entropy ratio of 0.86. Treatment: Encode as a 0/1 boolean indicator for modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
2
top_value
N
top_rate
0.7158
cardinality
2
entropy
0.8611
entropy_ratio
0.8611

JF_URL

text metadata near_unique one_word url_heavy null_rate
This column holds JesusFilm.org URLs (url_rate 1.0, one_word_rate 0.9995), almost all pointing to language-specific watch pages like /watch/jesus.html/{language}.html. Coverage is sparse with a 71.6% null rate, and values are near-unique (2008 distinct out of 7134), though a generic partners/resources page appears 13 times and 19 duplicates exist overall. URL lengths are tight (45-87 chars, median 55), consistent with a templated link rather than free text. Treatment: Treat as a reference link; extract the language slug from the path if you need a feature, otherwise drop from modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
5,107 (71.6%)
unique
2,008
len_min
45
len_max
87
len_mean
56.77
len_median
55
len_p95
69
word_mean
1
word_median
1
n_empty
0
n_duplicates
19
duplicate_rate
0.009373
vocab_size
2,009
readability_flesch_mean
-781.9
emoji_rate
0
url_rate
1
one_word_rate
0.9995
allcaps_rate
0
boilerplate_rate
0

HasAudioRecordings

categorical feature
Binary Y/N flag indicating whether a record has associated audio recordings. The split is 59.1% Y vs N, with entropy ratio 0.976 showing near-maximal balance for a two-class field. Null rate is negligible (0.0003). Treatment: Encode as a boolean indicator before modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
2
top_value
Y
top_rate
0.5911
cardinality
2
entropy
0.9759
entropy_ratio
0.9759

JPScale

categorical feature
JPScale is a low-cardinality categorical with 5 distinct values ('1' through '5'), suggesting an ordinal rating or scale. Distribution is bimodal: the extremes '5' (32.6%) and '4' (26.6%) dominate alongside '1' (20.6%), while middle values '3' and '2' are comparatively rare. Entropy ratio of 0.89 indicates fairly even spread across categories, but 11.87% of rows are null. Treatment: Treat as ordinal (1–5); impute or flag the ~12% nulls before modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
847 (11.9%)
unique
5
top_value
5
top_rate
0.3261
cardinality
5
entropy
2.07
entropy_ratio
0.8915

LeastReached

categorical feature
Binary Y/N flag indicating whether some 'least reached' status applies, with N dominating at 79.3% (5659) versus 1473 Y values across 7134 rows. Class imbalance is notable but not extreme, and nulls are negligible (0.03%). Cardinality is exactly 2 with entropy ratio 0.73, consistent with a clean boolean indicator. Treatment: Encode as boolean (Y=1, N=0) and account for the ~80/20 imbalance if used as a target. high · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
2
top_value
N
top_rate
0.7935
cardinality
2
entropy
0.7348
entropy_ratio
0.7348

RLG3

numeric feature
RLG3 is a small-cardinality integer-coded numeric (only 8 unique values across 7134 rows, ranging 1-9 with no zeros), suggesting an ordinal scale or category code rather than a true continuous measure. The distribution leans low: median 1, Q3 of 4, mean 2.82, with right skew (0.74) and 110 outliers (1.7%) toward the high end. Note the 10.88% null rate, which is non-trivial and should be addressed before modelling. Treatment: Treat as ordinal/categorical and impute or flag the ~11% missing values before modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
776 (10.9%)
unique
8
min
1
max
9
mean
2.819
median
1
std
2.144
q1
1
q3
4
iqr
3
skew
0.7378
kurtosis
-0.5707
n_outliers
110
outlier_rate
0.0173
zero_rate
0

PrimaryReligion

categorical feature
Categorical label for the dominant religion of each record, with 9 distinct values across 7134 rows. Christianity leads at 46.7% (3328), followed by Ethnic Religions (1472) and Islam (945). Note the empty-string category appears 774 times alongside an explicit 'Unknown' bucket of 110 — two separate missingness conventions coexist beyond the 0.03% true nulls. Treatment: Consolidate the empty string into 'Unknown' and one-hot encode the 9 categories. high · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
9
top_value
Christianity
top_rate
0.4666
cardinality
9
entropy
2.179
entropy_ratio
0.6873

FCBH_URL

text metadata near_unique one_word url_heavy null_rate
This column holds a single URL per row pointing to Faith Comes By Hearing resources (apk.fcbh.org or live.bible.is), with url_rate at 1.0 and one_word_rate at 0.9987. It is largely missing — null_rate is 0.6801 — and of the populated rows 2272 of values are unique with only 10 duplicates. Lengths are tight (min 25, max 100, mean 37.68), consistent with a structured link field rather than free text. Treatment: Treat as an optional reference link; keep as-is for lookup, do not feed into modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
4,852 (68.0%)
unique
2,272
len_min
25
len_max
100
len_mean
37.68
len_median
34
len_p95
66
word_mean
1.001
word_median
1
n_empty
0
n_duplicates
10
duplicate_rate
0.004382
vocab_size
2,272
readability_flesch_mean
-325.5
emoji_rate
0
url_rate
1
one_word_rate
0.9987
allcaps_rate
0
boilerplate_rate
0

NbrPGICs

numeric feature high_skew outliers
NbrPGICs is a heavily right-skewed count feature, with median 1 and Q3 of 2 but a maximum of 1804 and standard deviation of 48.93. The distribution shows extreme tail behaviour (skew 16.65, kurtosis 404.08) and 701 outliers (11.0% of values), while 10.75% of rows are null. Most records carry a trivial count, but a small subset reports values orders of magnitude larger. Treatment: Log-transform or cap at a high quantile and impute the 10.75% nulls before modelling. high · anthropic:claude-opus-4-7
n
7,134
nulls
767 (10.8%)
unique
155
min
1
max
1,804
mean
7.209
median
1
std
48.93
q1
1
q3
2
iqr
1
skew
16.65
kurtosis
404.1
n_outliers
701
outlier_rate
0.1101
zero_rate
0

NbrCountries

numeric feature high_skew outliers
NbrCountries is a numeric count of countries associated with each record, ranging from 1 to 136 with a median of 1 and Q1=Q3=1, meaning at least three quarters of rows are single-country. The distribution is extremely heavy-tailed (skew 15.6, kurtosis 364) with 1203 outliers (20.5% outlier rate) and a 17.75% null rate, so a small minority of multi-country records dominate the variance. Treatment: Log1p-transform or bucket into 1 vs. multi-country before modelling, and impute or flag the 17.75% nulls. high · anthropic:claude-opus-4-7
n
7,134
nulls
1,266 (17.7%)
unique
43
min
1
max
136
mean
1.711
median
1
std
3.901
q1
1
q3
1
iqr
0
skew
15.6
kurtosis
364.1
n_outliers
1,203
outlier_rate
0.205
zero_rate
0

JF

categorical feature
Binary Y/N flag with only two distinct values across 7134 rows and a negligible null rate of 0.0003. The distribution is skewed toward 'N' at 71.6%, leaving 'Y' at roughly 2027 occurrences. Entropy ratio of 0.86 indicates the split is imbalanced but still informative. Treatment: Encode as a 0/1 indicator and impute the rare nulls with the mode. high · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
2
top_value
N
top_rate
0.7158
cardinality
2
entropy
0.8611
entropy_ratio
0.8611

AudioRecordings

categorical feature
Binary Y/N flag indicating whether audio recordings exist for each row, with only 2 unique values across 7,134 records and a negligible null rate of 0.0003. The split is moderately balanced toward 'Y' at 59.1% (4,216) versus 'N' (2,916), giving high entropy (0.976) for a binary field. No surprising signals beyond the slight Y-majority skew. Treatment: Encode as a 0/1 boolean indicator after imputing the few nulls. high · anthropic:claude-opus-4-7
n
7,134
nulls
2 (0.0%)
unique
2
top_value
Y
top_rate
0.5911
cardinality
2
entropy
0.9759
entropy_ratio
0.9759