joshua project joshua project languages

source /home/coolhand/html/datavis/data_trove/joshua-project/joshua_project_languages.json 7,134 rows 26 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a Joshua Project languages dataset with 7,134 rows and 26 columns, profiling world languages alongside Bible translation status, audio/film resource availability, primary religion, and host-country distribution. The headline signal is religious-engagement coverage: PrimaryReligion is dominated by Christianity (3,328) followed by Ethnic Religions (1,472) and Islam (945), and JPScale skews toward the more-reached end with category 5 the largest bucket (2,050). Resource availability is uneven — HasAudioRecordings is roughly 59% Yes / 41% No, while HasJesusFilm is only ~28% Yes, suggesting the Jesus Film coverage gap is worth a closer look. Geographic concentration is also notable: HubCountry is led by Papua New Guinea (837), Indonesia (686), and Nigeria (494), together accounting for a large share of entries. Finally, NbrPGICs is extremely skewed (max 1,804, median 1) so any per-language counts should be inspected with that long tail in mind.

citing: PrimaryReligion · JPScale · HasAudioRecordings · HasJesusFilm · HubCountry · NbrPGICs · Status · BibleStatus

Charts the summary said to look at first

PrimaryReligion · Christianity dominates at ~47%, but Ethnic Religions and Islam together cover roughly a third of languages.

Show data table

Top values for PrimaryReligion (9 unique shown, of 9 total).
value	count	share
Christianity	3328	46.6%
Ethnic Religions	1472	20.6%
Islam	945	13.2%
	774	10.8%
Hinduism	268	3.8%
Buddhism	192	2.7%
Unknown	110	1.5%
Other / Small	25	0.4%
Non-Religious	18	0.3%

JPScale · Distribution across the 5-point Joshua Project progress scale, skewed toward the more-reached categories 4 and 5.

Show data table

Top values for JPScale (5 unique shown, of 5 total).
value	count	share
5	2050	28.7%
4	1900	26.6%
1	1473	20.6%
3	455	6.4%
2	409	5.7%

HasJesusFilm · Only about 28% of languages have a Jesus Film available — a clear coverage gap.

Show data table

Top values for HasJesusFilm (2 unique shown, of 2 total).
value	count	share
N	5105	71.6%
Y	2027	28.4%

HubCountry · Top hub countries (Papua New Guinea, Indonesia, Nigeria) concentrate a large share of the languages tracked.

Show data table

Top values for HubCountry (20 unique shown, of 210 total).
value	count	share
Papua New Guinea	837	11.7%
Indonesia	686	9.6%
Nigeria	494	6.9%
India	383	5.4%
Mexico	277	3.9%
China	256	3.6%
Cameroon	234	3.3%
Australia	192	2.7%
Congo, Democratic Republic of	182	2.6%
United States	179	2.5%
Philippines	175	2.5%
Brazil	175	2.5%
Tanzania	110	1.5%
Vanuatu	108	1.5%
Chad	104	1.5%
Nepal	100	1.4%
Malaysia	99	1.4%
Myanmar (Burma)	94	1.3%
Russia	91	1.3%
Peru	82	1.1%

BibleStatus · Bible translation status spread from 0 to 5 — note the ~15% at zero indicating no scripture available.

Show data table

Histogram bins for BibleStatus (median: 3.0).
bin	count
0 – 0.125	1064
0.125 – 0.25	0
0.25 – 0.375	0
0.375 – 0.5	0
0.5 – 0.625	0
0.625 – 0.75	0
0.75 – 0.875	0
0.875 – 1	0
1 – 1.125	488
1.125 – 1.25	0
1.25 – 1.375	0
1.375 – 1.5	0
1.5 – 1.625	0
1.625 – 1.75	0
1.75 – 1.875	0
1.875 – 2	0
2 – 2.125	1514
2.125 – 2.25	0
2.25 – 2.375	0
2.375 – 2.5	0
2.5 – 2.625	0
2.625 – 2.75	0
2.75 – 2.875	0
2.875 – 3	0
3 – 3.125	1470
3.125 – 3.25	0
3.25 – 3.375	0
3.375 – 3.5	0
3.5 – 3.625	0
3.625 – 3.75	0
3.75 – 3.875	0
3.875 – 4	0
4 – 4.125	1812
4.125 – 4.25	0
4.25 – 4.375	0
4.375 – 4.5	0
4.5 – 4.625	0
4.625 – 4.75	0
4.75 – 4.875	0
4.875 – 5	784

Schema

26 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
ROL3	text	0.0%	7,134	near_unique one_word short_text
Language	text	0.0%	7,124	near_unique one_word
WebLangText	text	0.0%	7,134	near_unique one_word
Status	categorical	0.0%	2
ROG3	categorical	0.0%	211
HubCountry	categorical	0.0%	210
BibleStatus	numeric	0.0%	6
GRN_URL	text	41.4%	4,179	near_unique one_word url_heavy null_rate
TranslationNeedQuestionable	unknown	0.0%	—	skipped
BibleYear	categorical	89.0%	488	long_tail null_rate
NTYear	text	63.6%	1,109	one_word allcaps null_rate short_text duplicates
PortionsYear	text	43.0%	1,797	one_word allcaps null_rate short_text duplicates
PercentAdherents	numeric	11.9%	1,349
PercentEvangelical	numeric	17.3%	1,006
HasJesusFilm	categorical	0.0%	2
JF_URL	text	71.6%	2,008	near_unique one_word url_heavy null_rate
HasAudioRecordings	categorical	0.0%	2
JPScale	categorical	11.9%	5
LeastReached	categorical	0.0%	2
RLG3	numeric	10.9%	8
PrimaryReligion	categorical	0.0%	9
FCBH_URL	text	68.0%	2,272	near_unique one_word url_heavy null_rate
NbrPGICs	numeric	10.8%	155	high_skew outliers
NbrCountries	numeric	17.7%	43	high_skew outliers
JF	categorical	0.0%	2
AudioRecordings	categorical	0.0%	2

ROL3

text identifier near_unique one_word short_text

ROL3 is a text column of exactly 7134 unique three-character single-word tokens across 7134 rows, with zero nulls and zero duplicates. The perfect 1:1 cardinality (vocab_size 7134 == n) and uniform len_min/max/mean of 3 strongly suggest this is a row-level identifier or code rather than natural language. Top tokens like 'aou', 'aiw', 'aas' show no repeated values, confirming it carries no distributional signal on its own. Treatment: Drop from modelling or use only as a join key; near-unique three-letter codes carry no predictive signal. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 0 (0.0%)
unique: 7,134
len_min: 3
len_max: 3
len_mean: 3
len_median: 3
len_p95: 3
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 7,134
readability_flesch_mean: 120.4
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

Language

text identifier near_unique one_word

This column holds language names, with 7124 distinct values across 7134 rows and only 10 duplicates — essentially one row per language. Entries are short (mean 9.1 chars, median 1 word) and 73.5% are single-word labels, though compound names involving directional qualifiers (southern, northern, eastern, western) and family roots (zapotec, mixtec, naga) appear often. The high cardinality combined with the 'language' and 'sign' top tokens suggests this is a catalog of world languages, likely including sign-language variants. Treatment: Treat as a near-unique label key; left-join on it rather than one-hot encoding. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 0 (0.0%)
unique: 7,124
len_min: 1
len_max: 45
len_mean: 9.102
len_median: 7
len_p95: 22
word_mean: 1.363
word_median: 1
n_empty: 0
n_duplicates: 10
duplicate_rate: 0.001402
vocab_size: 7,180
readability_flesch_mean: 52.16
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7347
allcaps_rate: 0
boilerplate_rate: 0

WebLangText

text identifier near_unique one_word

WebLangText appears to be a per-row language name label, with every one of the 7134 values unique and 73% being a single word (mean 1.37 words, median length 7 chars). Top tokens like 'language', 'sign', 'zapotec', 'mixtec', and 'naga' suggest this is an inventory of world languages including sign languages and regional variants (Southern/Northern/Eastern/Western). The full uniqueness (n_unique == n) means it functions as an identifier rather than a categorical feature. Treatment: Treat as a language-name key; left-join on this rather than using as a model feature. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 0 (0.0%)
unique: 7,134
len_min: 1
len_max: 45
len_mean: 9.119
len_median: 7
len_p95: 22
word_mean: 1.366
word_median: 1
n_empty: 0
n_duplicates: 0
duplicate_rate: 0
vocab_size: 7,200
readability_flesch_mean: 52.37
emoji_rate: 0
url_rate: 0
one_word_rate: 0.7318
allcaps_rate: 0
boilerplate_rate: 0

Status

categorical label

Binary status flag with two values, 'L' and 'N', dominated by 'L' at 86.0% (6134 of 7134) versus 'N' at 998. Class imbalance is notable, and there are 2 nulls (null_rate 0.0003). Entropy ratio of 0.58 confirms the skewed distribution. Treatment: Encode as binary; address class imbalance (e.g., stratified sampling or class weights) before modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 2
top_value: L
top_rate: 0.8601
cardinality: 2
entropy: 0.5841
entropy_ratio: 0.5841

ROG3

categorical feature

ROG3 is a categorical code field with 211 distinct two-letter values, dominated by 'PP' at 11.7% (837 rows) followed by 'ID', 'NI', 'IN', 'MX' — a distribution consistent with country or region codes. Entropy is 5.64 (ratio 0.73), indicating broad spread across the 211 categories rather than concentration in a few. Nulls are negligible (0.03%) and the top-10 values mix what look like ISO-style country codes ('IN', 'MX', 'US', 'CH') with less standard tokens ('PP', 'NI', 'CG'). Treatment: Target-encode or group rare levels before modelling; verify whether codes follow an ISO standard. medium · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 211
top_value: PP
top_rate: 0.1174
cardinality: 211
entropy: 5.642
entropy_ratio: 0.7307

HubCountry

categorical feature

HubCountry is a categorical country-name field with 210 distinct values across 7,134 rows and a near-zero null rate (0.0003). The distribution is broad rather than concentrated — entropy ratio 0.73, with the top value Papua New Guinea covering only 11.7% of records, followed by Indonesia (686) and Nigeria (494). The leading countries skew toward biodiversity- or resource-rich nations rather than the largest economies, which is worth noting before any geographic modelling. Treatment: Group long-tail countries into regions or frequency buckets before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 210
top_value: Papua New Guinea
top_rate: 0.1174
cardinality: 210
entropy: 5.641
entropy_ratio: 0.7312

BibleStatus

numeric feature

BibleStatus is an integer-coded categorical with 6 distinct values from 0 to 5, mean 2.68 and median 3, almost certainly an ordinal status/level code rather than a true numeric measure. About 14.9% of rows are zero and the distribution is mildly left-skewed (skew -0.34) with flat kurtosis (-0.91), suggesting a fairly even spread across the upper levels with a sizable zero/'none' bucket. Null rate is negligible (0.0003) and no outliers were flagged. Treatment: Treat as an ordinal categorical (one-hot or ordered encoding) rather than a continuous numeric. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 6
min: 0
max: 5
mean: 2.677
median: 3
std: 1.555
q1: 2
q3: 4
iqr: 2
skew: -0.3401
kurtosis: -0.9086
n_outliers: 0
outlier_rate: 0
zero_rate: 0.1492

GRN_URL

text identifier near_unique one_word url_heavy null_rate

This column holds Global Recordings Network language URLs, every value a fixed 44-character single token under https://globalrecordings.net/en/language/ followed by a language code. With 4179 unique values across 7134 rows and a 41.41% null rate, it functions as a per-language identifier link rather than a feature. Notable: only one duplicate URL (idt appears twice) despite the high uniqueness, and 41% of rows have no GRN link at all. Treatment: Extract the trailing language code as a foreign key; otherwise drop the URL itself from modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2,954 (41.4%)
unique: 4,179
len_min: 44
len_max: 44
len_mean: 44
len_median: 44
len_p95: 44
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 1
duplicate_rate: 0.0002392
vocab_size: 4,179
readability_flesch_mean: -435
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

TranslationNeedQuestionable

unknown other skipped

The column 'TranslationNeedQuestionable' was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 7134 and a null rate of 0.0. The name suggests a boolean or flag indicating whether the need for translation is in doubt, but this cannot be confirmed from the evidence. No distribution, unique count, or sample values were captured. Treatment: Re-profile with type inference enabled before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 7,134
nulls: 0 (0.0%)
unique: —

BibleYear

categorical free_text long_tail null_rate

Free-text field nominally capturing a year associated with a Bible (likely year acquired or published), but heavily polluted: 89.01% of 7134 rows are null, and among the 488 unique values the most common entry is "2023" at just 3.4%, with "Yes" appearing as the second most frequent value (22 times) — indicating the field also absorbed yes/no answers. Entropy ratio of 0.93 confirms a long, flat tail with no dominant year. Treatment: Clean by coercing to integer years and routing non-numeric responses (e.g., "Yes") to a separate flag before use. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 6,350 (89.0%)
unique: 488
top_value: 2023
top_rate: 0.03444
cardinality: 488
entropy: 8.296
entropy_ratio: 0.9289

NTYear

text feature one_word allcaps null_rate short_text duplicates

Despite the name suggesting a year, NTYear is a single-token text column with mixed semantics: the most frequent value is 'Yes' (147 rows), followed by four-digit years from 2016-2024. It's 63.61% null and 57.28% duplicates across only 1109 unique values, with one_word_rate of 1.0 and allcaps_rate of 0.94. The mix of a yes/no token alongside year values suggests two questions were collapsed into one field. Treatment: Split into two columns (boolean indicator vs. numeric year) before any modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 4,538 (63.6%)
unique: 1,109
len_min: 3
len_max: 9
len_mean: 6.694
len_median: 9
len_p95: 9
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 1,487
duplicate_rate: 0.5728
vocab_size: 1,109
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.9434
boilerplate_rate: 0

PortionsYear

text feature one_word allcaps null_rate short_text duplicates

Despite the name PortionsYear, this column mixes a yes/no flag with four-digit years: every entry is a single word, the most common value is 'Yes' (706 occurrences) followed by years like 2024 (107), 2022 (55) and 2025 (43). 43% of rows are null and 55.8% of the non-null values are duplicates across only 1,797 unique tokens, with 82.6% in all-caps. The semantic mix of a boolean and a year in one field is the headline anomaly. Treatment: Split into two columns — a boolean 'has portions' flag and a parsed year — before modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 3,068 (43.0%)
unique: 1,797
len_min: 3
len_max: 9
len_mean: 6.372
len_median: 9
len_p95: 9
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 2,269
duplicate_rate: 0.558
vocab_size: 1,797
readability_flesch_mean: 121.2
emoji_rate: 0
url_rate: 0
one_word_rate: 1
allcaps_rate: 0.8264
boilerplate_rate: 0

PercentAdherents

numeric feature

PercentAdherents is a numeric share variable bounded between 0 and 100, almost certainly the percentage of some population that adheres to a religion or group. The distribution is strongly bimodal in feel: the IQR spans 5.34 to 90.0 with a median of 58.33, kurtosis of -1.68 indicates a flat/U-shaped spread rather than a central peak, and 7.3% of values are exactly zero. Nearly 12% of rows are null, which is worth flagging before any aggregation. Treatment: Impute or filter the 11.87% nulls and consider binning given the U-shaped distribution before modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 847 (11.9%)
unique: 1,349
min: 0
max: 100
mean: 49.63
median: 58.33
std: 38.83
q1: 5.34
q3: 90
iqr: 84.66
skew: -0.08408
kurtosis: -1.679
n_outliers: 0
outlier_rate: 0
zero_rate: 0.07285

PercentEvangelical

numeric feature

This appears to be a percentage feature capturing the share of evangelicals in each record's population, ranging from 0 to 95 with a median of just 5. The distribution is heavily right-skewed (skew 1.93, kurtosis 4.92) and 9% of values are exact zeros, with 251 outliers (4.3%) pulling the mean up to 10.1. Note also that 17.3% of rows are null, which is substantial. Treatment: Impute or flag the 17% nulls and consider a log1p transform before modelling to tame the right skew. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 1,234 (17.3%)
unique: 1,006
min: 0
max: 95
mean: 10.13
median: 5
std: 12.24
q1: 1
q3: 15.55
iqr: 14.55
skew: 1.932
kurtosis: 4.925
n_outliers: 251
outlier_rate: 0.04254
zero_rate: 0.09

HasJesusFilm

categorical feature

Binary Y/N flag indicating whether each record has an associated Jesus Film, with only 2 unique values across 7134 rows and a negligible 0.0003 null rate. The distribution is moderately imbalanced: 'N' dominates at 71.6% (5105) versus 2027 'Y' values, yielding an entropy ratio of 0.86. Treatment: Encode as a 0/1 boolean indicator for modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 2
top_value: N
top_rate: 0.7158
cardinality: 2
entropy: 0.8611
entropy_ratio: 0.8611

JF_URL

text metadata near_unique one_word url_heavy null_rate

This column holds JesusFilm.org URLs (url_rate 1.0, one_word_rate 0.9995), almost all pointing to language-specific watch pages like /watch/jesus.html/{language}.html. Coverage is sparse with a 71.6% null rate, and values are near-unique (2008 distinct out of 7134), though a generic partners/resources page appears 13 times and 19 duplicates exist overall. URL lengths are tight (45-87 chars, median 55), consistent with a templated link rather than free text. Treatment: Treat as a reference link; extract the language slug from the path if you need a feature, otherwise drop from modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 5,107 (71.6%)
unique: 2,008
len_min: 45
len_max: 87
len_mean: 56.77
len_median: 55
len_p95: 69
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 19
duplicate_rate: 0.009373
vocab_size: 2,009
readability_flesch_mean: -781.9
emoji_rate: 0
url_rate: 1
one_word_rate: 0.9995
allcaps_rate: 0
boilerplate_rate: 0

HasAudioRecordings

categorical feature

Binary Y/N flag indicating whether a record has associated audio recordings. The split is 59.1% Y vs N, with entropy ratio 0.976 showing near-maximal balance for a two-class field. Null rate is negligible (0.0003). Treatment: Encode as a boolean indicator before modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 2
top_value: Y
top_rate: 0.5911
cardinality: 2
entropy: 0.9759
entropy_ratio: 0.9759

JPScale

categorical feature

JPScale is a low-cardinality categorical with 5 distinct values ('1' through '5'), suggesting an ordinal rating or scale. Distribution is bimodal: the extremes '5' (32.6%) and '4' (26.6%) dominate alongside '1' (20.6%), while middle values '3' and '2' are comparatively rare. Entropy ratio of 0.89 indicates fairly even spread across categories, but 11.87% of rows are null. Treatment: Treat as ordinal (1–5); impute or flag the ~12% nulls before modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 847 (11.9%)
unique: 5
top_value: 5
top_rate: 0.3261
cardinality: 5
entropy: 2.07
entropy_ratio: 0.8915

LeastReached

categorical feature

Binary Y/N flag indicating whether some 'least reached' status applies, with N dominating at 79.3% (5659) versus 1473 Y values across 7134 rows. Class imbalance is notable but not extreme, and nulls are negligible (0.03%). Cardinality is exactly 2 with entropy ratio 0.73, consistent with a clean boolean indicator. Treatment: Encode as boolean (Y=1, N=0) and account for the ~80/20 imbalance if used as a target. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 2
top_value: N
top_rate: 0.7935
cardinality: 2
entropy: 0.7348
entropy_ratio: 0.7348

RLG3

numeric feature

RLG3 is a small-cardinality integer-coded numeric (only 8 unique values across 7134 rows, ranging 1-9 with no zeros), suggesting an ordinal scale or category code rather than a true continuous measure. The distribution leans low: median 1, Q3 of 4, mean 2.82, with right skew (0.74) and 110 outliers (1.7%) toward the high end. Note the 10.88% null rate, which is non-trivial and should be addressed before modelling. Treatment: Treat as ordinal/categorical and impute or flag the ~11% missing values before modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 776 (10.9%)
unique: 8
min: 1
max: 9
mean: 2.819
median: 1
std: 2.144
q1: 1
q3: 4
iqr: 3
skew: 0.7378
kurtosis: -0.5707
n_outliers: 110
outlier_rate: 0.0173
zero_rate: 0

PrimaryReligion

categorical feature

Categorical label for the dominant religion of each record, with 9 distinct values across 7134 rows. Christianity leads at 46.7% (3328), followed by Ethnic Religions (1472) and Islam (945). Note the empty-string category appears 774 times alongside an explicit 'Unknown' bucket of 110 — two separate missingness conventions coexist beyond the 0.03% true nulls. Treatment: Consolidate the empty string into 'Unknown' and one-hot encode the 9 categories. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 9
top_value: Christianity
top_rate: 0.4666
cardinality: 9
entropy: 2.179
entropy_ratio: 0.6873

FCBH_URL

text metadata near_unique one_word url_heavy null_rate

This column holds a single URL per row pointing to Faith Comes By Hearing resources (apk.fcbh.org or live.bible.is), with url_rate at 1.0 and one_word_rate at 0.9987. It is largely missing — null_rate is 0.6801 — and of the populated rows 2272 of values are unique with only 10 duplicates. Lengths are tight (min 25, max 100, mean 37.68), consistent with a structured link field rather than free text. Treatment: Treat as an optional reference link; keep as-is for lookup, do not feed into modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 4,852 (68.0%)
unique: 2,272
len_min: 25
len_max: 100
len_mean: 37.68
len_median: 34
len_p95: 66
word_mean: 1.001
word_median: 1
n_empty: 0
n_duplicates: 10
duplicate_rate: 0.004382
vocab_size: 2,272
readability_flesch_mean: -325.5
emoji_rate: 0
url_rate: 1
one_word_rate: 0.9987
allcaps_rate: 0
boilerplate_rate: 0

NbrPGICs

numeric feature high_skew outliers

NbrPGICs is a heavily right-skewed count feature, with median 1 and Q3 of 2 but a maximum of 1804 and standard deviation of 48.93. The distribution shows extreme tail behaviour (skew 16.65, kurtosis 404.08) and 701 outliers (11.0% of values), while 10.75% of rows are null. Most records carry a trivial count, but a small subset reports values orders of magnitude larger. Treatment: Log-transform or cap at a high quantile and impute the 10.75% nulls before modelling. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 767 (10.8%)
unique: 155
min: 1
max: 1,804
mean: 7.209
median: 1
std: 48.93
q1: 1
q3: 2
iqr: 1
skew: 16.65
kurtosis: 404.1
n_outliers: 701
outlier_rate: 0.1101
zero_rate: 0

NbrCountries

numeric feature high_skew outliers

NbrCountries is a numeric count of countries associated with each record, ranging from 1 to 136 with a median of 1 and Q1=Q3=1, meaning at least three quarters of rows are single-country. The distribution is extremely heavy-tailed (skew 15.6, kurtosis 364) with 1203 outliers (20.5% outlier rate) and a 17.75% null rate, so a small minority of multi-country records dominate the variance. Treatment: Log1p-transform or bucket into 1 vs. multi-country before modelling, and impute or flag the 17.75% nulls. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 1,266 (17.7%)
unique: 43
min: 1
max: 136
mean: 1.711
median: 1
std: 3.901
q1: 1
q3: 1
iqr: 0
skew: 15.6
kurtosis: 364.1
n_outliers: 1,203
outlier_rate: 0.205
zero_rate: 0

JF

categorical feature

Binary Y/N flag with only two distinct values across 7134 rows and a negligible null rate of 0.0003. The distribution is skewed toward 'N' at 71.6%, leaving 'Y' at roughly 2027 occurrences. Entropy ratio of 0.86 indicates the split is imbalanced but still informative. Treatment: Encode as a 0/1 indicator and impute the rare nulls with the mode. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 2
top_value: N
top_rate: 0.7158
cardinality: 2
entropy: 0.8611
entropy_ratio: 0.8611

AudioRecordings

categorical feature

Binary Y/N flag indicating whether audio recordings exist for each row, with only 2 unique values across 7,134 records and a negligible null rate of 0.0003. The split is moderately balanced toward 'Y' at 59.1% (4,216) versus 'N' (2,916), giving high entropy (0.976) for a binary field. No surprising signals beyond the slight Y-majority skew. Treatment: Encode as a 0/1 boolean indicator after imputing the few nulls. high · anthropic:claude-opus-4-7

n: 7,134
nulls: 2 (0.0%)
unique: 2
top_value: Y
top_rate: 0.5911
cardinality: 2
entropy: 0.9759
entropy_ratio: 0.9759