joshua project joshua project unreached
Reading
This dataset is a Joshua Project catalogue of 7,124 unreached people groups described across 109 fields covering geography, language, religion, population, and outreach status. Every row is flagged as 'Unreached' (JPScaleText is constant) and 'LeastReached' is uniformly Y, so the analytical interest sits in the breakdown by region, religion, and population rather than in reach status itself. The data is heavily skewed toward Asia (5,351 of 7,124) and especially South Asian Peoples (3,681), with India alone accounting for 2,032 groups; Islam (3,279) and Hinduism (2,142) dominate PrimaryReligion. Population is extremely long-tailed (median 30,000 vs. max 135.5M, skew ~21), so any size-based analysis should use log scales or medians. Worth a closer look first: the Continent/Region/Country concentration, the religion mix, and the population distribution — these three together explain most of the dataset's shape.
citing: row_count · column_count · Continent.top_values · AffinityBloc.top_values · Ctry.top_values · PrimaryReligion.top_values · Population.stats · RegionName.top_values · Frontier.top_values · JPScaleText.top_values · LeastReached.top_values · PrimaryLanguageName.top_values
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| Asia | 5351 | 75.1% |
| Africa | 986 | 13.8% |
| Europe | 431 | 6.0% |
| North America | 175 | 2.5% |
| South America | 106 | 1.5% |
| Australia | 39 | 0.5% |
| Oceania | 36 | 0.5% |
Show data table
| value | count | share |
|---|---|---|
| Islam | 3279 | 46.0% |
| Hinduism | 2142 | 30.1% |
| Ethnic Religions | 933 | 13.1% |
| Buddhism | 480 | 6.7% |
| Unknown | 157 | 2.2% |
| Other / Small | 120 | 1.7% |
| Non-Religious | 13 | 0.2% |
Show data table
| value | count | share |
|---|---|---|
| Asia, South | 3349 | 47.0% |
| Asia, Southeast | 726 | 10.2% |
| Asia, Northeast | 521 | 7.3% |
| Africa, West and Central | 460 | 6.5% |
| Africa, North and Middle East | 444 | 6.2% |
| Africa, East and Southern | 373 | 5.2% |
| Asia, Central | 352 | 4.9% |
| Europe, Western | 320 | 4.5% |
| Europe, Eastern and Eurasia | 223 | 3.1% |
| America, North and Caribbean | 160 | 2.2% |
| America, Latin | 121 | 1.7% |
| Australia and Pacific | 75 | 1.1% |
Show data table
| bin | count |
|---|---|
| 10 – 3.388e+06 | 6930 |
| 3.388e+06 – 6.777e+06 | 78 |
| 6.777e+06 – 1.016e+07 | 36 |
| 1.016e+07 – 1.355e+07 | 17 |
| 1.355e+07 – 1.694e+07 | 11 |
| 1.694e+07 – 2.033e+07 | 8 |
| 2.033e+07 – 2.372e+07 | 7 |
| 2.372e+07 – 2.711e+07 | 3 |
| 2.711e+07 – 3.049e+07 | 2 |
| 3.049e+07 – 3.388e+07 | 1 |
| 3.388e+07 – 3.727e+07 | 2 |
| 3.727e+07 – 4.066e+07 | 4 |
| 4.066e+07 – 4.405e+07 | 0 |
| 4.405e+07 – 4.744e+07 | 3 |
| 4.744e+07 – 5.082e+07 | 1 |
| 5.082e+07 – 5.421e+07 | 0 |
| 5.421e+07 – 5.76e+07 | 1 |
| 5.76e+07 – 6.099e+07 | 0 |
| 6.099e+07 – 6.438e+07 | 1 |
| 6.438e+07 – 6.777e+07 | 1 |
| 6.777e+07 – 7.115e+07 | 0 |
| 7.115e+07 – 7.454e+07 | 0 |
| 7.454e+07 – 7.793e+07 | 0 |
| 7.793e+07 – 8.132e+07 | 0 |
| 8.132e+07 – 8.471e+07 | 0 |
| 8.471e+07 – 8.81e+07 | 0 |
| 8.81e+07 – 9.148e+07 | 0 |
| 9.148e+07 – 9.487e+07 | 0 |
| 9.487e+07 – 9.826e+07 | 0 |
| 9.826e+07 – 1.016e+08 | 1 |
| 1.016e+08 – 1.05e+08 | 0 |
| 1.05e+08 – 1.084e+08 | 0 |
| 1.084e+08 – 1.118e+08 | 0 |
| 1.118e+08 – 1.152e+08 | 0 |
| 1.152e+08 – 1.186e+08 | 1 |
| 1.186e+08 – 1.22e+08 | 0 |
| 1.22e+08 – 1.254e+08 | 0 |
| 1.254e+08 – 1.288e+08 | 0 |
| 1.288e+08 – 1.321e+08 | 0 |
| 1.321e+08 – 1.355e+08 | 1 |
Show data table
| value | count | share |
|---|---|---|
| South Asian Peoples | 3681 | 51.7% |
| Sub-Saharan Peoples | 632 | 8.9% |
| Arab World | 475 | 6.7% |
| Southeast Asian Peoples | 451 | 6.3% |
| Malay Peoples | 339 | 4.8% |
| Tibetan-Himalayan Peoples | 287 | 4.0% |
| Turkic Peoples | 269 | 3.8% |
| Persian-Median | 225 | 3.2% |
| Eurasian Peoples | 166 | 2.3% |
| Deaf | 151 | 2.1% |
| East Asian Peoples | 140 | 2.0% |
| Jewish | 128 | 1.8% |
| Horn of Africa Peoples | 83 | 1.2% |
| Latin-Caribbean Americans | 81 | 1.1% |
| Pacific Islanders | 13 | 0.2% |
| North American Peoples | 3 | 0.0% |
Schema
109 columns| Alerts | ||||
|---|---|---|---|---|
| PeopleID3ROG3 | text | 0.0% | 7,124 |
near_unique
one_word
allcaps
short_text
|
| ROG3 | categorical | 0.0% | 202 |
|
| PeopleID3 | numeric | 0.0% | 4,614 |
|
| ROP3 | numeric | 0.1% | 4,608 |
|
| PeopNameInCountry | text | 0.0% | 4,722 |
one_word
duplicates
|
| ROG2 | categorical | 0.0% | 7 |
|
| Continent | categorical | 0.0% | 7 |
|
| RegionName | categorical | 0.0% | 12 |
|
| ISO3 | categorical | 0.0% | 202 |
|
| LocationInCountry | text | 64.2% | 2,176 |
multilingual
null_rate
|
| PeopleID1 | numeric | 0.0% | 16 |
outliers
|
| ROP1 | categorical | 0.0% | 16 |
|
| AffinityBloc | categorical | 0.0% | 16 |
|
| PeopleID2 | numeric | 0.0% | 205 |
|
| ROP2 | categorical | 0.0% | 155 |
|
| PeopleCluster | categorical | 0.0% | 205 |
|
| PeopNameAcrossCountries | text | 0.0% | 4,604 |
one_word
duplicates
|
| Population | numeric | 0.2% | 1,200 |
high_skew
outliers
|
| Category | categorical | 0.0% | 3 |
|
| ROL3 | text | 0.0% | 1,565 |
one_word
short_text
duplicates
|
| PrimaryLanguageName | text | 0.0% | 1,563 |
one_word
short_text
duplicates
|
| PrimaryLanguageDialect | categorical | 94.5% | 303 |
long_tail
null_rate
|
| NumberLanguagesSpoken | numeric | 0.0% | 69 |
high_skew
outliers
|
| OfficialLang | categorical | 0.1% | 79 |
|
| SpeakNationalLang | unknown | 0.0% | — |
skipped
|
| BibleStatus | numeric | 0.0% | 6 |
outliers
|
| BibleYear | categorical | 45.8% | 163 |
null_rate
|
| NTYear | categorical | 24.6% | 305 |
null_rate
|
| PortionsYear | categorical | 13.5% | 460 |
|
| TranslationNeedQuestionable | unknown | 0.0% | — |
skipped
|
| JPScale | numeric | 0.0% | 1 |
constant
|
| JPScalePC | categorical | 0.0% | 5 |
|
| JPScalePGAC | categorical | 0.0% | 5 |
imbalance
|
| LeastReached | categorical | 0.0% | 1 |
imbalance
|
| LeastReachedPC | categorical | 0.0% | 2 |
|
| LeastReachedPGAC | categorical | 0.0% | 2 |
imbalance
|
| GSEC | categorical | 0.0% | 8 |
|
| HasAudioRecordings | categorical | 0.0% | 2 |
|
| NTOnline | categorical | 22.4% | 1 |
null_rate
imbalance
|
| RLG3 | numeric | 0.0% | 7 |
outliers
|
| RLG3PC | numeric | 0.0% | 8 |
outliers
|
| RLG3PGAC | numeric | 0.0% | 8 |
outliers
|
| PrimaryReligion | categorical | 0.0% | 7 |
|
| PrimaryReligionPC | categorical | 0.0% | 8 |
|
| PrimaryReligionPGAC | categorical | 0.0% | 8 |
|
| RLG4 | numeric | 92.4% | 18 |
null_rate
outliers
|
| ReligionSubdivision | categorical | 92.4% | 18 |
null_rate
|
| PCIslam | numeric | 0.1% | 902 |
|
| PCNonReligious | numeric | 0.3% | 152 |
high_skew
outliers
|
| PCUnknown | numeric | 0.4% | 388 |
high_skew
outliers
|
| SecurityLevel | numeric | 0.0% | 3 |
outliers
|
| LRTop100 | categorical | 0.0% | 2 |
imbalance
|
| PhotoAddress | text | 0.0% | 2,880 |
one_word
short_text
duplicates
|
| PhotoCredits | categorical | 0.1% | 851 |
long_tail
|
| PhotoCreditURL | categorical | 36.0% | 774 |
long_tail
null_rate
|
| PhotoCreativeCommons | categorical | 0.1% | 2 |
|
| PhotoCopyright | categorical | 0.2% | 2 |
|
| PhotoPermission | categorical | 0.2% | 3 |
|
| ProfileTextExists | categorical | 0.0% | 2 |
imbalance
|
| CountOfCountries | numeric | 0.0% | 39 |
high_skew
outliers
|
| CountOfProvinces | unknown | 0.0% | — |
skipped
|
| Longitude | numeric | 0.0% | 6,713 |
|
| Latitude | numeric | 0.0% | 6,696 |
|
| Ctry | categorical | 0.0% | 202 |
|
| IndigenousCode | categorical | 0.0% | 2 |
|
| PercentAdherents | categorical | 0.0% | 692 |
long_tail
|
| PercentChristianPC | categorical | 0.0% | 184 |
|
| NaturalName | text | 0.0% | 4,705 |
one_word
duplicates
|
| NaturalPronunciation | text | 48.5% | 1,489 |
one_word
null_rate
duplicates
|
| PercentChristianPGAC | categorical | 0.1% | 842 |
|
| PercentEvangelical | categorical | 10.4% | 401 |
long_tail
|
| PercentEvangelicalPC | categorical | 2.1% | 166 |
|
| PercentEvangelicalPGAC | categorical | 6.3% | 548 |
|
| PCBuddhism | numeric | 0.3% | 809 |
high_skew
outliers
|
| PCEthnicReligions | numeric | 0.3% | 351 |
high_skew
outliers
|
| PCHinduism | numeric | 0.3% | 1,131 |
|
| PCOtherSmall | numeric | 0.3% | 670 |
high_skew
outliers
|
| RegionCode | numeric | 0.0% | 12 |
outliers
|
| PopulationPGAC | numeric | 0.1% | 1,509 |
high_skew
outliers
|
| Frontier | categorical | 0.0% | 2 |
|
| MapAddress | text | 0.0% | 4,616 |
one_word
short_text
duplicates
|
| HasJesusFilm | categorical | 0.0% | 2 |
|
| Nomadic | categorical | 0.0% | 2 |
imbalance
|
| NomadicTypeDescription | categorical | 96.6% | 6 |
null_rate
|
| PhotoCCVersionText | categorical | 0.0% | 16 |
|
| PhotoCCVersionURL | categorical | 0.0% | 16 |
|
| MapCredits | categorical | 0.0% | 161 |
long_tail
|
| MapCreditURL | categorical | 0.0% | 31 |
long_tail
imbalance
|
| MapCopyright | categorical | 0.0% | 3 |
|
| MapCCVersionText | categorical | 0.0% | 4 |
imbalance
|
| MapCCVersionURL | categorical | 0.0% | 4 |
imbalance
|
| JF | categorical | 0.0% | 2 |
|
| AudioRecordings | categorical | 0.0% | 2 |
|
| Window1040 | categorical | 0.0% | 2 |
|
| PeopleGroupMapURL | text | 0.0% | 4,616 |
one_word
url_heavy
duplicates
|
| PeopleGroupMapExpandedURL | text | 0.0% | 4,331 |
one_word
url_heavy
duplicates
|
| PeopleGroupURL | text | 0.0% | 7,124 |
near_unique
one_word
url_heavy
|
| PeopleGroupPhotoURL | text | 0.0% | 2,880 |
one_word
url_heavy
duplicates
|
| CountryURL | categorical | 0.0% | 202 |
|
| JPScaleText | categorical | 0.0% | 1 |
imbalance
|
| JPScaleImageURL | categorical | 0.0% | 1 |
imbalance
|
| Summary | text | 0.0% | 3,685 |
one_word
duplicates
|
| Obstacles | text | 0.0% | 3,641 |
one_word
duplicates
|
| HowReach | text | 0.0% | 2,853 |
one_word
duplicates
|
| PrayForChurch | text | 0.0% | 1,713 |
one_word
duplicates
|
| PrayForPG | text | 0.0% | 3,441 |
one_word
duplicates
|
| Resources | unknown | 0.0% | — |
skipped
|
| country_data | unknown | 0.0% | — |
skipped
|
| language_data | unknown | 0.0% | — |
skipped
|
PeopleID3ROG3
text identifier near_unique one_word allcaps short_textPeopleID3ROG3 is almost certainly a person-level identifier: every one of the 7,124 rows holds a unique 7-character, all-caps, single-token code (n_unique equals n, len_min=len_max=7, allcaps_rate=1.0, one_word_rate=1.0). Sample values like '10208ng' and '10375su' suggest a 5-digit numeric prefix followed by a 2-letter suffix. There are no nulls, duplicates, or empties, so the key looks clean. Treatment: Use as a primary key for joins; exclude from modelling features.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 7,124
- len_min
- 7
- len_max
- 7
- len_mean
- 7
- len_median
- 7
- len_p95
- 7
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 7,124
- readability_flesch_mean
- 112.3
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 1
- boilerplate_rate
- 0
ROG3
categorical featureROG3 holds two-letter country codes across 7,124 rows with 202 distinct values and no nulls. India (IN) dominates at 28.5% of records, followed by PK (767) and CH (442); the top 10 codes account for the bulk of mass while a long tail of ~190 other codes shares the remainder, giving an entropy ratio of 0.66. Treatment: Group rare codes into an 'other' bucket before one-hot or target encoding.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 202
- top_value
- IN
- top_rate
- 0.2852
- cardinality
- 202
- entropy
- 5.058
- entropy_ratio
- 0.6605
PeopleID3
numeric foreign_keyPeopleID3 is an integer key spanning 10120 to 22661 with 4614 unique values across 7124 rows, suggesting a person identifier that recurs (about 1.5 rows per id on average). The distribution is mildly left-skewed (-0.23) and platykurtic (-0.95) with no nulls, no zeros, and no outliers, consistent with a dense allocated id range rather than a measured quantity. Treatment: Treat as a join key; do not use as a numeric feature.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 4,614
- min
- 10,120
- max
- 22,661
- mean
- 1.693e+04
- median
- 1.736e+04
- std
- 3431
- q1
- 1.433e+04
- q3
- 1.958e+04
- iqr
- 5251
- skew
- -0.2255
- kurtosis
- -0.9528
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ROP3
numeric identifierROP3 is a numeric column with 4608 unique values across 7124 rows, ranging tightly from 100005 to 119619 with a mean of 111443.68 and median of 112533. The narrow ~19k span sitting well above zero, combined with integer-looking bounds, suggests a coded identifier or sequence number rather than a measured quantity. Mild left skew (-0.47) and no outliers indicate a fairly uniform spread within that band, and the null rate is negligible at 0.001. Treatment: Treat as a categorical code or key; do not feed raw into numeric models.
- n
- 7,124
- nulls
- 7 (0.1%)
- unique
- 4,608
- min
- 100,005
- max
- 119,619
- mean
- 1.114e+05
- median
- 112,533
- std
- 5269
- q1
- 107,901
- q3
- 115,240
- iqr
- 7,339
- skew
- -0.4712
- kurtosis
- -0.7273
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
PeopNameInCountry
text label one_word duplicatesThis column names a people group as it appears within a given country (e.g., 'Turk', 'Persian', 'Arab, Moroccan'), likely from a Joshua Project-style ethnographic registry. Values are short (len_mean 12.5, word_mean 1.78) with 4,722 uniques across 7,124 rows, and 33.7% are duplicates because the same people group recurs across countries — 'Deaf' alone appears 151 times. Frequent qualifiers like '(Hindu traditions)' and '(Muslim traditions)' in top_words show religion-tagged variants are baked into the label. Treatment: Treat as a categorical label; pair with country to form a unique key, and consider stripping parenthetical religion tags for cleaner grouping.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 4,722
- len_min
- 1
- len_max
- 39
- len_mean
- 12.5
- len_median
- 11
- len_p95
- 27
- word_mean
- 1.784
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 2,402
- duplicate_rate
- 0.3372
- vocab_size
- 4,602
- readability_flesch_mean
- 56.38
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.4575
- allcaps_rate
- 0
- boilerplate_rate
- 0
ROG2
categorical featureROG2 is a low-cardinality categorical with 7 region codes (ASI, AFR, EUR, NAR, LAM, AUS, SOP) and no nulls across 7,124 rows, consistent with a continental/region-of-origin grouping. The distribution is highly imbalanced: ASI accounts for 75.1% of records while AUS and SOP together contribute fewer than 80 rows, yielding an entropy ratio of just 0.45. Any model conditioned on this field will be dominated by the ASI bucket. Treatment: One-hot encode and consider grouping AUS/SOP/LAM into an 'other' bucket given the severe imbalance.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- ASI
- top_rate
- 0.7511
- cardinality
- 7
- entropy
- 1.251
- entropy_ratio
- 0.4457
Continent
categorical featureContinent is a low-cardinality geographic categorical with 7 distinct values and no nulls across 7,124 rows. The distribution is heavily concentrated: Asia alone accounts for 75.1% of records, with Africa a distant second at 986. Notably, both 'Australia' (39) and 'Oceania' (36) appear as separate categories, which is a labeling inconsistency since Australia is part of Oceania. Treatment: Reconcile Australia/Oceania into a single category, then one-hot encode.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- Asia
- top_rate
- 0.7511
- cardinality
- 7
- entropy
- 1.251
- entropy_ratio
- 0.4457
RegionName
categorical featureRegionName is a categorical geographic grouping with 12 distinct regions and no nulls across 7,124 rows. The distribution is heavily concentrated: 'Asia, South' alone accounts for 47.0% of records, followed distantly by 'Asia, Southeast' at 726 and 'Asia, Northeast' at 521, leaving the Americas and Europe sparsely represented. Entropy ratio of 0.76 confirms meaningful but uneven coverage across the 12 buckets. Treatment: One-hot encode and consider grouping rare regions, given the dominance of 'Asia, South'.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- Asia, South
- top_rate
- 0.4701
- cardinality
- 12
- entropy
- 2.715
- entropy_ratio
- 0.7574
ISO3
categorical foreign_keyISO3 looks like a country code in standard 3-letter ISO 3166-1 alpha-3 format, with 202 distinct values across 7,124 rows and zero nulls. The distribution is heavily concentrated on India (IND) at 28.5% of rows (2,032), followed by PAK (767) and CHN (442), so South and East Asia dominate. Entropy ratio of 0.66 confirms the imbalance is material rather than uniform across countries. Treatment: Use as a join key to country reference tables; consider grouping long-tail codes or stratifying by ISO3 to control for the IND-heavy skew.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 202
- top_value
- IND
- top_rate
- 0.2852
- cardinality
- 202
- entropy
- 5.058
- entropy_ratio
- 0.6605
LocationInCountry
text free_text multilingual null_rateFree-text geographic descriptions of where a people group lives within a country, ranging from terse tags like "Widespread." (56 occurrences) to multi-sentence paragraphs up to 939 characters. 64.23% of rows are null and 14.6% of the non-null values are duplicates, so usable signal is concentrated in roughly a third of the dataset. Content is overwhelmingly English (1714 of 1729 detected) with a long tail of place names producing a 10,936-token vocabulary across 2,176 unique strings. Treatment: Normalize boilerplate phrases like "Widespread" into a categorical flag, then tokenize and embed the residual prose before modelling.
- n
- 7,124
- nulls
- 4,576 (64.2%)
- unique
- 2,176
- len_min
- 3
- len_max
- 939
- len_mean
- 141.1
- len_median
- 89
- len_p95
- 455.7
- word_mean
- 21.15
- word_median
- 12
- n_empty
- 0
- n_duplicates
- 372
- duplicate_rate
- 0.146
- vocab_size
- 10,936
- readability_flesch_mean
- 41.64
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.04317
- allcaps_rate
- 0
- boilerplate_rate
- 0
PeopleID1
numeric feature outliersPeopleID1 is stored as numeric but only takes 16 distinct integer values between 10 and 26, with a tight IQR of 1.0 (Q1=20, Q3=21) around a median of 21. The distribution is heavily left-skewed (skew -1.34) and 32.9% of rows fall outside the Tukey fence, so the 'outliers' alert reflects the column being near-categorical rather than truly continuous. No nulls and no zeros, but the name suggests an identifier despite the low cardinality. Treatment: Cast to categorical (16 levels) rather than treating as a continuous numeric.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 16
- min
- 10
- max
- 26
- mean
- 19.51
- median
- 21
- std
- 3.858
- q1
- 20
- q3
- 21
- iqr
- 1
- skew
- -1.337
- kurtosis
- 0.8321
- n_outliers
- 2,347
- outlier_rate
- 0.3294
- zero_rate
- 0
ROP1
categorical featureROP1 is a low-cardinality categorical code (16 distinct values, all following an 'A0xx' pattern) with no nulls across 7,124 rows. The distribution is heavily concentrated: 'A012' alone accounts for 51.7% of records, and entropy ratio of 0.67 confirms the imbalance. This looks like a controlled vocabulary or lookup code rather than free input. Treatment: One-hot or target-encode; consider grouping rare codes given the dominance of A012.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 16
- top_value
- A012
- top_rate
- 0.5167
- cardinality
- 16
- entropy
- 2.676
- entropy_ratio
- 0.669
AffinityBloc
categorical featureAffinityBloc is a categorical grouping of peoples/ethnolinguistic blocs, with 16 distinct values across 7124 rows and no nulls. The distribution is heavily concentrated: 'South Asian Peoples' alone accounts for 51.7% of records (3681), followed distantly by 'Sub-Saharan Peoples' (632) and 'Arab World' (475). Entropy ratio of 0.67 confirms the imbalance, and the inclusion of 'Deaf' (151) as a bloc alongside geographic/ethnic categories is a notable taxonomy quirk. Treatment: One-hot or target-encode, and consider grouping rare blocs given the dominance of South Asian Peoples.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 16
- top_value
- South Asian Peoples
- top_rate
- 0.5167
- cardinality
- 16
- entropy
- 2.676
- entropy_ratio
- 0.669
PeopleID2
numeric foreign_keyPeopleID2 is a numeric identifier-like field with only 205 distinct values across 7,124 rows, ranging 101 to 475 with no nulls or zeros. The distribution is left-skewed (skew -0.50) and platykurtic (kurtosis -1.24) with median 412 well above the mean 339, suggesting a few low-id clusters pull the mean down while most rows concentrate near the upper end. The low cardinality relative to row count indicates this is a repeated key rather than a per-row unique id. Treatment: Treat as a categorical foreign key and left-join to the people dimension rather than using as a numeric feature.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 205
- min
- 101
- max
- 475
- mean
- 339
- median
- 412
- std
- 123.2
- q1
- 232.8
- q3
- 450
- iqr
- 217.2
- skew
- -0.5035
- kurtosis
- -1.242
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ROP2
categorical featureROP2 is a categorical code field with 155 distinct values following an A####/C#### pattern, suggesting a classification or routing code. One value, 'A012', dominates at 51.6% of the 7,124 rows, while the remaining categories are long-tailed C-codes each below 2.4%. Entropy ratio of 0.56 confirms the distribution is heavily concentrated rather than uniform. Treatment: Collapse rare C-codes into an 'other' bucket and one-hot or target-encode before modelling.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 155
- top_value
- A012
- top_rate
- 0.5163
- cardinality
- 155
- entropy
- 4.085
- entropy_ratio
- 0.5614
PeopleCluster
categorical featurePeopleCluster is a high-cardinality categorical taxonomy of ethno-religious groupings, with 205 distinct values across 7124 rows and no nulls. The distribution is dominated by South Asian categories — 'South Asia Hindu - other' alone accounts for 12.2% (869 rows), followed by 'South Asia Muslim - other' (586) and 'South Asia Dalit - other' (352). Entropy ratio of 0.80 indicates a fairly spread distribution despite the South Asian skew, and the appearance of 'Deaf' alongside ethnolinguistic labels signals a mixed taxonomy worth flagging. Treatment: Group the long tail and target- or frequency-encode before modelling.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 205
- top_value
- South Asia Hindu - other
- top_rate
- 0.122
- cardinality
- 205
- entropy
- 6.108
- entropy_ratio
- 0.7954
PeopNameAcrossCountries
text label one_word duplicatesThis column holds people-group / ethnolinguistic names spanning countries (e.g. 'Turk', 'Persian', 'Kurd, Kurmanji', 'Arab, Moroccan'), with frequent religious-tradition qualifiers like '(Hindu traditions)' and '(Muslim traditions)' appearing in 985 and 424 rows respectively. Values are short (mean 12.4 chars, median 2 words) and 47.5% are single-word labels, yet 35.4% (2,520 rows) are duplicates across 4,604 unique strings out of 7,124 — the same group recurs across countries. The most surprising entry is 'Deaf' at 151 occurrences, which sits oddly alongside ethnic categories. Treatment: Treat as a categorical people-group label; normalize qualifiers in parentheses and join on (group, country) for cross-country aggregation.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 4,604
- len_min
- 1
- len_max
- 39
- len_mean
- 12.38
- len_median
- 10
- len_p95
- 27
- word_mean
- 1.766
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 2,520
- duplicate_rate
- 0.3537
- vocab_size
- 4,431
- readability_flesch_mean
- 56.01
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.475
- allcaps_rate
- 0
- boilerplate_rate
- 0
Population
numeric feature high_skew outliersPopulation counts per record, ranging from 10 to 135,533,000 with a median of just 30,000. The distribution is extraordinarily right-skewed (skew 21.1, kurtosis 607) — the mean of ~502,570 sits far above Q3 of 129,000, and ~14.9% of rows flag as outliers, suggesting a mix of small localities with a few country- or megacity-scale entries. Null rate is negligible (0.21%) and there are no zeros. Treatment: Apply a log1p transform before any modelling to tame the extreme skew.
- n
- 7,124
- nulls
- 15 (0.2%)
- unique
- 1,200
- min
- 10
- max
- 1.355e+08
- mean
- 5.026e+05
- median
- 30,000
- std
- 3.568e+06
- q1
- 6,700
- q3
- 129,000
- iqr
- 122,300
- skew
- 21.1
- kurtosis
- 607.4
- n_outliers
- 1,058
- outlier_rate
- 0.1488
- zero_rate
- 0
Category
categorical labelThree-level categorical with values "1", "2", "3" and no nulls across 7124 rows. Distribution is imbalanced: "3" dominates at 60.3% (4299), "1" follows at 2330, and "2" is rare at 495. Entropy ratio of 0.78 confirms the skew toward the majority class. Treatment: Treat as a categorical label; consider class weighting or stratified sampling to handle the minority class "2".
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- 3
- top_rate
- 0.6035
- cardinality
- 3
- entropy
- 1.234
- entropy_ratio
- 0.7788
ROL3
text feature one_word short_text duplicatesROL3 holds three-letter ISO 639-3 language codes — every value is exactly 3 characters and one word, with top entries like 'hin', 'ben', 'urd', 'guj', and 'tel' pointing to South Asian languages. The distribution is heavily skewed toward Hindi (662 of 7124) and only 1565 unique codes appear across 7124 rows, giving a 78% duplicate rate. No nulls, no empties, no formatting noise. Treatment: treat as a categorical language code; one-hot or target-encode, and consider grouping rare codes.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 1,565
- len_min
- 3
- len_max
- 3
- len_mean
- 3
- len_median
- 3
- len_p95
- 3
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 5,559
- duplicate_rate
- 0.7803
- vocab_size
- 1,564
- readability_flesch_mean
- 118.7
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0.0002807
- boilerplate_rate
- 0
PrimaryLanguageName
text feature one_word short_text duplicatesCategorical language label, almost always a single word (one_word_rate 0.70, word_mean 1.33) drawn from a vocabulary of 1,641 tokens across 1,563 distinct values. Hindi (662), Bengali (357), and Sindhi (191) dominate the 7,124 rows, and 78.1% of values are duplicates of an earlier row — expected for a controlled language taxonomy. Compound names like 'Pashto, Northern' and 'Punjabi, Eastern' indicate ISO-style subvariant naming rather than free text. Treatment: Treat as a categorical factor; normalise the comma-separated subvariants before one-hot or target encoding.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 1,563
- len_min
- 1
- len_max
- 32
- len_mean
- 9.147
- len_median
- 7
- len_p95
- 17
- word_mean
- 1.331
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 5,561
- duplicate_rate
- 0.7806
- vocab_size
- 1,641
- readability_flesch_mean
- 33.28
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7016
- allcaps_rate
- 0
- boilerplate_rate
- 0
PrimaryLanguageDialect
categorical metadata long_tail null_rateFree-text or controlled-vocabulary field naming a primary language dialect, with 303 distinct values across 7124 rows but populated in only 5.5% of records (null_rate 0.945). Distribution is essentially flat — entropy_ratio 0.97 and the modal value 'Punjabi' covers just 3.1% of non-nulls (12 occurrences) — so no dialect dominates. The mix spans South Asian, Middle Eastern, African, and European dialects, suggesting a global but extremely sparse roster. Treatment: Drop or collapse into a coarser language grouping; too sparse and high-cardinality to use directly as a feature.
- n
- 7,124
- nulls
- 6,732 (94.5%)
- unique
- 303
- top_value
- Punjabi
- top_rate
- 0.03061
- cardinality
- 303
- entropy
- 8.011
- entropy_ratio
- 0.9719
NumberLanguagesSpoken
numeric feature high_skew outliersCounts the number of languages spoken, with values ranging from 1 to 120 across 7,124 rows and no nulls. The distribution is heavily right-skewed (skew 4.87, kurtosis 36.3): the median is 1 and Q3 is 5, yet the mean is 4.33 and 597 rows (8.4%) flag as outliers, suggesting a long tail of implausibly high counts up to 120. Treatment: Cap or log-transform before modelling, and audit the extreme tail (values up to 120) for data-entry errors.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 69
- min
- 1
- max
- 120
- mean
- 4.333
- median
- 1
- std
- 7.32
- q1
- 1
- q3
- 5
- iqr
- 4
- skew
- 4.871
- kurtosis
- 36.31
- n_outliers
- 597
- outlier_rate
- 0.0838
- zero_rate
- 0
OfficialLang
categorical featureCategorical column listing an official language per record, with 79 distinct values across 7124 rows and effectively no nulls (0.08%). Hindi dominates at 28.5% (2032 rows), followed by Urdu (767), Standard Arabic (657), Mandarin (475), and English (433), giving an entropy ratio of 0.66 — moderately concentrated rather than uniform. The South/Central Asian skew is notable: five of the top ten values are languages of that region, which may bias any downstream language-level analysis. Treatment: Group rare languages into an 'Other' bucket and one-hot or target-encode before modelling.
- n
- 7,124
- nulls
- 6 (0.1%)
- unique
- 79
- top_value
- Hindi
- top_rate
- 0.2855
- cardinality
- 79
- entropy
- 4.178
- entropy_ratio
- 0.6628
SpeakNationalLang
unknown other skippedColumn 'SpeakNationalLang' was skipped by the profiler, so type, cardinality and value distribution are all unavailable. The only confirmed facts are that it has 7124 rows and zero nulls. The name suggests a flag or category for whether a respondent speaks the national language, but this cannot be verified from the evidence. Treatment: Re-profile with type inference enabled before deciding on any downstream handling.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- —
BibleStatus
numeric feature outliersBibleStatus is an ordinal/categorical code stored as a small integer, taking just 6 distinct values from 0 to 5 across 7,124 rows with no nulls. The distribution is heavily concentrated at the top (median 5, Q1=4, Q3=5, mean 4.05) with strong negative skew (-1.51), and 13.5% of rows flagged as low-end outliers plus a 3.8% zero rate. This looks like a status/level code rather than a true numeric measurement. Treatment: Treat as an ordinal category (or one-hot encode) rather than a continuous numeric.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 6
- min
- 0
- max
- 5
- mean
- 4.054
- median
- 5
- std
- 1.342
- q1
- 4
- q3
- 5
- iqr
- 1
- skew
- -1.513
- kurtosis
- 1.538
- n_outliers
- 961
- outlier_rate
- 0.1349
- zero_rate
- 0.03818
BibleYear
categorical metadata null_rateBibleYear appears to be a publication/edition year field for Bible translations, encoded mostly as date ranges like "1818-2022" rather than single years. Cardinality is high (163 distinct values across 7124 rows) and the column is missing for 45.79% of rows, which is a major coverage gap. The top value "1818-2022" covers 17.14% of non-nulls and most frequent entries are spans, while plain single years like "1954" (191 occurrences) are the exception. Treatment: Parse into start_year and end_year integers, then decide imputation given the 45.79% null rate.
- n
- 7,124
- nulls
- 3,262 (45.8%)
- unique
- 163
- top_value
- 1818-2022
- top_rate
- 0.1714
- cardinality
- 163
- entropy
- 5.318
- entropy_ratio
- 0.7237
NTYear
categorical feature null_rateNTYear appears to hold year-range strings (e.g. "1811-1998", "1801-1984") indicating a span between two dates, though the presence of "Yes" as the third most common value (345 rows) signals encoding inconsistency. The column has 305 distinct values across 7124 rows with a high null rate of 24.65%, and the top value covers only 12.3% of non-nulls, so the distribution is fairly spread (entropy ratio 0.735). Treatment: Parse year-range strings into start/end numeric fields and quarantine non-conforming values like "Yes" before modelling.
- n
- 7,124
- nulls
- 1,756 (24.6%)
- unique
- 305
- top_value
- 1811-1998
- top_rate
- 0.1233
- cardinality
- 305
- entropy
- 6.065
- entropy_ratio
- 0.7349
PortionsYear
categorical metadataPortionsYear appears to be a free-form field describing the year range covered by record portions, with 460 distinct values across 7124 rows and a 13.5% null rate. Most entries are date ranges like '1806-1962' or '1800-1980', but the single most common value is the literal string 'Yes' (821 rows, 13.3%), suggesting inconsistent data entry where a yes/no answer leaks into a date-range field. High entropy ratio (0.71) confirms values are spread across many ranges rather than concentrated. Treatment: Parse date ranges into start/end year numerics and isolate the 'Yes' contamination as a separate boolean flag before use.
- n
- 7,124
- nulls
- 961 (13.5%)
- unique
- 460
- top_value
- Yes
- top_rate
- 0.1332
- cardinality
- 460
- entropy
- 6.289
- entropy_ratio
- 0.711
TranslationNeedQuestionable
unknown other skippedThe column "TranslationNeedQuestionable" was skipped by the profiler, so no type, uniqueness, or value statistics are available. All we know is that it has 7124 rows with a 0.0 null rate; nothing else can be inferred from the evidence. Treatment: Re-profile or inspect manually before deciding on use; current evidence is insufficient.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- —
JPScale
numeric metadata constantJPScale is a numeric column that is entirely constant: all 7124 rows hold the value 1.0 with zero nulls, zero variance, and a single unique value. It carries no information for any downstream model or comparison and was flagged as constant. Treatment: Drop; constant column with no variance.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 1
- max
- 1
- mean
- 1
- median
- 1
- std
- 0
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
JPScalePC
categorical featureJPScalePC is a 5-level categorical code (values "1" through "5") with no nulls across 7124 rows, likely an ordinal scale or rating. The distribution is heavily concentrated at "1" (70.2% of rows), with "3" the rarest at just 205 occurrences, yielding an entropy ratio of 0.59. The non-monotonic frequency order (1 > 4 > 2 > 5 > 3) is unusual for a true ordinal scale and worth checking. Treatment: Treat as ordinal categorical; consider grouping minority levels (3, 5) given the dominance of "1".
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- 1
- top_rate
- 0.702
- cardinality
- 5
- entropy
- 1.377
- entropy_ratio
- 0.5929
JPScalePGAC
categorical feature imbalanceJPScalePGAC is a 5-level categorical code (values "1" through "5"), likely a Japanese seismic intensity / PGA scale rating. The distribution is severely imbalanced: "1" accounts for 6910 of 7124 rows (top_rate 0.97), entropy_ratio is just 0.10, and the remaining four levels together hold under 220 records. No nulls are present. Treatment: Collapse rare levels or binarise as "1" vs other before modelling.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- 1
- top_rate
- 0.97
- cardinality
- 5
- entropy
- 0.2372
- entropy_ratio
- 0.1021
LeastReached
categorical metadata imbalanceThis column is a single-valued categorical flag holding "Y" for all 7124 rows, with no nulls and zero entropy. Because cardinality is 1 and top_rate is 1.0, it carries no information for any downstream model. Treatment: Drop; constant column with no variance.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
LeastReachedPC
categorical featureBinary Y/N flag indicating whether some 'least reached' people-group condition is met. The column is fully populated across 7124 rows with only 2 distinct values, skewed toward 'Y' at 72.3% (5152) versus 'N' at 1972. Entropy ratio of 0.85 shows the split is uneven but still informative. Treatment: Encode as a 0/1 boolean for modelling.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.7232
- cardinality
- 2
- entropy
- 0.8511
- entropy_ratio
- 0.8511
LeastReachedPGAC
categorical feature imbalanceA binary Y/N flag (likely indicating whether the least-reached PGAC condition was met) with no nulls across 7124 rows. The distribution is severely imbalanced: 'Y' accounts for 6910 rows (97.0%) versus only 214 'N', yielding entropy_ratio of just 0.19. As a near-constant feature it carries little discriminative signal on its own. Treatment: Encode as 0/1 but consider dropping or pairing with rare-class oversampling given the 97/3 imbalance.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.97
- cardinality
- 2
- entropy
- 0.1946
- entropy_ratio
- 0.1946
GSEC
categorical featureGSEC is a low-cardinality categorical with 8 distinct values across 7124 rows and no nulls. The dominant value is the empty string at 51.08% (3639 rows), followed by '1' at 2767; the remaining codes ('0' through '6') together account for under 10% of rows. The mix of blanks and small integer codes suggests an optional categorical flag where 'missing' is encoded as '' rather than null. Treatment: Recode '' as explicit missing and one-hot encode the remaining small-integer categories.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- top_rate
- 0.5108
- cardinality
- 8
- entropy
- 1.605
- entropy_ratio
- 0.535
HasAudioRecordings
categorical featureBinary Y/N flag indicating whether a record has associated audio recordings. The class is heavily imbalanced toward 'Y' at 86.9% (6188 of 7124), with no nulls. Entropy ratio of 0.56 confirms the skew but the minority 'N' class still has 936 observations, enough to be usable. Treatment: Encode as a 0/1 boolean; watch for class imbalance if used as a target.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.8686
- cardinality
- 2
- entropy
- 0.5612
- entropy_ratio
- 0.5612
NTOnline
categorical feature null_rate imbalanceNTOnline is a categorical flag with only one observed value, 'Y', across 5528 non-null rows, while 22.4% of rows are null. With cardinality 1 and entropy 0, it carries no discriminative signal—presence vs. absence is the only information available. Treatment: Drop, or replace with a binary is_present indicator if the null pattern is meaningful.
- n
- 7,124
- nulls
- 1,596 (22.4%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
RLG3
numeric feature outliersRLG3 is a discrete numeric column with only 7 unique values spanning 2 to 9, suggesting an ordinal rating or Likert-style scale rather than a continuous measurement. The distribution is tight around the median of 6 (IQR=1, Q1=5, Q3=6) with mild left skew (-0.46), but 10.6% of rows (757) fall outside the IQR fences — an artifact of the narrow box rather than true anomalies. Treatment: Treat as an ordinal categorical; the outlier flag is a side-effect of the compressed IQR, not bad data.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 7
- min
- 2
- max
- 9
- mean
- 5.27
- median
- 6
- std
- 1.279
- q1
- 5
- q3
- 6
- iqr
- 1
- skew
- -0.4551
- kurtosis
- 2.001
- n_outliers
- 757
- outlier_rate
- 0.1063
- zero_rate
- 0
RLG3PC
numeric feature outliersRLG3PC is an integer-coded ordinal feature with only 8 distinct values spanning 1-9 and a tight IQR of 1 (Q1=5, Q3=6). The distribution is left-skewed (-0.95) and concentrated around the median of 5, yet 14.3% of rows (1022) fall outside the IQR fence, suggesting a heavy lower tail rather than true anomalies. No nulls or zeros are present. Treatment: Treat as an ordinal/categorical scale rather than continuous; the outlier rate reflects skew, not errors.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 8
- min
- 1
- max
- 9
- mean
- 5.079
- median
- 5
- std
- 1.52
- q1
- 5
- q3
- 6
- iqr
- 1
- skew
- -0.9463
- kurtosis
- 1.703
- n_outliers
- 1,022
- outlier_rate
- 0.1435
- zero_rate
- 0
RLG3PGAC
numeric feature outliersRLG3PGAC is a numeric column with only 8 distinct integer values spanning 1 to 9, suggesting an ordinal rating or Likert-style score rather than a continuous measurement. The distribution is tight around a median of 5.5 with IQR of just 1 (Q1=5, Q3=6), yet 776 rows (10.9%) fall outside the Tukey fence, indicating a heavy-tailed concentration where any deviation from the central 5-6 band registers as an outlier. Mild left skew (-0.46) hints that low scores are slightly more common than the symmetric mean of 5.27 would suggest. Treatment: Treat as ordinal categorical; bin or one-hot encode rather than scaling as continuous.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 8
- min
- 1
- max
- 9
- mean
- 5.272
- median
- 5.5
- std
- 1.296
- q1
- 5
- q3
- 6
- iqr
- 1
- skew
- -0.4637
- kurtosis
- 2.032
- n_outliers
- 776
- outlier_rate
- 0.1089
- zero_rate
- 0
PrimaryReligion
categorical featurePrimaryReligion is a low-cardinality categorical with 7 distinct values across 7,124 rows and no nulls. Islam dominates at 46% (3,279 rows), followed by Hinduism (2,142) and Ethnic Religions (933); Non-Religious appears only 13 times and 157 rows are explicitly 'Unknown'. Entropy ratio of 0.68 indicates a moderately skewed but not degenerate distribution. Treatment: One-hot encode and consider merging 'Unknown' with 'Other / Small' or treating it as a missing-value flag.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- Islam
- top_rate
- 0.4603
- cardinality
- 7
- entropy
- 1.92
- entropy_ratio
- 0.6839
PrimaryReligionPC
categorical featureCategorical label assigning each of 7124 rows to one of 8 primary religion categories, with no nulls. Islam dominates at 3105 rows (43.6%) followed by Hinduism at 2296, while Non-Religious (35) and Other/Small (62) are rare; entropy ratio of 0.68 indicates moderate concentration in the top two classes. 154 rows are explicitly 'Unknown', a category worth treating distinctly from missing. Treatment: One-hot encode the 8 levels and keep 'Unknown' as its own category rather than imputing.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- Islam
- top_rate
- 0.4359
- cardinality
- 8
- entropy
- 2.051
- entropy_ratio
- 0.6838
PrimaryReligionPGAC
categorical featureCategorical label for the primary religion of a People Group Across Countries (PGAC) record, with 8 distinct values across 7124 rows and no nulls. Islam dominates at 45.6% (3247), followed by Hinduism (2154) and Ethnic Religions (925); Christianity is strikingly rare at just 17 rows, which is notable for a religion-coded dataset. Entropy ratio of 0.65 indicates moderate concentration on the top categories. Treatment: One-hot or target-encode for modelling; consider folding 'Unknown' and 'Non-Religious'/'Christianity' tails into 'Other'.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- Islam
- top_rate
- 0.4558
- cardinality
- 8
- entropy
- 1.955
- entropy_ratio
- 0.6516
RLG4
numeric feature null_rate outliersRLG4 is a sparse numeric feature populated for only ~7.6% of rows (null_rate 0.9239) with just 18 distinct integer-like values ranging 10 to 39. The distribution is right-skewed (skew 1.05, mean 18.19 vs median 20.0) with 30 flagged outliers (5.5% of present values) and a tight IQR of 6. The combination of heavy nullness and a bounded, discrete value set suggests an ordinal score or category code recorded only in specific cases. Treatment: Add a missingness indicator and impute or bin before modelling, given 92% nulls and a small discrete value set.
- n
- 7,124
- nulls
- 6,582 (92.4%)
- unique
- 18
- min
- 10
- max
- 39
- mean
- 18.19
- median
- 20
- std
- 6.472
- q1
- 14
- q3
- 20
- iqr
- 6
- skew
- 1.051
- kurtosis
- 1.474
- n_outliers
- 30
- outlier_rate
- 0.05535
- zero_rate
- 0
ReligionSubdivision
categorical feature null_rateA sub-classification of religion (denomination/sect), with 18 distinct values like Sunni, Judaism, Sikhism, Tibetan, and Theravada. The column is 92.39% null, so it is only populated for the small subset of records where a finer-grained religious branch applies. Among the 7124 rows, Sunni leads at 29.52% of non-null values (160 occurrences), and entropy ratio 0.72 indicates the populated values are spread fairly evenly across branches. Treatment: Treat missingness as its own category and one-hot encode, or roll up into the parent Religion field before modelling.
- n
- 7,124
- nulls
- 6,582 (92.4%)
- unique
- 18
- top_value
- Sunni
- top_rate
- 0.2952
- cardinality
- 18
- entropy
- 2.984
- entropy_ratio
- 0.7157
PCIslam
numeric featurePCIslam appears to be a percentage-style indicator of Islamic affiliation, bounded between 0 and 100 with a near-zero null rate (0.0013). The distribution is starkly bimodal rather than continuous: 47.1% of rows are exactly zero, the median is 0.28, yet Q3 sits at 99.99, producing a kurtosis of -1.93 and an IQR spanning nearly the full range. Mean (45.2) and std (48.2) confirm the mass is piled at the extremes rather than around the center. Treatment: Treat as bimodal: consider binarizing (0 vs >0) or binning rather than using raw value in linear models.
- n
- 7,124
- nulls
- 9 (0.1%)
- unique
- 902
- min
- 0
- max
- 100
- mean
- 45.22
- median
- 0.2753
- std
- 48.22
- q1
- 0
- q3
- 99.99
- iqr
- 99.99
- skew
- 0.1703
- kurtosis
- -1.935
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.4713
PCNonReligious
numeric feature high_skew outliersPCNonReligious appears to be the percentage of non-religious individuals in each record, but the distribution is dominated by zeros — 87.5% of values are exactly 0 and the entire IQR collapses to 0. The remaining tail stretches to 99.0 with skew of 9.1 and kurtosis of 125.3, producing 886 outliers (12.5% of rows). Mean (1.02) sits far above median (0), so any modelling that assumes symmetry will be misled. Treatment: Treat as zero-inflated; consider a binary is_nonzero flag plus a log1p transform of the positive tail.
- n
- 7,124
- nulls
- 23 (0.3%)
- unique
- 152
- min
- 0
- max
- 99
- mean
- 1.016
- median
- 0
- std
- 4.549
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 9.105
- kurtosis
- 125.3
- n_outliers
- 886
- outlier_rate
- 0.1248
- zero_rate
- 0.8752
PCUnknown
numeric feature high_skew outliersPCUnknown is a numeric column expressing what looks like a percentage (range 0-100) of 'unknown' classification, with 92.8% of values being zero and a median/Q1/Q3 all at 0. The distribution is extremely right-skewed (skew 6.45, kurtosis 39.85) with 510 outliers (7.2%) extending up to 100. With 388 unique values and only 0.35% nulls, it carries sparse but potentially meaningful signal in the long tail. Treatment: Binarize (zero vs non-zero) or log-transform the non-zero tail before modelling.
- n
- 7,124
- nulls
- 25 (0.4%)
- unique
- 388
- min
- 0
- max
- 100
- mean
- 2.28
- median
- 0
- std
- 14.59
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 6.454
- kurtosis
- 39.85
- n_outliers
- 510
- outlier_rate
- 0.07184
- zero_rate
- 0.9282
SecurityLevel
numeric feature outliersSecurityLevel is an ordinal/categorical code stored as numeric, with only 3 distinct values spanning 0 to 2 across 7,124 complete rows. The distribution is heavily concentrated at the top tier (median, Q1, and Q3 all equal 2.0, mean 1.595), yet 15.6% of rows are 0 and the IQR-based outlier check flags 24.9% of records — an artifact of the degenerate IQR of 0 rather than true anomalies. Strong negative skew (-1.47) confirms the mass sits at level 2. Treatment: Treat as a 3-level ordinal category (one-hot or ordered encode); ignore the outlier flag since IQR is zero.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 3
- min
- 0
- max
- 2
- mean
- 1.595
- median
- 2
- std
- 0.7442
- q1
- 2
- q3
- 2
- iqr
- 0
- skew
- -1.466
- kurtosis
- 0.4048
- n_outliers
- 1,771
- outlier_rate
- 0.2486
- zero_rate
- 0.1564
LRTop100
categorical label imbalanceBinary Y/N flag indicating membership in some 'LRTop100' set, with exactly 100 rows marked 'Y' out of 7124 — strongly suggesting a curated top-100 list. The distribution is severely imbalanced (98.6% 'N', entropy ratio 0.107), which is flagged as an imbalance alert. No nulls are present. Treatment: Use as a binary indicator; if modelling, apply class-imbalance handling (stratification or reweighting).
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.986
- cardinality
- 2
- entropy
- 0.1065
- entropy_ratio
- 0.1065
PhotoAddress
text metadata one_word short_text duplicatesPhotoAddress holds single-token image filenames following a 'pXXXXX.jpg' pattern (one_word_rate 1.0, len_max 13). Coverage is poor: 1970 of 7124 rows are empty strings and duplicate_rate is 0.596, so the same photo is reused across many records (e.g., p19007.jpg appears 90 times). Only 2880 unique values back 7124 rows, suggesting shared stock images or a many-to-one photo lookup rather than a per-row asset. Treatment: Treat as a file reference: drop from modelling, or join to an image table after handling the ~1970 empty strings.
- n
- 7,124
- nulls
- 1 (0.0%)
- unique
- 2,880
- len_min
- 0
- len_max
- 13
- len_mean
- 7.26
- len_median
- 10
- len_p95
- 10
- word_mean
- 1
- word_median
- 1
- n_empty
- 1,970
- n_duplicates
- 4,243
- duplicate_rate
- 0.5957
- vocab_size
- 2,879
- readability_flesch_mean
- 84.01
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
PhotoCredits
categorical metadata long_tailPhotoCredits captures the attribution string for an associated image, with 851 distinct credits across 7124 rows. The column is dominated by missing-style values: 1970 rows (27.7%) are empty strings and another 1496 are 'Anonymous', so over half lack a real attribution. The remaining tail is long and idiosyncratic, mixing organisations ('Operation China, Asia Harvest'), individuals ('Isudas', 'Kerry Olson'), and platform tags ('Steve Evans - Flickr'). Treatment: Treat empty and 'Anonymous' as missing and keep only as provenance metadata; do not use as a model feature.
- n
- 7,124
- nulls
- 10 (0.1%)
- unique
- 851
- top_value
- top_rate
- 0.2769
- cardinality
- 851
- entropy
- 5.584
- entropy_ratio
- 0.5737
PhotoCreditURL
categorical metadata long_tail null_rateURL string crediting the source of an associated photo, dominated by a single domain (asiaharvest.org appears 443 times) alongside a long tail of 774 distinct values. 36% of rows are null and another 43.21% are empty strings — together roughly four out of five rows carry no usable credit. Remaining values mix organisational sites (newcovenantmissions, createinternational), shorteners (tinyurl), and stock-photo hosts (pixabay, pxhere, flickr). Treatment: Drop for modelling; if provenance matters, parse to domain and treat as a low-coverage attribution field.
- n
- 7,124
- nulls
- 2,565 (36.0%)
- unique
- 774
- top_value
- top_rate
- 0.4321
- cardinality
- 774
- entropy
- 5.389
- entropy_ratio
- 0.5616
PhotoCreativeCommons
categorical featureBinary Y/N flag indicating whether a photo carries a Creative Commons licence. The vast majority (top_rate 0.7981) are 'N' with only 1437 'Y' values, and nulls are negligible (null_rate 0.0007). Class imbalance is notable but not extreme. Treatment: Encode as a 0/1 boolean; impute the handful of nulls with the mode 'N'.
- n
- 7,124
- nulls
- 5 (0.1%)
- unique
- 2
- top_value
- N
- top_rate
- 0.7981
- cardinality
- 2
- entropy
- 0.7256
- entropy_ratio
- 0.7256
PhotoCopyright
categorical featureBinary Y/N flag indicating whether a photo carries copyright restrictions, with 'N' dominating at 80.6% of 7,124 rows and only 2 unique values. Nulls are negligible (0.17%) and entropy ratio of 0.71 reflects the moderate class imbalance. No anomalies beyond the expected skew toward unrestricted photos. Treatment: Encode as a boolean (Y=1, N=0) and impute the handful of nulls with the mode.
- n
- 7,124
- nulls
- 12 (0.2%)
- unique
- 2
- top_value
- N
- top_rate
- 0.8064
- cardinality
- 2
- entropy
- 0.709
- entropy_ratio
- 0.709
PhotoPermission
categorical featureBinary opt-in flag for photo permission, stored as 'Y'/'N'. The column is heavily skewed toward 'N' at 80.4% (5715 of 7124), with a near-zero null rate of 0.2%. Watch the case inconsistency: 2 records use lowercase 'y' alongside 1393 uppercase 'Y', so case-sensitive joins or filters will miscount. Treatment: Normalize case (upper) and encode as boolean before modelling.
- n
- 7,124
- nulls
- 14 (0.2%)
- unique
- 3
- top_value
- N
- top_rate
- 0.8038
- cardinality
- 3
- entropy
- 0.7173
- entropy_ratio
- 0.4526
ProfileTextExists
categorical feature imbalanceA binary flag indicating whether a profile text exists, with values Y/N. The column is severely imbalanced: 6888 of 7124 rows (96.7%) are Y, leaving only 236 N, yielding a low entropy ratio of 0.21. Treatment: Encode as a 0/1 indicator but expect minimal predictive signal due to severe class imbalance.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.9669
- cardinality
- 2
- entropy
- 0.2098
- entropy_ratio
- 0.2098
CountOfCountries
numeric feature high_skew outliersLikely a per-row count of distinct countries associated with each record, ranging from 1 to 164 across 7124 rows with no nulls. The distribution is severely right-skewed (skew 5.67, kurtosis 33.17): the median is just 2 and Q3 is 4, yet the mean is 8.11 and 16.98% of rows flag as outliers. A long tail of high-country records is dragging the mean far above typical values. Treatment: Log-transform or cap at a high quantile before modelling to tame the heavy right tail.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 39
- min
- 1
- max
- 164
- mean
- 8.108
- median
- 2
- std
- 24.27
- q1
- 1
- q3
- 4
- iqr
- 3
- skew
- 5.672
- kurtosis
- 33.17
- n_outliers
- 1,210
- outlier_rate
- 0.1698
- zero_rate
- 0
CountOfProvinces
unknown other skippedSaturn skipped profiling on CountOfProvinces, so beyond a row count of 7124 and zero nulls, no distributional evidence is available. The name suggests an integer count of provinces per record, but unique count, range, and summary stats are all missing. Without further inspection the column's actual content and cardinality cannot be confirmed. Treatment: Re-profile or manually inspect this column before use; saturn skipped it.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- —
Longitude
numeric featureGeographic longitude coordinates spanning the full global range from -173.08 to 178.44 degrees. The distribution is heavily left-skewed (-1.40) with a median of 75.23 sitting well above the mean of 62.80, suggesting concentration in eastern hemisphere locations with a tail of western-hemisphere points. About 4.4% of values (316 rows) fall outside the typical IQR range. Treatment: Pair with latitude for geospatial features; consider clustering or binning rather than treating as a raw scalar.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 6,713
- min
- -173.1
- max
- 178.4
- mean
- 62.8
- median
- 75.23
- std
- 44.79
- q1
- 40.81
- q3
- 88.22
- iqr
- 47.41
- skew
- -1.402
- kurtosis
- 2.859
- n_outliers
- 316
- outlier_rate
- 0.04436
- zero_rate
- 0
Latitude
numeric featureLatitude values for 7124 rows spanning -42.61 to 71.84 with a median of 25.02 — consistent with geographic latitudes in degrees. Distribution leans toward northern hemisphere (mean 23.54, skew -0.70) with 292 outliers (4.1%) likely representing far-southern or far-northern records. No nulls and 6696 unique values suggest near-record-level coordinates. Treatment: Pair with longitude for geospatial features; avoid standard scaling alone since latitude is bounded and non-linear in distance.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 6,696
- min
- -42.61
- max
- 71.84
- mean
- 23.54
- median
- 25.02
- std
- 14.92
- q1
- 15.55
- q3
- 31.61
- iqr
- 16.06
- skew
- -0.702
- kurtosis
- 2.141
- n_outliers
- 292
- outlier_rate
- 0.04099
- zero_rate
- 0
Ctry
categorical featureCountry-of-origin or location field with 202 distinct values across 7,124 rows and zero nulls. India dominates at 28.5% (2,032 rows), followed by Pakistan (767) and China (442); the long tail spans 200+ countries with entropy ratio 0.66, indicating concentrated but globally distributed coverage. The South/Central Asia skew is the headline surprise — five of the top six values are Asian. Treatment: Group rare countries into an 'Other' bucket or encode by region before modelling to avoid 200-way one-hot blow-up.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 202
- top_value
- India
- top_rate
- 0.2852
- cardinality
- 202
- entropy
- 5.058
- entropy_ratio
- 0.6605
IndigenousCode
categorical featureA binary Y/N flag indicating Indigenous status, fully populated across all 7124 rows. The distribution is imbalanced: 'Y' accounts for 79.4% (5657 rows) versus 1467 'N' rows, which is a notable skew to keep in mind for any stratified analysis. Treatment: Encode as a binary indicator and watch for class imbalance when used as a predictor or stratifier.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.7941
- cardinality
- 2
- entropy
- 0.7336
- entropy_ratio
- 0.7336
PercentAdherents
categorical feature long_tailPercentAdherents appears to be a numeric measure (likely a percentage or rate of religious adherents) stored as strings, with 692 distinct values across 7,124 rows and no nulls. It is dominated by '0.000', which accounts for 56.2% of records, and the long tail of small integer- and decimal-valued strings drives entropy down to a ratio of 0.43. The format mixing whole numbers like '5.000' with fractions like '0.200' suggests these are raw values rather than binned categories. Treatment: Cast to float and treat as a zero-inflated numeric feature rather than a category.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 692
- top_value
- 0.000
- top_rate
- 0.5625
- cardinality
- 692
- entropy
- 4.046
- entropy_ratio
- 0.4288
PercentChristianPC
categorical featureStored as a categorical but the 184 distinct values are numeric strings ranging from '0.000' to figures like '8.571', suggesting this is a percent-Christian metric (likely per-capita or per-county) cast as text. The distribution is concentrated: '0.482' alone covers 12.2% of 7124 rows and the top 10 values account for a large share, yet entropy ratio of 0.79 indicates the long tail still carries information. No nulls, but the repeated exact decimals hint at a lookup or pre-binned source rather than raw measurements. Treatment: cast to float and treat as a continuous feature; investigate the heavy spike at 0.482 before modelling.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 184
- top_value
- 0.482
- top_rate
- 0.122
- cardinality
- 184
- entropy
- 5.934
- entropy_ratio
- 0.7887
NaturalName
text label one_word duplicatesShort ethnonym/community labels (e.g., 'Deaf', 'Turk', 'Persian', 'Japanese'), averaging 11.8 characters and 1.7 words with a median of 2 words. About 34% of rows are duplicates (2,419) and ~49% are single-word entries, with 4,705 unique values across 7,124 rows. Surprising signals: 'Deaf' tops the list at 151 occurrences, and top words include parenthetical religious qualifiers like 'traditions)', '(hindu', '(muslim' (952/477/411), suggesting many entries carry trailing tradition tags that the tokenizer split awkwardly. Treatment: Normalize casing and strip parenthetical tradition suffixes, then treat as a categorical label.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 4,705
- len_min
- 1
- len_max
- 39
- len_mean
- 11.84
- len_median
- 10
- len_p95
- 27
- word_mean
- 1.723
- word_median
- 2
- n_empty
- 0
- n_duplicates
- 2,419
- duplicate_rate
- 0.3396
- vocab_size
- 4,343
- readability_flesch_mean
- 56.42
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.4885
- allcaps_rate
- 0
- boilerplate_rate
- 0
NaturalPronunciation
text metadata one_word null_rate duplicatesPhonetic respellings of ethnonyms — short hyphenated pronunciation guides like 'PUR-zhun', 'jae-puh-NEEZ', and 'pahsh-TOON' — accompanying some other label column. Values are overwhelmingly single tokens (one_word_rate 0.73, word_mean 1.28, len_mean 10.8) and 48.5% are null, so coverage is partial. Duplicates dominate (n_duplicates 2183, duplicate_rate 0.59) with only 1489 unique forms across 7124 rows, suggesting a small controlled vocabulary repeated across records. Treatment: Treat as an optional pronunciation lookup keyed to the parent term; drop or impute before modelling given ~48% nulls.
- n
- 7,124
- nulls
- 3,452 (48.5%)
- unique
- 1,489
- len_min
- 2
- len_max
- 42
- len_mean
- 10.77
- len_median
- 10
- len_p95
- 21
- word_mean
- 1.281
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 2,183
- duplicate_rate
- 0.5945
- vocab_size
- 1,537
- readability_flesch_mean
- 69.93
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7271
- allcaps_rate
- 0.0005447
- boilerplate_rate
- 0
PercentChristianPGAC
categorical featureThis column appears to be a percentage or count of Christians (PGAC suggesting a per-group/area Christian metric), stored as strings with three-decimal precision rather than as a numeric type. It is heavily zero-inflated: '0.000' accounts for 43.8% of the 7,124 rows (3,121 occurrences), and a suspiciously specific value '3.733' is the second mode at 151 rows. With 842 distinct values and entropy ratio 0.58, the distribution is concentrated but long-tailed, and the null rate is negligible at 0.07%. Treatment: Cast to numeric and consider a zero-inflated transform (e.g., log1p with a zero indicator) before modelling.
- n
- 7,124
- nulls
- 5 (0.1%)
- unique
- 842
- top_value
- 0.000
- top_rate
- 0.4384
- cardinality
- 842
- entropy
- 5.681
- entropy_ratio
- 0.5846
PercentEvangelical
categorical feature long_tailPercentEvangelical reads as a numeric share of evangelicals stored as strings, with 401 distinct values across 7124 rows. The distribution is heavily zero-inflated: 65.7% of rows are exactly "0.000" and another 10.4% are null, leaving a long tail of small fractions like 0.100, 0.200, 0.500. Entropy ratio of 0.364 confirms most of the signal collapses onto that single zero bucket. Treatment: Cast to float, impute the 10.4% nulls, and consider a zero-vs-nonzero indicator alongside the raw value to handle the zero inflation.
- n
- 7,124
- nulls
- 741 (10.4%)
- unique
- 401
- top_value
- 0.000
- top_rate
- 0.6572
- cardinality
- 401
- entropy
- 3.146
- entropy_ratio
- 0.3638
PercentEvangelicalPC
categorical featurePercentEvangelicalPC appears to be a numeric percentage (likely an evangelical population share, possibly per capita or principal-component scaled) that has been stored as strings, yielding 166 distinct values across 7124 rows with a 2.15% null rate. The distribution is concentrated: the top value '0.199' covers 12.47% of rows, and the leading entries cluster near zero ('0.095', '0.000', '0.004') yet some values reach above 3 ('3.409', '3.339'), suggesting a long right tail or mixed scale. Entropy ratio of 0.78 indicates moderate concentration rather than uniformity. Treatment: Cast to float, impute the ~2% nulls, and consider log or rank transform given the right-tailed values.
- n
- 7,124
- nulls
- 153 (2.1%)
- unique
- 166
- top_value
- 0.199
- top_rate
- 0.1247
- cardinality
- 166
- entropy
- 5.777
- entropy_ratio
- 0.7833
PercentEvangelicalPGAC
categorical featureNumeric percentages (likely share of evangelical population per PGAC unit) stored as strings, hence profiled as categorical with 548 distinct values. The distribution is heavily zero-inflated: '0.000' accounts for 48.9% of 7124 rows, with a curious secondary spike at '1.801' (151 rows) that doesn't fit a percentage interpretation cleanly. Null rate is 6.32% and entropy ratio is 0.55, consistent with a long tail of small fractional values. Treatment: Cast to float, impute or flag nulls, and consider a zero-indicator plus log/sqrt transform given the heavy zero mass.
- n
- 7,124
- nulls
- 450 (6.3%)
- unique
- 548
- top_value
- 0.000
- top_rate
- 0.4891
- cardinality
- 548
- entropy
- 4.972
- entropy_ratio
- 0.5465
PCBuddhism
numeric feature high_skew outliersPCBuddhism appears to be a percentage feature measuring the Buddhist share of some unit (likely a postcode or area), ranging 0 to 100 with mean 6.41. The distribution is extremely zero-inflated: 82.99% of rows are exactly 0, the entire IQR collapses to 0, and yet 17.01% of rows are flagged as outliers with skew 3.48 and kurtosis 10.56. This means Buddhism is rare across most areas but reaches sizeable concentrations in a long tail. Treatment: Treat as zero-inflated proportion: add a presence indicator and log1p-transform the non-zero tail before modelling.
- n
- 7,124
- nulls
- 24 (0.3%)
- unique
- 809
- min
- 0
- max
- 100
- mean
- 6.411
- median
- 0
- std
- 22.39
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 3.475
- kurtosis
- 10.56
- n_outliers
- 1,208
- outlier_rate
- 0.1701
- zero_rate
- 0.8299
PCEthnicReligions
numeric feature high_skew outliersPCEthnicReligions is a numeric percentage-style feature (0–100) capturing the share of some ethnic-religion category, likely per record/region. It's overwhelmingly zero — 78% of values are 0 and the entire interquartile range collapses to 0 — yet the mean is 13.1 with std 30.7, indicating a small set of records carry very large shares. Skew of 2.16 and a 22% outlier rate confirm a sparse, heavy-tailed distribution rather than a smooth continuum. Treatment: Binarize (zero vs non-zero) or apply a zero-inflated/log1p transform before modelling.
- n
- 7,124
- nulls
- 18 (0.3%)
- unique
- 351
- min
- 0
- max
- 100
- mean
- 13.11
- median
- 0
- std
- 30.74
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 2.155
- kurtosis
- 2.885
- n_outliers
- 1,560
- outlier_rate
- 0.2195
- zero_rate
- 0.7805
PCHinduism
numeric featureThis column appears to be the percentage share of Hindus in some geographic or demographic unit, ranging from 0 to 100 with a mean of 29.8. The distribution is strongly bimodal in spirit: 67.7% of rows are exactly zero while Q3 sits at 98.4, indicating most units have no Hindu presence and a substantial minority are nearly entirely Hindu. Skew is 0.87 and kurtosis -1.22, consistent with this U-shaped split rather than a single peak. Treatment: Consider a zero-vs-nonzero indicator plus the raw percentage, since a flat numeric treatment will hide the bimodal structure.
- n
- 7,124
- nulls
- 24 (0.3%)
- unique
- 1,131
- min
- 0
- max
- 100
- mean
- 29.82
- median
- 0
- std
- 44.98
- q1
- 0
- q3
- 98.42
- iqr
- 98.42
- skew
- 0.8721
- kurtosis
- -1.216
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.6768
PCOtherSmall
numeric feature high_skew outliersPCOtherSmall is a numeric feature where 88% of rows are zero and the IQR is zero, meaning the bottom three quartiles are all 0. The remaining mass stretches up to 100 with mean 1.84 and std 12.33, producing severe right skew (7.39) and very heavy tails (kurtosis 54.18). About 12% of rows (851) flag as outliers, suggesting this is a sparse share/percentage indicator that fires only for a small subset of records. Treatment: Binarize presence (>0) or apply log1p before modelling to tame the skew.
- n
- 7,124
- nulls
- 24 (0.3%)
- unique
- 670
- min
- 0
- max
- 100
- mean
- 1.836
- median
- 0
- std
- 12.33
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 7.39
- kurtosis
- 54.18
- n_outliers
- 851
- outlier_rate
- 0.1199
- zero_rate
- 0.8801
RegionCode
numeric feature outliersRegionCode holds 12 distinct integer values from 1 to 12 with no nulls, so it is almost certainly a categorical region identifier stored as a number rather than a true numeric measure. The distribution is concentrated around the median of 4 with an IQR of just 2, yet the right skew of 1.12 and 601 flagged outliers (8.4%) reflect the long tail of higher-numbered regions rather than genuine anomalies. Treatment: Cast to categorical and one-hot or target-encode; do not treat as a continuous numeric.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 12
- min
- 1
- max
- 12
- mean
- 5.005
- median
- 4
- std
- 2.457
- q1
- 4
- q3
- 6
- iqr
- 2
- skew
- 1.122
- kurtosis
- 0.5775
- n_outliers
- 601
- outlier_rate
- 0.08436
- zero_rate
- 0
PopulationPGAC
numeric feature high_skew outliersPopulationPGAC appears to be a population count tied to some geographic or administrative unit (PGAC), spanning 10 to roughly 925 million across 7,124 rows with only 0.07% nulls. The distribution is extraordinarily right-skewed (skew 25.5, kurtosis 1051) — the median is 130,300 while the mean is 4.88 million, and 17.8% of rows flag as outliers. With 1,509 unique values across 7,124 rows, the same population figures repeat heavily, suggesting many rows share the same geographic aggregate. Treatment: log-transform before regression to tame the extreme right skew.
- n
- 7,124
- nulls
- 5 (0.1%)
- unique
- 1,509
- min
- 10
- max
- 9.251e+08
- mean
- 4.881e+06
- median
- 130,300
- std
- 2.095e+07
- q1
- 20,000
- q3
- 1.435e+06
- iqr
- 1.415e+06
- skew
- 25.48
- kurtosis
- 1052
- n_outliers
- 1,264
- outlier_rate
- 0.1776
- zero_rate
- 0
Frontier
categorical featureBinary Y/N flag indicating whether a record is on the frontier, with no nulls across 7124 rows. The split is imbalanced toward Y at 66.9% (4767) versus N at 2357, though entropy ratio of 0.92 shows both classes are well represented. Treatment: Encode as a 0/1 indicator before modelling.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.6691
- cardinality
- 2
- entropy
- 0.9158
- entropy_ratio
- 0.9158
MapAddress
text foreign_key one_word short_text duplicatesMapAddress holds single-token PNG filenames (e.g. 'm00328.png'), with one_word_rate of 1.0 and max length 13, suggesting it points to a map image asset. 1500 of 7124 rows are empty strings and duplicate_rate is 0.352, so roughly a third of non-empty values repeat across rows — meaning many records share the same map. With 4616 unique values across 7124 rows, this behaves like a foreign reference to a finite set of map images rather than a free-text field. Treatment: Treat as a categorical asset reference: impute or flag the 1500 empties and join to a map-image lookup.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 4,616
- len_min
- 0
- len_max
- 13
- len_mean
- 8.649
- len_median
- 10
- len_p95
- 13
- word_mean
- 1
- word_median
- 1
- n_empty
- 1,500
- n_duplicates
- 2,508
- duplicate_rate
- 0.352
- vocab_size
- 4,615
- readability_flesch_mean
- 17.62
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
HasJesusFilm
categorical featureBinary Y/N flag indicating whether the Jesus Film is available for the entity (likely a language or people group). Heavily skewed toward 'Y' at 78.7% (5,610 of 7,124), with no nulls across all 7,124 rows. Entropy of 0.746 reflects the imbalance but still leaves a usable minority class of 1,514 'N' values. Treatment: Encode as 0/1 boolean; account for class imbalance if used as a target.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.7875
- cardinality
- 2
- entropy
- 0.7463
- entropy_ratio
- 0.7463
Nomadic
categorical feature imbalanceBinary Y/N flag indicating nomadic status, with no nulls across 7124 rows. Severely imbalanced: 'N' dominates at 96.6% (6884 rows) versus only 240 'Y' cases, yielding a low entropy ratio of 0.21. Treatment: Encode as binary; consider class-weighting or stratified sampling due to severe imbalance.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.9663
- cardinality
- 2
- entropy
- 0.2126
- entropy_ratio
- 0.2126
NomadicTypeDescription
categorical feature null_rateThis is a low-cardinality categorical describing the type of nomadic livelihood, with only 6 distinct values dominated by 'Agro-Pastoralists' (76.7% of non-nulls, 184 records). The column is almost entirely empty — null_rate is 0.9663, leaving roughly 240 populated rows out of 7124. Several values are comma-joined combinations (e.g., 'Agro-Pastoralists, Service or Trade'), suggesting the field encodes multi-label memberships as concatenated strings. Treatment: Split comma-separated values into multi-hot indicators and treat missingness as its own category given the 96.6% null rate.
- n
- 7,124
- nulls
- 6,884 (96.6%)
- unique
- 6
- top_value
- Agro-Pastoralists
- top_rate
- 0.7667
- cardinality
- 6
- entropy
- 1.159
- entropy_ratio
- 0.4483
PhotoCCVersionText
categorical metadataCreative Commons license version attached to a photo (e.g., 'CC BY 2.0', 'CC BY-SA 4.0'). The field is dominated by empty strings at 79.8% of 7124 rows, with only 16 distinct values and entropy ratio 0.33, so license metadata is missing for the vast majority of records. Among populated values, 'CC BY 2.0' (387) and 'CC BY-SA 4.0' (246) lead. Treatment: Treat empty string as missing and group rare licenses; use as a low-cardinality categorical only where photo licensing matters.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 16
- top_value
- top_rate
- 0.7984
- cardinality
- 16
- entropy
- 1.323
- entropy_ratio
- 0.3307
PhotoCCVersionURL
categorical metadataThis column holds the URL of the Creative Commons license version applied to an associated photo, drawn from a fixed set of 16 distinct license URIs. About 79.8% of rows (5688 of 7124) are empty strings, so the field is sparsely populated; among the populated minority, CC BY 2.0 (387) and CC BY-SA 4.0 (246) dominate. Entropy ratio of 0.33 confirms heavy concentration on the blank value. Treatment: Treat empty strings as missing and collapse to a categorical license code before any modelling.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 16
- top_value
- top_rate
- 0.7984
- cardinality
- 16
- entropy
- 1.323
- entropy_ratio
- 0.3307
MapCredits
categorical metadata long_tailAttribution string crediting the data, geography, and design sources for each map (e.g. Joshua Project, GMI, UNESCO, IMB). With 161 distinct values across 7124 rows, the top credit covers 28% of records and a blank string is the second most common value at 1505 rows; near-duplicates differing only by trailing punctuation (the same Omid/UNESCO credit appears with and without a final period) inflate cardinality. Treatment: Normalise whitespace/punctuation to collapse near-duplicates, then drop from modelling as boilerplate provenance.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 161
- top_value
- People Group data: Omid. Map geography: UNESCO / GMI. Map Design: Joshua Project
- top_rate
- 0.28
- cardinality
- 161
- entropy
- 3.318
- entropy_ratio
- 0.4527
MapCreditURL
categorical metadata long_tail imbalanceThis column holds attribution URLs for source maps, but 6919 of 7124 rows (top_rate 0.9712) are empty strings, leaving only 31 distinct values across the entire dataset. Among the populated entries, cartomission.com dominates with 100 occurrences while most other domains appear fewer than 10 times, producing a very long tail. Entropy ratio of 0.054 confirms there is almost no information here unless the empty string itself is treated as a meaningful 'no credit' signal. Treatment: Keep as provenance metadata; do not use as a model feature given 97% blanks.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 31
- top_value
- top_rate
- 0.9712
- cardinality
- 31
- entropy
- 0.27
- entropy_ratio
- 0.05449
MapCopyright
categorical featureA near-binary flag (N/Y) with a third state being an empty string, almost certainly indicating whether map copyright applies. 'N' dominates at 72.95% (5197/7124), blanks account for 1885 rows, and only 42 records are 'Y' — a severe class imbalance that makes the affirmative case nearly negligible. Treatment: Normalize blanks to a missing/'N' category and treat as a low-signal binary flag given the 42-row positive class.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- N
- top_rate
- 0.7295
- cardinality
- 3
- entropy
- 0.8831
- entropy_ratio
- 0.5572
MapCCVersionText
categorical metadata imbalanceThis appears to be a Creative Commons license version field for maps, but it is effectively empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving only 10 rows with actual licenses split across CC BY-SA 3.0 (8), CC0 1.0 (1), and CC BY 3.0 (1). Entropy is just 0.0166 (entropy_ratio 0.0083), so the column carries almost no information despite having 0% nulls — the missingness is encoded as empty strings rather than NaN. Treatment: Drop; near-constant blank with only 10 informative rows.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- top_rate
- 0.9986
- cardinality
- 4
- entropy
- 0.01662
- entropy_ratio
- 0.00831
MapCCVersionURL
categorical metadata imbalanceMapCCVersionURL appears to hold a Creative Commons license URL associated with each map record, but it is essentially empty: 7114 of 7124 rows (top_rate 0.9986) carry the blank string, leaving just 10 rows split across three CC license URLs. Entropy is 0.017 (ratio 0.008), so the column carries almost no information despite having 4 distinct values and zero nulls (the missingness is encoded as "" rather than null). Treatment: Drop; near-constant with empty-string standing in for missing.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- top_rate
- 0.9986
- cardinality
- 4
- entropy
- 0.01662
- entropy_ratio
- 0.00831
JF
categorical featureJF is a binary Y/N flag with no nulls across 7124 rows. The distribution is imbalanced: "Y" accounts for 78.7% (5610 rows) versus 1514 "N", giving an entropy ratio of 0.746. The column name is opaque, so the semantic meaning of the flag is not recoverable from the evidence. Treatment: Encode as a 0/1 indicator; consider class imbalance if used as a target.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.7875
- cardinality
- 2
- entropy
- 0.7463
- entropy_ratio
- 0.7463
AudioRecordings
categorical featureBinary Y/N flag indicating whether audio recordings exist for each row, with no nulls across 7124 records. The distribution is heavily imbalanced toward 'Y' at 86.9% (6188 vs 936), giving an entropy ratio of 0.56. Treatment: Encode as a 0/1 indicator; be mindful of class imbalance if used as a target.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.8686
- cardinality
- 2
- entropy
- 0.5612
- entropy_ratio
- 0.5612
Window1040
categorical featureWindow1040 is a binary Y/N flag covering all 7124 rows with no nulls. The distribution is imbalanced: 'Y' accounts for 5910 rows (top_rate 0.8296) versus 1214 'N', giving an entropy ratio of 0.659. The column's semantic meaning isn't recoverable from the evidence, but it behaves like a clean indicator variable. Treatment: Encode as a 0/1 indicator and watch for class imbalance when used as a predictor.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.8296
- cardinality
- 2
- entropy
- 0.6586
- entropy_ratio
- 0.6586
PeopleGroupMapURL
text metadata one_word url_heavy duplicatesThis column holds URLs to people-group map images hosted on joshuaproject.net, with every non-empty value being a single token (one_word_rate 1.0, url_rate 0.79). 1,500 of 7,124 rows are empty strings and 2,508 are duplicates (duplicate_rate 0.35), meaning many people groups share the same map image (e.g., m00328.png appears 40 times). With 4,616 unique values across 7,124 rows, this is a reference link rather than a unique key. Treatment: Keep as a display/reference URL; drop from modelling features.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 4,616
- len_min
- 0
- len_max
- 66
- len_mean
- 50.49
- len_median
- 63
- len_p95
- 66
- word_mean
- 1
- word_median
- 1
- n_empty
- 1,500
- n_duplicates
- 2,508
- duplicate_rate
- 0.352
- vocab_size
- 4,615
- readability_flesch_mean
- -568.7
- emoji_rate
- 0
- url_rate
- 0.7894
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
PeopleGroupMapExpandedURL
text metadata one_word url_heavy duplicatesThis column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, with 72.3% of rows containing a URL and every value being a single token. 1,975 rows (about 27.7%) are empty strings, and 2,793 rows (39.2%) duplicate another value — e.g. m00328.pdf appears 40 times — suggesting many people groups share the same regional map. Treatment: Treat as a reference link; drop from modelling or extract the map ID if joining to a maps table.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 4,331
- len_min
- 0
- len_max
- 66
- len_mean
- 46.2
- len_median
- 63
- len_p95
- 66
- word_mean
- 1
- word_median
- 1
- n_empty
- 1,975
- n_duplicates
- 2,793
- duplicate_rate
- 0.3921
- vocab_size
- 4,330
- readability_flesch_mean
- -468.9
- emoji_rate
- 0
- url_rate
- 0.7228
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
PeopleGroupURL
text identifier near_unique one_word url_heavyThis column holds Joshua Project people-group URLs, one per row, with every value a 48-character single-token https link (url_rate 1.0, one_word_rate 1.0, len_min and len_max both 48). All 7124 values are unique with zero nulls or duplicates, so it functions as a per-row identifier rather than a feature. The URLs encode a people-group ID and a country code suffix (e.g., /10375/tz, /10375/up), meaning the same group recurs across countries in the underlying key even though the full URL is unique. Treatment: Drop from modelling; retain as a row-level link key or parse out the people-group ID and country code as separate features.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 7,124
- len_min
- 48
- len_max
- 48
- len_mean
- 48
- len_median
- 48
- len_p95
- 48
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 0
- duplicate_rate
- 0
- vocab_size
- 7,124
- readability_flesch_mean
- -479.9
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
PeopleGroupPhotoURL
text metadata one_word url_heavy duplicatesThis column holds Joshua Project people-group photo URLs, with every populated cell being a single joshuaproject.net/assets/media/profiles/photos/.jpg link (url_rate 0.72, one_word_rate 1.0). 1971 of 7124 rows are empty strings (no nulls reported), and the same image URLs repeat heavily — duplicate_rate is 0.60 with only 2880 unique values, the top URL appearing 90 times. The same photo is clearly being reused across many people-group records rather than being a unique per-row asset. Treatment: Treat as an optional asset link; drop or replace empty strings with null and do not use as a feature.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2,880
- len_min
- 0
- len_max
- 68
- len_mean
- 47.04
- len_median
- 65
- len_p95
- 65
- word_mean
- 1
- word_median
- 1
- n_empty
- 1,971
- n_duplicates
- 4,244
- duplicate_rate
- 0.5957
- vocab_size
- 2,879
- readability_flesch_mean
- -604.3
- emoji_rate
- 0
- url_rate
- 0.7233
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
CountryURL
categorical foreign_keyCountry-level URLs pointing to joshuaproject.net profile pages, with the 2-letter country code as the path segment. There are 202 distinct countries across 7,124 rows and no nulls, but the distribution is heavily concentrated: India alone accounts for 28.5% of rows (2,032), with Pakistan (767) a distant second. Entropy ratio of 0.66 confirms moderate skew toward a handful of South Asian countries. Treatment: Extract the trailing country code as a categorical key; treat the URL itself as redundant metadata.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 202
- top_value
- https://joshuaproject.net/countries/IN
- top_rate
- 0.2852
- cardinality
- 202
- entropy
- 5.058
- entropy_ratio
- 0.6605
JPScaleText
categorical metadata imbalanceJPScaleText is a categorical field that holds a single value, "Unreached", across all 7124 rows with no nulls. With cardinality of 1 and entropy of 0, it carries no information and cannot discriminate between records. The constant value suggests this dataset has been pre-filtered to unreached people groups only. Treatment: Drop; constant column with zero entropy.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Unreached
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
JPScaleImageURL
categorical metadata imbalanceEvery one of the 7,124 rows holds the same URL, https://joshuaproject.net/assets/img/gauge/gauge-1.png, giving a single unique value and zero entropy. This looks like a static asset link (a JP Scale gauge image) attached to each record rather than a discriminating feature. It carries no information for analysis or modelling. Treatment: Drop; constant column with a single value across all rows.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- https://joshuaproject.net/assets/img/gauge/gauge-1.png
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
Summary
text free_text one_word duplicatesFree-text English summaries describing South Asian people groups (Rajputs, Jats, Bania, Beldar, etc.), averaging 51 words with median length 316 characters. Quality is poor: 3,167 of 7,124 rows (44%) are empty strings and another 3,439 are duplicates, leaving only 3,685 unique values and a 48% duplicate rate. Several near-identical Rajput paragraphs differ by only a word or two, suggesting lightly edited copies of the same source text rather than independent summaries. Flesch readability of 30.4 indicates fairly difficult prose. Treatment: Deduplicate near-identical entries and drop or impute the 3,167 empty rows before tokenizing and embedding.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 3,685
- len_min
- 0
- len_max
- 1,212
- len_mean
- 309.7
- len_median
- 316
- len_p95
- 793
- word_mean
- 51.26
- word_median
- 52
- n_empty
- 3,167
- n_duplicates
- 3,439
- duplicate_rate
- 0.4827
- vocab_size
- 24,501
- readability_flesch_mean
- 30.4
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.4446
- allcaps_rate
- 0
- boilerplate_rate
- 0.0002807
Obstacles
text free_text one_word duplicatesFree-text English prose describing barriers to Christian evangelism among various people groups (Rajputs, Jats, Bosniaks, Azeri, etc.), averaging 18 words and 107 characters per entry. Notably, 3167 of 7124 rows are empty strings and the duplicate rate is 0.489, with a single Rajput-pride passage repeated 88 times and a near-identical Jat passage appearing as both 74- and 7-count variants. Readability is low (Flesch 31.6) and vocabulary is modest (9760 unique words), consistent with a templated missiological description field. Treatment: Treat empties as missing, dedupe near-identical passages, then tokenize and embed for downstream topic or similarity analysis.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 3,641
- len_min
- 0
- len_max
- 726
- len_mean
- 106.9
- len_median
- 95
- len_p95
- 317
- word_mean
- 18.37
- word_median
- 16
- n_empty
- 3,167
- n_duplicates
- 3,483
- duplicate_rate
- 0.4889
- vocab_size
- 9,760
- readability_flesch_mean
- 31.62
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.4446
- allcaps_rate
- 0
- boilerplate_rate
- 0.0009826
HowReach
text free_text one_word duplicatesFree-text English prose describing outreach/engagement strategies for various people groups, likely a 'how to reach' field in a missions dataset. Over half the rows (3883 of 7124) are empty strings and duplicate_rate is 0.60, with the same Jats and Rajputs paragraphs repeating dozens of times — so the median length and word count are 0. Readability is low (Flesch 27.3) and vocabulary reaches 7803 tokens across the non-empty rows. Treatment: Treat empty strings as missing, deduplicate boilerplate, then tokenize and embed for downstream NLP.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 2,853
- len_min
- 0
- len_max
- 599
- len_mean
- 80.82
- len_median
- 0
- len_p95
- 260
- word_mean
- 14.08
- word_median
- 1
- n_empty
- 3,883
- n_duplicates
- 4,271
- duplicate_rate
- 0.5995
- vocab_size
- 7,803
- readability_flesch_mean
- 27.34
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.5451
- allcaps_rate
- 0
- boilerplate_rate
- 0.0002807
PrayForChurch
text free_text one_word duplicatesFree-text prayer prompts for an unreached-people-group / church-planting dataset, written in English (1473 detected) and centered on words like 'pray', 'Christ', 'among'. The field is sparsely populated: 5032 of 7124 rows are empty and only 1713 unique strings exist, giving a 0.76 duplicate rate as the same boilerplate prayer is reused across people groups (top non-empty value repeats 146 times). Readability is low (Flesch 19.5) and length varies wildly from 0 to 649 chars, so the column is a mix of nothing, one-liners, and full paragraphs. Treatment: Treat as optional long-form text: impute empties as missing and tokenize/embed the rest before any modelling.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 1,713
- len_min
- 0
- len_max
- 649
- len_mean
- 59.63
- len_median
- 0
- len_p95
- 286
- word_mean
- 11.19
- word_median
- 1
- n_empty
- 5,032
- n_duplicates
- 5,411
- duplicate_rate
- 0.7595
- vocab_size
- 4,447
- readability_flesch_mean
- 19.5
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.7063
- allcaps_rate
- 0
- boilerplate_rate
- 0
PrayForPG
text free_text one_word duplicatesFree-text prayer points for people groups (PG), each entry a short paragraph of intercessions led by the verb 'pray' (5450 occurrences). Nearly half the rows are empty (3405 of 7124) and another large chunk reuse boilerplate templates — duplicate_rate 0.517 with the top non-empty value repeating 88 times — so unique content is far less than the 3441 distinct strings suggest. Readability is low (Flesch mean 32.7) and all detected language is English (2528 rows tagged en). Treatment: Treat as free-text: drop empties, dedupe boilerplate, then tokenize/embed if used as a feature.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- 3,441
- len_min
- 0
- len_max
- 937
- len_mean
- 163.1
- len_median
- 120
- len_p95
- 453.8
- word_mean
- 28.23
- word_median
- 20
- n_empty
- 3,405
- n_duplicates
- 3,683
- duplicate_rate
- 0.517
- vocab_size
- 9,291
- readability_flesch_mean
- 32.72
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.478
- allcaps_rate
- 0
- boilerplate_rate
- 0
Resources
unknown other skippedThe column is named "Resources" with 7124 rows and zero nulls, but saturn skipped profiling so the kind is unknown and no unique count or value statistics were computed. Without type inference or sample values, its content (numeric, list, text, or identifier) cannot be determined from the evidence. Treatment: Re-profile or inspect raw samples to establish type before any downstream use.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- —
country_data
unknown other skippedThe column `country_data` was skipped by the profiler, so its kind is unrecorded and no statistics, uniqueness, or value distribution are available. The only confirmed signals are 7124 rows with a 0.0 null rate. Without further inspection, the contents (likely some country-related payload given the name) cannot be characterised. Treatment: Re-profile or manually inspect this column before any downstream use.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- —
language_data
unknown other skippedThe column `language_data` was skipped by the profiler — its kind is unrecognised and no descriptive statistics, uniqueness count, or value samples were emitted. Only the row count (7124) and a null rate of 0.0 are available, so nothing can be said about content, cardinality, or distribution. The name hints at linguistic payloads (possibly nested or serialised), but this is not corroborated by evidence. Treatment: Re-profile after parsing or casting to a supported type before deciding on use.
- n
- 7,124
- nulls
- 0 (0.0%)
- unique
- —