archive api data sample
Reading
This is a 50-row, 107-column sample from the Joshua Project API describing Arab and related Muslim people groups across 41 countries. The dataset is dominated by one affinity bloc ('Arab World', 100%) and one religion ('Islam', 98%), so the interesting variation lies in geography, population size, and reachedness rather than in identity fields. Look first at Population and PopulationPGAC, which are heavily right-skewed (max 2.22M and 7.56M respectively, with multiple outliers) and at PCIslam, which is high but varies from 25% to 100%. JPScaleText shows that 76% of groups are classified 'Unreached', making that the most actionable signal alongside Continent/RegionName for where these groups sit. Note also the high null rates on language, Bible-translation, and nomadic descriptors (86–98% missing), which limits any analysis of those attributes.
citing: Population · PopulationPGAC · PCIslam · JPScaleText · Continent · RegionName · PrimaryLanguageName · LeastReached · PrimaryReligion · PeopleCluster · PrimaryLanguageDialect · NomadicTypeDescription
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 200 – 3.175e+05 | 40 |
| 3.175e+05 – 6.347e+05 | 4 |
| 6.347e+05 – 9.52e+05 | 2 |
| 9.52e+05 – 1.269e+06 | 0 |
| 1.269e+06 – 1.586e+06 | 2 |
| 1.586e+06 – 1.904e+06 | 1 |
| 1.904e+06 – 2.221e+06 | 1 |
Show data table
| value | count | share |
|---|---|---|
| Unreached | 38 | 76.0% |
| Minimally Reached | 8 | 16.0% |
| Partially Reached | 3 | 6.0% |
| Superficially Reached | 1 | 2.0% |
Show data table
| value | count | share |
|---|---|---|
| Africa, North and Middle East | 16 | 32.0% |
| Africa, East and Southern | 7 | 14.0% |
| Europe, Western | 6 | 12.0% |
| America, North and Caribbean | 5 | 10.0% |
| Africa, West and Central | 4 | 8.0% |
| Europe, Eastern and Eurasia | 4 | 8.0% |
| Asia, Southeast | 3 | 6.0% |
| Asia, South | 2 | 4.0% |
| Australia and Pacific | 1 | 2.0% |
| America, Latin | 1 | 2.0% |
| Asia, Central | 1 | 2.0% |
Show data table
| bin | count |
|---|---|
| 25 – 35.71 | 1 |
| 35.71 – 46.43 | 0 |
| 46.43 – 57.14 | 1 |
| 57.14 – 67.86 | 2 |
| 67.86 – 78.57 | 3 |
| 78.57 – 89.29 | 1 |
| 89.29 – 100 | 42 |
Show data table
| value | count | share |
|---|---|---|
| Arabic, Levantine | 18 | 36.0% |
| Arabic, Gulf | 13 | 26.0% |
| Arabic, Omani | 8 | 16.0% |
| Arabic, Mesopotamian | 4 | 8.0% |
| Swahili | 2 | 4.0% |
| Tamajeq, Tayart | 1 | 2.0% |
| Arabic, Sudanese | 1 | 2.0% |
| English | 1 | 2.0% |
| Arabic, Moroccan | 1 | 2.0% |
| Arabic, Egyptian | 1 | 2.0% |
Schema
107 columns| Alerts | ||||
|---|---|---|---|---|
| ROL3 | categorical | 0.0% | 10 |
|
| PhotoCredits | categorical | 0.0% | 5 |
|
| PrimaryReligion | categorical | 0.0% | 2 |
imbalance
|
| Ctry | categorical | 0.0% | 41 |
long_tail
|
| RegionName | categorical | 0.0% | 11 |
|
| BibleYear | categorical | 92.0% | 3 |
long_tail
null_rate
|
| RLG3PC | numeric | 0.0% | 1 |
constant
|
| Population | numeric | 0.0% | 49 |
high_skew
outliers
|
| Resources | unknown | 0.0% | — |
skipped
|
| LeastReachedPGAC | categorical | 0.0% | 2 |
|
| NumberLanguagesSpoken | numeric | 0.0% | 2 |
high_skew
|
| GSEC | categorical | 0.0% | 2 |
|
| AudioRecordings | categorical | 0.0% | 1 |
imbalance
|
| PercentAdherents | categorical | 0.0% | 29 |
long_tail
|
| ROP1 | categorical | 0.0% | 1 |
imbalance
|
| JPScalePGAC | categorical | 0.0% | 2 |
|
| Latitude | numeric | 0.0% | 50 |
|
| PeopNameInCountry | categorical | 0.0% | 7 |
long_tail
|
| Window1040 | categorical | 0.0% | 2 |
|
| PeopleGroupMapURL | categorical | 0.0% | 17 |
long_tail
|
| CountryURL | categorical | 0.0% | 41 |
long_tail
|
| PercentEvangelicalPC | categorical | 0.0% | 3 |
long_tail
imbalance
|
| CountOfProvinces | unknown | 0.0% | — |
skipped
|
| PercentEvangelicalPGAC | categorical | 0.0% | 5 |
|
| NomadicTypeDescription | categorical | 90.0% | 1 |
null_rate
imbalance
|
| MapCredits | categorical | 0.0% | 7 |
|
| HasJesusFilm | categorical | 0.0% | 2 |
|
| HowReach | categorical | 0.0% | 20 |
long_tail
|
| PCIslam | numeric | 0.0% | 30 |
high_skew
outliers
|
| NTYear | categorical | 42.0% | 8 |
long_tail
null_rate
|
| RLG4 | numeric | 86.0% | 1 |
null_rate
constant
|
| AffinityBloc | categorical | 0.0% | 1 |
imbalance
|
| NaturalName | categorical | 0.0% | 7 |
long_tail
|
| PercentChristianPGAC | categorical | 0.0% | 5 |
|
| PrimaryLanguageName | categorical | 0.0% | 10 |
|
| CountOfCountries | numeric | 0.0% | 4 |
|
| PeopleID2 | numeric | 0.0% | 3 |
high_skew
|
| Summary | categorical | 0.0% | 21 |
long_tail
|
| Obstacles | categorical | 0.0% | 21 |
long_tail
|
| ROP2 | categorical | 0.0% | 3 |
long_tail
imbalance
|
| RLG3 | numeric | 0.0% | 2 |
high_skew
|
| PercentEvangelical | categorical | 0.0% | 18 |
long_tail
|
| LeastReached | categorical | 0.0% | 2 |
|
| Continent | categorical | 0.0% | 6 |
|
| JPScalePC | categorical | 0.0% | 1 |
imbalance
|
| JPScaleText | categorical | 0.0% | 4 |
|
| SecurityLevel | numeric | 0.0% | 3 |
|
| LRTop100 | categorical | 0.0% | 1 |
imbalance
|
| PrimaryReligionPGAC | categorical | 0.0% | 1 |
imbalance
|
| PCNonReligious | numeric | 2.0% | 6 |
outliers
|
| PhotoCreditURL | categorical | 4.0% | 3 |
|
| PhotoCreativeCommons | categorical | 0.0% | 2 |
|
| PrayForPG | categorical | 0.0% | 21 |
long_tail
|
| PeopleGroupPhotoURL | categorical | 0.0% | 5 |
|
| ROG2 | categorical | 0.0% | 6 |
|
| PhotoCCVersionText | categorical | 0.0% | 2 |
|
| Longitude | numeric | 0.0% | 50 |
outliers
|
| JPScaleImageURL | categorical | 0.0% | 4 |
|
| OfficialLang | categorical | 0.0% | 21 |
long_tail
|
| PhotoPermission | categorical | 0.0% | 2 |
imbalance
|
| PCHinduism | numeric | 2.0% | 3 |
high_skew
|
| PeopleID3 | numeric | 0.0% | 5 |
high_skew
outliers
|
| PeopleID1 | numeric | 0.0% | 1 |
constant
|
| SpeakNationalLang | unknown | 0.0% | — |
skipped
|
| PortionsYear | categorical | 16.0% | 9 |
long_tail
|
| PrimaryReligionPC | categorical | 0.0% | 1 |
imbalance
|
| PCUnknown | numeric | 2.0% | 1 |
constant
|
| ProfileTextExists | categorical | 0.0% | 2 |
|
| PCOtherSmall | numeric | 2.0% | 3 |
high_skew
outliers
|
| BibleStatus | numeric | 0.0% | 4 |
|
| Frontier | categorical | 0.0% | 2 |
|
| MapAddress | categorical | 0.0% | 17 |
long_tail
|
| PeopleID3ROG3 | categorical | 0.0% | 50 |
long_tail
|
| ROP3 | numeric | 0.0% | 5 |
high_skew
outliers
|
| PrimaryLanguageDialect | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| JPScale | numeric | 0.0% | 4 |
high_skew
outliers
|
| HasAudioRecordings | categorical | 0.0% | 1 |
imbalance
|
| PCBuddhism | numeric | 2.0% | 1 |
constant
|
| PeopNameAcrossCountries | categorical | 0.0% | 5 |
|
| PhotoCCVersionURL | categorical | 0.0% | 2 |
|
| MapCCVersionText | categorical | 0.0% | 1 |
imbalance
|
| PercentChristianPC | categorical | 0.0% | 3 |
long_tail
imbalance
|
| Nomadic | categorical | 0.0% | 2 |
|
| PrayForChurch | categorical | 0.0% | 9 |
long_tail
|
| RLG3PGAC | numeric | 0.0% | 1 |
constant
|
| ISO3 | categorical | 0.0% | 41 |
long_tail
|
| NaturalPronunciation | categorical | 2.0% | 6 |
|
| PhotoAddress | categorical | 0.0% | 5 |
|
| RegionCode | numeric | 0.0% | 11 |
|
| LocationInCountry | categorical | 72.0% | 13 |
long_tail
null_rate
|
| JF | categorical | 0.0% | 2 |
|
| PopulationPGAC | numeric | 0.0% | 5 |
outliers
|
| PeopleGroupMapExpandedURL | categorical | 0.0% | 11 |
long_tail
|
| TranslationNeedQuestionable | unknown | 0.0% | — |
skipped
|
| Category | categorical | 0.0% | 3 |
|
| PhotoCopyright | categorical | 0.0% | 2 |
imbalance
|
| NTOnline | categorical | 18.0% | 1 |
imbalance
|
| LeastReachedPC | categorical | 0.0% | 1 |
imbalance
|
| ROG3 | categorical | 0.0% | 41 |
long_tail
|
| ReligionSubdivision | categorical | 86.0% | 1 |
null_rate
imbalance
|
| PCEthnicReligions | numeric | 2.0% | 3 |
high_skew
outliers
|
| PeopleCluster | categorical | 0.0% | 3 |
long_tail
imbalance
|
| IndigenousCode | categorical | 0.0% | 2 |
|
| MapCreditURL | categorical | 0.0% | 1 |
imbalance
|
| MapCopyright | categorical | 0.0% | 1 |
imbalance
|
| MapCCVersionURL | categorical | 0.0% | 1 |
imbalance
|
| PeopleGroupURL | categorical | 0.0% | 50 |
long_tail
|
ROL3
categorical featureROL3 holds 3-letter codes (likely ISO 639-3 language tags such as 'eng', 'arz', 'ary', 'apc', 'afb') across 50 complete rows with 10 distinct values. Distribution is skewed toward Levantine/Gulf Arabic variants: 'apc' covers 36% and 'afb' another 13/50, while six codes appear only once or twice. Entropy ratio of 0.75 indicates moderate concentration rather than uniform spread. Treatment: Group rare codes into an 'other' bucket, then one-hot encode.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 10
- top_value
- apc
- top_rate
- 0.36
- cardinality
- 10
- entropy
- 2.501
- entropy_ratio
- 0.7527
PhotoCredits
categorical metadataAttribution string for the image accompanying each row, naming photographer and source platform. Just 5 distinct credits cover all 50 rows, with 'Hashim Abdullah - Pixabay' alone accounting for 56% and the top three Pixabay/Flickr contributors covering 48 of 50 entries. Two credits ('Link Up Africa', 'Claudiovidri - Shutterstock') appear only once, suggesting a long tail of incidental sources. Treatment: Retain as provenance metadata; drop from modelling features.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- Hashim Abdullah - Pixabay
- top_rate
- 0.56
- cardinality
- 5
- entropy
- 1.611
- entropy_ratio
- 0.694
PrimaryReligion
categorical feature imbalanceCategorical column capturing the dominant religion of each record, with only two observed values across 50 rows. The distribution is severely imbalanced: Islam accounts for 49 of 50 entries (top_rate 0.98) and Christianity for just 1, yielding an entropy ratio of 0.14. With effectively no variance, this column carries almost no discriminative signal. Treatment: Drop or collapse to a binary 'is_Islam' flag; near-constant for modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Islam
- top_rate
- 0.98
- cardinality
- 2
- entropy
- 0.1414
- entropy_ratio
- 0.1414
Ctry
categorical feature long_tailCountry field with 41 distinct values across 50 rows and no nulls. The distribution is essentially flat — entropy ratio is 0.986 and the most common value, United Arab Emirates, appears just twice (4%). Nine countries tie at 2 occurrences each, the rest are singletons, hence the long_tail alert. Treatment: Group rare countries into regions or an 'Other' bucket before encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 41
- top_value
- United Arab Emirates
- top_rate
- 0.04
- cardinality
- 41
- entropy
- 5.284
- entropy_ratio
- 0.9862
RegionName
categorical featureRegionName is a categorical geographic grouping with 11 distinct regions across 50 rows and no nulls. The distribution is uneven: 'Africa, North and Middle East' alone accounts for 32% (16/50), and the three African regions together dominate the column. Entropy ratio of 0.86 indicates spread is fairly even given the cardinality, but several regions ('Australia and Pacific', 'America, Latin') appear only once. Treatment: one-hot or target-encode; consider grouping single-row regions to avoid sparse levels.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- Africa, North and Middle East
- top_rate
- 0.32
- cardinality
- 11
- entropy
- 2.973
- entropy_ratio
- 0.8595
BibleYear
categorical metadata long_tail null_rateBibleYear appears to be a metadata field capturing the publication year or year-range of a Bible edition, with values like "1890-2024", "1382-2020", and "2021". The column is almost entirely empty: 92% null with only 4 of 50 rows populated and 3 distinct values. Format is inconsistent, mixing single years with hyphenated ranges, which blocks numeric parsing. Treatment: Drop or quarantine; null rate of 0.92 and mixed year/range formats make it unusable without manual curation.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 3
- top_value
- 1890-2024
- top_rate
- 0.5
- cardinality
- 3
- entropy
- 1.5
- entropy_ratio
- 0.9464
RLG3PC
numeric other constantRLG3PC is a numeric column that is entirely constant: all 50 rows hold the value 6.0, with zero variance and only 1 unique value. There is no information for a model to learn from, and no nulls or outliers to caveat. Treatment: Drop; constant column carries no signal.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 6
- max
- 6
- mean
- 6
- median
- 6
- std
- 0
- q1
- 6
- q3
- 6
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Population
numeric feature high_skew outliersPopulation counts across 50 rows, ranging from 200 to 2,221,000 with a median of just 46,500 versus a mean of 264,074. The distribution is severely right-skewed (skew 2.52, kurtosis 5.78) and 12% of rows (6 values) flag as outliers, indicating a few very large populations dominate an otherwise small-town dataset. No nulls or zeros, and 49 of 50 values are unique. Treatment: log-transform before any modelling to tame the right skew and outliers.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 49
- min
- 200
- max
- 2.221e+06
- mean
- 264,074
- median
- 46,500
- std
- 4.927e+05
- q1
- 13,000
- q3
- 272,500
- iqr
- 259,500
- skew
- 2.52
- kurtosis
- 5.781
- n_outliers
- 6
- outlier_rate
- 0.12
- zero_rate
- 0
Resources
unknown other skippedThe column is named "Resources" and contains 50 non-null entries, but saturn skipped profiling it so its kind is unknown and no descriptive stats (uniqueness, distribution, type) are available. Without further signals, its content and structure cannot be characterized from this evidence. Treatment: Re-run profiling with type coercion or inspect raw values manually before deciding on use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
LeastReachedPGAC
categorical featureBinary Y/N flag indicating whether some 'least reached PGAC' condition was met, with no nulls across 50 rows. The split is nearly balanced (28 N, 22 Y; top_rate 0.56) and entropy_ratio of 0.99 confirms maximal informativeness for a binary feature. Treatment: Encode as 0/1 boolean for modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.56
- cardinality
- 2
- entropy
- 0.9896
- entropy_ratio
- 0.9896
NumberLanguagesSpoken
numeric feature high_skewCounts the number of languages spoken, but with only 2 unique values across 50 rows it is effectively a binary indicator of monolingual vs bilingual. The mean of 1.04 and median of 1.0 show 48 of 50 records sit at 1, with just 2 outliers at 2.0 driving the extreme skew (4.69) and kurtosis (20.04). Treatment: Recode as a binary multilingual flag; the raw count adds no information.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 1
- max
- 2
- mean
- 1.04
- median
- 1
- std
- 0.1979
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 4.695
- kurtosis
- 20.04
- n_outliers
- 2
- outlier_rate
- 0.04
- zero_rate
- 0
GSEC
categorical featureGSEC is a binary categorical field with exactly two values, "1" and an empty string, split perfectly 25/25 across the 50 rows. The maximum entropy (1.0) confirms a balanced flag, but the empty string rather than "0" or null suggests the absent state is encoded as a blank string rather than a true missing value. Treatment: Recode empty string to 0 (or NaN if it means missing) and treat as a binary indicator.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 1
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
AudioRecordings
categorical metadata imbalanceAudioRecordings is a categorical flag that takes the single value 'Y' across all 50 rows, with zero nulls and entropy of 0. Because cardinality is 1 and top_rate is 1.0, the column carries no information and cannot discriminate between records. Treatment: Drop; constant column with no variance.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
PercentAdherents
categorical feature long_tailPercentAdherents holds numeric percentages stored as strings, with 29 distinct values across 50 rows and no nulls. The mode is "0.000" at 12% of rows, and entropy ratio 0.942 indicates a very flat distribution with a long tail of small-frequency values. The mix of fractional (0.200, 0.500) and whole-number (5.000, 6.000) entries suggests the values are raw percentages rather than proportions. Treatment: Cast strings to float and treat as a continuous numeric feature.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 29
- top_value
- 0.000
- top_rate
- 0.12
- cardinality
- 29
- entropy
- 4.576
- entropy_ratio
- 0.942
ROP1
categorical metadata imbalanceROP1 is a categorical column holding a single constant value 'A001' across all 50 rows, with zero nulls and entropy of 0.0. It carries no information for modelling or segmentation since cardinality is 1 and top_rate is 1.0. Treatment: Drop; constant column with no variance.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- A001
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
JPScalePGAC
categorical featureA binary categorical field with only the values "1" and "2", split 28/22 across 50 rows. The near-maximal entropy ratio (0.99) indicates an almost balanced two-class distribution with no nulls. The column name suggests a Japanese PGA scale code, likely an ordinal seismic-intensity or rating bucket. Treatment: Cast to categorical (or 0/1 indicator) before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 2
- top_rate
- 0.56
- cardinality
- 2
- entropy
- 0.9896
- entropy_ratio
- 0.9896
Latitude
numeric featureNumeric column holding geographic latitude in degrees, with all 50 values unique and no nulls. The range spans -33.87 to 59.11, consistent with worldwide coordinates, and the distribution is mildly left-skewed (-0.46) with a mean of 22.28 sitting below the median of 24.12. Only one outlier (2%) is flagged, suggesting one row sits far from the otherwise broad spread (IQR 26.18). Treatment: Pair with longitude as a geospatial coordinate; consider binning or distance features rather than using as a raw scalar.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- min
- -33.87
- max
- 59.11
- mean
- 22.28
- median
- 24.12
- std
- 20.19
- q1
- 9.247
- q3
- 35.43
- iqr
- 26.18
- skew
- -0.4594
- kurtosis
- 0.0487
- n_outliers
- 1
- outlier_rate
- 0.02
- zero_rate
- 0
PeopNameInCountry
categorical label long_tailCategorical label naming the people group in-country, with only 7 distinct values across 50 rows and no nulls. The distribution is heavily concentrated on Arab variants: 'Arab' alone covers 54% of rows, and the top three Arab-prefixed labels account for 46 of 50 entries, leaving a long tail of singletons like 'Tuareg, Air' and 'Amri'. Entropy ratio of 0.65 confirms the imbalance flagged by the long_tail alert. Treatment: Collapse rare singletons into an 'Other' bucket before any group-wise analysis.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- Arab
- top_rate
- 0.54
- cardinality
- 7
- entropy
- 1.835
- entropy_ratio
- 0.6537
Window1040
categorical featureWindow1040 is a binary Y/N flag, almost perfectly balanced with 26 'Y' and 24 'N' across 50 rows. Entropy ratio of 0.999 confirms a near-maximum-uncertainty split, and there are no nulls. The name suggests a windowed indicator (possibly a 1040-period rolling event flag), but the evidence does not confirm its semantics. Treatment: Encode as a 0/1 boolean for modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.52
- cardinality
- 2
- entropy
- 0.9988
- entropy_ratio
- 0.9988
PeopleGroupMapURL
categorical metadata long_tailURL pointing to a people-group map image hosted on joshuaproject.net, one per row. 48% of the 50 rows are empty strings, so the field is missing more often than populated, and a single map (m00007.png) accounts for 9 of the 26 non-blank entries. With 17 unique values across 50 rows and a long_tail alert, most distinct URLs appear only once. Treatment: Treat empty strings as missing and drop from modelling; keep only as a display link.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 17
- top_value
- top_rate
- 0.48
- cardinality
- 17
- entropy
- 2.792
- entropy_ratio
- 0.6832
CountryURL
categorical foreign_key long_tailURLs to country pages on joshuaproject.net, with the two-letter country code as the path suffix. With 41 unique values across 50 rows and entropy ratio 0.986, the column is near-unique; the most frequent URL (UAE) appears just twice (top_rate 0.04). The base domain is constant, so the country code is the only informative part. Treatment: Extract the trailing country code and use that as the join/grouping key instead of the raw URL.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 41
- top_value
- https://joshuaproject.net/countries/AE
- top_rate
- 0.04
- cardinality
- 41
- entropy
- 5.284
- entropy_ratio
- 0.9862
PercentEvangelicalPC
categorical feature long_tail imbalanceNumeric-looking field stored as a categorical with only 3 distinct values across 50 rows, and 96% of rows share the single value '0.197'. The other two values ('0.103' and '0.265') each appear exactly once, giving an extreme imbalance and an entropy ratio of just 0.178. This looks like a principal-component or aggregate score that has been collapsed/repeated for nearly every record, leaving almost no signal. Treatment: Drop or treat as constant — near-zero variance offers no modelling signal.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- 0.197
- top_rate
- 0.96
- cardinality
- 3
- entropy
- 0.2823
- entropy_ratio
- 0.1781
CountOfProvinces
unknown other skippedCountOfProvinces was skipped by the profiler, so no type, uniqueness, or distribution stats are available beyond a row count of 50 with no nulls. The name suggests an integer tally of provinces per record, but this cannot be confirmed from the evidence. No further signal is present to flag skew, duplicates, or range. Treatment: Re-run the profiler on this column to recover type and distribution before deciding how to use it.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
PercentEvangelicalPGAC
categorical featureLikely a percentage of evangelical adherents (PGAC denomination grouping) stored as strings rather than floats, with only 5 distinct values across 50 rows. The distribution is severely lumpy: '1.892' covers 56% of rows and the top three values ('1.892', '0.233', '0.023') account for 48 of 50 observations, suggesting these are imputed or default category codes rather than true continuous measurements. Treatment: Cast to float and treat as a low-cardinality categorical or imputed flag rather than a continuous percentage.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- 1.892
- top_rate
- 0.56
- cardinality
- 5
- entropy
- 1.611
- entropy_ratio
- 0.694
NomadicTypeDescription
categorical metadata null_rate imbalanceThis appears to be a descriptive label for a nomadic lifestyle classification, but it carries almost no information in this sample. 90% of rows are null, and the 5 non-null rows all hold the single value 'Agro-Pastoralists' (top_rate 1.0, cardinality 1, entropy 0.0). As-is, the column cannot discriminate between records. Treatment: Drop: 90% null and only one observed value provides no signal.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 1
- top_value
- Agro-Pastoralists
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
MapCredits
categorical metadataMapCredits holds attribution strings for the map associated with each row, citing sources like Joshua Project, GMI, ESRI, and Bethany World Prayer Center. Nearly half the rows (24 of 50) carry an empty string rather than a null, and a single credit to 'Bethany World Prayer Center' covers another 14 rows, leaving only 7 distinct values across the column. The dominance of blanks alongside non-null status is the main surprise — missingness is encoded as empty text, not NULL. Treatment: Normalize empty strings to nulls and treat as provenance metadata; drop from modelling features.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- top_rate
- 0.48
- cardinality
- 7
- entropy
- 1.987
- entropy_ratio
- 0.7077
HasJesusFilm
categorical featureBinary Y/N flag indicating whether each record has an associated 'Jesus Film' resource. The column is complete (null_rate 0.0) with only 2 unique values, heavily skewed toward 'Y' at 82% (41 of 50), leaving just 9 'N' cases. Treatment: Encode as boolean; expect limited discriminatory power given the 82/18 imbalance.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.82
- cardinality
- 2
- entropy
- 0.6801
- entropy_ratio
- 0.6801
HowReach
categorical free_text long_tailHowReach holds free-text outreach suggestions, likely missionary engagement strategies for various people groups. The column is dominated by empty strings (62% top_rate, 31 of 50 rows blank), and every non-blank value is unique, yielding 20 distinct values across 50 rows with no nulls flagged. Entropy ratio of 0.60 plus the long_tail alert confirm this is essentially sparse prose, not a categorical variable. Treatment: Treat blanks as missing and tokenize/embed the prose entries rather than one-hot encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 20
- top_value
- top_rate
- 0.62
- cardinality
- 20
- entropy
- 2.572
- entropy_ratio
- 0.5952
PCIslam
numeric feature high_skew outliersPCIslam appears to be a percentage measure of Islamic affiliation per record, ranging 25.0 to 100.0 with a median of 95.99 and mean 91.12. The distribution is heavily left-skewed (skew -2.65, kurtosis 7.39) with a tight IQR of 6.55 between Q1 92.93 and Q3 99.48, yet 8 outliers (16%) sit far below that cluster. Most observations are near-saturation while a small tail of low-share records pulls the mean down. Treatment: Consider a reflected log or beta transform before modelling to tame the left skew and downweight the low-share outliers.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 30
- min
- 25
- max
- 100
- mean
- 91.12
- median
- 95.99
- std
- 14.7
- q1
- 92.93
- q3
- 99.47
- iqr
- 6.55
- skew
- -2.648
- kurtosis
- 7.388
- n_outliers
- 8
- outlier_rate
- 0.16
- zero_rate
- 0
NTYear
categorical free_text long_tail null_rateNTYear appears to be a free-form annotation about a 'NT' year status, mixing a yes/no flag with single years (e.g. 2005, 1932, 2012) and year ranges (e.g. 1879-1989, 1990-2003). The format is inconsistent: 'Yes' dominates at 62% of non-null entries, while 42% of all rows are null and the remaining cells split across 7 heterogeneous values. This is effectively two or three different fields collapsed into one string column. Treatment: Split into a boolean indicator and parsed year/year-range fields before use.
- n
- 50
- nulls
- 21 (42.0%)
- unique
- 8
- top_value
- Yes
- top_rate
- 0.6207
- cardinality
- 8
- entropy
- 1.925
- entropy_ratio
- 0.6416
RLG4
numeric other null_rate constantRLG4 is a numeric column that is effectively unusable: 86% of its 50 rows are null, and every one of the remaining values equals 20.0 (min, median, max, std all confirm this). With a single distinct value and no variance, it carries no information for modelling. Treatment: Drop the column; it is constant where present and 86% null.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 1
- min
- 20
- max
- 20
- mean
- 20
- median
- 20
- std
- 0
- q1
- 20
- q3
- 20
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
AffinityBloc
categorical metadata imbalanceAffinityBloc is a categorical grouping label, but every one of the 50 rows holds the same value, "Arab World". With cardinality 1 and entropy 0, this column carries no information for distinguishing records in this slice. Treatment: Drop; constant column with zero entropy.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Arab World
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
NaturalName
categorical label long_tailNaturalName is a low-cardinality categorical label, likely an ethnic or linguistic group identifier, with 7 distinct values across 50 rows and no nulls. The distribution is heavily concentrated: 'Arab' alone covers 54% of rows, and together with 'Gulf-spoken Arab' (11) and 'Omani Arab' (8) accounts for 46 of 50 records, leaving four singleton categories in a long tail. Entropy ratio of 0.65 confirms the imbalance. Treatment: Group the four singleton categories into an 'Other' bucket before encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- Arab
- top_rate
- 0.54
- cardinality
- 7
- entropy
- 1.835
- entropy_ratio
- 0.6537
PercentChristianPGAC
categorical featureThis column appears to be a percentage-based metric (likely 'Percent Christian' from a PGAC indicator) stored as strings, with only 5 distinct values across 50 rows. The distribution is heavily concentrated: '14.741' accounts for 56% of records, followed by '0.935' (12 rows) and '0.066' (8 rows), suggesting these are repeated category-level constants rather than per-row measurements. The presence of just 5 unique values for what looks like a continuous percentage is suspicious and points to either aggregated/joined reference data or a coarse bucketing. Treatment: Cast to numeric and verify whether values are constants from a join key before using as a feature.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- 14.741
- top_rate
- 0.56
- cardinality
- 5
- entropy
- 1.611
- entropy_ratio
- 0.694
PrimaryLanguageName
categorical featureThis is a categorical column naming a primary language, dominated by Arabic dialects across 10 distinct values in 50 rows. Levantine Arabic leads at 18/50 (36%), followed by Gulf (13) and Omani (8); non-Arabic entries are rare (Swahili 2, plus singletons for Tamajeq, English, etc.). Entropy ratio of 0.75 indicates moderate concentration without a single overwhelming class, and there are no nulls. Treatment: Group rare non-Arabic and minor dialects into an 'Other' bucket before one-hot encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 10
- top_value
- Arabic, Levantine
- top_rate
- 0.36
- cardinality
- 10
- entropy
- 2.501
- entropy_ratio
- 0.7527
CountOfCountries
numeric featureNumeric count of countries with only 4 unique values across 50 rows, ranging 1 to 28 with median 28 and mean 19.88. The distribution is heavily concentrated at the maximum (median equals max, Q3 equals max), indicating most rows hit the ceiling of 28 while a minority sit much lower. Negative kurtosis (-1.52) and mild left skew (-0.43) confirm a bimodal-like spread rather than a smooth distribution. Treatment: Treat as a low-cardinality ordinal or bin into categorical buckets rather than as a continuous variable.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 4
- min
- 1
- max
- 28
- mean
- 19.88
- median
- 28
- std
- 9.512
- q1
- 12
- q3
- 28
- iqr
- 16
- skew
- -0.4253
- kurtosis
- -1.521
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
PeopleID2
numeric foreign_key high_skewPeopleID2 is stored as numeric but behaves like a categorical key, with only 3 unique values across 50 rows and an IQR of 0 because Q1, median, and Q3 all equal 111. The mean (115.04) is pulled above the median by 2 outliers reaching up to 307, producing extreme skew (6.85) and kurtosis (44.93). No nulls or zeros are present. Treatment: Treat as a categorical identifier (not a numeric feature) and left-join on it rather than aggregating.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 3
- min
- 111
- max
- 307
- mean
- 115
- median
- 111
- std
- 27.71
- q1
- 111
- q3
- 111
- iqr
- 0
- skew
- 6.847
- kurtosis
- 44.93
- n_outliers
- 2
- outlier_rate
- 0.04
- zero_rate
- 0
Summary
categorical free_text long_tailFree-text ethnographic summaries describing people groups, with 21 unique values across 50 rows and a 60% top rate driven entirely by empty strings (30 of 50). The non-empty entries are long, prose paragraphs about Tuareg, Arab, and other groups, so this is descriptive content rather than a category. Entropy ratio of 0.61 and the long_tail alert confirm most non-empty values appear only once. Treatment: Treat empty strings as missing and tokenize/embed the prose before any modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 21
- top_value
- top_rate
- 0.6
- cardinality
- 21
- entropy
- 2.7
- entropy_ratio
- 0.6146
Obstacles
categorical free_text long_tailFree-text commentary describing barriers to gospel outreach for various people groups, one paragraph per row. 30 of 50 rows (top_rate 0.6) are empty strings, and the remaining 20 entries are essentially unique, yielding 21 distinct values and entropy_ratio 0.61. Despite the categorical kind, the content is prose passages about Islam, identity, and mission access, not a bounded label set. Treatment: Treat empty strings as missing and tokenize/embed the prose rather than one-hot encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 21
- top_value
- top_rate
- 0.6
- cardinality
- 21
- entropy
- 2.7
- entropy_ratio
- 0.6146
ROP2
categorical feature long_tail imbalanceROP2 is a categorical column with only 3 distinct codes (C0013, C0219, C0019) across 50 rows and no nulls. The distribution is extremely lopsided: C0013 covers 96% of rows while the other two codes appear once each, yielding an entropy ratio of just 0.178. As a near-constant feature it carries almost no signal. Treatment: Drop or collapse rare levels into 'other'; near-constant column unlikely to aid modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- C0013
- top_rate
- 0.96
- cardinality
- 3
- entropy
- 0.2823
- entropy_ratio
- 0.1781
RLG3
numeric feature high_skewRLG3 is a near-constant numeric feature: 49 of 50 rows take the value 6.0 (median, Q1, and Q3 all equal 6.0), with a single outlier at 1.0 producing extreme negative skew (-6.86) and kurtosis (45.02). With only 2 unique values and an IQR of 0, this behaves more like a degenerate flag than a continuous measurement. Treatment: Drop or binarize as is_not_6; near-zero variance makes it useless for most models.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 1
- max
- 6
- mean
- 5.9
- median
- 6
- std
- 0.7071
- q1
- 6
- q3
- 6
- iqr
- 0
- skew
- -6.857
- kurtosis
- 45.02
- n_outliers
- 1
- outlier_rate
- 0.02
- zero_rate
- 0
PercentEvangelical
categorical feature long_tailPercentEvangelical appears to be a numeric share of evangelicals stored as strings, with 18 distinct values across 50 rows and no nulls. The distribution is heavily concentrated on small values: '0.000' and '0.500' tie for the mode at 8 occurrences each (16% top_rate), while values up to '2.500' form a long tail. High entropy_ratio (0.88) indicates the mass is spread fairly evenly across the small set of bins despite the long_tail alert. Treatment: Cast strings to float and treat as a continuous numeric feature.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 18
- top_value
- 0.000
- top_rate
- 0.16
- cardinality
- 18
- entropy
- 3.686
- entropy_ratio
- 0.884
LeastReached
categorical featureBinary Y/N flag, likely indicating whether some 'least reached' status applies to each record. The class is imbalanced toward Y at 76% (38 of 50), with N covering the remaining 12. No nulls, and entropy ratio of 0.80 reflects the moderate skew. Treatment: Encode as a 0/1 indicator; consider class-imbalance handling if used as a target.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.76
- cardinality
- 2
- entropy
- 0.795
- entropy_ratio
- 0.795
Continent
categorical featureCategorical continent label with 6 distinct values across 50 rows and no nulls. Asia dominates at 38% (19 rows), followed by Africa (14) and Europe (10), while Australia and South America appear only once each. The skewed distribution and high entropy ratio (0.80) suggest reasonable spread but with a clear Asia/Africa concentration. Treatment: One-hot encode; consider grouping Australia and South America into an 'Other' bucket given single-row counts.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- Asia
- top_rate
- 0.38
- cardinality
- 6
- entropy
- 2.067
- entropy_ratio
- 0.7996
JPScalePC
categorical metadata imbalanceJPScalePC is a categorical column that holds the single value "1" across all 50 rows, giving it cardinality 1 and zero entropy. With a top_rate of 1.0 and no nulls, it carries no information and likely represents a constant flag or scale parameter that was never varied in this slice. Treatment: Drop before modelling; the column is constant.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- 1
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
JPScaleText
categorical labelCategorical label describing a Joshua Project reach scale, with 4 distinct levels across 50 rows and no nulls. The distribution is heavily skewed: 'Unreached' covers 76% of rows, while 'Superficially Reached' appears just once, giving an entropy ratio of 0.54. Class imbalance will dominate any model trained on this field. Treatment: Treat as ordinal categorical and address class imbalance (e.g., stratify or collapse rare levels) before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- Unreached
- top_rate
- 0.76
- cardinality
- 4
- entropy
- 1.08
- entropy_ratio
- 0.5402
SecurityLevel
numeric featureSecurityLevel takes only 3 distinct integer values across 50 rows (min 0, max 2, median 2), so it reads as an ordinal category encoded as a number rather than a continuous measure. The distribution is bimodal-leaning: 42% of rows are zero while the median sits at 2, and the strongly negative kurtosis (-1.90) confirms a flat, multi-peaked shape with no outliers. Treatment: Treat as an ordinal category (0/1/2) or one-hot encode before modelling rather than using as a continuous numeric.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 3
- min
- 0
- max
- 2
- mean
- 1.12
- median
- 2
- std
- 0.9823
- q1
- 0
- q3
- 2
- iqr
- 2
- skew
- -0.2416
- kurtosis
- -1.899
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.42
LRTop100
categorical feature imbalanceThis is a categorical flag (likely 'is in LR Top 100') that takes the single value 'N' across all 50 rows. With cardinality 1 and entropy 0.0, it carries no information for any downstream model. The 'imbalance' alert here reflects total constancy rather than skew. Treatment: Drop; constant column with zero entropy.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- N
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
PrimaryReligionPGAC
categorical metadata imbalanceThis column records the primary religion classification (PGAC), but every one of the 50 rows holds the single value "Islam". With cardinality of 1 and entropy of 0.0, it carries no information for modelling or segmentation. Treatment: Drop; constant column with zero variance.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Islam
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
PCNonReligious
numeric feature outliersLikely a percentage of non-religious population per row, with 50 records and only 6 distinct values. The distribution is dominated by zeros (zero_rate 0.755) with median and IQR both 0, yet a long right tail pushes the max to 10.0 and flags 12 outliers (24.5%). Skew of 1.89 and one missing value (null_rate 0.02) confirm a sparse, heavily right-skewed feature. Treatment: Treat as sparse; consider a zero/non-zero binary flag plus log1p transform before modelling.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 6
- min
- 0
- max
- 10
- mean
- 1.255
- median
- 0
- std
- 2.45
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 1.892
- kurtosis
- 2.849
- n_outliers
- 12
- outlier_rate
- 0.2449
- zero_rate
- 0.7551
PhotoCreditURL
categorical metadataThis column holds photo credit URLs (Pixabay and Flickr links), but with only 3 unique values across 50 rows it functions as a coarse source tag rather than a per-record citation. The top URL covers 58.3% of non-null rows (28 of 50), suggesting the same stock photo is reused widely. Null rate is 4%. Treatment: Drop or retain as a low-cardinality source label; not useful as a modelling feature.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 3
- top_value
- https://pixabay.com/photos/people-students-walk-street-muslim-6284192/
- top_rate
- 0.5833
- cardinality
- 3
- entropy
- 1.384
- entropy_ratio
- 0.8735
PhotoCreativeCommons
categorical featureBinary Y/N flag indicating whether a photo carries a Creative Commons license, with no nulls across 50 rows. The distribution is heavily skewed toward 'N' at 84% (42 of 50), leaving only 8 records flagged 'Y'. Treatment: Encode as a boolean indicator; note class imbalance if used as a predictor.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.84
- cardinality
- 2
- entropy
- 0.6343
- entropy_ratio
- 0.6343
PrayForPG
categorical free_text long_tailFree-text prayer prompts for unreached people groups, with 60% of the 50 rows being empty strings and the remaining 20 entries each unique multi-sentence paragraphs. The dominance of the blank value (top_rate 0.60) coexists with very high textual diversity among non-blank rows, hence the long_tail alert and entropy_ratio of 0.61. No nulls are recorded, but the empty string is functioning as a missing-value sentinel. Treatment: Treat empty strings as missing and tokenize/embed the remaining prose if used as a feature.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 21
- top_value
- top_rate
- 0.6
- cardinality
- 21
- entropy
- 2.7
- entropy_ratio
- 0.6146
PeopleGroupPhotoURL
categorical metadataThis column holds URLs to people-group profile photos hosted on joshuaproject.net, one per row with no nulls. Despite 50 rows, only 5 distinct images appear, and a single photo (p10375.jpg) covers 56% of records while the top three URLs account for 48 of 50 — suggesting many rows share the same people-group identity rather than being unique entities. Treatment: Drop for modelling; retain as a display asset or join key to a people-group lookup.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- https://joshuaproject.net/assets/media/profiles/photos/p10375.jpg
- top_rate
- 0.56
- cardinality
- 5
- entropy
- 1.611
- entropy_ratio
- 0.694
ROG2
categorical featureROG2 looks like a regional grouping code with 6 categories (ASI, AFR, EUR, NAR, AUS, LAM), fully populated across all 50 rows. Distribution is moderately concentrated — ASI leads at 38% (19/50) and AFR follows at 14, while AUS and LAM appear only once each. Entropy ratio of 0.80 indicates fairly even spread among the top regions but with thin tails. Treatment: One-hot encode, but consider collapsing AUS and LAM into an 'Other' bucket given single-row support.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- ASI
- top_rate
- 0.38
- cardinality
- 6
- entropy
- 2.067
- entropy_ratio
- 0.7996
PhotoCCVersionText
categorical metadataThis is a categorical license-text field for photo Creative Commons versioning, with only 2 distinct values across 50 rows. 84% of entries are empty strings (42/50), and the remaining 8 carry 'CC BY-NC-SA 2.0' — there are no nulls, just blanks standing in for missing licenses. Entropy ratio of 0.63 reflects this binary, heavily imbalanced split. Treatment: Recode empty strings to missing and collapse to a binary has_license flag.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- top_rate
- 0.84
- cardinality
- 2
- entropy
- 0.6343
- entropy_ratio
- 0.6343
Longitude
numeric feature outliersGeographic longitude coordinates spanning -118.3 to 151.2, covering most of the globe's east-west range with all 50 values unique. The distribution is left-skewed (skew -0.77) with a median of 39.4 sitting well above the mean of 27.5, and 7 outliers (14%) flag locations far from the central cluster around Europe/Asia. Treatment: Pair with Latitude as a 2D geospatial feature; avoid scaling independently.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- min
- -118.3
- max
- 151.2
- mean
- 27.55
- median
- 39.45
- std
- 52.08
- q1
- 10.72
- q3
- 51.19
- iqr
- 40.47
- skew
- -0.7666
- kurtosis
- 1.346
- n_outliers
- 7
- outlier_rate
- 0.14
- zero_rate
- 0
JPScaleImageURL
categorical featureThis column holds URLs to Joshua Project gauge images (gauge-1 through gauge-4), almost certainly a visual encoding of an ordinal progress/status scale. Distribution is heavily skewed: 76% point to gauge-1.png, while gauge-3.png appears only once across 50 rows. With just 4 unique values and no nulls, it's a low-cardinality categorical masquerading as a URL. Treatment: Extract the gauge digit (1-4) and treat as an ordinal feature rather than keeping the URL string.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- https://joshuaproject.net/assets/img/gauge/gauge-1.png
- top_rate
- 0.76
- cardinality
- 4
- entropy
- 1.08
- entropy_ratio
- 0.5402
OfficialLang
categorical feature long_tailThis is a categorical column naming an official language, with 21 distinct values across 50 rows and no nulls. The distribution is heavily skewed: 'Arabic, Standard' alone covers 36% of rows, followed by English (8) and French (5), while a long tail of languages like Swahili, Ukrainian, Sinhala and Bulgarian appear only once. Entropy ratio of 0.77 confirms concentration at the top despite the wide vocabulary. Treatment: Group rare languages into an 'Other' bucket before one-hot encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 21
- top_value
- Arabic, Standard
- top_rate
- 0.36
- cardinality
- 21
- entropy
- 3.39
- entropy_ratio
- 0.7719
PhotoPermission
categorical feature imbalanceBinary Y/N flag indicating whether photo permission has been granted, with no missing values across 50 rows. The distribution is severely imbalanced: 'N' covers 49 of 50 rows (top_rate 0.98) with only a single 'Y', yielding entropy_ratio of 0.14. With one positive case, this column carries almost no discriminative signal in the current sample. Treatment: Drop or hold aside as a near-constant flag until more 'Y' cases accumulate.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.98
- cardinality
- 2
- entropy
- 0.1414
- entropy_ratio
- 0.1414
PCHinduism
numeric feature high_skewPCHinduism appears to be a per-row count or percentage related to Hinduism, with 95.9% of values being zero and only 3 distinct values across 50 rows. The distribution is extremely sparse and right-skewed (skew 5.96, kurtosis 35.36), with a max of 6.0 standing out as an outlier against a median and IQR of 0. Effectively a near-constant feature with rare nonzero spikes. Treatment: Binarize to zero/nonzero or drop, given 95.9% zeros and only 3 unique values.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 3
- min
- 0
- max
- 6
- mean
- 0.1633
- median
- 0
- std
- 0.8978
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 5.957
- kurtosis
- 35.36
- n_outliers
- 2
- outlier_rate
- 0.04082
- zero_rate
- 0.9592
PeopleID3
numeric foreign_key high_skew outliersPeopleID3 is a numeric identifier-like field with only 5 unique values across 50 rows, clustered tightly around 10375-10376 (IQR of 1.0). The distribution is severely left-skewed (skew -5.59, kurtosis 31.2) because at least one value drops to 10208 while the bulk sits near the max of 10378, producing a 20% outlier rate. Despite being typed as numeric, the near-constant range and low cardinality suggest this behaves as a categorical key rather than a measurement. Treatment: Treat as a categorical identifier; do not use as a continuous numeric feature.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- min
- 10,208
- max
- 10,378
- mean
- 1.037e+04
- median
- 10,375
- std
- 25.8
- q1
- 10,375
- q3
- 10,376
- iqr
- 1
- skew
- -5.593
- kurtosis
- 31.24
- n_outliers
- 10
- outlier_rate
- 0.2
- zero_rate
- 0
PeopleID1
numeric other constantPeopleID1 is flagged as constant: every one of the 50 rows holds the value 10, with zero variance and a single unique value. Despite the 'ID' name, it carries no identifying information and cannot distinguish records. There is no null or outlier activity to interpret. Treatment: Drop, constant column with no signal.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 10
- max
- 10
- mean
- 10
- median
- 10
- std
- 0
- q1
- 10
- q3
- 10
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
SpeakNationalLang
unknown feature skippedSaturn skipped this column, so no profiling stats were computed beyond row count (50) and a null rate of 0.0. The name 'SpeakNationalLang' suggests a binary or categorical indicator of whether a respondent speaks the national language, but kind is 'unknown' and n_unique is missing, so the actual value distribution cannot be confirmed from this evidence. Treatment: Re-profile or manually inspect to determine dtype before use; if binary, encode as 0/1.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
PortionsYear
categorical feature long_tailPortionsYear is a low-cardinality categorical with 9 unique values across 50 rows and a 16% null rate. Most entries are year ranges (e.g., "1939-2021" at 42.9% and "2009-2024"), but 4 rows contain the string "Yes", indicating a mixed/dirty schema where a boolean answer was recorded in a date-range field. Entropy ratio of 0.70 and a long-tail alert reflect many singleton ranges alongside two dominant values. Treatment: Clean the type mismatch (separate "Yes" from year ranges), then parse ranges into start/end year numerics before modelling.
- n
- 50
- nulls
- 8 (16.0%)
- unique
- 9
- top_value
- 1939-2021
- top_rate
- 0.4286
- cardinality
- 9
- entropy
- 2.222
- entropy_ratio
- 0.7009
PrimaryReligionPC
categorical feature imbalanceThis column records the primary religion of a people-cluster (PC), but every one of the 50 rows holds the value "Islam". With cardinality 1 and entropy 0, it carries no information for distinguishing records in this slice. Treatment: Drop; constant column with zero entropy.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Islam
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
PCUnknown
numeric feature constantPCUnknown is a numeric column that is effectively constant: every one of the 49 non-null observations is exactly 0, and 2% of rows are null. There is no variance, no spread, and no outliers, so the column carries no information as currently populated. Treatment: Drop; constant column with zero variance.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 1
- min
- 0
- max
- 0
- mean
- 0
- median
- 0
- std
- 0
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 1
ProfileTextExists
categorical featureBinary Y/N flag indicating whether a profile text exists, with no nulls across 50 rows. The distribution is heavily imbalanced: 'Y' covers 45 of 50 (top_rate 0.9) versus only 5 'N', giving an entropy ratio of 0.47. Treatment: Encode as a 0/1 boolean; watch for low signal given the 90/10 imbalance.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.9
- cardinality
- 2
- entropy
- 0.469
- entropy_ratio
- 0.469
PCOtherSmall
numeric feature high_skew outliersPCOtherSmall is a numeric count-like feature that is essentially zero for nearly everyone — 93.9% of the 49 non-null rows are 0 and only 3 distinct values appear. The distribution is extremely right-skewed (skew 6.41, kurtosis 40.46) with a max of 7 driving 3 outliers (6.1% outlier rate), so almost all signal lives in a tiny tail. Treatment: Binarize to zero/non-zero or drop, since variance is concentrated in a handful of outliers.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 3
- min
- 0
- max
- 7
- mean
- 0.1837
- median
- 0
- std
- 1.014
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 6.411
- kurtosis
- 40.46
- n_outliers
- 3
- outlier_rate
- 0.06122
- zero_rate
- 0.9388
BibleStatus
numeric featureBibleStatus is a small-range integer code with only 4 distinct values spanning 2 to 5, mean 3.5 and median 4. The tight IQR (3 to 4) and absence of zeros or nulls suggest it's an ordinal status flag rather than a true numeric measurement. Mild left skew (-0.38) indicates most records sit at the higher end of the scale. Treatment: Treat as an ordinal categorical and one-hot or ordinal-encode before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 4
- min
- 2
- max
- 5
- mean
- 3.5
- median
- 4
- std
- 0.8631
- q1
- 3
- q3
- 4
- iqr
- 1
- skew
- -0.3848
- kurtosis
- -0.6309
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
Frontier
categorical featureFrontier is a binary Y/N flag with no nulls across 50 rows. The distribution is heavily skewed toward 'N' at 84% (42 of 50), leaving only 8 'Y' cases, which limits its discriminative power. Treatment: Encode as a 0/1 indicator; watch for class imbalance when using as a predictor.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.84
- cardinality
- 2
- entropy
- 0.6343
- entropy_ratio
- 0.6343
MapAddress
categorical metadata long_tailMapAddress holds filenames of map images (e.g. 'm00007.png', 'm10375_ke.png'), with country/region suffixes appearing on some variants. Nearly half the rows (24/50) are empty strings and a single file 'm00007.png' covers 9 more, leaving a long tail of 15 other filenames at 1-2 occurrences each. Cardinality is 17 unique values with entropy ratio 0.68, so the column is dominated by the blank and one hero image. Treatment: Treat empty string as missing and group the long tail before any categorical encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 17
- top_value
- top_rate
- 0.48
- cardinality
- 17
- entropy
- 2.792
- entropy_ratio
- 0.6832
PeopleID3ROG3
categorical identifier long_tailPeopleID3ROG3 looks like a per-row identifier: every one of the 50 rows holds a distinct alphanumeric code (5 digits followed by 2 letters, e.g. '10208NG'), giving cardinality 50 and entropy_ratio 1.0. Top_rate is 0.02 because no value repeats, and there are no nulls. The long_tail alert simply reflects that uniqueness rather than any skew. Treatment: drop from modelling features; retain as a join/lookup key.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- 10208NG
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
ROP3
numeric feature high_skew outliersROP3 is a near-constant numeric reading clustered tightly around 100425 (median) with an IQR of just 2, yet 20% of values are flagged as outliers and the minimum drops to 100161 versus a max of 100431. The extreme negative skew (-5.53) and kurtosis above 30 indicate a heavy left tail dragging the mean (100418.74) below the median. With only 5 unique values across 50 rows, this looks like a sensor or pressure-style measurement that is mostly stuck at one level with occasional sharp dips. Treatment: Investigate the low-tail outliers and consider centering (subtract median) or binning before modelling, given the near-constant distribution.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- min
- 100,161
- max
- 100,431
- mean
- 1.004e+05
- median
- 100,425
- std
- 41.09
- q1
- 100,425
- q3
- 100,427
- iqr
- 2
- skew
- -5.53
- kurtosis
- 30.52
- n_outliers
- 10
- outlier_rate
- 0.2
- zero_rate
- 0
PrimaryLanguageDialect
categorical metadata long_tail null_rate imbalancePrimaryLanguageDialect is a categorical field that is effectively empty: 98% of the 50 rows are null, and the single non-null value is the string "Air" — which doesn't read like a language or dialect at all. With only 1 unique value, entropy is 0, so the column carries no signal in this sample and the lone value looks suspect. Treatment: Drop from modelling; investigate the lone "Air" value as a possible data-entry error.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Air
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
JPScale
numeric feature high_skew outliersJPScale is a small-integer ordinal feature with only 4 distinct values ranging from 1 to 4, where the bulk of records sit at 1 (median, Q1 and Q3 all equal 1.0, IQR 0.0). The distribution is heavily right-skewed (skew 2.29, kurtosis 4.44) and 24% of rows (12 of 50) flag as outliers simply because anything above 1 deviates from the dominant value. Mean 1.38 against std 0.81 confirms most mass is at the floor with a long thin tail toward 4. Treatment: Treat as an ordinal/categorical scale (1-4) rather than continuous; one-hot or bin the rare 2-4 levels.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 4
- min
- 1
- max
- 4
- mean
- 1.38
- median
- 1
- std
- 0.8053
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 2.29
- kurtosis
- 4.437
- n_outliers
- 12
- outlier_rate
- 0.24
- zero_rate
- 0
HasAudioRecordings
categorical metadata imbalanceThis column is a flag indicating whether audio recordings exist, but every one of the 50 rows holds the value "Y". Cardinality is 1 and entropy is 0, so it carries no information for any downstream task. Treatment: Drop; constant column with zero entropy.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
PCBuddhism
numeric feature constantPCBuddhism appears to be a numeric feature (likely a per-capita or principal-component-style indicator for Buddhism) that carries no information in this sample: every one of the 49 non-null values is exactly 0.0, with a 2% null rate. The constant alert and zero_rate of 1.0 confirm there is no variance to model. Treatment: Drop; constant column with no predictive signal.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 1
- min
- 0
- max
- 0
- mean
- 0
- median
- 0
- std
- 0
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 1
PeopNameAcrossCountries
categorical featureThis column appears to label ethnic or people-group identities across countries, with only 5 unique values across 50 rows and no nulls. The distribution is heavily skewed toward 'Arab' (28 of 50, top_rate 0.56), with 'Arab, Arabic Gulf Spoken' and 'Arab, Omani' as the next most common, while 'Tuareg, Air' and 'Amri' appear just once each. Entropy ratio of 0.69 confirms moderate concentration rather than uniform spread. Treatment: Group rare categories or one-hot encode after consolidating the long tail.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- Arab
- top_rate
- 0.56
- cardinality
- 5
- entropy
- 1.611
- entropy_ratio
- 0.694
PhotoCCVersionURL
categorical metadataThis column appears to hold a Creative Commons license URL associated with a photo, but it is overwhelmingly empty: 42 of 50 rows are blank strings and only 8 carry the single license value 'https://creativecommons.org/licenses/by-nc-sa/2.0/'. With just 2 unique values and a top_rate of 0.84, it functions more as a binary licensed/unlicensed flag than a true URL field. Note that nulls are reported as 0.0 because the missing entries are empty strings rather than true nulls. Treatment: Convert to a boolean has_license flag rather than treating as a URL.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- top_rate
- 0.84
- cardinality
- 2
- entropy
- 0.6343
- entropy_ratio
- 0.6343
MapCCVersionText
categorical metadata imbalanceMapCCVersionText is a categorical column that contains a single value — the empty string — across all 50 rows. Cardinality is 1, entropy is 0, and null_rate is 0.0, so the field is technically populated but carries no information. Treatment: Drop; constant empty value provides no signal.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
PercentChristianPC
categorical feature long_tail imbalanceThis appears to be a per-capita or principal-component-style 'PercentChristian' score, stored as strings with only 3 distinct values across 50 rows. It is overwhelmingly degenerate: the value '0.997' covers 48/50 rows (top_rate 0.96), with '0.116' and '1.344' each appearing once, yielding an entropy ratio of just 0.178. The two outlier values look anomalous relative to the dominant 0.997 and may be data-entry artefacts or genuine extremes worth investigating. Treatment: Drop or treat as near-constant; inspect the two non-modal rows as potential outliers before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- 0.997
- top_rate
- 0.96
- cardinality
- 3
- entropy
- 0.2823
- entropy_ratio
- 0.1781
Nomadic
categorical featureBinary Y/N flag indicating whether a record is nomadic, with no nulls across 50 rows. The distribution is heavily imbalanced: 'N' covers 45 of 50 (top_rate 0.9) while 'Y' appears only 5 times, yielding low entropy_ratio of 0.47. Treatment: Encode as a 0/1 indicator; watch for class imbalance given only 5 positives.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.9
- cardinality
- 2
- entropy
- 0.469
- entropy_ratio
- 0.469
PrayForChurch
categorical free_text long_tailFree-text prayer prompts about Christian outreach to Arab/Muslim people groups, stored as a categorical but functionally a short-document field. 42 of 50 rows (top_rate 0.84) are empty strings and the remaining 8 are all unique long sentences, giving 9 distinct values and an entropy ratio of 0.35. The long_tail alert reflects this empty-vs-unique split rather than meaningful category structure. Treatment: Treat empty strings as missing and tokenize/embed the remaining prose rather than one-hot encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 9
- top_value
- top_rate
- 0.84
- cardinality
- 9
- entropy
- 1.114
- entropy_ratio
- 0.3515
RLG3PGAC
numeric metadata constantRLG3PGAC is a numeric column that holds the constant value 6.0 across all 50 rows, with zero variance and no nulls. Since min, max, mean, and both quartiles are all 6.0, the column carries no information for modelling or analysis. Treatment: Drop, constant column.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 6
- max
- 6
- mean
- 6
- median
- 6
- std
- 0
- q1
- 6
- q3
- 6
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ISO3
categorical foreign_key long_tailISO3 holds three-letter country codes (ARE, CAN, EGY, KEN...), making it a country identifier. With 41 unique values across 50 rows and entropy ratio 0.986, it is near-uniform with only nine countries appearing twice; no nulls. The long_tail alert reflects that most countries appear exactly once. Treatment: left-join on this id to enrich with country attributes.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 41
- top_value
- ARE
- top_rate
- 0.04
- cardinality
- 41
- entropy
- 5.284
- entropy_ratio
- 0.9862
NaturalPronunciation
categorical featurePhonetic pronunciation guides for what appear to be Arabic-related labels, with most variants ending in 'AE-rub' (likely 'Arab'). The top value 'AE-rub' covers 55% of rows (27/50), and the top three values account for 46 of 50 entries, leaving three singleton long-tail spellings like 'AH-eer TWA-reg' and 'KEN-yun AE-rub'. Cardinality is just 6 with a 2% null rate, suggesting a controlled vocabulary rather than free text. Treatment: One-hot encode or group the three singleton categories into 'other' before modelling.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 6
- top_value
- AE-rub
- top_rate
- 0.551
- cardinality
- 6
- entropy
- 1.728
- entropy_ratio
- 0.6686
PhotoAddress
categorical metadataPhotoAddress holds JPG filenames (e.g., p10375.jpg), so it points to image assets associated with each row. With only 5 unique values across 50 rows and the top file p10375.jpg covering 56% of records, the same images are reused heavily rather than being row-specific. Entropy ratio of 0.69 confirms a skewed distribution dominated by three filenames, while two others appear just once each. Treatment: Treat as a low-cardinality asset reference; join to an image table or drop unless image features are needed.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- p10375.jpg
- top_rate
- 0.56
- cardinality
- 5
- entropy
- 1.611
- entropy_ratio
- 0.694
RegionCode
numeric foreign_keyRegionCode is stored as an integer but only takes 11 distinct values across 50 rows (min 1, max 12), so it is almost certainly a categorical region identifier rather than a true numeric quantity. The distribution is roughly centered (mean 7.28, median 7) with low skew (-0.09) and one flagged outlier, but those moments are not meaningful for a code. No nulls or zeros are present. Treatment: Cast to categorical and one-hot or target-encode before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 11
- min
- 1
- max
- 12
- mean
- 7.28
- median
- 7
- std
- 2.711
- q1
- 6
- q3
- 9
- iqr
- 3
- skew
- -0.08855
- kurtosis
- -0.2134
- n_outliers
- 1
- outlier_rate
- 0.02
- zero_rate
- 0
LocationInCountry
categorical free_text long_tail null_rateFree-text geographic descriptions of where a group is located within a country, ranging from single words like "Scattered" to multi-clause sentences naming provinces and landmarks. The column is 72% null and only 14 of 50 rows carry values, yet entropy_ratio is 0.99 with 13 unique strings across 14 non-nulls — essentially every response is bespoke. The top value "Widespread." appears just twice, so there is no usable category structure. Treatment: Treat as free-text notes; geocode or NER-extract place names rather than one-hot encoding.
- n
- 50
- nulls
- 36 (72.0%)
- unique
- 13
- top_value
- Widespread.
- top_rate
- 0.1429
- cardinality
- 13
- entropy
- 3.664
- entropy_ratio
- 0.9903
JF
categorical featureJF is a binary Y/N flag with no missing values across 50 rows. The distribution is imbalanced: 'Y' accounts for 41 of 50 records (top_rate 0.82) versus 9 'N's, yielding entropy of 0.68. Treatment: Encode as a 0/1 indicator and account for the 82/18 class imbalance in modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- Y
- top_rate
- 0.82
- cardinality
- 2
- entropy
- 0.6801
- entropy_ratio
- 0.6801
PopulationPGAC
numeric feature outliersPopulationPGAC appears to be a population count tied to some PGAC grouping, with values ranging from 101,000 to 7,562,600 across 50 rows. Only 5 unique values populate the column, so the 'numeric' framing is misleading — it behaves more like a coarse categorical bucket. The right-skew (1.03) and 26% outlier rate stem from a small number of rows carrying the largest population value far above the Q3 of 3,096,000. Treatment: Treat as a categorical/ordinal bucket given only 5 unique values, or log-transform if kept numeric.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- min
- 101,000
- max
- 7.563e+06
- mean
- 3.402e+06
- median
- 1.927e+06
- std
- 2.427e+06
- q1
- 1.927e+06
- q3
- 3.096e+06
- iqr
- 1.169e+06
- skew
- 1.03
- kurtosis
- -0.6488
- n_outliers
- 13
- outlier_rate
- 0.26
- zero_rate
- 0
PeopleGroupMapExpandedURL
categorical metadata long_tailThis column holds URLs to expanded people-group map PDFs hosted on joshuaproject.net, one per row. It is mostly empty: 38 of 50 rows (top_rate 0.76) are blank strings, leaving only 11 distinct values across the 50 records with a long tail of near-unique links. Despite null_rate being 0, the dominant value is an empty string, so true coverage is roughly a quarter of rows. Treatment: Convert empty strings to nulls and treat as an optional reference link rather than a modelling feature.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- top_rate
- 0.76
- cardinality
- 11
- entropy
- 1.575
- entropy_ratio
- 0.4554
TranslationNeedQuestionable
unknown other skippedThe column 'TranslationNeedQuestionable' was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 with no nulls. The name suggests a boolean or flag indicating whether a translation need is in doubt, but this cannot be confirmed from the evidence. No distributional signals are present to flag. Treatment: Re-profile or manually inspect to determine type before any downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
Category
categorical labelA low-cardinality categorical with 3 distinct values ('1','2','3') across 50 rows and no nulls, likely a class label or category code. The distribution is imbalanced: '1' dominates at 68% (34/50) while '2' and '3' account for just 7 and 9 rows respectively, giving an entropy ratio of 0.77. Treatment: Treat as a categorical class label and address the class imbalance (e.g., stratified splits or reweighting) before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- 1
- top_rate
- 0.68
- cardinality
- 3
- entropy
- 1.221
- entropy_ratio
- 0.7702
PhotoCopyright
categorical feature imbalancePhotoCopyright is a binary Y/N flag, almost certainly indicating whether a photo carries a copyright restriction. The distribution is severely imbalanced: 49 of 50 rows are 'N' and only 1 is 'Y', giving an entropy ratio of just 0.14. With effectively no variance, this column carries little signal on its own. Treatment: Drop or retain only as a rare-event indicator; near-constant at 98% 'N'.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.98
- cardinality
- 2
- entropy
- 0.1414
- entropy_ratio
- 0.1414
NTOnline
categorical feature imbalanceNTOnline is a categorical flag that takes only the value 'Y' across all 41 non-null rows, with 18% of records null. Effectively a constant indicator with no discriminative information, plus a non-trivial missingness rate that may itself be the only signal. Treatment: Drop as a zero-variance column, or replace with a binary is_null indicator if missingness is meaningful.
- n
- 50
- nulls
- 9 (18.0%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
LeastReachedPC
categorical metadata imbalanceThis is a categorical flag that takes the value "Y" for all 50 rows, giving cardinality 1 and entropy 0.0. With a single constant value it carries no information and cannot discriminate between records. The name suggests it once tracked a 'least reached PC' status, but here it is degenerate. Treatment: Drop; constant column with zero entropy.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- Y
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ROG3
categorical feature long_tailROG3 holds two-letter country codes (AE, CA, EG, KE, SO, etc.), making it a geographic categorical feature. With 41 unique values across 50 rows and a top rate of just 0.04, the column is almost a unique-per-row identifier — entropy ratio 0.9862 confirms an extremely flat, long-tail distribution. No nulls, but the sample is too thin for any country to dominate. Treatment: Group rare countries into an 'Other' bucket or map to region before encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 41
- top_value
- AE
- top_rate
- 0.04
- cardinality
- 41
- entropy
- 5.284
- entropy_ratio
- 0.9862
ReligionSubdivision
categorical metadata null_rate imbalanceThis column records a religious subdivision, but it is overwhelmingly empty: 86% of the 50 rows are null, and every one of the 7 populated rows is 'Sunni'. With cardinality of 1 and entropy of 0, the field carries no discriminative signal in this sample. Treatment: Drop or collapse to a binary 'Sunni vs missing' indicator; otherwise non-informative.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 1
- top_value
- Sunni
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
PCEthnicReligions
numeric feature high_skew outliersNumeric count-like column where 93.9% of the 49 non-null values are zero and only 3 distinct values appear. The distribution is highly right-skewed (skew 4.45, kurtosis 19.79) with a max of 10 against a median of 0, producing 3 outliers (6.1%). One null is present. Treatment: Binarize to zero/non-zero or drop, since near-constant zeros carry little signal.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 3
- min
- 0
- max
- 10
- mean
- 0.4082
- median
- 0
- std
- 1.719
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 4.446
- kurtosis
- 19.79
- n_outliers
- 3
- outlier_rate
- 0.06122
- zero_rate
- 0.9388
PeopleCluster
categorical feature long_tail imbalanceCategorical grouping of people clusters, with only 3 distinct values across 50 rows and no nulls. The distribution is extremely imbalanced: 'Arab, Arabian' covers 96% of rows, leaving 'Tuareg' and 'Arab, Sudan' with a single record each, yielding a very low entropy ratio of 0.178. Treatment: Drop or collapse rare levels; near-constant column offers little signal.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- Arab, Arabian
- top_rate
- 0.96
- cardinality
- 3
- entropy
- 0.2823
- entropy_ratio
- 0.1781
IndigenousCode
categorical featureBinary Y/N flag indicating Indigenous status, with 'N' dominating at 86% (43 of 50) versus 7 'Y' records. The column is fully populated with no nulls and only 2 distinct values, yielding a low entropy ratio of 0.58. The class imbalance is notable for any modelling use. Treatment: Encode as a binary indicator and account for class imbalance in any downstream model.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- N
- top_rate
- 0.86
- cardinality
- 2
- entropy
- 0.5842
- entropy_ratio
- 0.5842
MapCreditURL
categorical metadata imbalanceMapCreditURL appears to be a metadata field intended to hold a URL crediting a map's source, but every one of the 50 rows is an empty string. Cardinality is 1, entropy is 0, and the top value (empty) accounts for 100% of records, so the column carries no information. Treatment: Drop; the column is constant (all empty strings).
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
MapCopyright
categorical metadata imbalanceMapCopyright is a categorical column that holds the single value "N" across all 50 rows, giving it zero entropy and a top_rate of 1.0. With cardinality of 1 and no nulls, it carries no information for any downstream model or comparison. Treatment: Drop; constant column with a single value.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- N
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
MapCCVersionURL
categorical metadata imbalanceThis column appears to be a metadata field intended to hold a Creative Commons version URL for a map, but every one of the 50 rows contains an empty string. Cardinality is 1 with entropy of 0.0, so it carries no information whatsoever. Treatment: Drop the column; it is constant (empty) across all rows.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
PeopleGroupURL
categorical identifier long_tailThis column holds Joshua Project people-group URLs, with each row pointing to a /people_groups/{id}/{country} path. All 50 rows are unique (cardinality 50, entropy_ratio 1.0, top_rate 0.02) and there are no nulls, so it functions as a per-row identifier rather than a categorical feature. The URL stems repeat (e.g. 10375 appears across TZ, UP, AG, AS, AU, BA), suggesting the same people group is tracked across multiple country codes. Treatment: Drop from modelling; retain as a row-level link or parse out the group id and country code into separate keys.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- https://joshuaproject.net/people_groups/10208/NG
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1