quirky peppers
Reading
This dataset catalogs 175 pepper varieties with 11 fields covering name, origin, flavor, heat category, biological type, intended use, and Scoville heat measurements (min, median, max, plus a jalapeño-relative score). The Scoville and jalRP numeric columns are extremely right-skewed (skew ~9-10, kurtosis >100) with max scoville_max reaching 16,000,000 versus a median of just 30,000 — a handful of super-hot peppers dominate the tail and 24% of rows flag as outliers. On the categorical side, 'Medium' heat accounts for 40% of peppers and 'Culinary' use covers 80%, while origin leans heavily toward the United States (26%) and Mexico (15%). Worth a closer look first: the Scoville distribution (consider a log scale) and the type column, which has casing inconsistencies ('annuum' vs 'Annuum', 'chinense' vs 'Chinense') that should be cleaned before any grouping.
citing: scoville_max · scoville_median · jalRP · heat · use · origin · type · flavor
Charts the summary said to look at first
Show data table
| bin | count |
|---|---|
| 0 – 1.231e+06 | 155 |
| 1.231e+06 – 2.462e+06 | 16 |
| 2.462e+06 – 3.692e+06 | 3 |
| 3.692e+06 – 4.923e+06 | 0 |
| 4.923e+06 – 6.154e+06 | 0 |
| 6.154e+06 – 7.385e+06 | 0 |
| 7.385e+06 – 8.615e+06 | 0 |
| 8.615e+06 – 9.846e+06 | 0 |
| 9.846e+06 – 1.108e+07 | 0 |
| 1.108e+07 – 1.231e+07 | 0 |
| 1.231e+07 – 1.354e+07 | 0 |
| 1.354e+07 – 1.477e+07 | 0 |
| 1.477e+07 – 1.6e+07 | 1 |
Show data table
| value | count | share |
|---|---|---|
| Medium | 70 | 40.0% |
| Mild | 45 | 25.7% |
| Super Hot | 30 | 17.1% |
| Hot | 17 | 9.7% |
| Extra Hot | 13 | 7.4% |
Show data table
| value | count | share |
|---|---|---|
| annuum | 104 | 59.4% |
| chinense | 46 | 26.3% |
| baccatum | 12 | 6.9% |
| Annuum | 4 | 2.3% |
| frutescens | 4 | 2.3% |
| pubescens | 2 | 1.1% |
| Chinense | 2 | 1.1% |
| N/A | 1 | 0.6% |
Show data table
| value | count | share |
|---|---|---|
| Culinary | 141 | 80.6% |
| Ornamental | 31 | 17.7% |
| Culinary, Ornamental | 2 | 1.1% |
| 1 | 0.6% |
Show data table
| value | count | share |
|---|---|---|
| United States | 46 | 26.3% |
| Mexico | 26 | 14.9% |
| South America | 11 | 6.3% |
| Peru | 11 | 6.3% |
| Italy | 8 | 4.6% |
| Unknown | 7 | 4.0% |
| United Kingdom | 7 | 4.0% |
| Trinidad | 7 | 4.0% |
| Caribbean | 6 | 3.4% |
| India | 6 | 3.4% |
| Brazil | 5 | 2.9% |
| Spain | 4 | 2.3% |
| Hungary | 4 | 2.3% |
| Japan | 3 | 1.7% |
| Africa | 3 | 1.7% |
| China | 2 | 1.1% |
| Thailand | 2 | 1.1% |
| Balkan Peninsula | 1 | 0.6% |
| France | 1 | 0.6% |
| Chile | 1 | 0.6% |
Schema
11 columns| Alerts | ||||
|---|---|---|---|---|
| name | categorical | 0.0% | 175 |
long_tail
|
| heat | categorical | 0.0% | 5 |
|
| scoville_min | numeric | 0.0% | 44 |
high_skew
outliers
|
| scoville_max | numeric | 0.0% | 59 |
high_skew
outliers
|
| scoville_median | numeric | 0.0% | 80 |
high_skew
outliers
|
| jalRP | numeric | 0.0% | 81 |
high_skew
outliers
|
| type | categorical | 0.0% | 8 |
|
| origin | categorical | 0.0% | 34 |
|
| use | categorical | 0.0% | 4 |
|
| flavor | categorical | 0.0% | 73 |
long_tail
|
| url | categorical | 0.0% | 175 |
long_tail
|
name
categorical identifier long_tailThe `name` column holds 175 unique strings across 175 rows (cardinality 175, entropy_ratio ~1.0), making it a perfect per-row identifier. Sample values like "Bell Pepper", "Gypsy Pepper", and "Peperone di Senise" suggest this is a catalog of pepper varieties rather than a categorical feature. With every value occurring exactly once (top_rate 0.0057), there is no useful frequency signal to model on. Treatment: Use as a row label or join key; drop from feature matrix, near-unique.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 175
- top_value
- Bell Pepper
- top_rate
- 0.005714
- cardinality
- 175
- entropy
- 7.451
- entropy_ratio
- 1
heat
categorical featureThis is a categorical heat/spice level rating with 5 ordinal tiers and no nulls across 175 rows. Medium dominates at 40% (70 rows), followed by Mild (45); the upper tiers Hot (17) and Extra Hot (13) are the rarest, while Super Hot (30) is oddly more common than Hot, breaking the expected monotonic decline up the heat scale. Treatment: Encode as an ordered categorical (Mild < Medium < Hot < Super Hot < Extra Hot) before modelling.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- Medium
- top_rate
- 0.4
- cardinality
- 5
- entropy
- 2.074
- entropy_ratio
- 0.8933
scoville_min
numeric feature high_skew outliersNumeric heat ratings (Scoville minimum) for 175 entries spanning 0 to 15,000,000 with a median of 15,000 — classic chili pepper data. Distribution is brutally right-skewed (skew 10.31, kurtosis 120.13) with mean 289,208 dwarfing the median, and 29 outliers (16.6% rate) plus 9.7% zeros. The std of 1,218,458 against an IQR of just 74,000 confirms a long, heavy tail. Treatment: Apply log1p transform before any modelling to tame the extreme skew and outliers.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 44
- min
- 0
- max
- 1.5e+07
- mean
- 2.892e+05
- median
- 15,000
- std
- 1.218e+06
- q1
- 1,000
- q3
- 75,000
- iqr
- 74,000
- skew
- 10.31
- kurtosis
- 120.1
- n_outliers
- 29
- outlier_rate
- 0.1657
- zero_rate
- 0.09714
scoville_max
numeric feature high_skew outliersMaximum Scoville heat ratings for 175 peppers, ranging from 0 to 16,000,000 with a median of 30,000 but a mean of 384,835. Distribution is extremely right-skewed (skew 9.45, kurtosis 106) with 24.6% of values flagged as outliers and 5.7% zeros. The IQR (2,750-100,000) is dwarfed by the max, consistent with a few extreme superhot varieties dominating the tail. Treatment: log1p-transform before modelling to tame the heavy right tail.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 59
- min
- 0
- max
- 1.6e+07
- mean
- 3.848e+05
- median
- 30,000
- std
- 1.333e+06
- q1
- 2,750
- q3
- 100,000
- iqr
- 97,250
- skew
- 9.45
- kurtosis
- 106.1
- n_outliers
- 43
- outlier_rate
- 0.2457
- zero_rate
- 0.05714
scoville_median
numeric feature high_skew outliersNumeric column capturing the median Scoville heat rating across 175 entries with no nulls and 80 unique values. The distribution is extremely right-skewed (skew 9.79, kurtosis 111.5): the median is 22,500 while the mean is 339,805 and the max reaches 15,500,000, with 41 outliers (23.4%) and 5.7% zeros. The IQR (2,000 to 90,000) is tiny relative to the std of 1,278,965, confirming a heavy upper tail. Treatment: Apply a log1p transform before modelling to compress the heavy right tail.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 80
- min
- 0
- max
- 1.55e+07
- mean
- 3.398e+05
- median
- 22,500
- std
- 1.279e+06
- q1
- 2,000
- q3
- 90,000
- iqr
- 88,000
- skew
- 9.794
- kurtosis
- 111.5
- n_outliers
- 41
- outlier_rate
- 0.2343
- zero_rate
- 0.05714
jalRP
numeric feature high_skew outliersNumeric feature 'jalRP' is extremely right-skewed: the median is 4.29 with Q3 at 17.14, yet the max reaches 2952.38 and the mean (64.72) sits far above the median. Skew of 9.79 and kurtosis of 111.48 confirm a heavy tail, and 23.4% of values flag as outliers with 5.7% exact zeros. Only 81 unique values across 175 rows suggests repeated discrete magnitudes rather than a smooth continuum. Treatment: log1p-transform (or winsorize) before modelling to tame the heavy tail.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 81
- min
- 0
- max
- 2952
- mean
- 64.72
- median
- 4.29
- std
- 243.6
- q1
- 0.38
- q3
- 17.14
- iqr
- 16.76
- skew
- 9.795
- kurtosis
- 111.5
- n_outliers
- 41
- outlier_rate
- 0.2343
- zero_rate
- 0.05714
type
categorical labelThis column records the Capsicum species (type), dominated by 'annuum' at 59.4% of 175 rows with 'chinense' second at 46. Watch out for case-inconsistent duplicates ('Annuum' 4, 'Chinense' 2 alongside their lowercase forms) and a literal 'N/A' string that isn't being counted as null (null_rate 0.0). Treatment: Lowercase-normalize and convert 'N/A' to null before using as a categorical label.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 8
- top_value
- annuum
- top_rate
- 0.5943
- cardinality
- 8
- entropy
- 1.657
- entropy_ratio
- 0.5524
origin
categorical featureThis is a categorical origin/country field with 34 distinct values across 175 rows and no nulls. Distribution is moderately concentrated: United States leads at 26.3% (46 rows), followed by Mexico (26) and South America (11), with entropy ratio 0.78 indicating fairly broad spread across the long tail. Notable quirks include a mix of country-level (United States, Italy, India) and region-level (South America, Caribbean) labels, plus 7 explicit 'Unknown' entries. Treatment: Normalise country-vs-region granularity and treat 'Unknown' as missing before one-hot or target encoding.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 34
- top_value
- United States
- top_rate
- 0.2629
- cardinality
- 34
- entropy
- 3.98
- entropy_ratio
- 0.7823
use
categorical featureThis is a low-cardinality categorical describing the use of an item, with 4 distinct values across 175 rows and no nulls. The distribution is heavily skewed: 'Culinary' accounts for 80.6% (141 rows), 'Ornamental' for 31, plus 2 rows with a combined 'Culinary, Ornamental' label and 1 empty string that should be treated as missing. Entropy ratio of 0.40 confirms the imbalance. Treatment: Normalize the empty string to null and split the comma-delimited value into multi-hot flags before encoding.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- Culinary
- top_rate
- 0.8057
- cardinality
- 4
- entropy
- 0.8097
- entropy_ratio
- 0.4049
flavor
categorical feature long_tailThis is a categorical flavor descriptor field, with values that look like comma-separated tag combinations (e.g. 'Sweet, Fruity, Earthy, Smoky') rather than single labels. Cardinality is high — 73 unique values across only 175 rows — and entropy_ratio of 0.845 confirms a long tail; the top value 'Sweet' covers just 14.3% of rows. The compound labels suggest the underlying data is multi-label flavor notes that have been collapsed into one string. Treatment: Split on commas and one-hot encode the individual flavor tags before modelling.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 73
- top_value
- Sweet
- top_rate
- 0.1429
- cardinality
- 73
- entropy
- 5.232
- entropy_ratio
- 0.8453
url
categorical identifier long_tailThis is a URL column serving as a per-row identifier, with all 175 values unique and zero nulls. Every entry is a pepperscale.com pepper page (e.g., bell-pepper, gypsy-pepper, habanada-pepper), so the column is effectively a primary key for pepper varieties. Entropy ratio of ~1.0 confirms no repetition. Treatment: Drop from modelling; retain as a row key or source link.
- n
- 175
- nulls
- 0 (0.0%)
- unique
- 175
- top_value
- https://www.pepperscale.com/bell-pepper/
- top_rate
- 0.005714
- cardinality
- 175
- entropy
- 7.451
- entropy_ratio
- 1