saturn·

quirky peppers

source /home/coolhand/html/datavis/data_trove/data/quirky/peppers.json 175 rows 11 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 175 pepper varieties with 11 fields covering name, origin, flavor, heat category, biological type, intended use, and Scoville heat measurements (min, median, max, plus a jalapeño-relative score). The Scoville and jalRP numeric columns are extremely right-skewed (skew ~9-10, kurtosis >100) with max scoville_max reaching 16,000,000 versus a median of just 30,000 — a handful of super-hot peppers dominate the tail and 24% of rows flag as outliers. On the categorical side, 'Medium' heat accounts for 40% of peppers and 'Culinary' use covers 80%, while origin leans heavily toward the United States (26%) and Mexico (15%). Worth a closer look first: the Scoville distribution (consider a log scale) and the type column, which has casing inconsistencies ('annuum' vs 'Annuum', 'chinense' vs 'Chinense') that should be cleaned before any grouping.

citing: scoville_max · scoville_median · jalRP · heat · use · origin · type · flavor

Schema

11 columns
Per-column summary. Click column name to jump to its detail.
Alerts
name categorical 0.0% 175
long_tail
heat categorical 0.0% 5
scoville_min numeric 0.0% 44
high_skew outliers
scoville_max numeric 0.0% 59
high_skew outliers
scoville_median numeric 0.0% 80
high_skew outliers
jalRP numeric 0.0% 81
high_skew outliers
type categorical 0.0% 8
origin categorical 0.0% 34
use categorical 0.0% 4
flavor categorical 0.0% 73
long_tail
url categorical 0.0% 175
long_tail

name

categorical identifier long_tail
The `name` column holds 175 unique strings across 175 rows (cardinality 175, entropy_ratio ~1.0), making it a perfect per-row identifier. Sample values like "Bell Pepper", "Gypsy Pepper", and "Peperone di Senise" suggest this is a catalog of pepper varieties rather than a categorical feature. With every value occurring exactly once (top_rate 0.0057), there is no useful frequency signal to model on. Treatment: Use as a row label or join key; drop from feature matrix, near-unique. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
175
top_value
Bell Pepper
top_rate
0.005714
cardinality
175
entropy
7.451
entropy_ratio
1

heat

categorical feature
This is a categorical heat/spice level rating with 5 ordinal tiers and no nulls across 175 rows. Medium dominates at 40% (70 rows), followed by Mild (45); the upper tiers Hot (17) and Extra Hot (13) are the rarest, while Super Hot (30) is oddly more common than Hot, breaking the expected monotonic decline up the heat scale. Treatment: Encode as an ordered categorical (Mild < Medium < Hot < Super Hot < Extra Hot) before modelling. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
5
top_value
Medium
top_rate
0.4
cardinality
5
entropy
2.074
entropy_ratio
0.8933

scoville_min

numeric feature high_skew outliers
Numeric heat ratings (Scoville minimum) for 175 entries spanning 0 to 15,000,000 with a median of 15,000 — classic chili pepper data. Distribution is brutally right-skewed (skew 10.31, kurtosis 120.13) with mean 289,208 dwarfing the median, and 29 outliers (16.6% rate) plus 9.7% zeros. The std of 1,218,458 against an IQR of just 74,000 confirms a long, heavy tail. Treatment: Apply log1p transform before any modelling to tame the extreme skew and outliers. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
44
min
0
max
1.5e+07
mean
2.892e+05
median
15,000
std
1.218e+06
q1
1,000
q3
75,000
iqr
74,000
skew
10.31
kurtosis
120.1
n_outliers
29
outlier_rate
0.1657
zero_rate
0.09714

scoville_max

numeric feature high_skew outliers
Maximum Scoville heat ratings for 175 peppers, ranging from 0 to 16,000,000 with a median of 30,000 but a mean of 384,835. Distribution is extremely right-skewed (skew 9.45, kurtosis 106) with 24.6% of values flagged as outliers and 5.7% zeros. The IQR (2,750-100,000) is dwarfed by the max, consistent with a few extreme superhot varieties dominating the tail. Treatment: log1p-transform before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
59
min
0
max
1.6e+07
mean
3.848e+05
median
30,000
std
1.333e+06
q1
2,750
q3
100,000
iqr
97,250
skew
9.45
kurtosis
106.1
n_outliers
43
outlier_rate
0.2457
zero_rate
0.05714

scoville_median

numeric feature high_skew outliers
Numeric column capturing the median Scoville heat rating across 175 entries with no nulls and 80 unique values. The distribution is extremely right-skewed (skew 9.79, kurtosis 111.5): the median is 22,500 while the mean is 339,805 and the max reaches 15,500,000, with 41 outliers (23.4%) and 5.7% zeros. The IQR (2,000 to 90,000) is tiny relative to the std of 1,278,965, confirming a heavy upper tail. Treatment: Apply a log1p transform before modelling to compress the heavy right tail. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
80
min
0
max
1.55e+07
mean
3.398e+05
median
22,500
std
1.279e+06
q1
2,000
q3
90,000
iqr
88,000
skew
9.794
kurtosis
111.5
n_outliers
41
outlier_rate
0.2343
zero_rate
0.05714

jalRP

numeric feature high_skew outliers
Numeric feature 'jalRP' is extremely right-skewed: the median is 4.29 with Q3 at 17.14, yet the max reaches 2952.38 and the mean (64.72) sits far above the median. Skew of 9.79 and kurtosis of 111.48 confirm a heavy tail, and 23.4% of values flag as outliers with 5.7% exact zeros. Only 81 unique values across 175 rows suggests repeated discrete magnitudes rather than a smooth continuum. Treatment: log1p-transform (or winsorize) before modelling to tame the heavy tail. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
81
min
0
max
2952
mean
64.72
median
4.29
std
243.6
q1
0.38
q3
17.14
iqr
16.76
skew
9.795
kurtosis
111.5
n_outliers
41
outlier_rate
0.2343
zero_rate
0.05714

type

categorical label
This column records the Capsicum species (type), dominated by 'annuum' at 59.4% of 175 rows with 'chinense' second at 46. Watch out for case-inconsistent duplicates ('Annuum' 4, 'Chinense' 2 alongside their lowercase forms) and a literal 'N/A' string that isn't being counted as null (null_rate 0.0). Treatment: Lowercase-normalize and convert 'N/A' to null before using as a categorical label. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
8
top_value
annuum
top_rate
0.5943
cardinality
8
entropy
1.657
entropy_ratio
0.5524

origin

categorical feature
This is a categorical origin/country field with 34 distinct values across 175 rows and no nulls. Distribution is moderately concentrated: United States leads at 26.3% (46 rows), followed by Mexico (26) and South America (11), with entropy ratio 0.78 indicating fairly broad spread across the long tail. Notable quirks include a mix of country-level (United States, Italy, India) and region-level (South America, Caribbean) labels, plus 7 explicit 'Unknown' entries. Treatment: Normalise country-vs-region granularity and treat 'Unknown' as missing before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
34
top_value
United States
top_rate
0.2629
cardinality
34
entropy
3.98
entropy_ratio
0.7823

use

categorical feature
This is a low-cardinality categorical describing the use of an item, with 4 distinct values across 175 rows and no nulls. The distribution is heavily skewed: 'Culinary' accounts for 80.6% (141 rows), 'Ornamental' for 31, plus 2 rows with a combined 'Culinary, Ornamental' label and 1 empty string that should be treated as missing. Entropy ratio of 0.40 confirms the imbalance. Treatment: Normalize the empty string to null and split the comma-delimited value into multi-hot flags before encoding. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
4
top_value
Culinary
top_rate
0.8057
cardinality
4
entropy
0.8097
entropy_ratio
0.4049

flavor

categorical feature long_tail
This is a categorical flavor descriptor field, with values that look like comma-separated tag combinations (e.g. 'Sweet, Fruity, Earthy, Smoky') rather than single labels. Cardinality is high — 73 unique values across only 175 rows — and entropy_ratio of 0.845 confirms a long tail; the top value 'Sweet' covers just 14.3% of rows. The compound labels suggest the underlying data is multi-label flavor notes that have been collapsed into one string. Treatment: Split on commas and one-hot encode the individual flavor tags before modelling. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
73
top_value
Sweet
top_rate
0.1429
cardinality
73
entropy
5.232
entropy_ratio
0.8453

url

categorical identifier long_tail
This is a URL column serving as a per-row identifier, with all 175 values unique and zero nulls. Every entry is a pepperscale.com pepper page (e.g., bell-pepper, gypsy-pepper, habanada-pepper), so the column is effectively a primary key for pepper varieties. Entropy ratio of ~1.0 confirms no repetition. Treatment: Drop from modelling; retain as a row key or source link. high · anthropic:claude-opus-4-7
n
175
nulls
0 (0.0%)
unique
175
top_value
https://www.pepperscale.com/bell-pepper/
top_rate
0.005714
cardinality
175
entropy
7.451
entropy_ratio
1