saturn·

quirky hot sauces

source /home/coolhand/html/datavis/data_trove/data/quirky/hot_sauces.json 258 rows 9 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 258 hot sauce products sourced entirely from OpenFoodFacts, with 9 categorical columns covering brand, category, country, ingredients, labels, name, and URL. Brands are highly fragmented across 158 unique values, with Tabasco (12) and McIlhenny Company, Tabasco (11) leading but no dominant player — and 37 records have a blank brand worth investigating. Geographically, the United States (54) and France (28) account for the largest shares of the 123 country values, though inconsistent encoding (e.g., 'en:us' vs 'United States') suggests a data-cleaning task. The labels column is sparse: 145 of 258 rows are blank, so dietary tags like 'No gluten' or 'Non GMO project' apply to only a small minority. Note that source and type are constant (OpenFoodFacts / hot_sauce_product) and carry no analytical signal.

citing: brand · countries · labels · categories · name · source · type

Schema

9 columns
Per-column summary. Click column name to jump to its detail.
Alerts
name categorical 0.0% 221
long_tail
brand categorical 0.0% 158
long_tail
countries categorical 0.0% 123
long_tail
categories categorical 0.0% 106
long_tail
ingredients categorical 0.0% 207
long_tail
labels categorical 0.0% 77
long_tail
url categorical 0.0% 258
long_tail
source categorical 0.0% 1
imbalance
type categorical 0.0% 1
imbalance

name

categorical label long_tail
This is a product name field for hot sauces, with 221 unique values across 258 rows and near-maximal entropy ratio of 0.984. The top value 'Carolina Reaper Hot Sauce' only covers 2.3% of rows, and casing/spelling variants ('Carolina Reaper Hot Sauce' vs 'Carolina reaper hot sauce', 'Sriracha Hot Chilli Sauce' vs 'Sriracha Hot Chili Sauce') plus a French entry and 3 empty strings indicate inconsistent normalization despite a 0.0 null rate. Treatment: normalize casing and spelling variants (and treat empty strings as missing) before grouping or joining. high · anthropic:claude-opus-4-7
n
258
nulls
0 (0.0%)
unique
221
top_value
Carolina Reaper Hot Sauce
top_rate
0.02326
cardinality
221
entropy
7.666
entropy_ratio
0.9843

brand

categorical feature long_tail
Categorical brand label for what appears to be a hot sauce catalogue, with 158 distinct brands across 258 rows and very high entropy ratio (0.894) indicating a long tail. The most common value is the empty string at 37 occurrences (14.3% top rate), meaning missing-as-blank dominates over real brands like Tabasco (12) and McIlhenny Company, Tabasco (11). Note also that 'Tabasco' and 'McIlhenny Company, Tabasco' likely refer to the same maker but appear as separate categories, suggesting inconsistent normalisation. Treatment: Replace empty strings with explicit nulls, normalise brand aliases (e.g. Tabasco vs McIlhenny), then group rare brands into 'Other' before encoding. high · anthropic:claude-opus-4-7
n
258
nulls
0 (0.0%)
unique
158
top_value
top_rate
0.1434
cardinality
158
entropy
6.53
entropy_ratio
0.8941

countries

categorical feature long_tail
This is a country-of-origin or sale label for 258 records, with 123 distinct values and no nulls. The encoding is inconsistent: plain names ('United States', 54) coexist with Open Food Facts-style tag prefixes ('en:us', 10; 'en:United States', 4) and multi-country strings ('United States, World'), so the same country appears under several spellings. High entropy ratio (0.82) and a long tail confirm the values are fragmented well beyond the 20.9% top rate. Treatment: Normalize to ISO country codes (strip 'en:' prefixes, split comma lists) before grouping or encoding. high · anthropic:claude-opus-4-7
n
258
nulls
0 (0.0%)
unique
123
top_value
United States
top_rate
0.2093
cardinality
123
entropy
5.676
entropy_ratio
0.8176

categories

categorical feature long_tail
Comma-delimited product category tags, dominated by condiment/sauce/hot-sauce hierarchies. Cardinality is high (106 unique across 258 rows, entropy ratio 0.82) and the most common value is the empty string at 13.6% (35 rows), indicating missing labels encoded as blanks rather than nulls. Near-duplicate variants differ only by spacing, casing, or 'en:' prefixes (e.g., 'Condiments,Sauces' vs 'Condiments, Sauces, Groceries'), so raw cardinality overstates the true taxonomy. Treatment: Normalise delimiters/casing, treat empty strings as missing, then split into a multi-hot tag encoding. high · anthropic:claude-opus-4-7
n
258
nulls
0 (0.0%)
unique
106
top_value
top_rate
0.1357
cardinality
106
entropy
5.506
entropy_ratio
0.8183

ingredients

categorical free_text long_tail
Free-text ingredient lists for what appears to be hot-sauce or chili products, with 207 distinct strings across 258 rows and entropy ratio 0.90 indicating near-unique values. The dominant 'value' is an empty string at 49 rows (19% top_rate), so roughly a fifth of records have no ingredients recorded. The remaining entries mix multiple languages (English, French, Norwegian, German) and formatting conventions, so direct categorical use is not viable. Treatment: Treat empty strings as missing, then tokenize/normalize across languages and extract ingredient features before modelling. high · anthropic:claude-opus-4-7
n
258
nulls
0 (0.0%)
unique
207
top_value
top_rate
0.1899
cardinality
207
entropy
6.922
entropy_ratio
0.8997

labels

categorical feature long_tail
Free-form product label tags (dietary, certification, packaging) with 77 distinct values across 258 rows. Over half the rows (56.2%) carry an empty string rather than a true null, so null_rate=0 is misleading. Values mix languages (English 'No gluten' vs French 'Sans gluten') and formats (raw text vs Open Food Facts taxonomy codes like 'en:vegan'), and many cells concatenate multiple labels with commas. Treatment: Treat empty strings as missing, split on commas, normalise language/taxonomy variants, then multi-hot encode. high · anthropic:claude-opus-4-7
n
258
nulls
0 (0.0%)
unique
77
top_value
top_rate
0.562
cardinality
77
entropy
3.557
entropy_ratio
0.5675

url

categorical identifier long_tail
This column holds Open Food Facts product URLs, one per row, with the trailing path segment being the product barcode. Every one of the 258 values is unique (entropy_ratio 1.0, top_rate 0.0039), so it functions as a row identifier rather than a feature. Treatment: Drop from modelling; keep as a lookup link or join key on the embedded barcode. high · anthropic:claude-opus-4-7
n
258
nulls
0 (0.0%)
unique
258
top_value
https://world.openfoodfacts.org/product/8710605030051
top_rate
0.003876
cardinality
258
entropy
8.011
entropy_ratio
1

source

categorical metadata imbalance
This column records the data provenance, with every one of the 258 rows tagged 'OpenFoodFacts'. Cardinality is 1 and entropy is 0, so it carries no information for modelling and simply documents that the entire slice came from a single source. Treatment: Drop before modelling; retain only as dataset-level provenance. high · anthropic:claude-opus-4-7
n
258
nulls
0 (0.0%)
unique
1
top_value
OpenFoodFacts
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

type

categorical metadata imbalance
This column is a constant categorical tag identifying every row as 'hot_sauce_product', appearing in all 258 records with no nulls. Cardinality is 1 and entropy is 0, so it carries no discriminative information. It likely served as a type marker from an ingestion pipeline rather than a usable feature. Treatment: Drop before modelling; single constant value provides no signal. high · anthropic:claude-opus-4-7
n
258
nulls
0 (0.0%)
unique
1
top_value
hot_sauce_product
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0