saturn·

quirky chocolate origins

source /home/coolhand/html/datavis/data_trove/data/quirky/chocolate_origins.json 2,530 rows 10 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 2,530 chocolate bar reviews with 10 columns covering bean origins, cocoa percentages, ingredients, ratings, and review metadata. Ratings cluster tightly (median 3.25, IQR 0.5) on a 1–4 scale, while cocoa percent is similarly concentrated around 70% but carries 235 outliers worth investigating. Geographic skew is notable: U.S.A. dominates company locations at 44.9% of records, whereas bean origins are more diverse, led by Venezuela, Peru, and the Dominican Republic. Heads up that the `company` column is entirely empty (single blank value across all rows), so it should be excluded from analysis.

citing: row_count · column_count · rating · cocoa_percent · company_location · country_of_bean_origin · ingredients · company · review_date

Schema

10 columns
Per-column summary. Click column name to jump to its detail.
Alerts
ref categorical 0.0% 630
company categorical 0.0% 1
imbalance
company_location categorical 0.0% 67
review_date numeric 0.0% 16
country_of_bean_origin categorical 0.0% 62
specific_bean_origin text 0.0% 1,605
one_word duplicates
cocoa_percent numeric 0.0% 46
outliers
ingredients categorical 0.0% 22
most_memorable_characteristics text 0.0% 2,487
near_unique
rating numeric 0.0% 12

ref

categorical foreign_key
`ref` is a high-cardinality categorical with 630 distinct values across 2530 rows and no nulls, with entropy ratio 0.9954 indicating a near-uniform distribution. Values are short numeric strings (e.g. "414", "24", "1462") and the most frequent appears only 10 times (top_rate 0.0040), so this behaves like a reference/lookup id repeated a handful of times rather than a free-form feature. Treatment: treat as a foreign key and left-join to its reference table rather than one-hot encoding. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
630
top_value
414
top_rate
0.003953
cardinality
630
entropy
9.257
entropy_ratio
0.9954

company

categorical metadata imbalance
The column is labelled 'company' but contains a single value—an empty string—across all 2530 rows. Cardinality is 1, entropy is 0, and top_rate is 1.0, so it carries no information. This is effectively a placeholder field that was never populated. Treatment: Drop; constant empty value provides no signal. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
1
top_value
top_rate
1
cardinality
1
entropy
0
entropy_ratio
0

company_location

categorical feature
Country of the chocolate maker, with 67 distinct locations and no nulls across 2530 rows. Heavily US-centric: 'U.S.A.' accounts for 44.9% (1136 rows), followed by Canada (177), France (176), and the U.K. (133), giving an entropy ratio of 0.61. Country names use abbreviated forms ('U.S.A.', 'U.K.') so any joins on canonical country lists will need normalisation. Treatment: Normalise country labels and group long-tail countries before one-hot or target encoding. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
67
top_value
U.S.A.
top_rate
0.449
cardinality
67
entropy
3.675
entropy_ratio
0.6058

review_date

numeric timestamp
This column stores the year a review was recorded, ranging from 2006 to 2021 with only 16 unique values across 2530 rows. The distribution is centered around 2015 (mean 2014.37, median 2015) with a modest spread (std 3.97) and is roughly symmetric (skew -0.18). No nulls or outliers are present. Treatment: Treat as a year-level temporal feature; bin or convert to datetime for trend analysis. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
16
min
2,006
max
2,021
mean
2014
median
2,015
std
3.968
q1
2,012
q3
2,018
iqr
6
skew
-0.1833
kurtosis
-0.7727
n_outliers
0
outlier_rate
0
zero_rate
0

country_of_bean_origin

categorical feature
Categorical country label identifying where the cocoa beans originated, with 62 distinct values across 2530 complete rows and no nulls. The distribution is broad rather than concentrated: the top value Venezuela accounts for only 10% of rows, and entropy ratio 0.79 confirms fairly even spread across many origins. Notable wrinkle: 'Blend' appears as the 6th most common value (156 rows), meaning some entries aren't a single country and will need special handling. Treatment: Group rare origins into 'Other', isolate 'Blend' as its own category, then one-hot or target-encode. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
62
top_value
Venezuela
top_rate
0.1
cardinality
62
entropy
4.717
entropy_ratio
0.7921

specific_bean_origin

text feature one_word duplicates
This column captures the specific bean origin (region, estate, or country) for what appears to be a chocolate/cocoa dataset, with 1,605 unique values across 2,530 rows. Top values are dominated by countries like Madagascar (55), Ecuador (43), and Peru (41), but the high frequency of the word 'batch' (356 occurrences) suggests many entries mix origin names with batch identifiers, inflating uniqueness. Roughly 34% of values are single words and 37% are duplicates, indicating inconsistent granularity — some entries are broad countries, others are specific estates or batch-tagged labels. Treatment: Normalize by stripping batch suffixes and standardizing to country/region before encoding. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
1,605
len_min
3
len_max
51
len_mean
17.12
len_median
14
len_p95
39
word_mean
2.681
word_median
2
n_empty
0
n_duplicates
925
duplicate_rate
0.3656
vocab_size
2,079
readability_flesch_mean
28.41
emoji_rate
0
url_rate
0
one_word_rate
0.3375
allcaps_rate
0.001581
boilerplate_rate
0

cocoa_percent

numeric feature outliers
This is the cocoa percentage of each chocolate bar, ranging from 42 to 100 with a tight median of 70 and IQR of just 4. The distribution is right-skewed (skew 1.20, kurtosis 6.54) and 9.3% of rows flag as outliers — likely the high-cocoa tail pushing toward 100%. With only 46 unique values across 2530 rows, the field is effectively semi-discrete. Treatment: Use as-is or bin into cocoa-strength buckets; no transform needed given the narrow IQR. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
46
min
42
max
100
mean
71.64
median
70
std
5.617
q1
70
q3
74
iqr
4
skew
1.198
kurtosis
6.541
n_outliers
235
outlier_rate
0.09289
zero_rate
0

ingredients

categorical feature
This appears to be a coded recipe/composition field where each value lists a count followed by single-letter ingredient tokens (e.g. 'B,S,C' for what looks like beef/sauce/cheese-style components). With only 22 distinct combinations across 2530 rows and a top value ('3- B,S,C') covering 39.5% of records, the field is highly concentrated — entropy_ratio is just 0.545. Notably, 87 rows carry an empty string rather than null, so null_rate=0.0 understates true missingness. Treatment: Treat empty strings as missing, then one-hot encode the ingredient tokens (split on comma) rather than the raw combined string. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
22
top_value
3- B,S,C
top_rate
0.3949
cardinality
22
entropy
2.43
entropy_ratio
0.545

most_memorable_characteristics

text free_text near_unique
Short free-text tasting notes (mean 23 characters, ~3 words) describing flavor and texture characteristics, almost certainly from a chocolate or cocoa review dataset given top tokens like 'cocoa', 'sweet', 'nutty', 'creamy', and 'fruit'. Values are near-unique (2487 distinct of 2530) yet built from a small vocabulary of 868 words, indicating these are comma-separated descriptor combinations rather than prose. Only 43 exact duplicates and no empties or URLs; readability mean of 49.7 is not very meaningful at this length. Treatment: Split on commas into descriptor tags and one-hot or embed for modelling. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
2,487
len_min
3
len_max
37
len_mean
23.06
len_median
23
len_p95
30
word_mean
3.376
word_median
3
n_empty
0
n_duplicates
43
duplicate_rate
0.017
vocab_size
868
readability_flesch_mean
49.71
emoji_rate
0
url_rate
0
one_word_rate
0.007115
allcaps_rate
0
boilerplate_rate
0

rating

numeric feature
A bounded numeric rating on a 1.0–4.0 scale with only 12 distinct values, suggesting half- or quarter-step increments rather than continuous scores. The distribution is tight (IQR 0.5, std 0.45) and slightly left-skewed (-0.61), centered near 3.25 with a mean of 3.20, and 50 low-end outliers (1.98%) pull the tail. No nulls or zeros, so every row carries a usable score. Treatment: Use as-is as an ordinal/numeric feature; consider treating the 50 low-end outliers separately if modelling tails. high · anthropic:claude-opus-4-7
n
2,530
nulls
0 (0.0%)
unique
12
min
1
max
4
mean
3.196
median
3.25
std
0.4453
q1
3
q3
3.5
iqr
0.5
skew
-0.6084
kurtosis
1.053
n_outliers
50
outlier_rate
0.01976
zero_rate
0