quirky chocolate origins

source /home/coolhand/html/datavis/data_trove/data/quirky/chocolate_origins.json 2,530 rows 10 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset catalogs 2,530 chocolate bar reviews with 10 columns covering bean origins, cocoa percentages, ingredients, ratings, and review metadata. Ratings cluster tightly (median 3.25, IQR 0.5) on a 1–4 scale, while cocoa percent is similarly concentrated around 70% but carries 235 outliers worth investigating. Geographic skew is notable: U.S.A. dominates company locations at 44.9% of records, whereas bean origins are more diverse, led by Venezuela, Peru, and the Dominican Republic. Heads up that the `company` column is entirely empty (single blank value across all rows), so it should be excluded from analysis.

citing: row_count · column_count · rating · cocoa_percent · company_location · country_of_bean_origin · ingredients · company · review_date

Charts the summary said to look at first

rating · Check how tightly ratings cluster around 3.25 and whether the left tail of low scores is meaningful.

Show data table

Histogram bins for rating (median: 3.25).
bin	count
1 – 1.075	4
1.075 – 1.15	0
1.15 – 1.225	0
1.225 – 1.3	0
1.3 – 1.375	0
1.375 – 1.45	0
1.45 – 1.525	10
1.525 – 1.6	0
1.6 – 1.675	0
1.675 – 1.75	0
1.75 – 1.825	3
1.825 – 1.9	0
1.9 – 1.975	0
1.975 – 2.05	33
2.05 – 2.125	0
2.125 – 2.2	0
2.2 – 2.275	17
2.275 – 2.35	0
2.35 – 2.425	0
2.425 – 2.5	0
2.5 – 2.575	166
2.575 – 2.65	0
2.65 – 2.725	0
2.725 – 2.8	333
2.8 – 2.875	0
2.875 – 2.95	0
2.95 – 3.025	523
3.025 – 3.1	0
3.1 – 3.175	0
3.175 – 3.25	0
3.25 – 3.325	464
3.325 – 3.4	0
3.4 – 3.475	0
3.475 – 3.55	565
3.55 – 3.625	0
3.625 – 3.7	0
3.7 – 3.775	300
3.775 – 3.85	0
3.85 – 3.925	0
3.925 – 4	112

cocoa_percent · Look for the 70% spike and the 9% of bars flagged as outliers above or below the typical range.

Show data table

Histogram bins for cocoa_percent (median: 70.0).
bin	count
42 – 43.45	1
43.45 – 44.9	0
44.9 – 46.35	1
46.35 – 47.8	0
47.8 – 49.25	0
49.25 – 50.7	1
50.7 – 52.15	0
52.15 – 53.6	1
53.6 – 55.05	16
55.05 – 56.5	2
56.5 – 57.95	1
57.95 – 59.4	8
59.4 – 60.85	47
60.85 – 62.3	23
62.3 – 63.75	14
63.75 – 65.2	124
65.2 – 66.65	28
66.65 – 68.1	106
68.1 – 69.55	13
69.55 – 71	1046
71 – 72.45	340
72.45 – 73.9	72
73.9 – 75.35	377
75.35 – 76.8	35
76.8 – 78.25	63
78.25 – 79.7	2
79.7 – 81.15	95
81.15 – 82.6	18
82.6 – 84.05	9
84.05 – 85.5	40
85.5 – 86.95	1
86.95 – 88.4	9
88.4 – 89.85	2
89.85 – 91.3	12
91.3 – 92.75	0
92.75 – 94.2	0
94.2 – 95.65	0
95.65 – 97.1	0
97.1 – 98.55	0
98.55 – 100	23

company_location · Note how heavily U.S.A. dominates at ~45% of all reviews, dwarfing Canada and France.

Show data table

Top values for company_location (20 unique shown, of 67 total).
value	count	share
U.S.A.	1136	44.9%
Canada	177	7.0%
France	176	7.0%
U.K.	133	5.3%
Italy	78	3.1%
Belgium	63	2.5%
Ecuador	58	2.3%
Australia	53	2.1%
Switzerland	44	1.7%
Germany	42	1.7%
Spain	36	1.4%
Venezuela	31	1.2%
Japan	31	1.2%
Denmark	31	1.2%
Austria	30	1.2%
Colombia	29	1.1%
New Zealand	27	1.1%
Hungary	26	1.0%
Brazil	25	1.0%
Peru	23	0.9%

country_of_bean_origin · See the more balanced spread across Venezuela, Peru, Dominican Republic, and Ecuador as top bean sources.

Show data table

Top values for country_of_bean_origin (20 unique shown, of 62 total).
value	count	share
Venezuela	253	10.0%
Peru	244	9.6%
Dominican Republic	226	8.9%
Ecuador	219	8.7%
Madagascar	177	7.0%
Blend	156	6.2%
Nicaragua	100	4.0%
Bolivia	80	3.2%
Tanzania	79	3.1%
Colombia	79	3.1%
Brazil	78	3.1%
Belize	76	3.0%
Vietnam	73	2.9%
Guatemala	62	2.5%
Mexico	55	2.2%
Papua New Guinea	50	2.0%
Costa Rica	43	1.7%
Trinidad	42	1.7%
Ghana	41	1.6%
India	35	1.4%

ingredients · Observe that two recipes (3-ingredient B,S,C and 2-ingredient B,S) account for the majority of bars.

Show data table

Top values for ingredients (20 unique shown, of 22 total).
value	count	share
3- B,S,C	999	39.5%
2- B,S	718	28.4%
4- B,S,C,L	286	11.3%
5- B,S,C,V,L	184	7.3%
4- B,S,C,V	141	5.6%
	87	3.4%
2- B,S*	31	1.2%
4- B,S*,C,Sa	20	0.8%
3- B,S*,C	12	0.5%
3- B,S,L	8	0.3%
4- B,S*,C,V	7	0.3%
5-B,S,C,V,Sa	6	0.2%
1- B	6	0.2%
4- B,S,V,L	5	0.2%
4- B,S,C,Sa	5	0.2%
6-B,S,C,V,L,Sa	4	0.2%
3- B,S,V	3	0.1%
4- B,S*,V,L	3	0.1%
4- B,S*,C,L	2	0.1%
3- B,S*,Sa	1	0.0%

Schema

10 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
ref	categorical	0.0%	630
company	categorical	0.0%	1	imbalance
company_location	categorical	0.0%	67
review_date	numeric	0.0%	16
country_of_bean_origin	categorical	0.0%	62
specific_bean_origin	text	0.0%	1,605	one_word duplicates
cocoa_percent	numeric	0.0%	46	outliers
ingredients	categorical	0.0%	22
most_memorable_characteristics	text	0.0%	2,487	near_unique
rating	numeric	0.0%	12

ref

categorical foreign_key

`ref` is a high-cardinality categorical with 630 distinct values across 2530 rows and no nulls, with entropy ratio 0.9954 indicating a near-uniform distribution. Values are short numeric strings (e.g. "414", "24", "1462") and the most frequent appears only 10 times (top_rate 0.0040), so this behaves like a reference/lookup id repeated a handful of times rather than a free-form feature. Treatment: treat as a foreign key and left-join to its reference table rather than one-hot encoding. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 630
top_value: 414
top_rate: 0.003953
cardinality: 630
entropy: 9.257
entropy_ratio: 0.9954

company

categorical metadata imbalance

The column is labelled 'company' but contains a single value—an empty string—across all 2530 rows. Cardinality is 1, entropy is 0, and top_rate is 1.0, so it carries no information. This is effectively a placeholder field that was never populated. Treatment: Drop; constant empty value provides no signal. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

company_location

categorical feature

Country of the chocolate maker, with 67 distinct locations and no nulls across 2530 rows. Heavily US-centric: 'U.S.A.' accounts for 44.9% (1136 rows), followed by Canada (177), France (176), and the U.K. (133), giving an entropy ratio of 0.61. Country names use abbreviated forms ('U.S.A.', 'U.K.') so any joins on canonical country lists will need normalisation. Treatment: Normalise country labels and group long-tail countries before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 67
top_value: U.S.A.
top_rate: 0.449
cardinality: 67
entropy: 3.675
entropy_ratio: 0.6058

review_date

numeric timestamp

This column stores the year a review was recorded, ranging from 2006 to 2021 with only 16 unique values across 2530 rows. The distribution is centered around 2015 (mean 2014.37, median 2015) with a modest spread (std 3.97) and is roughly symmetric (skew -0.18). No nulls or outliers are present. Treatment: Treat as a year-level temporal feature; bin or convert to datetime for trend analysis. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 16
min: 2,006
max: 2,021
mean: 2014
median: 2,015
std: 3.968
q1: 2,012
q3: 2,018
iqr: 6
skew: -0.1833
kurtosis: -0.7727
n_outliers: 0
outlier_rate: 0
zero_rate: 0

country_of_bean_origin

categorical feature

Categorical country label identifying where the cocoa beans originated, with 62 distinct values across 2530 complete rows and no nulls. The distribution is broad rather than concentrated: the top value Venezuela accounts for only 10% of rows, and entropy ratio 0.79 confirms fairly even spread across many origins. Notable wrinkle: 'Blend' appears as the 6th most common value (156 rows), meaning some entries aren't a single country and will need special handling. Treatment: Group rare origins into 'Other', isolate 'Blend' as its own category, then one-hot or target-encode. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 62
top_value: Venezuela
top_rate: 0.1
cardinality: 62
entropy: 4.717
entropy_ratio: 0.7921

specific_bean_origin

text feature one_word duplicates

This column captures the specific bean origin (region, estate, or country) for what appears to be a chocolate/cocoa dataset, with 1,605 unique values across 2,530 rows. Top values are dominated by countries like Madagascar (55), Ecuador (43), and Peru (41), but the high frequency of the word 'batch' (356 occurrences) suggests many entries mix origin names with batch identifiers, inflating uniqueness. Roughly 34% of values are single words and 37% are duplicates, indicating inconsistent granularity — some entries are broad countries, others are specific estates or batch-tagged labels. Treatment: Normalize by stripping batch suffixes and standardizing to country/region before encoding. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 1,605
len_min: 3
len_max: 51
len_mean: 17.12
len_median: 14
len_p95: 39
word_mean: 2.681
word_median: 2
n_empty: 0
n_duplicates: 925
duplicate_rate: 0.3656
vocab_size: 2,079
readability_flesch_mean: 28.41
emoji_rate: 0
url_rate: 0
one_word_rate: 0.3375
allcaps_rate: 0.001581
boilerplate_rate: 0

cocoa_percent

numeric feature outliers

This is the cocoa percentage of each chocolate bar, ranging from 42 to 100 with a tight median of 70 and IQR of just 4. The distribution is right-skewed (skew 1.20, kurtosis 6.54) and 9.3% of rows flag as outliers — likely the high-cocoa tail pushing toward 100%. With only 46 unique values across 2530 rows, the field is effectively semi-discrete. Treatment: Use as-is or bin into cocoa-strength buckets; no transform needed given the narrow IQR. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 46
min: 42
max: 100
mean: 71.64
median: 70
std: 5.617
q1: 70
q3: 74
iqr: 4
skew: 1.198
kurtosis: 6.541
n_outliers: 235
outlier_rate: 0.09289
zero_rate: 0

ingredients

categorical feature

This appears to be a coded recipe/composition field where each value lists a count followed by single-letter ingredient tokens (e.g. 'B,S,C' for what looks like beef/sauce/cheese-style components). With only 22 distinct combinations across 2530 rows and a top value ('3- B,S,C') covering 39.5% of records, the field is highly concentrated — entropy_ratio is just 0.545. Notably, 87 rows carry an empty string rather than null, so null_rate=0.0 understates true missingness. Treatment: Treat empty strings as missing, then one-hot encode the ingredient tokens (split on comma) rather than the raw combined string. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 22
top_value: 3- B,S,C
top_rate: 0.3949
cardinality: 22
entropy: 2.43
entropy_ratio: 0.545

most_memorable_characteristics

text free_text near_unique

Short free-text tasting notes (mean 23 characters, ~3 words) describing flavor and texture characteristics, almost certainly from a chocolate or cocoa review dataset given top tokens like 'cocoa', 'sweet', 'nutty', 'creamy', and 'fruit'. Values are near-unique (2487 distinct of 2530) yet built from a small vocabulary of 868 words, indicating these are comma-separated descriptor combinations rather than prose. Only 43 exact duplicates and no empties or URLs; readability mean of 49.7 is not very meaningful at this length. Treatment: Split on commas into descriptor tags and one-hot or embed for modelling. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 2,487
len_min: 3
len_max: 37
len_mean: 23.06
len_median: 23
len_p95: 30
word_mean: 3.376
word_median: 3
n_empty: 0
n_duplicates: 43
duplicate_rate: 0.017
vocab_size: 868
readability_flesch_mean: 49.71
emoji_rate: 0
url_rate: 0
one_word_rate: 0.007115
allcaps_rate: 0
boilerplate_rate: 0

rating

numeric feature

A bounded numeric rating on a 1.0–4.0 scale with only 12 distinct values, suggesting half- or quarter-step increments rather than continuous scores. The distribution is tight (IQR 0.5, std 0.45) and slightly left-skewed (-0.61), centered near 3.25 with a mean of 3.20, and 50 low-end outliers (1.98%) pull the tail. No nulls or zeros, so every row carries a usable score. Treatment: Use as-is as an ordinal/numeric feature; consider treating the 50 low-end outliers separately if modelling tails. high · anthropic:claude-opus-4-7

n: 2,530
nulls: 0 (0.0%)
unique: 12
min: 1
max: 4
mean: 3.196
median: 3.25
std: 0.4453
q1: 3
q3: 3.5
iqr: 0.5
skew: -0.6084
kurtosis: 1.053
n_outliers: 50
outlier_rate: 0.01976
zero_rate: 0