saturn·

quirky chocolate origins

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/chocolate_origins.json

Saturn profiled 2,530 rows across 10 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/chocolate_origins.json",
    "--findings", "quirky-chocolate_origins.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset catalogs 2,530 chocolate bar reviews with 10 columns covering bean origins, cocoa percentages, ingredients, ratings, and review metadata. Ratings cluster tightly (median 3.25, IQR 0.5) on a 1–4 scale, while cocoa percent is similarly concentrated around 70% but carries 235 outliers worth investigating. Geographic skew is notable: U.S.A. dominates company locations at 44.9% of records, whereas bean origins are more diverse, led by Venezuela, Peru, and the Dominican Republic. Heads up that the `company` column is entirely empty (single blank value across all rows), so it should be excluded from analysis.

citing: row_count · column_count · rating · cocoa_percent · company_location · country_of_bean_origin · ingredients · company · review_date

Out[4]:

saturn.schema() · 10 columns

column kind n null% unique alerts
ref categorical 2,530 0.0% 630
company categorical 2,530 0.0% 1 imbalance
company_location categorical 2,530 0.0% 67
review_date numeric 2,530 0.0% 16
country_of_bean_origin categorical 2,530 0.0% 62
specific_bean_origin text 2,530 0.0% 1,605 one_word duplicates
cocoa_percent numeric 2,530 0.0% 46 outliers
ingredients categorical 2,530 0.0% 22
most_memorable_characteristics text 2,530 0.0% 2,487 near_unique
rating numeric 2,530 0.0% 12
Fig 1.
rating · Check how tightly ratings cluster around 3.25 and whether the left tail of low scores is meaningful.
Show data table
Histogram bins for rating (median: 3.25).
bincount
1 – 1.0754
1.075 – 1.150
1.15 – 1.2250
1.225 – 1.30
1.3 – 1.3750
1.375 – 1.450
1.45 – 1.52510
1.525 – 1.60
1.6 – 1.6750
1.675 – 1.750
1.75 – 1.8253
1.825 – 1.90
1.9 – 1.9750
1.975 – 2.0533
2.05 – 2.1250
2.125 – 2.20
2.2 – 2.27517
2.275 – 2.350
2.35 – 2.4250
2.425 – 2.50
2.5 – 2.575166
2.575 – 2.650
2.65 – 2.7250
2.725 – 2.8333
2.8 – 2.8750
2.875 – 2.950
2.95 – 3.025523
3.025 – 3.10
3.1 – 3.1750
3.175 – 3.250
3.25 – 3.325464
3.325 – 3.40
3.4 – 3.4750
3.475 – 3.55565
3.55 – 3.6250
3.625 – 3.70
3.7 – 3.775300
3.775 – 3.850
3.85 – 3.9250
3.925 – 4112
Fig 2.
cocoa_percent · Look for the 70% spike and the 9% of bars flagged as outliers above or below the typical range.
Show data table
Histogram bins for cocoa_percent (median: 70.0).
bincount
42 – 43.451
43.45 – 44.90
44.9 – 46.351
46.35 – 47.80
47.8 – 49.250
49.25 – 50.71
50.7 – 52.150
52.15 – 53.61
53.6 – 55.0516
55.05 – 56.52
56.5 – 57.951
57.95 – 59.48
59.4 – 60.8547
60.85 – 62.323
62.3 – 63.7514
63.75 – 65.2124
65.2 – 66.6528
66.65 – 68.1106
68.1 – 69.5513
69.55 – 711046
71 – 72.45340
72.45 – 73.972
73.9 – 75.35377
75.35 – 76.835
76.8 – 78.2563
78.25 – 79.72
79.7 – 81.1595
81.15 – 82.618
82.6 – 84.059
84.05 – 85.540
85.5 – 86.951
86.95 – 88.49
88.4 – 89.852
89.85 – 91.312
91.3 – 92.750
92.75 – 94.20
94.2 – 95.650
95.65 – 97.10
97.1 – 98.550
98.55 – 10023
Fig 3.
company_location · Note how heavily U.S.A. dominates at ~45% of all reviews, dwarfing Canada and France.
Show data table
Top values for company_location (20 unique shown, of 67 total).
valuecountshare
U.S.A.113644.9%
Canada1777.0%
France1767.0%
U.K.1335.3%
Italy783.1%
Belgium632.5%
Ecuador582.3%
Australia532.1%
Switzerland441.7%
Germany421.7%
Spain361.4%
Venezuela311.2%
Japan311.2%
Denmark311.2%
Austria301.2%
Colombia291.1%
New Zealand271.1%
Hungary261.0%
Brazil251.0%
Peru230.9%
Fig 4.
country_of_bean_origin · See the more balanced spread across Venezuela, Peru, Dominican Republic, and Ecuador as top bean sources.
Show data table
Top values for country_of_bean_origin (20 unique shown, of 62 total).
valuecountshare
Venezuela25310.0%
Peru2449.6%
Dominican Republic2268.9%
Ecuador2198.7%
Madagascar1777.0%
Blend1566.2%
Nicaragua1004.0%
Bolivia803.2%
Tanzania793.1%
Colombia793.1%
Brazil783.1%
Belize763.0%
Vietnam732.9%
Guatemala622.5%
Mexico552.2%
Papua New Guinea502.0%
Costa Rica431.7%
Trinidad421.7%
Ghana411.6%
India351.4%
Fig 5.
ingredients · Observe that two recipes (3-ingredient B,S,C and 2-ingredient B,S) account for the majority of bars.
Show data table
Top values for ingredients (20 unique shown, of 22 total).
valuecountshare
3- B,S,C99939.5%
2- B,S71828.4%
4- B,S,C,L28611.3%
5- B,S,C,V,L1847.3%
4- B,S,C,V1415.6%
873.4%
2- B,S*311.2%
4- B,S*,C,Sa200.8%
3- B,S*,C120.5%
3- B,S,L80.3%
4- B,S*,C,V70.3%
5-B,S,C,V,Sa60.2%
1- B60.2%
4- B,S,V,L50.2%
4- B,S,C,Sa50.2%
6-B,S,C,V,L,Sa40.2%
3- B,S,V30.1%
4- B,S*,V,L30.1%
4- B,S*,C,L20.1%
3- B,S*,Sa10.0%
Fig 6.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
refcategorical0.0%
companycategorical0.0%
company_locationcategorical0.0%
review_datenumeric0.0%
country_of_bean_origincategorical0.0%
specific_bean_origintext0.0%
cocoa_percentnumeric0.0%
ingredientscategorical0.0%
most_memorable_characteristicstext0.0%
ratingnumeric0.0%
Fig 7.
Pearson correlation across numeric columns (sampled, bounded).
Show data table
Pearson correlation across 3 numeric columns (values clipped to 2 decimals).
review_datecocoa_percentrating
review_date+1.00+0.09+0.13
cocoa_percent+0.09+1.00-0.14
rating+0.13-0.14+1.00

ref categorical foreign_key

`ref` is a high-cardinality categorical with 630 distinct values across 2530 rows and no nulls, with entropy ratio 0.9954 indicating a near-uniform distribution. Values are short numeric strings (e.g. "414", "24", "1462") and the most frequent appears only 10 times (top_rate 0.0040), so this behaves like a reference/lookup id repeated a handful of times rather than a free-form feature.

Treatment: treat as a foreign key and left-join to its reference table rather than one-hot encoding.

anthropic:claude-opus-4-7 · confidence high
Out[13]:

saturn.columns["ref"].stats

statvalue
n2,530
nulls0 (0.0%)
unique630
top_value 414
top_rate 0.003953
cardinality 630
entropy 9.257
entropy_ratio 0.9954
Fig 8.
Top values for ref.
Show data table
Top values for ref (20 unique shown, of 630 total).
valuecountshare
414100.4%
2490.4%
40490.4%
38790.4%
146280.3%
145480.3%
43180.3%
43980.3%
145080.3%
55280.3%
145880.3%
146680.3%
37070.3%
50270.3%
63670.3%
57270.3%
35570.3%
48670.3%
47870.3%
37770.3%

company categorical metadata

The column is labelled 'company' but contains a single value—an empty string—across all 2530 rows. Cardinality is 1, entropy is 0, and top_rate is 1.0, so it carries no information. This is effectively a placeholder field that was never populated.

Treatment: Drop; constant empty value provides no signal.

anthropic:claude-opus-4-7 · confidence high
Out[16]:

saturn.columns["company"].stats

statvalue
n2,530
nulls0 (0.0%)
unique1
top_value
top_rate 1
cardinality 1
entropy 0
entropy_ratio 0
alert: imbalancetop value is 100.0% of rows
Fig 9.
Top values for company.
Show data table
Top values for company (1 unique shown, of 1 total).
valuecountshare
2530100.0%

company_location categorical feature

Country of the chocolate maker, with 67 distinct locations and no nulls across 2530 rows. Heavily US-centric: 'U.S.A.' accounts for 44.9% (1136 rows), followed by Canada (177), France (176), and the U.K. (133), giving an entropy ratio of 0.61. Country names use abbreviated forms ('U.S.A.', 'U.K.') so any joins on canonical country lists will need normalisation.

Treatment: Normalise country labels and group long-tail countries before one-hot or target encoding.

anthropic:claude-opus-4-7 · confidence high
Out[19]:

saturn.columns["company_location"].stats

statvalue
n2,530
nulls0 (0.0%)
unique67
top_value U.S.A.
top_rate 0.449
cardinality 67
entropy 3.675
entropy_ratio 0.6058
Fig 10.
Top values for company_location.
Show data table
Top values for company_location (20 unique shown, of 67 total).
valuecountshare
U.S.A.113644.9%
Canada1777.0%
France1767.0%
U.K.1335.3%
Italy783.1%
Belgium632.5%
Ecuador582.3%
Australia532.1%
Switzerland441.7%
Germany421.7%
Spain361.4%
Venezuela311.2%
Japan311.2%
Denmark311.2%
Austria301.2%
Colombia291.1%
New Zealand271.1%
Hungary261.0%
Brazil251.0%
Peru230.9%

review_date numeric timestamp

This column stores the year a review was recorded, ranging from 2006 to 2021 with only 16 unique values across 2530 rows. The distribution is centered around 2015 (mean 2014.37, median 2015) with a modest spread (std 3.97) and is roughly symmetric (skew -0.18). No nulls or outliers are present.

Treatment: Treat as a year-level temporal feature; bin or convert to datetime for trend analysis.

anthropic:claude-opus-4-7 · confidence high
Out[22]:

saturn.columns["review_date"].stats

statvalue
n2,530
nulls0 (0.0%)
unique16
min 2,006
max 2,021
mean 2014
median 2,015
std 3.968
q1 2,012
q3 2,018
iqr 6
skew -0.1833
kurtosis -0.7727
n_outliers 0
outlier_rate 0
zero_rate 0
Fig 11.
Distribution of review_date. Vertical dash marks the median.
Show data table
Histogram bins for review_date (median: 2015.0).
bincount
2006 – 200662
2006 – 20070
2007 – 200773
2007 – 20080
2008 – 20080
2008 – 200892
2008 – 20090
2009 – 20090
2009 – 2009123
2009 – 20100
2010 – 2010110
2010 – 20100
2010 – 20110
2011 – 2011163
2011 – 20120
2012 – 20120
2012 – 2012194
2012 – 20130
2013 – 2013183
2013 – 20140
2014 – 20140
2014 – 2014247
2014 – 20150
2015 – 20150
2015 – 2015284
2015 – 20160
2016 – 2016217
2016 – 20160
2016 – 20170
2017 – 2017105
2017 – 20180
2018 – 20180
2018 – 2018228
2018 – 20190
2019 – 2019193
2019 – 20200
2020 – 20200
2020 – 202081
2020 – 20210
2021 – 2021175

country_of_bean_origin categorical feature

Categorical country label identifying where the cocoa beans originated, with 62 distinct values across 2530 complete rows and no nulls. The distribution is broad rather than concentrated: the top value Venezuela accounts for only 10% of rows, and entropy ratio 0.79 confirms fairly even spread across many origins. Notable wrinkle: 'Blend' appears as the 6th most common value (156 rows), meaning some entries aren't a single country and will need special handling.

Treatment: Group rare origins into 'Other', isolate 'Blend' as its own category, then one-hot or target-encode.

anthropic:claude-opus-4-7 · confidence high
Out[25]:

saturn.columns["country_of_bean_origin"].stats

statvalue
n2,530
nulls0 (0.0%)
unique62
top_value Venezuela
top_rate 0.1
cardinality 62
entropy 4.717
entropy_ratio 0.7921
Fig 12.
Top values for country_of_bean_origin.
Show data table
Top values for country_of_bean_origin (20 unique shown, of 62 total).
valuecountshare
Venezuela25310.0%
Peru2449.6%
Dominican Republic2268.9%
Ecuador2198.7%
Madagascar1777.0%
Blend1566.2%
Nicaragua1004.0%
Bolivia803.2%
Tanzania793.1%
Colombia793.1%
Brazil783.1%
Belize763.0%
Vietnam732.9%
Guatemala622.5%
Mexico552.2%
Papua New Guinea502.0%
Costa Rica431.7%
Trinidad421.7%
Ghana411.6%
India351.4%

specific_bean_origin text feature

This column captures the specific bean origin (region, estate, or country) for what appears to be a chocolate/cocoa dataset, with 1,605 unique values across 2,530 rows. Top values are dominated by countries like Madagascar (55), Ecuador (43), and Peru (41), but the high frequency of the word 'batch' (356 occurrences) suggests many entries mix origin names with batch identifiers, inflating uniqueness. Roughly 34% of values are single words and 37% are duplicates, indicating inconsistent granularity — some entries are broad countries, others are specific estates or batch-tagged labels.

Treatment: Normalize by stripping batch suffixes and standardizing to country/region before encoding.

anthropic:claude-opus-4-7 · confidence high
Out[28]:

saturn.columns["specific_bean_origin"].stats

statvalue
n2,530
nulls0 (0.0%)
unique1,605
len_min 3
len_max 51
len_mean 17.12
len_median 14
len_p95 39
word_mean 2.681
word_median 2
n_empty 0
n_duplicates 925
duplicate_rate 0.3656
vocab_size 2,079
readability_flesch_mean 28.41
emoji_rate 0
url_rate 0
one_word_rate 0.3375
allcaps_rate 0.001581
boilerplate_rate 0
alert: one_word33.8% rows are a single word
alert: duplicates36.6% duplicate strings
Fig 13.
Character-length distribution for specific_bean_origin.
Show data table
Character-length distribution for specific_bean_origin (mean: 17.115415019762846).
charscount
3 – 486
4 – 5106
5 – 7152
7 – 8211
8 – 9142
9 – 10306
10 – 1160
11 – 1371
13 – 1479
14 – 1554
15 – 16143
16 – 1773
17 – 1998
19 – 2065
20 – 2164
21 – 22119
22 – 2348
23 – 2557
25 – 2642
26 – 2745
27 – 2884
28 – 2937
29 – 3139
31 – 3235
32 – 3329
33 – 3458
34 – 3518
35 – 3723
37 – 3830
38 – 3921
39 – 4033
40 – 4113
41 – 4316
43 – 4414
44 – 4513
45 – 4623
46 – 478
47 – 4912
49 – 501
50 – 512

cocoa_percent numeric feature

This is the cocoa percentage of each chocolate bar, ranging from 42 to 100 with a tight median of 70 and IQR of just 4. The distribution is right-skewed (skew 1.20, kurtosis 6.54) and 9.3% of rows flag as outliers — likely the high-cocoa tail pushing toward 100%. With only 46 unique values across 2530 rows, the field is effectively semi-discrete.

Treatment: Use as-is or bin into cocoa-strength buckets; no transform needed given the narrow IQR.

anthropic:claude-opus-4-7 · confidence high
Out[31]:

saturn.columns["cocoa_percent"].stats

statvalue
n2,530
nulls0 (0.0%)
unique46
min 42
max 100
mean 71.64
median 70
std 5.617
q1 70
q3 74
iqr 4
skew 1.198
kurtosis 6.541
n_outliers 235
outlier_rate 0.09289
zero_rate 0
alert: outliers9.3% rows beyond 1.5 IQR
Fig 14.
Distribution of cocoa_percent. Vertical dash marks the median.
Show data table
Histogram bins for cocoa_percent (median: 70.0).
bincount
42 – 43.451
43.45 – 44.90
44.9 – 46.351
46.35 – 47.80
47.8 – 49.250
49.25 – 50.71
50.7 – 52.150
52.15 – 53.61
53.6 – 55.0516
55.05 – 56.52
56.5 – 57.951
57.95 – 59.48
59.4 – 60.8547
60.85 – 62.323
62.3 – 63.7514
63.75 – 65.2124
65.2 – 66.6528
66.65 – 68.1106
68.1 – 69.5513
69.55 – 711046
71 – 72.45340
72.45 – 73.972
73.9 – 75.35377
75.35 – 76.835
76.8 – 78.2563
78.25 – 79.72
79.7 – 81.1595
81.15 – 82.618
82.6 – 84.059
84.05 – 85.540
85.5 – 86.951
86.95 – 88.49
88.4 – 89.852
89.85 – 91.312
91.3 – 92.750
92.75 – 94.20
94.2 – 95.650
95.65 – 97.10
97.1 – 98.550
98.55 – 10023

ingredients categorical feature

This appears to be a coded recipe/composition field where each value lists a count followed by single-letter ingredient tokens (e.g. 'B,S,C' for what looks like beef/sauce/cheese-style components). With only 22 distinct combinations across 2530 rows and a top value ('3- B,S,C') covering 39.5% of records, the field is highly concentrated — entropy_ratio is just 0.545. Notably, 87 rows carry an empty string rather than null, so null_rate=0.0 understates true missingness.

Treatment: Treat empty strings as missing, then one-hot encode the ingredient tokens (split on comma) rather than the raw combined string.

anthropic:claude-opus-4-7 · confidence high
Out[34]:

saturn.columns["ingredients"].stats

statvalue
n2,530
nulls0 (0.0%)
unique22
top_value 3- B,S,C
top_rate 0.3949
cardinality 22
entropy 2.43
entropy_ratio 0.545
Fig 15.
Top values for ingredients.
Show data table
Top values for ingredients (20 unique shown, of 22 total).
valuecountshare
3- B,S,C99939.5%
2- B,S71828.4%
4- B,S,C,L28611.3%
5- B,S,C,V,L1847.3%
4- B,S,C,V1415.6%
873.4%
2- B,S*311.2%
4- B,S*,C,Sa200.8%
3- B,S*,C120.5%
3- B,S,L80.3%
4- B,S*,C,V70.3%
5-B,S,C,V,Sa60.2%
1- B60.2%
4- B,S,V,L50.2%
4- B,S,C,Sa50.2%
6-B,S,C,V,L,Sa40.2%
3- B,S,V30.1%
4- B,S*,V,L30.1%
4- B,S*,C,L20.1%
3- B,S*,Sa10.0%

most_memorable_characteristics text free_text

Short free-text tasting notes (mean 23 characters, ~3 words) describing flavor and texture characteristics, almost certainly from a chocolate or cocoa review dataset given top tokens like 'cocoa', 'sweet', 'nutty', 'creamy', and 'fruit'. Values are near-unique (2487 distinct of 2530) yet built from a small vocabulary of 868 words, indicating these are comma-separated descriptor combinations rather than prose. Only 43 exact duplicates and no empties or URLs; readability mean of 49.7 is not very meaningful at this length.

Treatment: Split on commas into descriptor tags and one-hot or embed for modelling.

anthropic:claude-opus-4-7 · confidence high
Out[37]:

saturn.columns["most_memorable_characteristics"].stats

statvalue
n2,530
nulls0 (0.0%)
unique2,487
len_min 3
len_max 37
len_mean 23.06
len_median 23
len_p95 30
word_mean 3.376
word_median 3
n_empty 0
n_duplicates 43
duplicate_rate 0.017
vocab_size 868
readability_flesch_mean 49.71
emoji_rate 0
url_rate 0
one_word_rate 0.007115
allcaps_rate 0
boilerplate_rate 0
alert: near_unique98.3% of rows are unique strings
Fig 16.
Character-length distribution for most_memorable_characteristics.
Show data table
Character-length distribution for most_memorable_characteristics (mean: 23.062450592885376).
charscount
3 – 41
4 – 50
5 – 64
6 – 64
6 – 73
7 – 81
8 – 90
9 – 102
10 – 113
11 – 129
12 – 1239
12 – 1347
13 – 1452
14 – 150
15 – 1629
16 – 1734
17 – 1769
17 – 18100
18 – 19145
19 – 200
20 – 21206
21 – 22206
22 – 23156
23 – 23173
23 – 24179
24 – 25192
25 – 260
26 – 27201
27 – 28179
28 – 28178
28 – 29121
29 – 3086
30 – 3166
31 – 320
32 – 3323
33 – 3413
34 – 343
34 – 354
35 – 361
36 – 371

rating numeric feature

A bounded numeric rating on a 1.0–4.0 scale with only 12 distinct values, suggesting half- or quarter-step increments rather than continuous scores. The distribution is tight (IQR 0.5, std 0.45) and slightly left-skewed (-0.61), centered near 3.25 with a mean of 3.20, and 50 low-end outliers (1.98%) pull the tail. No nulls or zeros, so every row carries a usable score.

Treatment: Use as-is as an ordinal/numeric feature; consider treating the 50 low-end outliers separately if modelling tails.

anthropic:claude-opus-4-7 · confidence high
Out[40]:

saturn.columns["rating"].stats

statvalue
n2,530
nulls0 (0.0%)
unique12
min 1
max 4
mean 3.196
median 3.25
std 0.4453
q1 3
q3 3.5
iqr 0.5
skew -0.6084
kurtosis 1.053
n_outliers 50
outlier_rate 0.01976
zero_rate 0
Fig 17.
Distribution of rating. Vertical dash marks the median.
Show data table
Histogram bins for rating (median: 3.25).
bincount
1 – 1.0754
1.075 – 1.150
1.15 – 1.2250
1.225 – 1.30
1.3 – 1.3750
1.375 – 1.450
1.45 – 1.52510
1.525 – 1.60
1.6 – 1.6750
1.675 – 1.750
1.75 – 1.8253
1.825 – 1.90
1.9 – 1.9750
1.975 – 2.0533
2.05 – 2.1250
2.125 – 2.20
2.2 – 2.27517
2.275 – 2.350
2.35 – 2.4250
2.425 – 2.50
2.5 – 2.575166
2.575 – 2.650
2.65 – 2.7250
2.725 – 2.8333
2.8 – 2.8750
2.875 – 2.950
2.95 – 3.025523
3.025 – 3.10
3.1 – 3.1750
3.175 – 3.250
3.25 – 3.325464
3.325 – 3.40
3.4 – 3.4750
3.475 – 3.55565
3.55 – 3.6250
3.625 – 3.70
3.7 – 3.775300
3.775 – 3.850
3.85 – 3.9250
3.925 – 4112

How to cite

click to copy

BibTeX
@misc{saturn-quirky-chocolate-origins-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky chocolate origins},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-chocolate_origins}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky chocolate origins. Source: /home/coolhand/html/datavis/data_trove/data/quirky/chocolate_origins.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-chocolate_origins