data trove world of cheese

saturn notebook · generated 2026-06-21 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/data/quirky/cheese_list.json

Saturn profiled 7,146 rows across 4 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/data/quirky/cheese_list.json",
    "--findings", "data-trove-world-of-cheese.json",
    "--llm", "anthropic:default",
])

Summary confidence: high

This dataset is a multilingual catalogue of 7,146 cheese products spanning 32 categories and 111 countries of origin. The most immediately striking pattern is the geographic concentration: France alone accounts for 26% of all entries (1,853), followed by Germany and the United States, suggesting the dataset skews heavily toward Western European dairy traditions. On the category side, Cream Cheese dominates with 1,187 entries (17%), and the top 5 categories together cover over half the dataset — worth examining for potential over-representation. The 'value' column is entirely constant at 1.0 and can be safely ignored. Note also that the product names are highly multilingual (30 languages detected) with an 11% duplicate rate, indicating some cheese types are listed under multiple language variants.

citing: row_count · column_count · country.stats.top_value · country.stats.top_rate · country.n_unique · category.stats.top_value · category.stats.top_rate · category.n_unique · name.stats.duplicate_rate · name.language_counts · value.alerts

Out[4]:

saturn.schema() · 4 columns

column	kind	n	null%	unique	alerts
name	text	7,146	0.0%	6,337	multilingual
country	categorical	7,146	0.0%	111
category	categorical	7,146	0.0%	32
value	numeric	7,146	0.0%	1	constant

Fig 1.

country · Look for the sharp drop-off after France and Germany — the dataset is heavily concentrated in a handful of Western European countries.

Show data table

Top values for country (20 unique shown, of 111 total).
value	count	share
France	1853	25.9%
Germany	907	12.7%
United States	759	10.6%
Belgium	334	4.7%
United Kingdom	333	4.7%
Spain	319	4.5%
Italy	307	4.3%
Switzerland	209	2.9%
Poland	145	2.0%
Netherlands	134	1.9%
Austria	127	1.8%
Canada	125	1.7%
Sweden	123	1.7%
Portugal	115	1.6%
Ireland	114	1.6%
Czech Republic	103	1.4%
Australia	100	1.4%
Finland	88	1.2%
Norway	75	1.0%
Bulgaria	60	0.8%

Fig 2.

category · Cream Cheese, Mozzarella, and Soft Cheese together dominate the category mix, revealing which cheese types are most catalogued.

Show data table

Top values for category (20 unique shown, of 32 total).
value	count	share
Cream Cheese	1187	16.6%
Mozzarella	702	9.8%
Soft Cheese	637	8.9%
Grated Cheese	571	8.0%
Cottage Cheese	544	7.6%
Goat Cheese	526	7.4%
Cheese Spread	473	6.6%
Gouda	456	6.4%
Hard Cheese	340	4.8%
Feta	246	3.4%
Fresh Cheese	196	2.7%
Fromage Blanc	150	2.1%
Raclette	144	2.0%
Comté	99	1.4%
Edam	97	1.4%
Havarti	95	1.3%
Burrata	88	1.2%
Halloumi	87	1.2%
Ricotta	85	1.2%
Dairy Products	77	1.1%

Fig 3.

name · The top language breakdown shows English and French names account for the majority, despite 30 languages being present overall.

Show data table

Character-length distribution for name (mean: 22.96235656311223).
chars	count
4 – 9	438
9 – 13	994
13 – 18	1446
18 – 23	1162
23 – 27	1119
27 – 32	774
32 – 37	389
37 – 41	341
41 – 46	184
46 – 51	107
51 – 55	69
55 – 60	44
60 – 65	28
65 – 69	17
69 – 74	7
74 – 79	2
79 – 83	2
83 – 88	6
88 – 93	0
93 – 98	1
98 – 102	3
102 – 107	1
107 – 112	6
112 – 116	0
116 – 121	0
121 – 126	3
126 – 130	1
130 – 135	0
135 – 140	0
140 – 144	0
144 – 149	0
149 – 154	0
154 – 158	1
158 – 163	0
163 – 168	0
168 – 172	0
172 – 177	0
177 – 182	0
182 – 186	0
186 – 191	1

Fig 4.

name · Most cheese names are concise (median 21 characters), but a long tail stretches to 191 characters — worth inspecting for data quality issues.

Show data table

Character-length distribution for name (mean: 22.96235656311223).
chars	count
4 – 9	438
9 – 13	994
13 – 18	1446
18 – 23	1162
23 – 27	1119
27 – 32	774
32 – 37	389
37 – 41	341
41 – 46	184
46 – 51	107
51 – 55	69
55 – 60	44
60 – 65	28
65 – 69	17
69 – 74	7
74 – 79	2
79 – 83	2
83 – 88	6
88 – 93	0
93 – 98	1
98 – 102	3
102 – 107	1
107 – 112	6
112 – 116	0
116 – 121	0
121 – 126	3
126 – 130	1
130 – 135	0
135 – 140	0
140 – 144	0
144 – 149	0
149 – 154	0
154 – 158	1
158 – 163	0
163 – 168	0
168 – 172	0
172 – 177	0
177 – 182	0
182 – 186	0
186 – 191	1

Fig 5.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
name	text	0.0%
country	categorical	0.0%
category	categorical	0.0%
value	numeric	0.0%

Fig 6.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 4,745 detected strings).
lang	count	share
en	1847	38.9%
fr	1029	21.7%
de	552	11.6%
it	389	8.2%
es	251	5.3%
nl	146	3.1%
pt	96	2.0%
pl	77	1.6%
ca	50	1.1%
fi	39	0.8%
sv	30	0.6%
uk	29	0.6%
cs	27	0.6%
ru	24	0.5%
sk	19	0.4%
tr	17	0.4%
et	13	0.3%
no	13	0.3%
ro	12	0.3%
ja	12	0.3%
el	10	0.2%
da	9	0.2%
hu	7	0.1%
he	7	0.1%
lt	7	0.1%
ceb	7	0.1%
eo	7	0.1%
lv	7	0.1%
sh	6	0.1%
id	6	0.1%

name text label

This column contains the names of cheese products (or cheese-related food items), as evidenced by top values like 'Mozzarella', 'Cottage Cheese', and 'Gouda', and dominant words including 'cheese', 'mozzarella', 'fromage', and 'queso'. Notably, 11.3% of values are duplicates (809 out of 7146), partly due to case-inconsistent entries like 'Cottage cheese' (29) and 'Cottage Cheese' (27) being counted separately. A multilingual alert is triggered across 30 detected languages — English (1847), French (1029), German (552), Italian (389), and Spanish (251) are dominant — reflecting international product naming rather than true language mixing in a single entry. The mean name length is ~23 characters with a median of 3 words, consistent with structured product label strings rather than free text.

Treatment: Normalise case before deduplication or grouping; consider language-aware normalisation to consolidate cross-lingual synonyms (e.g. 'fromage', 'queso', 'cheese') for modelling.

anthropic:default · confidence high

Out[12]:

saturn.columns["name"].stats

stat	value
n	7,146
nulls	0 (0.0%)
unique	6,337
len_min	4
len_max	191
len_mean	22.96
len_median	21
len_p95	44
word_mean	3.443
word_median	3
n_empty	0
n_duplicates	809
duplicate_rate	0.1132
vocab_size	4,732
readability_flesch_mean	53.61
emoji_rate	0.0005598
url_rate	0
one_word_rate	0.1041
allcaps_rate	0.01707
boilerplate_rate	0
alert: multilingual	31 languages detected in sample

Fig 7.

Character-length distribution for name.

Show data table

Character-length distribution for name (mean: 22.96235656311223).
chars	count
4 – 9	438
9 – 13	994
13 – 18	1446
18 – 23	1162
23 – 27	1119
27 – 32	774
32 – 37	389
37 – 41	341
41 – 46	184
46 – 51	107
51 – 55	69
55 – 60	44
60 – 65	28
65 – 69	17
69 – 74	7
74 – 79	2
79 – 83	2
83 – 88	6
88 – 93	0
93 – 98	1
98 – 102	3
102 – 107	1
107 – 112	6
112 – 116	0
116 – 121	0
121 – 126	3
126 – 130	1
130 – 135	0
135 – 140	0
140 – 144	0
144 – 149	0
149 – 154	0
154 – 158	1
158 – 163	0
163 – 168	0
168 – 172	0
172 – 177	0
177 – 182	0
182 – 186	0
186 – 191	1

country categorical feature

This column records the country of origin or residence for each record, spanning 111 distinct countries across 7,146 rows with no nulls. France dominates heavily, accounting for 25.9% of all records (1,853 rows), followed by Germany (907) and United States (759) — suggesting a strongly Europe-centric dataset, likely French-sourced. The entropy ratio of 0.628 indicates moderate distributional spread, but the long tail of 111 countries means many nations are sparsely represented beyond the top 10.

Treatment: One-hot encode top countries and group sparse tail into an 'Other' category before modelling.

anthropic:default · confidence high

Out[15]:

saturn.columns["country"].stats

stat	value
n	7,146
nulls	0 (0.0%)
unique	111
top_value	France
top_rate	0.2593
cardinality	111
entropy	4.268
entropy_ratio	0.6281

Fig 8.

Top values for country.

Show data table

Top values for country (20 unique shown, of 111 total).
value	count	share
France	1853	25.9%
Germany	907	12.7%
United States	759	10.6%
Belgium	334	4.7%
United Kingdom	333	4.7%
Spain	319	4.5%
Italy	307	4.3%
Switzerland	209	2.9%
Poland	145	2.0%
Netherlands	134	1.9%
Austria	127	1.8%
Canada	125	1.7%
Sweden	123	1.7%
Portugal	115	1.6%
Ireland	114	1.6%
Czech Republic	103	1.4%
Australia	100	1.4%
Finland	88	1.2%
Norway	75	1.0%
Bulgaria	60	0.8%

category categorical label

This column is a product category label for what appears to be a cheese-focused retail or food dataset, with 32 distinct cheese types across 7,146 records and no nulls. The distribution is moderately uneven: 'Cream Cheese' dominates at 16.6% (1,187 rows), while the top 10 categories alone account for the vast majority of records. The entropy ratio of 0.82 suggests reasonable spread across categories but with a clear long tail beyond the top 10. No anomalies or alerts are present.

Treatment: One-hot encode or target-encode for modelling; consider grouping low-frequency tail categories if sparse classes cause issues.

anthropic:default · confidence high

Out[18]:

saturn.columns["category"].stats

stat	value
n	7,146
nulls	0 (0.0%)
unique	32
top_value	Cream Cheese
top_rate	0.1661
cardinality	32
entropy	4.098
entropy_ratio	0.8195

Fig 9.

Top values for category.

Show data table

Top values for category (20 unique shown, of 32 total).
value	count	share
Cream Cheese	1187	16.6%
Mozzarella	702	9.8%
Soft Cheese	637	8.9%
Grated Cheese	571	8.0%
Cottage Cheese	544	7.6%
Goat Cheese	526	7.4%
Cheese Spread	473	6.6%
Gouda	456	6.4%
Hard Cheese	340	4.8%
Feta	246	3.4%
Fresh Cheese	196	2.7%
Fromage Blanc	150	2.1%
Raclette	144	2.0%
Comté	99	1.4%
Edam	97	1.4%
Havarti	95	1.3%
Burrata	88	1.2%
Halloumi	87	1.2%
Ricotta	85	1.2%
Dairy Products	77	1.1%

value numeric other

This column is a numeric constant: every one of its 7,146 non-null rows holds exactly the value 1.0, with zero variance, zero skew, and a single unique value. It carries no information and will contribute nothing to any model or analysis. This is likely a placeholder, a join artifact, or a weight/flag column that was never populated with real data.

Treatment: Drop immediately; zero-variance constant adds no predictive or descriptive value.

anthropic:default · confidence high

Out[21]:

saturn.columns["value"].stats

stat	value
n	7,146
nulls	0 (0.0%)
unique	1
min	1
max	1
mean	1
median	1
std	0
q1	1
q3	1
iqr	0
skew	0
kurtosis	0
n_outliers	0
outlier_rate	0
zero_rate	0
alert: constant	only one distinct value

Fig 10.

Distribution of value. Vertical dash marks the median.

Show data table

Histogram bins for value (median: 1.0).
bin	count
0.5 – 0.525	0
0.525 – 0.55	0
0.55 – 0.575	0
0.575 – 0.6	0
0.6 – 0.625	0
0.625 – 0.65	0
0.65 – 0.675	0
0.675 – 0.7	0
0.7 – 0.725	0
0.725 – 0.75	0
0.75 – 0.775	0
0.775 – 0.8	0
0.8 – 0.825	0
0.825 – 0.85	0
0.85 – 0.875	0
0.875 – 0.9	0
0.9 – 0.925	0
0.925 – 0.95	0
0.95 – 0.975	0
0.975 – 1	0
1 – 1.025	7146
1.025 – 1.05	0
1.05 – 1.075	0
1.075 – 1.1	0
1.1 – 1.125	0
1.125 – 1.15	0
1.15 – 1.175	0
1.175 – 1.2	0
1.2 – 1.225	0
1.225 – 1.25	0
1.25 – 1.275	0
1.275 – 1.3	0
1.3 – 1.325	0
1.325 – 1.35	0
1.35 – 1.375	0
1.375 – 1.4	0
1.4 – 1.425	0
1.425 – 1.45	0
1.45 – 1.475	0
1.475 – 1.5	0

How to cite

click to copy

BibTeX

@misc{saturn-data-trove-world-of-cheese-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: data trove world of cheese},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/data-trove-world-of-cheese}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:default},
}

APA

Steuber, L. (2026). Saturn reading: data trove world of cheese. Source: /home/coolhand/html/datavis/data_trove/data/quirky/cheese_list.json. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:default). Retrieved from https://dr.eamer.dev/saturn/view/data-trove-world-of-cheese