quirky-openfoodfacts_cheese_20260121

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/quirky/openfoodfacts_cheese_20260121.parquet

Saturn profiled 77,145 rows across 10 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:

!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/quirky/openfoodfacts_cheese_20260121.parquet",
    "--findings", "quirky-openfoodfacts_cheese_20260121.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 77,145 product records from Open Food Facts, focused on cheese products, with 10 columns covering names, ingredients, quantities, image URLs, and several tag fields (brands, categories, countries, labels, nutrition grades, origins). The text fields are highly multilingual — product_name spans 30+ languages with English (3,820) and French (315) dominating, and ingredients_text shows the same pattern. Two things deserve a closer look first: the heavy null rates on quantity (57.7%) and ingredients_text (41.9%), which will limit any analysis depending on those fields, and the strong duplication in quantity (90.5% duplicate rate) where values like '1 serving(s)', '8 oz', and '200 g' recur thousands of times. Product names also duplicate substantially (30.9%), with 'Cottage Cheese', 'Cheese', and 'Mozzarella' appearing as common generic labels. Note that the six tag-style columns were skipped during profiling, so their structure is not yet characterized.

citing: row_count · column_count · columns.quantity.null_rate · columns.quantity.stats.duplicate_rate · columns.quantity.top_values · columns.ingredients_text.null_rate · columns.ingredients_text.language_counts · columns.product_name.language_counts · columns.product_name.top_values · columns.product_name.stats.duplicate_rate · columns.image_url.null_rate

Out[4]:

saturn.schema() · 10 columns

column	kind	n	null%	unique	alerts
brands_tags	unknown	77,145	0.0%	—	skipped
categories_tags	unknown	77,145	0.0%	—	skipped
countries_tags	unknown	77,145	0.0%	—	skipped
image_url	text	77,145	39.0%	47,035	near_unique one_word url_heavy null_rate
ingredients_text	text	77,145	41.9%	30,902	multilingual null_rate duplicates
labels_tags	unknown	77,145	0.0%	—	skipped
nutrition_grades_tags	unknown	77,145	0.0%	—	skipped
origins_tags	unknown	77,145	0.0%	—	skipped
product_name	text	77,145	1.0%	52,776	multilingual duplicates
quantity	text	77,145	57.7%	3,102	one_word null_rate short_text duplicates

Fig 1.

quantity · Most common quantity strings — note inconsistent formatting like '200 g' vs '200g' that will need normalization.

Show data table

Character-length distribution for quantity (mean: 4.739691195392439).
chars	count
0 – 1	4393
1 – 3	66
3 – 4	10984
4 – 6	12621
6 – 7	1264
7 – 9	413
9 – 10	462
10 – 12	129
12 – 13	1845
13 – 14	80
14 – 16	51
16 – 17	142
17 – 19	49
19 – 20	48
20 – 22	14
22 – 23	21
23 – 25	4
25 – 26	13
26 – 28	3
28 – 29	5
29 – 30	8
30 – 32	5
32 – 33	4
33 – 35	0
35 – 36	3
36 – 38	1
38 – 39	1
39 – 41	2
41 – 42	6
42 – 44	1
44 – 45	1
45 – 46	1
46 – 48	0
48 – 49	1
49 – 51	0
51 – 52	0
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	1

Fig 2.

product_name · Top product names are dominated by generic terms like 'Cottage Cheese' and 'Mozzarella', suggesting many non-unique entries.

Show data table

Character-length distribution for product_name (mean: 27.748952056588944).
chars	count
0 – 6	1658
6 – 13	7931
13 – 19	16585
19 – 25	15617
25 – 32	12097
32 – 38	8979
38 – 44	4569
44 – 51	2949
51 – 57	2054
57 – 64	1107
64 – 70	796
70 – 76	562
76 – 83	297
83 – 89	201
89 – 95	135
95 – 102	105
102 – 108	99
108 – 114	96
114 – 121	63
121 – 127	70
127 – 133	54
133 – 140	56
140 – 146	49
146 – 152	35
152 – 159	37
159 – 165	30
165 – 171	13
171 – 178	22
178 – 184	11
184 – 190	12
190 – 197	15
197 – 203	10
203 – 210	4
210 – 216	5
216 – 222	5
222 – 229	4
229 – 235	2
235 – 241	2
241 – 248	1
248 – 254	3

Fig 3.

ingredients_text · Ingredient text length is highly skewed (median 126 chars, max 5,283) — useful for spotting sparse vs. detailed entries.

Show data table

Character-length distribution for ingredients_text (mean: 238.2221552743746).
chars	count
0 – 132	22912
132 – 264	8740
264 – 396	4217
396 – 528	3155
528 – 660	1952
660 – 792	1268
792 – 925	810
925 – 1057	581
1057 – 1189	378
1189 – 1321	282
1321 – 1453	173
1453 – 1585	115
1585 – 1717	57
1717 – 1849	38
1849 – 1981	37
1981 – 2113	29
2113 – 2245	29
2245 – 2377	7
2377 – 2509	12
2509 – 2642	5
2642 – 2774	3
2774 – 2906	1
2906 – 3038	3
3038 – 3170	2
3170 – 3302	0
3302 – 3434	0
3434 – 3566	1
3566 – 3698	0
3698 – 3830	1
3830 – 3962	1
3962 – 4094	0
4094 – 4226	0
4226 – 4358	0
4358 – 4491	0
4491 – 4623	0
4623 – 4755	1
4755 – 4887	0
4887 – 5019	0
5019 – 5151	0
5151 – 5283	1

Fig 4.

product_name · Product name length distribution shows most names are short (median 24 chars), with a long tail of verbose entries.

Show data table

Character-length distribution for product_name (mean: 27.748952056588944).
chars	count
0 – 6	1658
6 – 13	7931
13 – 19	16585
19 – 25	15617
25 – 32	12097
32 – 38	8979
38 – 44	4569
44 – 51	2949
51 – 57	2054
57 – 64	1107
64 – 70	796
70 – 76	562
76 – 83	297
83 – 89	201
89 – 95	135
95 – 102	105
102 – 108	99
108 – 114	96
114 – 121	63
121 – 127	70
127 – 133	54
133 – 140	56
140 – 146	49
146 – 152	35
152 – 159	37
159 – 165	30
165 – 171	13
171 – 178	22
178 – 184	11
184 – 190	12
190 – 197	15
197 – 203	10
203 – 210	4
210 – 216	5
216 – 222	5
222 – 229	4
229 – 235	2
235 – 241	2
241 – 248	1
248 – 254	3

Fig 5.

Per-column null rate across the corpus. Columns are ordered by input position.

Show data table

Per-column null rate across the corpus.
column	kind	null %
brands_tags	unknown	0.0%
categories_tags	unknown	0.0%
countries_tags	unknown	0.0%
image_url	text	39.0%
ingredients_text	text	41.9%
labels_tags	unknown	0.0%
nutrition_grades_tags	unknown	0.0%
origins_tags	unknown	0.0%
product_name	text	1.0%
quantity	text	57.7%

Fig 6.

Language mix across all text columns (per-string detection, sampled).

Show data table

Per-language counts (total 6,991 detected strings).
lang	count	share
en	5443	77.9%
fr	527	7.5%
de	297	4.2%
it	185	2.6%
es	151	2.2%
nl	52	0.7%
pl	49	0.7%
pt	33	0.5%
sv	31	0.4%
ru	26	0.4%
no	22	0.3%
fi	20	0.3%
cs	18	0.3%
ca	15	0.2%
tr	15	0.2%
ja	13	0.2%
da	10	0.1%
bg	8	0.1%
el	8	0.1%
uk	8	0.1%
hu	6	0.1%
ro	6	0.1%
et	6	0.1%
hr	5	0.1%
ms	5	0.1%
eo	4	0.1%
ceb	4	0.1%
id	4	0.1%
sk	3	0.0%
bs	3	0.0%
zh	3	0.0%
sr	2	0.0%
sh	2	0.0%
lt	2	0.0%
nn	2	0.0%
ar	1	0.0%
ko	1	0.0%
th	1	0.0%

brands_tags unknown other

The column brands_tags was skipped by the profiler, so no descriptive statistics, uniqueness count, or value samples are available. The only confirmed facts are that it has 77145 rows and a null_rate of 0.0, meaning every row carries some value. Based on the name alone it likely holds brand tag strings, but without evidence this is not verified.

Treatment: Re-run profiling with string handling enabled before deciding on use.

anthropic:claude-opus-4-7 · confidence low

Out[12]:

saturn.columns["brands_tags"].stats

stat	value
n	77,145
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

categories_tags unknown feature

This column was skipped by the profiler, so no statistics, uniqueness, or value samples are available beyond a row count of 77145 and a null rate of 0.0. The name `categories_tags` suggests a multi-valued categorical field (likely a delimited list of category labels, e.g. from Open Food Facts), but this cannot be confirmed from the evidence.

Treatment: Re-profile with list/string handling enabled, then split on the delimiter and one-hot or multi-hot encode before modelling.

anthropic:claude-opus-4-7 · confidence low

Out[14]:

saturn.columns["categories_tags"].stats

stat	value
n	77,145
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

countries_tags unknown feature

The column `countries_tags` was skipped by the profiler, so no statistics beyond row count and null rate are available. All 77,145 rows are non-null, but uniqueness, distribution, and value examples are unknown. The name suggests a multi-valued tag field (e.g., comma- or colon-separated country codes like `en:france`), but this cannot be confirmed from the evidence.

Treatment: Re-profile with a tag-aware parser (split on separator) before deciding on encoding or filtering.

anthropic:claude-opus-4-7 · confidence low

Out[16]:

saturn.columns["countries_tags"].stats

stat	value
n	77,145
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

image_url text metadata

This column holds Open Food Facts product image URLs, one per row, with url_rate of 1.0 and word_mean of 1.0 confirming each value is a single link. It is near-unique (47,035 distinct out of 77,145) and 39.03% null, so coverage is partial. Lengths are tightly bounded (75-96 chars) reflecting a fixed CDN path template, and only one duplicate URL appears.

Treatment: Drop for modelling; retain as a reference link for image fetching or manual inspection.

anthropic:claude-opus-4-7 · confidence high

Out[18]:

saturn.columns["image_url"].stats

stat	value
n	77,145
nulls	30,109 (39.0%)
unique	47,035
len_min	75
len_max	96
len_mean	84.24
len_median	84
len_p95	85
word_mean	1
word_median	1
n_empty	0
n_duplicates	1
duplicate_rate	2.126e-05
vocab_size	19,999
readability_flesch_mean	-811.9
emoji_rate	0
url_rate	1
one_word_rate	1
allcaps_rate	0
boilerplate_rate	0
alert: near_unique	100.0% of rows are unique strings
alert: one_word	100.0% rows are a single word
alert: url_heavy	100.0% rows contain a URL
alert: null_rate	39.0% null

Fig 7.

Character-length distribution for image_url.

Show data table

Character-length distribution for image_url (mean: 84.24347308444595).
chars	count
75 – 76	2
76 – 76	2
76 – 77	0
77 – 77	0
77 – 78	0
78 – 78	0
78 – 79	0
79 – 79	0
79 – 80	0
80 – 80	0
80 – 81	0
81 – 81	0
81 – 82	0
82 – 82	0
82 – 83	0
83 – 83	0
83 – 84	0
84 – 84	36226
84 – 85	0
85 – 86	10558
86 – 86	163
86 – 87	0
87 – 87	4
87 – 88	0
88 – 88	0
88 – 89	0
89 – 89	33
89 – 90	0
90 – 90	3
90 – 91	0
91 – 91	11
91 – 92	0
92 – 92	8
92 – 93	0
93 – 93	9
93 – 94	0
94 – 94	3
94 – 95	0
95 – 95	13
95 – 96	1

ingredients_text text free_text

Free-text ingredient lists from what looks like a food product database, dominated by English (1623 rows) but spanning 30 languages including French (212), German (128), and trace Thai/Korean/Arabic. 41.91% of values are null and 4981 are empty strings while another 2553 read literally as 'undefined', and 31.04% of non-null entries are exact duplicates — cheese-style boilerplate like 'Pasteurized milk, cheese culture, salt, enzymes.' recurs hundreds of times. Texts are short on average (word_median 16, len_median 126) but stretch to 5283 characters, and the Flesch mean of 16.0 reflects dense comma-separated lists rather than prose.

Treatment: Normalise 'undefined'/empty to null, dedupe, and tokenize per detected language before embedding.

anthropic:claude-opus-4-7 · confidence high

Out[21]:

saturn.columns["ingredients_text"].stats

stat	value
n	77,145
nulls	32,334 (41.9%)
unique	30,902
len_min	0
len_max	5,283
len_mean	238.2
len_median	126
len_p95	842
word_mean	31.64
word_median	16
n_empty	4,981
n_duplicates	13,909
duplicate_rate	0.3104
vocab_size	35,093
readability_flesch_mean	16.01
emoji_rate	0.0002232
url_rate	0.002299
one_word_rate	0.1733
allcaps_rate	0.08565
boilerplate_rate	0
alert: multilingual	31 languages detected in sample
alert: null_rate	41.9% null
alert: duplicates	31.0% duplicate strings

Fig 8.

Character-length distribution for ingredients_text.

Show data table

Character-length distribution for ingredients_text (mean: 238.2221552743746).
chars	count
0 – 132	22912
132 – 264	8740
264 – 396	4217
396 – 528	3155
528 – 660	1952
660 – 792	1268
792 – 925	810
925 – 1057	581
1057 – 1189	378
1189 – 1321	282
1321 – 1453	173
1453 – 1585	115
1585 – 1717	57
1717 – 1849	38
1849 – 1981	37
1981 – 2113	29
2113 – 2245	29
2245 – 2377	7
2377 – 2509	12
2509 – 2642	5
2642 – 2774	3
2774 – 2906	1
2906 – 3038	3
3038 – 3170	2
3170 – 3302	0
3302 – 3434	0
3434 – 3566	1
3566 – 3698	0
3698 – 3830	1
3830 – 3962	1
3962 – 4094	0
4094 – 4226	0
4226 – 4358	0
4358 – 4491	0
4491 – 4623	0
4623 – 4755	1
4755 – 4887	0
4887 – 5019	0
5019 – 5151	0
5151 – 5283	1

labels_tags unknown other

The column "labels_tags" was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 77145 and a null rate of 0.0. The name suggests a multi-valued tag or label field (likely comma- or pipe-delimited strings), but this cannot be confirmed from the evidence. No surprises can be flagged without underlying stats.

Treatment: Re-run the profiler with parsing for list-like strings before deciding on encoding.

anthropic:claude-opus-4-7 · confidence low

Out[24]:

saturn.columns["labels_tags"].stats

stat	value
n	77,145
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

nutrition_grades_tags unknown label

This column is named nutrition_grades_tags, suggesting it holds Nutri-Score-style grade tags (e.g., a–e) for food products. Saturn skipped profiling, so no uniqueness, value distribution, or stats are available beyond a row count of 77145 with a null_rate of 0.0. Without cardinality or value samples, the actual content cannot be verified from the evidence.

Treatment: Re-profile to recover value distribution before deciding; if it resolves to a small grade vocabulary, one-hot encode.

anthropic:claude-opus-4-7 · confidence low

Out[26]:

saturn.columns["nutrition_grades_tags"].stats

stat	value
n	77,145
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

origins_tags unknown metadata

Column `origins_tags` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 77145 and a null rate of 0.0. The name suggests a tag-style field listing geographic origins, likely multi-valued (comma- or pipe-delimited) per row, but this cannot be confirmed from the evidence. No distributional signals can be assessed until the column is re-profiled.

Treatment: Re-run the profiler on this column; if multi-valued, split on the delimiter and one-hot or count-encode the tags.

anthropic:claude-opus-4-7 · confidence low

Out[28]:

saturn.columns["origins_tags"].stats

stat	value
n	77,145
nulls	0 (0.0%)
unique	—
alert: skipped	no profiler for kind=unknown

product_name text free_text

Short product-name strings (mean 27.7 chars, median 4 words) for what looks like a cheese-dominated catalog — 'cheese' appears 13,869 times and the top values are 'Cottage Cheese', 'Cheese', 'Mozzarella', etc. Of 77,145 rows, 30.9% are duplicates (23,564) and case-variant pairs like 'Cottage Cheese'/'cottage cheese' and 'Cream Cheese'/'Cream cheese' inflate that count; 349 rows are empty and null_rate is 1.04%. Although English dominates (3,820 detected), there is a multilingual tail (de 169, fr 315, it 163, es 114) that will fragment exact-match joins.

Treatment: Normalize case and whitespace to collapse near-duplicates, then tokenize/embed for modelling or fuzzy matching.

anthropic:claude-opus-4-7 · confidence high

Out[30]:

saturn.columns["product_name"].stats

stat	value
n	77,145
nulls	805 (1.0%)
unique	52,776
len_min	0
len_max	254
len_mean	27.75
len_median	24
len_p95	58
word_mean	4.269
word_median	4
n_empty	349
n_duplicates	23,564
duplicate_rate	0.3087
vocab_size	8,852
readability_flesch_mean	63.62
emoji_rate	0.0003537
url_rate	0
one_word_rate	0.05946
allcaps_rate	0.01347
boilerplate_rate	0
alert: multilingual	31 languages detected in sample
alert: duplicates	30.9% duplicate strings

Fig 9.

Character-length distribution for product_name.

Show data table

Character-length distribution for product_name (mean: 27.748952056588944).
chars	count
0 – 6	1658
6 – 13	7931
13 – 19	16585
19 – 25	15617
25 – 32	12097
32 – 38	8979
38 – 44	4569
44 – 51	2949
51 – 57	2054
57 – 64	1107
64 – 70	796
70 – 76	562
76 – 83	297
83 – 89	201
89 – 95	135
95 – 102	105
102 – 108	99
108 – 114	96
114 – 121	63
121 – 127	70
127 – 133	54
133 – 140	56
140 – 146	49
146 – 152	35
152 – 159	37
159 – 165	30
165 – 171	13
171 – 178	22
178 – 184	11
184 – 190	12
190 – 197	15
197 – 203	10
203 – 210	4
210 – 216	5
216 – 222	5
222 – 229	4
229 – 235	2
235 – 241	2
241 – 248	1
248 – 254	3

quantity text feature

This column captures food portion sizes as free-text quantity strings (e.g. '1 serving(s)', '8 oz', '200 g'), mixing units of grams, ounces and serving counts. It is sparse and repetitive: 57.69% of rows are null, 4252 are empty, and 90.5% of populated values are duplicates drawn from a vocabulary of just 1802 tokens. Note the inconsistent formatting — '200 g' (1469) and '200g' (696) are stored as separate strings, so naive grouping will undercount.

Treatment: Parse into numeric magnitude + normalized unit (g/oz/serving), collapsing variants like '200 g' and '200g'.

anthropic:claude-opus-4-7 · confidence high

Out[33]:

saturn.columns["quantity"].stats

stat	value
n	77,145
nulls	44,503 (57.7%)
unique	3,102
len_min	0
len_max	58
len_mean	4.74
len_median	5
len_p95	12
word_mean	1.791
word_median	2
n_empty	4,252
n_duplicates	29,540
duplicate_rate	0.905
vocab_size	1,802
readability_flesch_mean	102.6
emoji_rate	0
url_rate	0
one_word_rate	0.2975
allcaps_rate	0.01042
boilerplate_rate	0
alert: one_word	29.7% rows are a single word
alert: null_rate	57.7% null
alert: short_text	95th-percentile length under 20 chars
alert: duplicates	90.5% duplicate strings

Fig 10.

Character-length distribution for quantity.

Show data table

Character-length distribution for quantity (mean: 4.739691195392439).
chars	count
0 – 1	4393
1 – 3	66
3 – 4	10984
4 – 6	12621
6 – 7	1264
7 – 9	413
9 – 10	462
10 – 12	129
12 – 13	1845
13 – 14	80
14 – 16	51
16 – 17	142
17 – 19	49
19 – 20	48
20 – 22	14
22 – 23	21
23 – 25	4
25 – 26	13
26 – 28	3
28 – 29	5
29 – 30	8
30 – 32	5
32 – 33	4
33 – 35	0
35 – 36	3
36 – 38	1
38 – 39	1
39 – 41	2
41 – 42	6
42 – 44	1
44 – 45	1
45 – 46	1
46 – 48	0
48 – 49	1
49 – 51	0
51 – 52	0
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	1

quirky openfoodfacts cheese 20260121

Overview

Summary confidence: high

brands_tags unknown other

categories_tags unknown feature

countries_tags unknown feature

image_url text metadata

ingredients_text text free_text

labels_tags unknown other

nutrition_grades_tags unknown label

origins_tags unknown metadata

product_name text free_text

quantity text feature

How to cite