quirky openfoodfacts cheese 20260121

source /home/coolhand/html/datavis/data_trove/cache/quirky/openfoodfacts_cheese_20260121.parquet 77,145 rows 10 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 77,145 product records from Open Food Facts, focused on cheese products, with 10 columns covering names, ingredients, quantities, image URLs, and several tag fields (brands, categories, countries, labels, nutrition grades, origins). The text fields are highly multilingual — product_name spans 30+ languages with English (3,820) and French (315) dominating, and ingredients_text shows the same pattern. Two things deserve a closer look first: the heavy null rates on quantity (57.7%) and ingredients_text (41.9%), which will limit any analysis depending on those fields, and the strong duplication in quantity (90.5% duplicate rate) where values like '1 serving(s)', '8 oz', and '200 g' recur thousands of times. Product names also duplicate substantially (30.9%), with 'Cottage Cheese', 'Cheese', and 'Mozzarella' appearing as common generic labels. Note that the six tag-style columns were skipped during profiling, so their structure is not yet characterized.

citing: row_count · column_count · columns.quantity.null_rate · columns.quantity.stats.duplicate_rate · columns.quantity.top_values · columns.ingredients_text.null_rate · columns.ingredients_text.language_counts · columns.product_name.language_counts · columns.product_name.top_values · columns.product_name.stats.duplicate_rate · columns.image_url.null_rate

Charts the summary said to look at first

quantity · Most common quantity strings — note inconsistent formatting like '200 g' vs '200g' that will need normalization.

Show data table

Character-length distribution for quantity (mean: 4.739691195392439).
chars	count
0 – 1	4393
1 – 3	66
3 – 4	10984
4 – 6	12621
6 – 7	1264
7 – 9	413
9 – 10	462
10 – 12	129
12 – 13	1845
13 – 14	80
14 – 16	51
16 – 17	142
17 – 19	49
19 – 20	48
20 – 22	14
22 – 23	21
23 – 25	4
25 – 26	13
26 – 28	3
28 – 29	5
29 – 30	8
30 – 32	5
32 – 33	4
33 – 35	0
35 – 36	3
36 – 38	1
38 – 39	1
39 – 41	2
41 – 42	6
42 – 44	1
44 – 45	1
45 – 46	1
46 – 48	0
48 – 49	1
49 – 51	0
51 – 52	0
52 – 54	0
54 – 55	0
55 – 57	0
57 – 58	1

product_name · Top product names are dominated by generic terms like 'Cottage Cheese' and 'Mozzarella', suggesting many non-unique entries.

Show data table

Character-length distribution for product_name (mean: 27.748952056588944).
chars	count
0 – 6	1658
6 – 13	7931
13 – 19	16585
19 – 25	15617
25 – 32	12097
32 – 38	8979
38 – 44	4569
44 – 51	2949
51 – 57	2054
57 – 64	1107
64 – 70	796
70 – 76	562
76 – 83	297
83 – 89	201
89 – 95	135
95 – 102	105
102 – 108	99
108 – 114	96
114 – 121	63
121 – 127	70
127 – 133	54
133 – 140	56
140 – 146	49
146 – 152	35
152 – 159	37
159 – 165	30
165 – 171	13
171 – 178	22
178 – 184	11
184 – 190	12
190 – 197	15
197 – 203	10
203 – 210	4
210 – 216	5
216 – 222	5
222 – 229	4
229 – 235	2
235 – 241	2
241 – 248	1
248 – 254	3

ingredients_text · Ingredient text length is highly skewed (median 126 chars, max 5,283) — useful for spotting sparse vs. detailed entries.

Show data table

Character-length distribution for ingredients_text (mean: 238.2221552743746).
chars	count
0 – 132	22912
132 – 264	8740
264 – 396	4217
396 – 528	3155
528 – 660	1952
660 – 792	1268
792 – 925	810
925 – 1057	581
1057 – 1189	378
1189 – 1321	282
1321 – 1453	173
1453 – 1585	115
1585 – 1717	57
1717 – 1849	38
1849 – 1981	37
1981 – 2113	29
2113 – 2245	29
2245 – 2377	7
2377 – 2509	12
2509 – 2642	5
2642 – 2774	3
2774 – 2906	1
2906 – 3038	3
3038 – 3170	2
3170 – 3302	0
3302 – 3434	0
3434 – 3566	1
3566 – 3698	0
3698 – 3830	1
3830 – 3962	1
3962 – 4094	0
4094 – 4226	0
4226 – 4358	0
4358 – 4491	0
4491 – 4623	0
4623 – 4755	1
4755 – 4887	0
4887 – 5019	0
5019 – 5151	0
5151 – 5283	1

product_name · Product name length distribution shows most names are short (median 24 chars), with a long tail of verbose entries.

Show data table

Character-length distribution for product_name (mean: 27.748952056588944).
chars	count
0 – 6	1658
6 – 13	7931
13 – 19	16585
19 – 25	15617
25 – 32	12097
32 – 38	8979
38 – 44	4569
44 – 51	2949
51 – 57	2054
57 – 64	1107
64 – 70	796
70 – 76	562
76 – 83	297
83 – 89	201
89 – 95	135
95 – 102	105
102 – 108	99
108 – 114	96
114 – 121	63
121 – 127	70
127 – 133	54
133 – 140	56
140 – 146	49
146 – 152	35
152 – 159	37
159 – 165	30
165 – 171	13
171 – 178	22
178 – 184	11
184 – 190	12
190 – 197	15
197 – 203	10
203 – 210	4
210 – 216	5
216 – 222	5
222 – 229	4
229 – 235	2
235 – 241	2
241 – 248	1
248 – 254	3

Schema

10 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
brands_tags	unknown	0.0%	—	skipped
categories_tags	unknown	0.0%	—	skipped
countries_tags	unknown	0.0%	—	skipped
image_url	text	39.0%	47,035	near_unique one_word url_heavy null_rate
ingredients_text	text	41.9%	30,902	multilingual null_rate duplicates
labels_tags	unknown	0.0%	—	skipped
nutrition_grades_tags	unknown	0.0%	—	skipped
origins_tags	unknown	0.0%	—	skipped
product_name	text	1.0%	52,776	multilingual duplicates
quantity	text	57.7%	3,102	one_word null_rate short_text duplicates

brands_tags

unknown other skipped

The column brands_tags was skipped by the profiler, so no descriptive statistics, uniqueness count, or value samples are available. The only confirmed facts are that it has 77145 rows and a null_rate of 0.0, meaning every row carries some value. Based on the name alone it likely holds brand tag strings, but without evidence this is not verified. Treatment: Re-run profiling with string handling enabled before deciding on use. low · anthropic:claude-opus-4-7

n: 77,145
nulls: 0 (0.0%)
unique: —

categories_tags

unknown feature skipped

This column was skipped by the profiler, so no statistics, uniqueness, or value samples are available beyond a row count of 77145 and a null rate of 0.0. The name `categories_tags` suggests a multi-valued categorical field (likely a delimited list of category labels, e.g. from Open Food Facts), but this cannot be confirmed from the evidence. Treatment: Re-profile with list/string handling enabled, then split on the delimiter and one-hot or multi-hot encode before modelling. low · anthropic:claude-opus-4-7

n: 77,145
nulls: 0 (0.0%)
unique: —

countries_tags

unknown feature skipped

The column `countries_tags` was skipped by the profiler, so no statistics beyond row count and null rate are available. All 77,145 rows are non-null, but uniqueness, distribution, and value examples are unknown. The name suggests a multi-valued tag field (e.g., comma- or colon-separated country codes like `en:france`), but this cannot be confirmed from the evidence. Treatment: Re-profile with a tag-aware parser (split on separator) before deciding on encoding or filtering. low · anthropic:claude-opus-4-7

n: 77,145
nulls: 0 (0.0%)
unique: —

image_url

text metadata near_unique one_word url_heavy null_rate

This column holds Open Food Facts product image URLs, one per row, with url_rate of 1.0 and word_mean of 1.0 confirming each value is a single link. It is near-unique (47,035 distinct out of 77,145) and 39.03% null, so coverage is partial. Lengths are tightly bounded (75-96 chars) reflecting a fixed CDN path template, and only one duplicate URL appears. Treatment: Drop for modelling; retain as a reference link for image fetching or manual inspection. high · anthropic:claude-opus-4-7

n: 77,145
nulls: 30,109 (39.0%)
unique: 47,035
len_min: 75
len_max: 96
len_mean: 84.24
len_median: 84
len_p95: 85
word_mean: 1
word_median: 1
n_empty: 0
n_duplicates: 1
duplicate_rate: 2.126e-05
vocab_size: 19,999
readability_flesch_mean: -811.9
emoji_rate: 0
url_rate: 1
one_word_rate: 1
allcaps_rate: 0
boilerplate_rate: 0

ingredients_text

text free_text multilingual null_rate duplicates

Free-text ingredient lists from what looks like a food product database, dominated by English (1623 rows) but spanning 30 languages including French (212), German (128), and trace Thai/Korean/Arabic. 41.91% of values are null and 4981 are empty strings while another 2553 read literally as 'undefined', and 31.04% of non-null entries are exact duplicates — cheese-style boilerplate like 'Pasteurized milk, cheese culture, salt, enzymes.' recurs hundreds of times. Texts are short on average (word_median 16, len_median 126) but stretch to 5283 characters, and the Flesch mean of 16.0 reflects dense comma-separated lists rather than prose. Treatment: Normalise 'undefined'/empty to null, dedupe, and tokenize per detected language before embedding. high · anthropic:claude-opus-4-7

n: 77,145
nulls: 32,334 (41.9%)
unique: 30,902
len_min: 0
len_max: 5,283
len_mean: 238.2
len_median: 126
len_p95: 842
word_mean: 31.64
word_median: 16
n_empty: 4,981
n_duplicates: 13,909
duplicate_rate: 0.3104
vocab_size: 35,093
readability_flesch_mean: 16.01
emoji_rate: 0.0002232
url_rate: 0.002299
one_word_rate: 0.1733
allcaps_rate: 0.08565
boilerplate_rate: 0

labels_tags

unknown other skipped

The column "labels_tags" was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 77145 and a null rate of 0.0. The name suggests a multi-valued tag or label field (likely comma- or pipe-delimited strings), but this cannot be confirmed from the evidence. No surprises can be flagged without underlying stats. Treatment: Re-run the profiler with parsing for list-like strings before deciding on encoding. low · anthropic:claude-opus-4-7

n: 77,145
nulls: 0 (0.0%)
unique: —

nutrition_grades_tags

unknown label skipped

This column is named nutrition_grades_tags, suggesting it holds Nutri-Score-style grade tags (e.g., a–e) for food products. Saturn skipped profiling, so no uniqueness, value distribution, or stats are available beyond a row count of 77145 with a null_rate of 0.0. Without cardinality or value samples, the actual content cannot be verified from the evidence. Treatment: Re-profile to recover value distribution before deciding; if it resolves to a small grade vocabulary, one-hot encode. low · anthropic:claude-opus-4-7

n: 77,145
nulls: 0 (0.0%)
unique: —

origins_tags

unknown metadata skipped

Column `origins_tags` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 77145 and a null rate of 0.0. The name suggests a tag-style field listing geographic origins, likely multi-valued (comma- or pipe-delimited) per row, but this cannot be confirmed from the evidence. No distributional signals can be assessed until the column is re-profiled. Treatment: Re-run the profiler on this column; if multi-valued, split on the delimiter and one-hot or count-encode the tags. low · anthropic:claude-opus-4-7

n: 77,145
nulls: 0 (0.0%)
unique: —

product_name

text free_text multilingual duplicates

Short product-name strings (mean 27.7 chars, median 4 words) for what looks like a cheese-dominated catalog — 'cheese' appears 13,869 times and the top values are 'Cottage Cheese', 'Cheese', 'Mozzarella', etc. Of 77,145 rows, 30.9% are duplicates (23,564) and case-variant pairs like 'Cottage Cheese'/'cottage cheese' and 'Cream Cheese'/'Cream cheese' inflate that count; 349 rows are empty and null_rate is 1.04%. Although English dominates (3,820 detected), there is a multilingual tail (de 169, fr 315, it 163, es 114) that will fragment exact-match joins. Treatment: Normalize case and whitespace to collapse near-duplicates, then tokenize/embed for modelling or fuzzy matching. high · anthropic:claude-opus-4-7

n: 77,145
nulls: 805 (1.0%)
unique: 52,776
len_min: 0
len_max: 254
len_mean: 27.75
len_median: 24
len_p95: 58
word_mean: 4.269
word_median: 4
n_empty: 349
n_duplicates: 23,564
duplicate_rate: 0.3087
vocab_size: 8,852
readability_flesch_mean: 63.62
emoji_rate: 0.0003537
url_rate: 0
one_word_rate: 0.05946
allcaps_rate: 0.01347
boilerplate_rate: 0

quantity

text feature one_word null_rate short_text duplicates

This column captures food portion sizes as free-text quantity strings (e.g. '1 serving(s)', '8 oz', '200 g'), mixing units of grams, ounces and serving counts. It is sparse and repetitive: 57.69% of rows are null, 4252 are empty, and 90.5% of populated values are duplicates drawn from a vocabulary of just 1802 tokens. Note the inconsistent formatting — '200 g' (1469) and '200g' (696) are stored as separate strings, so naive grouping will undercount. Treatment: Parse into numeric magnitude + normalized unit (g/oz/serving), collapsing variants like '200 g' and '200g'. high · anthropic:claude-opus-4-7

n: 77,145
nulls: 44,503 (57.7%)
unique: 3,102
len_min: 0
len_max: 58
len_mean: 4.74
len_median: 5
len_p95: 12
word_mean: 1.791
word_median: 2
n_empty: 4,252
n_duplicates: 29,540
duplicate_rate: 0.905
vocab_size: 1,802
readability_flesch_mean: 102.6
emoji_rate: 0
url_rate: 0
one_word_rate: 0.2975
allcaps_rate: 0.01042
boilerplate_rate: 0