quirky openfoodfacts cheese 20260121
Reading
This dataset contains 77,145 product records from Open Food Facts, focused on cheese products, with 10 columns covering names, ingredients, quantities, image URLs, and several tag fields (brands, categories, countries, labels, nutrition grades, origins). The text fields are highly multilingual — product_name spans 30+ languages with English (3,820) and French (315) dominating, and ingredients_text shows the same pattern. Two things deserve a closer look first: the heavy null rates on quantity (57.7%) and ingredients_text (41.9%), which will limit any analysis depending on those fields, and the strong duplication in quantity (90.5% duplicate rate) where values like '1 serving(s)', '8 oz', and '200 g' recur thousands of times. Product names also duplicate substantially (30.9%), with 'Cottage Cheese', 'Cheese', and 'Mozzarella' appearing as common generic labels. Note that the six tag-style columns were skipped during profiling, so their structure is not yet characterized.
citing: row_count · column_count · columns.quantity.null_rate · columns.quantity.stats.duplicate_rate · columns.quantity.top_values · columns.ingredients_text.null_rate · columns.ingredients_text.language_counts · columns.product_name.language_counts · columns.product_name.top_values · columns.product_name.stats.duplicate_rate · columns.image_url.null_rate
Charts the summary said to look at first
Show data table
| chars | count |
|---|---|
| 0 – 1 | 4393 |
| 1 – 3 | 66 |
| 3 – 4 | 10984 |
| 4 – 6 | 12621 |
| 6 – 7 | 1264 |
| 7 – 9 | 413 |
| 9 – 10 | 462 |
| 10 – 12 | 129 |
| 12 – 13 | 1845 |
| 13 – 14 | 80 |
| 14 – 16 | 51 |
| 16 – 17 | 142 |
| 17 – 19 | 49 |
| 19 – 20 | 48 |
| 20 – 22 | 14 |
| 22 – 23 | 21 |
| 23 – 25 | 4 |
| 25 – 26 | 13 |
| 26 – 28 | 3 |
| 28 – 29 | 5 |
| 29 – 30 | 8 |
| 30 – 32 | 5 |
| 32 – 33 | 4 |
| 33 – 35 | 0 |
| 35 – 36 | 3 |
| 36 – 38 | 1 |
| 38 – 39 | 1 |
| 39 – 41 | 2 |
| 41 – 42 | 6 |
| 42 – 44 | 1 |
| 44 – 45 | 1 |
| 45 – 46 | 1 |
| 46 – 48 | 0 |
| 48 – 49 | 1 |
| 49 – 51 | 0 |
| 51 – 52 | 0 |
| 52 – 54 | 0 |
| 54 – 55 | 0 |
| 55 – 57 | 0 |
| 57 – 58 | 1 |
Show data table
| chars | count |
|---|---|
| 0 – 6 | 1658 |
| 6 – 13 | 7931 |
| 13 – 19 | 16585 |
| 19 – 25 | 15617 |
| 25 – 32 | 12097 |
| 32 – 38 | 8979 |
| 38 – 44 | 4569 |
| 44 – 51 | 2949 |
| 51 – 57 | 2054 |
| 57 – 64 | 1107 |
| 64 – 70 | 796 |
| 70 – 76 | 562 |
| 76 – 83 | 297 |
| 83 – 89 | 201 |
| 89 – 95 | 135 |
| 95 – 102 | 105 |
| 102 – 108 | 99 |
| 108 – 114 | 96 |
| 114 – 121 | 63 |
| 121 – 127 | 70 |
| 127 – 133 | 54 |
| 133 – 140 | 56 |
| 140 – 146 | 49 |
| 146 – 152 | 35 |
| 152 – 159 | 37 |
| 159 – 165 | 30 |
| 165 – 171 | 13 |
| 171 – 178 | 22 |
| 178 – 184 | 11 |
| 184 – 190 | 12 |
| 190 – 197 | 15 |
| 197 – 203 | 10 |
| 203 – 210 | 4 |
| 210 – 216 | 5 |
| 216 – 222 | 5 |
| 222 – 229 | 4 |
| 229 – 235 | 2 |
| 235 – 241 | 2 |
| 241 – 248 | 1 |
| 248 – 254 | 3 |
Show data table
| chars | count |
|---|---|
| 0 – 132 | 22912 |
| 132 – 264 | 8740 |
| 264 – 396 | 4217 |
| 396 – 528 | 3155 |
| 528 – 660 | 1952 |
| 660 – 792 | 1268 |
| 792 – 925 | 810 |
| 925 – 1057 | 581 |
| 1057 – 1189 | 378 |
| 1189 – 1321 | 282 |
| 1321 – 1453 | 173 |
| 1453 – 1585 | 115 |
| 1585 – 1717 | 57 |
| 1717 – 1849 | 38 |
| 1849 – 1981 | 37 |
| 1981 – 2113 | 29 |
| 2113 – 2245 | 29 |
| 2245 – 2377 | 7 |
| 2377 – 2509 | 12 |
| 2509 – 2642 | 5 |
| 2642 – 2774 | 3 |
| 2774 – 2906 | 1 |
| 2906 – 3038 | 3 |
| 3038 – 3170 | 2 |
| 3170 – 3302 | 0 |
| 3302 – 3434 | 0 |
| 3434 – 3566 | 1 |
| 3566 – 3698 | 0 |
| 3698 – 3830 | 1 |
| 3830 – 3962 | 1 |
| 3962 – 4094 | 0 |
| 4094 – 4226 | 0 |
| 4226 – 4358 | 0 |
| 4358 – 4491 | 0 |
| 4491 – 4623 | 0 |
| 4623 – 4755 | 1 |
| 4755 – 4887 | 0 |
| 4887 – 5019 | 0 |
| 5019 – 5151 | 0 |
| 5151 – 5283 | 1 |
Show data table
| chars | count |
|---|---|
| 0 – 6 | 1658 |
| 6 – 13 | 7931 |
| 13 – 19 | 16585 |
| 19 – 25 | 15617 |
| 25 – 32 | 12097 |
| 32 – 38 | 8979 |
| 38 – 44 | 4569 |
| 44 – 51 | 2949 |
| 51 – 57 | 2054 |
| 57 – 64 | 1107 |
| 64 – 70 | 796 |
| 70 – 76 | 562 |
| 76 – 83 | 297 |
| 83 – 89 | 201 |
| 89 – 95 | 135 |
| 95 – 102 | 105 |
| 102 – 108 | 99 |
| 108 – 114 | 96 |
| 114 – 121 | 63 |
| 121 – 127 | 70 |
| 127 – 133 | 54 |
| 133 – 140 | 56 |
| 140 – 146 | 49 |
| 146 – 152 | 35 |
| 152 – 159 | 37 |
| 159 – 165 | 30 |
| 165 – 171 | 13 |
| 171 – 178 | 22 |
| 178 – 184 | 11 |
| 184 – 190 | 12 |
| 190 – 197 | 15 |
| 197 – 203 | 10 |
| 203 – 210 | 4 |
| 210 – 216 | 5 |
| 216 – 222 | 5 |
| 222 – 229 | 4 |
| 229 – 235 | 2 |
| 235 – 241 | 2 |
| 241 – 248 | 1 |
| 248 – 254 | 3 |
Schema
10 columns| Alerts | ||||
|---|---|---|---|---|
| brands_tags | unknown | 0.0% | — |
skipped
|
| categories_tags | unknown | 0.0% | — |
skipped
|
| countries_tags | unknown | 0.0% | — |
skipped
|
| image_url | text | 39.0% | 47,035 |
near_unique
one_word
url_heavy
null_rate
|
| ingredients_text | text | 41.9% | 30,902 |
multilingual
null_rate
duplicates
|
| labels_tags | unknown | 0.0% | — |
skipped
|
| nutrition_grades_tags | unknown | 0.0% | — |
skipped
|
| origins_tags | unknown | 0.0% | — |
skipped
|
| product_name | text | 1.0% | 52,776 |
multilingual
duplicates
|
| quantity | text | 57.7% | 3,102 |
one_word
null_rate
short_text
duplicates
|
image_url
text metadata near_unique one_word url_heavy null_rateThis column holds Open Food Facts product image URLs, one per row, with url_rate of 1.0 and word_mean of 1.0 confirming each value is a single link. It is near-unique (47,035 distinct out of 77,145) and 39.03% null, so coverage is partial. Lengths are tightly bounded (75-96 chars) reflecting a fixed CDN path template, and only one duplicate URL appears. Treatment: Drop for modelling; retain as a reference link for image fetching or manual inspection.
- n
- 77,145
- nulls
- 30,109 (39.0%)
- unique
- 47,035
- len_min
- 75
- len_max
- 96
- len_mean
- 84.24
- len_median
- 84
- len_p95
- 85
- word_mean
- 1
- word_median
- 1
- n_empty
- 0
- n_duplicates
- 1
- duplicate_rate
- 2.126e-05
- vocab_size
- 19,999
- readability_flesch_mean
- -811.9
- emoji_rate
- 0
- url_rate
- 1
- one_word_rate
- 1
- allcaps_rate
- 0
- boilerplate_rate
- 0
ingredients_text
text free_text multilingual null_rate duplicatesFree-text ingredient lists from what looks like a food product database, dominated by English (1623 rows) but spanning 30 languages including French (212), German (128), and trace Thai/Korean/Arabic. 41.91% of values are null and 4981 are empty strings while another 2553 read literally as 'undefined', and 31.04% of non-null entries are exact duplicates — cheese-style boilerplate like 'Pasteurized milk, cheese culture, salt, enzymes.' recurs hundreds of times. Texts are short on average (word_median 16, len_median 126) but stretch to 5283 characters, and the Flesch mean of 16.0 reflects dense comma-separated lists rather than prose. Treatment: Normalise 'undefined'/empty to null, dedupe, and tokenize per detected language before embedding.
- n
- 77,145
- nulls
- 32,334 (41.9%)
- unique
- 30,902
- len_min
- 0
- len_max
- 5,283
- len_mean
- 238.2
- len_median
- 126
- len_p95
- 842
- word_mean
- 31.64
- word_median
- 16
- n_empty
- 4,981
- n_duplicates
- 13,909
- duplicate_rate
- 0.3104
- vocab_size
- 35,093
- readability_flesch_mean
- 16.01
- emoji_rate
- 0.0002232
- url_rate
- 0.002299
- one_word_rate
- 0.1733
- allcaps_rate
- 0.08565
- boilerplate_rate
- 0
product_name
text free_text multilingual duplicatesShort product-name strings (mean 27.7 chars, median 4 words) for what looks like a cheese-dominated catalog — 'cheese' appears 13,869 times and the top values are 'Cottage Cheese', 'Cheese', 'Mozzarella', etc. Of 77,145 rows, 30.9% are duplicates (23,564) and case-variant pairs like 'Cottage Cheese'/'cottage cheese' and 'Cream Cheese'/'Cream cheese' inflate that count; 349 rows are empty and null_rate is 1.04%. Although English dominates (3,820 detected), there is a multilingual tail (de 169, fr 315, it 163, es 114) that will fragment exact-match joins. Treatment: Normalize case and whitespace to collapse near-duplicates, then tokenize/embed for modelling or fuzzy matching.
- n
- 77,145
- nulls
- 805 (1.0%)
- unique
- 52,776
- len_min
- 0
- len_max
- 254
- len_mean
- 27.75
- len_median
- 24
- len_p95
- 58
- word_mean
- 4.269
- word_median
- 4
- n_empty
- 349
- n_duplicates
- 23,564
- duplicate_rate
- 0.3087
- vocab_size
- 8,852
- readability_flesch_mean
- 63.62
- emoji_rate
- 0.0003537
- url_rate
- 0
- one_word_rate
- 0.05946
- allcaps_rate
- 0.01347
- boilerplate_rate
- 0
quantity
text feature one_word null_rate short_text duplicatesThis column captures food portion sizes as free-text quantity strings (e.g. '1 serving(s)', '8 oz', '200 g'), mixing units of grams, ounces and serving counts. It is sparse and repetitive: 57.69% of rows are null, 4252 are empty, and 90.5% of populated values are duplicates drawn from a vocabulary of just 1802 tokens. Note the inconsistent formatting — '200 g' (1469) and '200g' (696) are stored as separate strings, so naive grouping will undercount. Treatment: Parse into numeric magnitude + normalized unit (g/oz/serving), collapsing variants like '200 g' and '200g'.
- n
- 77,145
- nulls
- 44,503 (57.7%)
- unique
- 3,102
- len_min
- 0
- len_max
- 58
- len_mean
- 4.74
- len_median
- 5
- len_p95
- 12
- word_mean
- 1.791
- word_median
- 2
- n_empty
- 4,252
- n_duplicates
- 29,540
- duplicate_rate
- 0.905
- vocab_size
- 1,802
- readability_flesch_mean
- 102.6
- emoji_rate
- 0
- url_rate
- 0
- one_word_rate
- 0.2975
- allcaps_rate
- 0.01042
- boilerplate_rate
- 0