saturn·

quirky openfoodfacts cheese 20260121

source /home/coolhand/html/datavis/data_trove/cache/quirky/openfoodfacts_cheese_20260121.parquet 77,145 rows 10 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This dataset contains 77,145 product records from Open Food Facts, focused on cheese products, with 10 columns covering names, ingredients, quantities, image URLs, and several tag fields (brands, categories, countries, labels, nutrition grades, origins). The text fields are highly multilingual — product_name spans 30+ languages with English (3,820) and French (315) dominating, and ingredients_text shows the same pattern. Two things deserve a closer look first: the heavy null rates on quantity (57.7%) and ingredients_text (41.9%), which will limit any analysis depending on those fields, and the strong duplication in quantity (90.5% duplicate rate) where values like '1 serving(s)', '8 oz', and '200 g' recur thousands of times. Product names also duplicate substantially (30.9%), with 'Cottage Cheese', 'Cheese', and 'Mozzarella' appearing as common generic labels. Note that the six tag-style columns were skipped during profiling, so their structure is not yet characterized.

citing: row_count · column_count · columns.quantity.null_rate · columns.quantity.stats.duplicate_rate · columns.quantity.top_values · columns.ingredients_text.null_rate · columns.ingredients_text.language_counts · columns.product_name.language_counts · columns.product_name.top_values · columns.product_name.stats.duplicate_rate · columns.image_url.null_rate

Schema

10 columns
Per-column summary. Click column name to jump to its detail.
Alerts
brands_tags unknown 0.0%
skipped
categories_tags unknown 0.0%
skipped
countries_tags unknown 0.0%
skipped
image_url text 39.0% 47,035
near_unique one_word url_heavy null_rate
ingredients_text text 41.9% 30,902
multilingual null_rate duplicates
labels_tags unknown 0.0%
skipped
nutrition_grades_tags unknown 0.0%
skipped
origins_tags unknown 0.0%
skipped
product_name text 1.0% 52,776
multilingual duplicates
quantity text 57.7% 3,102
one_word null_rate short_text duplicates

brands_tags

unknown other skipped
The column brands_tags was skipped by the profiler, so no descriptive statistics, uniqueness count, or value samples are available. The only confirmed facts are that it has 77145 rows and a null_rate of 0.0, meaning every row carries some value. Based on the name alone it likely holds brand tag strings, but without evidence this is not verified. Treatment: Re-run profiling with string handling enabled before deciding on use. low · anthropic:claude-opus-4-7
n
77,145
nulls
0 (0.0%)
unique

categories_tags

unknown feature skipped
This column was skipped by the profiler, so no statistics, uniqueness, or value samples are available beyond a row count of 77145 and a null rate of 0.0. The name `categories_tags` suggests a multi-valued categorical field (likely a delimited list of category labels, e.g. from Open Food Facts), but this cannot be confirmed from the evidence. Treatment: Re-profile with list/string handling enabled, then split on the delimiter and one-hot or multi-hot encode before modelling. low · anthropic:claude-opus-4-7
n
77,145
nulls
0 (0.0%)
unique

countries_tags

unknown feature skipped
The column `countries_tags` was skipped by the profiler, so no statistics beyond row count and null rate are available. All 77,145 rows are non-null, but uniqueness, distribution, and value examples are unknown. The name suggests a multi-valued tag field (e.g., comma- or colon-separated country codes like `en:france`), but this cannot be confirmed from the evidence. Treatment: Re-profile with a tag-aware parser (split on separator) before deciding on encoding or filtering. low · anthropic:claude-opus-4-7
n
77,145
nulls
0 (0.0%)
unique

image_url

text metadata near_unique one_word url_heavy null_rate
This column holds Open Food Facts product image URLs, one per row, with url_rate of 1.0 and word_mean of 1.0 confirming each value is a single link. It is near-unique (47,035 distinct out of 77,145) and 39.03% null, so coverage is partial. Lengths are tightly bounded (75-96 chars) reflecting a fixed CDN path template, and only one duplicate URL appears. Treatment: Drop for modelling; retain as a reference link for image fetching or manual inspection. high · anthropic:claude-opus-4-7
n
77,145
nulls
30,109 (39.0%)
unique
47,035
len_min
75
len_max
96
len_mean
84.24
len_median
84
len_p95
85
word_mean
1
word_median
1
n_empty
0
n_duplicates
1
duplicate_rate
2.126e-05
vocab_size
19,999
readability_flesch_mean
-811.9
emoji_rate
0
url_rate
1
one_word_rate
1
allcaps_rate
0
boilerplate_rate
0

ingredients_text

text free_text multilingual null_rate duplicates
Free-text ingredient lists from what looks like a food product database, dominated by English (1623 rows) but spanning 30 languages including French (212), German (128), and trace Thai/Korean/Arabic. 41.91% of values are null and 4981 are empty strings while another 2553 read literally as 'undefined', and 31.04% of non-null entries are exact duplicates — cheese-style boilerplate like 'Pasteurized milk, cheese culture, salt, enzymes.' recurs hundreds of times. Texts are short on average (word_median 16, len_median 126) but stretch to 5283 characters, and the Flesch mean of 16.0 reflects dense comma-separated lists rather than prose. Treatment: Normalise 'undefined'/empty to null, dedupe, and tokenize per detected language before embedding. high · anthropic:claude-opus-4-7
n
77,145
nulls
32,334 (41.9%)
unique
30,902
len_min
0
len_max
5,283
len_mean
238.2
len_median
126
len_p95
842
word_mean
31.64
word_median
16
n_empty
4,981
n_duplicates
13,909
duplicate_rate
0.3104
vocab_size
35,093
readability_flesch_mean
16.01
emoji_rate
0.0002232
url_rate
0.002299
one_word_rate
0.1733
allcaps_rate
0.08565
boilerplate_rate
0

labels_tags

unknown other skipped
The column "labels_tags" was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 77145 and a null rate of 0.0. The name suggests a multi-valued tag or label field (likely comma- or pipe-delimited strings), but this cannot be confirmed from the evidence. No surprises can be flagged without underlying stats. Treatment: Re-run the profiler with parsing for list-like strings before deciding on encoding. low · anthropic:claude-opus-4-7
n
77,145
nulls
0 (0.0%)
unique

nutrition_grades_tags

unknown label skipped
This column is named nutrition_grades_tags, suggesting it holds Nutri-Score-style grade tags (e.g., a–e) for food products. Saturn skipped profiling, so no uniqueness, value distribution, or stats are available beyond a row count of 77145 with a null_rate of 0.0. Without cardinality or value samples, the actual content cannot be verified from the evidence. Treatment: Re-profile to recover value distribution before deciding; if it resolves to a small grade vocabulary, one-hot encode. low · anthropic:claude-opus-4-7
n
77,145
nulls
0 (0.0%)
unique

origins_tags

unknown metadata skipped
Column `origins_tags` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 77145 and a null rate of 0.0. The name suggests a tag-style field listing geographic origins, likely multi-valued (comma- or pipe-delimited) per row, but this cannot be confirmed from the evidence. No distributional signals can be assessed until the column is re-profiled. Treatment: Re-run the profiler on this column; if multi-valued, split on the delimiter and one-hot or count-encode the tags. low · anthropic:claude-opus-4-7
n
77,145
nulls
0 (0.0%)
unique

product_name

text free_text multilingual duplicates
Short product-name strings (mean 27.7 chars, median 4 words) for what looks like a cheese-dominated catalog — 'cheese' appears 13,869 times and the top values are 'Cottage Cheese', 'Cheese', 'Mozzarella', etc. Of 77,145 rows, 30.9% are duplicates (23,564) and case-variant pairs like 'Cottage Cheese'/'cottage cheese' and 'Cream Cheese'/'Cream cheese' inflate that count; 349 rows are empty and null_rate is 1.04%. Although English dominates (3,820 detected), there is a multilingual tail (de 169, fr 315, it 163, es 114) that will fragment exact-match joins. Treatment: Normalize case and whitespace to collapse near-duplicates, then tokenize/embed for modelling or fuzzy matching. high · anthropic:claude-opus-4-7
n
77,145
nulls
805 (1.0%)
unique
52,776
len_min
0
len_max
254
len_mean
27.75
len_median
24
len_p95
58
word_mean
4.269
word_median
4
n_empty
349
n_duplicates
23,564
duplicate_rate
0.3087
vocab_size
8,852
readability_flesch_mean
63.62
emoji_rate
0.0003537
url_rate
0
one_word_rate
0.05946
allcaps_rate
0.01347
boilerplate_rate
0

quantity

text feature one_word null_rate short_text duplicates
This column captures food portion sizes as free-text quantity strings (e.g. '1 serving(s)', '8 oz', '200 g'), mixing units of grams, ounces and serving counts. It is sparse and repetitive: 57.69% of rows are null, 4252 are empty, and 90.5% of populated values are duplicates drawn from a vocabulary of just 1802 tokens. Note the inconsistent formatting — '200 g' (1469) and '200g' (696) are stored as separate strings, so naive grouping will undercount. Treatment: Parse into numeric magnitude + normalized unit (g/oz/serving), collapsing variants like '200 g' and '200g'. high · anthropic:claude-opus-4-7
n
77,145
nulls
44,503 (57.7%)
unique
3,102
len_min
0
len_max
58
len_mean
4.74
len_median
5
len_p95
12
word_mean
1.791
word_median
2
n_empty
4,252
n_duplicates
29,540
duplicate_rate
0.905
vocab_size
1,802
readability_flesch_mean
102.6
emoji_rate
0
url_rate
0
one_word_rate
0.2975
allcaps_rate
0.01042
boilerplate_rate
0