saturn·

quirky openfoodfacts cheese 20260121

saturn notebook · generated 2026-05-01 Report Notebook

Overview

Source: /home/coolhand/html/datavis/data_trove/cache/quirky/openfoodfacts_cheese_20260121.parquet

Saturn profiled 77,145 rows across 10 columns. The stats below are deterministic and machine-readable; the prose is a language-model interpretation of those stats (opt-in, added after the fact, never sees raw rows).

[2]:
!pip install saturn-dissect
import subprocess
subprocess.run([
    "saturn", "analyze", "/home/coolhand/html/datavis/data_trove/cache/quirky/openfoodfacts_cheese_20260121.parquet",
    "--findings", "quirky-openfoodfacts_cheese_20260121.json",
    "--llm", "anthropic:claude-opus-4-7",
])

Summary confidence: high

This dataset contains 77,145 product records from Open Food Facts, focused on cheese products, with 10 columns covering names, ingredients, quantities, image URLs, and several tag fields (brands, categories, countries, labels, nutrition grades, origins). The text fields are highly multilingual — product_name spans 30+ languages with English (3,820) and French (315) dominating, and ingredients_text shows the same pattern. Two things deserve a closer look first: the heavy null rates on quantity (57.7%) and ingredients_text (41.9%), which will limit any analysis depending on those fields, and the strong duplication in quantity (90.5% duplicate rate) where values like '1 serving(s)', '8 oz', and '200 g' recur thousands of times. Product names also duplicate substantially (30.9%), with 'Cottage Cheese', 'Cheese', and 'Mozzarella' appearing as common generic labels. Note that the six tag-style columns were skipped during profiling, so their structure is not yet characterized.

citing: row_count · column_count · columns.quantity.null_rate · columns.quantity.stats.duplicate_rate · columns.quantity.top_values · columns.ingredients_text.null_rate · columns.ingredients_text.language_counts · columns.product_name.language_counts · columns.product_name.top_values · columns.product_name.stats.duplicate_rate · columns.image_url.null_rate

Out[4]:

saturn.schema() · 10 columns

column kind n null% unique alerts
brands_tags unknown 77,145 0.0% skipped
categories_tags unknown 77,145 0.0% skipped
countries_tags unknown 77,145 0.0% skipped
image_url text 77,145 39.0% 47,035 near_unique one_word url_heavy null_rate
ingredients_text text 77,145 41.9% 30,902 multilingual null_rate duplicates
labels_tags unknown 77,145 0.0% skipped
nutrition_grades_tags unknown 77,145 0.0% skipped
origins_tags unknown 77,145 0.0% skipped
product_name text 77,145 1.0% 52,776 multilingual duplicates
quantity text 77,145 57.7% 3,102 one_word null_rate short_text duplicates
Fig 1.
quantity · Most common quantity strings — note inconsistent formatting like '200 g' vs '200g' that will need normalization.
Show data table
Character-length distribution for quantity (mean: 4.739691195392439).
charscount
0 – 14393
1 – 366
3 – 410984
4 – 612621
6 – 71264
7 – 9413
9 – 10462
10 – 12129
12 – 131845
13 – 1480
14 – 1651
16 – 17142
17 – 1949
19 – 2048
20 – 2214
22 – 2321
23 – 254
25 – 2613
26 – 283
28 – 295
29 – 308
30 – 325
32 – 334
33 – 350
35 – 363
36 – 381
38 – 391
39 – 412
41 – 426
42 – 441
44 – 451
45 – 461
46 – 480
48 – 491
49 – 510
51 – 520
52 – 540
54 – 550
55 – 570
57 – 581
Fig 2.
product_name · Top product names are dominated by generic terms like 'Cottage Cheese' and 'Mozzarella', suggesting many non-unique entries.
Show data table
Character-length distribution for product_name (mean: 27.748952056588944).
charscount
0 – 61658
6 – 137931
13 – 1916585
19 – 2515617
25 – 3212097
32 – 388979
38 – 444569
44 – 512949
51 – 572054
57 – 641107
64 – 70796
70 – 76562
76 – 83297
83 – 89201
89 – 95135
95 – 102105
102 – 10899
108 – 11496
114 – 12163
121 – 12770
127 – 13354
133 – 14056
140 – 14649
146 – 15235
152 – 15937
159 – 16530
165 – 17113
171 – 17822
178 – 18411
184 – 19012
190 – 19715
197 – 20310
203 – 2104
210 – 2165
216 – 2225
222 – 2294
229 – 2352
235 – 2412
241 – 2481
248 – 2543
Fig 3.
ingredients_text · Ingredient text length is highly skewed (median 126 chars, max 5,283) — useful for spotting sparse vs. detailed entries.
Show data table
Character-length distribution for ingredients_text (mean: 238.2221552743746).
charscount
0 – 13222912
132 – 2648740
264 – 3964217
396 – 5283155
528 – 6601952
660 – 7921268
792 – 925810
925 – 1057581
1057 – 1189378
1189 – 1321282
1321 – 1453173
1453 – 1585115
1585 – 171757
1717 – 184938
1849 – 198137
1981 – 211329
2113 – 224529
2245 – 23777
2377 – 250912
2509 – 26425
2642 – 27743
2774 – 29061
2906 – 30383
3038 – 31702
3170 – 33020
3302 – 34340
3434 – 35661
3566 – 36980
3698 – 38301
3830 – 39621
3962 – 40940
4094 – 42260
4226 – 43580
4358 – 44910
4491 – 46230
4623 – 47551
4755 – 48870
4887 – 50190
5019 – 51510
5151 – 52831
Fig 4.
product_name · Product name length distribution shows most names are short (median 24 chars), with a long tail of verbose entries.
Show data table
Character-length distribution for product_name (mean: 27.748952056588944).
charscount
0 – 61658
6 – 137931
13 – 1916585
19 – 2515617
25 – 3212097
32 – 388979
38 – 444569
44 – 512949
51 – 572054
57 – 641107
64 – 70796
70 – 76562
76 – 83297
83 – 89201
89 – 95135
95 – 102105
102 – 10899
108 – 11496
114 – 12163
121 – 12770
127 – 13354
133 – 14056
140 – 14649
146 – 15235
152 – 15937
159 – 16530
165 – 17113
171 – 17822
178 – 18411
184 – 19012
190 – 19715
197 – 20310
203 – 2104
210 – 2165
216 – 2225
222 – 2294
229 – 2352
235 – 2412
241 – 2481
248 – 2543
Fig 5.
Per-column null rate across the corpus. Columns are ordered by input position.
Show data table
Per-column null rate across the corpus.
columnkindnull %
brands_tagsunknown0.0%
categories_tagsunknown0.0%
countries_tagsunknown0.0%
image_urltext39.0%
ingredients_texttext41.9%
labels_tagsunknown0.0%
nutrition_grades_tagsunknown0.0%
origins_tagsunknown0.0%
product_nametext1.0%
quantitytext57.7%
Fig 6.
Language mix across all text columns (per-string detection, sampled).
Show data table
Per-language counts (total 6,991 detected strings).
langcountshare
en544377.9%
fr5277.5%
de2974.2%
it1852.6%
es1512.2%
nl520.7%
pl490.7%
pt330.5%
sv310.4%
ru260.4%
no220.3%
fi200.3%
cs180.3%
ca150.2%
tr150.2%
ja130.2%
da100.1%
bg80.1%
el80.1%
uk80.1%
hu60.1%
ro60.1%
et60.1%
hr50.1%
ms50.1%
eo40.1%
ceb40.1%
id40.1%
sk30.0%
bs30.0%
zh30.0%
sr20.0%
sh20.0%
lt20.0%
nn20.0%
ar10.0%
ko10.0%
th10.0%

brands_tags unknown other

The column brands_tags was skipped by the profiler, so no descriptive statistics, uniqueness count, or value samples are available. The only confirmed facts are that it has 77145 rows and a null_rate of 0.0, meaning every row carries some value. Based on the name alone it likely holds brand tag strings, but without evidence this is not verified.

Treatment: Re-run profiling with string handling enabled before deciding on use.

anthropic:claude-opus-4-7 · confidence low
Out[12]:

saturn.columns["brands_tags"].stats

statvalue
n77,145
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

categories_tags unknown feature

This column was skipped by the profiler, so no statistics, uniqueness, or value samples are available beyond a row count of 77145 and a null rate of 0.0. The name `categories_tags` suggests a multi-valued categorical field (likely a delimited list of category labels, e.g. from Open Food Facts), but this cannot be confirmed from the evidence.

Treatment: Re-profile with list/string handling enabled, then split on the delimiter and one-hot or multi-hot encode before modelling.

anthropic:claude-opus-4-7 · confidence low
Out[14]:

saturn.columns["categories_tags"].stats

statvalue
n77,145
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

countries_tags unknown feature

The column `countries_tags` was skipped by the profiler, so no statistics beyond row count and null rate are available. All 77,145 rows are non-null, but uniqueness, distribution, and value examples are unknown. The name suggests a multi-valued tag field (e.g., comma- or colon-separated country codes like `en:france`), but this cannot be confirmed from the evidence.

Treatment: Re-profile with a tag-aware parser (split on separator) before deciding on encoding or filtering.

anthropic:claude-opus-4-7 · confidence low
Out[16]:

saturn.columns["countries_tags"].stats

statvalue
n77,145
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

image_url text metadata

This column holds Open Food Facts product image URLs, one per row, with url_rate of 1.0 and word_mean of 1.0 confirming each value is a single link. It is near-unique (47,035 distinct out of 77,145) and 39.03% null, so coverage is partial. Lengths are tightly bounded (75-96 chars) reflecting a fixed CDN path template, and only one duplicate URL appears.

Treatment: Drop for modelling; retain as a reference link for image fetching or manual inspection.

anthropic:claude-opus-4-7 · confidence high
Out[18]:

saturn.columns["image_url"].stats

statvalue
n77,145
nulls30,109 (39.0%)
unique47,035
len_min 75
len_max 96
len_mean 84.24
len_median 84
len_p95 85
word_mean 1
word_median 1
n_empty 0
n_duplicates 1
duplicate_rate 2.126e-05
vocab_size 19,999
readability_flesch_mean -811.9
emoji_rate 0
url_rate 1
one_word_rate 1
allcaps_rate 0
boilerplate_rate 0
alert: near_unique100.0% of rows are unique strings
alert: one_word100.0% rows are a single word
alert: url_heavy100.0% rows contain a URL
alert: null_rate39.0% null
Fig 7.
Character-length distribution for image_url.
Show data table
Character-length distribution for image_url (mean: 84.24347308444595).
charscount
75 – 762
76 – 762
76 – 770
77 – 770
77 – 780
78 – 780
78 – 790
79 – 790
79 – 800
80 – 800
80 – 810
81 – 810
81 – 820
82 – 820
82 – 830
83 – 830
83 – 840
84 – 8436226
84 – 850
85 – 8610558
86 – 86163
86 – 870
87 – 874
87 – 880
88 – 880
88 – 890
89 – 8933
89 – 900
90 – 903
90 – 910
91 – 9111
91 – 920
92 – 928
92 – 930
93 – 939
93 – 940
94 – 943
94 – 950
95 – 9513
95 – 961

ingredients_text text free_text

Free-text ingredient lists from what looks like a food product database, dominated by English (1623 rows) but spanning 30 languages including French (212), German (128), and trace Thai/Korean/Arabic. 41.91% of values are null and 4981 are empty strings while another 2553 read literally as 'undefined', and 31.04% of non-null entries are exact duplicates — cheese-style boilerplate like 'Pasteurized milk, cheese culture, salt, enzymes.' recurs hundreds of times. Texts are short on average (word_median 16, len_median 126) but stretch to 5283 characters, and the Flesch mean of 16.0 reflects dense comma-separated lists rather than prose.

Treatment: Normalise 'undefined'/empty to null, dedupe, and tokenize per detected language before embedding.

anthropic:claude-opus-4-7 · confidence high
Out[21]:

saturn.columns["ingredients_text"].stats

statvalue
n77,145
nulls32,334 (41.9%)
unique30,902
len_min 0
len_max 5,283
len_mean 238.2
len_median 126
len_p95 842
word_mean 31.64
word_median 16
n_empty 4,981
n_duplicates 13,909
duplicate_rate 0.3104
vocab_size 35,093
readability_flesch_mean 16.01
emoji_rate 0.0002232
url_rate 0.002299
one_word_rate 0.1733
allcaps_rate 0.08565
boilerplate_rate 0
alert: multilingual31 languages detected in sample
alert: null_rate41.9% null
alert: duplicates31.0% duplicate strings
Fig 8.
Character-length distribution for ingredients_text.
Show data table
Character-length distribution for ingredients_text (mean: 238.2221552743746).
charscount
0 – 13222912
132 – 2648740
264 – 3964217
396 – 5283155
528 – 6601952
660 – 7921268
792 – 925810
925 – 1057581
1057 – 1189378
1189 – 1321282
1321 – 1453173
1453 – 1585115
1585 – 171757
1717 – 184938
1849 – 198137
1981 – 211329
2113 – 224529
2245 – 23777
2377 – 250912
2509 – 26425
2642 – 27743
2774 – 29061
2906 – 30383
3038 – 31702
3170 – 33020
3302 – 34340
3434 – 35661
3566 – 36980
3698 – 38301
3830 – 39621
3962 – 40940
4094 – 42260
4226 – 43580
4358 – 44910
4491 – 46230
4623 – 47551
4755 – 48870
4887 – 50190
5019 – 51510
5151 – 52831

labels_tags unknown other

The column "labels_tags" was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 77145 and a null rate of 0.0. The name suggests a multi-valued tag or label field (likely comma- or pipe-delimited strings), but this cannot be confirmed from the evidence. No surprises can be flagged without underlying stats.

Treatment: Re-run the profiler with parsing for list-like strings before deciding on encoding.

anthropic:claude-opus-4-7 · confidence low
Out[24]:

saturn.columns["labels_tags"].stats

statvalue
n77,145
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

nutrition_grades_tags unknown label

This column is named nutrition_grades_tags, suggesting it holds Nutri-Score-style grade tags (e.g., a–e) for food products. Saturn skipped profiling, so no uniqueness, value distribution, or stats are available beyond a row count of 77145 with a null_rate of 0.0. Without cardinality or value samples, the actual content cannot be verified from the evidence.

Treatment: Re-profile to recover value distribution before deciding; if it resolves to a small grade vocabulary, one-hot encode.

anthropic:claude-opus-4-7 · confidence low
Out[26]:

saturn.columns["nutrition_grades_tags"].stats

statvalue
n77,145
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

origins_tags unknown metadata

Column `origins_tags` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 77145 and a null rate of 0.0. The name suggests a tag-style field listing geographic origins, likely multi-valued (comma- or pipe-delimited) per row, but this cannot be confirmed from the evidence. No distributional signals can be assessed until the column is re-profiled.

Treatment: Re-run the profiler on this column; if multi-valued, split on the delimiter and one-hot or count-encode the tags.

anthropic:claude-opus-4-7 · confidence low
Out[28]:

saturn.columns["origins_tags"].stats

statvalue
n77,145
nulls0 (0.0%)
unique
alert: skippedno profiler for kind=unknown

product_name text free_text

Short product-name strings (mean 27.7 chars, median 4 words) for what looks like a cheese-dominated catalog — 'cheese' appears 13,869 times and the top values are 'Cottage Cheese', 'Cheese', 'Mozzarella', etc. Of 77,145 rows, 30.9% are duplicates (23,564) and case-variant pairs like 'Cottage Cheese'/'cottage cheese' and 'Cream Cheese'/'Cream cheese' inflate that count; 349 rows are empty and null_rate is 1.04%. Although English dominates (3,820 detected), there is a multilingual tail (de 169, fr 315, it 163, es 114) that will fragment exact-match joins.

Treatment: Normalize case and whitespace to collapse near-duplicates, then tokenize/embed for modelling or fuzzy matching.

anthropic:claude-opus-4-7 · confidence high
Out[30]:

saturn.columns["product_name"].stats

statvalue
n77,145
nulls805 (1.0%)
unique52,776
len_min 0
len_max 254
len_mean 27.75
len_median 24
len_p95 58
word_mean 4.269
word_median 4
n_empty 349
n_duplicates 23,564
duplicate_rate 0.3087
vocab_size 8,852
readability_flesch_mean 63.62
emoji_rate 0.0003537
url_rate 0
one_word_rate 0.05946
allcaps_rate 0.01347
boilerplate_rate 0
alert: multilingual31 languages detected in sample
alert: duplicates30.9% duplicate strings
Fig 9.
Character-length distribution for product_name.
Show data table
Character-length distribution for product_name (mean: 27.748952056588944).
charscount
0 – 61658
6 – 137931
13 – 1916585
19 – 2515617
25 – 3212097
32 – 388979
38 – 444569
44 – 512949
51 – 572054
57 – 641107
64 – 70796
70 – 76562
76 – 83297
83 – 89201
89 – 95135
95 – 102105
102 – 10899
108 – 11496
114 – 12163
121 – 12770
127 – 13354
133 – 14056
140 – 14649
146 – 15235
152 – 15937
159 – 16530
165 – 17113
171 – 17822
178 – 18411
184 – 19012
190 – 19715
197 – 20310
203 – 2104
210 – 2165
216 – 2225
222 – 2294
229 – 2352
235 – 2412
241 – 2481
248 – 2543

quantity text feature

This column captures food portion sizes as free-text quantity strings (e.g. '1 serving(s)', '8 oz', '200 g'), mixing units of grams, ounces and serving counts. It is sparse and repetitive: 57.69% of rows are null, 4252 are empty, and 90.5% of populated values are duplicates drawn from a vocabulary of just 1802 tokens. Note the inconsistent formatting — '200 g' (1469) and '200g' (696) are stored as separate strings, so naive grouping will undercount.

Treatment: Parse into numeric magnitude + normalized unit (g/oz/serving), collapsing variants like '200 g' and '200g'.

anthropic:claude-opus-4-7 · confidence high
Out[33]:

saturn.columns["quantity"].stats

statvalue
n77,145
nulls44,503 (57.7%)
unique3,102
len_min 0
len_max 58
len_mean 4.74
len_median 5
len_p95 12
word_mean 1.791
word_median 2
n_empty 4,252
n_duplicates 29,540
duplicate_rate 0.905
vocab_size 1,802
readability_flesch_mean 102.6
emoji_rate 0
url_rate 0
one_word_rate 0.2975
allcaps_rate 0.01042
boilerplate_rate 0
alert: one_word29.7% rows are a single word
alert: null_rate57.7% null
alert: short_text95th-percentile length under 20 chars
alert: duplicates90.5% duplicate strings
Fig 10.
Character-length distribution for quantity.
Show data table
Character-length distribution for quantity (mean: 4.739691195392439).
charscount
0 – 14393
1 – 366
3 – 410984
4 – 612621
6 – 71264
7 – 9413
9 – 10462
10 – 12129
12 – 131845
13 – 1480
14 – 1651
16 – 17142
17 – 1949
19 – 2048
20 – 2214
22 – 2321
23 – 254
25 – 2613
26 – 283
28 – 295
29 – 308
30 – 325
32 – 334
33 – 350
35 – 363
36 – 381
38 – 391
39 – 412
41 – 426
42 – 441
44 – 451
45 – 461
46 – 480
48 – 491
49 – 510
51 – 520
52 – 540
54 – 550
55 – 570
57 – 581

How to cite

click to copy

BibTeX
@misc{saturn-quirky-openfoodfacts-cheese-20260121-2026,
  author       = {Steuber, Luke},
  title        = {Saturn reading: quirky openfoodfacts cheese 20260121},
  year         ={2026},
  howpublished = {\url{https://dr.eamer.dev/saturn/view/quirky-openfoodfacts_cheese_20260121}},
  note         = {Profiled with saturn-dissect v0.2.0, prompt saturn-insight-v2, model anthropic:claude-opus-4-7},
}
APA
Steuber, L. (2026). Saturn reading: quirky openfoodfacts cheese 20260121. Source: /home/coolhand/html/datavis/data_trove/cache/quirky/openfoodfacts_cheese_20260121.parquet. Profiled with saturn-dissect v0.2.0 (saturn-insight-v2, anthropic:claude-opus-4-7). Retrieved from https://dr.eamer.dev/saturn/view/quirky-openfoodfacts_cheese_20260121