wild openfoodfacts sample

source /home/coolhand/html/datavis/data_trove/cache/wild/openfoodfacts_sample.json 50 rows 545 columns profiled 2026-05-01 raw JSON static .html .ipynb Report Notebook

Reading

dataset summary · high confidence anthropic:claude-opus-4-7

This is a 50-row sample from Open Food Facts with 545 columns, dominated by per-language localized fields (product names, generic names, ingredient texts, packaging texts, origin) plus nutrition, scoring, and provenance metadata. The shape is extremely sparse: the vast majority of localized columns have null rates of 0.92–0.98, so most analytical signal lives in a small core of fields. Worth a closer look first: the Nutri-Score and NOVA distributions (the catalog skews heavily to grade 'e' and NOVA group 4), the Eco-Score grade mix, and the food_groups / pnns_groups_2 breakdown showing this sample is concentrated in chocolate and biscuit products. Also note the heavy imbalance in `lang` (70% French) and `countries_lc`, which biases any text or origin analysis. Treat the hundreds of `*_xx` / `ingredients_text_` columns as effectively empty rather than as features.

citing: column_count · row_count · nutriscore_grade · nova_groups · ecoscore_grade · food_groups · pnns_groups_2 · lang · countries_lc · ecoscore_score · nutriscore_score · completeness

Charts the summary said to look at first

nutriscore_grade · Check how heavily the sample skews to Nutri-Score 'e' (27 of 50) versus a/b/c.

Show data table

Top values for nutriscore_grade (6 unique shown, of 6 total).
value	count	share
e	27	54.0%
d	9	18.0%
c	7	14.0%
a	4	8.0%
b	2	4.0%
unknown	1	2.0%

nova_groups · See the split between ultra-processed (NOVA 4) and processed (NOVA 3) products, which dominate the sample.

Show data table

Top values for nova_groups (3 unique shown, of 3 total).
value	count	share
4	33	66.0%
3	14	28.0%
1	1	2.0%

ecoscore_grade · Compare Eco-Score grade frequencies including the notable 'unknown' and 'not-applicable' buckets.

Show data table

Top values for ecoscore_grade (9 unique shown, of 9 total).
value	count	share
e	12	24.0%
d	9	18.0%
b	8	16.0%
c	8	16.0%
unknown	6	12.0%
a	3	6.0%
a-plus	2	4.0%
not-applicable	1	2.0%
f	1	2.0%

pnns_groups_2 · Confirm that biscuits/cakes and chocolate products together account for most of the sample's food categories.

Show data table

Top values for pnns_groups_2 (11 unique shown, of 11 total).
value	count	share
Biscuits and cakes	17	34.0%
Chocolate products	16	32.0%
Appetizers	4	8.0%
Pastries	3	6.0%
Bread	2	4.0%
unknown	2	4.0%
Sweets	2	4.0%
Dairy desserts	1	2.0%
Waters and flavored waters	1	2.0%
Cereals	1	2.0%
Dried fruits	1	2.0%

lang · Spot the strong French-language bias (35/50) that will affect any text-based analysis.

Show data table

Top values for lang (5 unique shown, of 5 total).
value	count	share
fr	35	70.0%
en	10	20.0%
de	3	6.0%
bg	1	2.0%
ro	1	2.0%

Schema

545 columns

Per-column summary. Click column name to jump to its detail.
				Alerts
update_key	categorical	0.0%	9	long_tail
categories_old	categorical	2.0%	45	long_tail
ecoscore_score	numeric	14.0%	31
environment_impact_level	categorical	56.0%	1	null_rate imbalance
ingredients_text_fi	categorical	90.0%	4	long_tail null_rate
nutrition_data_prepared	categorical	4.0%	1	imbalance
packaging_shapes_tags	unknown	0.0%	—	skipped
nutrient_levels_tags	unknown	0.0%	—	skipped
packagings_materials	unknown	0.0%	—	skipped
ingredients_without_ecobalyse_ids	unknown	0.0%	—	skipped
generic_name_nl	categorical	76.0%	4	long_tail null_rate
misc_tags	unknown	0.0%	—	skipped
product_name_sv	categorical	92.0%	4	long_tail null_rate
scans_n	numeric	0.0%	49	high_skew outliers
schema_version	numeric	0.0%	1	constant
url	categorical	0.0%	50	long_tail
vitamins_tags	unknown	0.0%	—	skipped
debug_param_sorted_langs	unknown	0.0%	—	skipped
packaging	categorical	12.0%	41	long_tail
grades	unknown	0.0%	—	skipped
last_modified_t	numeric	0.0%	50	outliers
origin_nl	categorical	76.0%	1	null_rate imbalance
allergens_lc	categorical	4.0%	6
states_hierarchy	unknown	0.0%	—	skipped
ingredients_text_ja	categorical	98.0%	1	long_tail null_rate imbalance
teams_tags	unknown	0.0%	—	skipped
traces_from_user	categorical	0.0%	35	long_tail
origins_tags	unknown	0.0%	—	skipped
serving_quantity_unit	categorical	8.0%	2	imbalance
vitamins_prev_tags	unknown	0.0%	—	skipped
ingredients_hierarchy	unknown	0.0%	—	skipped
unique_scans_n	numeric	0.0%	48	high_skew outliers
labels	categorical	2.0%	42	long_tail
generic_name_en	categorical	14.0%	8	long_tail
weighters_tags	unknown	0.0%	—	skipped
popularity_tags	unknown	0.0%	—	skipped
product_name_fi	categorical	90.0%	4	long_tail null_rate
origin_fr	categorical	8.0%	7	long_tail
generic_name	categorical	4.0%	28	long_tail
nutriscore_version	categorical	0.0%	1	imbalance
ingredients_without_ciqual_codes	unknown	0.0%	—	skipped
manufacturing_places_tags	unknown	0.0%	—	skipped
photographers_tags	unknown	0.0%	—	skipped
packaging_text_pl	categorical	90.0%	1	null_rate imbalance
informers_tags	unknown	0.0%	—	skipped
ingredients_text_en	categorical	12.0%	36	long_tail
ingredients_text_it	categorical	68.0%	12	long_tail null_rate
origin_de	categorical	60.0%	1	null_rate imbalance
nova_group	numeric	4.0%	3	high_skew
packaging_text_fi	categorical	90.0%	1	null_rate imbalance
states	categorical	0.0%	26	long_tail
ingredients_with_unspecified_percent_sum	numeric	0.0%	22
added_countries_tags	unknown	0.0%	—	skipped
id	categorical	0.0%	50	long_tail
nutrient_levels	unknown	0.0%	—	skipped
sortkey	numeric	12.0%	44	high_skew outliers
image_small_url	categorical	0.0%	50	long_tail
packaging_recycling_tags	unknown	0.0%	—	skipped
food_groups	categorical	2.0%	11
nova_groups_markers	unknown	0.0%	—	skipped
packaging_text_de	categorical	60.0%	2	null_rate
categories_lc	categorical	0.0%	6
checkers	unknown	0.0%	—	skipped
packaging_text_es	categorical	60.0%	2	null_rate
unknown_nutrients_tags	unknown	0.0%	—	skipped
editors_tags	unknown	0.0%	—	skipped
nutrition_score_warning_fruits_vegetables_nuts_estimate_from_ingredients	numeric	10.0%	1	constant
labels_lc	categorical	2.0%	6
nutriscore_data	unknown	0.0%	—	skipped
other_nutritional_substances_tags	unknown	0.0%	—	skipped
product_name_nb	categorical	96.0%	2	long_tail null_rate
nutrition_data_prepared_per	categorical	0.0%	1	imbalance
product_quantity	categorical	6.0%	27	long_tail
product_type	categorical	0.0%	1	imbalance
checkers_tags	unknown	0.0%	—	skipped
nucleotides_tags	unknown	0.0%	—	skipped
languages_tags	unknown	0.0%	—	skipped
traces_lc	categorical	4.0%	6
categories_hierarchy	unknown	0.0%	—	skipped
image_front_small_url	categorical	0.0%	50	long_tail
entry_dates_tags	unknown	0.0%	—	skipped
ecoscore_tags	unknown	0.0%	—	skipped
nutrition_score_warning_fruits_vegetables_legumes_estimate_from_ingredients	numeric	8.0%	1	constant
ingredients_without_ciqual_codes_n	numeric	0.0%	15
rev	numeric	0.0%	46
ingredients_non_nutritive_sweeteners_n	numeric	0.0%	1	constant
ingredients_without_ecobalyse_ids_n	numeric	0.0%	20
environment_impact_level_tags	unknown	0.0%	—	skipped
last_image_dates_tags	unknown	0.0%	—	skipped
labels_hierarchy	unknown	0.0%	—	skipped
product_name_en	categorical	14.0%	34	long_tail
nutrition_score_warning_fruits_vegetables_legumes_estimate_from_ingredients_value	numeric	8.0%	6	high_skew outliers
traces	categorical	0.0%	23	long_tail
generic_name_fi	categorical	90.0%	5	long_tail null_rate
emb_codes_orig	categorical	34.0%	5	long_tail null_rate
ingredients_with_specified_percent_n	numeric	0.0%	7
nutrition_grades	categorical	0.0%	6
weighers_tags	unknown	0.0%	—	skipped
categories_tags	unknown	0.0%	—	skipped
image_url	categorical	0.0%	50	long_tail
sources	unknown	0.0%	—	skipped
languages_hierarchy	unknown	0.0%	—	skipped
pnns_groups_1	categorical	0.0%	7
countries_lc	categorical	2.0%	6
additives_tags	unknown	0.0%	—	skipped
codes_tags	unknown	0.0%	—	skipped
countries_tags	unknown	0.0%	—	skipped
creator	categorical	0.0%	13	long_tail
ingredients	unknown	0.0%	—	skipped
product_name_nl	categorical	76.0%	7	long_tail null_rate
ingredients_n_tags	unknown	0.0%	—	skipped
origin_es	categorical	60.0%	1	null_rate imbalance
product_name_pl	categorical	90.0%	3	long_tail null_rate
scores	unknown	0.0%	—	skipped
brands	categorical	0.0%	41	long_tail
ingredients_text_de	categorical	60.0%	16	long_tail null_rate
ingredients_text_nb	categorical	96.0%	1	null_rate imbalance
packagings_n	numeric	18.0%	5	outliers
complete	numeric	0.0%	2
emb_codes_20141016	categorical	58.0%	7	long_tail null_rate
ingredients_tags	unknown	0.0%	—	skipped
packaging_text_ja	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_de	categorical	60.0%	9	long_tail null_rate
last_editor	categorical	2.0%	24	long_tail
minerals_prev_tags	unknown	0.0%	—	skipped
last_image_t	numeric	0.0%	50	high_skew
obsolete_since_date	categorical	12.0%	1	imbalance
pnns_groups_2_tags	unknown	0.0%	—	skipped
emb_codes_tags	unknown	0.0%	—	skipped
countries_beforescanbot	categorical	14.0%	38	long_tail
nutrition_grade_fr	categorical	0.0%	6
data_quality_tags	unknown	0.0%	—	skipped
ingredients_with_specified_percent_sum	numeric	0.0%	22
origin_it	categorical	68.0%	1	null_rate imbalance
nutrition_data_per	categorical	0.0%	2
origin_pl	categorical	90.0%	1	null_rate imbalance
product	unknown	0.0%	—	skipped
link	categorical	4.0%	28	long_tail
ingredients_text_nl	categorical	76.0%	9	long_tail null_rate
additives_n	numeric	0.0%	8
generic_name_sv	categorical	92.0%	4	long_tail null_rate
ingredients_that_may_be_from_palm_oil_tags	unknown	0.0%	—	skipped
known_ingredients_n	numeric	0.0%	22
completeness	numeric	0.0%	14	outliers
ingredients_sweeteners_n	numeric	0.0%	1	constant
nova_groups	categorical	4.0%	3
allergens_hierarchy	unknown	0.0%	—	skipped
obsolete	categorical	12.0%	1	imbalance
origin_sv	categorical	92.0%	1	null_rate imbalance
packaging_hierarchy	unknown	0.0%	—	skipped
ingredients_with_unspecified_percent_n	numeric	0.0%	18
fruits-vegetables-nuts_100g_estimate	numeric	46.0%	2	null_rate high_skew
emb_codes	categorical	4.0%	11	long_tail
packagings	unknown	0.0%	—	skipped
purchase_places_tags	unknown	0.0%	—	skipped
additives_original_tags	unknown	0.0%	—	skipped
image_front_url	categorical	0.0%	50	long_tail
data_quality_bugs_tags	unknown	0.0%	—	skipped
origin_fi	categorical	90.0%	1	null_rate imbalance
images	unknown	0.0%	—	skipped
ingredients_analysis	unknown	0.0%	—	skipped
ingredients_text_with_allergens_pl	categorical	92.0%	3	long_tail null_rate
product_name_de	categorical	60.0%	16	long_tail null_rate
ingredients_text_with_allergens_nb	categorical	96.0%	1	null_rate imbalance
packaging_text_it	categorical	68.0%	3	long_tail null_rate
product_name_it	categorical	68.0%	12	long_tail null_rate
serving_quantity	categorical	12.0%	27	long_tail
product_name_ja	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_sv	categorical	92.0%	4	long_tail null_rate
allergens_tags	unknown	0.0%	—	skipped
ingredients_text_fr	categorical	4.0%	47	long_tail
nutrition_score_beverage	numeric	0.0%	2	high_skew
ingredients_ids_debug	unknown	0.0%	—	skipped
nutrition_data	categorical	2.0%	1	imbalance
origin_ja	categorical	98.0%	1	long_tail null_rate imbalance
packaging_text_en	categorical	14.0%	5	long_tail
unknown_ingredients_n	numeric	0.0%	6	high_skew outliers
ingredients_from_palm_oil_tags	unknown	0.0%	—	skipped
labels_tags	unknown	0.0%	—	skipped
packaging_old_before_taxonomization	categorical	24.0%	36	long_tail null_rate
packaging_text_nb	categorical	96.0%	1	null_rate imbalance
nutrition_grades_tags	unknown	0.0%	—	skipped
category_properties	unknown	0.0%	—	skipped
nutriscore_score	numeric	2.0%	28
packaging_tags	unknown	0.0%	—	skipped
labels_old	categorical	8.0%	38	long_tail
packaging_text	categorical	4.0%	13	long_tail
ingredients_percent_analysis	numeric	0.0%	2	high_skew outliers
ecoscore_data	unknown	0.0%	—	skipped
ingredients_text_sv	categorical	92.0%	4	long_tail null_rate
brands_tags	unknown	0.0%	—	skipped
compared_to_category	categorical	0.0%	35	long_tail
data_sources	categorical	0.0%	43	long_tail
other_nutritional_substances_prev_tags	unknown	0.0%	—	skipped
ingredients_from_palm_oil_n	numeric	8.0%	2	outliers
last_updated_t	numeric	0.0%	50	outliers
nutrition_score_debug	categorical	0.0%	2	imbalance
popularity_key	numeric	0.0%	49	high_skew outliers
product_name_es	categorical	60.0%	17	long_tail null_rate
allergens_from_user	categorical	0.0%	34	long_tail
informers	unknown	0.0%	—	skipped
brands_old	categorical	32.0%	29	long_tail null_rate
data_quality_errors_tags	unknown	0.0%	—	skipped
ingredients_text	categorical	0.0%	50	long_tail
categories	categorical	0.0%	46	long_tail
nutrition_score_warning_fruits_vegetables_nuts_estimate_from_ingredients_value	numeric	10.0%	13	high_skew outliers
ingredients_from_or_that_may_be_from_palm_oil_n	numeric	6.0%	3
origins_old	categorical	22.0%	9	long_tail null_rate
packaging_text_nl	categorical	76.0%	1	null_rate imbalance
expiration_date	categorical	4.0%	34	long_tail
selected_images	unknown	0.0%	—	skipped
traces_from_ingredients	categorical	0.0%	12	long_tail
ingredients_text_with_allergens	categorical	0.0%	50	long_tail
image_front_thumb_url	categorical	0.0%	50	long_tail
lc	categorical	0.0%	5
ingredients_text_debug	categorical	28.0%	35	long_tail null_rate
packagings_materials_main	categorical	62.0%	3	null_rate
data_quality_dimensions	unknown	0.0%	—	skipped
serving_size	categorical	12.0%	37	long_tail
pnns_groups_1_tags	unknown	0.0%	—	skipped
origin	categorical	6.0%	6	long_tail
ingredients_lc	categorical	0.0%	4
packaging_old	categorical	14.0%	40	long_tail
packaging_text_fr	categorical	6.0%	14	long_tail
nova_group_debug	categorical	0.0%	3	long_tail imbalance
ingredients_original_tags	unknown	0.0%	—	skipped
data_quality_completeness_tags	unknown	0.0%	—	skipped
cities_tags	unknown	0.0%	—	skipped
countries_hierarchy	unknown	0.0%	—	skipped
nutriscore_score_opposite	numeric	2.0%	28
categories_properties_tags	unknown	0.0%	—	skipped
origins_lc	categorical	4.0%	6
ciqual_food_name_tags	unknown	0.0%	—	skipped
countries	categorical	0.0%	43	long_tail
ingredients_text_with_allergens_it	categorical	68.0%	12	long_tail null_rate
packaging_lc	categorical	12.0%	7
correctors_tags	unknown	0.0%	—	skipped
interface_version_created	categorical	2.0%	3
states_tags	unknown	0.0%	—	skipped
nutriscore_2021_tags	unknown	0.0%	—	skipped
stores_tags	unknown	0.0%	—	skipped
image_thumb_url	categorical	0.0%	50	long_tail
categories_properties	unknown	0.0%	—	skipped
nucleotides_prev_tags	unknown	0.0%	—	skipped
allergens_from_ingredients	categorical	0.0%	35	long_tail
ingredients_text_with_allergens_fi	categorical	90.0%	4	long_tail null_rate
_keywords	unknown	0.0%	—	skipped
manufacturing_places	categorical	2.0%	20	long_tail
pnns_groups_2	categorical	0.0%	11
ingredients_text_pl	categorical	90.0%	3	long_tail null_rate
generic_name_es	categorical	60.0%	7	long_tail null_rate
origin_en	categorical	14.0%	2	imbalance
generic_name_it	categorical	68.0%	5	long_tail null_rate
ingredients_that_may_be_from_palm_oil_n	numeric	8.0%	3	high_skew outliers
ingredients_text_es	categorical	60.0%	13	long_tail null_rate
teams	categorical	8.0%	39	long_tail
food_groups_tags	unknown	0.0%	—	skipped
data_quality_warnings_tags	unknown	0.0%	—	skipped
debug_tags	unknown	0.0%	—	skipped
main_countries_tags	unknown	0.0%	—	skipped
origins_hierarchy	unknown	0.0%	—	skipped
packagings_complete	numeric	4.0%	2
nutriscore_tags	unknown	0.0%	—	skipped
ingredients_text_with_allergens_nl	categorical	78.0%	9	long_tail null_rate
created_t	numeric	0.0%	50
traces_hierarchy	unknown	0.0%	—	skipped
generic_name_nb	categorical	96.0%	1	null_rate imbalance
ingredients_text_with_allergens_de	categorical	66.0%	16	long_tail null_rate
ingredients_text_with_allergens_es	categorical	62.0%	13	long_tail null_rate
product_name_fr	categorical	2.0%	47	long_tail
stores	categorical	4.0%	31	long_tail
_id	categorical	0.0%	50	long_tail
nutriments	unknown	0.0%	—	skipped
editors	unknown	0.0%	—	skipped
max_imgid	categorical	0.0%	38	long_tail
nutriscore_grade	categorical	0.0%	6
product_quantity_unit	categorical	10.0%	2	imbalance
ingredients_analysis_tags	unknown	0.0%	—	skipped
ingredients_text_with_allergens_fr	categorical	4.0%	47	long_tail
interface_version_modified	categorical	0.0%	2
data_sources_tags	unknown	0.0%	—	skipped
ingredients_text_with_allergens_en	categorical	16.0%	36	long_tail
removed_countries_tags	unknown	0.0%	—	skipped
amino_acids_prev_tags	unknown	0.0%	—	skipped
code	categorical	0.0%	50	long_tail
correctors	unknown	0.0%	—	skipped
generic_name_ja	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_fr	categorical	6.0%	34	long_tail
generic_name_pl	categorical	90.0%	2	null_rate
amino_acids_tags	unknown	0.0%	—	skipped
ingredients_debug	unknown	0.0%	—	skipped
ingredients_text_with_allergens_ja	categorical	98.0%	1	long_tail null_rate imbalance
data_quality_info_tags	unknown	0.0%	—	skipped
last_edit_dates_tags	unknown	0.0%	—	skipped
last_modified_by	categorical	2.0%	24	long_tail
no_nutrition_data	categorical	4.0%	1	imbalance
nutriscore	unknown	0.0%	—	skipped
origin_nb	categorical	96.0%	1	null_rate imbalance
origins	categorical	4.0%	20	long_tail
nova_groups_tags	unknown	0.0%	—	skipped
languages	unknown	0.0%	—	skipped
nutriscore_2023_tags	unknown	0.0%	—	skipped
packaging_materials_tags	unknown	0.0%	—	skipped
lang	categorical	0.0%	5
packaging_text_sv	categorical	92.0%	1	null_rate imbalance
photographers	unknown	0.0%	—	skipped
languages_codes	unknown	0.0%	—	skipped
ecoscore_grade	categorical	0.0%	9
ingredients_n	numeric	0.0%	22
allergens	categorical	0.0%	16
minerals_tags	unknown	0.0%	—	skipped
product_name	categorical	0.0%	49	long_tail
purchase_places	categorical	2.0%	32	long_tail
quantity	categorical	2.0%	36	long_tail
traces_tags	unknown	0.0%	—	skipped
origin_uk	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_ar	categorical	80.0%	2	null_rate
packaging_text_uk	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_ar	categorical	78.0%	2	null_rate
ingredients_text_uk	categorical	98.0%	1	long_tail null_rate imbalance
last_check_dates_tags	unknown	0.0%	—	skipped
checked	categorical	86.0%	1	null_rate imbalance
packaging_text_ar	categorical	80.0%	1	null_rate imbalance
carbon_footprint_percent_of_known_ingredients	numeric	62.0%	19	null_rate
last_checker	categorical	86.0%	4	null_rate
product_name_uk	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_uk	categorical	98.0%	1	long_tail null_rate imbalance
product_name_ar	categorical	78.0%	6	long_tail null_rate
carbon_footprint_from_known_ingredients_debug	categorical	72.0%	14	long_tail null_rate
last_checked_t	numeric	86.0%	7	null_rate
ingredients_text_with_allergens_uk	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_ar	categorical	82.0%	2	null_rate
origin_ar	categorical	80.0%	1	null_rate imbalance
nutriments_estimated	unknown	0.0%	—	skipped
nutrition_score_warning_no_fiber	numeric	70.0%	1	null_rate constant
ingredients_text_debug_tags	unknown	0.0%	—	skipped
taxonomies_enhancer_tags	unknown	0.0%	—	skipped
completed_t	numeric	68.0%	16	null_rate
product_name_bg	categorical	94.0%	3	long_tail null_rate
ingredients_text_et	categorical	94.0%	3	long_tail null_rate
origin_sl	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_dz	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_sl	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_ca	categorical	96.0%	1	null_rate imbalance
ingredients_text_dz	categorical	98.0%	1	long_tail null_rate imbalance
product_name_ca	categorical	96.0%	1	null_rate imbalance
origin_ca	categorical	96.0%	1	null_rate imbalance
product_name_et	categorical	94.0%	3	long_tail null_rate
ingredients_text_with_allergens_bg	categorical	94.0%	3	long_tail null_rate
ingredients_text_with_allergens_et	categorical	94.0%	3	long_tail null_rate
origin_sk	categorical	98.0%	1	long_tail null_rate imbalance
origin_bg	categorical	94.0%	1	null_rate imbalance
packaging_text_sl	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_sk	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_sl	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_ca	categorical	96.0%	1	null_rate imbalance
generic_name_sl	categorical	98.0%	1	long_tail null_rate imbalance
product_name_dz	categorical	98.0%	1	long_tail null_rate imbalance
origin_et	categorical	94.0%	1	null_rate imbalance
ingredients_text_with_allergens_sk	categorical	98.0%	1	long_tail null_rate imbalance
product_name_sk	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_pt	categorical	84.0%	4	long_tail null_rate
ingredients_text_with_allergens_ca	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_pt	categorical	80.0%	3	long_tail null_rate
packaging_text_pt	categorical	80.0%	1	null_rate imbalance
ingredients_text_pt	categorical	80.0%	4	long_tail null_rate
origin_pt	categorical	80.0%	1	null_rate imbalance
nutrition_score_warning_nutriments_estimated	numeric	96.0%	1	null_rate constant
packaging_text_bg	categorical	94.0%	1	null_rate imbalance
generic_name_et	categorical	94.0%	1	null_rate imbalance
packaging_text_ca	categorical	96.0%	1	null_rate imbalance
product_name_sl	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_bg	categorical	94.0%	1	null_rate imbalance
ingredients_text_sk	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_bg	categorical	94.0%	3	long_tail null_rate
packaging_text_et	categorical	94.0%	1	null_rate imbalance
packaging_text_sk	categorical	98.0%	1	long_tail null_rate imbalance
product_name_pt	categorical	80.0%	7	long_tail null_rate
abbreviated_product_name_fr	categorical	86.0%	7	long_tail null_rate
obsolete_imported	categorical	86.0%	1	null_rate imbalance
sources_fields	unknown	0.0%	—	skipped
emb_code	categorical	98.0%	1	long_tail null_rate imbalance
lang_imported	categorical	86.0%	1	null_rate imbalance
generic_name_zh	categorical	98.0%	1	long_tail null_rate imbalance
conservation_conditions_fr_imported	categorical	86.0%	7	long_tail null_rate
origin_fr_imported	categorical	96.0%	2	long_tail null_rate
owner	categorical	86.0%	6	long_tail null_rate
ingredients_text_fr_imported	categorical	86.0%	7	long_tail null_rate
owners_tags	categorical	86.0%	6	long_tail null_rate
product_name_zh	categorical	98.0%	1	long_tail null_rate imbalance
nutrition_data_prepared_per_imported	categorical	86.0%	1	null_rate imbalance
abbreviated_product_name_fr_imported	categorical	86.0%	7	long_tail null_rate
generic_name_zh_debug_tags	unknown	0.0%	—	skipped
customer_service_fr	categorical	86.0%	6	long_tail null_rate
customer_service_fr_imported	categorical	86.0%	6	long_tail null_rate
ingredients_text_zh_debug_tags	unknown	0.0%	—	skipped
product_name_fr_imported	categorical	86.0%	7	long_tail null_rate
brands_imported	categorical	86.0%	6	long_tail null_rate
owner_imported	categorical	88.0%	5	long_tail null_rate
product_name_zh_debug_tags	unknown	0.0%	—	skipped
lc_imported	categorical	84.0%	2	null_rate
ingredients_text_zh	categorical	98.0%	1	long_tail null_rate imbalance
quantity_imported	categorical	86.0%	7	long_tail null_rate
nutrition_data_per_imported	categorical	84.0%	1	null_rate imbalance
generic_name_fr_imported	categorical	86.0%	7	long_tail null_rate
owner_fields	unknown	0.0%	—	skipped
categories_imported	categorical	88.0%	5	long_tail null_rate
conservation_conditions_fr	categorical	86.0%	7	long_tail null_rate
conservation_conditions	categorical	86.0%	7	long_tail null_rate
countries_imported	categorical	84.0%	2	null_rate
origins_fr	categorical	96.0%	2	long_tail null_rate
abbreviated_product_name	categorical	86.0%	7	long_tail null_rate
customer_service	categorical	86.0%	6	long_tail null_rate
data_sources_imported	categorical	84.0%	8	long_tail null_rate
nova_group_error	categorical	96.0%	1	null_rate imbalance
ingredients_text_de_ocr_1648897071_result	categorical	98.0%	1	long_tail null_rate imbalance
packaging_text_ro	categorical	96.0%	1	null_rate imbalance
product_name_ro	categorical	96.0%	2	long_tail null_rate
producer_version_id	categorical	92.0%	3	long_tail null_rate
serving_size_imported	categorical	88.0%	6	long_tail null_rate
no_nutrition_data_imported	categorical	92.0%	1	null_rate imbalance
packaging_imported	categorical	92.0%	2	null_rate
ingredients_text_ro	categorical	96.0%	1	null_rate imbalance
producer_version_id_imported	categorical	92.0%	3	long_tail null_rate
labels_imported	categorical	90.0%	3	long_tail null_rate
ingredients_text_de_ocr_1648990410_result	categorical	98.0%	1	long_tail null_rate imbalance
allergens_imported	categorical	90.0%	4	long_tail null_rate
ingredients_text_de_ocr_1648990410	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_de_ocr_1648897071	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_ro	categorical	96.0%	1	null_rate imbalance
origin_ro	categorical	96.0%	1	null_rate imbalance
abbreviated_product_name_imported	categorical	94.0%	3	long_tail null_rate
traces_imported	categorical	92.0%	4	long_tail null_rate
specific_ingredients	unknown	0.0%	—	skipped
product_name_ru	categorical	94.0%	2	null_rate
origin_ru	categorical	94.0%	1	null_rate imbalance
ingredients_text_with_allergens_ru	categorical	94.0%	1	null_rate imbalance
packaging_text_ru	categorical	94.0%	1	null_rate imbalance
generic_name_ru	categorical	94.0%	2	null_rate
ingredients_text_ru	categorical	94.0%	1	null_rate imbalance
ingredients_text_da	categorical	96.0%	2	long_tail null_rate
ingredients_text_with_allergens_da	categorical	96.0%	2	long_tail null_rate
product_name_da	categorical	96.0%	2	long_tail null_rate
packaging_text_da	categorical	96.0%	1	null_rate imbalance
generic_name_da	categorical	96.0%	2	long_tail null_rate
forest_footprint_data	unknown	0.0%	—	skipped
origin_da	categorical	96.0%	1	null_rate imbalance
origin_sr	categorical	96.0%	1	null_rate imbalance
ingredients_text_nl_ocr_1675675383_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_cs	categorical	94.0%	2	null_rate
product_name_cs	categorical	94.0%	2	null_rate
origin_hu	categorical	92.0%	1	null_rate imbalance
packaging_text_hu	categorical	92.0%	1	null_rate imbalance
origin_cs	categorical	96.0%	1	null_rate imbalance
ingredients_text_with_allergens_hu	categorical	94.0%	3	long_tail null_rate
generic_name_cs	categorical	94.0%	1	null_rate imbalance
ingredients_text_hu	categorical	92.0%	4	long_tail null_rate
ingredients_text_sr	categorical	96.0%	2	long_tail null_rate
packaging_text_sr	categorical	96.0%	1	null_rate imbalance
ingredients_text_nl_ocr_1675675383	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_cs	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_sr	categorical	96.0%	2	long_tail null_rate
packaging_text_cs	categorical	94.0%	1	null_rate imbalance
product_name_sr	categorical	96.0%	2	long_tail null_rate
ingredients_text_hu_ocr_1571428260_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_hu_ocr_1571428260	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_hu	categorical	92.0%	2	null_rate
product_name_hu	categorical	92.0%	3	long_tail null_rate
ingredients_text_with_allergens_sr	categorical	96.0%	2	long_tail null_rate
ingredients_text_es_ocr_1548767061_result	categorical	98.0%	1	long_tail null_rate imbalance
product_name_xx	categorical	96.0%	1	null_rate imbalance
generic_name_xx	categorical	96.0%	1	null_rate imbalance
ingredients_text_es_ocr_1548767061	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_xx	categorical	96.0%	1	null_rate imbalance
origin_xx	categorical	98.0%	1	long_tail null_rate imbalance
packaging_text_xx	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_ur	categorical	98.0%	1	long_tail null_rate imbalance
product_name_ur	categorical	98.0%	1	long_tail null_rate imbalance
origin_he	categorical	98.0%	1	long_tail null_rate imbalance
product_name_he	categorical	96.0%	2	long_tail null_rate
origin_ur	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_ur	categorical	98.0%	1	long_tail null_rate imbalance
packaging_text_he	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_he	categorical	98.0%	1	long_tail null_rate imbalance
packaging_text_ur	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_he	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_he	categorical	98.0%	1	long_tail null_rate imbalance
nutriscore_grade_producer	categorical	94.0%	3	long_tail null_rate
nutriscore_grade_producer_imported	categorical	94.0%	3	long_tail null_rate
packaging_text_el	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_el	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_el	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_el	categorical	98.0%	1	long_tail null_rate imbalance
origin_el	categorical	98.0%	1	long_tail null_rate imbalance
product_name_el	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_th	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_de_ocr_1559410715_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_th	categorical	98.0%	1	long_tail null_rate imbalance
packaging_text_th	categorical	98.0%	1	long_tail null_rate imbalance
product_name_th	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_de_ocr_1548767354_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_th	categorical	98.0%	1	long_tail null_rate imbalance
origin_th	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_de_ocr_1548767354	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_de_ocr_1559410715	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_it_ocr_1559410715	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_it_ocr_1559410715_result	categorical	98.0%	1	long_tail null_rate imbalance
packaging_text_fr_imported	categorical	98.0%	1	long_tail null_rate imbalance
preparation_fr_imported	categorical	98.0%	1	long_tail null_rate imbalance
preparation	categorical	98.0%	1	long_tail null_rate imbalance
preparation_fr	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_lc	categorical	98.0%	1	long_tail null_rate imbalance
product_name_lc	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_lc	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_lc	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_xx_debug_tags	unknown	0.0%	—	skipped
product_name_xx_debug_tags	unknown	0.0%	—	skipped
generic_name_xx_debug_tags	unknown	0.0%	—	skipped
ingredients_text_fr_ocr_1561814324	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1561814324_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1624039072_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1624039072	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573108346	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1566920858_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573107556	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573108346_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573107560_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573108349_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573108360	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573109955_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573108349	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573109955	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573107556_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573108360_result	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1573107560	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1566920858	categorical	98.0%	1	long_tail null_rate imbalance
generic_name_lt	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_ro	categorical	98.0%	1	long_tail null_rate imbalance
packaging_text_lt	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_lt	categorical	98.0%	1	long_tail null_rate imbalance
origin_lt	categorical	98.0%	1	long_tail null_rate imbalance
product_name_lt	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_with_allergens_lt	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1713713129	categorical	98.0%	1	long_tail null_rate imbalance
ingredients_text_fr_ocr_1713713129_result	categorical	98.0%	1	long_tail null_rate imbalance

update_key

categorical metadata long_tail

A categorical update_key field with only 9 distinct values across 50 rows, dominated by 'brands' at 56% (28/50) and 'sort' at 20%. The long tail mixes human-readable labels ('divinfood', 'nova-yogurts', 'germany2', 'france') with timestamp-style tokens ('key_1748337248', 'ingredients20240805'), suggesting inconsistent naming conventions for what appears to track update batches or jobs. Entropy ratio of 0.64 confirms the heavy concentration on a few keys. Treatment: Group rare keys into 'other' or normalize naming before using as a grouping dimension. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 9
top_value: brands
top_rate: 0.56
cardinality: 9
entropy: 2.015
entropy_ratio: 0.6357

categories_old

categorical feature long_tail

Hierarchical product category strings (Open Food Facts style taxonomy paths), stored as comma-separated breadcrumbs. Near-unique with 45 distinct values across 50 rows and entropy ratio 0.99, and the strings appear in mixed languages (French, English, Polish, Bulgarian Cyrillic), so direct grouping will fragment. Top value covers only 4% of rows and one row is null. Treatment: Split on commas, normalise language, and keep only the top-level taxon as a categorical feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 45
top_value: Snacks, Snacks sucrés, Biscuits et gâteaux, Biscuits, Biscuits secs
top_rate: 0.04082
cardinality: 45
entropy: 5.451
entropy_ratio: 0.9926

ecoscore_score

numeric feature

Numeric Eco-Score rating per item, ranging from 13 to 94 with a mean of 47.7 and median of 44. The distribution is mildly right-skewed (0.31) and platykurtic (-0.79), spanning a wide IQR of 36.5 with no outliers flagged. Notably, 14% of values are null and only 31 unique scores appear across 50 rows. Treatment: Impute or flag the 14% nulls, then use as a continuous feature without transformation. high · anthropic:claude-opus-4-7

n: 50
nulls: 7 (14.0%)
unique: 31
min: 13
max: 94
mean: 47.74
median: 44
std: 21.19
q1: 27.5
q3: 64
iqr: 36.5
skew: 0.3069
kurtosis: -0.7946
n_outliers: 0
outlier_rate: 0
zero_rate: 0

environment_impact_level

categorical feature null_rate imbalance

This appears to be a categorical flag for environmental impact severity, but it carries no usable signal in this sample. 56% of the 50 rows are null, and the remaining 22 records all hold the empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; the column is effectively constant and majority-null. high · anthropic:claude-opus-4-7

n: 50
nulls: 28 (56.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fi

categorical free_text long_tail null_rate

Finnish-language ingredient declarations, almost entirely absent: 90% of the 50 rows are null and only 4 distinct non-null values appear, two of which are empty strings. The few populated entries are verbose product ingredient lists (chocolate, wheat-based baked goods) with allergen markup, suggesting this is a localized free-text field rather than a categorical feature despite its low cardinality here. Treatment: Drop or set aside; null_rate 0.9 makes it unusable as a feature without a Finnish-text NLP pipeline. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 4
top_value
top_rate: 0.4
cardinality: 4
entropy: 1.922
entropy_ratio: 0.961

nutrition_data_prepared

categorical metadata imbalance

This appears to be a flag indicating whether nutrition data was prepared, but it carries no information: only one unique value (an empty string) appears across all 48 non-null rows, with a 4% null rate. Entropy is 0 and top_rate is 1.0, so the column is constant. Treatment: Drop; constant column with no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_shapes_tags

unknown free_text skipped

This column, packaging_shapes_tags, was skipped by the profiler so no descriptive statistics are available beyond a row count of 50 and a null rate of 0. The name suggests a tag-style field listing packaging shape descriptors, likely multi-valued per row, which is probably why it was bypassed. Without unique counts or value samples, nothing further can be confirmed. Treatment: Re-profile after splitting the tag list, then one-hot or multi-label encode. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nutrient_levels_tags

unknown feature skipped

Column 'nutrient_levels_tags' was skipped by the profiler, so no statistics beyond a 50-row count and 0% null rate are available. The name suggests a list of nutrient classification tags (likely multi-valued strings like 'fat-in-low-quantity'), but uniqueness, cardinality, and value distribution are all unknown. Treat any downstream use cautiously until the column is re-profiled with list-aware parsing. Treatment: Re-profile with list/tag-aware parsing, then one-hot or multi-label encode the individual tags. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

packagings_materials

unknown other skipped

The column packagings_materials was skipped by the profiler, so its kind is unknown and no descriptive statistics are available. We only know there are 50 rows and a 0.0 null rate; uniqueness, type, and value distribution are all missing. The name suggests structured packaging material data (likely nested or list-valued), which would explain why the profiler bailed out. Treatment: Inspect raw values manually and parse the nested structure before any downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_without_ecobalyse_ids

unknown other skipped

This column is named `ingredients_without_ecobalyse_ids`, suggesting it lists ingredients that lack matching identifiers in the Ecobalyse reference system. Saturn skipped profiling, so type, uniqueness, and value distribution are unknown despite a populated null_rate of 0.0 across 50 rows. Treatment: Inspect raw values manually to determine structure (likely a list) before deciding on parsing or join strategy. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

generic_name_nl

categorical metadata long_tail null_rate

This appears to be a Dutch-language generic product name field, likely from a food product catalog. It is largely unusable as-is: 76% of rows are null and among the 12 non-null entries, 9 are empty strings, leaving only 3 distinct real values (e.g., 'Extra fijne pure chocolade'). Cardinality is just 4 across 50 rows, so there is essentially no signal here. Treatment: Drop, or retain only as a descriptive label — too sparse to model. high · anthropic:claude-opus-4-7

n: 50
nulls: 38 (76.0%)
unique: 4
top_value
top_rate: 0.75
cardinality: 4
entropy: 1.208
entropy_ratio: 0.6038

misc_tags

unknown other skipped

The column 'misc_tags' was skipped by the profiler, so no type inference, uniqueness count, or value statistics are available. The only confirmed signals are 50 rows with a 0.0 null rate. Without further stats, its content and structure cannot be characterized. Treatment: Re-profile with a parser suited to this column (e.g., list/JSON tags) before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

product_name_sv

categorical metadata long_tail null_rate

Swedish-localised product name field, populated for only 4 of 50 rows (null_rate 0.92). The 4 present values are all unique, giving maximum entropy (entropy_ratio 1.0) but no repeated category to learn from. Values like "90% Cocoa" and "Dark 70%" look English rather than Swedish, suggesting localisation is incomplete or mislabelled. Treatment: Drop or defer until localisation coverage improves; not usable as a feature at 92% null. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 4
top_value: 90% Cocoa
top_rate: 0.25
cardinality: 4
entropy: 2
entropy_ratio: 1

scans_n

numeric feature high_skew outliers

A numeric count of scans per record, with 49 unique values across 50 rows and no nulls or zeros. The distribution is tightly clustered (median 492, IQR 217) but extremely right-skewed (skew 3.90, kurtosis 18.72) with a max of 2523 versus a Q3 of 604, producing 4 outliers (8%). The mean (577.94) sits well above the median, confirming a heavy upper tail. Treatment: Log-transform or winsorize before modelling to tame the heavy right tail. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 49
min: 333
max: 2,523
mean: 577.9
median: 492
std: 343.9
q1: 387
q3: 604
iqr: 217
skew: 3.899
kurtosis: 18.72
n_outliers: 4
outlier_rate: 0.08
zero_rate: 0

schema_version

numeric metadata constant

Constant numeric column holding the value 996.0 across all 50 rows with no nulls. Despite being typed as numeric, the zero variance (std 0.0, iqr 0.0) and single unique value indicate this is a schema/version tag rather than a measurement. Carries no signal for modelling. Treatment: Drop before modelling; retain only as a provenance tag. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 1
min: 996
max: 996
mean: 996
median: 996
std: 0
q1: 996
q3: 996
iqr: 0
skew: 0
kurtosis: 0
n_outliers: 0
outlier_rate: 0
zero_rate: 0

url

categorical identifier long_tail

This column holds Open Food Facts product URLs, one per row, with every value unique across all 50 rows (cardinality 50, entropy_ratio 1.0). The URL path embeds a product barcode plus a slugified name, so it functions as a permalink/identifier rather than a feature. No nulls, but the long_tail alert simply reflects that every row is its own category. Treatment: Drop from modelling; keep as a join key or reference link to the source product page. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: https://world.openfoodfacts.org/product/6111242100992/perly
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

vitamins_tags

unknown other skipped

The column `vitamins_tags` was skipped by the profiler, so no type, uniqueness, or distribution statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests a list-valued field enumerating vitamin identifiers (e.g., tag-style strings), but this cannot be confirmed from the evidence. Without parsing, downstream use is blocked. Treatment: Re-profile after parsing as a list of tags, then one-hot or multi-hot encode. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

debug_param_sorted_langs

unknown metadata skipped

This column was skipped by the profiler (alert: "skipped"), so its kind is unknown and no statistics were computed beyond a row count of 50 with 0% nulls. The name suggests a debug artefact holding sorted language codes, likely a list or compound value the profiler couldn't classify. Without unique counts or value samples there is nothing further to infer. Treatment: Drop unless you can re-profile with list/struct support; it appears to be a debug field. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

packaging

categorical free_text long_tail

Free-form packaging descriptions, likely from a food/product database (Open Food Facts style) given multilingual prefixes like 'en:', 'es:', 'pt:'. Cardinality is extreme: 41 unique values across 50 rows with entropy ratio 0.985, and the top value 'Plastique' covers only 9% — most entries are comma-separated multi-tag strings mixing languages. 12% are null, and the long_tail alert confirms there is no usable category structure as-is. Treatment: Split on commas, normalize language prefixes, and one-hot encode the resulting material tags rather than using the raw string. high · anthropic:claude-opus-4-7

n: 50
nulls: 6 (12.0%)
unique: 41
top_value: Plastique
top_rate: 0.09091
cardinality: 41
entropy: 5.278
entropy_ratio: 0.9851

grades

unknown other skipped

The column is named "grades" and contains 50 rows with no nulls, but saturn skipped profiling and could not infer a kind, so no distributional stats are available. Without n_unique or value summaries, it's impossible to tell whether this holds letter grades, numeric scores, or a nested structure. The "skipped" alert is the key signal: something about the storage type prevented standard analysis. Treatment: Manually inspect a sample to determine the underlying type before deciding on a downstream encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

last_modified_t

numeric timestamp outliers

Values are Unix epoch seconds (min 1737907641, max 1768643720) so this column is a last-modified timestamp, likely covering early 2025 through late 2025. All 50 rows are unique with no nulls, but the distribution is heavily left-skewed (skew -1.96) with 6 outliers (12%) sitting far below the q1 of 1761612624, suggesting a small tail of much older edits while most records cluster within a ~6.1M second IQR. Treat as a timestamp, not a numeric feature. Treatment: Convert from epoch seconds to datetime and derive recency or bucketed features instead of using the raw integer. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
min: 1.738e+09
max: 1.769e+09
mean: 1.763e+09
median: 1.767e+09
std: 8.093e+06
q1: 1.762e+09
q3: 1.768e+09
iqr: 6.138e+06
skew: -1.961
kurtosis: 2.972
n_outliers: 6
outlier_rate: 0.12
zero_rate: 0

origin_nl

categorical metadata null_rate imbalance

origin_nl appears to be a categorical attribute (likely a Dutch-language origin label) but is effectively empty in this sample. 76% of the 50 rows are null, and the remaining 12 non-null entries are all the empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; column has no variance and is mostly null. high · anthropic:claude-opus-4-7

n: 50
nulls: 38 (76.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

allergens_lc

categorical metadata

Language code for the allergens text, with 6 distinct values across 50 rows and a 4% null rate. The distribution is nearly bimodal between 'en' (22) and 'fr' (21), with es/de/it/pl appearing once or twice each — a language mix worth flagging before any text processing. Treatment: Use as a language filter or routing key before tokenizing the allergens text. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 6
top_value: en
top_rate: 0.4583
cardinality: 6
entropy: 1.578
entropy_ratio: 0.6104

states_hierarchy

unknown other skipped

The column 'states_hierarchy' was skipped by the profiler, so its kind is unknown and no descriptive statistics were computed. We can only confirm there are 50 rows with no nulls; uniqueness, type, and value distribution are unavailable. Treatment: Re-profile or inspect manually to determine type before any downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_ja

categorical free_text long_tail null_rate imbalance

Japanese-language ingredients text, almost entirely absent from this sample. 98% of the 50 rows are null, and the single non-null value is an empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; the column carries no usable signal in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

teams_tags

unknown other skipped

The column `teams_tags` was skipped by the profiler, so its kind, uniqueness, and value distribution are unknown. Only two facts are available: 50 rows were seen and none were null. Without further stats, the content (e.g. whether it holds lists, delimited tags, or structured objects) cannot be characterised. Treatment: Re-profile with a parser that handles this column's type before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

traces_from_user

categorical free_text long_tail

This column appears to capture user-submitted allergen/ingredient traces, prefixed with a language code like '(en)' or '(fr)' followed by comma-separated tags such as 'en:milk,en:nuts'. With 35 unique values across 50 rows and entropy ratio 0.938, it is highly diverse; the top value '(en) ' (an empty tag list) covers only 14% and the distribution has a long tail. Notably, the language prefix is mixed (English and French) and many entries are blank tag lists, which complicates direct use as a category. Treatment: Parse the language prefix and split the tag list into a multi-hot allergen feature before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 35
top_value: (en)
top_rate: 0.14
cardinality: 35
entropy: 4.811
entropy_ratio: 0.9379

origins_tags

unknown other skipped

The column `origins_tags` was skipped by the profiler, so kind is unknown and no descriptive statistics were computed. The only confirmed signals are 50 rows present and a 0.0 null rate; uniqueness, value distribution, and data type are all unavailable. Treatment: Re-profile with an appropriate parser (likely a list/tag field) before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

serving_quantity_unit

categorical metadata imbalance

This column records the unit of measurement for serving quantity, almost exclusively grams ('g' at 45 of 46 non-null rows, top_rate 0.978) with a single 'ml' entry. With only 2 unique values, an 8% null rate, and entropy_ratio of 0.151, it carries almost no information. Treatment: Drop or collapse to a binary flag; near-constant with negligible signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 4 (8.0%)
unique: 2
top_value: g
top_rate: 0.9783
cardinality: 2
entropy: 0.1511
entropy_ratio: 0.1511

vitamins_prev_tags

unknown other skipped

The column "vitamins_prev_tags" was skipped by the profiler, so no type, uniqueness, or distribution stats are available. The only confirmed signals are 50 rows with a 0.0 null rate. Without further evidence the content (likely a list/array of prior tag values given the name) cannot be characterized. Treatment: Re-profile with a parser that handles nested/array values before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_hierarchy

unknown other skipped

This column is labelled ingredients_hierarchy but saturn skipped profiling it, so no type, uniqueness, or value statistics are available. The only confirmed signals are that it has 50 rows and zero nulls. Without further evidence, the structure (likely nested or list-valued, given the name) cannot be verified. Treatment: Re-profile with a parser that handles nested or list-valued fields before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

unique_scans_n

numeric feature high_skew outliers

Numeric count of unique scans per row, with 48 distinct values across 50 records and no nulls or zeros. The distribution is heavily right-skewed (skew 3.91, kurtosis 18.71): median is 432 against a mean of 525.38, and the max of 2257 sits far beyond q3 of 560.75, producing 4 outliers (8% outlier rate). Std of 306.41 dwarfs the IQR of 198, confirming a long upper tail. Treatment: Log-transform or winsorize before modelling to tame the long upper tail. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 48
min: 319
max: 2,257
mean: 525.4
median: 432
std: 306.4
q1: 362.8
q3: 560.8
iqr: 198
skew: 3.911
kurtosis: 18.71
n_outliers: 4
outlier_rate: 0.08
zero_rate: 0

labels

categorical feature long_tail

Free-form labels/certifications column (e.g. organic, fair-trade, Triman, Green Dot) stored as comma-separated multi-label strings, often mixing English, French, Portuguese and Spanish tokens. Of 50 rows, 42 distinct values and entropy ratio 0.95 indicate near-unique combinations; the only repeated 'value' is the empty string (8 rows, 16%) on top of a 2% null rate, so roughly one in five records carries no label at all. The long_tail alert is well earned — almost every non-empty cell is its own bag of tags. Treatment: Split on commas into a multi-hot tag set (normalising language variants) before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 42
top_value
top_rate: 0.1633
cardinality: 42
entropy: 5.125
entropy_ratio: 0.9504

generic_name_en

categorical free_text long_tail

Likely an English-language generic product name field, but it is essentially empty: the top value is the blank string at 83.7% of non-null rows, with a further 14% null. Only 7 actual product descriptions appear across 50 rows (e.g. 'Dark chocolate', 'Crackers'), all singletons, giving cardinality 8 and entropy ratio 0.37. The long_tail alert reflects that every real value occurs exactly once. Treatment: Drop or treat blanks as missing; too sparse and unique to use as a categorical feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 7 (14.0%)
unique: 8
top_value
top_rate: 0.8372
cardinality: 8
entropy: 1.098
entropy_ratio: 0.366

weighters_tags

unknown other skipped

The column 'weighters_tags' was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. Without kind detection or sample values, its content and structure cannot be characterised here. The name suggests it may hold tag-like annotations, but this is not confirmed by evidence. Treatment: Re-profile with parsing enabled to determine type and cardinality before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

popularity_tags

unknown other skipped

The column `popularity_tags` was skipped by the profiler, so its kind is unknown and no descriptive statistics were computed. The only signals available are that 50 rows were seen with a null rate of 0.0, meaning every row carries some value. Cardinality, type, and distribution are all missing, so the column's actual content cannot be characterized from this evidence. Treatment: Re-profile with the appropriate parser (likely list/JSON) before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

product_name_fi

categorical metadata long_tail null_rate

Likely a Finnish-localized product name field, but it is essentially empty: 90% nulls and the most frequent observed value is the empty string (top_rate 0.4 of the 5 non-null entries). Among the few populated rows, the names are in English (e.g., 'Excellence: 90% cocoa Dark Supreme', 'Arriba 85% Cacao Dark Chocolate'), contradicting the _fi suffix. With only 4 unique values across 50 rows, this column carries almost no usable signal. Treatment: Drop or defer until localization coverage improves; do not use as a feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 4
top_value
top_rate: 0.4
cardinality: 4
entropy: 1.922
entropy_ratio: 0.961

origin_fr

categorical free_text long_tail

This appears to be a French-language origin/provenance field describing where a product or its ingredients are made. The column is essentially empty: 40 of 50 rows hold the empty string and another 8% are null, leaving only 6 distinct non-blank descriptions ranging from a single country ('France') to multi-region ingredient breakdowns. Entropy ratio of 0.319 and a top_rate of 0.87 confirm the long-tail alert — almost no usable signal here. Treatment: Drop or defer; too sparse and unstructured to use without targeted NER on the few populated strings. high · anthropic:claude-opus-4-7

n: 50
nulls: 4 (8.0%)
unique: 7
top_value
top_rate: 0.8696
cardinality: 7
entropy: 0.8958
entropy_ratio: 0.3191

generic_name

categorical free_text long_tail

Free-text generic product names, predominantly French with some English entries (e.g., "Compound Chocolate with MILK AND ALMONDS"). The dominant value is the empty string at 21/50 (top_rate 0.4375), and combined with a 0.04 null_rate this means most rows carry no usable name. The remaining 28 unique values are nearly all singletons, producing the flagged long tail. Treatment: Treat empty strings as missing, then tokenize/normalize language before embedding or matching. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 28
top_value
top_rate: 0.4375
cardinality: 28
entropy: 3.663
entropy_ratio: 0.762

nutriscore_version

categorical metadata imbalance

This column records the Nutri-Score version applied to each row, and every one of the 50 records carries the value "2023". With cardinality 1 and entropy 0, it offers no discriminative signal in this sample. Treatment: Drop, constant column. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 1
top_value: 2023
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_without_ciqual_codes

unknown other skipped

This column, named ingredients_without_ciqual_codes, was skipped by the profiler so no descriptive statistics are available beyond a row count of 50 and a null rate of 0. The name suggests it holds ingredient entries that lack a matching CIQUAL food-database code, likely as a list or nested structure that the profiler could not introspect. Without unique counts or value samples, nothing further can be inferred. Treatment: Re-profile after parsing the nested structure, or explode to a list before downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

manufacturing_places_tags

unknown metadata skipped

This column was skipped by the profiler, so no statistics beyond row count and null rate are available. The name suggests it holds tags for manufacturing locations, likely a multi-valued or list-like field that the dissector could not classify. With 50 rows and 0% nulls reported but no uniqueness or value stats, nothing further can be inferred from the evidence. Treatment: Re-profile after parsing the tag list, then one-hot or multi-label encode top tags. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

photographers_tags

unknown other skipped

The column `photographers_tags` was skipped by the profiler, so no kind, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds tag annotations associated with photographers, likely a list or delimited string, but this cannot be confirmed from the evidence. No further signal is present to characterise distribution, cardinality, or content. Treatment: Re-profile with list/string parsing enabled before deciding on downstream handling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

packaging_text_pl

categorical metadata null_rate imbalance

Polish-language packaging text field that is effectively empty: 90% of the 50 rows are null and the remaining 10% are all the empty string, giving a single observed value and zero entropy. There is no usable signal here, only nulls and blanks. Treatment: Drop; the column carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

informers_tags

unknown other skipped

The column `informers_tags` was skipped by the profiler, so no type, uniqueness, or value statistics were computed beyond a row count of 50 with 0% nulls. Without stats it's impossible to tell whether this holds scalar tags, delimited lists, or nested structures, though the plural name hints at a multi-valued tag field. Treat any interpretation as provisional until the column is re-profiled. Treatment: Re-run profiling with parsing for list/JSON values before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_en

categorical free_text long_tail

English-language ingredient lists for food products, stored as free-form text rather than a controlled vocabulary. With 36 unique values across 50 rows and entropy ratio 0.93, values are nearly all distinct; the only repeated 'value' is the empty string (9 occurrences, top_rate 0.20), and 12% are null, so roughly a third of rows carry no usable ingredient text. Content is heterogeneous — multi-sentence allergen-tagged lists, percentages, punctuation noise, and at least one junk entry ('Hhhhh'). Treatment: Normalize, tokenize, and embed (or parse into ingredient lists) before modelling; treat empty strings as nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 6 (12.0%)
unique: 36
top_value
top_rate: 0.2045
cardinality: 36
entropy: 4.811
entropy_ratio: 0.9306

ingredients_text_it

categorical free_text long_tail null_rate

Free-form Italian ingredient lists for food products, with 68% nulls and only 50 rows total. Of the 16 non-null entries, 5 are empty strings (top_rate 0.3125) and the remaining values are nearly all unique long product descriptions, yielding 12 distinct values and entropy_ratio 0.913. Effectively unstructured text rather than a categorical field, despite being typed as such. Treatment: Treat as free text: normalize empty strings to null, then tokenize/parse for allergen or ingredient extraction rather than one-hot encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 34 (68.0%)
unique: 12
top_value
top_rate: 0.3125
cardinality: 12
entropy: 3.274
entropy_ratio: 0.9134

origin_de

categorical feature null_rate imbalance

This appears to be a German-origin flag or label, but it carries no information in this sample: 60% of rows are null and the remaining 20 rows all hold the empty string, giving a single unique value and zero entropy. There is no signal to model on here. Treatment: Drop; constant column with majority nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

nova_group

numeric feature high_skew

This is the NOVA food classification group (1-4 scale) indicating processing level, with 3 unique values present across 50 rows and a 4% null rate. The distribution is heavily skewed toward ultra-processed foods: median is 4.0, Q1-Q3 spans 3-4, and skew of -2.06 with kurtosis 5.65 confirms a long left tail with one outlier at the low end. Despite being numeric, only 3 of the 4 possible NOVA categories appear in this sample. Treatment: Treat as ordinal categorical rather than continuous; impute the 4% nulls with median (4) or a missing-indicator. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 3
min: 1
max: 4
mean: 3.646
median: 4
std: 0.601
q1: 3
q3: 4
iqr: 1
skew: -2.062
kurtosis: 5.651
n_outliers: 1
outlier_rate: 0.02083
zero_rate: 0

packaging_text_fi

categorical free_text null_rate imbalance

Finnish-language packaging text field that is effectively empty in this sample. 90% of the 50 rows are null and the remaining 5 rows all hold the empty string, giving a single observed value and zero entropy. Treatment: Drop; column carries no information in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

states

categorical feature long_tail

This column packs an Open Food Facts-style product completion checklist into a single comma-joined string of `en:*-completed` / `en:*-to-be-completed` tags covering nutrition, ingredients, photos, packaging, etc. With 26 unique combinations across just 50 rows (entropy ratio 0.91) and the most common state appearing only 8 times, it behaves like a long-tail composite status flag rather than a clean category. The values are clearly multi-valued — they should be split into individual status tags before any modelling. Treatment: Split on comma and one-hot encode each `en:*` tag instead of treating the concatenated string as a single category. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 26
top_value: en:to-be-completed, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-completed, en:packaging-code-to-be-completed, en:characteristics-to-be-completed, en:origins-to-be-completed, en:categories-completed, en:brands-completed, en:packaging-completed, en:quantity-completed, en:product-name-completed, en:photos-validated, en:packaging-photo-selected, en:nutrition-photo-selected, en:ingredients-photo-selected, en:front-photo-selected, en:photos-uploaded
top_rate: 0.16
cardinality: 26
entropy: 4.286
entropy_ratio: 0.9119

ingredients_with_unspecified_percent_sum

numeric feature

This column appears to be a per-record sum of ingredient percentages where the precise share was not specified, expressed on a 0–100 scale (max 100.0, min 0.4). The distribution is heavily left-skewed (skew -1.18) with a median of 100.0 and Q3 also at 100.0, meaning at least half of the 50 rows have effectively all of their ingredient mass unspecified. Only 22 unique values across 50 rows and a mean of 79.4 confirm the concentration at the upper bound. Treatment: Treat as a data-quality indicator; consider binarizing (e.g. =100 vs <100) rather than using the raw left-skewed value. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 22
min: 0.4
max: 100
mean: 79.42
median: 100
std: 31.64
q1: 53.6
q3: 100
iqr: 46.4
skew: -1.183
kurtosis: -0.133
n_outliers: 0
outlier_rate: 0
zero_rate: 0

added_countries_tags

unknown other skipped

The column 'added_countries_tags' was skipped by the profiler, so no type, uniqueness, or distributional statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds country tags associated with record additions, likely a list-like or multi-valued field that the profiler could not classify. Without further stats, nothing can be said about cardinality, format, or content. Treatment: Inspect raw values manually and re-profile after parsing into a normalized list type. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

id

categorical identifier long_tail

This column is a unique row identifier, with all 50 values distinct (n_unique=50, entropy_ratio=1.0) and no nulls. The values look like product barcodes (mostly 13-digit EAN/GTIN strings such as '6111242100992', with at least one shorter numeric like '20995553'), suggesting a product-level key rather than a sequential surrogate ID. The long_tail alert simply reflects that every value occurs exactly once (top_rate=0.02). Treatment: Drop from modelling; retain as a join key for linking to product metadata. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: 6111242100992
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

nutrient_levels

unknown other skipped

The column 'nutrient_levels' was skipped by the profiler, so its kind is unknown and no descriptive statistics are available. We only know it has 50 rows with a 0.0 null rate; uniqueness, distribution, and value structure are all missing from the evidence. Treatment: Re-profile or manually inspect a sample to determine the underlying type before any downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

sortkey

numeric timestamp high_skew outliers

Values range from 1,567,543,172 to 1,610,897,644 with a median of 1,608,147,866 — consistent with Unix epoch seconds spanning roughly 2019 to early 2021, masquerading as a numeric sort key. Distribution is heavily left-skewed (skew -2.78, kurtosis 8.09) with 4 outliers (9.1%) trailing toward older timestamps, and 12% of rows are null. The tight IQR of ~6.16M seconds (~71 days) versus a 43M-second range confirms most records cluster late in the window. Treatment: Convert from epoch seconds to datetime and use as a temporal feature rather than a raw numeric. high · anthropic:claude-opus-4-7

n: 50
nulls: 6 (12.0%)
unique: 44
min: 1.568e+09
max: 1.611e+09
mean: 1.605e+09
median: 1.608e+09
std: 8.692e+06
q1: 1.604e+09
q3: 1.61e+09
iqr: 6.16e+06
skew: -2.782
kurtosis: 8.091
n_outliers: 4
outlier_rate: 0.09091
zero_rate: 0

image_small_url

categorical identifier long_tail

Per-row URL pointing to a 200px product front image hosted on images.openfoodfacts.org, with French/English locale suffixes embedded in the filename. All 50 rows are unique with zero nulls, so this acts as a row-level asset reference rather than a feature. The path segments encode the product barcode (e.g. 6111242100992), making this effectively a derivable identifier. Treatment: Drop from modelling; retain as a fetch URL for image pipelines or extract the embedded barcode. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.200.jpg
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

packaging_recycling_tags

unknown other skipped

The column 'packaging_recycling_tags' was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds packaging or recyclability tags, likely a multi-valued or list-like field that the profiler could not categorise. Without parsed values, nothing can be said about cardinality, distribution, or label vocabulary. Treatment: Re-profile after parsing into a list or one-hot tag set before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

food_groups

categorical feature

This is a categorical food taxonomy field using Open Food Facts-style prefixed slugs (e.g., 'en:biscuits-and-cakes'). The distribution is heavily concentrated on sweets: 'en:biscuits-and-cakes' (17/49) and 'en:chocolate-products' (16/49) together account for roughly two-thirds of non-null rows, with 11 distinct categories across 50 records and a 2% null rate. Entropy ratio of 0.74 confirms moderate concentration rather than uniform spread. Treatment: Strip the 'en:' prefix and one-hot or target-encode; consider grouping the long tail of single-occurrence categories into 'other'. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 11
top_value: en:biscuits-and-cakes
top_rate: 0.3469
cardinality: 11
entropy: 2.549
entropy_ratio: 0.7367

nova_groups_markers

unknown other skipped

Column 'nova_groups_markers' was skipped by the profiler, so no type, uniqueness, or distribution stats are available beyond a row count of 50 and a null rate of 0.0. The name suggests it carries NOVA food-classification group markers, likely a list or structured field that the dissector could not parse. Without parsed values, nothing further can be said about its content. Treatment: Inspect raw values manually and reparse (likely a list/struct) before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

packaging_text_de

categorical free_text null_rate

German-language packaging description field, almost entirely unpopulated. With a 60% null rate and the empty string accounting for 19 of the 20 non-null rows (95% top_rate), only one row carries actual content ("1 Folie aus 22 PAP zum Recyclen"). Cardinality of 2 and entropy ratio of 0.29 confirm there is virtually no usable signal here. Treatment: Drop; effectively empty with only one informative value. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 2
top_value
top_rate: 0.95
cardinality: 2
entropy: 0.2864
entropy_ratio: 0.2864

categories_lc

categorical feature

This column appears to hold lowercase ISO language codes, with 6 distinct values across 50 rows and no nulls. The distribution is dominated by 'fr' (25) and 'en' (19), together covering 44 of 50 rows, while 'es', 'de', 'it', and 'pl' form a thin long tail. Entropy ratio of 0.63 reflects this Franco-English skew rather than a balanced multilingual mix. Treatment: One-hot encode, optionally collapsing rare codes (it, pl, de, es) into an 'other' bucket. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 6
top_value: fr
top_rate: 0.5
cardinality: 6
entropy: 1.628
entropy_ratio: 0.6297

checkers

unknown other skipped

The column 'checkers' was skipped by the profiler, so its data type and value distribution are unknown. Only the row count (50) and null rate (0.0) are reported; n_unique and all other statistics are missing. Without further inspection there is no basis to infer what this column represents. Treatment: Re-profile or manually inspect the column before any downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

packaging_text_es

categorical free_text null_rate

Spanish-language packaging description, populated for almost none of the rows. 60% are null and of the 20 non-null entries, 19 are empty strings, leaving exactly one real value describing a recyclable cardboard box and plastic tray. Effective cardinality is 2 and entropy ratio is 0.29, so this column carries virtually no signal in this sample. Treatment: Drop unless a larger sample shows meaningful Spanish text coverage. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 2
top_value
top_rate: 0.95
cardinality: 2
entropy: 0.2864
entropy_ratio: 0.2864

unknown_nutrients_tags

unknown other skipped

This column is labelled `unknown_nutrients_tags` and was skipped by the profiler, so no descriptive statistics, uniqueness count, or value samples are available. The only confirmed signals are that all 50 rows are non-null and the column kind is reported as 'unknown'. Without further evidence its content and structure cannot be characterised. Treatment: Re-profile with type inference enabled before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

editors_tags

unknown other skipped

The column `editors_tags` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 50 and a null rate of 0.0. Without sample values or a detected kind, the content and structure are unknown — the name suggests editor-assigned tags, possibly a list or delimited string, but this is not confirmed by evidence. Treatment: Re-profile with list/string parsing enabled to determine structure before any downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nutrition_score_warning_fruits_vegetables_nuts_estimate_from_ingredients

numeric metadata constant

This is a nutrition-score warning flag indicating whether fruit/vegetable/nut content was estimated from ingredients. Every one of the 45 non-null rows holds the value 1.0, and 10% of rows are null — so the column carries no discriminative signal in this sample, only a presence/absence distinction. Treatment: Drop as a feature; optionally retain a binary is_null indicator if the missingness itself is meaningful. high · anthropic:claude-opus-4-7

n: 50
nulls: 5 (10.0%)
unique: 1
min: 1
max: 1
mean: 1
median: 1
std: 0
q1: 1
q3: 1
iqr: 0
skew: 0
kurtosis: 0
n_outliers: 0
outlier_rate: 0
zero_rate: 0

labels_lc

categorical label

This column appears to be a lowercase ISO language code label, with 6 distinct values across 50 rows and one null. English and French dominate at 22 occurrences each, leaving es, de, it, and pl with just 1-2 rows combined — a near-binary distribution despite the multilingual appearance. Entropy ratio of 0.61 confirms the imbalance. Treatment: Group rare codes (es/de/it/pl) into 'other' before stratifying or one-hot encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 6
top_value: en
top_rate: 0.449
cardinality: 6
entropy: 1.57
entropy_ratio: 0.6072

nutriscore_data

unknown other skipped

The column 'nutriscore_data' was skipped by the profiler, so its kind, uniqueness, and value distribution are unknown. The only confirmed signals are 50 rows with a 0.0 null rate. Without further stats, the contents (likely a nested Nutri-Score payload given the name) cannot be characterised. Treatment: Re-profile with nested/struct support enabled before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

other_nutritional_substances_tags

unknown other skipped

This column is flagged as skipped by the profiler, so no statistics beyond row count (50) and a null rate of 0.0 were computed. The name suggests it holds tag-style annotations for additional nutritional substances, likely a delimited or list-valued field that the dissector could not type. Without unique counts or value samples, its actual content and cardinality remain unverified. Treatment: Manually inspect raw values and re-profile as a multi-label tag list before deciding to encode or drop. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

product_name_nb

categorical metadata long_tail null_rate

Norwegian product name field (suffix _nb suggests Bokmål locale) that is almost entirely empty: 96% null across 50 rows, leaving only 2 non-null observations with one being an empty string and the other '99% mørk sjokolade'. With just two distinct values and effectively no signal, this column cannot support analysis as-is. Treatment: Drop unless joined to a richer localized catalog; null rate is too high to model. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

nutrition_data_prepared_per

categorical metadata imbalance

This column records the basis on which nutrition data is reported, and every one of the 50 rows carries the single value "100g". With cardinality of 1, entropy of 0, and a top_rate of 1.0, the field provides no discriminating information whatsoever. Treatment: Drop; constant column carries no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 1
top_value: 100g
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_quantity

categorical feature long_tail

Numeric product quantities stored as strings, treated here as categorical with 27 distinct values across 50 rows. The mode '100' covers 23.4% of non-nulls, but entropy ratio of 0.90 confirms a long tail with most other values appearing only once or twice. Note 6% nulls and the presence of '0' as a quantity, which may indicate missing or placeholder stock entries. Treatment: Cast to numeric and treat as a quantitative feature; investigate zeros and nulls before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 3 (6.0%)
unique: 27
top_value: 100
top_rate: 0.234
cardinality: 27
entropy: 4.287
entropy_ratio: 0.9017

product_type

categorical metadata imbalance

This is a categorical column recording product type, but every one of the 50 rows holds the same value, "food". Cardinality is 1 and entropy is 0, so the column carries no information for modelling or segmentation. Treatment: Drop; constant column with zero entropy. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 1
top_value: food
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

checkers_tags

unknown other skipped

The column `checkers_tags` was skipped by the profiler, so its kind is unknown and no statistics (uniqueness, value distribution, type) were computed. Only the row count (50) and null rate (0.0) are available; everything else is missing. The name suggests it may hold tag-like values associated with a checkers process, but this cannot be confirmed from the evidence. Treatment: Re-run profiling or manually inspect a sample before deciding how to use this column. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nucleotides_tags

unknown other skipped

The column 'nucleotides_tags' was skipped by the profiler, so no type, uniqueness, or distribution statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds tag-style annotations related to nucleotides, likely a list or delimited string, but this cannot be confirmed from the evidence. No further signal is present to characterise its values. Treatment: Re-profile with list/string parsing enabled before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

languages_tags

unknown metadata skipped

This column is named languages_tags, suggesting it holds language metadata (likely tag strings such as locale codes) for each record. Saturn skipped detailed profiling, so no cardinality, distribution, or value samples are available beyond a row count of 50 and a null rate of 0.0. Without uniqueness or value stats, no surprises can be flagged. Treatment: Re-profile or inspect raw values to determine structure before deciding whether to split tags and one-hot encode. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

traces_lc

categorical feature

This is a low-cardinality categorical column holding lowercase language codes (fr, en, es, de, it, pl), almost certainly a detected or declared language tag. The distribution is heavily concentrated on French (23/50) and English (20/50), with the top value covering 47.9% of non-null rows and entropy ratio of 0.61. Four percent of rows are null and three languages appear only once, so any per-language analysis will be unstable beyond fr/en. Treatment: Keep fr/en as-is and bucket de/es/it/pl into an 'other' category before encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 6
top_value: fr
top_rate: 0.4792
cardinality: 6
entropy: 1.575
entropy_ratio: 0.6093

categories_hierarchy

unknown other skipped

The column `categories_hierarchy` was skipped by the profiler, so no type, uniqueness, or distribution stats are available. The name suggests a nested or path-like categorical structure (e.g., taxonomy levels), but this cannot be confirmed from the evidence. Only the row count (50) and null rate (0.0) are known. Treatment: Re-profile after parsing the hierarchy (e.g., split into level columns) before deciding on encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

image_front_small_url

categorical metadata long_tail

URLs pointing to small front-of-pack product images on the Open Food Facts CDN, one per row. Every one of 50 values is unique (entropy_ratio 1.0, top_rate 0.02) and there are no nulls, so this acts as a per-product asset link rather than a feature. URLs mix `front_fr` and `front_en` suffixes, hinting at a French/English language mix in the source catalogue. Treatment: Keep as a reference link; drop from modelling or fetch the images separately for vision features. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.200.jpg
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

entry_dates_tags

unknown other skipped

This column was skipped by the profiler, so no statistics beyond row count and null rate are available. The name 'entry_dates_tags' suggests a composite field combining dates and tags, likely a nested or list-like structure that didn't fit a scalar type. With 50 rows and 0% nulls, every record has some value, but its shape is unknown from this evidence. Treatment: Inspect raw values and parse into separate date and tag columns before use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ecoscore_tags

unknown other skipped

The column 'ecoscore_tags' was skipped by the profiler, so no statistics, uniqueness, or value samples are available beyond a row count of 50 and a null rate of 0.0. Based on the name alone, it likely holds Open Food Facts-style ecoscore category tags (e.g., 'en:b'), but this cannot be confirmed from the evidence. The 'skipped' alert is the key signal here and warrants a re-profile with appropriate parsing. Treatment: Re-profile with list/tag-aware parsing before deciding on encoding or drop. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nutrition_score_warning_fruits_vegetables_legumes_estimate_from_ingredients

numeric metadata constant

This appears to be a binary warning flag indicating that the fruits/vegetables/legumes share in a Nutri-Score calculation was estimated from ingredients. Every non-null value is 1.0 (n_unique=1, std=0), and 8% of rows are null, so the column carries no discriminative signal in this sample. Treatment: Drop; constant value provides no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 4 (8.0%)
unique: 1
min: 1
max: 1
mean: 1
median: 1
std: 0
q1: 1
q3: 1
iqr: 0
skew: 0
kurtosis: 0
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ingredients_without_ciqual_codes_n

numeric feature

Counts the number of ingredients in a record that lack a CIQUAL code, so it's a data-quality feature describing how complete the ingredient mapping is. The distribution is right-skewed (skew 1.21) with a median of 3.5 but a max of 22 and one outlier; 18% of rows are already fully mapped (zero_rate 0.18). Only 15 unique values across 50 rows, so it behaves like a small ordinal count. Treatment: Treat as a count; consider log1p or a binary 'fully mapped' flag before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 15
min: 0
max: 22
mean: 4.98
median: 3.5
std: 4.825
q1: 1
q3: 7.75
iqr: 6.75
skew: 1.208
kurtosis: 1.491
n_outliers: 1
outlier_rate: 0.02
zero_rate: 0.18

rev

numeric feature

A numeric revenue feature spanning 19 to 674 with a mean of 230 and median of 233.5, suggesting per-record monetary or count values. Distribution is moderately right-skewed (0.71) with a wide IQR of 237.75 and only one outlier (2%), so spread is large but not pathological. All 50 rows are populated with no zeros and 46 unique values. Treatment: Consider a log or sqrt transform before regression to tame the right skew. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 46
min: 19
max: 674
mean: 230
median: 233.5
std: 166.6
q1: 72.75
q3: 310.5
iqr: 237.8
skew: 0.7092
kurtosis: -0.02278
n_outliers: 1
outlier_rate: 0.02
zero_rate: 0

ingredients_non_nutritive_sweeteners_n

numeric feature constant

This column appears to be a count of non-nutritive (artificial) sweeteners listed in a product's ingredients. Across all 50 rows it is exactly 0, with zero_rate of 1.0 and no nulls, so it carries no information in this sample. Treatment: Drop; constant zero provides no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 1
min: 0
max: 0
mean: 0
median: 0
std: 0
q1: 0
q3: 0
iqr: 0
skew: 0
kurtosis: 0
n_outliers: 0
outlier_rate: 0
zero_rate: 1

ingredients_without_ecobalyse_ids_n

numeric feature

This is a count of ingredients on a product that lack Ecobalyse identifiers, ranging from 0 to 29 with a median of 6.5 and mean 8.16. The distribution is right-skewed (skew 1.28) with one outlier at the high end, suggesting most products have a handful of unmapped ingredients while a few have many. Only 2% are zero, meaning nearly every row has at least one ingredient missing an Ecobalyse ID — a notable data-coverage gap. Treatment: Use as-is or log-transform if feeding into a regression; treat as a coverage-quality signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 20
min: 0
max: 29
mean: 8.16
median: 6.5
std: 5.898
q1: 4
q3: 11
iqr: 7
skew: 1.28
kurtosis: 1.743
n_outliers: 1
outlier_rate: 0.02
zero_rate: 0.02

environment_impact_level_tags

unknown other skipped

This column was skipped by the profiler, so no type, uniqueness, or distribution stats are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds tags describing environmental impact levels, likely a list-valued or multi-label field that the profiler could not classify. Without parsed values there is no way to confirm cardinality, label vocabulary, or whether tags are single- or multi-valued. Treatment: Re-profile after parsing the tag structure (e.g., split lists) before deciding on encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

last_image_dates_tags

unknown other skipped

This column is named `last_image_dates_tags`, suggesting it holds image-related dates and tags, but saturn skipped profiling so type and content cannot be confirmed. The only evidence available is 50 rows with no nulls; uniqueness, distribution, and value samples are all missing. Treatment: Inspect raw values manually to determine structure before deciding on parsing or modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

labels_hierarchy

unknown other skipped

This column was skipped by the profiler, so its contents are uncharacterised beyond a row count of 50 and a null rate of 0.0. The name suggests a nested or structured label taxonomy, which likely tripped the profiler's type detection. No uniqueness, value, or distribution statistics are available to confirm. Treatment: Inspect raw values manually and parse the hierarchy (e.g., split on delimiter or expand JSON) before profiling again. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

product_name_en

categorical free_text long_tail

Free-text English product names with 34 unique values across 50 rows and high entropy ratio (0.91), indicating heavy diversity. Notable issues: 14% nulls plus an empty-string value taking the top slot at 23.3% (10 occurrences), so effective missingness is much higher than null_rate alone suggests. Values mix languages (e.g., 'Edelbitter-Schokolade', 'Chocolat noir', 'tonik') and include junk like 'Hhhhh', flagged as long_tail. Treatment: Normalise empties to null, language-detect, then tokenize/embed before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 7 (14.0%)
unique: 34
top_value
top_rate: 0.2326
cardinality: 34
entropy: 4.654
entropy_ratio: 0.9147

nutrition_score_warning_fruits_vegetables_legumes_estimate_from_ingredients_value

numeric feature high_skew outliers

This appears to be a Nutri-Score warning value estimating fruit/vegetable/legume content from ingredients. The distribution is dominated by zeros (zero_rate 0.89, median and IQR both 0), with a handful of extreme values pushing the max to 50 and producing severe skew (5.93) and kurtosis (35.2). Five outliers (10.9% rate) drive the mean to 1.65 despite a std of 7.55, and 8% of rows are null. Treatment: Binarize (zero vs non-zero) or winsorize before modelling given the heavy zero mass and extreme skew. high · anthropic:claude-opus-4-7

n: 50
nulls: 4 (8.0%)
unique: 6
min: 0
max: 50
mean: 1.652
median: 0
std: 7.551
q1: 0
q3: 0
iqr: 0
skew: 5.932
kurtosis: 35.23
n_outliers: 5
outlier_rate: 0.1087
zero_rate: 0.8913

traces

categorical feature long_tail

This column holds comma-separated allergen/ingredient trace tags with an `en:` language prefix (e.g. `en:milk,en:nuts`), so each cell is a multi-label set rather than a single category. Across 50 rows there are 23 distinct combinations and entropy ratio 0.87, indicating high diversity, and the most common value is the empty string at 22% (11 rows) — meaning missing-as-empty rather than a true null (null_rate 0.0). The long_tail alert reflects many combinations appearing only once or twice. Treatment: Split on commas and multi-hot encode the individual `en:` tags; treat empty string as missing. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 23
top_value
top_rate: 0.22
cardinality: 23
entropy: 3.922
entropy_ratio: 0.8671

generic_name_fi

categorical free_text long_tail null_rate

Finnish-language generic product name field, populated for only 5 of 50 rows (90% null). Among the 5 present values, all are unique with maximum entropy (2.32, ratio 1.0), and casing inconsistencies appear ("Tumma suklaa" vs "tumma suklaa") plus one empty string counted as a value. Treatment: Normalise case and treat empty strings as null; too sparse (90% missing) to use as a feature without imputation or dropping. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 5
top_value: Hieno tumma suklaa jossa 90% kaakaota
top_rate: 0.2
cardinality: 5
entropy: 2.322
entropy_ratio: 1

emb_codes_orig

categorical metadata long_tail null_rate

This appears to be original packaging or establishment codes (EMB-prefixed identifiers used on European food labels), kept in raw form. The column is sparsely populated: 34% are null and among the remaining rows the empty string dominates at roughly 85% (top_rate 0.848), leaving only 5 distinct values across 50 rows. One entry is not a code at all but a company name pair (SOLENT GMBH & CO. KG,SCHWARZ BETEILIGUNGS GMBH), suggesting inconsistent source formatting. Treatment: Normalise empty strings to null and parse/validate the EMB code pattern before use; too sparse to model directly. medium · anthropic:claude-opus-4-7

n: 50
nulls: 17 (34.0%)
unique: 5
top_value
top_rate: 0.8485
cardinality: 5
entropy: 0.9048
entropy_ratio: 0.3897

ingredients_with_specified_percent_n

numeric feature

Counts the number of ingredients whose percentage is explicitly specified on a product label. The distribution is heavily zero-inflated (zero_rate 0.58) with median 0 and mean 1.1, but a long right tail reaches 8 (skew 1.88, kurtosis 3.68), and only 7 distinct values appear across 50 rows. Treatment: Treat as a count feature; consider a binary 'has_specified_percent' flag plus log1p transform to tame the skew. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 7
min: 0
max: 8
mean: 1.1
median: 0
std: 1.729
q1: 0
q3: 2
iqr: 2
skew: 1.878
kurtosis: 3.676
n_outliers: 1
outlier_rate: 0.02
zero_rate: 0.58

nutrition_grades

categorical label

This is a Nutri-Score-style nutrition grade for each item, with six observed levels (a-e plus 'unknown'). The distribution is heavily skewed toward the worst grade: 'e' accounts for 27 of 50 rows (top_rate 0.54), while 'a' and 'b' together appear only 6 times. One row carries the literal value 'unknown' rather than null, so null_rate is 0.0 despite missing information. Treatment: Treat as ordered categorical (a high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 6
top_value: e
top_rate: 0.54
cardinality: 6
entropy: 1.913
entropy_ratio: 0.7399

weighers_tags

unknown other skipped

The column `weighers_tags` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 50 and a null rate of 0.0. Without `n_unique` or any sample values it is impossible to tell whether this holds tag strings, arrays, or something else. The name suggests a multi-valued tag field associated with 'weighers', which would explain why the profiler couldn't fit it into a standard kind. Treatment: Re-profile with array/text handling enabled to determine structure before use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

categories_tags

unknown other skipped

The column `categories_tags` was skipped by the profiler, so no type, cardinality, or distribution stats are available beyond the row count (n=50) and a null rate of 0.0. The name suggests a multi-valued tag field (e.g., comma- or colon-separated category labels), but this cannot be confirmed from the evidence. No further signal is present. Treatment: Re-profile with parsing for delimited tag lists before deciding on encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

image_url

categorical identifier long_tail

This column holds Open Food Facts product image URLs, one per row, all pointing to front-of-package JPEGs at 400px width. Every one of the 50 values is unique (entropy_ratio 1.0, top_rate 0.02), so it functions as a per-row asset link rather than a categorical feature. URL paths mix _fr and _en locale suffixes, hinting at a multilingual product catalog. Treatment: Drop from modelling; retain as a reference link or fetch for downstream image features. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.400.jpg
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

sources

unknown other skipped

The column "sources" was skipped by the profiler, so its kind is unknown and no statistics (uniqueness, value distribution, type) were computed. Only two facts are available: 50 rows were seen and none were null. Without further inspection, nothing can be said about its content or structure. Treatment: Re-profile or manually inspect this column before any downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

languages_hierarchy

unknown other skipped

The column 'languages_hierarchy' was skipped by the profiler, so no statistics are available beyond a row count of 50 and a null rate of 0. The name suggests a nested or structured representation of languages (likely a list or path-like string), but the dissector did not characterize its values. No uniqueness, length, or value-distribution signals are present to confirm. Treatment: Re-profile with a parser that handles nested/structured values before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

pnns_groups_1

categorical feature

This is a PNNS food group classifier with 7 distinct categories and no nulls across 50 rows. The distribution is severely imbalanced: 'Sugary snacks' accounts for 76% of records, with entropy ratio just 0.48, suggesting the sample is dominated by one food type. Two rows are explicitly labeled 'unknown', and four other categories appear only once or twice each. Treatment: One-hot encode, but expect the 'Sugary snacks' class to dominate any downstream model. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 7
top_value: Sugary snacks
top_rate: 0.76
cardinality: 7
entropy: 1.36
entropy_ratio: 0.4846

countries_lc

categorical feature

Lowercase ISO-style language or country codes with 6 distinct values across 50 rows and a 2% null rate. The distribution is heavily English-dominant (en at 28, top_rate 0.57) followed by fr at 16, leaving es/de/it/pl as singletons or near-singletons. Entropy ratio of 0.59 confirms the long tail is thin and unlikely to support per-class modelling. Treatment: Group rare codes into an 'other' bucket and one-hot encode; impute the 2% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 6
top_value: en
top_rate: 0.5714
cardinality: 6
entropy: 1.521
entropy_ratio: 0.5883

additives_tags

unknown other skipped

The column `additives_tags` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests a list-style field enumerating food additive identifiers (e.g., E-numbers), but this cannot be confirmed from the evidence. No distributional signal is present to flag. Treatment: Re-profile with list/string parsing enabled, then explode tags for one-hot or multi-label encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

codes_tags

unknown other skipped

Column `codes_tags` was skipped by the profiler, so no type inference, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests a tag or code list (likely a delimited or array-valued field), but this cannot be confirmed from the evidence. Without `n_unique` or any sampled values, no distributional claims can be made. Treatment: Re-profile with array/string parsing enabled before deciding on a downstream transform. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

countries_tags

unknown feature skipped

The column `countries_tags` was skipped by the profiler (kind=unknown) so no statistics were computed beyond a 50-row count with 0% nulls. Based solely on the name, it likely holds country tag strings (e.g., comma- or colon-delimited slugs), but uniqueness, cardinality, and value distribution are all unknown here. Treat any interpretation as provisional until the column is reparsed. Treatment: Reparse as a delimited tag list, then explode and one-hot or multi-label encode before modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

creator

categorical metadata long_tail

Usernames of the people or bots that created each record, with 13 distinct creators across 50 rows and no nulls. Two accounts dominate: 'openfoodfacts-contributors' at 46% (23 rows) and 'kiliweb' at 15 rows, together covering 76% of the column, while the remaining creators each appear once or twice — a classic long tail flagged in alerts. Treatment: Collapse rare creators into an 'other' bucket before any encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 13
top_value: openfoodfacts-contributors
top_rate: 0.46
cardinality: 13
entropy: 2.351
entropy_ratio: 0.6353

ingredients

unknown free_text skipped

This column is named 'ingredients' but saturn skipped profiling it (kind=unknown, no stats computed). Across 50 rows there are zero nulls, but uniqueness, types, and value distribution are all unknown. Based on the name alone it is likely a list or free-text field of recipe components, which is why a generic profiler bailed out. Treatment: Parse into a list and one-hot or tokenize/embed before modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

product_name_nl

categorical free_text long_tail null_rate

Dutch-language product names, but the column is mostly empty: 76% of rows are null and of the 12 populated rows, 6 are blank strings, leaving only 6 actual names across 7 unique values. The surviving entries are a language mix (English 'Dark absolute', French 'Tartines craquantes multi-céréales', Dutch 'Volkoren cracotte'), so the field is not consistently Dutch despite its name. Treatment: Drop or defer; too sparse and linguistically inconsistent to use without upstream cleanup. high · anthropic:claude-opus-4-7

n: 50
nulls: 38 (76.0%)
unique: 7
top_value
top_rate: 0.5
cardinality: 7
entropy: 2.292
entropy_ratio: 0.8166

ingredients_n_tags

unknown other skipped

The column 'ingredients_n_tags' was skipped by the profiler, so no statistics, uniqueness, or type information are available beyond a row count of 50 and a null rate of 0.0. The name suggests a count of ingredient tags, but this cannot be confirmed from the evidence. Without stats, any downstream assumption about its distribution or role is unsupported. Treatment: Re-profile with type coercion to determine whether this is numeric before use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

origin_es

categorical metadata null_rate imbalance

This appears to be a Spanish-language origin field, but it carries no usable signal in this sample. 60% of rows are null and the remaining 20 non-null entries are all the empty string, giving cardinality 1 and entropy 0. Treatment: Drop; the column has no variation and a 60% null rate. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_pl

categorical metadata long_tail null_rate

Polish-language product names, populated for only 10% of rows (null_rate 0.9) with just 3 distinct values across 50 records. The top value is the empty string at 60%, leaving only two real product names ('Czekolada gorzka 74%' and 'Excellence 70% Cocoa Intense Dark') appearing once each. Both the long_tail and null_rate alerts fire, and empty strings are being counted as a category rather than nulls. Treatment: Normalise empty strings to null and treat as a sparse localisation field; drop unless Polish-market analysis is required. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 3
top_value
top_rate: 0.6
cardinality: 3
entropy: 1.371
entropy_ratio: 0.865

scores

unknown other skipped

The column 'scores' was skipped by the profiler and reports kind 'unknown', so no statistics, uniqueness, or value distribution were computed. The only confirmed signals are 50 rows and a 0.0 null rate; everything else is missing. Without type inference or sample values, the column's actual content (numeric scores, lists, structured objects) cannot be determined from this evidence. Treatment: Re-profile with type coercion or inspect raw values before deciding on a downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

brands

categorical feature long_tail

Brand name of each product, with 41 distinct values across 50 rows and no nulls. The distribution is essentially flat (entropy ratio 0.97), with Lindt leading at just 8% (4 occurrences) and most brands appearing once — a long tail flagged explicitly. One value is in Arabic script (عربي), suggesting mixed-language entries that may need normalization. Treatment: Group rare brands into an 'other' bucket and normalize encodings/scripts before one-hot or target encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 41
top_value: Lindt
top_rate: 0.08
cardinality: 41
entropy: 5.214
entropy_ratio: 0.9731

ingredients_text_de

categorical free_text long_tail null_rate

German-language ingredient declarations, likely scraped from product packaging (e.g. Kakaomasse, Zucker, Weizenmehl). Coverage is poor: 60% null and the most common value is the empty string (5/50, 25% of non-nulls), while the remaining 16 unique strings are essentially free text with allergen markup like _SOJA_ and _WEIZENMEHL_. Entropy ratio of 0.94 confirms each populated row is nearly unique, so this behaves as free text rather than a category. Treatment: Treat as free text: parse comma-separated tokens or embed; do not one-hot encode. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 16
top_value
top_rate: 0.25
cardinality: 16
entropy: 3.741
entropy_ratio: 0.9354

ingredients_text_nb

categorical free_text null_rate imbalance

This appears to be a Norwegian Bokmål ingredients text field, likely from a multilingual product dataset. It is effectively empty: 96% of the 50 rows are null, and the only non-null value across the remaining 2 rows is an empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; no usable signal at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packagings_n

numeric feature outliers

Likely a count of packaging components per product, ranging from 1 to 5 with a mean of 2.07 and median of 2. The IQR is 0 because Q1 and Q3 both equal 2, which mechanically labels nearly half the rows (outlier_rate 0.488, n_outliers 20) as outliers — a quirk of the IQR rule on a low-cardinality integer, not a data quality issue. Note the 18% null rate and only 5 distinct values across 50 rows. Treatment: Treat as a small-count integer feature; impute the 18% nulls and ignore the IQR-flagged outliers. high · anthropic:claude-opus-4-7

n: 50
nulls: 9 (18.0%)
unique: 5
min: 1
max: 5
mean: 2.073
median: 2
std: 0.8772
q1: 2
q3: 2
iqr: 0
skew: 0.9834
kurtosis: 1.602
n_outliers: 20
outlier_rate: 0.4878
zero_rate: 0

complete

numeric label

Binary 0/1 indicator (n_unique=2, min=0, max=1) likely flagging completion status. Only 32% of rows are marked complete (mean=0.32, zero_rate=0.68), so the negative class dominates roughly 2:1. No nulls or outliers across the 50 rows. Treatment: Treat as binary target; account for the 68/32 class imbalance during modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 2
min: 0
max: 1
mean: 0.32
median: 0
std: 0.4712
q1: 0
q3: 1
iqr: 1
skew: 0.7717
kurtosis: -1.404
n_outliers: 0
outlier_rate: 0
zero_rate: 0.68

emb_codes_20141016

categorical metadata long_tail null_rate

This appears to be a packager/manufacturer code field from an Open Food Facts-style export, dated 2014-10-16, mixing French EMB establishment codes (e.g., 'EMB 44068A') with free-text manufacturer descriptors in multiple languages (German, Spanish). With only 50 rows, 58% are null and another 30% (15/50) are empty strings as the top value, leaving just 6 distinct non-empty entries — each appearing exactly once. Entropy ratio of 0.57 and the dominance of blanks make this column nearly unusable as-is. Treatment: Drop or defer; coverage is too sparse and values too heterogeneous to feature-engineer without a dedicated parser. high · anthropic:claude-opus-4-7

n: 50
nulls: 29 (58.0%)
unique: 7
top_value
top_rate: 0.7143
cardinality: 7
entropy: 1.602
entropy_ratio: 0.5705

ingredients_tags

unknown free_text skipped

The column 'ingredients_tags' was skipped by the profiler, so no type, uniqueness, or distributional statistics are available. The only confirmed signals are 50 rows with a 0.0 null rate. The name suggests a list-valued or delimited tag field (e.g., ingredient identifiers), which would explain why standard profiling bailed out. Treatment: Parse/explode the tag list and one-hot or embed the individual ingredients before modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

packaging_text_ja

categorical free_text long_tail null_rate imbalance

Japanese-language packaging text, almost entirely absent: 98% of the 50 rows are null and the single non-null value is itself an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here for any downstream task. Treatment: Drop the column; it is effectively empty. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_de

categorical free_text long_tail null_rate

German-language generic product name, likely a free-text label for food items (chocolates, biscuits, spreads). Coverage is poor: null_rate is 0.6 and the top value is the empty string at 12 occurrences, while the remaining 9 unique values each appear once, indicating no repetition across products. Entropy_ratio of 0.68 reflects the empty-string mass dominating an otherwise unique long tail. Treatment: Treat as free text; impute missing and normalize/tokenize before any categorical use. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 9
top_value
top_rate: 0.6
cardinality: 9
entropy: 2.171
entropy_ratio: 0.6849

last_editor

categorical metadata long_tail

Likely the username or bot handle that last edited each record. One contributor, "foodless", dominates with 21 of 50 rows (top_rate 0.43), while the remaining 49 rows spread across 23 other editors, producing a long tail and entropy ratio of 0.77. Roughly 2% of values are null, and several handles look like apps/bots (e.g., municorn-calorie-counter-app, macrofactor) mixed with human usernames. Treatment: Group rare editors into an "other" bucket and keep as a categorical provenance feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 24
top_value: foodless
top_rate: 0.4286
cardinality: 24
entropy: 3.513
entropy_ratio: 0.7662

minerals_prev_tags

unknown other skipped

The column `minerals_prev_tags` was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 50 and a null rate of 0.0. Without `n_unique` or any descriptive stats, the content and structure of this field are unknown. The name suggests it may hold prior tag annotations related to minerals, but this cannot be confirmed from the evidence. Treatment: Re-profile with the appropriate parser (likely list/string) before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

last_image_t

numeric timestamp high_skew

Values are 10-digit integers ranging from 1,639,159,016 to 1,767,675,445 with a median of 1,752,195,111 — these are Unix epoch seconds, so the column is a 'last image' timestamp spanning roughly late 2021 through 2025. All 50 rows are unique with no nulls or zeros, but the distribution is strongly left-skewed (skew -2.44, kurtosis 7.36) with 2 outliers (4%) sitting far below the bulk, indicating a few very stale records against an otherwise recent cluster. Treatment: Cast from epoch seconds to datetime and derive recency features rather than using the raw integer. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
min: 1.639e+09
max: 1.768e+09
mean: 1.745e+09
median: 1.752e+09
std: 2.681e+07
q1: 1.735e+09
q3: 1.764e+09
iqr: 2.896e+07
skew: -2.443
kurtosis: 7.36
n_outliers: 2
outlier_rate: 0.04
zero_rate: 0

obsolete_since_date

categorical metadata imbalance

This appears to be a date column marking when items became obsolete, but it carries no usable information in this sample. Across 50 rows there is a single non-null distinct value — the empty string — making up 100% of non-nulls (44 of 44), with a 12% null rate on top. Entropy is 0.0 and cardinality is 1, so the field is effectively blank. Treatment: Drop; the column is constant (empty) and offers no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 6 (12.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

pnns_groups_2_tags

unknown other skipped

Column `pnns_groups_2_tags` was skipped by the profiler, so no statistics, uniqueness count, or value samples are available. The only confirmed signals are 50 rows present and a 0.0 null rate. The name suggests Open Food Facts PNNS group-2 category tags, typically a low-cardinality categorical, but this cannot be verified from the evidence. Treatment: Re-run the profiler on this column to determine type before deciding; if categorical tags, one-hot or target-encode. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

emb_codes_tags

unknown other skipped

The column 'emb_codes_tags' was skipped by saturn, so no statistics beyond row count (50) and a null rate of 0.0 are available. The name suggests it holds embossing or packaging code tags, likely a multi-valued categorical string field, but uniqueness, cardinality, and value distribution are unknown. Without further profiling no surprises can be flagged. Treatment: Re-profile with string/list parsing enabled before deciding whether to one-hot encode or drop. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

countries_beforescanbot

categorical feature long_tail

This appears to be a multi-country list field (likely product distribution countries from an Open Food Facts–style source, captured before a scan-bot pass). With 38 unique values across 50 rows and entropy_ratio 0.965, it is nearly free-form: France leads at only 6 occurrences (top_rate 0.14), and many cells are comma-separated lists. Values mix languages (French 'Belgique', Spanish 'Bélgica', Dutch 'nl:Duitsland', English 'en:Morocco') and taxonomy-prefixed codes, plus a 14% null rate. Treatment: Split on comma, normalize language variants and 'xx:' prefixes to ISO country codes, then multi-hot encode. high · anthropic:claude-opus-4-7

n: 50
nulls: 7 (14.0%)
unique: 38
top_value: France
top_rate: 0.1395
cardinality: 38
entropy: 5.066
entropy_ratio: 0.9653

nutrition_grade_fr

categorical label

This is the French Nutri-Score grade (a-e) for each food item, with one row coded as 'unknown'. The distribution is heavily skewed toward the worst grade: 'e' alone covers 54% of the 50 rows, and grades d+e together dominate while only 6 rows are 'a' or 'b'. Entropy ratio of 0.74 confirms moderate concentration rather than a balanced ordinal spread. Treatment: Treat as ordered categorical (a high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 6
top_value: e
top_rate: 0.54
cardinality: 6
entropy: 1.913
entropy_ratio: 0.7399

data_quality_tags

unknown other skipped

The column 'data_quality_tags' was skipped by the profiler, so no kind, uniqueness, or value statistics are available. The only confirmed signals are 50 rows with a null_rate of 0.0, meaning every row has some value, but its content and cardinality are unknown. Treatment: Re-profile or manually inspect before use; the profiler skipped this column. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_with_specified_percent_sum

numeric feature

This appears to be a numeric feature summing the percentages of ingredients whose proportions are explicitly disclosed (likely on food product labels). The distribution is heavily zero-inflated with a zero_rate of 0.58 and median of 0.0, while non-zero values stretch up to 99.6 with mean 22.74 and std 32.88. The right skew (0.998) and bimodal shape (q1=0, q3=52.25) suggest two regimes: products with no specified percentages and those with substantial disclosure. Treatment: Consider a hurdle approach: a binary 'has_disclosure' flag plus the continuous value, since 58% of rows are zero. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 22
min: 0
max: 99.6
mean: 22.74
median: 0
std: 32.88
q1: 0
q3: 52.25
iqr: 52.25
skew: 0.9979
kurtosis: -0.5856
n_outliers: 0
outlier_rate: 0
zero_rate: 0.58

origin_it

categorical feature null_rate imbalance

This column appears to be an origin flag for Italy, but it carries no information: of 50 rows, 68% are null and the remaining 16 non-null values are all empty strings, giving a single unique value and zero entropy. There is no signal here to model on. Treatment: Drop; constant column with majority nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 34 (68.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

nutrition_data_per

categorical metadata

This column records the basis on which nutrition values are reported, taking only two values: '100g' and 'serving'. The encoding is heavily skewed, with 84% of the 50 rows using '100g' and the remaining 8 rows using 'serving', and there are no nulls. Analysts should note that nutrition figures in other columns are not directly comparable across rows without normalising to a common basis. Treatment: Use as a grouping flag and normalise nutrition fields to a single basis before aggregation. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 2
top_value: 100g
top_rate: 0.84
cardinality: 2
entropy: 0.6343
entropy_ratio: 0.6343

origin_pl

categorical metadata null_rate imbalance

This appears to be an origin-related categorical field (likely a place/location code from the column name 'origin_pl'), but it carries almost no information here. 90% of the 50 rows are null, and the remaining 5 non-null entries are all empty strings, giving cardinality 1 and entropy 0. Treatment: Drop; the column is 90% null with only empty strings remaining. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product

unknown other skipped

The column 'product' was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 50 and a null rate of 0.0. Without unique counts or sample values, the role of this column cannot be inferred from the evidence. Treatment: Re-run the profiler on this column to obtain stats before deciding on treatment. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

link

categorical metadata long_tail

URLs pointing to product pages, likely a manufacturer/source link field. The dominant value is the empty string at 21 of 50 rows (top_rate 0.4375), and combined with a 4% null rate this column is mostly missing. The remaining 28 unique values look like one-off product URLs across various brand domains, hence the long_tail alert and entropy ratio of 0.76. Treatment: Extract domain as a low-cardinality feature; drop the raw URL as it's near-unique and mostly blank. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 28
top_value
top_rate: 0.4375
cardinality: 28
entropy: 3.663
entropy_ratio: 0.762

ingredients_text_nl

categorical free_text long_tail null_rate

Dutch-language ingredient lists for food products, present for only 24% of the 50 rows (null_rate 0.76). Among the 12 non-null entries there are 9 distinct strings with high entropy_ratio 0.92, and the modal value is actually the empty string (4 occurrences) rather than a real ingredient list. Contents range from short declarations like 'Aardappelen, zonnebloemolie, zeezout.' to long packaging blurbs containing addresses and URLs, so the field mixes ingredients with marketing text. Treatment: Treat empty strings as nulls, then tokenize and embed (or parse comma-separated ingredients) before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 38 (76.0%)
unique: 9
top_value
top_rate: 0.3333
cardinality: 9
entropy: 2.918
entropy_ratio: 0.9206

additives_n

numeric feature

Count of additives per product, ranging from 0 to 8 across 50 rows with no nulls and only 8 distinct values. The distribution is heavily right-skewed (skew 1.47, kurtosis 2.10) with a zero_rate of 0.4 and median of 1, while a small tail produces 2 outliers (outlier_rate 0.04). Mean (1.52) sits well above the median, confirming a few additive-heavy products pull the average up. Treatment: Treat as a discrete count; consider log1p or binning (0 vs 1+ vs many) before modelling given the skew and high zero_rate. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 8
min: 0
max: 8
mean: 1.52
median: 1
std: 1.821
q1: 0
q3: 2
iqr: 2
skew: 1.473
kurtosis: 2.105
n_outliers: 2
outlier_rate: 0.04
zero_rate: 0.4

generic_name_sv

categorical free_text long_tail null_rate

Swedish-language generic product name, populated for only 4 of 50 rows (null_rate 0.92). The four observed values are all distinct (entropy_ratio 1.0), including one empty string, so there is effectively no usable signal here. Top value 'Fin mörk choklad med 90% kakao' appears just once. Treatment: Drop or defer — too sparse (92% null) and unique to model. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 4
top_value: Fin mörk choklad med 90% kakao
top_rate: 0.25
cardinality: 4
entropy: 2
entropy_ratio: 1

ingredients_that_may_be_from_palm_oil_tags

unknown feature skipped

This column was skipped by the profiler, so no statistics, uniqueness, or value samples are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds tags listing ingredients potentially derived from palm oil, likely a multi-valued/list field typical of Open Food Facts exports. Without parsed values, nothing can be said about cardinality, distribution, or content. Treatment: Re-profile after parsing as a list of tags, then one-hot or count-encode before modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

known_ingredients_n

numeric feature

A non-negative integer count of recognised ingredients per record, ranging from 0 to 36 with a mean of 11.76 and median of 9. The distribution is right-skewed (skew 0.86) with a wide IQR of 13.5, and 4% of rows are zero — meaning a small fraction had no ingredients matched at all. No outliers were flagged and there are no nulls across the 50 rows. Treatment: Consider a log1p transform before modelling to tame the right skew. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 22
min: 0
max: 36
mean: 11.76
median: 9
std: 8.721
q1: 5
q3: 18.5
iqr: 13.5
skew: 0.8598
kurtosis: 0.07411
n_outliers: 0
outlier_rate: 0
zero_rate: 0.04

completeness

numeric feature outliers

A numeric quality score named 'completeness', bounded loosely between 0.575 and 1.1 with mean 0.91 and median 0.9, so most rows are near-complete. The max of 1.1 is suspicious for a metric that nominally caps at 1.0, and 12% of values flag as outliers with a left skew of -0.67, suggesting a tail of poorly-populated records. Only 14 unique values across 50 rows hints at a discretised or rounded score rather than a continuous measurement. Treatment: Clip values above 1.0 and inspect the low-end outliers before using as a quality filter. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 14
min: 0.575
max: 1.1
mean: 0.91
median: 0.9
std: 0.1358
q1: 0.8875
q3: 1
iqr: 0.1125
skew: -0.6678
kurtosis: 0.32
n_outliers: 6
outlier_rate: 0.12
zero_rate: 0

ingredients_sweeteners_n

numeric feature constant

This column appears to count sweetener ingredients per record, but every one of the 50 rows holds the value 0 (zero_rate 1.0, n_unique 1, std 0.0). It carries no information for modelling and is flagged constant. Treatment: Drop; constant column with zero variance. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 1
min: 0
max: 0
mean: 0
median: 0
std: 0
q1: 0
q3: 0
iqr: 0
skew: 0
kurtosis: 0
n_outliers: 0
outlier_rate: 0
zero_rate: 1

nova_groups

categorical feature

This column holds NOVA food classification groups, a 4-level ordinal scheme encoded as strings ('1' through '4'). Only 3 of the 4 possible groups appear across 50 rows, with group '4' (ultra-processed) dominating at 33/50 (top_rate 0.6875) and group '2' entirely absent. Null rate is 0.04 and entropy_ratio is 0.64, indicating concentration toward the ultra-processed end. Treatment: Treat as ordinal (cast to int) and impute the 4% missing before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 3
top_value: 4
top_rate: 0.6875
cardinality: 3
entropy: 1.006
entropy_ratio: 0.635

allergens_hierarchy

unknown feature skipped

This column is labeled 'allergens_hierarchy', suggesting it holds hierarchical allergen tags (likely a list or delimited path structure). Saturn skipped profiling, so no uniqueness, cardinality, or value statistics are available beyond the fact that all 50 rows are non-null. Without parsed content, the structure and value distribution cannot be characterized. Treatment: Parse the hierarchy into a list, then one-hot or multi-label encode allergen tags before modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

obsolete

categorical metadata imbalance

The 'obsolete' column has a single observed value—an empty string—across all 44 non-null rows, with the remaining 12% of rows null. Cardinality is 1 and entropy is 0, so this column carries no information as-is. The name suggests a deprecated flag, consistent with it being effectively unused. Treatment: Drop; constant column with no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 6 (12.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_sv

categorical metadata null_rate imbalance

This appears to be a source/origin indicator (likely Swedish, given the _sv suffix) but it carries virtually no information in this sample. With a 92% null rate and the only non-null value being an empty string repeated 4 times, cardinality is 1 and entropy is 0. The column is effectively constant and unusable as-is. Treatment: Drop the column; it is 92% null and otherwise constant. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_hierarchy

unknown other skipped

The column packaging_hierarchy was skipped by the profiler, so no type, uniqueness, or distribution stats are available. All 50 rows are non-null, but every other signal (kind, n_unique, summary stats) is missing. Without further inspection the contents and structure remain unknown. Treatment: Re-profile or manually inspect a sample before deciding on downstream handling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_with_unspecified_percent_n

numeric feature

Likely a count of ingredients on a product whose declared percentage is unspecified, ranging from 1 to 33 with a mean of 8.8 and median of 7. The distribution is right-skewed (skew 1.64, kurtosis 3.55) with two outliers (4%) pulling the upper tail toward 33, well above the Q3 of 11. Every row has a value (null_rate 0, zero_rate 0), so no product in this sample fully discloses ingredient percentages. Treatment: Apply a log or sqrt transform before modelling to tame the right skew. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 18
min: 1
max: 33
mean: 8.8
median: 7
std: 6.061
q1: 5
q3: 11
iqr: 6
skew: 1.645
kurtosis: 3.545
n_outliers: 2
outlier_rate: 0.04
zero_rate: 0

fruits-vegetables-nuts_100g_estimate

numeric feature null_rate high_skew

This is an estimated percentage of fruits, vegetables, and nuts per 100g of product. The signal is almost absent: 46% of rows are null, and of the 27 non-null values, 96.3% are zero, leaving essentially one non-zero observation at 85.0 that drives the mean of 3.15 and skew of 4.9. Treatment: Drop or collapse to a binary has_value flag; the column carries almost no variance. high · anthropic:claude-opus-4-7

n: 50
nulls: 23 (46.0%)
unique: 2
min: 0
max: 85
mean: 3.148
median: 0
std: 16.36
q1: 0
q3: 0
iqr: 0
skew: 4.903
kurtosis: 22.04
n_outliers: 1
outlier_rate: 0.03704
zero_rate: 0.963

emb_codes

categorical metadata long_tail

Looks like a free-form certification/packaging code field (FSC-*, EMB *, LPL.*) with mixed formats including one company-name string. The column is dominated by empty strings — 35 of 50 rows (top_rate 0.73) — and has a 4% null rate on top, leaving very little signal across 11 unique values. Entropy ratio of 0.50 and the long_tail alert confirm most non-empty codes appear only once or twice. Treatment: Treat empty strings as missing and consider dropping or collapsing into a binary has_code flag given the sparsity and long tail. medium · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 11
top_value
top_rate: 0.7292
cardinality: 11
entropy: 1.72
entropy_ratio: 0.4972

packagings

unknown other skipped

This column was skipped by the profiler, so no statistics are available beyond a row count of 50 and a null rate of 0.0. The name 'packagings' suggests it likely holds nested or structured packaging descriptions (lists or objects), which is consistent with the profiler classifying its kind as 'unknown' and emitting a 'skipped' alert. Without unique counts or value summaries, nothing further can be inferred. Treatment: Inspect raw values and parse/normalize the structure before deciding on a downstream treatment. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

purchase_places_tags

unknown other skipped

Profiling skipped this column, so type, uniqueness, and value distribution are unknown. The only confirmed facts are 50 rows present with a null rate of 0.0 and a name suggesting a tags-style field for purchase locations. No further signal is available to characterise content or cardinality. Treatment: Re-run profiling with parsing enabled to inspect tag values before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

additives_original_tags

unknown other skipped

The column 'additives_original_tags' was skipped by the profiler, so no statistics, uniqueness count, or value samples are available beyond a row count of 50 and a null rate of 0.0. Based solely on the name, it likely holds lists of food-additive tag identifiers (e.g., E-numbers) in their original locale, but this cannot be verified from the evidence. Treatment: Re-run the profiler with list/tag parsing enabled, then explode tags before encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

image_front_url

categorical identifier long_tail

Per-row URLs pointing to Open Food Foundation product front images, with locale suffixes like front_fr and front_en in the path. All 50 values are unique (entropy_ratio 1.0, top_rate 0.02) and there are no nulls, so this is effectively a 1:1 asset link rather than a feature. Treatment: Treat as an asset URL: drop from modelling, or fetch images out-of-band for vision pipelines. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.400.jpg
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

data_quality_bugs_tags

unknown other skipped

This column was skipped by the profiler, so its kind is unknown and no descriptive statistics are available beyond a row count of 50 and a null rate of 0. The name suggests it holds tags related to data-quality bugs, likely a list or delimited string, but that structure is not confirmed by evidence. Without uniqueness, value, or length signals, no distributional claims can be made. Treatment: Re-run the profiler with parsing enabled (e.g., explode tags) before deciding how to use this column. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

origin_fi

categorical metadata null_rate imbalance

This appears to be an origin field (likely a financial or geographic origin code) that is essentially empty. 90% of the 50 rows are null, and the remaining 5 non-null entries are all the empty string, giving a single unique value and zero entropy. There is no usable signal here. Treatment: Drop; the column is 90% null and the remaining values are blank. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

images

unknown other skipped

The column 'images' was skipped by the profiler, so its kind is unknown and no descriptive statistics were computed. Only the row count (50) and a null rate of 0.0 are available; uniqueness, type, and value distribution are all missing. The name suggests binary or path-like image payloads, which would explain why the dissector bypassed it. Treatment: Inspect raw values manually to confirm format, then route to an image-processing pipeline rather than tabular modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_analysis

unknown other skipped

The column 'ingredients_analysis' was skipped by the profiler, so no type, uniqueness, or distribution statistics are available. The only confirmed signals are 50 rows present and a 0.0 null rate. Without further inspection, its content and structure remain unknown. Treatment: Inspect raw values manually to determine type before any downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_with_allergens_pl

categorical free_text long_tail null_rate

Polish-language ingredient text with embedded allergen HTML markup, almost entirely absent from this sample. 92% of 50 rows are null and only 3 distinct values appear, two of which are unique product descriptions and one is an empty string (top_rate 0.5 among non-nulls). Treatment: Drop for modelling given 92% nulls; if retained, strip HTML allergen tags and treat as free text. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 3
top_value
top_rate: 0.5
cardinality: 3
entropy: 1.5
entropy_ratio: 0.9464

product_name_de

categorical free_text long_tail null_rate

German-language product names, almost certainly the localized display label for food/confectionery items (chocolate, biscuits, Nutella). 60% of rows are null and the top non-null value is an empty string occurring 5 times, so effectively only ~15 distinct names cover 50 rows. Entropy ratio of 0.935 confirms the populated values are nearly all unique, and at least one entry ('Lightly Sea Salted') is English rather than German. Treatment: Treat as free text: normalize empty strings to null, then tokenize/embed if used as a feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 16
top_value
top_rate: 0.25
cardinality: 16
entropy: 3.741
entropy_ratio: 0.9354

ingredients_text_with_allergens_nb

categorical free_text null_rate imbalance

This appears to be a Norwegian-language ingredients text field with allergen annotations, likely a localized variant of a product description column. It is effectively empty: 96% of rows are null and the only non-null value across the remaining 2 records is an empty string, giving a single unique value and zero entropy. Treatment: Drop; the column carries no information at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_it

categorical free_text long_tail null_rate

Italian-language packaging description text, almost entirely absent from this sample. 68% of rows are null and of the 16 non-null entries, 14 are empty strings, leaving only 2 substantive Italian descriptions of recycling instructions. With cardinality of 3 and a top_rate of 0.875 on the empty string, this column carries virtually no usable signal here. Treatment: Drop unless joined with a much larger Italian-locale slice; too sparse to model. high · anthropic:claude-opus-4-7

n: 50
nulls: 34 (68.0%)
unique: 3
top_value
top_rate: 0.875
cardinality: 3
entropy: 0.6686
entropy_ratio: 0.4218

product_name_it

categorical free_text long_tail null_rate

Italian-language product name field, mostly empty: 68% of the 50 rows are null and the modal non-null value is the empty string "" (5 occurrences, top_rate 0.3125). Among the 12 distinct values the names are heterogeneous chocolate and snack labels (e.g. "Fondente Prodigioso 90% Cacao", "Pringles classiche 175 gr", "Milka"), with case-variant duplicates like "cioccolato fondente" vs "Cioccolato fondente" inflating cardinality. Entropy ratio 0.913 confirms the non-null tail is essentially flat, each name appearing once. Treatment: Normalise case/whitespace, treat empty strings as null, then tokenize and embed; not usable as a categorical feature given 68% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 34 (68.0%)
unique: 12
top_value
top_rate: 0.3125
cardinality: 12
entropy: 3.274
entropy_ratio: 0.9134

serving_quantity

categorical feature long_tail

Numeric serving sizes stored as strings, with 27 distinct values across 50 rows and a 12% null rate. The distribution is long-tailed: top values "100" and "10" each cover only 7 records (top_rate 0.159), entropy_ratio is 0.909 indicating values are spread almost uniformly, and outliers like "1000" and decimals like "11.5" sit alongside round numbers. Treatment: Cast to numeric, impute the 12% nulls, and consider log-transforming before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 6 (12.0%)
unique: 27
top_value: 100
top_rate: 0.1591
cardinality: 27
entropy: 4.322
entropy_ratio: 0.9089

product_name_ja

categorical metadata long_tail null_rate imbalance

This appears to be a Japanese product name field that is effectively empty in this sample: 98% of the 50 rows are null and the single non-null value is itself the empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here whatsoever. Treatment: Drop the column; it is 98% null with a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_sv

categorical free_text long_tail null_rate

Swedish-language ingredient lists with embedded HTML allergen markup (…), likely the Swedish localisation of a product ingredients field. Coverage is extremely poor: 92% null and only 4 distinct values across 50 rows, with the top value appearing just once (top_rate 0.25 over the non-null subset). One value is an empty string and others mix Swedish with Danish/Norwegian terms (HVEDEMEL, BYG, EGG), indicating inconsistent locale handling. Treatment: Strip HTML tags and parse allergen spans separately; given 92% nulls, do not use as a primary feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 4
top_value: kakaomassa, kakaosmör, fettreducerat kakaopulver, socker, vanilj.
top_rate: 0.25
cardinality: 4
entropy: 2
entropy_ratio: 1

allergens_tags

unknown feature skipped

Column `allergens_tags` was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests a multi-valued tag field listing allergens (e.g., milk, nuts), but this cannot be verified from the evidence. Re-profile with list/string handling enabled to learn cardinality and tag distribution. Treatment: Re-profile with tag-aware parsing, then one-hot or multi-hot encode the individual allergen tokens. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_fr

categorical free_text long_tail

This is the French-language ingredients list for food products, stored as free-form text. Of 50 rows, 47 are unique and entropy ratio is 0.998, so values are essentially all distinct long strings; 4% are null and the most common value is an empty string (2 occurrences). Contents range from a two-word 'Eau de source' to multi-sentence ingredient declarations with percentages, allergens and additive codes. Treatment: Tokenize and embed (or extract structured ingredient/allergen features) rather than treating as a category. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 47
top_value
top_rate: 0.04167
cardinality: 47
entropy: 5.543
entropy_ratio: 0.998

nutrition_score_beverage

numeric feature high_skew

This appears to be a beverage-specific nutrition score, encoded as a numeric flag rather than a continuous metric: only 2 unique values across 50 rows, with min 0 and max 1. The distribution is overwhelmingly zero (zero_rate 0.98), leaving a single outlier at 1 that drives the extreme skew (6.86) and kurtosis (45.02). Effectively a near-constant indicator column. Treatment: Treat as a binary flag, or drop as near-constant since 98% of rows share one value. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 2
min: 0
max: 1
mean: 0.02
median: 0
std: 0.1414
q1: 0
q3: 0
iqr: 0
skew: 6.857
kurtosis: 45.02
n_outliers: 1
outlier_rate: 0.02
zero_rate: 0.98

ingredients_ids_debug

unknown other skipped

This column was skipped by the profiler, so no statistics beyond row count (50) and null rate (0.0) are available. The name suggests it holds debug-only ingredient identifiers, likely a complex or nested structure that the dissector could not categorize. Without unique counts or value samples, its content and utility cannot be assessed here. Treatment: Drop unless a downstream consumer specifically needs the raw debug payload. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nutrition_data

categorical metadata imbalance

This appears to be a flag indicating whether nutrition data is present, with the only observed value being "on" across all 49 non-null rows. Cardinality is 1 and entropy is 0, so the column carries no discriminative information; one row (2%) is null. Treatment: Drop, constant column with no variance. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 1
top_value: on
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_ja

categorical metadata long_tail null_rate imbalance

This appears to be a Japanese-language origin field, likely a localized counterpart to a primary origin column. It is effectively empty: 98% of the 50 rows are null and the only non-null value is itself the empty string, yielding a single unique value and zero entropy. Treatment: Drop; the column carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_en

categorical free_text long_tail

English-language packaging descriptions, likely free-text recycling instructions scraped from product labels. Of 50 rows, 14% are null and another 39 (top_rate 0.91) are empty strings, leaving only 4 rows with actual content across 5 unique values. Entropy ratio of 0.27 confirms the column is almost entirely uninformative as-is. Treatment: Drop or defer; coverage is too sparse to model, but if retained treat empty strings as nulls and tokenize the remainder. high · anthropic:claude-opus-4-7

n: 50
nulls: 7 (14.0%)
unique: 5
top_value
top_rate: 0.907
cardinality: 5
entropy: 0.6325
entropy_ratio: 0.2724

unknown_ingredients_n

numeric feature high_skew outliers

This is a count of unrecognised ingredients per row, ranging from 0 to 13 with a mean of 0.66. The distribution is dominated by zeros (zero_rate 0.84) with median, q1, and q3 all at 0, but a long right tail produces extreme skew (4.24) and kurtosis (18.32), with 8 outliers (16%) pulling the max to 13. Effectively a sparse anomaly indicator rather than a continuous count. Treatment: Binarise (zero vs non-zero) or cap before modelling; raw values are too skewed for linear models. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 6
min: 0
max: 13
mean: 0.66
median: 0
std: 2.255
q1: 0
q3: 0
iqr: 0
skew: 4.236
kurtosis: 18.32
n_outliers: 8
outlier_rate: 0.16
zero_rate: 0.84

ingredients_from_palm_oil_tags

unknown other skipped

This column, named ingredients_from_palm_oil_tags, was skipped by the profiler so no distribution, uniqueness, or value-level statistics are available. The only confirmed signals are 50 rows and a 0.0 null rate; everything else (kind, n_unique) is missing. Treatment: Re-profile with type coercion before deciding; likely a list/tag field needing parsing and one-hot or multi-label encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

labels_tags

unknown other skipped

The column `labels_tags` was skipped by the profiler, so its kind, cardinality, and value distribution are all unknown. The only confirmed facts are that it contains 50 rows with no nulls. The name suggests a labels or tags field, likely multi-valued or delimited text, but no evidence confirms that. Treatment: Re-profile with a parser that handles list/tag-style values before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

packaging_old_before_taxonomization

categorical free_text long_tail null_rate

Pre-taxonomy packaging descriptions captured as free text, mixing languages (French, Spanish, English, German) and multi-value comma-separated lists. With 36 unique values across 38 non-null rows and entropy ratio 0.99, the field is almost fully unique; even the top value 'plastique' covers only 7.9% and 24% are null. Values combine material terms, language prefixes like 'fr:'/'en:', and counts ('20 biscuits en 4 sachets'), so it behaves more like free text than a category. Treatment: Normalise and split on commas, then map tokens to a controlled packaging taxonomy before use. high · anthropic:claude-opus-4-7

n: 50
nulls: 12 (24.0%)
unique: 36
top_value: plastique
top_rate: 0.07895
cardinality: 36
entropy: 5.123
entropy_ratio: 0.9909

packaging_text_nb

categorical free_text null_rate imbalance

This appears to be a Norwegian Bokmål packaging text field, but it is effectively empty: 96% of 50 rows are null and the only 2 non-null values are blank strings, yielding cardinality 1 and entropy 0. Treatment: Drop; the column carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

nutrition_grades_tags

unknown other skipped

This column was skipped by the profiler, so no statistics beyond a 50-row count and zero null rate are available. The name nutrition_grades_tags suggests categorical tags (likely Nutri-Score letters such as a-e) from an Open Food Facts-style source, but uniqueness, frequencies, and value examples are all missing. Treat any interpretation as provisional until the column is reprofiled. Treatment: Reprofile with categorical parsing enabled before deciding on encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

category_properties

unknown other skipped

This column was skipped by the profiler, so no type, uniqueness, or distribution stats were computed beyond a 50-row count and 0% null rate. The name 'category_properties' suggests it holds nested or structured per-category attributes (likely dict/list/JSON), which is why saturn flagged it as unknown rather than a scalar kind. Treatment: Inspect raw values and, if structured, flatten or JSON-normalize into separate columns before profiling again. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nutriscore_score

numeric feature

This is the Nutri-Score numeric grade (typically -15 best to 40 worst), here ranging 0 to 40 with a mean of 17.47 and median of 19. The distribution is roughly symmetric (skew -0.16, kurtosis -0.53) with no outliers flagged, and 8.2% of values are exactly zero. Only 2% are null and 28 unique values across 50 rows, so the column is well populated and reasonably varied. Treatment: Use directly as a numeric feature; impute the 2% nulls with the median. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 28
min: 0
max: 40
mean: 17.47
median: 19
std: 9.906
q1: 10
q3: 25
iqr: 15
skew: -0.1616
kurtosis: -0.5337
n_outliers: 0
outlier_rate: 0
zero_rate: 0.08163

packaging_tags

unknown other skipped

The column 'packaging_tags' was skipped by the profiler, so no type, uniqueness, or distributional statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds packaging-related tags, likely a multi-valued or list-like field that the profiler could not classify. Without parsed values we cannot confirm cardinality, delimiter, or language. Treatment: Re-profile after parsing the tag list (e.g., split on delimiter) before deciding on encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

labels_old

categorical free_text long_tail

Legacy multi-label tags for products, stored as comma-separated strings mixing French, English, Polish, Bulgarian (Cyrillic), and namespaced codes like 'en:CE'. With 38 uniques across 50 rows, an 8% null rate, and the most common value being the empty string at 19.6%, the field is sparse and nearly free-form. Entropy ratio 0.93 and the long_tail alert confirm almost every non-empty value is singleton. Treatment: Split on commas, normalise language/namespace prefixes, and one-hot the resulting tag tokens rather than treating the raw string as a category. high · anthropic:claude-opus-4-7

n: 50
nulls: 4 (8.0%)
unique: 38
top_value
top_rate: 0.1957
cardinality: 38
entropy: 4.903
entropy_ratio: 0.9343

packaging_text

categorical free_text long_tail

Free-text packaging descriptions, mostly in French with some English mixed in, detailing materials and recycling instructions. The dominant value is an empty string at 75% (36 of 50 rows), and only 13 unique values exist with entropy ratio 0.46, so signal is sparse and long-tailed. Among non-empty entries, formats vary widely (multi-line itemized lists, comma-separated tags, uppercase marketing strings), suggesting no controlled vocabulary. Treatment: Normalise case/whitespace and parse material keywords into multi-hot features; treat empty string as missing. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 13
top_value
top_rate: 0.75
cardinality: 13
entropy: 1.708
entropy_ratio: 0.4614

ingredients_percent_analysis

numeric feature high_skew outliers

This appears to be a binary status flag for ingredient percent analysis, taking only 2 unique values (1.0 and -1.0) across 50 rows with no nulls. The distribution is heavily dominated by 1.0 (median, q1, q3 all 1.0; mean 0.84), with 4 outliers (8%) at -1.0 producing extreme negative skew (-3.10) and high kurtosis (7.59). Despite the numeric kind, the IQR of 0 and only two unique values indicate this is categorical rather than continuous. Treatment: Recode as a categorical/boolean flag rather than treating as continuous numeric. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 2
min: -1
max: 1
mean: 0.84
median: 1
std: 0.5481
q1: 1
q3: 1
iqr: 0
skew: -3.096
kurtosis: 7.587
n_outliers: 4
outlier_rate: 0.08
zero_rate: 0

ecoscore_data

unknown other skipped

The column 'ecoscore_data' was skipped by the profiler, so no type, uniqueness, or distribution stats are available. Only the row count (50) and a null rate of 0.0 are reported. The name suggests it holds Eco-Score payloads, likely a nested/structured object that the profiler could not introspect. Treatment: Inspect raw values and parse the nested structure into typed sub-fields before use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_sv

categorical free_text long_tail null_rate

Swedish-language ingredient lists for food products, free-text rather than truly categorical. Coverage is extremely sparse: 92% null with only 4 distinct values across 50 rows, one of which is an empty string. The non-null entries are full ingredient declarations including allergen markers and bilingual Swedish/Danish/Norwegian terms. Treatment: Treat as free text; given 92% nulls, drop or use only as a fallback to other-language ingredient columns. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 4
top_value: kakaomassa, kakaosmör, fettreducerat kakaopulver, socker, vanilj.
top_rate: 0.25
cardinality: 4
entropy: 2
entropy_ratio: 1

brands_tags

unknown other skipped

The column 'brands_tags' was skipped by the profiler, so no type, uniqueness, or distribution statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds brand tag strings (likely slug-style identifiers, possibly multi-valued), but this cannot be confirmed from the evidence. Treat any downstream assumption with caution until the column is re-profiled. Treatment: Re-profile or sample the raw values before deciding; if multi-valued tag strings, split and one-hot or embed. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

compared_to_category

categorical metadata long_tail

Holds an Open Food Facts category taxonomy code (e.g., 'en:dark-chocolate-bar-with-more-than-70-cocoa') used as a comparison reference. With 35 unique values across 50 rows and entropy ratio 0.95, the column is extremely diffuse — the modal category covers only 10% of rows and a long tail dominates. No nulls, but the high cardinality relative to sample size will make this hard to use as-is. Treatment: Roll up to a coarser taxonomy level (e.g., chocolate/biscuits/dairy) before any grouping or modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 35
top_value: en:dark-chocolate-bar-with-more-than-70-cocoa
top_rate: 0.1
cardinality: 35
entropy: 4.886
entropy_ratio: 0.9526

data_sources

categorical metadata long_tail

This column records the set of apps/databases that contributed each product's data, stored as a comma-separated list rather than a normalized relation. With 43 unique strings across 50 rows (entropy ratio 0.98) and the most common combination appearing only 4 times (top_rate 0.08), nearly every row has a bespoke source bundle. Notable: the values mix case ('yuka' vs 'Yuka') and overlap heavily on 'App - smoothie-openfoodfacts' and 'Apps', suggesting the same sources are repeatedly concatenated in different orders. Treatment: split on commas, normalize case, and one-hot encode individual sources instead of treating the raw string as a category. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 43
top_value: App - yuka, Apps, App - Open Food Facts, App - smoothie-openfoodfacts
top_rate: 0.08
cardinality: 43
entropy: 5.309
entropy_ratio: 0.9783

other_nutritional_substances_prev_tags

unknown other skipped

This column, `other_nutritional_substances_prev_tags`, was skipped by the profiler, so no statistics on uniqueness, distribution, or content are available. The only signals are that all 50 rows are non-null and the kind is unknown. Without further evidence the contents cannot be characterised; the name suggests a tag list referencing prior values of a nutritional-substances field. Treatment: Re-profile with parsing enabled (likely a delimited tag list) before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_from_palm_oil_n

numeric feature outliers

This is effectively a binary indicator counting palm-oil-derived ingredients per product, stored as numeric with values only 0 or 1 (n_unique=2, max=1.0). The column is heavily zero-dominated (zero_rate ≈ 0.85) with mean ≈ 0.152, and the 7 ones get flagged as outliers because the IQR is 0. Null rate is 8%, modest but worth noting. Treatment: Recast as a boolean palm-oil flag and impute the 8% nulls before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 4 (8.0%)
unique: 2
min: 0
max: 1
mean: 0.1522
median: 0
std: 0.3632
q1: 0
q3: 0
iqr: 0
skew: 1.937
kurtosis: 1.751
n_outliers: 7
outlier_rate: 0.1522
zero_rate: 0.8478

last_updated_t

numeric timestamp outliers

Values are unique 10-digit integers in the ~1.74e9–1.77e9 range, which is the Unix-epoch band for early 2025 through late 2025, consistent with the column name suggesting a 'last updated' timestamp. The distribution is heavily left-skewed (skew -1.94) with 12% flagged as outliers — a handful of much older updates pulling the tail down while most rows cluster within a ~6.1M-second IQR (~71 days). No nulls or zeros. Treatment: Cast from Unix seconds to datetime and derive recency features rather than using the raw integer. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
min: 1.739e+09
max: 1.769e+09
mean: 1.763e+09
median: 1.767e+09
std: 8.037e+06
q1: 1.762e+09
q3: 1.768e+09
iqr: 6.138e+06
skew: -1.945
kurtosis: 2.892
n_outliers: 6
outlier_rate: 0.12
zero_rate: 0

nutrition_score_debug

categorical metadata imbalance

This looks like a debug/diagnostic field for a nutrition scoring pipeline, capturing which input nutrients were missing during computation. It is overwhelmingly empty: 49 of 50 rows (top_rate 0.98) hold an empty string, with only one row carrying a substantive message about missing saturated-fat, sugars, and sodium. Entropy of 0.14 confirms near-zero information content in this sample. Treatment: Drop from modelling; retain only for pipeline debugging. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 2
top_value
top_rate: 0.98
cardinality: 2
entropy: 0.1414
entropy_ratio: 0.1414

popularity_key

numeric identifier high_skew outliers

Values cluster tightly between 23.999B and 24.000B with an IQR of only ~400K, yet the minimum drops to ~22.9999B, producing severe negative skew (-2.67) and 5 low-side outliers (10%). With 49 unique values across 50 rows and no nulls, this looks like an opaque high-magnitude key or encoded rank rather than a true numeric measure. Treatment: Treat as an identifier and exclude from numeric modelling; join on it if it links to another table. medium · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 49
min: 2.3e+10
max: 2.4e+10
mean: 2.39e+10
median: 2.4e+10
std: 3.03e+08
q1: 2.4e+10
q3: 2.4e+10
iqr: 4.002e+05
skew: -2.667
kurtosis: 5.111
n_outliers: 5
outlier_rate: 0.1
zero_rate: 0

product_name_es

categorical free_text long_tail null_rate

Spanish-language product names, evidently a localized label field paralleling a primary product identifier. With null_rate 0.6 and 4 of the 20 non-null entries being empty strings, only ~16 rows carry usable text; among those, near-uniqueness is extreme (17 distinct values, entropy_ratio 0.96). Values mix branded items (Nutella Biscuits, Excellence 85% cacao) with generic descriptors (Original, Chocolate negro 85% cacao), so it behaves more like free text than a controlled vocabulary. Treatment: Treat empty strings as nulls and tokenize/embed if used as a feature; otherwise drop given 60% missingness and high cardinality. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 17
top_value
top_rate: 0.2
cardinality: 17
entropy: 3.922
entropy_ratio: 0.9595

allergens_from_user

categorical free_text long_tail

User-submitted allergen tags prefixed with a language code like (fr), (en), (es). 34 distinct values across 50 rows with high entropy ratio 0.9112064098150886, and the top value '(fr) ' (rate 0.16) is just an empty language tag, as is '(en) ' at 7 occurrences. Values mix languages and free-form casing (e.g. 'Gluten,Lait,Soja, en:gluten' alongside normalised 'en:gluten'), so the same allergen appears under multiple spellings. Treatment: Strip the language prefix, split on commas, and normalise tokens to the en: namespace before using as multi-label features. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 34
top_value: (fr)
top_rate: 0.16
cardinality: 34
entropy: 4.636
entropy_ratio: 0.9112

informers

unknown other skipped

The column 'informers' was skipped by the profiler, so its kind is unknown and no descriptive statistics were computed. The only confirmed signals are 50 rows with no nulls; uniqueness, type, and value distribution are all missing from the evidence. Treatment: Re-profile or manually inspect the raw values before deciding on any downstream handling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

brands_old

categorical metadata long_tail null_rate

This is a legacy brand-name field for product records, with 29 distinct values across 50 rows and 32% nulls. The distribution is nearly flat (entropy_ratio 0.98) and the top brand 'Gerblé' covers only ~8.8% of non-null rows, so no brand dominates. Values mix clean names (Lindt, Cristaline) with concatenations like 'Wasa,Barilla' and oddities like 'LuMondelez', suggesting prior data-entry or merge artefacts. Treatment: Clean and split multi-brand strings, then reconcile against a canonical brand list before use. high · anthropic:claude-opus-4-7

n: 50
nulls: 16 (32.0%)
unique: 29
top_value: Gerblé
top_rate: 0.08824
cardinality: 29
entropy: 4.749
entropy_ratio: 0.9776

data_quality_errors_tags

unknown other skipped

Profiling was skipped for this column, so no type, uniqueness, or value statistics are available. The only confirmed signals are 50 rows and a 0.0 null rate; everything else is missing. The name suggests it carries tags describing data-quality errors, likely a list or delimited string, but that is not verified by evidence. Treatment: Re-run profiling with list/string parsing enabled before deciding how to use it. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text

categorical free_text long_tail

Free-text ingredient lists from food packaging, one per row. Every one of the 50 rows is unique (entropy_ratio 1.0, top_rate 0.02) and the samples mix multiple languages (English, French, Bulgarian Cyrillic) with punctuation, percentages, and allergen notes. Treating this as a categorical feature is misleading despite the kind tag — it is unstructured multilingual prose flagged long_tail. Treatment: Parse and tokenize (language-detect first), then embed or extract ingredient entities before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: milk cream, cream, sugar, banana, bacteria
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

nutrition_score_warning_fruits_vegetables_nuts_estimate_from_ingredients_value

numeric feature high_skew outliers

This appears to be an estimated percentage of fruits/vegetables/nuts content derived from ingredients, used in nutrition score warnings. The distribution is dominated by zeros (zero_rate 0.71) with median 0.0 and Q3 only 2.33, yet a long tail pushes max to 100.0, producing extreme skew (5.41) and kurtosis (30.37). Seven outliers (15.6%) and a 10% null rate further indicate this signal fires for only a small subset of products. Treatment: Binarize (zero vs non-zero) or log1p-transform before modelling given the heavy zero mass and skew. high · anthropic:claude-opus-4-7

n: 50
nulls: 5 (10.0%)
unique: 13
min: 0
max: 100
mean: 4.532
median: 0
std: 15.52
q1: 0
q3: 2.326
iqr: 2.326
skew: 5.411
kurtosis: 30.37
n_outliers: 7
outlier_rate: 0.1556
zero_rate: 0.7111

ingredients_from_or_that_may_be_from_palm_oil_n

numeric feature

Likely a count of ingredients sourced from or potentially from palm oil per product. With only 3 unique values ranging 0–2 and 70.2% zeros, most products contain none, while the right skew (1.39) reflects a small tail with one or two such ingredients. Null rate is modest at 6%. Treatment: Treat as a low-cardinality ordinal count; impute missing as 0 or add a missing flag before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 3 (6.0%)
unique: 3
min: 0
max: 2
mean: 0.3404
median: 0
std: 0.5625
q1: 0
q3: 1
iqr: 1
skew: 1.393
kurtosis: 0.969
n_outliers: 0
outlier_rate: 0
zero_rate: 0.7021

origins_old

categorical metadata long_tail null_rate

Legacy free-text origin field, likely superseded (note the `_old` suffix). Of 50 rows, 22% are null and another 31 are empty strings, so the top_rate of 0.795 is dominated by blanks; only 9 distinct values exist and the non-empty entries mix country names ('France', 'Morocco'), multi-region comma lists, and noise like 'biologique' or 'Farine de blé: France'. Entropy ratio 0.425 confirms most signal is absent, and the inconsistent formats mean this cannot be used as a clean categorical without parsing. Treatment: Drop or archive; if needed, parse non-empty strings into a normalised origins list and prefer the replacement column. high · anthropic:claude-opus-4-7

n: 50
nulls: 11 (22.0%)
unique: 9
top_value
top_rate: 0.7949
cardinality: 9
entropy: 1.347
entropy_ratio: 0.4251

packaging_text_nl

categorical free_text null_rate imbalance

Dutch-language packaging text field (likely from Open Food Facts or similar). 76% of the 50 rows are null, and every one of the 12 non-null values is the empty string, giving cardinality 1 and entropy 0. The column carries no usable signal in this sample. Treatment: Drop; column is effectively empty (null or blank in all rows). high · anthropic:claude-opus-4-7

n: 50
nulls: 38 (76.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

expiration_date

categorical metadata long_tail

This is an expiration_date field captured as free-form text rather than a parsed date. With 34 unique values across 50 rows and a top_rate of 0.3125 driven by an empty string, roughly 31% of entries are blank (plus a 4% null_rate), and the remaining values mix incompatible formats like '31/07/2020', '28/02/24', '25.11.2025', '01/2018', '19-10-2023', and even non-date tokens like '30days'. Treatment: Normalise to ISO dates with multi-format parsing and treat blanks/'30days' as missing before use. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 34
top_value
top_rate: 0.3125
cardinality: 34
entropy: 4.364
entropy_ratio: 0.8578

selected_images

unknown other skipped

The column 'selected_images' was skipped by the profiler, so no type, cardinality, or distribution stats are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds image references or selections (filenames, URLs, or arrays), but this cannot be confirmed from the evidence. Without n_unique or value samples, no further characterisation is possible. Treatment: Inspect raw values manually to determine structure before deciding on parsing, exploding, or dropping. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

traces_from_ingredients

categorical free_text long_tail

Allergen trace declarations parsed from product ingredient lists, recorded as free-form comma-separated allergen names. 78% of the 50 rows (39) are empty strings rather than nulls, and the remaining 11 distinct values mix languages (French 'œuf', 'lait', English 'nuts, milk', German 'Schalenfrüchte') and inconsistent casing, with some entries duplicating the same allergens twice in one string. Treatment: Normalise case, split on commas, translate to a canonical allergen vocabulary, and treat empty strings as missing before one-hot encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 12
top_value
top_rate: 0.78
cardinality: 12
entropy: 1.521
entropy_ratio: 0.4243

ingredients_text_with_allergens

categorical free_text long_tail

Free-text ingredient lists with embedded HTML markup highlighting allergens. All 50 rows are unique (entropy_ratio 1.0, top_rate 0.02) and the language mix spans English, French, and Bulgarian Cyrillic, so any naive categorical encoding will explode. The HTML tags and multilingual content mean raw values need cleaning before NLP use. Treatment: Strip HTML tags, language-detect, then tokenize/embed; do not treat as a category. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: milk cream, cream, sugar, banana, bacteria
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

image_front_thumb_url

categorical identifier long_tail

This column holds Open Food Facts thumbnail URLs pointing to product front images, embedding the product barcode and a language suffix (front_fr/front_en) in the path. Every one of the 50 rows is unique with zero nulls, so it functions as a per-row identifier rather than a feature. The mix of fr and en suffixes hints at a multi-locale product set. Treatment: Drop for modelling; retain as a media link or fetch images if vision features are needed. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.100.jpg
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

lc

categorical feature

This is a low-cardinality categorical with 5 distinct values that look like ISO 639-1 language codes (fr, en, de, bg, ro), suggesting a language tag for each row. The distribution is heavily concentrated: 'fr' accounts for 35 of 50 rows (top_rate 0.70), 'en' for 10, while 'de', 'bg', and 'ro' appear only 1-3 times. Entropy ratio of 0.56 confirms the imbalance, and there are no nulls. Treatment: One-hot encode, or group rare codes (bg, ro, de) into an 'other' bucket before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 5
top_value: fr
top_rate: 0.7
cardinality: 5
entropy: 1.294
entropy_ratio: 0.5572

ingredients_text_debug

categorical free_text long_tail null_rate

Free-text ingredient lists in French (e.g., 'Lait écrémé, crème, sucre...'), likely a debug dump of OpenFoodFacts-style product compositions. Near-maximal entropy (0.997) and 35 unique values out of 50 confirm essentially every non-null row is distinct, while 28% are null and the top value is an empty string appearing twice. Texts vary wildly in length and include allergen markup (_lait_, _soja_) plus stray non-ingredient prose like publication dates. Treatment: Tokenize and embed (or parse into structured allergen/ingredient lists) after imputing empty strings as nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 14 (28.0%)
unique: 35
top_value
top_rate: 0.05556
cardinality: 35
entropy: 5.114
entropy_ratio: 0.9971

packagings_materials_main

categorical feature null_rate

This is a low-cardinality categorical tagging the dominant packaging material, with only 3 distinct values across 50 rows ('en:paper-or-cardboard', 'en:plastic', 'en:unknown'). The headline issue is a 62% null rate, leaving just 19 observed rows where 'en:paper-or-cardboard' alone covers 68.4%. Entropy ratio of 0.70 indicates moderate concentration among the few non-null entries. Treatment: Impute missing as an explicit 'unknown' category before one-hot encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 31 (62.0%)
unique: 3
top_value: en:paper-or-cardboard
top_rate: 0.6842
cardinality: 3
entropy: 1.105
entropy_ratio: 0.6972

data_quality_dimensions

unknown other skipped

The column `data_quality_dimensions` was skipped by the profiler, so no type, uniqueness, or value statistics were computed beyond a row count of 50 and a null rate of 0.0. Without `n_unique` or any descriptive stats, its content and structure are unknown from this evidence alone. The name suggests it may hold structured or list-like quality metadata, but that cannot be confirmed here. Treatment: Re-profile with type inference forced, or inspect raw values manually before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

serving_size

categorical feature long_tail

Free-text serving size descriptors, with 37 unique values across only 50 rows (entropy ratio 0.98) and a 12% null rate. The top value '100g' covers just 6.8% of rows, and inconsistent formatting is rampant — '100g' vs '100 g', '10 g' vs '20g', plus compound strings like '1 Square (10 g)' — so the same physical quantity appears under multiple labels. Treatment: Parse into a numeric grams column via regex and unit normalization before use. high · anthropic:claude-opus-4-7

n: 50
nulls: 6 (12.0%)
unique: 37
top_value: 100g
top_rate: 0.06818
cardinality: 37
entropy: 5.107
entropy_ratio: 0.9803

pnns_groups_1_tags

unknown metadata skipped

This column is named pnns_groups_1_tags, suggesting it holds Programme National Nutrition Santé top-level group tags (likely a categorical food classification). Saturn skipped profiling, so no uniqueness, frequency, or value statistics are available beyond n=50 and a 0.0 null rate. Without distribution evidence, the cardinality and dominant categories cannot be confirmed. Treatment: Re-profile or inspect manually before use; if categorical, encode as a low-cardinality factor. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

origin

categorical free_text long_tail

Free-text origin/provenance field, likely scraped from product packaging in mixed French/German wording. The column is almost entirely empty: the blank string accounts for 42 of 50 rows (top_rate 0.894) and another 6% are null, leaving only 5 distinct populated values. Entropy ratio of 0.285 and the long_tail alert confirm there is essentially no usable signal as-is. Treatment: Drop or parse with regex/NER to extract country tokens; too sparse to use directly. high · anthropic:claude-opus-4-7

n: 50
nulls: 3 (6.0%)
unique: 6
top_value
top_rate: 0.8936
cardinality: 6
entropy: 0.7359
entropy_ratio: 0.2847

ingredients_lc

categorical metadata

This column appears to be a language code for ingredient text, with only 4 distinct values across 50 rows. French dominates at 70% (35 rows), followed by English (11), with Bulgarian and German trailing at 2 each. The skew is heavy and the entropy ratio of 0.606 confirms concentration around a single language. Treatment: One-hot encode or use as a filter; consider grouping rare languages into 'other'. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 4
top_value: fr
top_rate: 0.7
cardinality: 4
entropy: 1.212
entropy_ratio: 0.6061

packaging_old

categorical free_text long_tail

Free-form packaging descriptions, almost certainly from an Open Food Facts-style export, mixing French and English tokens plus language-prefixed tags (e.g. 'fr:Triman', 'en:Bottle'). With 40 unique values across 50 rows and entropy_ratio 0.99, it's near-unique; the top value 'Plastique' covers only 6.98% and 14% are null. Entries are comma-separated multi-tags of varying granularity, so this behaves more like a tag list than a clean category. Treatment: Split on commas, normalise language prefixes, and one-hot the resulting tag set rather than treating raw strings as categories. high · anthropic:claude-opus-4-7

n: 50
nulls: 7 (14.0%)
unique: 40
top_value: Plastique
top_rate: 0.06977
cardinality: 40
entropy: 5.269
entropy_ratio: 0.9901

packaging_text_fr

categorical free_text long_tail

Free-text French packaging instructions, mostly empty: 34 of 50 rows (top_rate 0.723) are blank and another 6% are null. Of 14 distinct values, the populated ones are heterogeneous descriptions of materials and recycling instructions (plastic films, cardboard étuis, aluminium sheets), with one outlier containing OCR-like artefacts and a date string. Entropy ratio 0.49 confirms the long-tail alert: almost every non-empty entry is unique. Treatment: Treat blanks as missing and parse remaining strings for material/recyclability tokens rather than using as a categorical. high · anthropic:claude-opus-4-7

n: 50
nulls: 3 (6.0%)
unique: 14
top_value
top_rate: 0.7234
cardinality: 14
entropy: 1.874
entropy_ratio: 0.4923

nova_group_debug

categorical metadata long_tail imbalance

A categorical debug/diagnostic field, presumably trace messages from a NOVA food-group classifier. It's overwhelmingly empty (96% blank, 48 of 50 rows), with only two non-empty entries — both error strings explaining that NOVA classification was skipped due to unknown ingredients. Entropy ratio of 0.178 confirms near-zero information content. Treatment: Drop; near-constant debug log with no modelling value. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 3
top_value
top_rate: 0.96
cardinality: 3
entropy: 0.2823
entropy_ratio: 0.1781

ingredients_original_tags

unknown free_text skipped

The column `ingredients_original_tags` was skipped by the profiler, so no statistics, uniqueness, or value samples are available beyond a row count of 50 and a null rate of 0.0. The name suggests a list-valued field of ingredient tags (likely arrays or delimited strings), which is consistent with the profiler classifying it as `unknown` and bailing out. Without type or cardinality signals, nothing further can be inferred from the evidence. Treatment: Parse the list/array structure and explode or multi-hot encode before downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

data_quality_completeness_tags

unknown other skipped

This column is named data_quality_completeness_tags but saturn skipped profiling it, so its kind is unknown and no uniqueness or value statistics were computed. The only confirmed signals are that it has 50 rows with a null rate of 0.0. Without sample values or cardinality, its actual content (likely tag strings about completeness checks) cannot be verified from the evidence. Treatment: Re-profile with explicit parsing before deciding how to use it downstream. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

cities_tags

unknown other skipped

The column `cities_tags` was skipped by the profiler, so no type inference, uniqueness count, or value statistics are available. Only two facts are known: 50 rows were seen and none were null. Without further stats the content and structure cannot be characterised. Treatment: Re-profile or inspect raw values manually before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

countries_hierarchy

unknown feature skipped

Column `countries_hierarchy` was skipped by the profiler, so no kind, uniqueness, or value statistics are available beyond a row count of 50 with zero nulls. The name suggests a nested or list-like representation of country tags (e.g., `en:france > en:europe`), which likely tripped the type detector. Treat the absence of stats as a signal that the values are non-scalar rather than missing. Treatment: Parse the hierarchical strings into a list of country tags, then explode or one-hot encode before modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nutriscore_score_opposite

numeric feature

Numeric column holding the negation of a Nutri-Score (range -40 to 0, median -19), so lower values correspond to better nutritional grades. Distribution is roughly symmetric (skew 0.16, kurtosis -0.53) with no outliers and a tight IQR of 15. Notable signals: 2% nulls, 8% zeros, and only 28 unique values across 50 rows, consistent with an integer score derived by sign-flipping the original Nutri-Score. Treatment: Use as-is for modelling, or invert the sign back to the original Nutri-Score for interpretability. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 28
min: -40
max: 0
mean: -17.47
median: -19
std: 9.906
q1: -25
q3: -10
iqr: 15
skew: 0.1616
kurtosis: -0.5337
n_outliers: 0
outlier_rate: 0
zero_rate: 0.08163

categories_properties_tags

unknown other skipped

The column `categories_properties_tags` was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests a nested or multi-valued field (categories/properties/tags), which likely tripped the dissector's scalar assumptions. Without distinct-value or sample evidence, its actual content and cardinality are unknown. Treatment: Re-profile after flattening or JSON-parsing this field before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

origins_lc

categorical feature

This is a lowercase language/origin code with 6 distinct values across 50 rows and a 4% null rate. The distribution is dominated by 'fr' (23) and 'en' (20), together accounting for nearly all non-null entries, while 'es', 'de', 'it', and 'pl' appear only once or twice each. Entropy ratio of 0.61 confirms the heavy concentration in two categories. Treatment: One-hot encode with rare categories (es/de/it/pl) collapsed into an 'other' bucket. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 6
top_value: fr
top_rate: 0.4792
cardinality: 6
entropy: 1.575
entropy_ratio: 0.6093

ciqual_food_name_tags

unknown other skipped

This column, ciqual_food_name_tags, was skipped by the profiler so no distributional statistics are available. The only confirmed signals are 50 rows present and a null rate of 0.0; uniqueness, value samples, and type are all missing. Based on the name alone it likely holds CIQUAL food-name tag strings, but that cannot be verified from the evidence. Treatment: Re-run the profiler on this column to recover type and cardinality before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

countries

categorical free_text long_tail

Free-text country list per record, not a clean categorical: 43 unique values across 50 rows (entropy ratio 0.97) with the top value 'Maroc' at only 10%. Values mix languages (Maroc vs Morocco, Belgique vs Belgium), comma-separated multi-country strings, and even an 'en:switzerland' prefix, so the same country appears in several surface forms. The 'long_tail' alert is consistent with this near-unique, multi-label encoding. Treatment: Split on commas, normalise language and prefixes to ISO country codes, then one-hot or multi-hot encode. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 43
top_value: Maroc
top_rate: 0.1
cardinality: 43
entropy: 5.252
entropy_ratio: 0.9678

ingredients_text_with_allergens_it

categorical free_text long_tail null_rate

Italian-language ingredient lists with embedded HTML markup, one row per product. Coverage is poor: 68% of the 50 rows are null and the most common non-null value is the empty string (5 occurrences, 31% of present values), leaving only a handful of genuine ingredient strings. Among the 12 distinct values, contents range from short lists (e.g. "patate, olio di girasole, sale marino.") to long compound declarations, so length and structure vary widely. Treatment: Strip HTML allergen tags, treat empty strings as null, then tokenize for NLP or extract allergen flags as features. high · anthropic:claude-opus-4-7

n: 50
nulls: 34 (68.0%)
unique: 12
top_value
top_rate: 0.3125
cardinality: 12
entropy: 3.274
entropy_ratio: 0.9134

packaging_lc

categorical metadata

This column appears to be the language code of packaging text, with 7 distinct ISO-style codes across 50 rows. French and English tie at 17 occurrences each, though the reported top_rate of 0.386 reflects only one being chosen as top_value ('fr'); German trails at 5, with Portuguese, Italian, Spanish, and Croatian as singletons. A 12% null rate and entropy ratio of 0.71 indicate moderate diversity but a clear FR/EN dominance. Treatment: Treat as a low-cardinality categorical; impute nulls and one-hot encode, optionally collapsing rare codes into 'other'. high · anthropic:claude-opus-4-7

n: 50
nulls: 6 (12.0%)
unique: 7
top_value: fr
top_rate: 0.3864
cardinality: 7
entropy: 1.992
entropy_ratio: 0.7094

correctors_tags

unknown other skipped

The column `correctors_tags` was skipped by the profiler, so no type, uniqueness, or distributional statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds tags identifying correctors, plausibly a list- or set-valued field that the dissector could not coerce into a known kind. Without further stats, nothing can be said about cardinality, value mix, or skew. Treatment: Re-profile after parsing into a primitive type (e.g., explode tags into strings) before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

interface_version_created

categorical metadata

This column appears to record the interface version in use when each record was created, encoded as a date-stamp with optional jQuery Mobile suffix. Only 3 distinct values appear across 50 rows, with '20120622' dominating at 59.2% and a long-tail '20130323.jqm' appearing just twice. Entropy ratio of 0.74 confirms moderate concentration, and there is a 2% null rate to account for. Treatment: Treat as a low-cardinality categorical; one-hot encode or bucket the rare '20130323.jqm' level. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 3
top_value: 20120622
top_rate: 0.5918
cardinality: 3
entropy: 1.167
entropy_ratio: 0.7363

states_tags

unknown other skipped

This column was skipped by the profiler, so no statistics beyond row count (50) and a 0.0 null rate are available. The name suggests it holds tag-like state markers, possibly multi-valued, but kind is reported as unknown and uniqueness is not measured. Without sampled values or cardinality, its content and structure cannot be characterised from the evidence. Treatment: Re-profile with parsing enabled (likely a delimited tag list) before deciding whether to one-hot or drop. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nutriscore_2021_tags

unknown feature skipped

This column is labelled nutriscore_2021_tags, suggesting Nutri-Score grade tags from a 2021 reference (typically values like a/b/c/d/e). Saturn skipped profiling, so no distribution, uniqueness, or value statistics are available beyond a row count of 50 and a 0.0 null rate. No further signal can be extracted without re-profiling. Treatment: Re-profile or inspect manually; if confirmed categorical, treat as an ordinal Nutri-Score grade. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

stores_tags

unknown other skipped

This column was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests a multi-valued tag field associated with stores (likely a list or delimited string), but this cannot be confirmed from the evidence. Treatment: Re-profile after parsing as a list/array to determine cardinality and tag distribution before use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

image_thumb_url

categorical metadata long_tail

This column holds Open Food Facts product thumbnail URLs, one per row. Every one of the 50 values is unique (entropy_ratio 1.0, top_rate 0.02), so it acts as a per-row asset pointer rather than a categorical feature. URLs mix `front_fr` and `front_en` locale suffixes, hinting at a French/English product mix. Treatment: Drop for modelling; retain only as a display link or for image-fetching pipelines. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.100.jpg
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

categories_properties

unknown other skipped

This column was skipped by the profiler, so its type, cardinality, and value distribution are unknown beyond a count of 50 rows with no nulls. The name `categories_properties` suggests a nested or structured field (e.g., a list or dict of category attributes) that the profiler could not coerce into a scalar kind. Without parsed contents there is nothing further to infer. Treatment: Inspect raw values and parse the nested structure (explode or flatten) before profiling again. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nucleotides_prev_tags

unknown other skipped

Saturn skipped profiling for this column, so its type and contents are unknown beyond a row count of 50 and a null rate of 0.0. The name suggests it holds prior tags associated with nucleotide records, possibly a list or nested structure that the profiler could not introspect. No uniqueness, distribution, or value statistics are available to characterise it further. Treatment: Inspect raw values manually to determine structure before deciding on parsing or encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

allergens_from_ingredients

categorical feature long_tail

Free-text allergen list parsed from ingredient strings, mixing Open Food Facts taxonomy codes (en:gluten, en:milk, en:soybeans) with raw multilingual tokens (blé, lait, NOISETTES, соеви). 30% of the 50 rows are empty strings and the remaining 35 unique values are nearly all singletons with duplicated tokens within a single cell, so this is dirty list-encoded data rather than a clean category. Treatment: Split on commas, normalize to en: taxonomy codes, dedupe tokens, then multi-hot encode. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 35
top_value
top_rate: 0.3
cardinality: 35
entropy: 4.432
entropy_ratio: 0.864

ingredients_text_with_allergens_fi

categorical free_text long_tail null_rate

Finnish-language ingredient text with inline HTML allergen markup, mirroring the multilingual ingredient fields common in Open Food Facts. Coverage is extremely thin: null_rate is 0.9 and only 4 distinct values exist across n=50, with the empty string itself appearing twice as the top_value (top_rate 0.4 of non-nulls). The non-empty entries are long free-text strings wrapping allergens in tags rather than clean tokens. Treatment: Strip HTML tags and tokenize for allergen extraction; otherwise drop, since 90% are null. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 4
top_value
top_rate: 0.4
cardinality: 4
entropy: 1.922
entropy_ratio: 0.961

_keywords

unknown other skipped

The column `_keywords` was skipped by the profiler, so kind is unknown and no statistics (n_unique, value distribution, length, etc.) are available. The only confirmed signals are 50 rows with a 0.0 null rate. Without further evidence the content and structure cannot be characterised — the name suggests a keyword list, but this is not verified. Treatment: Re-profile with appropriate parser (likely list/tokenized text) before deciding usage. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

manufacturing_places

categorical free_text long_tail

Free-text manufacturing locations, mostly country names but mixed with multi-token strings combining cities, regions, and postal codes. The dominant value is the empty string (20 of 50, top_rate 0.408), making missing-or-blank the modal state, with France a distant second at 9. Across 20 unique values the entropy_ratio of 0.737 plus the long_tail alert signals scattered, inconsistent formatting (e.g. 'France,Italie' vs full German address chains). Treatment: Normalise blanks to null and parse/standardise to country tokens before using as a feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 20
top_value
top_rate: 0.4082
cardinality: 20
entropy: 3.187
entropy_ratio: 0.7374

pnns_groups_2

categorical label

This is a food sub-category label (PNNS group 2), with 11 distinct values across 50 rows and no nulls. The distribution is heavily concentrated in sweets: 'Biscuits and cakes' (17) and 'Chocolate products' (16) account for 33 of 50 rows, giving a top_rate of 0.34 and entropy_ratio of 0.75. Two rows carry the literal value 'unknown', which should be treated as missing rather than a real category. Treatment: One-hot or target-encode after recoding 'unknown' to null. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 11
top_value: Biscuits and cakes
top_rate: 0.34
cardinality: 11
entropy: 2.599
entropy_ratio: 0.7513

ingredients_text_pl

categorical free_text long_tail null_rate

Polish-language ingredient text for food products, almost entirely absent in this sample: 90% of rows are null and of the 5 non-null rows, 3 are empty strings, leaving only 2 genuine ingredient lists (both for cocoa-based chocolate). With n_unique=3 across 50 rows, this column carries virtually no usable signal here. Treatment: Drop unless Polish-language analysis is required; too sparse to model. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 3
top_value
top_rate: 0.6
cardinality: 3
entropy: 1.371
entropy_ratio: 0.865

generic_name_es

categorical metadata long_tail null_rate

Spanish-language generic product name, sparsely populated with only 7 unique values across 50 rows and a 60% null rate. The non-null values skew heavily toward dark chocolate descriptions (e.g., 'Chocolate negro' appears twice, with several variants citing cacao percentages), suggesting the dataset is dominated by chocolate products. Top rate of 0.65 reflects the empty string acting as the modal 'value', so usable coverage is even thinner than the null rate alone implies. Treatment: Treat empty strings as nulls and drop or backfill from a canonical product-name field before use. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 7
top_value
top_rate: 0.65
cardinality: 7
entropy: 1.817
entropy_ratio: 0.6471

origin_en

categorical metadata imbalance

Categorical column likely intended to mark country of origin in English, but it is effectively empty: of 50 rows, 14% are null and 42 of the remaining values are blank strings, leaving just one populated label ("France"). Cardinality is 2 with a top_rate of 0.977 and entropy_ratio of 0.16, so the field carries almost no information. Treatment: Drop; the column is near-constant with blanks and only one real value. high · anthropic:claude-opus-4-7

n: 50
nulls: 7 (14.0%)
unique: 2
top_value
top_rate: 0.9767
cardinality: 2
entropy: 0.1594
entropy_ratio: 0.1594

generic_name_it

categorical metadata long_tail null_rate

Italian generic product name, evidently a localized label field on food items. 68% of rows are null and among the 16 non-null entries the most common value is the empty string (11 occurrences), leaving only 5 distinct real names like 'Cioccolato extra fondente' and 'Crackers'. Coverage is too sparse to be useful as-is. Treatment: Drop or defer until Italian coverage improves; not usable at 68% null. high · anthropic:claude-opus-4-7

n: 50
nulls: 34 (68.0%)
unique: 5
top_value
top_rate: 0.6875
cardinality: 5
entropy: 1.497
entropy_ratio: 0.6446

ingredients_that_may_be_from_palm_oil_n

numeric feature high_skew outliers

Count of ingredients that may be derived from palm oil per product. Values are extremely concentrated at zero (zero_rate 0.83, median and IQR both 0), with only 3 distinct values up to a max of 2, yet 17% of non-null rows register as outliers and skew is 2.23. An 8% null rate also means some products lack this assessment entirely. Treatment: Binarise to zero/non-zero or drop, since the column is near-constant with heavy skew. high · anthropic:claude-opus-4-7

n: 50
nulls: 4 (8.0%)
unique: 3
min: 0
max: 2
mean: 0.1957
median: 0
std: 0.4531
q1: 0
q3: 0
iqr: 0
skew: 2.23
kurtosis: 4.321
n_outliers: 8
outlier_rate: 0.1739
zero_rate: 0.8261

ingredients_text_es

categorical free_text long_tail null_rate

Spanish-language ingredient lists for food products, stored as free text. Of 50 rows, 60% are null and another 8 entries (top_rate 0.4) are empty strings, leaving only a handful of distinct populated values—mostly chocolate and cereal formulations with allergen markers like _TRIGO_ and _HUEVO_. The 13-value cardinality and high entropy ratio (0.84) reflect that nearly every non-empty entry is unique long-form prose, not a true category. Treatment: Treat as multilingual free text: tokenize/embed or parse into ingredient lists; do not use as a categorical feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 30 (60.0%)
unique: 13
top_value
top_rate: 0.4
cardinality: 13
entropy: 3.122
entropy_ratio: 0.8437

teams

categorical feature long_tail

This column holds team affiliations as comma-separated lists of slugs, with 39 unique combinations across 50 rows and an 8% null rate. Cardinality is extreme (entropy ratio 0.97) and the most common value 'pain-au-chocolat' covers only 10.9%, while several rows pack 4-14 teams into one string. The mix of single-team and multi-team entries means this is effectively a multi-label field stored as a delimited string. Treatment: split on commas and one-hot encode as multi-label team membership before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 4 (8.0%)
unique: 39
top_value: pain-au-chocolat
top_rate: 0.1087
cardinality: 39
entropy: 5.124
entropy_ratio: 0.9695

food_groups_tags

unknown feature skipped

This column is labeled food_groups_tags, suggesting it holds categorical food group classifiers (likely list-valued or comma-delimited tags). Saturn skipped profiling, so no uniqueness, cardinality, or distribution stats are available beyond a 50-row sample with zero nulls. Treat any inferences cautiously until the column is re-profiled with a parser that handles its native type. Treatment: Re-profile after parsing as a list/multi-label field, then one-hot or multi-hot encode the tags. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

data_quality_warnings_tags

unknown metadata skipped

This column is named data_quality_warnings_tags, suggesting it carries flags or tag arrays describing data quality issues per row. Saturn skipped profiling, so no uniqueness, value distribution, or stats are available beyond a 50-row sample with 0% nulls. Without parsed contents it is impossible to tell whether the field is empty strings, lists, or structured tags. Treatment: Inspect raw values manually and parse tag structure before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

debug_tags

unknown metadata skipped

Column 'debug_tags' was skipped by the profiler and classified as kind 'unknown', so no descriptive statistics are available beyond a row count of 50 and a null rate of 0.0. Uniqueness, distribution, and content type are all unreported, meaning we cannot infer what the field carries. The name suggests internal debugging annotations rather than analytical signal. Treatment: Drop unless a downstream consumer specifically needs the debug annotations. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

main_countries_tags

unknown feature skipped

The column `main_countries_tags` was skipped during profiling, so no type, cardinality, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests a tags-style field listing principal countries, likely delimited or list-valued, which is probably why the profiler bailed. Nothing else can be inferred without re-profiling with list/text handling enabled. Treatment: Re-profile as a multi-valued tag field, then split/explode and one-hot encode before modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

origins_hierarchy

unknown other skipped

Profiling was skipped for this column, so saturn emitted no type, uniqueness, or value statistics beyond a row count of 50 and a null rate of 0.0. The name suggests a nested or path-like representation of origin categories (e.g. a taxonomy hierarchy), but without parsed values this is inference from the label only. Treat it as opaque until reprofiled. Treatment: Reprofile after parsing the hierarchy (split path or explode levels) before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

packagings_complete

numeric feature

This is a binary 0/1 flag (n_unique=2, min=0, max=1) indicating whether packaging information is complete. The split is nearly even with a mean of 0.52 and zero_rate of 0.48, and 4% of rows are null. The strongly negative kurtosis (-1.99) is expected for a balanced binary variable. Treatment: Cast to boolean and impute the 4% nulls before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 2
min: 0
max: 1
mean: 0.5208
median: 1
std: 0.5049
q1: 0
q3: 1
iqr: 1
skew: -0.08341
kurtosis: -1.993
n_outliers: 0
outlier_rate: 0
zero_rate: 0.4792

nutriscore_tags

unknown label skipped

This column is labeled nutriscore_tags, suggesting it holds Nutri-Score classification tags (likely letter grades a-e or arrays thereof) for food products. Profiling was skipped, so no cardinality, value distribution, or type stats are available beyond a 50-row sample with zero nulls. Without n_unique or value frequencies, the actual contents and structure remain unverified. Treatment: Re-profile with parsing enabled, then one-hot or ordinal-encode the Nutri-Score grade for modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_with_allergens_nl

categorical free_text long_tail null_rate

Dutch-language ingredient lists with inline HTML `` markup, evidently the NL localisation of an Open Food Facts-style ingredients field. Coverage is poor: 78% null and only 9 distinct values across 50 rows, with the most common non-null entries being short cocoa/chocolate ingredient strings while one row is clearly mis-parsed packaging footer text (Mondelez addresses, URLs). Entropy ratio 0.95 confirms the few present values are nearly all unique, so this is free text rather than a category. Treatment: Strip HTML allergen tags, then tokenize for NLP/allergen extraction; expect heavy missingness so do not use as a primary feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 39 (78.0%)
unique: 9
top_value
top_rate: 0.2727
cardinality: 9
entropy: 3.027
entropy_ratio: 0.955

created_t

numeric timestamp

Values are 10-digit integers ranging from 1,337,517,352 to 1,724,094,916 with all 50 rows unique and no nulls — consistent with Unix epoch seconds spanning roughly mid-2012 to mid-2024. The distribution is mildly right-skewed (skew 0.33) and platykurtic (kurtosis -0.81), with a median of 1,475,927,880.5 sitting near the mean, suggesting events are spread fairly evenly across the window rather than clustered. The name 'created_t' reinforces a creation-timestamp interpretation. Treatment: convert from Unix seconds to datetime and derive features (year, recency) before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
min: 1.338e+09
max: 1.724e+09
mean: 1.483e+09
median: 1.476e+09
std: 1.043e+08
q1: 1.386e+09
q3: 1.555e+09
iqr: 1.694e+08
skew: 0.3311
kurtosis: -0.8095
n_outliers: 0
outlier_rate: 0
zero_rate: 0

traces_hierarchy

unknown other skipped

Column 'traces_hierarchy' was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. Without kind inference or sample stats, the content remains unknown — the name hints at nested trace/hierarchy data (likely a complex or non-scalar structure), which is consistent with the profiler skipping it. Treatment: Inspect raw values manually and parse the nested structure before it can be profiled or modelled. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

generic_name_nb

categorical metadata null_rate imbalance

This appears to be a Norwegian Bokmål generic name field, likely meant to hold localized drug or product names. It is effectively empty: 96% of the 50 rows are null and the only non-null value observed is the empty string (2 occurrences), giving cardinality 1 and zero entropy. Treatment: Drop; the column carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_de

categorical free_text long_tail null_rate

German-language ingredient lists with embedded HTML tags marking allergens like SOJA, WEIZEN, and HASELNÜSSE. Two-thirds of rows are null (null_rate 0.66) and among the 17 non-null values 16 are unique (entropy_ratio 0.99), with the empty string itself appearing twice as the top value. Casing and punctuation are inconsistent across entries, and one row contains lowercase OCR-style text with stray newlines. Treatment: Strip HTML tags to extract allergen labels into a multi-hot feature, then drop or embed the residual text. high · anthropic:claude-opus-4-7

n: 50
nulls: 33 (66.0%)
unique: 16
top_value
top_rate: 0.1176
cardinality: 16
entropy: 3.97
entropy_ratio: 0.9925

ingredients_text_with_allergens_es

categorical free_text long_tail null_rate

Spanish-language ingredients lists with inline HTML markup highlighting allergens like trigo, soja, avellanas, and lactosa. 62% of the 50 rows are null and another 7 entries are empty strings, leaving only a handful of populated free-text recipes; among those that exist, all 13 unique values appear nearly distinct (entropy ratio 0.87). Treatment: Strip the allergen HTML tags, then tokenize/embed or parse into a structured allergen list before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 31 (62.0%)
unique: 13
top_value
top_rate: 0.3684
cardinality: 13
entropy: 3.214
entropy_ratio: 0.8684

product_name_fr

categorical free_text long_tail

French-language product names from what appears to be a food/grocery catalogue (chocolate bars, mineral water, biscuits). With 47 unique values across 50 rows and entropy ratio 0.996, this is essentially a free-text label rather than a categorical feature — the top value 'Henry's' only appears twice (4%). One null and a long-tail alert are flagged. Treatment: Treat as free-text product label; tokenize and embed (or use as a join key to a product table) rather than one-hot encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 47
top_value: Henry’s
top_rate: 0.04082
cardinality: 47
entropy: 5.533
entropy_ratio: 0.9961

stores

categorical feature long_tail

Comma-delimited list of retail chains where each product was observed (Lidl, Carrefour, Tesco, etc.), stored as a single string per row. Cardinality is high (31 unique across 50 rows, entropy_ratio 0.854) because most non-empty entries are bespoke multi-store concatenations appearing only once. The dominant value is an empty string at 29.17% top_rate plus a 4% null_rate, so roughly a third of rows carry no store information at all. Treatment: Split on comma and one-hot or multi-hot encode individual store names; treat empty string as missing. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 31
top_value
top_rate: 0.2917
cardinality: 31
entropy: 4.233
entropy_ratio: 0.8543

_id

categorical identifier long_tail

This column is a unique record identifier — every one of the 50 rows has a distinct value (n_unique=50, top_rate=0.02, entropy_ratio=1.0). Values look like long numeric codes resembling EAN/GTIN barcodes (e.g., '6111242100992', '7622210578464'), with at least one shorter outlier ('20995553'). The long_tail alert simply reflects that each value occurs exactly once. Treatment: drop from modelling features; retain as a join key. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: 6111242100992
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

nutriments

unknown other skipped

The column 'nutriments' was skipped by the profiler, so no statistics, uniqueness, or value samples are available beyond a row count of 50 and a null rate of 0.0. The name suggests it likely holds nested nutritional data (e.g., a struct or JSON object per product), which is consistent with the profiler's inability to classify it as a standard kind. Without parsed contents we cannot describe its distribution or cardinality. Treatment: Parse/flatten the nested structure into typed sub-columns before profiling or modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

editors

unknown other skipped

The column is named "editors" and was skipped by the profiler, so its kind is unknown and no descriptive statistics were computed. Across 50 rows there are zero nulls, but uniqueness, type, and value distribution are all unreported. Without further evidence, the content (likely a list or nested structure of editor entries) cannot be characterised. Treatment: Inspect raw values manually to determine type before deciding whether to parse, explode, or drop. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

max_imgid

categorical identifier long_tail

`max_imgid` holds 38 distinct integer-like strings across 50 rows with no nulls, suggesting it stores the maximum image identifier per record. Distribution is nearly uniform (entropy_ratio 0.98) with the top value '47' appearing only 3 times (top_rate 0.06), so it behaves like a high-cardinality numeric id mis-typed as categorical. The long_tail alert confirms most values occur once or twice. Treatment: Cast to integer and treat as a numeric id; do not one-hot encode. medium · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 38
top_value: 47
top_rate: 0.06
cardinality: 38
entropy: 5.149
entropy_ratio: 0.9811

nutriscore_grade

categorical label

This is the Nutri-Score grade, a categorical food-health rating with the expected letter levels a-e plus an 'unknown' bucket, giving 6 distinct values across 50 rows with no nulls. The distribution is heavily weighted toward the worst grade: 'e' alone accounts for 54% (27/50), while healthier grades 'a' and 'b' together cover only 6 rows. Entropy ratio of 0.74 confirms the imbalance, and the lone 'unknown' row signals a missing-data sentinel mixed in with the real grades. Treatment: Treat as ordered categorical (a high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 6
top_value: e
top_rate: 0.54
cardinality: 6
entropy: 1.913
entropy_ratio: 0.7399

product_quantity_unit

categorical metadata imbalance

Unit of measure for product quantities, taking only 'g' or 'ml'. The distribution is severely imbalanced: 'g' covers 44 of 45 non-null rows (top_rate 0.978) while 'ml' appears just once, and 10% of values are null. Entropy ratio of 0.154 confirms the column carries almost no information as-is. Treatment: Likely drop or collapse to a binary indicator; near-constant with one rare 'ml' case. high · anthropic:claude-opus-4-7

n: 50
nulls: 5 (10.0%)
unique: 2
top_value: g
top_rate: 0.9778
cardinality: 2
entropy: 0.1537
entropy_ratio: 0.1537

ingredients_analysis_tags

unknown metadata skipped

This column is labelled ingredients_analysis_tags, suggesting it carries categorical or list-valued tags from an ingredient analysis pipeline. Saturn skipped profiling, so no uniqueness, frequency, or value statistics are available beyond a 50-row sample with zero nulls. Without further stats, neither cardinality nor structure (scalar vs. list) can be confirmed. Treatment: Re-profile with list/tag-aware parsing before deciding to one-hot encode or drop. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_with_allergens_fr

categorical free_text long_tail

French-language ingredient lists with embedded HTML markup highlighting allergens. Near-unique across 47 of 50 rows (entropy ratio 0.998), with 4% nulls and 2 empty strings as the modal value. Content varies wildly in length and formatting, mixing prose, percentages, and tagged allergen tokens. Treatment: Strip HTML tags, parse allergen spans into a structured list, then tokenize the remaining text for NLP. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 47
top_value
top_rate: 0.04167
cardinality: 47
entropy: 5.543
entropy_ratio: 0.998

interface_version_modified

categorical metadata

A categorical column capturing an interface version stamp, with only 2 distinct values across 50 rows and no nulls. The distribution is heavily skewed: '20150316.jqm2' covers 84% (42 rows) while '20190830' accounts for the remaining 8 rows. The mixed format (one value carries a '.jqm2' suffix, the other is a bare date) suggests a schema or convention change between releases. Treatment: Treat as a binary version flag; one-hot encode or collapse to pre/post-change indicator. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 2
top_value: 20150316.jqm2
top_rate: 0.84
cardinality: 2
entropy: 0.6343
entropy_ratio: 0.6343

data_sources_tags

unknown other skipped

The column 'data_sources_tags' was skipped by the profiler, so its kind, uniqueness, and value distribution are all unknown. The only confirmed signals are 50 rows with no nulls. Without parsed stats, the name suggests a multi-valued tag field (e.g., a list or delimited string of source labels), but this cannot be verified from the evidence. Treatment: Manually inspect a sample to confirm structure, then explode tags into a multi-hot encoding before modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_with_allergens_en

categorical free_text long_tail

This column holds English ingredient lists with embedded HTML markup highlighting allergens like wheat, milk, soy, and nuts. With 36 unique values across 50 rows (entropy ratio 0.95) and a 16% null rate, it's near-unique free text; the top 'value' is actually the empty string (7 occurrences), and one row is junk ('Hhhhh'). The HTML tags and inconsistent casing/punctuation mean it needs cleaning before any allergen extraction. Treatment: Strip HTML, normalize case, and parse allergen spans into a structured multi-label feature before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 8 (16.0%)
unique: 36
top_value
top_rate: 0.1667
cardinality: 36
entropy: 4.924
entropy_ratio: 0.9525

removed_countries_tags

unknown other skipped

Column `removed_countries_tags` was skipped by the profiler, so no type, uniqueness, or distribution stats are available. The only facts on hand are 50 rows with a 0.0 null rate. The name suggests a list of country tags that were removed (likely a multi-valued tag field from an Open Food Facts-style schema), but this cannot be confirmed from the evidence. Treatment: Re-profile with list/string parsing enabled before deciding whether to keep, explode, or drop. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

amino_acids_prev_tags

unknown other skipped

This column was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds prior tag values associated with amino acid annotations, likely a list-like or structured field that the dissector could not parse. Nothing else can be inferred without re-profiling. Treatment: Re-profile with a parser that handles its container type before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

code

categorical identifier long_tail

This column holds 50 unique numeric strings of varying length (8 to 13 digits), almost certainly product barcodes (EAN/UPC/GTIN). Every one of 50 rows is unique with no nulls, giving maximum entropy (entropy_ratio 1.0) and a top_rate of just 0.02 — it functions as a row identifier rather than a feature. The long_tail alert simply reflects that uniqueness. Treatment: Use as a join key; drop from any model as it carries no predictive signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 50
top_value: 6111242100992
top_rate: 0.02
cardinality: 50
entropy: 5.644
entropy_ratio: 1

correctors

unknown other skipped

The column 'correctors' was skipped by the profiler, so its kind, uniqueness, and value distribution are unknown. Only the row count (50) and a null rate of 0.0 are reported; no other statistics are available to characterize content. Treatment: Re-profile or inspect manually before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

generic_name_ja

categorical metadata long_tail null_rate imbalance

Likely a Japanese generic-name field (generic_name_ja), but it carries essentially no information in this sample: 98% of 50 rows are null and the single non-null value is an empty string, giving cardinality 1 and entropy 0. Treatment: Drop from modelling; retain only if needed for display lookups. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_fr

categorical free_text long_tail

French-language generic product names, almost certainly from an Open Food Facts-style food catalogue. Cardinality is high (34 unique across 50 rows, entropy ratio 0.87) and most values are one-off descriptors like 'Chocolat noir extra-fin traditionnel à 90% de cacao'. The dominant 'value' is actually the empty string at 29.8% of non-null rows, on top of a 6% null rate, so effectively over a third of records carry no usable label. Treatment: Treat empty strings as missing, then tokenize/embed for any modelling rather than one-hot encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 3 (6.0%)
unique: 34
top_value
top_rate: 0.2979
cardinality: 34
entropy: 4.42
entropy_ratio: 0.8689

generic_name_pl

categorical metadata null_rate

Polish-language generic product name field, populated for only 5 of 50 rows (90% null) and containing just 2 distinct values where 4 of the 5 non-nulls are empty strings. Effectively a single real value ('Wyśmienita czkolada gorzka 70% kakao'), making the column unusable as a feature. Treatment: Drop; null rate 0.9 and only one meaningful value. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 2
top_value
top_rate: 0.8
cardinality: 2
entropy: 0.7219
entropy_ratio: 0.7219

amino_acids_tags

unknown other skipped

This column is labelled amino_acids_tags, suggesting it would hold tags describing amino acid composition (likely a list-valued food annotation field). Saturn skipped profiling, so no uniqueness, cardinality, or value statistics are available beyond an n of 50 and a 0.0 null rate. Nothing further can be inferred without re-profiling. Treatment: Re-profile after parsing as a list/tag field before deciding on encoding. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_debug

unknown metadata skipped

Column 'ingredients_debug' was skipped by the profiler, so no type, uniqueness, or distribution stats are available. Only the row count (50) and null rate (0.0) are known; everything else is missing. The name suggests it is a debug/auxiliary field rather than a modelling input. Treatment: Drop from modelling; retain only if needed for debugging upstream pipelines. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_with_allergens_ja

categorical free_text long_tail null_rate imbalance

Japanese-language ingredient text with allergen markup, almost entirely absent from this sample. 98% of the 50 rows are null, and the only non-null value observed is an empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; the column carries no information in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

data_quality_info_tags

unknown other skipped

This column, data_quality_info_tags, was skipped by the profiler so its type and contents remain uncharacterised. The only signals available are 50 rows with no nulls; uniqueness, value distribution, and data kind are all unknown. Treatment: Inspect raw values manually to determine type before deciding on handling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

last_edit_dates_tags

unknown other skipped

This column was skipped by the profiler, so no type, uniqueness, or distribution stats are available beyond a row count of 50 with no nulls. The name suggests it holds last-edit dates paired with tags, likely a composite or nested field that the dissector could not parse. Without further evidence, its structure and content cannot be characterised. Treatment: Inspect raw values and parse into separate date and tag fields before use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

last_modified_by

categorical metadata long_tail

This column records the user or app that last modified each record, dominated by the bot/account 'foodless' which accounts for 21 of 49 non-null entries (top_rate 0.43). With 24 unique values across 50 rows and entropy_ratio 0.77, there's a long tail of mostly singleton contributors alongside a handful of app-like editors (municorn-calorie-counter-app, macrofactor). Null rate is low at 0.02. Treatment: Keep as audit metadata; if used as a feature, collapse the long tail into 'other' and flag bot-vs-human editors. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 24
top_value: foodless
top_rate: 0.4286
cardinality: 24
entropy: 3.513
entropy_ratio: 0.7662

no_nutrition_data

categorical feature imbalance

A flag column indicating products lacking nutrition data, but it carries no information here: the only observed value is the empty string, present in all 48 non-null rows (top_rate 1.0, cardinality 1, entropy 0.0). 4% of rows are null, so there is literally nothing to distinguish records. Treatment: Drop; constant column with zero entropy. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

nutriscore

unknown other skipped

The column is named "nutriscore" but saturn skipped profiling it (kind="unknown"), so no type, uniqueness, or distribution stats are available. All 50 rows are non-null, but nothing else can be confirmed from the evidence. Treatment: Manually inspect and cast to a known type before any downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

origin_nb

categorical metadata null_rate imbalance

The column 'origin_nb' is effectively empty: 96% of the 50 rows are null and the only non-null value observed is the empty string, which appears twice. Cardinality is 1 with zero entropy, so the field carries no information in this sample. Treatment: Drop; the column is 96% null with a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origins

categorical free_text long_tail

Free-text origin/provenance strings for ingredients or products, with 20 unique values across 50 rows and a 4% null rate. The dominant value is the empty string at 24/50 (top_rate 0.5), so half the column is effectively blank rather than missing. The remainder is messy: language mix (France vs Maroc vs Morocco), comma-delimited multi-origin lists, and 'en:'-prefixed taxonomy tags like 'en:Madagarcar vanilla' (note the typo) — clearly not a clean categorical. Treatment: Treat empty strings as null, normalise synonyms (Maroc/Morocco), and split on comma into a multi-label set before any encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 2 (4.0%)
unique: 20
top_value
top_rate: 0.5
cardinality: 20
entropy: 3.027
entropy_ratio: 0.7003

nova_groups_tags

unknown metadata skipped

This column is named nova_groups_tags, suggesting it carries NOVA food-classification group tags (a 1-4 processing-level scheme used in nutrition datasets). However, saturn skipped profiling it, so no type, cardinality, or value statistics are available beyond a 50-row sample with zero nulls. Without inferred kind or unique counts, the actual content and format remain unverified. Treatment: Manually inspect raw values to confirm format, then one-hot encode the small set of NOVA group tags. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

languages

unknown other skipped

The column is named 'languages' but saturn skipped profiling, so type and distribution are unknown. With 50 rows and no nulls, every record carries some value, yet n_unique and other stats are unavailable. The name suggests a list-like field (e.g., languages spoken or supported), which would explain why the dissector flagged it as unknown. Treatment: Inspect raw values and parse (likely explode list-typed entries) before further profiling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nutriscore_2023_tags

unknown other skipped

This column is flagged as skipped by the profiler, so no descriptive statistics, uniqueness, or value samples are available beyond a row count of 50 and a 0.0 null rate. The name suggests it holds Nutri-Score 2023 classification tags (likely a categorical label such as a, b, c, d, e), but that interpretation cannot be verified from the evidence provided. No distributional or quality signals can be assessed here. Treatment: Re-run profiling with this column included to determine its type and cardinality before use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

packaging_materials_tags

unknown free_text skipped

This column appears to be a tags field listing packaging materials, likely a delimited or list-valued string per row. Saturn skipped profiling, so no uniqueness, cardinality, or value-frequency stats are available beyond a 50-row sample with zero nulls. Without parsed token statistics, the actual material vocabulary and its distribution remain unknown. Treatment: split on the tag delimiter and one-hot or multi-hot encode the resulting material tokens before modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

lang

categorical feature

This is a language code column with 5 distinct values and no nulls across 50 rows. The distribution is heavily dominated by 'fr' at 70% (35/50), with 'en' a distant second at 10 occurrences and 'de', 'bg', 'ro' appearing only 1-3 times each. Entropy ratio of 0.56 confirms the imbalance, and the long tail of rare languages (bg, ro with single observations) may be unstable for any per-language modelling. Treatment: One-hot encode with rare languages (bg, ro, de) collapsed into an 'other' bucket. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 5
top_value: fr
top_rate: 0.7
cardinality: 5
entropy: 1.294
entropy_ratio: 0.5572

packaging_text_sv

categorical free_text null_rate imbalance

Swedish packaging text field that is effectively empty: 92% of the 50 rows are null and the remaining 4 non-null values are all the empty string, giving a single observed category with entropy 0. There is no usable signal here. Treatment: Drop; column is 92% null and the only non-null value is an empty string. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

photographers

unknown other skipped

This column 'photographers' was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds photographer attributions, possibly as a list or nested structure that the dissector could not parse. Without unique counts or sample values, nothing further can be inferred. Treatment: Inspect raw values manually to determine structure before deciding whether to parse, explode, or drop. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

languages_codes

unknown other skipped

This column is named languages_codes but saturn skipped detailed profiling, leaving kind as unknown with no uniqueness or value statistics. The only confirmed signals are 50 rows and a 0.0 null rate. Without sample values or cardinality, the structure (single code, list, or delimited string) cannot be determined from the evidence. Treatment: Re-profile with parsing enabled to determine whether values are scalar codes or lists before use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ecoscore_grade

categorical label

This is the Eco-Score grade, a categorical environmental rating with letter tiers from a-plus through f plus sentinel values 'unknown' and 'not-applicable'. Distribution skews toward worse grades: 'e' leads at 12/50 (top_rate 0.24), followed by 'd' (9), while 'a' and 'a-plus' together account for only 5 rows. Six rows are 'unknown' and one 'not-applicable', so roughly 14% of values are non-informative sentinels that need handling. Treatment: Map sentinels ('unknown','not-applicable') to NA and treat the remaining tiers as an ordinal factor. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 9
top_value: e
top_rate: 0.24
cardinality: 9
entropy: 2.808
entropy_ratio: 0.8857

ingredients_n

numeric feature

Numeric count of ingredients per record, ranging from 1 to 39 with a median of 9 and mean of 11.7. The distribution is right-skewed (skew 1.24, kurtosis 1.44) with a wide IQR of 11 and 2 outliers (4%) on the high end. No nulls or zeros, and 22 unique values across 50 rows suggest a discrete count variable. Treatment: Consider log or sqrt transform before modelling to tame the right skew. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 22
min: 1
max: 39
mean: 11.7
median: 9
std: 8.244
q1: 5
q3: 16
iqr: 11
skew: 1.237
kurtosis: 1.435
n_outliers: 2
outlier_rate: 0.04
zero_rate: 0

allergens

categorical feature

Categorical allergen tags using an Open Food Facts-style 'en:' prefix, often combined as comma-separated lists (e.g., 'en:gluten,en:milk,en:soybeans'). The most common value is an empty string at 32% (16/50), suggesting missing or no-allergen records encoded as blanks rather than nulls. With 16 unique values across 50 rows and entropy ratio 0.84, the distribution is fairly spread; gluten, milk, and soybeans dominate the non-empty tags. Treatment: Split on comma and multi-hot encode allergen tags; treat empty string as missing. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 16
top_value
top_rate: 0.32
cardinality: 16
entropy: 3.364
entropy_ratio: 0.8411

minerals_tags

unknown other skipped

This column was skipped by the profiler, so no statistics are available beyond a row count of 50 and a 0.0 null rate. The name suggests it holds tag-style annotations for minerals (likely a list or delimited string per row), but uniqueness, cardinality, and value distribution are all unknown here. Treatment: Re-profile with a parser appropriate for list/tag fields before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

product_name

categorical free_text long_tail

Free-text product name field with 49 unique values across 50 rows, near-maximal entropy ratio of 0.998 — effectively a per-row label. Values mix languages (French, English, Cyrillic) and formats (brand-only like 'Henry's' versus full descriptors like 'CRISTALINE Eau De Source 0.5L'), and one row is an empty string despite a reported null_rate of 0.0. The single repeat ('Henry's', 2) is the only signal preventing full uniqueness. Treatment: Normalize casing and empty strings, then tokenize/embed rather than one-hot encode. high · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: 49
top_value: Henry’s
top_rate: 0.04
cardinality: 49
entropy: 5.604
entropy_ratio: 0.9981

purchase_places

categorical free_text long_tail

Free-form purchase location strings, often listing multiple places per row separated by commas. France dominates at 9/50 (18.4%), followed by an empty string (6) and Maroc (5), but with 32 unique values across 50 rows and entropy ratio 0.90, the long tail includes multi-country concatenations like 'Madrid,España,Montargis,France,Würzburg,Deutschland,...'. Mixed languages (Maroc vs Morocco, España vs Spain) and embedded postal codes signal inconsistent data entry rather than a clean categorical. Treatment: Split on commas and normalize each token to a canonical country before using as a multi-label feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 32
top_value: France
top_rate: 0.1837
cardinality: 32
entropy: 4.479
entropy_ratio: 0.8958

quantity

categorical feature long_tail

This column records product quantities as free-text strings, dominated by gram weights but with no consistent format — '100 g', '100g', and '100 gram' all appear separately among the top values. With 36 unique values across 50 rows and entropy ratio 0.959, the field is highly fragmented; the most common value '100 g' covers only 12.2% of non-nulls, and 2% are null plus 2 empty strings. The long_tail alert reflects this unit/spacing inconsistency rather than genuine variety. Treatment: Normalize units and parse into a numeric grams column before use. high · anthropic:claude-opus-4-7

n: 50
nulls: 1 (2.0%)
unique: 36
top_value: 100 g
top_rate: 0.1224
cardinality: 36
entropy: 4.956
entropy_ratio: 0.9587

traces_tags

unknown other skipped

The column `traces_tags` was skipped by the profiler, so no type, uniqueness, or distribution statistics are available. The only confirmed signals are 50 rows present and a null rate of 0.0, meaning every row has some value but its content and structure are unknown. The name suggests it may hold tag annotations associated with traces, possibly a nested or list-typed field that the profiler could not parse. Treatment: Inspect raw values manually to determine structure before deciding whether to parse, explode, or drop. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

origin_uk

categorical feature long_tail null_rate imbalance

This appears to be a binary/flag column indicating UK origin, but it carries virtually no signal: 98% of the 50 rows are null and the only non-null value observed is an empty string. With cardinality of 1 and entropy of 0, the column has no discriminative power as it stands. Treatment: Drop; column is 98% null with a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_ar

categorical metadata null_rate

Arabic generic-name field that is overwhelmingly empty: 80% of the 50 rows are null and of the 10 populated rows, 9 are blank strings and only 1 carries an actual Arabic value (الامير). With cardinality of just 2 and a top_rate of 0.9 on the empty string, this column carries almost no information as currently captured. Treatment: Drop or defer until source data is backfilled; not usable as-is. high · anthropic:claude-opus-4-7

n: 50
nulls: 40 (80.0%)
unique: 2
top_value
top_rate: 0.9
cardinality: 2
entropy: 0.469
entropy_ratio: 0.469

packaging_text_uk

categorical metadata long_tail null_rate imbalance

This column appears to be Ukrainian packaging text, but it is effectively empty: 98% of the 50 rows are null and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so it carries no information. Treatment: Drop; the column has no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_ar

categorical free_text null_rate

Arabic-language ingredients text, populated for only 11 of 50 rows (null_rate 0.78) and with 10 of those 11 non-null entries being empty strings. Only one row carries an actual Arabic ingredient list, giving cardinality 2 and a top_rate of 0.91 on the empty string. Effectively unusable as a feature on this sample. Treatment: Drop for modelling; retain only if you specifically need Arabic ingredient parsing and can source more populated rows. high · anthropic:claude-opus-4-7

n: 50
nulls: 39 (78.0%)
unique: 2
top_value
top_rate: 0.9091
cardinality: 2
entropy: 0.4395
entropy_ratio: 0.4395

ingredients_text_uk

categorical free_text long_tail null_rate imbalance

Ukrainian-language ingredients text, almost entirely absent in this sample. 98% of rows are null and the single non-null value is an empty string, leaving zero usable content. Entropy is 0 and cardinality is 1, so the column carries no signal here. Treatment: Drop from this slice; revisit only if a Ukrainian-locale subset is loaded. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

last_check_dates_tags

unknown other skipped

This column was skipped by the profiler, so no type, uniqueness, or distribution statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it stores tags associated with last-check dates, possibly a list or composite field that the dissector could not parse. Without further stats, nothing can be said about its values. Treatment: Inspect raw values manually and re-profile after parsing into a structured type. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

checked

categorical feature null_rate imbalance

This looks like a checkbox-style flag (likely from a web form), where the only observed value is "on" in 7 of 50 rows. The remaining 86% are null, and entropy is 0.0 because there is no variation among the non-null entries. With cardinality of 1, the column carries no discriminative signal as captured. Treatment: Convert to a boolean (on vs null) or drop, since it has only one observed value. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 1
top_value: on
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_ar

categorical free_text null_rate imbalance

This appears to be Arabic-language packaging text, but it carries no information in this sample: 80% of the 50 rows are null and the remaining 10 values are all empty strings, giving cardinality 1 and entropy 0. There is nothing to model or join on here. Treatment: Drop; the column is effectively constant-empty with 80% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 40 (80.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

carbon_footprint_percent_of_known_ingredients

numeric feature null_rate

Numeric coverage metric indicating the share of an item's known ingredients that have a carbon footprint estimate, ranging from 8.0 to 105.0 with a median of 70.0. The 62% null rate is the dominant signal — only 19 distinct values populate this column across 50 rows, so most records lack any coverage figure at all. The max of 105.0 is mildly surprising for what reads like a percentage, and the distribution is slightly left-skewed (skew -0.45) with no flagged outliers. Treatment: Impute or add a missingness indicator before modelling, and verify whether values above 100 are valid. medium · anthropic:claude-opus-4-7

n: 50
nulls: 31 (62.0%)
unique: 19
min: 8
max: 105
mean: 61.79
median: 70
std: 28.98
q1: 45.5
q3: 78.3
iqr: 32.8
skew: -0.4493
kurtosis: -0.8083
n_outliers: 0
outlier_rate: 0
zero_rate: 0

last_checker

categorical metadata null_rate

This looks like the username of the last reviewer/checker on a record, with only 4 distinct values across 50 rows. The column is 86% null, so just 7 rows carry a value, and 'aleene' accounts for 3 of those (top_rate 0.43). Entropy ratio of 0.92 indicates the few present values are spread fairly evenly across the small handful of checkers. Treatment: Treat missingness as a 'never checked' category; too sparse to use as a model feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 4
top_value: aleene
top_rate: 0.4286
cardinality: 4
entropy: 1.842
entropy_ratio: 0.9212

product_name_uk

categorical metadata long_tail null_rate imbalance

This appears to be a Ukrainian-language product name field that is effectively empty: 98% of the 50 rows are null and the single non-null value is itself an empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here whatsoever. Treatment: Drop the column; it carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_uk

categorical metadata long_tail null_rate imbalance

This appears to be a UK-localized generic product name field, but it is effectively empty in this sample: 98% of the 50 rows are null and the only non-null value is an empty string. Cardinality is 1 with zero entropy, so the column carries no information here. Treatment: Drop; no usable signal at this null rate and cardinality. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_ar

categorical metadata long_tail null_rate

Arabic-language product name field that is mostly absent: 78% null and only 6 distinct values across 50 rows. The non-null entries are a language mix — one Arabic string (برنس) alongside Spanish and English names like 'Leche Y Almendras' and 'Chocolate Negro 92% Cacao' — suggesting the column is not consistently populated with Arabic translations. The most frequent observed value is an empty string (6 occurrences, 54.5% of non-nulls), indicating empties coexist with true nulls. Treatment: Drop or defer until translation coverage improves; normalise empty strings to null and validate language before use. high · anthropic:claude-opus-4-7

n: 50
nulls: 39 (78.0%)
unique: 6
top_value
top_rate: 0.5455
cardinality: 6
entropy: 2.049
entropy_ratio: 0.7928

carbon_footprint_from_known_ingredients_debug

categorical metadata long_tail null_rate

Debug trace string showing the per-ingredient carbon footprint computation (percentage × emission factor = grams) for each product. Every one of the 14 non-null values is unique (entropy_ratio ≈ 1.0, top_rate 0.07), and 72% of rows are null, so it functions as a verbose audit log rather than a feature. Treatment: Drop from modelling; retain only for auditing the carbon calculation. high · anthropic:claude-opus-4-7

n: 50
nulls: 36 (72.0%)
unique: 14
top_value: en:cereal 50% x 0.3 = 15 g -
top_rate: 0.07143
cardinality: 14
entropy: 3.807
entropy_ratio: 1

last_checked_t

numeric timestamp null_rate

Values are Unix epoch seconds ranging from 1540933974 to 1730226344, consistent with 'last checked' timestamps spanning roughly late 2018 to late 2024. Severe sparsity dominates: null_rate is 0.86 and only 7 unique values populate the 50 rows, so this column is barely usable as-is. Distribution is mildly right-skewed (skew 0.81) with no outliers flagged. Treatment: Convert from epoch seconds to datetime and treat as mostly-missing; impute or drop before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
min: 1.541e+09
max: 1.73e+09
mean: 1.607e+09
median: 1.565e+09
std: 7.772e+07
q1: 1.556e+09
q3: 1.652e+09
iqr: 9.601e+07
skew: 0.8106
kurtosis: -1.103
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ingredients_text_with_allergens_uk

categorical free_text long_tail null_rate imbalance

This appears to be a UK-localized variant of an ingredients-with-allergens text field, but it is effectively empty in this sample. 98% of the 50 rows are null, and the single non-null value is itself an empty string, giving cardinality 1 and zero entropy. There is no usable signal here. Treatment: Drop; no observed values in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_ar

categorical free_text null_rate

Arabic-language ingredients text (with allergen markup) for food products. The column is 82% null across just 50 rows, and of the 9 non-null entries 8 are empty strings — only 1 row carries an actual ingredient list. Effectively no usable signal at this sample size. Treatment: Drop for now; revisit only if a larger Arabic-localized sample becomes available. high · anthropic:claude-opus-4-7

n: 50
nulls: 41 (82.0%)
unique: 2
top_value
top_rate: 0.8889
cardinality: 2
entropy: 0.5033
entropy_ratio: 0.5033

origin_ar

categorical metadata null_rate imbalance

Categorical field 'origin_ar' carries a single observed value (an empty string) across the 10 non-null rows, while 80% of records are null. With cardinality 1 and entropy 0, the column conveys no information in this sample. Treatment: Drop; the column is 80% null and constant on the remainder. high · anthropic:claude-opus-4-7

n: 50
nulls: 40 (80.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

nutriments_estimated

unknown other skipped

The column `nutriments_estimated` was skipped by the profiler, so no type, uniqueness, or distribution stats are available. The only facts on record are that all 50 sampled rows are non-null and the kind is unknown. Without further evidence, the content and structure of this field cannot be characterised. Treatment: Re-profile with an appropriate parser before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

nutrition_score_warning_no_fiber

numeric feature null_rate constant

This appears to be a binary warning flag indicating a missing-fiber condition in a nutrition score, encoded as 1 when triggered. Every one of the 15 non-null rows holds the value 1.0, and 70% of rows are null — consistent with a sparse flag that is only populated when the warning fires. With zero variance, it carries no discriminative signal as-is. Treatment: Recode nulls to 0 to convert into a usable binary indicator, or drop if still constant after recoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 35 (70.0%)
unique: 1
min: 1
max: 1
mean: 1
median: 1
std: 0
q1: 1
q3: 1
iqr: 0
skew: 0
kurtosis: 0
n_outliers: 0
outlier_rate: 0
zero_rate: 0

ingredients_text_debug_tags

unknown metadata skipped

This column, named `ingredients_text_debug_tags`, was skipped by the profiler so no distributional statistics are available. The name suggests it holds debugging tags emitted by an ingredients-text parser, likely a list or sparse string field. With 50 rows observed and a 0.0 null rate reported but no unique count or value stats, nothing further can be inferred from the evidence. Treatment: Inspect raw values manually; likely drop unless debugging the ingredients parser. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

taxonomies_enhancer_tags

unknown other skipped

The column 'taxonomies_enhancer_tags' was skipped by the profiler, so no type, uniqueness, or distributional statistics are available beyond a row count of 50 and a null rate of 0.0. Without kind detection or value samples, its content (likely some form of taxonomy/tag payload based on the name) cannot be verified from the evidence. Treatment: Re-profile with parsing enabled (likely a nested/list field) before deciding on use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

completed_t

numeric timestamp null_rate

Values are 10-digit integers ranging from 1628199203 to 1763195431, consistent with Unix epoch seconds spanning roughly 2021 through 2025 — almost certainly a 'completed at' timestamp. The 68% null rate is the dominant signal, suggesting most records were never completed. Distribution across the non-null 16 unique values is near-symmetric (skew ~0.001) with no outliers. Treatment: Convert from epoch seconds to datetime and treat nulls as 'not yet completed' rather than imputing. high · anthropic:claude-opus-4-7

n: 50
nulls: 34 (68.0%)
unique: 16
min: 1.628e+09
max: 1.763e+09
mean: 1.7e+09
median: 1.703e+09
std: 4.07e+07
q1: 1.663e+09
q3: 1.74e+09
iqr: 7.618e+07
skew: 0.001247
kurtosis: -1.155
n_outliers: 0
outlier_rate: 0
zero_rate: 0

product_name_bg

categorical metadata long_tail null_rate

This is a Bulgarian-language product name field, with values like 'Шоколад 85% какаова маса' indicating localized chocolate/cocoa product labels. It is 94% null across 50 rows, leaving only 3 non-null entries that are all unique. With so little populated data, this column carries almost no analytical signal in its current state. Treatment: Drop or defer until Bulgarian localization coverage improves; too sparse to use. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: Шоколад 85% какаова маса
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

ingredients_text_et

categorical free_text long_tail null_rate

Free-text ingredient lists ostensibly tagged as Estonian (et), but the three observed values mix Slovenian, German, and Estonian, suggesting mislabeled locale tagging. The field is 94% null with only 3 non-null entries out of 50, so any signal here is anecdotal at best. Treatment: Drop or defer; too sparse and language-inconsistent to model without a language-detection cleanup pass. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: kakavova masa, manjmasten kakavov prah, kakavovo maslo, sladkor, emulgator: lecitini (_sojin_ lecitin); ekstrakt vanilije.
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

origin_sl

categorical metadata long_tail null_rate imbalance

The column appears to be an origin identifier or location code, but it is effectively empty in this sample. 98% of the 50 rows are null, and the single non-null value is itself a blank string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; the column carries no usable signal at this null rate. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_dz

categorical metadata long_tail null_rate imbalance

This appears to be a localized (Algerian/Dzongkha?) generic-name field, but it is effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no information. Treatment: Drop; no signal (98% null, single empty-string value). high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_sl

categorical free_text long_tail null_rate imbalance

Slovenian-language ingredients text, almost entirely empty: 98% null with only 1 non-null value across 50 rows. The single observed entry is a free-form product label (cocoa-based confection) rather than a controlled vocabulary, so the categorical framing is misleading. Treatment: Drop for modelling; if needed, treat as free text and merge with other-language ingredient fields. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Kakavova masa, manjmasten kakavov prah, kakavovo maslo, sladkor, emulgator: lecitini (sojin lecitin); ekstrakt vanilije. Lahko vsebuje sledi oreškov (lešniki, mandlji, pistacija) in mleka. Uporabno najmanj do: glej odtis na zadnji strani embalaže.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_ca

categorical metadata null_rate imbalance

This appears to be a Catalan-language generic product name field, but it is effectively empty: 96% of rows are null and the only non-null value observed is the empty string (2 occurrences). Cardinality is 1 with zero entropy, so the column carries no information in this sample. Treatment: Drop; the column is 96% null with a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_dz

categorical free_text long_tail null_rate imbalance

This appears to be a Dzongkha-language ingredients text field, likely a localized variant of a multilingual product description column. Out of 50 rows, 98% are null and the single non-null value is an empty string, leaving zero usable content. Treatment: Drop; effectively empty in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_ca

categorical metadata null_rate imbalance

This appears to be a Catalan-language product name field, but it is effectively empty: 96% of the 50 rows are null and the only 2 non-null values are blank strings, giving a single observed category with entropy 0. There is no usable signal here. Treatment: Drop the column; it is 96% null with no distinct values. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_ca

categorical feature null_rate imbalance

This appears to be a Canadian-origin flag or code field, but it's effectively empty: 96% of the 50 rows are null, and the only 2 non-null values are both blank strings. With cardinality of 1 and entropy of 0, the column carries no information. Treatment: Drop; the column is 96% null and the remaining values are empty strings. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_et

categorical metadata long_tail null_rate

Estonian-localized product name field, but 94% of the 50 rows are null and only 3 distinct values appear among the remainder — including one empty string and two names that are actually French and English, not Estonian. With each surviving value occurring exactly once (entropy_ratio 1.0), this column carries almost no usable signal and shows a language-tagging mismatch. Treatment: Drop from modelling; revisit upstream localization pipeline since values aren't in Estonian. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: Chocolat noir - 85% cacao
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

ingredients_text_with_allergens_bg

categorical free_text long_tail null_rate

Bulgarian-language ingredient lists with inline HTML allergen markup (), localised for the bg market. Coverage is extremely thin: 94% null and only 3 distinct values across 50 rows, one of which is an empty string. The two real entries are confectionery ingredient declarations mentioning soy, hazelnuts and milk allergens. Treatment: Strip the HTML tags and treat as free text; too sparse (94% null) to use as a feature without aggregation across locales. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: Какаова маса, нискомаслено какао на прах, какаово масло, захар, емулгатор: лецитин (соеви), екстракт от ванилия, Може да съдържа следи от ядки и мляко,
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

ingredients_text_with_allergens_et

categorical free_text long_tail null_rate

Estonian-localised ingredient text with allergen markup, but only 3 of 50 rows carry a value (null_rate 0.94) and all three are unique — and not even in Estonian (one Slovenian, one German, plus one short Estonian entry). The field is essentially empty and the few populated rows show a language mix rather than the expected `et` locale, suggesting upstream localisation fallback or mislabelling. Treatment: Drop unless you specifically need allergen extraction; 94% nulls and inconsistent language make it unusable as-is. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: kakavova masa, manjmasten kakavov prah, kakavovo maslo, sladkor, emulgator: lecitini (sojin lecitin); ekstrakt vanilije.
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

origin_sk

categorical foreign_key long_tail null_rate imbalance

`origin_sk` appears to be a surrogate key for an origin entity, but it carries almost no information in this slice: 98% of the 50 rows are null and the single non-null value is an empty string. Cardinality is 1 and entropy is 0, so the column is effectively constant where populated. Treatment: Drop from modelling; investigate upstream join before relying on it. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_bg

categorical foreign_key null_rate imbalance

This appears to be an origin block-group identifier, likely a geographic foreign key. It is effectively unusable here: 94% of rows are null, and the only 1 distinct value observed across the 50 rows is the empty string (3 occurrences), giving entropy 0.0. Treatment: Drop; no usable signal at this null rate and cardinality. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_sl

categorical free_text long_tail null_rate imbalance

This appears to be a Slovenian-language packaging text field, but it is effectively empty: 98% of the 50 rows are null and the single non-null value is an empty string, giving cardinality 1 and entropy 0. There is no usable signal here. Treatment: Drop; the column is 98% null with only an empty-string value otherwise. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_sk

categorical foreign_key long_tail null_rate imbalance

Likely a surrogate key linking to a generic drug name dimension, but it is effectively empty in this sample. 98% of rows are null and the single non-null value is the empty string, giving cardinality 1 and zero entropy. Treatment: Drop or defer until a non-empty sample is available; carries no signal here. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_sl

categorical free_text long_tail null_rate imbalance

This appears to be a Slovenian-language ingredients list with embedded allergen HTML markup (``), likely a localized product label field. The column is almost entirely empty with a null_rate of 0.98, leaving only 1 non-null row out of 50, and that single value is the only unique entry (cardinality 1, entropy 0.0). With essentially no signal and HTML mixed into the text, it carries no analytical value as-is. Treatment: Drop; 98% null and only one observed value make it unusable, or strip HTML and reserve for text extraction if more rows arrive. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Kakavova masa, manjmasten kakavov prah, kakavovo maslo, sladkor, emulgator: lecitini (sojin lecitin); ekstrakt vanilije. Lahko vsebuje sledi oreškov (lešniki, mandlji, pistacija) in mleka. Uporabno najmanj do: glej odtis na zadnji strani embalaže.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_ca

categorical free_text null_rate imbalance

Catalan-language ingredients text field, almost entirely absent from this sample. 96% of the 50 rows are null and the only 2 non-null values are empty strings, giving a single distinct value and zero entropy. Treatment: Drop; the column carries no usable signal in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_sl

categorical metadata long_tail null_rate imbalance

This column appears to be a Slovenian-language generic name field that is effectively empty in this sample. With a 98% null rate and the only non-null value being an empty string, there is zero usable signal (entropy 0.0, cardinality 1). Treatment: Drop; no usable values present. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_dz

categorical metadata long_tail null_rate imbalance

This appears to be a localized product name field (Dzongkha or similar locale suffix), but it is effectively empty: 98% of the 50 rows are null and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no usable signal in this sample. Treatment: Drop; the column is 98% null with a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_et

categorical metadata null_rate imbalance

`origin_et` appears to be a categorical metadata field, but it carries almost no information here: 94% of the 50 rows are null and the only non-null value observed is the empty string, which accounts for all 3 populated rows. Cardinality is 1 and entropy is 0, so the column is effectively constant where present. Treatment: Drop; constant empty value with 94% nulls offers no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_sk

categorical identifier long_tail null_rate imbalance

Column appears to be a surrogate key for ingredient text with allergen markup, but it is effectively empty: 98% of 50 rows are null and the only observed value is the empty string. Cardinality is 1 with zero entropy, so there is no usable signal here. Treatment: Drop; the column is 98% null with a single empty value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_sk

categorical free_text long_tail null_rate imbalance

Almost certainly a Slovak product name field that is effectively empty in this slice — 98% of the 50 rows are null and the single non-null value is itself an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here whatsoever. Treatment: Drop; the column is 98% null with only an empty string observed. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_pt

categorical free_text long_tail null_rate

Portuguese-language ingredient lists with embedded HTML allergen tags (…), likely scraped from a food product database. The column is sparsely populated with an 0.84 null rate, and among the 8 non-null rows 5 are empty strings, leaving only 3 genuine ingredient declarations. Each non-empty value is unique and contains raw HTML markup rather than cleaned text. Treatment: Strip HTML tags to extract allergen tokens, then treat as sparse free text; too null-heavy for direct modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 42 (84.0%)
unique: 4
top_value
top_rate: 0.625
cardinality: 4
entropy: 1.549
entropy_ratio: 0.7744

ingredients_text_with_allergens_ca

categorical free_text long_tail null_rate imbalance

Localized ingredients text with allergens for Catalan, but it's effectively empty in this sample: 98% null and the only non-null value observed is itself an empty string. With cardinality of 1 and entropy 0, this column carries no information here. Treatment: Drop; no usable signal in this slice. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_pt

categorical free_text long_tail null_rate

Portuguese-language generic product name, present for only 20% of the 50 rows (null_rate 0.8) and otherwise dominated by an empty string (top_rate 0.8 on value ''). Among the 10 non-null entries only 2 distinct strings appear in the top values, both descriptive food labels like 'Chocolate extrafino com 70% de cacau'. Coverage is too thin and cardinality too low (n_unique 3 including the blank) to support modelling on its own. Treatment: Drop or retain only as a fallback display label; coverage is too sparse to feature-engineer. high · anthropic:claude-opus-4-7

n: 50
nulls: 40 (80.0%)
unique: 3
top_value
top_rate: 0.8
cardinality: 3
entropy: 0.9219
entropy_ratio: 0.5817

packaging_text_pt

categorical free_text null_rate imbalance

This appears to be a Portuguese packaging-text field, likely free-form descriptions of product packaging. It is effectively empty: 80% of the 50 rows are null, and the remaining 10 rows all hold the empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here. Treatment: Drop the column; it carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 40 (80.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_pt

categorical free_text long_tail null_rate

Portuguese-language ingredient lists for food products, stored as free text. The column is mostly empty: 80% null and the most common value (7 of 50 rows) is an empty string, leaving only 4 distinct values across 50 rows. The few populated entries are long, comma-separated ingredient declarations with allergen tokens in caps or underscores. Treatment: Treat as free text: drop or impute the empty majority, then tokenize and parse ingredients before modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 40 (80.0%)
unique: 4
top_value
top_rate: 0.7
cardinality: 4
entropy: 1.357
entropy_ratio: 0.6784

origin_pt

categorical metadata null_rate imbalance

This appears to be an origin point identifier, but it carries no usable signal in this sample. 80% of rows are null and the remaining 10 rows all hold the same empty-string value, giving a single unique category and entropy of 0. Treatment: Drop; the column is 80% null and the rest is a constant empty string. high · anthropic:claude-opus-4-7

n: 50
nulls: 40 (80.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

nutrition_score_warning_nutriments_estimated

numeric feature null_rate constant

This appears to be a flag indicating that the nutrition score warning was estimated from nutriment data, likely a 0/1 boolean. Of 50 rows, 96% are null and the remaining 4% all carry the value 1.0, making it effectively constant where present. With no variation and almost no coverage, it carries no usable signal. Treatment: Drop; constant-when-present and 96% null. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
min: 1
max: 1
mean: 1
median: 1
std: 0
q1: 1
q3: 1
iqr: 0
skew: 0
kurtosis: 0
n_outliers: 0
outlier_rate: 0
zero_rate: 0

packaging_text_bg

categorical free_text null_rate imbalance

Likely Bulgarian-language packaging text from a product database. The column is effectively empty: 94% null and the only non-null value across 50 rows is the empty string itself (3 occurrences), giving cardinality 1 and zero entropy. Treatment: Drop; no usable signal at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_et

categorical metadata null_rate imbalance

This appears to be an Estonian-language generic name field, but it carries no usable signal in this sample: 94% of rows are null and the only non-null value observed is the empty string (3 occurrences), giving cardinality 1 and entropy 0. Effectively every record is missing or blank. Treatment: Drop; column is 94% null with a single empty-string value otherwise. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_ca

categorical metadata null_rate imbalance

Likely a Canadian-locale packaging text field from a product catalog. It is effectively empty: 96% null and the only 2 non-null values are both blank strings, giving a single distinct value and zero entropy. Treatment: Drop; the column carries no signal at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_sl

categorical metadata long_tail null_rate imbalance

Localized Slovenian product name field that is effectively empty: 98% of 50 rows are null and the single populated row reads "ARRIBA 85% cacao". With cardinality 1 and entropy 0, this column carries no usable signal in the current sample. Treatment: Drop from modelling; revisit only if a fuller localized catalogue becomes available. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: ARRIBA 85% cacao
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_bg

categorical metadata null_rate imbalance

This appears to be a Bulgarian-language generic drug name field, but it is effectively empty: 94% of the 50 rows are null and the only non-null value observed is the empty string itself, repeated 3 times. Cardinality is 1 with zero entropy, so the column carries no information. Treatment: Drop; the column is 94% null with a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_sk

categorical free_text long_tail null_rate imbalance

This appears to be a Slovak-language ingredients text field (suffix _sk), but it is effectively empty in this sample: 98% of 50 rows are null and the single non-null value is an empty string, yielding cardinality 1 and entropy 0. Treatment: Drop; no usable signal at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_bg

categorical free_text long_tail null_rate

Bulgarian-language ingredient lists for food products, stored as free text in Cyrillic. The column is almost entirely empty (null_rate 0.94) with only 3 non-null values across 50 rows, each unique — effectively unusable as a categorical feature. Despite the categorical kind label, the content is long-form ingredient prose, not a discrete category. Treatment: Drop unless doing multilingual text analysis; 94% null leaves too little signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: Какаова маса, нискомаслено какао на прах, какаово масло, захар, емулгатор: лецитин (соеви), екстракт от ванилия, Може да съдържа следи от ядки и мляко,
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

packaging_text_et

categorical free_text null_rate imbalance

Estonian packaging text field that is effectively empty: 94% of 50 rows are null, and the only non-null value observed is the empty string itself (3 occurrences). With cardinality of 1 and entropy of 0, this column carries no information. Treatment: Drop; no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_sk

categorical foreign_key long_tail null_rate imbalance

This appears to be a surrogate key for packaging text, but it is essentially empty: 98% of the 50 rows are null and the single non-null value is an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here. Treatment: Drop; 98% null and only one observed value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_pt

categorical free_text long_tail null_rate

This is a Portuguese-localized product name field, but it is mostly empty: 80% null and only 7 distinct values across 50 rows, with the top value being the empty string at 40%. The non-null entries are a language mix (Portuguese, Italian, French, English) rather than purely Portuguese, suggesting fallback to original-language labels when no translation exists. Entropy ratio of 0.90 reflects that the few present values are nearly all unique. Treatment: Drop or treat as optional metadata; too sparse and language-inconsistent for direct modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 40 (80.0%)
unique: 7
top_value
top_rate: 0.4
cardinality: 7
entropy: 2.522
entropy_ratio: 0.8983

abbreviated_product_name_fr

categorical label long_tail null_rate

Likely a French abbreviated product name field (brand + descriptor + size), used as a display label for items. The column is mostly empty with a null_rate of 0.86, leaving only 7 unique values across 50 rows, each appearing once — entropy_ratio is 1.0, so among the populated rows every value is distinct. Sparsity makes it unusable as a categorical feature in its current state. Treatment: Drop or treat as free text; too sparse and unique to encode as a category. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: CRISTALINE Eau De Source 0.5L
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

obsolete_imported

categorical feature null_rate imbalance

This appears to be a boolean-style flag indicating whether a record was imported as obsolete, but the signal is effectively absent: 86% of rows are null and the only observed value is "0" across all 7 non-null entries. Cardinality is 1 with zero entropy, so the column carries no discriminative information in this sample. Treatment: Drop; constant value with 86% nulls offers no modelling signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 1
top_value: 0
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

sources_fields

unknown other skipped

The column `sources_fields` was skipped by the profiler, so its kind, cardinality, and value statistics are all unavailable. The only confirmed signals are 50 rows present with a null rate of 0.0, meaning every row has some value, but nothing is known about what those values look like. Without further inspection this column cannot be characterised. Treatment: Re-profile or manually inspect a sample before deciding on downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

emb_code

categorical metadata long_tail null_rate imbalance

This appears to be an embargo or embarkation code, with values like "EMB 44068 A" suggesting an alphanumeric reference identifier. The column is almost entirely empty: 98% null across 50 rows, leaving a single non-null observation. With only one value present, entropy is 0 and no distributional inference is possible. Treatment: Drop; 98% null with only one observed value provides no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: EMB 44068 A
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

lang_imported

categorical metadata null_rate imbalance

Likely a language tag for imported records, but 86% of the 50 rows are null and the remaining 7 entries are all 'fr'. With only one observed value, entropy is 0 and the column carries no discriminative signal as captured. Treatment: Drop or hold aside until more non-null values arrive; constant 'fr' with 86% nulls is unusable as a feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 1
top_value: fr
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_zh

categorical metadata long_tail null_rate imbalance

This appears to be a Chinese generic-name field, but it is effectively empty: 98% of the 50 rows are null and the single non-null observation is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no usable signal in this sample. Treatment: Drop; the column is 98% null with a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

conservation_conditions_fr_imported

categorical free_text long_tail null_rate

This column holds French-language storage instructions imported from an external source (e.g., 'A conserver de préférence à l'abri du soleil...'). Coverage is extremely sparse: 86% null and only 7 distinct phrasings across 7 non-null rows, each appearing exactly once. The values are free-text variants of the same advice rather than a controlled vocabulary, so entropy_ratio sits at 1.0. Treatment: Drop or normalise via keyword extraction; too sparse and too variable to use as a categorical feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: A conserver de préférence à l'abri du soleil, dans un endroit propre, frais et sans odeur.
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

origin_fr_imported

categorical free_text long_tail null_rate

This appears to be a French-language origin/import provenance field, with values ranging from a single country tag ("France") to a multi-line description of cocoa paste sourcing across continents. Only 2 of 50 rows are populated (null_rate 0.96), and both populated values are unique, giving entropy_ratio 1.0 over a cardinality of 2. The mix of a clean country label with a long descriptive string suggests inconsistent data entry rather than a true categorical. Treatment: Drop or defer; 96% null and entries mix country codes with prose, so not usable as a category without manual normalisation. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: France
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

owner

categorical metadata long_tail null_rate

Categorical column listing the owning organization (food/beverage manufacturers like Barilla, Ferrero, Nestlé) for each record. The column is overwhelmingly empty: 86% null, leaving only 7 populated rows spread across 6 distinct owners, with Barilla appearing twice and the rest singletons. Entropy ratio of 0.98 confirms the non-null values are nearly uniform, so there is little signal beyond identifying who submitted the entry. Treatment: Drop or retain as provenance metadata only; too sparse for modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 6
top_value: org-barilla-france-sa
top_rate: 0.2857
cardinality: 6
entropy: 2.522
entropy_ratio: 0.9755

ingredients_text_fr_imported

categorical free_text long_tail null_rate

French-language ingredient declarations imported from an external source, with each non-null value being a long free-text recipe listing (allergens capitalised, percentages, additive codes). The column is 86% null and the 7 present values are all unique, yielding maximum entropy (entropy_ratio 1.0) and a top_rate of just 0.14. This is unstructured product copy, not a category, despite being typed as categorical. Treatment: Treat as free text: parse ingredient lists or tokenize/embed for NLP rather than one-hot encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: Eau de Source
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

owners_tags

categorical metadata long_tail null_rate

Categorical tag identifying the owning organization for each record, with values like 'org-barilla-france-sa' and 'org-nestle-france' suggesting Open Food Facts-style contributor org slugs. The column is 86% null, leaving only 7 populated rows spread across 6 distinct owners, so entropy ratio is near-saturated (0.976) and the top value covers just 2 records. With this much sparsity it carries almost no signal at n=50. Treatment: Drop or retain only as a provenance tag; too sparse to use as a feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 6
top_value: org-barilla-france-sa
top_rate: 0.2857
cardinality: 6
entropy: 2.522
entropy_ratio: 0.9755

product_name_zh

categorical metadata long_tail null_rate imbalance

This appears to be a Chinese product name field that is effectively empty in this sample: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no usable signal here. Treatment: Drop from modelling; revisit only if a larger sample shows actual Chinese strings populated. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

nutrition_data_prepared_per_imported

categorical metadata null_rate imbalance

This column appears to be metadata indicating the basis on which nutrition data was prepared, with the only observed value being '100g'. It is essentially a constant: 86% of rows are null and the remaining 7 entries all share the single value '100g', giving zero entropy. Treatment: Drop; constant column with no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 1
top_value: 100g
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

abbreviated_product_name_fr_imported

categorical metadata long_tail null_rate

This appears to be a French-language abbreviated product name field, likely imported from an external catalog. It is overwhelmingly empty with a null_rate of 0.86, leaving only 7 distinct values across 50 rows, each appearing once (top_rate 0.143, entropy_ratio 1.0). The few populated entries mix brand-led formats like "CRISTALINE Eau De Source 0.5L" and "NESTLE DESSERT Noir 205g" with locale tags such as "Authentique 275g, fr". Treatment: Drop or defer; too sparse (86% null) and unique to model directly. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: CRISTALINE Eau De Source 0.5L
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

generic_name_zh_debug_tags

unknown metadata skipped

This column appears to be a debug-tag field associated with Chinese generic names, but saturn skipped profiling so no value-level statistics are available. The only confirmed signals are 50 rows with a 0.0 null rate; uniqueness, distribution, and content are unknown. Treatment: Re-profile or inspect manually before use; likely drop as debug instrumentation. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

customer_service_fr

categorical free_text long_tail null_rate

This column holds French-language customer service contact details (postal addresses or web contact URLs) for product manufacturers. It is overwhelmingly empty with an 86% null rate, leaving only 7 non-null values across 6 nearly-unique strings (entropy ratio 0.976), with the top entry — a Wasa contact URL — appearing just twice. The values are unstructured free text mixing URLs, company names, and postal addresses. Treatment: Drop or treat as sparse metadata; not usable as a categorical feature given 86% nulls and near-unique values. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 6
top_value: Service Consommateurs, : www.wasa.com/fr-fr/contact (depuis la France), www.wasa.com/fr-be/contact (depuis la Belgique)
top_rate: 0.2857
cardinality: 6
entropy: 2.522
entropy_ratio: 0.9755

customer_service_fr_imported

categorical metadata long_tail null_rate

This column holds French-language customer service contact details (postal addresses or web contact URLs) for product manufacturers, imported as free-form strings. It is 86% null with only 7 populated rows yielding 6 distinct values, so it functions more as sparse metadata than an analytical feature. Entries vary in format from full postal addresses (Nestlé, Ferrero, Cristaline) to URLs, indicating no normalization upstream. Treatment: Drop for modelling; retain only if needed as a manufacturer contact lookup. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 6
top_value: Service Consommateurs, : www.wasa.com/fr-fr/contact (depuis la France), www.wasa.com/fr-be/contact (depuis la Belgique)
top_rate: 0.2857
cardinality: 6
entropy: 2.522
entropy_ratio: 0.9755

ingredients_text_zh_debug_tags

unknown metadata skipped

This column is flagged as kind "unknown" and was skipped by the profiler, so no statistics, uniqueness, or value samples are available. The only confirmed signals are 50 rows present and a 0.0 null rate. The name suggests it holds debug tags from Chinese-language ingredient text parsing, but that is inferred from the column name, not the evidence. Treatment: Drop unless debug tags are explicitly needed; re-profile with a parser that handles this type before use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

product_name_fr_imported

categorical free_text long_tail null_rate

French-language product names imported from an external source, judging by the suffix and the values like 'CRISTALINE Eau De Source 0.5L' and 'Biscuits Nutella x22 biscuits fourrés - 304g'. Only 7 of 50 rows carry a value (null_rate 0.86), and every populated value is unique (entropy_ratio 1.0, top_rate 0.143), so this behaves as free-text rather than a category. The extreme nullity combined with full uniqueness makes it unusable as a grouping key. Treatment: Treat as sparse free text—drop for modelling or tokenize/embed if product identification is needed. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: CRISTALINE Eau De Source 0.5L
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

brands_imported

categorical feature long_tail null_rate

This appears to be a free-text brand field listing imported product brands, with 6 distinct values across only 7 non-null rows out of 50 (null_rate 0.86). The top value 'Wasa' appears just twice (top_rate 0.286), and entropy_ratio 0.976 indicates the few present values are nearly uniformly distributed. One entry 'NESTLE DESSERT,Tablettes' looks like a comma-joined multi-value string, suggesting inconsistent encoding. Treatment: Split multi-value strings on comma and treat as low-coverage categorical; consider dropping given 86% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 6
top_value: Wasa
top_rate: 0.2857
cardinality: 6
entropy: 2.522
entropy_ratio: 0.9755

owner_imported

categorical foreign_key long_tail null_rate

Categorical column holding organisation slugs (e.g. 'org-barilla-france-sa', 'org-nestle-france'), almost certainly a foreign key to an owning company. It is 88% null with only 6 non-null rows spread across 5 distinct owners, so entropy_ratio is 0.97 simply because nearly every present value is unique. The column is too sparse to support any aggregation or join in its current state. Treatment: Drop or defer: 88% null leaves too few rows to join or model on. high · anthropic:claude-opus-4-7

n: 50
nulls: 44 (88.0%)
unique: 5
top_value: org-barilla-france-sa
top_rate: 0.3333
cardinality: 5
entropy: 2.252
entropy_ratio: 0.9697

product_name_zh_debug_tags

unknown metadata skipped

This column appears to be an internal debug-tagging field attached to Chinese product names, but saturn skipped profiling so its contents are opaque. The only confirmed facts are 50 rows with no nulls; uniqueness, value distribution, and type are all unreported. Without a profile pass it is impossible to tell whether it carries useful signal or just developer annotations. Treatment: Re-run profiling with this kind enabled before deciding; provisionally drop as unparsed debug metadata. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

lc_imported

categorical metadata null_rate

A categorical flag indicating the source language of imported records, with values 'fr' and 'es'. The column is dominated by missingness — 84% null across 50 rows — and among the 8 populated rows, 'fr' accounts for 7 (87.5%), leaving 'es' as a single observation. Cardinality is just 2, so this carries little signal in its current state. Treatment: Treat nulls as a category or drop; near-constant with severe missingness limits modelling value. high · anthropic:claude-opus-4-7

n: 50
nulls: 42 (84.0%)
unique: 2
top_value: fr
top_rate: 0.875
cardinality: 2
entropy: 0.5436
entropy_ratio: 0.5436

ingredients_text_zh

categorical free_text long_tail null_rate imbalance

This appears to be a Chinese-language ingredients text field, likely from a localized product/food dataset. It is effectively empty: 98% of the 50 rows are null and the only non-null value observed is itself an empty string, giving a cardinality of 1 and entropy of 0. Treatment: Drop; no usable signal at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

quantity_imported

categorical feature long_tail null_rate

This appears to be a free-form quantity/packaging size field mixing volume ('500 ml') and mass units ('304 g', '275 g'), stored as strings rather than parsed numerics. Coverage is extremely poor: 86% of the 50 rows are null, and among the 7 non-null values every one is unique (entropy_ratio 1.0, top_rate 0.14). With no repeated values and mixed units, it offers little categorical signal as-is. Treatment: Parse into a numeric magnitude plus a unit column before use; given 86% nulls, consider dropping or imputing. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: 500 ml
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

nutrition_data_per_imported

categorical metadata null_rate imbalance

Likely a metadata flag indicating the basis on which nutrition values were imported, with '100g' as the sole observed value across all 8 non-null rows. The column is 84% null and has only one unique value, giving zero entropy and no discriminative power. Treatment: Drop; constant value with 84% nulls carries no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 42 (84.0%)
unique: 1
top_value: 100g
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_fr_imported

categorical free_text long_tail null_rate

French generic product names imported from an upstream source (e.g. Open Food Facts), holding descriptors like "Eau De Source" and "Biscuit fourré à la pâte à tartiner aux noisettes et au cacao Nutella®". The column is 86% null and every one of the 7 observed values is unique (entropy_ratio 1.0), so it behaves as free-text rather than a categorical feature. Values are in French with accented characters and brand marks, which will need normalisation if joined with other locales. Treatment: Treat as multilingual free text: normalise accents and tokenize/embed if used; otherwise drop given 86% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: Eau De Source
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

owner_fields

unknown other skipped

The column `owner_fields` was skipped by the profiler, so its kind is unknown and no descriptive statistics, uniqueness, or value samples are available. The only signals are a row count of 50 and a null rate of 0.0, meaning every row is populated but the contents are opaque from this evidence alone. Without a sample or type inference, nothing can be said about what the field encodes. Treatment: Re-profile with parsing enabled (or inspect raw values) before deciding how to use this column. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

categories_imported

categorical metadata long_tail null_rate

Hierarchical product category paths (comma-separated taxonomy strings, mostly French with some en: prefixes) imported from an external source, likely Open Food Facts. The column is 88% null with only 6 non-null rows across 5 distinct values, so coverage is too sparse to be useful as-is. Entropy ratio of 0.97 confirms the few present values are nearly all distinct, and the top value appears just twice. Treatment: Split on comma into hierarchical levels and use only the top level as a feature, or drop given 88% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 44 (88.0%)
unique: 5
top_value: Snacks, Snacks salés, Amuse-gueules, Chips et frites, Chips
top_rate: 0.3333
cardinality: 5
entropy: 2.252
entropy_ratio: 0.9697

conservation_conditions_fr

categorical free_text long_tail null_rate

French-language storage instructions for products, written as free-form sentences (e.g. "A conserver dans un endroit sec à l'abri de la lumière."). Coverage is very thin: 86% null and only 7 distinct strings across 50 rows, each appearing exactly once, so entropy_ratio is 1.0. Despite semantic overlap (cool, dry, away from light), no two entries are phrased identically, making this unusable as a category without normalisation. Treatment: Treat as free text: normalise/cluster phrases or extract keywords (sec, frais, lumière) rather than one-hot encoding. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: A conserver de préférence à l'abri du soleil, dans un endroit propre, frais et sans odeur.
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

conservation_conditions

categorical free_text long_tail null_rate

Free-text French storage instructions for a product (e.g., "A conserver à l'abri du soleil..."), captured as a categorical field. With 86% nulls and only 7 distinct values across 50 rows — each appearing exactly once — this behaves like sparse free text rather than a controlled vocabulary. Maximum entropy ratio (1.0) confirms every observed value is unique. Treatment: Treat as free text; normalize/keyword-extract (e.g., 'sec', 'frais', 'abri') or drop given 86% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: A conserver de préférence à l'abri du soleil, dans un endroit propre, frais et sans odeur.
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

countries_imported

categorical metadata null_rate

Likely a country-of-origin tag for imported items, but with 84% nulls only 8 of 50 rows actually carry a value. Of those, 7 are 'France' and 1 is 'España', giving a top_rate of 0.875 and just 2 distinct categories. The mixed language ('España' vs the English column name) hints at inconsistent source encoding. Treatment: Impute or flag missingness and normalise country names to a single language before any grouping. medium · anthropic:claude-opus-4-7

n: 50
nulls: 42 (84.0%)
unique: 2
top_value: France
top_rate: 0.875
cardinality: 2
entropy: 0.5436
entropy_ratio: 0.5436

origins_fr

categorical metadata long_tail null_rate

This appears to be a French-language origins field listing geographic provenance and source names (towns, regions, water sources) as a comma-concatenated string. The column is almost entirely empty with a 96% null rate, leaving only 2 distinct values across 50 rows—one populated entry bundling 11 locations together and one blank string. The packed multi-value format suggests this was flattened from a list field rather than a clean categorical. Treatment: split on commas and explode into a multi-label set before use; coverage too sparse to model directly. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: Chambon-la-Forêt,France,Cairanne,Provence-Alpes-Côte d'Azur,Vaucluse,Italie,Source Sainte Cécile,Source Ofélia,Source Éléonore,Source Emma,Source Éléna
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

abbreviated_product_name

categorical free_text long_tail null_rate

Short product label field, likely a shelf-name abbreviation including brand, variant and pack size (e.g. 'CRISTALINE Eau De Source 0.5L'). It is almost entirely empty with a null_rate of 0.86, and among the 7 populated rows every value is unique (entropy_ratio 1.0, top_rate ~0.143), so it carries no repeating categories. Treatment: Drop or treat as free text; too sparse and unique to use as a categorical feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 7
top_value: CRISTALINE Eau De Source 0.5L
top_rate: 0.1429
cardinality: 7
entropy: 2.807
entropy_ratio: 1

customer_service

categorical free_text long_tail null_rate

Free-text customer service contact details (postal addresses or URLs) extracted from product packaging, mostly in French. The column is 86% null with only 7 populated rows across 6 near-unique values, and entries are long unstructured strings mixing brands like Wasa, Cristaline, Ferrero, Kellogg's, Nestlé and La Boulangère. Treatment: Drop or parse out brand/URL/address fields separately; too sparse and unstructured to model as-is. high · anthropic:claude-opus-4-7

n: 50
nulls: 43 (86.0%)
unique: 6
top_value: Service Consommateurs, : www.wasa.com/fr-fr/contact (depuis la France), www.wasa.com/fr-be/contact (depuis la Belgique)
top_rate: 0.2857
cardinality: 6
entropy: 2.522
entropy_ratio: 0.9755

data_sources_imported

categorical metadata long_tail null_rate

Concatenated provenance trail listing the producers, databases, and apps that contributed to each record (e.g., 'Database - Equadis, Database - GDSN, Databases, Producers, Producer - nestle-france'). 84% of rows are null and the 8 non-null values are all unique, giving entropy_ratio 1.0 — every observed string is a bespoke composite rather than a clean category. Repeated tokens within a single value (e.g., 'Producers' appearing twice) suggest the field was assembled by concatenation without deduplication. Treatment: Split on commas and one-hot or multi-hot encode the underlying source tokens rather than using the raw string. high · anthropic:claude-opus-4-7

n: 50
nulls: 42 (84.0%)
unique: 8
top_value: Producers, Producer - gie-sources-alma, Database - Equadis, Database - GDSN, Databases, Producers, Producer - gie-sources-alma
top_rate: 0.125
cardinality: 8
entropy: 3
entropy_ratio: 1

nova_group_error

categorical metadata null_rate imbalance

This appears to be an error/diagnostic flag explaining why a NOVA food classification group could not be assigned. It is null in 96% of 50 rows, and the only observed value across the 2 non-null cases is "too_many_unknown_ingredients" (top_rate 1.0, cardinality 1, entropy 0). With a single category present, the column carries no discriminative signal in this sample. Treatment: Drop or retain only as a boolean error-present flag; near-constant and overwhelmingly null. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value: too_many_unknown_ingredients
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_de_ocr_1648897071_result

categorical free_text long_tail null_rate imbalance

This appears to be the OCR result of a German ingredients list (ingredients_text_de_ocr) tied to a specific timestamped run. Of 50 rows, 98% are null and only a single non-null value exists — a detailed Nuss-Nougat-Creme ingredient declaration — giving cardinality 1 and entropy 0. Treatment: Drop; 98% null with a single observed value provides no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Nuss-Nougat-Creme 40% (Zucker, Palmöl, _Haselnüsse_ 13%, _Magermilchpulver_ 8,7%, fettarmer Kakao 7,4%, Emulgator Lecithine (_Soja_), Vanillin), _Weizenmehl_ 32,5%, pflanzliche Fette (Palm, Palmkern), Rohrzucker 8,5% (enthält _Weizen_), _Milchzucker_, _Weizenkleie_, _Vollmilchpulver_, _Gerstenmalz_ - und Maisextraktpulver, Honig, Backtriebmittel: Dinatriumdiphosphat, Natriumhydrogencarbonat, Ammoniumhydrogencarbonat; fettarmer Kakao, Salz, _Weizenstärke_, _Gerstenmalzmehl_, Emulgator Lecithine (_Soja_), Vanillin
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_ro

categorical free_text null_rate imbalance

Romanian-language packaging text field that is essentially empty: 96% of the 50 rows are null and the remaining 2 non-null values are both blank strings, yielding a single observed category and zero entropy. Treatment: Drop; no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_ro

categorical metadata long_tail null_rate

A Romanian-language product name field that is effectively empty: 96% of the 50 rows are null, leaving only 2 non-null values, one of which is an empty string and the other an English phrase ('Sour Cream & Onion'). With cardinality of 2 and no actual Romanian content observed, this column carries no usable signal in the sample. Treatment: Drop; null_rate 0.96 and no Romanian values present. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

producer_version_id

categorical metadata long_tail null_rate

Identifier-style categorical capturing a producer/version reference, but 92% of the 50 rows are null, leaving only 4 populated values. The non-null entries are inconsistent in shape — a small integer ('1'), an ISO timestamp, and an 8-digit number — suggesting the field is overloaded or improperly typed. With cardinality 3 and top_rate 0.5 over a tiny populated subset, no reliable signal can be drawn. Treatment: Drop or quarantine until the upstream schema is clarified; not usable as a feature at 92% null with mixed value types. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 3
top_value: 1
top_rate: 0.5
cardinality: 3
entropy: 1.5
entropy_ratio: 0.9464

serving_size_imported

categorical free_text long_tail null_rate

Free-text serving size descriptors imported from an upstream source, mixing grams with French unit hints like 'tranche' and 'carrés'. 88% of the 50 rows are null and the 6 non-null values are all unique, so entropy_ratio is 1.0 and there is no modal serving. Format is inconsistent (e.g. '30 g' vs '25.6 g (5 carrés (25,6 g))'), making direct aggregation unsafe. Treatment: Parse the leading numeric grams into a numeric column and discard the free-text remainder. high · anthropic:claude-opus-4-7

n: 50
nulls: 44 (88.0%)
unique: 6
top_value: 13.8 g (1)
top_rate: 0.1667
cardinality: 6
entropy: 2.585
entropy_ratio: 1

no_nutrition_data_imported

categorical metadata null_rate imbalance

A boolean-style flag indicating whether nutrition data was skipped during import. With a 0.92 null rate and only 4 non-null rows all reading "false" (top_rate 1.0, cardinality 1, entropy 0.0), the column carries no information in this sample. Treatment: Drop; constant value with overwhelming nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 1
top_value: false
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_imported

categorical metadata null_rate

Categorical column capturing imported packaging type, almost certainly free-form labels like 'Enveloppe' or 'Boîte, Barquette'. It's effectively unusable as-is: 92% of the 50 rows are null, leaving only 4 observed values across 2 distinct categories, with 'Enveloppe' covering 3 of them. Treatment: Drop or set aside until more coverage is available; 92% nulls leave nothing to model. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 2
top_value: Enveloppe
top_rate: 0.75
cardinality: 2
entropy: 0.8113
entropy_ratio: 0.8113

ingredients_text_ro

categorical free_text null_rate imbalance

Romanian-language ingredients text, almost entirely absent in this sample. 96% of the 50 rows are null, and the only 2 non-null values are empty strings, giving cardinality 1 and entropy 0. There is no usable signal here. Treatment: Drop; effectively empty for this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

producer_version_id_imported

categorical metadata long_tail null_rate

This appears to be a sparsely populated categorical field tracking some imported producer version identifier, with 92% null_rate leaving only 4 non-null values across 50 rows. The 3 distinct values are wildly inconsistent in format — '1', a timestamp '2021-01-25T13:53:49+01:00', and a numeric '44217063' — suggesting the column conflates multiple semantics or has been mis-mapped during import. With only 4 observations, top_rate of 0.5 and entropy_ratio of 0.95 are not meaningful signals. Treatment: Drop unless the import pipeline can be fixed to emit a single consistent value type. low · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 3
top_value: 1
top_rate: 0.5
cardinality: 3
entropy: 1.5
entropy_ratio: 0.9464

labels_imported

categorical metadata long_tail null_rate

Imported product labels (likely certifications or dietary tags) carried over from an external source, with values like 'Végétarien' and comma-separated certification strings ('Point Vert, Rainforest Alliance, Triman'). The column is 90% null, leaving only 5 populated rows across 3 distinct values, and the top value covers 60% of those. With such sparse coverage and multi-label strings packed into single cells, this field is barely usable as-is. Treatment: Split comma-separated tags and one-hot encode, but expect to drop given 90% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 3
top_value: Végétarien
top_rate: 0.6
cardinality: 3
entropy: 1.371
entropy_ratio: 0.865

ingredients_text_de_ocr_1648990410_result

categorical free_text long_tail null_rate imbalance

This appears to be the OCR-extracted German ingredients text from a timestamped scan (1648990410), capturing raw product label text. It is effectively empty: 98% null with only 1 non-null value out of 50 rows, a single German ingredients string for a hazelnut-nougat cookie product. With cardinality 1 and entropy 0, it carries no discriminative signal in this sample. Treatment: Drop; 98% null and only one observed value provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Kekse mit Nuss - Nugat - Creme - Füllung: Nuss-Nugat-Creme 40% (Zucker, Palmöl, HASELNÜSSE Magermilchpulver, fettarmer Kakao, Emulgator Lecithine (S0JA), Vanillin, Weizenmehl, pflanzliche Fette ( Palm, Palmkern), Rohrzucker, Milchzucker, Weizenkleie, VOLLMILCHPULVER, GERSTENMALZ-und Maisextraktpulver, Honig. Backtriebmittel: Dinatriumdiphosphat, Natriumhydrogencarbonat, Ammoniumhydrogencarbonat; fettarmer Kakao, Salz, Weizenstärke, Gerstenmalzmehl, Emulgator Lecithine (Soja), Vanillin
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

allergens_imported

categorical feature long_tail null_rate

Categorical column listing imported allergen declarations, with 90% nulls leaving only 5 populated rows across 4 distinct values. Entries are comma-separated multi-allergen strings in French (e.g., 'Gluten, Graines de sésame', 'Œufs, Gluten'), and one value embeds what looks like a GS1 code ('Gs1:T4078:ML'), suggesting inconsistent encoding. 'Gluten' is the only repeated value (2 of 5), and entropy_ratio of 0.96 reflects the near-uniform spread across the tiny populated subset. Treatment: Split on commas into a multi-label allergen set and impute or flag the 90% missing before use. medium · anthropic:claude-opus-4-7

n: 50
nulls: 45 (90.0%)
unique: 4
top_value: Gluten
top_rate: 0.4
cardinality: 4
entropy: 1.922
entropy_ratio: 0.961

ingredients_text_de_ocr_1648990410

categorical free_text long_tail null_rate imbalance

This appears to be an OCR-extracted German ingredients text field (timestamped 1648990410), likely from a food product database such as Open Food Facts. Out of 50 rows, 98% are null and only a single non-null value exists — a long ingredient string for a hazelnut-nougat cookie product. With cardinality 1 and entropy 0, the column carries no discriminative signal in this sample. Treatment: Drop; 98% null and only one unique OCR value provides no modelling signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Kekse mit Nuss - Nugat- Creme - Füllung: Nuss-Nugat-Creme 40% (Zucker, Palmöl, HASELNÜSSE Magermilchpulver, fettarmer Kakao, Emulgator Lecithine (S0JA), Vanillin, Weizenmehl, pflanzliche Fette ( Palm, Palmkern), Rohrzucker, Milchzucker, Weizenkleie, VOLLMILCHPULVER, GERSTENMALZ-und Maisextraktpulver, Honig. Backtriebmittel: Dinatriumdiphosphat, Natriumhydrogencarbonat, Ammoniumhydrogencarbonat; fettarmer Kakao, Salz, Weizenstärke, Gerstenmalzmehl, Emulgator Lecithine (Soja), Vanillin
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_de_ocr_1648897071

categorical free_text long_tail null_rate imbalance

This appears to be a German-language OCR extraction of an ingredients list (timestamped 1648897071), likely from Open Food Facts product packaging. The column is essentially empty: 98% null across 50 rows, with only 1 non-null value present, so cardinality is 1 and entropy is 0. The single observed entry is a long free-text ingredients string for a hazelnut-nougat-cream product, with allergens marked by underscores. Treatment: Drop; the column is 98% null with only one unique value and carries no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Nuss-Nougat-Creme 40% (Zucker, Palmöl, _Haselnüsse_ 13%, _Magermilchpulver_ 8,7%, fettarmer Kakao 7,4%, Emulgator Lecithine (_Soja_), Vanillin), _Weizenmehl_ 32,5%, pflanzliche Fette (Palm, Palmkern), Rohrzucker 8,5% (enthält _Weizen_), _Milchzucker_, _Weizenkleie_, _Vollmilchpulver_, _Gerstenmalz_- und Maisextraktpulver, Honig, Backtriebmittel: Dinatriumdiphosphat, Natriumhydrogencarbonat, Ammoniumhydrogencarbonat; fettarmer Kakao, Salz, _Weizenstärke_, _Gerstenmalzmehl_, Emulgator Lecithine (_Soja_), Vanillin
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_ro

categorical metadata null_rate imbalance

This appears to be a Romanian-language generic drug name field, but it is effectively empty: 96% of the 50 rows are null, and the only 2 non-null entries are blank strings, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; the column carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_ro

categorical other null_rate imbalance

Column 'origin_ro' is effectively empty: 96% of the 50 rows are null, and the only 2 non-null values are blank strings, giving a single observed category with zero entropy. There is no usable signal here. Treatment: Drop; column carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

abbreviated_product_name_imported

categorical metadata long_tail null_rate

This appears to be a shortened/imported product name field, but it's almost entirely empty: 94% null across 50 rows, leaving only 3 non-null values, each unique (e.g., 'Authentique 275g, fr', 'Fibres 230g, fr', 'DESSERT Noir 205g'). With cardinality equal to the populated count and maximal entropy_ratio of 1.0, there is no repetition to learn from. The mixed formatting and language hints (French abbreviations, weight suffixes) further suggest inconsistent upstream import. Treatment: Drop or defer — too sparse and near-unique to be useful as a feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: Authentique 275g, fr
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

traces_imported

categorical free_text long_tail null_rate

This column appears to record allergen trace declarations (in French) on food products, listing items like Lupin, Lait, Moutarde, Soja as comma-separated lists. It is almost entirely empty with a 92% null rate, leaving only 4 non-null values across 50 rows, each unique. With cardinality equal to the populated count, every observed value is its own category, making aggregation unreliable. Treatment: Split on commas into a multi-label allergen indicator set, but expect sparse signal given the 92% null rate. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 4
top_value: Lupin, Lait, Moutarde, Graines de sésame, Soja
top_rate: 0.25
cardinality: 4
entropy: 2
entropy_ratio: 1

specific_ingredients

unknown free_text skipped

The column 'specific_ingredients' was skipped by the profiler, so no type, uniqueness, or distribution stats are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds ingredient lists, likely free text or arrays, which is consistent with the profiler declining to summarise it. Without sample values or cardinality we cannot confirm structure or detect duplicates, language mix, or skew. Treatment: Inspect raw values and parse or tokenize before any modelling. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

product_name_ru

categorical metadata null_rate

Russian-language product name field, almost entirely absent: 94% of the 50 rows are null and only 2 distinct non-null values appear, one of which is an empty string. The single real value observed is 'Эксeленс 99% какао', suggesting this column is a sparsely populated localization of a product name. Treatment: Drop or defer; too sparse (94% null) to use as a feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 2
top_value
top_rate: 0.6667
cardinality: 2
entropy: 0.9183
entropy_ratio: 0.9183

origin_ru

categorical metadata null_rate imbalance

A categorical column flagged as Russian-origin metadata, but it is effectively empty: 94% of the 50 rows are null and the only non-null value observed is the empty string, repeated 3 times. Cardinality is 1 and entropy is 0, so this column carries no information as-is. Treatment: Drop; the column has a single empty value and 94% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_ru

categorical free_text null_rate imbalance

Russian-language ingredient text with allergen markup, almost entirely absent from this sample. 94% of rows are null and the remaining 6% are all empty strings, leaving a single unique value and zero entropy. Treatment: Drop; no usable signal in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_ru

categorical free_text null_rate imbalance

Russian-language packaging text field that is effectively empty: 94% of 50 rows are null and the remaining 3 non-null entries are all the empty string, giving a single observed value and zero entropy. There is no usable signal here. Treatment: Drop; column is 94% null with the only non-null value being an empty string. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_ru

categorical metadata null_rate

Russian-language generic product name field, populated for only 3 of 50 rows (null_rate 0.94). Among the 3 non-null entries, 2 are empty strings and 1 is 'Плитка горького шоколада (99% какао)', so effectively only one real value is present. The column is unusable as a feature at this sample size. Treatment: Drop or hold aside until coverage improves; do not use for modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 2
top_value
top_rate: 0.6667
cardinality: 2
entropy: 0.9183
entropy_ratio: 0.9183

ingredients_text_ru

categorical free_text null_rate imbalance

Russian-language ingredient text from what appears to be a multilingual product catalog. The column is effectively empty: 94% null and the only 3 non-null entries are blank strings, giving cardinality 1 and zero entropy. Treatment: Drop; no usable signal at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_da

categorical free_text long_tail null_rate

Danish-language ingredients text for food products, evidently sourced from Open Food Facts-style multilingual labeling. The column is almost entirely empty (null_rate 0.96), with only 2 non-null values out of 50, one of which is a blank string. The single substantive entry is actually a mixed Swedish/Danish/Norwegian ingredient list with allergen tokens marked by underscores, suggesting the locale tagging is unreliable. Treatment: Drop unless modelling Danish text specifically; coverage is too sparse to be useful. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: _VETEMJÖL_/_HVEDEMEL_, palmolja/-olie, glukossirap, maltextrakt från _KORN_/_BYG_, bakpulver/hævemidler (ammoniumkarbonater, natriumkarbonater), salt, _ÄGG_/_ÆG_/_EGG_, arom, mjölbehandlingsmedel/melbehandlingsmiddel (_NATRIUMDISULFIT_).
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

ingredients_text_with_allergens_da

categorical free_text long_tail null_rate

Danish-language ingredients text with HTML-tagged allergen spans, evidently sourced from a multilingual food product database. The column is 96% null with only 2 non-null values out of 50, one of which is an empty string, leaving effectively a single real entry that mixes Swedish and Danish/Norwegian terms. Treatment: Drop unless Danish-specific allergen extraction is required; coverage is too sparse to model. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: VETEMJÖL/HVEDEMEL, palmolja/-olie, glukossirap, maltextrakt från KORN/BYG, bakpulver/hævemidler (ammoniumkarbonater, natriumkarbonater), salt, ÄGG/ÆG/EGG, arom, mjölbehandlingsmedel/melbehandlingsmiddel (NATRIUMDISULFIT).
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

product_name_da

categorical metadata long_tail null_rate

Danish product name field with only 2 non-null values out of 50 rows (null_rate 0.96), each appearing once. The two observed labels ("Original", "Alpine Milk") look like product variant descriptors rather than full product names. With 96% missingness the column carries almost no signal as-is. Treatment: Drop or defer until backfilled; 96% nulls make it unusable for modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: Original
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

packaging_text_da

categorical free_text null_rate imbalance

Danish-language packaging text field that is effectively empty: 96% of the 50 rows are null, and the only 2 non-null values are empty strings, giving cardinality 1 and zero entropy. There is no usable signal here. Treatment: Drop; column is empty in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_da

categorical metadata long_tail null_rate

Danish-language generic product name field, almost entirely empty: 96% null across 50 rows, leaving only 2 non-null observations. The two surviving values are 'Kiks' and an empty string, so there is essentially no usable signal here. Treatment: Drop; null rate too high to be useful. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: Kiks
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

forest_footprint_data

unknown other skipped

The column 'forest_footprint_data' was skipped by the profiler, so its kind is unknown and no descriptive statistics are available. The only confirmed signals are 50 rows with a 0.0 null rate; uniqueness, type, and value distribution are all missing. Treatment: Re-profile or manually inspect this column before any downstream use. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

origin_da

categorical metadata null_rate imbalance

The column appears to be an origin date or destination identifier ('origin_da') but is effectively empty: 96% of the 50 rows are null and the only observed value is the empty string, which accounts for the remaining 2 entries. With cardinality of 1 and entropy of 0, it carries no information. Treatment: Drop; the column has no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_sr

categorical metadata null_rate imbalance

This appears to be a source/origin categorical field, but it is effectively empty: 96% of the 50 rows are null and the only 2 non-null values are blank strings, giving a single observed category with entropy 0. There is no usable signal here. Treatment: Drop; column is 96% null with only blank values. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_nl_ocr_1675675383_result

categorical free_text long_tail null_rate imbalance

This column appears to hold OCR-extracted Dutch ingredient text from product packaging, likely a per-image result field. It is 98% null with only one non-null value present ('Cacaomassa, suiker, cacaoboter, natuurlijk Bourbon vanille - stokje.'), giving cardinality 1 and zero entropy. With effectively no variance or coverage, it carries no analytical signal in this sample. Treatment: Drop; near-empty with a single observed value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Cacaomassa, suiker, cacaoboter, natuurlijk Bourbon vanille - stokje.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_cs

categorical free_text null_rate

Czech-language ingredients text, almost entirely absent: 94% of the 50 rows are null and only 2 distinct non-null values exist, one of which is an empty string appearing twice. The single substantive entry is a Czech ingredients list ("Kakaová hmota, cukr, kakaové máslo, vanilka."), confirming this is a localized free-text field rather than a categorical feature. Treatment: Drop for modelling; retain only if Czech-localized text is specifically needed. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 2
top_value
top_rate: 0.6667
cardinality: 2
entropy: 0.9183
entropy_ratio: 0.9183

product_name_cs

categorical metadata null_rate

Czech-localised product name field (`product_name_cs`) that is almost entirely unpopulated: 94% of the 50 rows are null and only 2 distinct values appear, one of which is an empty string. The single real label observed is an English-language entry ("Excellence 70% Cocoa Intense Dark"), suggesting the Czech translation pipeline has not been applied. Treatment: Drop unless Czech localisation is required; the column is 94% null and effectively empty. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 2
top_value
top_rate: 0.6667
cardinality: 2
entropy: 0.9183
entropy_ratio: 0.9183

origin_hu

categorical metadata null_rate imbalance

A categorical field 'origin_hu' that is 92% null and, among the 4 non-null rows, contains only the empty string. Effective cardinality is 1 with zero entropy, so the column carries no information in this sample. Treatment: Drop; the column is constant and almost entirely null. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_hu

categorical metadata null_rate imbalance

Hungarian packaging text field that is effectively empty: 92% null and the only non-null value across 50 rows is the empty string, occurring 4 times. Cardinality is 1 with entropy 0, so the column carries no information. Treatment: Drop; no signal (single empty value, 92% null). high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_cs

categorical metadata null_rate imbalance

This appears to be an origin call-sign or code field, but it is effectively empty: 96% of the 50 rows are null and the only non-null entries are blank strings. With cardinality of 1 and entropy of 0, the column carries no information. Treatment: Drop; the column is 96% null with a single blank value otherwise. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_hu

categorical free_text long_tail null_rate

Hungarian-language ingredient lists with inline HTML markup highlighting allergens, drawn from what looks like a food-product catalogue (Open Food Facts style). The column is almost entirely empty (null_rate 0.94) — only 3 of 50 rows are populated, and each of those values is unique. Notable surprise: at least one entry is multilingual, bundling Hungarian, Romanian and Bulgarian label text into a single cell. Treatment: Strip HTML tags and treat as optional free text; too sparse (94% null) to use as a feature without heavy imputation. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: Kakaómassza, cukor, kakaó - vaj, vanília.
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

generic_name_cs

categorical metadata null_rate imbalance

This appears to be a Czech-localized generic drug name field, but it carries virtually no information in this sample. 94% of rows are null, and the only 1 distinct non-null value is itself an empty string (3 occurrences), giving entropy of 0.0 and a top_rate of 1.0. Treatment: Drop; column is effectively empty (94% null and only blank values otherwise). high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_hu

categorical free_text long_tail null_rate

Hungarian-language ingredient declarations for food products, mirroring Open Food Facts' per-language ingredients_text fields. The column is 92% null with only 4 distinct values across 50 rows, and one of those is an empty string while another is a multi-language label dump (HU/RO/BG) rather than pure Hungarian. Top value frequency is just 0.25, so there is no real mode to lean on. Treatment: Treat as sparse free text; drop or language-filter before any NLP, and don't use as a feature without imputing the 92% missing. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 4
top_value: Kakaómassza, cukor, kakaó - vaj, vanília.
top_rate: 0.25
cardinality: 4
entropy: 2
entropy_ratio: 1

ingredients_text_sr

categorical free_text long_tail null_rate

Serbian-language ingredient list field (likely a localized variant of an ingredients_text column). Out of 50 rows, only 2 are populated (null_rate 0.96), and one of those is an empty string, leaving exactly one substantive value. There is essentially no signal here at this sample size. Treatment: Drop unless analysis is restricted to Serbian-localized rows; otherwise too sparse to use. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: Šećer, kakao masa, kakao buter, vanile.
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

packaging_text_sr

categorical free_text null_rate imbalance

This appears to be a Serbian-language packaging text field, but it is effectively empty in the sample: 96% of 50 rows are null and the only non-null entries are blank strings, giving cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop the column; it carries no information in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_nl_ocr_1675675383

categorical free_text long_tail null_rate imbalance

This column appears to be Dutch-language OCR-extracted ingredient text from product packaging, likely a sparsely populated language variant of an ingredients field. Out of 50 rows, 98% are null and the single non-null value is a chocolate ingredients list ('Cacaomassa, suiker, cacaoboter, natuurlijk Bourbon vanille- stokje.'). With cardinality of 1 and entropy of 0, this column carries effectively no signal in this sample. Treatment: Drop; 98% null with a single observed value provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Cacaomassa, suiker, cacaoboter, natuurlijk Bourbon vanille- stokje.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_cs

categorical free_text long_tail null_rate imbalance

This appears to be a Czech-language ingredient list with allergen annotations for food products. With a 98% null rate and only 1 non-null value across 50 rows ('Kakaová hmota, cukr, kakaové máslo, vanilka.'), the column carries virtually no information in this sample. Entropy is 0.0 and cardinality is 1, so it cannot discriminate between records as-is. Treatment: Drop for modelling; revisit only if a larger sample provides meaningful coverage. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Kakaová hmota, cukr, kakaové máslo, vanilka.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_sr

categorical metadata long_tail null_rate

This appears to be a Serbian-language generic product name field (generic_name_sr), likely a localized label in a multilingual product catalog. It is almost entirely empty: 96% null across 50 rows, with only 2 distinct values observed, one of which is itself a blank string. The single non-empty entry is 'Tamna čokolada sa 70% kakaa', so this column carries virtually no usable signal in this sample. Treatment: Drop or ignore for modelling; retain only if a Serbian-locale view is required. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: Tamna čokolada sa 70% kakaa
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

packaging_text_cs

categorical free_text null_rate imbalance

Czech-language packaging text field that is effectively empty: 94% null and the only non-null value observed (3 rows) is itself an empty string, giving cardinality 1 and entropy 0. There is no usable signal here in this sample. Treatment: Drop; column is constant-empty with 94% nulls. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_sr

categorical metadata long_tail null_rate

This looks like a Serbian-localized product name field, but it is effectively empty: 96% of the 50 rows are null and only 2 distinct non-null values appear, one in English ('Excellence 70% Cocoa Intense Dark') and one in Cyrillic ('Течен Шоколад Нутела'). The language mix between Latin English and Cyrillic Serbian is notable for a column nominally tagged 'sr'. Treatment: Drop or defer until coverage improves; with 96% nulls and 2 unique values it carries no modelling signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: Excellence 70% Cocoa Intense Dark
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

ingredients_text_hu_ocr_1571428260_result

categorical free_text long_tail null_rate imbalance

This appears to be the OCR-extracted Hungarian ingredients text for a single product (likely chocolate: cocoa mass, sugar, cocoa butter, vanilla), captured from one timestamped scan. With null_rate 0.98 and only 1 unique non-null value across n=50, this column is effectively empty and carries no comparative signal. The single populated row repeats verbatim, so cardinality and entropy are both at floor. Treatment: Drop; 98% null with a single OCR string offers no modelling value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: kakaómassza, cukor, kakaó - vaj, természetes bourbon vanília. Nyomokban egyéb dióféléket, tejet, szóját, szezámmagot es búzát tartalmazhat.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_hu_ocr_1571428260

categorical free_text long_tail null_rate imbalance

This appears to be a Hungarian-language OCR extraction of an ingredients list (likely from a chocolate product, mentioning cocoa mass, sugar, cocoa butter, and bourbon vanilla). Of 50 rows, 98% are null and only 1 non-null value exists, so the column is effectively empty. The single populated entry is free-text in Hungarian, not a category, despite being typed as categorical. Treatment: Drop; 98% null with a single Hungarian free-text value carries no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: kakaómassza, cukor, kakaó- vaj, természetes bourbon vanília. Nyomokban egyéb dióféléket, tejet, szóját, szezámmagot es búzát tartalmazhat.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_hu

categorical metadata null_rate

Hungarian generic-name field that is almost entirely missing: 92% of the 50 rows are null, and of the 4 non-null entries, 3 are empty strings and only 1 carries a value ("Finom"). With cardinality of 2 and a top_rate of 0.75 on the empty string, this column carries virtually no usable signal in the sample. Treatment: Drop unless a fuller source can be joined in; current null rate makes it unusable. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 2
top_value
top_rate: 0.75
cardinality: 2
entropy: 0.8113
entropy_ratio: 0.8113

product_name_hu

categorical metadata long_tail null_rate

Hungarian-language product name field that is effectively empty: 92% of the 50 rows are null, and the only 4 non-null entries collapse to 3 distinct strings (one of which is the empty string, appearing twice). The two actual values present ("Excellence 70% Cocoa Intense Dark", "Dark Chocolate 70% Cacao") are in English, not Hungarian, suggesting the localisation pipeline never populated this column. Treatment: Drop; null_rate 0.92 and no genuine Hungarian content make it unusable. high · anthropic:claude-opus-4-7

n: 50
nulls: 46 (92.0%)
unique: 3
top_value
top_rate: 0.5
cardinality: 3
entropy: 1.5
entropy_ratio: 0.9464

ingredients_text_with_allergens_sr

categorical free_text long_tail null_rate

Serbian-language ingredients list with allergen annotations, populated for only 2 of 50 rows (null_rate 0.96). Of the two non-null entries, one is an empty string and the other is a chocolate ingredient list, leaving effectively a single usable value. Coverage is too sparse to support any aggregate analysis. Treatment: Drop or defer; coverage is 4% and insufficient for modelling. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: Šećer, kakao masa, kakao buter, vanile.
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

ingredients_text_es_ocr_1548767061_result

categorical free_text long_tail null_rate imbalance

This appears to be the result of an OCR pass extracting Spanish ingredient text, with the timestamped name suggesting it's one run among many. Of 50 rows, 98% are null and the single non-null value is a Spanish chocolate ingredients list (cocoa paste, sugar, cocoa butter, sunflower lecithin E-322, vanilla extract, 70% cocoa minimum). With cardinality of 1 and entropy 0, this column carries essentially no information across the dataset. Treatment: drop; 98% null with only one distinct OCR result provides no modelling signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Pasta de cacao, azúcar, manteca de cacao, emulgente: lecitina de girasol (E-322), extracto de vainilla. Cacao: 70% mínimo.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_xx

categorical metadata null_rate imbalance

A categorical field, likely a localized product name (suffix _xx suggests a translation/locale variant). It is effectively empty: 96% null and the only 2 non-null rows contain the empty string, giving cardinality 1 and zero entropy. Treatment: Drop; the column carries no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_xx

categorical metadata null_rate imbalance

This appears to be a localized generic name field (suffix _xx suggests a translation/locale variant), but it is effectively empty: 96% of the 50 rows are null, and the only 2 non-null values are blank strings. Cardinality is 1 with zero entropy, so the column carries no information. Treatment: Drop; null_rate 0.96 and single empty value provide no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_es_ocr_1548767061

categorical free_text long_tail null_rate imbalance

This appears to be a Spanish-language OCR capture of an ingredients list (likely from a chocolate product label, given 'Pasta de cacao' and '70% mínimo'). It is effectively empty: 98% null across 50 rows, with the single non-null value being one verbatim ingredients string. There is no analytical signal here — entropy is 0 and cardinality is 1. Treatment: Drop; 98% null with only one observed value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Pasta de cacao, azúcar, manteca de cacao, emulgente: lecitina de girasol (E-322), extracto de vainilla. Cacao: 70% mínimo.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_xx

categorical free_text null_rate imbalance

This appears to be a localized ingredients text field (suffix _xx suggests a placeholder or unknown locale variant). It is effectively empty: 96% of the 50 rows are null, and the only 2 non-null entries are both empty strings, giving cardinality 1 and entropy 0. Treatment: Drop; the column carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_xx

categorical other long_tail null_rate imbalance

The column 'origin_xx' is effectively empty: 98% of its 50 rows are null, and the single non-null value is itself an empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop the column; it carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_xx

categorical metadata long_tail null_rate imbalance

This appears to be a localized packaging text field (xx language suffix) that is essentially empty in this sample. 98% of rows are null and the single non-null value is itself an empty string, leaving zero effective cardinality and zero entropy. Treatment: Drop; the column carries no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_ur

categorical free_text long_tail null_rate imbalance

Likely an Urdu-language ingredient text field (ingredients_text_ur), almost entirely absent from this sample. 98% of rows are null, and the single non-null value is an empty string, leaving zero usable content. Cardinality is 1 with entropy 0.0, so the column carries no information here. Treatment: Drop from modelling; retain only if a fuller multilingual extract becomes available. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_ur

categorical metadata long_tail null_rate imbalance

This appears to be an Urdu-language product name field that is essentially empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality collapses to 1 and entropy is 0, so the column carries no information as captured. Treatment: Drop; column is effectively empty with zero entropy. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_he

categorical metadata long_tail null_rate imbalance

Column 'origin_he' is effectively empty: 98% of the 50 rows are null and the only observed value is the empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; column carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_he

categorical metadata long_tail null_rate

Hebrew product name field that is almost entirely empty — 96% null across 50 rows, leaving only 2 non-null values, both unique (נוטלה and תפוציפס שמנת בצל). With just two observations the entropy ratio of 1.0 is meaningless, and the column cannot support any analysis in its current state. Treatment: Drop or defer until backfilled; 96% null makes it unusable downstream. high · anthropic:claude-opus-4-7

n: 50
nulls: 48 (96.0%)
unique: 2
top_value: נוטלה
top_rate: 0.5
cardinality: 2
entropy: 1
entropy_ratio: 1

origin_ur

categorical metadata long_tail null_rate imbalance

The column 'origin_ur' appears to be a near-empty metadata or URL-like field, with 98% nulls across 50 rows and only 1 non-null value, which is itself an empty string. Cardinality is 1 and entropy is 0, so the column carries no information as captured. The truncated name ('origin_ur', likely 'origin_url') and total absence of real values suggest a broken or unused field. Treatment: Drop; column has 98% nulls and a single empty-string value, providing no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_ur

categorical metadata long_tail null_rate imbalance

This appears to be an Urdu-language generic name field, likely a localized translation of a drug or product's generic name. It is effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string, giving cardinality 1 and entropy 0. There is no usable signal here. Treatment: Drop the column; it is 98% null with the only present value being blank. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_he

categorical free_text long_tail null_rate imbalance

Hebrew packaging text field that is essentially empty: 98% of the 50 rows are null and the single non-null observation is itself an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here for any downstream task. Treatment: Drop the column; it carries no information. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_he

categorical free_text long_tail null_rate imbalance

Hebrew-language ingredient text field, almost entirely absent from this sample. 98% of the 50 rows are null, and the single non-null value is an empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; no usable signal at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_ur

categorical free_text long_tail null_rate imbalance

Likely an Urdu-language packaging text field, but it carries essentially no information here: 98% of rows are null and the only non-null value observed is an empty string. Cardinality is 1 with entropy 0.0, so the column is constant across the populated rows. Treatment: Drop; effectively empty with 98% nulls and a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_he

categorical metadata long_tail null_rate imbalance

Hebrew generic drug name field, but it is effectively empty: 98% null across 50 rows, leaving a single non-null value (a Hebrew product description) that occupies 100% of the observed entries. Cardinality is 1 and entropy is 0, so the column carries no discriminating signal in this sample. Treatment: Drop from modelling; revisit only if a fuller extract populates the field. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: ממרח אגוזי לוז עם קקאו
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_he

categorical free_text long_tail null_rate imbalance

Hebrew-localized ingredients-with-allergens text, almost entirely absent: 98% null across 50 rows and the single non-null value is an empty string. Cardinality is 1 with zero entropy, so this column carries no information in this sample. Treatment: Drop; no usable signal in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

nutriscore_grade_producer

categorical feature long_tail null_rate

Producer-supplied Nutri-Score letter grade (a–e scale), captured here as three distinct values: 'c', 'e', and 'b', each appearing once. The column is essentially empty, with a 94% null rate leaving only 3 of 50 rows populated, so the apparent uniform entropy (1.58) is an artefact of tiny sample size. No 'a' or 'd' grades observed in the evidence. Treatment: Drop or defer; too sparse (94% null) to use until producer coverage improves. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: c
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

nutriscore_grade_producer_imported

categorical feature long_tail null_rate

This appears to be a Nutri-Score grade (a-e scale) imported from producer data, stored as a categorical letter grade. The column is almost entirely empty with a 94% null rate, leaving only 3 non-null values across 3 distinct grades (c, e, b) — too sparse to draw any distributional conclusions. Treatment: Drop or treat as missing-indicator only; too sparse (94% null) to use as a feature. high · anthropic:claude-opus-4-7

n: 50
nulls: 47 (94.0%)
unique: 3
top_value: c
top_rate: 0.3333
cardinality: 3
entropy: 1.585
entropy_ratio: 1

packaging_text_el

categorical free_text long_tail null_rate imbalance

This appears to be Greek-language packaging text, but it's effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, meaning there is no usable signal here whatsoever. Treatment: Drop the column; it has no observed content. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_el

categorical free_text long_tail null_rate imbalance

Greek-language ingredients-with-allergens text field that is effectively empty in this sample: 98% null and the only non-null value observed is itself an empty string. With cardinality of 1 and entropy 0, this column carries no signal here. Treatment: Drop; no usable content in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_el

categorical free_text long_tail null_rate imbalance

This is the Greek-language ingredients text field, presumably from a multilingual food product dataset. It is effectively empty: 98% null across 50 rows, and the single non-null value is itself an empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; no usable signal at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_el

categorical metadata long_tail null_rate imbalance

Greek-language generic name field that is effectively empty: 98% nulls and the only observed non-null value is itself an empty string, giving a single unique entry across 50 rows. Entropy is 0.0 and top_rate is 1.0, so the column carries no information in this sample. Treatment: Drop; the column has no usable signal here. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_el

categorical other long_tail null_rate imbalance

The column 'origin_el' is nearly entirely empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here whatsoever. Treatment: Drop the column; it is effectively all null with no variance. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_el

categorical metadata long_tail null_rate imbalance

This appears to be a Greek-language product name field that is effectively empty in this sample. 98% of rows are null and the single non-null value is itself an empty string, giving cardinality 1 and zero entropy. There is no usable signal here at n=50. Treatment: Drop from modelling; revisit only if a larger sample shows actual Greek text. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_th

categorical metadata long_tail null_rate imbalance

Thai-language generic drug name field that is effectively empty in this sample: 98% of 50 rows are null and the only non-null value observed is the empty string, giving a single distinct value and zero entropy. Treatment: Drop from modelling; the column carries no information in this slice. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_de_ocr_1559410715_result

categorical free_text long_tail null_rate imbalance

This column appears to hold the OCR result of a German ingredients text extraction (likely from a product label image), with a timestamp embedded in the column name. Of 50 rows, 98% are null and only one non-null value exists — a single German cocoa-product ingredient list mentioning possible traces of nuts, milk, and soy. With cardinality 1 and entropy 0, the column carries effectively no signal at this sample size. Treatment: Drop; 98% null and only one distinct OCR string provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Kakaomasse, fettarmes Kakaopulver, Kakaobutter, Rohrzucker. Kann Schalenfrüchte, Milch und Soja enthalten.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_th

categorical free_text long_tail null_rate imbalance

This appears to be a Thai-localized ingredients-with-allergens text field, but it is effectively empty: 98% of 50 rows are null and the single populated row contains an English-language ingredients string for a 99% cocoa product. The column carries zero entropy (entropy_ratio 0.0) and only one distinct value, so it provides no analytical signal. The language mismatch (English content in a _th column) is also notable. Treatment: Drop; 98% null and a single non-Thai value make it unusable. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Cocoa solids 99%, Cocoa paste, fat-reduced cocoa, cocoa butter, demerara sugar. May contain nuts, milk and soya.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_th

categorical free_text long_tail null_rate imbalance

This appears to be a Thai-language packaging text field, but it is effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no information. Treatment: Drop; no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_th

categorical metadata long_tail null_rate imbalance

Thai-language product name field, almost entirely empty: 98% null across 50 rows with only a single non-null value (one Lindt dark chocolate entry). Cardinality is 1 and entropy is 0, so this column carries no discriminating signal in this sample. Treatment: Drop from modelling; revisit only if a fuller Thai-localised dump becomes available. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: ลินด์ เอ็กเซอร์แลนซ์ ดาร์ก 99% โกโก้ ดาร์ก แอปโซลูท ช็อกโกแลต
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_de_ocr_1548767354_result

categorical free_text long_tail null_rate imbalance

This column appears to hold the OCR result of a German ingredients label (one specific dark chocolate product) tied to a timestamped run (1548767354). Of 50 rows, 98% are null and the single non-null value occupies the entire observed cardinality of 1, giving zero entropy. There is effectively no variation to learn from here. Treatment: Drop; 98% null and only one distinct OCR string. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Extra feine dunkle Schokolade. Schokolade enthält: Kakao: mind. 99%. Zutaten: Kakaomasse, fettarmes Kakaopulver, Kakaobutter, Rohrzucker. Kann Schalenfrüchte, Milch und Soja enthalten.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_th

categorical free_text long_tail null_rate imbalance

This is a Thai-language ingredients text field (ingredients_text_th), but 98% of the 50 rows are null and the single non-null entry is actually English text describing cocoa-based product ingredients. With cardinality of 1 and entropy of 0, the column carries no usable signal and the one populated value appears to be in the wrong language for the field. Treatment: Drop; 98% null and the lone value is mislocalized. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Cocoa solids 99%, Cocoa paste, fat-reduced cocoa, cocoa butter, demerara sugar. May contain nuts, milk and soya.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_th

categorical other long_tail null_rate imbalance

Column 'origin_th' is effectively empty: 98% of the 50 rows are null and the only observed non-null value is itself an empty string, giving a cardinality of 1 and entropy of 0. There is no signal here to model or join on. Treatment: Drop; the column is 98% null with a single empty-string value otherwise. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_de_ocr_1548767354

categorical free_text long_tail null_rate imbalance

This appears to be German-language OCR text of product ingredient lists, likely from a food database (the sole observed value describes dark chocolate ingredients). The column is almost entirely empty with a 0.98 null rate, and only one non-null record exists across 50 rows, yielding cardinality 1 and entropy 0. With a single observation there is no usable variation for analysis. Treatment: Drop; 98% null with only one observed value provides no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Extra feine dunkle Schokolade. Schokolade enthält: Kakao: mind. 99%. Zutaten: Kakaomasse, fettarmes Kakaopulver, Kakaobutter, Rohrzucker. Kann Schalenfrüchte, Milch und Soja enthalten.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_de_ocr_1559410715

categorical free_text long_tail null_rate imbalance

This appears to be a German-language OCR extraction of an ingredients list (likely from a chocolate product packaging), captured as free text. It is almost entirely empty with a 0.98 null rate, and the single non-null row contains one verbose ingredient declaration, giving cardinality 1 and entropy 0.0. With only one observed value out of 50 rows, this column carries no usable signal as-is. Treatment: Drop; effectively empty (98% null, single distinct OCR string). high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Extra feine dunkle Schokolade. Schokolade enthält: Kakao: mind. 99%. Zutaten: Kakaomasse, fettarmes Kakaopulver, Kakaobutter, Rohrzucker. Kann Schalenfrüchte, Milch und Soja enthalten.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_it_ocr_1559410715

categorical free_text long_tail null_rate imbalance

This appears to be an Italian-language OCR extraction of an ingredients list, likely from a food product label (the lone value describes 99% dark chocolate). The column is effectively empty: 98% null across 50 rows, with only a single non-null value, giving cardinality 1 and entropy 0. There is no variation to learn from here. Treatment: Drop; 98% null with a single observed value carries no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Cioccolato amaro extra. Cacao: 99% minimo. Ingredienti: pasta di cacao, cacao magro, burro di cacao, zucchero grezzo di canna. Può contenere frutta a guscio, latte e soia.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_it_ocr_1559410715_result

categorical free_text long_tail null_rate imbalance

This column appears to hold OCR-extracted Italian ingredient text (timestamp-suffixed name suggests a single OCR pass result). It is effectively empty: 98% null across n=50, with only one non-null value — a chocolate ingredient list — giving cardinality 1 and zero entropy. Treatment: Drop; a single non-null OCR string carries no modelling signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: pasta di cacao, cacao magro, burro di cacao, zucchero grezzo di canna. Può contenere frutta a guscio, latte e soia.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_fr_imported

categorical free_text long_tail null_rate imbalance

Likely a French-language packaging description imported from an external source (e.g., Open Food Facts), describing recycling instructions for components. The column is 98% null and only a single non-null value appears across 50 rows, so it carries no analytical signal in this sample. Treatment: Drop; near-entirely null with no variance. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: 1 FEUILLE PAPIER À RECYCLER, 1 FEUILLE METAL À RECYCLER.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

preparation_fr_imported

categorical metadata long_tail null_rate imbalance

This appears to be a French-language preparation/readiness label imported from an external source, indicating how a product is prepared. It is effectively unusable here: 98% of the 50 rows are null, and the single non-null value is "Produit prêt à consommer", giving cardinality 1 and entropy 0. Treatment: Drop; near-entirely null with a single constant value carries no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Produit prêt à consommer
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

preparation

categorical metadata long_tail null_rate imbalance

A categorical preparation field, likely indicating how a food product should be prepared before consumption. It is effectively empty: 98% of the 50 rows are null, leaving only a single observed value ("Produit prêt à consommer") with cardinality 1 and zero entropy. With no variation among the populated rows, the column carries no discriminative signal. Treatment: Drop; 98% null and only one observed value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Produit prêt à consommer
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

preparation_fr

categorical metadata long_tail null_rate imbalance

This appears to be a French-language preparation instruction field, likely metadata describing how a product is consumed. It is essentially empty: 98% of the 50 rows are null, and the single non-null value is "Produit prêt à consommer", giving cardinality 1 and entropy 0. There is no variation to learn from in this sample. Treatment: Drop; 98% null with only one observed value provides no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Produit prêt à consommer
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_lc

categorical free_text long_tail null_rate imbalance

This appears to be a lowercased ingredients text field, likely derived from a product or recipe source. It is effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string, giving cardinality 1 and entropy 0. There is no usable signal here. Treatment: Drop; the column is 98% null and the only observed value is empty. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_lc

categorical feature long_tail null_rate imbalance

A lowercased product name field that is effectively empty: 98% of the 50 rows are null and the single non-null value is also an empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; column is 98% null with the only observed value being an empty string. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_lc

categorical free_text long_tail null_rate imbalance

Almost certainly a normalised lowercase ingredients-with-allergens text field, but in this sample it carries no signal: 98% of rows are null and the single non-null value is an empty string. With cardinality 1 and entropy 0, there is nothing to learn from this column as-is. Treatment: Drop from the working set unless a larger sample shows real text content. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_lc

categorical feature long_tail null_rate imbalance

This appears to be a lowercased generic drug name field, but it is effectively empty in this sample: 98% null and the only non-null value among 50 rows is an empty string. With cardinality of 1 and entropy of 0, the column carries no information here. Treatment: Drop; no signal at this null rate and cardinality. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_xx_debug_tags

unknown metadata skipped

This column is flagged as unknown kind with all profiling skipped, so saturn produced no statistics beyond a 50-row count and zero nulls. The name suggests it holds debug tags attached to multilingual (xx) ingredient text, likely a list or structured field the profiler could not parse. Without unique counts, value samples, or type information there is nothing further to characterise. Treatment: Inspect raw values manually; drop unless debug tags are needed downstream. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

product_name_xx_debug_tags

unknown metadata skipped

This column was skipped by the profiler, so its kind is unknown and no descriptive statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds debug tags attached to localized product names (the 'xx' locale and 'debug_tags' suffix), which is typically engineering scaffolding rather than analytical signal. Without unique counts or value samples, nothing can be said about cardinality or content. Treatment: Drop unless a downstream debugging workflow specifically needs it. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

generic_name_xx_debug_tags

unknown metadata skipped

This column is flagged as kind 'unknown' and was skipped by the profiler, so no statistics were computed beyond a row count of 50 and a null rate of 0.0. The name suggests it holds debug tags associated with a generic name field, likely diagnostic metadata rather than analytical signal. Without unique counts or value samples, nothing further can be inferred. Treatment: Drop unless debug tags are explicitly needed for tracing. low · anthropic:claude-opus-4-7

n: 50
nulls: 0 (0.0%)
unique: —

ingredients_text_fr_ocr_1561814324

categorical free_text long_tail null_rate imbalance

This appears to be a French-language OCR capture of an ingredients list, timestamped in the column name (1561814324). With null_rate of 0.98, only a single non-null row exists out of 50, and that lone value is a full ingredients paragraph—cardinality is 1 and entropy is 0. There is effectively no signal here for analysis. Treatment: Drop; 98% null with a single observed value provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: 25 % cerneaux de noix, 25 % amandes décortiquées 25 % raisins secs sultanines (raisins secs,huile de tournesol. antioxydant: anhydride lfureux), 15% canneberges, 9,8% sucre, huile de tournesol. Traces éventuelles d'autres fruits à coque et d'arachides. Conditionné sous atmosphère protectrice.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1561814324_result

categorical free_text long_tail null_rate imbalance

This appears to be the OCR result of a French ingredients label, captured at a specific timestamp (1561814324) suggesting it's one of many time-stamped OCR snapshot columns. Of 50 rows, 98% are null and only a single non-null value exists — a verbose French ingredients string for a nut-and-raisin mix. With cardinality 1 and entropy 0, the column carries essentially no analytical signal in this sample. Treatment: Drop; 98% null and only one distinct OCR string provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: 25 % cerneaux de noix, 25 % amandes décortiquées 25 % raisins secs sultanines (raisins secs,huile de tournesol. antioxydant: anhydride lfureux), 15% canneberges, 9,8% sucre, huile de tournesol. Traces éventuelles d'autres fruits à coque et d'arachides.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1624039072_result

categorical free_text long_tail null_rate imbalance

This column appears to hold OCR results of French ingredient lists, likely from a timestamped extraction run (1624039072). It is effectively empty: 98% null with only 1 non-null value out of 50, that single entry being a cocoa/soy lecithin/vanilla ingredient string. With cardinality 1 and entropy 0, the column carries no usable signal in this sample. Treatment: Drop; 98% null with a single observed value provides no modelling signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Cacao, émulsifiant (lécithine de _soja_), vanille.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1624039072

categorical free_text long_tail null_rate imbalance

This appears to be French OCR-extracted ingredient text from product packaging, with the timestamp suffix suggesting a dated extraction run. Out of 50 rows, 98% are null and only a single non-null value exists ('ingrédients : cacao, émulsifiant (lécithine de _soja_), vanille.'), giving cardinality 1 and zero entropy. The column is effectively empty and carries no discriminative signal. Treatment: Drop; 98% null with a single observed value provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: ingrédients : cacao, émulsifiant (lécithine de _soja_), vanille.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573108346

categorical free_text long_tail null_rate imbalance

This appears to be a French-language OCR-extracted ingredients list (likely from a food product label, given mentions of flour, sugar, butter, eggs, and emulsifiers). Out of 50 rows, 98% are null and only a single non-null value exists, giving cardinality 1 and entropy 0. The column is effectively empty and carries no discriminative signal in this sample. Treatment: Drop; 98% null with a single observed value provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1566920858_result

categorical free_text long_tail null_rate imbalance

This appears to be the OCR-extracted French ingredients text from a single product image (timestamped 1566920858), holding raw label transcriptions. The column is effectively empty: 98% null across n=50, with only one non-null value — a single French ingredient list for a butter/egg pastry product. Cardinality is 1 and entropy is 0, so it carries no discriminative signal in this sample. Treatment: Drop; 98% null and only one unique OCR string provides no modelling signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , oeufs entiers frais, crème fraîche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - lactylate de sodium, Esters et mono et diacétyltartriques des mono et diglycérides d'acides gras), protéines de lait, levure désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573107556

categorical free_text long_tail null_rate imbalance

This is a French OCR-extracted ingredients list, almost certainly a timestamped snapshot column from an Open Food Facts-style export. The column is effectively empty: 98% null across 50 rows, with only a single non-null value present, so cardinality is 1 and entropy is 0. The lone observation is a long free-text French ingredients string, not a category, despite being typed as categorical. Treatment: Drop; a single non-null value at 98% null rate carries no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573108346_result

categorical free_text long_tail null_rate imbalance

This appears to be the OCR-extracted French ingredients text from a specific scan run (timestamped 1573108346), holding raw label transcriptions like a bakery product's flour/sugar/butter list. Out of 50 rows, 98% are null and the single populated value is one long French ingredient string, giving cardinality 1 and entropy 0. The column is effectively empty for analytical purposes. Treatment: Drop; 98% null with only one populated OCR string offers no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573107560_result

categorical free_text long_tail null_rate imbalance

This column appears to hold the OCR-extracted French ingredients text from a product image (timestamp 1573107560 in the name suggests a single OCR run). Of 50 rows, 98% are null and only 1 unique value exists — a single French ingredient list for what looks like a butter/egg pastry. With cardinality 1 and entropy 0, this column carries no discriminative signal in this sample. Treatment: Drop or defer — 98% null and a single observed value provide no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573108349_result

categorical free_text long_tail null_rate imbalance

This column appears to hold the OCR-extracted French ingredients text for a product, tied to a specific OCR run timestamp (1573108349). It is essentially empty: 98% null across 50 rows, with only a single non-null value — a long French ingredients string for what looks like a butter/egg pastry. With cardinality 1 and entropy 0, it carries no discriminative signal in this sample. Treatment: Drop; 98% null and only one distinct OCR string provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573108360

categorical free_text long_tail null_rate imbalance

This appears to be a French-language OCR extraction of a product ingredients list, timestamped 1573108360 in the column name. Out of 50 rows, 98% are null and only a single non-null value exists (a single bakery product's ingredient declaration), giving cardinality 1 and entropy 0. The column is effectively empty and carries no discriminative signal. Treatment: Drop; 98% null with a single OCR string offers no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573109955_result

categorical free_text long_tail null_rate imbalance

This column appears to hold the OCR result of a French ingredients list (timestamped 1573109955), capturing the parsed text from a product label. Of 50 rows, 98% are null and only a single non-null value exists — a long French ingredients string for a butter/egg pastry product. With cardinality 1 and entropy 0, it carries essentially no information at this sample size. Treatment: Drop; 98% null with a single OCR string offers no modelling signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573108349

categorical free_text long_tail null_rate imbalance

This column appears to be a French OCR-extracted ingredients list (likely from food packaging), based on the column name and the single observed value containing French ingredient text like 'Farine de blé, sucre, beurre frais'. It is almost entirely empty: 98% null across n=50, with only one non-null record and cardinality of 1, giving zero entropy. With a single observation it carries no analytical signal in this sample. Treatment: Drop; 98% null and only one unique value provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573109955

categorical free_text long_tail null_rate imbalance

This appears to be an OCR-extracted French ingredients list (timestamped 1573109955), likely from a product packaging scan. The column is almost entirely empty: 98% null across 50 rows, with only a single non-null value present, giving cardinality 1 and entropy 0. That lone value is a long, noisy free-text string typical of raw OCR output rather than a clean categorical label. Treatment: Drop; effectively empty with only one OCR string and no analytical signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573107556_result

categorical free_text long_tail null_rate imbalance

This appears to be the OCR result of a French ingredients label (timestamped 1573107556), capturing extracted text from a product image. Of 50 rows, 98% are null and only a single non-null value exists — one French ingredient list for a butter/egg/flour pastry. With cardinality 1 and entropy 0, the column carries essentially no analytical signal. Treatment: Drop; 98% null with only one observed OCR string. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573108360_result

categorical free_text long_tail null_rate imbalance

This appears to be the OCR-extracted French ingredients text from a timestamped scan run (1573108360), holding raw label transcriptions. With 98% nulls and only 1 non-null value across 50 rows, it is effectively empty — the single populated entry is a long French ingredient list for a butter/egg pastry product. Cardinality of 1 and entropy of 0 mean it carries no discriminative signal here. Treatment: Drop from modelling; if needed, merge with sibling OCR columns into a single ingredients_text field before NLP. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1573107560

categorical free_text long_tail null_rate imbalance

This appears to be an OCR-extracted French ingredients list (timestamped 1573107560), capturing the raw text from a product label. Out of 50 rows, 98% are null and only a single non-null value exists, an entry beginning 'Farine de blé, sucre, beurre frais 9,5%...'. With cardinality 1 and entropy 0, the column carries effectively no signal in this sample. Treatment: Drop; 98% null with a single OCR string offers no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1566920858

categorical free_text long_tail null_rate imbalance

This is a French-language OCR-extracted ingredients list, timestamped in the column name (1566920858), almost certainly from an Open Food Facts-style product dump. Out of 50 rows, 98% are null and only a single non-null value exists, a verbose ingredients string for a butter/egg pastry. With cardinality 1 and entropy 0, the column carries effectively no signal in this sample. Treatment: Drop; 98% null and only one distinct OCR string provides no usable signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Farine de blé, sucre, beurre frais 9,5 % , oeufs entiers frais, crème fraîche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- lactylate de sodium, Esters et mono et diacétyltartriques des mono et diglycérides d'acides gras), protéines de lait, levure désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

generic_name_lt

categorical metadata long_tail null_rate imbalance

This appears to be a Lithuanian-locale generic name field, but it is effectively empty: 98% of the 50 rows are null and the single non-null value is the empty string. Cardinality is 1 and entropy is 0, so the column carries no information. Treatment: Drop; the column is 98% null with a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_ro

categorical free_text long_tail null_rate imbalance

A Romanian-language ingredients-with-allergens text field, almost entirely empty in this sample. 98% of rows are null and the only non-null value observed is an empty string, giving a single unique value across n=50. Treatment: Drop; effectively no signal at this sample size. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

packaging_text_lt

categorical free_text long_tail null_rate imbalance

Lithuanian packaging text field that is effectively empty in this sample: 98% null and the single non-null value is itself an empty string, giving cardinality 1 and zero entropy. There is no usable signal here. Treatment: Drop; no observed content. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_lt

categorical free_text long_tail null_rate imbalance

This appears to be a Lithuanian-language ingredients text field, likely from a multilingual product catalog. It is effectively empty: 98% null across 50 rows, and the single non-null value is itself an empty string, giving cardinality 1 and zero entropy. Treatment: Drop; no usable signal in this sample. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

origin_lt

categorical metadata long_tail null_rate imbalance

The column 'origin_lt' is nearly entirely null, with a null_rate of 0.98 across 50 rows, leaving only a single non-null observation that is itself an empty string. With cardinality of 1, entropy of 0, and top_rate of 1.0, there is no usable signal here. Treatment: Drop; effectively empty with no variance. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

product_name_lt

categorical metadata long_tail null_rate imbalance

This column appears to be a Lithuanian-localized product name field, but it is effectively empty: 98% of the 50 rows are null and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so it carries no information. Treatment: Drop; the column is 98% null with a single empty-string value. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_with_allergens_lt

categorical free_text long_tail null_rate imbalance

This appears to be a Lithuanian-language ingredients text field with allergen annotations, but it is effectively empty in this sample. 98% of rows are null and the only non-null value observed is the empty string, giving zero entropy across n=50. Treatment: Drop from analysis; insufficient non-null content to model. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1713713129

categorical free_text long_tail null_rate imbalance

This appears to be a French-language OCR extract of an ingredients list (chocolate product), captured at a specific timestamp suggested by the column suffix. Out of 50 rows, 98% are null and only 1 non-null value exists, making the column effectively a single-record artifact rather than a usable feature. The lone value is free-form text describing cacao paste, cocoa butter, sugar, milk powder, and allergen traces. Treatment: Drop; 98% null and only one observed value provides no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Ingrédients : Pâte de cacao, cacao en poudre dégraissé, beurre de cacao, sucre, lait en poudre, pâte de amandes et de noisettes, émulsifiants (lécithines (soja, toumesol)) et arôme. Cacao 92% minimum. Peut contenir des traces d'autres fruits à coque.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0

ingredients_text_fr_ocr_1713713129_result

categorical free_text long_tail null_rate imbalance

This appears to be the OCR result of a French ingredients list, captured at a single timestamp (1713713129) and stored as raw text. With null_rate 0.98, only 1 of 50 rows has a value, and that single observation is a chocolate ingredients statement (cocoa paste, cocoa powder, almonds, hazelnuts, soy lecithin). Cardinality is 1 and entropy is 0, so there is no variation to model from this column alone. Treatment: Drop; 98% null and only one distinct OCR string provides no signal. high · anthropic:claude-opus-4-7

n: 50
nulls: 49 (98.0%)
unique: 1
top_value: Pâte de cacao, cacao en poudre dégraissé, beurre de cacao, sucre, lait en poudre, pâte de amandes et de noisettes, émulsifiants (lécithines (soja, toumesol)) et arôme. Cacao 92% minimum. Peut contenir des traces d'autres fruits à coque.
top_rate: 1
cardinality: 1
entropy: 0
entropy_ratio: 0