wild openfoodfacts sample
Reading
This is a 50-row sample from Open Food Facts with 545 columns, dominated by per-language localized fields (product names, generic names, ingredient texts, packaging texts, origin) plus nutrition, scoring, and provenance metadata. The shape is extremely sparse: the vast majority of localized columns have null rates of 0.92–0.98, so most analytical signal lives in a small core of fields. Worth a closer look first: the Nutri-Score and NOVA distributions (the catalog skews heavily to grade 'e' and NOVA group 4), the Eco-Score grade mix, and the food_groups / pnns_groups_2 breakdown showing this sample is concentrated in chocolate and biscuit products. Also note the heavy imbalance in `lang` (70% French) and `countries_lc`, which biases any text or origin analysis. Treat the hundreds of `*_xx` / `ingredients_text_
citing: column_count · row_count · nutriscore_grade · nova_groups · ecoscore_grade · food_groups · pnns_groups_2 · lang · countries_lc · ecoscore_score · nutriscore_score · completeness
Charts the summary said to look at first
Show data table
| value | count | share |
|---|---|---|
| e | 27 | 54.0% |
| d | 9 | 18.0% |
| c | 7 | 14.0% |
| a | 4 | 8.0% |
| b | 2 | 4.0% |
| unknown | 1 | 2.0% |
Show data table
| value | count | share |
|---|---|---|
| 4 | 33 | 66.0% |
| 3 | 14 | 28.0% |
| 1 | 1 | 2.0% |
Show data table
| value | count | share |
|---|---|---|
| e | 12 | 24.0% |
| d | 9 | 18.0% |
| b | 8 | 16.0% |
| c | 8 | 16.0% |
| unknown | 6 | 12.0% |
| a | 3 | 6.0% |
| a-plus | 2 | 4.0% |
| not-applicable | 1 | 2.0% |
| f | 1 | 2.0% |
Show data table
| value | count | share |
|---|---|---|
| Biscuits and cakes | 17 | 34.0% |
| Chocolate products | 16 | 32.0% |
| Appetizers | 4 | 8.0% |
| Pastries | 3 | 6.0% |
| Bread | 2 | 4.0% |
| unknown | 2 | 4.0% |
| Sweets | 2 | 4.0% |
| Dairy desserts | 1 | 2.0% |
| Waters and flavored waters | 1 | 2.0% |
| Cereals | 1 | 2.0% |
| Dried fruits | 1 | 2.0% |
Show data table
| value | count | share |
|---|---|---|
| fr | 35 | 70.0% |
| en | 10 | 20.0% |
| de | 3 | 6.0% |
| bg | 1 | 2.0% |
| ro | 1 | 2.0% |
Schema
545 columns| Alerts | ||||
|---|---|---|---|---|
| update_key | categorical | 0.0% | 9 |
long_tail
|
| categories_old | categorical | 2.0% | 45 |
long_tail
|
| ecoscore_score | numeric | 14.0% | 31 |
|
| environment_impact_level | categorical | 56.0% | 1 |
null_rate
imbalance
|
| ingredients_text_fi | categorical | 90.0% | 4 |
long_tail
null_rate
|
| nutrition_data_prepared | categorical | 4.0% | 1 |
imbalance
|
| packaging_shapes_tags | unknown | 0.0% | — |
skipped
|
| nutrient_levels_tags | unknown | 0.0% | — |
skipped
|
| packagings_materials | unknown | 0.0% | — |
skipped
|
| ingredients_without_ecobalyse_ids | unknown | 0.0% | — |
skipped
|
| generic_name_nl | categorical | 76.0% | 4 |
long_tail
null_rate
|
| misc_tags | unknown | 0.0% | — |
skipped
|
| product_name_sv | categorical | 92.0% | 4 |
long_tail
null_rate
|
| scans_n | numeric | 0.0% | 49 |
high_skew
outliers
|
| schema_version | numeric | 0.0% | 1 |
constant
|
| url | categorical | 0.0% | 50 |
long_tail
|
| vitamins_tags | unknown | 0.0% | — |
skipped
|
| debug_param_sorted_langs | unknown | 0.0% | — |
skipped
|
| packaging | categorical | 12.0% | 41 |
long_tail
|
| grades | unknown | 0.0% | — |
skipped
|
| last_modified_t | numeric | 0.0% | 50 |
outliers
|
| origin_nl | categorical | 76.0% | 1 |
null_rate
imbalance
|
| allergens_lc | categorical | 4.0% | 6 |
|
| states_hierarchy | unknown | 0.0% | — |
skipped
|
| ingredients_text_ja | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| teams_tags | unknown | 0.0% | — |
skipped
|
| traces_from_user | categorical | 0.0% | 35 |
long_tail
|
| origins_tags | unknown | 0.0% | — |
skipped
|
| serving_quantity_unit | categorical | 8.0% | 2 |
imbalance
|
| vitamins_prev_tags | unknown | 0.0% | — |
skipped
|
| ingredients_hierarchy | unknown | 0.0% | — |
skipped
|
| unique_scans_n | numeric | 0.0% | 48 |
high_skew
outliers
|
| labels | categorical | 2.0% | 42 |
long_tail
|
| generic_name_en | categorical | 14.0% | 8 |
long_tail
|
| weighters_tags | unknown | 0.0% | — |
skipped
|
| popularity_tags | unknown | 0.0% | — |
skipped
|
| product_name_fi | categorical | 90.0% | 4 |
long_tail
null_rate
|
| origin_fr | categorical | 8.0% | 7 |
long_tail
|
| generic_name | categorical | 4.0% | 28 |
long_tail
|
| nutriscore_version | categorical | 0.0% | 1 |
imbalance
|
| ingredients_without_ciqual_codes | unknown | 0.0% | — |
skipped
|
| manufacturing_places_tags | unknown | 0.0% | — |
skipped
|
| photographers_tags | unknown | 0.0% | — |
skipped
|
| packaging_text_pl | categorical | 90.0% | 1 |
null_rate
imbalance
|
| informers_tags | unknown | 0.0% | — |
skipped
|
| ingredients_text_en | categorical | 12.0% | 36 |
long_tail
|
| ingredients_text_it | categorical | 68.0% | 12 |
long_tail
null_rate
|
| origin_de | categorical | 60.0% | 1 |
null_rate
imbalance
|
| nova_group | numeric | 4.0% | 3 |
high_skew
|
| packaging_text_fi | categorical | 90.0% | 1 |
null_rate
imbalance
|
| states | categorical | 0.0% | 26 |
long_tail
|
| ingredients_with_unspecified_percent_sum | numeric | 0.0% | 22 |
|
| added_countries_tags | unknown | 0.0% | — |
skipped
|
| id | categorical | 0.0% | 50 |
long_tail
|
| nutrient_levels | unknown | 0.0% | — |
skipped
|
| sortkey | numeric | 12.0% | 44 |
high_skew
outliers
|
| image_small_url | categorical | 0.0% | 50 |
long_tail
|
| packaging_recycling_tags | unknown | 0.0% | — |
skipped
|
| food_groups | categorical | 2.0% | 11 |
|
| nova_groups_markers | unknown | 0.0% | — |
skipped
|
| packaging_text_de | categorical | 60.0% | 2 |
null_rate
|
| categories_lc | categorical | 0.0% | 6 |
|
| checkers | unknown | 0.0% | — |
skipped
|
| packaging_text_es | categorical | 60.0% | 2 |
null_rate
|
| unknown_nutrients_tags | unknown | 0.0% | — |
skipped
|
| editors_tags | unknown | 0.0% | — |
skipped
|
| nutrition_score_warning_fruits_vegetables_nuts_estimate_from_ingredients | numeric | 10.0% | 1 |
constant
|
| labels_lc | categorical | 2.0% | 6 |
|
| nutriscore_data | unknown | 0.0% | — |
skipped
|
| other_nutritional_substances_tags | unknown | 0.0% | — |
skipped
|
| product_name_nb | categorical | 96.0% | 2 |
long_tail
null_rate
|
| nutrition_data_prepared_per | categorical | 0.0% | 1 |
imbalance
|
| product_quantity | categorical | 6.0% | 27 |
long_tail
|
| product_type | categorical | 0.0% | 1 |
imbalance
|
| checkers_tags | unknown | 0.0% | — |
skipped
|
| nucleotides_tags | unknown | 0.0% | — |
skipped
|
| languages_tags | unknown | 0.0% | — |
skipped
|
| traces_lc | categorical | 4.0% | 6 |
|
| categories_hierarchy | unknown | 0.0% | — |
skipped
|
| image_front_small_url | categorical | 0.0% | 50 |
long_tail
|
| entry_dates_tags | unknown | 0.0% | — |
skipped
|
| ecoscore_tags | unknown | 0.0% | — |
skipped
|
| nutrition_score_warning_fruits_vegetables_legumes_estimate_from_ingredients | numeric | 8.0% | 1 |
constant
|
| ingredients_without_ciqual_codes_n | numeric | 0.0% | 15 |
|
| rev | numeric | 0.0% | 46 |
|
| ingredients_non_nutritive_sweeteners_n | numeric | 0.0% | 1 |
constant
|
| ingredients_without_ecobalyse_ids_n | numeric | 0.0% | 20 |
|
| environment_impact_level_tags | unknown | 0.0% | — |
skipped
|
| last_image_dates_tags | unknown | 0.0% | — |
skipped
|
| labels_hierarchy | unknown | 0.0% | — |
skipped
|
| product_name_en | categorical | 14.0% | 34 |
long_tail
|
| nutrition_score_warning_fruits_vegetables_legumes_estimate_from_ingredients_value | numeric | 8.0% | 6 |
high_skew
outliers
|
| traces | categorical | 0.0% | 23 |
long_tail
|
| generic_name_fi | categorical | 90.0% | 5 |
long_tail
null_rate
|
| emb_codes_orig | categorical | 34.0% | 5 |
long_tail
null_rate
|
| ingredients_with_specified_percent_n | numeric | 0.0% | 7 |
|
| nutrition_grades | categorical | 0.0% | 6 |
|
| weighers_tags | unknown | 0.0% | — |
skipped
|
| categories_tags | unknown | 0.0% | — |
skipped
|
| image_url | categorical | 0.0% | 50 |
long_tail
|
| sources | unknown | 0.0% | — |
skipped
|
| languages_hierarchy | unknown | 0.0% | — |
skipped
|
| pnns_groups_1 | categorical | 0.0% | 7 |
|
| countries_lc | categorical | 2.0% | 6 |
|
| additives_tags | unknown | 0.0% | — |
skipped
|
| codes_tags | unknown | 0.0% | — |
skipped
|
| countries_tags | unknown | 0.0% | — |
skipped
|
| creator | categorical | 0.0% | 13 |
long_tail
|
| ingredients | unknown | 0.0% | — |
skipped
|
| product_name_nl | categorical | 76.0% | 7 |
long_tail
null_rate
|
| ingredients_n_tags | unknown | 0.0% | — |
skipped
|
| origin_es | categorical | 60.0% | 1 |
null_rate
imbalance
|
| product_name_pl | categorical | 90.0% | 3 |
long_tail
null_rate
|
| scores | unknown | 0.0% | — |
skipped
|
| brands | categorical | 0.0% | 41 |
long_tail
|
| ingredients_text_de | categorical | 60.0% | 16 |
long_tail
null_rate
|
| ingredients_text_nb | categorical | 96.0% | 1 |
null_rate
imbalance
|
| packagings_n | numeric | 18.0% | 5 |
outliers
|
| complete | numeric | 0.0% | 2 |
|
| emb_codes_20141016 | categorical | 58.0% | 7 |
long_tail
null_rate
|
| ingredients_tags | unknown | 0.0% | — |
skipped
|
| packaging_text_ja | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_de | categorical | 60.0% | 9 |
long_tail
null_rate
|
| last_editor | categorical | 2.0% | 24 |
long_tail
|
| minerals_prev_tags | unknown | 0.0% | — |
skipped
|
| last_image_t | numeric | 0.0% | 50 |
high_skew
|
| obsolete_since_date | categorical | 12.0% | 1 |
imbalance
|
| pnns_groups_2_tags | unknown | 0.0% | — |
skipped
|
| emb_codes_tags | unknown | 0.0% | — |
skipped
|
| countries_beforescanbot | categorical | 14.0% | 38 |
long_tail
|
| nutrition_grade_fr | categorical | 0.0% | 6 |
|
| data_quality_tags | unknown | 0.0% | — |
skipped
|
| ingredients_with_specified_percent_sum | numeric | 0.0% | 22 |
|
| origin_it | categorical | 68.0% | 1 |
null_rate
imbalance
|
| nutrition_data_per | categorical | 0.0% | 2 |
|
| origin_pl | categorical | 90.0% | 1 |
null_rate
imbalance
|
| product | unknown | 0.0% | — |
skipped
|
| link | categorical | 4.0% | 28 |
long_tail
|
| ingredients_text_nl | categorical | 76.0% | 9 |
long_tail
null_rate
|
| additives_n | numeric | 0.0% | 8 |
|
| generic_name_sv | categorical | 92.0% | 4 |
long_tail
null_rate
|
| ingredients_that_may_be_from_palm_oil_tags | unknown | 0.0% | — |
skipped
|
| known_ingredients_n | numeric | 0.0% | 22 |
|
| completeness | numeric | 0.0% | 14 |
outliers
|
| ingredients_sweeteners_n | numeric | 0.0% | 1 |
constant
|
| nova_groups | categorical | 4.0% | 3 |
|
| allergens_hierarchy | unknown | 0.0% | — |
skipped
|
| obsolete | categorical | 12.0% | 1 |
imbalance
|
| origin_sv | categorical | 92.0% | 1 |
null_rate
imbalance
|
| packaging_hierarchy | unknown | 0.0% | — |
skipped
|
| ingredients_with_unspecified_percent_n | numeric | 0.0% | 18 |
|
| fruits-vegetables-nuts_100g_estimate | numeric | 46.0% | 2 |
null_rate
high_skew
|
| emb_codes | categorical | 4.0% | 11 |
long_tail
|
| packagings | unknown | 0.0% | — |
skipped
|
| purchase_places_tags | unknown | 0.0% | — |
skipped
|
| additives_original_tags | unknown | 0.0% | — |
skipped
|
| image_front_url | categorical | 0.0% | 50 |
long_tail
|
| data_quality_bugs_tags | unknown | 0.0% | — |
skipped
|
| origin_fi | categorical | 90.0% | 1 |
null_rate
imbalance
|
| images | unknown | 0.0% | — |
skipped
|
| ingredients_analysis | unknown | 0.0% | — |
skipped
|
| ingredients_text_with_allergens_pl | categorical | 92.0% | 3 |
long_tail
null_rate
|
| product_name_de | categorical | 60.0% | 16 |
long_tail
null_rate
|
| ingredients_text_with_allergens_nb | categorical | 96.0% | 1 |
null_rate
imbalance
|
| packaging_text_it | categorical | 68.0% | 3 |
long_tail
null_rate
|
| product_name_it | categorical | 68.0% | 12 |
long_tail
null_rate
|
| serving_quantity | categorical | 12.0% | 27 |
long_tail
|
| product_name_ja | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_sv | categorical | 92.0% | 4 |
long_tail
null_rate
|
| allergens_tags | unknown | 0.0% | — |
skipped
|
| ingredients_text_fr | categorical | 4.0% | 47 |
long_tail
|
| nutrition_score_beverage | numeric | 0.0% | 2 |
high_skew
|
| ingredients_ids_debug | unknown | 0.0% | — |
skipped
|
| nutrition_data | categorical | 2.0% | 1 |
imbalance
|
| origin_ja | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| packaging_text_en | categorical | 14.0% | 5 |
long_tail
|
| unknown_ingredients_n | numeric | 0.0% | 6 |
high_skew
outliers
|
| ingredients_from_palm_oil_tags | unknown | 0.0% | — |
skipped
|
| labels_tags | unknown | 0.0% | — |
skipped
|
| packaging_old_before_taxonomization | categorical | 24.0% | 36 |
long_tail
null_rate
|
| packaging_text_nb | categorical | 96.0% | 1 |
null_rate
imbalance
|
| nutrition_grades_tags | unknown | 0.0% | — |
skipped
|
| category_properties | unknown | 0.0% | — |
skipped
|
| nutriscore_score | numeric | 2.0% | 28 |
|
| packaging_tags | unknown | 0.0% | — |
skipped
|
| labels_old | categorical | 8.0% | 38 |
long_tail
|
| packaging_text | categorical | 4.0% | 13 |
long_tail
|
| ingredients_percent_analysis | numeric | 0.0% | 2 |
high_skew
outliers
|
| ecoscore_data | unknown | 0.0% | — |
skipped
|
| ingredients_text_sv | categorical | 92.0% | 4 |
long_tail
null_rate
|
| brands_tags | unknown | 0.0% | — |
skipped
|
| compared_to_category | categorical | 0.0% | 35 |
long_tail
|
| data_sources | categorical | 0.0% | 43 |
long_tail
|
| other_nutritional_substances_prev_tags | unknown | 0.0% | — |
skipped
|
| ingredients_from_palm_oil_n | numeric | 8.0% | 2 |
outliers
|
| last_updated_t | numeric | 0.0% | 50 |
outliers
|
| nutrition_score_debug | categorical | 0.0% | 2 |
imbalance
|
| popularity_key | numeric | 0.0% | 49 |
high_skew
outliers
|
| product_name_es | categorical | 60.0% | 17 |
long_tail
null_rate
|
| allergens_from_user | categorical | 0.0% | 34 |
long_tail
|
| informers | unknown | 0.0% | — |
skipped
|
| brands_old | categorical | 32.0% | 29 |
long_tail
null_rate
|
| data_quality_errors_tags | unknown | 0.0% | — |
skipped
|
| ingredients_text | categorical | 0.0% | 50 |
long_tail
|
| categories | categorical | 0.0% | 46 |
long_tail
|
| nutrition_score_warning_fruits_vegetables_nuts_estimate_from_ingredients_value | numeric | 10.0% | 13 |
high_skew
outliers
|
| ingredients_from_or_that_may_be_from_palm_oil_n | numeric | 6.0% | 3 |
|
| origins_old | categorical | 22.0% | 9 |
long_tail
null_rate
|
| packaging_text_nl | categorical | 76.0% | 1 |
null_rate
imbalance
|
| expiration_date | categorical | 4.0% | 34 |
long_tail
|
| selected_images | unknown | 0.0% | — |
skipped
|
| traces_from_ingredients | categorical | 0.0% | 12 |
long_tail
|
| ingredients_text_with_allergens | categorical | 0.0% | 50 |
long_tail
|
| image_front_thumb_url | categorical | 0.0% | 50 |
long_tail
|
| lc | categorical | 0.0% | 5 |
|
| ingredients_text_debug | categorical | 28.0% | 35 |
long_tail
null_rate
|
| packagings_materials_main | categorical | 62.0% | 3 |
null_rate
|
| data_quality_dimensions | unknown | 0.0% | — |
skipped
|
| serving_size | categorical | 12.0% | 37 |
long_tail
|
| pnns_groups_1_tags | unknown | 0.0% | — |
skipped
|
| origin | categorical | 6.0% | 6 |
long_tail
|
| ingredients_lc | categorical | 0.0% | 4 |
|
| packaging_old | categorical | 14.0% | 40 |
long_tail
|
| packaging_text_fr | categorical | 6.0% | 14 |
long_tail
|
| nova_group_debug | categorical | 0.0% | 3 |
long_tail
imbalance
|
| ingredients_original_tags | unknown | 0.0% | — |
skipped
|
| data_quality_completeness_tags | unknown | 0.0% | — |
skipped
|
| cities_tags | unknown | 0.0% | — |
skipped
|
| countries_hierarchy | unknown | 0.0% | — |
skipped
|
| nutriscore_score_opposite | numeric | 2.0% | 28 |
|
| categories_properties_tags | unknown | 0.0% | — |
skipped
|
| origins_lc | categorical | 4.0% | 6 |
|
| ciqual_food_name_tags | unknown | 0.0% | — |
skipped
|
| countries | categorical | 0.0% | 43 |
long_tail
|
| ingredients_text_with_allergens_it | categorical | 68.0% | 12 |
long_tail
null_rate
|
| packaging_lc | categorical | 12.0% | 7 |
|
| correctors_tags | unknown | 0.0% | — |
skipped
|
| interface_version_created | categorical | 2.0% | 3 |
|
| states_tags | unknown | 0.0% | — |
skipped
|
| nutriscore_2021_tags | unknown | 0.0% | — |
skipped
|
| stores_tags | unknown | 0.0% | — |
skipped
|
| image_thumb_url | categorical | 0.0% | 50 |
long_tail
|
| categories_properties | unknown | 0.0% | — |
skipped
|
| nucleotides_prev_tags | unknown | 0.0% | — |
skipped
|
| allergens_from_ingredients | categorical | 0.0% | 35 |
long_tail
|
| ingredients_text_with_allergens_fi | categorical | 90.0% | 4 |
long_tail
null_rate
|
| _keywords | unknown | 0.0% | — |
skipped
|
| manufacturing_places | categorical | 2.0% | 20 |
long_tail
|
| pnns_groups_2 | categorical | 0.0% | 11 |
|
| ingredients_text_pl | categorical | 90.0% | 3 |
long_tail
null_rate
|
| generic_name_es | categorical | 60.0% | 7 |
long_tail
null_rate
|
| origin_en | categorical | 14.0% | 2 |
imbalance
|
| generic_name_it | categorical | 68.0% | 5 |
long_tail
null_rate
|
| ingredients_that_may_be_from_palm_oil_n | numeric | 8.0% | 3 |
high_skew
outliers
|
| ingredients_text_es | categorical | 60.0% | 13 |
long_tail
null_rate
|
| teams | categorical | 8.0% | 39 |
long_tail
|
| food_groups_tags | unknown | 0.0% | — |
skipped
|
| data_quality_warnings_tags | unknown | 0.0% | — |
skipped
|
| debug_tags | unknown | 0.0% | — |
skipped
|
| main_countries_tags | unknown | 0.0% | — |
skipped
|
| origins_hierarchy | unknown | 0.0% | — |
skipped
|
| packagings_complete | numeric | 4.0% | 2 |
|
| nutriscore_tags | unknown | 0.0% | — |
skipped
|
| ingredients_text_with_allergens_nl | categorical | 78.0% | 9 |
long_tail
null_rate
|
| created_t | numeric | 0.0% | 50 |
|
| traces_hierarchy | unknown | 0.0% | — |
skipped
|
| generic_name_nb | categorical | 96.0% | 1 |
null_rate
imbalance
|
| ingredients_text_with_allergens_de | categorical | 66.0% | 16 |
long_tail
null_rate
|
| ingredients_text_with_allergens_es | categorical | 62.0% | 13 |
long_tail
null_rate
|
| product_name_fr | categorical | 2.0% | 47 |
long_tail
|
| stores | categorical | 4.0% | 31 |
long_tail
|
| _id | categorical | 0.0% | 50 |
long_tail
|
| nutriments | unknown | 0.0% | — |
skipped
|
| editors | unknown | 0.0% | — |
skipped
|
| max_imgid | categorical | 0.0% | 38 |
long_tail
|
| nutriscore_grade | categorical | 0.0% | 6 |
|
| product_quantity_unit | categorical | 10.0% | 2 |
imbalance
|
| ingredients_analysis_tags | unknown | 0.0% | — |
skipped
|
| ingredients_text_with_allergens_fr | categorical | 4.0% | 47 |
long_tail
|
| interface_version_modified | categorical | 0.0% | 2 |
|
| data_sources_tags | unknown | 0.0% | — |
skipped
|
| ingredients_text_with_allergens_en | categorical | 16.0% | 36 |
long_tail
|
| removed_countries_tags | unknown | 0.0% | — |
skipped
|
| amino_acids_prev_tags | unknown | 0.0% | — |
skipped
|
| code | categorical | 0.0% | 50 |
long_tail
|
| correctors | unknown | 0.0% | — |
skipped
|
| generic_name_ja | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_fr | categorical | 6.0% | 34 |
long_tail
|
| generic_name_pl | categorical | 90.0% | 2 |
null_rate
|
| amino_acids_tags | unknown | 0.0% | — |
skipped
|
| ingredients_debug | unknown | 0.0% | — |
skipped
|
| ingredients_text_with_allergens_ja | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| data_quality_info_tags | unknown | 0.0% | — |
skipped
|
| last_edit_dates_tags | unknown | 0.0% | — |
skipped
|
| last_modified_by | categorical | 2.0% | 24 |
long_tail
|
| no_nutrition_data | categorical | 4.0% | 1 |
imbalance
|
| nutriscore | unknown | 0.0% | — |
skipped
|
| origin_nb | categorical | 96.0% | 1 |
null_rate
imbalance
|
| origins | categorical | 4.0% | 20 |
long_tail
|
| nova_groups_tags | unknown | 0.0% | — |
skipped
|
| languages | unknown | 0.0% | — |
skipped
|
| nutriscore_2023_tags | unknown | 0.0% | — |
skipped
|
| packaging_materials_tags | unknown | 0.0% | — |
skipped
|
| lang | categorical | 0.0% | 5 |
|
| packaging_text_sv | categorical | 92.0% | 1 |
null_rate
imbalance
|
| photographers | unknown | 0.0% | — |
skipped
|
| languages_codes | unknown | 0.0% | — |
skipped
|
| ecoscore_grade | categorical | 0.0% | 9 |
|
| ingredients_n | numeric | 0.0% | 22 |
|
| allergens | categorical | 0.0% | 16 |
|
| minerals_tags | unknown | 0.0% | — |
skipped
|
| product_name | categorical | 0.0% | 49 |
long_tail
|
| purchase_places | categorical | 2.0% | 32 |
long_tail
|
| quantity | categorical | 2.0% | 36 |
long_tail
|
| traces_tags | unknown | 0.0% | — |
skipped
|
| origin_uk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_ar | categorical | 80.0% | 2 |
null_rate
|
| packaging_text_uk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_ar | categorical | 78.0% | 2 |
null_rate
|
| ingredients_text_uk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| last_check_dates_tags | unknown | 0.0% | — |
skipped
|
| checked | categorical | 86.0% | 1 |
null_rate
imbalance
|
| packaging_text_ar | categorical | 80.0% | 1 |
null_rate
imbalance
|
| carbon_footprint_percent_of_known_ingredients | numeric | 62.0% | 19 |
null_rate
|
| last_checker | categorical | 86.0% | 4 |
null_rate
|
| product_name_uk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_uk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_ar | categorical | 78.0% | 6 |
long_tail
null_rate
|
| carbon_footprint_from_known_ingredients_debug | categorical | 72.0% | 14 |
long_tail
null_rate
|
| last_checked_t | numeric | 86.0% | 7 |
null_rate
|
| ingredients_text_with_allergens_uk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_ar | categorical | 82.0% | 2 |
null_rate
|
| origin_ar | categorical | 80.0% | 1 |
null_rate
imbalance
|
| nutriments_estimated | unknown | 0.0% | — |
skipped
|
| nutrition_score_warning_no_fiber | numeric | 70.0% | 1 |
null_rate
constant
|
| ingredients_text_debug_tags | unknown | 0.0% | — |
skipped
|
| taxonomies_enhancer_tags | unknown | 0.0% | — |
skipped
|
| completed_t | numeric | 68.0% | 16 |
null_rate
|
| product_name_bg | categorical | 94.0% | 3 |
long_tail
null_rate
|
| ingredients_text_et | categorical | 94.0% | 3 |
long_tail
null_rate
|
| origin_sl | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_dz | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_sl | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_ca | categorical | 96.0% | 1 |
null_rate
imbalance
|
| ingredients_text_dz | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_ca | categorical | 96.0% | 1 |
null_rate
imbalance
|
| origin_ca | categorical | 96.0% | 1 |
null_rate
imbalance
|
| product_name_et | categorical | 94.0% | 3 |
long_tail
null_rate
|
| ingredients_text_with_allergens_bg | categorical | 94.0% | 3 |
long_tail
null_rate
|
| ingredients_text_with_allergens_et | categorical | 94.0% | 3 |
long_tail
null_rate
|
| origin_sk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| origin_bg | categorical | 94.0% | 1 |
null_rate
imbalance
|
| packaging_text_sl | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_sk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_sl | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_ca | categorical | 96.0% | 1 |
null_rate
imbalance
|
| generic_name_sl | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_dz | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| origin_et | categorical | 94.0% | 1 |
null_rate
imbalance
|
| ingredients_text_with_allergens_sk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_sk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_pt | categorical | 84.0% | 4 |
long_tail
null_rate
|
| ingredients_text_with_allergens_ca | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_pt | categorical | 80.0% | 3 |
long_tail
null_rate
|
| packaging_text_pt | categorical | 80.0% | 1 |
null_rate
imbalance
|
| ingredients_text_pt | categorical | 80.0% | 4 |
long_tail
null_rate
|
| origin_pt | categorical | 80.0% | 1 |
null_rate
imbalance
|
| nutrition_score_warning_nutriments_estimated | numeric | 96.0% | 1 |
null_rate
constant
|
| packaging_text_bg | categorical | 94.0% | 1 |
null_rate
imbalance
|
| generic_name_et | categorical | 94.0% | 1 |
null_rate
imbalance
|
| packaging_text_ca | categorical | 96.0% | 1 |
null_rate
imbalance
|
| product_name_sl | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_bg | categorical | 94.0% | 1 |
null_rate
imbalance
|
| ingredients_text_sk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_bg | categorical | 94.0% | 3 |
long_tail
null_rate
|
| packaging_text_et | categorical | 94.0% | 1 |
null_rate
imbalance
|
| packaging_text_sk | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_pt | categorical | 80.0% | 7 |
long_tail
null_rate
|
| abbreviated_product_name_fr | categorical | 86.0% | 7 |
long_tail
null_rate
|
| obsolete_imported | categorical | 86.0% | 1 |
null_rate
imbalance
|
| sources_fields | unknown | 0.0% | — |
skipped
|
| emb_code | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| lang_imported | categorical | 86.0% | 1 |
null_rate
imbalance
|
| generic_name_zh | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| conservation_conditions_fr_imported | categorical | 86.0% | 7 |
long_tail
null_rate
|
| origin_fr_imported | categorical | 96.0% | 2 |
long_tail
null_rate
|
| owner | categorical | 86.0% | 6 |
long_tail
null_rate
|
| ingredients_text_fr_imported | categorical | 86.0% | 7 |
long_tail
null_rate
|
| owners_tags | categorical | 86.0% | 6 |
long_tail
null_rate
|
| product_name_zh | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| nutrition_data_prepared_per_imported | categorical | 86.0% | 1 |
null_rate
imbalance
|
| abbreviated_product_name_fr_imported | categorical | 86.0% | 7 |
long_tail
null_rate
|
| generic_name_zh_debug_tags | unknown | 0.0% | — |
skipped
|
| customer_service_fr | categorical | 86.0% | 6 |
long_tail
null_rate
|
| customer_service_fr_imported | categorical | 86.0% | 6 |
long_tail
null_rate
|
| ingredients_text_zh_debug_tags | unknown | 0.0% | — |
skipped
|
| product_name_fr_imported | categorical | 86.0% | 7 |
long_tail
null_rate
|
| brands_imported | categorical | 86.0% | 6 |
long_tail
null_rate
|
| owner_imported | categorical | 88.0% | 5 |
long_tail
null_rate
|
| product_name_zh_debug_tags | unknown | 0.0% | — |
skipped
|
| lc_imported | categorical | 84.0% | 2 |
null_rate
|
| ingredients_text_zh | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| quantity_imported | categorical | 86.0% | 7 |
long_tail
null_rate
|
| nutrition_data_per_imported | categorical | 84.0% | 1 |
null_rate
imbalance
|
| generic_name_fr_imported | categorical | 86.0% | 7 |
long_tail
null_rate
|
| owner_fields | unknown | 0.0% | — |
skipped
|
| categories_imported | categorical | 88.0% | 5 |
long_tail
null_rate
|
| conservation_conditions_fr | categorical | 86.0% | 7 |
long_tail
null_rate
|
| conservation_conditions | categorical | 86.0% | 7 |
long_tail
null_rate
|
| countries_imported | categorical | 84.0% | 2 |
null_rate
|
| origins_fr | categorical | 96.0% | 2 |
long_tail
null_rate
|
| abbreviated_product_name | categorical | 86.0% | 7 |
long_tail
null_rate
|
| customer_service | categorical | 86.0% | 6 |
long_tail
null_rate
|
| data_sources_imported | categorical | 84.0% | 8 |
long_tail
null_rate
|
| nova_group_error | categorical | 96.0% | 1 |
null_rate
imbalance
|
| ingredients_text_de_ocr_1648897071_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| packaging_text_ro | categorical | 96.0% | 1 |
null_rate
imbalance
|
| product_name_ro | categorical | 96.0% | 2 |
long_tail
null_rate
|
| producer_version_id | categorical | 92.0% | 3 |
long_tail
null_rate
|
| serving_size_imported | categorical | 88.0% | 6 |
long_tail
null_rate
|
| no_nutrition_data_imported | categorical | 92.0% | 1 |
null_rate
imbalance
|
| packaging_imported | categorical | 92.0% | 2 |
null_rate
|
| ingredients_text_ro | categorical | 96.0% | 1 |
null_rate
imbalance
|
| producer_version_id_imported | categorical | 92.0% | 3 |
long_tail
null_rate
|
| labels_imported | categorical | 90.0% | 3 |
long_tail
null_rate
|
| ingredients_text_de_ocr_1648990410_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| allergens_imported | categorical | 90.0% | 4 |
long_tail
null_rate
|
| ingredients_text_de_ocr_1648990410 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_de_ocr_1648897071 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_ro | categorical | 96.0% | 1 |
null_rate
imbalance
|
| origin_ro | categorical | 96.0% | 1 |
null_rate
imbalance
|
| abbreviated_product_name_imported | categorical | 94.0% | 3 |
long_tail
null_rate
|
| traces_imported | categorical | 92.0% | 4 |
long_tail
null_rate
|
| specific_ingredients | unknown | 0.0% | — |
skipped
|
| product_name_ru | categorical | 94.0% | 2 |
null_rate
|
| origin_ru | categorical | 94.0% | 1 |
null_rate
imbalance
|
| ingredients_text_with_allergens_ru | categorical | 94.0% | 1 |
null_rate
imbalance
|
| packaging_text_ru | categorical | 94.0% | 1 |
null_rate
imbalance
|
| generic_name_ru | categorical | 94.0% | 2 |
null_rate
|
| ingredients_text_ru | categorical | 94.0% | 1 |
null_rate
imbalance
|
| ingredients_text_da | categorical | 96.0% | 2 |
long_tail
null_rate
|
| ingredients_text_with_allergens_da | categorical | 96.0% | 2 |
long_tail
null_rate
|
| product_name_da | categorical | 96.0% | 2 |
long_tail
null_rate
|
| packaging_text_da | categorical | 96.0% | 1 |
null_rate
imbalance
|
| generic_name_da | categorical | 96.0% | 2 |
long_tail
null_rate
|
| forest_footprint_data | unknown | 0.0% | — |
skipped
|
| origin_da | categorical | 96.0% | 1 |
null_rate
imbalance
|
| origin_sr | categorical | 96.0% | 1 |
null_rate
imbalance
|
| ingredients_text_nl_ocr_1675675383_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_cs | categorical | 94.0% | 2 |
null_rate
|
| product_name_cs | categorical | 94.0% | 2 |
null_rate
|
| origin_hu | categorical | 92.0% | 1 |
null_rate
imbalance
|
| packaging_text_hu | categorical | 92.0% | 1 |
null_rate
imbalance
|
| origin_cs | categorical | 96.0% | 1 |
null_rate
imbalance
|
| ingredients_text_with_allergens_hu | categorical | 94.0% | 3 |
long_tail
null_rate
|
| generic_name_cs | categorical | 94.0% | 1 |
null_rate
imbalance
|
| ingredients_text_hu | categorical | 92.0% | 4 |
long_tail
null_rate
|
| ingredients_text_sr | categorical | 96.0% | 2 |
long_tail
null_rate
|
| packaging_text_sr | categorical | 96.0% | 1 |
null_rate
imbalance
|
| ingredients_text_nl_ocr_1675675383 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_cs | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_sr | categorical | 96.0% | 2 |
long_tail
null_rate
|
| packaging_text_cs | categorical | 94.0% | 1 |
null_rate
imbalance
|
| product_name_sr | categorical | 96.0% | 2 |
long_tail
null_rate
|
| ingredients_text_hu_ocr_1571428260_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_hu_ocr_1571428260 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_hu | categorical | 92.0% | 2 |
null_rate
|
| product_name_hu | categorical | 92.0% | 3 |
long_tail
null_rate
|
| ingredients_text_with_allergens_sr | categorical | 96.0% | 2 |
long_tail
null_rate
|
| ingredients_text_es_ocr_1548767061_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_xx | categorical | 96.0% | 1 |
null_rate
imbalance
|
| generic_name_xx | categorical | 96.0% | 1 |
null_rate
imbalance
|
| ingredients_text_es_ocr_1548767061 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_xx | categorical | 96.0% | 1 |
null_rate
imbalance
|
| origin_xx | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| packaging_text_xx | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_ur | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_ur | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| origin_he | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_he | categorical | 96.0% | 2 |
long_tail
null_rate
|
| origin_ur | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_ur | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| packaging_text_he | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_he | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| packaging_text_ur | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_he | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_he | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| nutriscore_grade_producer | categorical | 94.0% | 3 |
long_tail
null_rate
|
| nutriscore_grade_producer_imported | categorical | 94.0% | 3 |
long_tail
null_rate
|
| packaging_text_el | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_el | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_el | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_el | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| origin_el | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_el | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_th | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_de_ocr_1559410715_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_th | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| packaging_text_th | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_th | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_de_ocr_1548767354_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_th | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| origin_th | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_de_ocr_1548767354 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_de_ocr_1559410715 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_it_ocr_1559410715 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_it_ocr_1559410715_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| packaging_text_fr_imported | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| preparation_fr_imported | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| preparation | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| preparation_fr | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_lc | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_lc | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_lc | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_lc | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_xx_debug_tags | unknown | 0.0% | — |
skipped
|
| product_name_xx_debug_tags | unknown | 0.0% | — |
skipped
|
| generic_name_xx_debug_tags | unknown | 0.0% | — |
skipped
|
| ingredients_text_fr_ocr_1561814324 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1561814324_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1624039072_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1624039072 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573108346 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1566920858_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573107556 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573108346_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573107560_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573108349_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573108360 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573109955_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573108349 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573109955 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573107556_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573108360_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1573107560 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1566920858 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| generic_name_lt | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_ro | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| packaging_text_lt | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_lt | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| origin_lt | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| product_name_lt | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_with_allergens_lt | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1713713129 | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
| ingredients_text_fr_ocr_1713713129_result | categorical | 98.0% | 1 |
long_tail
null_rate
imbalance
|
update_key
categorical metadata long_tailA categorical update_key field with only 9 distinct values across 50 rows, dominated by 'brands' at 56% (28/50) and 'sort' at 20%. The long tail mixes human-readable labels ('divinfood', 'nova-yogurts', 'germany2', 'france') with timestamp-style tokens ('key_1748337248', 'ingredients20240805'), suggesting inconsistent naming conventions for what appears to track update batches or jobs. Entropy ratio of 0.64 confirms the heavy concentration on a few keys. Treatment: Group rare keys into 'other' or normalize naming before using as a grouping dimension.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 9
- top_value
- brands
- top_rate
- 0.56
- cardinality
- 9
- entropy
- 2.015
- entropy_ratio
- 0.6357
categories_old
categorical feature long_tailHierarchical product category strings (Open Food Facts style taxonomy paths), stored as comma-separated breadcrumbs. Near-unique with 45 distinct values across 50 rows and entropy ratio 0.99, and the strings appear in mixed languages (French, English, Polish, Bulgarian Cyrillic), so direct grouping will fragment. Top value covers only 4% of rows and one row is null. Treatment: Split on commas, normalise language, and keep only the top-level taxon as a categorical feature.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 45
- top_value
- Snacks, Snacks sucrés, Biscuits et gâteaux, Biscuits, Biscuits secs
- top_rate
- 0.04082
- cardinality
- 45
- entropy
- 5.451
- entropy_ratio
- 0.9926
ecoscore_score
numeric featureNumeric Eco-Score rating per item, ranging from 13 to 94 with a mean of 47.7 and median of 44. The distribution is mildly right-skewed (0.31) and platykurtic (-0.79), spanning a wide IQR of 36.5 with no outliers flagged. Notably, 14% of values are null and only 31 unique scores appear across 50 rows. Treatment: Impute or flag the 14% nulls, then use as a continuous feature without transformation.
- n
- 50
- nulls
- 7 (14.0%)
- unique
- 31
- min
- 13
- max
- 94
- mean
- 47.74
- median
- 44
- std
- 21.19
- q1
- 27.5
- q3
- 64
- iqr
- 36.5
- skew
- 0.3069
- kurtosis
- -0.7946
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
environment_impact_level
categorical feature null_rate imbalanceThis appears to be a categorical flag for environmental impact severity, but it carries no usable signal in this sample. 56% of the 50 rows are null, and the remaining 22 records all hold the empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; the column is effectively constant and majority-null.
- n
- 50
- nulls
- 28 (56.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fi
categorical free_text long_tail null_rateFinnish-language ingredient declarations, almost entirely absent: 90% of the 50 rows are null and only 4 distinct non-null values appear, two of which are empty strings. The few populated entries are verbose product ingredient lists (chocolate, wheat-based baked goods) with allergen markup, suggesting this is a localized free-text field rather than a categorical feature despite its low cardinality here. Treatment: Drop or set aside; null_rate 0.9 makes it unusable as a feature without a Finnish-text NLP pipeline.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 4
- top_value
- top_rate
- 0.4
- cardinality
- 4
- entropy
- 1.922
- entropy_ratio
- 0.961
nutrition_data_prepared
categorical metadata imbalanceThis appears to be a flag indicating whether nutrition data was prepared, but it carries no information: only one unique value (an empty string) appears across all 48 non-null rows, with a 4% null rate. Entropy is 0 and top_rate is 1.0, so the column is constant. Treatment: Drop; constant column with no signal.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packagings_materials
unknown other skippedThe column packagings_materials was skipped by the profiler, so its kind is unknown and no descriptive statistics are available. We only know there are 50 rows and a 0.0 null rate; uniqueness, type, and value distribution are all missing. The name suggests structured packaging material data (likely nested or list-valued), which would explain why the profiler bailed out. Treatment: Inspect raw values manually and parse the nested structure before any downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
ingredients_without_ecobalyse_ids
unknown other skippedThis column is named `ingredients_without_ecobalyse_ids`, suggesting it lists ingredients that lack matching identifiers in the Ecobalyse reference system. Saturn skipped profiling, so type, uniqueness, and value distribution are unknown despite a populated null_rate of 0.0 across 50 rows. Treatment: Inspect raw values manually to determine structure (likely a list) before deciding on parsing or join strategy.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
generic_name_nl
categorical metadata long_tail null_rateThis appears to be a Dutch-language generic product name field, likely from a food product catalog. It is largely unusable as-is: 76% of rows are null and among the 12 non-null entries, 9 are empty strings, leaving only 3 distinct real values (e.g., 'Extra fijne pure chocolade'). Cardinality is just 4 across 50 rows, so there is essentially no signal here. Treatment: Drop, or retain only as a descriptive label — too sparse to model.
- n
- 50
- nulls
- 38 (76.0%)
- unique
- 4
- top_value
- top_rate
- 0.75
- cardinality
- 4
- entropy
- 1.208
- entropy_ratio
- 0.6038
product_name_sv
categorical metadata long_tail null_rateSwedish-localised product name field, populated for only 4 of 50 rows (null_rate 0.92). The 4 present values are all unique, giving maximum entropy (entropy_ratio 1.0) but no repeated category to learn from. Values like "90% Cocoa" and "Dark 70%" look English rather than Swedish, suggesting localisation is incomplete or mislabelled. Treatment: Drop or defer until localisation coverage improves; not usable as a feature at 92% null.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 4
- top_value
- 90% Cocoa
- top_rate
- 0.25
- cardinality
- 4
- entropy
- 2
- entropy_ratio
- 1
scans_n
numeric feature high_skew outliersA numeric count of scans per record, with 49 unique values across 50 rows and no nulls or zeros. The distribution is tightly clustered (median 492, IQR 217) but extremely right-skewed (skew 3.90, kurtosis 18.72) with a max of 2523 versus a Q3 of 604, producing 4 outliers (8%). The mean (577.94) sits well above the median, confirming a heavy upper tail. Treatment: Log-transform or winsorize before modelling to tame the heavy right tail.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 49
- min
- 333
- max
- 2,523
- mean
- 577.9
- median
- 492
- std
- 343.9
- q1
- 387
- q3
- 604
- iqr
- 217
- skew
- 3.899
- kurtosis
- 18.72
- n_outliers
- 4
- outlier_rate
- 0.08
- zero_rate
- 0
schema_version
numeric metadata constantConstant numeric column holding the value 996.0 across all 50 rows with no nulls. Despite being typed as numeric, the zero variance (std 0.0, iqr 0.0) and single unique value indicate this is a schema/version tag rather than a measurement. Carries no signal for modelling. Treatment: Drop before modelling; retain only as a provenance tag.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 996
- max
- 996
- mean
- 996
- median
- 996
- std
- 0
- q1
- 996
- q3
- 996
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
url
categorical identifier long_tailThis column holds Open Food Facts product URLs, one per row, with every value unique across all 50 rows (cardinality 50, entropy_ratio 1.0). The URL path embeds a product barcode plus a slugified name, so it functions as a permalink/identifier rather than a feature. No nulls, but the long_tail alert simply reflects that every row is its own category. Treatment: Drop from modelling; keep as a join key or reference link to the source product page.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- https://world.openfoodfacts.org/product/6111242100992/perly
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
debug_param_sorted_langs
unknown metadata skippedThis column was skipped by the profiler (alert: "skipped"), so its kind is unknown and no statistics were computed beyond a row count of 50 with 0% nulls. The name suggests a debug artefact holding sorted language codes, likely a list or compound value the profiler couldn't classify. Without unique counts or value samples there is nothing further to infer. Treatment: Drop unless you can re-profile with list/struct support; it appears to be a debug field.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
packaging
categorical free_text long_tailFree-form packaging descriptions, likely from a food/product database (Open Food Facts style) given multilingual prefixes like 'en:', 'es:', 'pt:'. Cardinality is extreme: 41 unique values across 50 rows with entropy ratio 0.985, and the top value 'Plastique' covers only 9% — most entries are comma-separated multi-tag strings mixing languages. 12% are null, and the long_tail alert confirms there is no usable category structure as-is. Treatment: Split on commas, normalize language prefixes, and one-hot encode the resulting material tags rather than using the raw string.
- n
- 50
- nulls
- 6 (12.0%)
- unique
- 41
- top_value
- Plastique
- top_rate
- 0.09091
- cardinality
- 41
- entropy
- 5.278
- entropy_ratio
- 0.9851
grades
unknown other skippedThe column is named "grades" and contains 50 rows with no nulls, but saturn skipped profiling and could not infer a kind, so no distributional stats are available. Without n_unique or value summaries, it's impossible to tell whether this holds letter grades, numeric scores, or a nested structure. The "skipped" alert is the key signal: something about the storage type prevented standard analysis. Treatment: Manually inspect a sample to determine the underlying type before deciding on a downstream encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
last_modified_t
numeric timestamp outliersValues are Unix epoch seconds (min 1737907641, max 1768643720) so this column is a last-modified timestamp, likely covering early 2025 through late 2025. All 50 rows are unique with no nulls, but the distribution is heavily left-skewed (skew -1.96) with 6 outliers (12%) sitting far below the q1 of 1761612624, suggesting a small tail of much older edits while most records cluster within a ~6.1M second IQR. Treat as a timestamp, not a numeric feature. Treatment: Convert from epoch seconds to datetime and derive recency or bucketed features instead of using the raw integer.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- min
- 1.738e+09
- max
- 1.769e+09
- mean
- 1.763e+09
- median
- 1.767e+09
- std
- 8.093e+06
- q1
- 1.762e+09
- q3
- 1.768e+09
- iqr
- 6.138e+06
- skew
- -1.961
- kurtosis
- 2.972
- n_outliers
- 6
- outlier_rate
- 0.12
- zero_rate
- 0
origin_nl
categorical metadata null_rate imbalanceorigin_nl appears to be a categorical attribute (likely a Dutch-language origin label) but is effectively empty in this sample. 76% of the 50 rows are null, and the remaining 12 non-null entries are all the empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; column has no variance and is mostly null.
- n
- 50
- nulls
- 38 (76.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
allergens_lc
categorical metadataLanguage code for the allergens text, with 6 distinct values across 50 rows and a 4% null rate. The distribution is nearly bimodal between 'en' (22) and 'fr' (21), with es/de/it/pl appearing once or twice each — a language mix worth flagging before any text processing. Treatment: Use as a language filter or routing key before tokenizing the allergens text.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 6
- top_value
- en
- top_rate
- 0.4583
- cardinality
- 6
- entropy
- 1.578
- entropy_ratio
- 0.6104
states_hierarchy
unknown other skippedThe column 'states_hierarchy' was skipped by the profiler, so its kind is unknown and no descriptive statistics were computed. We can only confirm there are 50 rows with no nulls; uniqueness, type, and value distribution are unavailable. Treatment: Re-profile or inspect manually to determine type before any downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
ingredients_text_ja
categorical free_text long_tail null_rate imbalanceJapanese-language ingredients text, almost entirely absent from this sample. 98% of the 50 rows are null, and the single non-null value is an empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; the column carries no usable signal in this sample.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
traces_from_user
categorical free_text long_tailThis column appears to capture user-submitted allergen/ingredient traces, prefixed with a language code like '(en)' or '(fr)' followed by comma-separated tags such as 'en:milk,en:nuts'. With 35 unique values across 50 rows and entropy ratio 0.938, it is highly diverse; the top value '(en) ' (an empty tag list) covers only 14% and the distribution has a long tail. Notably, the language prefix is mixed (English and French) and many entries are blank tag lists, which complicates direct use as a category. Treatment: Parse the language prefix and split the tag list into a multi-hot allergen feature before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 35
- top_value
- (en)
- top_rate
- 0.14
- cardinality
- 35
- entropy
- 4.811
- entropy_ratio
- 0.9379
serving_quantity_unit
categorical metadata imbalanceThis column records the unit of measurement for serving quantity, almost exclusively grams ('g' at 45 of 46 non-null rows, top_rate 0.978) with a single 'ml' entry. With only 2 unique values, an 8% null rate, and entropy_ratio of 0.151, it carries almost no information. Treatment: Drop or collapse to a binary flag; near-constant with negligible signal.
- n
- 50
- nulls
- 4 (8.0%)
- unique
- 2
- top_value
- g
- top_rate
- 0.9783
- cardinality
- 2
- entropy
- 0.1511
- entropy_ratio
- 0.1511
ingredients_hierarchy
unknown other skippedThis column is labelled ingredients_hierarchy but saturn skipped profiling it, so no type, uniqueness, or value statistics are available. The only confirmed signals are that it has 50 rows and zero nulls. Without further evidence, the structure (likely nested or list-valued, given the name) cannot be verified. Treatment: Re-profile with a parser that handles nested or list-valued fields before deciding on use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
unique_scans_n
numeric feature high_skew outliersNumeric count of unique scans per row, with 48 distinct values across 50 records and no nulls or zeros. The distribution is heavily right-skewed (skew 3.91, kurtosis 18.71): median is 432 against a mean of 525.38, and the max of 2257 sits far beyond q3 of 560.75, producing 4 outliers (8% outlier rate). Std of 306.41 dwarfs the IQR of 198, confirming a long upper tail. Treatment: Log-transform or winsorize before modelling to tame the long upper tail.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 48
- min
- 319
- max
- 2,257
- mean
- 525.4
- median
- 432
- std
- 306.4
- q1
- 362.8
- q3
- 560.8
- iqr
- 198
- skew
- 3.911
- kurtosis
- 18.71
- n_outliers
- 4
- outlier_rate
- 0.08
- zero_rate
- 0
labels
categorical feature long_tailFree-form labels/certifications column (e.g. organic, fair-trade, Triman, Green Dot) stored as comma-separated multi-label strings, often mixing English, French, Portuguese and Spanish tokens. Of 50 rows, 42 distinct values and entropy ratio 0.95 indicate near-unique combinations; the only repeated 'value' is the empty string (8 rows, 16%) on top of a 2% null rate, so roughly one in five records carries no label at all. The long_tail alert is well earned — almost every non-empty cell is its own bag of tags. Treatment: Split on commas into a multi-hot tag set (normalising language variants) before modelling.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 42
- top_value
- top_rate
- 0.1633
- cardinality
- 42
- entropy
- 5.125
- entropy_ratio
- 0.9504
generic_name_en
categorical free_text long_tailLikely an English-language generic product name field, but it is essentially empty: the top value is the blank string at 83.7% of non-null rows, with a further 14% null. Only 7 actual product descriptions appear across 50 rows (e.g. 'Dark chocolate', 'Crackers'), all singletons, giving cardinality 8 and entropy ratio 0.37. The long_tail alert reflects that every real value occurs exactly once. Treatment: Drop or treat blanks as missing; too sparse and unique to use as a categorical feature.
- n
- 50
- nulls
- 7 (14.0%)
- unique
- 8
- top_value
- top_rate
- 0.8372
- cardinality
- 8
- entropy
- 1.098
- entropy_ratio
- 0.366
product_name_fi
categorical metadata long_tail null_rateLikely a Finnish-localized product name field, but it is essentially empty: 90% nulls and the most frequent observed value is the empty string (top_rate 0.4 of the 5 non-null entries). Among the few populated rows, the names are in English (e.g., 'Excellence: 90% cocoa Dark Supreme', 'Arriba 85% Cacao Dark Chocolate'), contradicting the _fi suffix. With only 4 unique values across 50 rows, this column carries almost no usable signal. Treatment: Drop or defer until localization coverage improves; do not use as a feature.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 4
- top_value
- top_rate
- 0.4
- cardinality
- 4
- entropy
- 1.922
- entropy_ratio
- 0.961
origin_fr
categorical free_text long_tailThis appears to be a French-language origin/provenance field describing where a product or its ingredients are made. The column is essentially empty: 40 of 50 rows hold the empty string and another 8% are null, leaving only 6 distinct non-blank descriptions ranging from a single country ('France') to multi-region ingredient breakdowns. Entropy ratio of 0.319 and a top_rate of 0.87 confirm the long-tail alert — almost no usable signal here. Treatment: Drop or defer; too sparse and unstructured to use without targeted NER on the few populated strings.
- n
- 50
- nulls
- 4 (8.0%)
- unique
- 7
- top_value
- top_rate
- 0.8696
- cardinality
- 7
- entropy
- 0.8958
- entropy_ratio
- 0.3191
generic_name
categorical free_text long_tailFree-text generic product names, predominantly French with some English entries (e.g., "Compound Chocolate with MILK AND ALMONDS"). The dominant value is the empty string at 21/50 (top_rate 0.4375), and combined with a 0.04 null_rate this means most rows carry no usable name. The remaining 28 unique values are nearly all singletons, producing the flagged long tail. Treatment: Treat empty strings as missing, then tokenize/normalize language before embedding or matching.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 28
- top_value
- top_rate
- 0.4375
- cardinality
- 28
- entropy
- 3.663
- entropy_ratio
- 0.762
nutriscore_version
categorical metadata imbalanceThis column records the Nutri-Score version applied to each row, and every one of the 50 records carries the value "2023". With cardinality 1 and entropy 0, it offers no discriminative signal in this sample. Treatment: Drop, constant column.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- 2023
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_without_ciqual_codes
unknown other skippedThis column, named ingredients_without_ciqual_codes, was skipped by the profiler so no descriptive statistics are available beyond a row count of 50 and a null rate of 0. The name suggests it holds ingredient entries that lack a matching CIQUAL food-database code, likely as a list or nested structure that the profiler could not introspect. Without unique counts or value samples, nothing further can be inferred. Treatment: Re-profile after parsing the nested structure, or explode to a list before downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
packaging_text_pl
categorical metadata null_rate imbalancePolish-language packaging text field that is effectively empty: 90% of the 50 rows are null and the remaining 10% are all the empty string, giving a single observed value and zero entropy. There is no usable signal here, only nulls and blanks. Treatment: Drop; the column carries no information.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_en
categorical free_text long_tailEnglish-language ingredient lists for food products, stored as free-form text rather than a controlled vocabulary. With 36 unique values across 50 rows and entropy ratio 0.93, values are nearly all distinct; the only repeated 'value' is the empty string (9 occurrences, top_rate 0.20), and 12% are null, so roughly a third of rows carry no usable ingredient text. Content is heterogeneous — multi-sentence allergen-tagged lists, percentages, punctuation noise, and at least one junk entry ('Hhhhh'). Treatment: Normalize, tokenize, and embed (or parse into ingredient lists) before modelling; treat empty strings as nulls.
- n
- 50
- nulls
- 6 (12.0%)
- unique
- 36
- top_value
- top_rate
- 0.2045
- cardinality
- 36
- entropy
- 4.811
- entropy_ratio
- 0.9306
ingredients_text_it
categorical free_text long_tail null_rateFree-form Italian ingredient lists for food products, with 68% nulls and only 50 rows total. Of the 16 non-null entries, 5 are empty strings (top_rate 0.3125) and the remaining values are nearly all unique long product descriptions, yielding 12 distinct values and entropy_ratio 0.913. Effectively unstructured text rather than a categorical field, despite being typed as such. Treatment: Treat as free text: normalize empty strings to null, then tokenize/parse for allergen or ingredient extraction rather than one-hot encoding.
- n
- 50
- nulls
- 34 (68.0%)
- unique
- 12
- top_value
- top_rate
- 0.3125
- cardinality
- 12
- entropy
- 3.274
- entropy_ratio
- 0.9134
origin_de
categorical feature null_rate imbalanceThis appears to be a German-origin flag or label, but it carries no information in this sample: 60% of rows are null and the remaining 20 rows all hold the empty string, giving a single unique value and zero entropy. There is no signal to model on here. Treatment: Drop; constant column with majority nulls.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
nova_group
numeric feature high_skewThis is the NOVA food classification group (1-4 scale) indicating processing level, with 3 unique values present across 50 rows and a 4% null rate. The distribution is heavily skewed toward ultra-processed foods: median is 4.0, Q1-Q3 spans 3-4, and skew of -2.06 with kurtosis 5.65 confirms a long left tail with one outlier at the low end. Despite being numeric, only 3 of the 4 possible NOVA categories appear in this sample. Treatment: Treat as ordinal categorical rather than continuous; impute the 4% nulls with median (4) or a missing-indicator.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 3
- min
- 1
- max
- 4
- mean
- 3.646
- median
- 4
- std
- 0.601
- q1
- 3
- q3
- 4
- iqr
- 1
- skew
- -2.062
- kurtosis
- 5.651
- n_outliers
- 1
- outlier_rate
- 0.02083
- zero_rate
- 0
packaging_text_fi
categorical free_text null_rate imbalanceFinnish-language packaging text field that is effectively empty in this sample. 90% of the 50 rows are null and the remaining 5 rows all hold the empty string, giving a single observed value and zero entropy. Treatment: Drop; column carries no information in this sample.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
states
categorical feature long_tailThis column packs an Open Food Facts-style product completion checklist into a single comma-joined string of `en:*-completed` / `en:*-to-be-completed` tags covering nutrition, ingredients, photos, packaging, etc. With 26 unique combinations across just 50 rows (entropy ratio 0.91) and the most common state appearing only 8 times, it behaves like a long-tail composite status flag rather than a clean category. The values are clearly multi-valued — they should be split into individual status tags before any modelling. Treatment: Split on comma and one-hot encode each `en:*` tag instead of treating the concatenated string as a single category.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 26
- top_value
- en:to-be-completed, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-completed, en:packaging-code-to-be-completed, en:characteristics-to-be-completed, en:origins-to-be-completed, en:categories-completed, en:brands-completed, en:packaging-completed, en:quantity-completed, en:product-name-completed, en:photos-validated, en:packaging-photo-selected, en:nutrition-photo-selected, en:ingredients-photo-selected, en:front-photo-selected, en:photos-uploaded
- top_rate
- 0.16
- cardinality
- 26
- entropy
- 4.286
- entropy_ratio
- 0.9119
ingredients_with_unspecified_percent_sum
numeric featureThis column appears to be a per-record sum of ingredient percentages where the precise share was not specified, expressed on a 0–100 scale (max 100.0, min 0.4). The distribution is heavily left-skewed (skew -1.18) with a median of 100.0 and Q3 also at 100.0, meaning at least half of the 50 rows have effectively all of their ingredient mass unspecified. Only 22 unique values across 50 rows and a mean of 79.4 confirm the concentration at the upper bound. Treatment: Treat as a data-quality indicator; consider binarizing (e.g. =100 vs <100) rather than using the raw left-skewed value.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 22
- min
- 0.4
- max
- 100
- mean
- 79.42
- median
- 100
- std
- 31.64
- q1
- 53.6
- q3
- 100
- iqr
- 46.4
- skew
- -1.183
- kurtosis
- -0.133
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
id
categorical identifier long_tailThis column is a unique row identifier, with all 50 values distinct (n_unique=50, entropy_ratio=1.0) and no nulls. The values look like product barcodes (mostly 13-digit EAN/GTIN strings such as '6111242100992', with at least one shorter numeric like '20995553'), suggesting a product-level key rather than a sequential surrogate ID. The long_tail alert simply reflects that every value occurs exactly once (top_rate=0.02). Treatment: Drop from modelling; retain as a join key for linking to product metadata.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- 6111242100992
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
nutrient_levels
unknown other skippedThe column 'nutrient_levels' was skipped by the profiler, so its kind is unknown and no descriptive statistics are available. We only know it has 50 rows with a 0.0 null rate; uniqueness, distribution, and value structure are all missing from the evidence. Treatment: Re-profile or manually inspect a sample to determine the underlying type before any downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
sortkey
numeric timestamp high_skew outliersValues range from 1,567,543,172 to 1,610,897,644 with a median of 1,608,147,866 — consistent with Unix epoch seconds spanning roughly 2019 to early 2021, masquerading as a numeric sort key. Distribution is heavily left-skewed (skew -2.78, kurtosis 8.09) with 4 outliers (9.1%) trailing toward older timestamps, and 12% of rows are null. The tight IQR of ~6.16M seconds (~71 days) versus a 43M-second range confirms most records cluster late in the window. Treatment: Convert from epoch seconds to datetime and use as a temporal feature rather than a raw numeric.
- n
- 50
- nulls
- 6 (12.0%)
- unique
- 44
- min
- 1.568e+09
- max
- 1.611e+09
- mean
- 1.605e+09
- median
- 1.608e+09
- std
- 8.692e+06
- q1
- 1.604e+09
- q3
- 1.61e+09
- iqr
- 6.16e+06
- skew
- -2.782
- kurtosis
- 8.091
- n_outliers
- 4
- outlier_rate
- 0.09091
- zero_rate
- 0
image_small_url
categorical identifier long_tailPer-row URL pointing to a 200px product front image hosted on images.openfoodfacts.org, with French/English locale suffixes embedded in the filename. All 50 rows are unique with zero nulls, so this acts as a row-level asset reference rather than a feature. The path segments encode the product barcode (e.g. 6111242100992), making this effectively a derivable identifier. Treatment: Drop from modelling; retain as a fetch URL for image pipelines or extract the embedded barcode.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.200.jpg
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
food_groups
categorical featureThis is a categorical food taxonomy field using Open Food Facts-style prefixed slugs (e.g., 'en:biscuits-and-cakes'). The distribution is heavily concentrated on sweets: 'en:biscuits-and-cakes' (17/49) and 'en:chocolate-products' (16/49) together account for roughly two-thirds of non-null rows, with 11 distinct categories across 50 records and a 2% null rate. Entropy ratio of 0.74 confirms moderate concentration rather than uniform spread. Treatment: Strip the 'en:' prefix and one-hot or target-encode; consider grouping the long tail of single-occurrence categories into 'other'.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 11
- top_value
- en:biscuits-and-cakes
- top_rate
- 0.3469
- cardinality
- 11
- entropy
- 2.549
- entropy_ratio
- 0.7367
nova_groups_markers
unknown other skippedColumn 'nova_groups_markers' was skipped by the profiler, so no type, uniqueness, or distribution stats are available beyond a row count of 50 and a null rate of 0.0. The name suggests it carries NOVA food-classification group markers, likely a list or structured field that the dissector could not parse. Without parsed values, nothing further can be said about its content. Treatment: Inspect raw values manually and reparse (likely a list/struct) before deciding on use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
packaging_text_de
categorical free_text null_rateGerman-language packaging description field, almost entirely unpopulated. With a 60% null rate and the empty string accounting for 19 of the 20 non-null rows (95% top_rate), only one row carries actual content ("1 Folie aus 22 PAP zum Recyclen"). Cardinality of 2 and entropy ratio of 0.29 confirm there is virtually no usable signal here. Treatment: Drop; effectively empty with only one informative value.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 2
- top_value
- top_rate
- 0.95
- cardinality
- 2
- entropy
- 0.2864
- entropy_ratio
- 0.2864
categories_lc
categorical featureThis column appears to hold lowercase ISO language codes, with 6 distinct values across 50 rows and no nulls. The distribution is dominated by 'fr' (25) and 'en' (19), together covering 44 of 50 rows, while 'es', 'de', 'it', and 'pl' form a thin long tail. Entropy ratio of 0.63 reflects this Franco-English skew rather than a balanced multilingual mix. Treatment: One-hot encode, optionally collapsing rare codes (it, pl, de, es) into an 'other' bucket.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- fr
- top_rate
- 0.5
- cardinality
- 6
- entropy
- 1.628
- entropy_ratio
- 0.6297
checkers
unknown other skippedThe column 'checkers' was skipped by the profiler, so its data type and value distribution are unknown. Only the row count (50) and null rate (0.0) are reported; n_unique and all other statistics are missing. Without further inspection there is no basis to infer what this column represents. Treatment: Re-profile or manually inspect the column before any downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
packaging_text_es
categorical free_text null_rateSpanish-language packaging description, populated for almost none of the rows. 60% are null and of the 20 non-null entries, 19 are empty strings, leaving exactly one real value describing a recyclable cardboard box and plastic tray. Effective cardinality is 2 and entropy ratio is 0.29, so this column carries virtually no signal in this sample. Treatment: Drop unless a larger sample shows meaningful Spanish text coverage.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 2
- top_value
- top_rate
- 0.95
- cardinality
- 2
- entropy
- 0.2864
- entropy_ratio
- 0.2864
nutrition_score_warning_fruits_vegetables_nuts_estimate_from_ingredients
numeric metadata constantThis is a nutrition-score warning flag indicating whether fruit/vegetable/nut content was estimated from ingredients. Every one of the 45 non-null rows holds the value 1.0, and 10% of rows are null — so the column carries no discriminative signal in this sample, only a presence/absence distinction. Treatment: Drop as a feature; optionally retain a binary is_null indicator if the missingness itself is meaningful.
- n
- 50
- nulls
- 5 (10.0%)
- unique
- 1
- min
- 1
- max
- 1
- mean
- 1
- median
- 1
- std
- 0
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
labels_lc
categorical labelThis column appears to be a lowercase ISO language code label, with 6 distinct values across 50 rows and one null. English and French dominate at 22 occurrences each, leaving es, de, it, and pl with just 1-2 rows combined — a near-binary distribution despite the multilingual appearance. Entropy ratio of 0.61 confirms the imbalance. Treatment: Group rare codes (es/de/it/pl) into 'other' before stratifying or one-hot encoding.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 6
- top_value
- en
- top_rate
- 0.449
- cardinality
- 6
- entropy
- 1.57
- entropy_ratio
- 0.6072
nutriscore_data
unknown other skippedThe column 'nutriscore_data' was skipped by the profiler, so its kind, uniqueness, and value distribution are unknown. The only confirmed signals are 50 rows with a 0.0 null rate. Without further stats, the contents (likely a nested Nutri-Score payload given the name) cannot be characterised. Treatment: Re-profile with nested/struct support enabled before deciding on use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
product_name_nb
categorical metadata long_tail null_rateNorwegian product name field (suffix _nb suggests Bokmål locale) that is almost entirely empty: 96% null across 50 rows, leaving only 2 non-null observations with one being an empty string and the other '99% mørk sjokolade'. With just two distinct values and effectively no signal, this column cannot support analysis as-is. Treatment: Drop unless joined to a richer localized catalog; null rate is too high to model.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
nutrition_data_prepared_per
categorical metadata imbalanceThis column records the basis on which nutrition data is reported, and every one of the 50 rows carries the single value "100g". With cardinality of 1, entropy of 0, and a top_rate of 1.0, the field provides no discriminating information whatsoever. Treatment: Drop; constant column carries no signal.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- 100g
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_quantity
categorical feature long_tailNumeric product quantities stored as strings, treated here as categorical with 27 distinct values across 50 rows. The mode '100' covers 23.4% of non-nulls, but entropy ratio of 0.90 confirms a long tail with most other values appearing only once or twice. Note 6% nulls and the presence of '0' as a quantity, which may indicate missing or placeholder stock entries. Treatment: Cast to numeric and treat as a quantitative feature; investigate zeros and nulls before modelling.
- n
- 50
- nulls
- 3 (6.0%)
- unique
- 27
- top_value
- 100
- top_rate
- 0.234
- cardinality
- 27
- entropy
- 4.287
- entropy_ratio
- 0.9017
product_type
categorical metadata imbalanceThis is a categorical column recording product type, but every one of the 50 rows holds the same value, "food". Cardinality is 1 and entropy is 0, so the column carries no information for modelling or segmentation. Treatment: Drop; constant column with zero entropy.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- top_value
- food
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
traces_lc
categorical featureThis is a low-cardinality categorical column holding lowercase language codes (fr, en, es, de, it, pl), almost certainly a detected or declared language tag. The distribution is heavily concentrated on French (23/50) and English (20/50), with the top value covering 47.9% of non-null rows and entropy ratio of 0.61. Four percent of rows are null and three languages appear only once, so any per-language analysis will be unstable beyond fr/en. Treatment: Keep fr/en as-is and bucket de/es/it/pl into an 'other' category before encoding.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 6
- top_value
- fr
- top_rate
- 0.4792
- cardinality
- 6
- entropy
- 1.575
- entropy_ratio
- 0.6093
categories_hierarchy
unknown other skippedThe column `categories_hierarchy` was skipped by the profiler, so no type, uniqueness, or distribution stats are available. The name suggests a nested or path-like categorical structure (e.g., taxonomy levels), but this cannot be confirmed from the evidence. Only the row count (50) and null rate (0.0) are known. Treatment: Re-profile after parsing the hierarchy (e.g., split into level columns) before deciding on encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
image_front_small_url
categorical metadata long_tailURLs pointing to small front-of-pack product images on the Open Food Facts CDN, one per row. Every one of 50 values is unique (entropy_ratio 1.0, top_rate 0.02) and there are no nulls, so this acts as a per-product asset link rather than a feature. URLs mix `front_fr` and `front_en` suffixes, hinting at a French/English language mix in the source catalogue. Treatment: Keep as a reference link; drop from modelling or fetch the images separately for vision features.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.200.jpg
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
nutrition_score_warning_fruits_vegetables_legumes_estimate_from_ingredients
numeric metadata constantThis appears to be a binary warning flag indicating that the fruits/vegetables/legumes share in a Nutri-Score calculation was estimated from ingredients. Every non-null value is 1.0 (n_unique=1, std=0), and 8% of rows are null, so the column carries no discriminative signal in this sample. Treatment: Drop; constant value provides no information.
- n
- 50
- nulls
- 4 (8.0%)
- unique
- 1
- min
- 1
- max
- 1
- mean
- 1
- median
- 1
- std
- 0
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ingredients_without_ciqual_codes_n
numeric featureCounts the number of ingredients in a record that lack a CIQUAL code, so it's a data-quality feature describing how complete the ingredient mapping is. The distribution is right-skewed (skew 1.21) with a median of 3.5 but a max of 22 and one outlier; 18% of rows are already fully mapped (zero_rate 0.18). Only 15 unique values across 50 rows, so it behaves like a small ordinal count. Treatment: Treat as a count; consider log1p or a binary 'fully mapped' flag before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 15
- min
- 0
- max
- 22
- mean
- 4.98
- median
- 3.5
- std
- 4.825
- q1
- 1
- q3
- 7.75
- iqr
- 6.75
- skew
- 1.208
- kurtosis
- 1.491
- n_outliers
- 1
- outlier_rate
- 0.02
- zero_rate
- 0.18
rev
numeric featureA numeric revenue feature spanning 19 to 674 with a mean of 230 and median of 233.5, suggesting per-record monetary or count values. Distribution is moderately right-skewed (0.71) with a wide IQR of 237.75 and only one outlier (2%), so spread is large but not pathological. All 50 rows are populated with no zeros and 46 unique values. Treatment: Consider a log or sqrt transform before regression to tame the right skew.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 46
- min
- 19
- max
- 674
- mean
- 230
- median
- 233.5
- std
- 166.6
- q1
- 72.75
- q3
- 310.5
- iqr
- 237.8
- skew
- 0.7092
- kurtosis
- -0.02278
- n_outliers
- 1
- outlier_rate
- 0.02
- zero_rate
- 0
ingredients_non_nutritive_sweeteners_n
numeric feature constantThis column appears to be a count of non-nutritive (artificial) sweeteners listed in a product's ingredients. Across all 50 rows it is exactly 0, with zero_rate of 1.0 and no nulls, so it carries no information in this sample. Treatment: Drop; constant zero provides no signal.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 0
- max
- 0
- mean
- 0
- median
- 0
- std
- 0
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 1
ingredients_without_ecobalyse_ids_n
numeric featureThis is a count of ingredients on a product that lack Ecobalyse identifiers, ranging from 0 to 29 with a median of 6.5 and mean 8.16. The distribution is right-skewed (skew 1.28) with one outlier at the high end, suggesting most products have a handful of unmapped ingredients while a few have many. Only 2% are zero, meaning nearly every row has at least one ingredient missing an Ecobalyse ID — a notable data-coverage gap. Treatment: Use as-is or log-transform if feeding into a regression; treat as a coverage-quality signal.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 20
- min
- 0
- max
- 29
- mean
- 8.16
- median
- 6.5
- std
- 5.898
- q1
- 4
- q3
- 11
- iqr
- 7
- skew
- 1.28
- kurtosis
- 1.743
- n_outliers
- 1
- outlier_rate
- 0.02
- zero_rate
- 0.02
labels_hierarchy
unknown other skippedThis column was skipped by the profiler, so its contents are uncharacterised beyond a row count of 50 and a null rate of 0.0. The name suggests a nested or structured label taxonomy, which likely tripped the profiler's type detection. No uniqueness, value, or distribution statistics are available to confirm. Treatment: Inspect raw values manually and parse the hierarchy (e.g., split on delimiter or expand JSON) before profiling again.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
product_name_en
categorical free_text long_tailFree-text English product names with 34 unique values across 50 rows and high entropy ratio (0.91), indicating heavy diversity. Notable issues: 14% nulls plus an empty-string value taking the top slot at 23.3% (10 occurrences), so effective missingness is much higher than null_rate alone suggests. Values mix languages (e.g., 'Edelbitter-Schokolade', 'Chocolat noir', 'tonik') and include junk like 'Hhhhh', flagged as long_tail. Treatment: Normalise empties to null, language-detect, then tokenize/embed before modelling.
- n
- 50
- nulls
- 7 (14.0%)
- unique
- 34
- top_value
- top_rate
- 0.2326
- cardinality
- 34
- entropy
- 4.654
- entropy_ratio
- 0.9147
nutrition_score_warning_fruits_vegetables_legumes_estimate_from_ingredients_value
numeric feature high_skew outliersThis appears to be a Nutri-Score warning value estimating fruit/vegetable/legume content from ingredients. The distribution is dominated by zeros (zero_rate 0.89, median and IQR both 0), with a handful of extreme values pushing the max to 50 and producing severe skew (5.93) and kurtosis (35.2). Five outliers (10.9% rate) drive the mean to 1.65 despite a std of 7.55, and 8% of rows are null. Treatment: Binarize (zero vs non-zero) or winsorize before modelling given the heavy zero mass and extreme skew.
- n
- 50
- nulls
- 4 (8.0%)
- unique
- 6
- min
- 0
- max
- 50
- mean
- 1.652
- median
- 0
- std
- 7.551
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 5.932
- kurtosis
- 35.23
- n_outliers
- 5
- outlier_rate
- 0.1087
- zero_rate
- 0.8913
traces
categorical feature long_tailThis column holds comma-separated allergen/ingredient trace tags with an `en:` language prefix (e.g. `en:milk,en:nuts`), so each cell is a multi-label set rather than a single category. Across 50 rows there are 23 distinct combinations and entropy ratio 0.87, indicating high diversity, and the most common value is the empty string at 22% (11 rows) — meaning missing-as-empty rather than a true null (null_rate 0.0). The long_tail alert reflects many combinations appearing only once or twice. Treatment: Split on commas and multi-hot encode the individual `en:` tags; treat empty string as missing.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 23
- top_value
- top_rate
- 0.22
- cardinality
- 23
- entropy
- 3.922
- entropy_ratio
- 0.8671
generic_name_fi
categorical free_text long_tail null_rateFinnish-language generic product name field, populated for only 5 of 50 rows (90% null). Among the 5 present values, all are unique with maximum entropy (2.32, ratio 1.0), and casing inconsistencies appear ("Tumma suklaa" vs "tumma suklaa") plus one empty string counted as a value. Treatment: Normalise case and treat empty strings as null; too sparse (90% missing) to use as a feature without imputation or dropping.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 5
- top_value
- Hieno tumma suklaa jossa 90% kaakaota
- top_rate
- 0.2
- cardinality
- 5
- entropy
- 2.322
- entropy_ratio
- 1
emb_codes_orig
categorical metadata long_tail null_rateThis appears to be original packaging or establishment codes (EMB-prefixed identifiers used on European food labels), kept in raw form. The column is sparsely populated: 34% are null and among the remaining rows the empty string dominates at roughly 85% (top_rate 0.848), leaving only 5 distinct values across 50 rows. One entry is not a code at all but a company name pair (SOLENT GMBH & CO. KG,SCHWARZ BETEILIGUNGS GMBH), suggesting inconsistent source formatting. Treatment: Normalise empty strings to null and parse/validate the EMB code pattern before use; too sparse to model directly.
- n
- 50
- nulls
- 17 (34.0%)
- unique
- 5
- top_value
- top_rate
- 0.8485
- cardinality
- 5
- entropy
- 0.9048
- entropy_ratio
- 0.3897
ingredients_with_specified_percent_n
numeric featureCounts the number of ingredients whose percentage is explicitly specified on a product label. The distribution is heavily zero-inflated (zero_rate 0.58) with median 0 and mean 1.1, but a long right tail reaches 8 (skew 1.88, kurtosis 3.68), and only 7 distinct values appear across 50 rows. Treatment: Treat as a count feature; consider a binary 'has_specified_percent' flag plus log1p transform to tame the skew.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 7
- min
- 0
- max
- 8
- mean
- 1.1
- median
- 0
- std
- 1.729
- q1
- 0
- q3
- 2
- iqr
- 2
- skew
- 1.878
- kurtosis
- 3.676
- n_outliers
- 1
- outlier_rate
- 0.02
- zero_rate
- 0.58
nutrition_grades
categorical labelThis is a Nutri-Score-style nutrition grade for each item, with six observed levels (a-e plus 'unknown'). The distribution is heavily skewed toward the worst grade: 'e' accounts for 27 of 50 rows (top_rate 0.54), while 'a' and 'b' together appear only 6 times. One row carries the literal value 'unknown' rather than null, so null_rate is 0.0 despite missing information. Treatment: Treat as ordered categorical (a
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- e
- top_rate
- 0.54
- cardinality
- 6
- entropy
- 1.913
- entropy_ratio
- 0.7399
image_url
categorical identifier long_tailThis column holds Open Food Facts product image URLs, one per row, all pointing to front-of-package JPEGs at 400px width. Every one of the 50 values is unique (entropy_ratio 1.0, top_rate 0.02), so it functions as a per-row asset link rather than a categorical feature. URL paths mix _fr and _en locale suffixes, hinting at a multilingual product catalog. Treatment: Drop from modelling; retain as a reference link or fetch for downstream image features.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.400.jpg
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
sources
unknown other skippedThe column "sources" was skipped by the profiler, so its kind is unknown and no statistics (uniqueness, value distribution, type) were computed. Only two facts are available: 50 rows were seen and none were null. Without further inspection, nothing can be said about its content or structure. Treatment: Re-profile or manually inspect this column before any downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
languages_hierarchy
unknown other skippedThe column 'languages_hierarchy' was skipped by the profiler, so no statistics are available beyond a row count of 50 and a null rate of 0. The name suggests a nested or structured representation of languages (likely a list or path-like string), but the dissector did not characterize its values. No uniqueness, length, or value-distribution signals are present to confirm. Treatment: Re-profile with a parser that handles nested/structured values before deciding on use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
pnns_groups_1
categorical featureThis is a PNNS food group classifier with 7 distinct categories and no nulls across 50 rows. The distribution is severely imbalanced: 'Sugary snacks' accounts for 76% of records, with entropy ratio just 0.48, suggesting the sample is dominated by one food type. Two rows are explicitly labeled 'unknown', and four other categories appear only once or twice each. Treatment: One-hot encode, but expect the 'Sugary snacks' class to dominate any downstream model.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 7
- top_value
- Sugary snacks
- top_rate
- 0.76
- cardinality
- 7
- entropy
- 1.36
- entropy_ratio
- 0.4846
countries_lc
categorical featureLowercase ISO-style language or country codes with 6 distinct values across 50 rows and a 2% null rate. The distribution is heavily English-dominant (en at 28, top_rate 0.57) followed by fr at 16, leaving es/de/it/pl as singletons or near-singletons. Entropy ratio of 0.59 confirms the long tail is thin and unlikely to support per-class modelling. Treatment: Group rare codes into an 'other' bucket and one-hot encode; impute the 2% nulls.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 6
- top_value
- en
- top_rate
- 0.5714
- cardinality
- 6
- entropy
- 1.521
- entropy_ratio
- 0.5883
creator
categorical metadata long_tailUsernames of the people or bots that created each record, with 13 distinct creators across 50 rows and no nulls. Two accounts dominate: 'openfoodfacts-contributors' at 46% (23 rows) and 'kiliweb' at 15 rows, together covering 76% of the column, while the remaining creators each appear once or twice — a classic long tail flagged in alerts. Treatment: Collapse rare creators into an 'other' bucket before any encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 13
- top_value
- openfoodfacts-contributors
- top_rate
- 0.46
- cardinality
- 13
- entropy
- 2.351
- entropy_ratio
- 0.6353
ingredients
unknown free_text skippedThis column is named 'ingredients' but saturn skipped profiling it (kind=unknown, no stats computed). Across 50 rows there are zero nulls, but uniqueness, types, and value distribution are all unknown. Based on the name alone it is likely a list or free-text field of recipe components, which is why a generic profiler bailed out. Treatment: Parse into a list and one-hot or tokenize/embed before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
product_name_nl
categorical free_text long_tail null_rateDutch-language product names, but the column is mostly empty: 76% of rows are null and of the 12 populated rows, 6 are blank strings, leaving only 6 actual names across 7 unique values. The surviving entries are a language mix (English 'Dark absolute', French 'Tartines craquantes multi-céréales', Dutch 'Volkoren cracotte'), so the field is not consistently Dutch despite its name. Treatment: Drop or defer; too sparse and linguistically inconsistent to use without upstream cleanup.
- n
- 50
- nulls
- 38 (76.0%)
- unique
- 7
- top_value
- top_rate
- 0.5
- cardinality
- 7
- entropy
- 2.292
- entropy_ratio
- 0.8166
origin_es
categorical metadata null_rate imbalanceThis appears to be a Spanish-language origin field, but it carries no usable signal in this sample. 60% of rows are null and the remaining 20 non-null entries are all the empty string, giving cardinality 1 and entropy 0. Treatment: Drop; the column has no variation and a 60% null rate.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_pl
categorical metadata long_tail null_ratePolish-language product names, populated for only 10% of rows (null_rate 0.9) with just 3 distinct values across 50 records. The top value is the empty string at 60%, leaving only two real product names ('Czekolada gorzka 74%' and 'Excellence 70% Cocoa Intense Dark') appearing once each. Both the long_tail and null_rate alerts fire, and empty strings are being counted as a category rather than nulls. Treatment: Normalise empty strings to null and treat as a sparse localisation field; drop unless Polish-market analysis is required.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 3
- top_value
- top_rate
- 0.6
- cardinality
- 3
- entropy
- 1.371
- entropy_ratio
- 0.865
scores
unknown other skippedThe column 'scores' was skipped by the profiler and reports kind 'unknown', so no statistics, uniqueness, or value distribution were computed. The only confirmed signals are 50 rows and a 0.0 null rate; everything else is missing. Without type inference or sample values, the column's actual content (numeric scores, lists, structured objects) cannot be determined from this evidence. Treatment: Re-profile with type coercion or inspect raw values before deciding on a downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
brands
categorical feature long_tailBrand name of each product, with 41 distinct values across 50 rows and no nulls. The distribution is essentially flat (entropy ratio 0.97), with Lindt leading at just 8% (4 occurrences) and most brands appearing once — a long tail flagged explicitly. One value is in Arabic script (عربي), suggesting mixed-language entries that may need normalization. Treatment: Group rare brands into an 'other' bucket and normalize encodings/scripts before one-hot or target encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 41
- top_value
- Lindt
- top_rate
- 0.08
- cardinality
- 41
- entropy
- 5.214
- entropy_ratio
- 0.9731
ingredients_text_de
categorical free_text long_tail null_rateGerman-language ingredient declarations, likely scraped from product packaging (e.g. Kakaomasse, Zucker, Weizenmehl). Coverage is poor: 60% null and the most common value is the empty string (5/50, 25% of non-nulls), while the remaining 16 unique strings are essentially free text with allergen markup like _SOJA_ and _WEIZENMEHL_. Entropy ratio of 0.94 confirms each populated row is nearly unique, so this behaves as free text rather than a category. Treatment: Treat as free text: parse comma-separated tokens or embed; do not one-hot encode.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 16
- top_value
- top_rate
- 0.25
- cardinality
- 16
- entropy
- 3.741
- entropy_ratio
- 0.9354
ingredients_text_nb
categorical free_text null_rate imbalanceThis appears to be a Norwegian Bokmål ingredients text field, likely from a multilingual product dataset. It is effectively empty: 96% of the 50 rows are null, and the only non-null value across the remaining 2 rows is an empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; no usable signal at this sample size.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packagings_n
numeric feature outliersLikely a count of packaging components per product, ranging from 1 to 5 with a mean of 2.07 and median of 2. The IQR is 0 because Q1 and Q3 both equal 2, which mechanically labels nearly half the rows (outlier_rate 0.488, n_outliers 20) as outliers — a quirk of the IQR rule on a low-cardinality integer, not a data quality issue. Note the 18% null rate and only 5 distinct values across 50 rows. Treatment: Treat as a small-count integer feature; impute the 18% nulls and ignore the IQR-flagged outliers.
- n
- 50
- nulls
- 9 (18.0%)
- unique
- 5
- min
- 1
- max
- 5
- mean
- 2.073
- median
- 2
- std
- 0.8772
- q1
- 2
- q3
- 2
- iqr
- 0
- skew
- 0.9834
- kurtosis
- 1.602
- n_outliers
- 20
- outlier_rate
- 0.4878
- zero_rate
- 0
complete
numeric labelBinary 0/1 indicator (n_unique=2, min=0, max=1) likely flagging completion status. Only 32% of rows are marked complete (mean=0.32, zero_rate=0.68), so the negative class dominates roughly 2:1. No nulls or outliers across the 50 rows. Treatment: Treat as binary target; account for the 68/32 class imbalance during modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.32
- median
- 0
- std
- 0.4712
- q1
- 0
- q3
- 1
- iqr
- 1
- skew
- 0.7717
- kurtosis
- -1.404
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.68
emb_codes_20141016
categorical metadata long_tail null_rateThis appears to be a packager/manufacturer code field from an Open Food Facts-style export, dated 2014-10-16, mixing French EMB establishment codes (e.g., 'EMB 44068A') with free-text manufacturer descriptors in multiple languages (German, Spanish). With only 50 rows, 58% are null and another 30% (15/50) are empty strings as the top value, leaving just 6 distinct non-empty entries — each appearing exactly once. Entropy ratio of 0.57 and the dominance of blanks make this column nearly unusable as-is. Treatment: Drop or defer; coverage is too sparse and values too heterogeneous to feature-engineer without a dedicated parser.
- n
- 50
- nulls
- 29 (58.0%)
- unique
- 7
- top_value
- top_rate
- 0.7143
- cardinality
- 7
- entropy
- 1.602
- entropy_ratio
- 0.5705
packaging_text_ja
categorical free_text long_tail null_rate imbalanceJapanese-language packaging text, almost entirely absent: 98% of the 50 rows are null and the single non-null value is itself an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here for any downstream task. Treatment: Drop the column; it is effectively empty.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_de
categorical free_text long_tail null_rateGerman-language generic product name, likely a free-text label for food items (chocolates, biscuits, spreads). Coverage is poor: null_rate is 0.6 and the top value is the empty string at 12 occurrences, while the remaining 9 unique values each appear once, indicating no repetition across products. Entropy_ratio of 0.68 reflects the empty-string mass dominating an otherwise unique long tail. Treatment: Treat as free text; impute missing and normalize/tokenize before any categorical use.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 9
- top_value
- top_rate
- 0.6
- cardinality
- 9
- entropy
- 2.171
- entropy_ratio
- 0.6849
last_editor
categorical metadata long_tailLikely the username or bot handle that last edited each record. One contributor, "foodless", dominates with 21 of 50 rows (top_rate 0.43), while the remaining 49 rows spread across 23 other editors, producing a long tail and entropy ratio of 0.77. Roughly 2% of values are null, and several handles look like apps/bots (e.g., municorn-calorie-counter-app, macrofactor) mixed with human usernames. Treatment: Group rare editors into an "other" bucket and keep as a categorical provenance feature.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 24
- top_value
- foodless
- top_rate
- 0.4286
- cardinality
- 24
- entropy
- 3.513
- entropy_ratio
- 0.7662
last_image_t
numeric timestamp high_skewValues are 10-digit integers ranging from 1,639,159,016 to 1,767,675,445 with a median of 1,752,195,111 — these are Unix epoch seconds, so the column is a 'last image' timestamp spanning roughly late 2021 through 2025. All 50 rows are unique with no nulls or zeros, but the distribution is strongly left-skewed (skew -2.44, kurtosis 7.36) with 2 outliers (4%) sitting far below the bulk, indicating a few very stale records against an otherwise recent cluster. Treatment: Cast from epoch seconds to datetime and derive recency features rather than using the raw integer.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- min
- 1.639e+09
- max
- 1.768e+09
- mean
- 1.745e+09
- median
- 1.752e+09
- std
- 2.681e+07
- q1
- 1.735e+09
- q3
- 1.764e+09
- iqr
- 2.896e+07
- skew
- -2.443
- kurtosis
- 7.36
- n_outliers
- 2
- outlier_rate
- 0.04
- zero_rate
- 0
obsolete_since_date
categorical metadata imbalanceThis appears to be a date column marking when items became obsolete, but it carries no usable information in this sample. Across 50 rows there is a single non-null distinct value — the empty string — making up 100% of non-nulls (44 of 44), with a 12% null rate on top. Entropy is 0.0 and cardinality is 1, so the field is effectively blank. Treatment: Drop; the column is constant (empty) and offers no signal.
- n
- 50
- nulls
- 6 (12.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
countries_beforescanbot
categorical feature long_tailThis appears to be a multi-country list field (likely product distribution countries from an Open Food Facts–style source, captured before a scan-bot pass). With 38 unique values across 50 rows and entropy_ratio 0.965, it is nearly free-form: France leads at only 6 occurrences (top_rate 0.14), and many cells are comma-separated lists. Values mix languages (French 'Belgique', Spanish 'Bélgica', Dutch 'nl:Duitsland', English 'en:Morocco') and taxonomy-prefixed codes, plus a 14% null rate. Treatment: Split on comma, normalize language variants and 'xx:' prefixes to ISO country codes, then multi-hot encode.
- n
- 50
- nulls
- 7 (14.0%)
- unique
- 38
- top_value
- France
- top_rate
- 0.1395
- cardinality
- 38
- entropy
- 5.066
- entropy_ratio
- 0.9653
nutrition_grade_fr
categorical labelThis is the French Nutri-Score grade (a-e) for each food item, with one row coded as 'unknown'. The distribution is heavily skewed toward the worst grade: 'e' alone covers 54% of the 50 rows, and grades d+e together dominate while only 6 rows are 'a' or 'b'. Entropy ratio of 0.74 confirms moderate concentration rather than a balanced ordinal spread. Treatment: Treat as ordered categorical (a
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- e
- top_rate
- 0.54
- cardinality
- 6
- entropy
- 1.913
- entropy_ratio
- 0.7399
ingredients_with_specified_percent_sum
numeric featureThis appears to be a numeric feature summing the percentages of ingredients whose proportions are explicitly disclosed (likely on food product labels). The distribution is heavily zero-inflated with a zero_rate of 0.58 and median of 0.0, while non-zero values stretch up to 99.6 with mean 22.74 and std 32.88. The right skew (0.998) and bimodal shape (q1=0, q3=52.25) suggest two regimes: products with no specified percentages and those with substantial disclosure. Treatment: Consider a hurdle approach: a binary 'has_disclosure' flag plus the continuous value, since 58% of rows are zero.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 22
- min
- 0
- max
- 99.6
- mean
- 22.74
- median
- 0
- std
- 32.88
- q1
- 0
- q3
- 52.25
- iqr
- 52.25
- skew
- 0.9979
- kurtosis
- -0.5856
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.58
origin_it
categorical feature null_rate imbalanceThis column appears to be an origin flag for Italy, but it carries no information: of 50 rows, 68% are null and the remaining 16 non-null values are all empty strings, giving a single unique value and zero entropy. There is no signal here to model on. Treatment: Drop; constant column with majority nulls.
- n
- 50
- nulls
- 34 (68.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
nutrition_data_per
categorical metadataThis column records the basis on which nutrition values are reported, taking only two values: '100g' and 'serving'. The encoding is heavily skewed, with 84% of the 50 rows using '100g' and the remaining 8 rows using 'serving', and there are no nulls. Analysts should note that nutrition figures in other columns are not directly comparable across rows without normalising to a common basis. Treatment: Use as a grouping flag and normalise nutrition fields to a single basis before aggregation.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 100g
- top_rate
- 0.84
- cardinality
- 2
- entropy
- 0.6343
- entropy_ratio
- 0.6343
origin_pl
categorical metadata null_rate imbalanceThis appears to be an origin-related categorical field (likely a place/location code from the column name 'origin_pl'), but it carries almost no information here. 90% of the 50 rows are null, and the remaining 5 non-null entries are all empty strings, giving cardinality 1 and entropy 0. Treatment: Drop; the column is 90% null with only empty strings remaining.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product
unknown other skippedThe column 'product' was skipped by the profiler, so no type, cardinality, or value statistics are available beyond a row count of 50 and a null rate of 0.0. Without unique counts or sample values, the role of this column cannot be inferred from the evidence. Treatment: Re-run the profiler on this column to obtain stats before deciding on treatment.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
link
categorical metadata long_tailURLs pointing to product pages, likely a manufacturer/source link field. The dominant value is the empty string at 21 of 50 rows (top_rate 0.4375), and combined with a 4% null rate this column is mostly missing. The remaining 28 unique values look like one-off product URLs across various brand domains, hence the long_tail alert and entropy ratio of 0.76. Treatment: Extract domain as a low-cardinality feature; drop the raw URL as it's near-unique and mostly blank.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 28
- top_value
- top_rate
- 0.4375
- cardinality
- 28
- entropy
- 3.663
- entropy_ratio
- 0.762
ingredients_text_nl
categorical free_text long_tail null_rateDutch-language ingredient lists for food products, present for only 24% of the 50 rows (null_rate 0.76). Among the 12 non-null entries there are 9 distinct strings with high entropy_ratio 0.92, and the modal value is actually the empty string (4 occurrences) rather than a real ingredient list. Contents range from short declarations like 'Aardappelen, zonnebloemolie, zeezout.' to long packaging blurbs containing addresses and URLs, so the field mixes ingredients with marketing text. Treatment: Treat empty strings as nulls, then tokenize and embed (or parse comma-separated ingredients) before modelling.
- n
- 50
- nulls
- 38 (76.0%)
- unique
- 9
- top_value
- top_rate
- 0.3333
- cardinality
- 9
- entropy
- 2.918
- entropy_ratio
- 0.9206
additives_n
numeric featureCount of additives per product, ranging from 0 to 8 across 50 rows with no nulls and only 8 distinct values. The distribution is heavily right-skewed (skew 1.47, kurtosis 2.10) with a zero_rate of 0.4 and median of 1, while a small tail produces 2 outliers (outlier_rate 0.04). Mean (1.52) sits well above the median, confirming a few additive-heavy products pull the average up. Treatment: Treat as a discrete count; consider log1p or binning (0 vs 1+ vs many) before modelling given the skew and high zero_rate.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 8
- min
- 0
- max
- 8
- mean
- 1.52
- median
- 1
- std
- 1.821
- q1
- 0
- q3
- 2
- iqr
- 2
- skew
- 1.473
- kurtosis
- 2.105
- n_outliers
- 2
- outlier_rate
- 0.04
- zero_rate
- 0.4
generic_name_sv
categorical free_text long_tail null_rateSwedish-language generic product name, populated for only 4 of 50 rows (null_rate 0.92). The four observed values are all distinct (entropy_ratio 1.0), including one empty string, so there is effectively no usable signal here. Top value 'Fin mörk choklad med 90% kakao' appears just once. Treatment: Drop or defer — too sparse (92% null) and unique to model.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 4
- top_value
- Fin mörk choklad med 90% kakao
- top_rate
- 0.25
- cardinality
- 4
- entropy
- 2
- entropy_ratio
- 1
known_ingredients_n
numeric featureA non-negative integer count of recognised ingredients per record, ranging from 0 to 36 with a mean of 11.76 and median of 9. The distribution is right-skewed (skew 0.86) with a wide IQR of 13.5, and 4% of rows are zero — meaning a small fraction had no ingredients matched at all. No outliers were flagged and there are no nulls across the 50 rows. Treatment: Consider a log1p transform before modelling to tame the right skew.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 22
- min
- 0
- max
- 36
- mean
- 11.76
- median
- 9
- std
- 8.721
- q1
- 5
- q3
- 18.5
- iqr
- 13.5
- skew
- 0.8598
- kurtosis
- 0.07411
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.04
completeness
numeric feature outliersA numeric quality score named 'completeness', bounded loosely between 0.575 and 1.1 with mean 0.91 and median 0.9, so most rows are near-complete. The max of 1.1 is suspicious for a metric that nominally caps at 1.0, and 12% of values flag as outliers with a left skew of -0.67, suggesting a tail of poorly-populated records. Only 14 unique values across 50 rows hints at a discretised or rounded score rather than a continuous measurement. Treatment: Clip values above 1.0 and inspect the low-end outliers before using as a quality filter.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 14
- min
- 0.575
- max
- 1.1
- mean
- 0.91
- median
- 0.9
- std
- 0.1358
- q1
- 0.8875
- q3
- 1
- iqr
- 0.1125
- skew
- -0.6678
- kurtosis
- 0.32
- n_outliers
- 6
- outlier_rate
- 0.12
- zero_rate
- 0
ingredients_sweeteners_n
numeric feature constantThis column appears to count sweetener ingredients per record, but every one of the 50 rows holds the value 0 (zero_rate 1.0, n_unique 1, std 0.0). It carries no information for modelling and is flagged constant. Treatment: Drop; constant column with zero variance.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 1
- min
- 0
- max
- 0
- mean
- 0
- median
- 0
- std
- 0
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 1
nova_groups
categorical featureThis column holds NOVA food classification groups, a 4-level ordinal scheme encoded as strings ('1' through '4'). Only 3 of the 4 possible groups appear across 50 rows, with group '4' (ultra-processed) dominating at 33/50 (top_rate 0.6875) and group '2' entirely absent. Null rate is 0.04 and entropy_ratio is 0.64, indicating concentration toward the ultra-processed end. Treatment: Treat as ordinal (cast to int) and impute the 4% missing before modelling.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 3
- top_value
- 4
- top_rate
- 0.6875
- cardinality
- 3
- entropy
- 1.006
- entropy_ratio
- 0.635
allergens_hierarchy
unknown feature skippedThis column is labeled 'allergens_hierarchy', suggesting it holds hierarchical allergen tags (likely a list or delimited path structure). Saturn skipped profiling, so no uniqueness, cardinality, or value statistics are available beyond the fact that all 50 rows are non-null. Without parsed content, the structure and value distribution cannot be characterized. Treatment: Parse the hierarchy into a list, then one-hot or multi-label encode allergen tags before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
obsolete
categorical metadata imbalanceThe 'obsolete' column has a single observed value—an empty string—across all 44 non-null rows, with the remaining 12% of rows null. Cardinality is 1 and entropy is 0, so this column carries no information as-is. The name suggests a deprecated flag, consistent with it being effectively unused. Treatment: Drop; constant column with no signal.
- n
- 50
- nulls
- 6 (12.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_sv
categorical metadata null_rate imbalanceThis appears to be a source/origin indicator (likely Swedish, given the _sv suffix) but it carries virtually no information in this sample. With a 92% null rate and the only non-null value being an empty string repeated 4 times, cardinality is 1 and entropy is 0. The column is effectively constant and unusable as-is. Treatment: Drop the column; it is 92% null and otherwise constant.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_hierarchy
unknown other skippedThe column packaging_hierarchy was skipped by the profiler, so no type, uniqueness, or distribution stats are available. All 50 rows are non-null, but every other signal (kind, n_unique, summary stats) is missing. Without further inspection the contents and structure remain unknown. Treatment: Re-profile or manually inspect a sample before deciding on downstream handling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
ingredients_with_unspecified_percent_n
numeric featureLikely a count of ingredients on a product whose declared percentage is unspecified, ranging from 1 to 33 with a mean of 8.8 and median of 7. The distribution is right-skewed (skew 1.64, kurtosis 3.55) with two outliers (4%) pulling the upper tail toward 33, well above the Q3 of 11. Every row has a value (null_rate 0, zero_rate 0), so no product in this sample fully discloses ingredient percentages. Treatment: Apply a log or sqrt transform before modelling to tame the right skew.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 18
- min
- 1
- max
- 33
- mean
- 8.8
- median
- 7
- std
- 6.061
- q1
- 5
- q3
- 11
- iqr
- 6
- skew
- 1.645
- kurtosis
- 3.545
- n_outliers
- 2
- outlier_rate
- 0.04
- zero_rate
- 0
fruits-vegetables-nuts_100g_estimate
numeric feature null_rate high_skewThis is an estimated percentage of fruits, vegetables, and nuts per 100g of product. The signal is almost absent: 46% of rows are null, and of the 27 non-null values, 96.3% are zero, leaving essentially one non-zero observation at 85.0 that drives the mean of 3.15 and skew of 4.9. Treatment: Drop or collapse to a binary has_value flag; the column carries almost no variance.
- n
- 50
- nulls
- 23 (46.0%)
- unique
- 2
- min
- 0
- max
- 85
- mean
- 3.148
- median
- 0
- std
- 16.36
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 4.903
- kurtosis
- 22.04
- n_outliers
- 1
- outlier_rate
- 0.03704
- zero_rate
- 0.963
emb_codes
categorical metadata long_tailLooks like a free-form certification/packaging code field (FSC-*, EMB *, LPL.*) with mixed formats including one company-name string. The column is dominated by empty strings — 35 of 50 rows (top_rate 0.73) — and has a 4% null rate on top, leaving very little signal across 11 unique values. Entropy ratio of 0.50 and the long_tail alert confirm most non-empty codes appear only once or twice. Treatment: Treat empty strings as missing and consider dropping or collapsing into a binary has_code flag given the sparsity and long tail.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 11
- top_value
- top_rate
- 0.7292
- cardinality
- 11
- entropy
- 1.72
- entropy_ratio
- 0.4972
packagings
unknown other skippedThis column was skipped by the profiler, so no statistics are available beyond a row count of 50 and a null rate of 0.0. The name 'packagings' suggests it likely holds nested or structured packaging descriptions (lists or objects), which is consistent with the profiler classifying its kind as 'unknown' and emitting a 'skipped' alert. Without unique counts or value summaries, nothing further can be inferred. Treatment: Inspect raw values and parse/normalize the structure before deciding on a downstream treatment.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
image_front_url
categorical identifier long_tailPer-row URLs pointing to Open Food Foundation product front images, with locale suffixes like front_fr and front_en in the path. All 50 values are unique (entropy_ratio 1.0, top_rate 0.02) and there are no nulls, so this is effectively a 1:1 asset link rather than a feature. Treatment: Treat as an asset URL: drop from modelling, or fetch images out-of-band for vision pipelines.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.400.jpg
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
origin_fi
categorical metadata null_rate imbalanceThis appears to be an origin field (likely a financial or geographic origin code) that is essentially empty. 90% of the 50 rows are null, and the remaining 5 non-null entries are all the empty string, giving a single unique value and zero entropy. There is no usable signal here. Treatment: Drop; the column is 90% null and the remaining values are blank.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
images
unknown other skippedThe column 'images' was skipped by the profiler, so its kind is unknown and no descriptive statistics were computed. Only the row count (50) and a null rate of 0.0 are available; uniqueness, type, and value distribution are all missing. The name suggests binary or path-like image payloads, which would explain why the dissector bypassed it. Treatment: Inspect raw values manually to confirm format, then route to an image-processing pipeline rather than tabular modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
ingredients_analysis
unknown other skippedThe column 'ingredients_analysis' was skipped by the profiler, so no type, uniqueness, or distribution statistics are available. The only confirmed signals are 50 rows present and a 0.0 null rate. Without further inspection, its content and structure remain unknown. Treatment: Inspect raw values manually to determine type before any downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
ingredients_text_with_allergens_pl
categorical free_text long_tail null_ratePolish-language ingredient text with embedded allergen HTML markup, almost entirely absent from this sample. 92% of 50 rows are null and only 3 distinct values appear, two of which are unique product descriptions and one is an empty string (top_rate 0.5 among non-nulls). Treatment: Drop for modelling given 92% nulls; if retained, strip HTML allergen tags and treat as free text.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 3
- top_value
- top_rate
- 0.5
- cardinality
- 3
- entropy
- 1.5
- entropy_ratio
- 0.9464
product_name_de
categorical free_text long_tail null_rateGerman-language product names, almost certainly the localized display label for food/confectionery items (chocolate, biscuits, Nutella). 60% of rows are null and the top non-null value is an empty string occurring 5 times, so effectively only ~15 distinct names cover 50 rows. Entropy ratio of 0.935 confirms the populated values are nearly all unique, and at least one entry ('Lightly Sea Salted') is English rather than German. Treatment: Treat as free text: normalize empty strings to null, then tokenize/embed if used as a feature.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 16
- top_value
- top_rate
- 0.25
- cardinality
- 16
- entropy
- 3.741
- entropy_ratio
- 0.9354
ingredients_text_with_allergens_nb
categorical free_text null_rate imbalanceThis appears to be a Norwegian-language ingredients text field with allergen annotations, likely a localized variant of a product description column. It is effectively empty: 96% of rows are null and the only non-null value across the remaining 2 records is an empty string, giving a single unique value and zero entropy. Treatment: Drop; the column carries no information at this sample size.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_it
categorical free_text long_tail null_rateItalian-language packaging description text, almost entirely absent from this sample. 68% of rows are null and of the 16 non-null entries, 14 are empty strings, leaving only 2 substantive Italian descriptions of recycling instructions. With cardinality of 3 and a top_rate of 0.875 on the empty string, this column carries virtually no usable signal here. Treatment: Drop unless joined with a much larger Italian-locale slice; too sparse to model.
- n
- 50
- nulls
- 34 (68.0%)
- unique
- 3
- top_value
- top_rate
- 0.875
- cardinality
- 3
- entropy
- 0.6686
- entropy_ratio
- 0.4218
product_name_it
categorical free_text long_tail null_rateItalian-language product name field, mostly empty: 68% of the 50 rows are null and the modal non-null value is the empty string "" (5 occurrences, top_rate 0.3125). Among the 12 distinct values the names are heterogeneous chocolate and snack labels (e.g. "Fondente Prodigioso 90% Cacao", "Pringles classiche 175 gr", "Milka"), with case-variant duplicates like "cioccolato fondente" vs "Cioccolato fondente" inflating cardinality. Entropy ratio 0.913 confirms the non-null tail is essentially flat, each name appearing once. Treatment: Normalise case/whitespace, treat empty strings as null, then tokenize and embed; not usable as a categorical feature given 68% nulls.
- n
- 50
- nulls
- 34 (68.0%)
- unique
- 12
- top_value
- top_rate
- 0.3125
- cardinality
- 12
- entropy
- 3.274
- entropy_ratio
- 0.9134
serving_quantity
categorical feature long_tailNumeric serving sizes stored as strings, with 27 distinct values across 50 rows and a 12% null rate. The distribution is long-tailed: top values "100" and "10" each cover only 7 records (top_rate 0.159), entropy_ratio is 0.909 indicating values are spread almost uniformly, and outliers like "1000" and decimals like "11.5" sit alongside round numbers. Treatment: Cast to numeric, impute the 12% nulls, and consider log-transforming before modelling.
- n
- 50
- nulls
- 6 (12.0%)
- unique
- 27
- top_value
- 100
- top_rate
- 0.1591
- cardinality
- 27
- entropy
- 4.322
- entropy_ratio
- 0.9089
product_name_ja
categorical metadata long_tail null_rate imbalanceThis appears to be a Japanese product name field that is effectively empty in this sample: 98% of the 50 rows are null and the single non-null value is itself the empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here whatsoever. Treatment: Drop the column; it is 98% null with a single empty-string value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_sv
categorical free_text long_tail null_rateSwedish-language ingredient lists with embedded HTML allergen markup (…), likely the Swedish localisation of a product ingredients field. Coverage is extremely poor: 92% null and only 4 distinct values across 50 rows, with the top value appearing just once (top_rate 0.25 over the non-null subset). One value is an empty string and others mix Swedish with Danish/Norwegian terms (HVEDEMEL, BYG, EGG), indicating inconsistent locale handling. Treatment: Strip HTML tags and parse allergen spans separately; given 92% nulls, do not use as a primary feature.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 4
- top_value
- kakaomassa, kakaosmör, fettreducerat kakaopulver, socker, vanilj.
- top_rate
- 0.25
- cardinality
- 4
- entropy
- 2
- entropy_ratio
- 1
ingredients_text_fr
categorical free_text long_tailThis is the French-language ingredients list for food products, stored as free-form text. Of 50 rows, 47 are unique and entropy ratio is 0.998, so values are essentially all distinct long strings; 4% are null and the most common value is an empty string (2 occurrences). Contents range from a two-word 'Eau de source' to multi-sentence ingredient declarations with percentages, allergens and additive codes. Treatment: Tokenize and embed (or extract structured ingredient/allergen features) rather than treating as a category.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 47
- top_value
- top_rate
- 0.04167
- cardinality
- 47
- entropy
- 5.543
- entropy_ratio
- 0.998
nutrition_score_beverage
numeric feature high_skewThis appears to be a beverage-specific nutrition score, encoded as a numeric flag rather than a continuous metric: only 2 unique values across 50 rows, with min 0 and max 1. The distribution is overwhelmingly zero (zero_rate 0.98), leaving a single outlier at 1 that drives the extreme skew (6.86) and kurtosis (45.02). Effectively a near-constant indicator column. Treatment: Treat as a binary flag, or drop as near-constant since 98% of rows share one value.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.02
- median
- 0
- std
- 0.1414
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 6.857
- kurtosis
- 45.02
- n_outliers
- 1
- outlier_rate
- 0.02
- zero_rate
- 0.98
ingredients_ids_debug
unknown other skippedThis column was skipped by the profiler, so no statistics beyond row count (50) and null rate (0.0) are available. The name suggests it holds debug-only ingredient identifiers, likely a complex or nested structure that the dissector could not categorize. Without unique counts or value samples, its content and utility cannot be assessed here. Treatment: Drop unless a downstream consumer specifically needs the raw debug payload.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
nutrition_data
categorical metadata imbalanceThis appears to be a flag indicating whether nutrition data is present, with the only observed value being "on" across all 49 non-null rows. Cardinality is 1 and entropy is 0, so the column carries no discriminative information; one row (2%) is null. Treatment: Drop, constant column with no variance.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 1
- top_value
- on
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_ja
categorical metadata long_tail null_rate imbalanceThis appears to be a Japanese-language origin field, likely a localized counterpart to a primary origin column. It is effectively empty: 98% of the 50 rows are null and the only non-null value is itself the empty string, yielding a single unique value and zero entropy. Treatment: Drop; the column carries no information.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_en
categorical free_text long_tailEnglish-language packaging descriptions, likely free-text recycling instructions scraped from product labels. Of 50 rows, 14% are null and another 39 (top_rate 0.91) are empty strings, leaving only 4 rows with actual content across 5 unique values. Entropy ratio of 0.27 confirms the column is almost entirely uninformative as-is. Treatment: Drop or defer; coverage is too sparse to model, but if retained treat empty strings as nulls and tokenize the remainder.
- n
- 50
- nulls
- 7 (14.0%)
- unique
- 5
- top_value
- top_rate
- 0.907
- cardinality
- 5
- entropy
- 0.6325
- entropy_ratio
- 0.2724
unknown_ingredients_n
numeric feature high_skew outliersThis is a count of unrecognised ingredients per row, ranging from 0 to 13 with a mean of 0.66. The distribution is dominated by zeros (zero_rate 0.84) with median, q1, and q3 all at 0, but a long right tail produces extreme skew (4.24) and kurtosis (18.32), with 8 outliers (16%) pulling the max to 13. Effectively a sparse anomaly indicator rather than a continuous count. Treatment: Binarise (zero vs non-zero) or cap before modelling; raw values are too skewed for linear models.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 6
- min
- 0
- max
- 13
- mean
- 0.66
- median
- 0
- std
- 2.255
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 4.236
- kurtosis
- 18.32
- n_outliers
- 8
- outlier_rate
- 0.16
- zero_rate
- 0.84
packaging_old_before_taxonomization
categorical free_text long_tail null_ratePre-taxonomy packaging descriptions captured as free text, mixing languages (French, Spanish, English, German) and multi-value comma-separated lists. With 36 unique values across 38 non-null rows and entropy ratio 0.99, the field is almost fully unique; even the top value 'plastique' covers only 7.9% and 24% are null. Values combine material terms, language prefixes like 'fr:'/'en:', and counts ('20 biscuits en 4 sachets'), so it behaves more like free text than a category. Treatment: Normalise and split on commas, then map tokens to a controlled packaging taxonomy before use.
- n
- 50
- nulls
- 12 (24.0%)
- unique
- 36
- top_value
- plastique
- top_rate
- 0.07895
- cardinality
- 36
- entropy
- 5.123
- entropy_ratio
- 0.9909
packaging_text_nb
categorical free_text null_rate imbalanceThis appears to be a Norwegian Bokmål packaging text field, but it is effectively empty: 96% of 50 rows are null and the only 2 non-null values are blank strings, yielding cardinality 1 and entropy 0. Treatment: Drop; the column carries no information.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
category_properties
unknown other skippedThis column was skipped by the profiler, so no type, uniqueness, or distribution stats were computed beyond a 50-row count and 0% null rate. The name 'category_properties' suggests it holds nested or structured per-category attributes (likely dict/list/JSON), which is why saturn flagged it as unknown rather than a scalar kind. Treatment: Inspect raw values and, if structured, flatten or JSON-normalize into separate columns before profiling again.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
nutriscore_score
numeric featureThis is the Nutri-Score numeric grade (typically -15 best to 40 worst), here ranging 0 to 40 with a mean of 17.47 and median of 19. The distribution is roughly symmetric (skew -0.16, kurtosis -0.53) with no outliers flagged, and 8.2% of values are exactly zero. Only 2% are null and 28 unique values across 50 rows, so the column is well populated and reasonably varied. Treatment: Use directly as a numeric feature; impute the 2% nulls with the median.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 28
- min
- 0
- max
- 40
- mean
- 17.47
- median
- 19
- std
- 9.906
- q1
- 10
- q3
- 25
- iqr
- 15
- skew
- -0.1616
- kurtosis
- -0.5337
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.08163
labels_old
categorical free_text long_tailLegacy multi-label tags for products, stored as comma-separated strings mixing French, English, Polish, Bulgarian (Cyrillic), and namespaced codes like 'en:CE'. With 38 uniques across 50 rows, an 8% null rate, and the most common value being the empty string at 19.6%, the field is sparse and nearly free-form. Entropy ratio 0.93 and the long_tail alert confirm almost every non-empty value is singleton. Treatment: Split on commas, normalise language/namespace prefixes, and one-hot the resulting tag tokens rather than treating the raw string as a category.
- n
- 50
- nulls
- 4 (8.0%)
- unique
- 38
- top_value
- top_rate
- 0.1957
- cardinality
- 38
- entropy
- 4.903
- entropy_ratio
- 0.9343
packaging_text
categorical free_text long_tailFree-text packaging descriptions, mostly in French with some English mixed in, detailing materials and recycling instructions. The dominant value is an empty string at 75% (36 of 50 rows), and only 13 unique values exist with entropy ratio 0.46, so signal is sparse and long-tailed. Among non-empty entries, formats vary widely (multi-line itemized lists, comma-separated tags, uppercase marketing strings), suggesting no controlled vocabulary. Treatment: Normalise case/whitespace and parse material keywords into multi-hot features; treat empty string as missing.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 13
- top_value
- top_rate
- 0.75
- cardinality
- 13
- entropy
- 1.708
- entropy_ratio
- 0.4614
ingredients_percent_analysis
numeric feature high_skew outliersThis appears to be a binary status flag for ingredient percent analysis, taking only 2 unique values (1.0 and -1.0) across 50 rows with no nulls. The distribution is heavily dominated by 1.0 (median, q1, q3 all 1.0; mean 0.84), with 4 outliers (8%) at -1.0 producing extreme negative skew (-3.10) and high kurtosis (7.59). Despite the numeric kind, the IQR of 0 and only two unique values indicate this is categorical rather than continuous. Treatment: Recode as a categorical/boolean flag rather than treating as continuous numeric.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- min
- -1
- max
- 1
- mean
- 0.84
- median
- 1
- std
- 0.5481
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- -3.096
- kurtosis
- 7.587
- n_outliers
- 4
- outlier_rate
- 0.08
- zero_rate
- 0
ecoscore_data
unknown other skippedThe column 'ecoscore_data' was skipped by the profiler, so no type, uniqueness, or distribution stats are available. Only the row count (50) and a null rate of 0.0 are reported. The name suggests it holds Eco-Score payloads, likely a nested/structured object that the profiler could not introspect. Treatment: Inspect raw values and parse the nested structure into typed sub-fields before use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
ingredients_text_sv
categorical free_text long_tail null_rateSwedish-language ingredient lists for food products, free-text rather than truly categorical. Coverage is extremely sparse: 92% null with only 4 distinct values across 50 rows, one of which is an empty string. The non-null entries are full ingredient declarations including allergen markers and bilingual Swedish/Danish/Norwegian terms. Treatment: Treat as free text; given 92% nulls, drop or use only as a fallback to other-language ingredient columns.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 4
- top_value
- kakaomassa, kakaosmör, fettreducerat kakaopulver, socker, vanilj.
- top_rate
- 0.25
- cardinality
- 4
- entropy
- 2
- entropy_ratio
- 1
compared_to_category
categorical metadata long_tailHolds an Open Food Facts category taxonomy code (e.g., 'en:dark-chocolate-bar-with-more-than-70-cocoa') used as a comparison reference. With 35 unique values across 50 rows and entropy ratio 0.95, the column is extremely diffuse — the modal category covers only 10% of rows and a long tail dominates. No nulls, but the high cardinality relative to sample size will make this hard to use as-is. Treatment: Roll up to a coarser taxonomy level (e.g., chocolate/biscuits/dairy) before any grouping or modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 35
- top_value
- en:dark-chocolate-bar-with-more-than-70-cocoa
- top_rate
- 0.1
- cardinality
- 35
- entropy
- 4.886
- entropy_ratio
- 0.9526
data_sources
categorical metadata long_tailThis column records the set of apps/databases that contributed each product's data, stored as a comma-separated list rather than a normalized relation. With 43 unique strings across 50 rows (entropy ratio 0.98) and the most common combination appearing only 4 times (top_rate 0.08), nearly every row has a bespoke source bundle. Notable: the values mix case ('yuka' vs 'Yuka') and overlap heavily on 'App - smoothie-openfoodfacts' and 'Apps', suggesting the same sources are repeatedly concatenated in different orders. Treatment: split on commas, normalize case, and one-hot encode individual sources instead of treating the raw string as a category.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 43
- top_value
- App - yuka, Apps, App - Open Food Facts, App - smoothie-openfoodfacts
- top_rate
- 0.08
- cardinality
- 43
- entropy
- 5.309
- entropy_ratio
- 0.9783
ingredients_from_palm_oil_n
numeric feature outliersThis is effectively a binary indicator counting palm-oil-derived ingredients per product, stored as numeric with values only 0 or 1 (n_unique=2, max=1.0). The column is heavily zero-dominated (zero_rate ≈ 0.85) with mean ≈ 0.152, and the 7 ones get flagged as outliers because the IQR is 0. Null rate is 8%, modest but worth noting. Treatment: Recast as a boolean palm-oil flag and impute the 8% nulls before modelling.
- n
- 50
- nulls
- 4 (8.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.1522
- median
- 0
- std
- 0.3632
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 1.937
- kurtosis
- 1.751
- n_outliers
- 7
- outlier_rate
- 0.1522
- zero_rate
- 0.8478
last_updated_t
numeric timestamp outliersValues are unique 10-digit integers in the ~1.74e9–1.77e9 range, which is the Unix-epoch band for early 2025 through late 2025, consistent with the column name suggesting a 'last updated' timestamp. The distribution is heavily left-skewed (skew -1.94) with 12% flagged as outliers — a handful of much older updates pulling the tail down while most rows cluster within a ~6.1M-second IQR (~71 days). No nulls or zeros. Treatment: Cast from Unix seconds to datetime and derive recency features rather than using the raw integer.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- min
- 1.739e+09
- max
- 1.769e+09
- mean
- 1.763e+09
- median
- 1.767e+09
- std
- 8.037e+06
- q1
- 1.762e+09
- q3
- 1.768e+09
- iqr
- 6.138e+06
- skew
- -1.945
- kurtosis
- 2.892
- n_outliers
- 6
- outlier_rate
- 0.12
- zero_rate
- 0
nutrition_score_debug
categorical metadata imbalanceThis looks like a debug/diagnostic field for a nutrition scoring pipeline, capturing which input nutrients were missing during computation. It is overwhelmingly empty: 49 of 50 rows (top_rate 0.98) hold an empty string, with only one row carrying a substantive message about missing saturated-fat, sugars, and sodium. Entropy of 0.14 confirms near-zero information content in this sample. Treatment: Drop from modelling; retain only for pipeline debugging.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- top_rate
- 0.98
- cardinality
- 2
- entropy
- 0.1414
- entropy_ratio
- 0.1414
popularity_key
numeric identifier high_skew outliersValues cluster tightly between 23.999B and 24.000B with an IQR of only ~400K, yet the minimum drops to ~22.9999B, producing severe negative skew (-2.67) and 5 low-side outliers (10%). With 49 unique values across 50 rows and no nulls, this looks like an opaque high-magnitude key or encoded rank rather than a true numeric measure. Treatment: Treat as an identifier and exclude from numeric modelling; join on it if it links to another table.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 49
- min
- 2.3e+10
- max
- 2.4e+10
- mean
- 2.39e+10
- median
- 2.4e+10
- std
- 3.03e+08
- q1
- 2.4e+10
- q3
- 2.4e+10
- iqr
- 4.002e+05
- skew
- -2.667
- kurtosis
- 5.111
- n_outliers
- 5
- outlier_rate
- 0.1
- zero_rate
- 0
product_name_es
categorical free_text long_tail null_rateSpanish-language product names, evidently a localized label field paralleling a primary product identifier. With null_rate 0.6 and 4 of the 20 non-null entries being empty strings, only ~16 rows carry usable text; among those, near-uniqueness is extreme (17 distinct values, entropy_ratio 0.96). Values mix branded items (Nutella Biscuits, Excellence 85% cacao) with generic descriptors (Original, Chocolate negro 85% cacao), so it behaves more like free text than a controlled vocabulary. Treatment: Treat empty strings as nulls and tokenize/embed if used as a feature; otherwise drop given 60% missingness and high cardinality.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 17
- top_value
- top_rate
- 0.2
- cardinality
- 17
- entropy
- 3.922
- entropy_ratio
- 0.9595
allergens_from_user
categorical free_text long_tailUser-submitted allergen tags prefixed with a language code like (fr), (en), (es). 34 distinct values across 50 rows with high entropy ratio 0.9112064098150886, and the top value '(fr) ' (rate 0.16) is just an empty language tag, as is '(en) ' at 7 occurrences. Values mix languages and free-form casing (e.g. 'Gluten,Lait,Soja, en:gluten' alongside normalised 'en:gluten'), so the same allergen appears under multiple spellings. Treatment: Strip the language prefix, split on commas, and normalise tokens to the en: namespace before using as multi-label features.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 34
- top_value
- (fr)
- top_rate
- 0.16
- cardinality
- 34
- entropy
- 4.636
- entropy_ratio
- 0.9112
informers
unknown other skippedThe column 'informers' was skipped by the profiler, so its kind is unknown and no descriptive statistics were computed. The only confirmed signals are 50 rows with no nulls; uniqueness, type, and value distribution are all missing from the evidence. Treatment: Re-profile or manually inspect the raw values before deciding on any downstream handling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
brands_old
categorical metadata long_tail null_rateThis is a legacy brand-name field for product records, with 29 distinct values across 50 rows and 32% nulls. The distribution is nearly flat (entropy_ratio 0.98) and the top brand 'Gerblé' covers only ~8.8% of non-null rows, so no brand dominates. Values mix clean names (Lindt, Cristaline) with concatenations like 'Wasa,Barilla' and oddities like 'LuMondelez', suggesting prior data-entry or merge artefacts. Treatment: Clean and split multi-brand strings, then reconcile against a canonical brand list before use.
- n
- 50
- nulls
- 16 (32.0%)
- unique
- 29
- top_value
- Gerblé
- top_rate
- 0.08824
- cardinality
- 29
- entropy
- 4.749
- entropy_ratio
- 0.9776
ingredients_text
categorical free_text long_tailFree-text ingredient lists from food packaging, one per row. Every one of the 50 rows is unique (entropy_ratio 1.0, top_rate 0.02) and the samples mix multiple languages (English, French, Bulgarian Cyrillic) with punctuation, percentages, and allergen notes. Treating this as a categorical feature is misleading despite the kind tag — it is unstructured multilingual prose flagged long_tail. Treatment: Parse and tokenize (language-detect first), then embed or extract ingredient entities before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- milk cream, cream, sugar, banana, bacteria
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
categories
categorical feature long_tailThis column holds Open Food Facts-style hierarchical category breadcrumbs, with each value a comma-separated taxonomy path from broad ('Snacks') to specific ('Chocolat noir en tablette extra dégustation à 70% de cacao minimum'). It is nearly unique (46 distinct values across 50 rows, top_rate just 0.06, entropy_ratio 0.99) and mixes French and English labels for overlapping concepts (e.g. 'Snacks sucrés' vs 'Sweet snacks', 'Chocolats noirs' vs 'Dark chocolates'), which the long_tail alert flags. Treat as a multi-label taxonomy rather than a flat category. Treatment: Split on commas, normalize French/English synonyms, and one-hot or embed the resulting taxonomy tags rather than using the raw string.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 46
- top_value
- Snacks,Snacks sucrés,Cacao et dérivés,Chocolats,Chocolats noirs,Chocolat noir en tablette extra dégustation à 70% de cacao minimum
- top_rate
- 0.06
- cardinality
- 46
- entropy
- 5.469
- entropy_ratio
- 0.9901
nutrition_score_warning_fruits_vegetables_nuts_estimate_from_ingredients_value
numeric feature high_skew outliersThis appears to be an estimated percentage of fruits/vegetables/nuts content derived from ingredients, used in nutrition score warnings. The distribution is dominated by zeros (zero_rate 0.71) with median 0.0 and Q3 only 2.33, yet a long tail pushes max to 100.0, producing extreme skew (5.41) and kurtosis (30.37). Seven outliers (15.6%) and a 10% null rate further indicate this signal fires for only a small subset of products. Treatment: Binarize (zero vs non-zero) or log1p-transform before modelling given the heavy zero mass and skew.
- n
- 50
- nulls
- 5 (10.0%)
- unique
- 13
- min
- 0
- max
- 100
- mean
- 4.532
- median
- 0
- std
- 15.52
- q1
- 0
- q3
- 2.326
- iqr
- 2.326
- skew
- 5.411
- kurtosis
- 30.37
- n_outliers
- 7
- outlier_rate
- 0.1556
- zero_rate
- 0.7111
ingredients_from_or_that_may_be_from_palm_oil_n
numeric featureLikely a count of ingredients sourced from or potentially from palm oil per product. With only 3 unique values ranging 0–2 and 70.2% zeros, most products contain none, while the right skew (1.39) reflects a small tail with one or two such ingredients. Null rate is modest at 6%. Treatment: Treat as a low-cardinality ordinal count; impute missing as 0 or add a missing flag before modelling.
- n
- 50
- nulls
- 3 (6.0%)
- unique
- 3
- min
- 0
- max
- 2
- mean
- 0.3404
- median
- 0
- std
- 0.5625
- q1
- 0
- q3
- 1
- iqr
- 1
- skew
- 1.393
- kurtosis
- 0.969
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.7021
origins_old
categorical metadata long_tail null_rateLegacy free-text origin field, likely superseded (note the `_old` suffix). Of 50 rows, 22% are null and another 31 are empty strings, so the top_rate of 0.795 is dominated by blanks; only 9 distinct values exist and the non-empty entries mix country names ('France', 'Morocco'), multi-region comma lists, and noise like 'biologique' or 'Farine de blé: France'. Entropy ratio 0.425 confirms most signal is absent, and the inconsistent formats mean this cannot be used as a clean categorical without parsing. Treatment: Drop or archive; if needed, parse non-empty strings into a normalised origins list and prefer the replacement column.
- n
- 50
- nulls
- 11 (22.0%)
- unique
- 9
- top_value
- top_rate
- 0.7949
- cardinality
- 9
- entropy
- 1.347
- entropy_ratio
- 0.4251
packaging_text_nl
categorical free_text null_rate imbalanceDutch-language packaging text field (likely from Open Food Facts or similar). 76% of the 50 rows are null, and every one of the 12 non-null values is the empty string, giving cardinality 1 and entropy 0. The column carries no usable signal in this sample. Treatment: Drop; column is effectively empty (null or blank in all rows).
- n
- 50
- nulls
- 38 (76.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
expiration_date
categorical metadata long_tailThis is an expiration_date field captured as free-form text rather than a parsed date. With 34 unique values across 50 rows and a top_rate of 0.3125 driven by an empty string, roughly 31% of entries are blank (plus a 4% null_rate), and the remaining values mix incompatible formats like '31/07/2020', '28/02/24', '25.11.2025', '01/2018', '19-10-2023', and even non-date tokens like '30days'. Treatment: Normalise to ISO dates with multi-format parsing and treat blanks/'30days' as missing before use.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 34
- top_value
- top_rate
- 0.3125
- cardinality
- 34
- entropy
- 4.364
- entropy_ratio
- 0.8578
selected_images
unknown other skippedThe column 'selected_images' was skipped by the profiler, so no type, cardinality, or distribution stats are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds image references or selections (filenames, URLs, or arrays), but this cannot be confirmed from the evidence. Without n_unique or value samples, no further characterisation is possible. Treatment: Inspect raw values manually to determine structure before deciding on parsing, exploding, or dropping.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
traces_from_ingredients
categorical free_text long_tailAllergen trace declarations parsed from product ingredient lists, recorded as free-form comma-separated allergen names. 78% of the 50 rows (39) are empty strings rather than nulls, and the remaining 11 distinct values mix languages (French 'œuf', 'lait', English 'nuts, milk', German 'Schalenfrüchte') and inconsistent casing, with some entries duplicating the same allergens twice in one string. Treatment: Normalise case, split on commas, translate to a canonical allergen vocabulary, and treat empty strings as missing before one-hot encoding.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 12
- top_value
- top_rate
- 0.78
- cardinality
- 12
- entropy
- 1.521
- entropy_ratio
- 0.4243
ingredients_text_with_allergens
categorical free_text long_tailFree-text ingredient lists with embedded HTML markup highlighting allergens. All 50 rows are unique (entropy_ratio 1.0, top_rate 0.02) and the language mix spans English, French, and Bulgarian Cyrillic, so any naive categorical encoding will explode. The HTML tags and multilingual content mean raw values need cleaning before NLP use. Treatment: Strip HTML tags, language-detect, then tokenize/embed; do not treat as a category.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- milk cream, cream, sugar, banana, bacteria
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
image_front_thumb_url
categorical identifier long_tailThis column holds Open Food Facts thumbnail URLs pointing to product front images, embedding the product barcode and a language suffix (front_fr/front_en) in the path. Every one of the 50 rows is unique with zero nulls, so it functions as a per-row identifier rather than a feature. The mix of fr and en suffixes hints at a multi-locale product set. Treatment: Drop for modelling; retain as a media link or fetch images if vision features are needed.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.100.jpg
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
lc
categorical featureThis is a low-cardinality categorical with 5 distinct values that look like ISO 639-1 language codes (fr, en, de, bg, ro), suggesting a language tag for each row. The distribution is heavily concentrated: 'fr' accounts for 35 of 50 rows (top_rate 0.70), 'en' for 10, while 'de', 'bg', and 'ro' appear only 1-3 times. Entropy ratio of 0.56 confirms the imbalance, and there are no nulls. Treatment: One-hot encode, or group rare codes (bg, ro, de) into an 'other' bucket before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- fr
- top_rate
- 0.7
- cardinality
- 5
- entropy
- 1.294
- entropy_ratio
- 0.5572
ingredients_text_debug
categorical free_text long_tail null_rateFree-text ingredient lists in French (e.g., 'Lait écrémé, crème, sucre...'), likely a debug dump of OpenFoodFacts-style product compositions. Near-maximal entropy (0.997) and 35 unique values out of 50 confirm essentially every non-null row is distinct, while 28% are null and the top value is an empty string appearing twice. Texts vary wildly in length and include allergen markup (_lait_, _soja_) plus stray non-ingredient prose like publication dates. Treatment: Tokenize and embed (or parse into structured allergen/ingredient lists) after imputing empty strings as nulls.
- n
- 50
- nulls
- 14 (28.0%)
- unique
- 35
- top_value
- top_rate
- 0.05556
- cardinality
- 35
- entropy
- 5.114
- entropy_ratio
- 0.9971
packagings_materials_main
categorical feature null_rateThis is a low-cardinality categorical tagging the dominant packaging material, with only 3 distinct values across 50 rows ('en:paper-or-cardboard', 'en:plastic', 'en:unknown'). The headline issue is a 62% null rate, leaving just 19 observed rows where 'en:paper-or-cardboard' alone covers 68.4%. Entropy ratio of 0.70 indicates moderate concentration among the few non-null entries. Treatment: Impute missing as an explicit 'unknown' category before one-hot encoding.
- n
- 50
- nulls
- 31 (62.0%)
- unique
- 3
- top_value
- en:paper-or-cardboard
- top_rate
- 0.6842
- cardinality
- 3
- entropy
- 1.105
- entropy_ratio
- 0.6972
data_quality_dimensions
unknown other skippedThe column `data_quality_dimensions` was skipped by the profiler, so no type, uniqueness, or value statistics were computed beyond a row count of 50 and a null rate of 0.0. Without `n_unique` or any descriptive stats, its content and structure are unknown from this evidence alone. The name suggests it may hold structured or list-like quality metadata, but that cannot be confirmed here. Treatment: Re-profile with type inference forced, or inspect raw values manually before deciding on use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
serving_size
categorical feature long_tailFree-text serving size descriptors, with 37 unique values across only 50 rows (entropy ratio 0.98) and a 12% null rate. The top value '100g' covers just 6.8% of rows, and inconsistent formatting is rampant — '100g' vs '100 g', '10 g' vs '20g', plus compound strings like '1 Square (10 g)' — so the same physical quantity appears under multiple labels. Treatment: Parse into a numeric grams column via regex and unit normalization before use.
- n
- 50
- nulls
- 6 (12.0%)
- unique
- 37
- top_value
- 100g
- top_rate
- 0.06818
- cardinality
- 37
- entropy
- 5.107
- entropy_ratio
- 0.9803
origin
categorical free_text long_tailFree-text origin/provenance field, likely scraped from product packaging in mixed French/German wording. The column is almost entirely empty: the blank string accounts for 42 of 50 rows (top_rate 0.894) and another 6% are null, leaving only 5 distinct populated values. Entropy ratio of 0.285 and the long_tail alert confirm there is essentially no usable signal as-is. Treatment: Drop or parse with regex/NER to extract country tokens; too sparse to use directly.
- n
- 50
- nulls
- 3 (6.0%)
- unique
- 6
- top_value
- top_rate
- 0.8936
- cardinality
- 6
- entropy
- 0.7359
- entropy_ratio
- 0.2847
ingredients_lc
categorical metadataThis column appears to be a language code for ingredient text, with only 4 distinct values across 50 rows. French dominates at 70% (35 rows), followed by English (11), with Bulgarian and German trailing at 2 each. The skew is heavy and the entropy ratio of 0.606 confirms concentration around a single language. Treatment: One-hot encode or use as a filter; consider grouping rare languages into 'other'.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 4
- top_value
- fr
- top_rate
- 0.7
- cardinality
- 4
- entropy
- 1.212
- entropy_ratio
- 0.6061
packaging_old
categorical free_text long_tailFree-form packaging descriptions, almost certainly from an Open Food Facts-style export, mixing French and English tokens plus language-prefixed tags (e.g. 'fr:Triman', 'en:Bottle'). With 40 unique values across 50 rows and entropy_ratio 0.99, it's near-unique; the top value 'Plastique' covers only 6.98% and 14% are null. Entries are comma-separated multi-tags of varying granularity, so this behaves more like a tag list than a clean category. Treatment: Split on commas, normalise language prefixes, and one-hot the resulting tag set rather than treating raw strings as categories.
- n
- 50
- nulls
- 7 (14.0%)
- unique
- 40
- top_value
- Plastique
- top_rate
- 0.06977
- cardinality
- 40
- entropy
- 5.269
- entropy_ratio
- 0.9901
packaging_text_fr
categorical free_text long_tailFree-text French packaging instructions, mostly empty: 34 of 50 rows (top_rate 0.723) are blank and another 6% are null. Of 14 distinct values, the populated ones are heterogeneous descriptions of materials and recycling instructions (plastic films, cardboard étuis, aluminium sheets), with one outlier containing OCR-like artefacts and a date string. Entropy ratio 0.49 confirms the long-tail alert: almost every non-empty entry is unique. Treatment: Treat blanks as missing and parse remaining strings for material/recyclability tokens rather than using as a categorical.
- n
- 50
- nulls
- 3 (6.0%)
- unique
- 14
- top_value
- top_rate
- 0.7234
- cardinality
- 14
- entropy
- 1.874
- entropy_ratio
- 0.4923
nova_group_debug
categorical metadata long_tail imbalanceA categorical debug/diagnostic field, presumably trace messages from a NOVA food-group classifier. It's overwhelmingly empty (96% blank, 48 of 50 rows), with only two non-empty entries — both error strings explaining that NOVA classification was skipped due to unknown ingredients. Entropy ratio of 0.178 confirms near-zero information content. Treatment: Drop; near-constant debug log with no modelling value.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 3
- top_value
- top_rate
- 0.96
- cardinality
- 3
- entropy
- 0.2823
- entropy_ratio
- 0.1781
countries_hierarchy
unknown feature skippedColumn `countries_hierarchy` was skipped by the profiler, so no kind, uniqueness, or value statistics are available beyond a row count of 50 with zero nulls. The name suggests a nested or list-like representation of country tags (e.g., `en:france > en:europe`), which likely tripped the type detector. Treat the absence of stats as a signal that the values are non-scalar rather than missing. Treatment: Parse the hierarchical strings into a list of country tags, then explode or one-hot encode before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
nutriscore_score_opposite
numeric featureNumeric column holding the negation of a Nutri-Score (range -40 to 0, median -19), so lower values correspond to better nutritional grades. Distribution is roughly symmetric (skew 0.16, kurtosis -0.53) with no outliers and a tight IQR of 15. Notable signals: 2% nulls, 8% zeros, and only 28 unique values across 50 rows, consistent with an integer score derived by sign-flipping the original Nutri-Score. Treatment: Use as-is for modelling, or invert the sign back to the original Nutri-Score for interpretability.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 28
- min
- -40
- max
- 0
- mean
- -17.47
- median
- -19
- std
- 9.906
- q1
- -25
- q3
- -10
- iqr
- 15
- skew
- 0.1616
- kurtosis
- -0.5337
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.08163
origins_lc
categorical featureThis is a lowercase language/origin code with 6 distinct values across 50 rows and a 4% null rate. The distribution is dominated by 'fr' (23) and 'en' (20), together accounting for nearly all non-null entries, while 'es', 'de', 'it', and 'pl' appear only once or twice each. Entropy ratio of 0.61 confirms the heavy concentration in two categories. Treatment: One-hot encode with rare categories (es/de/it/pl) collapsed into an 'other' bucket.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 6
- top_value
- fr
- top_rate
- 0.4792
- cardinality
- 6
- entropy
- 1.575
- entropy_ratio
- 0.6093
countries
categorical free_text long_tailFree-text country list per record, not a clean categorical: 43 unique values across 50 rows (entropy ratio 0.97) with the top value 'Maroc' at only 10%. Values mix languages (Maroc vs Morocco, Belgique vs Belgium), comma-separated multi-country strings, and even an 'en:switzerland' prefix, so the same country appears in several surface forms. The 'long_tail' alert is consistent with this near-unique, multi-label encoding. Treatment: Split on commas, normalise language and prefixes to ISO country codes, then one-hot or multi-hot encode.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 43
- top_value
- Maroc
- top_rate
- 0.1
- cardinality
- 43
- entropy
- 5.252
- entropy_ratio
- 0.9678
ingredients_text_with_allergens_it
categorical free_text long_tail null_rateItalian-language ingredient lists with embedded HTML markup, one row per product. Coverage is poor: 68% of the 50 rows are null and the most common non-null value is the empty string (5 occurrences, 31% of present values), leaving only a handful of genuine ingredient strings. Among the 12 distinct values, contents range from short lists (e.g. "patate, olio di girasole, sale marino.") to long compound declarations, so length and structure vary widely. Treatment: Strip HTML allergen tags, treat empty strings as null, then tokenize for NLP or extract allergen flags as features.
- n
- 50
- nulls
- 34 (68.0%)
- unique
- 12
- top_value
- top_rate
- 0.3125
- cardinality
- 12
- entropy
- 3.274
- entropy_ratio
- 0.9134
packaging_lc
categorical metadataThis column appears to be the language code of packaging text, with 7 distinct ISO-style codes across 50 rows. French and English tie at 17 occurrences each, though the reported top_rate of 0.386 reflects only one being chosen as top_value ('fr'); German trails at 5, with Portuguese, Italian, Spanish, and Croatian as singletons. A 12% null rate and entropy ratio of 0.71 indicate moderate diversity but a clear FR/EN dominance. Treatment: Treat as a low-cardinality categorical; impute nulls and one-hot encode, optionally collapsing rare codes into 'other'.
- n
- 50
- nulls
- 6 (12.0%)
- unique
- 7
- top_value
- fr
- top_rate
- 0.3864
- cardinality
- 7
- entropy
- 1.992
- entropy_ratio
- 0.7094
interface_version_created
categorical metadataThis column appears to record the interface version in use when each record was created, encoded as a date-stamp with optional jQuery Mobile suffix. Only 3 distinct values appear across 50 rows, with '20120622' dominating at 59.2% and a long-tail '20130323.jqm' appearing just twice. Entropy ratio of 0.74 confirms moderate concentration, and there is a 2% null rate to account for. Treatment: Treat as a low-cardinality categorical; one-hot encode or bucket the rare '20130323.jqm' level.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 3
- top_value
- 20120622
- top_rate
- 0.5918
- cardinality
- 3
- entropy
- 1.167
- entropy_ratio
- 0.7363
image_thumb_url
categorical metadata long_tailThis column holds Open Food Facts product thumbnail URLs, one per row. Every one of the 50 values is unique (entropy_ratio 1.0, top_rate 0.02), so it acts as a per-row asset pointer rather than a categorical feature. URLs mix `front_fr` and `front_en` locale suffixes, hinting at a French/English product mix. Treatment: Drop for modelling; retain only as a display link or for image-fetching pipelines.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- https://images.openfoodfacts.org/images/products/611/124/210/0992/front_fr.172.100.jpg
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
categories_properties
unknown other skippedThis column was skipped by the profiler, so its type, cardinality, and value distribution are unknown beyond a count of 50 rows with no nulls. The name `categories_properties` suggests a nested or structured field (e.g., a list or dict of category attributes) that the profiler could not coerce into a scalar kind. Without parsed contents there is nothing further to infer. Treatment: Inspect raw values and parse the nested structure (explode or flatten) before profiling again.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
allergens_from_ingredients
categorical feature long_tailFree-text allergen list parsed from ingredient strings, mixing Open Food Facts taxonomy codes (en:gluten, en:milk, en:soybeans) with raw multilingual tokens (blé, lait, NOISETTES, соеви). 30% of the 50 rows are empty strings and the remaining 35 unique values are nearly all singletons with duplicated tokens within a single cell, so this is dirty list-encoded data rather than a clean category. Treatment: Split on commas, normalize to en: taxonomy codes, dedupe tokens, then multi-hot encode.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 35
- top_value
- top_rate
- 0.3
- cardinality
- 35
- entropy
- 4.432
- entropy_ratio
- 0.864
ingredients_text_with_allergens_fi
categorical free_text long_tail null_rateFinnish-language ingredient text with inline HTML allergen markup, mirroring the multilingual ingredient fields common in Open Food Facts. Coverage is extremely thin: null_rate is 0.9 and only 4 distinct values exist across n=50, with the empty string itself appearing twice as the top_value (top_rate 0.4 of non-nulls). The non-empty entries are long free-text strings wrapping allergens in tags rather than clean tokens. Treatment: Strip HTML tags and tokenize for allergen extraction; otherwise drop, since 90% are null.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 4
- top_value
- top_rate
- 0.4
- cardinality
- 4
- entropy
- 1.922
- entropy_ratio
- 0.961
_keywords
unknown other skippedThe column `_keywords` was skipped by the profiler, so kind is unknown and no statistics (n_unique, value distribution, length, etc.) are available. The only confirmed signals are 50 rows with a 0.0 null rate. Without further evidence the content and structure cannot be characterised — the name suggests a keyword list, but this is not verified. Treatment: Re-profile with appropriate parser (likely list/tokenized text) before deciding usage.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
manufacturing_places
categorical free_text long_tailFree-text manufacturing locations, mostly country names but mixed with multi-token strings combining cities, regions, and postal codes. The dominant value is the empty string (20 of 50, top_rate 0.408), making missing-or-blank the modal state, with France a distant second at 9. Across 20 unique values the entropy_ratio of 0.737 plus the long_tail alert signals scattered, inconsistent formatting (e.g. 'France,Italie' vs full German address chains). Treatment: Normalise blanks to null and parse/standardise to country tokens before using as a feature.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 20
- top_value
- top_rate
- 0.4082
- cardinality
- 20
- entropy
- 3.187
- entropy_ratio
- 0.7374
pnns_groups_2
categorical labelThis is a food sub-category label (PNNS group 2), with 11 distinct values across 50 rows and no nulls. The distribution is heavily concentrated in sweets: 'Biscuits and cakes' (17) and 'Chocolate products' (16) account for 33 of 50 rows, giving a top_rate of 0.34 and entropy_ratio of 0.75. Two rows carry the literal value 'unknown', which should be treated as missing rather than a real category. Treatment: One-hot or target-encode after recoding 'unknown' to null.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 11
- top_value
- Biscuits and cakes
- top_rate
- 0.34
- cardinality
- 11
- entropy
- 2.599
- entropy_ratio
- 0.7513
ingredients_text_pl
categorical free_text long_tail null_ratePolish-language ingredient text for food products, almost entirely absent in this sample: 90% of rows are null and of the 5 non-null rows, 3 are empty strings, leaving only 2 genuine ingredient lists (both for cocoa-based chocolate). With n_unique=3 across 50 rows, this column carries virtually no usable signal here. Treatment: Drop unless Polish-language analysis is required; too sparse to model.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 3
- top_value
- top_rate
- 0.6
- cardinality
- 3
- entropy
- 1.371
- entropy_ratio
- 0.865
generic_name_es
categorical metadata long_tail null_rateSpanish-language generic product name, sparsely populated with only 7 unique values across 50 rows and a 60% null rate. The non-null values skew heavily toward dark chocolate descriptions (e.g., 'Chocolate negro' appears twice, with several variants citing cacao percentages), suggesting the dataset is dominated by chocolate products. Top rate of 0.65 reflects the empty string acting as the modal 'value', so usable coverage is even thinner than the null rate alone implies. Treatment: Treat empty strings as nulls and drop or backfill from a canonical product-name field before use.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 7
- top_value
- top_rate
- 0.65
- cardinality
- 7
- entropy
- 1.817
- entropy_ratio
- 0.6471
origin_en
categorical metadata imbalanceCategorical column likely intended to mark country of origin in English, but it is effectively empty: of 50 rows, 14% are null and 42 of the remaining values are blank strings, leaving just one populated label ("France"). Cardinality is 2 with a top_rate of 0.977 and entropy_ratio of 0.16, so the field carries almost no information. Treatment: Drop; the column is near-constant with blanks and only one real value.
- n
- 50
- nulls
- 7 (14.0%)
- unique
- 2
- top_value
- top_rate
- 0.9767
- cardinality
- 2
- entropy
- 0.1594
- entropy_ratio
- 0.1594
generic_name_it
categorical metadata long_tail null_rateItalian generic product name, evidently a localized label field on food items. 68% of rows are null and among the 16 non-null entries the most common value is the empty string (11 occurrences), leaving only 5 distinct real names like 'Cioccolato extra fondente' and 'Crackers'. Coverage is too sparse to be useful as-is. Treatment: Drop or defer until Italian coverage improves; not usable at 68% null.
- n
- 50
- nulls
- 34 (68.0%)
- unique
- 5
- top_value
- top_rate
- 0.6875
- cardinality
- 5
- entropy
- 1.497
- entropy_ratio
- 0.6446
ingredients_that_may_be_from_palm_oil_n
numeric feature high_skew outliersCount of ingredients that may be derived from palm oil per product. Values are extremely concentrated at zero (zero_rate 0.83, median and IQR both 0), with only 3 distinct values up to a max of 2, yet 17% of non-null rows register as outliers and skew is 2.23. An 8% null rate also means some products lack this assessment entirely. Treatment: Binarise to zero/non-zero or drop, since the column is near-constant with heavy skew.
- n
- 50
- nulls
- 4 (8.0%)
- unique
- 3
- min
- 0
- max
- 2
- mean
- 0.1957
- median
- 0
- std
- 0.4531
- q1
- 0
- q3
- 0
- iqr
- 0
- skew
- 2.23
- kurtosis
- 4.321
- n_outliers
- 8
- outlier_rate
- 0.1739
- zero_rate
- 0.8261
ingredients_text_es
categorical free_text long_tail null_rateSpanish-language ingredient lists for food products, stored as free text. Of 50 rows, 60% are null and another 8 entries (top_rate 0.4) are empty strings, leaving only a handful of distinct populated values—mostly chocolate and cereal formulations with allergen markers like _TRIGO_ and _HUEVO_. The 13-value cardinality and high entropy ratio (0.84) reflect that nearly every non-empty entry is unique long-form prose, not a true category. Treatment: Treat as multilingual free text: tokenize/embed or parse into ingredient lists; do not use as a categorical feature.
- n
- 50
- nulls
- 30 (60.0%)
- unique
- 13
- top_value
- top_rate
- 0.4
- cardinality
- 13
- entropy
- 3.122
- entropy_ratio
- 0.8437
teams
categorical feature long_tailThis column holds team affiliations as comma-separated lists of slugs, with 39 unique combinations across 50 rows and an 8% null rate. Cardinality is extreme (entropy ratio 0.97) and the most common value 'pain-au-chocolat' covers only 10.9%, while several rows pack 4-14 teams into one string. The mix of single-team and multi-team entries means this is effectively a multi-label field stored as a delimited string. Treatment: split on commas and one-hot encode as multi-label team membership before modelling.
- n
- 50
- nulls
- 4 (8.0%)
- unique
- 39
- top_value
- pain-au-chocolat
- top_rate
- 0.1087
- cardinality
- 39
- entropy
- 5.124
- entropy_ratio
- 0.9695
origins_hierarchy
unknown other skippedProfiling was skipped for this column, so saturn emitted no type, uniqueness, or value statistics beyond a row count of 50 and a null rate of 0.0. The name suggests a nested or path-like representation of origin categories (e.g. a taxonomy hierarchy), but without parsed values this is inference from the label only. Treat it as opaque until reprofiled. Treatment: Reprofile after parsing the hierarchy (split path or explode levels) before deciding on use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
packagings_complete
numeric featureThis is a binary 0/1 flag (n_unique=2, min=0, max=1) indicating whether packaging information is complete. The split is nearly even with a mean of 0.52 and zero_rate of 0.48, and 4% of rows are null. The strongly negative kurtosis (-1.99) is expected for a balanced binary variable. Treatment: Cast to boolean and impute the 4% nulls before modelling.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 2
- min
- 0
- max
- 1
- mean
- 0.5208
- median
- 1
- std
- 0.5049
- q1
- 0
- q3
- 1
- iqr
- 1
- skew
- -0.08341
- kurtosis
- -1.993
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0.4792
ingredients_text_with_allergens_nl
categorical free_text long_tail null_rateDutch-language ingredient lists with inline HTML `` markup, evidently the NL localisation of an Open Food Facts-style ingredients field. Coverage is poor: 78% null and only 9 distinct values across 50 rows, with the most common non-null entries being short cocoa/chocolate ingredient strings while one row is clearly mis-parsed packaging footer text (Mondelez addresses, URLs). Entropy ratio 0.95 confirms the few present values are nearly all unique, so this is free text rather than a category. Treatment: Strip HTML allergen tags, then tokenize for NLP/allergen extraction; expect heavy missingness so do not use as a primary feature.
- n
- 50
- nulls
- 39 (78.0%)
- unique
- 9
- top_value
- top_rate
- 0.2727
- cardinality
- 9
- entropy
- 3.027
- entropy_ratio
- 0.955
created_t
numeric timestampValues are 10-digit integers ranging from 1,337,517,352 to 1,724,094,916 with all 50 rows unique and no nulls — consistent with Unix epoch seconds spanning roughly mid-2012 to mid-2024. The distribution is mildly right-skewed (skew 0.33) and platykurtic (kurtosis -0.81), with a median of 1,475,927,880.5 sitting near the mean, suggesting events are spread fairly evenly across the window rather than clustered. The name 'created_t' reinforces a creation-timestamp interpretation. Treatment: convert from Unix seconds to datetime and derive features (year, recency) before modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- min
- 1.338e+09
- max
- 1.724e+09
- mean
- 1.483e+09
- median
- 1.476e+09
- std
- 1.043e+08
- q1
- 1.386e+09
- q3
- 1.555e+09
- iqr
- 1.694e+08
- skew
- 0.3311
- kurtosis
- -0.8095
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
traces_hierarchy
unknown other skippedColumn 'traces_hierarchy' was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. Without kind inference or sample stats, the content remains unknown — the name hints at nested trace/hierarchy data (likely a complex or non-scalar structure), which is consistent with the profiler skipping it. Treatment: Inspect raw values manually and parse the nested structure before it can be profiled or modelled.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
generic_name_nb
categorical metadata null_rate imbalanceThis appears to be a Norwegian Bokmål generic name field, likely meant to hold localized drug or product names. It is effectively empty: 96% of the 50 rows are null and the only non-null value observed is the empty string (2 occurrences), giving cardinality 1 and zero entropy. Treatment: Drop; the column carries no information.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_de
categorical free_text long_tail null_rateGerman-language ingredient lists with embedded HTML tags marking allergens like SOJA, WEIZEN, and HASELNÜSSE. Two-thirds of rows are null (null_rate 0.66) and among the 17 non-null values 16 are unique (entropy_ratio 0.99), with the empty string itself appearing twice as the top value. Casing and punctuation are inconsistent across entries, and one row contains lowercase OCR-style text with stray newlines. Treatment: Strip HTML tags to extract allergen labels into a multi-hot feature, then drop or embed the residual text.
- n
- 50
- nulls
- 33 (66.0%)
- unique
- 16
- top_value
- top_rate
- 0.1176
- cardinality
- 16
- entropy
- 3.97
- entropy_ratio
- 0.9925
ingredients_text_with_allergens_es
categorical free_text long_tail null_rateSpanish-language ingredients lists with inline HTML markup highlighting allergens like trigo, soja, avellanas, and lactosa. 62% of the 50 rows are null and another 7 entries are empty strings, leaving only a handful of populated free-text recipes; among those that exist, all 13 unique values appear nearly distinct (entropy ratio 0.87). Treatment: Strip the allergen HTML tags, then tokenize/embed or parse into a structured allergen list before modelling.
- n
- 50
- nulls
- 31 (62.0%)
- unique
- 13
- top_value
- top_rate
- 0.3684
- cardinality
- 13
- entropy
- 3.214
- entropy_ratio
- 0.8684
product_name_fr
categorical free_text long_tailFrench-language product names from what appears to be a food/grocery catalogue (chocolate bars, mineral water, biscuits). With 47 unique values across 50 rows and entropy ratio 0.996, this is essentially a free-text label rather than a categorical feature — the top value 'Henry's' only appears twice (4%). One null and a long-tail alert are flagged. Treatment: Treat as free-text product label; tokenize and embed (or use as a join key to a product table) rather than one-hot encoding.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 47
- top_value
- Henry’s
- top_rate
- 0.04082
- cardinality
- 47
- entropy
- 5.533
- entropy_ratio
- 0.9961
stores
categorical feature long_tailComma-delimited list of retail chains where each product was observed (Lidl, Carrefour, Tesco, etc.), stored as a single string per row. Cardinality is high (31 unique across 50 rows, entropy_ratio 0.854) because most non-empty entries are bespoke multi-store concatenations appearing only once. The dominant value is an empty string at 29.17% top_rate plus a 4% null_rate, so roughly a third of rows carry no store information at all. Treatment: Split on comma and one-hot or multi-hot encode individual store names; treat empty string as missing.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 31
- top_value
- top_rate
- 0.2917
- cardinality
- 31
- entropy
- 4.233
- entropy_ratio
- 0.8543
_id
categorical identifier long_tailThis column is a unique record identifier — every one of the 50 rows has a distinct value (n_unique=50, top_rate=0.02, entropy_ratio=1.0). Values look like long numeric codes resembling EAN/GTIN barcodes (e.g., '6111242100992', '7622210578464'), with at least one shorter outlier ('20995553'). The long_tail alert simply reflects that each value occurs exactly once. Treatment: drop from modelling features; retain as a join key.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- 6111242100992
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
nutriments
unknown other skippedThe column 'nutriments' was skipped by the profiler, so no statistics, uniqueness, or value samples are available beyond a row count of 50 and a null rate of 0.0. The name suggests it likely holds nested nutritional data (e.g., a struct or JSON object per product), which is consistent with the profiler's inability to classify it as a standard kind. Without parsed contents we cannot describe its distribution or cardinality. Treatment: Parse/flatten the nested structure into typed sub-columns before profiling or modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
editors
unknown other skippedThe column is named "editors" and was skipped by the profiler, so its kind is unknown and no descriptive statistics were computed. Across 50 rows there are zero nulls, but uniqueness, type, and value distribution are all unreported. Without further evidence, the content (likely a list or nested structure of editor entries) cannot be characterised. Treatment: Inspect raw values manually to determine type before deciding whether to parse, explode, or drop.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
max_imgid
categorical identifier long_tail`max_imgid` holds 38 distinct integer-like strings across 50 rows with no nulls, suggesting it stores the maximum image identifier per record. Distribution is nearly uniform (entropy_ratio 0.98) with the top value '47' appearing only 3 times (top_rate 0.06), so it behaves like a high-cardinality numeric id mis-typed as categorical. The long_tail alert confirms most values occur once or twice. Treatment: Cast to integer and treat as a numeric id; do not one-hot encode.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 38
- top_value
- 47
- top_rate
- 0.06
- cardinality
- 38
- entropy
- 5.149
- entropy_ratio
- 0.9811
nutriscore_grade
categorical labelThis is the Nutri-Score grade, a categorical food-health rating with the expected letter levels a-e plus an 'unknown' bucket, giving 6 distinct values across 50 rows with no nulls. The distribution is heavily weighted toward the worst grade: 'e' alone accounts for 54% (27/50), while healthier grades 'a' and 'b' together cover only 6 rows. Entropy ratio of 0.74 confirms the imbalance, and the lone 'unknown' row signals a missing-data sentinel mixed in with the real grades. Treatment: Treat as ordered categorical (a
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 6
- top_value
- e
- top_rate
- 0.54
- cardinality
- 6
- entropy
- 1.913
- entropy_ratio
- 0.7399
product_quantity_unit
categorical metadata imbalanceUnit of measure for product quantities, taking only 'g' or 'ml'. The distribution is severely imbalanced: 'g' covers 44 of 45 non-null rows (top_rate 0.978) while 'ml' appears just once, and 10% of values are null. Entropy ratio of 0.154 confirms the column carries almost no information as-is. Treatment: Likely drop or collapse to a binary indicator; near-constant with one rare 'ml' case.
- n
- 50
- nulls
- 5 (10.0%)
- unique
- 2
- top_value
- g
- top_rate
- 0.9778
- cardinality
- 2
- entropy
- 0.1537
- entropy_ratio
- 0.1537
ingredients_text_with_allergens_fr
categorical free_text long_tailFrench-language ingredient lists with embedded HTML markup highlighting allergens. Near-unique across 47 of 50 rows (entropy ratio 0.998), with 4% nulls and 2 empty strings as the modal value. Content varies wildly in length and formatting, mixing prose, percentages, and tagged allergen tokens. Treatment: Strip HTML tags, parse allergen spans into a structured list, then tokenize the remaining text for NLP.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 47
- top_value
- top_rate
- 0.04167
- cardinality
- 47
- entropy
- 5.543
- entropy_ratio
- 0.998
interface_version_modified
categorical metadataA categorical column capturing an interface version stamp, with only 2 distinct values across 50 rows and no nulls. The distribution is heavily skewed: '20150316.jqm2' covers 84% (42 rows) while '20190830' accounts for the remaining 8 rows. The mixed format (one value carries a '.jqm2' suffix, the other is a bare date) suggests a schema or convention change between releases. Treatment: Treat as a binary version flag; one-hot encode or collapse to pre/post-change indicator.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 2
- top_value
- 20150316.jqm2
- top_rate
- 0.84
- cardinality
- 2
- entropy
- 0.6343
- entropy_ratio
- 0.6343
ingredients_text_with_allergens_en
categorical free_text long_tailThis column holds English ingredient lists with embedded HTML markup highlighting allergens like wheat, milk, soy, and nuts. With 36 unique values across 50 rows (entropy ratio 0.95) and a 16% null rate, it's near-unique free text; the top 'value' is actually the empty string (7 occurrences), and one row is junk ('Hhhhh'). The HTML tags and inconsistent casing/punctuation mean it needs cleaning before any allergen extraction. Treatment: Strip HTML, normalize case, and parse allergen spans into a structured multi-label feature before modelling.
- n
- 50
- nulls
- 8 (16.0%)
- unique
- 36
- top_value
- top_rate
- 0.1667
- cardinality
- 36
- entropy
- 4.924
- entropy_ratio
- 0.9525
code
categorical identifier long_tailThis column holds 50 unique numeric strings of varying length (8 to 13 digits), almost certainly product barcodes (EAN/UPC/GTIN). Every one of 50 rows is unique with no nulls, giving maximum entropy (entropy_ratio 1.0) and a top_rate of just 0.02 — it functions as a row identifier rather than a feature. The long_tail alert simply reflects that uniqueness. Treatment: Use as a join key; drop from any model as it carries no predictive signal.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 50
- top_value
- 6111242100992
- top_rate
- 0.02
- cardinality
- 50
- entropy
- 5.644
- entropy_ratio
- 1
correctors
unknown other skippedThe column 'correctors' was skipped by the profiler, so its kind, uniqueness, and value distribution are unknown. Only the row count (50) and a null rate of 0.0 are reported; no other statistics are available to characterize content. Treatment: Re-profile or inspect manually before deciding on use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
generic_name_ja
categorical metadata long_tail null_rate imbalanceLikely a Japanese generic-name field (generic_name_ja), but it carries essentially no information in this sample: 98% of 50 rows are null and the single non-null value is an empty string, giving cardinality 1 and entropy 0. Treatment: Drop from modelling; retain only if needed for display lookups.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_fr
categorical free_text long_tailFrench-language generic product names, almost certainly from an Open Food Facts-style food catalogue. Cardinality is high (34 unique across 50 rows, entropy ratio 0.87) and most values are one-off descriptors like 'Chocolat noir extra-fin traditionnel à 90% de cacao'. The dominant 'value' is actually the empty string at 29.8% of non-null rows, on top of a 6% null rate, so effectively over a third of records carry no usable label. Treatment: Treat empty strings as missing, then tokenize/embed for any modelling rather than one-hot encoding.
- n
- 50
- nulls
- 3 (6.0%)
- unique
- 34
- top_value
- top_rate
- 0.2979
- cardinality
- 34
- entropy
- 4.42
- entropy_ratio
- 0.8689
generic_name_pl
categorical metadata null_ratePolish-language generic product name field, populated for only 5 of 50 rows (90% null) and containing just 2 distinct values where 4 of the 5 non-nulls are empty strings. Effectively a single real value ('Wyśmienita czkolada gorzka 70% kakao'), making the column unusable as a feature. Treatment: Drop; null rate 0.9 and only one meaningful value.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 2
- top_value
- top_rate
- 0.8
- cardinality
- 2
- entropy
- 0.7219
- entropy_ratio
- 0.7219
ingredients_debug
unknown metadata skippedColumn 'ingredients_debug' was skipped by the profiler, so no type, uniqueness, or distribution stats are available. Only the row count (50) and null rate (0.0) are known; everything else is missing. The name suggests it is a debug/auxiliary field rather than a modelling input. Treatment: Drop from modelling; retain only if needed for debugging upstream pipelines.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
ingredients_text_with_allergens_ja
categorical free_text long_tail null_rate imbalanceJapanese-language ingredient text with allergen markup, almost entirely absent from this sample. 98% of the 50 rows are null, and the only non-null value observed is an empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; the column carries no information in this sample.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
last_modified_by
categorical metadata long_tailThis column records the user or app that last modified each record, dominated by the bot/account 'foodless' which accounts for 21 of 49 non-null entries (top_rate 0.43). With 24 unique values across 50 rows and entropy_ratio 0.77, there's a long tail of mostly singleton contributors alongside a handful of app-like editors (municorn-calorie-counter-app, macrofactor). Null rate is low at 0.02. Treatment: Keep as audit metadata; if used as a feature, collapse the long tail into 'other' and flag bot-vs-human editors.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 24
- top_value
- foodless
- top_rate
- 0.4286
- cardinality
- 24
- entropy
- 3.513
- entropy_ratio
- 0.7662
no_nutrition_data
categorical feature imbalanceA flag column indicating products lacking nutrition data, but it carries no information here: the only observed value is the empty string, present in all 48 non-null rows (top_rate 1.0, cardinality 1, entropy 0.0). 4% of rows are null, so there is literally nothing to distinguish records. Treatment: Drop; constant column with zero entropy.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
nutriscore
unknown other skippedThe column is named "nutriscore" but saturn skipped profiling it (kind="unknown"), so no type, uniqueness, or distribution stats are available. All 50 rows are non-null, but nothing else can be confirmed from the evidence. Treatment: Manually inspect and cast to a known type before any downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
origin_nb
categorical metadata null_rate imbalanceThe column 'origin_nb' is effectively empty: 96% of the 50 rows are null and the only non-null value observed is the empty string, which appears twice. Cardinality is 1 with zero entropy, so the field carries no information in this sample. Treatment: Drop; the column is 96% null with a single empty-string value.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origins
categorical free_text long_tailFree-text origin/provenance strings for ingredients or products, with 20 unique values across 50 rows and a 4% null rate. The dominant value is the empty string at 24/50 (top_rate 0.5), so half the column is effectively blank rather than missing. The remainder is messy: language mix (France vs Maroc vs Morocco), comma-delimited multi-origin lists, and 'en:'-prefixed taxonomy tags like 'en:Madagarcar vanilla' (note the typo) — clearly not a clean categorical. Treatment: Treat empty strings as null, normalise synonyms (Maroc/Morocco), and split on comma into a multi-label set before any encoding.
- n
- 50
- nulls
- 2 (4.0%)
- unique
- 20
- top_value
- top_rate
- 0.5
- cardinality
- 20
- entropy
- 3.027
- entropy_ratio
- 0.7003
languages
unknown other skippedThe column is named 'languages' but saturn skipped profiling, so type and distribution are unknown. With 50 rows and no nulls, every record carries some value, yet n_unique and other stats are unavailable. The name suggests a list-like field (e.g., languages spoken or supported), which would explain why the dissector flagged it as unknown. Treatment: Inspect raw values and parse (likely explode list-typed entries) before further profiling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
lang
categorical featureThis is a language code column with 5 distinct values and no nulls across 50 rows. The distribution is heavily dominated by 'fr' at 70% (35/50), with 'en' a distant second at 10 occurrences and 'de', 'bg', 'ro' appearing only 1-3 times each. Entropy ratio of 0.56 confirms the imbalance, and the long tail of rare languages (bg, ro with single observations) may be unstable for any per-language modelling. Treatment: One-hot encode with rare languages (bg, ro, de) collapsed into an 'other' bucket.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 5
- top_value
- fr
- top_rate
- 0.7
- cardinality
- 5
- entropy
- 1.294
- entropy_ratio
- 0.5572
packaging_text_sv
categorical free_text null_rate imbalanceSwedish packaging text field that is effectively empty: 92% of the 50 rows are null and the remaining 4 non-null values are all the empty string, giving a single observed category with entropy 0. There is no usable signal here. Treatment: Drop; column is 92% null and the only non-null value is an empty string.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
photographers
unknown other skippedThis column 'photographers' was skipped by the profiler, so no type, uniqueness, or value statistics are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds photographer attributions, possibly as a list or nested structure that the dissector could not parse. Without unique counts or sample values, nothing further can be inferred. Treatment: Inspect raw values manually to determine structure before deciding whether to parse, explode, or drop.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
languages_codes
unknown other skippedThis column is named languages_codes but saturn skipped detailed profiling, leaving kind as unknown with no uniqueness or value statistics. The only confirmed signals are 50 rows and a 0.0 null rate. Without sample values or cardinality, the structure (single code, list, or delimited string) cannot be determined from the evidence. Treatment: Re-profile with parsing enabled to determine whether values are scalar codes or lists before use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
ecoscore_grade
categorical labelThis is the Eco-Score grade, a categorical environmental rating with letter tiers from a-plus through f plus sentinel values 'unknown' and 'not-applicable'. Distribution skews toward worse grades: 'e' leads at 12/50 (top_rate 0.24), followed by 'd' (9), while 'a' and 'a-plus' together account for only 5 rows. Six rows are 'unknown' and one 'not-applicable', so roughly 14% of values are non-informative sentinels that need handling. Treatment: Map sentinels ('unknown','not-applicable') to NA and treat the remaining tiers as an ordinal factor.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 9
- top_value
- e
- top_rate
- 0.24
- cardinality
- 9
- entropy
- 2.808
- entropy_ratio
- 0.8857
ingredients_n
numeric featureNumeric count of ingredients per record, ranging from 1 to 39 with a median of 9 and mean of 11.7. The distribution is right-skewed (skew 1.24, kurtosis 1.44) with a wide IQR of 11 and 2 outliers (4%) on the high end. No nulls or zeros, and 22 unique values across 50 rows suggest a discrete count variable. Treatment: Consider log or sqrt transform before modelling to tame the right skew.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 22
- min
- 1
- max
- 39
- mean
- 11.7
- median
- 9
- std
- 8.244
- q1
- 5
- q3
- 16
- iqr
- 11
- skew
- 1.237
- kurtosis
- 1.435
- n_outliers
- 2
- outlier_rate
- 0.04
- zero_rate
- 0
allergens
categorical featureCategorical allergen tags using an Open Food Facts-style 'en:' prefix, often combined as comma-separated lists (e.g., 'en:gluten,en:milk,en:soybeans'). The most common value is an empty string at 32% (16/50), suggesting missing or no-allergen records encoded as blanks rather than nulls. With 16 unique values across 50 rows and entropy ratio 0.84, the distribution is fairly spread; gluten, milk, and soybeans dominate the non-empty tags. Treatment: Split on comma and multi-hot encode allergen tags; treat empty string as missing.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 16
- top_value
- top_rate
- 0.32
- cardinality
- 16
- entropy
- 3.364
- entropy_ratio
- 0.8411
product_name
categorical free_text long_tailFree-text product name field with 49 unique values across 50 rows, near-maximal entropy ratio of 0.998 — effectively a per-row label. Values mix languages (French, English, Cyrillic) and formats (brand-only like 'Henry's' versus full descriptors like 'CRISTALINE Eau De Source 0.5L'), and one row is an empty string despite a reported null_rate of 0.0. The single repeat ('Henry's', 2) is the only signal preventing full uniqueness. Treatment: Normalize casing and empty strings, then tokenize/embed rather than one-hot encode.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- 49
- top_value
- Henry’s
- top_rate
- 0.04
- cardinality
- 49
- entropy
- 5.604
- entropy_ratio
- 0.9981
purchase_places
categorical free_text long_tailFree-form purchase location strings, often listing multiple places per row separated by commas. France dominates at 9/50 (18.4%), followed by an empty string (6) and Maroc (5), but with 32 unique values across 50 rows and entropy ratio 0.90, the long tail includes multi-country concatenations like 'Madrid,España,Montargis,France,Würzburg,Deutschland,...'. Mixed languages (Maroc vs Morocco, España vs Spain) and embedded postal codes signal inconsistent data entry rather than a clean categorical. Treatment: Split on commas and normalize each token to a canonical country before using as a multi-label feature.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 32
- top_value
- France
- top_rate
- 0.1837
- cardinality
- 32
- entropy
- 4.479
- entropy_ratio
- 0.8958
quantity
categorical feature long_tailThis column records product quantities as free-text strings, dominated by gram weights but with no consistent format — '100 g', '100g', and '100 gram' all appear separately among the top values. With 36 unique values across 50 rows and entropy ratio 0.959, the field is highly fragmented; the most common value '100 g' covers only 12.2% of non-nulls, and 2% are null plus 2 empty strings. The long_tail alert reflects this unit/spacing inconsistency rather than genuine variety. Treatment: Normalize units and parse into a numeric grams column before use.
- n
- 50
- nulls
- 1 (2.0%)
- unique
- 36
- top_value
- 100 g
- top_rate
- 0.1224
- cardinality
- 36
- entropy
- 4.956
- entropy_ratio
- 0.9587
origin_uk
categorical feature long_tail null_rate imbalanceThis appears to be a binary/flag column indicating UK origin, but it carries virtually no signal: 98% of the 50 rows are null and the only non-null value observed is an empty string. With cardinality of 1 and entropy of 0, the column has no discriminative power as it stands. Treatment: Drop; column is 98% null with a single empty-string value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_ar
categorical metadata null_rateArabic generic-name field that is overwhelmingly empty: 80% of the 50 rows are null and of the 10 populated rows, 9 are blank strings and only 1 carries an actual Arabic value (الامير). With cardinality of just 2 and a top_rate of 0.9 on the empty string, this column carries almost no information as currently captured. Treatment: Drop or defer until source data is backfilled; not usable as-is.
- n
- 50
- nulls
- 40 (80.0%)
- unique
- 2
- top_value
- top_rate
- 0.9
- cardinality
- 2
- entropy
- 0.469
- entropy_ratio
- 0.469
packaging_text_uk
categorical metadata long_tail null_rate imbalanceThis column appears to be Ukrainian packaging text, but it is effectively empty: 98% of the 50 rows are null and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so it carries no information. Treatment: Drop; the column has no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_ar
categorical free_text null_rateArabic-language ingredients text, populated for only 11 of 50 rows (null_rate 0.78) and with 10 of those 11 non-null entries being empty strings. Only one row carries an actual Arabic ingredient list, giving cardinality 2 and a top_rate of 0.91 on the empty string. Effectively unusable as a feature on this sample. Treatment: Drop for modelling; retain only if you specifically need Arabic ingredient parsing and can source more populated rows.
- n
- 50
- nulls
- 39 (78.0%)
- unique
- 2
- top_value
- top_rate
- 0.9091
- cardinality
- 2
- entropy
- 0.4395
- entropy_ratio
- 0.4395
ingredients_text_uk
categorical free_text long_tail null_rate imbalanceUkrainian-language ingredients text, almost entirely absent in this sample. 98% of rows are null and the single non-null value is an empty string, leaving zero usable content. Entropy is 0 and cardinality is 1, so the column carries no signal here. Treatment: Drop from this slice; revisit only if a Ukrainian-locale subset is loaded.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
checked
categorical feature null_rate imbalanceThis looks like a checkbox-style flag (likely from a web form), where the only observed value is "on" in 7 of 50 rows. The remaining 86% are null, and entropy is 0.0 because there is no variation among the non-null entries. With cardinality of 1, the column carries no discriminative signal as captured. Treatment: Convert to a boolean (on vs null) or drop, since it has only one observed value.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 1
- top_value
- on
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_ar
categorical free_text null_rate imbalanceThis appears to be Arabic-language packaging text, but it carries no information in this sample: 80% of the 50 rows are null and the remaining 10 values are all empty strings, giving cardinality 1 and entropy 0. There is nothing to model or join on here. Treatment: Drop; the column is effectively constant-empty with 80% nulls.
- n
- 50
- nulls
- 40 (80.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
carbon_footprint_percent_of_known_ingredients
numeric feature null_rateNumeric coverage metric indicating the share of an item's known ingredients that have a carbon footprint estimate, ranging from 8.0 to 105.0 with a median of 70.0. The 62% null rate is the dominant signal — only 19 distinct values populate this column across 50 rows, so most records lack any coverage figure at all. The max of 105.0 is mildly surprising for what reads like a percentage, and the distribution is slightly left-skewed (skew -0.45) with no flagged outliers. Treatment: Impute or add a missingness indicator before modelling, and verify whether values above 100 are valid.
- n
- 50
- nulls
- 31 (62.0%)
- unique
- 19
- min
- 8
- max
- 105
- mean
- 61.79
- median
- 70
- std
- 28.98
- q1
- 45.5
- q3
- 78.3
- iqr
- 32.8
- skew
- -0.4493
- kurtosis
- -0.8083
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
last_checker
categorical metadata null_rateThis looks like the username of the last reviewer/checker on a record, with only 4 distinct values across 50 rows. The column is 86% null, so just 7 rows carry a value, and 'aleene' accounts for 3 of those (top_rate 0.43). Entropy ratio of 0.92 indicates the few present values are spread fairly evenly across the small handful of checkers. Treatment: Treat missingness as a 'never checked' category; too sparse to use as a model feature.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 4
- top_value
- aleene
- top_rate
- 0.4286
- cardinality
- 4
- entropy
- 1.842
- entropy_ratio
- 0.9212
product_name_uk
categorical metadata long_tail null_rate imbalanceThis appears to be a Ukrainian-language product name field that is effectively empty: 98% of the 50 rows are null and the single non-null value is itself an empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here whatsoever. Treatment: Drop the column; it carries no information.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_uk
categorical metadata long_tail null_rate imbalanceThis appears to be a UK-localized generic product name field, but it is effectively empty in this sample: 98% of the 50 rows are null and the only non-null value is an empty string. Cardinality is 1 with zero entropy, so the column carries no information here. Treatment: Drop; no usable signal at this null rate and cardinality.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_ar
categorical metadata long_tail null_rateArabic-language product name field that is mostly absent: 78% null and only 6 distinct values across 50 rows. The non-null entries are a language mix — one Arabic string (برنس) alongside Spanish and English names like 'Leche Y Almendras' and 'Chocolate Negro 92% Cacao' — suggesting the column is not consistently populated with Arabic translations. The most frequent observed value is an empty string (6 occurrences, 54.5% of non-nulls), indicating empties coexist with true nulls. Treatment: Drop or defer until translation coverage improves; normalise empty strings to null and validate language before use.
- n
- 50
- nulls
- 39 (78.0%)
- unique
- 6
- top_value
- top_rate
- 0.5455
- cardinality
- 6
- entropy
- 2.049
- entropy_ratio
- 0.7928
carbon_footprint_from_known_ingredients_debug
categorical metadata long_tail null_rateDebug trace string showing the per-ingredient carbon footprint computation (percentage × emission factor = grams) for each product. Every one of the 14 non-null values is unique (entropy_ratio ≈ 1.0, top_rate 0.07), and 72% of rows are null, so it functions as a verbose audit log rather than a feature. Treatment: Drop from modelling; retain only for auditing the carbon calculation.
- n
- 50
- nulls
- 36 (72.0%)
- unique
- 14
- top_value
- en:cereal 50% x 0.3 = 15 g -
- top_rate
- 0.07143
- cardinality
- 14
- entropy
- 3.807
- entropy_ratio
- 1
last_checked_t
numeric timestamp null_rateValues are Unix epoch seconds ranging from 1540933974 to 1730226344, consistent with 'last checked' timestamps spanning roughly late 2018 to late 2024. Severe sparsity dominates: null_rate is 0.86 and only 7 unique values populate the 50 rows, so this column is barely usable as-is. Distribution is mildly right-skewed (skew 0.81) with no outliers flagged. Treatment: Convert from epoch seconds to datetime and treat as mostly-missing; impute or drop before modelling.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- min
- 1.541e+09
- max
- 1.73e+09
- mean
- 1.607e+09
- median
- 1.565e+09
- std
- 7.772e+07
- q1
- 1.556e+09
- q3
- 1.652e+09
- iqr
- 9.601e+07
- skew
- 0.8106
- kurtosis
- -1.103
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
ingredients_text_with_allergens_uk
categorical free_text long_tail null_rate imbalanceThis appears to be a UK-localized variant of an ingredients-with-allergens text field, but it is effectively empty in this sample. 98% of the 50 rows are null, and the single non-null value is itself an empty string, giving cardinality 1 and zero entropy. There is no usable signal here. Treatment: Drop; no observed values in this sample.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_ar
categorical free_text null_rateArabic-language ingredients text (with allergen markup) for food products. The column is 82% null across just 50 rows, and of the 9 non-null entries 8 are empty strings — only 1 row carries an actual ingredient list. Effectively no usable signal at this sample size. Treatment: Drop for now; revisit only if a larger Arabic-localized sample becomes available.
- n
- 50
- nulls
- 41 (82.0%)
- unique
- 2
- top_value
- top_rate
- 0.8889
- cardinality
- 2
- entropy
- 0.5033
- entropy_ratio
- 0.5033
origin_ar
categorical metadata null_rate imbalanceCategorical field 'origin_ar' carries a single observed value (an empty string) across the 10 non-null rows, while 80% of records are null. With cardinality 1 and entropy 0, the column conveys no information in this sample. Treatment: Drop; the column is 80% null and constant on the remainder.
- n
- 50
- nulls
- 40 (80.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
nutriments_estimated
unknown other skippedThe column `nutriments_estimated` was skipped by the profiler, so no type, uniqueness, or distribution stats are available. The only facts on record are that all 50 sampled rows are non-null and the kind is unknown. Without further evidence, the content and structure of this field cannot be characterised. Treatment: Re-profile with an appropriate parser before deciding on downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
nutrition_score_warning_no_fiber
numeric feature null_rate constantThis appears to be a binary warning flag indicating a missing-fiber condition in a nutrition score, encoded as 1 when triggered. Every one of the 15 non-null rows holds the value 1.0, and 70% of rows are null — consistent with a sparse flag that is only populated when the warning fires. With zero variance, it carries no discriminative signal as-is. Treatment: Recode nulls to 0 to convert into a usable binary indicator, or drop if still constant after recoding.
- n
- 50
- nulls
- 35 (70.0%)
- unique
- 1
- min
- 1
- max
- 1
- mean
- 1
- median
- 1
- std
- 0
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
completed_t
numeric timestamp null_rateValues are 10-digit integers ranging from 1628199203 to 1763195431, consistent with Unix epoch seconds spanning roughly 2021 through 2025 — almost certainly a 'completed at' timestamp. The 68% null rate is the dominant signal, suggesting most records were never completed. Distribution across the non-null 16 unique values is near-symmetric (skew ~0.001) with no outliers. Treatment: Convert from epoch seconds to datetime and treat nulls as 'not yet completed' rather than imputing.
- n
- 50
- nulls
- 34 (68.0%)
- unique
- 16
- min
- 1.628e+09
- max
- 1.763e+09
- mean
- 1.7e+09
- median
- 1.703e+09
- std
- 4.07e+07
- q1
- 1.663e+09
- q3
- 1.74e+09
- iqr
- 7.618e+07
- skew
- 0.001247
- kurtosis
- -1.155
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
product_name_bg
categorical metadata long_tail null_rateThis is a Bulgarian-language product name field, with values like 'Шоколад 85% какаова маса' indicating localized chocolate/cocoa product labels. It is 94% null across 50 rows, leaving only 3 non-null entries that are all unique. With so little populated data, this column carries almost no analytical signal in its current state. Treatment: Drop or defer until Bulgarian localization coverage improves; too sparse to use.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- Шоколад 85% какаова маса
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
ingredients_text_et
categorical free_text long_tail null_rateFree-text ingredient lists ostensibly tagged as Estonian (et), but the three observed values mix Slovenian, German, and Estonian, suggesting mislabeled locale tagging. The field is 94% null with only 3 non-null entries out of 50, so any signal here is anecdotal at best. Treatment: Drop or defer; too sparse and language-inconsistent to model without a language-detection cleanup pass.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- kakavova masa, manjmasten kakavov prah, kakavovo maslo, sladkor, emulgator: lecitini (_sojin_ lecitin); ekstrakt vanilije.
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
origin_sl
categorical metadata long_tail null_rate imbalanceThe column appears to be an origin identifier or location code, but it is effectively empty in this sample. 98% of the 50 rows are null, and the single non-null value is itself a blank string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; the column carries no usable signal at this null rate.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_dz
categorical metadata long_tail null_rate imbalanceThis appears to be a localized (Algerian/Dzongkha?) generic-name field, but it is effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no information. Treatment: Drop; no signal (98% null, single empty-string value).
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_sl
categorical free_text long_tail null_rate imbalanceSlovenian-language ingredients text, almost entirely empty: 98% null with only 1 non-null value across 50 rows. The single observed entry is a free-form product label (cocoa-based confection) rather than a controlled vocabulary, so the categorical framing is misleading. Treatment: Drop for modelling; if needed, treat as free text and merge with other-language ingredient fields.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Kakavova masa, manjmasten kakavov prah, kakavovo maslo, sladkor, emulgator: lecitini (sojin lecitin); ekstrakt vanilije. Lahko vsebuje sledi oreškov (lešniki, mandlji, pistacija) in mleka. Uporabno najmanj do: glej odtis na zadnji strani embalaže.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_ca
categorical metadata null_rate imbalanceThis appears to be a Catalan-language generic product name field, but it is effectively empty: 96% of rows are null and the only non-null value observed is the empty string (2 occurrences). Cardinality is 1 with zero entropy, so the column carries no information in this sample. Treatment: Drop; the column is 96% null with a single empty-string value.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_dz
categorical free_text long_tail null_rate imbalanceThis appears to be a Dzongkha-language ingredients text field, likely a localized variant of a multilingual product description column. Out of 50 rows, 98% are null and the single non-null value is an empty string, leaving zero usable content. Treatment: Drop; effectively empty in this sample.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_ca
categorical metadata null_rate imbalanceThis appears to be a Catalan-language product name field, but it is effectively empty: 96% of the 50 rows are null and the only 2 non-null values are blank strings, giving a single observed category with entropy 0. There is no usable signal here. Treatment: Drop the column; it is 96% null with no distinct values.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_ca
categorical feature null_rate imbalanceThis appears to be a Canadian-origin flag or code field, but it's effectively empty: 96% of the 50 rows are null, and the only 2 non-null values are both blank strings. With cardinality of 1 and entropy of 0, the column carries no information. Treatment: Drop; the column is 96% null and the remaining values are empty strings.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_et
categorical metadata long_tail null_rateEstonian-localized product name field, but 94% of the 50 rows are null and only 3 distinct values appear among the remainder — including one empty string and two names that are actually French and English, not Estonian. With each surviving value occurring exactly once (entropy_ratio 1.0), this column carries almost no usable signal and shows a language-tagging mismatch. Treatment: Drop from modelling; revisit upstream localization pipeline since values aren't in Estonian.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- Chocolat noir - 85% cacao
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
ingredients_text_with_allergens_bg
categorical free_text long_tail null_rateBulgarian-language ingredient lists with inline HTML allergen markup (), localised for the bg market. Coverage is extremely thin: 94% null and only 3 distinct values across 50 rows, one of which is an empty string. The two real entries are confectionery ingredient declarations mentioning soy, hazelnuts and milk allergens. Treatment: Strip the HTML tags and treat as free text; too sparse (94% null) to use as a feature without aggregation across locales.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- Какаова маса, нискомаслено какао на прах, какаово масло, захар, емулгатор: лецитин (соеви), екстракт от ванилия, Може да съдържа следи от ядки и мляко,
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
ingredients_text_with_allergens_et
categorical free_text long_tail null_rateEstonian-localised ingredient text with allergen markup, but only 3 of 50 rows carry a value (null_rate 0.94) and all three are unique — and not even in Estonian (one Slovenian, one German, plus one short Estonian entry). The field is essentially empty and the few populated rows show a language mix rather than the expected `et` locale, suggesting upstream localisation fallback or mislabelling. Treatment: Drop unless you specifically need allergen extraction; 94% nulls and inconsistent language make it unusable as-is.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- kakavova masa, manjmasten kakavov prah, kakavovo maslo, sladkor, emulgator: lecitini (sojin lecitin); ekstrakt vanilije.
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
origin_sk
categorical foreign_key long_tail null_rate imbalance`origin_sk` appears to be a surrogate key for an origin entity, but it carries almost no information in this slice: 98% of the 50 rows are null and the single non-null value is an empty string. Cardinality is 1 and entropy is 0, so the column is effectively constant where populated. Treatment: Drop from modelling; investigate upstream join before relying on it.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_bg
categorical foreign_key null_rate imbalanceThis appears to be an origin block-group identifier, likely a geographic foreign key. It is effectively unusable here: 94% of rows are null, and the only 1 distinct value observed across the 50 rows is the empty string (3 occurrences), giving entropy 0.0. Treatment: Drop; no usable signal at this null rate and cardinality.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_sl
categorical free_text long_tail null_rate imbalanceThis appears to be a Slovenian-language packaging text field, but it is effectively empty: 98% of the 50 rows are null and the single non-null value is an empty string, giving cardinality 1 and entropy 0. There is no usable signal here. Treatment: Drop; the column is 98% null with only an empty-string value otherwise.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_sk
categorical foreign_key long_tail null_rate imbalanceLikely a surrogate key linking to a generic drug name dimension, but it is effectively empty in this sample. 98% of rows are null and the single non-null value is the empty string, giving cardinality 1 and zero entropy. Treatment: Drop or defer until a non-empty sample is available; carries no signal here.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_sl
categorical free_text long_tail null_rate imbalanceThis appears to be a Slovenian-language ingredients list with embedded allergen HTML markup (``), likely a localized product label field. The column is almost entirely empty with a null_rate of 0.98, leaving only 1 non-null row out of 50, and that single value is the only unique entry (cardinality 1, entropy 0.0). With essentially no signal and HTML mixed into the text, it carries no analytical value as-is. Treatment: Drop; 98% null and only one observed value make it unusable, or strip HTML and reserve for text extraction if more rows arrive.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Kakavova masa, manjmasten kakavov prah, kakavovo maslo, sladkor, emulgator: lecitini (sojin lecitin); ekstrakt vanilije. Lahko vsebuje sledi oreškov (lešniki, mandlji, pistacija) in mleka. Uporabno najmanj do: glej odtis na zadnji strani embalaže.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_ca
categorical free_text null_rate imbalanceCatalan-language ingredients text field, almost entirely absent from this sample. 96% of the 50 rows are null and the only 2 non-null values are empty strings, giving a single distinct value and zero entropy. Treatment: Drop; the column carries no usable signal in this sample.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_sl
categorical metadata long_tail null_rate imbalanceThis column appears to be a Slovenian-language generic name field that is effectively empty in this sample. With a 98% null rate and the only non-null value being an empty string, there is zero usable signal (entropy 0.0, cardinality 1). Treatment: Drop; no usable values present.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_dz
categorical metadata long_tail null_rate imbalanceThis appears to be a localized product name field (Dzongkha or similar locale suffix), but it is effectively empty: 98% of the 50 rows are null and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no usable signal in this sample. Treatment: Drop; the column is 98% null with a single empty-string value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_et
categorical metadata null_rate imbalance`origin_et` appears to be a categorical metadata field, but it carries almost no information here: 94% of the 50 rows are null and the only non-null value observed is the empty string, which accounts for all 3 populated rows. Cardinality is 1 and entropy is 0, so the column is effectively constant where present. Treatment: Drop; constant empty value with 94% nulls offers no signal.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_sk
categorical identifier long_tail null_rate imbalanceColumn appears to be a surrogate key for ingredient text with allergen markup, but it is effectively empty: 98% of 50 rows are null and the only observed value is the empty string. Cardinality is 1 with zero entropy, so there is no usable signal here. Treatment: Drop; the column is 98% null with a single empty value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_sk
categorical free_text long_tail null_rate imbalanceAlmost certainly a Slovak product name field that is effectively empty in this slice — 98% of the 50 rows are null and the single non-null value is itself an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here whatsoever. Treatment: Drop; the column is 98% null with only an empty string observed.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_pt
categorical free_text long_tail null_ratePortuguese-language ingredient lists with embedded HTML allergen tags (…), likely scraped from a food product database. The column is sparsely populated with an 0.84 null rate, and among the 8 non-null rows 5 are empty strings, leaving only 3 genuine ingredient declarations. Each non-empty value is unique and contains raw HTML markup rather than cleaned text. Treatment: Strip HTML tags to extract allergen tokens, then treat as sparse free text; too null-heavy for direct modelling.
- n
- 50
- nulls
- 42 (84.0%)
- unique
- 4
- top_value
- top_rate
- 0.625
- cardinality
- 4
- entropy
- 1.549
- entropy_ratio
- 0.7744
ingredients_text_with_allergens_ca
categorical free_text long_tail null_rate imbalanceLocalized ingredients text with allergens for Catalan, but it's effectively empty in this sample: 98% null and the only non-null value observed is itself an empty string. With cardinality of 1 and entropy 0, this column carries no information here. Treatment: Drop; no usable signal in this slice.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_pt
categorical free_text long_tail null_ratePortuguese-language generic product name, present for only 20% of the 50 rows (null_rate 0.8) and otherwise dominated by an empty string (top_rate 0.8 on value ''). Among the 10 non-null entries only 2 distinct strings appear in the top values, both descriptive food labels like 'Chocolate extrafino com 70% de cacau'. Coverage is too thin and cardinality too low (n_unique 3 including the blank) to support modelling on its own. Treatment: Drop or retain only as a fallback display label; coverage is too sparse to feature-engineer.
- n
- 50
- nulls
- 40 (80.0%)
- unique
- 3
- top_value
- top_rate
- 0.8
- cardinality
- 3
- entropy
- 0.9219
- entropy_ratio
- 0.5817
packaging_text_pt
categorical free_text null_rate imbalanceThis appears to be a Portuguese packaging-text field, likely free-form descriptions of product packaging. It is effectively empty: 80% of the 50 rows are null, and the remaining 10 rows all hold the empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here. Treatment: Drop the column; it carries no information.
- n
- 50
- nulls
- 40 (80.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_pt
categorical free_text long_tail null_ratePortuguese-language ingredient lists for food products, stored as free text. The column is mostly empty: 80% null and the most common value (7 of 50 rows) is an empty string, leaving only 4 distinct values across 50 rows. The few populated entries are long, comma-separated ingredient declarations with allergen tokens in caps or underscores. Treatment: Treat as free text: drop or impute the empty majority, then tokenize and parse ingredients before modelling.
- n
- 50
- nulls
- 40 (80.0%)
- unique
- 4
- top_value
- top_rate
- 0.7
- cardinality
- 4
- entropy
- 1.357
- entropy_ratio
- 0.6784
origin_pt
categorical metadata null_rate imbalanceThis appears to be an origin point identifier, but it carries no usable signal in this sample. 80% of rows are null and the remaining 10 rows all hold the same empty-string value, giving a single unique category and entropy of 0. Treatment: Drop; the column is 80% null and the rest is a constant empty string.
- n
- 50
- nulls
- 40 (80.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
nutrition_score_warning_nutriments_estimated
numeric feature null_rate constantThis appears to be a flag indicating that the nutrition score warning was estimated from nutriment data, likely a 0/1 boolean. Of 50 rows, 96% are null and the remaining 4% all carry the value 1.0, making it effectively constant where present. With no variation and almost no coverage, it carries no usable signal. Treatment: Drop; constant-when-present and 96% null.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- min
- 1
- max
- 1
- mean
- 1
- median
- 1
- std
- 0
- q1
- 1
- q3
- 1
- iqr
- 0
- skew
- 0
- kurtosis
- 0
- n_outliers
- 0
- outlier_rate
- 0
- zero_rate
- 0
packaging_text_bg
categorical free_text null_rate imbalanceLikely Bulgarian-language packaging text from a product database. The column is effectively empty: 94% null and the only non-null value across 50 rows is the empty string itself (3 occurrences), giving cardinality 1 and zero entropy. Treatment: Drop; no usable signal at this sample size.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_et
categorical metadata null_rate imbalanceThis appears to be an Estonian-language generic name field, but it carries no usable signal in this sample: 94% of rows are null and the only non-null value observed is the empty string (3 occurrences), giving cardinality 1 and entropy 0. Effectively every record is missing or blank. Treatment: Drop; column is 94% null with a single empty-string value otherwise.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_ca
categorical metadata null_rate imbalanceLikely a Canadian-locale packaging text field from a product catalog. It is effectively empty: 96% null and the only 2 non-null values are both blank strings, giving a single distinct value and zero entropy. Treatment: Drop; the column carries no signal at this sample size.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_sl
categorical metadata long_tail null_rate imbalanceLocalized Slovenian product name field that is effectively empty: 98% of 50 rows are null and the single populated row reads "ARRIBA 85% cacao". With cardinality 1 and entropy 0, this column carries no usable signal in the current sample. Treatment: Drop from modelling; revisit only if a fuller localized catalogue becomes available.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- ARRIBA 85% cacao
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_bg
categorical metadata null_rate imbalanceThis appears to be a Bulgarian-language generic drug name field, but it is effectively empty: 94% of the 50 rows are null and the only non-null value observed is the empty string itself, repeated 3 times. Cardinality is 1 with zero entropy, so the column carries no information. Treatment: Drop; the column is 94% null with a single empty-string value.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_sk
categorical free_text long_tail null_rate imbalanceThis appears to be a Slovak-language ingredients text field (suffix _sk), but it is effectively empty in this sample: 98% of 50 rows are null and the single non-null value is an empty string, yielding cardinality 1 and entropy 0. Treatment: Drop; no usable signal at this sample size.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_bg
categorical free_text long_tail null_rateBulgarian-language ingredient lists for food products, stored as free text in Cyrillic. The column is almost entirely empty (null_rate 0.94) with only 3 non-null values across 50 rows, each unique — effectively unusable as a categorical feature. Despite the categorical kind label, the content is long-form ingredient prose, not a discrete category. Treatment: Drop unless doing multilingual text analysis; 94% null leaves too little signal.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- Какаова маса, нискомаслено какао на прах, какаово масло, захар, емулгатор: лецитин (соеви), екстракт от ванилия, Може да съдържа следи от ядки и мляко,
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
packaging_text_et
categorical free_text null_rate imbalanceEstonian packaging text field that is effectively empty: 94% of 50 rows are null, and the only non-null value observed is the empty string itself (3 occurrences). With cardinality of 1 and entropy of 0, this column carries no information. Treatment: Drop; no usable signal.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_sk
categorical foreign_key long_tail null_rate imbalanceThis appears to be a surrogate key for packaging text, but it is essentially empty: 98% of the 50 rows are null and the single non-null value is an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here. Treatment: Drop; 98% null and only one observed value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_pt
categorical free_text long_tail null_rateThis is a Portuguese-localized product name field, but it is mostly empty: 80% null and only 7 distinct values across 50 rows, with the top value being the empty string at 40%. The non-null entries are a language mix (Portuguese, Italian, French, English) rather than purely Portuguese, suggesting fallback to original-language labels when no translation exists. Entropy ratio of 0.90 reflects that the few present values are nearly all unique. Treatment: Drop or treat as optional metadata; too sparse and language-inconsistent for direct modelling.
- n
- 50
- nulls
- 40 (80.0%)
- unique
- 7
- top_value
- top_rate
- 0.4
- cardinality
- 7
- entropy
- 2.522
- entropy_ratio
- 0.8983
abbreviated_product_name_fr
categorical label long_tail null_rateLikely a French abbreviated product name field (brand + descriptor + size), used as a display label for items. The column is mostly empty with a null_rate of 0.86, leaving only 7 unique values across 50 rows, each appearing once — entropy_ratio is 1.0, so among the populated rows every value is distinct. Sparsity makes it unusable as a categorical feature in its current state. Treatment: Drop or treat as free text; too sparse and unique to encode as a category.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- CRISTALINE Eau De Source 0.5L
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
obsolete_imported
categorical feature null_rate imbalanceThis appears to be a boolean-style flag indicating whether a record was imported as obsolete, but the signal is effectively absent: 86% of rows are null and the only observed value is "0" across all 7 non-null entries. Cardinality is 1 with zero entropy, so the column carries no discriminative information in this sample. Treatment: Drop; constant value with 86% nulls offers no modelling signal.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 1
- top_value
- 0
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
sources_fields
unknown other skippedThe column `sources_fields` was skipped by the profiler, so its kind, cardinality, and value statistics are all unavailable. The only confirmed signals are 50 rows present with a null rate of 0.0, meaning every row has some value, but nothing is known about what those values look like. Without further inspection this column cannot be characterised. Treatment: Re-profile or manually inspect a sample before deciding on downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
emb_code
categorical metadata long_tail null_rate imbalanceThis appears to be an embargo or embarkation code, with values like "EMB 44068 A" suggesting an alphanumeric reference identifier. The column is almost entirely empty: 98% null across 50 rows, leaving a single non-null observation. With only one value present, entropy is 0 and no distributional inference is possible. Treatment: Drop; 98% null with only one observed value provides no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- EMB 44068 A
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
lang_imported
categorical metadata null_rate imbalanceLikely a language tag for imported records, but 86% of the 50 rows are null and the remaining 7 entries are all 'fr'. With only one observed value, entropy is 0 and the column carries no discriminative signal as captured. Treatment: Drop or hold aside until more non-null values arrive; constant 'fr' with 86% nulls is unusable as a feature.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 1
- top_value
- fr
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_zh
categorical metadata long_tail null_rate imbalanceThis appears to be a Chinese generic-name field, but it is effectively empty: 98% of the 50 rows are null and the single non-null observation is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no usable signal in this sample. Treatment: Drop; the column is 98% null with a single empty-string value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
conservation_conditions_fr_imported
categorical free_text long_tail null_rateThis column holds French-language storage instructions imported from an external source (e.g., 'A conserver de préférence à l'abri du soleil...'). Coverage is extremely sparse: 86% null and only 7 distinct phrasings across 7 non-null rows, each appearing exactly once. The values are free-text variants of the same advice rather than a controlled vocabulary, so entropy_ratio sits at 1.0. Treatment: Drop or normalise via keyword extraction; too sparse and too variable to use as a categorical feature.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- A conserver de préférence à l'abri du soleil, dans un endroit propre, frais et sans odeur.
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
origin_fr_imported
categorical free_text long_tail null_rateThis appears to be a French-language origin/import provenance field, with values ranging from a single country tag ("France") to a multi-line description of cocoa paste sourcing across continents. Only 2 of 50 rows are populated (null_rate 0.96), and both populated values are unique, giving entropy_ratio 1.0 over a cardinality of 2. The mix of a clean country label with a long descriptive string suggests inconsistent data entry rather than a true categorical. Treatment: Drop or defer; 96% null and entries mix country codes with prose, so not usable as a category without manual normalisation.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- France
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
owner
categorical metadata long_tail null_rateCategorical column listing the owning organization (food/beverage manufacturers like Barilla, Ferrero, Nestlé) for each record. The column is overwhelmingly empty: 86% null, leaving only 7 populated rows spread across 6 distinct owners, with Barilla appearing twice and the rest singletons. Entropy ratio of 0.98 confirms the non-null values are nearly uniform, so there is little signal beyond identifying who submitted the entry. Treatment: Drop or retain as provenance metadata only; too sparse for modelling.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 6
- top_value
- org-barilla-france-sa
- top_rate
- 0.2857
- cardinality
- 6
- entropy
- 2.522
- entropy_ratio
- 0.9755
ingredients_text_fr_imported
categorical free_text long_tail null_rateFrench-language ingredient declarations imported from an external source, with each non-null value being a long free-text recipe listing (allergens capitalised, percentages, additive codes). The column is 86% null and the 7 present values are all unique, yielding maximum entropy (entropy_ratio 1.0) and a top_rate of just 0.14. This is unstructured product copy, not a category, despite being typed as categorical. Treatment: Treat as free text: parse ingredient lists or tokenize/embed for NLP rather than one-hot encoding.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- Eau de Source
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
product_name_zh
categorical metadata long_tail null_rate imbalanceThis appears to be a Chinese product name field that is effectively empty in this sample: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no usable signal here. Treatment: Drop from modelling; revisit only if a larger sample shows actual Chinese strings populated.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
nutrition_data_prepared_per_imported
categorical metadata null_rate imbalanceThis column appears to be metadata indicating the basis on which nutrition data was prepared, with the only observed value being '100g'. It is essentially a constant: 86% of rows are null and the remaining 7 entries all share the single value '100g', giving zero entropy. Treatment: Drop; constant column with no information.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 1
- top_value
- 100g
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
abbreviated_product_name_fr_imported
categorical metadata long_tail null_rateThis appears to be a French-language abbreviated product name field, likely imported from an external catalog. It is overwhelmingly empty with a null_rate of 0.86, leaving only 7 distinct values across 50 rows, each appearing once (top_rate 0.143, entropy_ratio 1.0). The few populated entries mix brand-led formats like "CRISTALINE Eau De Source 0.5L" and "NESTLE DESSERT Noir 205g" with locale tags such as "Authentique 275g, fr". Treatment: Drop or defer; too sparse (86% null) and unique to model directly.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- CRISTALINE Eau De Source 0.5L
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
customer_service_fr
categorical free_text long_tail null_rateThis column holds French-language customer service contact details (postal addresses or web contact URLs) for product manufacturers. It is overwhelmingly empty with an 86% null rate, leaving only 7 non-null values across 6 nearly-unique strings (entropy ratio 0.976), with the top entry — a Wasa contact URL — appearing just twice. The values are unstructured free text mixing URLs, company names, and postal addresses. Treatment: Drop or treat as sparse metadata; not usable as a categorical feature given 86% nulls and near-unique values.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 6
- top_value
- Service Consommateurs, : www.wasa.com/fr-fr/contact (depuis la France), www.wasa.com/fr-be/contact (depuis la Belgique)
- top_rate
- 0.2857
- cardinality
- 6
- entropy
- 2.522
- entropy_ratio
- 0.9755
customer_service_fr_imported
categorical metadata long_tail null_rateThis column holds French-language customer service contact details (postal addresses or web contact URLs) for product manufacturers, imported as free-form strings. It is 86% null with only 7 populated rows yielding 6 distinct values, so it functions more as sparse metadata than an analytical feature. Entries vary in format from full postal addresses (Nestlé, Ferrero, Cristaline) to URLs, indicating no normalization upstream. Treatment: Drop for modelling; retain only if needed as a manufacturer contact lookup.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 6
- top_value
- Service Consommateurs, : www.wasa.com/fr-fr/contact (depuis la France), www.wasa.com/fr-be/contact (depuis la Belgique)
- top_rate
- 0.2857
- cardinality
- 6
- entropy
- 2.522
- entropy_ratio
- 0.9755
product_name_fr_imported
categorical free_text long_tail null_rateFrench-language product names imported from an external source, judging by the suffix and the values like 'CRISTALINE Eau De Source 0.5L' and 'Biscuits Nutella x22 biscuits fourrés - 304g'. Only 7 of 50 rows carry a value (null_rate 0.86), and every populated value is unique (entropy_ratio 1.0, top_rate 0.143), so this behaves as free-text rather than a category. The extreme nullity combined with full uniqueness makes it unusable as a grouping key. Treatment: Treat as sparse free text—drop for modelling or tokenize/embed if product identification is needed.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- CRISTALINE Eau De Source 0.5L
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
brands_imported
categorical feature long_tail null_rateThis appears to be a free-text brand field listing imported product brands, with 6 distinct values across only 7 non-null rows out of 50 (null_rate 0.86). The top value 'Wasa' appears just twice (top_rate 0.286), and entropy_ratio 0.976 indicates the few present values are nearly uniformly distributed. One entry 'NESTLE DESSERT,Tablettes' looks like a comma-joined multi-value string, suggesting inconsistent encoding. Treatment: Split multi-value strings on comma and treat as low-coverage categorical; consider dropping given 86% nulls.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 6
- top_value
- Wasa
- top_rate
- 0.2857
- cardinality
- 6
- entropy
- 2.522
- entropy_ratio
- 0.9755
owner_imported
categorical foreign_key long_tail null_rateCategorical column holding organisation slugs (e.g. 'org-barilla-france-sa', 'org-nestle-france'), almost certainly a foreign key to an owning company. It is 88% null with only 6 non-null rows spread across 5 distinct owners, so entropy_ratio is 0.97 simply because nearly every present value is unique. The column is too sparse to support any aggregation or join in its current state. Treatment: Drop or defer: 88% null leaves too few rows to join or model on.
- n
- 50
- nulls
- 44 (88.0%)
- unique
- 5
- top_value
- org-barilla-france-sa
- top_rate
- 0.3333
- cardinality
- 5
- entropy
- 2.252
- entropy_ratio
- 0.9697
lc_imported
categorical metadata null_rateA categorical flag indicating the source language of imported records, with values 'fr' and 'es'. The column is dominated by missingness — 84% null across 50 rows — and among the 8 populated rows, 'fr' accounts for 7 (87.5%), leaving 'es' as a single observation. Cardinality is just 2, so this carries little signal in its current state. Treatment: Treat nulls as a category or drop; near-constant with severe missingness limits modelling value.
- n
- 50
- nulls
- 42 (84.0%)
- unique
- 2
- top_value
- fr
- top_rate
- 0.875
- cardinality
- 2
- entropy
- 0.5436
- entropy_ratio
- 0.5436
ingredients_text_zh
categorical free_text long_tail null_rate imbalanceThis appears to be a Chinese-language ingredients text field, likely from a localized product/food dataset. It is effectively empty: 98% of the 50 rows are null and the only non-null value observed is itself an empty string, giving a cardinality of 1 and entropy of 0. Treatment: Drop; no usable signal at this sample size.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
quantity_imported
categorical feature long_tail null_rateThis appears to be a free-form quantity/packaging size field mixing volume ('500 ml') and mass units ('304 g', '275 g'), stored as strings rather than parsed numerics. Coverage is extremely poor: 86% of the 50 rows are null, and among the 7 non-null values every one is unique (entropy_ratio 1.0, top_rate 0.14). With no repeated values and mixed units, it offers little categorical signal as-is. Treatment: Parse into a numeric magnitude plus a unit column before use; given 86% nulls, consider dropping or imputing.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- 500 ml
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
nutrition_data_per_imported
categorical metadata null_rate imbalanceLikely a metadata flag indicating the basis on which nutrition values were imported, with '100g' as the sole observed value across all 8 non-null rows. The column is 84% null and has only one unique value, giving zero entropy and no discriminative power. Treatment: Drop; constant value with 84% nulls carries no signal.
- n
- 50
- nulls
- 42 (84.0%)
- unique
- 1
- top_value
- 100g
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_fr_imported
categorical free_text long_tail null_rateFrench generic product names imported from an upstream source (e.g. Open Food Facts), holding descriptors like "Eau De Source" and "Biscuit fourré à la pâte à tartiner aux noisettes et au cacao Nutella®". The column is 86% null and every one of the 7 observed values is unique (entropy_ratio 1.0), so it behaves as free-text rather than a categorical feature. Values are in French with accented characters and brand marks, which will need normalisation if joined with other locales. Treatment: Treat as multilingual free text: normalise accents and tokenize/embed if used; otherwise drop given 86% nulls.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- Eau De Source
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
owner_fields
unknown other skippedThe column `owner_fields` was skipped by the profiler, so its kind is unknown and no descriptive statistics, uniqueness, or value samples are available. The only signals are a row count of 50 and a null rate of 0.0, meaning every row is populated but the contents are opaque from this evidence alone. Without a sample or type inference, nothing can be said about what the field encodes. Treatment: Re-profile with parsing enabled (or inspect raw values) before deciding how to use this column.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
categories_imported
categorical metadata long_tail null_rateHierarchical product category paths (comma-separated taxonomy strings, mostly French with some en: prefixes) imported from an external source, likely Open Food Facts. The column is 88% null with only 6 non-null rows across 5 distinct values, so coverage is too sparse to be useful as-is. Entropy ratio of 0.97 confirms the few present values are nearly all distinct, and the top value appears just twice. Treatment: Split on comma into hierarchical levels and use only the top level as a feature, or drop given 88% nulls.
- n
- 50
- nulls
- 44 (88.0%)
- unique
- 5
- top_value
- Snacks, Snacks salés, Amuse-gueules, Chips et frites, Chips
- top_rate
- 0.3333
- cardinality
- 5
- entropy
- 2.252
- entropy_ratio
- 0.9697
conservation_conditions_fr
categorical free_text long_tail null_rateFrench-language storage instructions for products, written as free-form sentences (e.g. "A conserver dans un endroit sec à l'abri de la lumière."). Coverage is very thin: 86% null and only 7 distinct strings across 50 rows, each appearing exactly once, so entropy_ratio is 1.0. Despite semantic overlap (cool, dry, away from light), no two entries are phrased identically, making this unusable as a category without normalisation. Treatment: Treat as free text: normalise/cluster phrases or extract keywords (sec, frais, lumière) rather than one-hot encoding.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- A conserver de préférence à l'abri du soleil, dans un endroit propre, frais et sans odeur.
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
conservation_conditions
categorical free_text long_tail null_rateFree-text French storage instructions for a product (e.g., "A conserver à l'abri du soleil..."), captured as a categorical field. With 86% nulls and only 7 distinct values across 50 rows — each appearing exactly once — this behaves like sparse free text rather than a controlled vocabulary. Maximum entropy ratio (1.0) confirms every observed value is unique. Treatment: Treat as free text; normalize/keyword-extract (e.g., 'sec', 'frais', 'abri') or drop given 86% nulls.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- A conserver de préférence à l'abri du soleil, dans un endroit propre, frais et sans odeur.
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
countries_imported
categorical metadata null_rateLikely a country-of-origin tag for imported items, but with 84% nulls only 8 of 50 rows actually carry a value. Of those, 7 are 'France' and 1 is 'España', giving a top_rate of 0.875 and just 2 distinct categories. The mixed language ('España' vs the English column name) hints at inconsistent source encoding. Treatment: Impute or flag missingness and normalise country names to a single language before any grouping.
- n
- 50
- nulls
- 42 (84.0%)
- unique
- 2
- top_value
- France
- top_rate
- 0.875
- cardinality
- 2
- entropy
- 0.5436
- entropy_ratio
- 0.5436
origins_fr
categorical metadata long_tail null_rateThis appears to be a French-language origins field listing geographic provenance and source names (towns, regions, water sources) as a comma-concatenated string. The column is almost entirely empty with a 96% null rate, leaving only 2 distinct values across 50 rows—one populated entry bundling 11 locations together and one blank string. The packed multi-value format suggests this was flattened from a list field rather than a clean categorical. Treatment: split on commas and explode into a multi-label set before use; coverage too sparse to model directly.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- Chambon-la-Forêt,France,Cairanne,Provence-Alpes-Côte d'Azur,Vaucluse,Italie,Source Sainte Cécile,Source Ofélia,Source Éléonore,Source Emma,Source Éléna
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
abbreviated_product_name
categorical free_text long_tail null_rateShort product label field, likely a shelf-name abbreviation including brand, variant and pack size (e.g. 'CRISTALINE Eau De Source 0.5L'). It is almost entirely empty with a null_rate of 0.86, and among the 7 populated rows every value is unique (entropy_ratio 1.0, top_rate ~0.143), so it carries no repeating categories. Treatment: Drop or treat as free text; too sparse and unique to use as a categorical feature.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 7
- top_value
- CRISTALINE Eau De Source 0.5L
- top_rate
- 0.1429
- cardinality
- 7
- entropy
- 2.807
- entropy_ratio
- 1
customer_service
categorical free_text long_tail null_rateFree-text customer service contact details (postal addresses or URLs) extracted from product packaging, mostly in French. The column is 86% null with only 7 populated rows across 6 near-unique values, and entries are long unstructured strings mixing brands like Wasa, Cristaline, Ferrero, Kellogg's, Nestlé and La Boulangère. Treatment: Drop or parse out brand/URL/address fields separately; too sparse and unstructured to model as-is.
- n
- 50
- nulls
- 43 (86.0%)
- unique
- 6
- top_value
- Service Consommateurs, : www.wasa.com/fr-fr/contact (depuis la France), www.wasa.com/fr-be/contact (depuis la Belgique)
- top_rate
- 0.2857
- cardinality
- 6
- entropy
- 2.522
- entropy_ratio
- 0.9755
data_sources_imported
categorical metadata long_tail null_rateConcatenated provenance trail listing the producers, databases, and apps that contributed to each record (e.g., 'Database - Equadis, Database - GDSN, Databases, Producers, Producer - nestle-france'). 84% of rows are null and the 8 non-null values are all unique, giving entropy_ratio 1.0 — every observed string is a bespoke composite rather than a clean category. Repeated tokens within a single value (e.g., 'Producers' appearing twice) suggest the field was assembled by concatenation without deduplication. Treatment: Split on commas and one-hot or multi-hot encode the underlying source tokens rather than using the raw string.
- n
- 50
- nulls
- 42 (84.0%)
- unique
- 8
- top_value
- Producers, Producer - gie-sources-alma, Database - Equadis, Database - GDSN, Databases, Producers, Producer - gie-sources-alma
- top_rate
- 0.125
- cardinality
- 8
- entropy
- 3
- entropy_ratio
- 1
nova_group_error
categorical metadata null_rate imbalanceThis appears to be an error/diagnostic flag explaining why a NOVA food classification group could not be assigned. It is null in 96% of 50 rows, and the only observed value across the 2 non-null cases is "too_many_unknown_ingredients" (top_rate 1.0, cardinality 1, entropy 0). With a single category present, the column carries no discriminative signal in this sample. Treatment: Drop or retain only as a boolean error-present flag; near-constant and overwhelmingly null.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- too_many_unknown_ingredients
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_de_ocr_1648897071_result
categorical free_text long_tail null_rate imbalanceThis appears to be the OCR result of a German ingredients list (ingredients_text_de_ocr) tied to a specific timestamped run. Of 50 rows, 98% are null and only a single non-null value exists — a detailed Nuss-Nougat-Creme ingredient declaration — giving cardinality 1 and entropy 0. Treatment: Drop; 98% null with a single observed value provides no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Nuss-Nougat-Creme 40% (Zucker, Palmöl, _Haselnüsse_ 13%, _Magermilchpulver_ 8,7%, fettarmer Kakao 7,4%, Emulgator Lecithine (_Soja_), Vanillin), _Weizenmehl_ 32,5%, pflanzliche Fette (Palm, Palmkern), Rohrzucker 8,5% (enthält _Weizen_), _Milchzucker_, _Weizenkleie_, _Vollmilchpulver_, _Gerstenmalz_ - und Maisextraktpulver, Honig, Backtriebmittel: Dinatriumdiphosphat, Natriumhydrogencarbonat, Ammoniumhydrogencarbonat; fettarmer Kakao, Salz, _Weizenstärke_, _Gerstenmalzmehl_, Emulgator Lecithine (_Soja_), Vanillin
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_ro
categorical free_text null_rate imbalanceRomanian-language packaging text field that is essentially empty: 96% of the 50 rows are null and the remaining 2 non-null values are both blank strings, yielding a single observed category and zero entropy. Treatment: Drop; no usable signal.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_ro
categorical metadata long_tail null_rateA Romanian-language product name field that is effectively empty: 96% of the 50 rows are null, leaving only 2 non-null values, one of which is an empty string and the other an English phrase ('Sour Cream & Onion'). With cardinality of 2 and no actual Romanian content observed, this column carries no usable signal in the sample. Treatment: Drop; null_rate 0.96 and no Romanian values present.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
producer_version_id
categorical metadata long_tail null_rateIdentifier-style categorical capturing a producer/version reference, but 92% of the 50 rows are null, leaving only 4 populated values. The non-null entries are inconsistent in shape — a small integer ('1'), an ISO timestamp, and an 8-digit number — suggesting the field is overloaded or improperly typed. With cardinality 3 and top_rate 0.5 over a tiny populated subset, no reliable signal can be drawn. Treatment: Drop or quarantine until the upstream schema is clarified; not usable as a feature at 92% null with mixed value types.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 3
- top_value
- 1
- top_rate
- 0.5
- cardinality
- 3
- entropy
- 1.5
- entropy_ratio
- 0.9464
serving_size_imported
categorical free_text long_tail null_rateFree-text serving size descriptors imported from an upstream source, mixing grams with French unit hints like 'tranche' and 'carrés'. 88% of the 50 rows are null and the 6 non-null values are all unique, so entropy_ratio is 1.0 and there is no modal serving. Format is inconsistent (e.g. '30 g' vs '25.6 g (5 carrés (25,6 g))'), making direct aggregation unsafe. Treatment: Parse the leading numeric grams into a numeric column and discard the free-text remainder.
- n
- 50
- nulls
- 44 (88.0%)
- unique
- 6
- top_value
- 13.8 g (1)
- top_rate
- 0.1667
- cardinality
- 6
- entropy
- 2.585
- entropy_ratio
- 1
no_nutrition_data_imported
categorical metadata null_rate imbalanceA boolean-style flag indicating whether nutrition data was skipped during import. With a 0.92 null rate and only 4 non-null rows all reading "false" (top_rate 1.0, cardinality 1, entropy 0.0), the column carries no information in this sample. Treatment: Drop; constant value with overwhelming nulls.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 1
- top_value
- false
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_imported
categorical metadata null_rateCategorical column capturing imported packaging type, almost certainly free-form labels like 'Enveloppe' or 'Boîte, Barquette'. It's effectively unusable as-is: 92% of the 50 rows are null, leaving only 4 observed values across 2 distinct categories, with 'Enveloppe' covering 3 of them. Treatment: Drop or set aside until more coverage is available; 92% nulls leave nothing to model.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 2
- top_value
- Enveloppe
- top_rate
- 0.75
- cardinality
- 2
- entropy
- 0.8113
- entropy_ratio
- 0.8113
ingredients_text_ro
categorical free_text null_rate imbalanceRomanian-language ingredients text, almost entirely absent in this sample. 96% of the 50 rows are null, and the only 2 non-null values are empty strings, giving cardinality 1 and entropy 0. There is no usable signal here. Treatment: Drop; effectively empty for this sample.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
producer_version_id_imported
categorical metadata long_tail null_rateThis appears to be a sparsely populated categorical field tracking some imported producer version identifier, with 92% null_rate leaving only 4 non-null values across 50 rows. The 3 distinct values are wildly inconsistent in format — '1', a timestamp '2021-01-25T13:53:49+01:00', and a numeric '44217063' — suggesting the column conflates multiple semantics or has been mis-mapped during import. With only 4 observations, top_rate of 0.5 and entropy_ratio of 0.95 are not meaningful signals. Treatment: Drop unless the import pipeline can be fixed to emit a single consistent value type.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 3
- top_value
- 1
- top_rate
- 0.5
- cardinality
- 3
- entropy
- 1.5
- entropy_ratio
- 0.9464
labels_imported
categorical metadata long_tail null_rateImported product labels (likely certifications or dietary tags) carried over from an external source, with values like 'Végétarien' and comma-separated certification strings ('Point Vert, Rainforest Alliance, Triman'). The column is 90% null, leaving only 5 populated rows across 3 distinct values, and the top value covers 60% of those. With such sparse coverage and multi-label strings packed into single cells, this field is barely usable as-is. Treatment: Split comma-separated tags and one-hot encode, but expect to drop given 90% nulls.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 3
- top_value
- Végétarien
- top_rate
- 0.6
- cardinality
- 3
- entropy
- 1.371
- entropy_ratio
- 0.865
ingredients_text_de_ocr_1648990410_result
categorical free_text long_tail null_rate imbalanceThis appears to be the OCR-extracted German ingredients text from a timestamped scan (1648990410), capturing raw product label text. It is effectively empty: 98% null with only 1 non-null value out of 50 rows, a single German ingredients string for a hazelnut-nougat cookie product. With cardinality 1 and entropy 0, it carries no discriminative signal in this sample. Treatment: Drop; 98% null and only one observed value provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Kekse mit Nuss - Nugat - Creme - Füllung: Nuss-Nugat-Creme 40% (Zucker, Palmöl, HASELNÜSSE Magermilchpulver, fettarmer Kakao, Emulgator Lecithine (S0JA), Vanillin, Weizenmehl, pflanzliche Fette ( Palm, Palmkern), Rohrzucker, Milchzucker, Weizenkleie, VOLLMILCHPULVER, GERSTENMALZ-und Maisextraktpulver, Honig. Backtriebmittel: Dinatriumdiphosphat, Natriumhydrogencarbonat, Ammoniumhydrogencarbonat; fettarmer Kakao, Salz, Weizenstärke, Gerstenmalzmehl, Emulgator Lecithine (Soja), Vanillin
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
allergens_imported
categorical feature long_tail null_rateCategorical column listing imported allergen declarations, with 90% nulls leaving only 5 populated rows across 4 distinct values. Entries are comma-separated multi-allergen strings in French (e.g., 'Gluten, Graines de sésame', 'Œufs, Gluten'), and one value embeds what looks like a GS1 code ('Gs1:T4078:ML'), suggesting inconsistent encoding. 'Gluten' is the only repeated value (2 of 5), and entropy_ratio of 0.96 reflects the near-uniform spread across the tiny populated subset. Treatment: Split on commas into a multi-label allergen set and impute or flag the 90% missing before use.
- n
- 50
- nulls
- 45 (90.0%)
- unique
- 4
- top_value
- Gluten
- top_rate
- 0.4
- cardinality
- 4
- entropy
- 1.922
- entropy_ratio
- 0.961
ingredients_text_de_ocr_1648990410
categorical free_text long_tail null_rate imbalanceThis appears to be an OCR-extracted German ingredients text field (timestamped 1648990410), likely from a food product database such as Open Food Facts. Out of 50 rows, 98% are null and only a single non-null value exists — a long ingredient string for a hazelnut-nougat cookie product. With cardinality 1 and entropy 0, the column carries no discriminative signal in this sample. Treatment: Drop; 98% null and only one unique OCR value provides no modelling signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Kekse mit Nuss - Nugat- Creme - Füllung: Nuss-Nugat-Creme 40% (Zucker, Palmöl, HASELNÜSSE Magermilchpulver, fettarmer Kakao, Emulgator Lecithine (S0JA), Vanillin, Weizenmehl, pflanzliche Fette ( Palm, Palmkern), Rohrzucker, Milchzucker, Weizenkleie, VOLLMILCHPULVER, GERSTENMALZ-und Maisextraktpulver, Honig. Backtriebmittel: Dinatriumdiphosphat, Natriumhydrogencarbonat, Ammoniumhydrogencarbonat; fettarmer Kakao, Salz, Weizenstärke, Gerstenmalzmehl, Emulgator Lecithine (Soja), Vanillin
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_de_ocr_1648897071
categorical free_text long_tail null_rate imbalanceThis appears to be a German-language OCR extraction of an ingredients list (timestamped 1648897071), likely from Open Food Facts product packaging. The column is essentially empty: 98% null across 50 rows, with only 1 non-null value present, so cardinality is 1 and entropy is 0. The single observed entry is a long free-text ingredients string for a hazelnut-nougat-cream product, with allergens marked by underscores. Treatment: Drop; the column is 98% null with only one unique value and carries no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Nuss-Nougat-Creme 40% (Zucker, Palmöl, _Haselnüsse_ 13%, _Magermilchpulver_ 8,7%, fettarmer Kakao 7,4%, Emulgator Lecithine (_Soja_), Vanillin), _Weizenmehl_ 32,5%, pflanzliche Fette (Palm, Palmkern), Rohrzucker 8,5% (enthält _Weizen_), _Milchzucker_, _Weizenkleie_, _Vollmilchpulver_, _Gerstenmalz_- und Maisextraktpulver, Honig, Backtriebmittel: Dinatriumdiphosphat, Natriumhydrogencarbonat, Ammoniumhydrogencarbonat; fettarmer Kakao, Salz, _Weizenstärke_, _Gerstenmalzmehl_, Emulgator Lecithine (_Soja_), Vanillin
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_ro
categorical metadata null_rate imbalanceThis appears to be a Romanian-language generic drug name field, but it is effectively empty: 96% of the 50 rows are null, and the only 2 non-null entries are blank strings, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; the column carries no information.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_ro
categorical other null_rate imbalanceColumn 'origin_ro' is effectively empty: 96% of the 50 rows are null, and the only 2 non-null values are blank strings, giving a single observed category with zero entropy. There is no usable signal here. Treatment: Drop; column carries no information.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
abbreviated_product_name_imported
categorical metadata long_tail null_rateThis appears to be a shortened/imported product name field, but it's almost entirely empty: 94% null across 50 rows, leaving only 3 non-null values, each unique (e.g., 'Authentique 275g, fr', 'Fibres 230g, fr', 'DESSERT Noir 205g'). With cardinality equal to the populated count and maximal entropy_ratio of 1.0, there is no repetition to learn from. The mixed formatting and language hints (French abbreviations, weight suffixes) further suggest inconsistent upstream import. Treatment: Drop or defer — too sparse and near-unique to be useful as a feature.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- Authentique 275g, fr
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
traces_imported
categorical free_text long_tail null_rateThis column appears to record allergen trace declarations (in French) on food products, listing items like Lupin, Lait, Moutarde, Soja as comma-separated lists. It is almost entirely empty with a 92% null rate, leaving only 4 non-null values across 50 rows, each unique. With cardinality equal to the populated count, every observed value is its own category, making aggregation unreliable. Treatment: Split on commas into a multi-label allergen indicator set, but expect sparse signal given the 92% null rate.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 4
- top_value
- Lupin, Lait, Moutarde, Graines de sésame, Soja
- top_rate
- 0.25
- cardinality
- 4
- entropy
- 2
- entropy_ratio
- 1
specific_ingredients
unknown free_text skippedThe column 'specific_ingredients' was skipped by the profiler, so no type, uniqueness, or distribution stats are available beyond a row count of 50 and a null rate of 0.0. The name suggests it holds ingredient lists, likely free text or arrays, which is consistent with the profiler declining to summarise it. Without sample values or cardinality we cannot confirm structure or detect duplicates, language mix, or skew. Treatment: Inspect raw values and parse or tokenize before any modelling.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
product_name_ru
categorical metadata null_rateRussian-language product name field, almost entirely absent: 94% of the 50 rows are null and only 2 distinct non-null values appear, one of which is an empty string. The single real value observed is 'Эксeленс 99% какао', suggesting this column is a sparsely populated localization of a product name. Treatment: Drop or defer; too sparse (94% null) to use as a feature.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 2
- top_value
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183
origin_ru
categorical metadata null_rate imbalanceA categorical column flagged as Russian-origin metadata, but it is effectively empty: 94% of the 50 rows are null and the only non-null value observed is the empty string, repeated 3 times. Cardinality is 1 and entropy is 0, so this column carries no information as-is. Treatment: Drop; the column has a single empty value and 94% nulls.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_ru
categorical free_text null_rate imbalanceRussian-language ingredient text with allergen markup, almost entirely absent from this sample. 94% of rows are null and the remaining 6% are all empty strings, leaving a single unique value and zero entropy. Treatment: Drop; no usable signal in this sample.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_ru
categorical free_text null_rate imbalanceRussian-language packaging text field that is effectively empty: 94% of 50 rows are null and the remaining 3 non-null entries are all the empty string, giving a single observed value and zero entropy. There is no usable signal here. Treatment: Drop; column is 94% null with the only non-null value being an empty string.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_ru
categorical metadata null_rateRussian-language generic product name field, populated for only 3 of 50 rows (null_rate 0.94). Among the 3 non-null entries, 2 are empty strings and 1 is 'Плитка горького шоколада (99% какао)', so effectively only one real value is present. The column is unusable as a feature at this sample size. Treatment: Drop or hold aside until coverage improves; do not use for modelling.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 2
- top_value
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183
ingredients_text_ru
categorical free_text null_rate imbalanceRussian-language ingredient text from what appears to be a multilingual product catalog. The column is effectively empty: 94% null and the only 3 non-null entries are blank strings, giving cardinality 1 and zero entropy. Treatment: Drop; no usable signal at this sample size.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_da
categorical free_text long_tail null_rateDanish-language ingredients text for food products, evidently sourced from Open Food Facts-style multilingual labeling. The column is almost entirely empty (null_rate 0.96), with only 2 non-null values out of 50, one of which is a blank string. The single substantive entry is actually a mixed Swedish/Danish/Norwegian ingredient list with allergen tokens marked by underscores, suggesting the locale tagging is unreliable. Treatment: Drop unless modelling Danish text specifically; coverage is too sparse to be useful.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- _VETEMJÖL_/_HVEDEMEL_, palmolja/-olie, glukossirap, maltextrakt från _KORN_/_BYG_, bakpulver/hævemidler (ammoniumkarbonater, natriumkarbonater), salt, _ÄGG_/_ÆG_/_EGG_, arom, mjölbehandlingsmedel/melbehandlingsmiddel (_NATRIUMDISULFIT_).
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
ingredients_text_with_allergens_da
categorical free_text long_tail null_rateDanish-language ingredients text with HTML-tagged allergen spans, evidently sourced from a multilingual food product database. The column is 96% null with only 2 non-null values out of 50, one of which is an empty string, leaving effectively a single real entry that mixes Swedish and Danish/Norwegian terms. Treatment: Drop unless Danish-specific allergen extraction is required; coverage is too sparse to model.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- VETEMJÖL/HVEDEMEL, palmolja/-olie, glukossirap, maltextrakt från KORN/BYG, bakpulver/hævemidler (ammoniumkarbonater, natriumkarbonater), salt, ÄGG/ÆG/EGG, arom, mjölbehandlingsmedel/melbehandlingsmiddel (NATRIUMDISULFIT).
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
product_name_da
categorical metadata long_tail null_rateDanish product name field with only 2 non-null values out of 50 rows (null_rate 0.96), each appearing once. The two observed labels ("Original", "Alpine Milk") look like product variant descriptors rather than full product names. With 96% missingness the column carries almost no signal as-is. Treatment: Drop or defer until backfilled; 96% nulls make it unusable for modelling.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- Original
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
packaging_text_da
categorical free_text null_rate imbalanceDanish-language packaging text field that is effectively empty: 96% of the 50 rows are null, and the only 2 non-null values are empty strings, giving cardinality 1 and zero entropy. There is no usable signal here. Treatment: Drop; column is empty in this sample.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_da
categorical metadata long_tail null_rateDanish-language generic product name field, almost entirely empty: 96% null across 50 rows, leaving only 2 non-null observations. The two surviving values are 'Kiks' and an empty string, so there is essentially no usable signal here. Treatment: Drop; null rate too high to be useful.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- Kiks
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
forest_footprint_data
unknown other skippedThe column 'forest_footprint_data' was skipped by the profiler, so its kind is unknown and no descriptive statistics are available. The only confirmed signals are 50 rows with a 0.0 null rate; uniqueness, type, and value distribution are all missing. Treatment: Re-profile or manually inspect this column before any downstream use.
- n
- 50
- nulls
- 0 (0.0%)
- unique
- —
origin_da
categorical metadata null_rate imbalanceThe column appears to be an origin date or destination identifier ('origin_da') but is effectively empty: 96% of the 50 rows are null and the only observed value is the empty string, which accounts for the remaining 2 entries. With cardinality of 1 and entropy of 0, it carries no information. Treatment: Drop; the column has no usable signal.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_sr
categorical metadata null_rate imbalanceThis appears to be a source/origin categorical field, but it is effectively empty: 96% of the 50 rows are null and the only 2 non-null values are blank strings, giving a single observed category with entropy 0. There is no usable signal here. Treatment: Drop; column is 96% null with only blank values.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_nl_ocr_1675675383_result
categorical free_text long_tail null_rate imbalanceThis column appears to hold OCR-extracted Dutch ingredient text from product packaging, likely a per-image result field. It is 98% null with only one non-null value present ('Cacaomassa, suiker, cacaoboter, natuurlijk Bourbon vanille - stokje.'), giving cardinality 1 and zero entropy. With effectively no variance or coverage, it carries no analytical signal in this sample. Treatment: Drop; near-empty with a single observed value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Cacaomassa, suiker, cacaoboter, natuurlijk Bourbon vanille - stokje.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_cs
categorical free_text null_rateCzech-language ingredients text, almost entirely absent: 94% of the 50 rows are null and only 2 distinct non-null values exist, one of which is an empty string appearing twice. The single substantive entry is a Czech ingredients list ("Kakaová hmota, cukr, kakaové máslo, vanilka."), confirming this is a localized free-text field rather than a categorical feature. Treatment: Drop for modelling; retain only if Czech-localized text is specifically needed.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 2
- top_value
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183
product_name_cs
categorical metadata null_rateCzech-localised product name field (`product_name_cs`) that is almost entirely unpopulated: 94% of the 50 rows are null and only 2 distinct values appear, one of which is an empty string. The single real label observed is an English-language entry ("Excellence 70% Cocoa Intense Dark"), suggesting the Czech translation pipeline has not been applied. Treatment: Drop unless Czech localisation is required; the column is 94% null and effectively empty.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 2
- top_value
- top_rate
- 0.6667
- cardinality
- 2
- entropy
- 0.9183
- entropy_ratio
- 0.9183
origin_hu
categorical metadata null_rate imbalanceA categorical field 'origin_hu' that is 92% null and, among the 4 non-null rows, contains only the empty string. Effective cardinality is 1 with zero entropy, so the column carries no information in this sample. Treatment: Drop; the column is constant and almost entirely null.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_hu
categorical metadata null_rate imbalanceHungarian packaging text field that is effectively empty: 92% null and the only non-null value across 50 rows is the empty string, occurring 4 times. Cardinality is 1 with entropy 0, so the column carries no information. Treatment: Drop; no signal (single empty value, 92% null).
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_cs
categorical metadata null_rate imbalanceThis appears to be an origin call-sign or code field, but it is effectively empty: 96% of the 50 rows are null and the only non-null entries are blank strings. With cardinality of 1 and entropy of 0, the column carries no information. Treatment: Drop; the column is 96% null with a single blank value otherwise.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_hu
categorical free_text long_tail null_rateHungarian-language ingredient lists with inline HTML markup highlighting allergens, drawn from what looks like a food-product catalogue (Open Food Facts style). The column is almost entirely empty (null_rate 0.94) — only 3 of 50 rows are populated, and each of those values is unique. Notable surprise: at least one entry is multilingual, bundling Hungarian, Romanian and Bulgarian label text into a single cell. Treatment: Strip HTML tags and treat as optional free text; too sparse (94% null) to use as a feature without heavy imputation.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- Kakaómassza, cukor, kakaó - vaj, vanília.
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
generic_name_cs
categorical metadata null_rate imbalanceThis appears to be a Czech-localized generic drug name field, but it carries virtually no information in this sample. 94% of rows are null, and the only 1 distinct non-null value is itself an empty string (3 occurrences), giving entropy of 0.0 and a top_rate of 1.0. Treatment: Drop; column is effectively empty (94% null and only blank values otherwise).
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_hu
categorical free_text long_tail null_rateHungarian-language ingredient declarations for food products, mirroring Open Food Facts' per-language ingredients_text fields. The column is 92% null with only 4 distinct values across 50 rows, and one of those is an empty string while another is a multi-language label dump (HU/RO/BG) rather than pure Hungarian. Top value frequency is just 0.25, so there is no real mode to lean on. Treatment: Treat as sparse free text; drop or language-filter before any NLP, and don't use as a feature without imputing the 92% missing.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 4
- top_value
- Kakaómassza, cukor, kakaó - vaj, vanília.
- top_rate
- 0.25
- cardinality
- 4
- entropy
- 2
- entropy_ratio
- 1
ingredients_text_sr
categorical free_text long_tail null_rateSerbian-language ingredient list field (likely a localized variant of an ingredients_text column). Out of 50 rows, only 2 are populated (null_rate 0.96), and one of those is an empty string, leaving exactly one substantive value. There is essentially no signal here at this sample size. Treatment: Drop unless analysis is restricted to Serbian-localized rows; otherwise too sparse to use.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- Šećer, kakao masa, kakao buter, vanile.
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
packaging_text_sr
categorical free_text null_rate imbalanceThis appears to be a Serbian-language packaging text field, but it is effectively empty in the sample: 96% of 50 rows are null and the only non-null entries are blank strings, giving cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop the column; it carries no information in this sample.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_nl_ocr_1675675383
categorical free_text long_tail null_rate imbalanceThis column appears to be Dutch-language OCR-extracted ingredient text from product packaging, likely a sparsely populated language variant of an ingredients field. Out of 50 rows, 98% are null and the single non-null value is a chocolate ingredients list ('Cacaomassa, suiker, cacaoboter, natuurlijk Bourbon vanille- stokje.'). With cardinality of 1 and entropy of 0, this column carries effectively no signal in this sample. Treatment: Drop; 98% null with a single observed value provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Cacaomassa, suiker, cacaoboter, natuurlijk Bourbon vanille- stokje.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_cs
categorical free_text long_tail null_rate imbalanceThis appears to be a Czech-language ingredient list with allergen annotations for food products. With a 98% null rate and only 1 non-null value across 50 rows ('Kakaová hmota, cukr, kakaové máslo, vanilka.'), the column carries virtually no information in this sample. Entropy is 0.0 and cardinality is 1, so it cannot discriminate between records as-is. Treatment: Drop for modelling; revisit only if a larger sample provides meaningful coverage.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Kakaová hmota, cukr, kakaové máslo, vanilka.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_sr
categorical metadata long_tail null_rateThis appears to be a Serbian-language generic product name field (generic_name_sr), likely a localized label in a multilingual product catalog. It is almost entirely empty: 96% null across 50 rows, with only 2 distinct values observed, one of which is itself a blank string. The single non-empty entry is 'Tamna čokolada sa 70% kakaa', so this column carries virtually no usable signal in this sample. Treatment: Drop or ignore for modelling; retain only if a Serbian-locale view is required.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- Tamna čokolada sa 70% kakaa
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
packaging_text_cs
categorical free_text null_rate imbalanceCzech-language packaging text field that is effectively empty: 94% null and the only non-null value observed (3 rows) is itself an empty string, giving cardinality 1 and entropy 0. There is no usable signal here in this sample. Treatment: Drop; column is constant-empty with 94% nulls.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_sr
categorical metadata long_tail null_rateThis looks like a Serbian-localized product name field, but it is effectively empty: 96% of the 50 rows are null and only 2 distinct non-null values appear, one in English ('Excellence 70% Cocoa Intense Dark') and one in Cyrillic ('Течен Шоколад Нутела'). The language mix between Latin English and Cyrillic Serbian is notable for a column nominally tagged 'sr'. Treatment: Drop or defer until coverage improves; with 96% nulls and 2 unique values it carries no modelling signal.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- Excellence 70% Cocoa Intense Dark
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
ingredients_text_hu_ocr_1571428260_result
categorical free_text long_tail null_rate imbalanceThis appears to be the OCR-extracted Hungarian ingredients text for a single product (likely chocolate: cocoa mass, sugar, cocoa butter, vanilla), captured from one timestamped scan. With null_rate 0.98 and only 1 unique non-null value across n=50, this column is effectively empty and carries no comparative signal. The single populated row repeats verbatim, so cardinality and entropy are both at floor. Treatment: Drop; 98% null with a single OCR string offers no modelling value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- kakaómassza, cukor, kakaó - vaj, természetes bourbon vanília. Nyomokban egyéb dióféléket, tejet, szóját, szezámmagot es búzát tartalmazhat.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_hu_ocr_1571428260
categorical free_text long_tail null_rate imbalanceThis appears to be a Hungarian-language OCR extraction of an ingredients list (likely from a chocolate product, mentioning cocoa mass, sugar, cocoa butter, and bourbon vanilla). Of 50 rows, 98% are null and only 1 non-null value exists, so the column is effectively empty. The single populated entry is free-text in Hungarian, not a category, despite being typed as categorical. Treatment: Drop; 98% null with a single Hungarian free-text value carries no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- kakaómassza, cukor, kakaó- vaj, természetes bourbon vanília. Nyomokban egyéb dióféléket, tejet, szóját, szezámmagot es búzát tartalmazhat.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_hu
categorical metadata null_rateHungarian generic-name field that is almost entirely missing: 92% of the 50 rows are null, and of the 4 non-null entries, 3 are empty strings and only 1 carries a value ("Finom"). With cardinality of 2 and a top_rate of 0.75 on the empty string, this column carries virtually no usable signal in the sample. Treatment: Drop unless a fuller source can be joined in; current null rate makes it unusable.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 2
- top_value
- top_rate
- 0.75
- cardinality
- 2
- entropy
- 0.8113
- entropy_ratio
- 0.8113
product_name_hu
categorical metadata long_tail null_rateHungarian-language product name field that is effectively empty: 92% of the 50 rows are null, and the only 4 non-null entries collapse to 3 distinct strings (one of which is the empty string, appearing twice). The two actual values present ("Excellence 70% Cocoa Intense Dark", "Dark Chocolate 70% Cacao") are in English, not Hungarian, suggesting the localisation pipeline never populated this column. Treatment: Drop; null_rate 0.92 and no genuine Hungarian content make it unusable.
- n
- 50
- nulls
- 46 (92.0%)
- unique
- 3
- top_value
- top_rate
- 0.5
- cardinality
- 3
- entropy
- 1.5
- entropy_ratio
- 0.9464
ingredients_text_with_allergens_sr
categorical free_text long_tail null_rateSerbian-language ingredients list with allergen annotations, populated for only 2 of 50 rows (null_rate 0.96). Of the two non-null entries, one is an empty string and the other is a chocolate ingredient list, leaving effectively a single usable value. Coverage is too sparse to support any aggregate analysis. Treatment: Drop or defer; coverage is 4% and insufficient for modelling.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- Šećer, kakao masa, kakao buter, vanile.
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
ingredients_text_es_ocr_1548767061_result
categorical free_text long_tail null_rate imbalanceThis appears to be the result of an OCR pass extracting Spanish ingredient text, with the timestamped name suggesting it's one run among many. Of 50 rows, 98% are null and the single non-null value is a Spanish chocolate ingredients list (cocoa paste, sugar, cocoa butter, sunflower lecithin E-322, vanilla extract, 70% cocoa minimum). With cardinality of 1 and entropy 0, this column carries essentially no information across the dataset. Treatment: drop; 98% null with only one distinct OCR result provides no modelling signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Pasta de cacao, azúcar, manteca de cacao, emulgente: lecitina de girasol (E-322), extracto de vainilla. Cacao: 70% mínimo.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_xx
categorical metadata null_rate imbalanceA categorical field, likely a localized product name (suffix _xx suggests a translation/locale variant). It is effectively empty: 96% null and the only 2 non-null rows contain the empty string, giving cardinality 1 and zero entropy. Treatment: Drop; the column carries no signal.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_xx
categorical metadata null_rate imbalanceThis appears to be a localized generic name field (suffix _xx suggests a translation/locale variant), but it is effectively empty: 96% of the 50 rows are null, and the only 2 non-null values are blank strings. Cardinality is 1 with zero entropy, so the column carries no information. Treatment: Drop; null_rate 0.96 and single empty value provide no signal.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_es_ocr_1548767061
categorical free_text long_tail null_rate imbalanceThis appears to be a Spanish-language OCR capture of an ingredients list (likely from a chocolate product label, given 'Pasta de cacao' and '70% mínimo'). It is effectively empty: 98% null across 50 rows, with the single non-null value being one verbatim ingredients string. There is no analytical signal here — entropy is 0 and cardinality is 1. Treatment: Drop; 98% null with only one observed value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Pasta de cacao, azúcar, manteca de cacao, emulgente: lecitina de girasol (E-322), extracto de vainilla. Cacao: 70% mínimo.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_xx
categorical free_text null_rate imbalanceThis appears to be a localized ingredients text field (suffix _xx suggests a placeholder or unknown locale variant). It is effectively empty: 96% of the 50 rows are null, and the only 2 non-null entries are both empty strings, giving cardinality 1 and entropy 0. Treatment: Drop; the column carries no information.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_xx
categorical other long_tail null_rate imbalanceThe column 'origin_xx' is effectively empty: 98% of its 50 rows are null, and the single non-null value is itself an empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop the column; it carries no information.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_xx
categorical metadata long_tail null_rate imbalanceThis appears to be a localized packaging text field (xx language suffix) that is essentially empty in this sample. 98% of rows are null and the single non-null value is itself an empty string, leaving zero effective cardinality and zero entropy. Treatment: Drop; the column carries no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_ur
categorical free_text long_tail null_rate imbalanceLikely an Urdu-language ingredient text field (ingredients_text_ur), almost entirely absent from this sample. 98% of rows are null, and the single non-null value is an empty string, leaving zero usable content. Cardinality is 1 with entropy 0.0, so the column carries no information here. Treatment: Drop from modelling; retain only if a fuller multilingual extract becomes available.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_ur
categorical metadata long_tail null_rate imbalanceThis appears to be an Urdu-language product name field that is essentially empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality collapses to 1 and entropy is 0, so the column carries no information as captured. Treatment: Drop; column is effectively empty with zero entropy.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_he
categorical metadata long_tail null_rate imbalanceColumn 'origin_he' is effectively empty: 98% of the 50 rows are null and the only observed value is the empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; column carries no information.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_he
categorical metadata long_tail null_rateHebrew product name field that is almost entirely empty — 96% null across 50 rows, leaving only 2 non-null values, both unique (נוטלה and תפוציפס שמנת בצל). With just two observations the entropy ratio of 1.0 is meaningless, and the column cannot support any analysis in its current state. Treatment: Drop or defer until backfilled; 96% null makes it unusable downstream.
- n
- 50
- nulls
- 48 (96.0%)
- unique
- 2
- top_value
- נוטלה
- top_rate
- 0.5
- cardinality
- 2
- entropy
- 1
- entropy_ratio
- 1
origin_ur
categorical metadata long_tail null_rate imbalanceThe column 'origin_ur' appears to be a near-empty metadata or URL-like field, with 98% nulls across 50 rows and only 1 non-null value, which is itself an empty string. Cardinality is 1 and entropy is 0, so the column carries no information as captured. The truncated name ('origin_ur', likely 'origin_url') and total absence of real values suggest a broken or unused field. Treatment: Drop; column has 98% nulls and a single empty-string value, providing no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_ur
categorical metadata long_tail null_rate imbalanceThis appears to be an Urdu-language generic name field, likely a localized translation of a drug or product's generic name. It is effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string, giving cardinality 1 and entropy 0. There is no usable signal here. Treatment: Drop the column; it is 98% null with the only present value being blank.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_he
categorical free_text long_tail null_rate imbalanceHebrew packaging text field that is essentially empty: 98% of the 50 rows are null and the single non-null observation is itself an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here for any downstream task. Treatment: Drop the column; it carries no information.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_he
categorical free_text long_tail null_rate imbalanceHebrew-language ingredient text field, almost entirely absent from this sample. 98% of the 50 rows are null, and the single non-null value is an empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; no usable signal at this sample size.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_ur
categorical free_text long_tail null_rate imbalanceLikely an Urdu-language packaging text field, but it carries essentially no information here: 98% of rows are null and the only non-null value observed is an empty string. Cardinality is 1 with entropy 0.0, so the column is constant across the populated rows. Treatment: Drop; effectively empty with 98% nulls and a single empty-string value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_he
categorical metadata long_tail null_rate imbalanceHebrew generic drug name field, but it is effectively empty: 98% null across 50 rows, leaving a single non-null value (a Hebrew product description) that occupies 100% of the observed entries. Cardinality is 1 and entropy is 0, so the column carries no discriminating signal in this sample. Treatment: Drop from modelling; revisit only if a fuller extract populates the field.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- ממרח אגוזי לוז עם קקאו
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_he
categorical free_text long_tail null_rate imbalanceHebrew-localized ingredients-with-allergens text, almost entirely absent: 98% null across 50 rows and the single non-null value is an empty string. Cardinality is 1 with zero entropy, so this column carries no information in this sample. Treatment: Drop; no usable signal in this sample.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
nutriscore_grade_producer
categorical feature long_tail null_rateProducer-supplied Nutri-Score letter grade (a–e scale), captured here as three distinct values: 'c', 'e', and 'b', each appearing once. The column is essentially empty, with a 94% null rate leaving only 3 of 50 rows populated, so the apparent uniform entropy (1.58) is an artefact of tiny sample size. No 'a' or 'd' grades observed in the evidence. Treatment: Drop or defer; too sparse (94% null) to use until producer coverage improves.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- c
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
nutriscore_grade_producer_imported
categorical feature long_tail null_rateThis appears to be a Nutri-Score grade (a-e scale) imported from producer data, stored as a categorical letter grade. The column is almost entirely empty with a 94% null rate, leaving only 3 non-null values across 3 distinct grades (c, e, b) — too sparse to draw any distributional conclusions. Treatment: Drop or treat as missing-indicator only; too sparse (94% null) to use as a feature.
- n
- 50
- nulls
- 47 (94.0%)
- unique
- 3
- top_value
- c
- top_rate
- 0.3333
- cardinality
- 3
- entropy
- 1.585
- entropy_ratio
- 1
packaging_text_el
categorical free_text long_tail null_rate imbalanceThis appears to be Greek-language packaging text, but it's effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, meaning there is no usable signal here whatsoever. Treatment: Drop the column; it has no observed content.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_el
categorical free_text long_tail null_rate imbalanceGreek-language ingredients-with-allergens text field that is effectively empty in this sample: 98% null and the only non-null value observed is itself an empty string. With cardinality of 1 and entropy 0, this column carries no signal here. Treatment: Drop; no usable content in this sample.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_el
categorical free_text long_tail null_rate imbalanceThis is the Greek-language ingredients text field, presumably from a multilingual food product dataset. It is effectively empty: 98% null across 50 rows, and the single non-null value is itself an empty string, leaving cardinality at 1 and entropy at 0. Treatment: Drop; no usable signal at this sample size.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_el
categorical metadata long_tail null_rate imbalanceGreek-language generic name field that is effectively empty: 98% nulls and the only observed non-null value is itself an empty string, giving a single unique entry across 50 rows. Entropy is 0.0 and top_rate is 1.0, so the column carries no information in this sample. Treatment: Drop; the column has no usable signal here.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_el
categorical other long_tail null_rate imbalanceThe column 'origin_el' is nearly entirely empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string, leaving cardinality at 1 and entropy at 0. There is no usable signal here whatsoever. Treatment: Drop the column; it is effectively all null with no variance.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_el
categorical metadata long_tail null_rate imbalanceThis appears to be a Greek-language product name field that is effectively empty in this sample. 98% of rows are null and the single non-null value is itself an empty string, giving cardinality 1 and zero entropy. There is no usable signal here at n=50. Treatment: Drop from modelling; revisit only if a larger sample shows actual Greek text.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_th
categorical metadata long_tail null_rate imbalanceThai-language generic drug name field that is effectively empty in this sample: 98% of 50 rows are null and the only non-null value observed is the empty string, giving a single distinct value and zero entropy. Treatment: Drop from modelling; the column carries no information in this slice.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_de_ocr_1559410715_result
categorical free_text long_tail null_rate imbalanceThis column appears to hold the OCR result of a German ingredients text extraction (likely from a product label image), with a timestamp embedded in the column name. Of 50 rows, 98% are null and only one non-null value exists — a single German cocoa-product ingredient list mentioning possible traces of nuts, milk, and soy. With cardinality 1 and entropy 0, the column carries effectively no signal at this sample size. Treatment: Drop; 98% null and only one distinct OCR string provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Kakaomasse, fettarmes Kakaopulver, Kakaobutter, Rohrzucker. Kann Schalenfrüchte, Milch und Soja enthalten.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_th
categorical free_text long_tail null_rate imbalanceThis appears to be a Thai-localized ingredients-with-allergens text field, but it is effectively empty: 98% of 50 rows are null and the single populated row contains an English-language ingredients string for a 99% cocoa product. The column carries zero entropy (entropy_ratio 0.0) and only one distinct value, so it provides no analytical signal. The language mismatch (English content in a _th column) is also notable. Treatment: Drop; 98% null and a single non-Thai value make it unusable.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Cocoa solids 99%, Cocoa paste, fat-reduced cocoa, cocoa butter, demerara sugar. May contain nuts, milk and soya.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_th
categorical free_text long_tail null_rate imbalanceThis appears to be a Thai-language packaging text field, but it is effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so the column carries no information. Treatment: Drop; no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_th
categorical metadata long_tail null_rate imbalanceThai-language product name field, almost entirely empty: 98% null across 50 rows with only a single non-null value (one Lindt dark chocolate entry). Cardinality is 1 and entropy is 0, so this column carries no discriminating signal in this sample. Treatment: Drop from modelling; revisit only if a fuller Thai-localised dump becomes available.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- ลินด์ เอ็กเซอร์แลนซ์ ดาร์ก 99% โกโก้ ดาร์ก แอปโซลูท ช็อกโกแลต
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_de_ocr_1548767354_result
categorical free_text long_tail null_rate imbalanceThis column appears to hold the OCR result of a German ingredients label (one specific dark chocolate product) tied to a timestamped run (1548767354). Of 50 rows, 98% are null and the single non-null value occupies the entire observed cardinality of 1, giving zero entropy. There is effectively no variation to learn from here. Treatment: Drop; 98% null and only one distinct OCR string.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Extra feine dunkle Schokolade. Schokolade enthält: Kakao: mind. 99%. Zutaten: Kakaomasse, fettarmes Kakaopulver, Kakaobutter, Rohrzucker. Kann Schalenfrüchte, Milch und Soja enthalten.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_th
categorical free_text long_tail null_rate imbalanceThis is a Thai-language ingredients text field (ingredients_text_th), but 98% of the 50 rows are null and the single non-null entry is actually English text describing cocoa-based product ingredients. With cardinality of 1 and entropy of 0, the column carries no usable signal and the one populated value appears to be in the wrong language for the field. Treatment: Drop; 98% null and the lone value is mislocalized.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Cocoa solids 99%, Cocoa paste, fat-reduced cocoa, cocoa butter, demerara sugar. May contain nuts, milk and soya.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_th
categorical other long_tail null_rate imbalanceColumn 'origin_th' is effectively empty: 98% of the 50 rows are null and the only observed non-null value is itself an empty string, giving a cardinality of 1 and entropy of 0. There is no signal here to model or join on. Treatment: Drop; the column is 98% null with a single empty-string value otherwise.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_de_ocr_1548767354
categorical free_text long_tail null_rate imbalanceThis appears to be German-language OCR text of product ingredient lists, likely from a food database (the sole observed value describes dark chocolate ingredients). The column is almost entirely empty with a 0.98 null rate, and only one non-null record exists across 50 rows, yielding cardinality 1 and entropy 0. With a single observation there is no usable variation for analysis. Treatment: Drop; 98% null with only one observed value provides no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Extra feine dunkle Schokolade. Schokolade enthält: Kakao: mind. 99%. Zutaten: Kakaomasse, fettarmes Kakaopulver, Kakaobutter, Rohrzucker. Kann Schalenfrüchte, Milch und Soja enthalten.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_de_ocr_1559410715
categorical free_text long_tail null_rate imbalanceThis appears to be a German-language OCR extraction of an ingredients list (likely from a chocolate product packaging), captured as free text. It is almost entirely empty with a 0.98 null rate, and the single non-null row contains one verbose ingredient declaration, giving cardinality 1 and entropy 0.0. With only one observed value out of 50 rows, this column carries no usable signal as-is. Treatment: Drop; effectively empty (98% null, single distinct OCR string).
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Extra feine dunkle Schokolade. Schokolade enthält: Kakao: mind. 99%. Zutaten: Kakaomasse, fettarmes Kakaopulver, Kakaobutter, Rohrzucker. Kann Schalenfrüchte, Milch und Soja enthalten.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_it_ocr_1559410715
categorical free_text long_tail null_rate imbalanceThis appears to be an Italian-language OCR extraction of an ingredients list, likely from a food product label (the lone value describes 99% dark chocolate). The column is effectively empty: 98% null across 50 rows, with only a single non-null value, giving cardinality 1 and entropy 0. There is no variation to learn from here. Treatment: Drop; 98% null with a single observed value carries no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Cioccolato amaro extra. Cacao: 99% minimo. Ingredienti: pasta di cacao, cacao magro, burro di cacao, zucchero grezzo di canna. Può contenere frutta a guscio, latte e soia.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_it_ocr_1559410715_result
categorical free_text long_tail null_rate imbalanceThis column appears to hold OCR-extracted Italian ingredient text (timestamp-suffixed name suggests a single OCR pass result). It is effectively empty: 98% null across n=50, with only one non-null value — a chocolate ingredient list — giving cardinality 1 and zero entropy. Treatment: Drop; a single non-null OCR string carries no modelling signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- pasta di cacao, cacao magro, burro di cacao, zucchero grezzo di canna. Può contenere frutta a guscio, latte e soia.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_fr_imported
categorical free_text long_tail null_rate imbalanceLikely a French-language packaging description imported from an external source (e.g., Open Food Facts), describing recycling instructions for components. The column is 98% null and only a single non-null value appears across 50 rows, so it carries no analytical signal in this sample. Treatment: Drop; near-entirely null with no variance.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- 1 FEUILLE PAPIER À RECYCLER, 1 FEUILLE METAL À RECYCLER.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
preparation_fr_imported
categorical metadata long_tail null_rate imbalanceThis appears to be a French-language preparation/readiness label imported from an external source, indicating how a product is prepared. It is effectively unusable here: 98% of the 50 rows are null, and the single non-null value is "Produit prêt à consommer", giving cardinality 1 and entropy 0. Treatment: Drop; near-entirely null with a single constant value carries no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Produit prêt à consommer
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
preparation
categorical metadata long_tail null_rate imbalanceA categorical preparation field, likely indicating how a food product should be prepared before consumption. It is effectively empty: 98% of the 50 rows are null, leaving only a single observed value ("Produit prêt à consommer") with cardinality 1 and zero entropy. With no variation among the populated rows, the column carries no discriminative signal. Treatment: Drop; 98% null and only one observed value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Produit prêt à consommer
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
preparation_fr
categorical metadata long_tail null_rate imbalanceThis appears to be a French-language preparation instruction field, likely metadata describing how a product is consumed. It is essentially empty: 98% of the 50 rows are null, and the single non-null value is "Produit prêt à consommer", giving cardinality 1 and entropy 0. There is no variation to learn from in this sample. Treatment: Drop; 98% null with only one observed value provides no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Produit prêt à consommer
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_lc
categorical free_text long_tail null_rate imbalanceThis appears to be a lowercased ingredients text field, likely derived from a product or recipe source. It is effectively empty: 98% of the 50 rows are null, and the single non-null value is itself an empty string, giving cardinality 1 and entropy 0. There is no usable signal here. Treatment: Drop; the column is 98% null and the only observed value is empty.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_lc
categorical feature long_tail null_rate imbalanceA lowercased product name field that is effectively empty: 98% of the 50 rows are null and the single non-null value is also an empty string, giving a cardinality of 1 and entropy of 0. There is no usable signal here. Treatment: Drop; column is 98% null with the only observed value being an empty string.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_lc
categorical free_text long_tail null_rate imbalanceAlmost certainly a normalised lowercase ingredients-with-allergens text field, but in this sample it carries no signal: 98% of rows are null and the single non-null value is an empty string. With cardinality 1 and entropy 0, there is nothing to learn from this column as-is. Treatment: Drop from the working set unless a larger sample shows real text content.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_lc
categorical feature long_tail null_rate imbalanceThis appears to be a lowercased generic drug name field, but it is effectively empty in this sample: 98% null and the only non-null value among 50 rows is an empty string. With cardinality of 1 and entropy of 0, the column carries no information here. Treatment: Drop; no signal at this null rate and cardinality.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1561814324
categorical free_text long_tail null_rate imbalanceThis appears to be a French-language OCR capture of an ingredients list, timestamped in the column name (1561814324). With null_rate of 0.98, only a single non-null row exists out of 50, and that lone value is a full ingredients paragraph—cardinality is 1 and entropy is 0. There is effectively no signal here for analysis. Treatment: Drop; 98% null with a single observed value provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- 25 % cerneaux de noix, 25 % amandes décortiquées 25 % raisins secs sultanines (raisins secs,huile de tournesol. antioxydant: anhydride lfureux), 15% canneberges, 9,8% sucre, huile de tournesol. Traces éventuelles d'autres fruits à coque et d'arachides. Conditionné sous atmosphère protectrice.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1561814324_result
categorical free_text long_tail null_rate imbalanceThis appears to be the OCR result of a French ingredients label, captured at a specific timestamp (1561814324) suggesting it's one of many time-stamped OCR snapshot columns. Of 50 rows, 98% are null and only a single non-null value exists — a verbose French ingredients string for a nut-and-raisin mix. With cardinality 1 and entropy 0, the column carries essentially no analytical signal in this sample. Treatment: Drop; 98% null and only one distinct OCR string provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- 25 % cerneaux de noix, 25 % amandes décortiquées 25 % raisins secs sultanines (raisins secs,huile de tournesol. antioxydant: anhydride lfureux), 15% canneberges, 9,8% sucre, huile de tournesol. Traces éventuelles d'autres fruits à coque et d'arachides.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1624039072_result
categorical free_text long_tail null_rate imbalanceThis column appears to hold OCR results of French ingredient lists, likely from a timestamped extraction run (1624039072). It is effectively empty: 98% null with only 1 non-null value out of 50, that single entry being a cocoa/soy lecithin/vanilla ingredient string. With cardinality 1 and entropy 0, the column carries no usable signal in this sample. Treatment: Drop; 98% null with a single observed value provides no modelling signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Cacao, émulsifiant (lécithine de _soja_), vanille.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1624039072
categorical free_text long_tail null_rate imbalanceThis appears to be French OCR-extracted ingredient text from product packaging, with the timestamp suffix suggesting a dated extraction run. Out of 50 rows, 98% are null and only a single non-null value exists ('ingrédients : cacao, émulsifiant (lécithine de _soja_), vanille.'), giving cardinality 1 and zero entropy. The column is effectively empty and carries no discriminative signal. Treatment: Drop; 98% null with a single observed value provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- ingrédients : cacao, émulsifiant (lécithine de _soja_), vanille.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573108346
categorical free_text long_tail null_rate imbalanceThis appears to be a French-language OCR-extracted ingredients list (likely from a food product label, given mentions of flour, sugar, butter, eggs, and emulsifiers). Out of 50 rows, 98% are null and only a single non-null value exists, giving cardinality 1 and entropy 0. The column is effectively empty and carries no discriminative signal in this sample. Treatment: Drop; 98% null with a single observed value provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1566920858_result
categorical free_text long_tail null_rate imbalanceThis appears to be the OCR-extracted French ingredients text from a single product image (timestamped 1566920858), holding raw label transcriptions. The column is effectively empty: 98% null across n=50, with only one non-null value — a single French ingredient list for a butter/egg pastry product. Cardinality is 1 and entropy is 0, so it carries no discriminative signal in this sample. Treatment: Drop; 98% null and only one unique OCR string provides no modelling signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , oeufs entiers frais, crème fraîche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - lactylate de sodium, Esters et mono et diacétyltartriques des mono et diglycérides d'acides gras), protéines de lait, levure désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573107556
categorical free_text long_tail null_rate imbalanceThis is a French OCR-extracted ingredients list, almost certainly a timestamped snapshot column from an Open Food Facts-style export. The column is effectively empty: 98% null across 50 rows, with only a single non-null value present, so cardinality is 1 and entropy is 0. The lone observation is a long free-text French ingredients string, not a category, despite being typed as categorical. Treatment: Drop; a single non-null value at 98% null rate carries no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573108346_result
categorical free_text long_tail null_rate imbalanceThis appears to be the OCR-extracted French ingredients text from a specific scan run (timestamped 1573108346), holding raw label transcriptions like a bakery product's flour/sugar/butter list. Out of 50 rows, 98% are null and the single populated value is one long French ingredient string, giving cardinality 1 and entropy 0. The column is effectively empty for analytical purposes. Treatment: Drop; 98% null with only one populated OCR string offers no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573107560_result
categorical free_text long_tail null_rate imbalanceThis column appears to hold the OCR-extracted French ingredients text from a product image (timestamp 1573107560 in the name suggests a single OCR run). Of 50 rows, 98% are null and only 1 unique value exists — a single French ingredient list for what looks like a butter/egg pastry. With cardinality 1 and entropy 0, this column carries no discriminative signal in this sample. Treatment: Drop or defer — 98% null and a single observed value provide no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573108349_result
categorical free_text long_tail null_rate imbalanceThis column appears to hold the OCR-extracted French ingredients text for a product, tied to a specific OCR run timestamp (1573108349). It is essentially empty: 98% null across 50 rows, with only a single non-null value — a long French ingredients string for what looks like a butter/egg pastry. With cardinality 1 and entropy 0, it carries no discriminative signal in this sample. Treatment: Drop; 98% null and only one distinct OCR string provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573108360
categorical free_text long_tail null_rate imbalanceThis appears to be a French-language OCR extraction of a product ingredients list, timestamped 1573108360 in the column name. Out of 50 rows, 98% are null and only a single non-null value exists (a single bakery product's ingredient declaration), giving cardinality 1 and entropy 0. The column is effectively empty and carries no discriminative signal. Treatment: Drop; 98% null with a single OCR string offers no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573109955_result
categorical free_text long_tail null_rate imbalanceThis column appears to hold the OCR result of a French ingredients list (timestamped 1573109955), capturing the parsed text from a product label. Of 50 rows, 98% are null and only a single non-null value exists — a long French ingredients string for a butter/egg pastry product. With cardinality 1 and entropy 0, it carries essentially no information at this sample size. Treatment: Drop; 98% null with a single OCR string offers no modelling signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573108349
categorical free_text long_tail null_rate imbalanceThis column appears to be a French OCR-extracted ingredients list (likely from food packaging), based on the column name and the single observed value containing French ingredient text like 'Farine de blé, sucre, beurre frais'. It is almost entirely empty: 98% null across n=50, with only one non-null record and cardinality of 1, giving zero entropy. With a single observation it carries no analytical signal in this sample. Treatment: Drop; 98% null and only one unique value provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573109955
categorical free_text long_tail null_rate imbalanceThis appears to be an OCR-extracted French ingredients list (timestamped 1573109955), likely from a product packaging scan. The column is almost entirely empty: 98% null across 50 rows, with only a single non-null value present, giving cardinality 1 and entropy 0. That lone value is a long, noisy free-text string typical of raw OCR output rather than a clean categorical label. Treatment: Drop; effectively empty with only one OCR string and no analytical signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573107556_result
categorical free_text long_tail null_rate imbalanceThis appears to be the OCR result of a French ingredients label (timestamped 1573107556), capturing extracted text from a product image. Of 50 rows, 98% are null and only a single non-null value exists — one French ingredient list for a butter/egg/flour pastry. With cardinality 1 and entropy 0, the column carries essentially no analytical signal. Treatment: Drop; 98% null with only one observed OCR string.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573108360_result
categorical free_text long_tail null_rate imbalanceThis appears to be the OCR-extracted French ingredients text from a timestamped scan run (1573108360), holding raw label transcriptions. With 98% nulls and only 1 non-null value across 50 rows, it is effectively empty — the single populated entry is a long French ingredient list for a butter/egg pastry product. Cardinality of 1 and entropy of 0 mean it carries no discriminative signal here. Treatment: Drop from modelling; if needed, merge with sibling OCR columns into a single ingredients_text field before NLP.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2 - actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1573107560
categorical free_text long_tail null_rate imbalanceThis appears to be an OCR-extracted French ingredients list (timestamped 1573107560), capturing the raw text from a product label. Out of 50 rows, 98% are null and only a single non-null value exists, an entry beginning 'Farine de blé, sucre, beurre frais 9,5%...'. With cardinality 1 and entropy 0, the column carries effectively no signal in this sample. Treatment: Drop; 98% null with a single OCR string offers no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , aeufs entiers frais, crème fraiche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- actylate de sodium, diacétyltartriques des mono et diglycérides d'acides désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque. Esters et mono gras), protéines de lait levure
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1566920858
categorical free_text long_tail null_rate imbalanceThis is a French-language OCR-extracted ingredients list, timestamped in the column name (1566920858), almost certainly from an Open Food Facts-style product dump. Out of 50 rows, 98% are null and only a single non-null value exists, a verbose ingredients string for a butter/egg pastry. With cardinality 1 and entropy 0, the column carries effectively no signal in this sample. Treatment: Drop; 98% null and only one distinct OCR string provides no usable signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Farine de blé, sucre, beurre frais 9,5 % , oeufs entiers frais, crème fraîche 5,5% , levure, sel, arômes naturels (contient alcool), gluten de blé, poudre de lait écrémé, eau de vie, émulsifiants (Mono et diglycérides d'acides gras, Stéaroyl-2- lactylate de sodium, Esters et mono et diacétyltartriques des mono et diglycérides d'acides gras), protéines de lait, levure désactivée, colorant (béta carotène) Traces éventuelles de fruits à coque.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
generic_name_lt
categorical metadata long_tail null_rate imbalanceThis appears to be a Lithuanian-locale generic name field, but it is effectively empty: 98% of the 50 rows are null and the single non-null value is the empty string. Cardinality is 1 and entropy is 0, so the column carries no information. Treatment: Drop; the column is 98% null with a single empty-string value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_ro
categorical free_text long_tail null_rate imbalanceA Romanian-language ingredients-with-allergens text field, almost entirely empty in this sample. 98% of rows are null and the only non-null value observed is an empty string, giving a single unique value across n=50. Treatment: Drop; effectively no signal at this sample size.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
packaging_text_lt
categorical free_text long_tail null_rate imbalanceLithuanian packaging text field that is effectively empty in this sample: 98% null and the single non-null value is itself an empty string, giving cardinality 1 and zero entropy. There is no usable signal here. Treatment: Drop; no observed content.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_lt
categorical free_text long_tail null_rate imbalanceThis appears to be a Lithuanian-language ingredients text field, likely from a multilingual product catalog. It is effectively empty: 98% null across 50 rows, and the single non-null value is itself an empty string, giving cardinality 1 and zero entropy. Treatment: Drop; no usable signal in this sample.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
origin_lt
categorical metadata long_tail null_rate imbalanceThe column 'origin_lt' is nearly entirely null, with a null_rate of 0.98 across 50 rows, leaving only a single non-null observation that is itself an empty string. With cardinality of 1, entropy of 0, and top_rate of 1.0, there is no usable signal here. Treatment: Drop; effectively empty with no variance.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
product_name_lt
categorical metadata long_tail null_rate imbalanceThis column appears to be a Lithuanian-localized product name field, but it is effectively empty: 98% of the 50 rows are null and the single non-null value is itself an empty string. Cardinality is 1 with zero entropy, so it carries no information. Treatment: Drop; the column is 98% null with a single empty-string value.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_with_allergens_lt
categorical free_text long_tail null_rate imbalanceThis appears to be a Lithuanian-language ingredients text field with allergen annotations, but it is effectively empty in this sample. 98% of rows are null and the only non-null value observed is the empty string, giving zero entropy across n=50. Treatment: Drop from analysis; insufficient non-null content to model.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1713713129
categorical free_text long_tail null_rate imbalanceThis appears to be a French-language OCR extract of an ingredients list (chocolate product), captured at a specific timestamp suggested by the column suffix. Out of 50 rows, 98% are null and only 1 non-null value exists, making the column effectively a single-record artifact rather than a usable feature. The lone value is free-form text describing cacao paste, cocoa butter, sugar, milk powder, and allergen traces. Treatment: Drop; 98% null and only one observed value provides no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Ingrédients : Pâte de cacao, cacao en poudre dégraissé, beurre de cacao, sucre, lait en poudre, pâte de amandes et de noisettes, émulsifiants (lécithines (soja, toumesol)) et arôme. Cacao 92% minimum. Peut contenir des traces d'autres fruits à coque.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0
ingredients_text_fr_ocr_1713713129_result
categorical free_text long_tail null_rate imbalanceThis appears to be the OCR result of a French ingredients list, captured at a single timestamp (1713713129) and stored as raw text. With null_rate 0.98, only 1 of 50 rows has a value, and that single observation is a chocolate ingredients statement (cocoa paste, cocoa powder, almonds, hazelnuts, soy lecithin). Cardinality is 1 and entropy is 0, so there is no variation to model from this column alone. Treatment: Drop; 98% null and only one distinct OCR string provides no signal.
- n
- 50
- nulls
- 49 (98.0%)
- unique
- 1
- top_value
- Pâte de cacao, cacao en poudre dégraissé, beurre de cacao, sucre, lait en poudre, pâte de amandes et de noisettes, émulsifiants (lécithines (soja, toumesol)) et arôme. Cacao 92% minimum. Peut contenir des traces d'autres fruits à coque.
- top_rate
- 1
- cardinality
- 1
- entropy
- 0
- entropy_ratio
- 0