This project is part of the EPFL's Applied Data Analysis course and promotes Data Science in Switzerland. The concept is not spatialy restricted and can be easily generalized elsewhere.
Switzerland is well know for its rich heritage: incredible landscapes, watches, cheese, chocolate and diversified influences from its five neighboring countries. This project investigates how this heritage is reflected in terms of food habits. We picked 2 Swiss and 3 French cities with high restaurant density to get insights about dietetics. We mapped restaurant meals to recipes and ingredients of recipes to products to analyze the corresponding nutriments.
Is there any area-based nutrition bias ?
Our infrastructure and datasets also allow us to explore other topics such as:
- food trends according to clichés (e.g. Rösti, Malakoff)
- food/nutriments variety per locations (e.g. meals with more salt/lipids/etc..)
- 11k restaurants (e.g. LaFourchette)
- 35k meals (extracted from the restaurants' menus)
- 170k recipes (various websites, e.g. CuisineAZ)
- 1.3M ingredients (derived from the recipes)
- 5k products (e.g. FDDB, OpenFood)
- 40k nutriments (extracted from the products)
We assumed that:
- the restaurants listed in LaFourchette were representative enough of the local food habits.
- we could associate recipes to meals and products to recipes well enough to derive the nutritious facts for a meal without suffering too much of variance and central limit theorem.
Each folder in this git is a step in the developpement of this project. Each folder contains a README describing what we did and why. We recommend reading them to get a better idea of the work we have done.
We implemented the following data pipeline:
We used this process to find matches:
Disadvantage | Advantage |
---|---|
Rare events, misspelled, grouped Pavé de boeuf aux morilles Pavé de boeuf aux morilles simplissimes |
Order tolerance Tiramisu caramel speculos beurre salé Tiramisu au caramel au beurre salé et spéculoos |
Wide, personal meaning café gourmand café gourmand à ma façon |
Exact match Salade d'orange au miel et à la cannelle Salade d'orange au miel et à la cannelle |
Principal component Rognons de lapins à la moutarde de Meaux Fricassée de champignons à la moutarde de Meaux |
Limited difference Terrine de foie gras et confiture de pruneaux Terrine de foie gras aux pruneaux et raisins secs |
Unknown, language Tartare de boeuf minute, salade et potatoes Twice baked potatoes au bacon |
Complex Cassolette de Saint-Jacques et crevettes Ravioles, noix de Saint-Jacques et crevettes en cassolettes raffinées |
A few examples of food facts we can extract from the datasets with our infrastructure.
Per country | Per city |
---|---|
![]() Energy(kCal) per country |
![]() Energy(kCal) per city |
![]() Protein per country |
![]() Protein per city |
![]() Carbohydrates per country |
![]() Carbohydrates per city |
![]() Salt per country |
![]() Salt per city |
Here are a few visualization examples for cliché-meal searches.
Speciality | Different kind |
---|---|
![]() |
![]() |
Choucroute (red), Malakoff (blue) | Fondue Savoyarde (red), Fondue au fromage (blue) |
Expected food trends were present as one could expect from well-known clichés. Looking closer at the estimated nutritious facts, the high variance and noisiness of the datasets coupled to the matching process increases greatly the difficutly of our analysis. No relevant area-based nutrition bias among the insights was found. One could nonetheless use the matching process and the pipeline as tools for further in depth investigation.
Before starting the project, we expected the following points to be the most challenging:
- datasets collection : menus data can be difficult to gather
- sparsity and spatial homogeneity : depending on datasets quality some regions might need to be ignored due to lack of data
- content languages : textual informations (including menus) can have different name depending on area, standardization and translation might be needed
- data completeness : non food data might need be extracted from different sources to achieve a valuable meaning
After finishing the project, the challenges actually were the following ones:
- data mining and normalization (high variance, different sources, captchas)
- data organization (complex queries, centralized storage with ElasticSearch)
- french NLP (weird characters, hard modeling)
- matching (many candidates, heterogeneous units)
- computationally heavy (vectorization, visualization)
Regarding the content languages, no data was available for the German and Italian part of Switzerland on LaFourchette. Hence we focused our work on France and the French part of Switzerland.
- formal statistical evaluation: as limited in time, the project does not contain a lot of insights. This could be definetly enhanced to increase modelling and evaluation.
- deep recurrent neural network for matching: one should evalute the effiency of neural net to match meal to recipes.
- computational efficiency: currently the matching lasts 20 seconds per restaurant (centralized server), this could be improved by batching, parallelisation and local server.
- expand visualization: better interactive and more diverse kind of visuzalization.
- more and enhanced data for Switzerland: data precision is still an issue. This could have been improved by using personal restaurant websites for example.
Project is available under Apache 2.0 license and data belong to their owners under appropriate licensing.