Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Añadir corpus original Spanish Dish Tiltle. #43

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
1 change: 1 addition & 0 deletions datasets.csv
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ Spanish Skip-Gram Word Embeddings in FastText,"modelado del lenguaje,FastText",g
TDX Thesis Spanish Corpus,modelado del lenguaje,academico,"catalán, español",España,https://doi.org/10.5281/zenodo.7313149,,,,David Arias
WikiCorpus,"modelado del lenguaje,POS (Part of Speech)",general,"catalán, español, inglés",Varios,https://www.cs.upc.edu/~nlp/wikicorpus/,,https://www.cs.upc.edu/~nlp/papers/reese10.pdf,wikicorpus,Albert Villanova @Hugging Face
eHealth-KD,reconocimiento de entidades nombradas (NER),clinico,es,España,https://knowledge-learning.github.io/ehealthkd-2020/,https://github.com/knowledge-learning/ehealthkd-2020,http://ceur-ws.org/Vol-2664/eHealth-KD_overview.pdf,ehealth_kd,María Grandury
Spanish Dish title,Imagen a texto,general,español,Varios,,,,,Fredy Orozco
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Este fichero no hace falta que lo incluyas, incluye .ipynb_checkpoints en el .gitignore :)

"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
878 changes: 878 additions & 0 deletions datasets/spanish_dish_title/EDA.ipynb

Large diffs are not rendered by default.

42 changes: 42 additions & 0 deletions datasets/spanish_dish_title/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Platos de comida
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propuesta para el estudio de sesgos: de dónde son las recetas? Incluyen recetas de diferentes países/continentes?

## Descripción
El siguiente dataset son imagenes con platos de comidas y su titulo. El dataset se creó haciendo scrapy a la siguiente página web <a href="https://www.recetasgratis.net/">Recetas gratis</a>, la metodología es la siguiente:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
El siguiente dataset son imagenes con platos de comidas y su titulo. El dataset se creó haciendo scrapy a la siguiente página web <a href="https://www.recetasgratis.net/">Recetas gratis</a>, la metodología es la siguiente:
El siguiente dataset son imágenes con platos de comidas y su título. El dataset se creó haciendo scrapy a la siguiente página web <a href="https://www.recetasgratis.net/">Recetas gratis</a>, la metodología es la siguiente:

1. Se obtiene el link de la página principal de la categoría de comida.
2. Se obtiene el link de la página de cada receta.
3. Se obtiene el link de la imagen de la receta.
4. Se obtiene el titulo de la receta.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Se obtiene el titulo de la receta.
4. Se obtiene el título de la receta.

## Imagenes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Imagenes
## Imágenes

Las imagenes tienen un tamaño de 300x300 pixeles y se encuentran en formato jpg.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Las imagenes tienen un tamaño de 300x300 pixeles y se encuentran en formato jpg.
Las imágenes tienen un tamaño de 300x300 pixeles y se encuentran en formato jpg.

## Metadatos
Los metadatos que se encuentran en el dataset son los siguientes:
+ **prompt**: Titulo de la receta.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ **prompt**: Titulo de la receta.
+ **prompt**: Título de la receta.

+ **source**: path de la imagen.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ **source**: path de la imagen.
+ **source**: Path de la imagen.

+ **uuid**: Identificador único de la imagen.

Nota 1: El dataset se encuentra en formato csv.
Nota 2: El nombre de las imagenes tambien va el titulo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Nota 2: El nombre de las imagenes tambien va el titulo
Nota 2: En el nombre de las imágenes tambien va el título.


## Directorio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incluye por favor todos los ficheros y su explicación

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especifica por favor la función del notebook en el nombre de Untitled.ipynb

```bash
|-- README.md - Este archivo
|-- dataset.csv - Dataset
|-- images - Imagenes
|-- src - Código fuente, en especial el script de scrapy
```
## Análisis exploratorio de datos

El ánilisis exploratorio se centra en el texto, para las imagenes tocaría aplicar herramientas de visión por computador como clip, para crear ciertas clasificaciones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incluye también una frase diciendo que el notebook está disponible con un enlace al notebook EDA.ipynb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
El ánilisis exploratorio se centra en el texto, para las imagenes tocaría aplicar herramientas de visión por computador como clip, para crear ciertas clasificaciones.
El ánilisis exploratorio se centra en el texto, para las imágenes tocaría aplicar herramientas de visión por computador como clip, para crear ciertas clasificaciones.


### Análisis de texto

<img src="nube_de_palabras.png">
En la imagen podemos ver las palabras más frecuentes para el texto, tambien podemos ver un boxplot del texto
<img src="box_plot.png">
Aquí podemos ver como existen palabras muy pequeñas y muy grandes, por lo que recomendamos al usario que se fije en el texto para ver si le sirve el tamaño del texto
<img src="distribution.png">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

En este caso igual size_distribution.png es un nombre más específico :)

En el siguiente histograma podemos ver la distribución de los tamaños de los textos, podemos ver que la mayoría de textos tienen un tamaño menor a 78 caracteres, el 75% del dataset tiene un tamaño de 31 caracteres.

### Análisis de imagenes
Se recomienda analizar por medio de redes neuronales, para sacar más provecho y verificar la correspondecia entre el prompt y la imagen. (Una idea es hacer esto con CLIP)

<img src="dishes_prompt.png">
935 changes: 935 additions & 0 deletions datasets/spanish_dish_title/Untitled.ipynb

Large diffs are not rendered by default.

Binary file added datasets/spanish_dish_title/box_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16,464 changes: 16,464 additions & 0 deletions datasets/spanish_dish_title/dataset.csv

Large diffs are not rendered by default.

Binary file added datasets/spanish_dish_title/dishes_prompt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added datasets/spanish_dish_title/distribution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading