Skip to content

Latest commit

 

History

History
205 lines (160 loc) · 6.94 KB

dataset.md

File metadata and controls

205 lines (160 loc) · 6.94 KB

Dataset creation

This document will guide you in creating a custom dataset to fine-tune the LLM, or create a dabase for RAG models. To do accomplis this, we require 2 steps, descibed more in depth in the corresponding sections:

  1. downloading the contents from the external sources;
  2. processing such contents to create a suitable dataset;
  3. optionally, put the converted dataset into a vector store if you want to implement a RAG application.

Note: this component is inspired from this blog-post on the Hugging Face blog, particularly from the code available here.

Requirements

To create a dataset, we rely on Hugging Face's datatrove library, particularly the datatrove[io] component. Alongside this, other required dependencies are tokenizers, regex, and spacy.

To install these dependencies, run the following command:

pip3 install datatrove[io] tokenizers regex spacy

Improvements

  • Adding support jupyter notebooks; for starters, some code is present here;

1. Content download

To download the contents from external source, you may use the python script dataset/pull_content.py by calling

python3 -m dataset.pull_content

This script will download the contents in the download folder from the sources specified in the dataset_config.yaml file at the root of this repository. At the current stage, sources that can be downloaded are:

  • Github repositories: they are specified as list of <user>/<repository> entries within the github_repositories tag. As example, if the link of a repository is https://github.com/idra-lab/z1_ros2, then only idra-lab/z1_ros2 must be specified;
  • GitLab repositories: same as Github repositories, but the <user>/<repository> entry is downloaded from the GitLab servers. Entries are specified within the gitlab_repositories tag;

Configuration example

github_repositories:
  - ros-controls/ros2_control  # Fetches https://github.com/ros-controls/ros2_control

gitlab_repositories:
  - libeigen/eigen  # Fetches https://gitlab.com/libeigen/eigen

2. Content processing

To process the downloaded content, you may use the python script dataset/process.py by calling

python3 -m dataset.process

This script will process the contents in the download folder. In doing its tasks, intermediate and temporary results will be stored in the tmp folder at the root of this repository. Additionally, datatrove will log some information in the logs folder.

Processing pipeline

Within the processing script, we execute a pipeline inspired by Hugging Face's MinHash deduplication example. MinHash is an algorithm that use hashmaps to approximate the Jaccard similarity, an index that measures the similarity of two files. Exploiting this method, we can automatically and easilly remove files which appears to be duplicates within the downloaded contents (maybe due to the presence of different versions of the same file which have not been deleted).
Finding near-duplicates with Jaccard similarity and MinHash is a good blog post that explains this concept more in depth; for a technical overview of MinHash algorithms, you may refer to this review.

Stages of the pipeline are the following:

  • stage 1: load the data, and apply some filtering to the file to process. Details on the filtering are provided later;
  • stage 2: create the minhash signature for each task;
  • stage 3: finds matches between signatures in each bucket;
  • stage 4: creates clusters of duplicates using the results from all buckets;
  • stage 5: construct the final dataset by removing duplicate files based on the minhash clustering.

Processing configuration

As for the content download, the processing is configured through the dataset_config.yaml file. The following subsections will guide you through the configuration options.

Filtering

When processing the contents, it is possible to filter the files that shall not be considered during the training procedure (e.g., binaries, images, pdfs, etc.). Within the configuration file, you can specify the in filters tag some filters that will be applied on each file. Particularly, you may blacklist files based on their extensions, or based on the they contain paths.

An example is

filters:
  extensions:
    - png   # ignore images
    - jpg
    - jpeg
    - gif
  paths:
    - .git  # ignore git-related files

A default set of filters is provided at the bottom of this document.

3. Creating a vector store with Chroma

Chroma is an open-source database with a focus on AI applications. To get running, you can simply install it using python's pip:

pip3 install chromadb

The dataset/rag_chroma_vectorstore.py python file contains some helper functions to convert the processed dataset into a Chroma vector store that can be directly used by our RAG application.

This, to work, requires the installation of further python dependencies (mostly based on langchain which constitutes the backbone of the RAG implementation): langchain, langchain-chroma, langchain-huggingface, and langchain-ollama.
To install them, run the following command:

pip3 install langchain langchain-chroma langchain-huggingface langchain-ollama

Default filters

The following is a good starting point for file filtering that can be used in the dataset_config.yaml file:

filters:
  extensions:
    # Images
    - png
    - jpg
    - jpeg
    - gif  
    # Videos
    - mp4
    - jfif 
    # Documents
    - key
    - PDF
    - pdf
    - docx
    - xlsx
    - pptx
    - csv
    - tsv
    - txt 
    # Audio
    - flac
    - ogg
    - mid
    - webm
    - wav
    - mp3
    # Archives
    - jar
    - aar
    - gz
    - zip
    - bz2  
    # Models
    - onnx
    - pickle
    - model
    - neuron  
    # Others
    - ipynb
    - npy
    - index
    - inv
    - DS_Store
    - rdb
    - pack
    - idx
    - glb
    - gltf
    - len
    - otf
    - unitypackage
    - ttf
    - xz
    - pcm
    - opus
    - package-lock.json
    - yarn.lock
    - Cargo.lock
    - poetry.lock
    - lock
    - clang-tidy
    - clang-format
  paths:
    - .git
    - .github
    - .idea
    - .vscode
    - xcodeproj
    - __pycache__
    - venv
    - .venv