This document will guide you in creating a custom dataset to fine-tune the LLM, or create a dabase for RAG models. To do accomplis this, we require 2 steps, descibed more in depth in the corresponding sections:
- downloading the contents from the external sources;
- processing such contents to create a suitable dataset;
- optionally, put the converted dataset into a vector store if you want to implement a RAG application.
Note: this component is inspired from this blog-post on the Hugging Face blog, particularly from the code available here.
To create a dataset, we rely on Hugging Face's datatrove library, particularly the datatrove[io]
component.
Alongside this, other required dependencies are tokenizers
, regex
, and spacy
.
To install these dependencies, run the following command:
pip3 install datatrove[io] tokenizers regex spacy
- Adding support jupyter notebooks; for starters, some code is present here;
To download the contents from external source, you may use the python script dataset/pull_content.py
by calling
python3 -m dataset.pull_content
This script will download the contents in the download
folder from the sources specified in the dataset_config.yaml
file at the root of this repository.
At the current stage, sources that can be downloaded are:
- Github repositories: they are specified as list of
<user>/<repository>
entries within thegithub_repositories
tag. As example, if the link of a repository ishttps://github.com/idra-lab/z1_ros2
, then onlyidra-lab/z1_ros2
must be specified; - GitLab repositories: same as Github repositories, but the
<user>/<repository>
entry is downloaded from the GitLab servers. Entries are specified within thegitlab_repositories
tag;
github_repositories:
- ros-controls/ros2_control # Fetches https://github.com/ros-controls/ros2_control
gitlab_repositories:
- libeigen/eigen # Fetches https://gitlab.com/libeigen/eigen
To process the downloaded content, you may use the python script dataset/process.py
by calling
python3 -m dataset.process
This script will process the contents in the download
folder.
In doing its tasks, intermediate and temporary results will be stored in the tmp
folder at the root of this repository.
Additionally, datatrove will log some information in the logs
folder.
Within the processing script, we execute a pipeline inspired by Hugging Face's MinHash deduplication example.
MinHash is an algorithm that use hashmaps to approximate the Jaccard similarity, an index that measures the similarity of two files.
Exploiting this method, we can automatically and easilly remove files which appears to be duplicates within the downloaded contents (maybe due to the presence of different versions of the same file which have not been deleted).
Finding near-duplicates with Jaccard similarity and MinHash is a good blog post that explains this concept more in depth; for a technical overview of MinHash algorithms, you may refer to this review.
Stages of the pipeline are the following:
- stage 1: load the data, and apply some filtering to the file to process. Details on the filtering are provided later;
- stage 2: create the minhash signature for each task;
- stage 3: finds matches between signatures in each bucket;
- stage 4: creates clusters of duplicates using the results from all buckets;
- stage 5: construct the final dataset by removing duplicate files based on the minhash clustering.
As for the content download, the processing is configured through the dataset_config.yaml
file.
The following subsections will guide you through the configuration options.
When processing the contents, it is possible to filter the files that shall not be considered during the training procedure (e.g., binaries, images, pdfs, etc.).
Within the configuration file, you can specify the in filters
tag some filters that will be applied on each file.
Particularly, you may blacklist files based on their extensions
, or based on the they contain paths
.
An example is
filters:
extensions:
- png # ignore images
- jpg
- jpeg
- gif
paths:
- .git # ignore git-related files
A default set of filters is provided at the bottom of this document.
Chroma is an open-source database with a focus on AI applications.
To get running, you can simply install it using python's pip
:
pip3 install chromadb
The dataset/rag_chroma_vectorstore.py
python file contains some helper functions to convert the processed dataset into a Chroma vector store that can be directly used by our RAG application.
This, to work, requires the installation of further python dependencies (mostly based on langchain which constitutes the backbone of the RAG implementation): langchain
, langchain-chroma
, langchain-huggingface
, and langchain-ollama
.
To install them, run the following command:
pip3 install langchain langchain-chroma langchain-huggingface langchain-ollama
The following is a good starting point for file filtering that can be used in the dataset_config.yaml
file:
filters:
extensions:
# Images
- png
- jpg
- jpeg
- gif
# Videos
- mp4
- jfif
# Documents
- key
- PDF
- pdf
- docx
- xlsx
- pptx
- csv
- tsv
- txt
# Audio
- flac
- ogg
- mid
- webm
- wav
- mp3
# Archives
- jar
- aar
- gz
- zip
- bz2
# Models
- onnx
- pickle
- model
- neuron
# Others
- ipynb
- npy
- index
- inv
- DS_Store
- rdb
- pack
- idx
- glb
- gltf
- len
- otf
- unitypackage
- ttf
- xz
- pcm
- opus
- package-lock.json
- yarn.lock
- Cargo.lock
- poetry.lock
- lock
- clang-tidy
- clang-format
paths:
- .git
- .github
- .idea
- .vscode
- xcodeproj
- __pycache__
- venv
- .venv