This repository contains the development for the project of the PhD course Mastering Foundation Models: Techniques and Applications for Advanced AI Tasks held by prof Subhankar Roy on February 2025.
The goal of the project is to have a hands-on approach with LLMs and implement a fine-tuning script for local models, as well as implement a RAG model
The aim of the project is to develop an improved and personalised Copilot, i.e., a LLM-based code assistant to enhance productivity and code quality in our works.
To achieve such task, we first create a custom dataset based on repositories we are interested in (more details can be found in dataset.md
); with such dataset, we proceed on:
- fine tuning state-of-the-art LLM models publicly available on some libraries of our choice. This is described in
finetuning.md
; - creating a RAG (retrieval-augmented generation) agent that can provide grounded information on the answer. Further details can be found in
rag.md
.
Since this project has lots of dependencies, it may be preferrable to work with virtual environments, in order not to pullute the system dependencies.
To create a virtual environment:
- Make sure that the python
venv
module is installed on your machine. On a Ubuntu system, this can be simply achieved by callingsudo apt-get update && sudo apt-get install -y python3-venv
- To create a virtual environment in the
.venv
folder, callpython3 -m venv .venv
- Once the environment is created, you must activate it on each terminal you plan on using it.
This is achieved by calling
source .venv/bin/activate
Once the virtual environment is activated, you may want to install all the dependencies by calling
pip install -r requirements.txt --no-deps
Note: if you install dependencies from the provided requirements.txt
file, make sure to add the --no-deps
flag to the pip install
call, as langchain
and datatrove
libraries have incombatiblities in the version of numpy
, and such flag avoid version compatibility checking;
even though such compability issue exists, the provided scripts work anyway.
Tip: if you are on Linux systems, as shown in this thread, you may add the following bash function in your .bashrc
to automatically activate the virtual environment when you cd
into it:
function cd() {
builtin cd "$@"
if [[ -z "$VIRTUAL_ENV" ]] ; then
if [[ -d ./.venv ]] ; then
source ./.venv/bin/activate
fi
else
parentdir="$(dirname "$VIRTUAL_ENV")"
if [[ "$PWD"/ != "$parentdir"/* ]] ; then
deactivate
fi
fi
}
This project is primary meant to work with local LLMs. Particularly, we rely on ollama, a tool that enables to get running locally with LLMs with few commands. You may refer to the official website for the installation on your platform, but for linux system this software can be installed by simply calling
curl -fsSL https://ollama.com/install.sh | sh
One other tool we rely on for the project is Hugging Face, a machine learning (ML) and data science platform and community that helps users build, deploy and train machine learning models.
Such platform is used to pull LLM embeddings libraries, as well as whole pre-trained LLMs.
To do so, you must register (for free) to the Hugging Face hub, and generate a personal token to download content from the Hugging Face repositories. Once logged on the hub, go to ⚙️ Setting
(left bar menu), then Access Tokens
and finally + Create new token
(top right corner);
for our application, a read
only token is more then sufficient.
For further information, refer to the official guide.
If you are on a linux system, we raccomend to export the token with the HUGGINGFACEHUB_API_TOKEN
enivornment variable for a streamlined experience working with hugging face.
This can be automated by adding the following line in your ~/.bashrc
:
export HUGGINGFACEHUB_API_TOKEN="<your-token-here>"
To easen common procedures, we provide a Makefile
with the following targets:
help
(default target) shows on the terminal the different target options;clean
deletes all content generated by executable, such as the download folder and the processed dataset. It also deletes the virtual environment;venv
initialises a virtual environment and installed the required dependencies;dataset
pulls the content and process the dataset;rag
runs the RAG application.
Software components
- ollama: a front-end software component that easily allows to download and execute local instances of free LLM models;
- Hugging Face: the platform where the machine learning community collaborates on models, datasets, and applications. It provides a wide range of utilities, tutorials, and libraries to work with LLM models;
- Lang Chain: an all-in-one developer platform for every step of the LLM-powered application lifecycle, whether you’re building with LangChain or not. Debug, collaborate, test, and monitor your LLM applications.
Articles:
- Personal Copilot: Train Your Own Coding Assistant, blog post on hugging face about the fine-tuning/full-training of LLM models;
- Build a Retrieval Augmented Generation (RAG) App, tutorial on LangChain about the implementation of a RAG agent;
- Advanced RAG on Hugging Face documentation using LangChain, open-source AI cookbook on Hugging Face;
- RAG documentation on Hugging Face;
- Code a simple RAG from scratch, blog-post on Hugging Face;