Fine-tuning and RAG for improved LLM-based code assistants

This repository contains the development for the project of the PhD course Mastering Foundation Models: Techniques and Applications for Advanced AI Tasks held by prof Subhankar Roy on February 2025.

The goal of the project is to have a hands-on approach with LLMs and implement a fine-tuning script for local models, as well as implement a RAG model

The aim of the project is to develop an improved and personalised Copilot, i.e., a LLM-based code assistant to enhance productivity and code quality in our works. To achieve such task, we first create a custom dataset based on repositories we are interested in (more details can be found in dataset.md); with such dataset, we proceed on:

fine tuning state-of-the-art LLM models publicly available on some libraries of our choice. This is described in finetuning.md;
creating a RAG (retrieval-augmented generation) agent that can provide grounded information on the answer. Further details can be found in rag.md.

Virtual environments

Since this project has lots of dependencies, it may be preferrable to work with virtual environments, in order not to pullute the system dependencies.

To create a virtual environment:

Make sure that the python venv module is installed on your machine. On a Ubuntu system, this can be simply achieved by calling
```
sudo apt-get update && sudo apt-get install -y python3-venv
```
To create a virtual environment in the .venv folder, call
```
python3 -m venv .venv
```
Once the environment is created, you must activate it on each terminal you plan on using it. This is achieved by calling
```
source .venv/bin/activate
```

Once the virtual environment is activated, you may want to install all the dependencies by calling

pip install -r requirements.txt --no-deps

Note: if you install dependencies from the provided requirements.txt file, make sure to add the --no-deps flag to the pip install call, as langchain and datatrove libraries have incombatiblities in the version of numpy, and such flag avoid version compatibility checking; even though such compability issue exists, the provided scripts work anyway.

Tip: if you are on Linux systems, as shown in this thread, you may add the following bash function in your .bashrc to automatically activate the virtual environment when you cd into it:

function cd() {
  builtin cd "$@"

  if [[ -z "$VIRTUAL_ENV" ]] ; then
      if [[ -d ./.venv ]] ; then
        source ./.venv/bin/activate  
      fi
  else
      parentdir="$(dirname "$VIRTUAL_ENV")"
      if [[ "$PWD"/ != "$parentdir"/* ]] ; then
        deactivate
      fi
  fi
}

Prerequisites

This project is primary meant to work with local LLMs. Particularly, we rely on ollama, a tool that enables to get running locally with LLMs with few commands. You may refer to the official website for the installation on your platform, but for linux system this software can be installed by simply calling

curl -fsSL https://ollama.com/install.sh | sh

One other tool we rely on for the project is Hugging Face, a machine learning (ML) and data science platform and community that helps users build, deploy and train machine learning models. Such platform is used to pull LLM embeddings libraries, as well as whole pre-trained LLMs.
To do so, you must register (for free) to the Hugging Face hub, and generate a personal token to download content from the Hugging Face repositories. Once logged on the hub, go to ⚙️ Setting (left bar menu), then Access Tokens and finally + Create new token (top right corner); for our application, a read only token is more then sufficient. For further information, refer to the official guide. If you are on a linux system, we raccomend to export the token with the HUGGINGFACEHUB_API_TOKEN enivornment variable for a streamlined experience working with hugging face. This can be automated by adding the following line in your ~/.bashrc:

export HUGGINGFACEHUB_API_TOKEN="<your-token-here>"

Makefile

To easen common procedures, we provide a Makefile with the following targets:

help (default target) shows on the terminal the different target options;
clean deletes all content generated by executable, such as the download folder and the processed dataset. It also deletes the virtual environment;
venv initialises a virtual environment and installed the required dependencies;
dataset pulls the content and process the dataset;
rag runs the RAG application.

Useful links and libraries

Software components

ollama: a front-end software component that easily allows to download and execute local instances of free LLM models;
Hugging Face: the platform where the machine learning community collaborates on models, datasets, and applications. It provides a wide range of utilities, tutorials, and libraries to work with LLM models;
Lang Chain: an all-in-one developer platform for every step of the LLM-powered application lifecycle, whether you’re building with LangChain or not. Debug, collaborate, test, and monitor your LLM applications.

Articles:

Personal Copilot: Train Your Own Coding Assistant, blog post on hugging face about the fine-tuning/full-training of LLM models;
Build a Retrieval Augmented Generation (RAG) App, tutorial on LangChain about the implementation of a RAG agent;
Advanced RAG on Hugging Face documentation using LangChain, open-source AI cookbook on Hugging Face;
RAG documentation on Hugging Face;
Code a simple RAG from scratch, blog-post on Hugging Face;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-tuning and RAG for improved LLM-based code assistants

Virtual environments

Prerequisites

Makefile

Useful links and libraries

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
finetuning		finetuning
rag		rag
scripts		scripts
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
dataset.md		dataset.md
dataset_config.yaml		dataset_config.yaml
finetuning.md		finetuning.md
pyproject.toml		pyproject.toml
rag.md		rag.md
requirements.txt		requirements.txt

idra-lab/ro-llama

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning and RAG for improved LLM-based code assistants

Virtual environments

Prerequisites

Makefile

Useful links and libraries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages