Skip to content

Group project for the PhD course on Foundation Model

Notifications You must be signed in to change notification settings

idra-lab/ro-llama

Repository files navigation

Fine-tuning and RAG for improved LLM-based code assistants

This repository contains the development for the project of the PhD course Mastering Foundation Models: Techniques and Applications for Advanced AI Tasks held by prof Subhankar Roy on February 2025.

The goal of the project is to have a hands-on approach with LLMs and implement a fine-tuning script for local models, as well as implement a RAG model

The aim of the project is to develop an improved and personalised Copilot, i.e., a LLM-based code assistant to enhance productivity and code quality in our works. To achieve such task, we first create a custom dataset based on repositories we are interested in (more details can be found in dataset.md); with such dataset, we proceed on:

  • fine tuning state-of-the-art LLM models publicly available on some libraries of our choice. This is described in finetuning.md;
  • creating a RAG (retrieval-augmented generation) agent that can provide grounded information on the answer. Further details can be found in rag.md.

Virtual environments

Since this project has lots of dependencies, it may be preferrable to work with virtual environments, in order not to pullute the system dependencies.

To create a virtual environment:

  1. Make sure that the python venv module is installed on your machine. On a Ubuntu system, this can be simply achieved by calling
    sudo apt-get update && sudo apt-get install -y python3-venv
  2. To create a virtual environment in the .venv folder, call
    python3 -m venv .venv
  3. Once the environment is created, you must activate it on each terminal you plan on using it. This is achieved by calling
    source .venv/bin/activate

Once the virtual environment is activated, you may want to install all the dependencies by calling

pip install -r requirements.txt --no-deps

Note: if you install dependencies from the provided requirements.txt file, make sure to add the --no-deps flag to the pip install call, as langchain and datatrove libraries have incombatiblities in the version of numpy, and such flag avoid version compatibility checking; even though such compability issue exists, the provided scripts work anyway.

Tip: if you are on Linux systems, as shown in this thread, you may add the following bash function in your .bashrc to automatically activate the virtual environment when you cd into it:

function cd() {
  builtin cd "$@"

  if [[ -z "$VIRTUAL_ENV" ]] ; then
      if [[ -d ./.venv ]] ; then
        source ./.venv/bin/activate  
      fi
  else
      parentdir="$(dirname "$VIRTUAL_ENV")"
      if [[ "$PWD"/ != "$parentdir"/* ]] ; then
        deactivate
      fi
  fi
}

Prerequisites

This project is primary meant to work with local LLMs. Particularly, we rely on ollama, a tool that enables to get running locally with LLMs with few commands. You may refer to the official website for the installation on your platform, but for linux system this software can be installed by simply calling

curl -fsSL https://ollama.com/install.sh | sh

One other tool we rely on for the project is Hugging Face, a machine learning (ML) and data science platform and community that helps users build, deploy and train machine learning models. Such platform is used to pull LLM embeddings libraries, as well as whole pre-trained LLMs.
To do so, you must register (for free) to the Hugging Face hub, and generate a personal token to download content from the Hugging Face repositories. Once logged on the hub, go to ⚙️ Setting (left bar menu), then Access Tokens and finally + Create new token (top right corner); for our application, a read only token is more then sufficient. For further information, refer to the official guide. If you are on a linux system, we raccomend to export the token with the HUGGINGFACEHUB_API_TOKEN enivornment variable for a streamlined experience working with hugging face. This can be automated by adding the following line in your ~/.bashrc:

export HUGGINGFACEHUB_API_TOKEN="<your-token-here>"

Makefile

To easen common procedures, we provide a Makefile with the following targets:

  • help (default target) shows on the terminal the different target options;
  • clean deletes all content generated by executable, such as the download folder and the processed dataset. It also deletes the virtual environment;
  • venv initialises a virtual environment and installed the required dependencies;
  • dataset pulls the content and process the dataset;
  • rag runs the RAG application.

Useful links and libraries

Software components

  • ollama: a front-end software component that easily allows to download and execute local instances of free LLM models;
  • Hugging Face: the platform where the machine learning community collaborates on models, datasets, and applications. It provides a wide range of utilities, tutorials, and libraries to work with LLM models;
  • Lang Chain: an all-in-one developer platform for every step of the LLM-powered application lifecycle, whether you’re building with LangChain or not. Debug, collaborate, test, and monitor your LLM applications.

Articles:

About

Group project for the PhD course on Foundation Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •