Skip to content

Journey towards Fine-Tuning a Breton speaking Chat Model

License

Notifications You must be signed in to change notification settings

blackccpie/GweLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GweLLM

Version française

Hint: "Gwell" = "Better" in Breton

image

Fine-Tuning a Breton speaking Chat Model

Experiments on adapting a pretrained Conversational LLM to a new language, in this case Breton as I live in sunny Brittany 😎😉

GweLLM initial motivation was to build open source lightweight langage models for Breton, allowing:

  • Local deployment and execution (even on CPU only)
  • Hassle-free use (no external API limitations)

Output models and datasets will be made available on my HuggingFace repo 🤗.

This is a Work in Progress...

Approach

Let's breakdown the problem:

  • The global idea is to fine-tune an existing multi-lingual LLM, ideally one that saw some Breton during its tokenizer/model pre-training.
  • To proceed to such a fine-tuning, we need a Breton instruction dataset, which doesn't seem to exist out-of-the-box.
  • We can start from a french (or english) instruction dataset, and translate it to Breton.
  • Finally with that dataset I can fine-tune the foundation LLM of my choice.

Here is a shortlist of the challenges I first identified:

  • Finding a good (and free) French -> Breton translation tool:
    • APIs?
      • Apertium seems like a very interesting project, but the translation pair involving Breton is only the br->fr one, the way back is not available yet 😕
    • LLMs?
      • Google's T5
        • Mutlilingual, but no Breton in training corpus 👎
      • Meta's M2M100_418M
        • Multilingual, Breton included in training corpus 👍
        • Raw Breton performance is not good, will need fine-tuning!
  • Finding a Breton instruction dataset:
    • Not found yet, will have to buid one myself 💪

So this project has 3 "by-products":

  • A French -> Breton Translation Model called Gallek (meaning "French" in Breton)
  • A Breton Instruction Dataset called Goulenn (meaning "Question" in Breton)
  • A Breton Conversational LLM called GweLLM ("Gwell" meaning "Good" in Breton)

All code is mainly based on HuggingFace's Transformers library.

Building the Gallek fr->br translation model

For now:

What's inside the gallek subdirectory:

  • train_translation_model.py : used to fine-tune m2m100 model on the aforementionned datasets, with BLEU score evaluation at the end of training
  • test_translation_model.py : used to test the fine-tuned gallek model on single input french text (also includes Apertium reverse translation)
  • test_translation_mode_gradio : used to test the fine-tuned gallek model using a Gradio UI

TODOs:

  • Add new datasets in training corpus (initial one was ofis_publik)
  • Add some gguf conversion/quantization scripts using llama.cpp, spoiler alert : m2m100 seems unsupported 😱
  • Reach a high quality 50 BLEU score
  • Train bidirectional version

Building the Goulenn Breton Instruct Dataset

For now:

What's inside the goulenn subdirectory:

  • dataset_translation.py : used to batch translate the original French Alpaca instructions dataset into Breton
  • convert_dataset.py : used to convert the arrow formated translated dataset to json and parquet
  • concatenate_datasets.py : used to concatenate two arrow datasets, in case translation has been fragmented

TODOs:

  • Translate 50k samples (available on HF🤗 here)
  • Translate the whole 110k (available on HF🤗 here)
  • Generate new instruction data using a "Magpie" like synthesis approach (WIP in goulenn/magpie_instruct_dataset_generation.py)

Fine-Tuning GweLLM

For now:

  • Based on the google/gemma-2-2b-it base model (seems to already know a bit of Breton)
  • Trained on Gallek 50k

What's inside the gwellm subdirectory:

  • train_model_instruct.py : used to fine-tune the Breton speaking instruct model
  • test_model_instruct : used to test the fine-tuned model (unmerged adapter)
  • merge_adapter.py : used to merge the fine-tuned adapter model to the base model
  • test_model_instruct_gradio.py : used to test the quantized gguf model using a gradio chat UI

TODOs:

  • Release an initial beta version
  • Distribute as LLamafile
  • Hybrid Fine-Tuning (start with a pretraining with a raw breton text corpus)

TODO FT Strategy [Instruction Pre-Training: Language Models are Supervised Multitask Learners]

TODO sh scripts

Using GweLLM

Import in GPT4All

TODO

Additional Resources

Finding Breton Datasets

Here are the few resources I found after initial googling:

Publications

Misc

Troubleshooting

Installing llama-cpp-python

Installing llama-cpp-python can be a bit tricky, as I really struggled to install it on WSL2 (Ubuntu 22.04):

  • The classic pip install llama-cpp-python systematically failed as described in this issue
  • The documented way of installing a prebuilt cpu-only wheel pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu also failed
  • I finally downloaded the llama_cpp_python-0.3.2-cp310-cp310-linux_x86_64.whl package from the wheel repository and installed it manually with pip install pip install llama_cpp_python-0.3.2-cp310-cp310-linux_x86_64.whl
  • As I encountered issues related to libc.musl dependency I had to use this workaround

About

Journey towards Fine-Tuning a Breton speaking Chat Model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published