GweLLM

Hint: "Gwell" = "Better" in Breton

Fine-Tuning a Breton speaking Chat Model

Experiments on adapting a pretrained Conversational LLM to a new language, in this case Breton as I live in sunny Brittany 😎😉

GweLLM initial motivation was to build open source lightweight langage models for Breton, allowing:

Local deployment and execution (even on CPU only)
Hassle-free use (no external API limitations)

Output models and datasets will be made available on my HuggingFace repo 🤗.

This is a Work in Progress...

Approach

Let's breakdown the problem:

The global idea is to fine-tune an existing multi-lingual LLM, ideally one that saw some Breton during its tokenizer/model pre-training.
To proceed to such a fine-tuning, we need a Breton instruction dataset, which doesn't seem to exist out-of-the-box.
We can start from a french (or english) instruction dataset, and translate it to Breton.
Finally with that dataset I can fine-tune the foundation LLM of my choice.

Here is a shortlist of the challenges I first identified:

Finding a good (and free) French -> Breton translation tool:
- APIs?
  - Apertium seems like a very interesting project, but the translation pair involving Breton is only the br->fr one, the way back is not available yet 😕
- LLMs?
  - Google's T5
    - Mutlilingual, but no Breton in training corpus 👎
  - Meta's M2M100_418M
    - Multilingual, Breton included in training corpus 👍
    - Raw Breton performance is not good, will need fine-tuning!
Finding a Breton instruction dataset:
- Not found yet, will have to buid one myself 💪

So this project has 3 "by-products":

A French -> Breton Translation Model called Gallek (meaning "French" in Breton)
A Breton Instruction Dataset called Goulenn (meaning "Question" in Breton)
A Breton Conversational LLM called GweLLM ("Gwell" meaning "Good" in Breton)

All code is mainly based on HuggingFace's Transformers library.

Building the Gallek fr->br translation model

For now:

Based on the facebook/m2m100_418M base model
Fine-tuned on the Bretagne/ofis_publik_br-fr, Bretagne/OpenSubtitles_br_fr & Bretagne/Autogramm_Breton_translation datasets
Monodirectionally fr->br fine-tuned
Reaches an honorable BLEU score of 40 on a 20% train/test split of the dataset

What's inside the gallek subdirectory:

train_translation_model.py : used to fine-tune m2m100 model on the aforementionned datasets, with BLEU score evaluation at the end of training
test_translation_model.py : used to test the fine-tuned gallek model on single input french text (also includes Apertium reverse translation)
test_translation_mode_gradio : used to test the fine-tuned gallek model using a Gradio UI

TODOs:

Add new datasets in training corpus (initial one was ofis_publik)
Add some gguf conversion/quantization scripts using llama.cpp, spoiler alert : m2m100 seems unsupported 😱
Reach a high quality 50 BLEU score
Train bidirectional version

Building the Goulenn Breton Instruct Dataset

For now:

Based on the original jpacifico/French-Alpaca-dataset-Instruct-110K, thanks to the work of Jonathan Pacifico.
Translated to Breton using the Gallek model

What's inside the goulenn subdirectory:

dataset_translation.py : used to batch translate the original French Alpaca instructions dataset into Breton
convert_dataset.py : used to convert the arrow formated translated dataset to json and parquet
concatenate_datasets.py : used to concatenate two arrow datasets, in case translation has been fragmented

TODOs:

Translate 50k samples (available on HF🤗 here)
Translate the whole 110k (available on HF🤗 here)
Generate new instruction data using a "Magpie" like synthesis approach (WIP in goulenn/magpie_instruct_dataset_generation.py)

Fine-Tuning GweLLM

For now:

Based on the google/gemma-2-2b-it base model (seems to already know a bit of Breton)
Trained on Gallek 50k

What's inside the gwellm subdirectory:

train_model_instruct.py : used to fine-tune the Breton speaking instruct model
test_model_instruct : used to test the fine-tuned model (unmerged adapter)
merge_adapter.py : used to merge the fine-tuned adapter model to the base model
test_model_instruct_gradio.py : used to test the quantized gguf model using a gradio chat UI

TODOs:

Release an initial beta version
Distribute as LLamafile
Hybrid Fine-Tuning (start with a pretraining with a raw breton text corpus)

TODO FT Strategy [Instruction Pre-Training: Language Models are Supervised Multitask Learners]

TODO sh scripts

Using GweLLM

Import in GPT4All

TODO

Additional Resources

Finding Breton Datasets

Here are the few resources I found after initial googling:

Publications

Soon after releasing the first Gallek translator model, I stumbled upon this french paper describing the same m2m100 Breton finetuning approach: Loïc Grobol, Mélanie Jouitteau. ARBRES Kenstur: a Breton-French Parallel Corpus Rooted in Field Linguistics. LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, ELRA Language Resources Association Language Resources Association; International Committee on Computational Linguistics, May 2024, Torino, Italy. ffhal-04551941

Misc

Troubleshooting

Installing `llama-cpp-python`

Installing llama-cpp-python can be a bit tricky, as I really struggled to install it on WSL2 (Ubuntu 22.04):

The classic pip install llama-cpp-python systematically failed as described in this issue
The documented way of installing a prebuilt cpu-only wheel pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu also failed
I finally downloaded the llama_cpp_python-0.3.2-cp310-cp310-linux_x86_64.whl package from the wheel repository and installed it manually with pip install pip install llama_cpp_python-0.3.2-cp310-cp310-linux_x86_64.whl
As I encountered issues related to libc.musl dependency I had to use this workaround

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
gallek		gallek
goulenn		goulenn
gwellm		gwellm
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README-fr.md		README-fr.md
README.md		README.md
gwenadu.png		gwenadu.png
run_gwellm_task.sh		run_gwellm_task.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GweLLM

Fine-Tuning a Breton speaking Chat Model

Approach

Building the Gallek fr->br translation model

Building the Goulenn Breton Instruct Dataset

Fine-Tuning GweLLM

Using GweLLM

Import in GPT4All

Additional Resources

Finding Breton Datasets

Publications

Misc

Troubleshooting

Installing `llama-cpp-python`

About

Releases

Packages

Languages

License

blackccpie/GweLLM

Folders and files

Latest commit

History

Repository files navigation

GweLLM

Fine-Tuning a Breton speaking Chat Model

Approach

Building the Gallek fr->br translation model

Building the Goulenn Breton Instruct Dataset

Fine-Tuning GweLLM

Using GweLLM

Import in GPT4All

Additional Resources

Finding Breton Datasets

Publications

Misc

Troubleshooting

Installing llama-cpp-python

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Installing `llama-cpp-python`

Packages