Hint: "Gwell" = "Better" in Breton
Experiments on adapting a pretrained Conversational LLM to a new language, in this case Breton as I live in sunny Brittany 😎😉
GweLLM initial motivation was to build open source lightweight langage models for Breton, allowing:
- Local deployment and execution (even on CPU only)
- Hassle-free use (no external API limitations)
Output models and datasets will be made available on my HuggingFace repo 🤗.
This is a Work in Progress...
Let's breakdown the problem:
- The global idea is to fine-tune an existing multi-lingual LLM, ideally one that saw some Breton during its tokenizer/model pre-training.
- To proceed to such a fine-tuning, we need a Breton instruction dataset, which doesn't seem to exist out-of-the-box.
- We can start from a french (or english) instruction dataset, and translate it to Breton.
- Finally with that dataset I can fine-tune the foundation LLM of my choice.
Here is a shortlist of the challenges I first identified:
- Finding a good (and free) French -> Breton translation tool:
- APIs?
- LLMs?
- Google's T5
- Mutlilingual, but no Breton in training corpus 👎
- Meta's M2M100_418M
- Multilingual, Breton included in training corpus 👍
- Raw Breton performance is not good, will need fine-tuning!
- Google's T5
- Finding a Breton instruction dataset:
- Not found yet, will have to buid one myself 💪
So this project has 3 "by-products":
- A French -> Breton Translation Model called Gallek (meaning "French" in Breton)
- A Breton Instruction Dataset called Goulenn (meaning "Question" in Breton)
- A Breton Conversational LLM called GweLLM ("Gwell" meaning "Good" in Breton)
All code is mainly based on HuggingFace's Transformers library.
For now:
- Based on the facebook/m2m100_418M base model
- Fine-tuned on the Bretagne/ofis_publik_br-fr, Bretagne/OpenSubtitles_br_fr & Bretagne/Autogramm_Breton_translation datasets
- Monodirectionally fr->br fine-tuned
- Reaches an honorable BLEU score of 40 on a 20% train/test split of the dataset
What's inside the gallek
subdirectory:
train_translation_model.py
: used to fine-tune m2m100 model on the aforementionned datasets, with BLEU score evaluation at the end of trainingtest_translation_model.py
: used to test the fine-tuned gallek model on single input french text (also includes Apertium reverse translation)test_translation_mode_gradio
: used to test the fine-tuned gallek model using a Gradio UI
TODOs:
- Add new datasets in training corpus (initial one was ofis_publik)
- Add some gguf conversion/quantization scripts using llama.cpp, spoiler alert : m2m100 seems unsupported 😱
- Reach a high quality 50 BLEU score
- Train bidirectional version
For now:
- Based on the original jpacifico/French-Alpaca-dataset-Instruct-110K, thanks to the work of Jonathan Pacifico.
- Translated to Breton using the Gallek model
What's inside the goulenn
subdirectory:
dataset_translation.py
: used to batch translate the original French Alpaca instructions dataset into Bretonconvert_dataset.py
: used to convert thearrow
formated translated dataset tojson
andparquet
concatenate_datasets.py
: used to concatenate twoarrow
datasets, in case translation has been fragmented
TODOs:
- Translate 50k samples (available on HF🤗 here)
- Translate the whole 110k (available on HF🤗 here)
- Generate new instruction data using a "Magpie" like synthesis approach (WIP in
goulenn/magpie_instruct_dataset_generation.py
)
For now:
- Based on the google/gemma-2-2b-it base model (seems to already know a bit of Breton)
- Trained on Gallek 50k
What's inside the gwellm
subdirectory:
train_model_instruct.py
: used to fine-tune the Breton speaking instruct modeltest_model_instruct
: used to test the fine-tuned model (unmerged adapter)merge_adapter.py
: used to merge the fine-tuned adapter model to the base modeltest_model_instruct_gradio.py
: used to test the quantized gguf model using a gradio chat UI
TODOs:
- Release an initial beta version
- Distribute as LLamafile
- Hybrid Fine-Tuning (start with a pretraining with a raw breton text corpus)
TODO FT Strategy [Instruction Pre-Training: Language Models are Supervised Multitask Learners]
TODO sh scripts
TODO
Here are the few resources I found after initial googling:
- Texts corpus at the French public office for Breton language
- The "Bretagne" organization on Hugging Face 👍
- Soon after releasing the first Gallek translator model, I stumbled upon this french paper describing the same m2m100 Breton finetuning approach: Loïc Grobol, Mélanie Jouitteau. ARBRES Kenstur: a Breton-French Parallel Corpus Rooted in Field Linguistics. LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, ELRA Language Resources Association Language Resources Association; International Committee on Computational Linguistics, May 2024, Torino, Italy. ffhal-04551941
Installing llama-cpp-python
can be a bit tricky, as I really struggled to install it on WSL2 (Ubuntu 22.04):
- The classic
pip install llama-cpp-python
systematically failed as described in this issue - The documented way of installing a prebuilt cpu-only wheel
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
also failed - I finally downloaded the
llama_cpp_python-0.3.2-cp310-cp310-linux_x86_64.whl
package from the wheel repository and installed it manually withpip install pip install llama_cpp_python-0.3.2-cp310-cp310-linux_x86_64.whl
- As I encountered issues related to
libc.musl
dependency I had to use this workaround