Skip to content

Commit

Permalink
Merge pull request #466 from superlinked/robertdhayanturner-patch-2
Browse files Browse the repository at this point in the history
Update rag-application-communication-system.md
  • Loading branch information
robertdhayanturner authored Aug 14, 2024
2 parents ec04b24 + 991cadf commit ac58bf2
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/articles/rag-application-communication-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ A good fine-tuning dataset, though it requires a significant amount of careful m

Preparation of the instruction dataset and base model improvement should be your main focus; these have the most impact on performance. I don't spend much time optimizing the training design beyond a few hyperparameters (learning rate, batch size, etc.). I've also generally stopped looking into preference fine-tuning (like DPO); the time spent was not worth the very few improvement points.

While it's far less common, you can also apply this approach - fine-tuning your instruction dataset using RAG-generated synthetic data - [to embedding models](https://huggingface.co/blog/davanstrien/synthetic-similarity-datasets). Synthetic data makes it considerably easier to create an instruction dataset that maps the expected format of the similarity dataset (including queries and “hard negatives”). Fine-tuning your embedding models with synthetic data will confer the same benefits as LLM fine-tuning: cost savings (a much smaller model that demonstrates the same level of performance as a big one) and appropriateness, by bringing the “similarity” score closer to the expectations of your retrieval system.
While it's far less common, you can also apply this approach (i.e., fine-tuning your instruction dataset using RAG-generated synthetic data) [to embedding models](https://huggingface.co/blog/davanstrien/synthetic-similarity-datasets). Synthetic data makes it considerably easier to create an instruction dataset that maps the expected format of the similarity dataset (including queries and “hard negatives”). Fine-tuning your embedding models with synthetic data will confer the same benefits as LLM fine-tuning: cost savings (a much smaller model that demonstrates the same level of performance as a big one) and appropriateness, by bringing the “similarity” score closer to the expectations of your retrieval system.

### 4.2 Fine-tuning for robustness

Expand Down

0 comments on commit ac58bf2

Please sign in to comment.