-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add documentation to various parts of the scripts and pipeline (#298)
- Loading branch information
Showing
21 changed files
with
382 additions
and
180 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
--- | ||
layout: default | ||
title: Bicleaner | ||
parent: Data cleaning | ||
--- | ||
# Bicleaner | ||
|
||
Bicleaner is a tool that aims at detecting noisy sentence pairs in a parallel corpus. The classifier scores parallel sentences from 0 to 1 where 0 means a very noisy translation and 1 is a good translation. In the pipeline, Bicleaner AI will be used first if [the language is available][ai-releases], otherwise it will fallback to the original non-AI Bicleaner. | ||
|
||
See: | ||
* [https://github.com/bitextor/bicleaner-ai](https://github.com/bitextor/bicleaner-ai) | ||
* [https://github.com/bitextor/bicleaner](https://github.com/bitextor/bicleaner) | ||
|
||
For supported languages see: | ||
* [Bicleaner AI Releases][ai-releases] | ||
* [Bicleaner Releases][releases] | ||
|
||
New language releases should be added to: `taskcluster/ci/fetch/bicleaner.yml` | ||
|
||
## How to configure for training | ||
|
||
The configuration specifies a default threshold and a per-dataset threshold. A sentence pair will be kept if its score is **above** the given threshold. | ||
|
||
- `0.5` should be a [good default value]. | ||
- Increase the threshold for noisier datasets. | ||
- Set the threshold to `0` to skip cleaning entirely. | ||
|
||
## Recommendations for specific datasets | ||
|
||
| Data set | Threshold | Reason | | ||
| ------------- | --------- | ------- | | ||
| OpenSubtitles | 0.8 | This is a noiser dataset | | ||
| ParaCrawl | 0 | This dataset has already been cleaned by bicleaner. See [Bicleaner AI: Bicleaner Goes Neural], section 4.2.2 | | ||
|
||
## Example config: | ||
|
||
``` | ||
bicleaner: | ||
default-threshold: 0.5 | ||
dataset-thresholds: | ||
opus_CCAligned/v1: 0.7 | ||
opus_OpenSubtitles/v2018: 0.8 | ||
opus_ParaCrawl/v9: 0 | ||
... | ||
``` | ||
|
||
[good default value]: https://github.com/bitextor/bicleaner-ai/wiki/How-to-train-your-Bicleaner-AI#bicleaning-a-corpus | ||
[ai-releases]: https://github.com/bitextor/bicleaner-ai-data/releases | ||
[releases]: https://github.com/bitextor/bicleaner-data/releases | ||
[Bicleaner AI: Bicleaner Goes Neural]: https://aclanthology.org/2022.lrec-1.87.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
layout: default | ||
title: Pipeline steps | ||
nav_order: 3 | ||
has_children: true | ||
--- | ||
|
||
# Pipeline steps | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
--- | ||
layout: default | ||
title: Teacher Ensemble | ||
parent: Pipeline steps | ||
--- | ||
|
||
# Teacher Ensemble | ||
|
||
Teacher models are larger and slower translation models that have higher BLEU scores. In the pipeline they are used to distill smaller and faster student models at the cost of a lower BLEU score. | ||
|
||
In the config files, you can specify how many teachers to train via `experiment.teacher-ensemble` key. The teachers will be identical except they will be initialized with different random seeds. This has been shown to improve the performance during student distillation, as the translation probabilities will be combined from both models. | ||
|
||
While our current implementation only changes seeds, it's also possible to have ensembles that use different configurations or are trained on different datasets. | ||
|
||
Recommendations information from [Efficient machine translation](https://nbogoychev.com/efficient-machine-translation/#ensembling): | ||
|
||
> One very easy way to improve translation quality of the teacher is to produce an ensemble of systems that produce translation together. This is done by training identical systems, initialising them with different random seed. The more systems, the better, although returns are diminishing. | ||
> | ||
> For example, if we want to have an ensemble of two systems, we need to separate configuration files for training, where the seed parameter is different. Configuration one would have seed: 1111, whereas configuration two would have seed: 2222. | ||
We typically use two teacher models in our training. |
Oops, something went wrong.