Skip to content

Commit

Permalink
Update HPLT to 2.0 (#1000)
Browse files Browse the repository at this point in the history
* Update HPLT importer to 2.0

* Update training configuration

* Update docs

* Replace scores to floats

* Move packages required for PyICU to the base image

* Change default cleaning thresholds

* Update HPLT tests

* Fix naming issue

* Add more statistics and refactor

* Update and extract tests

* Create taskgraph directory if it doesn't exit

* Revert "Create taskgraph directory if it doesn't exit"

This reverts commit 094cde0.

* Reimplement closing

* Fix Korean locale

* Clarify monocleaner thresholds

* Add a link to HPLT reports

* Pin poetry version
  • Loading branch information
eu9ene authored Jan 27, 2025
1 parent b08737d commit 5091a54
Show file tree
Hide file tree
Showing 28 changed files with 484 additions and 315 deletions.
22 changes: 11 additions & 11 deletions docs/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,17 @@ Example:
- mtdata_newstest2014_ruen
```

Data source | Prefix | Name examples | Type | Comments
--- | --- | --- | ---| ---
[MTData](https://github.com/thammegowda/mtdata) | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
[OPUS](opus.nlpl.eu/) | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
[Flores](https://github.com/facebookresearch/flores) | flores | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages.
Custom parallel | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst` | corpus | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
[Paracrawl](https://paracrawl.eu/) | paracrawl-mono | paracrawl8 | mono | Datasets that are crawled from the web. Only [mono datasets](https://paracrawl.eu/index.php/moredata) are used in this importer. Parallel corpus is available using opus importer.
[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Some news monolingual datasets from [WMT21](https://www.statmt.org/wmt21/translation-task.html)
[Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)
Custom mono | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst` | mono | A custom zst compressed monolingual dataset, for instance uploaded to GCS.
Data source | Prefix | Name examples | Type | Comments
--- |------------| --- | ---| ---
[MTData](https://github.com/thammegowda/mtdata) | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
[OPUS](opus.nlpl.eu/) | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
[Flores](https://github.com/facebookresearch/flores) | flores | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages.
Custom parallel | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst` | corpus | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Monolingual news datasets from [WMT](https://www.statmt.org/wmt21/translation-task.html)
[OPUS](opus.nlpl.eu/) | opus | tldr-pages/v2023-08-29 | mono | Monolingual dataset from OPUS.
[HPLT](https://hplt-project.org/datasets/v2.0) | hplt | mono/v2.0 | mono | HPLT monolingual corpus (mostly from Internet Archive, but also from Common Crawl).
Custom mono | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst` | mono | A custom zst compressed monolingual dataset, for instance uploaded to GCS.

You can also use [find-corpus](https://github.com/mozilla/translations/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.

Expand Down
40 changes: 23 additions & 17 deletions pipeline/data/download-mono.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
from pathlib import Path
from typing import Optional

from importers.mono.hplt import download_hplt
from importers.mono.hplt import HpltDownloader

from pipeline.common.datasets import Dataset, shuffle_with_max_lines
from pipeline.common.downloads import (
Expand All @@ -37,9 +37,6 @@
from pipeline.common.logging import get_logger
from pipeline.data.cjk import ChineseConverter, ChineseType

# TODO(CJK) - Issue #424
MAX_WORDS_IN_SENTENCE = 100

CURRENT_FOLDER = os.path.dirname(os.path.abspath(__file__))
IMPORTERS_PATH = os.path.abspath(os.path.join(CURRENT_FOLDER, "mono"))

Expand All @@ -57,17 +54,22 @@ def main(args_list: Optional[list[str]] = None) -> None:
"--max_sentences", type=int, help="The maximum number of sentences to retain"
)
parser.add_argument(
"--hlpt_min_fluency",
"--hplt_min_doc_score",
type=float,
help="The minimum fluency score to filter datasets that include this metric",
default=0.8,
help="The minimum document score to filter datasets that include this metric",
default=5.0,
)
parser.add_argument(
"--hlpt_max_characters",
"--hplt_max_characters",
type=int,
help="The maximum number of characters to merge lines in a document before writing. "
"0 - preserve original lines of HPLT dataset",
default=0,
help="The maximum length of the output segments. ",
default=600,
)
parser.add_argument(
"--hplt_merge_lines",
type=bool,
help="Whether to accumulate lines of the same document in one output segment until `hplt_max_characters` is reached.",
default=False,
)
parser.add_argument(
"--artifacts", type=Path, help="The location where the dataset will be saved"
Expand All @@ -80,22 +82,26 @@ def main(args_list: Optional[list[str]] = None) -> None:

logger.info(f"Dataset: {args.dataset}")
logger.info(f"Language: {args.language}")
logger.info(f"Max Sentences: {args.max_sentences}")
logger.info(f"Mininmum Fluency Threshold: {args.hlpt_min_fluency}")
logger.info(f"HPLT Max Sentences: {args.max_sentences}")
logger.info(f"HPLT Minimum Document Score Threshold: {args.hplt_min_doc_score}")
logger.info(f"HPLT Merge Lines: {args.hplt_merge_lines}")
logger.info(f"Artifacts: {args.artifacts}")
logger.info(f"File Destination: {file_destination}")

if not os.path.exists(args.artifacts):
os.makedirs(args.artifacts)

if dataset.importer == "hplt":
download_hplt(
if dataset.name != "mono/v2.0":
raise ValueError("Only HPLT v2.0 is supported")
HpltDownloader(
language=args.language,
hlpt_min_fluency=args.hlpt_min_fluency,
max_characters=args.hlpt_max_characters,
hplt_min_doc_score=args.hplt_min_doc_score,
max_characters=args.hplt_max_characters,
max_lines=args.max_sentences,
file_destination=file_destination,
)
merge_lines=args.hplt_merge_lines,
).download()

return

Expand Down
Loading

0 comments on commit 5091a54

Please sign in to comment.