Update HPLT to 2.0 (#1000)

* Update HPLT importer to 2.0 * Update training configuration * Update docs * Replace scores to floats * Move packages required for PyICU to the base image * Change default cleaning thresholds * Update HPLT tests * Fix naming issue * Add more statistics and refactor * Update and extract tests * Create taskgraph directory if it doesn't exit * Revert "Create taskgraph directory if it doesn't exit" This reverts commit 094cde0. * Reimplement closing * Fix Korean locale * Clarify monocleaner thresholds * Add a link to HPLT reports * Pin poetry version
mozilla · Jan 27, 2025 · 5091a54 · 5091a54
1 parent b08737d
commit 5091a54
Show file tree

Hide file tree

Showing 28 changed files with 484 additions and 315 deletions.
diff --git a/docs/data.md b/docs/data.md
@@ -15,17 +15,17 @@ Example:
     - mtdata_newstest2014_ruen
 ```
 
-Data source | Prefix | Name examples | Type | Comments
---- | --- | --- | ---| ---
-[MTData](https://github.com/thammegowda/mtdata) | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
-[OPUS](opus.nlpl.eu/) | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
-[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
-[Flores](https://github.com/facebookresearch/flores) | flores | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages.
-Custom parallel | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst` | corpus | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
-[Paracrawl](https://paracrawl.eu/) | paracrawl-mono | paracrawl8 | mono | Datasets that are crawled from the web. Only [mono datasets](https://paracrawl.eu/index.php/moredata) are used in this importer. Parallel corpus is available using opus importer.
-[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Some news monolingual datasets from [WMT21](https://www.statmt.org/wmt21/translation-task.html)
-[Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)
-Custom mono | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst` | mono | A custom zst compressed monolingual dataset, for instance uploaded to GCS.
+Data source | Prefix     | Name examples | Type | Comments
+--- |------------| --- | ---| ---
+[MTData](https://github.com/thammegowda/mtdata) | mtdata     | newstest2017_ruen | corpus | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
+[OPUS](opus.nlpl.eu/) | opus       | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
+[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu  | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
+[Flores](https://github.com/facebookresearch/flores) | flores     | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages.
+Custom parallel | url        | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst` | corpus | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
+[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Monolingual news datasets from [WMT](https://www.statmt.org/wmt21/translation-task.html)
+[OPUS](opus.nlpl.eu/) | opus       | tldr-pages/v2023-08-29 | mono | Monolingual dataset from OPUS.
+[HPLT](https://hplt-project.org/datasets/v2.0) | hplt       | mono/v2.0 | mono | HPLT monolingual corpus (mostly from Internet Archive, but also from Common Crawl).
+Custom mono | url        | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst` | mono | A custom zst compressed monolingual dataset, for instance uploaded to GCS.
 
 You can also use [find-corpus](https://github.com/mozilla/translations/tree/main/pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
 

diff --git a/pipeline/data/download-mono.py b/pipeline/data/download-mono.py
@@ -26,7 +26,7 @@
 from pathlib import Path
 from typing import Optional
 
-from importers.mono.hplt import download_hplt
+from importers.mono.hplt import HpltDownloader
 
 from pipeline.common.datasets import Dataset, shuffle_with_max_lines
 from pipeline.common.downloads import (
@@ -37,9 +37,6 @@
 from pipeline.common.logging import get_logger
 from pipeline.data.cjk import ChineseConverter, ChineseType
 
-# TODO(CJK) - Issue #424
-MAX_WORDS_IN_SENTENCE = 100
-
 CURRENT_FOLDER = os.path.dirname(os.path.abspath(__file__))
 IMPORTERS_PATH = os.path.abspath(os.path.join(CURRENT_FOLDER, "mono"))
 
@@ -57,17 +54,22 @@ def main(args_list: Optional[list[str]] = None) -> None:
         "--max_sentences", type=int, help="The maximum number of sentences to retain"
     )
     parser.add_argument(
-        "--hlpt_min_fluency",
+        "--hplt_min_doc_score",
         type=float,
-        help="The minimum fluency score to filter datasets that include this metric",
-        default=0.8,
+        help="The minimum document score to filter datasets that include this metric",
+        default=5.0,
     )
     parser.add_argument(
-        "--hlpt_max_characters",
+        "--hplt_max_characters",
         type=int,
-        help="The maximum number of characters to merge lines in a document before writing. "
-        "0 - preserve original lines of HPLT dataset",
-        default=0,
+        help="The maximum length of the output segments. ",
+        default=600,
+    )
+    parser.add_argument(
+        "--hplt_merge_lines",
+        type=bool,
+        help="Whether to accumulate lines of the same document in one output segment until `hplt_max_characters` is reached.",
+        default=False,
     )
     parser.add_argument(
         "--artifacts", type=Path, help="The location where the dataset will be saved"
@@ -80,22 +82,26 @@ def main(args_list: Optional[list[str]] = None) -> None:
 
     logger.info(f"Dataset: {args.dataset}")
     logger.info(f"Language: {args.language}")
-    logger.info(f"Max Sentences: {args.max_sentences}")
-    logger.info(f"Mininmum Fluency Threshold: {args.hlpt_min_fluency}")
+    logger.info(f"HPLT Max Sentences: {args.max_sentences}")
+    logger.info(f"HPLT Minimum Document Score Threshold: {args.hplt_min_doc_score}")
+    logger.info(f"HPLT Merge Lines: {args.hplt_merge_lines}")
     logger.info(f"Artifacts: {args.artifacts}")
     logger.info(f"File Destination: {file_destination}")
 
     if not os.path.exists(args.artifacts):
         os.makedirs(args.artifacts)
 
     if dataset.importer == "hplt":
-        download_hplt(
+        if dataset.name != "mono/v2.0":
+            raise ValueError("Only HPLT v2.0 is supported")
+        HpltDownloader(
             language=args.language,
-            hlpt_min_fluency=args.hlpt_min_fluency,
-            max_characters=args.hlpt_max_characters,
+            hplt_min_doc_score=args.hplt_min_doc_score,
+            max_characters=args.hplt_max_characters,
             max_lines=args.max_sentences,
             file_destination=file_destination,
-        )
+            merge_lines=args.hplt_merge_lines,
+        ).download()
 
         return