Merge branch 'main' into add-helper-methods

ManyTheFish · web-flow · commit 5fe758b5a099 · 2023-06-29T13:08:29.000+02:00
diff --git a/charabia/Cargo.toml b/charabia/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "charabia"
-version = "0.7.2"
+version = "0.8.0"
 license = "MIT"
 authors = ["Many <many@meilisearch.com>"]
 edition = "2021"
diff --git a/charabia/README.md b/charabia/README.md
@@ -16,15 +16,15 @@ Charabia provides a simple API to segment, normalize, or tokenize (segment + nor
 
 |  Script / Language  |                           specialized segmentation                            | specialized normalization | Segmentation Performance level | Tokenization Performance level |
 |---------------------|-------------------------------------------------------------------------------|---------------------------|-------------------|---|
-| **Latin** | ✅ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) + CamelCase segmentation | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal          | 🟨 ~15MiB/sec    | 🟨 ~8MiB/sec    |
-| **Greek** | ❌ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + final sigma normalization         | 🟩 ~22MiB/sec    | 🟨 ~7MiB/sec    |
-| **Cyrillic** - **Georgian** | ❌ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase          | 🟨 ~15MiB/sec    | 🟨 ~8MiB/sec    |
-| **Chinese** **CMN** 🇨🇳 | ✅ [jieba](https://github.com/messense/jieba-rs) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + pinyin conversion | 🟨 ~11MiB/sec    | 🟧 ~6MiB/sec    |
-| **Hebrew** 🇮🇱 | ❌ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal  | 🟩 ~28MiB/sec    | 🟨 ~11MiB/sec    |
-| **Arabic**  | ✅ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) + `ال` segmentation | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal + [Tatweel, Alef, Yeh, and Taa Marbuta normalization]  | 🟩 ~26MiB/sec    | 🟨 ~10MiB/sec    |
-| **Japanese** 🇯🇵 | ✅ [lindera](https://github.com/lindera-morphology/lindera) IPA-dict | ❌ [compatibility decomposition](https://unicode.org/reports/tr15/) | 🟧 ~5MiB/sec    | 🟧 ~4MiB/sec    |
+| **Latin** | ✅ CamelCase segmentation | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal          | 🟩 ~23MiB/sec    | 🟨 ~9MiB/sec    |
+| **Greek** | ❌ | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + final sigma normalization         | 🟩 ~27MiB/sec    | 🟨 ~8MiB/sec    |
+| **Cyrillic** - **Georgian** | ❌ | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase          | 🟩 ~27MiB/sec    | 🟨 ~9MiB/sec    |
+| **Chinese** **CMN** 🇨🇳 | ✅ [jieba](https://github.com/messense/jieba-rs) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + pinyin conversion | 🟨 ~10MiB/sec    | 🟧 ~5MiB/sec    |
+| **Hebrew** 🇮🇱 | ❌ | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal  | 🟩 ~33MiB/sec    | 🟨 ~11MiB/sec    |
+| **Arabic**  | ✅ `ال` segmentation | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal + [Tatweel, Alef, Yeh, and Taa Marbuta normalization]  | 🟩 ~36MiB/sec    | 🟨 ~11MiB/sec    |
+| **Japanese** 🇯🇵 | ✅ [lindera](https://github.com/lindera-morphology/lindera) IPA-dict | ❌ [compatibility decomposition](https://unicode.org/reports/tr15/) | 🟧 ~3MiB/sec    | 🟧 ~3MiB/sec    |
 | **Korean** 🇰🇷 | ✅ [lindera](https://github.com/lindera-morphology/lindera) KO-dict | ❌ [compatibility decomposition](https://unicode.org/reports/tr15/) | 🟥 ~2MiB/sec    | 🟥 ~2MiB/sec    |
-| **Thai** 🇹🇭 | ✅ [dictionary based](https://github.com/PyThaiNLP/nlpo3) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~25MiB/sec    | 🟨 ~13MiB/sec    |
+| **Thai** 🇹🇭 | ✅ [dictionary based](https://github.com/PyThaiNLP/nlpo3) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~22MiB/sec    | 🟨 ~11MiB/sec    |
 
 We aim to provide global language support, and your feedback helps us [move closer to that goal](https://docs.meilisearch.com/learn/advanced/language.html#improving-our-language-support). If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [GitHub repository](https://github.com/meilisearch/charabia/issues/new/choose).