Skip to content

Commit 5fe758b

Browse files
authored
Merge branch 'main' into add-helper-methods
2 parents 5e4cee7 + f96cfa4 commit 5fe758b

File tree

2 files changed

+9
-9
lines changed

2 files changed

+9
-9
lines changed

charabia/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "charabia"
3-
version = "0.7.2"
3+
version = "0.8.0"
44
license = "MIT"
55
authors = ["Many <[email protected]>"]
66
edition = "2021"

charabia/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,15 +16,15 @@ Charabia provides a simple API to segment, normalize, or tokenize (segment + nor
1616

1717
| Script / Language | specialized segmentation | specialized normalization | Segmentation Performance level | Tokenization Performance level |
1818
|---------------------|-------------------------------------------------------------------------------|---------------------------|-------------------|---|
19-
| **Latin** |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) + CamelCase segmentation |[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟨 ~15MiB/sec | 🟨 ~8MiB/sec |
20-
| **Greek** |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) |[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + final sigma normalization | 🟩 ~22MiB/sec | 🟨 ~7MiB/sec |
21-
| **Cyrillic** - **Georgian** |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) |[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase | 🟨 ~15MiB/sec | 🟨 ~8MiB/sec |
22-
| **Chinese** **CMN** 🇨🇳 |[jieba](https://github.com/messense/jieba-rs) |[compatibility decomposition](https://unicode.org/reports/tr15/) + pinyin conversion | 🟨 ~11MiB/sec | 🟧 ~6MiB/sec |
23-
| **Hebrew** 🇮🇱 |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) |[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~28MiB/sec | 🟨 ~11MiB/sec |
24-
| **Arabic** |[unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) + `ال` segmentation |[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal + [Tatweel, Alef, Yeh, and Taa Marbuta normalization] | 🟩 ~26MiB/sec | 🟨 ~10MiB/sec |
25-
| **Japanese** 🇯🇵 |[lindera](https://github.com/lindera-morphology/lindera) IPA-dict |[compatibility decomposition](https://unicode.org/reports/tr15/) | 🟧 ~5MiB/sec | 🟧 ~4MiB/sec |
19+
| **Latin** | ✅ CamelCase segmentation |[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~23MiB/sec | 🟨 ~9MiB/sec |
20+
| **Greek** ||[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + final sigma normalization | 🟩 ~27MiB/sec | 🟨 ~8MiB/sec |
21+
| **Cyrillic** - **Georgian** ||[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase | 🟩 ~27MiB/sec | 🟨 ~9MiB/sec |
22+
| **Chinese** **CMN** 🇨🇳 |[jieba](https://github.com/messense/jieba-rs) |[compatibility decomposition](https://unicode.org/reports/tr15/) + pinyin conversion | 🟨 ~10MiB/sec | 🟧 ~5MiB/sec |
23+
| **Hebrew** 🇮🇱 ||[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~33MiB/sec | 🟨 ~11MiB/sec |
24+
| **Arabic** |`ال` segmentation |[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal + [Tatweel, Alef, Yeh, and Taa Marbuta normalization] | 🟩 ~36MiB/sec | 🟨 ~11MiB/sec |
25+
| **Japanese** 🇯🇵 |[lindera](https://github.com/lindera-morphology/lindera) IPA-dict |[compatibility decomposition](https://unicode.org/reports/tr15/) | 🟧 ~3MiB/sec | 🟧 ~3MiB/sec |
2626
| **Korean** 🇰🇷 |[lindera](https://github.com/lindera-morphology/lindera) KO-dict |[compatibility decomposition](https://unicode.org/reports/tr15/) | 🟥 ~2MiB/sec | 🟥 ~2MiB/sec |
27-
| **Thai** 🇹🇭 |[dictionary based](https://github.com/PyThaiNLP/nlpo3) |[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~25MiB/sec | 🟨 ~13MiB/sec |
27+
| **Thai** 🇹🇭 |[dictionary based](https://github.com/PyThaiNLP/nlpo3) |[compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~22MiB/sec | 🟨 ~11MiB/sec |
2828

2929
We aim to provide global language support, and your feedback helps us [move closer to that goal](https://docs.meilisearch.com/learn/advanced/language.html#improving-our-language-support). If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [GitHub repository](https://github.com/meilisearch/charabia/issues/new/choose).
3030

0 commit comments

Comments
 (0)