From b1c1e679bfd89f65f2b747ed5b04408fd88e343d Mon Sep 17 00:00:00 2001 From: ManyTheFish Date: Wed, 14 Dec 2022 15:17:41 +0100 Subject: [PATCH] Update readme with performances --- README.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 5f3ff7c6..a96a2024 100644 --- a/README.md +++ b/README.md @@ -14,13 +14,15 @@ Charabia provides a simple API to segment, normalize, or tokenize (segment + nor **Charabia is multilingual**, featuring optimized support for: -| Script - Language | specialized segmentation | specialized normalization | Segmentation Performance level | Tokenization Performance level | +| Script / Language | specialized segmentation | specialized normalization | Segmentation Performance level | Tokenization Performance level | |---------------------|-------------------------------------------------------------------------------|---------------------------|-------------------|---| -| **Latin** - **Any** | ✅ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ lowercase + deunicode | 🟨 ~13MiB/sec | 🟧 ~5MiB/sec | -| **Chinese** - **CMN** 🇨🇳 | ✅ [jieba](https://github.com/messense/jieba-rs) | ✅ traditional-to-simplified conversion | 🟨 ~9MiB/sec | 🟧 ~5MiB/sec | -| **Hebrew** 🇮🇱 | ✅ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ diacritics removal | 🟩 ~21MiB/sec | 🟨 ~11MiB/sec | -| **Japanese** 🇯🇵 | ✅ [lindera](https://github.com/lindera-morphology/lindera) | ✅ convert to Hiragana | 🟧 ~5MiB/sec | 🟧 ~4MiB/sec | -| **Thai** 🇹🇭 | ✅ [dictionary based](https://github.com/PyThaiNLP/nlpo3) | ❌ | 🟩 ~23MiB/sec | 🟨 ~14MiB/sec | +| **Latin** | ❌ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟨 ~14MiB/sec | 🟨 ~8MiB/sec | +| **Cyrillic** - **Greek** - **Georgian** | ❌ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase | 🟨 ~14MiB/sec | 🟨 ~8MiB/sec | +| **Chinese** **CMN** 🇨🇳 | ✅ [jieba](https://github.com/messense/jieba-rs) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + pinyin conversion | 🟨 ~11MiB/sec | 🟧 ~6MiB/sec | +| **Hebrew** 🇮🇱 - **Arabic** | ❌ [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~22MiB/sec | 🟨 ~10MiB/sec | +| **Japanese** 🇯🇵 | ✅ [lindera](https://github.com/lindera-morphology/lindera) IPA-dict | ❌ [compatibility decomposition](https://unicode.org/reports/tr15/) | 🟧 ~5MiB/sec | 🟧 ~4MiB/sec | +| **Korean** 🇰🇷 | ✅ [lindera](https://github.com/lindera-morphology/lindera) KO-dict | ❌ [compatibility decomposition](https://unicode.org/reports/tr15/) | 🟥 ~2MiB/sec | 🟥 ~2MiB/sec | +| **Thai** 🇹🇭 | ✅ [dictionary based](https://github.com/PyThaiNLP/nlpo3) | ✅ [compatibility decomposition](https://unicode.org/reports/tr15/) + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~26MiB/sec | 🟨 ~13MiB/sec | We aim to provide global language support, and your feedback helps us [move closer to that goal](https://docs.meilisearch.com/learn/advanced/language.html#improving-our-language-support). If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [GitHub repository](https://github.com/meilisearch/charabia/issues/new/choose).