@@ -16,15 +16,15 @@ Charabia provides a simple API to segment, normalize, or tokenize (segment + nor
16
16
17
17
| Script / Language | specialized segmentation | specialized normalization | Segmentation Performance level | Tokenization Performance level |
18
18
| ---------------------| -------------------------------------------------------------------------------| ---------------------------| -------------------| ---|
19
- | ** Latin** | ✅ [ unicode-segmentation ] ( https://github.com/unicode-rs/unicode-segmentation ) + CamelCase segmentation | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + lowercase + [ nonspacing-marks] ( https://www.compart.com/en/unicode/category/Mn ) removal | 🟨 ~ 15MiB /sec | 🟨 ~ 8MiB /sec |
20
- | ** Greek** | ❌ [ unicode-segmentation ] ( https://github.com/unicode-rs/unicode-segmentation ) | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + lowercase + final sigma normalization | 🟩 ~ 22MiB /sec | 🟨 ~ 7MiB /sec |
21
- | ** Cyrillic** - ** Georgian** | ❌ [ unicode-segmentation ] ( https://github.com/unicode-rs/unicode-segmentation ) | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + lowercase | 🟨 ~ 15MiB /sec | 🟨 ~ 8MiB /sec |
22
- | ** Chinese** ** CMN** 🇨🇳 | ✅ [ jieba] ( https://github.com/messense/jieba-rs ) | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + pinyin conversion | 🟨 ~ 11MiB /sec | 🟧 ~ 6MiB /sec |
23
- | ** Hebrew** 🇮🇱 | ❌ [ unicode-segmentation ] ( https://github.com/unicode-rs/unicode-segmentation ) | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + [ nonspacing-marks] ( https://www.compart.com/en/unicode/category/Mn ) removal | 🟩 ~ 28MiB /sec | 🟨 ~ 11MiB/sec |
24
- | ** Arabic** | ✅ [ unicode-segmentation ] ( https://github.com/unicode-rs/unicode-segmentation ) + ` ال ` segmentation | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + [ nonspacing-marks] ( https://www.compart.com/en/unicode/category/Mn ) removal + [ Tatweel, Alef, Yeh, and Taa Marbuta normalization] | 🟩 ~ 26MiB /sec | 🟨 ~ 10MiB /sec |
25
- | ** Japanese** 🇯🇵 | ✅ [ lindera] ( https://github.com/lindera-morphology/lindera ) IPA-dict | ❌ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) | 🟧 ~ 5MiB /sec | 🟧 ~ 4MiB /sec |
19
+ | ** Latin** | ✅ CamelCase segmentation | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + lowercase + [ nonspacing-marks] ( https://www.compart.com/en/unicode/category/Mn ) removal | 🟩 ~ 23MiB /sec | 🟨 ~ 9MiB /sec |
20
+ | ** Greek** | ❌ | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + lowercase + final sigma normalization | 🟩 ~ 27MiB /sec | 🟨 ~ 8MiB /sec |
21
+ | ** Cyrillic** - ** Georgian** | ❌ | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + lowercase | 🟩 ~ 27MiB /sec | 🟨 ~ 9MiB /sec |
22
+ | ** Chinese** ** CMN** 🇨🇳 | ✅ [ jieba] ( https://github.com/messense/jieba-rs ) | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + pinyin conversion | 🟨 ~ 10MiB /sec | 🟧 ~ 5MiB /sec |
23
+ | ** Hebrew** 🇮🇱 | ❌ | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + [ nonspacing-marks] ( https://www.compart.com/en/unicode/category/Mn ) removal | 🟩 ~ 33MiB /sec | 🟨 ~ 11MiB/sec |
24
+ | ** Arabic** | ✅ ` ال ` segmentation | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + [ nonspacing-marks] ( https://www.compart.com/en/unicode/category/Mn ) removal + [ Tatweel, Alef, Yeh, and Taa Marbuta normalization] | 🟩 ~ 36MiB /sec | 🟨 ~ 11MiB /sec |
25
+ | ** Japanese** 🇯🇵 | ✅ [ lindera] ( https://github.com/lindera-morphology/lindera ) IPA-dict | ❌ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) | 🟧 ~ 3MiB /sec | 🟧 ~ 3MiB /sec |
26
26
| ** Korean** 🇰🇷 | ✅ [ lindera] ( https://github.com/lindera-morphology/lindera ) KO-dict | ❌ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) | 🟥 ~ 2MiB/sec | 🟥 ~ 2MiB/sec |
27
- | ** Thai** 🇹🇭 | ✅ [ dictionary based] ( https://github.com/PyThaiNLP/nlpo3 ) | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + [ nonspacing-marks] ( https://www.compart.com/en/unicode/category/Mn ) removal | 🟩 ~ 25MiB /sec | 🟨 ~ 13MiB /sec |
27
+ | ** Thai** 🇹🇭 | ✅ [ dictionary based] ( https://github.com/PyThaiNLP/nlpo3 ) | ✅ [ compatibility decomposition] ( https://unicode.org/reports/tr15/ ) + [ nonspacing-marks] ( https://www.compart.com/en/unicode/category/Mn ) removal | 🟩 ~ 22MiB /sec | 🟨 ~ 11MiB /sec |
28
28
29
29
We aim to provide global language support, and your feedback helps us [ move closer to that goal] ( https://docs.meilisearch.com/learn/advanced/language.html#improving-our-language-support ) . If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our [ GitHub repository] ( https://github.com/meilisearch/charabia/issues/new/choose ) .
30
30
0 commit comments