retriv provides several resources for multi-lingual text pre-processing, aiming to maximize its retrieval effectiveness.
Stemmers reduce words to their word stem, base or root form.
retriv supports the following stemmers:
- snowball (default)
The following languages are supported by Snowball Stemmer: Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish.
To select your preferred language simply use<language>
. - arlstem (Arabic)
- arlstem2 (Arabic)
- cistem (German)
- isri (Arabic)
- krovetz (English)
- lancaster (English)
- porter (English)
Tokenizers divide a string into smaller units, such as words.
retriv supports the following tokenizers:
retriv supports stop-word lists for the following languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, and Turkish.