Text Pre-Processing

retriv provides several resources for multi-lingual text pre-processing, aiming to maximize its retrieval effectiveness.

Stemmers

Stemmers reduce words to their word stem, base or root form.
retriv supports the following stemmers:

snowball (default)
The following languages are supported by Snowball Stemmer: Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish.
To select your preferred language simply use <language> .
arlstem (Arabic)
arlstem2 (Arabic)
cistem (German)
isri (Arabic)
krovetz (English)
lancaster (English)
porter (English)

Tokenizers

Tokenizers divide a string into smaller units, such as words.
retriv supports the following tokenizers:

whitespace
word
wordpunct
sent

Stop-word Lists

retriv supports stop-word lists for the following languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, and Turkish.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text_preprocessing.md

text_preprocessing.md

Text Pre-Processing

Stemmers

Tokenizers

Stop-word Lists

Files

text_preprocessing.md

Latest commit

History

text_preprocessing.md

File metadata and controls

Text Pre-Processing

Stemmers

Tokenizers

Stop-word Lists