Skip to content

Latest commit

 

History

History
31 lines (25 loc) · 2.08 KB

text_preprocessing.md

File metadata and controls

31 lines (25 loc) · 2.08 KB

Text Pre-Processing

retriv provides several resources for multi-lingual text pre-processing, aiming to maximize its retrieval effectiveness.

Stemmers

Stemmers reduce words to their word stem, base or root form.
retriv supports the following stemmers:

  • snowball (default)
    The following languages are supported by Snowball Stemmer: Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish.
    To select your preferred language simply use <language> .
  • arlstem (Arabic)
  • arlstem2 (Arabic)
  • cistem (German)
  • isri (Arabic)
  • krovetz (English)
  • lancaster (English)
  • porter (English)

Tokenizers

Tokenizers divide a string into smaller units, such as words.
retriv supports the following tokenizers:

Stop-word Lists

retriv supports stop-word lists for the following languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, and Turkish.