Skip to content

Tokenizer while indexing #1262

Answered by lintool
chenyn66 asked this question in Q&A
Sep 12, 2022 · 1 comments · 1 reply
Discussion options

You must be logged in to vote

Current solution is to generate a new version of the corpus that has already been tokenized, and then use the -pretokenized option during indexing. Note that queries need to be similarly pretokenized.

For a full example, see https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage-wp.md - this is MS MARCO with WordPiece tokenization.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@chenyn66
Comment options

Answer selected by chenyn66
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #1261 on September 12, 2022 01:07.