-
Hi, |
Beta Was this translation helpful? Give feedback.
Answered by
lintool
Sep 12, 2022
Replies: 1 comment 1 reply
-
Current solution is to generate a new version of the corpus that has already been tokenized, and then use the For a full example, see https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage-wp.md - this is MS MARCO with WordPiece tokenization. |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
chenyn66
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Current solution is to generate a new version of the corpus that has already been tokenized, and then use the
-pretokenized
option during indexing. Note that queries need to be similarly pretokenized.For a full example, see https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage-wp.md - this is MS MARCO with WordPiece tokenization.