Skip to content

Commit 56b2ffc

Browse files
committed
japanese tok bugfix
1 parent 35b901d commit 56b2ffc

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

src/datatrove/utils/word_tokenizers.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ def tokenizer(self):
130130

131131
def _do_tokenize(self, text: str):
132132
# japanese has a max byte length
133-
texts = [text] if self.language != "ja" else chunk_text_on_bytes(text, 49000)
133+
texts = [text] if self.language != "ja" else chunk_text_on_bytes(text, 48050)
134134
self.tokenizer.max_length = len(text)
135135
return [self.tokenizer(t, disable=["parser", "tagger", "ner"]) for t in texts]
136136

0 commit comments

Comments
 (0)