Skip to content

Commit 06f5db2

Browse files
committed
jpn word_tokenize
1 parent 8eb6e16 commit 06f5db2

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

src/datatrove/utils/word_tokenizers.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ def chunk_text_on_bytes(text: str, max_chunk_size: int = 1_000_000):
3232
def __utf8len(s: str):
3333
return len(s.encode("utf-8"))
3434

35-
factor = len(text) / __utf8len(text)
35+
factor = len(text) / __utf8len(text) if __utf8len(text) > 0 else 1
3636
increase_by = int(max(min(max_chunk_size * 0.1, 10), 1))
3737
initial_size_guess = int(max(max_chunk_size * factor - 10, 1))
3838
final_list = []

0 commit comments

Comments
 (0)