fix showing unknown token at gpt_tokenize#801
Open
katsu560 wants to merge 11 commits into
Open
Conversation
…to fixunknowntoken
ggerganov
approved these changes
Jun 16, 2024
Member
ggerganov
left a comment
There was a problem hiding this comment.
Can be simplified - see comments
Comment on lines
+343
to
+344
| auto unk = word.substr(i, 1).data(); | ||
| unknown.push_back(*unk); |
Member
There was a problem hiding this comment.
Isn't this just:
Suggested change
| auto unk = word.substr(i, 1).data(); | |
| unknown.push_back(*unk); | |
| unknown.push_back(word[i]); |
Comment on lines
+351
to
+352
| std::string unkstr(unknown.begin(), unknown.end()); | ||
| fprintf(stderr, "%s: unknown token '%s'\n", __func__, unkstr.data()); |
Member
There was a problem hiding this comment.
Suggested change
| std::string unkstr(unknown.begin(), unknown.end()); | |
| fprintf(stderr, "%s: unknown token '%s'\n", __func__, unkstr.data()); | |
| fprintf(stderr, "%s: unknown token '%s'\n", __func__, unknown.data()); |
Comment on lines
+334
to
+335
| std::string unkstr(unknown.begin(), unknown.end()); | ||
| fprintf(stderr, "%s: unknown token '%s'\n", __func__, unkstr.data()); |
Member
There was a problem hiding this comment.
Suggested change
| std::string unkstr(unknown.begin(), unknown.end()); | |
| fprintf(stderr, "%s: unknown token '%s'\n", __func__, unkstr.data()); | |
| fprintf(stderr, "%s: unknown token '%s'\n", __func__, unknown.data()); |
Comment on lines
+323
to
+325
| // unknown token | ||
| std::vector<char> unknown; | ||
| unknown.clear(); |
Member
There was a problem hiding this comment.
Suggested change
| // unknown token | |
| std::vector<char> unknown; | |
| unknown.clear(); | |
| // unknown token | |
| std::vector<char> unknown; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As for current implementation, gpt_tokenize() shows each byte of multi bytes character if unknown token is existed, like below
I fixed with stopping show each bytes as below.
please confirm this.
-- detail --
original:
$ ./240407up/gpt-neox.org --repeat-last-n 256 --repeat-penalty 1.2 -m models/cyberagent/ggml-calm-1b-q4_0.bin -s 7654321 -p "日本で一番高い山は何ですか?"
main: seed = 7654321
gpt_neox_model_load: loading model from 'models/cyberagent/ggml-calm-1b-q4_0.bin' - please wait ...
gpt_neox_model_load: n_vocab = 52096
gpt_neox_model_load: n_ctx = 2048
gpt_neox_model_load: n_embd = 2048
gpt_neox_model_load: n_head = 16
gpt_neox_model_load: n_layer = 24
gpt_neox_model_load: n_rot = 128
gpt_neox_model_load: par_res = 0
gpt_neox_model_load: ftype = 2002
gpt_neox_model_load: qntvr = 2
gpt_neox_model_load: ggml ctx size = 1917.12 MB
gpt_neox_model_load: memory_size = 384.00 MB, n_mem = 49152
gpt_neox_model_load: .................................... done
gpt_neox_model_load: model size = 764.92 MB / num tensors = 292
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token ' '
main: number of tokens in prompt = 6
main: token[0] = 5619, 日本で
main: token[1] = 3300, 一番
main: token[2] = 1737, 高い
main: token[3] = 14218, 山は
main: token[4] = 37814, 何で
main: token[5] = 24250, すか
日本で一番高い山は何ですか?」。そんな質問を何度か受けてきましたが、 ...
fixed:
$ ./240407up/gpt-neox.mod --repeat-last-n 256 --repeat-penalty 1.2 -m models/cyberagent/ggml-calm-1b-q4_0.bin -s 7654321 -p "日本で一番高い山は何ですか?"
main: seed = 7654321
gpt_neox_model_load: loading model from 'models/cyberagent/ggml-calm-1b-q4_0.bin' - please wait ...
gpt_neox_model_load: n_vocab = 52096
gpt_neox_model_load: n_ctx = 2048
gpt_neox_model_load: n_embd = 2048
gpt_neox_model_load: n_head = 16
gpt_neox_model_load: n_layer = 24
gpt_neox_model_load: n_rot = 128
gpt_neox_model_load: par_res = 0
gpt_neox_model_load: ftype = 2002
gpt_neox_model_load: qntvr = 2
gpt_neox_model_load: ggml ctx size = 1917.12 MB
gpt_neox_model_load: memory_size = 384.00 MB, n_mem = 49152
gpt_neox_model_load: .................................... done
gpt_neox_model_load: model size = 764.92 MB / num tensors = 292
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
gpt_tokenize: unknown token '?'
main: number of tokens in prompt = 6
main: token[0] = 5619, 日本で
main: token[1] = 3300, 一番
main: token[2] = 1737, 高い
main: token[3] = 14218, 山は
main: token[4] = 37814, 何で
main: token[5] = 24250, すか
日本で一番高い山は何ですか?」。そんな質問を何度か受けてきましたが、 ...