Replies: 1 comment
-
Issue #71 is looking into that.
If I'm getting it right, your suggestion is to reinitialize the model and use the last set of generated tokens as a prompt to resume the interaction. This approach might work but it will consume a lot of computation time. |
Beta Was this translation helpful? Give feedback.
-
when the bot interaction reaches the context length the programm exits.
As far as I understand how the transformer architecture is used here it stores the hidden states of the transformer when new tokens come in. So when the context length is reached one would need to discard some tokens at the beginning and restart the inference. Can we implement something like this?
I understand that this would mean a noticable delay since we would need to make a big inference about a good portion of the chat. Can we maybe start this inference in the background while user is still chatting within allowed context length?
If not can we at least start the new inference at a point where user input is requested so as to hide at least some of the delay from the user?
Beta Was this translation helpful? Give feedback.
All reactions