Replies: 3 comments 2 replies
-
If the article content is before the question in the prompt, then the cache miss will be just for the question. I.e: # these prompts are slow
[question 1] [article]
[question 2] [article]
...
# these prompts are fast
[article] [question 1]
[article] [question 2]
... You can also set the slot_id in the request explicitly and perform any kind of slot mapping that you need. |
Beta Was this translation helpful? Give feedback.
-
My questions are dynamic (they are ordered by the dynamic part but still) and longer than the article. If I have 2 slots, and I have 10 users chatting then afaik I would get a lot of cache-misses as each user-context will have a different beginning. I would think that if I could set a cache_prompt_name filled with a userid or something (basically a client-provided key) that as long as I stay within the context(/memory)-length I would get an almost 100% cache-hit. |
Beta Was this translation helpful? Give feedback.
-
Ok, for example : 2 slots, 3 users. User1/slot 1 : Hello User1/slot 1 : HelloFineHow are you Now imagine the context being 8k tokens. Just being added on at the end. Basically if you have the idea of conversation mode (from llama-cli) with 2 slots and more users than slots. My thoughts are just create a KV-cache on disk with an added name that you swap in/out, thinking that saving / restoring would be faster than generating. |
Beta Was this translation helpful? Give feedback.
-
Preface : I have a high throughput workflow which asks multiple questions / question types starting with one starting point, for example I take one article/product from a dbase and then ask an llm 10 questions about that article / product.
Currently you get a lot of KV-cache misses if you do this workflow on article/product at a time, as question 2 will start different than question 1.
The best way to have this useful with kv-caching is to take 1000 articles, ask q1 over all of these so you get best use of KV-cache, then run q2 over all of these 1000 etc.
Because up-front you don't know which slot the question will go into, I can't save / restore the KV-cache by slot.
Now I was thinking, would it be possible to give slots names or something like that, so that I can categorise my q1/q2/q3 while calling the server and it can restore / save a specific KV-cache regardless of which slot the question comes in?
I believe this also could be beneficial for chat-apps as you could just have a KV-cache by userid for example.
Or perhaps it could be a cache_prompt_name option on the completion endpoint so that you can just name KV-cache irregardless of the slots endpoint where it ends in.
It would be just an option between the save_slots and the completions endpoint and I would think it would up the KV-cache reuse.
Beta Was this translation helpful? Give feedback.
All reactions