Possibility to name KV-Cache saves / restores from server for high-throughput? #10557

Gomez12 · 2024-11-28T05:20:47Z

Gomez12
Nov 28, 2024

Preface : I have a high throughput workflow which asks multiple questions / question types starting with one starting point, for example I take one article/product from a dbase and then ask an llm 10 questions about that article / product.

Currently you get a lot of KV-cache misses if you do this workflow on article/product at a time, as question 2 will start different than question 1.
The best way to have this useful with kv-caching is to take 1000 articles, ask q1 over all of these so you get best use of KV-cache, then run q2 over all of these 1000 etc.

Because up-front you don't know which slot the question will go into, I can't save / restore the KV-cache by slot.

Now I was thinking, would it be possible to give slots names or something like that, so that I can categorise my q1/q2/q3 while calling the server and it can restore / save a specific KV-cache regardless of which slot the question comes in?
I believe this also could be beneficial for chat-apps as you could just have a KV-cache by userid for example.

Or perhaps it could be a cache_prompt_name option on the completion endpoint so that you can just name KV-cache irregardless of the slots endpoint where it ends in.

It would be just an option between the save_slots and the completions endpoint and I would think it would up the KV-cache reuse.

ggerganov · 2024-11-28T08:11:50Z

ggerganov
Nov 28, 2024
Maintainer

Currently you get a lot of KV-cache misses if you do this workflow on article/product at a time, as question 2 will start different than question 1.

If the article content is before the question in the prompt, then the cache miss will be just for the question. I.e:

# these prompts are slow
[question 1] [article]
[question 2] [article]
...

# these prompts are fast
[article] [question 1]
[article] [question 2]
...

You can also set the slot_id in the request explicitly and perform any kind of slot mapping that you need.

0 replies

Gomez12 · 2024-11-28T08:59:14Z

Gomez12
Nov 28, 2024
Author

My questions are dynamic (they are ordered by the dynamic part but still) and longer than the article.
But that's just my specific use-case and I think it is better explained in a multi-user chat environment with context/memory.

If I have 2 slots, and I have 10 users chatting then afaik I would get a lot of cache-misses as each user-context will have a different beginning.
I could set the slot_id but I can't predict which user will come next so I can't do it well. And the logic would be extensive because of Continuous batching.

I would think that if I could set a cache_prompt_name filled with a userid or something (basically a client-provided key) that as long as I stay within the context(/memory)-length I would get an almost 100% cache-hit.

1 reply

ggerganov Nov 28, 2024
Maintainer

Could you provide a few sample requests, since I am not sure I understand the use case well?

Gomez12 · 2024-11-28T10:06:35Z

Gomez12
Nov 28, 2024
Author

Ok, for example : 2 slots, 3 users.

User1/slot 1 : Hello
LLM Responds with : Hello
User2/slot 2: How are you
LLM Responds with : Fine
User3/slot 1: What is your name
Imho this destroys the KV-cache.

User1/slot 1 : HelloFineHow are you
Afaik this has no KV-cache (except for the part) as User3 has just destroyed it
User2/slot 2: How are youFineWhat is the color of the sky
This will have KV-cache as this goes to slot 2 (either by mapping the client or by using the default option of 50% match)
User3/slot 1 : What is your nameJoshHow are you
Afaik this has no KV-cache (except for the part) as User1 has just destroyed it

Now imagine the context being 8k tokens. Just being added on at the end.
Like you said, the client could map the users to certain slots, but then User3 would have to wait for the slot of User1/3 to open up while the slot of User2 could be available this would effectively negate Cb

Basically if you have the idea of conversation mode (from llama-cli) with 2 slots and more users than slots.

My thoughts are just create a KV-cache on disk with an added name that you swap in/out, thinking that saving / restoring would be faster than generating.

1 reply

ggerganov Nov 29, 2024
Maintainer

Ok got it. I think you can already achieve this logic with the existing save/restore slot API.

Like you said, the client could map the users to certain slots, but then User3 would have to wait for the slot of User1/3 to open up while the slot of User2 could be available this would effectively negate Cb

Your app has to keep track which slot what context currently has. When User3 request comes, your app can dynamically decide to restore the context into slot 2 and pass the request to it.

This logic can be integrated in llama-server, but it depends how complicated it would get. I think it's better to first implement a simple script that proxies the incoming requests and see if it works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to name KV-Cache saves / restores from server for high-throughput? #10557

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Possibility to name KV-Cache saves / restores from server for high-throughput? #10557

Gomez12 Nov 28, 2024

Replies: 3 comments · 2 replies

ggerganov Nov 28, 2024 Maintainer

Gomez12 Nov 28, 2024 Author

ggerganov Nov 28, 2024 Maintainer

Gomez12 Nov 28, 2024 Author

ggerganov Nov 29, 2024 Maintainer

Gomez12
Nov 28, 2024

Replies: 3 comments 2 replies

ggerganov
Nov 28, 2024
Maintainer

Gomez12
Nov 28, 2024
Author

ggerganov Nov 28, 2024
Maintainer

Gomez12
Nov 28, 2024
Author

ggerganov Nov 29, 2024
Maintainer