MacOS: mlock/wired memory/cold start #9029
Replies: 3 comments
-
I've seen this behaviour as well on my Mac and I don't know how to fix it. Seems like some internal caching mechanism |
Beta Was this translation helpful? Give feedback.
-
Got it, thank you. |
Beta Was this translation helpful? Give feedback.
-
I couldn't figure out it either, but for single-user setups seems like sending warmup/keepalive queries is working reasonably good. keepalive_dst.mp4Here we keep sending context + message as we type, even if message hasn't changed, and (at least for reasonably short context) we can get rid of both model loading lag + hide context encoding latency. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Let's say we start llama server like this:
It is a large model (120GB+) but with very short context. Hardware is M2 Ultra 192GB.
Now let's query it 3 times with the same prompt and no prompt cache:
I see the following times:
This are timings from the server log itself:
First prompt processing is much slower(cold start?). We don't use cache, we can see that from
kv cache rm [p0, end)
wherep0=0
.If I add a sleep between the calls like this:
all three calls become equally slow:
Is there a way to avoid this cold start problem? Is there any way to keep model always loaded (other than sending mock keepalive queries)?
With
--mlock
I see a difference in reported system metrics (memory stayswired
, without mlock wired goes down to 0), but there's no measurable difference in latency.I think
llama-cli
has the same behavior, it's just easier to reproduce with server queries.Thank you!
Beta Was this translation helpful? Give feedback.
All reactions