Server creates a CPU buffer no matter the VRAM usage for 72B models #11012
DrVonSinistro
started this conversation in
General
Replies: 1 comment 1 reply
-
This is CUDA waiting for the device to finish work: https://forums.developer.nvidia.com/t/100-cpu-usage-when-running-cuda-code/35920 It's normal. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
QWEN2.5 32B Q8 loads in GPU and it creates something called CUDA_Host which has few mb of something in it. No significant CPU usage during prompt processing and inference.
QWEN2.5 72B Q6-5-4-2 loads all fully in GPU but even if VRAM is only half full, it always create and fill a CPU buffer in which it puts 600-800mb of something in it.
Then I get one single CPU core that work like hell on that thing during prompt processing and inference.. Its very annoying. I tried everything. Plz send help.
Beta Was this translation helpful? Give feedback.
All reactions