Multi gpu setup & Parallel decoding : share compute, more than sharing VRAM #9364
ExtReMLapin
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
From my understanding, simple non parallel decoding doesn't allow for efficient muilti gpu computer power usage because the "prompt" goes sequentially from a layer to another so from a gpu to another.
However, -again from my understanding- on the opposite, on a model hosted on one single GPU, we can queue prompts (do parallel decoding) to increase the GPU total throughput.
Should'd multi-gpu coupled with parallel procedding allow to share compute power as the model (so the layers) are shared on multiple GPUs ?
I'm not thinking about having duplicated layers from GPU A and B, because we can already do that outselves by just starting llama-server twice.
Beta Was this translation helpful? Give feedback.
All reactions