Multi gpu setup & Parallel decoding : share compute, more than sharing VRAM #9364

ExtReMLapin · 2024-09-08T07:28:53Z

ExtReMLapin
Sep 8, 2024

Hello,

From my understanding, simple non parallel decoding doesn't allow for efficient muilti gpu computer power usage because the "prompt" goes sequentially from a layer to another so from a gpu to another.

However, -again from my understanding- on the opposite, on a model hosted on one single GPU, we can queue prompts (do parallel decoding) to increase the GPU total throughput.

Should'd multi-gpu coupled with parallel procedding allow to share compute power as the model (so the layers) are shared on multiple GPUs ?

I'm not thinking about having duplicated layers from GPU A and B, because we can already do that outselves by just starting llama-server twice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi gpu setup & Parallel decoding : share compute, more than sharing VRAM #9364

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Multi gpu setup & Parallel decoding : share compute, more than sharing VRAM #9364

Uh oh!

ExtReMLapin Sep 8, 2024

Replies: 0 comments

ExtReMLapin
Sep 8, 2024