RPC offloading uses a local model copy #9740

alfrentgen · 2024-10-04T11:27:14Z

alfrentgen
Oct 4, 2024

Hello!
First of all, I would like to say that the RPC server is a greate feature. Thank you.
However, it takes too long to offload a part of the model using 100Mbit connection. I am thinking of buying 1Gbit switch now )
But, I think it would be better to have on option in RPC server to specify a model file copy stored on the local storage. So, it can download its part of the model faster from the storage, avoiding network transmission on init stage.
Another idea is to have cache in the RPC server. The cache can also be stored locally either in RAM or on the disk. It looks more complicated than the first idea.
I can try to implement RPC offloading with the local model copy. But I would like to get some hints on where to start looking in the code.

slaren · 2024-10-04T19:00:52Z

slaren
Oct 4, 2024
Collaborator

It may be tricky because the backend interface has no concept of files. You could try caching the calls to the set_tensor function of the buffer interface in some cases, eg. if the amount of data to transfer is large enough and the buffer has the flag GGML_BACKEND_BUFFER_USAGE_WEIGHTS, then you could try sending a hash to the server so that it can try to load it from the local cache, but don't do that in other cases to avoid the latency of the additional round trip.

I would suggest first making structs for all the RPC protocol commands to make the code more readable and avoid mistakes. It could also be a good way to introduce yourself to the code.

0 replies

alfrentgen · 2024-10-05T17:52:20Z

alfrentgen
Oct 5, 2024
Author

This is what is going on during initialization on the 1Gbit network:

It takes about 5 minutes to start llama-cli with 30 Gbytes offloaded over the network.

1 reply

Abdulhanan535 Oct 30, 2024

for me it takes 2 hrs.

Abdulhanan535 · 2024-10-30T12:57:55Z

Abdulhanan535
Oct 30, 2024

i think what will be best is like download whole model on both side and then just loading specific number of layer the network tells you to.

1 reply

alfrentgen Nov 11, 2024
Author

It may affect RPC usefulness in the case when you want distributed computation over machines which have not enough VRAM/RAM to fit the whole model.

lexasub · 2025-02-04T12:38:54Z

lexasub
Feb 4, 2025

@alfrentgen any updates?

0 replies

Abdulhanan535 · 2025-02-04T15:57:55Z

Abdulhanan535
Feb 4, 2025

no one fking care here :\

0 replies

rgerganov · 2025-02-04T16:12:11Z

rgerganov
Feb 4, 2025
Collaborator

I will try to implement @slaren's idea, you can follow #10095 for details

2 replies

slaren Feb 9, 2025
Collaborator

At the time I wrote way we didn't have a very good way of adding special functionality to a backend without littering the code with #ifdefs, but I think it would be ok to add a custom function obtained through ggml_backend_reg_get_proc_address to load a tensor from a file. The llama.cpp model loader could use this function, if available for the backend owning the buffer type where a tensor needs to be loaded, which then the RPC backend would forward to the server. Other backends like CUDA could also implement this function to support cuFile/DirectStorage.

However, I am not completely convinced that this would be a better solution, because to be completely sure that the server and the client have the same model file, you would still need to do a hash, so it wouldn't be any faster than implementing this logic transparently in the set_tensor function. We would probably need to have hashes pre-stored in the GGUF file to be able to implement this in a reliable way.

rgerganov Feb 10, 2025
Collaborator

Adding a custom backend function for loading a tensor from a file may also allow using mmap on the server side. @ggerganov had some ideas of leveraging this with the Metal backend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC offloading uses a local model copy #9740

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RPC offloading uses a local model copy #9740

alfrentgen Oct 4, 2024

Replies: 6 comments · 4 replies

slaren Oct 4, 2024 Collaborator

alfrentgen Oct 5, 2024 Author

Abdulhanan535 Oct 30, 2024

Abdulhanan535 Oct 30, 2024

alfrentgen Nov 11, 2024 Author

lexasub Feb 4, 2025

Abdulhanan535 Feb 4, 2025

rgerganov Feb 4, 2025 Collaborator

slaren Feb 9, 2025 Collaborator

rgerganov Feb 10, 2025 Collaborator

alfrentgen
Oct 4, 2024

Replies: 6 comments 4 replies

slaren
Oct 4, 2024
Collaborator

alfrentgen
Oct 5, 2024
Author

Abdulhanan535
Oct 30, 2024

alfrentgen Nov 11, 2024
Author

lexasub
Feb 4, 2025

Abdulhanan535
Feb 4, 2025

rgerganov
Feb 4, 2025
Collaborator

slaren Feb 9, 2025
Collaborator

rgerganov Feb 10, 2025
Collaborator