Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help managing GPU resources #22

Open
ldmoser opened this issue Jan 27, 2021 · 0 comments
Open

Help managing GPU resources #22

ldmoser opened this issue Jan 27, 2021 · 0 comments

Comments

@ldmoser
Copy link

ldmoser commented Jan 27, 2021

I'm sure this problem can get very complicated and may require custom implementations, but I was wondering whether you have intentions or ideas on how to manage the limited GPU resources across all MLClient nodes instantiated in a nuke scene.

Nuke MLClient nodes could be talking with different or same classes in the MLServer side, using any kind of back end (pytorch, tensorflow, ..).
This is somewhat related to issue #21 but it goes beyond that because it deals with all the classes used in a Nuke session.

Here's a broad idea that may be a good discussion starter:

  1. Each instance of MLServer creates a LRU-cache object that is supposed to hold pre-trained models. It could have some options, like how many models the cache can hold, or how much GPU Ram it wants to guarantee to be available at any given time.
  2. The MLServer provides this cache object as an API to the model classes, which they use to register their model-loading-method that is supposed to be called only if the model is not in the cache. This custom method should return a reference to the pre-trained model and also the location of the model (ie: "gpu0")
  3. The LRU-cache will call the custom function and will catch Memory exceptions during the construction of the model. The exception could trigger the purge of less recent items in the cache and allow retrying the model-loading-method.
  4. The LRU-cache object holds a reference to the object returned by the custom model-loading-method (be it pytorch, tensorflow, whatever).
  5. It would also verify what's the remaining GPU memory (of the corresponding location) after loading the model and would prune more models to guarantee enough free memory according to the options in 1.

It feels that this more general approach would make issue #21 irrelevant and it would deal with complex scenarios, including multi-gpu.

Do you see a benefit adding something like that to the MLServer API?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant