-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unload model after being not used for some time #72
Comments
I think I'll have this feature implemented by EOW. For now, you can use the |
I would like this behind a flag so I can keep the model loaded for improved latency. |
Has anyone already found a solution for the problem, similar to how Ollama does it? Unfortunately, my GPU also constantly uses around 45W-50W in idle mode, which becomes quite expensive to operate given the local electricity prices (especially if the model is not used for an extended period) :S |
FYI, I'm currently working on this |
The feature has been implemented in #92 and is available on master. Will wait a couple of days before make a new release with the changes. Any feedback would be appreciated! |
Thanks, that was a quick one! Model unloading works, but some GPU Memory is still used by the process which keeps the GPU in P0 state and continues to consume more power than necessary:
|
@hgruber I'm aware of this, but unfortunately, this is an upstream issue which doesn't seem resolved yet. See (this)[https://github.com/SYSTRAN/faster-whisper/issues/992]. If you find a workaround, please LMK and I'll implement it. |
Unlike to #66 it's probably also a good idea to unload the model after e.g. 300s of not being used.
For ollama server this is the standard. My graphics card uses 40W more of power with any model being loaded.
The text was updated successfully, but these errors were encountered: