Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unload model after being not used for some time #72

Closed
hgruber opened this issue Sep 4, 2024 · 7 comments
Closed

Unload model after being not used for some time #72

hgruber opened this issue Sep 4, 2024 · 7 comments

Comments

@hgruber
Copy link

hgruber commented Sep 4, 2024

Unlike to #66 it's probably also a good idea to unload the model after e.g. 300s of not being used.
For ollama server this is the standard. My graphics card uses 40W more of power with any model being loaded.

@hgruber hgruber changed the title unload model after being not used for some time Unload model after being not used for some time Sep 4, 2024
@fedirz
Copy link
Collaborator

fedirz commented Sep 5, 2024

I think I'll have this feature implemented by EOW. For now, you can use the DELETE /api/ps/{model_name:path} route to manually offload the model.

@samos123
Copy link
Contributor

samos123 commented Sep 5, 2024

I would like this behind a flag so I can keep the model loaded for improved latency.

@Foddy
Copy link

Foddy commented Sep 29, 2024

Has anyone already found a solution for the problem, similar to how Ollama does it? Unfortunately, my GPU also constantly uses around 45W-50W in idle mode, which becomes quite expensive to operate given the local electricity prices (especially if the model is not used for an extended period) :S

@fedirz
Copy link
Collaborator

fedirz commented Sep 30, 2024

Has anyone already found a solution for the problem, similar to how Ollama does it? Unfortunately, my GPU also constantly uses around 45W-50W in idle mode, which becomes quite expensive to operate given the local electricity prices (especially if the model is not used for an extended period) :S

FYI, I'm currently working on this

@fedirz
Copy link
Collaborator

fedirz commented Oct 1, 2024

The feature has been implemented in #92 and is available on master. Will wait a couple of days before make a new release with the changes. Any feedback would be appreciated!

@fedirz fedirz closed this as completed Oct 1, 2024
@hgruber
Copy link
Author

hgruber commented Oct 2, 2024

Thanks, that was a quick one! Model unloading works, but some GPU Memory is still used by the process which keeps the GPU in P0 state and continues to consume more power than necessary:

|=========================================+======================+======================|
|   0  Tesla P40                      On  | 00000000:01:00.0 Off |                    0 |
| N/A   40C    P0              51W / 250W |    152MiB / 23040MiB |      0%	Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage	|
|=======================================================================================|
|    0   N/A  N/A   2441259	 C   ...ter-whisper-server/.venv/bin/python	 150MiB |
+---------------------------------------------------------------------------------------+

@fedirz
Copy link
Collaborator

fedirz commented Oct 3, 2024

@hgruber I'm aware of this, but unfortunately, this is an upstream issue which doesn't seem resolved yet. See (this)[https://github.com/SYSTRAN/faster-whisper/issues/992]. If you find a workaround, please LMK and I'll implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants