Replies: 1 comment 1 reply
-
Use |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there,
Is there a way to achieve that Comfy treats VRAM like i.e. Ollama does?
I would like to disable offloading to RAM but keep partitial loading for large models. Flux.dev for example doesn't quite fit in my 24 GB VRAM, but it's still faster/almost as fast and the output quality is a lot better than Q8 GGUF or fp8_*, especially when rendering images with text.
Currently Comfy does the following: Load Text Encoders to VRAM -> Use Text encoders -> Offload Text Encoders to RAM -> Particially Load Flux (about 95%, hardly any speed reduction) -> Render -> Offload Flux to RAM and repeat.
At the same time my OS (Linux) caches the models in RAM, too (in the buff/cache memory of the OS) So the models get cached in RAM twice.
What I'd like to see is: Load Text Encoders to VRAM -> Use Text Encoders -> Discard Text from VRAM -> Load Flux Partitially -> Render -> Discard Flux from VRAM / Shared VRAM -> Repeat
That would not only be faster but also more memory efficient.
When I use the the --high-vram or --gpu-only switch I run out of memory on the device.
When I use the --disable-smart-memory switch it unloads models immediately after using them but still offloads them to VRAM.
Thanks in advance,
Peter
Beta Was this translation helpful? Give feedback.
All reactions