[Core] Free KV cache GPU memory on engine shutdown #28953
+101
−38
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related to #24885
Addresses the
enable_multiprocessing=FalseTODO intests/v1/shutdown/test_delete.py::test_llm_deleteThe trickiest part is the single-process case ("inproc" engine and "uniproc" executor") where we can't rely on shutting down child processes to release GPU memory. To address that, we:
engine_core.shutdown()fromLLMEngine.__del__()GPUWorker.shutdown()Other changes include:
wait_for_gpu_memory_to_clear()timeout parameter to get a niceMemory of devices not freeerror