-
Notifications
You must be signed in to change notification settings - Fork 362
Added CPU offloading #3452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added CPU offloading #3452
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasnt there supposed to be a bunch of logging?
@@ -684,12 +678,17 @@ def compile( | |||
) | |||
|
|||
gm = exported_program.module() | |||
# Move the weights in the state_dict to CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't relevant here.
@@ -833,6 +833,7 @@ def contains_metadata(gm: torch.fx.GraphModule) -> bool: | |||
str(name), | |||
str(submodule.graph), | |||
) | |||
submodule.to(torch.cuda.current_device()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use to_torch_device(settings.device)
here
"The model is offloaded to CPU during compilation. If you want to keep the model on GPU, set offload_module_to_cpu=False." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider this message: The PyTorch model was moved to the CPU to allocate all GPU memory to TensorRT. To retain the model on the GPU, set offload_module_to_cpu=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there's one more thing we discussed - throw a warning if we predict an oom when offload_module_to_cpu=False. This could be achieved by measuring the size of the pytorch module and available GPU memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to test this change across our entire test suite to ensure it is working as expected.
) | ||
else: | ||
remaining_memory, total_memory = torch.cuda.mem_get_info() | ||
if remaining_memory < total_memory / 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_memory // 2
@@ -49,6 +49,7 @@ | |||
TILING_OPTIMIZATION_LEVEL = "none" | |||
L2_LIMIT_FOR_TILING = -1 | |||
USE_DISTRIBUTED_MODE_TRACE = False | |||
OFFLOAD_MODULE_TO_CPU = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test the whole test suite by enabling this to True. I think we discussed the default to be True here. Since this would be a breaking change, we shall mention this in release notes.
Description
Added CPU offloading. Compilation takes no more than 1x GPU memory. Before engine compilation, the model and graph module are moved to CPU.
Fixes # (issue)
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: