Skip to content

Added CPU offloading #3452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Added CPU offloading #3452

wants to merge 5 commits into from

Conversation

cehongwang
Copy link
Collaborator

Description

Added CPU offloading. Compilation takes no more than 1x GPU memory. Before engine compilation, the model and graph module are moved to CPU.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@github-actions github-actions bot added component: conversion Issues re: Conversion stage component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Mar 26, 2025
@github-actions github-actions bot requested a review from gs-olive March 26, 2025 16:33
@github-actions github-actions bot added the component: tests Issues re: Tests label Mar 31, 2025
@github-actions github-actions bot removed the component: tests Issues re: Tests label Apr 7, 2025
Copy link
Collaborator

@narendasan narendasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasnt there supposed to be a bunch of logging?

@@ -684,12 +678,17 @@ def compile(
)

gm = exported_program.module()
# Move the weights in the state_dict to CPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't relevant here.

@@ -833,6 +833,7 @@ def contains_metadata(gm: torch.fx.GraphModule) -> bool:
str(name),
str(submodule.graph),
)
submodule.to(torch.cuda.current_device())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use to_torch_device(settings.device) here

Comment on lines 690 to 691
"The model is offloaded to CPU during compilation. If you want to keep the model on GPU, set offload_module_to_cpu=False."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this message: The PyTorch model was moved to the CPU to allocate all GPU memory to TensorRT. To retain the model on the GPU, set offload_module_to_cpu=False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there's one more thing we discussed - throw a warning if we predict an oom when offload_module_to_cpu=False. This could be achieved by measuring the size of the pytorch module and available GPU memory.

Copy link
Collaborator

@peri044 peri044 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to test this change across our entire test suite to ensure it is working as expected.

)
else:
remaining_memory, total_memory = torch.cuda.mem_get_info()
if remaining_memory < total_memory / 2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_memory // 2

@@ -49,6 +49,7 @@
TILING_OPTIMIZATION_LEVEL = "none"
L2_LIMIT_FOR_TILING = -1
USE_DISTRIBUTED_MODE_TRACE = False
OFFLOAD_MODULE_TO_CPU = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test the whole test suite by enabling this to True. I think we discussed the default to be True here. Since this would be a breaking change, we shall mention this in release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: api [Python] Issues re: Python API component: conversion Issues re: Conversion stage component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants