perf(policies): Make ACT policy compatible with `torch.compile` #2159

tc-huang · 2025-10-10T04:21:11Z

1. What this does

1.1 Changes to make the ACT policy compile-compatible

1.1.1 Modifications to fix graph breaks

TorchDynamo emitted graph-break warnings due to the use of Tensor.item() inside the forward() method of ACTPolicy. Since .item() performs Python-side scalar extraction and disrupts graph capture, these conversions were removed from the forward() method. The model now returns loss tensors directly in loss_dict; scalar extraction is deferred to the training script (lerobot_train.py).

The warnings observed were:

W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] Graph break from `Tensor.item()`, consider setting:
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0]     torch._dynamo.config.capture_scalar_outputs = True
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] or:
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0]     env TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] to include these operations in the captured graph.
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] 
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] Graph break: from user code at:
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0]   File "/home/tc-huang/Desktop/lerobot/src/lerobot/policies/act/modeling_act.py", line 147, in forward
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0]     loss_dict = {"l1_loss": l1_loss.item()}

Corresponding code changes:

ACTPolicy.forward in src/lerobot/policies/act/modeling_act.py

-    loss_dict = {"l1_loss": l1_loss.item()}
+    loss_dict = {"l1_loss": l1_loss}
     if self.config.use_vae:
          # Calculate Dₖₗ(latent_pdf || standard_normal). Note: After computing the KL-divergence for
          # each dimension independently, we sum over the latent dimension to get the total
          # KL-divergence per batch element, then take the mean over the batch.
          # (See App. B of https://huggingface.co/papers/1312.6114 for more details).
          mean_kld = (
              (-0.5 * (1 + log_sigma_x2_hat - mu_hat.pow(2) - (log_sigma_x2_hat).exp())).sum(-1).mean()
          )
-         loss_dict["kld_loss"] = mean_kld.item()
+         loss_dict["kld_loss"] = mean_kld
          loss = l1_loss + mean_kld * self.config.kl_weight
      else:
          loss = l1_loss

      return loss, loss_dict

update_policy in src/lerobot/scripts/lerobot_train.py

  with accelerator.autocast():
      loss, output_dict = policy.forward(batch)
+     output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()}

1.1.2 Add optional torch.compile support to ACTPolicy

This update introduces optional compilation support for ACTPolicy using PyTorch’s torch.compile. Two new arguments for compilation control were added, consistent with PI0 and PI0.5 policies:

--policy.compile_model: Enables or disables compilation of the policy model.
--policy.compile_mode: Specifies the Torch compile mode to use.

In ACTConfig (src/lerobot/policies/act/configuration_act.py), the following fields were added:

+    compile_model: bool = False  # Whether to use torch.compile for model optimization
+    compile_mode: str = "default"  # Torch compile mode

During initialization of ACTPolicy (src/lerobot/policies/act/modeling_act.py), compilation is applied conditionally based on the configuration:

+    if config.compile_model:
+        self.forward = torch.compile(self.forward, mode=config.compile_mode)
+        self.select_action = torch.compile(self.select_action, mode="default")

Note: The select_action method is always compiled with the "default" mode. Using "reduce-overhead" for select_action caused errors during training, and attempts to mitigate with torch.compiler.cudagraph_mark_step_begin did not improve performance compared to "default" mode. The observed error was:
File "/workspace/lerobot/src/lerobot/policies/act/modeling_act.py", line 124, in select_action
    self._action_queue.extend(actions.transpose(0, 1)). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.

1.2 Changes of benchmark

The benchmark script benchmarks/policies_compilation/benchmark_inference_compile_lerobot.py modified from the script provided in issue #2061 and the following changes were made:

Remove part of deprecated pi0fast.
Add comdand-line arguments to specify compile options
- --compile-mode: ["default", "reduce-overhead"] Torch compile mode to use.
- --fullgraph: If set, compile the entire model as a single graph and raise an error if graph breaks.
- --disable-dropout: If set, disable dropout layers by setting their dropout rate to 0.
- --matmul-precision: ["highest", "high", "medium"] Set float32 matmul precision (only applies when device is cuda).
- --disable-cudnn-tf32: Disallow the use of TensorFloat-32 tensor cores in cuDNN convolutions (only applies when device is CUDA).

2. How it was tested

2.1 Environment and testing command

The environment used for testing and benchmarking is as follows:

- lerobot version: 0.4.3
- Platform: Linux-6.8.0-87-generic-x86_64-with-glibc2.39
- Python version: 3.10.18
- Huggingface Hub version: 0.35.3
- Datasets version: 4.1.1
- Numpy version: 2.2.6
- PyTorch version: 2.7.1+cu126
- Is PyTorch built with CUDA support?: True
- Cuda version: 12.6
- GPU model: NVIDIA GeForce RTX 4090
- Using GPU in script?: Yes

Tests and benchmarks were executed using the following command with different combinations of command-line arguments:

python benchmarks/policies_compilation/benchmark_inference_compile_lerobot.py \
--policy act \
--device cuda \
--output benchmarks/policies_compilation/baseline_act_report.md \
--fullgraph \
# --compile-mode default or reduce-overhead \
# --disable-dropout \
# --matmul-precision highest or high \
# --disable-cudnn-tf32

Note: The --fullgraph flag ensures that any graph breaks raise errors during compilation.

The benchmark was performed using the following combinations of command-line arguments. Both default and reduce-overhead compile modes were tested separately:

#	disable_dropout	matmul_precision	disable_cudnn_tf32
1	TRUE	highest	TRUE
2	TRUE	highest	FALSE
3	TRUE	high	TRUE
4	TRUE	high	FALSE
5	FALSE	highest	TRUE
6	FALSE	highest	FALSE
7	FALSE	high	TRUE
8	FALSE	high	FALSE

2.2 Baseline and final benchmark reports

Note: Differences are rounded to 6 decimal places; speedup values are rounded to 2 decimal places; consistency values are rounded to 6 decimal places.
✅ indicates a difference less than 0.00001 or a speedup of 1.10× or higher; ❌ indicates otherwise.

2.2.1 Compile mode: `default`

#	action diff	loss diff	Inference ms/iter (orig)	Inference ms/iter (compile)	Inference speedup	Training ms/iter (orig)	Training ms/iter (compile)	Training speedup	loss consistency	grad norm consistency
1	0.000002✅	0.000000✅	21.87	19.01	1.15×✅	66.90	65.12	1.03×❌	0.001445	1.657770
2	0.001952❌	0.000111❌	21.38	17.09	1.25×✅	65.22	60.71	1.07×❌	0.001999	1.884838
3	0.000340❌	0.000005✅	19.67	16.50	1.19×✅	62.37	60.09	1.04×❌	0.002416	2.347739
4	0.001738❌	0.000110❌	19.01	14.60	1.30×✅	59.29	53.67	1.10×✅	0.002912	1.775447
5	0.000002✅	0.000170❌	21.91	18.99	1.15×✅	70.54	67.62	1.04×❌	0.002742	0.971044
6	0.001952❌	0.000089❌	21.45	17.07	1.26×✅	68.50	61.98	1.11×✅	0.002901	1.057907
7	0.000340❌	0.000171❌	19.81	16.50	1.20×✅	65.66	65.76	1.00×❌	0.002762	1.013415
8	0.001738❌	0.000090❌	19.00	14.63	1.30×✅	62.67	57.33	1.09×❌	0.002812	1.030260

2.2.2 Compile mode: `reduce-overhead`

Known issue: Training ACTPolicy with dropout enabled under torch.compile(mode="reduce-overhead") triggers segmentation faults (core dumps) without additional diagnostic output. As a result, benchmark cases 5–8 were excluded.

#	action diff	loss diff	Inference ms/iter (orig)	Inference ms/iter (compile)	Inference speedup	Training ms/iter (orig)	Training ms/iter (compile)	Training speedup	loss consistency	grad norm consistency
1	0.000002✅	0.000000✅	21.98	19.01	1.16×✅	67.04	64.19	1.04×❌	0.002255	2.028442
2	0.001952❌	0.000111❌	21.40	17.06	1.25×✅	65.99	59.45	1.11×✅	0.002189	1.789360
3	0.000340❌	0.000005✅	19.71	16.51	1.19×✅	62.65	59.36	1.06×❌	0.001331	1.414832
4	0.001738❌	0.000110❌	18.94	14.59	1.30×✅	59.60	52.57	1.13×✅	0.001032	0.668586

3. How to checkout & try (for the reviewer)

Testing and benchmarking can be performed using the following command with different combinations of command-line arguments:

python benchmarks/policies_compilation/benchmark_inference_compile_lerobot.py \
--policy act \
--device cuda \
--output benchmarks/policies_compilation/baseline_act_report.md \
--fullgraph \
# --compile-mode default or reduce-overhead \
# --disable-dropout \
# --matmul-precision highest or high \
# --disable-cudnn-tf32

Adds `benchmark_inference_compile_lerobot.py` from https://gist.github.com/AdilZouitine/3574664e4cf71605986b49e9148d29ab.

…pile graph break Removed .item() calls from loss_dict in forward() to avoid breaking the torch.compile computation graph. The tensor-to-scalar conversion is now handled in the training script instead.

In the inference benchmark in `benchmark_inference_compile_lerobot.py`, the `test_correctness` method failed to properly compare compiled `ACTPolicy` inference results. This was due to `policy_original` and `policy_compiled` sharing the same `_action_queue` object. Previously, the call order was: 1. `policy_original.reset()` 2. `policy_compiled.reset()` 3. `policy_original.select_action()` 4. `policy_compiled.select_action()` Because the `_action_queue` is shared, `policy_original.select_action()` would run inference (`predict_action_chunk`) and extend the queue with `n_action_steps` actions. `policy_compiled.select_action()` would then find a non-empty queue and simply pop an action, bypassing its own compiled inference logic. This commit reorders the calls to: 1. `policy_original.reset()` 2. `policy_original.select_action()` 3. `policy_compiled.reset()` 4. `policy_compiled.select_action()` This change ensures that `policy_compiled.reset()` clears the shared queue *after* the original policy's action selection. Consequently, `policy_compiled.select_action()` finds an empty queue and executes its own compiled inference, allowing for a correct comparison. With this fix, the compiled `ACTPolicy` inference check within `test_correctness` now passes, validating that the compiled inference output matches the original.

- Added import copy. - Use copy.deepcopy(policy) before torch.compile.

- Introduced `self.fullgraph` attribute in `TorchCompileBenchmark`. - Pass `fullgraph=self.fullgraph` when calling `torch.compile`. - Added CLI argument `--fullgraph` to enable full graph compilation, raising errors if graph breaks.

- Added `--matmul-precision` argument with choices: `highest`, `high`, `medium`. - Applied only when CUDA device is selected. - Allows benchmarking with different float32 matmul precision settings.

- Add `--disable-cudnn-tf32` CLI argument to disallow the use of TensorFloat-32 tensor cores in cuDNN convolutions (CUDA only). - Apply `torch.backends.cudnn.allow_tf32 = False` when the argument is used.

- Add `--disable-dropout` CLI argument to set dropout rate to 0 in policies. - Apply the argument by setting `cfg.dropout = 0.0` if the policy config has a dropout attribute.

Remove the conditional `if args.fullgraph` check and assign `benchmark.fullgraph` directly from `args.fullgraph`. This ensures the benchmark always reflects the CLI flag.

- Add `compile_mode` to `TorchCompileBenchmark` and expose it through the command-line argument `--compile-mode`, supporting both `default` and `reduce-overhead` modes. - Update the benchmark compilation strategy by compiling `forward` and `select_action` individually instead of compiling the entire model, improving control over compilation behavior and inference performance. - Extend `ACTConfig` with `compile_model` and `compile_mode` to support optional model compilation through configuration. - Update `ACTPolicy` to conditionally compile `forward` and `select_action` during initialization when `compile_model` is enabled in the policy configuration.

Copilot

Pull request overview

This PR makes the ACT policy compatible with torch.compile by removing .item() calls from the forward method (which cause graph breaks) and delegating scalar extraction to the training script. It also adds optional torch.compile support via configuration flags and includes a comprehensive benchmark script.

Key Changes:

Removed .item() calls from ACT policy's forward() method to avoid graph breaks during compilation
Modified training script to handle tensor-to-scalar conversion for loss dictionaries
Added compile_model and compile_mode configuration options to ACTConfig
Introduced benchmark script for evaluating torch.compile performance

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`src/lerobot/policies/act/modeling_act.py`	Removed `.item()` calls from loss dict; added conditional torch.compile of forward and select_action methods
`src/lerobot/policies/act/configuration_act.py`	Added compile_model and compile_mode configuration fields with documentation
`src/lerobot/scripts/lerobot_train.py`	Added dictionary comprehension to convert tensor values to scalars after policy forward pass
`benchmarks/policies_compilation/benchmark_inference_compile_lerobot.py`	New comprehensive benchmark script with compile options and performance testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-14T19:57:51Z

src/lerobot/scripts/lerobot_train.py

    # Let accelerator handle mixed precision
    with accelerator.autocast():
        loss, output_dict = policy.forward(batch)
+        output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()}


This line will crash when output_dict is None (e.g., for DiffusionPolicy which returns None). The code should check if output_dict is not None before attempting to call .items() on it. Consider: output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()} if output_dict is not None else {}

Suggested change

output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()}

output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()} if output_dict is not None else {}

Copilot · 2025-12-14T19:57:52Z

src/lerobot/policies/act/configuration_act.py

        dropout: Dropout to use in the transformer layers (see code for details).
        kl_weight: The weight to use for the KL-divergence component of the loss if the variational objective
            is enabled. Loss is then calculated as: `reconstruction_loss + kl_weight * kld_loss`.
+        compile_model: Enables compiling with `torch.compile` for faster policy training and inference.


The documentation states this parameter enables torch.compile for faster training and inference, but the implementation in modeling_act.py shows that it compiles both forward() and select_action() methods. The docstring should clarify which specific methods are compiled to avoid confusion.

Suggested change

compile_model: Enables compiling with `torch.compile` for faster policy training and inference.

compile_model: Enables compiling with `torch.compile` for faster policy training and inference. This compiles both the `forward()` and `select_action()` methods.

benchmarks/policies_compilation/benchmark_inference_compile_lerobot.py

…robot.py Co-authored-by: Copilot <[email protected]> Signed-off-by: HUANG TZU-CHUN <[email protected]>

tc-huang added 30 commits October 10, 2025 11:33

chore: Add policies_compilation directory

92e0400

Merge branch 'main' into feat/policies-torch-compilable

b25daa3

Merge branch 'main' into feat/policies-torch-compilable

c451ff1

feat: Add torch.compile benchmark script

46edcb8

Adds `benchmark_inference_compile_lerobot.py` from https://gist.github.com/AdilZouitine/3574664e4cf71605986b49e9148d29ab.

fix: Remove deprecated pi0fast from benchmark script

545aed6

chore: Remove .gitkeep file from policies_compilation directory

67e341e

feat: Add baseline benchmark report for ACT policy

febc052

fix: Remove .item() call in act policy forward() to prevent torch.com…

631a55a

…pile graph break Removed .item() calls from loss_dict in forward() to avoid breaking the torch.compile computation graph. The tensor-to-scalar conversion is now handled in the training script instead.

Merge branch 'main' into feat/policies-torch-compilable

6b80007

Merge branch 'main' into feat/policies-torch-compilable

050dec4

Merge branch 'main' into feat/policies-torch-compilable

c88c939

Merge branch 'main' into feat/policies-torch-compilable

30600fc

Merge branch 'main' into feat/policies-torch-compilable

413112a

Merge branch 'main' into feat/policies-torch-compilable

d5a3190

Merge branch 'main' into feat/policies-torch-compilable

7c8ddcd

Merge branch 'main' into feat/policies-torch-compilable

89c2e68

fix: deepcopy policy before compile to avoid sharing objects

ece6768

- Added import copy. - Use copy.deepcopy(policy) before torch.compile.

feat: add fullgraph option for Torch.compile

0e5f4e3

- Introduced `self.fullgraph` attribute in `TorchCompileBenchmark`. - Pass `fullgraph=self.fullgraph` when calling `torch.compile`. - Added CLI argument `--fullgraph` to enable full graph compilation, raising errors if graph breaks.

feat: add command-line argument to control float32 matmul precision

dee646e

- Added `--matmul-precision` argument with choices: `highest`, `high`, `medium`. - Applied only when CUDA device is selected. - Allows benchmarking with different float32 matmul precision settings.

feat: add argument to disable cuDNN TF32

1c6f40b

- Add `--disable-cudnn-tf32` CLI argument to disallow the use of TensorFloat-32 tensor cores in cuDNN convolutions (CUDA only). - Apply `torch.backends.cudnn.allow_tf32 = False` when the argument is used.

feat: add argument to disable dropout layers

792b21b

- Add `--disable-dropout` CLI argument to set dropout rate to 0 in policies. - Apply the argument by setting `cfg.dropout = 0.0` if the policy config has a dropout attribute.

fix: assign fullgraph attribute directly without conditional

dfb5951

Remove the conditional `if args.fullgraph` check and assign `benchmark.fullgraph` directly from `args.fullgraph`. This ensures the benchmark always reflects the CLI flag.

chore: remove redundant print in test_correctness

848878b

Merge branch 'main' into feat/policies-torch-compilable

b344f76

Merge branch 'main' into feat/policies-torch-compilable

0047655

Merge branch 'main' into feat/policies-torch-compilable

1994bbc

chore: remove outdated Torch.compile benchmark report for ACT

ccebd12

docs: improve docstrings for compile_model and compile_mode in ACTConfig

5496a42

tc-huang force-pushed the feat/policies-torch-compilable branch from 4046ece to 5496a42 Compare December 14, 2025 19:49

Merge branch 'main' into feat/policies-torch-compilable

023c663

tc-huang marked this pull request as ready for review December 14, 2025 19:54

Copilot AI review requested due to automatic review settings December 14, 2025 19:54

tc-huang changed the title ~~[WIP] perf(policies): Make ACT policy compatible with torch.compile~~ perf(policies): Make ACT policy compatible with torch.compile Dec 14, 2025

Copilot started reviewing on behalf of tc-huang December 14, 2025 19:55 View session

Copilot AI reviewed Dec 14, 2025

View reviewed changes

Update benchmarks/policies_compilation/benchmark_inference_compile_le…

ef59821

…robot.py Co-authored-by: Copilot <[email protected]> Signed-off-by: HUANG TZU-CHUN <[email protected]>

github-actions bot added the policies Items related to robot policies label Dec 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(policies): Make ACT policy compatible with `torch.compile` #2159

perf(policies): Make ACT policy compatible with `torch.compile` #2159

Uh oh!

tc-huang commented Oct 10, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 14, 2025

Uh oh!

Copilot AI Dec 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()}
	output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()} if output_dict is not None else {}

	compile_model: Enables compiling with `torch.compile` for faster policy training and inference.
	compile_model: Enables compiling with `torch.compile` for faster policy training and inference. This compiles both the `forward()` and `select_action()` methods.

perf(policies): Make ACT policy compatible with torch.compile #2159

Are you sure you want to change the base?

perf(policies): Make ACT policy compatible with torch.compile #2159

Uh oh!

Conversation

tc-huang commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. What this does

1.1 Changes to make the ACT policy compile-compatible

1.1.1 Modifications to fix graph breaks

1.1.2 Add optional torch.compile support to ACTPolicy

1.2 Changes of benchmark

2. How it was tested

2.1 Environment and testing command

2.2 Baseline and final benchmark reports

2.2.1 Compile mode: default

2.2.2 Compile mode: reduce-overhead

3. How to checkout & try (for the reviewer)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

perf(policies): Make ACT policy compatible with `torch.compile` #2159

perf(policies): Make ACT policy compatible with `torch.compile` #2159

tc-huang commented Oct 10, 2025 •

edited

Loading

2.2.1 Compile mode: `default`

2.2.2 Compile mode: `reduce-overhead`