Skip to content

Conversation

@tc-huang
Copy link
Contributor

@tc-huang tc-huang commented Oct 10, 2025

1. What this does

1.1 Changes to make the ACT policy compile-compatible

1.1.1 Modifications to fix graph breaks

TorchDynamo emitted graph-break warnings due to the use of Tensor.item() inside the forward() method of ACTPolicy. Since .item() performs Python-side scalar extraction and disrupts graph capture, these conversions were removed from the forward() method. The model now returns loss tensors directly in loss_dict; scalar extraction is deferred to the training script (lerobot_train.py).

The warnings observed were:

W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] Graph break from `Tensor.item()`, consider setting:
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0]     torch._dynamo.config.capture_scalar_outputs = True
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] or:
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0]     env TORCHDYNAMO_CAPTURE_SCALAR_OUTPUTS=1
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] to include these operations in the captured graph.
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] 
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0] Graph break: from user code at:
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0]   File "/home/tc-huang/Desktop/lerobot/src/lerobot/policies/act/modeling_act.py", line 147, in forward
W1018 23:02:25.828000 103342 torch/_dynamo/variables/tensor.py:913] [0/0]     loss_dict = {"l1_loss": l1_loss.item()}

Corresponding code changes:

  • ACTPolicy.forward in src/lerobot/policies/act/modeling_act.py
    -    loss_dict = {"l1_loss": l1_loss.item()}
    +    loss_dict = {"l1_loss": l1_loss}
         if self.config.use_vae:
              # Calculate Dₖₗ(latent_pdf || standard_normal). Note: After computing the KL-divergence for
              # each dimension independently, we sum over the latent dimension to get the total
              # KL-divergence per batch element, then take the mean over the batch.
              # (See App. B of https://huggingface.co/papers/1312.6114 for more details).
              mean_kld = (
                  (-0.5 * (1 + log_sigma_x2_hat - mu_hat.pow(2) - (log_sigma_x2_hat).exp())).sum(-1).mean()
              )
    -         loss_dict["kld_loss"] = mean_kld.item()
    +         loss_dict["kld_loss"] = mean_kld
              loss = l1_loss + mean_kld * self.config.kl_weight
          else:
              loss = l1_loss
    
          return loss, loss_dict
  • update_policy in src/lerobot/scripts/lerobot_train.py
      with accelerator.autocast():
          loss, output_dict = policy.forward(batch)
    +     output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()}

1.1.2 Add optional torch.compile support to ACTPolicy

This update introduces optional compilation support for ACTPolicy using PyTorch’s torch.compile. Two new arguments for compilation control were added, consistent with PI0 and PI0.5 policies:

  • --policy.compile_model: Enables or disables compilation of the policy model.
  • --policy.compile_mode: Specifies the Torch compile mode to use.

In ACTConfig (src/lerobot/policies/act/configuration_act.py), the following fields were added:

+    compile_model: bool = False  # Whether to use torch.compile for model optimization
+    compile_mode: str = "default"  # Torch compile mode

During initialization of ACTPolicy (src/lerobot/policies/act/modeling_act.py), compilation is applied conditionally based on the configuration:

+    if config.compile_model:
+        self.forward = torch.compile(self.forward, mode=config.compile_mode)
+        self.select_action = torch.compile(self.select_action, mode="default")

Note: The select_action method is always compiled with the "default" mode. Using "reduce-overhead" for select_action caused errors during training, and attempts to mitigate with torch.compiler.cudagraph_mark_step_begin did not improve performance compared to "default" mode. The observed error was:

File "/workspace/lerobot/src/lerobot/policies/act/modeling_act.py", line 124, in select_action
    self._action_queue.extend(actions.transpose(0, 1)). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.

1.2 Changes of benchmark

The benchmark script benchmarks/policies_compilation/benchmark_inference_compile_lerobot.py modified from the script provided in issue #2061 and the following changes were made:

  1. Remove part of deprecated pi0fast.
  2. Add comdand-line arguments to specify compile options
    • --compile-mode: ["default", "reduce-overhead"] Torch compile mode to use.
    • --fullgraph: If set, compile the entire model as a single graph and raise an error if graph breaks.
    • --disable-dropout: If set, disable dropout layers by setting their dropout rate to 0.
    • --matmul-precision: ["highest", "high", "medium"] Set float32 matmul precision (only applies when device is cuda).
    • --disable-cudnn-tf32: Disallow the use of TensorFloat-32 tensor cores in cuDNN convolutions (only applies when device is CUDA).

2. How it was tested

2.1 Environment and testing command

The environment used for testing and benchmarking is as follows:

- lerobot version: 0.4.3
- Platform: Linux-6.8.0-87-generic-x86_64-with-glibc2.39
- Python version: 3.10.18
- Huggingface Hub version: 0.35.3
- Datasets version: 4.1.1
- Numpy version: 2.2.6
- PyTorch version: 2.7.1+cu126
- Is PyTorch built with CUDA support?: True
- Cuda version: 12.6
- GPU model: NVIDIA GeForce RTX 4090
- Using GPU in script?: Yes

Tests and benchmarks were executed using the following command with different combinations of command-line arguments:

python benchmarks/policies_compilation/benchmark_inference_compile_lerobot.py \
--policy act \
--device cuda \
--output benchmarks/policies_compilation/baseline_act_report.md \
--fullgraph \
# --compile-mode default or reduce-overhead \
# --disable-dropout \
# --matmul-precision highest or high \
# --disable-cudnn-tf32

Note: The --fullgraph flag ensures that any graph breaks raise errors during compilation.

The benchmark was performed using the following combinations of command-line arguments. Both default and reduce-overhead compile modes were tested separately:

# disable_dropout matmul_precision disable_cudnn_tf32
1 TRUE highest TRUE
2 TRUE highest FALSE
3 TRUE high TRUE
4 TRUE high FALSE
5 FALSE highest TRUE
6 FALSE highest FALSE
7 FALSE high TRUE
8 FALSE high FALSE

2.2 Baseline and final benchmark reports

Note: Differences are rounded to 6 decimal places; speedup values are rounded to 2 decimal places; consistency values are rounded to 6 decimal places.
✅ indicates a difference less than 0.00001 or a speedup of 1.10× or higher; ❌ indicates otherwise.

2.2.1 Compile mode: default

# action diff loss diff Inference ms/iter (orig) Inference ms/iter (compile) Inference speedup Training ms/iter (orig) Training ms/iter (compile) Training speedup loss consistency grad norm consistency
1 0.000002✅ 0.000000✅ 21.87 19.01 1.15×✅ 66.90 65.12 1.03×❌ 0.001445 1.657770
2 0.001952❌ 0.000111❌ 21.38 17.09 1.25×✅ 65.22 60.71 1.07×❌ 0.001999 1.884838
3 0.000340❌ 0.000005✅ 19.67 16.50 1.19×✅ 62.37 60.09 1.04×❌ 0.002416 2.347739
4 0.001738❌ 0.000110❌ 19.01 14.60 1.30×✅ 59.29 53.67 1.10×✅ 0.002912 1.775447
5 0.000002✅ 0.000170❌ 21.91 18.99 1.15×✅ 70.54 67.62 1.04×❌ 0.002742 0.971044
6 0.001952❌ 0.000089❌ 21.45 17.07 1.26×✅ 68.50 61.98 1.11×✅ 0.002901 1.057907
7 0.000340❌ 0.000171❌ 19.81 16.50 1.20×✅ 65.66 65.76 1.00×❌ 0.002762 1.013415
8 0.001738❌ 0.000090❌ 19.00 14.63 1.30×✅ 62.67 57.33 1.09×❌ 0.002812 1.030260

2.2.2 Compile mode: reduce-overhead

Known issue: Training ACTPolicy with dropout enabled under torch.compile(mode="reduce-overhead") triggers segmentation faults (core dumps) without additional diagnostic output. As a result, benchmark cases 5–8 were excluded.

# action diff loss diff Inference ms/iter (orig) Inference ms/iter (compile) Inference speedup Training ms/iter (orig) Training ms/iter (compile) Training speedup loss consistency grad norm consistency
1 0.000002✅ 0.000000✅ 21.98 19.01 1.16×✅ 67.04 64.19 1.04×❌ 0.002255 2.028442
2 0.001952❌ 0.000111❌ 21.40 17.06 1.25×✅ 65.99 59.45 1.11×✅ 0.002189 1.789360
3 0.000340❌ 0.000005✅ 19.71 16.51 1.19×✅ 62.65 59.36 1.06×❌ 0.001331 1.414832
4 0.001738❌ 0.000110❌ 18.94 14.59 1.30×✅ 59.60 52.57 1.13×✅ 0.001032 0.668586

3. How to checkout & try (for the reviewer)

Testing and benchmarking can be performed using the following command with different combinations of command-line arguments:

python benchmarks/policies_compilation/benchmark_inference_compile_lerobot.py \
--policy act \
--device cuda \
--output benchmarks/policies_compilation/baseline_act_report.md \
--fullgraph \
# --compile-mode default or reduce-overhead \
# --disable-dropout \
# --matmul-precision highest or high \
# --disable-cudnn-tf32

…pile graph break

Removed .item() calls from loss_dict in forward() to avoid breaking the
torch.compile computation graph. The tensor-to-scalar conversion is now
handled in the training script instead.
In the inference benchmark in `benchmark_inference_compile_lerobot.py`,
the `test_correctness` method failed to properly compare compiled
`ACTPolicy` inference results. This was due to `policy_original`
and `policy_compiled` sharing the same `_action_queue` object.

Previously, the call order was:
1. `policy_original.reset()`
2. `policy_compiled.reset()`
3. `policy_original.select_action()`
4. `policy_compiled.select_action()`

Because the `_action_queue` is shared,
`policy_original.select_action()` would run inference
(`predict_action_chunk`) and extend the queue with `n_action_steps`
actions. `policy_compiled.select_action()` would then find a
non-empty queue and simply pop an action, bypassing its own
compiled inference logic.

This commit reorders the calls to:
1. `policy_original.reset()`
2. `policy_original.select_action()`
3. `policy_compiled.reset()`
4. `policy_compiled.select_action()`

This change ensures that `policy_compiled.reset()` clears the shared
queue *after* the original policy's action selection. Consequently,
`policy_compiled.select_action()` finds an empty queue and executes
its own compiled inference, allowing for a correct comparison.

With this fix, the compiled `ACTPolicy` inference check within
`test_correctness` now passes, validating that the compiled
inference output matches the original.
- Added import copy.
- Use copy.deepcopy(policy) before torch.compile.
- Introduced `self.fullgraph` attribute in `TorchCompileBenchmark`.
- Pass `fullgraph=self.fullgraph` when calling `torch.compile`.
- Added CLI argument `--fullgraph` to enable full graph compilation, raising errors if graph breaks.
- Added `--matmul-precision` argument with choices: `highest`, `high`, `medium`.
- Applied only when CUDA device is selected.
- Allows benchmarking with different float32 matmul precision settings.
- Add `--disable-cudnn-tf32` CLI argument to disallow the use of
  TensorFloat-32 tensor cores in cuDNN convolutions (CUDA only).
- Apply `torch.backends.cudnn.allow_tf32 = False` when the argument is used.
- Add `--disable-dropout` CLI argument to set dropout rate to 0 in
  policies.
- Apply the argument by setting `cfg.dropout = 0.0` if the policy
  config has a dropout attribute.
Remove the conditional `if args.fullgraph` check and assign
`benchmark.fullgraph` directly from `args.fullgraph`. This ensures
the benchmark always reflects the CLI flag.
- Add `compile_mode` to `TorchCompileBenchmark` and expose it through the command-line
  argument `--compile-mode`, supporting both `default` and `reduce-overhead` modes.
- Update the benchmark compilation strategy by compiling `forward` and `select_action`
  individually instead of compiling the entire model, improving control over compilation
  behavior and inference performance.
- Extend `ACTConfig` with `compile_model` and `compile_mode` to support optional model
  compilation through configuration.
- Update `ACTPolicy` to conditionally compile `forward` and `select_action` during
  initialization when `compile_model` is enabled in the policy configuration.
@tc-huang tc-huang force-pushed the feat/policies-torch-compilable branch from 4046ece to 5496a42 Compare December 14, 2025 19:49
@tc-huang tc-huang marked this pull request as ready for review December 14, 2025 19:54
Copilot AI review requested due to automatic review settings December 14, 2025 19:54
@tc-huang tc-huang changed the title [WIP] perf(policies): Make ACT policy compatible with torch.compile perf(policies): Make ACT policy compatible with torch.compile Dec 14, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the ACT policy compatible with torch.compile by removing .item() calls from the forward method (which cause graph breaks) and delegating scalar extraction to the training script. It also adds optional torch.compile support via configuration flags and includes a comprehensive benchmark script.

Key Changes:

  • Removed .item() calls from ACT policy's forward() method to avoid graph breaks during compilation
  • Modified training script to handle tensor-to-scalar conversion for loss dictionaries
  • Added compile_model and compile_mode configuration options to ACTConfig
  • Introduced benchmark script for evaluating torch.compile performance

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/lerobot/policies/act/modeling_act.py Removed .item() calls from loss dict; added conditional torch.compile of forward and select_action methods
src/lerobot/policies/act/configuration_act.py Added compile_model and compile_mode configuration fields with documentation
src/lerobot/scripts/lerobot_train.py Added dictionary comprehension to convert tensor values to scalars after policy forward pass
benchmarks/policies_compilation/benchmark_inference_compile_lerobot.py New comprehensive benchmark script with compile options and performance testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Let accelerator handle mixed precision
with accelerator.autocast():
loss, output_dict = policy.forward(batch)
output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()}
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will crash when output_dict is None (e.g., for DiffusionPolicy which returns None). The code should check if output_dict is not None before attempting to call .items() on it. Consider: output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()} if output_dict is not None else {}

Suggested change
output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()}
output_dict = {k: v.item() if isinstance(v, torch.Tensor) else v for k, v in output_dict.items()} if output_dict is not None else {}

Copilot uses AI. Check for mistakes.
dropout: Dropout to use in the transformer layers (see code for details).
kl_weight: The weight to use for the KL-divergence component of the loss if the variational objective
is enabled. Loss is then calculated as: `reconstruction_loss + kl_weight * kld_loss`.
compile_model: Enables compiling with `torch.compile` for faster policy training and inference.
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states this parameter enables torch.compile for faster training and inference, but the implementation in modeling_act.py shows that it compiles both forward() and select_action() methods. The docstring should clarify which specific methods are compiled to avoid confusion.

Suggested change
compile_model: Enables compiling with `torch.compile` for faster policy training and inference.
compile_model: Enables compiling with `torch.compile` for faster policy training and inference. This compiles both the `forward()` and `select_action()` methods.

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added the policies Items related to robot policies label Dec 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

policies Items related to robot policies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant