After applying commit 6a3a9da, the training job runs out of GPU memory (OOM).

### Bug description

My training job (Flux-dev model) on AMD GPUs encounters an out-of-memory (OOM) issue after merging the following commit from the main branch:
commit id : 6a3a9da9564d82a1120c7639ef6236bb4cffa049
[Refactor attention and make attention mask an argument to the model](https://github.com/pytorch/torchtitan/commit/6a3a9da9564d82a1120c7639ef6236bb4cffa049)
related PR : [Refactor attention and make attention mask an argument to the model](https://github.com/pytorch/torchtitan/pull/1776)

<img width="3300" height="106" alt="Image" src="https://github.com/user-attachments/assets/e50f8ca7-0094-46ae-9f24-4d8310728666" />


Reverting this commit resolves the problem, so it’s possible that this change introduces additional GPU memory usage ?

### Versions

torchtitan commit id : 6a3a9da9564d82a1120c7639ef6236bb4cffa049
torch : 2.10.0.dev20250914+rocm6.4


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

After applying commit 6a3a9da, the training job runs out of GPU memory (OOM). #1855

Bug description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

After applying commit 6a3a9da, the training job runs out of GPU memory (OOM). #1855

Description

Bug description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions