Skip to content

After applying commit 6a3a9da, the training job runs out of GPU memory (OOM). #1855

@limou102

Description

@limou102

Bug description

My training job (Flux-dev model) on AMD GPUs encounters an out-of-memory (OOM) issue after merging the following commit from the main branch:
commit id : 6a3a9da
Refactor attention and make attention mask an argument to the model
related PR : Refactor attention and make attention mask an argument to the model

Image

Reverting this commit resolves the problem, so it’s possible that this change introduces additional GPU memory usage ?

Versions

torchtitan commit id : 6a3a9da
torch : 2.10.0.dev20250914+rocm6.4

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions