Bug description
My training job (Flux-dev model) on AMD GPUs encounters an out-of-memory (OOM) issue after merging the following commit from the main branch:
commit id : 6a3a9da
Refactor attention and make attention mask an argument to the model
related PR : Refactor attention and make attention mask an argument to the model
Reverting this commit resolves the problem, so it’s possible that this change introduces additional GPU memory usage ?
Versions
torchtitan commit id : 6a3a9da
torch : 2.10.0.dev20250914+rocm6.4