-
Notifications
You must be signed in to change notification settings - Fork 570
Open
Labels
Description
Bug description
Async TP related CI started to fail since Sep 22 2025. However even if we roll back the nightly PyTorch to 0919, the tests still failed.
python -m pip install --force-reinstall torch==2.10.0.dev20250917+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126
This is not an async TP issue but symmetric memory. This simple line can cause issues on the CI machine/docker.
symm_mem = get_symm_mem_workspace(torch.distributed.group.WORLD.group_name, min_size=1024*1024*64)
We cannot reproduce this issue on any local machine/environment.
Versions
nightly