Skip to content

Async TP CI failing #1757

@fegin

Description

@fegin

Bug description

Async TP related CI started to fail since Sep 22 2025. However even if we roll back the nightly PyTorch to 0919, the tests still failed.

python -m pip install --force-reinstall torch==2.10.0.dev20250917+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126

This is not an async TP issue but symmetric memory. This simple line can cause issues on the CI machine/docker.

symm_mem = get_symm_mem_workspace(torch.distributed.group.WORLD.group_name, min_size=1024*1024*64)

We cannot reproduce this issue on any local machine/environment.

Versions

nightly

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions