Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault at exit #15

Open
zasdfgbnm opened this issue Jul 1, 2021 · 0 comments
Open

Segfault at exit #15

zasdfgbnm opened this issue Jul 1, 2021 · 0 comments

Comments

@zasdfgbnm
Copy link
Contributor

I am seeing segmentation fault at exit. It can be reproduced with

bash ./test/start_test.sh ./test/torch_allreduce_test.py --backend=ucc

See also the test results of #14.

I looked deep into it, seems that the root causes are two issues:

Issue 1: The destruction of ProcessGroupUCC happens earlier than ~CommUCC. The latter will invoke ucc_context_destroy to destroy UCC's context, which uses a c10d::Store object as an out-of-band communicator for allgather. However, at the time ucc_context_destroy is called, the c10d::Store object is already destroyed, which causes access to an invalid pointer. I have a fix of this issue at #13.

Issue 2: The ~ProcessGroupUCC will do ucc_destroy_team(team), but at the time when this happens, the progress_loop thread can still be running ucc_context_progress, which triggers a segfault.

In my test, I modified the beginning of start_test.sh to size=2 instead of size=4, and the rank=0 process will fail due to issue 2, and the rank=1 process will fail due to issue 1.

cc: @Sergei-Lebedev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant