You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I looked deep into it, seems that the root causes are two issues:
Issue 1: The destruction of ProcessGroupUCC happens earlier than ~CommUCC. The latter will invoke ucc_context_destroy to destroy UCC's context, which uses a c10d::Store object as an out-of-band communicator for allgather. However, at the time ucc_context_destroy is called, the c10d::Store object is already destroyed, which causes access to an invalid pointer. I have a fix of this issue at #13.
Issue 2: The ~ProcessGroupUCC will do ucc_destroy_team(team), but at the time when this happens, the progress_loop thread can still be running ucc_context_progress, which triggers a segfault.
In my test, I modified the beginning of start_test.sh to size=2 instead of size=4, and the rank=0 process will fail due to issue 2, and the rank=1 process will fail due to issue 1.
I am seeing segmentation fault at exit. It can be reproduced with
See also the test results of #14.
I looked deep into it, seems that the root causes are two issues:
Issue 1: The destruction of
ProcessGroupUCC
happens earlier than~CommUCC
. The latter will invokeucc_context_destroy
to destroy UCC's context, which uses ac10d::Store
object as an out-of-band communicator for allgather. However, at the timeucc_context_destroy
is called, thec10d::Store
object is already destroyed, which causes access to an invalid pointer. I have a fix of this issue at #13.Issue 2: The
~ProcessGroupUCC
will doucc_destroy_team(team)
, but at the time when this happens, theprogress_loop
thread can still be runningucc_context_progress
, which triggers a segfault.In my test, I modified the beginning of
start_test.sh
tosize=2
instead ofsize=4
, and the rank=0 process will fail due to issue 2, and the rank=1 process will fail due to issue 1.cc: @Sergei-Lebedev
The text was updated successfully, but these errors were encountered: