Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client/Service Deadlock #484

Open
YuanYuYuan opened this issue Feb 27, 2025 · 3 comments
Open

Client/Service Deadlock #484

YuanYuYuan opened this issue Feb 27, 2025 · 3 comments

Comments

@YuanYuYuan
Copy link
Contributor

YuanYuYuan commented Feb 27, 2025

After investigating the test failure nav2_behavior_tree/test/plugins/action/test_bt_action_node in navigation2, we found that a deadlock could occur in the following way.

send_request lock (client.r_mutex, C)
send_request fetched
add_new_query lock (service.mutex, C) <-- session-local, the callback add_new_query registed in service's queryable is directly executed on Thread C
add_new_query lock fetched
add_new_query lock released
send_request released
take_request lock (service.mutex, S)
take_request lock fetched
take_request lock released

send_request lock (client.r_mutex, C)
send_request fetched
send_response lock (service.mutex, S) <-- triggered by take_request
send_response lock fetched
add_new_query lock (service.mutex, C) <-- session-local, Thread C stucks
ClientData::is_shutdown lock (client.r_mutex, S)  <-- triggered by send_response, client.r_mutex is still held by Thread C

The add_new_query operation must occur before or after send_response.

@evshary
Copy link
Contributor

evshary commented Mar 3, 2025

A diagram for further analysis:

sequenceDiagram
participant Service as Thread 1 (Service)
participant mutex_s
participant mutex_c
participant Client as Thread 2 (Client)

Client->>mutex_c: send_request (lock)
Client->>mutex_s: add_new_query (lock)
mutex_s->>Client: add_new_query (release)
mutex_c->>Client: send_request (release)

Service->>mutex_s: take_request (lock)
mutex_s->>Service: take_request (release)

Client->>mutex_c: send_request (lock)
Service->>mutex_s: send_response (lock)

Client->>mutex_s: add_new_query (waiting)
Service->>mutex_c: query.reply (waiting)
Loading

@YuanYuYuan
Copy link
Contributor Author

YuanYuYuan commented Mar 3, 2025

To clarify the deadlock issue and the relationship between session-local mechanism in zenoh, I try to plot the whole mechanism as below.

sequenceDiagram
  participant mutex_s as mutex_s
  participant T2 as Service API
  actor P2 as Background Thread
  participant Z as Zenoh <br> Session
  participant mutex_c as mutex_c
  participant T1 as Client API
  actor P1 as Main Thread

  P1 ->> T1: send_request
  T1 ->> mutex_c: acquire
  mutex_c ->> T1: lock
  activate mutex_c
  T1 ->> Z: 
  Z ->> Z: z_get
  Z ->> T2: add_new_query
  T2 ->> mutex_s: acquire
  mutex_s ->> T2: lock
  activate mutex_s
  T2 ->> P2: notify
  T2 ->> mutex_s: release
  deactivate mutex_s
  T2 ->> Z: 
  Z ->> T1: 
  T1 ->> mutex_c: release
  deactivate mutex_c
  P2 ->> T2: take_request
  P1 ->> T1: send_request
  T2 ->> mutex_s: acquire
  mutex_s ->> T2: lock
  activate mutex_s
  T2 ->> P2: queue request
  T2 ->> mutex_s: release
  deactivate mutex_s
  T1 ->> mutex_c: acquire
  mutex_c ->> T1: lock
  activate mutex_c
  T1 ->> Z: 
  P2 ->> T2: send_response
  Z ->> Z: z_get
  T2 ->> mutex_s: acquire
  mutex_s ->> T2: lock
  activate mutex_s
  Z ->> T2: add_new_query
  T2 -x mutex_s: acquire
  T2 ->> Z: query.reply
  Z ->> T1: add_new_reply
  T1 -x mutex_c: acquire
  deactivate mutex_c
  deactivate mutex_s
Loading
  1. This AB/BA lock happens when add_new_query interrupts the query.reply.
  2. If the service and the client don't share the same zenoh session, z_get would return early and schedule the following task on another thread.

@YuanYuYuan
Copy link
Contributor Author

YuanYuYuan commented Mar 6, 2025

A proposed fix be like.

diff --git a/rmw_zenoh_cpp/src/detail/rmw_client_data.cpp b/rmw_zenoh_cpp/src/detail/rmw_client_data.cpp
index 6446c71..889065d 100644
--- a/rmw_zenoh_cpp/src/detail/rmw_client_data.cpp
+++ b/rmw_zenoh_cpp/src/detail/rmw_client_data.cpp
@@ -394,6 +394,7 @@ rmw_ret_t ClientData::send_request(
   std::weak_ptr<rmw_zenoh_cpp::ClientData> client_data = shared_from_this();
   zenoh::ZResult result;
   std::string parameters;
+  mutex_.unlock();
   context_impl->session()->get(
     keyexpr_.value(),
     parameters,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants