Reconnect to the last endpoint without restarting tasks by steils · Pull Request #1220 · eclipse-zenoh/zenoh-pico

steils · 2026-05-12T07:40:16Z

Avoid restarting transport tasks during client reconnect.
Reuse the existing client transport and try the last successful endpoint first.
If that fails, continue with the configured connect locators or scouted locators. Restore the connection without replacing the transport object.

Closes: #1005
Closes: #1053

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: enhancement

✨ Enhancement Requirements

Since this PR enhances existing functionality:

Enhancement scope documented - Clear description of what is being improved
Minimum necessary code - Implementation is as simple as possible, doesn't overcomplicate the system
Backwards compatible - Existing code/APIs still work unchanged
No new APIs added - Only improving existing functionality
Tests updated - Existing tests pass, new test cases added if needed
Performance improvement measured - If applicable, before/after metrics provided
Documentation updated - Existing docs updated to reflect improvements
User impact documented - How users benefit from this enhancement

Remember: Enhancements should not introduce new APIs or breaking changes.

Instructions:

Check off items as you complete them (change - [ ] to - [x])
The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

 LIVELINESS_SUB_CLIENT_COMMAND = STDBUF_CMD + [f'{DIR_EXAMPLES}/z_sub_liveliness', '-h', '-e', f'tcp/127.0.0.1:{ZENOH_PORT}']

+SINGLE_THREAD_ZENOH_PORT = "7448"
+SINGLE_THREAD_ROUTER_ARGS = ['-l', f'tcp/0.0.0.0:{SINGLE_THREAD_ZENOH_PORT}', '--no-multicast-scouting']


github-advanced-security

Cppcheck (reported by Codacy) found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

gmartin82 · 2026-05-14T09:36:27Z

        run: |
          sudo apt install -y ninja-build
-          CMAKE_GENERATOR=Ninja ASAN=ON make BUILD_TYPE=Debug
+          CMAKE_GENERATOR=Ninja ASAN=ON ZENOH_DEBUG=3 make BUILD_TYPE=Debug


Was this an intentional change or part of debugging?

gmartin82 · 2026-05-14T09:38:44Z

+          CMAKE_GENERATOR=Ninja ASAN=ON ZENOH_DEBUG=3 make BUILD_TYPE=Debug
          ninja -C build/ test
          python3 ./build/tests/single_thread.py
+          python3 -u ./build/tests/connection_restore.py ./zenoh-standalone/zenohd --single-thread


As we have a dedicated connection_restore_test CI job would it be better having a single thread version of that test instead?

gmartin82 · 2026-05-14T10:05:01Z

 DIR_EXAMPLES = "build/examples"

+
+def filter_debug_logs(output):


All changes in this file would not be needed if a single threaded version of the connection restore test CI job was added instead instead.

gmartin82 · 2026-05-14T10:08:35Z

 z_result_t _z_transport_start_batching(_z_transport_t *zt) {
    _z_transport_common_t *ztc = _z_transport_get_common(zt);
-    if (ztc == NULL) {
+    if ((ztc == NULL) || (ztc->_state != _Z_TRANSPORT_STATE_OPEN) || (ztc->_link == NULL)) {


This _link == NULL check looks unnecessary to me.

My understanding is that _state should be the authoritative transport availability state. If a transport is _Z_TRANSPORT_STATE_OPEN, then it should have a valid link. If the link is cleared/freed/moved during reconnect or close, the state should first be changed to RECONNECTING or CLOSED.

If OPEN && _link == NULL is possible, that seems like a lifecycle bug rather than a normal condition the send path should need to handle. Could we rely on _state here and make sure the reconnect/clear/replace paths maintain that invariant?

gmartin82 · 2026-05-14T10:09:09Z

        _Z_ERROR_RETURN(_Z_ERR_TRANSPORT_NOT_AVAILABLE);
    }

+    bool is_available = (ztc->_state == _Z_TRANSPORT_STATE_OPEN) && (ztc->_link != NULL);


See previous comment about _state and _link.

gmartin82 · 2026-05-14T10:14:06Z


+    bool is_available = (ztc->_state == _Z_TRANSPORT_STATE_OPEN) && (ztc->_link != NULL);
+    if (ztc->_batch_state != _Z_BATCHING_ACTIVE) {
+        return is_available ? _Z_RES_OK : _Z_ERR_TRANSPORT_NOT_AVAILABLE;


Was this change necessary?

stop_batching() looks more like cleanup/unwind logic than a transport availability check. If start_batching() succeeded, I would expect stop_batching() to reliably release the batching state/mutexes even if the transport has since moved to RECONNECTING or CLOSED.

I can see the value in giving feedback that something has gone wrong with the transport, but is this the right place to do that?

gmartin82 · 2026-05-14T10:14:16Z

 #endif
    ztc->_batch_state = _Z_BATCHING_IDLE;
-    return _Z_RES_OK;
+    return is_available ? _Z_RES_OK : _Z_ERR_TRANSPORT_NOT_AVAILABLE;


See previous comment.

gmartin82 · 2026-05-14T10:21:15Z

+
+static z_result_t _z_client_reopen_unicast(_z_session_t *s) {
+    _z_string_svec_t candidates = _z_string_svec_null();
+    _Z_RETURN_IF_ERR(_z_client_reconnect_candidates(s, &candidates));


I’m going to pause detailed review here because I have a broader concern about the approach.

I appreciate this change was introduced while you were already working on this PR, but the client connect path was recently refactored so that configured connect locators are handled consistently as a list of alternatives. That gives us one place for locator ordering, retryable vs non-retryable errors, timeout behaviour, configured locators, and scouting fallback semantics.

This PR appears to introduce a separate client reconnection path, with its own locator candidate construction and connection establishment logic. I’m concerned that initial connect and reconnect could drift semantically over time.

I think reconnect should build on the existing client connect/open path rather than adding a second implementation. The “try the last successful locator first” behaviour feels like an extension of the existing locator ordering logic, not a separate connection establishment model.

Could we adapt the existing client connect/open path to accept an optional preferred reconnect locator, try that first while avoiding duplicates, and then fall back to the normal configured/scouted alternatives? The reconnect-specific part should ideally be limited to what happens after a successful connection: for client unicast reconnect, install the new connection into the existing transport object rather than replacing/reinitialising the whole transport.

Given that this affects the structure of the change, I think it would be better to resolve this design question before reviewing the rest of the implementation in detail.

gmartin82 · 2026-05-14T10:44:19Z

-    z_result_t ret = _z_open(&zs, &s->_config, &s->_local_zid);
-    _z_session_transport_mutex_unlock(s);
+    z_result_t ret =
+        s->_mode == Z_WHATAMI_CLIENT ? _z_client_reopen_unicast(s) : _z_open(&zs, &s->_config, &s->_local_zid);


Minor point: you're tying client reconnect to unicast here.

That may be fine for the currently supported behaviour, since _z_multicast_open_client() returns _Z_ERR_CONFIG_UNSUPPORTED_CLIENT_MULTICAST, but _z_new_transport_client() still has explicit handling for Z_LINK_CAP_TRANSPORT_RAWETH and Z_LINK_CAP_TRANSPORT_MULTICAST.

Could we either make the unicast-only assumption explicit here, or ensure the reconnect path will not become a trap if client multicast/raweth support is implemented later?

Vuk-SFL · 2026-05-15T12:17:27Z

Does this change make #1205 unnecessary since that PR is about adressing tx resource usage while reconenction is active?

steils · 2026-05-19T01:33:10Z

@Vuk-SFL

Does this change make #1205 unnecessary since that PR is about adressing tx resource usage while reconenction is active?

For the reconnect case is covered in this PR, I think yes, it's no longer needed. But as I see #1205 touches more general clear paths, so that still should be relevant after this PR.

Avoid restarting transport tasks during client reconnect. Reuse the existing client transport and try the last successful endpoint first. If that fails, continue with the configured connect locators or scouted locators. Restore the connection without replacing the transport object. Closes: eclipse-zenoh#1005 Closes: eclipse-zenoh#1053

github-advanced-security AI found potential problems May 12, 2026

View reviewed changes

steils changed the title ~~Improve reconnection mechanism to try first reconnect latest endpoint without thread restart~~ Reconnect to the latest endpoint without restarting tasks May 12, 2026

steils changed the title ~~Reconnect to the latest endpoint without restarting tasks~~ Reconnect to the last endpoint without restarting tasks May 12, 2026

steils force-pushed the reconnection branch from 9cbc4bf to 110d38a Compare May 12, 2026 23:57

steils added the enhancement Existing things could work better label May 12, 2026

steils force-pushed the reconnection branch from 110d38a to 18771ed Compare May 13, 2026 00:20

github-advanced-security AI found potential problems May 13, 2026

View reviewed changes

Comment thread tests/single_thread.py Fixed

steils force-pushed the reconnection branch 2 times, most recently from 631df0d to 3da94d4 Compare May 13, 2026 15:30

steils marked this pull request as ready for review May 13, 2026 16:05

steils requested review from DenisBiryukov91 and gmartin82 May 13, 2026 16:07

gmartin82 suggested changes May 14, 2026

View reviewed changes

gmartin82 reviewed May 14, 2026

View reviewed changes

steils marked this pull request as draft May 15, 2026 20:32

steils force-pushed the reconnection branch from 3da94d4 to 9c17d1a Compare May 19, 2026 01:19

steils force-pushed the reconnection branch from 9c17d1a to bbaca93 Compare May 19, 2026 02:06

steils marked this pull request as ready for review May 19, 2026 02:08

steils requested a review from gmartin82 May 19, 2026 02:08

steils force-pushed the reconnection branch from bbaca93 to 0aee406 Compare May 19, 2026 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconnect to the last endpoint without restarting tasks#1220

Reconnect to the last endpoint without restarting tasks#1220
steils wants to merge 1 commit into
eclipse-zenoh:mainfrom
ZettaScaleLabs:reconnection

steils commented May 12, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-advanced-security AI left a comment

Uh oh!

Uh oh!

gmartin82 May 14, 2026

Uh oh!

gmartin82 May 14, 2026

Uh oh!

gmartin82 May 14, 2026

Uh oh!

gmartin82 May 14, 2026

Uh oh!

gmartin82 May 14, 2026

Uh oh!

gmartin82 May 14, 2026

Uh oh!

gmartin82 May 14, 2026

Uh oh!

gmartin82 May 14, 2026

Uh oh!

gmartin82 May 14, 2026

Uh oh!

Vuk-SFL commented May 15, 2026

Uh oh!

steils commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		DIR_EXAMPLES = "build/examples"


		def filter_debug_logs(output):

Conversation

steils commented May 12, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🏷️ Label-Based Checklist

✨ Enhancement Requirements

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vuk-SFL commented May 15, 2026

Uh oh!

steils commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steils commented May 12, 2026 •

edited by github-actions Bot

Loading