Skip to content

Reconnect to the last endpoint without restarting tasks#1220

Open
steils wants to merge 1 commit into
eclipse-zenoh:mainfrom
ZettaScaleLabs:reconnection
Open

Reconnect to the last endpoint without restarting tasks#1220
steils wants to merge 1 commit into
eclipse-zenoh:mainfrom
ZettaScaleLabs:reconnection

Conversation

@steils
Copy link
Copy Markdown
Member

@steils steils commented May 12, 2026

Avoid restarting transport tasks during client reconnect.
Reuse the existing client transport and try the last successful endpoint first.
If that fails, continue with the configured connect locators or scouted locators. Restore the connection without replacing the transport object.

Closes: #1005
Closes: #1053


🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: enhancement

✨ Enhancement Requirements

Since this PR enhances existing functionality:

  • Enhancement scope documented - Clear description of what is being improved
  • Minimum necessary code - Implementation is as simple as possible, doesn't overcomplicate the system
  • Backwards compatible - Existing code/APIs still work unchanged
  • No new APIs added - Only improving existing functionality
  • Tests updated - Existing tests pass, new test cases added if needed
  • Performance improvement measured - If applicable, before/after metrics provided
  • Documentation updated - Existing docs updated to reflect improvements
  • User impact documented - How users benefit from this enhancement

Remember: Enhancements should not introduce new APIs or breaking changes.

Instructions:

  1. Check off items as you complete them (change - [ ] to - [x])
  2. The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

LIVELINESS_SUB_CLIENT_COMMAND = STDBUF_CMD + [f'{DIR_EXAMPLES}/z_sub_liveliness', '-h', '-e', f'tcp/127.0.0.1:{ZENOH_PORT}']

SINGLE_THREAD_ZENOH_PORT = "7448"
SINGLE_THREAD_ROUTER_ARGS = ['-l', f'tcp/0.0.0.0:{SINGLE_THREAD_ZENOH_PORT}', '--no-multicast-scouting']
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cppcheck (reported by Codacy) found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@steils steils changed the title Improve reconnection mechanism to try first reconnect latest endpoint without thread restart Reconnect to the latest endpoint without restarting tasks May 12, 2026
@steils steils changed the title Reconnect to the latest endpoint without restarting tasks Reconnect to the last endpoint without restarting tasks May 12, 2026
@steils steils added the enhancement Existing things could work better label May 12, 2026
Comment thread tests/single_thread.py Fixed
@steils steils force-pushed the reconnection branch 2 times, most recently from 631df0d to 3da94d4 Compare May 13, 2026 15:30
@steils steils marked this pull request as ready for review May 13, 2026 16:05
Comment thread .github/workflows/ci.yml Outdated
run: |
sudo apt install -y ninja-build
CMAKE_GENERATOR=Ninja ASAN=ON make BUILD_TYPE=Debug
CMAKE_GENERATOR=Ninja ASAN=ON ZENOH_DEBUG=3 make BUILD_TYPE=Debug
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this an intentional change or part of debugging?

Comment thread .github/workflows/ci.yml Outdated
CMAKE_GENERATOR=Ninja ASAN=ON ZENOH_DEBUG=3 make BUILD_TYPE=Debug
ninja -C build/ test
python3 ./build/tests/single_thread.py
python3 -u ./build/tests/connection_restore.py ./zenoh-standalone/zenohd --single-thread
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have a dedicated connection_restore_test CI job would it be better having a single thread version of that test instead?

Comment thread tests/single_thread.py Outdated
DIR_EXAMPLES = "build/examples"


def filter_debug_logs(output):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes in this file would not be needed if a single threaded version of the connection restore test CI job was added instead instead.

Comment thread src/transport/transport.c Outdated
z_result_t _z_transport_start_batching(_z_transport_t *zt) {
_z_transport_common_t *ztc = _z_transport_get_common(zt);
if (ztc == NULL) {
if ((ztc == NULL) || (ztc->_state != _Z_TRANSPORT_STATE_OPEN) || (ztc->_link == NULL)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This _link == NULL check looks unnecessary to me.

My understanding is that _state should be the authoritative transport availability state. If a transport is _Z_TRANSPORT_STATE_OPEN, then it should have a valid link. If the link is cleared/freed/moved during reconnect or close, the state should first be changed to RECONNECTING or CLOSED.

If OPEN && _link == NULL is possible, that seems like a lifecycle bug rather than a normal condition the send path should need to handle. Could we rely on _state here and make sure the reconnect/clear/replace paths maintain that invariant?

Comment thread src/transport/transport.c Outdated
_Z_ERROR_RETURN(_Z_ERR_TRANSPORT_NOT_AVAILABLE);
}

bool is_available = (ztc->_state == _Z_TRANSPORT_STATE_OPEN) && (ztc->_link != NULL);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment about _state and _link.

Comment thread src/transport/transport.c Outdated

bool is_available = (ztc->_state == _Z_TRANSPORT_STATE_OPEN) && (ztc->_link != NULL);
if (ztc->_batch_state != _Z_BATCHING_ACTIVE) {
return is_available ? _Z_RES_OK : _Z_ERR_TRANSPORT_NOT_AVAILABLE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this change necessary?

stop_batching() looks more like cleanup/unwind logic than a transport availability check. If start_batching() succeeded, I would expect stop_batching() to reliably release the batching state/mutexes even if the transport has since moved to RECONNECTING or CLOSED.

I can see the value in giving feedback that something has gone wrong with the transport, but is this the right place to do that?

Comment thread src/transport/transport.c Outdated
#endif
ztc->_batch_state = _Z_BATCHING_IDLE;
return _Z_RES_OK;
return is_available ? _Z_RES_OK : _Z_ERR_TRANSPORT_NOT_AVAILABLE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Comment thread src/net/session.c Outdated

static z_result_t _z_client_reopen_unicast(_z_session_t *s) {
_z_string_svec_t candidates = _z_string_svec_null();
_Z_RETURN_IF_ERR(_z_client_reconnect_candidates(s, &candidates));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m going to pause detailed review here because I have a broader concern about the approach.

I appreciate this change was introduced while you were already working on this PR, but the client connect path was recently refactored so that configured connect locators are handled consistently as a list of alternatives. That gives us one place for locator ordering, retryable vs non-retryable errors, timeout behaviour, configured locators, and scouting fallback semantics.

This PR appears to introduce a separate client reconnection path, with its own locator candidate construction and connection establishment logic. I’m concerned that initial connect and reconnect could drift semantically over time.

I think reconnect should build on the existing client connect/open path rather than adding a second implementation. The “try the last successful locator first” behaviour feels like an extension of the existing locator ordering logic, not a separate connection establishment model.

Could we adapt the existing client connect/open path to accept an optional preferred reconnect locator, try that first while avoiding duplicates, and then fall back to the normal configured/scouted alternatives? The reconnect-specific part should ideally be limited to what happens after a successful connection: for client unicast reconnect, install the new connection into the existing transport object rather than replacing/reinitialising the whole transport.

Given that this affects the structure of the change, I think it would be better to resolve this design question before reviewing the rest of the implementation in detail.

Comment thread src/net/session.c Outdated
z_result_t ret = _z_open(&zs, &s->_config, &s->_local_zid);
_z_session_transport_mutex_unlock(s);
z_result_t ret =
s->_mode == Z_WHATAMI_CLIENT ? _z_client_reopen_unicast(s) : _z_open(&zs, &s->_config, &s->_local_zid);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point: you're tying client reconnect to unicast here.

That may be fine for the currently supported behaviour, since _z_multicast_open_client() returns _Z_ERR_CONFIG_UNSUPPORTED_CLIENT_MULTICAST, but _z_new_transport_client() still has explicit handling for Z_LINK_CAP_TRANSPORT_RAWETH and Z_LINK_CAP_TRANSPORT_MULTICAST.

Could we either make the unicast-only assumption explicit here, or ensure the reconnect path will not become a trap if client multicast/raweth support is implemented later?

@Vuk-SFL
Copy link
Copy Markdown
Contributor

Vuk-SFL commented May 15, 2026

Does this change make #1205 unnecessary since that PR is about adressing tx resource usage while reconenction is active?

@steils steils marked this pull request as draft May 15, 2026 20:32
@steils
Copy link
Copy Markdown
Member Author

steils commented May 19, 2026

@Vuk-SFL

Does this change make #1205 unnecessary since that PR is about adressing tx resource usage while reconenction is active?

For the reconnect case is covered in this PR, I think yes, it's no longer needed. But as I see #1205 touches more general clear paths, so that still should be relevant after this PR.

@steils steils marked this pull request as ready for review May 19, 2026 02:08
@steils steils requested a review from gmartin82 May 19, 2026 02:08
Avoid restarting transport tasks during client reconnect.
Reuse the existing client transport and try the last successful endpoint
first.
If that fails, continue with the configured connect locators or scouted
locators. Restore the connection without replacing the transport object.

Closes: eclipse-zenoh#1005
Closes: eclipse-zenoh#1053
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Existing things could work better

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Reconnection on ThreadX causes crash due to memset() The rmw_zenoh on Arduino on ESP32 in unicast of client mode does not reconnect

4 participants