Reconnect to the last endpoint without restarting tasks#1220
Conversation
| LIVELINESS_SUB_CLIENT_COMMAND = STDBUF_CMD + [f'{DIR_EXAMPLES}/z_sub_liveliness', '-h', '-e', f'tcp/127.0.0.1:{ZENOH_PORT}'] | ||
|
|
||
| SINGLE_THREAD_ZENOH_PORT = "7448" | ||
| SINGLE_THREAD_ROUTER_ARGS = ['-l', f'tcp/0.0.0.0:{SINGLE_THREAD_ZENOH_PORT}', '--no-multicast-scouting'] |
There was a problem hiding this comment.
Cppcheck (reported by Codacy) found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
631df0d to
3da94d4
Compare
| run: | | ||
| sudo apt install -y ninja-build | ||
| CMAKE_GENERATOR=Ninja ASAN=ON make BUILD_TYPE=Debug | ||
| CMAKE_GENERATOR=Ninja ASAN=ON ZENOH_DEBUG=3 make BUILD_TYPE=Debug |
There was a problem hiding this comment.
Was this an intentional change or part of debugging?
| CMAKE_GENERATOR=Ninja ASAN=ON ZENOH_DEBUG=3 make BUILD_TYPE=Debug | ||
| ninja -C build/ test | ||
| python3 ./build/tests/single_thread.py | ||
| python3 -u ./build/tests/connection_restore.py ./zenoh-standalone/zenohd --single-thread |
There was a problem hiding this comment.
As we have a dedicated connection_restore_test CI job would it be better having a single thread version of that test instead?
| DIR_EXAMPLES = "build/examples" | ||
|
|
||
|
|
||
| def filter_debug_logs(output): |
There was a problem hiding this comment.
All changes in this file would not be needed if a single threaded version of the connection restore test CI job was added instead instead.
| z_result_t _z_transport_start_batching(_z_transport_t *zt) { | ||
| _z_transport_common_t *ztc = _z_transport_get_common(zt); | ||
| if (ztc == NULL) { | ||
| if ((ztc == NULL) || (ztc->_state != _Z_TRANSPORT_STATE_OPEN) || (ztc->_link == NULL)) { |
There was a problem hiding this comment.
This _link == NULL check looks unnecessary to me.
My understanding is that _state should be the authoritative transport availability state. If a transport is _Z_TRANSPORT_STATE_OPEN, then it should have a valid link. If the link is cleared/freed/moved during reconnect or close, the state should first be changed to RECONNECTING or CLOSED.
If OPEN && _link == NULL is possible, that seems like a lifecycle bug rather than a normal condition the send path should need to handle. Could we rely on _state here and make sure the reconnect/clear/replace paths maintain that invariant?
| _Z_ERROR_RETURN(_Z_ERR_TRANSPORT_NOT_AVAILABLE); | ||
| } | ||
|
|
||
| bool is_available = (ztc->_state == _Z_TRANSPORT_STATE_OPEN) && (ztc->_link != NULL); |
There was a problem hiding this comment.
See previous comment about _state and _link.
|
|
||
| bool is_available = (ztc->_state == _Z_TRANSPORT_STATE_OPEN) && (ztc->_link != NULL); | ||
| if (ztc->_batch_state != _Z_BATCHING_ACTIVE) { | ||
| return is_available ? _Z_RES_OK : _Z_ERR_TRANSPORT_NOT_AVAILABLE; |
There was a problem hiding this comment.
Was this change necessary?
stop_batching() looks more like cleanup/unwind logic than a transport availability check. If start_batching() succeeded, I would expect stop_batching() to reliably release the batching state/mutexes even if the transport has since moved to RECONNECTING or CLOSED.
I can see the value in giving feedback that something has gone wrong with the transport, but is this the right place to do that?
| #endif | ||
| ztc->_batch_state = _Z_BATCHING_IDLE; | ||
| return _Z_RES_OK; | ||
| return is_available ? _Z_RES_OK : _Z_ERR_TRANSPORT_NOT_AVAILABLE; |
|
|
||
| static z_result_t _z_client_reopen_unicast(_z_session_t *s) { | ||
| _z_string_svec_t candidates = _z_string_svec_null(); | ||
| _Z_RETURN_IF_ERR(_z_client_reconnect_candidates(s, &candidates)); |
There was a problem hiding this comment.
I’m going to pause detailed review here because I have a broader concern about the approach.
I appreciate this change was introduced while you were already working on this PR, but the client connect path was recently refactored so that configured connect locators are handled consistently as a list of alternatives. That gives us one place for locator ordering, retryable vs non-retryable errors, timeout behaviour, configured locators, and scouting fallback semantics.
This PR appears to introduce a separate client reconnection path, with its own locator candidate construction and connection establishment logic. I’m concerned that initial connect and reconnect could drift semantically over time.
I think reconnect should build on the existing client connect/open path rather than adding a second implementation. The “try the last successful locator first” behaviour feels like an extension of the existing locator ordering logic, not a separate connection establishment model.
Could we adapt the existing client connect/open path to accept an optional preferred reconnect locator, try that first while avoiding duplicates, and then fall back to the normal configured/scouted alternatives? The reconnect-specific part should ideally be limited to what happens after a successful connection: for client unicast reconnect, install the new connection into the existing transport object rather than replacing/reinitialising the whole transport.
Given that this affects the structure of the change, I think it would be better to resolve this design question before reviewing the rest of the implementation in detail.
| z_result_t ret = _z_open(&zs, &s->_config, &s->_local_zid); | ||
| _z_session_transport_mutex_unlock(s); | ||
| z_result_t ret = | ||
| s->_mode == Z_WHATAMI_CLIENT ? _z_client_reopen_unicast(s) : _z_open(&zs, &s->_config, &s->_local_zid); |
There was a problem hiding this comment.
Minor point: you're tying client reconnect to unicast here.
That may be fine for the currently supported behaviour, since _z_multicast_open_client() returns _Z_ERR_CONFIG_UNSUPPORTED_CLIENT_MULTICAST, but _z_new_transport_client() still has explicit handling for Z_LINK_CAP_TRANSPORT_RAWETH and Z_LINK_CAP_TRANSPORT_MULTICAST.
Could we either make the unicast-only assumption explicit here, or ensure the reconnect path will not become a trap if client multicast/raweth support is implemented later?
|
Does this change make #1205 unnecessary since that PR is about adressing tx resource usage while reconenction is active? |
For the reconnect case is covered in this PR, I think yes, it's no longer needed. But as I see #1205 touches more general clear paths, so that still should be relevant after this PR. |
Avoid restarting transport tasks during client reconnect. Reuse the existing client transport and try the last successful endpoint first. If that fails, continue with the configured connect locators or scouted locators. Restore the connection without replacing the transport object. Closes: eclipse-zenoh#1005 Closes: eclipse-zenoh#1053
Avoid restarting transport tasks during client reconnect.
Reuse the existing client transport and try the last successful endpoint first.
If that fails, continue with the configured connect locators or scouted locators. Restore the connection without replacing the transport object.
Closes: #1005
Closes: #1053
🏷️ Label-Based Checklist
Based on the labels applied to this PR, please complete these additional requirements:
Labels:
enhancement✨ Enhancement Requirements
Since this PR enhances existing functionality:
Remember: Enhancements should not introduce new APIs or breaking changes.
Instructions:
- [ ]to- [x])This checklist updates automatically when labels change, but preserves your checked boxes.