replicaset: do not close initial conn during retryable error by mrForza · Pull Request #638 · tarantool/vshard

mrForza · 2026-01-28T10:17:43Z

Before this patch the replicaset closed an initial netbox connection
during name check (vconnect stage) when retryable error happened on the
storage side (STORAGE_IS_DISABLED, NO_SUCH_PROC, AccessDeniedError).
It was happening because conn_vconnect_check_or_close and
conn_vconnect_wait_or_close functions didn't have the logic with
retryable errors. This bug led to alerts on the router and as a result to
problems in high-level products which worked using vshard (e.g. TDG).

To fix this bug we introduce new replicaset helper functions -
conn_vconnect_is_retryable_error and conn_vconnect_try_to_restart.
They can help us to determine whether the error is retryable and restart
vconnect. We use it in conn_vconnect_check_or_close and
conn_vconnect_wait_or_close.

Also we delete one vconnect test - test_vconnect_no_result because it
tests the behavior which is changed by our new patch.

Closes #632

NO_DOC=bugfix

Before this patch the logic of discarding and nullification of vconnect's future object was located only in `netbox_on_disconnect` trigger. As in next patches we need to add this logic in other places, we introduce a new replicaset function - `conn_vconnect_reset`. Needed for tarantool#632 NO_TEST=refactoring NO_DOC=refactoring

Serpentian

Thank you for the patchset! Left some comments)

vshard/replicaset.lua

test/replicaset-luatest/vconnect_test.lua

vshard/replicaset.lua

Before this patch 1) the logic of determining whether a storage error is retryable was located only in `can_backoff_after_error`. 2) the logic of extracting the vshard error from the `LuajitError` or `ClientError` wrapper was located only in `replica_call`. As in next patches we need to add this logic in other places, we introduce some new helper functions - `is_vshard_not_ready` and `unwrap_vshard_error`. Needed for tarantool#632 NO_TEST=refactoring NO_DOC=refactoring

Serpentian

Well done, I have only nit comments. Great tests!

test/replicaset-luatest/vconnect_test.lua

vshard/replicaset.lua

Before this patch the replicaset closed an initial netbox connection during name check (`vconnect` stage) when retryable error happened on the storage side (`STORAGE_IS_DISABLED`, `NO_SUCH_PROC`, `AccessDeniedError`). It was happening because `conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close` functions didn't have the logic with retryable errors. This bug led to alerts on the router and as a result to problems in high-level products which worked using vshard. To fix this bug we use new error/replicaset helper functions - `is_vshard_not_ready`, `unwrap_vshard_error` and `conn_vconnect_restart`. They can help us to determine whether the error is retryable, extract the vshard error and restart vconnect. We use it in `conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close`. Also we delete one vconnect test - `test_vconnect_no_result` because it tests the behavior which is changed by our new patch. Part of tarantool#632 NO_DOC=bugfix

This patch adds the logging that the connection was closed due to non retryable error in `conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close` functions. In this patch we modify `test_vconnect_check_no_future` test and make a template function from it in order to test `netbox` connection closing during non retryable error with `sync / async` option. Also we add a check for new vconnect logs about closing the connection. Closes tarantool#632 NO_DOC=bugfix

Serpentian

Thank you for the patchset! Looks clean, IMHO. Just one more question, we should discuss

Serpentian · 2026-02-06T14:17:01Z

vshard/replicaset.lua

-           err.type == 'ClientError') then
-            err = lerror.from_string(err.message) or err
-        end
+        local err = lerror.unwrap_vshard_error(storage_status)


Hmm, that's break backward compatibility a little bit. We have users, which explicitly access the replicaset's API and call functions via it, end they're not vshard ones. If they throw vshard errors the behavior before and after patch differs:

Diff for example

diff --git a/example/storage.lua b/example/storage.lua index 05e5ef3..b8847f8 100755 --- a/example/storage.lua +++ b/example/storage.lua @@ -65,6 +65,8 @@ box.once("testapp:schema:1", function() box.schema.role.grant('public', 'execute', 'function', 'raise_luajit_error') box.schema.func.create('raise_client_error') box.schema.role.grant('public', 'execute', 'function', 'raise_client_error') + box.schema.func.create('raise_vshard_error') + box.schema.role.grant('public', 'execute', 'function', 'raise_vshard_error') end) function customer_add(customer) @@ -122,6 +124,11 @@ function raise_luajit_error() assert(1 == 2) end +function raise_vshard_error(bid) + local err = require('vshard.error') + error(err.vshard(err.code.STORAGE_CFG_IS_IN_PROGRESS)) +end + function raise_client_error() box.error(box.error.UNKNOWN) end

**
Before the patch:

unix/:./data/router_1.control> vshard.router.internal.static_router.replicasets['cbf06940-0790-498b-948d-042b62cf3d29']:callrw('raise_vshard_error') --- - null - code: 32 base_type: LuajitError type: LuajitError message: '{"type":"ShardingError","name":"STORAGE_CFG_IS_IN_PROGRESS","message":"Configuration of the storage is in progress","code":37}' trace: - file: ./src/lua/utils.c line: 700 - null ...

After the patch:

unix/:./data/router_1.control> vshard.router.internal.static_router.replicasets['cbf06940-0790-498b-948d-042b62cf3d29']:callrw('raise_vshard_error') --- - null - type: ShardingError code: 37 name: STORAGE_CFG_IS_IN_PROGRESS message: Configuration of the storage is in progress - null ...

I agree, that after the patch it looks better, but I'd rather not break user's code, when it's possible, and here it cost us nothing not to break it: we can just use err = lerror.from_string(err.message) or err in vconnect functions and that's it. @mrForza and @Gerold103, what do you think? I'm on the side of backward compatibility here, but I'm not so sure

I think that if users have fewer problems after the patch, this is the best option for us. These solutions I see:

Leave everything as it was before the patch (or rather, remove commit f2d791453099cd3b7ff819792b239b29d5453044). I don't think that this is a good solution because this commit makes our code base less dirty.

Warn all users (in changelog maybe) about changes in returned errors of replicasets' methods. There may be some pitfalls here because some users may not pay attention to it.

Use unwrapped vshard errors only for logging and for lerrror.is_vshard_not_ready function. The returned error from replicasets' methods should be original. See my diff:

diff-patch

diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua index 6e5c43f..ddbc528 100644 --- a/vshard/replicaset.lua +++ b/vshard/replicaset.lua @@ -217,7 +217,6 @@ local function conn_vconnect_check(conn) -- Critical errors. Connection should be closed after these ones. local result, err = vconn.future:result() if not result then - err = lerror.unwrap_vshard_error(err) -- Failed to get response. E.g. access error. return nil, lerror.make(err) end @@ -239,14 +238,15 @@ local function conn_vconnect_check_or_close(conn) return ok, err end assert(err) - if lerror.is_vshard_not_ready(err) then + local unwrapped_error = lerror.unwrap_vshard_error(err) + if lerror.is_vshard_not_ready(unwrapped_error) then conn_vconnect_restart(conn) elseif not (err.type == 'ShardingError' and err.code == lerror.code.VHANDSHAKE_NOT_COMPLETE) then -- Close the connection, if error happened, but it is not retryable -- and not VSHANDSHAKE_NOT_COMPLETE. log.warn('Closing the connection to %s due to failed initial ' .. - 'handshake: %s', conn.replica, err) + 'handshake: %s', conn.replica, unwrapped_error) conn:close() end return ok, err @@ -276,7 +276,6 @@ local function conn_vconnect_wait(conn, timeout) -- Wait for async call to return. local res, err = future_wait(vconn.future, timeout) if res == nil then - err = lerror.unwrap_vshard_error(err) -- Either timeout error or any other. If it's not a timeout error, -- then conn must be recteated, handshake must be retried. return nil, lerror.make(err) @@ -290,11 +289,12 @@ local function conn_vconnect_wait_or_close(conn, timeout) return ok, err end assert(err) - if lerror.is_vshard_not_ready(err) then + local unwrapped_error = lerror.unwrap_vshard_error(err) + if lerror.is_vshard_not_ready(unwrapped_error) then conn_vconnect_restart(conn) elseif not lerror.is_timeout(err) then log.warn('Closing the connection to %s after waiting due to failed ' .. - 'initial handshake: %s', conn.replica, err) + 'initial handshake: %s', conn.replica, unwrapped_error) conn:close() end return ok, err @@ -716,10 +716,11 @@ local function replica_call(replica, func, args, opts) if opts.timeout >= replica.net_timeout then replica_on_failed_request(replica) end - local err = lerror.unwrap_vshard_error(storage_status) + local unwrapped_err = lerror.unwrap_vshard_error(storage_status) replica.limiter:log_error(err, - "Exception during calling '%s' on '%s': %s", func, replica, err) - return false, nil, lerror.make(err) + "Exception during calling '%s' on '%s': %s", func, replica, + unwrapped_err) + return false, nil, lerror.make(storage_status) else replica_on_success_request(replica) end

I prefer the third solution because it repair the backward compatibility with minimal changes in our codebase.

Use unwrapped vshard errors only for logging and for lerrror.is_vshard_not_ready function. The returned error from replicasets' methods should be original. See my diff:

But before your patchset depending on the error, different results were returned: if vshard error during calling vshard function - unwrap, in all other cases error as is. After your patches: if vshard error during calling any function - unwrap, else - as is. Now you propose to return error as is, which is also the breakage of backward compatibility.

I'd probably prefer the solution, where we change nothing in replica_call and just add err = lerror.from_string(err.message) or err to vconnect functions

My bad, see my updated code:

diff-patch

diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua index 6e5c43f..605a828 100644 --- a/vshard/replicaset.lua +++ b/vshard/replicaset.lua @@ -217,7 +217,6 @@ local function conn_vconnect_check(conn) -- Critical errors. Connection should be closed after these ones. local result, err = vconn.future:result() if not result then - err = lerror.unwrap_vshard_error(err) -- Failed to get response. E.g. access error. return nil, lerror.make(err) end @@ -239,14 +238,15 @@ local function conn_vconnect_check_or_close(conn) return ok, err end assert(err) - if lerror.is_vshard_not_ready(err) then + local unwrapped_error = lerror.unwrap_vshard_error(err) + if lerror.is_vshard_not_ready(unwrapped_error) then conn_vconnect_restart(conn) elseif not (err.type == 'ShardingError' and err.code == lerror.code.VHANDSHAKE_NOT_COMPLETE) then -- Close the connection, if error happened, but it is not retryable -- and not VSHANDSHAKE_NOT_COMPLETE. log.warn('Closing the connection to %s due to failed initial ' .. - 'handshake: %s', conn.replica, err) + 'handshake: %s', conn.replica, unwrapped_error) conn:close() end return ok, err @@ -276,7 +276,6 @@ local function conn_vconnect_wait(conn, timeout) -- Wait for async call to return. local res, err = future_wait(vconn.future, timeout) if res == nil then - err = lerror.unwrap_vshard_error(err) -- Either timeout error or any other. If it's not a timeout error, -- then conn must be recteated, handshake must be retried. return nil, lerror.make(err) @@ -290,11 +289,12 @@ local function conn_vconnect_wait_or_close(conn, timeout) return ok, err end assert(err) - if lerror.is_vshard_not_ready(err) then + local unwrapped_error = lerror.unwrap_vshard_error(err) + if lerror.is_vshard_not_ready(unwrapped_error) then conn_vconnect_restart(conn) elseif not lerror.is_timeout(err) then log.warn('Closing the connection to %s after waiting due to failed ' .. - 'initial handshake: %s', conn.replica, err) + 'initial handshake: %s', conn.replica, unwrapped_error) conn:close() end return ok, err @@ -716,7 +716,8 @@ local function replica_call(replica, func, args, opts) if opts.timeout >= replica.net_timeout then replica_on_failed_request(replica) end - local err = lerror.unwrap_vshard_error(storage_status) + local err = func:startswith('vshard.') and + lerror.unwrap_vshard_error(storage_status) or storage_status replica.limiter:log_error(err, "Exception during calling '%s' on '%s': %s", func, replica, err) return false, nil, lerror.make(err)

I hope 🙏 it will fix the backward compatibility! replica_call has 4 exits:

"return" after conn_vconnect_wait_or_close during initial handshake (bad way)

"return" after conn_vconnect_check_or_close during initial handshake (bad way)

"return" after storage call (bad way)

"return" in the end of the function (good way)

The 1st and 2nd "return" give out the original error without unwrapping. This behavior is equivalent to the before-patch code. The 3rd "return" gives out an unwrapped error only if we call vshard function, in other cases - original error. The 4th "return" is the same like before the patch.

As long as keeping backward compatibility costs nothing, I think we should preserve it.

Gerold103

Thanks! Nicely sliced patches by the way ✍️.

I think I have just one question here from the start.

Gerold103 · 2026-02-06T22:56:20Z

vshard/replicaset.lua

-    -- Close the connection, if error happened, but it is not
-    -- VSHANDSHAKE_NOT_COMPLETE.
-    if not ok and err and not (err.type == 'ShardingError' and
-       err.code == lerror.code.VHANDSHAKE_NOT_COMPLETE) then


What was a problem to let it close the connection and let it reconnect and retry automatically? Or was it closed forever? I think the replicaset object re-creates the connections when we try to call them and they are noticed to be closed. No?

If that isn't so, then I would suggest we simply allow reconnects regardless of error code. This would be super simple, and even makes sense, because this thing exists for name checks and name can be changed. And then we would need to reconnect somehow anyway.

It looks shaky that stuff falls unless we cherry-pick some magic set of error codes which are retryable. (I also don't like that in applier, but this is a different story.)

What was a problem to let it close the connection and let it reconnect and retry automatically? Or was it closed forever?

Firstly, the router can't recreate a closed netbox connection to some replica by its own. The user (or high-level products) should manually invoke some of replicasets' methods (call, callrw, callro, e.t.c.) in order to recreate a recently failed netbox connection to replica. In other cases the connection will be closed.

Secondly we can't allow the situation to happen when just right after cluster bootstrap our router has alerts, because high-level products depend on the lack of alerts. See this Jira ticket.

Also we have a bug which can prevent the router's connection to be recreated: The router can't recreate a netbox connection when user manually invoke replicaset:callro method. GitHub ticket.

I would suggest we simply allow reconnects regardless of error code

In this solution our router will constantly send network requests to those storages which throw error. This will continue until the error is fixed on the storages. It can seriously load the netwrok where our vshard cluster is based. E.g. we have k routers and n "broken" storages, the total amount of netwrok requests will be - k * n. Since some of our clients have giant clusters (dozens of routers and hundreds of storages), this solution can lead to additional load of network (thousands of RPS)!

It looks shaky that stuff falls unless we cherry-pick some magic set of error codes which are retryable

I think we should carefully determine which storage errors can be resolved by itself over time (not manually by user). Only with that errors it is necessary to retry our vconnect, otherwise we can get problems with netwrok load. And I don't think that handling retryable errors can seriously mess our codebase, because:

The amount of retryable errors should be many times less than other ones. (This is my assumption, it should be checked).

If there are too many retryable errors, we can simply define a lua table which will store checkers for retryable errors and then in is_vshard_not_ready we will iterate over it. But I think we won't get to that point.

I see your points. All entirely valid. This gave me a bit more things to think about 🤔.

Firstly, the router can't recreate a closed netbox connection to some replica by its own

But you see? - it does right now. When connectivity was lost without an explicit close(), we reconnect automatically. I see not a big difference in these 2 reasons of connection loss. And I think it makes sense to repair them in some common way.

Secondly we can't allow the situation to happen when just right after cluster bootstrap our router has alerts, because high-level products depend on the lack of alerts.

This is why I think we need to introduce auto-reconnect even for "critical" errors, with a backoff timeout. This ticket is the example of an error which we thought is critical and it was actually not. This will happen again.

In this solution our router will constantly send network requests to those storages which throw error.

Unfortunately, this will happen in your current solution as well. I didn't see it at first, but you brought up a very good point here. In your current solution we will wait on vconnect, discover "not ready" error and immediately recreate it. Because in conn_vconnect_check_or_close() you call conn_vconnect_restart() without any backoff. So if I have a high number of calls coming to the routers, they will bump this call all the time and will restart vconnect quite frequently until the error goes away. Each vconnect will send a request to the storage.

this solution can lead to additional load of network (thousands of RPS)!

It doesn't have to be retried once per second. For bad errors it is totally fine to retry every, for instance, 30 seconds. Or even rarer (could make it a consts value and power users could modify it).

mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch 3 times, most recently from 445e990 to e9f3a13 Compare January 28, 2026 16:07

mrForza requested a review from Serpentian January 28, 2026 16:26

mrForza assigned Serpentian Jan 28, 2026

Serpentian requested changes Jan 29, 2026

View reviewed changes

Serpentian assigned mrForza and unassigned Serpentian Jan 29, 2026

Serpentian reviewed Jan 29, 2026

View reviewed changes

vshard/replicaset.lua Outdated Show resolved Hide resolved

mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch from e9f3a13 to a1129a1 Compare February 2, 2026 14:07

mrForza requested a review from Serpentian February 3, 2026 12:36

mrForza assigned Serpentian and unassigned mrForza Feb 3, 2026

mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch from a1129a1 to 50d39a6 Compare February 4, 2026 12:30

Serpentian requested a review from kamenkremen February 4, 2026 13:30

Serpentian assigned Serpentian and kamenkremen and unassigned Serpentian Feb 4, 2026

mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch from 50d39a6 to cfa27aa Compare February 5, 2026 09:04

Serpentian reviewed Feb 5, 2026

View reviewed changes

test/replicaset-luatest/vconnect_test.lua Outdated Show resolved Hide resolved

vshard/replicaset.lua Show resolved Hide resolved

vshard/replicaset.lua Show resolved Hide resolved

vshard/replicaset.lua Outdated Show resolved Hide resolved

vshard/replicaset.lua Outdated Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian Feb 5, 2026

mrForza added 2 commits February 6, 2026 14:01

mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch from cfa27aa to 5f06a38 Compare February 6, 2026 11:12

mrForza requested a review from Serpentian February 6, 2026 11:33

mrForza assigned Serpentian and unassigned mrForza Feb 6, 2026

Serpentian approved these changes Feb 6, 2026

View reviewed changes

Serpentian removed their assignment Feb 6, 2026

Serpentian requested a review from Gerold103 February 6, 2026 14:20

Serpentian assigned Gerold103 Feb 6, 2026

Serpentian added the full-ci label Feb 6, 2026

Gerold103 reviewed Feb 6, 2026

View reviewed changes

mrForza unassigned Gerold103 Feb 9, 2026

mrForza requested a review from Gerold103 February 9, 2026 20:24

mrForza assigned Gerold103 Feb 9, 2026

Conversation

mrForza commented Jan 28, 2026

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Serpentian Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

mrForza Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Serpentian Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

mrForza Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Gerold103 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Gerold103 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

mrForza Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Gerold103 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrForza Feb 6, 2026 •

edited

Loading