Skip to content

replicaset: do not close initial conn during retryable error#638

Open
mrForza wants to merge 4 commits intotarantool:masterfrom
mrForza:mrforza/gh-632-connection-is-closed-during-init-connection
Open

replicaset: do not close initial conn during retryable error#638
mrForza wants to merge 4 commits intotarantool:masterfrom
mrForza:mrforza/gh-632-connection-is-closed-during-init-connection

Conversation

@mrForza
Copy link
Contributor

@mrForza mrForza commented Jan 28, 2026

Before this patch the replicaset closed an initial netbox connection
during name check (vconnect stage) when retryable error happened on the
storage side (STORAGE_IS_DISABLED, NO_SUCH_PROC, AccessDeniedError).
It was happening because conn_vconnect_check_or_close and
conn_vconnect_wait_or_close functions didn't have the logic with
retryable errors. This bug led to alerts on the router and as a result to
problems in high-level products which worked using vshard (e.g. TDG).

To fix this bug we introduce new replicaset helper functions -
conn_vconnect_is_retryable_error and conn_vconnect_try_to_restart.
They can help us to determine whether the error is retryable and restart
vconnect. We use it in conn_vconnect_check_or_close and
conn_vconnect_wait_or_close.

Also we delete one vconnect test - test_vconnect_no_result because it
tests the behavior which is changed by our new patch.

Closes #632

NO_DOC=bugfix

Before this patch the logic of discarding and nullification of
vconnect's future object was located only in `netbox_on_disconnect`
trigger. As in next patches we need to add this logic in other places,
we introduce a new replicaset function - `conn_vconnect_reset`.

Needed for tarantool#632

NO_TEST=refactoring
NO_DOC=refactoring
@mrForza mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch 3 times, most recently from 445e990 to e9f3a13 Compare January 28, 2026 16:07
@mrForza mrForza requested a review from Serpentian January 28, 2026 16:26
Copy link
Collaborator

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patchset! Left some comments)

@Serpentian Serpentian assigned mrForza and unassigned Serpentian Jan 29, 2026
@mrForza mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch from e9f3a13 to a1129a1 Compare February 2, 2026 14:07
@mrForza mrForza requested a review from Serpentian February 3, 2026 12:36
@mrForza mrForza assigned Serpentian and unassigned mrForza Feb 3, 2026
Before this patch
1) the logic of determining whether a storage error is retryable was
   located only in `can_backoff_after_error`.
2) the logic of extracting the vshard error from the `LuajitError` or
   `ClientError` wrapper was located only in `replica_call`.

As in next patches we need to add this logic in other places, we introduce
some new helper functions - `is_vshard_not_ready` and
`unwrap_vshard_error`.

Needed for tarantool#632

NO_TEST=refactoring
NO_DOC=refactoring
@mrForza mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch from a1129a1 to 50d39a6 Compare February 4, 2026 12:30
@mrForza mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch from 50d39a6 to cfa27aa Compare February 5, 2026 09:04
Copy link
Collaborator

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, I have only nit comments. Great tests!

@Serpentian Serpentian assigned mrForza and unassigned Serpentian Feb 5, 2026
Before this patch the replicaset closed an initial netbox connection
during name check (`vconnect` stage) when retryable error happened on the
storage side (`STORAGE_IS_DISABLED`, `NO_SUCH_PROC`, `AccessDeniedError`).
It was happening because `conn_vconnect_check_or_close` and
`conn_vconnect_wait_or_close` functions didn't have the logic with
retryable errors. This bug led to alerts on the router and as a result to
problems in high-level products which worked using vshard.

To fix this bug we use new error/replicaset helper functions -
`is_vshard_not_ready`, `unwrap_vshard_error` and `conn_vconnect_restart`.
They can help us to determine whether the error is retryable, extract
the vshard error and restart vconnect. We use it in
`conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close`.

Also we delete one vconnect test - `test_vconnect_no_result` because it
tests the behavior which is changed by our new patch.

Part of tarantool#632

NO_DOC=bugfix
This patch adds the logging that the connection was closed due to
non retryable error in `conn_vconnect_check_or_close` and
`conn_vconnect_wait_or_close` functions.

In this patch we modify `test_vconnect_check_no_future` test and make a
template function from it in order to test `netbox` connection closing
during non retryable error with `sync / async` option. Also we add a
check for new vconnect logs about closing the connection.

Closes tarantool#632

NO_DOC=bugfix
@mrForza mrForza force-pushed the mrforza/gh-632-connection-is-closed-during-init-connection branch from cfa27aa to 5f06a38 Compare February 6, 2026 11:12
@mrForza mrForza requested a review from Serpentian February 6, 2026 11:33
@mrForza mrForza assigned Serpentian and unassigned mrForza Feb 6, 2026
Copy link
Collaborator

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patchset! Looks clean, IMHO. Just one more question, we should discuss

err.type == 'ClientError') then
err = lerror.from_string(err.message) or err
end
local err = lerror.unwrap_vshard_error(storage_status)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's break backward compatibility a little bit. We have users, which explicitly access the replicaset's API and call functions via it, end they're not vshard ones. If they throw vshard errors the behavior before and after patch differs:

Diff for example
diff --git a/example/storage.lua b/example/storage.lua
index 05e5ef3..b8847f8 100755
--- a/example/storage.lua
+++ b/example/storage.lua
@@ -65,6 +65,8 @@ box.once("testapp:schema:1", function()
     box.schema.role.grant('public', 'execute', 'function', 'raise_luajit_error')
     box.schema.func.create('raise_client_error')
     box.schema.role.grant('public', 'execute', 'function', 'raise_client_error')
+    box.schema.func.create('raise_vshard_error')
+    box.schema.role.grant('public', 'execute', 'function', 'raise_vshard_error')
 end)
 
 function customer_add(customer)
@@ -122,6 +124,11 @@ function raise_luajit_error()
     assert(1 == 2)
 end
 
+function raise_vshard_error(bid)
+    local err = require('vshard.error')
+    error(err.vshard(err.code.STORAGE_CFG_IS_IN_PROGRESS))
+end
+
 function raise_client_error()
     box.error(box.error.UNKNOWN)
 end
**

Before the patch:

unix/:./data/router_1.control> vshard.router.internal.static_router.replicasets['cbf06940-0790-498b-948d-042b62cf3d29']:callrw('raise_vshard_error')
---
- null
- code: 32
  base_type: LuajitError
  type: LuajitError
  message: '{"type":"ShardingError","name":"STORAGE_CFG_IS_IN_PROGRESS","message":"Configuration
    of the storage is in progress","code":37}'
  trace:
  - file: ./src/lua/utils.c
    line: 700
- null
...

After the patch:

unix/:./data/router_1.control> vshard.router.internal.static_router.replicasets['cbf06940-0790-498b-948d-042b62cf3d29']:callrw('raise_vshard_error')
---
- null
- type: ShardingError
  code: 37
  name: STORAGE_CFG_IS_IN_PROGRESS
  message: Configuration of the storage is in progress
- null
...

I agree, that after the patch it looks better, but I'd rather not break user's code, when it's possible, and here it cost us nothing not to break it: we can just use err = lerror.from_string(err.message) or err in vconnect functions and that's it. @mrForza and @Gerold103, what do you think? I'm on the side of backward compatibility here, but I'm not so sure

Copy link
Contributor Author

@mrForza mrForza Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if users have fewer problems after the patch, this is the best option for us. These solutions I see:

  1. Leave everything as it was before the patch (or rather, remove commit f2d791453099cd3b7ff819792b239b29d5453044). I don't think that this is a good solution because this commit makes our code base less dirty.

  2. Warn all users (in changelog maybe) about changes in returned errors of replicasets' methods. There may be some pitfalls here because some users may not pay attention to it.

  3. Use unwrapped vshard errors only for logging and for lerrror.is_vshard_not_ready function. The returned error from replicasets' methods should be original. See my diff:

    diff-patch
    diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
    index 6e5c43f..ddbc528 100644
    --- a/vshard/replicaset.lua
    +++ b/vshard/replicaset.lua
    @@ -217,7 +217,6 @@ local function conn_vconnect_check(conn)
         -- Critical errors. Connection should be closed after these ones.
         local result, err = vconn.future:result()
         if not result then
    -        err = lerror.unwrap_vshard_error(err)
             -- Failed to get response. E.g. access error.
             return nil, lerror.make(err)
         end
    @@ -239,14 +238,15 @@ local function conn_vconnect_check_or_close(conn)
             return ok, err
         end
         assert(err)
    -    if lerror.is_vshard_not_ready(err) then
    +    local unwrapped_error = lerror.unwrap_vshard_error(err)
    +    if lerror.is_vshard_not_ready(unwrapped_error) then
             conn_vconnect_restart(conn)
         elseif not (err.type == 'ShardingError' and
                     err.code == lerror.code.VHANDSHAKE_NOT_COMPLETE) then
             -- Close the connection, if error happened, but it is not retryable
             -- and not VSHANDSHAKE_NOT_COMPLETE.
             log.warn('Closing the connection to %s due to failed initial ' ..
    -                 'handshake: %s', conn.replica, err)
    +                 'handshake: %s', conn.replica, unwrapped_error)
             conn:close()
         end
         return ok, err
    @@ -276,7 +276,6 @@ local function conn_vconnect_wait(conn, timeout)
         -- Wait for async call to return.
         local res, err = future_wait(vconn.future, timeout)
         if res == nil then
    -        err = lerror.unwrap_vshard_error(err)
             -- Either timeout error or any other. If it's not a timeout error,
             -- then conn must be recteated, handshake must be retried.
             return nil, lerror.make(err)
    @@ -290,11 +289,12 @@ local function conn_vconnect_wait_or_close(conn, timeout)
             return ok, err
         end
         assert(err)
    -    if lerror.is_vshard_not_ready(err) then
    +    local unwrapped_error = lerror.unwrap_vshard_error(err)
    +    if lerror.is_vshard_not_ready(unwrapped_error) then
             conn_vconnect_restart(conn)
         elseif not lerror.is_timeout(err) then
             log.warn('Closing the connection to %s after waiting due to failed ' ..
    -                 'initial handshake: %s', conn.replica, err)
    +                 'initial handshake: %s', conn.replica, unwrapped_error)
             conn:close()
         end
         return ok, err
    @@ -716,10 +716,11 @@ local function replica_call(replica, func, args, opts)
             if opts.timeout >= replica.net_timeout then
                 replica_on_failed_request(replica)
             end
    -        local err = lerror.unwrap_vshard_error(storage_status)
    +        local unwrapped_err = lerror.unwrap_vshard_error(storage_status)
             replica.limiter:log_error(err,
    -            "Exception during calling '%s' on '%s': %s", func, replica, err)
    -        return false, nil, lerror.make(err)
    +            "Exception during calling '%s' on '%s': %s", func, replica,
    +            unwrapped_err)
    +        return false, nil, lerror.make(storage_status)
         else
             replica_on_success_request(replica)
         end

I prefer the third solution because it repair the backward compatibility with minimal changes in our codebase.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use unwrapped vshard errors only for logging and for lerrror.is_vshard_not_ready function. The returned error from replicasets' methods should be original. See my diff:

But before your patchset depending on the error, different results were returned: if vshard error during calling vshard function - unwrap, in all other cases error as is. After your patches: if vshard error during calling any function - unwrap, else - as is. Now you propose to return error as is, which is also the breakage of backward compatibility.

I'd probably prefer the solution, where we change nothing in replica_call and just add err = lerror.from_string(err.message) or err to vconnect functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, see my updated code:

diff-patch
diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index 6e5c43f..605a828 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -217,7 +217,6 @@ local function conn_vconnect_check(conn)
     -- Critical errors. Connection should be closed after these ones.
     local result, err = vconn.future:result()
     if not result then
-        err = lerror.unwrap_vshard_error(err)
         -- Failed to get response. E.g. access error.
         return nil, lerror.make(err)
     end
@@ -239,14 +238,15 @@ local function conn_vconnect_check_or_close(conn)
         return ok, err
     end
     assert(err)
-    if lerror.is_vshard_not_ready(err) then
+    local unwrapped_error = lerror.unwrap_vshard_error(err)
+    if lerror.is_vshard_not_ready(unwrapped_error) then
         conn_vconnect_restart(conn)
     elseif not (err.type == 'ShardingError' and
                 err.code == lerror.code.VHANDSHAKE_NOT_COMPLETE) then
         -- Close the connection, if error happened, but it is not retryable
         -- and not VSHANDSHAKE_NOT_COMPLETE.
         log.warn('Closing the connection to %s due to failed initial ' ..
-                 'handshake: %s', conn.replica, err)
+                 'handshake: %s', conn.replica, unwrapped_error)
         conn:close()
     end
     return ok, err
@@ -276,7 +276,6 @@ local function conn_vconnect_wait(conn, timeout)
     -- Wait for async call to return.
     local res, err = future_wait(vconn.future, timeout)
     if res == nil then
-        err = lerror.unwrap_vshard_error(err)
         -- Either timeout error or any other. If it's not a timeout error,
         -- then conn must be recteated, handshake must be retried.
         return nil, lerror.make(err)
@@ -290,11 +289,12 @@ local function conn_vconnect_wait_or_close(conn, timeout)
         return ok, err
     end
     assert(err)
-    if lerror.is_vshard_not_ready(err) then
+    local unwrapped_error = lerror.unwrap_vshard_error(err)
+    if lerror.is_vshard_not_ready(unwrapped_error) then
         conn_vconnect_restart(conn)
     elseif not lerror.is_timeout(err) then
         log.warn('Closing the connection to %s after waiting due to failed ' ..
-                 'initial handshake: %s', conn.replica, err)
+                 'initial handshake: %s', conn.replica, unwrapped_error)
         conn:close()
     end
     return ok, err
@@ -716,7 +716,8 @@ local function replica_call(replica, func, args, opts)
         if opts.timeout >= replica.net_timeout then
             replica_on_failed_request(replica)
         end
-        local err = lerror.unwrap_vshard_error(storage_status)
+        local err = func:startswith('vshard.') and
+            lerror.unwrap_vshard_error(storage_status) or storage_status
         replica.limiter:log_error(err,
             "Exception during calling '%s' on '%s': %s", func, replica, err)
         return false, nil, lerror.make(err)

I hope 🙏 it will fix the backward compatibility! replica_call has 4 exits:

  • "return" after conn_vconnect_wait_or_close during initial handshake (bad way)
  • "return" after conn_vconnect_check_or_close during initial handshake (bad way)
  • "return" after storage call (bad way)
  • "return" in the end of the function (good way)

The 1st and 2nd "return" give out the original error without unwrapping. This behavior is equivalent to the before-patch code. The 3rd "return" gives out an unwrapped error only if we call vshard function, in other cases - original error. The 4th "return" is the same like before the patch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as keeping backward compatibility costs nothing, I think we should preserve it.

@Serpentian Serpentian removed their assignment Feb 6, 2026
@Serpentian Serpentian requested a review from Gerold103 February 6, 2026 14:20
Copy link
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Nicely sliced patches by the way ✍️.

I think I have just one question here from the start.

Comment on lines -231 to -234
-- Close the connection, if error happened, but it is not
-- VSHANDSHAKE_NOT_COMPLETE.
if not ok and err and not (err.type == 'ShardingError' and
err.code == lerror.code.VHANDSHAKE_NOT_COMPLETE) then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was a problem to let it close the connection and let it reconnect and retry automatically? Or was it closed forever? I think the replicaset object re-creates the connections when we try to call them and they are noticed to be closed. No?

If that isn't so, then I would suggest we simply allow reconnects regardless of error code. This would be super simple, and even makes sense, because this thing exists for name checks and name can be changed. And then we would need to reconnect somehow anyway.

It looks shaky that stuff falls unless we cherry-pick some magic set of error codes which are retryable. (I also don't like that in applier, but this is a different story.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was a problem to let it close the connection and let it reconnect and retry automatically? Or was it closed forever?

Firstly, the router can't recreate a closed netbox connection to some replica by its own. The user (or high-level products) should manually invoke some of replicasets' methods (call, callrw, callro, e.t.c.) in order to recreate a recently failed netbox connection to replica. In other cases the connection will be closed.

Secondly we can't allow the situation to happen when just right after cluster bootstrap our router has alerts, because high-level products depend on the lack of alerts. See this Jira ticket.

Also we have a bug which can prevent the router's connection to be recreated: The router can't recreate a netbox connection when user manually invoke replicaset:callro method. GitHub ticket.


I would suggest we simply allow reconnects regardless of error code

In this solution our router will constantly send network requests to those storages which throw error. This will continue until the error is fixed on the storages. It can seriously load the netwrok where our vshard cluster is based. E.g. we have k routers and n "broken" storages, the total amount of netwrok requests will be - k * n. Since some of our clients have giant clusters (dozens of routers and hundreds of storages), this solution can lead to additional load of network (thousands of RPS)!


It looks shaky that stuff falls unless we cherry-pick some magic set of error codes which are retryable

I think we should carefully determine which storage errors can be resolved by itself over time (not manually by user). Only with that errors it is necessary to retry our vconnect, otherwise we can get problems with netwrok load. And I don't think that handling retryable errors can seriously mess our codebase, because:

  1. The amount of retryable errors should be many times less than other ones. (This is my assumption, it should be checked).
  2. If there are too many retryable errors, we can simply define a lua table which will store checkers for retryable errors and then in is_vshard_not_ready we will iterate over it. But I think we won't get to that point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your points. All entirely valid. This gave me a bit more things to think about 🤔.

Firstly, the router can't recreate a closed netbox connection to some replica by its own

But you see? - it does right now. When connectivity was lost without an explicit close(), we reconnect automatically. I see not a big difference in these 2 reasons of connection loss. And I think it makes sense to repair them in some common way.

Secondly we can't allow the situation to happen when just right after cluster bootstrap our router has alerts, because high-level products depend on the lack of alerts.

This is why I think we need to introduce auto-reconnect even for "critical" errors, with a backoff timeout. This ticket is the example of an error which we thought is critical and it was actually not. This will happen again.

In this solution our router will constantly send network requests to those storages which throw error.

Unfortunately, this will happen in your current solution as well. I didn't see it at first, but you brought up a very good point here. In your current solution we will wait on vconnect, discover "not ready" error and immediately recreate it. Because in conn_vconnect_check_or_close() you call conn_vconnect_restart() without any backoff. So if I have a high number of calls coming to the routers, they will bump this call all the time and will restart vconnect quite frequently until the error goes away. Each vconnect will send a request to the storage.

this solution can lead to additional load of network (thousands of RPS)!

It doesn't have to be retried once per second. For bad errors it is totally fine to retry every, for instance, 30 seconds. Or even rarer (could make it a consts value and power users could modify it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection is closed during name check on initial connection, when retryable error happens

4 participants