replicaset: do not close initial conn during retryable error#638
replicaset: do not close initial conn during retryable error#638mrForza wants to merge 4 commits intotarantool:masterfrom
Conversation
Before this patch the logic of discarding and nullification of vconnect's future object was located only in `netbox_on_disconnect` trigger. As in next patches we need to add this logic in other places, we introduce a new replicaset function - `conn_vconnect_reset`. Needed for tarantool#632 NO_TEST=refactoring NO_DOC=refactoring
445e990 to
e9f3a13
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Thank you for the patchset! Left some comments)
e9f3a13 to
a1129a1
Compare
Before this patch 1) the logic of determining whether a storage error is retryable was located only in `can_backoff_after_error`. 2) the logic of extracting the vshard error from the `LuajitError` or `ClientError` wrapper was located only in `replica_call`. As in next patches we need to add this logic in other places, we introduce some new helper functions - `is_vshard_not_ready` and `unwrap_vshard_error`. Needed for tarantool#632 NO_TEST=refactoring NO_DOC=refactoring
a1129a1 to
50d39a6
Compare
50d39a6 to
cfa27aa
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Well done, I have only nit comments. Great tests!
Before this patch the replicaset closed an initial netbox connection during name check (`vconnect` stage) when retryable error happened on the storage side (`STORAGE_IS_DISABLED`, `NO_SUCH_PROC`, `AccessDeniedError`). It was happening because `conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close` functions didn't have the logic with retryable errors. This bug led to alerts on the router and as a result to problems in high-level products which worked using vshard. To fix this bug we use new error/replicaset helper functions - `is_vshard_not_ready`, `unwrap_vshard_error` and `conn_vconnect_restart`. They can help us to determine whether the error is retryable, extract the vshard error and restart vconnect. We use it in `conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close`. Also we delete one vconnect test - `test_vconnect_no_result` because it tests the behavior which is changed by our new patch. Part of tarantool#632 NO_DOC=bugfix
This patch adds the logging that the connection was closed due to non retryable error in `conn_vconnect_check_or_close` and `conn_vconnect_wait_or_close` functions. In this patch we modify `test_vconnect_check_no_future` test and make a template function from it in order to test `netbox` connection closing during non retryable error with `sync / async` option. Also we add a check for new vconnect logs about closing the connection. Closes tarantool#632 NO_DOC=bugfix
cfa27aa to
5f06a38
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Thank you for the patchset! Looks clean, IMHO. Just one more question, we should discuss
| err.type == 'ClientError') then | ||
| err = lerror.from_string(err.message) or err | ||
| end | ||
| local err = lerror.unwrap_vshard_error(storage_status) |
There was a problem hiding this comment.
Hmm, that's break backward compatibility a little bit. We have users, which explicitly access the replicaset's API and call functions via it, end they're not vshard ones. If they throw vshard errors the behavior before and after patch differs:
Diff for example
diff --git a/example/storage.lua b/example/storage.lua
index 05e5ef3..b8847f8 100755
--- a/example/storage.lua
+++ b/example/storage.lua
@@ -65,6 +65,8 @@ box.once("testapp:schema:1", function()
box.schema.role.grant('public', 'execute', 'function', 'raise_luajit_error')
box.schema.func.create('raise_client_error')
box.schema.role.grant('public', 'execute', 'function', 'raise_client_error')
+ box.schema.func.create('raise_vshard_error')
+ box.schema.role.grant('public', 'execute', 'function', 'raise_vshard_error')
end)
function customer_add(customer)
@@ -122,6 +124,11 @@ function raise_luajit_error()
assert(1 == 2)
end
+function raise_vshard_error(bid)
+ local err = require('vshard.error')
+ error(err.vshard(err.code.STORAGE_CFG_IS_IN_PROGRESS))
+end
+
function raise_client_error()
box.error(box.error.UNKNOWN)
end
Before the patch:
unix/:./data/router_1.control> vshard.router.internal.static_router.replicasets['cbf06940-0790-498b-948d-042b62cf3d29']:callrw('raise_vshard_error')
---
- null
- code: 32
base_type: LuajitError
type: LuajitError
message: '{"type":"ShardingError","name":"STORAGE_CFG_IS_IN_PROGRESS","message":"Configuration
of the storage is in progress","code":37}'
trace:
- file: ./src/lua/utils.c
line: 700
- null
...
After the patch:
unix/:./data/router_1.control> vshard.router.internal.static_router.replicasets['cbf06940-0790-498b-948d-042b62cf3d29']:callrw('raise_vshard_error')
---
- null
- type: ShardingError
code: 37
name: STORAGE_CFG_IS_IN_PROGRESS
message: Configuration of the storage is in progress
- null
...
I agree, that after the patch it looks better, but I'd rather not break user's code, when it's possible, and here it cost us nothing not to break it: we can just use err = lerror.from_string(err.message) or err in vconnect functions and that's it. @mrForza and @Gerold103, what do you think? I'm on the side of backward compatibility here, but I'm not so sure
There was a problem hiding this comment.
I think that if users have fewer problems after the patch, this is the best option for us. These solutions I see:
-
Leave everything as it was before the patch (or rather, remove commit f2d791453099cd3b7ff819792b239b29d5453044). I don't think that this is a good solution because this commit makes our code base less dirty.
-
Warn all users (in changelog maybe) about changes in returned errors of replicasets' methods. There may be some pitfalls here because some users may not pay attention to it.
-
Use unwrapped vshard errors only for logging and for
lerrror.is_vshard_not_readyfunction. The returned error from replicasets' methods should be original. See my diff:diff-patch
diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua index 6e5c43f..ddbc528 100644 --- a/vshard/replicaset.lua +++ b/vshard/replicaset.lua @@ -217,7 +217,6 @@ local function conn_vconnect_check(conn) -- Critical errors. Connection should be closed after these ones. local result, err = vconn.future:result() if not result then - err = lerror.unwrap_vshard_error(err) -- Failed to get response. E.g. access error. return nil, lerror.make(err) end @@ -239,14 +238,15 @@ local function conn_vconnect_check_or_close(conn) return ok, err end assert(err) - if lerror.is_vshard_not_ready(err) then + local unwrapped_error = lerror.unwrap_vshard_error(err) + if lerror.is_vshard_not_ready(unwrapped_error) then conn_vconnect_restart(conn) elseif not (err.type == 'ShardingError' and err.code == lerror.code.VHANDSHAKE_NOT_COMPLETE) then -- Close the connection, if error happened, but it is not retryable -- and not VSHANDSHAKE_NOT_COMPLETE. log.warn('Closing the connection to %s due to failed initial ' .. - 'handshake: %s', conn.replica, err) + 'handshake: %s', conn.replica, unwrapped_error) conn:close() end return ok, err @@ -276,7 +276,6 @@ local function conn_vconnect_wait(conn, timeout) -- Wait for async call to return. local res, err = future_wait(vconn.future, timeout) if res == nil then - err = lerror.unwrap_vshard_error(err) -- Either timeout error or any other. If it's not a timeout error, -- then conn must be recteated, handshake must be retried. return nil, lerror.make(err) @@ -290,11 +289,12 @@ local function conn_vconnect_wait_or_close(conn, timeout) return ok, err end assert(err) - if lerror.is_vshard_not_ready(err) then + local unwrapped_error = lerror.unwrap_vshard_error(err) + if lerror.is_vshard_not_ready(unwrapped_error) then conn_vconnect_restart(conn) elseif not lerror.is_timeout(err) then log.warn('Closing the connection to %s after waiting due to failed ' .. - 'initial handshake: %s', conn.replica, err) + 'initial handshake: %s', conn.replica, unwrapped_error) conn:close() end return ok, err @@ -716,10 +716,11 @@ local function replica_call(replica, func, args, opts) if opts.timeout >= replica.net_timeout then replica_on_failed_request(replica) end - local err = lerror.unwrap_vshard_error(storage_status) + local unwrapped_err = lerror.unwrap_vshard_error(storage_status) replica.limiter:log_error(err, - "Exception during calling '%s' on '%s': %s", func, replica, err) - return false, nil, lerror.make(err) + "Exception during calling '%s' on '%s': %s", func, replica, + unwrapped_err) + return false, nil, lerror.make(storage_status) else replica_on_success_request(replica) end
I prefer the third solution because it repair the backward compatibility with minimal changes in our codebase.
There was a problem hiding this comment.
Use unwrapped vshard errors only for logging and for lerrror.is_vshard_not_ready function. The returned error from replicasets' methods should be original. See my diff:
But before your patchset depending on the error, different results were returned: if vshard error during calling vshard function - unwrap, in all other cases error as is. After your patches: if vshard error during calling any function - unwrap, else - as is. Now you propose to return error as is, which is also the breakage of backward compatibility.
I'd probably prefer the solution, where we change nothing in replica_call and just add err = lerror.from_string(err.message) or err to vconnect functions
There was a problem hiding this comment.
My bad, see my updated code:
diff-patch
diff --git a/vshard/replicaset.lua b/vshard/replicaset.lua
index 6e5c43f..605a828 100644
--- a/vshard/replicaset.lua
+++ b/vshard/replicaset.lua
@@ -217,7 +217,6 @@ local function conn_vconnect_check(conn)
-- Critical errors. Connection should be closed after these ones.
local result, err = vconn.future:result()
if not result then
- err = lerror.unwrap_vshard_error(err)
-- Failed to get response. E.g. access error.
return nil, lerror.make(err)
end
@@ -239,14 +238,15 @@ local function conn_vconnect_check_or_close(conn)
return ok, err
end
assert(err)
- if lerror.is_vshard_not_ready(err) then
+ local unwrapped_error = lerror.unwrap_vshard_error(err)
+ if lerror.is_vshard_not_ready(unwrapped_error) then
conn_vconnect_restart(conn)
elseif not (err.type == 'ShardingError' and
err.code == lerror.code.VHANDSHAKE_NOT_COMPLETE) then
-- Close the connection, if error happened, but it is not retryable
-- and not VSHANDSHAKE_NOT_COMPLETE.
log.warn('Closing the connection to %s due to failed initial ' ..
- 'handshake: %s', conn.replica, err)
+ 'handshake: %s', conn.replica, unwrapped_error)
conn:close()
end
return ok, err
@@ -276,7 +276,6 @@ local function conn_vconnect_wait(conn, timeout)
-- Wait for async call to return.
local res, err = future_wait(vconn.future, timeout)
if res == nil then
- err = lerror.unwrap_vshard_error(err)
-- Either timeout error or any other. If it's not a timeout error,
-- then conn must be recteated, handshake must be retried.
return nil, lerror.make(err)
@@ -290,11 +289,12 @@ local function conn_vconnect_wait_or_close(conn, timeout)
return ok, err
end
assert(err)
- if lerror.is_vshard_not_ready(err) then
+ local unwrapped_error = lerror.unwrap_vshard_error(err)
+ if lerror.is_vshard_not_ready(unwrapped_error) then
conn_vconnect_restart(conn)
elseif not lerror.is_timeout(err) then
log.warn('Closing the connection to %s after waiting due to failed ' ..
- 'initial handshake: %s', conn.replica, err)
+ 'initial handshake: %s', conn.replica, unwrapped_error)
conn:close()
end
return ok, err
@@ -716,7 +716,8 @@ local function replica_call(replica, func, args, opts)
if opts.timeout >= replica.net_timeout then
replica_on_failed_request(replica)
end
- local err = lerror.unwrap_vshard_error(storage_status)
+ local err = func:startswith('vshard.') and
+ lerror.unwrap_vshard_error(storage_status) or storage_status
replica.limiter:log_error(err,
"Exception during calling '%s' on '%s': %s", func, replica, err)
return false, nil, lerror.make(err)I hope 🙏 it will fix the backward compatibility! replica_call has 4 exits:
- "return" after
conn_vconnect_wait_or_closeduring initial handshake (bad way) - "return" after
conn_vconnect_check_or_closeduring initial handshake (bad way) - "return" after storage call (bad way)
- "return" in the end of the function (good way)
The 1st and 2nd "return" give out the original error without unwrapping. This behavior is equivalent to the before-patch code. The 3rd "return" gives out an unwrapped error only if we call vshard function, in other cases - original error. The 4th "return" is the same like before the patch.
There was a problem hiding this comment.
As long as keeping backward compatibility costs nothing, I think we should preserve it.
Gerold103
left a comment
There was a problem hiding this comment.
Thanks! Nicely sliced patches by the way ✍️.
I think I have just one question here from the start.
| -- Close the connection, if error happened, but it is not | ||
| -- VSHANDSHAKE_NOT_COMPLETE. | ||
| if not ok and err and not (err.type == 'ShardingError' and | ||
| err.code == lerror.code.VHANDSHAKE_NOT_COMPLETE) then |
There was a problem hiding this comment.
What was a problem to let it close the connection and let it reconnect and retry automatically? Or was it closed forever? I think the replicaset object re-creates the connections when we try to call them and they are noticed to be closed. No?
If that isn't so, then I would suggest we simply allow reconnects regardless of error code. This would be super simple, and even makes sense, because this thing exists for name checks and name can be changed. And then we would need to reconnect somehow anyway.
It looks shaky that stuff falls unless we cherry-pick some magic set of error codes which are retryable. (I also don't like that in applier, but this is a different story.)
There was a problem hiding this comment.
What was a problem to let it close the connection and let it reconnect and retry automatically? Or was it closed forever?
Firstly, the router can't recreate a closed netbox connection to some replica by its own. The user (or high-level products) should manually invoke some of replicasets' methods (call, callrw, callro, e.t.c.) in order to recreate a recently failed netbox connection to replica. In other cases the connection will be closed.
Secondly we can't allow the situation to happen when just right after cluster bootstrap our router has alerts, because high-level products depend on the lack of alerts. See this Jira ticket.
Also we have a bug which can prevent the router's connection to be recreated: The router can't recreate a netbox connection when user manually invoke replicaset:callro method. GitHub ticket.
I would suggest we simply allow reconnects regardless of error code
In this solution our router will constantly send network requests to those storages which throw error. This will continue until the error is fixed on the storages. It can seriously load the netwrok where our vshard cluster is based. E.g. we have k routers and n "broken" storages, the total amount of netwrok requests will be - k * n. Since some of our clients have giant clusters (dozens of routers and hundreds of storages), this solution can lead to additional load of network (thousands of RPS)!
It looks shaky that stuff falls unless we cherry-pick some magic set of error codes which are retryable
I think we should carefully determine which storage errors can be resolved by itself over time (not manually by user). Only with that errors it is necessary to retry our vconnect, otherwise we can get problems with netwrok load. And I don't think that handling retryable errors can seriously mess our codebase, because:
- The amount of retryable errors should be many times less than other ones. (This is my assumption, it should be checked).
- If there are too many retryable errors, we can simply define a lua table which will store checkers for retryable errors and then in
is_vshard_not_readywe will iterate over it. But I think we won't get to that point.
There was a problem hiding this comment.
I see your points. All entirely valid. This gave me a bit more things to think about 🤔.
Firstly, the router can't recreate a closed netbox connection to some replica by its own
But you see? - it does right now. When connectivity was lost without an explicit close(), we reconnect automatically. I see not a big difference in these 2 reasons of connection loss. And I think it makes sense to repair them in some common way.
Secondly we can't allow the situation to happen when just right after cluster bootstrap our router has alerts, because high-level products depend on the lack of alerts.
This is why I think we need to introduce auto-reconnect even for "critical" errors, with a backoff timeout. This ticket is the example of an error which we thought is critical and it was actually not. This will happen again.
In this solution our router will constantly send network requests to those storages which throw error.
Unfortunately, this will happen in your current solution as well. I didn't see it at first, but you brought up a very good point here. In your current solution we will wait on vconnect, discover "not ready" error and immediately recreate it. Because in conn_vconnect_check_or_close() you call conn_vconnect_restart() without any backoff. So if I have a high number of calls coming to the routers, they will bump this call all the time and will restart vconnect quite frequently until the error goes away. Each vconnect will send a request to the storage.
this solution can lead to additional load of network (thousands of RPS)!
It doesn't have to be retried once per second. For bad errors it is totally fine to retry every, for instance, 30 seconds. Or even rarer (could make it a consts value and power users could modify it).
Before this patch the replicaset closed an initial netbox connection
during name check (
vconnectstage) when retryable error happened on thestorage side (
STORAGE_IS_DISABLED,NO_SUCH_PROC,AccessDeniedError).It was happening because
conn_vconnect_check_or_closeandconn_vconnect_wait_or_closefunctions didn't have the logic withretryable errors. This bug led to alerts on the router and as a result to
problems in high-level products which worked using vshard (e.g. TDG).
To fix this bug we introduce new replicaset helper functions -
conn_vconnect_is_retryable_errorandconn_vconnect_try_to_restart.They can help us to determine whether the error is retryable and restart
vconnect. We use it in
conn_vconnect_check_or_closeandconn_vconnect_wait_or_close.Also we delete one vconnect test -
test_vconnect_no_resultbecause ittests the behavior which is changed by our new patch.
Closes #632
NO_DOC=bugfix