-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remove node command and DisableNodeChecker #2806
base: main
Are you sure you want to change the base?
Conversation
This commit introduces the remove node command which uses the DisableNodeChecker to verify that it is safe to disable/remove this node from the cluster.
c68a4d5
to
069d10a
Compare
// It's important that we never operate on an outdated Logs wrt the used NodesConfiguration. | ||
// Otherwise, we risk that we don't see all segments this node is part of. | ||
self.metadata | ||
.sync(MetadataKind::Logs, TargetVersion::Latest) | ||
.await?; | ||
self.logs = self.metadata.logs_snapshot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not sufficient for ensuring that a node that the checker classifies as safe to disable won't be used in any future Logs
. There are two problems:
(a) An in-flight Logs
update that was made before a given node was marked as read-only
(b) A cluster controller operating on an outdated NodesConfiguration
where a read-only node is not yet marked as read-only
Both cases can lead to the inclusion of a node into new segments after it has been marked as read-only or after disabling/removing it :-(
To solve (a), we would probably need to also update Logs
when updating NodesConfiguration
to mark the node as read-only or at the latest when removing it. That way we would fence off other in-flight Logs updates. To solve (b) we probably need to make sure that after a failed Logs update because of a failed precondition, the updating node syncs the latest NodesConfiguration
before continuing.
For (a) it might be good enough to bump the version of Logs
before writing the updated NodesConfiguration
with the removed node. Doing the same for transitioning a node to read-only won't be sufficient though. There we would need some kind of transactional support by the metadata store to update two keys atomically.
The underlying problem seems to be that we impose a certain relationship between NodeConfiguration
and Logs
after we see nodes in a specific state, e.g. when the node is in read-only mode, then it should no longer be included in newer node sets. By versioning NodesConfiguration
and Logs
separately, we can't enforce linearizable operations across both types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a) is true, this will always be the case (it's a side-effect of decoupling node membership from log configuration, and it's intentional). That said, in practice, just leaving enough time between marking the node for drain (read-only) and disabling covers all practical cases. Avg time for nodes to sync nodes config is <3s, and CC should use not retry (insist) on applying the new nodeset before re-evaluating (that's why retry loops on writes are not great). If we allow 15-30 seconds between drain and the attempt to disable nodes, it should be practically sufficient.
if node_config.roles.contains(Role::LogServer) { | ||
self.safe_to_disable_log_server(node_id, node_config.log_server_config.storage_state) | ||
.await? | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check regardless of the current role, The storage-state is the real indicator of membership status.
.await? | ||
} | ||
|
||
if node_config.roles.contains(Role::MetadataServer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Ok(()) | ||
} | ||
|
||
/// Checks whether it is safe to disable the given log server identified by the node_id. It is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have chain trimming, we can make this more strict by requiring that the node doesn't appear in any historical nodeset.
| StorageState::DataLoss) => { | ||
return Err(DisableNodeError::NotReadOnly(storage_state)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we should block the transition from DataLoss
or Provisioning
. Nodes in DataLoss cannot transition into read-only, but they can be passively drained if it's not a member of any nodeset (same for Provisioning)
|
||
let mut required_trim_points: HashMap<_, _> = HashMap::default(); | ||
|
||
// find for every log the latest segment in which the node is part of the nodeset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be for (all segments)
future::try_join_all(required_trim_points.into_iter().map( | ||
|(log_id, required_trim_point)| async move { | ||
let actual_trim_point = self | ||
.bifrost | ||
.clone() | ||
.get_trim_point(log_id) | ||
.await | ||
.map_err(|err| DisableNodeError::FindTrimPoint { log_id, err })?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the check for trim point altogether.
async fn remove_node( | ||
&self, | ||
request: Request<RemoveNodeRequest>, | ||
) -> Result<Response<()>, Status> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, we are assuming that if the segments are trimmed, then partitions must have snapshotted up to a point that excludes the trimmed segments, and that's the basis of not checking partition's last applied. Is that accurate?
// It's important that we never operate on an outdated Logs wrt the used NodesConfiguration. | ||
// Otherwise, we risk that we don't see all segments this node is part of. | ||
self.metadata | ||
.sync(MetadataKind::Logs, TargetVersion::Latest) | ||
.await?; | ||
self.logs = self.metadata.logs_snapshot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a) is true, this will always be the case (it's a side-effect of decoupling node membership from log configuration, and it's intentional). That said, in practice, just leaving enough time between marking the node for drain (read-only) and disabling covers all practical cases. Avg time for nodes to sync nodes config is <3s, and CC should use not retry (insist) on applying the new nodeset before re-evaluating (that's why retry loops on writes are not great). If we allow 15-30 seconds between drain and the attempt to disable nodes, it should be practically sufficient.
This commit introduces the remove node command which uses the DisableNodeChecker to verify that it is safe to disable/remove this node from the cluster.