Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remove node command and DisableNodeChecker #2806

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tillrohrmann
Copy link
Contributor

This commit introduces the remove node command which uses the DisableNodeChecker to verify that it is safe to disable/remove this node from the cluster.

This commit introduces the remove node command which uses the DisableNodeChecker
to verify that it is safe to disable/remove this node from the cluster.
Comment on lines +88 to +93
// It's important that we never operate on an outdated Logs wrt the used NodesConfiguration.
// Otherwise, we risk that we don't see all segments this node is part of.
self.metadata
.sync(MetadataKind::Logs, TargetVersion::Latest)
.await?;
self.logs = self.metadata.logs_snapshot();
Copy link
Contributor Author

@tillrohrmann tillrohrmann Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not sufficient for ensuring that a node that the checker classifies as safe to disable won't be used in any future Logs. There are two problems:

(a) An in-flight Logs update that was made before a given node was marked as read-only
(b) A cluster controller operating on an outdated NodesConfiguration where a read-only node is not yet marked as read-only

Both cases can lead to the inclusion of a node into new segments after it has been marked as read-only or after disabling/removing it :-(

To solve (a), we would probably need to also update Logs when updating NodesConfiguration to mark the node as read-only or at the latest when removing it. That way we would fence off other in-flight Logs updates. To solve (b) we probably need to make sure that after a failed Logs update because of a failed precondition, the updating node syncs the latest NodesConfiguration before continuing.

For (a) it might be good enough to bump the version of Logs before writing the updated NodesConfiguration with the removed node. Doing the same for transitioning a node to read-only won't be sufficient though. There we would need some kind of transactional support by the metadata store to update two keys atomically.

The underlying problem seems to be that we impose a certain relationship between NodeConfiguration and Logs after we see nodes in a specific state, e.g. when the node is in read-only mode, then it should no longer be included in newer node sets. By versioning NodesConfiguration and Logs separately, we can't enforce linearizable operations across both types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) is true, this will always be the case (it's a side-effect of decoupling node membership from log configuration, and it's intentional). That said, in practice, just leaving enough time between marking the node for drain (read-only) and disabling covers all practical cases. Avg time for nodes to sync nodes config is <3s, and CC should use not retry (insist) on applying the new nodeset before re-evaluating (that's why retry loops on writes are not great). If we allow 15-30 seconds between drain and the attempt to disable nodes, it should be practically sufficient.

Comment on lines +110 to +113
if node_config.roles.contains(Role::LogServer) {
self.safe_to_disable_log_server(node_id, node_config.log_server_config.storage_state)
.await?
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check regardless of the current role, The storage-state is the real indicator of membership status.

.await?
}

if node_config.roles.contains(Role::MetadataServer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Ok(())
}

/// Checks whether it is safe to disable the given log server identified by the node_id. It is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have chain trimming, we can make this more strict by requiring that the node doesn't appear in any historical nodeset.

Comment on lines +135 to +137
| StorageState::DataLoss) => {
return Err(DisableNodeError::NotReadOnly(storage_state));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we should block the transition from DataLoss or Provisioning. Nodes in DataLoss cannot transition into read-only, but they can be passively drained if it's not a member of any nodeset (same for Provisioning)


let mut required_trim_points: HashMap<_, _> = HashMap::default();

// find for every log the latest segment in which the node is part of the nodeset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be for (all segments)

Comment on lines +184 to +191
future::try_join_all(required_trim_points.into_iter().map(
|(log_id, required_trim_point)| async move {
let actual_trim_point = self
.bifrost
.clone()
.get_trim_point(log_id)
.await
.map_err(|err| DisableNodeError::FindTrimPoint { log_id, err })?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the check for trim point altogether.

Comment on lines +295 to +298
async fn remove_node(
&self,
request: Request<RemoveNodeRequest>,
) -> Result<Response<()>, Status> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, we are assuming that if the segments are trimmed, then partitions must have snapshotted up to a point that excludes the trimmed segments, and that's the basis of not checking partition's last applied. Is that accurate?

Comment on lines +88 to +93
// It's important that we never operate on an outdated Logs wrt the used NodesConfiguration.
// Otherwise, we risk that we don't see all segments this node is part of.
self.metadata
.sync(MetadataKind::Logs, TargetVersion::Latest)
.await?;
self.logs = self.metadata.logs_snapshot();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) is true, this will always be the case (it's a side-effect of decoupling node membership from log configuration, and it's intentional). That said, in practice, just leaving enough time between marking the node for drain (read-only) and disabling covers all practical cases. Avg time for nodes to sync nodes config is <3s, and CC should use not retry (insist) on applying the new nodeset before re-evaluating (that's why retry loops on writes are not great). If we allow 15-30 seconds between drain and the attempt to disable nodes, it should be practically sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants