Add remove node command and DisableNodeChecker #2806

tillrohrmann · 2025-02-27T22:52:28Z

This commit introduces the remove node command which uses the DisableNodeChecker to verify that it is safe to disable/remove this node from the cluster.

tillrohrmann · 2025-02-28T08:18:56Z

crates/node/src/network_server/disable_node_checker.rs

+        // It's important that we never operate on an outdated Logs wrt the used NodesConfiguration.
+        // Otherwise, we risk that we don't see all segments this node is part of.
+        self.metadata
+            .sync(MetadataKind::Logs, TargetVersion::Latest)
+            .await?;
+        self.logs = self.metadata.logs_snapshot();


I think this is not sufficient for ensuring that a node that the checker classifies as safe to disable won't be used in any future Logs. There are two problems:

(a) An in-flight Logs update that was made before a given node was marked as read-only
(b) A cluster controller operating on an outdated NodesConfiguration where a read-only node is not yet marked as read-only

Both cases can lead to the inclusion of a node into new segments after it has been marked as read-only or after disabling/removing it :-(

To solve (a), we would probably need to also update Logs when updating NodesConfiguration to mark the node as read-only or at the latest when removing it. That way we would fence off other in-flight Logs updates. To solve (b) we probably need to make sure that after a failed Logs update because of a failed precondition, the updating node syncs the latest NodesConfiguration before continuing.

For (a) it might be good enough to bump the version of Logs before writing the updated NodesConfiguration with the removed node. Doing the same for transitioning a node to read-only won't be sufficient though. There we would need some kind of transactional support by the metadata store to update two keys atomically.

The underlying problem seems to be that we impose a certain relationship between NodeConfiguration and Logs after we see nodes in a specific state, e.g. when the node is in read-only mode, then it should no longer be included in newer node sets. By versioning NodesConfiguration and Logs separately, we can't enforce linearizable operations across both types.

(a) is true, this will always be the case (it's a side-effect of decoupling node membership from log configuration, and it's intentional). That said, in practice, just leaving enough time between marking the node for drain (read-only) and disabling covers all practical cases. Avg time for nodes to sync nodes config is <3s, and CC should use not retry (insist) on applying the new nodeset before re-evaluating (that's why retry loops on writes are not great). If we allow 15-30 seconds between drain and the attempt to disable nodes, it should be practically sufficient.

AhmedSoliman · 2025-02-28T09:10:18Z

crates/node/src/network_server/disable_node_checker.rs

+        if node_config.roles.contains(Role::LogServer) {
+            self.safe_to_disable_log_server(node_id, node_config.log_server_config.storage_state)
+                .await?
+        }


I think we should check regardless of the current role, The storage-state is the real indicator of membership status.

AhmedSoliman · 2025-02-28T09:10:20Z

crates/node/src/network_server/disable_node_checker.rs

+                .await?
+        }
+
+        if node_config.roles.contains(Role::MetadataServer) {


AhmedSoliman · 2025-02-28T09:12:52Z

crates/node/src/network_server/disable_node_checker.rs

+        Ok(())
+    }
+
+    /// Checks whether it is safe to disable the given log server identified by the node_id. It is


Since we have chain trimming, we can make this more strict by requiring that the node doesn't appear in any historical nodeset.

AhmedSoliman · 2025-02-28T09:20:15Z

crates/node/src/network_server/disable_node_checker.rs

+            | StorageState::DataLoss) => {
+                return Err(DisableNodeError::NotReadOnly(storage_state));
+            }


I don't believe we should block the transition from DataLoss or Provisioning. Nodes in DataLoss cannot transition into read-only, but they can be passively drained if it's not a member of any nodeset (same for Provisioning)

AhmedSoliman · 2025-02-28T09:22:31Z

crates/node/src/network_server/disable_node_checker.rs

+
+        let mut required_trim_points: HashMap<_, _> = HashMap::default();
+
+        // find for every log the latest segment in which the node is part of the nodeset


I believe this should be for (all segments)

AhmedSoliman · 2025-02-28T09:30:14Z

crates/node/src/network_server/disable_node_checker.rs

+        future::try_join_all(required_trim_points.into_iter().map(
+            |(log_id, required_trim_point)| async move {
+                let actual_trim_point = self
+                    .bifrost
+                    .clone()
+                    .get_trim_point(log_id)
+                    .await
+                    .map_err(|err| DisableNodeError::FindTrimPoint { log_id, err })?;


I think we can remove the check for trim point altogether.

AhmedSoliman · 2025-02-28T09:35:13Z

crates/node/src/network_server/grpc_svc_handler.rs

+    async fn remove_node(
+        &self,
+        request: Request<RemoveNodeRequest>,
+    ) -> Result<Response<()>, Status> {


Just to be sure, we are assuming that if the segments are trimmed, then partitions must have snapshotted up to a point that excludes the trimmed segments, and that's the basis of not checking partition's last applied. Is that accurate?

AhmedSoliman · 2025-02-28T09:44:02Z

crates/node/src/network_server/disable_node_checker.rs

+        // It's important that we never operate on an outdated Logs wrt the used NodesConfiguration.
+        // Otherwise, we risk that we don't see all segments this node is part of.
+        self.metadata
+            .sync(MetadataKind::Logs, TargetVersion::Latest)
+            .await?;
+        self.logs = self.metadata.logs_snapshot();


(a) is true, this will always be the case (it's a side-effect of decoupling node membership from log configuration, and it's intentional). That said, in practice, just leaving enough time between marking the node for drain (read-only) and disabling covers all practical cases. Avg time for nodes to sync nodes config is <3s, and CC should use not retry (insist) on applying the new nodeset before re-evaluating (that's why retry loops on writes are not great). If we allow 15-30 seconds between drain and the attempt to disable nodes, it should be practically sufficient.

tillrohrmann requested review from AhmedSoliman and jackkleeman February 27, 2025 22:52

Add remove node command and DisableNodeChecker

069d10a

This commit introduces the remove node command which uses the DisableNodeChecker to verify that it is safe to disable/remove this node from the cluster.

tillrohrmann force-pushed the disable-log-servers branch from c68a4d5 to 069d10a Compare February 28, 2025 08:07

tillrohrmann commented Feb 28, 2025

View reviewed changes

AhmedSoliman reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add remove node command and DisableNodeChecker #2806

Add remove node command and DisableNodeChecker #2806

tillrohrmann commented Feb 27, 2025

tillrohrmann Feb 28, 2025 •

edited

Loading

AhmedSoliman Feb 28, 2025

AhmedSoliman Feb 28, 2025

AhmedSoliman Feb 28, 2025

AhmedSoliman Feb 28, 2025

AhmedSoliman Feb 28, 2025

AhmedSoliman Feb 28, 2025

AhmedSoliman Feb 28, 2025

AhmedSoliman Feb 28, 2025

AhmedSoliman Feb 28, 2025


		let mut required_trim_points: HashMap<_, _> = HashMap::default();

		// find for every log the latest segment in which the node is part of the nodeset

Add remove node command and DisableNodeChecker #2806

Are you sure you want to change the base?

Add remove node command and DisableNodeChecker #2806

Conversation

tillrohrmann commented Feb 27, 2025

tillrohrmann Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Feb 28, 2025 •

edited

Loading