Description
When a ControlConnection reconnects to a new node during or shortly after a node decommission, there is a race condition that can cause the decommissioned node to permanently remain in the Metadata.Hosts collection.
Root cause
During reconnection the ControlConnection:
- Opens a connection to a new node
- Queries
system.peers to build the node list
- Registers for server events (
TOPOLOGY_CHANGE, STATUS_CHANGE, SCHEMA_CHANGE)
If a node is being decommissioned concurrently:
- Step 2 may return stale data (the decommissioned node is still in
system.peers)
- The
TOPOLOGY_CHANGE REMOVED_NODE event may have already been broadcast by the server before step 3 completes
Since there is no periodic node list refresh, the driver has no further trigger to re-query system.peers, and the decommissioned node stays in the Hosts collection indefinitely. The HostConnectionPool keeps attempting to reconnect to the dead node every 5 seconds forever.
Impact
Metadata.Hosts.Count reports incorrect host count after decommission
GetReplicas() may return stale replica sets
- Token map is not rebuilt to reflect the reduced cluster
- Unnecessary reconnection attempts to the decommissioned node
Reproduction
The issue is observed in TokenMap_Should_RebuildTokenMap_When_NodeIsDecommissioned when two independent Cluster objects are connected to the same CCM cluster and one node is decommissioned. One cluster detects the decommission (it received the event), the other does not.
Proposed fix
Schedule a delayed node list refresh via the event debouncer after every successful ControlConnection reconnection. This re-queries system.peers ~1 second later, catching any topology changes that were missed during the reconnection window.
Description
When a
ControlConnectionreconnects to a new node during or shortly after a node decommission, there is a race condition that can cause the decommissioned node to permanently remain in theMetadata.Hostscollection.Root cause
During reconnection the
ControlConnection:system.peersto build the node listTOPOLOGY_CHANGE,STATUS_CHANGE,SCHEMA_CHANGE)If a node is being decommissioned concurrently:
system.peers)TOPOLOGY_CHANGE REMOVED_NODEevent may have already been broadcast by the server before step 3 completesSince there is no periodic node list refresh, the driver has no further trigger to re-query
system.peers, and the decommissioned node stays in theHostscollection indefinitely. TheHostConnectionPoolkeeps attempting to reconnect to the dead node every 5 seconds forever.Impact
Metadata.Hosts.Countreports incorrect host count after decommissionGetReplicas()may return stale replica setsReproduction
The issue is observed in
TokenMap_Should_RebuildTokenMap_When_NodeIsDecommissionedwhen two independentClusterobjects are connected to the same CCM cluster and one node is decommissioned. One cluster detects the decommission (it received the event), the other does not.Proposed fix
Schedule a delayed node list refresh via the event debouncer after every successful
ControlConnectionreconnection. This re-queriessystem.peers~1 second later, catching any topology changes that were missed during the reconnection window.