feat(orchestrator): do not ignore subnet assignment on upgrade loop errors #7868
+244
−149
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The upgrade loop in the orchestrator is responsible both for executing upgrades and determining the subnet ID of the node, used to provision SSH keys and rotate IDKG keys. Though there are multiple code flows where the orchestrator determines the subnet ID but there is an error later in the loop, which makes the function return an error and the caller not apply the subnet ID. This prevents SSH keys from being provisioned even though the subnet ID had correctly been identified.
An example of such a code flow is if the local CUP is not deserializable but the NiDkgId is, which allows the subnet ID to be correctly determined (i.e. we hit here). But since the CUP is not deserializable and currently has the highest height compared to a recovery or peers CUP (we imagine it's at the very start of a recovery, before applying SSH keys -> there is no recovery CUP yet), we return an error here and the subnet ID is not updated, and SSH keys are not provisioned. If it does not have the highest height (i.e. there is a recovery CUP), then we can use the latter, which explains why we can still recover.
Note: the existing system test
sr_app_no_upgrade_with_chain_keys_testis testing that we can recover a subnet exactly in that case (if the CUP is not deserializable but the NiDkgId is). As explained, nodes can see the Recovery CUP, but we do not apply readonly keys even though we could. In a parallel PR, I distinguished cases where the NiDkgId was corrupted or not. If yes, then there's indeed no way of provisioning SSH keys, but there's also no way of seeing the Recovery CUP -> thus use failover nodes. If not, then we should be able to provision SSH keys. When the second case runs on the current implementation, it fails because we cannot provision SSH keys. When merging this branch to it, the test succeeds, which is a positive sign towards the added value of this change.Another example is if we detected we need to leave the subnet but removing the state failed (i.e. hit here). Then, we'd return an error again and fail to remove SSH keys of the subnet.
This PR is not supposed to bring any functional change to the upgrade logic but instead modifies the return type of the loop to return the subnet assignment also on errors, if able to determine it.
PS: The PR also uses the same registry version for the entire loop, instead of determining multiple times the latest registry version (in functions
prepare_upgrade_if_scheduled,check_for_upgrade_as_unassigned,should_node_become_unassigned), in order to have a more consistent and predictable behaviour.