Skip to content

Commit

Permalink
fix bug due to idempotency
Browse files Browse the repository at this point in the history
The sled-agent instance-ensure-unregistered API is idempotent, which is
right and proper...but it introduces an issue in the case of a
forgotten VMM after a sled-agent restart. The
instance-ensure-unregistered call will return `None` because sled-agent
doesn't know about that instance, but if this is the first time we have
discovered that sled-agent doesn't know about the instance, we will need
to move it to `Failed`. This commit fixes that.
  • Loading branch information
hawkw committed Oct 7, 2024
1 parent 360c9cd commit 34c3058
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 13 deletions.
37 changes: 31 additions & 6 deletions nexus/src/app/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -827,7 +827,21 @@ impl super::Nexus {
self.notify_vmm_updated(opctx, propolis_id, &state).await?;
}
// If the returned state from sled-agent is `None`, then the instance
// was already unregistered --- so this should succeed.
// was already unregistered. This may have been from a prior
// instance-ensure-unregistered call, but, since we observed an
// active VMM above, the current observed VMM generation doesn't
// know that the VMM is gone, so it is possible that the sled-agent
// has misplaced this instance. Therefore, we will attempt to mark
// the VMM as `Failed` at the generation after which we observed the
// VMM. This is safe to do here, because if the instance has been
// unregistered due to a race with another
// instance-ensure-unregistered request (rather than a sled-agent
// failure), that other call will have advanced the state
// generation, and our attempt to write the failed state will not
// succeed, which is fine.
//
// Either way, the caller should not observe a returned instance
// state that believes itself to be running.
Ok(None) => {
info!(
opctx.log,
Expand All @@ -837,6 +851,14 @@ impl super::Nexus {
"vmm_id" => %propolis_id,
"sled_id" => %sled_id,
);
let _ = self
.mark_vmm_failed(
&opctx,
authz_instance.clone(),
&vmm,
&"instance already unregistered",
)
.await;
}
// If the error indicates that the VMM wasn't there to terminate,
// mark it as Failed instead.
Expand Down Expand Up @@ -1487,19 +1509,22 @@ impl super::Nexus {
/// execute, as this may just mean that another saga is already updating the
/// instance. The update will be performed eventually even if this method
/// could not update the instance.
pub(crate) async fn mark_vmm_failed(
pub(crate) async fn mark_vmm_failed<R>(
&self,
opctx: &OpContext,
authz_instance: authz::Instance,
vmm: &db::model::Vmm,
reason: &SledAgentInstanceError,
) -> Result<(), Error> {
reason: &R,
) -> Result<(), Error>
where
R: std::fmt::Display,
{
let instance_id = InstanceUuid::from_untyped_uuid(authz_instance.id());
let vmm_id = PropolisUuid::from_untyped_uuid(vmm.id);
error!(self.log, "marking VMM failed due to sled agent API error";
"instance_id" => %instance_id,
"vmm_id" => %vmm_id,
"error" => ?reason);
"error" => %reason);

let new_runtime = VmmRuntimeState {
state: db::model::VmmState::Failed,
Expand All @@ -1513,7 +1538,7 @@ impl super::Nexus {
info!(self.log, "marked VMM as Failed, preparing update saga";
"instance_id" => %instance_id,
"vmm_id" => %vmm_id,
"reason" => ?reason,
"reason" => %reason,
);
let saga = instance_update::SagaInstanceUpdate::prepare(
&instance_update::Params {
Expand Down
8 changes: 1 addition & 7 deletions nexus/tests/integration_tests/instances.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6069,15 +6069,9 @@ async fn test_instance_force_terminate(cptestctx: &ControlPlaneTestContext) {
instance_post(&client, &already_gone_name, InstanceOp::ForceTerminate)
.await
);
assert_eq!(instance.runtime.run_state, InstanceState::Stopping);
// This time, the instance will go to `Failed` rather than `Stopped` since
// sled-agent is no longer aware of it.
instance_wait_for_state(
client,
InstanceUuid::from_untyped_uuid(instance.identity.id),
InstanceState::Failed,
)
.await;
assert_eq!(instance.runtime.run_state, InstanceState::Failed);
}

async fn instance_get(
Expand Down

0 comments on commit 34c3058

Please sign in to comment.