Skip to content

SingleSp::set_power_state should probably be a bit more idempotent #270

@hawkw

Description

@hawkw

At present, if one attempts to set a SP to a power state that it's already in, one receives a 503 Service Unavailable. For instance, attempting to update Sled 14 on madrid failed when wicketd attempted to put it in A2, because the system was already in A2:

[         mupdate] [sled 14 00:00:00]   Running  7/19) Setting host power state to A2
[         mupdate] [sled 14 00:00:00]    Failed  7/19) Setting host power state to A2: after 2.38ms: updating power state failed
[         mupdate] [sled 14 00:00:00]             Caused by:
[         mupdate] [sled 14 00:00:00]             - Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "f64e6790-ccfa-43e0-a629-f1ee71d32c5b", "content-length": "233", "date": "Sun, 28 Dec 1986 00:05:40 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 14 }: Error response from SP: power state error (code 1))", request_id: "f64e6790-ccfa-43e0-a629-f1ee71d32c5b" }
[         mupdate] [sled 14 00:00:00]  Terminal Execution failed

This appears to be because the gimlet-seq task in Hubris returns an error for A2 -> A2 transitions. See this match:
https://github.com/oxidecomputer/hubris/blob/2f24dd4419634ca95742de901d18e52ff14b912b/drv/gimlet-seq-server/src/main.rs#L696-L918

gateway-sp-comms' SingleSp::set_power_state method just tries to perform the RPC, bubbling up the error from the SP without checking whether it's already in the desired state:

/// Set the current power state.
pub async fn set_power_state(&self, power_state: PowerState) -> Result<()> {
self.rpc(MgsRequest::SetPowerState(power_state))
.await
.and_then(expect_set_power_state_ack)
}

aaaaaaand finally, over in Omicron, the MGS HTTP API just bubbles up whatever error SingleSp::set_power_state returns:
https://github.com/oxidecomputer/omicron/blob/d96ea7ca8945b8ad78a53fd083850ea39789e5f0/gateway/src/http_entrypoints.rs#L623-L637

We should probably make this method succeed when the host is already in the desired state, instead. Although we could make the sequencer task in Hubris responsible for this, it seems to me that it might make more sense to do it here in MGS?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions