-
Notifications
You must be signed in to change notification settings - Fork 69
Description
(This came up from asking Claude to review a branch I have working toward #9801, and it found this problem with my branch, but we also shipped it in R18.)
R18 revved the sled-agent write_network_bootstore_config() API. Following what we usually do for API changes, the "old" handler converts the incoming request into the latest version and delegates to the new handler:
omicron/sled-agent/api/src/lib.rs
Lines 823 to 829 in 7c56d21
| async fn write_network_bootstore_config_v1( | |
| rqctx: RequestContext<Self::Context>, | |
| body: TypedBody<v1::early_networking::EarlyNetworkConfig>, | |
| ) -> Result<HttpResponseUpdatedNoContent, HttpError> { | |
| Self::write_network_bootstore_config(rqctx, body.map(|x| x.into())) | |
| .await | |
| } |
// 1. One scrimlet is updated; its sled agent is now running the new
// version.
// 2. Nexus (still running the old version) sends a
// `write_network_bootstore_config_vN()` request to the updated scrimlet.
// 3. The scrimlet converts the from-old-Nexus `vN` request to the latest
// bootstore format and tells the bootstore to replicate it.
// 4. Other sleds, which have NOT YET been updated, will now see the new
// version and be unable to deserialize it.
We would need a combination of things to happen for this to be a problem:
- Nexus would need to tell sled-agent to write new bootstore contents during an update, after one or both scrimlets have had host OS updates but before all other sleds have completed their host OS updates.
- After the scrimlet(s) propagate the bootstore changes to all the other sleds, one or more of the sleds still on the old OS would need to reboot. (This can happen routinely during an update; e.g., to update a sled's SP prior to updating its OS.)
- sled-agent would be unable to read the new bootstore version.
I think if we hit this, we'd see sled-agent spin while trying to start up, continuously logging that it was unable to get the network config from the bootstore:
omicron/sled-agent/src/sled_agent.rs
Lines 620 to 633 in ccede3c
| let rack_network_config: Option<RackNetworkConfig> = | |
| retry_notify::<_, String, _, _, _, _>( | |
| retry_policy_internal_service_aggressive(), | |
| get_network_config, | |
| |error, delay| { | |
| warn!( | |
| log, | |
| "failed to get network config from bootstore"; | |
| "error" => ?error, | |
| "retry_after" => ?delay, | |
| ); | |
| }, | |
| ) | |
| .await |
We should advise customers not to change upstream network configuration during updates to R18. I've got a branch in progress that will fix this for R19 (and will fix it for updating to R19 as well, since the root cause here is the newer software pushing out information that isn't readable by not-yet-updated prior versions).