Network changes during an update to R18 could replicate unreadable bootstore contents

(This came up from asking Claude to review a branch I have working toward #9801, and it found this problem with my branch, but we also shipped it in R18.)

R18 revved the sled-agent `write_network_bootstore_config()` API. Following what we usually do for API changes, the "old" handler converts the incoming request into the latest version and delegates to the new handler: https://github.com/oxidecomputer/omicron/blob/7c56d21f5fb8127a619cb8557cbcc286e8ae0af7/sled-agent/api/src/lib.rs#L823-L829. However, in the case of bootstore, this is fundamentally incorrect, because of this possible sequence:

```
    // 1. One scrimlet is updated; its sled agent is now running the new
    //    version.
    // 2. Nexus (still running the old version) sends a
    //    `write_network_bootstore_config_vN()` request to the updated scrimlet.
    // 3. The scrimlet converts the from-old-Nexus `vN` request to the latest
    //    bootstore format and tells the bootstore to replicate it.
    // 4. Other sleds, which have NOT YET been updated, will now see the new
    //    version and be unable to deserialize it.
```

We would need a combination of things to happen for this to be a problem:

1. Nexus would need to tell sled-agent to write new bootstore contents during an update, after one or both scrimlets have had host OS updates but before all other sleds have completed their host OS updates.
2. After the scrimlet(s) propagate the bootstore changes to all the other sleds, one or more of the sleds still on the old OS would need to reboot. (This can happen routinely during an update; e.g., to update a sled's SP prior to updating its OS.)
3. sled-agent would be unable to read the new bootstore version.

I think if we hit this, we'd see sled-agent spin while trying to start up, continuously logging that it was unable to get the network config from the bootstore: https://github.com/oxidecomputer/omicron/blob/ccede3cb5287e447b0f30916000c7943b1596689/sled-agent/src/sled_agent.rs#L620-L633

We should advise customers not to change upstream network configuration during updates to R18. I've got a branch in progress that will fix this for R19 (and will fix it for updating _to_ R19 as well, since the root cause here is the _newer_ software pushing out information that isn't readable by not-yet-updated prior versions).

	async fn write_network_bootstore_config_v1(
	rqctx: RequestContext<Self::Context>,
	body: TypedBody<v1::early_networking::EarlyNetworkConfig>,
	) -> Result<HttpResponseUpdatedNoContent, HttpError> {
	Self::write_network_bootstore_config(rqctx, body.map(\|x\| x.into()))
	.await
	}

	let rack_network_config: Option<RackNetworkConfig> =
	retry_notify::<_, String, _, _, _, _>(
	retry_policy_internal_service_aggressive(),
	get_network_config,
	\|error, delay\| {
	warn!(
	log,
	"failed to get network config from bootstore";
	"error" => ?error,
	"retry_after" => ?delay,
	);
	},
	)
	.await

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network changes during an update to R18 could replicate unreadable bootstore contents #9943

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Network changes during an update to R18 could replicate unreadable bootstore contents #9943

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions