Skip to content

Network changes during an update to R18 could replicate unreadable bootstore contents #9943

@jgallagher

Description

@jgallagher

(This came up from asking Claude to review a branch I have working toward #9801, and it found this problem with my branch, but we also shipped it in R18.)

R18 revved the sled-agent write_network_bootstore_config() API. Following what we usually do for API changes, the "old" handler converts the incoming request into the latest version and delegates to the new handler:

async fn write_network_bootstore_config_v1(
rqctx: RequestContext<Self::Context>,
body: TypedBody<v1::early_networking::EarlyNetworkConfig>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
Self::write_network_bootstore_config(rqctx, body.map(|x| x.into()))
.await
}
. However, in the case of bootstore, this is fundamentally incorrect, because of this possible sequence:

    // 1. One scrimlet is updated; its sled agent is now running the new
    //    version.
    // 2. Nexus (still running the old version) sends a
    //    `write_network_bootstore_config_vN()` request to the updated scrimlet.
    // 3. The scrimlet converts the from-old-Nexus `vN` request to the latest
    //    bootstore format and tells the bootstore to replicate it.
    // 4. Other sleds, which have NOT YET been updated, will now see the new
    //    version and be unable to deserialize it.

We would need a combination of things to happen for this to be a problem:

  1. Nexus would need to tell sled-agent to write new bootstore contents during an update, after one or both scrimlets have had host OS updates but before all other sleds have completed their host OS updates.
  2. After the scrimlet(s) propagate the bootstore changes to all the other sleds, one or more of the sleds still on the old OS would need to reboot. (This can happen routinely during an update; e.g., to update a sled's SP prior to updating its OS.)
  3. sled-agent would be unable to read the new bootstore version.

I think if we hit this, we'd see sled-agent spin while trying to start up, continuously logging that it was unable to get the network config from the bootstore:

let rack_network_config: Option<RackNetworkConfig> =
retry_notify::<_, String, _, _, _, _>(
retry_policy_internal_service_aggressive(),
get_network_config,
|error, delay| {
warn!(
log,
"failed to get network config from bootstore";
"error" => ?error,
"retry_after" => ?delay,
);
},
)
.await

We should advise customers not to change upstream network configuration during updates to R18. I've got a branch in progress that will fix this for R19 (and will fix it for updating to R19 as well, since the root cause here is the newer software pushing out information that isn't readable by not-yet-updated prior versions).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions