any form of rolling upgrade is impossible with Flux #6609

garlick · 2025-02-05T23:45:07Z

Problem: it's not possible to have compute nodes reboot between jobs and update to a new Flux version because flux broker versions must match exactly in a given Flux instance.

On a machine the size of El Capitan, rebooting everything at once stresses other parts of the system, so could potentially reduce the length of system downtime if Flux allowed at least consecutive releases to interoperate, e.g. for now, MAJOR.MINOR.PATCH and MAJOR.MINOR.(PATCH+1)

It's easy enough to relax the checks that occur during broker wireup. It will be a little more challenging to figure out how to check this in CI.

connector: add version negotiation for tools connecting to broker #3157

wihobbs · 2025-02-06T18:29:59Z

It's easy enough to relax the checks that occur during broker wireup. It will be a little more challenging to figure out how to check this in CI.

It's gone stale, and hasn't run for a while, which is bad on me, but perhaps we could leverage flux-test-collective to do this. In the past, binaries from previous days runs have been saved in a workspace folder, so we could develop a script that would boot yesterday's version of flux with today's brokers? We could also check if the brokers can wire up under whatever version is in /usr/bin on the system in question. Just a thought.

garlick · 2025-02-06T22:48:59Z

It's not quite on target IMHO to tests "today" vs "yesterday" or "today" vs "host install" if we need to guarantee consecutive releases interoperate. We might have to do something like side install the last tagged version and use that in combination with the version built in the development tree.

It hurts my brain though, trying to think how to test that. Maybe flux start could grow some options for starting different brokers on different ranks and then we could change test_under_flux to conditionally use that and then run the whole test suite through with different configurations. Somehow.

grondo · 2025-02-06T23:18:05Z

We could reduce the test surface by only allowing non rank 0 nodes to have a version greater than rank 0 and not the other way around.

Then, we'd get a lot of mileage out of being able to use the system broker on rank 0 and the built flux everywhere else. We'd want to ensure the test commands were using the built flux, perhaps even run them on a non rank 0 broker (simulating the typical use of a login node, though I'm not sure login nodes are part of the rolling upgrade plan)

garlick · 2025-02-07T19:53:31Z

We could reduce the test surface by only allowing non rank 0 nodes to have a version greater than rank 0 and not the other way around.

One wrinkle with that - if you're running a batch/alloc job in an instance that has some upgraded compute nodes, the job might end up with the newer release on its rank 0 and older on other ranks. Also if the TBON isn't flat, we can still end up with a wire-up like old - new - old - new.

grondo · 2025-02-07T20:12:38Z

Hm, good point. It would be best if a rolling upgrade could be coordinated so that nodes are upgraded as they become free, and are not available for scheduling until the update is complete. This would ensure new jobs after the rolling updates begin always use the newest version. This might be difficult to enforce in practice though

grondo · 2025-02-07T20:18:55Z

Or, if we can get dynamic property management working, perhaps the broker version could be a property assigned to a node, and on systems where rolling upgrades are enabled, the scheduler could (somehow) ensure jobs are only assigned a matching version set of brokers. (Note: the current property constraint matching can't accomplish this unfortunately.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

any form of rolling upgrade is impossible with Flux #6609

any form of rolling upgrade is impossible with Flux #6609

garlick commented Feb 5, 2025 •

edited

Loading

wihobbs commented Feb 6, 2025

garlick commented Feb 6, 2025

grondo commented Feb 6, 2025

garlick commented Feb 7, 2025

grondo commented Feb 7, 2025

grondo commented Feb 7, 2025

any form of rolling upgrade is impossible with Flux #6609

any form of rolling upgrade is impossible with Flux #6609

Comments

garlick commented Feb 5, 2025 • edited Loading

wihobbs commented Feb 6, 2025

garlick commented Feb 6, 2025

grondo commented Feb 6, 2025

garlick commented Feb 7, 2025

grondo commented Feb 7, 2025

grondo commented Feb 7, 2025

garlick commented Feb 5, 2025 •

edited

Loading