-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
any form of rolling upgrade is impossible with Flux #6609
Comments
It's gone stale, and hasn't run for a while, which is bad on me, but perhaps we could leverage flux-test-collective to do this. In the past, binaries from previous days runs have been saved in a workspace folder, so we could develop a script that would boot yesterday's version of flux with today's brokers? We could also check if the brokers can wire up under whatever version is in |
It's not quite on target IMHO to tests "today" vs "yesterday" or "today" vs "host install" if we need to guarantee consecutive releases interoperate. We might have to do something like side install the last tagged version and use that in combination with the version built in the development tree. It hurts my brain though, trying to think how to test that. Maybe |
We could reduce the test surface by only allowing non rank 0 nodes to have a version greater than rank 0 and not the other way around. Then, we'd get a lot of mileage out of being able to use the system broker on rank 0 and the built flux everywhere else. We'd want to ensure the test commands were using the built flux, perhaps even run them on a non rank 0 broker (simulating the typical use of a login node, though I'm not sure login nodes are part of the rolling upgrade plan) |
One wrinkle with that - if you're running a batch/alloc job in an instance that has some upgraded compute nodes, the job might end up with the newer release on its rank 0 and older on other ranks. Also if the TBON isn't flat, we can still end up with a wire-up like old - new - old - new. |
Hm, good point. It would be best if a rolling upgrade could be coordinated so that nodes are upgraded as they become free, and are not available for scheduling until the update is complete. This would ensure new jobs after the rolling updates begin always use the newest version. This might be difficult to enforce in practice though |
Or, if we can get dynamic property management working, perhaps the broker version could be a property assigned to a node, and on systems where rolling upgrades are enabled, the scheduler could (somehow) ensure jobs are only assigned a matching version set of brokers. (Note: the current property constraint matching can't accomplish this unfortunately.) |
Problem: it's not possible to have compute nodes reboot between jobs and update to a new Flux version because flux broker versions must match exactly in a given Flux instance.
On a machine the size of El Capitan, rebooting everything at once stresses other parts of the system, so could potentially reduce the length of system downtime if Flux allowed at least consecutive releases to interoperate, e.g. for now, MAJOR.MINOR.PATCH and MAJOR.MINOR.(PATCH+1)
It's easy enough to relax the checks that occur during broker wireup. It will be a little more challenging to figure out how to check this in CI.
Related:
The text was updated successfully, but these errors were encountered: