-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid scheduling jobs on compute nodes that are not cleaned up #6616
base: master
Are you sure you want to change the base?
Conversation
Problem: a comment has an extra "to" that makes the sentence incorrect. Drop the extra word.
Problem: the timer used by sdbus_connect() is hard to modify because of the embedded error handling. Extract a function for building the user bus path for the error log. Now the timer is a bit simpler.
Problem: the sdbus module is hardwired to connect to a systemd user instance, but Flux now has "work" running in the systemd system instance as well (prolog, epilog, housekeeping). Add a "system" module option which directs sdbus to connect to the systemd system instance instead. Future commits will allow a second instance of the sdbus module to be loaded with this option so access to both systemd instances can be handled concurrently.
Problem: the sdbus system option has no coverage. Amend the 2407-sdbus.t test with a simple test of "system mode".
Problem: the sdbus module can only be loaded once because it uses an explicit service name. Drop the outdated MOD_NAME() symbol declaration. Register methods in a way that lets the default service name change. Update the self-contacting "subscribe" composite RPC to determine the topic string to contact programmatically. Now the module can be loaded as many times as we like using e.g. flux module load --name NEWNAME sdbus
Problem: there are no tests for loading sdbus under a different name Modify the system test to load sdbus under the name "sdbus-sys" in system mode instead of reloading the module. Show that it works for listing units in the system instance.
Problem: when the system is configured to use systemd, sdbus is only loaded for the systemd user instance. Load sdbus-sys as well.
Problem: some libsdexec RPCs can now be directed to different services to reach the systemd system or user instance. Add a service parameter to the following functions: sdexec_list_units() sdexec_property_get() sdexec_property_get_all() sdexec_property_changed() Update sdexec. Update tests.
Wow, nice! I don't have any qualms with using an sdmon.idle broker group for the current implementation. It seems like eventually we'd want various subsystems to be able to recapture their state from what sdmon has found. (for example the job execution system could reconnect to running jobs after restart, or terminate jobs that are not supposed to be running, the job manager could do something similar for prolog/epilog and housekeeping. Any thoughts on how that might work? I realize bringing that up is a bit premature, but it could inform the solution here as a stepping stone. (I guess one thought is that as state is able to be recaptured, this would reduce the list of things that prevent a broker from joining Also, since the |
Maybe |
Problem: there is no mechanism to track systemd units across a broker restart. Add a broker module that creates and maintains a list of running flux systemd units. This monitors two instances of systemd: - the user one, running as user flux (where jobs are run) - the system one (where housekeeping, prolog, epilog run) A list of units matching flux unit globs is requested at initialization, and a subscription to property updates on those globs is obtained. After the initial list, monitoring is driven solely by property updates. Join the sdmon.online broker group once the node is demonstrably idle. This lets the resource module on rank 0 notice compute nodes that need cleanup at restart and withhold them from the scheduler. Once the group is joined, sdmon does not explicitly leave it. It implicitly leaves the group if sdmon is unloaded or the node goes offline/lost. If there are running units at startup, log this information at LOG_ERR level, and again when the units are cleaned up, e.g. [email protected] needs cleanup - resources are offline cleanup complete - resources are online In the future, this module's role could be expanded to support tooling for listing running work and obtaining runtime information such as pids and cgroup resource parameters. It could also play a role in informing other flux components about work that should be re-attached after a full or partial restart, when support for that is added.
Problem: the sdmon module is not loaded by default. Load it if systemd.enable = true in the configuration.
Problem: the monitor subsystem of the resource module needs to know whether the "sdmon.online" broker group will be populated. Parse the enable key from [systemd]. Pass the whole resource_config struct to the monitor subsystem instead of just monitor_force_up.
Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system - the broker exits without proper shutdown When systemd is enabled, the new sdmon module joins the 'sdmon.online' broker group on startup. However, if there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.online instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590
Problem: there is no test coverage for the sdmon module. Add a new sharness script.
Renamed the group and fixed a spelling error in a test caught in CI. This still needs a test for the resource module portion of the proposed change so I'll leave it WIP for the moment. |
Problem: there is no test coverage for the resource module's behavior when systemd is configured and sdmon is providing sdmon.online. Add a sharness script for that.
I added the missing test, so I'll drop the WIP. One thing I should do before we merge this though is make sure the systemd shipped with RHEL 8 allows sdbus to authetnicate to it with flux credentials. I'll try to test that on fluke. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6616 +/- ##
==========================================
+ Coverage 79.50% 79.52% +0.02%
==========================================
Files 531 532 +1
Lines 88363 88597 +234
==========================================
+ Hits 70251 70456 +205
- Misses 18112 18141 +29
|
Yep that worked
|
Right, I like that way of thinking about it. Hmm, we should also be trying to capture the state of any units that have completed but weren't reaped, and put that in a lost+found or something. I need to refresh my memory on what happens to that state for the cases we're discussing here (the templated system units and transient user units). That could be a follow-on PR. But anyway, yeah, if a running unit can be reclaimed, we could let the node join |
Problem: after a broker restart, "stuck" housekeeping, epilog, prolog, or job shell tasks might still be running, but flux is unaware and new work may be scheduled there even though there might be a problem, or those tasks might be holding on to resources.
When those things are run under systemd, we have the machinery for finding them and tracking them readily at hand.
This PR does the following
sdbus
so it can be loaded twice, once for user and once for system systemd instancessdmon
monitoring module that tracks running flux units on both bussessdmon
join asdmon.idle
broker group at startup, after it has verified that no units are runningsdmon.idle
instead ofbroker.online
when configured to use systemdThe net effect is that nodes that require cleanup remain offline until the node is cleaned up.
sdmon
also logs the systemd units it has found. Here's an example where I kill -9 a broker while housekeeping is running, then start it back up againBefore cleanup is complete,
flux resource status
reportsSeems like this does the bare minimum to resolve #6590
This does seem a bit thin in the tooling department. The possibilities are pretty broad so for now, I wanted to get this posted and get feedback on the way the resource module is tied into
sdmon
using broker groups.