Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 28: add shrink key to resource acquisition response #447

Merged
merged 2 commits into from
Feb 18, 2025

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Feb 18, 2025

This PR adds the shrink key to RFC 28 as suggested in flux-framework/flux-core#6641.
This key contains an idset of execution targets which have been removed from the instance so should no longer be considered for scheduling.

Problem: An example in RFC 28 contains invalid JSON.

Fix the invalid JSON.
@grondo
Copy link
Contributor Author

grondo commented Feb 18, 2025

One thought here: I wonder if, for better backwards compatibility, it should be ok for an execution target to be in both the down and shrink idsets, and if shrink is supported, that should take precedence? This would allow a scheduler that doesn't yet support shrink to continue to work with the updated protocol (since in this PR at least, the protocol version hasn't been updated)

@garlick
Copy link
Member

garlick commented Feb 18, 2025

This would allow a scheduler that doesn't yet support shrink to continue to work with the updated protocol

Great idea!

@grondo
Copy link
Contributor Author

grondo commented Feb 18, 2025

Done. I've added:

For backwards compatibility, targets in the shrink
key SHALL also appear in the down key of the same response. If a
scheduler supports shrink then the shrink key SHALL take precedence.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! One comment below - feel free to ignore.

spec_28.rst Outdated
Comment on lines 180 to 183
If removed (``shrink``) or down resources are assigned to a job, the
scheduler SHALL NOT raise an exception on the job. The execution system
takes the active role in handling failures in this case. Eventually the
scheduler will receive a ``sched.free`` request for the offline resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should clarify that we'll send a sched.free for resources even when they have been removed? This reads a bit ambiguous to me in that regard, but it should be required for backwards compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Would this be sufficient:

Suggested change
If removed (``shrink``) or down resources are assigned to a job, the
scheduler SHALL NOT raise an exception on the job. The execution system
takes the active role in handling failures in this case. Eventually the
scheduler will receive a ``sched.free`` request for the offline resources.
If removed (``shrink``) or down resources are assigned to a job, the
scheduler SHALL NOT raise an exception on the job. The execution system
takes the active role in handling failures in this case. Eventually the
scheduler will receive a ``sched.free`` request for the offline or removed resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change and will set MWP. Thanks!

Problem: There is no way for Flux to notify a scheduler that resources
that have gone down are not coming back.

Add a `shrink` key to the RFC 28 resource acquisition response which
tells the scheduler to remove a set of resources based on execution
target from consideration for scheduling.
@mergify mergify bot merged commit 7ef4e74 into flux-framework:master Feb 18, 2025
7 checks passed
@grondo grondo deleted the rfc28-shrink branch February 21, 2025 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants