-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support shrink
key added to the RFC 28 scheduler acquisition protocol
#1344
Comments
A WIP PR is now posted to flux-core (flux-framework/flux-core#6652) which adds support to the resource acquisition protocol for the Before moving on to testing and merging the PR before the next release, I'd like to know if it is possible to get similar functionality in Fluxion quickly. I'm willing to poke at it if someone could give me a pointer where to start looking. The functionality we'll need is to remove or otherwise mark "permanently unavailable" execution targets (ranks) provided in the |
This is my first stab at support for the diff --git a/resource/modules/resource_match.cpp b/resource/modules/resource_match.cpp
index 9edda2d8..5e607b1e 100644
--- a/resource/modules/resource_match.cpp
+++ b/resource/modules/resource_match.cpp
@@ -1258,23 +1258,54 @@ done:
return rc;
}
+static int shrink_resources (std::shared_ptr<resource_ctx_t> &ctx,
+ const char *ids)
+{
+ int rc = -1;
+ std::set<int64_t> ranks;
+
+ if (!ids) {
+ errno = EINVAL;
+ goto done;
+ }
+ if ((rc = decode_rankset (ctx, ids, ranks)) < 0)
+ goto done;
+ if ((rc = ctx->traverser->shrink (ranks))) {
+ flux_log (ctx->h,
+ LOG_ERR,
+ "shrink %s failed: %s",
+ ids,
+ ctx->traverser->err_message ().c_str ());
+ goto done;
+ }
+ flux_log (ctx->h,
+ LOG_DEBUG,
+ "removed ranks %s from resource set",
+ ids);
+done:
+ return rc;
+}
+
static void update_resource (flux_future_t *f, void *arg)
{
int rc = -1;
const char *up = NULL;
const char *down = NULL;
+ const char *shrink = NULL;
double expiration = -1.;
json_t *resources = NULL;
std::shared_ptr<resource_ctx_t> ctx = getctx ((flux_t *)arg);
if ((rc = flux_rpc_get_unpack (f,
- "{s?:o s?:s s?:s s?:F}",
+ "{s?:o s?:s s?:s s?:s s?:F}",
"resources",
&resources,
"up",
&up,
"down",
&down,
+ "shrink",
+ &shrink,
"expiration",
&expiration))
< 0) {
@@ -1286,6 +1317,8 @@ static void update_resource (flux_future_t *f, void *arg)
flux_log_error (ctx->h, "%s: update_resource_db", __FUNCTION__);
goto done;
}
+ if (shrink && shrink_resources (ctx, shrink) < 0)
+ goto done;
if (expiration >= 0.) {
/* Update graph duration:
*/
diff --git a/resource/traversers/dfu.cpp b/resource/traversers/dfu.cpp
index 9241a3f1..8a60f1f4 100644
--- a/resource/traversers/dfu.cpp
+++ b/resource/traversers/dfu.cpp
@@ -509,6 +509,12 @@ int dfu_traverser_t::mark (std::set<int64_t> &ranks, resource_pool_t::status_t s
return detail::dfu_impl_t::mark (ranks, status);
}
+int dfu_traverser_t::shrink (std::set<int64_t> &ranks)
+{
+ clear_err_message ();
+ return detail::dfu_impl_t::shrink (ranks);
+}
+
/*
* vi:tabstop=4 shiftwidth=4 expandtab
*/
diff --git a/resource/traversers/dfu.hpp b/resource/traversers/dfu.hpp
index 1db95343..4cca16b0 100644
--- a/resource/traversers/dfu.hpp
+++ b/resource/traversers/dfu.hpp
@@ -199,6 +199,9 @@ class dfu_traverser_t : protected detail::dfu_impl_t {
*/
int mark (std::set<int64_t> &ranks, resource_pool_t::status_t status);
+
+ int shrink (std::set<int64_t> &ranks);
+
private:
int is_satisfiable (Jobspec::Jobspec &jobspec,
detail::jobmeta_t &meta,
diff --git a/resource/traversers/dfu_impl.hpp b/resource/traversers/dfu_impl.hpp
index 8369fa09..aff94e8e 100644
--- a/resource/traversers/dfu_impl.hpp
+++ b/resource/traversers/dfu_impl.hpp
@@ -324,6 +324,8 @@ class dfu_impl_t {
*/
int mark (std::set<int64_t> &ranks, resource_pool_t::status_t status);
+ int shrink (std::set<int64_t> &ranks);
+
private:
/************************************************************************
* *
diff --git a/resource/traversers/dfu_impl_update.cpp b/resource/traversers/dfu_impl_update.cpp
index 9cbf9466..5bffb023 100644
--- a/resource/traversers/dfu_impl_update.cpp
+++ b/resource/traversers/dfu_impl_update.cpp
@@ -15,6 +15,7 @@ extern "C" {
}
#include "resource/traversers/dfu_impl.hpp"
+#include <readers/resource_reader_factory.hpp>
using namespace Flux::Jobspec;
using namespace Flux::resource_model;
@@ -908,6 +909,47 @@ int dfu_impl_t::mark (std::set<int64_t> &ranks, resource_pool_t::status_t status
return 0;
}
+int dfu_impl_t::shrink (std::set<int64_t> &ranks)
+{
+ std::shared_ptr<resource_reader_base_t> rd;
+ if ((rd = create_resource_reader ("jgf")) == nullptr)
+ return -1;
+
+ try {
+ std::map<int64_t, std::vector<vtx_t>>::iterator vit;
+ std::string subtree_path = "", tmp_path = "";
+ subsystem_t dom = m_match->dom_subsystem ();
+ vtx_t subtree_root;
+
+ int total = 0;
+ for (auto &rank : ranks) {
+ // Now iterate through subgraphs keyed by rank and
+ // set status appropriately
+ vit = m_graph_db->metadata.by_rank.find (rank);
+ if (vit == m_graph_db->metadata.by_rank.end ())
+ continue;
+
+ subtree_root = vit->second.front ();
+ subtree_path = (*m_graph)[subtree_root].paths.at (dom);
+ for (vtx_t v : vit->second) {
+ // The shortest path string is the subtree root.
+ tmp_path = (*m_graph)[v].paths.at (dom);
+ if (tmp_path.length () < subtree_path.length ()) {
+ subtree_path = tmp_path;
+ subtree_root = v;
+ }
+ }
+ rd->remove_subgraph (*m_graph, m_graph_db->metadata, subtree_path);
+ // TODO reinit traverser?
+ ++total;
+ }
+ } catch (std::out_of_range &) {
+ errno = ENOENT;
+ return -1;
+ }
+ return 0;
+}
+
/*
* vi:tabstop=4 shiftwidth=4 expandtab
*/
|
Discussed in fluxion meeting making a new status for nodes that are never going to become usable. Prune: |
Thanks @trws and @milroy: I added a However, the bit in flux-sched/resource/traversers/dfu.cpp Lines 74 to 75 in da6156a
This seems to be because The test case I'm using is this simple script: #!/bin/sh
flux module remove sched-simple
flux module load resource/modules/sched-fluxion-resource.so
flux module load qmanager/modules/sched-fluxion-qmanager.so
flux run -N4 hostname
flux overlay disconnect 3
flux run -N4 hostname This script is run with Here's the result of running this test with my printf debugging enabled:
And the second job hangs instead of getting an exception. Here's the diff including the debugging: diff --git a/resource/traversers/dfu.cpp b/resource/traversers/dfu.cpp
index 9241a3f1..25658478 100644
--- a/resource/traversers/dfu.cpp
+++ b/resource/traversers/dfu.cpp
@@ -71,6 +71,16 @@ int dfu_traverser_t::request_feasible (detail::jobmeta_t const &meta,
const bool checking_satisfiability =
op == match_op_t::MATCH_ALLOCATE_W_SATISFIABILITY || op == match_op_t::MATCH_SATISFIABILITY;
+ std::cerr << "request_feasible"
+ << " target_nodes=" << target_nodes
+ << " nodes_up=" << get_graph_db ()->metadata.nodes_up
+ << std::endl;
+
+ for (const auto& pair : dfv) {
+ std::cerr << "{" << pair.first << ": " << pair.second << "} ";
+ }
+ std::cout << std::endl;
+
if ((!meta.constraint) && (target_nodes <= get_graph_db ()->metadata.nodes_up))
return 0;
@@ -88,6 +98,8 @@ int dfu_traverser_t::request_feasible (detail::jobmeta_t const I admit I'm a bit confused, and did not expect |
I did figure this part out. It seems |
Ok, this was my fault, somehow I missed modifying diff --git a/resource/traversers/dfu_impl.cpp b/resource/traversers/dfu_impl.cpp
index 4103a2fb..5f6f69e8 100644
--- a/resource/traversers/dfu_impl.cpp
+++ b/resource/traversers/dfu_impl.cpp
@@ -213,6 +213,12 @@ int dfu_impl_t::prune (const jobmeta_t &meta,
const std::vector<Jobspec::Resource> &resources)
{
int rc = 0;
+
+ // Prune LOST resources
+ if ((*m_graph)[u].status == resource_pool_t::status_t::LOST) {
+ rc = -1;
+ goto done;
+ }
// Prune by the visiting resource vertex's availability
// If resource is not UP, no reason to descend further.
if (meta.alloc_type != jobmeta_t::alloc_type_t::AT_SATISFIABILITY I'm still a bit confused why |
Problem: RFC 28 specifies an optional `shrink` key in an RFC 28 resource acquisition protocol response, but this is not handled by Fluxion. Mark resources in any `shrink` key in a resource acquisition response as LOST. This treats the resources as permanently unavailable for scheduling without removing them from the resource graph. Fixes flux-framework#1344
The |
Thanks @trws, once I propose a PR I'll open a separate issue on what I observed with |
Problem: RFC 28 specifies an optional `shrink` key in an RFC 28 resource acquisition protocol response, but this is not handled by Fluxion. Mark resources in any `shrink` key in a resource acquisition response as LOST. This treats the resources as permanently unavailable for scheduling without removing them from the resource graph. Fixes flux-framework#1344
Problem: RFC 28 specifies an optional `shrink` key in an RFC 28 resource acquisition protocol response, but this is not handled by Fluxion. Mark resources in any `shrink` key in a resource acquisition response as LOST. This treats the resources as permanently unavailable for scheduling without removing them from the resource graph. Fixes flux-framework#1344
Problem: RFC 28 specifies an optional `shrink` key in an RFC 28 resource acquisition protocol response, but this is not handled by Fluxion. Mark resources in any `shrink` key in a resource acquisition response as LOST. This treats the resources as permanently unavailable for scheduling without removing them from the resource graph. Fixes flux-framework#1344
If you want to remove the subgraph, you'll need to partial cancel the ranks in the After partial cancel and shrink, you can update the total counts by reinitializing the traverser (e.g., in ctx->traverser->initialize (); You could also do the reinitialization in the traverser itself. Partial cancel currently only supports rv1exec and JGF, but it would be straightforward to add support for an idset. Let me know if that's needed and I'll get to work on it. |
Thanks @milroy! After the meeting I dropped the code above in favor of the suggested approach of a new vertex state (implemented in #1345). Let me know if you think we should lean towards subgraph removal instead. Addition of the LOST state was trivial, but it does seem like true subgraph removal would be superior. What will happen though, when the job manager sends a free response which contains removed resources? (We do not shrink the resources assigned to jobs when a node fails, so when the job releases resources, this will include the resources corresponding to the |
target_nodes == 0 seems to be the case in my testing because |
Currently that would result in an error here: flux-sched/resource/traversers/dfu_impl_update.cpp Lines 821 to 824 in da6156a
It would be easy to change the behavior to ignore the missing ranks. Implementing |
It sounds like it's working as designed. You can try setting the pruning filter configuration to be "ALL:core,ALL:node" and see if |
Yes, that worked. For my own edification, can you explain why "ALL:core" is the default and not "ALL:core,ALL:node"? |
That's a good question! After searching a bit, it looks like the "ALL:core" default was set in this PR: #401 for expediency. It may have been the intention to revisit the default settings. I don't see why we can't make the default "ALL:core,ALL:node". The disadvantage is increased memory footprint, but it's a small increase relative to the current default. |
We probably should, there are certain things that don't work as expected otherwise. @milroy would you be willing to put together a PR for that? |
Yeah, I can do that this afternoon. |
Sounds like we want to at least try this approach. I can try to implement the partial cancel by idset, but if you have time to throw it together @milroy feel free! Thanks. |
On second thought, we don't need a separate implementation. You could pack the idset into the flux-sched/resource/readers/resource_reader_rv1exec.cpp Lines 951 to 970 in da6156a
Not sure how elegant that is, but it will work. |
flux-framework/rfc#447 adds a new
shrink
key in the RFC 28 resource acquisition protocol. The key contains an idset of execution targets which have been removed from the instance and should no longer be considered for scheduling, feasibility etc.This is one step in solving flux-framework/flux-core#6641, in which nodes lost during a resilient batch job are still considered schedulable at some point in the future, when in reality they will never come back online, so a job can block in pending state indefinitely.
What would it take to support the
shrink
key in Fluxion @trws @milroy? If we can get this enabled by the next release that would be ideal.The text was updated successfully, but these errors were encountered: