-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added some support for tracking allocations #1904
Conversation
Overall, this is impressive! Many thanks for doing it! I see a couple of places where things need some adjusting, but the overall approach is sound. I'm working on job cleanup right now as there is a problem with the current code being a little overly aggressive and removing the contact files and session directories of other tools sharing the node (e.g., the Committing that work will generate some conflicts here, but nothing too horrible. I'll pick this branch up and resolve those, plus address the adjustments, and then push it back up here. I believe we do have an attribute to indicate that a job should persist beyond its parent:
Although originally intended for a tool calling spawn, you can see that it was extended to apply to anyone calling spawn. I thought I had included support for that in PRRTE, but I'd have to look. |
I am working my way thru this change and found a few things that need addressing. In order to figure it all out, I've started adding documentation describing the structures being added/modified and how they are used, along with a "flow of control" for the entire scheduler interaction sequence. I'd like to push the changes up here, but I had to rebase the branch to bring it up to date. I'll try to force push it and see if I can get the branch updated. |
Signed-off-by: HawkmoonEternal <[email protected]>
Signed-off-by: HawkmoonEternal <[email protected]>
79aaf80
to
808170c
Compare
Bring the branch up to date with PRRTE master. Add documentation on scheduler integration, including added and modified structures plus flow-of-control for scheduler operations. Fix a few places that had some errors. Fill in where some session identification logic was missing. Signed-off-by: Ralph Castain <[email protected]>
def43ed
to
3dd9b1d
Compare
Release of prte_server_req_t does not free the info array unless the "copy" flag has been set to true. Signed-off-by: Ralph Castain <[email protected]>
3dd9b1d
to
35f3d56
Compare
this PR appears to have broken singleton support, at least in some cases. being investigated. |
@hppritcha I suspect you need something like the following: diff --git a/src/tools/prte/prte.c b/src/tools/prte/prte.c
index 2bb9f571d5..ce69cc00ee 100644
--- a/src/tools/prte/prte.c
+++ b/src/tools/prte/prte.c
@@ -1401,6 +1401,7 @@ static int prep_singleton(const char *name)
jdata = PMIX_NEW(prte_job_t);
PMIX_LOAD_NSPACE(jdata->nspace, ptr);
free(ptr);
+ jdata->session = prte_default_session;
rc = prte_set_job_data_object(jdata);
if (PRTE_SUCCESS != rc) { |
yep and there's more! |
Based on the discussions with @rhc54 , @sonjahapp, and Isaias in the Dynamic Workflows WG and email threads I started implementing some support for tracking allocations in PRRTE. The required changes turned out to be more extensive than I expected and more work will be required also for Session_control. I'm a bit tied up with other work in the coming weeks so I thought I'd just share what I have so far.
As Ralph suggested I created a prte_session_t struct which tracks the nodes, jobs, and child sessions associated with a session.
So far I have added:
Some important assumptions:
I did some testing with the following scripts:
https://github.com/HawkmoonEternal/allocation_tracking_test
The allocation tracking seems to work for these test cases but I think some more extensive testing is required.
I also noticed that the default behavior of PRRTE is to abort all child jobs if the parent job is completed. I was wondering if we should maybe introduce an attribute in the PMIx_Spawn call to indicate that a job should persist. I could imagine situations such as postprocessing, insitu tasks etc. where the child jobs are expected to outlive the parent job.
Looking forward to your feedback and discussing this in the next Working Group meeting.