PU-level scheduling in `resource` & differing behavior from `sched-simple` #624

SteVwonder · 2020-03-18T04:42:35Z

When running with simple-sched, it appears that the behavior when specifying -c1 to flux-mini, is to run the process on a PU.

❯ FLUX_QMANAGER_RC_NOOP=t FLUX_RESOURCE_RC_NOOP=t ./bin/flux start

❯ flux module list                                                                                                                                                                                                                                                                        21:36:55 ()
Module                   Size Digest  Idle  S Service
userdb                1122616 E537E35    9  S 
aggregator            1141360 4319017    9  S 
cron                  1202976 AC1B9B5    0  S 
kvs                   1558376 D2EDB0A    0  S 
job-exec              1276224 93DC36A    9  S 
connector-local       1110920 A097C9C    0  R 
job-manager           1332792 8B529A1    9  S 
sched-simple          1241920 55B0BE9    9  S sched
kvs-watch             1299296 2D970AF    9  S 
barrier               1124544 C1742F5    9  S 
job-info              1357552 5B9B170    9  S 
job-ingest            1219136 4C12AA0    9  S 
content-sqlite        1130384 DFA6333    9  S content-backing

❯ flux hwloc info
1 Machine, 36 Cores, 72 PUs

❯ ~/Repositories/flux-framework/flux-core/t/ingest/submitbench -r 72 <(flux mini submit -c1 -n1 --dry-run sleep 100)
<snip>

❯ flux jobs -o '{name} {state}' | uniq -c                                                                                                                                                                                                                                                 21:38:27 ()
      1 NAME STATE
     72 sleep RUN

That does not appear to be the case with resource:

❯ flux start

❯ flux module list                                                                                                                                                                                                                                                                        21:30:08 ()
Module                   Size Digest  Idle  S Service
kvs-watch             1299296 2D970AF    6  S 
job-manager           1332792 8B529A1    6  S 
aggregator            1141360 4319017   30  S 
kvs                   1558376 D2EDB0A    0  S 
content-sqlite        1130384 DFA6333    6  S content-backing
userdb                1122616 E537E35   30  S 
job-exec              1276224 93DC36A    6  S 
resource             18210776 3BF88C6    6  S 
cron                  1202976 AC1B9B5    0  S 
connector-local       1110920 A097C9C    0  R 
qmanager              1088552 73E97AA    6  S sched
job-ingest            1219136 4C12AA0   25  S 
barrier               1124544 C1742F5    0  S 
job-info              1357552 5B9B170    6  S 


❯ flux hwloc info                                                                                                                                                                                                                                                                         
1 Machine, 36 Cores, 72 PUs

❯ ~/Repositories/flux-framework/flux-core/t/ingest/submitbench -r 72 <(flux mini submit -c1 -n1 --dry-run sleep 100)
<snip>

❯ flux jobs -o '{name} {state}' | uniq -c
      1 NAME STATE
     36 sleep SCHED
     36 sleep RUN

It seems wrong to me that the same jobspec has such different behavior under the two different schedulers. Do we have a way in flux-sched to enable PU-level scheduling? I'm wondering if this is something that needs to be handled at the flux-mini level? Ultimately, not sure I have many intelligent thoughts on this right now, but I wanted to at least document it.

The text was updated successfully, but these errors were encountered:

dongahn · 2020-03-18T04:53:07Z

Seems a bit counterintuitive that -c doesn't specify the number of cores but pus.

Untested, but I think you can specialize resource to do PU-level scheduling if you load it with

export FLUX_RESOURCE_OPTIONS="load-whitelist=node,pu,gpu prune-filters=ALL:pu

In theory this should populate only NODES, PUs and GPUs into the resource graph store. The job spec however should specify "pu" as the requesting resources not "cores." Presumably, flux mini submit emits "cores" yet simple-sched match them with PUs?

grondo · 2020-03-18T15:00:32Z

I think sched-simple just treats core as "pu" (an hwloc term I'm not sure we should adopt into our jobspec, but that is a different issue) for testing since jobspec v1 doesn't accept any way to specify threads in addition to cores. It is convenient to be able to run with up to as many processes as the system reports CPUs.

i.e. there was no "sinister master plan" for sched-simple to behave differently. 😉

I agree eventually we should fix this, though maybe we wait until a jobspec v2?

grondo · 2020-03-18T15:18:04Z

Though maybe resource.hwloc.by_rank should be enhanced to include the PU information. Right now it just stores the hwloc reported "allowed cpuset" on each rank, then treats each bit as an available "cpu" (and assumes equivalency to a "core"). I agree this is the wrong approach long-term, so perhaps the json representation should soon be enhanced to include more topological information.

dongahn · 2020-03-18T17:20:58Z

I think sched-simple just treats core as "pu" (an hwloc term I'm not sure we should adopt into our jobspec, but that is a different issue)

I agree "pu" probably is a bad resource name. resource doesn't have to use this name as far as we have an agreement on the name used by the schedulers and jobspect, it can cover hwloc's "pu" into that name. Maybe "ht" or "hw_thread"?

I agree eventually we should fix this, though maybe we wait until a jobspec v2?

Sure. We probably want to do this before the initial system instance deployment though.

SteVwonder · 2020-03-18T18:01:24Z

It is convenient to be able to run with up to as many processes as the system reports CPUs.

I agree! I think that is the behavior that I personally want 95% of the time.

i.e. there was no "sinister master plan" for sched-simple to behave differently. 😉

Darn! I was hoping for a juicy conspiracy here.....guess I'll just go back to spreading pandemic-related conspiracies with my free time.

I agree eventually we should fix this, though maybe we wait until a jobspec v2?

WIP Jobspec V2 is already up, so we can discuss the relevant jobspec-specifics there: flux-framework/rfc#229

Maybe "ht" or "hw_thread"?

"hw_thread" sounds good to me for the resource name. FWIW, it looks like python's argparse supports multi-character short args, so -ht would be a valid CLI option if we wanted:

❯ python3
Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import argparse
>>> parser = argparse.ArgumentParser()
>>> parser.add_argument('-ht', '--hardware-threads', type=int)
_StoreAction(option_strings=['-ht', '--hardware-threads'], dest='hardware_threads', nargs=None, const=None, default=None, type=<class 'int'>, choices=None, help=None, metavar=None)
>>> parser.print_help()
usage: [-h] [-ht HARDWARE_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -ht HARDWARE_THREADS, --hardware-threads HARDWARE_THREADS

dongahn · 2020-03-18T18:03:59Z

I agree "pu" probably is a bad resource name. resource doesn't have to use this name as far as we have an agreement on the name used by the schedulers and jobspect, it can cover hwloc's "pu" into that name. Maybe "ht" or "hw_thread"?

BTW, if threads, ht, or hw_threads are used as a resource name, it would be good to add one more switch to the flux mini interface to specify the number of hardware threads per task? Initially I was thinking about -t or --nthreads but realized -t was already taken for a wall time limit.

Looking at srun option set, it introduces an extra option: --threads-per-core=<num hardware threads in addition to -c. Perhaps we can just use that since a goal of our mini interface was to facilitate the porting of a user's existing scripts.

grondo · 2020-03-18T18:10:04Z

FWIW, it looks like python's argparse supports multi-character short args, so -ht would be a valid CLI option if we wanted:

Ack! It seems dangerous to support grouping and multi-character short args at the same time!

Could you support multiple long args --ht, --hw-threads-per-core=N instead?

SteVwonder · 2020-03-18T22:42:12Z

Ack! It seems dangerous to support grouping and multi-character short args at the same time!
Could you support multiple long args --ht, --hw-threads-per-core=N instead?

Good call! Yeah, that does seem possible:

❯ python3                                                                                                                             15:39:43 ()
Python 3.7.5 (default, Nov 20 2019, 09:21:52) 
[GCC 9.2.1 20191008] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import argparse
>>> parser = argparse.ArgumentParser()
>>> parser.add_argument('--ht', '--hardware-threads', type=int)
_StoreAction(option_strings=['--ht', '--hardware-threads'], dest='ht', nargs=None, const=None, default=None, type=<class 'int'>, choices=None, help=None, metavar=None)
>>> parser.print_help()
usage: [-h] [--ht HT]

optional arguments:
  -h, --help            show this help message and exit
  --ht HT, --hardware-threads HT
>>> parser.parse_args(['--ht', '5'])
Namespace(ht=5)

it would be good to add one more switch to the flux mini interface to specify the number of hardware threads per task?

Opened a flux-core issue to track that: flux-framework/flux-core#2857

grondo · 2020-06-03T14:33:29Z

Coming back around to this due to recent issues in core flux-framework/flux-core#2968.

This seemed to have become a real problem once cpu affinity support was added to the shell, at which time treating PUs as cores become an error. In the meantime, I unfortunately forgot at the time that sched-simple operated in this way, ultimately causing the bugs.

I'll fix the sched-simple issue (actually a flux hwloc reload issue), and possibly add a way to fall back to the old behavior for testing. Just wanted to add a note here, since the behavior described here for sched-simple changed.

dongahn · 2020-06-03T15:42:41Z

Thanks @grondo.

BTW, will there be a case where users want to specialize their scheduling and schedule at the PU level (Hyperthreading)? Just FYI, if this specialization is needed, I think we can do this at the Fluxion level. (Setting the hwloc whitelist to include PU type resource) However, we need to understand how the R information is used at the job shell level... (we may still hit the same issue that you are working on though...).

jobshell would understand the resource type named PU right? What API are you using for cpu affinity?

grondo · 2020-06-03T16:27:59Z

jobshell would understand the resource type named PU right? What API are you using for cpu affinity?

I think we'll need _R_v2 for this to work. Currently R only contains a "core" id list for any rank children. Currently the job shell does not know about resource type "PU", since only "core" and "gpu" are allowed types in the RFC20 execution section.

The shell affinity plugin uses hwloc currently for cpu affinity. It walks the current topology and takes the union cpuset for all assigned cores and sets its own affinity to that set. If per-task affinity is requested, it then uses hwloc_distrib to distribute individual tasks across all available PUs.

dongahn · 2022-03-10T17:45:38Z

I think flux-core also core as core not as pu so I will close it. If this needs to be reopened, please let me know.

grondo mentioned this issue Mar 18, 2020

sched-simple: incorporate topology information flux-framework/flux-core#2854

Open

dongahn closed this as completed Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PU-level scheduling in `resource` & differing behavior from `sched-simple` #624

PU-level scheduling in `resource` & differing behavior from `sched-simple` #624

SteVwonder commented Mar 18, 2020

dongahn commented Mar 18, 2020

grondo commented Mar 18, 2020

grondo commented Mar 18, 2020

dongahn commented Mar 18, 2020

SteVwonder commented Mar 18, 2020

dongahn commented Mar 18, 2020

grondo commented Mar 18, 2020

SteVwonder commented Mar 18, 2020

grondo commented Jun 3, 2020

dongahn commented Jun 3, 2020

grondo commented Jun 3, 2020

dongahn commented Mar 10, 2022

PU-level scheduling in resource & differing behavior from sched-simple #624

PU-level scheduling in resource & differing behavior from sched-simple #624

Comments

SteVwonder commented Mar 18, 2020

dongahn commented Mar 18, 2020

grondo commented Mar 18, 2020

grondo commented Mar 18, 2020

dongahn commented Mar 18, 2020

SteVwonder commented Mar 18, 2020

dongahn commented Mar 18, 2020

grondo commented Mar 18, 2020

SteVwonder commented Mar 18, 2020

grondo commented Jun 3, 2020

dongahn commented Jun 3, 2020

grondo commented Jun 3, 2020

dongahn commented Mar 10, 2022

PU-level scheduling in `resource` & differing behavior from `sched-simple` #624

PU-level scheduling in `resource` & differing behavior from `sched-simple` #624