-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify that an MPIJob resource is created when current solution gets run. #13
Verify that an MPIJob resource is created when current solution gets run. #13
Comments
Currently the job returns an "error from server: Forbidden". Pat commented that this looks like an access control issue. Will be looking into creating a new role binding and hopefully that will resolve it. |
Access control issue that was blocking this issue was resolved and documented in: #17 Deploying an mpijob now returns subsequent errors. Here's a printout of the error messages: riskpaths-launcher /bin/sh: 1: orted: not found |
Summarizing the potential issues listed in the errors: (1) Not finding required binaries or libraries. Verify that PATH and LD_LIBRARY_PATH contain entries for mpirun (maybe other executables like orted) and its link libraries. (2) Lack of authority to execute commands on one or more pods. Check what users the commands are being run as and check user and group permissions on the containers. (3) Inability to write start-up files to /tmp in the container (when the worker pods are being initialized). (4) Inability to create a connection back to mpirun. Please check network connectivity including firewalls. |
(5) Possible internal programming error that brakes some aspect of message routing. Suggested work-around is to supply the MCA parameter routed=direct to mpirun. |
Summary of latest effort to getting OpenM models run as MPIJobs and having logs and results available back in the UI afterwards: (1) Added label entry (to obtain access to blob storage) to worker pod specification in the template used for creating MPIJob manifests. |
Three changes to the openmpp web service start-up script: These changes are all in the kubeflow-containers repo with a pending pull request: StatCan/aaw-kubeflow-containers#553 |
We enabled the running of openmpp models compiled for mpi execution using kubeflow's mpijob functionality. |
The solution appears to parse model run requests and create valid manifests.
It seems to be applying the manifests, but I don't know yet how to verify that that's happening for sure.
The text was updated successfully, but these errors were encountered: