Verify that an MPIJob resource is created when current solution gets run. #13

jacek-dudek · 2023-10-16T16:57:39Z

The solution appears to parse model run requests and create valid manifests.
It seems to be applying the manifests, but I don't know yet how to verify that that's happening for sure.

jacek-dudek · 2023-10-18T14:25:14Z

Currently the job returns an "error from server: Forbidden". Pat commented that this looks like an access control issue. Will be looking into creating a new role binding and hopefully that will resolve it.

jacek-dudek · 2023-10-18T14:26:34Z

Epic: #2
Relates to: #8

jacek-dudek · 2023-11-14T19:40:37Z

Access control issue that was blocking this issue was resolved and documented in: #17

Deploying an mpijob now returns subsequent errors. Here's a printout of the error messages:

riskpaths-launcher /bin/sh: 1: orted: not found
riskpaths-launcher command terminated with exit code 127
riskpaths-launcher --------------------------------------------------------------------------
riskpaths-launcher ORTE was unable to reliably start one or more daemons.
riskpaths-launcher This usually is caused by:
riskpaths-launcher * not finding the required libraries and/or binaries on
riskpaths-launcher one or more nodes. Please check your PATH and LD_LIBRARY_PATH
riskpaths-launcher settings, or configure OMPI with --enable-orterun-prefix-by-default
riskpaths-launcher * lack of authority to execute on one or more specified nodes.
riskpaths-launcher Please verify your allocation and authorities.
riskpaths-launcher * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
riskpaths-launcher Please check with your sys admin to determine the correct location to use.
riskpaths-launcher * compilation of the orted with dynamic libraries when static are required
riskpaths-launcher (e.g., on Cray). Please check your configure cmd line and consider using
riskpaths-launcher one of the contrib/platform definitions for your system type.
riskpaths-launcher * an inability to create a connection back to mpirun due to a
riskpaths-launcher lack of common network interfaces and/or no route found between
riskpaths-launcher them. Please check network connectivity (including firewalls
riskpaths-launcher and network routing requirements).
riskpaths-launcher --------------------------------------------------------------------------
riskpaths-launcher --------------------------------------------------------------------------
riskpaths-launcher ORTE does not know how to route a message to the specified daemon
riskpaths-launcher located on the indicated node:
riskpaths-launcher my node: riskpaths-launcher
riskpaths-launcher target node: riskpaths-worker-0
riskpaths-launcher This is usually an internal programming error that should be
riskpaths-launcher reported to the developers. In the meantime, a workaround may
riskpaths-launcher be to set the MCA param routed=direct on the command line or
riskpaths-launcher in your environment. We apologize for the problem.
riskpaths-launcher --------------------------------------------------------------------------
riskpaths-launcher [riskpaths-launcher:00001] 2 more processes have sent help message help-errmgr-base.txt / no-path
riskpaths-launcher [riskpaths-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
riskpaths-launcher /bin/sh: 1: orted: not found
riskpaths-launcher command terminated with exit code 127

jacek-dudek · 2023-11-14T19:47:27Z

Summarizing the potential issues listed in the errors:

(1) Not finding required binaries or libraries. Verify that PATH and LD_LIBRARY_PATH contain entries for mpirun (maybe other executables like orted) and its link libraries.

(2) Lack of authority to execute commands on one or more pods. Check what users the commands are being run as and check user and group permissions on the containers.

(3) Inability to write start-up files to /tmp in the container (when the worker pods are being initialized).

(4) Inability to create a connection back to mpirun. Please check network connectivity including firewalls.

jacek-dudek · 2023-11-14T20:22:38Z

(5) Possible internal programming error that brakes some aspect of message routing. Suggested work-around is to supply the MCA parameter routed=direct to mpirun.

jacek-dudek · 2023-11-15T21:55:33Z

Summary of latest effort to getting OpenM models run as MPIJobs and having logs and results available back in the UI afterwards:

(1) Added label entry (to obtain access to blob storage) to worker pod specification in the template used for creating MPIJob manifests.
(2) Moved the model binary being tested from ~/models/ to ~/buckets/aaw-unclassified.
(3) Updated parseCommand.py and MPIJobTemplate.yaml to reflect change in working directory to ~/buckets/aaw-unclassified
(4) Set OpenM.LogToFile option true in MPIJob manifests.

jacek-dudek · 2023-11-17T21:48:52Z

Three changes to the openmpp web service start-up script:
Updated the script to specify a model log directory and to choose both model and model log directories in a consistent way.
Added a command line option to the web service start-up command to use the preferred model log directory when running.
The scripts and templates needed to run mpi jobs via kubeflow are now being downloaded and placed where needed in the openmpp installation (two binaries and two template files into openmpp's bin and etc directories respectively).

These changes are all in the kubeflow-containers repo with a pending pull request: StatCan/aaw-kubeflow-containers#553

jacek-dudek · 2023-11-23T19:32:32Z

We enabled the running of openmpp models compiled for mpi execution using kubeflow's mpijob functionality.
Changes were made to aaw-kubeflow-containers and openmpp repositories and submitted in pull requests:
StatCan/aaw-kubeflow-containers#553
#22

jacek-dudek self-assigned this Oct 18, 2023

chuckbelisle assigned KrisWilliamson Nov 15, 2023

jacek-dudek linked a pull request Nov 17, 2023 that will close this issue

OMPP: Moved copy file directives from start-custom.sh to start-oms.sh StatCan/aaw-kubeflow-containers#553

Merged

12 tasks

Souheil-Yazji closed this as completed Nov 28, 2023

KrisWilliamson mentioned this issue Nov 29, 2023

[Epic] Implement OpenM++ MPI job controller using GO #24

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify that an MPIJob resource is created when current solution gets run. #13

Verify that an MPIJob resource is created when current solution gets run. #13

jacek-dudek commented Oct 16, 2023

jacek-dudek commented Oct 18, 2023

jacek-dudek commented Oct 18, 2023

jacek-dudek commented Nov 14, 2023

jacek-dudek commented Nov 14, 2023

jacek-dudek commented Nov 14, 2023 •

edited

Loading

jacek-dudek commented Nov 15, 2023

jacek-dudek commented Nov 17, 2023

jacek-dudek commented Nov 23, 2023 •

edited

Loading

Verify that an MPIJob resource is created when current solution gets run. #13

Verify that an MPIJob resource is created when current solution gets run. #13

Comments

jacek-dudek commented Oct 16, 2023

jacek-dudek commented Oct 18, 2023

jacek-dudek commented Oct 18, 2023

jacek-dudek commented Nov 14, 2023

jacek-dudek commented Nov 14, 2023

jacek-dudek commented Nov 14, 2023 • edited Loading

jacek-dudek commented Nov 15, 2023

jacek-dudek commented Nov 17, 2023

jacek-dudek commented Nov 23, 2023 • edited Loading

jacek-dudek commented Nov 14, 2023 •

edited

Loading

jacek-dudek commented Nov 23, 2023 •

edited

Loading