Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify that an MPIJob resource is created when current solution gets run. #13

Closed
jacek-dudek opened this issue Oct 16, 2023 · 8 comments · Fixed by StatCan/aaw-kubeflow-containers#553
Assignees

Comments

@jacek-dudek
Copy link
Collaborator

The solution appears to parse model run requests and create valid manifests.
It seems to be applying the manifests, but I don't know yet how to verify that that's happening for sure.

@jacek-dudek jacek-dudek self-assigned this Oct 18, 2023
@jacek-dudek
Copy link
Collaborator Author

Currently the job returns an "error from server: Forbidden". Pat commented that this looks like an access control issue. Will be looking into creating a new role binding and hopefully that will resolve it.

@jacek-dudek
Copy link
Collaborator Author

Epic: #2
Relates to: #8

@jacek-dudek
Copy link
Collaborator Author

Access control issue that was blocking this issue was resolved and documented in: #17

Deploying an mpijob now returns subsequent errors. Here's a printout of the error messages:

riskpaths-launcher /bin/sh: 1: orted: not found
riskpaths-launcher command terminated with exit code 127
riskpaths-launcher --------------------------------------------------------------------------
riskpaths-launcher ORTE was unable to reliably start one or more daemons.
riskpaths-launcher This usually is caused by:
riskpaths-launcher * not finding the required libraries and/or binaries on
riskpaths-launcher one or more nodes. Please check your PATH and LD_LIBRARY_PATH
riskpaths-launcher settings, or configure OMPI with --enable-orterun-prefix-by-default
riskpaths-launcher * lack of authority to execute on one or more specified nodes.
riskpaths-launcher Please verify your allocation and authorities.
riskpaths-launcher * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
riskpaths-launcher Please check with your sys admin to determine the correct location to use.
riskpaths-launcher * compilation of the orted with dynamic libraries when static are required
riskpaths-launcher (e.g., on Cray). Please check your configure cmd line and consider using
riskpaths-launcher one of the contrib/platform definitions for your system type.
riskpaths-launcher * an inability to create a connection back to mpirun due to a
riskpaths-launcher lack of common network interfaces and/or no route found between
riskpaths-launcher them. Please check network connectivity (including firewalls
riskpaths-launcher and network routing requirements).
riskpaths-launcher --------------------------------------------------------------------------
riskpaths-launcher --------------------------------------------------------------------------
riskpaths-launcher ORTE does not know how to route a message to the specified daemon
riskpaths-launcher located on the indicated node:
riskpaths-launcher my node: riskpaths-launcher
riskpaths-launcher target node: riskpaths-worker-0
riskpaths-launcher This is usually an internal programming error that should be
riskpaths-launcher reported to the developers. In the meantime, a workaround may
riskpaths-launcher be to set the MCA param routed=direct on the command line or
riskpaths-launcher in your environment. We apologize for the problem.
riskpaths-launcher --------------------------------------------------------------------------
riskpaths-launcher [riskpaths-launcher:00001] 2 more processes have sent help message help-errmgr-base.txt / no-path
riskpaths-launcher [riskpaths-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
riskpaths-launcher /bin/sh: 1: orted: not found
riskpaths-launcher command terminated with exit code 127

@jacek-dudek
Copy link
Collaborator Author

Summarizing the potential issues listed in the errors:

(1) Not finding required binaries or libraries. Verify that PATH and LD_LIBRARY_PATH contain entries for mpirun (maybe other executables like orted) and its link libraries.

(2) Lack of authority to execute commands on one or more pods. Check what users the commands are being run as and check user and group permissions on the containers.

(3) Inability to write start-up files to /tmp in the container (when the worker pods are being initialized).

(4) Inability to create a connection back to mpirun. Please check network connectivity including firewalls.

@jacek-dudek
Copy link
Collaborator Author

jacek-dudek commented Nov 14, 2023

(5) Possible internal programming error that brakes some aspect of message routing. Suggested work-around is to supply the MCA parameter routed=direct to mpirun.

@jacek-dudek
Copy link
Collaborator Author

Summary of latest effort to getting OpenM models run as MPIJobs and having logs and results available back in the UI afterwards:

(1) Added label entry (to obtain access to blob storage) to worker pod specification in the template used for creating MPIJob manifests.
(2) Moved the model binary being tested from ~/models/ to ~/buckets/aaw-unclassified.
(3) Updated parseCommand.py and MPIJobTemplate.yaml to reflect change in working directory to ~/buckets/aaw-unclassified
(4) Set OpenM.LogToFile option true in MPIJob manifests.

@jacek-dudek
Copy link
Collaborator Author

Three changes to the openmpp web service start-up script:
Updated the script to specify a model log directory and to choose both model and model log directories in a consistent way.
Added a command line option to the web service start-up command to use the preferred model log directory when running.
The scripts and templates needed to run mpi jobs via kubeflow are now being downloaded and placed where needed in the openmpp installation (two binaries and two template files into openmpp's bin and etc directories respectively).

These changes are all in the kubeflow-containers repo with a pending pull request: StatCan/aaw-kubeflow-containers#553

@jacek-dudek
Copy link
Collaborator Author

jacek-dudek commented Nov 23, 2023

We enabled the running of openmpp models compiled for mpi execution using kubeflow's mpijob functionality.
Changes were made to aaw-kubeflow-containers and openmpp repositories and submitted in pull requests:
StatCan/aaw-kubeflow-containers#553
#22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants