Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch introduces acceleration code in Code_Saturne for NVIDIA GPUs. This is a partial port in the sense that a limited set of testcases are supported.
This has been tested on OpenPOWER platforms but it should work as well for other platforms that support CUDA. We tested this on both Power8 + P100 and Power9 + V100 machines. For the former you should expect to see over 2x speedup at scale if the amount of cells per GPU is over 100k. For the latter that speedup goes up to at least 3x while providing better strong scaling - we tested the code successfully on Summit Supercomputer in Oak Ridge National Lab in up to 512 nodes.
The overall idea is to reduce the effect of latencies in the code for the different vector and matrix-vector operations. We employ a template packing technique to statically bundle multiple operations in the same CUDA kernel. Also, we create data environments to keep data in the GPU for longer.
The implementation introduces the implementation of the GPU acceleration port in
/src/cuda
and the various entry points are invoked from all around the code.This code is prepared to be launched with NVIDIA Multi-Process Service (MPS) so that multiple ranks can use the same GPU. I tested this successfully with up to 5 ranks per GPU. In order for this to work CUDA GPU visibilities have to be such that each rank only sees the GPU it is meant to use.
The patch introduces a way to assess the number of local ranks which expects an OpenMPI compatible environment - e.g. SpectrumMPI from IBM Spectrum Scale.
The patch also introduces changes in the build system so that the code can be easily built with GPU support. Building without GPU support would be equivalent to run Code Saturne in its current version: with CPU-only support.
To build the code you should use a C/C++ compiler that supports C++11 as the CUDA code requires that. Here's an example on how to build the code (note the
--enable-cuda-offload
flag):To run with MPS support, there are multiple ways. We used both Spectrum Scale LSF and LSF+CSM. Here is an example of LSF script to submit at job:
Here,
../../cs_solver_gpu
is a proxy script that starts MPS servers (one per GPU) and launches thecs_solver
application. Here are its contents:One MPS server per GPU may be overkill, 2 per GPU is in most cases sufficient.
We tested the code with a cavity load flow. Here is an example using a 13M mesh:
https://ibm.box.com/s/2rhbavxqgxhvrfi4ws98w36h74i7aqat
To run it, download the testcase from this link and then launch the job from
cs_test/SRC
as in the LSF script above.