Slingshot (CXI) Build Instructions

Building and Installing Sandia OpenSHMEM (SOS) on HPE Slingshot Fabric:

If you have a libfabric version that includes CXI provider that can be used to configure and run SOS on HPE Slingshot network.

Building SOS to use the libfabric CXI providers

Use the following configure options to build SOS to use the libfabric CXI provider:

$ ./autogen.sh
$ ./configure --prefix=<SOS_install_dir> --with-ofi=<libfabric_install_dir> --enable-pmi-simple --enable-ofi-mr=basic --enable-mr-endpoint --enable-ofi-manual-progress
$ make
$ make install

where <SOS_install_dir> is an appropriate installation path for SOS and <libfabric_install_dir> is the libfabric installation path with CXI provider enabled. The basic memory registration for OFI should be enabled via --enable-ofi-mr=basic flag. Also, --enable-mr-endpoint should be used to enable MR-ENDPOINT in OFI for memory registration. For this configuration, --enable-ofi-manual-progress needs to be used to enable FI_MANUAL_PROGRESS as data progress option.

Compiling and running OpenSHMEM programs

The SOS build should be added to your path. If you have used the build instructions posted here, this is done by running the following command:

$ export PATH=<SOS_install_dir>/bin:$PATH

Once SOS is in your path, you can use the compiler wrapper oshcc to compile your application and the launcher wrapper oshrun to run it. Please assure you have a compatible launcher, such as Hydra 3.2 or newer.

To run SOS programs over CXI, SOS can be instructed to use the provider by setting the following environment variable:

SHMEM_OFI_PROVIDER="cxi"

Performance Tuning

Shared Memory Intranode Communication

XPMEM can be leveraged to route intranode PEs through a shared memory interface. This can be enabled with the configuration parameter --with-xpmem e.g. --with-xpmem=/usr/lib

Mapping PEs and NICs

Intranode PEs will automatically use the NICs that are located physically close to the CPU the PE is mapped to during start up. The OS scheduler may migrate PEs to other CPU cores if they are not pinned. It is critical to maximize locality and reduce cache misses.

Job launchers unfortunately have different ways of accomplishing this. e.g MPICH supports the parameter --cpu-bind

For performance sensitive workloads you should use hwloc to examine the CPU socket/core to NIC mappings of the target host and develop a core list e.g. This system has two domains 1 with cores 0-63 and 2 with cores 64-127 with a NIC attached to each. An optimal pinning map for an 8 PE/node job would be 0,1,2,3,64,65,67,68 on MPICH this would be accomplished with the parameter: --cpu-bind list:0:1:2:3:64:65:67:68

Tuning Memory Registration (MR) Cache

Memory registration (making virtual memory pages accessible to NIC hardware (via hardware translation tables)) can be an expensive operation. Libfabric keeps a pool of previously registered memory regions that can be reused. Hardware translation tables are a limited resource and therefore should be tuned to the specific workload for optimal performance.

Useful environmental variable defaults for Slingshot

FI_MR_CACHE_MAX_COUNT=32768 
FI_MR_CACHE_MAX_SIZE=-1

Bounce Buffers

SOS utilizes bounce buffers in order to allow buffering for put operations. The bounce buffer is an intermediary buffer. Using an intermediary buffer allows SOS to return source buffers to applications for reuse while the intermediary buffer is in use by the NIC. This reduces put latency at the cost of an additional memory transfer. If the bounce buffer is small enough the memory transfer is negligible to the time it takes to transfer the intermediary buffer to the NIC's memory. Bounce buffers should be limited to smaller transfer sizes.

The maximum put size of the bounce buffer is controlled with SHMEM_BOUNCE_SIZE this is a positive numerical value that is a power of 2 e.g. SHMEM_BOUNCE_SIZE=8192.

The maximum count of bounce buffers can be limited with SHMEM_MAX_BOUNCE_BUFFERS. The current implementation forces puts to wait if all bounce buffers in flight.

Limiting or disabling bounce buffers on single NIC systems may be useful. If NIC hardware is overwhelmed by incoming buffers, you can incur the overhead of waiting for bounce buffers to be available and waiting for NIC hardware queue availability. This situation is known as bufferbloat and should be avoided.

Please refer to the Troubleshooting wiki page if you encounter any issues. If your particular problem is not covered in the wiki, please submit an issue through Github.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slingshot (CXI) Build Instructions

Building and Installing Sandia OpenSHMEM (SOS) on HPE Slingshot Fabric:

Building SOS to use the libfabric CXI providers

Compiling and running OpenSHMEM programs

Performance Tuning

Shared Memory Intranode Communication

Mapping PEs and NICs

Tuning Memory Registration (MR) Cache

Bounce Buffers

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally