Accessing the HWLOC topology tree

There are several mechanisms by which OMPI may obtain an HWLOC topology, depending on the environment within which the application is executing and the method by which the application was started. As PMIx continues to roll out across the environment, the variations in how OMPI deals with the topology will hopefully simplify. In the interim, however, OMPI must deal with a variety of use-cases. This document attempts to capture those situations and explain how OMPI interacts with the topology.

Note: this document pertains to version 5.0 and above - while elements of the following discussion can be found in earlier OMPI versions, there may exist nuances that modify their application to that situation. In v5.0 and above, PRRTE is used as the OMPI RTE, and PRRTE (PMIx Reference RunTime Environment) is built with PMIx as its core foundation. Key to the discussion, therefore, is that OMPI v5.0 and above requires PRRTE 2.0 or above, which in turn requires PMIx v4.01 or above.

It is important to note that it is PMIx (and not PRRTE itself) that is often providing the HWLOC topology to the application. This is definitely the case for mpirun launch, and other environments have (so far) followed that model. If PMIx provides the topology, it will come in several forms:

if HWLOC 2.x or above is used, then the primary form will be via HWLOC's shmem feature. The shmem rendezvous information is provided in a set of three PMIx keys (PMIX_HWLOC_SHMEM_FILE, PMIX_HWLOC_SHMEM_ADDR, and PMIX_HWLOC_SHMEM_SIZE)
if HWLOC 2.x or above is used, then PMIx will also provide the topology as a HWLOC v2 XML string. Although one could argue it is a duplication of information, it is provided by default to support environments where shmem may not be available or authorized between the server and client processes (more on that below)
regardless of HWLOC version, PMIx also provides the topology as a HWLOC v1 XML string to support client applications that are linked against an older HWLOC version

Should none of those be available, or if the user has specified a topology file that is to be used in place of whatever the environment provides, then OMPI will either read the topology from the file or perform its own local discovery. The latter is highly discouraged as it leads to significant scaling issues (both in terms of startup time and memory footprint) on complex machines with many cores and multiple layers in their memory hierarchy.

Once the topology has been obtained, the next question one must consider is: what does that topology represent? Is it the topology assigned to the application itself (e.g., via cgroup)? Or is it the overall topology as seen by the base OS? OMPI is designed to utilize the former - i.e., it expects to see the topology assigned to the application, and thus considers any resources present in the topology to be available for its use. It is therefore important to be able to identify the scope of the topology, and to appropriately filter it when necessary.

Unfortunately, the question of when to filter depends upon the method of launch, and (in the case of direct launch) on the architecture of the host environment. Let's consider the various scenarios:

mpirun launch mpirun is always started at the user level. Therefore, both mpirun and its compute node daemons only "see" the topology made available to them by the base OS - i.e., whatever cgroup is being applied to the application has already been reflected in the HWLOC topology discovered by mpirun or the local compute node daemon. Thus, the topology provided by mpirun (regardless of the delivery mechanism) contains a full description of the resources available to that application.

However, users can launch multiple mpirun applications in parallel within that same allocation, using an appropriate cmd line option to assign specific subsets of the overall allocation to each invocation. In this case (a soft resource assignment), the topology will not have been filtered to reflect the subdivision of resources.

DVM launch
direct launch In the case of direct launch (i.e., launch by a host-provided launcher such as srun for Slurm), the
- Per-job (step) daemon hosting PMIx server. This is the most common
- System daemon hosting PMIx server. This is less common in practice due to the security mismatch - the system daemon must operate at a privileged level, while the application is operating at a user level. However, there are scenarios where this may be permissible or even required. In such cases, the system daemon will expose an unfiltered view of the local topology to all applications executing on that node. This is essentially equivalent to the DVM launch mode described above, except that there is no guarantee that the host environment will provide all the information required by OMPI. Thus, it may be necessary to filter the topology in such cases.
singleton By definition, singletons execute without the support of any RTE. While technically they could connect to a system-level PMIx server, OMPI initializes application processes as PMIx "clients" and not "tools". Thus, the PMIx client library does not support discovery and connection to an arbitrary PMIx server - it requires that either the server identify itself via envars or that the application provide the necessary rendezvous information. Singletons, therefore, must discover and topology for themselves. If operating under external constraints (e.g., cgroups), the discovery will yield an appropriately constrained set of resources.

Recommended practice

While there are flags to indicate if OMPI has been launched by its own RTE (whether mpirun or DVM), this in itself is not sufficient information to determine if the topology reflects the resources assigned to the application. The best method, therefore is to:

a. attempt to access the desired information directly from PMIx. In most cases, all OMPI-required information will have been provided. This includes relative process locality and device (NIC and GPU) distances between each process and their local devices. If found, this information accurately reflects the actual resource utilization/availability for the application, thereby removing the need to directly access the topology itself. This is the recommended practice

a. if the desired information is not available from PMIx, then one must turn to the topology for the answers.

Accessing the HWLOC topology tree

Recommended practice

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!