Skip to content

MPI and OpenMP Example

Kevin Huck edited this page Feb 26, 2023 · 3 revisions

The following example will introduce APEX using MPI and OpenMP.

Source Code

The following example is the (now long outdated) Lulesh example from LLNL.

Running the MPI example

Running the example is straightforward, just run your equivalent of mpirun (srun, jsrun, mpiexec, etc) and the executable name:

[khuck@gilgamesh apex-tutorial]$ mpirun -np 8 ./build/bin/lulesh_MPI_OpenMP_2.0
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Running problem size 30^3 per domain until completion
Num processors: 8
Num threads: 4
Total number of elements: 216000

To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options

Run completed:
   Problem size        =  30
   MPI tasks           =  8
   Iteration count     =  2031
   Final Origin Energy = 7.130703e+05
   Testing Plane 0 of Energy Array on rank 0:
        MaxAbsDiff   = 1.236913e-10
        TotalAbsDiff = 4.731078e-09
        MaxRelDiff   = 3.858431e-14


Elapsed time         =      28.96 (s)
Grind time (us/z/c)  = 0.52806123 (per dom)  (0.066007654 overall)
FOM                  =  15149.758 (z/s)

Because the example is also used for tuning, it has APEX linked in, so the APEX header is displayed.

Running the example with APEX and OpenMP support

For specifics on OpenMP support, please see the OpenMP Examples page.

[khuck@gilgamesh apex-tutorial]$ mpirun -np 8 apex_exec --apex:ompt --apex:tasktree ./build/bin/lulesh_MPI_OpenMP_2.0
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Running problem size 30^3 per domain until completion
Num processors: 8
Num threads: 4
Total number of elements: 216000

To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options

Run completed:
   Problem size        =  30
   MPI tasks           =  8
   Iteration count     =  2031
   Final Origin Energy = 7.130703e+05
   Testing Plane 0 of Energy Array on rank 0:
        MaxAbsDiff   = 1.236913e-10
        TotalAbsDiff = 4.731078e-09
        MaxRelDiff   = 3.858431e-14


Elapsed time         =      26.34 (s)
Grind time (us/z/c)  = 0.48032695 (per dom)  (0.060040869 overall)
FOM                  =  16655.322 (z/s)


Start Date/Time: 26/02/2023 13:36:51
Elapsed time: 26.5126 seconds
Total processes detected: 8
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 7
Available CPU time on rank 0: 185.588 seconds
Available CPU time on all ranks: 1484.71 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :    216    15.08    19.00
                       Bytes : MPI_Allreduce recvbuf :  16240    64.00    64.00
                       Bytes : MPI_Allreduce sendbuf :  16240     8.00     8.00
                                   Bytes : MPI_Irecv : 219404 1.54e+04 4.61e+04
                                   Bytes : MPI_Isend : 219404 1.54e+04 4.61e+04
                          Bytes : MPI_Reduce recvbuf :      8    64.00    64.00
                          Bytes : MPI_Reduce sendbuf :      8     8.00     8.00
                                         CPU Guest % :    208     0.00     0.00
                                      CPU I/O Wait % :    208     0.57     1.12
                                           CPU IRQ % :    208     0.26     0.29
                                          CPU Idle % :    208    61.92    67.68
                                          CPU Nice % :    208     0.24     1.02
                                         CPU Steal % :    208     0.00     0.00
                                        CPU System % :    208     4.82     5.49
                                          CPU User % :    208    32.10    38.07
                                      CPU soft IRQ % :    208     0.09     0.13
                                         DRAM Energy :    208     2.92     4.00
                                    Package-0 Energy :    208   145.59   154.00
                                      status:Threads :    216     8.55    10.00
                                    status:VmData kB :    216 5.06e+05 5.22e+05
                                     status:VmExe kB :    216   136.00   136.00
                                     status:VmHWM kB :    216 7.90e+04 8.22e+04
                                     status:VmLck kB :    216     0.00     0.00
                                     status:VmLib kB :    216 1.43e+05 1.43e+05
                                     status:VmPTE kB :    216   611.39   664.00
                                    status:VmPeak kB :    216 1.32e+06 1.46e+06
                                     status:VmPin kB :    216     0.00     0.00
                                     status:VmRSS kB :    216 7.59e+04 8.15e+04
                                    status:VmSize kB :    216 1.28e+06 1.40e+06
                                     status:VmStk kB :    216   140.00   140.00
                                    status:VmSwap kB :    216     0.00     0.00
                   status:nonvoluntary_ctxt_switches :    216    63.74    92.00
                      status:voluntary_ctxt_switches :    216 2.58e+04 1.08e+05
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1    26.51    26.51
        int apex_preload_main(int, char **, char **) :      8    26.45   211.63
   EvalEOSForElems(Domain&, double*, int, int*, int) : 178728     0.00    89.70
        OpenMP Work Loop: .omp_outlined..34:0x20e316 :  64992     0.00    85.67
           OpenMP Work Loop: .omp_outlined.:0x2060e7 :  64992     0.00    47.10
        OpenMP Work Loop: .omp_outlined..33:0x20ce5a :  64992     0.00    33.45
        OpenMP Work Loop: .omp_outlined..31:0x20b962 :  64992     0.00    33.32
               OpenMP Parallel Region: main:0x207dee :  16248     0.00    22.41
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00    19.36
        OpenMP Work Loop: .omp_outlined..42:0x2105ab :  64992     0.00    18.36
        OpenMP Work Loop: .omp_outlined..18:0x2097b3 : 1.95e+06     0.00    15.96
               OpenMP Parallel Region: main:0x20824d :  16248     0.00    12.47
        OpenMP Work Loop: .omp_outlined..43:0x211273 : 714912     0.00    11.05
        OpenMP Work Loop: .omp_outlined..23:0x20a870 : 2.27e+06     0.00    10.67
        OpenMP Work Loop: .omp_outlined..26:0x20b0ec : 6.82e+06     0.00    10.30
        OpenMP Work Loop: .omp_outlined..21:0x20a3aa : 2.27e+06     0.00     9.55
        OpenMP Work Loop: .omp_outlined..25:0x20adc6 : 6.82e+06     0.00     9.45
               OpenMP Parallel Region: main:0x207bd6 :  16248     0.00     9.25
               OpenMP Parallel Region: main:0x207a77 :  16248     0.00     9.12
   int MPI_Waitall(int, MPI_Request *, MPI_Status *) :  48752     0.00     8.42
        OpenMP Work Loop: .omp_outlined..35:0x20fa83 :  64992     0.00     7.28
        OpenMP Work Loop: .omp_outlined..18:0x20994f : 1.95e+06     0.00     6.96
        OpenMP Work Loop: .omp_outlined..32:0x20cc43 :  64992     0.00     6.84
        OpenMP Work Loop: .omp_outlined..24:0x20ab83 : 2.27e+06     0.00     6.35
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     6.27
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     6.24
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     6.20
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     5.18
               OpenMP Parallel Region: main:0x20836c :  16248     0.00     4.85
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     4.58
               OpenMP Parallel Region: main:0x2084a2 : 178728     0.00     4.55
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     4.38
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     4.38
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     4.37
        OpenMP Work Loop: .omp_outlined..20:0x20a046 : 2.27e+06     0.00     4.28
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     4.25
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     4.12
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 568680     0.00     4.08
        OpenMP Work Loop: .omp_outlined..19:0x209e96 : 714912     0.00     2.82
        OpenMP Work Loop: .omp_outlined..48:0x21214b : 714912     0.00     2.74
        OpenMP Work Loop: .omp_outlined..30:0x20b7c0 :  64992     0.00     2.56
        OpenMP Work Loop: .omp_outlined..22:0x20a66c : 2.27e+06     0.00     2.45
               OpenMP Parallel Region: main:0x207e28 :  16248     0.00     2.25
               OpenMP Parallel Region: main:0x207ab4 :  16248     0.00     2.16
        OpenMP Work Loop: .omp_outlined..36:0x20fcac :  64992     0.00     2.09
        OpenMP Work Loop: .omp_outlined..40:0x210430 :  64992     0.00     2.08
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 178728     0.00     2.02
        OpenMP Work Loop: .omp_outlined..18:0x209cbb : 1.95e+06     0.00     2.01
               OpenMP Parallel Region: main:0x208831 : 178728     0.00     1.98
        OpenMP Work Loop: .omp_outlined..38:0x210163 :  64992     0.00     1.95
        OpenMP Work Loop: .omp_outlined..27:0x20b262 : 714912     0.00     1.69
        OpenMP Work Loop: .omp_outlined..39:0x2102ef :  64992     0.00     1.63
OpenMP Parallel Region: EvalEOSForElems(Domain&, do… : 178728     0.00     1.59
               OpenMP Parallel Region: main:0x2089cf : 178728     0.00     1.45
        OpenMP Work Loop: .omp_outlined..18:0x209ac7 : 1.95e+06     0.00     1.28
        OpenMP Work Loop: .omp_outlined..18:0x209bc2 : 1.95e+06     0.00     1.26
               OpenMP Parallel Region: main:0x20797b :  16248     0.00     0.84
        OpenMP Work Loop: .omp_outlined..28:0x20b430 :  64992     0.00     0.82
               OpenMP Parallel Region: main:0x208275 :  16248     0.00     0.74
               OpenMP Parallel Region: main:0x2080c6 :  16248     0.00     0.73
               OpenMP Parallel Region: main:0x208046 :  16248     0.00     0.71
        OpenMP Work Loop: .omp_outlined..49:0x212365 : 714912     0.00     0.64
               OpenMP Parallel Region: main:0x2080fd :  16248     0.00     0.64
           int MPI_Wait(MPI_Request *, MPI_Status *) : 219404     0.00     0.63
        OpenMP Work Loop: .omp_outlined..46:0x211cad :  64992     0.00     0.56
               OpenMP Parallel Region: main:0x20861d :  16248     0.00     0.54
               OpenMP Parallel Region: main:0x2078cc :  16248     0.00     0.36
int MPI_Isend(const void *, int, MPI_Datatype, int,… : 219404     0.00     0.29
int MPI_Allreduce(const void *, void *, int, MPI_Da… :  16240     0.00     0.28
        OpenMP Work Loop: .omp_outlined..29:0x20b6d5 :  64992     0.00     0.27
        OpenMP Work Loop: .omp_outlined..47:0x211e90 :  64992     0.00     0.23
               OpenMP Parallel Region: main:0x207b15 :  16248     0.00     0.22
        OpenMP Work Loop: .omp_outlined..46:0x211b0c :  64992     0.00     0.19
               OpenMP Parallel Region: main:0x208077 :  16248     0.00     0.18
                                 MPI Collective Sync :  16256     0.00     0.18
               OpenMP Parallel Region: main:0x20871c :  16248     0.00     0.18
        OpenMP Work Loop: .omp_outlined..46:0x211bf0 :  64992     0.00     0.16
int MPI_Irecv(void *, int, MPI_Datatype, int, int, … : 219404     0.00     0.10
        OpenMP Work Loop: .omp_outlined..37:0x210037 :  32496     0.00     0.03
        OpenMP Work Loop: .omp_outlined..37:0x20fe04 :  32496     0.00     0.03
        OpenMP Work Loop: .omp_outlined..37:0x20ff17 :  32496     0.00     0.02
int MPI_Reduce(const void *, void *, int, MPI_Datat… :      8     0.00     0.00
                           int MPI_Barrier(MPI_Comm) :      8     0.00     0.00
                                                     :      1     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 48510953
Writing: .//apex_tasktree.csv...done.
[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 700 rows
Found 7 ranks, with max graph node index of 88 and depth of 4
building common tree...
Rank 7 ...
8-> 26.513 - 100.000% [1] {min=26.512, max=26.513, mean=26.513, threads=8} APEX MAIN
8 |-> 26.453 - 99.776% [1] {min=26.401, max=26.506, mean=26.453, threads=8} int apex_preload_main(int, char **, char **)
8 | |-> 11.212 - 42.291% [22341] {min=10.118, max=13.009, mean=0.001, threads=8} EvalEOSForElems(Domain&, double*, int, int*, int)
8 | | |-> 2.420 - 9.126% [71085] {min=1.925, max=3.316, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x208ec6
8 | | | |-> 1.995 - 7.525% [243372] {min=1.153, max=3.530, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x2097b3
8 | | | |-> 0.870 - 3.280% [243372] {min=0.474, max=1.561, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x20994f
8 | | | |-> 0.252 - 0.950% [243372] {min=0.164, max=0.398, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x209cbb
8 | | | |-> 0.160 - 0.605% [243372] {min=0.113, max=0.244, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x209ac7
8 | | | |-> 0.158 - 0.596% [243372] {min=0.111, max=0.244, mean=0.000, threads=48} OpenMP Work Loop: .omp_outlined..18:0x209bc2
8 | | |-> 0.783 - 2.954% [71085] {min=0.628, max=0.990, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209327
8 | | | |-> 1.334 - 5.030% [284340] {min=0.760, max=2.167, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..23:0x20a870
8 | | |-> 0.781 - 2.944% [71085] {min=0.688, max=0.843, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209042
8 | | | |-> 0.535 - 2.018% [284340] {min=0.357, max=0.830, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..20:0x20a046
8 | | |-> 0.775 - 2.925% [71085] {min=0.634, max=0.935, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209182
8 | | | |-> 1.194 - 4.504% [284340] {min=0.735, max=1.892, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..21:0x20a3aa
8 | | |-> 0.648 - 2.444% [71085] {min=0.565, max=0.765, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209465
8 | | | |-> 0.794 - 2.996% [284340] {min=0.508, max=1.140, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..24:0x20ab83
8 | | |-> 0.572 - 2.159% [71085] {min=0.539, max=0.603, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2090af
8 | | | |-> 0.451 - 1.700% [284340] {min=0.329, max=0.629, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..25:0x20adc6
8 | | |-> 0.547 - 2.064% [71085] {min=0.489, max=0.620, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2090fe
8 | | | |-> 0.438 - 1.650% [284340] {min=0.300, max=0.655, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..26:0x20b0ec
8 | | |-> 0.547 - 2.064% [71085] {min=0.519, max=0.586, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209244
8 | | | |-> 0.385 - 1.451% [284340] {min=0.262, max=0.547, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..25:0x20adc6
8 | | |-> 0.546 - 2.060% [71085] {min=0.500, max=0.593, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x20928b
8 | | | |-> 0.444 - 1.676% [284340] {min=0.300, max=0.652, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..26:0x20b0ec
8 | | |-> 0.531 - 2.004% [71085] {min=0.476, max=0.571, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2093ad
8 | | | |-> 0.346 - 1.307% [284340] {min=0.236, max=0.489, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..25:0x20adc6
8 | | |-> 0.516 - 1.945% [71085] {min=0.463, max=0.587, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2093f1
8 | | | |-> 0.406 - 1.532% [284340] {min=0.267, max=0.631, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..26:0x20b0ec
8 | | |-> 0.510 - 1.923% [71085] {min=0.448, max=0.543, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2091be
8 | | | |-> 0.306 - 1.153% [284340] {min=0.206, max=0.433, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..22:0x20a66c
8 | | |-> 0.252 - 0.952% [22341] {min=0.224, max=0.270, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x2094e4
8 | | | |-> 0.353 - 1.330% [89364] {min=0.312, max=0.377, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..19:0x209e96
8 | | |-> 0.199 - 0.751% [22341] {min=0.174, max=0.225, mean=0.000, threads=8} OpenMP Parallel Region: EvalEOSForElems(Domain&, double*, int, int*, int):0x209590
8 | | | |-> 0.211 - 0.795% [89364] {min=0.184, max=0.271, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..27:0x20b262
8 | |-> 2.802 - 10.567% [2031] {min=2.709, max=2.902, mean=0.001, threads=8} OpenMP Parallel Region: main:0x207dee
8 | | |-> 10.709 - 40.392% [8124] {min=10.421, max=11.032, mean=0.001, threads=32} OpenMP Work Loop: .omp_outlined..34:0x20e316
8 | |-> 1.558 - 5.878% [2031] {min=1.537, max=1.596, mean=0.001, threads=8} OpenMP Parallel Region: main:0x20824d
8 | | |-> 5.887 - 22.204% [8124] {min=5.790, max=5.985, mean=0.001, threads=32} OpenMP Work Loop: .omp_outlined.:0x2060e7
8 | |-> 1.156 - 4.360% [2031] {min=1.067, max=1.273, mean=0.001, threads=8} OpenMP Parallel Region: main:0x207bd6
8 | | |-> 4.181 - 15.771% [8124] {min=3.852, max=4.820, mean=0.001, threads=32} OpenMP Work Loop: .omp_outlined..33:0x20ce5a
8 | |-> 1.140 - 4.301% [2031] {min=1.038, max=1.435, mean=0.001, threads=8} OpenMP Parallel Region: main:0x207a77
8 | | |-> 4.165 - 15.708% [8124] {min=4.005, max=4.571, mean=0.001, threads=32} OpenMP Work Loop: .omp_outlined..31:0x20b962
8 | |-> 1.053 - 3.972% [6094] {min=0.779, max=1.294, mean=0.000, threads=8} int MPI_Waitall(int, MPI_Request *, MPI_Status *)
8 | |-> 0.606 - 2.288% [2031] {min=0.595, max=0.616, mean=0.000, threads=8} OpenMP Parallel Region: main:0x20836c
8 | | |-> 2.295 - 8.656% [8124] {min=2.279, max=2.310, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..42:0x2105ab
8 | |-> 0.569 - 2.146% [22341] {min=0.510, max=0.598, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2084a2
8 | | |-> 1.382 - 5.211% [89364] {min=1.285, max=1.463, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..43:0x211273
8 | |-> 0.281 - 1.059% [2031] {min=0.225, max=0.307, mean=0.000, threads=8} OpenMP Parallel Region: main:0x207e28
8 | | |-> 0.910 - 3.432% [8124] {min=0.769, max=1.015, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..35:0x20fa83
8 | |-> 0.270 - 1.019% [2031] {min=0.204, max=0.314, mean=0.000, threads=8} OpenMP Parallel Region: main:0x207ab4
8 | | |-> 0.855 - 3.226% [8124] {min=0.653, max=0.972, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..32:0x20cc43
8 | |-> 0.248 - 0.935% [22341] {min=0.223, max=0.268, mean=0.000, threads=8} OpenMP Parallel Region: main:0x208831
8 | | |-> 0.342 - 1.291% [89364] {min=0.321, max=0.369, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..48:0x21214b
8 | |-> 0.181 - 0.683% [22341] {min=0.163, max=0.194, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2089cf
8 | | |-> 0.080 - 0.301% [89364] {min=0.068, max=0.106, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..49:0x212365
8 | |-> 0.105 - 0.395% [2031] {min=0.097, max=0.112, mean=0.000, threads=8} OpenMP Parallel Region: main:0x20797b
8 | | |-> 0.320 - 1.208% [8124] {min=0.282, max=0.358, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..30:0x20b7c0
8 | |-> 0.092 - 0.348% [2031] {min=0.087, max=0.099, mean=0.000, threads=8} OpenMP Parallel Region: main:0x208275
8 | | |-> 0.260 - 0.980% [8124] {min=0.235, max=0.280, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..40:0x210430
8 | |-> 0.091 - 0.345% [2031] {min=0.067, max=0.119, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2080c6
8 | | |-> 0.244 - 0.919% [8124] {min=0.180, max=0.335, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..38:0x210163
8 | |-> 0.089 - 0.334% [2031] {min=0.085, max=0.093, mean=0.000, threads=8} OpenMP Parallel Region: main:0x208046
8 | | |-> 0.262 - 0.987% [8124] {min=0.257, max=0.269, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..36:0x20fcac
8 | |-> 0.079 - 0.300% [2031] {min=0.057, max=0.109, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2080fd
8 | | |-> 0.204 - 0.769% [8124] {min=0.147, max=0.292, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..39:0x2102ef
8 | |-> 0.078 - 0.295% [27425] {min=0.022, max=0.199, mean=0.000, threads=8} int MPI_Wait(MPI_Request *, MPI_Status *)
8 | |-> 0.067 - 0.253% [2031] {min=0.062, max=0.071, mean=0.000, threads=8} OpenMP Parallel Region: main:0x20861d
8 | | |-> 0.070 - 0.263% [8124] {min=0.069, max=0.071, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..46:0x211cad
8 | | |-> 0.024 - 0.089% [8124] {min=0.023, max=0.024, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..46:0x211b0c
8 | | |-> 0.020 - 0.077% [8124] {min=0.020, max=0.021, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..46:0x211bf0
8 | |-> 0.046 - 0.172% [2031] {min=0.041, max=0.051, mean=0.000, threads=8} OpenMP Parallel Region: main:0x2078cc
8 | | |-> 0.103 - 0.389% [8124] {min=0.090, max=0.120, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..28:0x20b430
8 | |-> 0.037 - 0.138% [27425] {min=0.027, max=0.045, mean=0.000, threads=8} int MPI_Isend(const void *, int, MPI_Datatype, int, int, MPI_Comm, MPI_Request *)
8 | |-> 0.035 - 0.133% [2030] {min=0.030, max=0.040, mean=0.000, threads=8} int MPI_Allreduce(const void *, void *, int, MPI_Datatype, MPI_Op, MPI_Comm)
8 | | |-> 0.022 - 0.083% [2030] {min=0.016, max=0.027, mean=0.000, threads=8} MPI Collective Sync
8 | |-> 0.027 - 0.101% [2031] {min=0.023, max=0.033, mean=0.000, threads=8} OpenMP Parallel Region: main:0x207b15
8 | | |-> 0.033 - 0.125% [8124] {min=0.030, max=0.037, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..29:0x20b6d5
8 | |-> 0.022 - 0.084% [2031] {min=0.005, max=0.036, mean=0.000, threads=8} OpenMP Parallel Region: main:0x208077
4 | | |-> 0.007 - 0.027% [8124] {min=0.006, max=0.008, mean=0.000, threads=16} OpenMP Work Loop: .omp_outlined..37:0x210037
4 | | |-> 0.007 - 0.026% [8124] {min=0.006, max=0.008, mean=0.000, threads=16} OpenMP Work Loop: .omp_outlined..37:0x20fe04
4 | | |-> 0.006 - 0.022% [8124] {min=0.006, max=0.006, mean=0.000, threads=16} OpenMP Work Loop: .omp_outlined..37:0x20ff17
8 | |-> 0.022 - 0.083% [2031] {min=0.021, max=0.023, mean=0.000, threads=8} OpenMP Parallel Region: main:0x20871c
8 | | |-> 0.029 - 0.108% [8124] {min=0.026, max=0.032, mean=0.000, threads=32} OpenMP Work Loop: .omp_outlined..47:0x211e90
8 | |-> 0.012 - 0.046% [27425] {min=0.010, max=0.015, mean=0.000, threads=8} int MPI_Irecv(void *, int, MPI_Datatype, int, int, MPI_Comm, MPI_Request *)
8 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=8} int MPI_Reduce(const void *, void *, int, MPI_Datatype, MPI_Op, int, MPI_Comm)
8 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=8} MPI Collective Sync
8 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=8} int MPI_Barrier(MPI_Comm)
8 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=8} MPI Collective Sync
90 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.

DOT task graph of MPI example

As shown in the output, APEX will wrap and measure the MPI communication routines, capturing time spent in the calls as well as the total bytes transferred. In the DOT figure, we see boxes are shaded by either time (blue) or bytes transmitted (red).