Skip to content

OpenMP Examples

Kevin Huck edited this page Feb 26, 2023 · 10 revisions

The following example will introduce APEX using the OpenMP programming model.

Source Code

Any of the following examples can be used with the following instructions:

The following examples will use the ompt_reduction.c example.

Running the OpenMP example

Running one of the examples (we recommend setting the number of threads):

kehuck1@uan01:~/src/apex-tutorial> export OMP_NUM_THREADS=4
kehuck1@uan01:~/src/apex-tutorial> ./build/bin/ompt_reduction
Final result= 656700.000000

Running the OpenMP example with APEX and OpenMP support

As described in C POSIX Pthreads and Standard C++ threads, APEX provides the apex_exec wrapper script to preload the APEX measurement library and set appropriate environment variables. There are three options that are relevant to OpenMP support:

    --apex:ompt                   enable OpenMP profiling (requires runtime support)
    --apex:ompt_simple            only enable OpenMP Tools required events
    --apex:ompt_details           enable all OpenMP Tools events

As mentioned in the help message, OpenMP Tool (OMPT) support requires the necessary support in the OpenMP runtime as indicated in the OpenMP 5.0 specification. APEX provides the tool implementation that matches up with the runtime support. Compiler vendors that have known OMPT support include:

  • LLVM-based compilers, like Clang, Cray, AMD Clang, others
  • Intel OneAPI compilers
  • NVHPC version 22+ (special flags required at link time)
  • IBM XL compilers You'll notice that GCC is not on this list - currently there is no known effort to provide OMPT support in the GCC compilers.

NOTE: Not all compilers/runtimes support the full OMPT set of events (some events are optional), and not all events support them correctly yet. The following examples are using the AMD Clang/Clang++ 5.0.2 compilers.

Enabling OMPT support in APEX requires one of the three flags specified above. For example, basic support is provided with the --apex:ompt flag (for information about the other APEX flags, see the C POSIX Pthreads and Standard C++ threads tutorials):

kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:ompt --apex:tasktree ./build/bin/ompt_reduction
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
Final result= 656700.000000

Start Date/Time: 26/02/2023 22:33:21
Elapsed time: 0.0148564 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 4
Available CPU time on rank 0: 0.0594254 seconds
Available CPU time on all ranks: 0.0594254 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      1     2.31     2.31
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 4.16e+05 4.16e+05
                                     status:VmExe kB :      1     8.00     8.00
                                     status:VmHWM kB :      1 4.34e+04 4.34e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 1.81e+05 1.81e+05
                                     status:VmPTE kB :      1   484.00   484.00
                                    status:VmPeak kB :      1 8.56e+05 8.56e+05
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 4.34e+04 4.34e+04
                                    status:VmSize kB :      1 7.90e+05 7.90e+05
                                     status:VmStk kB :      1   160.00   160.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     0.00     0.00
                      status:voluntary_ctxt_switches :      1     6.00     6.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.01     0.01
        int apex_preload_main(int, char **, char **) :      1     0.01     0.01
               OpenMP Parallel Region: main:0x201a3d :      1     0.00     0.00
           OpenMP Work Loop: .omp_outlined.:0x201b58 :      4     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 6
Writing: .//apex_tasktree.csv
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii --dot
Reading tasktree...
Read 4 rows
Found 0 ranks, with max graph node index of 3 and depth of 3
building common tree...
Rank 0 ...
1-> 0.015 - 100.000% [1] {min=0.015, max=0.015, mean=0.015, threads=1} APEX MAIN
1 |-> 0.015 - 99.397% [1] {min=0.015, max=0.015, mean=0.015, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.001 - 3.773% [1] {min=0.001, max=0.001, mean=0.001, threads=1} OpenMP Parallel Region: main:0x201a3d
1 | | |-> 0.000 - 0.199% [4] {min=0.000, max=0.000, mean=0.000, threads=4} OpenMP Work Loop: .omp_outlined.:0x201b58
5 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.

DOT task graph of ompt_reduction example

As shown in the output, APEX captures the outer parallel region as well as work loop for the 4 threads.

Running the OpenMP example with APEX and OMPT details

There are additional events in the OpenMP runtime that provide performance data, but they are not enabled by default because they can introduce significant overhead in tight loops. To enable these events, use the --apex:ompt_details flag:

kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:ompt --apex:ompt_details --apex:tasktree ./build/bin/ompt_reduction
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
Final result= 656700.000000

Start Date/Time: 26/02/2023 22:39:33
Elapsed time: 0.0168116 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 4
Available CPU time on rank 0: 0.0672464 seconds
Available CPU time on all ranks: 0.0672464 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      1     1.81     1.81
Iterations: OpenMP Work Loop: .omp_outlined.:0x201b… :      4   100.00   100.00
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 4.16e+05 4.16e+05
                                     status:VmExe kB :      1     8.00     8.00
                                     status:VmHWM kB :      1 4.52e+04 4.52e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 1.81e+05 1.81e+05
                                     status:VmPTE kB :      1   492.00   492.00
                                    status:VmPeak kB :      1 8.56e+05 8.56e+05
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 4.52e+04 4.52e+04
                                    status:VmSize kB :      1 7.90e+05 7.90e+05
                                     status:VmStk kB :      1   160.00   160.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     1.00     1.00
                      status:voluntary_ctxt_switches :      1     7.00     7.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.02     0.02
        int apex_preload_main(int, char **, char **) :      1     0.02     0.02
               OpenMP Parallel Region: main:0x201a3d :      1     0.00     0.00
                 OpenMP Implicit Task: main:0x201a3d :      1     0.00     0.00
           OpenMP Work Loop: .omp_outlined.:0x201b58 :      4     0.00     0.00
              OpenMP Implicit Barrier: main:0x201a3d :      1     0.00     0.00
         OpenMP Implicit Barrier Wait: main:0x201a3d :      1     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 9
Writing: .//apex_tasktree.csv
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii --dot
Reading tasktree...
Read 7 rows
Found 0 ranks, with max graph node index of 6 and depth of 5
building common tree...
Rank 0 ...
1-> 0.017 - 100.000% [1] {min=0.017, max=0.017, mean=0.017, threads=1} APEX MAIN
1 |-> 0.017 - 99.531% [1] {min=0.017, max=0.017, mean=0.017, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.001 - 5.114% [1] {min=0.001, max=0.001, mean=0.001, threads=1} OpenMP Parallel Region: main:0x201a3d
1 | | |-> 0.000 - 0.884% [1] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Implicit Task: main:0x201a3d
1 | | | |-> 0.000 - 0.651% [4] {min=0.000, max=0.000, mean=0.000, threads=4} OpenMP Work Loop: .omp_outlined.:0x201b58
1 | | | |-> 0.000 - 0.315% [1] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Implicit Barrier: main:0x201a3d
1 | | | | |-> 0.000 - 0.239% [1] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Implicit Barrier Wait: main:0x201a3d
8 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.

DOT task graph of ompt_reduction ompt_details example

As shown in the output, APEX is now recording implicit barriers for each thread within the parallel loop. In addition, the synchronization time is captured as a separate event. The details also include implicit task events generated by the runtime.

Running the OpenMP example with APEX and OMPT simple

kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:ompt --apex:ompt_simple --apex:tasktree ./build/bin/ompt_reduction
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
Final result= 656700.000000

Start Date/Time: 26/02/2023 22:42:41
Elapsed time: 0.0145053 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 4
Available CPU time on rank 0: 0.0580211 seconds
Available CPU time on all ranks: 0.0580211 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      1     1.68     1.68
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 4.16e+05 4.16e+05
                                     status:VmExe kB :      1     8.00     8.00
                                     status:VmHWM kB :      1 4.30e+04 4.30e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 1.81e+05 1.81e+05
                                     status:VmPTE kB :      1   492.00   492.00
                                    status:VmPeak kB :      1 8.56e+05 8.56e+05
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 4.30e+04 4.30e+04
                                    status:VmSize kB :      1 7.90e+05 7.90e+05
                                     status:VmStk kB :      1   160.00   160.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     1.00     1.00
                      status:voluntary_ctxt_switches :      1     8.00     8.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.01     0.01
        int apex_preload_main(int, char **, char **) :      1     0.01     0.01
               OpenMP Parallel Region: main:0x201a3d :      1     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 2
Writing: .//apex_tasktree.csv
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii --dot
Reading tasktree...
Read 3 rows
Found 0 ranks, with max graph node index of 2 and depth of 2
building common tree...
Rank 0 ...
1-> 0.015 - 100.000% [1] {min=0.015, max=0.015, mean=0.015, threads=1} APEX MAIN
1 |-> 0.014 - 99.360% [1] {min=0.014, max=0.014, mean=0.014, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.001 - 4.757% [1] {min=0.001, max=0.001, mean=0.001, threads=1} OpenMP Parallel Region: main:0x201a3d
4 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.

DOT task graph of ompt_reduction ompt_simple example

As shown in the example, only the parallel region event is captured, which is outside the loop itself.