-
Notifications
You must be signed in to change notification settings - Fork 0
OpenMP Examples
The following example will introduce APEX using the OpenMP programming model.
Any of the following examples can be used with the following instructions:
- ompt_reduction.c
- ompt_sync_region_wait.c
- ompt_thread.c
- ompt_master.c
- ompt_sections.c
- ompt_parallel_region.c
- ompt_single.c
- ompt_task.c
The following examples will use the ompt_reduction.c example.
Running one of the examples (we recommend setting the number of threads):
kehuck1@uan01:~/src/apex-tutorial> export OMP_NUM_THREADS=4
kehuck1@uan01:~/src/apex-tutorial> ./build/bin/ompt_reduction
Final result= 656700.000000
As described in C POSIX Pthreads and Standard C++ threads, APEX provides the apex_exec
wrapper script to preload the APEX measurement library and set appropriate environment variables. There are three options that are relevant to OpenMP support:
--apex:ompt enable OpenMP profiling (requires runtime support)
--apex:ompt_simple only enable OpenMP Tools required events
--apex:ompt_details enable all OpenMP Tools events
As mentioned in the help message, OpenMP Tool (OMPT) support requires the necessary support in the OpenMP runtime as indicated in the OpenMP 5.0 specification. APEX provides the tool implementation that matches up with the runtime support. Compiler vendors that have known OMPT support include:
- LLVM-based compilers, like Clang, Cray, AMD Clang, others
- Intel OneAPI compilers
- NVHPC version 22+ (special flags required at link time)
- IBM XL compilers You'll notice that GCC is not on this list - currently there is no known effort to provide OMPT support in the GCC compilers.
NOTE: Not all compilers/runtimes support the full OMPT set of events (some events are optional), and not all events support them correctly yet. The following examples are using the AMD Clang/Clang++ 5.0.2 compilers.
Enabling OMPT support in APEX requires one of the three flags specified above. For example, basic support is provided with the --apex:ompt
flag (for information about the other APEX flags, see the C POSIX Pthreads and Standard C++ threads tutorials):
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:ompt --apex:tasktree ./build/bin/ompt_reduction
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
Final result= 656700.000000
Start Date/Time: 26/02/2023 22:33:21
Elapsed time: 0.0148564 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 4
Available CPU time on rank 0: 0.0594254 seconds
Available CPU time on all ranks: 0.0594254 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
1 Minute Load average : 1 2.31 2.31
status:Threads : 1 2.00 2.00
status:VmData kB : 1 4.16e+05 4.16e+05
status:VmExe kB : 1 8.00 8.00
status:VmHWM kB : 1 4.34e+04 4.34e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 1.81e+05 1.81e+05
status:VmPTE kB : 1 484.00 484.00
status:VmPeak kB : 1 8.56e+05 8.56e+05
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 4.34e+04 4.34e+04
status:VmSize kB : 1 7.90e+05 7.90e+05
status:VmStk kB : 1 160.00 160.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 0.00 0.00
status:voluntary_ctxt_switches : 1 6.00 6.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.01 0.01
int apex_preload_main(int, char **, char **) : 1 0.01 0.01
OpenMP Parallel Region: main:0x201a3d : 1 0.00 0.00
OpenMP Work Loop: .omp_outlined.:0x201b58 : 4 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 6
Writing: .//apex_tasktree.csv
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii --dot
Reading tasktree...
Read 4 rows
Found 0 ranks, with max graph node index of 3 and depth of 3
building common tree...
Rank 0 ...
1-> 0.015 - 100.000% [1] {min=0.015, max=0.015, mean=0.015, threads=1} APEX MAIN
1 |-> 0.015 - 99.397% [1] {min=0.015, max=0.015, mean=0.015, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.001 - 3.773% [1] {min=0.001, max=0.001, mean=0.001, threads=1} OpenMP Parallel Region: main:0x201a3d
1 | | |-> 0.000 - 0.199% [4] {min=0.000, max=0.000, mean=0.000, threads=4} OpenMP Work Loop: .omp_outlined.:0x201b58
5 total graph nodes
Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
As shown in the output, APEX captures the outer parallel region as well as work loop for the 4 threads.
There are additional events in the OpenMP runtime that provide performance data, but they are not enabled by default because they can introduce significant overhead in tight loops. To enable these events, use the --apex:ompt_details
flag:
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:ompt --apex:ompt_details --apex:tasktree ./build/bin/ompt_reduction
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
Final result= 656700.000000
Start Date/Time: 26/02/2023 22:39:33
Elapsed time: 0.0168116 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 4
Available CPU time on rank 0: 0.0672464 seconds
Available CPU time on all ranks: 0.0672464 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
1 Minute Load average : 1 1.81 1.81
Iterations: OpenMP Work Loop: .omp_outlined.:0x201b… : 4 100.00 100.00
status:Threads : 1 2.00 2.00
status:VmData kB : 1 4.16e+05 4.16e+05
status:VmExe kB : 1 8.00 8.00
status:VmHWM kB : 1 4.52e+04 4.52e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 1.81e+05 1.81e+05
status:VmPTE kB : 1 492.00 492.00
status:VmPeak kB : 1 8.56e+05 8.56e+05
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 4.52e+04 4.52e+04
status:VmSize kB : 1 7.90e+05 7.90e+05
status:VmStk kB : 1 160.00 160.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 1.00 1.00
status:voluntary_ctxt_switches : 1 7.00 7.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.02 0.02
int apex_preload_main(int, char **, char **) : 1 0.02 0.02
OpenMP Parallel Region: main:0x201a3d : 1 0.00 0.00
OpenMP Implicit Task: main:0x201a3d : 1 0.00 0.00
OpenMP Work Loop: .omp_outlined.:0x201b58 : 4 0.00 0.00
OpenMP Implicit Barrier: main:0x201a3d : 1 0.00 0.00
OpenMP Implicit Barrier Wait: main:0x201a3d : 1 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 9
Writing: .//apex_tasktree.csv
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii --dot
Reading tasktree...
Read 7 rows
Found 0 ranks, with max graph node index of 6 and depth of 5
building common tree...
Rank 0 ...
1-> 0.017 - 100.000% [1] {min=0.017, max=0.017, mean=0.017, threads=1} APEX MAIN
1 |-> 0.017 - 99.531% [1] {min=0.017, max=0.017, mean=0.017, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.001 - 5.114% [1] {min=0.001, max=0.001, mean=0.001, threads=1} OpenMP Parallel Region: main:0x201a3d
1 | | |-> 0.000 - 0.884% [1] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Implicit Task: main:0x201a3d
1 | | | |-> 0.000 - 0.651% [4] {min=0.000, max=0.000, mean=0.000, threads=4} OpenMP Work Loop: .omp_outlined.:0x201b58
1 | | | |-> 0.000 - 0.315% [1] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Implicit Barrier: main:0x201a3d
1 | | | | |-> 0.000 - 0.239% [1] {min=0.000, max=0.000, mean=0.000, threads=1} OpenMP Implicit Barrier Wait: main:0x201a3d
8 total graph nodes
Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
As shown in the output, APEX is now recording implicit barriers for each thread within the parallel loop. In addition, the synchronization time is captured as a separate event. The details also include implicit task events generated by the runtime.
kehuck1@uan01:~/src/apex-tutorial> apex_exec --apex:ompt --apex:ompt_simple --apex:tasktree ./build/bin/ompt_reduction
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-7e20f7fa-develop
Built on: 18:48:48 Feb 23 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
----- START LOGGING OF TOOL REGISTRATION -----
Search for OMP tool in current address space... Success.
Tool was started and is using the OMPT interface.
----- END LOGGING OF TOOL REGISTRATION -----
Final result= 656700.000000
Start Date/Time: 26/02/2023 22:42:41
Elapsed time: 0.0145053 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 4
Available CPU time on rank 0: 0.0580211 seconds
Available CPU time on all ranks: 0.0580211 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
1 Minute Load average : 1 1.68 1.68
status:Threads : 1 2.00 2.00
status:VmData kB : 1 4.16e+05 4.16e+05
status:VmExe kB : 1 8.00 8.00
status:VmHWM kB : 1 4.30e+04 4.30e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 1.81e+05 1.81e+05
status:VmPTE kB : 1 492.00 492.00
status:VmPeak kB : 1 8.56e+05 8.56e+05
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 4.30e+04 4.30e+04
status:VmSize kB : 1 7.90e+05 7.90e+05
status:VmStk kB : 1 160.00 160.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 1.00 1.00
status:voluntary_ctxt_switches : 1 8.00 8.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.01 0.01
int apex_preload_main(int, char **, char **) : 1 0.01 0.01
OpenMP Parallel Region: main:0x201a3d : 1 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 2
Writing: .//apex_tasktree.csv
kehuck1@uan01:~/src/apex-tutorial> apex-treesummary.py --ascii --dot
Reading tasktree...
Read 3 rows
Found 0 ranks, with max graph node index of 2 and depth of 2
building common tree...
Rank 0 ...
1-> 0.015 - 100.000% [1] {min=0.015, max=0.015, mean=0.015, threads=1} APEX MAIN
1 |-> 0.014 - 99.360% [1] {min=0.014, max=0.014, mean=0.014, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.001 - 4.757% [1] {min=0.001, max=0.001, mean=0.001, threads=1} OpenMP Parallel Region: main:0x201a3d
4 total graph nodes
Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
As shown in the example, only the parallel region event is captured, which is outside the loop itself.
APEX tutorial, © Copyright 2023, University of Oregon. For more information on APEX, see https://github.com/UO-OACISS/apex