|
| 1 | +# `nstreams_device_selection` Sample |
| 2 | + |
| 3 | +The `nstreams_device_selection` sample demonstrates how to use the Intel® oneAPI Base Toolkit (Base Kit) and Intel® oneAPI DPC++ Library (oneDPL) found in the Base Kit to apply device selection policies using a simple application based on nstreams. |
| 4 | + |
| 5 | +For comprehensive instructions, see the [Intel® oneAPI Programming Guide](https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/current/overview.html) and search based on relevant terms noted in the comments. |
| 6 | + |
| 7 | +| Property | Description |
| 8 | +|:--- |:--- |
| 9 | +| What you will learn | How to offload the computation to specific devices and use policies to different dynamic offload strategies. |
| 10 | +| Time to complete | 30 minutes |
| 11 | + |
| 12 | +## Purpose |
| 13 | + |
| 14 | +This sample performs a simple element-wise parallel computation on three vectors: `A`, `B` and `C`. For each element `i`, it computes `A[i] += B[i] + scalar * C[i]`. Additional information can be found on the [Optimizing Memory Bandwidth on Stream Triad](https://www.intel.com/content/www/us/en/developer/articles/technical/optimizing-memory-bandwidth-on-stream-triad.html) page. This sample starts with a simple implementation of device offload using SYCL*. The second version of the code shows how to introduce Dynamic Device Selection and uses device specific policies that can be selected by supplying different arguments when invoking the application. |
| 15 | + |
| 16 | +The sample includes two different versions of the nstreams project: |
| 17 | +1. `1_nstreams_sycl.cpp`: basic SYCL implementation; creates a kernel that targets the system's CPU. |
| 18 | +2. `2_nstreams_policies.cpp`: version of the sample that includes five policies: |
| 19 | + 1. Static CPU |
| 20 | + 2. Static GPU |
| 21 | + 3. Round Robin policy CPU/GPU |
| 22 | + 4. Dynamic Load policy CPU/GPU |
| 23 | + 5. Auto Tune policy CPU/GPU |
| 24 | + |
| 25 | +The varying policies are helpful as follows: |
| 26 | +1. **Fixed CPU:** This is the simplest implementation. It can be helpful to start implementations using fixed CPU since any debug or troubleshooting will be considerably easier. |
| 27 | +2. **Fixed GPU:** This an incremental step that simply designates the offload kernel to run on the GPU, isolating functionality to help triage any problems that may arise when targeting the GPU. |
| 28 | +3. **Round Robin:** Assigns the function to the next available device as specified in the "universe". The capability is particularly beneficial in *multi-GPU systems*. Note that performance benefits may not be realized on single GPU platforms but will scale accordingly on multi-GPU systems. |
| 29 | +4. **Dynamic Load** selects the device that has the most available capacity at that moment based on the number of unfinished submissions. This can be useful for offloading kernels of varying cost to devices of varying performance. |
| 30 | +5. **Auto-tune** performs run-time profile sampling of the performance of the kernel on the available devices before selecting a final device to use. The choice is made based on runtime performance history, so this policy is only useful for kernels that have stable performance. |
| 31 | + |
| 32 | +[Detailed Descriptions of the Policies](https://www.intel.com/content/www/us/en/docs/onedpl/developer-guide/current/policies.html) are available in the Intel® oneAPI DPC++ Library Developer Guide and Reference. |
| 33 | + |
| 34 | +>NOTE: Given the simplicity of this example, performance benefits may not be gained depending on the available devices. |
| 35 | +> |
| 36 | +
|
| 37 | +Dynamic Device Selection support customization to allow frameworks or application developers to define custom logic for making device selections. Complete reference documentation is available in the [oneAPI DPC++ Library Developer Guide](https://www.intel.com/content/www/us/en/docs/onedpl/developer-guide/2022-2/overview.html). |
| 38 | + |
| 39 | +## Key Implementation Details |
| 40 | + |
| 41 | +The basic SYCL standards implemented in the code include the use of the following: |
| 42 | +- Fixed (CPU and GPU) policies. |
| 43 | +- Dynamic policies Round Robin, Load, and Auto-tune. |
| 44 | +- Basic structure: header, namespace, define universe, setup policies, wrap kernel, and return event. **Note: a return event is required for all Dynamic Device Selection usage.** |
| 45 | + |
| 46 | +## Building the `nstreams_device_selection` Program for CPU and GPU |
| 47 | + |
| 48 | +> **Note**: If you have not already done so, set up your CLI |
| 49 | +> environment by sourcing the `setvars` script located in |
| 50 | +> the root of your oneAPI installation. |
| 51 | +> |
| 52 | +> Linux: |
| 53 | +> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` |
| 54 | +> - For private installations: `. ~/intel/oneapi/setvars.sh` |
| 55 | +> |
| 56 | +> Windows: |
| 57 | +> - `C:\Program Files(x86)\Intel\oneAPI\setvars.bat` |
| 58 | +> |
| 59 | +>For more information on environment variables, see Use the setvars Script for [Linux or macOS](https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2023-1/use-the-setvars-script-with-linux-or-macos.html#GUID-D01C791A-E72A-4EA5-A45A-AEF22F1E8506), or [Windows](https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2023-1/use-the-setvars-script-with-windows.html#GUID-A76C1E1B-5235-4A16-9AA3-F5BD35F8C7F1). |
| 60 | +
|
| 61 | + |
| 62 | +### Using Visual Studio Code* (Optional) |
| 63 | + |
| 64 | +You can use Visual Studio Code (VS Code) extensions to set your environment, create launch configurations, and browse and download samples. |
| 65 | + |
| 66 | +The basic steps to build and run a sample using VS Code include: |
| 67 | + - Download a sample using the extension **Code Sample Browser for Intel® oneAPI Toolkits**. |
| 68 | + - Configure the oneAPI environment with the extension **Environment Configurator for Intel® oneAPI Toolkits**. |
| 69 | + - Open a Terminal in VS Code (**Terminal>New Terminal**). |
| 70 | + - Run the sample in the VS Code terminal using the instructions below. |
| 71 | + |
| 72 | +To learn more about the extensions and how to configure the oneAPI environment, see |
| 73 | +[Using Visual Studio Code with Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/docs/oneapi/user-guide-vs-code/current/overview.html). |
| 74 | + |
| 75 | +### On Linux* |
| 76 | +Perform the following steps: |
| 77 | +1. Build the program using the following `cmake` commands. |
| 78 | + ``` |
| 79 | + $ mkdir build |
| 80 | + $ cd build |
| 81 | + $ cmake .. |
| 82 | + $ make |
| 83 | + ``` |
| 84 | + |
| 85 | +2. Run the program. |
| 86 | + ``` |
| 87 | + $ make run_all |
| 88 | + ``` |
| 89 | + > **Note**: by default, only CPU devices are run. Use ``sycl-ls`` to see available devices on your target system. |
| 90 | +
|
| 91 | + Manually envoking the application requires supplying a vector length. 1000 is used in the examples below. |
| 92 | + |
| 93 | + For the basic SYCL implementation: |
| 94 | + ``` |
| 95 | + $ ./1_nstreams_sycl 1000 |
| 96 | + ``` |
| 97 | + |
| 98 | + For Dynamic Device Selection, usage: ./2_nstreams_policies 1000 <policy>. For example, Fixed Resource Policy (CPU): |
| 99 | + ``` |
| 100 | + $ ./2_nstreams_policies 1000 1 |
| 101 | + ``` |
| 102 | +
|
| 103 | + | Arg | Dynamic Device Selection Policy |
| 104 | + |:--- |:--- |
| 105 | + | 1 | Fixed Resource Policy (CPU) |
| 106 | + | 2 | Fixed Resource Policy (GPU) |
| 107 | + | 3 | Round Robin Policy |
| 108 | + | 4 | Dynamic Load Policy |
| 109 | + | 5 | Auto Tune Policy |
| 110 | +
|
| 111 | +3. Clean the program. (Optional). |
| 112 | + ``` |
| 113 | + make clean |
| 114 | + ``` |
| 115 | +
|
| 116 | +If an error occurs, you can get more details by running `make` with the `VERBOSE=1` argument: |
| 117 | +``` |
| 118 | +make VERBOSE=1 |
| 119 | +``` |
| 120 | +
|
| 121 | +### Troubleshooting |
| 122 | +If you receive an error message, troubleshoot the problem using the Diagnostics Utility for Intel® oneAPI Toolkits, which provides system checks to find missing dependencies and permissions errors. See [Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html). |
| 123 | +
|
| 124 | +
|
| 125 | +### On Windows* Using Visual Studio* Version 2019 or Newer |
| 126 | +
|
| 127 | +- Build the program using VS2019 or VS2022 |
| 128 | + - Right-click on the solution file and open using either VS2019 or VS2022 IDE. |
| 129 | + - Right-click on the project in Solution Explorer and select Set as Startup Project. |
| 130 | + - Select the correct correct configuration from the drop down list in the top menu (5_GPU_optimized has more arguments to choose) |
| 131 | + - Right-click on the project in Solution Explorer and select Rebuild. |
| 132 | + - From the top menu, select Debug -> Start without Debugging. |
| 133 | +> **Note**: Remember to use Release mode for better performance. |
| 134 | +
|
| 135 | +- Build the program using MSBuild |
| 136 | + - Open "x64 Native Tools Command Prompt for VS2019" or "x64 Native Tools Command Prompt for VS2022" |
| 137 | + - Run the following command: `MSBuild "nstreams_device_selection.sln" /t:Rebuild /p:Configuration="Release"` |
| 138 | +
|
| 139 | +### Application Parameters |
| 140 | +
|
| 141 | +You can run individual nstream executables and modify parameters from the command line. |
| 142 | +
|
| 143 | +`./<executable_name> <vector length> <policy>` |
| 144 | +
|
| 145 | +For example: |
| 146 | +
|
| 147 | +``` |
| 148 | +$ ./2_nstreams_policy 1000 2 |
| 149 | +``` |
| 150 | +Where: |
| 151 | +
|
| 152 | + vector length : The size of the A, B and C vectors. |
| 153 | + Policy : Specifies the dynamic device selection policy (only valid for 2_nstreams_policy). |
| 154 | +
|
| 155 | +## Example Output |
| 156 | +
|
| 157 | +``` |
| 158 | +Using Static Policy (CPU) to iterate on CPU device with vector length: 10000 |
| 159 | +11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz |
| 160 | +11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz |
| 161 | +... |
| 162 | +11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz |
| 163 | +11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz |
| 164 | + |
| 165 | + Rate: larger better (MB/s): 120.042 |
| 166 | + Avg time: lower better (ns): 1.33287e+06 |
| 167 | +``` |
| 168 | +
|
| 169 | +
|
| 170 | +## License |
| 171 | +
|
| 172 | +Code samples are licensed under the MIT license. See |
| 173 | +[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. |
| 174 | +
|
| 175 | +Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). |
0 commit comments