Skip to content

HIP Example

Kevin Huck edited this page Feb 26, 2023 · 3 revisions

The following example will introduce APEX using the HIP programming model.

APEX is integrated with the Roctracer, Rocprofiler, RocTX and ROCm-SMI libraries for HIP measurement support.

Source Code

The following example is a matrix transpose example written with HIP.

The example has typical HIP API calls, as well as RocTX instrumentation added.

Running the HIP example

The apex_exec wrapper script has several options for supporting HIP programs:

    --apex:hip                    enable HIP/ROCTracer measurement (default: off)
    --apex:hip_metrics            enable HIP/ROCProfiler metric support (default: off)
    --apex:hip_driver             enable HIP/ROCTracer KSA driver API callbacks (default: off)
    --apex:hip_details            enable per-kernel statistics where available (default: off)
    --apex:monitor_gpu            enable GPU monitoring services (CUDA NVML, ROCm SMI)
    --apex:gpu_memory             enable CPU memory wrapper support

To enable basic HIP support, use the --apex:hip flag:

[khuck@gilgamesh apex-tutorial]$ apex_exec --apex:hip --apex:tasktree ./build/bin/MatrixTranspose
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Device name 
 System major 9
 System minor 0
## Iteration (9) #################
PASSED!
## Iteration (8) #################
PASSED!
## Iteration (7) #################
PASSED!
## Iteration (6) #################
PASSED!
## Iteration (5) #################
PASSED!
## Iteration (4) #################
PASSED!
## Iteration (3) #################
PASSED!
## Iteration (2) #################
PASSED!
## Iteration (1) #################
PASSED!
## Iteration (0) #################
PASSED!

Start Date/Time: 26/02/2023 13:52:39
Elapsed time: 1.56751 seconds
Total processes detected: 1
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 2
Available CPU time on rank 0: 3.13502 seconds
Available CPU time on all ranks: 3.13502 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      2     2.39     2.39 
                                         CPU Guest % :      1     0.00     0.00 
                                      CPU I/O Wait % :      1     0.00     0.00 
                                           CPU IRQ % :      1     0.04     0.04 
                                          CPU Idle % :      1    96.75    96.75 
                                          CPU Nice % :      1     0.00     0.00 
                                         CPU Steal % :      1     0.00     0.00 
                                        CPU System % :      1     0.79     0.79 
                                          CPU User % :      1     2.40     2.40 
                                      CPU soft IRQ % :      1     0.02     0.02 
                                         DRAM Energy :      1     1.00     1.00 
                     GPU: Bytes Allocated: hipMalloc :      2 2.68e+08 2.68e+08 
                           GPU: Bytes Freed: hipFree :      2 2.68e+08 2.68e+08 
                         GPU: CopyDeviceToHost Bytes :     10     0.00     0.00 
                         GPU: CopyHostToDevice Bytes :     10     0.00     0.00 
                 GPU: Total Bytes Occupied on Device :      4 2.68e+08 5.37e+08 
                                    Package-0 Energy :      1    81.00    81.00 
                                      status:Threads :      2     3.00     4.00 
                                    status:VmData kB :      2 8.88e+05 1.36e+06 
                                     status:VmExe kB :      2    32.00    32.00 
                                     status:VmHWM kB :      2 3.98e+05 7.56e+05 
                                     status:VmLck kB :      2     0.00     0.00 
                                     status:VmLib kB :      2 1.37e+05 1.37e+05 
                                     status:VmPTE kB :      2  1070.00  1780.00 
                                    status:VmPeak kB :      2 5.74e+06 1.07e+07 
                                     status:VmPin kB :      2     0.00     0.00 
                                     status:VmRSS kB :      2 3.98e+05 7.56e+05 
                                    status:VmSize kB :      2 5.65e+06 1.06e+07 
                                     status:VmStk kB :      2   136.00   136.00 
                                    status:VmSwap kB :      2     0.00     0.00 
                   status:nonvoluntary_ctxt_switches :      2     7.00    12.00 
                      status:voluntary_ctxt_switches :      2    43.00    56.00 
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total
--------------------------------------------------------------------------------
           GPU: matrixTranspose(float*, float*, int) :     10     0.01     0.05 
                               GPU: CopyDeviceToHost :     10     0.00     0.01 
                               GPU: CopyHostToDevice :     10     0.00     0.01 
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     1.57     1.57 
        int apex_preload_main(int, char **, char **) :      1     1.52     1.52 
                                      Initialization :      1     0.79     0.79 
                         matrixTransposeCPUReference :      1     0.76     0.76 
                                    While Loop range :     10     0.07     0.72 
                                           hipMemcpy :     20     0.02     0.48 
                                     Validation Step :     10     0.02     0.19 
                                      Memcpy wrapper :     10     0.02     0.17 
                                LaunchKernel wrapper :     10     0.01     0.05 
                                hipDeviceSynchronize :     31     0.00     0.05 
                                         Memory Free :      1     0.01     0.01 
                                     hipLaunchKernel :     10     0.00     0.00 
                                             hipFree :      2     0.00     0.00 
                                           hipMalloc :      2     0.00     0.00 
                              hipGetDeviceProperties :      1     0.00     0.00 
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 140
Writing: .//apex_tasktree.csv
[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 22 rows
Found 0 ranks, with max graph node index of 21 and depth of 5
building common tree...
Rank 0 ...
1-> 1.568 - 100.000% [1] {min=1.568, max=1.568, mean=1.568, threads=1} APEX MAIN
1 |-> 1.519 - 96.898% [1] {min=1.519, max=1.519, mean=1.519, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.788 - 50.250% [1] {min=0.788, max=0.788, mean=0.788, threads=1} Initialization
1 | | |-> 0.763 - 48.660% [1] {min=0.763, max=0.763, mean=0.763, threads=1} matrixTransposeCPUReference
1 | | |-> 0.000 - 0.011% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipMalloc
1 | |-> 0.723 - 46.105% [10] {min=0.723, max=0.723, mean=0.072, threads=1} While Loop range
1 | | |-> 0.308 - 19.657% [10] {min=0.308, max=0.308, mean=0.031, threads=1} hipMemcpy
1 | | | |-> 0.012 - 0.756% [10] {min=0.012, max=0.012, mean=0.001, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.189 - 12.050% [10] {min=0.189, max=0.189, mean=0.019, threads=1} Validation Step
1 | | |-> 0.173 - 11.048% [10] {min=0.173, max=0.173, mean=0.017, threads=1} Memcpy wrapper
1 | | | |-> 0.173 - 11.040% [10] {min=0.173, max=0.173, mean=0.017, threads=1} hipMemcpy
1 | | | | |-> 0.012 - 0.763% [10] {min=0.012, max=0.012, mean=0.001, threads=1} GPU: CopyDeviceToHost
1 | | | |-> 0.000 - 0.001% [10] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | | |-> 0.052 - 3.307% [10] {min=0.052, max=0.052, mean=0.005, threads=1} LaunchKernel wrapper
1 | | | |-> 0.051 - 3.245% [10] {min=0.051, max=0.051, mean=0.005, threads=1} hipDeviceSynchronize
1 | | | |-> 0.001 - 0.049% [10] {min=0.001, max=0.001, mean=0.000, threads=1} hipLaunchKernel
1 | | | | |-> 0.051 - 3.230% [10] {min=0.051, max=0.051, mean=0.005, threads=1} GPU: matrixTranspose(float*, float*, int)
1 | | |-> 0.000 - 0.003% [10] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.007 - 0.426% [1] {min=0.007, max=0.007, mean=0.007, threads=1} Memory Free
1 | | |-> 0.000 - 0.013% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipFree
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipGetDeviceProperties
23 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
[khuck@gilgamesh apex-tutorial]$ dot -Tsvg -O tasktree.dot 

HIP base support task tree

HIP Details

Adding the --apex:hip_details flag will enable capturing detailed stats about each kernel invocation:

[khuck@gilgamesh apex-tutorial]$ apex_exec --apex:hip --apex:hip_details ./build/bin/MatrixTranspose
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Device name 
 System major 9
 System minor 0
## Iteration (9) #################
PASSED!
## Iteration (8) #################
PASSED!
## Iteration (7) #################
PASSED!
## Iteration (6) #################
PASSED!
## Iteration (5) #################
PASSED!
## Iteration (4) #################
PASSED!
## Iteration (3) #################
PASSED!
## Iteration (2) #################
PASSED!
## Iteration (1) #################
PASSED!
## Iteration (0) #################
PASSED!

Start Date/Time: 26/02/2023 13:56:41
Elapsed time: 1.68719 seconds
Total processes detected: 1
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 2
Available CPU time on rank 0: 3.37438 seconds
Available CPU time on all ranks: 3.37438 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      2     2.20     2.20 
                                         CPU Guest % :      1     0.00     0.00 
                                      CPU I/O Wait % :      1     0.00     0.00 
                                           CPU IRQ % :      1     0.02     0.02 
                                          CPU Idle % :      1    96.93    96.93 
                                          CPU Nice % :      1     0.00     0.00 
                                         CPU Steal % :      1     0.00     0.00 
                                        CPU System % :      1     0.46     0.46 
                                          CPU User % :      1     2.57     2.57 
                                      CPU soft IRQ % :      1     0.02     0.02 
                                         DRAM Energy :      1     0.00     0.00 
                     GPU: Bytes Allocated: hipMalloc :      2 2.68e+08 2.68e+08 
                           GPU: Bytes Freed: hipFree :      2 2.68e+08 2.68e+08 
                         GPU: CopyDeviceToHost Bytes :     10     0.00     0.00 
                         GPU: CopyHostToDevice Bytes :     10     0.00     0.00 
                 GPU: Total Bytes Occupied on Device :      4 2.68e+08 5.37e+08 
GPU: dimBlocks.X: matrixTranspose(float*, float*, i… :     10     4.00     4.00 
GPU: dimBlocks.Y: matrixTranspose(float*, float*, i… :     10     4.00     4.00 
GPU: dimBlocks.Z: matrixTranspose(float*, float*, i… :     10     1.00     1.00 
GPU: numBlocks.X: matrixTranspose(float*, float*, i… :     10  2048.00  2048.00 
GPU: numBlocks.Y: matrixTranspose(float*, float*, i… :     10  2048.00  2048.00 
GPU: numBlocks.Z: matrixTranspose(float*, float*, i… :     10     1.00     1.00 
GPU: sharedMemBytes: matrixTranspose(float*, float*… :     10     0.00     0.00 
                                    Package-0 Energy :      1    83.00    83.00 
                                      status:Threads :      2     3.00     4.00 
                                    status:VmData kB :      2 8.88e+05 1.36e+06 
                                     status:VmExe kB :      2    32.00    32.00 
                                     status:VmHWM kB :      2 3.99e+05 7.57e+05 
                                     status:VmLck kB :      2     0.00     0.00 
                                     status:VmLib kB :      2 1.37e+05 1.37e+05 
                                     status:VmPTE kB :      2  1078.00  1792.00 
                                    status:VmPeak kB :      2 5.74e+06 1.07e+07 
                                     status:VmPin kB :      2     0.00     0.00 
                                     status:VmRSS kB :      2 3.99e+05 7.57e+05 
                                    status:VmSize kB :      2 5.65e+06 1.06e+07 
                                     status:VmStk kB :      2   136.00   136.00 
                                    status:VmSwap kB :      2     0.00     0.00 
                   status:nonvoluntary_ctxt_switches :      2     4.00     7.00 
                      status:voluntary_ctxt_switches :      2    42.50    57.00 
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total
--------------------------------------------------------------------------------
           GPU: matrixTranspose(float*, float*, int) :     10     0.01     0.05 
                               GPU: CopyDeviceToHost :     10     0.00     0.02 
                               GPU: CopyHostToDevice :     10     0.00     0.01 
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     1.69     1.69 
        int apex_preload_main(int, char **, char **) :      1     1.64     1.64 
                                    While Loop range :     10     0.09     0.85 
                                      Initialization :      1     0.77     0.77 
                         matrixTransposeCPUReference :      1     0.75     0.75 
                                           hipMemcpy :     20     0.02     0.48 
                                     Validation Step :     10     0.03     0.32 
                                      Memcpy wrapper :     10     0.02     0.17 
                                LaunchKernel wrapper :     10     0.01     0.05 
                                hipDeviceSynchronize :     31     0.00     0.05 
                                         Memory Free :      1     0.01     0.01 
                                     hipLaunchKernel :     10     0.00     0.00 
                                             hipFree :      2     0.00     0.00 
                                           hipMalloc :      2     0.00     0.00 
                              hipGetDeviceProperties :      1     0.00     0.00 
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 140

GPU monitoring with ROCm-SMI

Adding the --apex:monitor_gpu flag will enable the ROCm-SMI support:

[khuck@gilgamesh apex-tutorial]$ apex_exec --apex:hip --apex:monitor_gpu ./build/bin/MatrixTranspose
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Device name 
 System major 9
 System minor 0
## Iteration (9) #################
PASSED!
## Iteration (8) #################
PASSED!
## Iteration (7) #################
PASSED!
## Iteration (6) #################
PASSED!
## Iteration (5) #################
PASSED!
## Iteration (4) #################
PASSED!
## Iteration (3) #################
PASSED!
## Iteration (2) #################
PASSED!
## Iteration (1) #################
PASSED!
## Iteration (0) #################
PASSED!

Start Date/Time: 26/02/2023 14:01:49
Elapsed time: 2.87411 seconds
Total processes detected: 1
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 2
Available CPU time on rank 0: 5.74821 seconds
Available CPU time on all ranks: 5.74821 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      3    79.69    79.69 
                                         CPU Guest % :      2     0.00     0.00 
                                      CPU I/O Wait % :      2     0.00     0.00 
                                           CPU IRQ % :      2     0.53     0.53 
                                          CPU Idle % :      2     2.10     2.33 
                                          CPU Nice % :      2     0.00     0.00 
                                         CPU Steal % :      2     0.00     0.00 
                                        CPU System % :      2     0.76     0.88 
                                          CPU User % :      2    96.59    96.94 
                                      CPU soft IRQ % :      2     0.04     0.04 
                                         DRAM Energy :      2     4.00     4.00 
                     GPU: Bytes Allocated: hipMalloc :      2 2.68e+08 2.68e+08 
                           GPU: Bytes Freed: hipFree :      2 2.68e+08 2.68e+08 
                         GPU: CopyDeviceToHost Bytes :     10     0.00     0.00 
                         GPU: CopyHostToDevice Bytes :     10     0.00     0.00 
                       GPU: Device 0 Device Busy (%) :      3     0.00     0.00 
                       GPU: Device 0 Memory Busy (%) :      3     0.00     0.00 
                 GPU: Device 0 Memory Reserved Pages :      3     0.00     0.00 
                 GPU: Device 0 Memory Used, GTT (GB) :      3     0.01     0.01 
                GPU: Device 0 Memory Used, VRAM (GB) :      3     0.19     0.55 
           GPU: Device 0 Memory Used, Vis. VRAM (GB) :      3     0.19     0.55 
                             GPU: Device 0 Power (W) :      3    43.67    47.00 
                       GPU: Device 0 Temperature (C) :      3    33.00    33.00 
                           GPU: Device 0 Voltage (V) :      3     0.79     0.79 
                 GPU: Total Bytes Occupied on Device :      4 2.68e+08 5.37e+08 
                                    Package-0 Energy :      2   170.00   171.00 
                                      status:Threads :      3     3.33     4.00 
                                    status:VmData kB :      3 1.00e+06 1.36e+06 
                                     status:VmExe kB :      3    32.00    32.00 
                                     status:VmHWM kB :      3 5.46e+05 1.02e+06 
                                     status:VmLck kB :      3     0.00     0.00 
                                     status:VmLib kB :      3 1.37e+05 1.37e+05 
                                     status:VmPTE kB :      3  1366.67  2300.00 
                                    status:VmPeak kB :      3 7.23e+06 1.07e+07 
                                     status:VmPin kB :      3     0.00     0.00 
                                     status:VmRSS kB :      3 5.46e+05 1.02e+06 
                                    status:VmSize kB :      3 7.08e+06 1.06e+07 
                                     status:VmStk kB :      3   141.33   152.00 
                                    status:VmSwap kB :      3     0.00     0.00 
                   status:nonvoluntary_ctxt_switches :      3    35.67    71.00 
                      status:voluntary_ctxt_switches :      3    57.33   111.00 
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total
--------------------------------------------------------------------------------
           GPU: matrixTranspose(float*, float*, int) :     10     0.01     0.05 
                               GPU: CopyDeviceToHost :     10     0.00     0.01 
                               GPU: CopyHostToDevice :     10     0.00     0.01 
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     2.87     2.87 
        int apex_preload_main(int, char **, char **) :      1     2.79     2.79 
                                    While Loop range :     10     0.17     1.68 
                                           hipMemcpy :     20     0.06     1.21 
                                      Initialization :      1     1.10     1.10 
                         matrixTransposeCPUReference :      1     1.06     1.06 
                                      Memcpy wrapper :     10     0.04     0.41 
                                     Validation Step :     10     0.04     0.39 
                                LaunchKernel wrapper :     10     0.01     0.07 
                                hipDeviceSynchronize :     31     0.00     0.06 
                                         Memory Free :      1     0.02     0.02 
                                     hipLaunchKernel :     10     0.00     0.01 
                                           hipMalloc :      2     0.00     0.00 
                                             hipFree :      2     0.00     0.00 
                              hipGetDeviceProperties :      1     0.00     0.00 
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 140

GPU Memory tracking support

Adding the --apex:gpu_memory flag will enable memory consumption/leak tracking for all hipMalloc* calls. At the end of execution, any/all memory leaks will be reported to the user in a text file.

[khuck@gilgamesh apex-tutorial]$ apex_exec --apex:hip --apex:gpu_memory ./build/bin/MatrixTranspose
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Device name 
 System major 9
 System minor 0
## Iteration (9) #################
PASSED!
## Iteration (8) #################
PASSED!
## Iteration (7) #################
PASSED!
## Iteration (6) #################
PASSED!
## Iteration (5) #################
PASSED!
## Iteration (4) #################
PASSED!
## Iteration (3) #################
PASSED!
## Iteration (2) #################
PASSED!
## Iteration (1) #################
PASSED!
## Iteration (0) #################
PASSED!

Start Date/Time: 26/02/2023 14:04:52
Elapsed time: 1.52486 seconds
Total processes detected: 1
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 2
Available CPU time on rank 0: 3.04971 seconds
Available CPU time on all ranks: 3.04971 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      2    17.12    17.12 
                                         CPU Guest % :      1     0.00     0.00 
                                      CPU I/O Wait % :      1     0.00     0.00 
                                           CPU IRQ % :      1     0.02     0.02 
                                          CPU Idle % :      1    98.51    98.51 
                                          CPU Nice % :      1     0.07     0.07 
                                         CPU Steal % :      1     0.00     0.00 
                                        CPU System % :      1     0.39     0.39 
                                          CPU User % :      1     1.00     1.00 
                                      CPU soft IRQ % :      1     0.01     0.01 
                                         DRAM Energy :      1     0.00     0.00 
                     GPU: Bytes Allocated: hipMalloc :      2 2.68e+08 2.68e+08 
                           GPU: Bytes Freed: hipFree :      2 2.68e+08 2.68e+08 
                         GPU: CopyDeviceToHost Bytes :     10     0.00     0.00 
                         GPU: CopyHostToDevice Bytes :     10     0.00     0.00 
                 GPU: Total Bytes Occupied on Device :      4 2.68e+08 5.37e+08 
                                    Package-0 Energy :      1    68.00    68.00 
                                      status:Threads :      2     3.00     4.00 
                                    status:VmData kB :      2 8.88e+05 1.36e+06 
                                     status:VmExe kB :      2    32.00    32.00 
                                     status:VmHWM kB :      2 3.99e+05 7.57e+05 
                                     status:VmLck kB :      2     0.00     0.00 
                                     status:VmLib kB :      2 1.37e+05 1.37e+05 
                                     status:VmPTE kB :      2  1074.00  1784.00 
                                    status:VmPeak kB :      2 5.74e+06 1.07e+07 
                                     status:VmPin kB :      2     0.00     0.00 
                                     status:VmRSS kB :      2 3.99e+05 7.57e+05 
                                    status:VmSize kB :      2 5.65e+06 1.06e+07 
                                     status:VmStk kB :      2   136.00   136.00 
                                    status:VmSwap kB :      2     0.00     0.00 
                   status:nonvoluntary_ctxt_switches :      2     5.50     9.00 
                      status:voluntary_ctxt_switches :      2    44.50    59.00 
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total|  allocs |  (bytes) |    frees |   (bytes) 
---------------------------------------------------------------------------------------------------------------------
GPU: matrixTranspose(float*, float*, int) [{/proc/s… :     10     0.01     0.05 
                               GPU: CopyDeviceToHost :     10     0.00     0.01 
                               GPU: CopyHostToDevice :     10     0.00     0.01 
---------------------------------------------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total| allocs| (bytes)|  frees | (bytes) 
---------------------------------------------------------------------------------------------------------------------
                                           APEX MAIN :      1     1.52     1.52       0      0      0      0
        int apex_preload_main(int, char **, char **) :      1     1.48     1.48       0      0      0      0
                                      Initialization :      1     0.80     0.80       2 5.37e+08      0      0
                         matrixTransposeCPUReference :      1     0.77     0.77       0      0      0      0
                                    While Loop range :     10     0.07     0.68       0      0      0      0
                                           hipMemcpy :     20     0.02     0.45       0      0      0      0
                                     Validation Step :     10     0.02     0.18       0      0      0      0
                                      Memcpy wrapper :     10     0.01     0.14       0      0      0      0
                                LaunchKernel wrapper :     10     0.01     0.05       0      0      0      0
                                hipDeviceSynchronize :     31     0.00     0.05       0      0      0      0
                                     hipLaunchKernel :     10     0.00     0.00       0      0      0      0
                                         Memory Free :      1     0.00     0.00       0      0      0      0
                                           hipMalloc :      2     0.00     0.00       0      0      0      0
                                             hipFree :      2     0.00     0.00       0      0      2 5.37e+08
                              hipGetDeviceProperties :      1     0.00     0.00       0      0      0      0
---------------------------------------------------------------------------------------------------------------------


---------------------------------------------------------------------------------------------------------------------
                                        Total timers : 140
APEX Memory Report: (see memory_report.0.txt)
sorting 0 leaks by size...
Aggregating leaks by task and writing report...
Reported 0 'actual' leaks.
Expect false positives if memory was freed after exit.