Skip to content

CUDA Example

Kevin Huck edited this page Mar 6, 2023 · 1 revision

The following example will introduce APEX using the CUDA programming model.

APEX is integrated with the CUPTI, NVTX and NVML libraries for CUDA measurement support.

Source Code

The following example is a vector add and vector subtraction example written with CUDA.

The example has typical CUDA API calls.

Running the CUDA example

The apex_exec wrapper script has several options for supporting CUDA programs:

    --apex:cuda                   enable CUDA/CUPTI measurement (default: off)
    --apex:cuda_counters          enable CUDA/CUPTI counter support (default: off)
    --apex:cuda_driver            enable CUDA driver API callbacks (default: off)
    --apex:cuda_details           enable per-kernel statistics where available (default: off)
    --apex:monitor_gpu            enable GPU monitoring services (CUDA NVML, ROCm SMI)
    --apex:gpu_memory             enable CPU memory wrapper support

To enable basic CUDA support, use the --apex:cuda flag:

[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree ./build/bin/apex_vector_cu
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar  6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB

Start Date/Time: 06/03/2023 21:11:06
Elapsed time: 0.519433 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.519433 seconds
Available CPU time on all ranks: 0.519433 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      1     1.29     1.29
                                GPU: Bytes Allocated :      6 2.00e+05 2.00e+05
                                    GPU: Bytes Freed :      6 2.00e+05 2.00e+05
                 GPU: Total Bytes Occupied on Device :     12 3.00e+05 6.00e+05
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 1.09e+06 1.09e+06
                                     status:VmExe kB :      1   460.00   460.00
                                     status:VmHWM kB :      1 1.39e+04 1.39e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 7.76e+04 7.76e+04
                                     status:VmPTE kB :      1   196.00   196.00
                                    status:VmPeak kB :      1 1.36e+06 1.36e+06
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 1.39e+04 1.39e+04
                                    status:VmSize kB :      1 1.30e+06 1.30e+06
                                     status:VmStk kB :      1   144.00   144.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     0.00     0.00
                      status:voluntary_ctxt_switches :      1    33.00    33.00
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total
--------------------------------------------------------------------------------
                                    GPU: Memcpy HtoD :      4     0.00     0.00
                                    GPU: Memcpy DtoH :      2     0.00     0.00
      GPU: VecAdd(int const*, int const*, int*, int) :      2     0.00     0.00
      GPU: VecSub(int const*, int const*, int*, int) :      2     0.00     0.00
                            GPU: Context Synchronize :      2     0.00     0.00
                             GPU: Stream Synchronize :      1     0.00     0.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.52     0.52
          int apex_preload_main(int, char**, char**) :      1     0.51     0.51
                                          cudaMalloc :      6     0.04     0.21
                                     cudaDeviceReset :      1     0.11     0.11
                                     cudaMemcpyAsync :      6     0.00     0.00
                                            cudaFree :      6     0.00     0.00
                                    cudaLaunchKernel :      4     0.00     0.00
                                    cudaStreamCreate :      1     0.00     0.00
                                  cudaGetDeviceCount :      1     0.00     0.00
                               cudaDeviceSynchronize :      2     0.00     0.00
                               cudaStreamSynchronize :      1     0.00     0.00
                                       cudaSetDevice :      1     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 43
Writing: .//apex_tasktree.csv
[kehuck1@mahti-login11 apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 18 rows
Found 0 ranks, with max graph node index of 17 and depth of 3
building common tree...
Rank 0 ...
1-> 0.519 - 100.000% [1] {min=0.519, max=0.519, mean=0.519, threads=1} APEX MAIN
1 |-> 0.505 - 97.312% [1] {min=0.505, max=0.505, mean=0.505, threads=1} int apex_preload_main(int, char**, char**)
1 | |-> 0.213 - 40.951% [6] {min=0.213, max=0.213, mean=0.035, threads=1} cudaMalloc
1 | |-> 0.106 - 20.470% [1] {min=0.106, max=0.106, mean=0.106, threads=1} cudaDeviceReset
1 | |-> 0.001 - 0.116% [6] {min=0.001, max=0.001, mean=0.000, threads=1} cudaMemcpyAsync
1 | | |-> 0.000 - 0.011% [4] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Memcpy HtoD
1 | | |-> 0.000 - 0.005% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Memcpy DtoH
1 | |-> 0.000 - 0.054% [6] {min=0.000, max=0.000, mean=0.000, threads=1} cudaFree
1 | |-> 0.000 - 0.015% [4] {min=0.000, max=0.000, mean=0.000, threads=1} cudaLaunchKernel
1 | | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: VecAdd(int const*, int const*, int*, int)
1 | | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: VecSub(int const*, int const*, int*, int)
1 | |-> 0.000 - 0.005% [1] {min=0.000, max=0.000, mean=0.000, threads=1} cudaStreamCreate
1 | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} cudaGetDeviceCount
1 | |-> 0.000 - 0.003% [2] {min=0.000, max=0.000, mean=0.000, threads=1} cudaDeviceSynchronize
1 | | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Context Synchronize
1 | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} cudaStreamSynchronize
1 | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Stream Synchronize
1 | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} cudaSetDevice
19 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
[kehuck1@mahti-login11 apex-tutorial]$ dot -Tsvg -O tasktree.dot

CUDA base support task tree

CUDA Counters

Adding the --apex:cuda_counters flag will enable capturing additional counters for all kernel invocations:

[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree --apex:cuda_counters ./build/bin/apex_vector_cu
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar  6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB

Start Date/Time: 06/03/2023 21:20:24
Elapsed time: 0.491052 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.491052 seconds
Available CPU time on all ranks: 0.491052 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      1     2.20     2.20
                  GPU: Bandwidth (GB/s): Memcpy DtoH :      2    18.11    18.77
                  GPU: Bandwidth (GB/s): Memcpy HtoD :      4    14.60    14.71
                                GPU: Bytes Allocated :      6 2.00e+05 2.00e+05
                                    GPU: Bytes Freed :      6 2.00e+05 2.00e+05
                             GPU: Bytes: Memcpy DtoH :      2 2.00e+05 2.00e+05
                             GPU: Bytes: Memcpy HtoD :      4 2.00e+05 2.00e+05
                      GPU: Dynamic Shared Memory (B) :      4     0.00     0.00
                    GPU: Local Memory Per Thread (B) :      4     0.00     0.00
                         GPU: Local Memory Total (B) :      4 1.27e+08 1.27e+08
                           GPU: Registers Per Thread :      4    16.00    16.00
                         GPU: Shared Memory Size (B) :      4 3.28e+04 3.28e+04
                       GPU: Static Shared Memory (B) :      4     0.00     0.00
                 GPU: Total Bytes Occupied on Device :     12 3.00e+05 6.00e+05
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 1.09e+06 1.09e+06
                                     status:VmExe kB :      1   460.00   460.00
                                     status:VmHWM kB :      1 1.39e+04 1.39e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 7.76e+04 7.76e+04
                                     status:VmPTE kB :      1   196.00   196.00
                                    status:VmPeak kB :      1 1.36e+06 1.36e+06
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 1.39e+04 1.39e+04
                                    status:VmSize kB :      1 1.30e+06 1.30e+06
                                     status:VmStk kB :      1   148.00   148.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     1.00     1.00
                      status:voluntary_ctxt_switches :      1    49.00    49.00
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total
--------------------------------------------------------------------------------
                                    GPU: Memcpy HtoD :      4     0.00     0.00
                                    GPU: Memcpy DtoH :      2     0.00     0.00
      GPU: VecAdd(int const*, int const*, int*, int) :      2     0.00     0.00
                            GPU: Context Synchronize :      2     0.00     0.00
      GPU: VecSub(int const*, int const*, int*, int) :      2     0.00     0.00
                             GPU: Stream Synchronize :      1     0.00     0.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.49     0.49
          int apex_preload_main(int, char**, char**) :      1     0.48     0.48
                                          cudaMalloc :      6     0.03     0.20
                                     cudaDeviceReset :      1     0.11     0.11
                                     cudaMemcpyAsync :      6     0.00     0.00
                                            cudaFree :      6     0.00     0.00
                                    cudaLaunchKernel :      4     0.00     0.00
                                    cudaStreamCreate :      1     0.00     0.00
                                  cudaGetDeviceCount :      1     0.00     0.00
                               cudaDeviceSynchronize :      2     0.00     0.00
                               cudaStreamSynchronize :      1     0.00     0.00
                                       cudaSetDevice :      1     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 43

CUDA Details

Adding the --apex:cuda_details flag will enable capturing detailed stats about each kernel invocation:

[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree --apex:cuda_counters --apex:cuda_details ./build/bin/apex_vector_cu
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar  6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB

Start Date/Time: 06/03/2023 21:21:57
Elapsed time: 0.485327 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.485327 seconds
Available CPU time on all ranks: 0.485327 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                               1 Minute Load average :      1     1.39     1.39
                  GPU: Bandwidth (GB/s): Memcpy DtoH :      2    17.09    17.51
                  GPU: Bandwidth (GB/s): Memcpy HtoD :      4    14.43    14.57
                    GPU: Bytes Allocated: cudaMalloc :      6 2.00e+05 2.00e+05
                          GPU: Bytes Freed: cudaFree :      6 2.00e+05 2.00e+05
                             GPU: Bytes: Memcpy DtoH :      2 2.00e+05 2.00e+05
                             GPU: Bytes: Memcpy HtoD :      4 2.00e+05 2.00e+05
GPU: Dynamic Shared Memory (B): VecAdd(int const*, … :      2     0.00     0.00
GPU: Dynamic Shared Memory (B): VecSub(int const*, … :      2     0.00     0.00
GPU: Local Memory Per Thread (B): VecAdd(int const*… :      2     0.00     0.00
GPU: Local Memory Per Thread (B): VecSub(int const*… :      2     0.00     0.00
GPU: Local Memory Total (B): VecAdd(int const*, int… :      2 1.27e+08 1.27e+08
GPU: Local Memory Total (B): VecSub(int const*, int… :      2 1.27e+08 1.27e+08
GPU: Registers Per Thread: VecAdd(int const*, int c… :      2    16.00    16.00
GPU: Registers Per Thread: VecSub(int const*, int c… :      2    16.00    16.00
GPU: Shared Memory Size (B): VecAdd(int const*, int… :      2 3.28e+04 3.28e+04
GPU: Shared Memory Size (B): VecSub(int const*, int… :      2 3.28e+04 3.28e+04
GPU: Static Shared Memory (B): VecAdd(int const*, i… :      2     0.00     0.00
GPU: Static Shared Memory (B): VecSub(int const*, i… :      2     0.00     0.00
       GPU: Total Bytes Occupied on Device: cudaFree :      6 2.00e+05 4.00e+05
     GPU: Total Bytes Occupied on Device: cudaMalloc :      6 4.00e+05 6.00e+05
GPU: blockX: VecAdd(int const*, int const*, int*, i… :      2   256.00   256.00
GPU: blockX: VecSub(int const*, int const*, int*, i… :      2   256.00   256.00
GPU: blockY: VecAdd(int const*, int const*, int*, i… :      2     1.00     1.00
GPU: blockY: VecSub(int const*, int const*, int*, i… :      2     1.00     1.00
GPU: blockZ: VecAdd(int const*, int const*, int*, i… :      2     1.00     1.00
GPU: blockZ: VecSub(int const*, int const*, int*, i… :      2     1.00     1.00
GPU: gridX: VecAdd(int const*, int const*, int*, in… :      2   196.00   196.00
GPU: gridX: VecSub(int const*, int const*, int*, in… :      2   196.00   196.00
GPU: gridY: VecAdd(int const*, int const*, int*, in… :      2     1.00     1.00
GPU: gridY: VecSub(int const*, int const*, int*, in… :      2     1.00     1.00
GPU: gridZ: VecAdd(int const*, int const*, int*, in… :      2     1.00     1.00
GPU: gridZ: VecSub(int const*, int const*, int*, in… :      2     1.00     1.00
GPU: queue delay (us): VecAdd(int const*, int const… :      2    11.55    15.45
GPU: queue delay (us): VecSub(int const*, int const… :      2    10.45    14.00
GPU: submit delay (us): VecAdd(int const*, int cons… :      2     0.90     1.18
GPU: submit delay (us): VecSub(int const*, int cons… :      2     0.62     0.64
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 1.09e+06 1.09e+06
                                     status:VmExe kB :      1   460.00   460.00
                                     status:VmHWM kB :      1 1.39e+04 1.39e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 7.76e+04 7.76e+04
                                     status:VmPTE kB :      1   196.00   196.00
                                    status:VmPeak kB :      1 1.36e+06 1.36e+06
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 1.39e+04 1.39e+04
                                    status:VmSize kB :      1 1.30e+06 1.30e+06
                                     status:VmStk kB :      1   148.00   148.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     0.00     0.00
                      status:voluntary_ctxt_switches :      1    50.00    50.00
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total
--------------------------------------------------------------------------------
                                    GPU: Memcpy HtoD :      4     0.00     0.00
                                    GPU: Memcpy DtoH :      2     0.00     0.00
      GPU: VecAdd(int const*, int const*, int*, int) :      2     0.00     0.00
      GPU: VecSub(int const*, int const*, int*, int) :      2     0.00     0.00
                            GPU: Context Synchronize :      2     0.00     0.00
                             GPU: Stream Synchronize :      1     0.00     0.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.49     0.49
          int apex_preload_main(int, char**, char**) :      1     0.47     0.47
                                          cudaMalloc :      6     0.03     0.20
                                     cudaDeviceReset :      1     0.11     0.11
                                     cudaMemcpyAsync :      6     0.00     0.00
                                            cudaFree :      6     0.00     0.00
                                    cudaLaunchKernel :      4     0.00     0.00
                                    cudaStreamCreate :      1     0.00     0.00
                                  cudaGetDeviceCount :      1     0.00     0.00
                               cudaDeviceSynchronize :      2     0.00     0.00
                               cudaStreamSynchronize :      1     0.00     0.00
                                       cudaSetDevice :      1     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 43

GPU monitoring with NVML

Adding the --apex:monitor_gpu flag will enable the NVML support:

[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree --apex:monitor_gpu ./build/bin/apex_vector_cu
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar  6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB

Start Date/Time: 06/03/2023 21:22:57
Elapsed time: 0.500163 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.500163 seconds
Available CPU time on all ranks: 0.500163 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                                GPU: Bytes Allocated :      6 2.00e+05 2.00e+05
                                    GPU: Bytes Freed :      6 2.00e+05 2.00e+05
                    GPU: Device 0 Clock Memory (MHz) :      1  1215.00  1215.00
                        GPU: Device 0 Clock SM (MHz) :      1   210.00   210.00
                      GPU: Device 0 Memory Free (GB) :      1    42.33    42.33
                     GPU: Device 0 Memory Total (GB) :      1    42.95    42.95
                      GPU: Device 0 Memory Used (GB) :      1     0.62     0.62
                  GPU: Device 0 Memory Utilization % :      1     0.00     0.00
                     GPU: Device 0 NvLink Link Count :      1    12.00    12.00
                   GPU: Device 0 NvLink Speed (GB/s) :      1    25.00    25.00
             GPU: Device 0 NvLink Throughput Data RX :      1 1.57e+09 1.57e+09
             GPU: Device 0 NvLink Throughput Data TX :      1 1.57e+09 1.57e+09
              GPU: Device 0 NvLink Throughput Raw RX :      1 2.50e+09 2.50e+09
              GPU: Device 0 NvLink Throughput Raw TX :      1 2.50e+09 2.50e+09
             GPU: Device 0 PCIe RX Throughput (MB/s) :      1    15.00    15.00
             GPU: Device 0 PCIe TX Throughput (MB/s) :      1    11.00    11.00
                             GPU: Device 0 Power (W) :      1    53.81    53.81
                       GPU: Device 0 Temperature (C) :      1    38.00    38.00
                         GPU: Device 0 Utilization % :      1     0.00     0.00
                 GPU: Total Bytes Occupied on Device :     12 3.00e+05 6.00e+05
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 1.09e+06 1.09e+06
                                     status:VmExe kB :      1   460.00   460.00
                                     status:VmHWM kB :      1 1.39e+04 1.39e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 7.76e+04 7.76e+04
                                     status:VmPTE kB :      1   196.00   196.00
                                    status:VmPeak kB :      1 1.36e+06 1.36e+06
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 1.39e+04 1.39e+04
                                    status:VmSize kB :      1 1.30e+06 1.30e+06
                                     status:VmStk kB :      1   148.00   148.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     0.00     0.00
                      status:voluntary_ctxt_switches :      1    51.00    51.00
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total
--------------------------------------------------------------------------------
                                    GPU: Memcpy HtoD :      4     0.00     0.00
                                    GPU: Memcpy DtoH :      2     0.00     0.00
      GPU: VecAdd(int const*, int const*, int*, int) :      2     0.00     0.00
      GPU: VecSub(int const*, int const*, int*, int) :      2     0.00     0.00
                            GPU: Context Synchronize :      2     0.00     0.00
                             GPU: Stream Synchronize :      1     0.00     0.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.50     0.50
          int apex_preload_main(int, char**, char**) :      1     0.49     0.49
                                          cudaMalloc :      6     0.03     0.20
                                     cudaDeviceReset :      1     0.11     0.11
                                     cudaMemcpyAsync :      6     0.00     0.00
                                            cudaFree :      6     0.00     0.00
                                    cudaLaunchKernel :      4     0.00     0.00
                                    cudaStreamCreate :      1     0.00     0.00
                               cudaDeviceSynchronize :      2     0.00     0.00
                                  cudaGetDeviceCount :      1     0.00     0.00
                                       cudaSetDevice :      1     0.00     0.00
                               cudaStreamSynchronize :      1     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 43

GPU Memory tracking support

Adding the --apex:gpu_memory flag will enable memory consumption/leak tracking for all cudaMalloc* calls. At the end of execution, any/all memory leaks will be reported to the user in a text file.

[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree --apex:gpu_memory ./build/bin/apex_vector_cu
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar  6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB

Start Date/Time: 06/03/2023 21:24:37
Elapsed time: 0.46681 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.46681 seconds
Available CPU time on all ranks: 0.46681 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                                GPU: Bytes Allocated :      6 2.00e+05 2.00e+05
                                    GPU: Bytes Freed :      6 2.00e+05 2.00e+05
                 GPU: Total Bytes Occupied on Device :     12 3.00e+05 6.00e+05
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 1.09e+06 1.09e+06
                                     status:VmExe kB :      1   460.00   460.00
                                     status:VmHWM kB :      1 1.39e+04 1.39e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 7.76e+04 7.76e+04
                                     status:VmPTE kB :      1   196.00   196.00
                                    status:VmPeak kB :      1 1.36e+06 1.36e+06
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 1.39e+04 1.39e+04
                                    status:VmSize kB :      1 1.30e+06 1.30e+06
                                     status:VmStk kB :      1   148.00   148.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     0.00     0.00
                      status:voluntary_ctxt_switches :      1    51.00    51.00
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total|  allocs |  (bytes) |    frees |   (bytes)
---------------------------------------------------------------------------------------------------------------------
                                    GPU: Memcpy HtoD :      4     0.00     0.00
                                    GPU: Memcpy DtoH :      2     0.00     0.00
                            GPU: Context Synchronize :      2     0.00     0.00
      GPU: VecAdd(int const*, int const*, int*, int) :      2     0.00     0.00
      GPU: VecSub(int const*, int const*, int*, int) :      2     0.00     0.00
                             GPU: Stream Synchronize :      1     0.00     0.00
---------------------------------------------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total| allocs| (bytes)|  frees | (bytes)
---------------------------------------------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.47     0.47       0      0      0      0
          int apex_preload_main(int, char**, char**) :      1     0.45     0.45       0      0      0      0
                                          cudaMalloc :      6     0.03     0.20       6 1.20e+06      0      0
                                     cudaDeviceReset :      1     0.11     0.11       0      0      0      0
                                     cudaMemcpyAsync :      6     0.00     0.00       0      0      0      0
                                            cudaFree :      6     0.00     0.00       0      0      6 1.20e+06
                                    cudaLaunchKernel :      4     0.00     0.00       0      0      0      0
                                    cudaStreamCreate :      1     0.00     0.00       0      0      0      0
                                  cudaGetDeviceCount :      1     0.00     0.00       0      0      0      0
                               cudaDeviceSynchronize :      2     0.00     0.00       0      0      0      0
                                       cudaSetDevice :      1     0.00     0.00       0      0      0      0
                               cudaStreamSynchronize :      1     0.00     0.00       0      0      0      0
---------------------------------------------------------------------------------------------------------------------


---------------------------------------------------------------------------------------------------------------------
                                        Total timers : 43
Writing: .//apex_tasktree.csv
APEX Memory Report: (see memory_report.0.txt)
sorting 0 leaks by size...
Aggregating leaks by task and writing report...
Ignoring known leaks in CUDA/CUPTI...
Reported 0 'actual' leaks.
Expect false positives if memory was freed after exit.