-
Notifications
You must be signed in to change notification settings - Fork 0
CUDA Example
The following example will introduce APEX using the CUDA programming model.
APEX is integrated with the CUPTI, NVTX and NVML libraries for CUDA measurement support.
The following example is a vector add and vector subtraction example written with CUDA.
The example has typical CUDA API calls.
The apex_exec
wrapper script has several options for supporting CUDA programs:
--apex:cuda enable CUDA/CUPTI measurement (default: off)
--apex:cuda_counters enable CUDA/CUPTI counter support (default: off)
--apex:cuda_driver enable CUDA driver API callbacks (default: off)
--apex:cuda_details enable per-kernel statistics where available (default: off)
--apex:monitor_gpu enable GPU monitoring services (CUDA NVML, ROCm SMI)
--apex:gpu_memory enable CPU memory wrapper support
To enable basic CUDA support, use the --apex:cuda
flag:
[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree ./build/bin/apex_vector_cu
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar 6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB
Start Date/Time: 06/03/2023 21:11:06
Elapsed time: 0.519433 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.519433 seconds
Available CPU time on all ranks: 0.519433 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
1 Minute Load average : 1 1.29 1.29
GPU: Bytes Allocated : 6 2.00e+05 2.00e+05
GPU: Bytes Freed : 6 2.00e+05 2.00e+05
GPU: Total Bytes Occupied on Device : 12 3.00e+05 6.00e+05
status:Threads : 1 2.00 2.00
status:VmData kB : 1 1.09e+06 1.09e+06
status:VmExe kB : 1 460.00 460.00
status:VmHWM kB : 1 1.39e+04 1.39e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 7.76e+04 7.76e+04
status:VmPTE kB : 1 196.00 196.00
status:VmPeak kB : 1 1.36e+06 1.36e+06
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 1.39e+04 1.39e+04
status:VmSize kB : 1 1.30e+06 1.30e+06
status:VmStk kB : 1 144.00 144.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 0.00 0.00
status:voluntary_ctxt_switches : 1 33.00 33.00
--------------------------------------------------------------------------------
GPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
GPU: Memcpy HtoD : 4 0.00 0.00
GPU: Memcpy DtoH : 2 0.00 0.00
GPU: VecAdd(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: VecSub(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: Context Synchronize : 2 0.00 0.00
GPU: Stream Synchronize : 1 0.00 0.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.52 0.52
int apex_preload_main(int, char**, char**) : 1 0.51 0.51
cudaMalloc : 6 0.04 0.21
cudaDeviceReset : 1 0.11 0.11
cudaMemcpyAsync : 6 0.00 0.00
cudaFree : 6 0.00 0.00
cudaLaunchKernel : 4 0.00 0.00
cudaStreamCreate : 1 0.00 0.00
cudaGetDeviceCount : 1 0.00 0.00
cudaDeviceSynchronize : 2 0.00 0.00
cudaStreamSynchronize : 1 0.00 0.00
cudaSetDevice : 1 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 43
Writing: .//apex_tasktree.csv
[kehuck1@mahti-login11 apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 18 rows
Found 0 ranks, with max graph node index of 17 and depth of 3
building common tree...
Rank 0 ...
1-> 0.519 - 100.000% [1] {min=0.519, max=0.519, mean=0.519, threads=1} APEX MAIN
1 |-> 0.505 - 97.312% [1] {min=0.505, max=0.505, mean=0.505, threads=1} int apex_preload_main(int, char**, char**)
1 | |-> 0.213 - 40.951% [6] {min=0.213, max=0.213, mean=0.035, threads=1} cudaMalloc
1 | |-> 0.106 - 20.470% [1] {min=0.106, max=0.106, mean=0.106, threads=1} cudaDeviceReset
1 | |-> 0.001 - 0.116% [6] {min=0.001, max=0.001, mean=0.000, threads=1} cudaMemcpyAsync
1 | | |-> 0.000 - 0.011% [4] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Memcpy HtoD
1 | | |-> 0.000 - 0.005% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Memcpy DtoH
1 | |-> 0.000 - 0.054% [6] {min=0.000, max=0.000, mean=0.000, threads=1} cudaFree
1 | |-> 0.000 - 0.015% [4] {min=0.000, max=0.000, mean=0.000, threads=1} cudaLaunchKernel
1 | | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: VecAdd(int const*, int const*, int*, int)
1 | | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: VecSub(int const*, int const*, int*, int)
1 | |-> 0.000 - 0.005% [1] {min=0.000, max=0.000, mean=0.000, threads=1} cudaStreamCreate
1 | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} cudaGetDeviceCount
1 | |-> 0.000 - 0.003% [2] {min=0.000, max=0.000, mean=0.000, threads=1} cudaDeviceSynchronize
1 | | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Context Synchronize
1 | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} cudaStreamSynchronize
1 | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Stream Synchronize
1 | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} cudaSetDevice
19 total graph nodes
Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
[kehuck1@mahti-login11 apex-tutorial]$ dot -Tsvg -O tasktree.dot
Adding the --apex:cuda_counters
flag will enable capturing additional counters for all kernel invocations:
[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree --apex:cuda_counters ./build/bin/apex_vector_cu
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar 6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB
Start Date/Time: 06/03/2023 21:20:24
Elapsed time: 0.491052 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.491052 seconds
Available CPU time on all ranks: 0.491052 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
1 Minute Load average : 1 2.20 2.20
GPU: Bandwidth (GB/s): Memcpy DtoH : 2 18.11 18.77
GPU: Bandwidth (GB/s): Memcpy HtoD : 4 14.60 14.71
GPU: Bytes Allocated : 6 2.00e+05 2.00e+05
GPU: Bytes Freed : 6 2.00e+05 2.00e+05
GPU: Bytes: Memcpy DtoH : 2 2.00e+05 2.00e+05
GPU: Bytes: Memcpy HtoD : 4 2.00e+05 2.00e+05
GPU: Dynamic Shared Memory (B) : 4 0.00 0.00
GPU: Local Memory Per Thread (B) : 4 0.00 0.00
GPU: Local Memory Total (B) : 4 1.27e+08 1.27e+08
GPU: Registers Per Thread : 4 16.00 16.00
GPU: Shared Memory Size (B) : 4 3.28e+04 3.28e+04
GPU: Static Shared Memory (B) : 4 0.00 0.00
GPU: Total Bytes Occupied on Device : 12 3.00e+05 6.00e+05
status:Threads : 1 2.00 2.00
status:VmData kB : 1 1.09e+06 1.09e+06
status:VmExe kB : 1 460.00 460.00
status:VmHWM kB : 1 1.39e+04 1.39e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 7.76e+04 7.76e+04
status:VmPTE kB : 1 196.00 196.00
status:VmPeak kB : 1 1.36e+06 1.36e+06
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 1.39e+04 1.39e+04
status:VmSize kB : 1 1.30e+06 1.30e+06
status:VmStk kB : 1 148.00 148.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 1.00 1.00
status:voluntary_ctxt_switches : 1 49.00 49.00
--------------------------------------------------------------------------------
GPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
GPU: Memcpy HtoD : 4 0.00 0.00
GPU: Memcpy DtoH : 2 0.00 0.00
GPU: VecAdd(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: Context Synchronize : 2 0.00 0.00
GPU: VecSub(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: Stream Synchronize : 1 0.00 0.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.49 0.49
int apex_preload_main(int, char**, char**) : 1 0.48 0.48
cudaMalloc : 6 0.03 0.20
cudaDeviceReset : 1 0.11 0.11
cudaMemcpyAsync : 6 0.00 0.00
cudaFree : 6 0.00 0.00
cudaLaunchKernel : 4 0.00 0.00
cudaStreamCreate : 1 0.00 0.00
cudaGetDeviceCount : 1 0.00 0.00
cudaDeviceSynchronize : 2 0.00 0.00
cudaStreamSynchronize : 1 0.00 0.00
cudaSetDevice : 1 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 43
Adding the --apex:cuda_details
flag will enable capturing detailed stats about each kernel invocation:
[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree --apex:cuda_counters --apex:cuda_details ./build/bin/apex_vector_cu
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar 6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB
Start Date/Time: 06/03/2023 21:21:57
Elapsed time: 0.485327 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.485327 seconds
Available CPU time on all ranks: 0.485327 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
1 Minute Load average : 1 1.39 1.39
GPU: Bandwidth (GB/s): Memcpy DtoH : 2 17.09 17.51
GPU: Bandwidth (GB/s): Memcpy HtoD : 4 14.43 14.57
GPU: Bytes Allocated: cudaMalloc : 6 2.00e+05 2.00e+05
GPU: Bytes Freed: cudaFree : 6 2.00e+05 2.00e+05
GPU: Bytes: Memcpy DtoH : 2 2.00e+05 2.00e+05
GPU: Bytes: Memcpy HtoD : 4 2.00e+05 2.00e+05
GPU: Dynamic Shared Memory (B): VecAdd(int const*, … : 2 0.00 0.00
GPU: Dynamic Shared Memory (B): VecSub(int const*, … : 2 0.00 0.00
GPU: Local Memory Per Thread (B): VecAdd(int const*… : 2 0.00 0.00
GPU: Local Memory Per Thread (B): VecSub(int const*… : 2 0.00 0.00
GPU: Local Memory Total (B): VecAdd(int const*, int… : 2 1.27e+08 1.27e+08
GPU: Local Memory Total (B): VecSub(int const*, int… : 2 1.27e+08 1.27e+08
GPU: Registers Per Thread: VecAdd(int const*, int c… : 2 16.00 16.00
GPU: Registers Per Thread: VecSub(int const*, int c… : 2 16.00 16.00
GPU: Shared Memory Size (B): VecAdd(int const*, int… : 2 3.28e+04 3.28e+04
GPU: Shared Memory Size (B): VecSub(int const*, int… : 2 3.28e+04 3.28e+04
GPU: Static Shared Memory (B): VecAdd(int const*, i… : 2 0.00 0.00
GPU: Static Shared Memory (B): VecSub(int const*, i… : 2 0.00 0.00
GPU: Total Bytes Occupied on Device: cudaFree : 6 2.00e+05 4.00e+05
GPU: Total Bytes Occupied on Device: cudaMalloc : 6 4.00e+05 6.00e+05
GPU: blockX: VecAdd(int const*, int const*, int*, i… : 2 256.00 256.00
GPU: blockX: VecSub(int const*, int const*, int*, i… : 2 256.00 256.00
GPU: blockY: VecAdd(int const*, int const*, int*, i… : 2 1.00 1.00
GPU: blockY: VecSub(int const*, int const*, int*, i… : 2 1.00 1.00
GPU: blockZ: VecAdd(int const*, int const*, int*, i… : 2 1.00 1.00
GPU: blockZ: VecSub(int const*, int const*, int*, i… : 2 1.00 1.00
GPU: gridX: VecAdd(int const*, int const*, int*, in… : 2 196.00 196.00
GPU: gridX: VecSub(int const*, int const*, int*, in… : 2 196.00 196.00
GPU: gridY: VecAdd(int const*, int const*, int*, in… : 2 1.00 1.00
GPU: gridY: VecSub(int const*, int const*, int*, in… : 2 1.00 1.00
GPU: gridZ: VecAdd(int const*, int const*, int*, in… : 2 1.00 1.00
GPU: gridZ: VecSub(int const*, int const*, int*, in… : 2 1.00 1.00
GPU: queue delay (us): VecAdd(int const*, int const… : 2 11.55 15.45
GPU: queue delay (us): VecSub(int const*, int const… : 2 10.45 14.00
GPU: submit delay (us): VecAdd(int const*, int cons… : 2 0.90 1.18
GPU: submit delay (us): VecSub(int const*, int cons… : 2 0.62 0.64
status:Threads : 1 2.00 2.00
status:VmData kB : 1 1.09e+06 1.09e+06
status:VmExe kB : 1 460.00 460.00
status:VmHWM kB : 1 1.39e+04 1.39e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 7.76e+04 7.76e+04
status:VmPTE kB : 1 196.00 196.00
status:VmPeak kB : 1 1.36e+06 1.36e+06
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 1.39e+04 1.39e+04
status:VmSize kB : 1 1.30e+06 1.30e+06
status:VmStk kB : 1 148.00 148.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 0.00 0.00
status:voluntary_ctxt_switches : 1 50.00 50.00
--------------------------------------------------------------------------------
GPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
GPU: Memcpy HtoD : 4 0.00 0.00
GPU: Memcpy DtoH : 2 0.00 0.00
GPU: VecAdd(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: VecSub(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: Context Synchronize : 2 0.00 0.00
GPU: Stream Synchronize : 1 0.00 0.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.49 0.49
int apex_preload_main(int, char**, char**) : 1 0.47 0.47
cudaMalloc : 6 0.03 0.20
cudaDeviceReset : 1 0.11 0.11
cudaMemcpyAsync : 6 0.00 0.00
cudaFree : 6 0.00 0.00
cudaLaunchKernel : 4 0.00 0.00
cudaStreamCreate : 1 0.00 0.00
cudaGetDeviceCount : 1 0.00 0.00
cudaDeviceSynchronize : 2 0.00 0.00
cudaStreamSynchronize : 1 0.00 0.00
cudaSetDevice : 1 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 43
Adding the --apex:monitor_gpu
flag will enable the NVML support:
[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree --apex:monitor_gpu ./build/bin/apex_vector_cu
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar 6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB
Start Date/Time: 06/03/2023 21:22:57
Elapsed time: 0.500163 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.500163 seconds
Available CPU time on all ranks: 0.500163 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
GPU: Bytes Allocated : 6 2.00e+05 2.00e+05
GPU: Bytes Freed : 6 2.00e+05 2.00e+05
GPU: Device 0 Clock Memory (MHz) : 1 1215.00 1215.00
GPU: Device 0 Clock SM (MHz) : 1 210.00 210.00
GPU: Device 0 Memory Free (GB) : 1 42.33 42.33
GPU: Device 0 Memory Total (GB) : 1 42.95 42.95
GPU: Device 0 Memory Used (GB) : 1 0.62 0.62
GPU: Device 0 Memory Utilization % : 1 0.00 0.00
GPU: Device 0 NvLink Link Count : 1 12.00 12.00
GPU: Device 0 NvLink Speed (GB/s) : 1 25.00 25.00
GPU: Device 0 NvLink Throughput Data RX : 1 1.57e+09 1.57e+09
GPU: Device 0 NvLink Throughput Data TX : 1 1.57e+09 1.57e+09
GPU: Device 0 NvLink Throughput Raw RX : 1 2.50e+09 2.50e+09
GPU: Device 0 NvLink Throughput Raw TX : 1 2.50e+09 2.50e+09
GPU: Device 0 PCIe RX Throughput (MB/s) : 1 15.00 15.00
GPU: Device 0 PCIe TX Throughput (MB/s) : 1 11.00 11.00
GPU: Device 0 Power (W) : 1 53.81 53.81
GPU: Device 0 Temperature (C) : 1 38.00 38.00
GPU: Device 0 Utilization % : 1 0.00 0.00
GPU: Total Bytes Occupied on Device : 12 3.00e+05 6.00e+05
status:Threads : 1 2.00 2.00
status:VmData kB : 1 1.09e+06 1.09e+06
status:VmExe kB : 1 460.00 460.00
status:VmHWM kB : 1 1.39e+04 1.39e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 7.76e+04 7.76e+04
status:VmPTE kB : 1 196.00 196.00
status:VmPeak kB : 1 1.36e+06 1.36e+06
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 1.39e+04 1.39e+04
status:VmSize kB : 1 1.30e+06 1.30e+06
status:VmStk kB : 1 148.00 148.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 0.00 0.00
status:voluntary_ctxt_switches : 1 51.00 51.00
--------------------------------------------------------------------------------
GPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
GPU: Memcpy HtoD : 4 0.00 0.00
GPU: Memcpy DtoH : 2 0.00 0.00
GPU: VecAdd(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: VecSub(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: Context Synchronize : 2 0.00 0.00
GPU: Stream Synchronize : 1 0.00 0.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.50 0.50
int apex_preload_main(int, char**, char**) : 1 0.49 0.49
cudaMalloc : 6 0.03 0.20
cudaDeviceReset : 1 0.11 0.11
cudaMemcpyAsync : 6 0.00 0.00
cudaFree : 6 0.00 0.00
cudaLaunchKernel : 4 0.00 0.00
cudaStreamCreate : 1 0.00 0.00
cudaDeviceSynchronize : 2 0.00 0.00
cudaGetDeviceCount : 1 0.00 0.00
cudaSetDevice : 1 0.00 0.00
cudaStreamSynchronize : 1 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 43
Adding the --apex:gpu_memory
flag will enable memory consumption/leak tracking for all cudaMalloc*
calls. At the end of execution, any/all memory leaks will be reported to the user in a text file.
[kehuck1@mahti-login11 apex-tutorial]$ srun apex_exec --apex:cuda --apex:tasktree --apex:gpu_memory ./build/bin/apex_vector_cu
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: -62a039a-develop
Built on: 15:44:00 Mar 6 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
GCC Compiler version : 11.2.0
Device Name: NVIDIA A100-SXM4-40GB
Start Date/Time: 06/03/2023 21:24:37
Elapsed time: 0.46681 seconds
Total processes detected: 1
HW Threads detected on rank 0: 256
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.46681 seconds
Available CPU time on all ranks: 0.46681 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
GPU: Bytes Allocated : 6 2.00e+05 2.00e+05
GPU: Bytes Freed : 6 2.00e+05 2.00e+05
GPU: Total Bytes Occupied on Device : 12 3.00e+05 6.00e+05
status:Threads : 1 2.00 2.00
status:VmData kB : 1 1.09e+06 1.09e+06
status:VmExe kB : 1 460.00 460.00
status:VmHWM kB : 1 1.39e+04 1.39e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 7.76e+04 7.76e+04
status:VmPTE kB : 1 196.00 196.00
status:VmPeak kB : 1 1.36e+06 1.36e+06
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 1.39e+04 1.39e+04
status:VmSize kB : 1 1.30e+06 1.30e+06
status:VmStk kB : 1 148.00 148.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 0.00 0.00
status:voluntary_ctxt_switches : 1 51.00 51.00
--------------------------------------------------------------------------------
GPU Timers : #calls| mean | total| allocs | (bytes) | frees | (bytes)
---------------------------------------------------------------------------------------------------------------------
GPU: Memcpy HtoD : 4 0.00 0.00
GPU: Memcpy DtoH : 2 0.00 0.00
GPU: Context Synchronize : 2 0.00 0.00
GPU: VecAdd(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: VecSub(int const*, int const*, int*, int) : 2 0.00 0.00
GPU: Stream Synchronize : 1 0.00 0.00
---------------------------------------------------------------------------------------------------------------------
CPU Timers : #calls| mean | total| allocs| (bytes)| frees | (bytes)
---------------------------------------------------------------------------------------------------------------------
APEX MAIN : 1 0.47 0.47 0 0 0 0
int apex_preload_main(int, char**, char**) : 1 0.45 0.45 0 0 0 0
cudaMalloc : 6 0.03 0.20 6 1.20e+06 0 0
cudaDeviceReset : 1 0.11 0.11 0 0 0 0
cudaMemcpyAsync : 6 0.00 0.00 0 0 0 0
cudaFree : 6 0.00 0.00 0 0 6 1.20e+06
cudaLaunchKernel : 4 0.00 0.00 0 0 0 0
cudaStreamCreate : 1 0.00 0.00 0 0 0 0
cudaGetDeviceCount : 1 0.00 0.00 0 0 0 0
cudaDeviceSynchronize : 2 0.00 0.00 0 0 0 0
cudaSetDevice : 1 0.00 0.00 0 0 0 0
cudaStreamSynchronize : 1 0.00 0.00 0 0 0 0
---------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
Total timers : 43
Writing: .//apex_tasktree.csv
APEX Memory Report: (see memory_report.0.txt)
sorting 0 leaks by size...
Aggregating leaks by task and writing report...
Ignoring known leaks in CUDA/CUPTI...
Reported 0 'actual' leaks.
Expect false positives if memory was freed after exit.
APEX tutorial, © Copyright 2023, University of Oregon. For more information on APEX, see https://github.com/UO-OACISS/apex