1
- Using CUDA Graphs in PyTorch C++ API
2
- ====================================
1
+ PyTorch C++ APIμμ CUDA κ·Έλν μ¬μ©νκΈ°
2
+ ===========================================
3
+
4
+ **λ²μ **: `μ₯ν¨μ <https://github.com/hyoyoung >`_
3
5
4
6
.. note ::
5
- |edit | View and edit this tutorial in `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs.rst >`__. The full source code is available on `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs >`__.
7
+ |edit | μ΄ νν 리μΌμ μ¬κΈ°μ λ³΄κ³ νΈμ§νμΈμ `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs.rst >`__. μ 체 μμ€ μ½λλ μ¬κΈ°μ μμ΅λλ€ `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs >`__.
6
8
7
- Prerequisites :
9
+ μ μ μ§μ :
8
10
9
- - `Using the PyTorch C++ Frontend <../advanced_source/cpp_frontend.html >`__
11
+ - `PyTorch C++ νλ‘ νΈμλ μ¬μ©νκΈ° <../advanced_source/cpp_frontend.html >`__
10
12
- `CUDA semantics <https://pytorch.org/docs/master/notes/cuda.html >`__
11
- - Pytorch 2.0 or later
12
- - CUDA 11 or later
13
-
14
- NVIDIAβs CUDA Graphs have been a part of CUDA Toolkit library since the
15
- release of `version 10 <https://developer.nvidia.com/blog/cuda-graphs/ >`_.
16
- They are capable of greatly reducing the CPU overhead increasing the
17
- performance of applications.
18
-
19
- In this tutorial, we will be focusing on using CUDA Graphs for `C++
20
- frontend of PyTorch <https://tutorials.pytorch.kr/advanced/cpp_frontend.html> `_.
21
- The C++ frontend is mostly utilized in production and deployment applications which
22
- are important parts of PyTorch use cases. Since `the first appearance
23
- <https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/> `_
24
- the CUDA Graphs won usersβ and developerβs hearts for being a very performant
25
- and at the same time simple-to-use tool. In fact, CUDA Graphs are used by default
26
- in ``torch.compile `` of PyTorch 2.0 to boost the productivity of training and inference.
27
-
28
- We would like to demonstrate CUDA Graphs usage on PyTorchβs `MNIST
29
- example <https://github.com/pytorch/examples/tree/main/cpp/mnist> `_.
30
- The usage of CUDA Graphs in LibTorch (C++ Frontend) is very similar to its
31
- `Python counterpart <https://pytorch.org/docs/main/notes/cuda.html#cuda-graphs >`_
32
- but with some differences in syntax and functionality.
33
-
34
- Getting Started
13
+ - Pytorch 2.0 μ΄μ
14
+ - CUDA 11 μ΄μ
15
+
16
+ NVIDIAμ CUDA κ·Έλνλ λ²μ 10 λ¦΄λ¦¬μ¦ μ΄νλ‘ CUDA ν΄ν· λΌμ΄λΈλ¬λ¦¬μ μΌλΆμμ΅λλ€
17
+ `version 10 <https://developer.nvidia.com/blog/cuda-graphs/ >`_.
18
+ CPU κ³ΌλΆνλ₯Ό ν¬κ² μ€μ¬ μ ν리μΌμ΄μ
μ μ±λ₯μ ν₯μμν΅λλ€.
19
+
20
+ μ΄ νν 리μΌμμλ, CUDA κ·Έλν μ¬μ©μ μ΄μ μ λ§μΆ κ²μ
λλ€
21
+ `PyTorch C++ νλ‘ νΈμλ μ¬μ©νκΈ° <https://tutorials.pytorch.kr/advanced/cpp_frontend.html >`_.
22
+ C++ νλ‘ νΈμλλ νμ΄ν μΉ μ¬μ© μ¬λ‘μ μ€μν λΆλΆμΈλ°, μ£Όλ‘ μ ν λ° λ°°ν¬ μ ν리μΌμ΄μ
μμ νμ©λ©λλ€.
23
+ `첫λ²μ§Έ λ±μ₯ <https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ >`_
24
+ μ΄νλ‘ CUDA κ·Έλνλ λ§€μ° μ±λ₯μ΄ μ’κ³ μ¬μ©νκΈ° μ¬μμ, μ¬μ©μμ κ°λ°μμ λ§μμ μ¬λ‘μ‘μμ΅λλ€.
25
+ μ€μ λ‘, CUDA κ·Έλνλ νμ΄ν μΉ 2.0μ ``torch.compile `` μμ κΈ°λ³Έμ μΌλ‘ μ¬μ©λλ©°,
26
+ νλ ¨κ³Ό μΆλ‘ μμ μμ°μ±μ λμ¬μ€λλ€.
27
+
28
+ νμ΄ν μΉμμ CUDA κ·Έλν μ¬μ©λ²μ 보μ¬λλ¦¬κ³ μ ν©λλ€ `MNIST
29
+ μμ <https://github.com/pytorch/examples/tree/main/cpp/mnist> `_.
30
+ LibTorch(C++ νλ‘ νΈμλ)μμμ CUDA κ·Έλν μ¬μ©λ²μ λ€μκ³Ό λ§€μ° μ μ¬νμ§λ§
31
+ `Python μ¬μ©μμ <https://pytorch.org/docs/main/notes/cuda.html#cuda-graphs >`_
32
+ μ½κ°μ ꡬ문과 κΈ°λ₯μ μ°¨μ΄κ° μμ΅λλ€.
33
+
34
+ μμνκΈ°
35
35
---------------
36
36
37
- The main training loop consists of the several steps and depicted in the
38
- following code chunk:
37
+ μ£Όμ νλ ¨ 루νλ μ¬λ¬ λ¨κ³λ‘ ꡬμ±λμ΄ μμΌλ©°
38
+ λ€μ μ½λ λͺ¨μμ μ€λͺ
λμ΄ μμ΅λλ€.
39
39
40
40
.. code-block :: cpp
41
41
@@ -49,12 +49,12 @@ following code chunk:
49
49
optimizer.step();
50
50
}
51
51
52
- The example above includes a forward pass, a backward pass, and weight updates .
52
+ μμ μμμλ μμ ν, μμ ν, κ°μ€μΉ μ
λ°μ΄νΈκ° ν¬ν¨λμ΄ μμ΅λλ€ .
53
53
54
- In this tutorial, we will be applying CUDA Graph on all the compute steps through the whole-network
55
- graph capture. But before doing so, we need to slightly modify the source code. What we need
56
- to do is preallocate tensors for reusing them in the main training loop. Here is an example
57
- implementation:
54
+ μ΄ νν 리μΌμμλ μ 체 λ€νΈμν¬ κ·Έλν μΊ‘μ²λ₯Ό ν΅ν΄ λͺ¨λ κ³μ° λ¨κ³μ CUDA κ·Έλνλ₯Ό μ μ©ν©λλ€.
55
+ νμ§λ§ κ·Έ μ μ μ½κ°μ μμ€ μ½λ μμ μ΄ νμν©λλ€. μ°λ¦¬κ° ν΄μΌ ν μΌμ μ£Ό νλ ¨ 루νμμ
56
+ tensorλ₯Ό μ¬μ¬μ©ν μ μλλ‘ tensorλ₯Ό 미리 ν λΉνλ κ²μ
λλ€.
57
+ λ€μμ ꡬν μμμ
λλ€.
58
58
59
59
.. code-block :: cpp
60
60
@@ -74,7 +74,7 @@ implementation:
74
74
training_step(model, optimizer, data, targets, output, loss);
75
75
}
76
76
77
- Where ``training_step `` simply consists of forward and backward passes with corresponding optimizer calls:
77
+ μ¬κΈ°μ ``training_step``μ λ¨μν ν΄λΉ μ΅ν°λ§μ΄μ νΈμΆκ³Ό ν¨κ» μμ ν λ° μμ νλ‘ κ΅¬μ±λ©λλ€
78
78
79
79
.. code-block:: cpp
80
80
@@ -92,7 +92,7 @@ Where ``training_step`` simply consists of forward and backward passes with corr
92
92
optimizer.step();
93
93
}
94
94
95
- PyTorchβs CUDA Graphs API is relying on Stream Capture which in our case would be used like this:
95
+ νμ΄ν μΉμ CUDA κ·Έλν APIλ μ€νΈλ¦Ό μΊ‘μ²μ μμ‘΄νκ³ μμΌλ©°, μ΄ κ²½μ° λ€μμ²λΌ μ¬μ©λ©λλ€
96
96
97
97
.. code-block:: cpp
98
98
@@ -104,9 +104,9 @@ PyTorchβs CUDA Graphs API is relying on Stream Capture which in our case would
104
104
training_step(model, optimizer, data, targets, output, loss);
105
105
graph.capture_end();
106
106
107
- Before the actual graph capture, it is important to run several warm-up iterations on side stream to
108
- prepare CUDA cache as well as CUDA libraries (like CUBLAS and CUDNN) that will be used during
109
- the training:
107
+ μ€μ κ·Έλν μΊ‘μ² μ μ, μ¬μ΄λ μ€νΈλ¦Όμμ μ¬λ¬ λ²μ μλ°μ
λ°λ³΅μ μ€ννμ¬
108
+ CUDA μΊμλΏλ§ μλλΌ νλ ¨ μ€μ μ¬μ©ν
109
+ CUDA λΌμ΄λΈλ¬λ¦¬(CUBLASμ CUDNNκ°μ)λ₯Ό μ€λΉνλ κ²μ΄ μ€μν©λλ€.
110
110
111
111
.. code-block:: cpp
112
112
@@ -116,13 +116,13 @@ the training:
116
116
training_step(model, optimizer, data, targets, output, loss);
117
117
}
118
118
119
- After the successful graph capture, we can replace ``training_step(model, optimizer, data, targets , output, loss); ``
120
- call via ``graph.replay(); `` to do the training step .
119
+ κ·Έλν μΊ‘μ²μ μ±κ³΅νλ©΄ ``training_step(model, optimizer, data, target , output, loss); `` νΈμΆμ
120
+ ``graph.replay()``λ‘ λ체νμ¬ νμ΅ λ¨κ³λ₯Ό μ§νν μ μμ΅λλ€ .
121
121
122
- Training Results
122
+ νλ ¨ κ²°κ³Ό
123
123
----------------
124
124
125
- Taking the code for a spin we can see the following output from ordinary non-graphed training:
125
+ μ½λλ₯Ό ν λ² μ΄ν΄λ³΄λ©΄ κ·Έλνκ° μλ μΌλ° νλ ¨μμ λ€μκ³Ό κ°μ κ²°κ³Όλ₯Ό λ³Ό μ μμ΅λλ€
126
126
127
127
.. code-block:: shell
128
128
@@ -152,7 +152,7 @@ Taking the code for a spin we can see the following output from ordinary non-gra
152
152
user 0m44.018s
153
153
sys 0m1.116s
154
154
155
- While the training with the CUDA Graph produces the following output:
155
+ CUDA κ·Έλνλ₯Ό μ¬μ©ν νλ ¨μ λ€μκ³Ό κ°μ μΆλ ₯μ μμ±ν©λλ€
156
156
157
157
.. code-block:: shell
158
158
@@ -182,12 +182,11 @@ While the training with the CUDA Graph produces the following output:
182
182
user 0m7.048s
183
183
sys 0m0.619s
184
184
185
- Conclusion
185
+ κ²°λ‘
186
186
----------
187
-
188
- As we can see, just by applying a CUDA Graph on the `MNIST example
189
- <https://github.com/pytorch/examples/tree/main/cpp/mnist> `_ we were able to gain the performance
190
- by more than six times for training. This kind of large performance improvement was achievable due to
191
- the small model size. In case of larger models with heavy GPU usage, the CPU overhead is less impactful
192
- so the improvement will be smaller. Nevertheless, it is always advantageous to use CUDA Graphs to
193
- gain the performance of GPUs.
187
+ μ μμμμ λ³Ό μ μλ―μ΄, λ°λ‘ `MNIST μμ
188
+ <https://github.com/pytorch/examples/tree/main/cpp/mnist>`_ μ CUDA κ·Έλνλ₯Ό μ μ©νλ κ²λ§μΌλ‘λ
189
+ νλ ¨ μ±λ₯μ 6λ°° μ΄μ ν₯μμν¬ μ μμμ΅λλ€.
190
+ μ΄λ κ² ν° μ±λ₯ ν₯μμ΄ κ°λ₯νλ κ²μ λͺ¨λΈ ν¬κΈ°κ° μμκΈ° λλ¬Έμ
λλ€.
191
+ GPU μ¬μ©λμ΄ λ§μ λν λͺ¨λΈμ κ²½μ° CPU κ³ΌλΆνμ μν₯μ΄ μ κΈ° λλ¬Έμ κ°μ ν¨κ³Όκ° λ μμ μ μμ΅λλ€.
192
+ κ·Έλ° κ²½μ°λΌλ, GPUμ μ±λ₯μ μ΄λμ΄λ΄λ €λ©΄ CUDA κ·Έλνλ₯Ό μ¬μ©νλ κ²μ΄ νμ μ 리ν©λλ€.
0 commit comments