Skip to content

Commit 5fc8a7d

Browse files
authored
advanced_source/cpp_cuda_graphs.rst λ²ˆμ—­ (#961)
* translate beginner_source/torchtext_custom_dataset_tutorial.py
1 parent 34644b5 commit 5fc8a7d

File tree

1 file changed

+53
-54
lines changed

1 file changed

+53
-54
lines changed

β€Žadvanced_source/cpp_cuda_graphs.rst

Lines changed: 53 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,41 @@
1-
Using CUDA Graphs in PyTorch C++ API
2-
====================================
1+
PyTorch C++ APIμ—μ„œ CUDA κ·Έλž˜ν”„ μ‚¬μš©ν•˜κΈ°
2+
===========================================
3+
4+
**λ²ˆμ—­**: `μž₯효영 <https://github.com/hyoyoung>`_
35

46
.. note::
5-
|edit| View and edit this tutorial in `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs.rst>`__. The full source code is available on `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs>`__.
7+
|edit| 이 νŠœν† λ¦¬μ–Όμ„ μ—¬κΈ°μ„œ 보고 νŽΈμ§‘ν•˜μ„Έμš” `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs.rst>`__. 전체 μ†ŒμŠ€ μ½”λ“œλŠ” 여기에 μžˆμŠ΅λ‹ˆλ‹€ `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs>`__.
68

7-
Prerequisites:
9+
μ„ μˆ˜ 지식:
810

9-
- `Using the PyTorch C++ Frontend <../advanced_source/cpp_frontend.html>`__
11+
- `PyTorch C++ ν”„λ‘ νŠΈμ—”λ“œ μ‚¬μš©ν•˜κΈ° <../advanced_source/cpp_frontend.html>`__
1012
- `CUDA semantics <https://pytorch.org/docs/master/notes/cuda.html>`__
11-
- Pytorch 2.0 or later
12-
- CUDA 11 or later
13-
14-
NVIDIA’s CUDA Graphs have been a part of CUDA Toolkit library since the
15-
release of `version 10 <https://developer.nvidia.com/blog/cuda-graphs/>`_.
16-
They are capable of greatly reducing the CPU overhead increasing the
17-
performance of applications.
18-
19-
In this tutorial, we will be focusing on using CUDA Graphs for `C++
20-
frontend of PyTorch <https://tutorials.pytorch.kr/advanced/cpp_frontend.html>`_.
21-
The C++ frontend is mostly utilized in production and deployment applications which
22-
are important parts of PyTorch use cases. Since `the first appearance
23-
<https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/>`_
24-
the CUDA Graphs won users’ and developer’s hearts for being a very performant
25-
and at the same time simple-to-use tool. In fact, CUDA Graphs are used by default
26-
in ``torch.compile`` of PyTorch 2.0 to boost the productivity of training and inference.
27-
28-
We would like to demonstrate CUDA Graphs usage on PyTorch’s `MNIST
29-
example <https://github.com/pytorch/examples/tree/main/cpp/mnist>`_.
30-
The usage of CUDA Graphs in LibTorch (C++ Frontend) is very similar to its
31-
`Python counterpart <https://pytorch.org/docs/main/notes/cuda.html#cuda-graphs>`_
32-
but with some differences in syntax and functionality.
33-
34-
Getting Started
13+
- Pytorch 2.0 이상
14+
- CUDA 11 이상
15+
16+
NVIDIA의 CUDA κ·Έλž˜ν”„λŠ” 버전 10 릴리즈 μ΄ν›„λ‘œ CUDA νˆ΄ν‚· 라이브러리의 μΌλΆ€μ˜€μŠ΅λ‹ˆλ‹€
17+
`version 10 <https://developer.nvidia.com/blog/cuda-graphs/>`_.
18+
CPU κ³ΌλΆ€ν•˜λ₯Ό 크게 쀄여 μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ˜ μ„±λŠ₯을 ν–₯μƒμ‹œν‚΅λ‹ˆλ‹€.
19+
20+
이 νŠœν† λ¦¬μ–Όμ—μ„œλŠ”, CUDA κ·Έλž˜ν”„ μ‚¬μš©μ— μ΄ˆμ μ„ 맞좜 κ²ƒμž…λ‹ˆλ‹€
21+
`PyTorch C++ ν”„λ‘ νŠΈμ—”λ“œ μ‚¬μš©ν•˜κΈ° <https://tutorials.pytorch.kr/advanced/cpp_frontend.html>`_.
22+
C++ ν”„λ‘ νŠΈμ—”λ“œλŠ” νŒŒμ΄ν† μΉ˜ μ‚¬μš© μ‚¬λ‘€μ˜ μ€‘μš”ν•œ 뢀뢄인데, 주둜 μ œν’ˆ 및 배포 μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ—μ„œ ν™œμš©λ©λ‹ˆλ‹€.
23+
`첫번째 λ“±μž₯ <https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/>`_
24+
μ΄ν›„λ‘œ CUDA κ·Έλž˜ν”„λŠ” 맀우 μ„±λŠ₯이 μ’‹κ³  μ‚¬μš©ν•˜κΈ° μ‰¬μ›Œμ„œ, μ‚¬μš©μžμ™€ 개발자의 λ§ˆμŒμ„ μ‚¬λ‘œμž‘μ•˜μŠ΅λ‹ˆλ‹€.
25+
μ‹€μ œλ‘œ, CUDA κ·Έλž˜ν”„λŠ” νŒŒμ΄ν† μΉ˜ 2.0의 ``torch.compile`` μ—μ„œ 기본적으둜 μ‚¬μš©λ˜λ©°,
26+
ν›ˆλ ¨κ³Ό μΆ”λ‘  μ‹œμ— 생산성을 λ†’μ—¬μ€λ‹ˆλ‹€.
27+
28+
νŒŒμ΄ν† μΉ˜μ—μ„œ CUDA κ·Έλž˜ν”„ μ‚¬μš©λ²•μ„ λ³΄μ—¬λ“œλ¦¬κ³ μž ν•©λ‹ˆλ‹€ `MNIST
29+
예제 <https://github.com/pytorch/examples/tree/main/cpp/mnist>`_.
30+
LibTorch(C++ ν”„λ‘ νŠΈμ—”λ“œ)μ—μ„œμ˜ CUDA κ·Έλž˜ν”„ μ‚¬μš©λ²•μ€ λ‹€μŒκ³Ό 맀우 μœ μ‚¬ν•˜μ§€λ§Œ
31+
`Python μ‚¬μš©μ˜ˆμ œ <https://pytorch.org/docs/main/notes/cuda.html#cuda-graphs>`_
32+
μ•½κ°„μ˜ ꡬ문과 κΈ°λŠ₯의 차이가 μžˆμŠ΅λ‹ˆλ‹€.
33+
34+
μ‹œμž‘ν•˜κΈ°
3535
---------------
3636

37-
The main training loop consists of the several steps and depicted in the
38-
following code chunk:
37+
μ£Όμš” ν›ˆλ ¨ λ£¨ν”„λŠ” μ—¬λŸ¬ λ‹¨κ³„λ‘œ κ΅¬μ„±λ˜μ–΄ 있으며
38+
λ‹€μŒ μ½”λ“œ λͺ¨μŒμ— μ„€λͺ…λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
3939

4040
.. code-block:: cpp
4141
@@ -49,12 +49,12 @@ following code chunk:
4949
optimizer.step();
5050
}
5151
52-
The example above includes a forward pass, a backward pass, and weight updates.
52+
μœ„μ˜ μ˜ˆμ‹œμ—λŠ” μˆœμ „νŒŒ, μ—­μ „νŒŒ, κ°€μ€‘μΉ˜ μ—…λ°μ΄νŠΈκ°€ ν¬ν•¨λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
5353

54-
In this tutorial, we will be applying CUDA Graph on all the compute steps through the whole-network
55-
graph capture. But before doing so, we need to slightly modify the source code. What we need
56-
to do is preallocate tensors for reusing them in the main training loop. Here is an example
57-
implementation:
54+
이 νŠœν† λ¦¬μ–Όμ—μ„œλŠ” 전체 λ„€νŠΈμ›Œν¬ κ·Έλž˜ν”„ 캑처λ₯Ό 톡해 λͺ¨λ“  계산 단계에 CUDA κ·Έλž˜ν”„λ₯Ό μ μš©ν•©λ‹ˆλ‹€.
55+
ν•˜μ§€λ§Œ κ·Έ 전에 μ•½κ°„μ˜ μ†ŒμŠ€ μ½”λ“œ μˆ˜μ •μ΄ ν•„μš”ν•©λ‹ˆλ‹€. μš°λ¦¬κ°€ ν•΄μ•Ό ν•  일은 μ£Ό ν›ˆλ ¨ λ£¨ν”„μ—μ„œ
56+
tensorλ₯Ό μž¬μ‚¬μš©ν•  수 μžˆλ„λ‘ tensorλ₯Ό 미리 ν• λ‹Ήν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.
57+
λ‹€μŒμ€ κ΅¬ν˜„ μ˜ˆμ‹œμž…λ‹ˆλ‹€.
5858

5959
.. code-block:: cpp
6060
@@ -74,7 +74,7 @@ implementation:
7474
training_step(model, optimizer, data, targets, output, loss);
7575
}
7676
77-
Where ``training_step`` simply consists of forward and backward passes with corresponding optimizer calls:
77+
μ—¬κΈ°μ„œ ``training_step``은 λ‹¨μˆœνžˆ ν•΄λ‹Ή μ˜΅ν‹°λ§ˆμ΄μ € 호좜과 ν•¨κ»˜ μˆœμ „νŒŒ 및 μ—­μ „νŒŒλ‘œ κ΅¬μ„±λ©λ‹ˆλ‹€
7878
7979
.. code-block:: cpp
8080
@@ -92,7 +92,7 @@ Where ``training_step`` simply consists of forward and backward passes with corr
9292
optimizer.step();
9393
}
9494
95-
PyTorch’s CUDA Graphs API is relying on Stream Capture which in our case would be used like this:
95+
νŒŒμ΄ν† μΉ˜μ˜ CUDA κ·Έλž˜ν”„ APIλŠ” 슀트림 μΊ‘μ²˜μ— μ˜μ‘΄ν•˜κ³  있으며, 이 경우 λ‹€μŒμ²˜λŸΌ μ‚¬μš©λ©λ‹ˆλ‹€
9696
9797
.. code-block:: cpp
9898
@@ -104,9 +104,9 @@ PyTorch’s CUDA Graphs API is relying on Stream Capture which in our case would
104104
training_step(model, optimizer, data, targets, output, loss);
105105
graph.capture_end();
106106
107-
Before the actual graph capture, it is important to run several warm-up iterations on side stream to
108-
prepare CUDA cache as well as CUDA libraries (like CUBLAS and CUDNN) that will be used during
109-
the training:
107+
μ‹€μ œ κ·Έλž˜ν”„ 캑처 전에, μ‚¬μ΄λ“œ μŠ€νŠΈλ¦Όμ—μ„œ μ—¬λŸ¬ 번의 μ›Œλ°μ—… λ°˜λ³΅μ„ μ‹€ν–‰ν•˜μ—¬
108+
CUDA μΊμ‹œλΏλ§Œ μ•„λ‹ˆλΌ ν›ˆλ ¨ 쀑에 μ‚¬μš©ν• 
109+
CUDA 라이브러리(CUBLAS와 CUDNN같은)λ₯Ό μ€€λΉ„ν•˜λŠ” 것이 μ€‘μš”ν•©λ‹ˆλ‹€.
110110
111111
.. code-block:: cpp
112112
@@ -116,13 +116,13 @@ the training:
116116
training_step(model, optimizer, data, targets, output, loss);
117117
}
118118
119-
After the successful graph capture, we can replace ``training_step(model, optimizer, data, targets, output, loss);``
120-
call via ``graph.replay();`` to do the training step.
119+
κ·Έλž˜ν”„ μΊ‘μ²˜μ— μ„±κ³΅ν•˜λ©΄ ``training_step(model, optimizer, data, target, output, loss);`` ν˜ΈμΆœμ„
120+
``graph.replay()``둜 λŒ€μ²΄ν•˜μ—¬ ν•™μŠ΅ 단계λ₯Ό μ§„ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
121121
122-
Training Results
122+
ν›ˆλ ¨ κ²°κ³Ό
123123
----------------
124124
125-
Taking the code for a spin we can see the following output from ordinary non-graphed training:
125+
μ½”λ“œλ₯Ό ν•œ 번 μ‚΄νŽ΄λ³΄λ©΄ κ·Έλž˜ν”„κ°€ μ•„λ‹Œ 일반 ν›ˆλ ¨μ—μ„œ λ‹€μŒκ³Ό 같은 κ²°κ³Όλ₯Ό λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€
126126
127127
.. code-block:: shell
128128
@@ -152,7 +152,7 @@ Taking the code for a spin we can see the following output from ordinary non-gra
152152
user 0m44.018s
153153
sys 0m1.116s
154154
155-
While the training with the CUDA Graph produces the following output:
155+
CUDA κ·Έλž˜ν”„λ₯Ό μ‚¬μš©ν•œ ν›ˆλ ¨μ€ λ‹€μŒκ³Ό 같은 좜λ ₯을 μƒμ„±ν•©λ‹ˆλ‹€
156156
157157
.. code-block:: shell
158158
@@ -182,12 +182,11 @@ While the training with the CUDA Graph produces the following output:
182182
user 0m7.048s
183183
sys 0m0.619s
184184
185-
Conclusion
185+
κ²°λ‘ 
186186
----------
187-
188-
As we can see, just by applying a CUDA Graph on the `MNIST example
189-
<https://github.com/pytorch/examples/tree/main/cpp/mnist>`_ we were able to gain the performance
190-
by more than six times for training. This kind of large performance improvement was achievable due to
191-
the small model size. In case of larger models with heavy GPU usage, the CPU overhead is less impactful
192-
so the improvement will be smaller. Nevertheless, it is always advantageous to use CUDA Graphs to
193-
gain the performance of GPUs.
187+
μœ„ μ˜ˆμ‹œμ—μ„œ λ³Ό 수 μžˆλ“―μ΄, λ°”λ‘œ `MNIST 예제
188+
<https://github.com/pytorch/examples/tree/main/cpp/mnist>`_ 에 CUDA κ·Έλž˜ν”„λ₯Ό μ μš©ν•˜λŠ” κ²ƒλ§ŒμœΌλ‘œλ„
189+
ν›ˆλ ¨ μ„±λŠ₯을 6λ°° 이상 ν–₯μƒμ‹œν‚¬ 수 μžˆμ—ˆμŠ΅λ‹ˆλ‹€.
190+
μ΄λ ‡κ²Œ 큰 μ„±λŠ₯ ν–₯상이 κ°€λŠ₯ν–ˆλ˜ 것은 λͺ¨λΈ 크기가 μž‘μ•˜κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.
191+
GPU μ‚¬μš©λŸ‰μ΄ λ§Žμ€ λŒ€ν˜• λͺ¨λΈμ˜ 경우 CPU κ³ΌλΆ€ν•˜μ˜ 영ν–₯이 적기 λ•Œλ¬Έμ— κ°œμ„  νš¨κ³Όκ°€ 더 μž‘μ„ 수 μžˆμŠ΅λ‹ˆλ‹€.
192+
그런 κ²½μš°λΌλ„, GPU의 μ„±λŠ₯을 μ΄λŒμ–΄λ‚΄λ €λ©΄ CUDA κ·Έλž˜ν”„λ₯Ό μ‚¬μš©ν•˜λŠ” 것이 항상 μœ λ¦¬ν•©λ‹ˆλ‹€.

0 commit comments

Comments
Β (0)