You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/index.rst
+2-3
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ Welcome to Intel® Extension for PyTorch* documentation!
8
8
9
9
Intel® Extension for PyTorch* extends PyTorch with optimizations for extra performance boost on Intel hardware. Most of the optimizations will be included in stock PyTorch releases eventually, and the intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware, examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX).
10
10
11
-
Intel® Extension for PyTorch* is structured as the following figure. It is a runtime extension. Users can enable it dynamically in script by importing `intel_extension_for_pytorch`. It covers optimizations for both imperative mode and graph mode. Optimized operators and kernels are registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel hardware. During execution, Intel® Extension for PyTorch* intercepts invocation of ATen operators, and replace the original ones with these optimized ones. In graph mode, further operator fusions are applied manually by Intel engineers or through a tool named *oneDNN Graph* to reduce operator/kernel invocation overheads, and thus increase performance.
11
+
Intel® Extension for PyTorch* is structured as the following figure. It is loaded as a Python module for Python programs or linked as a C++ library for C++ programs. Users can enable it dynamically in script by importing `intel_extension_for_pytorch`. It covers optimizations for both imperative mode and graph mode. Optimized operators and kernels are registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel hardware. During execution, Intel® Extension for PyTorch* intercepts invocation of ATen operators, and replace the original ones with these optimized ones. In graph mode, further operator fusions are applied manually by Intel engineers or through a tool named *oneDNN Graph* to reduce operator/kernel invocation overheads, and thus increase performance.
Copy file name to clipboardExpand all lines: docs/tutorials/blogs_publications.md
+2
Original file line number
Diff line number
Diff line change
@@ -3,4 +3,6 @@ Blogs & Publications
3
3
4
4
*[Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/intel-facebook-boost-bfloat16.html)
5
5
*[Accelerate PyTorch with the extension and oneDNN using Intel BF16 Technology](https://medium.com/pytorch/accelerate-pytorch-with-ipex-and-onednn-using-intel-bf16-technology-dca5b8e6b58f)
6
+
**Note*: APIs mentioned in it are deprecated.
6
7
*[Scaling up BERT-like model Inference on modern CPU - Part 1 by the launcher of the extension](https://huggingface.co/blog/bert-cpu-scaling-part-1)
8
+
*[KT Optimizes Performance for Personalized Text-to-Speech](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/KT-Optimizes-Performance-for-Personalized-Text-to-Speech/post/1337757)
*[torch_ipex/csrc](https://github.com/intel/intel-extension-for-pytorch/tree/master/torch_ipex/csrc) - C++ library for Intel® Extension for PyTorch\*
99
-
*[intel_extension_for_pytorch](https://github.com/intel/intel-extension-for-pytorch/tree/master/intel_extension_for_pytorch) - The actual Intel® Extension for PyTorch\* library. Everything that is not in [csrc](https://github.com/intel/intel-extension-for-pytorch/tree/master/torch_ipex/csrc) is a Python module, following the PyTorch Python frontend module structure.
*[tests](https://github.com/intel/intel-extension-for-pytorch/tree/master/tests) - Python unit tests for Intel® Extension for PyTorch\* Python frontend.
*[cpp](https://github.com/intel/intel-extension-for-pytorch/tree/master/tests/cpu/cpp) - C++ unit tests for Intel® Extension for PyTorch\* C++ frontend.
Copy file name to clipboardExpand all lines: docs/tutorials/examples.md
+143-23
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,6 @@ output = model(data)
30
30
31
31
#### Complete - Float32
32
32
33
-
34
33
```
35
34
import torch
36
35
import torchvision
@@ -128,7 +127,69 @@ torch.save({
128
127
129
128
### Distributed Training
130
129
131
-
Distributed training with PyTorch DDP is accelerated by oneAPI Collective Communications Library Bindings for Pytorch\* (oneCCL Bindings for Pytorch\*). More detailed information and examples are available at its [Github repo](https://github.com/intel/torch-ccl).
130
+
Distributed training with PyTorch DDP is accelerated by oneAPI Collective Communications Library Bindings for Pytorch\* (oneCCL Bindings for Pytorch\*). The extension supports FP32 and BF16 data types. More detailed information and examples are available at its [Github repo](https://github.com/intel/torch-ccl).
131
+
132
+
**Note:** When performing distributed training with BF16 data type, please use oneCCL Bindings for Pytorch\*. Due to a PyTorch limitation, distributed training with BF16 data type with Intel® Extension for PyTorch\* is not supported.
model = torch.nn.parallel.DistributedDataParallel(model)
178
+
179
+
for batch_idx, (data, target) in enumerate(train_loader):
180
+
optimizer.zero_grad()
181
+
# Setting memory_format to torch.channels_last could improve performance with 4D input data. This is optional.
182
+
data = data.to(memory_format=torch.channels_last)
183
+
output = model(data)
184
+
loss = criterion(output, target)
185
+
loss.backward()
186
+
optimizer.step()
187
+
print('batch_id: {}'.format(batch_idx))
188
+
torch.save({
189
+
'model_state_dict': model.state_dict(),
190
+
'optimizer_state_dict': optimizer.state_dict(),
191
+
}, 'checkpoint.pth')
192
+
```
132
193
133
194
## Inference
134
195
@@ -148,7 +209,7 @@ data = torch.rand(1, 3, 224, 224)
148
209
149
210
import intel_extension_for_pytorch as ipex
150
211
model = model.to(memory_format=torch.channels_last)
151
-
model = ipex.optimize(model, dtype=torch.float32, level='O1')
212
+
model = ipex.optimize(model)
152
213
data = data.to(memory_format=torch.channels_last)
153
214
154
215
with torch.no_grad():
@@ -170,7 +231,7 @@ seq_length = 512
170
231
data = torch.randint(vocab_size, size=[batch_size, seq_length])
171
232
172
233
import intel_extension_for_pytorch as ipex
173
-
model = ipex.optimize(model, dtype=torch.float32, level="O1")
234
+
model = ipex.optimize(model)
174
235
175
236
with torch.no_grad():
176
237
model(data)
@@ -190,7 +251,7 @@ data = torch.rand(1, 3, 224, 224)
190
251
191
252
import intel_extension_for_pytorch as ipex
192
253
model = model.to(memory_format=torch.channels_last)
193
-
model = ipex.optimize(model, dtype=torch.float32, level='O1')
254
+
model = ipex.optimize(model)
194
255
data = data.to(memory_format=torch.channels_last)
195
256
196
257
with torch.no_grad():
@@ -216,7 +277,7 @@ seq_length = 512
216
277
data = torch.randint(vocab_size, size=[batch_size, seq_length])
217
278
218
279
import intel_extension_for_pytorch as ipex
219
-
model = ipex.optimize(model, dtype=torch.float32, level="O1")
280
+
model = ipex.optimize(model)
220
281
221
282
with torch.no_grad():
222
283
d = torch.randint(vocab_size, size=[batch_size, seq_length])
@@ -242,7 +303,7 @@ data = torch.rand(1, 3, 224, 224)
242
303
243
304
import intel_extension_for_pytorch as ipex
244
305
model = model.to(memory_format=torch.channels_last)
245
-
model = ipex.optimize(model, dtype=torch.bfloat16, level='O1')
306
+
model = ipex.optimize(model, dtype=torch.bfloat16)
246
307
data = data.to(memory_format=torch.channels_last)
247
308
248
309
with torch.no_grad():
@@ -265,7 +326,7 @@ seq_length = 512
265
326
data = torch.randint(vocab_size, size=[batch_size, seq_length])
266
327
267
328
import intel_extension_for_pytorch as ipex
268
-
model = ipex.optimize(model, dtype=torch.bfloat16, level="O1")
329
+
model = ipex.optimize(model, dtype=torch.bfloat16)
269
330
270
331
with torch.no_grad():
271
332
with torch.cpu.amp.autocast():
@@ -286,7 +347,7 @@ data = torch.rand(1, 3, 224, 224)
286
347
287
348
import intel_extension_for_pytorch as ipex
288
349
model = model.to(memory_format=torch.channels_last)
289
-
model = ipex.optimize(model, dtype=torch.bfloat16, level='O1')
350
+
model = ipex.optimize(model, dtype=torch.bfloat16)
290
351
data = data.to(memory_format=torch.channels_last)
291
352
292
353
with torch.no_grad():
@@ -312,7 +373,7 @@ seq_length = 512
312
373
data = torch.randint(vocab_size, size=[batch_size, seq_length])
313
374
314
375
import intel_extension_for_pytorch as ipex
315
-
model = ipex.optimize(model, dtype=torch.bfloat16, level="O1")
376
+
model = ipex.optimize(model, dtype=torch.bfloat16)
316
377
317
378
with torch.no_grad():
318
379
with torch.cpu.amp.autocast():
@@ -349,13 +410,12 @@ for d in calibration_data_loader():
349
410
model(d)
350
411
conf.save('int8_conf.json', default_recipe=True)
351
412
model = ipex.quantization.convert(model, conf, torch.rand(<shape>))
352
-
353
-
with torch.no_grad():
354
-
model(data)
355
413
```
356
414
357
415
#### Deployment
358
416
417
+
##### Imperative Mode
418
+
359
419
```
360
420
import torch
361
421
@@ -371,15 +431,31 @@ with torch.no_grad():
371
431
model(data)
372
432
```
373
433
434
+
##### Graph Mode
435
+
436
+
```
437
+
import torch
438
+
import intel_extension_for_pytorch as ipex
439
+
440
+
model = torch.jit.load('<INT8 model file>')
441
+
model.eval()
442
+
data = torch.rand(<shape>)
443
+
444
+
with torch.no_grad():
445
+
model(data)
446
+
```
447
+
374
448
## C++
375
449
376
450
To work with libtorch, C++ library of PyTorch, Intel® Extension for PyTorch\* provides its C++ dynamic library as well. The C++ library is supposed to handle inference workload only, such as service deployment. For regular development, please use Python interface. Comparing to usage of libtorch, no specific code changes are required, except for converting input data into channels last data format. Compilation follows the recommended methodology with CMake. Detailed instructions can be found in [PyTorch tutorial](https://pytorch.org/tutorials/advanced/cpp_export.html#depending-on-libtorch-and-building-the-application).
377
451
378
452
During compilation, Intel optimizations will be activated automatically once C++ dynamic library of Intel® Extension for PyTorch\* is linked.
**Note:** Since Intel® Extension for PyTorch\* is still under development, name of the c++ dynamic library in the master branch may defer to *libintel-ext-pt-cpu.so* shown above. Please check the name out in the installation folder. The so file name starts with *libintel-*.
Use cases that had already been optimized by Intel engineers are available at [Model Zoo forIntel® Architecture](https://github.com/IntelAI/models/tree/pytorch-r1.10-models). A bunch of PyTorch use cases for benchmarking are also available on the [Github page](https://github.com/IntelAI/models/tree/pytorch-r1.10-models/benchmarks#pytorch-use-cases). You can get performance benefits out-of-box by simply running sciptsin the Model Zoo.
0 commit comments