intel
diff --git a/‎.github/workflows/publish.yml
+31-31 b/‎.github/workflows/publish.yml
+31-31
diff --git a/‎README.md
+21-358 b/‎README.md
+21-358
diff --git a/‎docs/index.rst
+2-3 b/‎docs/index.rst
+2-3
diff --git a/‎docs/tutorials/blogs_publications.md
+2 b/‎docs/tutorials/blogs_publications.md
+2
diff --git a/‎docs/tutorials/contribution.md
-9 b/‎docs/tutorials/contribution.md
-9
diff --git a/‎docs/tutorials/examples.md
+143-23 b/‎docs/tutorials/examples.md
+143-23
@@ -1,33 +1,33 @@
 name: Publish
 
-on:
-  push:
-    branches:
-      - ghpapers_style
-
-jobs:
-  build:
-
-    runs-on: ubuntu-latest
-
-    steps:
-    - uses: actions/checkout@v1
-    - name: Install dependencies
-      run: |
-        export PATH="$HOME/.local/bin:$PATH"
-        sudo apt-get install -y python3-setuptools
-        pip3 install --user --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
-        pip3 install --user -r requirements.txt
-        python3 setup.py install
-        pip3 install --user -r docs/requirements.txt
-    - name: Build the docs
-      run: |
-        export PATH="$HOME/.local/bin:$PATH"
-        cd docs
-        make html
-    - name: Push the docs
-      uses: peaceiris/actions-gh-pages@v3
-      with:
-        github_token: ${{ secrets.GITHUB_TOKEN }}
-        publish_dir: docs/_build/html
-        publish_branch: gh-pages
+#on:
+#  push:
+#    branches:
+#      - gh-pages
+#
+#jobs:
+#  build:
+#
+#    runs-on: ubuntu-latest
+#
+#    steps:
+#    - uses: actions/checkout@v1
+#    - name: Install dependencies
+#      run: |
+#        export PATH="$HOME/.local/bin:$PATH"
+#        sudo apt-get install -y python3-setuptools
+#        pip3 install --user torch=1.10.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
+#        pip3 install --user -r requirements.txt
+#        python3 setup.py install
+#        pip3 install --user -r docs/requirements.txt
+#    - name: Build the docs
+#      run: |
+#        export PATH="$HOME/.local/bin:$PATH"
+#        cd docs
+#        make html
+#    - name: Push the docs
+#      uses: peaceiris/actions-gh-pages@v3
+#      with:
+#        github_token: ${{ secrets.GITHUB_TOKEN }}
+#        publish_dir: docs/_build/html
+#        publish_branch: gh-pages
@@ -8,7 +8,7 @@ Welcome to Intel® Extension for PyTorch* documentation!
 
 Intel® Extension for PyTorch* extends PyTorch with optimizations for extra performance boost on Intel hardware. Most of the optimizations will be included in stock PyTorch releases eventually, and the intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware, examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX).
 
-Intel® Extension for PyTorch* is structured as the following figure. It is a runtime extension. Users can enable it dynamically in script by importing `intel_extension_for_pytorch`. It covers optimizations for both imperative mode and graph mode. Optimized operators and kernels are registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel hardware. During execution, Intel® Extension for PyTorch* intercepts invocation of ATen operators, and replace the original ones with these optimized ones. In graph mode, further operator fusions are applied manually by Intel engineers or through a tool named *oneDNN Graph* to reduce operator/kernel invocation overheads, and thus increase performance.
+Intel® Extension for PyTorch* is structured as the following figure. It is loaded as a Python module for Python programs or linked as a C++ library for C++ programs. Users can enable it dynamically in script by importing `intel_extension_for_pytorch`. It covers optimizations for both imperative mode and graph mode. Optimized operators and kernels are registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel hardware. During execution, Intel® Extension for PyTorch* intercepts invocation of ATen operators, and replace the original ones with these optimized ones. In graph mode, further operator fusions are applied manually by Intel engineers or through a tool named *oneDNN Graph* to reduce operator/kernel invocation overheads, and thus increase performance.
 
 .. image:: ../images/intel_extension_for_pytorch_structure.png
   :width: 800
@@ -24,8 +24,7 @@ Intel® Extension for PyTorch* has been released as an open–source project at
    :maxdepth: 1
 
    tutorials/features
-   tutorials/notices
-   tutorials/release_notes
+   tutorials/releases
    tutorials/installation
    tutorials/examples
    tutorials/api_doc
 
@@ -3,4 +3,6 @@ Blogs & Publications
 
 * [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/intel-facebook-boost-bfloat16.html)
 * [Accelerate PyTorch with the extension and oneDNN using Intel BF16 Technology](https://medium.com/pytorch/accelerate-pytorch-with-ipex-and-onednn-using-intel-bf16-technology-dca5b8e6b58f)
+  * *Note*: APIs mentioned in it are deprecated.
 * [Scaling up BERT-like model Inference on modern CPU - Part 1 by the launcher of the extension](https://huggingface.co/blog/bert-cpu-scaling-part-1)
+* [KT Optimizes Performance for Personalized Text-to-Speech](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/KT-Optimizes-Performance-for-Personalized-Text-to-Speech/post/1337757)
@@ -93,15 +93,6 @@ In case you want to reinstall, make sure that you uninstall Intel® Extension fo
   ENV_KEY1=ENV_VAL1[, ENV_KEY2=ENV_VAL2]* python setup.py develop
   ```
 
-## Codebase structure
-
-* [torch_ipex/csrc](https://github.com/intel/intel-extension-for-pytorch/tree/master/torch_ipex/csrc) - C++ library for Intel® Extension for PyTorch\*
-* [intel_extension_for_pytorch](https://github.com/intel/intel-extension-for-pytorch/tree/master/intel_extension_for_pytorch) - The actual Intel® Extension for PyTorch\* library. Everything that is not in [csrc](https://github.com/intel/intel-extension-for-pytorch/tree/master/torch_ipex/csrc) is a Python module, following the PyTorch Python frontend module structure.
-* [tools](https://github.com/intel/intel-extension-for-pytorch/tree/master/tools) - 
-* [tests](https://github.com/intel/intel-extension-for-pytorch/tree/master/tests) - Python unit tests for Intel® Extension for PyTorch\* Python frontend.
-  * [cpu](https://github.com/intel/intel-extension-for-pytorch/tree/master/tests/cpu) - 
-    * [cpp](https://github.com/intel/intel-extension-for-pytorch/tree/master/tests/cpu/cpp) - C++ unit tests for Intel® Extension for PyTorch\* C++ frontend.
-
 ## Unit testing
 
 ### Python Unit Testing
 
@@ -30,7 +30,6 @@ output = model(data)
 
 #### Complete - Float32
 
-
 ```
 import torch
 import torchvision
@@ -128,7 +127,69 @@ torch.save({
 
 ### Distributed Training
 
-Distributed training with PyTorch DDP is accelerated by oneAPI Collective Communications Library Bindings for Pytorch\* (oneCCL Bindings for Pytorch\*). More detailed information and examples are available at its [Github repo](https://github.com/intel/torch-ccl).
+Distributed training with PyTorch DDP is accelerated by oneAPI Collective Communications Library Bindings for Pytorch\* (oneCCL Bindings for Pytorch\*). The extension supports FP32 and BF16 data types. More detailed information and examples are available at its [Github repo](https://github.com/intel/torch-ccl).
+
+**Note:** When performing distributed training with BF16 data type, please use oneCCL Bindings for Pytorch\*. Due to a PyTorch limitation, distributed training with BF16 data type with Intel® Extension for PyTorch\* is not supported.
+
+```
+import os
+import torch
+import torch.distributed as dist
+import torchvision
+import torch_ccl
+import intel_extension_for_pytorch as ipex
+
+LR = 0.001
+DOWNLOAD = True
+DATA = 'datasets/cifar10/'
+
+transform = torchvision.transforms.Compose([
+    torchvision.transforms.Resize((224, 224)),
+    torchvision.transforms.ToTensor(),
+    torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
+])
+train_dataset = torchvision.datasets.CIFAR10(
+        root=DATA,
+        train=True,
+        transform=transform,
+        download=DOWNLOAD,
+)
+train_loader = torch.utils.data.DataLoader(
+        dataset=train_dataset,
+        batch_size=128
+)
+
+os.environ['MASTER_ADDR'] = '127.0.0.1'
+os.environ['MASTER_PORT'] = '29500'
+os.environ['RANK'] = os.environ.get('PMI_RANK', 0)
+os.environ['WORLD_SIZE'] = os.environ.get('PMI_SIZE', 1)
+dist.init_process_group(
+backend='ccl',
+init_method='env://'
+)
+
+model = torchvision.models.resnet50()
+criterion = torch.nn.CrossEntropyLoss()
+optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
+model.train()
+model, optimizer = ipex.optimize(model, optimizer=optimizer)
+
+model = torch.nn.parallel.DistributedDataParallel(model)
+
+for batch_idx, (data, target) in enumerate(train_loader):
+    optimizer.zero_grad()
+    # Setting memory_format to torch.channels_last could improve performance with 4D input data. This is optional.
+    data = data.to(memory_format=torch.channels_last)
+    output = model(data)
+    loss = criterion(output, target)
+    loss.backward()
+    optimizer.step()
+    print('batch_id: {}'.format(batch_idx))
+torch.save({
+     'model_state_dict': model.state_dict(),
+     'optimizer_state_dict': optimizer.state_dict(),
+     }, 'checkpoint.pth')
+```
 
 ## Inference
 
@@ -148,7 +209,7 @@ data = torch.rand(1, 3, 224, 224)
 
 import intel_extension_for_pytorch as ipex
 model = model.to(memory_format=torch.channels_last)
-model = ipex.optimize(model, dtype=torch.float32, level='O1')
+model = ipex.optimize(model)
 data = data.to(memory_format=torch.channels_last)
 
 with torch.no_grad():
@@ -170,7 +231,7 @@ seq_length = 512
 data = torch.randint(vocab_size, size=[batch_size, seq_length])
 
 import intel_extension_for_pytorch as ipex
-model = ipex.optimize(model, dtype=torch.float32, level="O1")
+model = ipex.optimize(model)
 
 with torch.no_grad():
   model(data)
@@ -190,7 +251,7 @@ data = torch.rand(1, 3, 224, 224)
 
 import intel_extension_for_pytorch as ipex
 model = model.to(memory_format=torch.channels_last)
-model = ipex.optimize(model, dtype=torch.float32, level='O1')
+model = ipex.optimize(model)
 data = data.to(memory_format=torch.channels_last)
 
 with torch.no_grad():
@@ -216,7 +277,7 @@ seq_length = 512
 data = torch.randint(vocab_size, size=[batch_size, seq_length])
 
 import intel_extension_for_pytorch as ipex
-model = ipex.optimize(model, dtype=torch.float32, level="O1")
+model = ipex.optimize(model)
 
 with torch.no_grad():
   d = torch.randint(vocab_size, size=[batch_size, seq_length])
@@ -242,7 +303,7 @@ data = torch.rand(1, 3, 224, 224)
 
 import intel_extension_for_pytorch as ipex
 model = model.to(memory_format=torch.channels_last)
-model = ipex.optimize(model, dtype=torch.bfloat16, level='O1')
+model = ipex.optimize(model, dtype=torch.bfloat16)
 data = data.to(memory_format=torch.channels_last)
 
 with torch.no_grad():
@@ -265,7 +326,7 @@ seq_length = 512
 data = torch.randint(vocab_size, size=[batch_size, seq_length])
 
 import intel_extension_for_pytorch as ipex
-model = ipex.optimize(model, dtype=torch.bfloat16, level="O1")
+model = ipex.optimize(model, dtype=torch.bfloat16)
 
 with torch.no_grad():
   with torch.cpu.amp.autocast():
@@ -286,7 +347,7 @@ data = torch.rand(1, 3, 224, 224)
 
 import intel_extension_for_pytorch as ipex
 model = model.to(memory_format=torch.channels_last)
-model = ipex.optimize(model, dtype=torch.bfloat16, level='O1')
+model = ipex.optimize(model, dtype=torch.bfloat16)
 data = data.to(memory_format=torch.channels_last)
 
 with torch.no_grad():
@@ -312,7 +373,7 @@ seq_length = 512
 data = torch.randint(vocab_size, size=[batch_size, seq_length])
 
 import intel_extension_for_pytorch as ipex
-model = ipex.optimize(model, dtype=torch.bfloat16, level="O1")
+model = ipex.optimize(model, dtype=torch.bfloat16)
 
 with torch.no_grad():
   with torch.cpu.amp.autocast():
@@ -349,13 +410,12 @@ for d in calibration_data_loader():
     model(d) 
 conf.save('int8_conf.json', default_recipe=True)
 model = ipex.quantization.convert(model, conf, torch.rand(<shape>)) 
-
-with torch.no_grad():
-  model(data)
 ```
 
 #### Deployment
 
+##### Imperative Mode
+
 ```
 import torch
 
@@ -371,15 +431,31 @@ with torch.no_grad():
   model(data)
 ```
 
+##### Graph Mode
+
+```
+import torch
+import intel_extension_for_pytorch as ipex
+
+model = torch.jit.load('<INT8 model file>')
+model.eval()
+data = torch.rand(<shape>)
+
+with torch.no_grad():
+  model(data)
+```
+
 ## C++
 
 To work with libtorch, C++ library of PyTorch, Intel® Extension for PyTorch\* provides its C++ dynamic library as well. The C++ library is supposed to handle inference workload only, such as service deployment. For regular development, please use Python interface. Comparing to usage of libtorch, no specific code changes are required, except for converting input data into channels last data format. Compilation follows the recommended methodology with CMake. Detailed instructions can be found in [PyTorch tutorial](https://pytorch.org/tutorials/advanced/cpp_export.html#depending-on-libtorch-and-building-the-application).
 
 During compilation, Intel optimizations will be activated automatically once C++ dynamic library of Intel® Extension for PyTorch\* is linked.
 
+The example code below works for all data types.
+
 **example-app.cpp**
 
-```
+```cpp
 #include <torch/script.h>
 #include <iostream>
 #include <memory>
@@ -405,25 +481,69 @@ int main(int argc, const char* argv[]) {
 
 **CMakeList.txt**
 
-```
+```cmake
 cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
 project(example-app)
 
-find_package(Torch REQUIRED)
-set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS} -Wl,--no-as-needed")
+find_package(intel-ext-pt-cpu REQUIRED)
 
 add_executable(example-app example-app.cpp)
-# Link the binary against the C++ dynamic library file of Intel® Extension for PyTorch*
-target_link_libraries(example-app "${TORCH_LIBRARIES}" "${INTEL_EXTENSION_FOR_PYTORCH_PATH}/lib/libintel-ext-pt-cpu.so")
+target_link_libraries(example-app "${TORCH_LIBRARIES}")
 
 set_property(TARGET example-app PROPERTY CXX_STANDARD 14)
 ```
 
-**Note:** Since Intel® Extension for PyTorch\* is still under development, name of the c++ dynamic library in the master branch may defer to *libintel-ext-pt-cpu.so* shown above. Please check the name out in the installation folder. The so file name starts with *libintel-*.
-
 **Command for compilation**
 
-```
-$ cmake -DCMAKE_PREFIX_PATH=<LIBPYTORCH_PATH> -DINTEL_EXTENSION_FOR_PYTORCH_PATH=<INTEL_EXTENSION_FOR_PYTORCH_INSTALLATION_PATH> ..
+```bash
+$ cmake -DCMAKE_PREFIX_PATH=<LIBPYTORCH_PATH> ..
 $ make
 ```
+
+If *Found INTEL_EXT_PT_CPU* is shown as *TRUE*, the extension had been linked into the binary. This can be verified with Linux command *ldd*.
+
+```bash
+$ cmake -DCMAKE_PREFIX_PATH=/workspace/libtorch ..
+-- The C compiler identification is GNU 9.3.0
+-- The CXX compiler identification is GNU 9.3.0
+-- Check for working C compiler: /usr/bin/cc
+-- Check for working C compiler: /usr/bin/cc -- works
+-- Detecting C compiler ABI info
+-- Detecting C compiler ABI info - done
+-- Detecting C compile features
+-- Detecting C compile features - done
+-- Check for working CXX compiler: /usr/bin/c++
+-- Check for working CXX compiler: /usr/bin/c++ -- works
+-- Detecting CXX compiler ABI info
+-- Detecting CXX compiler ABI info - done
+-- Detecting CXX compile features
+-- Detecting CXX compile features - done
+-- Looking for pthread.h
+-- Looking for pthread.h - found
+-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
+-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
+-- Looking for pthread_create in pthreads
+-- Looking for pthread_create in pthreads - not found
+-- Looking for pthread_create in pthread
+-- Looking for pthread_create in pthread - found
+-- Found Threads: TRUE
+-- Found Torch: /workspace/libtorch/lib/libtorch.so
+-- Found INTEL_EXT_PT_CPU: TRUE
+-- Configuring done
+-- Generating done
+-- Build files have been written to: /workspace/build
+
+$ ldd example-app
+        ...
+        libtorch.so => /workspace/libtorch/lib/libtorch.so (0x00007f3cf98e0000)
+        libc10.so => /workspace/libtorch/lib/libc10.so (0x00007f3cf985a000)
+        libintel-ext-pt-cpu.so => /workspace/libtorch/lib/libintel-ext-pt-cpu.so (0x00007f3cf70fc000)
+        libtorch_cpu.so => /workspace/libtorch/lib/libtorch_cpu.so (0x00007f3ce16ac000)
+        ...
+        libdnnl_graph.so.0 => /workspace/libtorch/lib/libdnnl_graph.so.0 (0x00007f3cde954000)
+        ...
+```
+
+## Model Zoo
+
+Use cases that had already been optimized by Intel engineers are available at [Model Zoo for Intel® Architecture](https://github.com/IntelAI/models/tree/pytorch-r1.10-models). A bunch of PyTorch use cases for benchmarking are also available on the [Github page](https://github.com/IntelAI/models/tree/pytorch-r1.10-models/benchmarks#pytorch-use-cases). You can get performance benefits out-of-box by simply running scipts in the Model Zoo.