[Intel GPU] Docs of XPUInductorQuantizer #3293

ZhiweiYan-96 · 2025-03-18T05:45:07Z

Description

Add tutorials for XPUInductorQuantzer, which serves as the INT8 quantization backend for Intel GPU inside PT2E.

cc @gujinghui @EikanWang @fengyuan14 @guangyey

pytorch-bot · 2025-03-18T05:45:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3293

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[Infra] Jobs got intermittently cancelled/fail midway checkout

This comment was automatically generated by Dr. CI and updates every 15 minutes.

prototype_source/pt2e_quant_xpu_inductor.rst

CuiYifeng · 2025-03-18T08:12:37Z

prototype_source/pt2e_quant_xpu_inductor.rst

+::
+    quantizer = XPUInductorQuantizer()
+    quantizer.set_global(get_xpu_inductor_symm_quantization_config())
+


The code format has not taken effect.

thanks for reminding, added the fix.

riverliuintel · 2025-03-23T23:13:18Z

prototype_source/prototype_index.rst

   :tags: Quantization

+.. customcarditem::
+   :header: PyTorch 2 Export Quantization with Intel GPU Backend through Inductor


At previous stage when we upload RFCs, we recommend using GPU instead of XPU for readability for users. Do we have some changes on this description desicsion?

riverliuintel · 2025-03-23T23:13:46Z

prototype_source/pt2e_quant_xpu_inductor.rst

@@ -0,0 +1,234 @@
+PyTorch 2 Export Quantization with Intel GPU Backend through Inductor


Suggested change

PyTorch 2 Export Quantization with Intel GPU Backend through Inductor

Export Quantization with Intel GPU Backend through Inductor

riverliuintel · 2025-03-23T23:16:09Z

prototype_source/pt2e_quant_xpu_inductor.rst

+utilizes PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.
+
+The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
+This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.


This approach is expected to have significantly higher model coverage with better programmability and a simplified user experience.

Thanks for suggestions, modified.

riverliuintel · 2025-03-23T23:18:04Z

prototype_source/pt2e_quant_xpu_inductor.rst

+The quantization flow mainly includes three steps:
+
+- Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
+- Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,


Apply the quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,

Thanks for suggestions, has changed the description here.

riverliuintel · 2025-03-23T23:23:01Z

prototype_source/pt2e_quant_xpu_inductor.rst

+  performing the prepared model's calibration, and converting the prepared model into the quantized model.
+- Step 3: Lower the quantized model into inductor with the API ``torch.compile``. 
+
+During Step 3, the inductor would decide which kernels are dispatched into. There are two kinds of kernels the Intel GPU would obtain benefits, oneDNN kernels and triton kernels. `Intel oneAPI Deep Neural Network Library (oneDNN) <https://github.com/uxlfoundation/oneDNN>`_ contains 


If a end-user documentation, I think we could focus on PyTorch itself, and remove this section explanation.

Thanks for suggestion, I removed the prolonged description over oneDNN and triton. Instead, I add a simple mention at Step 3 above.

riverliuintel · 2025-03-23T23:24:14Z

prototype_source/pt2e_quant_xpu_inductor.rst

+Post Training Quantization
+----------------------------
+
+Static quantization is the only method we support currently. QAT and dynamic quantization will be available in later versions.


remove the further ready context from current introduction - "QAT and dynamic quantization will be available in later versions."

Thanks for suggestion, removed.

riverliuintel · 2025-03-23T23:26:12Z

prototype_source/pt2e_quant_xpu_inductor.rst

+
+::
+
+    pip install torchvision pytorch-triton-xpu --index-url https://download.pytorch.org/whl/nightly/xpu


Let's use standard "pip install torch torchvision torchaudio", not separate internal commands to highlight the internal dependencies command.

We may need keep using our own channels, since torchvision is customized on XPU, we need let user could run example in this doc successfully. Standard channel would have runtime error. Synced with @jingxu10 I changed to use pip3 install torch torchvision torchaudio pytorch-triton-xpu --index-url https://download.pytorch.org/whl/xpu, instead of nightly wheel.

CuiYifeng · 2025-03-24T07:14:40Z

prototype_source/pt2e_quant_xpu_inductor.rst

+
+The high-level architecture of this flow could look like this:
+
+.. image:: ../_static/img/pt2e_quant_xpu_inductor.png


Please note that Float Model, Example Input and XPUInductorQuantizer is invisible in dark mode.

thanks for reminding, the pictures is moidified

CuiYifeng · 2025-03-24T07:17:05Z

prototype_source/pt2e_quant_xpu_inductor.rst

+PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
+==================================================================
+
+**Author**: `Yan Zhiwei <https://github.com/ZhiweiYan-96>`_, `Wang Eikan <https://github.com/EikanWang>`_, `Zhang, Liangang <https://github.com/liangan1>`_, `Liu River <https://github.com/riverliuintel>`_, `Cui Yifeng <https://github.com/CuiYifeng>`_


Please unify the style of names.

thanks, modified

CuiYifeng · 2025-03-24T07:37:58Z

prototype_source/pt2e_quant_xpu_inductor.rst

+            quant_min=-128,
+            quant_max=127,
+            qscheme=torch.per_tensor_symmetric,


Please consider whether we need more detailed annotations here to explain the meaning of these key parameters to users.

thanks, explanation is added.

CuiYifeng · 2025-03-24T07:38:55Z

prototype_source/pt2e_quant_xpu_inductor.rst

+            dtype=torch.int8,
+            quant_min=-128,
+            quant_max=127,
+            qscheme=torch.per_channel_symmetric,


thanks, explanation is added.

prototype_source/pt2e_quant_xpu_inductor.rst

alexsin368 · 2025-03-28T23:14:51Z

prototype_source/pt2e_quant_xpu_inductor.rst

+--------------
+
+This tutorial introduces XPUInductorQuantizer aiming for serving the quantized model inference on Intel GPUs. The tutorial will cover how it 
+utilizes PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.


What are you trying to say in this phrase: "lowers the quantized model into the inductor"?

It's the terminology in torch.compile

CuiYifeng · 2025-04-01T07:43:39Z

prototype_source/pt2e_quant_xpu_inductor.rst

+        optimized_model(*example_inputs)
+
+In a more advanced scenario, int8-mixed-bf16 quantization comes into play. In this instance,
+a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence


Suggested change

a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence

a Convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence

or

Suggested change

a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence

a Conv or GEMM operator produces the output in BFloat16 instead of Float32 in the absence

Thanks for suggestion. We may keep this as here is a vanilla noun.

CuiYifeng · 2025-04-01T07:49:41Z

prototype_source/pt2e_quant_xpu_inductor.rst

+--------------
+
+This tutorial introduces XPUInductorQuantizer, which aims to serve quantized models for inference on Intel GPUs.
+It utilizes the PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.


Can we standardize capitalization of Inductor?

thanks for reminding, has align the style now

ZhiweiYan-96 · 2025-04-02T05:05:56Z

hi, @svekars @AlannaBurke could you please help review our documentation? The PR serves as a tutorial for PT2E int8 on Intel GPU backend. Appreciation for your feedback and suggestions.

svekars

A few editorial suggestions.

prototype_source/pt2e_quant_xpu_inductor.rst

svekars · 2025-04-04T21:30:20Z

prototype_source/pt2e_quant_xpu_inductor.rst

@@ -0,0 +1,234 @@
+PyTorch 2 Export Quantization with Intel GPU Backend through Inductor


Suggested change

PyTorch 2 Export Quantization with Intel GPU Backend through Inductor

Export Quantization with Intel GPU Backend through Inductor

prototype_source/pt2e_quant_xpu_inductor.rst

svekars · 2025-04-04T21:32:37Z

prototype_source/pt2e_quant_xpu_inductor.rst

+This tutorial introduces XPUInductorQuantizer, which aims to serve quantized models for inference on Intel GPUs.
+It utilizes the PyTorch 2 Export Quantization flow and lowers the quantized model into the inductor.
+
+The Pytorch 2 Export Quantization flow uses `torch.export` to capture the model into a graph and perform quantization transformations on top of the ATen graph.


Do we need to call it "PyTorch 2 Export Quantization flow" or can it be just "Export Quantization flow"?

Suggested change

The Pytorch 2 Export Quantization flow uses `torch.export` to capture the model into a graph and perform quantization transformations on top of the ATen graph.

The PyTorch 2 Export Quantization flow uses ``torch.export`` to capture the model into a graph and perform quantization transformations on top of the ATen graph.

hi, @svekars , PyTorch 2 Export here should be a full description of pt2e in APIs like prepare_pt2e, convert_pt2e. Could we keep this just like x86InductorQuantizer here https://pytorch.org/tutorials/prototype/pt2e_quant_x86_inductor.html?

Sounds good!

prototype_source/pt2e_quant_xpu_inductor.rst

AlannaBurke

Update with @svekars's suggestions and then I think this will be good. Also requested a review from @HamidShojanazeri.

Co-authored-by: Svetlana Karslioglu <[email protected]>

Co-authored-by: alexsin368 <[email protected]>

ZhiweiYan-96 · 2025-04-09T05:04:24Z

hi @AlannaBurke @svekars @HamidShojanazeri , I've applied the suggestions in latest commits. Could you please help review it again and approve it if no further issues in this tutorial? Great thanks for your advice.

jingxu10 · 2025-04-15T07:09:14Z

Hi @svekars , any updates?

[Intel GPU] Docs of XPUInductorQuantizer

19a568a

facebook-github-bot added the cla signed label Mar 18, 2025

ZhiweiYan-96 added 3 commits March 18, 2025 06:38

refine

6bfa4d7

syntax

a8e8d8a

refine

63a63cc

CuiYifeng suggested changes Mar 18, 2025

View reviewed changes

xiaolil1 added 4 commits March 18, 2025 10:16

refine

1b3cf01

Add img

6a9640a

fix path

57985bf

picture layout

4c4069a

ZhiweiYan-96 marked this pull request as draft March 19, 2025 05:48

svekars added the 2.7 label Mar 19, 2025

word spelling

5a6663a

svekars requested a review from AlannaBurke March 21, 2025 16:14

Merge branch 'main' into zhiwei/xpu_quant

79d56de

riverliuintel suggested changes Mar 23, 2025

View reviewed changes

description, code change

6a4f748

ZhiweiYan-96 requested a review from riverliuintel March 24, 2025 06:07

ZhiweiYan-96 added 2 commits March 24, 2025 06:16

Mov back to prototype_source

e7e2275

fix bug

d741cf9

ZhiweiYan-96 requested a review from CuiYifeng March 24, 2025 06:52

CuiYifeng suggested changes Mar 24, 2025

View reviewed changes

style, image

9a0f0c7

ZhiweiYan-96 requested a review from CuiYifeng March 24, 2025 08:10

CuiYifeng approved these changes Mar 24, 2025

View reviewed changes

Merge branch 'main' into zhiwei/xpu_quant

aae7c51

alexsin368 reviewed Mar 28, 2025

View reviewed changes

prototype_source/pt2e_quant_xpu_inductor.rst Outdated Show resolved Hide resolved

alexsin368 reviewed Mar 28, 2025

View reviewed changes

prototype_source/pt2e_quant_xpu_inductor.rst Outdated Show resolved Hide resolved

alexsin368 reviewed Mar 28, 2025

View reviewed changes

refine

6a43ce0

CuiYifeng reviewed Apr 1, 2025

View reviewed changes

refine

a6516c9

ZhiweiYan-96 marked this pull request as ready for review April 1, 2025 08:01

ZhiweiYan-96 requested a review from alexsin368 April 1, 2025 13:30

svekars reviewed Apr 4, 2025

View reviewed changes

AlannaBurke assigned HamidShojanazeri Apr 7, 2025

AlannaBurke requested review from HamidShojanazeri and removed request for alexsin368 and riverliuintel April 7, 2025 19:30

AlannaBurke added the module: xpu XPU related issues label Apr 7, 2025

AlannaBurke requested changes Apr 7, 2025

View reviewed changes

ZhiweiYan-96 and others added 6 commits April 8, 2025 13:48

Apply suggestions from code review

d04de57

Co-authored-by: Svetlana Karslioglu <[email protected]>

Update prototype_source/pt2e_quant_xpu_inductor.rst

8e0293a

Co-authored-by: alexsin368 <[email protected]>

fix syntax

980079c

syntax

640fa94

refine

17ad15a

Merge branch 'main' into zhiwei/xpu_quant

41fc5b6

ZhiweiYan-96 requested review from AlannaBurke, alexsin368, riverliuintel and svekars April 15, 2025 15:09

svekars added 2 commits April 15, 2025 13:37

Merge branch 'main' into zhiwei/xpu_quant

93f764b

Merge branch 'main' into zhiwei/xpu_quant

bd34671

svekars approved these changes Apr 16, 2025

View reviewed changes

Merge branch 'main' into zhiwei/xpu_quant

b22e049

svekars merged commit 459084a into pytorch:main Apr 18, 2025
16 of 17 checks passed

		@@ -0,0 +1,234 @@
		PyTorch 2 Export Quantization with Intel GPU Backend through Inductor

	PyTorch 2 Export Quantization with Intel GPU Backend through Inductor
	Export Quantization with Intel GPU Backend through Inductor


		::

		pip install torchvision pytorch-triton-xpu --index-url https://download.pytorch.org/whl/nightly/xpu


		The high-level architecture of this flow could look like this:

		.. image:: ../_static/img/pt2e_quant_xpu_inductor.png

	a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence
	a Convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence

	a convolution or GEMM operator produces the output in BFloat16 instead of Float32 in the absence
	a Conv or GEMM operator produces the output in BFloat16 instead of Float32 in the absence

	The Pytorch 2 Export Quantization flow uses `torch.export` to capture the model into a graph and perform quantization transformations on top of the ATen graph.
	The PyTorch 2 Export Quantization flow uses ``torch.export`` to capture the model into a graph and perform quantization transformations on top of the ATen graph.

[Intel GPU] Docs of XPUInductorQuantizer #3293

[Intel GPU] Docs of XPUInductorQuantizer #3293

Uh oh!

Conversation

ZhiweiYan-96 commented Mar 18, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

pytorch-bot bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3293

❗ 1 Active SEVs

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ZhiweiYan-96 commented Mar 18, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 18, 2025 •

edited

Loading

ZhiweiYan-96 Mar 24, 2025 •

edited

Loading

ZhiweiYan-96 commented Apr 2, 2025 •

edited

Loading

ZhiweiYan-96 commented Apr 9, 2025 •

edited

Loading