Enable Intel GPU #753

dbyoung18 · 2024-08-27T12:16:31Z

This PR is migrated from gpt-fast #79. We would like to add initial support for Intel GPU in torch-ao with the device option "xpu"(i.e., --device "xpu"). Currently, both BF16 & INT8 under eager mode and compile mode are functionally supported. INT4 support and further performance improvement are WIP.

Here are the steps to run Llama2-7b and Llama3-8b generation on Intel GPU with torch-ao. We will update the tutorial later with improved performance.

Launch

command for BF16
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --precision torch.bfloat16
command for INT8 dynamic quantization
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --quantization int8dq
command for INT8 weight-only quantization
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --quantization int8wo

pytorch-bot · 2024-08-27T12:16:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/753

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

PyTorch Testing Nodes Undergoing ROCm 6.2.1 Upgrades

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-08-27T12:16:36Z

Hi @dbyoung18!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

msaroufim · 2024-08-27T13:49:46Z

Thanks for your PR @dbyoung18, my preference here would be to land generic accelerator memory APIs in core and then use those. That way we wouldn't need to ask people that are trying to use Intel GPUs to change their code so it'd be something like torch.get_accelerator().max_memory_reserved() or torch.accelerator.max_memory_reserved()

@guangyey is doing some work on this at Intel and can share more information on what's the current plan of record

facebook-github-bot · 2024-08-28T02:12:35Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

guangyey · 2024-08-28T02:35:03Z

Hi @msaroufim and @dbyoung18 , let me explain the plan.
In the long term, we have a proposal to provide device-agnostic APIs for each accelerator. We would like to start from the runtime device and stream component. And then gradually cover the allocator memory APIs. Our RFC is [RFC] A device-agnostic Python runtime API design for stream-based accelerators
In the short term, for XPU, we plan to provide those memory APIs first in case block the customer usage. We have prepared a series of PRs to implement them. You can refer to #129919 and it will be landed soon if everything goes well.

dbyoung18 · 2024-08-28T08:44:03Z

Convert to draft first. Pending to #129919 ready.

EikanWang · 2024-09-23T09:11:55Z

@dbyoung18 , may I know why the change is torchao/_models/llama/generate.py only?

Signed-off-by: dbyoung18 <[email protected]>

dbyoung18 · 2024-09-30T13:17:02Z

@dbyoung18 , may I know why the change is torchao/_models/llama/generate.py only?

Hi, @EikanWang. We have a plan to gradually support torch-ao on Intel GPU with different models(llama2,llama3,sam etc.) and difference features(BF16/INT8/INT4/FP8 etc.). As the first step, we choose Llama2 & Llama3 BF16 as a start point. With this PR, Llama2-7b and Llama3-8b can run BF16 on Intel GPU under both eager mode & compile mode, by passing --device xpu to the launch commands. And INT8 can be launched with intel/intel-xpu-backend-for-triton under compile mode. We are also doing some work to upstream INT8/INT4/FP8 support on Intel GPU with oneDNN to PyTorch core. When the further upstream is ready in stock PyTorch, we will continue our contributions to torch-ao to bring the library more available and powerful on different platforms.

malfet

I'll leave it up to repo maintainers, but IMO one need to think a bit more about unifying device approach rather than migrating long strings of elifs from repo to repo

malfet · 2024-09-30T13:57:06Z

torchao/_models/llama/generate.py

@@ -369,7 +381,8 @@ def callback(x):

    tokpersec = torch.mean(torch.tensor(aggregate_metrics['tokens_per_sec'])).item()
    bandwidth = model_size * tokpersec
-    mem = torch.cuda.max_memory_reserved() /1e9
+    max_memory_reserved = torch.cuda.max_memory_reserved() if "cuda" in device else torch.xpu.max_memory_reserved()


This feels wrong, as it will dispatch to xpu for HIP devices as well, wouldn't it?

malfet · 2024-09-30T13:58:33Z

torchao/_models/llama/generate.py

        else:
-            torch.profiler._utils._init_for_cuda_graphs()
-            prof = torch.profiler.profile()
+            if "cuda" in device:


Please dismantle the pyramid of doom and use elif "cuda" in device" rather than else:\n\tif "cuda" in device:"
And see my comment below again about "hip"

malfet · 2024-09-30T13:59:10Z

torchao/_models/llama/generate.py

@@ -288,7 +290,10 @@ def main(

    for i in range(start, num_samples):
        if i==0:
-            torch.cuda.reset_peak_memory_stats()
+            if "cuda" in device:


Same as below, can you please check that torch.cuda.reset_peak_memory_stats does not need to be applied to hip

msaroufim · 2024-09-30T16:00:27Z

I'm already planning on writing an RFC on how we'll support more hardware architectures. RIght now ao is very much NVIDIA device centric but a lot of recent issues have been about supporting more hardware architectures on more operating systems. We need to think about generalizing devices, CI/testing and performance carefully.

EikanWang · 2024-10-08T13:03:17Z

We are working on device-agnostic runtime API for accelerators. It may help ao to support more hardware architectures.

@malfet , @msaroufim FYI - https://dev-discuss.pytorch.org/t/python-c-api-rules-for-device-generic-apis/2511

msaroufim · 2024-10-08T19:08:53Z

@EikanWang are there are any github runners for Intel GPUs to ensure our test suite works? We don't have to run the code per commit but at least a nightly check to make sure we understand what works and what doesnt would be helpful

EikanWang · 2024-10-10T06:55:16Z

@msaroufim , may I know if two runners for Intel GPUs are good enough now for the ao nightly?

msaroufim · 2024-10-10T17:41:03Z

@msaroufim , may I know if two runners for Intel GPUs are good enough now for the ao nightly?

Yup that should be fine! We won't be running on Intel runners per commit for now. cc @atalman @seemethere as well

EikanWang · 2024-10-11T06:11:55Z

Sounds good! We will add two runners to Intel GPU CI/CD resource pool and reserve the two runners dedicated to ao nightly.

cc @riverliuintel, @chuanqi129

chuanqi129 · 2024-10-11T06:20:54Z

Sounds good! We will add two runners to Intel GPU CI/CD resource pool and reserve the two runners dedicated to ao nightly.

cc @riverliuintel, @chuanqi129

Currently, we have 16 pytorch organization level xpu runners with label "linux.idc.xpu" used for pytorch CICD. I think the torchao repo can use them directly.

EikanWang · 2024-10-15T02:08:58Z

@chuanqi129 , @riverliuintel, any update?

msaroufim · 2024-10-15T02:22:20Z

They'd need to be hooked up to the Nova workflows as well, see #999 which ran into some issues as well

EikanWang · 2024-10-15T13:58:26Z

@msaroufim , may I know what "Nova workflows" means? It is a ao-specific workflow?

msaroufim · 2024-10-15T15:30:41Z

@EikanWang we leverage some reusuable github workflows https://github.com/pytorch/ao/blob/main/.github/workflows/regression_test.yml#L68 produced by pytorch/test-infra this lets us easily build and test ao on multiple architectures and devices

We could potentially do a 1 off run of our test suite to see what works in ao out of the box today but will be hard to track progress without the CI integration.

As to how to integrate with Nova workflows your best bet is to reach out to @seemethere and @atalman on the Intel slack channel. Feel free to tag me there as well so we can move faster

mingfeima · 2024-10-16T02:19:22Z

@dbyoung18 does this one support int4 woq ?

dbyoung18 · 2024-10-16T06:05:08Z

@dbyoung18 does this one support int4 woq ?

Currently, it doesn't support int4 woq on Intel GPU. We are in the upstream progress to support INT4 xpu backend in PyTorch(targetting v2.5). Once the upstream is ready, will continue adding support on ao side.

dbyoung18 · 2024-11-30T06:53:08Z

Closed due to duplicate w/ PR:ao#1259. THX for above review comments.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2024

dbyoung18 marked this pull request as draft August 28, 2024 08:44

dbyoung18 force-pushed the dbyoung/enable_xpu branch from 620a3bf to 6fc261d Compare August 29, 2024 07:02

guangyey mentioned this pull request Sep 3, 2024

[RFC] A device-agnostic Python device memory related API design for stream-based accelerators pytorch/pytorch#134978

Open

dbyoung18 force-pushed the dbyoung/enable_xpu branch 2 times, most recently from 9022b38 to 37d9431 Compare September 23, 2024 01:00

feat(xpu): enable XPU for Llama

de4ac30

Signed-off-by: dbyoung18 <[email protected]>

dbyoung18 force-pushed the dbyoung/enable_xpu branch from 37d9431 to de4ac30 Compare September 30, 2024 12:03

dbyoung18 marked this pull request as ready for review September 30, 2024 13:17

malfet reviewed Sep 30, 2024

View reviewed changes

msaroufim added the multibackend label Oct 1, 2024

msaroufim mentioned this pull request Oct 15, 2024

Multibackend tracker #1082

Open

17 tasks

dbyoung18 closed this Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Intel GPU #753

Enable Intel GPU #753

dbyoung18 commented Aug 27, 2024 •

edited

Loading

pytorch-bot bot commented Aug 27, 2024 •

edited

Loading

facebook-github-bot commented Aug 27, 2024

msaroufim commented Aug 27, 2024 •

edited

Loading

facebook-github-bot commented Aug 28, 2024

guangyey commented Aug 28, 2024

dbyoung18 commented Aug 28, 2024 •

edited

Loading

EikanWang commented Sep 23, 2024

dbyoung18 commented Sep 30, 2024

malfet left a comment

malfet Sep 30, 2024

malfet Sep 30, 2024

malfet Sep 30, 2024

msaroufim commented Sep 30, 2024

EikanWang commented Oct 8, 2024

msaroufim commented Oct 8, 2024

EikanWang commented Oct 10, 2024

msaroufim commented Oct 10, 2024

EikanWang commented Oct 11, 2024

chuanqi129 commented Oct 11, 2024

EikanWang commented Oct 15, 2024

msaroufim commented Oct 15, 2024

EikanWang commented Oct 15, 2024

msaroufim commented Oct 15, 2024

mingfeima commented Oct 16, 2024

dbyoung18 commented Oct 16, 2024

dbyoung18 commented Nov 30, 2024

Enable Intel GPU #753

Enable Intel GPU #753

Conversation

dbyoung18 commented Aug 27, 2024 • edited Loading

pytorch-bot bot commented Aug 27, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/753

❗ 1 Active SEVs

facebook-github-bot commented Aug 27, 2024

Action Required

Process

msaroufim commented Aug 27, 2024 • edited Loading

facebook-github-bot commented Aug 28, 2024

guangyey commented Aug 28, 2024

dbyoung18 commented Aug 28, 2024 • edited Loading

EikanWang commented Sep 23, 2024

dbyoung18 commented Sep 30, 2024

malfet left a comment

Choose a reason for hiding this comment

malfet Sep 30, 2024

Choose a reason for hiding this comment

malfet Sep 30, 2024

Choose a reason for hiding this comment

malfet Sep 30, 2024

Choose a reason for hiding this comment

msaroufim commented Sep 30, 2024

EikanWang commented Oct 8, 2024

msaroufim commented Oct 8, 2024

EikanWang commented Oct 10, 2024

msaroufim commented Oct 10, 2024

EikanWang commented Oct 11, 2024

chuanqi129 commented Oct 11, 2024

EikanWang commented Oct 15, 2024

msaroufim commented Oct 15, 2024

EikanWang commented Oct 15, 2024

msaroufim commented Oct 15, 2024

mingfeima commented Oct 16, 2024

dbyoung18 commented Oct 16, 2024

dbyoung18 commented Nov 30, 2024

dbyoung18 commented Aug 27, 2024 •

edited

Loading

pytorch-bot bot commented Aug 27, 2024 •

edited

Loading

msaroufim commented Aug 27, 2024 •

edited

Loading

dbyoung18 commented Aug 28, 2024 •

edited

Loading