Skip to content

Enable Intel GPU #753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

dbyoung18
Copy link
Contributor

@dbyoung18 dbyoung18 commented Aug 27, 2024

This PR is migrated from gpt-fast #79. We would like to add initial support for Intel GPU in torch-ao with the device option "xpu"(i.e., --device "xpu"). Currently, both BF16 & INT8 under eager mode and compile mode are functionally supported. INT4 support and further performance improvement are WIP.

Here are the steps to run Llama2-7b and Llama3-8b generation on Intel GPU with torch-ao. We will update the tutorial later with improved performance.

Launch

  1. command for BF16
    python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --precision torch.bfloat16
  2. command for INT8 dynamic quantization
    python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --quantization int8dq
  3. command for INT8 weight-only quantization
    python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --quantization int8wo

Copy link

pytorch-bot bot commented Aug 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/753

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

Hi @dbyoung18!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@msaroufim
Copy link
Member

msaroufim commented Aug 27, 2024

Thanks for your PR @dbyoung18, my preference here would be to land generic accelerator memory APIs in core and then use those. That way we wouldn't need to ask people that are trying to use Intel GPUs to change their code so it'd be something like torch.get_accelerator().max_memory_reserved() or torch.accelerator.max_memory_reserved()

@guangyey is doing some work on this at Intel and can share more information on what's the current plan of record

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2024
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@guangyey
Copy link

Hi @msaroufim and @dbyoung18 , let me explain the plan.
In the long term, we have a proposal to provide device-agnostic APIs for each accelerator. We would like to start from the runtime device and stream component. And then gradually cover the allocator memory APIs. Our RFC is [RFC] A device-agnostic Python runtime API design for stream-based accelerators
In the short term, for XPU, we plan to provide those memory APIs first in case block the customer usage. We have prepared a series of PRs to implement them. You can refer to #129919 and it will be landed soon if everything goes well.

@dbyoung18
Copy link
Contributor Author

dbyoung18 commented Aug 28, 2024

Convert to draft first. Pending to #129919 ready.

@EikanWang
Copy link

@dbyoung18 , may I know why the change is torchao/_models/llama/generate.py only?

@dbyoung18
Copy link
Contributor Author

@dbyoung18 , may I know why the change is torchao/_models/llama/generate.py only?

Hi, @EikanWang. We have a plan to gradually support torch-ao on Intel GPU with different models(llama2,llama3,sam etc.) and difference features(BF16/INT8/INT4/FP8 etc.). As the first step, we choose Llama2 & Llama3 BF16 as a start point. With this PR, Llama2-7b and Llama3-8b can run BF16 on Intel GPU under both eager mode & compile mode, by passing --device xpu to the launch commands. And INT8 can be launched with intel/intel-xpu-backend-for-triton under compile mode. We are also doing some work to upstream INT8/INT4/FP8 support on Intel GPU with oneDNN to PyTorch core. When the further upstream is ready in stock PyTorch, we will continue our contributions to torch-ao to bring the library more available and powerful on different platforms.

@dbyoung18 dbyoung18 marked this pull request as ready for review September 30, 2024 13:17
Copy link

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it up to repo maintainers, but IMO one need to think a bit more about unifying device approach rather than migrating long strings of elifs from repo to repo

@@ -369,7 +381,8 @@ def callback(x):

tokpersec = torch.mean(torch.tensor(aggregate_metrics['tokens_per_sec'])).item()
bandwidth = model_size * tokpersec
mem = torch.cuda.max_memory_reserved() /1e9
max_memory_reserved = torch.cuda.max_memory_reserved() if "cuda" in device else torch.xpu.max_memory_reserved()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels wrong, as it will dispatch to xpu for HIP devices as well, wouldn't it?

Comment on lines 325 to +326
else:
torch.profiler._utils._init_for_cuda_graphs()
prof = torch.profiler.profile()
if "cuda" in device:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please dismantle the pyramid of doom and use elif "cuda" in device" rather than else:\n\tif "cuda" in device:"
And see my comment below again about "hip"

@@ -288,7 +290,10 @@ def main(

for i in range(start, num_samples):
if i==0:
torch.cuda.reset_peak_memory_stats()
if "cuda" in device:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below, can you please check that torch.cuda.reset_peak_memory_stats does not need to be applied to hip

@msaroufim
Copy link
Member

I'm already planning on writing an RFC on how we'll support more hardware architectures. RIght now ao is very much NVIDIA device centric but a lot of recent issues have been about supporting more hardware architectures on more operating systems. We need to think about generalizing devices, CI/testing and performance carefully.

@EikanWang
Copy link

We are working on device-agnostic runtime API for accelerators. It may help ao to support more hardware architectures.

@malfet , @msaroufim FYI - https://dev-discuss.pytorch.org/t/python-c-api-rules-for-device-generic-apis/2511

@msaroufim
Copy link
Member

@EikanWang are there are any github runners for Intel GPUs to ensure our test suite works? We don't have to run the code per commit but at least a nightly check to make sure we understand what works and what doesnt would be helpful

@EikanWang
Copy link

@msaroufim , may I know if two runners for Intel GPUs are good enough now for the ao nightly?

@msaroufim
Copy link
Member

@msaroufim , may I know if two runners for Intel GPUs are good enough now for the ao nightly?

Yup that should be fine! We won't be running on Intel runners per commit for now. cc @atalman @seemethere as well

@EikanWang
Copy link

Sounds good! We will add two runners to Intel GPU CI/CD resource pool and reserve the two runners dedicated to ao nightly.

cc @riverliuintel, @chuanqi129

@chuanqi129
Copy link
Contributor

Sounds good! We will add two runners to Intel GPU CI/CD resource pool and reserve the two runners dedicated to ao nightly.

cc @riverliuintel, @chuanqi129

Currently, we have 16 pytorch organization level xpu runners with label "linux.idc.xpu" used for pytorch CICD. I think the torchao repo can use them directly.

@EikanWang
Copy link

@chuanqi129 , @riverliuintel, any update?

@msaroufim
Copy link
Member

They'd need to be hooked up to the Nova workflows as well, see #999 which ran into some issues as well

@EikanWang
Copy link

@msaroufim , may I know what "Nova workflows" means? It is a ao-specific workflow?

@msaroufim
Copy link
Member

@EikanWang we leverage some reusuable github workflows https://github.com/pytorch/ao/blob/main/.github/workflows/regression_test.yml#L68 produced by pytorch/test-infra this lets us easily build and test ao on multiple architectures and devices

We could potentially do a 1 off run of our test suite to see what works in ao out of the box today but will be hard to track progress without the CI integration.

As to how to integrate with Nova workflows your best bet is to reach out to @seemethere and @atalman on the Intel slack channel. Feel free to tag me there as well so we can move faster

@msaroufim msaroufim mentioned this pull request Oct 15, 2024
17 tasks
@mingfeima
Copy link

@dbyoung18 does this one support int4 woq ?

@dbyoung18
Copy link
Contributor Author

@dbyoung18 does this one support int4 woq ?

Currently, it doesn't support int4 woq on Intel GPU. We are in the upstream progress to support INT4 xpu backend in PyTorch(targetting v2.5). Once the upstream is ready, will continue adding support on ao side.

@dbyoung18
Copy link
Contributor Author

Closed due to duplicate w/ PR:ao#1259. THX for above review comments.

@dbyoung18 dbyoung18 closed this Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. multibackend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants