Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu #3113

sywangyi · 2025-03-14T08:42:56Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Wang, Yi A <[email protected]>

baptistecolle · 2025-03-17T11:24:32Z

Hey @sywangyi Thanks for contributing to the new upstreamed Gaudi backend 🙌

Here are some guidelines about the new backend that I think could be useful to you.

Guidelines for contributing to Gaudi on TGI: All changes should be made within the backends/gaudi folder. In general, you should avoid modifying the router, launcher, or benchmark to accommodate Gaudi hardware, as all Gaudi-specific logic should be contained within the backends/gaudi folder.

text-generation-inference/docs/source/backends/gaudi.mdx

Line 294 in 48df618

**Guidelines for contributing to Gaudi on TGI:** All changes should be made within the `backends/gaudi` folder. In general, you should avoid modifying the router, launcher, or benchmark to accommodate Gaudi hardware, as all Gaudi-specific logic should be contained within the `backends/gaudi` folder.

launcher/src/main.rs

router/src/usage_stats.rs

baptistecolle · 2025-03-17T11:29:46Z

FYI @regisss

When modifying anything outside of strictly Gaudi-related files (e.g., backend/gaudi, Dockerfile_gaudi, or Gaudi documentation), we should also tag @Narsil for review. This helps ensure that we don’t unintentionally modify or bloat core TGI components and allows the core maintainers to review the changes.

router/src/usage_stats.rs

Signed-off-by: Wang, Yi A <[email protected]>

baptistecolle · 2025-03-18T08:55:23Z

Could you please rebase your image on the latest main branch of TGI?
Commit: 8c2c348f3c4d1101e862af493b44e4986119b557

This update integrates the latest changes from the old tgi-gaudi fork, allowing us to confirm that your refactoring is fully compatible with the latest version of the Gaudi backend.

Thanks!

Signed-off-by: Wang, Yi A <[email protected]>

remove model where pageattn is not used, set block table to None since it's not used Signed-off-by: Wang, Yi A <[email protected]>

Signed-off-by: Wang, Yi A <[email protected]>

…mic shape Signed-off-by: Wang, Yi A <[email protected]>

Signed-off-by: Wang, Yi A <[email protected]>

baptistecolle

Thanks for the huge effort!

I’ve added @regisss and @Narsil as reviewers. @Narsil, could you check the modifications to the launcher and router and confirm that everything aligns with the main TGI team? Since this is a big PR, it would also be great for @regisss to take a look.

Really appreciate the refactoring and code cleanup!

One quick question—did you happen to benchmark the performance after this refactoring? Specifically, I’m curious about the impact of using the vllm-hpu-extension

Dockerfile_gaudi

backends/gaudi/server/tests/utils/test_weights.py

backends/gaudi/server/text_generation_server/cli.py

sywangyi · 2025-04-03T07:56:59Z

Thanks for the huge effort!

I’ve added @regisss and @Narsil as reviewers. @Narsil, could you check the modifications to the launcher and router and confirm that everything aligns with the main TGI team? Since this is a big PR, it would also be great for @regisss to take a look.

Really appreciate the refactoring and code cleanup!

One quick question—did you happen to benchmark the performance after this refactoring? Specifically, I’m curious about the impact of using the vllm-hpu-extension

This PR is for functional enabling, I am also working in perf optimization. because we have comparison between tgi-gaudi(OH backend) and vllm-gaudi(hpu page path). vllm-gaudi is better than tgi-gaudi in throughput. this refactor is help to align the performance with vllm gaudi

Signed-off-by: Wang, Yi A <[email protected]>

sywangyi · 2025-04-11T02:42:38Z

@baptistecolle I have rebase the PR to main. please help review it

sywangyi · 2025-04-14T00:50:18Z

@regisss have rebase the PR to main. please help review it and merge

regisss

LGTM!

Gently pinging @Narsil for final approval.

sywangyi added 5 commits March 14, 2025 01:25

clean cuda/rocm code in hpu backend, enable flat_hpu

201dc62

Signed-off-by: Wang, Yi A <[email protected]>

fix TP in pageattn

b7fea6f

Signed-off-by: Wang, Yi A <[email protected]>

adjust block table in hpu to improve performance

5d36539

Signed-off-by: Wang, Yi A <[email protected]>

enable all the model. not testet yet

a07e743

Signed-off-by: Wang, Yi A <[email protected]>

use tensor cache in hpu graph to avoid replay issue

6bbe24d

Signed-off-by: Wang, Yi A <[email protected]>

baptistecolle reviewed Mar 17, 2025

View reviewed changes

launcher/src/main.rs Outdated Show resolved Hide resolved

baptistecolle reviewed Mar 17, 2025

View reviewed changes

router/src/usage_stats.rs Show resolved Hide resolved

baptistecolle reviewed Mar 17, 2025

View reviewed changes

router/src/usage_stats.rs Show resolved Hide resolved

add moe support, fix qwen/mistral/mixtral crash

5cd1c93

Signed-off-by: Wang, Yi A <[email protected]>

sywangyi added 18 commits March 18, 2025 23:11

fix phimoe issue

073f793

Signed-off-by: Wang, Yi A <[email protected]>

gpt_bigcode could also go pageattn

2cde30d

Signed-off-by: Wang, Yi A <[email protected]>

enable dbrx remove some unused code

2074d05

Signed-off-by: Wang, Yi A <[email protected]>

Merge branch 'main' into gaudi_backend_pa

d5b78ba

multi-modality initial PR

f95aa42

Signed-off-by: Wang, Yi A <[email protected]>

adjust warmup and enable vlm

36b6612

Signed-off-by: Wang, Yi A <[email protected]>

fix incorrect output in qwen2 idefics if hpu graph is used

fdf0733

Signed-off-by: Wang, Yi A <[email protected]>

remove unused quantization code and enable awq/gptq int4

9914ffe

Signed-off-by: Wang, Yi A <[email protected]>

fix gptq issue

8d221b7

Signed-off-by: Wang, Yi A <[email protected]>

enable fp8

6977376

Signed-off-by: Wang, Yi A <[email protected]>

warmup prefill

fd70ad7

remove model where pageattn is not used, set block table to None since it's not used Signed-off-by: Wang, Yi A <[email protected]>

add warmup_decode

ba7a131

Signed-off-by: Wang, Yi A <[email protected]>

warmup decode

7900be5

Signed-off-by: Wang, Yi A <[email protected]>

remove block_tables and prefill_cache_indices which will lead to dyna…

1508ee8

…mic shape Signed-off-by: Wang, Yi A <[email protected]>

Merge branch 'main' into gaudi_backend_pa

7914e98

Signed-off-by: Wang, Yi A <[email protected]>

fix comment

787dbe9

Signed-off-by: Wang, Yi A <[email protected]>

missing gptj change...

376e050

Signed-off-by: Wang, Yi A <[email protected]>

fix some issue

f0e5fae

Signed-off-by: Wang, Yi A <[email protected]>

sywangyi marked this pull request as ready for review March 28, 2025 14:03

remove torch.where to fix incorrect output in hpu graph model

c55a8ca

Signed-off-by: Wang, Yi A <[email protected]>

baptistecolle reviewed Apr 2, 2025

View reviewed changes

Dockerfile_gaudi Show resolved Hide resolved

backends/gaudi/server/tests/utils/test_weights.py Outdated Show resolved Hide resolved

backends/gaudi/server/text_generation_server/cli.py Show resolved Hide resolved

baptistecolle requested review from Narsil and regisss April 2, 2025 09:31

baptistecolle added the gaudi Issues related to Intel Gaudi hardware label Apr 2, 2025

baptistecolle changed the title ~~clean cuda/rocm code in hpu backend, enable flat_hpu~~ Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu Apr 2, 2025

sywangyi added 2 commits April 10, 2025 18:20

Merge branch 'main' into gaudi_backend_pa

610dd20

Signed-off-by: Wang, Yi A <[email protected]>

match the latest vllm_extension ops

4cdc34e

Signed-off-by: Wang, Yi A <[email protected]>

regisss approved these changes Apr 14, 2025

View reviewed changes

Narsil merged commit d62c941 into huggingface:main Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu #3113

Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu #3113

sywangyi commented Mar 14, 2025

baptistecolle commented Mar 17, 2025

baptistecolle commented Mar 17, 2025

baptistecolle commented Mar 18, 2025

baptistecolle left a comment

sywangyi commented Apr 3, 2025

sywangyi commented Apr 11, 2025

sywangyi commented Apr 14, 2025

regisss left a comment

Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu #3113

Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu #3113

Conversation

sywangyi commented Mar 14, 2025

What does this PR do?

Before submitting

Who can review?

baptistecolle commented Mar 17, 2025

baptistecolle commented Mar 17, 2025

baptistecolle commented Mar 18, 2025

baptistecolle left a comment

Choose a reason for hiding this comment

sywangyi commented Apr 3, 2025

sywangyi commented Apr 11, 2025

sywangyi commented Apr 14, 2025

regisss left a comment

Choose a reason for hiding this comment