Skip to content

Conversation

@divakar-amd
Copy link
Contributor

This PR cherry-picks this PR from aiter main branch to update the custom_op logic in aiter. It resolves the following error in DeepSeek-R1 due to improper handling of a quant op

Run Cmd:

VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=0 vllm bench latency --model /data/DeepSeek-R1-dontUseDebugOnly/ --dtype auto --batch-size 32 --input-len 128 --output-len 32 -tp 8 --compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}' --trust-remote-code --max-model-len=32768 --block-size=1 --num-iters-warmup 1 --num-iters 3 --load-format dummy

Error

RuntimeError: Worker failed with error 'Attempted to call function marked as skipped

File "/Projects/VLLM_DIR/vllm_upstream/vllm/model_executor/layers/mla.py", line 126, in forward_native
qkv_lora = self.fused_qkv_a_proj(hidden_states)[0]
File "/Projects/VLLM_DIR/vllm_upstream/vllm/model_executor/layers/linear.py", line 565, in forward
output_parallel = self.quant_method.apply(self, input_, bias)
File "/Projects/VLLM_DIR/vllm_upstream/vllm/model_executor/layers/quantization/fp8.py", line 666, in apply
return self.w8a8_block_fp8_linear.apply(
File "/Projects/VLLM_DIR/vllm_upstream/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 296, in apply
output = self.w8a8_blockscale_op(input_2d, weight, weight_scale)
File "/Projects/VLLM_DIR/vllm_upstream/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 355, in_run_aiter
q_input, input_scale = aiter_per1x128_quant(
File "/Projects/VLLM_DIR/aiter_ssh/aiter/ops/quant.py", line 236, in per_group_quant_hip
dynamic_per_token_scaled_quant(
File "/Projects/VLLM_DIR/aiter_ssh/aiter/jit/core.py", line 513, in wrapper
module = get_module(md)
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/polyfills/init.py", line 259, in getattr_and_trace
return fn(*args[2:], **kwargs)

* first commit

* revert some conflict

* revert some conflict2

* support custom op define schema for some ops

* support some of op return None value

* support gemm return None

* support other op for custom

* commit on mha, gemm, moe

* fix pa test

* commit for enable op

* add mha op multi return support

* support reduce

* support mha fwd

* add support mha fwd and mha_v3

* support mhd bwd and reformat files

* fix ci error and support mha

* rewrite ops

* reformat

* fix ci

* fix ci

* skip three ops in custom

* add cpu backend

* support rms_norm op

* support hipb_mm and moe gate

* fix bug

* fix bug with comment

* support mha_v3_varlen

* use common func to reduce code in mha

* reformat

* fix some bug in ci

* fix some bug in ci

* fix rms norm bug

* fix ci

* fix ci

* fix moe bug
@divakar-amd
Copy link
Contributor Author

@gshtras

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants