-
Notifications
You must be signed in to change notification settings - Fork 716
[Feat] Remove Redundant Variables after Integrate FIA operator in mla_cp._forward_decode #5659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Feat] Remove Redundant Variables after Integrate FIA operator in mla_cp._forward_decode #5659
Conversation
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits) [1/N] Refactor nightly test structure (vllm-project#5479) Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257) Revert "moe_gating_top_k" (vllm-project#5512) [Doc] Fix issue link for 0.12.0 (vllm-project#5500) [CI]update triton ascend version (vllm-project#5392) moe_gating_top_k (vllm-project#5271) [refactor] refactor model runner capture model (vllm-project#5230) Update corresponding vllm commit ID to 12 29 (vllm-project#5475) [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458) [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251) [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176) [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324) [Feature] support eager mode in model runner v2 (vllm-project#5210) [feature] fia support sliding windows (vllm-project#5239) Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) ...
Signed-off-by: 白永斌 <[email protected]>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [feature] mooncake support pcp/dcp in common conditions (vllm-project#5224) [Bugfix] Fix mm_merge (vllm-project#5249) [Main2Main] Upgrade vllm commit to 1230 (vllm-project#5495) [Feature] Refactor PCP &DCP related code (vllm-project#5214) [main][test] Refactor the mtp and eagle test case (vllm-project#5326) [smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (vllm-project#5521) [2/N] Upgrade nightly doc (vllm-project#5534) [Doc] Add new contributors. (vllm-project#5537) [3/N][Nightly] Move ops tests to nightly (vllm-project#5538)
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (58 commits) [Main2Main] Upgrade vllm commit to 0106 (vllm-project#5617) [CI]update bisheng version (vllm-project#5621) [UT][PCP&DCP] UT for block_table.py (vllm-project#5032) [Main2Main] Upgrade vllm commit to 0105 (vllm-project#5595) [CI] mv ops to correct path (vllm-project#5615) [BugFix] Fix Smoke Testing Bug for DSR1 longseq (vllm-project#5613) Revert "[Feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5545)" (vllm-project#5611) [TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (vllm-project#5267) [perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... (vllm-project#5192) [docs] Correct image about prefill phase of PCP (vllm-project#5598) [CI] update triton-ascend version (vllm-project#5584) [P/D]Remove mooncake kvpool unused parameter `local_hostname` (vllm-project#5574) [Bugfix] record cos and sin cache in AscendRotaryEmbedding (vllm-project#5516) [bugfix] fix test_camem failed with triton-ascend (vllm-project#5492) [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (vllm-project#5474) [CI] Download models from ms (vllm-project#5405) Docs: Add A3 Docker image guidance for Atlas A3 machines (vllm-project#5256) [Doc] Add NNAL installation guide and requirements (vllm-project#5235) Add the requirement of arctic-inference which speculative decoding with suffix_decode (vllm-project#5045) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5253) ...
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: daishixun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the attention mechanism code by removing the batch_seq_mask variable and its associated logic. This cleanup is done following the integration of an updated FIA (Fused Infer Attention) operator, which now presumably handles masking for zero-length sequences internally. The changes are consistent across the implementation and test files, leading to cleaner and more maintainable code. The removal of explicit masking logic in functions like _process_attn_out_lse and _compute_prefill_context correctly relies on the improved capabilities of the underlying NPU operators. The updates to tests to reflect these changes are also correctly implemented. Overall, this is a good refactoring that simplifies the codebase.
Signed-off-by: daishixun <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: daishixun <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Bai Yongbin <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: daishixun <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: Update vllm_ascend/attention/context_parallel/mla_cp.py
Signed-off-by: 白永斌 <[email protected]>
|
This PR should be merged after #5641 |
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: Bai Yongbin <[email protected]>
Signed-off-by: 白永斌 <[email protected]>
…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: (39 commits) [CI] Drop outdated cases (vllm-project#5709) [EPLB][CI] EPLB add aclgraph and redundant expert ci (vllm-project#5625) [CI] fix image build tag (vllm-project#5703) Optimize the print info format when deprecated code is used in vllm-ascend (vllm-project#5696) [Feature] add the magicmtp speculative decoding acceleration algorithm (vllm-project#5542) [bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (vllm-project#4311) [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (vllm-project#5181) [BugFix][P/D] Fix pre-create link parameter error (vllm-project#5694) [Kernel] Add moe_gating_top_k operator support for Ascend NPU (vllm-project#5579) [1/N][CI] Refactor accuracy test (vllm-project#5400) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5676) [Tests] Add qwen3-8b nightly test (vllm-project#5597) [Refactor] Import global var form vllm instead of overwirte it (vllm-project#5469) [Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (vllm-project#4870) [CI] move image and wheel job to schedule way (vllm-project#5685) [Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (vllm-project#5553) [Bugfix] fix resource are insufficient when pcp and piecewise (vllm-project#5377) [CI] Add workflow to cancel running workflows on PR close (vllm-project#5646) [CI] Bump lm-eval version to v0.4.9.2 (vllm-project#5655) [CI] cleanup single/multi-card test (vllm-project#5623) ...
Signed-off-by: 白永斌 <[email protected]>
Signed-off-by: tongyuzhou <[email protected]>
Signed-off-by: dsxsteven <[email protected]>
What this PR does / why we need it?
PCP/DCP splits the kv-cache onto different cards. After introducing the parameter
cp-kv-cache-interleave-size, the firstsizetokens will be cached at Card 0, and so on.However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem.
After we integrate FIA operator in mla_cp._forward_decode, we now can remove these additional operations.
Does this PR introduce any user-facing change?
No
How was this patch tested?