[Feat] Remove Redundant Variables after Integrate FIA operator in mla_cp._forward_decode #5659

dsxsteven · 2026-01-06T12:05:18Z

What this PR does / why we need it?

PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on.
However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem.

After we integrate FIA operator in mla_cp._forward_decode, we now can remove these additional operations.

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@2f4e654

Signed-off-by: 白永斌 <[email protected]>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (88 commits) [1/N] Refactor nightly test structure (vllm-project#5479) Docs: Remove deprecated --task parameter for embedding models (vllm-project#5257) Revert "moe_gating_top_k" (vllm-project#5512) [Doc] Fix issue link for 0.12.0 (vllm-project#5500) [CI]update triton ascend version (vllm-project#5392) moe_gating_top_k (vllm-project#5271) [refactor] refactor model runner capture model (vllm-project#5230) Update corresponding vllm commit ID to 12 29 (vllm-project#5475) [Kernel]update csrc cmakelist for open-source cann (vllm-project#5458) [OP] add custom op aclnnMoeInitRoutingCustom (vllm-project#5251) [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (vllm-project#5176) [Refactor][Triton] Move reject sample triton kernels into ops/triton (vllm-project#5324) [Feature] support eager mode in model runner v2 (vllm-project#5210) [feature] fia support sliding windows (vllm-project#5239) Optimize some rejectsampler functions to make npu op launch non-blocking (vllm-project#4587) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy（depend on pr5285） (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) ...

Signed-off-by: 白永斌 <[email protected]>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [feature] mooncake support pcp/dcp in common conditions (vllm-project#5224) [Bugfix] Fix mm_merge (vllm-project#5249) [Main2Main] Upgrade vllm commit to 1230 (vllm-project#5495) [Feature] Refactor PCP &DCP related code (vllm-project#5214) [main][test] Refactor the mtp and eagle test case (vllm-project#5326) [smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (vllm-project#5521) [2/N] Upgrade nightly doc (vllm-project#5534) [Doc] Add new contributors. (vllm-project#5537) [3/N][Nightly] Move ops tests to nightly (vllm-project#5538)

Signed-off-by: 白永斌 <[email protected]>

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (58 commits) [Main2Main] Upgrade vllm commit to 0106 (vllm-project#5617) [CI]update bisheng version (vllm-project#5621) [UT][PCP&DCP] UT for block_table.py (vllm-project#5032) [Main2Main] Upgrade vllm commit to 0105 (vllm-project#5595) [CI] mv ops to correct path (vllm-project#5615) [BugFix] Fix Smoke Testing Bug for DSR1 longseq (vllm-project#5613) Revert "[Feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5545)" (vllm-project#5611) [TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (vllm-project#5267) [perf] Fix MLAPO weight disposal for KV-consumer MLA in PD-mix deploy... (vllm-project#5192) [docs] Correct image about prefill phase of PCP (vllm-project#5598) [CI] update triton-ascend version (vllm-project#5584) [P/D]Remove mooncake kvpool unused parameter `local_hostname` (vllm-project#5574) [Bugfix] record cos and sin cache in AscendRotaryEmbedding (vllm-project#5516) [bugfix] fix test_camem failed with triton-ascend (vllm-project#5492) [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (vllm-project#5474) [CI] Download models from ms (vllm-project#5405) Docs: Add A3 Docker image guidance for Atlas A3 machines (vllm-project#5256) [Doc] Add NNAL installation guide and requirements (vllm-project#5235) Add the requirement of arctic-inference which speculative decoding with suffix_decode (vllm-project#5045) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5253) ...

Signed-off-by: 白永斌 <[email protected]>

Signed-off-by: daishixun <[email protected]>

gemini-code-assist

Code Review

This pull request refactors the attention mechanism code by removing the batch_seq_mask variable and its associated logic. This cleanup is done following the integration of an updated FIA (Fused Infer Attention) operator, which now presumably handles masking for zero-length sequences internally. The changes are consistent across the implementation and test files, leading to cleaner and more maintainable code. The removal of explicit masking logic in functions like _process_attn_out_lse and _compute_prefill_context correctly relies on the improved capabilities of the underlying NPU operators. The updates to tests to reflect these changes are also correctly implemented. Overall, this is a good refactoring that simplifies the codebase.

Signed-off-by: daishixun <[email protected]>

github-actions · 2026-01-06T13:25:53Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: daishixun <[email protected]>

github-actions · 2026-01-07T03:29:00Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Bai Yongbin <[email protected]>

Signed-off-by: daishixun <[email protected]>

Signed-off-by: 白永斌 <[email protected]>

…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: Update vllm_ascend/attention/context_parallel/mla_cp.py

Signed-off-by: 白永斌 <[email protected]>

dsxsteven · 2026-01-07T08:00:45Z

This PR should be merged after #5641

Signed-off-by: 白永斌 <[email protected]>

Signed-off-by: Bai Yongbin <[email protected]>

Signed-off-by: 白永斌 <[email protected]>

…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: (39 commits) [CI] Drop outdated cases (vllm-project#5709) [EPLB][CI] EPLB add aclgraph and redundant expert ci (vllm-project#5625) [CI] fix image build tag (vllm-project#5703) Optimize the print info format when deprecated code is used in vllm-ascend (vllm-project#5696) [Feature] add the magicmtp speculative decoding acceleration algorithm (vllm-project#5542) [bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (vllm-project#4311) [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (vllm-project#5181) [BugFix][P/D] Fix pre-create link parameter error (vllm-project#5694) [Kernel] Add moe_gating_top_k operator support for Ascend NPU (vllm-project#5579) [1/N][CI] Refactor accuracy test (vllm-project#5400) [BugFix][Fusion] Fix graph fusion failure problem (vllm-project#5676) [Tests] Add qwen3-8b nightly test (vllm-project#5597) [Refactor] Import global var form vllm instead of overwirte it (vllm-project#5469) [Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (vllm-project#4870) [CI] move image and wheel job to schedule way (vllm-project#5685) [Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (vllm-project#5553) [Bugfix] fix resource are insufficient when pcp and piecewise (vllm-project#5377) [CI] Add workflow to cancel running workflows on PR close (vllm-project#5646) [CI] Bump lm-eval version to v0.4.9.2 (vllm-project#5655) [CI] cleanup single/multi-card test (vllm-project#5623) ...

Signed-off-by: 白永斌 <[email protected]>

Signed-off-by: tongyuzhou <[email protected]>

…var_after_fia

Fia rebase

Signed-off-by: dsxsteven <[email protected]>

白永斌 and others added 11 commits December 24, 2025 10:44

integrate FIA operator into mla_cp

6ae53b5

Signed-off-by: 白永斌 <[email protected]>

make it more readable

08de021

Signed-off-by: 白永斌 <[email protected]>

adapt acl_graph in mla_cp FIA

daafaff

Signed-off-by: 白永斌 <[email protected]>

adapt graph mode

452c663

Signed-off-by: 白永斌 <[email protected]>

support mtp

6733ce3

Signed-off-by: 白永斌 <[email protected]>

remove redundant attributes

410be4d

Signed-off-by: 白永斌 <[email protected]>

remove data cleaning

8d06f81

Signed-off-by: 白永斌 <[email protected]>

remove redundant variables after mla_cp forward decode uses fia

a974ca9

Signed-off-by: daishixun <[email protected]>

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

delete redundant variables in mtp proposer

b901f38

Signed-off-by: daishixun <[email protected]>

github-actions bot added the module:tests label Jan 6, 2026

add ut for arguments cp-kv-interleave-size

d33951f

Signed-off-by: daishixun <[email protected]>

github-actions bot added the merge-conflicts label Jan 7, 2026

845473182 and others added 4 commits January 7, 2026 14:38

Update vllm_ascend/attention/context_parallel/mla_cp.py

1352315

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Bai Yongbin <[email protected]>

fix test name

de585e8

Signed-off-by: daishixun <[email protected]>

mv ut after rebase

fa520cd

Signed-off-by: daishixun <[email protected]>

mv ut after rebase

3fc5720

Signed-off-by: daishixun <[email protected]>

github-actions bot removed the merge-conflicts label Jan 7, 2026

dsxsteven and others added 4 commits January 7, 2026 15:20

fix pre-commit

f5f5b0a

Signed-off-by: daishixun <[email protected]>

fix lint

47072e3

Signed-off-by: 白永斌 <[email protected]>

Merge branch 'FIA_rebase' of https://github.com/845473182/vllm-ascend …

120ac20

…into FIA_rebase * 'FIA_rebase' of https://github.com/845473182/vllm-ascend: Update vllm_ascend/attention/context_parallel/mla_cp.py

fix lint

7e899c6

Signed-off-by: 白永斌 <[email protected]>

白永斌 and others added 2 commits January 7, 2026 17:06

fix lint

40afa15

Signed-off-by: 白永斌 <[email protected]>

Merge branch 'main' into FIA_rebase

4134757

Signed-off-by: Bai Yongbin <[email protected]>

白永斌 and others added 7 commits January 8, 2026 14:18

fix ut

c3f5465

Signed-off-by: 白永斌 <[email protected]>

fix lint

92436a2

Signed-off-by: 白永斌 <[email protected]>

[Ops] replace _update_out_and_lse with _npu_attn_out_lse_update

a2a6f72

Signed-off-by: tongyuzhou <[email protected]>

Merge branch 'vllm-project:main' into main_2026_0106_remove_redunant_…

4a44ad4

…var_after_fia

Merge pull request vllm-project#2 from 845473182/FIA_rebase

c01e0c2

Fia rebase

Merge branch 'ops' into main_2026_0106_remove_redunant_var_after_fia

445db54

Signed-off-by: dsxsteven <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Remove Redundant Variables after Integrate FIA operator in mla_cp._forward_decode #5659

[Feat] Remove Redundant Variables after Integrate FIA operator in mla_cp._forward_decode #5659

dsxsteven commented Jan 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

dsxsteven commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Feat] Remove Redundant Variables after Integrate FIA operator in mla_cp._forward_decode #5659

Are you sure you want to change the base?

[Feat] Remove Redundant Variables after Integrate FIA operator in mla_cp._forward_decode #5659

Conversation

dsxsteven commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

dsxsteven commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dsxsteven commented Jan 6, 2026 •

edited

Loading