Skip to content

spec-decode: eliminate replay pass via fast-rollback#390

Draft
howard0su wants to merge 1 commit into
Luce-Org:mainfrom
howard0su:verify
Draft

spec-decode: eliminate replay pass via fast-rollback#390
howard0su wants to merge 1 commit into
Luce-Org:mainfrom
howard0su:verify

Conversation

@howard0su

@howard0su howard0su commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Replace the expensive snapshot→verify→restore→replay cycle with:

  • Pure-attention targets (Gemma4): KV truncation (no replay needed)
  • Hybrid SSM targets (Qwen35): fast-rollback using per-step SSM intermediate states captured during verify (GPU memcpy, no recompute)

Key changes:

  • DFlashTarget interface: add truncate_kv(), has_recurrent_state(), supports_fast_rollback(), rollback_to()
  • Qwen35DFlashTarget: implement rollback via delta_captures (SSM dequant
    • conv cudaMemcpy2D), enable capture_delta_intermediate during verify
  • dflash_spec_decode.cpp: three-way branch (fast-rollback / legacy replay / pure-attention truncate)
  • qwen35_backend.cpp: daemon inline loop supports fast-rollback
  • server_main.cpp: add --fast-rollback CLI flag

Fast-rollback uses the implicit bonus approach: target_tok[accept_n-1] seeds the next draft step and is guaranteed accepted as draft_tok[0], producing identical output with zero extra forward passes.

Performance (Qwen3.6-27B, 256 tokens, RTX 2080 Ti):
Baseline (replay): 36.5 tok/s
Fast-rollback: 47.7 tok/s (+30%)

Review in cubic

Replace the expensive snapshot→verify→restore→replay cycle with:
- Pure-attention targets (Gemma4): KV truncation (no replay needed)
- Hybrid SSM targets (Qwen35): fast-rollback using per-step SSM
  intermediate states captured during verify (GPU memcpy, no recompute)

Key changes:
- DFlashTarget interface: add truncate_kv(), has_recurrent_state(),
  supports_fast_rollback(), rollback_to()
- Qwen35DFlashTarget: implement rollback via delta_captures (SSM dequant
  + conv cudaMemcpy2D), enable capture_delta_intermediate during verify
- dflash_spec_decode.cpp: three-way branch (fast-rollback / legacy replay
  / pure-attention truncate)
- qwen35_backend.cpp: daemon inline loop supports fast-rollback
- server_main.cpp: add --fast-rollback CLI flag

Fast-rollback uses the implicit bonus approach: target_tok[accept_n-1]
seeds the next draft step and is guaranteed accepted as draft_tok[0],
producing identical output with zero extra forward passes.

Performance (Qwen3.6-27B, 256 tokens, RTX 2080 Ti):
  Baseline (replay):    36.5 tok/s
  Fast-rollback:        47.7 tok/s  (+30%)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant