spec-decode: eliminate replay pass via fast-rollback by howard0su · Pull Request #390 · Luce-Org/lucebox-hub

howard0su · 2026-06-15T23:23:48Z

Replace the expensive snapshot→verify→restore→replay cycle with:

Pure-attention targets (Gemma4): KV truncation (no replay needed)
Hybrid SSM targets (Qwen35): fast-rollback using per-step SSM intermediate states captured during verify (GPU memcpy, no recompute)

Key changes:

DFlashTarget interface: add truncate_kv(), has_recurrent_state(), supports_fast_rollback(), rollback_to()
Qwen35DFlashTarget: implement rollback via delta_captures (SSM dequant
- conv cudaMemcpy2D), enable capture_delta_intermediate during verify
dflash_spec_decode.cpp: three-way branch (fast-rollback / legacy replay / pure-attention truncate)
qwen35_backend.cpp: daemon inline loop supports fast-rollback
server_main.cpp: add --fast-rollback CLI flag

Fast-rollback uses the implicit bonus approach: target_tok[accept_n-1] seeds the next draft step and is guaranteed accepted as draft_tok[0], producing identical output with zero extra forward passes.

Performance (Qwen3.6-27B, 256 tokens, RTX 2080 Ti):
Baseline (replay): 36.5 tok/s
Fast-rollback: 47.7 tok/s (+30%)

Replace the expensive snapshot→verify→restore→replay cycle with: - Pure-attention targets (Gemma4): KV truncation (no replay needed) - Hybrid SSM targets (Qwen35): fast-rollback using per-step SSM intermediate states captured during verify (GPU memcpy, no recompute) Key changes: - DFlashTarget interface: add truncate_kv(), has_recurrent_state(), supports_fast_rollback(), rollback_to() - Qwen35DFlashTarget: implement rollback via delta_captures (SSM dequant + conv cudaMemcpy2D), enable capture_delta_intermediate during verify - dflash_spec_decode.cpp: three-way branch (fast-rollback / legacy replay / pure-attention truncate) - qwen35_backend.cpp: daemon inline loop supports fast-rollback - server_main.cpp: add --fast-rollback CLI flag Fast-rollback uses the implicit bonus approach: target_tok[accept_n-1] seeds the next draft step and is guaranteed accepted as draft_tok[0], producing identical output with zero extra forward passes. Performance (Qwen3.6-27B, 256 tokens, RTX 2080 Ti): Baseline (replay): 36.5 tok/s Fast-rollback: 47.7 tok/s (+30%)

howard0su force-pushed the verify branch from 6eabd12 to 7e15832 Compare June 15, 2026 23:24

howard0su force-pushed the verify branch from 7e15832 to 5553bfc Compare June 15, 2026 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

spec-decode: eliminate replay pass via fast-rollback#390

spec-decode: eliminate replay pass via fast-rollback#390
howard0su wants to merge 1 commit into
Luce-Org:mainfrom
howard0su:verify

howard0su commented Jun 15, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

howard0su commented Jun 15, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

howard0su commented Jun 15, 2026 •

edited by cubic-dev-ai Bot

Loading