Skip to content

Add switchable draft LoRA variants for Laguna#460

Draft
davide221 wants to merge 1 commit into
mainfrom
draft-lora-variants
Draft

Add switchable draft LoRA variants for Laguna#460
davide221 wants to merge 1 commit into
mainfrom
draft-lora-variants

Conversation

@davide221

@davide221 davide221 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add GGUF LoRA adapter loading for DFlash draft tensors and pre-merge adapters into draft weights at load time
  • add --draft-lora <path|name=path> and request-side extra_body.draft_lora selection
  • keep base as the unadapted draft and load named LoRA variants as resident switchable variants

Validation

  • git diff --cached --check before commit
  • remote CUDA build: cmake --build build-lora-cuda --target dflash_server -j$(nproc)
  • smoke request on lucebox with personal LoRA selected: adapter loaded, base + personal variants loaded, request selected personal

Experiment note

A small rank-16 personal LoRA trained from session-derived records did not improve holdout speculative metrics versus the normal drafter, so this PR only adds serving support and does not enable any adapter by default.

Review in cubic

davide221 added a commit that referenced this pull request Jul 3, 2026
Remove the switchable draft LoRA adapter support that was folded in from
#460: the --draft-lora flag, the per-request extra_body.draft_lora
selector, the GGUF LoRA adapter loader/merger, and the DraftLoraSpec
plumbing through the factory and backends. #460 stays open for a future
pass; this PR ships only the DSpark Markov head and the spec-verify
stack.

The draft-variant container in LagunaBackend stays (single "base"
variant) so a future variant mechanism can slot back in without
re-plumbing the loader.

Verified on the 3090: output hashes byte-identical to the previous
commit, 282 tok/s at w6, DSpark head active, --draft-lora now rejected
as an unknown option, ctest 12/12.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
davide221 added a commit that referenced this pull request Jul 3, 2026
…/s (RTX 3090) (#482)

* deps: ggml spec-decode verify kernel stack for sm_86

- default the plain-mul_mat MMVQ ncols ceiling to 3: measured crossover on
  RTX 3090 (Q4_K_M/Q6_K dense GEMVs) - MMVQ wins at ncols<=3, MMQ at 4-8.
  Laguna w6 chain 199->237 tok/s, Qwen3.6 chain 127->137 tok/s. Env-overridable
  via LUCE_MMVQ_MAX_NCOLS for other architectures.
- extend the MUL_MAT_ID MoE fast path from 8 to 16 tokens (host+device caps,
  launch_bounds, dispatch); keeps CUDA-graph capture for 9-16 token
  spec-verify batches. ne2<=8 behavior byte-identical.
- LUCE_MMQ_DP_MAX_NE1 tuning env (default off; documented negative result
  on sm_86).

* draft: DSpark Markov/confidence head loading + LoRA draft variants

Incorporates #460 and #461 rebased onto main: converter embeds the optional
dflash.dspark.* tensors, the draft loader binds them (old GGUFs unaffected),
and the Laguna greedy-chain decode applies the low-rank previous-token Markov
correction. LoRA loader ported to the GgufMmap API introduced on main.

* laguna: fused DSpark Markov head for chain drafting and DDTree candidates

One graph on the draft stream (base logits via a single lm_head matmul +
unrolled Markov argmax->get_rows chain, async readback, one sync) replaces
the per-candidate graph rebuild and host logits round-trip. The same builder
serves DDTree candidate generation (markov-corrected top-K, same output
contract as project_hidden_to_topk). Kill switches:
DFLASH_LAGUNA_FUSED_DSPARK=0, DFLASH_LAGUNA_DSPARK_TREE=0.

Measured (laguna-xs2 Q4_K_M + v24 drafter, RTX 3090, HumanEval): chain
accept +2.2pts at w3; DDTree commit 3.83->4.43 at budget 12; verify-width
sweep moves the chain optimum to w6.

* laguna: persistent verify step-graph

Within a (width, mask-window) key the verify graph is structurally identical
every step - kv position enters only through input data - so cache the built
graph and skip the per-step host rebuild + allocator pass. Outputs are
byte-identical (same graph, same data); DFLASH_LAGUNA_PERSIST_VERIFY=0
restores the rebuild-every-step path.

* scripts: DFlash draft requantize tool (q8_0 / q4-mix)

q4-mix quantizes the drafter backbone to q4_0 and keeps the dflash.* heads
at q8_0 (the Markov/projection bias precision is what near-tie corrections
depend on). Measured acceptance-neutral: f16 236.6 -> q8_0 241.3 -> q4-mix
249.6 tok/s (Laguna-XS.2 + v24 drafter, RTX 3090, HumanEval, verify width 6).

* fix: address cubic review findings

- draft LoRA loader: overflow-safe gguf_tensor_in_file() bounds check with
  the gguf_bounds_error() diagnostic; validate tensor type and byte size
  before resizing the host buffer so corrupt metadata cannot force a large
  allocation
- laguna verify: key the persistent step-graph slot on backend/weights/cache
  identity, and restrict reuse to the kv_idx input-data path (NO_KVPAD /
  PAD_CPY fallbacks bake kv_start into the graph, so they rebuild per step)
- server: reject --draft-lora for non-laguna targets and for layer-split
  placement instead of silently ignoring it
- laguna backend: drop the dead active_draft_lora_ member
- quantize_dflash_draft.py: np.asarray() the metadata array parts before
  .item() (robust across gguf-py part representations; output verified
  byte-identical to the shipped q4-mix draft)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(laguna): keep qwen KV-cache env shorthands from overriding laguna KV types

Since #455, resolve_laguna_kv_types() routed laguna's KV dtype through
resolve_kv_types() whenever any DFLASH27B_KV_* env var was present.
dflash_server auto-sets DFLASH27B_KV_TQ3=1 for max_ctx > 6144 (a VRAM
heuristic for the qwen 27B family), so every laguna server run at the
default max_ctx=8192 silently got a TQ3_0 ternary KV cache instead of
Q8_0 and produced degenerate output on every decode path (accept 16.7%,
avg_commit 1.00).

Laguna now honors only the explicit per-axis DFLASH27B_KV_K / _KV_V
overrides (--cache-type-k/v), logs when they take effect, and ignores
the legacy F16/Q4/TQ3 shorthands.

Verified on RTX 3090 (lucebox): default-env laguna-xs2 + v24 q4mix w6
chain goes from deterministic garbage to coherent 272 tok/s at 71.1%
accept / avg_commit 4.27; output hash identical with
DFLASH_LAGUNA_PERSIST_VERIFY=0; explicit q4_0 override honored and
logged; ctest 12/12.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(laguna): validate the (K, V) pair after explicit KV-type overrides

Extract the supported-pair check from resolve_kv_types() into
dflash::validate_kv_pair_or_abort() and call it from the laguna per-axis
override path, so an unsupported combination (e.g. DFLASH27B_KV_K=tq3_0
DFLASH27B_KV_V=q4_1) aborts at startup with the supported-pairs listing
instead of failing later in a CUDA kernel.

Verified on lucebox 3090: invalid pair aborts with the listing; valid
q4_0/q4_0 override still honored, logged, and coherent; default q8_0
path untouched; ctest 12/12.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* perf(laguna): pinned async input staging + per-step phase profiler

Stage the per-step graph inputs (embed, positions, kv_idx, masks, feat
rows) in a slot-owned pinned host buffer and upload them with
ggml_backend_tensor_set_async on the backend stream. A plain tensor_set
from pageable memory costs a staged DMA plus a stream synchronize per
call (~6 pairs per decode/verify step); the single sync inside
ggml_backend_graph_compute() now covers ordering. Output hashes are
byte-identical; worth ~1-2 tok/s at w6 and grows with context length
(the mask upload scales with kv_pad).

Add DFLASH_LAGUNA_STEP_PROF=1: per-step wall breakdown of the spec chain
(draft forward / DSpark heads / verify / other), printed with the spec
summary. Measured on RTX 3090 at w6: verify 11.8 ms (11.6 ms pure GPU,
98% busy), draft 2.2 ms, heads 0.75 ms, other 0.4 ms of a 15.2 ms step -
the basis for the current kernel-optimization roadmap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(draft): defer LoRA draft variants, keep the DSpark head

Remove the switchable draft LoRA adapter support that was folded in from
#460: the --draft-lora flag, the per-request extra_body.draft_lora
selector, the GGUF LoRA adapter loader/merger, and the DraftLoraSpec
plumbing through the factory and backends. #460 stays open for a future
pass; this PR ships only the DSpark Markov head and the spec-verify
stack.

The draft-variant container in LagunaBackend stays (single "base"
variant) so a future variant mechanism can slot back in without
re-plumbing the loader.

Verified on the 3090: output hashes byte-identical to the previous
commit, 282 tok/s at w6, DSpark head active, --draft-lora now rejected
as an unknown option, ctest 12/12.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: mrciffa <davide@cifarelli.tech>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant