Add switchable draft LoRA variants for Laguna by davide221 · Pull Request #460 · Luce-Org/lucebox-hub

davide221 · 2026-06-29T00:24:16Z

Summary

add GGUF LoRA adapter loading for DFlash draft tensors and pre-merge adapters into draft weights at load time
add --draft-lora <path|name=path> and request-side extra_body.draft_lora selection
keep base as the unadapted draft and load named LoRA variants as resident switchable variants

Validation

git diff --cached --check before commit
remote CUDA build: cmake --build build-lora-cuda --target dflash_server -j$(nproc)
smoke request on lucebox with personal LoRA selected: adapter loaded, base + personal variants loaded, request selected personal

Experiment note

A small rank-16 personal LoRA trained from session-derived records did not improve holdout speculative metrics versus the normal drafter, so this PR only adds serving support and does not enable any adapter by default.

Remove the switchable draft LoRA adapter support that was folded in from #460: the --draft-lora flag, the per-request extra_body.draft_lora selector, the GGUF LoRA adapter loader/merger, and the DraftLoraSpec plumbing through the factory and backends. #460 stays open for a future pass; this PR ships only the DSpark Markov head and the spec-verify stack. The draft-variant container in LagunaBackend stays (single "base" variant) so a future variant mechanism can slot back in without re-plumbing the loader. Verified on the 3090: output hashes byte-identical to the previous commit, 282 tok/s at w6, DSpark head active, --draft-lora now rejected as an unknown option, ctest 12/12. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…/s (RTX 3090) (#482) * deps: ggml spec-decode verify kernel stack for sm_86 - default the plain-mul_mat MMVQ ncols ceiling to 3: measured crossover on RTX 3090 (Q4_K_M/Q6_K dense GEMVs) - MMVQ wins at ncols<=3, MMQ at 4-8. Laguna w6 chain 199->237 tok/s, Qwen3.6 chain 127->137 tok/s. Env-overridable via LUCE_MMVQ_MAX_NCOLS for other architectures. - extend the MUL_MAT_ID MoE fast path from 8 to 16 tokens (host+device caps, launch_bounds, dispatch); keeps CUDA-graph capture for 9-16 token spec-verify batches. ne2<=8 behavior byte-identical. - LUCE_MMQ_DP_MAX_NE1 tuning env (default off; documented negative result on sm_86). * draft: DSpark Markov/confidence head loading + LoRA draft variants Incorporates #460 and #461 rebased onto main: converter embeds the optional dflash.dspark.* tensors, the draft loader binds them (old GGUFs unaffected), and the Laguna greedy-chain decode applies the low-rank previous-token Markov correction. LoRA loader ported to the GgufMmap API introduced on main. * laguna: fused DSpark Markov head for chain drafting and DDTree candidates One graph on the draft stream (base logits via a single lm_head matmul + unrolled Markov argmax->get_rows chain, async readback, one sync) replaces the per-candidate graph rebuild and host logits round-trip. The same builder serves DDTree candidate generation (markov-corrected top-K, same output contract as project_hidden_to_topk). Kill switches: DFLASH_LAGUNA_FUSED_DSPARK=0, DFLASH_LAGUNA_DSPARK_TREE=0. Measured (laguna-xs2 Q4_K_M + v24 drafter, RTX 3090, HumanEval): chain accept +2.2pts at w3; DDTree commit 3.83->4.43 at budget 12; verify-width sweep moves the chain optimum to w6. * laguna: persistent verify step-graph Within a (width, mask-window) key the verify graph is structurally identical every step - kv position enters only through input data - so cache the built graph and skip the per-step host rebuild + allocator pass. Outputs are byte-identical (same graph, same data); DFLASH_LAGUNA_PERSIST_VERIFY=0 restores the rebuild-every-step path. * scripts: DFlash draft requantize tool (q8_0 / q4-mix) q4-mix quantizes the drafter backbone to q4_0 and keeps the dflash.* heads at q8_0 (the Markov/projection bias precision is what near-tie corrections depend on). Measured acceptance-neutral: f16 236.6 -> q8_0 241.3 -> q4-mix 249.6 tok/s (Laguna-XS.2 + v24 drafter, RTX 3090, HumanEval, verify width 6). * fix: address cubic review findings - draft LoRA loader: overflow-safe gguf_tensor_in_file() bounds check with the gguf_bounds_error() diagnostic; validate tensor type and byte size before resizing the host buffer so corrupt metadata cannot force a large allocation - laguna verify: key the persistent step-graph slot on backend/weights/cache identity, and restrict reuse to the kv_idx input-data path (NO_KVPAD / PAD_CPY fallbacks bake kv_start into the graph, so they rebuild per step) - server: reject --draft-lora for non-laguna targets and for layer-split placement instead of silently ignoring it - laguna backend: drop the dead active_draft_lora_ member - quantize_dflash_draft.py: np.asarray() the metadata array parts before .item() (robust across gguf-py part representations; output verified byte-identical to the shipped q4-mix draft) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(laguna): keep qwen KV-cache env shorthands from overriding laguna KV types Since #455, resolve_laguna_kv_types() routed laguna's KV dtype through resolve_kv_types() whenever any DFLASH27B_KV_* env var was present. dflash_server auto-sets DFLASH27B_KV_TQ3=1 for max_ctx > 6144 (a VRAM heuristic for the qwen 27B family), so every laguna server run at the default max_ctx=8192 silently got a TQ3_0 ternary KV cache instead of Q8_0 and produced degenerate output on every decode path (accept 16.7%, avg_commit 1.00). Laguna now honors only the explicit per-axis DFLASH27B_KV_K / _KV_V overrides (--cache-type-k/v), logs when they take effect, and ignores the legacy F16/Q4/TQ3 shorthands. Verified on RTX 3090 (lucebox): default-env laguna-xs2 + v24 q4mix w6 chain goes from deterministic garbage to coherent 272 tok/s at 71.1% accept / avg_commit 4.27; output hash identical with DFLASH_LAGUNA_PERSIST_VERIFY=0; explicit q4_0 override honored and logged; ctest 12/12. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(laguna): validate the (K, V) pair after explicit KV-type overrides Extract the supported-pair check from resolve_kv_types() into dflash::validate_kv_pair_or_abort() and call it from the laguna per-axis override path, so an unsupported combination (e.g. DFLASH27B_KV_K=tq3_0 DFLASH27B_KV_V=q4_1) aborts at startup with the supported-pairs listing instead of failing later in a CUDA kernel. Verified on lucebox 3090: invalid pair aborts with the listing; valid q4_0/q4_0 override still honored, logged, and coherent; default q8_0 path untouched; ctest 12/12. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf(laguna): pinned async input staging + per-step phase profiler Stage the per-step graph inputs (embed, positions, kv_idx, masks, feat rows) in a slot-owned pinned host buffer and upload them with ggml_backend_tensor_set_async on the backend stream. A plain tensor_set from pageable memory costs a staged DMA plus a stream synchronize per call (~6 pairs per decode/verify step); the single sync inside ggml_backend_graph_compute() now covers ordering. Output hashes are byte-identical; worth ~1-2 tok/s at w6 and grows with context length (the mask upload scales with kv_pad). Add DFLASH_LAGUNA_STEP_PROF=1: per-step wall breakdown of the spec chain (draft forward / DSpark heads / verify / other), printed with the spec summary. Measured on RTX 3090 at w6: verify 11.8 ms (11.6 ms pure GPU, 98% busy), draft 2.2 ms, heads 0.75 ms, other 0.4 ms of a 15.2 ms step - the basis for the current kernel-optimization roadmap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(draft): defer LoRA draft variants, keep the DSpark head Remove the switchable draft LoRA adapter support that was folded in from #460: the --draft-lora flag, the per-request extra_body.draft_lora selector, the GGUF LoRA adapter loader/merger, and the DraftLoraSpec plumbing through the factory and backends. #460 stays open for a future pass; this PR ships only the DSpark Markov head and the spec-verify stack. The draft-variant container in LagunaBackend stays (single "base" variant) so a future variant mechanism can slot back in without re-plumbing the loader. Verified on the 3090: output hashes byte-identical to the previous commit, 282 tok/s at w6, DSpark head active, --draft-lora now rejected as an unknown option, ctest 12/12. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: mrciffa <davide@cifarelli.tech> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Add draft LoRA variants for Laguna

d2e687b

This was referenced Jun 29, 2026

Add DSpark Markov draft-head support for Laguna #461

Closed

perf(laguna): DSpark Markov head + spec-verify stack - 206 to 249 tok/s (RTX 3090) #482

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add switchable draft LoRA variants for Laguna#460

Add switchable draft LoRA variants for Laguna#460
davide221 wants to merge 1 commit into
mainfrom
draft-lora-variants

davide221 commented Jun 29, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

davide221 commented Jun 29, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Experiment note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davide221 commented Jun 29, 2026 •

edited by cubic-dev-ai Bot

Loading