Add switchable draft LoRA variants for Laguna#460
Draft
davide221 wants to merge 1 commit into
Draft
Conversation
This was referenced Jun 29, 2026
davide221
added a commit
that referenced
this pull request
Jul 3, 2026
Remove the switchable draft LoRA adapter support that was folded in from #460: the --draft-lora flag, the per-request extra_body.draft_lora selector, the GGUF LoRA adapter loader/merger, and the DraftLoraSpec plumbing through the factory and backends. #460 stays open for a future pass; this PR ships only the DSpark Markov head and the spec-verify stack. The draft-variant container in LagunaBackend stays (single "base" variant) so a future variant mechanism can slot back in without re-plumbing the loader. Verified on the 3090: output hashes byte-identical to the previous commit, 282 tok/s at w6, DSpark head active, --draft-lora now rejected as an unknown option, ctest 12/12. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
davide221
added a commit
that referenced
this pull request
Jul 3, 2026
…/s (RTX 3090) (#482) * deps: ggml spec-decode verify kernel stack for sm_86 - default the plain-mul_mat MMVQ ncols ceiling to 3: measured crossover on RTX 3090 (Q4_K_M/Q6_K dense GEMVs) - MMVQ wins at ncols<=3, MMQ at 4-8. Laguna w6 chain 199->237 tok/s, Qwen3.6 chain 127->137 tok/s. Env-overridable via LUCE_MMVQ_MAX_NCOLS for other architectures. - extend the MUL_MAT_ID MoE fast path from 8 to 16 tokens (host+device caps, launch_bounds, dispatch); keeps CUDA-graph capture for 9-16 token spec-verify batches. ne2<=8 behavior byte-identical. - LUCE_MMQ_DP_MAX_NE1 tuning env (default off; documented negative result on sm_86). * draft: DSpark Markov/confidence head loading + LoRA draft variants Incorporates #460 and #461 rebased onto main: converter embeds the optional dflash.dspark.* tensors, the draft loader binds them (old GGUFs unaffected), and the Laguna greedy-chain decode applies the low-rank previous-token Markov correction. LoRA loader ported to the GgufMmap API introduced on main. * laguna: fused DSpark Markov head for chain drafting and DDTree candidates One graph on the draft stream (base logits via a single lm_head matmul + unrolled Markov argmax->get_rows chain, async readback, one sync) replaces the per-candidate graph rebuild and host logits round-trip. The same builder serves DDTree candidate generation (markov-corrected top-K, same output contract as project_hidden_to_topk). Kill switches: DFLASH_LAGUNA_FUSED_DSPARK=0, DFLASH_LAGUNA_DSPARK_TREE=0. Measured (laguna-xs2 Q4_K_M + v24 drafter, RTX 3090, HumanEval): chain accept +2.2pts at w3; DDTree commit 3.83->4.43 at budget 12; verify-width sweep moves the chain optimum to w6. * laguna: persistent verify step-graph Within a (width, mask-window) key the verify graph is structurally identical every step - kv position enters only through input data - so cache the built graph and skip the per-step host rebuild + allocator pass. Outputs are byte-identical (same graph, same data); DFLASH_LAGUNA_PERSIST_VERIFY=0 restores the rebuild-every-step path. * scripts: DFlash draft requantize tool (q8_0 / q4-mix) q4-mix quantizes the drafter backbone to q4_0 and keeps the dflash.* heads at q8_0 (the Markov/projection bias precision is what near-tie corrections depend on). Measured acceptance-neutral: f16 236.6 -> q8_0 241.3 -> q4-mix 249.6 tok/s (Laguna-XS.2 + v24 drafter, RTX 3090, HumanEval, verify width 6). * fix: address cubic review findings - draft LoRA loader: overflow-safe gguf_tensor_in_file() bounds check with the gguf_bounds_error() diagnostic; validate tensor type and byte size before resizing the host buffer so corrupt metadata cannot force a large allocation - laguna verify: key the persistent step-graph slot on backend/weights/cache identity, and restrict reuse to the kv_idx input-data path (NO_KVPAD / PAD_CPY fallbacks bake kv_start into the graph, so they rebuild per step) - server: reject --draft-lora for non-laguna targets and for layer-split placement instead of silently ignoring it - laguna backend: drop the dead active_draft_lora_ member - quantize_dflash_draft.py: np.asarray() the metadata array parts before .item() (robust across gguf-py part representations; output verified byte-identical to the shipped q4-mix draft) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(laguna): keep qwen KV-cache env shorthands from overriding laguna KV types Since #455, resolve_laguna_kv_types() routed laguna's KV dtype through resolve_kv_types() whenever any DFLASH27B_KV_* env var was present. dflash_server auto-sets DFLASH27B_KV_TQ3=1 for max_ctx > 6144 (a VRAM heuristic for the qwen 27B family), so every laguna server run at the default max_ctx=8192 silently got a TQ3_0 ternary KV cache instead of Q8_0 and produced degenerate output on every decode path (accept 16.7%, avg_commit 1.00). Laguna now honors only the explicit per-axis DFLASH27B_KV_K / _KV_V overrides (--cache-type-k/v), logs when they take effect, and ignores the legacy F16/Q4/TQ3 shorthands. Verified on RTX 3090 (lucebox): default-env laguna-xs2 + v24 q4mix w6 chain goes from deterministic garbage to coherent 272 tok/s at 71.1% accept / avg_commit 4.27; output hash identical with DFLASH_LAGUNA_PERSIST_VERIFY=0; explicit q4_0 override honored and logged; ctest 12/12. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(laguna): validate the (K, V) pair after explicit KV-type overrides Extract the supported-pair check from resolve_kv_types() into dflash::validate_kv_pair_or_abort() and call it from the laguna per-axis override path, so an unsupported combination (e.g. DFLASH27B_KV_K=tq3_0 DFLASH27B_KV_V=q4_1) aborts at startup with the supported-pairs listing instead of failing later in a CUDA kernel. Verified on lucebox 3090: invalid pair aborts with the listing; valid q4_0/q4_0 override still honored, logged, and coherent; default q8_0 path untouched; ctest 12/12. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf(laguna): pinned async input staging + per-step phase profiler Stage the per-step graph inputs (embed, positions, kv_idx, masks, feat rows) in a slot-owned pinned host buffer and upload them with ggml_backend_tensor_set_async on the backend stream. A plain tensor_set from pageable memory costs a staged DMA plus a stream synchronize per call (~6 pairs per decode/verify step); the single sync inside ggml_backend_graph_compute() now covers ordering. Output hashes are byte-identical; worth ~1-2 tok/s at w6 and grows with context length (the mask upload scales with kv_pad). Add DFLASH_LAGUNA_STEP_PROF=1: per-step wall breakdown of the spec chain (draft forward / DSpark heads / verify / other), printed with the spec summary. Measured on RTX 3090 at w6: verify 11.8 ms (11.6 ms pure GPU, 98% busy), draft 2.2 ms, heads 0.75 ms, other 0.4 ms of a 15.2 ms step - the basis for the current kernel-optimization roadmap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(draft): defer LoRA draft variants, keep the DSpark head Remove the switchable draft LoRA adapter support that was folded in from #460: the --draft-lora flag, the per-request extra_body.draft_lora selector, the GGUF LoRA adapter loader/merger, and the DraftLoraSpec plumbing through the factory and backends. #460 stays open for a future pass; this PR ships only the DSpark Markov head and the spec-verify stack. The draft-variant container in LagunaBackend stays (single "base" variant) so a future variant mechanism can slot back in without re-plumbing the loader. Verified on the 3090: output hashes byte-identical to the previous commit, 282 tok/s at w6, DSpark head active, --draft-lora now rejected as an unknown option, ctest 12/12. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: mrciffa <davide@cifarelli.tech> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--draft-lora <path|name=path>and request-sideextra_body.draft_loraselectionbaseas the unadapted draft and load named LoRA variants as resident switchable variantsValidation
git diff --cached --checkbefore commitcmake --build build-lora-cuda --target dflash_server -j$(nproc)personalLoRA selected: adapter loaded, base + personal variants loaded, request selected personalExperiment note
A small rank-16 personal LoRA trained from session-derived records did not improve holdout speculative metrics versus the normal drafter, so this PR only adds serving support and does not enable any adapter by default.