Use case
Phase 2 of a small mmap zero-copy refactor in a downstream toy
project lets the CPU backend keep weight tensors as direct views
into an mmap'd GGUF (ggml_backend_cpu_buffer_from_ptr + tensors
allocated via ggml_backend_tensor_alloc(buf, t, mmap_base + offset)). Wins on Qwen2.5-7B-Q8 (gx10 CPU):
| Path |
Load wall |
Peak RSS |
| Dequant + persistent buffer |
33s |
~30 GB |
| mmap, copy into persistent buffer |
15s |
~18 GB |
| BYO-pointer (CPU, mmap IS the buffer) |
30 ms |
7.4 GB |
On GB10 / DGX Spark / Jetson (unified memory), the same trick
should work for the CUDA backend — host pointers are
device-addressable via UVA after cudaHostRegister(... HostRegisterMapped). But ggml-cuda has no equivalent of
buffer_from_ptr. The closest is
ggml_backend_cuda_register_host_buffer, which only pins host
pages for cudaMemcpy DMA; the device-side buffer is still a separate
allocation.
This issue is a feature request for a CUDA-side
ggml_backend_cuda_buffer_from_ptr.
What the patch does
(Attached as a working downstream patch — see
oripekelman/toy_ruby_neural_network commit
e302d32,
and the vendored ggml commit 5f3bee4 in the same project.)
GGML_BACKEND_API ggml_backend_buffer_t
ggml_backend_cuda_buffer_from_ptr(void *host_ptr, size_t size, int device);
Implementation outline:
cudaHostRegister(host_ptr, size, Portable | Mapped [| ReadOnly])
to register the mmap'd region (fall back to no-ReadOnly for older
runtimes).
cudaHostGetDevicePointer(&dev_ptr, host_ptr, 0) — returns
host_ptr on UVA SKUs.
- Wrap in a backend buffer with a small read-only interface
(set_tensor aborts, get_tensor is a host memcpy, free_buffer
cudaHostUnregisters but doesn't free the underlying memory).
- Backed by a dedicated
buffer_type_t whose get_alloc_size is
NULL (defaults to ggml_nbytes) — no MATRIX_ROW_PADDING
expansion past the tensor end, since the mmap'd file doesn't have
padding after each tensor's bytes.
Smoke test result (GB10)
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124546 MiB):
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
[1] mmap'd qwen25-1.5b-native-q8.gguf: 2326177120 bytes
[2] CUDA backend initialised
[3] cuda_buffer_from_ptr OK; buffer = 0xc38d2441e050
[4] buffer base = 0xecc52a600000 (host = 0xecc52a600000; UVA (host==dev))
[5] buffer size = 2326177120 bytes (file = 2326177120)
[6] first 4 bytes via dev_ptr: 'GGUF' (expect 'GGUF')
[7] buffer freed
[8] OK
Known limitations
-
MATRIX_ROW_PADDING for quantized matmul: ggml-cuda's
quantized matmul kernels read the last row up to a
512-element-aligned boundary. The default
get_alloc_size rounds quantized tensors up. The mmap'd file
doesn't have that padding past each tensor. For tensors whose
ne0 % 512 != 0 (e.g. Qwen2.5-0.5B, d_model=896), the kernel
would read into adjacent tensor bytes (not crash; the masked-out
positions don't affect output, but it's an undocumented invariant
to rely on).
For d_model ∈ {1536, 2048, 3584} this is irrelevant; matmul
reads stay within tensor bounds. Worth either (a) documenting
the requirement, or (b) probing per-tensor and falling back to
the regular allocated buffer when padding would be needed.
-
Discrete GPUs: On a non-UVA card,
cudaHostGetDevicePointer returns a different address than
host_ptr and kernel reads pull bytes over PCIe per access —
functionally correct, performance terrible. The patch makes no
attempt to detect this; could add a cudaDeviceProp:: pageableMemoryAccess probe + warn.
-
ReadOnly flag: cudaHostRegisterReadOnly requires CUDA
runtime ≥ 11.4. The patch falls back gracefully.
Why "from_ptr" and not "register" + alloc-tensor-at-pointer?
Both APIs would work. from_ptr matches the CPU side's existing
shape and feels like the natural counterpart. The alternative — pin
externally, then ggml_backend_tensor_alloc(some_cuda_buffer, tensor, ptr) — would require relaxing the
ggml_backend_cuda_buffer_context::dev_ptr ownership model in the
allocated buffer's destructor. from_ptr keeps the surgery local
to one new buffer type.
Asks
- Sanity check on the approach.
- Permission to upstream the patch (happy to clean up + submit as a
PR with a test).
- Guidance on the MATRIX_ROW_PADDING question — should
buffer_from_ptr probe per-tensor and reject quantized weights
with insufficient row padding, or document the caller's
responsibility?
Acknowledgement
Downstream patch was developed against ggml @ commit 5725fee on
gx10 (GB10 unified memory). Happy to share the smoke test + the
Phase 2 design doc if useful.
Use case
Phase 2 of a small mmap zero-copy refactor in a downstream toy
project lets the CPU backend keep weight tensors as direct views
into an mmap'd GGUF (
ggml_backend_cpu_buffer_from_ptr+ tensorsallocated via
ggml_backend_tensor_alloc(buf, t, mmap_base + offset)). Wins on Qwen2.5-7B-Q8 (gx10 CPU):On GB10 / DGX Spark / Jetson (unified memory), the same trick
should work for the CUDA backend — host pointers are
device-addressable via UVA after
cudaHostRegister(... HostRegisterMapped). But ggml-cuda has no equivalent ofbuffer_from_ptr. The closest isggml_backend_cuda_register_host_buffer, which only pins hostpages for cudaMemcpy DMA; the device-side buffer is still a separate
allocation.
This issue is a feature request for a CUDA-side
ggml_backend_cuda_buffer_from_ptr.What the patch does
(Attached as a working downstream patch — see
oripekelman/toy_ruby_neural_network commit
e302d32,and the vendored ggml commit
5f3bee4in the same project.)Implementation outline:
cudaHostRegister(host_ptr, size, Portable | Mapped [| ReadOnly])to register the mmap'd region (fall back to no-ReadOnly for older
runtimes).
cudaHostGetDevicePointer(&dev_ptr, host_ptr, 0)— returnshost_ptron UVA SKUs.(
set_tensoraborts,get_tensoris a host memcpy,free_buffercudaHostUnregisters but doesn't free the underlying memory).
buffer_type_twhoseget_alloc_sizeisNULL(defaults toggml_nbytes) — noMATRIX_ROW_PADDINGexpansion past the tensor end, since the mmap'd file doesn't have
padding after each tensor's bytes.
Smoke test result (GB10)
Known limitations
MATRIX_ROW_PADDING for quantized matmul: ggml-cuda's
quantized matmul kernels read the last row up to a
512-element-aligned boundary. The default
get_alloc_sizerounds quantized tensors up. The mmap'd filedoesn't have that padding past each tensor. For tensors whose
ne0 % 512 != 0(e.g. Qwen2.5-0.5B, d_model=896), the kernelwould read into adjacent tensor bytes (not crash; the masked-out
positions don't affect output, but it's an undocumented invariant
to rely on).
For d_model ∈ {1536, 2048, 3584} this is irrelevant; matmul
reads stay within tensor bounds. Worth either (a) documenting
the requirement, or (b) probing per-tensor and falling back to
the regular allocated buffer when padding would be needed.
Discrete GPUs: On a non-UVA card,
cudaHostGetDevicePointerreturns a different address thanhost_ptrand kernel reads pull bytes over PCIe per access —functionally correct, performance terrible. The patch makes no
attempt to detect this; could add a
cudaDeviceProp:: pageableMemoryAccessprobe + warn.ReadOnly flag:
cudaHostRegisterReadOnlyrequires CUDAruntime ≥ 11.4. The patch falls back gracefully.
Why "from_ptr" and not "register" + alloc-tensor-at-pointer?
Both APIs would work.
from_ptrmatches the CPU side's existingshape and feels like the natural counterpart. The alternative — pin
externally, then
ggml_backend_tensor_alloc(some_cuda_buffer, tensor, ptr)— would require relaxing theggml_backend_cuda_buffer_context::dev_ptrownership model in theallocated buffer's destructor.
from_ptrkeeps the surgery localto one new buffer type.
Asks
PR with a test).
buffer_from_ptr probe per-tensor and reject quantized weights
with insufficient row padding, or document the caller's
responsibility?
Acknowledgement
Downstream patch was developed against ggml @ commit
5725feeongx10 (GB10 unified memory). Happy to share the smoke test + the
Phase 2 design doc if useful.