Skip to content

Quantum memory management within CUDA-Q kernels does not work #4407

@renezander90

Description

@renezander90

Required prerequisites

  • Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
  • Make sure you've read the documentation. Your issue may be addressed there.
  • Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
  • If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

When sequentially executing a sub-kernel that allocates a local cudaq.qvector, the quantum memory management fails to actually free and reuse the allocated qubits after the sub-kernel returns. This causes total qubit usage to scale linearly with the number of calls, eventually causing a crash due to qubit exhaustion/OOM.

Crucially, inspecting the MLIR confirms that the frontend correctly inserts the quake.dealloc instruction. The issue appears to be that the execution runtime or simulator backend is ignoring this instruction or failing to return the qubits to the available pool.

Please note that many quantum algorithms will also require memory management within a single kernel (e.g., for allocating ancillary qubits for an mcx gate, and deallocating them afterwards https://www.iccs-meeting.org/archive/iccs2022/papers/133530169.pdf). But, this might rather be a feature request.

Steps to reproduce the bug

import cudaq

@cudaq.kernel
def test_kernel() -> int:
    q = cudaq.qvector(10)
    h(q[0])
    return mz(q[0])

@cudaq.kernel
def main_kernel() -> int:

    a = test_kernel()
    b = test_kernel()
    c = test_kernel()
    return a + b + c

# print(test_kernel)
cudaq.run(main_kernel, shots_count=10)

This will cause a crash due to excessive memory usage. (It will work for 1 or 2 calls of test_kernel).

Expected behavior

The quantum resources go out of scope at the end of each sub-kernel. The program contains a quake.dealloc statement:

    %0 = quake.alloca !quake.veq<10>
    %1 = quake.extract_ref %0[0] : (!quake.veq<10>) -> !quake.ref
    quake.h %1 : (!quake.ref) -> ()
    %2 = quake.extract_ref %0[0] : (!quake.veq<10>) -> !quake.ref
    %measOut = quake.mz %2 : (!quake.ref) -> !quake.measure
    %3 = quake.discriminate %measOut : (!quake.measure) -> i1
    %4 = cc.cast unsigned %3 : (i1) -> i64
    quake.dealloc %0 : !quake.veq<10>
    return %4 : i64

Expected behavior would be that the maximum number of qubits does not exceed 10. But, simulation time and memory usage grows drastically with the number of calls of test_kernelinside the main_kernel.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

  • CUDA-Q version: 0.14.0
  • Python version: 3.11.14
  • Operating system: macOs 15.7.4

Suggestions

No response

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions