Required prerequisites
Describe the bug
When sequentially executing a sub-kernel that allocates a local cudaq.qvector, the quantum memory management fails to actually free and reuse the allocated qubits after the sub-kernel returns. This causes total qubit usage to scale linearly with the number of calls, eventually causing a crash due to qubit exhaustion/OOM.
Crucially, inspecting the MLIR confirms that the frontend correctly inserts the quake.dealloc instruction. The issue appears to be that the execution runtime or simulator backend is ignoring this instruction or failing to return the qubits to the available pool.
Please note that many quantum algorithms will also require memory management within a single kernel (e.g., for allocating ancillary qubits for an mcx gate, and deallocating them afterwards https://www.iccs-meeting.org/archive/iccs2022/papers/133530169.pdf). But, this might rather be a feature request.
Steps to reproduce the bug
import cudaq
@cudaq.kernel
def test_kernel() -> int:
q = cudaq.qvector(10)
h(q[0])
return mz(q[0])
@cudaq.kernel
def main_kernel() -> int:
a = test_kernel()
b = test_kernel()
c = test_kernel()
return a + b + c
# print(test_kernel)
cudaq.run(main_kernel, shots_count=10)
This will cause a crash due to excessive memory usage. (It will work for 1 or 2 calls of test_kernel).
Expected behavior
The quantum resources go out of scope at the end of each sub-kernel. The program contains a quake.dealloc statement:
%0 = quake.alloca !quake.veq<10>
%1 = quake.extract_ref %0[0] : (!quake.veq<10>) -> !quake.ref
quake.h %1 : (!quake.ref) -> ()
%2 = quake.extract_ref %0[0] : (!quake.veq<10>) -> !quake.ref
%measOut = quake.mz %2 : (!quake.ref) -> !quake.measure
%3 = quake.discriminate %measOut : (!quake.measure) -> i1
%4 = cc.cast unsigned %3 : (i1) -> i64
quake.dealloc %0 : !quake.veq<10>
return %4 : i64
Expected behavior would be that the maximum number of qubits does not exceed 10. But, simulation time and memory usage grows drastically with the number of calls of test_kernelinside the main_kernel.
Is this a regression? If it is, put the last known working version (or commit) here.
Not a regression
Environment
- CUDA-Q version: 0.14.0
- Python version: 3.11.14
- Operating system: macOs 15.7.4
Suggestions
No response
Required prerequisites
Describe the bug
When sequentially executing a sub-kernel that allocates a local cudaq.qvector, the quantum memory management fails to actually free and reuse the allocated qubits after the sub-kernel returns. This causes total qubit usage to scale linearly with the number of calls, eventually causing a crash due to qubit exhaustion/OOM.
Crucially, inspecting the MLIR confirms that the frontend correctly inserts the quake.dealloc instruction. The issue appears to be that the execution runtime or simulator backend is ignoring this instruction or failing to return the qubits to the available pool.
Please note that many quantum algorithms will also require memory management within a single kernel (e.g., for allocating ancillary qubits for an mcx gate, and deallocating them afterwards https://www.iccs-meeting.org/archive/iccs2022/papers/133530169.pdf). But, this might rather be a feature request.
Steps to reproduce the bug
This will cause a crash due to excessive memory usage. (It will work for 1 or 2 calls of
test_kernel).Expected behavior
The quantum resources go out of scope at the end of each sub-kernel. The program contains a
quake.deallocstatement:Expected behavior would be that the maximum number of qubits does not exceed 10. But, simulation time and memory usage grows drastically with the number of calls of
test_kernelinside themain_kernel.Is this a regression? If it is, put the last known working version (or commit) here.
Not a regression
Environment
Suggestions
No response