Quantum memory management within CUDA-Q kernels does not work

### Required prerequisites

- [x] Consult the [security policy](https://github.com/NVIDIA/cuda-quantum/security/policy). If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
- [x] Make sure you've read the [documentation](https://nvidia.github.io/cuda-quantum/latest). Your issue may be addressed there.
- [x] Search the [issue tracker](https://github.com/NVIDIA/cuda-quantum/issues) to verify that this hasn't already been reported. +1 or comment there if it has.
- [x] If possible, make a PR with a failing test to give us a starting point to work on!

### Describe the bug

When **sequentially executing a sub-kernel** that allocates a local cudaq.qvector, the quantum memory management fails to actually free and reuse the allocated qubits after the sub-kernel returns. This causes total qubit usage to scale linearly with the number of calls, eventually causing a crash due to qubit exhaustion/OOM.

Crucially, inspecting the MLIR confirms that the frontend correctly inserts the quake.dealloc instruction. The issue appears to be that the execution runtime or simulator backend is ignoring this instruction or failing to return the qubits to the available pool.

Please note that many quantum algorithms will also require memory management **within a single kernel** (e.g., for allocating ancillary qubits for an mcx gate, and deallocating them afterwards https://www.iccs-meeting.org/archive/iccs2022/papers/133530169.pdf). But, this might rather be a feature request.

### Steps to reproduce the bug

```
import cudaq

@cudaq.kernel
def test_kernel() -> int:
    q = cudaq.qvector(10)
    h(q[0])
    return mz(q[0])

@cudaq.kernel
def main_kernel() -> int:

    a = test_kernel()
    b = test_kernel()
    c = test_kernel()
    return a + b + c

# print(test_kernel)
cudaq.run(main_kernel, shots_count=10)
```

This will cause a crash due to excessive memory usage. (It will work for 1 or 2 calls of ``test_kernel``).

### Expected behavior

The quantum resources go out of scope at the end of each sub-kernel. The program contains a ``quake.dealloc`` statement:

```
    %0 = quake.alloca !quake.veq<10>
    %1 = quake.extract_ref %0[0] : (!quake.veq<10>) -> !quake.ref
    quake.h %1 : (!quake.ref) -> ()
    %2 = quake.extract_ref %0[0] : (!quake.veq<10>) -> !quake.ref
    %measOut = quake.mz %2 : (!quake.ref) -> !quake.measure
    %3 = quake.discriminate %measOut : (!quake.measure) -> i1
    %4 = cc.cast unsigned %3 : (i1) -> i64
    quake.dealloc %0 : !quake.veq<10>
    return %4 : i64
```

Expected behavior would be that the maximum number of qubits does not exceed 10. But, simulation time and memory usage grows drastically with the number of calls of ``test_kernel``inside the ``main_kernel``.

### Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

### Environment

- **CUDA-Q version**: 0.14.0
- **Python version**: 3.11.14
- **Operating system**: macOs 15.7.4


### Suggestions

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantum memory management within CUDA-Q kernels does not work #4407

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Quantum memory management within CUDA-Q kernels does not work #4407

Description

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions