Performance Regression for Kernel Launching 13.0->14.0

### Required prerequisites

- [x] Consult the [security policy](https://github.com/NVIDIA/cuda-quantum/security/policy). If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
- [x] Make sure you've read the [documentation](https://nvidia.github.io/cuda-quantum/latest). Your issue may be addressed there.
- [x] Search the [issue tracker](https://github.com/NVIDIA/cuda-quantum/issues) to verify that this hasn't already been reported. +1 or comment there if it has.
- [x] If possible, make a PR with a failing test to give us a starting point to work on!

### Describe the bug

There seems to be a slowdown in kernel launches between CUDA-Q v13.0 and 14.0(2).   

The code below, produced the following timings.

<img width="644" height="290" alt="Image" src="https://github.com/user-attachments/assets/4515a067-e112-42a7-b074-ef3cdfddb0a7" />



### Steps to reproduce the bug

```
import time
import cudaq
from cudaq import spin

cudaq.set_target("nvidia", option="fp64")

@cudaq.kernel
def trivial(n: int):
    q = cudaq.qvector(n)
    for i in range(n):
        h(q[i])

H = spin.z(0)  # the simplest possible observable

# ---- Warm-up: trigger JIT for both code paths ----
for _ in range(5):
    cudaq.get_state(trivial, 4)
    cudaq.observe(trivial, H, 4).expectation()

N = 1000

# ---- Measure get_state ----
t0 = time.time()
for _ in range(N):
    cudaq.get_state(trivial, 4)
t_state_total = time.time() - t0
t_state_per   = t_state_total / N

# ---- Measure observe ----
t0 = time.time()
for _ in range(N):
    cudaq.observe(trivial, H, 4).expectation()
t_obs_total = time.time() - t0
t_obs_per   = t_obs_total / N

print(f"cudaq:    {cudaq.__version__.split()[2]}")
print(f"target:   nvidia fp64")
print(f"kernel:   trivial 4-qubit Hadamard")
print(f"calls:    {N} per op")
print()
print(f"  get_state  total: {t_state_total:7.2f} s   per call: {t_state_per*1000:7.3f} ms")
print(f"  observe    total: {t_obs_total:7.2f} s   per call: {t_obs_per*1000:7.3f} ms")

```

### Expected behavior

Faster times

### Is this a regression? If it is, put the last known working version (or commit) here.

13.0

### Environment

- **CUDA-Q version**:  13.0 compared to 14.2 (and 14.0)
- **Python version**:  3.11
- **Operating system**:  WSL


### Suggestions

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Regression for Kernel Launching 13.0->14.0 #4507

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Performance Regression for Kernel Launching 13.0->14.0 #4507

Description

Required prerequisites

Describe the bug

Steps to reproduce the bug

Expected behavior

Is this a regression? If it is, put the last known working version (or commit) here.

Environment

Suggestions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions