Skip to content

[CUDAQ-790] cc.device_call lowering to realtime dispatch for local simulators#4565

Open
1tnguyen wants to merge 16 commits into
NVIDIA:mainfrom
1tnguyen:tnguyen/device-call-realtime
Open

[CUDAQ-790] cc.device_call lowering to realtime dispatch for local simulators#4565
1tnguyen wants to merge 16 commits into
NVIDIA:mainfrom
1tnguyen:tnguyen/device-call-realtime

Conversation

@1tnguyen
Copy link
Copy Markdown
Collaborator

@1tnguyen 1tnguyen commented May 21, 2026

Support lowering of cc.device_call to realtime-based ring-buffer dispatch.

cc.device_call is lowered to three ABI calls that operate on a transport-neutral frame lease - a leased RX/TX slot pair owned by the runtime for the duration of one RPC:

  __cudaq_device_call_acquire_realtime_frame  // lease an in-flight frame
  __cudaq_device_call_dispatch_realtime_frame // publish + wait
  __cudaq_device_call_safely_release_realtime_frame

The compiler writes function arguments directly into the leased RX payload and reads results directly from the leased TX payload.

  • Scalars are stored at aligned offsets in the request slot

  • std::vector<T> args use a length prefix followed by element bytes packed in place.

  • Output std::vector<T>& arguments (for CUDA-QX interop) are signalled via an attribute on cc.device_call and read back through the same zero-copy response slot.

    The runtime side (runtime/internal/device_call/) wraps cudaq_ringbuffer_t from CUDA-Q Realtime: a singleton driver routes per-device sessions to a DeviceCallChannel, with two built-in shared-memory channels: device_dispatch (persistent GPU dispatch kernel) and host_dispatch (graph-launch with a pinned mailbox).

Notes:

  • This PR handles local shared-memory only; no process separation or RDMA transport in this branch. This is deferred to a follow-up.

  • CUDA-Q builds without realtime by default; the path is opt-in via CUDAQ_REALTIME_DIR at configure time (points to a realtime installation) and

  • CI here is build-only: realtime_integration_ci.yml compiles a realtime-enabled CUDA-Q against the realtime installation.

Tested by:

  • DeviceCallDispatchTester.cu: test runtime ring-buffer / channel behavior
  • End-to-end NVQPP/device_call_realtime_{scalar,array}.cpp tests of shared-memory and host-dispatch channels for different arguments and return type with simulators in the loop.

1tnguyen and others added 2 commits May 21, 2026 03:43
Lower CUDA-Q device_call operations to realtime RPC buffers and dispatch through service-backed shared-memory transports.

Add runtime channels, GPU and host dispatch support, compiler lowering, nvq++ integration, and focused tests for the core realtime device_call path.

Keep helper APIs internal and use generic flat-array arguments. TCP/IP transport support is split to follow-up branch tnguyen/device-call-realtime-tcp.

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

CI Summary (push) — ✅ passed

Run #26483004932 · ✅ 6 · ⏩ 7 · ❌ 0 · ⛔ 0

Top-level jobs (13)
Job Result
binaries ⏩ skipped
build_and_test ✅ success
config_devdeps ✅ success
config_source_build ⏩ skipped
config_wheeldeps ✅ success
devdeps ✅ success
docker_image ⏩ skipped
gen_code_coverage ⏩ skipped
metadata ✅ success
python_metapackages ⏩ skipped
python_wheels ⏩ skipped
source_build ⏩ skipped
wheeldeps ✅ success
⏩ Skipped jobs (7) — intentionally skipped on PR builds; run on merge_group / workflow_dispatch
Job
binaries
config_source_build
docker_image
gen_code_coverage
python_metapackages
python_wheels
source_build
All sub-jobs (42) — every matrix leg, with links
Job Status Link
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python) ✅ success view
Build and test (amd64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, llvm, openmpi) / Dev environment (Python) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Python) ✅ success view
CI Summary ❔ in_progress view
Configure build (devdeps) ✅ success view
Configure build (source_build) ⏩ skipped view
Configure build (wheeldeps) ✅ success view
Create CUDA Quantum installer ⏩ skipped view
Create Docker images ⏩ skipped view
Create Python metapackages ⏩ skipped view
Create Python wheels ⏩ skipped view
Gen code coverage ⏩ skipped view
Load dependencies (amd64, gcc12) / Caching ✅ success view
Load dependencies (amd64, gcc12) / Finalize ✅ success view
Load dependencies (amd64, gcc12) / Metadata ✅ success view
Load dependencies (amd64, llvm) / Caching ✅ success view
Load dependencies (amd64, llvm) / Finalize ✅ success view
Load dependencies (amd64, llvm) / Metadata ✅ success view
Load dependencies (arm64, gcc12) / Caching ✅ success view
Load dependencies (arm64, gcc12) / Finalize ✅ success view
Load dependencies (arm64, gcc12) / Metadata ✅ success view
Load dependencies (arm64, llvm) / Caching ✅ success view
Load dependencies (arm64, llvm) / Finalize ✅ success view
Load dependencies (arm64, llvm) / Metadata ✅ success view
Load source build cache ⏩ skipped view
Load wheel dependencies (amd64, 12.6) / Caching ✅ success view
Load wheel dependencies (amd64, 12.6) / Finalize ✅ success view
Load wheel dependencies (amd64, 12.6) / Metadata ✅ success view
Load wheel dependencies (amd64, 13.0) / Caching ✅ success view
Load wheel dependencies (amd64, 13.0) / Finalize ✅ success view
Load wheel dependencies (amd64, 13.0) / Metadata ✅ success view
Load wheel dependencies (arm64, 12.6) / Caching ✅ success view
Load wheel dependencies (arm64, 12.6) / Finalize ✅ success view
Load wheel dependencies (arm64, 12.6) / Metadata ✅ success view
Load wheel dependencies (arm64, 13.0) / Caching ✅ success view
Load wheel dependencies (arm64, 13.0) / Finalize ✅ success view
Load wheel dependencies (arm64, 13.0) / Metadata ✅ success view
Prepare cache clean-up ❔ in_progress view
Retrieve PR info ✅ success view
✅ Required checks (6/6) — declared in .github/required-checks.yml for push
Required check Status Link
Build and test (amd64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, llvm, openmpi) / Dev environment (Python) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Python) ✅ success view
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python) ✅ success view

Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
@1tnguyen 1tnguyen force-pushed the tnguyen/device-call-realtime branch 3 times, most recently from f80c892 to f2baa1c Compare May 22, 2026 03:33
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
@1tnguyen 1tnguyen force-pushed the tnguyen/device-call-realtime branch from f2baa1c to f193f4c Compare May 22, 2026 04:00
1tnguyen and others added 11 commits May 24, 2026 23:23
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
@1tnguyen 1tnguyen changed the title [WIP] cc.device_call lowering to realtime dispatch for local simulators [CUDAQ-790] cc.device_call lowering to realtime dispatch for local simulators May 26, 2026
@1tnguyen 1tnguyen marked this pull request as ready for review May 26, 2026 18:34
Comment thread realtime/CMakeLists.txt Outdated
Signed-off-by: Thien Nguyen <thiennguyen@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants