feat(ipc): optional GPU isolation for target-shard daemon (same-backend splits) by davide221 · Pull Request #483 · Luce-Org/lucebox-hub

davide221 · 2026-07-03T01:27:18Z

Summary

The target-shard IPC daemon is fork()+execv()'d and inherits the parent's environment with no GPU-visibility pinning (backend_ipc.cpp builds argv and execs; the daemon uses the raw --target-gpus indices). So the main process and the daemon both enumerate every GPU.

Cross-backend split (CUDA main + HIP daemon) — harmless: the two runtimes see disjoint device sets.
Same-backend split (e.g. HIP main + HIP daemon across two AMD GPUs) — two ROCr runtimes each initialize the other process's device at startup, which on some ROCm/gfx setups hard-faults the host.

This blocks same-backend multi-GPU target-shard splits for any model that uses the IPC path (qwen35 / gemma4 / laguna / deepseek4).

Change

Opt-in, backward-compatible:

BackendIpcLaunchConfig gains child_env — extra env vars setenv'd in the child before exec (parent env untouched).
When DFLASH_TARGET_SHARD_ISOLATE_GPUS is set, TargetShardIpcSession pins the daemon to only its assigned GPUs via {CUDA,ROCR}_VISIBLE_DEVICES and remaps --target-gpus to the 0-based pinned view.

Default (env unset) is byte-identical to today — child_env stays empty, the gpu list is unchanged. So existing CUDA+HIP splits are unaffected; there's no behavior change unless you opt in.

Why both `CUDA_VISIBLE_DEVICES` and `ROCR_VISIBLE_DEVICES`

Set both so the isolation is backend-agnostic: a HIP daemon honors ROCR_VISIBLE_DEVICES, a CUDA daemon honors CUDA_VISIBLE_DEVICES, and each ignores the other's var.

Validation

Validated in concept on a Strix Halo (gfx1151, iGPU) + external Radeon AI PRO R9700 (gfx1201, dGPU) HIP+HIP split. Applying the exact same pinning via an exec wrapper (ROCR_VISIBLE_DEVICES=<gpu>; exec backend_ipc_daemon) took the dual-GPU DeepSeek V4 Flash split from an instant host hard-fault to a stable, coherent generation across both GPUs (~8.4 tok/s, both GPUs computing). This PR moves that proven workaround into the launch mechanism.

Draft — testing checklist before merge

Build (Linux HIP + Linux CUDA).
In-code path on the HIP+HIP box (replace the exec wrapper with DFLASH_TARGET_SHARD_ISOLATE_GPUS=1).
Regression: a CUDA+HIP split with the env unset (expect no change) and set (expect equivalent behavior + the daemon pinned to its GPU).
Multi-GPU remote daemon (>1 remote GPU) index remap.

…nd splits) The target-shard IPC daemon is fork+exec'd and inherits the parent's environment, with no GPU-visibility pinning. Both the main process and the daemon therefore enumerate every GPU. For a cross-backend split (CUDA main + HIP daemon) that is harmless. But for a same-backend split (e.g. HIP main + HIP daemon across two AMD GPUs), two ROCr runtimes each initializing the other process's device can hard-fault the host at init. Add an opt-in isolation path: BackendIpcLaunchConfig gains a child_env list applied (setenv) in the child before exec; when DFLASH_TARGET_SHARD_ISOLATE_GPUS is set, TargetShardIpcSession pins the daemon to only its assigned GPUs via {CUDA,ROCR}_VISIBLE_DEVICES and remaps --target-gpus to the 0-based pinned view. Default (env unset) is byte-identical to prior behavior: child_env stays empty and the gpu list is unchanged, so existing CUDA+HIP splits are unaffected. Validated in concept on a Strix Halo (gfx1151) + external R9700 (gfx1201) HIP+HIP split, where the same pinning applied via an exec wrapper took the dual-GPU split from an instant host hard-fault to a stable coherent run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ipc): optional GPU isolation for target-shard daemon (same-backend splits)#483

feat(ipc): optional GPU isolation for target-shard daemon (same-backend splits)#483
davide221 wants to merge 1 commit into
Luce-Org:mainfrom
davide221:feat/ipc-target-shard-gpu-isolation

davide221 commented Jul 3, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

davide221 commented Jul 3, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Why both CUDA_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES

Validation

Draft — testing checklist before merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davide221 commented Jul 3, 2026 •

edited by cubic-dev-ai Bot

Loading

Why both `CUDA_VISIBLE_DEVICES` and `ROCR_VISIBLE_DEVICES`