Skip to content

feat(ipc): optional GPU isolation for target-shard daemon (same-backend splits)#483

Draft
davide221 wants to merge 1 commit into
Luce-Org:mainfrom
davide221:feat/ipc-target-shard-gpu-isolation
Draft

feat(ipc): optional GPU isolation for target-shard daemon (same-backend splits)#483
davide221 wants to merge 1 commit into
Luce-Org:mainfrom
davide221:feat/ipc-target-shard-gpu-isolation

Conversation

@davide221

@davide221 davide221 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

The target-shard IPC daemon is fork()+execv()'d and inherits the parent's environment with no GPU-visibility pinning (backend_ipc.cpp builds argv and execs; the daemon uses the raw --target-gpus indices). So the main process and the daemon both enumerate every GPU.

  • Cross-backend split (CUDA main + HIP daemon) — harmless: the two runtimes see disjoint device sets.
  • Same-backend split (e.g. HIP main + HIP daemon across two AMD GPUs) — two ROCr runtimes each initialize the other process's device at startup, which on some ROCm/gfx setups hard-faults the host.

This blocks same-backend multi-GPU target-shard splits for any model that uses the IPC path (qwen35 / gemma4 / laguna / deepseek4).

Change

Opt-in, backward-compatible:

  • BackendIpcLaunchConfig gains child_env — extra env vars setenv'd in the child before exec (parent env untouched).
  • When DFLASH_TARGET_SHARD_ISOLATE_GPUS is set, TargetShardIpcSession pins the daemon to only its assigned GPUs via {CUDA,ROCR}_VISIBLE_DEVICES and remaps --target-gpus to the 0-based pinned view.

Default (env unset) is byte-identical to todaychild_env stays empty, the gpu list is unchanged. So existing CUDA+HIP splits are unaffected; there's no behavior change unless you opt in.

Why both CUDA_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES

Set both so the isolation is backend-agnostic: a HIP daemon honors ROCR_VISIBLE_DEVICES, a CUDA daemon honors CUDA_VISIBLE_DEVICES, and each ignores the other's var.

Validation

Validated in concept on a Strix Halo (gfx1151, iGPU) + external Radeon AI PRO R9700 (gfx1201, dGPU) HIP+HIP split. Applying the exact same pinning via an exec wrapper (ROCR_VISIBLE_DEVICES=<gpu>; exec backend_ipc_daemon) took the dual-GPU DeepSeek V4 Flash split from an instant host hard-fault to a stable, coherent generation across both GPUs (~8.4 tok/s, both GPUs computing). This PR moves that proven workaround into the launch mechanism.

Draft — testing checklist before merge

  • Build (Linux HIP + Linux CUDA).
  • In-code path on the HIP+HIP box (replace the exec wrapper with DFLASH_TARGET_SHARD_ISOLATE_GPUS=1).
  • Regression: a CUDA+HIP split with the env unset (expect no change) and set (expect equivalent behavior + the daemon pinned to its GPU).
  • Multi-GPU remote daemon (>1 remote GPU) index remap.

Review in cubic

…nd splits)

The target-shard IPC daemon is fork+exec'd and inherits the parent's
environment, with no GPU-visibility pinning. Both the main process and the
daemon therefore enumerate every GPU. For a cross-backend split (CUDA main +
HIP daemon) that is harmless. But for a same-backend split (e.g. HIP main +
HIP daemon across two AMD GPUs), two ROCr runtimes each initializing the
other process's device can hard-fault the host at init.

Add an opt-in isolation path: BackendIpcLaunchConfig gains a child_env list
applied (setenv) in the child before exec; when DFLASH_TARGET_SHARD_ISOLATE_GPUS
is set, TargetShardIpcSession pins the daemon to only its assigned GPUs via
{CUDA,ROCR}_VISIBLE_DEVICES and remaps --target-gpus to the 0-based pinned view.

Default (env unset) is byte-identical to prior behavior: child_env stays empty
and the gpu list is unchanged, so existing CUDA+HIP splits are unaffected.

Validated in concept on a Strix Halo (gfx1151) + external R9700 (gfx1201)
HIP+HIP split, where the same pinning applied via an exec wrapper took the
dual-GPU split from an instant host hard-fault to a stable coherent run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant