feat(ipc): optional GPU isolation for target-shard daemon (same-backend splits)#483
Draft
davide221 wants to merge 1 commit into
Draft
feat(ipc): optional GPU isolation for target-shard daemon (same-backend splits)#483davide221 wants to merge 1 commit into
davide221 wants to merge 1 commit into
Conversation
…nd splits)
The target-shard IPC daemon is fork+exec'd and inherits the parent's
environment, with no GPU-visibility pinning. Both the main process and the
daemon therefore enumerate every GPU. For a cross-backend split (CUDA main +
HIP daemon) that is harmless. But for a same-backend split (e.g. HIP main +
HIP daemon across two AMD GPUs), two ROCr runtimes each initializing the
other process's device can hard-fault the host at init.
Add an opt-in isolation path: BackendIpcLaunchConfig gains a child_env list
applied (setenv) in the child before exec; when DFLASH_TARGET_SHARD_ISOLATE_GPUS
is set, TargetShardIpcSession pins the daemon to only its assigned GPUs via
{CUDA,ROCR}_VISIBLE_DEVICES and remaps --target-gpus to the 0-based pinned view.
Default (env unset) is byte-identical to prior behavior: child_env stays empty
and the gpu list is unchanged, so existing CUDA+HIP splits are unaffected.
Validated in concept on a Strix Halo (gfx1151) + external R9700 (gfx1201)
HIP+HIP split, where the same pinning applied via an exec wrapper took the
dual-GPU split from an instant host hard-fault to a stable coherent run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The target-shard IPC daemon is
fork()+execv()'d and inherits the parent's environment with no GPU-visibility pinning (backend_ipc.cppbuilds argv and execs; the daemon uses the raw--target-gpusindices). So the main process and the daemon both enumerate every GPU.This blocks same-backend multi-GPU target-shard splits for any model that uses the IPC path (qwen35 / gemma4 / laguna / deepseek4).
Change
Opt-in, backward-compatible:
BackendIpcLaunchConfiggainschild_env— extra env varssetenv'd in the child before exec (parent env untouched).DFLASH_TARGET_SHARD_ISOLATE_GPUSis set,TargetShardIpcSessionpins the daemon to only its assigned GPUs via{CUDA,ROCR}_VISIBLE_DEVICESand remaps--target-gpusto the 0-based pinned view.Default (env unset) is byte-identical to today —
child_envstays empty, the gpu list is unchanged. So existing CUDA+HIP splits are unaffected; there's no behavior change unless you opt in.Why both
CUDA_VISIBLE_DEVICESandROCR_VISIBLE_DEVICESSet both so the isolation is backend-agnostic: a HIP daemon honors
ROCR_VISIBLE_DEVICES, a CUDA daemon honorsCUDA_VISIBLE_DEVICES, and each ignores the other's var.Validation
Validated in concept on a Strix Halo (gfx1151, iGPU) + external Radeon AI PRO R9700 (gfx1201, dGPU) HIP+HIP split. Applying the exact same pinning via an
execwrapper (ROCR_VISIBLE_DEVICES=<gpu>; exec backend_ipc_daemon) took the dual-GPU DeepSeek V4 Flash split from an instant host hard-fault to a stable, coherent generation across both GPUs (~8.4 tok/s, both GPUs computing). This PR moves that proven workaround into the launch mechanism.Draft — testing checklist before merge
DFLASH_TARGET_SHARD_ISOLATE_GPUS=1).