Skip to content

[do not merge] add opt-in NUMA binding for ci testing#4630

Draft
cuichenx wants to merge 2 commits into
r0.5.0from
cuichenx/moonlight-k8s-numa-affinity-r050
Draft

[do not merge] add opt-in NUMA binding for ci testing#4630
cuichenx wants to merge 2 commits into
r0.5.0from
cuichenx/moonlight-k8s-numa-affinity-r050

Conversation

@cuichenx

@cuichenx cuichenx commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Adds opt-in GPU-local NUMA binding for Kubeflow jobs launched through torchrun.

NCCL 2.30 restores the launcher's original CPU affinity after communicator initialization. Kubeflow training jobs currently launch ranks without an explicit CPU or memory binding, which can expose cross-NUMA host scheduling overhead for TP+EP MoE workloads such as Moonlight.

When NEMO_KUBEFLOW_NUMA_BINDING=1 is set, each torchrun worker now:

  1. maps LOCAL_RANK to its GPU PCI bus ID;
  2. reads the GPU-local NUMA node from sysfs;
  3. logs the rank, PCI bus, and NUMA mapping;
  4. hard-fails if the mapping is missing or invalid; and
  5. execs the unchanged training command through numactl --cpunodebind and --membind.

The implementation wraps the run.Script task before it is passed to the stock NeMo-Run KubeflowExecutor and Torchrun. The default Kubeflow launcher path and exact executor class are unchanged.

Changelog

  • Add an opt-in rank-local task wrapper for Kubeflow/torchrun workers.
  • Normalize eight-digit NVIDIA PCI domains for Linux sysfs lookup.
  • Add focused tests for flag handling and generated worker commands.
  • Retain the stock NeMo-Run executor class required by its executor mapping.

GitHub Actions CI

Validation performed:

  • uv run --no-project --with pytest --with 'nemo-run==0.10.0' python -m pytest --confcutdir=tests/unit_tests/scripts/performance tests/unit_tests/scripts/performance/test_executors.py -q — 4 passed.
  • NeMo-Run 0.10.0 packaging smoke test — role remains stock kubeflow + torchrun, and generated /nemo_run/scripts/moonlight.sh contains the NUMA wrapper.
  • uv run --no-project --with pre-commit pre-commit run --all-files — passed.

uv sync --group dev cannot complete on the local glibc 2.31 host because pinned nvidia-resiliency-ext==0.6.0 provides glibc 2.39 wheels. The checks above were therefore run in isolated uv environments with the pinned NeMo-Run version.

The first runtime attempt, job 353648344, exposed that subclassing KubeflowExecutor is incompatible with NeMo-Run's exact-class executor mapping. It failed before TrainJob creation with KeyError: <class 'utils.executors._TransformingKubeflowExecutor'>; the implementation was then changed to the stock-executor task wrapper above.

Corrected runtime validation is queued as NeMo-CI job 353773441 using nvcr.io/nvidian/nemo:26.06.01.rc0 on 64 GCP GB200 GPUs. Its owned GPULease is waiting for capacity with a 24-hour acquisition window. No TrainJob or W&B run exists yet, so live NUMA mapping and performance validation remain pending.

Before your PR is "Ready for review"

Pre checks:

  • Read and followed the contributor guidelines.
  • Added focused tests.
  • Complete the queued 64-GPU runtime validation and attach results.
  • No optional dependency behavior is changed.

Additional Information

This draft targets r0.5.0 because the runtime experiment uses the 26.06.01 release stack and the change is based directly on release commit 78540b21.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cuichenx cuichenx changed the title perf(kubeflow): add opt-in NUMA binding [do not merge] add opt-in NUMA binding for ci testing Jul 2, 2026
@cuichenx cuichenx added the dummy-pr Filed this PR to run tests, not going to merge label Jul 2, 2026
Signed-off-by: Chen Cui <chcui@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dummy-pr Filed this PR to run tests, not going to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant