[do not merge] add opt-in NUMA binding for ci testing by cuichenx · Pull Request #4630 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-07-02T18:12:28Z

What does this PR do ?

Adds opt-in GPU-local NUMA binding for Kubeflow jobs launched through torchrun.

NCCL 2.30 restores the launcher's original CPU affinity after communicator initialization. Kubeflow training jobs currently launch ranks without an explicit CPU or memory binding, which can expose cross-NUMA host scheduling overhead for TP+EP MoE workloads such as Moonlight.

When NEMO_KUBEFLOW_NUMA_BINDING=1 is set, each torchrun worker now:

maps LOCAL_RANK to its GPU PCI bus ID;
reads the GPU-local NUMA node from sysfs;
logs the rank, PCI bus, and NUMA mapping;
hard-fails if the mapping is missing or invalid; and
execs the unchanged training command through numactl --cpunodebind and --membind.

The implementation wraps the run.Script task before it is passed to the stock NeMo-Run KubeflowExecutor and Torchrun. The default Kubeflow launcher path and exact executor class are unchanged.

Changelog

Add an opt-in rank-local task wrapper for Kubeflow/torchrun workers.
Normalize eight-digit NVIDIA PCI domains for Linux sysfs lookup.
Add focused tests for flag handling and generated worker commands.
Retain the stock NeMo-Run executor class required by its executor mapping.

GitHub Actions CI

Validation performed:

uv run --no-project --with pytest --with 'nemo-run==0.10.0' python -m pytest --confcutdir=tests/unit_tests/scripts/performance tests/unit_tests/scripts/performance/test_executors.py -q — 4 passed.
NeMo-Run 0.10.0 packaging smoke test — role remains stock kubeflow + torchrun, and generated /nemo_run/scripts/moonlight.sh contains the NUMA wrapper.
uv run --no-project --with pre-commit pre-commit run --all-files — passed.

uv sync --group dev cannot complete on the local glibc 2.31 host because pinned nvidia-resiliency-ext==0.6.0 provides glibc 2.39 wheels. The checks above were therefore run in isolated uv environments with the pinned NeMo-Run version.

The first runtime attempt, job 353648344, exposed that subclassing KubeflowExecutor is incompatible with NeMo-Run's exact-class executor mapping. It failed before TrainJob creation with KeyError: <class 'utils.executors._TransformingKubeflowExecutor'>; the implementation was then changed to the stock-executor task wrapper above.

Corrected runtime validation is queued as NeMo-CI job 353773441 using nvcr.io/nvidian/nemo:26.06.01.rc0 on 64 GCP GB200 GPUs. Its owned GPULease is waiting for capacity with a 24-hour acquisition window. No TrainJob or W&B run exists yet, so live NUMA mapping and performance validation remain pending.

Before your PR is "Ready for review"

Pre checks:

Read and followed the contributor guidelines.
Added focused tests.
Complete the queued 64-GPU runtime validation and attach results.
No optional dependency behavior is changed.

Additional Information

This draft targets r0.5.0 because the runtime experiment uses the 26.06.01 release stack and the change is based directly on release commit 78540b21.

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-07-02T18:12:33Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>

perf(kubeflow): add opt-in NUMA binding

93352ac

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx changed the title ~~perf(kubeflow): add opt-in NUMA binding~~ [do not merge] add opt-in NUMA binding for ci testing Jul 2, 2026

cuichenx added the dummy-pr Filed this PR to run tests, not going to merge label Jul 2, 2026

fix(kubeflow): wrap NUMA task without executor subclass

1d8f974

Signed-off-by: Chen Cui <chcui@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[do not merge] add opt-in NUMA binding for ci testing#4630

[do not merge] add opt-in NUMA binding for ci testing#4630
cuichenx wants to merge 2 commits into
r0.5.0from
cuichenx/moonlight-k8s-numa-affinity-r050

cuichenx commented Jul 2, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cuichenx commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cuichenx commented Jul 2, 2026 •

edited

Loading