Skip to content

Guard nvrx __version__ + degrade async support gracefully for older nvidia-resiliency-ext#5605

Draft
yeyu-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
yeyu-nvidia:yeyu/nvrx-version-guard
Draft

Guard nvrx __version__ + degrade async support gracefully for older nvidia-resiliency-ext#5605
yeyu-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
yeyu-nvidia:yeyu/nvrx-version-guard

Conversation

@yeyu-nvidia

@yeyu-nvidia yeyu-nvidia commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What

has_nvrx_async_support() asserts a minimum nvidia-resiliency-ext version and reads nvrx.__version__ unconditionally. Some environments ship an older nvidia-resiliency-ext that lacks __version__, so importing megatron.core.dist_checkpointing raises AttributeError: module 'nvidia_resiliency_ext' has no attribute '__version__', and the hard assert turns a missing/old nvrx into a crash rather than simply disabling the optional nvrx async-checkpoint path.

Fix

  • nvrx_version = str(getattr(nvrx, "__version__", "0.0.0")) instead of nvrx.__version__.
  • has_nvrx_async_support() returns False when the version is below minimum instead of asserting.

Backward compatible; behavior is unchanged when a recent nvrx is present.

…vidia-resiliency-ext

has_nvrx_async_support() asserted a minimum nvidia-resiliency-ext version and read
nvrx.__version__ unconditionally. Older nvrx builds (e.g. in some NeMo containers)
lack __version__, raising AttributeError at import of dist_checkpointing, and the
hard assert turns a missing/old nvrx into a crash instead of simply disabling the
optional nvrx async-checkpoint path. Use getattr(nvrx, '__version__', '0.0.0') and
return False (async unsupported) instead of asserting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
@yeyu-nvidia yeyu-nvidia requested review from a team as code owners July 1, 2026 19:04
@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft July 1, 2026 19:08
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant