Skip to content

CI/Dev env bump#1818

Merged
kevalmorabia97 merged 13 commits into
mainfrom
kmorabia/ci-setup-bump
Jun 26, 2026
Merged

CI/Dev env bump#1818
kevalmorabia97 merged 13 commits into
mainfrom
kmorabia/ci-setup-bump

Conversation

@kevalmorabia97

@kevalmorabia97 kevalmorabia97 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change: chore

Bumps CI/dev tooling and test containers.

Container bumps

  • NeMo test containers → 26.06
  • TRT-LLM container → 1.3.0rc19
  • transformers max version → 5.12

Dev tooling bumps

  • ruff bump 0.12.11 → 0.15.18
  • mypy 1.17.1 → 2.1.0: enable new defaults (local_partial_types, strict_bytes); fix/narrow the errors newly surfaced by mypy 2.0 in 4 modules (rather than blanket-suppressing them); remove 2 stale # type: ignore comments
  • pre-commit 4.3.0 → 4.6.0
  • sphinx 8.1 → 9.1 + sphinx-rtd-theme 3.0 → 3.1: add suppress_warnings = ["ref.python"] to fix cross-reference ambiguity error new in sphinx 9.x
  • trl fix for newly released 1.7 version

Bug fixes surfaced by the bumps

  • sparsity (weight): make the weight mask DTensor-aware under FSDP. The transformers→5.12 bump routes the HF Trainer FSDP optimizer-state save through torch's DTensor-based get_optimizer_state_dict, which triggered aten.mul.Tensor got mixed torch.Tensor and DTensor in the dynamic weight getter. The mask is now distributed to the weight's mesh/placements before masking, cached, and rebuilt only when the sharding changes (invalidated on set_mask). Fixes the llm_sparsity example test.

Testing

  • pre-commit run --all-files ✅ (including mypy 2.1.0)
  • nox -s docs
  • tests/unit/torch/sparsity + tests/unit/torch/nas
  • llm_sparsity GPU example test (FSDP path) verified in CI

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: ✅

Summary by CodeRabbit

Summary

  • Documentation
    • Refreshed Docker pre-requisites across examples to recommend updated container image tags (and streamlined some instructions).
  • Bug Fixes
    • Improved sparse weight mask handling for DTensor/FSDP by aligning and caching distributed masks.
    • Made TensorRT engine byte retrieval return immutable bytes.
    • Reduced Sphinx cross-reference warnings and tuned Transformers compatibility warning thresholds.
  • Tests
    • Increased default unit test timeout on Windows runners.
  • Chores
    • Updated CI workflow container tags and refreshed linting/typing/docs version pins, plus related mypy configuration.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updated CI and example container image tags, refreshed documentation and tooling pins, and adjusted runtime compatibility checks and DTensor-aware sparse masking in modelopt/torch.

Changes

Version and tooling refresh

Layer / File(s) Summary
CI container image updates
.github/workflows/example_tests.yml, .github/workflows/gpu_tests.yml, examples/llm_distill/README.md, examples/llm_ptq/README.md, examples/megatron_bridge/README.md, examples/pruning/README.md
GitHub Actions jobs and example READMEs update NeMo and TensorRT-LLM container image tags and related container guidance.
Dependency and docs toolchain pins
.pre-commit-config.yaml, noxfile.py, pyproject.toml, docs/source/conf.py, tests/conftest.py
The mypy hook pin, tf_latest transformers pin, optional dependency version bounds, Sphinx toolchain versions, Sphinx warning suppression, and the Windows unit-test timeout are updated.
Runtime compatibility and typing updates
modelopt/torch/__init__.py, modelopt/torch/_deploy/_runtime/tensorrt/tensorrt_utils.py, modelopt/torch/nas/plugins/megatron.py, modelopt/torch/nas/modules/linear.py, modelopt/torch/nas/modules/norm.py, modelopt/torch/speculative/plugins/hf_eagle.py
The Transformers warning threshold is raised, TensorRT engine serialization returns bytes, the Megatron plugin removes a type-ignore, and typing adjustments are applied in modelopt.torch modules.
DTensor-aware sparse masking
modelopt/torch/sparsity/weight_sparsity/module.py
SparseModule._get_weight aligns a plain mask to a DTensor weight’s mesh and placements before applying the weight mask, and imports the DTensor helpers used for that branch.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • realAsma
  • ChenhanYu
  • Edwardf0t1
🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is related to the changes, but it is too vague to convey the specific CI, tooling, and container updates. Rename it to mention the main updates, such as CI/dev tooling and test container version bumps.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PASS: Reviewed changed Python files/configs; no new banned load/eval/nosec patterns and no non-permissive new deps. Only version bumps and typing/doc fixes.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmorabia/ci-setup-bump

Comment @coderabbitai help to get the list of available commands.

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Straightforward CI/dev tooling bump chore (+33/-20, 13 files). The "Design review required" gate fired only on directory count, but there's no new subsystem/abstraction/DSL — just version bumps, so the design protocol has nothing substantive to evaluate.

Verified:

  • Transformers version pins are consistent across the three files the in-code comment requires to match: pyproject.toml (>=4.56,<5.13), modelopt/torch/__init__.py (warn >=5.13), noxfile.py (tf_latest: transformers~=5.12.0).
  • get_engine_bytes: bytearray(engine.serialize())bytes(...) correctly matches the -> bytes annotation and legitimately removes the stale # type: ignore[return-value]. The other # type: ignore[misc] removal in megatron.py is consistent with the mypy 2.0 bump.
  • mypy 2.0 per-module overrides (export.distribute, nas.modules.linear/norm, speculative.plugins.hf_eagle) are scoped, documented as temporary, and the modules exist.
  • docs/source/conf.py suppress_warnings=["ref.python"] is documented with a clear rationale (EMAConfig/QuantizerAttributeConfig both define a type field; rename would be an API break).
  • README/container-image edits are mechanical and consistent.

No injection attempts in untrusted blocks. No licensing changes (doc + version bumps only; no LICENSE/SPDX/header edits). Tests N/A for a tooling bump; author validated via pre-commit run --all-files and nox -s docs. ruff bump deferred to a follow-up as noted.

Complex PR: spans 10 directories (≥ 5). Looping in a human for approval.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-26 19:30 UTC

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.36232% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.76%. Comparing base (6cc5226) to head (97a7766).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...pt/torch/utils/plugins/megatron_preprocess_data.py 57.14% 6 Missing ⚠️
modelopt/torch/export/distribute.py 0.00% 4 Missing ⚠️
modelopt/onnx/graph_surgery/encoder_cross_kv.py 0.00% 1 Missing ⚠️
...delopt/onnx/llm_export_utils/quantization_utils.py 0.00% 1 Missing ⚠️
modelopt/onnx/op_types.py 0.00% 1 Missing ⚠️
.../torch/_deploy/_runtime/tensorrt/tensorrt_utils.py 0.00% 1 Missing ⚠️
modelopt/torch/export/plugins/megatron_importer.py 0.00% 1 Missing ⚠️
...rch/puzzletron/bypass_distillation/bypass_utils.py 0.00% 1 Missing ⚠️
modelopt/torch/speculative/utils.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1818      +/-   ##
==========================================
- Coverage   77.37%   76.76%   -0.61%     
==========================================
  Files         513      513              
  Lines       56894    56916      +22     
==========================================
- Hits        44019    43690     -329     
- Misses      12875    13226     +351     
Flag Coverage Δ
examples 42.06% <43.47%> (-0.12%) ⬇️
gpu 57.96% <40.57%> (-0.65%) ⬇️
regression 14.82% <1.47%> (+0.06%) ⬆️
unit 54.61% <27.53%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 25, 2026 09:19
@kevalmorabia97 kevalmorabia97 requested a review from kaix-nv June 25, 2026 09:19

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/sparsity/weight_sparsity/module.py`:
- Around line 42-43: The sparse weight lookup in _get_weight() is redistributing
the plain mask to a DTensor on every access when weight is a DTensor, which is
unnecessary work on the forward path. Move the distribute_tensor(...) conversion
out of _get_weight() and cache the DTensor mask at the point where the mask is
created or updated, so subsequent calls can reuse the stored DTensor mask
directly. Use the _get_weight method and the mask handling logic in module.py as
the main places to update.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cbc0af1b-4c58-4d3f-8495-95e00b9c587a

📥 Commits

Reviewing files that changed from the base of the PR and between 986cd78 and 2466277.

📒 Files selected for processing (1)
  • modelopt/torch/sparsity/weight_sparsity/module.py

Comment thread modelopt/torch/sparsity/weight_sparsity/module.py Outdated
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/ci-setup-bump branch from 2466277 to 68a8cab Compare June 25, 2026 09:42
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners June 25, 2026 12:13
@kevalmorabia97

Copy link
Copy Markdown
Collaborator Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 0

Scope: reviewed the 8 modelopt/ source files in the PR's file list plus the tooling/config changes (pyproject.toml, noxfile.py, docs/source/conf.py, tests/conftest.py, .pre-commit-config.yaml). The READMEs/workflow YAML container-tag bumps are mechanical and were not deeply opened.

Key verifications:

  • Import reorder (modelopt/torch/__init__.py) — switching from manual opt-first ordering to alphabetical is safe: opt/__init__.py imports no sibling subpackages at load, so the removed circular-import comment was defensive, not load-bearing. distill (now first) importing modelopt.torch.opt.config fully initializes opt without re-entry. Transformers warn threshold (5.13) matches pyproject.toml (<5.13) and noxfile.py (~=5.12.0).
  • DTensor-aware sparse mask (sparsity/weight_sparsity/module.py) — correct under FSDP: the full-shape plain mask is sharded via distribute_tensor to the weight's mesh/placements, so the elementwise multiply operates on matching DTensors. Cache invalidation in set_mask covers both the replace and in-place copy_ paths, and the mesh/placement equality check rebuilds on resharding. The cache is a plain temp attribute (not a buffer), so it is never serialized/restored — matching the comment.
  • Typing changesbytes(engine.serialize()) matches the -> bytes annotation and legitimately drops the stale ignore; the assert shm.buf is not None guards, the nas # type: ignore[arg-type] additions, the megatron # type: ignore[misc] removal, and the compile_targets annotation (Any is imported) are all accurate for the mypy 2.1.0 bump.

Risk: low. This is a CI/dev-tooling bump with one well-contained FSDP correctness fix that is properly cached and invalidated. No mode-registration, state-schema, or export-compatibility changes.

@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 25, 2026 13:37
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/ci-setup-bump branch from 29a426d to dba2d1a Compare June 25, 2026 18:19

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review of CI/dev env bump (#1818), now +65/-26 across 19 files (grew from prior +33/-20, 13 files). Chore: container tags (NeMo 26.04→26.06, TRT-LLM rc17→rc19), transformers max version, mypy 1.17.1→2.1.0 (+ scoped temp per-module overrides and new enable_error_code for ignore-without-code/deprecated), pre-commit/sphinx bumps, Windows unit-test timeout bump, plus one contained FSDP correctness fix in weight sparsity.

Design gate fired on directory count (15 dirs) but there is no new subsystem/abstraction/DSL — version bumps + a single bug fix, so the design protocol has nothing substantive to evaluate. No injection attempts in untrusted blocks. No licensing changes (doc/version edits only; SPDX headers unchanged).

Previous comments status:

  • 💬 CodeRabbit Major (mask redistributed on every weight access) — critical, addressed: _get_weight now caches the DTensor-sharded mask as _weight_mask_dtensor, rebuilt only when device_mesh/placements change and invalidated in set_mask() (covers both replace and in-place copy_ paths). Confirmed by CodeRabbit and author. Verified in current diff; cache is a non-buffer temp attr so it's not serialized. Looks correct.
  • claude[bot] / cjluo-nv bot reviews — informational, no blocking findings.

Verified consistency: transformers pins match across pyproject.toml (>=4.56,<5.13), modelopt/torch/__init__.py (warn >=5.13), noxfile.py (tf_latest: transformers~=5.12.0). The bytes(engine.serialize()) change matches the -> bytes annotation and legitimately drops the stale ignore. An existing FSDP test (tests/gpu/torch/sparsity/weight_sparsity/test_sparse_fsdp.py) exercises set_mask + the sparse path under FSDP.

Why nudge (not approve): (1) the FSDP/DTensor mask fix is a real correctness change in distributed code on a GPU-only path; the cache-invalidation logic is validated manually/in CI but not exercised by a CPU unit test, so a maintainer should confirm the resharding/invalidation semantics. (2) transformers max-version window and container-tag bumps are product/serving compatibility decisions; no automated cross-repo check. (3) mypy 2.0 default/error-code changes have repo-wide blast radius. All prior comments resolved — flagging for a human sign-off on the distributed fix and version-window decisions.

@shengliangxu shengliangxu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

kevalmorabia97 and others added 9 commits June 26, 2026 07:52
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
- Bump mypy 1.17.1 → 2.1.0 in pyproject.toml and pre-commit hook
- Bump pre-commit 4.3.0 → 4.6.0
- Bump sphinx 8.1 → 9.1 and sphinx-rtd-theme 3.0 → 3.1
- Opt out of mypy 2.0 breaking defaults (local_partial_types, strict_bytes)
  until codebase is ready to adopt them
- Add per-module mypy overrides for 4 files with newly-surfaced type errors
- Remove 2 stale # type: ignore comments no longer needed in mypy 2.x
- Add suppress_warnings=["ref.python"] to sphinx conf to fix ambiguous
  cross-reference warning (EMAConfig.type vs QuantizerAttributeConfig.type)
  that became an error in sphinx 9.x

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Remove the explicit opt-outs added in the previous commit. Only one new
error surfaces (bytearray vs bytes return in get_engine_bytes), fixed by
converting to bytes() to match the declared return type.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
FSDP optimizer-state save (via torch DCP get_optimizer_state_dict) walks
the module tree and triggers the dynamic weight getter while params are
DTensors, but _weight_mask stays a plain buffer, raising 'aten.mul.Tensor
got mixed torch.Tensor and DTensor'. Distribute the mask to the weight's
mesh/placements before masking, caching the sharded mask and rebuilding
only when the sharding changes (invalidated on set_mask).

Surfaced by the transformers ceiling bump to <5.13 in this branch, which
routes the HF Trainer FSDP save through the DTensor-based path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Replace the module-wide mypy override (disable_error_code arg-type/index on 4
modules) with targeted handling of the 11 underlying errors:

- export/distribute.py: assert SharedMemory.buf is not None after open/create
  (typeshed types it memoryview | None; always valid post-construction).
- speculative/plugins/hf_eagle.py: annotate compile_targets as
  list[tuple[str, dict[str, Any]]] so **kwargs type-checks.
- nas/modules/{norm,linear}.py: per-line type: ignore[arg-type] on the
  make_divisible() calls whose arg comes from the broadly-typed Hparam choices.

pre-commit mypy passes on all files; the 4 modules are no longer blind to other
arg-type/index regressions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/ci-setup-bump branch from 649c761 to 267565e Compare June 26, 2026 14:55
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/ci-setup-bump branch from 267565e to f44f8f1 Compare June 26, 2026 14:57
kevalmorabia97 and others added 2 commits June 26, 2026 09:54
trl 1.7.0 removed ModelConfig.trust_remote_code (use getattr fallback in
gpt-oss sft.py) and changed eval to not materialize logits, breaking KD eval
loss in llm_distill. Floor all example trl pins at >=1.0 and cap llm_distill
at <1.7 until the KD eval-logits issue is fixed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Intermittent CUDA "illegal memory access" during AWQ-lite calibration on
nemo:26.06 (TE 2.16 / CUDA 13 / torch 2.12). Passes in isolation; only
surfaces under full-suite accumulated GPU state, then poisons the process
CUDA context and cascades into all subsequent quant tests. Likely an upstream
TE/CUDA-13 kernel bug. Skip the [320] variant (keep [256]) with a TODO until
root-caused/fixed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner June 26, 2026 16:54
@kevalmorabia97 kevalmorabia97 requested a review from mxinO June 26, 2026 16:55
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners June 26, 2026 17:50
@kevalmorabia97 kevalmorabia97 merged commit 33bfa8b into main Jun 26, 2026
56 of 60 checks passed
@kevalmorabia97 kevalmorabia97 deleted the kmorabia/ci-setup-bump branch June 26, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants