Releases: NVIDIA-NeMo/Megatron-Bridge
Releases · NVIDIA-NeMo/Megatron-Bridge
Release list
NVIDIA Megatron-Bridge 0.5.0
Changelog Details
Model Collection Support
LLM / VLM
- Qwen3.5 text bridges (dense + MoE) (PR#3769, community @HowardZorn)
- DeepSeek V4 bridge and DeepSeek-V4-Flash pretraining recipes (PR#3562, PR#3893)
- Ernie 4.5 text-only MoE and VL bridges (PR#3263, community @bo-ke)
- GLM-5 / GLM-5.1 (MoE + MLA + DSA) bridge and provider (PR#2913, PR#3635)
- GLM-4.7 / GLM-4.7-Flash support (PR#2983)
- StepFun Step-3.5-Flash (PR#3525) and Step-3.7-Flash (PR#4043)
- MiMo-V2-Flash support (PR#3163, community @beccohov)
- Gemma 4 (26B-A4B and 31B dense, LLM + VLM), MoE and Dense models (PR#3148, PR#3885, community @pavelgein)
- Falcon H1 hybrid Transformer + Mamba support (PR#1462, community @dhiaEddineRhaiem)
- Ling MoE V2 support (PR#2028, community @ccclyu)
Multimodal
- Nemotron-3 Nano Omni support, including model, recipe, and examples (PR#3760)
- Qwen3-Omni-MoE training support (PR#3317, community @hbhflw2000)
- Qwen3-ASR support (PR#2836, PR#3273)
- Nemotron Diffusion (Nemotron-Labs-Diffusion) model support (PR#3105)
Training & Functionality
- MegatronMIMO (Multimodel-In-Multimodel-Out) is a new feature to train multimodal models with heterogeneous parallelism (e.g. different model parallelism for the image encoder and text decoder). NeMo 26.06 supports non-colocated training (i.e. encoder and decoder are placed on different ranks PR#2004, PR#2007, PR#2869, PR#2870) and MegatronMIMO model conversion (PR#3905) with a focus on dense models. Colocated training (i.e. encoder and decoder on the same rank) and MoE models will be supported in the next release.
- Energon v7 support, including metadata and stateless cookers (PR#4090)
- Energon updates for video and multi-image (PR#3691)
- Eval-time context parallelism via decentralized process-group rebinding (PR#3755)
- Deterministic training support for performance recipes (PR#3543)
- Evaluator backend integration (SFT + inference + evaluation, demonstrated on GPT-OSS) (PR#2990)
- LoRA support for not sharing expert adapters (PR#3408)
- Configurable async checkpoint strategy (PR#3153); MSC support for FSDP DTensors (PR#3300)
- Fast dataloading configs and documentation (PR#3351)
Low-Precision Bridge & Checkpoint Conversion
- Quantize-then-gather weight export (FP8 / MXFP4) for faster RL trainer→rollout weight sync (PR#2737, community @hy2826)
- DeepSeek V4 quantization-scale emission during HF export (PR#3969)
Performance
fp4_param_gatherenabled inMixedPrecisionConfig(PR#3364)- Qwen3-Next 80B GB200/GB300 parallel mappings (PR#3168)
- CUDA graph support for Qwen3-VL LLM and vision-encoder submodules (PR#2334); full-iteration CUDA graph for GPT-OSS recipes (PR#4140)
Megatron-LM ↔ Megatron-Bridge Unification
- Megatron Inference integrated into Bridge — MCore Inference Engine examples, model wrappers, pure-LLM inference CLI, and
inference_optimizedpath (PR#3897) - Tokenizer unification — MCore tokenizer config promoted as the shared surface (Bridge side: PR#3451; MCore side: MCore PR#4406)
- Training-loop upstreaming (in progress) — Bridge's config + builder patterns moving into Megatron-LM: ConfigContainer (MCore PR#4227), serialization base (MCore PR#4309), Mamba config + builder (MCore PR#4550), GPT config + builder (MCore PR#4741), supporting utils (MCore PR#4872)
Developer Experience & Compatibility
- RL API refactoring — model creation, config override, training loop, export, and LoRA for RL (PR#3813)
AGENTS.mdand AI-coding-agent skills updated (recipe-recommender, NeMo-RL & verl E2E testing) (PR#3256, PR#3277, PR#3831)
Examples & Tutorials
- MegatronMIMO Qwen3.5-VL non-collocated SFT tutorial + LLaVA tutorial (PR#4239)
- Qwen3-0.6B 128K long-context SFT recipe with YaRN RoPE scaling (PR#3316)
- HuggingFace ↔ Megatron-FSDP weight conversion (PR#3512); online HF load/save for Megatron-FSDP (PR#1910)
ModelOpt
- LoRA × ModelOpt × DeepSeek architecture support (PR#3612)
Community Contributions
A big thank you to our community contributors for their valuable support!
Known issues:
- Step-3.7-Flash forward-pass outputs have not been fully verified.
- Some examples/ scripts have known minor issues: MiniMax M2 (conversion/export saving), GLM-4.5V (exported tokenizer artifacts), FLUX (tokenizer setup), and WAN (inference setup/dependencies).
- Some MoE training configurations that combine tensor parallelism and expert parallelism may run slower in 26.06 after upgrading from NCCL 2.29 to NCCL 2.30.
- Root cause: NCCL 2.30 fixed a CPU-affinity leak and now correctly restores the launcher's original CPU affinity after communicator initialization. Earlier NCCL versions could inadvertently leave application threads bound to CPUs local to each GPU. Training launchers without explicit CPU and memory binding may therefore expose cross-NUMA scheduling overhead after the upgrade.
- Workaround: As a workaround, bind each training rank and its host-memory allocations to the NUMA node local to its assigned GPU:
numactl --cpunodebind=<NUMA_NODE> --membind=<NUMA_NODE> <training command>. The GPU-local NUMA node can be determined programmatically from the GPU's PCI bus ID. For Slurm or torchrun launchers, the training command can be wrapped as follows:
Code
LID=${SLURM_LOCALID}
PCI_BUS=$(nvidia-smi -i $LID --query-gpu=pci.bus_id --format=csv,noheader 2>/dev/null | head -1 | tr '[:upper:]' '[:lower:]')
NUMA_NODE=$(cat /sys/bus/pci/devices/$PCI_BUS/numa_node 2>/dev/null || echo -1)
echo "[numactl_local] rank=$LID gpu_pci=$PCI_BUS numa=$NUMA_NODE"
exec numactl --cpunodebind=$NUMA_NODE --membind=$NUMA_NODE "$@"NVIDIA Megatron-Bridge 0.4.2
Highlights
- Expanded performance configs for DeepSeek V3, Qwen, GPT-OSS, and WAN
- Supported fp4_param_gather mixed precision config
- Enhanced security in dataset checkpoint deserialization and url loading. Safer trust_remote_code handling.
Performance
- NVFP4 with 4-bit parameter AllGather in DP communications (PR#3364, PR#4005)
- DSV3 B300 recipe tuning (PR#3549)
- DSV3 B200 recipe tuning (PR#3368)
- Qwen3 235B A22B B300 recipe tuning (PR#3490)
- NT3 super B300 recipe tuning (PR#3579)
- GPT-OSS B200 regression fix (PR#3614)
Software Component
Known issues
- There is a known issue with Evaluator when installing nvidia-vlmeval inside /opt/NeMo-FW. Please use the /opt/Megatron-Bridge directory to install the package:
cd /opt/Megatron-Bridge
uv pip install nvidia-vlmeval
Changelog Details
- beep boop 🤖: Bumping megatron.bridge to v0.4.1 by @nemo-automation-bot[bot] :: PR: #3363
- cp:
[perf] fix: guard cuda_graph_scope validation against None (3249)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3262 - cp:
fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283)intor0.4.0by @yaoyu-33 :: PR: #3305 - cp:
Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names (3179)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3324 - cp:
Perf script utility to lock gpu frequency. (2977)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3326 - cp:
fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3332 - cp:
fix(perf): read baseline values from golden values when using new format (3334)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3338 - [docs] chore: bump versions1.json to 0.4.0 (latest) by @ko3n1g :: PR: #3376
- b200 DSv3 better cfg (#3368), mxfp8 to fp8_cs for h100 gpt-oss #3378 by @malay-nagda :: PR: #3420
- 2604 perf summary (#3377) by @malay-nagda :: PR: #3405
- cp:
docs(releases): add 26.04 software component versions (3421)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3430 - cp:
b200 DSv3 better cfg (3368)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3401 - cp:
[training] fix: report memory on 2nd iteration to better reflect actual peak (3169)intor0.4.0by @dingqingy-nv :: PR: #3367 - cp:
Update Qwen3-VL pretrain perf configs for 30B and 235B (3327)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3342 - cp:
docs: Add container version to docs version picker (3434)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3435 - cp: [docs] Add Megatron Bridge 0.4.0 release notes (#3419) by @chtruong814 :: PR: #3439
- cp: fix(test): clone mmap-backed tensors before overwriting safetensors file (#3335) by @yaoyu-33 :: PR: #3441
- cp:
[test] refactor: move diffusion tests to test_groups directory (3275)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3442 - remove archival data from main page by @malay-nagda :: PR: #3448
- cp:
fix: set 644 permissions on COPY'd files to match cloned repos (3431)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3450 - cp:
[perf] fix: use direct assignment for NCCL env vars when nccl_ub enabled (3350)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3453 - cp:
[training] feat: enable fp4_param_gather in MixedPrecisionConfig (3364)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3454 - cp:
fix(docker): replace rdma-core source build with system package install (3429)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3457 - cp:
[training] fix: record CUDA memory history before snapshot so dumps are non-empty (#3487)intor0.4.0by @dingqingy-nv :: PR: #3508 - cp:
[vulnops][misc] fix: Add allowlist validation for _target_ instantiation (3142)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3540 - cp:
[vulnops][data] fix: Replace unsafe pickle.loads with restricted unpickler in Qwen VL pipeline (3139)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3541 - cp:
[vulnops][ckpt] fix: Use weights_only=True in ModelOpt checkpoint loading (3138)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3542 - cp:
[vulnops][ckpt] fix: Use weights_only=True in TrainState checkpoint loading (3506)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3557 - cp:
[vulnops][data] fix: Replace unsafe pickle.load with restricted unpickler for index metadata (3140)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3558 - cp:
[vulnops] fix: _contains_code_references allowlist bypass leads to RCE (3379)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3559 - fix: Add security warning for trust_remote_code and remove hardcoded True by @chtruong814 :: PR: #3539
- cp:
Cleanup TE cuda graphs with the right api (3459)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3476 - cp:
Update DeepSeek-V3 configs for B300 (3549)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3565 - cp:
log repo status manual (3570)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3572 - cp:
ci: post merge comment with SHA after successful CI run (3567)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3573 - cp:
[perf] update: switch GPT-OSS GB200 V2 dispatcher default to alltoall (3561)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3577 - cp:
no fp4 param gather (3578)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3580 - cp:
fix(evaluate): skip non-dict golden value entries such as job_id (3581)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3582 - cp:
[vulnops][data] fix: Validate URLs in VLM video loader to prevent SSRF (3482)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3588 - fix(docker): suppress lightning from uv resolution in fw_pyproject by @ko3n1g :: PR: #3602
- cp:
[vulnops][data] fix: Remove unnecessary allow_pickle=True and add security warnings (3141)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3615 - cp:
[vulnops][data] fix: Replace allow_pickle=True with restricted unpickler in packed dataset loading (3616)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3629 - cp:
add VP for LoRA Lm3 70B (3547)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3596 - cp:
num_layers_fix- qwen vl 235b_a22b on B200 (3589)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3603 - cp:
fix(docker): resolve lightning not found on PyPI by providing local stub (3604)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3606 - cp:
70b_lora_gb200_bf16_fix (3623)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3627 - cp:
[vulnops] fix: Add SSRF protection to image-loading utilities (3630)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3632 - chore(beep boop 🤖): Bump
uv.lock(r0.4.0, mcore-core_r0.17.0) (2026-04-30) by @svcnvidia-nemo-ci :: PR: #3591 - cp:
[vulnops] fix: Add SSRF protection to audio URL loading (3633)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3636 - cp:
fix(perf): keep PCT binding for deepseek_v3 large_scale on b300 (3656)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3657 - fix: apply vllm PR 36192 patch and bump pillow to 12.20 by @ko3n1g :: PR: #3671
- cp:
Add previously removed NemotronHBridge SequentialMLP mappings (3628)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3701 - Use HybridEP flex dispatcher for Qwen3 235B B300 perf configs (#3490) by @rhmukundan :: PR: #3675
- [build] chore: bump package version to 0.4.2 by @ko3n1g :: PR: #3721
- [model, ckpt, docs] fix: support HF→Megatron conversion under decentralized PGs (r0.4.0) by @cuichenx :: PR: #3674
- cp:
Fix Gemma3 example folder (3724)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3728 - cp:
Reorganize ModelOpt docs (3715)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3751 - [model, ckpt] fix: align GPT-OSS BF16 down_proj orientation on import (r0.4.0) by @cuichenx :: PR: #3753
- perf(qwen3-next): set expandable_segments on GB300 BF16/FP8_MX to fix OOM by @ko3n1g :: PR: #3767
- cp:
llama31 405b gb200 nvfp4 no pg overlap (3713)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3773 - cp:
[perf] update: switch GPT-OSS B200 V2 dispatcher default to alltoall (3614)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3682 - nt3 super nvfp4; lm3.1 405B nvfp4; lm3 70B mxfp8- expandable_segments by @malay-nagda :: PR: #3780
- cp:
[config] Update micro_batch_size to 2 for gemma3 recipe (3815)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3828 - chore: Bump TE to latest 2.14 and MCore to latest 0.17.0 by @chtruong814 :: PR: #3806
- qwen3 next env var fix by @malay-nagda :: PR: #3845
- chore: Bump and remove packages to address CVEs (#3841) by @chtruong814 :: PR: #3855
- Bump MCore to 2edffa by @chtruong814 :: PR: #3857
- cp:
chore: Bump deps to address CVEs (3919)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3925 - cp:
2604_patch_perf_summary (3818)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3861 - cp:
26.04.01_perf_summary (3997)intor0.4.0by @svcnvidia-nemo-ci :: PR: #3998 - cp:
docs: note 26.04 drops PyAV by default and document runtime install (4020)intor0.4.0by @svcnvidia-nemo-ci :: PR: #4021 - cp:
[perf] fix: guard cuda_graph_scope validation against None (3249)intor0.4.0(#3262) by @svcnvidia-nemo-ci - cp:
fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283)intor0.4.0(#3305) by @yaoyu-33 - cp:
Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names (3179)intor0.4.0(#3324) by @svcnvidia-nemo-ci - cp:
Perf script utility to lock gpu frequency. (2977)intor0.4.0(#3326) by @svcnvidia-nemo-ci - cp: `fix(gemma3-vl): force right-padding in VLM collate t...
26.04-alpha.rc2
[MXFP8 param gather]Update param buffer before copy to model weights …
NVIDIA Megatron-Bridge 0.4.1
- This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com
26.04-alpha.rc1
Merge branch 'PR2411' into 26.04-alpha
NVIDIA Megatron-Bridge 0.4.0
Highlights
Model Collection Support
- MiniMax M2 / M2.5 support (PR#2602)
- Kimi 2.5 support, including GB300 MXFP8 recipe and HF config updates (PR#2743)
- Nemotron 3 Super model support (PR#2912)
- Sarvam support (PR#1814)
- Qwen 3.5 VL Bridge with recipes and LoRA bridge / merge support (PR#2530, PR#2654, PR#2736)
- Qwen 2.5 Omni support (PR#2634)
- Qwen2-Audio support (PR#2324)
- Xiaomi MiMo dense MTP model bridge support (PR#2387, by HollowMan6)
Diffusion Collection
- Diffusion model support for DFM-to-Bridge migration (PR#2534, PR#2645)
- FLUX and WAN diffusion submodule improvements (PR#2822, PR#2849)
Training & Functionality
- Parquet support for sequence-packing preprocessing, improving handling of larger datasets (PR#2395)
- Energon integration for sequence packing with WebDataset workflows (PR#2440)
- Default packed sequences across finetune recipes (PR#2284)
- More modern finetuning datasets, including OpenMathInstruct V2 and GSM8K (PR#2264)
- Unified dataset configuration in
run_recipe.py(PR#2826) - NCCL flight recorder configuration support (PR#2891)
- Comet ML experiment tracking integration (PR#2910)
- Refactored SFT and PEFT recipes for VLM workflows (PR#2614)
- Added the
on_checkpoint_savecallback event for training workflows (PR#2905) - Added MoE LoRA rank normalization for expert layers (PR#3006)
- Direct export of block-wise FP8 weights and scaling factors (PR#1994)
- Accelerated first-fit packing with a segment tree for much faster packing on large datasets (PR#2953)
Model Optimization
- Pruning support and documentation (PR#2244)
- Post-training quantization support for Nano, Super, and Ultra model families (PR#2303)
- Distillation quantization support in NeMo 2 (PR#2591)
Performance
- Nemotron 3 Super perf config, including GB200 improvements and BF16 / NVFP4 functional support via module recompute (PR#3208)
Developer Experience & Compatibility
- ModelConfig and ModelBuilder refactor integrated into the training loop (PR#2798, PR#2671)
- Dev branch support and documentation updates (PR#2497)
- Python 3.12 migration announcement (PR#2773)
- Transformers 5.0 through 5.3 compatibility (PR#2068, PR#2781)
- PEFT Bridge offline mode support (PR#2574)
- LoRA merge on CPU (PR#2194)
- Self-contained Megatron-to-HF export with auto-config synthesis (PR#2778)
- Scripts and documentation for Megatron-LM and Megatron Bridge correlation
Examples & Tutorials
- Resiliency examples (PR#2115)
- Qwen3 VL sequence packing examples (PR#2380)
- Distillation example cleanup (PR#2865, PR#2860)
Community Contributions
@HollowMan6(Aalto University): Xiaomi MiMo dense MTP bridge support, Qwen 3.5 VL LoRA bridge and merge, and additional export / PEFT fixes (PR#2387, PR#2736, PR#2384, PR#2799)@shaltielshmid: packed-sequence improvements for large datasets and safer model loading defaults (PR#2395, PR#2766)@jaeminh: accelerated first-fit packing with a segment tree (PR#2953)@pavelgein: added theon_checkpoint_savecallback event (PR#2905)@ShiftyBlock(UC Berkeley): added auto-config for self-contained Megatron-to-HF export (PR#2778)@erictang000(Anyscale): added LoRA rank normalization for MoE expert layers (PR#3006)@eternally-z: added direct export support for block-wise FP8 weights and scaling factors (PR#1994)@Hayak3: fixed the unsupported normalization argument for Qwen3-VL (PR#1970)@mohit-sarvam(Sarvam AI): added Sarvam MoE support (PR#1814)
A big thank you to our community contributors for their valuable support!
Changelog Details
- docs: Update callback code snippets to include all imports needed for example by @ananthsub :: PR: #2283
- M4 leftover for QWen3-VL with MCore vision encoder by @shifangx :: PR: #2370
- Update Qwen3 235B B300 Configs to match Qwen3 B200 Configs by @rhmukundan :: PR: #2669
- [bridge] Fix off-by-one in sliding window size for Gemma2, Gemma3, Mistral, and GPT-OSS by @cuichenx :: PR: #2656
- fix: Write intermediate results to tmp by @ko3n1g :: PR: #2726
- Perf recipe dataloader num_workers interface fix by @dingqingy-nv :: PR: #2710
- Suppress noisy _extra_state warnings during checkpoint loading by @cuichenx :: PR: #2689
- [model, recipe] Add Qwen 3.5 recipes by @cuichenx :: PR: #2654
- [ci] chore: add nightly dev commit bump workflow by @ko3n1g :: PR: #2729
- ci(fix): Unique naming for dev branch by @ko3n1g :: PR: #2747
- [ci] Refactor Gemma3-VL launch script to run finetune and packed tests separately by @cuichenx :: PR: #2730
- add qwen2_5_omni by @yuekaizhang :: PR: #2634
- build: Bump TE 2.13 by @ko3n1g :: PR: #2753
- [docs, ci] chore: add governance issue forms and triage guide by @yaoyu-33 :: PR: #2716
- [test] fix: temporarily disable qwen2.5 omni unit tests by @yaoyu-33 :: PR: #2759
- add nemotron3 super docs by @liding-nv :: PR: #2757
- ci: Fix stopiteration for Mbridge by @ko3n1g :: PR: #2760
- GPT-OSS Blackwell MXFP8 recipes by @weijiac0619 :: PR: #2633
- feat(mimo): phase 2 - model provider, DDP wrapping, process groups by @aroshanghias-nvd :: PR: #2004
- [build] feat: add OSS NeMo FW dockerfiles by @thomasdhc :: PR: #2722
- Lm3 70B GB200 FP8_CS SFT cfg update by @malay-nagda :: PR: #2748
- [docs] chore: use uv run in test file docstring run instructions by @cuichenx :: PR: #2728
- build: Bump NVRX by @ko3n1g :: PR: #2775
- NVFP4 memory spike fix compared to M-LM by @sanandaraj5597 :: PR: #2764
- [doc] feat: Document adapter merge verification in stream_adapter_weights example by @yaoyu-33 :: PR: #2042
- [doc] chore: Add needs-review to PR state labels guidance by @yaoyu-33 :: PR: #2758
- [ckpt] fix: broaden exception handling in save_artifacts dynamic module loading by @yaoyu-33 :: PR: #2765
- [test] fix: use toy configs in qwen2.5 omni unit tests by @yaoyu-33 :: PR: #2761
- [model] Refactor Qwen3-VL and Ministral3 fine-tuning scripts by @kamran-nvidia :: PR: #2735
- docs - Update user manual with new MoE features and Megatron FSDP by @onel :: PR: #2529
- remove encoder_and_decoder usage by @dimapihtar :: PR: #2512
- Fix attention_mask mismatch in compare.py by @mohsinm-dev :: PR: #2476
- [model, test] fix: guard hybrid layer count across MCore branches by @yaoyu-33 :: PR: #2776
- [data] fix: guard eval_interval division to prevent ZeroDivisionError by @yaoyu-33 :: PR: #2732
- [sync][training] fix: log loss values of exactly 0.0 in training_log() by @mehraakash :: PR: #2740
- [model] feat: support Qwen 3.5 MTP c...
NVIDIA Megatron-Bridge 0.3.1
Changelog Details
Performance & Model Configs
- CP SFT performance improvements (#2527)
- Nemotron 3 Nano perf config updates (#2560, #2681)
- Onboard LLaMA3 70B LoRA to B300 and B200 chips (#2588)
- Update Qwen3 235B B300 configs to match B200 configs (#2706, #2720)
- Update DeepSeek-V3 B300 config (#2723)
- DeepSeek-V3: set
no_non_det_algofor deterministic training (#2673) - Add MoE Sequential MLP mappings in HF Bridges (#2589)
Bug Fixes
- [training] Cap
lr_warmup_stepsto be strictly less thanlr_decay_steps(#2858) - [training] Fix
DistillationProvider.to_cfg_dictto save missing keys in run_config (#2594) - [training] Fix
StopIterationerror in MBridge (#2762) - [checkpoint] Fix local checkpoint integration (#2709)
- [checkpoint] Log warning when HuggingFace Hub download fails silently (#2493)
- [checkpoint] Low-memory save: use
AutoBridgedirectly indistill_llama32_3b-1bto load HF weights (#2860) - [inference] Use
config.hidden_sizedirectly for Qwen3VL inference wrapper (#2855) - [misc] Improve
compare.pyrobustness for multi-GPU and vocab-padded models (#2647) - [misc] Fix BOS token mismatch in
compare_text_generation(#2889) - [misc] Guard
eod_idaccess incompare_text_generationfor HF tokenizers (#2853) - [misc] Guard missing kubernetes deps (#2871)
- [example] Fix example scripts and recipe names in release branch (#2862, #2863)
Documentation
- Add ModelOpt pruning docs (#2629)
NVIDIA Megatron-Bridge 0.3.0
Highlights
- Model Collection Support
- Performance
- NVFP4 support for LLama3 models.
- HybridEP support for NVL8 systems (PR#494)
- MLA performance improvement with cudnn layernorm and cudnn 9.18
- LN+MXFP8 quantization fusion with TE.sequence and cudnn backend
- Supports FSDP for MoE models with MXFP8 (PR#2135, PR#2239)
- Support Muon Optimizer (PR#683)
- NVFP4 Llama Playbook (PR#1409)
- Training & Functionality
- LoRA Bridge (initial): RL LoRA support for VeRL / nemo-rl (PR#1766)
- Multi-token prediction (MTP): Qwen3 dense examples (PR#2138)
- Decentralized parallel group (M4) end to end support and examples (PR#2011, examples)
- Context Parallelism (CP) with sequence packing in LLMs (PR#1867)
- Context Parallelism (CP) with sequence packing in VLMs (PR#1997)
- Callbacks integration (PR#2063)
- Low memory save for model importing from HF (fix Deepseek V3 and Kimi-K2 import) (PR#1949)
- Community Contributions
- @HollowMan6: MoE router weight adapter wrapper (PR#1834), temporary disable adapter support (PR#1811), flexible LoRA target_modules (PR#1799), separate layernorm mappings (PR#1808), shared_experts MoE fix (PR#1800), LoRA split QKV with GQA fix (PR#1818), Moonlight/Kimi rotary_emb export fix (PR#1838), configurable use_arbitrary_attention_mask (PR#1807)
- @Hayak3: Fix Qwen3-VL unsupported normalization arg (PR#1970)
- @shaltielshmid: Disable FP8 during CPU initialization for export (PR#1815)
- @therealnaveenkamal: MLFlow integration (PR#2112)
- @kannankumar: Fill-in-the-Middle (FIM) dataset support (PR#2066)
- A big thank you to our community contributors for their valuable support!
Changelog Details
- concise naming | weak scaling | save cfg to file by @malay-nagda :: PR: #1246
- cg_scope valid list and default none by @malay-nagda :: PR: #1264
- chore: Merge fp8 args by @ko3n1g :: PR: #1279
- cg and nan grad norm fix by @malay-nagda :: PR: #1309
- feat: Support PEFT weight mapping and merge LoRA adapters when export to hf by @HollowMan6 :: PR: #1310
- Add Nemotron nano v2 vl by @cuichenx :: PR: #1136
- Replay "Ko3n1g/ci/cleanup recipe evaluator (#1349)" by @ko3n1g :: PR: #1377
- Gemma3 VL LoRA Recipe + Documentations by @suiyoubi :: PR: #1388
- Add GLM4.5 FT Recipe by @suiyoubi :: PR: #1382
- Adding FLA as dependency for Qwen3-Next by @adityavavreNVDA :: PR: #1359
- fix: default to
ncclcomm overlap bootstrap backend by @ananthsub :: PR: #1395 - Add Qwen2/2.5 FT recipes by @ananthsub :: PR: #1385
- [PEFT/LoRA] fix: using ETP instead of TP for expert layers by @HollowMan6 :: PR: #1380
- Llama3 PEFT- 8B, 70B by @malay-nagda :: PR: #1381
- Add option for LoRA with Transformer Engine op fuser by @michal2409 :: PR: #1324
- [OMNIML-2937] Support Megatron Bridge quantized checkpoint export to HF unified checkpoint by @yueshen2016 :: PR: #1302
- HybridEP support by @erhoo82 :: PR: #1367
- expose option to dump config to file during end to end tests by @ananthsub :: PR: #1400
- [OMNIML-2935] PTQ support of MOE model (Qwen-3) on Megatron-Bridge by @yueshen2016 :: PR: #1405
- Revert "feat: Dependabot automerge if successful (#1051)" by @pablo-garay :: PR: #1428
- Update perf docs by @gautham-kollu :: PR: #1426
- Add Qwen3VL support (dense and moe) by @yashaswikarnati :: PR: #1174
- Fix llama3-8b NVFP4 recipe by @adityavavreNVDA :: PR: #1347
- fix GPT-OSS perf scripts by @erhoo82 :: PR: #1438
- Add functional test for finetuning with sequence packing by @ananthsub :: PR: #861
- feat: Pass custom srun args into Run by @ko3n1g :: PR: #1440
- Fix typo in dataclass from
callable=>typing.Callableinnemotron_h_provider.pyby @shaltielshmid :: PR: #1442 - pass the support of deepep for B200 and B300 GPUs by @erhoo82 :: PR: #1436
- cuda graph fine grained scope | hybridEP | a2a overlap by @malay-nagda :: PR: #1348
- nvfp4 for dense models by @sanandaraj5597 :: PR: #1453
- Added Qwen 3 next perf scripts by @sanandaraj5597 :: PR: #1451
- reset gradient_accumulation_fusion with megatron fsdp by @ananthsub :: PR: #1386
- guard trust_remote_code by @dimapihtar :: PR: #1291
- fix lint checks on main by @ananthsub :: PR: #1463
- DSv3- gb200 base cfg fix | b200 no a2a overlap by @malay-nagda :: PR: #1476
- sequence_length -> seq_length by @dimapihtar :: PR: #1023
- feat: Add whitelist support for mismatched params in load_hf_weights by @yaoyu-33 :: PR: #1447
- [docs] Update readme with supported models/recipes by @ananthsub :: PR: #1455
- Add Gemma2 recipes by @ananthsub :: PR: #1383
- [docs] Add release section for changelog and software component versions by @ananthsub :: PR: #1490
- [docs] Add 0.2.0 version picker by @ananthsub :: PR: #1488
- Reduced precision (BF16, FP8, MXFP8, NVFP4) training tutorial using Megatron-Bridge by @sergiopperez :: PR: #1409
- Update conversion compare script and add accelerate dependency by @yaoyu-33 :: PR: #1344
- [main] Fix functional conftest to handle optional
nvdlfw-inspectdependency by @ananthsub :: PR: #1496 - [docs] Update supported model docs by @ananthsub :: PR: #1503
- fix: Escape user inputs in data tutorials by @ananthsub :: PR: #1465
- Bridge instantiate_utils: drop unexpected config keys with warning by @yaoyu-33 :: PR: #1203
- Make container image point to last known release container by @gautham-kollu :: PR: #1443
- Revamp recipe tutorials by @ananthsub :: PR: #1308
- [docs] 25.11 release notes by @ananthsub :: PR: #1504
- Add generic scripts for training by @ananthsub :: PR: #1390
- Nemotron nano v2 finetune by @cuichenx :: PR: #1391
- Replay: M4 Remove parallel state usage in train loops, train steps and utils #1175 + Bug fix by @yaoyu-33 :: PR: #1445
- track dtype in scatter to tp ranks by @ananthsub :: PR: #1509
- Update performance scripts to align with llmb requirements by @scsudhakaran :: PR: #1416
- fix qwen3_vl by changing sequence_length to seq_length by @shifangx :: PR: #1511
- Update GPT-OSS pretrain config parameters by @cuichenx :: PR: #1375
- feat: mcore trigger mbridge by @pablo-garay :: PR: #1441
- fix: cleanup by @pablo-garay :: PR: #1540
- Revert strong-scaling support for DeepSeek-V3 by @scsudhakaran :: PR: #1548
- Add fallback for shared embedding flag by @yaoyu-33 :: PR: #1521
- Wan Bridge (checkpoints conversion) by @huvunvidia :: PR: #1550
- feat: defer flop calculation to model_provider "get_num_floating_point_operations" if provided by @yaoyu-33 :: PR: #1446
- refactor: Unify launchers by @ko3n1g :: PR: #1519
- bug fixes- unify launchers by @malay-nagda :: PR: #1573
- ci: Bump MCore and ModelOpt by @chtruong814 :: PR: #1551
- docs: Update documentation.md to include install submodules command by @chenopis :: PR: #1576
- fix: Fix load failure when
load_megatron_modelfrom a model trained with uneven pp by @yaoyu-33 :: PR: #1579 - Added 25.11 starter pack by @sanandaraj5597 :: PR: #1596
- fix: Wandb mocking by @ko3n1g :: PR: #1587
- fix: Use model seq length as default if no CLI is provided by @ko3n1g :: PR: #1600
- scripts: Update help string of args.detach by @ko3n1g :: PR: #1589
- ci: Add DGXC executor by @ko3n1g :: PR: #1584
- fix: Fix model parallel initialization ordering by @yaoyu-33 :: PR: #1574
- fix: Missing return of parse_additional_slurm_params by @ko3n1g :: PR: #1619
- Add fix for users who want to provide a path on disk to a custom HF tokenizer by @jstjohn :: PR: #1594
- fix: wandb exp name in recipe path by @ko3n1g :: PR: #1623
- Rename TensorRT Model Optimizer to Model Optimizer by @AAnoosheh :: PR: #1484
- Cleanup partial CG objects by @gautham-kollu :: PR: #1615
- [Canonical LoRA] fix: use correct q_out_features for
linear_qby @HollowMan6 :: PR: #1627 - [Canonical LoRA] fix: forward under expert layers by @HollowMan6 :: PR: #1628
- qwen3 235b config update by @malay-nagda :: PR: #1613
- chore: Update codeowners of performance scripts by @ko3n1g :: PR: #1641
- Re-use higher-level config override util in tutorials by @ananthsub :: PR: #1524
- docs: add wayfinder readme.md files for each docs directory by @chenopis :: PR: #1617
- ci: Fix DGXC env vars by @ko3n1g :: PR: #1629
- Support strong scaling ...
NVIDIA Megatron-Bridge 0.2.2
- This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com
NVIDIA Megatron-Bridge 0.2.1
- Performance
- Activation offloading to host memory support with pipelining
- Supports the high activation memory needs of MoE models training with dynamic shapes
- Fixed Nemotron FLOPS calculation model
- Activation offloading to host memory support with pipelining
- Model Collection Support
- Ministral 3
- Enhanced LoRA support
- LoRA support for Mamba layers (for Nemotron Nano V2 and NemotronH finetuning)