Releases · NVIDIA-NeMo/Megatron-Bridge

Release list

NVIDIA Megatron-Bridge 0.5.0 Latest

Latest

nemo-automation-bot released this 22 Jun 23:23

v0.5.0

fcbb603

Changelog Details

Model Collection Support

LLM / VLM

Qwen3.5 text bridges (dense + MoE) (PR#3769, community @HowardZorn)
DeepSeek V4 bridge and DeepSeek-V4-Flash pretraining recipes (PR#3562, PR#3893)
Ernie 4.5 text-only MoE and VL bridges (PR#3263, community @bo-ke)
GLM-5 / GLM-5.1 (MoE + MLA + DSA) bridge and provider (PR#2913, PR#3635)
GLM-4.7 / GLM-4.7-Flash support (PR#2983)
StepFun Step-3.5-Flash (PR#3525) and Step-3.7-Flash (PR#4043)
MiMo-V2-Flash support (PR#3163, community @beccohov)
Gemma 4 (26B-A4B and 31B dense, LLM + VLM), MoE and Dense models (PR#3148, PR#3885, community @pavelgein)
Falcon H1 hybrid Transformer + Mamba support (PR#1462, community @dhiaEddineRhaiem)
Ling MoE V2 support (PR#2028, community @ccclyu)

Multimodal

Nemotron-3 Nano Omni support, including model, recipe, and examples (PR#3760)
Qwen3-Omni-MoE training support (PR#3317, community @hbhflw2000)
Qwen3-ASR support (PR#2836, PR#3273)
Nemotron Diffusion (Nemotron-Labs-Diffusion) model support (PR#3105)

Training & Functionality

MegatronMIMO (Multimodel-In-Multimodel-Out) is a new feature to train multimodal models with heterogeneous parallelism (e.g. different model parallelism for the image encoder and text decoder). NeMo 26.06 supports non-colocated training (i.e. encoder and decoder are placed on different ranks PR#2004, PR#2007, PR#2869, PR#2870) and MegatronMIMO model conversion (PR#3905) with a focus on dense models. Colocated training (i.e. encoder and decoder on the same rank) and MoE models will be supported in the next release.
Energon v7 support, including metadata and stateless cookers (PR#4090)
Energon updates for video and multi-image (PR#3691)
Eval-time context parallelism via decentralized process-group rebinding (PR#3755)
Deterministic training support for performance recipes (PR#3543)
Evaluator backend integration (SFT + inference + evaluation, demonstrated on GPT-OSS) (PR#2990)
LoRA support for not sharing expert adapters (PR#3408)
Configurable async checkpoint strategy (PR#3153); MSC support for FSDP DTensors (PR#3300)
Fast dataloading configs and documentation (PR#3351)

Low-Precision Bridge & Checkpoint Conversion

Quantize-then-gather weight export (FP8 / MXFP4) for faster RL trainer→rollout weight sync (PR#2737, community @hy2826)
DeepSeek V4 quantization-scale emission during HF export (PR#3969)

Performance

fp4_param_gather enabled in MixedPrecisionConfig (PR#3364)
Qwen3-Next 80B GB200/GB300 parallel mappings (PR#3168)
CUDA graph support for Qwen3-VL LLM and vision-encoder submodules (PR#2334); full-iteration CUDA graph for GPT-OSS recipes (PR#4140)

Megatron-LM ↔ Megatron-Bridge Unification

Megatron Inference integrated into Bridge — MCore Inference Engine examples, model wrappers, pure-LLM inference CLI, and inference_optimized path (PR#3897)
Tokenizer unification — MCore tokenizer config promoted as the shared surface (Bridge side: PR#3451; MCore side: MCore PR#4406)
Training-loop upstreaming (in progress) — Bridge's config + builder patterns moving into Megatron-LM: ConfigContainer (MCore PR#4227), serialization base (MCore PR#4309), Mamba config + builder (MCore PR#4550), GPT config + builder (MCore PR#4741), supporting utils (MCore PR#4872)

Developer Experience & Compatibility

RL API refactoring — model creation, config override, training loop, export, and LoRA for RL (PR#3813)
AGENTS.md and AI-coding-agent skills updated (recipe-recommender, NeMo-RL & verl E2E testing) (PR#3256, PR#3277, PR#3831)

Examples & Tutorials

MegatronMIMO Qwen3.5-VL non-collocated SFT tutorial + LLaVA tutorial (PR#4239)
Qwen3-0.6B 128K long-context SFT recipe with YaRN RoPE scaling (PR#3316)
HuggingFace ↔ Megatron-FSDP weight conversion (PR#3512); online HF load/save for Megatron-FSDP (PR#1910)

ModelOpt

LoRA × ModelOpt × DeepSeek architecture support (PR#3612)

Community Contributions

A big thank you to our community contributors for their valuable support!

Known issues:

Step-3.7-Flash forward-pass outputs have not been fully verified.
Some examples/ scripts have known minor issues: MiniMax M2 (conversion/export saving), GLM-4.5V (exported tokenizer artifacts), FLUX (tokenizer setup), and WAN (inference setup/dependencies).
Some MoE training configurations that combine tensor parallelism and expert parallelism may run slower in 26.06 after upgrading from NCCL 2.29 to NCCL 2.30.
- Root cause: NCCL 2.30 fixed a CPU-affinity leak and now correctly restores the launcher's original CPU affinity after communicator initialization. Earlier NCCL versions could inadvertently leave application threads bound to CPUs local to each GPU. Training launchers without explicit CPU and memory binding may therefore expose cross-NUMA scheduling overhead after the upgrade.
- Workaround: As a workaround, bind each training rank and its host-memory allocations to the NUMA node local to its assigned GPU: numactl --cpunodebind=<NUMA_NODE> --membind=<NUMA_NODE> <training command>. The GPU-local NUMA node can be determined programmatically from the GPU's PCI bus ID. For Slurm or torchrun launchers, the training command can be wrapped as follows:

Code

LID=${SLURM_LOCALID}
PCI_BUS=$(nvidia-smi -i $LID --query-gpu=pci.bus_id --format=csv,noheader 2>/dev/null | head -1 | tr '[:upper:]' '[:lower:]')
NUMA_NODE=$(cat /sys/bus/pci/devices/$PCI_BUS/numa_node 2>/dev/null || echo -1)
echo "[numactl_local] rank=$LID gpu_pci=$PCI_BUS numa=$NUMA_NODE"
exec numactl --cpunodebind=$NUMA_NODE --membind=$NUMA_NODE "$@"

Contributors

ccclyu, pavelgein, and 7 other contributors

Assets 2

NVIDIA Megatron-Bridge 0.4.2

nemo-automation-bot released this 28 May 21:18

v0.4.2

c810129

Highlights

Expanded performance configs for DeepSeek V3, Qwen, GPT-OSS, and WAN
Supported fp4_param_gather mixed precision config
Enhanced security in dataset checkpoint deserialization and url loading. Safer trust_remote_code handling.

Performance

NVFP4 with 4-bit parameter AllGather in DP communications (PR#3364, PR#4005)
DSV3 B300 recipe tuning (PR#3549)
DSV3 B200 recipe tuning (PR#3368)
Qwen3 235B A22B B300 recipe tuning (PR#3490)
NT3 super B300 recipe tuning (PR#3579)
GPT-OSS B200 regression fix (PR#3614)

Software Component

Upgraded NVIDIA Resiliency Extension (NVRX) to v0.6.0

Known issues

There is a known issue with Evaluator when installing nvidia-vlmeval inside /opt/NeMo-FW. Please use the /opt/Megatron-Bridge directory to install the package:

cd /opt/Megatron-Bridge
uv pip install nvidia-vlmeval

Changelog Details

beep boop 🤖: Bumping megatron.bridge to v0.4.1 by @nemo-automation-bot[bot] :: PR: #3363
cp: [perf] fix: guard cuda_graph_scope validation against None (3249) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3262
cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0 by @yaoyu-33 :: PR: #3305
cp: Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names (3179) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3324
cp: Perf script utility to lock gpu frequency. (2977) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3326
cp: fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3332
cp: fix(perf): read baseline values from golden values when using new format (3334) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3338
[docs] chore: bump versions1.json to 0.4.0 (latest) by @ko3n1g :: PR: #3376
b200 DSv3 better cfg (#3368), mxfp8 to fp8_cs for h100 gpt-oss #3378 by @malay-nagda :: PR: #3420
2604 perf summary (#3377) by @malay-nagda :: PR: #3405
cp: docs(releases): add 26.04 software component versions (3421) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3430
cp: b200 DSv3 better cfg (3368) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3401
cp: [training] fix: report memory on 2nd iteration to better reflect actual peak (3169) into r0.4.0 by @dingqingy-nv :: PR: #3367
cp: Update Qwen3-VL pretrain perf configs for 30B and 235B (3327) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3342
cp: docs: Add container version to docs version picker (3434) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3435
cp: [docs] Add Megatron Bridge 0.4.0 release notes (#3419) by @chtruong814 :: PR: #3439
cp: fix(test): clone mmap-backed tensors before overwriting safetensors file (#3335) by @yaoyu-33 :: PR: #3441
cp: [test] refactor: move diffusion tests to test_groups directory (3275) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3442
remove archival data from main page by @malay-nagda :: PR: #3448
cp: fix: set 644 permissions on COPY'd files to match cloned repos (3431) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3450
cp: [perf] fix: use direct assignment for NCCL env vars when nccl_ub enabled (3350) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3453
cp: [training] feat: enable fp4_param_gather in MixedPrecisionConfig (3364) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3454
cp: fix(docker): replace rdma-core source build with system package install (3429) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3457
cp: [training] fix: record CUDA memory history before snapshot so dumps are non-empty (#3487) into r0.4.0 by @dingqingy-nv :: PR: #3508
cp: [vulnops][misc] fix: Add allowlist validation for _target_ instantiation (3142) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3540
cp: [vulnops][data] fix: Replace unsafe pickle.loads with restricted unpickler in Qwen VL pipeline (3139) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3541
cp: [vulnops][ckpt] fix: Use weights_only=True in ModelOpt checkpoint loading (3138) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3542
cp: [vulnops][ckpt] fix: Use weights_only=True in TrainState checkpoint loading (3506) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3557
cp: [vulnops][data] fix: Replace unsafe pickle.load with restricted unpickler for index metadata (3140) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3558
cp: [vulnops] fix: _contains_code_references allowlist bypass leads to RCE (3379) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3559
fix: Add security warning for trust_remote_code and remove hardcoded True by @chtruong814 :: PR: #3539
cp: Cleanup TE cuda graphs with the right api (3459) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3476
cp: Update DeepSeek-V3 configs for B300 (3549) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3565
cp: log repo status manual (3570) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3572
cp: ci: post merge comment with SHA after successful CI run (3567) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3573
cp: [perf] update: switch GPT-OSS GB200 V2 dispatcher default to alltoall (3561) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3577
cp: no fp4 param gather (3578) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3580
cp: fix(evaluate): skip non-dict golden value entries such as job_id (3581) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3582
cp: [vulnops][data] fix: Validate URLs in VLM video loader to prevent SSRF (3482) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3588
fix(docker): suppress lightning from uv resolution in fw_pyproject by @ko3n1g :: PR: #3602
cp: [vulnops][data] fix: Remove unnecessary allow_pickle=True and add security warnings (3141) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3615
cp: [vulnops][data] fix: Replace allow_pickle=True with restricted unpickler in packed dataset loading (3616) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3629
cp: add VP for LoRA Lm3 70B (3547) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3596
cp: num_layers_fix- qwen vl 235b_a22b on B200 (3589) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3603
cp: fix(docker): resolve lightning not found on PyPI by providing local stub (3604) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3606
cp: 70b_lora_gb200_bf16_fix (3623) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3627
cp: [vulnops] fix: Add SSRF protection to image-loading utilities (3630) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3632
chore(beep boop 🤖): Bump uv.lock (r0.4.0, mcore-core_r0.17.0) (2026-04-30) by @svcnvidia-nemo-ci :: PR: #3591
cp: [vulnops] fix: Add SSRF protection to audio URL loading (3633) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3636
cp: fix(perf): keep PCT binding for deepseek_v3 large_scale on b300 (3656) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3657
fix: apply vllm PR 36192 patch and bump pillow to 12.20 by @ko3n1g :: PR: #3671
cp: Add previously removed NemotronHBridge SequentialMLP mappings (3628) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3701
Use HybridEP flex dispatcher for Qwen3 235B B300 perf configs (#3490) by @rhmukundan :: PR: #3675
[build] chore: bump package version to 0.4.2 by @ko3n1g :: PR: #3721
[model, ckpt, docs] fix: support HF→Megatron conversion under decentralized PGs (r0.4.0) by @cuichenx :: PR: #3674
cp: Fix Gemma3 example folder (3724) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3728
cp: Reorganize ModelOpt docs (3715) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3751
[model, ckpt] fix: align GPT-OSS BF16 down_proj orientation on import (r0.4.0) by @cuichenx :: PR: #3753
perf(qwen3-next): set expandable_segments on GB300 BF16/FP8_MX to fix OOM by @ko3n1g :: PR: #3767
cp: llama31 405b gb200 nvfp4 no pg overlap (3713) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3773
cp: [perf] update: switch GPT-OSS B200 V2 dispatcher default to alltoall (3614) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3682
nt3 super nvfp4; lm3.1 405B nvfp4; lm3 70B mxfp8- expandable_segments by @malay-nagda :: PR: #3780
cp: [config] Update micro_batch_size to 2 for gemma3 recipe (3815) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3828
chore: Bump TE to latest 2.14 and MCore to latest 0.17.0 by @chtruong814 :: PR: #3806
qwen3 next env var fix by @malay-nagda :: PR: #3845
chore: Bump and remove packages to address CVEs (#3841) by @chtruong814 :: PR: #3855
Bump MCore to 2edffa by @chtruong814 :: PR: #3857
cp: chore: Bump deps to address CVEs (3919) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3925
cp: 2604_patch_perf_summary (3818) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3861
cp: 26.04.01_perf_summary (3997) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3998
cp: docs: note 26.04 drops PyAV by default and document runtime install (4020) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #4021
cp: [perf] fix: guard cuda_graph_scope validation against None (3249) into r0.4.0 (#3262) by @svcnvidia-nemo-ci
cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0 (#3305) by @yaoyu-33
cp: Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names (3179) into r0.4.0 (#3324) by @svcnvidia-nemo-ci
cp: Perf script utility to lock gpu frequency. (2977) into r0.4.0 (#3326) by @svcnvidia-nemo-ci
cp: `fix(gemma3-vl): force right-padding in VLM collate t...

Contributors

ko3n1g, cuichenx, and 6 other contributors

Assets 2

26.04-alpha.rc2

mmarcinkiewicz released this 07 May 07:08

26.04-alpha.rc2

fd5c473

[MXFP8 param gather]Update param buffer before copy to model weights …

Assets 2

NVIDIA Megatron-Bridge 0.4.1

nemo-automation-bot released this 06 May 21:49

v0.4.1

f9b6319

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com

Assets 2

26.04-alpha.rc1

mmarcinkiewicz released this 23 Apr 09:32

26.04-alpha.rc1

68d8bcc

Merge branch 'PR2411' into 26.04-alpha

Assets 2

NVIDIA Megatron-Bridge 0.4.0

svcnvidia-nemo-ci released this 16 Apr 22:46

v0.4.0

0fbfe7d

Highlights

Model Collection Support

MiniMax M2 / M2.5 support (PR#2602)
Kimi 2.5 support, including GB300 MXFP8 recipe and HF config updates (PR#2743)
Nemotron 3 Super model support (PR#2912)
Sarvam support (PR#1814)
Qwen 3.5 VL Bridge with recipes and LoRA bridge / merge support (PR#2530, PR#2654, PR#2736)
Qwen 2.5 Omni support (PR#2634)
Qwen2-Audio support (PR#2324)
Xiaomi MiMo dense MTP model bridge support (PR#2387, by HollowMan6)

Diffusion Collection

Diffusion model support for DFM-to-Bridge migration (PR#2534, PR#2645)
FLUX and WAN diffusion submodule improvements (PR#2822, PR#2849)

Training & Functionality

Parquet support for sequence-packing preprocessing, improving handling of larger datasets (PR#2395)
Energon integration for sequence packing with WebDataset workflows (PR#2440)
Default packed sequences across finetune recipes (PR#2284)
More modern finetuning datasets, including OpenMathInstruct V2 and GSM8K (PR#2264)
Unified dataset configuration in run_recipe.py (PR#2826)
NCCL flight recorder configuration support (PR#2891)
Comet ML experiment tracking integration (PR#2910)
Refactored SFT and PEFT recipes for VLM workflows (PR#2614)
Added the on_checkpoint_save callback event for training workflows (PR#2905)
Added MoE LoRA rank normalization for expert layers (PR#3006)
Direct export of block-wise FP8 weights and scaling factors (PR#1994)
Accelerated first-fit packing with a segment tree for much faster packing on large datasets (PR#2953)

Model Optimization

Pruning support and documentation (PR#2244)
Post-training quantization support for Nano, Super, and Ultra model families (PR#2303)
Distillation quantization support in NeMo 2 (PR#2591)

Performance

Nemotron 3 Super perf config, including GB200 improvements and BF16 / NVFP4 functional support via module recompute (PR#3208)

Developer Experience & Compatibility

ModelConfig and ModelBuilder refactor integrated into the training loop (PR#2798, PR#2671)
Dev branch support and documentation updates (PR#2497)
Python 3.12 migration announcement (PR#2773)
Transformers 5.0 through 5.3 compatibility (PR#2068, PR#2781)
PEFT Bridge offline mode support (PR#2574)
LoRA merge on CPU (PR#2194)
Self-contained Megatron-to-HF export with auto-config synthesis (PR#2778)
Scripts and documentation for Megatron-LM and Megatron Bridge correlation

Examples & Tutorials

Resiliency examples (PR#2115)
Qwen3 VL sequence packing examples (PR#2380)
Distillation example cleanup (PR#2865, PR#2860)

Community Contributions

@HollowMan6 (Aalto University): Xiaomi MiMo dense MTP bridge support, Qwen 3.5 VL LoRA bridge and merge, and additional export / PEFT fixes (PR#2387, PR#2736, PR#2384, PR#2799)
@shaltielshmid: packed-sequence improvements for large datasets and safer model loading defaults (PR#2395, PR#2766)
@jaeminh: accelerated first-fit packing with a segment tree (PR#2953)
@pavelgein: added the on_checkpoint_save callback event (PR#2905)
@ShiftyBlock (UC Berkeley): added auto-config for self-contained Megatron-to-HF export (PR#2778)
@erictang000 (Anyscale): added LoRA rank normalization for MoE expert layers (PR#3006)
@eternally-z: added direct export support for block-wise FP8 weights and scaling factors (PR#1994)
@Hayak3: fixed the unsupported normalization argument for Qwen3-VL (PR#1970)
@mohit-sarvam (Sarvam AI): added Sarvam MoE support (PR#1814)

A big thank you to our community contributors for their valuable support!

Changelog Details

docs: Update callback code snippets to include all imports needed for example by @ananthsub :: PR: #2283
M4 leftover for QWen3-VL with MCore vision encoder by @shifangx :: PR: #2370
Update Qwen3 235B B300 Configs to match Qwen3 B200 Configs by @rhmukundan :: PR: #2669
[bridge] Fix off-by-one in sliding window size for Gemma2, Gemma3, Mistral, and GPT-OSS by @cuichenx :: PR: #2656
fix: Write intermediate results to tmp by @ko3n1g :: PR: #2726
Perf recipe dataloader num_workers interface fix by @dingqingy-nv :: PR: #2710
Suppress noisy _extra_state warnings during checkpoint loading by @cuichenx :: PR: #2689
[model, recipe] Add Qwen 3.5 recipes by @cuichenx :: PR: #2654
[ci] chore: add nightly dev commit bump workflow by @ko3n1g :: PR: #2729
ci(fix): Unique naming for dev branch by @ko3n1g :: PR: #2747
[ci] Refactor Gemma3-VL launch script to run finetune and packed tests separately by @cuichenx :: PR: #2730
add qwen2_5_omni by @yuekaizhang :: PR: #2634
build: Bump TE 2.13 by @ko3n1g :: PR: #2753
[docs, ci] chore: add governance issue forms and triage guide by @yaoyu-33 :: PR: #2716
[test] fix: temporarily disable qwen2.5 omni unit tests by @yaoyu-33 :: PR: #2759
add nemotron3 super docs by @liding-nv :: PR: #2757
ci: Fix stopiteration for Mbridge by @ko3n1g :: PR: #2760
GPT-OSS Blackwell MXFP8 recipes by @weijiac0619 :: PR: #2633
feat(mimo): phase 2 - model provider, DDP wrapping, process groups by @aroshanghias-nvd :: PR: #2004
[build] feat: add OSS NeMo FW dockerfiles by @thomasdhc :: PR: #2722
Lm3 70B GB200 FP8_CS SFT cfg update by @malay-nagda :: PR: #2748
[docs] chore: use uv run in test file docstring run instructions by @cuichenx :: PR: #2728
build: Bump NVRX by @ko3n1g :: PR: #2775
NVFP4 memory spike fix compared to M-LM by @sanandaraj5597 :: PR: #2764
[doc] feat: Document adapter merge verification in stream_adapter_weights example by @yaoyu-33 :: PR: #2042
[doc] chore: Add needs-review to PR state labels guidance by @yaoyu-33 :: PR: #2758
[ckpt] fix: broaden exception handling in save_artifacts dynamic module loading by @yaoyu-33 :: PR: #2765
[test] fix: use toy configs in qwen2.5 omni unit tests by @yaoyu-33 :: PR: #2761
[model] Refactor Qwen3-VL and Ministral3 fine-tuning scripts by @kamran-nvidia :: PR: #2735
docs - Update user manual with new MoE features and Megatron FSDP by @onel :: PR: #2529
remove encoder_and_decoder usage by @dimapihtar :: PR: #2512
Fix attention_mask mismatch in compare.py by @mohsinm-dev :: PR: #2476
[model, test] fix: guard hybrid layer count across MCore branches by @yaoyu-33 :: PR: #2776
[data] fix: guard eval_interval division to prevent ZeroDivisionError by @yaoyu-33 :: PR: #2732
[sync][training] fix: log loss values of exactly 0.0 in training_log() by @mehraakash :: PR: #2740
[model] feat: support Qwen 3.5 MTP c...

Contributors

onel, sudostock, and 42 other contributors

Assets 2

NVIDIA Megatron-Bridge 0.3.1

svcnvidia-nemo-ci released this 20 Mar 22:35

v0.3.1

9c9dd84

Changelog Details

Performance & Model Configs

CP SFT performance improvements (#2527)
Nemotron 3 Nano perf config updates (#2560, #2681)
Onboard LLaMA3 70B LoRA to B300 and B200 chips (#2588)
Update Qwen3 235B B300 configs to match B200 configs (#2706, #2720)
Update DeepSeek-V3 B300 config (#2723)
DeepSeek-V3: set no_non_det_algo for deterministic training (#2673)
Add MoE Sequential MLP mappings in HF Bridges (#2589)

Bug Fixes

[training] Cap lr_warmup_steps to be strictly less than lr_decay_steps (#2858)
[training] Fix DistillationProvider.to_cfg_dict to save missing keys in run_config (#2594)
[training] Fix StopIteration error in MBridge (#2762)
[checkpoint] Fix local checkpoint integration (#2709)
[checkpoint] Log warning when HuggingFace Hub download fails silently (#2493)
[checkpoint] Low-memory save: use AutoBridge directly in distill_llama32_3b-1b to load HF weights (#2860)
[inference] Use config.hidden_size directly for Qwen3VL inference wrapper (#2855)
[misc] Improve compare.py robustness for multi-GPU and vocab-padded models (#2647)
[misc] Fix BOS token mismatch in compare_text_generation (#2889)
[misc] Guard eod_id access in compare_text_generation for HF tokenizers (#2853)
[misc] Guard missing kubernetes deps (#2871)
[example] Fix example scripts and recipe names in release branch (#2862, #2863)

Documentation

Add ModelOpt pruning docs (#2629)

Assets 2

NVIDIA Megatron-Bridge 0.3.0

svcnvidia-nemo-ci released this 26 Feb 03:51

v0.3.0

21b02e0

Highlights

Model Collection Support
- Nano v3 (PR#1858)
- GLM 4.5v (PR#1798)
- Ministral 3 (PR#1580)
Performance
- NVFP4 support for LLama3 models.
- HybridEP support for NVL8 systems (PR#494)
- MLA performance improvement with cudnn layernorm and cudnn 9.18
- LN+MXFP8 quantization fusion with TE.sequence and cudnn backend
- Supports FSDP for MoE models with MXFP8 (PR#2135, PR#2239)
- Support Muon Optimizer (PR#683)
- NVFP4 Llama Playbook (PR#1409)
Training & Functionality
- LoRA Bridge (initial): RL LoRA support for VeRL / nemo-rl (PR#1766)
- Multi-token prediction (MTP): Qwen3 dense examples (PR#2138)
- Decentralized parallel group (M4) end to end support and examples (PR#2011, examples)
- Context Parallelism (CP) with sequence packing in LLMs (PR#1867)
- Context Parallelism (CP) with sequence packing in VLMs (PR#1997)
- Callbacks integration (PR#2063)
- Low memory save for model importing from HF (fix Deepseek V3 and Kimi-K2 import) (PR#1949)
Community Contributions
- @HollowMan6: MoE router weight adapter wrapper (PR#1834), temporary disable adapter support (PR#1811), flexible LoRA target_modules (PR#1799), separate layernorm mappings (PR#1808), shared_experts MoE fix (PR#1800), LoRA split QKV with GQA fix (PR#1818), Moonlight/Kimi rotary_emb export fix (PR#1838), configurable use_arbitrary_attention_mask (PR#1807)
- @Hayak3: Fix Qwen3-VL unsupported normalization arg (PR#1970)
- @shaltielshmid: Disable FP8 during CPU initialization for export (PR#1815)
- @therealnaveenkamal: MLFlow integration (PR#2112)
- @kannankumar: Fill-in-the-Middle (FIM) dataset support (PR#2066)
- A big thank you to our community contributors for their valuable support!

Changelog Details

concise naming | weak scaling | save cfg to file by @malay-nagda :: PR: #1246
cg_scope valid list and default none by @malay-nagda :: PR: #1264
chore: Merge fp8 args by @ko3n1g :: PR: #1279
cg and nan grad norm fix by @malay-nagda :: PR: #1309
feat: Support PEFT weight mapping and merge LoRA adapters when export to hf by @HollowMan6 :: PR: #1310
Add Nemotron nano v2 vl by @cuichenx :: PR: #1136
Replay "Ko3n1g/ci/cleanup recipe evaluator (#1349)" by @ko3n1g :: PR: #1377
Gemma3 VL LoRA Recipe + Documentations by @suiyoubi :: PR: #1388
Add GLM4.5 FT Recipe by @suiyoubi :: PR: #1382
Adding FLA as dependency for Qwen3-Next by @adityavavreNVDA :: PR: #1359
fix: default to nccl comm overlap bootstrap backend by @ananthsub :: PR: #1395
Add Qwen2/2.5 FT recipes by @ananthsub :: PR: #1385
[PEFT/LoRA] fix: using ETP instead of TP for expert layers by @HollowMan6 :: PR: #1380
Llama3 PEFT- 8B, 70B by @malay-nagda :: PR: #1381
Add option for LoRA with Transformer Engine op fuser by @michal2409 :: PR: #1324
[OMNIML-2937] Support Megatron Bridge quantized checkpoint export to HF unified checkpoint by @yueshen2016 :: PR: #1302
HybridEP support by @erhoo82 :: PR: #1367
expose option to dump config to file during end to end tests by @ananthsub :: PR: #1400
[OMNIML-2935] PTQ support of MOE model (Qwen-3) on Megatron-Bridge by @yueshen2016 :: PR: #1405
Revert "feat: Dependabot automerge if successful (#1051)" by @pablo-garay :: PR: #1428
Update perf docs by @gautham-kollu :: PR: #1426
Add Qwen3VL support (dense and moe) by @yashaswikarnati :: PR: #1174
Fix llama3-8b NVFP4 recipe by @adityavavreNVDA :: PR: #1347
fix GPT-OSS perf scripts by @erhoo82 :: PR: #1438
Add functional test for finetuning with sequence packing by @ananthsub :: PR: #861
feat: Pass custom srun args into Run by @ko3n1g :: PR: #1440
Fix typo in dataclass from callable => typing.Callable in nemotron_h_provider.py by @shaltielshmid :: PR: #1442
pass the support of deepep for B200 and B300 GPUs by @erhoo82 :: PR: #1436
cuda graph fine grained scope | hybridEP | a2a overlap by @malay-nagda :: PR: #1348
nvfp4 for dense models by @sanandaraj5597 :: PR: #1453
Added Qwen 3 next perf scripts by @sanandaraj5597 :: PR: #1451
reset gradient_accumulation_fusion with megatron fsdp by @ananthsub :: PR: #1386
guard trust_remote_code by @dimapihtar :: PR: #1291
fix lint checks on main by @ananthsub :: PR: #1463
DSv3- gb200 base cfg fix | b200 no a2a overlap by @malay-nagda :: PR: #1476
sequence_length -> seq_length by @dimapihtar :: PR: #1023
feat: Add whitelist support for mismatched params in load_hf_weights by @yaoyu-33 :: PR: #1447
[docs] Update readme with supported models/recipes by @ananthsub :: PR: #1455
Add Gemma2 recipes by @ananthsub :: PR: #1383
[docs] Add release section for changelog and software component versions by @ananthsub :: PR: #1490
[docs] Add 0.2.0 version picker by @ananthsub :: PR: #1488
Reduced precision (BF16, FP8, MXFP8, NVFP4) training tutorial using Megatron-Bridge by @sergiopperez :: PR: #1409
Update conversion compare script and add accelerate dependency by @yaoyu-33 :: PR: #1344
[main] Fix functional conftest to handle optional nvdlfw-inspect dependency by @ananthsub :: PR: #1496
[docs] Update supported model docs by @ananthsub :: PR: #1503
fix: Escape user inputs in data tutorials by @ananthsub :: PR: #1465
Bridge instantiate_utils: drop unexpected config keys with warning by @yaoyu-33 :: PR: #1203
Make container image point to last known release container by @gautham-kollu :: PR: #1443
Revamp recipe tutorials by @ananthsub :: PR: #1308
[docs] 25.11 release notes by @ananthsub :: PR: #1504
Add generic scripts for training by @ananthsub :: PR: #1390
Nemotron nano v2 finetune by @cuichenx :: PR: #1391
Replay: M4 Remove parallel state usage in train loops, train steps and utils #1175 + Bug fix by @yaoyu-33 :: PR: #1445
track dtype in scatter to tp ranks by @ananthsub :: PR: #1509
Update performance scripts to align with llmb requirements by @scsudhakaran :: PR: #1416
fix qwen3_vl by changing sequence_length to seq_length by @shifangx :: PR: #1511
Update GPT-OSS pretrain config parameters by @cuichenx :: PR: #1375
feat: mcore trigger mbridge by @pablo-garay :: PR: #1441
fix: cleanup by @pablo-garay :: PR: #1540
Revert strong-scaling support for DeepSeek-V3 by @scsudhakaran :: PR: #1548
Add fallback for shared embedding flag by @yaoyu-33 :: PR: #1521
Wan Bridge (checkpoints conversion) by @huvunvidia :: PR: #1550
feat: defer flop calculation to model_provider "get_num_floating_point_operations" if provided by @yaoyu-33 :: PR: #1446
refactor: Unify launchers by @ko3n1g :: PR: #1519
bug fixes- unify launchers by @malay-nagda :: PR: #1573
ci: Bump MCore and ModelOpt by @chtruong814 :: PR: #1551
docs: Update documentation.md to include install submodules command by @chenopis :: PR: #1576
fix: Fix load failure when load_megatron_model from a model trained with uneven pp by @yaoyu-33 :: PR: #1579
Added 25.11 starter pack by @sanandaraj5597 :: PR: #1596
fix: Wandb mocking by @ko3n1g :: PR: #1587
fix: Use model seq length as default if no CLI is provided by @ko3n1g :: PR: #1600
scripts: Update help string of args.detach by @ko3n1g :: PR: #1589
ci: Add DGXC executor by @ko3n1g :: PR: #1584
fix: Fix model parallel initialization ordering by @yaoyu-33 :: PR: #1574
fix: Missing return of parse_additional_slurm_params by @ko3n1g :: PR: #1619
Add fix for users who want to provide a path on disk to a custom HF tokenizer by @jstjohn :: PR: #1594
fix: wandb exp name in recipe path by @ko3n1g :: PR: #1623
Rename TensorRT Model Optimizer to Model Optimizer by @AAnoosheh :: PR: #1484
Cleanup partial CG objects by @gautham-kollu :: PR: #1615
[Canonical LoRA] fix: use correct q_out_features for linear_q by @HollowMan6 :: PR: #1627
[Canonical LoRA] fix: forward under expert layers by @HollowMan6 :: PR: #1628
qwen3 235b config update by @malay-nagda :: PR: #1613
chore: Update codeowners of performance scripts by @ko3n1g :: PR: #1641
Re-use higher-level config override util in tutorials by @ananthsub :: PR: #1524
docs: add wayfinder readme.md files for each docs directory by @chenopis :: PR: #1617
ci: Fix DGXC env vars by @ko3n1g :: PR: #1629
Support strong scaling ...

Contributors

jstjohn, yfw, and 46 other contributors

Assets 2

NVIDIA Megatron-Bridge 0.2.2

chtruong814 released this 09 Jan 18:14

v0.2.2

0465189

This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit https://www.nvidia.com/en-us/security/, for acknowledgement please reach out to the NVIDIA PSIRT team at PSIRT@nvidia.com

Assets 2

NVIDIA Megatron-Bridge 0.2.1

ko3n1g released this 18 Dec 00:04

v0.2.1

1c43b39

Performance
- Activation offloading to host memory support with pipelining
  - Supports the high activation memory needs of MoE models training with dynamic shapes
  - Fixed Nemotron FLOPS calculation model
Model Collection Support
- Ministral 3
Enhanced LoRA support
- LoRA support for Mamba layers (for Nemotron Nano V2 and NemotronH finetuning)

Assets 2

Uh oh!

Releases: NVIDIA-NeMo/Megatron-Bridge

Release list

NVIDIA Megatron-Bridge 0.5.0

Model Collection Support

Training & Functionality

Low-Precision Bridge & Checkpoint Conversion

Performance

Megatron-LM ↔ Megatron-Bridge Unification

Developer Experience & Compatibility

Examples & Tutorials

Community Contributions

Contributors

Uh oh!

NVIDIA Megatron-Bridge 0.4.2

Highlights

Performance

Software Component

Known issues

Contributors

Uh oh!

26.04-alpha.rc2

Uh oh!

NVIDIA Megatron-Bridge 0.4.1

Uh oh!

26.04-alpha.rc1

Uh oh!

NVIDIA Megatron-Bridge 0.4.0

Highlights

Model Collection Support

Diffusion Collection

Training & Functionality

Model Optimization

Performance

Developer Experience & Compatibility

Examples & Tutorials

Community Contributions

Contributors

Uh oh!

NVIDIA Megatron-Bridge 0.3.1

Performance & Model Configs

Bug Fixes

Documentation

Uh oh!

NVIDIA Megatron-Bridge 0.3.0

Highlights

Contributors

Uh oh!

NVIDIA Megatron-Bridge 0.2.2

Uh oh!

NVIDIA Megatron-Bridge 0.2.1

Uh oh!