Skip to content

Releases: NVIDIA-NeMo/Megatron-Bridge

NVIDIA Megatron-Bridge 0.5.0

Choose a tag to compare

@nemo-automation-bot nemo-automation-bot released this 22 Jun 23:23
fcbb603
Changelog Details

Model Collection Support

LLM / VLM

Multimodal

  • Nemotron-3 Nano Omni support, including model, recipe, and examples (PR#3760)
  • Qwen3-Omni-MoE training support (PR#3317, community @hbhflw2000)
  • Qwen3-ASR support (PR#2836, PR#3273)
  • Nemotron Diffusion (Nemotron-Labs-Diffusion) model support (PR#3105)

Training & Functionality

  • MegatronMIMO (Multimodel-In-Multimodel-Out) is a new feature to train multimodal models with heterogeneous parallelism (e.g. different model parallelism for the image encoder and text decoder). NeMo 26.06 supports non-colocated training (i.e. encoder and decoder are placed on different ranks PR#2004, PR#2007, PR#2869, PR#2870) and MegatronMIMO model conversion (PR#3905) with a focus on dense models. Colocated training (i.e. encoder and decoder on the same rank) and MoE models will be supported in the next release.
  • Energon v7 support, including metadata and stateless cookers (PR#4090)
  • Energon updates for video and multi-image (PR#3691)
  • Eval-time context parallelism via decentralized process-group rebinding (PR#3755)
  • Deterministic training support for performance recipes (PR#3543)
  • Evaluator backend integration (SFT + inference + evaluation, demonstrated on GPT-OSS) (PR#2990)
  • LoRA support for not sharing expert adapters (PR#3408)
  • Configurable async checkpoint strategy (PR#3153); MSC support for FSDP DTensors (PR#3300)
  • Fast dataloading configs and documentation (PR#3351)

Low-Precision Bridge & Checkpoint Conversion

  • Quantize-then-gather weight export (FP8 / MXFP4) for faster RL trainer→rollout weight sync (PR#2737, community @hy2826)
  • DeepSeek V4 quantization-scale emission during HF export (PR#3969)

Performance

  • fp4_param_gather enabled in MixedPrecisionConfig (PR#3364)
  • Qwen3-Next 80B GB200/GB300 parallel mappings (PR#3168)
  • CUDA graph support for Qwen3-VL LLM and vision-encoder submodules (PR#2334); full-iteration CUDA graph for GPT-OSS recipes (PR#4140)

Megatron-LM ↔ Megatron-Bridge Unification

  • Megatron Inference integrated into Bridge — MCore Inference Engine examples, model wrappers, pure-LLM inference CLI, and inference_optimized path (PR#3897)
  • Tokenizer unification — MCore tokenizer config promoted as the shared surface (Bridge side: PR#3451; MCore side: MCore PR#4406)
  • Training-loop upstreaming (in progress) — Bridge's config + builder patterns moving into Megatron-LM: ConfigContainer (MCore PR#4227), serialization base (MCore PR#4309), Mamba config + builder (MCore PR#4550), GPT config + builder (MCore PR#4741), supporting utils (MCore PR#4872)

Developer Experience & Compatibility

  • RL API refactoring — model creation, config override, training loop, export, and LoRA for RL (PR#3813)
  • AGENTS.md and AI-coding-agent skills updated (recipe-recommender, NeMo-RL & verl E2E testing) (PR#3256, PR#3277, PR#3831)

Examples & Tutorials

  • MegatronMIMO Qwen3.5-VL non-collocated SFT tutorial + LLaVA tutorial (PR#4239)
  • Qwen3-0.6B 128K long-context SFT recipe with YaRN RoPE scaling (PR#3316)
  • HuggingFace ↔ Megatron-FSDP weight conversion (PR#3512); online HF load/save for Megatron-FSDP (PR#1910)

ModelOpt

  • LoRA × ModelOpt × DeepSeek architecture support (PR#3612)

Community Contributions

A big thank you to our community contributors for their valuable support!

Known issues:

  • Step-3.7-Flash forward-pass outputs have not been fully verified.
  • Some examples/ scripts have known minor issues: MiniMax M2 (conversion/export saving), GLM-4.5V (exported tokenizer artifacts), FLUX (tokenizer setup), and WAN (inference setup/dependencies).
  • Some MoE training configurations that combine tensor parallelism and expert parallelism may run slower in 26.06 after upgrading from NCCL 2.29 to NCCL 2.30.
    • Root cause: NCCL 2.30 fixed a CPU-affinity leak and now correctly restores the launcher's original CPU affinity after communicator initialization. Earlier NCCL versions could inadvertently leave application threads bound to CPUs local to each GPU. Training launchers without explicit CPU and memory binding may therefore expose cross-NUMA scheduling overhead after the upgrade.
    • Workaround: As a workaround, bind each training rank and its host-memory allocations to the NUMA node local to its assigned GPU: numactl --cpunodebind=<NUMA_NODE> --membind=<NUMA_NODE> <training command>. The GPU-local NUMA node can be determined programmatically from the GPU's PCI bus ID. For Slurm or torchrun launchers, the training command can be wrapped as follows:
Code
LID=${SLURM_LOCALID}
PCI_BUS=$(nvidia-smi -i $LID --query-gpu=pci.bus_id --format=csv,noheader 2>/dev/null | head -1 | tr '[:upper:]' '[:lower:]')
NUMA_NODE=$(cat /sys/bus/pci/devices/$PCI_BUS/numa_node 2>/dev/null || echo -1)
echo "[numactl_local] rank=$LID gpu_pci=$PCI_BUS numa=$NUMA_NODE"
exec numactl --cpunodebind=$NUMA_NODE --membind=$NUMA_NODE "$@"

NVIDIA Megatron-Bridge 0.4.2

Choose a tag to compare

@nemo-automation-bot nemo-automation-bot released this 28 May 21:18
c810129

Highlights

  • Expanded performance configs for DeepSeek V3, Qwen, GPT-OSS, and WAN
  • Supported fp4_param_gather mixed precision config
  • Enhanced security in dataset checkpoint deserialization and url loading. Safer trust_remote_code handling.

Performance

  • NVFP4 with 4-bit parameter AllGather in DP communications (PR#3364, PR#4005)
  • DSV3 B300 recipe tuning (PR#3549)
  • DSV3 B200 recipe tuning (PR#3368)
  • Qwen3 235B A22B B300 recipe tuning (PR#3490)
  • NT3 super B300 recipe tuning (PR#3579)
  • GPT-OSS B200 regression fix (PR#3614)

Software Component

Known issues

  • There is a known issue with Evaluator when installing nvidia-vlmeval inside /opt/NeMo-FW. Please use the /opt/Megatron-Bridge directory to install the package:
cd /opt/Megatron-Bridge
uv pip install nvidia-vlmeval
Changelog Details
  • beep boop 🤖: Bumping megatron.bridge to v0.4.1 by @nemo-automation-bot[bot] :: PR: #3363
  • cp: [perf] fix: guard cuda_graph_scope validation against None (3249) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3262
  • cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0 by @yaoyu-33 :: PR: #3305
  • cp: Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names (3179) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3324
  • cp: Perf script utility to lock gpu frequency. (2977) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3326
  • cp: fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3332
  • cp: fix(perf): read baseline values from golden values when using new format (3334) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3338
  • [docs] chore: bump versions1.json to 0.4.0 (latest) by @ko3n1g :: PR: #3376
  • b200 DSv3 better cfg (#3368), mxfp8 to fp8_cs for h100 gpt-oss #3378 by @malay-nagda :: PR: #3420
  • 2604 perf summary (#3377) by @malay-nagda :: PR: #3405
  • cp: docs(releases): add 26.04 software component versions (3421) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3430
  • cp: b200 DSv3 better cfg (3368) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3401
  • cp: [training] fix: report memory on 2nd iteration to better reflect actual peak (3169) into r0.4.0 by @dingqingy-nv :: PR: #3367
  • cp: Update Qwen3-VL pretrain perf configs for 30B and 235B (3327) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3342
  • cp: docs: Add container version to docs version picker (3434) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3435
  • cp: [docs] Add Megatron Bridge 0.4.0 release notes (#3419) by @chtruong814 :: PR: #3439
  • cp: fix(test): clone mmap-backed tensors before overwriting safetensors file (#3335) by @yaoyu-33 :: PR: #3441
  • cp: [test] refactor: move diffusion tests to test_groups directory (3275) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3442
  • remove archival data from main page by @malay-nagda :: PR: #3448
  • cp: fix: set 644 permissions on COPY'd files to match cloned repos (3431) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3450
  • cp: [perf] fix: use direct assignment for NCCL env vars when nccl_ub enabled (3350) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3453
  • cp: [training] feat: enable fp4_param_gather in MixedPrecisionConfig (3364) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3454
  • cp: fix(docker): replace rdma-core source build with system package install (3429) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3457
  • cp: [training] fix: record CUDA memory history before snapshot so dumps are non-empty (#3487) into r0.4.0 by @dingqingy-nv :: PR: #3508
  • cp: [vulnops][misc] fix: Add allowlist validation for _target_ instantiation (3142) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3540
  • cp: [vulnops][data] fix: Replace unsafe pickle.loads with restricted unpickler in Qwen VL pipeline (3139) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3541
  • cp: [vulnops][ckpt] fix: Use weights_only=True in ModelOpt checkpoint loading (3138) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3542
  • cp: [vulnops][ckpt] fix: Use weights_only=True in TrainState checkpoint loading (3506) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3557
  • cp: [vulnops][data] fix: Replace unsafe pickle.load with restricted unpickler for index metadata (3140) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3558
  • cp: [vulnops] fix: _contains_code_references allowlist bypass leads to RCE (3379) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3559
  • fix: Add security warning for trust_remote_code and remove hardcoded True by @chtruong814 :: PR: #3539
  • cp: Cleanup TE cuda graphs with the right api (3459) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3476
  • cp: Update DeepSeek-V3 configs for B300 (3549) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3565
  • cp: log repo status manual (3570) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3572
  • cp: ci: post merge comment with SHA after successful CI run (3567) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3573
  • cp: [perf] update: switch GPT-OSS GB200 V2 dispatcher default to alltoall (3561) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3577
  • cp: no fp4 param gather (3578) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3580
  • cp: fix(evaluate): skip non-dict golden value entries such as job_id (3581) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3582
  • cp: [vulnops][data] fix: Validate URLs in VLM video loader to prevent SSRF (3482) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3588
  • fix(docker): suppress lightning from uv resolution in fw_pyproject by @ko3n1g :: PR: #3602
  • cp: [vulnops][data] fix: Remove unnecessary allow_pickle=True and add security warnings (3141) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3615
  • cp: [vulnops][data] fix: Replace allow_pickle=True with restricted unpickler in packed dataset loading (3616) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3629
  • cp: add VP for LoRA Lm3 70B (3547) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3596
  • cp: num_layers_fix- qwen vl 235b_a22b on B200 (3589) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3603
  • cp: fix(docker): resolve lightning not found on PyPI by providing local stub (3604) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3606
  • cp: 70b_lora_gb200_bf16_fix (3623) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3627
  • cp: [vulnops] fix: Add SSRF protection to image-loading utilities (3630) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3632
  • chore(beep boop 🤖): Bump uv.lock (r0.4.0, mcore-core_r0.17.0) (2026-04-30) by @svcnvidia-nemo-ci :: PR: #3591
  • cp: [vulnops] fix: Add SSRF protection to audio URL loading (3633) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3636
  • cp: fix(perf): keep PCT binding for deepseek_v3 large_scale on b300 (3656) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3657
  • fix: apply vllm PR 36192 patch and bump pillow to 12.20 by @ko3n1g :: PR: #3671
  • cp: Add previously removed NemotronHBridge SequentialMLP mappings (3628) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3701
  • Use HybridEP flex dispatcher for Qwen3 235B B300 perf configs (#3490) by @rhmukundan :: PR: #3675
  • [build] chore: bump package version to 0.4.2 by @ko3n1g :: PR: #3721
  • [model, ckpt, docs] fix: support HF→Megatron conversion under decentralized PGs (r0.4.0) by @cuichenx :: PR: #3674
  • cp: Fix Gemma3 example folder (3724) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3728
  • cp: Reorganize ModelOpt docs (3715) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3751
  • [model, ckpt] fix: align GPT-OSS BF16 down_proj orientation on import (r0.4.0) by @cuichenx :: PR: #3753
  • perf(qwen3-next): set expandable_segments on GB300 BF16/FP8_MX to fix OOM by @ko3n1g :: PR: #3767
  • cp: llama31 405b gb200 nvfp4 no pg overlap (3713) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3773
  • cp: [perf] update: switch GPT-OSS B200 V2 dispatcher default to alltoall (3614) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3682
  • nt3 super nvfp4; lm3.1 405B nvfp4; lm3 70B mxfp8- expandable_segments by @malay-nagda :: PR: #3780
  • cp: [config] Update micro_batch_size to 2 for gemma3 recipe (3815) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3828
  • chore: Bump TE to latest 2.14 and MCore to latest 0.17.0 by @chtruong814 :: PR: #3806
  • qwen3 next env var fix by @malay-nagda :: PR: #3845
  • chore: Bump and remove packages to address CVEs (#3841) by @chtruong814 :: PR: #3855
  • Bump MCore to 2edffa by @chtruong814 :: PR: #3857
  • cp: chore: Bump deps to address CVEs (3919) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3925
  • cp: 2604_patch_perf_summary (3818) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3861
  • cp: 26.04.01_perf_summary (3997) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #3998
  • cp: docs: note 26.04 drops PyAV by default and document runtime install (4020) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #4021
  • cp: [perf] fix: guard cuda_graph_scope validation against None (3249) into r0.4.0 (#3262) by @svcnvidia-nemo-ci
  • cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0 (#3305) by @yaoyu-33
  • cp: Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names (3179) into r0.4.0 (#3324) by @svcnvidia-nemo-ci
  • cp: Perf script utility to lock gpu frequency. (2977) into r0.4.0 (#3326) by @svcnvidia-nemo-ci
  • cp: `fix(gemma3-vl): force right-padding in VLM collate t...
Read more

26.04-alpha.rc2

Choose a tag to compare

@mmarcinkiewicz mmarcinkiewicz released this 07 May 07:08
fd5c473
[MXFP8 param gather]Update param buffer before copy to model weights …

NVIDIA Megatron-Bridge 0.4.1

Choose a tag to compare

@nemo-automation-bot nemo-automation-bot released this 06 May 21:49
f9b6319

26.04-alpha.rc1

Choose a tag to compare

@mmarcinkiewicz mmarcinkiewicz released this 23 Apr 09:32
Merge branch 'PR2411' into 26.04-alpha

NVIDIA Megatron-Bridge 0.4.0

Choose a tag to compare

@svcnvidia-nemo-ci svcnvidia-nemo-ci released this 16 Apr 22:46
0fbfe7d

Highlights

Model Collection Support

  • MiniMax M2 / M2.5 support (PR#2602)
  • Kimi 2.5 support, including GB300 MXFP8 recipe and HF config updates (PR#2743)
  • Nemotron 3 Super model support (PR#2912)
  • Sarvam support (PR#1814)
  • Qwen 3.5 VL Bridge with recipes and LoRA bridge / merge support (PR#2530, PR#2654, PR#2736)
  • Qwen 2.5 Omni support (PR#2634)
  • Qwen2-Audio support (PR#2324)
  • Xiaomi MiMo dense MTP model bridge support (PR#2387, by HollowMan6)

Diffusion Collection

Training & Functionality

  • Parquet support for sequence-packing preprocessing, improving handling of larger datasets (PR#2395)
  • Energon integration for sequence packing with WebDataset workflows (PR#2440)
  • Default packed sequences across finetune recipes (PR#2284)
  • More modern finetuning datasets, including OpenMathInstruct V2 and GSM8K (PR#2264)
  • Unified dataset configuration in run_recipe.py (PR#2826)
  • NCCL flight recorder configuration support (PR#2891)
  • Comet ML experiment tracking integration (PR#2910)
  • Refactored SFT and PEFT recipes for VLM workflows (PR#2614)
  • Added the on_checkpoint_save callback event for training workflows (PR#2905)
  • Added MoE LoRA rank normalization for expert layers (PR#3006)
  • Direct export of block-wise FP8 weights and scaling factors (PR#1994)
  • Accelerated first-fit packing with a segment tree for much faster packing on large datasets (PR#2953)

Model Optimization

  • Pruning support and documentation (PR#2244)
  • Post-training quantization support for Nano, Super, and Ultra model families (PR#2303)
  • Distillation quantization support in NeMo 2 (PR#2591)

Performance

  • Nemotron 3 Super perf config, including GB200 improvements and BF16 / NVFP4 functional support via module recompute (PR#3208)

Developer Experience & Compatibility

  • ModelConfig and ModelBuilder refactor integrated into the training loop (PR#2798, PR#2671)
  • Dev branch support and documentation updates (PR#2497)
  • Python 3.12 migration announcement (PR#2773)
  • Transformers 5.0 through 5.3 compatibility (PR#2068, PR#2781)
  • PEFT Bridge offline mode support (PR#2574)
  • LoRA merge on CPU (PR#2194)
  • Self-contained Megatron-to-HF export with auto-config synthesis (PR#2778)
  • Scripts and documentation for Megatron-LM and Megatron Bridge correlation

Examples & Tutorials

Community Contributions

  • @HollowMan6 (Aalto University): Xiaomi MiMo dense MTP bridge support, Qwen 3.5 VL LoRA bridge and merge, and additional export / PEFT fixes (PR#2387, PR#2736, PR#2384, PR#2799)
  • @shaltielshmid: packed-sequence improvements for large datasets and safer model loading defaults (PR#2395, PR#2766)
  • @jaeminh: accelerated first-fit packing with a segment tree (PR#2953)
  • @pavelgein: added the on_checkpoint_save callback event (PR#2905)
  • @ShiftyBlock (UC Berkeley): added auto-config for self-contained Megatron-to-HF export (PR#2778)
  • @erictang000 (Anyscale): added LoRA rank normalization for MoE expert layers (PR#3006)
  • @eternally-z: added direct export support for block-wise FP8 weights and scaling factors (PR#1994)
  • @Hayak3: fixed the unsupported normalization argument for Qwen3-VL (PR#1970)
  • @mohit-sarvam (Sarvam AI): added Sarvam MoE support (PR#1814)

A big thank you to our community contributors for their valuable support!

Changelog Details
  • docs: Update callback code snippets to include all imports needed for example by @ananthsub :: PR: #2283
  • M4 leftover for QWen3-VL with MCore vision encoder by @shifangx :: PR: #2370
  • Update Qwen3 235B B300 Configs to match Qwen3 B200 Configs by @rhmukundan :: PR: #2669
  • [bridge] Fix off-by-one in sliding window size for Gemma2, Gemma3, Mistral, and GPT-OSS by @cuichenx :: PR: #2656
  • fix: Write intermediate results to tmp by @ko3n1g :: PR: #2726
  • Perf recipe dataloader num_workers interface fix by @dingqingy-nv :: PR: #2710
  • Suppress noisy _extra_state warnings during checkpoint loading by @cuichenx :: PR: #2689
  • [model, recipe] Add Qwen 3.5 recipes by @cuichenx :: PR: #2654
  • [ci] chore: add nightly dev commit bump workflow by @ko3n1g :: PR: #2729
  • ci(fix): Unique naming for dev branch by @ko3n1g :: PR: #2747
  • [ci] Refactor Gemma3-VL launch script to run finetune and packed tests separately by @cuichenx :: PR: #2730
  • add qwen2_5_omni by @yuekaizhang :: PR: #2634
  • build: Bump TE 2.13 by @ko3n1g :: PR: #2753
  • [docs, ci] chore: add governance issue forms and triage guide by @yaoyu-33 :: PR: #2716
  • [test] fix: temporarily disable qwen2.5 omni unit tests by @yaoyu-33 :: PR: #2759
  • add nemotron3 super docs by @liding-nv :: PR: #2757
  • ci: Fix stopiteration for Mbridge by @ko3n1g :: PR: #2760
  • GPT-OSS Blackwell MXFP8 recipes by @weijiac0619 :: PR: #2633
  • feat(mimo): phase 2 - model provider, DDP wrapping, process groups by @aroshanghias-nvd :: PR: #2004
  • [build] feat: add OSS NeMo FW dockerfiles by @thomasdhc :: PR: #2722
  • Lm3 70B GB200 FP8_CS SFT cfg update by @malay-nagda :: PR: #2748
  • [docs] chore: use uv run in test file docstring run instructions by @cuichenx :: PR: #2728
  • build: Bump NVRX by @ko3n1g :: PR: #2775
  • NVFP4 memory spike fix compared to M-LM by @sanandaraj5597 :: PR: #2764
  • [doc] feat: Document adapter merge verification in stream_adapter_weights example by @yaoyu-33 :: PR: #2042
  • [doc] chore: Add needs-review to PR state labels guidance by @yaoyu-33 :: PR: #2758
  • [ckpt] fix: broaden exception handling in save_artifacts dynamic module loading by @yaoyu-33 :: PR: #2765
  • [test] fix: use toy configs in qwen2.5 omni unit tests by @yaoyu-33 :: PR: #2761
  • [model] Refactor Qwen3-VL and Ministral3 fine-tuning scripts by @kamran-nvidia :: PR: #2735
  • docs - Update user manual with new MoE features and Megatron FSDP by @onel :: PR: #2529
  • remove encoder_and_decoder usage by @dimapihtar :: PR: #2512
  • Fix attention_mask mismatch in compare.py by @mohsinm-dev :: PR: #2476
  • [model, test] fix: guard hybrid layer count across MCore branches by @yaoyu-33 :: PR: #2776
  • [data] fix: guard eval_interval division to prevent ZeroDivisionError by @yaoyu-33 :: PR: #2732
  • [sync][training] fix: log loss values of exactly 0.0 in training_log() by @mehraakash :: PR: #2740
  • [model] feat: support Qwen 3.5 MTP c...
Read more

NVIDIA Megatron-Bridge 0.3.1

Choose a tag to compare

@svcnvidia-nemo-ci svcnvidia-nemo-ci released this 20 Mar 22:35
Changelog Details

Performance & Model Configs

  • CP SFT performance improvements (#2527)
  • Nemotron 3 Nano perf config updates (#2560, #2681)
  • Onboard LLaMA3 70B LoRA to B300 and B200 chips (#2588)
  • Update Qwen3 235B B300 configs to match B200 configs (#2706, #2720)
  • Update DeepSeek-V3 B300 config (#2723)
  • DeepSeek-V3: set no_non_det_algo for deterministic training (#2673)
  • Add MoE Sequential MLP mappings in HF Bridges (#2589)

Bug Fixes

  • [training] Cap lr_warmup_steps to be strictly less than lr_decay_steps (#2858)
  • [training] Fix DistillationProvider.to_cfg_dict to save missing keys in run_config (#2594)
  • [training] Fix StopIteration error in MBridge (#2762)
  • [checkpoint] Fix local checkpoint integration (#2709)
  • [checkpoint] Log warning when HuggingFace Hub download fails silently (#2493)
  • [checkpoint] Low-memory save: use AutoBridge directly in distill_llama32_3b-1b to load HF weights (#2860)
  • [inference] Use config.hidden_size directly for Qwen3VL inference wrapper (#2855)
  • [misc] Improve compare.py robustness for multi-GPU and vocab-padded models (#2647)
  • [misc] Fix BOS token mismatch in compare_text_generation (#2889)
  • [misc] Guard eod_id access in compare_text_generation for HF tokenizers (#2853)
  • [misc] Guard missing kubernetes deps (#2871)
  • [example] Fix example scripts and recipe names in release branch (#2862, #2863)

Documentation

  • Add ModelOpt pruning docs (#2629)

NVIDIA Megatron-Bridge 0.3.0

Choose a tag to compare

@svcnvidia-nemo-ci svcnvidia-nemo-ci released this 26 Feb 03:51
21b02e0

Highlights

  • Model Collection Support
  • Performance
    • NVFP4 support for LLama3 models.
    • HybridEP support for NVL8 systems (PR#494)
    • MLA performance improvement with cudnn layernorm and cudnn 9.18
    • LN+MXFP8 quantization fusion with TE.sequence and cudnn backend
    • Supports FSDP for MoE models with MXFP8 (PR#2135, PR#2239)
    • Support Muon Optimizer (PR#683)
    • NVFP4 Llama Playbook (PR#1409)
  • Training & Functionality
    • LoRA Bridge (initial): RL LoRA support for VeRL / nemo-rl (PR#1766)
    • Multi-token prediction (MTP): Qwen3 dense examples (PR#2138)
    • Decentralized parallel group (M4) end to end support and examples (PR#2011, examples)
    • Context Parallelism (CP) with sequence packing in LLMs (PR#1867)
    • Context Parallelism (CP) with sequence packing in VLMs (PR#1997)
    • Callbacks integration (PR#2063)
    • Low memory save for model importing from HF (fix Deepseek V3 and Kimi-K2 import) (PR#1949)
  • Community Contributions
Changelog Details
Read more

NVIDIA Megatron-Bridge 0.2.2

Choose a tag to compare

@chtruong814 chtruong814 released this 09 Jan 18:14
0465189

NVIDIA Megatron-Bridge 0.2.1

Choose a tag to compare

@ko3n1g ko3n1g released this 18 Dec 00:04
v0.2.1
1c43b39
  • Performance
    • Activation offloading to host memory support with pipelining
      • Supports the high activation memory needs of MoE models training with dynamic shapes
      • Fixed Nemotron FLOPS calculation model
  • Model Collection Support
    • Ministral 3
  • Enhanced LoRA support
    • LoRA support for Mamba layers (for Nemotron Nano V2 and NemotronH finetuning)