feat: support Mixed Preference Optimization (MPO)#1609
feat: support Mixed Preference Optimization (MPO)#1609LarryLeeee wants to merge 1 commit intoInternLM:mainfrom
Conversation
|
Thanks for contributing! |
|
@claude review |
| import os | ||
| import gc | ||
| from copy import deepcopy | ||
| from datetime import datetime |
There was a problem hiding this comment.
Critical: Missing module xtuner.v1.rl.dpo
This import will fail at runtime. The module xtuner/v1/rl/dpo/ does not exist in the repository and is not created by this PR. The classes DPOLossConfig, DPOLossContext, and DPOLossContextInputItem are imported but never defined anywhere in the diff.
This means the entire DPO trainer feature is non-functional -- it will crash on import.
| - hinge, ipo, robust: Other DPO variants | ||
|
|
||
| For MPO (Mixed Preference Optimization), use loss_types=["sigmoid", "bco_pair", "sft"] | ||
| with appropriate loss_weights. |
There was a problem hiding this comment.
Critical: Missing imports -- qwen3_vl_dpo_collator and DPOColateItem do not exist
These symbols are not defined in xtuner/v1/datasets/__init__.py nor in collator.py (checked the diff). The Qwen3VLDPOTokenizeFnConfig used in the example configs is also not defined anywhere in this PR. This will cause an ImportError at runtime.
| ) | ||
| self.ref_engine.from_hf(str(ref_load_from)) | ||
|
|
||
| # Freeze reference model |
There was a problem hiding this comment.
Critical: Reference model creates a full optimizer unnecessarily, wasting GPU memory
_init_reference_model creates a TrainEngine (or VisionComposeTrainEngine) for the reference model. These engines call build_optimizer() internally, allocating optimizer state (Adam moments, etc.) for all parameters -- which is then immediately wasted since the reference model is frozen (requires_grad = False) and never updated.
For a typical 8B model, this wastes ~32GB of GPU memory (2x fp32 copies of all parameters for Adam states).
The reference model should be loaded without an optimizer. Consider either:
- Adding a
build_inference_only()path that skips optimizer creation, or - Creating the FSDP-wrapped model directly without the engine abstraction for inference-only use.
| # Average loss over batch | ||
| total_loss = total_loss / len(batch) | ||
|
|
||
| # Backward pass |
There was a problem hiding this comment.
Bug: _train_step calls self.train_engine.model(...) directly but does not manage gradient accumulation correctly
The method calls total_loss.backward() on every invocation, but optimizer.step() and zero_grad() are only called every gradient_accumulation_steps steps. The problem is that total_loss is computed fresh each call and .backward() accumulates gradients -- this is correct for gradient accumulation. However, the loss value logged as metrics["loss"] only reflects the current micro-batch, not the accumulated loss over all accumulation steps. This is misleading for monitoring.
More importantly: if gradient_accumulation_steps > 1, the first call to _train_step will call backward() but NOT zero_grad() before the next call. The gradients from step 0 will still be present. Since zero_grad() is only called AFTER optimizer.step(), the very first gradient_accumulation_steps - 1 micro-batches will train with stale gradients from initialization (which are zero, so this is actually OK for the first accumulation window). But the issue is that after the optimizer step + zero_grad, the pattern is correct. This is technically fine but fragile -- consider calling optimizer.zero_grad() once at the start of training in fit() to make the intent explicit.
|
|
||
| # Forward chosen and rejected separately (policy) | ||
| chosen_output = self.train_engine.model(seq_ctx=chosen_seq_ctx, loss_ctx=None) | ||
| rejected_output = self.train_engine.model(seq_ctx=rejected_seq_ctx, loss_ctx=None) | ||
|
|
||
| chosen_logits = _get_field(chosen_output, "logits") |
There was a problem hiding this comment.
Bug: _train_step does two full forward passes through the policy model (chosen_output and rejected_output) with loss_ctx=None, but the model is in training mode with gradients enabled.
The forward passes compute chosen_output and rejected_output but their intermediate activations (including the full logits tensors) are held in memory simultaneously. For a typical 8B model with 8K sequence length, this means ~2x the activation memory. This is very likely to OOM.
Consider:
- Computing chosen and rejected forward passes sequentially, extracting logprobs immediately, and deleting the logits before the next forward pass.
- Using
torch.no_grad()for the logits computation and only enabling gradients for the loss backward (which requires custom autograd).
|
|
||
| # Warmup function - linear warmup from 0 to base_lr (same as SFT Trainer) | ||
| def warmup_fn(x): | ||
| return x / warmup_steps if x < warmup_steps else 1 |
There was a problem hiding this comment.
Warning: LR warmup lambda returns 0 at step 0, causing division by zero in warmup_fn
When x = 0, warmup_fn returns 0 / warmup_steps = 0, which means the learning rate at step 0 is exactly 0. This is a common pattern but worth noting. However, if warmup_steps = 0 (e.g. warmup_ratio = 0), then warmup_fn will attempt 0 / 0 which produces nan.
Add a guard: if warmup_steps == 0: warmup_fn = lambda x: 1.
| ref_rejected_logits = _get_field(ref_rejected_output, "logits") | ||
|
|
||
| # Compute log probs | ||
| # NOTE: _gather_logprobs returns token-level logprobs [B, L] |
There was a problem hiding this comment.
Warning: Return type annotation says tuple[Tensor, Tensor] but returns None, None
When self.ref_engine is None, the method returns (None, None), which violates the declared return type tuple[torch.Tensor, torch.Tensor]. The type hint should be tuple[torch.Tensor | None, torch.Tensor | None].
|
|
||
| def _gather_logprobs( | ||
| self, logits: torch.Tensor, labels: torch.Tensor | ||
| ) -> torch.Tensor: |
There was a problem hiding this comment.
Numerical stability: _gather_logprobs computes log_softmax over the full vocabulary without any numerical guard
While F.log_softmax is numerically stable itself, the gathered log probabilities for padding tokens (label = -100, clipped to 0) will return the log probability of token 0, which is meaningless noise. This is handled downstream by masking, but it means the returned tensor contains uninitialized/garbage values in masked positions.
More importantly, this method duplicates the existing gather_logprobs from xtuner/v1/rl/utils.py (which is also imported later in _train_step at line 622). This dual implementation is confusing. Consider using the utility function consistently.
|
|
||
| For MPO (Mixed Preference Optimization), use: | ||
| loss_types=["sigmoid", "bco_pair", "sft"] | ||
| loss_weights=[0.8, 0.2, 1.0] |
There was a problem hiding this comment.
Critical: Hardcoded internal storage paths
This config file contains hardcoded paths to internal shared storage (/mnt/shared-storage-user/lisongze/...). These should use environment variables like the pattern in the original rl_qwen3_vl_8B_grpo.py (which this PR also breaks -- see separate comment).
| loss_weights=[0.8, 0.2, 1.0] | |
| ceph_config = os.environ.get("CEPH_CONFIG", "") | |
| meta_data_path = os.environ["META_DATA_PATH"] | |
| model_path = os.environ["MODEL_PATH"] | |
| work_dir = os.environ["WORK_DIR"] | |
| tokenizer_cache_dir = os.environ.get("TOKENIZER_CACHE_DIR", os.path.join(work_dir, "tokenizer_cache")) |
| data_path = os.environ["DATA_PATH"] | ||
| eval_data_path = os.environ["EVAL_DATA_PATH"] | ||
| work_dir = '/mnt/shared-storage-user/yanziang/test_xtuner/105xtuner/xtuner/examples/v1/config/rl_qwen3_vl_8B_grpo.py' | ||
| model_path = "/mnt/shared-storage-user/yanziang/xtuner/Qwen3-VL-8B-Instruct" |
There was a problem hiding this comment.
Critical: Reverted change -- environment variables replaced with hardcoded internal paths
The existing config used os.environ["WORK_DIR"], os.environ["MODEL_PATH"], etc. This PR replaces them with hardcoded paths to a developer's internal shared storage. This is clearly a debugging artifact that was accidentally committed. This change should be reverted entirely.
| #打印lmdeploy版本 | ||
| import lmdeploy | ||
| print(f"lmdeploy version: {lmdeploy.__version__}") | ||
| # breakpoint() |
There was a problem hiding this comment.
Warning: Debug artifacts left in production code
This adds a lmdeploy version print and a commented-out breakpoint(). These are clearly debugging artifacts that should not be committed. Please remove.
|
|
||
| from pathlib import Path |
There was a problem hiding this comment.
Warning: Globally disabling torch._dynamo is a heavy-handed side effect
torch._dynamo.config.disable = True is set at module import time, meaning any code that imports this CLI module will have dynamo disabled globally. This conflicts with the torch_compile=True setting in the FSDP configs of the example configs. If dynamo is truly needed to be disabled for DPO, it should be documented why, and scoped more narrowly (e.g., inside main()).
| f"Total trainable parameters: {num_total_requires_grad // 1e6}M, total parameters: {num_total // 1e6}M" | ||
| ) | ||
| logger.info(f"Untrainable parameters names: {untrainable_names}") | ||
| def build(self, params): |
There was a problem hiding this comment.
Critical: Breaking change -- MuonConfig removed from optim.py
This PR removes the entire MuonConfig class and the Muon optimizer support. This is a breaking change to the existing codebase that is unrelated to the MPO feature. Anyone using MuonConfig in their training configs will break. The commit message says "feat: support MPO" but this deletion is a separate breaking change.
Per the PR standards: "One logical change per PR. Do not mix bug fixes with features or refactors."
The MuonConfig removal and the AdamWConfig.build() signature change should be in a separate PR, or this PR should not touch these files.
| from .chunk_loss import ChunkLoss | ||
| from .moe_loss import BalancingLoss, ZLoss | ||
| from .rl_loss import LogProbConfig, LogProbContext | ||
|
|
There was a problem hiding this comment.
Critical: Breaking change -- LogProbConfig and LogProbContext removed from public API
These classes are removed from the __init__.py exports. Any external code importing from xtuner.v1.loss import LogProbConfig will break. This removal is tied to the forward_only signature change in TrainEngine which also removes the loss_ctx parameter. This is a significant API-breaking refactor that goes beyond "feat: support MPO".
| loss_kwargs_list = DPOLossContext.build_batches_loss_kwargs( | ||
| [dpo_input], self.config.loss_cfg | ||
| ) | ||
|
|
There was a problem hiding this comment.
Bug: _eval_step accesses chosen_output.hidden_states and rejected_output.hidden_states as attributes, but elsewhere in the trainer _get_field is used because outputs may be dicts
In _train_step, the code uses a helper _get_field(out, key) that handles both dict and attribute access patterns. But in _eval_step, the code directly accesses .hidden_states as an attribute. If the model returns a dict (which is the case when loss_ctx=None is passed to compose models), this will raise AttributeError.
Additionally, _eval_step uses self.train_engine.forward_only(...) which wraps the call in @torch.no_grad(), but then tries to compute loss with loss_ctx.loss_fn(hidden_states, head_weight, head_bias, loss_kwargs). This pattern assumes DPOLossContext.loss_fn exists and has this signature, but since xtuner.v1.rl.dpo is missing, we cannot verify this.
| @property | ||
| def cur_epoch(self) -> int: | ||
| return self._cur_epoch | ||
| # [XTuner][2026-01-12 07:36:21][WARNING] Failed to process inputs: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images', using text-only No newline at end of file |
There was a problem hiding this comment.
Nit: Debug log message left at end of file
This line contains a leftover debug/warning message as a comment at the end of the file. Also, the file is missing a trailing newline.
| loss_ctx = DPOLossContext(self.config.loss_cfg, loss_kwargs) | ||
| loss = torch.tensor(0.0, device=logits.device, dtype=logits.dtype) | ||
| extra_info: dict[str, Any] = {} | ||
|
|
||
| for loss_type, weight in zip(self.config.loss_cfg.loss_types, self.config.loss_cfg.loss_weights): | ||
| if loss_type == "sigmoid": |
There was a problem hiding this comment.
Warning: Calling private methods _dpo_loss_sigmoid, _bco_pair_loss, etc. with type: ignore annotations
The loss dispatch in _train_step calls private methods on DPOLossContext (e.g., loss_ctx._dpo_loss_sigmoid(...)) with # type: ignore[attr-defined] annotations. Since the DPOLossContext class doesn't exist in the diff (missing xtuner.v1.rl.dpo module), we cannot verify these methods exist or have the expected signatures.
This pattern of calling private methods and silencing type errors is fragile. Consider adding a proper public dispatch method to DPOLossContext (e.g., compute_loss(loss_type, ...)) that encapsulates the loss selection logic.
| from typing import List, cast | ||
|
|
||
| import torch | ||
| import torch.distributed as dist |
There was a problem hiding this comment.
Warning: VisionComposeTrainEngine lacks a docstring for the class and its public methods
Per CLAUDE.md: "Only provide docstrings for public methods. Private methods do not require docstrings." The class itself and public methods like build_model, from_hf, save_hf, train_step, and maybe_precompute_float8_dynamic_scale_for_fsdp all lack Google-style docstrings with type-hinted parameters.
| export NCCL_TIMEOUT=10800 | ||
| export TORCH_DISTRIBUTED_TIMEOUT=10800 | ||
| export XTUNER_USE_FA3=1 | ||
| export PYTHONPATH="$(pwd)" |
There was a problem hiding this comment.
Warning: Hardcoded absolute paths in shell scripts
Both config_file (line 12) and the torchrun target (line 24) use hardcoded absolute paths to a developer's workspace. These should use relative paths or environment variables, following the pattern of other scripts in the repo.
| loss = loss + _l | ||
| elif loss_type == "sft": | ||
| # SFT loss only on chosen part | ||
| _l = loss_ctx._sft_loss( # type: ignore[attr-defined] |
There was a problem hiding this comment.
Warning: SFT loss slicing logic is fragile and likely incorrect with sequence parallelism
The SFT loss branch slices logits[:, : chosen_mask.shape[1]] to extract the "chosen" portion. But at this point in the code, logits is the concatenation of chosen_logits and rejected_logits along dim=1 (line 614). The chosen_mask comes from loss_kwargs.chosen_mask which was built by DPOLossContext.build_batches_loss_kwargs.
With sequence parallelism enabled, chosen_logits and rejected_logits are already SP-split (each rank has only a portion of the sequence). The concatenation torch.cat([chosen_logits, rejected_logits], dim=1) concatenates these partial sequences. But chosen_mask.shape[1] would be the SP-split chosen length, so logits[:, :chosen_mask.shape[1]] would correctly slice the chosen portion from the concatenated tensor. This is OK only if loss_kwargs.chosen_mask was also SP-split by build_batches_loss_kwargs -- but since that module is missing, this cannot be verified.
Also, the all_reduce of _l (SFT loss) on line 716 sums across SP ranks. If the SFT loss was already normalized per-token within _sft_loss, then summing gives the total loss, not the average. The other loss types (sigmoid, bco_pair, etc.) do NOT have this all_reduce, meaning they operate on SP-local logprobs that were already globally aggregated (lines 638-651). This inconsistency between how different loss types handle SP is a potential correctness bug.
| @@ -8,8 +8,6 @@ | |||
| DatasetConfigList, | |||
| DatasetConfigListAdatper, | |||
| ) | |||
There was a problem hiding this comment.
Warning: Breaking change -- CustomPackDataset, CustomSampler, LongTextPretrainTokenizeFunction, and LongTextPretrainTokenizeFunctionConfig removed from public API
These classes are removed from the datasets __init__.py. Any downstream code importing these will break. This removal is unrelated to the MPO feature and should be in a separate refactoring PR.
| @@ -51,22 +48,11 @@ class BaseLossKwargs(BaseModel): | |||
| model_config = ConfigDict(title="loss keyword arguments", extra="forbid", arbitrary_types_allowed=True) | |||
There was a problem hiding this comment.
Warning: Breaking change -- BaseLossKwargs.sp_split() and .to() methods removed
The sp_split and to methods are removed from BaseLossKwargs. These were used by BaseRLLossConfig.build() (also removed) and potentially by downstream code. The GRPOLossKwargs and OrealLossKwargs subclasses now define their own fields inline rather than inheriting from the base.
While the refactor toward immutable construction in sp_split (returning new objects via type(self)(...) in RLLossContextInputItem) is an improvement, removing the base class methods is a breaking change for any code that relied on the base BaseLossKwargs.to() pattern.
| @@ -144,7 +150,6 @@ def loss_fn( | |||
| self.loss_cfg.policy_loss_cfg, | |||
There was a problem hiding this comment.
Warning: Safety assertion removed
The assertion assert old_logprobs is not None was removed here. While the code will still crash with a clear error if old_logprobs is None (since .detach() would fail on None), the explicit assertion provided a more informative error message. Consider keeping defensive checks on critical invariants.
There was a problem hiding this comment.
Summary
This PR adds Mixed Preference Optimization (MPO) / DPO support via a new DPOTrainer, CLI entry point, and VisionComposeTrainEngine. However, it also includes a large number of unrelated refactors to the loss system, optimizer configs, RL worker, and dataset APIs that introduce breaking changes.
Issues
Critical
-
[xtuner/v1/train/dpo_trainer.py:18] Missing module
xtuner.v1.rl.dpo— The entire DPO trainer importsDPOLossConfig,DPOLossContext, andDPOLossContextInputItemfromxtuner.v1.rl.dpo, but this module does not exist and is not created by the PR. The feature is completely non-functional. -
[xtuner/v1/train/dpo_trainer.py:12] Missing
qwen3_vl_dpo_collator,DPOColateItem,Qwen3VLDPOTokenizeFnConfig,VLMPreferenceJsonlDataset— These symbols are imported/referenced but not defined anywhere in the diff or existing codebase. -
[examples/v1/config/rl_qwen3_vl_8B_grpo.py:25] Existing config broken — Environment variable lookups (
os.environ["WORK_DIR"], etc.) replaced with hardcoded developer-specific paths. This breaks the existing RL example config for all users. -
[xtuner/v1/config/optim.py:29]
MuonConfigremoved — The entire Muon optimizer support is deleted, breaking any existing configs that use it. This is unrelated to MPO. -
[xtuner/v1/loss/init.py:5]
LogProbConfig/LogProbContextremoved from public API — Breaking change to the loss module's public interface, unrelated to MPO. -
[xtuner/v1/train/dpo_trainer.py:463] Reference model wastes ~32GB GPU memory — The reference engine creates a full optimizer with Adam states for a model that is immediately frozen. Should use inference-only loading.
Warning
-
[xtuner/v1/train/dpo_trainer.py:605-610] High OOM risk — Two full forward passes (chosen + rejected) through the policy model are kept in memory simultaneously before backward. For 8B models this roughly doubles activation memory.
-
[xtuner/v1/train/dpo_trainer.py:413] Division by zero in LR warmup — When
warmup_ratio=0,warmup_steps=0, andwarmup_fn(0)computes0/0 = nan. -
[xtuner/v1/train/dpo_trainer.py:709] Inconsistent SP handling across loss types — SFT loss does an explicit
all_reduceacross SP ranks, but sigmoid/bco_pair/etc. do not. This inconsistency may produce incorrect loss values under sequence parallelism. -
[xtuner/v1/train/cli/dpo.py:11-12]
torch._dynamoglobally disabled — Conflicts withtorch_compile=Truein example FSDP configs. -
[xtuner/v1/train/cli/rl.py:43-46] Debug artifacts —
lmdeployversion print and commented-outbreakpoint()committed to production code. -
[examples/] Hardcoded paths — All new config files and shell scripts contain hardcoded paths to internal storage (
/mnt/shared-storage-user/lisongze/...). -
[xtuner/v1/datasets/init.py, xtuner/v1/loss/base_loss_ctx.py, xtuner/v1/rl/base/loss.py] — Multiple breaking API removals (
CustomPackDataset,CustomSampler,LongTextPretrainTokenizeFunction,BaseLossKwargs.sp_split(),BaseLossKwargs.to(),BaseRLLossContext,BaseRLLossKwargs,compute_kl_loss_weight) bundled into a feature PR.
Nit
- [xtuner/v1/train/dpo_trainer.py:1059] — Leftover debug comment at end of file, missing trailing newline.
- [xtuner/v1/train/dpo_trainer.py:497] — Return type annotation
tuple[Tensor, Tensor]is wrong when method returns(None, None). - [xtuner/v1/train/dpo_trainer.py:530] — Duplicate
_gather_logprobsimplementation;xtuner.v1.rl.utils.gather_logprobsalready exists and is imported later in the same file. - [xtuner/v1/engine/vision_compose_train_engine.py] — Missing docstrings on class and public methods (CLAUDE.md compliance).
- Code style: single quotes used in several places (e.g.,
_data['annotation']), Chinese comments without English equivalents in public example configs.
Verdict
REQUEST_CHANGES
The PR is not mergeable in its current state:
-
The core feature does not work — the
xtuner.v1.rl.dpomodule (containingDPOLossConfig,DPOLossContext,DPOLossContextInputItem) is missing from the PR. Without it, every import fails. -
Too many unrelated breaking changes — the PR removes
MuonConfig,LogProbConfig/Context,BaseRLLossContext,CustomPackDataset, and others. Per project standards: "One logical change per PR." -
Hardcoded paths throughout — example configs and scripts use developer-specific absolute paths instead of environment variables.
Recommended action: split this into (a) the loss system refactor as a separate PR, (b) the MPO feature with all required modules included, and (c) ensure no existing configs/APIs are broken.
| export AWS_ACCESS_KEY_ID=0giydv1f6acxmwwsvm54 | ||
| export AWS_SECRET_ACCESS_KEY=gi3l1nedpaurw6606p2g4pykhfom1zhkeumhldzc |
There was a problem hiding this comment.
Claude: Critical: Leaked credentials in a public repository.
AWS access key ID and secret access key are hardcoded here. These credentials must be revoked immediately and removed from this PR. Use environment variables or a secrets manager instead — credentials should never appear in source code.
Even after removal from the PR, they will remain in git history. Revocation is mandatory.
| self.q_lora_rank = q_lora_rank | ||
| self.qk_rope_head_dim = qk_rope_head_dim | ||
| self.qk_nope_head_dim = qk_nope_head_dim | ||
| self.qk_head_dim = qk_rope_head_dim + qk_nope_head_dim |
There was a problem hiding this comment.
Claude: Bug: Trailing commas on dataclass/config field defaults create tuples, not scalars.
In Python, kv_lora_rank: int = 896, assigns (896,) (a tuple), not 896. Same for dropout: float = 0.0,, qkv_bias: bool = True,, etc. This will cause type validation errors or unexpected behavior at runtime.
Remove the trailing commas from all these field defaults.
| in xtuner v1 framework, following the same pattern as RL configs. | ||
|
|
||
| Supported loss types: | ||
| - sigmoid: Standard DPO loss for preference learning | ||
| - bco_pair: Binary Classifier Optimization for absolute quality | ||
| - sft: Supervised Fine-Tuning loss to maintain generation quality |
There was a problem hiding this comment.
Claude: Critical: The DPO feature is non-functional — missing module xtuner.v1.rl.dpo.
This config imports DPOLossConfig from xtuner.v1.rl.dpo, and the trainer imports DPOLossContext, DPOColateItem, etc., but the module defining these classes is not included in this PR. Additionally:
Qwen3VLDPOTokenizeFnConfigis not defined anywhereVLMPreferenceJsonlDatasetis not defined anywhereqwen3_vl_dpo_collatoris not defined anywhere, and also not in theLiteraltype forDataloaderConfig.collator
The MPO/DPO feature cannot work as submitted.
| # ============================================================================ | ||
| # 路径配置 (Path Configuration) | ||
| # ============================================================================ | ||
| ceph_config = "/mnt/shared-storage-user/lisongze/iv3/xtuner/config/petreloss.conf" | ||
| meta_data_path = "/mnt/shared-storage-user/lisongze/iv3/xtuner/dpo_config/MMPR.json" |
There was a problem hiding this comment.
Claude: Critical: Hardcoded internal paths in example config.
Paths like /mnt/shared-storage-user/lisongze/iv3/xtuner/... are developer-specific. Example configs should use environment variables (e.g., os.environ["MODEL_PATH"]) or clearly documented placeholders. Same issue in mpo_qwen3vl_8B_video.py and the shell scripts.
| if sequence_parallel_mesh is not None and sequence_parallel_mesh.size() > 1: | ||
| input_ids_list = [torch.empty_like(input_ids) for _ in range(sequence_parallel_mesh.size())] | ||
| dist.all_gather(input_ids_list, input_ids, group=sequence_parallel_mesh.get_group()) | ||
| input_ids = torch.cat(input_ids_list, dim=1) # type: ignore | ||
|
|
There was a problem hiding this comment.
Claude: Warning: Silently swallowing RuntimeError in forward pass masks real bugs.
Catching RuntimeError and continuing with visual_embeds.sum() * 0.0 is dangerous — it hides real bugs (dimension mismatches, OOM, etc.) and produces silently incorrect training. The visual_embeds.sum() * 0.0 line is also duplicated. Additionally, the fallback path below that creates dummy pixel_values with torch.zeros and runs the full vision pipeline just to add 0.0 is wasteful.
|
Claude: ## Review Summary — PR #1609 "feat: support Mixed Preference Optimization (MPO)" This PR adds MPO/DPO support (new trainer, loss, configs, qwen3_vl_mla module) but also performs extensive refactoring across 110 files (~5400 additions, ~4100 deletions) that removes existing features and introduces several bugs. As the maintainer already noted, many recent XTuner updates appear to have been reverted. Critical Issues1. Leaked AWS Credentials (examples/v1/scripts/run_rl.sh)Hardcoded AWS access key ID and secret access key in the script. These must be revoked immediately. Even after removal from the PR, they persist in git history. 2. DPO Feature is Non-Functional — Missing Core ModuleThe config imports 3. MLA Module Corruption — Breaks All Existing MLA ModelsChanges to the shared
These should be gated behind config flags, not changed globally. 4. Hardcoded Internal Paths Replace Env-Var-Based Paths
5. Trailing Commas Create Tuples in
|
No description provided.