Add Mistral3 vision-language model support (For Flux2 Migration) #3246

SpenserCai · 2025-12-16T12:52:34Z

Summary

This PR adds support for the Mistral3 (Mistral-Small-3.x) vision-language model to candle-transformers. Mistral3 combines the Pixtral vision encoder with the Mistral language model, enabling multimodal image-text understanding.

Note: This PR is a preparatory step for the upcoming Flux2 model migration, as Flux2 shares similar multimodal architecture patterns with Mistral3.

Changes

New files in candle-transformers/src/models/mistral3/:

mod.rs - Module exports and documentation
config.rs - Mistral3Config with vision, text, and projector settings
model.rs - Mistral3Model and Mistral3ForConditionalGeneration
patch_merger.rs - PatchMerger for reducing image tokens
projector.rs - MultiModalProjector (RMSNorm + PatchMerger + MLP)

Modified files:

candle-transformers/src/models/mod.rs - Added mistral3 module export
candle-transformers/src/models/pixtral/vision_model.rs - Added forward_with_hidden_states() and VisionModelOutput struct
candle-transformers/src/models/mistral.rs - Added forward_embeds_hidden() for multimodal integration

Architecture

Mistral3ForConditionalGeneration
├── Mistral3Model
│   ├── vision_tower (Pixtral Vision Model, 24 layers)
│   ├── multi_modal_projector
│   │   ├── norm (RMSNorm)
│   │   ├── patch_merger (spatial_merge_size=2, reduces tokens by 4x)
│   │   ├── linear_1
│   │   ├── act (GELU)
│   │   └── linear_2
│   └── language_model (Mistral, 40 layers)
└── lm_head

Key Implementation Details

PatchMerger: Uses reshape + permute to implement PyTorch's unfold operation (kernel_size == stride, no overlap), merging 2x2 patches into one.
Image Token Replacement: Implements replace_image_tokens() as Candle equivalent of PyTorch's masked_scatter.
Vision Tower Integration: Uses forward_with_hidden_states() to get batch-dimension-preserved output matching PyTorch Transformers behavior.

Supported Models

Differences from Pixtral LLaVA

Feature	Pixtral LLaVA	Mistral3
PatchMerger	❌	✅ (spatial_merge_size=2)
Projector RMSNorm	❌	✅
Projector bias	✅	❌
Image token reduction	1x	4x

Usage

use candle_transformers::models::mistral3::{Mistral3Config, Mistral3ForConditionalGeneration};

let config: Mistral3Config = serde_json::from_str(&config_str)?;
let model = Mistral3ForConditionalGeneration::new(&config, vb)?;
let logits = model.forward(&input_ids, Some(&pixel_values), Some(&image_sizes), 0)?;

Verification

The implementation has been verified against PyTorch Transformers reference:

Vision Tower: avg_diff = 2.29e-4
MultiModal Projector: avg_diff = 3.61e-8
Full Forward Pass: Predicted token matches (token ID: 1784 "The")

Checklist

New model implementation follows existing patterns in candle-transformers
Configuration uses serde for JSON deserialization
Reuses existing components (Pixtral vision, Mistral language model)
Documentation comments included
Verified against PyTorch reference implementation

SpenserCai · 2025-12-22T09:30:03Z

mistral3 examples added!

SpenserCai · 2025-12-23T16:37:56Z

Fixed clippy and fmt.

SpenserCai added 7 commits December 16, 2025 14:39

mistral3 init

05ba353

support forward_with_hidden_states in pixtral

eb9ef69

complete!

7f07ade

update

15ba84a

add mistral3 example

1a36214

unified dtype type

d7518b1

Fix duplicate loading of lm_head

8030c36

SpenserCai added 2 commits December 23, 2025 10:48

Merge branch 'huggingface:main' into mistralai3_support

ec9be03

fixed clippy and fmt

30e52d3

Merge branch 'huggingface:main' into mistralai3_support

445447c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Mistral3 vision-language model support (For Flux2 Migration) #3246

Add Mistral3 vision-language model support (For Flux2 Migration) #3246

Uh oh!

SpenserCai commented Dec 16, 2025 •

edited

Loading

Uh oh!

SpenserCai commented Dec 22, 2025

Uh oh!

SpenserCai commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Mistral3 vision-language model support (For Flux2 Migration) #3246

Are you sure you want to change the base?

Add Mistral3 vision-language model support (For Flux2 Migration) #3246

Uh oh!

Conversation

SpenserCai commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Architecture

Key Implementation Details

Supported Models

Differences from Pixtral LLaVA

Usage

Verification

Checklist

Uh oh!

SpenserCai commented Dec 22, 2025

Uh oh!

SpenserCai commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SpenserCai commented Dec 16, 2025 •

edited

Loading