Add Z-Image Text-to-Image Generation Support #3261

SpenserCai · 2025-12-24T03:26:14Z

Summary

This PR introduces support for Z-Image, Alibaba's ~24B parameter text-to-image generation model using Flow Matching. The implementation follows Candle's architecture conventions and includes the full inference pipeline.

Model Overview

Z-Image is a state-of-the-art text-to-image model featuring:

Transformer: 24B parameter DiT with 30 main layers + 2 noise refiner + 2 context refiner
Text Encoder: Qwen3-based encoder (outputs second-to-last hidden states)
VAE: AutoEncoderKL with diffusers format weights
Scheduler: FlowMatchEulerDiscreteScheduler with dynamic timestep shifting
Position Encoding: 3D RoPE (Frame/Height/Width axes)

Model Links:

🔧 Usage Examples

Basic Usage (CUDA)

cargo run --features cuda --example z_image --release -- \
    --model-path weights/Z-Image-Turbo \
    --prompt "A beautiful landscape with mountains and a lake" \
    --width 1024 --height 768 \
    --num-steps 8

Using Metal (macOS)

cargo run --features metal --example z_image --release -- \
    --model-path weights/Z-Image-Turbo \
    --prompt "A futuristic city at night with neon lights" \
    --width 1024 --height 1024 \
    --num-steps 9

Files Changed

New Files

File	Lines	Description
`candle-transformers/src/models/z_image/mod.rs`	34	Module exports
`candle-transformers/src/models/z_image/transformer.rs`	940	Core Transformer (Config, TimestepEmbedder, RopeEmbedder, ZImageAttention, ZImageTransformerBlock, FinalLayer, ZImageTransformer2DModel)
`candle-transformers/src/models/z_image/text_encoder.rs`	453	Qwen3-based Text Encoder
`candle-transformers/src/models/z_image/vae.rs`	684	AutoEncoderKL (diffusers format)
`candle-transformers/src/models/z_image/scheduler.rs`	237	FlowMatchEulerDiscreteScheduler
`candle-transformers/src/models/z_image/sampling.rs`	133	Sampling utilities (noise generation, shift calculation)
`candle-transformers/src/models/z_image/preprocess.rs`	169	Input preprocessing (image postprocessing)
`candle-examples/examples/z_image/main.rs`	393	Complete inference example
`candle-examples/examples/z_image/README.md`	128	Example documentation

Modified Files

File	Change
`candle-transformers/src/models/mod.rs`	Added `pub mod z_image;`

Implementation Highlights

1. Optimized Patchify/Unpatchify

The implementation uses optimized 6D tensor operations for the F=1 (single frame) case, avoiding Candle's 7D+ dimension limitations:

// Patchify: (B, C, 1, H, W) → (B, num_patches, patch_dim)
// Matches Python: permute(1, 3, 5, 2, 4, 6, 0)
let x = x.permute((0, 2, 4, 3, 5, 1))?;  // (B, H_t, W_t, pH, pW, C)

2. 3D RoPE Position Encoding

Implements 3D Rotary Position Embeddings with pre-computed sin/cos caches:

pub struct RopeEmbedder {
    axes_dims: Vec<usize>,  // [32, 48, 48] for Frame/H/W
    axes_lens: Vec<usize>,  // [1536, 512, 512] max positions
    cos_cached: Vec<Tensor>,
    sin_cached: Vec<Tensor>,
}

3. AdaLN Modulation with Tanh Gate

// Z-Image specific: tanh gate instead of sigmoid
let gate_msa = gate_msa.tanh()?;
let gate_mlp = gate_mlp.tanh()?;

4. Dynamic Timestep Shifting

pub fn calculate_shift(seq_len: usize, base_seq: usize, max_seq: usize, base_shift: f64, max_shift: f64) -> f64 {
    let m = (max_shift - base_shift) / (max_seq - base_seq) as f64;
    base_shift + m * (seq_len - base_seq) as f64
}

Image Size Requirements

Image dimensions must be divisible by 16:

✅ 1024×1024, 1024×768, 768×1024, 512×512, 1280×720
❌ 1920×1080 (1080 is not divisible by 16)

Latent size formula: latent = 2 × (image_size ÷ 16)

📝 Testing Status

Test	Status
`cargo check --features metal`	✅ Pass
`cargo clippy --workspace --tests --examples --benches -- -D warnings`	✅ Pass
`cargo fmt --all -- --check`	✅ Pass
Inference test (1024×768, Metal)	✅ Pass
Inference test (1024×1024, Metal)	✅ Pass

Sample Output

Metal

Cuda

Checklist

Code compiles without errors
Passes cargo clippy --workspace --tests --examples --benches -- -D warnings
Passes cargo fmt --all -- --check
Example runs successfully
README documentation added
Follows Candle architecture conventions
Weight mapping matches original implementation

References

Z-Image
Diffusers

Additional Fix: Clippy Warning in `candle-nn`

While implementing SDPA support for Z-Image, I discovered a minor clippy warning in candle-nn/src/ops.rs:1040 introduced by PR #3196. @EricLBuehler

Issue: clippy::nonminimal_bool warning

// Before
let supports_sdpa_full_mask = !self.mask.is_some() || q_seq <= k_seq;

// After
let supports_sdpa_full_mask = self.mask.is_none() || q_seq <= k_seq;

SpenserCai added 5 commits December 24, 2025 10:12

init z-image

01de593

fixed patchify, unpatchify and latent

e7071ae

update z_image examples readme

3f2781d

fixed clippy and rustfmt

7a40c96

fixed z_image example readme links

c5db88d

SpenserCai mentioned this pull request Dec 24, 2025

Model Wishlist #1177

Open

support sdpa and flash-attn in Z-Image and fixed sdpa clippy warning

c2e9336

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Z-Image Text-to-Image Generation Support #3261

Add Z-Image Text-to-Image Generation Support #3261

Uh oh!

SpenserCai commented Dec 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Z-Image Text-to-Image Generation Support #3261

Are you sure you want to change the base?

Add Z-Image Text-to-Image Generation Support #3261

Uh oh!

Conversation

SpenserCai commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Model Overview

🔧 Usage Examples

Basic Usage (CUDA)

Using Metal (macOS)

Files Changed

New Files

Modified Files

Implementation Highlights

1. Optimized Patchify/Unpatchify

2. 3D RoPE Position Encoding

3. AdaLN Modulation with Tanh Gate

4. Dynamic Timestep Shifting

Image Size Requirements

📝 Testing Status

Sample Output

Metal

Cuda

Checklist

References

Additional Fix: Clippy Warning in candle-nn

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SpenserCai commented Dec 24, 2025 •

edited

Loading

Additional Fix: Clippy Warning in `candle-nn`