Skip to content

Conversation

@SpenserCai
Copy link

@SpenserCai SpenserCai commented Dec 24, 2025

Summary

This PR introduces support for Z-Image, Alibaba's ~24B parameter text-to-image generation model using Flow Matching. The implementation follows Candle's architecture conventions and includes the full inference pipeline.

Model Overview

Z-Image is a state-of-the-art text-to-image model featuring:

  • Transformer: 24B parameter DiT with 30 main layers + 2 noise refiner + 2 context refiner
  • Text Encoder: Qwen3-based encoder (outputs second-to-last hidden states)
  • VAE: AutoEncoderKL with diffusers format weights
  • Scheduler: FlowMatchEulerDiscreteScheduler with dynamic timestep shifting
  • Position Encoding: 3D RoPE (Frame/Height/Width axes)

Model Links:

🔧 Usage Examples

Basic Usage (CUDA)

cargo run --features cuda --example z_image --release -- \
    --model-path weights/Z-Image-Turbo \
    --prompt "A beautiful landscape with mountains and a lake" \
    --width 1024 --height 768 \
    --num-steps 8

Using Metal (macOS)

cargo run --features metal --example z_image --release -- \
    --model-path weights/Z-Image-Turbo \
    --prompt "A futuristic city at night with neon lights" \
    --width 1024 --height 1024 \
    --num-steps 9

Files Changed

New Files

File Lines Description
candle-transformers/src/models/z_image/mod.rs 34 Module exports
candle-transformers/src/models/z_image/transformer.rs 940 Core Transformer (Config, TimestepEmbedder, RopeEmbedder, ZImageAttention, ZImageTransformerBlock, FinalLayer, ZImageTransformer2DModel)
candle-transformers/src/models/z_image/text_encoder.rs 453 Qwen3-based Text Encoder
candle-transformers/src/models/z_image/vae.rs 684 AutoEncoderKL (diffusers format)
candle-transformers/src/models/z_image/scheduler.rs 237 FlowMatchEulerDiscreteScheduler
candle-transformers/src/models/z_image/sampling.rs 133 Sampling utilities (noise generation, shift calculation)
candle-transformers/src/models/z_image/preprocess.rs 169 Input preprocessing (image postprocessing)
candle-examples/examples/z_image/main.rs 393 Complete inference example
candle-examples/examples/z_image/README.md 128 Example documentation

Modified Files

File Change
candle-transformers/src/models/mod.rs Added pub mod z_image;

Implementation Highlights

1. Optimized Patchify/Unpatchify

The implementation uses optimized 6D tensor operations for the F=1 (single frame) case, avoiding Candle's 7D+ dimension limitations:

// Patchify: (B, C, 1, H, W) → (B, num_patches, patch_dim)
// Matches Python: permute(1, 3, 5, 2, 4, 6, 0)
let x = x.permute((0, 2, 4, 3, 5, 1))?;  // (B, H_t, W_t, pH, pW, C)

2. 3D RoPE Position Encoding

Implements 3D Rotary Position Embeddings with pre-computed sin/cos caches:

pub struct RopeEmbedder {
    axes_dims: Vec<usize>,  // [32, 48, 48] for Frame/H/W
    axes_lens: Vec<usize>,  // [1536, 512, 512] max positions
    cos_cached: Vec<Tensor>,
    sin_cached: Vec<Tensor>,
}

3. AdaLN Modulation with Tanh Gate

// Z-Image specific: tanh gate instead of sigmoid
let gate_msa = gate_msa.tanh()?;
let gate_mlp = gate_mlp.tanh()?;

4. Dynamic Timestep Shifting

pub fn calculate_shift(seq_len: usize, base_seq: usize, max_seq: usize, base_shift: f64, max_shift: f64) -> f64 {
    let m = (max_shift - base_shift) / (max_seq - base_seq) as f64;
    base_shift + m * (seq_len - base_seq) as f64
}

Image Size Requirements

Image dimensions must be divisible by 16:

  • ✅ 1024×1024, 1024×768, 768×1024, 512×512, 1280×720
  • ❌ 1920×1080 (1080 is not divisible by 16)

Latent size formula: latent = 2 × (image_size ÷ 16)

📝 Testing Status

Test Status
cargo check --features metal ✅ Pass
cargo clippy --workspace --tests --examples --benches -- -D warnings ✅ Pass
cargo fmt --all -- --check ✅ Pass
Inference test (1024×768, Metal) ✅ Pass
Inference test (1024×1024, Metal) ✅ Pass

Sample Output

Metal

34b1e832d17ba98bb7ee3500327c5fbe

Cuda

70225eeb4ec55d9a85a64ad84c7a369f

Checklist

  • Code compiles without errors
  • Passes cargo clippy --workspace --tests --examples --benches -- -D warnings
  • Passes cargo fmt --all -- --check
  • Example runs successfully
  • README documentation added
  • Follows Candle architecture conventions
  • Weight mapping matches original implementation

References

Z-Image
Diffusers

Additional Fix: Clippy Warning in candle-nn

While implementing SDPA support for Z-Image, I discovered a minor clippy warning in candle-nn/src/ops.rs:1040 introduced by PR #3196. @EricLBuehler

Issue: clippy::nonminimal_bool warning

// Before
let supports_sdpa_full_mask = !self.mask.is_some() || q_seq <= k_seq;

// After
let supports_sdpa_full_mask = self.mask.is_none() || q_seq <= k_seq;

@SpenserCai SpenserCai mentioned this pull request Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant