Skip to content

Commit 3a0d1cb

Browse files
Add Z-Image Text-to-Image Generation Support (#3261)
* init z-image * fixed patchify, unpatchify and latent * update z_image examples readme * fixed clippy and rustfmt * fixed z_image example readme links * support sdpa and flash-attn in Z-Image and fixed sdpa clippy warning * fix some readme * Update candle-transformers/src/models/z_image/transformer.rs Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com> * support --model in example --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
1 parent d8fb848 commit 3a0d1cb

File tree

11 files changed

+3412
-1
lines changed

11 files changed

+3412
-1
lines changed
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# candle-z-image: Text-to-Image Generation with Flow Matching
2+
3+
Z-Image is a ~24B parameter text-to-image generation model developed by Alibaba,
4+
using flow matching for high-quality image synthesis.
5+
[ModelScope](https://www.modelscope.cn/models/Tongyi-MAI/Z-Image-Turbo),
6+
[HuggingFace](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo).
7+
8+
## Model Architecture
9+
10+
- **Transformer**: 24B parameter DiT with 30 main layers + 2 noise refiner + 2 context refiner
11+
- **Text Encoder**: Qwen3-based encoder (outputs second-to-last hidden states)
12+
- **VAE**: AutoEncoderKL with diffusers format weights
13+
- **Scheduler**: FlowMatchEulerDiscreteScheduler with dynamic shifting
14+
15+
## Running the Model
16+
17+
### Basic Usage (Auto-download from HuggingFace)
18+
19+
```bash
20+
cargo run --features cuda --example z_image --release -- \
21+
--model turbo \
22+
--prompt "A beautiful landscape with mountains and a lake" \
23+
--width 1024 --height 768 \
24+
--num-steps 8
25+
```
26+
27+
### Using Metal (macOS)
28+
29+
```bash
30+
cargo run --features metal --example z_image --release -- \
31+
--model turbo \
32+
--prompt "A futuristic city at night with neon lights" \
33+
--width 1024 --height 1024 \
34+
--num-steps 9
35+
```
36+
37+
### Using Local Weights
38+
39+
If you prefer to use locally downloaded weights:
40+
41+
```bash
42+
# Download weights first
43+
hf download Tongyi-MAI/Z-Image-Turbo --local-dir weights/Z-Image-Turbo
44+
45+
# Run with local path
46+
cargo run --features cuda --example z_image --release -- \
47+
--model turbo \
48+
--model-path weights/Z-Image-Turbo \
49+
--prompt "A beautiful landscape with mountains and a lake"
50+
```
51+
52+
### Command-line Flags
53+
54+
| Flag | Description | Default |
55+
|------|-------------|---------|
56+
| `--model` | Model variant to use (`turbo`) | `turbo` |
57+
| `--model-path` | Override path to local weights (optional) | Auto-download |
58+
| `--prompt` | The text prompt for image generation | Required |
59+
| `--negative-prompt` | Negative prompt for CFG guidance | `""` |
60+
| `--width` | Width of the generated image (must be divisible by 16) | `1024` |
61+
| `--height` | Height of the generated image (must be divisible by 16) | `1024` |
62+
| `--num-steps` | Number of denoising steps | Model default (9 for turbo) |
63+
| `--guidance-scale` | Classifier-free guidance scale | `5.0` |
64+
| `--seed` | Random seed for reproducibility | Random |
65+
| `--output` | Output image filename | `z_image_output.png` |
66+
| `--cpu` | Use CPU instead of GPU | `false` |
67+
68+
## Image Size Requirements
69+
70+
Image dimensions **must be divisible by 16**. Valid sizes include:
71+
72+
- ✅ 1024×1024, 1024×768, 768×1024, 512×512, 1280×720, 1920×1088
73+
- ❌ 1920×1080 (1080 is not divisible by 16)
74+
75+
If an invalid size is provided, the program will suggest valid alternatives.
76+
77+
## Performance Notes
78+
79+
- **Turbo Version**: Z-Image-Turbo is optimized for fast inference, requiring only 8-9 steps
80+
- **Memory Usage**: The 24B model requires significant GPU memory. Reduce image dimensions if encountering OOM errors
81+
82+
## Example Outputs
83+
84+
```bash
85+
# Landscape (16:9)
86+
cargo run --features metal --example z_image -r -- \
87+
--model turbo \
88+
--prompt "A serene mountain lake at sunset, photorealistic, 4k" \
89+
--width 1280 --height 720 --num-steps 8
90+
91+
# Portrait (3:4)
92+
cargo run --features metal --example z_image -r -- \
93+
--model turbo \
94+
--prompt "A portrait of a wise elderly scholar, oil painting style" \
95+
--width 768 --height 1024 --num-steps 9
96+
97+
# Square (1:1)
98+
cargo run --features metal --example z_image -r -- \
99+
--model turbo \
100+
--prompt "A cute robot holding a candle, digital art" \
101+
--width 1024 --height 1024 --num-steps 8
102+
```
103+
104+
## Technical Details
105+
106+
### Latent Space
107+
108+
The VAE operates with an 8× upsampling factor. Latent dimensions are calculated as:
109+
110+
```
111+
latent_height = 2 × (image_height ÷ 16)
112+
latent_width = 2 × (image_width ÷ 16)
113+
```
114+
115+
### 3D RoPE Position Encoding
116+
117+
Z-Image uses 3D Rotary Position Embeddings with axes:
118+
- Frame (temporal): 32 dims, max 1536 positions
119+
- Height (spatial): 48 dims, max 512 positions
120+
- Width (spatial): 48 dims, max 512 positions
121+
122+
### Dynamic Timestep Shifting
123+
124+
The scheduler uses dynamic shifting based on image sequence length:
125+
126+
```
127+
mu = BASE_SHIFT + (image_seq_len - BASE_SEQ_LEN) / (MAX_SEQ_LEN - BASE_SEQ_LEN) × (MAX_SHIFT - BASE_SHIFT)
128+
```
129+
130+
Where `BASE_SHIFT=0.5`, `MAX_SHIFT=1.15`, `BASE_SEQ_LEN=256`, `MAX_SEQ_LEN=4096`.

0 commit comments

Comments
 (0)