|
| 1 | +# candle-z-image: Text-to-Image Generation with Flow Matching |
| 2 | + |
| 3 | +Z-Image is a ~24B parameter text-to-image generation model developed by Alibaba, |
| 4 | +using flow matching for high-quality image synthesis. |
| 5 | +[ModelScope](https://www.modelscope.cn/models/Tongyi-MAI/Z-Image-Turbo), |
| 6 | +[HuggingFace](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo). |
| 7 | + |
| 8 | +## Model Architecture |
| 9 | + |
| 10 | +- **Transformer**: 24B parameter DiT with 30 main layers + 2 noise refiner + 2 context refiner |
| 11 | +- **Text Encoder**: Qwen3-based encoder (outputs second-to-last hidden states) |
| 12 | +- **VAE**: AutoEncoderKL with diffusers format weights |
| 13 | +- **Scheduler**: FlowMatchEulerDiscreteScheduler with dynamic shifting |
| 14 | + |
| 15 | +## Running the Model |
| 16 | + |
| 17 | +### Basic Usage (Auto-download from HuggingFace) |
| 18 | + |
| 19 | +```bash |
| 20 | +cargo run --features cuda --example z_image --release -- \ |
| 21 | + --model turbo \ |
| 22 | + --prompt "A beautiful landscape with mountains and a lake" \ |
| 23 | + --width 1024 --height 768 \ |
| 24 | + --num-steps 8 |
| 25 | +``` |
| 26 | + |
| 27 | +### Using Metal (macOS) |
| 28 | + |
| 29 | +```bash |
| 30 | +cargo run --features metal --example z_image --release -- \ |
| 31 | + --model turbo \ |
| 32 | + --prompt "A futuristic city at night with neon lights" \ |
| 33 | + --width 1024 --height 1024 \ |
| 34 | + --num-steps 9 |
| 35 | +``` |
| 36 | + |
| 37 | +### Using Local Weights |
| 38 | + |
| 39 | +If you prefer to use locally downloaded weights: |
| 40 | + |
| 41 | +```bash |
| 42 | +# Download weights first |
| 43 | +hf download Tongyi-MAI/Z-Image-Turbo --local-dir weights/Z-Image-Turbo |
| 44 | + |
| 45 | +# Run with local path |
| 46 | +cargo run --features cuda --example z_image --release -- \ |
| 47 | + --model turbo \ |
| 48 | + --model-path weights/Z-Image-Turbo \ |
| 49 | + --prompt "A beautiful landscape with mountains and a lake" |
| 50 | +``` |
| 51 | + |
| 52 | +### Command-line Flags |
| 53 | + |
| 54 | +| Flag | Description | Default | |
| 55 | +|------|-------------|---------| |
| 56 | +| `--model` | Model variant to use (`turbo`) | `turbo` | |
| 57 | +| `--model-path` | Override path to local weights (optional) | Auto-download | |
| 58 | +| `--prompt` | The text prompt for image generation | Required | |
| 59 | +| `--negative-prompt` | Negative prompt for CFG guidance | `""` | |
| 60 | +| `--width` | Width of the generated image (must be divisible by 16) | `1024` | |
| 61 | +| `--height` | Height of the generated image (must be divisible by 16) | `1024` | |
| 62 | +| `--num-steps` | Number of denoising steps | Model default (9 for turbo) | |
| 63 | +| `--guidance-scale` | Classifier-free guidance scale | `5.0` | |
| 64 | +| `--seed` | Random seed for reproducibility | Random | |
| 65 | +| `--output` | Output image filename | `z_image_output.png` | |
| 66 | +| `--cpu` | Use CPU instead of GPU | `false` | |
| 67 | + |
| 68 | +## Image Size Requirements |
| 69 | + |
| 70 | +Image dimensions **must be divisible by 16**. Valid sizes include: |
| 71 | + |
| 72 | +- ✅ 1024×1024, 1024×768, 768×1024, 512×512, 1280×720, 1920×1088 |
| 73 | +- ❌ 1920×1080 (1080 is not divisible by 16) |
| 74 | + |
| 75 | +If an invalid size is provided, the program will suggest valid alternatives. |
| 76 | + |
| 77 | +## Performance Notes |
| 78 | + |
| 79 | +- **Turbo Version**: Z-Image-Turbo is optimized for fast inference, requiring only 8-9 steps |
| 80 | +- **Memory Usage**: The 24B model requires significant GPU memory. Reduce image dimensions if encountering OOM errors |
| 81 | + |
| 82 | +## Example Outputs |
| 83 | + |
| 84 | +```bash |
| 85 | +# Landscape (16:9) |
| 86 | +cargo run --features metal --example z_image -r -- \ |
| 87 | + --model turbo \ |
| 88 | + --prompt "A serene mountain lake at sunset, photorealistic, 4k" \ |
| 89 | + --width 1280 --height 720 --num-steps 8 |
| 90 | + |
| 91 | +# Portrait (3:4) |
| 92 | +cargo run --features metal --example z_image -r -- \ |
| 93 | + --model turbo \ |
| 94 | + --prompt "A portrait of a wise elderly scholar, oil painting style" \ |
| 95 | + --width 768 --height 1024 --num-steps 9 |
| 96 | + |
| 97 | +# Square (1:1) |
| 98 | +cargo run --features metal --example z_image -r -- \ |
| 99 | + --model turbo \ |
| 100 | + --prompt "A cute robot holding a candle, digital art" \ |
| 101 | + --width 1024 --height 1024 --num-steps 8 |
| 102 | +``` |
| 103 | + |
| 104 | +## Technical Details |
| 105 | + |
| 106 | +### Latent Space |
| 107 | + |
| 108 | +The VAE operates with an 8× upsampling factor. Latent dimensions are calculated as: |
| 109 | + |
| 110 | +``` |
| 111 | +latent_height = 2 × (image_height ÷ 16) |
| 112 | +latent_width = 2 × (image_width ÷ 16) |
| 113 | +``` |
| 114 | + |
| 115 | +### 3D RoPE Position Encoding |
| 116 | + |
| 117 | +Z-Image uses 3D Rotary Position Embeddings with axes: |
| 118 | +- Frame (temporal): 32 dims, max 1536 positions |
| 119 | +- Height (spatial): 48 dims, max 512 positions |
| 120 | +- Width (spatial): 48 dims, max 512 positions |
| 121 | + |
| 122 | +### Dynamic Timestep Shifting |
| 123 | + |
| 124 | +The scheduler uses dynamic shifting based on image sequence length: |
| 125 | + |
| 126 | +``` |
| 127 | +mu = BASE_SHIFT + (image_seq_len - BASE_SEQ_LEN) / (MAX_SEQ_LEN - BASE_SEQ_LEN) × (MAX_SHIFT - BASE_SHIFT) |
| 128 | +``` |
| 129 | + |
| 130 | +Where `BASE_SHIFT=0.5`, `MAX_SHIFT=1.15`, `BASE_SEQ_LEN=256`, `MAX_SEQ_LEN=4096`. |
0 commit comments