Skip to content

Commit d95b993

Browse files
authored
[docs] T2I (huggingface#7623)
* refactor t2i * add code snippets
1 parent 1d48029 commit d95b993

File tree

3 files changed

+241
-232
lines changed

3 files changed

+241
-232
lines changed

Diff for: docs/source/en/_toctree.yml

+3-1
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,8 @@
8686
title: Kandinsky
8787
- local: using-diffusers/controlnet
8888
title: ControlNet
89+
- local: using-diffusers/t2i_adapter
90+
title: T2I-Adapter
8991
- local: using-diffusers/shap-e
9092
title: Shap-E
9193
- local: using-diffusers/diffedit
@@ -358,7 +360,7 @@
358360
- local: api/pipelines/stable_diffusion/ldm3d_diffusion
359361
title: LDM3D Text-to-(RGB, Depth), Text-to-(RGB-pano, Depth-pano), LDM3D Upscaler
360362
- local: api/pipelines/stable_diffusion/adapter
361-
title: Stable Diffusion T2I-Adapter
363+
title: T2I-Adapter
362364
- local: api/pipelines/stable_diffusion/gligen
363365
title: GLIGEN (Grounded Language-to-Image Generation)
364366
title: Stable Diffusion

Diff for: docs/source/en/api/pipelines/stable_diffusion/adapter.md

+19-231
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
1010
specific language governing permissions and limitations under the License.
1111
-->
1212

13-
# Text-to-Image Generation with Adapter Conditioning
14-
15-
## Overview
13+
# T2I-Adapter
1614

1715
[T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.08453) by Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie.
1816

@@ -24,236 +22,26 @@ The abstract of the paper is the following:
2422

2523
This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❤️ .
2624

27-
## Available Pipelines:
28-
29-
| Pipeline | Tasks | Demo
30-
|---|---|:---:|
31-
| [StableDiffusionAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning* | -
32-
| [StableDiffusionXLAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_xl_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning on StableDiffusion-XL* | -
33-
34-
## Usage example with the base model of StableDiffusion-1.4/1.5
35-
36-
In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
37-
All adapters use the same pipeline.
38-
39-
1. Images are first converted into the appropriate *control image* format.
40-
2. The *control image* and *prompt* are passed to the [`StableDiffusionAdapterPipeline`].
41-
42-
Let's have a look at a simple example using the [Color Adapter](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1).
43-
44-
```python
45-
from diffusers.utils import load_image, make_image_grid
46-
47-
image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png")
48-
```
49-
50-
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png)
51-
52-
53-
Then we can create our color palette by simply resizing it to 8 by 8 pixels and then scaling it back to original size.
54-
55-
```python
56-
from PIL import Image
57-
58-
color_palette = image.resize((8, 8))
59-
color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST)
60-
```
61-
62-
Let's take a look at the processed image.
63-
64-
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_palette.png)
65-
66-
67-
Next, create the adapter pipeline
68-
69-
```py
70-
import torch
71-
from diffusers import StableDiffusionAdapterPipeline, T2IAdapter
72-
73-
adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_color_sd14v1", torch_dtype=torch.float16)
74-
pipe = StableDiffusionAdapterPipeline.from_pretrained(
75-
"CompVis/stable-diffusion-v1-4",
76-
adapter=adapter,
77-
torch_dtype=torch.float16,
78-
)
79-
pipe.to("cuda")
80-
```
81-
82-
Finally, pass the prompt and control image to the pipeline
83-
84-
```py
85-
# fix the random seed, so you will get the same result as the example
86-
generator = torch.Generator("cuda").manual_seed(7)
87-
88-
out_image = pipe(
89-
"At night, glowing cubes in front of the beach",
90-
image=color_palette,
91-
generator=generator,
92-
).images[0]
93-
make_image_grid([image, color_palette, out_image], rows=1, cols=3)
94-
```
95-
96-
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png)
97-
98-
## Usage example with the base model of StableDiffusion-XL
99-
100-
In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
101-
All adapters use the same pipeline.
102-
103-
1. Images are first downloaded into the appropriate *control image* format.
104-
2. The *control image* and *prompt* are passed to the [`StableDiffusionXLAdapterPipeline`].
105-
106-
Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0).
107-
108-
```python
109-
from diffusers.utils import load_image, make_image_grid
110-
111-
sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")
112-
```
113-
114-
![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png)
115-
116-
Then, create the adapter pipeline
117-
118-
```py
119-
import torch
120-
from diffusers import (
121-
T2IAdapter,
122-
StableDiffusionXLAdapterPipeline,
123-
DDPMScheduler
124-
)
125-
126-
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
127-
adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0", torch_dtype=torch.float16, adapter_type="full_adapter_xl")
128-
scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")
129-
130-
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
131-
model_id, adapter=adapter, safety_checker=None, torch_dtype=torch.float16, variant="fp16", scheduler=scheduler
132-
)
133-
134-
pipe.to("cuda")
135-
```
136-
137-
Finally, pass the prompt and control image to the pipeline
138-
139-
```py
140-
# fix the random seed, so you will get the same result as the example
141-
generator = torch.Generator().manual_seed(42)
142-
143-
sketch_image_out = pipe(
144-
prompt="a photo of a dog in real world, high quality",
145-
negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality",
146-
image=sketch_image,
147-
generator=generator,
148-
guidance_scale=7.5
149-
).images[0]
150-
make_image_grid([sketch_image, sketch_image_out], rows=1, cols=2)
151-
```
152-
153-
![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch_output.png)
154-
155-
## Available checkpoints
156-
157-
Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://huggingface.co/TencentARC/T2I-Adapter/tree/main/models).
158-
159-
### T2I-Adapter with Stable Diffusion 1.4
160-
161-
| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
162-
|---|---|---|---|
163-
|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)<br/> *Trained with spatial color palette* | An image with 8x8 color palette.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_sample_output.png"/></a>|
164-
|[TencentARC/t2iadapter_canny_sd14v1](https://huggingface.co/TencentARC/t2iadapter_canny_sd14v1)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/canny_sample_output.png"/></a>|
165-
|[TencentARC/t2iadapter_sketch_sd14v1](https://huggingface.co/TencentARC/t2iadapter_sketch_sd14v1)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/sketch_sample_output.png"/></a>|
166-
|[TencentARC/t2iadapter_depth_sd14v1](https://huggingface.co/TencentARC/t2iadapter_depth_sd14v1)<br/> *Trained with Midas depth estimation* | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_output.png"/></a>|
167-
|[TencentARC/t2iadapter_openpose_sd14v1](https://huggingface.co/TencentARC/t2iadapter_openpose_sd14v1)<br/> *Trained with OpenPose bone image* | A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/openpose_sample_output.png"/></a>|
168-
|[TencentARC/t2iadapter_keypose_sd14v1](https://huggingface.co/TencentARC/t2iadapter_keypose_sd14v1)<br/> *Trained with mmpose skeleton image* | A [mmpose skeleton](https://github.com/open-mmlab/mmpose) image.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_output.png"/></a>|
169-
|[TencentARC/t2iadapter_seg_sd14v1](https://huggingface.co/TencentARC/t2iadapter_seg_sd14v1)<br/>*Trained with semantic segmentation* | An [custom](https://github.com/TencentARC/T2I-Adapter/discussions/25) segmentation protocol image.|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_input.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_input.png"/></a>|<a href="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_output.png"><img width="64" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/seg_sample_output.png"/></a> |
170-
|[TencentARC/t2iadapter_canny_sd15v2](https://huggingface.co/TencentARC/t2iadapter_canny_sd15v2)||
171-
|[TencentARC/t2iadapter_depth_sd15v2](https://huggingface.co/TencentARC/t2iadapter_depth_sd15v2)||
172-
|[TencentARC/t2iadapter_sketch_sd15v2](https://huggingface.co/TencentARC/t2iadapter_sketch_sd15v2)||
173-
|[TencentARC/t2iadapter_zoedepth_sd15v1](https://huggingface.co/TencentARC/t2iadapter_zoedepth_sd15v1)||
174-
|[Adapter/t2iadapter, subfolder='sketch_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0)||
175-
|[Adapter/t2iadapter, subfolder='canny_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/canny_sdxl_1.0)||
176-
|[Adapter/t2iadapter, subfolder='openpose_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/openpose_sdxl_1.0)||
177-
178-
## Combining multiple adapters
179-
180-
[`MultiAdapter`] can be used for applying multiple conditionings at once.
181-
182-
Here we use the keypose adapter for the character posture and the depth adapter for creating the scene.
183-
184-
```py
185-
from diffusers.utils import load_image, make_image_grid
186-
187-
cond_keypose = load_image(
188-
"https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"
189-
)
190-
cond_depth = load_image(
191-
"https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"
192-
)
193-
cond = [cond_keypose, cond_depth]
194-
195-
prompt = ["A man walking in an office room with a nice view"]
196-
```
197-
198-
The two control images look as such:
199-
200-
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png)
201-
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png)
202-
203-
204-
`MultiAdapter` combines keypose and depth adapters.
205-
206-
`adapter_conditioning_scale` balances the relative influence of the different adapters.
207-
208-
```py
209-
import torch
210-
from diffusers import StableDiffusionAdapterPipeline, MultiAdapter, T2IAdapter
211-
212-
adapters = MultiAdapter(
213-
[
214-
T2IAdapter.from_pretrained("TencentARC/t2iadapter_keypose_sd14v1"),
215-
T2IAdapter.from_pretrained("TencentARC/t2iadapter_depth_sd14v1"),
216-
]
217-
)
218-
adapters = adapters.to(torch.float16)
219-
220-
pipe = StableDiffusionAdapterPipeline.from_pretrained(
221-
"CompVis/stable-diffusion-v1-4",
222-
torch_dtype=torch.float16,
223-
adapter=adapters,
224-
).to("cuda")
225-
226-
image = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8]).images[0]
227-
make_image_grid([cond_keypose, cond_depth, image], rows=1, cols=3)
228-
```
229-
230-
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_depth_sample_output.png)
231-
232-
233-
## T2I-Adapter vs ControlNet
234-
235-
T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet).
236-
T2I-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process.
237-
However, T2I-Adapter performs slightly worse than ControlNet.
238-
23925
## StableDiffusionAdapterPipeline
26+
24027
[[autodoc]] StableDiffusionAdapterPipeline
241-
- all
242-
- __call__
243-
- enable_attention_slicing
244-
- disable_attention_slicing
245-
- enable_vae_slicing
246-
- disable_vae_slicing
247-
- enable_xformers_memory_efficient_attention
248-
- disable_xformers_memory_efficient_attention
28+
- all
29+
- __call__
30+
- enable_attention_slicing
31+
- disable_attention_slicing
32+
- enable_vae_slicing
33+
- disable_vae_slicing
34+
- enable_xformers_memory_efficient_attention
35+
- disable_xformers_memory_efficient_attention
24936

25037
## StableDiffusionXLAdapterPipeline
38+
25139
[[autodoc]] StableDiffusionXLAdapterPipeline
252-
- all
253-
- __call__
254-
- enable_attention_slicing
255-
- disable_attention_slicing
256-
- enable_vae_slicing
257-
- disable_vae_slicing
258-
- enable_xformers_memory_efficient_attention
259-
- disable_xformers_memory_efficient_attention
40+
- all
41+
- __call__
42+
- enable_attention_slicing
43+
- disable_attention_slicing
44+
- enable_vae_slicing
45+
- disable_vae_slicing
46+
- enable_xformers_memory_efficient_attention
47+
- disable_xformers_memory_efficient_attention

0 commit comments

Comments
 (0)