-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Modular Diffusers #9672
base: main
Are you sure you want to change the base?
The Modular Diffusers #9672
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Very cool! |
… completely yet (look into later)
hi this is very interesting! I'm making a Python pipeline flow visual scripting tool, that can auto-convert functions to visual nodes for fast and modular UI blocks demo. Itself is a pip package: https://pypi.org/project/nozyio/ I wanted to integrate diffusers with my flow nodes UI project but found its not very modular. But this PR may change that! Looking forward to see how this evolves. github: https://github.com/oozzy77/nozyio happy to connect! |
@oozzy77 thanks! |
Hi super willing to join slack channel with you! What’s the workspace
channel I should join?or you can invite me ***@***.***
…On Thu, Oct 31, 2024 at 4:59 AM YiYi Xu ***@***.***> wrote:
@oozzy77 <https://github.com/oozzy77> thanks!
do you want to join a slack channel with me? if you want to experiment
building something with this PR I'm eager to hear your feedback and iterate
base on that
—
Reply to this email directly, view it on GitHub
<#9672 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BMBK3ZHNSKN56N262LBH3WLZ6FCBNAVCNFSM6AAAAABP5SYMXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBYGM3DQMBYGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@oozzy77 I sent an invite! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks @yiyixuxu!
My first comments are regarding pipeline functions like encode_prompt
, encode_image
, prepare_ip_adapter_image_embeds
and related modules. We can remove everything related to num_images_per_prompt
as its handled by StableDiffusionXLInputStep
and I think we could make these functions work with a single input then call separately with positive and negative prompt/image from the module.
For example, with do_classifier_free_guidance
prepare_ip_adapter_image_embeds
returns a list of concatenated embeds that we chunk in StableDiffusionXLIPAdapterStep
, but in encode_image
we just use zeros_like
for the unconditional (zeros_like
through image_encoder
when output_hidden_states
). Instead of having code in encode_image
and prepare_ip_adapter_image_embeds
to handle this we can pass zeros_like
to prepare_ip_adapter_image_embeds
from StableDiffusionXLIPAdapterStep
and we can allow experimentation with actual negative ip adapter image embeds, a custom module for that would currently be possible but unintuitive as we'd need to pass a negative ip adapter image yet take the positive embeds output as the negative embeds.
totally agree, I was thinking about that too! do you want to take a stab on that? we need to refactor these functions from regular pipeline too |
@yiyixuxu Yes I'll work on that |
Super cool @yiyixuxu @asomoza @hlky! Not reviewing the PR yet since I'm getting a feel for how a developer would be interacting with the library, but I personally found it very intuitive to get started from the examples. Here's my first try at making a modular diffusers workflow for naive latent upscaling with SDXL: Codeimport torch
import torch.nn.functional as F
from diffusers import ModularPipeline, StableDiffusionXLAutoPipeline
from diffusers.pipelines.components_manager import ComponentsManager
# Load models
components = ComponentsManager()
components.add_from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
components.enable_auto_cpu_offload(device="cuda:0")
# Create pipeline
pipe = ModularPipeline.from_block(StableDiffusionXLAutoPipeline())
pipe.update_states(**components.components)
pipe.to("cuda")
# Run inference
prompt = "A majestic lion jumping from a big stone at night"
height = 1024
width = 1024
output = pipe(prompt=prompt, height=height, width=width, num_inference_steps=30)
images = output.intermediates.get("images").images
latents = output.intermediates.get("latents")
images[0].save("output.png")
# Latent upscale
# Note that only naive upscaling is done here. Alternatively, a latent upscaler
# model could be used
batch_size, num_channels, latent_height, latent_width = latents.shape
scale_factor = 1.5
upscaled_height, upscaled_width = int(height * scale_factor), int(width * scale_factor)
upscaled_latent_height, upscaled_latent_width = int(latent_height * scale_factor), int(latent_width * scale_factor)
upscaled_latents = F.interpolate(latents, size=(upscaled_latent_height, upscaled_latent_width), mode="nearest-exact")
# Run inference with upscaled latents
strength = 0.5
upscaled_output = pipe(prompt=prompt, image_latents=upscaled_latents, height=upscaled_height, width=upscaled_width, num_inference_steps=40, strength=strength)
images = upscaled_output.intermediates.get("images").images
images[0].save("output_upscaled.png")
On my first try, I passed I wonder if things like this may cause some friction in getting started with modular diffusers workflows. In this case, do you think renaming |
Question: Let's say we implemented a Flux/SD3 equivalent of the SDXL modular blocks. Now I want to do the same latent upscale thing in the above comment. To make it possible to upscale latents with every supported model, I would like to create a general purpose node/block, with different possible init configurations, that takes a How would I go about inserting my custom blocks into the pipeline execution flow? Or, what would the plan of action on the developers' end look like if they want to inject some code before/after each atomic pipeline step that we currently have (vae encode/decode, latent prep, denoise step, ...)? |
@@ -46,6 +46,7 @@ | |||
"AutoPipelineForInpainting", | |||
"AutoPipelineForText2Image", | |||
] | |||
_import_structure["modular_pipeline"] = ["ModularPipeline"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add components_manager
and at parent level because in the example we are using
from diffusers import ComponentsManager
@a-r-r-o-w latent is one example
there is also in your case for upscaling, I think it should be
open to suggestions/discussions |
if it is an upscaler that takes latents as input, I think it is most convenient to be used on its own, (like in UI, it would be its own node/pipeline) maybe make a map like this so it can be used to create different presets? AUTO_UPSCALE_BLOCKS = OrderedDict([
("text_encoder", StableDiffusionXLTextEncoderStep),
("ip_adapter", StableDiffusionXLAutoIPAdapterStep),
("image_encoder", StableDiffusionXLAutoVaeEncoderStep),
("before_denoise", StableDiffusionXLAutoBeforeDenoiseStep),
("upscale", AutoUpscaleStep),
("denoise", StableDiffusionXLAutoDenoiseStep),
("decode", StableDiffusionXLAutoDecodeStep)
]) make a preset for end-to-end pipeline class SDXLAutoUpscaleBlocks(SequentialPipelineBlocks):
block_classes = list(AUTO_UPSCALE_BLOCKS.values())
block_names = list(AUTO_UPSCALE_BLOCKS.keys())
auto_pipe_upscaled = ModularPipeline.from_block(SDXLAutoUpscaleBlocks()) just the upscaler node used in stand-alone upscaler_block = AUTO_UPSCALE_BLOCKS["upscale"]()
upcaler_node = ModularPipeline.from_block(upscaler_block) |
Did a pass on the examples and the info shared instead of looking through the code too much (following @a-r-r-o-w's philosophy). Some comments first.
What if the user combines the inputs that are supported? How do we infer for such situations? For example, what if I provide a
This is very convenient! However, I wonder if the user could restrict the level of info they want to see. I got a bit lost after the args started appearing. Maybe something to consider in the later iterations. Misc:
Now, I tried to use the SDXL refiner: Codeimport torch
from diffusers import ModularPipeline, StableDiffusionXLAutoPipeline
from diffusers.pipelines.components_manager import ComponentsManager
# Load models
components = ComponentsManager()
components.add_from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
# Create pipeline
pipe = ModularPipeline.from_block(StableDiffusionXLAutoPipeline())
pipe.update_states(**components.components)
pipe.to("cuda")
# Run inference
prompt = "A majestic lion jumping from a big stone at night"
height = 1024
width = 1024
output = pipe(prompt=prompt, height=height, width=width, num_inference_steps=30)
images = output.intermediates.get("images").images
latents = output.intermediates.get("latents")
print(f"{latents.shape=}")
images[0].save("output_modular.png")
# Clear things
del components, pipe
torch.cuda.empty_cache()
# Load refiner
components = ComponentsManager()
components.add_from_pretrained("stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16)
# Create pipeline
pipe = ModularPipeline.from_block(StableDiffusionXLAutoPipeline())
pipe.update_states(**components.components)
pipe.to("cuda")
pipe.register_to_config(requires_aesthetics_score=False)
# Refine outputs.
output = pipe(prompt=prompt, image_latents=latents, num_inference_steps=30)
images = output.intermediates.get("images").images
images[0].save("output_refiner_modular.png") It leads to: ValueError: Model expects an added time embedding vector of length 2560, but a vector of 2816 was created. Please make sure to disable `requires_aesthetics_score` with `pipe.register_to_config(requires_aesthetics_score=False)` to make sure `target_size` (1024, 1024) is correctly used by the model. Questions:
|
@sayakpaul these are really good feedbacks! thank you! for refiner, you have to do refiner_pipeline.update_states(**components.get(["text_encoder_2","tokenizer_2", "vae", "scheduler"]), unet=components.get("refiner_unet"), force_zeros_for_empty_prompt=True, requires_aesthetics_score=True) it is a bit verbose as you can see, and it's the case in general on how we load the
open to better API, but probably not components because we also update config with it
open to suggestions on how to do better here, currently each
These are pretty important! We don't have to wait to improve in later iterations. Let's make it better now if it's possible. maybe we don't have to print out the docstring (the args etc), we can direct user to use |
@sayakpaul # Loading Models
components = ComponentsManager()
components.add_from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
# load just the refiner UNet (reuse the text_encoders that's already in components)
+ refiner_unet = UNet2DConditionModel.from_pretrained(
+ "stabilityai/stable-diffusion-xl-refiner-1.0",
+ subfolder="unet",
+ torch_dtype=torch.float16
+ )
+ components.add("refiner_unet", refiner_unet)
# this make sure all models stay in cpu until forward pass is invoked and may be put back on cpu when more GPU memory is needed
+ components.enable_auto_cpu_offload()
# I think we don't need to do this:
# 1. pipe's states are managed by `components`; if we want to delete everything, delete components in components manager is enough
# 2. GPU memory is already managed by `components`, i.e. if we need more memory to run refiner pipeline,
# the other unet from base repo will be offload to cpu.
# We can also add methods to unload/delete models if more explicit control is needed but overall I think we don't need to
# delete a model unless we are certain we do not need them anymore
# 3. in this particular use case, we still need the text_encoders so don't recommend deleting them and reloading again here
- # Clear components and free CUDA memory before loading refiner
- del components, pipe
- torch.cuda.empty_cache()
-
- # Load complete refiner pipeline
- components = ComponentsManager()
- components.add_from_pretrained(
- "stabilityai/stable-diffusion-xl-refiner-1.0",
- torch_dtype=torch.float16
- )
# Refiner Pipeline Setup
refiner_pipeline = ModularPipeline.from_block(StableDiffusionXLAutoPipeline())
refiner_pipeline.update_states(
**components.get(["text_encoder_2", "tokenizer_2", "vae", "scheduler"]),
+ unet=components.get("refiner_unet"), # Using explicitly loaded UNet
- unet=components.get("unet"), # Using UNet from complete pipeline
force_zeros_for_empty_prompt=True,
requires_aesthetics_score=True
) Click to expand the codeimport torch
from diffusers import ModularPipeline, StableDiffusionXLAutoPipeline, UNet2DConditionModel
from diffusers.pipelines.components_manager import ComponentsManager
# Load models
components = ComponentsManager()
components.add_from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
refiner_unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-refiner-1.0", subfolder="unet", torch_dtype=torch.float16)
components.add("refiner_unet", refiner_unet)
components.enable_auto_cpu_offload()
# Create pipeline
pipe = ModularPipeline.from_block(StableDiffusionXLAutoPipeline())
pipe.update_states(**components.components)
pipe.to("cuda")
# Run inference
prompt = "A majestic lion jumping from a big stone at night"
height = 1024
width = 1024
output = pipe(prompt=prompt, height=height, width=width, num_inference_steps=30)
images = output.intermediates.get("images").images
latents = output.intermediates.get("latents")
print(f"{latents.shape=}")
images[0].save("output_modular.png")
# Create pipeline
refiner_pipeline = ModularPipeline.from_block(StableDiffusionXLAutoPipeline())
refiner_pipeline.update_states(
**components.get(["text_encoder_2", "tokenizer_2", "vae", "scheduler"]),
unet=components.get("refiner_unet"),
force_zeros_for_empty_prompt=True,
requires_aesthetics_score=True
)
refiner_pipeline.to("cuda")
# Refine outputs.
output = refiner_pipeline(prompt=prompt, image_latents=latents, num_inference_steps=30)
images = output.intermediates.get("images").images
images[0].save("output_refiner_modular.png") can you help me:
|
@sayakpaul
could be just
happy to explore this too, if you can share a POC that'd be great! |
I think this is a valid assumption except for the situations where we don't have enough CPU RAM (48GBs might be low).
I think we could cover the refiner use case (and alike) under the theme of "reusing components between workflows". We could make it clear that to make the most out of reusing, it's recommended to first load all the components needed for the workflows users want to try out and keep them on CPU. Users will always have the option to load any ad-hoc component component they may may have forgotten in the beginning. If we can make this clear in the docs with examples, I think that should be enough. WDYT?
Yeah
Sure, happy to do that. I will branch off of this PR and try to open a PR. Would that work? |
I finished testing and doing a PoC with the callbacks so I can update the step progress inside an UI. So discussing here a question about the implementation, since we now have the So I did this for the PoC to match current implementation: if data.callback_on_step_end is not None:
callback_kwargs = {}
for k in data.callback_on_step_end_tensor_inputs:
callback_kwargs[k] = getattr(data, k)
callback_outputs = data.callback_on_step_end(self, i, t, callback_kwargs)
data.latents = callback_outputs.pop("latents", data.latents)
data.prompt_embeds = callback_outputs.pop("prompt_embeds", data.prompt_embeds)
data.added_cond_kwargs["text_embeds"] = callback_outputs.pop("text_embeds", data.added_cond_kwargs["text_embeds"])
data.added_cond_kwargs["time_ids"] = callback_outputs.pop("time_ids", data.added_cond_kwargs["time_ids"]) but it could be something like this which is better to me: if data.callback_on_step_end is not None:
data.callback_on_step_end(self, i, t, data) what are your thoughts on this @yiyixuxu? |
@asomoza if data.callback_on_step_end is not None:
data.callback_on_step_end(self, i, t, data) |
@sayakpaul
|
It's looking really nice. Obviously there are a lot of intricacies here that I might not have picked up, so in my initial pass I just tried to focus on parts that felt a little unclear to me. I tried to break it down by the major components in Modular Diffusers. Components ManagerMy understanding here is that Components Manager is responsible for loading all models, schedulers, etc into the Modular Pipeline and performing memory management for the loaded components. Where it felt a bit unintuitive was trying determine which model repos can be used with For example, This snippet will load all the components of the base SDXL Pipelines into Component Manager # Load models
components = ComponentsManager()
components.add_from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
) But if I want to load a ControlNet Model via a model repo I cannot. I have to create the object and add to Components Manager via the components.add_from_pretrained("xinsir/controlnet-union-sdxl-1.0", torch_dtype=torch.float16) Since I'm familiar with the library, I realise that this is following our existing Pipeline loading logic. But I think it might make sense to support adding individual model components through PipelineBlockMy understanding here is that a The Let's say I want to add a PipelineBlock that has a model associated with the step. In the example below I want to create block that automatically extracts a depth map from an image so that I can use it with a ControlNet. Can I add the depth model to the class DepthBlock(PipelineBlock):
@property
def inputs(self) -> List[InputParam]:
control_image = InputParam(
name="control_image",
required=True,
)
return control_image
def __init__(self) -> None:
super().__init__()
# If I load in a model in pipeline block is it possible to move the the componets manager?
depth_preprocessor = DepthPreprocessor.from_pretrained("depth-anything/Depth-Anything-V2-Large-hf")
def __call__(self, pipeline, state: PipelineState) -> PipelineState:
data = self.get_block_state(state)
control_image = data.control_image
depth_image = self.depth_processor(control_image)
data.control_image = depth_image
self.add_block_state(data, state)
return pipeline, state When initializing class StableDiffusionXLDecodeLatentsStep(PipelineBlock):
expected_components = ["vae"]
model_name = "stable-diffusion-xl" And then in the def __init__(self):
super().__init__()
self.components["vae"] = None
self.auxiliaries["image_processor"] = VaeImageProcessor(vae_scale_factor=8) I found it a bit confusing as to why we are setting Are the class attributes at the top of the block needed? As far as I can tell from skimming the code, we operate on block instances everywhere? Can we define PipelineBlocks in such a way? IMO a bit more Pythonic and makes the Blocks feel a bit more like mini-Pipelines. You can also add type enforcement check on the components too. LMK if I'm missing something here. class StableDiffusionXLTextEncoderStep(PipelineBlock):
def __init__(
self,
text_encoder=None,
text_encoder_2=None,
tokenizer=None,
tokenizer_2=None,
force_zeros_for_empty_prompt=True,
):
super().__init__()
# this would set expected_configs
self.register_to_config(force_zeros_for_empty_prompt=force_zeros_for_empty_prompt)
# this would set expected_components
self.register_component({
text_encoder=text_encoder,
text_encoder_2=text_encoder_2,
tokenizer=tokenizer,
tokenizer_2=tokenizer_2
}) Another thing I wasn't quite able to figure out the exact scope of Here let's say we are encoding a prompt. In the example (
data.prompt_embeds,
data.negative_prompt_embeds,
data.pooled_prompt_embeds,
data.negative_pooled_prompt_embeds,
) = pipeline.encode_prompt(
data.prompt,
data.prompt_2,
data.device,
1,
data.do_classifier_free_guidance,
data.negative_prompt,
data.negative_prompt_2,
prompt_embeds=None,
negative_prompt_embeds=None,
pooled_prompt_embeds=None,
negative_pooled_prompt_embeds=None,
lora_scale=data.text_encoder_lora_scale,
clip_skip=data.clip_skip,
) The Can my_modular_pipe.pipeline_block['text_encoder_step'].encode_prompt() I think Modular actually supports this workflow already. Is it also considered bad practice to set components as attributes in the blocks as use them that way? Something like? @torch.no_grad()
def __call__(self, pipeline, state: PipelineState) -> PipelineState:
# Get inputs and intermediates
data = self.get_block_state(state)
self.check_inputs(pipeline, data)
prompt_embeds = self.text_encoder(data.prompt) Regarding Auxillaries, Is there a strong reason to not have these objects just be considered components as well? Auto WorkflowI am a little apprehensive about introducing Auto workflows in V1. IMO it's better to let users get Modular Pipeline, Block State, Pipeline StateI like these a lot and I'm pretty much aligned on how they work. One small nit that is unrelated to the actual functionality (just putting out here for consideration) @torch.no_grad()
def __call__(self, pipeline, state: PipelineState) -> PipelineState:
# Get inputs and intermediates
data = self.get_block_state(state) Obviously the work here is very extensive and I'm still playing around with it. LMK if I've misunderstood some concepts or if I should open PRs to try and clarify any of these points. |
Thanks! These are super nice feedback! I'll address all of them, but I want to focus on PipelineBlock first because I think it is where most confusion comes from, and it indicates to me that this is where most work needs to be done to improve it! I just had enough time to think about these 2 aspect you mentioned: (1) the design choice on making pipeline blocks stateless and (2) the class attribute vs 1. Stateless Design ChoiceYes, in the current design, Pipelineblocks (
I like to think there are two stages in Modular diffusers:
# Define the depth block you were working on
class DepthBlock(PipelineBlock):
...
# another one for canny images
class CannyBlock(PipelineBlock):
...
# Combine these two into one with conditional logic
class AutoControlInputBlock(AutoPipelineBlocks):
block_classes = [DepthBlock, CannyBlock]
block_names = ["depth", "canny"]
block_trigger_inputs = ["depth_image", "canny_image"]
# combine in sequential orders
class CompleteControlNetPipeline(SequentialPipelineBlocks):
block_classes = [AutoControlInputBlock, PrepareLatentBlock, DenoiseBlock, DecodeBlock]
block_names = ["control_input", "prepare", "denoise", "decode"] you can keep composing for as long as you want, but once you're done and you want to use it now, we enter the "Runtime Stage" and that's when the pipeline blocks become stateful
# Create Modularpipeline with the block you just made
controlnet_node = ModularPipeline.from_block(CompleteControlNetPipeline())
# Load models and components
controlnet_node.update_states(**components.components)
# Run inference
image = controlnet_node(control_image=my_image, prompt="a cat", output="images") I made pipeline blocks stateless since model loading isn't needed during composition - it's only required at runtime. The design you proposed here will make pipeline block stateful. That means each pipeline block will need to manage model components themselves, and you will have to load models into each pipeline blocks and then compose them somehow. It is a possible alternative design, but I think it might need a different system to support it and it is more complex. class StableDiffusionXLTextEncoderStep(PipelineBlock):
def __init__(
self,
text_encoder=None,
text_encoder_2=None,
tokenizer=None,
tokenizer_2=None,
force_zeros_for_empty_prompt=True,
):
super().__init__()
# this would set expected_configs
self.register_to_config(force_zeros_for_empty_prompt=force_zeros_for_empty_prompt)
# this would set expected_components
self.register_component({
text_encoder=text_encoder,
text_encoder_2=text_encoder_2,
tokenizer=tokenizer,
tokenizer_2=tokenizer_2
}) 2. Component Initialization and Class AttributesAbout your comment on Component Initialization here:
I totally agree that it is very confusing that we have both The class attributes
I like to think these class attributes However, I don't think we need both the class attribute I think it might be better to remove the class DepthBlock(PipelineBlock):
expected_components = [
ComponentSpec(
name="depth_processor",
class_name=["depth_anything", "DepthPreprocessor"],
default_repo="depth-anything/Depth-Anything-V2-Large-hf"
)
]
@property
def inputs(self) -> List[InputParam]:
return [InputParam(
name="control_image",
required=True,
)]
def __call__(self, pipeline, state: PipelineState) -> PipelineState:
data = self.get_block_state(state)
depth_image = pipeline.depth_processor(data.control_image)
data.control_image = depth_image
self.add_block_state(data, state)
return pipeline, state This way, we would also be able to support the use case you described here:
currently, indeed, you would always have to add the models to What do you think? |
@DN6 Scope of Pipeline Block MethodsRegarding your questions about PipelineBlock scope and global pipeline methods, here:
Yes, you can define methods on pipeline blocks level. Currently, we have two places where methods can live:
components as attributes in blocksregarding this question
yes, with current design, it would be a bad practice since Pipeline blocks are stateless, and all the model components should be managed at the global pipeline level and passed to each pipeline block at run time through the If you think a stateful pipeline block design is more intuitive, I'd be happy to explore with you for that too:) A few things to keep in mind if we want to explore the alternative stateful design:
|
now for Auto Workflow I agree it is probably not that important for our current diffuser users, but I consider it crucial for UI use case. Since one of our goal is to eliminate the barrier between us and the UI community/professionals, I think it makes sense for us release with it let me explain a bit! Auto Workflow fits really really well with how workflows are developed. Alvaro's guides (for example, https://huggingface.co/blog/OzzyGT/outpainting-differential-diffusion) give a pretty good sense of the process. It is usually an iterative process: the user does not necessarily know exactly what's needed in the beginning, so they start with something basic and gradually add/remove features and modify part of the workflows until they get satisfactory results. Without auto workflows, they'd have to rebuild their workflow each time they want to try something different. It is not a very nice experience. With Auto Workflows (node build with auto workflow), they can pretty much just stick to the same node and just change the input nodes as they need. also, there is the number of nodes. comfy currently faces this challenge that there are too many nodes and it's bit of overwhelming for users. without auto workflow, we'd have the same issue. with auto workflow, we currently have like 5 nodes, prompt_encode/ image_encode/decode/denoise/ip-adapter. so it is very manageable I think maybe we can have different guides targeting on different user group and only talk about auto workflow for the ones targeting on UI/professionals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm better understanding some of the things here after working with it for a bit. I'll try to provide some general thoughts and introduce some ideas I had when you were initially starting the modular diffusers development:
-
No strong opinions on whether the PipelineBlocks should be stateful or not. We could ideally support both cases similar to what's done in the Diffusers Hooks.
-
Each PipelineBlock IMO should only contain minimal implementation, i.e. bare-minimum single functionality and not handle too many overlapping cases. For example,
pipeline.encode_prompt
and similar that operate on both unconditional and conditional branches should probably just support one optionprompt
. The pipeline block can then invoke this method twice - once for positive, once for negative. -
If methods like
encode_prompt
could have a functional equivalent that can be invoked from outside a pipeline/pipeline-blocks, I think it would be super helpful for re-using in trainers instead of rolling our own minified implementation. -
We should consider batching vs non-batching inference. Currently with existing pipelines, we always batch negative and positive prompt embeds. This increase memory required from intermediate activation states by 2x. For a low VRAM mode, this might be an important consideration. (It's not very important. We can always add a BatchedInferenceHook or something to the
model::forward
to split the args/kwargs along batch dimension) -
Currently, the invocation mode is eager. Something like:
I_AM_AT_BLOCK_X -> DO_I_HAVE_THE_INPUTS_I_REQUIRE? ---> YES ---> PERFORM_COMPUTATION_AND_PROCEED_TO_NEXT_BLOCK |--> NO ---> RAISE_ERROR
If we're somewhere deep inside the execution stage and then error out (maybe due to a missing input), all computation done till now is lost for a silly error. This is very frustrating (I've personally faced it multiple times during model integrations). IMO we have an opportunity to improve this (perhaps, some time in the near future if not for now). Since we already know that each block requires a set of inputs and outputs, regardless of what the other blocks do, we can topologically traverse the graph of blocks in reverse to determine if all inputs/outputs mapping is correct. If not, we can early-error out and let the user know. If yes, we can proceed with computation.
Note that this won't help identify issues in cases where we simply forgot to pass an input to a model or something, but it'll be helpful in block-development cases -- we're simply doing a static analysis to make sure that the invocation graph makes sense on a high-level from the pipeline one creates. -
Regarding
_execution_device
anddtype
on pipeline, I think we should remove it and instead infer device/dtype from the module that is going to do the processing next. For example, if my text encoder is in float16 but transformer is in bfloat16,dtype
on pipeline will returnfloat16
. So, prompt embeds will be in different dtype leading to error on transformer unless we explicitly write some logic to handle this in the pipeline. Writing it per-model block is prone to errors and can introduce lossy conversions, so it might be nice to push to keep the pipeline as a simple container holding modules and remove any notion of module state from it, and handle these device/dtype-changes more centrally (likepipeline.prepare_inputs_for_model(model, inputs)
) (just my thoughts and not really at issue here) -
Have we thought about how a pipeline created by a user can be shared via an exported file, say on the Hub, for ease of distribution?
return noise_cfg | ||
|
||
|
||
class CFGGuider: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR; let's try to separate algorithms from the modeling/pipeline implementations as much as possible. If we can decouple CFG nicely, I believe we would have lot more composibility and options for testing. Let's try to write these in a manner that works with existing pipelines too if we invoke __call__
with a guider object.
I like this design. For some time now, I've wanted to add support for different guidance techniques (STG, Perturbed Attention guidance, Energy-based CFG, Skip layer guidance, etc.) to all existing models/pipelines where applicable. As I'm working on something similar, I'll share some thoughts.
These techniques are independent of the model/pipeline, so it makes sense to me that we should not tie in that logic too strongly to the pipelines. At the moment, our pipelines only accept parameters like guidance_scale
, guidance_rescale
, true_cfg_scale
, and similar. This is not really scalable if we want composability while supporting latest research techniques. So, this design of being able to initialize "guiders" is super cool, since we can parameterize them however we want and since it's decoupled from the pipelines __call__
and model forward
itself.
To provide some more details of what I've been trying , this is some pseudo-code:
from diffusers.hooks import HookRegistry, PerturbedAttentionGuidanceHook
class GuidanceMixin:
def register_modules(self, denoiser: torch.nn.Module, ...) -> None:
...
def unregister_modules(self, denoiser: torch.nn.Module, ...) -> None:
...
def prepare_inputs(self, **kwargs) -> Any:
parameters = inspect.signature(self._prepare_inputs).parameters
ignored_kwargs = {k for k in kwargs.keys() if k not in parameters}
input_kwargs = {k: v for k, v in kwargs.items() if k in parameters}
return self._prepare_inputs(**input_kwargs)
def __call__(self, **kwargs) -> Any:
parameters = inspect.signature(self.forward).parameters
ignored_kwargs = {k for k in kwargs.keys() if k not in parameters}
input_kwargs = {k: v for k, v in kwargs.items() if k in parameters}
return self.forward(**input_kwargs)
def _prepare_inputs(self, **kwargs) -> Any:
raise NotImplementedError
class ClassifierFreeGuidance(GuidanceMixin):
def __init__(self, scale: float) -> None:
self.scale = scale
def _prepare_inputs(self, latents: torch.Tensor, prompt_embeds: torch.Tensor, negative_prompt_embeds: Optional[torch.Tensor] = None, generator: Optional[torch.Generator] = None) -> torch.Tensor:
if self.scale > 1.0:
latents = torch.cat([latents, torch.zeros_like(latents).normal_(generator=generator)])
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
return {"latents": latents, "prompt_embeds": prompt_embeds}
def forward(self, x_uncond: torch.Tensor, x_cond: torch.Tensor) -> torch.Tensor:
return x_uncond + self.scale * (x_cond - x_uncond)
class PerturbedAttentionGuidance(GuidanceMixin):
def __init__(self, scale: float, cfg_scale: float, layers: Union[str, List[str]]) -> None:
self.scale = scale
self.cfg_scale = scale
self.layers = [layers] if isinstance(layers, str) else layers
def register_modules(self, denoiser: torch.nn.Module, ...) -> None:
for name, submodule in denoiser.named_modules():
if any(regex_match(name, layer_name) for layer_name in self.layers):
registry = HookRegistry.check_if_exists_or_initialize(submodule)
hook = PerturbedAttentionGuidanceHook()
registry.register_hook(hook)
def prepare_inputs(self, latents: torch.Tensor, prompt_embeds: torch.Tensor, negative_prompt_embeds: Optional[torch.Tensor] = None, generator: Optional[torch.Generator] = None) -> torch.Tensor:
num_additional_latents = (self.scale > 1.0) + (self.cfg_scale > 1.0)
if num_additional_latents > 0:
additional_latents = [torch.zeros_like(latents).normal_(generator=generator) for _ in range(num_additional_latents)]
latents = torch.cat([latents, *additional_latents])
... # Similarly handle prompt embeddings
return ...
def forward(self, x_uncond: torch.Tensor, x_cond: torch.Tensor) -> torch.Tensor:
...
from diffusers import FluxPipeline
from diffusers.guidance import ClassifierFreeGuidance, PerturbedAttentionGuidance
pipe = FluxPipeline.from_pretrained(...)
pipe.to("cuda")
cfg = ClassifierFreeGuidance(scale=7.0)
pag = PerturbedAttentionGuidance(scale=5.0, layers=["transformer_blocks\.(20|24)"])
cfg_output = pipe(..., guidance=cfg)
pag_output = pipe(..., guidance=pag)
In the existing pipelines, we will invoke the prepare_inputs
and __call__
methods in a non-backwards-breaking manner. For the new modular diffusers, we can customize as required. As the guidance objects are lightweight to create, one can modify it on-the-fly, which would be super useful for UI cases and experimentation.
A pet peeve I have is needing to write additional attention processors for a method like PAG. Per model processors are hard to maintain for all kinds of techniques available, with all kinds of permutations possible. This introduces limitations. Since we know that most modeling implementations use our Attention
class, or atleast follow similar naming conventions, one way of making this technique generally applicable is utilizing some sort of pre/post-forward hook that can perform the attention-branch shortcut required in PAG. This would be a single addition to address all models at once, because we follow certain strict naming conventions of layers.
As guiders can be stateful (for example, disabling guidance after certain number of steps should remove the unconditional latent/prompt embeddings, or guidance scale could be adaptive to amount of low-frequency/high-frequency noise in latent), I really like that we can do reset_guider
. IMO, we should mark this as stateful/un-stateful using a flag like _is_stateful = True
(similar to
_is_stateful = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-r-r-o-w feel free to take over the guider and refactor it :)
thanks @a-r-r-o-w! insightful as always:)
not sure I understand what this means "Each PipelineBlock IMO should only contain minimal implementation"; but based on the example you provided I think we are aligned. @hlky is working on a refactor on some of the pipeline methods to do just what you described. we are also considering making them class method so it can be invoked outside of pipeline/pipeline-blocks. Please take a look there and share your thoughts! #10726
hook approach sounds good, or a special optimized denoising block. feel free to explore and it can be part of the offloading strategy we offer on components manager, e.g. if user does not have enough memory, we automatically run the non-batch inference.
actually, we are already doing that. when combine a few pipeline blocks in a sequential order, we loop through the blocks to find out the overall
basically say we have 3 blocks we want to combine in sequential order
each block has we look through the blocks,
once we have this I think we can add a happy to work a bit more on this with you! I think it is very important feature. I can start to add some test cases for things we are already covered - and you can help to see if we miss any use cases. what do you think?
agree, feel free to help refacter later!
I think we should share vis hub but haven't explored about that yet - feel free to take a stab on it! |
@yiyixuxu
I think this is fine.
I do prefer that. One case can I think of is if I try to replace a step in a Pipeline e.g. the encode prompt step and then I try
I'm cool with having the components be managed at the global level. I agree it would get complicated if the components are attached to blocks. I think I was trying to convey that class MyPipelineBlock:
def __init__(self, vae):
self.register_component(vae=vae)
vae = AutoencoderKL.from_pretrained("..")
pipe = ModularPipeline.from_block(MyPipelineBlock(vae=vae))
# these would point to the same object
pipe.vae == pipe.blocks("vae_step").vae But the ComponentSpec solution also works for this case 👍🏽 And the other points such as not being able to use different VAE's at different steps makes sense.
I think nothing in the current design prevents stateful blocks though. I think we need a bit more clarity on how to create/manage them correctly. e.g in the DepthBlock solution you proposed class DepthBlock(PipelineBlock):
expected_components = [
ComponentSpec(
name="depth_processor",
class_name=["depth_anything", "DepthPreprocessor"],
default_repo="depth-anything/Depth-Anything-V2-Large-hf"
)
]
@property
def inputs(self) -> List[InputParam]:
return [InputParam(
name="control_image",
required=True,
)]
def __call__(self, pipeline, state: PipelineState) -> PipelineState:
data = self.get_block_state(state)
depth_image = pipeline.depth_processor(data.control_image)
data.control_image = depth_image
self.add_block_state(data, state)
return pipeline, state If we're creating the Another thought I had was, suppose a user has created a custom Pipeline Block with a model component (config and weights), and custom code and is hosting both on the Hub. Would we allows something like |
Let me try to move all the global pipeline methods to the block level first - If I'm able to do that, I think maybe we won't need global pipeline methods at all, so things would be easier
I was thinking something similar to model_index. so we would re-use same approach (potentially code) to handle. So a different library should not be problem (like we do for text_encoders from transformers); and if it's defined in the same file, maybe we can do something similar to what we do for these diffusers modules that we cannot import from top level like here . I haven't really thought through about it, though; if you have good suggestions, let me know!!
I'm not sure how custom code would work for now (like how we share the code on hub and load them), let me know if you have good ideas! but yes I think we should add a loading method to pipeline blocks! we should also allow attaching components manager to the pipeline blocks so that things loaded from the pipeline block will be registered to the components manager adding a loading method on pipeline blocks will also be able to support the use case you described earlier before #9672 (comment)
instead of this (we could still support this in the future after we have the AutoModel class) components.add_from_pretrained("xinsir/controlnet-union-sdxl-1.0", torch_dtype=torch.float16) we could already do something like this: control_block.add_from_pretrained(repo, components_manager = components) because, we will add class info for |
Hi. I've been watching this project unfold for some time now and I've attempted some orthogonal modular diffusers projects in the past. I’m deeply interested in reviewing, researching, responding, reciprocating etc during creation process if possible, especially with regards to integrating Auto Workflow elements, |
Getting Started with Modular Diffusers
With Modular Diffusers, we introduce a unified pipeline system that simplifies how you work with diffusion models. Instead of creating separate pipelines for each task, Modular Diffusers let you:
Write Only What's New: You won't need to rewrite the entire pipeline from scratch. You can create pipeline blocks just for your new workflow's unique aspects and reuse existing blocks for existing functionalities.
Assemble Like LEGO®: You can mix and match blocks in flexible ways. This allows you to write dedicated blocks for specific workflows, and then assemble different blocks into a pipeline that that can be used more conveniently for multiple workflows. Here we will walk you through how to use a pipeline like this we built with Modular diffusers! In later sections, we will also go over how to assemble and build new pipelines!
Quick Start with
StableDiffusionXLAutoPipeline
Auto Workflow Selection
The pipeline automatically adapts to your inputs:
prompt
image
inputimage
andmask_image
control_image
Auto Documentations
We care a great deal about documentation here at Diffusers, and Modular Diffusers carries this mission forward. All our pipeline blocks comes with complete docstrings that automatically compose as you build your pipelines. This means
inspect your pipeline
see an example of output
use
get_execution_blocks
to see which blocks will run for your inputs/workflow, for example, if you want to run a text-to-image controlnet workflow, you can do thissee the docstring relevant to your inputs/workflow
Advanced Workflows
Once you've created the auto pipeline, you can use it for different features as long as you add the required components and pass the required inputs.
Here is an example you can run for a more complex workflow using controlnet/IP-Adapter/Lora/PAG
check out more usage examples here
test1: complete testing script for `StableDiffusionXLAutoPipeline`
Modular Setup
StableDiffusionXLAutoPipeline
is a very convenient preset; Just like the LEGO sets, you can break it down and reassemble and rearrange the pipeline blocks however you want. A more modular setup would look like this:With this setup, you precompute embeddings and reuse them across different denoise backends or with different inference parameters such as
guidance_scale,
num_inference_steps,
or use different schedulers. You can modify your workflow by simply adding/removing/swapping blocks without recomputing the entire pipeline over and over again.check out the full example script here
test2: modular setup
This is the full testing script I used for more configuration, including inpainting/refiner/union controlnet/APGtest3: modular setup with IPAdapter
Developer Guide: Building with Modular Diffusers
Core Components Overview
The Modular Diffusers architecture consists of four main components:
ModularPipeline
The main interface for creating and running modular pipelines. Unlike traditional pipelines, you don't write it from scratch - it builds itself from pipeline blocks! Example usage:
PipelineBlock
The fundamental building block, similar to a mellon/comfy node. Each block:
__call__(pipeline, state) -> (pipeline, state)
MultiPipelineBlocks
Combines multiple blocks into a bigger one! These combined blocks behave just like single blocks - with their own inputs, outputs, and components, but they are able to handle more complex workflows!
We have two types of MultiPipelineBlocks available, you can use them to combine individual blocks into ready-to-use sets (Like LEGO® presets!)
SequentialPipelineBlocks
AutoPipelineBlocks
AutoPipelineBlocks
makes the complexif.. else..
logic in your code disappear! with this, you can write blocks for specific use case to keep your code path clean; and useAutoPipelineBlocks
to combine blocks into convenient presets that can provide a better user experience :)ControlNetDenoiseStep
step will be dispatched when "control_image" is passed from the user, otherwise, it will run the defaultDenoseStep
PipelineState and BlockStates
PipelineState
andBlockStates
manage dataflow between/inside blocks; they make debugging really easy! feel free to print out them at any given time to have an overview of all the shapes/types/values of your pipeline/block statesDifferential Diffusion Example
Here we'll show you a new way to build with Modular Diffusers. Let's look at implementing a Differential Diffusion pipeline as an example. (https://differential-diffusion.github.io/). It is, in a sense, an image-to-image workflow, so we can start with the preset of pipeline blocks we used to build our current img2img pipeline (
IMAGE2IMAGE_BLOCKS
) and see how we can build this new pipeline with them!It seems like we can reuse the
"text_encoder"
,"ip_adapter"
,"image_encoder"
,"input"
,"prepare_add_cond"
and"decode"
steps from img2img workflow out-of-box. The"set_timesteps"
step in Differential Diffusion is the same as the one we use for text-to-image (i.e. it does not takestrength
parameter), so we just useStableDiffusionXLSetTimestepsStep
. It uses a different denoising method so we will need to write a new"denoise"
step, and the"prepare_latents"
step is also a little bit different, so we will write a new one too.Here are the changes needed to create the Differential Diffusion version of these blocks:
StableDiffusionXLImg2ImgPrepareLatentsStep
:StableDiffusionXLDenoiseStep
step: we remove inpaint-related logics and added diff-diff specific logicThat's all there is to it! Once you've made these 2 diff-diff blocks, you can create a preset(pre-assembled sets of blocks) and then build your pipeline from it.
to use it
Complete Example: Implementing Differential Diffusion Pipeline
Diffusers as seen in nodes
coming up soon....
Next Steps