Are the text padding tokens masked in SD3? #11072

va1bhavagrawal · 2025-03-16T18:28:57Z

va1bhavagrawal
Mar 16, 2025

SD3 has the text embeddings coming from CLIP and T5 text encoders. These are padded upto the sequence length (77 for CLIP, max_sequence_length for T5).

How are the padding tokens masked? Do the image tokens directly attend to the padding tokens? I do not see any attention_mask being passed to the transformer? My intuition says that the image tokens should not be allowed to attend to the padding tokens.

Any ideas how this is handled?

va1bhavagrawal · 2025-03-17T16:02:59Z

va1bhavagrawal
Mar 17, 2025
Author

I tried to visualize some image->text attention maps in SD 3.5 medium on random padding tokens. It seems that the attention maps are non zero, I have visualized some of them here:

Any explanation for why mask is not used for padding tokens?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are the text padding tokens masked in SD3? #11072

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Are the text padding tokens masked in SD3? #11072

va1bhavagrawal Mar 16, 2025

Replies: 1 comment

va1bhavagrawal Mar 17, 2025 Author

va1bhavagrawal
Mar 16, 2025

va1bhavagrawal
Mar 17, 2025
Author