Is there a reason why slot attention is placed in the middle of the visual encoder? Why is it not located at the end or every layer?