-
Notifications
You must be signed in to change notification settings - Fork 118
Open
Description
Hi, Thanks for the wonderful work, and I am really impressed with the proposed 'MetaFormer' concepts and experimental results you have provided! While reading the paper, some questions were raised regarding the poolformer and the concept of MetaFormer that I wanted to share with you.
- As far as I understand, the metaformer basically consists of 'input embedding + iteration of blocks with [norm - token mixer - residual connection - norm - channel mixer - residual connection].' Then does MetaFormer not have consideration for non-overlapping patches or a sequence of flattened patches? If so, is the combination of token mixer and channel mixer with other components basically what we have for the 'MetaFormer' regardless of the hierarchical structure of networks or shape of inputs?
- The poolformer has non-parametric 2D pooling for the token mixer, which is extremely simple compared to previous token mixers. However, the patch embedding inserted between the blocks seems to have implicit token mixing since it is a convolution with a smaller stride than its kernel size and eventually yields overlapped patches. Under the assumption of overlapping patches, I believe the resulting patches share information on the same spatial locations.
Thanks!
Metadata
Metadata
Assignees
Labels
No labels