Is there a specific reason why you used Apex AMP instead of the native AMP provided by PyTorch? Have you tried native AMP?
I tried to train poolformer_s12 and poolformer_s24 with solo-learn; with native fp16 the loss goes to nan after a few epochs, while with fp32 it works fine. Did you experience similar behavior?
On a side note, can you provide the implementation and the hyperparameters for the hybrid stage [Pool, Pool, Attention, Attention]? It seems very interesting!