You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Excellent work on advancing efficient MLLMs. I observe that LLaVA-Mini employs transformer layers initialized from the language model for multimodal fusion through cross-attention mechanisms, sharing similarities with prior works such as mPLUG-Owl3, which repurposes the transformer layers within the language model to execute both cross-attention and self-attention operations in parallel. To strengthen the contextual foundation of efficient MLLM research, we suggest adding related cross-attention architectures in your references. Specifically, foundational works like Flamingo, EVLM, and LLaMA-Vision could be cited to better situate your work within the landscape of efficient MLLM development.
Excellent work on advancing efficient MLLMs. I observe that LLaVA-Mini employs transformer layers initialized from the language model for multimodal fusion through cross-attention mechanisms, sharing similarities with prior works such as mPLUG-Owl3, which repurposes the transformer layers within the language model to execute both cross-attention and self-attention operations in parallel. To strengthen the contextual foundation of efficient MLLM research, we suggest adding related cross-attention architectures in your references. Specifically, foundational works like Flamingo, EVLM, and LLaMA-Vision could be cited to better situate your work within the landscape of efficient MLLM development.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
(https://arxiv.org/abs/2408.04840)[https://arxiv.org/abs/2408.04840]
The text was updated successfully, but these errors were encountered: