Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggesting to cite prior work mPLUG-Owl3 and other related cross-attention based efficient MLLMs #18

Open
LukeForeverYoung opened this issue Feb 8, 2025 · 4 comments

Comments

@LukeForeverYoung
Copy link

Excellent work on advancing efficient MLLMs. I observe that LLaVA-Mini employs transformer layers initialized from the language model for multimodal fusion through cross-attention mechanisms, sharing similarities with prior works such as mPLUG-Owl3, which repurposes the transformer layers within the language model to execute both cross-attention and self-attention operations in parallel. To strengthen the contextual foundation of efficient MLLM research, we suggest adding related cross-attention architectures in your references. Specifically, foundational works like Flamingo, EVLM, and LLaMA-Vision could be cited to better situate your work within the landscape of efficient MLLM development.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
(https://arxiv.org/abs/2408.04840)[https://arxiv.org/abs/2408.04840]

@MonolithFoundation
Copy link

Many work uses this version, but added too much parameters, not sure with such large additional computation added, less token is meaningful or not

@MiloQ
Copy link

MiloQ commented Feb 12, 2025

@MonolithFoundation Do you mean Llava-mini or mPLUG-Owl3?

@MonolithFoundation
Copy link

Anything with a Resampler

@MiloQ
Copy link

MiloQ commented Feb 13, 2025

@MonolithFoundation How to explain the speed gain mentioned in these papers, such as Flops ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants