Replies: 5 comments 12 replies
-
if you have the time and motivation check #13114 thanks for all <3 |
Beta Was this translation helpful? Give feedback.
-
another interesting one is MiniCPM-o (2.6) which you have already added vision support for in MTMD. It seems competitive with Ultravox in benchmarks. The benefit of these two (Qwen2.5-omni & MiniCPM-o) is that you can have one model loaded (with two different mmproj files?) for both image and audio, so you can spare some VRAM. MiniCPM-o has a benefit in image tokenization speed (much faster than Qwen2.5-VL-7b), so it is usable also on CPUs. Maybe that's true with their audio processing as well (no idea). The license is more restrictive though. Needless to say, your work in image & audio support is much appreciated. |
Beta Was this translation helpful? Give feedback.
-
The model is censorious, ultra-PC and refuses to transcribe if "triggered", about half of my attempts : (or with just
(sic). See the details (original MP3's content) and my musings about it all here, it is darkly hilarious: https://huggingface.co/ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF/discussions/1 Ver. 1.2 |
Beta Was this translation helpful? Give feedback.
-
Just fyi, I also tested Ultravox with DeepSeek-R1-Distill-Llama-8B and works great.
As expected, the thinking mode helps in certain types of questions. |
Beta Was this translation helpful? Give feedback.
-
thanks for your hard work and standout contributions. your MTMD API is highly-designed and excellent: there is nothing code changes in my downstream MTMD-API based codes when I upgrade the entire llama.cpp and MTMD source code in my downstream llama.cpp-based mobile application. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The audio input support was firstly introduced in #13623 , and support in
llama-server
was added in #13714This feature is highly experimental. Therefore, I create this discussion to discuss and to keep track about the support.
Currently, we support these families of model:
Ultravox
Original model can be found here: https://huggingface.co/fixie-ai
Added in PR: #13623
This model use "vanilla" text models, for example Llama and Nemo, meaning the text model is original weight. The audio encoder is trained to match the "latent" input of the text model, so the text model can understand input audio without further fine-tuning or training.
Known problems:
Demo
Qwen2-Audio and SeaLLM-Audio
IMPORTANT: Text model hallucinates a lot, making it unusable even with text-only
See more details in the PR: #13760
Qwen2.5-Omni
Support is added via #13784
Modalities:
Beta Was this translation helpful? Give feedback.
All reactions