diff --git a/README.md b/README.md index 7a0d04e..cadf3ea 100644 --- a/README.md +++ b/README.md @@ -13,14 +13,14 @@ We thank the following contributors for their valuable contributions! and You! [![Star History Chart](https://api.star-history.com/svg?repos=AudioLLMs/Awesome-Audio-LLM&type=Date)](https://star-history.com/#AudioLLMs/Awesome-Audio-LLM&Date) ## Table of Contents +- [Multimodal](#multimodal) +- [Dataset Resource](#dataset-resource) - [Model and Methods](#model-and-methods) - [Benchmark](#benchmark) -- [Dataset Resource](#dataset-resource) -- [Safety](#safety) -- [Multimodal](#multimodal) - [Survey](#survey) -- [Study](#study) - [Chatbot](#chatbot) +- [Safety](#safety) +- [Study](#study) Timeline Visualization @@ -28,77 +28,111 @@ and You! - [PAL](https://arxiv.org/abs/2506.10423) - [CMI-Bench](https://arxiv.org/abs/2506.12285) - [MMAR](https://arxiv.org/abs/2505.13032) +- [Audio-FLAN](https://arxiv.org/abs/2502.16584) - [Step-Audio](https://arxiv.org/abs/2502.11946) - [OSUM](https://arxiv.org/pdf/2501.13306) -- [Audio-FLAN](https://arxiv.org/abs/2502.16584) +- [Sayna](https://github.com/SaynaAI/sayna) - [Audio-CoT](https://arxiv.org/abs/2501.07246) -- [UltraEval-Audio](https://github.com/OpenBMB/UltraEval-Audio) - [MinMo](https://arxiv.org/abs/2501.06282) - [LUCY](https://arxiv.org/abs/2501.16327) +- [UltraEval-Audio](https://github.com/OpenBMB/UltraEval-Audio) - [Typhoon2-Audio](https://arxiv.org/abs/2412.13702) - [ADU-Bench](https://arxiv.org/abs/2412.05167) - [TalkArena](https://talkarena.org/) - [MERaLiON-AudioLLM](https://arxiv.org/abs/2412.09818) - [ADU-Bench](https://arxiv.org/abs/2412.05167) -- [WavChat-Survey](https://arxiv.org/abs/2411.13577) - [Dynamic-SUPERB Phase-2](https://arxiv.org/pdf/2411.05361) +- [WavChat-Survey](https://arxiv.org/abs/2411.13577) - [Taiwanese AudioLLM](https://arxiv.org/pdf/2411.07111) -- [SPIRIT LM](https://arxiv.org/abs/2402.05755) -- [VoiceBench](https://arxiv.org/pdf/2410.17196) -- [DiVA](https://arxiv.org/pdf/2410.02678) - [SpeechLM-Survey](https://arxiv.org/pdf/2410.03751) - [SpeechEmotionLlama](https://arxiv.org/pdf/2410.01162) -- [MMAU](https://arxiv.org/pdf/2410.19168) -- [SpeechLLM-Survey](https://arxiv.org/pdf/2410.18908v2) - [SPIRIT LM](https://arxiv.org/pdf/2402.05755) +- [VoiceBench](https://arxiv.org/pdf/2410.17196) +- [SpeechLLM-Survey](https://arxiv.org/pdf/2410.18908v2) +- [DiVA](https://arxiv.org/pdf/2410.02678) +- [SPIRIT LM](https://arxiv.org/abs/2402.05755) +- [MMAU](https://arxiv.org/pdf/2410.19168) - [DeSTA2](https://arxiv.org/pdf/2409.20007) -- [Moshi](https://arxiv.org/pdf/2410.00037) +- [SALMon](https://arxiv.org/abs/2409.07437) +- [ASRCompare](https://arxiv.org/pdf/2409.00800v1) - [Ultravox](https://github.com/fixie-ai/ultravox) +- [AudioBERT](https://arxiv.org/pdf/2409.08199) +- [Moshi](https://arxiv.org/pdf/2410.00037) - [EMOVA](https://arxiv.org/pdf/2409.18042) - [LLaMA-Omni](https://arxiv.org/pdf/2409.06666v1) - [MoWE-Audio](https://arxiv.org/pdf/2409.06635) -- [ASRCompare](https://arxiv.org/pdf/2409.00800v1) -- [AudioBERT](https://arxiv.org/pdf/2409.08199) -- [SALMon](https://arxiv.org/abs/2409.07437) +- [MooER](https://arxiv.org/pdf/2408.05101) +- [Mini-Omni](https://arxiv.org/pdf/2408.16725) - [Typhoon-Audio](https://arxiv.org/abs/2409.10999) - [MuChoMusic](https://arxiv.org/abs/2408.01337) -- [Mini-Omni](https://arxiv.org/pdf/2408.16725) -- [MooER](https://arxiv.org/pdf/2408.05101) -- [FunAudioLLM](https://arxiv.org/pdf/2407.04051v3) +- [AudioEntailment](https://arxiv.org/pdf/2407.18062) - [LLaST](https://arxiv.org/pdf/2407.15415) +- [FunAudioLLM](https://arxiv.org/pdf/2407.04051v3) +- [Decoder-only LLMs for STT](https://arxiv.org/pdf/2407.03169) - [GAMA](https://arxiv.org/abs/2406.11768) -- [AudioEntailment](https://arxiv.org/pdf/2407.18062) -- [CompA](https://arxiv.org/abs/2310.08753) - [Qwen2-Audio](https://arxiv.org/pdf/2407.10759) -- [Decoder-only LLMs for STT](https://arxiv.org/pdf/2407.03169) -- [AudioBench](https://arxiv.org/abs/2406.16020) +- [CompA](https://arxiv.org/abs/2310.08753) - [DeSTA](https://arxiv.org/abs/2406.18871) -- [Audio Hallucination](https://arxiv.org/pdf/2406.08402) -- [SD-Eval](https://arxiv.org/pdf/2406.13340) -- [CodecFake](https://arxiv.org/abs/2406.07237) - [Speech ReaLLM](https://arxiv.org/pdf/2406.09569) - [MusiLingo](https://arxiv.org/pdf/2309.08730) -- [VoiceJailbreak](https://arxiv.org/pdf/2405.19103) -- [Audio Flamingo](https://arxiv.org/abs/2402.01831) +- [SD-Eval](https://arxiv.org/pdf/2406.13340) +- [AudioBench](https://arxiv.org/abs/2406.16020) +- [CodecFake](https://arxiv.org/abs/2406.07237) +- [Audio Hallucination](https://arxiv.org/pdf/2406.08402) - [AIR-Bench](https://aclanthology.org/2024.acl-long.109/) -- [LibriSQA](https://arxiv.org/abs/2308.10390) +- [Audio Flamingo](https://arxiv.org/abs/2402.01831) +- [VoiceJailbreak](https://arxiv.org/pdf/2405.19103) - [SALMONN](https://arxiv.org/pdf/2310.13289.pdf) -- [SpokenWOZ](https://arxiv.org/abs/2305.13040) +- [LibriSQA](https://arxiv.org/abs/2308.10390) - [WavLLM](https://arxiv.org/pdf/2404.00656) -- [SLAM-LLM](https://arxiv.org/pdf/2402.08846) +- [SpokenWOZ](https://arxiv.org/abs/2305.13040) - [AudioLM-Survey](https://arxiv.org/abs/2402.13236) +- [SLAM-LLM](https://arxiv.org/pdf/2402.08846) - [Pengi](https://arxiv.org/pdf/2305.11834.pdf) - [Qwen-Audio](https://arxiv.org/pdf/2311.07919.pdf) - [CoDi-2](https://arxiv.org/pdf/2311.18775) - [UniAudio](https://arxiv.org/abs/2310.00704) - [Segment-level Q-Former](https://arxiv.org/pdf/2309.13963) -- [Dynamic-SUPERB](https://arxiv.org/abs/2309.09510) - [LLaSM](https://arxiv.org/pdf/2308.15930.pdf) +- [Dynamic-SUPERB](https://arxiv.org/abs/2309.09510) - [Prompting LLMs with Speech Recognition](https://arxiv.org/pdf/2307.11795) - [Macaw-LLM](https://arxiv.org/pdf/2306.09093) - [SpeechGPT](https://arxiv.org/pdf/2305.11000.pdf) - [AudioGPT](https://arxiv.org/pdf/2304.12995.pdf) +## Multimodal + +- `【2024-09】-【EMOVA】-【HKUST】-【Type: Model】` + - **EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions** + - **Author(s):** Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Jun Yao, Lanqing Hong, Lu Hou, Hang Xu + - [Paper](https://arxiv.org/pdf/2409.18042) / [Demo](https://emova-ollm.github.io/) + +- `【2023-11】-【CoDi-2】-【UC Berkeley】-【Type: Model】` + - **CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation** + - **Author(s):** Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal + - [![GitHub stars](https://img.shields.io/github/stars/main/CoDi-2?style=social)](https://github.com/microsoft/i-Code/tree/main/CoDi-2) + - [Paper](https://arxiv.org/pdf/2311.18775) / [Demo](https://codi-2.github.io/) + +- `【2023-06】-【Macaw-LLM】-【Tencent】-【Type: Model】` + - **Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration** + - **Author(s):** Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu + - [![GitHub stars](https://img.shields.io/github/stars/lyuchenyang/Macaw-LLM?style=social)](https://github.com/lyuchenyang/Macaw-LLM) + - [Paper](https://arxiv.org/pdf/2306.09093) + +## Dataset Resource + +- `【2025-02】-【Audio-FLAN】-【The Hong Kong University of Science and Technology】-【Type: Dataset Resource】` + - **Audio-FLAN: A Preliminary Release** + - **Author(s):** Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue + - [![GitHub stars](https://img.shields.io/github/stars/lmxue/Audio-FLAN?style=social)](https://github.com/lmxue/Audio-FLAN) + - [Paper](https://arxiv.org/abs/2502.16584) / [Hugging Face Model](https://huggingface.co/datasets/HKUSTAudio/Audio-FLAN-Dataset) + +- `【2024-04】-【LibriSQA】-【Shanghai Jiao Tong University】-【Type: Dataset Resource】` + - **LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models** + - **Author(s):** Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang + - [![GitHub stars](https://img.shields.io/github/stars/ZihanZhaoSJTU/LibriSQA?style=social)](https://github.com/ZihanZhaoSJTU/LibriSQA) + - [Paper](https://arxiv.org/abs/2308.10390) + ## Model and Methods - `【2025-06】-【PAL】-【CVSSP,PAI@University of Surrey UK, MBZUAI Abu Dhabi】-【Type: Model】` @@ -119,6 +153,12 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/ASLP-lab/OSUM?style=social)](https://github.com/ASLP-lab/OSUM) - [Paper](https://arxiv.org/pdf/2501.13306) / [Hugging Face Model](https://huggingface.co/spaces/ASLP-lab/OSUM) +- `【2025-01】-【Sayna】-【SaynaAI】-【Type: Infrastructure】` + - **Sayna: Voice Infrastructure for Audio LLM Applications** + - **Author(s):** + - [![GitHub stars](https://img.shields.io/github/stars/SaynaAI/sayna?style=social)](https://github.com/SaynaAI/sayna) + - [Other Link](https://docs.sayna.ai/) + - `【2025-01】-【Audio-CoT】-【Nanyang Technological University, Singapore】-【Type: Model】` - **Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model** - **Author(s):** Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, Xie Chen @@ -146,11 +186,16 @@ and You! - **Author(s):** Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, Yi-Cheng Lin, Yu-Xiang Lin, Wei-Chih Chen, Ho Lam Chung, Chun-Yi Kuan, Wei-Ping Huang, Ke-Han Lu, Tzu-Quan Lin, Hsiu-Hsuan Wang, En-Pei Hu, Chan-Jan Hsu, Liang-Hsuan Tseng, I-Hsiang Chiu, Ulin Sanga, Xuanjun Chen, Po-chun Hsu, Shu-wen Yang, Hung-yi Lee - [Paper](https://arxiv.org/pdf/2411.07111) +- `【2024-10】-【SpeechEmotionLlama】-【MIT, Meta】-【Type: Model】` + - **Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech** + - **Author(s):** Wonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar, Ozlem Kalinli + - [Paper](https://arxiv.org/pdf/2410.01162) + - `【2024-10】-【SPIRIT LM】-【Meta】-【Type: Model】` - **SPIRIT LM: Interleaved Spoken and Written Language Model** - - **Author(s):** Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux + - **Author(s):** Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoît Sagot, Emmanuel Dupoux - [![GitHub stars](https://img.shields.io/github/stars/facebookresearch/spiritlm?style=social)](https://github.com/facebookresearch/spiritlm) - - [Paper](https://arxiv.org/abs/2402.05755) / [Other Link](https://speechbot.github.io/spiritlm/) + - [Paper](https://arxiv.org/pdf/2402.05755) / [Demo](https://speechbot.github.io/spiritlm/) - `【2024-10】-【DiVA】-【Georgia Tech, Stanford】-【Type: Model】` - **Distilling an End-to-End Voice Assistant Without Instruction Training Data** @@ -158,16 +203,11 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/github.com/diva-audio?style=social)](https://github.com/diva-audio) - [Paper](https://arxiv.org/pdf/2410.02678) / [Demo](https://diva-audio.github.io/) -- `【2024-10】-【SpeechEmotionLlama】-【MIT, Meta】-【Type: Model】` - - **Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech** - - **Author(s):** Wonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar, Ozlem Kalinli - - [Paper](https://arxiv.org/pdf/2410.01162) - - `【2024-10】-【SPIRIT LM】-【Meta】-【Type: Model】` - **SPIRIT LM: Interleaved Spoken and Written Language Model** - - **Author(s):** Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoît Sagot, Emmanuel Dupoux + - **Author(s):** Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux - [![GitHub stars](https://img.shields.io/github/stars/facebookresearch/spiritlm?style=social)](https://github.com/facebookresearch/spiritlm) - - [Paper](https://arxiv.org/pdf/2402.05755) / [Demo](https://speechbot.github.io/spiritlm/) + - [Paper](https://arxiv.org/abs/2402.05755) / [Other Link](https://speechbot.github.io/spiritlm/) - `【2024-09】-【DeSTA2】-【National Taiwan University, NVIDIA】-【Type: Model】` - **Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data** @@ -175,17 +215,29 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/kehanlu/DeSTA2?style=social)](https://github.com/kehanlu/DeSTA2) - [Paper](https://arxiv.org/pdf/2409.20007) -- `【2024-09】-【Moshi】-【Kyutai】-【Type: Model】` - - **Moshi: a speech-text foundation model for real-time dialogue** - - **Author(s):** Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour - - [![GitHub stars](https://img.shields.io/github/stars/kyutai-labs/moshi?style=social)](https://github.com/kyutai-labs/moshi) - - [Paper](https://arxiv.org/pdf/2410.00037) +- `【2024-09】-【ASRCompare】-【Tsinghua University, Tencent AI Lab】-【Type: Model】` + - **Comparing Discrete and Continuous Space LLMs for Speech Recognition** + - **Author(s):** Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu + - [![GitHub stars](https://img.shields.io/github/stars/xuyaoxun/ASRCompare?style=social)](https://github.com/xuyaoxun/ASRCompare) + - [Paper](https://arxiv.org/pdf/2409.00800v1) - `【2024-09】-【Ultravox】-【Fixie.ai】-【Type: Model】` - **Ultravox: A Fast Multimodal LLM for Real-Time Voice** - **Author(s):** - [![GitHub stars](https://img.shields.io/github/stars/fixie-ai/ultravox?style=social)](https://github.com/fixie-ai/ultravox) +- `【2024-09】-【AudioBERT】-【POSTECH, Inha University】-【Type: Model】` + - **AudioBERT: Audio Knowledge Augmented Language Model** + - **Author(s):** Hyunjong Ok, Suho Yoo, Jaeho Lee + - [![GitHub stars](https://img.shields.io/github/stars/HJ-Ok/AudioBERT?style=social)](https://github.com/HJ-Ok/AudioBERT) + - [Paper](https://arxiv.org/pdf/2409.08199) + +- `【2024-09】-【Moshi】-【Kyutai】-【Type: Model】` + - **Moshi: a speech-text foundation model for real-time dialogue** + - **Author(s):** Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour + - [![GitHub stars](https://img.shields.io/github/stars/kyutai-labs/moshi?style=social)](https://github.com/kyutai-labs/moshi) + - [Paper](https://arxiv.org/pdf/2410.00037) + - `【2024-09】-【LLaMA-Omni】-【Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)】-【Type: Model】` - **LLaMA-Omni: Seamless Speech Interaction with Large Language Models** - **Author(s):** Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng @@ -197,22 +249,11 @@ and You! - **Author(s):** Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw - [Paper](https://arxiv.org/pdf/2409.06635) -- `【2024-09】-【ASRCompare】-【Tsinghua University, Tencent AI Lab】-【Type: Model】` - - **Comparing Discrete and Continuous Space LLMs for Speech Recognition** - - **Author(s):** Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu - - [![GitHub stars](https://img.shields.io/github/stars/xuyaoxun/ASRCompare?style=social)](https://github.com/xuyaoxun/ASRCompare) - - [Paper](https://arxiv.org/pdf/2409.00800v1) - -- `【2024-09】-【AudioBERT】-【POSTECH, Inha University】-【Type: Model】` - - **AudioBERT: Audio Knowledge Augmented Language Model** - - **Author(s):** Hyunjong Ok, Suho Yoo, Jaeho Lee - - [![GitHub stars](https://img.shields.io/github/stars/HJ-Ok/AudioBERT?style=social)](https://github.com/HJ-Ok/AudioBERT) - - [Paper](https://arxiv.org/pdf/2409.08199) - -- `【2024-08】-【Typhoon-Audio】-【SCB 10X】-【Type: Multimodal Language Model】` - - **Typhoon-Audio: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models** - - **Author(s):** Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul - - [Paper](https://arxiv.org/abs/2409.10999) / [Hugging Face Model](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-audio-preview) +- `【2024-08】-【MooER】-【Moore Threads】-【Type: Model】` + - **MooER: LLM-based Speech Recognition and Translation Models from Moore Threads** + - **Author(s):** Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang + - [![GitHub stars](https://img.shields.io/github/stars/MooreThreads/MooER?style=social)](https://github.com/MooreThreads/MooER) + - [Paper](https://arxiv.org/pdf/2408.05101) - `【2024-08】-【Mini-Omni】-【Tsinghua University】-【Type: Model】` - **Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming** @@ -220,11 +261,16 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/gpt-omni/mini-omni?style=social)](https://github.com/gpt-omni/mini-omni) - [Paper](https://arxiv.org/pdf/2408.16725) -- `【2024-08】-【MooER】-【Moore Threads】-【Type: Model】` - - **MooER: LLM-based Speech Recognition and Translation Models from Moore Threads** - - **Author(s):** Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang - - [![GitHub stars](https://img.shields.io/github/stars/MooreThreads/MooER?style=social)](https://github.com/MooreThreads/MooER) - - [Paper](https://arxiv.org/pdf/2408.05101) +- `【2024-08】-【Typhoon-Audio】-【SCB 10X】-【Type: Multimodal Language Model】` + - **Typhoon-Audio: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models** + - **Author(s):** Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul + - [Paper](https://arxiv.org/abs/2409.10999) / [Hugging Face Model](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-audio-preview) + +- `【2024-07】-【LLaST】-【The Chinese University of Hong Kong, Shenzhen; Shanghai AI Laboratory; Nara Institute of Science and Technology, Japan】-【Type: Model】` + - **LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models** + - **Author(s):** Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura + - [![GitHub stars](https://img.shields.io/github/stars/openaudiolab/LLaST?style=social)](https://github.com/openaudiolab/LLaST) + - [Paper](https://arxiv.org/pdf/2407.15415) - `【2024-07】-【FunAudioLLM】-【Alibaba】-【Type: Model】` - **FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs** @@ -232,11 +278,10 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/github.com/FunAudioLLM?style=social)](https://github.com/FunAudioLLM) - [Paper](https://arxiv.org/pdf/2407.04051v3) / [Demo](https://fun-audio-llm.github.io/) -- `【2024-07】-【LLaST】-【The Chinese University of Hong Kong, Shenzhen; Shanghai AI Laboratory; Nara Institute of Science and Technology, Japan】-【Type: Model】` - - **LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models** - - **Author(s):** Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura - - [![GitHub stars](https://img.shields.io/github/stars/openaudiolab/LLaST?style=social)](https://github.com/openaudiolab/LLaST) - - [Paper](https://arxiv.org/pdf/2407.15415) +- `【2024-07】-【Decoder-only LLMs for STT】-【NTU-Taiwan, Meta】-【Type: Research】` + - **Investigating Decoder-only Large Language Models for Speech-to-text Translation** + - **Author(s):** Authors not specified in the provided information + - [Paper](https://arxiv.org/pdf/2407.03169) - `【2024-07】-【GAMA】-【University of Maryland, College Park】-【Type: Model】` - **GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities** @@ -244,22 +289,17 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/Sreyan88/GAMA?style=social)](https://github.com/Sreyan88/GAMA) - [Paper](https://arxiv.org/abs/2406.11768) / [Demo](https://sreyan88.github.io/gamaaudio/) -- `【2024-07】-【CompA】-【University of Maryland, College Park; Adobe, USA; NVIDIA, Bangalore, India】-【Type: Model】` - - **CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models** - - **Author(s):** Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha - - [![GitHub stars](https://img.shields.io/github/stars/Sreyan88/CompA?style=social)](https://github.com/Sreyan88/CompA) - - [Paper](https://arxiv.org/abs/2310.08753) / [Demo](https://sreyan88.github.io/compa_iclr/) - - `【2024-07】-【Qwen2-Audio】-【Alibaba Group】-【Type: Model】` - **Qwen2-Audio Technical Report** - **Author(s):** Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou - [![GitHub stars](https://img.shields.io/github/stars/QwenLM/Qwen2-Audio?style=social)](https://github.com/QwenLM/Qwen2-Audio) - [Paper](https://arxiv.org/pdf/2407.10759) -- `【2024-07】-【Decoder-only LLMs for STT】-【NTU-Taiwan, Meta】-【Type: Research】` - - **Investigating Decoder-only Large Language Models for Speech-to-text Translation** - - **Author(s):** Authors not specified in the provided information - - [Paper](https://arxiv.org/pdf/2407.03169) +- `【2024-07】-【CompA】-【University of Maryland, College Park; Adobe, USA; NVIDIA, Bangalore, India】-【Type: Model】` + - **CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models** + - **Author(s):** Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha + - [![GitHub stars](https://img.shields.io/github/stars/Sreyan88/CompA?style=social)](https://github.com/Sreyan88/CompA) + - [Paper](https://arxiv.org/abs/2310.08753) / [Demo](https://sreyan88.github.io/compa_iclr/) - `【2024-06】-【DeSTA】-【NTU-Taiwan, Nvidia】-【Type: Model】` - **DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment** @@ -418,18 +458,18 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/microsoft/AudioEntailment?style=social)](https://github.com/microsoft/AudioEntailment) - [Paper](https://arxiv.org/pdf/2407.18062) -- `【2024-06】-【AudioBench】-【A*STAR, Singapore】-【Type: Benchmark】` - - **AudioBench: A Universal Benchmark for Audio Large Language Models** - - **Author(s):** Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen - - [![GitHub stars](https://img.shields.io/github/stars/AudioLLMs/AudioBench?style=social)](https://github.com/AudioLLMs/AudioBench) - - [Paper](https://arxiv.org/abs/2406.16020) / [Demo](https://huggingface.co/spaces/AudioLLMs/AudioBench-Leaderboard) - - `【2024-06】-【SD-Eval】-【CUHK, Bytedance】-【Type: Benchmark】` - **SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words** - **Author(s):** Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu - [![GitHub stars](https://img.shields.io/github/stars/amphionspace/SD-Eval?style=social)](https://github.com/amphionspace/SD-Eval) - [Paper](https://arxiv.org/pdf/2406.13340) +- `【2024-06】-【AudioBench】-【A*STAR, Singapore】-【Type: Benchmark】` + - **AudioBench: A Universal Benchmark for Audio Large Language Models** + - **Author(s):** Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen + - [![GitHub stars](https://img.shields.io/github/stars/AudioLLMs/AudioBench?style=social)](https://github.com/AudioLLMs/AudioBench) + - [Paper](https://arxiv.org/abs/2406.16020) / [Demo](https://huggingface.co/spaces/AudioLLMs/AudioBench-Leaderboard) + - `【2024-05】-【AIR-Bench】-【ZJU, Alibaba】-【Type: Benchmark】` - **AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension** - **Author(s):** Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou @@ -448,53 +488,6 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/dynamic-superb/dynamic-superb?style=social)](https://github.com/dynamic-superb/dynamic-superb) - [Paper](https://arxiv.org/abs/2309.09510) -## Dataset Resource - -- `【2025-02】-【Audio-FLAN】-【The Hong Kong University of Science and Technology】-【Type: Dataset Resource】` - - **Audio-FLAN: A Preliminary Release** - - **Author(s):** Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue - - [![GitHub stars](https://img.shields.io/github/stars/lmxue/Audio-FLAN?style=social)](https://github.com/lmxue/Audio-FLAN) - - [Paper](https://arxiv.org/abs/2502.16584) / [Hugging Face Model](https://huggingface.co/datasets/HKUSTAudio/Audio-FLAN-Dataset) - -- `【2024-04】-【LibriSQA】-【Shanghai Jiao Tong University】-【Type: Dataset Resource】` - - **LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models** - - **Author(s):** Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang - - [![GitHub stars](https://img.shields.io/github/stars/ZihanZhaoSJTU/LibriSQA?style=social)](https://github.com/ZihanZhaoSJTU/LibriSQA) - - [Paper](https://arxiv.org/abs/2308.10390) - -## Safety - -- `【2024-06】-【CodecFake】-【National Taiwan University】-【Type: Safety】` - - **CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems** - - **Author(s):** Haibin Wu, Yuan Tseng, Hung-yi Lee - - [![GitHub stars](https://img.shields.io/github/stars/roger-tseng/CodecFake?style=social)](https://github.com/roger-tseng/CodecFake) - - [Paper](https://arxiv.org/abs/2406.07237) / [Other Link](https://codecfake.github.io/) - -- `【2024-05】-【VoiceJailbreak】-【CISPA】-【Type: Method】` - - **Voice Jailbreak Attacks Against GPT-4o** - - **Author(s):** Xinyue Shen, Yixin Wu, Michael Backes, Yang Zhang - - [![GitHub stars](https://img.shields.io/github/stars/TrustAIRLab/VoiceJailbreakAttack?style=social)](https://github.com/TrustAIRLab/VoiceJailbreakAttack) - - [Paper](https://arxiv.org/pdf/2405.19103) - -## Multimodal - -- `【2024-09】-【EMOVA】-【HKUST】-【Type: Model】` - - **EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions** - - **Author(s):** Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Jun Yao, Lanqing Hong, Lu Hou, Hang Xu - - [Paper](https://arxiv.org/pdf/2409.18042) / [Demo](https://emova-ollm.github.io/) - -- `【2023-11】-【CoDi-2】-【UC Berkeley】-【Type: Model】` - - **CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation** - - **Author(s):** Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal - - [![GitHub stars](https://img.shields.io/github/stars/main/CoDi-2?style=social)](https://github.com/microsoft/i-Code/tree/main/CoDi-2) - - [Paper](https://arxiv.org/pdf/2311.18775) / [Demo](https://codi-2.github.io/) - -- `【2023-06】-【Macaw-LLM】-【Tencent】-【Type: Model】` - - **Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration** - - **Author(s):** Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu - - [![GitHub stars](https://img.shields.io/github/stars/lyuchenyang/Macaw-LLM?style=social)](https://github.com/lyuchenyang/Macaw-LLM) - - [Paper](https://arxiv.org/pdf/2306.09093) - ## Survey - `【2024-11】-【WavChat-Survey】-【Zhejiang University】-【Type: Survey】` @@ -517,6 +510,27 @@ and You! - **Author(s):** Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-yi Lee - [Paper](https://arxiv.org/abs/2402.13236) +## Chatbot + +- `【2025-01】-【MinMo】-【FunAudioLLM Team, Tongyi Lab, Alibaba Group】-【Type: Multimodal Large Language Model】` + - **MinMo: A Multimodal Large Language Model for Seamless Voice Interaction** + - **Author(s):** Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou + - [Paper](https://arxiv.org/abs/2501.06282) / [Other Link](https://funaudiollm.github.io/minmo) + +## Safety + +- `【2024-06】-【CodecFake】-【National Taiwan University】-【Type: Safety】` + - **CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems** + - **Author(s):** Haibin Wu, Yuan Tseng, Hung-yi Lee + - [![GitHub stars](https://img.shields.io/github/stars/roger-tseng/CodecFake?style=social)](https://github.com/roger-tseng/CodecFake) + - [Paper](https://arxiv.org/abs/2406.07237) / [Other Link](https://codecfake.github.io/) + +- `【2024-05】-【VoiceJailbreak】-【CISPA】-【Type: Method】` + - **Voice Jailbreak Attacks Against GPT-4o** + - **Author(s):** Xinyue Shen, Yixin Wu, Michael Backes, Yang Zhang + - [![GitHub stars](https://img.shields.io/github/stars/TrustAIRLab/VoiceJailbreakAttack?style=social)](https://github.com/TrustAIRLab/VoiceJailbreakAttack) + - [Paper](https://arxiv.org/pdf/2405.19103) + ## Study - `【2024-06】-【Audio Hallucination】-【NTU-Taiwan】-【Type: Research】` @@ -524,10 +538,3 @@ and You! - **Author(s):** Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee - [![GitHub stars](https://img.shields.io/github/stars/kuan2jiu99/audio-hallucination?style=social)](https://github.com/kuan2jiu99/audio-hallucination) - [Paper](https://arxiv.org/pdf/2406.08402) - -## Chatbot - -- `【2025-01】-【MinMo】-【FunAudioLLM Team, Tongyi Lab, Alibaba Group】-【Type: Multimodal Large Language Model】` - - **MinMo: A Multimodal Large Language Model for Seamless Voice Interaction** - - **Author(s):** Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou - - [Paper](https://arxiv.org/abs/2501.06282) / [Other Link](https://funaudiollm.github.io/minmo) diff --git a/items/Sayna.json b/items/Sayna.json new file mode 100644 index 0000000..6427607 --- /dev/null +++ b/items/Sayna.json @@ -0,0 +1,18 @@ +{ + "Category": "Model and Methods", + "Type": "Infrastructure", + "Abbreviation": "Sayna", + "Title": "Sayna: Voice Infrastructure for Audio LLM Applications", + "Time": "2025-01", + "Affiliation": "SaynaAI", + "Author": "", + "GitHub_Link": "https://github.com/SaynaAI/sayna", + "Paper_Link": "", + "HF_Link": "", + "Demo_Link": "", + "Other_Link": "https://docs.sayna.ai/", + "Audio_Input": "Yes", + "Audio_Output": "Yes", + "Language": "Multilingual", + "Description": "Sayna is a real-time voice infrastructure platform for building production voice-enabled LLM agents. It provides a unified API layer for STT/TTS with real-time streaming, multi-provider support, VAD, and voice analytics. Built with Rust and LiveKit, it offers low-latency WebSocket connections and REST endpoints for seamless voice-first experiences. Self-hostable with Docker and Kubernetes support." +} diff --git a/model_release_timeline_vertical_listed.png b/model_release_timeline_vertical_listed.png index 627b0f5..cd1059a 100644 Binary files a/model_release_timeline_vertical_listed.png and b/model_release_timeline_vertical_listed.png differ