diff --git a/README.md b/README.md index 7a0d04e..3d20db1 100644 --- a/README.md +++ b/README.md @@ -13,74 +13,75 @@ We thank the following contributors for their valuable contributions! and You! [![Star History Chart](https://api.star-history.com/svg?repos=AudioLLMs/Awesome-Audio-LLM&type=Date)](https://star-history.com/#AudioLLMs/Awesome-Audio-LLM&Date) ## Table of Contents +- [Dataset Resource](#dataset-resource) - [Model and Methods](#model-and-methods) - [Benchmark](#benchmark) -- [Dataset Resource](#dataset-resource) -- [Safety](#safety) -- [Multimodal](#multimodal) - [Survey](#survey) +- [Multimodal](#multimodal) - [Study](#study) +- [Safety](#safety) - [Chatbot](#chatbot) Timeline Visualization ### Abbreviations with Links -- [PAL](https://arxiv.org/abs/2506.10423) +- [ACORN](https://arxiv.org/abs/2506.08524) - [CMI-Bench](https://arxiv.org/abs/2506.12285) +- [PAL](https://arxiv.org/abs/2506.10423) - [MMAR](https://arxiv.org/abs/2505.13032) -- [Step-Audio](https://arxiv.org/abs/2502.11946) -- [OSUM](https://arxiv.org/pdf/2501.13306) - [Audio-FLAN](https://arxiv.org/abs/2502.16584) +- [OSUM](https://arxiv.org/pdf/2501.13306) +- [Step-Audio](https://arxiv.org/abs/2502.11946) - [Audio-CoT](https://arxiv.org/abs/2501.07246) - [UltraEval-Audio](https://github.com/OpenBMB/UltraEval-Audio) -- [MinMo](https://arxiv.org/abs/2501.06282) - [LUCY](https://arxiv.org/abs/2501.16327) -- [Typhoon2-Audio](https://arxiv.org/abs/2412.13702) +- [MinMo](https://arxiv.org/abs/2501.06282) - [ADU-Bench](https://arxiv.org/abs/2412.05167) - [TalkArena](https://talkarena.org/) +- [Typhoon2-Audio](https://arxiv.org/abs/2412.13702) - [MERaLiON-AudioLLM](https://arxiv.org/abs/2412.09818) - [ADU-Bench](https://arxiv.org/abs/2412.05167) +- [Taiwanese AudioLLM](https://arxiv.org/pdf/2411.07111) - [WavChat-Survey](https://arxiv.org/abs/2411.13577) - [Dynamic-SUPERB Phase-2](https://arxiv.org/pdf/2411.05361) -- [Taiwanese AudioLLM](https://arxiv.org/pdf/2411.07111) -- [SPIRIT LM](https://arxiv.org/abs/2402.05755) - [VoiceBench](https://arxiv.org/pdf/2410.17196) -- [DiVA](https://arxiv.org/pdf/2410.02678) -- [SpeechLM-Survey](https://arxiv.org/pdf/2410.03751) -- [SpeechEmotionLlama](https://arxiv.org/pdf/2410.01162) - [MMAU](https://arxiv.org/pdf/2410.19168) +- [SPIRIT LM](https://arxiv.org/abs/2402.05755) - [SpeechLLM-Survey](https://arxiv.org/pdf/2410.18908v2) +- [SpeechEmotionLlama](https://arxiv.org/pdf/2410.01162) - [SPIRIT LM](https://arxiv.org/pdf/2402.05755) -- [DeSTA2](https://arxiv.org/pdf/2409.20007) -- [Moshi](https://arxiv.org/pdf/2410.00037) +- [SpeechLM-Survey](https://arxiv.org/pdf/2410.03751) +- [DiVA](https://arxiv.org/pdf/2410.02678) +- [AudioBERT](https://arxiv.org/pdf/2409.08199) - [Ultravox](https://github.com/fixie-ai/ultravox) -- [EMOVA](https://arxiv.org/pdf/2409.18042) - [LLaMA-Omni](https://arxiv.org/pdf/2409.06666v1) -- [MoWE-Audio](https://arxiv.org/pdf/2409.06635) -- [ASRCompare](https://arxiv.org/pdf/2409.00800v1) -- [AudioBERT](https://arxiv.org/pdf/2409.08199) - [SALMon](https://arxiv.org/abs/2409.07437) -- [Typhoon-Audio](https://arxiv.org/abs/2409.10999) +- [DeSTA2](https://arxiv.org/pdf/2409.20007) +- [ASRCompare](https://arxiv.org/pdf/2409.00800v1) +- [MoWE-Audio](https://arxiv.org/pdf/2409.06635) +- [Moshi](https://arxiv.org/pdf/2410.00037) +- [EMOVA](https://arxiv.org/pdf/2409.18042) - [MuChoMusic](https://arxiv.org/abs/2408.01337) - [Mini-Omni](https://arxiv.org/pdf/2408.16725) - [MooER](https://arxiv.org/pdf/2408.05101) -- [FunAudioLLM](https://arxiv.org/pdf/2407.04051v3) +- [Typhoon-Audio](https://arxiv.org/abs/2409.10999) +- [Qwen2-Audio](https://arxiv.org/pdf/2407.10759) - [LLaST](https://arxiv.org/pdf/2407.15415) -- [GAMA](https://arxiv.org/abs/2406.11768) +- [Decoder-only LLMs for STT](https://arxiv.org/pdf/2407.03169) - [AudioEntailment](https://arxiv.org/pdf/2407.18062) +- [GAMA](https://arxiv.org/abs/2406.11768) +- [FunAudioLLM](https://arxiv.org/pdf/2407.04051v3) - [CompA](https://arxiv.org/abs/2310.08753) -- [Qwen2-Audio](https://arxiv.org/pdf/2407.10759) -- [Decoder-only LLMs for STT](https://arxiv.org/pdf/2407.03169) +- [Speech ReaLLM](https://arxiv.org/pdf/2406.09569) +- [Audio Hallucination](https://arxiv.org/pdf/2406.08402) - [AudioBench](https://arxiv.org/abs/2406.16020) - [DeSTA](https://arxiv.org/abs/2406.18871) -- [Audio Hallucination](https://arxiv.org/pdf/2406.08402) -- [SD-Eval](https://arxiv.org/pdf/2406.13340) - [CodecFake](https://arxiv.org/abs/2406.07237) -- [Speech ReaLLM](https://arxiv.org/pdf/2406.09569) +- [SD-Eval](https://arxiv.org/pdf/2406.13340) - [MusiLingo](https://arxiv.org/pdf/2309.08730) -- [VoiceJailbreak](https://arxiv.org/pdf/2405.19103) -- [Audio Flamingo](https://arxiv.org/abs/2402.01831) - [AIR-Bench](https://aclanthology.org/2024.acl-long.109/) +- [Audio Flamingo](https://arxiv.org/abs/2402.01831) +- [VoiceJailbreak](https://arxiv.org/pdf/2405.19103) - [LibriSQA](https://arxiv.org/abs/2308.10390) - [SALMONN](https://arxiv.org/pdf/2310.13289.pdf) - [SpokenWOZ](https://arxiv.org/abs/2305.13040) @@ -91,34 +92,53 @@ and You! - [Qwen-Audio](https://arxiv.org/pdf/2311.07919.pdf) - [CoDi-2](https://arxiv.org/pdf/2311.18775) - [UniAudio](https://arxiv.org/abs/2310.00704) -- [Segment-level Q-Former](https://arxiv.org/pdf/2309.13963) - [Dynamic-SUPERB](https://arxiv.org/abs/2309.09510) - [LLaSM](https://arxiv.org/pdf/2308.15930.pdf) +- [Segment-level Q-Former](https://arxiv.org/pdf/2309.13963) - [Prompting LLMs with Speech Recognition](https://arxiv.org/pdf/2307.11795) - [Macaw-LLM](https://arxiv.org/pdf/2306.09093) - [SpeechGPT](https://arxiv.org/pdf/2305.11000.pdf) - [AudioGPT](https://arxiv.org/pdf/2304.12995.pdf) +## Dataset Resource + +- `【2025-02】-【Audio-FLAN】-【The Hong Kong University of Science and Technology】-【Type: Dataset Resource】` + - **Audio-FLAN: A Preliminary Release** + - **Author(s):** Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue + - [![GitHub stars](https://img.shields.io/github/stars/lmxue/Audio-FLAN?style=social)](https://github.com/lmxue/Audio-FLAN) + - [Paper](https://arxiv.org/abs/2502.16584) / [Hugging Face Model](https://huggingface.co/datasets/HKUSTAudio/Audio-FLAN-Dataset) + +- `【2024-04】-【LibriSQA】-【Shanghai Jiao Tong University】-【Type: Dataset Resource】` + - **LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models** + - **Author(s):** Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang + - [![GitHub stars](https://img.shields.io/github/stars/ZihanZhaoSJTU/LibriSQA?style=social)](https://github.com/ZihanZhaoSJTU/LibriSQA) + - [Paper](https://arxiv.org/abs/2308.10390) + ## Model and Methods +- `【2025-07】-【ACORN】-【NIO】-【Type: Model】` + - **Teaching Physical Awareness to LLMs through Sounds** + - **Author(s):** Weiguo Wang, Andy Nie, Wenrui Zhou, Yi Kai, Chengchen Hu + - [Paper](https://arxiv.org/abs/2506.08524) / [Other Link](https://icml.cc/virtual/2025/poster/46139) + - `【2025-06】-【PAL】-【CVSSP,PAI@University of Surrey UK, MBZUAI Abu Dhabi】-【Type: Model】` - **PAL: Probing Audio Encoders via LLMs - A Study of Information Transfer from Audio Encoders to LLMs** - **Author(s):** Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson, Imran Razzak, Muhammad Awais - [![GitHub stars](https://img.shields.io/github/stars/ta012/PAL-AudioLLM?style=social)](https://github.com/ta012/PAL-AudioLLM) - [Paper](https://arxiv.org/abs/2506.10423) / [Other Link](https://ta012.github.io/PAL/) -- `【2025-02】-【Step-Audio】-【Step-Audio Team, StepFun】-【Type: Model】` - - **Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction** - - **Author(s):** Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Hongyuan Wang, Kang An, Wei Ji, Wen Li, Xuan Wen, Xiangwen Kong, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Junjing Guo, Jiashuai Liu, Jiahong Liu, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Liang Zhao, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingliang Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Ran Sun, Shuai Shuai, Shaoliang Pang, Shiliang Yang, Shuli Gao, Shanshan Yuan, Siqi Liu, Shihong Deng, Shilei Jiang, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wuxun Xie, Weipeng Ming, Wenqing He , Wen Sun, Xin Han, Xin Huang, Xiaomin Deng, Xiaojia Liu, Xin Wu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaoyu Wang, Yaqiang Shi, Yilei Wang, Yizhuang Zhou, Yinmin Zhong, Yang Zhang, Yaoben Wei, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuchu Luo, Yuanhao Ding, Yuting Yan, Yaqi Dai, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zhisheng Guan, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu - - [![GitHub stars](https://img.shields.io/github/stars/stepfun-ai/Step-Audio?style=social)](https://github.com/stepfun-ai/Step-Audio) - - [Paper](https://arxiv.org/abs/2502.11946) / [Hugging Face Model](https://huggingface.co/stepfun-ai/Step-Audio-Chat) - - `【2025-02】-【OSUM】-【ASLP@NPU】-【Type: Model】` - **OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia** - **Author(s):** Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu, Zhennan Lin, Zhixian Zhao, Guojian Li, Wenjie Tian, Peikun Chen, Yangze Li, Pengcheng Guo, Mingchen Shao, Shuiyuan Wang, Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang Dai, Xinfa Zhu, Yue Li, Li Zhang, Lei Xie - [![GitHub stars](https://img.shields.io/github/stars/ASLP-lab/OSUM?style=social)](https://github.com/ASLP-lab/OSUM) - [Paper](https://arxiv.org/pdf/2501.13306) / [Hugging Face Model](https://huggingface.co/spaces/ASLP-lab/OSUM) +- `【2025-02】-【Step-Audio】-【Step-Audio Team, StepFun】-【Type: Model】` + - **Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction** + - **Author(s):** Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Hongyuan Wang, Kang An, Wei Ji, Wen Li, Xuan Wen, Xiangwen Kong, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Junjing Guo, Jiashuai Liu, Jiahong Liu, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Liang Zhao, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingliang Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Ran Sun, Shuai Shuai, Shaoliang Pang, Shiliang Yang, Shuli Gao, Shanshan Yuan, Siqi Liu, Shihong Deng, Shilei Jiang, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wuxun Xie, Weipeng Ming, Wenqing He , Wen Sun, Xin Han, Xin Huang, Xiaomin Deng, Xiaojia Liu, Xin Wu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaoyu Wang, Yaqiang Shi, Yilei Wang, Yizhuang Zhou, Yinmin Zhong, Yang Zhang, Yaoben Wei, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuchu Luo, Yuanhao Ding, Yuting Yan, Yaqi Dai, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zhisheng Guan, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu + - [![GitHub stars](https://img.shields.io/github/stars/stepfun-ai/Step-Audio?style=social)](https://github.com/stepfun-ai/Step-Audio) + - [Paper](https://arxiv.org/abs/2502.11946) / [Hugging Face Model](https://huggingface.co/stepfun-ai/Step-Audio-Chat) + - `【2025-01】-【Audio-CoT】-【Nanyang Technological University, Singapore】-【Type: Model】` - **Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model** - **Author(s):** Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, Xie Chen @@ -152,12 +172,6 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/facebookresearch/spiritlm?style=social)](https://github.com/facebookresearch/spiritlm) - [Paper](https://arxiv.org/abs/2402.05755) / [Other Link](https://speechbot.github.io/spiritlm/) -- `【2024-10】-【DiVA】-【Georgia Tech, Stanford】-【Type: Model】` - - **Distilling an End-to-End Voice Assistant Without Instruction Training Data** - - **Author(s):** William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang - - [![GitHub stars](https://img.shields.io/github/stars/github.com/diva-audio?style=social)](https://github.com/diva-audio) - - [Paper](https://arxiv.org/pdf/2410.02678) / [Demo](https://diva-audio.github.io/) - - `【2024-10】-【SpeechEmotionLlama】-【MIT, Meta】-【Type: Model】` - **Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech** - **Author(s):** Wonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar, Ozlem Kalinli @@ -169,17 +183,17 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/facebookresearch/spiritlm?style=social)](https://github.com/facebookresearch/spiritlm) - [Paper](https://arxiv.org/pdf/2402.05755) / [Demo](https://speechbot.github.io/spiritlm/) -- `【2024-09】-【DeSTA2】-【National Taiwan University, NVIDIA】-【Type: Model】` - - **Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data** - - **Author(s):** Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee - - [![GitHub stars](https://img.shields.io/github/stars/kehanlu/DeSTA2?style=social)](https://github.com/kehanlu/DeSTA2) - - [Paper](https://arxiv.org/pdf/2409.20007) +- `【2024-10】-【DiVA】-【Georgia Tech, Stanford】-【Type: Model】` + - **Distilling an End-to-End Voice Assistant Without Instruction Training Data** + - **Author(s):** William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang + - [![GitHub stars](https://img.shields.io/github/stars/github.com/diva-audio?style=social)](https://github.com/diva-audio) + - [Paper](https://arxiv.org/pdf/2410.02678) / [Demo](https://diva-audio.github.io/) -- `【2024-09】-【Moshi】-【Kyutai】-【Type: Model】` - - **Moshi: a speech-text foundation model for real-time dialogue** - - **Author(s):** Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour - - [![GitHub stars](https://img.shields.io/github/stars/kyutai-labs/moshi?style=social)](https://github.com/kyutai-labs/moshi) - - [Paper](https://arxiv.org/pdf/2410.00037) +- `【2024-09】-【AudioBERT】-【POSTECH, Inha University】-【Type: Model】` + - **AudioBERT: Audio Knowledge Augmented Language Model** + - **Author(s):** Hyunjong Ok, Suho Yoo, Jaeho Lee + - [![GitHub stars](https://img.shields.io/github/stars/HJ-Ok/AudioBERT?style=social)](https://github.com/HJ-Ok/AudioBERT) + - [Paper](https://arxiv.org/pdf/2409.08199) - `【2024-09】-【Ultravox】-【Fixie.ai】-【Type: Model】` - **Ultravox: A Fast Multimodal LLM for Real-Time Voice** @@ -192,10 +206,11 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/ictnlp/llama-omni?style=social)](https://github.com/ictnlp/llama-omni) - [Paper](https://arxiv.org/pdf/2409.06666v1) -- `【2024-09】-【MoWE-Audio】-【A*STAR】-【Type: Model】` - - **MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders** - - **Author(s):** Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw - - [Paper](https://arxiv.org/pdf/2409.06635) +- `【2024-09】-【DeSTA2】-【National Taiwan University, NVIDIA】-【Type: Model】` + - **Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data** + - **Author(s):** Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee + - [![GitHub stars](https://img.shields.io/github/stars/kehanlu/DeSTA2?style=social)](https://github.com/kehanlu/DeSTA2) + - [Paper](https://arxiv.org/pdf/2409.20007) - `【2024-09】-【ASRCompare】-【Tsinghua University, Tencent AI Lab】-【Type: Model】` - **Comparing Discrete and Continuous Space LLMs for Speech Recognition** @@ -203,16 +218,16 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/xuyaoxun/ASRCompare?style=social)](https://github.com/xuyaoxun/ASRCompare) - [Paper](https://arxiv.org/pdf/2409.00800v1) -- `【2024-09】-【AudioBERT】-【POSTECH, Inha University】-【Type: Model】` - - **AudioBERT: Audio Knowledge Augmented Language Model** - - **Author(s):** Hyunjong Ok, Suho Yoo, Jaeho Lee - - [![GitHub stars](https://img.shields.io/github/stars/HJ-Ok/AudioBERT?style=social)](https://github.com/HJ-Ok/AudioBERT) - - [Paper](https://arxiv.org/pdf/2409.08199) +- `【2024-09】-【MoWE-Audio】-【A*STAR】-【Type: Model】` + - **MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders** + - **Author(s):** Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw + - [Paper](https://arxiv.org/pdf/2409.06635) -- `【2024-08】-【Typhoon-Audio】-【SCB 10X】-【Type: Multimodal Language Model】` - - **Typhoon-Audio: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models** - - **Author(s):** Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul - - [Paper](https://arxiv.org/abs/2409.10999) / [Hugging Face Model](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-audio-preview) +- `【2024-09】-【Moshi】-【Kyutai】-【Type: Model】` + - **Moshi: a speech-text foundation model for real-time dialogue** + - **Author(s):** Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour + - [![GitHub stars](https://img.shields.io/github/stars/kyutai-labs/moshi?style=social)](https://github.com/kyutai-labs/moshi) + - [Paper](https://arxiv.org/pdf/2410.00037) - `【2024-08】-【Mini-Omni】-【Tsinghua University】-【Type: Model】` - **Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming** @@ -226,11 +241,16 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/MooreThreads/MooER?style=social)](https://github.com/MooreThreads/MooER) - [Paper](https://arxiv.org/pdf/2408.05101) -- `【2024-07】-【FunAudioLLM】-【Alibaba】-【Type: Model】` - - **FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs** - - **Author(s):** Authors not specified in the provided information - - [![GitHub stars](https://img.shields.io/github/stars/github.com/FunAudioLLM?style=social)](https://github.com/FunAudioLLM) - - [Paper](https://arxiv.org/pdf/2407.04051v3) / [Demo](https://fun-audio-llm.github.io/) +- `【2024-08】-【Typhoon-Audio】-【SCB 10X】-【Type: Multimodal Language Model】` + - **Typhoon-Audio: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models** + - **Author(s):** Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul + - [Paper](https://arxiv.org/abs/2409.10999) / [Hugging Face Model](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-audio-preview) + +- `【2024-07】-【Qwen2-Audio】-【Alibaba Group】-【Type: Model】` + - **Qwen2-Audio Technical Report** + - **Author(s):** Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou + - [![GitHub stars](https://img.shields.io/github/stars/QwenLM/Qwen2-Audio?style=social)](https://github.com/QwenLM/Qwen2-Audio) + - [Paper](https://arxiv.org/pdf/2407.10759) - `【2024-07】-【LLaST】-【The Chinese University of Hong Kong, Shenzhen; Shanghai AI Laboratory; Nara Institute of Science and Technology, Japan】-【Type: Model】` - **LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models** @@ -238,28 +258,33 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/openaudiolab/LLaST?style=social)](https://github.com/openaudiolab/LLaST) - [Paper](https://arxiv.org/pdf/2407.15415) +- `【2024-07】-【Decoder-only LLMs for STT】-【NTU-Taiwan, Meta】-【Type: Research】` + - **Investigating Decoder-only Large Language Models for Speech-to-text Translation** + - **Author(s):** Authors not specified in the provided information + - [Paper](https://arxiv.org/pdf/2407.03169) + - `【2024-07】-【GAMA】-【University of Maryland, College Park】-【Type: Model】` - **GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities** - **Author(s):** Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha - [![GitHub stars](https://img.shields.io/github/stars/Sreyan88/GAMA?style=social)](https://github.com/Sreyan88/GAMA) - [Paper](https://arxiv.org/abs/2406.11768) / [Demo](https://sreyan88.github.io/gamaaudio/) +- `【2024-07】-【FunAudioLLM】-【Alibaba】-【Type: Model】` + - **FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs** + - **Author(s):** Authors not specified in the provided information + - [![GitHub stars](https://img.shields.io/github/stars/github.com/FunAudioLLM?style=social)](https://github.com/FunAudioLLM) + - [Paper](https://arxiv.org/pdf/2407.04051v3) / [Demo](https://fun-audio-llm.github.io/) + - `【2024-07】-【CompA】-【University of Maryland, College Park; Adobe, USA; NVIDIA, Bangalore, India】-【Type: Model】` - **CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models** - **Author(s):** Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha - [![GitHub stars](https://img.shields.io/github/stars/Sreyan88/CompA?style=social)](https://github.com/Sreyan88/CompA) - [Paper](https://arxiv.org/abs/2310.08753) / [Demo](https://sreyan88.github.io/compa_iclr/) -- `【2024-07】-【Qwen2-Audio】-【Alibaba Group】-【Type: Model】` - - **Qwen2-Audio Technical Report** - - **Author(s):** Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou - - [![GitHub stars](https://img.shields.io/github/stars/QwenLM/Qwen2-Audio?style=social)](https://github.com/QwenLM/Qwen2-Audio) - - [Paper](https://arxiv.org/pdf/2407.10759) - -- `【2024-07】-【Decoder-only LLMs for STT】-【NTU-Taiwan, Meta】-【Type: Research】` - - **Investigating Decoder-only Large Language Models for Speech-to-text Translation** +- `【2024-06】-【Speech ReaLLM】-【Meta】-【Type: Model】` + - **Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time** - **Author(s):** Authors not specified in the provided information - - [Paper](https://arxiv.org/pdf/2407.03169) + - [Paper](https://arxiv.org/pdf/2406.09569) - `【2024-06】-【DeSTA】-【NTU-Taiwan, Nvidia】-【Type: Model】` - **DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment** @@ -267,11 +292,6 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/multimodal/DeSTA?style=social)](https://github.com/kehanlu/Nemo/tree/desta/examples/multimodal/DeSTA) - [Paper](https://arxiv.org/abs/2406.18871) -- `【2024-06】-【Speech ReaLLM】-【Meta】-【Type: Model】` - - **Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time** - - **Author(s):** Authors not specified in the provided information - - [Paper](https://arxiv.org/pdf/2406.09569) - - `【2024-06】-【MusiLingo】-【University of Pennsylvania】-【Type: Model】` - **MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response** - **Author(s):** Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos @@ -320,17 +340,17 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/yangdongchao/UniAudio?style=social)](https://github.com/yangdongchao/UniAudio) - [Paper](https://arxiv.org/abs/2310.00704) / [Demo](https://dongchaoyang.top/UniAudio_demo/) -- `【2023-09】-【Segment-level Q-Former】-【Tsinghua University, ByteDance】-【Type: Model】` - - **Connecting Speech Encoder and Large Language Model for ASR** - - **Author(s):** Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang - - [Paper](https://arxiv.org/pdf/2309.13963) - - `【2023-09】-【LLaSM】-【LinkSoul.AI】-【Type: Model】` - **LLaSM: Large Language and Speech Model** - **Author(s):** Authors not specified in the provided information - [![GitHub stars](https://img.shields.io/github/stars/LinkSoul-AI/LLaSM?style=social)](https://github.com/LinkSoul-AI/LLaSM) - [Paper](https://arxiv.org/pdf/2308.15930.pdf) +- `【2023-09】-【Segment-level Q-Former】-【Tsinghua University, ByteDance】-【Type: Model】` + - **Connecting Speech Encoder and Large Language Model for ASR** + - **Author(s):** Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang + - [Paper](https://arxiv.org/pdf/2309.13963) + - `【2023-07】-【Prompting LLMs with Speech Recognition】-【Meta】-【Type: Model】` - **Prompting Large Language Models with Speech Recognition Abilities** - **Author(s):** Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer @@ -448,33 +468,27 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/dynamic-superb/dynamic-superb?style=social)](https://github.com/dynamic-superb/dynamic-superb) - [Paper](https://arxiv.org/abs/2309.09510) -## Dataset Resource - -- `【2025-02】-【Audio-FLAN】-【The Hong Kong University of Science and Technology】-【Type: Dataset Resource】` - - **Audio-FLAN: A Preliminary Release** - - **Author(s):** Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue - - [![GitHub stars](https://img.shields.io/github/stars/lmxue/Audio-FLAN?style=social)](https://github.com/lmxue/Audio-FLAN) - - [Paper](https://arxiv.org/abs/2502.16584) / [Hugging Face Model](https://huggingface.co/datasets/HKUSTAudio/Audio-FLAN-Dataset) +## Survey -- `【2024-04】-【LibriSQA】-【Shanghai Jiao Tong University】-【Type: Dataset Resource】` - - **LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models** - - **Author(s):** Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang - - [![GitHub stars](https://img.shields.io/github/stars/ZihanZhaoSJTU/LibriSQA?style=social)](https://github.com/ZihanZhaoSJTU/LibriSQA) - - [Paper](https://arxiv.org/abs/2308.10390) +- `【2024-11】-【WavChat-Survey】-【Zhejiang University】-【Type: Survey】` + - **WavChat: A Survey of Spoken Dialogue Models** + - **Author(s):** Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao + - [Paper](https://arxiv.org/abs/2411.13577) -## Safety +- `【2024-10】-【SpeechLLM-Survey】-【SJTU, AISpeech】-【Type: Survey】` + - **A Survey on Speech Large Language Models** + - **Author(s):** Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu + - [Paper](https://arxiv.org/pdf/2410.18908v2) -- `【2024-06】-【CodecFake】-【National Taiwan University】-【Type: Safety】` - - **CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems** - - **Author(s):** Haibin Wu, Yuan Tseng, Hung-yi Lee - - [![GitHub stars](https://img.shields.io/github/stars/roger-tseng/CodecFake?style=social)](https://github.com/roger-tseng/CodecFake) - - [Paper](https://arxiv.org/abs/2406.07237) / [Other Link](https://codecfake.github.io/) +- `【2024-10】-【SpeechLM-Survey】-【CUHK, Tencent】-【Type: Survey】` + - **Recent Advances in Speech Language Models: A Survey** + - **Author(s):** Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King + - [Paper](https://arxiv.org/pdf/2410.03751) -- `【2024-05】-【VoiceJailbreak】-【CISPA】-【Type: Method】` - - **Voice Jailbreak Attacks Against GPT-4o** - - **Author(s):** Xinyue Shen, Yixin Wu, Michael Backes, Yang Zhang - - [![GitHub stars](https://img.shields.io/github/stars/TrustAIRLab/VoiceJailbreakAttack?style=social)](https://github.com/TrustAIRLab/VoiceJailbreakAttack) - - [Paper](https://arxiv.org/pdf/2405.19103) +- `【2024-02】-【AudioLM-Survey】-【National Taiwan University, MIT】-【Type: Survey】` + - **Towards audio language modeling -- an overview** + - **Author(s):** Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-yi Lee + - [Paper](https://arxiv.org/abs/2402.13236) ## Multimodal @@ -495,28 +509,6 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/lyuchenyang/Macaw-LLM?style=social)](https://github.com/lyuchenyang/Macaw-LLM) - [Paper](https://arxiv.org/pdf/2306.09093) -## Survey - -- `【2024-11】-【WavChat-Survey】-【Zhejiang University】-【Type: Survey】` - - **WavChat: A Survey of Spoken Dialogue Models** - - **Author(s):** Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao - - [Paper](https://arxiv.org/abs/2411.13577) - -- `【2024-10】-【SpeechLM-Survey】-【CUHK, Tencent】-【Type: Survey】` - - **Recent Advances in Speech Language Models: A Survey** - - **Author(s):** Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King - - [Paper](https://arxiv.org/pdf/2410.03751) - -- `【2024-10】-【SpeechLLM-Survey】-【SJTU, AISpeech】-【Type: Survey】` - - **A Survey on Speech Large Language Models** - - **Author(s):** Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu - - [Paper](https://arxiv.org/pdf/2410.18908v2) - -- `【2024-02】-【AudioLM-Survey】-【National Taiwan University, MIT】-【Type: Survey】` - - **Towards audio language modeling -- an overview** - - **Author(s):** Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, Alexander H. Liu, Hung-yi Lee - - [Paper](https://arxiv.org/abs/2402.13236) - ## Study - `【2024-06】-【Audio Hallucination】-【NTU-Taiwan】-【Type: Research】` @@ -525,6 +517,20 @@ and You! - [![GitHub stars](https://img.shields.io/github/stars/kuan2jiu99/audio-hallucination?style=social)](https://github.com/kuan2jiu99/audio-hallucination) - [Paper](https://arxiv.org/pdf/2406.08402) +## Safety + +- `【2024-06】-【CodecFake】-【National Taiwan University】-【Type: Safety】` + - **CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems** + - **Author(s):** Haibin Wu, Yuan Tseng, Hung-yi Lee + - [![GitHub stars](https://img.shields.io/github/stars/roger-tseng/CodecFake?style=social)](https://github.com/roger-tseng/CodecFake) + - [Paper](https://arxiv.org/abs/2406.07237) / [Other Link](https://codecfake.github.io/) + +- `【2024-05】-【VoiceJailbreak】-【CISPA】-【Type: Method】` + - **Voice Jailbreak Attacks Against GPT-4o** + - **Author(s):** Xinyue Shen, Yixin Wu, Michael Backes, Yang Zhang + - [![GitHub stars](https://img.shields.io/github/stars/TrustAIRLab/VoiceJailbreakAttack?style=social)](https://github.com/TrustAIRLab/VoiceJailbreakAttack) + - [Paper](https://arxiv.org/pdf/2405.19103) + ## Chatbot - `【2025-01】-【MinMo】-【FunAudioLLM Team, Tongyi Lab, Alibaba Group】-【Type: Multimodal Large Language Model】` diff --git a/items/ACORN.json b/items/ACORN.json new file mode 100644 index 0000000..a2eb1e6 --- /dev/null +++ b/items/ACORN.json @@ -0,0 +1,19 @@ +{ + "Category": "Model and Methods", + "Type": "Model", + "Abbreviation": "ACORN", + "Title": "Teaching Physical Awareness to LLMs through Sounds", + "Time": "2025-07", + "Affiliation": "NIO", + "Author": "Weiguo Wang, Andy Nie, Wenrui Zhou, Yi Kai, Chengchen Hu", + "GitHub_Link": "", + "Paper_Link": "https://arxiv.org/abs/2506.08524", + "HF_Link": "", + "Demo_Link": "", + "Other_Link": "https://icml.cc/virtual/2025/poster/46139", + "Audio_Input": "Yes", + "Audio_Output": "No", + "Language": "", + "Description": "ACORN explores and validates the feasibility of teaching LLMs to understand the physical world through sounds." +} + diff --git a/model_release_timeline_vertical_listed.png b/model_release_timeline_vertical_listed.png index 627b0f5..691bdfe 100644 Binary files a/model_release_timeline_vertical_listed.png and b/model_release_timeline_vertical_listed.png differ