-
Notifications
You must be signed in to change notification settings - Fork 572
Initial commit: Add task audio-text-to-text #1212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
0d56ce2
Initial commit: Add task audio-text-to-text
65ffff2
Merge branch 'main' into audio-text-to-text
Vaibhavs10 717c0d0
Update packages/tasks/src/tasks/audio-text-to-text/about.md
Deep-unlearning 365870d
Merge branch 'main' into audio-text-to-text
Deep-unlearning 8d6bd6f
added useful resources
88b96a9
Merge branch 'main' into audio-text-to-text
Vaibhavs10 77a7cfc
add resources for ichigo
Deep-unlearning ca03aef
more resources
Deep-unlearning 1197c90
add examples use cases
Deep-unlearning ba73b86
add demo example
Deep-unlearning 1118efe
nit
Deep-unlearning 54ea88a
added more datasets
Deep-unlearning a946894
added audio flamingo model
Deep-unlearning File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| ## Different Types of Audio-Text-to-Text Models | ||
|
|
||
| Audio-text-to-text models can be categorized into two main types: | ||
|
|
||
| - **Base:** | ||
| Pre-trained models that extract rich audio features using techniques such as Wav2Vec, HuBERT, or Whisper. These models serve as the backbone for various downstream tasks. An example is the [Qwen2-Audio-7b](https://huggingface.co/Qwen/Qwen2-Audio-7B), which can be further fine-tuned. | ||
|
|
||
| - **Instruction:** | ||
| Base models fine-tuned on specialized audio instruction datasets to better handle task-specific querie and conversation. For instance, [Ichigo-llama3.1-s-instruct-v0.4](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.4) has been optimized to follow detailed audio-related commands. | ||
|
|
||
| ### Use Cases | ||
|
|
||
| - **Multimodal Audio Dialogue:** | ||
| These models can engage in real-time, multi-turn conversations by processing audio inputs and generating text responses. They are the backbone of advanced voice assistants and interactive dialogue systems. | ||
| You can try this model with [Audio Flamingo](https://huggingface.co/spaces/nvidia/audio-flamingo-3) demo, which is a large audio-language model that unifies speech, sound, and music understanding with long-context reasoning, multi-turn dialogue, and voice-to-voice interaction. | ||
|
|
||
| - **Speech Transcription and Analysis:** | ||
| Beyond converting spoken words to text, these models capture prosody, emotion, and speaker characteristics. This enriched transcription can be used for applications such as sentiment analysis and speaker profiling. | ||
|
|
||
| You can transcribe audio with [Voxtral Mini](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) with this code snippet: | ||
| ```python | ||
| from transformers import VoxtralForConditionalGeneration, AutoProcessor | ||
| import torch | ||
|
|
||
| device = "cuda" | ||
| repo_id = "mistralai/Voxtral-Mini-3B-2507" | ||
|
|
||
| processor = AutoProcessor.from_pretrained(repo_id) | ||
| model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device) | ||
|
|
||
| inputs = processor.apply_transcription_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id) | ||
| inputs = inputs.to(device, dtype=torch.bfloat16) | ||
|
|
||
| outputs = model.generate(**inputs, max_new_tokens=500) | ||
| decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True) | ||
|
|
||
| print("\nGenerated responses:") | ||
| print("=" * 80) | ||
| for decoded_output in decoded_outputs: | ||
| print(decoded_output) | ||
| print("=" * 80) | ||
| ``` | ||
|
|
||
| - **Audio Question Answering:** | ||
| By directly processing audio inputs, the models can answer questions about the content of an audio clip—whether it’s a podcast excerpt or a recorded conversation. | ||
|
|
||
| You can try this with [Qwen2-Audio-Instruct-Demo](https://huggingface.co/Qwen/Qwen2-Audio-Instruct-Demo) with this code snippet: | ||
| ```python | ||
| from io import BytesIO | ||
| from urllib.request import urlopen | ||
| import librosa | ||
| from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor | ||
|
|
||
| processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct") | ||
| model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto") | ||
|
|
||
| conversation = [ | ||
| {'role': 'system', 'content': 'You are a helpful assistant.'}, | ||
| {"role": "user", "content": [ | ||
| {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"}, | ||
| {"type": "text", "text": "What's that sound?"}, | ||
| ]}, | ||
| {"role": "assistant", "content": "It is the sound of glass shattering."}, | ||
| {"role": "user", "content": [ | ||
| {"type": "text", "text": "What can you do when you hear that?"}, | ||
| ]}, | ||
| {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."}, | ||
| {"role": "user", "content": [ | ||
| {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"}, | ||
| {"type": "text", "text": "What does the person say?"}, | ||
| ]}, | ||
| ] | ||
| text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) | ||
| audios = [] | ||
| for message in conversation: | ||
| if isinstance(message["content"], list): | ||
| for ele in message["content"]: | ||
| if ele["type"] == "audio": | ||
| audios.append( | ||
| librosa.load( | ||
| BytesIO(urlopen(ele['audio_url']).read()), | ||
| sr=processor.feature_extractor.sampling_rate)[0] | ||
| ) | ||
|
|
||
| inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True) | ||
| inputs.input_ids = inputs.input_ids.to("cuda") | ||
|
|
||
| generate_ids = model.generate(**inputs, max_length=256) | ||
| generate_ids = generate_ids[:, inputs.input_ids.size(1):] | ||
|
|
||
| response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] | ||
| ``` | ||
|
|
||
| ### Useful Resources | ||
Deep-unlearning marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Here are some useful resources: | ||
|
|
||
| - [Audio Flamingo, an Large Audio-Language Model that that unifies speech, sound, and music understanding with long-context reasoning, multi-turn dialogue, and voice-to-voice interaction.](https://huggingface.co/nvidia/audio-flamingo-3) | ||
|
|
||
| - [Ultravox, a fast multimodal large language model designed for real-time voice interactions-.](https://github.com/fixie-ai/ultravox) | ||
|
|
||
| - [Ichigo an audio-text-to-text model for audio-related tasks.](https://github.com/menloresearch/ichigo) | ||
|
|
||
| - [An open-source large-scale audio-language model by Alibaba Cloud, Qwen2-Audio, supporting voice chat and audio analysis in multiple languages.](https://github.com/QwenLM/Qwen2-Audio) | ||
|
|
||
| - [A compact, open-source speech tokenizer, WhisperSpeech, enhancing multilingual performance with minimal impact on English capabilities.](https://github.com/janhq/WhisperSpeech) | ||
|
|
||
| - [A guide to Microsoft's open-source Phi models, PhiCookBook, offering capable and cost-effective small language models.](https://github.com/microsoft/PhiCookBook) | ||
|
|
||
| - [Fast-RTC, turn any python function into a real-time audio and video stream over WebRTC or WebSockets.](https://huggingface.co/fastrtc) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| import type { TaskDataCustom } from "../index.js"; | ||
|
|
||
| const taskData: TaskDataCustom = { | ||
| datasets: [ | ||
| { | ||
| description: "Instructions composed of audio and text.", | ||
| id: "homebrewltd/instruction-speech-encodec-v1.5", | ||
| }, | ||
| { | ||
| description: "A large-scale long audio question-answering (AQA) dataset", | ||
| id: "https://huggingface.co/datasets/nvidia/LongAudio", | ||
| }, | ||
| { | ||
| description: "A audio-text dataset for chain-of-thought (CoT)-type reasoning.", | ||
| id: "https://huggingface.co/datasets/nvidia/AF-Think", | ||
| }, | ||
| ], | ||
| demo: { | ||
| inputs: [ | ||
| { | ||
| filename: "sample1.flac", | ||
| type: "audio", | ||
| }, | ||
| { | ||
| label: "Text Prompt", | ||
| content: "Transcribe this audio.", | ||
| type: "text", | ||
| }, | ||
| ], | ||
| outputs: [ | ||
| { | ||
| label: "Answer", | ||
| content: "Going along slushy country roads and speaking to damp audiences in...", | ||
| type: "text", | ||
| }, | ||
| ], | ||
| }, | ||
| metrics: [], | ||
| models: [ | ||
| { | ||
| description: "A large audio-language model that unifies speech, sound, and music understanding with long-context reasoning, multi-turn dialogue, and voice-to-voice interaction.", | ||
| id: "nvidia/audio-flamingo-3", | ||
| }, | ||
| { | ||
| description: "Small yet powerful audio language model.", | ||
| id: "fixie-ai/ultravox-v0_5-llama-3_2-1b", | ||
| }, | ||
| { | ||
| description: "Audio Language Model based on Llama 3.1. 8b", | ||
| id: "homebrewltd/Ichigo-llama3.1-s-instruct-v0.4", | ||
| }, | ||
| { | ||
| description: "Strong Audio Language Model.", | ||
| id: "Qwen/Qwen2-Audio-7B", | ||
| }, | ||
| ], | ||
| spaces: [ | ||
| { | ||
| description: "Powerful audio-language model assistant.", | ||
| id: "Qwen/Qwen2-Audio-Instruct-Demo", | ||
| }, | ||
| { | ||
| description: "Real-time audio-text-to-text model.", | ||
| id: "Steveeeeeeen/talk-to-ultravox-0.5", | ||
| }, | ||
| ], | ||
| summary: | ||
| "Audio-text-to-text models extend multimodal AI into the speech domain. Much like their visual counterparts, these models are designed to understand and generate text based on audio inputs. Recent research in spoken dialogue systems and Speech Large Language Models (LLMs) highlights how such models are evolving, leveraging both semantic and acoustic representations extracted from speech signals.", | ||
| widgetModels: [], | ||
| }; | ||
|
|
||
| export default taskData; |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.