[Feature] DSPy Audio/Video Support Tracking #7847

isaacbmiller · 2025-02-24T22:03:14Z

What feature would you like to see?

We have received a number of requests for Audio and Video input support over the last few months (#2037, #7844, etc.)

I implemented DSPy.Image, and am looking for someone to help out and create similar or better implementations for audio and/or video inputs. It would be shocking to me if some good prompting and few shot suppport for audio would greatly help in some use cases, and also being able to script with audio in the same way that you can with text inputs.

For someone to implement this, there are a few required steps for the implementation I am imagining:

create a class similar to Image (see adapters/image_utils.py)
edit chat_adapter and json_adapter to have a try_expand_audio_tags method that will search and expand messages with multimodal inputs
Write tests similar to tests/signatures/test_adapter_image.py to make sure it can work with a variety of signature types and input methods

I don't know much about the audio input APIs to really know what the speedbumps on this implementation are going to be.

As a first step, I would choose either the OpenAI API or Gemini, and get it working for that provider with whatever hacky code is needed, then expand and abstract after that.

feel free to @ me in the discord username is ibmiller if you need help

Would you like to contribute?

Yes, I'd like to help implement this.
No, I just want to request it.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

pretbc · 2025-02-25T07:13:57Z

Yes, I'd like to help implement this.

ramisbahi · 2025-02-25T22:12:00Z

Yes, I'd like to help implement this: I'd like to help with implementing support for Video input. Additionally, I'm looking into benchmarking video understanding performance to evaluate how well DSPy can process video-based inputs

glesperance · 2025-02-28T14:45:33Z

~~This PR should work for audio (and more) #7872~~

After some review, the PR takes us closer to supporting audio but there's still work to be done to support this.
For reference here's the LiteLLM doc on using audio models[1]

[1] https://docs.litellm.ai/docs/completion/audio#audio-input-to-a-model

pretbc · 2025-03-26T11:29:20Z

Hello guys

for AZURE openAI we have to add in Image module

def expand_image_tags(text: str) -> Union[str, List[Dict[str, Any]]]:
    """Expand image tags in the text. If there are any image tags, 
    turn it from a content string into a content list of texts and image urls.
    
    Args:
        text: The text content that may contain image tags
        
    Returns:
        Either the original string if no image tags, or a list of content dicts
        with text and image_url entries
    """
    image_tag_regex = r'"?<DSPY_IMAGE_START>(.*?)<DSPY_IMAGE_END>"?'
    
    # If no image tags, return original text
    if not re.search(image_tag_regex, text):
        return text
        
    final_list = []
    remaining_text = text
    
    while remaining_text:
        match = re.search(image_tag_regex, remaining_text)
        if not match:
            if remaining_text.strip():
                final_list.append({"type": "text", "text": remaining_text.strip()})
            break
            
        # Get text before the image tag
        prefix = remaining_text[:match.start()].strip()
        if prefix:
            final_list.append({"type": "text", "text": prefix})
            
        # Add the image
        image_url = match.group(1)
        # final_list.append({"type": "image_url", "image_url": {"url": image_url}})

        data = image_url.split(",", 1)[1]  # Get the base64 encoded data
        final_list.append({
            "type": "input_audio",
            "input_audio": {
                "data": data,
                "format": 'wav'
            }
        })
        
        # Update remaining text
        remaining_text = remaining_text[match.end():].strip()
    
    return final_list

due to fact azure use for audio models

completion = client.chat.completions.create( 
    model="gpt-4o-mini-audio-preview", 
    modalities=["text", "audio"], 
    audio={"voice": "alloy", "format": "wav"}, 
    messages=[ 
        { 
            "role": "user", 
            "content": [ 
                {  
                    "type": "text", 
                    "text": "Describe in detail the spoken audio input." 
                }, 
                { 
                    "type": "input_audio", 
                    "input_audio": { 
                        "data": encoded_string, 
                        "format": "wav" 
                    } 
                } 
            ] 
        }, 
    ] 
)

MIME like generate cause errors 'audio/x-wav'.

litellm.BadRequestError: AzureException BadRequestError - Invalid value: 'audio/x-wav'. Supported values are: 'wav' and 'mp3'.

I have no idea at which level we should check how to handle data. Best would be to check for provider type and create some sub pipeline to process final_list.append

isaacbmiller added the enhancement New feature or request label Feb 24, 2025

pretbc mentioned this issue Feb 25, 2025

add audio utils to handle model audio input #7850

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] DSPy Audio/Video Support Tracking #7847

[Feature] DSPy Audio/Video Support Tracking #7847

isaacbmiller commented Feb 24, 2025 •

edited

Loading

pretbc commented Feb 25, 2025

ramisbahi commented Feb 25, 2025

glesperance commented Feb 28, 2025 •

edited

Loading

pretbc commented Mar 26, 2025 •

edited

Loading

[Feature] DSPy Audio/Video Support Tracking #7847

[Feature] DSPy Audio/Video Support Tracking #7847

Comments

isaacbmiller commented Feb 24, 2025 • edited Loading

What feature would you like to see?

Would you like to contribute?

Additional Context

pretbc commented Feb 25, 2025

ramisbahi commented Feb 25, 2025

glesperance commented Feb 28, 2025 • edited Loading

pretbc commented Mar 26, 2025 • edited Loading

isaacbmiller commented Feb 24, 2025 •

edited

Loading

glesperance commented Feb 28, 2025 •

edited

Loading

pretbc commented Mar 26, 2025 •

edited

Loading