.Net: Add support for audio and binary tags to chat prompt parser (#11919)

glorious-beard · rogerbarreto · web-flow · commit 5c04bbed1afd · 2025-06-06T15:35:49.000Z
### Motivation and Context #### Why is this change required? This template parsers like the YAML parser to embed content types other than just text and images for LLMs that support additional content types, like PDFs for OpenAI and DOCXs for Claude. Without this capability, functions with prompts that have attachments would have to manually build it's chat history in code. #### What problem does it solve? See above #### What scenario does it contribute to? Usage additional content types beyond visuals and audio for user messages #### Open Issues Addressed - Fixes #11044 ### Description #### Chat Prompt Parser To preserve backward compatibility, rather than consolidating binary content types, I chose to go with adding additional content types so that LLM chat service providers could opt-in to new content types. It also reduces the chances of breaking existing code. 3 new content types are created: * `PdfContent` for PDF files. Uses the tag "&lt;pdf&gt;". Allows for Base64 data URIs or standard URIs, similar to `ImageContent`. * `DocContent` for MS Word .doc files. Uses the tag "&lt;doc&gt;". Allows for Base64 data URIs or standard URIs, similar to `ImageContent`. * `DocxContent` for MS Word .docx files. Uses the tag "&lt;docx&gt;". Allows for Base64 data URIs or standard URIs, similar to `ImageContent`. (**NOTE**: `DocContent` and `DocxContent` are mainly separate because they have different MIME types and different content formats, though they could easily be consolidated into a single tag and just let the LLM provider handle distinguishing between "doc" and "docx" files. Alternately, I could also see the case for dropping ".doc" support and requiring the caller to only use ".docx".) In addition, the following 2 contents are now parsed from the XML: * `AudioContent` - Parses the tag "&lt;audio&gt;" with either Base64 data URIs or standard URIs, similar to `ImageContent`. * `BinaryContent` - Parses the tag "&lt;file&gt;" with either Base64 data URIs or standard URIs, similar to `ImageContent`. Here is a sample: ```xml <message role='user'> This part will be discarded upon parsing <text>Make sense of this random assortment of stuff.</text> <image>https://fake-link-to-image/</image> <audio>data:audio/wav;base64,UklGRiQAAABXQVZFZm10IBAAAAABAAEAIlYAAACABAAZGF0YVgAAAAA</audio> <pdf>data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMyAwIG9iago8PC9UeXBlL1hSZWYvUGFnZXMgNiAwIFIKL1R5cGUvUGFnZS9NZWRpYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9GMiA8PC9GMyA8PC9GNCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GNSA8PC9GNiA8PC9GNyBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GOCAvPj4KZW5kb2JqCjEwIDAgb2JqCjw8L1R5cGUvUGFnZS9NYWRlYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9GMiA8PC9GMyA8PC9GNCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GNSA8PC9GNiA8PC9GNyBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GOCAvPj4KZW5kb2JqCjEwIDAgb2JqCjw8L1R5cGUvUGFnZS9NYWRlYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9G</pdf> <pdf>https://fake-link-to-pdf/</pdf> <doc>data:application/msword;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</doc> <doc>https://fake-link-to-doc/</doc> <docx>data:application/vnd.openxmlformats-officedocument.wordprocessingml.document;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</docx> <docx>https://fake-link-to-docx/</docx> <file>data:application/octet-stream;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</file> <file>https://fake-link-to-binary/</file> This part will also be discarded upon parsing </message> ``` #### Amazon Bedrock Modified the `Converse` API request generator to handle the subset of binary content supported by Amazon Bedrock (PDF, DOC, DOCX, and Image), as documented [here](https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/BedrockRuntime/TContentBlock.html). #### OpenAI Modified the client to handle PDF content, audio content, and file references when generating a request to an OpenAI (or OpenAI compatible) client.  ### Contribution Checklist  - [X] The code builds clean without any errors or warnings - [X] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [X] All unit tests pass, and I have added new tests where possible - [X] I didn't break anyone 😄 --------- Co-authored-by: Roger Barreto <19890735+RogerBarreto@users.noreply.github.com>
diff --git a/dotnet/samples/Concepts/PromptTemplates/ChatPromptWithAudio.cs b/dotnet/samples/Concepts/PromptTemplates/ChatPromptWithAudio.cs
@@ -0,0 +1,86 @@
+﻿// Copyright (c) Microsoft. All rights reserved.
+
+using Microsoft.SemanticKernel;
+using Resources;
+
+namespace PromptTemplates;
+
+/// <summary>
+/// This example demonstrates how to use ChatPrompt XML format with Audio content types.
+/// The new ChatPrompt parser supports &lt;audio&gt; tags for various audio formats like WAV, MP3, etc.
+/// </summary>
+public class OpenAI_ChatPromptWithAudio(ITestOutputHelper output) : BaseTest(output)
+{
+    /// <summary>
+    /// Demonstrates using audio content in ChatPrompt XML format with data URI.
+    /// </summary>
+    [Fact]
+    public async Task ChatPromptWithAudioContentDataUri()
+    {
+        // Load an audio file and convert to base64 data URI
+        var audioBytes = await EmbeddedResource.ReadAllAsync("test_audio.wav");
+        var audioBase64 = Convert.ToBase64String(audioBytes.ToArray());
+        var dataUri = $"data:audio/wav;base64,{audioBase64}";
+
+        var chatPrompt = $"""
+            <message role="system">You are a helpful assistant that can analyze audio content.</message>
+            <message role="user">
+                <text>Please transcribe and analyze this audio file.</text>
+                <audio>{dataUri}</audio>
+            </message>
+            """;
+
+        var kernel = Kernel.CreateBuilder()
+            .AddOpenAIChatCompletion(
+                modelId: "gpt-4o-audio-preview", // Use audio-capable model
+                apiKey: TestConfiguration.OpenAI.ApiKey)
+            .Build();
+
+        var chatFunction = kernel.CreateFunctionFromPrompt(chatPrompt);
+        var result = await kernel.InvokeAsync(chatFunction);
+
+        Console.WriteLine("=== ChatPrompt with Audio Content (Data URI) ===");
+        Console.WriteLine("Prompt:");
+        Console.WriteLine(chatPrompt);
+        Console.WriteLine("\nResult:");
+        Console.WriteLine(result);
+    }
+
+    /// <summary>
+    /// Demonstrates a conversation flow using ChatPrompt with audio content across multiple messages.
+    /// </summary>
+    [Fact]
+    public async Task ChatPromptConversationWithAudioContent()
+    {
+        var audioBytes = await EmbeddedResource.ReadAllAsync("test_audio.wav");
+        var audioBase64 = Convert.ToBase64String(audioBytes.ToArray());
+        var audioDataUri = $"data:audio/wav;base64,{audioBase64}";
+
+        var chatPrompt = $"""
+            <message role="system">You are a helpful assistant that specializes in audio analysis and transcription.</message>
+            <message role="user">
+                <text>I have an audio recording that I need help with. Can you analyze it?</text>
+                <audio>{audioDataUri}</audio>
+            </message>
+            <message role="assistant">I can help you analyze this audio recording. Let me transcribe and examine its content for you. What specific information are you looking for from this audio?</message>
+            <message role="user">
+                <text>Can you provide a full transcription and also identify any background sounds or audio quality issues?</text>
+            </message>
+            """;
+
+        var kernel = Kernel.CreateBuilder()
+            .AddOpenAIChatCompletion(
+                modelId: "gpt-4o-audio-preview", // Use audio-capable model
+                apiKey: TestConfiguration.OpenAI.ApiKey)
+            .Build();
+
+        var chatFunction = kernel.CreateFunctionFromPrompt(chatPrompt);
+        var result = await kernel.InvokeAsync(chatFunction);
+
+        Console.WriteLine("=== ChatPrompt Conversation with Audio Content ===");
+        Console.WriteLine("Prompt (showing conversation flow):");
+        Console.WriteLine(chatPrompt[..Math.Min(800, chatPrompt.Length)] + "...");
+        Console.WriteLine("\nResult:");
+        Console.WriteLine(result);
+    }
+}
diff --git a/dotnet/samples/Concepts/PromptTemplates/ChatPromptWithBinary.cs b/dotnet/samples/Concepts/PromptTemplates/ChatPromptWithBinary.cs
@@ -0,0 +1,86 @@
+﻿// Copyright (c) Microsoft. All rights reserved.
+
+using Microsoft.SemanticKernel;
+using Resources;
+
+namespace PromptTemplates;
+
+/// <summary>
+/// This example demonstrates how to use ChatPrompt XML format with Binary content types.
+/// The new ChatPrompt parser supports &lt;binary&gt; tags for various document formats like PDF, Word, CSV, etc.
+/// </summary>
+public class ChatPromptWithBinary(ITestOutputHelper output) : BaseTest(output)
+{
+    /// <summary>
+    /// Demonstrates using binary content (PDF file) in ChatPrompt XML format with data URI.
+    /// </summary>
+    [Fact]
+    public async Task ChatPromptWithBinaryContentDataUri()
+    {
+        // Load a PDF file and convert to base64 data URI
+        var fileBytes = await EmbeddedResource.ReadAllAsync("employees.pdf");
+        var fileBase64 = Convert.ToBase64String(fileBytes.ToArray());
+        var dataUri = $"data:application/pdf;base64,{fileBase64}";
+
+        var chatPrompt = $"""
+            <message role="system">You are a helpful assistant that can analyze documents.</message>
+            <message role="user">
+                <text>Please analyze this PDF document and provide a summary of its contents.</text>
+                <binary>{dataUri}</binary>
+            </message>
+            """;
+
+        var kernel = Kernel.CreateBuilder()
+            .AddOpenAIChatCompletion(
+                modelId: TestConfiguration.OpenAI.ChatModelId,
+                apiKey: TestConfiguration.OpenAI.ApiKey)
+            .Build();
+
+        var chatFunction = kernel.CreateFunctionFromPrompt(chatPrompt);
+        var result = await kernel.InvokeAsync(chatFunction);
+
+        Console.WriteLine("=== ChatPrompt with Binary Content (Data URI) ===");
+        Console.WriteLine("Prompt:");
+        Console.WriteLine(chatPrompt);
+        Console.WriteLine("\nResult:");
+        Console.WriteLine(result);
+    }
+
+    /// <summary>
+    /// Demonstrates a conversation flow using ChatPrompt with binary content across multiple messages.
+    /// </summary>
+    [Fact]
+    public async Task ChatPromptConversationWithBinaryContent()
+    {
+        var pdfBytes = await EmbeddedResource.ReadAllAsync("employees.pdf");
+        var pdfBase64 = Convert.ToBase64String(pdfBytes.ToArray());
+        var pdfDataUri = $"data:application/pdf;base64,{pdfBase64}";
+
+        var chatPrompt = $"""
+            <message role="system">You are a helpful assistant that can analyze documents and provide insights.</message>
+            <message role="user">
+                <text>I have a document that I need help understanding. Can you analyze it?</text>
+                <binary>{pdfDataUri}</binary>
+            </message>
+            <message role="assistant">I can see this is a PDF document about employees. Let me analyze its contents for you. The document appears to contain employee information and organizational data. What specific aspects would you like me to focus on?</message>
+            <message role="user">
+                <text>Can you extract the key information and create a summary? Also, what format would be best for sharing this information with my team?</text>
+            </message>
+            """;
+
+        var kernel = Kernel.CreateBuilder()
+            .AddOpenAIChatCompletion(
+                modelId: TestConfiguration.OpenAI.ChatModelId,
+                apiKey: TestConfiguration.OpenAI.ApiKey)
+            .Build();
+
+        var chatFunction = kernel.CreateFunctionFromPrompt(chatPrompt);
+        var result = await kernel.InvokeAsync(chatFunction);
+
+        Console.WriteLine("=== ChatPrompt Conversation with Binary Content ===");
+        Console.WriteLine("Prompt (showing conversation flow):");
+        Console.WriteLine(chatPrompt[..Math.Min(800, chatPrompt.Length)] + "...");
+        Console.WriteLine("\nResult:");
+        Console.WriteLine(result);
+    }
+}
diff --git a/dotnet/samples/Concepts/README.md b/dotnet/samples/Concepts/README.md
@@ -192,16 +192,18 @@ dotnet test -l "console;verbosity=detailed" --filter "FullyQualifiedName=ChatCom
 ### PromptTemplates - Using [`Templates`](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/src/SemanticKernel.Abstractions/PromptTemplate/IPromptTemplate.cs) with parametrization for `Prompt` rendering
 
 - [ChatCompletionPrompts](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/ChatCompletionPrompts.cs)
+- [ChatLoopWithPrompt](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/ChatLoopWithPrompt.cs)
+- [ChatPromptWithAudio](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/ChatPromptWithAudio.cs)
+- [ChatPromptWithBinary](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/ChatPromptWithBinary.cs)
 - [ChatWithPrompts](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/ChatWithPrompts.cs)
 - [HandlebarsPrompts](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/HandlebarsPrompts.cs)
+- [HandlebarsVisionPrompts](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/HandlebarsVisionPrompts.cs)
 - [LiquidPrompts](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/LiquidPrompts.cs)
 - [MultiplePromptTemplates](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/MultiplePromptTemplates.cs)
 - [PromptFunctionsWithChatGPT](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/PromptFunctionsWithChatGPT.cs)
-- [TemplateLanguage](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/TemplateLanguage.cs)
 - [PromptyFunction](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/PromptyFunction.cs)
-- [HandlebarsVisionPrompts](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/HandlebarsVisionPrompts.cs)
 - [SafeChatPrompts](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/SafeChatPrompts.cs)
-- [ChatLoopWithPrompt](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/ChatLoopWithPrompt.cs)
+- [TemplateLanguage](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/Concepts/PromptTemplates/TemplateLanguage.cs)
 
 ### RAG - Retrieval-Augmented Generation
 
diff --git a/dotnet/src/SemanticKernel.Abstractions/AI/ChatCompletion/ChatPromptParser.cs b/dotnet/src/SemanticKernel.Abstractions/AI/ChatCompletion/ChatPromptParser.cs
@@ -12,11 +12,6 @@ namespace Microsoft.SemanticKernel.ChatCompletion;
 /// </summary>
 internal static class ChatPromptParser
 {
-    private const string MessageTagName = "message";
-    private const string RoleAttributeName = "role";
-    private const string ImageTagName = "image";
-    private const string TextTagName = "text";
-
     /// <summary>
     /// Parses a prompt for an XML representation of a <see cref="ChatHistory"/>.
     /// </summary>
@@ -73,20 +68,10 @@ private static ChatMessageContent ParseChatNode(PromptNode node)
         ChatMessageContentItemCollection items = [];
         foreach (var childNode in node.ChildNodes.Where(childNode => childNode.Content is not null))
         {
-            if (childNode.TagName.Equals(ImageTagName, StringComparison.OrdinalIgnoreCase))
-            {
-                if (childNode.Content!.StartsWith("data:", StringComparison.OrdinalIgnoreCase))
-                {
-                    items.Add(new ImageContent(childNode.Content));
-                }
-                else
-                {
-                    items.Add(new ImageContent(new Uri(childNode.Content!)));
-                }
-            }
-            else if (childNode.TagName.Equals(TextTagName, StringComparison.OrdinalIgnoreCase))
+            if (s_contentFactoryMapping.TryGetValue(childNode.TagName.ToUpperInvariant(), out var createBinaryContent))
             {
-                items.Add(new TextContent(childNode.Content));
+                childNode.Attributes.TryGetValue("mimetype", out var mimeType);
+                items.Add(createBinaryContent(childNode.Content!, mimeType));
             }
         }
 
@@ -103,21 +88,55 @@ private static ChatMessageContent ParseChatNode(PromptNode node)
             : new ChatMessageContent(authorRole, node.Content);
     }
 
+    /// <summary>
+    /// Creates a new instance of <typeparamref name="T"/> from a data URI.
+    /// </summary>
+    /// <typeparam name="T">Type of <see cref="BinaryContent"/> to create.</typeparam>
+    /// <param name="content">Base64 encoded content or URI.</param>
+    /// <param name="mimeType">Optional MIME type of the content.</param>
+    /// <returns>A new instance of <typeparamref name="T"/> with <paramref name="content"/></returns>
+    private static T CreateBinaryContent<T>(string content, string? mimeType) where T : BinaryContent, new()
+    {
+        return (content.StartsWith("data:", StringComparison.OrdinalIgnoreCase)) ? new T { DataUri = content } : new T { Uri = new Uri(content), MimeType = mimeType };
+    }
+
+    /// <summary>
+    /// Factory for creating a <see cref="KernelContent"/> instance based on the tag name.
+    /// </summary>
+    private static readonly Dictionary<string, Func<string, string?, KernelContent>> s_contentFactoryMapping = new()
+    {
+        { TextTagName, (content, _) => new TextContent(content) },
+        { ImageTagName, CreateBinaryContent<ImageContent> },
+        { AudioTagName, CreateBinaryContent<AudioContent> },
+        { BinaryTagName, CreateBinaryContent<BinaryContent> }
+    };
+
     /// <summary>
     /// Checks if <see cref="PromptNode"/> is valid chat message.
     /// </summary>
     /// <param name="node">Instance of <see cref="PromptNode"/>.</param>
     /// <remarks>
-    /// A valid chat message is a node with the following structure:<br/>
-    /// TagName = "message"<br/>
-    /// Attributes = { "role" : "..." }<br/>
-    /// optional one or more child nodes <image>...</image><br/>
-    /// optional one or more child nodes <text>...</text>
+    /// A valid chat message is a node with the following structure:
+    /// <list type="bullet">
+    /// <item><description>TagName = "message"</description></item>
+    /// <item><description>Attributes = { "role" : "..." }</description></item>
+    /// <item><description>optional one or more child nodes &lt;image&gt;...&lt;/image&gt;</description></item>
+    /// <item><description>optional one or more child nodes &lt;text&gt;...&lt;/text&gt;</description></item>
+    /// <item><description>optional one or more child nodes &lt;audio&gt;...&lt;/audio&gt;</description></item>
+    /// <item><description>optional one or more child nodes &lt;binary&gt;...&lt;/binary&gt;</description></item>
+    /// </list>
     /// </remarks>
     private static bool IsValidChatMessage(PromptNode node)
     {
         return
             node.TagName.Equals(MessageTagName, StringComparison.OrdinalIgnoreCase) &&
             node.Attributes.ContainsKey(RoleAttributeName);
     }
+
+    private const string MessageTagName = "message";
+    private const string RoleAttributeName = "role";
+    private const string ImageTagName = "IMAGE";
+    private const string TextTagName = "TEXT";
+    private const string AudioTagName = "AUDIO";
+    private const string BinaryTagName = "BINARY";
 }
diff --git a/dotnet/src/SemanticKernel.UnitTests/Prompt/ChatPromptParserTests.cs b/dotnet/src/SemanticKernel.UnitTests/Prompt/ChatPromptParserTests.cs