You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
.Net: Add support for audio and binary tags to chat prompt parser (#11919)
### Motivation and Context
#### Why is this change required?
This template parsers like the YAML parser to embed content types other
than just text and images for LLMs that support additional content
types, like PDFs for OpenAI and DOCXs for Claude. Without this
capability, functions with prompts that have attachments would have to
manually build it's chat history in code.
#### What problem does it solve?
See above
#### What scenario does it contribute to?
Usage additional content types beyond visuals and audio for user
messages
#### Open Issues Addressed
- Fixes#11044
### Description
#### Chat Prompt Parser
To preserve backward compatibility, rather than consolidating binary
content types, I chose to go with adding additional content types so
that LLM chat service providers could opt-in to new content types. It
also reduces the chances of breaking existing code.
3 new content types are created:
* `PdfContent` for PDF files. Uses the tag "<pdf>". Allows for
Base64 data URIs or standard URIs, similar to `ImageContent`.
* `DocContent` for MS Word .doc files. Uses the tag "<doc>".
Allows for Base64 data URIs or standard URIs, similar to `ImageContent`.
* `DocxContent` for MS Word .docx files. Uses the tag "<docx>".
Allows for Base64 data URIs or standard URIs, similar to `ImageContent`.
(**NOTE**: `DocContent` and `DocxContent` are mainly separate because
they have different MIME types and different content formats, though
they could easily be consolidated into a single tag and just let the LLM
provider handle distinguishing between "doc" and "docx" files.
Alternately, I could also see the case for dropping ".doc" support and
requiring the caller to only use ".docx".)
In addition, the following 2 contents are now parsed from the XML:
* `AudioContent` - Parses the tag "<audio>" with either Base64
data URIs or standard URIs, similar to `ImageContent`.
* `BinaryContent` - Parses the tag "<file>" with either Base64
data URIs or standard URIs, similar to `ImageContent`.
Here is a sample:
```xml
<message role='user'>
This part will be discarded upon parsing
<text>Make sense of this random assortment of stuff.</text>
<image>https://fake-link-to-image/</image>
<audio>data:audio/wav;base64,UklGRiQAAABXQVZFZm10IBAAAAABAAEAIlYAAACABAAZGF0YVgAAAAA</audio>
<pdf>data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMyAwIG9iago8PC9UeXBlL1hSZWYvUGFnZXMgNiAwIFIKL1R5cGUvUGFnZS9NZWRpYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9GMiA8PC9GMyA8PC9GNCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GNSA8PC9GNiA8PC9GNyBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GOCAvPj4KZW5kb2JqCjEwIDAgb2JqCjw8L1R5cGUvUGFnZS9NYWRlYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9GMiA8PC9GMyA8PC9GNCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GNSA8PC9GNiA8PC9GNyBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GOCAvPj4KZW5kb2JqCjEwIDAgb2JqCjw8L1R5cGUvUGFnZS9NYWRlYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9G</pdf>
<pdf>https://fake-link-to-pdf/</pdf>
<doc>data:application/msword;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</doc>
<doc>https://fake-link-to-doc/</doc>
<docx>data:application/vnd.openxmlformats-officedocument.wordprocessingml.document;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</docx>
<docx>https://fake-link-to-docx/</docx>
<file>data:application/octet-stream;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</file>
<file>https://fake-link-to-binary/</file>
This part will also be discarded upon parsing
</message>
```
#### Amazon Bedrock
Modified the `Converse` API request generator to handle the subset of
binary content supported by Amazon Bedrock (PDF, DOC, DOCX, and Image),
as documented
[here](https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/BedrockRuntime/TContentBlock.html).
#### OpenAI
Modified the client to handle PDF content, audio content, and file
references when generating a request to an OpenAI (or OpenAI compatible)
client.
<!-- Describe your changes, the overall approach, the underlying design.
These notes will help understanding how your code works. Thanks! -->
### Contribution Checklist
<!-- Before submitting this PR, please make sure: -->
- [X] The code builds clean without any errors or warnings
- [X] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [X] All unit tests pass, and I have added new tests where possible
- [X] I didn't break anyone 😄
---------
Co-authored-by: Roger Barreto <[email protected]>
<message role="system">You are a helpful assistant that specializes in audio analysis and transcription.</message>
61
+
<message role="user">
62
+
<text>I have an audio recording that I need help with. Can you analyze it?</text>
63
+
<audio>{audioDataUri}</audio>
64
+
</message>
65
+
<message role="assistant">I can help you analyze this audio recording. Let me transcribe and examine its content for you. What specific information are you looking for from this audio?</message>
66
+
<message role="user">
67
+
<text>Can you provide a full transcription and also identify any background sounds or audio quality issues?</text>
68
+
</message>
69
+
""";
70
+
71
+
varkernel=Kernel.CreateBuilder()
72
+
.AddOpenAIChatCompletion(
73
+
modelId:"gpt-4o-audio-preview",// Use audio-capable model
<message role="system">You are a helpful assistant that can analyze documents and provide insights.</message>
61
+
<message role="user">
62
+
<text>I have a document that I need help understanding. Can you analyze it?</text>
63
+
<binary>{pdfDataUri}</binary>
64
+
</message>
65
+
<message role="assistant">I can see this is a PDF document about employees. Let me analyze its contents for you. The document appears to contain employee information and organizational data. What specific aspects would you like me to focus on?</message>
66
+
<message role="user">
67
+
<text>Can you extract the key information and create a summary? Also, what format would be best for sharing this information with my team?</text>
Copy file name to clipboardExpand all lines: dotnet/samples/Concepts/README.md
+5-3Lines changed: 5 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -192,16 +192,18 @@ dotnet test -l "console;verbosity=detailed" --filter "FullyQualifiedName=ChatCom
192
192
### PromptTemplates - Using [`Templates`](https://github.com/microsoft/semantic-kernel/blob/main/dotnet/src/SemanticKernel.Abstractions/PromptTemplate/IPromptTemplate.cs) with parametrization for `Prompt` rendering
0 commit comments