-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper Redesigned Solution #23549
Merged
kunal-vaishnavi
merged 67 commits into
microsoft:main
from
kunal-vaishnavi:kvaishnavi/whisper-separate-export
Mar 15, 2025
Merged
Whisper Redesigned Solution #23549
kunal-vaishnavi
merged 67 commits into
microsoft:main
from
kunal-vaishnavi:kvaishnavi/whisper-separate-export
Mar 15, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tianleiwu
reviewed
Mar 14, 2025
tianleiwu
reviewed
Mar 14, 2025
tianleiwu
reviewed
Mar 14, 2025
tianleiwu
reviewed
Mar 14, 2025
tianleiwu
reviewed
Mar 14, 2025
tianleiwu
reviewed
Mar 14, 2025
tianleiwu
reviewed
Mar 14, 2025
tianleiwu
reviewed
Mar 14, 2025
tianleiwu
reviewed
Mar 14, 2025
tianleiwu
approved these changes
Mar 15, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR re-designs how Whisper is created and supported in ONNX Runtime. The new solution leverages previous optimization work, and it is designed to be used in conjunction with this work in ONNX Runtime GenAI.
Some of the added changes include:
WhisperBeamSearch
opWhisperBeamSearch
op to chain the encoder and decoder subgraphsWhisperBeamSearch
op created an encoder-decoder-init model and decoder-with-past model. The decoder was duplicated twice, one in each.DUMP_STRING
to enable easy logging of intermediate information when running in debug mode to debug a problem. This info is not printed in release mode so it will not impact performance.DecoderMaskedMultiHeadAttention
intoMultiHeadAttention
MultiHeadAttention
op for improved performancecache_indirection
andpast_sequence_length
as new optional inputs toMultiHeadAttention
output_qk
as new optional output toMultiHeadAttention
output_qk
tensor with FP16 or FP32 precision, regardless of the model's precisionThe existing solutions are still available if desired.
Known Issues
WhisperBeamSearch
op and output QK is currently disabled. This is because ONNX Runtime doesn't currently support output QK kernels on CPU, only on CUDA.DecoderMaskedMultiHeadAttention
CPU kernel has a parity mismatch with theDecoderMaskedMultiHeadAttention
CUDA kernel.DecoderMaskedMultiHeadAttention
for the FP32 CPU model is not enabled. Currently, it usesMultiHeadAttention
to avoid the parity mismatch issue.Motivation and Context
Using the beam search op has made it more difficult to debug and fix errors that are encountered. This new approach is more flexible and more customizable for users (e.g. by running with ONNX Runtime GenAI). It also helps this issue.