-
Notifications
You must be signed in to change notification settings - Fork 86
Add NVDEC reconfiguration capability to improve NVDEC cache hit rate and reduce initialization overhead. #1176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…and reduce initialization overhead.
…and reduce initialization overhead.
|
Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the great PR @IteratorandIterator ! This is definitely something we'll want to include if we can verify the performance improvements.
Just checking my understanding first: before, we would never reconfigure a decoder, so to have a cache hit we expected all of codecType, width , height, chromaFormat, bitDepthLumaMinus8, and numDecodeSurfaces to match exactly. Now, we still require all of these to match to re-use the decoder, but we allow width , height, and numDecodeSurfaces to differ as long as width and height are smaller than the cache entry - in which case we can reconfigure.
Is that understanding correct?
Regarding the performance improvement, would you be able to share a quick reproducible benchmark showing the performance gain? I assume we should observe the most gain when decoding multiple videos sequentially with decreasing height and width?
Let me know if that's something you have the bandwidth for, and thanks a lot for the PR!
src/torchcodec/_core/NVDECCache.cpp
Outdated
| context.second.ulMaxWidth >= videoFormat->coded_width && | ||
| context.second.ulMaxHeight >= videoFormat->coded_height |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question: does there need to be any condition on numDecodeSurfaces for the decoder to be eligible for re-configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numDecodeSurfaces value returned by each reconfiguration should be less than the ulMaxNumDecodeSurfaces specified when creating the video parser. Since the current code fixes ulMaxNumDecodeSurfaces to 8 when creating the video parser, if the tests pass under this setting, it indicates that for most videos the required ulNumDecodeSurfaces is less than or equal to 8. Therefore, I did not add an explicit check here, but it is indeed necessary and important to enforce this constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add this constraint soon~
Hi, @NicolasHug ! When creating a decoder for the first time, it is usually necessary to specify ulMaxHeight and ulMaxWidth. During subsequent reconfiguration, it is only required that the new bitstream’s coded_height and coded_width are less than or equal to the ulMaxHeight and ulMaxWidth specified at creation time. In other words, If relatively large ulMaxHeight and ulMaxWidth values (for example, 4K) are specified manually when creating the decoder, it will result in a higher cache hit rate. In this PR, the reason for setting these values to the resolution of the first video that uses the decoder is to remain consistent with the existing code logic, and I believe this approach is also reasonable. Regarding ulNumDecodeSurfaces, based on my reading of the official NVIDIA documentation as well as my own experimental results, the value set when creating the decoder usually does not affect the outcome. This is because the pfnSequenceCallback function returns the actual ulNumDecodeSurfaces required by the current bitstream and updates the value dynamically. However, the ulNumDecodeSurfaces value returned during reconfiguration must not exceed the ulMaxNumDecodeSurfaces specified when creating the video parser; otherwise, the decoding result will be incorrect. Oh— I just realized that I did not add a constraint to ensure that the current bitstream’s ulNumDecodeSurfaces is less than or equal to ulMaxNumDecodeSurfaces. That said, the current code fixes ulMaxNumDecodeSurfaces to 8, so the test videos likely all use a ulNumDecodeSurfaces value that is less than or equal to 8 by default. I would be very happy to provide a reproducible benchmark. I also noticed that some test results did not pass, and I am not sure whether this is due to a logic issue in the code I provided or some edge cases that I did not consider. If you could let me know how to inspect the source code of the tests being run, I would be glad to fix these bugs and help merge the reconfiguration functionality into torchcodec. |
|
Thanks for the details! Regarding the test failures, if you click on one of those links
you should then have access to the logs like this one: https://github.com/meta-pytorch/torchcodec/actions/runs/21201605282/job/61015703549?pr=1176 The logs aren't always easy to read because stdout and stderr are interleaved. Here, I think the relevant errors are: |
… ulMaxNumDecodeSurfaces.
… ulMaxNumDecodeSurfaces.
… ulMaxNumDecodeSurfaces.
|
Thanks a lot! I've already fixed the bug. If this code passes testing, I'll start preparing the reproducible benchmark. |
|
Sorry @NicolasHug , I misexplained the part about ulNumDecodeSurfaces yesterday. It seems that ulMaxNumDecodeSurfaces, specified when creating the parser, is actually the one that can be dynamically changed, and the number of surfaces used during reconfiguration should be less than the ulNumDecodeSurfaces set when creating the decoder. If this passes testing, the conclusion should be correct. |

Added NVDEC decoder reconfiguration capability on top of the NVDEC cache, which improves cache hit rate. Re-creating a completely new decoder takes approximately 50–70 milliseconds to initialize, whereas reconfiguring an existing decoder reduces initialization time to about 5–6 milliseconds—a 10× reduction in initialization latency (tested on an H100 GPU).