Skip to content

Conversation

@IteratorandIterator
Copy link

Added NVDEC decoder reconfiguration capability on top of the NVDEC cache, which improves cache hit rate. Re-creating a completely new decoder takes approximately 50–70 milliseconds to initialize, whereas reconfiguring an existing decoder reduces initialization time to about 5–6 milliseconds—a 10× reduction in initialization latency (tested on an H100 GPU).

Lingxiao Zhao added 2 commits January 21, 2026 11:35
@meta-cla
Copy link

meta-cla bot commented Jan 21, 2026

Hi @IteratorandIterator!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@meta-cla
Copy link

meta-cla bot commented Jan 21, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 21, 2026
Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the great PR @IteratorandIterator ! This is definitely something we'll want to include if we can verify the performance improvements.

Just checking my understanding first: before, we would never reconfigure a decoder, so to have a cache hit we expected all of codecType, width , height, chromaFormat, bitDepthLumaMinus8, and numDecodeSurfaces to match exactly. Now, we still require all of these to match to re-use the decoder, but we allow width , height, and numDecodeSurfaces to differ as long as width and height are smaller than the cache entry - in which case we can reconfigure.

Is that understanding correct?

Regarding the performance improvement, would you be able to share a quick reproducible benchmark showing the performance gain? I assume we should observe the most gain when decoding multiple videos sequentially with decreasing height and width?

Let me know if that's something you have the bandwidth for, and thanks a lot for the PR!

Comment on lines 95 to 96
context.second.ulMaxWidth >= videoFormat->coded_width &&
context.second.ulMaxHeight >= videoFormat->coded_height
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: does there need to be any condition on numDecodeSurfaces for the decoder to be eligible for re-configuration?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numDecodeSurfaces value returned by each reconfiguration should be less than the ulMaxNumDecodeSurfaces specified when creating the video parser. Since the current code fixes ulMaxNumDecodeSurfaces to 8 when creating the video parser, if the tests pass under this setting, it indicates that for most videos the required ulNumDecodeSurfaces is less than or equal to 8. Therefore, I did not add an explicit check here, but it is indeed necessary and important to enforce this constraint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this constraint soon~

@IteratorandIterator
Copy link
Author

Thanks a lot for the great PR @IteratorandIterator ! This is definitely something we'll want to include if we can verify the performance improvements.

Just checking my understanding first: before, we would never reconfigure a decoder, so to have a cache hit we expected all of codecType, width , height, chromaFormat, bitDepthLumaMinus8, and numDecodeSurfaces to match exactly. Now, we still require all of these to match to re-use the decoder, but we allow width , height, and numDecodeSurfaces to differ as long as width and height are smaller than the cache entry - in which case we can reconfigure.

Is that understanding correct?

Regarding the performance improvement, would you be able to share a quick reproducible benchmark showing the performance gain? I assume we should observe the most gain when decoding multiple videos sequentially with decreasing height and width?

Let me know if that's something you have the bandwidth for, and thanks a lot for the PR!

Hi, @NicolasHug !

When creating a decoder for the first time, it is usually necessary to specify ulMaxHeight and ulMaxWidth. During subsequent reconfiguration, it is only required that the new bitstream’s coded_height and coded_width are less than or equal to the ulMaxHeight and ulMaxWidth specified at creation time. In other words, If relatively large ulMaxHeight and ulMaxWidth values (for example, 4K) are specified manually when creating the decoder, it will result in a higher cache hit rate. In this PR, the reason for setting these values to the resolution of the first video that uses the decoder is to remain consistent with the existing code logic, and I believe this approach is also reasonable.

Regarding ulNumDecodeSurfaces, based on my reading of the official NVIDIA documentation as well as my own experimental results, the value set when creating the decoder usually does not affect the outcome. This is because the pfnSequenceCallback function returns the actual ulNumDecodeSurfaces required by the current bitstream and updates the value dynamically. However, the ulNumDecodeSurfaces value returned during reconfiguration must not exceed the ulMaxNumDecodeSurfaces specified when creating the video parser; otherwise, the decoding result will be incorrect.

Oh— I just realized that I did not add a constraint to ensure that the current bitstream’s ulNumDecodeSurfaces is less than or equal to ulMaxNumDecodeSurfaces. That said, the current code fixes ulMaxNumDecodeSurfaces to 8, so the test videos likely all use a ulNumDecodeSurfaces value that is less than or equal to 8 by default.

I would be very happy to provide a reproducible benchmark. I also noticed that some test results did not pass, and I am not sure whether this is due to a logic issue in the code I provided or some edge cases that I did not consider. If you could let me know how to inspect the source code of the tests being run, I would be glad to fix these bugs and help merge the reconfiguration functionality into torchcodec.

@NicolasHug
Copy link
Contributor

Thanks for the details!

Regarding the test failures, if you click on one of those links

image

you should then have access to the logs like this one: https://github.com/meta-pytorch/torchcodec/actions/runs/21201605282/job/61015703549?pr=1176

The logs aren't always easy to read because stdout and stderr are interleaved. Here, I think the relevant errors are:

/__w/torchcodec/torchcodec/meta-pytorch/torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp: In member function ‘int facebook::torchcodec::BetaCudaDeviceInterface::reconfigureNVDECDecoder(CUVIDEOFORMAT*)’:
/__w/torchcodec/torchcodec/meta-pytorch/torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp:392:42: error: missing initializer for member ‘_CUVIDRECONFIGUREDECODERINFO::ulHeight’ [-Werror=missing-field-initializers]
  392 |   CUVIDRECONFIGUREDECODERINFO info = { 0 };
      |                                          ^
/__w/torchcodec/torchcodec/meta-pytorch/torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp:392:42: error: missing initializer for member ‘_CUVIDRECONFIGUREDECODERINFO::ulTargetWidth’ [-Werror=missing-field-initializers]
/__w/torchcodec/torchcodec/meta-pytorch/torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp:392:42: error: missing initializer for member ‘_CUVIDRECONFIGUREDECODERINFO::ulTargetHeight’ [-Werror=missing-field-initializers]
/__w/torchcodec/torchcodec/meta-pytorch/torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp:392:42: error: missing initializer for member ‘_CUVIDRECONFIGUREDECODERINFO::ulNumDecodeSurfaces’ [-Werror=missing-field-initializers]
/__w/torchcodec/torchcodec/meta-pytorch/torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp:392:42: error: missing initializer for member ‘_CUVIDRECONFIGUREDECODERINFO::reserved1’ [-Werror=missing-field-initializers]
/__w/torchcodec/torchcodec/meta-pytorch/torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp:392:42: error: missing initializer for member ‘_CUVIDRECONFIGUREDECODERINFO::display_area’ [-Werror=missing-field-initializers]
/__w/torchcodec/torchcodec/meta-pytorch/torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp:392:42: error: missing initializer for member ‘_CUVIDRECONFIGUREDECODERINFO::target_rect’ [-Werror=missing-field-initializers]
/__w/torchcodec/torchcodec/meta-pytorch/torchcodec/src/torchcodec/_core/BetaCudaDeviceInterface.cpp:392:42: error: missing initializer for member ‘_CUVIDRECONFIGUREDECODERINFO::reserved2’ [-Werror=missing-field-initializers]
cc1plus: all warnings being treated as errors

@IteratorandIterator
Copy link
Author

Thanks a lot!

I've already fixed the bug. If this code passes testing, I'll start preparing the reproducible benchmark.

@IteratorandIterator
Copy link
Author

Sorry @NicolasHug , I misexplained the part about ulNumDecodeSurfaces yesterday. It seems that ulMaxNumDecodeSurfaces, specified when creating the parser, is actually the one that can be dynamically changed, and the number of surfaces used during reconfiguration should be less than the ulNumDecodeSurfaces set when creating the decoder. If this passes testing, the conclusion should be correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants