Skip to content
This repository has been archived by the owner on Jun 10, 2024. It is now read-only.

VPF Performance analysis

Roman Arzumanyan edited this page Oct 28, 2021 · 12 revisions

Basic idea

VPF provides Python bindings to HW-accelerated video processing features such as decoding / encoding and some CUDA-accelerated features like color conversion.
How do you know your Python video processing program shows optimal performance of not? By using performance profiling and measurement tools.
You may think of VPF as just another CUDA and Video Codec SDK C++ library, so all existing CUDA tools apply to VPF as well.

nvidia-smi

Small yet extremely useful CLI tool which shows a whole lot of information such as Nvdec / Nvenc / CUDA cores load level, GPU clocks and many more. Run this utility in parallel with your program to see what HW components are used.

Let's run it to get information about the HW, CUDA version and many more: nvidia-smi CLI utility output

Below is example of SampleDemuxDecode.py which utilizes Nvdec for 1080p H.264 video decoding and CUDA cores for NV12 -> YUV420 color converson:

nvidia-smi dmon CLI utility launched in parallel with SampleDemuxDecode.py

What do these numbers mean?

Both Nvdec and CUDA cores aren't maxed out, their usage is between 15% and 23%. Hence something slows down the program.
Usual reasons are:

  • Network or disk IO speed
  • Memory copies between RAM and vRAM
  • CPU-accelerated code which takes long time to run

Let's check what's happening in our SampleDemuxDecode.py code:

    while True:
        # Demuxer has sync design, it returns packet every time it's called.
        # If demuxer can't return packet it usually means EOF.
        if not nvDmx.DemuxSinglePacket(packet):
            break


        # Decoder is async by design.
        # As it consumes packets from demuxer one at a time it may not return
        # decoded surface every time the decoding function is called.
        surface_nv12 = nvDec.DecodeSurfaceFromPacket(packet)
        if not surface_nv12.Empty():
            surface_yuv420 = nvCvt.Execute(surface_nv12, cc_ctx)
            if surface_yuv420.Empty():
                break
            if not nvDwn.DownloadSingleSurface(surface_yuv420, rawFrame):
                break
            bits = bytearray(rawFrame)
            decFile.write(bits)

We can easily see that decoded surfaces are copied from vRAM to RAM and later saved on disk:

            #bits = bytearray(rawFrame)
            #decFile.write(bits)

Let's comment out portion of the code which dumps the frames to disk and see the nvidia-smi dmon output again:

nvidia-smi dmon output after frames aren no longer stored on disk

One can easily notice the performance improvement. Now let's remove a copy between RAM and vRAM which is done here:

            #if not nvDwn.DownloadSingleSurface(surface_yuv420, rawFrame):
            #    break

And repeat analysis one more time:

nvidia-smi dmon output after DtoH memcpy is eliminated

We see a decline in CUDA cores load and slightly more stable Nvdec usage which is stable at 33%. It doesn't go any higher than this simply because our Quadro RTX 3000 GPU has 3 Nvdec units, hence a single video stream can only occupy 1/3 of it's decoding capacity which is ~33%.

Now our modified SampleDemuxDecode.py script is clearly limited by Nvdec performance and can't be further optimized.

More content to be added soon.

Clone this wiki locally