Skip to content
This repository was archived by the owner on Jun 10, 2024. It is now read-only.
This repository was archived by the owner on Jun 10, 2024. It is now read-only.

Decoding speed is not good as expected #570

@brucechin

Description

@brucechin

I want to leverage this framework to accelerate the old CPU-FFmpeg workflow and did some benchmark.

The ffmpeg execution flow is:

        command = (
            ffmpeg.input(file_path)
            .filter("select", "not(mod(n, 3))")
            .filter("scale", w="if(gt(iw,ih),-1,360)", h="if(gt(iw,ih),360,-1)")
            .output(
                "pipe:1",
                format="image2pipe",
                vcodec="mjpeg",
                vsync="vfr",
                qscale=2,
                threads=4,
            )
            .compile()
        )

For simplicity, on the GPU implementation side I did:

        decoder = nvc.PyNvDecoder(video_file, gpu_id)  
        width = decoder.Width()
        height = decoder.Height()
        frame_count = 0
        raw_frame = np.zeros((height, width, 3), np.uint8)
        while True:
            # Decode the frame
            success = decoder.DecodeSingleFrame(raw_frame)
            if not success:
                break
            frame_count += 1

For a ~10000 frame 1920*1080 resolution video, I run the code on 64 CPU-cores which are Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and an NVIDIA T4 GPU. I obtain the following results:

Total videos decoded: 1
Total frames decoded: 9712
Total NV Framework decoding time: 15.64 seconds
Total ffmpeg CPU decoding time: 14.27 seconds

According to nvidia-smi dmon the T4 GPU dec unit utilization is quite high throughout the decoding execution:

gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
Idx     W     C     C     %     %     %     %   MHz   MHz
    0    52    57     -    48    20     0    74  5000  1590
    0    55    57     -    35    18     0    84  5000  1590
    0    61    58     -    30    19     0   100  5000  1590
    0    52    57     -    21    17     0    99  5000  1590
    0    53    58     -    17    15     0    93  5000  1590
    0    48    58     -    46    24     0   100  5000  1590
    0    60    60     -    37    19     0    90  5000  1590
    0    51    59     -    52    23     0    96  5000  1590
    0    56    59     -    36    18     0    83  5000  1590
    0    55    59     -    67    28     0   100  5000  1590
    0    60    60     -    46    23     0    97  5000  1590
    0    53    59     -    25    17     0    96  5000  1590
    0    56    59     -    63    25     0   100  5000  1590

Please note thate for the ffmpeg side, it can not fully leverage the 64 cores. If I create a thread pool and use it to process many videos, the speed could be ~7-8 times faster. May I ask if this is expected? here for T4 side, I only called DecodeSingleFrame and did not implement the full video transformation logic, but the speed is not as good as I expected. I thought T4 could be 10X faster than the current number. Otherwise, the switch from CPU ffmpeg to GPU decoding workflow does not enjoy much benefit.

cc @RomanArzumanyan @gedoensmax . If this GPU decode single frame performance is expected, do you have any other GPU acceleration suggestion for me? I though there could be things like batch processing to improve the overall decoding throughput but I have not found it. Really appreciate your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions