Decoding speed is not good as expected

I want to leverage this framework to accelerate the old CPU-FFmpeg workflow and did some benchmark. 

The ffmpeg execution flow is:
```
        command = (
            ffmpeg.input(file_path)
            .filter("select", "not(mod(n, 3))")
            .filter("scale", w="if(gt(iw,ih),-1,360)", h="if(gt(iw,ih),360,-1)")
            .output(
                "pipe:1",
                format="image2pipe",
                vcodec="mjpeg",
                vsync="vfr",
                qscale=2,
                threads=4,
            )
            .compile()
        )
```

For simplicity, on the GPU implementation side I did:
```
        decoder = nvc.PyNvDecoder(video_file, gpu_id)  
        width = decoder.Width()
        height = decoder.Height()
        frame_count = 0
        raw_frame = np.zeros((height, width, 3), np.uint8)
        while True:
            # Decode the frame
            success = decoder.DecodeSingleFrame(raw_frame)
            if not success:
                break
            frame_count += 1
```

For a ~10000 frame 1920*1080 resolution video, I run the code on 64 CPU-cores which are Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and an `NVIDIA T4` GPU. I obtain the following results:

```
Total videos decoded: 1
Total frames decoded: 9712
Total NV Framework decoding time: 15.64 seconds
Total ffmpeg CPU decoding time: 14.27 seconds
```

According to nvidia-smi dmon the T4 GPU dec unit utilization is quite high throughout the decoding execution:
```
gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
Idx     W     C     C     %     %     %     %   MHz   MHz
    0    52    57     -    48    20     0    74  5000  1590
    0    55    57     -    35    18     0    84  5000  1590
    0    61    58     -    30    19     0   100  5000  1590
    0    52    57     -    21    17     0    99  5000  1590
    0    53    58     -    17    15     0    93  5000  1590
    0    48    58     -    46    24     0   100  5000  1590
    0    60    60     -    37    19     0    90  5000  1590
    0    51    59     -    52    23     0    96  5000  1590
    0    56    59     -    36    18     0    83  5000  1590
    0    55    59     -    67    28     0   100  5000  1590
    0    60    60     -    46    23     0    97  5000  1590
    0    53    59     -    25    17     0    96  5000  1590
    0    56    59     -    63    25     0   100  5000  1590
```

Please note thate for the ffmpeg side, it can not fully leverage the 64 cores. If I create a thread pool and use it to process many videos, the speed could be ~7-8 times faster. May I ask if this is expected? here for T4 side, I only called `DecodeSingleFrame` and did not implement the full video transformation logic, but the speed is not as good as I expected. I thought `T4` could be 10X faster than the current number. Otherwise, the switch from CPU ffmpeg to GPU decoding workflow does not enjoy much benefit.

cc @RomanArzumanyan @gedoensmax . If this GPU decode single frame performance is expected, do you have any other GPU acceleration suggestion for me? I though there could be things like batch processing to improve the overall decoding throughput but I have not found it. Really appreciate your help! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decoding speed is not good as expected #570

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decoding speed is not good as expected #570

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions