-
Notifications
You must be signed in to change notification settings - Fork 237
Decoding speed is not good as expected #570
Description
I want to leverage this framework to accelerate the old CPU-FFmpeg workflow and did some benchmark.
The ffmpeg execution flow is:
command = (
ffmpeg.input(file_path)
.filter("select", "not(mod(n, 3))")
.filter("scale", w="if(gt(iw,ih),-1,360)", h="if(gt(iw,ih),360,-1)")
.output(
"pipe:1",
format="image2pipe",
vcodec="mjpeg",
vsync="vfr",
qscale=2,
threads=4,
)
.compile()
)
For simplicity, on the GPU implementation side I did:
decoder = nvc.PyNvDecoder(video_file, gpu_id)
width = decoder.Width()
height = decoder.Height()
frame_count = 0
raw_frame = np.zeros((height, width, 3), np.uint8)
while True:
# Decode the frame
success = decoder.DecodeSingleFrame(raw_frame)
if not success:
break
frame_count += 1
For a ~10000 frame 1920*1080 resolution video, I run the code on 64 CPU-cores which are Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and an NVIDIA T4 GPU. I obtain the following results:
Total videos decoded: 1
Total frames decoded: 9712
Total NV Framework decoding time: 15.64 seconds
Total ffmpeg CPU decoding time: 14.27 seconds
According to nvidia-smi dmon the T4 GPU dec unit utilization is quite high throughout the decoding execution:
gpu pwr gtemp mtemp sm mem enc dec mclk pclk
Idx W C C % % % % MHz MHz
0 52 57 - 48 20 0 74 5000 1590
0 55 57 - 35 18 0 84 5000 1590
0 61 58 - 30 19 0 100 5000 1590
0 52 57 - 21 17 0 99 5000 1590
0 53 58 - 17 15 0 93 5000 1590
0 48 58 - 46 24 0 100 5000 1590
0 60 60 - 37 19 0 90 5000 1590
0 51 59 - 52 23 0 96 5000 1590
0 56 59 - 36 18 0 83 5000 1590
0 55 59 - 67 28 0 100 5000 1590
0 60 60 - 46 23 0 97 5000 1590
0 53 59 - 25 17 0 96 5000 1590
0 56 59 - 63 25 0 100 5000 1590
Please note thate for the ffmpeg side, it can not fully leverage the 64 cores. If I create a thread pool and use it to process many videos, the speed could be ~7-8 times faster. May I ask if this is expected? here for T4 side, I only called DecodeSingleFrame and did not implement the full video transformation logic, but the speed is not as good as I expected. I thought T4 could be 10X faster than the current number. Otherwise, the switch from CPU ffmpeg to GPU decoding workflow does not enjoy much benefit.
cc @RomanArzumanyan @gedoensmax . If this GPU decode single frame performance is expected, do you have any other GPU acceleration suggestion for me? I though there could be things like batch processing to improve the overall decoding throughput but I have not found it. Really appreciate your help!