nvidia-drm: handle -EDEADLK in nv_drm_reset_input_colorspace#1031
nvidia-drm: handle -EDEADLK in nv_drm_reset_input_colorspace#1031jopamo wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
drm_atomic_get_plane_state() and drm_atomic_commit() can return -EDEADLK when ww-mutex deadlock avoidance triggers. The current nv_drm_reset_input_colorspace() path drops locks and returns without running the required modeset backoff/retry flow. Rework the function to retry the atomic sequence with drm_modeset_backoff(&ctx), rebuilding atomic state on each retry, and only finish once the sequence succeeds or another error is returned. Signed-off-by: Paul Moses <p@1g4.org>
|
|
Thanks for the patch and backtrace. What are the steps you're using, or particular configuration, to trigger the problem? |
|
GPU/driver: 5060 Ti — 590.48.01 This happened during boot while I was debugging an unrelated kernel module (act_gate). I wasn’t actively exercising the DRM stack; the warning showed up as part of early system bring-up with heavy lock debugging enabled. From what I can tell, the relevant locking contract in the kernel source is very explicit about
Given the above, it looks like the atomic path needs to follow the standard ww-mutex backoff/retry sequence when I haven't had stability issues in conjunction with this, but this doesn't appear to be a false positive based on kernel docs. |
|
I think we would need to explore Wound/Wait Deadlock Prevention conceptually more before being able to approach any sort of fix. My take on this is the issue seen only reproduces with kernel lock debugging enabled, and we have not seen any live issues of this that we are currently aware of. nvidia-drm does not make use of TTM or any of the upstream GPU resource managers, so it could be that the design difference is falsely triggering the deadlock detector. |
|
With The fact that this only reproduces with lock debugging enabled is expected. Lockdep is designed to expose latent ordering bugs that depend on timing. Not seeing a production deadlock does not establish correctness. The atomic helpers assume drivers implement the documented backoff pattern. On TTM, its absence is not directly relevant. The ww_mutex and modeset locking rules apply to any driver using DRM atomic helpers regardless of memory manager. The expectations come from DRM core, not TTM. If anything, diverging from common upstream patterns makes strict adherence more important. This is not a lockdep false positive due to design differences. It is a missing retry path in a ww_mutex context and lock debugging simply makes it visible. |
|
Took a closer look. The GSP-RM path still leads to GSP RPC timeouts (Xid 119). In my testing the DRM-side error path is effectively masked because GSP wedges first. I can reproduce a GPU hang/reset by racing DRM atomic commits with DROP_MASTER/SET_MASTER. The failure manifests as a GSP RPC timeout: fn 76 (GSP_RM_CONTROL) data0=0x20800a6a data1=0x0, followed by Xid 62/109/119 and “GPU reset required”. Adding a small delay between DROP_MASTER and SET_MASTER reduces the reproduction rate which suggests a tight timing window.
|
|
nevermind, I can hit both with lock debugging off.
|
drm_atomic_get_plane_state() and drm_atomic_commit() can return -EDEADLK when ww-mutex deadlock avoidance triggers. The current
nv_drm_reset_input_colorspace() path drops locks and returns without running the required modeset backoff/retry flow.
Rework the function to retry the atomic sequence with drm_modeset_backoff(&ctx), rebuilding atomic state on each retry, and only finish once the sequence succeeds or another error is returned.