Fix use-after-free races in memory pool shrinker and DRM fence destruction#1004
Open
neoyubi wants to merge 1 commit intoNVIDIA:mainfrom
Open
Fix use-after-free races in memory pool shrinker and DRM fence destruction#1004neoyubi wants to merge 1 commit intoNVIDIA:mainfrom
neoyubi wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
During memory pressure, kswapd invokes shrinker callbacks via shrink_slab. A race condition exists where nv_mem_pool_destroy() can free the shrinker while kswapd is still iterating, causing the kernel to call corrupted function pointers and crash. Changes: - Move nv_mem_pool_shrinker_free() to execute FIRST in destroy sequence - Add synchronize_rcu() after shrinker unregistration to ensure all RCU readers have completed before continuing destruction - Set shrinker pointer to NULL after free to prevent dangling reference - Split DRM fence context destruction into prepare + final phases to signal fences before drm_gem_object_release() Tested on RTX 5090 with kernel 6.18.5 - system stable after fix.
Collaborator
|
Reading through this code, I believe the refactors to
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix use-after-free races in memory pool shrinker and DRM fence destruction
Summary
This patch fixes two related use-after-free race conditions that cause kernel crashes under memory pressure:
kswapdcan invoke shrinker callbacks whilenv_mem_pool_destroy()is freeing pool resourcesdma_resvwhile fence contexts are being destroyedBoth issues stem from the same root cause: cleanup callbacks not being stopped before the resources they access are released.
Issue 1: Memory Pool Shrinker Race
Problem
The shrinker is unregistered after freeing the pool's page lists:
Race Scenario
Fix
nv_mem_pool_shrinker_free()to the start of destructionsynchronize_rcu()after unregistration to ensure no callbacks are in-flight (kernel iterates shrinkers under RCU)Issue 2: DRM Fence Context Destruction Race
Problem
When a GEM object with an associated fence context is destroyed, the current code:
drm_gem_object_release()(releases dma_resv)This allows the kernel's drm_exec/shrinker infrastructure to access
dma_resvwhile fences are still active.Race Scenario
Fix
Introduce two-phase destruction for fence contexts:
prepare_release/prepare_destroy: Stop callbacks, timers, and signal all pending fences beforedrm_gem_object_release()free/destroy: Release NVKMS resources and free memory after the GEM object is fully releasedThis ensures fences are detached from
dma_resvbefore the kernel can no longer safely access them.Changes
nv-vm.c
nv_mem_pool_destroy()synchronize_rcu()aftershrinker_free()/unregister_shrinker()nvidia-drm-gem.h/c
prepare_releasecallback tonv_drm_gem_object_funcsprepare_releasebeforedrm_gem_object_release()innv_drm_gem_free()nvidia-drm-fence.c
prepare_destroycallback tonv_drm_fence_context_ops__nv_drm_prime_fence_context_destroy()into prepare/destroy phases__nv_drm_semsurf_fence_ctx_destroy()into prepare/destroy phases__nv_drm_fence_context_gem_prepare_release()to call prepare phaseTesting
kswapdpath through nvidia shrinker/fence callbacksImpact
These bugs affect all users of nvidia-open kernel modules under memory pressure. Symptoms include:
shrink_slab()ordrm_execpathsThe fixes follow established kernel conventions: unregister/stop callbacks before freeing the resources they access.
References
include/linux/shrinker.hdrivers/gpu/drm/drm_gem.ckernel-open/nvidia/nv-vm.ckernel-open/nvidia-drm/nvidia-drm-gem.ckernel-open/nvidia-drm/nvidia-drm-gem.hkernel-open/nvidia-drm/nvidia-drm-fence.c