Conversation
- add perf test that compares baseline frees vs frees after ThreadAlloc teardown - cover atexit/static destruction scenario to repro post-finalisation slowdown
The teardown code can be a performance bottleneck when a lot of deallocations occur after a thread begins to be torn down. This code tracks how many operations have occured after the thread has been torndown, and will stop performing the tidying with an exponential back-off. This takes the overhead on the microbenchmark from around 50x to 10x.
75c1fc7 to
bbd7edc
Compare
This improves performance for most cases. The feature is problematic on glibc as there is potential for it to work incorrectly. This disables the feature by default, and can be enabled if someone really cares about it.
The in_use flag on pooled types is used to check for incorrect pool usage. This is not required for correct code. So moving to a Debug feature.
There was a problem hiding this comment.
Pull request overview
This PR is an experiment addressing #809 by reducing overhead in “post-teardown” allocator usage (e.g., static/global destruction), along with a few related performance-oriented tweaks and a new optional fork-protection build flag.
Changes:
- Add a new perf micro-benchmark to measure deallocation cost before vs. after
ThreadAlloc::teardown(). - Reduce teardown/flush overhead in post-teardown scenarios (amortized flushing) and avoid expensive teardown-time validation work unless a mitigation is enabled.
- Make fork protection (
PreventFork) optional via a new CMake option and add small spin-lock micro-optimizations.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/test/perf/post_teardown/post-teardown.cc | Adds a perf test reproducing #809-like “free after teardown” behavior. |
| src/snmalloc/mem/pooled.h | Lowers overhead of pooled-object “in_use” tracking (debug vs release behavior). |
| src/snmalloc/mem/corealloc.h | Optimizes flush() teardown paths (skip validation unless mitigation; early-return in slab release). |
| src/snmalloc/global/threadalloc.h | Amortizes flush() calls across repeated teardown() invocations. |
| src/snmalloc/ds_aal/prevent_fork.h | Makes PreventFork a no-op unless SNMALLOC_PTHREAD_FORK_PROTECTION is enabled. |
| src/snmalloc/ds_aal/flaglock.h | Adds branch prediction hints to spin-lock acquisition loops. |
| CMakeLists.txt | Adds SNMALLOC_PTHREAD_FORK_PROTECTION option and enforces atfork availability when enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
CMakeLists.txt
Outdated
| add_as_define(SNMALLOC_PTHREAD_FORK_PROTECTION) | ||
| add_as_define(SNMALLOC_PLATFORM_HAS_GETENTROPY) | ||
| add_as_define(SNMALLOC_PTHREAD_ATFORK_WORKS) | ||
| #add_as_define(SNMALLOC_PTHREAD_ATFORK_WORKS) |
There was a problem hiding this comment.
SNMALLOC_PTHREAD_ATFORK_WORKS is still computed via CHECK_CXX_SOURCE_COMPILES, but the compile definition export is commented out. As a result, the C++ code’s #ifdef SNMALLOC_PTHREAD_ATFORK_WORKS branches (e.g., prevent_fork.h’s pthread_atfork registration and src/test/func/protect_fork/protect_fork.cc) will behave as if atfork is unsupported even when CMake detected it. Please restore exporting this macro (or otherwise plumb the result into the preprocessor), especially since SNMALLOC_PTHREAD_FORK_PROTECTION depends on it.
| #add_as_define(SNMALLOC_PTHREAD_ATFORK_WORKS) | |
| add_as_define(SNMALLOC_PTHREAD_ATFORK_WORKS) |
| static void ensure_init() | ||
| { | ||
| #ifdef SNMALLOC_PTHREAD_ATFORK_WORKS | ||
| # ifdef SNMALLOC_PTHREAD_ATFORK_WORKS | ||
| static stl::Atomic<bool> initialised{false}; | ||
|
|
||
| if (initialised.load(stl::memory_order_acquire)) | ||
| return; | ||
|
|
||
| pthread_atfork(prefork, postfork_parent, postfork_child); | ||
| initialised.store(true, stl::memory_order_release); | ||
| #endif | ||
| # endif |
There was a problem hiding this comment.
When SNMALLOC_PTHREAD_FORK_PROTECTION is enabled, ensure_init() only calls pthread_atfork under #ifdef SNMALLOC_PTHREAD_ATFORK_WORKS. If that macro is not consistently defined by the build (and it currently isn’t exported from CMake), fork protection silently becomes a no-op even though the option is ON. Consider making the pthread_atfork path depend solely on SNMALLOC_PTHREAD_FORK_PROTECTION (given the CMake fatal-error guard), or ensure the SNMALLOC_PTHREAD_ATFORK_WORKS define is always provided when detected.
src/snmalloc/ds_aal/prevent_fork.h
Outdated
| } | ||
| }; | ||
| #else | ||
| // The fork protection can cost a lot and it generally not required. |
There was a problem hiding this comment.
Grammar: "it generally not required" should be "it is generally not required".
| // The fork protection can cost a lot and it generally not required. | |
| // The fork protection can cost a lot and it is generally not required. |
| times_teardown_called++; | ||
| if (bits::is_pow2(times_teardown_called) || times_teardown_called < 128) | ||
| alloc->flush(); | ||
| AllocPool<Config>::release(alloc); | ||
| alloc = const_cast<Alloc*>(&default_alloc); |
There was a problem hiding this comment.
ThreadAlloc::teardown() now sometimes releases an allocator back to AllocPool without calling alloc->flush(). Since flush() is what drains fast free lists, processes the message queue, and posts the remote-dealloc cache, skipping it can leave remote frees unposted and memory retained in an allocator that may sit idle in the pool for a long time. If the intent is to amortize teardown cost, consider introducing a cheaper “post-teardown flush” that at least processes/publishes remote deallocations (and/or document why it’s safe to skip a full flush before pooling).
This change improves the performance of post-teardown operations. (See #809)
The main changes are: