Skip to content

Conversation

@sithhell
Copy link
Contributor

@sithhell sithhell commented Jan 8, 2026

The changes introduced to #1683 (changing run_loop from a condition variable to atomics) introduced a stack-use-after-free when the task completion is happening on a different thread: run_loop::finish might cause
sync_wait_t::apply_sender to return before scheduling the noop task, ending
the lifetime of the queue before the noop task push could be finalized:

  • Thread 1 sets _finishing to true
  • Thread 2 observed __finishing is true, __noop_task hasn't been pushed yet and finishes successfully returns.
  • Thread 1 tries to push __noop_task to the queue, which now is out of scope.

To avoid this situation, the sync_wait state is additionally synchronized on the finish being returning successfully (avoiding the aforementioned race condition).

The changes introduced to NVIDIA#1683 (changing run_loop from a condition variable to
atomics) introduced a stack-use-after-free when the task completion is
happening on a different thread: `run_loop::finish` might cause
 `sync_wait_t::apply_sender` to return before scheduling the noop task, ending
the lifetime of the queue before the noop task push could be finalized:
 - Thread 1 sets __finishing_ to true
 - Thread 2 observed __finishing is true, __noop_task hasn't been pushed
   yet and finishes successfully returns.
 - Thread 1 tries to push __noop_task to the queue, which now is out of
   scope.

To avoid this situation, the sync_wait state is additionally
synchronized on the `finish` being returning successfully (avoiding the
aforementioned race condition).
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


void wait() noexcept {
// Account for spurios wakeups
while (!__done_.load(stdexec::__std::memory_order_acquire)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a use-after-free bug; a background thread can store true to __done_, then before it calls notify, the waiter thread may end up seeing __done_ is true and break (say, a spurious wakeup, or the wait had never reached the first iteration of the loop), then the sync_wait returns and destructs.

Btw, I don't think you need to loop here .. std::atomic::wait only returns when the value has changed, accounting for spurious wakeups (it has a loop internally).

I was able to detect this with Relacy, which I am in the process of updating to work with stdexec. The test is https://github.com/NVIDIA/stdexec/compare/main...ccotter:stdexec:sync-wait-rrd-bug?expand=1

To run the test, follow these instructions, except use this branch instead of relacy's main branch: dvyukov/relacy#33

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure how this can be addressed. In code like

exec::static_thread_pool pool{1};
{
  auto s = ex::schedule(pool.get_scheduler()) | ex::then([] { return 0; });
  ex::sync_wait(s);
}

when the background thread notifies the foreground thread to wake up and unblock, the communication channel (condition variable, atomic, other) may become destroyed when the sync_wait objects on the stack created by the foreground thread go out of scope. Whether the notifier uses done.store(true); done.notify_one(); or done = true; cv.notify();, I'm not sure how to ensure the foreground thread will wake up and continue, but only after the background thread has finishing doing the notification...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewissbaker - wondering if you might be able to shed some light on this .. I seem to have hit a mental dead end in not understanding how it's possible to to have a sync_wait() without allocating the internal run_loop on the heap, and giving a strong reference to the receiver to guarantee lifetime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccotter The receiver is the one allocating the shared state (currently on the stack).
In fact, the code you mentioned above was very similar to the code where I detected the bug this fix is for (see here: https://github.com/olympus-robotics/hephaestus/blob/main/modules/concurrency/tests/context_tests.cpp#L96).
In fact, the solution I came up with should be actually safe. I don't have a full proof, but ran my testcase for around a million times.

Let me try to answer why what you are describing should be safe:

  • Both the state and the receiver are owned by thread calling sync_wait
  • The execution context on a different thread merely has a reference to the shared state

The execution context signals completion by setting performing the following steps (in a strongly "happens before" order):
1.1. Set the finishing of the run loop to true
1.2. (optional) signalling the run loop
1.3. Set done to true
1.4. Signal done

The execution context waiting on completion is performing the following steps:
2.1. If finished is true, goto 5.
2.2. Wait for new task
2.3. Execute task
2.4. Goto 1.
2.6. Wait for done to be set to true
2.7. return (ends lifetime of the state)

Without the 'done' step, we run into the situation that step 1.2 might access invalid memory because we already ended up in 2.7. too early.
Previously, this was ensured by actually holding the mutex while signalling the condition variable (

) which had the same effect.

I can see how there is still a tiny window which can cause a use after free as you described (even after removing the loop).

I think the only way this can be truly fixed is to introduce a count of 'tasks in flight' for the run_loop:

  • Each run_loop::__push_back call increments this count
  • Each run_loop::__execute_all decrements this count by the amount it handled
  • run_loop::finish increments the count before setting the finishing flag and decrements it before returning.
  • run_loop::run is looping as long as the count is non zero and finishing is set to true.

I wanted to avoid this, since it sounded expensive and the solution I propose here at least seems to work...

@ccotter ccotter mentioned this pull request Jan 9, 2026
@sithhell
Copy link
Contributor Author

sithhell commented Jan 9, 2026

Btw, I don't think you need to loop here .. std::atomic::wait only returns when the value has changed, accounting for spurious wakeups (it has a loop internally).

Thanks for pointing this out!

sithhell added a commit to sithhell/stdexec that referenced this pull request Jan 9, 2026
We need to synchronize returning from `__run_loop_base::run` with
potentially concurrent calls to `__run_loop_base::finish`. This is done
by introducing a counter, ensuring proper completion of all tasks in
flight.

Also see NVIDIA#1742 for additional
information.
@sithhell
Copy link
Contributor Author

Closing in favor of #1746

@sithhell sithhell closed this Jan 10, 2026
sithhell added a commit to sithhell/stdexec that referenced this pull request Jan 10, 2026
We need to synchronize returning from `__run_loop_base::run` with
potentially concurrent calls to `__run_loop_base::finish`. This is done
by introducing a counter, ensuring proper completion of all tasks in
flight.

Also see NVIDIA#1742 for additional
information.
@sithhell sithhell deleted the fix/sync_wait_stack_use_after_free branch January 11, 2026 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants