Skip to content

Conversation

@sithhell
Copy link
Contributor

@sithhell sithhell commented Jan 9, 2026

We need to synchronize returning from __run_loop_base::run with potentially concurrent calls to __run_loop_base::finish. This is done by introducing a counter, ensuring proper completion of all tasks in flight.

Also see #1742 for additional information.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sithhell
Copy link
Contributor Author

sithhell commented Jan 9, 2026

@ccotter This should fix your additional concerns regarding lifetime. Would you mind actually checking it with relacy for me?

@sithhell
Copy link
Contributor Author

sithhell commented Jan 9, 2026

After thinking about it, I would actually prefer this over #1742 since it fixed all users of stdexec::run_loop and not just stdexec::sync_wait.

@ccotter, @ericniebler what do you think?

@ccotter
Copy link
Contributor

ccotter commented Jan 9, 2026

This still ends up being problematic .. one race (verified with Relacy as I described in #1742 (comment)) will be that a background thread can push the no-op task, but immediately after pushing in https://github.com/NVIDIA/stdexec/blob/6302fbe/include/stdexec/__detail/__atomic_intrusive_queue.hpp#L44, but before line 49, the sync_wait thread observe the no-op item, process it, decrement __task_count_ to 0, then exit run() and destroy the state.

I'm nearly convinced this can't be done with atomics, and that we need to move back to the mutex/cv. It may not be worth dealing with atomics in run_loop. I'm not sure where else it'll be used, but from the sync_wait perspective, it will be used sparingly in programs (in many cases, exactly once in main()).

@sithhell
Copy link
Contributor Author

sithhell commented Jan 10, 2026 via email

@sithhell
Copy link
Contributor Author

The latest commit addresses the one remaining use after free bug. I ran the relacy tests myself, which now execute cleanly. Also added the sync_wait test as well as accounting for missing support of atomic_ref in relacy.

@ccotter thanks for the hints!

We need to synchronize returning from `__run_loop_base::run` with
potentially concurrent calls to `__run_loop_base::finish`. This is done
by introducing a counter, ensuring proper completion of all tasks in
flight.

Also see NVIDIA#1742 for additional
information.
- Properly synchronizing `finish` with `run` with task counts
- Adding sync_wait relacy test for verification
- adapting atomic wrappers to account for missing std::atomic_ref in relacy
@sithhell sithhell force-pushed the fix/sync_wait_stack_use_after_free_alternative branch from 113e104 to 8baf34b Compare January 10, 2026 13:45
@ccotter
Copy link
Contributor

ccotter commented Jan 11, 2026

First off, I want to say I admire the exploration of different ways for atomics to safely implement a race/deadlock/memory bug-free sync_wait/run_loop. Thanks for adding the Relacy test in - I think that would have helped detect the bug when it crept in in #1683. I'm glad it was able to help prove the correctness of this latest version. I added #1749 to see if we should add these tests into a more standard workflow like CI.

Regarding sync_wait usage, I slightly disagree, there are cases where you
want to call it in multiple places in the program. It's the primary way to
'return' from the async world. Basically one way to switch from red to blue
functions and sometimes essential to interface with other libraries.

True. There are many programs that I think will be exactly one sync_wait in main. There will be some programs that need to transfer from red to blue functions, although I don't know if sync_wait's implementation choice (atomics vs mutex) would dominate those program's performance - and if it did, I think it'd be time for those programs to re-evaluate why they have so many sync_waits.

And yes, I did ignore the fact that run_loop is not just an implementation detail of sync_wait, and that it's used in static_thread_pool, but that it's also a standalone C++26 component which may be used outside of sync_wait/static_thread_pool! We need to consider all of this for how we address the existing bug.

In the most recent version of the fix, I think what can happen now is that run() can end up spinning waiting for the other thread in some (rare) cases:

while (__execute_all() || __task_count_.load(__std::memory_order_acquire) > 0)
        ; // After finish() pushes the noop task, but before __task_count_ is decremented,
          // this turns into a spin loop. If the finish() thread is preempted before __task_count_
          // is decremented, the spin will be rather large until the OS reschedules that thread.

If I can try to summarize, and with a myopic view of only addressing sync_wait for the sake of technical discussion, the core of what we are trying to solve simplifies to creating a class with an interface like the below (I know the actual stdexec::sync_wait is split across __sync_wait/__run_loop/__atomic_intrusive_queue.hpp with multiple atomics spread across these components):

Goal

struct sync_object {

  // Unblocks the thread waiting on wait(). If no thread is waiting yet, whenever a
  // thread eventually calls wait(), the call returns immediately.
  void signal();

  // Blocks until some other thread invokes signal().
  void wait();

  // For a given sync_object, signal() and wait() should be called 0 or 1 times, but never more than that.
};

void example_usage() {
  sync_object o;
  some_thread_pool.enqueue([&] { /* do some interesting work*/ o.signal(); };

  o.wait();
}

There are a few ways we can try to implement this.

Using atomics with wait/notify

struct sync_object {
  std::atomic<bool> is_done{false};

  void signal() { is_done.store(true); /* UAF can occur in the next call */ is_done.notify_one(); }
  void wait() { is_done.wait(false); }
};

The problem with this is use-after-free may occur if the thread pool thread sets is_done to true, then wait sees is_done is true and exits, leading to the sync_object being destroyed, all before the thread pool thread can call is_done.notify_one().

Using atomics with spin loop (no wait/notify)

struct sync_object {
  std::atomic<bool> is_done{false};

  void signal() { is_done.store(true); }
  void wait() { while(!is_done.load()) /* spin */; }
};

This doesn't have any use-after-free bugs, but relies on spinning in some cases.

Heap allocated state with atomic wait/notify

struct sync_object {
  std::shared_ptr<std::atomic<bool>> is_done = std::make_shared<std::atomic<bool>>(false);

  void signal() { auto local_is_done = is_done; local_is_done->store(true); local_is_done->notify_one(); }
  void wait() { is_done->wait(false); }
};

This is safe because it ensures that even if the thread that owns the sync_object goes out of scope, the signal thread still has access to the atomic communication variable as long as it needs it. However, this incurs the cost of heap+shared_ptr's atomics.

Using mutex/cv

struct sync_object {
  std::mutex mtx;
  std::condition_variable cv;
  bool is_done{false};

  void signal() { std::unique_lock lock{mtx}; is_done = true; cv.notify(); }
  void wait() { std::unique_lock lock{mtx}; cv.wait(lock, [&] { return is_done; }); }
};

I believe this is safe in terms of races/deadlocks and memory, with no extra spinning.

Summary

I'm not sure if this explores the full design space, but I think it shows some of the tradeoffs. We know the first version (atomic with wait/notify) has use-after-free bugs that I don't think we can get around, given the inherent nature of store() + notify_one() not being atomic.

The spin loop version is memory safe, but has the unfortunate spinning. Then, we get to the heap solution which is safe, at the cost of the heap allocation + atomics of the shared_ptr.

For some cases like sync_wait using run_loop, a mutex/cv is probably no worse than atomics as there will be almost no contention (and on some platforms, no additional kernel calls compared to atomics). But for longer lived / contentious use cases, like static_thread_pool (or other, given this is a public C++26 class), I wonder (without providing much more to go on) if we can specifically address finish() to use mutex/cv, without the core (and possibly numerous/contentious like in the static_thread_pool) push/pop operations using mutex/cv when the run_loop is not shutting down...

Anyway, this has been an interesting deep dive!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants