Skip to content

fix: Atomic job reservation to prevent race condition#1399

Merged
dimitri-yatsenko merged 2 commits intomasterfrom
fix/job-reserve-race-1398
Feb 17, 2026
Merged

fix: Atomic job reservation to prevent race condition#1399
dimitri-yatsenko merged 2 commits intomasterfrom
fix/job-reserve-race-1398

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Feb 17, 2026

Summary

  • Fixes Bug: multiple workers reserving the same key in Job.reserve() in 2.1 #1398 — multiple workers could reserve the same job key simultaneously
  • Replace the non-atomic SELECT → UPDATE in Job.reserve() with a single atomic UPDATE ... WHERE status='pending', checking cursor.rowcount for success
  • Reduces three database round-trips to one
  • Works on both MySQL and PostgreSQL backends

Root cause

The previous reserve() performed a SELECT to check status='pending', then a separate UPDATE matching only on primary key. When concurrent workers (e.g., SLURM array jobs) hit this simultaneously, both SELECTs see status='pending', both UPDATEs succeed (since the WHERE matched only on PK), and both workers proceed to call make() on the same key.

Fix

A single atomic UPDATE includes AND status='pending' AND scheduled_time <= CURRENT_TIMESTAMP(3) in the WHERE clause. The database guarantees only one concurrent UPDATE can match; all others get rowcount=0 and return False.

Test plan

  • Verify existing job reservation tests pass
  • Test with concurrent workers (SLURM or multiprocessing) to confirm only one worker reserves each key

🤖 Generated with Claude Code

Replace the non-atomic SELECT-then-UPDATE pattern in Job.reserve()
with a single atomic UPDATE that includes status='pending' in the
WHERE clause. Check cursor.rowcount to determine if the reservation
succeeded. This eliminates the race window where multiple workers
could simultaneously reserve the same job.

The previous implementation allowed concurrent workers to both read
status='pending' and then both successfully UPDATE (since the WHERE
matched only on primary key). Now only the first UPDATE succeeds;
all others see rowcount=0 and return False.

Also reduces three database round-trips to one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimitri-yatsenko dimitri-yatsenko merged commit 4a7e1e8 into master Feb 17, 2026
7 checks passed
@dimitri-yatsenko dimitri-yatsenko deleted the fix/job-reserve-race-1398 branch February 17, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: multiple workers reserving the same key in Job.reserve() in 2.1

2 participants