Skip to content

fix: job retry mechanism not triggering#4961

Merged
npalm merged 20 commits intomainfrom
stu/fix_job_retry
Feb 3, 2026
Merged

fix: job retry mechanism not triggering#4961
npalm merged 20 commits intomainfrom
stu/fix_job_retry

Conversation

@stuartp44
Copy link
Contributor

This pull request adds comprehensive tests for the retry mechanism in the scaleUp functionality and reintroduces the publishRetryMessage call to the scale-up process. The tests ensure that the retry logic works correctly under various scenarios, such as when jobs are queued, when the maximum number of runners is reached, and when queue checks are disabled.

Testing and Retry Mechanism Enhancements:

  • Added a new test suite "Retry mechanism tests" in scale-up.test.ts to cover scenarios where publishRetryMessage should be called, including: when jobs are queued, when maximum runners are reached, with correct message structure, and when job queue checks are disabled.

Other code Updates:

  • Fixed logic to skip runner creation if no new runners are needed by checking if newRunners <= 0 instead of comparing counts, improving clarity and correctness.
Example scenarios for the above bug

Scenario 1

  • Admin sets RUNNERS_MAXIMUM_COUNT=20
  • System scales up to 15 active runners
  • Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
  • Before those 15 runners terminate, new jobs arrive
  • Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
  • Code tries to call createRunners({numberOfRunners: -5}) and fails

Scenario 2

  • RUNNERS_MAXIMUM_COUNT=5
  • Someone manually launches 8 EC2 instances with runner tags
  • New jobs arrive
  • Bug triggers: newRunners = Math.min(2, 5-8) = -3
  • Code tries to call createRunners({numberOfRunners: -3}) and fails

Scenario 3

  • Admin sets RUNNERS_MAXIMUM_COUNT=20
  • System scales up to 15 active runners
  • Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
  • Before those 15 runners terminate, new jobs arrive
  • Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
  • Code tries to call createRunners({numberOfRunners: -5}) and fails

We tested this in our staging environment and verified it's working.

Closes #4960

@stuartp44 stuartp44 requested a review from a team as a code owner December 18, 2025 10:09
@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Co-authored-by: Brend Smits <brend.smits@philips.com>
@stuartp44 stuartp44 added the bug Something isn't working label Dec 18, 2025
@npalm npalm requested a review from Copilot December 18, 2025 20:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a critical bug where the job retry mechanism was not being triggered during the scale-up process. The fix re-introduces the publishRetryMessage call and corrects the logic for skipping runner creation when the maximum runner count is exceeded or when newRunners would be negative.

Key Changes

  • Re-introduced the publishRetryMessage call in the scale-up loop to ensure retry messages are published for queued jobs
  • Fixed the condition for skipping runner creation from missingInstanceCount === scaleUp to newRunners <= 0, preventing attempts to create negative numbers of runners
  • Added comprehensive test coverage for the retry mechanism with 7 new test cases covering various scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
lambdas/functions/control-plane/src/scale-runners/scale-up.ts Imports publishRetryMessage, calls it for each queued message, and fixes the skip condition to handle negative newRunners values
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Adds mock setup for publishRetryMessage and a new test suite with 7 tests covering retry mechanism behavior in various scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI commented Dec 18, 2025

@npalm I've opened a new pull request, #4966, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 4 commits December 18, 2025 20:54
Per reviewer feedback, publishRetryMessage is now called at the end of the
processing loop for each message that is NOT marked as invalid. This prevents
duplicate retry messages for the same event, since invalid messages already go
back to the SQS queue for retry.

Key changes:
- Track validMessagesForRetry separately from all messages
- Only messages that pass job queued check are added to validMessagesForRetry
- publishRetryMessage is called after runner creation, not before
- Messages marked as invalid (e.g., max runners reached, creation failed) are
  excluded from retry message publishing

Tests updated to reflect new behavior:
- publishRetryMessage is called AFTER runner creation
- Messages marked invalid do not get retry messages published
- All test scenarios updated with proper mock runner creation

Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>
Address code review feedback:
- Use Set for invalidMessageIds to improve lookup performance from O(n) to O(1)
- Extract duplicate retry message publishing logic into helper function
- Maintain both invalidMessages array (for return value) and Set (for fast lookups)

This improves performance when processing large message batches and makes the
code more maintainable by eliminating duplication.

Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>
Simplified the retry message logic based on reviewer feedback:
- Renamed `validMessagesForRetry` to `queuedMessages` (clearer intent)
- Renamed `invalidMessageIds` to `rejectedMessageIds` (more accurate)
- Removed `invalidMessages` array duplication - now using only Set
- Removed `publishRetryMessagesForValid` helper function (inline instead)
- Convert Set to array only at return statement

Result: clearer naming, less complexity, same functionality.
Net reduction of 6 lines while maintaining all tests passing.

Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>
Remove branch restriction from lambda.yml workflow to support stacked PRs
and feature branch workflows. The workflow will now run on any PR that
modifies lambda code or the workflow itself.

Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>
Copy link
Contributor

@iainlane iainlane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers - right now I just had one request!

Copy link
Contributor

Copilot AI commented Jan 5, 2026

@stuartp44 I've opened a new pull request, #4980, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>
@stuartp44 stuartp44 requested a review from a team as a code owner January 5, 2026 13:09
stuartp44 and others added 4 commits January 5, 2026 14:21
Fix formatting issues identified by prettier:
- Remove trailing whitespace in scale-up.test.ts
- Format .prettierrc and eslint.config.js
- Format various tsconfig.json and other config files

Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>
The merge from stu/fix_job_retry incorrectly reverted the core change
where publishRetryMessage was moved to the end of the loop. This fix
restores the correct behavior:

- Restore queuedMessages.push(message) instead of await publishRetryMessage(message)
- Remove conflicting test assertions from merge (lines expecting both
  publishRetryMessage NOT to be called AND to be called 2 times)
- Reorder test assertions to check listEC2Runners before publishRetryMessage

All tests now pass with the correct behavior where publishRetryMessage
is called AFTER runner creation and only for non-rejected messages.

Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>
Per reviewer feedback to keep changes minimal and focused on the
original objective. Reverting:

- Formatting changes to config files (.prettierrc, eslint.config.js,
  tsconfig.json files, nx.json, vitest.base.config.ts)
- Formatting changes to test resource JSON files
- Workflow change to lambda.yml (restoring branches: main restriction)

Only keeping the essential changes:
- lambdas/functions/control-plane/src/scale-runners/scale-up.ts
- lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts

These are the only files needed to fix the original issue of moving
publishRetryMessage to the end of the processing loop to avoid
duplicate retries.

Co-authored-by: Brend-Smits <15904543+Brend-Smits@users.noreply.github.com>
@Brend-Smits Brend-Smits requested a review from npalm January 12, 2026 09:37
Co-authored-by: Brend Smits <brend.smits@philips.com>
stuartp44 and others added 2 commits January 12, 2026 10:37
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>
@Brend-Smits
Copy link
Contributor

Rebased the PR and resolved open conversations. This is now ready to get merged after getting approvals.

@Brend-Smits Brend-Smits requested a review from iainlane January 12, 2026 09:38
@Brend-Smits Brend-Smits self-assigned this Jan 16, 2026
@andrecastro
Copy link

This is a critical issue for me. After upgrading our infrastructure to v7.3.0, we noticed that jobs are no longer being retried, which is resulting in a growing number of jobs stuck waiting indefinitely.

I understand this is an experimental feature, but It’s unclear why this behavior changed. It appears to be a regression and likely a bug.

Could one of the maintainers please review and approve this?

@npalm
Copy link
Member

npalm commented Jan 30, 2026

I will give this PR priority early next week

Copy link
Member

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npalm npalm merged commit 5039ae5 into main Feb 3, 2026
11 checks passed
@npalm npalm deleted the stu/fix_job_retry branch February 3, 2026 18:30
npalm pushed a commit that referenced this pull request Feb 4, 2026
🤖 I have created a release *beep* *boop*
---


##
[7.4.0](v7.3.0...v7.4.0)
(2026-02-04)


### Features

* **control-plane:** tag control plane created SSM Parameters
([#4833](#4833))
([#4834](#4834))
([7e1a0a1](7e1a0a1))
@wadherv
* use prefix variable for POWERTOOLS_SERVICE_NAME in Lambda functions
([#4948](#4948))
([8bd61d2](8bd61d2))
@alexalbu001


### Bug Fixes

* add SSM AMI parameter permissions and environment-based naming
([#5016](#5016))
([1a7158b](1a7158b))
* job retry mechanism not triggering
([#4961](#4961))
([5039ae5](5039ae5))
* **lambda:** bump diff from 4.0.2 to 4.0.4 in /lambdas
([#5004](#5004))
([cd86fe6](cd86fe6))
* **lambda:** bump lodash-es from 4.17.21 to 4.17.23 in /lambdas
([#5006](#5006))
([c638e38](c638e38))
* **lambda:** bump the aws group in /lambdas with 7 updates
([#4998](#4998))
([d373bcc](d373bcc))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

job-retry mechanism broken in version 7.0.0

6 participants