fix: job retry mechanism not triggering by stuartp44 · Pull Request #4961 · github-aws-runners/terraform-aws-github-runner

stuartp44 · 2025-12-18T10:09:51Z

This pull request adds comprehensive tests for the retry mechanism in the scaleUp functionality and reintroduces the publishRetryMessage call to the scale-up process. The tests ensure that the retry logic works correctly under various scenarios, such as when jobs are queued, when the maximum number of runners is reached, and when queue checks are disabled.

Testing and Retry Mechanism Enhancements:

Added a new test suite "Retry mechanism tests" in scale-up.test.ts to cover scenarios where publishRetryMessage should be called, including: when jobs are queued, when maximum runners are reached, with correct message structure, and when job queue checks are disabled.

Other code Updates:

Fixed logic to skip runner creation if no new runners are needed by checking if newRunners <= 0 instead of comparing counts, improving clarity and correctness.

Example scenarios for the above bug

Scenario 1

Admin sets RUNNERS_MAXIMUM_COUNT=20
System scales up to 15 active runners
Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
Before those 15 runners terminate, new jobs arrive
Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
Code tries to call createRunners({numberOfRunners: -5}) and fails

Scenario 2

RUNNERS_MAXIMUM_COUNT=5
Someone manually launches 8 EC2 instances with runner tags
New jobs arrive
Bug triggers: newRunners = Math.min(2, 5-8) = -3
Code tries to call createRunners({numberOfRunners: -3}) and fails

Scenario 3

Admin sets RUNNERS_MAXIMUM_COUNT=20
System scales up to 15 active runners
Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
Before those 15 runners terminate, new jobs arrive
Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
Code tries to call createRunners({numberOfRunners: -5}) and fails

We tested this in our staging environment and verified it's working.

Closes #4960

github-actions · 2025-12-18T10:10:09Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Co-authored-by: Brend Smits <brend.smits@philips.com>

Copilot

Pull request overview

This pull request fixes a critical bug where the job retry mechanism was not being triggered during the scale-up process. The fix re-introduces the publishRetryMessage call and corrects the logic for skipping runner creation when the maximum runner count is exceeded or when newRunners would be negative.

Key Changes

Re-introduced the publishRetryMessage call in the scale-up loop to ensure retry messages are published for queued jobs
Fixed the condition for skipping runner creation from missingInstanceCount === scaleUp to newRunners <= 0, preventing attempts to create negative numbers of runners
Added comprehensive test coverage for the retry mechanism with 7 new test cases covering various scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`lambdas/functions/control-plane/src/scale-runners/scale-up.ts`	Imports `publishRetryMessage`, calls it for each queued message, and fixes the skip condition to handle negative `newRunners` values
`lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts`	Adds mock setup for `publishRetryMessage` and a new test suite with 7 tests covering retry mechanism behavior in various scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

Copilot · 2025-12-18T20:43:00Z

@npalm I've opened a new pull request, #4966, to work on those changes. Once the pull request is ready, I'll request review from you.

Per reviewer feedback, publishRetryMessage is now called at the end of the processing loop for each message that is NOT marked as invalid. This prevents duplicate retry messages for the same event, since invalid messages already go back to the SQS queue for retry. Key changes: - Track validMessagesForRetry separately from all messages - Only messages that pass job queued check are added to validMessagesForRetry - publishRetryMessage is called after runner creation, not before - Messages marked as invalid (e.g., max runners reached, creation failed) are excluded from retry message publishing Tests updated to reflect new behavior: - publishRetryMessage is called AFTER runner creation - Messages marked invalid do not get retry messages published - All test scenarios updated with proper mock runner creation Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>

Address code review feedback: - Use Set for invalidMessageIds to improve lookup performance from O(n) to O(1) - Extract duplicate retry message publishing logic into helper function - Maintain both invalidMessages array (for return value) and Set (for fast lookups) This improves performance when processing large message batches and makes the code more maintainable by eliminating duplication. Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>

Simplified the retry message logic based on reviewer feedback: - Renamed `validMessagesForRetry` to `queuedMessages` (clearer intent) - Renamed `invalidMessageIds` to `rejectedMessageIds` (more accurate) - Removed `invalidMessages` array duplication - now using only Set - Removed `publishRetryMessagesForValid` helper function (inline instead) - Convert Set to array only at return statement Result: clearer naming, less complexity, same functionality. Net reduction of 6 lines while maintaining all tests passing. Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>

Remove branch restriction from lambda.yml workflow to support stacked PRs and feature branch workflows. The workflow will now run on any PR that modifies lambda code or the workflow itself. Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>

iainlane

Cheers - right now I just had one request!

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2026-01-05T12:53:02Z

@stuartp44 I've opened a new pull request, #4980, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>

Fix formatting issues identified by prettier: - Remove trailing whitespace in scale-up.test.ts - Format .prettierrc and eslint.config.js - Format various tsconfig.json and other config files Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>

The merge from stu/fix_job_retry incorrectly reverted the core change where publishRetryMessage was moved to the end of the loop. This fix restores the correct behavior: - Restore queuedMessages.push(message) instead of await publishRetryMessage(message) - Remove conflicting test assertions from merge (lines expecting both publishRetryMessage NOT to be called AND to be called 2 times) - Reorder test assertions to check listEC2Runners before publishRetryMessage All tests now pass with the correct behavior where publishRetryMessage is called AFTER runner creation and only for non-rejected messages. Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>

Per reviewer feedback to keep changes minimal and focused on the original objective. Reverting: - Formatting changes to config files (.prettierrc, eslint.config.js, tsconfig.json files, nx.json, vitest.base.config.ts) - Formatting changes to test resource JSON files - Workflow change to lambda.yml (restoring branches: main restriction) Only keeping the essential changes: - lambdas/functions/control-plane/src/scale-runners/scale-up.ts - lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts These are the only files needed to fix the original issue of moving publishRetryMessage to the end of the processing loop to avoid duplicate retries. Co-authored-by: Brend-Smits <15904543+Brend-Smits@users.noreply.github.com>

Co-authored-by: Brend Smits <brend.smits@philips.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>

Brend-Smits · 2026-01-12T09:38:33Z

Rebased the PR and resolved open conversations. This is now ready to get merged after getting approvals.

andrecastro · 2026-01-29T20:48:20Z

This is a critical issue for me. After upgrading our infrastructure to v7.3.0, we noticed that jobs are no longer being retried, which is resulting in a growing number of jobs stuck waiting indefinitely.

I understand this is an experimental feature, but It’s unclear why this behavior changed. It appears to be a regression and likely a bug.

Could one of the maintainers please review and approve this?

npalm · 2026-01-30T20:07:21Z

I will give this PR priority early next week

npalm

@Brend-Smits @stuartp44 thx.

@wadherv

🤖 I have created a release *beep* *boop* --- ## [7.4.0](v7.3.0...v7.4.0) (2026-02-04) ### Features * **control-plane:** tag control plane created SSM Parameters ([#4833](#4833)) ([#4834](#4834)) ([7e1a0a1](7e1a0a1)) @wadherv * use prefix variable for POWERTOOLS_SERVICE_NAME in Lambda functions ([#4948](#4948)) ([8bd61d2](8bd61d2)) @alexalbu001 ### Bug Fixes * add SSM AMI parameter permissions and environment-based naming ([#5016](#5016)) ([1a7158b](1a7158b)) * job retry mechanism not triggering ([#4961](#4961)) ([5039ae5](5039ae5)) * **lambda:** bump diff from 4.0.2 to 4.0.4 in /lambdas ([#5004](#5004)) ([cd86fe6](cd86fe6)) * **lambda:** bump lodash-es from 4.17.21 to 4.17.23 in /lambdas ([#5006](#5006)) ([c638e38](c638e38)) * **lambda:** bump the aws group in /lambdas with 7 updates ([#4998](#4998)) ([d373bcc](d373bcc)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>

stuartp44 requested a review from a team as a code owner December 18, 2025 10:09

stuartp44 force-pushed the stu/fix_job_retry branch from 584b846 to b02ebc2 Compare December 18, 2025 10:13

fix: improve readability of retry message assertions in tests

3eee9b2

Co-authored-by: Brend Smits <brend.smits@philips.com>

stuartp44 force-pushed the stu/fix_job_retry branch from b02ebc2 to 3eee9b2 Compare December 18, 2025 10:14

stuartp44 added the bug Something isn't working label Dec 18, 2025

npalm requested a review from Copilot December 18, 2025 20:20

Copilot started reviewing on behalf of npalm December 18, 2025 20:20 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

lambdas/functions/control-plane/src/scale-runners/scale-up.ts Outdated Show resolved Hide resolved

lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Outdated Show resolved Hide resolved

npalm reviewed Dec 18, 2025

View reviewed changes

lambdas/functions/control-plane/src/scale-runners/scale-up.ts Outdated Show resolved Hide resolved

Initial plan

3e24ec6

Copilot AI mentioned this pull request Dec 18, 2025

fix: Move publishRetryMessage to end of processing loop to avoid duplicate retries #4966

Merged

12 tasks

Copilot AI and others added 4 commits December 18, 2025 20:54

iainlane reviewed Dec 19, 2025

View reviewed changes

lambdas/functions/control-plane/src/scale-runners/scale-up.ts Show resolved Hide resolved

stuartp44 and others added 2 commits December 21, 2025 19:58

Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts

019ef82

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into stu/fix_job_retry

20ec578

Copilot AI mentioned this pull request Jan 5, 2026

Add listEC2Runners verification to maximum runners test #4980

Merged

Add listEC2Runners verification to maximum runners test (#4980)

a218024

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>

stuartp44 requested a review from a team as a code owner January 5, 2026 13:09

stuartp44 and others added 4 commits January 5, 2026 14:21

Merge branch 'stu/fix_job_retry' into copilot/sub-pr-4961

dd365f1

Brend-Smits requested a review from npalm January 12, 2026 09:37

fix: improve readability of retry message assertions in tests

311001e

Co-authored-by: Brend Smits <brend.smits@philips.com>

stuartp44 and others added 2 commits January 12, 2026 10:37

Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts

a476eb0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add listEC2Runners verification to maximum runners test (#4980)

63d1914

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>

Brend-Smits force-pushed the stu/fix_job_retry branch from d2f642c to 63d1914 Compare January 12, 2026 09:38

Brend-Smits requested a review from iainlane January 12, 2026 09:38

Merge branch 'main' into stu/fix_job_retry

8c0d34e

Brend-Smits self-assigned this Jan 16, 2026

Merge branch 'copilot/sub-pr-4961' into stu/fix_job_retry

68b1794

Brend-Smits force-pushed the stu/fix_job_retry branch from 4a00df4 to 68b1794 Compare January 16, 2026 12:32

Brend-Smits assigned npalm and unassigned Brend-Smits Jan 28, 2026

andrecastro approved these changes Jan 29, 2026

View reviewed changes

Merge branch 'main' into stu/fix_job_retry

dac6aff

Merge branch 'main' into stu/fix_job_retry

e8bd9aa

npalm approved these changes Feb 3, 2026

View reviewed changes

npalm merged commit 5039ae5 into main Feb 3, 2026
11 checks passed

npalm deleted the stu/fix_job_retry branch February 3, 2026 18:30

runners-releaser bot mentioned this pull request Feb 3, 2026

chore(main): release 7.4.0 #5015

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: job retry mechanism not triggering#4961

fix: job retry mechanism not triggering#4961
npalm merged 20 commits intomainfrom
stu/fix_job_retry

stuartp44 commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Dec 18, 2025

Uh oh!

iainlane left a comment

Uh oh!

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

Brend-Smits commented Jan 12, 2026

Uh oh!

andrecastro commented Jan 29, 2026

Uh oh!

npalm commented Jan 30, 2026

Uh oh!

npalm left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

stuartp44 commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Dec 18, 2025

Uh oh!

iainlane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

Brend-Smits commented Jan 12, 2026

Uh oh!

andrecastro commented Jan 29, 2026

Uh oh!

npalm commented Jan 30, 2026

Uh oh!

npalm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

github-actions bot commented Dec 18, 2025 •

edited

Loading