Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
584b846 to
b02ebc2
Compare
Co-authored-by: Brend Smits <brend.smits@philips.com>
b02ebc2 to
3eee9b2
Compare
There was a problem hiding this comment.
Pull request overview
This pull request fixes a critical bug where the job retry mechanism was not being triggered during the scale-up process. The fix re-introduces the publishRetryMessage call and corrects the logic for skipping runner creation when the maximum runner count is exceeded or when newRunners would be negative.
Key Changes
- Re-introduced the
publishRetryMessagecall in the scale-up loop to ensure retry messages are published for queued jobs - Fixed the condition for skipping runner creation from
missingInstanceCount === scaleUptonewRunners <= 0, preventing attempts to create negative numbers of runners - Added comprehensive test coverage for the retry mechanism with 7 new test cases covering various scenarios
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
lambdas/functions/control-plane/src/scale-runners/scale-up.ts |
Imports publishRetryMessage, calls it for each queued message, and fixes the skip condition to handle negative newRunners values |
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts |
Adds mock setup for publishRetryMessage and a new test suite with 7 tests covering retry mechanism behavior in various scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts
Outdated
Show resolved
Hide resolved
Per reviewer feedback, publishRetryMessage is now called at the end of the processing loop for each message that is NOT marked as invalid. This prevents duplicate retry messages for the same event, since invalid messages already go back to the SQS queue for retry. Key changes: - Track validMessagesForRetry separately from all messages - Only messages that pass job queued check are added to validMessagesForRetry - publishRetryMessage is called after runner creation, not before - Messages marked as invalid (e.g., max runners reached, creation failed) are excluded from retry message publishing Tests updated to reflect new behavior: - publishRetryMessage is called AFTER runner creation - Messages marked invalid do not get retry messages published - All test scenarios updated with proper mock runner creation Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>
Address code review feedback: - Use Set for invalidMessageIds to improve lookup performance from O(n) to O(1) - Extract duplicate retry message publishing logic into helper function - Maintain both invalidMessages array (for return value) and Set (for fast lookups) This improves performance when processing large message batches and makes the code more maintainable by eliminating duplication. Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>
Simplified the retry message logic based on reviewer feedback: - Renamed `validMessagesForRetry` to `queuedMessages` (clearer intent) - Renamed `invalidMessageIds` to `rejectedMessageIds` (more accurate) - Removed `invalidMessages` array duplication - now using only Set - Removed `publishRetryMessagesForValid` helper function (inline instead) - Convert Set to array only at return statement Result: clearer naming, less complexity, same functionality. Net reduction of 6 lines while maintaining all tests passing. Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>
Remove branch restriction from lambda.yml workflow to support stacked PRs and feature branch workflows. The workflow will now run on any PR that modifies lambda code or the workflow itself. Co-authored-by: npalm <11609620+npalm@users.noreply.github.com>
iainlane
left a comment
There was a problem hiding this comment.
Cheers - right now I just had one request!
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@stuartp44 I've opened a new pull request, #4980, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>
Fix formatting issues identified by prettier: - Remove trailing whitespace in scale-up.test.ts - Format .prettierrc and eslint.config.js - Format various tsconfig.json and other config files Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>
The merge from stu/fix_job_retry incorrectly reverted the core change where publishRetryMessage was moved to the end of the loop. This fix restores the correct behavior: - Restore queuedMessages.push(message) instead of await publishRetryMessage(message) - Remove conflicting test assertions from merge (lines expecting both publishRetryMessage NOT to be called AND to be called 2 times) - Reorder test assertions to check listEC2Runners before publishRetryMessage All tests now pass with the correct behavior where publishRetryMessage is called AFTER runner creation and only for non-rejected messages. Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>
Per reviewer feedback to keep changes minimal and focused on the original objective. Reverting: - Formatting changes to config files (.prettierrc, eslint.config.js, tsconfig.json files, nx.json, vitest.base.config.ts) - Formatting changes to test resource JSON files - Workflow change to lambda.yml (restoring branches: main restriction) Only keeping the essential changes: - lambdas/functions/control-plane/src/scale-runners/scale-up.ts - lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts These are the only files needed to fix the original issue of moving publishRetryMessage to the end of the processing loop to avoid duplicate retries. Co-authored-by: Brend-Smits <15904543+Brend-Smits@users.noreply.github.com>
Co-authored-by: Brend Smits <brend.smits@philips.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>
d2f642c to
63d1914
Compare
|
Rebased the PR and resolved open conversations. This is now ready to get merged after getting approvals. |
4a00df4 to
68b1794
Compare
|
This is a critical issue for me. After upgrading our infrastructure to v7.3.0, we noticed that jobs are no longer being retried, which is resulting in a growing number of jobs stuck waiting indefinitely. I understand this is an experimental feature, but It’s unclear why this behavior changed. It appears to be a regression and likely a bug. Could one of the maintainers please review and approve this? |
|
I will give this PR priority early next week |
🤖 I have created a release *beep* *boop* --- ## [7.4.0](v7.3.0...v7.4.0) (2026-02-04) ### Features * **control-plane:** tag control plane created SSM Parameters ([#4833](#4833)) ([#4834](#4834)) ([7e1a0a1](7e1a0a1)) @wadherv * use prefix variable for POWERTOOLS_SERVICE_NAME in Lambda functions ([#4948](#4948)) ([8bd61d2](8bd61d2)) @alexalbu001 ### Bug Fixes * add SSM AMI parameter permissions and environment-based naming ([#5016](#5016)) ([1a7158b](1a7158b)) * job retry mechanism not triggering ([#4961](#4961)) ([5039ae5](5039ae5)) * **lambda:** bump diff from 4.0.2 to 4.0.4 in /lambdas ([#5004](#5004)) ([cd86fe6](cd86fe6)) * **lambda:** bump lodash-es from 4.17.21 to 4.17.23 in /lambdas ([#5006](#5006)) ([c638e38](c638e38)) * **lambda:** bump the aws group in /lambdas with 7 updates ([#4998](#4998)) ([d373bcc](d373bcc)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: runners-releaser[bot] <194412594+runners-releaser[bot]@users.noreply.github.com>
This pull request adds comprehensive tests for the retry mechanism in the
scaleUpfunctionality and reintroduces thepublishRetryMessagecall to the scale-up process. The tests ensure that the retry logic works correctly under various scenarios, such as when jobs are queued, when the maximum number of runners is reached, and when queue checks are disabled.Testing and Retry Mechanism Enhancements:
scale-up.test.tsto cover scenarios wherepublishRetryMessageshould be called, including: when jobs are queued, when maximum runners are reached, with correct message structure, and when job queue checks are disabled.Other code Updates:
newRunners <= 0instead of comparing counts, improving clarity and correctness.Example scenarios for the above bug
Scenario 1
Scenario 2
Scenario 3
We tested this in our staging environment and verified it's working.
Closes #4960