🌱 Remove ephemeral aspects of network errors#2507
🌱 Remove ephemeral aspects of network errors#2507openshift-merge-bot[bot] merged 1 commit intooperator-framework:mainfrom
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a network-error sanitization utility to remove ephemeral details (notably changing source ports) from error messages so status condition messages remain stable across retries, reducing unnecessary downstream reconciliations.
Changes:
- Added
SanitizeNetworkErrorhelper for producing stable, sanitized network error messages. - Updated ClusterExtension and ClusterCatalog Progressing condition messages to use the sanitizer.
- Added unit tests for sanitization behavior and controller condition-message expectations.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
internal/shared/util/error/sanitize.go |
New helper to sanitize net.DNSError / net.OpError messages for retry stability. |
internal/shared/util/error/sanitize_test.go |
New unit tests covering sanitizer behavior and consistency across varying ephemeral ports. |
internal/operator-controller/controllers/common_controller.go |
Uses sanitized messages for ClusterExtension Progressing condition (after unwrapping terminal errors). |
internal/operator-controller/controllers/common_controller_test.go |
Adds cases asserting sanitized network/DNS error messages in Progressing condition. |
internal/catalogd/controllers/core/clustercatalog_controller.go |
Uses sanitized messages for ClusterCatalog Progressing condition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2507 +/- ##
==========================================
- Coverage 73.23% 73.20% -0.04%
==========================================
Files 102 103 +1
Lines 8505 8509 +4
==========================================
Hits 6229 6229
+ Misses 1801 1800 -1
- Partials 475 480 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2890347 to
b7dbac8
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/common_controller_test.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/common_controller_test.go
Outdated
Show resolved
Hide resolved
b7dbac8 to
ea4e5b4
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
I would suggest that we do not put on the reconcile queue clusterextension that have reached ProgressDeadlineTimeout when ClusterCatalog change is there. |
I think this is a separate issue. The main issue here is that we are updating the cluster catalog even though there's no real change in state, and this is triggering unnecessary reconciliation on all CEs (independently of their underlying state). On ProgressionDeadlineTimeout side, we need to think about the different options for getting the CE unstuck. I was thinking over ShiftWeek that we may want to move the progression deadline check to the CE controller since there are other errors that we might want to stop retrying for after a while that aren't to do with revision, e.g. bundle download / unpack issues, resolution issues, etc. In some cases it may make sense to let an exhausted CE reconcile on catalog changes - maybe something like there's a type in the bundle image url that gets fixed in a subsequent. Although, I'd say that if it doesn't solve the problem, you'd still be in the ProgressionDeadlineTimeout state. |
|
I would say this PR is 🌱 not a bug fix - we are dropping unnecessary reconciliations, not correcting system behavior.. so it is a nice improvement. |
Signed-off-by: Per Goncalves da Silva <pegoncal@redhat.com>
ea4e5b4 to
914cdce
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: camilamacedo86 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/label qe-approved |
|
@bandrade: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1fd37fa
into
operator-framework:main
Description
When there's a network error, e.g. the registry is not found, successive error message can change due to the source port changing between retries. This causes unnecessary updates to the resource and triggering reconciliation of downstream resources.
I noticed this error when observing that the ClusterExtension controller was being triggered every 10 seconds even though it had reached a ProgressionDeadlineTimeout terminal state. The issue was that one of the ClusterCatalogs was pointing to a registry that did not exist and the periodic change to the error message triggered reconciliation on the ClusterExtensions.
I would see a
Progressingmessage on the catalog like:and every 10 seconds I would see the same message but with
10.244.0.8having a different portReviewer Checklist