Skip to content

Comments

🌱 Remove ephemeral aspects of network errors#2507

Merged
openshift-merge-bot[bot] merged 1 commit intooperator-framework:mainfrom
perdasilva:bug-catalog-retrigger
Feb 19, 2026
Merged

🌱 Remove ephemeral aspects of network errors#2507
openshift-merge-bot[bot] merged 1 commit intooperator-framework:mainfrom
perdasilva:bug-catalog-retrigger

Conversation

@perdasilva
Copy link
Contributor

Description

When there's a network error, e.g. the registry is not found, successive error message can change due to the source port changing between retries. This causes unnecessary updates to the resource and triggering reconciliation of downstream resources.

I noticed this error when observing that the ClusterExtension controller was being triggered every 10 seconds even though it had reached a ProgressionDeadlineTimeout terminal state. The issue was that one of the ClusterCatalogs was pointing to a registry that did not exist and the periodic change to the error message triggered reconciliation on the ClusterExtensions.

I would see a Progressing message on the catalog like:

lookup docker-registry.operator-controller.svc on 10.96.0.10:53: read udp 10.244.0.8:46753->10.96.0.10:53: i/o timeout

and every 10 seconds I would see the same message but with 10.244.0.8 having a different port

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

Copilot AI review requested due to automatic review settings February 13, 2026 13:41
@netlify
Copy link

netlify bot commented Feb 13, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 914cdce
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/6996d88e947de30008bdfc59
😎 Deploy Preview https://deploy-preview-2507--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a network-error sanitization utility to remove ephemeral details (notably changing source ports) from error messages so status condition messages remain stable across retries, reducing unnecessary downstream reconciliations.

Changes:

  • Added SanitizeNetworkError helper for producing stable, sanitized network error messages.
  • Updated ClusterExtension and ClusterCatalog Progressing condition messages to use the sanitizer.
  • Added unit tests for sanitization behavior and controller condition-message expectations.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/shared/util/error/sanitize.go New helper to sanitize net.DNSError / net.OpError messages for retry stability.
internal/shared/util/error/sanitize_test.go New unit tests covering sanitizer behavior and consistency across varying ephemeral ports.
internal/operator-controller/controllers/common_controller.go Uses sanitized messages for ClusterExtension Progressing condition (after unwrapping terminal errors).
internal/operator-controller/controllers/common_controller_test.go Adds cases asserting sanitized network/DNS error messages in Progressing condition.
internal/catalogd/controllers/core/clustercatalog_controller.go Uses sanitized messages for ClusterCatalog Progressing condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.20%. Comparing base (fb28936) to head (914cdce).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2507      +/-   ##
==========================================
- Coverage   73.23%   73.20%   -0.04%     
==========================================
  Files         102      103       +1     
  Lines        8505     8509       +4     
==========================================
  Hits         6229     6229              
+ Misses       1801     1800       -1     
- Partials      475      480       +5     
Flag Coverage Δ
e2e 45.56% <50.00%> (-0.33%) ⬇️
experimental-e2e 53.58% <50.00%> (+0.16%) ⬆️
unit 57.91% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@perdasilva perdasilva marked this pull request as draft February 16, 2026 10:55
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 16, 2026
@perdasilva perdasilva force-pushed the bug-catalog-retrigger branch 2 times, most recently from 2890347 to b7dbac8 Compare February 16, 2026 16:05
Copilot AI review requested due to automatic review settings February 16, 2026 16:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@perdasilva perdasilva force-pushed the bug-catalog-retrigger branch from b7dbac8 to ea4e5b4 Compare February 17, 2026 07:50
@perdasilva perdasilva marked this pull request as ready for review February 17, 2026 07:50
Copilot AI review requested due to automatic review settings February 17, 2026 07:50
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 17, 2026
@openshift-ci openshift-ci bot requested review from bentito and oceanc80 February 17, 2026 07:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pedjak
Copy link
Contributor

pedjak commented Feb 17, 2026

I noticed this error when observing that the ClusterExtension controller was being triggered every 10 seconds even though it had reached a ProgressionDeadlineTimeout terminal state.

I would suggest that we do not put on the reconcile queue clusterextension that have reached ProgressDeadlineTimeout when ClusterCatalog change is there.

@perdasilva
Copy link
Contributor Author

perdasilva commented Feb 17, 2026

I noticed this error when observing that the ClusterExtension controller was being triggered every 10 seconds even though it had reached a ProgressionDeadlineTimeout terminal state.

I would suggest that we do not put on the reconcile queue clusterextension that have reached ProgressDeadlineTimeout when ClusterCatalog change is there.

I think this is a separate issue. The main issue here is that we are updating the cluster catalog even though there's no real change in state, and this is triggering unnecessary reconciliation on all CEs (independently of their underlying state).

On ProgressionDeadlineTimeout side, we need to think about the different options for getting the CE unstuck. I was thinking over ShiftWeek that we may want to move the progression deadline check to the CE controller since there are other errors that we might want to stop retrying for after a while that aren't to do with revision, e.g. bundle download / unpack issues, resolution issues, etc.

In some cases it may make sense to let an exhausted CE reconcile on catalog changes - maybe something like there's a type in the bundle image url that gets fixed in a subsequent. Although, I'd say that if it doesn't solve the problem, you'd still be in the ProgressionDeadlineTimeout state.

Copy link
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 18, 2026
@pedjak
Copy link
Contributor

pedjak commented Feb 18, 2026

I would say this PR is 🌱 not a bug fix - we are dropping unnecessary reconciliations, not correcting system behavior.. so it is a nice improvement.

Signed-off-by: Per Goncalves da Silva <pegoncal@redhat.com>
@perdasilva perdasilva force-pushed the bug-catalog-retrigger branch from ea4e5b4 to 914cdce Compare February 19, 2026 09:31
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2026
@perdasilva perdasilva changed the title 🐛 Remove ephemeral aspects of network errors 🌱 Remove ephemeral aspects of network errors Feb 19, 2026
@pedjak
Copy link
Contributor

pedjak commented Feb 19, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2026
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: camilamacedo86

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2026
@bandrade
Copy link

/label qe-approved
/verified by @bandrade

@openshift-ci
Copy link

openshift-ci bot commented Feb 19, 2026

@bandrade: The label(s) qe-approved cannot be applied, because the repository doesn't have them.

Details

In response to this:

/label qe-approved
/verified by @bandrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 1fd37fa into operator-framework:main Feb 19, 2026
36 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants