Skip to content

Comments

abandon evaluation of any new catalogsource image which is pathological#3766

Merged
openshift-merge-bot[bot] merged 2 commits intooperator-framework:masterfrom
grokspawn:fix-wedged-newcat-eval
Feb 19, 2026
Merged

abandon evaluation of any new catalogsource image which is pathological#3766
openshift-merge-bot[bot] merged 2 commits intooperator-framework:masterfrom
grokspawn:fix-wedged-newcat-eval

Conversation

@grokspawn
Copy link
Contributor

@grokspawn grokspawn commented Feb 11, 2026

Description of the change:

This PR adds a pathological status check for when container status indicates crashloopbackoff.
If that status matches, OLM will abandon catalog evaluation for that image, dispose of the pod, and pull a new image when the pollInterval comes up again.

This will now cause the catalog operator to emit a message of the form

time="2026-02-11T22:30:17Z" level=info msg="catalog polling result: update pod poison-pill-9qvmp failed to start" UpdatePod=poison-pill-9qvmp

and the offending pod will be deleted.

Motivation for the change:

In the case that a catalogsource defines .spec.grpcPodConfig.extractContent it is possible for OLMv0 to get trapped in an evaluation loop if the catalogsource is not compatible with the on-cluster catalogsource service.

This is because the on-cluster catalog services which use extractContent define two initContainers and a service container. When those initContainers succeed, the pod status progresses to RUNNING regardless of the success/failure of the service container.

If the service container fails, it will halt, and the pod will start being rebooted by kube when it fails readiness/liveness probes. It will remain in RUNNING status, so OLM will requeue its evaluation without end.

Architectural changes:

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

…y restrts

Signed-off-by: grokspawn <jordan@nimblewidget.com>
Signed-off-by: grokspawn <jordan@nimblewidget.com>
@grokspawn grokspawn force-pushed the fix-wedged-newcat-eval branch from 9985929 to 1098551 Compare February 12, 2026 15:56
@grokspawn grokspawn changed the title abandon evaluation of any new catalogsource image which pathologically restarts abandon evaluation of any new catalogsource image which is pathological Feb 13, 2026
@grokspawn grokspawn requested a review from joelanford February 18, 2026 20:02
Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Just the one (nit) question about the crashloopbackoff constant.

ServiceHashLabelKey = "olm.service-spec-hash"
CatalogPollingRequeuePeriod = 30 * time.Second
// containerReasonCrashLoopBackOff is the kubelet Waiting reason when a container is backing off after repeated crashes.
containerReasonCrashLoopBackOff = "CrashLoopBackOff"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking there isn't already a constant defined for this in corev1 of k8s.io/api?

return true
}
}
// TODO: currently no ephemeral containers in a catalogsource, should we add checks anyway?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ephemeral containers should never be part of the equation I don't think.

See: https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2026
@openshift-ci
Copy link

openshift-ci bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2026
@openshift-merge-bot openshift-merge-bot bot merged commit feecd01 into operator-framework:master Feb 19, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants