abandon evaluation of any new catalogsource image which is pathological#3766
Conversation
…y restrts Signed-off-by: grokspawn <jordan@nimblewidget.com>
Signed-off-by: grokspawn <jordan@nimblewidget.com>
9985929 to
1098551
Compare
joelanford
left a comment
There was a problem hiding this comment.
/approve
Just the one (nit) question about the crashloopbackoff constant.
| ServiceHashLabelKey = "olm.service-spec-hash" | ||
| CatalogPollingRequeuePeriod = 30 * time.Second | ||
| // containerReasonCrashLoopBackOff is the kubelet Waiting reason when a container is backing off after repeated crashes. | ||
| containerReasonCrashLoopBackOff = "CrashLoopBackOff" |
There was a problem hiding this comment.
Just checking there isn't already a constant defined for this in corev1 of k8s.io/api?
| return true | ||
| } | ||
| } | ||
| // TODO: currently no ephemeral containers in a catalogsource, should we add checks anyway? |
There was a problem hiding this comment.
Ephemeral containers should never be part of the equation I don't think.
See: https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: joelanford The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
feecd01
into
operator-framework:master
Description of the change:
This PR adds a pathological status check for when container status indicates crashloopbackoff.
If that status matches, OLM will abandon catalog evaluation for that image, dispose of the pod, and pull a new image when the
pollIntervalcomes up again.This will now cause the catalog operator to emit a message of the form
and the offending pod will be deleted.
Motivation for the change:
In the case that a catalogsource defines
.spec.grpcPodConfig.extractContentit is possible for OLMv0 to get trapped in an evaluation loop if the catalogsource is not compatible with the on-cluster catalogsource service.This is because the on-cluster catalog services which use
extractContentdefine two initContainers and a service container. When those initContainers succeed, the pod status progresses to RUNNING regardless of the success/failure of the service container.If the service container fails, it will halt, and the pod will start being rebooted by kube when it fails readiness/liveness probes. It will remain in RUNNING status, so OLM will requeue its evaluation without end.
Architectural changes:
Testing remarks:
Reviewer Checklist
/doc[FLAKE]are truly flaky and have an issue