csiaddonsnode: Add retry with exponential backoff for connections #924

black-dragon74 · 2025-12-03T10:04:23Z

This patch adds the functionality to retry for a maximum of
maxRetries to connect to the sidecar.

If the connection attempt is not successful, the object is considered
obsolete and is deleted.

The retry is tracked inside an annotation(connRetryAnnotation) and also
reflected in object's status.

These transient artifacts are cleaned up once a connection is
established.

internal/controller/csiaddons/csiaddonsnode_controller.go

nixpanic · 2025-12-09T16:42:12Z

What is the process to get a deleted CSIAddonsNode back in case of a longer network interruption? Can that be automated too?

black-dragon74 · 2025-12-10T06:51:13Z

What is the process to get a deleted CSIAddonsNode back in case of a longer network interruption? Can that be automated too?

#765 takes care of such cases, if a CSIAddonsNode should exist, it will be ensured that it exists.

Madhu-1

Why do we need to extract or store the details in the annotation, cant we make use of the status/message?

internal/controller/csiaddons/csiaddonsnode_controller.go

black-dragon74 · 2025-12-12T10:28:14Z

Why do we need to extract or store the details in the annotation, cant we make use of the status/message?

We could. A lot of manual parsing would be required without API changes (status messages should be human readable). Are there any downsides of having a transient annotation (keeps the changes simple)?

nixpanic · 2025-12-12T10:31:54Z

Why do we need to extract or store the details in the annotation, cant we make use of the status/message?

Hmm, yes, I agree that .Status.Conditions[] is cleaner for this.

black-dragon74 · 2025-12-12T10:39:03Z

Why do we need to extract or store the details in the annotation, cant we make use of the status/message?

Hmm, yes, I agree that .Status.Conditions[] is cleaner for this.

Not that willing for API changes. What about the assumption that if status == retrying, message holds some form of info about the retry. A bit of manual parsing but can be done. WDYT?

This patch adds the functionality to retry for a maximum of `maxRetries` to connect to the sidecar. If the connection attempt is not successful, the object is considered obsolete and is deleted. The retry is tracked inside an annotation(`connRetryAnnotation`) and also reflected in object's status. These transient artifacts are cleaned up once a connection is established. Signed-off-by: Niraj Yadav <niryadav@redhat.com>

black-dragon74 · 2025-12-15T10:45:05Z

Test results

2025-12-15T10:39:27.331Z        INFO    Adding finalizer        {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "b6a2c4df-fef2-46f8-bd73-5e8f81b94460", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:27.368Z        INFO    Connecting to sidecar   {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "b6a2c4df-fef2-46f8-bd73-5e8f81b94460", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:27.379Z        ERROR   Failed to establish connection with sidecar     {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "b6a2c4df-fef2-46f8-bd73-5e8f81b94460", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": "", "attempt": 1, "error": "failed to exit idle mode: delegating_resolver: invalid target address \"\": missing address"}
...
2025-12-15T10:39:27.405Z        INFO    Requeuing request for attempting the connection again   {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "b6a2c4df-fef2-46f8-bd73-5e8f81b94460", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": "", "backoff": "2s"}
....
2025-12-15T10:39:41.655Z        INFO    Failed to establish connection with sidecar after 3 attempts, deleting the object       {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "18c5fdde-df4c-4d95-93c8-767c928928ec", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:41.696Z        INFO    successfully deleted CSIAddonsNode object due to reaching max reconnection attempts     {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "18c5fdde-df4c-4d95-93c8-767c928928ec", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:41.715Z        INFO    Deleting connection     {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "257ae5a8-4401-4d30-8141-a6d45029d74a", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": "", "Key": "rook-ceph/csi-rbdplugin-noent"}
2025-12-15T10:39:41.716Z        INFO    Removing finalizer      {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "257ae5a8-4401-4d30-8141-a6d45029d74a", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:41.749Z        INFO    CSIAddonsNode resource not found        {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "0d7228cb-6da9-4395-81e7-e758d89f402c"}

Madhu-1

I see only one problem, if the controller restarts multiple times we will delete the csiaddons node object immediately right?

black-dragon74 · 2025-12-15T10:53:41Z

I see only one problem, if the controller restarts multiple times we will delete the csiaddons node object immediately right?

Even if we do that, the object will be recreated by the sidecar almost immediately. But we only delete and update the status if the actual connection attempt is unsuccessful.

Upon testing further, I am going to revert the change that returns an empty podname in case a pod is not found. If we keep it so, the reconciler will keep on retrying again and again and we will never reach the backoff section.

Better to let it try and fail than to keep retrying infinitely? WDYT?

@iPraveenParihar @Rakshith-R ^^

Rakshith-R · 2025-12-15T11:13:31Z

I see only one problem, if the controller restarts multiple times we will delete the csiaddons node object immediately right?

Even if we do that, the object will be recreated by the sidecar almost immediately. But we only delete and update the status if the actual connection attempt is unsuccessful.

Upon testing further, I am going to revert the change that returns an empty podname in case a pod is not found. If we keep it so, the reconciler will keep on retrying again and again and we will never reach the backoff section.

Better to let it try and fail than to keep retrying infinitely? WDYT?

@iPraveenParihar @Rakshith-R ^^

Let's move the entire retry logic to a helper method and call it at

the current spot
when podname emtpy too ?

black-dragon74 · 2025-12-15T12:17:57Z

Let's move the entire retry logic to a helper method and call it at

the current spot

when podname emtpy too ?

I was mid refactor and then....

It's not that simple and would complicate things because, we need to add additional logic to decide between stopping the reconcile, requeueing after a backoff or returning an error. And we would need to do that at multiple places (two at-least).

A pod uniquely identifies a csiaddonsnode object and if that pod is not found (stale), it is not going to come back under normal circumstances. If it does come back, the sidecar will always create a csiaddonsnode for it.

Let's keep it simple; we have quantifiable number of worker nodes (out of which a select few will be in this state requiring cleanup) and we retry for 3 times only.

From my POV, it is worth retrying only when the connection fails (network can be flaky), not when something is out of the ordinary. If we want to be optimistic and assume that the pod WILL come back, let's keep the current code which will use its own retry-backoff and keep delaying the reconcile due to missing pod.

mergify bot requested review from Madhu-1, Nikhil-Ladha, Rakshith-R, iPraveenParihar, nixpanic and yati1998 December 3, 2025 10:05

black-dragon74 force-pushed the fix-extra-addonsnodeconn branch from b007a75 to b5acabf Compare December 3, 2025 10:35

nixpanic reviewed Dec 4, 2025

View reviewed changes

internal/controller/csiaddons/csiaddonsnode_controller.go Outdated Show resolved Hide resolved

internal/controller/csiaddons/csiaddonsnode_controller.go Outdated Show resolved Hide resolved

black-dragon74 force-pushed the fix-extra-addonsnodeconn branch from b5acabf to 714a22d Compare December 9, 2025 13:33

mergify bot added the api Change to the API, requires extra care label Dec 9, 2025

mergify bot requested a review from ShyamsundarR December 9, 2025 13:33

black-dragon74 force-pushed the fix-extra-addonsnodeconn branch from 714a22d to 31bdf4c Compare December 9, 2025 13:36

black-dragon74 added the DNM Do Not Merge label Dec 9, 2025

black-dragon74 changed the title ~~csiaddonsnode: delete the object after max connection retries~~ csiaddonsnode: Add retry with exponential backoff for connections Dec 9, 2025

nixpanic reviewed Dec 9, 2025

View reviewed changes

internal/controller/csiaddons/csiaddonsnode_controller.go Show resolved Hide resolved

Madhu-1 reviewed Dec 10, 2025

View reviewed changes

iPraveenParihar reviewed Dec 10, 2025

View reviewed changes

internal/controller/csiaddons/csiaddonsnode_controller.go Outdated Show resolved Hide resolved

internal/controller/csiaddons/csiaddonsnode_controller.go Outdated Show resolved Hide resolved

black-dragon74 force-pushed the fix-extra-addonsnodeconn branch from 31bdf4c to f8eb32a Compare December 15, 2025 10:09

black-dragon74 requested review from Madhu-1, iPraveenParihar and nixpanic December 15, 2025 10:47

black-dragon74 removed the DNM Do Not Merge label Dec 15, 2025

Madhu-1 reviewed Dec 15, 2025

View reviewed changes

csiaddonsnode: Add retry with exponential backoff for connections #924

Are you sure you want to change the base?

csiaddonsnode: Add retry with exponential backoff for connections #924

Uh oh!

Conversation

black-dragon74 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nixpanic commented Dec 9, 2025

Uh oh!

black-dragon74 commented Dec 10, 2025

Uh oh!

Madhu-1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

black-dragon74 commented Dec 12, 2025

Uh oh!

nixpanic commented Dec 12, 2025

Uh oh!

black-dragon74 commented Dec 12, 2025

Uh oh!

black-dragon74 commented Dec 15, 2025

Test results

Uh oh!

Madhu-1 left a comment

Choose a reason for hiding this comment

Uh oh!

black-dragon74 commented Dec 15, 2025

Uh oh!

Rakshith-R commented Dec 15, 2025

Uh oh!

black-dragon74 commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

black-dragon74 commented Dec 3, 2025 •

edited

Loading