Skip to content

Conversation

@black-dragon74
Copy link
Member

@black-dragon74 black-dragon74 commented Dec 3, 2025

This patch adds the functionality to retry for a maximum of
maxRetries to connect to the sidecar.

If the connection attempt is not successful, the object is considered
obsolete and is deleted.

The retry is tracked inside an annotation(connRetryAnnotation) and also
reflected in object's status.

These transient artifacts are cleaned up once a connection is
established.

@black-dragon74 black-dragon74 force-pushed the fix-extra-addonsnodeconn branch from b5acabf to 714a22d Compare December 9, 2025 13:33
@mergify mergify bot added the api Change to the API, requires extra care label Dec 9, 2025
@mergify mergify bot requested a review from ShyamsundarR December 9, 2025 13:33
@black-dragon74 black-dragon74 force-pushed the fix-extra-addonsnodeconn branch from 714a22d to 31bdf4c Compare December 9, 2025 13:36
@black-dragon74 black-dragon74 added the DNM Do Not Merge label Dec 9, 2025
@black-dragon74 black-dragon74 changed the title csiaddonsnode: delete the object after max connection retries csiaddonsnode: Add retry with exponential backoff for connections Dec 9, 2025
@nixpanic
Copy link
Collaborator

nixpanic commented Dec 9, 2025

What is the process to get a deleted CSIAddonsNode back in case of a longer network interruption? Can that be automated too?

@black-dragon74
Copy link
Member Author

What is the process to get a deleted CSIAddonsNode back in case of a longer network interruption? Can that be automated too?

#765 takes care of such cases, if a CSIAddonsNode should exist, it will be ensured that it exists.

Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to extract or store the details in the annotation, cant we make use of the status/message?

@black-dragon74
Copy link
Member Author

Why do we need to extract or store the details in the annotation, cant we make use of the status/message?

We could. A lot of manual parsing would be required without API changes (status messages should be human readable). Are there any downsides of having a transient annotation (keeps the changes simple)?

@nixpanic
Copy link
Collaborator

Why do we need to extract or store the details in the annotation, cant we make use of the status/message?

Hmm, yes, I agree that .Status.Conditions[] is cleaner for this.

@black-dragon74
Copy link
Member Author

Why do we need to extract or store the details in the annotation, cant we make use of the status/message?

Hmm, yes, I agree that .Status.Conditions[] is cleaner for this.

Not that willing for API changes. What about the assumption that if status == retrying, message holds some form of info about the retry. A bit of manual parsing but can be done. WDYT?

This patch adds the functionality to retry for a maximum of
`maxRetries` to connect to the sidecar.

If the connection attempt is not successful, the object is considered
obsolete and is deleted.

The retry is tracked inside an annotation(`connRetryAnnotation`) and also
reflected in object's status.

These transient artifacts are cleaned up once a connection is
established.

Signed-off-by: Niraj Yadav <niryadav@redhat.com>
@black-dragon74 black-dragon74 force-pushed the fix-extra-addonsnodeconn branch from 31bdf4c to f8eb32a Compare December 15, 2025 10:09
@black-dragon74
Copy link
Member Author

Test results

2025-12-15T10:39:27.331Z        INFO    Adding finalizer        {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "b6a2c4df-fef2-46f8-bd73-5e8f81b94460", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:27.368Z        INFO    Connecting to sidecar   {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "b6a2c4df-fef2-46f8-bd73-5e8f81b94460", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:27.379Z        ERROR   Failed to establish connection with sidecar     {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "b6a2c4df-fef2-46f8-bd73-5e8f81b94460", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": "", "attempt": 1, "error": "failed to exit idle mode: delegating_resolver: invalid target address \"\": missing address"}
...
2025-12-15T10:39:27.405Z        INFO    Requeuing request for attempting the connection again   {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "b6a2c4df-fef2-46f8-bd73-5e8f81b94460", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": "", "backoff": "2s"}
....
2025-12-15T10:39:41.655Z        INFO    Failed to establish connection with sidecar after 3 attempts, deleting the object       {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "18c5fdde-df4c-4d95-93c8-767c928928ec", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:41.696Z        INFO    successfully deleted CSIAddonsNode object due to reaching max reconnection attempts     {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "18c5fdde-df4c-4d95-93c8-767c928928ec", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:41.715Z        INFO    Deleting connection     {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "257ae5a8-4401-4d30-8141-a6d45029d74a", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": "", "Key": "rook-ceph/csi-rbdplugin-noent"}
2025-12-15T10:39:41.716Z        INFO    Removing finalizer      {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "257ae5a8-4401-4d30-8141-a6d45029d74a", "NodeID": "minikube", "DriverName": "rook-ceph.rbd.csi.ceph.com", "EndPoint": ""}
2025-12-15T10:39:41.749Z        INFO    CSIAddonsNode resource not found        {"controller": "csiaddonsnode", "controllerGroup": "csiaddons.openshift.io", "controllerKind": "CSIAddonsNode", "CSIAddonsNode": {"name":"test-it-out","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "test-it-out", "reconcileID": "0d7228cb-6da9-4395-81e7-e758d89f402c"}

Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only one problem, if the controller restarts multiple times we will delete the csiaddons node object immediately right?

@black-dragon74
Copy link
Member Author

I see only one problem, if the controller restarts multiple times we will delete the csiaddons node object immediately right?

Even if we do that, the object will be recreated by the sidecar almost immediately. But we only delete and update the status if the actual connection attempt is unsuccessful.

Upon testing further, I am going to revert the change that returns an empty podname in case a pod is not found. If we keep it so, the reconciler will keep on retrying again and again and we will never reach the backoff section.

Better to let it try and fail than to keep retrying infinitely? WDYT?

@iPraveenParihar @Rakshith-R ^^

@Rakshith-R
Copy link
Member

I see only one problem, if the controller restarts multiple times we will delete the csiaddons node object immediately right?

Even if we do that, the object will be recreated by the sidecar almost immediately. But we only delete and update the status if the actual connection attempt is unsuccessful.

Upon testing further, I am going to revert the change that returns an empty podname in case a pod is not found. If we keep it so, the reconciler will keep on retrying again and again and we will never reach the backoff section.

Better to let it try and fail than to keep retrying infinitely? WDYT?

@iPraveenParihar @Rakshith-R ^^

Let's move the entire retry logic to a helper method and call it at

  • the current spot
  • when podname emtpy too ?

@black-dragon74
Copy link
Member Author

Let's move the entire retry logic to a helper method and call it at

  • the current spot
  • when podname emtpy too ?

I was mid refactor and then....

It's not that simple and would complicate things because, we need to add additional logic to decide between stopping the reconcile, requeueing after a backoff or returning an error. And we would need to do that at multiple places (two at-least).

A pod uniquely identifies a csiaddonsnode object and if that pod is not found (stale), it is not going to come back under normal circumstances. If it does come back, the sidecar will always create a csiaddonsnode for it.

Let's keep it simple; we have quantifiable number of worker nodes (out of which a select few will be in this state requiring cleanup) and we retry for 3 times only.

From my POV, it is worth retrying only when the connection fails (network can be flaky), not when something is out of the ordinary. If we want to be optimistic and assume that the pod WILL come back, let's keep the current code which will use its own retry-backoff and keep delaying the reconcile due to missing pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Change to the API, requires extra care

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants