-
Notifications
You must be signed in to change notification settings - Fork 42
csiaddonsnode: Add retry with exponential backoff for connections #924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
csiaddonsnode: Add retry with exponential backoff for connections #924
Conversation
b007a75 to
b5acabf
Compare
b5acabf to
714a22d
Compare
714a22d to
31bdf4c
Compare
|
What is the process to get a deleted CSIAddonsNode back in case of a longer network interruption? Can that be automated too? |
#765 takes care of such cases, if a |
Madhu-1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to extract or store the details in the annotation, cant we make use of the status/message?
We could. A lot of manual parsing would be required without API changes (status messages should be human readable). Are there any downsides of having a transient annotation (keeps the changes simple)? |
Hmm, yes, I agree that |
Not that willing for API changes. What about the assumption that if status == retrying, message holds some form of info about the retry. A bit of manual parsing but can be done. WDYT? |
This patch adds the functionality to retry for a maximum of `maxRetries` to connect to the sidecar. If the connection attempt is not successful, the object is considered obsolete and is deleted. The retry is tracked inside an annotation(`connRetryAnnotation`) and also reflected in object's status. These transient artifacts are cleaned up once a connection is established. Signed-off-by: Niraj Yadav <niryadav@redhat.com>
31bdf4c to
f8eb32a
Compare
Test results |
Madhu-1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see only one problem, if the controller restarts multiple times we will delete the csiaddons node object immediately right?
Even if we do that, the object will be recreated by the sidecar almost immediately. But we only delete and update the status if the actual connection attempt is unsuccessful. Upon testing further, I am going to revert the change that returns an empty podname in case a pod is not found. If we keep it so, the reconciler will keep on retrying again and again and we will never reach the backoff section. Better to let it try and fail than to keep retrying infinitely? WDYT? |
Let's move the entire retry logic to a helper method and call it at
|
I was mid refactor and then.... It's not that simple and would complicate things because, we need to add additional logic to decide between stopping the reconcile, requeueing after a backoff or returning an error. And we would need to do that at multiple places (two at-least). A pod uniquely identifies a csiaddonsnode object and if that pod is not found (stale), it is not going to come back under normal circumstances. If it does come back, the sidecar will always create a csiaddonsnode for it. Let's keep it simple; we have quantifiable number of worker nodes (out of which a select few will be in this state requiring cleanup) and we retry for 3 times only. From my POV, it is worth retrying only when the connection fails (network can be flaky), not when something is out of the ordinary. If we want to be optimistic and assume that the pod WILL come back, let's keep the current code which will use its own retry-backoff and keep delaying the reconcile due to missing pod. |
This patch adds the functionality to retry for a maximum of
maxRetriesto connect to the sidecar.If the connection attempt is not successful, the object is considered
obsolete and is deleted.
The retry is tracked inside an annotation(
connRetryAnnotation) and alsoreflected in object's status.
These transient artifacts are cleaned up once a connection is
established.