Add retry logic to updateRecoveryWindow to handle concurrent ObjectStore status updates

## Problem

When running scheduled backups with retention policies, we observe transient errors:

```
{"level":"error","msg":"Error while updating the recovery window in the ObjectStore status stanza. Skipping.","error":"Operation cannot be fulfilled on objectstores.barmancloud.cnpg.io \"cluster-name-backup\": the object has been modified; please apply your changes to the latest version and try again"}

{"level":"error","msg":"Retention policy enforcement failed","error":"Operation cannot be fulfilled on objectstores.barmancloud.cnpg.io \"cluster-name-backup\": the object has been modified; please apply your changes to the latest version and try again"}
```

## Root Cause Analysis

After investigating the plugin source code, we identified that the `updateRecoveryWindow` function in `internal/cnpgi/instance/recovery_window.go` performs a direct status update without retry logic:

```go
// recovery_window.go:40
func updateRecoveryWindow(...) error {
    // ... builds status ...
    return c.Status().Update(ctx, objectStore)  // No retry on conflict
}
```

This function is called from two places that can run concurrently:
1. **backup.go:169** - After a backup completes successfully
2. **retention.go:66** - During periodic retention policy enforcement (default every 5 minutes)

When both operations happen close together, Kubernetes optimistic concurrency control rejects one update because the `resourceVersion` changed between read and write.

## Evidence

The same file already has a function that correctly handles this scenario:

```go
// recovery_window.go:65 - setLastFailedBackupTime
func setLastFailedBackupTime(...) error {
    return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
        var objectStore barmancloudv1.ObjectStore
        if err := c.Get(ctx, objectStoreKey, &objectStore); err != nil {
            return err
        }
        // ... update status ...
        return c.Status().Update(ctx, &objectStore)
    })
}
```

The `setLastFailedBackupTime` function uses `retry.RetryOnConflict` which:
1. Gets a fresh copy of the resource before updating
2. Retries on conflict with exponential backoff

## Impact

- **Severity**: Low - backups complete successfully, status eventually updates
- **User experience**: Confusing error messages in logs
- **Frequency**: Depends on backup/retention timing overlap (we see ~2 errors per 24h)

## Proposed Fix

Apply the same retry pattern to `updateRecoveryWindow`:

```go
func updateRecoveryWindow(
    ctx context.Context,
    c client.Client,
    backupList *catalog.Catalog,
    objectStore *barmancloudv1.ObjectStore,
    serverName string,
) error {
    return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
        // Get fresh copy
        var freshObjectStore barmancloudv1.ObjectStore
        if err := c.Get(ctx, client.ObjectKeyFromObject(objectStore), &freshObjectStore); err != nil {
            return err
        }

        // Build recovery window
        convertTime := func(t *time.Time) *metav1.Time {
            if t == nil {
                return nil
            }
            return ptr.To(metav1.NewTime(*t))
        }

        recoveryWindow := freshObjectStore.Status.ServerRecoveryWindow[serverName]
        recoveryWindow.FirstRecoverabilityPoint = convertTime(backupList.GetFirstRecoverabilityPoint())
        recoveryWindow.LastSuccessfulBackupTime = convertTime(backupList.GetLastSuccessfulBackupTime())

        if freshObjectStore.Status.ServerRecoveryWindow == nil {
            freshObjectStore.Status.ServerRecoveryWindow = make(map[string]barmancloudv1.RecoveryWindow)
        }
        freshObjectStore.Status.ServerRecoveryWindow[serverName] = recoveryWindow

        return c.Status().Update(ctx, &freshObjectStore)
    })
}
```

## Environment

- Plugin version: 0.10.0
- CNPG Operator: 1.26+
- Kubernetes: 1.29+
- Object storage: AWS S3

We're happy to submit a PR if this approach looks correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry logic to updateRecoveryWindow to handle concurrent ObjectStore status updates #758

Problem

Root Cause Analysis

Evidence

Impact

Proposed Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add retry logic to updateRecoveryWindow to handle concurrent ObjectStore status updates #758

Description

Problem

Root Cause Analysis

Evidence

Impact

Proposed Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions