Skip to content

Add retry logic to updateRecoveryWindow to handle concurrent ObjectStore status updates #758

@gabrielmouallem

Description

@gabrielmouallem

Problem

When running scheduled backups with retention policies, we observe transient errors:

{"level":"error","msg":"Error while updating the recovery window in the ObjectStore status stanza. Skipping.","error":"Operation cannot be fulfilled on objectstores.barmancloud.cnpg.io \"cluster-name-backup\": the object has been modified; please apply your changes to the latest version and try again"}

{"level":"error","msg":"Retention policy enforcement failed","error":"Operation cannot be fulfilled on objectstores.barmancloud.cnpg.io \"cluster-name-backup\": the object has been modified; please apply your changes to the latest version and try again"}

Root Cause Analysis

After investigating the plugin source code, we identified that the updateRecoveryWindow function in internal/cnpgi/instance/recovery_window.go performs a direct status update without retry logic:

// recovery_window.go:40
func updateRecoveryWindow(...) error {
    // ... builds status ...
    return c.Status().Update(ctx, objectStore)  // No retry on conflict
}

This function is called from two places that can run concurrently:

  1. backup.go:169 - After a backup completes successfully
  2. retention.go:66 - During periodic retention policy enforcement (default every 5 minutes)

When both operations happen close together, Kubernetes optimistic concurrency control rejects one update because the resourceVersion changed between read and write.

Evidence

The same file already has a function that correctly handles this scenario:

// recovery_window.go:65 - setLastFailedBackupTime
func setLastFailedBackupTime(...) error {
    return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
        var objectStore barmancloudv1.ObjectStore
        if err := c.Get(ctx, objectStoreKey, &objectStore); err != nil {
            return err
        }
        // ... update status ...
        return c.Status().Update(ctx, &objectStore)
    })
}

The setLastFailedBackupTime function uses retry.RetryOnConflict which:

  1. Gets a fresh copy of the resource before updating
  2. Retries on conflict with exponential backoff

Impact

  • Severity: Low - backups complete successfully, status eventually updates
  • User experience: Confusing error messages in logs
  • Frequency: Depends on backup/retention timing overlap (we see ~2 errors per 24h)

Proposed Fix

Apply the same retry pattern to updateRecoveryWindow:

func updateRecoveryWindow(
    ctx context.Context,
    c client.Client,
    backupList *catalog.Catalog,
    objectStore *barmancloudv1.ObjectStore,
    serverName string,
) error {
    return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
        // Get fresh copy
        var freshObjectStore barmancloudv1.ObjectStore
        if err := c.Get(ctx, client.ObjectKeyFromObject(objectStore), &freshObjectStore); err != nil {
            return err
        }

        // Build recovery window
        convertTime := func(t *time.Time) *metav1.Time {
            if t == nil {
                return nil
            }
            return ptr.To(metav1.NewTime(*t))
        }

        recoveryWindow := freshObjectStore.Status.ServerRecoveryWindow[serverName]
        recoveryWindow.FirstRecoverabilityPoint = convertTime(backupList.GetFirstRecoverabilityPoint())
        recoveryWindow.LastSuccessfulBackupTime = convertTime(backupList.GetLastSuccessfulBackupTime())

        if freshObjectStore.Status.ServerRecoveryWindow == nil {
            freshObjectStore.Status.ServerRecoveryWindow = make(map[string]barmancloudv1.RecoveryWindow)
        }
        freshObjectStore.Status.ServerRecoveryWindow[serverName] = recoveryWindow

        return c.Status().Update(ctx, &freshObjectStore)
    })
}

Environment

  • Plugin version: 0.10.0
  • CNPG Operator: 1.26+
  • Kubernetes: 1.29+
  • Object storage: AWS S3

We're happy to submit a PR if this approach looks correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions