Skip to content

Conversation

@vadikko2
Copy link
Owner

@vadikko2 vadikko2 commented Jan 28, 2026

Saga recovery attempts

  • ISagaStorage.get_sagas_for_recovery() — returns saga IDs that need recovery (status RUNNING, COMPENSATING, or FAILED) with optional filters:
    • limit — maximum number of IDs to return
    • max_recovery_attempts (default: 5) — only sagas with recovery_attempts strictly less than this value; excludes repeatedly failing sagas from retry
    • stale_after_seconds (optional) — only sagas whose updated_at is older than now - stale_after_seconds; avoids picking sagas currently being executed by another worker
  • ISagaStorage.increment_recovery_attempts() — atomically increments recovery_attempts and optionally updates saga status (e.g. to FAILED). Intended for use after a failed recovery; recover_saga() calls it automatically on exception, so callers do not need to call it manually.
  • recovery_attempts field in saga storage — each saga execution now has a counter of failed recovery attempts. Used by get_sagas_for_recovery() to limit retries and by increment_recovery_attempts() on recovery failure.

Implemented in both MemorySagaStorage and SqlAlchemySagaStorage.

Changed

  • recover_saga() — on recovery failure (any exception during resume), the storage's increment_recovery_attempts(saga_id, new_status=SagaStatus.FAILED) is invoked automatically. Sagas can then be retried until max_recovery_attempts or excluded from future recovery runs via get_sagas_for_recovery(max_recovery_attempts=...).

Documentation

  • Recovery and storage docs now describe recovery attempts, get_sagas_for_recovery(), and increment_recovery_attempts().
  • Example saga_recovery_scheduler.py demonstrates a recovery loop using get_sagas_for_recovery(limit, max_recovery_attempts, stale_after_seconds) and recover_saga() without manual increment_recovery_attempts calls.

Upgrade notes

  • Storage interface: If you implement a custom ISagaStorage, you must add:
    • get_sagas_for_recovery(limit, max_recovery_attempts=5, stale_after_seconds=None) -> list[uuid.UUID]
    • increment_recovery_attempts(saga_id, new_status=None) -> None
  • SqlAlchemy: The saga_executions table gains a new column recovery_attempts (INTEGER, default 0). For existing databases, add the column and backfill if needed, for example:
    ALTER TABLE saga_executions ADD COLUMN recovery_attempts INTEGER NOT NULL DEFAULT 0;
  • Recovery jobs: Prefer storage.get_sagas_for_recovery(limit=..., max_recovery_attempts=..., stale_after_seconds=...) instead of custom queries to select sagas for recovery.

@codspeed-hq
Copy link
Contributor

codspeed-hq bot commented Jan 28, 2026

CodSpeed Performance Report

Merging this PR will not alter performance

Comparing feature-add-saga-attemptes (5bbc044) with master (69e0c9b)

Summary

✅ 11 untouched benchmarks

@lukashuk-da
Copy link
Collaborator

Approve

@vadikko2 vadikko2 merged commit a1427f3 into master Jan 28, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants