Skip to content

Conversation

@yuanjieding-db
Copy link
Collaborator

What changes are proposed in this pull request?

WHAT

  • Extending retry function with a new parameter max_attempt to allow client to retry and fail after certain amount
  • Remove 500 from FilesExt retry status code
  • Add new config to set the retry attempts for FilesExt
  • Update the retry logic of FilesExt to fail after certain attempts.

WHY

  • 500 errors shouldn't be retried
  • The FilesExt should always prioritize fallback over retry to avoid regression

How is this tested?

Unit tests were updated to reflect the change.

Comment on lines 236 to 239
# Maximum number of retry attempts for FilesExt cloud API operations.
# This works in conjunction with retry_timeout_seconds - whichever limit
# is hit first will stop the retry loop.
files_ext_cloud_api_max_retries: int = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use this as a temporary parameter, since our end goal is not to fallback.

Suggested change
# Maximum number of retry attempts for FilesExt cloud API operations.
# This works in conjunction with retry_timeout_seconds - whichever limit
# is hit first will stop the retry loop.
files_ext_cloud_api_max_retries: int = 3
# Maximum number of retry attempts for FilesExt cloud API operations.
# This works in conjunction with retry_timeout_seconds - whichever limit
# is hit first will stop the retry loop.
experimental_files_ext_cloud_api_max_retries: int = 3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


# Determine which limit was hit
if max_attempts is not None and attempt > max_attempts:
raise TimeoutError(f"Exceeded max retry attempts ({max_attempts})") from last_err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a better error to represent this error? TimeoutError feels a bit odd for this case, as the function is not actually timed out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should use a custom type RetryError, with TimeoutError and MaxRetryExceededError as it's derived types, so that the user can catch the RetryError if they don't care why the retry exhausted, while keeping the information.
However since we have been using built-in TimeoutError and users may already be depending on this behavior, it is risky to change it to a different Error.
If we were to introduce a new type of error for max retry exceeded scenario, it would be more difficult for the upper layer to handle the retry error: it needs to catch both Errors manually.

I don't see a better solution here, unless we can rewrite the retry logic completely, or make FilesExt using a different retry library.

Comment on lines 334 to 337
def failing_function():
nonlocal call_count
call_count += 1
raise ValueError("test error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we sleep on this failing function for a sec? I am a bit worried that this test will become flaky because it is possible to retry 100 times in 2 seconds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't because of the backoff logic, right?

Comment on lines 372 to 373
def test_max_attempts_none_preserves_backward_compatibility():
"""Test that max_attempts=None only uses timeout (backward compatibility)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_max_attempts_none_preserves_backward_compatibility():
"""Test that max_attempts=None only uses timeout (backward compatibility)."""
def test_max_attempts_none():
"""Test that max_attempts=None only uses timeout."""

I think we don't need to mention that this test is for backward compatibility because, almost always, a unit test is for regression catching.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

assert call_count == attempts


def test_max_attempts_respected():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these a table test to simplify the tests?

with pytest.raises(TimeoutError) as exc_info:
failing_function()

# Should have attempted 3 times (initial + 2 retries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Should have attempted 3 times (initial + 2 retries)
# Should have attempted 3 times (initial + 2 retries).

Period after a sentence, ditto all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@github-actions
Copy link

github-actions bot commented Feb 2, 2026

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-py

Inputs:

  • PR number: 1211
  • Commit SHA: 2cfa627d0cb5ba41bb17e33fb23985d1dda4cd01

Checks will be approved automatically on success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants