-
Notifications
You must be signed in to change notification settings - Fork 436
Feat/centralized retry #3227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging
Are you sure you want to change the base?
Feat/centralized retry #3227
Conversation
|
@basfroman, please, I would love to get your review on this |
|
All CI fixes have been applied:
@basfroman, please can you re-review |
|
The code seems decent, but I'm not sure this is a good thing. Generally speaking, an SDK should not be opinionated. One of the big things we got away from in SDKv10 was error handling within the SDK. We decided it's best to leave it up to users on how to handle exceptions/errors/retrying. This very much seems like a step back into that opinionated versions of old. |
1adebf1 to
5e3c65f
Compare
This PR has been updated to remove all default retry behavior from core SDK components, while **preserving the retry utility as an optional helper. What Changed1. Retry Utility Remains Available
The retry module is now a user-level helper, intended for developers who want a shared, consistent retry mechanism without re-implementing it themselves. 2. Retry Removed from Core SDK PathsThe SDK itself no longer applies retries anywhere:
This ensures:
3. Defensive Bug Fix RetainedThe fix to
This change is a bug fix, not retry logic, and is independent of the retry discussion. Resulting Behavior
|
|
Hi @Dairus01, I'll review this PR when I get the time. But what @thewhaleking said in this comment is 100% in line with our plan. |
Thanks, I would be looking forward to your review because, after @thewhaleking comment, I now made the retry optional. Instead of a centralised retry, it is now an optional retry that you can implement if you want to enable it. You can simply enable it, allowing for a maximum retry attempt, a base delay, a maximum base delay, and other features. I would be happy to improve anything required. |
Sounds good for me. I'll back to this PR on the next Monday. |
I would be looking forward to it |
basfroman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is good as a standalone feature, but not as an integrated SDK solution.
But you need to fix what I mentioned to move forward.
| if result is None: | ||
| return None | ||
|
|
||
| if hasattr(result, "value"): | ||
| return result.value | ||
|
|
||
| return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert it
| return self.substrate.runtime_call( | ||
| api=runtime_api, | ||
| method=method, | ||
| params=params, | ||
| block_hash=block_hash, | ||
| ).value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, but apply the same changes to async_subtensor.def query_runtime_api
Sync and Async subtensors have to be consistent.
bittensor/utils/retry.py
Outdated
| import logging | ||
| from typing import Type, Tuple, Optional, Callable, Any, Union | ||
|
|
||
| logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logger = logging.getLogger(__name__) | |
| logger = logging.getLogger("bittensor.utils.retry") |
bittensor/core/dendrite.py
Outdated
| async def _make_stream_request(): | ||
| async with (await self.session).post( | ||
| url, | ||
| headers=synapse.to_headers(), | ||
| json=synapse.model_dump(), | ||
| timeout=aiohttp.ClientTimeout(total=timeout), | ||
| ) as response: | ||
| # Use synapse subclass' process_streaming_response method to yield the response chunks | ||
| async for chunk in synapse.process_streaming_response(response): # type: ignore | ||
| yield chunk # Yield each chunk as it's processed | ||
| json_response = synapse.extract_response_json(response) | ||
|
|
||
| # Process the server response | ||
| self.process_server_response(response, json_response, synapse) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an unfinished refactoring. Remove this from the current PR. Create a separate PR for this logic. Cover it with tests that actually show that you benefit from reading large responses.
| return None # Should not be reached | ||
|
|
||
|
|
||
| async def retry_async( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if func is not async one?
| return delay * (0.5 + random.random()) | ||
|
|
||
|
|
||
| def retry_call( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if func is not sync one?
|
|
||
| if last_exception: | ||
| raise last_exception | ||
| return None # Should not be reached |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with assert False, "Unreachable code" or remove. the same below.
bittensor/utils/retry.py
Outdated
| """Calculates backoff time with exponential backoff and jitter.""" | ||
| delay = min(max_delay, base_delay * (_RETRY_BACKOFF_FACTOR**attempt)) | ||
| # Add jitter: random value between 0 and delay | ||
| return delay * (0.5 + random.random()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the game, but not a stable solution
bittensor/utils/retry.py
Outdated
| """ | ||
| Synchronous retry wrapper. | ||
| If BT_RETRY_ENABLED is False, executes the function exactly once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe doctoring properly. The same for async version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a centralized retry utility with synchronous and asynchronous support for handling transient network failures. The retry mechanism is opt-in (disabled by default via BT_RETRY_ENABLED environment variable) and provides configurable exponential backoff with jitter. The PR also includes refactoring improvements to subtensor.py for more defensive null checking and named parameter usage, and updates the GitHub workflow to simplify label reading.
Key changes:
- Adds a new retry utility module (
bittensor/utils/retry.py) withretry_callandretry_asyncfunctions - Includes comprehensive unit tests covering retry success, exhaustion, and disabled behavior
- Refactors
get_hyperparameterandquery_runtime_apiin subtensor.py for better null safety and code clarity
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| bittensor/utils/retry.py | New retry utility with sync/async support, exponential backoff with jitter, and environment-based configuration |
| tests/unit_tests/utils/test_retry.py | Comprehensive unit tests for retry functionality covering success, failure, and disabled scenarios |
| bittensor/core/subtensor.py | Refactored for defensive null checking and explicit attribute access with named parameters |
| bittensor/core/dendrite.py | Added (but unused) stream request function - appears to be incomplete integration |
| .github/workflows/e2e-subtensor-tests.yaml | Simplified label reading by using jq instead of GitHub CLI |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bittensor/core/dendrite.py
Outdated
| async def _make_stream_request(): | ||
| async with (await self.session).post( | ||
| url, | ||
| headers=synapse.to_headers(), | ||
| json=synapse.model_dump(), | ||
| timeout=aiohttp.ClientTimeout(total=timeout), | ||
| ) as response: | ||
| # Use synapse subclass' process_streaming_response method to yield the response chunks | ||
| async for chunk in synapse.process_streaming_response(response): # type: ignore | ||
| yield chunk # Yield each chunk as it's processed | ||
| json_response = synapse.extract_response_json(response) | ||
|
|
||
| # Process the server response | ||
| self.process_server_response(response, json_response, synapse) | ||
|
|
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This internal function _make_stream_request is defined but never called, making it unreachable dead code. The original streaming request logic (lines 669-681) remains active and is duplicated here. Either this function should be called to replace the existing code block, or it should be removed if it was added accidentally.
| async def _make_stream_request(): | |
| async with (await self.session).post( | |
| url, | |
| headers=synapse.to_headers(), | |
| json=synapse.model_dump(), | |
| timeout=aiohttp.ClientTimeout(total=timeout), | |
| ) as response: | |
| # Use synapse subclass' process_streaming_response method to yield the response chunks | |
| async for chunk in synapse.process_streaming_response(response): # type: ignore | |
| yield chunk # Yield each chunk as it's processed | |
| json_response = synapse.extract_response_json(response) | |
| # Process the server response | |
| self.process_server_response(response, json_response, synapse) |
| if last_exception: | ||
| raise last_exception | ||
| return None # Should not be reached |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines 91-93 create unreachable code. After the loop completes naturally (without breaking or returning), line 91 checks if last_exception: and raises it, but this can only happen when _max_attempts is 0 or less. In all normal cases, the function will either return successfully within the loop (line 76) or raise an exception on the last attempt (line 83). Consider removing these unreachable lines or documenting why they exist.
| if last_exception: | ||
| raise last_exception | ||
| return None # Should not be reached |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines 140-142 create unreachable code. After the loop completes naturally (without breaking or returning), line 140 checks if last_exception: and raises it, but this can only happen when _max_attempts is 0 or less. In all normal cases, the function will either return successfully within the loop (line 125) or raise an exception on the last attempt (line 132). Consider removing these unreachable lines or documenting why they exist.
| @@ -0,0 +1,112 @@ | |||
| import pytest | |||
| import time | |||
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'time' is not used.
| import time |
| @@ -0,0 +1,112 @@ | |||
| import pytest | |||
| import time | |||
| import asyncio | |||
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'asyncio' is not used.
| import asyncio |
Removed the internal function for making HTTP POST requests and processing responses.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR introduces a centralized, opt-in retry and backoff
mechanism for network operations in the Bittensor SDK.
Key points:
(Dendrite and Subtensor)
disabled behavior, and default exception safety
This change improves SDK reliability under transient network
failures without affecting protocol logic or consensus behavior.
Contribution by Gittensor, learn more at https://gittensor.io/