Feat/centralized retry #3227

Dairus01 · 2025-12-20T07:23:53Z

This PR introduces a centralized, opt-in retry and backoff
mechanism for network operations in the Bittensor SDK.

Key points:

Adds a shared retry utility with sync and async support
Integrates retries at outbound network boundaries only
(Dendrite and Subtensor)
Retries are disabled by default to preserve existing behavior
Axon logic is intentionally untouched
Async streaming calls are excluded by design
Includes unit tests covering retry success, exhaustion,
disabled behavior, and default exception safety

This change improves SDK reliability under transient network
failures without affecting protocol logic or consensus behavior.

Contribution by Gittensor, learn more at https://gittensor.io/

Release/v10.0.0

Dairus01 · 2025-12-20T07:25:03Z

@basfroman, please, I would love to get your review on this

Dairus01 · 2025-12-20T08:14:12Z

All CI fixes have been applied:

Ruff formatting corrected-
Subtensor unit test failure fixed via defensive handling of substrate.query-
E2E Subtensor workflow fixed to avoid unauthenticated gh usage-
Please let me know if you’d prefer these commits squashed.

@basfroman, please can you re-review

thewhaleking · 2025-12-21T18:57:51Z

The code seems decent, but I'm not sure this is a good thing.

Generally speaking, an SDK should not be opinionated. One of the big things we got away from in SDKv10 was error handling within the SDK. We decided it's best to leave it up to users on how to handle exceptions/errors/retrying. This very much seems like a step back into that opinionated versions of old.

Dairus01 · 2025-12-21T21:43:24Z

The code seems decent, but I'm not sure this is a good thing.

Generally speaking, an SDK should not be opinionated. One of the big things we got away from in SDKv10 was error handling within the SDK. We decided it's best to leave it up to users on how to handle exceptions/errors/retrying. This very much seems like a step back into that opinionated versions of old.

This PR has been updated to remove all default retry behavior from core SDK components, while **preserving the retry utility as an optional helper.

What Changed

1. Retry Utility Remains Available

bittensor/utils/retry.py is kept intact
Provides:
- retry_call (sync)
- retry_async (async)
Configuration remains explicit and opt-in
This utility is not used internally by the SDK

The retry module is now a user-level helper, intended for developers who want a shared, consistent retry mechanism without re-implementing it themselves.

2. Retry Removed from Core SDK Paths

The SDK itself no longer applies retries anywhere:

Dendrite
- All retry wrapping has been removed
- Network calls behave exactly as before
Subtensor
- All retry wrapping has been removed
- Direct calls to substrate.query are restored

This ensures:

No automatic retries
No implicit error handling
Exceptions always propagate exactly as they did prior to this PR

3. Defensive Bug Fix Retained

The fix to Subtensor.get_hyperparameter remains:

Guards against substrate.query returning None
Guards against results without a .value attribute
This was required to satisfy existing unit tests and prevents AttributeError

This change is a bug fix, not retry logic, and is independent of the retry discussion.

Resulting Behavior

SDK behavior is fully non-opinionated
No retry, backoff, or error policy is enforced
Users who want retries must explicitly opt in by using the helper
Existing applications observe no behavioral change

basfroman · 2025-12-21T23:35:00Z

Hi @Dairus01, I'll review this PR when I get the time. But what @thewhaleking said in this comment is 100% in line with our plan.

Dairus01 · 2025-12-21T23:56:02Z

Hi @Dairus01, I'll review this PR when I get the time. But what @thewhaleking said in this comment is 100% in line with our plan.

Thanks, I would be looking forward to your review because, after @thewhaleking comment, I now made the retry optional.

Instead of a centralised retry, it is now an optional retry that you can implement if you want to enable it. You can simply enable it, allowing for a maximum retry attempt, a base delay, a maximum base delay, and other features.

I would be happy to improve anything required.

basfroman · 2025-12-23T18:34:53Z

Hi @Dairus01, I'll review this PR when I get the time. But what @thewhaleking said in this comment is 100% in line with our plan.

Thanks, I would be looking forward to your review because, after @thewhaleking comment, I now made the retry optional.

Instead of a centralised retry, it is now an optional retry that you can implement if you want to enable it. You can simply enable it, allowing for a maximum retry attempt, a base delay, a maximum base delay, and other features.

I would be happy to improve anything required.

Sounds good for me. I'll back to this PR on the next Monday.

Dairus01 · 2025-12-23T21:05:59Z

Hi @Dairus01, I'll review this PR when I get the time. But what @thewhaleking said in this comment is 100% in line with our plan.

Thanks, I would be looking forward to your review because, after @thewhaleking comment, I now made the retry optional.
Instead of a centralised retry, it is now an optional retry that you can implement if you want to enable it. You can simply enable it, allowing for a maximum retry attempt, a base delay, a maximum base delay, and other features.
I would be happy to improve anything required.

Sounds good for me. I'll back to this PR on the next Monday.

I would be looking forward to it

basfroman

The idea is good as a standalone feature, but not as an integrated SDK solution.
But you need to fix what I mentioned to move forward.

basfroman · 2025-12-29T19:39:48Z

bittensor/core/subtensor.py

+        if result is None:
+            return None
+
+        if hasattr(result, "value"):
+            return result.value
+
+        return result


basfroman · 2025-12-29T19:41:32Z

bittensor/core/subtensor.py

+        return self.substrate.runtime_call(
+            api=runtime_api,
+            method=method,
+            params=params,
+            block_hash=block_hash,
+        ).value


good, but apply the same changes to async_subtensor.def query_runtime_api
Sync and Async subtensors have to be consistent.

basfroman · 2025-12-29T19:42:14Z

bittensor/utils/retry.py

+import logging
+from typing import Type, Tuple, Optional, Callable, Any, Union
+
+logger = logging.getLogger(__name__)


Suggested change

logger = logging.getLogger(__name__)

logger = logging.getLogger("bittensor.utils.retry")

basfroman · 2025-12-29T19:47:07Z

bittensor/core/dendrite.py

+            async def _make_stream_request():
+                async with (await self.session).post(
+                    url,
+                    headers=synapse.to_headers(),
+                    json=synapse.model_dump(),
+                    timeout=aiohttp.ClientTimeout(total=timeout),
+                ) as response:
+                    # Use synapse subclass' process_streaming_response method to yield the response chunks
+                    async for chunk in synapse.process_streaming_response(response):  # type: ignore
+                        yield chunk  # Yield each chunk as it's processed
+                    json_response = synapse.extract_response_json(response)
+
+                    # Process the server response
+                    self.process_server_response(response, json_response, synapse)
+


Looks like an unfinished refactoring. Remove this from the current PR. Create a separate PR for this logic. Cover it with tests that actually show that you benefit from reading large responses.

tests/unit_tests/utils/test_retry.py

basfroman · 2025-12-29T19:57:40Z

bittensor/utils/retry.py

+    return None  # Should not be reached
+
+
+async def retry_async(


what if func is not async one?

basfroman · 2025-12-29T19:58:15Z

bittensor/utils/retry.py

+    return delay * (0.5 + random.random())
+
+
+def retry_call(


what if func is not sync one?

basfroman · 2025-12-29T20:07:23Z

bittensor/utils/retry.py

+
+    if last_exception:
+        raise last_exception
+    return None  # Should not be reached


replace with assert False, "Unreachable code" or remove. the same below.

basfroman · 2025-12-29T20:08:58Z

bittensor/utils/retry.py

+    """Calculates backoff time with exponential backoff and jitter."""
+    delay = min(max_delay, base_delay * (_RETRY_BACKOFF_FACTOR**attempt))
+    # Add jitter: random value between 0 and delay
+    return delay * (0.5 + random.random())


this is the game, but not a stable solution

basfroman · 2025-12-29T20:10:29Z

bittensor/utils/retry.py

+    """
+    Synchronous retry wrapper.
+
+    If BT_RETRY_ENABLED is False, executes the function exactly once.


Describe doctoring properly. The same for async version.

Copilot

Pull request overview

This PR introduces a centralized retry utility with synchronous and asynchronous support for handling transient network failures. The retry mechanism is opt-in (disabled by default via BT_RETRY_ENABLED environment variable) and provides configurable exponential backoff with jitter. The PR also includes refactoring improvements to subtensor.py for more defensive null checking and named parameter usage, and updates the GitHub workflow to simplify label reading.

Key changes:

Adds a new retry utility module (bittensor/utils/retry.py) with retry_call and retry_async functions
Includes comprehensive unit tests covering retry success, exhaustion, and disabled behavior
Refactors get_hyperparameter and query_runtime_api in subtensor.py for better null safety and code clarity

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
bittensor/utils/retry.py	New retry utility with sync/async support, exponential backoff with jitter, and environment-based configuration
tests/unit_tests/utils/test_retry.py	Comprehensive unit tests for retry functionality covering success, failure, and disabled scenarios
bittensor/core/subtensor.py	Refactored for defensive null checking and explicit attribute access with named parameters
bittensor/core/dendrite.py	Added (but unused) stream request function - appears to be incomplete integration
.github/workflows/e2e-subtensor-tests.yaml	Simplified label reading by using jq instead of GitHub CLI

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bittensor/utils/retry.py

Copilot · 2025-12-30T09:04:39Z

bittensor/core/dendrite.py

+            async def _make_stream_request():
+                async with (await self.session).post(
+                    url,
+                    headers=synapse.to_headers(),
+                    json=synapse.model_dump(),
+                    timeout=aiohttp.ClientTimeout(total=timeout),
+                ) as response:
+                    # Use synapse subclass' process_streaming_response method to yield the response chunks
+                    async for chunk in synapse.process_streaming_response(response):  # type: ignore
+                        yield chunk  # Yield each chunk as it's processed
+                    json_response = synapse.extract_response_json(response)
+
+                    # Process the server response
+                    self.process_server_response(response, json_response, synapse)
+


This internal function _make_stream_request is defined but never called, making it unreachable dead code. The original streaming request logic (lines 669-681) remains active and is duplicated here. Either this function should be called to replace the existing code block, or it should be removed if it was added accidentally.

Suggested change

async def _make_stream_request():

async with (await self.session).post(

url,

headers=synapse.to_headers(),

json=synapse.model_dump(),

timeout=aiohttp.ClientTimeout(total=timeout),

) as response:

# Use synapse subclass' process_streaming_response method to yield the response chunks

async for chunk in synapse.process_streaming_response(response): # type: ignore

yield chunk # Yield each chunk as it's processed

json_response = synapse.extract_response_json(response)

# Process the server response

self.process_server_response(response, json_response, synapse)

Copilot · 2025-12-30T09:04:40Z

bittensor/utils/retry.py

+    if last_exception:
+        raise last_exception
+    return None  # Should not be reached


The lines 91-93 create unreachable code. After the loop completes naturally (without breaking or returning), line 91 checks if last_exception: and raises it, but this can only happen when _max_attempts is 0 or less. In all normal cases, the function will either return successfully within the loop (line 76) or raise an exception on the last attempt (line 83). Consider removing these unreachable lines or documenting why they exist.

Copilot · 2025-12-30T09:04:40Z

bittensor/utils/retry.py

+    if last_exception:
+        raise last_exception
+    return None  # Should not be reached


The lines 140-142 create unreachable code. After the loop completes naturally (without breaking or returning), line 140 checks if last_exception: and raises it, but this can only happen when _max_attempts is 0 or less. In all normal cases, the function will either return successfully within the loop (line 125) or raise an exception on the last attempt (line 132). Consider removing these unreachable lines or documenting why they exist.

bittensor/utils/retry.py

Copilot · 2025-12-30T09:04:40Z

tests/unit_tests/utils/test_retry.py

@@ -0,0 +1,112 @@
+import pytest
+import time


Import of 'time' is not used.

Suggested change

import time

Copilot · 2025-12-30T09:04:41Z

tests/unit_tests/utils/test_retry.py

@@ -0,0 +1,112 @@
+import pytest
+import time
+import asyncio


Import of 'asyncio' is not used.

Suggested change

import asyncio

Removed the internal function for making HTTP POST requests and processing responses.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

basfroman and others added 2 commits December 9, 2025 16:40

Merge pull request opentensor#3206 from opentensor/release/v10.0.0

34fa24a

Release/v10.0.0

feat: centralized retry and backoff for SDK network operations

510f728

Dairus01 added 3 commits December 20, 2025 08:50

chore: format retry utility to satisfy ruff

3575822

fix(subtensor): guard against None query results in get_hyperparameter

18b2576

ci: fix e2e subtensor label detection without gh auth

94d5894

chore: remove default retry usage and keep retry utility optional

5e3c65f

Dairus01 force-pushed the feat/centralized-retry branch from 1adebf1 to 5e3c65f Compare December 21, 2025 21:13

basfroman requested changes Dec 29, 2025

View reviewed changes

Move test_retry.py to utils directory

ef12886

Copilot AI review requested due to automatic review settings December 30, 2025 09:00

Copilot started reviewing on behalf of Dairus01 December 30, 2025 09:01 View session

Copilot AI reviewed Dec 30, 2025

View reviewed changes

Dairus01 and others added 9 commits December 30, 2025 01:05

Change logger name to 'bittensor.utils.retry'

2085e2d

Refactor HTTP request handling in dendrite.py

5b7ae44

Removed the internal function for making HTTP POST requests and processing responses.

Update bittensor/utils/retry.py

c1a64b3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update bittensor/utils/retry.py

4661554

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update bittensor/utils/retry.py

3ff54cb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update bittensor/utils/retry.py

cb81b59

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update bittensor/utils/retry.py

8ad8db0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Revert get_hyperparameter to use getattr

8c7ab7a

Revert get_hyperparameter to use getattr as requested in review.

612c12f

	logger = logging.getLogger(__name__)
	logger = logging.getLogger("bittensor.utils.retry")

Feat/centralized retry #3227

Are you sure you want to change the base?

Feat/centralized retry #3227

Conversation

Dairus01 commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dairus01 commented Dec 20, 2025

Uh oh!

Dairus01 commented Dec 20, 2025

Uh oh!

thewhaleking commented Dec 21, 2025

Uh oh!

Dairus01 commented Dec 21, 2025

What Changed

1. Retry Utility Remains Available

2. Retry Removed from Core SDK Paths

3. Defensive Bug Fix Retained

Resulting Behavior

Uh oh!

basfroman commented Dec 21, 2025

Uh oh!

Dairus01 commented Dec 21, 2025

Uh oh!

basfroman commented Dec 23, 2025

Uh oh!

Dairus01 commented Dec 23, 2025

Uh oh!

basfroman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Dairus01 commented Dec 20, 2025 •

edited

Loading