SonicRetry combined PR[CMSSW_15_1_0_pre6]#24
SonicRetry combined PR[CMSSW_15_1_0_pre6]#24kakwok wants to merge 23 commits intofastmachinelearning:masterfrom
Conversation
…r method in TritonClient. Update BuildFile.xml and fix formatting in header files.
…tructor for TritonClient, and update BuildFile.xml to include Catch2 for testing.
…tests; remove old cfg
…lection; remove unused parameters and improve documentation.
| if (client_) { | ||
| client_->evaluate(); | ||
| } else { | ||
| edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate."; |
There was a problem hiding this comment.
This should be an exception rather than a LogError. (It may actually need to be a return false or similar, because the call chain is client->finish() -> action->retry() -> action->eval(), and only finish() should actually emit an exception.)
There was a problem hiding this comment.
This comment has not been addressed yet
There was a problem hiding this comment.
It should be just return, similar to what client->evaluate does, if we want only finish() to throw exception
There was a problem hiding this comment.
But we need to let finish() know that the eval() failed somehow... I guess it could directly call finish(false) as is done in evaluate()/handle_exception(). we may want to rethink this pattern in general, though, as I am becoming concerned that it will lead to excessive stack depth.
HeterogeneousCore/SonicTriton/test/retry_action_diff_log_test.sh
Outdated
Show resolved
Hide resolved
| if (client_) { | ||
| client_->evaluate(); | ||
| } else { | ||
| edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate."; |
There was a problem hiding this comment.
This comment has not been addressed yet
| //todo: use some algorithm to select server rather than just picking arbitrarily | ||
| const auto& server(servers_.find(serverName)->second); | ||
| return server; | ||
| const auto serverPair = servers_.find(serverName); |
There was a problem hiding this comment.
for clarity, this should be serverPairIter or similar (to indicate this is an iterator pointing to the pair<name,server> object, not the object itself)
| std::unique_ptr<tc::InferenceServerGrpcClient> client; | ||
| TRITON_THROW_IF_ERROR( | ||
| tc::InferenceServerGrpcClient::Create(&client, server.url, false, server.useSsl, server.sslOptions), | ||
| "TritonService(): unable to create inference context for " + serverName + " (" + server.url + ")", | ||
| false); |
There was a problem hiding this comment.
since these exact lines now occur twice in this file, there should probably be a dedicated helper function.
| client->IsServerLive(&live); | ||
| client->IsServerReady(&ready); |
There was a problem hiding this comment.
if these remain false, will the subsequent server calls below fail?
| ServerHealth& health = acc->second; | ||
| health.live = live; | ||
| health.ready = ready; | ||
| health.failureCount = failures; | ||
| health.avgQueueTimeMs = avgQueueTimeMs / queue_count; | ||
| health.avgInferTimeMs = avgInferTimeMs / infer_count; |
There was a problem hiding this comment.
there may need to be additional logic behind these updates. right now, everything is just overwritten with the latest status of the server. but we may actually want to track some of these from the client's perspective: e.g. how many times in the last [time interval] did this server return live = false? how many times did calls fail from within this job? etc. (this may imply more quantities to track)
|
|
||
| } catch (const TritonException& e) { | ||
| // mark existing entry unhealthy if present | ||
| tbb::concurrent_hash_map<std::string, ServerHealth>::accessor acc; |
There was a problem hiding this comment.
typedef: using ServerHealthIter = tbb::concurrent_hash_map<std::string, ServerHealth>::accessor or similar
| health.ready = false; | ||
| } | ||
| } catch (const std::exception& e) { | ||
| // fallback for other exceptions |
There was a problem hiding this comment.
if this does the same thing as the previous catch, they could be combined in one block (unless we eventually expect them to diverge)
| if (!health.live || !health.ready) | ||
| continue; // skip unhealthy | ||
|
|
||
| // Select server according to rules: | ||
| // 1) lowest failureCount | ||
| // 2) tie-breaker: lowest avgQueueTimeMs | ||
| if (!bestServerName || health.failureCount < bestHealth.failureCount || | ||
| (health.failureCount == bestHealth.failureCount && health.avgQueueTimeMs < bestHealth.avgQueueTimeMs)) { | ||
| bestServerName = serverName; | ||
| bestHealth = health; | ||
| } |
There was a problem hiding this comment.
this logic should be encapsulated into a function, or even a plugin, so that different schemes can be specified
| } | ||
| } | ||
| if (verbose_ && bestServerName) { | ||
| edm::LogInfo("Chosen server for model '" + modelName + "': " + *bestServerName + |
There was a problem hiding this comment.
should be:
edm::LogInfo("TritonService") << [text];| edm::LogWarning("RetryActionDiffServer") | ||
| << "No alternative server found for model " << tritonClient->modelName(); |
There was a problem hiding this comment.
this should be more than a warning; execution should stop in this case.
|
|
||
| std::optional<std::string> TritonService::getBestServer(const std::string& modelName, | ||
| const std::string& IgnoreServer) { | ||
| std::optional<std::string> bestServerName; |
There was a problem hiding this comment.
I'm not sure about the use of optional here. If no remote server is found, the fallback should be used. If for some reason the fallback cannot be found or used, then execution should stop here.
…in the review (fastmachinelearning#24) Address review comments + fixes for TDR regionizer (not used in default CMSSW configuration)
| edm::LogInfo("SonicClientBase") << "SonicCallFailed: call failed, no retry actions available after " | ||
| << totalTries_ << " tries."; | ||
| edm::Exception ex(edm::errors::ExternalFailure); | ||
| ex << "SonicCallFailed: call failed, no retry actions available after " << totalTries_ << " tries."; |
There was a problem hiding this comment.
This is repetitive. better:
#include <sstream>
std::stringstream failmsg << "SonicCallFailed: call failed, no retry actions available after " << totalTries_ << " tries.";
edm::LogInfo("SonicClientBase") << failmsg.str();
edm::Exception ex(edm::errors::ExternalFailure);
ex << failmsg.str();
Rebased #23 to CMSSW_15_1_0_pre6.
PR description:
RetryActionDiffServerfor SonicTriton usingTritonService’s server registry; remove per-action alternate server parameters.TritonClient::updateServer(TritonService::Server::fallbackName)to switch servers on retry, per review guidance.HeterogeneousCore/SonicTriton/test/test_RetryActionDiffServer.cc(arms → updateServer(fallback) → no-op on second retry → exception path is caught).HeterogeneousCore/SonicTriton/test/tritonTest_cfg.pywith--retryAction {same,diff}and a verbose confirmation line.TestHeterogeneousCoreSonicTritonRetryActionDiff_Logto assert the selected retry policy.PR validation:
Built and ran unit/integration tests in
CMSSW_15_1_0_pre6area:scram b -j 8TODO/To verify:
scram b runtests TEST=HeterogeneousCore/SonicTriton(passes)Client.Retryis explicitly configured.