[fix](cloud) add fe to ms rpc metrics#60574
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive metrics tracking for FE (Frontend) to MS (Meta Service) RPC calls in cloud mode. The changes introduce per-method and aggregate RPC metrics including total calls, failures, retries, latency, and requests-per-second calculations.
Changes:
- Added new metrics infrastructure in CloudMetrics for tracking meta-service RPC calls (per-method and aggregate)
- Instrumented all RPC methods in MetaServiceProxy with metrics tracking using a new executeWithMetrics helper
- Added metric calculation logic in MetricCalculator for computing RPS rates from counter deltas
- Updated test signatures to match the new executeRequest method signature that includes methodName
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/metric/CloudMetrics.java | Defines per-method and aggregate meta-service RPC metrics (total, failed, retry, latency, RPS) |
| fe/fe-core/src/main/java/org/apache/doris/metric/MetricCalculator.java | Adds initialization and interval-based calculation logic for meta-service RPC rate metrics |
| fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java | Adds updateMetaServiceRpcPerSecond helper method for updating per-method RPS gauges |
| fe/fe-core/src/main/java/org/apache/doris/cloud/rpc/MetaServiceProxy.java | Instruments all RPC methods with metrics tracking, adds executeWithMetrics helper, updates executeRequest signature |
| fe/fe-core/src/test/java/org/apache/doris/cloud/rpc/MetaServiceProxyTest.java | Updates test calls to match new executeRequest signature with methodName parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public Cloud.GetInstanceResponse getInstance(Cloud.GetInstanceRequest request) | ||
| throws RpcException { | ||
| long startTime = System.currentTimeMillis(); | ||
| String methodName = "getInstance"; | ||
| if (MetricRepo.isInit && Config.isCloudMode()) { | ||
| CloudMetrics.META_SERVICE_RPC_ALL_TOTAL.increase(1L); | ||
| CloudMetrics.META_SERVICE_RPC_TOTAL.getOrAdd(methodName).increase(1L); | ||
| } | ||
|
|
||
| try { | ||
| final MetaServiceClient client = getProxy(); | ||
| return client.getInstance(request); | ||
| Cloud.GetInstanceResponse response = client.getInstance(request); | ||
| if (MetricRepo.isInit && Config.isCloudMode()) { | ||
| CloudMetrics.META_SERVICE_RPC_LATENCY.getOrAdd(methodName) | ||
| .update(System.currentTimeMillis() - startTime); | ||
| } | ||
| return response; | ||
| } catch (Exception e) { | ||
| if (MetricRepo.isInit && Config.isCloudMode()) { | ||
| CloudMetrics.META_SERVICE_RPC_ALL_FAILED.increase(1L); | ||
| CloudMetrics.META_SERVICE_RPC_FAILED.getOrAdd(methodName).increase(1L); | ||
| CloudMetrics.META_SERVICE_RPC_LATENCY.getOrAdd(methodName) | ||
| .update(System.currentTimeMillis() - startTime); | ||
| } | ||
| throw new RpcException("", e.getMessage(), e); | ||
| } | ||
| } |
There was a problem hiding this comment.
The getInstance method duplicates the metrics tracking logic that exists in executeWithMetrics. This method should use executeWithMetrics like other RPC methods to avoid code duplication and ensure consistent metrics tracking.
Additionally, getInstance bypasses the retry logic in MetaServiceClientWrapper.executeRequest by directly calling getProxy() and client.getInstance(request), while all other methods benefit from the retry mechanism. This inconsistency could lead to different reliability characteristics for this method compared to others.
| @@ -242,10 +305,20 @@ public Future<Cloud.GetVersionResponse> getVisibleVersionAsync(Cloud.GetVersionR | |||
| new com.google.common.util.concurrent.FutureCallback<Cloud.GetVersionResponse>() { | |||
| @Override | |||
| public void onSuccess(Cloud.GetVersionResponse result) { | |||
| if (MetricRepo.isInit && Config.isCloudMode()) { | |||
| CloudMetrics.META_SERVICE_RPC_LATENCY.getOrAdd(methodName) | |||
| .update(System.currentTimeMillis() - startTime); | |||
| } | |||
| } | |||
|
|
|||
| @Override | |||
| public void onFailure(Throwable t) { | |||
| if (MetricRepo.isInit && Config.isCloudMode()) { | |||
| CloudMetrics.META_SERVICE_RPC_ALL_FAILED.increase(1L); | |||
| CloudMetrics.META_SERVICE_RPC_FAILED.getOrAdd(methodName).increase(1L); | |||
| CloudMetrics.META_SERVICE_RPC_LATENCY.getOrAdd(methodName) | |||
| .update(System.currentTimeMillis() - startTime); | |||
| } | |||
| if (finalClient != null) { | |||
| finalClient.shutdown(true); | |||
| } | |||
| @@ -254,6 +327,12 @@ public void onFailure(Throwable t) { | |||
| } | |||
| return future; | |||
| } catch (Exception e) { | |||
| if (MetricRepo.isInit && Config.isCloudMode()) { | |||
| CloudMetrics.META_SERVICE_RPC_ALL_FAILED.increase(1L); | |||
| CloudMetrics.META_SERVICE_RPC_FAILED.getOrAdd(methodName).increase(1L); | |||
| CloudMetrics.META_SERVICE_RPC_LATENCY.getOrAdd(methodName) | |||
| .update(System.currentTimeMillis() - startTime); | |||
| } | |||
| if (client != null) { | |||
| client.shutdown(true); | |||
| } | |||
| @@ -262,162 +341,164 @@ public void onFailure(Throwable t) { | |||
| } | |||
There was a problem hiding this comment.
The getVisibleVersionAsync method also duplicates the metrics tracking logic instead of using a consistent pattern. Unlike the other synchronous RPC methods that now use executeWithMetrics, this async method manually tracks metrics in multiple places (before the call, in onSuccess, in onFailure, and in the catch block).
Consider creating an async-specific metrics wrapper method similar to executeWithMetrics to reduce code duplication and ensure consistent metrics tracking across both synchronous and asynchronous RPC calls.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)