Skip to content

<fix>[zbs]: reduce mds connect timeout and enable tryNext for volume clients#3331

Open
zstack-robot-2 wants to merge 1 commit into5.5.6from
sync/ye.zou/fix/ZSTAC-80595
Open

<fix>[zbs]: reduce mds connect timeout and enable tryNext for volume clients#3331
zstack-robot-2 wants to merge 1 commit into5.5.6from
sync/ye.zou/fix/ZSTAC-80595

Conversation

@zstack-robot-2
Copy link
Collaborator

Summary

  • ZSTAC-80595: CBD 防脑裂 check 选到失联 mds,connect timeout 5次×1分钟
  • 将 syncHttpCall 改为 HttpCaller,超时从默认降为 30s,启用 tryNext 失败后尝试下一个 mds

Files Changed

  • ZbsStorageController.java — HttpCaller with 30s timeout + setTryNext(true)

Resolves: ZSTAC-80595

sync from gitlab !9153

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

Walkthrough

在 ZbsStorageController 中为 HttpCaller 内部类添加了 setTryNext(boolean) 方法;getActiveClients 在使用 CBD 协议时通过新的 HttpCaller 路径调用 GetVolumeClients,启用 30 秒超时并设置 tryNext=true 以允许 MDS 快速故障转移。

Changes

Cohort / File(s) Summary
HttpCaller 故障转移优化
plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java
新增 HttpCaller<T>.setTryNext(boolean) 方法;在 getActiveClients 使用 CBD 协议时改用带 30s 超时的 HttpCaller 调用 GetVolumeClients,并在调用前设置 tryNext=true,使 HTTP 失败时可继续尝试下一个 MDS。

Sequence Diagram(s)

(此变更为单文件内的调用策略调整,不生成序列图)

Estimated code review effort

🎯 2 (简单) | ⏱️ ~10 分钟

🐰 轻跳代码林间行,
新增开关护故障,
三十秒候风再试,
MDS 路径转不停,
兔影微笑又一层。


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)
Check name Status Explanation Resolution
Title check ❌ Error PR title exceeds the 72-character limit at 76 characters, violating the specified format requirement. Reduce title length to 72 characters or less while maintaining the [scope]: format, e.g., '[zbs]: reduce mds timeout and enable tryNext for clients'.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed PR description is related to the changeset, detailing the issue (ZSTAC-80595), the solution (HttpCaller with 30s timeout and tryNext), and affected files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/ye.zou/fix/ZSTAC-80595

No actionable comments were generated in the recent review. 🎉


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java (1)

1457-1510: ⚠️ Potential issue | 🟠 Major

tryNext 目前无法覆盖"syncCall 直接抛异常"的场景,导致快速 failover 目标无法达成

doSyncCall() 只在 base.syncCall(...) 返回 ret.isSuccess()==false 时进入 tryNext 分支,但 restf.syncJsonPost() 在网络超时、连接错误等异常情况下会直接抛出异常(如 OperationFailureExceptionResourceAccessException 等),而非返回失败响应。这导致:

  • 当选中不可达 MDS 时,异常直接传播,后续 MDS 无法被尝试
  • setTryNext(true) 形同虚设
  • PR 目标"快速 failover"无法达成

需在 doSyncCall() 中添加异常捕获,将异常情况也纳入 tryNext 的重试逻辑。建议将递归改写为 while 循环以提升效率。

建议实现方向
 private T doSyncCall() {
-    if (!it.hasNext()) {
-        throw new OperationFailureException(operr(ORG_ZSTACK_STORAGE_ZBS_10029, errorCodes, "all MDS cannot execute http call[%s]", path));
-    }
-
-    ZbsPrimaryStorageMdsBase base = it.next();
-    cmd.setAddr(base.getSelf().getAddr());
-
-    T ret = base.syncCall(path, cmd, retClass, unit, timeout);
-    if (!ret.isSuccess()) {
-        logger.warn(String.format("failed to execute http call[%s] on MDS[%s], error is: %s",
-                path, base.getSelf().getAddr(), JSONObjectUtil.toJsonString(ret.getError())));
-        errorCodes.getCauses().add(operr(ORG_ZSTACK_STORAGE_ZBS_10030, ret.getError()));
-        if (tryNext) {
-            return doSyncCall();
-        } else {
-            throw new OperationFailureException(operr(ORG_ZSTACK_STORAGE_ZBS_10031, errorCodes, "all MDS cannot execute http call[%s]", path));
-        }
-    }
-
-    return ret;
+    while (it.hasNext()) {
+        ZbsPrimaryStorageMdsBase base = it.next();
+        cmd.setAddr(base.getSelf().getAddr());
+        try {
+            T ret = base.syncCall(path, cmd, retClass, unit, timeout);
+            if (ret != null && ret.isSuccess()) {
+                return ret;
+            }
+
+            logger.warn(String.format("failed to execute http call[%s] on MDS[%s], error is: %s",
+                    path, base.getSelf().getAddr(), ret == null ? "null response" : JSONObjectUtil.toJsonString(ret.getError())));
+            if (ret != null) {
+                errorCodes.getCauses().add(operr(ORG_ZSTACK_STORAGE_ZBS_10030, ret.getError()));
+            }
+        } catch (Exception e) {
+            logger.warn(String.format("exception on http call[%s] on MDS[%s]: %s",
+                    path, base.getSelf().getAddr(), e.getMessage()), e);
+        }
+
+        if (!tryNext) {
+            break;
+        }
+    }
+
+    throw new OperationFailureException(operr(ORG_ZSTACK_STORAGE_ZBS_10031, errorCodes, "all MDS cannot execute http call[%s]", path));
 }
🧹 Nitpick comments (1)
plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java (1)

178-203: GET_VOLUME_CLIENTS 的 30s 超时建议抽常量(避免魔法值),并确认“30s”不会叠加底层重试导致仍然很慢

当前 Line 183 直接写死 30,后续如果别处也要调整/对齐会比较难追踪;另外 HttpCaller/ZbsPrimaryStorageMdsBase 内部如果还有重试,实际最坏耗时可能仍然超预期(例如 30s × N 次重试 × MDS 数)。

建议的最小改动(抽常量)
 public class ZbsStorageController implements PrimaryStorageControllerSvc, PrimaryStorageNodeSvc {
+    private static final long GET_VOLUME_CLIENTS_TIMEOUT_SECONDS = 30;
 ...
     public List<ActiveVolumeClient> getActiveClients(String installPath, String protocol) {
         if (VolumeProtocol.CBD.toString().equals(protocol)) {
             GetVolumeClientsCmd cmd = new GetVolumeClientsCmd();
             cmd.setPath(installPath);
             // Optimize anti-split-brain check: 30s timeout + tryNext for faster mds failover
-            GetVolumeClientsRsp rsp = new HttpCaller<>(GET_VOLUME_CLIENTS_PATH, cmd, GetVolumeClientsRsp.class, null, TimeUnit.SECONDS, 30, true)
+            GetVolumeClientsRsp rsp = new HttpCaller<>(GET_VOLUME_CLIENTS_PATH, cmd, GetVolumeClientsRsp.class, null, TimeUnit.SECONDS, GET_VOLUME_CLIENTS_TIMEOUT_SECONDS, true)
                     .setTryNext(true)
                     .syncCall();

As per coding guidelines “避免使用魔法值(Magic Value)”。

When anti-split-brain check selects a disconnected MDS node, the HTTP
call now times out after 30s instead of 5+ minutes, and automatically
retries the next available MDS via tryNext mechanism.

Resolves: ZSTAC-80595

Change-Id: I1be80f1b70cad1606eb38d1f0078c8f2781e6941
@MatheMatrix MatheMatrix force-pushed the sync/ye.zou/fix/ZSTAC-80595 branch from 45353bf to 3b5bda3 Compare February 12, 2026 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants