<fix>[zbs]: reduce mds connect timeout and enable tryNext for volume clients#3331
<fix>[zbs]: reduce mds connect timeout and enable tryNext for volume clients#3331zstack-robot-2 wants to merge 1 commit into5.5.6from
Conversation
Walkthrough在 ZbsStorageController 中为 HttpCaller 内部类添加了 Changes
Sequence Diagram(s)(此变更为单文件内的调用策略调整,不生成序列图) Estimated code review effort🎯 2 (简单) | ⏱️ ~10 分钟 诗
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java (1)
1457-1510:⚠️ Potential issue | 🟠 Major
tryNext目前无法覆盖"syncCall 直接抛异常"的场景,导致快速 failover 目标无法达成
doSyncCall()只在base.syncCall(...)返回ret.isSuccess()==false时进入tryNext分支,但restf.syncJsonPost()在网络超时、连接错误等异常情况下会直接抛出异常(如OperationFailureException、ResourceAccessException等),而非返回失败响应。这导致:
- 当选中不可达 MDS 时,异常直接传播,后续 MDS 无法被尝试
setTryNext(true)形同虚设- PR 目标"快速 failover"无法达成
需在
doSyncCall()中添加异常捕获,将异常情况也纳入tryNext的重试逻辑。建议将递归改写为while循环以提升效率。建议实现方向
private T doSyncCall() { - if (!it.hasNext()) { - throw new OperationFailureException(operr(ORG_ZSTACK_STORAGE_ZBS_10029, errorCodes, "all MDS cannot execute http call[%s]", path)); - } - - ZbsPrimaryStorageMdsBase base = it.next(); - cmd.setAddr(base.getSelf().getAddr()); - - T ret = base.syncCall(path, cmd, retClass, unit, timeout); - if (!ret.isSuccess()) { - logger.warn(String.format("failed to execute http call[%s] on MDS[%s], error is: %s", - path, base.getSelf().getAddr(), JSONObjectUtil.toJsonString(ret.getError()))); - errorCodes.getCauses().add(operr(ORG_ZSTACK_STORAGE_ZBS_10030, ret.getError())); - if (tryNext) { - return doSyncCall(); - } else { - throw new OperationFailureException(operr(ORG_ZSTACK_STORAGE_ZBS_10031, errorCodes, "all MDS cannot execute http call[%s]", path)); - } - } - - return ret; + while (it.hasNext()) { + ZbsPrimaryStorageMdsBase base = it.next(); + cmd.setAddr(base.getSelf().getAddr()); + try { + T ret = base.syncCall(path, cmd, retClass, unit, timeout); + if (ret != null && ret.isSuccess()) { + return ret; + } + + logger.warn(String.format("failed to execute http call[%s] on MDS[%s], error is: %s", + path, base.getSelf().getAddr(), ret == null ? "null response" : JSONObjectUtil.toJsonString(ret.getError()))); + if (ret != null) { + errorCodes.getCauses().add(operr(ORG_ZSTACK_STORAGE_ZBS_10030, ret.getError())); + } + } catch (Exception e) { + logger.warn(String.format("exception on http call[%s] on MDS[%s]: %s", + path, base.getSelf().getAddr(), e.getMessage()), e); + } + + if (!tryNext) { + break; + } + } + + throw new OperationFailureException(operr(ORG_ZSTACK_STORAGE_ZBS_10031, errorCodes, "all MDS cannot execute http call[%s]", path)); }
🧹 Nitpick comments (1)
plugin/zbs/src/main/java/org/zstack/storage/zbs/ZbsStorageController.java (1)
178-203:GET_VOLUME_CLIENTS的 30s 超时建议抽常量(避免魔法值),并确认“30s”不会叠加底层重试导致仍然很慢当前 Line 183 直接写死
30,后续如果别处也要调整/对齐会比较难追踪;另外HttpCaller/ZbsPrimaryStorageMdsBase内部如果还有重试,实际最坏耗时可能仍然超预期(例如 30s × N 次重试 × MDS 数)。建议的最小改动(抽常量)
public class ZbsStorageController implements PrimaryStorageControllerSvc, PrimaryStorageNodeSvc { + private static final long GET_VOLUME_CLIENTS_TIMEOUT_SECONDS = 30; ... public List<ActiveVolumeClient> getActiveClients(String installPath, String protocol) { if (VolumeProtocol.CBD.toString().equals(protocol)) { GetVolumeClientsCmd cmd = new GetVolumeClientsCmd(); cmd.setPath(installPath); // Optimize anti-split-brain check: 30s timeout + tryNext for faster mds failover - GetVolumeClientsRsp rsp = new HttpCaller<>(GET_VOLUME_CLIENTS_PATH, cmd, GetVolumeClientsRsp.class, null, TimeUnit.SECONDS, 30, true) + GetVolumeClientsRsp rsp = new HttpCaller<>(GET_VOLUME_CLIENTS_PATH, cmd, GetVolumeClientsRsp.class, null, TimeUnit.SECONDS, GET_VOLUME_CLIENTS_TIMEOUT_SECONDS, true) .setTryNext(true) .syncCall();As per coding guidelines “避免使用魔法值(Magic Value)”。
When anti-split-brain check selects a disconnected MDS node, the HTTP call now times out after 30s instead of 5+ minutes, and automatically retries the next available MDS via tryNext mechanism. Resolves: ZSTAC-80595 Change-Id: I1be80f1b70cad1606eb38d1f0078c8f2781e6941
45353bf to
3b5bda3
Compare
Summary
Files Changed
ZbsStorageController.java— HttpCaller with 30s timeout + setTryNext(true)Resolves: ZSTAC-80595
sync from gitlab !9153