Docs: add disaster recovery guides for partial and complete failure by moizpgedge · Pull Request #268 · pgEdge/control-plane

moizpgedge · 2026-02-17T15:46:35Z

Summary

Adds disaster recovery documentation: an index plus two guides—Partial Failure Recovery (Quorum Intact) and Complete Failure Recovery (No Quorum)—so operators can recover the Control Plane when hosts or quorum are lost.

Changes

docs/disaster-recovery/index.md — New index that routes to the right guide and warns about backing up the Control Plane data volume.
docs/disaster-recovery/partial-recovery.md — Guide for when etcd and Swarm quorum are intact but one or more hosts are lost: force-remove host via API, optional DB spec updates, Swarm cleanup, verification, then restore (Path A: host still accessible; Path B: provision new host), rejoin cluster, and restore database capacity.
docs/disaster-recovery/full-recovery.md — Guide for when etcd and/or Swarm quorum is lost: Phase 1 (1A total quorum loss, 1B majority quorum loss, 1C etcd + Swarm loss) with snapshot restore, Swarm re-init, registry rebuild, and ForceNewCluster where needed; Phase 2 remove dead hosts and clean DBs; Phases 3–5 rejoin/provision hosts, join cluster, restore DB capacity, and final verification. Includes prerequisites, variable setup, and common issues/troubleshooting.

Testing

Documentation only; no code or test changes.
Manual review: follow steps in a test environment (e.g. Lima fixture) for both partial and full-recovery flows if desired.

Checklist

Tests added or updated (unit and/or e2e, as needed)
Documentation updated (if needed)
Issue is linked (branch name or URL in PR description)
Changelog entry added for user-facing behavior changes (changes/unreleased/Added-disaster-recovery-docs.yaml)
Breaking changes (if any) are clearly called out in the PR description

Notes for Reviewers

Branch: docs/PLAT-313-314-etcd-swarm-disaster-recovery — PLAT-313 and PLAT-314
Partial guide assumes API is up; full guide assumes API is down until Phase 1 is done. Cross-links between them (e.g. partial-recovery → full-recovery for quorum loss) are already in place.
Full-recovery Phase 1C (etcd + Swarm quorum loss) is the most involved: Swarm re-init, dead node removal, registry recreate, image build/push, and PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER=true; worth a careful read.

ChangeLog

Entry added in changes/unreleased/Added-disaster-recovery-docs.yaml for the disaster recovery documentation.

coderabbitai · 2026-02-17T15:46:52Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds comprehensive disaster recovery documentation for the Control Plane: an index plus three guides (partial recovery with quorum intact, complete/no-quorum recovery, and a combined disaster-recovery overview). Documents phased workflows, procedures for etcd and Docker Swarm recovery, prerequisites, verification, and troubleshooting. No code or public API changes.

Changes

Cohort / File(s)	Summary
Disaster Recovery Guides `docs/disaster-recovery/index.md`, `docs/disaster-recovery/partial-recovery.md`, `docs/disaster-recovery/full-recovery.md`, `docs/disaster-recovery/disaster-recovery.md`	Adds an index and three detailed recovery guides covering: partial quorum-intact recovery, complete no-quorum recovery (etcd snapshot restores, Swarm re-init, branching phases), and a comprehensive disaster-recovery overview. Includes phased workflows, prerequisites, CLI/API examples, verification steps, and troubleshooting.
Changelog Entry `changes/unreleased/Added-disaster-recovery-docs.yaml`	Adds a release-note YAML recording the addition of the disaster recovery documentation with timestamp and description.

Poem

🐇 I thumped my paw where clusters sigh,
Five careful phases, step by try,
Snapshots, Swarm, and hosts rejoined,
I hop the path the docs have coined,
Rebuilt the garden, safe and spry.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding disaster recovery guides for two failure scenarios (partial and complete), which aligns with the substantial documentation additions in the PR.
Description check	✅ Passed	The description covers all required template sections: Summary (one sentence), Changes (bulleted details), Testing, Checklist items (marked appropriately), and Notes for Reviewers with issue links, cross-links, and implementation notes. Documentation is complete.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch docs/PLAT-313-314-etcd-swarm-disaster-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

docs/disaster-recovery/full-recovery.md (2)
288-295: Add a language specifier to the fenced code block.

Same as in partial-recovery.md — the docker node ls example output block should use ```text or ```console.
Proposed fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 288 - 295, Update the
fenced code block showing the docker node ls example output to include a
language specifier (e.g., use ```text or ```console instead of plain ```) so
syntax-aware renderers treat it as plain text; locate the fenced block
containing the sample output for "docker node ls" (the block starting with the
ID/HOSTNAME/STATUS header) and replace the opening fence with ```text (or
```console) accordingly.
204-210: Phase 1B backup step only preserves etcd directory, unlike Phase 1A.

Phase 1A Step 1A.3 backs up etcd data, certificates, and generated.config.json. Phase 1B Step 1B.1 only backs up the etcd directory. Since the surviving host in 1B still has valid certificates and config, this is likely intentional — but a brief note explaining the difference would help operators avoid second-guessing during a high-stress recovery.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 204 - 210, Phase 1B
Step 1B.1 only backs up the etcd directory whereas Phase 1A Step 1A.3 also backs
up certificates and generated.config.json; add a brief explanatory note under
"Step 1B.1: Backup Existing etcd Data" clarifying that in Phase 1B the surviving
host is expected to already have valid certificates and configuration so only
the etcd data is preserved, and reference Step 1A.3 for the full backup behavior
to avoid operator confusion.
docs/disaster-recovery/partial-recovery.md (1)
134-140: Add a language specifier to the fenced code block.

The example docker node ls output block on Line 135 lacks a language identifier. Use ```text or ```console for plain output blocks to satisfy the MD040 lint rule and improve rendering consistency.
Proposed fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 4aoqjp3q8jcny4kec5nadcn6x *   lima-host-1   Ready     Active         Leader
 959g9937i62judknmr40kcw9r     lima-host-2   Ready     Active         Reachable
 l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/disaster-recovery/partial-recovery.md around lines 134 - 140, The
fenced code block that shows the docker node ls output (the block starting with
"ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS" in the partial-recovery
example) is missing a language specifier and triggers MD040; update that block
by adding a plain-output language tag such as text or console immediately
after the opening backticks so the snippet becomes a labelled code fence (e.g.,
change totext) to satisfy linting and rendering.
</details>

</blockquote></details>
<details>
<summary>mkdocs.yml (1)</summary><blockquote>

`67-69`: **Consider adding `disaster-recovery/index.md` as the section index page.**

Other nav sections (e.g., "Prerequisites", "Installing Control Plane") include their `index.md` as the first entry, making the section header clickable. Currently, "Disaster Recovery" is just a label and `index.md` isn't reachable from the nav, even though it exists and contains useful context (the backup warning).

<details>
<summary>Proposed fix</summary>

```diff
   - Disaster Recovery:
+      - disaster-recovery/index.md
       - Partial Failure Recovery (Quorum Intact): disaster-recovery/partial-recovery.md
       - Complete Failure Recovery (No Quorum): disaster-recovery/full-recovery.md
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@mkdocs.yml` around lines 67 - 69, The "Disaster Recovery" nav entry should
include its section index so the header becomes clickable; update the mkdocs.yml
nav under the "Disaster Recovery" section to insert "disaster-recovery/index.md"
as the first entry before "disaster-recovery/partial-recovery.md" and
"disaster-recovery/full-recovery.md" (so the section reads index.md,
partial-recovery.md, full-recovery.md).
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @docs/disaster-recovery/full-recovery.md:

Around line 385-397: The docs show a port mismatch: the registry is published
on port 5000 but the control plane service sets
PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST to 127.0.0.1:5001/pgedge; update the
environment variable in the control-plane service creation
(PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST in the
control-plane_${RECOVERY_HOST_ID} service) to 127.0.0.1:5000/pgedge to match the
registry, or explicitly document why 5001 is used (e.g., mirror/proxy) if that
is intentional.

In @docs/disaster-recovery/partial-recovery.md:

Around line 76-82: The doc currently shows monitoring the host-removal task
via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was
force-removed this endpoint may be unavailable; update the instructions to use a
cluster-level or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID>
or /v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after
deletion—replace the existing reference to
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative
endpoints.

Nitpick comments:
In @docs/disaster-recovery/full-recovery.md:

Around line 288-295: Update the fenced code block showing the docker node ls
example output to include a language specifier (e.g., use text or console
instead of plain ) so syntax-aware renderers treat it as plain text; locate the fenced block containing the sample output for "docker node ls" (the block starting with the ID/HOSTNAME/STATUS header) and replace the opening fence with text (or ```console) accordingly.

Around line 204-210: Phase 1B Step 1B.1 only backs up the etcd directory
whereas Phase 1A Step 1A.3 also backs up certificates and generated.config.json;
add a brief explanatory note under "Step 1B.1: Backup Existing etcd Data"
clarifying that in Phase 1B the surviving host is expected to already have valid
certificates and configuration so only the etcd data is preserved, and reference
Step 1A.3 for the full backup behavior to avoid operator confusion.

In @docs/disaster-recovery/partial-recovery.md:

Around line 134-140: The fenced code block that shows the docker node ls
output (the block starting with "ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS"
in the partial-recovery example) is missing a language specifier and triggers
MD040; update that block by adding a plain-output language tag such as text or console immediately after the opening backticks so the snippet becomes a
labelled code fence (e.g., change totext) to satisfy linting and
rendering.

In @mkdocs.yml:

Around line 67-69: The "Disaster Recovery" nav entry should include its
section index so the header becomes clickable; update the mkdocs.yml nav under
the "Disaster Recovery" section to insert "disaster-recovery/index.md" as the
first entry before "disaster-recovery/partial-recovery.md" and
"disaster-recovery/full-recovery.md" (so the section reads index.md,
partial-recovery.md, full-recovery.md).
</details>

docs/disaster-recovery/full-recovery.md

coderabbitai · 2026-02-17T15:50:41Z

docs/disaster-recovery/partial-recovery.md

+```sh
+# Monitor host removal task
+curl http://<HEALTHY_HOST>:3000/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>
+
+# Monitor database update task logs
+curl http://<HEALTHY_HOST>:3000/v1/databases/<DB>/tasks/<TASK_ID>/log
+```


⚠️ Potential issue | 🟡 Minor

Task monitoring endpoint references the removed host.

Line 78 monitors the host removal task via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but the host was just force-removed in Step 1.1. This endpoint may not be available after deletion. Consider using a cluster-level or recovery-host-scoped task endpoint instead, or clarify the expected behavior (e.g., whether the task endpoint remains accessible for a removed host).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/partial-recovery.md` around lines 76 - 82, The doc currently shows monitoring the host-removal task via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was force-removed this endpoint may be unavailable; update the instructions to use a cluster-level or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID> or /v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after deletion—replace the existing reference to /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative endpoints.

jason-lynch

This is getting closer! There are still a lot of repeated steps, and a few places where we're doing the same things in different ways. I think we can do more to deduplicate and simplify these documents. Could this just be one document?:

Disaster recovery
- Restoring Docker Swarm
  - Reinitializing the Swarm
  - Joining hosts to the new Swarm
  - Removing old Swarm nodes
- Restoring the Control Plane
  - Reinitializing the Control Plane Etcd cluster
  - Updating databases to remove old hosts
  - Removing old hosts
  - Re-adding hosts
  - Updating databases to re-add hosts

jason-lynch · 2026-02-17T16:57:20Z

mkdocs.yml

+  - Disaster Recovery:
+      - Partial Failure Recovery (Quorum Intact): disaster-recovery/partial-recovery.md
+      - Complete Failure Recovery (No Quorum): disaster-recovery/full-recovery.md


Could you please remove this for now? I think @mmols had suggested that we go through QA and documentation review before we publish this to the site.

jason-lynch · 2026-02-17T17:01:03Z

docs/disaster-recovery/full-recovery.md

+
+1. **[Total Quorum Loss](#phase-1a-total-quorum-loss)** — All server-mode hosts are offline (100% loss). Docker Swarm is still functional.
+2. **[Majority Quorum Loss](#phase-1b-majority-quorum-loss)** — More than 50% of server-mode hosts are offline, but at least one remains online. Docker Swarm is still functional.
+3. **[etcd and Docker Swarm Quorum Loss](#phase-1c-etcd-and-docker-swarm-quorum-loss)** — Both etcd and Docker Swarm have lost quorum (majority of hosts destroyed). Requires Swarm re-initialization, registry recreation, and image rebuild before etcd recovery.


"Registry recreation" and "image rebuild" are exclusive to our development environment, so we should not document them in our user-facing documentation.

jason-lynch · 2026-02-17T17:08:11Z

docs/disaster-recovery/full-recovery.md

+    You may see "cannot elect leader" errors when stopping services. These are expected. If Docker Swarm commands fail, stop containers directly:
+
+    ```bash
+    docker ps --filter label=com.docker.swarm.service.name=control-plane_<host-id> --format "{{.ID}}" | xargs docker stop


What are the conditions where it becomes necessary to do this?

jason-lynch · 2026-02-17T17:20:21Z

docs/disaster-recovery/full-recovery.md

+
+#### Step 1A.2: Restore Data Volume
+
+Restore the Control Plane data volume from your backup.


We can't write specific instructions for this step because it's very dependent on the user's environment. You can simulate it locally using cp or Lima's snapshot tool, but in a real environment it will look quite different. Here are some examples of different procedures that users might follow:

https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html

https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/container-storage-plugin/3-0/getting-started-with-vmware-vsphere-container-storage-plug-in-3-0/using-vsphere-container-storage-plug-in/volume-snapshot-and-restore.html

https://learn.microsoft.com/en-us/azure/backup/backup-azure-arm-restore-vms

https://docs.cloud.google.com/compute/docs/disks/restore-snapshot

We can't document every possible way of restoring a volume or an instance from a backup, so I wouldn't get more specific than this first sentence.

jason-lynch · 2026-02-17T17:22:51Z

docs/disaster-recovery/full-recovery.md

+```bash
+ARCH=$(uname -m)
+if [ "$ARCH" = "x86_64" ]; then ARCH="amd64"; elif [ "$ARCH" = "aarch64" ]; then ARCH="arm64"; fi
+curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl


This will look better split onto two lines:

Suggested change

curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl

curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz \

| tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl

jason-lynch · 2026-02-17T17:59:51Z

docs/disaster-recovery/full-recovery.md

+curl -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/<DEAD_HOST_2>?force=true"
+```
+
+!!! important "Remove Hosts in Order"


This is not important. The order should not matter.

jason-lynch · 2026-02-17T18:00:54Z

docs/disaster-recovery/full-recovery.md

+rm -f "${PGEDGE_DATA_DIR}/generated.config.json"
+```
+
+**For client-mode hosts:**


Healthy client-mode hosts do not need to be removed at all, so users should never be in this position.

jason-lynch · 2026-02-17T18:05:53Z

docs/disaster-recovery/full-recovery.md

+
+#### Step 3B.1: Create New Host
+
+Provision the replacement host. For Lima-based environments:


Please remove these Lima-specific instructions. This is something that's specific to our development environment, so these instructions are not meaningful to users,

jason-lynch · 2026-02-17T18:09:09Z

docs/disaster-recovery/full-recovery.md

+If Swarm no longer has the service definition:
+
+```bash
+docker stack deploy -c <path-to-stack-yaml> control-plane


This is the first time we mention the stack definition in this document. The stack definition should be listed as a pre-requisite so that users know to prepare it ahead of time.

Also - keep in mind that doing this deploy will recreate and scale up all services in the stack. I would just do this one time after I've gotten all of my hosts ready to deploy again.

jason-lynch · 2026-02-17T18:12:14Z

docs/disaster-recovery/partial-recovery.md

+
+---
+
+## Phase 1: Remove the Failed Host


Let's be consistent with the full recovery document and say that users should update the database to remove the lost host first, and then remove the host without the force parameter. As I mentioned in a comment on the other doc, I'm leaning towards removing the force parameter altogether from that remove host endpoint because it can lead to problems.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

docs/disaster-recovery/full-recovery.md (2)

272-277: Fenced code block at line 272 is missing a language specifier (MD040).

♻️ Proposed fix

-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 ...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 272 - 277, The fenced
code block starting with triple-backticks that contains the table header "ID    
HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS" is missing a language
specifier; update the opening fence to include a language (e.g., change ``` to
```text or ```console) so the block complies with MD040 and renders/linters
correctly, leaving the block contents unchanged.

77-80: Indented code blocks inside admonitions should be fenced (MD046).

markdownlint flags five locations (lines 79, 151, 184, 254, 551) where 4-space indented blocks are used inside !!! admonitions. These may not render as code blocks in all Markdown renderers; fenced blocks are the reliable and consistent approach.

Example fix for the pattern (apply similarly at all five locations):

♻️ Proposed fix pattern

 !!! important "Reset Cluster Membership for Multi-Node Clusters"
 
-    If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore` to reset cluster membership. Simply copying the etcd directory will not work.
+    If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore`
+    to reset cluster membership. Simply copying the etcd directory will not work.

For any nested code snippets inside the admonition, use a fenced block indented with 4 spaces:

 !!! note
 
-    If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead.
+    If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead.

Also applies to: 149-152, 182-185, 252-255, 549-552

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 77 - 80, Replace the
4-space indented code blocks inside the "!!! important" admonitions with fenced
code blocks so they render consistently; locate the admonition starting with
'!!! important "Reset Cluster Membership for Multi-Node Clusters"' and the other
admonitions at the flagged locations (around lines referenced in the review: 79,
151, 184, 254, 551) and change any nested indented blocks to triple-backtick
fenced blocks (```), preserving language hints if present and maintaining the
same indentation level for the admonition content.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 200-213: Phase 1B currently contradicts itself: "Option A"
instructs running etcdctl snapshot save against a live endpoint (the etcdctl
snapshot save command) even though Phase 1B assumes majority quorum loss and
etcd is inaccessible; either delete the entire "Option A: Create Snapshot from
Existing Data" subsection or rewrite it to clearly define a distinct
sub-scenario (e.g., "Partial Quorum: single surviving node with etcd running")
that is explicitly separate from the majority-quorum-lost path, and in that
rewrite explain preconditions, when to run the etcdctl snapshot save command,
and add a cross-reference to "Option B" (pre-existing snapshot) for the true
quorum-loss recovery path so readers are not misled.
- Line 51: The document contains literal draft comments using C-style "// To Do"
text (e.g. "// TO DO link the installation.md file section here", "// To Do A
snapshot file is optional...", "// To Do The title of this section, "Majority
Quorum Loss"...", "// To Do a snapshot is optional...", "// To Do use etcdutl
snapshot restore", "// To Do we will keep one section in final doc") which
render verbatim and trigger MD046; remove or resolve each by either deleting
them, converting them to proper HTML comments (<!-- TODO: ... -->) or replacing
them with real content or a Markdown TODO/task list or an actual link (e.g.
install section link) so the draft notes no longer appear in the published
output and the lint warning is eliminated.
- Around line 315-346: Phase 1C is incomplete: add concrete Step 1C.4 and 1C.5
and replace the Step 1C.6 TODO with the etcd ForceNewCluster restore steps, then
update the dangling cross-reference and summary table. Specifically, create Step
1C.4 "Recreate container registry on surviving host" that documents rebuilding
or restoring the registry (recreate registry container, load required
images/tags, push images back into registry) and reference the surviving
host/node names (e.g., host-2/n2); add Step 1C.5 "Prepare etcd snapshot and
prerequisites" describing how to fetch/transfer the snapshot, validate it, stop
etcd on recovery nodes and prepare data dirs; replace Step 1C.6 with a full
"ForceNewCluster etcdutl snapshot restore" sequence (use etcdutl snapshot
restore with --force-new-cluster and correct --name/--initial-cluster settings,
restore to expected data-dir and systemd restart steps) and ensure Step 1C.7
remains the verification curl. Finally, update the Common Issues cross-reference
that mentions "Phase 1C, Step 1C.4" and the Summary table range/actions to
include the rebuilt registry and ForceNewCluster restore so documentation is
consistent.
- Around line 367-369: Update the Step 2.1 curl example so the URL uses the same
port variable as the rest of the document: replace the hardcoded ":3000" in the
curl command URL (the line starting with curl -X POST
"http://${RECOVERY_HOST_IP}:3000/...) with the template variable ":${API_PORT}"
to match other API calls and maintain consistency.
- Around line 37-42: The Set Variables block is missing RECOVERY_HOST_ID and
API_PORT which are referenced throughout the guide; update the initial variables
list to include RECOVERY_HOST_ID (the identifier/name of the recovery host used
in commands and SSH/etcd membership steps) and API_PORT (the control-plane API
port used by the recovery host), and ensure the examples show sensible
placeholders like RECOVERY_HOST_ID="<recovery-host-id>" and API_PORT=<api-port>
so all subsequent snippets (references to RECOVERY_HOST_ID and API_PORT) work
without reader guesswork; verify mentions in functions/sections that reference
RECOVERY_HOST_ID and API_PORT still match the variable names exactly.
- Around line 130-133: Remove the trailing space after the backslash in the curl
line so the shell line continuation works (ensure the backslash is the last
character before the newline in the "curl -L
https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz
\ | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl"
snippet) or collapse into a single line; also update the hardcoded release tag
from v3.6.5 to v3.6.8 or replace it with a VERSION variable so the etcd download
URL and the etcd-v... names stay consistent and current, and verify the
subsequent sudo mv/ chmod lines still reference the same etcdutl filename.

---

Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 272-277: The fenced code block starting with triple-backticks that
contains the table header "ID                            HOSTNAME      STATUS   
AVAILABILITY   MANAGER STATUS" is missing a language specifier; update the
opening fence to include a language (e.g., change ``` to ```text or ```console)
so the block complies with MD040 and renders/linters correctly, leaving the
block contents unchanged.
- Around line 77-80: Replace the 4-space indented code blocks inside the "!!!
important" admonitions with fenced code blocks so they render consistently;
locate the admonition starting with '!!! important "Reset Cluster Membership for
Multi-Node Clusters"' and the other admonitions at the flagged locations (around
lines referenced in the review: 79, 151, 184, 254, 551) and change any nested
indented blocks to triple-backtick fenced blocks (```), preserving language
hints if present and maintaining the same indentation level for the admonition
content.

docs/disaster-recovery/full-recovery.md

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

docs/disaster-recovery/full-recovery.md (2)
330-332: Step 1C.7 lacks concrete commands for registry recreation — the hardest step in Phase 1C.

This is the only recovery path for operators who have lost both etcd and Swarm quorum. The step instructs them to "recreate the registry and build/push the Control Plane image" but provides no commands, and the ARCHIVE_VERSION variable defined in Set Variables (line 50) is clearly intended for exactly these commands. An operator in a production incident cannot act on prose alone.

Consider adding a code block with the docker registry/image commands (e.g., docker service create for the registry, docker pull/docker tag/docker push using ${ARCHIVE_VERSION}), or at minimum link to a dedicated registry setup reference.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 330 - 332, Step 1C.7 is
missing the concrete commands to recreate a container registry and push the
Control Plane image using the ARCHIVE_VERSION variable; update the prose to
include an explicit sequence showing how to create the registry service and then
pull/tag/push the Control Plane image using ${ARCHIVE_VERSION} (or, if you
prefer not to inline commands, add a clear link to a registry setup reference),
and ensure you reference the ARCHIVE_VERSION variable defined earlier so
operators can copy-paste the exact commands for "recreate registry" and
"build/push Control Plane image" before starting the Control Plane service in
Step 1C.7.
461-469: Step 3A.2 doesn't distinguish server-mode from client-mode hosts, unlike the partial-recovery guide.

partial-recovery.md Step 3A.2 has separate blocks for server-mode (clear etcd + certificates + generated.config.json) and client-mode (clear only generated.config.json). This guide applies the server-mode commands to all hosts unconditionally. While rm -rf on absent directories is harmless, aligning with the partial-recovery guide's pattern avoids operator confusion when working across both docs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 461 - 469, Step 3A.2
currently applies the server-mode cleanup (removing ${PGEDGE_DATA_DIR}/etcd,
/certificates, and generated.config.json) to all hosts; update this section to
mirror partial-recovery.md by distinguishing server-mode vs client-mode hosts:
for server-mode instruct removal of etcd, certificates, and
generated.config.json, and for client-mode instruct removal of only
generated.config.json (use the same labels "server-mode" and "client-mode" and
reference the PGEDGE_DATA_DIR variable and filenames to make the distinction
clear).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 48-51: The variable ARCHIVE_VERSION is declared but unused; either
remove the ARCHIVE_VERSION="<control-plane-version>" line from the env block or
add the registry-rebuild commands that consume it (e.g., the
registry-rebuild/registry restore commands referenced in Step 1C.7) so the guide
actually uses ${ARCHIVE_VERSION}; update the examples and any subsequent
commands to reference ARCHIVE_VERSION in place of a hard-coded version to make
the variable meaningful.
- Around line 256-261: The fenced code block showing the `docker node ls` output
is missing a language specifier which triggers MD040; update the triple-backtick
fence for that block in full-recovery.md to include a language tag such as text
(e.g., change ``` to ```text) so the block is explicitly marked and linting
passes; locate the block containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER
STATUS` table and add the language specifier to the opening fence.

In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 105-110: Add a language specifier to the fenced code block
containing the `docker node ls` example (the block showing ID, HOSTNAME, STATUS,
AVAILABILITY, MANAGER STATUS) to silence MD040 and improve rendering; update the
opening fence from ``` to ```text (or ```console) so the block becomes a labeled
code block.
- Line 403: The step label is incorrect: remove the "(force remove)" note from
Step 1.1 and add it to the Step 1.2 text so the host deletion action is
documented under Step 1.2 (leave Step 1.1 describing only database updates);
update the sentence for Step 1.2 to read something like "Step 1.2 — force remove
(host record deletion) and wait for the task to finish before rejoining" so the
force-remove action is clearly associated with Step 1.2.

---

Duplicate comments:
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 88-92: The doc uses the per-host task endpoint
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> to poll a removal task on a host that
was just deleted; since that host may become unavailable, update the example to
query a stable API host instead—either call the cluster-level endpoint
/v1/tasks/<TASK_ID> (preferred) or replace <LOST_HOST_ID> with <HEALTHY_HOST> so
the curl targets a healthy node (/v1/hosts/<HEALTHY_HOST>/tasks/<TASK_ID>), and
update the curl example accordingly.

---

Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 330-332: Step 1C.7 is missing the concrete commands to recreate a
container registry and push the Control Plane image using the ARCHIVE_VERSION
variable; update the prose to include an explicit sequence showing how to create
the registry service and then pull/tag/push the Control Plane image using
${ARCHIVE_VERSION} (or, if you prefer not to inline commands, add a clear link
to a registry setup reference), and ensure you reference the ARCHIVE_VERSION
variable defined earlier so operators can copy-paste the exact commands for
"recreate registry" and "build/push Control Plane image" before starting the
Control Plane service in Step 1C.7.
- Around line 461-469: Step 3A.2 currently applies the server-mode cleanup
(removing ${PGEDGE_DATA_DIR}/etcd, /certificates, and generated.config.json) to
all hosts; update this section to mirror partial-recovery.md by distinguishing
server-mode vs client-mode hosts: for server-mode instruct removal of etcd,
certificates, and generated.config.json, and for client-mode instruct removal of
only generated.config.json (use the same labels "server-mode" and "client-mode"
and reference the PGEDGE_DATA_DIR variable and filenames to make the distinction
clear).

coderabbitai · 2026-02-18T12:59:52Z

docs/disaster-recovery/full-recovery.md

+```bash
+RECOVERY_HOST_EXTERNAL_IP="<recovery-host-external-ip>"  # e.g., 192.168.105.4
+ARCHIVE_VERSION="<control-plane-version>"                  # e.g., 0.6.2
+```


⚠️ Potential issue | 🟡 Minor

ARCHIVE_VERSION is defined but never referenced in the document.

No code block anywhere in the guide uses ${ARCHIVE_VERSION}. Either add the concrete registry-rebuild commands that consume it (see also Step 1C.7 below), or remove it from the variable block to avoid confusing operators.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 48 - 51, The variable ARCHIVE_VERSION is declared but unused; either remove the ARCHIVE_VERSION="<control-plane-version>" line from the env block or add the registry-rebuild commands that consume it (e.g., the registry-rebuild/registry restore commands referenced in Step 1C.7) so the guide actually uses ${ARCHIVE_VERSION}; update the examples and any subsequent commands to reference ARCHIVE_VERSION in place of a hard-coded version to make the variable meaningful.

coderabbitai · 2026-02-18T12:59:52Z

docs/disaster-recovery/full-recovery.md

+```
+ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
+4aoqjp3q8jcny4kec5nadcn6x     lima-host-1   Down      Active         Unreachable
+959g9937i62judknmr40kcw9r *   lima-host-2   Ready     Active         Leader
+l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
+```


⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

Same MD040 issue as in partial-recovery.md — the docker node ls output block lacks a language tag.

✏️ Proposed fix

-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x lima-host-1 Down Active Unreachable 959g9937i62judknmr40kcw9r * lima-host-2 Ready Active Leader l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 256-256: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 256 - 261, The fenced code block showing the `docker node ls` output is missing a language specifier which triggers MD040; update the triple-backtick fence for that block in full-recovery.md to include a language tag such as text (e.g., change ``` to ```text) so the block is explicitly marked and linting passes; locate the block containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS` table and add the language specifier to the opening fence.

coderabbitai · 2026-02-18T12:59:52Z

docs/disaster-recovery/partial-recovery.md

+```
+ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
+4aoqjp3q8jcny4kec5nadcn6x *   lima-host-1   Ready     Active         Leader
+959g9937i62judknmr40kcw9r     lima-host-2   Ready     Active         Reachable
+l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
+```


⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

The docker node ls example output uses a plain fence, which triggers MD040. Use text or console to silence the warning and improve rendering.

✏️ Proposed fix

-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS

4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader

959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable

l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable

```

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/partial-recovery.md` around lines 105 - 110, Add a language specifier to the fenced code block containing the `docker node ls` example (the block showing ID, HOSTNAME, STATUS, AVAILABILITY, MANAGER STATUS) to silence MD040 and improve rendering; update the opening fence from ``` to ```text (or ```console) so the block becomes a labeled code block.

docs/disaster-recovery/partial-recovery.md

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

docs/disaster-recovery/full-recovery.md (1)
706-714: Unquoted $PGEDGE_DATA_DIR in rm commands — inconsistent quoting.

The Common Issues rm snippet uses unquoted variables. Every other rm/mv block in the guide (Steps 1A.3, 1C.5, 3A.2) quotes "${PGEDGE_DATA_DIR}/...". An unquoted path breaks if PGEDGE_DATA_DIR contains spaces.
✏️ Proposed fix
-rm -rf ${PGEDGE_DATA_DIR}/etcd
-rm -rf ${PGEDGE_DATA_DIR}/certificates
-rm -f ${PGEDGE_DATA_DIR}/generated.config.json
+rm -rf "${PGEDGE_DATA_DIR}/etcd"
+rm -rf "${PGEDGE_DATA_DIR}/certificates"
+rm -f "${PGEDGE_DATA_DIR}/generated.config.json"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 706 - 714, The three rm
lines use an unquoted environment variable and will break for paths with spaces;
update the commands that remove etcd, certificates, and generated.config.json to
use the PGEDGE_DATA_DIR variable wrapped in double quotes (i.e., quote the
"${PGEDGE_DATA_DIR}/...") so the paths are treated as single arguments and match
the quoting style used in other steps.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 219-226: The per-host task monitor URL shown after the DELETE call
(DELETE /v1/hosts/<LOST_HOST_ID>) may 404 because the host resource was removed;
update the docs to use the global task endpoint instead (e.g.,
/v1/tasks/<TASK_ID>) for monitoring or explicitly note that
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> must remain accessible post-deletion;
adjust the example and the explanatory text around the DELETE
/v1/hosts/<LOST_HOST_ID> and the monitoring line so readers use
/v1/tasks/<TASK_ID> or understand the host-scoped URL behavior.
- Around line 178-183: The phrase "(e.g. after Path 1A/1B)" is undefined in this
document; replace that parenthetical in the Control Plane start instructions
with a contextual description or a cross-link — for example change the text to
"(e.g. after completing the Swarm and etcd restore steps above)" or add an
explicit link to full-recovery.md — so readers of disaster-recovery.md have a
meaningful anchor instead of "Path 1A/1B".

In `@docs/disaster-recovery/full-recovery.md`:
- Around line 370-398: Update the two example JSON database specs in
full-recovery.md to include the missing "port": 5432 field so they match other
guides: add "port": 5432 inside the top-level "spec" object in the Step 2.1
(remove_host) example (the curl example that posts an updated-database-spec for
storefront) and in the Phase 4 (add nodes) example's "spec" object; keep all
other fields unchanged.

---

Duplicate comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 48-51: ARCHIVE_VERSION is declared but never used in the guide;
either remove the ARCHIVE_VERSION variable or update Phase 1C (specifically Step
1C.7) to use it by adding the registry-rebuild / image-tagging commands that
consume ${ARCHIVE_VERSION} (i.e., where you rebuild or retag control-plane
images to the archived version). Locate the variable block and the Phase 1C Step
1C.7 text, then either delete the ARCHIVE_VERSION line or insert the appropriate
commands that reference ${ARCHIVE_VERSION} so the variable is actually consumed.
- Around line 256-261: The fenced code block showing the output of `docker node
ls` is missing a language specifier which triggers MD040; update the fence that
contains the lines starting with "ID                            HOSTNAME     
STATUS..." so it becomes a triple-backtick fence with a language specifier
(e.g., use "text") instead of a plain fence to suppress the lint warning and
ensure proper rendering.

In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 88-92: The monitor URL references a per-host endpoint that may no
longer resolve after force-removing the host; change the example curl to use a
cluster-level task endpoint (`/v1/tasks/<TASK_ID>`) or add a note that
`/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>` remains accessible
post-deletion—update the example command and text to use `curl
http://<HEALTHY_HOST>:3000/v1/tasks/<TASK_ID>` or explicitly state the
host-scoped endpoint behavior so operators can reliably check task status.
- Around line 105-110: The fenced code block showing the output of `docker node
ls` is missing a language specifier which triggers MD040; update the opening
fence for that block (the triple backticks before the lines starting "ID 
HOSTNAME  STATUS...") to include a language tag such as text (i.e., change ```
to ```text) so the block is explicitly marked and renders correctly.

---

Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 706-714: The three rm lines use an unquoted environment variable
and will break for paths with spaces; update the commands that remove etcd,
certificates, and generated.config.json to use the PGEDGE_DATA_DIR variable
wrapped in double quotes (i.e., quote the "${PGEDGE_DATA_DIR}/...") so the paths
are treated as single arguments and match the quoting style used in other steps.

coderabbitai · 2026-02-19T13:46:16Z

docs/disaster-recovery/disaster-recovery.md

+4. **Start the Control Plane** and verify:
+
+   - If Control Plane is already deployed as Swarm services (e.g. after Path 1A/1B):  
+     `docker service scale control-plane_${RECOVERY_HOST_ID}=1`
+   - If you completed [Restoring Docker Swarm](#restoring-docker-swarm) and deploy via stack:  
+     `docker stack deploy -c <path-to-stack-yaml> control-plane` (do not set `PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER`).


⚠️ Potential issue | 🟡 Minor

"Path 1A/1B" labels are undefined in this document.

Line 180 reads: (e.g. after Path 1A/1B). Those path labels are internal to full-recovery.md and don't appear anywhere in this guide. Readers working from this document only will have no anchor for "1A/1B". Replace with a description meaningful in this document's context (e.g., "after the Swarm and etcd restore steps above") or cross-link to full-recovery.md.

✏️ Proposed fix

- - If Control Plane is already deployed as Swarm services (e.g. after Path 1A/1B): + - If Control Plane is already deployed as Swarm services (i.e. Swarm is functional and services were only scaled down):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

4. **Start the Control Plane** and verify:

- If Control Plane is already deployed as Swarm services (e.g. after Path 1A/1B):

`docker service scale control-plane_${RECOVERY_HOST_ID}=1`

- If you completed [Restoring Docker Swarm](#restoring-docker-swarm) and deploy via stack:

`docker stack deploy -c <path-to-stack-yaml> control-plane` (do not set `PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER`).

4. **Start the Control Plane** and verify:

- If Control Plane is already deployed as Swarm services (i.e. Swarm is functional and services were only scaled down):

`docker service scale control-plane_${RECOVERY_HOST_ID}=1`

- If you completed [Restoring Docker Swarm](`#restoring-docker-swarm`) and deploy via stack:

`docker stack deploy -c <path-to-stack-yaml> control-plane` (do not set `PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER`).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/disaster-recovery.md` around lines 178 - 183, The phrase "(e.g. after Path 1A/1B)" is undefined in this document; replace that parenthetical in the Control Plane start instructions with a contextual description or a cross-link — for example change the text to "(e.g. after completing the Swarm and etcd restore steps above)" or add an explicit link to full-recovery.md — so readers of disaster-recovery.md have a meaningful anchor instead of "Path 1A/1B".

thanks for the suggestion.Will be fixed in new commit.

@moizpgedge, sounds good! Looking forward to seeing the update in the next commit. 👍

coderabbitai · 2026-02-19T13:46:17Z

docs/disaster-recovery/disaster-recovery.md

+After all affected databases have been updated, remove each lost host from the Control Plane (one at a time; wait for each removal task to complete):
+
+```sh
+curl -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/<LOST_HOST_ID>"
+```
+
+Monitor: `curl http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>`
+


⚠️ Potential issue | 🟡 Minor

Task monitor URL scoped to a deleted host may not resolve.

After the DELETE /v1/hosts/<LOST_HOST_ID> call, the monitoring line uses:

/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>

If the API returns 404 for task endpoints on removed hosts, this instruction will silently fail during recovery. This is the same pattern flagged (unresolved) in partial-recovery.md. Consider using a generic task endpoint or adding a note that the per-host task URL remains accessible post-deletion.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/disaster-recovery.md` around lines 219 - 226, The per-host task monitor URL shown after the DELETE call (DELETE /v1/hosts/<LOST_HOST_ID>) may 404 because the host resource was removed; update the docs to use the global task endpoint instead (e.g., /v1/tasks/<TASK_ID>) for monitoring or explicitly note that /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> must remain accessible post-deletion; adjust the example and the explanatory text around the DELETE /v1/hosts/<LOST_HOST_ID> and the monitoring line so readers use /v1/tasks/<TASK_ID> or understand the host-scoped URL behavior.

coderabbitai · 2026-02-19T13:46:17Z

docs/disaster-recovery/full-recovery.md

+Use the `remove_host` query parameter to remove instances from destroyed hosts:
+
+```sh
+curl -X POST "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<DEAD_HOST_1>&remove_host=<DEAD_HOST_2>" \
+    -H "Content-Type: application/json" \
+    -d '<updated-database-spec>'
+```
+
+Example:
+
+```sh
+curl -X POST "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/storefront?remove_host=host-1&remove_host=host-3" \
+    -H "Content-Type: application/json" \
+    -d '{
+        "spec": {
+            "database_name": "storefront",
+            "database_users": [
+                {
+                    "username": "admin",
+                    "db_owner": true,
+                    "attributes": ["SUPERUSER", "LOGIN"]
+                }
+            ],
+            "nodes": [
+                { "name": "n2", "host_ids": ["host-2"] }
+            ]
+        }
+    }'
+```


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find all instances of "port" in disaster-recovery guides rg -n '"port"' docs/disaster-recovery/

Repository: pgEdge/control-plane

Length of output: 356

🏁 Script executed:

#!/bin/bash # Let's examine the specific files mentioned - look at full-recovery.md around both line ranges echo "=== Lines 370-398 in full-recovery.md ===" sed -n '370,398p' docs/disaster-recovery/full-recovery.md echo -e "\n=== Lines 584-604 in full-recovery.md ===" sed -n '584,604p' docs/disaster-recovery/full-recovery.md

Repository: pgEdge/control-plane

Length of output: 1673

🏁 Script executed:

#!/bin/bash # Check partial-recovery.md for port field in Steps 1.1 and 5.1 examples echo "=== Searching for Step 1.1 and Step 5.1 in partial-recovery.md ===" rg -n "Step 1.1|Step 5.1" docs/disaster-recovery/partial-recovery.md -A 30 | head -100

Repository: pgEdge/control-plane

Length of output: 3965

🏁 Script executed:

#!/bin/bash # Check disaster-recovery.md around port field instances echo "=== Context around line 206 in disaster-recovery.md ===" sed -n '190,220p' docs/disaster-recovery/disaster-recovery.md echo -e "\n=== Context around line 285 in disaster-recovery.md ===" sed -n '270,300p' docs/disaster-recovery/disaster-recovery.md

Repository: pgEdge/control-plane

Length of output: 2826

🏁 Script executed:

#!/bin/bash # Check if there's any documentation about port being required/optional rg -n "port.*required|required.*port" docs/disaster-recovery/ -i

Repository: pgEdge/control-plane

Length of output: 46

Add "port": 5432 to database spec examples in Step 2.1 (remove_host) and Phase 4 (add nodes) to match other guides.

Both examples in full-recovery.md omit the "port" field from the spec body, while equivalent examples in partial-recovery.md (Steps 1.1 and 5.1) and disaster-recovery.md consistently include "port": 5432. This creates an inconsistency that could cause API errors during recovery operations.

✏️ Proposed fix (Step 2.1 example and Phase 4 example)

"database_users": [ { "username": "admin", "db_owner": true, "attributes": ["SUPERUSER", "LOGIN"] } ], + "port": 5432, "nodes": [

Also applies to: 584-604

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 370 - 398, Update the two example JSON database specs in full-recovery.md to include the missing "port": 5432 field so they match other guides: add "port": 5432 inside the top-level "spec" object in the Step 2.1 (remove_host) example (the curl example that posts an updated-database-spec for storefront) and in the Phase 4 (add nodes) example's "spec" object; keep all other fields unchanged.

jason-lynch

Nice! This is looking a lot closer. I've left a few suggestions. Could you also please remove the other documents from this PR so that we don't forget?

jason-lynch · 2026-02-20T16:44:41Z

docs/disaster-recovery/disaster-recovery.md

+- Host ID(s) of the lost host(s)
+- SSH access to remaining cluster hosts (for Docker Swarm and host operations)
+- The Control Plane stack definition file (YAML) from your initial deployment
+- If etcd quorum was lost: a backup of the Control Plane data volume and (optionally) an etcd snapshot file


We need to be clearer here: you only need a backup if you've lost every host that was serving Etcd. Losing quorum just means you've lost more than 50% of the Control Plane servers that were serving Etcd. You can recover without a backup if you still have at least one server-mode host remaining.

jason-lynch · 2026-02-20T17:13:23Z

docs/disaster-recovery/disaster-recovery.md

+Restoring the Control Plane data volume from your backup is environment-specific; we cannot document every possible procedure. For examples, see your provider's documentation:
+
+- [AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html)
+- [VMware vSphere](https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/container-storage-plugin/3-0/getting-started-with-vmware-vsphere-container-storage-plug-in-3-0/using-vsphere-container-storage-plug-in/volume-snapshot-and-restore.html)
+- [Azure](https://learn.microsoft.com/en-us/azure/backup/backup-azure-arm-restore-vms)
+- [Google Cloud](https://docs.cloud.google.com/compute/docs/disks/restore-snapshot)


I sent this list to give you an idea of what this process would look like in a production environment. I don't think it's very helpful in our documentation. Here's how I would phrase this section:

Suggested change

Restoring the Control Plane data volume from your backup is environment-specific; we cannot document every possible procedure. For examples, see your provider's documentation:

- [AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html)

- [VMware vSphere](https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/container-storage-plugin/3-0/getting-started-with-vmware-vsphere-container-storage-plug-in-3-0/using-vsphere-container-storage-plug-in/volume-snapshot-and-restore.html)

- [Azure](https://learn.microsoft.com/en-us/azure/backup/backup-azure-arm-restore-vms)

- [Google Cloud](https://docs.cloud.google.com/compute/docs/disks/restore-snapshot)

You will need to restore from a previously created backup if you've lost 100% of your Control Plane servers configured to serve Etcd. This could be a snapshot of the data volume or any other type of backup that includes the Control Plane data directory for one of the lost servers. Only one Control Plane server backup is needed to restore the cluster.

If you've lost 100% of your database instances, you'll need the data directory from at least one instance from each database. Your data volume backup will also include this data if you're restoring a host that was running an instance of each database. If not, you'll also need to restore at least one more host that does have this instance data.

If you do not have any data volume backups that include your database instances, we recommend creating a new Control Plane cluster and restoring your databases from pgBackRest backups instead. See [Creating a New Database from a Backup](../using/backup-restore/#creating-a-new-database-from-a-backup) for more information.

jason-lynch · 2026-02-20T17:59:02Z

docs/disaster-recovery/disaster-recovery.md

+On a surviving manager:
+
+```bash
+sudo docker swarm init --force-new-cluster --advertise-addr ${RECOVERY_HOST_IP}


This document alternates between using and not using sudo for docker commands. We should be consistent, and I prefer showing it without, since it aligns with Docker's own documentation, and it's typical to configure a non-root user with Docker access.

Suggested change

sudo docker swarm init --force-new-cluster --advertise-addr ${RECOVERY_HOST_IP}

docker swarm init --force-new-cluster --advertise-addr ${RECOVERY_HOST_IP}

jason-lynch · 2026-02-20T17:59:13Z

docs/disaster-recovery/disaster-recovery.md

+Verify:
+
+```bash
+sudo docker node ls


Suggested change

sudo docker node ls

docker node ls

jason-lynch · 2026-02-20T17:59:56Z

docs/disaster-recovery/disaster-recovery.md

+sudo docker service rm control-plane_<LOST_HOST_ID_1> control-plane_<LOST_HOST_ID_2>
+sudo docker service ls
+sudo docker service rm <orphaned-postgres-service-1> <orphaned-postgres-service-2>


Suggested change

sudo docker service rm control-plane_<LOST_HOST_ID_1> control-plane_<LOST_HOST_ID_2>

sudo docker service ls

sudo docker service rm <orphaned-postgres-service-1> <orphaned-postgres-service-2>

docker service rm control-plane_<LOST_HOST_ID_1> control-plane_<LOST_HOST_ID_2>

docker service ls

docker service rm <orphaned-postgres-service-1> <orphaned-postgres-service-2>

jason-lynch · 2026-02-20T18:10:21Z

docs/disaster-recovery/disaster-recovery.md

+   sudo mv /tmp/etcdutl /usr/local/bin/ && sudo chmod +x /usr/local/bin/etcdutl
+   ```
+
+3. **Restore etcd** from the backup directory (step 1 sets `ETCD_BACKUP_DIR`). If you have no existing etcd directory and are using a snapshot file instead, use that file path in place of `"${ETCD_BACKUP_DIR}/member/snap/db"`:


I feel like we need to make it clear why this step exists so that people don't feel tempted to skip it since they already have a data directory.

Suggested change

3. **Restore etcd** from the backup directory (step 1 sets `ETCD_BACKUP_DIR`). If you have no existing etcd directory and are using a snapshot file instead, use that file path in place of `"${ETCD_BACKUP_DIR}/member/snap/db"`:

3. **Restore etcd** from the backup directory (step 1 sets `ETCD_BACKUP_DIR`). If you have no existing etcd directory and are using a snapshot file instead, use that file path in place of `"${ETCD_BACKUP_DIR}/member/snap/db"`. This will restore quorum by reinitializing Etcd with a single cluster member:

jason-lynch · 2026-02-20T18:16:23Z

docs/disaster-recovery/disaster-recovery.md

+When both etcd and Docker Swarm quorum were lost (you will complete [Restoring Docker Swarm](#restoring-docker-swarm) first):
+
+```bash
+RECOVERY_HOST_EXTERNAL_IP="<recovery-host-external-ip>"
+ARCHIVE_VERSION="<control-plane-version>"
+```


I worry that the internal/external IP stuff is confusing since it's environment-specific. We mention SSH a few times, so let's assume the user is SSH'ing in and working from one of the machines.

It also looks like ARCHIVE_VERSION should have been removed before.

Suggested change

When both etcd and Docker Swarm quorum were lost (you will complete [Restoring Docker Swarm](#restoring-docker-swarm) first):

```bash

RECOVERY_HOST_EXTERNAL_IP="<recovery-host-external-ip>"

ARCHIVE_VERSION="<control-plane-version>"

```

jason-lynch · 2026-02-20T18:16:57Z

docs/disaster-recovery/disaster-recovery.md

+   ```bash
+   docker service ps control-plane_${RECOVERY_HOST_ID} --no-trunc
+   curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts"
+   # or, if using RECOVERY_HOST_EXTERNAL_IP: curl "http://${RECOVERY_HOST_EXTERNAL_IP}:${API_PORT}/v1/databases"


To match the suggestion above about removing the "external IP" concept.

Suggested change

# or, if using RECOVERY_HOST_EXTERNAL_IP: curl "http://${RECOVERY_HOST_EXTERNAL_IP}:${API_PORT}/v1/databases"

jason-lynch · 2026-02-20T18:19:19Z

docs/disaster-recovery/disaster-recovery.md

+| Docker Swarm commands hang | Swarm has lost quorum | Run [Reinitializing the Swarm](#reinitializing-the-swarm) on a surviving manager. |
+| "service already exists" when deploying stack | Manually created service conflicts with stack | Run `docker service rm <service-name>`, then redeploy the stack. |
+| Control Plane API hangs after etcd restore | etcd auth not fully re-enabled after restore | Check `docker service logs control-plane_<HOST_ID>`. Restart the service if necessary. |
+| Image pull fails on new hosts | Registry was on a lost host | Recreate the registry on a surviving host and ensure new hosts can reach it. |


Looks like this should have been removed before:

Suggested change

| Image pull fails on new hosts | Registry was on a lost host | Recreate the registry on a surviving host and ensure new hosts can reach it. |

jason-lynch · 2026-02-20T18:22:15Z

docs/disaster-recovery/disaster-recovery.md

+| Quorum not restored | Too few server-mode hosts rejoined | Rejoin enough server-mode hosts to reach quorum (e.g. 2 of 3 for a 3-node cluster). |
+| Docker Swarm commands hang | Swarm has lost quorum | Run [Reinitializing the Swarm](#reinitializing-the-swarm) on a surviving manager. |
+| "service already exists" when deploying stack | Manually created service conflicts with stack | Run `docker service rm <service-name>`, then redeploy the stack. |
+| Control Plane API hangs after etcd restore | etcd auth not fully re-enabled after restore | Check `docker service logs control-plane_<HOST_ID>`. Restart the service if necessary. |


This specific issue shouldn't happen now that you've removed the advice to run etcd by itself:

Suggested change

| Control Plane API hangs after etcd restore | etcd auth not fully re-enabled after restore | Check `docker service logs control-plane_<HOST_ID>`. Restart the service if necessary. |

…ull-recovery, index)

jason-lynch

Awesome! Thank you for doing this.

tsivaprasad and others added 7 commits February 13, 2026 19:16

docs: etcd and swarm failure scenarios and recovery procedures

1a01b65

Partial and full recovery etcd/swarm docs

7813e44

Partial and full recovery etcd/swarm docs

cdd02f3

full recovery etcd/swarm docs duplication removed

b5bb1ff

Add partial recovery documentation for etcd and swarm (quorum intact)

a77f23c

Add full recovery documentation for etcd and swarm (no quorum)

62ba9b7

Minor issue resolved

92c3c1a

moizpgedge requested a review from jason-lynch February 17, 2026 15:46

moizpgedge requested a review from tsivaprasad February 17, 2026 15:46

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

changelog updated

ec2be84

jason-lynch reviewed Feb 17, 2026

View reviewed changes

partially addressed review comments and there are To Do's to finish

b8fcb8a

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

Suggested comments addressed

3f6b4ca

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

moizpgedge and others added 3 commits February 18, 2026 18:08

Suggested comments addressed

45bc71a

combined (partial-full) recovery procedure doc

6b246f0

Disaster recovery docs updated

95d9711

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

Disaster recovery docs minor issue solved & updated

8462b60

jason-lynch reviewed Feb 20, 2026

View reviewed changes

moizpgedge and others added 3 commits February 21, 2026 00:56

feat: add consolidated disaster recovery guide for host and quorum loss

5923574

docs: remove superseded disaster recovery guides (partial-recovery, f…

83f80c5

…ull-recovery, index)

Doc:-Deleted changelog entry

656fdad

jason-lynch approved these changes Feb 20, 2026

View reviewed changes

moizpgedge merged commit 443a863 into main Feb 20, 2026
2 of 3 checks passed


		#### Step 1A.2: Restore Data Volume

		Restore the Control Plane data volume from your backup.

	curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz \| tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl
	curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz \
	\| tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl


		#### Step 3B.1: Create New Host

		Provision the replacement host. For Lima-based environments:

	sudo docker swarm init --force-new-cluster --advertise-addr ${RECOVERY_HOST_IP}
	docker swarm init --force-new-cluster --advertise-addr ${RECOVERY_HOST_IP}

	3. Restore etcd from the backup directory (step 1 sets `ETCD_BACKUP_DIR`). If you have no existing etcd directory and are using a snapshot file instead, use that file path in place of `"${ETCD_BACKUP_DIR}/member/snap/db"`:
	3. Restore etcd from the backup directory (step 1 sets `ETCD_BACKUP_DIR`). If you have no existing etcd directory and are using a snapshot file instead, use that file path in place of `"${ETCD_BACKUP_DIR}/member/snap/db"`. This will restore quorum by reinitializing Etcd with a single cluster member:

Comments

Conversation

moizpgedge commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Checklist

Notes for Reviewers

ChangeLog

Uh oh!

coderabbitai bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

jason-lynch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

moizpgedge commented Feb 17, 2026 •

edited

Loading

coderabbitai bot commented Feb 17, 2026 •

edited

Loading

coderabbitai bot Feb 19, 2026 •

edited

Loading