Docs: add disaster recovery guides for partial and complete failure#268
Docs: add disaster recovery guides for partial and complete failure#268moizpgedge merged 17 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds comprehensive disaster recovery documentation for the Control Plane: an index plus three guides (partial recovery with quorum intact, complete/no-quorum recovery, and a combined disaster-recovery overview). Documents phased workflows, procedures for etcd and Docker Swarm recovery, prerequisites, verification, and troubleshooting. No code or public API changes. Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
docs/disaster-recovery/full-recovery.md (2)
288-295: Add a language specifier to the fenced code block.Same as in partial-recovery.md — the
docker node lsexample output block should use```textor```console.Proposed fix
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 288 - 295, Update the fenced code block showing the docker node ls example output to include a language specifier (e.g., use ```text or ```console instead of plain ```) so syntax-aware renderers treat it as plain text; locate the fenced block containing the sample output for "docker node ls" (the block starting with the ID/HOSTNAME/STATUS header) and replace the opening fence with ```text (or ```console) accordingly.
204-210: Phase 1B backup step only preserves etcd directory, unlike Phase 1A.Phase 1A Step 1A.3 backs up etcd data, certificates, and
generated.config.json. Phase 1B Step 1B.1 only backs up the etcd directory. Since the surviving host in 1B still has valid certificates and config, this is likely intentional — but a brief note explaining the difference would help operators avoid second-guessing during a high-stress recovery.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 204 - 210, Phase 1B Step 1B.1 only backs up the etcd directory whereas Phase 1A Step 1A.3 also backs up certificates and generated.config.json; add a brief explanatory note under "Step 1B.1: Backup Existing etcd Data" clarifying that in Phase 1B the surviving host is expected to already have valid certificates and configuration so only the etcd data is preserved, and reference Step 1A.3 for the full backup behavior to avoid operator confusion.docs/disaster-recovery/partial-recovery.md (1)
134-140: Add a language specifier to the fenced code block.The example
docker node lsoutput block on Line 135 lacks a language identifier. Use```textor```consolefor plain output blocks to satisfy the MD040 lint rule and improve rendering consistency.Proposed fix
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/disaster-recovery/partial-recovery.mdaround lines 134 - 140, The
fenced code block that shows the docker node ls output (the block starting with
"ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS" in the partial-recovery
example) is missing a language specifier and triggers MD040; update that block
by adding a plain-output language tag such astext orconsole immediately
after the opening backticks so the snippet becomes a labelled code fence (e.g.,
changetotext) to satisfy linting and rendering.</details> </blockquote></details> <details> <summary>mkdocs.yml (1)</summary><blockquote> `67-69`: **Consider adding `disaster-recovery/index.md` as the section index page.** Other nav sections (e.g., "Prerequisites", "Installing Control Plane") include their `index.md` as the first entry, making the section header clickable. Currently, "Disaster Recovery" is just a label and `index.md` isn't reachable from the nav, even though it exists and contains useful context (the backup warning). <details> <summary>Proposed fix</summary> ```diff - Disaster Recovery: + - disaster-recovery/index.md - Partial Failure Recovery (Quorum Intact): disaster-recovery/partial-recovery.md - Complete Failure Recovery (No Quorum): disaster-recovery/full-recovery.md ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@mkdocs.yml` around lines 67 - 69, The "Disaster Recovery" nav entry should include its section index so the header becomes clickable; update the mkdocs.yml nav under the "Disaster Recovery" section to insert "disaster-recovery/index.md" as the first entry before "disaster-recovery/partial-recovery.md" and "disaster-recovery/full-recovery.md" (so the section reads index.md, partial-recovery.md, full-recovery.md). ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@docs/disaster-recovery/full-recovery.md:
- Around line 385-397: The docs show a port mismatch: the registry is published
on port 5000 but the control plane service sets
PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST to 127.0.0.1:5001/pgedge; update the
environment variable in the control-plane service creation
(PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST in the
control-plane_${RECOVERY_HOST_ID} service) to 127.0.0.1:5000/pgedge to match the
registry, or explicitly document why 5001 is used (e.g., mirror/proxy) if that
is intentional.In
@docs/disaster-recovery/partial-recovery.md:
- Around line 76-82: The doc currently shows monitoring the host-removal task
via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was
force-removed this endpoint may be unavailable; update the instructions to use a
cluster-level or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID>
or /v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after
deletion—replace the existing reference to
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative
endpoints.
Nitpick comments:
In@docs/disaster-recovery/full-recovery.md:
- Around line 288-295: Update the fenced code block showing the docker node ls
example output to include a language specifier (e.g., usetext orconsole
instead of plain) so syntax-aware renderers treat it as plain text; locate the fenced block containing the sample output for "docker node ls" (the block starting with the ID/HOSTNAME/STATUS header) and replace the opening fence withtext (or ```console) accordingly.- Around line 204-210: Phase 1B Step 1B.1 only backs up the etcd directory
whereas Phase 1A Step 1A.3 also backs up certificates and generated.config.json;
add a brief explanatory note under "Step 1B.1: Backup Existing etcd Data"
clarifying that in Phase 1B the surviving host is expected to already have valid
certificates and configuration so only the etcd data is preserved, and reference
Step 1A.3 for the full backup behavior to avoid operator confusion.In
@docs/disaster-recovery/partial-recovery.md:
- Around line 134-140: The fenced code block that shows the docker node ls
output (the block starting with "ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS"
in the partial-recovery example) is missing a language specifier and triggers
MD040; update that block by adding a plain-output language tag such astext orconsole immediately after the opening backticks so the snippet becomes a
labelled code fence (e.g., changetotext) to satisfy linting and
rendering.In
@mkdocs.yml:
- Around line 67-69: The "Disaster Recovery" nav entry should include its
section index so the header becomes clickable; update the mkdocs.yml nav under
the "Disaster Recovery" section to insert "disaster-recovery/index.md" as the
first entry before "disaster-recovery/partial-recovery.md" and
"disaster-recovery/full-recovery.md" (so the section reads index.md,
partial-recovery.md, full-recovery.md).</details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| ```sh | ||
| # Monitor host removal task | ||
| curl http://<HEALTHY_HOST>:3000/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> | ||
|
|
||
| # Monitor database update task logs | ||
| curl http://<HEALTHY_HOST>:3000/v1/databases/<DB>/tasks/<TASK_ID>/log | ||
| ``` |
There was a problem hiding this comment.
Task monitoring endpoint references the removed host.
Line 78 monitors the host removal task via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but the host was just force-removed in Step 1.1. This endpoint may not be available after deletion. Consider using a cluster-level or recovery-host-scoped task endpoint instead, or clarify the expected behavior (e.g., whether the task endpoint remains accessible for a removed host).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/partial-recovery.md` around lines 76 - 82, The doc
currently shows monitoring the host-removal task via
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was force-removed
this endpoint may be unavailable; update the instructions to use a cluster-level
or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID> or
/v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after
deletion—replace the existing reference to
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative
endpoints.
jason-lynch
left a comment
There was a problem hiding this comment.
This is getting closer! There are still a lot of repeated steps, and a few places where we're doing the same things in different ways. I think we can do more to deduplicate and simplify these documents. Could this just be one document?:
- Disaster recovery
- Restoring Docker Swarm
- Reinitializing the Swarm
- Joining hosts to the new Swarm
- Removing old Swarm nodes
- Restoring the Control Plane
- Reinitializing the Control Plane Etcd cluster
- Updating databases to remove old hosts
- Removing old hosts
- Re-adding hosts
- Updating databases to re-add hosts
- Restoring Docker Swarm
mkdocs.yml
Outdated
| - Disaster Recovery: | ||
| - Partial Failure Recovery (Quorum Intact): disaster-recovery/partial-recovery.md | ||
| - Complete Failure Recovery (No Quorum): disaster-recovery/full-recovery.md |
There was a problem hiding this comment.
Could you please remove this for now? I think @mmols had suggested that we go through QA and documentation review before we publish this to the site.
|
|
||
| 1. **[Total Quorum Loss](#phase-1a-total-quorum-loss)** — All server-mode hosts are offline (100% loss). Docker Swarm is still functional. | ||
| 2. **[Majority Quorum Loss](#phase-1b-majority-quorum-loss)** — More than 50% of server-mode hosts are offline, but at least one remains online. Docker Swarm is still functional. | ||
| 3. **[etcd and Docker Swarm Quorum Loss](#phase-1c-etcd-and-docker-swarm-quorum-loss)** — Both etcd and Docker Swarm have lost quorum (majority of hosts destroyed). Requires Swarm re-initialization, registry recreation, and image rebuild before etcd recovery. |
There was a problem hiding this comment.
"Registry recreation" and "image rebuild" are exclusive to our development environment, so we should not document them in our user-facing documentation.
| You may see "cannot elect leader" errors when stopping services. These are expected. If Docker Swarm commands fail, stop containers directly: | ||
|
|
||
| ```bash | ||
| docker ps --filter label=com.docker.swarm.service.name=control-plane_<host-id> --format "{{.ID}}" | xargs docker stop |
There was a problem hiding this comment.
What are the conditions where it becomes necessary to do this?
|
|
||
| #### Step 1A.2: Restore Data Volume | ||
|
|
||
| Restore the Control Plane data volume from your backup. |
There was a problem hiding this comment.
We can't write specific instructions for this step because it's very dependent on the user's environment. You can simulate it locally using cp or Lima's snapshot tool, but in a real environment it will look quite different. Here are some examples of different procedures that users might follow:
- https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html
- https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/container-storage-plugin/3-0/getting-started-with-vmware-vsphere-container-storage-plug-in-3-0/using-vsphere-container-storage-plug-in/volume-snapshot-and-restore.html
- https://learn.microsoft.com/en-us/azure/backup/backup-azure-arm-restore-vms
- https://docs.cloud.google.com/compute/docs/disks/restore-snapshot
We can't document every possible way of restoring a volume or an instance from a backup, so I wouldn't get more specific than this first sentence.
| ```bash | ||
| ARCH=$(uname -m) | ||
| if [ "$ARCH" = "x86_64" ]; then ARCH="amd64"; elif [ "$ARCH" = "aarch64" ]; then ARCH="arm64"; fi | ||
| curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl |
There was a problem hiding this comment.
This will look better split onto two lines:
| curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl | |
| curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz \ | |
| | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl |
| curl -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/<DEAD_HOST_2>?force=true" | ||
| ``` | ||
|
|
||
| !!! important "Remove Hosts in Order" |
There was a problem hiding this comment.
This is not important. The order should not matter.
| rm -f "${PGEDGE_DATA_DIR}/generated.config.json" | ||
| ``` | ||
|
|
||
| **For client-mode hosts:** |
There was a problem hiding this comment.
Healthy client-mode hosts do not need to be removed at all, so users should never be in this position.
|
|
||
| #### Step 3B.1: Create New Host | ||
|
|
||
| Provision the replacement host. For Lima-based environments: |
There was a problem hiding this comment.
Please remove these Lima-specific instructions. This is something that's specific to our development environment, so these instructions are not meaningful to users,
| If Swarm no longer has the service definition: | ||
|
|
||
| ```bash | ||
| docker stack deploy -c <path-to-stack-yaml> control-plane |
There was a problem hiding this comment.
This is the first time we mention the stack definition in this document. The stack definition should be listed as a pre-requisite so that users know to prepare it ahead of time.
Also - keep in mind that doing this deploy will recreate and scale up all services in the stack. I would just do this one time after I've gotten all of my hosts ready to deploy again.
|
|
||
| --- | ||
|
|
||
| ## Phase 1: Remove the Failed Host |
There was a problem hiding this comment.
Let's be consistent with the full recovery document and say that users should update the database to remove the lost host first, and then remove the host without the force parameter. As I mentioned in a comment on the other doc, I'm leaning towards removing the force parameter altogether from that remove host endpoint because it can lead to problems.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
docs/disaster-recovery/full-recovery.md (2)
272-277: Fenced code block at line 272 is missing a language specifier (MD040).♻️ Proposed fix
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 272 - 277, The fenced code block starting with triple-backticks that contains the table header "ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS" is missing a language specifier; update the opening fence to include a language (e.g., change ``` to ```text or ```console) so the block complies with MD040 and renders/linters correctly, leaving the block contents unchanged.
77-80: Indented code blocks inside admonitions should be fenced (MD046).markdownlint flags five locations (lines 79, 151, 184, 254, 551) where 4-space indented blocks are used inside
!!!admonitions. These may not render as code blocks in all Markdown renderers; fenced blocks are the reliable and consistent approach.Example fix for the pattern (apply similarly at all five locations):
♻️ Proposed fix pattern
!!! important "Reset Cluster Membership for Multi-Node Clusters" - If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore` to reset cluster membership. Simply copying the etcd directory will not work. + If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore` + to reset cluster membership. Simply copying the etcd directory will not work.For any nested code snippets inside the admonition, use a fenced block indented with 4 spaces:
!!! note - If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead. + If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead.Also applies to: 149-152, 182-185, 252-255, 549-552
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 77 - 80, Replace the 4-space indented code blocks inside the "!!! important" admonitions with fenced code blocks so they render consistently; locate the admonition starting with '!!! important "Reset Cluster Membership for Multi-Node Clusters"' and the other admonitions at the flagged locations (around lines referenced in the review: 79, 151, 184, 254, 551) and change any nested indented blocks to triple-backtick fenced blocks (```), preserving language hints if present and maintaining the same indentation level for the admonition content.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 200-213: Phase 1B currently contradicts itself: "Option A"
instructs running etcdctl snapshot save against a live endpoint (the etcdctl
snapshot save command) even though Phase 1B assumes majority quorum loss and
etcd is inaccessible; either delete the entire "Option A: Create Snapshot from
Existing Data" subsection or rewrite it to clearly define a distinct
sub-scenario (e.g., "Partial Quorum: single surviving node with etcd running")
that is explicitly separate from the majority-quorum-lost path, and in that
rewrite explain preconditions, when to run the etcdctl snapshot save command,
and add a cross-reference to "Option B" (pre-existing snapshot) for the true
quorum-loss recovery path so readers are not misled.
- Line 51: The document contains literal draft comments using C-style "// To Do"
text (e.g. "// TO DO link the installation.md file section here", "// To Do A
snapshot file is optional...", "// To Do The title of this section, "Majority
Quorum Loss"...", "// To Do a snapshot is optional...", "// To Do use etcdutl
snapshot restore", "// To Do we will keep one section in final doc") which
render verbatim and trigger MD046; remove or resolve each by either deleting
them, converting them to proper HTML comments (<!-- TODO: ... -->) or replacing
them with real content or a Markdown TODO/task list or an actual link (e.g.
install section link) so the draft notes no longer appear in the published
output and the lint warning is eliminated.
- Around line 315-346: Phase 1C is incomplete: add concrete Step 1C.4 and 1C.5
and replace the Step 1C.6 TODO with the etcd ForceNewCluster restore steps, then
update the dangling cross-reference and summary table. Specifically, create Step
1C.4 "Recreate container registry on surviving host" that documents rebuilding
or restoring the registry (recreate registry container, load required
images/tags, push images back into registry) and reference the surviving
host/node names (e.g., host-2/n2); add Step 1C.5 "Prepare etcd snapshot and
prerequisites" describing how to fetch/transfer the snapshot, validate it, stop
etcd on recovery nodes and prepare data dirs; replace Step 1C.6 with a full
"ForceNewCluster etcdutl snapshot restore" sequence (use etcdutl snapshot
restore with --force-new-cluster and correct --name/--initial-cluster settings,
restore to expected data-dir and systemd restart steps) and ensure Step 1C.7
remains the verification curl. Finally, update the Common Issues cross-reference
that mentions "Phase 1C, Step 1C.4" and the Summary table range/actions to
include the rebuilt registry and ForceNewCluster restore so documentation is
consistent.
- Around line 367-369: Update the Step 2.1 curl example so the URL uses the same
port variable as the rest of the document: replace the hardcoded ":3000" in the
curl command URL (the line starting with curl -X POST
"http://${RECOVERY_HOST_IP}:3000/...) with the template variable ":${API_PORT}"
to match other API calls and maintain consistency.
- Around line 37-42: The Set Variables block is missing RECOVERY_HOST_ID and
API_PORT which are referenced throughout the guide; update the initial variables
list to include RECOVERY_HOST_ID (the identifier/name of the recovery host used
in commands and SSH/etcd membership steps) and API_PORT (the control-plane API
port used by the recovery host), and ensure the examples show sensible
placeholders like RECOVERY_HOST_ID="<recovery-host-id>" and API_PORT=<api-port>
so all subsequent snippets (references to RECOVERY_HOST_ID and API_PORT) work
without reader guesswork; verify mentions in functions/sections that reference
RECOVERY_HOST_ID and API_PORT still match the variable names exactly.
- Around line 130-133: Remove the trailing space after the backslash in the curl
line so the shell line continuation works (ensure the backslash is the last
character before the newline in the "curl -L
https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz
\ | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl"
snippet) or collapse into a single line; also update the hardcoded release tag
from v3.6.5 to v3.6.8 or replace it with a VERSION variable so the etcd download
URL and the etcd-v... names stay consistent and current, and verify the
subsequent sudo mv/ chmod lines still reference the same etcdutl filename.
---
Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 272-277: The fenced code block starting with triple-backticks that
contains the table header "ID HOSTNAME STATUS
AVAILABILITY MANAGER STATUS" is missing a language specifier; update the
opening fence to include a language (e.g., change ``` to ```text or ```console)
so the block complies with MD040 and renders/linters correctly, leaving the
block contents unchanged.
- Around line 77-80: Replace the 4-space indented code blocks inside the "!!!
important" admonitions with fenced code blocks so they render consistently;
locate the admonition starting with '!!! important "Reset Cluster Membership for
Multi-Node Clusters"' and the other admonitions at the flagged locations (around
lines referenced in the review: 79, 151, 184, 254, 551) and change any nested
indented blocks to triple-backtick fenced blocks (```), preserving language
hints if present and maintaining the same indentation level for the admonition
content.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
docs/disaster-recovery/full-recovery.md (2)
330-332: Step 1C.7 lacks concrete commands for registry recreation — the hardest step in Phase 1C.This is the only recovery path for operators who have lost both etcd and Swarm quorum. The step instructs them to "recreate the registry and build/push the Control Plane image" but provides no commands, and the
ARCHIVE_VERSIONvariable defined in Set Variables (line 50) is clearly intended for exactly these commands. An operator in a production incident cannot act on prose alone.Consider adding a code block with the docker registry/image commands (e.g.,
docker service createfor the registry,docker pull/docker tag/docker pushusing${ARCHIVE_VERSION}), or at minimum link to a dedicated registry setup reference.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 330 - 332, Step 1C.7 is missing the concrete commands to recreate a container registry and push the Control Plane image using the ARCHIVE_VERSION variable; update the prose to include an explicit sequence showing how to create the registry service and then pull/tag/push the Control Plane image using ${ARCHIVE_VERSION} (or, if you prefer not to inline commands, add a clear link to a registry setup reference), and ensure you reference the ARCHIVE_VERSION variable defined earlier so operators can copy-paste the exact commands for "recreate registry" and "build/push Control Plane image" before starting the Control Plane service in Step 1C.7.
461-469: Step 3A.2 doesn't distinguish server-mode from client-mode hosts, unlike the partial-recovery guide.
partial-recovery.mdStep 3A.2 has separate blocks for server-mode (clearetcd+certificates+generated.config.json) and client-mode (clear onlygenerated.config.json). This guide applies the server-mode commands to all hosts unconditionally. Whilerm -rfon absent directories is harmless, aligning with the partial-recovery guide's pattern avoids operator confusion when working across both docs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 461 - 469, Step 3A.2 currently applies the server-mode cleanup (removing ${PGEDGE_DATA_DIR}/etcd, /certificates, and generated.config.json) to all hosts; update this section to mirror partial-recovery.md by distinguishing server-mode vs client-mode hosts: for server-mode instruct removal of etcd, certificates, and generated.config.json, and for client-mode instruct removal of only generated.config.json (use the same labels "server-mode" and "client-mode" and reference the PGEDGE_DATA_DIR variable and filenames to make the distinction clear).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 48-51: The variable ARCHIVE_VERSION is declared but unused; either
remove the ARCHIVE_VERSION="<control-plane-version>" line from the env block or
add the registry-rebuild commands that consume it (e.g., the
registry-rebuild/registry restore commands referenced in Step 1C.7) so the guide
actually uses ${ARCHIVE_VERSION}; update the examples and any subsequent
commands to reference ARCHIVE_VERSION in place of a hard-coded version to make
the variable meaningful.
- Around line 256-261: The fenced code block showing the `docker node ls` output
is missing a language specifier which triggers MD040; update the triple-backtick
fence for that block in full-recovery.md to include a language tag such as text
(e.g., change ``` to ```text) so the block is explicitly marked and linting
passes; locate the block containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER
STATUS` table and add the language specifier to the opening fence.
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 105-110: Add a language specifier to the fenced code block
containing the `docker node ls` example (the block showing ID, HOSTNAME, STATUS,
AVAILABILITY, MANAGER STATUS) to silence MD040 and improve rendering; update the
opening fence from ``` to ```text (or ```console) so the block becomes a labeled
code block.
- Line 403: The step label is incorrect: remove the "(force remove)" note from
Step 1.1 and add it to the Step 1.2 text so the host deletion action is
documented under Step 1.2 (leave Step 1.1 describing only database updates);
update the sentence for Step 1.2 to read something like "Step 1.2 — force remove
(host record deletion) and wait for the task to finish before rejoining" so the
force-remove action is clearly associated with Step 1.2.
---
Duplicate comments:
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 88-92: The doc uses the per-host task endpoint
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> to poll a removal task on a host that
was just deleted; since that host may become unavailable, update the example to
query a stable API host instead—either call the cluster-level endpoint
/v1/tasks/<TASK_ID> (preferred) or replace <LOST_HOST_ID> with <HEALTHY_HOST> so
the curl targets a healthy node (/v1/hosts/<HEALTHY_HOST>/tasks/<TASK_ID>), and
update the curl example accordingly.
---
Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 330-332: Step 1C.7 is missing the concrete commands to recreate a
container registry and push the Control Plane image using the ARCHIVE_VERSION
variable; update the prose to include an explicit sequence showing how to create
the registry service and then pull/tag/push the Control Plane image using
${ARCHIVE_VERSION} (or, if you prefer not to inline commands, add a clear link
to a registry setup reference), and ensure you reference the ARCHIVE_VERSION
variable defined earlier so operators can copy-paste the exact commands for
"recreate registry" and "build/push Control Plane image" before starting the
Control Plane service in Step 1C.7.
- Around line 461-469: Step 3A.2 currently applies the server-mode cleanup
(removing ${PGEDGE_DATA_DIR}/etcd, /certificates, and generated.config.json) to
all hosts; update this section to mirror partial-recovery.md by distinguishing
server-mode vs client-mode hosts: for server-mode instruct removal of etcd,
certificates, and generated.config.json, and for client-mode instruct removal of
only generated.config.json (use the same labels "server-mode" and "client-mode"
and reference the PGEDGE_DATA_DIR variable and filenames to make the distinction
clear).
| ```bash | ||
| RECOVERY_HOST_EXTERNAL_IP="<recovery-host-external-ip>" # e.g., 192.168.105.4 | ||
| ARCHIVE_VERSION="<control-plane-version>" # e.g., 0.6.2 | ||
| ``` |
There was a problem hiding this comment.
ARCHIVE_VERSION is defined but never referenced in the document.
No code block anywhere in the guide uses ${ARCHIVE_VERSION}. Either add the concrete registry-rebuild commands that consume it (see also Step 1C.7 below), or remove it from the variable block to avoid confusing operators.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/full-recovery.md` around lines 48 - 51, The variable
ARCHIVE_VERSION is declared but unused; either remove the
ARCHIVE_VERSION="<control-plane-version>" line from the env block or add the
registry-rebuild commands that consume it (e.g., the registry-rebuild/registry
restore commands referenced in Step 1C.7) so the guide actually uses
${ARCHIVE_VERSION}; update the examples and any subsequent commands to reference
ARCHIVE_VERSION in place of a hard-coded version to make the variable
meaningful.
| ``` | ||
| ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS | ||
| 4aoqjp3q8jcny4kec5nadcn6x lima-host-1 Down Active Unreachable | ||
| 959g9937i62judknmr40kcw9r * lima-host-2 Ready Active Leader | ||
| l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable | ||
| ``` |
There was a problem hiding this comment.
Add a language specifier to the fenced code block.
Same MD040 issue as in partial-recovery.md — the docker node ls output block lacks a language tag.
✏️ Proposed fix
-```
+```text
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
4aoqjp3q8jcny4kec5nadcn6x lima-host-1 Down Active Unreachable
959g9937i62judknmr40kcw9r * lima-host-2 Ready Active Leader
l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 256-256: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/full-recovery.md` around lines 256 - 261, The fenced
code block showing the `docker node ls` output is missing a language specifier
which triggers MD040; update the triple-backtick fence for that block in
full-recovery.md to include a language tag such as text (e.g., change ``` to
```text) so the block is explicitly marked and linting passes; locate the block
containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS` table and add
the language specifier to the opening fence.
| ``` | ||
| ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS | ||
| 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader | ||
| 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable | ||
| l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable | ||
| ``` |
There was a problem hiding this comment.
Add a language specifier to the fenced code block.
The docker node ls example output uses a plain fence, which triggers MD040. Use text or console to silence the warning and improve rendering.
✏️ Proposed fix
-```
+```text
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader
959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable
l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS | |
| 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader | |
| 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable | |
| l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 105-105: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/partial-recovery.md` around lines 105 - 110, Add a
language specifier to the fenced code block containing the `docker node ls`
example (the block showing ID, HOSTNAME, STATUS, AVAILABILITY, MANAGER STATUS)
to silence MD040 and improve rendering; update the opening fence from ``` to
```text (or ```console) so the block becomes a labeled code block.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
docs/disaster-recovery/full-recovery.md (1)
706-714: Unquoted$PGEDGE_DATA_DIRinrmcommands — inconsistent quoting.The Common Issues
rmsnippet uses unquoted variables. Every otherrm/mvblock in the guide (Steps 1A.3, 1C.5, 3A.2) quotes"${PGEDGE_DATA_DIR}/...". An unquoted path breaks ifPGEDGE_DATA_DIRcontains spaces.✏️ Proposed fix
-rm -rf ${PGEDGE_DATA_DIR}/etcd -rm -rf ${PGEDGE_DATA_DIR}/certificates -rm -f ${PGEDGE_DATA_DIR}/generated.config.json +rm -rf "${PGEDGE_DATA_DIR}/etcd" +rm -rf "${PGEDGE_DATA_DIR}/certificates" +rm -f "${PGEDGE_DATA_DIR}/generated.config.json"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 706 - 714, The three rm lines use an unquoted environment variable and will break for paths with spaces; update the commands that remove etcd, certificates, and generated.config.json to use the PGEDGE_DATA_DIR variable wrapped in double quotes (i.e., quote the "${PGEDGE_DATA_DIR}/...") so the paths are treated as single arguments and match the quoting style used in other steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 219-226: The per-host task monitor URL shown after the DELETE call
(DELETE /v1/hosts/<LOST_HOST_ID>) may 404 because the host resource was removed;
update the docs to use the global task endpoint instead (e.g.,
/v1/tasks/<TASK_ID>) for monitoring or explicitly note that
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> must remain accessible post-deletion;
adjust the example and the explanatory text around the DELETE
/v1/hosts/<LOST_HOST_ID> and the monitoring line so readers use
/v1/tasks/<TASK_ID> or understand the host-scoped URL behavior.
- Around line 178-183: The phrase "(e.g. after Path 1A/1B)" is undefined in this
document; replace that parenthetical in the Control Plane start instructions
with a contextual description or a cross-link — for example change the text to
"(e.g. after completing the Swarm and etcd restore steps above)" or add an
explicit link to full-recovery.md — so readers of disaster-recovery.md have a
meaningful anchor instead of "Path 1A/1B".
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 370-398: Update the two example JSON database specs in
full-recovery.md to include the missing "port": 5432 field so they match other
guides: add "port": 5432 inside the top-level "spec" object in the Step 2.1
(remove_host) example (the curl example that posts an updated-database-spec for
storefront) and in the Phase 4 (add nodes) example's "spec" object; keep all
other fields unchanged.
---
Duplicate comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 48-51: ARCHIVE_VERSION is declared but never used in the guide;
either remove the ARCHIVE_VERSION variable or update Phase 1C (specifically Step
1C.7) to use it by adding the registry-rebuild / image-tagging commands that
consume ${ARCHIVE_VERSION} (i.e., where you rebuild or retag control-plane
images to the archived version). Locate the variable block and the Phase 1C Step
1C.7 text, then either delete the ARCHIVE_VERSION line or insert the appropriate
commands that reference ${ARCHIVE_VERSION} so the variable is actually consumed.
- Around line 256-261: The fenced code block showing the output of `docker node
ls` is missing a language specifier which triggers MD040; update the fence that
contains the lines starting with "ID HOSTNAME
STATUS..." so it becomes a triple-backtick fence with a language specifier
(e.g., use "text") instead of a plain fence to suppress the lint warning and
ensure proper rendering.
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 88-92: The monitor URL references a per-host endpoint that may no
longer resolve after force-removing the host; change the example curl to use a
cluster-level task endpoint (`/v1/tasks/<TASK_ID>`) or add a note that
`/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>` remains accessible
post-deletion—update the example command and text to use `curl
http://<HEALTHY_HOST>:3000/v1/tasks/<TASK_ID>` or explicitly state the
host-scoped endpoint behavior so operators can reliably check task status.
- Around line 105-110: The fenced code block showing the output of `docker node
ls` is missing a language specifier which triggers MD040; update the opening
fence for that block (the triple backticks before the lines starting "ID
HOSTNAME STATUS...") to include a language tag such as text (i.e., change ```
to ```text) so the block is explicitly marked and renders correctly.
---
Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 706-714: The three rm lines use an unquoted environment variable
and will break for paths with spaces; update the commands that remove etcd,
certificates, and generated.config.json to use the PGEDGE_DATA_DIR variable
wrapped in double quotes (i.e., quote the "${PGEDGE_DATA_DIR}/...") so the paths
are treated as single arguments and match the quoting style used in other steps.
| 4. **Start the Control Plane** and verify: | ||
|
|
||
| - If Control Plane is already deployed as Swarm services (e.g. after Path 1A/1B): | ||
| `docker service scale control-plane_${RECOVERY_HOST_ID}=1` | ||
| - If you completed [Restoring Docker Swarm](#restoring-docker-swarm) and deploy via stack: | ||
| `docker stack deploy -c <path-to-stack-yaml> control-plane` (do not set `PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER`). |
There was a problem hiding this comment.
"Path 1A/1B" labels are undefined in this document.
Line 180 reads: (e.g. after Path 1A/1B). Those path labels are internal to full-recovery.md and don't appear anywhere in this guide. Readers working from this document only will have no anchor for "1A/1B". Replace with a description meaningful in this document's context (e.g., "after the Swarm and etcd restore steps above") or cross-link to full-recovery.md.
✏️ Proposed fix
- - If Control Plane is already deployed as Swarm services (e.g. after Path 1A/1B):
+ - If Control Plane is already deployed as Swarm services (i.e. Swarm is functional and services were only scaled down): 📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 4. **Start the Control Plane** and verify: | |
| - If Control Plane is already deployed as Swarm services (e.g. after Path 1A/1B): | |
| `docker service scale control-plane_${RECOVERY_HOST_ID}=1` | |
| - If you completed [Restoring Docker Swarm](#restoring-docker-swarm) and deploy via stack: | |
| `docker stack deploy -c <path-to-stack-yaml> control-plane` (do not set `PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER`). | |
| 4. **Start the Control Plane** and verify: | |
| - If Control Plane is already deployed as Swarm services (i.e. Swarm is functional and services were only scaled down): | |
| `docker service scale control-plane_${RECOVERY_HOST_ID}=1` | |
| - If you completed [Restoring Docker Swarm](`#restoring-docker-swarm`) and deploy via stack: | |
| `docker stack deploy -c <path-to-stack-yaml> control-plane` (do not set `PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER`). |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/disaster-recovery.md` around lines 178 - 183, The
phrase "(e.g. after Path 1A/1B)" is undefined in this document; replace that
parenthetical in the Control Plane start instructions with a contextual
description or a cross-link — for example change the text to "(e.g. after
completing the Swarm and etcd restore steps above)" or add an explicit link to
full-recovery.md — so readers of disaster-recovery.md have a meaningful anchor
instead of "Path 1A/1B".
There was a problem hiding this comment.
thanks for the suggestion.Will be fixed in new commit.
There was a problem hiding this comment.
@moizpgedge, sounds good! Looking forward to seeing the update in the next commit. 👍
| After all affected databases have been updated, remove each lost host from the Control Plane (one at a time; wait for each removal task to complete): | ||
|
|
||
| ```sh | ||
| curl -X DELETE "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/<LOST_HOST_ID>" | ||
| ``` | ||
|
|
||
| Monitor: `curl http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>` | ||
|
|
There was a problem hiding this comment.
Task monitor URL scoped to a deleted host may not resolve.
After the DELETE /v1/hosts/<LOST_HOST_ID> call, the monitoring line uses:
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>
If the API returns 404 for task endpoints on removed hosts, this instruction will silently fail during recovery. This is the same pattern flagged (unresolved) in partial-recovery.md. Consider using a generic task endpoint or adding a note that the per-host task URL remains accessible post-deletion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/disaster-recovery.md` around lines 219 - 226, The
per-host task monitor URL shown after the DELETE call (DELETE
/v1/hosts/<LOST_HOST_ID>) may 404 because the host resource was removed; update
the docs to use the global task endpoint instead (e.g., /v1/tasks/<TASK_ID>) for
monitoring or explicitly note that /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> must
remain accessible post-deletion; adjust the example and the explanatory text
around the DELETE /v1/hosts/<LOST_HOST_ID> and the monitoring line so readers
use /v1/tasks/<TASK_ID> or understand the host-scoped URL behavior.
| Use the `remove_host` query parameter to remove instances from destroyed hosts: | ||
|
|
||
| ```sh | ||
| curl -X POST "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<DEAD_HOST_1>&remove_host=<DEAD_HOST_2>" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '<updated-database-spec>' | ||
| ``` | ||
|
|
||
| Example: | ||
|
|
||
| ```sh | ||
| curl -X POST "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/storefront?remove_host=host-1&remove_host=host-3" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "spec": { | ||
| "database_name": "storefront", | ||
| "database_users": [ | ||
| { | ||
| "username": "admin", | ||
| "db_owner": true, | ||
| "attributes": ["SUPERUSER", "LOGIN"] | ||
| } | ||
| ], | ||
| "nodes": [ | ||
| { "name": "n2", "host_ids": ["host-2"] } | ||
| ] | ||
| } | ||
| }' | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all instances of "port" in disaster-recovery guides
rg -n '"port"' docs/disaster-recovery/Repository: pgEdge/control-plane
Length of output: 356
🏁 Script executed:
#!/bin/bash
# Let's examine the specific files mentioned - look at full-recovery.md around both line ranges
echo "=== Lines 370-398 in full-recovery.md ==="
sed -n '370,398p' docs/disaster-recovery/full-recovery.md
echo -e "\n=== Lines 584-604 in full-recovery.md ==="
sed -n '584,604p' docs/disaster-recovery/full-recovery.mdRepository: pgEdge/control-plane
Length of output: 1673
🏁 Script executed:
#!/bin/bash
# Check partial-recovery.md for port field in Steps 1.1 and 5.1 examples
echo "=== Searching for Step 1.1 and Step 5.1 in partial-recovery.md ==="
rg -n "Step 1.1|Step 5.1" docs/disaster-recovery/partial-recovery.md -A 30 | head -100Repository: pgEdge/control-plane
Length of output: 3965
🏁 Script executed:
#!/bin/bash
# Check disaster-recovery.md around port field instances
echo "=== Context around line 206 in disaster-recovery.md ==="
sed -n '190,220p' docs/disaster-recovery/disaster-recovery.md
echo -e "\n=== Context around line 285 in disaster-recovery.md ==="
sed -n '270,300p' docs/disaster-recovery/disaster-recovery.mdRepository: pgEdge/control-plane
Length of output: 2826
🏁 Script executed:
#!/bin/bash
# Check if there's any documentation about port being required/optional
rg -n "port.*required|required.*port" docs/disaster-recovery/ -iRepository: pgEdge/control-plane
Length of output: 46
Add "port": 5432 to database spec examples in Step 2.1 (remove_host) and Phase 4 (add nodes) to match other guides.
Both examples in full-recovery.md omit the "port" field from the spec body, while equivalent examples in partial-recovery.md (Steps 1.1 and 5.1) and disaster-recovery.md consistently include "port": 5432. This creates an inconsistency that could cause API errors during recovery operations.
✏️ Proposed fix (Step 2.1 example and Phase 4 example)
"database_users": [
{
"username": "admin",
"db_owner": true,
"attributes": ["SUPERUSER", "LOGIN"]
}
],
+ "port": 5432,
"nodes": [Also applies to: 584-604
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/full-recovery.md` around lines 370 - 398, Update the
two example JSON database specs in full-recovery.md to include the missing
"port": 5432 field so they match other guides: add "port": 5432 inside the
top-level "spec" object in the Step 2.1 (remove_host) example (the curl example
that posts an updated-database-spec for storefront) and in the Phase 4 (add
nodes) example's "spec" object; keep all other fields unchanged.
jason-lynch
left a comment
There was a problem hiding this comment.
Nice! This is looking a lot closer. I've left a few suggestions. Could you also please remove the other documents from this PR so that we don't forget?
| - Host ID(s) of the lost host(s) | ||
| - SSH access to remaining cluster hosts (for Docker Swarm and host operations) | ||
| - The Control Plane stack definition file (YAML) from your initial deployment | ||
| - If etcd quorum was lost: a backup of the Control Plane data volume and (optionally) an etcd snapshot file |
There was a problem hiding this comment.
We need to be clearer here: you only need a backup if you've lost every host that was serving Etcd. Losing quorum just means you've lost more than 50% of the Control Plane servers that were serving Etcd. You can recover without a backup if you still have at least one server-mode host remaining.
| Restoring the Control Plane data volume from your backup is environment-specific; we cannot document every possible procedure. For examples, see your provider's documentation: | ||
|
|
||
| - [AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html) | ||
| - [VMware vSphere](https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/container-storage-plugin/3-0/getting-started-with-vmware-vsphere-container-storage-plug-in-3-0/using-vsphere-container-storage-plug-in/volume-snapshot-and-restore.html) | ||
| - [Azure](https://learn.microsoft.com/en-us/azure/backup/backup-azure-arm-restore-vms) | ||
| - [Google Cloud](https://docs.cloud.google.com/compute/docs/disks/restore-snapshot) |
There was a problem hiding this comment.
I sent this list to give you an idea of what this process would look like in a production environment. I don't think it's very helpful in our documentation. Here's how I would phrase this section:
| Restoring the Control Plane data volume from your backup is environment-specific; we cannot document every possible procedure. For examples, see your provider's documentation: | |
| - [AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html) | |
| - [VMware vSphere](https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/container-storage-plugin/3-0/getting-started-with-vmware-vsphere-container-storage-plug-in-3-0/using-vsphere-container-storage-plug-in/volume-snapshot-and-restore.html) | |
| - [Azure](https://learn.microsoft.com/en-us/azure/backup/backup-azure-arm-restore-vms) | |
| - [Google Cloud](https://docs.cloud.google.com/compute/docs/disks/restore-snapshot) | |
| You will need to restore from a previously created backup if you've lost 100% of your Control Plane servers configured to serve Etcd. This could be a snapshot of the data volume or any other type of backup that includes the Control Plane data directory for one of the lost servers. Only one Control Plane server backup is needed to restore the cluster. | |
| If you've lost 100% of your database instances, you'll need the data directory from at least one instance from each database. Your data volume backup will also include this data if you're restoring a host that was running an instance of each database. If not, you'll also need to restore at least one more host that does have this instance data. | |
| If you do not have any data volume backups that include your database instances, we recommend creating a new Control Plane cluster and restoring your databases from pgBackRest backups instead. See [Creating a New Database from a Backup](../using/backup-restore/#creating-a-new-database-from-a-backup) for more information. |
| On a surviving manager: | ||
|
|
||
| ```bash | ||
| sudo docker swarm init --force-new-cluster --advertise-addr ${RECOVERY_HOST_IP} |
There was a problem hiding this comment.
This document alternates between using and not using sudo for docker commands. We should be consistent, and I prefer showing it without, since it aligns with Docker's own documentation, and it's typical to configure a non-root user with Docker access.
| sudo docker swarm init --force-new-cluster --advertise-addr ${RECOVERY_HOST_IP} | |
| docker swarm init --force-new-cluster --advertise-addr ${RECOVERY_HOST_IP} |
| Verify: | ||
|
|
||
| ```bash | ||
| sudo docker node ls |
There was a problem hiding this comment.
| sudo docker node ls | |
| docker node ls |
| sudo docker service rm control-plane_<LOST_HOST_ID_1> control-plane_<LOST_HOST_ID_2> | ||
| sudo docker service ls | ||
| sudo docker service rm <orphaned-postgres-service-1> <orphaned-postgres-service-2> |
There was a problem hiding this comment.
| sudo docker service rm control-plane_<LOST_HOST_ID_1> control-plane_<LOST_HOST_ID_2> | |
| sudo docker service ls | |
| sudo docker service rm <orphaned-postgres-service-1> <orphaned-postgres-service-2> | |
| docker service rm control-plane_<LOST_HOST_ID_1> control-plane_<LOST_HOST_ID_2> | |
| docker service ls | |
| docker service rm <orphaned-postgres-service-1> <orphaned-postgres-service-2> |
| sudo mv /tmp/etcdutl /usr/local/bin/ && sudo chmod +x /usr/local/bin/etcdutl | ||
| ``` | ||
|
|
||
| 3. **Restore etcd** from the backup directory (step 1 sets `ETCD_BACKUP_DIR`). If you have no existing etcd directory and are using a snapshot file instead, use that file path in place of `"${ETCD_BACKUP_DIR}/member/snap/db"`: |
There was a problem hiding this comment.
I feel like we need to make it clear why this step exists so that people don't feel tempted to skip it since they already have a data directory.
| 3. **Restore etcd** from the backup directory (step 1 sets `ETCD_BACKUP_DIR`). If you have no existing etcd directory and are using a snapshot file instead, use that file path in place of `"${ETCD_BACKUP_DIR}/member/snap/db"`: | |
| 3. **Restore etcd** from the backup directory (step 1 sets `ETCD_BACKUP_DIR`). If you have no existing etcd directory and are using a snapshot file instead, use that file path in place of `"${ETCD_BACKUP_DIR}/member/snap/db"`. This will restore quorum by reinitializing Etcd with a single cluster member: |
| When both etcd and Docker Swarm quorum were lost (you will complete [Restoring Docker Swarm](#restoring-docker-swarm) first): | ||
|
|
||
| ```bash | ||
| RECOVERY_HOST_EXTERNAL_IP="<recovery-host-external-ip>" | ||
| ARCHIVE_VERSION="<control-plane-version>" | ||
| ``` |
There was a problem hiding this comment.
I worry that the internal/external IP stuff is confusing since it's environment-specific. We mention SSH a few times, so let's assume the user is SSH'ing in and working from one of the machines.
It also looks like ARCHIVE_VERSION should have been removed before.
| When both etcd and Docker Swarm quorum were lost (you will complete [Restoring Docker Swarm](#restoring-docker-swarm) first): | |
| ```bash | |
| RECOVERY_HOST_EXTERNAL_IP="<recovery-host-external-ip>" | |
| ARCHIVE_VERSION="<control-plane-version>" | |
| ``` |
| ```bash | ||
| docker service ps control-plane_${RECOVERY_HOST_ID} --no-trunc | ||
| curl -sS "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/hosts" | ||
| # or, if using RECOVERY_HOST_EXTERNAL_IP: curl "http://${RECOVERY_HOST_EXTERNAL_IP}:${API_PORT}/v1/databases" |
There was a problem hiding this comment.
To match the suggestion above about removing the "external IP" concept.
| # or, if using RECOVERY_HOST_EXTERNAL_IP: curl "http://${RECOVERY_HOST_EXTERNAL_IP}:${API_PORT}/v1/databases" |
| | Docker Swarm commands hang | Swarm has lost quorum | Run [Reinitializing the Swarm](#reinitializing-the-swarm) on a surviving manager. | | ||
| | "service already exists" when deploying stack | Manually created service conflicts with stack | Run `docker service rm <service-name>`, then redeploy the stack. | | ||
| | Control Plane API hangs after etcd restore | etcd auth not fully re-enabled after restore | Check `docker service logs control-plane_<HOST_ID>`. Restart the service if necessary. | | ||
| | Image pull fails on new hosts | Registry was on a lost host | Recreate the registry on a surviving host and ensure new hosts can reach it. | |
There was a problem hiding this comment.
Looks like this should have been removed before:
| | Image pull fails on new hosts | Registry was on a lost host | Recreate the registry on a surviving host and ensure new hosts can reach it. | |
| | Quorum not restored | Too few server-mode hosts rejoined | Rejoin enough server-mode hosts to reach quorum (e.g. 2 of 3 for a 3-node cluster). | | ||
| | Docker Swarm commands hang | Swarm has lost quorum | Run [Reinitializing the Swarm](#reinitializing-the-swarm) on a surviving manager. | | ||
| | "service already exists" when deploying stack | Manually created service conflicts with stack | Run `docker service rm <service-name>`, then redeploy the stack. | | ||
| | Control Plane API hangs after etcd restore | etcd auth not fully re-enabled after restore | Check `docker service logs control-plane_<HOST_ID>`. Restart the service if necessary. | |
There was a problem hiding this comment.
This specific issue shouldn't happen now that you've removed the advice to run etcd by itself:
| | Control Plane API hangs after etcd restore | etcd auth not fully re-enabled after restore | Check `docker service logs control-plane_<HOST_ID>`. Restart the service if necessary. | |
jason-lynch
left a comment
There was a problem hiding this comment.
Awesome! Thank you for doing this.
Summary
Adds disaster recovery documentation: an index plus two guides—Partial Failure Recovery (Quorum Intact) and Complete Failure Recovery (No Quorum)—so operators can recover the Control Plane when hosts or quorum are lost.
Changes
Testing
Checklist
changes/unreleased/Added-disaster-recovery-docs.yaml)Notes for Reviewers
docs/PLAT-313-314-etcd-swarm-disaster-recovery— PLAT-313 and PLAT-314PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER=true; worth a careful read.ChangeLog
Entry added in changes/unreleased/Added-disaster-recovery-docs.yaml for the disaster recovery documentation.