Skip to content

Commit 7241a02

Browse files
committed
Allow Streaming Replication
Clusters can now be configured to automatically enable streaming replication from a remote primary. - The `spec.standby` section of the postgrescluster spec allows users to define a `host` and `port` that point to a remote primary - The `repoName` field is now optional - Certificate auth is required when connecting to the primary. Users must configure custom tls certs on the standby that allow this authentication method - Replication user will be the default `_crunchyrepl` user - A cluster will not be created if the standby spec is invalid - kuttl: deploy two clusters, a primary and standby, in a single namespace. Ensure that the standby cluster has replicated the primary data and the walreciever process is running
1 parent 5e69e97 commit 7241a02

File tree

22 files changed

+564
-141
lines changed

22 files changed

+564
-141
lines changed

config/crd/bases/postgres-operator.crunchydata.com_postgresclusters.yaml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10656,16 +10656,24 @@ spec:
1065610656
enabled:
1065710657
default: true
1065810658
description: Whether or not the PostgreSQL cluster should be read-only.
10659-
When this is true, WAL files are applied from the pgBackRest
10660-
repository.
10659+
When this is true, WAL files are applied from a pgBackRest repository
10660+
or another PostgreSQL server.
1066110661
type: boolean
10662+
host:
10663+
description: Network address of the PostgreSQL server to follow
10664+
via streaming replication.
10665+
type: string
10666+
port:
10667+
description: Network port of the PostgreSQL server to follow via
10668+
streaming replication.
10669+
format: int32
10670+
minimum: 1024
10671+
type: integer
1066210672
repoName:
1066310673
description: The name of the pgBackRest repository to follow for
1066410674
WAL files.
1066510675
pattern: ^repo[1-4]
1066610676
type: string
10667-
required:
10668-
- repoName
1066910677
type: object
1067010678
supplementalGroups:
1067110679
description: 'A list of group IDs applied to the process of a container.

docs/content/architecture/disaster-recovery.md

Lines changed: 67 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -5,115 +5,117 @@ draft: false
55
weight: 140
66
---
77

8-
![PostgreSQL Operator High-Availability Overview](/images/postgresql-ha-multi-data-center.png)
9-
10-
Advanced [high-availability]({{< relref "architecture/high-availability.md" >}})
11-
and [backup management]({{< relref "architecture/backups.md" >}})
12-
strategies involve spreading your database clusters across multiple data centers
13-
to help maximize uptime. In Kubernetes, this technique is known as "[federation](https://en.wikipedia.org/wiki/Federation_(information_technology))".
14-
Federated Kubernetes clusters are able to communicate with each other,
8+
Advanced high-availability and disaster recovery strategies involve spreading
9+
your database clusters across multiple data centers to help maximize uptime.
10+
In Kubernetes, this technique is known as "[federation](https://en.wikipedia.org/wiki/Federation_(information_technology))".
11+
Federated Kubernetes clusters can communicate with each other,
1512
coordinate changes, and provide resiliency for applications that have high
1613
uptime requirements.
1714

1815
As of this writing, federation in Kubernetes is still in ongoing development
1916
and is something we monitor with intense interest. As Kubernetes federation
2017
continues to mature, we wanted to provide a way to deploy PostgreSQL clusters
2118
managed by the [PostgreSQL Operator](https://www.crunchydata.com/developers/download-postgres/containers/postgres-operator)
22-
that can span multiple Kubernetes clusters. This can be accomplished with a
23-
few environmental setups:
24-
25-
- Two Kubernetes clusters
26-
- An external storage system, using one of the following:
27-
- S3, or an external storage system that uses the S3 protocol
28-
- GCS
29-
- Azure Blob Storage
30-
- A Kubernetes storage system that can span multiple clusters
19+
that can span multiple Kubernetes clusters.
3120

3221
At a high-level, the PostgreSQL Operator follows the "active-standby" data
3322
center deployment model for managing the PostgreSQL clusters across Kubernetes
34-
clusters. In one Kubernetes cluster, the PostgreSQL Operator deploy PostgreSQL as an
23+
clusters. In one Kubernetes cluster, the PostgreSQL Operator deploys PostgreSQL as an
3524
"active" PostgreSQL cluster, which means it has one primary and one-or-more
3625
replicas. In another Kubernetes cluster, the PostgreSQL cluster is deployed as
3726
a "standby" cluster: every PostgreSQL instance is a replica.
3827

3928
A side-effect of this is that in each of the Kubernetes clusters, the PostgreSQL
4029
Operator can be used to deploy both active and standby PostgreSQL clusters,
41-
allowing you to mix and match! While the mixing and matching may not ideal for
30+
allowing you to mix and match! While the mixing and matching may not be ideal for
4231
how you deploy your PostgreSQL clusters, it does allow you to perform online
4332
moves of your PostgreSQL data to different Kubernetes clusters as well as manual
4433
online upgrades.
4534

4635
Lastly, while this feature does extend high-availability, promoting a standby
4736
cluster to an active cluster is **not** automatic. While the PostgreSQL clusters
48-
within a Kubernetes cluster do support self-managed high-availability, a
49-
cross-cluster deployment requires someone to specifically promote the cluster
37+
within a Kubernetes cluster support self-managed high-availability, a
38+
cross-cluster deployment requires someone to promote the cluster
5039
from standby to active.
5140

5241
## Standby Cluster Overview
5342

54-
Standby PostgreSQL clusters are managed just like any other PostgreSQL cluster
55-
that is managed by the PostgreSQL Operator. For example, adding replicas to a
56-
standby cluster is identical as adding them to a primary cluster.
43+
Standby PostgreSQL clusters are managed like any other PostgreSQL cluster that the PostgreSQL
44+
Operator manages. For example, adding replicas to a standby cluster is identical to adding them to a
45+
primary cluster.
5746

58-
As the architecture diagram above shows, the main difference is that there is
59-
no primary instance: one PostgreSQL instance is reading in the database changes
60-
from the backup repository, while the other replicas are replicas of that instance.
61-
This is known as [cascading replication](https://www.postgresql.org/docs/current/warm-standby.html#CASCADING-REPLICATION).
62-
replicas are cascading replicas, i.e. replicas replicating from a database server that itself is replicating from another database server.
47+
The main difference between a primary and standby cluster is that there is no primary instance on
48+
the standby: one PostgreSQL instance is reading in the database changes from either the backup
49+
repository or via streaming replication, while other instances are replicas of it.
50+
51+
Any replicas created in the standby cluster are known as cascading replicas, i.e., replicas
52+
replicating from a database server that itself is replicating from another database server. More
53+
information about [cascading replication](https://www.postgresql.org/docs/current/warm-standby.html#CASCADING-REPLICATION)
54+
can be found in the PostgreSQL documentation.
6355

6456
Because standby clusters are effectively read-only, certain functionality
65-
that involves making changes to a database, e.g. PostgreSQL user changes, is
66-
blocked while a cluster is in standby mode. Additionally, backups and restores
67-
are blocked as well. While [pgBackRest](https://pgbackrest.org/) does support
57+
that involves making changes to a database, e.g., PostgreSQL user changes, is
58+
blocked while a cluster is in standby mode. Additionally, backups and restores
59+
are blocked as well. While [pgBackRest](https://pgbackrest.org/) supports
6860
backups from standbys, this requires direct access to the primary database,
6961
which cannot be done until the PostgreSQL Operator supports Kubernetes
7062
federation.
7163

72-
## Creating a Standby PostgreSQL Cluster
64+
### Types of Standby Clusters
65+
There are three ways to deploy a standby cluster with the Postgres Operator.
7366

74-
For creating a standby Postgres cluster with PGO, please see the [disaster recovery tutorial]({{< relref "tutorial/disaster-recovery.md" >}}#standby-cluster)
67+
#### Repo-based Standby
68+
69+
A repo-based standby will connect to a pgBackRest repo stored in an external storage system
70+
(S3, GCS, Azure Blob Storage, or any other Kubernetes storage system that can span multiple
71+
clusters). The standby cluster will receive WAL files from the repo and will apply those to the
72+
database.
7573

76-
## Promoting a Standby Cluster
74+
![PostgreSQL Operator High-Availability Overview](/images/postgresql-ha-multi-data-center.png)
75+
76+
#### Streaming Standby
77+
78+
A streaming standby relies on an authenticated connection to the primary over the network. The
79+
standby will receive WAL records directly from the primary as they are generated.
7780

78-
There comes a time where a standby cluster needs to be promoted to an active
79-
cluster. Promoting a standby cluster means that a PostgreSQL instance within
80-
it will become a primary and start accepting both reads and writes. This has the
81-
net effect of pushing WAL (transaction archives) to the pgBackRest repository,
82-
so we need to take a few steps first to ensure we don't accidentally create a
83-
split-brain scenario.
81+
<!-- ![PostgreSQL Operator High-Availability Overview](/images/postgresql-ha-multi-data-center.png) -->
8482

85-
First, if this is not a disaster scenario, you will want to "shutdown" the
86-
active PostgreSQL cluster. This can be done by setting:
83+
#### Streaming Standby with an External Repo
8784

88-
```
89-
spec:
90-
shutdown: true
91-
```
85+
You can also configure the operator to create a cluster that takes advantage of both methods. The
86+
standby cluster will bootstrap from the pgBackRest repo and continue to receive WAL files as they
87+
are pushed to the repo. The cluster will also directly connect to primary and receive WAL records
88+
as they are generated. Using a repo while also streaming ensures that your cluster will still be up
89+
to date with the pgBackRest repo if streaming falls behind.
90+
91+
<!-- ![PostgreSQL Operator High-Availability Overview](/images/postgresql-ha-multi-data-center.png) -->
92+
93+
For creating a standby Postgres cluster with PGO, please see the [disaster recovery tutorial]({{< relref "tutorial/disaster-recovery.md" >}}#standby-cluster)
9294

93-
The effect of this is that all the Kubernetes Statefulsets and Deployments for this cluster are
94-
scaled to 0.
95+
### Promoting a Standby Cluster
9596

96-
We can then promote the standby cluster using the following:
97+
There comes a time when a standby cluster needs to be promoted to an active cluster. Promoting a
98+
standby cluster means that the standby leader PostgreSQL instance will become a primary and start
99+
accepting both reads and writes. This has the net effect of pushing WAL (transaction archives) to
100+
the pgBackRest repository. Before doing this, we need to ensure we don't accidentally create a split-brain
101+
scenario.
97102

98-
```
99-
spec:
100-
standby:
101-
enabled: false
102-
```
103+
If you are promoting the standby while the primary is still running, i.e., if this is not a disaster
104+
scenario, you will want to [shutdown the active PostgreSQL cluster]({{< relref "tutorial/administrative-tasks.md" >}}#shutdown).
103105

104-
This command essentially removes the standby configuration from the Kubernetes
105-
cluster’s DCS, which triggers the promotion of the current standby leader to a
106-
primary PostgreSQL instance. You can view this promotion in the PostgreSQL
107-
standby leader's (soon to be active leader's) logs:
106+
The standby can be promoted once the primary is inactive, e.g., is either `shutdown` or failing.
107+
This process essentially removes the standby configuration from the Kubernetes cluster’s DCS, which
108+
triggers the promotion of the current standby leader to a primary PostgreSQL instance. You can view
109+
this promotion in the PostgreSQL standby leader's (soon to be active leader's) logs.
108110

109-
With the standby cluster now promoted, the cluster with the original active
110-
PostgreSQL cluster can now be turned into a standby PostgreSQL cluster. This is
111-
done by deleting and recreating all PVCs for the cluster and re-initializing it
112-
as a standby using the backup repository. Being that this is a destructive action
113-
(i.e. data will only be retained if any Storage Classes and/or Persistent
111+
Once the standby cluster is promoted, the cluster with the original active
112+
PostgreSQL cluster can now be turned into a standby PostgreSQL cluster. This is
113+
done by deleting and recreating all PVCs for the cluster and reinitializing it
114+
as a standby using the backup repository. Being that this is a destructive action
115+
(i.e., data will only be retained if any Storage Classes and/or Persistent
114116
Volumes have the appropriate reclaim policy configured) a warning is shown
115117
when attempting to enable standby.
116118

117119
The cluster will reinitialize from scratch as a standby, just
118-
like the original standby that was created above. Therefore any transactions
119-
written to the original standby, should now replicate back to this cluster.
120+
like the original standby created above. Therefore any transactions
121+
written to the original standby should now replicate back to this cluster.

docs/content/references/crd.md

Lines changed: 15 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/content/tutorial/administrative-tasks.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,21 @@ kubectl patch postgrescluster/hippo -n postgres-operator --type merge \
2727
--patch '{"spec":{"shutdown": true}}'
2828
```
2929

30-
Shutting down a cluster will terminate all of the active Pods. Any Statefulsets or Deployments are scaled to `0`.
30+
The effect of this is that all the Kubernetes workloads for this cluster are
31+
scaled to 0. You can verify this with the following command:
32+
33+
```
34+
kubectl get deploy,sts,cronjob --selector=postgres-operator.crunchydata.com/cluster=hippo
35+
36+
NAME READY UP-TO-DATE AVAILABLE AGE
37+
deployment.apps/hippo-pgbouncer 0/0 0 0 1h
38+
39+
NAME READY AGE
40+
statefulset.apps/hippo-00-lwgx 0/0 1h
41+
42+
NAME SCHEDULE SUSPEND ACTIVE
43+
cronjob.batch/hippo-repo1-full @daily True 0
44+
```
3145

3246
To turn a Postgres cluster that is shut down back on, you can set `spec.shutdown` to `false`.
3347

0 commit comments

Comments
 (0)