| High availability and disaster recovery (HA/DR) for Postgres databases

By Alexey Shishkin

This is a summary of Postgres Professional’s more than 6-years field experience of providing HA/DR solutions to our customers and it covers both PostgreSQL and Postgres Pro databases.

While most of our customers are making progress on their journeys to the cloud, many of the HA/DR solutions they use from the on-premises world can be used in the cloud as well.

Postgres adoption rates are rapidly increasing around the world, with more and more of the customers using Postgres databases from scratch while many others migrate to Postgres from other databases. It all means that the customers trust Postgres to keep their invaluable data safe.


Usually, customers start small and then progress fast, from simple applications to large, critical ones, mainly because they find Postgres a very robust database. And that is the time when business teams come into play with various SLAs. Most customers classify their business applications from the least critical to the most critical, and each critical level has its own SLA. Compliance with each SLA requirement can be achieved by using various technologies in combination.

When we talk about SLAs for business applications, we usually rely on two magic terms: Recovery Time Objective (RTO) and Recovery Point Objective (RPO), which indicate for how long an application can be unavailable to business users, and how much business data can be lost.

A sample customer RTO/RPO mapping to their classes of applications can look like this:
- mission critical application - RTO - 99,99%, RPO - 0
- business critical application - RTO - 99,9%, RPO - 0
- business operational application - RTO - 98%, RPO - 1 hour
- office operational application - RTO - 90%, RPO - 12 hours

Below are the most commonly used HA/DR technologies for customers to achieve the desired RTO/RPO numbers:
a) HA cluster or manual switchover/failover. (Don’t confuse this with a Postgres database cluster, which is a collection of databases managed by a single instance of a running database server.)
b) Replication -- which can be logical (database transaction), streaming (database block), file system (block device) or disk/LUN (raw device), and each of them can be synchronous or asynchronous.
c) Backup -- which is a must (online or offline, full or incremental, physical or logical, data files or WAL files, via database tools or disk-array snapshots).


You can find more information about these and other HA/DR technologies here:
https://www.postgresql.org/docs/14/high-availability.html
https://www.postgresql.org/docs/14/backup.html

HA cluster
Despite HA clusters being widely used for Postgres databases, some customers prefer not to use them for the following reasons:
- the fewer moving parts in the technology stack the better
- monitoring and alerting systems work fine to inform the Operations team about the database issues
- switchover/failover scenarios are well documented, and both day and night shifts of the Operations team have adequate expertise to cope with database availability issues
- RTO is not very strict (up to 5 minutes for switchover/failover tasks)
- the number of databases is relatively small (up to 50)
But what if the required switchover/failover time is only within a minute? Or what if the number of databases is hundreds or thousands? What to do if the night shift Operations team is less experienced than the day shift team? In such cases, customers prefer using HA clusters.

The list of HA clusters (sorted by popularity among our customers):
- Patroni - https://github.com/zalando/patroni
- Corosync/Pacemaker - https://github.com/ClusterLabs
- Stolon - https://github.com/sorintlab/stolon
- Postgres Pro Multimaster - https://github.com/postgrespro/mmts
- Veritas - https://www.veritas.com/availability/infoscale
The first four of them are open-source, and the first three items are free of charge.
Patroni and Stolon are similar in functionality. Both depend on DCS (Distributed Configuration Store) and require Postgres streaming replication. Patroni, as well as Stolon, are suitable for physical servers and virtual machines (VMs).
The other group of similar HA clusters includes Corosync/Pacemaker and Veritas. They use the VIP (Virtual IP) address to connect to the database and can be used with streaming replication and in shared disk configuration. Corosync/Pacemaker and Veritas are mostly used with physical servers and can be applied to build geo-clusters (two separate HA clusters with replication between them at 50+ km distance).
Postgres Pro Multimaster differs from all four HA clusters. It provides an HA cluster with all its nodes processing read-write requests and delivers minimal possible switchover/failover time (single-digit seconds).

Replication
Customers mainly use replication for disaster recovery purposes, for example, the main site is lost forever, and the RPO requirement is strict.
If the required RPO is strictly zero, synchronous replication has to be used between main and DR sites. Otherwise, asynchronous replication is enough.
Postgres streaming replication is the most popular replication among our customers. It’s included in both PostgreSQL and Postgres Pro databases. It’s well integrated with HA clusters (Patroni, Corosync/Pacemaker, Stolon) and uses master/standby(s) configuration, where the master is available for read/write requests, while standby(s) can be used only for read-only requests.
Postgres logical replication is slower by design than streaming replication; however, logical replication is more flexible. Postgres Pro Multimaster uses logical replication to set up an HA cluster where all HA cluster nodes can handle read/write requests.
Disk/LUN replication offers maximum performance for write-intensive load profiles; however, disk-arrays have to support such replication on both sites. Only Veritas HA cluster currently integrates with disk/LUN replication.

Backup
The most popular tool for database backup among our customers is ‘pg_probackup’:
https://github.com/postgrespro/pg_probackup
It allows almost any kind of backup and restore of the Postgres database, including incremental backup/restore.
The other popular strategy for backup/restore employs disk-array snapshots, which allows very fast backup/restore of the database (seconds to single-digit minutes), regardless of its size.

Having all options at their disposal, customers choose the following most popular HA/DR architectures:

1) Patroni-based HA cluster with Postgres streaming replication
- local (within one site) or stretched (between two or three sites, up to 30 km to each other)
- master, synchronous standby and asynchronous standby (3-node configuration) or master, synchronous standby and quorum device (2+1 node configuration)
- cluster nodes are either physical servers or VMs
- TCP-proxy to connect to master or standby(s)
- RPO is 0 in the case of a stretched HA cluster
- RPO can be non-zero in the case of a local HA cluster (for example, the main site is lost forever)
- switchover time - up to 15 seconds, failover time - up to 30 seconds

2) Corosync/Pacemaker-based HA cluster with Postgres streaming replication
- local (within one site) or stretched (between two or three sites, up to 30 km to each other)
- master, synchronous standby and asynchronous standby (3-node configuration) or master, synchronous standby and quorum device (2+1 node configuration)
- cluster nodes are physical servers
- VIP(s) for master and standbys
- RPO is 0 in the case of a stretched HA cluster
- RPO can be non-zero in the case of a local HA cluster (for example, the main site is lost forever)
- switchover time - up to 15 seconds, failover time - up to 30 seconds

3) Corosync/Pacemaker-based geo HA cluster with Postgres streaming replication
- main and DR regions with 50+ km distance between them
- Postgres streaming replication between regions (cascade asynchronous streaming replication between main and DR regions)
- each region has a local (within one site) or stretched (between two or three sites, up to 30 km to each other) Corosync/Pacemaker-based HA cluster with streaming replication
- master, synchronous standby and asynchronous standby (3-node configuration) or master, synchronous standby and quorum device (2+1 node configuration)
- cluster nodes are physical servers
- VIP(s) for master and standbys
- RPO can be non-zero in case of main region total failure
- switchover time - up to 1 minute, failover time - up to several minutes (depending on the failure type)

4) Veritas-based stretched HA cluster with disk/LUN replication
- stretched (between two sites, up to 30 km to each other)
- main and reserve nodes (2 node configuration)
- cluster nodes are physical servers
- VIP for the main node
- RPO is 0
- switchover time - up to 30 seconds, failover time - up to several minutes (depending on the failure type)
 
5) Manual switchover/failover (no HA cluster) with Postgres streaming replication
- local (within one site) or stretched (between two or three sites, up to 30 km to each other)
- master, synchronous standby and asynchronous standby (3-node configuration) or master, synchronous standby and quorum device (2+1 node configuration)
- database nodes are either physical servers or VMs
- TCP-proxy to connect to master or standby(s)
- RPO is 0 in case of stretched replication
- RPO can be non-zero in case of local replication (for example, the main site is lost forever)
- switchover/failover time depends on the Operations team experience

6) Corosync/Pacemaker or Veritas shared disk HA cluster (no replication)
- local (within one site)
- main and reserve nodes (2 node configuration)
- cluster nodes are physical servers
- VIP for the main node
- RPO is the latest backup
- switchover time - up to 30 seconds, failover time - up to several minutes (depending on the failure type)

Anything specific from Postgres Pro regarding HA/DR enhancements?

The functionality in Postgres Pro Enterprise includes:

I) Support for relaxed synchronous replication restrictions, which allows the master server to continue running while some of the standbys are temporarily unavailable:
https://postgrespro.com/docs/enterprise/13/runtime-config-replication#GUC-SYNCHRONOUS-STANDBY-GAP

II) Automatic page repair via streaming replication from standby in case of data corruption:
https://postgrespro.com/docs/enterprise/13/warm-standby#REPAIR-PAGE-FROM-STANDBY

III) Corrupted WAL data restore from in-memory WAL buffers:
https://postgrespro.com/docs/enterprise/13/wal-restoration

IV) Support for database minor version upgrades without a database instance restart:
https://postgrespro.com/docs/enterprise/13/release-proee-13-2-1

V) Compressed file system (CFS) offers database compression on the database block level:
https://postgrespro.com/docs/enterprise/13/cfs