C.7. Disaster Recovery Cluster Requirements #
- C.7.1. Terms and Abbreviations
- C.7.2. High-level Description of the DRC
- C.7.3. Replication Topology
- C.7.4. Hardware and Network Requirements
- C.7.5. Replication Mechanisms
- C.7.6. Monitoring and Management
- C.7.7. Security
- C.7.8. QA and Rollback
- C.7.9. Backup in Geografically Distributed System
- C.7.10. Documentation and Regulations
- C.7.2. High-level Description of the DRC
The underlying functionality is under development. For the production usage contact Support.
C.7.1. Terms and Abbreviations #
DB — Database.
DBMS — Database management system.
DC — Data center.
MDC — Main data center.
BDC — Backup data center.
HaC — Hight availability cluster.
DRC Disaster — recovery cluster.
C.7.2. High-level Description of the DRC #
MDC hosts the main cluster shards and the etcd cluster. Shards are high-availability clusters that consist of two nodes with Postgres Pro DBMS instances, one as a primary node, one as a synchronous standby. Every shard has the shardmand service running that checks the Postgres Pro DBMS instances and exchanges the information with the etcd cluster, thus providing Shardman clustering. The etcd cluster consists of three nodes that ensures a quorum.
To ensure disaster recovery, the customer’s BDC must host an identical cluster with the identical configuration and set of components. By default, the standby Shardman cluster nodes are disabled. A continuous logs delivery from MDC to BDC is asynchronous and uses the physical replication mechanisms. It is based on the standart Shardman utility pg_receivewal. It writes WALs to the default instance directory $PGDATA/pg_wal
. This utility is managed by the cluster software. When a syncpoint is detected under the standby etcd cluster, a standby Shardman cluster nodes are started by shardmand. It results in WAL update till LSN received from the syncpoin. In different DCs the etcd clusters are isolated, therefore, to distribute the syncpoint updated information, a script is periodically run from the MDC to BDC etcd.
C.7.3. Replication Topology #
Streaming physical replication is provided:
From the Postgres Pro DBMS shard nodes to MDC (synchronous)
From the Postgres Pro DBMS shard nodes to BDC (synchronous)
From the Postgres Pro DBMS shard nodes to DC (asynchronous)
C.7.4. Hardware and Network Requirements #
MDC and BDC hardware must have identical system resources and configuration for all the DRC components.
DCs must be connected with fiber optic network with the capacity not less than 20 Gbit per second. A backup channel is also required.
C.7.5. Replication Mechanisms #
To provide high-availability and disaster recovery clusters Shardman uses the Postgres Pro built-in streaming physical replication mechanism, for BDC it is also asynchronous.
Automatic recovery of a high-availability Shardman cluster is ensured by the cluster software.
DRC cluster recovery is only provided in manual semi-automatic mode.
C.7.6. Monitoring and Management #
Shardman cluster monitoring and management is provided within one DC with the shardmanctl utility.
C.7.7. Security #
C.7.7.1. Encrypting Data Across A Network (TLS/SSL) #
A secure channel between DCs is required.
C.7.7.2. Inter-nodes Authentication and Authorization #
Inter-nodes authentication and authorization is ensured by the built-in Postgres Pro DMBS tools.
C.7.7.3. Protection from Unauthorized Access to Standby Servers #
Protection from unauthorized access to standby servers is provided by the operation system and network tools.
C.7.8. QA and Rollback #
It is recommended to do periodical switchovers.
C.7.8.1. Data Integrity Check After Failover #
Data integrity check after a failover is provided by the backup utility shardmanctl probackup
.
C.7.8.2. Switchover to BDC #
Should the MDC fail, the administrator must make sure it is, indeed, unavailable and initiate the promote of the standby nodes. The standby cluster upgrades its state from standby
to master
. This process is only initiated and managed by the shardmanctl utility, no other procedures required.
C.7.8.3. MDC Recovery #
To recover remote nodes to the MDC, create a backup of the main cluster and restore it on these nodes. The backup can be either created as a cold backup or with the pg_probackup repository. Both options require a backup recovery to the MDC. Once the DB is restored from the backup, run pg_receivewal that connects to a special primary or standby shard replication slot in the BDC, then it receives WAL segments asynchronously and writes to the $PGDATA/pg_wal
directory of the main node.
In the BDC cluster, a script creates a syncpoint each specified period of time. It is written to the BDC etcd and sent to the MDC etcd. Once a syncpoint is in etcd, the MDC stanby cluster nodes check if a WAL with this record is received. If it is received by all the MDC standby cluster nodes, the cluster software initiates the DBMS server startup in the recovery with WAL mode until the syncpoint. Once the syncpoint is reached, no more WALs are applied. If all nodes successfully applied the WAL records, the DBMS server is stopped, followed by another cycle of receiving WAL, syncpoint check and recovery mode.
C.7.8.4. Switching Back to MDC #
To switch back to the MDC, create and transfer a cluster backup from BDS to MDC, run the nodes in the standby node mode. Once the lacking WALs are received, the BDC cluster nodes are stopped, and the MDC cluster nodes are promoted.
C.7.9. Backup in Geografically Distributed System #
Within the GDS (Geografically distributed systemt), BDC cluster must have the storage for the backups identical to one of the MDC. Regular syncing between the main and backup storage is also required.
C.7.9.1. Storing Backups in Geographically Distributed Storages #
The period of time the backups are stored is defined by the backup policy.
C.7.10. Documentation and Regulations #
For more information on disaster failover and normal switchover to MDC instructions, contact Prostres Pro Support.